Estimate Your English Vocabulary Using Python

October 16, 2016

We take our mother tongue for granted, a language we learn as young children without realizing the effort involved. It is only when as adults we try to pick up another language that we fully understand how much hard work surrounds each acquired word.

Depending on who you listen to, estimates vary as to the size of a typical native English speaker’s vocabulary. The ballpark figures seem to put most adults under 20 thousand words, while graduates achieve somewhere around 23 thousand words. It’s a subject [Alex Eames] became interested in after reading a BBC article on it, and he decided to write his own software to produce a personal estimate.

His Python script takes the Scrabble word list, and presents the user with a list of words, for each one of which they have to indicate their comprehension. After a hundred words have been presented it calculates an estimate of the size of the user’s vocabulary. [Alex] wrote it on and for the Raspberry Pi, but it should work quite happily on any platform with Python 3. It certainly had no problem with our Ubuntu-based PC.

There is plenty of opportunity for bragging over the size of one’s vocabulary with a script like this one, but it’s something of a statistical leveler in that if you are truthful in your responses it will almost certainly put you exactly where you might expect for your age or level of education. If you want to know the result this script returned for a Hackaday scribe, for example, the answer is 23554.

This subject is a slight departure into software from our usual hardware subject matter, but it’s one of those tests that becomes rather a consuming interest when performed competitively among a group of friends. How well will you fare?

Via [Recantha]

11 thoughts on “Estimate Your English Vocabulary Using Python”

Carl Smith says:

October 16, 2016 at 9:53 am

It works in Windows 10, but if you just double click the vocab.py file the window will close when it finishes and you won’t have a chance to see the results. So you need to run it from a command prompt or from Idle.

Report comment

Reply
QuantumRand says:

October 16, 2016 at 11:28 am

Cool concept, but the results were extremely inconsistent for me. I gave it 4 runs and scored anywhere from 26k to as low as 11k. Most of the words it gives are just really unusual forms of regular words, for example “ununited” or “antibiotically.” There were also works that weren’t exactly words, like “tsktsk.” It’s essentially a garble of words you make in Scrabble by adding a few tiles where you can, which makes sense considering the source.

Perhaps using an abridged standard dictionary would yield better results.

Report comment

Reply
1. Alex Eames says:
  
  October 16, 2016 at 2:20 pm
  
  I think you’d possibly benefit from increasing the value of “iterations” in the script. My results are fairly steady between 18000 and 20000. It’s a bit of a trade-off between “how long does this darn thing take?” and “how accurate can we be?” But at 200 iterations I think your results would be more consistent.
  
  I completely agree that a better dictionary source would be better. Ideally, one containing just the “stem” words rather than all the variants as well. Of course the script doesn’t care what file you feed it as long as it’s one word per line. It could even be in another language. :)
  
  Report comment
  
  Reply
Yenrabbit says:

October 16, 2016 at 11:57 am

Interesting. Although I had a bit of trouble deciding what qualifies a word as something I recognize. First round I tried to give good definitions and got 19k, the next round I contented myself with ‘some brain disease or something’ and got 23k. Think the second is closer since I recognized the words and would have got more nuanced defs from context. This is kind of addictive though – now I need to find out what chaparajos, nandin, synalephas et al mean :)

Report comment

Reply
1. Alex Eames says:
  
  October 16, 2016 at 2:16 pm
  
  Agreed. The original intent was “Your job is to hit “Enter” if you can define that word. I’m strict here – if you can define it, you know it. If you can’t define it, you don’t know it. If you can guess it, but have never come across it before – you still don’t know it.”
  
  But at the end of the day, as long as you are playing the same way as the people you are comparing with, it doesn’t really matter.
  
  It is kind of addictive. I’ve made another script for dictionary lookups of the unknown words and turned the results into a @PiWordoftheDay twitter bot that will tweet three definitions per day from the word I didn’t know. The aim here it to learn some new words. :) As long as people are enjoying the script, I’m happy. The scores don’t matter that much to me.
  
  Report comment
  
  Reply
jimd says:

October 16, 2016 at 1:00 pm

I guess it is all in the definition of a word: this recent psych research determined the average American knows about 42000 words. http://journal.frontiersin.org/article/10.3389/fpsyg.2016.01116/full

Report comment

Reply
1. RW says:
  
  October 16, 2016 at 1:16 pm
  
  Someone plz order a calibrated standard average American from ANSI and try it against the script.
  
  While you’ve got one on hand, find out what it eats. I keep coming up against “Average american diet contains enough of this and that..” without specifying what it is that’s averagely eaten than contributes it.
  
  Report comment
  
  Reply
  1. A says:
    
    October 17, 2016 at 6:39 pm
    
    To misquote someone “the average American has one testicle and one ovary”.
    
    Report comment
    
    Reply
RW says:

October 16, 2016 at 2:33 pm

“Sanity check” results from manual test with Funk and Wagnals standard desk dictionary… I got an average of 49.5 words a page, because there’s about 50 words a page in it, with 860 total pages. So scored 42,570 with that. So mostly seems to be defined by how many words a page there are. Also if I am only missing about 1% then when the dictionary claims “Over 100,000 words” then it should really come out somewhere near that.

Also, with physical dictionary, if you land on a word like “set” you can have definitions for that take up a whole page, just first one I could think of, anyway, that would skew your results a bit.

Report comment

Reply
Dan#942164212 says:

October 16, 2016 at 6:50 pm

There are problems with the list if it contains systematic chemical names as one can decode them and know what a word is even if it is novel, that is the point of systems such as https://en.wikipedia.org/wiki/IUPAC_nomenclature_of_organic_chemistry

Perhaps the random choice of words should be biased by the word use frequency count (higher means selected more often), as this may have the effect of giving a result that is as accurate as a much larger fully random selection.

Also I don’t think that there is a direct correlation between the number of words you recognise and the number of words you use in spoken, and separately, in written language.

Report comment

Reply
YellowBlock says:

October 16, 2016 at 8:00 pm

‘Corrected’ score of 18434 words and I am so glad that I do *not* know any of the words I said I didn’t know.

My vocabulary is plenty expressive enough without them.

Report comment

Reply

Hackaday

Estimate Your English Vocabulary Using Python

11 thoughts on “Estimate Your English Vocabulary Using Python”

Leave a Reply to Alex EamesCancel reply

Search

Never miss a hack

If you missed it

After 30 Years, Virtual Boy Gets Its Chance To Shine

How Vibe Coding Is Killing Open Source

Building Natural Seawalls To Fight Off The Rising Tide

Ask Hackaday: How Do You Digitize Your Documents?

The Amazing Maser

Our Columns

Hackaday Links: February 1, 2026

Secret Ingredients

Hackaday Podcast Episode 355: Person Detectors, Walkie Talkies, Open Smartphones, And A WiFi Traffic Light

Did We Overestimate The Potential Harm From Microplastics?

FLOSS Weekly Episode 862: Have Your CAKE And Eat It Too

11 thoughts on “Estimate Your English Vocabulary Using Python”

Leave a Reply to Alex EamesCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns