Estimate Your English Vocabulary Using Python

We take our mother tongue for granted, a language we learn as young children without realizing the effort involved. It is only when as adults we try to pick up another language that we fully understand how much hard work surrounds each acquired word.

Depending on who you listen to, estimates vary as to the size of a typical native English speaker’s vocabulary. The ballpark figures seem to put most adults under 20 thousand words, while graduates achieve somewhere around 23 thousand words. It’s a subject [Alex Eames] became interested in after reading a BBC article on it, and he decided to write his own software to produce a personal estimate.

His Python script takes the Scrabble word list, and presents the user with a list of words, for each one of which they have to indicate their comprehension. After a hundred words have been presented it calculates an estimate of the size of the user’s vocabulary. [Alex] wrote it on and for the Raspberry Pi, but it should work quite happily on any platform with Python 3. It certainly had no problem with our Ubuntu-based PC.

There is plenty of opportunity for bragging over the size of one’s vocabulary with a script like this one, but it’s something of a statistical leveler in that if you are truthful in your responses it will almost certainly put you exactly where you might expect for your age or level of education. If you want to know the result this script returned for a Hackaday scribe, for example, the answer is 23554.

This subject is a slight departure into software from our usual hardware subject matter, but it’s one of those tests that becomes rather a consuming interest when performed competitively among a group of friends. How well will you fare?

Via [Recantha]

11 thoughts on “Estimate Your English Vocabulary Using Python

  1. Cool concept, but the results were extremely inconsistent for me. I gave it 4 runs and scored anywhere from 26k to as low as 11k. Most of the words it gives are just really unusual forms of regular words, for example “ununited” or “antibiotically.” There were also works that weren’t exactly words, like “tsktsk.” It’s essentially a garble of words you make in Scrabble by adding a few tiles where you can, which makes sense considering the source.

    Perhaps using an abridged standard dictionary would yield better results.

    1. I think you’d possibly benefit from increasing the value of “iterations” in the script. My results are fairly steady between 18000 and 20000. It’s a bit of a trade-off between “how long does this darn thing take?” and “how accurate can we be?” But at 200 iterations I think your results would be more consistent.

      I completely agree that a better dictionary source would be better. Ideally, one containing just the “stem” words rather than all the variants as well. Of course the script doesn’t care what file you feed it as long as it’s one word per line. It could even be in another language. :)

  2. Interesting. Although I had a bit of trouble deciding what qualifies a word as something I recognize. First round I tried to give good definitions and got 19k, the next round I contented myself with ‘some brain disease or something’ and got 23k. Think the second is closer since I recognized the words and would have got more nuanced defs from context. This is kind of addictive though – now I need to find out what chaparajos, nandin, synalephas et al mean :)

    1. Agreed. The original intent was “Your job is to hit “Enter” if you can define that word. I’m strict here – if you can define it, you know it. If you can’t define it, you don’t know it. If you can guess it, but have never come across it before – you still don’t know it.”

      But at the end of the day, as long as you are playing the same way as the people you are comparing with, it doesn’t really matter.

      It is kind of addictive. I’ve made another script for dictionary lookups of the unknown words and turned the results into a @PiWordoftheDay twitter bot that will tweet three definitions per day from the word I didn’t know. The aim here it to learn some new words. :) As long as people are enjoying the script, I’m happy. The scores don’t matter that much to me.

    1. Someone plz order a calibrated standard average American from ANSI and try it against the script.

      While you’ve got one on hand, find out what it eats. I keep coming up against “Average american diet contains enough of this and that..” without specifying what it is that’s averagely eaten than contributes it.

  3. “Sanity check” results from manual test with Funk and Wagnals standard desk dictionary… I got an average of 49.5 words a page, because there’s about 50 words a page in it, with 860 total pages. So scored 42,570 with that. So mostly seems to be defined by how many words a page there are. Also if I am only missing about 1% then when the dictionary claims “Over 100,000 words” then it should really come out somewhere near that.

    Also, with physical dictionary, if you land on a word like “set” you can have definitions for that take up a whole page, just first one I could think of, anyway, that would skew your results a bit.

  4. There are problems with the list if it contains systematic chemical names as one can decode them and know what a word is even if it is novel, that is the point of systems such as

    Perhaps the random choice of words should be biased by the word use frequency count (higher means selected more often), as this may have the effect of giving a result that is as accurate as a much larger fully random selection.

    Also I don’t think that there is a direct correlation between the number of words you recognise and the number of words you use in spoken, and separately, in written language.

  5. ‘Corrected’ score of 18434 words and I am so glad that I do *not* know any of the words I said I didn’t know.

    My vocabulary is plenty expressive enough without them.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.