English Language Word Analysis With Dot Net


Posted on December 13, 2018

English language word analysis with C#

This post is featured on The Second Annual C# Advent Calendar.

Introduction

I was staring at anagrams recently like this one..

Conundrum

(Answer at the bottom)

And got me to thinking..

  • How many nine letter words are there?
  • How many words are there in the english language?
  • What letter do most words start with?
  • What letter do most words end with?
  • Could you store all the english words in memory and process them using c#?

So I downloaded a list of all words in the english language* to see if I could answer some of these questions.

*Disclaimer: I have no idea if this the entire list, but is as definitive as I could find. This is not a serious research endeavour, this is only something quickly thrown together.

So I threw together a .NET C# Console Application to read this list, parse the words and generate some figures.

And according to that, I had 466,450 words to work with. That's 4,396,016 letters (if you combine all the words) by the way.

What letter do most words starting with?

So most words would seem to start with the letter s. Closely followed by e or d.

Starting with

What letter do most words end with?

And most words would also seem to end with the letter s. Then surprisingly p or c.

Ending with

What is the most popular letter in all the words in the english language?

Of all the words in the english language, what are the frequency of all the letters in those words?

Letter Frequency

How long are the words?

So the most frequent length of word is nine characters, followed by eight, then ten.

Word Length

I noticed an unusual outlier in the form of 45 letter word pneumonoultramicroscopicsilicovolcanoconiosis. Which is a purposely long made up word to describe a lung disease caused by inhaling very fine ash and sand dust.

What is the frequency of double letters?

So the most frequent occurance of double letters is suprisingly ll, then less surpising ss.

Double Letter Frequency

The code

I've put my code up on GitHub. So you can play around with the word list as well.

What next?

There are other questions I have such as

  • How many words end with ING?
  • How many words are palindromes? (The same forward as back?)
  • Are there any words without any vowels?

If I have enough time, i'll update this blog post with the answers.

Summary

To be honest, there are fewer words in the english language than I thought, and many of those will be variations of other words.

The simplest of computer programs seemed able to blast through all of them without breaking a sweat and highlighted some interesting things (or so I thought).

Answer to BARNSLIEU = INSURABLE or SUBLINEAR