When I was a kid, I went through an 80s music phase…well, some things never change. “People just love to play with words…” Know that song? Anyway…
One of the biggest pains of text mining and NLP is colloquialism — language that is only appropriate in casual language and not in formal speech or writing. Words such as informal contractions (“gonna”, “wanna”, “whatcha”, “ain’t”, “y’all”) are colloquialisms and are everywhere on the Web. There is also a great deal of slang common on the Web including acronyms/emoticons (“LOL”, “WTF”) and smilies that add sentiment to text. There is also a less used slang called leetspeak that replaces letters with numbers (“n00b” rather than “noob”, “pwned” instead of “owned” and “pr0n” instead of “porn”).
There are also regionalisms which are a pain for semantic analysis but not so much for probabilistic analysis. Some examples are pancakes (“flapjacks”, “griddlecakes”) or carbonated beverages (“soda”, “pop”, “Coke”). Or, little did I know, “maple bars” vs. “Long Johns”. Now I am hungry. There are also words that have a formal and informal meeting such as “kid” (a young goat, or a child…same thing).
Linguists consider colloquialisms different than slang. Slang is informal language used by a specific group of people: Internet users, gamers, teenagers, college students, men/women, surfers, skaters, boarders, etc. These words can be used to put users into social groups, but beyond the point of this post.
Text mining becomes a lot less overwhelming if we can filter out known English words and focus on mapping colloquialisms, slang and Internet jargon to known English words. By using a list of known English words, we can do just that. I got some great recommendations for lists of English words that go beyond the typical list of words which is about length 58,000. This list may evolve over time, but it is what I have for now, and it was very useful to me.
- English Wordlists by SIL International provides a list of nearly 110,000 English words from 1991. Words were originally compiled from lists obtained from the Interociter bulleting board which originally came from Public Brand Software. American and Australian lists of homophones are also available. Thanks to Greg Hirson (@greghirson) for this suggestion.
- Various Scrabble Wordlists: Adam Bozon has a fairly comprehensive list here with words of certain lengths, words with certain prefixes and suffixes and other interesting characteristics. Steve Wolfberg provides a thorough digest of links (including OWL2) to Competitive Scrabble wordlists by length, brand names, and the Official Long Words List created the NSA Dictionary Committee. Thanks to Daniel Levine (@daniel_levine) for this suggestion.
- Wiktionary entry titles (339,000+). Wikistuff is always a great resource, but my biggest complaint about using Wiktionary is that it contains many non-sense entries and is not restricted to English. Still, it proves useful for many analyses. Thanks for Ignacio Zendejas (@izendejas) for this suggestion.
- The Aspell Project is an open-source Java based spell checker, and there is a list of English words in the source code. Thanks to Devjyoti Patra (@kprotocol) for this suggestion!
- Kevin Atkinson provides a comprehensive list of English word lists here. Spell checker oriented word lists, auto generated inflection database, variant conversion info, part of speech database, unofficial jargon word list, and links to many other word lists in other languages.
- Of course there is also WordNet, a lexical database for English that is not just a list of words, but provides much, much more information.
What about you? What are your favorite word lists?