Single Letter Frequencies in English

February 15, 2009
By

(This article was first published on "R-bloggers" via Tal Galili in Google Reader, and kindly contributed to R-bloggers)

Every time that I read a paper that discusses the frequencies of single letters in English, I feel like I should sit down and calculate them for myself from a sample of English text. Today, I finally did. Here are the probabilities and negative log probabilities of the characters in English over the corpus of Shakespeare’s plays:

Single Letter Probabilities.png
Single Letter Inverse Probabilities.png

And, for those who care, here’s the code to generate the data from the plays, which I downloaded from Project Gutenberg:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def initialize_letter_counts(letter_counts)
  ('a'..'z').each do |chr|
    letter_counts[chr] = 0
  end
end
 
def parse_file(filename, letter_counts)
  f = File.new(filename)
  begin
    while 1
      char = f.readchar().chr.downcase
      if char.match(/[a-z]/)
        letter_counts[char] = letter_counts[char] + 1
      end
    end
  rescue EOFError
    return nil
  end
end
 
directory = '/Users/johnmyleswhite/Princeton/Research/Letter Frequency'
 
Dir.chdir(directory)
 
letter_counts = {}
 
initialize_letter_counts(letter_counts)
 
Dir.new('Data').entries.each do |entry|
  if entry.match(/.txt$/)
    entry = File.expand_path(entry, directory + '/Data')
    parse_file(entry, letter_counts)
  end
end
 
letter_counts.keys.sort.each do |key|
  puts "'#{key}',#{letter_counts[key]}"
end

To leave a comment for the author, please follow the link and comment on his blog: "R-bloggers" via Tal Galili in Google Reader.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags:

Comments are closed.