Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

is, I think, the one you can find at sacred-texts.org.

curl -O 'http://www.sacred-texts.com/dna/hgp011k.htm'  #get it

#BORING DATA JANITORSHIP
tail -n +15 hgp011k.htm > hgp011k   #remove the HTML head stuff .. up to <pre>
head -n -3 hgp011k | sponge hgp011k #remove the HTML tail

#the sponge nonsense is because command < file > file will just blank your file
#sponge holds the output in a temp/swap for a sec, then writes > to file
#you can also slow your shell down by wrapping command in this bit of nonsense:
echo "head -n -3 hgp011k" > hgp011k

#now it's almost clean … just tattied with needless line endings
tr -d 'r' < hgp011k | sponge hgp011k
tr -d 'n' < hgp011k | sponge hgp011k

#ALL CLEAN!
less hgp011k


So that’s a bit of unix 101 / datacleaning 101. Now open up an R terminal for the fun part:

craig.v <- scan(file='hgp011k',what='character')
table( strsplit( craig.v, '') )

A     C     G     T
14941 15080 15210 14769



A good time was had by all.

Why don’t we do the same thing with π? Unlike Dr V’s DNA, I don’t have to get all wet and bloody acquiring as much of this data as I want. I do have to set some limits on how long to run the Berkeley Calculator though.

echo "scale=22222; a(1)*4" | bc -l  > pi.22222 #a(1) = arctan(1) = a quarter-circle
less pi.22222   #needs cleanup
echo "scale=22222; a(1)*4" | bc -l | tr -d 'n' | tr -d ''  > pi.22222
#one-liner! and it feels so good…


That was comparatively easier than scrolling through the HTML file to find the beginning of what we really wanted. R me the rock:

pi.2 <- scan(file=pi.22222, what='character')
pi.2 < strsplit(pi.2, '')    #R has no problem with the update-my-thing syntax! rainbow bash could learn a thing or two
table(pi.2) #could have also done table( strsplit( ... ))
.    0    1    2    3    4    5    6    7    8    9
1 2186 2205 2179 2202 2259 2315 2254 2201 2194 2228

#those are a bit hard to read so …

table(pi.2) / median(table(pi.2))

.            0            1            2            3            4
0.0004541326 0.9927338783 1.0013623978 0.9895549500 1.0000000000 1.0258855586
5            6            7            8            9
1.0513169846 1.0236148955 0.9995458674 0.9963669391 1.0118074478

#still a bit inscrutable

round( table(pi.2)) / median(table(pi.2)) ,3)

.     0     1     2     3     4     5     6     7     8     9
0.000 0.993 1.001 0.990 1.000 1.026 1.051 1.024 1.000 0.996 1.012

#there we go. pretty even distribution of digits, and let's leave the analysis of the dispersion for another day!


There we go. Pretty even distribution of digits of pi, and let’s leave the analysis of the dispersion for another day!

Obviously this was just an excuse for me to show off some unix tools like tr, curl, bc, tail -n +num, head -n -num, and some R functions like table, scan, and strsplit. But it works much better with a story, doesn’t it?!

Anyway, . Dr Venter, your epidermis is showing!