Craig Venter’s first chromosome
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
is, I think, the one you can find at sacred-texts.org.
curl -O 'http://www.sacred-texts.com/dna/hgp011k.htm' #get it #BORING DATA JANITORSHIP tail -n +15 hgp011k.htm > hgp011k #remove the HTML head stuff .. up to <pre> head -n -3 hgp011k | sponge hgp011k #remove the HTML tail #the `sponge` nonsense is because `command < file > file` will just blank your file #`sponge` holds the output in a temp/swap for a sec, then writes > to file #you can also slow your shell down by wrapping `command` in this bit of nonsense: echo "`head -n -3 hgp011k`" > hgp011k #now it's almost clean … just tattied with needless line endings tr -d 'r' < hgp011k | sponge hgp011k tr -d 'n' < hgp011k | sponge hgp011k #ALL CLEAN! less hgp011k
So that’s a bit of unix 101 / datacleaning 101. Now open up an R
terminal for the fun part:
craig.v <- scan(file='hgp011k',what='character') table( strsplit( craig.v, '') ) A C G T 14941 15080 15210 14769
A good time was had by all.
Why don’t we do the same thing with π? Unlike Dr V’s DNA, I don’t have to get all wet and bloody acquiring as much of this data as I want. I do have to set some limits on how long to run the Berkeley Calculator though.
echo "scale=22222; a(1)*4" | bc -l > pi.22222 #a(1) = arctan(1) = a quarter-circle less pi.22222 #needs cleanup echo "scale=22222; a(1)*4" | bc -l | tr -d 'n' | tr -d '' > pi.22222 #one-liner! and it feels so good…
That was comparatively easier than scrolling through the HTML file to find the beginning of what we really wanted. R
me the rock:
pi.2 <- scan(file=pi.22222, what='character') pi.2 < strsplit(pi.2, '') #R has no problem with the update-my-thing syntax! rainbow bash could learn a thing or two table(pi.2) #could have also done table( strsplit( ... )) . 0 1 2 3 4 5 6 7 8 9 1 2186 2205 2179 2202 2259 2315 2254 2201 2194 2228 #those are a bit hard to read so … table(pi.2) / median(table(pi.2)) . 0 1 2 3 4 0.0004541326 0.9927338783 1.0013623978 0.9895549500 1.0000000000 1.0258855586 5 6 7 8 9 1.0513169846 1.0236148955 0.9995458674 0.9963669391 1.0118074478 #still a bit inscrutable round( table(pi.2)) / median(table(pi.2)) ,3) . 0 1 2 3 4 5 6 7 8 9 0.000 0.993 1.001 0.990 1.000 1.026 1.051 1.024 1.000 0.996 1.012 #there we go. pretty even distribution of digits, and let's leave the analysis of the dispersion for another day!
There we go. Pretty even distribution of digits of pi, and let’s leave the analysis of the dispersion for another day!
Obviously this was just an excuse for me to show off some unix tools like tr, curl, bc, tail -n +num, head -n -num, and some R functions like table, scan, and strsplit. But it works much better with a story, doesn’t it?!
Anyway, . Dr Venter, your epidermis is showing!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.