Corpus Linguistics with R, Day 2

July 28, 2009
By

R Lesson 2 text gsub ("second", "third", text) SEARCH-REPLACE-SUBJECT "This is a first example sentence." "And this is a third example sentence." > gsub ("n", "X", text) "This is a first example seXteXce." "AXd this is a secoXd example seXteXce." > gsub ("is", "was", text) "Thwas was a first example

Read more »

Corpus Linguistics with R, Day 1

July 28, 2009
By

(This post documents the first day of a class on R that I took at ESU C&T. I is posted here purely for my own use.) R Lesson 1 > 2+3; 2/3; 2^3 5 0.6666667 8 --- Fundamentals - Functions > log(x=1000, base=10) 3 --- (Formals describes the syntax of other

Read more »

Wilcoxon-Mann-Whitney rank sum test (or test U)

July 27, 2009
By

Comparison of the averages of two independent groups of samples, of which we can not assume a distribution of Gaussian type; is also known as Mann-Whitney U-test.You want to see if the mean of goals suffered by two football teams over the years is the same. Are below the number of goals suffered by each team in 6 games...

Read more »

Wilcoxon-Mann-Whitney rank sum test (or test U)

July 27, 2009
By

Comparison of the averages of two independent groups of samples, of which we can not assume a distribution of Gaussian type; is also known as Mann-Whitney U-test.You want to see if the mean of goals suffered by two football teams over the years is the same. Are below the number of goals suffered by each team in 6 games...

Read more »

Beautiful Data

July 27, 2009
By
Beautiful Data

O'Reilly's recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data, is also available as a PDF download.

Read more »

Beautiful Data

July 27, 2009
By
Beautiful Data

O'Reilly's recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data, is also available as a PDF download.

Read more »

R Snippet for Sampling from a Dataframe

July 27, 2009
By

It took me a while to figure this out, so I thought I'd share. I have a dataframe with millions of observations in it, and I want to estimate a density distribution, which is a memory intensive process. Running my kde2d function on the full dataframe throws and error -- R tries to allocate a vector that...

Read more »

biomaRt

July 27, 2009
By

I use R and Bioconductor for most of my work. I am also increasingly replacing things I would have done before in Perl with R. One such example of this is the Bioconductor module biomaRt.As the name suggest it allows for access to BioMart via R. BioMart is a method of accessing large online databases...

Read more »

biomaRt

July 27, 2009
By

I use R and Bioconductor for most of my work. I am also increasingly replacing things I would have done before in Perl with R. One such example of this is the Bioconductor module biomaRt.As the name suggest it allows for access to BioMart via R. BioMart is a method of accessing large online databases...

Read more »

Book now shipping from Amazon

July 27, 2009
By
Book now shipping from Amazon

Amazon now reports that the book is in stock! The current discount is 13%.Or, order from the publisher. If you are an ASA member, you can use the online discount code 634LH to obtain a 15% discount.

Read more »

Paired Student’s t-test

July 26, 2009
By
Paired Student’s t-test

Comparison of the means of two sets of paired samples, taken from two populations with unknown variance.A school athletics has taken a new instructor, and want to test the effectiveness of the new type of training proposed by comparing the average time...

Read more »

Select operations on R data frames

July 26, 2009
By
Select operations on R data frames

The R language is weird - particularly for those coming from a typical programmer's background, which likely includes OO languages in the curly-brace family and relational databases using SQL. A key data structure in R, the data.frame, is used somethin...

Read more »

Rosetta Code

July 26, 2009
By
Rosetta Code

Today I'd like to suggest the interesting Rosetta Code site:Rosetta Code is a programming chrestomathy site. The idea is to present solutions to the same task in as many different languages as possible, to demonstrate how languages are similar and diff...

Read more »

Two sample Student’s t-test #2

July 25, 2009
By
Two sample Student’s t-test #2

Comparison of the averages of two independent groups, extracted from two populations at variance unknown; sample variances are not homogeneous.We want to compare the heights in inches of two groups of individuals. Here the measurements:A: 175, 168, 168...

Read more »

Example 7.7: Tabulate binomial probabilities

July 25, 2009
By
Example 7.7: Tabulate binomial probabilities

Suppose we wanted to assess the probability P(X=x) for a binomial random variate with n = 10 and with p = .81, .84, ..., .99. This could be helpful, for example, in various game settings. In SAS, we find the probability that X=x using differences in t...

Read more »

Two sample Student’s t-test #1

July 24, 2009
By
Two sample Student’s t-test #1

t-Test to compare the means of two groups under the assumption that both samples are random, independent, and come from normally distributed population with unknow but equal variancesHere I will use the same data just seen in a previous post. The data ...

Read more »

One sample Student’s t-test

July 23, 2009
By
One sample Student’s t-test

Comparison of the sample mean with a known value, when the variance of the population is not known.Consider the exercise we have just seen before.It was made an intelligence test in 10 subjects, and here are the results obtained. The average result of ...

Read more »

Two sample Z-test

July 22, 2009
By
Two sample Z-test

Comparison of the means of two independent groups of samples, taken from two populations with known variance.Is asked to compare the average heights of two groups. The first group (A) consists of individuals of Italian nationality (the variance of the ...

Read more »

Massively parallel database for analytics

July 22, 2009
By
Massively parallel database for analytics

This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end. But much more than a theoretical discussion, they have built a solution which they call HadoopDB. It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source. Alternative, column-based, backends to PostgreSQL...

Read more »

Massively parallel database for analytics

July 22, 2009
By
Massively parallel database for analytics

This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end. But much more than a theoretical discussion, they have built a solution which they call HadoopDB. It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source. Alternative, column-based, backends to PostgreSQL...

Read more »

One sample Z-test

July 21, 2009
By
One sample Z-test

Comparison of the sample mean with know population mean and standard deviation.Suppose that 10 volunteers have done an intelligence test; here are the results obtained. The mean obtained at the same test, from the entire population is 75. You want to c...

Read more »

RGG#155, 156 and 157

July 21, 2009
By
RGG#155, 156 and 157

I pushed 3 more graphics from Biecek Przemyslaw to the graphics gallery A list of popular names for colors from packages RColorBrewer, colorRamps, grDevices A set of examples of few graphical low-level parameters lend, ljoin, xpd, adj, lege...

Read more »

Score with scoring rules

July 21, 2009
By
Score with scoring rules

INCENTIVES TO STATE PROBABILITIES OF BELIEF TRUTHFULLY We have all been there. You are running an experiment in which you would like participants to tell you what they believe. In particular, you’d like them to tell you what they believe to be the probability that an event will occur. Normally, you would ask them. But

Read more »

Geometric and harmonic means in R

July 20, 2009
By
Geometric and harmonic means in R

Compute the geometric mean and harmonic mean in R of this sequence.10, 2, 19, 24, 6, 23, 47, 24, 54, 77These features are not present in the standard package of R, although they are easily available in some packets. However, it is easy to calculate the...

Read more »

Adding a legend to a plot

July 20, 2009
By
Adding a legend to a plot

It's pretty easy!plot (c(1968,2010),c(0,10),type="n", # sets the x and y axes scales xlab="Year",ylab="Expenditures/GDP (%)") # adds titles to the axes lines(year,defense,col="red",lwd=2.5) # adds a line for defense expenditures lines(year,health,col="...

Read more »

Adding a legend to a plot

July 20, 2009
By
Adding a legend to a plot

It's pretty easy!plot (c(1968,2010),c(0,10),type="n", # sets the x and y axes scales xlab="Year",ylab="Expenditures/GDP (%)") # adds titles to the axes lines(year,defense,col="red",lwd=2.5) # adds a line for defense expenditures lines(year,health,col="...

Read more »

Example 7.6: Find Amazon sales rank for a book

July 20, 2009
By
Example 7.6: Find Amazon sales rank for a book

In honor of Amazon's official release date for the book, we offer this blog entry.Both SAS and R can be used to find the Amazon Sales Rank for a book by downloading the desired web page and ferreting out the appropriate line. This code is likely to br...

Read more »

ggplot2: more wicked-cool plots in R

July 20, 2009
By

As far as I know there are 3 different systems for producing figures in R: (1) base graphics, included with R, (2) the lattice package, and (3) ggplot2, one of the newer plotting systems which is, according to the creator Hadley Wickham, "based on the grammar of graphics, which tries to take the good parts of base and lattice...

Read more »

Probability exercise: negative binomial distribution

July 19, 2009
By
Probability exercise: negative binomial distribution

What is the probability you get the 4th cross before the 3rd head, flipping a coin?The mathematical formula for solving this exercise, which follows a negative binomial distribution, is:$$f(x)=P(X=x)=\begin{pmatrix} x+y-1\\ y-1 \end{pmatrix} \cdot p^x ...

Read more »