Simple visualization of a 11X5 table (for WordPress 2.9 Features Vote Results)

July 31, 2009 · Posted in R bloggers · Comments Off 

Simply Something Sophisicated - a Wordpress poster

I guess this is not the number one post I would like to start with on this blog, but I feel the time is right for it (community-wise).

I’ll move on to the subject matter in a moment, but first a short intro: This blog is written by Tal Galili. I am an aspiring statistician who also loves to use R for his work. At the same time I am also a WordPress blogger, writing mainly at www.TalGalili.com where I can use my native language (Hebrew) for self expression.

This combination of statistics and blogging will lead me to sometimes much less statistical, but more Web/Open-Source oriented posts like this one. So for the statisticians in the audience I extend my apologies and invite you to wait for future posts which will be more fully focused on Statistics and R.

And now for the topic at hand. . .

*         *         *         *         *
I have just noticed the nice article published on the wordpress development blog titled “2.9 Features Vote Results“. The post exemplifies a wonderful trend in the WordPress community (led by Jane Wells) having to do with connecting between the core team and the WordPress user community. The way Jane does this is by giving surveys to WordPress users,  which in turn offers the WordPress core team an opportunity to understand the community needs.

In the post “2.9 Features Vote Results“, Jane presented the results of such survey. The post had tables and barplots, but the barplots were only present for the one dimensional variables. In contrast, more elaborate data, such as that of question 2 (asking to rate each of 11 potential features on a scale of 1 to 5), was shown only with a table, such as this:

q2

The table gives the full information (although I would love it if it was easily downloadable, instead of having to type in the numbers) – but its main limitation is that from a quick look, one can not easily get (let alone understand) anything.

For the goal of understanding more of the results with a quick glance, I offer two simple and well-known visualizations for the results.

1) Parallel barplots (click for bigger image)

q2-option1

This plot can be easily implemented in Excel (although I did it in R) and can allow us to compare the different ranking each potential feature received.

For example, this shows us that most answers were usually given rank 4 (“would be nice”) for each feature.

2) Mosaic plot (click for bigger image)

q2-option2

I don’t know if this can be done in Excel, but with R it is just a simple line of code.

(mosaicplot((DataSet.table), las = 1, col = c(“gray”,”gray”,”blue”,3,”dark green”), main = “”))

The advantage of this plot is that it allows us to compare the different features easily, while not only comparing the top rank, but also combining different rankings for easy comparison (for example, comparing how many rank 4 or 5 each feature received).

So for example, the plot shows me that the most ranked with number 5 was the feature “easier embeds” but the most ranked “number 4 or 5″ was the feature “custom image sizes”. The feature “media album” came close to these two, but didn’t top either.

Conclusions from this post:

  1. It would be nice (if possible) to publish the full data of the surveys, not just the results.
  2. The second question of the survey gives different answer than the first question. But since the difference in percentage seems to be so small compared to the other options, I would guess that all of the top 4 features are more or less in the same level of interest to the community.

p.s. to Jane – why do none of the numbers in this table add up to 3406 (the number of respondants) ?

p.p.s.  to Jane and the Dev team – great work people!

Share with friends:Twitter Facebook del.icio.us Digg StumbleUpon

Simple visualization of a 11X5 table (for WordPress 2.9 Features Vote Results)

July 31, 2009 · Posted in R bloggers · Comments Off 

Simply Something Sophisicated - a Wordpress poster

I guess this is not the number one post I would like to start with on this blog, but I feel the time is right for it (community-wise).

I’ll move on to the subject matter in a moment, but first a short intro: This blog is written by Tal Galili. I am an aspiring statistician who also loves to use R for his work. At the same time I am also a WordPress blogger, writing mainly at www.TalGalili.com where I can use my native language (Hebrew) for self expression.

This combination of statistics and blogging will lead me to sometimes much less statistical, but more Web/Open-Source oriented posts like this one. So for the statisticians in the audience I extend my apologies and invite you to wait for future posts which will be more fully focused on Statistics and R.

And now for the topic at hand. . .

*         *         *         *         *
I have just noticed the nice article published on the wordpress development blog titled “2.9 Features Vote Results“. The post exemplifies a wonderful trend in the WordPress community (led by Jane Wells) having to do with connecting between the core team and the WordPress user community. The way Jane does this is by giving surveys to WordPress users,  which in turn offers the WordPress core team an opportunity to understand the community needs.

In the post “2.9 Features Vote Results“, Jane presented the results of such survey. The post had tables and barplots, but the barplots were only present for the one dimensional variables. In contrast, more elaborate data, such as that of question 2 (asking to rate each of 11 potential features on a scale of 1 to 5), was shown only with a table, such as this:

q2

The table gives the full information (although I would love it if it was easily downloadable, instead of having to type in the numbers) – but its main limitation is that from a quick look, one can not easily get (let alone understand) anything.

For the goal of understanding more of the results with a quick glance, I offer two simple and well-known visualizations for the results.

1) Parallel barplots (click for bigger image)

q2-option1

This plot can be easily implemented in Excel (although I did it in R) and can allow us to compare the different ranking each potential feature received.

For example, this shows us that most answers were usually given rank 4 (”would be nice”) for each feature.

2) Mosaic plot (click for bigger image)

q2-option2

I don’t know if this can be done in Excel, but with R it is just a simple line of code.

(mosaicplot((DataSet.table), las = 1, col = c(”gray”,”gray”,”blue”,3,”dark green”), main = “”))

The advantage of this plot is that it allows us to compare the different features easily, while not only comparing the top rank, but also combining different rankings for easy comparison (for example, comparing how many rank 4 or 5 each feature received).

So for example, the plot shows me that the most ranked with number 5 was the feature “easier embeds” but the most ranked “number 4 or 5″ was the feature “custom image sizes”. The feature “media album” came close to these two, but didn’t top either.

Conclusions from this post:

  1. It would be nice (if possible) to publish the full data of the surveys, not just the results.
  2. The second question of the survey gives different answer than the first question. But since the difference in percentage seems to be so small compared to the other options, I would guess that all of the top 4 features are more or less in the same level of interest to the community.

p.s. to Jane – why do none of the numbers in this table add up to 3406 (the number of respondants) ?

p.p.s.  to Jane and the Dev team – great work people!

Share with friends:Twitter Facebook del.icio.us Digg StumbleUpon

Kruskal-Wallis one-way analysis of variance

July 31, 2009 · Posted in R bloggers · Comments Off 
If you have to perform the comparison between multiple groups, but you can not run a ANOVA for multiple comparisons because the groups do not follow a normal distribution, you can use the Kruskal-Wallis test, which can be applied when you can not make the assumption that the groups follow a gaussian distribution.
This test is similar to the Wilcoxon test for 2 samples.

Suppose you want to see if the means of the following 4 sets of values are statistically similar:
Group A: 1, 5, 8, 17, 16
Group B: 2, 16, 5, 7, 4
Group C: 1, 1, 3, 7, 9
Group D: 2, 15, 2, 9, 7


To use the test of Kruskal-Wallis simply enter the data, and then organize them into a list:


a = c(1, 5, 8, 17, 16)
b = c(2, 16, 5, 7, 4)
c = c(1, 1, 3, 7, 9)
d = c(2, 15, 2, 9, 7)

dati = list(g1=a, g2=b, g3=c, g4=d)


Now we can apply the kruskal.test() function:


kruskal.test(dati)

Kruskal-Wallis rank sum test

data: dati
Kruskal-Wallis chi-squared = 1.9217, df = 3, p-value = 0.5888


The value of the test statistic is 1.9217. This value already contains the fix when there are ties (repetitions). The p-value is greater than 0.05; also the value of the test statistic is lower than the chi-square-tabulation:


qchisq(0.950, 3)
[1] 7.814728


The conclusion is therefore that I accept the null hypothesis H0: the means of the 4 groups are statistically equal.

Analysis of variance: ANOVA, for multiple comparisons

July 30, 2009 · Posted in R bloggers · Comments Off 
Analysis of variance: ANOVA, for multiple comparisons

The ANOVA model can be used to compare the mean of several groups with each other, using a parametric method (assuming that the groups follow a Gaussian distribution).
Proceed with the following example:

The manager of a supermarket chain wants to see if the consumption in kilowatts of 4 stores between them are equal. He collects data at the end of each month for 6 months. The results are:
Store A: 65, 48, 66, 75, 70, 55
Store B: 64, 44, 70, 70, 68, 59
Store C: 60, 50, 65, 69, 69, 57
Store D: 62, 46, 68, 72, 67, 56


To proceed with the verification ANOVA, we must first verify the homoskedasticity (ie test for homogeneity of variances). The software R provides two tests: the Bartlett test, and the Fligner-Killeen test.




We begin with the Bartlett test.

First we create the 4 vectors:


a = c(65, 48, 66, 75, 70, 55)
b = c(64, 44, 70, 70, 68, 59)
c = c(60, 50, 65, 69, 69, 57)
d = c(62, 46, 68, 72, 67, 56)


Now we combine the 4 vectors in a single vector:


dati = c(a, b, c, d)


Now, on this vector in which are stored all the data, we create the 4 levels:


groups = factor(rep(letters[1:4], each = 6))


We can observe the contents of the vector groups simply by typing groups + [enter].

At this point we start the Bartlett test:


bartlett.test(dati, groups)

Bartlett test of homogeneity of variances

data: dati and groups
Bartlett's K-squared = 0.4822, df = 3, p-value = 0.9228


The function gave us the value of the statistical tests (K squared), and the p-value. Can be argued that the variances are homogeneous since p-value > 0.05. Alternatively, we can compare the Bartlett's K-squared with the value of chi-square-tables; we compute that value, assigning the value of alpha and degrees of freedom at the qchisq function:


qchisq(0.950, 3)
[1] 7.814728


Chi-squared > Bartlett's K-squared: we accept the null hypothesis H0 (variances homogeneity)




We try now to check the homoskedasticity, with the Fligner-Killeen test.
The syntax is quite similar, and then proceed quickly.


a = c(65, 48, 66, 75, 70, 55)
b = c(64, 44, 70, 70, 68, 59)
c = c(60, 50, 65, 69, 69, 57)
d = c(62, 46, 68, 72, 67, 56)

dati = c(a, b, c, d)

groups = factor(rep(letters[1:4], each = 6))

fligner.test(dati, groups)

Fligner-Killeen test of homogeneity of variances

data: dati and groups
Fligner-Killeen:med chi-squared = 0.1316, df = 3, p-value = 0.9878


The conclusions are similar to those for the test of Bartlett.




Having verified the homoskedasticity of the 4 groups, we can proceed with the ANOVA model.

First organize the values, fitting the model:


fit = lm(formula = dati ~ groups)


Then we analyze the ANOVA model:


anova (fit)

Analysis of Variance Table

Response: dati
Df Sum Sq Mean Sq F value Pr(>F)
groups 3 8.46 2.82 0.0327 0.9918
Residuals 20 1726.50 86.33


The output of the function is a classical ANOVA table with the following data:
Df = degree of freedom
Sum Sq = deviance (within groups, and residual)
Mean Sq = variance (within groups, and residual)
F value = the value of the Fisher statistic test, so computed (variance within groups) / (variance residual)
Pr(>F) = p-value

Since p-value > 0.05, we accept the null hypothesis H0: the four means are statistically equal. We can also compare the computed F-value with the tabulated F-value:

qf(0.950, 20, 3)
[1] 8.66019


Tabulated F-value > computed F-value: we accept the null hyptohesis.

Comparison of two proportions: parametric (Z-test) and non-parametric (chi-squared) methods

July 29, 2009 · Posted in R bloggers · Comments Off 
Consider for example the following problem.
The owner of a betting company wants to verify whether a customer is cheating or not. To do this want to compare the number of successes of one player with the number of successes of one of his employees, of which he is certain that he is not cheating. In a month's time, the player performs 74 bets and wins 30; the player in the same period of time making 103 bets, wins 65. Your client is a cheat or not?

A problem of this kind can be solved in two different ways: using a parametric and a non-parametric method.

* Solution with the parametric method: Z-test.

You can use a Z-test if you can do the following two assumptions: the probability of common success is approximate 0.5, and the number of games is very high (under these assumption, a binomial distribution is approximate a gaussian distribution). Suppose that this is the case. In R there is no function to calculate the value of Z, so we remember the mathematical formula, and we write our function:

$$Z=\frac{\frac{x_1}{n_1}-\frac{x_2}{n_s}}{\sqrt{\widehat{p}(1-\widehat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}$$


z.prop = function(x1,x2,n1,n2){
numerator = (x1/n1) - (x2/n2)
p.common = (x1+x2) / (n1+n2)
denominator = sqrt(p.common * (1-p.common) * (1/n1 + 1/n2))
z.prop.ris = numerator / denominator
return(z.prop.ris)
}


Z.prop function calculates the value of Z, receiving input the number of successes (x1 and x2), and the total number of games (n1 and n2). We apply the function just written with the data of our problem:


z.prop(30, 65, 74, 103)
[1] -2.969695


We obtained a value of z greater than the value of z-tabulated (1.96), which leads us to conclude that the player that the director was looking at is actually a cheat, since its probability of success is higher than a non-cheat user.

* Solution with the non-parametric method: Chi-squared test.


Suppose now that it can not make any assumption on the data of the problem, so that it can not approximate the binomial with a Gauss. We solve the problem with the test of chi-square applied to a 2x2 contingency table. In R there is the function prop.test.


prop.test(x = c(30, 65), n = c(74, 103), correct = FALSE)

2-sample test for equality of proportions without continuity correction

data: c(30, 65) out of c(74, 103)
X-squared = 8.8191, df = 1, p-value = 0.002981
alternative hypothesis: two.sided
95 percent confidence interval:
-0.37125315 -0.08007196
sample estimates:
prop 1 prop 2
0.4054054 0.6310680


Prop.test function calculates the value of chi-square, given the values of success (in the vector x) and total attempts (in the vector n). The vectors x and n can also be previously declared, and then be retrieved as usual: prop.test (x, n, correct = FALSE).

In the case of small samples (low value of n), you must specify correct = TRUE, so as to change the computation of chi-square based on the continuity of Yates:


prop.test(x = c(30, 65), n = c(74, 103), correct=TRUE)

2-sample test for equality of proportions with continuity correction

data: c(30, 65) out of c(74, 103)
X-squared = 7.9349, df = 1, p-value = 0.004849
alternative hypothesis: two.sided
95 percent confidence interval:
-0.38286428 -0.06846083
sample estimates:
prop 1 prop 2
0.4054054 0.6310680


In both cases, we obtained p-value less than 0.05, which leads us to reject the hypothesis of equal probability. In conclusion, the customer is a cheat. For confirmation we compare the value chi-square-value calculated with the chi-square-tabulation, which we calculate in this way:


qchisq(0.950, 1)
[1] 3.841459


qchisq function calculates the value of chi-square as a function of alpha and degrees of freedom. Since chi-square-calculated is greater than chi-square-tabulation, we conclude by rejecting the hypothesis H0 (as stated by the p-value, and the parametric test).

Wilcoxon signed rank test

July 29, 2009 · Posted in R bloggers · Comments Off 
Non-parametric statistical hypothesis test, for the comparison of the means between 2 paired samples

The mayor of a city wants to see if pollution levels are reduced by closing the streets to the car traffic. This is measured by the rate of pollution every 60 minutes (8am 22pm: total of 15 measurements) in a day when traffic is open, and in a day of closure to traffic. Here the values of air pollution:

With traffic: 214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234
Without traffic: 159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112


It is clear that the two groups are paired, because there is a bond between the readings, consisting in the fact that we are considering the same city (with its peculiarities weather, ventilation, etc.) albeit in two different days. Not being able to assume a Gaussian distribution for the values recorded, we must proceed with a non-parametric test, the Wilcoxon signed rank test.


a <- c(214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234)
b <- c(159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112)

wilcox.test(a,b, paired=TRUE)

Wilcoxon signed rank test

data: a and b
V = 80, p-value = 0.2769
alternative hypothesis: true location shift is not equal to 0



Since the p-value is greater than 0.05, we conclude that the means have remained essentially unchanged (we accept the null hypothesis H0), then blocking traffic for a single day did not lead to any improvement in terms of pollution of the city.
The value V = 80 corresponds to the sum of ranks assigned to the differences with positive sign. We can manually calculate the sum of ranks assigned to the differences with positive sign, and the sum of ranks assigned to the differences with negative sign, to compare this interval with the interval tabulated on the tables of Wilcoxon for paired samples, and confirm our statistic decision. Here's how to calculate the two sums.


a <- c(214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234)
b <- c(159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112)

diff <- c(a - b) #calculating the vector containing the differences
diff <- diff[ diff!=0 ] #delete all differences equal to zero
diff.rank <- rank(abs(diff)) #check the ranks of the differences, taken in absolute
diff.rank.sign <- diff.rank * sign(diff) #check the sign to the ranks, recalling the signs of the values of the differences
ranks.pos <- sum(diff.rank.sign[diff.rank.sign > 0]) #calculating the sum of ranks assigned to the differences as a positive, ie greater than zero
ranks.neg <- -sum(diff.rank.sign[diff.rank.sign < 0]) #calculating the sum of ranks assigned to the differences as a negative, ie less than zero
ranks.pos #it is the value V of the wilcoxon signed rank test
[1] 80
ranks.neg
[1] 40


The computed interval is (40, 80). The tabulated interval on Wilcoxon paired samples tables, with 15 differences, is (25, 95). Since the calculated interval is contained in the tabulated, we accept the null hypothesis H0 of equality of the means. As predicted by the p-value, closing roads to traffic did not bring any improvement in terms of rate of pollution.

I know it’s been so long…

July 28, 2009 · Posted in R bloggers · Comments Off 
Hey,

I know it's been so long since last time I posted something in here, but I was really busy with my thesis and some other stuff, but now that I have more time I promise I'll post some interesting stuff in here, by the way, I found such an Interesting site talking about R being used by Facebook and Google, it's by Michael E. Driscoll, here it is in case you want to check it:

http://dataspora.com/blog/predictive-analytics-using-r/
Have a nice day.

Corpus Linguistics with R, Day 2

July 28, 2009 · Posted in R bloggers · Comments Off 

R Lesson 2


text<-c("This is a first example sentence.", "And this is a second example sentence.")

# gsub replaces stuff in strings

> gsub ("second", "third", text)
SEARCH-REPLACE-SUBJECT
[1] "This is a first example sentence."
[2] "And this is a third example sentence."
> gsub ("n", "X", text)
[1] "This is a first example seXteXce."
[2] "AXd this is a secoXd example seXteXce."
> gsub ("is", "was", text)
[1] "Thwas was a first example sentence."
[2] "And thwas was a second example sentence."

---

Perl-style regex

^ beginning of str, e.g. "^x", ***OR*** NOT inside of []
$ end of str, e.g. "x$"
. any other char
\ escape char - TWO ("\\") needed
[] character classes, e.g. [aeiou] vowels, [a-h] is same as [abcdefgh]
{MIN,MAX} number of immediately preceding unit (chacter)

examples
lo+l

> grep("analy[sz]e", c("analyze", "analyse", "moo"), perl=T, value=T)
[1] "analyze" "analyse"

> grep("(first|second)", text, perl=T, value=T)
[1] "This is a first example sentence."
[2] "And this is a second example sentence."
> grep("(first|lalala)", text, perl=T, value=T)
[1] "This is a first example sentence."
>

> grep("ab{2}", z, perl=T, value=T)
[1] "aabbccdd"
> grep("(ab){2}", z, perl=T, value=T)
[1] "ababcdcd"
>
>
> gsub("a (first|second)", "another", text, perl=T)
[1] "This is another example sentence."
[2] "And this is another example sentence."
>
>
>
>
> gsub("[abcdefgh]", "X", text, perl=T)
[1] "TXis is X Xirst XxXmplX sXntXnXX."
[2] "AnX tXis is X sXXonX XxXmplX sXntXnXX."

> grep("forg[eo]t(s|ting|ten)?_v", a.corpus.file, perl=T, value=T)
all forms of forget

*? lazy matching e.g.
gregexpr("s.*?s", text[1], perl=T)

> gregexpr("s.*?s", text[1], perl=T)
[[1]]
[1] 4 14
attr(,"match.length")
[1] 4 12

# note: things that are matched are consumed and can then not be found again in the same passtext

> gsub("(19|20)[0-9]{2}", "YEAR", text)
[1] "They killed 250 people in YEAR." "No, it was in YEAR."
> #replaces only 19xx and 20xx

---

> textfile<-scan(file.choose(), what="char", sep="\n")
Enter file name: corp_gpl_short.txt
Read 9 items
> textfile<-tolower(textfile)
> textfile
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> unlist(strsplit(textfile, "//W"))
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> text_split<-unlist(strsplit(textfile, "//W"))
> text_split
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
>
> text_split<-unlist(strsplit(textfile, "//W"))
> text_split
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> text_split<-unlist(strsplit(textfile, "\\W"))

> textfile<-scan(file.choose(), what="char", sep="\n")
Enter file name: corp_gpl_short.txt
Read 9 items
> textfile<-tolower(textfile)
> textfile
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> unlist(strsplit(textfile, "//W"))
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."

> text_split<-unlist(strsplit(textfile, "//W+"))
> text_split
[1] "the licenses for most software are designed to take away your"
[2] "freedom to share and change it. by contrast, the gnu general public"
[3] "license is intended to guarantee your freedom to share and change free"
[4] "software--to make sure the software is free for all its users. this"
[5] "general public license applies to most of the free software"
[6] "foundation's software and to any other program whose authors commit to"
[7] "using it. (some other free software foundation software is covered by"
[8] "the gnu library general public license instead.) you can apply it to"
[9] "your programs, too."
> sort(table(text_split), decreasing=T)
text_split
to software the free and general
9 9 7 5 4 3 3
is it license public your by change
3 3 3 3 3 2 2
for foundation freedom gnu most other share
2 2 2 2 2 2 2
all any applies apply are authors away
1 1 1 1 1 1 1
can commit contrast covered designed guarantee instead
1 1 1 1 1 1 1
intended its library licenses make of program
1 1 1 1 1 1 1
programs s some sure take this too
1 1 1 1 1 1 1
users using whose you
1 1 1 1
>

> text_freqs
text_split
to software the free and general is
9 7 5 4 3 3 3
it license public your by change for
3 3 3 3 2 2 2
foundation freedom gnu most other share all
2 2 2 2 2 2 1
any applies apply are authors away can
1 1 1 1 1 1 1
commit contrast covered designed guarantee instead intended
1 1 1 1 1 1 1
its library licenses make of program programs
1 1 1 1 1 1 1
s some sure take this too users
1 1 1 1 1 1 1
using whose you
1 1 1
> text_freqs[text_freqs>1]
text_split
to software the free and general is
9 7 5 4 3 3 3
it license public your by change for
3 3 3 3 2 2 2
foundation freedom gnu most other share
2 2 2 2 2 2
>

> !(text_split %in% stop_list)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[37] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[49] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
[61] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> text_stopremoved<-text_split[!(text_split %in% stop_list)]
> text_stopremoved
[1] "licenses" "for" "most" "software" "are"
[6] "designed" "to" "take" "away" "your"
[11] "freedom" "to" "share" "change" "it"
[16] "by" "contrast" "gnu" "general" "public"
[21] "license" "is" "intended" "to" "guarantee"
[26] "your" "freedom" "to" "share" "change"
[31] "free" "software" "to" "make" "sure"
[36] "software" "is" "free" "for" "all"
[41] "its" "users" "this" "general" "public"
[46] "license" "applies" "to" "most" "free"
[51] "software" "foundation" "s" "software" "to"
[56] "any" "other" "program" "whose" "authors"
[61] "commit" "to" "using" "it" "some"
[66] "other" "free" "software" "foundation" "software"
[71] "is" "covered" "by" "gnu" "library"
[76] "general" "public" "license" "instead" "you"
[81] "can" "apply" "it" "to" "your"
[86] "programs" "too"
>

# LOAD an R file
source("something.r")

Corpus Linguistics with R, Day 1

July 28, 2009 · Posted in R bloggers · Comments Off 

(This post documents the first day of a class on R that I took at ESU C&T. I is posted here purely for my own use.)


R Lesson 1

> 2+3; 2/3; 2^3
[1] 5
[1] 0.6666667
[1] 8

---

Fundamentals - Functions

> log(x=1000, base=10)
[1] 3

---

(Formals describes the syntax of other functions)

formals(sample)

---

Variables

( <- allows you to save something in a data structure (variable) )
> a<-2+3
> a
[1] 5

# is for comments

whitespace doesn't matter

---
# Pick files
file.choose()

# Get working dir
getwd()

# Set working dir
setwd("..")

# Save
> save(VARIABLE_NAME, file=file.choose())
Fehler in save(test, file = file.choose()) : Objekt ‘test’ nicht gefunden
> save.image("FILE_NAME")

---

> setwd("/home/cornelius/Code/samples/Brown_95perc")
> getwd()
[1] "/home/cornelius/Code/samples/Brown_95perc"
> dir()

> my_array <- c(1,2,3,4)
> my_array
[1] 1 2 3 4
> my_array <- c("lalala", "lululu", "bla")
> my_array2 <- c(1,2,3,4)
> c(my_array, my_array2)
[1] "lalala" "lululu" "bla" "1" "2" "3" "4"
>

# it is possible to add something to ALL values in a vector, i.e.
my_array2 + 10

# c (conc) makes a list
stuff1<-c(1,2,3,4,5)

---

# sequence starts at 1 (first arg), goes on for 5 (second arg), increments by 1 (third arg)
seq(1, 5, 1)

---

# put a file into a corpus vector
# what=real|char sep=seperator
> my_corpus<-scan(file=file.choose(), what="char", sep="\n")

# unique elements in my array
unique(array)

# count elements in an array
table(array)

# sort elements in an array
sort(table(array))

---
# this tells me the position of the elements in my text that aren't "this"
> values<-which(my_little_corpus!="this")
> values
[1] 2 3 4 5 6 7 8 9 11 12 13 14

# this will produce TRUE|FALSE for my condition (is this element "this")
> values<-my_little_corpus!="this"
> values
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[13] TRUE TRUE

# this will return the array without "this"
> values<-my_little_corpus[my_little_corpus!="this"]
> values
[1] "is" "just" "a" "little" "example" "bla" "bla"
[8] "bla" "is" "the" "third" "line"

...

> cc<-c("banana", "bagel")
> cc == "banana"; cc!="banana" #
[1] TRUE FALSE
[1] FALSE TRUE
> "banana" %in% cc
[1] TRUE
> c("bagel", "banana") %in% cc
[1] TRUE TRUE
> match ("banana", cc)
[1] 1
> match (c("bagel","banana"), cc)
[1] 2 1

# match looks for a list of tokens and returns their position in the datastructure

---
> cat(bb, sep="\n", file=scan(what="char"), append=F)
# write the contents of bb to a file, ask the user for file

moo<-scan(what="char")
# read something the user types into a var

# Clear Mem
> rm(list=ls(all=T))
>

---

# create vector1 (ordered)
vec1<-c("a","b","c","d","e","f,",g","h","i","j")

# oder
# > letters[1:10]
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

# create vector2 (random)
# > vector2<-sample(vector1)

---

length()
# number of elements

nchar()
# number of characters

> aa<-"know"
> nchar(aa)
[1] 4
> aa<-c("I","do","not","know")
> nchar(aa)
[1] 1 2 3 4
> lala<-c("cat","gnu","hippopotamus")
> lala
[1] "cat" "gnu" "hippopotamus"
> nchar(lala)
[1] 3 3 12

> substr("hippopotamus", 0, 5)
[1] "hippo"
>

# like explode() / implode()
paste (string, sep="my_seperator", collapse="stuff to put in")

---

# percentages
x/sum(x)

barplot (1,2,3)

Read in corpus data and build a list of words frequencies
1) scan file
2) strsplit by " "
3) unlist to make vector
4) make a table with freqs
5) sort
6) output

#search for strings
grep("needle", haystack)

> grep("is", text, value=T)
[1] "This is a first example sentence."
[2] "And this is a second example sentence."
> grep("And", text, value=T)
[1] "And this is a second example sentence."
> grep("sentence", text, value=T)
[1] "This is a first example sentence."
[2] "And this is a second example sentence."
>

gregexpr
# alternative to grep, returns a list of vectors

> mat<-gregexpr("e", text)
> mat
[[1]]
[1] 17 23 26 29 32
attr(,"match.length")
[1] 1 1 1 1 1

[[2]]
[1] 16 22 28 31 34 37
attr(,"match.length")
[1] 1 1 1 1 1 1

> unlist(mat)
[1] 17 23 26 29 32 16 22 28 31 34 37
> mat<-gregexpr("sentence", text)
> sapply (mat, c)
[1] 25 30

Wilcoxon-Mann-Whitney rank sum test (or test U)

July 27, 2009 · Posted in R bloggers · Comments Off 
Comparison of the averages of two independent groups of samples, of which we can not assume a distribution of Gaussian type; is also known as Mann-Whitney U-test.

You want to see if the mean of goals suffered by two football teams over the years is the same. Are below the number of goals suffered by each team in 6 games for each year.
Team A: 6, 8, 2, 4, 4, 5
Team B: 7, 10, 4, 3, 5, 6


The Wilcoxon-Matt-Whitney test (or Wilcoxon rank sum test, or Mann-Whitney U-test) is used when is asked to compare the means of two groups that do not follow a normal distribution: it is a non-parametrical test. It is the equivalent of the t test, applied for independent samples.
Let's see how to solve the problem with R:


a = c(6, 8, 2, 4, 4, 5)
b = c(7, 10, 4, 3, 5, 6)

wilcox.test(a,b, correct=FALSE)

Wilcoxon rank sum test

data: a and b
W = 14, p-value = 0.5174
alternative hypothesis: true location shift is not equal to 0


The p-value is greater than 0.05, then we can accept the hypothesis H0 of statistical equality of the means of two groups.
If you run wilcox.test(b, a, correct = FALSE), the p-value would be logically the same:


a = c(6, 8, 2, 4, 4, 5)
b = c(7, 10, 4, 3, 5, 6)

wilcox.test(b,a, correct=FALSE)

Wilcoxon rank sum test

data: b and a
W = 22, p-value = 0.5174
alternative hypothesis: true location shift is not equal to 0


The value W is so computed:


sum.rank.a = sum(rank(c(a,b))[1:6]) #sum of ranks assigned to the group a
W = sum.rank.a – (length(a)*(length(a)+1)) / 2
W
[1] 14

sum.rank.b = sum(rank(c(a,b))[7:12]) #sum of ranks assigned to the group b
W = sum.rank.b – (length(b)*(length(b)+1)) / 2
W
[1] 22


We can finally compare the intervals tabulated on the tables of Wilcoxon for independent samples. The tabulated interval for two groups of 6 samples each is (26, 52), while the interval of our samples is:


sum(rank(c(a,b))[1:6]) #sum of ranks assigned to the group a
[1] 35
sum(rank(c(a,b))[7:12]) #sum of ranks assigned to the group b
[1] 43


Since the computed interval (35, 43), is contained within the tabulated interval (26, 52), we conclude by accepting the hypothesis H0 of equality of means.

Next Page »

Diag| Memory: Current usage: 35710 KB
Diag| Memory: Peak usage: 36533 KB