Excel (and French people) are such a pain in the…
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A few days ago, I published a post entitled extracting datasets from excel files in a zipped folder, because I wanted to use datasets that were online, in some (zipped) excel format. The first difficult part was the folder with a non-standard character (the French é). Because next week I should be using those dataset in a crash course in Gabon (in Africa), I wanted to make sure that everthing will go fine when we will run the code. And discussing @3wen‘s trick on Day 1 was maybe not the best way to explain that R is a very simple tool that should be used for data analysis…
To make thing easier, I did upload the xlsx files on my webpage. I wanted to use the xlsx R package. Unfortunately, on my linux laptop, I have troubles installing that package.
> install.packages("xlsx")
Installing package into '/home/arthur/R/x86_64-pc-linux-gnu-library/3.1'
(as 'lib' is unspecified)
also installing the dependencies 'rJava', 'xlsxjars'
trying URL 'http://cran.parentingamerica.com/src/contrib/rJava_0.9-6.tar.gz'
Content type 'application/x-gzip' length 567515 bytes (554 Kb)
opened URL
==================================================
downloaded 554 Kb
trying URL 'http://cran.parentingamerica.com/src/contrib/xlsxjars_0.6.1.tar.gz'
Content type 'application/x-gzip' length 9477071 bytes (9.0 Mb)
opened URL
==================================================
downloaded 9.0 Mb
trying URL 'http://cran.parentingamerica.com/src/contrib/xlsx_0.5.7.tar.gz'
Content type 'application/x-gzip' length 312839 bytes (305 Kb)
opened URL
==================================================
downloaded 305 Kb
ERROR: configuration failed for package 'rJava'
Warning messages:
1: In install.packages("xlsx") :
installation of package 'rJava' had non-zero exit status
2: In install.packages("xlsx") :
installation of package 'xlsxjars' had non-zero exit status
3: In install.packages("xlsx") :
installation of package 'xlsx' had non-zero exit status
I did experience similar problems with other packages,
> library(gdata)
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = "fr_CA:fr",
LC_ALL = (unset),
LANG = "fr_CA.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Just to make sure that everything will work just fine on Monday morning, I thought that it might be good to save the dataset in a – standard – csv format.
Here we go,
> CIMAH<-read.table( + "http://freakonometrics.free.fr/CIMAH.csv", + header=TRUE,sep=";") > head(CIMAH) Age Lx..CIMA.H. qx..CIMA.H. dx..CIMA.H. 1 0 1,000,000 0,5368% 5,368 2 1 994,632 0,0730% 726 3 2 993,906 0,0559% 555 4 3 993,351 0,0476% 473 5 4 992,878 0,0407% 404 6 5 992,474 0,0371% 368
Great, I can read the file. But I cannot use it. Commas are used here as decimal and thousands separators (yes, both).I knew about the French decimal separator (which is usually the comma), but usually, French people use a space as a thousands separator (not a comma).
The first step was to drop the comma used as a thousand separator (and convert also that factor as a number)
> dropcomma=Vectorize(function(x){
+ as.numeric(paste(unlist(strsplit(
+ as.character(x),",")),collapse="")) })
The second step was to play with the percentage, to drop the percentage sign, and convert the comma as a decimal separator (the standard dot). I did that in two steps
> commatodot=function(x){
+ return(as.numeric(paste(unlist(strsplit(
+ as.character(x),",")),collapse="."))) }
> droppercent=Vectorize(function(x){
+ comatodot(paste(unlist(strsplit(
+ as.character(x),"%")),collapse=""))/100 })
Then, we should use those functions on the three columns,
> CIMAH[,2]=dropcomma(CIMAH[,2]) > CIMAH[,3]=droppercent(CIMAH[,3]) > CIMAH[,4]=dropcomma(CIMAH[,4])
and indeed, it worked,
> head(CIMAH) Age Lx..CIMA.H. qx..CIMA.H. dx..CIMA.H. 1 0 1000000 0.005368 5368 2 1 994632 0.000730 726 3 2 993906 0.000559 555 4 3 993351 0.000476 473 5 4 992878 0.000407 404 6 5 992474 0.000371 368
Numbers are actual numbers, so I can plot some graphs, such that the mortality rate, per age,
> plot(CIMAH$Age,CIMAH$qx..CIMA.H.,log="y")

I could not imagine that it would be that long to read this excel spreadsheet and to get a (simple) graph…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.