Excel (and French people) are such a pain in the…

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few days ago, I published a post entitled extracting datasets from excel files in a zipped folder, because I wanted to use datasets that were online, in some (zipped) excel format. The first difficult part was the folder with a non-standard character (the French é). Because next week I should be using those dataset in a crash course in Gabon (in Africa), I wanted to make sure that everthing will go fine when we will run the code. And discussing @3wen‘s trick on Day 1 was maybe not the best way to explain that R is a very simple tool that should be used for data analysis…

To make thing easier, I did upload the xlsx files on my webpage. I wanted to use the xlsx R package. Unfortunately, on my linux laptop, I have troubles installing that package.

> install.packages("xlsx")
Installing package into '/home/arthur/R/x86_64-pc-linux-gnu-library/3.1'
(as 'lib' is unspecified)
also installing the dependencies 'rJava', 'xlsxjars'

trying URL 'http://cran.parentingamerica.com/src/contrib/rJava_0.9-6.tar.gz'
Content type 'application/x-gzip' length 567515 bytes (554 Kb)
opened URL
==================================================
downloaded 554 Kb

trying URL 'http://cran.parentingamerica.com/src/contrib/xlsxjars_0.6.1.tar.gz'
Content type 'application/x-gzip' length 9477071 bytes (9.0 Mb)
opened URL
==================================================
downloaded 9.0 Mb

trying URL 'http://cran.parentingamerica.com/src/contrib/xlsx_0.5.7.tar.gz'
Content type 'application/x-gzip' length 312839 bytes (305 Kb)
opened URL
==================================================
downloaded 305 Kb

ERROR: configuration failed for package 'rJava'

Warning messages:
1: In install.packages("xlsx") :
  installation of package 'rJava' had non-zero exit status
2: In install.packages("xlsx") :
  installation of package 'xlsxjars' had non-zero exit status
3: In install.packages("xlsx") :
  installation of package 'xlsx' had non-zero exit status

I did experience similar problems with other packages,

> library(gdata)
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = "fr_CA:fr",
	LC_ALL = (unset),
	LANG = "fr_CA.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").

Just to make sure that everything will work just fine on Monday morning, I thought that it might be good to save the dataset in a – standard – csv format.

Here we go,

> CIMAH<-read.table(
+ "http://freakonometrics.free.fr/CIMAH.csv",
+ header=TRUE,sep=";")
> head(CIMAH)
  Age Lx..CIMA.H. qx..CIMA.H. dx..CIMA.H.
1   0   1,000,000     0,5368%       5,368
2   1     994,632     0,0730%         726
3   2     993,906     0,0559%         555
4   3     993,351     0,0476%         473
5   4     992,878     0,0407%         404
6   5     992,474     0,0371%         368

Great, I can read the file. But I cannot use it. Commas are used here as decimal and thousands separators (yes, both).I knew about the French decimal separator (which is usually the comma), but usually, French people use a space as a thousands separator (not a comma).

The first step was to drop the comma used as a thousand separator (and convert also that factor as a number)

> dropcomma=Vectorize(function(x){
+ as.numeric(paste(unlist(strsplit(
+ as.character(x),",")),collapse="")) })

The second step was to play with the percentage, to drop the percentage sign, and convert the comma as a decimal separator (the standard dot). I did that in two steps

> commatodot=function(x){
+ return(as.numeric(paste(unlist(strsplit(
+ as.character(x),",")),collapse="."))) }
> droppercent=Vectorize(function(x){
+ comatodot(paste(unlist(strsplit(
+ as.character(x),"%")),collapse=""))/100 })

Then, we should use those functions on the three columns,

> CIMAH[,2]=dropcomma(CIMAH[,2])
> CIMAH[,3]=droppercent(CIMAH[,3])
> CIMAH[,4]=dropcomma(CIMAH[,4])

and indeed, it worked,

> head(CIMAH)
  Age Lx..CIMA.H. qx..CIMA.H. dx..CIMA.H.
1   0     1000000    0.005368        5368
2   1      994632    0.000730         726
3   2      993906    0.000559         555
4   3      993351    0.000476         473
5   4      992878    0.000407         404
6   5      992474    0.000371         368

Numbers are actual numbers, so I can plot some graphs, such that the mortality rate, per age,

> plot(CIMAH$Age,CIMAH$qx..CIMA.H.,log="y")

I could not imagine that it would be that long to read this excel spreadsheet and to get a (simple) graph…

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)