Excel (and French people) are such a pain in the…

November 6, 2014

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

A few days ago, I published a post entitled extracting datasets from excel files in a zipped folder, because I wanted to use datasets that were online, in some (zipped) excel format. The first difficult part was the folder with a non-standard character (the French é). Because next week I should be using those dataset in a crash course in Gabon (in Africa), I wanted to make sure that everthing will go fine when we will run the code. And discussing @3wen‘s trick on Day 1 was maybe not the best way to explain that R is a very simple tool that should be used for data analysis…

To make thing easier, I did upload the xlsx files on my webpage. I wanted to use the xlsx R package. Unfortunately, on my linux laptop, I have troubles installing that package.

> install.packages("xlsx")
Installing package into '/home/arthur/R/x86_64-pc-linux-gnu-library/3.1'
(as 'lib' is unspecified)
also installing the dependencies 'rJava', 'xlsxjars'

trying URL 'http://cran.parentingamerica.com/src/contrib/rJava_0.9-6.tar.gz'
Content type 'application/x-gzip' length 567515 bytes (554 Kb)
opened URL
downloaded 554 Kb

trying URL 'http://cran.parentingamerica.com/src/contrib/xlsxjars_0.6.1.tar.gz'
Content type 'application/x-gzip' length 9477071 bytes (9.0 Mb)
opened URL
downloaded 9.0 Mb

trying URL 'http://cran.parentingamerica.com/src/contrib/xlsx_0.5.7.tar.gz'
Content type 'application/x-gzip' length 312839 bytes (305 Kb)
opened URL
downloaded 305 Kb

ERROR: configuration failed for package 'rJava'

Warning messages:
1: In install.packages("xlsx") :
  installation of package 'rJava' had non-zero exit status
2: In install.packages("xlsx") :
  installation of package 'xlsxjars' had non-zero exit status
3: In install.packages("xlsx") :
  installation of package 'xlsx' had non-zero exit status

I did experience similar problems with other packages,

> library(gdata)
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = "fr_CA:fr",
	LC_ALL = (unset),
	LANG = "fr_CA.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").

Just to make sure that everything will work just fine on Monday morning, I thought that it might be good to save the dataset in a – standard – csv format.

Here we go,

> CIMAH<-read.table(
+ "http://freakonometrics.free.fr/CIMAH.csv",
+ header=TRUE,sep=";")
> head(CIMAH)
  Age Lx..CIMA.H. qx..CIMA.H. dx..CIMA.H.
1   0   1,000,000     0,5368%       5,368
2   1     994,632     0,0730%         726
3   2     993,906     0,0559%         555
4   3     993,351     0,0476%         473
5   4     992,878     0,0407%         404
6   5     992,474     0,0371%         368

Great, I can read the file. But I cannot use it. Commas are used here as decimal and thousands separators (yes, both).I knew about the French decimal separator (which is usually the comma), but usually, French people use a space as a thousands separator (not a comma).

The first step was to drop the comma used as a thousand separator (and convert also that factor as a number)

> dropcomma=Vectorize(function(x){
+ as.numeric(paste(unlist(strsplit(
+ as.character(x),",")),collapse="")) })

The second step was to play with the percentage, to drop the percentage sign, and convert the comma as a decimal separator (the standard dot). I did that in two steps

> commatodot=function(x){
+ return(as.numeric(paste(unlist(strsplit(
+ as.character(x),",")),collapse="."))) }
> droppercent=Vectorize(function(x){
+ comatodot(paste(unlist(strsplit(
+ as.character(x),"%")),collapse=""))/100 })

Then, we should use those functions on the three columns,

> CIMAH[,2]=dropcomma(CIMAH[,2])
> CIMAH[,3]=droppercent(CIMAH[,3])
> CIMAH[,4]=dropcomma(CIMAH[,4])

and indeed, it worked,

> head(CIMAH)
  Age Lx..CIMA.H. qx..CIMA.H. dx..CIMA.H.
1   0     1000000    0.005368        5368
2   1      994632    0.000730         726
3   2      993906    0.000559         555
4   3      993351    0.000476         473
5   4      992878    0.000407         404
6   5      992474    0.000371         368

Numbers are actual numbers, so I can plot some graphs, such that the mortality rate, per age,

> plot(CIMAH$Age,CIMAH$qx..CIMA.H.,log="y")

I could not imagine that it would be that long to read this excel spreadsheet and to get a (simple) graph…

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)