(This article was first published on

**meet Saptarsi**, and kindly contributed to R-bloggers)Price Earnings ratio (P/E) is one of the very popular ratios reported with all stocks. Very simply this is thought as - Current Market Price / Earning per Share. An operational definition of Earning per Share would be Total profit divided by # of Shares . I will redirect interested readers for further reading to

In this post, I would just like to show, how we can grab P/E data from Web and create some visualizations on it. My focus right now is Indian stocks and I intend to use the below website

So my first step is gearing up for the data extraction and essentially that is the most non-trivial task. As shown in the figure below, there is separate pages for each sector and we need to click on individual links , to go to that page and get the P/E ratios.

Here is something , I did outside ‘r’ , creating a csv file with the sector names , using delimiters while importing text and paste special as transpose , here is how my csv file would look. I would never discourage using multiple tools as this would be required to solve real world issues

So now I can import this in a dataset and read one row at a time and go to necessary URLs , but god have different plans J , it’s not that straightforward

__Case 1 : Single word sector names :__

We have sector as ‘Banks’and the sector link is as below

Again it is a no brainer , we can pick up the base url , append the sector name after a forward slash and then append the string ‘-Sector’ , this is true for most single word sector names like ‘FMCG’ , ‘Tyres’ , ‘Heathcare’ etc

__Case 2: Multiple words without ‘-‘ , ‘&’ and ‘/’__

We have sector as ‘Tobacco Products’ and the sector link is as below

This is also not that difficult apart from adding the ‘-Sector’ we need replace the spaces by a ‘-‘ .

__Case 3: Multiple words with a ‘-‘__

We have sector name as ‘IT-Software’, where we have to remove other spaces if exiting. There can be several other cases, but for discussion sake , I will limit myself here

__Case 4: Multiple words with a ‘/‘__

We have sector name as ‘Stock/ Commodity Brokers’, so the “/” needs to be removed

# Reading in dataset

*sectorsv1 <- read.csv("C:/Users/user/Desktop/Datasets/sectorsv1.csv")*

# Converting to a matrix , this is a practice generally I follow

*sectorvm<-as.matrix(sectorsv1)*

we can access individual sectors by , sectorvm[rowno,colon]

*pe<-c()*

*cname<-c()*

*cnt<-0*

*baseurl<-'http://www.indiainfoline.com/MarketStatistics/PE-Ratios/'*

*sectorvm<-as.matrix(sectorsv1)*

*for(i in 1:nrow(sectorvm))*

*{*

*securl<-sectorvm[i,1]*

*# Fixed true indicated the string is to matched as is and is not a regular expression*

*# Substitution of the different cases as we explained , we will point out using gsub instead of sub*

*# else only the first instance will be replaced*

*if(length(grep(' ',securl,fixed=TRUE))!=1)*

*{*

*securl<-paste(securl,'-Sector', sep="")*

*}*

*else*

*{*

*securl<-gsub(' ', '-', securl, ignore.case =FALSE, fixed=TRUE)*

*if(length(grep('---',securl,fixed=TRUE))==1)*

*{*

*securl<-gsub(' ---', '-', securl, ignore.case =FALSE, fixed=TRUE)*

*}*

*if(length(grep('&',securl,fixed=TRUE))==1)*

*{*

*securl<-gsub('&', 'and', securl, ignore.case =FALSE, fixed=TRUE)*

*}*

*if(length(grep('/',securl,fixed=TRUE))==1)*

*{*

*securl<-gsub('/', '', securl, ignore.case =FALSE, fixed=TRUE)*

*}*

*if(length(grep(',',securl,fixed=TRUE))==1)*

*{*

*securl<-gsub(',', '', securl, ignore.case =FALSE, fixed=TRUE)*

*}*

*securl<-paste(securl,'-Sector', sep="")*

*}*

*fullurl<-paste(baseurl,securl, sep="")*

*print(fullurl)*

*if (url.exists(fullurl))*

*{*

*petbls<-readHTMLTable(fullurl)*

*# Exploring the tables we found out relevant information on table 2*

*# Also the data is getting stored as factor , just doing an as.numeric will not suffice*

*# we need to do an as.character and then an as.numeric*

*pe<-c(pe,as.numeric(as.character(petbls[[2]]$PE)))*

*cname<-c(cname,*

*as.character*(

*petbls[[2]]$Company))*

*cnt = cnt + 1*

*}*

*}*

Different functions that we have used are explained as below

readHTMLTables -> Given a url , this function can retrieve the contents of the <Table> tag from html page. We need to use appropriate no. for the same. Like in this case we have used table no 2.

Grep, Paste, Gsub are normal string functions, grep finds occurrence of a string in another, paste concatenates and gsub does the act of replacing.

As.numeric(as.character()) had a lasting impressing on my mind as an innocuous and intuitive as.numeric would have left me only with the ranks.

url.exists :-> it is a good idea , to check the existence of the url , given we are dynamically forming the URLs.

Now playing with summary statistics:

We use the describe function from psych package

n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se |

1797 | 59.71 | 76.92 | 20.09 | 46.64 | 29.79 | 0 | 587.5 | 587.5 | 2.15 | 7.25 | 1.81 |

hist(pe,col='blue',main='P/E Distribution')

We get the below histogram for the P/E ratio , which shows it is nowhere near a normal distribution , with it’s peakedness and skew as confirmed from the summary statistics as well

We will never the less do a normalty test

`shapiro.test(pe)`

` Shapiro-Wilk normality test`

`data: pe `

`W = 0.7496, p-value < 2.2e-16`

Basically the null hypothesis is, the values come from a normal distribution and we see the p value to be very insignificant and hence we can easily reject the null.

Drawing a box plot on the P/E ratios

boxplot(pe,col='blue')

__Finding the outliers__

`boxplot.stats(pe)$out`

`484.33 327.91 587.50`

cname[which(pe %in% boxplot.stats(pe)$out)]

`[1] "Bajaj Electrical" "BF Utilities" "Ruchi Infrastr." `

Of course no prize guessing we should stay out of these stocks

So if we summarize this is kind of exploratory data analysis on PE ratio of Indian stocks

· We saw, we can get content out of url and html tables

· We added them in a data frame

· Looked at summary statistics , histogram and did a normality test

· Plotted a box plot and found the outliers

To

**leave a comment**for the author, please follow the link and comment on his blog:**meet Saptarsi**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...