(This article was first published on meet Saptarsi, and kindly contributed to Rbloggers)
Price Earnings ratio (P/E) is one of the very popular ratios reported with all stocks. Very simply this is thought as – Current Market Price / Earning per Share. An operational definition of Earning per Share would be Total profit divided by # of Shares . I will redirect interested readers for further reading to
In this post, I would just like to show, how we can grab P/E data from Web and create some visualizations on it. My focus right now is Indian stocks and I intend to use the below website
So my first step is gearing up for the data extraction and essentially that is the most nontrivial task. As shown in the figure below, there is separate pages for each sector and we need to click on individual links , to go to that page and get the P/E ratios.
Here is something , I did outside ‘r’ , creating a csv file with the sector names , using delimiters while importing text and paste special as transpose , here is how my csv file would look. I would never discourage using multiple tools as this would be required to solve real world issues
So now I can import this in a dataset and read one row at a time and go to necessary URLs , but god have different plans J , it’s not that straightforward
Case 1 : Single word sector names :
We have sector as ‘Banks’and the sector link is as below
Again it is a no brainer , we can pick up the base url , append the sector name after a forward slash and then append the string ‘Sector’ , this is true for most single word sector names like ‘FMCG’ , ‘Tyres’ , ‘Heathcare’ etc
Case 2: Multiple words without ‘‘ , ‘&’ and ‘/’
We have sector as ‘Tobacco Products’ and the sector link is as below
This is also not that difficult apart from adding the ‘Sector’ we need replace the spaces by a ‘‘ .
Case 3: Multiple words with a ‘‘
We have sector name as ‘ITSoftware’, where we have to remove other spaces if exiting. There can be several other cases, but for discussion sake , I will limit myself here
Case 4: Multiple words with a ‘/‘
We have sector name as ‘Stock/ Commodity Brokers’, so the “/” needs to be removed
# Reading in dataset
sectorsv1 < read.csv(“C:/Users/user/Desktop/Datasets/sectorsv1.csv”)
# Converting to a matrix , this is a practice generally I follow
sectorvm<as.matrix(sectorsv1)
we can access individual sectors by , sectorvm[rowno,colon]
pe<c()
cname<c()
cnt<0
baseurl<‘http://www.indiainfoline.com/MarketStatistics/PERatios/’
sectorvm<as.matrix(sectorsv1)
for(i in 1:nrow(sectorvm))
{
securl<sectorvm[i,1]
# Fixed true indicated the string is to matched as is and is not a regular expression
# Substitution of the different cases as we explained , we will point out using gsub instead of sub
# else only the first instance will be replaced
if(length(grep(‘ ‘,securl,fixed=TRUE))!=1)
{
securl<paste(securl,’Sector’, sep=””)
}
else
{
securl<gsub(‘ ‘, ‘‘, securl, ignore.case =FALSE, fixed=TRUE)
if(length(grep(‘—‘,securl,fixed=TRUE))==1)
{
securl<gsub(‘ —‘, ‘‘, securl, ignore.case =FALSE, fixed=TRUE)
}
if(length(grep(‘&’,securl,fixed=TRUE))==1)
{
securl<gsub(‘&’, ‘and’, securl, ignore.case =FALSE, fixed=TRUE)
}
if(length(grep(‘/’,securl,fixed=TRUE))==1)
{
securl<gsub(‘/’, ”, securl, ignore.case =FALSE, fixed=TRUE)
}
if(length(grep(‘,’,securl,fixed=TRUE))==1)
{
securl<gsub(‘,’, ”, securl, ignore.case =FALSE, fixed=TRUE)
}
securl<paste(securl,’Sector’, sep=””)
}
fullurl<paste(baseurl,securl, sep=””)
print(fullurl)
if (url.exists(fullurl))
{
petbls<readHTMLTable(fullurl)
# Exploring the tables we found out relevant information on table 2
# Also the data is getting stored as factor , just doing an as.numeric will not suffice
# we need to do an as.character and then an as.numeric
pe<c(pe,as.numeric(as.character(petbls[[2]]$PE)))
cname<c(cname, as.character (petbls[[2]]$Company))
cnt = cnt + 1
}
}
Different functions that we have used are explained as below
readHTMLTables > Given a url , this function can retrieve the contents of the <Table> tag from html page. We need to use appropriate no. for the same. Like in this case we have used table no 2.
Grep, Paste, Gsub are normal string functions, grep finds occurrence of a string in another, paste concatenates and gsub does the act of replacing.
As.numeric(as.character()) had a lasting impressing on my mind as an innocuous and intuitive as.numeric would have left me only with the ranks.
url.exists :> it is a good idea , to check the existence of the url , given we are dynamically forming the URLs.
Now playing with summary statistics:
We use the describe function from psych package
n

mean

sd

median

trimmed

mad

min

max

range

skew

kurtosis

se

1797

59.71

76.92

20.09

46.64

29.79

0

587.5

587.5

2.15

7.25

1.81

hist(pe,col='blue',main='P/E Distribution')
We get the below histogram for the P/E ratio , which shows it is nowhere near a normal distribution , with it’s peakedness and skew as confirmed from the summary statistics as well
We will never the less do a normalty test
shapiro.test(pe)
ShapiroWilk normality test
data: pe
W = 0.7496, pvalue < 2.2e16
Basically the null hypothesis is, the values come from a normal distribution and we see the p value to be very insignificant and hence we can easily reject the null.
Drawing a box plot on the P/E ratios
boxplot(pe,col='blue')
Finding the outliers
boxplot.stats(pe)$out
484.33 327.91 587.50
cname[which(pe %in% boxplot.stats(pe)$out)]
[1] "Bajaj Electrical" "BF Utilities" "Ruchi Infrastr."
Of course no prize guessing we should stay out of these stocks
So if we summarize this is kind of exploratory data analysis on PE ratio of Indian stocks
· We saw, we can get content out of url and html tables
· We added them in a data frame
· Looked at summary statistics , histogram and did a normality test
· Plotted a box plot and found the outliers
To leave a comment for the author, please follow the link and comment on their blog: meet Saptarsi.
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...