Coimbatore Weather and Questioning Amma!
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A week ago, Amma was telling the weather was getting hot in Coimbatore. I was telling her it is going to get worse in the next two months. She shot back saying that March is the hottest month while April and May are less hotter in Coimbatore. Growing up in India you are thought that your mother knows the best and she is right (almost) always. Well I could not resist the thought of putting the thesis to test and the internet comes to my help. So here goes the perl and R code to get the temperature data and explore it. The metric of choice would (arbitrarily) the average temperature. In R plots in this post the average is plotted by a big black dot. The month with the highest average temperature will be adjudged the hottest month. There are a lot of metrics to use but this is the simplest and the most intuitive.
(1) perl code to scrap temperature data from web. The wunderground website has data in csv format. I did not do checks like limiting days in June to 30. We will fix that in clean up script which will build a single csv file with data from years 2005 to 2008. You need the CPAN package LWP.You should be able to google and figure out how to install LWP package.
Save this file as [dir]/src/get_temperature_files.pl
To run
% cd [dir]/src
% perl get_temperature_files.pl
You will have the data in the directory [dir]/data
The data is for four years from 2005 to 2008.
#------------------------------------------------------------------ use warnings; use strict; #------------------------------------------------------------------ use LWP::UserAgent; #------------------------------------------------------------------ @ARGV == 0 or die "Sorry. The correct usage is:\ perl get_temperature_files.pl\n"; #------------------------------------------------------------------ my $datadir = "../data/"; mkdir($datadir, 0755) unless -d $datadir; # VOCB - Coimbatore my $base_url = "http://www.wunderground.com/history/airport/VOCB/"; my $suffix = "/DailyHistory.html?format=1"; for (my $year = 2005; $year < 2009; ++$year) { for (my $month = 1; $month <= 12; ++$month) { for (my $day = 1; $day <= 31; ++$day) { my $webfile = $year."/".$month."/".$day; print "Getting: $webfile\n"; my $url = $base_url.$webfile.$suffix; my $webPage = getWebPage($url); my $outfile = $year."_".$month."_".$day.".csv"; $outfile = $datadir.$outfile; open(OUTFILE, ">$outfile"); print OUTFILE "$webPage"; close(OUTFILE); # let us be patient and decent sleep(1); } } } #------------------------------------------------------------------ # subroutines #------------------------------------------------------------------ sub getWebPage { my ($url) = @_; my $browser = LWP::UserAgent->new(); my $response = $browser->get($url); # error checks die "Weird content type at $url -- ", $response->content_type() unless $response->is_success(); my $webPage = $response->content(); return($webPage); } #------------------------------------------------------------------
(2) Clean up the data and construct the data as a single file.
Save this file as [dir]/src/build_csv.pl
To run
% cd [dir]/src
% perl build_csv.pl ../data/ > cbe.csv
#------------------------------------------------------------------ use warnings; use strict; #------------------------------------------------------------------ @ARGV == 1 or die "Sorry. The correct usage is:\ perl build_csv.pl dir_containing_csv_files\ Example:\ perl build_csv.pl ../data/\n"; #------------------------------------------------------------------ my $datadir = $ARGV[0]; # make sure exactly one / is present after $datadir $datadir =~ s/[\/]+$//; $datadir .= "/"; # days in a month my @days_in_month = (31,28,31,30,31,30,31,31,30,31,30,31); my @month_names = ("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"); opendir(DIR, $datadir); my @files = grep { /\.csv$/ } readdir(DIR); closedir(DIR); my $header_flag = 0; foreach (@files) { my $file = $_; # get the time information # which we need to add to the csv file my ($time, $suffix) = split(/\./, $file); my ($year, $month, $day) = split(/\_/, $time); # handle leap year if (0 == $year % 4) { $days_in_month[1] = 29; } else { $days_in_month[1] = 28; } if ($day <= $days_in_month[$month-1]) { # read the raw csv file and clean it up my $csvfile = $datadir.$file; open(TIMEFILE, "<$csvfile"); while () { # remove everything between and including < > s/\<.*\>//; # skip blank lines next if /^(\s)*$/; # skip the header after the printing it # for the first time if(!$header_flag) { print "year, month, day, $_"; $header_flag = 1; next; } else { next if /^[a-z]+.*$/i; } chomp; print "$year, $month_names[$month-1], $day, $_\n"; } close(TIMEFILE); } } #------------------------------------------------------------------
(3) Explore the data using R
library(lattice) # read the raw data filename <- "cbe.csv"; x <- read.csv(file = filename, header = TRUE, as.is = TRUE); # factor hack to get the plots in xyplot() in correct order x$month = factor(x$month, levels=x$month) x$year = factor(x$year, levels=x$year) x$TimeIST = factor(x$TimeIST, levels=x$TimeIST) x$TemperatureC = (x$TemperatureF - 32)*(5.0/9.0);
Now for answering the question at the top of the post.
hyear <- bwplot(TemperatureC ~ month | year, data=x, ylab = "Temperature (C)"); plot(hyear);
Looks as if Amma thesis is probably rejected! Four years worth data shows that highest average temperature is in April. Let us see a split by years and see if there is a year in which March’s average temperature was the highest. Looks like we need to do little clean up of the data. There is a zero in October. Definitely not possible in Coimbatore!
hmonth <- bwplot(TemperatureC ~ month, data=x, ylab = "Temperature (C)"); plot(hmonth);
Well it looks like at least in 2005 and 2007 March’s average temperature is the highest. Although April matches March in both those years. I am trying to find something to salvage for my Amma!
Here is an another plot which splits by the hour of the day.
hhour <- bwplot(TemperatureC ~ TimeIST, data=x, ylab = "Temperature (C)"); plot(hhour);
The one surprising thing one was the fact that the lowest temperatures occur between 2:30 AM and 5:30 AM not around midnight. The highest temperatures are around 2:30 PM not noon. Well the zero is definitely an error since it shows up at 11:30 AM.
Update 1 (March 16 2009):
You can download [~700 KB, rename it as cbe.csv] the big csv file which contains the data for the weather in Coimbatore for years 2005 — 2008. Now you can skip to the Step (3) and use R to analyze the data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.