I was looking for some data set which has some bias in terms of temporal data. I thought of checking out the data from iNaturalist project IndianMoths. This project is aimed at documenting moths from India. This project was initiated in July 2012 but really caught steam in January 2013, with members contributing regularly, minimum of 100 records per month. The reason that this project has not yet completed one year, I thought it might have some bias form the missed out months. Another reason for bias could be the fact that moths are not seen in the same numbers through out the year.
To explore this data, I first downloaded the data in a .csv file and loaded into R.
The data summary looked like this:
Total no of records = 2958 Bounding box of records Inf , Inf - -Inf , -Inf Taxonomic summary... No of Families : 0 No of Genus : 0 No of Species : 0
This tells us that the data is read by the package, but it has not understood the format well and we might have to do some transformations to get this going with our package. So let us use the function fixstr
to get the data into (somewhat) required format.
imoth=fixstr(imoth,Latitude="latitude", Longitude="longitude", DateCollected="observed_on")
Now let us check the summary again
Total no of records = 2958 Date range of the records from 0208-07-26 to 2013-08-07 Bounding box of records 6.660428 , 72.8776559 - 32.5648529099 , 96.2124788761 Taxonomic summary... No of Families : 0 No of Genus : 0 No of Species : 0
Now we have date and Latitude-Longitudes in a form that our package can understand. A quick glance at this data summary shows us that there is some problem with dates. In our data set we have one record form year 208 (which must be typo for year 2008). And the data is all form in and around India looking at the bounding box values of records.
We still need to get the taxonomy in place, but we will leave that for later time, and start working with this data. Let us create temporal plots of this data for different timescales of Daily, Weekly and Monthly.
tempolar(imoth,title="Daily Records") tempolar(imoth,title="Weekly Records",timescale="w") tempolar(imoth,title="Monthly Records",timescale="m")
would produce following three plots.
These are records per calender day and we see that 2-3 days in April have very high number of records compared to other dates. This could be due to some targeted survey during that time. This also shows us that we do not have much data records from September till April.
The weekly aggregation of same records highlights the fact that April month does have some spike in numbers, and otherwise the number of records seem to fairly uniform.
Monthly plot shows that April has recorded more than 800 records, where as no other month have more than 500 records in a month.
This could be due to several reasons, but mainly because of the activity of this particular project.