Since F-Secure was #spiffy enough to provide us with GeoIP data for mapping the scope of the ZeroAccess botnet, I thought that some aspiring infosec data scientists might want to see how to use something besides Google Maps & Google Earth to view the data.
If you look at the CSV file, it’s formatted as such (this is a small portion…the file is ~140K lines):
CL,"-34.9833","-71.2333" PT,"38.679","-9.1569" US,"42.4163","-70.9969" BR,"-21.8667","-51.8333" |
While that’s useful, we don’t need quotes and a header would be nice (esp for some of the tools I’ll be showing), so a quick cleanup in vi gives us:
Code,Latitude,Longitude CL,-34.9833,-71.2333 PT,38.679,-9.1569 US,42.4163,-70.9969 BR,-21.8667,-51.8333 |
With just this information, we can see how much of the United States is covered in ZeroAccess with just a few lines of R:
# read in the csv file bots = read.csv("ZeroAccessGeoIPs.csv") # load the maps library library(maps) # draw the US outline in black and state boundaries in gray map("state", interior = FALSE) map("state", boundary = FALSE, col="gray", add = TRUE) # plot the latitude & longitudes with a small dot points(x=bots$Longitude,y=bots$Latitude,col='red',cex=0.25) |

Click for larger map
If you want to see how bad your state is, it’s just as simple. Using my state (Maine) it’s just a matter of swapping out the map statements with more specific data:
bots = read.csv("ZeroAccessGeoIPs.csv") library(maps) # draw Maine state boundary in black and counties in gray map("state","maine",interior=FALSE) map("county","maine",boundary=FALSE,col="gray",add=TRUE) points(x=bots$Longitude,y=bots$Latitude,col='red',cex=0.25) |

Click for larger map
Because of the way the maps library handles geo-plotting, there are points outside the actual map boundaries.
You can even get a quick and dirty geo-heatmap without too much trouble:
bots = read.csv("ZeroAccessGeoIPs.csv") # load the ggplot2 library library(ggplot2) # create an plot object for the heatmap zeroheat <- qplot(xlab="Longitude",ylab="Latitude",main="ZeroAccess Botnet",geom="blank",x=bots$Longitude,y=bots$Latitude,data=bots) + stat_bin2d(bins =300,aes(fill = log1p(..count..))) # display the heatmap zeroheat |

Click for larger map
Try playing around with the bins to see how that impacts the plots (the stat_bin2d(…) divides the “map” into “buckets” (or bins) and that informs plot how to color code the output).
If you were to pre-process the data a bit, or craft some ugly R code, a more tradtional choropleth can easily be created as well. The interesting part about using a non-boundaried plot is that this ZeroAccess network almost defines every continent for us (which is kinda scary).
That’s just a taste of what you can do with just a few, simple lines of R. If I have some time, I’ll toss up some examples in Python as well. Definitely drop a note in the comments if you put together some #spiffy visualizations with the data they provided.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).