Initial Work on a Post Not Yet Completed

Posted on January 12, 2011 by -- in R bloggers | 0 Comments

[This article was first published on Brock's Data Adventure » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s no secret I have been learning R for some time now, and one of the best resources out there is the hashtag rstats on twitter (#rstats). There is a tremendous community of active users who are always willing to help, but not to mention, you can get a first hand view of some of the cutting edge analytic applications being developed. A few users have shown examples of using social network analysis on a variety of web-based datasets (see @DrewConway for a post on real-time twitter analytics).

While it has been around for some time now, Social Network Analysis has caught my eye and I am mildly obsessed with it’s potential applications in higher ed. By no means am I an expert, but I do understand just enough to get me in trouble.

Recently, I had an idea to use data from a popular college search site. It is the application season after-all! Simply, for each school in their directory, they provide a list of 5 other institutions that were also viewed. Essentially, it’s the top 5 schools that overlap based on clicks through their website for a visit, rank ordered. I don’t think this is the best dataset to identify market share or determine one’s competitor set, but it’s not bad. Also, I am not trying to do hardcore network analysis with this post (mostly trying to learn R), but I do think it has some potential applications in enrollment management and student recruitment.

Last night I wrote an R script to crawl their website and collect the top 5 overlap schools for each institution. As the title of this post implies, there is a lot more that I want to do with this, but I did want to post some graphs to show you where this is headed. There are over 3700 schools shown below, and pockets of schools are clearly visible.

The image above is my first plot of the network. I attempted to clean the graph up a little bit and I think it is a decent first view of how these colleges are connected on the website.

The plot above relates betweenness and eigenvector centrality for each actor in the network. Obviously we expect some schools to be more popular (i.e. larger brand awareness?) and overlap more often than others, but this plot reveals some interesting facts. I know it is probably hard to see, but look at the top right of the plot. One institution clearly is an outlier. Hmmmmmm……..

In looking at the raw data, this school is listed as ”similar” 1,636 times. What does this mean? This institution is apparently viewed alongside 43.9% of their entire database (1,636 / 3727 schools). Given my dataset spans a large range of institution types (for-profit, 2/4-year schools, public, private, etc.), I find this very strange, and almost impossible. I checked one school that was a community college from Arizona, and voila, there it was listed in their top 5.

What do I take away from this? One of 3 things

The website isn’t tracking that data correctly. Being in the top 5 for 44% of the database seems impossible to me
My crawler didn’t parse the data correctly. Entirely possible, but I did manually check some of the data points that I thought were strange
This institution gets so much traffic volume on their site that it biases how this overlap is calculated for each school

The other idea I had, but this wouldn’t make any sense from a ROI perspective, was that the school had an agreement with the search site. I highly doubt that is the case.

I will definitely expand on this later on, but hopefully you think this is as interesting as I do.

Filed under: College Admissions, Higher Education, R Tagged: College Admissions, College Search, Enrollment Management, Higher Education, R, Student Recruitment