**Data Twirling » R**, and kindly contributed to R-bloggers)

I wanted to build upon my previous post and dive a little deeper into the sorts of questions we can answer using the FAFSA data supplied to us by applicants.

As a quick overview, students completing the FAFSA for student aid can list up to ten institutions on the form. I consider this the student’s consideration set. When aggregating these data, we can start to get a sense of the most frequently listed schools and how these institutions may be related.

With these data, you can manipulate the structure to answer a wide range of questions. One approach would coerce the data into a network. For this task, I am going to use the statistical programming language R and the library igraph. The resulting network includes all schools listed (excluding the host institution) with weighted edges representing the # of co-occurences.

Listed below are some quick stats on my undirected network from the last few years:

- Graph density: 0.05108093
- Diameter: 5
- Average Path Length: 2.418751
- Transitivity (clustering coefficient): 0.3390529

Graph density is the ratio of edges related to the total number of possible edges. For context, an edge is a connection between two schools. If you think of Facebook, you and your friends are connected by an edge. Diameter is a measure of how many steps (edges) are required to connect the two farthest nodes in the network. The Average Path Length is basically an average of how many steps it would take for all schools to be connected. The clustering coefficient is a measure of how well the nodes tend to cluster together (listed on the same FAFSA form).

Shown below is a plot of the graph, with each school sized by pagerank score (included function in igraph).

It’s easy to see that there are few key players in the FAFSA network; I consider these “core” competitors. More interesting to me, however, are the schools at the outer edge, as they are less common and speak to the choice set of an applicant.

In summary, this post was intended to be a quick overview of how one might employ network analysis to study the schools commonly listed on the FAFSA form for your institution. In the future, I will take the same data and use association rules to find common patterns of school listings.

EDIT: Here are the code snippets that I used to generate the data and plot above:

## basic stats: ## density (graph.density) graph.density(g) ## diamter diameter(g, directed=F) ## average path length (shortest.paths) average.path.length(g, directed=F) ## transivity (clustering coeffecient) transitivity(g) ## radius radius(g) ## degree distribution plot(1-degree.distribution(g, cumulative=T), type="l", xlab="degree", ylab="Cume Distribution", main="FAFSA Network") g$layout pagerank plot(g, vertex.size= pagerank*150, vertex.label=NA, vertex.color= "red", vertex.frame.color="black", edge.arrow.size=0, edge.color=colors()[239], edge.width=.5, edge.curved=TRUE, layout=layout.auto(g))

**leave a comment**for the author, please follow the link and comment on his blog:

**Data Twirling » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...