Programming Language Popularity: StackOverflow and Ohloh

August 17, 2010
By

(This article was first published on R-Chart, and kindly contributed to R-bloggers)



In the following example, programming language popularity is measured based upon two data sets.  The first is the number of  contributors associated with a language on ohloh.net.  The second is tag usage at stackoverflow.com.   


SQL with no DDL
I admit it... in an age of NoSQL... I like SQL.  I agree that fixed table schemas can be a real pain though... who wants the overhead of defining database tables for a quick comparison of two data sets? 


Joining on the language name provides a simple, intuitive way to correlate the results sets.  Of course there are limitations to this approach - after all SQL was designed for relational databases.  Since a "join" is being done based upon language and tag name, some languages may be under represented.  For instance  

  • JQuery - a Javascript library - is a leading tag.   
  • Objective C questions might appear under the iPhone tag.
  • C# Questions might appear under .NET, ASP.NET or other Microsoft tags.
Another problem that can occur in this type of comparison is that keyed data might not actually correlate with a matched key.   For instance the C programming language might be compared with a record pertaining to the third letter of the alphabet.  This is not a problem in the current example because the specific domain of both data sets is programming languages.


Top Languages
In this particular analysis, I am really interested in outliers - not the vast majority of the languages that appear in the data set.  So the name of each point will be plotted beside it.  For less popular languages, this chart is impossible to read and madly cluttered... but it is great for focusing on the most popular languages.   So rather than coming up with a publication-quality graphic, the chart above provides a "quick-and-dirty" perspective that can lead to helpful discussions for people familiar with the programming language domain.


In the previous post, Ruby ranked at the top.  This demonstrates the Ruby centric nature of github, which was initially directed towards the ruby community.  Similar trends affect the results in the current post (where Ruby ranks 12th in tag count and 16th in the number of contributors).  R is 18th in tag count and 33rd in number of contributors.


The data was extracted over the last few days and is available on github in ohlo_2010-08-16.txt and stackoverflow.txt (warning 400MB file... all tags from stackoverflow are listed in it).  The process to analyze the files involved the following R Code.



library(ggplot2)
library(sqldf)


SODF=read.csv('stackoverflow.txt',header=TRUE, sep=';')
OHLODF=read.csv('ohlo_2010-08-16.txt',header=TRUE, sep=';')


head(OHLODF)
head(SODF)


df = sqldf('select Name name, Count tag_count, Contributors contributors 
from OHLODF o 
join SODF s on LOWER(s.Tag) = LOWER(o.Name) order by 1')


ggplot(data=df, 
       aes(x=tag_count, y=contributors, color=name)) + 
  geom_point() + 
  geom_text(aes(label = name))

The resulting chart is displayed above.  To list the top 10 languages:


> head(df[order(df$contributors, decreasing=TRUE),],10)
         name tag_count contributors
57        XML     12374       133183
24       HTML     21936       106012
28       Java     62386        78098
9           C     17256        78023
13        CSS     16429        72060
11        C++     38691        61831
29 JavaScript     46608        60677
33       Make       537        50328
44     Python     31852        38691
39        PHP     53884        36952


> head(df[order(df$tag_count, decreasing=TRUE),],10)
          name tag_count contributors
10          C#    101811        22198
28        Java     62386        78098
39         PHP     53884        36952
29  JavaScript     46608        60677
11         C++     38691        61831
44      Python     31852        38691
48         SQL     25316        28069
24        HTML     21936       106012
9            C     17256        78023
37 Objective-C     17250         6555


All other things being equal, one might think that the relationship between contributors to projects and tag counts might be roughly linear.  As it stands, that is not the case at all.

Web Oriented Languages
The languages represented show a significant representation of web applications related technologies.  HTML, CSS, Java Script and PHP are used almost exclusively for such development, and Ruby, Python, Perl, Java, C#, SQL are also heavily used for web applications (though not exclusively).  C, C++, Objective C and Make are related technologies that are geared less towards web development.

Microsoft 
According to wikipedia StackOverflow is a Microsoft partner and stackoverflow itself was developed on the Microsoft platform.  This might provide some explanation to the high representation of C#.

Simple Languages = Less Questions
XML and HTML are markup languages with relatively simple syntax, hence the relatively small tag count.  CSS and Make are also relatively small languages with specific uses rather than general purpose programming languages.  The fact that C++ was developed as an enhancement to the C programming language explains why there are more questions (and a larger tag count) for C++ than C.  A more speculative suggestion is that Perl's relatively low tag count indicates that the "more than one way to do it" philosophy leads to less questions.  An obvious alternative is that Perl users simply ask questions in other venues.

Conclusion
All measures of programming language popularity have their limitations.  Correlating various sets of data can provide some additional insights into their prevalence and usage.  R and sqld provide a convenient means of making such comparisons.  And ggplot2 provides a great way of charting results.





Update

A log scale (as suggested by Tal in the comments) provides better insight into the majority of languages that appear clustered in the lower left hand corner of the chart.  However, though this site might be considered R rated, the **** was added through later image editing to make it fit for all audiences.  

ggplot(data=df, 
       aes(x=log(tag_count), y=log(contributors), color=name)) + 
  geom_point() + 
  geom_text(aes(label = name))


To leave a comment for the author, please follow the link and comment on his blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.