While there are many admirable efforts to increase participation by women in STEM fields, in many programming teams men still outnumber women, often by a significant margin. Specifically by how much is a fraught question, and accurate statistics are hard to come by. Another interesting question is whether the gender disparity varies by language, and how to define a “typical programmer” for a given language.
Jeff Allen from Trestle Tech recently took an interesting approach using R to gather data on gender ratios for programmers: get a list of the top coders for each programming language, and then count the number of men and women in each list. Neither task is trivial. For a list of coders, Jeff scraped GitHub's list of trending repositories over the past month by programming language, and then extracted the avatars for the listed contributors. Then, he used the Microsoft Cognitive Services Face API on the avatar to determine the apparent gender of each contributor, and then tally up the results. You can find the R code he used on GitHub.
According to this analysis, none of the contributors top C++ projects on GitHub are male; by contrast, almost 10% of contributors to R projects are female.
Now, these data need to be taken with a grain of salt. The main issue is numbers: fewer than 100 programmers per language are identified as “top programmers” via this method, and sometimes significantly fewer (just 45 top C++ contributors were identified). Part of the reason for this is that not all programmers use their face as an avatar; those that used a symbol, logo or cartoon were not counted. Furthermore, it's reasonable to assume that there's a disparity in the rate at which women use their own face as an avatar compared to men, which would add bias to the above results in addition to the variability from the small numbers. Finally, the gender determination is based on an algorithm, and isn't guaranteed to match the gender identity of the programmer (or their avatar).
Nonetheless, it's an interesting example of using social network data in conjunction with cognitive APIs to conduct demographic studies. You can examples of using other data from the facial analysis, including apparent happiness by language, at the link below.
(Update June 15: re-ran the analysis and updated the chart above to actually display percentages, not ratios, on the y-axis. The numbers changed slightly as the GitHub data changed. The old chart is here.)
Trestle Tech: EigenCoder: Programming Stereotypes