Two years as a Data Scientist at Stack Overflow

June 22, 2017
By

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

Last Friday marked my two year anniversary working as a data scientist at Stack Overflow. At the end of my first year I wrote a blog post about my experience, both to share some of what I’d learned and as a form of self-reflection.

After another year, I’d like to revisit the topic. While my first post focused mostly on the transition from my PhD to an industry position, here I’ll be sharing what has changed for me in my job in the last year, and what I hope the next year will bring.

Hiring a Second Data Scientist

In last year’s blog post, I noted how difficult it could be to be the only data scientist on a team:

Most of my current statistical education has to be self-driven, and I need to be very cautious about my work: if I use an inappropriate statistical assumption in a report, it’s unlikely anyone else will point it out.

This continued to be a challenge, and fortunately in December we hired our second data scientist, Julia Silge.

We started hiring for the position in September, and there were a lot of terrific candidates I got to meet and review during the application and review process. But I was particularly excited to welcome Julia to the team because we’d been working together during the course of the year, ever since we met and created the tidytext package at the 2016 rOpenSci unconference.

Julia, like me, works on analysis and visualization rather than building and productionizing features, and having a second person in that role has made our team much more productive. This is not just because Julia is an exceptional colleague, but because the two of us can now collaborate on statistical analyses or split them up to give each more focus. I did enjoy being the first data scientist at the company, but I’m glad I’m no longer the only one. Julia’s also a skilled writer and communicator, which was essential in achieving the next goal.

Company blog posts

In last year’s post, I shared some of the work that I’d done to explore the landscape of software developers, and set a goal for the following year (emphasis is new):

I’m also just intrinsically pretty interested in learning about and visualizing this kind of information; it’s one of the things that makes this a fun job. One plan for my second year here is to share more of these analyses publicly. In a previous post looked at which technologies were the most polarizing, and I’m looking forward to sharing more posts like that soon.

I’m happy to say that we’ve made this a priority in the last six months. Since December I’ve gotten the opportunity to write a number of posts for the Stack Overflow company blog:

Other members of the team have written data-driven blog posts as well, including:

I’ve really enjoyed sharing these snapshots of the software developer world, and I’m looking forward to sharing a lot more on the blog this next year.

Teaching R at Stack Overflow

Last year I mentioned that part of my work has been developing data science architecture, and trying to spread the use of R at the company.

This also has involved building R tutorials and writing “onboarding” materials… My hope is that as the data team grows and as more engineers learn R, this ecosystem of packages and guides can grow into a true internal data science platform.

At the time, R was used mostly by three of us on the data team (Jason Punyon, Nick Larsen, and me). I’m excited to say it’s grown since then, and not just because of my evangelism.

Every Friday since last September, I’ve met with a group of developers to run internal “R sessions”, in which we analyze some of our data to develop insights and models. Together we’ve made discoveries that have led to real projects and features, for both the Data Team and other parts of the engineering department.

There are about half a dozen developers who regularly take part, and they all do great work. But I especially appreciate Ian Allen and Jisoo Shin for coming up with the idea of these sessions back in September, and for following through in the months since. Ian and Jisoo joined the company last summer, and were interested in learning R to complement their development of product features. Their curiosity, and that of others in the team, has helped prove that data analysis can be a part of every engineer’s workflow.

Writing production code

My relationship to production code (the C# that runs the actual Stack Overflow website) has also changed. In my first year I wrote much more R code than C#, but in the second I’ve stopped writing C# entirely. (My last commit to production was more than a year ago, and I often go weeks without touching my Windows partition). This wasn’t really a conscious decision; it came from a gradual shift in my role on the engineering team. I’d usually rather be analyzing data than shipping features, and focusing entirely on R rather than splitting attention across languages has been helpful for my productivity.

Instead, I work with engineers to implement product changes based on analyses and push models into production. One skill I’ve had to work on is writing technical specifications, both for data sources that I need to query or models that I’m proposing for production. One developer I’d like to acknowledge specifically Nick Larsen, who works with me on the Data Team. Many of the blog posts I mention above answer questions like “What tags are visited in New York vs San Francisco”, or “What tags are visited at what hour of the day”, and these wouldn’t have been possible without Nick. Until recently, this kind of traffic data was very hard to extract and analyze, but he developed processes that extract and transform the data into more readily queryable tables. This has many important analyses possible besides the blog posts, and I can’t appreciate this work enough.

(Nick also recently wrote an awesome post, How to talk about yourself in a developer interview, that’s worth checking out).

Working with other teams

Last year I mentioned that one of my projects was developing targeting algorithms for Job Ads, which match Stack Overflow visitors with jobs they may be interested in (such as, for example, matching people who visit Python and Javascript questions with Python web developer jobs). These are an important part of our business and still make up part of my data science work. But I learned in the last year about a lot of components of the business that data could help more with.

One team that I’ve worked with that I hadn’t in the first year is Display Ads. Display Ads are separate from job ads, and are purchased by companies with developer-focused products and services.

For example, I’ve been excited to work closer with Steve Feldman on the Display Ad Operations team. If you’re wondering why I’m not ashamed to work on ads, please read Steve’s blog post on how we sell display ads at Stack Overflow– he explains it better than I could. We’ve worked on several new methods for display ad targeting and evaluation, and I think there’s a lot of potential for data to have a postive impact for the company.

Changes in the rest of my career

There’ve been other changes in my second year out of academia. In my first year, I attended only one conference (NYR 2016) but I’ve since had more of a chance to travel, including to useR and JSM 2017, PLOTCON, rstudio::conf 2017, and NYR 2017. I spoke at a few of these, about my broom package, about gganimate and about the history of R as seen by Stack Overflow.

Julia and I wrote and published an O’Reilly book, Text Mining with R (now available on Amazon and free online here). I also self-published an e-book, Introduction to Empirical Bayes: Examples from Baseball Statistics, based on a series of blog posts. I really enjoyed the experience of turning blog posts into a larger narrative, and I’d like to continue doing so this next year.

There are some goals I didn’t achieve. I’ve had a longstanding interest in getting R into production (and we’ve idly investigated some approaches like Microsoft R Server), but as of now we’re still productionizing models by rewriting them in C#. And there are many teams at Stack Overflow that I’d like to give better support to- prioritizing the Data Team’s time has been a challenge, though having a second data scientist has helped greatly. But I’m still happy with how my work has gone, and excited about the future.

In any case, this made the whole year worthwhile:

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)