Imagine you just started a job at a new company. You watched World War Z recently, so you’re in a skeptical mood, and given that your last two startups failed from what you believe to be a lack of data, you’re giving everything an extra critical eye.
You start by thinking about the impact of the sales team. How much extra revenue are they generating for the company? The sales folks you’ve met say that over 90% of the leads they’ve talked to end up buying the company’s product – but, you wonder, how many of those leads would have converted anyways?
You take a look at the logs, and notice something interesting: last week was hack week, and half the salesforce took time off from making calls to make Marauder’s Maps instead, yet the rate of converted leads remained the same…*
Suddenly, one of your teammates drops by your desk. He’s making a batch of Soylent, and he wants you to take a sip. It looks nasty, so you ask what the benefits are, and he responds that his friends who’ve been drinking it for the past few months just ran a marathon! Oh, did they just start running? Nope, they ran the marathon last year too!…
*Inspired by a true story.
Causality is incredibly important, yet often extremely difficult to establish.
Do patients who self-select into taking a new drug get better because the drug works, or would they have gotten better anyways? Is your salesforce actually effective, or are they simply talking to the customers who already plan to convert? Is Soylent (or your company’s million-dollar ad campaign) truly worth your time?
In an ideal world, we’d be able to run experiments – the gold standard for measuring causality – whenever we wish. In the real world, however, we can’t. There are ethical qualms with giving certain patients placebos, or dangerous and untested drugs. Management may be unwilling to take a potential short-term revenue hit by assigning sales to random customers, and a team earning commission-based bonuses may rebel against the thought as well.
How can we understand causal lifts in the absence of an A/B test? This is where propensity modeling, or other techniques of causal inference, comes into play.
So suppose we want to model the effect of drinking Soylent using a propensity model technique. To explain the idea, let’s start with a thought experiment.
Imagine Brad Pitt has a twin brother, indistinguishable in every way: Brad 1 and Brad 2 wake up at the same time, they eat the same foods, they exercise the same amount, and so on. One day, Brad 1 happens to receive the last batch of Soylent from a marketer on the street, while Brad 2 does not, and so only Brad 1 begins to incorporate Soylent into his diet. In this scenario, any subsequent difference in behavior between the twins is precisely the drink’s effect.
Taking this scenario into the real world, one way to estimate the Soylent’s effect on health would be as follows:
- For every Soylent drinker, find a Soylent abstainer who’s as close a match as possible. For example, we might match a Soylent-drinking Jay-Z with a non-Soylent Kanye, a Soylent-drinking Natalie Portman with a non-Soylent Keira Knightley, and a Soylent-drinking JK Rowling with a non-Soylent Stephenie Meyer.
- We measure Soylent’s effect as the difference between each twin pair.
However, finding closely matching twins is extremely difficult in practice. Is Jay-Z really a close match with Kanye, if Jay-Z sleeps one hour more on average? What about the Jonas Brothers and One Direction?
Propensity modeling, then, is a simplification of this twin matching procedure. Instead of matching pairs of people based on all the variables we have, we simply match all users based on a single number, the likelihood (“propensity”) that they’ll start to drink Soylent.
In more detail, here’s how to build a propensity model.
- First, select which variables to use as features. (e.g., what foods people eat, when they sleep, where they live, etc.)
- Next, build a probabilistic model (say, a logistic regression) based on these variables to predict whether a user will start drinking Soylent or not. For example, our training set might consist of a set of people, some of whom ordered Soylent in the first week of March 2014, and we would train the classifier to model which users become the Soylent users.
- The model’s probabilistic estimate that a user will start drinking Soylent is called a propensity score.
- Form some number of buckets, say 10 buckets in total (one bucket covers users with a 0.0 – 0.1 propensity to take the drink, a second bucket covers users with a 0.1 – 0.2 propensity, and so on), and place people into each one.
- Finally, compare the drinkers and non-drinkers within each bucket (say, by measuring their subsequent physical activity, weight, or whatever measure of health) to estimate Soylent’s causal effect.
For example, here’s a hypothetical distribution of Soylent and non-Soylent ages. We see that drinkers tend to be quite a bit older, and this confounding fact is one reason we can’t simply run a correlational analysis.
After training a model to estimate Soylent propensity and group users into propensity buckets, this might be a graph of the effect that Soylent has on a person’s weekly running mileage.
In the above (hypothetical) graph, each row represents a propensity bucket of people, and the exposure week denotes the first week of March, when the treatment group received their Soylent shipment. We see that prior to that week, both groups of people track quite well. After the treatment group (the Soylent drinkers) start their plan, however, their weekly running mileage ramps up, which forms our estimate of the drink’s causal effect.
Other Methods of Causal Inference
Of course, there are many other methods of causal inference on observational data. I’ll run through two of my favorites. (I originally wrote this post in response to a question on Quora, which is why I take my examples from there.)
Quora recently started displaying badges of status on the profiles of its Top Writers, so suppose we want to understand the effect of this feature. (Assume that it’s impossible to run an A/B test now that the feature has been launched.) Specifically, does the badge itself cause users to gain more followers?
For simplicity, let’s assume that the badge was given to all users who received at least 5000 upvotes in 2013. The idea behind a regression discontinuity design is that the difference between those users who just barely receive a Top Writer badge (i.e., receive 5000 upvotes) and those who just barely don’t (i.e., receive 4999 upvotes) is more or less random chance, so we can use this threshold to estimate a causal effect.
For example, in the imaginary graph below, the discontinuity at 5000 upvotes suggests that a Top Writer badge leads to around 100 more followers on average.
Understanding the effect of a Top Writer badge is a fairly uninteresting question, though. (It just makes an easy example.) A deeper, more fundamental question could be to ask: what happens when a user discovers a new writer that they love? Does the writer inspire them to write some of their own content, to explore more of the same topics, and through curation lead them to engage with the site even more? How important, in other words, is the connection to a great user as opposed to the reading of random, great posts?
I studied an analogous question when I was at Google, so instead of making up an imaginary Quora case study, I’ll describe some of that work here.
So let’s suppose we want to understand what would happen if we were able to match users to the perfect YouTube channel. How much is the ultimate recommendation worth?
- Does falling in love with a new channel lead to engagement above and beyond activity on the channel itself, perhaps because users return to YouTube specifically for the new channel and stay to watch more? (a multiplicative effect) In the TV world, for example, perhaps many people stay at home on Sunday nights specifically to catch the latest episode of Real Housewives, and channel surf for even more entertainment once it’s over.
- Does falling in love with a new channel simply increase activity on the channel alone? (an additive effect)
- Does a new channel replace existing engagement on YouTube? After all, maybe users only have a limited amount of time they can spend on the site. (a neutral effect)
- Does the perfect channel actually cause users to spend less time overall on the site, since maybe they spend less time idly browsing and channel surfing once they have concrete channels they know how to turn to? (a negative effect)
As always, an A/B test would be ideal, but it’s impossible in this case to run: we can’t force users to fall in love with a channel (we can recommend them channels, but there’s no guarantee they’ll actually like them), and we can’t forcibly block them from certain channels either.
One approach is to use a natural experiment (a scenario in which the universe itself somehow generates a random-like assignment) to study this effect. Here’s the idea.
Consider a user who uploads a new video every Wednesday. One month, he lets his subscribers know that he won’t be uploading any new videos for a few weeks, while he goes on vacation.
How do his subscribers respond? Do they stop watching YouTube on Wednesdays, since his channel was the sole reason for their visits? Or is their activity relatively unaffected, since they only watch his content when it appears on the front page?
Imagine, instead, that the channel starts uploading a new video every Friday. Do his subscribers start to visit then as well? And now that they’re on YouTube, do they merely stay for the new video, or does their visit lead to a sinkhole of searches and related content too?
As it turns out, these scenarios do happen. For example, here’s a calendar of when one popular channel uploads videos. You can see that in 2011, it tended to upload videos on Tuesdays and Fridays, but it shifted to uploads on Wednesday and Saturday at the end of the year.
By using this shift as natural experiment that “quasi-randomly” removes a well-loved channel on certain days and introduces it on others, we can try to understand the effect of successfully making the perfect recommendation.
(This is probably a somewhat convoluted example of a natural experiment. For an example that perhaps illustrates the idea more clearly, suppose we want to understand the effect of income on mental health. We can’t force some people to become poor or rich, and a correlational study is clearly flawed. This NY Times article describes a natural experiment when a group of Cherokee Indians distributed casino profits to its members, thereby “randomly” lifting some of them out of poverty.
Another example, assuming there’s nothing special about the period in which hack week occurs, is the use of hack week as an instrument that quasi-randomly “prevents” the sales team from doing their job, as in the scenario I described above.)
Discovering Drivers of Growth
Let’s go back to propensity modeling.
Imagine that we’re on our company’s Growth team, and we’re tasked with figuring out how to turn casual users of the site into users that return every day. What do we do?
The propensity modeling approach might be the following. We could take a list of features (installing the mobile app, logging in, signing up for a newsletter, following certain users, etc.), and build a propensity model for each one. We could then rank each feature by its estimated causal effect on engagement, and use the ordered list of features to prioritize our next sprint. (Or we could use these numbers in order to convince the exec team that we need more resources.) This is a slightly more sophisticated version of the idea of building an engagement regression model (or a churn regression model), and examining the weights on each feature.
Despite writing this post, though, I admit I’m generally not a fan of propensity modeling for many applications in the tech world. (I haven’t worked in the medical field, so I don’t have a strong opinion on its usefulness there, though I think it’s a little more necessary there.) I’ll save more of my reasons for another time, but after all, causal inference is extremely difficult, and we’re never going to be able to control for all the hidden influencers that can bias a treatment. Also, the mere fact that we have to choose which features to include in our model (and remember: building features is very time-consuming and difficult) means that we already have a strong prior belief on the usefulness of each feature, whereas what we’d really like to do is to discover hidden motivations of engagement that we’ve never thought of.
So what can we do instead?
If we’re trying to understand what drives people to become heavy users of the site, why don’t we simply ask them?
In more detail, let’s do the following:
- First, we’ll run a survey on a couple hundred of users.
- In the survey, we’ll ask them whether their engagement on the site has increased, decreased, or remained about the same over the past year. We’ll also ask them to explain possible reasons for their change in activity, and to describe how they use the site currently. We can also ask for supplemental details, like their demographic information.
- Finally, we can filter all responses for those users who heavily increased their engagement over the past year (or who heavily decreased it, if we’re trying to understand churn), and analyse their responses for the reasons.
For example, here’s one interesting response I got when I ran this study at YouTube.
“I have always been a big music fan, but recently I took up playing the guitar. Because of my new found passion (playing the guitar) my desire to watch concerts has increased. I started watching a whole lot of music festivals and concerts that are posted on Youtube and other music videos. I have spent a lot of time also watching guitar lessons on Youtube (from www.justinguitar.com).”
This response was representative of a general theme the survey uncovered: one big driver of engagement seemed to come from people discovering a new offline hobby, and using YouTube to increase their appreciation of it. People who wanted to start cooking at home would turn to YouTube for recipe videos, people who started playing tennis or some other sport would go to YouTube for lessons or other great shots, college students would look for channels like Khan Academy to supplement their lectures, and so on. In other words, offline activities were driving online growth, and instead of trying to figure out what kinds of online content people were interested in (which articles did they like on Facebook, who did they follow on Twitter, what did they read on Reddit), perhaps we should have been focusing on bringing their physical hobbies into the digital world.
This “offline hobby” idea certainly wouldn’t have been a feature I would have thrown into any engagement model, even if only because it’s a very difficult feature to create. (How do we know which videos are related to real-world behavior?) But now that we suspect it’s a potentially big driver of growth (“potentially” because, of course, surveys aren’t necessarily representative), it’s something we can spend a lot more time studying in the logs.
To summarize: propensity modeling is a powerful technique for measuring causal effects in the absence of a randomized experiment.
Purely correlational analyses on top of observational studies can be very dangerous, after all. To take my favorite example: if we find that cities with more policemen tend to have more crime, does this mean that we should try to reduce the size of our police forces in order to reduce the nation’s amount of crime?
That said, remember that (as always) a model is only as good as the data that you feed it. It’s super difficult to account for all the hidden variables that might matter, and what you think might be a well-designed causal model might well in fact be missing many hidden factors. (I actually remember hearing that a propensity model on the nurses study generated a flawed conclusion, though I can’t find any references to this at the moment.) So consider whether there are other approaches you can take, whether it’s an easier-to-understand causal technique or even just asking your users, and even if a randomized experiment seems too difficult to run now, the effort may be well worth the trouble in the end.