Site icon R-bloggers

Cassie Kozyrkov discusses decision making and decision intelligence!

[This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Cassie Kozyrkov, Chief Decision Scientist at Google.

Here is the podcast link.

Introducing Cassie Kozyrkov

Hugo: Hi there, Cassie, and welcome to DataFramed.

Cassie: Hi. Thanks, Hugo. Honored to be here.

Hugo: It’s a great pleasure to have you on the show. And I’m really excited to have you here to talk about data science in general, decision making, decision science, and decision intelligence. But before that, I’d like to find out a bit about you. First I want to know what your colleagues would say that you do.

Cassie: Oh, goodness. Well, it depends on the colleague, I think. But I think the consensus would be they’d say that I have some expertise in applied data science, especially. And I help Google teams and our cloud customers apply machine learning effectively.

Applied Data Science

Hugo: Great. Could you tell me a bit about what applied data science means to you?

Cassie: Yeah, so when it comes to data science, well, let’s start with what data science means, and then we’ll take it deeper. So data science, to me, is that umbrella discipline that has underneath it statistical inference, machine learning, and analytics or data mining. And the difference between these three, for me, boils down not to the algorithms that are used in them, because if you’re smart, you can use any algorithm for any of them. Also not down to the tools, but really boils down to the number of decisions under uncertainty that you want to make with them. With data mining, you really wanna get inspired. There are no specific decisions that you want yet, but you wanna see what your data inspires you to start thinking and dreaming about. Statistical inference is one where a few really important decisions under uncertainty, and then machine learning and AI, they boil down to a recipe for repeated decision making. So many, many decisions under uncertainty.

Cassie: I actually see data science as a discipline of decision making, turning information into action. Now, where is this applied versus research side of things? Researchers focus more on enabling fundamental tools that other people will use to solve business problems. Whereas applied folks will go and find among the available tools, what they need to solve those problems. So I’m not focused on how do I develop a new neural network architecture. I’m focused more on, there seems to be this germ of an idea in a business leader. How do we bring that out, make it real, build the team that’s going to end up working on that, and then ensuring that the entire process from start to finish is well thought out and executed, and then at the end, there is a safe, reliable result.

What do you do?

Hugo: And I think framing this form of data science and applied data science as a subdiscipline as decision making is something we’re gonna unpack more and more in this conversation. So that really sets the scene very nicely. So you said that your colleagues would view you as an expert on applied data science and thinking about how to use machine learning effectively. Now, what do you actually do? Are they pretty on track with what you do?

Cassie: They’re close. But I think at the heart of what I care about, is this notion of type three error in statistics. For those that don’t remember your errors, let’s have a quick reminder. Type one is incorrectly rejecting a null hypothesis. Type two is incorrectly failing to reject a null hypothesis, and type three is correctly rejecting the wrong null hypothesis. Or, if you prefer a Bayesian statement on the same thing, it’s all the right math to solve entirely the wrong problem.

Hugo: Great. So could you give me an example of a type three error?

Cassie: Yeah. So it’s any rabbit hole that a data scientist goes down meticulously, carefully answering a question that just didn’t need to be answered. So maybe this’ll be a familiar thing to some of the data scientists listening, and I hope you don’t suffer from those too much, ’cause that’s a bit of a newbie gotcha, but it kind of goes like this: There you are, finished with most of your work for the week. It’s maybe 4:00 p.m. on a Friday, and you’re excited for a nice, free weekend, because that’s the whole point of being in industry and not being in academia anymore. Right? Just kidding. Anyways-

Hugo: You’re not kidding at all.

Cassie: I’m … Okay. I’m kidding some.

Hugo: Sure. A bit.

Cassie: Yeah. All right.

Hugo: Great. Go on, go on.

Cassie: Fine. I’m not kidding at all. So there you are, and you’re just ready to go home, and say a product manager comes up to you. And with this sense of urgency in their voice, wants to get a specific measurement from you. Or a specific question answered. And you think to yourself, "My goodness. But that is difficult. That’s gonna take me at least all weekend. And well into the nights as well. I’m gonna first have to figure out getting the data, then I have to sync up with data engineers. I’m gonna have to look up all these methods in the textbook. This is gonna be a difficult thing. But look, I am a great data scientist, and I can do this. And I can do this correctly and I can make sure that all the statistical assumptions are met. And come Monday morning, I’m going to deliver this thing perfectly." And so on Monday morning, you come on bloody knees, you lift this result up to that product manager. And they kind of poke their head and look at you, and go, "Oh. I didn’t even realize that that was what I was asking for."

Cassie: So there you were, meticulously, very correctly solving this problem, but rather uselessly as well. And it goes nowhere, the product manager doesn’t use ut for anything, it just gets abandoned behind the sofa of lost results. So that’s a type three error.

Communication and Process

Hugo: And so how do you stop that happening? Presumably, on the Friday afternoon. Presumably involves communication. Right?

Cassie: Communication is one of those things but process, as well. So the data science team should know what the other stakeholders that they depend on are responsible for and what it looks like for those pieces of work to be completed correctly. So I like to talk about this wide versus deep approach to data science. So a rigorous approach versus a more shallow, inspiration gathering approach. And the second one is always good as long as you don’t end up wasting your data on it, and hopefully you’ll prod me about that shortly. But as long as you have data allocated to inspiration, having a light, gentle look at it is always a good idea. Putting your eyes on that data and seeing what it inspires you to think about. It helps you frame your ideas.

Hugo: And so in that case, we’re thinking about some sort of rapid prototyping in order to-

Cassie: We’re thinking about something even more basic. We’re thinking about just plotting the thing. And that is separate from a very careful pursuit, rigorously, of a specific and important goal. So first, separating these two and saying the former, that broad, wide, shallow approach, that is always … that’s indicated for every patient. Let’s just put it like that. Doctor prescribes that always. As long as you have the data to spare for it, do it. But don’t take your results too seriously, and don’t do anything too meticulous.

Cassie: On the other hand, this more rigorous approach, this takes a lot of effort. And it takes a lot of effort not just from the data science team. And the context for that, how the question is being asked, what assumptions are palatable and so forth, that is actually the responsibility of the decision maker, the business leader. And they have to have done their part properly in order for that meticulous work to make sense. So if you’re going to go and do rigorous things, you need to make sure that that work was properly framed for you.

Hugo: Right. So in this case of the data scientist going and spending their weekend working on this problem doing essentially work that the product manager didn’t think they’d be doing, a way to solve that would’ve been doing some sort of rapid datavis, exploratory data analysis, and then having a conversation with the product manager about what they really wanted.

Cassie: I would say, actually, the other way around. Have a conversation with the product manager about what they really wanted first. And if what they want is something emotional, to get a feel for something, that needs to be unpacked a little more, and perhaps what they’re looking for is possible, perhaps it isn’t. Perhaps taking a look at the data generates the inspiration that is required for what they want. Perhaps they’re hoping that the data scientist is a genie in a magic lamp that grants wishes that can’t be granted. So talking to them and figuring out what they want should actually be the first step. But better even than that, would be an organization where there isn’t this adversarial relationship that assumes that the product manager doesn’t know their part. Better to staff the project with trained, skilled decision makers who know how to do their part, and the data scientist will simply check the requests coming in. And if the request has a certain characteristic, they will tend to go for no work or light work, and if the request has a different kind of characteristic, they will go and do things carefully, rigorously and meticulously at the level of rigor and complexity requested by that skilled business leader.

Hugo: I love it. So then we’re actually talking about specifically, with clarity, defining roles and defining a process around the work—

Cassie: Absolutely. So this can get pretty big and interesting, of how you arrange these teams and how you arrange these processes. In its lightest form, it can be a matter of who talks to whom in what order, but it can be much bigger than that.

Wasting Data

Hugo: Great. And this is something we’re gonna delve into later in the conversation, is common organizational models for this type of work. But before that, I want to prod you a bit, and something you mentioned earlier was the idea of wasting your data. And maybe you could tell me what you mean by that.

Cassie: Sure. Well, we all learn something rather straightforward and obvious-sounding in statistics and data science class. And then unfortunately, we end up forgetting it a little bit. And it’s something that we really shouldn’t forget. And that is that a data point can be used for inspiration, or rigor, but not both if you’re dealing with uncertainty, if you wanna go beyond your data. Because when you’re assessing, whether or not your opinion actually holds up in reality and in general, then you need to make sure that you check that opinion on something that you didn’t use to form the opinion. Because we, humans, are the sorts of creatures that find Elvis’s face in a piece of toast. And if we use the same piece of toast to get inspired to wonder whether toasts look like Elvis, and then to also answer whether toast does, in general, look like Elvis, we have a problem. You’re going to need to go to a different piece of toast.

Cassie: And so you can use data for inspiration or rigor, but not both. And so if you use all the data that you have for getting inspired, for figuring out what questions you even wanna ask, then you have no data left over to rigorously answer them.

Hugo: And I think similar comparisons can be made to null hypothesis significance testing. Right? So for example, you’ll do exploratory data analysis, start to notice something, and then do a test there, because you were inspired in your null hypothesis and alternative hypothesis by the original data, you may actually be over fitting your model of the world to that dataset.

Cassie: Yeah, and I think that that kind of thing actually happens in practice in the real world, because of the way in which students are taught in class. So it makes sense in class for you to get a look at the conditions of a toy dataset, see what sorts of assumptions might or might not hold in that dataset, and then see what it looks like when you apply a particular method to that dataset. And that poor toy dataset would’ve been torn to shreds with respect to what you could actually reasonably learn from it, by all the thousands of times that the student and professor are torturing this poor little dataset. But that’s okay, because all you’re supposed to do in class is see how the math interacts with the data. But you get used to this idea that you’re first allowed to look and examine this dataset, and then you’re allowed to apply the algorithm or statistical test to it.

Cassie: In real life, though, you end up running into exactly this problem, where you invalidate your own conclusions by going through that process. You really should not use the same dataset for both purposes. You shouldn’t pick your statistical hypothesis and test it right then and there. I mean, think about it like this: Here you are, with x variable and y variable, nice scatter plot. And you take this little dataset, and you plot it and you see sort of the ghost of a upward, up and to the right lift in this little point cloud that you’ve just plotted. Well, you’ve just seen this, and so you ask yourself, "Well maybe I can put a straight line through it and see whether I statistically significantly have a positive correlation there." Congratulations, you are going to get the result that, yes you do statistically significantly have a positive correlation, on account of having been inspired to ask the question in the first place by how these particular points fell onto your scatterplot. The conclusion you make might be entirely unrelated to reality. If you’re inspired to do that by this data set, go get another data set from the same process in physical reality, and make sure that your inspiration holds up, there.

Cassie: We humans, we do see patterns that are convenient, interesting to, whatever we’re interested in, and might touch reality at no point at all.

Hugo: There are several ways that we think about battling this endemic issue. One that you mentioned, of course, is after noticing things in exploratory analysis of your dataset and coming up with hypotheses, then going and gathering more data, generated by the same processes. Another, of course, is preregistering techniques before looking at any data initially. I was wondering if there are any other ways that you’ve thought about, or you think may be worth discussing to help with this challenge.

Cassie: The problem, really, is the psychological element of data analysis. How you’re looking for things. How your mind tricks you as you look for things. And the mathematical techniques that are supposed to help you do things like validate under extreme circumstances with cross validation, those are really easy to break. They don’t actually protect you from the going about things the wrong way, psychologically.

Cassie: So what I suggest people do, when they start thinking about this is, if you were pitted against a data scientist who absolutely wants to lead you astray and trick you into all kinds of things, when you give them certain constraints on their process, can they still give you a bad result? Can they still mess with you? Can they still trick you? And most of those methods out there … In fact, I can’t think of one off the top of my head that wouldn’t be, most of them are susceptible to that kind of mucking about. And unfortunately, as a good data scientist, you are likely to trick yourself in that same manner.

Hugo: That’s very interesting, because I think it hints at the fact that we actually due to a lot of our cognitive and psychological biases, we don’t necessarily have good techniques. We need to develop processes, but we don’t necessarily have good techniques to deal with this, yet.

Cassie: When you talk about preregistration of a study, that is less of a technique and more of a statement that in these data, you’re not going to go and adjust your perspective and question. So you’re saying wherever your hypothesis comes from in advance of gathering and processing these data, it is now fixed. That’s kind of how it should be, anyway. So even when you ask the question and separate the two, you’re actually talking about two aspects of the same thing. If you want to form a hypothesis, go and explore data, but nuke that dataset from orbit if you’re going to go and do some rigorous process where you have the intention of taking yourself seriously. You should have your entire question, all the assumptions, all the code, even, submitted ideally before the data’s collected, but certainly before the data hits that code.

What is decision intelligence?

Hugo: So I’d like to move now, and talk about decision intelligence, which you yourself, a chief decision scientist at Google Cloud and you work in decision intelligence. And I was wondering if you could frame for us, what decision intelligence actually is, and how it differs from data science, as a whole?

Cassie: So I like to think of decision intelligence as data science plus plus, augmented with the social and managerial sciences. And with a focus on solving real business problems and turning information into action. So it’s oriented on decision making. If we had to start again with designing a discipline like that, we would wanna ask every science out there what does that science have to say about how do we turn information into action? How do we actually do it, because of the sort of animal that we are? And if we want to build a reliable system or reliable result for a specific goal, how do we do that in a way that actually reaches that goal rather than takes a nasty detour along the way?

Cassie: So it’s very process-oriented. It’s very decision-oriented. But of course, a large chunk of that is applied data science.

Why data science plus plus?

Hugo: So can you tell me why data science plus plus? Why do we have two pluses there?

Cassie: Ah, that plus plus, as in upgraded with the next thing, I suppose. In the same way as in Syntax, you would have I plus plus. That’s just some cuteness, there, I suppose. But think of the upgrade like this: a data scientist is taught how to analyze survey data and how to think through a lot of the careful, mathematical stuff, how to deal with what happens if their data are continuous, if they’re categorical. What if the person was using some sliding scale, et cetera? How many questions? How do we correct for this many questions? That sort of thing. But what they’re not taught, not directly in their training, is how do you construct that survey? How do you make sure that that survey minimizes say, response bias, which is where the user or participant simply lies to you and gives you the wrong answer to your question? And how do you think about what the original purpose of that survey is? Why are we doing this in the first place? And was a survey the right way to go about it? And how did we decide what was worth measuring? Those things are not typically taught to data scientists.

Cassie: So if data scientists want their work to be useful, then someone, whether it’s themselves or a teammate, who has the skills to think thorough that stuff, has to be participating.

Hugo: Right. And is it important, or is best case scenario the data scientist being involved in every step of the process from data collection, experimental design, question design through to actual decision making?

Cassie: That depends on what your budget is. Right? If you have infinite money, perhaps you might be able to hire one of the very, very rare unicorns who’ve actually thought about all of it, and who are skilled in all of it. There are not that many of those people. And if you intend to hire them, you are going to have to pay them. So intending to staff your projects in that manner, well, no wonder you’re going to be complaining about a talent shortage. So the reality of it is that you’re going to have to work with interdisciplinary teams. And also, even if you have someone who gets all of it, in a large scale project, there’s still more work than someone can do with the hours that there are in the day. And so why do you really need all these identical copies of the perfectly knowledgeable worker, if they’re going to have to work on different parts of the process, in any case? So the data scientist upskilling to perfection and then owning all of it, it’s a nice dream, but it doesn’t sound very practical.

Cassie: Instead, I imagine that they would be best suited to the part that they have spent the most time learning. And what they should really worry about more, is how the baton is passed from colleagues who are responsible for other parts of the process, and having the skills to check that that part was done well enough for their own work to be worthwhile. Because unfortunately, data science is right in the middle of the process. And it relies on the bookends. And if the bookends, like that decision making side and that product leadership side and that social science side, if that wasn’t done correctly, or if downstream, you have no way to put this into production reliably, and even if the prototype has beautiful math, it will be too much of a mess to actually use in practice. Then there is no point to the data scientist’s work. It all becomes a type three error.

Cassie: So they’re gonna be working with a very interdisciplinary team, probably. And they should focus on the parts where they can have the best impact.

Organizational Models

Hugo: Great. So in terms of decision making, I wanna know about these teams. I love that your response to my previous question was, "The reality of the situation is … " I wanna know more about reality, and I wanna know more about the practical nature of how data scientists and their work are included or embedded in decision making processes. So could you tell me a bit about the most common organizational models for how data scientists are included in this?

Cassie: Yeah, sure. An obvious way to do it is to collect a whole lot of data scientists and put them together into a centralized data science team, and that tends to be guided jealously by their data science director, who buffers them from the most egregious type three error requests, and makes sure that the rest of the organization uses them to a good purpose, or at least to the most impactful business purpose. And the junior data scientists in that structure, they don’t need to navigate the politics.

Cassie: There’s another model, which is simply embedding a data scientist in a large engineering team, and telling them to be useful.

Cassie: And there’s the decision support model. This is where you append the data scientist to a leader, and the data scientist helps that leader make decisions.

Cassie: And then, of course, there is the data scientist owning most of the process, especially the decision making. So here, data science is responsible for framing decision context, figuring out which questions are even worth asking, and then also taking ownership of answering them.

Hugo: So we have the pure data science team, the embedded in engineering, decision support, and data scientists as decision maker. And I think-

Cassie: The fifth will be the decision intelligence option, which is none of these.

Hugo: I look forward to discussing that. And it seems like, generally, this order goes from less decision making to more decision making on the part of the data scientist. Is that fair to say?

Cassie: Ah, fair enough.

Hugo: And so what are the pros and cons of being at different points along this spectrum?

Cassie: With a super centralized one, an obvious con is that if you are a small, scrappy organization, forget it. You are not going to be able to have this large data science org. Another con is that they tend to be put towards what the business already knows is worth doing properly and carefully. So in some sense, this is a pro. They’re going to be associated with the most delicate or high-value questions in the business. The con is that there’s flexibility to help out the broader organization to seize unusual opportunities, because there is this sort of single point through which all the requests come. And that tends to homogenize the requests a little bit. And that also means that individual data scientists will have very low contact with the decision function. That might be a pro for them. Maybe that’s a stressful thing for a junior data scientist. But it’s very hard for their work and their contribution to have visibility in this way.

Cassie: And all of this is really at the mercy of data science leadership. So if their data science director does not know what they’re doing, we’ll have a problem. And the industry is really suffering from a shortage of data science leaders. There are people who call themselves data science leaders or analytics managers, but these folks might not really know how to play the organizational politics. They might not have a good business sense. Or maybe they are primarily leaders, they have all those … that nose for impact, but they don’t understand how to make a data science team effective. So there can be some problems with that.

Cassie: The embedded in engineering: pro is that you get to influence engineering. However, you end up doing a variety of tasks, which may or may not have anything to do with data science. Quite often, the engineering team don’t really know what sort of animal you are, and doesn’t really know what you’re for, doesn’t know if you’re useful. They think of you as a sort of not-so-good programmer. "What’s wrong with you? And what is this stuff that you’re constantly fussing about on whiteboards?" You might not be seen as very useful, and you might find yourself taking on product management tasks that you might not want to do, that you didn’t think you were going to have to do, and that you didn’t train for. So you end up with non-specialist tasks and there’s no buffer against politics for you, there.

Hugo: And is this something that also happens as we move more towards data scientists who work in decision support and as decision makers, themselves?

Cassie: There’s some element of this, as well. With decision support, the leader, a good leader quickly figures out how to make you useful. So you don’t spend an awful lot of time sort of wafting about, figuring out how to even contribute in the first place. Now, it might be that your best contribution is nothing to do with complex methodology that you spent so many years in grad school studying, and your data science tasks might end up getting diluted with a whole host of other things you might be working on. But your value does tend to get better protected under this setup.

Hugo: And how about for a data scientist as an actual decision maker?

Cassie: So of course, the pro there is that you don’t have this loss in moving between the data science, engineering, and decision functions, because the data scientist owns all of those things. The con is that in order to do it, you need to really get several black belts. And if you don’t have them, you might think you’re being useful, but you might be doing more damage than good. So maybe you think you’re good at understanding business impact, but really, what you’re much better at is doing the math. And you end up pushing the organization down rabbit holes, bad rabbit holes at a much worse rate than they would’ve had without you. So you really do need these multiple black belts, and you need to understand that you have to train for these things separately. Because a standard training program just does not prepare you to be a two-in-one or three-in-one worker.

Cassie: So in practice, this is a rare animal.

Data Scientist as Decision Intelligence Operative

Hugo: And then, of course the fifth model that you mentioned, in passing, that I’d like to focus on now, is data scientist as decision intelligence operative. What happens here?

Cassie: So there will be some allocation of time and human resources toward the analytics or data mining side of data science. And so there will be an ongoing pause check for the company. So there will just be this broad, light touch analytics going on all the time, and whoever is best at that model of working under data science will be doing that and will be partially driven by what leadership wants, but will also be driven by the explore over exploit attitude.

Cassie: Then, if something else is going to be requested, there will be certain stages in the project lifecycle that have to be completed in order to get that work. So it’s sort of like a combination of those two models, where you get embedded in engineering, or you get embedded with decision making, but that match happens out of a centralized pool of labor, and it happens based on the project being framed in the required way. So for example, you might have the decision support framing where you need statistical assistance on a project. In order for that to happen, there has to be certain steps like a selection, if you’re going to go frequentist way, selection of the default action, what the decision maker actually wants to do by default, an understanding of what it takes to convince them, what their metrics are, that sort of passes a social science function, there. What population they’re thinking about. What assumptions they’re willing to deal with. That will be someone from social science or from data science working with the decision maker to help them frame their decision context.

Cassie: And then once that is all ready, then you staff folks who can actually do the heavy lifting, the calculations, the data stuff to the project. And of course, you need to staff data engineering to that project as well. So when everyone comes together, they know what they are there for.

Hugo: And this actually speaks to kind of a broader challenge. I mean, we’ve discussed this previously, but this idea that a lot of people want to hire data scientists or do machine learning or state of the art deep learning or AI before they even know what questions they want to answer. Right?

Cassie: Yeah. So what you should do … Here’s my advice for everybody. If you don’t know what you want, think of your data as a big old pile of photographs in the attic. And think of analytics or data mining as the person or function that is going to go to that attic and their opportunity to actually go to the attic and look at the data is going to be supported by data engineering. They’re gonna go to that attic, they’re going to upend those big box of photographs on the floor. They’re going to look at them, then they are going to summarize what they see there to the people waiting patiently in the house and ask those folks whether they are considering doing anything more with it. That kind of approach always makes sense. You’ll never know what’s in this pile of photographs. And you’ll never know whether it’s worth doing anything serious with it. But also because it’s a pile of photographs and you don’t know who took it, and for what purpose, you should never learn anything beyond what is there.

Cassie: So we, as citizens, we already know how to think about a pile of photographs, or a photo you find on the side of the road. The only thing that you can reasonably say about it is, "Hey, this is what is here." Does that inspire me? Does it make me dream? Does it make me want to ask other questions about the world? Sure. Perhaps. But do I take any of that seriously? No, of course not. It’s some photograph, and data science is essentially Photoshop, as we all know, and we don’t know much about how that photo was taken or why. And we can’t really make serious decisions based on it. But taking a look always makes sense. As long as you continue to think about it reasonably, the same way as you would think about those photos. So that’s always good for every project. And if any team, any organization says, "I’d like to know a little more about my data. I’d like to get into mining my data, looking at my data, finding out what’s in there," that is always a good thing.

Cassie: But now, if you don’t actually control the quality of that data, you might end up doing very careful, rigorous things with it. And the photographs were all, I don’t know, blank. Right? There’s no point to it. Or maybe they were all taken in a way that’s entirely unreliable for the question that you want to answer, as you didn’t actually plan the data collection, so if you look at the photographs that I take in my travels, you notice there’s all these super touristy landmarks. And yet somehow, I’m the only person pictured in the photograph at that landmark. You can’t conclude anything about how many people go to these landmarks based on my pile of photographs. But you can still take a look, as long as you don’t take them too seriously, and then you might start thinking about the sorts of things you might wanna do with them. And when you start figuring out what you might like to do, then you start planning the entire process as directed towards that goal. And then it makes sense to start thinking about hiring people who can do that extra stuff.

Why do so many organizations fail at using data science properly?

Hugo: So Cassie, given the variety of different models for embedding data scientists in the decision making process, I’m wondering why so many organizations fail at using data science to inform decision making properly and robustly.

Cassie: Well, this comes down to a problem of turning information into action and how decision makers are organized and trained to do that. So it may be a case of the decision maker actually doesn’t know what their own role in the process is, and they don’t know how to properly frame a decision context for a data science project that isn’t simply data mining and analytics, this wide and shallow approach. Without the decision maker taking control of the process, what always makes sense is a nice, shallow, wide data mining approach. Mine everything for inspiration, don’t take yourself too seriously. Don’t spend a whole lot of effort. And if you just stick to this, and you really, truly don’t take yourself more seriously than you’re supposed to, the biggest danger is overspending on personnel. Maybe you’ve ended up hiring a bunch of professors, and now you’ve used them for tasks that they consider to be far too easy given their intense training.

Cassie: But, what tends to happen is that decision makers don’t end up owning the dive into the careful, rigorous stuff properly. So maybe they just hire a bunch of data scientists and they leave them in a room, all on their own. They don’t give them any instructions, and then they are surprised when the only thing that comes out of that room is researched white papers. Maybe there’s a case of all those folks pursuing research and rigor for its own sake, because that’s the most comfortable thing, their comfort mode from their research training, and those folks aren’t really qualified to diagnose what is useful for the business, and the decision function just leaves them alone.

Cassie: It might be a case of the entire organization not understanding that there’s a difference between inspiration and rigor, and how to use data for these things, and how much effort each one takes. So another failure is where you get the opposite. You end up using data for inspiration, and then you think that you’ve done something rigorous there, where you really haven’t. And you start taking those results much more seriously than you ought to. And you become over confident and you run headlong into a wall.

Cassie: Another problem that organizations have is that it’s very convenient to use the outputs of data science work as a way to bludgeon your fellow decision makers in a meeting. So everyone wants to argue and put forth their personal opinion about a thing that cannot be solved with data, really. It has to do maybe with the strategy of the organization, and instead of sticking with humility, owning what you don’t know, and using argument to discuss with your fellow decision makers what should be done next, you bring some inscrutable report that’s covered in equations and you say, "Because my magical data scientists have said, this is the truth." But, you know about statistical inference, you know that the question matters more than the answer does, almost. And if all you do is you bring an answer, well, it may or may not be an answer to the question that is being asked or assumed by everybody else. It’s like that Douglas Adams thing, where you just bring 42 to the meeting, and you say look at all these equations that have gotten us to 42. And because it says 42, I’m correct. Actually, it doesn’t make much sense. Takes a lot of effort. And it wastes a lot of time.

Cassie: And then there’s also an element of misguided and misdirected delegation of decision responsibilities. That’s where you have someone who wants to have decision responsibility and they want decision making to be done rigorously, but they want that for more decisions than they actually have time to deal with. And so they sort of fool themselves into thinking that they can be in that decision maker role without spending the time to actually frame decision context, think through assumptions, work with the data science teams and so forth. And so what ends up happening is that people junior to them end up usurping those roles and make the decision however they make it. Maybe they make it rigorously, maybe they don’t, and then spend all of the data science team’s effort in persuading or convincing this pretend decision maker that it’s actually their idea. Now, there’s an element of fuss there, which could just be avoided if decision responsibility were delegated appropriately. There’s no need for this usurping thing. If you don’t have the time to put in the effort that it takes, then hand over that decision to someone who does have that time, if they’re going to pursue it carefully, rigorously, and in this intense statistical way. Or, say, "We’re going to base it on inspiration. It’s going to be a light case of analytics and plotting, but we’re not going to allow ourselves to become more confident than our methods deserve."

Cassie: So really, most of that disconnect has to do either with the people hired, or with the decision makers themselves not knowing what their own role is, since they are the ones who kick off the whole process. It really does matter that they have the skills to do it.

Data Literacy

Hugo: And is there another disconnect with respect to how many people in an org can speak data, in the sense that data literacy and data fluency aren’t necessarily spread or distributed across organizations. I suppose my question is: In organizations you have seen, how is data literacy spread through the org, and how would you like to see this change?

Cassie: So I’m not gonna speak specifically to Google, here. I’m gonna speak much more generally, about all of us at once.

Cassie: In this world, data literacy is in a sorry state. At least from my perspective, I really wish that we were better at this stuff. And we are surprisingly good at thinking through photograph data. And we’re fairly reasonable, fairly reasonable … We still might do some silly things. But we’re fairly reasonable about that. And we’re fairly reasonable about laughing and saying, "Oh, ha-ha, just because it’s in a book doesn’t mean it’s true." But somehow, when it involves math and data, we start to pronounce data with a capital D, like it is some source of objective truth that’s entirely disconnected from the humans that decided to collect it in the first place, and made decisions over how they were gonna collect it and why. So data literacy is in a sorry state. And what I keep seeing in the world at large is that we lack the humility to say, "Well, if we had no one on the team who could have played this role, who had the skills to take on the decision maker’s part, then we shouldn’t take ourselves too seriously."

Cassie: Instead, what one sees out there in the wild, is that there are these teams that are staffed with very meticulous mathematical minds and unskilled decision makers and that whole team, that whole … So what I see missing in the world at large is teams with the humility to say, "Taking ourselves seriously actually takes work, and it takes skills. And if we lack those skills, we’re not going to be able to do it. The best that we can get from this is the same kind of thing that we get from looking at a pile of photos." And that’s actually still something. It’s amazing that we have the ability to take an SD card, which means nothing to you when it’s laying in the palm of your hand, and you plug it into your computer, and you use some visualization software, I don’t know, Microsoft Paint, or something, and now you can get inspired and see what’s there. That’s an incredibly powerful thing. That’s good for everybody. Everyone should be doing more of that on more data types.

Cassie: But not to assume that just any old data plus very complex mathematics can create something out of nothing. Certainty out of uncertainty, for example. A good decision process where the fundamental initial skills were lacking. I like to say inspiration is cheap, but rigor is expensive. And if you’re not willing to pay up, don’t expect that there is some magical formula that’s going to give it to you. Without that data literacy, please don’t be trying to do very complicated things.

Future of Data Science and Decision Science

Hugo: Right so this is the present state of data science, decision making, decision intelligence and data literacy. What do the futures and their intersection of data science and decision science look like to you?

Cassie: As we start to do more with data, I hope to see the world taking the quality of the decision skills and kicking off and also guiding those projects, growing. We can’t really afford to be automating things with data at scale, and have that all based on bad decision making skills. That’s going to be a disaster for the company who’s doing it. So we’ll have to move towards taking those skills more seriously and not treating it just as something that you have a flair for or a talent. But, even though I imagine that whether we learn it now or we learn it the hard way later, that those skills are going to get better, they don’t have to be carried entirely by the people with decision responsibility currently delegated to them. There is another option.

Cassie: And that other option is to hire a helper who can do that rigorous thinking for you. The part of decision making that is a science can be done by a scientist, helping the decision maker who is owning the part that has to do with intuition and politics and so forth. So you can hire a helper to upgrade your skills if you don’t want to go and learn the stuff yourself. But I do think that on the whole, the future involves us taking that first bit much, much more seriously.

Call to Action

Hugo: So towards that future, my final question is: Do you have a final call to action for our listeners out there?

Cassie: Yeah, two. One is, it’s time to start shifting our focus away from only research, and more to a choice of whether you want to be doing research, or you want to be doing applied stuff. These are both equally valuable, important approaches. One of them is tremendously understaffed right now. I could argue both are and it’s a really exciting time if you want to get into that sphere, because that’s going to become more and more important as those general purpose techniques that the researchers make become more readily available for application. An analogy that I have for this is, researchers might be folks who build microwaves, new and better microwaves. Whereas applied folks think about innovating in the kitchen and recipes at scale. And I wanna point out that if you want to say, create McDonald’s, just because you don’t have to wire your own microwave doesn’t mean it’s easy. So this is an exciting time for a new area of investigation and new discipline.

Cassie: And the other thing I want to leave you with is, the world is generating more and more data. We really owe it to ourselves to make that data useful. Wasting all of our time and resources on type three error after type three error is a very sad state of affairs. So it’s really time for us to take this seriously, because we just have so much of it. Let’s do good, beneficial things with it.

Hugo: I love that, because it really brings this conversation full circle in terms of, you stated at the start of our conversation that a big part of what you do is to help teams avoid or lower the rate of type three errors in data science. We’ve come full circle, essentially. And that’s one of the calls to action, here. Right? That we all work together and use the data and our modeling techniques and question us in capabilities to lower type three errors more and more.

Cassie: Yeah, and as I think back on our conversation, I think that a disservice that I did to decision intelligence as a whole, is that I really spoke to you a lot about data scientists. I spoke a lot about decision makers. I vaguely mentioned social scientists. But it’s a much, much more diverse game. And I really left out all the other people that should be involved. The engineers, the reliability folks, the ethicists, the designers. There is a lot of important work to be done in this space by a great variety of people. And I want to ask everyone who’s thinking about sneaking off right now because this doesn’t apply to them, to reconsider. Decision making is important for all of us. And if we are going to do this seriously and at scale, then there is a role for everyone to play if you have anything to say about turning information into action.

Hugo: I couldn’t agree more. And Cassie, it’s been such a pleasure having you on the show.

Cassie: Thank you so much.

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.