Site icon R-bloggers

Data Science and Insurance (Transcript)

[This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here is a link to the podcast.

Introducing JD Long

Hugo: Hi there JD and welcome to DataFramed.

JD: Hey, Hugo.

Hugo: It’s great to have you on the show. Really excited to have you here to talk about data science, insurance, reinsurance, your work in the R community, the role of empathy in data science, which we’ve had great conversations about before. But before we get into all of that, I’d like to know a bit about you and maybe you can start off by telling us what you’re known for in the data community.

JD: Yeah, it’s interesting Hugo. I’m never completely sure when I meet new people if what they may have run into me or run into something that I wrote. I think most common are maybe asking R questions on stack overflow, possibly a presentation at a conference, or maybe starting the Chicago R user group or maybe some foolishness on Twitter. It’s really hard to guess.

Hugo: And of course, your role in asking questions on stack overflow was quite early on there, right?

JD: Yes. This goes back to when stack overflow was really first starting. The story there is kind of interesting. Notoriously in the R community, the R help email list at the time didn’t suffer fools well, or newbies for that matter. And there was a lot of encouragement to RTFM and that sort of thing. It wasn’t completely newbie friendly and this was about the time I was learning R and I had observed that.

JD: Mike Driscoll was toying with the idea of a beginners’ R mailing list and I contacted Mike. I had been watching how stack overflow was being developed. And I said, "Mike, I’m not sure. Maybe what we should do is try to get new users using stack overflow," because it looked innovative to me, which now that stack overflow has eaten the world, it’s kind of quaint to think about.

JD: But, stack overflow had some social science thinking in their design. It had rewards, incentives, thinking about the effect of design or nudges, if you would, that we think about in behavioral economics. How do you nudge people towards good behavior. It seemed like a good environment for newbie type questions. Mike Driscoll and I, and a handful of other people, we got from the Rseek website a whole bunch of questions people had typed into the search engine. This was a website dedicated to R information.

JD: And we tried to figure out, what do we think people were asking? So we created, I don’t know, 100 questions and answers or something and we did a flash mob at the birds of a feather session at OZ-Con 2009. I was not there, I was in Chicago at the time, living. But I participated virtually and we seeded stack overflow with a bunch of questions and answers. That kick-started our discussion there and later, another one was done and I continued to be active asking questions.

JD: As of, I haven’t looked recently, but as of a few years ago I was still one of the top question askers on stack overflow for the tag of R, for the R programming language.

Hugo: In terms of this initial inspiration of R help didn’t really suffer fools or newbies well, how do you think that has played out over the past several years and where we are now?

JD: It’s a good question. If we look at stack overflow, they’ve had their own challenges and growth issues. At first, it was just getting to mass and then they have clearly become the dominate monopoly for information on programming questions and it’s a really good resource. You alluded to this and we’ll talk to it a bit later. There’s this issue for a new for lots of empathy both in question askers and in question answerers. That’s proven to be a challenge in new ways.

JD: I think where we are now is stack overflow is a fantastic place to get information, there’s so much information there, most new beginners are able to find their question already asked and not have to ask a question or venture into that. If they do run into one, there’s the opportunity to get people to answer it.

What are you currently doing?

Hugo: I’m really excited to get back to this idea of empathy in data science, and that’s a little teaser for something coming up a bit later. But, tell us a bit about what you’re up to now and what you do currently, JD.

JD: Sure. What pays the bills is I’m VP of risk management for Renaissance reinsurance, although lawyers prefer that I state that everything I’m sharing here with you is my personal views, of course. I’m not representing the company I work for here. But, I’ve worked for Renaissance-re for over 9 years, now, and been in different insurance companies and reinsurance companies most of my career.

What is insurance and reinsurance?

Hugo: And can you remind us what insurance is and tell us what reinsurance is?

JD: Sure. Insurance, I think most people are familiar with it because of their house or car, or property insurance that they have. It’s a company that makes a payment on policies when adverse outcomes happen. What reinsurance is, is something that most people don’t face or interact with. And that is that individual insurance companies will buy protection from events or losses that are bigger than that insurance company could handle.

JD: An example of that would be homeowners’ insurance company in Florida may have more exposure to hurricanes than they have capital to payout claims. They would need to buy reinsurance to help make sure that they can make good on their promise to pay future claims.

Hugo: You’re insuring the insurance?

JD: We’re insuring the insurance.

Hugo: You must know what my next question is. Is it insurance all the way down?

JD: It is insurance all the way down. Let me tell you a language story here, real quick. I had done some work with the World Bank and a number of years ago I was in Mongolia. And we were discussing insurance, so we ask … I’m trying to learn what the word was, and they say, "Well, the word is daatgal." So, we’re like, "Oh, okay. Well, what’s reinsurance?" And they’re like, "Well, that’s daatgal-daatgal." We’re like, "Okay. We do this thing called retro where reinsurance companies trade with other reinsurance companies. Is that daatgal, daatgal, daatgal?" And they assured us that it was not, but it seemed intuitive to me.

Hugo: Yeah.

JD: Yes, it is insurance all the way down.

Hugo: I suppose that’s like asking us is it re-reinsurance, whereas you call it retro, right?

JD: Yeah, we call it retro, but that would be re-reinsurance. And we stopped counting the re’s after we start trading it around between the reinsurers.

How did you get into data science?

Hugo: Yeah. Fantastic, I’m really excited about talking about insurance and reinsurance particularly framed by the new emergence of data science because insurance and actuarial sciences have been around for a lot longer than data science. So, I’m interested in this. But, before we get there, I just want to hear a bit about your story and how you got into data and data science originally.

JD: Yeah, like most insurance data scientists, I’m an agricultural economist. That is obviously is not intuitive at all. But, I came into agricultural economics in the 90s and when I graduated with an undergrad in, I guess it was about ’96 I was starting graduate school. And I remember talking to my major professor about where the PhD graduates that year were going and one of the PHD graduates was going to American Express.

JD: I remember being baffled, she’s a PhD in agricultural economics, what’s she going to American Express for? He explained that American Express recruited explicitly agricultural economists because we have a very applied background, not pure theory, actually had experience working with data, they tended to have coding experience. Now, this was ’96, so that’s mostly SAS in the university I went to. Let’s put this in perspective: CRAN, the R network started in 1997, so this was the year before CRAN even existed. And Python was not available on DOS and Windows until ’94, so this was just a couple of years after Python was available on Windows platforms.

JD: We were in SAS, we were using mainframes and UNIX machines. And American Express was hiring and recruiting agricultural economists because they’d had some experience coding with this messy, real-world data. And within agricultural economics, I got exposure to crop insurance, model building, lots of regression analysis, we call it econometrics. And building those models at what, at the time, seemed like a degree of scale, it seems a trivial in retrospect, but that’s how I got in.

JD: And I like to tell the story under the pretext of agricultural economics is the OG of data science because we’ve been doing these sort of combining programming, and domain expertise, and statistics for a long time. And later, the data science name caught up, but I’ve been doing that same sort of thing for a number of years.

Hugo: And I suppose that also working with serious real data sets, as well, right? And messy data?

JD: Yeah, absolutely. We were working with actual field data, literally field data, and sometimes long historical sets. And it would require cleaning of outliers and a lot of the same sort of things that we talk about now: taking trend out, looking at analysis of time series and cyclicality, and removing that before you start building a model to explain other things.

JD: A lot of these methodologies, we’ve been using in agricultural economics for a number of years. My experience with applying that to agricultural insurance was my gateway to entering into financial risk and specifically insurance and reinsurance.

What challenges can data science help solve in insurance and reinsurance?

Hugo: Cool. What are the biggest challenges in insurance and reinsurance that you think data science can have a huge impact on or is currently having a huge impact on?

JD: Yeah, my view here is a little bit skewed because when we think about the problem space, there’s a bunch of things going on in marketing, the marketing of insurance where you see ads online, in the claims process of how claim payments are made and how quickly those can be made by using data analysis, and operations inside of companies. There’s big gains being made on all those.

JD: I don’t work particularly in those three ares, I work more in what we would call underwriting and risk. The distinction there, underwriting is the decision about taking an individual risk. Now, at a insurance level that might be whether or not a company writes a given policy to a person or a company. In reinsurance, it’s more understanding the risk of a deal that may have hundreds or thousands of policies underneath it.

JD: And then risk, or risk management, is broadly thinking about how do all of those risks aggregate up inside of a reinsurance company. Some will be correlated, some will be idiosyncratic, some may be anti-correlated. And then how do you think about rolling up that risk inside of the reinsurance company and being confident that you have the right amount of capital to hold behind that, but not too much capital.

Hugo: Right, and I suppose essentially that’s a huge task to, as you say, roll them up and aggregate it to make one final decision based on all the data and all the modeling coming in, right?

JD: Yeah, exactly. A lot of little decisions get made and the way that, that feeds back and shapes the portfolio, at least in the companies I’ve worked at, is some feedback mechanism for a risk adjusted return on capital. When an individual deal is looked at, it’s evaluated relative to the portfolio as a whole and there’s some capital charge. That deal needs to be profitable in excess of the capital that’s required to hold behind the deal. So, that’s how we think about feeding back from corporate risk to the deal making side.

What industries do you insure?

Hugo: Great. What industries do you insure or work in?

JD: By the time you’re aggregating at the reinsurance level, it’s very global. And it’s every industry because we’re trying to spread risk across, really, the whole globe and all industries so that we aren’t concentrated in one specific area. Now, if you think about the space for data science in insurance and reinsurance, marketing, ops, claims that I mentioned earlier are not … Well, maybe claims is, but definitely marketing and ops are not super reinsurance insurance specific. Those are very similar in lots of other transactional companies.

JD: But, the risk in underwriting is fairly domain knowledge intense. The domain knowledge there is really more about the deal understanding, the type of risk, how those risks fit together into a portfolio. For me, I work both in the micro and macro, so the micro would be looking at individual deals and then the macro is this corporate risk management component. I have an unusual job in that I do a little bit of both.

Crop Insurance Modeling

Hugo: So, JD, why don’t you tell me a bit about the micro scale and then we can move on to the macro. For example, maybe you can tell me a bit about how the crop insurance modeling works.

JD: Yeah, sure, Hugo. If we look at crop insurance in the US, which is one of the most mature crop insurance markets, the current products that dominate that market have only been around since 1996. The historic record isn’t very long for that product, and so we have to say, "Well, what data do we have about crop insurance?" And what we have is a history of agricultural yields that goes back in a long time series. We have a history of agricultural commodity prices, and we have a history of weather.

JD: One of the more data science-y type activities that I’ve engaged in is trying to take the data we do have and say, "Okay, how might the current portfolio of crop insurance have behaved all these years in the past for which we do have data," right? So, this is a classic modeling exercise where we’re taking something we know and we’re trying to project that into something we don’t know, and build up a historical understanding.

JD: Once we do that, we can do things like, "Well, let’s stochastically generate a whole bunch of different yield and price outcomes, and see if we can build up a model of a full stochastic distribution of how this crop insurance industry in a given country might work." That was one of my more interesting jobs for a number of years was building that model. That’s where we move from data analytics into something more data science-y. We’re building models to understand something we couldn’t understand otherwise.

Hugo: That’s really interesting. I’m going to stop you there for a second because you used a couple of terms that I’m very interested in. You talked about stochastically generating and then you talked about a distribution. I’m going to try to tease that apart and let me know where I’m getting this incorrect.

Hugo: Let’s say we’re trying to predict something concerning a market. You can stochastically generate and what that essentially means, to my understanding, is you can simulate the behavior and stochastic means there’s some sort of variation, right? Each time you simulate it, you’ll get a slightly different result. And what you actually get in the end is a lot of different results and you may get a thousand or 10,000, or 100,000 that give you some idea of the distribution of the possibilities of the market. Is that what you’re talking about?

JD: That’s exactly right, Hugo. We do very little predicting of what I think next year is going to happen. What we try to do is say, "What is the distribution of the potential outcomes for next year?" And, "What’s the shape of that distribution." We might ask questions like, "What’s the one in 1000 worst case scenario?" So, it doesn’t mean we’re thinking 1000 years into the future at all, it means this is about next year. But, it’s the improbable way, but still possible, that next year might turn out.

Hugo: This is awesome and I actually think a lot of industries and verticals, and basic science research, stats adopting data science and data science techniques as methodologies, could learn a lot from this conversation because there’s still a lot of managers will want point estimates, right? That want the average and then make a business decision based around that, maybe some error bars.

Hugo: But, the fact that you’re doing these mass simulations and getting out entire distributions of predictions, I think, is a very robust technique. As you say, you can actually see 1% of the time, we see something crazy that we actually do not want to happen at all.

JD: Yeah, that’s exactly right. There’s a really good introduction and I’ll make sure we’ll have this in the show notes, Hugo, that you have it to put in the show notes. There’s a book called ‘How to measure anything’ that-

Hugo: That’s a great name, by the way.

JD: Isn’t it a great name? They have an introduction to this and they take it from the idea of, "Well, initially you’re estimating what do we think, next year, is going to happen." Then you start to say, "Well, okay, what’s next year," and then a high estimate and a low estimate. So, you’re beginning to think about a range around next year’s outcome. And from there we can start thinking, "Okay, let’s increase the resolution. What’s an extreme event that could still happen?" And you could begin to think about creating some sort of error bars around your estimate. And then ultimately move on to this idea of a full stochastic simulation where you have thousands of possible outcomes.

Hugo: I want to tease apart something, now, that we’ve been using the word risk, which we all have an intuition of what risk means. But, there’s something … There’s an idea that’s coupled to this and I want to try to decouple it in some sense, which is uncertainty in the sense that once you do these predictive simulations and get out the distribution, you may not know what will actually happen. I’m wondering, is that uncertainty or risk? And how do you think about this in insurance?

JD: Yeah, that’s a really good point. I would generally think about the outcome of these models that I’m talking about as a risk and then uncertainty as a separate thing. Let me tease those apart and these get used in the vernacular interchangeably. But, in 1921, in a book called ‘Risk, uncertainty and profit,’ the economist Frank Knight, who’s of the Chicago School, he’s the university of Chicago economist. He presented this idea of risk versus uncertainty and the way he defined it is risk is when you understand the underlying distribution, but you don’t know what outcome you’re going to get.

JD: It’s like the classical urn full of marbles, of white and black marbles, you don’t know which one you’re going to draw out, but maybe you’ve been told ahead of time what’s the ratio of white marbles to black marbles. That would be-

Hugo: Yeah, and also another example is if you flip a coin 10 times, you can literally write the probability of seeing 10 heads, or seeing 9 heads, or seeing 8 heads, or seeing 7. So, you know the entire distribution of possibilities, right?

JD: That’s exactly right. And then we have other processes where we know the underlying distribution is a gaussian distribution, so the outcome’s going to follow the curve.

Hugo: And in the real world, do you have risk as opposed to uncertainty? Because these are toy examples.

JD: We have both. Let me just define uncertainty real quick. Uncertainty is the piece where we don’t know. We know it’s not deterministic, we know it can have wild outcomes or some other outcome than what we know about, but we can’t put a distribution around it. That’s uncertainty.

JD: Let’s go back to the real world, if we’re doing things like flipping a coin, there is some uncertainty that maybe we have a loaded coin. We don’t know how to … What’s the probability this coin being loaded given no other information, just we have it in our hand. What we don’t know, it’s an uncertainty, but we don’t know what the probability is, it’s probably pretty low, but you don’t know.

JD: A better example, like from the insurance world, might be auto insurance is a pretty good example of a scenario or a type of product where it’s mostly risk and less uncertainty. When a product has been around for a long time, people behave in relatively predictable patterns. And so, most of that activity follows a well-behaved historic distribution, there’s a little bit of uncertainty, some wild things happen and tail events happen that worked in your model distribution, but it’s pretty well-behaved.

JD: Now, on the flip side would be, say, terrorism insurance or just think of terror events. The underlying distribution, we don’t really know what it is. We know what the historic distribution of terror events looks like, we can make a catalog of those. But, there’s no reason to believe that world events are such that the next 12 months is a random draw from a historically stable distribution. We expect the distribution is probably not stable, it’s probably a function of changing geopolitics around the world, and a reaction to events that are going on real time. There’s a component of risk, but there’s also a much larger component of uncertainty. Does that help?

Hugo: That makes perfect sense. And it has led me down a variety of rabbit holes. My first question is, do governments or corporations take out terrorism insurance?

JD: They do, they do. There are a number of just property policies that would cover, in the event in the terrorism, and there are, of course, policies that explicitly excludes acts of terrorism. If I recall, I believe in certain countries it’s normal for crop insurance policies to exclude terrorism, for example.

Hugo: So, JD, we were led along this path talking about the micro level you work in, in terms of crop insurance modeling and risk representation of single deals. Can you tell us a bit about the macro levels that you work at in thinking about insurance and reinsurance?

JD: Sure, Hugo. So, we think about a reinsurance company that has a number of risks in many different lines of insurance. I mentioned earlier that some of those risks are correlated and the correlation can be caused from underlying physical relationships. All of the homeowners’ insurance in New York City should be correlated in their outcome because if we have a large event like a Hurricane Sandy hits New York, the impact is going to impact all of the insurance companies at right business in New York. That’s a physical process that causes correlation.

JD: Or, maybe on a casualty program, there’s an underlying risk that multiple companies have insurance for and when that turns out to be a problem and there is a casualty claim, it impacts multiple companies. Other times they have connection because, maybe there’s a risk like changing legal framework causes all claims to increase 15% on property claims. There’s these relationships between the policies that we have to understand as we aggregate the risk together and think about combined risk inside of a reinsurance company. Sometimes that involves building the physical models like the hurricane and earthquake models, where the policies are analyzed based on spatially where on the map the risk is. And then understanding the exposure across different programs for risk in a specific geographical location.

JD: And other times, it may be introduced with more traditional modeling methods where the correlation is added after the modeling through something like a copula method. Two distributions can be brought together and a joint relationship be added using a copula. Now, it’s always important to keep in mind that we add correlations at the end, sometimes, in our modeling. But, correlation is always and everywhere an artifact of some other process and when we do something like a copula, we’re just trying to make sure our model data reflects what should be there already, but we don’t have any other method for putting it in place.

What techniques and tools do you use?

Hugo: Okay, great. You’ve given me some insight as to the types of tools and techniques that you use, but maybe you could speak a bit more to what data science looks like in insurance and reinsurance. And what I mean by that is, in tech, we know that most of our data will be in our SQL database, so we’ll query our SQL database and then use R or Python, R to do a bunch of exploratory data analysis and visualization dashboards. If you want to do productionized machine learning, we’ll do that in Python.

Hugo: I’m just wondering what the techniques and tools that you use on a daily basis are when doing this type of modeling and data science?

JD: Sure, at the initial deal level in a reinsurance company, a bunch of the analysis looks like the historical data science-y analysis you just described. Only, the person doing the analysis may self- identify as a catastrophe analyst, a CAT analyst, or they may identify as an actuary. But, what they’re doing is analyzing data that they receive from someone, maybe combining it with industry data, trying to understand trends that are in the data in order to create this stochastic representation of a single deal.

JD: That may follow a similar pattern to other data science-y modeling with the idea of what’s coming out the other end is a mean expectation, but also a distribution around it for the outcome of a deal. They’ll, then, put that into a risk system and, I think, most companies use a system of some kind, that then is a framework where the whole book can be rolled up and understood in a meaningful way.

JD: And there’s a million different approaches for doing that. I’ve traditionally worked with an in-house tool and it handles making sure that deals that are connected because of spatial exposure get connected that way in the final modeling. That deals that are not getting at least correctly correlated with the other deals in their business class, so that these relationships are tied together and reflected so we can get an aggregate distribution that’s a reasonable view of these individual marginal distributions. Marginal, here, meaning individual deals in a portfolio that we can roll those up into one aggregate deal and understand its characteristics.

Hugo: Fantastic. I’d like to step back a bit, now, and think about where insurance has come from, the actuarial sciences and, now, the impact of data science on the discipline as a whole. So, could you give us a brief history of all of these disciplines and how they intertwine?

JD: You bet, Hugo. Let’s go back to 3000 B.C-

Hugo: I’d love to.

JD: The Babylonians … This was the earliest record, I could find, of a disaster contingency event. The Babylonians developed a system of loans where you a person could get a loan for building a ship and they might not have to repay that loan if a certain time of loss event happened because of certain types of accidents. That’s kind of like insurance, right? Kind of like a builders loan.

JD: The idea has long been around. Now, one of the things I find interesting is Edmond Halley of Halley’s comet fame, created one of the first modern style mortality tables and that was in 1693. Around about the same time, but completely disconnected from that, the Lloyd’s coffee house, which was a place for sailors to hang out and ship owners to talk about what’s coming into London on ships, the Lloyd’s coffee house emerged as a place to drink coffee and get shipping news. And also to buy shipping insurance, and that later became Lloyd’s of London, which we’ve all heard of, which Lloyd’s, it may not be well understood outside of the insurance community, but Lloyd’s is not an actual company that takes risks, it’s more of a market place. It’s like the Chicago mercantile exchange of risk. Lots of individual companies including the one I work for take risk at Lloyd’s of London.

JD: That was the late 1600s and then computational tools and statistical methodologies developed alongside the actuarial process and became part of that process. But an interesting thing happened in 1992: Hurricane Andrew ripped across Florida and then recharged in the Gulf of Mexico and plowed into Louisiana and Alabama. It was a huge catastrophe for the global reinsurance market because prior to ’92, hurricane reinsurance was kind of a gentleman’s game and it wasn’t really a quantitative, well-understood risk business.

JD: And Andrew caused many reinsurance bankruptcies and it was a big contraction of the market and there just wasn’t a lot of capacity for reinsurance because of that event. That was filled by the crop of reinsurers that sprouted up on the island of Bermuda. That market became a much more quantitative analysis market that looked more like the quantitative finance world. That has driven the way reinsurance around the globe has been modeled and approached. That was really the turning point of reinsurance becoming much more quantitative and also how I ended up living on Bermuda for 4 years.

Hugo: That’s incredible. Firstly, why Bermuda?

JD: Well, the history there is it’s got reasonable proximity to the United States. But, it’s a favorable tax jurisdiction for endeavors requiring lots of capital and not a lot of people. So, the reinsurance companies based there, it’s not a tax loop hole type of jurisdiction, it’s been a place where there’s no corporate income tax. But, it’s also well regulated, so it ends up being regulated at a level that’s consistent with mainland Europe. But, with not very heavy corporate tax structures.

JD: So, activities like reinsurance, which has periods of high returns followed by a year or two with negative returns, it’s pretty tax efficient to do those in Bermuda. That’s why it cropped up in 1993 as a jurisdiction for global reinsurance and especially US catastrophe reinsurance.

Hugo: Something we’ve mentioned several times is this idea of building models. And you’ve said that building models is really key to your work. Can you just say a bit about what model building actually means to you and what it entails?

JD: Sure, Hugo. When I think about model building in the context of insurance and reinsurance, what I’m really always thinking about was this process we’ve discussed a few times, where it’s coming up with a distribution of outcomes that reflects the possible outcomes for a given financial contract. That’s the simplest way I can think to describe it. We might use dozens and dozens of different methods, there’s different approaches to try to get our arms around the risk and uncertainty of a financial deal. And depending on what data is available, we might use complicated regression analysis, we might use a Bayesian method, we might even use a machine learning deep neural network of some kind.

JD: But, ultimately, what we’re trying to say is we have a potential contract we may enter and we’re trying to understand all the possible outcomes to make sure that the reinsurance company is being compensated for the risk that they’re taking on as part of this contract. So, the "model" could be lots of things that possibly are very complicated or it may be there’s very little data and we’re going to look at the past 15 years of experience, and we’re going to fit a distribution to that because that’s all the information that we have. And then we’re going to put a premium on there, a little extra load for this uncertainty because we can’t fully quantify the risk.

JD: That’s what I mean when I think about modeling in this context.

Hugo: Okay, great. I want to find out a bit more about how data science has impacted the insurance and reinsurance world. And I actually … The avenue that I want to approach it from is, there’s a great quote by Robin Wigglesworth from the Financial Times, who said, "Traders used to be first class citizens of the financial world, but that’s not true anymore. Technologists are the priority, now." I would actually … That was in 2015, I would say that data scientists are first class citizens of the financial world.

Hugo: In terms of insurance and reinsurance, actuaries have always been the first class citizens of the insurance world. How is this relationship now, with the emergence of data science working there?

JD: Well, you know Hugo, there’s been a little fluke historically in actuarial science. In that the historical fluke that resulted with me really ending up in this industry is that the catastrophic events, the catastrophe modeling did not exactly fit in the historic actuarial methods very well because sometimes in catastrophe insurance, we’re pricing and modeling risk that we’ve never observed historically. So, maybe we’re looking at a reinsurance deal that would be impacted by a worse hurricane than we have ever experienced or a hurricane season with more hurricanes than we’ve ever experienced.

JD: And if you look at actuarial method that’s based on looking at historical data and making corrections for sample size and evaluating that using heuristics that expect large sample size, it doesn’t work very effectively for these extreme tail events. My work in crop insurance, it was really around catastrophe work and, similarly, property CAT work whether it’s hurricanes or earthquakes, often deal with these risks that are so far out in the tail, we haven’t experienced them.

JD: So, it gave a lot of opportunity for those of us with quantitative background, maybe a systems modeling, and historically engineers who do engineering modeling to work in the space alongside actuaries. And, what we’re seeing is a very fruitful environment, in my opinion, in insurance and reinsurance where there’s, hopefully, a collaborative work between actuaries who have a tremendous set of tools, experience, and knowledge that’s specific to insurance. But, keep in mind a lot of it is heuristics that make certain assumptions. And then we’ve got data scientist, and financial engineers, and systems modelers who are used to modeling slightly different things making different assumptions often with different constraints. And if we can get those two groups working together, we can make even more effective models.

JD: And my experience, relatively recently, I was just … Last month, I spoke at an actuarial conference and one of the sessions I sat in after I presented, I was really impressed because one of the actuaries shared an actuarial methodology. And then after he shared it, he said, "Now, here’s a more data science-y way of doing this the way our data scientist friends might approach it."

JD: And he shared the exact same example, but working through using some type of GLM. And he showed how the answers were similar, but where they might differ. And I thought, "That’s the future inside of insurance companies." Is if we can get the actuaries and the data scientists talking together about what are the strengths and weaknesses of our different methodologies and get the deep business understanding from the actuaries, and maybe some of the methodologies experience of the data scientists deployed at the same problems, I think that would be tremendously powerful.

JD: That falls apart only if one side or the other isn’t in a very collaborative place, so I’m a huge proponent of collaborative data science.

Hugo: That’s fantastic, and I think it actually provides a wonderful segue into what we’ve promised the eager listener previously because a key component of these types of collaborations, particularly with such strong minded communities such as actuaries and data scientists, a key component of that collaboration, a requirement, a necessity, in fact, is empathy.

JD: Yeah. It sure is, Hugo.

Hugo: You gave a wonderful talk that I saw when we first met IRL. We had corresponded before that, but when we met at RStudio::conf con in San Diego earlier this year, you gave a wonderful talk called ‘Empathy in data science.’ And I’d just love to hear your take, once again, on what the role of empathy in data science is, at the moment, in your mind.

JD: Yeah, Hugo. I feel like … I don’t think empathy is a panacea for all of our problems. However, I do observe on a very regular basis situations that really need empathy in order to bridge two people who are talking past each other or a person who’s making what is obvious to other people, but not to them, is kind of a boneheaded mistake because they aren’t thinking about who’s consuming what they’re producing.

JD: My example I alluded to earlier was on stack overflow. I watch people ask questions on a regular basis and they clearly are not thinking about the person who’s receiving the question, who’s going to answer their question, and making it easy for the question answerer. Because if the asker was making it easy for the question asker, they would make an example that had code that the answerer could copy and paste into their environment, execute it, and observe what the question asker is seeing, and would immediately be able to help.

JD: But, instead, the asker may put incomplete code or maybe not even syntactically correct code and the question is, "I’m trying to do something and it doesn’t work. What’s wrong?" And the answerer has no way to know. If we can bridge that by helping an asker in that environment, think to themselves, "What’s it like to be on the other of this question? What’s it like to be the other person and how can I make their life easier and basically help them help me." They’ll find they’re much more successful at what their after.

JD: It’s the same inside of our workplace, right? If we’re doing analysis, I have to ask myself. Maybe I’m doing analysis that’s going to equip an underwriter to negotiate a deal. I have to think, "What information does that underwriter need to be well equipped to negotiate this deal?" And that’s going to drive my thinking of how I serve that person with my analysis.

Role of empathy in data science.

Hugo: So, JD, tell me about the role of empathy in data science.

JD: Sure, Hugo. I think I’ve just observed so many situations over the years where I felt there were two parties engaged in a conversation who were talking past each other and didn’t quite appreciate where the other person, where the understanding was or what they were concerned about. I’m not so pollyannaish as to assume that empathy is the solution to all our problems, but we have a lot of business problems and data problems that could be greatly helped by a dose of empathy.

JD: And a good example is one I alluded to with observing questions and answers on stack overflow, I observed any number of situations where the question asker clearly has not thought about the situation the answerer is going to be in. Because if the asker had, they might have put an example that could be copied and pasted by the answerer into their environment, executed, and then the answerer could see exactly what the problem is and answer the question.

JD: But, instead, we get, often, conceptual ideas, "I’m trying to do this thing, here’s a little piece of code, you can’t actually run it because you don’t have my data, but I’m not getting the answer I would expect. Help me fix it." And that’s really hard for an answerer to answer. This got me thinking about empathizing with the other person. And early on as stack overflow grew, at first, I felt like askers needed more empathy and, at times, I feel, now, like the answerers could use some empathy, as well.

JD: But, the same is true in our business environment. If I’m working with an underwriter to do the analysis for a deal, I need to be thinking about what does this person need when they go to negotiate this deal. What analysis do I need to have done that they can have in front of them to make them more effective, right? This isn’t about me and my understanding, I’m not doing this as a science fair exercise so that I’m smarter about risk. I’m doing it towards a business purpose of providing insight for a negotiation.

JD: So, that’s a useful mindset and I feel like it’s one where we need to explicitly teach. A lot of people it will resonate with immediately. And others, it may need some more work to help them build this empathy muscle, if you will, of learning to think about who’s reading my analysis, what are they doing with it, maybe who’s my user. I think there’s lots of ways we can build that and it’s an important part of data science in my opinion.

Hugo: Yeah, I couldn’t agree more. I will say, though, that to approximate some sort of truly empathic behavior or mind-frame, that can be really energetically consuming. Are there any ways we can approximate it or hack empathy?

JD: Absolutely. My favorite example of this, actually, comes from the agile development methodology, which is more of a computer programming thing than a specific data science-y thing. In agile, they do this method where they do user stories, so it’s "Hugo is a data scientist who’s trying to understand X. He needs this tool do Y so that he can understand X." What’s so great about that, in my opinion, is it forces the developer who’s reading it, or the data scientist who’s reading it to think about what it’s like to be Hugo. It’s an empathy hack.

JD: Now, none of the agile methodologies that I’ve ever seen use the word empathy, it’s just not mentioned. But, that’s what we do with user stories and I’ve had situations inside my company where a developer would be developing something and I’m like, "That’s a great idea. But, I know your user personally. I have lunch with him and they’re not going to think that’s near as great because you’re building the tool that you want, not the tool they want." Think about your end user, or if you’re a data scientist producing an analytic or a model outcome, think about who’s consuming it.

JD: We can give lots of little nudges whether it’s something explicitly like an empathy hack from agile, the user story, or sometimes it’s just reminding someone, "Hey, remember, your person consuming this has a name. It’s Bob and we know that Bob doesn’t think that way."

Hugo: Great. And we have learner profiles at DataCamp, which is similar with respect to what our learners backgrounds will be, along with how advanced they are as aspiring data scientists. And whenever we build courses we very much think about who this course, which one of our learner profiles or set of learner profiles these courses are aimed at.

JD: That’s super, Hugo. The podcast 99% invisible had a great episode on designing for average and how, basically, if you design for average, you design for no one. We’ll make sure that’s in the show notes. But, I think it’s such a fantastic idea to actually give your target audience a name so that we can … The people working on products for them can relate to them. That’s a super idea.

Hugo: That’s great. And this is actually … We had a segment on the podcast with Mike Betancourt who is core developer and maintainer of STAN, the probabilistic programming language, and he was talking about what’s commonly referred to as the tyranny of the mean.

JD: Gosh, so true.

Hugo: Which is in a couple of dimensions, you’re fine. But, as soon as you get in multidimensional space, if you’re thinking about measuring someone’s height, someone’s leg length, perimeter of thighs and calves, and that type of stuff, suddenly if you have designed something for the mean there, you’re absolutely lost because nobody really is around that mean at all.

JD: Yeah, not in all dimensions. If I remember, the 99 PI article had a statistic, and I’ll probably be wrong, but the gist and my take away was, if you have three dimensions of human body dimensions, like leg length, arm length, head circumference, hand size, any three in a small margin of error, only 6% of your population is going to be near that mean.

Hugo: Yeah.

JD: Because everybody’s off a little bit in some dimension.

What does the future of data science in insurance and reinsurance look like?

Hugo: It’s incredible. Okay, we’ve talked a lot about data science, insurance, reinsurance, empathy in data science, where it’s led to now. What does the future of data science in insurance, reinsurance, and otherwise look like to you?

JD: Well, I am really suspicious. We will see the term data science wane some and I think that’s fine. It was a very, very helpful term for a number of years to help us think about bringing in technology, computer science-y type terms, along with business acumen and statistics. It will fade, I think, because it has become so obvious. The data analyst of the future is going to be much more data science-y than a data analyst of 5 or 10 years ago. I’m confident of that.

JD: And I was just explaining, having a conversation at coffee today, after lunch, with a friend and we were discussing this idea of where’s the market opportunity. He works in the talent acquisition, the head hunter space, where’s the market opportunity. And I was telling him, "Well, it seems like deep learning and a lot of these very complicated artificial intelligence type methodologies get a huge amount of ink spill because they’re interesting. And they do have the potential to make some revolutionary changes." And that’s great and there needs to be work there, and there will be. But, I think about the other tail of the distribution and I think about your former guest, Jenny Bryan, and her work of trying to get people out of Excel. She’s like, "It’s a widely spread need and you’ve got nobody else crowding the space." And that’s a huge amount of the work in data science-y sort of things is going to be getting manual workflows out of Excel is going to be, I think, the future out of paper, right?

JD: In the past we’ve moved out of paper and getting it into computers. Well, we’re in computers, now, except we’ve got things in Excel that belong in a database or they belong in a programming language. And Excel doesn’t need to go away in my opinion, very controversial opinion, it’s a great tool. It’s the equivalent of a piece of paper. If we didn’t have blank paper to write on today, we would need to invent that because it is so useful and spreadsheets are the same way. If we didn’t have them, we would invent them because they’re so useful. But, we need to also be wise about what pieces do we put in there. For example, the spreadsheet should probably never be a system of record for an organization of more than two people.

JD: I think the future is going to be building a lot more structured process and structured tools around so many things that aren’t the sexy, deep AI, blockchain based gee-whizzery. It’s going to be a lot of more mundane things that are going to fundamentally change how efficient organizations are.

Call to Action

Hugo: Great. I’ve got time for one more question and what I really want to know is do you have a final call to action for our listeners out there?

JD: Yes. One of the things that I realized in the organization that I work in, one of the cultural norms that’s been very valuable to me is there’s a cultural norm here of asking the question, "Does it change the answer?" Or another way would be, "What’s the next best, simpler alternative?" The idea is if we don’t ever ask ourself, "Does our analysis change the outcome, the answer, what we’re actually trying to study." We can do infinite analysis because there’s an infinite number of things we don’t know. And we can keep entire teams busy inside of organizations doing infinite analysis that may just end up as appendix pages in the back of our PowerPoint presentation and may never drive our organization.

JD: I would like to encourage leaders within organizations to have candid conversations with their analytical teams about does the research or the analysis that we’re doing now have potential to change the answer of the decisions we make. If the answer is probably not, ask yourself why you’re throwing resources at it. Yeah, I’ve watched organizations do analysis just because the leader was concerned they would be standing in front of their board and be asked a question that they might not be able to give an answer to. When the answer might be, "That’s not relative to our business," or not relevant, sorry. "That is not relevant to our business." We need to ask these questions so that we don’t spend our precious analytical resources on solving not very important problems.

JD: Similarly, as an economist, I think about having an impact on the margin. So, if we ask ourself, "What’s the next, best, simpler alternative?" We should never compare our analysis or methodology compared against doing nothing because doing nothing is rarely the alterative. Usually, it’s something that’s a little simpler. So, if we’re going to implement a very complicated model, well we shouldn’t be comparing it not to no model at all, but comparing it to our old forecasting method, or a simpler, easier, cheaper, faster forecasting method. And then ask ourselves is the sophistication of that new method worth the added complexity.

JD: I think that’s where so many rich and important conversations in data science teams will happen in the future.

Hugo: Yeah, I love that. And actually, whenever I teach machine learning, for example, I actually get the learners to establish a baseline model not using machine learning: I get them to do 20 minutes of exploratory data analysis, look at some of the features, and make a prediction themselves in a classification challenge not using machine learning. And that will be a baseline model against which I get them to test any other machine learning model they use later on.

JD: That’s such a good idea, Hugo. I see this done with public policy often. There will be some policy proposal, and the benchmarks that are given of the effects of this policy are relative to doing nothing. And it’s like, "That’s not the good alternative." I love that you’re doing that with a class, and I also like that you mentioned plotting the data first. I think somebody already gave this as the call to action in one of your interviews, but, plot your damn data could be a very good mantra for all us.

Hugo: I love it. I’m actually going to put it up on my wall this evening.

JD: Fantastic, I’m going to get bumper stickers made up.

Hugo: Fantastic. JD, you rock. It’s been such a pleasure having you on the show.

JD: Thank you, Hugo. I appreciate the opportunity. I look forward to seeing you soon.

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.