As someone who has spent a significant amount of time ‘data wrangling’ outside the field of data science, I found this recent New York Times article both highly interesting and nothing new – I’ve seen the same issue discussed in a variety of fields with similar automation solutions presented each time. Granted, there are fields where this janitorial drudgery is absolutely necessary, such as molecular biology. As a scientist working on researching mRNA expression you certainly have to keep RNase-free conditions, meaning you have to ‘clean up’ your bench in a more detailed way than you’d ever imagine before you can run your first PCR cycle. This includes specific protocols for washing your dishes several times with special water before you even start because if you don’t, you’re not likely to get any bands in your electrophoresis gel and you’ll lose several days (or weeks!) of work.
There are other fields where you can get by with less prep if you so choose, but it makes the later process less enjoyable or more risky, like cooking. A chef needs to prep her ingredients, however whether all is mise en place before one turns on the stove is a decision left to the individual. Between these extremes there are many places to look for insight, and there are two types of wrangling that parallel the problems facing data scientists today: the struggles of the qualitative researcher and trials of the knitter. I offer them both here as food for thought for the data science community, from the novice starting to explore data science to the seasoned expert who uses their R toolbox daily.
I’ve taught educational research to a variety of students, from psychology undergraduates looking for a semester project to doctoral-level graduate students using RStudio in my research methods course. Many times these willing students would walk into my office with a preliminary idea for a quantitative project and they’d leave excited instead about a qualitative (or mixed-methods) endeavor. Why? Because it teaches them how to plan, gather, and interact with a large volume of data in a way free of their research preconceptions rather than helping them pseudo-engage in backfilling a quantitative research question with simple answers. These students wanted to learn how to conduct their own research in preparation for a thesis/dissertation – and I was prepared to teach them on-the-job skills. But isn’t qualitative research just hearsay and conjecture? Isn’t it a lesser form of research? (For reference, it’s just as methodical as quantitative and certainly has its place in the sun.)
There is often only one researcher involved in this kind of data collection & analysis, so how much can we really learn from the endeavor of qualitative research? I argue that we can learn more about ourselves as human researchers, how we interact with data, and how we bias our results from qualitative research – then we can use this information to improve our quantitative research processes and get even more out of later data than we thought possible.
There are software products to address these data ‘problems’ in qualitative research, from transcription to analysis, and there have been for many years. Unsavory tasks ahead? Automate the process of moving from audiotape to transcript so you can get to the ‘real stuff’ quicker. Not sure whether there are great connections to support your theory-building? Try some analysis software to see what pops out. Each time I’d caution my students against using these ‘time-savers’, however, because of what you lose in between the lines. You may gain time and lower your frustration level, but you lose your thoroughness of interaction with a set of data in a way which increases your insight and intuition while predicting and solving other issues down the road. You miss the building of excitement that can happen while you’re waiting to get to the patterns and find the ‘answers’. You get 70% (at most) of what the data has to say with this approach. When you invest this time rather than automate it away, you get that other 30% of information back which can make all the difference when you’re presenting the results to others and trying to stand out in a sea of researchers. Struggling through the mire is very useful in qualitative research, and teaches the researcher the time-tested ways of thoroughly knowing your data, correcting for human bias through procedures, and finding emergent patterns in a large volume of information. You can’t buy better experiences than those in the long-term.
The second bellwether comes from a different field entirely, and is a hobby for most who engage. I’m a longtime knitter and crocheter and I’ve taught a few hundred people to knit or crochet over the years. Each time I teach the yarn arts, I ask the student to bring a skein of yarn and a crochet hook to our session. Our first task isn’t to chain or learn a beginning stitch, though, it’s to make a ball of yarn from the skein. It’s part of the planning process, I say. We have to get to know our yarn first, to predict problems down the line and find out if this specific yarn is the best choice for what we’re making. We need to feel the yarn in our fingers, familiarize ourselves with the weight, drape, and shape of the strand. How does it split? Is that a feature or a problem? Are there any unexpected knots or frays hiding inside that skein? We should know this before we start our time-intensive alchemy from a mess of yarn to a finished scarf. Sure, we can make a scarf without this metadata… but it most likely won’t be as smooth of a working-up process or result in as polished of a finished product. Yet, the balling of yarn is seen by many as the janitor-work of the crafting world. There are many ways of automating the process with a ball winder or yarn caker – all of which omit the touching of the yarn by human fingers to gauge its usefulness and character while informing the rest of the process. I find this choice a mistake, and again to be short-sighted.
I hope these examples help you think more deeply about your own data cleaning issues and/or opportunities, and consider what you may lose or gain from automating these processes in your work. Happy Wrangling, whether it happens with thorny csv files or soft bamboo yarns!