How to choose a project to practice data science

[This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here at Sharp Sight, I’ve derided the “jump in and build something” method of learning data science for quite some time.

Learning data science by “jumping in” and starting a big project is highly inefficient.



However, projects can be extremely useful for practicing data science and refining your skillset, if you know how to select the right project.

Before I give you some pointers on how to select a good project, let’s first talk about why “jump in and build” is not the best method of learning data science.

Jump in and Build is bad for learning

As I mentioned above, Jump in and Build Something™ is the method of learning where you jump in and just build something. It’s based on the idea that the best way to learn a new skill is to select a large project and just build, even if you don’t know most of the requisite skills.

You see this quite a bit in programming. A few years ago, you used to hear guys say “I’m going to learn PHP by building an online social network” (essentially, building a Facebook copy).

Jump in and Build is extremely inefficient

While I will admit that it is possible to learn a new skill by jumping into a new project, you have to understand that it’s extremely inefficient. I also tend to think that for beginners, the “knowledge gained” decreases dramatically as the size and complexity of a project increases. That’s another way of saying that if a beginner selects a project that’s too big, they’re likely to learn very little (although, large projects can be very useful for advanced practitioners).

The reason for this is that if you choose a project that’s too big, and you don’t know most of the skills, you get bogged down just trying to learn everything before you can move on to getting things done. If you “jump in” to a very complicated project, but you don’t know the requisite skills, you’re going to spend 99% of your time just looking things up. If you’re a beginner and you don’t know much, you might even have trouble figuring out where to start.

Essentially, if you try to work on a project that’s too large or too complicated, you’ll spend all of your time trying to learn dozens of small things that you should have learned before starting the project.

To help clarify why this point, I’ll give you a few analogies.

Examples: when “jumping in” is a bad idea

I can think of dozens of examples in other arenas where “jumping in” can get you in over your head, but here are two that some of you might be familiar with: learning an instrument and lifting weights.

Trying to learn guitar with something way too hard

At some point in their life, most people have a desire to learn to play an instrument. For many people (and guys in particular) learning guitar is a goal.

If you want to learn to play guitar, are you going to jump in and try to learn to do this right away?



There are some people who are foolish enough to try.

The fact is, learning to play guitar like this would take most people years. More importantly, it would take years of preparation by learning thousands of little skills before you’d be at a level to perform like this. You’d have to learn a thousand little things: how to position your fingers on the fretboard. How to pick. How to play little “phrases” and also how to play fast. Etcetera.

Moreover, it’s not just the nuts-and-bolts techniques that makes it hard. It’s also a matter of style. To play guitar like this, you need to learn how to be expressive with the guitar. That’s a completely separate skill that also takes years.

So if you want to learn to play guitar, could you do it by jumping in and learning the guitar solo in the video? Is it possible to learn guitar by trying to learn this complicated guitar solo, one note at a time? Would you be able to do this without knowing any foundational guitar skills beforehand? Maybe. But it would be a long, frustrating effort. My guess is that such a task would induce most people to quit.

For beginning guitarists, it’s much, much more effective and efficient to start with the absolute foundational guitar skills, master the foundational skills, and progressively move on to skills of increasing difficulty.

It’s much more effective to put together a systematic plan with a skilled teacher that puts you on the path to your goal in structured way.

Data science is exactly like this. The most efficient and effective way to learn data science is to be highly systematic. You need to have a plan. You need to learn the right things in the right order. The optimal strategy for learning data science is almost the opposite of “jump in and build something.”

Trying to get strong by lifting too much weight

Here’s another example.

If you want to get fit and strong, it’s a terrible idea to jump in and try to lift very heavy weight. If you “jump in” and try to lift an amount of weight that’s far beyond your strength level, you’re likely to fail.

Like this guy.



Wow. Too much man. Take some weight off the bar.

In weightlifting, if you try to lift too much, you’re likely to fail and you might even get hurt.

In data science, you won’t have a risk of injuring yourself physically, but you might incur a different sort of damage: you might injure your ego. You might attempt a project that’s too hard and subsequently fail. Your failure might cause you to believe that you’re “not smart enough” to learn data science, and you might give up altogether. I hear it all the time. People try something that’s too hard, fail, and then give up. It’s a very real risk.

There’s actually a much better way to become a strong data scientist and it’s a lot like trying to get strong in the gym. In the gym, the best way to get strong is to start with light weights, and learn the basic motions safely with those low weights. Then, add a little weight each week. Five pounds. Maybe ten. That doesn’t sound like a lot, but over the course of only a few months, if you continue to add weight to the bar each week, you will get stronger.

Similarly, in data science, instead of jumping into a project with a high difficulty level, you should start with something small and do-able with your current skill level, then increase the size and complexity of your projects as you learn more over time. It’s remarkably similar to weightlifting. Start small, then increase complexity. Over time, you will become a strong, highly skilled data scientist.

When to use projects to practice data science

At this point, I want to clarify something, to make sure that you don’t get the wrong idea. Projects are great, but not for learning.

At a high level, projects are not very good for learning skills.

However, projects are excellent for 2 things:

  1. Integrating skills that you’ve already learned
  2. Identifying skill gaps

Projects help you integrate skills you’ve already learned

As you develop as a data scientist, projects are best for integrating the things that you already know.

Here’s what I mean:

Many of the skills that you need to learn in order to become a data scientist are highly modular.

This is particularly true if you’re using the tidyverse in R. For the most part, the tidyverse was designed such that each function does one thing, and does it well.

Each of these small tools (I.e., each function) is a small unit that you should learn and practice on a very small scale before starting a project. You should find very, very simple examples and practice those examples repeatedly over time.

This is just like a guitar player: a guitar player might practice a guitar scale every single day for a few weeks (or years). He might have a set of 3 chords and practice simple transitions between those guitar chords.

Similarly, you should have small, learnable units that you practice regularly. As a beginner, you should practice just making a bar chart. You should practice how to use dplyr::mutate() to add a new variable to a dataset. You should learn these skills on very simple examples, and practice them repeatedly until you can write that code “with your eyes closed.”

Then, when you start working on a small project, the project will help you integrate those skills. For example, you’ll often need to use dplyr::filter() in combination with ggplot2 to subset your data and create a new plot. Working on a project gives you an opportunity to put these two tools together. It allows you to take ggplot() and filter() – which you should have practiced separately – and integrate them in a way that produces something new and more complex.

This is what projects are great for: they help you put the pieces together. Projects help you integrate skills that you’ve already learned into a more cohesive whole.

Projects help you identify skill gaps

The second use for projects is to help you identify skill gaps.

When you start a new project, I recommend that you know most of the tools and techniques that you need to complete the project. So if the project requires bar charts, histograms, data sorting, adding new variables, etc, you should already know those skills. You should have learned them with small, simple examples, and practiced them for a while so that you’re “smooth” at executing them.

However, even if you’ve learned and practiced the required tools, when you dive into your project, you’ll begin to find little gaps. You’ll find things that you don’t know quite as well as you thought you did. You’ll discover that maybe you don’t know a particular function that well. Or you’ve forgotten a critical piece of syntax.

This is gold.

When you work on a project, these “missing pieces” tell you what you need to work on in order to get you to the next level.

Let me give you an example: when you’re starting out with ggplot2, I recommend that you learn 5 critical data visualizations: the bar, the line, the scatter, the histogram, and the small multiple. These comprise what I sometimes call “the big 5” data visualizations. These are the essentials.

After learning these, let’s say that you decide to work on a project. You decide to analyze a small dataset that you obtained online, and you plan to use the “essential visualizations.”

But after creating the basic visualizations to analyze the data, you decide that you want to make them look a little more polished by modifying the plot themes. If, at that point, you haven’t learned ggplot2::theme() and all of the element functions (like element_line() element_rect(), etc) then you’ll have a hard time formatting your plots and making them look more professional. In this case, you will have identified a “skill gap.” These are next skills to work on. You’d know that to get to the next level, you need to learn (and practice!) the theme() function and the accompanying functions & parameters of the ggplot2 theme system.

Projects are excellent for identifying your skill weaknesses. That will help you refine your learning plan as you move forward.

How to choose a good data science project to practice data science

To get the benefits from project work, the critical factor is selecting a project that’s at the right skill level: not too hard, but not too easy. If you’ve selected well, then you’ll have a small and manageable list of “things to learn right now” in order to finish the project. If you’ve selected a project at an appropriate skill level, then your “skill gap” will be small, you’ll be able to learn those new skills on the fly, and you’ll be able to complete the project.

Afterwards, you’ll be able to add these “new skills” to your practice routine so you can remember them over time.

Choosing such a project is more of an art than a science, but here are a few pointers:

Choose something that you that’s mostly within your current skill level

Ultimately, you want something that’s within your skill level, but will push you just a little bit.

Having said this, when you consider a new project, you should just ask a few simple questions:

  1. What skills do I think I’ll need?
  2. Do I know those skills?

Here’s an example: about a year ago, I did an analysis of a car dataset that I obtained online.

Before starting this project, I had a good idea of the tools that I’d need:

There were a few other tools and techniques, but that’s the short list.

Before I even started the project, I had a rough idea that those were the skills that I needed to know.

If you wanted to execute a similar project, you should make a similar list, and ask yourself, do I know most of these skills already?

You should know how to do about 90% of the work

After identifying the tools and techniques you’ll need for a project, here’s a good rule of thumb: you should already know about 90 percent of the tools and techniques.

For example, if you’re working on a project that requires about 20 primary tools or techniques, you should be able to execute roughly 18 of those techniques.

That means that there would be about 2 – 4 techniques that you didn’t know. Such a project would be a decent stretch. For the 18 techniques that you do know, it will be good practice. You’ll get to repeat those techniques (repetition is essential for long-term memory) and perhaps combine them into new or interesting ways.

What about the techniques that you don’t know? You’ll have to learn them on the fly and integrate them into the project. This is actually hard to do, because learning a new technique will slow you down. Learning a new technique while you’re working on your project will dramatically reduce your effectiveness and slow down the project’s progress. That’s why I recommend that you mostly learn and practice techniques outside the context of project work. To rapidly learn and master your tools, should be learning and practicing your toolkit regularly and separate from your projects.

But again, if you begin a project and realize that there are a few necessary techniques that you don’t know, that’s fine. In fact, it’s good. It tells you what your next steps are for your learning plan.

This invites a question though: What counts as a technique?

I actually think that the tidyverse’s modular structure gives us a good way of breaking things down. Individual tidyverse functions are a good way to dissect the project into different tools. In this scheme, I’d consider dplyr::mutate() to be one tool. dplyr::arrange() would be another. Among the ggplot2 techniques, you could consider geom_line() to be a single technique. Some of the intermediate tools like scale_fill_gradient() could also be considered separate techniques. Again, the tidyverse is highly modular, in that each function is a little functional module that does one thing. That being the case, you can treat these little, modular functions as units that you either know or don’t know when you evaluate a potential project.

So to restate, here’s a good rule of thumb: when you start a project, you should already know about 90% of the techniques (and the remaining 10% will force you to stretch your skill).

If it feels too easy, choose something harder

Having said that, if you evaluate a project, and it seems too easy, then try to find something harder. You want to push yourself just a little.

For example, if you’ve been a data scientist for a year or two, and you’ve made a few hundred bar charts and line charts, then choosing a project that uses only the basic tools might be little too easy for you. If that’s the case, try to find something that is just a little out of our comfort zone.

Again, it’s like weight lifting: you need to add a little weight to the bar every week in order to get strong. If the weight on the bar is so easy that you can do a couple dozen repetitions, it’s too light. You need something more difficult.

If you look at a potential project, and you know you’ve done something very similar many times before, choose something more difficult.

Projects are part of a larger process of systematic learning

If you use projects the right way, then they are a critical part of a much larger scheme of highly systematic learning.

In this post, I dropped some hints, but here I’ll be more explicit: to rapidly learn and master data science, you need to be systematic. You need to be systematic in what you learn, when you learn it, and how you practice. High performers of all stripes know that relentless, systematic practice is the most effective way to learn a new skill.

Having said that, as I mentioned above, projects are an important part of a systematic learning plan because they help you integrate what you’ve already learned, they help you identify skill gaps, and they can push you beyond your comfort zone.

But whatever you do, don’t fall into the “jump in and build something” trap by trying to learn data science without a plan.

Sign up now, and discover how to rapidly master data science

To rapidly master data science, you need a plan.

You need to be highly systematic.

Sign up for our email list right now, and you’ll get our “Data Science Crash Course.”

In it you’ll discover:

  • A step-by-step learning plan
  • How to create the essential data visualizations
  • How to perform the essential data wrangling techniques
  • How to get started with machine learning
  • How much math you need to learn (and when to learn it)
  • And more …

SIGN UP NOW

The post How to choose a project to practice data science appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)