The job market for data analysts is large and highly competitive. Many companies, including companies not traditionally classified as “tech” or “coding” companies are looking to hire people with analytical coding experience. Yet, the numbers of applicants seems to be rising even faster. It’s a competitive market and you want your application to stand out.
Many jobs descriptions include lines like “Executes and advises on reporting needs and works cross-functionally to analyze data and make actionable recommendations at all levels” or “Utilizes advanced analytical and/or statistical ability to evaluate data and make judgments and recommendations“, “Experience in at least one computer programming language or analytical programming language (R, Python, SAS, etc.)” (emphasis added).
Notice that these job postings include two common themes (1) experience analyzing data (2) and experience providing recommendations. Your goal as an aspiring analyst is to be able to demonstrate experience in both of these domains. But how can you do this when you are applying for your first job in the field? Easy: spend some time building a core portfolio that shows the types of skills that recruiters want.
This article covers four projects that can form the core of your application portfolio. We recommend completing these prior to applying for jobs so that you can have demonstrable experience to include on your resume and discuss in your interviews. Be sure to create a GitHub repo for each project and link prominently to your GitHub profile on your resume.
I expect that building this portfolio will take at least one month of focused work.
Week 1: Exploratory data analysis
Week 2: Interactive Shiny dashboard
Week 3: Natural Language Processing
Week 4: Machine Learning
As you work through the projects, keep in mind that your goal is not just to gain experience analyzing data but also providing insightful recommendation. Even though this is practice, structure your output to include conclusions like “If I wanted to improve my exercise habits, the data show that…”. Prove to future recruiters that you have the skills they want to see.
Core portfolio projects
Exploratory data analysis
Project goal: Load a messy dataset into R, clean the data, create 4-5 charts or tables that have summary stats about your data, and create 4-5 charts or tables that provide analytic insight. Your output should be an html rmarkdown document.
Questions to answer: How big is my dataset? What do the top 10 items of data look like? What is the distribution of my variables of interest (mean/median, skew, outliers)? How does data change over time? What is the relationship between Variable A and Variable B? Why does Variable A behave in such and such a manner?
Resume skills practiced: R, data cleaning, data visualization
Recommended packages: dplyr, tidyr, ggplot2, kableExtra, others as needed
Interactive Shiny dashboard
Project goal: Similar to exploratory data analysis, load, clean, and analyze a dataset. Focus, however, on visualizing data in a Shiny dashboard. Your output should be either a Shiny dashboard with “server.R” and “ui.R” files, a flexdashboard with “runtime: shiny” or an html rmarkdown with “runtime: shiny.” Host your dashboard on shinyapps.io.
Questions to answer: Ask the same basic questions as the Exploratory Data Analysis project, but build a dashboard that allows users to generate the answers themselves. For example, instead of showing a histogram of Variable A, create a histogram chart that allows the user to select which variable to plot.
Resume skills practiced: R, Shiny, data visualization
Recommended packages: shiny, DT, others as needed
Natural Language Processing (NLP) with R
Project goal: Replicate the analysis completed in this comprehensive example in order to draw analytical insight from a large body of unstructured text.
Questions to answer: What are the ten most frequent words in my corpus (excluding “stop words”)? What is the sentiment score of my corpus? How does sentiment change in each section in my corpus (e.g., chapter in my book)? What are the most common positive / negative words? What can tf-idf tell us about words unique to a part of my corpus (e.g., What words are most distinct to books A, B, and C)? Can I implement Latent Dirichlet allocation to separate my corpus into various topics?
Resume skills practiced: R, NLP, Text Mining, Machine Learning
Recommended packages: tidytext, topicmodels
Machine learning with R
Project goal: Load a dataset, train a machine learning algorithm on part of the dataset, and use the rest of the dataset to test it. Create summary stats to evaluate the performance of your model.
Questions to answer: The types of questions you ask will vary tremendously based on the type of machine learning algorithm you want to implement. Try implementing one unsupervised and a handful of supervised algorithms. If you implement k-means (unsupervised), run your model with different seed values to find the lowest total within cluster sum of squares (SS), and plot total with SS against various numbers of clusters (k) to determine the optimal cluster count. Try to train several different supervised models (random forest, kNN, etc.) on the same dataset, and compare their results to pick the best one. Example here.
Resume skills practiced: R, Machine Learning
Recommended packages: caret