Top 5 Best Practices for Data Collection and Storage in Microsoft Excel

[This article was first published on r-bloggers – Telethon Kids Institute, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As a biostatistician, my preferred way to receive data from a client is a csv extract from a well-designed database that included validation with the data-entry process. However, 95% of the data I see comes as an Excel file.

I have just sent an email to a client with some tips for using Excel to collect and store their research data. These 5 steps are my recommendations that save me a lot of headache when importing data from .xlsx files into R for analysis:

  1. (basic) Only enter a single piece of information into a cell. For example, if the column is “number”, then 365 is the only thing in the cell, do not do this: “365 (children only)”. If you need to leave a comment, then put it in another column.
  2. (basic) Keep column names simple, my preference is up to a maximum of 3 words, separated by underscores, all lower caps. If you can, keep column names meaningful to a human reader. Also, only use Row 1 for column names/information. Extra information about the data contained within a specific column should be placed in the data dictionary, which is in a different worksheet.
  3. (basic) Include a data dictionary that should have at least 5 columns: 1| column name (point 2), 2| Description, 3| data type (e.g. dichotomous (yes/no, true/false), categorical (list the categories), numeric, text), 4| valid ranges of data (for example, valid entries for age could be between 0 and 110 years, if 200 was entered then I know it’s likely an error), and 5| if the value is numeric, provide the unit of measurement (e.g. years, days, cm, kg, …).
  4. (advanced) Use Excel’s data validation functions – especially to make drop-down lists for categorical responses.
  5. (intermediate) If you have any date columns, change the cell data type to “Text” (the same for phone numbers).

Note: Always, any information that can be used to identify an individual should be removed from the file before it gets sent to a statistician.

If everyone followed these 5 steps, then 80% of my work (and time) would be freed up for analysis and communicating results. I know there must be other simple pieces of advice. What have I missed? What other simple advice do you give your colleagues/clients about data entry into Microsoft Excel?

Photo by Scott Graham on Unsplash

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – Telethon Kids Institute.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)