Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Imagine a folder A from whose content is to be copied to a folder B. A has five subfolders, each with 1, 2, 3, 4 and five files, respectively. For simplicity, consider the case that each file is of equal size. When copying the files from A to B, how would you count the progress?
Total number of files is 1 + 2 + 3 + 4 + 5 = 15. One way the progress can be measured in the intervals of 1/15? On the other hand, folders can be divided as 1, 1/2 x 2, 1/3 x 3, 1/4 x 4, 1/5 x 5.
When the first folder is done, the progress would be 20%, then 30%, 40%, 45%, 50%, 55%, 60%, and so on. The latter would be incredibly frustrating to watch. But it is possible.
This example might sound trivial and unreal, but it is incredible how often it turns up in real practical problems — including cooking and supply chain statistics.
My roommate Tagg has a unique way of making ramen noodles. He would bring the water to boil, pour the ramen in, leave it for less than a minute and put many seasonings on top. My other roommate Jack pours most of the water out and then lets it sit with the seasoning to soak in the spices. Tommy and Jake boil it with the herb and dry every drop of water. They like raw ramen noodles.
Which one’s better? I can’t say definitely. (Although I like Tagg’s method, this NY Times recipe is the best.)
I am working on a research project with a hygiene products company based in North Carolina. It’s facing returns, sometimes up to 15% of its sales. Prof Sean1 and I were trying to find why. We found opportunities to streamline distribution using their data for sales, transportation, and claims.
But this problem of choosing the “how” to calculate the metric turned up in something I thought was super simple. The company gave sales, transportation and products datasets. See the following examples. Of course, they’re not real, but they give you a good idea.
These datasets have random values and aren’t real. But they give you a taste of what the company provided us.
Deciding on the metrics is way more complicated than I initially thought. Consider you want to estimate how many complete pallet orders were shipped from a location. Where do you start? Well, each item was in a carton which was in a pallet. < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
We want to estimate the proportion of orders from a location in full pallets. There are at least two methods to find it.
First, I find the number of full pallets for every row since each row (in the Sales sheet) is an order-item combination. Group all the entries by an order number; then, you can find what proportion of cases were sent in full pallets. But that is for every order, and we wanted to get metrics by location. So, you can aggregate the results again by (City, State) and calculate the average proportion of total pallet cases.
Or, the other method is to group by (City, State) without first grouping by order number. This would disregard which items were part of which order — breaking 1-to-1 matches. Some orders would be higher volume than others. There’s no reason they should be the same unless all items were the same.
This situation of defining the right metric turns up in so many different ways. How we aggregate things together matters because the end product depends not only on raw materials but the method as well. Simple things aren’t as intuitive as one might think. Ultimately, we have to use the metric that the company likes to use.
A general note on metrics. < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
- 
If people have to perform calculations on your metrics to generate insights, they’re not good metrics. 
- 
Specify what your metric represents and what it doesn’t represent. You’ll avoid situations where your metric is misused. 
- 
Always consider what the company or client thinks about your metrics and their businesses. If they disagree with your formulation, your metric will be just another number. 
- 
There is more variability than what a statistical model can capture. Listen to managers; they’ve more knowledge about their businesses than you’d ever have. 
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
