Have you ever heard anyone, your team member, or colleague, reported that he increased an important result AVERAGELY by a significant number, say 50% or more?
I’ve seen enough
That could mislead you.
Here I show an example in below graph and in an imaginary situation.
Imagine you were a leader of an important team to improve productivity. You have prepared your team and execute many actions. You tracked the performance “before” vs. “after” the improvement. Then you have the graph presented to your boss and all the executives in your company.
You showed them “Before” improvement, your productivity was AVERAGELY 10 pcs/day. “After” improvement the AVERAGE became 15 pcs/day. It is a 50% improvement! (refer to the graph).
You’re so excited to report this and your boss could not hide his happiness and you became the star of the month. Everybody’s happy and you got the award. But something looks wrong.
Now, you look at the graph closely, you know something is wrong with the 50% improvement (in this example, it’s easy to spot with naked eyes…in real world with thousands of data points, the misleading conclusion could not easily be caught).
- In determining the average/mean (or median or others), you need to know the shape of distribution. Average is only ok for a symmetric normal distribution.
- In addition to average, you need also to see the variance (through standard deviation).
- From the graph, we can see the “after” improvement performance has some issues that make the AVERAGE is not correct:
- it is not stable (high variance)
- it has a decreasing TREND. It’s only increasing in the first couple days, after that the trend is decreasing even to the lower point than before improvement
- statistically, we need to check whether it’s normally distributed and the after improvement is really different than before.
In a nut shell, for making a representation of a set of data, at least you have to review 2 other things in addition to average (or median): the shape of distribution, and the variance of the distribution.
I deliberately put this at the end:
a box plot can help us to show several set of data side by side for graphical representation
there are several statistical tool can help us to test whether two set of data (or more) are different: could be t-test (for a pair of distribution) or analysis of variance (anova) for multiple set of data