In the previous post, I described how we make mistakes in testing when we draw an incorrect conclusion based on the results of a sample. I showed how sample size and sampling variability can team up to provide misleading evidence. The example was useful in depicting the alpha risk which is the risk associated with rejecting the null hypothesis when the null was actually true.
The other type of risk is called the beta risk and this is the risk of accepting the null when it is actually false. This would be drawing the conclusion that the process has not been improved when it has been improved.
Here is a depiction of our process that the company was working on improving. It shows a mean of 39.70 and a standard deviation of 0.02.
The team had originally planned a quick run of five items and if the mean was larger than 39.71, that this would suffice to draw the conclusion that the process improvements worked. Note that the current process mean was 39.70. But on reflection, we realized that this has too high a risk. As a result of running simulations to better understand the situation, we decided to run a sample of 20 units. This brought our risk of a false positive down from over 10% to just under 1%.
Beta Risk
Now, let’s look at the other risk. The possible risk of concluding that the process has not been improved when it has been. Our decision line was set to 39.71. That is, if the mean of the sample was above 39.71, we would conclude that the process had been improved. Obviously, we do not know the true state of the new process but for the purpose of teaching the beta risk, I will look at the situation where the new process mean is 39.715. The standard deviation remains the same as before at 0.02.
Here is a density plot of the two processes.
This depicts the new process with a mean of 39.715. This is just a simulation. In the real world, we need to take a sample and draw a conclusion. Let’s look at the results from a sample size of 20, as was decided on to minimize the alpha risk in the previous post.
This first plot serves as a reminder from the previous post that if the process was not improved, the distribution of averages from samples of size n = 20 would look like this. The alpha risk was kept to a low value of 1%.
The alpha risk was something we were able to easily control. We know something about our current process and can decide on a cutoff point where this risk is minimal.
The beta risk is not as easily managed. We really don’t know the parameters of the new process. It could be a little better or a lot better. Or not better at all. The depiction below shows us the distribution of means from samples of size n = 20 assuming that the new process runs with a mean of 39.715. This is better than the old process but we can only learn about it through sampling. This plot tells us that 13% of the time, the sample mean will be below 39.71. If our sample is part of this 13%, we will conclude that the process did not improve.
With beta risks, you need to assume different levels of improvement (e.g. different means for the new process) and then consider the risks alongside the magnitude of the improvement. If the new process had a mean of 40, well above the current state, you could learn that with a sample size of one. It is such a significant difference in means that you can be nearly certain. But most improvements are not that significant. Many of our improvements are incremental and may bump the mean by less than one standard deviation. This is why it is important to run simulations and understand these risks. Smaller increments require larger sample sizes.
Power
Now that we understand the alpha and beta risks, here’s a new term to understand.
Power
Power=1−β
Since beta is the risk of accepting the null when it is false (a mistake), power is the probability of rejecting the null when it is false (a correct decision).
The power of a test is the ability to detect an effect, if the effect truly exists. We want to consider power when deciding on the sample size and the cutoff for a test.
Sticking with our sample of 20 items and the cutoff of 39.71, let’s look at the power of the test. We know that β=0.13β=0.13 so power is 1−β=0.871−β=0.87
It’s very useful to draft up a few power curves when planning your tests so you can plan more intelligently and avoid costly mistakes.
There are four variables you need to consider for charting power curves.
Effect size The size of the effect you are testing to observe. A large effect would be easy to detect and so a small sample size will suffice. But small effects are not so easy and require larger samples.
The effect size is standardized. In our test, we we looking at a change of means from 39.7 to 39.715, a difference of 0.015. If we divide this difference by the standard deviation, we get 0.0150.020=0.750.0150.020=0.75. This standardized effect is termed Cohen’s d.
Sample Size The number of items to be sampled.
Alpha Also called the significance level. This is the risk of rejecting the null hypothesis when it was true. Often set to 5% or 0.05.
Power This is the probability of detecting an effect when it is true.
When you know, or decide on three variables, you can calculate the fourth,
We will plot two variables on the chart. Sample size on the x-axis and power on the y-axis.
To incorporate different effect sizes, we can use different coloured lines. The alpha risk will be 0.01.
To generate power curves for our situation, I will use the following;
samples sizes from 5 to 25
effect sizes 0.2, 0.5, 0.75 and 1.0
alpha = 0.01
The red horizontal line in the plot is set to 0.80 (80% Power) on the y-axis as a reference line. Many texts use 80% as a reasonable value for power.
These line plots help you to understand the relationship among sample size, effect size and the alpha and beta risks that are always a part of testing.
Thanks for reading.
Feel free to tag your colleagues who may be interested and to re-post this to your network.
Chris Butterworth, MBB
Industrial Problem Solving Course Developer
Belfield Academy
Comments