Problem in Science
In virtually all areas of science, researchers use p-values to quantify the significance of their results. However, they don't all understand what it means. This is bad, because they may use it in misleading ways in research. However, it really isn't that difficult to understand.
Let's say you're trying to check whether a coin is fair. The problem is there's no way to prove that it is fair. We can only disprove that it is fair. How do we do this? The only thing we can do is gather data. So, we flip 100 coins and find that 51 of them are heads. So, it's very reasonable that we could see data like that from a fair coin. If we had gotten 80/100 heads, for example, it's almost impossible for a fair coin to get that result.
Aside: Some Terminology
Null hypothesis = What you're trying to disprove. In this problem, "the coin is fair"
Alternative hypothesis = Conclusion if you reject the null hypothesis. In this problem, "the coin is unfair."
Below, there is a distribution of 100-coin-flip samples. So, if you flipped 100 coins and counted the number of heads each time for thousands of different samples, you would get the distribution pictured. Sometimes there'd be more, sometimes there'd be fewer, but it follows the distribution below.
Just by random variation, we can see that 84% of the time, a fair coin will give you a result as extreme as this one. (>= 51/100 heads or <= 49/100 heads).
In this problem, the p-value would be 0.84. Now, this is really high, so we have no evidence to say that the coin isn't fair.
If we assume that the coin is fair, there's an 84% chance of seeing results like these. This is high, so we have no reason to doubt the fairness of coin.
Let's say you had flipped 60/100 heads. Now, you're less convinced the coin is fair. Once again, if we want to show that the coin isn't fair, we have to show that this scenario would be extremely unlikely for a fair coin.
Here is a distribution for % heads with 100 flips, just like before. Now, we want to see the chance of getting a more extreme result than 60 heads (or <= 40 heads). The chance of this is 4.5%. So, it reasonably could happen, but it's unlikely.
Alpha Level/Significance Level
We see that 4.5% is low, but when do we draw the line? That is to say, when do we say that an event is too rare that the coin probably isn't fair? Often, people draw the line at 5%. Another way of putting this is that if there's a <5% of an event happening, the coin probably isn't fair.
However, that event happens 5% of the time, so assuming an event is too rare to reasonably happen will result in incorrectly rejecting the null hypothesis (recall, null hypothesis = coin is fair). This is called Type I Error.
Type I Error
This means rejecting the null hypothesis when it is in fact true. This can also be called a "false positive", because you reject when you shouldn't. In this problem, if you flipped 60/100 heads with a fair government-issued coin, you'd reject, even though it is possible.
The chance of error is equal to whatever the significance level is. If there's a significance level of 5%, then all events that are less than 5% likely to occur will automatically reject the null hypothesis. That means 5% of samples from a true null hypothesis will be rejected.
When is this bad?
Let's say you're a company that makes medicine, and you want to check if it's actually effective. So, you do some analysis, and find that it makes a change with p < 0.05.
In this problem, the null hypothesis is that it's ineffectual, so you conclude that the medicine works. However, if that turns out to be a false positive, that could literally cost someone their life.
How do you fix it?
Lower the significance level below p = 0.05. As stated above, the percent chance of a type I error is equal to the significance level.
Type II Error
This means failing to reject the null hypothesis when it is false. If we look at the same problem from before, we flipped 100 coins and got 51 heads. Our analysis said that the coin isn't rigged, but what if the coin was rigged so that 51% of the flips are heads? We messed up.
I'll have another post on how to calculate it.
When is this bad?
Let's say you're testing for cancer, and you get a p-value that's 0.06. Saying they don't have cancer when they really do can be really bad.
How do you fix it?
Increase sample size. The basic idea is that seeing 51/100 heads (p = .84) is much less conclusive than 510/1000 heads (p = .505) or 5100/10000 heads (p = .0455).
Multiple Comparisons Problem
Let's say you're a granola bar company, and you want to advertise your product as something that's good for you. You know your bar does nothing, so you want to exploit science to your advantage. So, you exploit science to your advantage. You test for everything under the sun: cholesterol, weight, vitamin A-K, blood pressure.
There's an interesting 538 article about p-hacking in nutrition.