P-Value: Definition, Calculation, Interpretation, and Examples
Key takeaways
* A p-value measures the probability of obtaining data at least as extreme as observed, assuming the null hypothesis is true.
* Lower p-values provide stronger evidence against the null hypothesis; a common threshold for “statistical significance” is 0.05, but this is arbitrary.
* P-values do not measure the probability that the null hypothesis is true, nor do they indicate effect size or practical importance.
* Report p-values alongside effect sizes and confidence intervals, and interpret them in context.
Explore More Resources
What is a p-value?
A p-value (probability value) quantifies how compatible observed data are with a specified null hypothesis. Formally, it is the probability of obtaining a test statistic as extreme or more extreme than the one observed, assuming the null hypothesis is true.
What is it used for?
P-values are used in hypothesis testing to assess evidence against a null hypothesis. They:
* Help determine whether observed patterns could plausibly arise by chance.
* Allow readers to judge significance without relying on a single preselected significance level.
* Are widely used in scientific research, government reports, and applied statistics to support or question findings.
Explore More Resources
How is a p-value calculated (overview)?
P-values are computed from the sampling distribution of a chosen test statistic (e.g., t, z, chi-square). The general steps:
1. Specify the null hypothesis and the appropriate test statistic.
2. Determine the sampling distribution of that statistic under the null (shape depends on sample size and degrees of freedom).
3. Compute the observed test statistic from the data.
4. Calculate the probability (area under the distribution curve) of values at least as extreme as the observed statistic.
This area — often found by statistical software or tables — is the p-value.
Types of tests
* One-tailed (lower or upper): measures extremeness in one direction.
* Two-tailed: measures extremeness in both directions (common when any deviation from the null matters).
Explore More Resources
Interpretation and common pitfalls
* Small p-value (e.g., 0.01) indicates that the observed data are unlikely under the null hypothesis and provides evidence against it.
* A p-value does not give the probability that the null hypothesis is true or false.
* Statistical significance (e.g., p < 0.05) is not the same as practical importance; consider effect sizes and confidence intervals.
* P-values depend on sample size — very large samples can produce small p-values for trivial effects; small samples may fail to detect meaningful effects.
* Multiple testing inflates the chance of false positives; adjust methods (e.g., Bonferroni, false discovery rate) when performing many tests.
* Replication and study design quality matter: a low p-value from a single study is not definitive proof.
Practical example
Suppose an investor tests whether a portfolio’s returns equal the S&P 500’s returns using a two-tailed test.
* Null hypothesis: portfolio returns = S&P 500 returns.
* Alternative: portfolio returns ≠ S&P 500 returns.
If the computed p-value is 0.001, then under the null hypothesis the chance of observing returns as different as those seen is 0.1% — strong evidence against the null. If another portfolio yields p = 0.10, evidence against the null is weak. Comparing p-values across portfolios lets an investor judge relative evidence, but should be combined with effect size and practical considerations.
Explore More Resources
Common questions
Is p = 0.05 significant?
* A p-value below 0.05 is traditionally considered statistically significant, but the threshold is arbitrary. Report the p-value so readers can apply their own criteria.
What does p = 0.001 mean?
* If the null is true, there is a 0.1% chance of observing data at least as extreme as what was observed. This provides strong evidence against the null under the test assumptions.
Explore More Resources
How to compare p-values from different tests?
* Smaller p-values indicate stronger evidence against their respective null hypotheses. However, comparison is meaningful only when tests, models, and sample sizes are comparable; also consider effect sizes and confidence intervals.
Best practices
* Report exact p-values (not just “significant” or “not significant”).
* Always include effect sizes and confidence intervals.
* Predefine hypotheses and analysis plans when possible to reduce bias.
* Account for multiple comparisons when applicable.
* Interpret p-values alongside study design, data quality, and subject-matter context.
Explore More Resources
Conclusion
P-values are a useful tool for assessing how surprising observed data are under a null hypothesis, but they are not a standalone verdict. Use p-values with effect sizes, confidence intervals, transparent methods, and proper corrections for multiple testing. Replication and careful study design remain essential for reliable scientific conclusions.