Unit 8: Inference for Categorical Data: Chi-Square
Chi-square procedures handle categorical data with counts in categories. They compare observed counts to expected counts and ask whether the differences are larger than random variation would usually produce.
The Chi-Square Distribution
The chi-square distribution is a family of right-skewed distributions indexed by degrees of freedom. Chi-square values are always nonnegative because they are built from squared differences.
As degrees of freedom increase, the distribution becomes less skewed. For a chi-square distribution,
\[\mu = df\]and
\[\sigma = \sqrt{2df}.\]
The Chi-Square Statistic
All AP chi-square tests use the same general statistic:
\[\chi^2 = \sum \frac{(O-E)^2}{E}.\]Here \(O\) is an observed count and \(E\) is an expected count. Large values of \(\chi^2\) indicate that observed counts are far from expected counts.
Chi-square tests are right-tailed: the p-value is the probability of getting a chi-square statistic at least as large as the observed one.
Conditions For Chi-Square Tests
Common conditions:
- Counts come from a random sample, random assignment, or randomized process.
- Observations are independent. If sampling without replacement, check the 10% Condition.
- Expected counts are large enough. AP Statistics commonly uses: all expected counts are at least 5.
Use counts, not proportions or percentages, in the chi-square statistic.
Goodness-Of-Fit Test
A chi-square goodness-of-fit test checks whether one categorical variable follows a claimed distribution.
Hypotheses:
- \(H_0\): The population distribution matches the claimed proportions.
- \(H_a\): The population distribution does not match the claimed proportions.
Expected counts are
\[E_i = n p_i,\]where \(p_i\) is the claimed proportion for category \(i\).
Degrees of freedom:
\[df = k-1,\]where \(k\) is the number of categories.

Test Of Independence
A chi-square test of independence checks whether two categorical variables are associated in one population.
Hypotheses:
- \(H_0\): The two variables are independent in the population.
- \(H_a\): The two variables are associated in the population.
Expected count for each cell:
\[E = \frac{(\text{row total})(\text{column total})}{\text{grand total}}.\]Degrees of freedom:
\[df = (r-1)(c-1),\]where \(r\) is the number of rows and \(c\) is the number of columns.
Test For Homogeneity
A chi-square test for homogeneity compares the distribution of one categorical variable across two or more populations or treatments.
Hypotheses:
- \(H_0\): The category distribution is the same for all populations/treatments.
- \(H_a\): At least one population/treatment has a different distribution.
The expected count formula is the same as for independence:
\[E = \frac{(\text{row total})(\text{column total})}{\text{grand total}}.\]Degrees of freedom:
\[df = (r-1)(c-1).\]Independence Versus Homogeneity
The calculations for independence and homogeneity are identical, but the study design and conclusion are different.
| Test | Data source | Question |
|---|---|---|
| Independence | One random sample, classify each individual by two variables | Are the variables associated? |
| Homogeneity | Separate random samples or treatments, classify one variable | Are the distributions the same across groups? |
If the problem has one sample and two categorical variables, think independence. If the problem has multiple samples or treatment groups and one categorical outcome, think homogeneity.
Interpreting Contributions
Each cell’s contribution is
\[\frac{(O-E)^2}{E}.\]Cells with large contributions explain most of the chi-square statistic. After rejecting a null hypothesis, inspect which cells have observed counts much larger or smaller than expected to describe the direction of the association or difference.
Calculator Notes
Common calculator tools:
χ²GOF-Test: goodness-of-fit test.χ²-Test: test of independence or homogeneity using a matrix of observed counts.
For two-way tables, store observed counts in a matrix, run the test, and inspect the expected-count matrix to check conditions.
Working Checklist
- Identify the test: goodness-of-fit, independence, or homogeneity.
- State hypotheses in context.
- Calculate expected counts and check conditions.
- Compute \(\chi^2\) and degrees of freedom.
- Find the right-tail p-value.
- Conclude in context.
- If significant, describe which categories/cells drive the result.
Key Equations
| Idea | Equation |
|---|---|
| Chi-square statistic | \(\chi^2=\sum \frac{(O-E)^2}{E}\) |
| GOF expected count | \(E_i=np_i\) |
| Two-way expected count | \(E=\frac{(\text{row total})(\text{column total})}{\text{grand total}}\) |
| GOF degrees of freedom | \(df=k-1\) |
| Two-way table degrees of freedom | \(df=(r-1)(c-1)\) |