Our tutorials reference a dataset called "sample" in many examples. If you'd like to download the sample dataset to work through the examples, choose one of the files below:
The Chi-Square Test of Independence determines whether there is an association between categorical variables (i.e., whether the variables are independent or related). It is a nonparametric test.
This test is also known as:
This test utilizes a contingency table to analyze the data. A contingency table (also known as a cross-tabulation, crosstab, or two-way table) is an arrangement in which data is classified according to two categorical variables. The categories for one variable appear in the rows, and the categories for the other variable appear in columns. Each variable must have two or more categories. Each cell reflects the total count of cases for a specific pair of categories.
There are several tests that go by the name "chi-square test" in addition to the Chi-Square Test of Independence. Look for context clues in the data and research question to make sure what form of the chi-square test is being used.
The Chi-Square Test of Independence is commonly used to test the following:
The Chi-Square Test of Independence can only compare categorical variables. It cannot make comparisons between continuous variables or between categorical and continuous variables. Additionally, the Chi-Square Test of Independence only assesses associations between categorical variables, and can not provide any inferences about causation.
If your categorical variables represent "pre-test" and "post-test" observations, then the chi-square test of independence is not appropriate. This is because the assumption of the independence of observations is violated. In this situation, McNemar's Test is appropriate.
Your data must meet the following requirements:
The null hypothesis (H_{0}) and alternative hypothesis (H_{1}) of the Chi-Square Test of Independence can be expressed in two different but equivalent ways:
H_{0}: "[Variable 1] is independent of [Variable 2]"
H_{1}: "[Variable 1] is not independent of [Variable 2]"
OR
H_{0}: "[Variable 1] is not associated with [Variable 2]"
H_{1}: "[Variable 1] is associated with [Variable 2]"
Your dataset should have the following structure:
The test statistic for the Chi-Square Test of Independence is denoted Χ^{2}, and is computed as:
$$ \chi^{2} = \sum_{i=1}^{R}{\sum_{j=1}^{C}{\frac{(o_{ij} - e_{ij})^{2}}{e_{ij}}}} $$
where
\(o_{ij}\) is the observed cell count in the i^{th} row and j^{th} column of the table
\(e_{ij}\) is the expected cell count in the i^{th} row and j^{th} column of the table, computed as
$$ e_{ij} = \frac{\mathrm{ \textrm{row } \mathit{i}} \textrm{ total} * \mathrm{\textrm{col } \mathit{j}} \textrm{ total}}{\textrm{grand total}} $$
The quantity (o_{ij} - e_{ij}) is sometimes referred to as the residual of cell (i, j), denoted \(r_{ij}\).
The calculated Χ^{2} value is then compared to the critical value from the Χ^{2} distribution table with degrees of freedom df = (R - 1)(C - 1) and chosen confidence level. If the calculated Χ^{2} value > critical Χ^{2} value, then we reject the null hypothesis.
The general form is
PROC FREQ data=dataset-name;
TABLE rowVar*colVar / CHISQ;
RUN;
The CHISQ option is added to the TABLES statement after the slash (/) character.
Many of PROC FREQ's most useful options have been covered in the tutorials on Frequency Tables and Crosstabs, but there are several additional options that can be useful when conducting a chi-square test of independence:
EXPECTED
DEVIATION
Let's continue the row and column percentage example from the Crosstabs tutorial, which described the relationship between the variables RankUpperUnder (upperclassman/underclassman) and LivesOnCampus (lives on campus/lives off-campus). Recall that the column percentages of the crosstab appeared to indicate that upperclassmen were less likely than underclassmen to live on campus:
Suppose that we want to test the association between class rank and living on campus using a Chi-Square Test of Independence (using α = 0.05).
PROC FREQ DATA=work.sample; TABLE RankUpperUnder*LiveOnCampus / CHISQ EXPECTED DEVIATION NOROW NOCOL NOPERCENT
; RUN;
The first table in the output is the crosstabulation. If you included the EXPECTED
and DEVIATION
options in your syntax, you should see the following:
With the Expected Count values shown, we can confirm that all cells have an expected value greater than 5.
Off-Campus | On-Campus | Total | |
---|---|---|---|
Underclassman |
Row 1, column 1 $$ o_{\mathrm{11}} = 79 $$ $$ e_{\mathrm{11}} = \frac{227*231}{388} = 135.15 $$ $$ r_{\mathrm{11}} = 79 - 135.147 = -56.15 $$ |
Row 1, column 2 $$ o_{\mathrm{12}} = 148 $$ $$ e_{\mathrm{12}} = \frac{227*157}{388} = 91.853 $$ $$ r_{\mathrm{12}} = 148 - 91.853 = 56.147 $$ |
row 1 total = 227 |
Upperclassmen |
Row 2, column 1 $$ o_{\mathrm{21}} = 152 $$ $$ e_{\mathrm{21}} = \frac{161*231}{388} = 95.853 $$ $$ r_{\mathrm{21}} = 152 - 95.853 = 56.147 $$ |
Row 2, column 2 $$ o_{\mathrm{22}} = 9 $$ $$ e_{\mathrm{22}} = \frac{161*157}{388} = 65.147 $$ $$ r_{\mathrm{22}} = 9 - 65.147 = -56.15 $$ |
row 2 total = 161 |
Total | col 1 total = 231 | col 2 total = 157 | grand total = 388 |
These numbers can be plugged into the chi-square test statistic formula:
$$ \chi^{2} = \sum_{i=1}^{R}{\sum_{j=1}^{C}{\frac{(o_{ij} - e_{ij})^{2}}{e_{ij}}}} = \frac{(-56.15)^{2}}{135.15} + \frac{(56.147)^{2}}{91.853} + \frac{(56.147)^{2}}{95.853} + \frac{(-56.15)^{2}}{65.147} = 138.926 $$
We can confirm this computation with the results in the table labeled Statistics for Table of RankUpperUnder by LiveOnCampus:
The row of interest here is Chi-Square.
Since the p-value is less than our chosen significance level α = 0.05, we can reject the null hypothesis, and conclude that there is an association between class rank and whether or not students live on-campus.
Based on the results, we can state the following: