LibGuides: SAS Tutorials: Chi-Square Test of Independence

Chi-Square Test of Independence

The Chi-Square Test of Independence determines whether there is an association between categorical variables (i.e., whether the variables are independent or related). It is a nonparametric test.

This test is also known as:

Chi-Square Test of Association.

This test utilizes a contingency table to analyze the data. A contingency table (also known as a cross-tabulation, crosstab, or two-way table) is an arrangement in which data is classified according to two categorical variables. The categories for one variable appear in the rows, and the categories for the other variable appear in columns. Each variable must have two or more categories. Each cell reflects the total count of cases for a specific pair of categories.

There are several tests that go by the name "chi-square test" in addition to the Chi-Square Test of Independence. Look for context clues in the data and research question to make sure what form of the chi-square test is being used.

Common Uses

The Chi-Square Test of Independence is commonly used to test the following:

Statistical independence or association between two categorical variables.

The Chi-Square Test of Independence can only compare categorical variables. It cannot make comparisons between continuous variables or between categorical and continuous variables. Additionally, the Chi-Square Test of Independence only assesses associations between categorical variables, and can not provide any inferences about causation.

If your categorical variables represent "pre-test" and "post-test" observations, then the chi-square test of independence is not appropriate. This is because the assumption of the independence of observations is violated. In this situation, McNemar's Test is appropriate.

Data Requirements

Your data must meet the following requirements:

Two categorical variables.
Two or more categories (groups) for each variable.
Independence of observations.
- There is no relationship between the subjects in each group.
- The categorical variables are not "paired" in any way (e.g. pre-test/post-test observations).
Relatively large sample size.
- Expected frequencies for each cell are at least 1.
- Expected frequencies should be at least 5 for the majority (80%) of the cells.

Hypotheses

The null hypothesis (H₀) and alternative hypothesis (H₁) of the Chi-Square Test of Independence can be expressed in two different but equivalent ways:

H₀: "[Variable 1] is independent of [Variable 2]"
H₁: "[Variable 1] is not independent of [Variable 2]"

H₀: "[Variable 1] is not associated with [Variable 2]"
H₁: "[Variable 1] is associated with [Variable 2]"

Data Set-Up

Your dataset should have the following structure:

Each case (row) represents a subject, and each subject appears once in the dataset, represented in columns. That is, each row represents an observation from a unique subject.
The dataset contains at least two nominal categorical variables (string or numeric). The categorical variables used in the test must have two or more categories; they should also not have too many categories.

Test Statistic

The test statistic for the Chi-Square Test of Independence is denoted Χ², and is computed as:

$$ \chi^{2} = \sum_{i=1}^{R}{\sum_{j=1}^{C}{\frac{(o_{ij} - e_{ij})^{2}}{e_{ij}}}} $$

where

$o_{ij}$ is the observed cell count in the i^th row and j^th column of the table

$e_{ij}$ is the expected cell count in the i^th row and j^th column of the table, computed as

$$ e_{ij} = \frac{\mathrm{ \textrm{row } \mathit{i}} \textrm{ total} * \mathrm{\textrm{col } \mathit{j}} \textrm{ total}}{\textrm{grand total}} $$

The quantity (o_ij - e_ij) is sometimes referred to as the residual of cell (i, j), denoted $r_{ij}$.

The calculated Χ² value is then compared to the critical value from the Χ² distribution table with degrees of freedom df = (R - 1)(C - 1) and chosen confidence level. If the calculated Χ² value > critical Χ² value, then we reject the null hypothesis.

Run a Chi-Square Test of Independence with PROC FREQ

The general form is

PROC FREQ data=dataset-name;
    TABLE rowVar*colVar / CHISQ;
RUN;

The CHISQ option is added to the TABLES statement after the slash (/) character.

Many of PROC FREQ's most useful options have been covered in the tutorials on Frequency Tables and Crosstabs, but there are several additional options that can be useful when conducting a chi-square test of independence:

EXPECTED
Adds expected cell counts to the cells of the crosstab table.
DEVIATION
Adds deviation values (i.e., observed minus expected values) to the cells of the crosstab table.

Example: Chi-square Test for 2x2 Table

Problem Statement

Let's continue the row and column percentage example from the Crosstabs tutorial, which described the relationship between the variables RankUpperUnder (upperclassman/underclassman) and LivesOnCampus (lives on campus/lives off-campus). Recall that the column percentages of the crosstab appeared to indicate that upperclassmen were less likely than underclassmen to live on campus:

The proportion of underclassmen who live off campus is 34.8%, or 79/227.
The proportion of underclassmen who live on campus is 65.2%, or 148/227.
The proportion of upperclassmen who live off campus is 94.4%, or 152/161.
The proportion of upperclassmen who live on campus is 5.6%, or 9/161.

Suppose that we want to test the association between class rank and living on campus using a Chi-Square Test of Independence (using α = 0.05).

Syntax

PROC FREQ DATA=work.sample;
    TABLE RankUpperUnder*LiveOnCampus / CHISQ EXPECTED DEVIATION NOROW NOCOL NOPERCENT;
RUN;

Output

The first table in the output is the crosstabulation. If you included the EXPECTED and DEVIATION options in your syntax, you should see the following:

With the Expected Count values shown, we can confirm that all cells have an expected value greater than 5.

Computation of the expected cell counts and residuals (observed minus expected) for the crosstabulation of class rank by living on campus.
	Off-Campus	On-Campus	Total
Underclassman	Row 1, column 1 $$ o_{\mathrm{11}} = 79 $$ $$ e_{\mathrm{11}} = \frac{227*231}{388} = 135.15 $$ $$ r_{\mathrm{11}} = 79 - 135.147 = -56.15 $$	Row 1, column 2 $$ o_{\mathrm{12}} = 148 $$ $$ e_{\mathrm{12}} = \frac{227*157}{388} = 91.853 $$ $$ r_{\mathrm{12}} = 148 - 91.853 = 56.147 $$	row 1 total = 227
Upperclassmen	Row 2, column 1 $$ o_{\mathrm{21}} = 152 $$ $$ e_{\mathrm{21}} = \frac{161*231}{388} = 95.853 $$ $$ r_{\mathrm{21}} = 152 - 95.853 = 56.147 $$	Row 2, column 2 $$ o_{\mathrm{22}} = 9 $$ $$ e_{\mathrm{22}} = \frac{161*157}{388} = 65.147 $$ $$ r_{\mathrm{22}} = 9 - 65.147 = -56.15 $$	row 2 total = 161
Total	col 1 total = 231	col 2 total = 157	grand total = 388

These numbers can be plugged into the chi-square test statistic formula:

$$ \chi^{2} = \sum_{i=1}^{R}{\sum_{j=1}^{C}{\frac{(o_{ij} - e_{ij})^{2}}{e_{ij}}}} = \frac{(-56.15)^{2}}{135.15} + \frac{(56.147)^{2}}{91.853} + \frac{(56.147)^{2}}{95.853} + \frac{(-56.15)^{2}}{65.147} = 138.926 $$

We can confirm this computation with the results in the table labeled Statistics for Table of RankUpperUnder by LiveOnCampus:

The row of interest here is Chi-Square.

The value of the test statistic is 138.926.
Because the crosstabulation is a 2x2 table, the degrees of freedom (df) for the test statistic is $$ df = (R - 1)*(C - 1) = (2 - 1)*(2 - 1) = 1 $$.
The corresponding p-value of the test statistic is so small that it is presented as p < 0.001.

Decision and Conclusions

Since the p-value is less than our chosen significance level α = 0.05, we can reject the null hypothesis, and conclude that there is an association between class rank and whether or not students live on-campus.

Based on the results, we can state the following:

There was a significant association between class rank and living on campus (Χ²(1) = 138.9, p < .001).

Library Locations at the Kent Campus

Regional Campus Libraries

SAS Tutorials: Chi-Square Test of Independence

Sample Data Files

Chi-Square Test of Independence

Common Uses

Data Requirements

Hypotheses

Data Set-Up

Test Statistic

Run a Chi-Square Test of Independence with PROC FREQ

Example: Chi-square Test for 2x2 Table

Problem Statement

Syntax

Output

Decision and Conclusions

Tutorial Feedback

Street Address

Mailing Address

Contact Us

Quick Links

Information