# SAS Tutorials: Chi-Square Test of Independence

The Chi-Square Test of Independence is used to test if two categorical variables are associated. This tutorial shows how to use PROC FREQ in SAS to run this test.

## Chi-Square Test of Independence

The Chi-Square Test of Independence determines whether there is an association between categorical variables (i.e., whether the variables are independent or related). It is a nonparametric test.

This test is also known as:

• Chi-Square Test of Association.

This test utilizes a contingency table to analyze the data. A contingency table (also known as a cross-tabulation, crosstab, or two-way table) is an arrangement in which data is classified according to two categorical variables. The categories for one variable appear in the rows, and the categories for the other variable appear in columns. Each variable must have two or more categories. Each cell reflects the total count of cases for a specific pair of categories.

There are several tests that go by the name "chi-square test" in addition to the Chi-Square Test of Independence. Look for context clues in the data and research question to make sure what form of the chi-square test is being used.

## Common Uses

The Chi-Square Test of Independence is commonly used to test the following:

• Statistical independence or association between two categorical variables.

The Chi-Square Test of Independence can only compare categorical variables. It cannot make comparisons between continuous variables or between categorical and continuous variables. Additionally, the Chi-Square Test of Independence only assesses associations between categorical variables, and can not provide any inferences about causation.

If your categorical variables represent "pre-test" and "post-test" observations, then the chi-square test of independence is not appropriate. This is because the assumption of the independence of observations is violated. In this situation, McNemar's Test is appropriate.

## Data Requirements

Your data must meet the following requirements:

1. Two categorical variables.
2. Two or more categories (groups) for each variable.
3. Independence of observations.
• There is no relationship between the subjects in each group.
• The categorical variables are not "paired" in any way (e.g. pre-test/post-test observations).
4. Relatively large sample size.
• Expected frequencies for each cell are at least 1.
• Expected frequencies should be at least 5 for the majority (80%) of the cells.

## Hypotheses

The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square Test of Independence can be expressed in two different but equivalent ways:

H0: "[Variable 1] is independent of [Variable 2]"
H1: "[Variable 1] is not independent of [Variable 2]"

OR

H0: "[Variable 1] is not associated with [Variable 2]"
H1"[Variable 1] is associated with [Variable 2]"

## Data Set-Up

Your dataset should have the following structure:

## Test Statistic

The test statistic for the Chi-Square Test of Independence is denoted Χ2, and is computed as:

$$\chi^{2} = \sum_{i=1}^{R}{\sum_{j=1}^{C}{\frac{(o_{ij} - e_{ij})^{2}}{e_{ij}}}}$$

where

$$o_{ij}$$ is the observed cell count in the ith row and jth column of the table

$$e_{ij}$$ is the expected cell count in the ith row and jth column of the table, computed as

$$e_{ij} = \frac{\mathrm{ \textrm{row } \mathit{i}} \textrm{ total} * \mathrm{\textrm{col } \mathit{j}} \textrm{ total}}{\textrm{grand total}}$$

The quantity (oij - eij) is sometimes referred to as the residual of cell (i, j), denoted $$r_{ij}$$.

The calculated Χ2 value is then compared to the critical value from the Χ2 distribution table with degrees of freedom df = (R - 1)(C - 1) and chosen confidence level. If the calculated Χ2 value > critical Χ2 value, then we reject the null hypothesis.

## Run a Chi-Square Test of Independence with PROC FREQ

The general form is

PROC FREQ data=dataset-name;
TABLE rowVar*colVar / CHISQ;
RUN;


The CHISQ option is added to the TABLES statement after the slash (/) character.

Many of PROC FREQ's most useful options have been covered in the tutorials on Frequency Tables and Crosstabs, but there are several additional options that can be useful when conducting a chi-square test of independence:

• EXPECTED
Adds expected cell counts to the cells of the crosstab table.
• DEVIATION
Adds deviation values (i.e., observed minus expected values) to the cells of the crosstab table.

## Example: Chi-square Test for 2x2 Table

### Problem Statement

Let's continue the row and column percentage example from the Crosstabs tutorial, which described the relationship between the variables RankUpperUnder (upperclassman/underclassman) and LivesOnCampus (lives on campus/lives off-campus). Recall that the column percentages of the crosstab appeared to indicate that upperclassmen were less likely than underclassmen to live on campus:

• The proportion of underclassmen who live off campus is 34.8%, or 79/227.
• The proportion of underclassmen who live on campus is 65.2%, or 148/227.
• The proportion of upperclassmen who live off campus is 94.4%, or 152/161.
• The proportion of upperclassmen who live on campus is 5.6%, or 9/161.

Suppose that we want to test the association between class rank and living on campus using a Chi-Square Test of Independence (using α = 0.05).

### Syntax

PROC FREQ DATA=work.sample;
TABLE RankUpperUnder*LiveOnCampus / CHISQ EXPECTED DEVIATION NOROW NOCOL NOPERCENT;
RUN;

### Output

The first table in the output is the crosstabulation. If you included the EXPECTED and DEVIATION options in your syntax, you should see the following:

With the Expected Count values shown, we can confirm that all cells have an expected value greater than 5.

Computation of the expected cell counts and residuals (observed minus expected) for the crosstabulation of class rank by living on campus.
Off-Campus On-Campus Total
Underclassman

Row 1, column 1

$$o_{\mathrm{11}} = 79$$

$$e_{\mathrm{11}} = \frac{227*231}{388} = 135.15$$

$$r_{\mathrm{11}} = 79 - 135.147 = -56.15$$

Row 1, column 2

$$o_{\mathrm{12}} = 148$$

$$e_{\mathrm{12}} = \frac{227*157}{388} = 91.853$$

$$r_{\mathrm{12}} = 148 - 91.853 = 56.147$$

row 1 total = 227
Upperclassmen

Row 2, column 1

$$o_{\mathrm{21}} = 152$$

$$e_{\mathrm{21}} = \frac{161*231}{388} = 95.853$$

$$r_{\mathrm{21}} = 152 - 95.853 = 56.147$$

Row 2, column 2

$$o_{\mathrm{22}} = 9$$

$$e_{\mathrm{22}} = \frac{161*157}{388} = 65.147$$

$$r_{\mathrm{22}} = 9 - 65.147 = -56.15$$

row 2 total = 161
Total col 1 total = 231 col 2 total = 157 grand total = 388

These numbers can be plugged into the chi-square test statistic formula:

$$\chi^{2} = \sum_{i=1}^{R}{\sum_{j=1}^{C}{\frac{(o_{ij} - e_{ij})^{2}}{e_{ij}}}} = \frac{(-56.15)^{2}}{135.15} + \frac{(56.147)^{2}}{91.853} + \frac{(56.147)^{2}}{95.853} + \frac{(-56.15)^{2}}{65.147} = 138.926$$

We can confirm this computation with the results in the table labeled Statistics for Table of RankUpperUnder by LiveOnCampus:

The row of interest here is Chi-Square.

• The value of the test statistic is 138.926.
• Because the crosstabulation is a 2x2 table, the degrees of freedom (df) for the test statistic is $$df = (R - 1)*(C - 1) = (2 - 1)*(2 - 1) = 1$$.
• The corresponding p-value of the test statistic is so small that it is presented as p < 0.001.

### Decision and Conclusions

Since the p-value is less than our chosen significance level α = 0.05, we can reject the null hypothesis, and conclude that there is an association between class rank and whether or not students live on-campus.

Based on the results, we can state the following:

• There was a significant association between class rank and living on campus (Χ2(1) = 138.9, p < .001).