Exploratory Data Analysis:
Categorical Data

STAT 200 - Lecture 2

Exploratory Data Analysis (EDA)

  • One of the most critical steps when studying a new problem and dataset.

  • EDA helps us to:

    • better understand the data;
    • find new and frequently unexpected relationships between variables;
    • raise interesting questions about the study;

Disclaimer

  • You are in the driver’s seat of your EDA.
    • There’s no recipe;
    • Little thought yields small findings
  • In general, EDA findings are preliminary and require more investigation.

Categorical Variables

Data

You might need to refresh this page to show the data.

  • which of the variables in the dataset above is categorical?

Frequency Tables

School Category n
Independent School 54
StrongStart BC 19
Public School 121

Bar Chart

Pie Chart (Avoid!)

Pie Chart (Avoid!)

  • Avoid using pie charts;

  • When you have many slices and/or the slices are roughly the same size, it becomes impossible to read.

  • In general, bar charts are much easier to read than pie charts. So stick with them.

Two categorical Variables

Contingency tables

  • Appropriate for summarizing the counts of two categorical variables;

  • Facilitates the analysis of the relationship between two categorical variables;

Case Study: Aspirin and Heart Attack

  • A controlled, randomized, double-blind study on the effects of aspirin was conducted in the 1980s.
Table 1: Data from the Aspirin Study
Drug Heart Attack No Heart Attack Total
Aspirin 104 10,933 11,037
Placebo 189 10,845 11,034
Total 293 21,778 22,071

The study was reported on the front page of the New York Times on January 27, 1988. (See the article here.)

A few questions of interest

  • What proportion of individuals did not have a heart attack?

  • What proportion of individuals received a placebo?

  • What proportion of individuals who had a heart attack were on aspirin?

  • What proportion of individuals who were on aspirin had a heart attack?

  • What proportion of individuals who were on placebo and had a heart attack?

Marginal Distribution

  • What is the proportion of individuals who did not have a heart attack?
Data from the Aspirin Study
Drug Heart Attack No Heart Attack Total
Aspirin 104 10,933 11,037
Placebo 189 10,845 11,034
Total 293 21,778 22,071
Table 2: Marginal distribution of the occurrence of heart attack
Heart Attack No Heart Attack
293 (1.33%) 21,778 (98.67%)

Answer: 98.67%

Marginal Distribution

  • What proportion of individuals received placebo?
Data from the Aspirin Study
Drug Heart Attack No Heart Attack Total
Aspirin 104 10,933 11,037
Placebo 189 10,845 11,034
Total 293 21,778 22,071
Table 3: Marginal distribution of drug type
Aspirin Placebo
11,037 (50%) 11,034 (50%)

Answer: 50%

Conditional Distribution

  • What proportion of individuals who had a heart attack were on aspirin?
Data from the Aspirin Study
Drug Heart Attack No Heart Attack Total
Aspirin 104 10,933 11,037
Placebo 189 10,845 11,034
Total 293 21,778 22,071
Table 4: Conditional distribution of drug type among people who had a heart attack
Aspirin Placebo
104 (35.49%) 189 (64.51%)

Answer: 35.49%

Note that we fix the condition of occurrence of heart attack and look at the distribution of drug type.

Conditional Distribution

  • What proportion of individuals who were on aspirin had a heart attack?
Data from the Aspirin Study
Drug Heart Attack No Heart Attack Total
Aspirin 104 10,933 11,037
Placebo 189 10,845 11,034
Total 293 21,778 22,071
Table 5: Conditional distribution of the occurrence of heart attack among people who were on aspirin
heart attack no heart attack
104 (0.94%) 10,933 (99.06%)

Answer: 0.94%

Note that we fix the condition of drug type and look at the distribution of occurrence of heart attack.

Intersection

  • What proportion of individuals who were on placebo and had a heart attack?
Data from the Aspirin Study
Drug Heart Attack No Heart Attack Total
Aspirin 104 10,933 11,037
Placebo 189 10,845 11,034
Total 293 21,778 22,071

Answer: \(\frac{189}{22071} = 0.86\%\)

Association

  • We can use the contingency table to explore the relationship between the variables.
  • What would you expect to see if there was no association between having a heart attack and the drug taken?
  • What would you expect to see if there was no association between having a heart attack and the drug taken?
    • roughly the same proportion of heart attacks in the placebo and aspirin groups;
  • What would you expect to see if there was no association between having a heart attack and the drug taken?
    • the proportion in the marginal distribution should match the one in the conditional distribution;

Association

Data from the Aspirin Study
Drug Type Heart Attack No Heart Attack Total
Aspirin 104 10,933 11,037
Placebo 189 10,845 11,034
Total 293 21,778 22,071
  • Proportion of people who had a heart attack:
  • Proportion of people on aspirin who had a heart attack:
  • Proportion of people on placebo who had a heart attack:
  • Proportion of people who had a heart attack: \(\frac{293}{22071}=\) 1.33%
  • Proportion of people on aspirin who had a heart attack: \(\frac{104}{11037}=\) 0.94%
  • Proportion of people on placebo who had a heart attack: \(\frac{189}{11034}=\) 1.71%

Case study

Race and Death Penalty in Florida

  • Does race affect the chance of receiving a death penalty sentence in Florida?

  • Radelet (1981) examined data on homicide indictments in 20 Florida counties between 1976 and 1977.

Table 6: Death Penalty sentences from homicide indictments between 1976 and 1977 in Florida (Radelet 1981)
Race Death Penalty
Yes No
White 19 141
Black 17 149
Total 36 290

Case Study: Questions of interest

  • Question 1: What proportion of indictments resulted in a death sentence?
  • Question 2: Among those who were sentenced to death, what is the proportion of white defendants?
  • Question 3: Compare the proportion of white defendants sentenced to death against that of black defendants.

Case Study: Diving Deeper

Death Penalty sentences grouped per victims’ races (Radelet 1981):

Table 7: White victims
Race Death Penalty
Yes No
White 19 132
Black 11 52
Total 30 184
Table 8: Black victims
Race Death Penalty
Yes No
White 0 9
Black 6 97
Total 6 106

Case Study: Questions of interest

  • Question 4: Compare the proportions of indictments that resulted in a death sentence for black and white victims.
  • Question 5: For black victims, compare the proportion of white defendants who were sentenced to death against black defendants.
  • Question 6: For white victims, compare the proportion of white defendants who were sentenced to death against black defendants.

Simpson’s Paradox

  • Question 7: Compare your results in Questions 5 and 6 with the result you obtained in Question 3. Is it surprising?
  • This reversal in the direction of association when accounting for a third variable is called Simpson’s Paradox;

Association is not Causation

  • Although we found that black offenders have a higher chance of being sentenced to death, this does not necessarily mean that it is because of their race.

  • The race could be the root cause, or a contributing factor, or has nothing to do with it at all;

  • The cause could be a different underlying factor that affects black offenders more;

    • This does not seem to be the case (see the paper for more details).

References & Attributions

Image Attributions

Data Attributions

References

Efron, Bradley, and R. J. Tibshirani. 1994. An Introduction to the Bootstrap. https://doi.org/https://doi.org/10.1201/9780429246593.
Radelet, Michael L. 1981. “Racial Characteristics and the Imposition of the Death Penalty.” American Sociological Review 46 (6): 918–27. http://www.jstor.org/stable/2095088.