Welcome to STAT 200

STAT 200 - Lecture 1

Randomness and Uncertainty

Is smoking bad for your health?

  • How did the medical community conclude that smoking is harmful?

  • Why is this not a trivial conclusion?

  • In fact, this conclusion was disputed for years before it was widely accepted.

Is smoking bad for your health?

  • In 1950, Richard Doll and Richard Peto launched a large-scale prospective observational study with over 40,000 doctors in England.

  • This study provided the first widely accepted evidence that smoking is associated with many diseases.

  • You can learn more about this study here and here.

New drugs

  • There were over 100 different types of vaccines being tested for COVID-19.

  • But how do we determine that a vaccine actually works?

    • Some vaccinated people get sick; others don’t.
  • Is the vaccine safe?

    • Some vaccinated people develop serious side effects; others don’t.

How do we deal with uncertainties?

  • We need data!

  • From the data, we can study the variability and identify patterns or trends in the data.

  • But we need to be careful!
  • How we collect the data affects how we analyze the data and the type of conclusions we can make.

This is where statistics comes in

  • Statistics is the science that studies variability.

  • It provides us with techniques to:

    1. design studies;
    2. collect, summarize, and analyze data;
    3. create and interpret models to draw conclusions;

Confirmation bias

  • Data are interpreted by people.

  • The strong convictions some people have may affect how they collect and interpret the data.

  • For example, people tend to focus on evidence that supports their beliefs and disregard other pieces of evidence.

  • This phenomenon is called confirmation bias.

Data subtleties

Statistics for all!

  • Understanding statistical concepts will enable you to assess and question people’s analyses and conclusions critically.

  • Do not just accept someone’s findings. Think!

    • What evidence did they present?
    • Is there any evidence they had missed (or buried 😈);
    • Do the data and their analysis actually allow them to reach such conclusions?

You are awesome!

  • You are very capable. THINK! Make your own conclusions!

  • Don’t feel intimidated by titles. Skepticism and detailed questioning are part of science!

Datasets and variables

Datasets

  • Statistics is all about using data to analyze the variability and uncertainties involved in a study.

  • Typically, you can think of a dataset as a table where:

    • every row corresponds to an individual/object.

    • every column corresponds to a variable.

Example: Palmer Penguins Dataset

Artwork by @allison_horst

Dr. Kristen Gorman has collected data on 344 penguins from three islands in the Palmer Archipelago, Antarctica.


Artwork by @allison_horst

Multiple variables were recorded for each penguin, such as: island, species, bill depth, bill length, body mass, sex, among others.

Example: Palmer Penguins Dataset

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Biscoe 42.0 19.5 200 4050 male 2008
Adelie Torgersen 38.5 17.9 190 3325 female 2009
Adelie Torgersen 38.8 17.6 191 3275 female 2009
Adelie Biscoe 40.5 17.9 187 3200 female 2007
Chinstrap Dream 51.5 18.7 187 3250 male 2009
Adelie Dream 40.8 18.9 208 4300 male 2008
Adelie Biscoe 41.4 18.6 191 3700 male 2008
Chinstrap Dream 46.5 17.9 192 3500 female 2007
Gentoo Biscoe 46.1 13.2 211 4500 female 2007
Chinstrap Dream 50.1 17.9 190 3400 female 2009
Gentoo Biscoe 47.6 14.5 215 5400 male 2007
Chinstrap Dream 51.7 20.3 194 3775 male 2007
Gentoo Biscoe 48.7 14.1 210 4450 female 2007
Adelie Biscoe 39.7 18.9 184 3550 male 2009
Gentoo Biscoe 51.3 14.2 218 5300 male 2009
Adelie Biscoe 35.9 19.2 189 3800 female 2007
Gentoo Biscoe 44.0 13.6 208 4350 female 2008
Adelie Dream 39.8 19.1 184 4650 male 2007
Gentoo Biscoe 46.4 15.0 216 4700 female 2008
Adelie Biscoe 35.5 16.2 195 3350 female 2008
Gentoo Biscoe 45.5 13.9 210 4200 female 2008
Adelie Dream 37.3 17.8 191 3350 female 2008
Gentoo Biscoe 51.5 16.3 230 5500 male 2009
Gentoo Biscoe 50.5 15.2 216 5000 female 2009
Gentoo Biscoe 49.8 16.8 230 5700 male 2008
Gentoo Biscoe 44.5 14.7 214 4850 female 2009
Adelie Biscoe 36.4 17.1 184 2850 female 2008
Chinstrap Dream 48.5 17.5 191 3400 male 2007
Gentoo Biscoe 47.3 13.8 216 4725 NA 2009
Gentoo Biscoe 48.4 14.6 213 5850 male 2007

Variables

  • A variable is a characteristic of each individual (or object) of interest.

  • Variables can be classified as Categorical or Quantitative.

Categorical Variables

  • Categorical variables are variables whose values are categories (e.g., hair color, marital status);

  • If there is an intrinsic order for the categories, we say the variable is Ordinal.

    • Pain level: mild, moderate, severe;

    • Your rank in League of Legends: Iron, Silver, Gold, Diamond;

Quantitative Variables

  • A variable is quantitative if it is measured on a numerical scale (e.g., income, age, height).

  • Note that the units of measurement must be provided.

Caution

A variable is not necessarily quantitative just because its values are numbers. Sometimes numbers are used to represent categories.

Activity

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Biscoe 36.4 17.1 184 2850 female 2008
Chinstrap Dream 50.0 19.5 196 3900 male 2007
Gentoo Biscoe 45.4 14.6 211 4800 female 2007
Adelie Dream 37.0 16.5 185 3400 female 2009
Chinstrap Dream 52.8 20.0 205 4550 male 2008
Gentoo Biscoe 46.1 13.2 211 4500 female 2007
Gentoo Biscoe 53.4 15.8 219 5500 male 2009
Adelie Torgersen 36.6 17.8 185 3700 female 2007
Chinstrap Dream 45.5 17.0 196 3500 female 2008
Adelie Torgersen 39.6 17.2 196 3550 female 2008
Gentoo Biscoe 45.1 14.4 210 4400 female 2008
Adelie Biscoe 40.6 18.8 193 3800 male 2008
Adelie Dream 39.8 19.1 184 4650 male 2007
Adelie Dream 37.3 16.8 192 3000 female 2009
Chinstrap Dream 47.0 17.3 185 3700 female 2007
Gentoo Biscoe 49.5 16.2 229 5800 male 2008
Adelie Dream 38.9 18.8 190 3600 female 2008
Adelie Torgersen 34.4 18.4 184 3325 female 2007
Chinstrap Dream 51.5 18.7 187 3250 male 2009
Chinstrap Dream 46.9 16.6 192 2700 female 2008
Gentoo Biscoe 46.2 14.4 214 4650 NA 2008
Gentoo Biscoe 48.7 15.1 222 5350 male 2007
Gentoo Biscoe 51.1 16.3 220 6000 male 2008
Gentoo Biscoe 50.5 15.9 222 5550 male 2008
Gentoo Biscoe 48.1 15.1 209 5500 male 2009
Gentoo Biscoe 49.2 15.2 221 6300 male 2007
Gentoo Biscoe 50.8 17.3 228 5600 male 2009
Adelie Torgersen 34.1 18.1 193 3475 NA 2007
Adelie Dream 39.5 16.7 178 3250 female 2007
Chinstrap Dream 50.8 19.0 210 4100 male 2009


  • Categorize each variable in the penguins data frame as categorical or quantitative.

References & Attributions

Image Attributions

References

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.