Welcome to STAT 200

STAT 200 - Lecture 1

Randomness and Uncertainty

Is smoking bad for your health?

  • How did the medical community conclude that smoking is harmful?

  • Why is this not a trivial conclusion?

  • In fact, this conclusion was disputed for years before it was widely accepted.

Is smoking bad for your health?

  • In 1950, Richard Doll and Richard Peto launched a large-scale prospective observational study with over 40,000 doctors in England.

  • This study provided the first widely accepted evidence that smoking is associated with many diseases.

  • You can learn more about this study here and here.

New drugs

  • There were over 100 different types of vaccines being tested for COVID-19.

  • But how do we determine that a vaccine actually works?

    • Some vaccinated people get sick; others don’t.
  • Is the vaccine safe?

    • Some vaccinated people develop serious side effects; others don’t.

How do we deal with uncertainties?

  • We need data!

  • From the data, we can study the variability and identify patterns or trends in the data.

  • But we need to be careful!
  • How we collect the data affects how we analyze the data and the type of conclusions we can make.

This is where statistics comes in

  • Statistics is the science that studies variability.

  • It provides us with techniques to:

    1. design studies;
    2. collect, summarize, and analyze data;
    3. create and interpret models to draw conclusions;

Confirmation bias

  • Data are interpreted by people.

  • The strong convictions some people have may affect how they collect and interpret the data.

  • For example, people tend to focus on evidence that supports their beliefs and disregard other pieces of evidence.

  • This phenomenon is called confirmation bias.

Data subtleties

Statistics for all!

  • Understanding statistical concepts will enable you to assess and question people’s analyses and conclusions critically.

  • Do not just accept someone’s findings. Think!

    • What evidence did they present?
    • Is there any evidence they had missed (or buried 😈);
    • Do the data and their analysis actually allow them to reach such conclusions?

You are awesome!

  • You are very capable. THINK! Make your own conclusions!

  • Don’t feel intimidated by titles. Skepticism and detailed questioning are part of science!

Datasets and variables

Datasets

  • Statistics is all about using data to analyze the variability and uncertainties involved in a study.

  • Typically, you can think of a dataset as a table where:

    • every row corresponds to an individual/object.

    • every column corresponds to a variable.

Example: Palmer Penguins Dataset

Artwork by @allison_horst

Dr. Kristen Gorman has collected data on 344 penguins from three islands in the Palmer Archipelago, Antarctica.


Artwork by @allison_horst

Multiple variables were recorded for each penguin, such as: island, species, bill depth, bill length, body mass, sex, among others.

Example: Palmer Penguins Dataset

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen NA NA NA NA NA 2007
Adelie Dream 40.6 17.2 187 3475 male 2009
Adelie Dream 36.2 17.3 187 3300 female 2008
Adelie Torgersen 42.1 19.1 195 4000 male 2008
Adelie Torgersen 36.2 16.1 187 3550 female 2008
Chinstrap Dream 47.0 17.3 185 3700 female 2007
Adelie Torgersen 38.6 21.2 191 3800 male 2007
Gentoo Biscoe 50.1 15.0 225 5000 male 2008
Gentoo Biscoe 43.8 13.9 208 4300 female 2008
Chinstrap Dream 51.5 18.7 187 3250 male 2009
Gentoo Biscoe 49.1 14.5 212 4625 female 2009
Gentoo Biscoe 51.1 16.5 225 5250 male 2009
Gentoo Biscoe 48.5 15.0 219 4850 female 2009
Adelie Torgersen 40.9 16.8 191 3700 female 2008
Adelie Dream 39.6 18.1 186 4450 male 2008
Adelie Dream 40.2 20.1 200 3975 male 2009
Gentoo Biscoe 51.3 14.2 218 5300 male 2009
Adelie Torgersen 42.5 20.7 197 4500 male 2007
Chinstrap Dream 50.2 18.7 198 3775 female 2009
Adelie Biscoe 38.1 16.5 198 3825 female 2009
Gentoo Biscoe 41.7 14.7 210 4700 female 2009
Gentoo Biscoe 50.2 14.3 218 5700 male 2007
Adelie Torgersen 45.8 18.9 197 4150 male 2008
Chinstrap Dream 46.1 18.2 178 3250 female 2007
Gentoo Biscoe 45.2 13.8 215 4750 female 2008
Adelie Biscoe 35.3 18.9 187 3800 female 2007
Gentoo Biscoe 50.0 15.9 224 5350 male 2009
Gentoo Biscoe 51.5 16.3 230 5500 male 2009
Chinstrap Dream 47.5 16.8 199 3900 female 2008
Gentoo Biscoe 48.2 15.6 221 5100 male 2008

Variables

  • A variable is a characteristic of each individual (or object) of interest.

  • Variables can be classified as Categorical or Quantitative.

Categorical Variables

  • Categorical variables are variables whose values are categories (e.g., hair color, marital status);

  • If there is an intrinsic order for the categories, we say the variable is Ordinal.

    • Pain level: mild, moderate, severe;

    • Your rank in League of Legends: Iron, Silver, Gold, Diamond;

Quantitative Variables

  • A variable is quantitative if it is measured on a numerical scale (e.g., income, age, height).

  • Note that the units of measurement must be provided.

Caution

A variable is not necessarily quantitative just because its values are numbers. Sometimes numbers are used to represent categories.

Activity

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Chinstrap Dream 50.3 20.0 197 3300 male 2007
Chinstrap Dream 52.8 20.0 205 4550 male 2008
Adelie Dream 36.2 17.3 187 3300 female 2008
Adelie Torgersen 39.7 18.4 190 3900 male 2008
Chinstrap Dream 48.5 17.5 191 3400 male 2007
Adelie Biscoe 38.1 17.0 181 3175 female 2009
Adelie Dream 39.2 18.6 190 4250 male 2009
Adelie Biscoe 39.6 17.7 186 3500 female 2008
Gentoo Biscoe 48.7 14.1 210 4450 female 2007
Adelie Torgersen 34.6 21.1 198 4400 male 2007
Adelie Dream 41.1 17.5 190 3900 male 2009
Gentoo Biscoe 49.6 16.0 225 5700 male 2008
Gentoo Biscoe 49.6 15.0 216 4750 male 2008
Gentoo Biscoe 47.3 13.8 216 4725 NA 2009
Adelie Biscoe 42.2 19.5 197 4275 male 2009
Gentoo Biscoe 45.5 15.0 220 5000 male 2008
Gentoo Biscoe 44.5 14.3 216 4100 NA 2007
Chinstrap Dream 50.8 19.0 210 4100 male 2009
Gentoo Biscoe 44.5 15.7 217 4875 NA 2009
Adelie Biscoe 38.2 18.1 185 3950 male 2007
Adelie Dream 39.7 17.9 193 4250 male 2009
Adelie Biscoe 41.6 18.0 192 3950 male 2008
Adelie Torgersen 34.1 18.1 193 3475 NA 2007
Chinstrap Dream 50.2 18.7 198 3775 female 2009
Gentoo Biscoe 45.7 13.9 214 4400 female 2008
Chinstrap Dream 49.0 19.6 212 4300 male 2009
Chinstrap Dream 42.5 16.7 187 3350 female 2008
Adelie Dream 37.0 16.5 185 3400 female 2009
Chinstrap Dream 52.2 18.8 197 3450 male 2009
Gentoo Biscoe 45.2 16.4 223 5950 male 2008


  • Categorize each variable in the penguins data frame as categorical or quantitative.

References & Attributions

Image Attributions

References

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.