← back to statistics

Introduction to Data

OpenIntro Statistics · Ch. 1 · openintro.org/book/os

Statistics begins with data. Every dataset is a collection of observations (rows) described by variables (columns). The type of variable and the way data was collected determine which methods apply.

Observations and variables

An observation (or case) is a single unit in the dataset: one person, one transaction, one experiment. A variable is a characteristic measured on each observation. Variables are either numerical (quantitative) or categorical (qualitative).

Variable X (numerical) Variable Y Each dot is one observation. Two numerical variables.
Scheme

Study design: observational vs. experimental

An observational study records data without intervening. An experiment assigns treatments to subjects. Only experiments can establish causation. Observational studies show association. The distinction matters: if you observe that coffee drinkers live longer, that does not mean coffee causes longevity.

Scheme

Sampling strategies

The population is everyone you want to study. The sample is who you actually measure. Good sampling avoids bias: simple random sampling, stratified sampling (split into groups first), and cluster sampling (randomly select entire groups). Convenience samples are cheap but unreliable.

Scheme
Neighbors

Related chapters

  • 🎰 Grinstead Ch.1 — discrete probability spaces, the mathematical foundation beneath sampling

Foundations (Wikipedia)

Translation notes

OpenIntro introduces data with the county and email datasets. We use small inline datasets instead to keep examples self-contained. The original covers data matrices, scatterplots, and data collection pitfalls in more detail. The sampling code here is deterministic for reproducibility; real sampling uses proper random number generators.

Want the full treatment? Read OpenIntro Statistics, Ch. 1.