Statistics begins with data. Every dataset is a collection of observations (rows) described by variables (columns). The type of variable and the way data was collected determine which methods apply.
Observations and variables
An observation (or case) is a single unit in the dataset: one person, one transaction, one experiment. A variable is a characteristic measured on each observation. Variables are either numerical (quantitative) or categorical (qualitative).
Scheme
; Variable types; Numerical: discrete (counts) or continuous (measurements); Categorical: nominal (no order) or ordinal (ordered)
(define dataset
'(("Alice"28"Engineer"72000)
("Bob"35"Teacher"55000)
("Carol"42"Engineer"88000)
("Dave"31"Designer"61000)))
; Each row is an observation
(display "Observations: ")
(display (length dataset)) (newline)
; Variables: name (categorical), age (numerical),; job (categorical), salary (numerical)
(display "Variables: name, age, job, salary") (newline)
; Numerical variable: compute mean age
(define ages (map cadr dataset))
(define (mean lst)
(/ (apply + lst) (length lst)))
(display "Mean age: ")
(display (exact->inexact (mean ages)))
An observational study records data without intervening. An experiment assigns treatments to subjects. Only experiments can establish causation. Observational studies show association. The distinction matters: if you observe that coffee drinkers live longer, that does not mean coffee causes longevity.
Scheme
; Observational: measure what exists; Experimental: assign treatment, measure outcome; Simulate an experiment: randomize to treatment/control
(define subjects '(12345678910))
; Simple random assignment
(define (assign-group id)
(if (< (modulo (* id 7) 10) 5) "treatment""control"))
(for-each
(lambda (id)
(display "Subject ")
(display id)
(display ": ")
(display (assign-group id))
(newline))
subjects)
; Key question: was treatment randomly assigned?; Yes -> experiment -> can infer causation; No -> observational -> association only
Sampling strategies
The population is everyone you want to study. The sample is who you actually measure. Good sampling avoids bias: simple random sampling, stratified sampling (split into groups first), and cluster sampling (randomly select entire groups). Convenience samples are cheap but unreliable.
Scheme
; Simple random sample: every member equally likely
(define population '(1234567891011121314151617181920))
; Pseudo-random selection using modular arithmetic
(define (simple-random-sample pop size seed)
(let loop ((remaining pop) (selected '()) (s seed) (count 0))
(if (or (= count size) (null? remaining))
(reverse selected)
(if (< (modulo (* s 13) 17) 8)
(loop (cdr remaining) (cons (car remaining) selected) (+ s 3) (+ count 1))
(loop (cdr remaining) selected (+ s 7) count)))))
(display "Population: ")
(display population) (newline)
(define sample (simple-random-sample population 53))
(display "Random sample of 5: ")
(display sample) (newline)
(display "Sample size: ")
(display (length sample))
Neighbors
Related chapters
🎰 Grinstead Ch.1 — discrete probability spaces, the mathematical foundation beneath sampling
OpenIntro introduces data with the county and email datasets. We use small inline datasets instead to keep examples self-contained. The original covers data matrices, scatterplots, and data collection pitfalls in more detail. The sampling code here is deterministic for reproducibility; real sampling uses proper random number generators.