Introduction to Data

OpenIntro Statistics · Ch. 1 · openintro.org/book/os

Statistics begins with data. Every dataset is a collection of observations (rows) described by variables (columns). The type of variable and the way data was collected determine which methods apply.

Observations and variables

An observation (or case) is a single unit in the dataset: one person, one transaction, one experiment. A variable is a characteristic measured on each observation. Variables are either numerical (quantitative) or categorical (qualitative).

Scheme

; Variable types
; Numerical: discrete (counts) or continuous (measurements)
; Categorical: nominal (no order) or ordinal (ordered)

(define dataset
  '(("Alice" 28 "Engineer" 72000)
    ("Bob"   35 "Teacher"  55000)
    ("Carol" 42 "Engineer" 88000)
    ("Dave"  31 "Designer" 61000)))

; Each row is an observation
(display "Observations: ")
(display (length dataset)) (newline)

; Variables: name (categorical), age (numerical),
;            job (categorical), salary (numerical)
(display "Variables: name, age, job, salary") (newline)

; Numerical variable: compute mean age
(define ages (map cadr dataset))
(define (mean lst)
  (/ (apply + lst) (length lst)))
(display "Mean age: ")
(display (exact->inexact (mean ages)))

Study design: observational vs. experimental

An observational study records data without intervening. An experiment assigns treatments to subjects. Only experiments can establish causation. Observational studies show association. The distinction matters: if you observe that coffee drinkers live longer, that does not mean coffee causes longevity.

Scheme

; Observational: measure what exists
; Experimental: assign treatment, measure outcome

; Simulate an experiment: randomize to treatment/control
(define subjects '(1 2 3 4 5 6 7 8 9 10))

; Simple random assignment
(define (assign-group id)
  (if (< (modulo (* id 7) 10) 5) "treatment" "control"))

(for-each
  (lambda (id)
    (display "Subject ")
    (display id)
    (display ": ")
    (display (assign-group id))
    (newline))
  subjects)

; Key question: was treatment randomly assigned?
; Yes -> experiment -> can infer causation
; No  -> observational -> association only

Sampling strategies

The population is everyone you want to study. The sample is who you actually measure. Good sampling avoids bias: simple random sampling, stratified sampling (split into groups first), and cluster sampling (randomly select entire groups). Convenience samples are cheap but unreliable.

Scheme

; Simple random sample: every member equally likely
(define population '(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20))

; Pseudo-random selection using modular arithmetic
(define (simple-random-sample pop size seed)
  (let loop ((remaining pop) (selected '()) (s seed) (count 0))
    (if (or (= count size) (null? remaining))
      (reverse selected)
      (if (< (modulo (* s 13) 17) 8)
        (loop (cdr remaining) (cons (car remaining) selected) (+ s 3) (+ count 1))
        (loop (cdr remaining) selected (+ s 7) count)))))

(display "Population: ")
(display population) (newline)

(define sample (simple-random-sample population 5 3))
(display "Random sample of 5: ")
(display sample) (newline)
(display "Sample size: ")
(display (length sample))

Neighbors

Related chapters

🎰 Grinstead Ch.1 — discrete probability spaces, the mathematical foundation beneath sampling

Foundations (Wikipedia)

Translation notes

OpenIntro introduces data with the county and email datasets. We use small inline datasets instead to keep examples self-contained. The original covers data matrices, scatterplots, and data collection pitfalls in more detail. The sampling code here is deterministic for reproducibility; real sampling uses proper random number generators.

Want the full treatment? Read OpenIntro Statistics, Ch. 1.

by june.kim Summarizing Data →