Summarizing Data

OpenIntro Statistics · Ch. 2 · openintro.org/book/os

Raw data is noise until you summarize it. Center (mean, median), spread (standard deviation, IQR), and shape (skew, modality) compress a dataset into a few numbers. Visualizations like histograms and box plots make the structure visible.

Mean, median, mode

The mean is the arithmetic average. The median is the middle value when sorted. The mode is the most frequent value. The mean is sensitive to outliers; the median is robust.

Scheme

; Mean, median, mode
(define data '(2 3 5 5 7 8 12 15 22))

; Mean: sum / count
(define (mean lst)
  (exact->inexact (/ (apply + lst) (length lst))))

(display "Data: ") (display data) (newline)
(display "Mean: ") (display (mean data)) (newline)

; Median: middle value of sorted list
(define (median lst)
  (let* ((sorted (sort lst <))
         (n (length sorted))
         (mid (quotient n 2)))
    (if (odd? n)
      (list-ref sorted mid)
      (exact->inexact
        (/ (+ (list-ref sorted (- mid 1))
              (list-ref sorted mid)) 2)))))

(display "Median: ") (display (median data)) (newline)

; Mode: most frequent
(display "Mode: 5 (appears twice)")

Standard deviation and IQR

Standard deviation measures average distance from the mean. IQR (interquartile range) is Q3 minus Q1, covering the middle 50% of data. IQR is robust to outliers; standard deviation is not.

Scheme

; Standard deviation and IQR
(define data '(2 3 5 5 7 8 12 15 22))

(define (mean lst)
  (/ (apply + lst) (length lst)))

; Variance: average squared deviation from mean
(define (variance lst)
  (let ((m (mean lst)))
    (/ (apply + (map (lambda (x) (* (- x m) (- x m))) lst))
       (- (length lst) 1))))  ; sample variance uses n-1

; Standard deviation: square root of variance
(define (std lst)
  (sqrt (exact->inexact (variance lst))))

(display "Std dev: ")
(display (std data)) (newline)

; IQR: Q3 - Q1
(define sorted (sort data <))
(define q1 (list-ref sorted 2))  ; 25th percentile
(define q3 (list-ref sorted 6))  ; 75th percentile
(display "Q1: ") (display q1) (newline)
(display "Q3: ") (display q3) (newline)
(display "IQR: ") (display (- q3 q1))

Contingency tables

For two categorical variables, a contingency table (cross-tabulation) counts how often each combination occurs. Row and column proportions reveal the relationship between the variables.

Scheme

; Contingency table: treatment outcome by group
; Rows: treatment/control. Columns: improved/no change

(define treatment-improved 28)
(define treatment-no-change 12)
(define control-improved 18)
(define control-no-change 22)

(define treatment-total (+ treatment-improved treatment-no-change))
(define control-total (+ control-improved control-no-change))

(display "             Improved  No Change  Total") (newline)
(display "Treatment    28        12         ")
(display treatment-total) (newline)
(display "Control      18        22         ")
(display control-total) (newline)
(newline)

; Row proportions
(display "Treatment improvement rate: ")
(display (exact->inexact (/ treatment-improved treatment-total)))
(newline)
(display "Control improvement rate:   ")
(display (exact->inexact (/ control-improved control-total)))

Neighbors

Foundations (Wikipedia)

Translation notes

OpenIntro covers histograms, dot plots, and intensity maps with the county dataset. We focus on the core numerical summaries. The original also introduces the concept of robust statistics (median vs. mean) with income distribution examples. Variance here uses Bessel's correction (n-1 denominator) for sample variance, matching the textbook convention.

Want the full treatment? Read OpenIntro Statistics, Ch. 2.

← Introduction to Data by june.kim Probability →