Raw data is noise until you summarize it. Center (mean, median), spread (standard deviation, IQR), and shape (skew, modality) compress a dataset into a few numbers. Visualizations like histograms and box plots make the structure visible.
Mean, median, mode
The mean is the arithmetic average. The median is the middle value when sorted. The mode is the most frequent value. The mean is sensitive to outliers; the median is robust.
# Mean, median, mode in Python
data = [2, 3, 5, 5, 7, 8, 12, 15, 22]
mean = sum(data) / len(data)
sorted_data = sorted(data)
n = len(sorted_data)
median = sorted_data[n // 2] if n % 2else (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2fromcollectionsimport Counter
mode = Counter(data).most_common(1)[0][0]
print(f"Data: {data}")
print(f"Mean: {mean:.1f}")
print(f"Median: {median}")
print(f"Mode: {mode}")
Standard deviation and IQR
Standard deviation measures average distance from the mean. IQR (interquartile range) is Q3 minus Q1, covering the middle 50% of data. IQR is robust to outliers; standard deviation is not.
Scheme
; Standard deviation and IQR
(define data '(235578121522))
(define (mean lst)
(/ (apply + lst) (length lst)))
; Variance: average squared deviation from mean
(define (variance lst)
(let ((m (mean lst)))
(/ (apply + (map (lambda (x) (* (- x m) (- x m))) lst))
(- (length lst) 1)))) ; sample variance uses n-1; Standard deviation: square root of variance
(define (std lst)
(sqrt (exact->inexact (variance lst))))
(display "Std dev: ")
(display (std data)) (newline)
; IQR: Q3 - Q1
(define sorted (sort data <))
(define q1 (list-ref sorted 2)) ; 25th percentile
(define q3 (list-ref sorted 6)) ; 75th percentile
(display "Q1: ") (display q1) (newline)
(display "Q3: ") (display q3) (newline)
(display "IQR: ") (display (- q3 q1))
Contingency tables
For two categorical variables, a contingency table (cross-tabulation) counts how often each combination occurs. Row and column proportions reveal the relationship between the variables.
OpenIntro covers histograms, dot plots, and intensity maps with the county dataset. We focus on the core numerical summaries. The original also introduces the concept of robust statistics (median vs. mean) with income distribution examples. Variance here uses Bessel's correction (n-1 denominator) for sample variance, matching the textbook convention.