# A Gentle Introduction to Confidence Intervals ![Confidence Intervals](/static/img/confidence-intervals-1.png) Many job applicants lack understanding of fundamental statistical concepts. In business contexts, data professionals gather information, compute summary metrics, and communicate findings to decision-makers. ## Are You Confident Enough in Your Summary? A critical question emerges: after calculating an average from 500 sessions showing 10 minutes of user engagement, how certain can you be? Potential concerns include: - Non-representative sampling by coincidence - Temporal variations (weekends vs. weekdays, day vs. night) - Insufficient data collection duration - Unknown thresholds for adequate data volume Two primary factors influence confidence: "the range of variation in our collected data, and the amount of data collected." ## The Data Variance and Standard Deviation Standard deviation measures data spread. "A low standard deviation indicates that the data points tend to be close to the mean, and vice versa." ```python import numpy as np np.random.normal(10, 1, 500) np.random.normal(10, 3, 500) ``` ![Standard Deviation Comparison](/static/img/confidence-intervals-2.png) Lower variance datasets warrant greater confidence in their mean estimates compared to higher variance counterparts. ## The Size of the Data (N) Sample size dramatically affects mean stability. Small samples produce highly variable means, while large samples converge toward population parameters. ```python import numpy as np import pandas as pd import matplotlib as plt for _ in range(10): pd.Series(np.random.normal(10, 3, 5)).plot( kind='kde', bw_method=1, ax=ax ) ``` ![Small Sample Variation](/static/img/confidence-intervals-3.png) ![Large Sample Stability](/static/img/confidence-intervals-4.png) ## Standard Error Standard error combines variance and sample size effects: ![Standard Error Formula](/static/img/confidence-intervals-5.png) The formula: SE = Standard Deviation / sqrt(Sample Size) ```python sample1 = pd.Series(np.random.normal(10, 1, 5)) sample2 = pd.Series(np.random.normal(10, 1, 50)) sample3 = pd.Series(np.random.normal(10, 9, 5)) sample4 = pd.Series(np.random.normal(10, 9, 50)) ``` ![Standard Error Comparison](/static/img/confidence-intervals-6.png) ```python df['Sample Mean'].plot(kind='bar', yerr=df['STD. Error']) ``` ![Error Bars](/static/img/confidence-intervals-7.png) Wider error bars signal reduced confidence in estimated values. ## Confidence Intervals Rather than arbitrarily selecting multipliers, confidence intervals rely on statistical principles. Normal distributions exhibit predictable patterns: approximately 68% of data falls within one standard deviation, and roughly 95% within two standard deviations. Z-Scores establish confidence thresholds: - Z = 1: ~68% confidence - Z = 1.96: 95% confidence - Z = 2.58: 99% confidence The 95% confidence interval spans: Mean +/- (1.96 x Standard Error) T-Scores serve as alternatives to Z-Scores in certain circumstances. ## What About Non-Normal Data? **Key takeaway:** "Don't worry if your data does not come from a normal distribution, just treat it as if it does and you will be fine!" The Central Limit Theorem establishes that sample means follow normal distributions regardless of underlying data distribution shapes, provided samples are sufficiently large. However, this principle applies less reliably to medians and quantiles. ## Conclusion Practitioners should prioritize calculating standard error and depicting error bars around summary statistics. While scientific publications regularly employ this practice, business contexts lag behind. Data professionals must either educate stakeholders about interpreting error bars or personally ensure confidence assessments inform their recommendations. ![Summary Statistics with Error Bars](/static/img/confidence-intervals-8.png) --- Tarek Amr, April 5, 2021