EPID 594
Spatial Epidemiology
University of Michigan School of Public Health
Jon Zelner
[email protected]
epibayes.io
Making the most of multi-level data using hierarchical models
Brief recap of the threefold path of hierarchical modeling
Radon ☢️ lab 🧪 !
Preview of next week
Simulated study where we sample 1000 individuals (\(i\)) from 20 neighborhoods (\(j\)) and measure:
\(y_{ij}\) is continuous systolic blood pressure (SBP) for individual \(i\) in location \(j\).
\(x_i \in [0,1]\) is a binary exposure indicating whether the individual gets regular physical exercise.
\(\beta\) is an increase in \(y_i\) associated with the exposure
\(\alpha\) is mean SBP in the absence of exposure
You have three choices: Which 🚪 will you choose?
Pool data across all units, i.e. ignore clustering.
i.e. fit model \(y_{ij} = \alpha + \beta x_i + \epsilon_i\)
Is this typically a good idea?
Complete pooling ignores potential sources of observed and unobserved unit-level confounding.
A fully pooled model:
\[ y_i = \alpha + \beta x + \epsilon_i \]
Assumes \(y_i\) is a combination of systematic variation (\(\alpha + \beta x\)) and uncorrelated random noise (\(\epsilon_i\)) where:
\[ \text{i.i.d.} \epsilon \sim Normal(0, \sigma^2) \]
require(ggplot2)
icc <- 0.9
total_var <- 1
cluster_sigma <- sqrt(icc * total_var)
ind_sigma <- sqrt((1 - icc) * total_var)
ind_cluster <- 100
ncluster <- 10
cluster_ids <- sort(rep(1:ncluster, ind_cluster))
cluster_means <- rnorm(ncluster, sd = cluster_sigma)
ind_vals <- rnorm(n = length(cluster_ids), mean = cluster_means[cluster_ids], sd = ind_sigma)
df <- data.frame(x = ind_vals, cluster = cluster_ids)
g <- ggplot(df, aes(x = x, cluster = cluster_ids)) +
geom_histogram(binwidth = 0.05, aes(y=..density..)) +
xlab("Distance from mean") +
ylab("Density") +
stat_function(fun = dnorm, args = list(mean = 0, sd = sqrt(total_var))) +
theme_bw()
plot(g)
Unpooled approach:Fit a separate model to each unit (\(j\)), assuming outcomes in each unit are independent:
Model looks like: \(y_{ij} = \alpha_j + \beta_j x_i + \epsilon_{ij}\)
Where: \(\epsilon_{ij} \sim N(0, \sigma_{j}^2)\)
Totally unpooled models run the risk of overfitting the data, particularly in small samples.
Some places may have few observations, making unpooled models impractical
We may want to allow the effect of an exposure to be consistent across location.
Will have nothing to say about data from a new location
Encode the assumption that places are similar unless data tell us otherwise.
Be flexible enough to reflect information in new data without overfitting.
Give answers equivalent to the fully pooled and unpooled approaches if that is what the data actually suggest.
Allow effects to vary across clusters, but constrain them to come from the same distribution:
Model looks like: \(y_{ij} = \alpha + \beta x_i + \epsilon_{i} + \epsilon_{j}\)
Where: \(\epsilon_{i} \sim N(0, \sigma_{i}^2)\)
And: \(\epsilon_{j} \sim N(0, \sigma_{j}^2)\)
This approach accommodates variation across units without assuming they have no similarity.
Allows us to include covariates both about individuals and their spatial context.
More likely to make accurate out-of-sample predictions than the fully-pooled or unpooled examples.
My very own radon mitigation system!
Sell more radon systems in Minnesota?
Reduce the burden of inequality in radon-associated health risks in Minnesota?
Branch out into other markets, like Michigan, where we have good measurements of soil uranium but don’t have access to household-level radon measurements?
05:00
Light dotted line = full pooling; Light solid line = no pooling; Dark line = partial pooling