EPID 684
Spatial Epidemiology
University of Michigan School of Public Health
Jon Zelner
[email protected]
epibayes.io
How do hierarchical models make the most use of multi-level data
Quick overview of prep for tuesday’s Destination presentations
Housekeeping for postcards 📮 presentations.
Walking through the three approaches to multi-level modeling discussed in (1).
Hierarchical model hands-on
Aim for 8-10m each including time for Q&A
2-3 slides, which should be primarily visual in nature
Pose specific questions for feedback from the class.
Send your slides to me before class so I can load them up at the beginning of class.
Before spring break: Please send me 3 candidate readings you think could be good to assign for your class session.
Next week: I will circulate a sign-up for slots to run discussion during our Postcards class sessions.
If you have a conflict that necessitates presenting on a specific day please let me ASAP so that I can block it out for you.
Simulated study where we sample 1000 individuals (\(i\)) from 20 neighborhoods (\(j\)) and measure:
\(y_{ij}\) is continuous systolic blood pressure (SBP) for individual \(i\) in location \(j\).
\(x_i \in [0,1]\) is a binary exposure indicating whether the individual gets regular physical exercise.
\(\beta\) is an increase in \(y_i\) associated with the exposure
\(\alpha\) is mean SBP in the absence of exposure
You have three choices: Which 🚪 will you choose?
Pool data across all units, i.e. ignore clustering.
i.e. fit model \(y_{ij} = \alpha + \beta x_i + \epsilon_i\)
Is this typically a good idea?
Complete pooling ignores potential sources of observed and unobserved unit-level confounding.
A fully pooled model:
\[ y_i = \alpha + \beta x + \epsilon_i \]
Assumes \(y_i\) is a combination of systematic variation (\(\alpha + \beta x\)) and uncorrelated random noise (\(\epsilon_i\)) where:
\[ \text{i.i.d.} \epsilon \sim Normal(0, \sigma^2) \]
require(ggplot2)
icc <- 0.9
total_var <- 1
cluster_sigma <- sqrt(icc * total_var)
ind_sigma <- sqrt((1 - icc) * total_var)
ind_cluster <- 100
ncluster <- 10
cluster_ids <- sort(rep(1:ncluster, ind_cluster))
cluster_means <- rnorm(ncluster, sd = cluster_sigma)
ind_vals <- rnorm(n = length(cluster_ids), mean = cluster_means[cluster_ids], sd = ind_sigma)
df <- data.frame(x = ind_vals, cluster = cluster_ids)
g <- ggplot(df, aes(x = x, cluster = cluster_ids)) +
geom_histogram(binwidth = 0.05, aes(y=..density..)) +
xlab("Distance from mean") +
ylab("Density") +
stat_function(fun = dnorm, args = list(mean = 0, sd = sqrt(total_var))) +
theme_bw()
plot(g)
Unpooled approach:Fit a separate model to each unit (\(j\)), assuming outcomes in each unit are independent:
Model looks like: \(y_{ij} = \alpha_j + \beta_j x_i + \epsilon_{ij}\)
Where: \(\epsilon_{ij} \sim N(0, \sigma_{j}^2)\)
Totally unpooled models run the risk of overfitting the data, particularly in small samples.
Some places may have few observations, making unpooled models impractical
We may want to allow the effect of an exposure to be consistent across location.
Will have nothing to say about data from a new location
Encode the assumption that places are similar unless data tell us otherwise.
Be flexible enough to reflect information in new data without overfitting.
Give answers equivalent to the fully pooled and unpooled approaches if that is what the data actually suggest.
Allow effects to vary across clusters, but constrain them to come from the same distribution:
Model looks like: \(y_{ij} = \alpha + \beta x_i + \epsilon_{i} + \epsilon_{j}\)
Where: \(\epsilon_{i} \sim N(0, \sigma_{i}^2)\)
And: \(\epsilon_{j} \sim N(0, \sigma_{j}^2)\)
This approach accommodates variation across units without assuming they have no similarity.
Allows us to include covariates both about individuals and their spatial context.
More likely to make accurate out-of-sample predictions than the fully-pooled or unpooled examples.
25:00
My very own radon mitigation system!
In groups:
Sell more radon systems in Minnesota?
Reduce the burden of inequality in radon-associated health risks in Minnesota?
Branch out into other markets, like Michigan, where we have good measurements of soil uranium but don’t have access to household-level radon measurements?
05:00
Light dotted line = full pooling; Light solid line = no pooling; Dark line = partial pooling
Destination presentations