What we do
luma's CHW interactions generate primary data: real questions, real referrals, real protocol matches. That data is sparse on its own. To produce useful district- and country-level estimates, we combine it with published priors — UNAIDS/WHO/PEPFAR estimates for Lesotho — using a Bayesian update.
The result is a posterior estimate that:
- Uses the prior when primary data is scarce
- Updates toward the observed signal as primary data accumulates
- Reports a 95% confidence interval that reflects the actual sample size
The math (Beta-Binomial)
For a proportion (e.g., HIV prevalence among adults), we use a Beta-Binomial conjugate update. This is the standard Bayesian textbook approach.
Prior
The published Lesotho HIV prevalence is 22.8% (UNAIDS 2024). We turn this into
a Beta(α, β) prior with a chosen "effective sample size" that
controls how strongly the prior anchors the posterior:
α_prior = prior_mean × effective_n
β_prior = (1 - prior_mean) × effective_n
For HIV we use effective_n = 200 — a strong prior, since UNAIDS
estimates are based on census-scale household surveys. We don't want one CHW
conversation to swing a country-level estimate.
Likelihood
Each CHW interaction gets tagged with a topic (HIV, TB, MNCH, etc.) by an
LLM-based extractor. We treat the count of HIV-related interactions
(k) over total interactions (n) as a Binomial
sample.
Posterior
α_post = α_prior + k
β_post = β_prior + (n - k)
posterior_mean = α_post / (α_post + β_post)
variance = (α_post × β_post) / ((α_post + β_post)² × (α_post + β_post + 1))
sd = √variance
95% CI ≈ posterior_mean ± 1.96 × sd
We use a normal approximation to the Beta CI, which is good for moderate sample sizes. For very small posteriors we widen the interval intentionally.
For incidence rates (e.g., TB)
Rates per 100,000 don't fit cleanly into a Beta-Binomial. We use a simpler weighted average:
posterior_rate = (1 - λ) × prior_rate + λ × observed_rate
λ = function of sample size (small n → λ near 0)
Currently we fix λ = 0.3. This becomes sample-size-adaptive
as primary data accumulates from active districts.
Priors used
Source years documented in src/projections.js. Replace with current values as new data is published.
| Topic | Prior | Source |
|---|---|---|
| HIV adult prevalence | 22.8% | UNAIDS 2024 |
| TB incidence | 650 / 100k / yr | WHO Global TB Report 2024 |
| Maternal mortality | 540 / 100k live births | WHO GHO 2024 |
| Under-5 mortality | 91 / 1000 | WHO GHO 2024 |
| Stunting (under-5) | 32% | UNICEF JME 2024 |
| Wasting (under-5) | 2.7% | UNICEF JME 2024 |
| Modern contraceptive use | 62% | UN Population Division 2024 |
What the current build CAN and CANNOT support
Can
- Show the framework end-to-end: ingestion → tagging → aggregation → projection → customer view
- Produce posteriors that update in real time as new CHW interactions arrive
- Distinguish prior-dominated estimates from data-dominated ones via the confidence indicator
- Provide an API endpoint pharma RWE / WHO can consume programmatically
Cannot (yet)
- Produce statistically meaningful district-level estimates from a few conversations
- Detect outbreak clusters — needs historical baseline data luma doesn't have yet
- Replace household survey methodology — DHS-style data remains complementary, not redundant
- Produce demographic stratification beyond what the extractor tags
Where this goes (production scale)
Once luma reaches full operational scale (national CHW coverage across multiple districts), the same framework produces:
- District-level prevalence posteriors for HIV/TB/SAM with meaningful CIs
- Real-time incidence trend lines for outbreak detection
- Treatment cascade indicators (initiation, retention, viral suppression) for pharma RWE
- Patient population sizing for clinical trial site selection — luma's commercial wedge
None of those rely on the current build's small dataset. They rely on the framework being correct.