luma — methodology

What we do

luma's CHW interactions generate primary data: real questions, real referrals, real protocol matches. That data is sparse on its own. To produce useful district- and country-level estimates, we combine it with published priors — UNAIDS/WHO/PEPFAR estimates for Lesotho — using a Bayesian update.

The result is a posterior estimate that:

Uses the prior when primary data is scarce
Updates toward the observed signal as primary data accumulates
Reports a 95% confidence interval that reflects the actual sample size

Honest caveat. At current scale — early Lesotho deployment, a modest number of weekly CHW interactions — the posteriors are dominated by the priors. The customer dashboards display this transparently via the "confidence" indicator on every projection card. The framework is real; the indicators become operationally tight as the CHW network expands toward national coverage.

The math (Beta-Binomial)

For a proportion (e.g., HIV prevalence among adults), we use a Beta-Binomial conjugate update. This is the standard Bayesian textbook approach.

Prior

The published Lesotho HIV prevalence is 22.8% (UNAIDS 2024). We turn this into a Beta(α, β) prior with a chosen "effective sample size" that controls how strongly the prior anchors the posterior:

α_prior = prior_mean × effective_n
β_prior = (1 - prior_mean) × effective_n

For HIV we use effective_n = 200 — a strong prior, since UNAIDS estimates are based on census-scale household surveys. We don't want one CHW conversation to swing a country-level estimate.

Likelihood

Each CHW interaction gets tagged with a topic (HIV, TB, MNCH, etc.) by an LLM-based extractor. We treat the count of HIV-related interactions (k) over total interactions (n) as a Binomial sample.

Posterior

α_post = α_prior + k
β_post = β_prior + (n - k)
posterior_mean = α_post / (α_post + β_post)
variance = (α_post × β_post) / ((α_post + β_post)² × (α_post + β_post + 1))
sd = √variance
95% CI ≈ posterior_mean ± 1.96 × sd

We use a normal approximation to the Beta CI, which is good for moderate sample sizes. For very small posteriors we widen the interval intentionally.

For incidence rates (e.g., TB)

Rates per 100,000 don't fit cleanly into a Beta-Binomial. We use a simpler weighted average:

posterior_rate = (1 - λ) × prior_rate + λ × observed_rate
λ = function of sample size (small n → λ near 0)

Currently we fix λ = 0.3. This becomes sample-size-adaptive as primary data accumulates from active districts.

Priors used

Source years documented in src/projections.js. Replace with current values as new data is published.

Topic	Prior	Source
HIV adult prevalence	22.8%	UNAIDS 2024
TB incidence	650 / 100k / yr	WHO Global TB Report 2024
Maternal mortality	540 / 100k live births	WHO GHO 2024
Under-5 mortality	91 / 1000	WHO GHO 2024
Stunting (under-5)	32%	UNICEF JME 2024
Wasting (under-5)	2.7%	UNICEF JME 2024
Modern contraceptive use	62%	UN Population Division 2024

What the current build CAN and CANNOT support

Can

Show the framework end-to-end: ingestion → tagging → aggregation → projection → customer view
Produce posteriors that update in real time as new CHW interactions arrive
Distinguish prior-dominated estimates from data-dominated ones via the confidence indicator
Provide an API endpoint pharma RWE / WHO can consume programmatically

Cannot (yet)

Produce statistically meaningful district-level estimates from a few conversations
Detect outbreak clusters — needs historical baseline data luma doesn't have yet
Replace household survey methodology — DHS-style data remains complementary, not redundant
Produce demographic stratification beyond what the extractor tags

Where this goes (production scale)

Once luma reaches full operational scale (national CHW coverage across multiple districts), the same framework produces:

District-level prevalence posteriors for HIV/TB/SAM with meaningful CIs
Real-time incidence trend lines for outbreak detection
Treatment cascade indicators (initiation, retention, viral suppression) for pharma RWE
Patient population sizing for clinical trial site selection — luma's commercial wedge

None of those rely on the current build's small dataset. They rely on the framework being correct.