Method

MICA fills in missing correlations in a partial matrix using a Bayesian model that respects positive definiteness, anchors each missing cell to what the surrounding matrix implies, and tells you cell by cell which imputed values you can trust.

The setup

You've done a meta-analysis. After pooling each cell across the studies that measured both variables, your matrix has unique correlations — some cells have many studies behind them, some few, some zero because no primary study measured both variables.

You want a complete matrix to feed into a Stage-2 SEM. Cells with zero studies are missing. Cells with very few studies are observed but unreliable. MICA fills the missing cells and tells you how confident to be about each filled-in value.

That's the whole problem. The phases below describe how MICA solves it.

Quick Bayesian refresher

Bayesian inference is a formal way of updating beliefs in the face of evidence. Three pieces:

PRIOR: your belief about the unknown before seeing the data. Wide = "I don't know much"; narrow = "I'm fairly confident."

LIKELIHOOD: how the data relates to the unknown. Given a specific candidate value, how probable is the data you observed?

POSTERIOR: your belief after seeing the data. The prior multiplied by the likelihood, normalized. Whatever values are both plausible under your prior AND consistent with the observed data.

CREDIBLE INTERVAL: a 95% credible interval is the range covering 95% of the posterior probability. We say a method is "calibrated" when its 95% intervals cover the truth roughly 95% of the time across repeated applications.

Phase 1 — Looking at what the matrix already tells us

Before any imputation, MICA studies the structure of what you've already observed. The key question is: for each missing cell, how much do the observed cells around it constrain what the missing value could plausibly be?

The intuition: if both endpoints of a missing cell correlate strongly with the same other variables, those variables are anchors that constrain the missing correlation. MICA runs a regression: the missing correlation is predicted from the observed correlations involving the same two variables. The result is two numbers — a predicted value (r-hat) and an R² (how well the anchors explain the prediction).

This gives a per-cell flag: • R² > 0.7 → data-dominant: the matrix essentially determines the value. The interval will be tight and trustworthy. • R² < 0.3 → prior-dominated: the matrix tells you almost nothing. Any honest imputation should produce a wide interval. • R² in [0.3, 0.7] → tension regime: anchors are partially informative but not decisive. We know empirically this is where calibrated intervals are hardest to produce reliably.

Phase 1 also computes width-PD: a geometric constraint where the requirement that the full matrix be valid narrows the range of values the cell could take, even before any imputation.

Phase 2 — The chordal baseline

MICA computes a quick deterministic imputation called a chordal max-determinant completion. It fills missing cells with the values that produce the smoothest valid matrix mathematically.

Why bother? Two reasons. First, this baseline is something MICA compares itself against, so we know whether the Bayesian step is adding value. Second, the chordal completion is a good starting point for the Bayesian sampler — it gives a sensible neighborhood to begin exploring.

This is what corBoundary (Ahn & Abbamonte, 2020) and related methods produce. MICA includes it as a stage to benchmark against, not as the final answer.

Phase 3 — The Bayesian core

Three pieces interlock here.

THE CHOLESKY FACTORIZATION. A correlation matrix can't be just any collection of numbers — it has to be positive definite (a mathematical way of saying "this matrix could have come from real data"). If you fill in random numbers between -1 and +1, almost none will be valid. Cholesky factorization is a clever workaround: every valid correlation matrix can be written as the product of a simpler triangular matrix with itself. MICA samples in that simpler space, where every proposal is automatically valid. Imagine landing a plane in a snowstorm — the Cholesky parameterization is a runway that points you only at valid landing positions.

THE REGRESSION-BOUND PRIOR. For each missing cell, MICA constructs a Bayesian prior using the Phase 1 anchor regression. The prior says: "Before looking at the observed data, my best guess is r-hat, and my uncertainty depends on how informative the anchors were (the R²)." High R² → tight prior; low R² → wide prior that admits many plausible values. There's also a small empirical-Bayes shift toward the matrix's central tendency that reduces a known systematic bias by about 26%.

THE HIERARCHICAL HETEROGENEITY MODEL. Each observed cell has k studies and a between-study heterogeneity τ². MICA estimates one τ² across the matrix, allowing individual cells to deviate. This borrows strength across cells while respecting cell-specific signal. A non-centered parameterization keeps the sampler well-behaved when heterogeneity is small.

The end result is a posterior distribution over the entire correlation matrix. Every missing cell now has a full distribution of plausible values, not just a single number.

How the likelihood works without knowing the truth

Common confusion: "we don't know the true matrix, so how can the likelihood compare to it?" The resolution: the likelihood does not compare candidates to the truth. It compares candidates to your observed values.

The sampler proposes thousands of candidate matrices. For each candidate, the likelihood asks: "If this candidate were the truth, how probable is it that meta-analytic pooling would have produced the observed values we collected?"

Candidates whose observed-cell positions sit near your actual observed values are highly compatible. Candidates that sit far from them are poorly compatible. We never need the truth — we only need our evidence and the sampling theory that connects truth to data.

For the missing cells: the likelihood doesn't directly score them (there's no observed value at those positions). But candidates have to be valid correlation matrices, so the values at every cell are mathematically linked. Candidates that score well on observed cells automatically have specific kinds of values at the missing cells. Run the sampler long enough and the missing-cell positions accumulate a distribution that reflects all the constraints simultaneously — observed-cell likelihood, geometric validity, and the regression-bound prior.

Courtroom analogy: the observed values are evidence. Candidate matrices are suspects. The likelihood is the prosecutor's argument: "If suspect X were the perpetrator, here's how probable the evidence is." We never see the actual perpetrator; we just reason about which suspects are most consistent with what we have.

Phase 4 — Per-cell triage (the headline)

The posterior gives a credible interval for every missing cell. But our empirical work showed that not all intervals can be trusted equally. MICA uses the Phase 1 flag to decide what to report:

• DATA-DOMINANT cells (R² > 0.7): tight interval, trustworthy. Anchors strongly determined the answer. 95% credible interval covers the truth ≈95% of the time on real data.

• PRIOR-DOMINATED cells (R² < 0.3): wide interval, trustworthy. Anchors couldn't say much, so MICA honestly reports a wide range. Coverage is at or above 95%, often well above, because we refuse to claim more precision than the data supports.

• TENSION REGIME cells (R² ∈ [0.3, 0.7]): we report point estimate only. This is the regime where, on some real matrices, credible intervals systematically miss the truth. The framework flags these cells and explicitly recommends sensitivity analysis or avoiding them in Stage-2 SEM paths.

This per-cell triage is what makes MICA distinctive. Other methods either report intervals on every cell (over-promising on some) or report no intervals at all. MICA reports intervals where they can be trusted and refuses to where they can't.

Phase 5 — Forwarding to Stage-2 SEM

Three options for using MICA's output in Stage-2 SEM:

(1) Posterior mean as point estimate, proceed as if the matrix were directly observed. Simplest but understates uncertainty for imputed cells. MICA's contribution here: the posterior mean is more accurate than chordal completion or pairwise mean.

(2) Multiple imputation (recommended). Draw 50 imputed matrices from the posterior, fit the Stage-2 SEM on each, pool the results using Rubin's (1987) rules. Propagates imputation uncertainty into Stage-2 path coefficients honestly. Cost: SEM runs 50× instead of once.

(3) Cell-effective N. Use the per-cell posterior variance to compute an effective sample size for each cell, entering Stage-2 with appropriately downweighted contributions for imputed cells.

The per-cell flags from Phase 4 tell you which Stage-2 path coefficients are well-supported and which depend on flagged cells.

When MICA does NOT help

Refuses outright when: • Matrix is disconnected (variables in different components share no information; joint imputation would fabricate relationships). • Variables that have <15% observed cells against the rest of the matrix — any imputation involving them is essentially prior-only.

Recommends pairwise mean instead when: • Coefficient of variation CV = sd(r) / median|r| < 1.0. The matrix is uniformly correlated; anchors don't add signal beyond the matrix mean. Empirically validated on the mindfulness matrix (CV=0.81), where MICA loses across all mask fractions.

Warns the user when: • Missingness > the CV-derived breakeven (~30% at CV=1.3, ~55% at CV=1.9). MICA may underperform pairwise mean here. • A high fraction of missing cells have no shared anchors.

Validation

Validation: 1,695-fit simulation across d ∈ {15, 25, 50, 100, 150} with three realistic-truth families (flat, asymmetric, bimodal). Coverage 95–100% across all conditions. Five real matrices: mindfulness, IE, WE, organizational justice, procrastination. MICA wins on RMSE on 4 of 5; the single loss (mindfulness, CV=0.81) is consistent with the recommender's CV gate. Per-cell coverage on real data: data-dominant and prior-dominated regimes calibrate at 95–99%; tension regime calibrates on bimodal matrices (procrastination, justice) but not on homogeneous-positive matrices (mindfulness — coverage drops to 53.5% on the [0.5, 0.7] band). This matrix-specific behavior is what the per-cell triage flag detects.