Interpretable Discovery of Hidden Dynamics in BK Virus Infection

The challenge

We have a mathematical model that tracks six biological compartments: susceptible cells, infected cells, BK virus, two T-cell populations, and serum creatinine. The model works well for most compartments, but the creatinine equation has a gap. A function \(h(t;\theta)\) on the right-hand side captures unknown dynamics we don't have a mechanistic theory for. The question is: can we learn what \(h\) should look like directly from patient data, without guessing its form in advance?

Clinical context

Working with electronic health records from Duke University Hospitals: over 900 renal transplant patients, 493 with BKV dynamics. The data includes viral load measurements, creatinine trajectories, vitals, demographics, and immunosuppressive dosing.

Why interpretability matters

A black-box neural network might fit the data well, but a nephrologist can't act on it. By extracting compact, human-readable ODE terms through symbolic regression, we give clinicians equations they can inspect, challenge, and ultimately trust.

What we found

Causal inference revealed that \(h\) depends primarily on susceptible cells and creatinine itself. Symbolic regression then recovered a parsimonious closed-form expression that closely matches the PINN-learned dynamics.

Three-Stage Pipeline

1

PINNs

Train a neural network that respects the known ODE structure while learning the unknown function \(h\). The physics loss penalizes any solution that violates the compartment model, which keeps the learned representation biologically consistent even with noisy, sparse data.

2

Causal inference

Not everything that correlates with creatinine actually drives it. We use causal analysis to separate genuine mechanistic drivers from spurious correlations, narrowing down which variables should appear in the final equation for \(h\).

3

Symbolic regression

Turn the PINN's black-box learned function into a compact, human-readable mathematical expression. The result is an equation a clinician can write on a whiteboard, not a neural network with thousands of weights.

Framework schematic. Clinical data (C, V) feeds into an ODE model. PINNs learn the time-dependent RHS, causal inference identifies variables that affect h(t), and symbolic regression extracts the final interpretable form.

Results

PINNs trajectory fitting on real patient data. Magenta dots show observed clinical data; black dashed curves show PINNs-learned dynamics for all six compartments: susceptible cells (H_S), infected cells (H_I), creatinine (C), BK-specific T-cells (E_V), allospecific T-cells (E_K), and BK virus (V).

Causal inference network showing h regulated by Hs and C

Causal inference network. Combined network reveals that the hidden function h is primarily regulated by susceptible cells (H_S) and creatinine (C).

ODE simulation using learned symbolic form vs real patient data

Final validation. ODE simulation using the learned symbolic form (patient-specific and combined) fitted against real patient creatinine data.

Symbolic regression: true h vs PINNs learned vs SR predicted, and complexity-error tradeoff

Symbolic regression (synthetic validation). Left: on a benchmark where the true h is known by construction, the PINN-learned h (black dashed) and SR-predicted h (purple) both recover the ground truth (red) closely. Right: complexity vs. error Pareto front showing error drops rapidly with moderate complexity.

Compartment Model & Equations

Compartment model diagram and governing equations

Compartment model. Six state variables govern BKV infection dynamics: susceptible cells (H_S), infected cells (H_I), BK virus (V), BK-specific T-cells (E_V), allospecific T-cells (E_K), and serum creatinine (C). A seventh term, the unknown forcing function h(t), appears in the creatinine equation to capture uncharacterized dynamics — its mathematical form is what the PINN learns from data.

PINN loss components

\[ \mathcal{L}_{\text{data}} = \frac{1}{N_d}\sum_{i=1}^{N_d}\bigl\|\hat{y}(t_i) - y_i^{\text{obs}}\bigr\|^2 \]

\[ \mathcal{L}_{\text{physics}} = \frac{1}{N_r}\sum_{j=1}^{N_r}\bigl\|\dot{\hat{C}}(t_j) - \lambda_C + \delta_{C0}\tfrac{\hat{H}_S(t_j)}{\hat{H}_S(t_j)+\kappa_{CH}} - \hat{h}(t_j)\bigr\|^2 \]

\[ \mathcal{L}_{\mathrm{PINN}} = \mathcal{L}_{\mathrm{data}} + \lambda_{\mathrm{phys}}\,\mathcal{L}_{\mathrm{physics}} \]

The data loss fits measured trajectories. The physics loss penalizes violation of the compartment ODE system at collocation points, enforcing biological consistency.

Creatinine equation with unknown RHS

\[ \dot{C}(t) = \lambda_C - \delta_{C0}\,\frac{H_S}{H_S + \kappa_{CH}} + h(t;\theta) \]

The function \(h(t;\theta)\) captures the unknown mechanism governing creatinine change. Causal inference identified that \(h\) depends on \(H_S\) and \(C\), and symbolic regression extracted an interpretable closed-form expression.

Dissertation

Developing Gaussian Process and Theory-Informed Neural Network Models for Clinical Decision Making

T. Zhoroev · Ph.D. Dissertation, North Carolina State University, 2025

This dissertation provides the broader methodological context for the BK virus hidden-dynamics study and situates the work within clinically grounded, interpretable machine learning.

View dissertation record Code availability: Repository is private as it contains research-stage material. A public methods repository will be released upon publication.