NC State University ยท PhD Research

Interpretable Discovery of Hidden Dynamics in BK Virus Infection

๐ŸŽ“ Ph.D. Dissertation Research. Journal manuscript in preparation

After a kidney transplant, doctors face a delicate balancing act: suppress the immune system enough to prevent rejection, but not so much that viruses like BK can take hold. When BKV reactivates, serum creatinine rises, but the exact mechanism connecting viral dynamics to kidney function has been poorly understood. I built a three-stage pipeline (physics-informed neural networks, causal inference, and symbolic regression) to learn the missing piece of the mathematical model directly from clinical data. The output: interpretable equations that clinicians can reason about.

EHR from Duke Hospitals ยท 900+ patients ยท 493 with BKV dynamics
PINNs + Causal Inference + Symbolic Regression pipeline
Goal: model serum creatinine to predict immunosuppression dosage

The challenge

We have a mathematical model that tracks six biological compartments: susceptible cells, infected cells, BK virus, two T-cell populations, and serum creatinine. The model works well for most compartments, but the creatinine equation has a gap. A function \(h(t;\theta)\) on the right-hand side captures unknown dynamics we don't have a mechanistic theory for. The question is: can we learn what \(h\) should look like directly from patient data, without guessing its form in advance?

Three-Stage Pipeline

1

PINNs

Train a neural network that respects the known ODE structure while learning the unknown function \(h\). The physics loss penalizes any solution that violates the compartment model, which keeps the learned representation biologically consistent even with noisy, sparse data.

2

Causal inference

Not everything that correlates with creatinine actually drives it. We use causal analysis to separate genuine mechanistic drivers from spurious correlations, narrowing down which variables should appear in the final equation for \(h\).

3

Symbolic regression

Turn the PINN's black-box learned function into a compact, human-readable mathematical expression. The result is an equation a clinician can write on a whiteboard, not a neural network with thousands of weights.

Framework schematic: Clinical Data โ†’ ODE Model โ†’ PINNs โ†’ Causal Inference โ†’ Symbolic Regression
Framework schematic. Clinical data (C, V) feeds into an ODE model. PINNs learn the time-dependent RHS, causal inference identifies variables that affect h(t), and symbolic regression extracts the final interpretable form.

Results

PINNs results: exact vs learned dynamics for all compartments
PINNs trajectory fitting on real patient data. Magenta dots show observed clinical data; black dashed curves show PINNs-learned dynamics for all six compartments: susceptible cells (HS), infected cells (HI), creatinine (C), BK-specific T-cells (EV), allospecific T-cells (EK), and BK virus (V).
Causal inference network showing h regulated by Hs and C
Causal inference network. Combined network reveals that the hidden function h is primarily regulated by susceptible cells (HS) and creatinine (C).
ODE simulation using learned symbolic form vs real patient data
Final validation. ODE simulation using the learned symbolic form (patient-specific and combined) fitted against real patient creatinine data.
Symbolic regression: true h vs PINNs learned vs SR predicted, and complexity-error tradeoff
Symbolic regression (synthetic validation). Left: on a benchmark where the true h is known by construction, the PINN-learned h (black dashed) and SR-predicted h (purple) both recover the ground truth (red) closely. Right: complexity vs. error Pareto front showing error drops rapidly with moderate complexity.

Compartment Model & Equations

Compartment model diagram and governing equations
Compartment model. Six state variables govern BKV infection dynamics: susceptible cells (HS), infected cells (HI), BK virus (V), BK-specific T-cells (EV), allospecific T-cells (EK), and serum creatinine (C). A seventh term, the unknown forcing function h(t), appears in the creatinine equation to capture uncharacterized dynamics โ€” its mathematical form is what the PINN learns from data.

PINN loss components

\[ \mathcal{L}_{\text{data}} = \frac{1}{N_d}\sum_{i=1}^{N_d}\bigl\|\hat{y}(t_i) - y_i^{\text{obs}}\bigr\|^2 \]
\[ \mathcal{L}_{\text{physics}} = \frac{1}{N_r}\sum_{j=1}^{N_r}\bigl\|\dot{\hat{C}}(t_j) - \lambda_C + \delta_{C0}\tfrac{\hat{H}_S(t_j)}{\hat{H}_S(t_j)+\kappa_{CH}} - \hat{h}(t_j)\bigr\|^2 \]
\[ \mathcal{L}_{\mathrm{PINN}} = \mathcal{L}_{\mathrm{data}} + \lambda_{\mathrm{phys}}\,\mathcal{L}_{\mathrm{physics}} \]

The data loss fits measured trajectories. The physics loss penalizes violation of the compartment ODE system at collocation points, enforcing biological consistency.

Creatinine equation with unknown RHS

\[ \dot{C}(t) = \lambda_C - \delta_{C0}\,\frac{H_S}{H_S + \kappa_{CH}} + h(t;\theta) \]

The function \(h(t;\theta)\) captures the unknown mechanism governing creatinine change. Causal inference identified that \(h\) depends on \(H_S\) and \(C\), and symbolic regression extracted an interpretable closed-form expression.

Dissertation

Developing Gaussian Process and Theory-Informed Neural Network Models for Clinical Decision Making

T. Zhoroev ยท Ph.D. Dissertation, North Carolina State University, 2025

This dissertation provides the broader methodological context for the BK virus hidden-dynamics study and situates the work within clinically grounded, interpretable machine learning.

View dissertation record Code availability: Repository is private as it contains research-stage material. A public methods repository will be released upon publication.