Multi-modal Learning

Overview

Multi-modal learning addresses one of the fundamental challenges in medical AI: clinical decisions are rarely made from a single data source. A clinician diagnosing diabetic macular oedema consults fundus photographs, OCT B-scans, fluorescein angiography, and the patient’s longitudinal record simultaneously. My research develops deep learning architectures that can fuse these heterogeneous modalities into a coherent representation.

Key research directions

Standard concatenation of modality-specific features often fails because different modalities live in incompatible representation spaces. I explore contrastive objectives and cross-attention mechanisms that align representations across modalities without requiring paired data at every follow-up visit.

Missing-modality robustness

In real clinical settings, not every modality is available for every patient at every time point. My models are trained to degrade gracefully when one or more modalities are absent at inference, using masking strategies and conditional generation as priors.

Ophthalmology-specific fusion

The primary application domain is ophthalmology where fundus images, OCT volumes, and angiography capture complementary aspects of retinal pathology. Combining these sources yields significantly better prediction of progression to advanced AMD and sight-threatening diabetic retinopathy than any single modality alone.

Representative publications

Multi-modal longitudinal learning for AMD progression — MICCAI 2024
LatiM: continuous-time multi-modal representation learning — MICCAI 2024

Overview

Key research directions

Cross-modal feature alignment

Missing-modality robustness

Ophthalmology-specific fusion

Representative publications

Related themes