Research implementation that fuses video, audio, and text embeddings to predict fMRI responses across 1,000+ cortical parcels. Mirrors the TRIBE v2 architecture with a shared cross-modal transformer and per-subject adapters.
- InputsVideo framesAudio waveformTranscripts
- Modality encodersViT / CLIPWhisperText LLM
- FusionCross-modal transformer
- OutputfMRI parcel predictions

















