fmri-dascoli_2026-tribe_v2
Model Summary
Modality |
fMRI |
|---|---|
Training Dataset |
d’Ascoli et al. (2026) (CNeuroMod, BoldMoments, Lebel2023, Wen2017) |
Species |
Human |
Stimuli |
Video, Audio, Text |
Model Type |
Multimodal Transformer (TRIBE v2) |
Creator |
Stéphane d’Ascoli (FAIR at Meta) |
Description
Setup. This model requires Python >= 3.11, FFmpeg, and HuggingFace authentication.
Install FFmpeg: - Linux:
sudo apt install ffmpeg- macOS:brew install ffmpeg- Conda:conda install -c conda-forge ffmpegInstall the HuggingFace CLI:
pip install huggingface_hubRequest access to LLaMA-3.2-3B at https://huggingface.co/meta-llama/Llama-3.2-3B
Authenticate:
hf auth loginEnter your HuggingFace username and access token when prompted.
GPU with >= 16 GB VRAM is recommended. CPU inference is supported but very slow.
What is TRIBE v2? TRIBE v2 is a tri-modal (video, audio, and language) foundation model for predicting human fMRI brain activity. It uses frozen pretrained feature extractors — V-JEPA2-Giant (video), Wav2Vec-BERT-2.0 (audio), and LLaMA-3.2-3B (text) — whose embeddings are fed into a trainable 8-layer, 8-head Transformer encoder that maps multimodal representations onto the cortical surface (fsaverage5, 20,484 vertices).
Architecture. Stimulus features are extracted at 2 Hz from each modality, projected to a shared 384-dimensional space per modality (1,152 total), and processed by an 8-layer, 8-head Transformer with 100-second context windows. A subject-conditioned final layer maps latent representations to cortical vertices. For inference, the model uses a special “unseen subject” layer trained via subject dropout, producing group-average-like predictions without requiring subject-specific data. No subject parameter is needed, the model always runs in this mode.
Training data. The model was trained on over 450 hours of fMRI across 25 subjects from four naturalistic datasets: Courtois NeuroMod (4 subjects, 269h — movies with speech), BoldMoments (10 subjects, 62h — short video clips), Lebel2023 (8 subjects, 86h — podcast listening), and Wen2017 (3 subjects, 35h — silent videos).
Feature extraction pipeline. When given a video, the model automatically (1) extracts audio from the video track, (2) transcribes speech with WhisperX to get word-level timings, (3) extracts visual features from V-JEPA2-Giant (64 frames spanning 4 seconds per time bin), audio features from Wav2Vec-BERT-2.0, and text features from LLaMA-3.2-3B with 1,024 tokens of preceding context. When given text only, it first synthesizes speech via gTTS and then runs the same audio+text pipeline. Which means, you provide only one file to the model, where a video file triggers the whole pipeline.
Output. Predictions are time-resolved fMRI activity at 1 Hz (1 TR = 1 second) across 20,484 cortical vertices on the fsaverage5 surface mesh. ROI selection is available via the Glasser HCP-MMP1.0 parcellation (180 bilateral cortical regions).
Performance. An earlier iteration of TRIBE v2 achieved first place in the Algonauts 2025 brain prediction competition (263 teams, mean score 0.2146). The current model, trained on over 1,000 hours of fMRI across 720 subjects, significantly outperforms linear encoding baselines across all training datasets and generalizes zero-shot to unseen subjects and tasks, including non-naturalistic experimental paradigms such as visual and language functional localizers.
Metadata
fmri
subject_id :
str- Subject identifier (‘average’)n_vertices :
int- Total cortical vertices (20484)n_vertices_lh :
int- Left hemisphere vertices (10242)n_vertices_rh :
int- Right hemisphere vertices (10242)surface_mesh :
str- Surface mesh name (‘fsaverage5’)output_frequency_hz :
float- Temporal resolution of predictions (1.0 Hz)
roi
parcellation :
str- Parcellation name (‘Glasser_HCP-MMP1.0’)roi_labels :
(180,)- Bilateral ROI names (e.g., ‘V1’, ‘V2’, ‘FFC’)roi_assignments :
(20484,)- ROI index per vertex (-1 = medial wall)roi_index :
dict- Mapping from ROI name to integer index in roi_assignments
Input
Type |
|
|---|---|
Description |
The input is a single file path (string) to a video, audio, or text file of any duration.
The file type is auto-detected from the extension, and determines which
modalities are activated:
• Video input → visual + audio + text features (full multimodal)
• Audio input → audio + text features (speech is transcribed)
• Text input → audio + text features (text is first synthesized to speech)
Exactly one file path must be provided per call. Audio and text are automatically
extracted from the video file.
Stimuli are processed as temporal sequences. Features are extracted at 2 Hz and
predictions are returned at 1 Hz, producing one predicted fMRI sample per second
of stimulus duration.
|
Output
Type |
|
|---|---|
Shape |
|
Description |
The output is a 2D array containing predicted fMRI activity on the fsaverage5
cortical surface. Shape is (n_timesteps, n_vertices), where n_timesteps depends
on stimulus duration (1 TR = 1 second) and n_vertices depends on ROI selection. However, these predictions
should not be interpreted as a direct one-to-one mapping from a single stimulus second to the same fMRI second,
because fMRI responses are delayed and temporally blurred by the hemodynamic response.
TRIBE v2 also uses temporal context, so each prediction can depend on surrounding/preceding stimulus information, not only the current second.
|
Dimensions |
n_timesteps: Number of seconds of predicted brain activity (1 per second, depends on stimulus duration)
n_vertices: Number of cortical vertices in the selection (up to 20,484)
|
Parameters
Parameters used in get_encoding_model
This function loads the encoding model.
model_id |
Type: str
Required: Yes
Description: Unique identifier of the model to load.
Valid Values: fmri-multi_study-tribe_v2
Example: “fmri-multi_study-tribe_v2”
|
selection |
Type: dict
Required: No
Description: Specifies which cortical vertices to include in the model output.
Can include ROI names and/or a binary vertex mask. If both are provided,
their union (logical OR) is used. If not provided, all 20,484 vertices
are returned.
Properties:
roi
Type: list[str]
Description: List of Glasser HCP-MMP1.0 ROI names to include.
Selects vertices from both hemispheres for each named region.
Valid values: “V1”, “MST”, “V6”, “V2”, “V3”, “V4”, “V8”, “4”, “3b”, “FEF”, “PEF”, “55b”, “V3A”, “RSC”, “POS2”, “V7”, “IPS1”, “FFC”, “V3B”, “LO1”, “LO2”, “PIT”, “MT”, “A1”, “PSL”, “SFL”, “PCV”, “STV”, “7Pm”, “7m”, “POS1”, “23d”, “v23ab”, “d23ab”, “31pv”, “5m”, “5mv”, “23c”, “5L”, “24dd”, “24dv”, “7AL”, “SCEF”, “6ma”, “7Am”, “7PL”, “7PC”, “LIPv”, “VIP”, “MIP”, “1”, “2”, “3a”, “6d”, “6mp”, “6v”, “p24pr”, “33pr”, “a24pr”, “p32pr”, “a24”, “d32”, “8BM”, “p32”, “10r”, “47m”, “8Av”, “8Ad”, “9m”, “8BL”, “9p”, “10d”, “8C”, “44”, “45”, “47l”, “a47r”, “6r”, “IFJa”, “IFJp”, “IFSp”, “IFSa”, “p9-46v”, “46”, “a9-46v”, “9-46d”, “9a”, “10v”, “a10p”, “10pp”, “11l”, “13l”, “OFC”, “47s”, “LIPd”, “6a”, “i6-8”, “s6-8”, “43”, “OP4”, “OP1”, “OP2-3”, “52”, “RI”, “PFcm”, “PoI2”, “TA2”, “FOP4”, “MI”, “Pir”, “AVI”, “AAIC”, “FOP1”, “FOP3”, “FOP2”, “PFt”, “AIP”, “EC”, “PreS”, “H”, “ProS”, “PeEc”, “STGa”, “PBelt”, “A5”, “PHA1”, “PHA3”, “STSda”, “STSdp”, “STSvp”, “TGd”, “TE1a”, “TE1p”, “TE2a”, “TF”, “TE2p”, “PHT”, “PH”, “TPOJ1”, “TPOJ2”, “TPOJ3”, “DVT”, “PGp”, “IP2”, “IP1”, “IP0”, “PFop”, “PF”, “PFm”, “PGi”, “PGs”, “V6A”, “VMV1”, “VMV3”, “PHA2”, “V4t”, “FST”, “V3CD”, “LO3”, “VMV2”, “31pd”, “31a”, “VVC”, “25”, “s32”, “pOFC”, “PoI1”, “Ig”, “FOP5”, “p10p”, “p47r”, “TGv”, “MBelt”, “LBelt”, “A4”, “STSva”, “TE1m”, “PI”, “a32pr”, “p24”
Example: [‘V1’, ‘V2’, ‘FFC’]
vertices
Type: numpy.ndarray
Description: Binary one-hot encoded vector indicating which vertices to include.
Must have exactly 20,484 elements (10,242 left + 10,242 right hemisphere).
Each position set to 1 indicates that vertex should be included.
Example: [0, 0, …, 1, 1, 0]
|
device |
Type: str
Required: No
Description: Device to run the model on. ‘auto’ will use CUDA if available, otherwise CPU.
GPU with >= 16 GB VRAM is strongly recommended. CPU inference is supported but very slow.
Valid Values: “cpu”, “cuda”, “auto”
Example: “auto”
|
Parameters used in encode
This function generates in silico neural responses using the encoding model previously loaded.
model |
Type: BaseModelInterface
Required: Yes
Description: An instantiated and loaded encoding model.
|
stimulus |
Type: str
Required: Yes
Description: File path (string) to the stimulus. Exactly one file must be provided.
The file type is auto-detected from the extension:
• Video: .mp4, .avi, .mkv, .mov, .webm → activates video + audio + text features
• Audio: .wav, .mp3, .flac, .ogg → activates audio + text features
• Text: .txt → text is synthesized to speech, then activates audio + text features
Example: “’/path/to/video.mp4’”
|
return_metadata |
Type: bool
Required: No
Description: Whether to return the encoding model’s metadata together with the in silico neural responses.
Example: True
|
show_progress |
Type: bool
Required: No
Description: Whether to show a progress bar during encoding.
Example: True
|
Parameters used in get_model_metadata
This function loads the encoding model’s metadata without having to load the model itself.
model_id |
Type: str
Required: Yes
Description: Unique identifier of the model to load.
Valid Values: fmri-multi_study-tribe_v2
Example: “fmri-multi_study-tribe_v2”
|
Performance
Accuracy Plots (AWS directory):
brain-encoding-response-generator/encoding_models/modality-fmri/train_dataset-multi_study/model-tribe_v2/encoding_models_accuracy
Example Usage
from berg import BERG
# Initialize BERG
berg = BERG(berg_dir="path/to/brain-encoding-response-generator")
# Create optional vertex mask
vertex_mask = np.zeros(20484, dtype=int)
vertex_mask[100:200] = 1
# Load the model with ROI and/or vertex selection
model = berg.get_encoding_model(
"fmri-dascoli_2026-tribe_v2",
selection={
"roi": ["V1", "V2", "FFC"],
"vertices": vertex_mask,
},
)
# Prepare the stimulus (file path to video, audio, or text)
# Video: audio is extracted and speech transcribed automatically
stimulus = "/path/to/video.mp4"
# Generate in silico neural responses
responses = berg.encode(model, stimulus, show_progress=True)
# responses shape: [n_timesteps, n_vertices]
# - n_timesteps: one per second of stimulus duration
# - n_vertices: cortical vertices in the selection (up to 20,484)
# Generate responses with metadata
responses, metadata = berg.encode(model, stimulus, return_metadata=True)
# Load metadata without loading the model
metadata = berg.get_model_metadata("fmri-multi_study-tribe_v2")
References
TRIBE v2 paper (d’Ascoli et al., 2026): https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/
TRIBE v2 code: https://github.com/facebookresearch/tribev2
TRIBE v2 weights: https://huggingface.co/facebook/tribev2
TRIBE v2 demo: https://aidemos.atmeta.com/tribev2/
Glasser parcellation (Glasser et al., 2016): https://doi.org/10.1038/nature18933
Algonauts 2025 challenge (Gifford et al., 2024): https://arxiv.org/abs/2501.00504