fmri-dascoli_2026-tribe_v2

Model Summary

Modality	fMRI
Training Dataset	d’Ascoli et al. (2026) (CNeuroMod, BoldMoments, Lebel2023, Wen2017)
Species	Human
Stimuli	Video, Audio, Text
Model Type	Multimodal Transformer (TRIBE v2)
Creator	Stéphane d’Ascoli (FAIR at Meta)

Description

Setup. This model requires Python >= 3.11, FFmpeg, and HuggingFace authentication.

Install FFmpeg: - Linux: sudo apt install ffmpeg - macOS: brew install ffmpeg - Conda: conda install -c conda-forge ffmpeg
Install the HuggingFace CLI: pip install huggingface_hub
Request access to LLaMA-3.2-3B at https://huggingface.co/meta-llama/Llama-3.2-3B
Authenticate: hf auth login
Enter your HuggingFace username and access token when prompted.

GPU with >= 16 GB VRAM is recommended. CPU inference is supported but very slow.

What is TRIBE v2? TRIBE v2 is a tri-modal (video, audio, and language) foundation model for predicting human fMRI brain activity. It uses frozen pretrained feature extractors — V-JEPA2-Giant (video), Wav2Vec-BERT-2.0 (audio), and LLaMA-3.2-3B (text) — whose embeddings are fed into a trainable 8-layer, 8-head Transformer encoder that maps multimodal representations onto the cortical surface (fsaverage5, 20,484 vertices).

Architecture. Stimulus features are extracted at 2 Hz from each modality, projected to a shared 384-dimensional space per modality (1,152 total), and processed by an 8-layer, 8-head Transformer with 100-second context windows. A subject-conditioned final layer maps latent representations to cortical vertices. For inference, the model uses a special “unseen subject” layer trained via subject dropout, producing group-average-like predictions without requiring subject-specific data. No subject parameter is needed, the model always runs in this mode.

Training data. The model was trained on over 450 hours of fMRI across 25 subjects from four naturalistic datasets: Courtois NeuroMod (4 subjects, 269h — movies with speech), BoldMoments (10 subjects, 62h — short video clips), Lebel2023 (8 subjects, 86h — podcast listening), and Wen2017 (3 subjects, 35h — silent videos).

Feature extraction pipeline. When given a video, the model automatically (1) extracts audio from the video track, (2) transcribes speech with WhisperX to get word-level timings, (3) extracts visual features from V-JEPA2-Giant (64 frames spanning 4 seconds per time bin), audio features from Wav2Vec-BERT-2.0, and text features from LLaMA-3.2-3B with 1,024 tokens of preceding context. When given text only, it first synthesizes speech via gTTS and then runs the same audio+text pipeline. Which means, you provide only one file to the model, where a video file triggers the whole pipeline.

Output. Predictions are time-resolved fMRI activity at 1 Hz (1 TR = 1 second) across 20,484 cortical vertices on the fsaverage5 surface mesh. ROI selection is available via the Glasser HCP-MMP1.0 parcellation (180 bilateral cortical regions).

Performance. An earlier iteration of TRIBE v2 achieved first place in the Algonauts 2025 brain prediction competition (263 teams, mean score 0.2146). The current model, trained on over 1,000 hours of fMRI across 720 subjects, significantly outperforms linear encoding baselines across all training datasets and generalizes zero-shot to unseen subjects and tasks, including non-naturalistic experimental paradigms such as visual and language functional localizers.

Metadata

fmri

subject_id : str - Subject identifier (‘average’)

n_vertices : int - Total cortical vertices (20484)

n_vertices_lh : int - Left hemisphere vertices (10242)

n_vertices_rh : int - Right hemisphere vertices (10242)

surface_mesh : str - Surface mesh name (‘fsaverage5’)

output_frequency_hz : float - Temporal resolution of predictions (1.0 Hz)

roi

parcellation : str - Parcellation name (‘Glasser_HCP-MMP1.0’)

roi_labels : (180,) - Bilateral ROI names (e.g., ‘V1’, ‘V2’, ‘FFC’)

roi_assignments : (20484,) - ROI index per vertex (-1 = medial wall)

roi_index : dict - Mapping from ROI name to integer index in roi_assignments

Input

Type

str (file path)

Description

The input is a single file path (string) to a video, audio, or text file of any duration.
The file type is auto-detected from the extension, and determines which
modalities are activated:

• Video input → visual + audio + text features (full multimodal)
• Audio input → audio + text features (speech is transcribed)
• Text input  → audio + text features (text is first synthesized to speech)

Exactly one file path must be provided per call. Audio and text are automatically
extracted from the video file.

Stimuli are processed as temporal sequences. Features are extracted at 2 Hz and
predictions are returned at 1 Hz, producing one predicted fMRI sample per second
of stimulus duration.

Output

Type	`numpy.ndarray`
Shape	`[n_timesteps, n_vertices]`
Description	The output is a 2D array containing predicted fMRI activity on the fsaverage5 cortical surface. Shape is (n_timesteps, n_vertices), where n_timesteps depends on stimulus duration (1 TR = 1 second) and n_vertices depends on ROI selection. However, these predictions should not be interpreted as a direct one-to-one mapping from a single stimulus second to the same fMRI second, because fMRI responses are delayed and temporally blurred by the hemodynamic response. TRIBE v2 also uses temporal context, so each prediction can depend on surrounding/preceding stimulus information, not only the current second.
Dimensions	n_timesteps: Number of seconds of predicted brain activity (1 per second, depends on stimulus duration) n_vertices: Number of cortical vertices in the selection (up to 20,484)

Parameters

Parameters used in `get_encoding_model`

This function loads the encoding model.

model_id	Type: str Required: Yes Description: Unique identifier of the model to load. Valid Values: fmri-multi_study-tribe_v2 Example: “fmri-multi_study-tribe_v2”
selection	Type: dict Required: No Description: Specifies which cortical vertices to include in the model output. Can include ROI names and/or a binary vertex mask. If both are provided, their union (logical OR) is used. If not provided, all 20,484 vertices are returned. Properties: roi Type: list[str] Description: List of Glasser HCP-MMP1.0 ROI names to include. Selects vertices from both hemispheres for each named region. Valid values: “V1”, “MST”, “V6”, “V2”, “V3”, “V4”, “V8”, “4”, “3b”, “FEF”, “PEF”, “55b”, “V3A”, “RSC”, “POS2”, “V7”, “IPS1”, “FFC”, “V3B”, “LO1”, “LO2”, “PIT”, “MT”, “A1”, “PSL”, “SFL”, “PCV”, “STV”, “7Pm”, “7m”, “POS1”, “23d”, “v23ab”, “d23ab”, “31pv”, “5m”, “5mv”, “23c”, “5L”, “24dd”, “24dv”, “7AL”, “SCEF”, “6ma”, “7Am”, “7PL”, “7PC”, “LIPv”, “VIP”, “MIP”, “1”, “2”, “3a”, “6d”, “6mp”, “6v”, “p24pr”, “33pr”, “a24pr”, “p32pr”, “a24”, “d32”, “8BM”, “p32”, “10r”, “47m”, “8Av”, “8Ad”, “9m”, “8BL”, “9p”, “10d”, “8C”, “44”, “45”, “47l”, “a47r”, “6r”, “IFJa”, “IFJp”, “IFSp”, “IFSa”, “p9-46v”, “46”, “a9-46v”, “9-46d”, “9a”, “10v”, “a10p”, “10pp”, “11l”, “13l”, “OFC”, “47s”, “LIPd”, “6a”, “i6-8”, “s6-8”, “43”, “OP4”, “OP1”, “OP2-3”, “52”, “RI”, “PFcm”, “PoI2”, “TA2”, “FOP4”, “MI”, “Pir”, “AVI”, “AAIC”, “FOP1”, “FOP3”, “FOP2”, “PFt”, “AIP”, “EC”, “PreS”, “H”, “ProS”, “PeEc”, “STGa”, “PBelt”, “A5”, “PHA1”, “PHA3”, “STSda”, “STSdp”, “STSvp”, “TGd”, “TE1a”, “TE1p”, “TE2a”, “TF”, “TE2p”, “PHT”, “PH”, “TPOJ1”, “TPOJ2”, “TPOJ3”, “DVT”, “PGp”, “IP2”, “IP1”, “IP0”, “PFop”, “PF”, “PFm”, “PGi”, “PGs”, “V6A”, “VMV1”, “VMV3”, “PHA2”, “V4t”, “FST”, “V3CD”, “LO3”, “VMV2”, “31pd”, “31a”, “VVC”, “25”, “s32”, “pOFC”, “PoI1”, “Ig”, “FOP5”, “p10p”, “p47r”, “TGv”, “MBelt”, “LBelt”, “A4”, “STSva”, “TE1m”, “PI”, “a32pr”, “p24” Example: [‘V1’, ‘V2’, ‘FFC’] vertices Type: numpy.ndarray Description: Binary one-hot encoded vector indicating which vertices to include. Must have exactly 20,484 elements (10,242 left + 10,242 right hemisphere). Each position set to 1 indicates that vertex should be included. Example: [0, 0, …, 1, 1, 0]
device	Type: str Required: No Description: Device to run the model on. ‘auto’ will use CUDA if available, otherwise CPU. GPU with >= 16 GB VRAM is strongly recommended. CPU inference is supported but very slow. Valid Values: “cpu”, “cuda”, “auto” Example: “auto”

Parameters used in `encode`

This function generates in silico neural responses using the encoding model previously loaded.

model	Type: BaseModelInterface Required: Yes Description: An instantiated and loaded encoding model.
stimulus	Type: str Required: Yes Description: File path (string) to the stimulus. Exactly one file must be provided. The file type is auto-detected from the extension: • Video: .mp4, .avi, .mkv, .mov, .webm → activates video + audio + text features • Audio: .wav, .mp3, .flac, .ogg → activates audio + text features • Text: .txt → text is synthesized to speech, then activates audio + text features Example: “’/path/to/video.mp4’”
return_metadata	Type: bool Required: No Description: Whether to return the encoding model’s metadata together with the in silico neural responses. Example: True
show_progress	Type: bool Required: No Description: Whether to show a progress bar during encoding. Example: True

Parameters used in `get_model_metadata`

This function loads the encoding model’s metadata without having to load the model itself.

model_id

Type: str
Required: Yes
Description: Unique identifier of the model to load.
Valid Values: fmri-multi_study-tribe_v2
Example: “fmri-multi_study-tribe_v2”

Performance

Accuracy Plots (AWS directory):

brain-encoding-response-generator/encoding_models/modality-fmri/train_dataset-multi_study/model-tribe_v2/encoding_models_accuracy

Example Usage

from berg import BERG

# Initialize BERG
berg = BERG(berg_dir="path/to/brain-encoding-response-generator")

# Create optional vertex mask
vertex_mask = np.zeros(20484, dtype=int)
vertex_mask[100:200] = 1

# Load the model with ROI and/or vertex selection
model = berg.get_encoding_model(
    "fmri-dascoli_2026-tribe_v2",
    selection={
        "roi": ["V1", "V2", "FFC"],
        "vertices": vertex_mask,
    },
)

# Prepare the stimulus (file path to video, audio, or text)
# Video: audio is extracted and speech transcribed automatically
stimulus = "/path/to/video.mp4"

# Generate in silico neural responses
responses = berg.encode(model, stimulus, show_progress=True)

# responses shape: [n_timesteps, n_vertices]
# - n_timesteps: one per second of stimulus duration
# - n_vertices: cortical vertices in the selection (up to 20,484)

# Generate responses with metadata
responses, metadata = berg.encode(model, stimulus, return_metadata=True)

# Load metadata without loading the model
metadata = berg.get_model_metadata("fmri-multi_study-tribe_v2")

References

TRIBE v2 paper (d’Ascoli et al., 2026): https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/
TRIBE v2 code: https://github.com/facebookresearch/tribev2
TRIBE v2 weights: https://huggingface.co/facebook/tribev2
TRIBE v2 demo: https://aidemos.atmeta.com/tribev2/
Glasser parcellation (Glasser et al., 2016): https://doi.org/10.1038/nature18933
Algonauts 2025 challenge (Gifford et al., 2024): https://arxiv.org/abs/2501.00504