fmri-dascoli_2026-tribe_v2

Model Summary

Modality

fMRI

Training Dataset

d’Ascoli et al. (2026) (CNeuroMod, BoldMoments, Lebel2023, Wen2017)

Species

Human

Stimuli

Video, Audio, Text

Model Type

Multimodal Transformer (TRIBE v2)

Creator

Stéphane d’Ascoli (FAIR at Meta)

Description

Setup. This model requires Python >= 3.11, FFmpeg, and HuggingFace authentication.

  1. Install FFmpeg: - Linux: sudo apt install ffmpeg - macOS: brew install ffmpeg - Conda: conda install -c conda-forge ffmpeg

  2. Install the HuggingFace CLI: pip install huggingface_hub

  3. Request access to LLaMA-3.2-3B at https://huggingface.co/meta-llama/Llama-3.2-3B

  4. Authenticate: hf auth login

  5. Enter your HuggingFace username and access token when prompted.

GPU with >= 16 GB VRAM is recommended. CPU inference is supported but very slow.

What is TRIBE v2? TRIBE v2 is a tri-modal (video, audio, and language) foundation model for predicting human fMRI brain activity. It uses frozen pretrained feature extractors — V-JEPA2-Giant (video), Wav2Vec-BERT-2.0 (audio), and LLaMA-3.2-3B (text) — whose embeddings are fed into a trainable 8-layer, 8-head Transformer encoder that maps multimodal representations onto the cortical surface (fsaverage5, 20,484 vertices).

Architecture. Stimulus features are extracted at 2 Hz from each modality, projected to a shared 384-dimensional space per modality (1,152 total), and processed by an 8-layer, 8-head Transformer with 100-second context windows. A subject-conditioned final layer maps latent representations to cortical vertices. For inference, the model uses a special “unseen subject” layer trained via subject dropout, producing group-average-like predictions without requiring subject-specific data. No subject parameter is needed, the model always runs in this mode.

Training data. The model was trained on over 450 hours of fMRI across 25 subjects from four naturalistic datasets: Courtois NeuroMod (4 subjects, 269h — movies with speech), BoldMoments (10 subjects, 62h — short video clips), Lebel2023 (8 subjects, 86h — podcast listening), and Wen2017 (3 subjects, 35h — silent videos).

Feature extraction pipeline. When given a video, the model automatically (1) extracts audio from the video track, (2) transcribes speech with WhisperX to get word-level timings, (3) extracts visual features from V-JEPA2-Giant (64 frames spanning 4 seconds per time bin), audio features from Wav2Vec-BERT-2.0, and text features from LLaMA-3.2-3B with 1,024 tokens of preceding context. When given text only, it first synthesizes speech via gTTS and then runs the same audio+text pipeline. Which means, you provide only one file to the model, where a video file triggers the whole pipeline.

Output. Predictions are time-resolved fMRI activity at 1 Hz (1 TR = 1 second) across 20,484 cortical vertices on the fsaverage5 surface mesh. ROI selection is available via the Glasser HCP-MMP1.0 parcellation (180 bilateral cortical regions).

Performance. An earlier iteration of TRIBE v2 achieved first place in the Algonauts 2025 brain prediction competition (263 teams, mean score 0.2146). The current model, trained on over 1,000 hours of fMRI across 720 subjects, significantly outperforms linear encoding baselines across all training datasets and generalizes zero-shot to unseen subjects and tasks, including non-naturalistic experimental paradigms such as visual and language functional localizers.

Metadata

fmri

subject_id : str - Subject identifier (‘average’)

n_vertices : int - Total cortical vertices (20484)

n_vertices_lh : int - Left hemisphere vertices (10242)

n_vertices_rh : int - Right hemisphere vertices (10242)

surface_mesh : str - Surface mesh name (‘fsaverage5’)

output_frequency_hz : float - Temporal resolution of predictions (1.0 Hz)

roi

parcellation : str - Parcellation name (‘Glasser_HCP-MMP1.0’)

roi_labels : (180,) - Bilateral ROI names (e.g., ‘V1’, ‘V2’, ‘FFC’)

roi_assignments : (20484,) - ROI index per vertex (-1 = medial wall)

roi_index : dict - Mapping from ROI name to integer index in roi_assignments

Input

Type

str (file path)

Description

The input is a single file path (string) to a video, audio, or text file of any duration.
The file type is auto-detected from the extension, and determines which
modalities are activated:

• Video input → visual + audio + text features (full multimodal)
• Audio input → audio + text features (speech is transcribed)
• Text input → audio + text features (text is first synthesized to speech)

Exactly one file path must be provided per call. Audio and text are automatically
extracted from the video file.

Stimuli are processed as temporal sequences. Features are extracted at 2 Hz and
predictions are returned at 1 Hz, producing one predicted fMRI sample per second
of stimulus duration.

Output

Type

numpy.ndarray

Shape

[n_timesteps, n_vertices]

Description

The output is a 2D array containing predicted fMRI activity on the fsaverage5
cortical surface. Shape is (n_timesteps, n_vertices), where n_timesteps depends
on stimulus duration (1 TR = 1 second) and n_vertices depends on ROI selection. However, these predictions
should not be interpreted as a direct one-to-one mapping from a single stimulus second to the same fMRI second,
because fMRI responses are delayed and temporally blurred by the hemodynamic response.
TRIBE v2 also uses temporal context, so each prediction can depend on surrounding/preceding stimulus information, not only the current second.

Dimensions

n_timesteps: Number of seconds of predicted brain activity (1 per second, depends on stimulus duration)
n_vertices: Number of cortical vertices in the selection (up to 20,484)

Parameters

Parameters used in get_encoding_model

This function loads the encoding model.

model_id

Type: str
Required: Yes
Description: Unique identifier of the model to load.
Valid Values: fmri-multi_study-tribe_v2
Example: “fmri-multi_study-tribe_v2”

selection

Type: dict
Required: No
Description: Specifies which cortical vertices to include in the model output.
Can include ROI names and/or a binary vertex mask. If both are provided,
their union (logical OR) is used. If not provided, all 20,484 vertices
are returned.

Properties:

roi
Type: list[str]
Description: List of Glasser HCP-MMP1.0 ROI names to include.
Selects vertices from both hemispheres for each named region.
Valid values: “V1”, “MST”, “V6”, “V2”, “V3”, “V4”, “V8”, “4”, “3b”, “FEF”, “PEF”, “55b”, “V3A”, “RSC”, “POS2”, “V7”, “IPS1”, “FFC”, “V3B”, “LO1”, “LO2”, “PIT”, “MT”, “A1”, “PSL”, “SFL”, “PCV”, “STV”, “7Pm”, “7m”, “POS1”, “23d”, “v23ab”, “d23ab”, “31pv”, “5m”, “5mv”, “23c”, “5L”, “24dd”, “24dv”, “7AL”, “SCEF”, “6ma”, “7Am”, “7PL”, “7PC”, “LIPv”, “VIP”, “MIP”, “1”, “2”, “3a”, “6d”, “6mp”, “6v”, “p24pr”, “33pr”, “a24pr”, “p32pr”, “a24”, “d32”, “8BM”, “p32”, “10r”, “47m”, “8Av”, “8Ad”, “9m”, “8BL”, “9p”, “10d”, “8C”, “44”, “45”, “47l”, “a47r”, “6r”, “IFJa”, “IFJp”, “IFSp”, “IFSa”, “p9-46v”, “46”, “a9-46v”, “9-46d”, “9a”, “10v”, “a10p”, “10pp”, “11l”, “13l”, “OFC”, “47s”, “LIPd”, “6a”, “i6-8”, “s6-8”, “43”, “OP4”, “OP1”, “OP2-3”, “52”, “RI”, “PFcm”, “PoI2”, “TA2”, “FOP4”, “MI”, “Pir”, “AVI”, “AAIC”, “FOP1”, “FOP3”, “FOP2”, “PFt”, “AIP”, “EC”, “PreS”, “H”, “ProS”, “PeEc”, “STGa”, “PBelt”, “A5”, “PHA1”, “PHA3”, “STSda”, “STSdp”, “STSvp”, “TGd”, “TE1a”, “TE1p”, “TE2a”, “TF”, “TE2p”, “PHT”, “PH”, “TPOJ1”, “TPOJ2”, “TPOJ3”, “DVT”, “PGp”, “IP2”, “IP1”, “IP0”, “PFop”, “PF”, “PFm”, “PGi”, “PGs”, “V6A”, “VMV1”, “VMV3”, “PHA2”, “V4t”, “FST”, “V3CD”, “LO3”, “VMV2”, “31pd”, “31a”, “VVC”, “25”, “s32”, “pOFC”, “PoI1”, “Ig”, “FOP5”, “p10p”, “p47r”, “TGv”, “MBelt”, “LBelt”, “A4”, “STSva”, “TE1m”, “PI”, “a32pr”, “p24”
Example: [‘V1’, ‘V2’, ‘FFC’]

vertices
Type: numpy.ndarray
Description: Binary one-hot encoded vector indicating which vertices to include.
Must have exactly 20,484 elements (10,242 left + 10,242 right hemisphere).
Each position set to 1 indicates that vertex should be included.
Example: [0, 0, …, 1, 1, 0]

device

Type: str
Required: No
Description: Device to run the model on. ‘auto’ will use CUDA if available, otherwise CPU.
GPU with >= 16 GB VRAM is strongly recommended. CPU inference is supported but very slow.
Valid Values: “cpu”, “cuda”, “auto”
Example: “auto”

Parameters used in encode

This function generates in silico neural responses using the encoding model previously loaded.

model

Type: BaseModelInterface
Required: Yes
Description: An instantiated and loaded encoding model.

stimulus

Type: str
Required: Yes
Description: File path (string) to the stimulus. Exactly one file must be provided.
The file type is auto-detected from the extension:

• Video: .mp4, .avi, .mkv, .mov, .webm → activates video + audio + text features
• Audio: .wav, .mp3, .flac, .ogg → activates audio + text features
• Text: .txt → text is synthesized to speech, then activates audio + text features
Example: “’/path/to/video.mp4’”

return_metadata

Type: bool
Required: No
Description: Whether to return the encoding model’s metadata together with the in silico neural responses.
Example: True

show_progress

Type: bool
Required: No
Description: Whether to show a progress bar during encoding.
Example: True

Parameters used in get_model_metadata

This function loads the encoding model’s metadata without having to load the model itself.

model_id

Type: str
Required: Yes
Description: Unique identifier of the model to load.
Valid Values: fmri-multi_study-tribe_v2
Example: “fmri-multi_study-tribe_v2”

Performance

Accuracy Plots (AWS directory):

  • brain-encoding-response-generator/encoding_models/modality-fmri/train_dataset-multi_study/model-tribe_v2/encoding_models_accuracy

Example Usage

from berg import BERG

# Initialize BERG
berg = BERG(berg_dir="path/to/brain-encoding-response-generator")

# Create optional vertex mask
vertex_mask = np.zeros(20484, dtype=int)
vertex_mask[100:200] = 1

# Load the model with ROI and/or vertex selection
model = berg.get_encoding_model(
    "fmri-dascoli_2026-tribe_v2",
    selection={
        "roi": ["V1", "V2", "FFC"],
        "vertices": vertex_mask,
    },
)

# Prepare the stimulus (file path to video, audio, or text)
# Video: audio is extracted and speech transcribed automatically
stimulus = "/path/to/video.mp4"

# Generate in silico neural responses
responses = berg.encode(model, stimulus, show_progress=True)

# responses shape: [n_timesteps, n_vertices]
# - n_timesteps: one per second of stimulus duration
# - n_vertices: cortical vertices in the selection (up to 20,484)

# Generate responses with metadata
responses, metadata = berg.encode(model, stimulus, return_metadata=True)

# Load metadata without loading the model
metadata = berg.get_model_metadata("fmri-multi_study-tribe_v2")

References