fmri-cneuromod_algo2025-vibe

Model Summary

Modality	fMRI
Training Dataset	CNeuroMod (Algonauts 2025 challenge preparation)
Species	Human
Stimuli	Video + Audio + Text
Model Type	Transformers
Creator	Shrey Dixit, Daniel Carlström Schad, Janis Keck, Viktor Studenyak, Aleksandr Shpilevoi, Andrej Bicanski

Description

VIBE (Video-Input Brain Encoder) is a multimodal fMRI encoding model trained on CNeuroMod movie data. It combines per-TR language transcripts, movie audio, and video features to predict whole-brain fMRI activity in Schaefer-1000 parcel space.

Architecture overview: VIBE uses a two-stage Transformer architecture. In the first stage, a modality-fusion transformer performs cross-attention across modalities independently at each time point (TR). Each feature stream (text, audio, video) is linearly projected to a shared 256-dimensional space together with a learned subject embedding, and fused via a single-layer Transformer encoder. The fused per-TR representations are concatenated and passed to the second stage: a prediction transformer (2 layers) that models temporal dependencies across TRs using Rotary Positional Embeddings (RoPE). A final feed-forward layer maps to the 1000-parcel Schaefer output space. The model is trained with a combined Pearson-correlation + MSE loss and ensembled across multiple seeds. For full details see Schad, Dixit, Keck et al. (2025), arXiv:2507.17958.

These BERG-integrated models are modified from the original to use fewer feature extractors for faster inference and lower memory usage.

Temporal resolution: The model was trained with a TR of 1.49 s, which is also the prediction resolution. The number of transcript strings passed as stimulus must exactly match the number of TRs derived from the video (i.e., floor(video_duration / 1.49)). A mismatch will raise an error.

The best model (when ensembled) reaches 0.3193 on in-distribution and 0.2122 on out-of-distribution data.

Pretrained variants are available from the Hugging Face collection ‘ShreyDixit/vibe’. You can inspect variants via berg.get_model_variants() and load a specific variant using model_variant=… in get_encoding_model().

Metadata

Note

Atlas files for glass brain visualization (Schaefer 1000-parcel MNI coordinates) are provided separately in the BERG directory and are not part of the per-subject metadata files.

roi_masks

Cont : (1000,) - Binary mask for Control/Frontoparietal network parcels

Default : (1000,) - Binary mask for Default Mode network parcels

DorsAttn : (1000,) - Binary mask for Dorsal Attention network parcels

Limbic : (1000,) - Binary mask for Limbic network parcels

SalVentAttn : (1000,) - Binary mask for Salience/Ventral Attention network parcels

SomMot : (1000,) - Binary mask for Somatomotor network parcels

Vis : (1000,) - Binary mask for Visual network parcels

Input

Type	`list[str], str`
Description	Two inputs are required: 1. stimulus: A list of per-TR transcripts (one string per TR, where TR = 1.49 s). The length must match the number of TRs derived from the video. 2. video_path: Path to the source video used for audio/video feature extraction.
Example	stimulus = [“Hello, are you”, “awake? Yes,”] video_path = “/path/to/movie.mp4”

Output

Type	`torch.Tensor`
Shape	`['num_timepoints', 'num_parcels']`
Description	Predicted fMRI activity for each TR.
Dimensions	num_timepoints: Number of predicted TRs. num_parcels: Number of parcels (up to 1000 Schaefer parcels, or selected subset).

Parameters

Parameters used in `get_encoding_model`

This function loads the encoding model.

model_id	Type: str Required: Yes Description: Unique identifier of the model to load. Valid Values: fmri-cneuromod_algo2025-vibe Example: “fmri-cneuromod_algo2025-vibe”
subject	Type: int Required: No Description: Subject ID for subject-conditioned prediction. Uses Algonauts-style IDs [1,2,3,5]. If omitted in get_encoding_model(), pass subject to encode(…, subject=…). Valid Values: 1, 2, 3, 5 Example: 1
device	Type: str Required: No Description: The computing device to use for inference. Valid Values: “cpu”, “cuda”, “auto” Example: “auto”
model_variant	Type: str Required: No Description: Hugging Face repository ID of a specific pretrained VIBE variant to load. If provided, its associated config is used and the config argument is ignored. Use model.get_pretrained_variants() or berg.get_model_variants(model_id). Example: “ShreyDixit/VIBE-Qwen2.5-14B”
low_mem_use	Type: bool Required: No Description: If True, unloads heavy components between calls to reduce memory usage (slower but lower VRAM footprint). Example: True
selection	Type: dict Required: No Description: Optional output filtering by network label and/or parcel index mask. If both are provided, they are combined with OR. Properties: roi Type: list[str] Description: Schaefer 2018 (7-network) labels to keep. Valid values: “Vis”, “SomMot”, “DorsAttn”, “SalVentAttn”, “Limbic”, “Cont”, “Default” Example: [‘Vis’] parcel_index Type: numpy.ndarray Description: Binary one-hot encoded vector selecting parcels. Must have length 1000 and contain at least one 1. Example: [0, 0, ‘…’, 1, 1, 0]

Parameters used in `encode`

This function generates in silico neural responses using the encoding model previously loaded.

subject	Type: int Required: No Description: Subject ID for subject-conditioned prediction. Uses Algonauts-style IDs [1,2,3,5]. If omitted in get_encoding_model(), pass subject to encode(…, subject=…). Valid Values: 1, 2, 3, 5 Example: 1
model	Type: BaseModelInterface Required: Yes Description: An instantiated and loaded encoding model.
stimulus	Type: list[str] Required: Yes Description: A list of transcript strings, one per TR (TR = 1.49 s). The length of this list must exactly match the number of TRs derived from the video duration (floor(video_duration / 1.49)). Example: [“Hello, are you”, “awake? Yes,”]
video_path	Type: str Required: Yes Description: Path to the video stimulus file. Example: “/path/to/movie.mp4”
return_metadata	Type: bool Required: No Description: Whether to return model metadata together with responses. Example: True
show_progress	Type: bool Required: No Description: Whether to show a progress bar during encoding. Example: True

Parameters used in `get_model_metadata`

This function loads the encoding model’s metadata without having to load the model itself.

model_id	Type: str Required: Yes Description: Unique identifier of the model to load. Valid Values: fmri-cneuromod_algo2025-vibe Example: “fmri-cneuromod_algo2025-vibe”
subject	Type: int Required: No Description: Subject ID for subject-conditioned prediction. Uses Algonauts-style IDs [1,2,3,5]. If omitted in get_encoding_model(), pass subject to encode(…, subject=…). Valid Values: 1, 2, 3, 5 Example: 1

Model-specific utility methods

`get_model_variants()`

Retrieve available pretrained variants for this model without instantiating it.

model_id

Type: str
Required: Yes
Description: Unique identifier of the model to load.

variants = berg.get_model_variants("fmri-cneuromod_algo2025-vibe")

`generate_glass_brain_animation()`

Generates and saves an animated glass brain GIF from predicted responses. Called directly on the loaded model instance.

responses	Type: `torch.Tensor` Required: Yes Description: Model predictions generated by encode().
out_path	Type: `str` Required: No Default: brain_activation.gif Description: Path for the generated GIF.

model.generate_glass_brain_animation(responses, out_path="activation.gif")

Performance

Metrics:

Mean parcel-wise Pearson correlation: ID Friends S07: 0.3193; OOD (6 films): 0.2122
Model variants: Available in Hugging Face Collection: ShreyDixit/vibe

Example Usage

from berg import BERG

# Initialize BERG
berg = BERG(berg_dir="path/to/brain-encoding-response-generator")

# Discover all model variants
variants = berg.get_model_variants("fmri-cneuromod_algo2025-vibe")

# Load the model
model = berg.get_encoding_model(
    "fmri-cneuromod_algo2025-vibe",
    subject=1,
    device="auto",
    model_variant="ShreyDixit/VIBE-Qwen2.5-14B",
    low_mem_use=True,
    selection={
        "roi": ["Vis"],
        "parcel_index": [0, 0, '...', 1, 1, 0]
    }
)

# Prepare stimulus: one transcript string per TR, matching video duration
transcripts = ["Hello, are you", "awake? Yes,", "I just woke up."]
video_path = "/path/to/movie.mp4"

# Generates the in silico neural responses using the encoding model previously loaded
responses = berg.encode(
    model,
    transcripts,
    video_path=video_path
)

# The in silico fMRI responses will be a torch.Tensor of shape:
# ['num_timepoints', 'num_parcels']
# where:
# - num_timepoints: Number of predicted TRs.
# - num_parcels: Number of parcels (up to 1000 Schaefer parcels, or selected subset).

# Generate in silico neural responses with metadata
responses, metadata = berg.encode(
    model,
    stimulus,
    return_metadata=True
)

# Load the encoding model's metadata without having to load the model itself
metadata = berg.get_model_metadata(
    "fmri-cneuromod_algo2025-vibe",
)

# Generate a gif out of the responses
gif_path = model.generate_glass_brain_animation(
  responses=responses,
  out_path="brain_activation.gif")

References

Schad, Daniel Carlström; Dixit, Shrey; Keck, Janis; Studenyak, Viktor; Shpilevoi, Aleksandr; Bicanski, Andrej. VIBE: Video-Input Brain Encoder for fMRI Response Modeling. arXiv:2507.17958 (2025).
Algonauts 2025 challenge dataset: https://github.com/courtois-neuromod/algonauts_2025.competitors