============================ fmri-cneuromod_algo2025-vibe ============================ Model Summary ------------ .. list-table:: :widths: 30 70 :stub-columns: 1 * - Modality - fMRI * - Training Dataset - CNeuroMod (Algonauts 2025 challenge preparation) * - Species - Human * - Stimuli - Video + Audio + Text * - Model Type - Transformers * - Creator - Shrey Dixit, Daniel Carlström Schad, Janis Keck, Viktor Studenyak, Aleksandr Shpilevoi, Andrej Bicanski Description ---------- VIBE (Video-Input Brain Encoder) is a multimodal fMRI encoding model trained on CNeuroMod movie data. It combines per-TR language transcripts, movie audio, and video features to predict whole-brain fMRI activity in Schaefer-1000 parcel space. Architecture overview: VIBE uses a two-stage Transformer architecture. In the first stage, a modality-fusion transformer performs cross-attention across modalities independently at each time point (TR). Each feature stream (text, audio, video) is linearly projected to a shared 256-dimensional space together with a learned subject embedding, and fused via a single-layer Transformer encoder. The fused per-TR representations are concatenated and passed to the second stage: a prediction transformer (2 layers) that models temporal dependencies across TRs using Rotary Positional Embeddings (RoPE). A final feed-forward layer maps to the 1000-parcel Schaefer output space. The model is trained with a combined Pearson-correlation + MSE loss and ensembled across multiple seeds. For full details see Schad, Dixit, Keck et al. (2025), arXiv:2507.17958. These BERG-integrated models are modified from the original to use fewer feature extractors for faster inference and lower memory usage. Temporal resolution: The model was trained with a TR of 1.49 s, which is also the prediction resolution. The number of transcript strings passed as `stimulus` must exactly match the number of TRs derived from the video (i.e., floor(video_duration / 1.49)). A mismatch will raise an error. The best model (when ensembled) reaches 0.3193 on in-distribution and 0.2122 on out-of-distribution data. Pretrained variants are available from the Hugging Face collection 'ShreyDixit/vibe'. You can inspect variants via `berg.get_model_variants()` and load a specific variant using `model_variant=...` in get_encoding_model(). Metadata -------- .. note:: Atlas files for glass brain visualization (Schaefer 1000-parcel MNI coordinates) are provided separately in the BERG directory and are not part of the per-subject metadata files. **roi_masks** **Cont** : ``(1000,)`` - Binary mask for Control/Frontoparietal network parcels **Default** : ``(1000,)`` - Binary mask for Default Mode network parcels **DorsAttn** : ``(1000,)`` - Binary mask for Dorsal Attention network parcels **Limbic** : ``(1000,)`` - Binary mask for Limbic network parcels **SalVentAttn** : ``(1000,)`` - Binary mask for Salience/Ventral Attention network parcels **SomMot** : ``(1000,)`` - Binary mask for Somatomotor network parcels **Vis** : ``(1000,)`` - Binary mask for Visual network parcels Input ----- .. list-table:: :widths: 20 80 :stub-columns: 1 * - Type - ``list[str], str`` * - Description - | Two inputs are required: | 1. `stimulus`: A list of per-TR transcripts (one string per TR, where TR = 1.49 s). | The length must match the number of TRs derived from the video. | 2. `video_path`: Path to the source video used for audio/video feature extraction. * - Example - stimulus = ["Hello, are you", "awake? Yes,"] video_path = "/path/to/movie.mp4" Output ------ .. list-table:: :widths: 20 80 :stub-columns: 1 * - Type - ``torch.Tensor`` * - Shape - ``['num_timepoints', 'num_parcels']`` * - Description - Predicted fMRI activity for each TR. * - Dimensions - | **num_timepoints**: Number of predicted TRs. | **num_parcels**: Number of parcels (up to 1000 Schaefer parcels, or selected subset). Parameters --------- Parameters used in ``get_encoding_model`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This function loads the encoding model. .. list-table:: :widths: 20 80 :header-rows: 0 * - **model_id** - | **Type:** str | **Required:** Yes | **Description:** Unique identifier of the model to load. | **Valid Values:** fmri-cneuromod_algo2025-vibe | **Example:** "fmri-cneuromod_algo2025-vibe" * - **subject** - | **Type:** int | **Required:** No | **Description:** Subject ID for subject-conditioned prediction. | Uses Algonauts-style IDs [1,2,3,5]. | If omitted in get_encoding_model(), pass subject to encode(..., subject=...). | **Valid Values:** 1, 2, 3, 5 | **Example:** 1 * - **device** - | **Type:** str | **Required:** No | **Description:** The computing device to use for inference. | **Valid Values:** "cpu", "cuda", "auto" | **Example:** "auto" * - **model_variant** - | **Type:** str | **Required:** No | **Description:** Hugging Face repository ID of a specific pretrained VIBE variant to load. | If provided, its associated config is used and the `config` argument is ignored. | Use model.get_pretrained_variants() or berg.get_model_variants(model_id). | **Example:** "ShreyDixit/VIBE-Qwen2.5-14B" * - **low_mem_use** - | **Type:** bool | **Required:** No | **Description:** If True, unloads heavy components between calls to reduce memory usage | (slower but lower VRAM footprint). | **Example:** True * - **selection** - | **Type:** dict | **Required:** No | **Description:** Optional output filtering by network label and/or parcel index mask. | If both are provided, they are combined with OR. | | **Properties:** | | **roi** | **Type:** list[str] | **Description:** Schaefer 2018 (7-network) labels to keep. | **Valid values:** "Vis", "SomMot", "DorsAttn", "SalVentAttn", "Limbic", "Cont", "Default" | **Example:** ['Vis'] | | **parcel_index** | **Type:** numpy.ndarray | **Description:** Binary one-hot encoded vector selecting parcels. | Must have length 1000 and contain at least one 1. | **Example:** [0, 0, '...', 1, 1, 0] Parameters used in ``encode`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This function generates in silico neural responses using the encoding model previously loaded. .. list-table:: :widths: 20 80 :header-rows: 0 * - **subject** - | **Type:** int | **Required:** No | **Description:** Subject ID for subject-conditioned prediction. | Uses Algonauts-style IDs [1,2,3,5]. | If omitted in get_encoding_model(), pass subject to encode(..., subject=...). | **Valid Values:** 1, 2, 3, 5 | **Example:** 1 * - **model** - | **Type:** BaseModelInterface | **Required:** Yes | **Description:** An instantiated and loaded encoding model. * - **stimulus** - | **Type:** list[str] | **Required:** Yes | **Description:** A list of transcript strings, one per TR (TR = 1.49 s). | The length of this list must exactly match the number of TRs | derived from the video duration (floor(video_duration / 1.49)). | **Example:** | ["Hello, are you", "awake? Yes,"] * - **video_path** - | **Type:** str | **Required:** Yes | **Description:** Path to the video stimulus file. | **Example:** "/path/to/movie.mp4" * - **return_metadata** - | **Type:** bool | **Required:** No | **Description:** Whether to return model metadata together with responses. | **Example:** True * - **show_progress** - | **Type:** bool | **Required:** No | **Description:** Whether to show a progress bar during encoding. | **Example:** True Parameters used in ``get_model_metadata`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This function loads the encoding model's metadata without having to load the model itself. .. list-table:: :widths: 20 80 :header-rows: 0 * - **model_id** - | **Type:** str | **Required:** Yes | **Description:** Unique identifier of the model to load. | **Valid Values:** fmri-cneuromod_algo2025-vibe | **Example:** "fmri-cneuromod_algo2025-vibe" * - **subject** - | **Type:** int | **Required:** No | **Description:** Subject ID for subject-conditioned prediction. | Uses Algonauts-style IDs [1,2,3,5]. | If omitted in get_encoding_model(), pass subject to encode(..., subject=...). | **Valid Values:** 1, 2, 3, 5 | **Example:** 1 Model-specific utility methods ------------------------------ ``get_model_variants()`` ~~~~~~~~~~~~~~~~~~~~~~~~ Retrieve available pretrained variants for this model without instantiating it. .. list-table:: :widths: 20 80 :header-rows: 0 * - **model_id** - | **Type:** ``str`` | **Required:** Yes | **Description:** Unique identifier of the model to load. .. code-block:: python variants = berg.get_model_variants("fmri-cneuromod_algo2025-vibe") ---- ``generate_glass_brain_animation()`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Generates and saves an animated glass brain GIF from predicted responses. Called directly on the loaded model instance. .. list-table:: :widths: 20 80 :header-rows: 0 * - **responses** - | **Type:** ``torch.Tensor`` | **Required:** Yes | **Description:** Model predictions generated by encode(). * - **out_path** - | **Type:** ``str`` | **Required:** No | **Default:** brain_activation.gif | **Description:** Path for the generated GIF. .. code-block:: python model.generate_glass_brain_animation(responses, out_path="activation.gif") Performance ---------- **Metrics:** * **Mean parcel-wise Pearson correlation**: ID Friends S07: 0.3193; OOD (6 films): 0.2122 * **Model variants**: Available in Hugging Face Collection: ShreyDixit/vibe Example Usage ------------ .. code-block:: python from berg import BERG # Initialize BERG berg = BERG(berg_dir="path/to/brain-encoding-response-generator") # Discover all model variants variants = berg.get_model_variants("fmri-cneuromod_algo2025-vibe") # Load the model model = berg.get_encoding_model( "fmri-cneuromod_algo2025-vibe", subject=1, device="auto", model_variant="ShreyDixit/VIBE-Qwen2.5-14B", low_mem_use=True, selection={ "roi": ["Vis"], "parcel_index": [0, 0, '...', 1, 1, 0] } ) # Prepare stimulus: one transcript string per TR, matching video duration transcripts = ["Hello, are you", "awake? Yes,", "I just woke up."] video_path = "/path/to/movie.mp4" # Generates the in silico neural responses using the encoding model previously loaded responses = berg.encode( model, transcripts, video_path=video_path ) # The in silico fMRI responses will be a torch.Tensor of shape: # ['num_timepoints', 'num_parcels'] # where: # - num_timepoints: Number of predicted TRs. # - num_parcels: Number of parcels (up to 1000 Schaefer parcels, or selected subset). # Generate in silico neural responses with metadata responses, metadata = berg.encode( model, stimulus, return_metadata=True ) # Load the encoding model's metadata without having to load the model itself metadata = berg.get_model_metadata( "fmri-cneuromod_algo2025-vibe", ) # Generate a gif out of the responses gif_path = model.generate_glass_brain_animation( responses=responses, out_path="brain_activation.gif") References --------- * Schad, Daniel Carlström; Dixit, Shrey; Keck, Janis; Studenyak, Viktor; Shpilevoi, Aleksandr; Bicanski, Andrej. VIBE: Video-Input Brain Encoder for fMRI Response Modeling. arXiv:2507.17958 (2025). * Algonauts 2025 challenge dataset: https://github.com/courtois-neuromod/algonauts_2025.competitors