utah_array-tvsd-vit_b_32

Model Summary

Modality

Utah arrays

Training Dataset

THINGS Ventral Stream Spiking Dataset (TVSD)

Species

Macaque

Stimuli

Images

Model Type

Vision transformer (ViT-B/32)

Creator

Domenic Bersch

Description

This encoding model consists of a linear mapping through linear regression of a vision transformer (Dosovitskiy et al., 2020) image features onto intracortical spiking activity. The ViT-B/32 model extracts features from all 12 transformer layers, using all 50 patch tokens per layer. Prior to mapping onto neural responses, the image features have been downsampled to 250 principal components using principal component analysis. The encoding models were trained on the THINGS Ventral Stream Spiking Dataset (TVSD) (Papale et al., Neuron 2025), simultaneous intracortical recordings from 1,024 electrodes across macaque ventral stream areas (V1, V4, IT) in response to natural images from the THINGS database (Hebart et al., 2019). The encoding models are trained on either the full training data, or on four independent training data random splits.

Neural data. Encoding models were trained on the preprocessed data preparation provided in the TVSD. Raw broadband signals (30 kHz) were band-pass filtered to extract high-frequency spiking activity, and multi-unit activity (MUA) was obtained using threshold-based spike detection and smoothing, following the official TVSD pipeline. Responses were baseline-corrected and normalized per session, with area-specific time windows aligned to peak latencies (V1: 25–125 ms, V4: 50–150 ms, IT: 75–175 ms). The data were epoched from -100 ms to +199 ms relative to stimulus onset, resulting in 300 time points. More detailed preprocessing steps are described in the TVSD paper.

Model training partition. Single-trial spiking responses to 22,248 unique images from the THINGS database, each presented once during passive fixation, were used for training. One set of encoding models are trained on the full training data. Another set of encoding models are trained on four independent training data random splits (of 5,562 trials each), therefore generating four different in silico spiking response predictions (i.e., repetitions) per image. A unique PCA random seed is derived for each combination of monkey and training split, ensuring independent PCA bases across encoding models.

Model testing partition. Spiking responses to 100 unique images, each repeated 30 times.

Training procedure. Independent encoding models were trained for each monkey (monkeyN and monkeyF).

Noise ceiling. The noise ceiling was computed from the 30 repeated presentations of each test image, following the analytical procedure described in the Natural Scenes Dataset (NSD) paper (Allen et al., 2022).

Output. Each encoding model predicts time-resolved spike responses for all 1024 electrodes (or user-specified subsets) across 300 time points for each input image.

Metadata

utah_array

times : (300,) - Time points (-100ms to 199ms)

electrode_order : (1024,) - Electrode mapping order (0-based)

monkey_id : str - Monkey identifier

n_electrodes : int - Number of electrodes (1024)

roi

roi_assignments : (1024,) - ROI assignment per electrode (0=V1, 1=V4, 2=IT)

roi_labels : (3,) - ROI label names [‘V1’, ‘V4’, ‘IT’]

encoding_model

all_training_splits: Training data and encoding accuracy results for encoding models trained on all training splits

train_img_ids : (22248,) - Training stimulus IDs

train_stimuli : (22248,) - Training image filenames

train_concepts : (22248,) - Training object categories

train_days : (22248,) - Recording days for training

train_sequence_pos : (22248,) - Position in 4-image sequence

correlation_results : (1024, 300) - Prediction accuracy (Pearson’s r)

percent_noise_ceiling : (1024, 300) - Noise ceiling normalized prediction accuracy (% of noise ceiling)

single_training_split_{N}: Training data and encoding accuracy results for encoding models trained on training split N (N=1,2,3,4)

train_img_ids : (5562,) - Training stimulus IDs

train_stimuli : (5562,) - Training image filenames

train_concepts : (5562,) - Training object categories

train_days : (5562,) - Recording days for training

train_sequence_pos : (5562,) - Position in 4-image sequence

correlation_results : (1024, 300) - Prediction accuracy (Pearson’s r)

percent_noise_ceiling : (1024, 300) - Noise ceiling normalized prediction accuracy (% of noise ceiling)

test_img_ids : (3000,) - Test stimulus IDs (individual trials)

test_stimuli : (3000,) - Test image filenames (individual)

test_concepts : (3000,) - Test object categories (individual)

test_days : (3000,) - Recording days for test

test_sequence_pos : (3000,) - Position in sequence for test

SNR : (4, 1024) - Signal-to-noise ratio per day per electrode

SNR_max : (1024,) - Best SNR across all days per electrode

ncsnr : (1024, 300) - Neural cross-validated signal-to-noise ratio per electrode/timepoint

noise_ceiling : (1024, 300) - Noise ceiling per electrode/timepoint

Input

Type

numpy.ndarray

Shape

['batch_size', 3, 'height', 'width']

Description

The input should be a batch of RGB images.

Constraints

  • Image values should be integers in range [0, 255].

  • Image dimensions (height, width) should be equal (square).

  • Minimum recommended image size: 224×224 pixels.

Output

Type

numpy.ndarray

Shape

[batch_size, n_electrodes, n_timepoints] or [batch_size, repeats, n_electrodes, n_timepoints]

Description

The output is a 3D or 4D array containing in silico utah-array responses.
The second dimension varies based on train_splits parameter:
- When train_splits=”all”: shape is [batch_size, n_electrodes, n_timepoints]
- When train_splits=”single”: shape is [batch_size, repeats, n_electrodes, n_timepoints]

The n_electrodes dimension corresponds to the number of electrodes in the selected ROI,
which varies by ROI and monkey.
The third dimension corresponds to the timepoints (300).

Monkey N electrode count:
- V1: 448
- V4: 256
- IT: 256

Monkey F electrode count:
- V1: 512
- V4: 192
- IT: 320

Dimensions

batch_size: Number of stimuli in the batch
repeats: Number of simulated repetitions of the same stimulus (always 4; only applies when using the encoding models trained on single training data splits)
n_electrodes: Number of electrodes in the selection
timepoints: Timepoints of recording

Parameters

Parameters used in get_encoding_model

This function loads the encoding model.

model_id

Type: str
Required: Yes
Description: Unique identifier of the model to load.
Valid Values: utah_array-tvsd-vit_b_32
Example: “utah_array-tvsd-vit_b_32”

subject

Type: str
Required: Yes
Description: Monkey ID
Valid Values: “N”, “F”
Example: “N”

train_splits

Type: str
Required: No
Description: Specifies the training data split on which the encoding model is trained.
- “all”: Use an encoding model trained on all training data splits.
- “single”: Use encoding models trained on four independent training data random splits, therefore generating four different in silico spiking response predictions (i.e., repetitions) per image.
Valid Values: “all”, “single”
Example: “single”

selection

Type: dict
Required: No
Description: Specifies which outputs to include in the model responses.
Can include specific electrodes and/or timepoints. If not provided,
utah-array responses are generated for all electrodes and time points.

Properties:

roi
Type: list[str]
Description: List of ROIs to include in the output
Valid values: “V1”, “V4”, “IT”
Example: [‘V1’, ‘IT’]

electrodes
Type: numpy.ndarray
Description: Binary one-hot encoded vector indicating which electrodes to include.
Must have exactly the same length as the number of available electrode (1024).
Each position set to 1 indicates that timepoint should be included.
Example: [0, 0, ‘…’, 1, 1, 0]

timepoints
Type: numpy.ndarray
Description: Binary one-hot encoded vector indicating which timepoints to include.
Must have exactly the same length as the number of available timepoints (300).
Each position set to 1 indicates that timepoint should be included.
Example: [0, 0, ‘…’, 1, 1, 0]

device

Type: str
Required: No
Description: Device to run the model on. ‘auto’ will use CUDA if available, otherwise CPU.
Valid Values: “cpu”, “cuda”, “auto”
Example: “auto”

Parameters used in encode

This function generates in silico neural responses using the encoding model previously loaded.

model

Type: BaseModelInterface
Required: Yes
Description: An instantiated and loaded encoding model.

stimulus

Type: numpy.ndarray
Required: Yes
Description: A batch of RGB images to be encoded. Images should be in integer format with values in the range [0, 255], and square dimensions (e.g. 224×224).
Example: “An array of shape [100, 3, 224, 224] representing 100 RGB images.”

return_metadata

Type: bool
Required: No
Description: Whether to return the encoding model’s metadata together with the in silico neural responses.
Example: True

show_progress

Type: bool
Required: No
Description: Whether to show a progress bar during encoding (for large batches).
Example: True

Parameters used in get_model_metadata

This function loads the encoding model’s metadata without having to load the model itself.

model_id

Type: str
Required: Yes
Description: Unique identifier of the model to load.
Valid Values: utah_array-tvsd-vit_b_32
Example: “utah_array-tvsd-vit_b_32”

subject

Type: str
Required: Yes
Description: Monkey ID
Valid Values: “N”, “F”
Example: “N”

Performance

Accuracy Plots (AWS directory):

  • brain-encoding-response-generator/encoding_models/modality-utah_array/train_dataset-tvsd/model-vit_b_32/encoding_models_accuracy

Example Usage

from berg import BERG

# Initialize BERG
berg = BERG(berg_dir="path/to/brain-encoding-response-generator")

# Load the model
model = berg.get_encoding_model(
    "utah_array-tvsd-vit_b_32",
    subject="N",
    train_splits="single",
    selection={
        "roi": ["V1", "IT"],
        "electrodes": [0, 0, '...', 1, 1, 0],
        "timepoints": [0, 0, '...', 1, 1, 0]
    }
)

# Prepare the stimulus images
# Image shape should be [batch_size, 3 RGB channels, height, width]
stimulus = np.random.randint(0, 255, (100, 3, 256, 256))

# Generates the in silico neural responses using the encoding model previously loaded
responses = berg.encode(
    model,
    stimulus,
    show_progress=True
)

# The in silico fMRI responses will be a numpy.ndarray of shape:
# [batch_size, n_electrodes, n_timepoints] or [batch_size, repeats, n_electrodes, n_timepoints]
# where:
# - repeats: Number of simulated repetitions of the same stimulus (always 4; only applies when using the encoding models trained on single training data splits)
# - n_electrodes: Number of electrodes in the selection
# - timepoints: Timepoints of recording

# Generate in silico neural responses with metadata
responses, metadata = berg.encode(
    model,
    stimulus,
    return_metadata=True
)

# Load the encoding model's metadata without having to load the model itself
metadata = berg.get_model_metadata(
    "utah_array-tvsd-vit_b_32",
    subject="N"
)

References