utah_array-tvsd-vit_b_32

Model Summary

Modality	Utah arrays
Training Dataset	THINGS Ventral Stream Spiking Dataset (TVSD)
Species	Macaque
Stimuli	Images
Model Type	Vision transformer (ViT-B/32)
Creator	Domenic Bersch

Description

This encoding model consists of a linear mapping through linear regression of a vision transformer (Dosovitskiy et al., 2020) image features onto intracortical spiking activity. The ViT-B/32 model extracts features from all 12 transformer layers, using all 50 patch tokens per layer. Prior to mapping onto neural responses, the image features have been downsampled to 250 principal components using principal component analysis. The encoding models were trained on the THINGS Ventral Stream Spiking Dataset (TVSD) (Papale et al., Neuron 2025), simultaneous intracortical recordings from 1,024 electrodes across macaque ventral stream areas (V1, V4, IT) in response to natural images from the THINGS database (Hebart et al., 2019). The encoding models are trained on either the full training data, or on four independent training data random splits.

Neural data. Encoding models were trained on the preprocessed data preparation provided in the TVSD. Raw broadband signals (30 kHz) were band-pass filtered to extract high-frequency spiking activity, and multi-unit activity (MUA) was obtained using threshold-based spike detection and smoothing, following the official TVSD pipeline. Responses were baseline-corrected and normalized per session, with area-specific time windows aligned to peak latencies (V1: 25–125 ms, V4: 50–150 ms, IT: 75–175 ms). The data were epoched from -100 ms to +199 ms relative to stimulus onset, resulting in 300 time points. More detailed preprocessing steps are described in the TVSD paper.

Model training partition. Single-trial spiking responses to 22,248 unique images from the THINGS database, each presented once during passive fixation, were used for training. One set of encoding models are trained on the full training data. Another set of encoding models are trained on four independent training data random splits (of 5,562 trials each), therefore generating four different in silico spiking response predictions (i.e., repetitions) per image. A unique PCA random seed is derived for each combination of monkey and training split, ensuring independent PCA bases across encoding models.

Model testing partition. Spiking responses to 100 unique images, each repeated 30 times.

Training procedure. Independent encoding models were trained for each monkey (monkeyN and monkeyF).

Noise ceiling. The noise ceiling was computed from the 30 repeated presentations of each test image, following the analytical procedure described in the Natural Scenes Dataset (NSD) paper (Allen et al., 2022).

Output. Each encoding model predicts time-resolved spike responses for all 1024 electrodes (or user-specified subsets) across 300 time points for each input image.

Metadata

utah_array

times : (300,) - Time points (-100ms to 199ms)

electrode_order : (1024,) - Electrode mapping order (0-based)

monkey_id : str - Monkey identifier

n_electrodes : int - Number of electrodes (1024)

roi

roi_assignments : (1024,) - ROI assignment per electrode (0=V1, 1=V4, 2=IT)

roi_labels : (3,) - ROI label names [‘V1’, ‘V4’, ‘IT’]

encoding_model

all_training_splits: Training data and encoding accuracy results for encoding models trained on all training splits

train_img_ids : (22248,) - Training stimulus IDs

train_stimuli : (22248,) - Training image filenames

train_concepts : (22248,) - Training object categories

train_days : (22248,) - Recording days for training

train_sequence_pos : (22248,) - Position in 4-image sequence

correlation_results : (1024, 300) - Prediction accuracy (Pearson’s r)

percent_noise_ceiling : (1024, 300) - Noise ceiling normalized prediction accuracy (% of noise ceiling)

single_training_split_{N}: Training data and encoding accuracy results for encoding models trained on training split N (N=1,2,3,4)

train_img_ids : (5562,) - Training stimulus IDs

train_stimuli : (5562,) - Training image filenames

train_concepts : (5562,) - Training object categories

train_days : (5562,) - Recording days for training

train_sequence_pos : (5562,) - Position in 4-image sequence

correlation_results : (1024, 300) - Prediction accuracy (Pearson’s r)

percent_noise_ceiling : (1024, 300) - Noise ceiling normalized prediction accuracy (% of noise ceiling)

test_img_ids : (3000,) - Test stimulus IDs (individual trials)

test_stimuli : (3000,) - Test image filenames (individual)

test_concepts : (3000,) - Test object categories (individual)

test_days : (3000,) - Recording days for test

test_sequence_pos : (3000,) - Position in sequence for test

SNR : (4, 1024) - Signal-to-noise ratio per day per electrode

SNR_max : (1024,) - Best SNR across all days per electrode

ncsnr : (1024, 300) - Neural cross-validated signal-to-noise ratio per electrode/timepoint

noise_ceiling : (1024, 300) - Noise ceiling per electrode/timepoint

Input

Type	`numpy.ndarray`
Shape	`['batch_size', 3, 'height', 'width']`
Description	The input should be a batch of RGB images.
Constraints	Image values should be integers in range [0, 255]. Image dimensions (height, width) should be equal (square). Minimum recommended image size: 224×224 pixels.

Output

Type	`numpy.ndarray`
Shape	`[batch_size, n_electrodes, n_timepoints] or [batch_size, repeats, n_electrodes, n_timepoints]`
Description	The output is a 3D or 4D array containing in silico utah-array responses. The second dimension varies based on train_splits parameter: - When train_splits=”all”: shape is [batch_size, n_electrodes, n_timepoints] - When train_splits=”single”: shape is [batch_size, repeats, n_electrodes, n_timepoints] The n_electrodes dimension corresponds to the number of electrodes in the selected ROI, which varies by ROI and monkey. The third dimension corresponds to the timepoints (300). Monkey N electrode count: - V1: 448 - V4: 256 - IT: 256 Monkey F electrode count: - V1: 512 - V4: 192 - IT: 320
Dimensions	batch_size: Number of stimuli in the batch repeats: Number of simulated repetitions of the same stimulus (always 4; only applies when using the encoding models trained on single training data splits) n_electrodes: Number of electrodes in the selection timepoints: Timepoints of recording

Parameters

Parameters used in `get_encoding_model`

This function loads the encoding model.

model_id	Type: str Required: Yes Description: Unique identifier of the model to load. Valid Values: utah_array-tvsd-vit_b_32 Example: “utah_array-tvsd-vit_b_32”
subject	Type: str Required: Yes Description: Monkey ID Valid Values: “N”, “F” Example: “N”
train_splits	Type: str Required: No Description: Specifies the training data split on which the encoding model is trained. - “all”: Use an encoding model trained on all training data splits. - “single”: Use encoding models trained on four independent training data random splits, therefore generating four different in silico spiking response predictions (i.e., repetitions) per image. Valid Values: “all”, “single” Example: “single”
selection	Type: dict Required: No Description: Specifies which outputs to include in the model responses. Can include specific electrodes and/or timepoints. If not provided, utah-array responses are generated for all electrodes and time points. Properties: roi Type: list[str] Description: List of ROIs to include in the output Valid values: “V1”, “V4”, “IT” Example: [‘V1’, ‘IT’] electrodes Type: numpy.ndarray Description: Binary one-hot encoded vector indicating which electrodes to include. Must have exactly the same length as the number of available electrode (1024). Each position set to 1 indicates that timepoint should be included. Example: [0, 0, ‘…’, 1, 1, 0] timepoints Type: numpy.ndarray Description: Binary one-hot encoded vector indicating which timepoints to include. Must have exactly the same length as the number of available timepoints (300). Each position set to 1 indicates that timepoint should be included. Example: [0, 0, ‘…’, 1, 1, 0]
device	Type: str Required: No Description: Device to run the model on. ‘auto’ will use CUDA if available, otherwise CPU. Valid Values: “cpu”, “cuda”, “auto” Example: “auto”

Parameters used in `encode`

This function generates in silico neural responses using the encoding model previously loaded.

model	Type: BaseModelInterface Required: Yes Description: An instantiated and loaded encoding model.
stimulus	Type: numpy.ndarray Required: Yes Description: A batch of RGB images to be encoded. Images should be in integer format with values in the range [0, 255], and square dimensions (e.g. 224×224). Example: “An array of shape [100, 3, 224, 224] representing 100 RGB images.”
return_metadata	Type: bool Required: No Description: Whether to return the encoding model’s metadata together with the in silico neural responses. Example: True
show_progress	Type: bool Required: No Description: Whether to show a progress bar during encoding (for large batches). Example: True

Parameters used in `get_model_metadata`

This function loads the encoding model’s metadata without having to load the model itself.

model_id	Type: str Required: Yes Description: Unique identifier of the model to load. Valid Values: utah_array-tvsd-vit_b_32 Example: “utah_array-tvsd-vit_b_32”
subject	Type: str Required: Yes Description: Monkey ID Valid Values: “N”, “F” Example: “N”

Performance

Accuracy Plots (AWS directory):

brain-encoding-response-generator/encoding_models/modality-utah_array/train_dataset-tvsd/model-vit_b_32/encoding_models_accuracy

Example Usage

from berg import BERG

# Initialize BERG
berg = BERG(berg_dir="path/to/brain-encoding-response-generator")

# Load the model
model = berg.get_encoding_model(
    "utah_array-tvsd-vit_b_32",
    subject="N",
    train_splits="single",
    selection={
        "roi": ["V1", "IT"],
        "electrodes": [0, 0, '...', 1, 1, 0],
        "timepoints": [0, 0, '...', 1, 1, 0]
    }
)

# Prepare the stimulus images
# Image shape should be [batch_size, 3 RGB channels, height, width]
stimulus = np.random.randint(0, 255, (100, 3, 256, 256))

# Generates the in silico neural responses using the encoding model previously loaded
responses = berg.encode(
    model,
    stimulus,
    show_progress=True
)

# The in silico fMRI responses will be a numpy.ndarray of shape:
# [batch_size, n_electrodes, n_timepoints] or [batch_size, repeats, n_electrodes, n_timepoints]
# where:
# - repeats: Number of simulated repetitions of the same stimulus (always 4; only applies when using the encoding models trained on single training data splits)
# - n_electrodes: Number of electrodes in the selection
# - timepoints: Timepoints of recording

# Generate in silico neural responses with metadata
responses, metadata = berg.encode(
    model,
    stimulus,
    return_metadata=True
)

# Load the encoding model's metadata without having to load the model itself
metadata = berg.get_model_metadata(
    "utah_array-tvsd-vit_b_32",
    subject="N"
)

References

Model building code: https://github.com/gifale95/BERG/tree/main/berg_creation_code/02_train_encoding_models/train_dataset-tvsd/model-vit_b_32
TVSD Paper (Papale et al., 2025): https://www.sciencedirect.com/science/article/pii/S089662732400881X
TVSD Data (Papale et al., 2025): https://gin.g-node.org/paolo_papale/TVSD
ViT-B/32 (Dosovitskiy et al., 2020): https://arxiv.org/abs/2010.11929