Group-based Splitting for Benchmarks + Batch-layer Evaluation + GPU Support by BKHMSI · Pull Request #356 · brain-score/language

BKHMSI · 2026-02-10T10:57:10Z

No description provided.

KartikP · 2026-02-10T11:13:59Z

Hi @BKHMSI did you intend to remove Pereira2018.243sentences-linear?

BKHMSI · 2026-02-10T11:22:53Z

Hi @BKHMSI did you intend to remove Pereira2018.243sentences-linear?

Yes, I did intend to change all benchmarks to use ridge regression instead of linear.

mschrimpf · 2026-02-10T11:35:28Z

let's keep the original ones for reference? (at least in code, don't have to display on the website.)
I agree we should use ridge going forward

BKHMSI · 2026-02-10T12:40:38Z

Re-added the linear metrics for all benchmarks and fixed ceiling for ridge.

Note that for Pereira2018, we need to cache ceilings for the new metrics.

mike-ferguson · 2026-02-10T14:27:02Z

Hi all,

Please note a few things about the state of the language repo:

Automerging currently disabled for benchmark-only PRs. Merging will require manual approval by Brain-Score admin (Kartik, me, or Martin most likely)
Scoring is disabled as well until language scoring infra comes online (~ end of next week)

KartikP · 2026-02-12T11:33:35Z

Hi @BKHMSI thanks for the PR. I had a chance to take a look at it and I noticed a few things that require attention:

Please provide a description. A simple explanation of what you've done and why would suffice.
In the benchmark factory, you pass the groupkfold to the linear benchmarks via CV_kwargs which breaks backwards compatibility with the linear variants of the benchmarks. Given that the intention is to hide linear on the leaderboard and use RidgeCV moving foward, could you just not pass any kwargs? Otherwise, tests should also reflect changes.
Missing numpy import in blank2014/benchmark.py, fedorenko2016/benchmark.py, tuckute2024/benchmark.py
Benchmarks return a dict instead of Score object. This breaks the way that the score object is parsed to populate the DB -> leaderboard.
My recommendation:

        score = Score(np.mean(list(layer_scores.values())))
        score.attrs['layer_scores'] = layer_scores

You've added a substantial amount (RidgeGCV, Ridge benchmark variants, ,etc) yet no tests. To ensure that your additions continue to operate as expected, please consider this addition.

i've attempted at addressing all of these issues in #361. The most significant differences are:

Return a Score object with layer_scores, raw, and ceiling as attributes. This was necessary because the dict was breaking downstream benchmark API.
Default.kfold was set to False instead of "group" to ensure backwards capability for cross-validation. This was the main data integrity risk.
Missing imports (numpy and scipy.linalg)
Missing coords (Blank2014 never added story coord and Fedorenko2016 never added sentence_id coord.
Remove CV_kwargs from linear benchmarks

If #361 looks good to you, please let me know, otherwise, I hope it can be benefit to you.

KartikP · 2026-02-17T18:59:30Z

@BKHMSI I tried running ceiling_packaging.py and ran into an error. Collect phase ran successfully and was stored by @store but failed on the extrapolation phase.

I think it was because curve_fit received NaN values which should be filtered out before curve_fit.

Did you ever update the ceiling_packaging.py to resolve this issue?

BKHMSI · 2026-02-18T12:27:40Z

Hi @KartikP, thanks for looking into the PR.

Yes, I had the same issue with Pereira2018.384 in the extrapolation phase, and it is indeed because of the NaN values. I am still not sure how to resolve it.

@mschrimpf any ideas on this?

BKHMSI · 2026-02-18T12:30:28Z

also, @mschrimpf do you recommend taking the mean of evaluated layers as the default final score?

@KartikP changed it to the following:

score = Score(np.mean(list(layer_scores.values())))

I think we can stay with the argmax for now until we adopt the new layer selection strategy?

KartikP · 2026-02-18T12:34:39Z

Hi @KartikP, thanks for looking into the PR.

Yes, I had the same issue with Pereira2018.384 in the extrapolation phase, and it is indeed because of the NaN values. I am still not sure how to resolve it.

@mschrimpf any ideas on this?

The way I handled it locally was to essentially filter Nan/Inf before passing to curve_fit

def fit(self, subject_subsamples, bootstrapped_scores):
    subject_subsamples = np.array(subject_subsamples)
    bootstrapped_scores = np.array(bootstrapped_scores)
    valid = ~np.isnan(bootstrapped_scores) & np.isfinite(bootstrapped_scores)
    if sum(valid) < 1:
        raise RuntimeError("No valid scores in sample")
    params, pcov = curve_fit(v, subject_subsamples[valid], bootstrapped_scores[valid],
                             bounds=([0, -np.inf], [1, np.inf]))
    return params

and then also adding a catch for ValueError alongside RuntimeError on line 222 (extrapolate_neuroid()) so that it ValueError from curve_fit doesn't propagate up and kill ceiling computation.

mschrimpf · 2026-02-18T14:25:01Z

also, @mschrimpf do you recommend taking the mean of evaluated layers as the default final score?

@KartikP changed it to the following:
score = Score(np.mean(list(layer_scores.values())))
I think we can stay with the argmax for now until we adopt the new layer selection strategy?

I maintain that the benchmark should not know anything about layers. The benchmark's job is to compare existing data with data of a new subject (which can be a model), there should be zero insight into the exact model implementation. So I really don't like the benchmark iterating over predictions['layer'].

Could we interpret this as regions instead?

the existing target data has voxels in several regions
for data from a new source subject, these regions might be organized differently so we search which source regions best map onto which target voxels (note that this is somewhat inconsistent between regions and voxels though)
the model can call its layers different regions. But consequently we will also have to allow a search over regions in the ceiling (right now I think the model has an unfair advantage by cherry-picking layers on the test data)

mschrimpf · 2026-02-18T14:27:35Z

Hi @KartikP, thanks for looking into the PR.
Yes, I had the same issue with Pereira2018.384 in the extrapolation phase, and it is indeed because of the NaN values. I am still not sure how to resolve it.
@mschrimpf any ideas on this?

The way I handled it locally was to essentially filter Nan/Inf before passing to curve_fit
def fit(self, subject_subsamples, bootstrapped_scores):
    subject_subsamples = np.array(subject_subsamples)
    bootstrapped_scores = np.array(bootstrapped_scores)
    valid = ~np.isnan(bootstrapped_scores) & np.isfinite(bootstrapped_scores)
    if sum(valid) < 1:
        raise RuntimeError("No valid scores in sample")
    params, pcov = curve_fit(v, subject_subsamples[valid], bootstrapped_scores[valid],
                             bounds=([0, -np.inf], [1, np.inf]))
    return params
and then also adding a catch for ValueError alongside RuntimeError on line 222 (extrapolate_neuroid()) so that it ValueError from curve_fit doesn't propagate up and kill ceiling computation.

doesn't the original implementation already filter nan values as well?

language/brainscore_language/benchmarks/pereira2018/ceiling_packaging.py

Lines 211 to 212 in 3232495

    
           # the sub_subjects dimension creates nans, get rid of those 
        
           num_scores = num_scores.dropna(f'sub_{self.subject_column}')

BKHMSI · 2026-02-18T14:28:59Z

@mschrimpf the idea was to evaluate multiple user selected layers at once instead of doing multiple forward passes / evaluations for each layer a user wants to evaluate. So I implemented it from an efficiency perspective.

It can return a list of Score for each layer instead maybe, so multiple submissions with different layer commitments?

BKHMSI added 3 commits February 2, 2026 12:56

support for Ridge GPU + Restructuring Pereira

db67562

Merge branch 'brain-score:main' into main

c3988b1

group based splitting for Fedorenko2016 and Blank2014

49e8fdf

re-added linear metric to benchmarks

3729272

KartikP added a commit that referenced this pull request Feb 11, 2026

from PR #356

3be6bb3

KartikP mentioned this pull request Feb 11, 2026

Fix #356 compatibility #361

Open

KartikP mentioned this pull request Feb 12, 2026

add OASM model from Hadidi et al. 2025 #355

Merged

Conversation

BKHMSI commented Feb 10, 2026

Uh oh!

KartikP commented Feb 10, 2026

Uh oh!

BKHMSI commented Feb 10, 2026

Uh oh!

mschrimpf commented Feb 10, 2026

Uh oh!

BKHMSI commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mike-ferguson commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KartikP commented Feb 12, 2026

Uh oh!

KartikP commented Feb 17, 2026

Uh oh!

BKHMSI commented Feb 18, 2026

Uh oh!

BKHMSI commented Feb 18, 2026

Uh oh!

KartikP commented Feb 18, 2026

Uh oh!

mschrimpf commented Feb 18, 2026

Uh oh!

mschrimpf commented Feb 18, 2026

Uh oh!

BKHMSI commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

BKHMSI commented Feb 10, 2026 •

edited

Loading

mike-ferguson commented Feb 10, 2026 •

edited

Loading

BKHMSI commented Feb 18, 2026 •

edited

Loading