Skip to content

Group-based Splitting for Benchmarks + Batch-layer Evaluation + GPU Support #356

Open
BKHMSI wants to merge 4 commits intobrain-score:mainfrom
epflneuroailab:main
Open

Group-based Splitting for Benchmarks + Batch-layer Evaluation + GPU Support #356
BKHMSI wants to merge 4 commits intobrain-score:mainfrom
epflneuroailab:main

Conversation

@BKHMSI
Copy link
Contributor

@BKHMSI BKHMSI commented Feb 10, 2026

No description provided.

@KartikP
Copy link
Contributor

KartikP commented Feb 10, 2026

Hi @BKHMSI did you intend to remove Pereira2018.243sentences-linear?

@BKHMSI
Copy link
Contributor Author

BKHMSI commented Feb 10, 2026

Hi @BKHMSI did you intend to remove Pereira2018.243sentences-linear?

Yes, I did intend to change all benchmarks to use ridge regression instead of linear.

@mschrimpf
Copy link
Member

let's keep the original ones for reference? (at least in code, don't have to display on the website.)
I agree we should use ridge going forward

@BKHMSI
Copy link
Contributor Author

BKHMSI commented Feb 10, 2026

Re-added the linear metrics for all benchmarks and fixed ceiling for ridge.

Note that for Pereira2018, we need to cache ceilings for the new metrics.

@mike-ferguson
Copy link
Member

mike-ferguson commented Feb 10, 2026

Hi all,

Please note a few things about the state of the language repo:

  1. Automerging currently disabled for benchmark-only PRs. Merging will require manual approval by Brain-Score admin (Kartik, me, or Martin most likely)
  2. Scoring is disabled as well until language scoring infra comes online (~ end of next week)

KartikP added a commit that referenced this pull request Feb 11, 2026
@KartikP
Copy link
Contributor

KartikP commented Feb 12, 2026

Hi @BKHMSI thanks for the PR. I had a chance to take a look at it and I noticed a few things that require attention:

  1. Please provide a description. A simple explanation of what you've done and why would suffice.

  2. In the benchmark factory, you pass the groupkfold to the linear benchmarks via CV_kwargs which breaks backwards compatibility with the linear variants of the benchmarks. Given that the intention is to hide linear on the leaderboard and use RidgeCV moving foward, could you just not pass any kwargs? Otherwise, tests should also reflect changes.

  3. Missing numpy import in blank2014/benchmark.py, fedorenko2016/benchmark.py, tuckute2024/benchmark.py

  4. Benchmarks return a dict instead of Score object. This breaks the way that the score object is parsed to populate the DB -> leaderboard.
    My recommendation:

        score = Score(np.mean(list(layer_scores.values())))
        score.attrs['layer_scores'] = layer_scores
  1. You've added a substantial amount (RidgeGCV, Ridge benchmark variants, ,etc) yet no tests. To ensure that your additions continue to operate as expected, please consider this addition.

i've attempted at addressing all of these issues in #361. The most significant differences are:

  1. Return a Score object with layer_scores, raw, and ceiling as attributes. This was necessary because the dict was breaking downstream benchmark API.
  2. Default.kfold was set to False instead of "group" to ensure backwards capability for cross-validation. This was the main data integrity risk.
  3. Missing imports (numpy and scipy.linalg)
  4. Missing coords (Blank2014 never added story coord and Fedorenko2016 never added sentence_id coord.
  5. Remove CV_kwargs from linear benchmarks

If #361 looks good to you, please let me know, otherwise, I hope it can be benefit to you.

@KartikP
Copy link
Contributor

KartikP commented Feb 17, 2026

@BKHMSI I tried running ceiling_packaging.py and ran into an error. Collect phase ran successfully and was stored by @store but failed on the extrapolation phase.

I think it was because curve_fit received NaN values which should be filtered out before curve_fit.

Did you ever update the ceiling_packaging.py to resolve this issue?

@BKHMSI
Copy link
Contributor Author

BKHMSI commented Feb 18, 2026

Hi @KartikP, thanks for looking into the PR.

Yes, I had the same issue with Pereira2018.384 in the extrapolation phase, and it is indeed because of the NaN values. I am still not sure how to resolve it.

@mschrimpf any ideas on this?

@BKHMSI
Copy link
Contributor Author

BKHMSI commented Feb 18, 2026

also, @mschrimpf do you recommend taking the mean of evaluated layers as the default final score?

@KartikP changed it to the following:

score = Score(np.mean(list(layer_scores.values())))

I think we can stay with the argmax for now until we adopt the new layer selection strategy?

@KartikP
Copy link
Contributor

KartikP commented Feb 18, 2026

Hi @KartikP, thanks for looking into the PR.

Yes, I had the same issue with Pereira2018.384 in the extrapolation phase, and it is indeed because of the NaN values. I am still not sure how to resolve it.

@mschrimpf any ideas on this?

The way I handled it locally was to essentially filter Nan/Inf before passing to curve_fit

def fit(self, subject_subsamples, bootstrapped_scores):
    subject_subsamples = np.array(subject_subsamples)
    bootstrapped_scores = np.array(bootstrapped_scores)
    valid = ~np.isnan(bootstrapped_scores) & np.isfinite(bootstrapped_scores)
    if sum(valid) < 1:
        raise RuntimeError("No valid scores in sample")
    params, pcov = curve_fit(v, subject_subsamples[valid], bootstrapped_scores[valid],
                             bounds=([0, -np.inf], [1, np.inf]))
    return params

and then also adding a catch for ValueError alongside RuntimeError on line 222 (extrapolate_neuroid()) so that it ValueError from curve_fit doesn't propagate up and kill ceiling computation.

@mschrimpf
Copy link
Member

also, @mschrimpf do you recommend taking the mean of evaluated layers as the default final score?

@KartikP changed it to the following:

score = Score(np.mean(list(layer_scores.values())))

I think we can stay with the argmax for now until we adopt the new layer selection strategy?

I maintain that the benchmark should not know anything about layers. The benchmark's job is to compare existing data with data of a new subject (which can be a model), there should be zero insight into the exact model implementation. So I really don't like the benchmark iterating over predictions['layer'].

Could we interpret this as regions instead?

  1. the existing target data has voxels in several regions
  2. for data from a new source subject, these regions might be organized differently so we search which source regions best map onto which target voxels (note that this is somewhat inconsistent between regions and voxels though)
  3. the model can call its layers different regions. But consequently we will also have to allow a search over regions in the ceiling (right now I think the model has an unfair advantage by cherry-picking layers on the test data)

@mschrimpf
Copy link
Member

Hi @KartikP, thanks for looking into the PR.
Yes, I had the same issue with Pereira2018.384 in the extrapolation phase, and it is indeed because of the NaN values. I am still not sure how to resolve it.
@mschrimpf any ideas on this?

The way I handled it locally was to essentially filter Nan/Inf before passing to curve_fit

def fit(self, subject_subsamples, bootstrapped_scores):
    subject_subsamples = np.array(subject_subsamples)
    bootstrapped_scores = np.array(bootstrapped_scores)
    valid = ~np.isnan(bootstrapped_scores) & np.isfinite(bootstrapped_scores)
    if sum(valid) < 1:
        raise RuntimeError("No valid scores in sample")
    params, pcov = curve_fit(v, subject_subsamples[valid], bootstrapped_scores[valid],
                             bounds=([0, -np.inf], [1, np.inf]))
    return params

and then also adding a catch for ValueError alongside RuntimeError on line 222 (extrapolate_neuroid()) so that it ValueError from curve_fit doesn't propagate up and kill ceiling computation.

doesn't the original implementation already filter nan values as well?

# the sub_subjects dimension creates nans, get rid of those
num_scores = num_scores.dropna(f'sub_{self.subject_column}')

@BKHMSI
Copy link
Contributor Author

BKHMSI commented Feb 18, 2026

@mschrimpf the idea was to evaluate multiple user selected layers at once instead of doing multiple forward passes / evaluations for each layer a user wants to evaluate. So I implemented it from an efficiency perspective.

It can return a list of Score for each layer instead maybe, so multiple submissions with different layer commitments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments