Skip to content

Fix weights#184

Open
tonywu1999 wants to merge 2 commits intodevelfrom
fix-weights
Open

Fix weights#184
tonywu1999 wants to merge 2 commits intodevelfrom
fix-weights

Conversation

@tonywu1999
Copy link
Contributor

@tonywu1999 tonywu1999 commented Feb 19, 2026

TBD to Merge

Motivation and Context

In group comparison models, when observations have associated variance estimates, inverse-variance weighting can be applied to give more weight to observations with lower variance (higher precision). Previously, weights were passed directly as inverse-variance values, which could inadvertently affect the scale of variance components. This PR normalizes the weights to sum to the sample size so that sample size calculation between MSstats and MSstats+ are comparable.

Solution

The weighting logic in .fitModelForGroupComparison has been updated to normalize inverse-variance weights. When a Variance column is present and contains non-NA values, the weights are now computed as: weight_input = (1/Variance) / mean(1/Variance), ensuring weights sum to the sample size

Detailed Changes

  • R/utils_groupcomparison.R (lines 142-147):
    • Modified weight calculation logic in .fitModelForGroupComparison function
    • Changed from passing weights as-is to computing normalized inverse-variance weights internally
    • Introduced intermediate variable unnormalized_weight_input = 1/input$Variance for clarity
    • Applied normalization: weight_input = unnormalized_weight_input / mean(unnormalized_weight_input)
    • Preserves existing behavior for cases without variance information (weight_input = NULL)

Unit Tests

  • inst/tinytest/test_utils_groupComparison_model.R:
    • Test verifies that weights are applied to the fitted model when Variance column is provided
    • Currently expects unnormalized weights 1/Variance, requiring update to verify normalized weights calculation

Coding Guidelines Violations

The existing test in test_utils_groupComparison_model.R (lines 16-17) does not account for the weight normalization introduced in this PR. The test assertion expect_equal(result$full_fit$weights, expected_weights) where expected_weights = 1/test_input_with_variance$Variance will fail because the actual weights are normalized by their mean. The test should be updated to check for expected_weights / mean(expected_weights) to properly validate the normalized weighting behavior.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

📝 Walkthrough

Walkthrough

This change modifies the weighting logic in .fitModelForGroupComparison when a Variance column is present. Instead of simple inverse-variance weighting, it now applies normalized inverse-variance weighting by dividing inverse variances by their mean, ensuring weights average to 1.

Changes

Cohort / File(s) Summary
Internal weighting logic modification
R/utils_groupcomparison.R
Updated inverse-variance weighting to normalize weights by their mean. Replaces single-step calculation with two-step approach to produce weights that average to 1.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 A hop through variance weights so fair,
Normalized now with balanced care,
No more wild swings from here to there,
Mean of one keeps all affairs square,
The bunny approves this weighting affair! 🌟

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is empty and does not follow the required template. All mandatory sections (Motivation and Context, Changes, Testing, and Checklist) are missing. Add a complete pull request description following the template: include motivation/context for the weight normalization change, a detailed bullet-point list of modifications, description of testing performed, and complete the checklist items.
Title check ❓ Inconclusive The title 'Fix weights' is vague and generic; it refers to a real change (weight normalization) but lacks specificity about what was fixed or how. Use a more descriptive title such as 'Normalize inverse-variance weights to have mean of 1' to clearly convey the specific change made.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-weights

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

Failed to generate code suggestions for PR

@tonywu1999 tonywu1999 marked this pull request as ready for review February 19, 2026 22:41
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@R/utils_groupcomparison.R`:
- Around line 142-147: The current normalization of unnormalized_weight_input
uses mean(unnormalized_weight_input) which returns NA if any Variance is NA;
change the normalization to mean(unnormalized_weight_input, na.rm = TRUE) and
then guard against remaining NA entries by replacing NA weights with 0 (e.g.
after computing weight_input, do weight_input[is.na(weight_input)] <- 0) so
downstream lm()/lmer() calls don't receive all-NA weights; update the code that
references "Variance", "unnormalized_weight_input", and "weight_input"
accordingly and add/adjust tests to cover the partial-NA Variance case.

Comment on lines 142 to 147
if ("Variance" %in% colnames(input) & !all(is.na(input$Variance))){
weight_input = 1/input$Variance
unnormalized_weight_input = 1/input$Variance
weight_input = unnormalized_weight_input / mean(unnormalized_weight_input)
} else {
weight_input = NULL
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n R/utils_groupcomparison.R | head -200

Repository: Vitek-Lab/MSstats

Length of output: 8797


🏁 Script executed:

# Search for where Variance column is created/assigned
rg -n "Variance\s*=" --type r -B 3 -A 1 | head -100

Repository: Vitek-Lab/MSstats

Length of output: 1999


🏁 Script executed:

# Read the calculateProteinVariance function
cat -n R/utils_feature_selection.R | sed -n '246,300p'

Repository: Vitek-Lab/MSstats

Length of output: 2304


🏁 Script executed:

# Search for where Variance is assigned to ProteinLevelData or similar
rg -n "ProteinLevelData.*Variance|Variance.*ProteinLevel" --type r -B 2 -A 2

Repository: Vitek-Lab/MSstats

Length of output: 1616


🏁 Script executed:

# Search more broadly for Variance creation in dataProcess pipeline
rg -n "var_resid|Variance" --type r R/ | grep -E "(var_resid|Variance\s*=|Variance\$)" | head -50

Repository: Vitek-Lab/MSstats

Length of output: 745


🏁 Script executed:

# Search for where var_resid is renamed to Variance
rg -n "var_resid.*Variance|Variance.*var_resid|setnames.*[Vv]ariance" --type r -B 1 -A 1

Repository: Vitek-Lab/MSstats

Length of output: 309


🏁 Script executed:

# Look for where var_resid might be renamed to Variance by checking column selection/renaming
rg -n "Variance|var_resid" --type r R/dataProcess.R | head -50

Repository: Vitek-Lab/MSstats

Length of output: 43


🏁 Script executed:

# Find all R files and see what's the main entry point
fd "\.R$" R/ | head -20

Repository: Vitek-Lab/MSstats

Length of output: 515


🏁 Script executed:

# Check dataProcess.R for Variance column handling
cat -n R/dataProcess.R | grep -i "variance\|var_resid" -B 2 -A 2

Repository: Vitek-Lab/MSstats

Length of output: 3315


🏁 Script executed:

# Look for the actual output structure - search for ProteinLevelData creation
rg -n "ProteinLevelData\s*=" --type r R/dataProcess.R -A 5 | head -80

Repository: Vitek-Lab/MSstats

Length of output: 43


🏁 Script executed:

# Let's check the groupComparison.R file to see how the data flows into the function
cat -n R/groupComparison.R | head -150

Repository: Vitek-Lab/MSstats

Length of output: 10509


🏁 Script executed:

# Test R behavior with mean() and partial NAs
python3 << 'EOF'
import subprocess
result = subprocess.run(['Rscript', '-e', '''
# Test mean with partial NAs
x = c(1, 2, NA, 4, 5)
print("Vector with partial NAs:")
print(x)
print("mean(x) with na.rm=FALSE (default):")
print(mean(x))
print("mean(x) with na.rm=TRUE:")
print(mean(x, na.rm=TRUE))

# Test division by NA result
y = 1/c(0.1, 0.2, NA, 0.3)
print("1/c(0.1, 0.2, NA, 0.3):")
print(y)
print("mean of that:")
print(mean(y))
print("Dividing y by mean(y):")
print(y / mean(y))
'''], capture_output=True, text=True)
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)
EOF

Repository: Vitek-Lab/MSstats

Length of output: 616


🏁 Script executed:

# Search for test cases that might show Variance with NA values
rg -n "Variance.*NA|NA.*Variance" --type r -B 2 -A 2 | head -80

Repository: Vitek-Lab/MSstats

Length of output: 577


🏁 Script executed:

# Search for tests that explicitly test Variance with NAs
rg -n "all.*is.na.*Variance|is.na.*Variance" --type r -B 3 -A 3

Repository: Vitek-Lab/MSstats

Length of output: 634


🏁 Script executed:

# Look for test files that might test the weight calculation logic
cat -n inst/tinytest/test_utils_groupComparison_model.R

Repository: Vitek-Lab/MSstats

Length of output: 960


🏁 Script executed:

# Check if the test at line 17 is actually passing or if there's a mismatch
# The test expects unwormalized weights but the code normalizes them
cat -n inst/tinytest/test_utils_groupComparison_model.R | sed -n '14,18p'

Repository: Vitek-Lab/MSstats

Length of output: 392


Missing na.rm = TRUE in mean() silently zeros out all weights when any Variance is NA

The guard at line 142 only rejects the case where all variances are NA. When some (but not all) rows have NA variance, the branch is entered, unnormalized_weight_input contains some NA elements, and mean(unnormalized_weight_input) returns NA (R's default is na.rm = FALSE). Every element of weight_input then becomes NA, and the all-NA weight vector is forwarded to every lm() / lmer() call below — effectively dropping all observations and producing a failed or degenerate model.

🐛 Proposed fix
     if ("Variance" %in% colnames(input) & !all(is.na(input$Variance))){
         unnormalized_weight_input = 1/input$Variance
-        weight_input = unnormalized_weight_input / mean(unnormalized_weight_input)
+        weight_input = unnormalized_weight_input / mean(unnormalized_weight_input, na.rm = TRUE)
     } else {
         weight_input = NULL
     }

Note: The existing test (inst/tinytest/test_utils_groupComparison_model.R:17) expects unnormalized weights but does not cover the partial-NA case, which is where this bug surfaces. The test should be updated to verify normalization and to include a case with partial NAs in the Variance column.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if ("Variance" %in% colnames(input) & !all(is.na(input$Variance))){
weight_input = 1/input$Variance
unnormalized_weight_input = 1/input$Variance
weight_input = unnormalized_weight_input / mean(unnormalized_weight_input)
} else {
weight_input = NULL
}
if ("Variance" %in% colnames(input) & !all(is.na(input$Variance))){
unnormalized_weight_input = 1/input$Variance
weight_input = unnormalized_weight_input / mean(unnormalized_weight_input, na.rm = TRUE)
} else {
weight_input = NULL
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/utils_groupcomparison.R` around lines 142 - 147, The current normalization
of unnormalized_weight_input uses mean(unnormalized_weight_input) which returns
NA if any Variance is NA; change the normalization to
mean(unnormalized_weight_input, na.rm = TRUE) and then guard against remaining
NA entries by replacing NA weights with 0 (e.g. after computing weight_input, do
weight_input[is.na(weight_input)] <- 0) so downstream lm()/lmer() calls don't
receive all-NA weights; update the code that references "Variance",
"unnormalized_weight_input", and "weight_input" accordingly and add/adjust tests
to cover the partial-NA Variance case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant