Conversation
📝 WalkthroughWalkthroughThis change modifies the weighting logic in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Failed to generate code suggestions for PR |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@R/utils_groupcomparison.R`:
- Around line 142-147: The current normalization of unnormalized_weight_input
uses mean(unnormalized_weight_input) which returns NA if any Variance is NA;
change the normalization to mean(unnormalized_weight_input, na.rm = TRUE) and
then guard against remaining NA entries by replacing NA weights with 0 (e.g.
after computing weight_input, do weight_input[is.na(weight_input)] <- 0) so
downstream lm()/lmer() calls don't receive all-NA weights; update the code that
references "Variance", "unnormalized_weight_input", and "weight_input"
accordingly and add/adjust tests to cover the partial-NA Variance case.
| if ("Variance" %in% colnames(input) & !all(is.na(input$Variance))){ | ||
| weight_input = 1/input$Variance | ||
| unnormalized_weight_input = 1/input$Variance | ||
| weight_input = unnormalized_weight_input / mean(unnormalized_weight_input) | ||
| } else { | ||
| weight_input = NULL | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n R/utils_groupcomparison.R | head -200Repository: Vitek-Lab/MSstats
Length of output: 8797
🏁 Script executed:
# Search for where Variance column is created/assigned
rg -n "Variance\s*=" --type r -B 3 -A 1 | head -100Repository: Vitek-Lab/MSstats
Length of output: 1999
🏁 Script executed:
# Read the calculateProteinVariance function
cat -n R/utils_feature_selection.R | sed -n '246,300p'Repository: Vitek-Lab/MSstats
Length of output: 2304
🏁 Script executed:
# Search for where Variance is assigned to ProteinLevelData or similar
rg -n "ProteinLevelData.*Variance|Variance.*ProteinLevel" --type r -B 2 -A 2Repository: Vitek-Lab/MSstats
Length of output: 1616
🏁 Script executed:
# Search more broadly for Variance creation in dataProcess pipeline
rg -n "var_resid|Variance" --type r R/ | grep -E "(var_resid|Variance\s*=|Variance\$)" | head -50Repository: Vitek-Lab/MSstats
Length of output: 745
🏁 Script executed:
# Search for where var_resid is renamed to Variance
rg -n "var_resid.*Variance|Variance.*var_resid|setnames.*[Vv]ariance" --type r -B 1 -A 1Repository: Vitek-Lab/MSstats
Length of output: 309
🏁 Script executed:
# Look for where var_resid might be renamed to Variance by checking column selection/renaming
rg -n "Variance|var_resid" --type r R/dataProcess.R | head -50Repository: Vitek-Lab/MSstats
Length of output: 43
🏁 Script executed:
# Find all R files and see what's the main entry point
fd "\.R$" R/ | head -20Repository: Vitek-Lab/MSstats
Length of output: 515
🏁 Script executed:
# Check dataProcess.R for Variance column handling
cat -n R/dataProcess.R | grep -i "variance\|var_resid" -B 2 -A 2Repository: Vitek-Lab/MSstats
Length of output: 3315
🏁 Script executed:
# Look for the actual output structure - search for ProteinLevelData creation
rg -n "ProteinLevelData\s*=" --type r R/dataProcess.R -A 5 | head -80Repository: Vitek-Lab/MSstats
Length of output: 43
🏁 Script executed:
# Let's check the groupComparison.R file to see how the data flows into the function
cat -n R/groupComparison.R | head -150Repository: Vitek-Lab/MSstats
Length of output: 10509
🏁 Script executed:
# Test R behavior with mean() and partial NAs
python3 << 'EOF'
import subprocess
result = subprocess.run(['Rscript', '-e', '''
# Test mean with partial NAs
x = c(1, 2, NA, 4, 5)
print("Vector with partial NAs:")
print(x)
print("mean(x) with na.rm=FALSE (default):")
print(mean(x))
print("mean(x) with na.rm=TRUE:")
print(mean(x, na.rm=TRUE))
# Test division by NA result
y = 1/c(0.1, 0.2, NA, 0.3)
print("1/c(0.1, 0.2, NA, 0.3):")
print(y)
print("mean of that:")
print(mean(y))
print("Dividing y by mean(y):")
print(y / mean(y))
'''], capture_output=True, text=True)
print(result.stdout)
if result.stderr:
print("STDERR:", result.stderr)
EOFRepository: Vitek-Lab/MSstats
Length of output: 616
🏁 Script executed:
# Search for test cases that might show Variance with NA values
rg -n "Variance.*NA|NA.*Variance" --type r -B 2 -A 2 | head -80Repository: Vitek-Lab/MSstats
Length of output: 577
🏁 Script executed:
# Search for tests that explicitly test Variance with NAs
rg -n "all.*is.na.*Variance|is.na.*Variance" --type r -B 3 -A 3Repository: Vitek-Lab/MSstats
Length of output: 634
🏁 Script executed:
# Look for test files that might test the weight calculation logic
cat -n inst/tinytest/test_utils_groupComparison_model.RRepository: Vitek-Lab/MSstats
Length of output: 960
🏁 Script executed:
# Check if the test at line 17 is actually passing or if there's a mismatch
# The test expects unwormalized weights but the code normalizes them
cat -n inst/tinytest/test_utils_groupComparison_model.R | sed -n '14,18p'Repository: Vitek-Lab/MSstats
Length of output: 392
Missing na.rm = TRUE in mean() silently zeros out all weights when any Variance is NA
The guard at line 142 only rejects the case where all variances are NA. When some (but not all) rows have NA variance, the branch is entered, unnormalized_weight_input contains some NA elements, and mean(unnormalized_weight_input) returns NA (R's default is na.rm = FALSE). Every element of weight_input then becomes NA, and the all-NA weight vector is forwarded to every lm() / lmer() call below — effectively dropping all observations and producing a failed or degenerate model.
🐛 Proposed fix
if ("Variance" %in% colnames(input) & !all(is.na(input$Variance))){
unnormalized_weight_input = 1/input$Variance
- weight_input = unnormalized_weight_input / mean(unnormalized_weight_input)
+ weight_input = unnormalized_weight_input / mean(unnormalized_weight_input, na.rm = TRUE)
} else {
weight_input = NULL
}Note: The existing test (inst/tinytest/test_utils_groupComparison_model.R:17) expects unnormalized weights but does not cover the partial-NA case, which is where this bug surfaces. The test should be updated to verify normalization and to include a case with partial NAs in the Variance column.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ("Variance" %in% colnames(input) & !all(is.na(input$Variance))){ | |
| weight_input = 1/input$Variance | |
| unnormalized_weight_input = 1/input$Variance | |
| weight_input = unnormalized_weight_input / mean(unnormalized_weight_input) | |
| } else { | |
| weight_input = NULL | |
| } | |
| if ("Variance" %in% colnames(input) & !all(is.na(input$Variance))){ | |
| unnormalized_weight_input = 1/input$Variance | |
| weight_input = unnormalized_weight_input / mean(unnormalized_weight_input, na.rm = TRUE) | |
| } else { | |
| weight_input = NULL | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@R/utils_groupcomparison.R` around lines 142 - 147, The current normalization
of unnormalized_weight_input uses mean(unnormalized_weight_input) which returns
NA if any Variance is NA; change the normalization to
mean(unnormalized_weight_input, na.rm = TRUE) and then guard against remaining
NA entries by replacing NA weights with 0 (e.g. after computing weight_input, do
weight_input[is.na(weight_input)] <- 0) so downstream lm()/lmer() calls don't
receive all-NA weights; update the code that references "Variance",
"unnormalized_weight_input", and "weight_input" accordingly and add/adjust tests
to cover the partial-NA Variance case.
TBD to Merge
Motivation and Context
In group comparison models, when observations have associated variance estimates, inverse-variance weighting can be applied to give more weight to observations with lower variance (higher precision). Previously, weights were passed directly as inverse-variance values, which could inadvertently affect the scale of variance components. This PR normalizes the weights to sum to the sample size so that sample size calculation between MSstats and MSstats+ are comparable.
Solution
The weighting logic in
.fitModelForGroupComparisonhas been updated to normalize inverse-variance weights. When a Variance column is present and contains non-NA values, the weights are now computed as:weight_input = (1/Variance) / mean(1/Variance), ensuring weights sum to the sample sizeDetailed Changes
.fitModelForGroupComparisonfunctionunnormalized_weight_input = 1/input$Variancefor clarityweight_input = unnormalized_weight_input / mean(unnormalized_weight_input)Unit Tests
1/Variance, requiring update to verify normalized weights calculationCoding Guidelines Violations
The existing test in
test_utils_groupComparison_model.R(lines 16-17) does not account for the weight normalization introduced in this PR. The test assertionexpect_equal(result$full_fit$weights, expected_weights)whereexpected_weights = 1/test_input_with_variance$Variancewill fail because the actual weights are normalized by their mean. The test should be updated to check forexpected_weights / mean(expected_weights)to properly validate the normalized weighting behavior.