Skip to content

Feature diann anomaly#119

Merged
devonjkohler merged 12 commits intodevelfrom
feature-diann-anomaly
Mar 2, 2026
Merged

Feature diann anomaly#119
devonjkohler merged 12 commits intodevelfrom
feature-diann-anomaly

Conversation

@devonjkohler
Copy link
Contributor

@devonjkohler devonjkohler commented Mar 2, 2026

Motivation and Context

Updated DIA-NN converter to work with MSstats+ and some minor bug fixes to anomaly detection model.

Changes

  • Added quality metric columns to DIA-NN converter
  • Made sure anomaly scores were calculated on unique rows
  • Removed clipping because MSstats will reweight anomaly scores in dataProcess now

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

Motivation and Context

This PR updates the DIA-NN -> MSstats conversion pipeline to integrate anomaly-detection support for MSstats+ and fixes bugs in the anomaly scoring pipeline. Goals: add quality-metric columns to the DIANN converter, ensure anomaly scores are computed on unique (aggregated) rows, and remove clipping of anomaly scores so MSstats' dataProcess can reweight scores.

Short solution summary

  • Add anomaly-related parameters and propagate anomaly metric columns through cleaning, preprocessing, and balanced-design steps.
  • Compute anomaly scores on unique model-input rows and merge scores back to original rows to preserve uniqueness.
  • Replace prior CuSum-like builders with stateful, NA-aware implementations and remove clipping of final anomaly scores.
  • Ensure anomaly metrics are included in outputs only when anomaly scoring is enabled.

Detailed changes

  • R/converters_DIANNtoMSstatsFormat.R

    • Signature extended with new anomaly parameters: calculateAnomalyScores (FALSE), anomalyModelFeatures (c()), anomalyModelFeatureTemporal (c()), removeMissingFeatures (0.5), anomalyModelFeatureCount (100), runOrder (NULL), n_trees (100), max_depth ("auto"), numberOfCores (1), plus propagation of these through the pipeline.
    • Standardize anomalyModelFeatures early (.standardizeColnames) and pass anomaly_metrics into MSstatsClean, MSstatsPreprocess, MSstatsBalancedDesign.
    • Change filtering behavior: entries failing qvalue/PGQ thresholds are zeroed (Intensity <- 0 or NA adjustments) rather than subset-removed.
    • Conditional call to MSstatsAnomalyScores when calculateAnomalyScores = TRUE.
    • Minor leftover debugging comment: browser() call(s) remain (see coding guideline note).
  • R/clean_DIANN.R

    • .cleanRawDIANN() and .cleanDIANNSelectRequiredColumns() accept calculateAnomalyScores and anomalyModelFeatures.
    • When anomaly scoring enabled, anomalyModelFeatures appended to required columns (qual_cols) so quality metric columns are retained.
  • R/utils_anomaly_score.R

    • Reimplemented mean_increase, mean_decrease, dispersion_increase as stateful, NA-aware builders with proper initialization and propagation of NA.
    • .prepareSpectronautAnomalyInput now builds unique model_input (unique combinations of split column and quality metrics) for anomaly-model training.
    • .runAnomalyModel computes scores on model_input, assigns AnomalyScores there, then merges scores back into input_data by split_column + quality metrics to ensure unique-row scoring.
    • Removed clipping of anomaly scores in this flow (scores are left as produced by the forest).
  • R/MSstatsConvert_core_functions.R

    • Expanded Roxygen/documentation for MSstatsAnomalyScores and related parameters; documentation only, no signature changes.
  • man/*.Rd updates

    • man/DIANNtoMSstatsFormat.Rd: updated to document the new anomaly-related public parameters and default values.
    • man/MSstatsAnomalyScores.Rd: expanded description of the isolation-forest approach, missing-data handling (median imputation fallback), and parallel support.
    • man/MSstatsClean.Rd and man/dot-cleanRawDIANN.Rd: documented the new arguments for DIANN cleaning.
  • inst/tinytest/test_utils_anomaly_score.R

    • Added comprehensive tests (369 lines) covering property-based checks, extreme-value/outlier detection, reproducibility/consistency, edge cases (identical/minimal/all-zero), scale invariance, regression properties (non-negative/finite), builders (mean_increase/mean_decrease/dispersion_increase), and temporal/missing-run parameters.
  • src/isolation_forest.cpp

    • Large block of commented-out alternative/previous Isolation Forest implementation and notes added; active interface/behavior unchanged.
  • Repository metadata

    • .gitignore: adjusted patterns (re-added dll ignore pattern and added .lintr and .vscode).
    • DESCRIPTION: RoxygenNote bumped 7.3.2 -> 7.3.3.

Unit tests added or modified

  • inst/tinytest/test_utils_anomaly_score.R (new/expanded)
    • Tests include:
      • Monotonicity/property tests (progressive CuSum-like behavior).
      • Outlier detection checks (obvious outliers get highest scores).
      • Reproducibility/correlation checks across runs (>0.95 expected).
      • Top-k overlap checks (top-5 anomalies overlap).
      • Edge cases: identical values, minimal datasets, all-zero data.
      • Scale invariance (relative rankings preserved after scaling).
      • Builder function unit tests for mean_increase, mean_decrease, dispersion_increase with expected numeric outputs.
      • Tests for .prepareSpectronautAnomalyInput temporal/feature-count/missing-run behavior.
    • Existing DIANN converter/cleaning tests remain focused on structure/required columns.

Coding guidelines / issues identified

  • Debugging artifacts: one or more browser() calls remain commented/uncommented in R/converters_DIANNtoMSstatsFormat.R (observed at least one "browser()" near end of file). These should be removed before merging.
  • Large commented blocks in src/isolation_forest.cpp increase file size and may reduce readability; consider moving historical code to a separate notes file if retention is required.
  • No public API signature removals; several public signatures expanded—ensure downstream callers are aware of new parameters and defaults.

@devonjkohler devonjkohler requested a review from Copilot March 2, 2026 15:02
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

Warning

Rate limit exceeded

@devonjkohler has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 9 minutes and 18 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 692546b and 004766b.

📒 Files selected for processing (5)
  • R/clean_DIANN.R
  • R/converters_DIANNtoMSstatsFormat.R
  • man/DIANNtoMSstatsFormat.Rd
  • man/MSstatsClean.Rd
  • man/dot-cleanRawDIANN.Rd
📝 Walkthrough

Walkthrough

Adds anomaly-detection support across the DIANN → MSstats conversion pipeline: new anomaly-related parameters and docs, NA-aware stateful anomaly score computations, propagation of anomaly metrics through preprocessing/balanced design, and minor build/docs ignore updates.

Changes

Cohort / File(s) Summary
Build & Metadata
\.gitignore, DESCRIPTION
Re-added/updated ignore patterns (.lintr, .vscode); bumped RoxygenNote from 7.3.2 → 7.3.3.
Core docs & API surface
R/MSstatsConvert_core_functions.R, man/MSstatsAnomalyScores.Rd, man/MSstatsClean.Rd, man/dot-cleanRawDIANN.Rd
Expanded anomaly documentation; added calculateAnomalyScores and anomalyModelFeatures to S4/method signatures and .cleanRawDIANN docs.
DIANN cleaning
R/clean_DIANN.R
Added calculateAnomalyScores and anomalyModelFeatures params and propagated them into required-column selection logic.
DIANN converter
R/converters_DIANNtoMSstatsFormat.R, man/DIANNtoMSstatsFormat.Rd
Extended DIANNtoMSstatsFormat signature with multiple anomaly parameters; standardize feature names; propagate anomaly_metrics through MSstatsClean/MSstatsPreprocess/MSstatsBalancedDesign; changed filtering to zero Intensity for failed thresholds; conditional call to MSstatsAnomalyScores when enabled.
Anomaly engine (R)
R/utils_anomaly_score.R
Rewrote CuSum-like builders to stateful, NA-aware implementations; rebuilt runAnomalyModel to compute scores on a model_input summary then merge results back into original data.
Anomaly engine (C++)
src/isolation_forest.cpp
Added large commented reference/alternative Isolation Forest implementation and notes; active interfaces unchanged.
Man pages
man/DIANNtoMSstatsFormat.Rd
Updated Rd to reflect new public parameters and reorganized parameter descriptions.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DIANNConverter as DIANNtoMSstatsFormat
    participant Cleaner as MSstatsClean
    participant Preprocessor as MSstatsPreprocess
    participant BalancedDesign as MSstatsBalancedDesign
    participant AnomalyDetector as MSstatsAnomalyScores
    participant Output

    User->>DIANNConverter: input + anomaly params
    activate DIANNConverter

    DIANNConverter->>Cleaner: call (calculateAnomalyScores, anomalyModelFeatures)
    activate Cleaner
    Cleaner->>Cleaner: filter or zero Intensity per thresholds
    Cleaner->>DIANNConverter: cleaned data
    deactivate Cleaner

    DIANNConverter->>Preprocessor: call (anomaly_metrics)
    activate Preprocessor
    Preprocessor->>DIANNConverter: preprocessed data
    deactivate Preprocessor

    DIANNConverter->>BalancedDesign: call (anomaly_metrics)
    activate BalancedDesign
    BalancedDesign->>DIANNConverter: balanced data
    deactivate BalancedDesign

    alt calculateAnomalyScores == TRUE
        DIANNConverter->>AnomalyDetector: request scores (features, temporal, params)
        activate AnomalyDetector
        AnomalyDetector->>AnomalyDetector: compute stateful metrics (mean_increase, etc.)
        AnomalyDetector->>AnomalyDetector: run isolation forest per feature-group
        AnomalyDetector->>DIANNConverter: merged AnomalyScores
        deactivate AnomalyDetector
    end

    DIANNConverter->>User: final output
    deactivate DIANNConverter
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • mstaniak

Poem

🐰 I hop through rows of intensity and name,
I count the trees where outliers came.
With NA-aware paws and stateful cheer,
I mark the strange so truth appears.
Anomaly scores — a carrot bright and clear! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Feature diann anomaly' is vague and generic, using non-descriptive terms that don't clearly convey the specific changes or main point of the changeset. Consider a more descriptive title such as 'Add anomaly detection support to DIANN converter' or 'Implement quality metrics and anomaly scoring for DIANN' that better reflects the main changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The description provides motivation, lists key changes, and completes the required checklist items. However, the Testing section is incomplete with placeholder text requesting details rather than actual test information.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature-diann-anomaly

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
man/dot-cleanRawDIANN.Rd (1)

10-22: ⚠️ Potential issue | 🟡 Minor

Add missing @param documentation for new parameters in the source file.

The function signature in R/clean_DIANN.R includes calculateAnomalyScores and anomalyModelFeatures, but the roxygen2 @param tags for these parameters are missing. Add documentation like:

#' `@param` calculateAnomalyScores Default is FALSE. If TRUE, will run anomaly detection model and calculate anomaly scores for each feature.
#' `@param` anomalyModelFeatures character vector of quality metric column names to be used as features in the anomaly detection model.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@man/dot-cleanRawDIANN.Rd` around lines 10 - 22, Add roxygen `@param` entries
for the two missing parameters in the function documented by
man/dot-cleanRawDIANN.Rd (source in R/clean_DIANN.R): add a line documenting
calculateAnomalyScores (e.g., Default FALSE; when TRUE runs the anomaly
detection model and computes anomaly scores per feature) and a line documenting
anomalyModelFeatures (e.g., character vector of quality metric column names used
as features for the anomaly detection model), placing them with the other `@param`
tags so they are emitted into the Rd file for the function (cleanRawDIANN / the
function in R/clean_DIANN.R).
src/isolation_forest.cpp (1)

204-506: ⚠️ Potential issue | 🔴 Critical

Merge improvements from commented code or remove the entire commented block—it contains critical bug fixes to path length traversal.

The ~300-line commented block (lines 204-505) is not just alternative code; it fixes correctness issues in the active implementation:

  1. Path length traversal bug: The active path_length() function (lines 151-165) fails to respect the is_missing_split flag, instead unconditionally checking it->second < node->value. The commented version explicitly marks this as "IMPORTANT FIX" and properly handles missing splits separately from numeric splits, preventing incorrect anomaly score calculations.

  2. Suboptimal missing split probability: The active code uses a hardcoded 0.5 probability (with a TODO acknowledging this). The commented version implements calibrated probability based on missing data frequency (p_miss = min(0.5, missing_fraction)), which aligns with the algorithm's design principle.

  3. Maintenance burden: The commented code creates ambiguity about which implementation is correct, especially given that the commented version explicitly labels a fix for the active code.

Action: Either merge the improvements into the active code (recommended) or delete the commented block entirely. Do not leave both versions in the codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/isolation_forest.cpp` around lines 204 - 506, The large commented block
contains critical fixes that must be applied or removed: merge the corrected
traversal logic from the commented path_length into the active path_length so it
respects IsolationTreeNode::is_missing_split (handle missing split vs numeric
split separately and use node->size when hitting a leaf), adopt the calibrated
missing-split probability p_miss = std::min(0.5, missing_fraction) and
missing-handling logic from the commented isolation_tree (use is_missing_split,
split_value generation, and left/right assignment), then delete the redundant
commented 300-line block to avoid ambiguity; update references to
isolation_tree, isolation_forest, and path_length accordingly.
R/MSstatsConvert_core_functions.R (1)

550-556: ⚠️ Potential issue | 🟡 Minor

Improve clarity of temporal feature condition check.

The comparison temporal_direction[i] != FALSE is semantically unclear for a character vector. While it happens to work (character values are not equal to logical FALSE), the intent is better expressed as checking whether the value is a recognized temporal direction or explicitly checking for non-empty/non-NA values.

Consider using:

if (!is.na(temporal_direction[i]) && nzchar(temporal_direction[i]))

or more explicitly:

if (temporal_direction[i] %in% c("mean_decrease", "mean_increase", "dispersion_increase"))

Note: The mutation of quality_metrics during the loop is safe because seq_along() creates a fixed sequence at loop entry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/MSstatsConvert_core_functions.R` around lines 550 - 556, Replace the
unclear comparison `temporal_direction[i] != FALSE` inside the loop that
iterates with `seq_along(quality_metrics)` by an explicit check for a
non-missing, non-empty temporal direction (e.g., use
`!is.na(temporal_direction[i]) && nzchar(temporal_direction[i])`) so the intent
is clear when appending to `quality_metrics`; keep the existing concatenation
logic that uses `paste0(quality_metrics[i], ".", temporal_direction[i])`.
R/utils_anomaly_score.R (1)

208-251: ⚠️ Potential issue | 🟠 Major

Close the parallel cluster on all exit paths.

Line 208 creates a cluster with parallel::makeCluster(cores), but there is no corresponding stopCluster before the return at line 251 or on error, which leaks worker processes on repeated calls.

Proposed fix
 .runAnomalyModel = function(input_data, 
                             n_trees, 
                             max_depth, 
                             cores,
                             split_column,
                             quality_metrics){
     
     function_environment = environment()
     cl = parallel::makeCluster(cores)
+    on.exit(parallel::stopCluster(cl), add = TRUE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/utils_anomaly_score.R` around lines 208 - 251, The cluster created with
parallel::makeCluster(cores) (variable cl) is never stopped; wrap cluster
lifecycle so stopCluster(cl) always runs (including on errors) by calling
on.exit(parallel::stopCluster(cl)) immediately after cl is created (or use
tryCatch/finally around parallel::parLapply) so the cluster is stopped on normal
return and on error; ensure this is added near the makeCluster call and before
any early returns, and keep parLapply, clusterExport, and subsequent merge logic
unchanged.
🧹 Nitpick comments (3)
R/clean_DIANN.R (1)

95-102: Consider validating that anomalyModelFeatures columns exist in the input.

When calculateAnomalyScores = TRUE, the code assumes all columns in anomalyModelFeatures exist in dn_input. If any are missing, line 102 will throw an error that may not clearly indicate which column is missing.

💡 Optional: Add validation with informative error message
     qual_cols <- if (calculateAnomalyScores) {
+        missing_cols <- setdiff(anomalyModelFeatures, names(dn_input))
+        if (length(missing_cols) > 0) {
+            stop("Anomaly model features not found in input: ", 
+                 paste(missing_cols, collapse = ", "))
+        }
         anomalyModelFeatures
     } else {
         c()
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/clean_DIANN.R` around lines 95 - 102, When calculateAnomalyScores is TRUE,
validate that every name in anomalyModelFeatures exists in dn_input before
building req_cols: compute missing <- setdiff(anomalyModelFeatures,
names(dn_input)) and if length(missing)>0 call stop() (or use a clear error via
the project's logging) with a message that lists the missing column names and
context (e.g., "Missing anomalyModelFeatures in dn_input: ..."); only proceed to
set qual_cols and return dn_input[, req_cols, with = FALSE] if no missing
columns are found.
R/converters_DIANNtoMSstatsFormat.R (1)

89-89: Remove leftover debug comments (# browser()).

These lines are debug artifacts and add noise in production code.

Also applies to: 108-108, 141-141, 159-159, 166-166, 173-173

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/converters_DIANNtoMSstatsFormat.R` at line 89, Remove all leftover debug
statements ("# browser()") from the R source to avoid noisy debug artifacts in
production; search the file converters_DIANNtoMSstatsFormat.R for any
occurrences (including the instances around the noted positions) and delete
those comment lines so functions like those in this file no longer contain
inline debug breakpoints.
R/utils_anomaly_score.R (1)

225-250: Avoid positional score assignment; map scores with an explicit key.

Line 247 uses unlist(model_results) and writes scores by position. This is brittle because it depends on implicit row ordering of model_input. Prefer returning a key (e.g., row id) from each worker and joining by that key.

Safer pattern
-    model_input = unique(input_data[, c(split_column, quality_metrics), with = FALSE])
+    model_input = unique(input_data[, c(split_column, quality_metrics), with = FALSE])
+    model_input[, row_id := .I]
@@
-            single_psm = model_input[model_input[[split_column]] == psm_list[[i]], 
-                                    ..quality_metrics]
+            single_psm = model_input[model_input[[split_column]] == psm_list[[i]], 
+                                     c("row_id", quality_metrics), with = FALSE]
@@
-            forest$anomaly_score
+            data.table::data.table(
+                row_id = single_psm$row_id,
+                AnomalyScores = forest$anomaly_score
+            )
         }
     )
     
-    model_input$AnomalyScores = unlist(model_results)
+    score_dt = data.table::rbindlist(model_results)
+    model_input = merge(model_input, score_dt, by = "row_id", all.x = TRUE)
 
-    input_data = merge(input_data, model_input, by = c(split_column, quality_metrics), all.x = TRUE)
+    input_data = merge(
+        input_data,
+        model_input[, c(split_column, quality_metrics, "AnomalyScores"), with = FALSE],
+        by = c(split_column, quality_metrics),
+        all.x = TRUE
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/utils_anomaly_score.R` around lines 225 - 250, The current
parallel::parLapply worker returns scores only and assigns them to model_input
by position (unlist(model_results)), which is brittle; update the worker (the
anonymous function passed to parLapply that references psm_list, split_column,
and calculate_anomaly_score) to return a named result containing the split key
(e.g., the psm identifier from psm_list[[i]] or a unique row id from
model_input) and the computed anomaly_score, then after parLapply build a result
table/data.frame mapping that key to anomaly_score and merge it back into
model_input/input_data using an explicit join by the split key (instead of
relying on positional unlist and model_input$AnomalyScores), ensuring you update
the merge call that currently uses merge(input_data, model_input, ...) to use
the new keyed result for a safe join.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@R/converters_DIANNtoMSstatsFormat.R`:
- Line 31: Update the documentation for the runOrder parameter in
converters_DIANNtoMSstatsFormat.R to reference DIA-NN instead of Spectronaut:
keep the description that runOrder is a two-column data.table with columns `Run`
and `Order` (where `Run` matches the run name output by DIA-NN and `Order` is an
integer) and that it is used to engineer temporal features in
`anomalyModelFeatureTemporal`.

In `@R/utils_anomaly_score.R`:
- Around line 82-92: The function .add_mean_increase (and the other helper
functions that write to element [1]) must early-return for empty inputs to avoid
creating length-1 outputs from length-0 inputs: at the top of .add_mean_increase
check if n <- length(quality_vector) is 0 and if so return numeric(0) (or
double(0)) immediately; apply the same guard to the other functions that assign
to [1] so an empty quality_vector yields an empty result rather than a length-1
vector.

---

Outside diff comments:
In `@man/dot-cleanRawDIANN.Rd`:
- Around line 10-22: Add roxygen `@param` entries for the two missing parameters
in the function documented by man/dot-cleanRawDIANN.Rd (source in
R/clean_DIANN.R): add a line documenting calculateAnomalyScores (e.g., Default
FALSE; when TRUE runs the anomaly detection model and computes anomaly scores
per feature) and a line documenting anomalyModelFeatures (e.g., character vector
of quality metric column names used as features for the anomaly detection
model), placing them with the other `@param` tags so they are emitted into the Rd
file for the function (cleanRawDIANN / the function in R/clean_DIANN.R).

In `@R/MSstatsConvert_core_functions.R`:
- Around line 550-556: Replace the unclear comparison `temporal_direction[i] !=
FALSE` inside the loop that iterates with `seq_along(quality_metrics)` by an
explicit check for a non-missing, non-empty temporal direction (e.g., use
`!is.na(temporal_direction[i]) && nzchar(temporal_direction[i])`) so the intent
is clear when appending to `quality_metrics`; keep the existing concatenation
logic that uses `paste0(quality_metrics[i], ".", temporal_direction[i])`.

In `@R/utils_anomaly_score.R`:
- Around line 208-251: The cluster created with parallel::makeCluster(cores)
(variable cl) is never stopped; wrap cluster lifecycle so stopCluster(cl) always
runs (including on errors) by calling on.exit(parallel::stopCluster(cl))
immediately after cl is created (or use tryCatch/finally around
parallel::parLapply) so the cluster is stopped on normal return and on error;
ensure this is added near the makeCluster call and before any early returns, and
keep parLapply, clusterExport, and subsequent merge logic unchanged.

In `@src/isolation_forest.cpp`:
- Around line 204-506: The large commented block contains critical fixes that
must be applied or removed: merge the corrected traversal logic from the
commented path_length into the active path_length so it respects
IsolationTreeNode::is_missing_split (handle missing split vs numeric split
separately and use node->size when hitting a leaf), adopt the calibrated
missing-split probability p_miss = std::min(0.5, missing_fraction) and
missing-handling logic from the commented isolation_tree (use is_missing_split,
split_value generation, and left/right assignment), then delete the redundant
commented 300-line block to avoid ambiguity; update references to
isolation_tree, isolation_forest, and path_length accordingly.

---

Nitpick comments:
In `@R/clean_DIANN.R`:
- Around line 95-102: When calculateAnomalyScores is TRUE, validate that every
name in anomalyModelFeatures exists in dn_input before building req_cols:
compute missing <- setdiff(anomalyModelFeatures, names(dn_input)) and if
length(missing)>0 call stop() (or use a clear error via the project's logging)
with a message that lists the missing column names and context (e.g., "Missing
anomalyModelFeatures in dn_input: ..."); only proceed to set qual_cols and
return dn_input[, req_cols, with = FALSE] if no missing columns are found.

In `@R/converters_DIANNtoMSstatsFormat.R`:
- Line 89: Remove all leftover debug statements ("# browser()") from the R
source to avoid noisy debug artifacts in production; search the file
converters_DIANNtoMSstatsFormat.R for any occurrences (including the instances
around the noted positions) and delete those comment lines so functions like
those in this file no longer contain inline debug breakpoints.

In `@R/utils_anomaly_score.R`:
- Around line 225-250: The current parallel::parLapply worker returns scores
only and assigns them to model_input by position (unlist(model_results)), which
is brittle; update the worker (the anonymous function passed to parLapply that
references psm_list, split_column, and calculate_anomaly_score) to return a
named result containing the split key (e.g., the psm identifier from
psm_list[[i]] or a unique row id from model_input) and the computed
anomaly_score, then after parLapply build a result table/data.frame mapping that
key to anomaly_score and merge it back into model_input/input_data using an
explicit join by the split key (instead of relying on positional unlist and
model_input$AnomalyScores), ensuring you update the merge call that currently
uses merge(input_data, model_input, ...) to use the new keyed result for a safe
join.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e347aea and 67a5090.

📒 Files selected for processing (11)
  • .gitignore
  • DESCRIPTION
  • R/MSstatsConvert_core_functions.R
  • R/clean_DIANN.R
  • R/converters_DIANNtoMSstatsFormat.R
  • R/utils_anomaly_score.R
  • man/DIANNtoMSstatsFormat.Rd
  • man/MSstatsAnomalyScores.Rd
  • man/MSstatsClean.Rd
  • man/dot-cleanRawDIANN.Rd
  • src/isolation_forest.cpp

Comment on lines +82 to +92
.add_mean_increase = function(quality_vector) {

n = length(quality_vector)
mean_increase = rep(NA_real_, n)

d = 0.5

for(k in 2:length(quality_vector)) {
# 5 is reference (3 sigma)
if (mean_increase[k] > 5){
mean_increase[k] = max(0,(quality_vector[k] - d), na.rm = TRUE)
} else {
mean_increase[k] = max(0,
(quality_vector[k] - d + mean_increase[k-1]),
na.rm = TRUE) # positive CuSum
h = 5

state = 0
mean_increase[1] = if (is.na(quality_vector[1])) NA_real_ else 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard empty input vectors before writing index 1.

At Line 91, Line 125, and Line 163, these functions write to [1] even when length(quality_vector) == 0, which can produce a length-1 output from an empty input.

Proposed fix
 .add_mean_increase = function(quality_vector) {
     n = length(quality_vector)
+    if (n == 0) return(numeric(0))
     mean_increase = rep(NA_real_, n)
@@
 .add_mean_decrease = function(quality_vector) {
     n = length(quality_vector)
+    if (n == 0) return(numeric(0))
     mean_decrease = rep(NA_real_, n)
@@
 .add_dispersion_increase = function(quality_vector) {
     n = length(quality_vector)
+    if (n == 0) return(numeric(0))
     dispersion_increase = rep(NA_real_, n)

Also applies to: 116-126, 151-169

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/utils_anomaly_score.R` around lines 82 - 92, The function
.add_mean_increase (and the other helper functions that write to element [1])
must early-return for empty inputs to avoid creating length-1 outputs from
length-0 inputs: at the top of .add_mean_increase check if n <-
length(quality_vector) is 0 and if so return numeric(0) (or double(0))
immediately; apply the same guard to the other functions that assign to [1] so
an empty quality_vector yields an empty result rather than a length-1 vector.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the DIA-NN import/conversion workflow to better support MSstats+ by carrying quality-metric columns through the pipeline and optionally generating anomaly scores, alongside several anomaly-model behavior adjustments and documentation updates.

Changes:

  • Extend DIANNtoMSstatsFormat() / DIANN cleaning to accept anomaly-scoring parameters and propagate quality-metric columns into preprocessing/balanced design and (optionally) MSstatsAnomalyScores().
  • Adjust anomaly feature engineering (CUSUM-like temporal features) and change anomaly scoring to operate on “unique rows” and merge scores back (removing prior clipping).
  • Update generated Rd docs and repo metadata/ignores (roxygen note bump, .lintr / .vscode ignored).

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/isolation_forest.cpp Adds a large commented-out alternate implementation (including a “missing split” traversal fix reference).
R/utils_anomaly_score.R Updates temporal feature functions and changes anomaly score calculation to de-duplicate rows and merge results back.
R/converters_DIANNtoMSstatsFormat.R Adds anomaly-scoring parameters to the DIA-NN converter and wires them into preprocessing/balanced design and optional scoring.
R/clean_DIANN.R Extends DIANN cleaning to optionally select additional quality-metric columns needed for anomaly scoring.
R/MSstatsConvert_core_functions.R Expands roxygen description for MSstatsAnomalyScores().
man/DIANNtoMSstatsFormat.Rd Documents new DIA-NN converter parameters related to anomaly scoring.
man/MSstatsClean.Rd Updates DIANN MSstatsClean() method usage to include anomaly-related arguments.
man/dot-cleanRawDIANN.Rd Updates .cleanRawDIANN() usage to include anomaly-related arguments.
man/MSstatsAnomalyScores.Rd Updates MSstatsAnomalyScores() description text.
DESCRIPTION Bumps RoxygenNote.
.gitignore Ignores .lintr and .vscode.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +204 to +231
// #include <vector>
// #include <string>
// #include <random>
// #include <memory>
// #include <unordered_map>
// #include <cmath>
// #include <Rcpp.h>

// using namespace Rcpp;

// const double EM_CONSTANT = 0.5772156649; // Euler-Mascheroni constant
// std::mt19937 gen(42); // Fixed random seed for reproducibility

// struct IsolationTreeNode {
// bool is_leaf;
// int size;
// std::string feature;
// double value;
// bool is_missing_split;
// std::unique_ptr<IsolationTreeNode> left;
// std::unique_ptr<IsolationTreeNode> right;

// IsolationTreeNode(int s)
// : is_leaf(true),
// size(s),
// feature(""),
// value(0.0),
// is_missing_split(false),
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added large commented-out duplicate implementation (including an alternate IsolationTreeNode / path_length) significantly increases noise and maintenance burden. If this code is meant to be kept, it should be moved behind a compile-time flag or into a separate reference document; otherwise please remove it from the source file.

Suggested change
// #include <vector>
// #include <string>
// #include <random>
// #include <memory>
// #include <unordered_map>
// #include <cmath>
// #include <Rcpp.h>
// using namespace Rcpp;
// const double EM_CONSTANT = 0.5772156649; // Euler-Mascheroni constant
// std::mt19937 gen(42); // Fixed random seed for reproducibility
// struct IsolationTreeNode {
// bool is_leaf;
// int size;
// std::string feature;
// double value;
// bool is_missing_split;
// std::unique_ptr<IsolationTreeNode> left;
// std::unique_ptr<IsolationTreeNode> right;
// IsolationTreeNode(int s)
// : is_leaf(true),
// size(s),
// feature(""),
// value(0.0),
// is_missing_split(false),

Copilot uses AI. Check for mistakes.
Comment on lines +396 to +427
// if (node->is_leaf) {
// return depth + node->size;
// }

// auto it = obs.find(node->feature);
// if (it == obs.end()) {
// // Feature absent from obs map. Fall back: stop at current depth.
// return depth;
// }

// double x = it->second;

// if (node->is_missing_split) {
// if (std::isnan(x)) {
// return path_length(node->left.get(), obs, depth + 1);
// }
// return path_length(node->right.get(), obs, depth + 1);
// }

// if (std::isnan(x)) {
// // Numeric split but x missing: choose a side deterministically.
// // Here: treat missing as going left.
// return path_length(node->left.get(), obs, depth + 1);
// }

// if (x < node->value) {
// return path_length(node->left.get(), obs, depth + 1);
// }

// return path_length(node->right.get(), obs, depth + 1);
// }

Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The active path_length() implementation ignores node->is_missing_split and doesn’t handle NaN traversal. Since isolation_tree() can create missing-value splits, this will route NaN observations incorrectly and produce wrong anomaly scores. Please implement missing-split-aware traversal in the live code (not commented out), or remove missing-split creation entirely.

Suggested change
// if (node->is_leaf) {
// return depth + node->size;
// }
// auto it = obs.find(node->feature);
// if (it == obs.end()) {
// // Feature absent from obs map. Fall back: stop at current depth.
// return depth;
// }
// double x = it->second;
// if (node->is_missing_split) {
// if (std::isnan(x)) {
// return path_length(node->left.get(), obs, depth + 1);
// }
// return path_length(node->right.get(), obs, depth + 1);
// }
// if (std::isnan(x)) {
// // Numeric split but x missing: choose a side deterministically.
// // Here: treat missing as going left.
// return path_length(node->left.get(), obs, depth + 1);
// }
// if (x < node->value) {
// return path_length(node->left.get(), obs, depth + 1);
// }
// return path_length(node->right.get(), obs, depth + 1);
// }
double path_length(const IsolationTreeNode* node,
const std::unordered_map<std::string, double>& obs,
int depth) {
if (node == nullptr) {
return depth;
}
if (node->is_leaf) {
return depth + node->size;
}
auto it = obs.find(node->feature);
if (it == obs.end()) {
// Feature absent from obs map. Fall back: stop at current depth.
return depth;
}
double x = it->second;
if (node->is_missing_split) {
if (std::isnan(x)) {
return path_length(node->left.get(), obs, depth + 1);
}
return path_length(node->right.get(), obs, depth + 1);
}
if (std::isnan(x)) {
// Numeric split but x missing: choose a side deterministically.
// Here: treat missing as going left.
return path_length(node->left.get(), obs, depth + 1);
}
if (x < node->value) {
return path_length(node->left.get(), obs, depth + 1);
}
return path_length(node->right.get(), obs, depth + 1);
}

Copilot uses AI. Check for mistakes.
#' @param anomalyModelFeatureTemporal character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Values must be one of: `mean_decrease`, `mean_increase`, `dispersion_increase`, or NULL (to perform no temporal feature engineering). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have as many values as anomalyModelFeatures (even if all NULL).
#' @param removeMissingFeatures Remove features with missing values in more than this fraction of runs. Default is 0.5. Only used if calculateAnomalyScores=TRUE.
#' @param anomalyModelFeatureCount Feature selection for anomaly model. Anomaly detection works on the precursor-level and can be much slower if all features used. We will by default filter to the top-100 highest intensity features. This can be adjusted as necessary. To turn feature-selection off, set this value to a high number (e.g. 10000). Only used if calculateAnomalyScores=TRUE.
#' @param runOrder Temporal order of MS runs. Should be a two column data.table with columns `Run` and `Order`, where `Run` matches the run name output by Spectronaut and `Order` is an integer. Used to engineer the temporal features defined in anomalyModelFeatureTemporal.
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for runOrder mentions that Run matches the run name output by Spectronaut, but this is the DIA-NN converter. Please update the wording to refer to DIA-NN run names to avoid confusing users.

Suggested change
#' @param runOrder Temporal order of MS runs. Should be a two column data.table with columns `Run` and `Order`, where `Run` matches the run name output by Spectronaut and `Order` is an integer. Used to engineer the temporal features defined in anomalyModelFeatureTemporal.
#' @param runOrder Temporal order of MS runs. Should be a two column data.table with columns `Run` and `Order`, where `Run` matches the run name output by DIA-NN (i.e. the value used in the Run column of the DIA-NN output) and `Order` is an integer. Used to engineer the temporal features defined in anomalyModelFeatureTemporal.

Copilot uses AI. Check for mistakes.
Comment on lines 1 to +13
#' Clean raw Diann files
#' @param msstats_object an object of class `MSstatsDIANNFiles`.
#' @param MBR True if analysis was done with match between runs
#' @param quantificationColumn Use 'FragmentQuantCorrected'(default) column for quantified intensities for DIANN 1.8.x.
#' Use 'FragmentQuantRaw' for quantified intensities for DIANN 1.9.x.
#' Use 'auto' for quantified intensities for DIANN 2.x where each fragment intensity is a separate column, e.g. Fr0Quantity.
#' @return data.table
#' @importFrom stats na.omit
#' @keywords internal
.cleanRawDIANN <- function(msstats_object, MBR = TRUE,
quantificationColumn = "FragmentQuantCorrected") {
quantificationColumn = "FragmentQuantCorrected",
calculateAnomalyScores = FALSE,
anomalyModelFeatures = c()) {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roxygen docs for .cleanRawDIANN were not updated for the new arguments (calculateAnomalyScores, anomalyModelFeatures), which leaves them undocumented in the generated Rd. Please add @param entries describing these new parameters so the man page stays in sync with the function signature.

Copilot uses AI. Check for mistakes.
Comment on lines 520 to +526
#' Run Anomaly Model
#'
#' Detects anomalous measurements in mass spectrometry data using an isolation forest algorithm.
#' This function identifies unusual precursor measurements based on quality metrics and their
#' temporal patterns. For features with insufficient quality metric data, it assigns anomaly
#' scores based on the median score of similar features (same peptide and charge combination).
#' The model supports parallel processing for improved performance on large datasets.
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new MSstatsAnomalyScores roxygen description claims median-based fallback for insufficient quality metrics and parallel-processing support, but neither behavior is visible in the current implementation of MSstatsAnomalyScores / .prepareSpectronautAnomalyInput / .runAnomalyModel. Please either implement these behaviors or adjust the documentation to match what the function actually does.

Copilot uses AI. Check for mistakes.
Comment on lines 225 to +249
@@ -192,9 +244,9 @@
}
)

model_results = unlist(model_results)
# Clip anomaly scores to stop them from exploding
input_data$AnomalyScores = pmax(model_results, .001)

model_input$AnomalyScores = unlist(model_results)

input_data = merge(input_data, model_input, by = c(split_column, quality_metrics), all.x = TRUE)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_input$AnomalyScores = unlist(model_results) assumes model_input rows are grouped/ordered identically to psm_list and the per-PSM subsets, which is not guaranteed (unique() can interleave PSMs). This can misassign scores to the wrong rows. Consider returning a data.table keyed by split_column (and an explicit row id within model_input) from each worker and rbind/join, or ordering model_input deterministically and assigning by index.

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +93
# browser()
input = MSstatsConvert::MSstatsClean(
input, MBR = MBR, quantificationColumn = quantificationColumn,
calculateAnomalyScores = calculateAnomalyScores,
anomalyModelFeatures = anomalyModelFeatures)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple commented-out debugging statements (e.g., "# browser()") left in the converter. These should be removed before merging to avoid accidental debugging traps and reduce noise.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines +64 to +81
DIANNtoMSstatsFormat = function(
input, annotation = NULL,
global_qvalue_cutoff = 0.01,
qvalue_cutoff = 0.01,
pg_qvalue_cutoff = 0.01,
useUniquePeptide = TRUE,
removeFewMeasurements = TRUE,
removeOxidationMpeptides = TRUE,
removeProtein_with1Feature = TRUE,
MBR = TRUE,
quantificationColumn = "FragmentQuantCorrected",
calculateAnomalyScores=FALSE, anomalyModelFeatures=c(),
anomalyModelFeatureTemporal=c(), removeMissingFeatures=.5,
anomalyModelFeatureCount=100,
runOrder=NULL, n_trees=100, max_depth="auto", numberOfCores=1,
use_log_file = TRUE, append = FALSE,
verbose = TRUE, log_file_path = NULL,
...) {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This converter now exposes anomaly-model parameters (calculateAnomalyScores, anomalyModelFeatures, temporal settings, etc.) but doesn’t validate them (unlike SpectronauttoMSstatsFormat, which calls .validateMSstatsConverterParameters). Please add the same validation here to catch cases like calculateAnomalyScores=TRUE with empty/length-mismatched feature vectors early with a clear error.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

#' Use 'auto' for quantified intensities for DIANN 2.x where each fragment intensity is a separate column, e.g. Fr0Quantity.
#' @param calculateAnomalyScores Default is FALSE. If TRUE, will run anomaly detection model and calculate anomaly scores for each feature. Used downstream to weigh measurements in differential analysis.
#' @param anomalyModelFeatures character vector of quality metric column names to be used as features in the anomaly detection model. List must not be empty if calculateAnomalyScores=TRUE.
#' @param anomalyModelFeatureTemporal character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Values must be one of: `mean_decrease`, `mean_increase`, `dispersion_increase`, or NULL (to perform no temporal feature engineering). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have as many values as anomalyModelFeatures (even if all NULL).
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The anomalyModelFeatureTemporal docs say values can be NULL, but NULL cannot appear as an element in a character vector (it gets dropped). Consider documenting a character sentinel (e.g., "none") or NA_character_ instead, and validate accordingly.

Suggested change
#' @param anomalyModelFeatureTemporal character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Values must be one of: `mean_decrease`, `mean_increase`, `dispersion_increase`, or NULL (to perform no temporal feature engineering). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have as many values as anomalyModelFeatures (even if all NULL).
#' @param anomalyModelFeatureTemporal character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Values must be one of: `mean_decrease`, `mean_increase`, `dispersion_increase`, or a sentinel indicating no temporal feature engineering (e.g. `none` or `NA_character_`). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have as many values as anomalyModelFeatures (even if all indicate no temporal feature engineering).

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +172
@@ -105,8 +125,8 @@ DIANNtoMSstatsFormat = function(input, annotation = NULL,
msg = '** MBR was not used to analyze the data. Now setting names and filtering'
msg_1 = paste0('-- Filtering on GlobalPGQValue < ', pg_qvalue_cutoff)
msg_2 = paste0('-- Filtering on GlobalQValue < ', qvalue_cutoff)
input = input[GlobalPGQValue < pg_qvalue_cutoff, ]
input = input[GlobalQValue < qvalue_cutoff, ]
input = input[GlobalPGQValue >= pg_qvalue_cutoff, Intensity := 0]
input = input[GlobalQValue >= qvalue_cutoff, Intensity := 0]
getOption("MSstatsLog")("INFO", msg)
getOption("MSstatsMsg")("INFO", msg)
getOption("MSstatsLog")("INFO", msg_1)
@@ -118,6 +138,7 @@ DIANNtoMSstatsFormat = function(input, annotation = NULL,

feature_columns = c("PeptideSequence", "PrecursorCharge",
"FragmentIon", "ProductCharge")
# browser()
input = MSstatsConvert::MSstatsPreprocess(
input,
annotation,
@@ -132,15 +153,24 @@ DIANNtoMSstatsFormat = function(input, annotation = NULL,
remove_features_with_few_measurements = removeFewMeasurements,
summarize_multiple_psms = max),
columns_to_fill = list(Fraction = 1,
IsotopeLabelType = "Light"))
IsotopeLabelType = "Light"),
anomaly_metrics = anomalyModelFeatures)
input[, Intensity := ifelse(Intensity == 0, NA, Intensity)]

# browser()
input = MSstatsConvert::MSstatsBalancedDesign(input, feature_columns,
fill_incomplete = FALSE,
fill_incomplete = TRUE,
handle_fractions = FALSE,
remove_few = removeFewMeasurements
remove_few = removeFewMeasurements,
anomaly_metrics = anomalyModelFeatures
)

# browser()
if (calculateAnomalyScores){
input = MSstatsConvert::MSstatsAnomalyScores(
input, anomalyModelFeatures, anomalyModelFeatureTemporal,
removeMissingFeatures, anomalyModelFeatureCount, runOrder, n_trees,
max_depth, numberOfCores)
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DIANNtoMSstatsFormat test expectations (e.g., fixed nrow/ncol) will likely change with the updated filtering behavior and fill_incomplete=TRUE balanced design. Please update/add tinytest coverage for the new behavior, including a test case that exercises calculateAnomalyScores=TRUE and verifies the AnomalyScores column is added and aligned row-wise.

Copilot uses AI. Check for mistakes.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

Copilot AI commented Mar 2, 2026

@devonjkohler I've opened a new pull request, #120, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Mar 2, 2026

@devonjkohler I've opened a new pull request, #121, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
R/converters_DIANNtoMSstatsFormat.R (1)

89-89: Remove committed debug breakpoints.

Lines 89, 108, 141, 159, 166, and 173 still include # browser() artifacts. Please drop them before merge.

Also applies to: 108-108, 141-141, 159-159, 166-166, 173-173

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/converters_DIANNtoMSstatsFormat.R` at line 89, Remove all committed
debugging breakpoints by deleting the commented `# browser()` lines left in
converters_DIANNtoMSstatsFormat.R (the occurrences around the blocks that
include the functions/conversion logic in this file). Search for and remove
every `# browser()` instance (including those at the previously noted spots) so
no debug breakpoints remain; ensure the functions that call internal conversion
routines no longer contain any `# browser()` artifacts before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@R/converters_DIANNtoMSstatsFormat.R`:
- Around line 75-78: When calculateAnomalyScores is TRUE, add upfront validation
that anomalyModelFeatures is non-empty and that length(anomalyModelFeatures)
matches length(anomalyModelFeatureTemporal) (or that anomalyModelFeatureTemporal
is length 1 or same length) before any downstream processing; in practice,
inside the top-level function that accepts args (check the parameter block
containing calculateAnomalyScores, anomalyModelFeatures,
anomalyModelFeatureTemporal, e.g., the converter function in
R/converters_DIANNtoMSstatsFormat.R) add checks that throw informative errors
(stop(...)) if anomalyModelFeatures is NULL/length 0 or if the two feature
vectors are misaligned, and run these validations immediately after parsing
arguments (before any use around lines ~167-171 and ~168) so failures are early
and actionable.
- Line 31: The roxygen param tag for runOrder is written with backticks
(`@param`) so it isn't recognized; update the roxygen block to use a proper
`@param` tag (i.e., replace `` `@param` runOrder ...`` with `@param runOrder ...`)
for the runOrder parameter description (which references the runOrder argument
and its role in engineering temporal features via anomalyModelFeatureTemporal).

---

Nitpick comments:
In `@R/converters_DIANNtoMSstatsFormat.R`:
- Line 89: Remove all committed debugging breakpoints by deleting the commented
`# browser()` lines left in converters_DIANNtoMSstatsFormat.R (the occurrences
around the blocks that include the functions/conversion logic in this file).
Search for and remove every `# browser()` instance (including those at the
previously noted spots) so no debug breakpoints remain; ensure the functions
that call internal conversion routines no longer contain any `# browser()`
artifacts before merging.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67a5090 and 692546b.

📒 Files selected for processing (1)
  • R/converters_DIANNtoMSstatsFormat.R

Comment on lines +75 to +78
calculateAnomalyScores=FALSE, anomalyModelFeatures=c(),
anomalyModelFeatureTemporal=c(), removeMissingFeatures=.5,
anomalyModelFeatureCount=100,
runOrder=NULL, n_trees=100, max_depth="auto", numberOfCores=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add upfront validation for anomaly-model arguments.

When calculateAnomalyScores=TRUE, there is no early check that anomalyModelFeatures is non-empty and aligned with anomalyModelFeatureTemporal. This can fail late in Line 168+ with less actionable errors.

Proposed fix
     anomalyModelFeatures = .standardizeColnames(anomalyModelFeatures)
+    if (calculateAnomalyScores) {
+        if (length(anomalyModelFeatures) == 0) {
+            stop("anomalyModelFeatures must be non-empty when calculateAnomalyScores=TRUE.")
+        }
+
+        allowed_temporal = c("mean_decrease", "mean_increase", "dispersion_increase", NA_character_)
+        if (length(anomalyModelFeatureTemporal) > 0) {
+            if (length(anomalyModelFeatureTemporal) != length(anomalyModelFeatures)) {
+                stop("anomalyModelFeatureTemporal must have the same length as anomalyModelFeatures.")
+            }
+            if (!all(anomalyModelFeatureTemporal %in% allowed_temporal)) {
+                stop("anomalyModelFeatureTemporal values must be one of mean_decrease, mean_increase, dispersion_increase, or NA/NULL.")
+            }
+        }
+    }

Also applies to: 85-86, 167-171

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/converters_DIANNtoMSstatsFormat.R` around lines 75 - 78, When
calculateAnomalyScores is TRUE, add upfront validation that anomalyModelFeatures
is non-empty and that length(anomalyModelFeatures) matches
length(anomalyModelFeatureTemporal) (or that anomalyModelFeatureTemporal is
length 1 or same length) before any downstream processing; in practice, inside
the top-level function that accepts args (check the parameter block containing
calculateAnomalyScores, anomalyModelFeatures, anomalyModelFeatureTemporal, e.g.,
the converter function in R/converters_DIANNtoMSstatsFormat.R) add checks that
throw informative errors (stop(...)) if anomalyModelFeatures is NULL/length 0 or
if the two feature vectors are misaligned, and run these validations immediately
after parsing arguments (before any use around lines ~167-171 and ~168) so
failures are early and actionable.

@devonjkohler devonjkohler merged commit 01dd914 into devel Mar 2, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants