Skip to content

Conversation

@JosephLalli
Copy link

This PR introduces a scoring bonus for reference paths during the surjection path selection process. Currently, when a read aligns equally well to multiple overlapping paths (e.g., a reference path and an alternate haplotype path), vg surject selects one arbitrarily based on iteration order. This change allows users to ensure reads are placed on the reference path in the event of a tie or close score.

Changes:

  • Add vg surject --reference-bonus N (default: 0) CLI flag and pass the value into the surjector.
  • Store reference_bonus on Surjector and apply it when choosing the primary strand, biasing toward PathSense::REFERENCE paths.
  • Update help text to document the new flag.

Behavior:

  • Adds N to the effective alignment score of any path with PathSense::REFERENCE before comparing it to the current best path.
    • --reference-bonus 1: Biases strictly tied scores towards the reference.
    • --reference-bonus >1: Requires alternate paths to strictly outperform the reference by at least this amount to be selected.

Files Changed:

  • src/subcommand/surject_main.cpp: Add option, help text, parsing, and pass-through to Surjector.
  • src/surjector.hpp: Add reference_bonus member and apply it in choose_primary_strand scoring.

This PR introduces a scoring bonus for reference paths during the surjection path selection process. Currently, when a read aligns equally well to multiple overlapping paths (e.g., a reference path and an alternate haplotype path), vg surject selects one arbitrarily based on iteration order. This change allows users to ensure reads are placed on the reference path in the event of a tie or close score. The specific use case I am thinking of is when someone wants to surject to a reference genome and a personalized pangenome path to view insertion coverage and call insertion variants. You want basically all reads to align to the reference unless it is very difficult to do so; this allows us to begin to fine tune exactly what "very difficult" means.

Change overview

  • Add vg surject --reference-bonus N (default 0)
  • Plumb reference_bonus into Surjector
  • Apply the bonus during path selection only by biasing toward PathSense::REFERENCE paths

Behavior

When comparing candidate paths for primary placement:

  • If a candidate path has PathSense::REFERENCE, compare using effective_score = score + reference_bonus
  • Otherwise compare using effective_score = score
  • --reference-bonus 0 should preserve existing behavior
  • --reference-bonus 1 should effectively serve as a tie breaker biasing towards reference surjection
  • --reference-bonus >1 requires alternates to beat reference by at least N to be selected

Implementation notes

  • The bonus is used only for selection; it is not supposed to modify the alignment score itself, only the “which path wins” comparison.
  • The change is localized to path choice logic (choose_primary_strand).

Questions for you all

  • Is this pull request even formatted properly?
  • Is PathSense::REFERENCE the right signal here, or is there a more appropriate notion of “reference-ness” in surject?
  • Is applying the adjustment in choose_primary_strand the correct level, or should it live in a different selection stage?
  • Any objections to the semantics of “bonus added only for selection, not reported score”?
  • Does choose_primary_strand cover the actual problematic case? (Tie/near-tie between multiple overlapping paths.)
  • Score interpretation: ensure we’re only biasing selection and not accidentally affecting MAPQ, AS tags, or anything downstream that assumes the score is “true.”
  • PathSense availability: confirm all candidate paths have reliable PathSense classification in the relevant code path.
  • Determinism: bonus reduces tie ambiguity for ref-vs-alt, but ties among multiple reference paths may still be order-dependent. Is that acceptable?
  • Input validation: should we explicitly check for basic input sanity (positive value, <255, etc)

* Add reference bonus biasing to surjection

Authored-by: JosephLalli <[email protected]>
@JosephLalli JosephLalli changed the title Add --reference-bonus to bias surjection towards reference paths (#1) Add --reference-bonus to bias surjection towards reference paths Jan 5, 2026
@adamnovak
Copy link
Member

This PR seems to be formatted about right, but we like to have a proposed changelog entry bullet point, which is not here.

The description of the changes could be more succinct. Some of it is redundant with the code or the Github metadata about what code is changed, or even with itself (there are two "Behavior" sections, as well as a "Changes", a "Change Overview", and a "Files Changed"). Some of it seems to be just tracing through implications of what's already described (and in a way that isn't quite true; see my later comment about creating new ties). Reading it, I get the sense that some of the text may have come off of a text generator without being edited for communicativeness.

The code changes look OK.

It looks to me like the score which gets set here to the new adjusted score is still local within that function and isn't propagated elsewhere, so I don't think this is introducing any problems with unexpected effects on quality computation or reported scores.

This feature might let us have a winning alignment with a lower reported score than candidates it beats, which will interact excitingly with @Sagorikanag's upcoming diploid surjection work and various new quality scores. (Actually, Sagorika, you should probably look over this PR to make sure it isn't going to conflict in an important way with your new architecture there. Do you happen to know if choose_primary_strand() is responsible for choosing the winning placement, or if it will stay responsible for that in your new design? The comment for the function suggests it would be.)

While we're here, it might be worth changing the tie-breaking logic to be deterministically randomized, rather than depend solely on order. We have a sort_shuffling_ties() function we use for this in vg giraffe. But it might make also make sense to skip that for now and put that in Sagorika's diploid surjection work instead.

I think REFERENCE sense for paths makes sense as the correct signal for whether a path is a reference; it should work fine in the use case you mention of dealing with insertions when surjecting to both a reference and a sample's haplotypes. But, you might want to consider the case of GRCh38 and its alt loci, where the reference assembly (GRCh38) has both normal contigs (like chr19) and "ALT contigs" (like chr19_KI270866v1_alt). A user might want to be able to prioritize mapping to the one over the other when surjecting, but vg would give both of them REFERENCE sense. (And I'm not sure there's a good way to get a GRCh38 graph with the alt contigs threaded into it anyway.)

I don't think there's a good reason to restrict to nonnegative bonuses, or to bonuses less than 255; the alignment score isn't an 8-bit number.

@JosephLalli, can you say a bit more about the use case for this? If what you want to do is break ties in favor of the reference placement, even using a 1-point bonus will not quite do that, because it can create new ties where the reference placement was previously 1 point worse. Those might end up breaking in favor of the reference or the haplotype path, depending on ordering, so at any nonzero bonus value, sometimes you will have reads mapping to the reference path when really the haplotype path gave a higher score.

We also might want to have some tests in test/t/15_vg_surject.t for this new feature, ideally showing that some ties which didn't break the right way at penalty 0 start breaking in favor of the reference at penalty 1.

@faithokamoto
Copy link
Contributor

It seems another option that might get you close enough to what you want, would be to introduce a flag (or just make it the default behavior) that tiebreaks always go to the reference if that is an option. And that could be implemented a few ways:

  • figure out the best reference & non-reference separately, and then choose the winner with reference winning ties
  • add 0.5 to scores for the reference
  • have a flag for whether the current best is a reference or not, and if you run into a tie where the flag is false but this new one is a reference, have it overwrite the best

etc.

@JosephLalli
Copy link
Author

JosephLalli commented Jan 17, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants