Add `--reference-bonus` to bias surjection towards reference paths #4786

JosephLalli · 2026-01-05T05:59:47Z

This PR introduces a scoring bonus for reference paths during the surjection path selection process. Currently, when a read aligns equally well to multiple overlapping paths (e.g., a reference path and an alternate haplotype path), vg surject selects one arbitrarily based on iteration order. This change allows users to ensure reads are placed on the reference path in the event of a tie or close score.

Changes:

Add vg surject --reference-bonus N (default: 0) CLI flag and pass the value into the surjector.
Store reference_bonus on Surjector and apply it when choosing the primary strand, biasing toward PathSense::REFERENCE paths.
Update help text to document the new flag.

Behavior:

Adds N to the effective alignment score of any path with PathSense::REFERENCE before comparing it to the current best path.
- --reference-bonus 1: Biases strictly tied scores towards the reference.
- --reference-bonus >1: Requires alternate paths to strictly outperform the reference by at least this amount to be selected.

Files Changed:

src/subcommand/surject_main.cpp: Add option, help text, parsing, and pass-through to Surjector.
src/surjector.hpp: Add reference_bonus member and apply it in choose_primary_strand scoring.

This PR introduces a scoring bonus for reference paths during the surjection path selection process. Currently, when a read aligns equally well to multiple overlapping paths (e.g., a reference path and an alternate haplotype path), vg surject selects one arbitrarily based on iteration order. This change allows users to ensure reads are placed on the reference path in the event of a tie or close score. The specific use case I am thinking of is when someone wants to surject to a reference genome and a personalized pangenome path to view insertion coverage and call insertion variants. You want basically all reads to align to the reference unless it is very difficult to do so; this allows us to begin to fine tune exactly what "very difficult" means.

Change overview

Add vg surject --reference-bonus N (default 0)
Plumb reference_bonus into Surjector
Apply the bonus during path selection only by biasing toward PathSense::REFERENCE paths

Behavior

When comparing candidate paths for primary placement:

If a candidate path has PathSense::REFERENCE, compare using effective_score = score + reference_bonus
Otherwise compare using effective_score = score
--reference-bonus 0 should preserve existing behavior
--reference-bonus 1 should effectively serve as a tie breaker biasing towards reference surjection
--reference-bonus >1 requires alternates to beat reference by at least N to be selected

Implementation notes

The bonus is used only for selection; it is not supposed to modify the alignment score itself, only the “which path wins” comparison.
The change is localized to path choice logic (choose_primary_strand).

Questions for you all

Is this pull request even formatted properly?
Is PathSense::REFERENCE the right signal here, or is there a more appropriate notion of “reference-ness” in surject?
Is applying the adjustment in choose_primary_strand the correct level, or should it live in a different selection stage?
Any objections to the semantics of “bonus added only for selection, not reported score”?
Does choose_primary_strand cover the actual problematic case? (Tie/near-tie between multiple overlapping paths.)
Score interpretation: ensure we’re only biasing selection and not accidentally affecting MAPQ, AS tags, or anything downstream that assumes the score is “true.”
PathSense availability: confirm all candidate paths have reliable PathSense classification in the relevant code path.
Determinism: bonus reduces tie ambiguity for ref-vs-alt, but ties among multiple reference paths may still be order-dependent. Is that acceptable?
Input validation: should we explicitly check for basic input sanity (positive value, <255, etc)

* Add reference bonus biasing to surjection Authored-by: JosephLalli <[email protected]>

adamnovak · 2026-01-05T21:53:46Z

This PR seems to be formatted about right, but we like to have a proposed changelog entry bullet point, which is not here.

The description of the changes could be more succinct. Some of it is redundant with the code or the Github metadata about what code is changed, or even with itself (there are two "Behavior" sections, as well as a "Changes", a "Change Overview", and a "Files Changed"). Some of it seems to be just tracing through implications of what's already described (and in a way that isn't quite true; see my later comment about creating new ties). Reading it, I get the sense that some of the text may have come off of a text generator without being edited for communicativeness.

The code changes look OK.

It looks to me like the score which gets set here to the new adjusted score is still local within that function and isn't propagated elsewhere, so I don't think this is introducing any problems with unexpected effects on quality computation or reported scores.

This feature might let us have a winning alignment with a lower reported score than candidates it beats, which will interact excitingly with @Sagorikanag's upcoming diploid surjection work and various new quality scores. (Actually, Sagorika, you should probably look over this PR to make sure it isn't going to conflict in an important way with your new architecture there. Do you happen to know if choose_primary_strand() is responsible for choosing the winning placement, or if it will stay responsible for that in your new design? The comment for the function suggests it would be.)

While we're here, it might be worth changing the tie-breaking logic to be deterministically randomized, rather than depend solely on order. We have a sort_shuffling_ties() function we use for this in vg giraffe. But it might make also make sense to skip that for now and put that in Sagorika's diploid surjection work instead.

I think REFERENCE sense for paths makes sense as the correct signal for whether a path is a reference; it should work fine in the use case you mention of dealing with insertions when surjecting to both a reference and a sample's haplotypes. But, you might want to consider the case of GRCh38 and its alt loci, where the reference assembly (GRCh38) has both normal contigs (like chr19) and "ALT contigs" (like chr19_KI270866v1_alt). A user might want to be able to prioritize mapping to the one over the other when surjecting, but vg would give both of them REFERENCE sense. (And I'm not sure there's a good way to get a GRCh38 graph with the alt contigs threaded into it anyway.)

I don't think there's a good reason to restrict to nonnegative bonuses, or to bonuses less than 255; the alignment score isn't an 8-bit number.

@JosephLalli, can you say a bit more about the use case for this? If what you want to do is break ties in favor of the reference placement, even using a 1-point bonus will not quite do that, because it can create new ties where the reference placement was previously 1 point worse. Those might end up breaking in favor of the reference or the haplotype path, depending on ordering, so at any nonzero bonus value, sometimes you will have reads mapping to the reference path when really the haplotype path gave a higher score.

We also might want to have some tests in test/t/15_vg_surject.t for this new feature, ideally showing that some ties which didn't break the right way at penalty 0 start breaking in favor of the reference at penalty 1.

faithokamoto · 2026-01-16T23:04:42Z

It seems another option that might get you close enough to what you want, would be to introduce a flag (or just make it the default behavior) that tiebreaks always go to the reference if that is an option. And that could be implemented a few ways:

figure out the best reference & non-reference separately, and then choose the winner with reference winning ties
add 0.5 to scores for the reference
have a flag for whether the current best is a reference or not, and if you run into a tie where the flag is false but this new one is a reference, have it overwrite the best

etc.

JosephLalli · 2026-01-17T00:07:19Z

Yeah, sorry I haven’t gotten back around to this - too many irons in the fire at the moment. The central idea (and I am likely prematurely optimizing this) is that I’d like to be able to surject to non-reference paths (ie, those sampled by vg haplotypes). However, I also want to stick to the rivers and the lakes that I’m used to (a reference path) unless doing so results in a real loss of information. If a read maps better to the non-reference path because of a single snp, intended behavior would be to have that read still surjected to the reference path. Exactly how much worse should the alignment to the reference be before we think about surjecting to the non-reference path? I don’t know, hence the tunable penalty - I can empirically tune it using a few test cases I have in mind. I can think of a few other heuristics (surject to non-reference path if: read would otherwise be dropped, read would require soft clipping, ref surjection would result in large insert size…) but I thought the penalty was a relatively straightforward starting heuristic. Thanks for bringing this back to my attention @faithokamoto! I hope what I said makes some amount of sense. Who knows, you all have likely already considered this and come up with a solution! Thanks, Joe On Jan 16, 2026, at 5:05 PM, Faith Okamoto ***@***.***> wrote: [https://avatars.githubusercontent.com/u/52177356?s=20&v=4]faithokamoto left a comment (vgteam/vg#4786)<https://urldefense.com/v3/__https://github.com/vgteam/vg/pull/4786*issuecomment-3762165996__;Iw!!Mak6IKo!PlF1HeuOXQBObqnju5s_q_xjJpcL9OgRJdopUBjvoUdlTYyR2ihJKkcDHR5Za_-WnuWHA05pRBDNH5LXaJssxqo$> It seems another option that might get you close enough to what you want, would be to introduce a flag (or just make it the default behavior) that tiebreaks always go to the reference if that is an option. And that could be implemented a few ways: * figure out the best reference & non-reference separately, and then choose the winner with reference winning ties * add 0.5 to scores for the reference * have a flag for whether the current best is a reference or not, and if you run into a tie where the flag is false but this new one is a reference, have it overwrite the best etc. — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/vgteam/vg/pull/4786*issuecomment-3762165996__;Iw!!Mak6IKo!PlF1HeuOXQBObqnju5s_q_xjJpcL9OgRJdopUBjvoUdlTYyR2ihJKkcDHR5Za_-WnuWHA05pRBDNH5LXaJssxqo$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ALXMY2PH6K3EOCWXPTD2BRT4HFVCBAVCNFSM6AAAAACQWBDFFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTONRSGE3DKOJZGY__;!!Mak6IKo!PlF1HeuOXQBObqnju5s_q_xjJpcL9OgRJdopUBjvoUdlTYyR2ihJKkcDHR5Za_-WnuWHA05pRBDNH5LXUL2Hlc4$>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Add --reference-bonus to bias surjection towards reference paths (#1)

840e1bc

* Add reference bonus biasing to surjection Authored-by: JosephLalli <[email protected]>

JosephLalli changed the title ~~Add --reference-bonus to bias surjection towards reference paths (#1)~~ Add --reference-bonus to bias surjection towards reference paths Jan 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--reference-bonus` to bias surjection towards reference paths #4786

Add `--reference-bonus` to bias surjection towards reference paths #4786

JosephLalli commented Jan 5, 2026

Uh oh!

adamnovak commented Jan 5, 2026

Uh oh!

faithokamoto commented Jan 16, 2026

Uh oh!

JosephLalli commented Jan 17, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add --reference-bonus to bias surjection towards reference paths #4786

Are you sure you want to change the base?

Add --reference-bonus to bias surjection towards reference paths #4786

Conversation

JosephLalli commented Jan 5, 2026

Change overview

Behavior

Implementation notes

Questions for you all

Uh oh!

adamnovak commented Jan 5, 2026

Uh oh!

faithokamoto commented Jan 16, 2026

Uh oh!

JosephLalli commented Jan 17, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add `--reference-bonus` to bias surjection towards reference paths #4786

Add `--reference-bonus` to bias surjection towards reference paths #4786