-
Notifications
You must be signed in to change notification settings - Fork 215
Add --reference-bonus to bias surjection towards reference paths
#4786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
* Add reference bonus biasing to surjection Authored-by: JosephLalli <[email protected]>
--reference-bonus to bias surjection towards reference paths (#1)--reference-bonus to bias surjection towards reference paths
|
This PR seems to be formatted about right, but we like to have a proposed changelog entry bullet point, which is not here. The description of the changes could be more succinct. Some of it is redundant with the code or the Github metadata about what code is changed, or even with itself (there are two "Behavior" sections, as well as a "Changes", a "Change Overview", and a "Files Changed"). Some of it seems to be just tracing through implications of what's already described (and in a way that isn't quite true; see my later comment about creating new ties). Reading it, I get the sense that some of the text may have come off of a text generator without being edited for communicativeness. The code changes look OK. It looks to me like the This feature might let us have a winning alignment with a lower reported score than candidates it beats, which will interact excitingly with @Sagorikanag's upcoming diploid surjection work and various new quality scores. (Actually, Sagorika, you should probably look over this PR to make sure it isn't going to conflict in an important way with your new architecture there. Do you happen to know if While we're here, it might be worth changing the tie-breaking logic to be deterministically randomized, rather than depend solely on order. We have a I think I don't think there's a good reason to restrict to nonnegative bonuses, or to bonuses less than 255; the alignment score isn't an 8-bit number. @JosephLalli, can you say a bit more about the use case for this? If what you want to do is break ties in favor of the reference placement, even using a 1-point bonus will not quite do that, because it can create new ties where the reference placement was previously 1 point worse. Those might end up breaking in favor of the reference or the haplotype path, depending on ordering, so at any nonzero bonus value, sometimes you will have reads mapping to the reference path when really the haplotype path gave a higher score. We also might want to have some tests in |
|
It seems another option that might get you close enough to what you want, would be to introduce a flag (or just make it the default behavior) that tiebreaks always go to the reference if that is an option. And that could be implemented a few ways:
etc. |
|
Yeah, sorry I haven’t gotten back around to this - too many irons in the fire at the moment. The central idea (and I am likely prematurely optimizing this) is that I’d like to be able to surject to non-reference paths (ie, those sampled by vg haplotypes). However, I also want to stick to the rivers and the lakes that I’m used to (a reference path) unless doing so results in a real loss of information. If a read maps better to the non-reference path because of a single snp, intended behavior would be to have that read still surjected to the reference path.
Exactly how much worse should the alignment to the reference be before we think about surjecting to the non-reference path? I don’t know, hence the tunable penalty - I can empirically tune it using a few test cases I have in mind. I can think of a few other heuristics (surject to non-reference path if: read would otherwise be dropped, read would require soft clipping, ref surjection would result in large insert size…) but I thought the penalty was a relatively straightforward starting heuristic.
Thanks for bringing this back to my attention @faithokamoto! I hope what I said makes some amount of sense. Who knows, you all have likely already considered this and come up with a solution!
Thanks,
Joe
On Jan 16, 2026, at 5:05 PM, Faith Okamoto ***@***.***> wrote:
[https://avatars.githubusercontent.com/u/52177356?s=20&v=4]faithokamoto left a comment (vgteam/vg#4786)<https://urldefense.com/v3/__https://github.com/vgteam/vg/pull/4786*issuecomment-3762165996__;Iw!!Mak6IKo!PlF1HeuOXQBObqnju5s_q_xjJpcL9OgRJdopUBjvoUdlTYyR2ihJKkcDHR5Za_-WnuWHA05pRBDNH5LXaJssxqo$>
It seems another option that might get you close enough to what you want, would be to introduce a flag (or just make it the default behavior) that tiebreaks always go to the reference if that is an option. And that could be implemented a few ways:
* figure out the best reference & non-reference separately, and then choose the winner with reference winning ties
* add 0.5 to scores for the reference
* have a flag for whether the current best is a reference or not, and if you run into a tie where the flag is false but this new one is a reference, have it overwrite the best
etc.
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/vgteam/vg/pull/4786*issuecomment-3762165996__;Iw!!Mak6IKo!PlF1HeuOXQBObqnju5s_q_xjJpcL9OgRJdopUBjvoUdlTYyR2ihJKkcDHR5Za_-WnuWHA05pRBDNH5LXaJssxqo$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ALXMY2PH6K3EOCWXPTD2BRT4HFVCBAVCNFSM6AAAAACQWBDFFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTONRSGE3DKOJZGY__;!!Mak6IKo!PlF1HeuOXQBObqnju5s_q_xjJpcL9OgRJdopUBjvoUdlTYyR2ihJKkcDHR5Za_-WnuWHA05pRBDNH5LXUL2Hlc4$>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
This PR introduces a scoring bonus for reference paths during the surjection path selection process. Currently, when a read aligns equally well to multiple overlapping paths (e.g., a reference path and an alternate haplotype path),
vg surjectselects one arbitrarily based on iteration order. This change allows users to ensure reads are placed on the reference path in the event of a tie or close score.Changes:
vg surject --reference-bonus N(default: 0) CLI flag and pass the value into the surjector.reference_bonusonSurjectorand apply it when choosing the primary strand, biasing towardPathSense::REFERENCEpaths.Behavior:
Nto the effective alignment score of any path withPathSense::REFERENCEbefore comparing it to the current best path.--reference-bonus 1: Biases strictly tied scores towards the reference.--reference-bonus >1: Requires alternate paths to strictly outperform the reference by at least this amount to be selected.Files Changed:
src/subcommand/surject_main.cpp: Add option, help text, parsing, and pass-through toSurjector.src/surjector.hpp: Addreference_bonusmember and apply it inchoose_primary_strandscoring.This PR introduces a scoring bonus for reference paths during the surjection path selection process. Currently, when a read aligns equally well to multiple overlapping paths (e.g., a reference path and an alternate haplotype path),
vg surjectselects one arbitrarily based on iteration order. This change allows users to ensure reads are placed on the reference path in the event of a tie or close score. The specific use case I am thinking of is when someone wants to surject to a reference genome and a personalized pangenome path to view insertion coverage and call insertion variants. You want basically all reads to align to the reference unless it is very difficult to do so; this allows us to begin to fine tune exactly what "very difficult" means.Change overview
vg surject --reference-bonus N(default0)reference_bonusintoSurjectorPathSense::REFERENCEpathsBehavior
When comparing candidate paths for primary placement:
PathSense::REFERENCE, compare usingeffective_score = score + reference_bonuseffective_score = score--reference-bonus 0should preserve existing behavior--reference-bonus 1should effectively serve as a tie breaker biasing towards reference surjection--reference-bonus >1requires alternates to beat reference by at leastNto be selectedImplementation notes
choose_primary_strand).Questions for you all
PathSense::REFERENCEthe right signal here, or is there a more appropriate notion of “reference-ness” in surject?choose_primary_strandthe correct level, or should it live in a different selection stage?choose_primary_strandcover the actual problematic case? (Tie/near-tie between multiple overlapping paths.)PathSenseclassification in the relevant code path.