[cudamapper] Filtering for self mappings, identical overlaps, and highly-similar overlaps#563
[cudamapper] Filtering for self mappings, identical overlaps, and highly-similar overlaps#563edawson wants to merge 5 commits into
Conversation
Adds a post-processing filter for full-length self mappings. This is needed as self mappings cause miniasm to produce empty assemblies.
…erlaps in postprocessing. Adds a CLI option and functions for filtering overlaps that either have identical query starts/ends or which have a percent reciprocal identity greater than a user-defined threshold.
| {"preserve-self-mappings", no_argument, 0, 'S'}, | ||
| {"max-reciprocal", required_argument, 0, 'Z'}, |
There was a problem hiding this comment.
These parameters should also be added to ApplicationParameters::help()
| if (application_parameters.all_to_all && application_parameters.filter_self_mappings) | ||
| { | ||
| GW_NVTX_RANGE(profiler, "main::postprocess_and_write_thread::remove_self_mappings"); | ||
| ::claraparabricks::genomeworks::cudamapper::details::overlapper::filter_self_mappings(overlaps, *application_parameters.query_parser, *application_parameters.target_parser, 0.9); |
There was a problem hiding this comment.
Is it necessary to qualify the function this day? Also, is function really a detail if it is used in a completely different (top level) file
| /// \param overlaps reference to vector of Overlaps. New overlaps (result of fusing) are added to this vector | ||
| /// \param drop_fused_overlaps If true, remove overlaps that are fused into larger overlaps in output. | ||
| static void post_process_overlaps(std::vector<Overlap>& overlaps, bool drop_fused_overlaps = false); | ||
| /// \param max_reciprocal Maximum reciprocal identity between two overlaps before one is filtered. Default (< 0.0) means no filtering; 0.0 removes only identical adjacent overlaps. |
There was a problem hiding this comment.
I find 0.0 a bit confusing. If I understand it correctly passing 0.75 would mean that any pair that has more than 75% overlap will get filtered out (i.e. one of those two overlaps). Wouldn't it than make sense to say that specifying 1.0 means only identical (= 100% overlapped) overlaps are filtered out and >1.0 that nothing gets filtered out as two overlaps cannot overlap more than 100%?
| /// \param overlaps reference to vector of Overlaps. New overlaps (result of fusing) are added to this vector | ||
| /// \param drop_fused_overlaps If true, remove overlaps that are fused into larger overlaps in output. | ||
| static void post_process_overlaps(std::vector<Overlap>& overlaps, bool drop_fused_overlaps = false); | ||
| /// \param max_reciprocal Maximum reciprocal identity between two overlaps before one is filtered. Default (< 0.0) means no filtering; 0.0 removes only identical adjacent overlaps. |
There was a problem hiding this comment.
Not sure how complicated would it be to specify which overlap gets filtered out (e.g. the one with smaller starting position, longer/shorter one or something)
| int32_t min_overlap_len = 250; // l, recommended range: 100 - 1000 | ||
| int32_t min_bases_per_residue = 1000; // b | ||
| float min_overlap_fraction = 0.8; // z | ||
| float max_reciprocal = -1.0; // Z, < 0 : no filtering. 0.0: only identical filtering. > 1.0: filter by percent reciprocity |
There was a problem hiding this comment.
For -F we use 0.0 - 1.0 range, not 1.0 - 100.0 (which should actually probably be 0.0 - 100.0)
| { | ||
| return 0.0; | ||
| } | ||
| int32_t query_overlap = std::min(a.query_end_position_in_read_, b.query_end_position_in_read_) - std::max(a.query_start_position_in_read_, b.query_start_position_in_read_); |
There was a problem hiding this comment.
This and similar should be position_in_read_t or even better number_of_basepairs_t
This PR brings in filtering strategies that mimic those of minimap2.
-S).