Revert "feat(log-surgeon)!: Support using a single capture group in schema rules to achieve parity with the heuristic parser." #1946

LinZhihao-723 · 2026-02-02T01:03:52Z

Reverts #1273

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved timestamp parsing and handling in log conversion and archive writing.
Refactor
- Restructured internal log parsing and dictionary processing for improved maintainability and performance.
Chores
- Updated dependencies and removed unused library references.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…chema ru…" This reverts commit 4246ada.

coderabbitai · 2026-02-02T01:04:15Z

Walkthrough

This refactoring updates token handling and dictionary processing across the parser and archive writer, switching from optional/getter-based access patterns to pointer-based patterns for timestamps and type IDs. Changes include removing static text and escape methods from the public API, rewriting the archive message encoding logic with new token iteration and switch-based handling, and refactoring test infrastructure.

Changes

Cohort / File(s)	Summary
Schema and Configuration `components/core/config/schemas.txt`, `components/core/tests/test_schema_files/multiple_capture_groups.txt`, `components/core/tests/test_schema_files/single_capture_group.txt`, `components/core/tests/test_schema_files/search_schema.txt`	Removed support for multiple capture groups in schemas; adjusted regex patterns and deleted test schema files exercising multi-capture and legacy configurations.
Core Token and Type ID Access `components/core/src/clp/GrepCore.cpp`, `components/core/src/clp/Utils.cpp`	Replaced getter-based type ID access with direct pointer dereference (`m_type_ids_ptr->at(0)`); replaced `cTokenHeader` with `cTokenFirstTimestamp` and `cTokenNewlineTimestamp`; removed validation requiring at most one capture group per regex.
Dictionary Entry Interface (clp) `components/core/src/clp/LogTypeDictionaryEntry.hpp`, `components/core/src/clp/LogTypeDictionaryEntry.cpp`, `components/core/src/clp/LogTypeDictionaryEntryReq.hpp`	Removed public `add_static_text()` method; updated `parse_next_var()` signature from trailing `auto...->bool` to explicit `bool`; enhanced escape handling via lambda within `parse_next_var()` to track placeholder positions and increment escape counters; removed concept requirements for `add_escape()` and `add_static_text()`.
Dictionary Entry Interface (clp_s) `components/core/src/clp_s/DictionaryEntry.hpp`, `components/core/src/clp_s/DictionaryEntry.cpp`	Mirrored changes to clp: removed `add_static_text()` method; changed `parse_next_var()` signature and enhanced internal escape handling with explicit lambda-based placeholder tracking; updated include paths to relative references.
Log Conversion `components/core/src/clp_s/log_converter/LogConverter.cpp`	Changed timestamp handling from optional-based pattern (`has_value()`, `value()`) to null-pointer checks; timestamp now accessed via direct dereference of pointer rather than optional accessor methods.
Archive Writing Refactor `components/core/src/clp/streaming_archive/writer/Archive.cpp`, `components/core/src/clp/streaming_archive/writer/Archive.hpp`	Rewrote `write_msg_using_schema()` to iterate directly over log output buffer tokens using switch-based type dispatch (Newline, UncaughtString, Int, Float); replaced `add_token_to_dicts()` delegation with integrated encoding logic; changed parameter from `log_surgeon::LogEventView const& event` to `LogEventView const& log_view`; removed `add_token_to_dicts()` method; updated timestamp pattern handling to use pointer-based approach instead of `const_cast`.
Test Infrastructure `components/core/tests/test-ParserWithUserSchema.cpp`, `components/core/tests/test_log_files/log_with_capture.txt`	Refactored test includes and added new utility functions (`compress()`, `decompress()`); replaced legacy scaffolding with direct `FileReader` and `LogSurgeonReader` usage; updated token member access patterns; removed tests exercising multiple capture groups; adjusted directory resolution logic.
Dependency Configuration `taskfiles/deps/main.yaml`	Removed public variable `G_YSTDLIB_LIB_NAME`; updated ystdlib `LIB_NAME` from variable reference to literal; removed ystdlib as explicit dependency in log-surgeon section; updated log-surgeon tarball URL and SHA256.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes a revert of a previous feature, directly corresponding to the changeset's main objective of undoing capture group support.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch revert-1273-capture-support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@components/core/src/clp/streaming_archive/writer/Archive.cpp`:
- Around line 392-399: Replace the negated boolean checks that use the "!"
operator with the repository style "false == <expression>" form in the
Archive.cpp block where you call
EncodedVariableInterpreter::convert_string_to_representable_integer_var(token.to_string(),
encoded_var) and related conditional checks (also the similar checks around
lines 409-416 and 438-439); e.g., change `if
(!EncodedVariableInterpreter::convert_string_to_representable_integer_var(...))`
to `if (false ==
EncodedVariableInterpreter::convert_string_to_representable_integer_var(...))`
and apply the same pattern for other negations in this function that involve
encode/encode_var_dict_id/m_var_dict.add_entry logic to keep conditionals
consistent with project style.
- Around line 390-422: When TokenInt or TokenFloat falls back to the variable
dictionary (inside the branches for
static_cast<int>(log_surgeon::SymbolId::TokenInt) and TokenFloat) you currently
call m_var_dict.add_entry(...) and set encoded_var but you never record the
dictionary id in m_var_ids, so update_segment_indices misses these entries;
after m_var_dict.add_entry(token.to_string(), id) and after setting encoded_var
= EncodedVariableInterpreter::encode_var_dict_id(id), append the id to m_var_ids
(e.g. m_var_ids.push_back(id)) in both branches so the ids are tracked for
update_segment_indices.

In `@components/core/tests/test-ParserWithUserSchema.cpp`:
- Around line 159-181: The commented-out tests (the TEST_CASE blocks referencing
generate_log_parser and compress and the TODO notes) must be either tracked with
an issue or removed; open a tracking ticket and replace the commented blocks
with a single short TODO comment that includes the issue ID and the original
test names ("Test error for creating log file with delimiter in regex pattern" /
"Test error for missing log file") plus the referenced symbols
generate_log_parser and compress, or delete the commented tests entirely and add
a test-placeholder comment referencing the new issue so coverage gaps are
traceable and the intent to re-enable behavior is preserved.

coderabbitai · 2026-02-02T01:11:28Z

components/core/src/clp/streaming_archive/writer/Archive.cpp

+            case static_cast<int>(log_surgeon::SymbolId::TokenInt): {
+                encoded_variable_t encoded_var;
+                if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var(
+                            token.to_string(),
+                            encoded_var
+                    ))
+                {
+                    variable_dictionary_id_t id;
+                    m_var_dict.add_entry(token.to_string(), id);
+                    encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
+                    m_logtype_dict_entry.add_dictionary_var();
+                } else {
+                    m_logtype_dict_entry.add_int_var();
+                }
+                m_encoded_vars.push_back(encoded_var);
+                break;
+            }
+            case static_cast<int>(log_surgeon::SymbolId::TokenFloat): {
+                encoded_variable_t encoded_var;
+                if (!EncodedVariableInterpreter::convert_string_to_representable_float_var(
+                            token.to_string(),
+                            encoded_var
+                    ))
+                {
+                    variable_dictionary_id_t id;
+                    m_var_dict.add_entry(token.to_string(), id);
+                    encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
+                    m_logtype_dict_entry.add_dictionary_var();
+                } else {
+                    m_logtype_dict_entry.add_float_var();
+                }
+                m_encoded_vars.push_back(encoded_var);
+                break;


⚠️ Potential issue | 🟠 Major

Dictionary IDs aren’t tracked for int/float fallbacks.
When an int/float token falls back to the variable dictionary, the ID isn’t added to m_var_ids, so update_segment_indices won’t record it. This can leave segment indices incomplete for those values.

✅ Suggested fix

if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var( token.to_string(), encoded_var )) { variable_dictionary_id_t id; m_var_dict.add_entry(token.to_string(), id); + m_var_ids.push_back(id); encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id); m_logtype_dict_entry.add_dictionary_var(); } else { m_logtype_dict_entry.add_int_var(); } @@ if (!EncodedVariableInterpreter::convert_string_to_representable_float_var( token.to_string(), encoded_var )) { variable_dictionary_id_t id; m_var_dict.add_entry(token.to_string(), id); + m_var_ids.push_back(id); encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id); m_logtype_dict_entry.add_dictionary_var(); } else { m_logtype_dict_entry.add_float_var(); }

🤖 Prompt for AI Agents

In `@components/core/src/clp/streaming_archive/writer/Archive.cpp` around lines 390 - 422, When TokenInt or TokenFloat falls back to the variable dictionary (inside the branches for static_cast<int>(log_surgeon::SymbolId::TokenInt) and TokenFloat) you currently call m_var_dict.add_entry(...) and set encoded_var but you never record the dictionary id in m_var_ids, so update_segment_indices misses these entries; after m_var_dict.add_entry(token.to_string(), id) and after setting encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id), append the id to m_var_ids (e.g. m_var_ids.push_back(id)) in both branches so the ids are tracked for update_segment_indices.

coderabbitai · 2026-02-02T01:11:28Z

components/core/src/clp/streaming_archive/writer/Archive.cpp

+                if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var(
+                            token.to_string(),
+                            encoded_var
+                    ))
+                {
+                    variable_dictionary_id_t id;
+                    m_var_dict.add_entry(token.to_string(), id);
+                    encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);


🧹 Nitpick | 🔵 Trivial

Use false == instead of ! for these checks.
This aligns with the repo’s style rule and keeps conditionals consistent.

♻️ Suggested refactor

- if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var( + if (false == EncodedVariableInterpreter::convert_string_to_representable_integer_var( token.to_string(), encoded_var )) @@ - if (!EncodedVariableInterpreter::convert_string_to_representable_float_var( + if (false == EncodedVariableInterpreter::convert_string_to_representable_float_var( token.to_string(), encoded_var )) @@ - if (!m_logtype_dict_entry.get_value().empty()) { + if (false == m_logtype_dict_entry.get_value().empty()) {

As per coding guidelines: Prefer false == <expression> rather than !<expression>.

Also applies to: 409-416, 438-439

🤖 Prompt for AI Agents

In `@components/core/src/clp/streaming_archive/writer/Archive.cpp` around lines 392 - 399, Replace the negated boolean checks that use the "!" operator with the repository style "false == <expression>" form in the Archive.cpp block where you call EncodedVariableInterpreter::convert_string_to_representable_integer_var(token.to_string(), encoded_var) and related conditional checks (also the similar checks around lines 409-416 and 438-439); e.g., change `if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var(...))` to `if (false == EncodedVariableInterpreter::convert_string_to_representable_integer_var(...))` and apply the same pattern for other negations in this function that involve encode/encode_var_dict_id/m_var_dict.add_entry logic to keep conditionals consistent with project style.

coderabbitai · 2026-02-02T01:11:28Z

components/core/tests/test-ParserWithUserSchema.cpp

+// TODO: This test doesn't currently work because delimiters are allowed in
+// schema files, and there is no option to disable this yet
+// TEST_CASE("Test error for creating log file with delimiter in regex pattern",
+//          "[LALR1Parser]SchemaParser]") {
+//    std::string file_path = "../tests/test_schema_files/schema_with_delimiter_in_regex_error.txt";
+//    std::string file_name = boost::filesystem::canonical(file_path).string();
+//    REQUIRE_THROWS_WITH(generate_log_parser(file_path),
+//                        file_name +
+//                        ":2: error: 'equals' has regex pattern which contains delimiter '='.\n"
+//                        + "          equals:.*=.*\n"
+//                        + "                 ^^^^^\n");
+//}
+
+// TODO: This error check is performed correctly by CLP, but it is handled by
+// something different now so this test will fail as is
+// TEST_CASE("Test error for missing log file", "[LALR1Parser][LogParser]") {
+//    std::string file_name = "../tests/test_log_files/missing_log.txt";
+//    std::string file_path = boost::filesystem::weakly_canonical(file_name).string();
+//    REQUIRE_THROWS(compress("../tests/test_archives", file_name,
+//                            "../tests/test_schema_files/schema_that_does_not_exist.txt"),
+//                   "Specified schema file does not exist.");
+//}
+


⚠️ Potential issue | 🟡 Minor

Track or remove the disabled tests.
Leaving tests commented out can create permanent coverage gaps; please link these TODOs to an issue or re-enable when the dependency is ready.

🧰 Tools

🪛 Cppcheck (2.19.0)

[style] 161-161: The function 'push_back_all' is never used.

(unusedFunction)

[style] 160-160: The function 'set_is_split' is never used.

(unusedFunction)

[style] 166-166: The function 'get_orig_path' is never used.

(unusedFunction)

[style] 168-168: The function 'get_orig_file_id' is never used.

(unusedFunction)

[style] 170-170: The function 'get_orig_file_id_as_string' is never used.

(unusedFunction)

[style] 176-176: The function 'get_id_as_string' is never used.

(unusedFunction)

[style] 178-178: The function 'get_begin_ts' is never used.

(unusedFunction)

[style] 180-180: The function 'get_end_ts' is never used.

(unusedFunction)

🤖 Prompt for AI Agents

In `@components/core/tests/test-ParserWithUserSchema.cpp` around lines 159 - 181, The commented-out tests (the TEST_CASE blocks referencing generate_log_parser and compress and the TODO notes) must be either tracked with an issue or removed; open a tracking ticket and replace the commented blocks with a single short TODO comment that includes the issue ID and the original test names ("Test error for creating log file with delimiter in regex pattern" / "Test error for missing log file") plus the referenced symbols generate_log_parser and compress, or delete the commented tests entirely and add a test-placeholder comment referencing the new issue so coverage gaps are traceable and the intent to re-enable behavior is preserved.

Revert "feat(log-surgeon)!: Support using a single capture group in s…

055738c

…chema ru…" This reverts commit 4246ada.

LinZhihao-723 requested review from a team and gibber9809 as code owners February 2, 2026 01:03

LinZhihao-723 marked this pull request as draft February 2, 2026 01:04

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "feat(log-surgeon)!: Support using a single capture group in schema rules to achieve parity with the heuristic parser." #1946

Revert "feat(log-surgeon)!: Support using a single capture group in schema rules to achieve parity with the heuristic parser." #1946

Uh oh!

LinZhihao-723 commented Feb 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 2, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 2, 2026

Uh oh!

coderabbitai bot Feb 2, 2026

Uh oh!

coderabbitai bot Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Revert "feat(log-surgeon)!: Support using a single capture group in schema rules to achieve parity with the heuristic parser." #1946

Are you sure you want to change the base?

Revert "feat(log-surgeon)!: Support using a single capture group in schema rules to achieve parity with the heuristic parser." #1946

Uh oh!

Conversation

LinZhihao-723 commented Feb 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LinZhihao-723 commented Feb 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 2, 2026 •

edited

Loading