Skip to content

Conversation

@LinZhihao-723
Copy link
Member

@LinZhihao-723 LinZhihao-723 commented Feb 2, 2026

Reverts #1273

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved timestamp parsing and handling in log conversion and archive writing.
  • Refactor

    • Restructured internal log parsing and dictionary processing for improved maintainability and performance.
  • Chores

    • Updated dependencies and removed unused library references.

✏️ Tip: You can customize this high-level summary in your review settings.

@LinZhihao-723 LinZhihao-723 requested review from a team and gibber9809 as code owners February 2, 2026 01:03
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 2, 2026

Walkthrough

This refactoring updates token handling and dictionary processing across the parser and archive writer, switching from optional/getter-based access patterns to pointer-based patterns for timestamps and type IDs. Changes include removing static text and escape methods from the public API, rewriting the archive message encoding logic with new token iteration and switch-based handling, and refactoring test infrastructure.

Changes

Cohort / File(s) Summary
Schema and Configuration
components/core/config/schemas.txt, components/core/tests/test_schema_files/multiple_capture_groups.txt, components/core/tests/test_schema_files/single_capture_group.txt, components/core/tests/test_schema_files/search_schema.txt
Removed support for multiple capture groups in schemas; adjusted regex patterns and deleted test schema files exercising multi-capture and legacy configurations.
Core Token and Type ID Access
components/core/src/clp/GrepCore.cpp, components/core/src/clp/Utils.cpp
Replaced getter-based type ID access with direct pointer dereference (m_type_ids_ptr->at(0)); replaced cTokenHeader with cTokenFirstTimestamp and cTokenNewlineTimestamp; removed validation requiring at most one capture group per regex.
Dictionary Entry Interface (clp)
components/core/src/clp/LogTypeDictionaryEntry.hpp, components/core/src/clp/LogTypeDictionaryEntry.cpp, components/core/src/clp/LogTypeDictionaryEntryReq.hpp
Removed public add_static_text() method; updated parse_next_var() signature from trailing auto...->bool to explicit bool; enhanced escape handling via lambda within parse_next_var() to track placeholder positions and increment escape counters; removed concept requirements for add_escape() and add_static_text().
Dictionary Entry Interface (clp_s)
components/core/src/clp_s/DictionaryEntry.hpp, components/core/src/clp_s/DictionaryEntry.cpp
Mirrored changes to clp: removed add_static_text() method; changed parse_next_var() signature and enhanced internal escape handling with explicit lambda-based placeholder tracking; updated include paths to relative references.
Log Conversion
components/core/src/clp_s/log_converter/LogConverter.cpp
Changed timestamp handling from optional-based pattern (has_value(), value()) to null-pointer checks; timestamp now accessed via direct dereference of pointer rather than optional accessor methods.
Archive Writing Refactor
components/core/src/clp/streaming_archive/writer/Archive.cpp, components/core/src/clp/streaming_archive/writer/Archive.hpp
Rewrote write_msg_using_schema() to iterate directly over log output buffer tokens using switch-based type dispatch (Newline, UncaughtString, Int, Float); replaced add_token_to_dicts() delegation with integrated encoding logic; changed parameter from log_surgeon::LogEventView const& event to LogEventView const& log_view; removed add_token_to_dicts() method; updated timestamp pattern handling to use pointer-based approach instead of const_cast.
Test Infrastructure
components/core/tests/test-ParserWithUserSchema.cpp, components/core/tests/test_log_files/log_with_capture.txt
Refactored test includes and added new utility functions (compress(), decompress()); replaced legacy scaffolding with direct FileReader and LogSurgeonReader usage; updated token member access patterns; removed tests exercising multiple capture groups; adjusted directory resolution logic.
Dependency Configuration
taskfiles/deps/main.yaml
Removed public variable G_YSTDLIB_LIB_NAME; updated ystdlib LIB_NAME from variable reference to literal; removed ystdlib as explicit dependency in log-surgeon section; updated log-surgeon tarball URL and SHA256.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes a revert of a previous feature, directly corresponding to the changeset's main objective of undoing capture group support.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch revert-1273-capture-support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@LinZhihao-723 LinZhihao-723 marked this pull request as draft February 2, 2026 01:04
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@components/core/src/clp/streaming_archive/writer/Archive.cpp`:
- Around line 392-399: Replace the negated boolean checks that use the "!"
operator with the repository style "false == <expression>" form in the
Archive.cpp block where you call
EncodedVariableInterpreter::convert_string_to_representable_integer_var(token.to_string(),
encoded_var) and related conditional checks (also the similar checks around
lines 409-416 and 438-439); e.g., change `if
(!EncodedVariableInterpreter::convert_string_to_representable_integer_var(...))`
to `if (false ==
EncodedVariableInterpreter::convert_string_to_representable_integer_var(...))`
and apply the same pattern for other negations in this function that involve
encode/encode_var_dict_id/m_var_dict.add_entry logic to keep conditionals
consistent with project style.
- Around line 390-422: When TokenInt or TokenFloat falls back to the variable
dictionary (inside the branches for
static_cast<int>(log_surgeon::SymbolId::TokenInt) and TokenFloat) you currently
call m_var_dict.add_entry(...) and set encoded_var but you never record the
dictionary id in m_var_ids, so update_segment_indices misses these entries;
after m_var_dict.add_entry(token.to_string(), id) and after setting encoded_var
= EncodedVariableInterpreter::encode_var_dict_id(id), append the id to m_var_ids
(e.g. m_var_ids.push_back(id)) in both branches so the ids are tracked for
update_segment_indices.

In `@components/core/tests/test-ParserWithUserSchema.cpp`:
- Around line 159-181: The commented-out tests (the TEST_CASE blocks referencing
generate_log_parser and compress and the TODO notes) must be either tracked with
an issue or removed; open a tracking ticket and replace the commented blocks
with a single short TODO comment that includes the issue ID and the original
test names ("Test error for creating log file with delimiter in regex pattern" /
"Test error for missing log file") plus the referenced symbols
generate_log_parser and compress, or delete the commented tests entirely and add
a test-placeholder comment referencing the new issue so coverage gaps are
traceable and the intent to re-enable behavior is preserved.

Comment on lines +390 to +422
case static_cast<int>(log_surgeon::SymbolId::TokenInt): {
encoded_variable_t encoded_var;
if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var(
token.to_string(),
encoded_var
))
{
variable_dictionary_id_t id;
m_var_dict.add_entry(token.to_string(), id);
encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
m_logtype_dict_entry.add_dictionary_var();
} else {
m_logtype_dict_entry.add_int_var();
}
m_encoded_vars.push_back(encoded_var);
break;
}
case static_cast<int>(log_surgeon::SymbolId::TokenFloat): {
encoded_variable_t encoded_var;
if (!EncodedVariableInterpreter::convert_string_to_representable_float_var(
token.to_string(),
encoded_var
))
{
variable_dictionary_id_t id;
m_var_dict.add_entry(token.to_string(), id);
encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
m_logtype_dict_entry.add_dictionary_var();
} else {
m_logtype_dict_entry.add_float_var();
}
m_encoded_vars.push_back(encoded_var);
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Dictionary IDs aren’t tracked for int/float fallbacks.
When an int/float token falls back to the variable dictionary, the ID isn’t added to m_var_ids, so update_segment_indices won’t record it. This can leave segment indices incomplete for those values.

✅ Suggested fix
                 if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var(
                             token.to_string(),
                             encoded_var
                     ))
                 {
                     variable_dictionary_id_t id;
                     m_var_dict.add_entry(token.to_string(), id);
+                    m_var_ids.push_back(id);
                     encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
                     m_logtype_dict_entry.add_dictionary_var();
                 } else {
                     m_logtype_dict_entry.add_int_var();
                 }
@@
                 if (!EncodedVariableInterpreter::convert_string_to_representable_float_var(
                             token.to_string(),
                             encoded_var
                     ))
                 {
                     variable_dictionary_id_t id;
                     m_var_dict.add_entry(token.to_string(), id);
+                    m_var_ids.push_back(id);
                     encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
                     m_logtype_dict_entry.add_dictionary_var();
                 } else {
                     m_logtype_dict_entry.add_float_var();
                 }
🤖 Prompt for AI Agents
In `@components/core/src/clp/streaming_archive/writer/Archive.cpp` around lines
390 - 422, When TokenInt or TokenFloat falls back to the variable dictionary
(inside the branches for static_cast<int>(log_surgeon::SymbolId::TokenInt) and
TokenFloat) you currently call m_var_dict.add_entry(...) and set encoded_var but
you never record the dictionary id in m_var_ids, so update_segment_indices
misses these entries; after m_var_dict.add_entry(token.to_string(), id) and
after setting encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id),
append the id to m_var_ids (e.g. m_var_ids.push_back(id)) in both branches so
the ids are tracked for update_segment_indices.

Comment on lines +392 to +399
if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var(
token.to_string(),
encoded_var
))
{
variable_dictionary_id_t id;
m_var_dict.add_entry(token.to_string(), id);
encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use false == instead of ! for these checks.
This aligns with the repo’s style rule and keeps conditionals consistent.

♻️ Suggested refactor
-                if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var(
+                if (false == EncodedVariableInterpreter::convert_string_to_representable_integer_var(
                             token.to_string(),
                             encoded_var
                     ))
@@
-                if (!EncodedVariableInterpreter::convert_string_to_representable_float_var(
+                if (false == EncodedVariableInterpreter::convert_string_to_representable_float_var(
                             token.to_string(),
                             encoded_var
                     ))
@@
-    if (!m_logtype_dict_entry.get_value().empty()) {
+    if (false == m_logtype_dict_entry.get_value().empty()) {

As per coding guidelines: Prefer false == <expression> rather than !<expression>.

Also applies to: 409-416, 438-439

🤖 Prompt for AI Agents
In `@components/core/src/clp/streaming_archive/writer/Archive.cpp` around lines
392 - 399, Replace the negated boolean checks that use the "!" operator with the
repository style "false == <expression>" form in the Archive.cpp block where you
call
EncodedVariableInterpreter::convert_string_to_representable_integer_var(token.to_string(),
encoded_var) and related conditional checks (also the similar checks around
lines 409-416 and 438-439); e.g., change `if
(!EncodedVariableInterpreter::convert_string_to_representable_integer_var(...))`
to `if (false ==
EncodedVariableInterpreter::convert_string_to_representable_integer_var(...))`
and apply the same pattern for other negations in this function that involve
encode/encode_var_dict_id/m_var_dict.add_entry logic to keep conditionals
consistent with project style.

Comment on lines +159 to +181
// TODO: This test doesn't currently work because delimiters are allowed in
// schema files, and there is no option to disable this yet
// TEST_CASE("Test error for creating log file with delimiter in regex pattern",
// "[LALR1Parser]SchemaParser]") {
// std::string file_path = "../tests/test_schema_files/schema_with_delimiter_in_regex_error.txt";
// std::string file_name = boost::filesystem::canonical(file_path).string();
// REQUIRE_THROWS_WITH(generate_log_parser(file_path),
// file_name +
// ":2: error: 'equals' has regex pattern which contains delimiter '='.\n"
// + " equals:.*=.*\n"
// + " ^^^^^\n");
//}

// TODO: This error check is performed correctly by CLP, but it is handled by
// something different now so this test will fail as is
// TEST_CASE("Test error for missing log file", "[LALR1Parser][LogParser]") {
// std::string file_name = "../tests/test_log_files/missing_log.txt";
// std::string file_path = boost::filesystem::weakly_canonical(file_name).string();
// REQUIRE_THROWS(compress("../tests/test_archives", file_name,
// "../tests/test_schema_files/schema_that_does_not_exist.txt"),
// "Specified schema file does not exist.");
//}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Track or remove the disabled tests.
Leaving tests commented out can create permanent coverage gaps; please link these TODOs to an issue or re-enable when the dependency is ready.

🧰 Tools
🪛 Cppcheck (2.19.0)

[style] 161-161: The function 'push_back_all' is never used.

(unusedFunction)


[style] 160-160: The function 'set_is_split' is never used.

(unusedFunction)


[style] 166-166: The function 'get_orig_path' is never used.

(unusedFunction)


[style] 168-168: The function 'get_orig_file_id' is never used.

(unusedFunction)


[style] 170-170: The function 'get_orig_file_id_as_string' is never used.

(unusedFunction)


[style] 176-176: The function 'get_id_as_string' is never used.

(unusedFunction)


[style] 178-178: The function 'get_begin_ts' is never used.

(unusedFunction)


[style] 180-180: The function 'get_end_ts' is never used.

(unusedFunction)

🤖 Prompt for AI Agents
In `@components/core/tests/test-ParserWithUserSchema.cpp` around lines 159 - 181,
The commented-out tests (the TEST_CASE blocks referencing generate_log_parser
and compress and the TODO notes) must be either tracked with an issue or
removed; open a tracking ticket and replace the commented blocks with a single
short TODO comment that includes the issue ID and the original test names ("Test
error for creating log file with delimiter in regex pattern" / "Test error for
missing log file") plus the referenced symbols generate_log_parser and compress,
or delete the commented tests entirely and add a test-placeholder comment
referencing the new issue so coverage gaps are traceable and the intent to
re-enable behavior is preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant