-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clp-s): Record log-order at compression time. #584
Conversation
WalkthroughThe pull request introduces various enhancements across multiple files, focusing on improving error handling, internal state management, and performance optimizations. Key changes include the addition of new member variables to track log event indices and message sizes, updates to method signatures for improved efficiency, and refinements in error messages for command-line argument parsing. Additionally, several methods have been modified to enhance functionality related to schema reading and writing, while maintaining existing logic and control flow. Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (30)
components/core/src/clp_s/archive_constants.hpp (1)
21-25
: Add documentation for schema tree constantsWhile the constants support the new log-order recording functionality, the empty strings used for
cRootNodeName
andcInternalSubtreeName
would benefit from documentation explaining their purpose in the MPT implementation.Add documentation like this:
// Schema tree constants +// Empty string constants are used to identify special nodes in the Merkle Patricia Tree: +// - Root node: The top-level node of the tree +// - Internal subtree: Contains implementation-specific fields that are ignored during decompression constexpr char cRootNodeName[] = ""; constexpr int32_t cRootNodeId = -1; constexpr char cInternalSubtreeName[] = ""; constexpr char cLogEventIdxName[] = "log_event_idx";components/core/src/clp_s/search/SearchUtils.cpp (1)
Line range hint
43-56
: Critical: Missing break statements in switch cases causes incorrect behaviourThe switch cases for LT/GTE and GT/LTE are missing break statements, causing fall-through to the default case. This means the ceil/floor operations are being overwritten by the default case's static_cast, leading to incorrect integer conversions.
Here's the corrected implementation:
bool double_as_int(double in, FilterOperation op, int64_t& out) { switch (op) { case FilterOperation::EQ: out = static_cast<int64_t>(in); return in == static_cast<double>(out); case FilterOperation::LT: case FilterOperation::GTE: out = std::ceil(in); + break; case FilterOperation::GT: case FilterOperation::LTE: out = std::floor(in); + break; default: out = static_cast<int64_t>(in); } return true; }components/core/src/clp_s/SchemaTree.cpp (2)
37-51
: Consider optimizing internal field lookupThe implementation is correct but uses a linear search through children. For better performance with many internal fields, consider using a hash map to cache field name to ID mappings.
+ private: + std::unordered_map<std::string, int32_t> m_internal_field_cache;This would allow O(1) lookups instead of O(n).
Line range hint
8-69
: Consider documenting the Internal subtree contractGiven that this implementation introduces a new Internal subtree for storing metadata like log order, it would be valuable to document:
- The purpose and scope of the Internal subtree
- Rules for what can be stored in it
- How it interacts with search and compression operations
This documentation would help maintain the separation between regular and internal fields as the system evolves.
components/core/src/clp_s/search/Projection.cpp (1)
55-55
: LGTM! Consider documenting the subtree choice.The change from
get_root_node_id()
toget_object_subtree_node_id()
correctly ensures that projections cannot resolve internal fields, aligning with the PR's objective of keeping internal fields inaccessible during searches.Consider adding a comment explaining why we start from the object subtree instead of the root node, such as:
+ // Start from object subtree to prevent resolving internal fields auto cur_node_id = tree->get_object_subtree_node_id();
components/core/src/clp_s/JsonParser.hpp (1)
98-104
: Consider enhancing method documentation and type safetyThe method documentation could be improved in several ways:
- Document the return value meaning and valid ranges
- Clarify what happens if the field already exists
- Explain the implications of not following the ordering convention
Additionally, consider using a strongly-typed identifier instead of raw int32_t to prevent mixing with other ID types.
Here's a suggested improvement:
/** * Gets the node ID for an internal field, and adds it to the schema tree if it does not exist. * + * @param field_name The name of the internal field to get/create + * @param type The type of node to create if the field doesn't exist + * @return The node ID for the internal field (>= 0) * * Note: this method should be called before parsing a record so that internal fields come first * in each table. This isn't strictly necessary, but it is a nice convention. + * + * @warning Calling this method after parsing has begun may result in non-optimal field ordering + * in the schema tree. */ - int32_t get_internal_field_id(std::string_view const field_name, NodeType type); + using InternalFieldId = int32_t; // Consider moving this to a types.hpp + InternalFieldId get_internal_field_id(std::string_view const field_name, NodeType type);components/core/src/clp_s/JsonSerializer.hpp (2)
120-123
: Consider optimizing string concatenationWhile the change to std::string_view is good, we could further optimize by using string_view's efficient substring operations and reserve/append operations on m_json_string.
Here's a potential optimization:
void append_value(std::string_view const value) { - m_json_string += value; - m_json_string += ","; + m_json_string.reserve(m_json_string.size() + value.size() + 1); + m_json_string.append(value); + m_json_string.push_back(','); }
Line range hint
1-144
: Consider future optimizations for JSON serializationGiven this class's role in handling log-order serialization, consider these future optimizations:
- Pre-calculate and reserve exact buffer sizes based on operation types
- Use a string buffer pool to reduce allocations
- Implement streaming serialization for large logs
components/core/src/clp_s/ReaderUtils.cpp (1)
21-21
: Excellent performance optimization!Moving the string declaration outside the loop eliminates repeated memory allocations, improving performance by reusing the same string buffer across iterations.
Consider adding a brief comment to document this intentional optimization:
+ // Declare key outside the loop to reuse the string buffer across iterations std::string key;
components/core/src/clp_s/ArchiveWriter.hpp (1)
101-104
: Enhance method documentationWhile the method implementation is correct, consider enhancing the documentation to explain:
- The purpose of log event IDs in the context of log-order compression
- How these IDs relate to the decompression order
- Whether the IDs are guaranteed to be sequential
/** - * @return the Id the next log event should receive when appended to the archive. + * @return The ID that will be assigned to the next log event when appended to the archive. + * These IDs are used to maintain the original log order during decompression when + * the --ordered flag is used. IDs are guaranteed to be sequential starting from 0. */components/core/src/clp_s/search/OutputHandler.cpp (2)
34-40
: LGTM! Consider const for placeholder.The addition of
log_event_idx
to the message packing is well implemented. However, consider making the placeholder constant static:- static constexpr string_view cOrigFilePathPlaceholder{""}; + static constexpr string_view const cOrigFilePathPlaceholder{""};
128-143
: Consider extracting QueryResult creation to reduce duplication.The QueryResult construction is duplicated at lines 128-134 and 137-143. Consider extracting this into a helper method:
+ private: + std::unique_ptr<QueryResult> create_query_result( + string_view message, + epochtime_t timestamp, + string_view archive_id, + int64_t log_event_idx) { + return std::make_unique<QueryResult>( + string_view{}, + message, + timestamp, + archive_id, + log_event_idx + ); + } void write(...) { if (m_latest_results.size() < m_max_num_results) { - m_latest_results.emplace(std::make_unique<QueryResult>(...)); + m_latest_results.emplace(create_query_result( + message, timestamp, archive_id, log_event_idx)); } else if (m_latest_results.top()->timestamp < timestamp) { m_latest_results.pop(); - m_latest_results.emplace(std::make_unique<QueryResult>(...)); + m_latest_results.emplace(create_query_result( + message, timestamp, archive_id, log_event_idx)); } }components/core/src/clp_s/JsonConstructor.cpp (2)
78-79
: LGTM: Index tracking variables properly definedThe transition from timestamp to index tracking is implemented correctly with appropriate type selection (int64_t).
Consider adding a brief comment explaining the purpose of these indices:
+ // Track the range of log event indices in the current chunk int64_t first_idx{0}; int64_t last_idx{0};
101-102
: Consider using fmt for string formattingWhile the current implementation works, using fmt would improve readability and maintainability.
Consider this improvement:
- std::string new_file_name = src_path.string() + "_" + std::to_string(first_idx) + "_" - + std::to_string(last_idx) + ".jsonl"; + std::string new_file_name = fmt::format("{}_{}_{}.jsonl", + src_path.string(), + first_idx, + last_idx);components/core/src/clp_s/ArchiveReader.hpp (1)
217-217
: Add documentation for the new member variable.Please add a comment explaining the purpose of
m_log_event_idx_column_id
and its role in supporting log-order decompression.Example:
+ /// Column ID for the internal field storing log event indices. Used to support log-order decompression. + /// A value of -1 indicates the field is not set. int32_t m_log_event_idx_column_id{-1};components/core/src/clp_s/search/ColumnDescriptor.hpp (1)
26-26
: Consider optimizing string storage.The change to use
std::string_view
is a good modernization that improves performance. However, since we're storing the token as a string member, consider usingstd::string
's constructor that takes a string_view directly:- explicit DescriptorToken(std::string_view const token) - : m_token(token), + explicit DescriptorToken(std::string_view const token) + : m_token(token.data(), token.size()),This small optimization ensures we're using the most efficient constructor overload.
components/core/src/clp_s/search/OutputHandler.hpp (1)
92-98
: Consider adding a separator between timestamp and messageThe current output format
archive_id: log_event_idx: timestamp message
might be hard to parse when the message starts with a number. Consider adding a separator:- std::cout << archive_id << ": " << log_event_idx << ": " << timestamp << " " << message; + std::cout << archive_id << ": " << log_event_idx << ": " << timestamp << " | " << message;components/core/src/clp_s/SchemaReader.hpp (2)
163-173
: Consider grouping related parametersThe method signature and documentation are well-structured. However, consider grouping related metadata parameters (timestamp and log_event_idx) using a struct to improve maintainability if more metadata fields are added in the future.
Example structure:
struct MessageMetadata { epochtime_t timestamp; int64_t log_event_idx; };
189-194
: Documentation could be more detailedThe method declarations follow consistent patterns and are well-typed. However, the documentation for mark_column_as_log_event_idx could be enhanced.
Consider adding more detailed documentation:
/** * Marks a column as the log_event_idx column. * @param column_reader The column reader containing log event indices * @note Only one column can be marked as log_event_idx */Also applies to: 210-214
components/core/src/clp_s/ArchiveWriter.cpp (1)
90-90
: Sequential log event ID assignment looks goodThe implementation correctly maintains message order by incrementing the counter after successful message encoding. If this class is intended for concurrent use in the future, consider documenting the single-writer assumption.
components/core/src/clp_s/search/Output.hpp (2)
92-92
: Add documentation for the new member variablePlease add a documentation comment explaining the purpose and usage of
m_internal_columns
. This helps maintain consistency with other well-documented members in the class and improves maintainability.Add this documentation above the member:
+ /** + * Set of column IDs that represent internal fields which should be ignored + * during dynamic wildcard expansion. + */ std::unordered_set<int32_t> m_internal_columns;
347-350
: Enhance method documentationThe current documentation could be more comprehensive to help maintainers understand the method's role in the search functionality.
Consider expanding the documentation:
/** - * Populates the set of internal columns that get ignored during dynamic wildcard expansion. + * Populates the set of internal columns (m_internal_columns) with column IDs + * that represent implementation-specific fields (e.g., log_event_idx). + * These columns are ignored during dynamic wildcard expansion to prevent + * internal fields from appearing in search results. + * + * This method should be called before processing wildcard searches to ensure + * internal fields are properly excluded from the results. */ void populate_internal_columns();components/core/src/clp_s/ArchiveReader.cpp (1)
315-320
: Consider improving indentation for better readability.The logic for marking the log event index column is correct, with proper type checking. However, the indentation could be more consistent with the codebase style.
Consider this formatting:
- if (column_id == m_log_event_idx_column_id - && nullptr != dynamic_cast<Int64ColumnReader*>(column_reader)) - { - reader.mark_column_as_log_event_idx(static_cast<Int64ColumnReader*>(column_reader)); - } + if (column_id == m_log_event_idx_column_id + && nullptr != dynamic_cast<Int64ColumnReader*>(column_reader)) { + reader.mark_column_as_log_event_idx(static_cast<Int64ColumnReader*>(column_reader)); + }components/core/src/clp_s/JsonParser.cpp (1)
533-540
: Add documentation for the new method.While the implementation is clean and efficient, consider adding documentation to explain:
- The purpose of internal fields
- The hierarchical structure created
- The meaning of the returned field ID
Add this documentation above the method:
+/** + * Creates or retrieves an internal field ID within the Internal subtree. + * + * @param field_name The name of the internal field + * @param type The type of the internal field + * @return The ID of the internal field node + */ int32_t JsonParser::get_internal_field_id(std::string_view const field_name, NodeType type) {components/core/src/clp_s/SchemaReader.cpp (2)
Line range hint
225-252
: Consider implementing the TODO optimization.The TODO comment suggests an optimization opportunity for skipping messages with timestamps less than the smallest timestamp in the priority queue.
Would you like me to help implement this optimization or create a GitHub issue to track this enhancement?
96-96
: Consider performance implications of string handling.The change from
auto const&
toauto
forname
variables might cause unnecessary string copies. While this change simplifies the code, it could impact performance when handling large strings.Consider keeping the const reference for better performance:
-auto name = m_global_schema_tree->get_node(column->get_id()).get_key_name(); +auto const& name = m_global_schema_tree->get_node(column->get_id()).get_key_name();Also applies to: 112-112, 128-128, 146-146
components/core/src/clp_s/CommandLineArguments.cpp (1)
299-305
: Add deprecation notice for timestamp orderingThe help text has been updated to reflect the switch from timestamp-ordered to log-ordered decompression. However, since this is a breaking change that deprecates timestamp ordering, it would be helpful to add a deprecation notice to guide users who might be relying on the old behaviour.
"ordered", po::bool_switch(&m_ordered_decompression), - "Enable decompression in log order for this archive" + "Enable decompression in log order for this archive (Note: timestamp-ordered decompression has been deprecated)"components/core/src/clp_s/SchemaTree.hpp (2)
68-68
: Update documentation to reflect change in return typeThe
get_key_name
method now returns astd::string_view
. Ensure that all documentation and comments reflect this change to prevent confusion.
116-135
: Assess thread safety of new methodsThe methods
get_object_subtree_node_id
,get_internal_field_id
, andget_internal_subtree_node_id
may not be thread-safe ifSchemaTree
is accessed by multiple threads concurrently. Since they access member variables that can be modified, consider the implications for thread safety.Evaluate whether these methods require synchronization mechanisms or if
SchemaTree
should be documented as not thread-safe. This will prevent potential race conditions in multithreaded environments.components/core/src/clp_s/search/SchemaMatch.cpp (1)
114-116
: Address the TODO regarding support for mixed log typesThe TODO comment indicates a future need to handle multiple log types by matching against more than just the object subtree. Consider planning for this enhancement to ensure scalability and maintainability when this capability is introduced.
Would you like assistance in outlining a plan or creating a GitHub issue to track this enhancement?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (24)
components/core/src/clp_s/ArchiveReader.cpp
(3 hunks)components/core/src/clp_s/ArchiveReader.hpp
(1 hunks)components/core/src/clp_s/ArchiveWriter.cpp
(2 hunks)components/core/src/clp_s/ArchiveWriter.hpp
(3 hunks)components/core/src/clp_s/CommandLineArguments.cpp
(1 hunks)components/core/src/clp_s/JsonConstructor.cpp
(4 hunks)components/core/src/clp_s/JsonConstructor.hpp
(1 hunks)components/core/src/clp_s/JsonParser.cpp
(2 hunks)components/core/src/clp_s/JsonParser.hpp
(2 hunks)components/core/src/clp_s/JsonSerializer.hpp
(3 hunks)components/core/src/clp_s/ReaderUtils.cpp
(1 hunks)components/core/src/clp_s/SchemaReader.cpp
(9 hunks)components/core/src/clp_s/SchemaReader.hpp
(5 hunks)components/core/src/clp_s/SchemaTree.cpp
(2 hunks)components/core/src/clp_s/SchemaTree.hpp
(7 hunks)components/core/src/clp_s/archive_constants.hpp
(3 hunks)components/core/src/clp_s/search/ColumnDescriptor.hpp
(2 hunks)components/core/src/clp_s/search/Output.cpp
(5 hunks)components/core/src/clp_s/search/Output.hpp
(3 hunks)components/core/src/clp_s/search/OutputHandler.cpp
(4 hunks)components/core/src/clp_s/search/OutputHandler.hpp
(7 hunks)components/core/src/clp_s/search/Projection.cpp
(1 hunks)components/core/src/clp_s/search/SchemaMatch.cpp
(2 hunks)components/core/src/clp_s/search/SearchUtils.cpp
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- components/core/src/clp_s/JsonConstructor.hpp
🔇 Additional comments (37)
components/core/src/clp_s/archive_constants.hpp (2)
4-4
: LGTM: Appropriate header inclusion for fixed-width types
The addition of is necessary for the int32_t type used in cRootNodeId.
35-41
: LGTM: Well-organized search result field constants
The new constants in the search namespace are well-structured and align with the expected MongoDB search results mentioned in the PR objectives.
Let's verify consistent usage of these field names across the codebase:
✅ Verification successful
Based on the search results, I can see that the field names defined in archive_constants.hpp
are consistently used across the codebase:
orig_file_path
is used extensively in output handlers, file operations and archive managementlog_event_ix
is used in message handling and output formattingtimestamp
is used throughout for temporal operations and search functionalitymessage
is used for log content handlingarchive_id
is used for archive identification and management
The constants are well-organized and their usage aligns with the expected functionality in search results handling. The field names are used consistently without any conflicting definitions.
Field names are consistently used across the codebase
The field names defined in the search namespace match their usage patterns in the implementation, providing a reliable interface for search results handling.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for consistent usage of the search result field names
# Expected: These field names should be used in search result handling code
# Search for usage of the field names
echo "Checking field name usage:"
for field in "orig_file_path" "log_event_ix" "timestamp" "message" "archive_id"; do
echo "=== $field ==="
rg --type cpp "$field" --glob '!archive_constants.hpp'
done
Length of output: 356020
components/core/src/clp_s/search/SearchUtils.cpp (1)
Line range hint 37-41
: LGTM: Internal node type handling aligns with requirements
The mapping of NodeType::Internal to LiteralType::UnknownT correctly implements the requirement that internal nodes should not be resolvable during searches.
Let's verify that no search operations attempt to handle Internal nodes differently:
✅ Verification successful
Verified: Internal node type handling is consistent throughout the codebase
The verification shows that NodeType::Internal is handled appropriately across the codebase:
- Used only for internal subtree identification in SchemaTree.cpp
- Created during JSON parsing in JsonParser.cpp
- Correctly mapped to UnknownT in search operations
The implementation maintains consistency with the requirement that internal nodes should not be searchable.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any special handling of Internal node types in search operations
rg -l "NodeType::Internal" | xargs -I {} rg -A 5 "case\s+NodeType::Internal" {}
Length of output: 215
Script:
#!/bin/bash
# Let's search more broadly for NodeType::Internal usage
rg -B 2 -A 2 "NodeType::Internal" --type cpp
# Also check for any switch statements handling NodeTypes
ast-grep --pattern 'switch ($expr) { $$$ case NodeType::$_: $$$ }'
Length of output: 1300
components/core/src/clp_s/SchemaTree.cpp (3)
8-8
: Excellent optimization using std::string_view!
The change from std::string const&
to std::string_view const
is a good optimization for the parameter, while still properly creating a persistent copy when storing the key in the node.
Also applies to: 17-17
67-69
: Good optimization of string serialization
The direct buffer writing approach is more efficient than using a helper method. The implementation properly handles the string size and content.
19-25
: Verify Internal subtree initialization
The logic for setting up the Internal subtree looks good. However, let's verify that the Internal subtree is properly initialized during normal operation.
components/core/src/clp_s/JsonParser.hpp (1)
6-6
: LGTM: Appropriate use of string_view
The addition of string_view header is well-aligned with modern C++ practices for efficient string handling.
components/core/src/clp_s/JsonSerializer.hpp (3)
5-5
: LGTM: Required include for std::string_view
The addition of this header is necessary to support the transition to std::string_view parameters.
114-118
: LGTM: Efficient string handling with string_view
The use of std::string_view is appropriate here as the key is immediately copied into m_json_string.
70-70
: Consider string_view lifetime implications
The change to std::string_view and emplace_back improves performance by avoiding unnecessary copies. However, ensure that the string_view parameter remains valid until it's converted to a string in m_special_keys.
Let's verify the lifetime of string_view parameters:
✅ Verification successful
Let me check the source of the string_view parameters by examining the get_key_name()
method and the lifetime of the key
variables:
The string_view parameter usage is safe
The string_view parameter in add_special_key
is safely stored because:
- The key names come from
SchemaNode::get_key_name()
which returns a stable string_view backed by a unique_ptr buffer (m_key_buf
) that lives for the lifetime of the SchemaNode - The
m_special_keys
vector in JsonSerializer creates its own string copy throughemplace_back
from the string_view - All calls to
add_special_key
use either:node.get_key_name()
from stable SchemaNode instances- String literals or long-lived strings stored in the schema tree
The change to string_view is a safe optimization that avoids unnecessary string copies while maintaining proper lifetime semantics.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for calls to add_special_key to verify string lifetime
rg -A 3 "add_special_key" --type cpp
Length of output: 2926
Script:
#!/bin/bash
# Search for get_key_name implementation
ast-grep --pattern 'get_key_name() { $$$ }'
# Search for key variable declarations in SchemaReader
rg -B 3 "std::string.*key" components/core/src/clp_s/SchemaReader.cpp
Length of output: 117
Script:
#!/bin/bash
# Try different pattern to find get_key_name implementation
ast-grep --pattern 'get_key_name'
# Look for Node class definition and member variables
rg -A 5 "class.*Node" --type hpp
# Check SchemaReader.hpp for member variables and types
cat components/core/src/clp_s/SchemaReader.hpp
Length of output: 10503
Script:
#!/bin/bash
# Look for Node class and get_key_name in SchemaTree files
rg -A 5 "class.*Node" components/core/src/clp_s/SchemaTree.hpp
cat components/core/src/clp_s/SchemaTree.hpp
# Check the implementation of generate_json_template where add_special_key is called
rg -B 3 "generate_json_template" --type cpp
Length of output: 6743
components/core/src/clp_s/ArchiveWriter.hpp (2)
4-4
: LGTM: Appropriate header inclusion
The addition of <string_view> header is correct and follows the standard include ordering.
97-97
: LGTM: Efficient parameter type change
The change to std::string_view const
from std::string const&
is a good optimization. This avoids potential string copies and is more efficient for string parameters that are only viewed, not modified.
components/core/src/clp_s/search/OutputHandler.cpp (2)
77-96
: Well-structured MongoDB document creation with constants.
Good use of constants from archive_constants.hpp instead of string literals. The document structure is clean and consistent with the addition of the log event index field.
Line range hint 34-143
: Verify performance impact of log event index tracking.
While the changes look good, let's verify that adding the log event index field doesn't significantly impact performance, especially for the MongoDB operations.
✅ Verification successful
Let me search for MongoDB collection initialization and index setup specific to the results cache collection.
Let me try a different search pattern to find MongoDB-related performance configurations.
Let me check for any MongoDB performance-related configurations in the constants or configuration files.
The log event index addition has minimal performance impact.
The implementation shows good performance characteristics:
- Uses batch inserts with
insert_many
for efficient bulk operations - Maintains pre-allocated vector with
reserve
for results - Employs the same MongoDB document structure, just adding one more field
- No additional indexes are required as the log event index is not used for querying
The performance overhead of adding the log event index field is negligible since:
- It's just one additional integer field in the document
- The batch processing logic remains unchanged
- No extra database operations are introduced
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for MongoDB index definitions to ensure optimal query performance
rg -g '*.cpp' -g '*.hpp' 'createIndex|create_index' --multiline
Length of output: 2427
Script:
#!/bin/bash
# Search for MongoDB collection initialization and index setup
rg -g '*.cpp' -g '*.hpp' 'results_cache.*collection' -B 5 -A 5
Length of output: 64
Script:
#!/bin/bash
# Search for MongoDB collection configurations and indexes in the results cache context
rg -g '*.cpp' -g '*.hpp' -B 5 -A 5 'mongocxx::collection|mongocxx::index|mongocxx::options'
Length of output: 4328
Script:
#!/bin/bash
# Search for MongoDB-related constants and configurations
rg -g '*.cpp' -g '*.hpp' -g '*.h' 'results_cache.*search|mongodb.*options|write_concern|bulk_write' -B 3 -A 3
Length of output: 6724
components/core/src/clp_s/JsonConstructor.cpp (3)
143-145
: LGTM: Index tracking in main loop properly implemented
The logic for tracking first and last indices in the processing loop is correct and maintains proper chunk boundaries.
122-126
: LGTM: MongoDB document fields correctly updated for log-order tracking
The transition from timestamps to indices in MongoDB documents is implemented correctly and maintains consistency with the new log-order approach.
Let's verify the MongoDB field names are used consistently:
✅ Verification successful
MongoDB field names are consistently used across the codebase
The verification confirms that the MongoDB field names cBeginMsgIx
and cEndMsgIx
are properly defined in archive_constants.hpp
and correctly used in JsonConstructor.cpp
. The field names are consistently implemented for the log-order tracking approach.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for usage of these MongoDB field names across the codebase
rg "cBeginMsgIx|cEndMsgIx" --type cpp
Length of output: 477
71-71
: LGTM: Comparison function correctly implements log-order processing
The change from get_next_timestamp()
to get_next_log_event_idx()
properly implements the transition to log-order processing as intended.
Let's verify the implementation of get_next_log_event_idx()
:
✅ Verification successful
Implementation of get_next_log_event_idx() is correct and properly used
The verification confirms that get_next_log_event_idx()
is correctly implemented in SchemaReader
class and returns an int64_t
value representing the log event index. The implementation properly handles both cases:
- Returns the extracted index value when the column exists
- Returns 0 as a fallback when the column is null
The comparison in JsonConstructor.cpp
correctly uses this method to establish ordering based on log event indices, which aligns with the PR's objective of transitioning to log-order processing.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for the implementation of get_next_log_event_idx
ast-grep --pattern 'get_next_log_event_idx() {
$$$
}'
Length of output: 57
Script:
#!/bin/bash
# Let's try different approaches to find the implementation
# 1. Search for the method declaration with potential variations
rg -A 5 "get_next_log_event_idx" --type cpp
# 2. Search for the header file that might contain the declaration
fd -e hpp -e h | xargs rg "get_next_log_event_idx"
# 3. Search for the class that might contain this method
rg -B 5 -A 10 "class.*JsonConstructor" --type cpp
Length of output: 3777
components/core/src/clp_s/ArchiveReader.hpp (1)
217-217
: Verify the usage of log event index column ID.
The implementation looks good, but let's verify that the cpp file correctly handles this field during schema reading and writing.
✅ Verification successful
Let me gather more context about the schema initialization to fully verify the usage.
Let me check one final aspect about the Int64ColumnReader usage to complete the verification.
Log event index column ID implementation is properly handled
The field is correctly managed throughout the codebase:
- Initialized to -1 and set via
get_internal_field_id
using the constantcLogEventIdxName
- Properly type-checked as
Int64ColumnReader
during schema reading - Correctly marked in the schema reader using
mark_column_as_log_event_idx
- Properly reset in cleanup
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the usage of m_log_event_idx_column_id in the cpp file
# Check how the field is set and used
rg -A 5 "m_log_event_idx_column_id" --type cpp
# Check for related schema handling code
ast-grep --pattern 'initialize_schema_reader($$$)'
Length of output: 2560
Script:
#!/bin/bash
# Check the implementation of initialize_schema_reader
ast-grep --pattern 'void initialize_schema_reader($$$) {
$$$
}'
# Check for any constants related to log event index
rg "cLogEventIdxName" --type cpp -A 2
Length of output: 977
Script:
#!/bin/bash
# Check Int64ColumnReader and log event index related code
ast-grep --pattern 'mark_column_as_log_event_idx($$$)'
# Check for any other relevant column reader code
rg "Int64ColumnReader" --type cpp -A 3
Length of output: 4129
components/core/src/clp_s/search/ColumnDescriptor.hpp (1)
7-7
: LGTM! Explicit inclusion of string_view header.
The addition of the string_view header follows proper include ordering and makes the dependency explicit.
components/core/src/clp_s/search/OutputHandler.hpp (3)
239-244
: LGTM: Empty implementation is appropriate
The empty implementation is correct as the count aggregation doesn't need the log event index.
273-278
: LGTM: Implementation correctly focuses on timestamp
The implementation properly ignores the log event index as it's not relevant for time-based bucketing.
46-53
: LGTM: Interface change is well-documented
The addition of the log_event_idx
parameter to the pure virtual write
method is properly documented and uses an appropriate type for indexing.
Let's verify all implementations of this interface include the new parameter:
components/core/src/clp_s/SchemaReader.hpp (2)
102-102
: LGTM: Proper initialization of log event index column
The initialization of m_log_event_idx_column to nullptr in the reset() method is correctly placed and follows the class's initialization pattern.
307-307
: LGTM: Proper member variable declaration
The m_log_event_idx_column member variable is appropriately declared and initialized, following the class's organization pattern.
components/core/src/clp_s/ArchiveWriter.cpp (1)
71-71
: Proper cleanup of log event counter
Good practice to reset the counter alongside other member variables during cleanup.
components/core/src/clp_s/search/Output.hpp (1)
8-8
: LGTM: Header include is appropriate
The addition of <unordered_map>
is necessary for the new internal columns functionality and follows the standard include ordering.
components/core/src/clp_s/ArchiveReader.cpp (2)
357-357
: LGTM! Proper cleanup of member variable.
The reset of m_log_event_idx_column_id
to -1 is consistent with initialization and follows the cleanup pattern in the close
method.
30-31
: LGTM! Consider verifying schema tree initialization.
The initialization of m_log_event_idx_column_id
looks correct. The field ID is fetched from the schema tree using the constant name.
Let's verify the schema tree initialization:
✅ Verification successful
The code shows that get_internal_field_id
in SchemaTree
properly handles the case when the internal field is not found by returning -1, and the initialization in ArchiveReader
is used correctly in conjunction with other components like JsonParser
that actually create these internal fields during compression.
Schema tree initialization and internal field handling verified successfully
The implementation is correct because:
SchemaTree::get_internal_field_id
safely handles missing fields by returning -1JsonParser
creates the internal field during compression usingconstants::cLogEventIdxName
- The field name constant is properly defined in
archive_constants.hpp
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify schema tree initialization and internal field handling
# Look for schema tree initialization and internal field handling
rg -A 5 "get_internal_field_id|cLogEventIdxName" --type cpp
Length of output: 3552
components/core/src/clp_s/JsonParser.cpp (2)
Line range hint 478-486
: Robust error handling for JSON parsing failures.
The try-catch block effectively handles latent simdjson errors with proper cleanup and detailed error reporting, including the file path and number of bytes parsed.
465-472
: Implementation aligns with log order tracking requirements.
The addition of the internal log event index field is well-placed before parsing the actual log entry, ensuring proper order tracking. The use of constants for field names and ordered schema insertion maintains consistency.
Let's verify the constant definitions:
✅ Verification successful
Implementation of log event index tracking is properly integrated across the codebase
The verification confirms that:
- The constant
cLogEventIdxName
is properly defined inarchive_constants.hpp
- The log event index field is consistently used throughout the codebase:
- Properly read in
ArchiveReader
- Correctly handled in schema operations via
SchemaReader
- Appropriately used in search output handling
- Maintained in order during JSON construction
- Properly read in
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the constant definitions for internal field names
# Expected: Find definitions of cLogEventIdxName and cInternalSubtreeName
rg -A 1 'cLogEventIdxName|cInternalSubtreeName' --type cpp
Length of output: 855
Script:
#!/bin/bash
# Let's verify the usage of log_event_idx in the codebase to ensure consistency
rg -A 3 'log_event_idx' --type cpp
Length of output: 14218
components/core/src/clp_s/SchemaReader.cpp (2)
40-45
: LGTM! Clean implementation of log event index retrieval.
The implementation is straightforward and handles the null case appropriately.
573-573
: LGTM! Minor readability improvements.
The changes to variable naming and code organization improve readability without affecting functionality.
Also applies to: 584-584
components/core/src/clp_s/search/Output.cpp (2)
141-145
: Consistent handling of internal columns filtering.
The implementation correctly filters out internal columns in both the initialization and wildcard column population phases. This ensures that internal fields are consistently ignored during searches.
#!/bin/bash
# Description: Verify that internal columns are consistently filtered
# throughout the codebase by checking for similar patterns.
rg -A 2 'm_internal_columns'
Also applies to: 994-996
965-976
: Document the assumption about non-nested structures.
The comment on line 971 indicates an important implementation assumption that the internal subtree contains no nested structures. This assumption should be validated or enforced to prevent potential issues if the structure changes in the future.
Consider:
- Adding runtime validation to detect nested structures
- Documenting this assumption in the class documentation
- Adding a test case to verify this assumption holds
components/core/src/clp_s/SchemaTree.hpp (2)
104-104
: Ensure consistency of add_node
method signature
The add_node
function now accepts std::string_view const key
. Verify that all calls to this method throughout the codebase are updated to match the new signature to avoid compilation errors.
Run the following script to locate and review all usages:
#!/bin/bash
# Description: Find all invocations of add_node and check parameter consistency.
# Search for add_node function calls in source files
rg 'add_node\s*\(' -g '*.*pp' -A 1
# Optionally, check for any declarations or definitions that may need updating
rg 'int32_t\s+add_node\s*\(' -g '*.*pp' -A 1
25-25
: Ensure the addition of Internal
does not impact existing enum usages
Adding Internal
to the NodeType
enum is acceptable. However, verify that this change does not affect serialization, deserialization, or any switch-case statements that rely on the enum's ordering or values. Unexpected behaviour might occur if the enum values are persisted or used without a default case.
Run the following script to identify potential issues:
✅ Verification successful
Addition of Internal
to NodeType
enum is safe to proceed
Based on the comprehensive analysis of the codebase:
- The
NodeType
enum has a definedUnknown
value as the last entry with a specific bit pattern - All switch statements in
ArchiveReader.cpp
andArchiveWriter.cpp
handle specific cases without relying on enum ordering - The enum is not used in array indexing or arithmetic operations
- Type comparisons are done using explicit values (==, !=) rather than ordering
- The serialization code in
SchemaTree::store
doesn't depend on enum values - The casting operation in
ReaderUtils.cpp
is safe as it's a controlled conversion
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find usages of NodeType that may be impacted by the new enum value.
# Search for switch statements using NodeType without a default case
rg -A 3 'switch\s*\(\s*NodeType' | rg -B 3 -A 3 -v 'default:'
# Search for serialization/deserialization code involving NodeType
rg 'serialize|deserialize' -g '*.*pp' -A 2 | rg 'NodeType'
# Search for places where NodeType values are assumed to be continuous or have specific values
rg 'NodeType::\w+\s*[\+\-\*/=]' -g '*.*pp'
Length of output: 7563
Script:
#!/bin/bash
# Let's search for switch statements and enum usages more comprehensively
# Search for switch statements with NodeType
rg -p 'switch.*\{.*\}' -U -g '*.*pp' -A 10 | rg -B 3 -A 10 'NodeType'
# Search for serialization/deserialization related code
rg -p 'class.*\{.*\}' -U -g '*.*pp' | rg -A 20 -B 3 'serialize|deserialize'
# Search for the NodeType enum definition to understand its current usage
rg -p 'enum.*\{.*\}' -U -g '*.*pp' | rg -B 3 -A 10 'NodeType'
# Search for any direct comparisons or assignments with NodeType
rg '(==|!=|\+=|-=|\*=|/=|=)\s*NodeType::' -g '*.*pp'
Length of output: 1266
Script:
#!/bin/bash
# Let's check for any value-based operations or array indexing with NodeType
# Search for the SchemaTree.hpp file to understand the full enum definition
cat components/core/src/clp_s/SchemaTree.hpp
# Search for any array indexing or value-based operations
rg '\[.*NodeType' -g '*.*pp'
rg 'static_cast<\w+>\(.*NodeType' -g '*.*pp'
rg '\(NodeType\)' -g '*.*pp'
# Search for any persistence or storage of NodeType values
rg 'write|store|save|load|read' -g '*.*pp' -A 2 | rg 'NodeType'
Length of output: 11429
components/core/src/clp_s/search/SchemaMatch.cpp (1)
79-82
:
Verify the change to get_object_subtree_node_id()
in loop condition
The update from get_root_node_id()
to get_object_subtree_node_id()
in the loop condition of populate_column_mapping
may affect the tree traversal logic. Please ensure that this change correctly reflects the intended behaviour and that no necessary nodes are skipped during the descriptor path construction.
Run the following script to check for other instances where get_root_node_id()
is used and may require similar updates:
✅ Verification successful
Change to get_object_subtree_node_id()
is correct and intentional
The change from get_root_node_id()
to get_object_subtree_node_id()
is verified to be correct based on the codebase analysis:
- The method is consistently used across the search components to specifically work with object subtrees
- A TODO comment in SchemaMatch.cpp explicitly states the intention to work with object subtrees: "Once we start supporting mixing different types of logs we will have to match against more than just the object subtree"
- Other components like Projection.cpp also use
get_object_subtree_node_id()
for similar tree traversal operations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for usages of `get_root_node_id()` in the codebase.
# Search for the method `get_root_node_id` in source files.
rg 'get_root_node_id\('
Length of output: 25
Script:
#!/bin/bash
# Let's try a broader search to understand the context better
# Search for any mentions of root node or object subtree
rg -i "root_node|object_subtree" components/core/src/clp_s/search/
# Also search for the class/struct definition containing these methods
ast-grep --pattern 'class $_ {
$$$
get_object_subtree_node_id($$$) {
$$$
}
$$$
}'
# And check the specific file for context around the change
rg -A 5 -B 5 "get_object_subtree_node_id" components/core/src/clp_s/search/SchemaMatch.cpp
Length of output: 2105
Can you report data for the outliers as well? (Or a table with all results?). We tend to come back to PRs more and more nowadays, so they serve as good docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
components/core/src/clp_s/JsonParser.cpp (1)
532-539
: Add input validation for field_name parameterThe method should validate that the field_name is not empty and follows any naming conventions for internal fields.
Consider adding validation:
int32_t JsonParser::add_internal_field(std::string_view const field_name, NodeType type) { + if (field_name.empty()) { + throw std::invalid_argument("Internal field name cannot be empty"); + } auto internal_subtree_id = m_archive_writer->add_node( constants::cRootNodeId, NodeType::Internal, constants::cInternalSubtreeName ); return m_archive_writer->add_node(internal_subtree_id, type, field_name); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
components/core/src/clp_s/JsonParser.cpp
(2 hunks)components/core/src/clp_s/JsonParser.hpp
(2 hunks)
🔇 Additional comments (3)
components/core/src/clp_s/JsonParser.hpp (2)
6-6
: LGTM! Appropriate header inclusion.
The addition of <string_view> header is correctly placed and necessary for the new method parameter type.
98-104
: Enhance method documentation and verify implementation.
While the documentation indicates the timing requirement, it would be beneficial to add:
- @param descriptions for field_name and type
- @return description explaining the meaning of the returned ID
- @throw documentation if any exceptions can be thrown
Also, let's verify the implementation to ensure proper handling of internal fields.
✅ Verification successful
Implementation Verified: add_internal_field
is properly handled before parsing.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the implementation of add_internal_field and its usage
# Look for the implementation and usage patterns
# Find the implementation
echo "Looking for implementation:"
rg -A 10 "add_internal_field.*NodeType.*\{" --type cpp
# Find usage patterns to verify it's called before parsing
echo -e "\nVerifying usage before parsing:"
rg -B 5 "add_internal_field" --type cpp
Length of output: 2553
components/core/src/clp_s/JsonParser.cpp (1)
477-478
: LGTM! Robust error handling implementation
The error handling implementation is well-designed:
- Uses specific exception type (
simdjson_error
) - Provides detailed error context including file path and bytes consumed
- Properly cleans up resources by closing the archive writer
Sure. Ran it on all public datasets, and actually it seems to be a little worse on public datasets than the general case.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
components/core/src/clp_s/JsonConstructor.cpp (1)
100-104
: Consider using a more robust file naming schemeWhile the current implementation correctly uses indices for file naming and MongoDB documents, consider using a format string to make the file naming more maintainable and less error-prone.
Here's a suggested improvement:
- std::string new_file_name = src_path.string() + "_" + std::to_string(first_idx) + "_" - + std::to_string(last_idx) + ".jsonl"; + std::string new_file_name = fmt::format("{}_{:010d}_{:010d}.jsonl", + src_path.string(), first_idx, last_idx);This change would:
- Use the existing
fmt
library- Ensure consistent width for indices
- Make the format easier to modify if needed
Also applies to: 124-128
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
components/core/src/clp_s/JsonConstructor.cpp
(4 hunks)
🔇 Additional comments (3)
components/core/src/clp_s/JsonConstructor.cpp (3)
71-71
: LGTM: Comparison logic updated for log-order processing
The change from timestamp to log event index comparison aligns with the PR's objective to support log-order decompression.
78-79
: LGTM: Index tracking variables properly initialized
The transition from timestamp to index variables is implemented correctly with appropriate types and initialization.
145-147
: Verify index ordering consistency
The index tracking logic appears correct, but let's verify that indices are monotonically increasing.
✅ Verification successful
Further verification needed on index usage
Re-run verification scripts with corrected syntax
Index ordering confirmed as consistent
The log event indices are monotonically increasing with no detected decrements.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if the log event indices are used consistently across the codebase
# and verify that they are always increasing in order.
# Search for all get_next_log_event_idx calls and their context
rg -A 3 "get_next_log_event_idx" --type cpp
# Look for any potential comparisons or assertions about index ordering
ast-grep --pattern 'if ($x->get_next_log_event_idx() $op $y->get_next_log_event_idx()) { $$$ }'
Length of output: 1870
Script:
#!/bin/bash
# Description: Ensure that log event indices are consistently increasing and not being altered unexpectedly.
# Search for any assignments to last_idx or first_idx outside their initialization
rg -E "^\s*(last_idx|first_idx)\s*=" --type cpp
# Check for any decrements or non-incremental changes to get_next_log_event_idx
rg -E "get_next_log_event_idx\(\)\s*(<|<=)" --type cpp
Length of output: 900
Script:
#!/bin/bash
# Description: Ensure that log event indices are consistently increasing and not being altered unexpectedly.
# Search for any assignments to last_idx or first_idx outside their initialization
rg --type cpp -e "^\s*(last_idx|first_idx)\s*=" .
# Check for any decrements or non-incremental changes to get_next_log_event_idx
rg --type cpp -e "get_next_log_event_idx\(\)\s*(<|<=)" .
Length of output: 291
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
components/core/src/clp_s/JsonParser.cpp
(2 hunks)
🔇 Additional comments (2)
components/core/src/clp_s/JsonParser.cpp (2)
470-476
: Previous review comment about error handling is still applicable.
The code correctly implements log-order recording, but the error handling concern raised in the previous review remains valid.
482-482
: LGTM! Good use of constants.
The change improves maintainability by using centralized constants instead of magic values.
int32_t JsonParser::add_internal_field(std::string_view const field_name, NodeType type) { | ||
auto internal_subtree_id = m_archive_writer->add_node( | ||
constants::cRootNodeId, | ||
NodeType::Internal, | ||
constants::cInternalSubtreeName | ||
); | ||
return m_archive_writer->add_node(internal_subtree_id, type, field_name); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling and documentation to the new method.
The method implementation is correct but could be more robust with error handling and documentation.
Consider applying these improvements:
+/**
+ * Creates an internal field under the Internal subtree.
+ * @param field_name Name of the internal field
+ * @param type Type of the internal field
+ * @return Node ID of the created field, or -1 if creation failed
+ */
int32_t JsonParser::add_internal_field(std::string_view const field_name, NodeType type) {
auto internal_subtree_id = m_archive_writer->add_node(
constants::cRootNodeId,
NodeType::Internal,
constants::cInternalSubtreeName
);
+ if (internal_subtree_id < 0) {
+ SPDLOG_ERROR("Failed to create internal subtree node");
+ return -1;
+ }
- return m_archive_writer->add_node(internal_subtree_id, type, field_name);
+ auto field_id = m_archive_writer->add_node(internal_subtree_id, type, field_name);
+ if (field_id < 0) {
+ SPDLOG_ERROR("Failed to create internal field node");
+ }
+ return field_id;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
int32_t JsonParser::add_internal_field(std::string_view const field_name, NodeType type) { | |
auto internal_subtree_id = m_archive_writer->add_node( | |
constants::cRootNodeId, | |
NodeType::Internal, | |
constants::cInternalSubtreeName | |
); | |
return m_archive_writer->add_node(internal_subtree_id, type, field_name); | |
} | |
/** | |
* Creates an internal field under the Internal subtree. | |
* @param field_name Name of the internal field | |
* @param type Type of the internal field | |
* @return Node ID of the created field, or -1 if creation failed | |
*/ | |
int32_t JsonParser::add_internal_field(std::string_view const field_name, NodeType type) { | |
auto internal_subtree_id = m_archive_writer->add_node( | |
constants::cRootNodeId, | |
NodeType::Internal, | |
constants::cInternalSubtreeName | |
); | |
if (internal_subtree_id < 0) { | |
SPDLOG_ERROR("Failed to create internal subtree node"); | |
return -1; | |
} | |
auto field_id = m_archive_writer->add_node(internal_subtree_id, type, field_name); | |
if (field_id < 0) { | |
SPDLOG_ERROR("Failed to create internal field node"); | |
} | |
return field_id; | |
} |
As the PR currently does, I would prefer to let clp-s jsonl's metadata us orig_file_id instead of archive_id. so it is easier to use a common query logic for both json and ir stream file. (so no change is required) If we decided to rename it from orig_file_id to archive_id, I would prefer to let IR and JsonL both use "target_id", or "source_id". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few reviews about style changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! One thing I'm curious is whystd::string_view
offers better performance thanconst std::string &
since we don't use string literals.
// We use a buffer so that references to this key name are stable after this SchemaNode is move | ||
// constructed | ||
std::unique_ptr<char[]> m_key_buf; | ||
std::string_view m_key_name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to change it to std::string_view
and have two members? Or can you point me to the place where move happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The move happens whenever the array that contains the nodes is resized. The reason I care about making the buffer stable is because I wanted m_node_map to be able to reference the same string buffer -- if we use a regular std::string a view into the string's contents isn't stable under move because of the small string optimization.
Having m_node_map use a string_view into the node is partially to avoid extra allocations, but the main reason I did it is to make it simpler to do lookups in m_node_map using a string_view to a node name. It's still possible to do lookups using a string_view if we store an std::string in m_node_map, but it forces us to write our own hash and comparison functions that work with both std::string and std::string_view.
@@ -467,11 +467,19 @@ bool JsonParser::parse() { | |||
return false; | |||
} | |||
|
|||
// Add internal log_event_idx field to record | |||
auto log_event_idx = add_internal_field(constants::cLogEventIdxName, NodeType::Integer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the order column optional since it may affect the compression ratio a lot for some datasets? And if we want to record log order, we can call add_internal_field
only once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make it optional I'd like to make it the default + force its use it in the package. I talked to @kirkrodrigues about it and the general consensus we reached is that allowing people to generate archives that don't record order now will cause problems for us down the line.
We could also do some brief experiments to see if we can reduce the overhead (e.g.by trying delta encoding or using 32 bit field size) if that would help convince you. If those kinds of optimizations work we can consider adding them in a later PR.
For add_internal_field
I'm calling it that we so that we don't need to tie it to every different place where we flush an archive. If you want though I could put it in a lambda and call it in every one of those places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to set it as the default in the package.
For each archive, we can add the internal subtree and the node once, storing the node ID. After that, we just call add_value. Could you clarify what different places you're referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly line 506 where split_archive
is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this change to call add_metadata_field
once before the ingestion loop and after every invocation of split_archive
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added a --disable-log-order
option for the compression flow.
@@ -21,21 +22,37 @@ enum class NodeType : uint8_t { | |||
NullValue, | |||
DateString, | |||
StructuredArray, | |||
Internal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a description somewhere to explain what an internal node is. And do you think metadata node sounds more intuitive because internal nodes also refer to nodes that have children (the opposite of leaf node)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. We can rename it to metadata node. I guess we could generally document usage of this enum in a comment and mention what the Metadata node type is for there.
@@ -93,7 +113,26 @@ class SchemaTree { | |||
return m_nodes[id]; | |||
} | |||
|
|||
int32_t get_root_node_id() const { return m_nodes[0].get_id(); } | |||
/** | |||
* @return the Id of the root of the Object sub-tree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And can we also explain it somewhere that the object sub-tree stores log data (while the internal sub-tree stores metadata)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't agree with this one. Metadata side should be clear enough after renaming it from Internal -> Metadata. The object sub-tree stores object-like data which may or may not be log data. Also, since we plan to store heterogeneous logs in the future I don't want to imply in the language that it's the main class of log data or that it is even guaranteed to exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. When might object-like data not count as log data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. if someone dumps an sqlite database as JSON and ingests that.
I can add something like "which is expected to contain JSON log data" to the docstring though, since that reflects the current use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. You are saying json data that may not be logs. Yeah we can say "semi-structured data" or "JSON data".
!= (m_wildcard_type_mask | ||
& node_to_literal_type(m_schema_tree->get_node(column_id).get_type()))) | ||
|| m_match.schema_searches_against_column(schema_id, column_id)) | ||
if (((0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we write it in a similar way like line 994?
Co-authored-by: wraymo <[email protected]>
Co-authored-by: wraymo <[email protected]>
Co-authored-by: wraymo <[email protected]>
Co-authored-by: wraymo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
components/core/src/clp_s/SchemaTree.hpp (2)
25-25
: Add documentation for the Internal node typeThe addition of
NodeType::Internal
would benefit from documentation explaining its purpose and relationship with the object subtree. Consider adding a comment block that clarifies:
- The purpose of Internal nodes
- The distinction between Internal and Object subtrees
- The types of metadata stored in Internal nodes
116-134
: Enhance method documentationWhile the methods are well-structured, consider enhancing the documentation:
- Add
@throws
section for potential exceptions- Provide examples of typical usage
- Explain the relationship between Object and Internal subtrees
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
components/core/src/clp_s/ArchiveWriter.hpp
(3 hunks)components/core/src/clp_s/JsonParser.hpp
(2 hunks)components/core/src/clp_s/SchemaTree.hpp
(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- components/core/src/clp_s/ArchiveWriter.hpp
- components/core/src/clp_s/JsonParser.hpp
🧰 Additional context used
📓 Learnings (1)
components/core/src/clp_s/SchemaTree.hpp (3)
Learnt from: gibber9809
PR: y-scope/clp#584
File: components/core/src/clp_s/SchemaTree.hpp:171-171
Timestamp: 2024-11-12T18:46:20.933Z
Learning: In `components/core/src/clp_s/SchemaTree.hpp`, it's acceptable to use `std::string_view` as keys in `m_node_map` because `SchemaNode`'s `m_key_name` remains valid even after move operations or reallocations, preventing dangling references.
Learnt from: gibber9809
PR: y-scope/clp#584
File: components/core/src/clp_s/SchemaTree.hpp:40-55
Timestamp: 2024-11-12T18:56:31.068Z
Learning: In `components/core/src/clp_s/SchemaTree.hpp`, within the `SchemaNode` class, the use of `std::string_view` for `m_key_name` referencing `m_key_buf` is intentional to ensure that references to the key name remain valid even after move construction.
Learnt from: gibber9809
PR: y-scope/clp#584
File: components/core/src/clp_s/SchemaTree.hpp:91-94
Timestamp: 2024-11-12T18:47:03.828Z
Learning: In `components/core/src/clp_s/SchemaTree.hpp`, the `SchemaNode` class uses `std::unique_ptr<char[]> m_key_buf` and `std::string_view m_key_name` to ensure that references to `m_key_name` remain valid even after `SchemaNode` is move-constructed.
🔇 Additional comments (5)
components/core/src/clp_s/SchemaTree.hpp (5)
40-55
: LGTM: Efficient string handling with stable references
The constructor implementation correctly handles string_view to minimize copying while ensuring reference stability after move operations.
68-68
: LGTM: Consistent use of string_view
The change to return string_view
aligns with the constructor changes and maintains efficiency.
91-94
: LGTM: Clear documentation and naming
The renamed buffer member and documentation clearly explain the move stability guarantee.
104-104
: LGTM: Consistent parameter type
The change to accept string_view
maintains consistency with other string handling optimizations.
170-172
: LGTM: Efficient map keys and clear subtree tracking
The use of string_view
in the map is safe given the implementation context, and the subtree ID tracking is clear and well-structured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (9)
components/core/src/clp_s/archive_constants.hpp (2)
21-25
: Consider adding documentation for special constants.These constants support the new "Internal" subtree for tracking log order, but their special values (empty strings and -1) would benefit from documentation explaining their significance and usage restrictions.
Consider adding comments like:
// Schema tree constants +// Special node identifiers for the Internal subtree implementation +// Note: Empty string names and negative IDs are reserved for internal use constexpr char cRootNodeName[] = ""; constexpr int32_t cRootNodeId = -1; constexpr char cMetadataSubtreeName[] = ""; constexpr char cLogEventIdxName[] = "log_event_idx";
35-41
: Consider renaming cArchiveId based on PR discussion.As discussed in the PR comments, consider using
orig_file_id
instead ofarchive_id
for consistency with IR stream files.namespace results_cache::search { constexpr char cOrigFilePath[]{"orig_file_path"}; constexpr char cLogEventIx[]{"log_event_ix"}; constexpr char cTimestamp[]{"timestamp"}; constexpr char cMessage[]{"message"}; -constexpr char cArchiveId[]{"archive_id"}; +constexpr char cOrigFileId[]{"orig_file_id"}; } // namespace results_cache::searchcomponents/core/src/clp_s/search/SearchUtils.cpp (1)
Line range hint
44-58
: Critical: Missing break statements in switch cases causing unintended fallthroughThe switch cases for handling different
FilterOperation
values are missingbreak
statements, which causes unintended fallthrough behaviour. This will lead to incorrect results as:
- LT/GTE cases will set ceiling then fall through to floor
- GT/LTE cases will set floor then fall through to static_cast
Here's the corrected implementation:
bool double_as_int(double in, FilterOperation op, int64_t& out) { switch (op) { case FilterOperation::EQ: out = static_cast<int64_t>(in); return in == static_cast<double>(out); case FilterOperation::LT: case FilterOperation::GTE: out = std::ceil(in); + break; case FilterOperation::GT: case FilterOperation::LTE: out = std::floor(in); + break; default: out = static_cast<int64_t>(in); } return true; }components/core/src/clp_s/SchemaTree.cpp (2)
8-8
: Consider optimizing string construction in node creationThe change to
std::string_view
for the parameter is good for performance, but thestd::string{key}
construction in the node creation still creates a copy. Consider moving the string construction to avoid temporary copies.- auto& node = m_nodes.emplace_back(parent_node_id, node_id, std::string{key}, type, 0); + auto& node = m_nodes.emplace_back(parent_node_id, node_id, key, type, 0);Also applies to: 17-17
37-51
: Consider optimizing metadata field lookup for large datasetsThe current implementation uses a linear search through all children. For better performance with large metadata sets, consider using a hash map to cache field name to ID mappings.
// Add as class member std::unordered_map<std::string_view, int32_t> m_metadata_field_cache; // Update method to use cache if (auto it = m_metadata_field_cache.find(field_name); it != m_metadata_field_cache.end()) { return it->second; }components/core/src/clp_s/JsonParser.hpp (1)
98-104
: Documentation could be more descriptiveWhile the documentation provides basic usage information, it could be enhanced to better explain:
- What constitutes an "internal field" in the MPT context
- Why field order matters in the schema
- The meaning and usage of the returned field ID
- The relationship with log-order recording (as mentioned in the PR objectives)
Consider expanding the documentation like this:
/** - * Adds an internal field to the MPT and get its Id. + * Adds an internal field to the Merkle Patricia Tree (MPT) and returns its unique identifier. + * Internal fields are metadata fields that are used for implementation details (like log ordering) + * and are not part of the original JSON structure. * * Note: this method should be called before parsing a record so that internal fields come first - * in each table. This isn't strictly necessary, but it is a nice convention. + * in each table's schema. This ordering helps maintain consistent schema layouts and + * simplifies internal field access patterns. + * + * @param field_name The name of the internal field to add + * @param type The NodeType specifying the field's data type + * @return The unique identifier assigned to this field in the MPT */components/core/src/clp_s/SchemaTree.hpp (1)
129-147
: LGTM! Clear and consistent interface documentationThe new metadata subtree methods are well-documented with clear return value specifications. Consider adding
@throws
documentation for potential exceptions in these methods, particularly for invalid access scenarios.components/core/src/clp_s/ArchiveReader.cpp (1)
315-320
: LGTM! Consider extracting type check to a helper methodThe logic for identifying and marking log event index columns is sound. The dynamic_cast provides safe type checking.
Consider extracting the type check to improve readability:
+ bool is_int64_column(BaseColumnReader* reader) { + return nullptr != dynamic_cast<Int64ColumnReader*>(reader); + } + - if (column_id == m_log_event_idx_column_id - && nullptr != dynamic_cast<Int64ColumnReader*>(column_reader)) + if (column_id == m_log_event_idx_column_id && is_int64_column(column_reader))components/core/src/clp_s/search/Output.cpp (1)
967-978
: Consider adding error handling and validation.While the implementation is correct, there are a few potential improvements:
- The assumption about no nested structures in the metadata subtree should be validated.
- Consider adding error handling for invalid metadata node structure.
- The method could benefit from logging when metadata subtree is not found.
void Output::populate_internal_columns() { int32_t metadata_subtree_root_node_id = m_schema_tree->get_metadata_subtree_node_id(); if (-1 == metadata_subtree_root_node_id) { + SPDLOG_DEBUG("No metadata subtree found in schema tree"); return; } // This code assumes that the metadata subtree contains no nested structures auto& metadata_node = m_schema_tree->get_node(metadata_subtree_root_node_id); + if (!metadata_node.is_valid()) { + SPDLOG_ERROR("Invalid metadata node in schema tree"); + return; + } for (auto child_id : metadata_node.get_children_ids()) { + if (m_schema_tree->get_node(child_id).has_children()) { + SPDLOG_WARN("Nested structure detected in metadata subtree"); + } m_metadata_columns.insert(child_id); } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (9)
components/core/src/clp_s/ArchiveReader.cpp
(3 hunks)components/core/src/clp_s/JsonParser.cpp
(2 hunks)components/core/src/clp_s/JsonParser.hpp
(2 hunks)components/core/src/clp_s/SchemaTree.cpp
(2 hunks)components/core/src/clp_s/SchemaTree.hpp
(7 hunks)components/core/src/clp_s/archive_constants.hpp
(3 hunks)components/core/src/clp_s/search/Output.cpp
(5 hunks)components/core/src/clp_s/search/Output.hpp
(3 hunks)components/core/src/clp_s/search/SearchUtils.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp_s/search/Output.hpp
🧰 Additional context used
📓 Learnings (1)
components/core/src/clp_s/SchemaTree.hpp (2)
Learnt from: gibber9809
PR: y-scope/clp#584
File: components/core/src/clp_s/SchemaTree.hpp:171-171
Timestamp: 2024-11-12T18:46:20.933Z
Learning: In `components/core/src/clp_s/SchemaTree.hpp`, it's acceptable to use `std::string_view` as keys in `m_node_map` because `SchemaNode`'s `m_key_name` remains valid even after move operations or reallocations, preventing dangling references.
Learnt from: gibber9809
PR: y-scope/clp#584
File: components/core/src/clp_s/SchemaTree.hpp:40-55
Timestamp: 2024-11-12T18:56:31.068Z
Learning: In `components/core/src/clp_s/SchemaTree.hpp`, within the `SchemaNode` class, the use of `std::string_view` for `m_key_name` referencing `m_key_buf` is intentional to ensure that references to the key name remain valid even after move construction.
🔇 Additional comments (18)
components/core/src/clp_s/archive_constants.hpp (2)
4-4
: LGTM! Good practice using fixed-width integer types.
The addition of <cstdint>
ensures consistent integer sizes across different platforms, which is crucial for serialization and network protocols.
35-41
: Verify search functionality with these new fields.
Let's verify that these new fields are properly integrated with MongoDB search results.
✅ Verification successful
Let me analyze the search functionality further by checking how these fields are used in the search implementation.
Based on the code analysis, I can now provide a final response:
The new search fields are properly integrated with MongoDB search results.
The code shows that these fields are correctly used in the MongoDB search functionality:
- The fields are properly defined as constants in
archive_constants.hpp
- They are correctly used in
OutputHandler.cpp
when constructing MongoDB documents:bsoncxx::builder::basic::kvp(constants::results_cache::search::cOrigFilePath, ...), bsoncxx::builder::basic::kvp(constants::results_cache::search::cMessage, ...), bsoncxx::builder::basic::kvp(constants::results_cache::search::cTimestamp, ...), bsoncxx::builder::basic::kvp(constants::results_cache::search::cArchiveId, ...), bsoncxx::builder::basic::kvp(constants::results_cache::search::cLogEventIx, ...)
- The search implementation properly handles these fields through the query processing and result construction pipeline
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if these field names are consistently used in search-related code
# Search for usage of these fields in search-related code
echo "Searching for field usage in search functionality..."
rg --type cpp "orig_file_path|log_event_ix|timestamp|message|archive_id" -g "**/*search*.{cpp,hpp}"
# Check for any potential naming inconsistencies
echo "Checking for potential naming inconsistencies..."
rg --type cpp "orig.*file.*id|archive.*id" -g "**/*.{cpp,hpp}"
Length of output: 38007
Script:
#!/bin/bash
# Check search implementation details
echo "Checking search implementation..."
ast-grep --pattern 'class $_ {
$$$
search($$$) {
$$$
}
$$$
}'
# Check field usage in MongoDB queries
echo "Checking MongoDB query construction..."
rg -A 10 "mongo|bson|query" components/core/src/clp_s/search/
Length of output: 79760
components/core/src/clp_s/search/SearchUtils.cpp (1)
Line range hint 37-41
: LGTM: Appropriate handling of metadata nodes
The addition of NodeType::Metadata
case mapping to LiteralType::UnknownT
aligns well with the PR's objective of preventing internal fields from being resolved during searches.
components/core/src/clp_s/SchemaTree.cpp (3)
19-25
: LGTM: Clear subtree ID assignment logic
The explicit checks for root node and type make the subtree ID assignment logic clearer and more maintainable.
67-69
: LGTM: Efficient string writing implementation
The direct buffer writing approach is more efficient than using write_string, and the size prefixing ensures safe reading.
Line range hint 8-69
: Verify compression ratio impact
Given the PR objectives mention compression ratio impacts of 2-6%, let's verify the changes don't introduce unexpected overhead.
components/core/src/clp_s/JsonParser.hpp (1)
6-6
: LGTM: Appropriate header inclusion
The addition of <string_view> header is correct and follows the codebase's header organization pattern. This supports the PR's goal of transitioning to std::string_view for better performance.
components/core/src/clp_s/SchemaTree.hpp (3)
8-8
: LGTM! Well-documented enum changes
The addition of the Metadata node type and its comprehensive documentation clearly explains its purpose for implementation-specific data like log order tracking. The documentation also properly addresses the enum ordering constraints.
Also applies to: 14-38
42-68
: LGTM! Efficient string handling with move safety
The implementation uses std::string_view
with a stable buffer to optimize string handling while ensuring references remain valid after move operations. The approach is well-documented and has been validated through previous discussions.
Also applies to: 81-81, 104-107
183-185
: LGTM! Verify subtree ID initialization
The member variables and map optimization look good. Let's verify that the subtree IDs are properly initialized when the corresponding subtrees are created.
✅ Verification successful
Let me gather more information about the initialization by checking the implementation file.
Subtree IDs are properly initialized - no issues found
The implementation correctly initializes the subtree IDs when nodes are added at the root level. The code shows that:
m_object_subtree_id
is set when an Object node is added as a root childm_metadata_subtree_id
is set when a Metadata node is added as a root child- The -1 initialization is properly used for validation, as seen in the
get_metadata_field_id
method
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify proper initialization of subtree IDs
# Look for places where m_object_subtree_id and m_metadata_subtree_id are set
# Search for subtree ID assignments
rg -l "m_(object|metadata)_subtree_id\s*=" --type cpp
# Search for add_node calls that might create these subtrees
ast-grep --pattern 'add_node($_, NodeType::Object, $_)'
ast-grep --pattern 'add_node($_, NodeType::Metadata, $_)'
Length of output: 214
Script:
#!/bin/bash
# Check the implementation of SchemaTree.cpp for subtree ID initialization
rg -A 5 -B 5 "m_(object|metadata)_subtree_id" components/core/src/clp_s/SchemaTree.cpp
# Look for constructor and initialization
ast-grep --pattern 'SchemaTree::SchemaTree() {
$$$
}'
# Look for add_node implementation
ast-grep --pattern 'add_node($$$) {
$$$
}'
Length of output: 1144
components/core/src/clp_s/ArchiveReader.cpp (2)
357-357
: LGTM! Proper cleanup of log event index state
The reset of m_log_event_idx_column_id to -1 in the close() method ensures proper cleanup of the reader's state.
30-31
: 🛠️ Refactor suggestion
Consider adding error handling for missing log event index field
The code assumes the log event index field exists in the schema tree. Consider handling the case where the field is not present to prevent potential issues downstream.
Let's verify if there are any existing error handling patterns for missing metadata fields:
Consider adding error handling:
- m_log_event_idx_column_id = m_schema_tree->get_metadata_field_id(constants::cLogEventIdxName);
+ try {
+ m_log_event_idx_column_id = m_schema_tree->get_metadata_field_id(constants::cLogEventIdxName);
+ } catch (const std::exception&) {
+ // Field not found, default to -1
+ m_log_event_idx_column_id = -1;
+ }
components/core/src/clp_s/JsonParser.cpp (2)
482-482
: LGTM! Good use of constants.
Using predefined constants instead of hardcoded values improves maintainability.
470-476
: Verify the impact on compression ratio.
The implementation for recording log order looks good. However, given the previous discussions about compression ratio impact, we should verify the overhead.
components/core/src/clp_s/search/Output.cpp (4)
65-65
: LGTM: Appropriate initialization of internal columns.
The placement of populate_internal_columns()
call ensures internal columns are properly identified before string queries are populated and filtering begins.
146-148
: LGTM: Proper handling of metadata columns.
The check correctly prevents metadata columns from being processed as regular columns during initialization, maintaining proper separation between data and metadata.
996-998
: LGTM: Consistent metadata column handling.
The check properly excludes metadata columns from wildcard column population, maintaining consistency with the overall metadata handling approach.
96-99
: Verify method signature changes across the codebase.
The change from get_next_message_with_timestamp
to get_next_message_with_metadata
and addition of log_event_idx
parameter aligns with the PR objectives. Let's verify the changes are consistent across the codebase.
✅ Verification successful
Let me gather more information about the write method signatures to complete the verification.
Let me search for more specific information about the OutputHandler's write method signature.
Method signature changes are consistent across the codebase
The verification shows:
- No references to the old
get_next_message_with_timestamp
method remain in the codebase - The new
get_next_message_with_metadata
method is properly defined in SchemaReader with the correct signature - The OutputHandler's
write
method consistently accepts thelog_event_idx
parameter across all implementations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify method signature changes are consistent.
# Test: Search for old and new method names
rg -A 2 "get_next_message_with_timestamp|get_next_message_with_metadata"
# Test: Verify write method signatures
ast-grep --pattern 'write(std::string $_, epochtime_t $_, $_)'
Length of output: 962
Script:
#!/bin/bash
# Search for write method declarations and definitions
ast-grep --pattern 'write($$$)'
# Search for any remaining references to get_next_message_with_timestamp
rg "get_next_message_with_timestamp" -l
# Look for OutputHandler interface or class definition
fd -e hpp -e cpp -x rg -A 5 "class.*OutputHandler"
Length of output: 3766
Script:
#!/bin/bash
# Search for write method declarations in OutputHandler
rg -A 3 "virtual.*write.*\(" components/core/src/clp_s/search/OutputHandler.hpp
# Search for write implementations in derived classes
rg -A 3 "void.*write.*\(" components/core/src/clp_s/search/OutputHandler.hpp
# Look for any remaining old method signatures
rg -A 3 "get_next_message_with_timestamp" components/core/src/clp_s/
Length of output: 1953
A lot of the changes to string_view aren't immediately a performance improvement, but they open us up to using string_view in more places in the codebase going forward instead of creating copies in std::string. E.g. there are many places in the current implementation of JsonParser where we turn a string_view into an std::string just so we can pass it somewhere else as an std::string const&. For the places I changed around marshalling I mostly started using std::string_view because as a result of other changes some code paths were passing std::string_view to some function and others were passing std::string to the same function, so accepting std::string_view as argument allows those functions to handle both (whereas only accepting std::string const& would force an unnecessary copy ont the std::string_view path). For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (9)
components/core/src/clp_s/JsonParser.hpp (2)
34-40
: Consider adding documentation for the record_log_order optionWhile the default initialization of members is a good practice, the new
record_log_order
option would benefit from documentation explaining its purpose and impact on compression behaviour.Add documentation above the struct:
struct JsonParserOption { + /** + * Configuration options for JSON parsing and compression + * + * @param record_log_order When true, preserves the original log event order + * during compression for ordered decompression (default: true) + */ std::vector<std::string> file_paths;
99-105
: Enhance method documentationThe documentation should include:
- Return value description
- Error handling behaviour
- Parameter descriptions
Improve the documentation:
/** * Adds an internal field to the MPT and get its Id. * + * @param field_name The name of the internal field to add + * @param type The type of the node to create + * @return The ID of the newly added field, or -1 if addition fails * * Note: this method should be called before parsing a record so that internal fields come first * in each table. This isn't strictly necessary, but it is a nice convention. + * + * @throws OperationFailed if the field cannot be added to the MPT */components/core/src/clp_s/SchemaTree.hpp (4)
21-25
: Enhance metadata documentation to clarify log-order trackingThe documentation could be more specific about the metadata subtree's primary purpose in this implementation - tracking log event order for compression and decompression.
Consider adding:
* the "Metadata" node type. The "Metadata" type is a special type used by the implementation to * demarcate data needed by the implementation that is not part of the log record. In particular, * the implementation may create a special subtree of the MPT which contains fields used to record - * things like original log order. + * implementation-specific data. The primary use is to record the original log event order during + * compression, enabling log-order decompression when the --ordered flag is used.
42-47
: Enhance documentation about string storage strategyWhile the move-safety is documented, it would be helpful to explain the string storage strategy more explicitly.
Consider adding:
* Note: the result of get_key_name is valid even if the original SchemaNode is later * move-constructed. + * Implementation: Key names are stored in a unique_ptr-managed buffer with string_view + * references, ensuring stable memory addresses even after container reallocations. */
135-141
: Add parameter documentation for get_metadata_field_idThe method documentation should specify the parameter's purpose and expected format.
Consider enhancing the documentation:
/** * Get the field Id for a specified field within the Metadata subtree. - * @param field_name + * @param field_name The name of the metadata field to look up (e.g., "log_event_idx") * * @return the field Id if the field exists within the Metadata sub-tree, -1 otherwise. */
184-185
: Document sentinel values for subtree IDsAdd a comment explaining the significance of the -1 sentinel value for subtree IDs.
+ // -1 indicates that the subtree hasn't been created yet int32_t m_object_subtree_id{-1}; int32_t m_metadata_subtree_id{-1};
components/core/src/clp_s/CommandLineArguments.hpp (1)
115-115
: Consider positive naming for better clarityThe double negative naming pattern (
get_record_log_order()
returning!m_no_record_log_order
) creates unnecessary cognitive overhead. Consider renaming for improved readability:- bool get_record_log_order() const { return false == m_no_record_log_order; } + bool is_log_order_recording_enabled() const { return m_record_log_order; } - bool m_no_record_log_order{false}; + bool m_record_log_order{true};Also applies to: 183-183
components/core/src/clp_s/clp-s.cpp (1)
Line range hint
1-450
: Consider documenting the deprecation of timestamp-ordered decompressionThe addition of log-order recording represents a significant architectural change. While the implementation is clean and maintains backwards compatibility, it would be beneficial to:
- Add code comments explaining that the
ordered
flag now enables log-order decompression- Document the 2-6% compression overhead in the code comments
- Consider adding a deprecation warning when timestamp-ordered decompression is used
Consider adding these comments near the relevant code sections:
+ // Note: As of <version>, the --ordered flag enables log-order decompression. + // Timestamp-ordered decompression is deprecated. + // The log-order recording adds a 2-6% compression overhead.components/core/src/clp_s/JsonParser.cpp (1)
451-460
: Consider moving lambda definition outside the parse method.The
add_log_event_idx_node
lambda is only used in two places and could be moved to a private method for better maintainability and reusability.- auto add_log_event_idx_node = [&]() { - if (m_record_log_order) { - log_event_idx_node_id - = add_metadata_field(constants::cLogEventIdxName, NodeType::Integer); - } - }; + void JsonParser::add_log_event_idx_node(int32_t& node_id) { + if (m_record_log_order) { + node_id = add_metadata_field(constants::cLogEventIdxName, NodeType::Integer); + } + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
components/core/src/clp_s/CommandLineArguments.cpp
(2 hunks)components/core/src/clp_s/CommandLineArguments.hpp
(2 hunks)components/core/src/clp_s/JsonParser.cpp
(5 hunks)components/core/src/clp_s/JsonParser.hpp
(4 hunks)components/core/src/clp_s/SchemaTree.hpp
(7 hunks)components/core/src/clp_s/clp-s.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp_s/CommandLineArguments.cpp
🧰 Additional context used
📓 Learnings (1)
components/core/src/clp_s/SchemaTree.hpp (2)
Learnt from: gibber9809
PR: y-scope/clp#584
File: components/core/src/clp_s/SchemaTree.hpp:171-171
Timestamp: 2024-11-12T18:46:20.933Z
Learning: In `components/core/src/clp_s/SchemaTree.hpp`, it's acceptable to use `std::string_view` as keys in `m_node_map` because `SchemaNode`'s `m_key_name` remains valid even after move operations or reallocations, preventing dangling references.
Learnt from: gibber9809
PR: y-scope/clp#584
File: components/core/src/clp_s/SchemaTree.hpp:40-55
Timestamp: 2024-11-12T18:56:31.068Z
Learning: In `components/core/src/clp_s/SchemaTree.hpp`, within the `SchemaNode` class, the use of `std::string_view` for `m_key_name` referencing `m_key_buf` is intentional to ensure that references to the key name remain valid even after move construction.
🔇 Additional comments (9)
components/core/src/clp_s/JsonParser.hpp (3)
6-6
: LGTM: Header inclusion is appropriate
The addition of the string_view header is necessary for the new method parameter type.
122-122
: LGTM: Member variable declaration is consistent
The new member variable follows the class's naming convention and initialization pattern.
40-40
: Verify impact on existing compression workflows
Since this change introduces log-order recording by default, we should verify that existing compression workflows aren't adversely affected.
Also applies to: 122-122
components/core/src/clp_s/CommandLineArguments.hpp (1)
115-115
: Verify CLI flag implementation
The getter appears to be related to the new "--disable-log-order" CLI flag mentioned in the PR objectives. Let's verify its implementation:
✅ Verification successful
CLI flag implementation is correctly implemented
The verification shows that the --disable-log-order
flag is properly implemented:
- Flag is defined in
CommandLineArguments.cpp
and correctly setsm_no_record_log_order
usingbool_switch
- The getter
get_record_log_order()
correctly inverts the flag's value - The flag is used in
clp-s.cpp
to setoption.record_log_order
in the program options
The implementation follows a consistent pattern where the flag disables the feature (sets m_no_record_log_order
), and the getter inverts it to provide the positive state (record_log_order
).
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if the CLI flag is properly implemented
# Expected: Find the flag definition and its connection to m_no_record_log_order
# Search for the CLI flag definition
echo "Searching for CLI flag definition:"
rg --type cpp '"disable-log-order"' -A 5
# Search for places where the getter is used
echo -e "\nSearching for getter usage:"
rg --type cpp "get_record_log_order" -A 5
Length of output: 1820
components/core/src/clp_s/clp-s.cpp (1)
99-99
: Integration of log order recording looks good, suggest verifying related functionality
The addition of record_log_order
option aligns with the PR's objective to support log-order decompression. The implementation correctly integrates with the existing option initialization pattern.
Let's verify the integration with related functionality:
components/core/src/clp_s/JsonParser.cpp (4)
18-19
: LGTM: Constructor initialization list updated correctly.
The new member variable m_record_log_order
is properly initialized from the options.
494-494
: LGTM: Improved error handling for JSON parsing.
The code now properly catches and handles simdjson_error
exceptions during parsing.
519-519
: LGTM: Log event index node properly recreated after archive split.
The code correctly maintains the log event index tracking across archive splits.
550-557
: 🛠️ Refactor suggestion
Add error handling to add_metadata_field method.
The method should handle potential errors from add_node
operations.
int32_t JsonParser::add_metadata_field(std::string_view const field_name, NodeType type) {
- auto metadata_subtree_id = m_archive_writer->add_node(
+ int32_t metadata_subtree_id = m_archive_writer->add_node(
constants::cRootNodeId,
NodeType::Metadata,
constants::cMetadataSubtreeName
);
+ if (metadata_subtree_id < 0) {
+ SPDLOG_ERROR("Failed to create metadata subtree node");
+ return -1;
+ }
- return m_archive_writer->add_node(metadata_subtree_id, type, field_name);
+ auto field_id = m_archive_writer->add_node(metadata_subtree_id, type, field_name);
+ if (field_id < 0) {
+ SPDLOG_ERROR("Failed to create metadata field node");
+ }
+ return field_id;
}
Likely invalid or redundant comment.
@@ -189,11 +189,15 @@ CommandLineArguments::parse_arguments(int argc, char const** argv) { | |||
)( | |||
"print-archive-stats", | |||
po::bool_switch(&m_print_archive_stats), | |||
"Print statistics (json) about the archive after it's compressed." | |||
"Print statistics (json) about the archixve after it's compressed." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Print statistics (json) about the archixve after it's compressed." | |
"Print statistics (json) about the archive after it's compressed." |
)( | ||
"structurize-arrays", | ||
po::bool_switch(&m_structurize_arrays), | ||
"Structurize arrays instead of compressing them as clp strings." | ||
)( | ||
"disable-log-order", | ||
po::bool_switch(&m_no_record_log_order), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can just use m_disable_log_order
?
if (0 != m_metadata_columns.count(column_id)) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move these three lines out of the if branch?
If we disable log order, do you think we should throw errors for ordered decompression? |
Hmm, maybe just log a warning. I'm probably going to use this flag in a script to migrate older archive versions to newer ones and making this error out and fail would make me have to write a lot more error handling. |
…nd address review comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
components/core/src/clp_s/JsonConstructor.cpp (1)
Line range hint
51-153
: Consider adding monitoring for index-based ordering transition, eh!While the error handling is solid, consider adding monitoring to track:
- Frequency of fallback to unordered decompression
- Index distribution statistics
- Performance metrics comparing ordered vs unordered decompression
This would help validate the impact of the transition and identify any potential issues early.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
components/core/src/clp_s/ArchiveReader.hpp
(2 hunks)components/core/src/clp_s/CommandLineArguments.cpp
(2 hunks)components/core/src/clp_s/CommandLineArguments.hpp
(2 hunks)components/core/src/clp_s/JsonConstructor.cpp
(5 hunks)components/core/src/clp_s/search/Output.cpp
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- components/core/src/clp_s/ArchiveReader.hpp
- components/core/src/clp_s/CommandLineArguments.cpp
- components/core/src/clp_s/CommandLineArguments.hpp
🔇 Additional comments (8)
components/core/src/clp_s/JsonConstructor.cpp (4)
51-55
: Clear and informative warning message implementation, eh!
The warning message effectively communicates both the issue (missing ordering information) and the fallback behaviour (out of order decompression). This aligns well with the transition to log-order decompression.
57-57
: Solid conditional logic for ordered decompression, eh!
The condition appropriately combines both requirements (ordered flag and log order presence) before proceeding with ordered decompression.
84-85
: Consistent implementation of index-based tracking, eh!
The transition from timestamp to index-based tracking is well implemented across:
- Variable declarations
- File naming convention
- MongoDB document structure
- Index assignment and increment logic
The comment about matching clp's behaviour with the exclusive end index is particularly helpful.
Also applies to: 106-110, 130-134, 151-151, 153-153
77-77
: Verify consistent index-based ordering implementation across the codebase, eh!
The comparison now uses log event indices instead of timestamps. Let's verify this transition is consistent across related components.
✅ Verification successful
Based on the search results, I can now generate the final response:
The transition to index-based ordering is consistently implemented, eh!
The codebase shows a clean implementation where:
get_next_timestamp()
is only used in test files and as an accessor inSchemaReader.hpp
- Index-based ordering is properly implemented in
JsonConstructor.cpp
and consistently used for record ordering
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining timestamp-based comparisons that might need updating
# Search for timestamp-based comparisons
rg -A 2 "get_next_timestamp|next_timestamp" --type cpp
# Search for potential missed index-based comparisons
rg -A 2 "get_next_log_event_idx|next_log_event_idx" --type cpp
Length of output: 2852
components/core/src/clp_s/search/Output.cpp (4)
141-143
: LGTM: Metadata column filtering in init.
The implementation correctly skips processing of metadata columns, maintaining separation between metadata and searchable content.
997-999
: LGTM: Consistent metadata handling in wildcard searches.
The implementation correctly excludes metadata columns from wildcard searches, maintaining consistency with the overall metadata handling approach.
96-99
: LGTM: Message retrieval with log event index.
The implementation correctly handles the retrieval and output of log event indices as part of message metadata. This aligns with the PR objective of recording log order during compression.
968-979
: Review assumption about metadata subtree structure.
The code assumes that the metadata subtree contains no nested structures (line 974). This assumption should be validated or enforced to prevent potential issues if the metadata structure changes in the future.
Consider adding validation to handle nested structures gracefully or document this limitation in the codebase.
✅ Verification successful
Let me gather more information about the schema tree structure and metadata handling.
Let me search for more specific information about the schema node structure and metadata handling.
Based on the evidence gathered, I can now provide the final response:
The assumption about non-nested metadata structures is valid and safe
The codebase implementation confirms that metadata nodes are handled appropriately:
- The
NodeType
enum shows metadata is treated as a distinct type, separate from structured types likeObject
andStructuredArray
- The metadata subtree is created directly under the root node with flat children, as seen in the
JsonParser.cpp
implementation - The schema tree implementation enforces this structure through its node creation and type handling
The current assumption in the code is actually a reflection of the enforced design rather than a limitation that needs validation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify if metadata subtree contains nested structures
# Test: Search for nested structures in metadata subtree definitions
# Search for metadata subtree definitions
ast-grep --pattern 'metadata_subtree$_'
# Search for potential nested structure definitions
rg -A 5 'metadata_subtree.*\{.*\}'
Length of output: 857
Script:
#!/bin/bash
# Search for SchemaNode definition and metadata-related implementations
ast-grep --pattern 'class SchemaNode {
$$$
}'
# Search for metadata subtree creation and population
rg -A 10 'metadata_subtree' components/core/src/clp_s/
# Look for any documentation about metadata structure
rg -l 'metadata' components/core/src/clp_s/README.md components/core/src/clp_s/**/*.md
Length of output: 6079
Script:
#!/bin/bash
# Search for NodeType enum definition to understand possible node types
rg -A 10 'enum.*NodeType'
# Search for metadata node creation and handling
ast-grep --pattern 'NodeType::Metadata'
# Look for any code that might create nested structures in metadata
rg -B 2 -A 5 'add_node.*Metadata.*'
Length of output: 997
Co-authored-by: wraymo <[email protected]>
Description
This PR adds support for recording log order at compression time and leveraging that information at decompression time to achieve log-order decompression. The
--ordered
flag now performs log-order decompression and timestamp-ordered decompression is no longer supported.Technically recording log order could be made optional, but per side-discussion it will save trouble down the line to force recording log order now.
Log order is recorded in a new "Internal" subtree in the MPT. Specifically, we add a new "Internal" node type and create a subtree off of the root node which can contain fields internal to the clp-s implementation. These fields are ignored during decompression (i.e. they do not get marshalled), and search (they can not be resolved to). The log event index is recorded in an integer field called "log_event_idx" in this subtree.
This PR also cleans up some code surrounding its changes. In particular the code to insert nodes into the MPT has been improved, and many instances of
std::string const&
have been replaced withstd::string_view const
throughout the codebase.Validation performed
Summary by CodeRabbit
Release Notes
New Features
ArchiveReader
andArchiveWriter
.SchemaReader
.JsonParser
andSchemaTree
for better data structure handling.Output
class to improve filtering capabilities.Improvements
Output
class to better manage internal columns.std::string_view
for method parameters.Bug Fixes
These changes collectively enhance the application's functionality, performance, and user experience.