Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clp-s): json to irv2 #657

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

AVMatthews
Copy link
Contributor

@AVMatthews AVMatthews commented Jan 9, 2025

Description

This PR:

  • Exposes the JSON to IRV2 parsing to the user through the command line
  • Enables users to write the IRV2 format to a file.

Validation performed

Generated IR V2 format for all 5 JSON public datasets
ex) ./clp-s r elasticsearch_ir elasticsearch/

Summary by CodeRabbit

Release Notes

  • New Features

    • Added a new command-line option for converting JSON files to Intermediate Representation (IR) format
    • Introduced advanced configuration options for JSON to IR conversion, including:
      • Output directory specification
      • Compression settings
      • Encoding type configuration
  • Improvements

    • Enhanced command-line interface with additional parsing capabilities
    • Improved error handling for new JSON conversion functionality

Copy link
Contributor

coderabbitai bot commented Jan 9, 2025

Walkthrough

The pull request introduces a new command-line option JsonToIr in the command-line parsing system. This feature enables converting JSON files to an Intermediate Representation (IR) format. The implementation includes new command-line options for specifying input and output paths, configuring compression settings, and handling encoding types. The changes span multiple files in the components/core/src/clp_s/ directory, adding new methods, enumerations, and functions to support the JSON to IR conversion process.

Changes

File Change Summary
CommandLineArguments.cpp - Added new command handling for JsonToIr
- Implemented new command-line options parsing
- Added help message function for new command
CommandLineArguments.hpp - Added JsonToIr to Command enum
- Introduced new getter methods for IR buffer size and encoding type
- Added private member variables for encoding and buffer configuration
JsonParser.hpp - Created new JsonToIrParserOption structure with parsing configuration options
clp-s.cpp - Added template functions for serialization and buffer management
- Implemented generate_ir function for JSON to IR conversion
- Modified main function to support new JsonToIr command

Sequence Diagram

sequenceDiagram
    participant CLI as Command Line Interface
    participant Parser as CommandLineArguments
    participant Generator as JSON to IR Generator
    participant Serializer as Serializer

    CLI->>Parser: Parse JsonToIr command
    Parser->>Generator: Validate and prepare options
    Generator->>Serializer: Initialize serialization
    Serializer->>Generator: Process JSON files
    Generator-->>CLI: Return conversion status
Loading

Possibly Related PRs

Finishing Touches

  • 📝 Generate Docstrings

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@AVMatthews AVMatthews changed the title Feat(clp s): json to irv2 feat(clp-s): json to irv2 Jan 9, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
components/core/src/clp_s/CommandLineArguments.cpp (2)

758-764: Fix typographical errors in option descriptions.

There are minor typos in the option descriptions:

  • Line 758: "before ir generation fails" should be "before IR generation fails".
  • Line 764: "befroe" should be "before".

Apply this diff to correct the typos:

-                        "Maximum allowed size (B) for a single document before ir generation fails."
+                        "Maximum allowed size (B) for a single document before IR generation fails."
...
-                        "Maximum allowed size (B) for an in memory IR buffer befroe being written to file."
+                        "Maximum allowed size (B) for an in-memory IR buffer before being written to file."

747-747: Consider renaming "Compression options" to "JSON to IR options".

For clarity, rename the header of the options from "Compression options" to "JSON to IR options", as these options are specific to the JsonToIr command.

Apply this diff to rename the options group:

-                po::options_description compression_options("Compression options");
+                po::options_description compression_options("JSON to IR options");
components/core/src/clp_s/CommandLineArguments.hpp (2)

29-30: Consider documenting the command character choice

The character 'r' for JsonToIr might not be immediately intuitive to users. Consider adding a comment explaining the rationale for this choice, or consider a more descriptive character if available.


202-203: Use named constants for magic numbers

Consider replacing the magic numbers with named constants to improve code readability and maintainability:

  • The encoding type value of 8 should be a named constant with documentation explaining its significance
  • The buffer size of 512MB could use a named constant similar to other size constants in the file
+    // Default encoding type for IR conversion
+    static constexpr int DEFAULT_ENCODING_TYPE = 8;
+    // Default maximum buffer size for IR conversion (512MB)
+    static constexpr size_t DEFAULT_MAX_IR_BUFFER_SIZE = 512ULL * 1024 * 1024;
-    int m_encoding_type{8};
-    size_t m_max_ir_buffer_size{512ULL * 1024 * 1024};
+    int m_encoding_type{DEFAULT_ENCODING_TYPE};
+    size_t m_max_ir_buffer_size{DEFAULT_MAX_IR_BUFFER_SIZE};
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d3b671 and 38229b5.

📒 Files selected for processing (4)
  • components/core/src/clp_s/CommandLineArguments.cpp (4 hunks)
  • components/core/src/clp_s/CommandLineArguments.hpp (4 hunks)
  • components/core/src/clp_s/JsonParser.hpp (1 hunks)
  • components/core/src/clp_s/clp-s.cpp (6 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
components/core/src/clp_s/JsonParser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/clp-s.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/CommandLineArguments.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/CommandLineArguments.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🪛 cppcheck (2.10-2)
components/core/src/clp_s/clp-s.cpp

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

🔇 Additional comments (4)
components/core/src/clp_s/clp-s.cpp (1)

191-191: Verify exception safety in functions to prevent throwing exceptions from noexcept functions.

Static analysis has flagged that an exception may be thrown in a function that should not throw exceptions. Please verify that serialize_msgpack_map does not throw exceptions, or ensure it is not declared with noexcept.

Run the following script to check if serialize_msgpack_map is declared noexcept:

🧰 Tools
🪛 cppcheck (2.10-2)

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

components/core/src/clp_s/JsonParser.hpp (1)

55-62: Struct JsonToIrParserOption added successfully.

The new structure JsonToIrParserOption is well-defined and follows appropriate coding standards.

components/core/src/clp_s/CommandLineArguments.hpp (2)

69-72: LGTM! Modern C++ practices well applied

The getter methods follow modern C++ best practices with [[nodiscard]] attribute and trailing return types. The const-qualification ensures thread safety.


178-179: LGTM! Consistent with existing usage methods

The method declaration follows the established pattern of other usage printing methods in the class.

Comment on lines +209 to +211
in_file.open(path, std::ifstream::in);

std::filesystem::path input_path{path};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for input file opening failure.

The code does not check if the input file fails to open, which may lead to unexpected behaviour when in_file.open fails.

Apply this diff to handle file opening errors:

     in_file.open(path, std::ifstream::in);
+    if (false == in_file.is_open()) {
+        SPDLOG_ERROR("Failed to open input file: {}", path);
+        return false;
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
in_file.open(path, std::ifstream::in);
std::filesystem::path input_path{path};
in_file.open(path, std::ifstream::in);
if (false == in_file.is_open()) {
SPDLOG_ERROR("Failed to open input file: {}", path);
return false;
}
std::filesystem::path input_path{path};

Comment on lines +762 to +765
po::value<size_t>(&m_max_ir_buffer_size)->value_name("BUFFER_SIZE")->
default_value(m_max_ir_buffer_size),
"Maximum allowed size (B) for an in memory IR buffer befroe being written to file."
)(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add input validation for maximum IR buffer size.

The max-ir-buffer-size should be validated to ensure it is greater than zero to prevent potential division by zero or other unexpected behaviours.

Apply this diff to add input validation:

+                if (0 == m_max_ir_buffer_size) {
+                    throw std::invalid_argument("max-ir-buffer-size must be greater than zero.");
+                }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +768 to +769
default_value(m_encoding_type),
"4 (four byte encoding) or 8 (eight byte encoding)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add input validation for encoding type.

The encoding-type parameter should be validated to ensure that it is either 4 or 8, preventing unexpected behaviour or errors when invalid values are provided.

Apply this diff to add input validation:

+                if (m_encoding_type != 4 && m_encoding_type != 8) {
+                    throw std::invalid_argument("encoding-type must be either 4 or 8.");
+                }

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments about using IR v2 APIs

for (auto& path : all_file_paths) {
bool success;
if (option.encoding == 4) {
success = run_serializer<int32_t>(option, path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we should only generate 4-byte encoded IR stream (despite we do support 8-byte encoding). Correct me if I'm wrong @kirkrodrigues

}
auto& serializer{result.value()};
std::vector<int8_t> ir_buf;
flush_and_clear_serializer_buffer(serializer, ir_buf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper is supposed to be used for unit test only. The IR serializer already has an internal IR buffer; using an extra ir_buf here will introduce unnecessary memory copy. We should use the serializer's internal IR buffer directly.

Comment on lines +171 to +173
auto const view{serializer.get_ir_buf_view()};
byte_buf.insert(byte_buf.cend(), view.begin(), view.end());
serializer.clear_ir_buf();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how to access the serializer's internal IR buffer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants