feat: Add worker client that submit tasks and get next task from schedulers #35

sitaowang1998 · 2024-12-12T06:54:04Z

Description

Add get normal schedulers in metadata storage. Add logic to update outputs, update dependent tasks inputs and states in metadata storage task_finish. Add worker client that submit tasks and get next task from scheduler.

Validation performed

GitHub workflows pass
Unit tests for metadata storage in devcontainer

Summary by CodeRabbit

New Features
- Introduced Driver and Scheduler classes for enhanced task and driver management.
- Added functionality to manage task completion and retrieval of active schedulers.
- New test case for task completion handling.
Bug Fixes
- Updated method signatures for adding drivers and schedulers to improve encapsulation.
Tests
- Refactored test cases to utilize Driver and Scheduler objects.
- Added new test case for verifying task completion functionality.

…inish

coderabbitai · 2024-12-12T06:54:11Z

Walkthrough

The pull request introduces modifications to the CMakeLists.txt file and various source files within the Spider project. Key changes include the addition of new header and source files for core, worker, and client modules, expanding their functionalities. The Driver and Scheduler classes are introduced, along with updates to the MetadataStorage class methods to accept these new classes. Additionally, test cases have been revised to reflect these changes, ensuring a consistent object-oriented approach throughout the project.

Changes

File Path	Change Summary
src/spider/CMakeLists.txt	Added new header and source files: `core/Driver.hpp`, `worker/WorkerClient.hpp`, `worker/WorkerClient.cpp`, `client/Driver.hpp`, `client/Job.hpp`.
src/spider/core/Driver.hpp	Introduced classes `Driver` and `Scheduler` with respective constructors and accessor methods.
src/spider/storage/MetadataStorage.hpp	Updated `add_driver` to accept a `Driver` object; added `add_scheduler` and `get_active_scheduler` methods; modified `task_finish` method signature.
src/spider/storage/MysqlStorage.cpp	Refactored `add_driver` and `add_scheduler` methods to accept `Driver` and `Scheduler` objects; added `get_active_scheduler` method; modified `task_finish`.
src/spider/storage/MysqlStorage.hpp	Updated method signatures for `add_driver`, `add_scheduler`, and `task_finish` to reflect the new object-oriented design.
src/spider/worker/WorkerClient.cpp	Implemented `WorkerClient` class with methods for task completion and retrieval of the next task.
src/spider/worker/WorkerClient.hpp	Introduced `WorkerClient` class with methods for task management; deleted copy and move constructors to ensure unique ownership.
tests/scheduler/test-SchedulerPolicy.cpp	Updated test cases to add drivers using `Driver` objects instead of separate parameters.
tests/storage/test-DataStorage.cpp	Modified test cases to utilize `Driver` objects for adding drivers to metadata storage.
tests/storage/test-MetadataStorage.cpp	Updated test cases to use `Driver` and `Scheduler` objects; added a new test case for task completion handling.

Possibly related PRs

feat: Add MySql support for storage backend #20: The main PR introduces new methods in MetadataStorage for managing drivers and tasks, which aligns with the changes in MysqlStorage that also enhance driver and task management functionalities.
test: Add tests for data storage #21: This PR adds tests for data storage, which is directly related to the changes in the main PR that modify how data is managed in MetadataStorage and MysqlStorage.
test: Add unit tests for metadata storage and fix bugs in MySQL storage backend #23: This PR enhances the testing framework for metadata storage, which is relevant to the changes in the main PR that introduce new methods for job and task management in MetadataStorage.
feat: Add scheduler scheduling #31: The introduction of a scheduler policy in this PR complements the main PR's changes that enhance task scheduling and management.
feat: Add scheduler policy and data stroage garbage collection #34: The addition of a scheduler policy and garbage collection in this PR is related to the main PR's focus on improving task management and data handling in the storage system.

Suggested reviewers

kirkrodrigues

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (4)

src/spider/storage/MysqlStorage.cpp (1)
984-1052: Consider refactoring to eliminate code duplication in task_finish

The loops for updating task outputs and inputs share similar structures. Refactoring these into a helper function could reduce duplication and improve maintainability.

Apply this refactoring suggestion:
// Create a helper function to update outputs and inputs
+void update_task_data(std::unique_ptr<sql::PreparedStatement>& statement, const std::vector<TaskOutput>& outputs, const sql::bytes& task_id_bytes) {
+    for (size_t i = 0; i < outputs.size(); ++i) {
+        const TaskOutput& output = outputs[i];
+        const std::optional<std::string>& value = output.get_value();
+        if (value.has_value()) {
+            statement->setString(1, value.value());
+        } else {
+            statement->setNull(1, sql::DataType::VARCHAR);
+        }
+        const std::optional<boost::uuids::uuid>& data_id = output.get_data_id();
+        if (data_id.has_value()) {
+            sql::bytes data_id_bytes = uuid_get_bytes(data_id.value());
+            statement->setBytes(2, &data_id_bytes);
+        } else {
+            statement->setNull(2, sql::DataType::BINARY);
+        }
+        statement->setBytes(3, &task_id_bytes);
+        statement->setUInt(4, i);
+        statement->executeUpdate();
+    }
+}

// Use the helper function for outputs
 update_task_data(output_statement, outputs, task_id_bytes);

// Use the helper function for inputs
 update_task_data(input_statement, outputs, task_id_bytes);
src/spider/core/Driver.hpp (1)
31-34: Optimise Scheduler accessors by returning const references

Returning const references for get_addr() and possibly get_id() improves efficiency by preventing copies.

Apply these diffs:
-    [[nodiscard]] auto get_addr() const -> std::string { return m_addr; }
+    [[nodiscard]] auto get_addr() const -> const std::string& { return m_addr; }

-    [[nodiscard]] auto get_id() const -> boost::uuids::uuid { return m_id; }
+    [[nodiscard]] auto get_id() const -> const boost::uuids::uuid& { return m_id; }
src/spider/storage/MetadataStorage.hpp (1)

51-52: Consider adding documentation for the outputs parameter.

The addition of TaskOutput vector parameter is well-designed, but documenting the expected format and handling of outputs would improve maintainability.

tests/storage/test-MetadataStorage.cpp (1)

228-284: Review the new task finish test case implementation

The test case effectively validates:

Task state transitions

Parent-child task relationships

Output propagation

However, consider adding test coverage for edge cases:

Task finish with invalid outputs

Concurrent task completions

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4e988b and b691e37.

📒 Files selected for processing (10)

src/spider/CMakeLists.txt (2 hunks)
src/spider/core/Driver.hpp (1 hunks)
src/spider/storage/MetadataStorage.hpp (3 hunks)
src/spider/storage/MysqlStorage.cpp (5 hunks)
src/spider/storage/MysqlStorage.hpp (3 hunks)
src/spider/worker/WorkerClient.cpp (1 hunks)
src/spider/worker/WorkerClient.hpp (1 hunks)
tests/scheduler/test-SchedulerPolicy.cpp (3 hunks)
tests/storage/test-DataStorage.cpp (4 hunks)
tests/storage/test-MetadataStorage.cpp (5 hunks)

🧰 Additional context used

🪛 cppcheck (2.10-2)

src/spider/worker/WorkerClient.cpp

[performance] 88-88: Variable 'id' is assigned in constructor body. Consider performing initialization in initialization list.

(useInitializationList)

🔇 Additional comments (19)

src/spider/worker/WorkerClient.cpp (2)

26-35: Constructor implementation looks correct and efficient

The member variables are properly initialized using the initializer list, and std::move is used where appropriate.

37-44: task_finish method is well-implemented

The method correctly updates task status and retrieves the next task.

src/spider/storage/MysqlStorage.cpp (2)

Line range hint 295-309: Refactored add_driver method enhances encapsulation

Accepting a Driver object instead of separate parameters improves code clarity and maintainability.

Line range hint 310-328: add_scheduler method updated effectively

The change to accept a Scheduler object aligns with the updated design and promotes better encapsulation.

src/spider/worker/WorkerClient.hpp (1)

17-48: WorkerClient class design is sound and follows best practices

The class correctly manages resources and prevents copying and moving, ensuring unique ownership.

src/spider/storage/MetadataStorage.hpp (2)

9-9: LGTM! Include addition is appropriate.

The addition of Driver.hpp is necessary for the new method signatures and follows the existing include organization pattern.

29-32: Well-structured API changes for driver and scheduler management.

The changes improve encapsulation by using proper object-oriented design:

Using Driver object instead of separate parameters reduces coupling
New scheduler methods align with the PR's objective to enhance scheduler management

src/spider/CMakeLists.txt (2)

13-13: LGTM! Core header addition is appropriate.

The Driver.hpp header is correctly placed in SPIDER_CORE_HEADERS, maintaining the project's organization.

52-53: Verify WorkerClient implementation completeness.

The addition of WorkerClient files aligns with the PR objectives. Let's verify the implementation includes all necessary components.

✅ Verification successful

WorkerClient implementation is complete and well-structured

The implementation includes all essential components:

Proper task lifecycle management through task_finish method
Task scheduling via get_next_task method
Robust error handling and scheduler communication
Appropriate storage interfaces for both data and metadata
Thread-safe design with deleted copy/move operations

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify WorkerClient implementation completeness

# Check for essential methods in WorkerClient
ast-grep --pattern 'class WorkerClient {
  $$$
  submit_task($_) {
    $$$
  }
  get_next_task($_) {
    $$$
  }
  $$$
}'

Length of output: 123

Script:

#!/bin/bash
# Let's check the actual content of WorkerClient files
echo "=== WorkerClient.hpp ==="
cat src/spider/worker/WorkerClient.hpp

echo -e "\n=== WorkerClient.cpp ==="
cat src/spider/worker/WorkerClient.cpp

Length of output: 4968

src/spider/storage/MysqlStorage.hpp (3)

14-14: LGTM! Include addition is appropriate.

The Driver.hpp include is necessary and follows the existing include organization.

55-56: LGTM! Method signature properly overrides base class.

The task_finish method correctly implements the interface change to handle task outputs.

35-38: Verify MySQL implementation completeness.

The method declarations properly override the base class. Let's verify the implementation handles all cases correctly.

✅ Verification successful

Let me gather more information about the base class to ensure the implementations match the expected behavior.

Let me try a different approach to verify the base class and implementation details.

Implementation appears complete and well-structured

The MySQL implementation properly handles all required operations:

add_driver: Correctly inserts driver data with proper error handling for duplicates
add_scheduler: Handles both driver and scheduler table insertions in a single transaction
get_active_scheduler: Implements proper JOIN query to fetch active schedulers with their details

The implementations include:

Proper transaction handling
Error handling with appropriate StorageErr types
Consistent data type conversions
Complete CRUD operations matching the interface requirements

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify MySQL implementation completeness

# Check for complete implementation of new methods
rg -A 5 "MySqlMetadataStorage::add_driver|MySqlMetadataStorage::add_scheduler|MySqlMetadataStorage::get_active_scheduler" --type cpp

Length of output: 1875

Script:

#!/bin/bash
# Check base class method signatures and any relevant documentation
ast-grep --pattern 'class Storage {
  $$$
  virtual auto add_driver(Driver const& driver) -> StorageErr = 0;
  virtual auto add_scheduler(Scheduler const& scheduler) -> StorageErr = 0;
  virtual auto get_active_scheduler(std::vector<Scheduler>* schedulers) -> StorageErr = 0;
  $$$
}'

# Check for any test cases that might reveal expected behavior
rg -l "TEST.*MySqlMetadataStorage" --type cpp

Length of output: 334

Script:

#!/bin/bash
# Check Storage base class definition
rg -A 10 "class Storage" --type hpp

# Check complete implementation of these methods
rg -A 15 "MySqlMetadataStorage::(add_driver|add_scheduler|get_active_scheduler)" --type cpp

# Look for any related test files
fd "mysql.*test" --type f

Length of output: 4404

tests/scheduler/test-SchedulerPolicy.cpp (2)

16-16: LGTM: Core Driver header inclusion

The addition of the Driver header aligns with the encapsulation of driver-related functionality.

93-93: LGTM: Driver object encapsulation

The update to use the Driver class for driver creation improves code organization and type safety.

Also applies to: 138-138

tests/storage/test-DataStorage.cpp (2)

10-10: LGTM: Driver header inclusion

The addition of the Driver header supports the new driver management approach.

28-28: LGTM: Consistent Driver object usage

The update to use the Driver class across all test cases ensures consistency in driver management.

Also applies to: 174-175

tests/storage/test-MetadataStorage.cpp (3)

15-15: LGTM: Driver header inclusion

The addition of the Driver header supports the encapsulation of driver-related functionality.

185-186: Verify the removal of time-based assertions

The commented-out time assertions suggest potential timing issues. Consider either:

Implementing more reliable time-based testing
Using a different approach to validate job creation timing

✅ Verification successful

Let me gather more context about the test file and related implementations.

Time-based assertions were appropriately removed

The commented-out assertions were attempting to verify job creation timestamps with a 1-second delta tolerance, which is inherently flaky. The removal is justified as:

These assertions could lead to intermittent test failures
The remaining assertions adequately verify the critical aspects (job ID, client ID, and task graph equality)
No other time-based assertions are present in the test suite

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for other time-based assertions in test files
rg "time_delta|creation_time" tests/

Length of output: 474

Script:

#!/bin/bash
# Check the full test context and implementation
rg -B 5 -A 5 "job_creation_time" tests/storage/test-MetadataStorage.cpp

# Check for other timing-related tests or assertions
ast-grep --pattern 'REQUIRE($$$time$$$)'

Length of output: 1213

258-270: Verify task state transition handling

The test validates that completing one parent task doesn't prematurely update child task states. Consider adding assertions to verify:

Parent task's final state
Task output persistence

coderabbitai · 2024-12-12T06:58:23Z

src/spider/worker/WorkerClient.cpp

+auto WorkerClient::get_next_task() -> std::optional<boost::uuids::uuid> {
+    // Get schedulers
+    std::vector<core::Scheduler> schedulers;
+    if (!m_metadata_store->get_active_scheduler(&schedulers).success()) {
+        return std::nullopt;
+    }
+    std::random_device random_device;
+    std::default_random_engine rng{random_device()};
+    std::ranges::shuffle(schedulers, rng);
+
+    std::vector<boost::asio::ip::tcp::endpoint> endpoints;
+    std::ranges::transform(
+            schedulers,
+            std::back_inserter(endpoints),
+            [](core::Scheduler const& scheduler) {
+                return boost::asio::ip::tcp::endpoint{
+                        boost::asio::ip::make_address(scheduler.get_addr()),
+                        static_cast<unsigned short>(scheduler.get_port())
+                };
+            }
+    );
+    try {
+        // Create socket to scheduler
+        boost::asio::ip::tcp::socket socket(m_context);
+        boost::asio::connect(socket, endpoints);
+
+        scheduler::ScheduleTaskRequest const request{m_worker_id, m_worker_addr};
+        msgpack::sbuffer request_buffer;
+        msgpack::pack(request_buffer, request);
+
+        core::send_message(socket, request_buffer);
+
+        // Receive response
+        std::optional<msgpack::sbuffer> const optional_response_buffer
+                = core::receive_message(socket);
+        if (!optional_response_buffer.has_value()) {
+            return std::nullopt;
+        }
+        msgpack::sbuffer const& response_buffer = optional_response_buffer.value();
+
+        scheduler::ScheduleTaskResponse response;
+        msgpack::object_handle const response_handle
+                = msgpack::unpack(response_buffer.data(), response_buffer.size());
+        response_handle.get().convert(response);
+
+        return response.get_task_id();
+    } catch (boost::system::system_error const& e) {
+        return std::nullopt;
+    } catch (std::runtime_error const& e) {
+        return std::nullopt;
+    }
+}


💡 Codebase verification

Empty scheduler list is properly handled, but connection attempt could be optimized

The code already handles the case of no active schedulers through two mechanisms:

Early return with std::nullopt if get_active_scheduler fails

Exception handling that returns std::nullopt if the connection attempt fails

However, it would be more efficient to add an explicit check before the connection attempt:

if (endpoints.empty()) { return std::nullopt; }

🔗 Analysis chain

Ensure robust handling when no schedulers are available

If there are no active schedulers, the method will attempt to connect with an empty endpoints list. Confirm that this behaviour is acceptable and consider handling this case explicitly.

Run the following script to check for potential issues when no schedulers are available:

Let me gather more information about the handling of empty lists in boost::asio::connect.

Let me check for tests in the correct directory structure and look for any error handling patterns.

Let me check the test files specifically for WorkerClient tests and empty scheduler scenarios.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Verify behaviour when no schedulers are returned. # Test: Check the handling of empty schedulers list in `get_next_task` rg -A 5 'get_active_scheduler' src/spider/worker/WorkerClient.cpp

Length of output: 308

Script:

#!/bin/bash # Check if there are any tests covering this scenario rg -l "get_next_task.*empty" test/ rg -l "boost::asio::connect.*empty" test/ # Look for connect documentation or similar usage ast-grep --pattern 'boost::asio::connect($_, $_)'

Length of output: 224

Script:

#!/bin/bash # Find test directories fd -t d "test" # Look for any boost::asio connect usage patterns rg -A 5 "boost::asio::connect" # Check for any error handling patterns with empty containers rg "endpoints\.empty"

Length of output: 2310

Script:

#!/bin/bash # Look for WorkerClient tests fd "test.*WorkerClient" tests/ # Check test content if found rg -A 10 "get_next_task" tests/worker/ # Look at metadata store interface rg -A 5 "get_active_scheduler" src/

Length of output: 2407

🧰 Tools

🪛 cppcheck (2.10-2)

[performance] 88-88: Variable 'id' is assigned in constructor body. Consider performing initialization in initialization list.

(useInitializationList)

coderabbitai · 2024-12-12T06:58:23Z

src/spider/core/Driver.hpp

+    [[nodiscard]] auto get_addr() const -> std::string { return m_addr; }
+


🛠️ Refactor suggestion

Optimise get_addr() by returning a const reference

Returning a const std::string& avoids unnecessary copying of the address string.

Apply this diff:

- [[nodiscard]] auto get_addr() const -> std::string { return m_addr; } + [[nodiscard]] auto get_addr() const -> const std::string& { return m_addr; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

[[nodiscard]] auto get_addr() const -> std::string { return m_addr; }

[[nodiscard]] auto get_addr() const -> const std::string& { return m_addr; }

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (2)

src/spider/core/Driver.hpp (2)
11-13: Add class documentation and improve constructor parameter semantics.

Consider adding documentation to explain the class's purpose and responsibility. Also, make the move semantics more explicit in the constructor.
+/**
+ * Represents a worker node in the Spider system.
+ * Encapsulates the identity and network location of a worker.
+ */
 class Driver {
 public:
-    Driver(boost::uuids::uuid const id, std::string addr) : m_id{id}, m_addr{std::move(addr)} {}
+    /**
+     * Constructs a Driver instance.
+     * @param id Unique identifier for the driver
+     * @param addr Network address (ownership transferred)
+     */
+    explicit Driver(boost::uuids::uuid const id, std::string&& addr) : m_id{id}, m_addr{std::move(addr)} {}
11-41: Consider extracting a common interface for network entities.

Both Driver and Scheduler share common network entity characteristics (ID and address). Consider introducing an abstract base class or interface to represent this common behaviour, which could benefit future extensions and maintain consistency.

Example interface:
class NetworkEntity {
public:
    virtual ~NetworkEntity() = default;
    [[nodiscard]] virtual auto get_id() const -> boost::uuids::uuid const& = 0;
    [[nodiscard]] virtual auto get_addr() const -> std::string const& = 0;
};

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b691e37 and 97d46a2.

📒 Files selected for processing (2)

src/spider/core/Driver.hpp (1 hunks)
src/spider/worker/WorkerClient.cpp (1 hunks)

🧰 Additional context used

🪛 cppcheck (2.10-2)

src/spider/worker/WorkerClient.cpp

[performance] 88-88: Variable 'id' is assigned in constructor body. Consider performing initialization in initialization list.

(useInitializationList)

🔇 Additional comments (1)

src/spider/core/Driver.hpp (1)

1-9: LGTM! Header structure and includes are well-organized.

The file structure follows best practices with proper header guards and minimal necessary includes.

coderabbitai · 2024-12-12T07:05:36Z

src/spider/core/Driver.hpp

+
+class Driver {
+public:
+    Driver(boost::uuids::uuid const id, std::string addr) : m_id{id}, m_addr{std::move(addr)} {}


🛠️ Refactor suggestion

Add address validation in constructor.

Consider validating that the address isn't empty to ensure valid driver instances.

- Driver(boost::uuids::uuid const id, std::string addr) : m_id{id}, m_addr{std::move(addr)} {} + Driver(boost::uuids::uuid const id, std::string addr) { + if (addr.empty()) { + throw std::invalid_argument("Driver address cannot be empty"); + } + m_id = id; + m_addr = std::move(addr); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Driver(boost::uuids::uuid const id, std::string addr) : m_id{id}, m_addr{std::move(addr)} {}

Driver(boost::uuids::uuid const id, std::string addr) {

if (addr.empty()) {

throw std::invalid_argument("Driver address cannot be empty");

}

m_id = id;

m_addr = std::move(addr);

}

coderabbitai · 2024-12-12T07:05:37Z

src/spider/core/Driver.hpp

+class Scheduler {
+public:
+    Scheduler(boost::uuids::uuid const id, std::string addr, int port)
+            : m_id{id},
+              m_addr{std::move(addr)},
+              m_port{port} {}
+


🛠️ Refactor suggestion

Add documentation and parameter validation for Scheduler class.

The Scheduler class needs documentation and parameter validation.

+/** + * Represents a scheduler node in the Spider system. + * Manages task distribution and scheduling operations. + */ class Scheduler { public: + /** + * Constructs a Scheduler instance. + * @param id Unique identifier for the scheduler + * @param addr Network address (ownership transferred) + * @param port Network port number (must be between 1 and 65535) + * @throws std::invalid_argument if port is invalid or address is empty + */ - Scheduler(boost::uuids::uuid const id, std::string addr, int port) - : m_id{id}, - m_addr{std::move(addr)}, - m_port{port} {} + Scheduler(boost::uuids::uuid const id, std::string&& addr, int port) { + if (addr.empty()) { + throw std::invalid_argument("Scheduler address cannot be empty"); + } + if (port <= 0 || port > 65535) { + throw std::invalid_argument("Invalid port number"); + } + m_id = id; + m_addr = std::move(addr); + m_port = port; + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class Scheduler {

public:

Scheduler(boost::uuids::uuid const id, std::string addr, int port)

: m_id{id},

m_addr{std::move(addr)},

m_port{port} {}

/**

* Represents a scheduler node in the Spider system.

* Manages task distribution and scheduling operations.

*/

class Scheduler {

public:

/**

* Constructs a Scheduler instance.

* @param id Unique identifier for the scheduler

* @param addr Network address (ownership transferred)

* @param port Network port number (must be between 1 and 65535)

* @throws std::invalid_argument if port is invalid or address is empty

*/

Scheduler(boost::uuids::uuid const id, std::string&& addr, int port) {

if (addr.empty()) {

throw std::invalid_argument("Scheduler address cannot be empty");

}

if (port <= 0 || port > 65535) {

throw std::invalid_argument("Invalid port number");

}

m_id = id;

m_addr = std::move(addr);

m_port = port;

}

coderabbitai · 2024-12-12T07:05:37Z

src/spider/worker/WorkerClient.cpp

+WorkerClient::WorkerClient(
+        boost::uuids::uuid const worker_id,
+        std::string worker_addr,
+        std::shared_ptr<core::DataStorage> data_store,
+        std::shared_ptr<core::MetadataStorage> metadata_store
+)
+        : m_worker_id{worker_id},
+          m_worker_addr{std::move(worker_addr)},
+          m_data_store(std::move(data_store)),
+          m_metadata_store(std::move(metadata_store)) {}


⚠️ Potential issue

Declare and initialize m_context member variable

The member variable m_context is used in the get_next_task method but is neither declared nor initialized in the WorkerClient class. This omission will lead to compilation errors due to the undefined reference to m_context. Please declare m_context as a member variable of type boost::asio::io_context.

Add the following code to WorkerClient.hpp inside the WorkerClient class:

private: boost::asio::io_context m_context;

coderabbitai · 2024-12-12T07:05:37Z

src/spider/worker/WorkerClient.cpp

+            [](core::Scheduler const& scheduler) {
+                return boost::asio::ip::tcp::endpoint{
+                        boost::asio::ip::make_address(scheduler.get_addr()),
+                        static_cast<unsigned short>(scheduler.get_port())
+                };
+            }
+    );


🛠️ Refactor suggestion

Validate port numbers to prevent potential truncation

Casting scheduler.get_port() to unsigned short without validation may result in truncation or unexpected behaviour if the port number exceeds the valid range of 0 to 65,535. To ensure robust network communication, validate that the port number is within the acceptable range before casting.

Apply this diff to add port number validation:

return boost::asio::ip::tcp::endpoint{ boost::asio::ip::make_address(scheduler.get_addr()), - static_cast<unsigned short>(scheduler.get_port()) + static_cast<unsigned short>( + (scheduler.get_port() >= 0 && scheduler.get_port() <= 65535) + ? scheduler.get_port() + : throw std::out_of_range("Port number out of valid range") + ) };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

[](core::Scheduler const& scheduler) {

return boost::asio::ip::tcp::endpoint{

boost::asio::ip::make_address(scheduler.get_addr()),

static_cast<unsigned short>(scheduler.get_port())

};

}

);

[](core::Scheduler const& scheduler) {

return boost::asio::ip::tcp::endpoint{

boost::asio::ip::make_address(scheduler.get_addr()),

static_cast<unsigned short>(

(scheduler.get_port() >= 0 && scheduler.get_port() <= 65535)

? scheduler.get_port()

: throw std::out_of_range("Port number out of valid range")

)

};

}

);

sitaowang1998 and others added 7 commits December 10, 2024 21:18

Add basic worker client interface

932dd82

Add get active scheduler in metadata storage

28e3960

Fix clang tidy

b340477

Add update task output, update dependent task input & state to task f…

9ffb1d4

…inish

Add fetch task from scheduler in task finish

0cff600

Split get next task into separate function

29ca2e2

Fix clang-tidy

b691e37

coderabbitai bot reviewed Dec 12, 2024

View reviewed changes

Improve code based on code rabbit suggestion

97d46a2

coderabbitai bot reviewed Dec 12, 2024

View reviewed changes

sitaowang1998 merged commit 38297fc into y-scope:main Dec 12, 2024
4 checks passed

sitaowang1998 deleted the worker-client branch December 12, 2024 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add worker client that submit tasks and get next task from schedulers #35

feat: Add worker client that submit tasks and get next task from schedulers #35

sitaowang1998 commented Dec 12, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Dec 12, 2024

coderabbitai bot Dec 12, 2024

coderabbitai bot left a comment

coderabbitai bot Dec 12, 2024

coderabbitai bot Dec 12, 2024

coderabbitai bot Dec 12, 2024

coderabbitai bot Dec 12, 2024

		[[nodiscard]] auto get_addr() const -> std::string { return m_addr; }

	[[nodiscard]] auto get_addr() const -> std::string { return m_addr; }
	[[nodiscard]] auto get_addr() const -> const std::string& { return m_addr; }

-    Driver(boost::uuids::uuid const id, std::string addr) : m_id{id}, m_addr{std::move(addr)} {}
+    Driver(boost::uuids::uuid const id, std::string addr) {
+        if (addr.empty()) {
+            throw std::invalid_argument("Driver address cannot be empty");
+        }
+        m_id = id;
+        m_addr = std::move(addr);
+    }

-class Scheduler {
-public:
-    Scheduler(boost::uuids::uuid const id, std::string addr, int port)
-            : m_id{id},
-              m_addr{std::move(addr)},
-              m_port{port} {}
+/**
+ * Represents a scheduler node in the Spider system.
+ * Manages task distribution and scheduling operations.
+ */
+class Scheduler {
+public:
+    /**
+     * Constructs a Scheduler instance.
+     * @param id Unique identifier for the scheduler
+     * @param addr Network address (ownership transferred)
+     * @param port Network port number (must be between 1 and 65535)
+     * @throws std::invalid_argument if port is invalid or address is empty
+     */
+    Scheduler(boost::uuids::uuid const id, std::string&& addr, int port) {
+        if (addr.empty()) {
+            throw std::invalid_argument("Scheduler address cannot be empty");
+        }
+        if (port <= 0 || port > 65535) {
+            throw std::invalid_argument("Invalid port number");
+        }
+        m_id = id;
+        m_addr = std::move(addr);
+        m_port = port;
+    }

feat: Add worker client that submit tasks and get next task from schedulers #35

feat: Add worker client that submit tasks and get next task from schedulers #35

Conversation

sitaowang1998 commented Dec 12, 2024 • edited by coderabbitai bot Loading

Description

Validation performed

Summary by CodeRabbit

coderabbitai bot commented Dec 12, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 12, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 12, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 12, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 12, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 12, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 12, 2024

Choose a reason for hiding this comment

sitaowang1998 commented Dec 12, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)