-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add MySql support for storage backend #20
Conversation
This reverts commit 4b52fb4.
Warning Rate limit exceeded@sitaowang1998 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 3 minutes and 3 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThis pull request includes several changes across multiple files. The GitHub Actions workflow for code linting has been updated to only support the "ubuntu-latest" operating system. The Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 18
🧹 Outside diff range and nitpick comments (15)
src/spider/core/Error.hpp (1)
15-16
: Consider replacingOtherErr
with specific error typesWhile
ConstraintViolationErr
is well-defined, using a catch-allOtherErr
type might obscure specific error conditions. Consider identifying and adding explicit error types for known MySQL-specific errors that could occur..github/workflows/code-linting-checks.yaml (1)
Line range hint
39-41
: Remove redundant macOS-specific step.Since macOS has been removed from the matrix configuration, this conditional step for installing coreutils on macOS should be removed as it will never be executed.
Apply this diff to remove the redundant step:
- - if: "matrix.os == 'macos-latest'" - name: "Install coreutils (for md5sum)" - run: "brew install coreutils"src/spider/core/Data.hpp (1)
35-35
: Consider a more descriptive name for the boolean flagWhile the implementation is correct, the name
is_hard_locality
could be more descriptive about its purpose in the context of MySQL storage.Consider renaming to something more specific like
is_locality_strict
orhas_strict_locality_constraints
.src/spider/storage/MetadataStorage.hpp (2)
21-27
: Consider consolidating driver addition methodsThe two
add_driver
methods could be simplified by providing a default port value in the three-parameter version, eliminating the need for two separate methods.-virtual auto add_driver(boost::uuids::uuid id, std::string const& addr) -> StorageErr = 0; -virtual auto add_driver(boost::uuids::uuid id, std::string const& addr, int port) -> StorageErr = 0; +virtual auto add_driver(boost::uuids::uuid id, std::string const& addr, int port = 3306) -> StorageErr = 0;
45-49
: Document scheduler state managementThe new scheduler state management methods introduce important orchestration capabilities. Please add documentation comments to describe:
- Expected state format/values
- State persistence guarantees
- Error handling expectations
+ /** + * Retrieves the current state of a scheduler + * @param id The UUID of the scheduler + * @param state Output parameter for the scheduler state + * @return StorageErr indicating success or specific error condition + */ virtual auto get_scheduler_state(boost::uuids::uuid id, std::string* state) -> StorageErr = 0; + /** + * Updates the state of a scheduler + * @param id The UUID of the scheduler + * @param state The new state to set + * @return StorageErr indicating success or specific error condition + */ virtual auto set_scheduler_state(boost::uuids::uuid id, std::string const& state) -> StorageErr = 0;src/spider/storage/MysqlStorage.hpp (3)
22-46
: Standardize class naming conventionThe class name
MySqlMetadataStorage
uses inconsistent capitalization (MySql). Consider renaming toMySQLMetadataStorage
to match common MySQL naming conventions.Also, the pointer-based interface changes look good and provide better null handling capabilities.
Line range hint
55-72
: Align MysqlDataStorage implementation with metadata storage patterns
- Class naming should be consistent with MySQL conventions (consider
MySQLDataStorage
)- Connection member should use smart pointer:
- sql::Connection* m_conn = nullptr; + std::unique_ptr<sql::Connection> m_conn{nullptr};The pointer-based interface changes are well-designed and consistent with the metadata storage implementation.
Thread safety concerns are valid and need to be addressed
The codebase analysis reveals that the MySQL storage classes maintain connection state (
m_conn
) that's shared across multiple methods, but lack any synchronization mechanisms. The implementation shows:
- Direct member access to
m_conn
across multiple methods- Connection state modifications (connect/close operations)
- No mutex or other synchronization primitives
- SQL operations that could be interleaved if called concurrently
Key areas requiring synchronization:
- Connection management (
connect
/close
methods)- All database operations using
m_conn
- State-changing operations (add/remove/update methods)
Recommendation: Add a mutex member to protect all operations that access or modify the connection state and database operations.
🔗 Analysis chain
Line range hint
22-72
: Verify thread safety requirementsThe storage classes handle shared state through database connections. Please verify if concurrent access is a requirement and consider adding appropriate synchronization if needed.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for concurrent access patterns in the codebase rg -l "std::thread|std::async|std::future" | while read -r file; do echo "=== $file ===" rg -A 3 "MysqlStorage|MySqlMetadataStorage" "$file" doneLength of output: 59
Script:
#!/bin/bash # Let's check for any threading or async patterns in header files rg -t hpp "thread|async|future|mutex|concurrent|parallel" # Also check if there are any comments or documentation about threading rg -g "*.{hpp,cpp,md}" -i "thread.*safe|concurrent.*access" # Check if the storage is used in any scheduler or worker contexts rg -A 3 "class.*Storage.*\{"Length of output: 1529
Script:
#!/bin/bash # Let's check how these storage classes are being used rg -g "*.{cpp,hpp}" -B 3 -A 3 "MySqlMetadataStorage|MysqlDataStorage" # Check for any scheduler or worker implementations rg -g "*.{cpp,hpp}" "class.*(Scheduler|Worker|Engine)" # Look for any initialization or usage patterns rg -g "*.{cpp,hpp}" "new MySqlMetadataStorage|new MysqlDataStorage|make_.*Storage"Length of output: 17369
src/spider/core/TaskGraph.hpp (3)
18-23
: Consider optimizing UUID generator creation.The random generator is created each time a TaskGraph is constructed. Consider making it static to avoid repeated initialization.
- TaskGraph() { - boost::uuids::random_generator gen; - m_id = gen(); - } + TaskGraph() { + static boost::uuids::random_generator gen; + m_id = gen(); + }
43-51
: Enhance method documentation.While the method is well-implemented, the comment could be more descriptive about user responsibilities regarding dependency management.
- // User is responsible to add the dependencies + /** + * Adds a task to the graph without any dependencies. + * Note: Users must explicitly add dependencies using add_dependencies() if needed. + * @param task The task to add + * @return true if task was added, false if task ID already exists + */
92-101
: Consider caching head tasks for better performance.The current implementation recalculates head tasks on every call. For large graphs with frequent access, consider maintaining a cached set of head tasks that's updated when dependencies change.
This would require:
- Adding a member variable for cached head tasks
- Updating it in add_task and add_dependencies methods
- Adding a flag to track if cache is dirty
Would you like me to provide a detailed implementation?
cmake/Modules/FindMariaDBClientCpp.cmake (2)
108-109
: Consider validating the existence of include subdirectoriesThe addition of these include directories is necessary for the MariaDB C++ connector headers. However, to make the module more robust, consider validating that these subdirectories exist before adding them to the target properties.
Here's a suggested improvement:
set_target_properties( MariaDBClientCpp::MariaDBClientCpp PROPERTIES INTERFACE_INCLUDE_DIRECTORIES "${MariaDBClientCpp_INCLUDE_DIR}" - "${MariaDBClientCpp_INCLUDE_DIR}/conncpp" - "${MariaDBClientCpp_INCLUDE_DIR}/conncpp/compat" + $<$<BOOL:$<AND: + $<BOOL:EXISTS:${MariaDBClientCpp_INCLUDE_DIR}/conncpp>, + $<BOOL:EXISTS:${MariaDBClientCpp_INCLUDE_DIR}/conncpp/compat>>>: + ${MariaDBClientCpp_INCLUDE_DIR}/conncpp;${MariaDBClientCpp_INCLUDE_DIR}/conncpp/compat> )
108-109
: Fix indentation inconsistencyThe indentation of the include paths is inconsistent. The second path has an extra level of indentation.
Apply this formatting fix:
set_target_properties( MariaDBClientCpp::MariaDBClientCpp PROPERTIES INTERFACE_INCLUDE_DIRECTORIES "${MariaDBClientCpp_INCLUDE_DIR}" "${MariaDBClientCpp_INCLUDE_DIR}/conncpp" - "${MariaDBClientCpp_INCLUDE_DIR}/conncpp/compat" + "${MariaDBClientCpp_INCLUDE_DIR}/conncpp/compat" )src/spider/core/Task.hpp (2)
40-42
: Consider adding validation and thread safety mechanismsThe new setter methods could benefit from:
- Input validation to ensure data consistency
- Thread safety mechanisms if the TaskInput objects are shared between threads
- Documentation explaining the thread safety guarantees
Line range hint
53-74
: Consider extracting common functionalityThe setter methods and member variables in TaskInput and TaskOutput are nearly identical. Consider:
- Creating a common base class or trait
- Using templates to reduce code duplication
- Documenting why these are separate classes if the functionality must remain duplicated
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (12)
.github/workflows/code-linting-checks.yaml
(1 hunks)cmake/Modules/FindMariaDBClient.cmake
(0 hunks)cmake/Modules/FindMariaDBClientCpp.cmake
(1 hunks)src/spider/CMakeLists.txt
(1 hunks)src/spider/core/Data.hpp
(2 hunks)src/spider/core/Error.hpp
(2 hunks)src/spider/core/Task.hpp
(4 hunks)src/spider/core/TaskGraph.hpp
(4 hunks)src/spider/storage/DataStorage.hpp
(1 hunks)src/spider/storage/MetadataStorage.hpp
(1 hunks)src/spider/storage/MysqlStorage.cpp
(1 hunks)src/spider/storage/MysqlStorage.hpp
(3 hunks)
💤 Files with no reviewable changes (1)
- cmake/Modules/FindMariaDBClient.cmake
🧰 Additional context used
🪛 cppcheck
src/spider/storage/MysqlStorage.cpp
[performance] 87-87: Variable 'id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
[performance] 114-114: Variable 'm_id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
[performance] 20-20: Variable 'm_id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
🔇 Additional comments (13)
src/spider/core/Error.hpp (1)
29-29
: LGTM! Well-implemented boolean conversion operator
The explicit bool operator is a good addition that enables convenient error checking while preventing implicit conversions. The implementation correctly evaluates the error state.
src/spider/CMakeLists.txt (2)
28-28
: LGTM: Appropriate visibility for fmt dependency
The fmt library is correctly set as PRIVATE since it's likely used only for implementation details and not exposed through public headers.
23-26
: Review library visibility changes carefully
The change from PRIVATE to PUBLIC visibility for these dependencies means that any project linking against spider_core will also need to link against these libraries. This has several implications:
- Boost, abseil, and MariaDBClientCpp become part of spider_core's public API
- Version conflicts could arise in consumer projects
- ABI compatibility becomes more complex to maintain
Consider keeping these as PRIVATE unless they are explicitly needed in public headers.
Let's verify if these dependencies are used in public headers:
✅ Verification successful
PUBLIC visibility is justified for these dependencies
The verification shows that these libraries are indeed used in public header files:
- Boost is used in multiple public headers including TaskGraph.hpp, Task.hpp, Data.hpp, and storage headers
- abseil (absl) is used in TaskGraph.hpp
- MariaDB/MySQL is used in MysqlStorage.hpp
Since these dependencies are part of the public interface through header files, marking them as PUBLIC in CMakeLists.txt is the correct approach. This ensures that projects consuming the spider_core library will have access to the required dependencies.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if these dependencies are used in public headers
# Expected: Headers that include these dependencies
# Search for Boost includes
echo "Checking Boost usage in headers:"
rg -l "include.*boost" --type hpp
# Search for abseil includes
echo "Checking abseil usage in headers:"
rg -l "include.*absl" --type hpp
# Search for MariaDB includes
echo "Checking MariaDB usage in headers:"
rg -l "include.*mariadb" --type hpp
Length of output: 423
Script:
#!/bin/bash
# Description: Check if these dependencies are used in public headers
# Expected: Headers that include these dependencies
# First, find header files
echo "Finding header files:"
fd -e h -e hpp
# Then search for includes in those files
echo -e "\nChecking Boost usage in headers:"
rg -l "include.*boost"
echo -e "\nChecking abseil usage in headers:"
rg -l "include.*absl"
echo -e "\nChecking MariaDB usage in headers:"
rg -l "include.*mariadb"
# Also check for public headers specifically
echo -e "\nListing potential public headers:"
fd -e h -e hpp . src/spider/include/
Length of output: 1203
src/spider/storage/DataStorage.hpp (2)
24-24
: Consider implications of using pointer parameter in get_data()
While changing from reference to pointer parameter allows for null checks, it introduces potential null pointer risks. Please ensure:
- All implementations handle null pointer cases
- Documentation clearly states if null is a valid input
- Callers are updated to handle potential null cases
#!/bin/bash
# Search for get_data implementations to verify null handling
ast-grep --pattern 'get_data(boost::uuids::uuid $_, Data* $_) {
$$$
}'
Consider using std::optional instead of a raw pointer to make the nullable semantics more explicit and safer.
19-19
: Verify impact of removing UUID from connect() method
The simplified signature improves the interface, but we should ensure this change doesn't break existing connection tracking functionality.
✅ Verification successful
Let me search for more specific patterns to understand the impact of this change.
Removing UUID from connect() method is safe and improves the interface
The verification shows that the simplified connect() method signature is already consistently implemented across the codebase:
- MysqlDataStorage and MySqlMetadataStorage implementations use only the URL parameter
- No existing code relies on UUID for connection tracking
- Connection errors are properly handled through StorageErr return type
The change aligns with the current implementation and usage patterns.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for existing connect() method calls to verify impact
rg -A 2 "->connect\(" --type cpp
Length of output: 74
Script:
#!/bin/bash
# Search for connect method implementations and usages with broader patterns
rg "connect\(" --type cpp -A 3
# Search for storage class implementations
fd -e hpp -e cpp -e h . | xargs rg "class.*Storage.*:" -A 3
# Search for interface implementations
ast-grep --pattern 'class $_ : public DataStorage {
$$$
}'
Length of output: 3681
src/spider/core/Data.hpp (2)
20-23
: Consider adding input validation to the new constructor
The constructor accepts external IDs and strings without validation. This could lead to inconsistent state if invalid UUIDs or empty strings are provided.
Consider adding validation:
Data(boost::uuids::uuid id, std::string key, std::string value)
- : m_id(id),
- m_key(std::move(key)),
- m_value(std::move(value)) {}
+ : m_id(id),
+ m_key(std::move(key)),
+ m_value(std::move(value)) {
+ if (boost::uuids::uuid{} == id) {
+ throw std::invalid_argument("UUID cannot be null");
+ }
+ if (value.empty()) {
+ throw std::invalid_argument("Value cannot be empty");
+ }
+}
31-33
: LGTM! Well-implemented getter method
The implementation follows best practices with const reference return type and [[nodiscard]] attribute.
src/spider/storage/MetadataStorage.hpp (1)
34-41
: Verify TaskInstance compatibility
The change from Task
to TaskInstance
in get_task_timeout
suggests a more detailed tracking mechanism. Ensure all implementations properly handle the additional instance-specific data.
#!/bin/bash
# Search for TaskInstance usage patterns
rg -A 5 "TaskInstance.*get_task_timeout"
src/spider/storage/MysqlStorage.hpp (1)
5-7
: LGTM: Appropriate use of MariaDB C++ connector
The choice of MariaDB's C++ connector (conncpp) is well-suited for this C++ codebase, providing type-safe database operations.
src/spider/core/TaskGraph.hpp (4)
5-5
: LGTM! Required includes are properly added.
The new includes support the added functionality for UUID generation and hash set operations.
Also applies to: 7-7
26-26
: LGTM! Good optimization.
Storing the task ID locally avoids multiple getter calls and improves readability.
Also applies to: 32-32
57-57
: LGTM! Proper use of [[nodiscard]].
The method correctly prevents accidental ignoring of the return value.
109-109
: LGTM! Proper member variable declaration.
The m_id member is correctly declared as private and follows the class's naming convention.
@@ -20,7 +20,7 @@ jobs: | |||
lint: | |||
strategy: | |||
matrix: | |||
os: ["macos-latest", "ubuntu-latest"] | |||
os: ["ubuntu-latest"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Dropping macOS CI checks requires attention to development dependencies
The codebase shows active macOS support requirements:
- Library installation tasks specifically for macOS in
dep-tasks.yaml
- Platform-specific build configurations in CMake files
- Dedicated macOS scripts directory at
tools/scripts/macOS
Removing macOS from the CI matrix while maintaining macOS-specific development dependencies could lead to undetected issues for macOS developers. Consider either:
- Restoring macOS CI checks to ensure cross-platform compatibility
- Or documenting the decision to not verify macOS builds if that's intentional
🔗 Analysis chain
Verify if dropping macOS CI checks aligns with development needs.
The simplification to use only Ubuntu for CI is reasonable for reducing resource usage. However, please confirm that this change won't impact developers working on macOS systems.
Let's verify if there are any macOS-specific development requirements:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for macOS-specific code or configurations
# Test: Search for macOS-specific patterns in the codebase
rg -i "darwin|macos|apple|osx" --type yaml --type cmake
Length of output: 373
Script:
#!/bin/bash
# Check for any platform-specific build or development setup
cat dep-tasks.yaml
Length of output: 436
Script:
#!/bin/bash
# Check if there are any active development branches or recent commits mentioning macOS
git log --grep="macos\|darwin\|osx" --since="6 months ago" --pretty=format:"%h %s"
Length of output: 84
Script:
#!/bin/bash
# Look for any platform-specific development documentation
fd -e md -e txt | xargs rg -i "macos|darwin|development.*environment|prerequisites"
Length of output: 161
src/spider/core/Data.hpp
Outdated
void set_locality(std::vector<std::string>& locality) { m_locality = locality; } | ||
|
||
void set_hard_locality(bool hard) { m_hard_locality = hard; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve setter methods for better safety and clarity
The current implementation has several potential improvements:
set_locality
takes a non-const reference which is unusual for setters- No validation of the locality vector
- Missing documentation about the purpose of these settings in the MySQL storage context
Consider this improvement:
- void set_locality(std::vector<std::string>& locality) { m_locality = locality; }
+ void set_locality(std::vector<std::string> locality) { m_locality = std::move(locality); }
Also, please add documentation comments explaining:
- What locality means in this context
- When to use hard locality vs soft locality
- Any constraints on the locality vector contents
Committable suggestion skipped: line range outside the PR's diff.
std::vector<std::string> m_locality; | ||
bool m_hard_locality = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Add documentation for locality-related member variables
These members are used in MySQL storage implementation for managing data locality:
m_locality
: Stores a list of locality identifiers retrieved from MySQL storagem_hard_locality
: Controls locality behaviour (likely for enforcement settings)
Please add documentation above these member variables to explain:
- The purpose and format of locality identifiers stored in the vector
- The specific impact of the hard locality flag on data storage/retrieval
- Any constraints or assumptions about these values
🔗 Analysis chain
Add documentation for new member variables
The new member variables need documentation explaining their role in MySQL storage implementation.
Let's verify how these members are used:
Let me gather more context about how these members are used.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for usage of locality in the codebase
rg -l "m_locality" --type cpp
rg -l "set_locality" --type cpp
Length of output: 147
Script:
#!/bin/bash
# Check the implementation details
rg "m_locality" -A 5 -B 5 --type cpp
rg "set_locality" -A 5 -B 5 --type cpp
Length of output: 2797
virtual auto get_task_graph(boost::uuids::uuid id, TaskGraph* task_graph) -> StorageErr = 0; | ||
virtual auto get_task_graphs(std::vector<boost::uuids::uuid>* task_graphs) -> StorageErr = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Null pointer checks needed in MySqlMetadataStorage implementation
The implementation of get_task_graph
and get_task_graphs
in MySqlMetadataStorage does not validate the pointer parameters before use. This could lead to undefined behaviour if null pointers are passed. Add null checks at the start of these methods:
src/spider/storage/MysqlStorage.cpp
: Add null check fortask_graph
parameter inget_task_graph
src/spider/storage/MysqlStorage.cpp
: Add null check fortask_graphs
parameter inget_task_graphs
Example pattern to follow:
if (task_graph == nullptr) {
return StorageErr::InvalidArgument;
}
🔗 Analysis chain
Verify null pointer handling in implementations
The change from references to pointers for output parameters is good for explicit ownership, but ensure all implementations include proper null checks.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for implementations to verify null checks
ast-grep --pattern 'if ($pointer == nullptr) {
$$$
}'
Length of output: 195
Script:
#!/bin/bash
# Search for concrete implementations of MetadataStorage
rg -l "class.*:.*public.*MetadataStorage"
# Then search for get_task_graph implementations
ast-grep --pattern 'auto get_task_graph(boost::uuids::uuid id, TaskGraph* task_graph)'
# And get_task_graphs implementations
ast-grep --pattern 'auto get_task_graphs(std::vector<boost::uuids::uuid>* task_graphs)'
Length of output: 258
Script:
#!/bin/bash
# Check implementation in MysqlStorage
rg -A 10 "get_task_graph.*TaskGraph\*" src/spider/storage/MysqlStorage.hpp
rg -A 10 "get_task_graph.*TaskGraph\*" src/spider/storage/MysqlStorage.cpp
rg -A 10 "get_task_graphs.*vector.*uuid" src/spider/storage/MysqlStorage.hpp
rg -A 10 "get_task_graphs.*vector.*uuid" src/spider/storage/MysqlStorage.cpp
Length of output: 3205
set_scheduler_state(boost::uuids::uuid id, std::string const& state) -> StorageErr override; | ||
|
||
private: | ||
sql::Connection* m_conn = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider using smart pointer for database connection
The raw pointer m_conn
could lead to memory leaks. Consider using std::unique_ptr
for automatic resource management.
- sql::Connection* m_conn = nullptr;
+ std::unique_ptr<sql::Connection> m_conn{nullptr};
Committable suggestion skipped: line range outside the PR's diff.
src/spider/storage/MysqlStorage.cpp
Outdated
m_conn->prepareStatement("UPDATE divers SET (heartbeat) VALUES " | ||
"(CURRENT_TIMESTAMP()) WHERE id = ?") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in Table Name and Incorrect SQL UPDATE Syntax
The SQL statement references divers
, but the correct table name is drivers
. Additionally, the UPDATE
syntax is incorrect.
Please correct the table name and fix the SQL syntax:
-m_conn->prepareStatement("UPDATE divers SET (heartbeat) VALUES (CURRENT_TIMESTAMP()) WHERE id = ?")
+m_conn->prepareStatement("UPDATE drivers SET heartbeat = CURRENT_TIMESTAMP() WHERE id = ?")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
m_conn->prepareStatement("UPDATE divers SET (heartbeat) VALUES " | |
"(CURRENT_TIMESTAMP()) WHERE id = ?") | |
m_conn->prepareStatement("UPDATE drivers SET heartbeat = CURRENT_TIMESTAMP() WHERE id = ?") |
src/spider/storage/MysqlStorage.cpp
Outdated
m_conn->prepareStatement("SELECT id, key, value, hard_locality " | ||
"FROM schedulers WHERE id = ?") | ||
); | ||
sql::bytes id_bytes = uuid_get_bytes(id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect Table Name in 'get_data' Method
The SELECT
query is retrieving data from the schedulers
table instead of the data
table.
Please correct the table name:
-m_conn->prepareStatement("SELECT id, key, value, hard_locality FROM schedulers WHERE id = ?")
+m_conn->prepareStatement("SELECT id, key, value, hard_locality FROM data WHERE id = ?")
Committable suggestion skipped: line range outside the PR's diff.
src/spider/storage/MysqlStorage.cpp
Outdated
m_conn->prepareStatement("INSERT INTO data_ref_driver (id, " | ||
"task_id) VALUES(?, ?)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mismatch in Column Name for 'data_ref_driver' Table
The INSERT
statement uses task_id
instead of driver_id
for the data_ref_driver
table.
Please correct the column name:
-m_conn->prepareStatement("INSERT INTO data_ref_driver (id, task_id) VALUES(?, ?)")
+m_conn->prepareStatement("INSERT INTO data_ref_driver (id, driver_id) VALUES(?, ?)")
Committable suggestion skipped: line range outside the PR's diff.
src/spider/storage/MysqlStorage.cpp
Outdated
id BINARY(16) NOT NULL, | ||
task_id BINARY(16) NOT NULL, | ||
start_time TIMESTAMP NOT NULL, | ||
CONSTRAINT instance_task_id FOREIGN KEY (task_id) REFERENCES task (id) ON UPDATE NO ACTION ON DELETE CASCADE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect Foreign Key Reference to Non-Existent Table 'task'
The foreign key constraint references a non-existent table task
. The correct table name should be tasks
.
Please update the foreign key reference:
-CONSTRAINT instance_task_id FOREIGN KEY (task_id) REFERENCES task (id) ON UPDATE NO ACTION ON DELETE CASCADE,
+CONSTRAINT instance_task_id FOREIGN KEY (task_id) REFERENCES tasks (id) ON UPDATE NO ACTION ON DELETE CASCADE,
Committable suggestion skipped: line range outside the PR's diff.
src/spider/storage/MysqlStorage.cpp
Outdated
m_conn->prepareStatement("DELETE FROM data_ref_driver " | ||
"WHERE id = ? AND task_id = ?") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect Column Name in 'remove_driver_reference' Method
The DELETE
statement incorrectly references task_id
instead of driver_id
in the WHERE
clause.
Please amend the column name:
-m_conn->prepareStatement("DELETE FROM data_ref_driver WHERE id = ? AND task_id = ?")
+m_conn->prepareStatement("DELETE FROM data_ref_driver WHERE id = ? AND driver_id = ?")
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (4)
src/spider/core/Data.hpp (2)
20-23
: Add documentation and parameter validationWhile the constructor implementation is correct, consider:
- Adding documentation explaining when to use this constructor vs others
- Adding validation for the UUID parameter (e.g., check for nil UUID)
+ /** + * Constructs a Data object with a pre-existing UUID + * @param id The UUID to use (must not be nil) + * @param key The key associated with this data + * @param value The value to store + * @throws std::invalid_argument if id is nil + */ Data(boost::uuids::uuid id, std::string key, std::string value) - : m_id(id), + : m_id(id.is_nil() ? throw std::invalid_argument("UUID cannot be nil") : id), m_key(std::move(key)), m_value(std::move(value)) {}
37-37
: Optimize set_locality with move semanticsThe current implementation copies the locality vector. Consider using move semantics for better performance with large vectors.
- void set_locality(std::vector<std::string> const& locality) { m_locality = locality; } + void set_locality(std::vector<std::string> locality) { m_locality = std::move(locality); }src/spider/storage/MysqlStorage.cpp (2)
168-172
: Consider using reinterpret_cast instead of C-style cast.While the C-style cast works, using
reinterpret_cast
would be more idiomatic in C++ and makes the casting intention clearer.- return {(char*)id.data(), id.size()}; + return {reinterpret_cast<char*>(id.data()), id.size()};
752-753
: Replace magic number with named constant.The value
0.0001
is used as a threshold for timeout comparison. This should be defined as a named constant for better maintainability.+ constexpr float cMinTimeoutThreshold = 0.0001f; // ... - "tasks.id WHERE tasks.timeout > 0.0001 AND TIMESTAMPDIFF(MICROSECOND, " + "tasks.id WHERE tasks.timeout > cMinTimeoutThreshold AND TIMESTAMPDIFF(MICROSECOND, "
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
src/spider/core/Data.hpp
(2 hunks)src/spider/storage/MysqlStorage.cpp
(1 hunks)
🧰 Additional context used
🪛 cppcheck
src/spider/storage/MysqlStorage.cpp
[performance] 87-87: Variable 'id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
[performance] 114-114: Variable 'm_id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
[performance] 20-20: Variable 'm_id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
🔇 Additional comments (4)
src/spider/core/Data.hpp (2)
9-9
: LGTM! Necessary include for std::vector
The vector include is properly placed and required for the new locality feature.
45-46
: Verify MySQL storage implementation details
Let's verify the locality implementation matches MySQL storage requirements.
✅ Verification successful
The MySQL implementation shows proper handling of the locality fields, with:
- Appropriate schema definition including
hard_locality
boolean column and separatedata_locality
table - Complete CRUD operations for both
m_hard_locality
andm_locality
vector - Foreign key constraints ensuring referential integrity
The locality implementation matches MySQL storage requirements
The verification confirms:
hard_locality
is stored as a BOOL column in the main data tablem_locality
values are stored in a separatedata_locality
table with proper foreign key relationship- Storage operations handle both fields correctly during insert and select operations
- Schema design follows database normalization principles
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for MySQL-specific locality handling
echo "Checking MySQL storage implementation:"
rg -A 5 "locality" --type cpp | grep -i "mysql"
echo -e "\nChecking storage interface requirements:"
ast-grep --pattern 'class $storage : public Storage {
$$$
locality($$$) {
$$$
}
$$$
}'
Length of output: 5280
src/spider/storage/MysqlStorage.cpp (2)
32-166
: LGTM! Well-structured table definitions and error codes.
The SQL table definitions are well-organized with appropriate constraints, indexes, and data types. The MariadbErr enum provides good coverage of common error scenarios.
🧰 Tools
🪛 cppcheck
[performance] 87-87: Variable 'id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
[performance] 114-114: Variable 'm_id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
383-383
:
Fix incorrect iteration over heads set.
The code attempts to iterate over heads
before adding any elements to it, which will result in no tasks being processed.
- for (boost::uuids::uuid const task_id : heads) {
+ for (boost::uuids::uuid const& task_id : task_graph.get_head_tasks()) {
Likely invalid or redundant comment.
src/spider/storage/MysqlStorage.cpp
Outdated
auto MySqlMetadataStorage::update_heartbeat(boost::uuids::uuid id) -> StorageErr { | ||
try { | ||
std::unique_ptr<sql::PreparedStatement> statement(m_conn->prepareStatement( | ||
"UPDATE divers SET heartbeat = CURRENT_TIMESTAMP() WHERE id = ?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo in table name.
The table name 'divers' should be 'drivers'.
- "UPDATE divers SET heartbeat = CURRENT_TIMESTAMP() WHERE id = ?"
+ "UPDATE drivers SET heartbeat = CURRENT_TIMESTAMP() WHERE id = ?"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"UPDATE divers SET heartbeat = CURRENT_TIMESTAMP() WHERE id = ?" | |
"UPDATE drivers SET heartbeat = CURRENT_TIMESTAMP() WHERE id = ?" |
src/spider/storage/MysqlStorage.cpp
Outdated
while (res->next()) { | ||
locality.emplace_back(res->getString(1)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect ResultSet usage in locality fetching.
The code uses res
instead of locality_res
when fetching locality data, which will result in incorrect data being retrieved.
- while (res->next()) {
- locality.emplace_back(res->getString(1));
+ while (locality_res->next()) {
+ locality.emplace_back(locality_res->getString(1));
Committable suggestion skipped: line range outside the PR's diff.
auto string_to_task_state(std::string const& state) -> spider::core::TaskState { | ||
if (state == "pending") { | ||
return spider::core::TaskState::Pending; | ||
} | ||
if (state == "ready") { | ||
return spider::core::TaskState::Ready; | ||
} | ||
if (state == "running") { | ||
return spider::core::TaskState::Running; | ||
} | ||
if (state == "success") { | ||
return spider::core::TaskState::Succeed; | ||
} | ||
if (state == "fail") { | ||
return spider::core::TaskState::Failed; | ||
} | ||
if (state == "cancel") { | ||
return spider::core::TaskState::Canceled; | ||
} | ||
return spider::core::TaskState::Pending; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add logging for unhandled task state.
The default case in string_to_task_state
silently returns TaskState::Pending
without any warning. This could mask potential issues if an invalid state string is provided.
if (state == "cancel") {
return spider::core::TaskState::Canceled;
}
+ // Log warning for unhandled state
+ std::cerr << "Warning: Unknown task state '" << state << "', defaulting to Pending" << std::endl;
return spider::core::TaskState::Pending;
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto string_to_task_state(std::string const& state) -> spider::core::TaskState { | |
if (state == "pending") { | |
return spider::core::TaskState::Pending; | |
} | |
if (state == "ready") { | |
return spider::core::TaskState::Ready; | |
} | |
if (state == "running") { | |
return spider::core::TaskState::Running; | |
} | |
if (state == "success") { | |
return spider::core::TaskState::Succeed; | |
} | |
if (state == "fail") { | |
return spider::core::TaskState::Failed; | |
} | |
if (state == "cancel") { | |
return spider::core::TaskState::Canceled; | |
} | |
return spider::core::TaskState::Pending; | |
} | |
auto string_to_task_state(std::string const& state) -> spider::core::TaskState { | |
if (state == "pending") { | |
return spider::core::TaskState::Pending; | |
} | |
if (state == "ready") { | |
return spider::core::TaskState::Ready; | |
} | |
if (state == "running") { | |
return spider::core::TaskState::Running; | |
} | |
if (state == "success") { | |
return spider::core::TaskState::Succeed; | |
} | |
if (state == "fail") { | |
return spider::core::TaskState::Failed; | |
} | |
if (state == "cancel") { | |
return spider::core::TaskState::Canceled; | |
} | |
// Log warning for unhandled state | |
std::cerr << "Warning: Unknown task state '" << state << "', defaulting to Pending" << std::endl; | |
return spider::core::TaskState::Pending; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
src/spider/storage/MysqlStorage.cpp (2)
47-166
: Consider adding indexes for frequently queried columns.While the schema design is solid, consider adding indexes for columns frequently used in WHERE clauses or JOIN conditions, such as:
tasks.state
for querying ready taskstask_instances.start_time
for timeout queriesdata.key
for key-based lookups (already indexed)🧰 Tools
🪛 cppcheck
[performance] 87-87: Variable 'id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
[performance] 114-114: Variable 'm_id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
168-172
: Consider usingreinterpret_cast
instead of C-style cast.While the NOLINT directive suppresses the warning, using
reinterpret_cast
would be more idiomatic in C++ and makes the casting intention clearer.- return {(char*)id.data(), id.size()}; + return {reinterpret_cast<char*>(id.data()), id.size()};
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/spider/storage/MysqlStorage.cpp
(1 hunks)
🧰 Additional context used
🪛 cppcheck
src/spider/storage/MysqlStorage.cpp
[performance] 87-87: Variable 'id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
[performance] 114-114: Variable 'm_id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
[performance] 20-20: Variable 'm_id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
🔇 Additional comments (1)
src/spider/storage/MysqlStorage.cpp (1)
3-42
: LGTM! Well-organized includes and error code definitions.
The includes are logically grouped, and the MariadbErr
enum provides clear error code definitions for MariaDB-specific errors.
🧰 Tools
🪛 cppcheck
[performance] 20-20: Variable 'm_id' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
src/spider/storage/MysqlStorage.cpp
Outdated
// Tasks must be added in graph order to avoid the dangling reference. | ||
absl::flat_hash_set<boost::uuids::uuid> heads; | ||
std::deque<boost::uuids::uuid> queue; | ||
// First go ver all heads | ||
for (boost::uuids::uuid const task_id : heads) { | ||
std::optional<Task> const task_option = task_graph.get_task(task_id); | ||
if (!task_option.has_value()) { | ||
m_conn->rollback(); | ||
return StorageErr{StorageErrType::KeyNotFoundErr, "Task graph inconsistent"}; | ||
} | ||
Task const& task = task_option.value(); | ||
this->add_task(job_id_bytes, task); | ||
for (boost::uuids::uuid const id : task_graph.get_child_tasks(task_id)) { | ||
queue.push_back(id); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix empty heads set in task graph traversal.
The heads
set is empty when used in the loop at line 383, which means no tasks will be processed initially. This could lead to incomplete task graph processing.
- absl::flat_hash_set<boost::uuids::uuid> heads;
+ absl::flat_hash_set<boost::uuids::uuid> heads = task_graph.get_head_tasks();
std::deque<boost::uuids::uuid> queue;
- // First go ver all heads
+ // First go over all heads
for (boost::uuids::uuid const task_id : heads) {
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// Tasks must be added in graph order to avoid the dangling reference. | |
absl::flat_hash_set<boost::uuids::uuid> heads; | |
std::deque<boost::uuids::uuid> queue; | |
// First go ver all heads | |
for (boost::uuids::uuid const task_id : heads) { | |
std::optional<Task> const task_option = task_graph.get_task(task_id); | |
if (!task_option.has_value()) { | |
m_conn->rollback(); | |
return StorageErr{StorageErrType::KeyNotFoundErr, "Task graph inconsistent"}; | |
} | |
Task const& task = task_option.value(); | |
this->add_task(job_id_bytes, task); | |
for (boost::uuids::uuid const id : task_graph.get_child_tasks(task_id)) { | |
queue.push_back(id); | |
} | |
} | |
// Tasks must be added in graph order to avoid the dangling reference. | |
absl::flat_hash_set<boost::uuids::uuid> heads = task_graph.get_head_tasks(); | |
std::deque<boost::uuids::uuid> queue; | |
// First go over all heads | |
for (boost::uuids::uuid const task_id : heads) { | |
std::optional<Task> const task_option = task_graph.get_task(task_id); | |
if (!task_option.has_value()) { | |
m_conn->rollback(); | |
return StorageErr{StorageErrType::KeyNotFoundErr, "Task graph inconsistent"}; | |
} | |
Task const& task = task_option.value(); | |
this->add_task(job_id_bytes, task); | |
for (boost::uuids::uuid const id : task_graph.get_child_tasks(task_id)) { | |
queue.push_back(id); | |
} | |
} |
Description
Add MySql implementation for data and metadata storage.
Validation performed
Summary by CodeRabbit
New Features
TaskGraph
andDataStorage
.Bug Fixes
Refactor
Documentation