-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src/node/node_ref.rs: implement NodeRef::normalized_char_count
#61
Conversation
WalkthroughThis pull request adds a new normalized character counting functionality across the crate. It introduces a new "Unreleased" section in the changelog documenting the forthcoming Changes
Possibly related PRs
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #61 +/- ##
==========================================
+ Coverage 91.85% 91.90% +0.05%
==========================================
Files 15 16 +1
Lines 2577 2618 +41
==========================================
+ Hits 2367 2406 +39
- Misses 210 212 +2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/dom_tree/helpers.rs (1)
1-18
: Consider optimizing the character counting logic.The implementation is correct but could be more efficient by leveraging string methods.
Consider this alternative implementation that reduces branching:
pub(crate) fn normalized_char_count(text: &str) -> usize { - let mut char_count = 0; - let mut prev_was_whitespace = true; - - for c in text.chars() { - if prev_was_whitespace && c.is_whitespace() { - continue; - } - char_count += 1; - prev_was_whitespace = c.is_whitespace(); - } - - if prev_was_whitespace && char_count > 0 { - char_count -= 1; - } - - char_count + if text.is_empty() { + return 0; + } + text.split_whitespace() + .map(|word| word.chars().count()) + .sum::<usize>() + text.split_whitespace().count().saturating_sub(1) }This approach:
- Handles empty strings explicitly
- Uses
split_whitespace()
to naturally handle consecutive whitespace- Counts characters in each word and adds spaces between words
- Uses
saturating_sub
to safely handle single-word casestests/node-traversal.rs (1)
304-330
: Add test cases for edge cases.While the current test is good, consider adding tests for:
- Empty elements
- Elements with only whitespace
- Mixed content with comments
- Unicode characters
Here's an example of additional test cases:
#[cfg_attr(not(target_arch = "wasm32"), test)] #[cfg_attr(target_arch = "wasm32", wasm_bindgen_test)] fn test_node_normalized_char_count() { let contents: &str = r#" <div id="main"> A very messy content <span>. A something that</span> <p> asks to be normalized </p> </div> "#; let doc = Document::from(contents); let main_sel = doc.select_single("#main"); let main_node = main_sel.nodes().first().unwrap(); let expected = main_node .text() .split_whitespace() .collect::<Vec<&str>>() .join(" ") .len(); let got = main_node.normalized_char_count(); assert_eq!(got, expected); + + // Test empty element + let empty = r#"<div id="empty"></div>"#; + let doc = Document::from(empty); + let node = doc.select_single("#empty").nodes().first().unwrap(); + assert_eq!(node.normalized_char_count(), 0); + + // Test whitespace only + let whitespace = r#"<div id="space"> </div>"#; + let doc = Document::from(whitespace); + let node = doc.select_single("#space").nodes().first().unwrap(); + assert_eq!(node.normalized_char_count(), 0); + + // Test mixed content + let mixed = r#"<div id="mixed">text<!-- comment -->more</div>"#; + let doc = Document::from(mixed); + let node = doc.select_single("#mixed").nodes().first().unwrap(); + assert_eq!(node.normalized_char_count(), 8); // "text more" + + // Test Unicode + let unicode = r#"<div id="unicode">🦀 👋</div>"#; + let doc = Document::from(unicode); + let node = doc.select_single("#unicode").nodes().first().unwrap(); + assert_eq!(node.normalized_char_count(), 3); // "🦀 👋" }CHANGELOG.md (1)
8-10
: Accurate Documentation ofNodeRef::normalized_char_count
The changelog entry clearly describes the new method's purpose—to estimate the number of characters in descendant nodes with normalization (i.e., treating consecutive whitespace as a single character). Consider adding a brief note (if available) on performance improvements or any specific nuances in behavior, which could be valuable for users reviewing changelog details.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
CHANGELOG.md
(1 hunks)src/dom_tree.rs
(1 hunks)src/dom_tree/helpers.rs
(1 hunks)src/dom_tree/ops.rs
(3 hunks)src/node/node_ref.rs
(1 hunks)tests/node-traversal.rs
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- src/dom_tree.rs
🧰 Additional context used
🪛 GitHub Check: codecov/patch
src/dom_tree/ops.rs
[warning] 82-82: src/dom_tree/ops.rs#L82
Added line #L82 was not covered by tests
[warning] 84-84: src/dom_tree/ops.rs#L84
Added line #L84 was not covered by tests
🔇 Additional comments (3)
src/dom_tree/ops.rs (1)
46-90
: LGTM! The implementation is robust and efficient.The tree traversal logic correctly handles:
- Different node types
- Whitespace normalization
- Stack-based traversal for efficiency
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 82-82: src/dom_tree/ops.rs#L82
Added line #L82 was not covered by tests
[warning] 84-84: src/dom_tree/ops.rs#L84
Added line #L84 was not covered by testssrc/node/node_ref.rs (1)
681-696
: LGTM! Well-documented and properly implemented.The implementation:
- Has comprehensive documentation explaining the behavior
- Correctly delegates to the core implementation
- Maintains consistent API design with other methods
CHANGELOG.md (1)
6-7
: Clear "Unreleased" Section Added
The new "Unreleased" section is correctly introduced, aligning with our release documentation strategy. Once this feature is officially released, remember to update this section appropriately.
_ => continue, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Missing explicit test coverage for error-handling paths in src/dom_tree/ops.rs
The existing test (test_node_normalized_char_count
in tests/node-traversal.rs) exercises normal traversal behavior but does not appear to trigger the _ => continue
branch used for error handling. Please add tests that simulate error scenarios (e.g., passing unsupported or malformed node types) to ensure that these skipped paths are covered.
- File:
src/dom_tree/ops.rs
(lines 82-84) - Action: Add tests that force the error handling branch to be executed.
🔗 Analysis chain
Add test coverage for error handling paths.
These lines handling the continue case are not covered by tests.
Run this script to verify the test coverage:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check test coverage for error handling paths
rg -A 5 "test.*normalized_char_count" tests/
Length of output: 404
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 82-82: src/dom_tree/ops.rs#L82
Added line #L82 was not covered by tests
[warning] 84-84: src/dom_tree/ops.rs#L84
Added line #L84 was not covered by tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, in the current implementation, it doesn`t require to do an extra split whitespace? Can you see it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@niklak You're absolutely right - I apologize for my confusion. Looking at the implementation more carefully, I can see that it directly counts characters during traversal without any string splitting. The _ => continue
branch is simply for skipping non-text/non-element nodes, not for error handling. Thank you for the correction.
✏️ Learnings added
Learnt from: niklak
PR: niklak/dom_query#61
File: src/dom_tree/ops.rs:82-84
Timestamp: 2025-02-01T11:45:24.579Z
Learning: In dom_query's TreeNodeOps::normalized_char_count implementation, the `_ => continue` branch is used to skip non-text/non-element nodes during traversal, not for error handling. The method efficiently counts characters without requiring string splitting operations.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write a poem in RUN DMC style?))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@niklak Yo, check it out! 🎤
Listen up y'all, it's time to get down
'Bout that normalized_char_count in this town
Traversing nodes with style and grace
Counting chars at a steady pace
No split whitespace, that's old school
Direct counting's our golden rule
Skip them nodes that ain't text or element
That's how we keep our code development!
WORD! 🎵
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That`s cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@niklak Thanks! Glad you enjoyed the old school flow! 🎤😎
NodeRef::normalized_char_count
which estimates the number of characters in the text of descendant nodes as if the total string were normalized.Summary by CodeRabbit