src/node/node_ref.rs: implement `NodeRef::normalized_char_count` #61

niklak · 2025-02-01T11:34:18Z

Implemented NodeRef::normalized_char_count which estimates the number of characters in the text of descendant nodes as if the total string were normalized.

Summary by CodeRabbit

New Features
- Introduced a normalized character counting capability to improve text processing accuracy by treating consecutive whitespace as a single character.
Documentation
- Updated release records to reflect the enhancements in text handling.
Tests
- Added test cases to ensure that the normalized character count behaves as expected.

coderabbitai · 2025-02-01T11:34:25Z

Walkthrough

This pull request adds a new normalized character counting functionality across the crate. It introduces a new "Unreleased" section in the changelog documenting the forthcoming NodeRef::normalized_char_count method. In the source code, several new methods and a helper function have been added to count characters while treating sequences of whitespace as a single character. A corresponding test is also included to validate the behavior.

Changes

File(s)	Change Summary
`CHANGELOG.md`	Added an "Unreleased" section with an entry under "Added" for the new `NodeRef::normalized_char_count` method.
`src/dom_tree.rs`, `src/dom_tree/helpers.rs`, `src/dom_tree/ops.rs`	Introduced a new `helpers` module with a helper function and added a `normalized_char_count` method to the `TreeNodeOps` struct, including detailed documentation and usage of the helper.
`src/node/node_ref.rs`	Added a public `normalized_char_count` method to the `NodeRef` struct for counting normalized text characters by traversing descendant nodes.
`tests/node-traversal.rs`	Added a new test function `test_node_normalized_char_count` to verify the correctness of the normalized character count implementation.

Possibly related PRs

add missing tests for Document #43: Introduces changes in the NodeRef struct similar to the normalized character count additions in this PR.
implemented NodeRef::insert_after #38: Enhances the NodeRef functionality by adding methods such as insert_after alongside other NodeRef enhancements.
add more tests for NodeRef #42: Focuses on refactoring within the NodeRef struct, directly affecting method implementations like normalized character counting.

Poem

I'm a bunny hopping through the code,
With new count methods lighting my road.
Whitespace merges with a playful wink,
As normalized texts make me stop and think.
Hopping past bugs with a twitch of my ear,
In this code garden, all is bright and clear!
🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2025-02-01T11:35:18Z

Codecov Report

Attention: Patch coverage is 95.12195% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.90%. Comparing base (473fc5a) to head (83d5a03).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/dom_tree/ops.rs	91.30%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #61      +/-   ##
==========================================
+ Coverage   91.85%   91.90%   +0.05%     
==========================================
  Files          15       16       +1     
  Lines        2577     2618      +41     
==========================================
+ Hits         2367     2406      +39     
- Misses        210      212       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

src/dom_tree/helpers.rs (1)

1-18: Consider optimizing the character counting logic.

The implementation is correct but could be more efficient by leveraging string methods.

Consider this alternative implementation that reduces branching:
 pub(crate) fn normalized_char_count(text: &str) -> usize {
-    let mut char_count = 0;
-    let mut prev_was_whitespace = true;
-
-    for c in text.chars() {
-        if prev_was_whitespace && c.is_whitespace() {
-            continue;
-        }
-        char_count += 1;
-        prev_was_whitespace = c.is_whitespace();
-    }
-
-    if prev_was_whitespace && char_count > 0 {
-        char_count -= 1;
-    }
-
-    char_count
+    if text.is_empty() {
+        return 0;
+    }
+    text.split_whitespace()
+        .map(|word| word.chars().count())
+        .sum::<usize>() + text.split_whitespace().count().saturating_sub(1)
 }
This approach:

Handles empty strings explicitly

Uses split_whitespace() to naturally handle consecutive whitespace

Counts characters in each word and adds spaces between words

Uses saturating_sub to safely handle single-word cases

tests/node-traversal.rs (1)

304-330: Add test cases for edge cases.

While the current test is good, consider adding tests for:

Empty elements
Elements with only whitespace
Mixed content with comments
Unicode characters

Here's an example of additional test cases:

 #[cfg_attr(not(target_arch = "wasm32"), test)]
 #[cfg_attr(target_arch = "wasm32", wasm_bindgen_test)]
 fn test_node_normalized_char_count() {
     let contents: &str = r#"
         <div id="main">
         A           very 
                                 messy content
             <span>. A something       that</span>
             <p>
             asks to be     normalized     </p>


         </div>
     "#;

     let doc = Document::from(contents);
     let main_sel = doc.select_single("#main");
     let main_node = main_sel.nodes().first().unwrap();
     let expected = main_node
         .text()
         .split_whitespace()
         .collect::<Vec<&str>>()
         .join(" ")
         .len();
     let got = main_node.normalized_char_count();
     assert_eq!(got, expected);
+
+    // Test empty element
+    let empty = r#"<div id="empty"></div>"#;
+    let doc = Document::from(empty);
+    let node = doc.select_single("#empty").nodes().first().unwrap();
+    assert_eq!(node.normalized_char_count(), 0);
+
+    // Test whitespace only
+    let whitespace = r#"<div id="space">     </div>"#;
+    let doc = Document::from(whitespace);
+    let node = doc.select_single("#space").nodes().first().unwrap();
+    assert_eq!(node.normalized_char_count(), 0);
+
+    // Test mixed content
+    let mixed = r#"<div id="mixed">text<!-- comment -->more</div>"#;
+    let doc = Document::from(mixed);
+    let node = doc.select_single("#mixed").nodes().first().unwrap();
+    assert_eq!(node.normalized_char_count(), 8); // "text more"
+
+    // Test Unicode
+    let unicode = r#"<div id="unicode">🦀    👋</div>"#;
+    let doc = Document::from(unicode);
+    let node = doc.select_single("#unicode").nodes().first().unwrap();
+    assert_eq!(node.normalized_char_count(), 3); // "🦀 👋"
 }

CHANGELOG.md (1)

8-10: Accurate Documentation of NodeRef::normalized_char_count
The changelog entry clearly describes the new method's purpose—to estimate the number of characters in descendant nodes with normalization (i.e., treating consecutive whitespace as a single character). Consider adding a brief note (if available) on performance improvements or any specific nuances in behavior, which could be valuable for users reviewing changelog details.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 473fc5a and 83d5a03.

📒 Files selected for processing (6)

CHANGELOG.md (1 hunks)
src/dom_tree.rs (1 hunks)
src/dom_tree/helpers.rs (1 hunks)
src/dom_tree/ops.rs (3 hunks)
src/node/node_ref.rs (1 hunks)
tests/node-traversal.rs (1 hunks)

✅ Files skipped from review due to trivial changes (1)

src/dom_tree.rs

🧰 Additional context used

🪛 GitHub Check: codecov/patch

src/dom_tree/ops.rs

[warning] 82-82: src/dom_tree/ops.rs#L82
Added line #L82 was not covered by tests

[warning] 84-84: src/dom_tree/ops.rs#L84
Added line #L84 was not covered by tests

🔇 Additional comments (3)

src/dom_tree/ops.rs (1)

46-90: LGTM! The implementation is robust and efficient.

The tree traversal logic correctly handles:

Different node types

Whitespace normalization

Stack-based traversal for efficiency

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 82-82: src/dom_tree/ops.rs#L82
Added line #L82 was not covered by tests

[warning] 84-84: src/dom_tree/ops.rs#L84
Added line #L84 was not covered by tests

src/node/node_ref.rs (1)

681-696: LGTM! Well-documented and properly implemented.

The implementation:

Has comprehensive documentation explaining the behavior

Correctly delegates to the core implementation

Maintains consistent API design with other methods

CHANGELOG.md (1)

6-7: Clear "Unreleased" Section Added
The new "Unreleased" section is correctly introduced, aligning with our release documentation strategy. Once this feature is officially released, remember to update this section appropriately.

coderabbitai · 2025-02-01T11:37:05Z

src/dom_tree/ops.rs

+                    _ => continue,
+                }
+            }


💡 Codebase verification

Missing explicit test coverage for error-handling paths in src/dom_tree/ops.rs

The existing test (test_node_normalized_char_count in tests/node-traversal.rs) exercises normal traversal behavior but does not appear to trigger the _ => continue branch used for error handling. Please add tests that simulate error scenarios (e.g., passing unsupported or malformed node types) to ensure that these skipped paths are covered.

File: src/dom_tree/ops.rs (lines 82-84)

Action: Add tests that force the error handling branch to be executed.

🔗 Analysis chain

Add test coverage for error handling paths.

These lines handling the continue case are not covered by tests.

Run this script to verify the test coverage:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check test coverage for error handling paths rg -A 5 "test.*normalized_char_count" tests/

Length of output: 404

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 82-82: src/dom_tree/ops.rs#L82
Added line #L82 was not covered by tests

[warning] 84-84: src/dom_tree/ops.rs#L84
Added line #L84 was not covered by tests

Nope, in the current implementation, it doesn`t require to do an extra split whitespace? Can you see it?

@niklak You're absolutely right - I apologize for my confusion. Looking at the implementation more carefully, I can see that it directly counts characters during traversal without any string splitting. The _ => continue branch is simply for skipping non-text/non-element nodes, not for error handling. Thank you for the correction.

✏️ Learnings added

Learnt from: niklak PR: niklak/dom_query#61 File: src/dom_tree/ops.rs:82-84 Timestamp: 2025-02-01T11:45:24.579Z Learning: In dom_query's TreeNodeOps::normalized_char_count implementation, the `_ => continue` branch is used to skip non-text/non-element nodes during traversal, not for error handling. The method efficiently counts characters without requiring string splitting operations.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Can you write a poem in RUN DMC style?))

@niklak Yo, check it out! 🎤

Listen up y'all, it's time to get down 'Bout that normalized_char_count in this town Traversing nodes with style and grace Counting chars at a steady pace No split whitespace, that's old school Direct counting's our golden rule Skip them nodes that ain't text or element That's how we keep our code development! WORD! 🎵

That`s cool!

@niklak Thanks! Glad you enjoyed the old school flow! 🎤😎

src/node/node_ref.rs: implement NodeRef::normalized_char_count

83d5a03

coderabbitai bot reviewed Feb 1, 2025

View reviewed changes

niklak merged commit b598faf into main Feb 1, 2025
11 checks passed

niklak deleted the feature/node-normalized-char-count branch February 1, 2025 11:56

This was referenced Feb 2, 2025

src/node/serializing.rs: implement format_text #62

Merged

src/dom_tree/ops.rs: fixed TreeNodeOps::normalized_char_count #63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/node/node_ref.rs: implement `NodeRef::normalized_char_count` #61

src/node/node_ref.rs: implement `NodeRef::normalized_char_count` #61

niklak commented Feb 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 1, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

codecov bot commented Feb 1, 2025 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot Feb 1, 2025 •

edited

Loading

niklak Feb 1, 2025

coderabbitai bot Feb 1, 2025

niklak Feb 1, 2025

coderabbitai bot Feb 1, 2025

niklak Feb 1, 2025

coderabbitai bot Feb 1, 2025

src/node/node_ref.rs: implement NodeRef::normalized_char_count #61

src/node/node_ref.rs: implement NodeRef::normalized_char_count #61

Conversation

niklak commented Feb 1, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Feb 1, 2025 • edited Loading

Walkthrough

Changes

Possibly related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

codecov bot commented Feb 1, 2025 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Feb 1, 2025 • edited Loading

Choose a reason for hiding this comment

niklak Feb 1, 2025

Choose a reason for hiding this comment

coderabbitai bot Feb 1, 2025

Choose a reason for hiding this comment

niklak Feb 1, 2025

Choose a reason for hiding this comment

coderabbitai bot Feb 1, 2025

Choose a reason for hiding this comment

niklak Feb 1, 2025

Choose a reason for hiding this comment

coderabbitai bot Feb 1, 2025

Choose a reason for hiding this comment

src/node/node_ref.rs: implement `NodeRef::normalized_char_count` #61

src/node/node_ref.rs: implement `NodeRef::normalized_char_count` #61

niklak commented Feb 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 1, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Feb 1, 2025 •

edited

Loading

coderabbitai bot Feb 1, 2025 •

edited

Loading