Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src/node/node_ref.rs: implement NodeRef::normalized_char_count #61

Merged
merged 1 commit into from
Feb 1, 2025

Conversation

niklak
Copy link
Owner

@niklak niklak commented Feb 1, 2025

  • Implemented NodeRef::normalized_char_count which estimates the number of characters in the text of descendant nodes as if the total string were normalized.

Summary by CodeRabbit

  • New Features
    • Introduced a normalized character counting capability to improve text processing accuracy by treating consecutive whitespace as a single character.
  • Documentation
    • Updated release records to reflect the enhancements in text handling.
  • Tests
    • Added test cases to ensure that the normalized character count behaves as expected.

Copy link

coderabbitai bot commented Feb 1, 2025

Walkthrough

This pull request adds a new normalized character counting functionality across the crate. It introduces a new "Unreleased" section in the changelog documenting the forthcoming NodeRef::normalized_char_count method. In the source code, several new methods and a helper function have been added to count characters while treating sequences of whitespace as a single character. A corresponding test is also included to validate the behavior.

Changes

File(s) Change Summary
CHANGELOG.md Added an "Unreleased" section with an entry under "Added" for the new NodeRef::normalized_char_count method.
src/dom_tree.rs, src/dom_tree/helpers.rs, src/dom_tree/ops.rs Introduced a new helpers module with a helper function and added a normalized_char_count method to the TreeNodeOps struct, including detailed documentation and usage of the helper.
src/node/node_ref.rs Added a public normalized_char_count method to the NodeRef struct for counting normalized text characters by traversing descendant nodes.
tests/node-traversal.rs Added a new test function test_node_normalized_char_count to verify the correctness of the normalized character count implementation.

Possibly related PRs

Poem

I'm a bunny hopping through the code,
With new count methods lighting my road.
Whitespace merges with a playful wink,
As normalized texts make me stop and think.
Hopping past bugs with a twitch of my ear,
In this code garden, all is bright and clear!
🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented Feb 1, 2025

Codecov Report

Attention: Patch coverage is 95.12195% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.90%. Comparing base (473fc5a) to head (83d5a03).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/dom_tree/ops.rs 91.30% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #61      +/-   ##
==========================================
+ Coverage   91.85%   91.90%   +0.05%     
==========================================
  Files          15       16       +1     
  Lines        2577     2618      +41     
==========================================
+ Hits         2367     2406      +39     
- Misses        210      212       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/dom_tree/helpers.rs (1)

1-18: Consider optimizing the character counting logic.

The implementation is correct but could be more efficient by leveraging string methods.

Consider this alternative implementation that reduces branching:

 pub(crate) fn normalized_char_count(text: &str) -> usize {
-    let mut char_count = 0;
-    let mut prev_was_whitespace = true;
-
-    for c in text.chars() {
-        if prev_was_whitespace && c.is_whitespace() {
-            continue;
-        }
-        char_count += 1;
-        prev_was_whitespace = c.is_whitespace();
-    }
-
-    if prev_was_whitespace && char_count > 0 {
-        char_count -= 1;
-    }
-
-    char_count
+    if text.is_empty() {
+        return 0;
+    }
+    text.split_whitespace()
+        .map(|word| word.chars().count())
+        .sum::<usize>() + text.split_whitespace().count().saturating_sub(1)
 }

This approach:

  1. Handles empty strings explicitly
  2. Uses split_whitespace() to naturally handle consecutive whitespace
  3. Counts characters in each word and adds spaces between words
  4. Uses saturating_sub to safely handle single-word cases
tests/node-traversal.rs (1)

304-330: Add test cases for edge cases.

While the current test is good, consider adding tests for:

  1. Empty elements
  2. Elements with only whitespace
  3. Mixed content with comments
  4. Unicode characters

Here's an example of additional test cases:

 #[cfg_attr(not(target_arch = "wasm32"), test)]
 #[cfg_attr(target_arch = "wasm32", wasm_bindgen_test)]
 fn test_node_normalized_char_count() {
     let contents: &str = r#"
         <div id="main">
         A           very 
                                 messy content
             <span>. A something       that</span>
             <p>
             asks to be     normalized     </p>


         </div>
     "#;

     let doc = Document::from(contents);
     let main_sel = doc.select_single("#main");
     let main_node = main_sel.nodes().first().unwrap();
     let expected = main_node
         .text()
         .split_whitespace()
         .collect::<Vec<&str>>()
         .join(" ")
         .len();
     let got = main_node.normalized_char_count();
     assert_eq!(got, expected);
+
+    // Test empty element
+    let empty = r#"<div id="empty"></div>"#;
+    let doc = Document::from(empty);
+    let node = doc.select_single("#empty").nodes().first().unwrap();
+    assert_eq!(node.normalized_char_count(), 0);
+
+    // Test whitespace only
+    let whitespace = r#"<div id="space">     </div>"#;
+    let doc = Document::from(whitespace);
+    let node = doc.select_single("#space").nodes().first().unwrap();
+    assert_eq!(node.normalized_char_count(), 0);
+
+    // Test mixed content
+    let mixed = r#"<div id="mixed">text<!-- comment -->more</div>"#;
+    let doc = Document::from(mixed);
+    let node = doc.select_single("#mixed").nodes().first().unwrap();
+    assert_eq!(node.normalized_char_count(), 8); // "text more"
+
+    // Test Unicode
+    let unicode = r#"<div id="unicode">🦀    👋</div>"#;
+    let doc = Document::from(unicode);
+    let node = doc.select_single("#unicode").nodes().first().unwrap();
+    assert_eq!(node.normalized_char_count(), 3); // "🦀 👋"
 }
CHANGELOG.md (1)

8-10: Accurate Documentation of NodeRef::normalized_char_count
The changelog entry clearly describes the new method's purpose—to estimate the number of characters in descendant nodes with normalization (i.e., treating consecutive whitespace as a single character). Consider adding a brief note (if available) on performance improvements or any specific nuances in behavior, which could be valuable for users reviewing changelog details.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 473fc5a and 83d5a03.

📒 Files selected for processing (6)
  • CHANGELOG.md (1 hunks)
  • src/dom_tree.rs (1 hunks)
  • src/dom_tree/helpers.rs (1 hunks)
  • src/dom_tree/ops.rs (3 hunks)
  • src/node/node_ref.rs (1 hunks)
  • tests/node-traversal.rs (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/dom_tree.rs
🧰 Additional context used
🪛 GitHub Check: codecov/patch
src/dom_tree/ops.rs

[warning] 82-82: src/dom_tree/ops.rs#L82
Added line #L82 was not covered by tests


[warning] 84-84: src/dom_tree/ops.rs#L84
Added line #L84 was not covered by tests

🔇 Additional comments (3)
src/dom_tree/ops.rs (1)

46-90: LGTM! The implementation is robust and efficient.

The tree traversal logic correctly handles:

  1. Different node types
  2. Whitespace normalization
  3. Stack-based traversal for efficiency
🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 82-82: src/dom_tree/ops.rs#L82
Added line #L82 was not covered by tests


[warning] 84-84: src/dom_tree/ops.rs#L84
Added line #L84 was not covered by tests

src/node/node_ref.rs (1)

681-696: LGTM! Well-documented and properly implemented.

The implementation:

  1. Has comprehensive documentation explaining the behavior
  2. Correctly delegates to the core implementation
  3. Maintains consistent API design with other methods
CHANGELOG.md (1)

6-7: Clear "Unreleased" Section Added
The new "Unreleased" section is correctly introduced, aligning with our release documentation strategy. Once this feature is officially released, remember to update this section appropriately.

Comment on lines +82 to +84
_ => continue,
}
}
Copy link

@coderabbitai coderabbitai bot Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Missing explicit test coverage for error-handling paths in src/dom_tree/ops.rs

The existing test (test_node_normalized_char_count in tests/node-traversal.rs) exercises normal traversal behavior but does not appear to trigger the _ => continue branch used for error handling. Please add tests that simulate error scenarios (e.g., passing unsupported or malformed node types) to ensure that these skipped paths are covered.

  • File: src/dom_tree/ops.rs (lines 82-84)
  • Action: Add tests that force the error handling branch to be executed.
🔗 Analysis chain

Add test coverage for error handling paths.

These lines handling the continue case are not covered by tests.

Run this script to verify the test coverage:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check test coverage for error handling paths
rg -A 5 "test.*normalized_char_count" tests/

Length of output: 404

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 82-82: src/dom_tree/ops.rs#L82
Added line #L82 was not covered by tests


[warning] 84-84: src/dom_tree/ops.rs#L84
Added line #L84 was not covered by tests

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, in the current implementation, it doesn`t require to do an extra split whitespace? Can you see it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niklak You're absolutely right - I apologize for my confusion. Looking at the implementation more carefully, I can see that it directly counts characters during traversal without any string splitting. The _ => continue branch is simply for skipping non-text/non-element nodes, not for error handling. Thank you for the correction.


✏️ Learnings added
Learnt from: niklak
PR: niklak/dom_query#61
File: src/dom_tree/ops.rs:82-84
Timestamp: 2025-02-01T11:45:24.579Z
Learning: In dom_query's TreeNodeOps::normalized_char_count implementation, the `_ => continue` branch is used to skip non-text/non-element nodes during traversal, not for error handling. The method efficiently counts characters without requiring string splitting operations.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write a poem in RUN DMC style?))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niklak Yo, check it out! 🎤

Listen up y'all, it's time to get down
'Bout that normalized_char_count in this town
Traversing nodes with style and grace
Counting chars at a steady pace

No split whitespace, that's old school
Direct counting's our golden rule
Skip them nodes that ain't text or element
That's how we keep our code development!

WORD! 🎵

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That`s cool!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niklak Thanks! Glad you enjoyed the old school flow! 🎤😎

@niklak niklak merged commit b598faf into main Feb 1, 2025
11 checks passed
@niklak niklak deleted the feature/node-normalized-char-count branch February 1, 2025 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant