Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add required methods to NullBufferBuilder #7013

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

Chen-Yuan-Lai
Copy link

@Chen-Yuan-Lai Chen-Yuan-Lai commented Jan 24, 2025

Which issue does this PR close?

Closes #7002 .

Rationale for this change

As #7002 says, some required methods are implemented due to the need to replace BooleanBufferBuilder with NullBufferBuilder in DataFsusion.

What changes are included in this PR?

  • Implemented methods below:
    • capacity()
    • is_valid() (as same as get_bit() in BooleanBufferBuilder)
    • truncate()
  • Add related unit tests

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 24, 2025
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up, left some comments 👍

if let Some(buf) = self.bitmap_builder.as_ref() {
buf.capacity()
} else {
0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think need to reference self.capacity here instead of 0?

Copy link
Author

@Chen-Yuan-Lai Chen-Yuan-Lai Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not sure when bitmap_builder hasn't been initialized, we should return the actual capacity of it (that is 0 because no builder existed) or self.capacity.

Copy link
Contributor

@Jefffrey Jefffrey Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion it would feel weird if I initialize a NullBufferBuilder with a capacity:

let nbb = NullBufferBuilder::new(10)

But then checking the capacity right after would result in it saying 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is a bit strange, but I think the PR as written makes the most sense.

Specifically, downstream in DataFusion (and other places) we use the capacity as a way to calculate how much memory has been allocated -- if there is no bitmap_builder there is no memory allocated.

I think we can make this less confusing with some comments. I left some suggestions

Another thing that might help might be to rename it to fn allocated_capacity() to make the difference more explicit 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fn allocated_capacity() would make sense, but it seems we already have that:

/// Return the allocated size of this builder, in bytes, useful for memory accounting.
pub fn allocated_size(&self) -> usize {
self.bitmap_builder
.as_ref()
.map(|b| b.capacity())
.unwrap_or(0)
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rename capacity field to initial_capacity or initial_bits to clearly indicate this is for initialization only?

pub fn new(capacity: usize) -> Self {
      Self {
          bitmap_builder: None,
          len: 0,
          initial_capacity,
      }
 }

arrow-buffer/src/builder/null.rs Outdated Show resolved Hide resolved
Comment on lines +161 to +163
if len > self.len {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we materialize, is self.len accurate anymore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean -- in this case the user requested to truncate to a value larger than the current length which has no effect.

This behavior seems to be consistent with Vec::truncate so it makes sense to me: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.truncate

Copy link
Contributor

@Jefffrey Jefffrey Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this test case:

    #[test]
    fn test123() {
        let mut builder = NullBufferBuilder::new(0);
        assert_eq!(builder.len(), 0);
        builder.append_n_nulls(2);
        assert_eq!(builder.len(), 2);
        builder.truncate(1);
        assert_eq!(builder.len(), 1); // fails here
    }

It would fail at the last assertion because it was materialized after appending two nulls, but then truncating down to 1 is a noop since the internal self.len stays 0 (not updated after materialization)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are (of course) correct. This is a great test and find

I took the liberty of pushing a commit to this branch that includes this test case (and will fail CI until it is fixed so block this PR from merging)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Chen-Yuan-Lai and @Jefffrey -- I think this PR looks really nice.

The comments / suggestions from @Jefffrey are worth looking at I think but in my opinion we could also merge this PR as is.

I will wait for @Jefffrey 's comments before doing so

if let Some(buf) = self.bitmap_builder.as_ref() {
buf.capacity()
} else {
0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is a bit strange, but I think the PR as written makes the most sense.

Specifically, downstream in DataFusion (and other places) we use the capacity as a way to calculate how much memory has been allocated -- if there is no bitmap_builder there is no memory allocated.

I think we can make this less confusing with some comments. I left some suggestions

Another thing that might help might be to rename it to fn allocated_capacity() to make the difference more explicit 🤔

@@ -134,6 +133,42 @@ impl NullBufferBuilder {
}
}

/// Returns the capacity of the buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the capacity of the buffer
/// Returns the allocated capacity of the buffer, if any
///
/// Note: if no nulls have been added, no buffer has been materialized
/// and this function will return 0, regardless of the capacity passed to `Self::new`

Comment on lines +161 to +163
if len > self.len {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean -- in this case the user requested to truncate to a value larger than the current length which has no effect.

This behavior seems to be consistent with Vec::truncate so it makes sense to me: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.truncate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add required methods to access inner builder for NullBufferBuilder
3 participants