-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add required methods to NullBufferBuilder
#7013
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up, left some comments 👍
if let Some(buf) = self.bitmap_builder.as_ref() { | ||
buf.capacity() | ||
} else { | ||
0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think need to reference self.capacity
here instead of 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm not sure when bitmap_builder
hasn't been initialized, we should return the actual capacity of it (that is 0 because no builder existed) or self.capacity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion it would feel weird if I initialize a NullBufferBuilder
with a capacity:
let nbb = NullBufferBuilder::new(10)
But then checking the capacity right after would result in it saying 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is a bit strange, but I think the PR as written makes the most sense.
Specifically, downstream in DataFusion (and other places) we use the capacity
as a way to calculate how much memory has been allocated -- if there is no bitmap_builder there is no memory allocated.
I think we can make this less confusing with some comments. I left some suggestions
Another thing that might help might be to rename it to fn allocated_capacity()
to make the difference more explicit 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I also made Minor: Clarify NullBufferBuilder::new capacity parameter #7016 to try and clarify the behavior of
capacity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn allocated_capacity()
would make sense, but it seems we already have that:
arrow-rs/arrow-buffer/src/builder/null.rs
Lines 188 to 194 in 0c07ec7
/// Return the allocated size of this builder, in bytes, useful for memory accounting. | |
pub fn allocated_size(&self) -> usize { | |
self.bitmap_builder | |
.as_ref() | |
.map(|b| b.capacity()) | |
.unwrap_or(0) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rename capacity
field to initial_capacity
or initial_bits
to clearly indicate this is for initialization only?
pub fn new(capacity: usize) -> Self {
Self {
bitmap_builder: None,
len: 0,
initial_capacity,
}
}
if len > self.len { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we materialize, is self.len
accurate anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean -- in this case the user requested to truncate to a value larger than the current length which has no effect.
This behavior seems to be consistent with Vec::truncate
so it makes sense to me: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.truncate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider this test case:
#[test]
fn test123() {
let mut builder = NullBufferBuilder::new(0);
assert_eq!(builder.len(), 0);
builder.append_n_nulls(2);
assert_eq!(builder.len(), 2);
builder.truncate(1);
assert_eq!(builder.len(), 1); // fails here
}
It would fail at the last assertion because it was materialized after appending two nulls, but then truncating down to 1 is a noop since the internal self.len
stays 0 (not updated after materialization)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are (of course) correct. This is a great test and find
I took the liberty of pushing a commit to this branch that includes this test case (and will fail CI until it is fixed so block this PR from merging)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Chen-Yuan-Lai and @Jefffrey -- I think this PR looks really nice.
The comments / suggestions from @Jefffrey are worth looking at I think but in my opinion we could also merge this PR as is.
I will wait for @Jefffrey 's comments before doing so
if let Some(buf) = self.bitmap_builder.as_ref() { | ||
buf.capacity() | ||
} else { | ||
0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is a bit strange, but I think the PR as written makes the most sense.
Specifically, downstream in DataFusion (and other places) we use the capacity
as a way to calculate how much memory has been allocated -- if there is no bitmap_builder there is no memory allocated.
I think we can make this less confusing with some comments. I left some suggestions
Another thing that might help might be to rename it to fn allocated_capacity()
to make the difference more explicit 🤔
@@ -134,6 +133,42 @@ impl NullBufferBuilder { | |||
} | |||
} | |||
|
|||
/// Returns the capacity of the buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns the capacity of the buffer | |
/// Returns the allocated capacity of the buffer, if any | |
/// | |
/// Note: if no nulls have been added, no buffer has been materialized | |
/// and this function will return 0, regardless of the capacity passed to `Self::new` |
if len > self.len { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean -- in this case the user requested to truncate to a value larger than the current length which has no effect.
This behavior seems to be consistent with Vec::truncate
so it makes sense to me: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.truncate
Which issue does this PR close?
Closes #7002 .
Rationale for this change
As #7002 says, some required methods are implemented due to the need to replace
BooleanBufferBuilder
withNullBufferBuilder
in DataFsusion.What changes are included in this PR?
capacity()
is_valid()
(as same asget_bit()
inBooleanBufferBuilder
)truncate()
Are there any user-facing changes?