Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More iobuf cross-shard checks, better doc #23263

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 110 additions & 18 deletions src/v/bytes/iobuf.h
Original file line number Diff line number Diff line change
Expand Up @@ -40,15 +40,56 @@
*
* General sharing-mutation caveat:
*
* iobuf has more complicated mutation and cross-shard sharing rules, compared
* to most other types such as int or std::string. The underlying cause in both
* cases is that two different iobuf objects may share one or more underlying
* buffers, and hence operations on one iobuf may be visible to the other.
*
* Operations such as share(), copy() and appending an iobuf or other compatible
* buffer type to an iobuf may be zero-copy, in the sense that some or all of
* the payload bytes may be shared between multiple iobufs (or between an iobuf
* and a compatible buffer type like ss::temporary_buffer<>). The sharing occurs
* at the fragment level.
*
* Be careful when any zero-copy operations are used as iobuf
* does not perform copy-on-write, Therefore changes will be visible to all
* iobufs that share the backing fragments.
* We say that two or more iobuf objects which share fragments have "internal
* sharing" and between such iobufs the following restrictions apply:
*
* BYTE MUTATION CAVEAT
*
* You should not write into the bytes held by an iobuf if it is internally
* shared with another buffer, since the updates will potentially been seen
* by both iobufs.
*
* On the other hand, two iobufs that have internal sharing will behave
* independently with respect to "structural updates", which are all mutations
* except for writing into the buffer itself. For example, if one iobuf is
* created as a copy of another via share() method, they will have full internal
* sharing, but appending to one buffer will not been seen Be careful when any
* zero-copy operations are used as iobuf does not perform copy-on-write,
* Therefore changes will be visible to all iobufs that share the backing
* fragments.
*
* CROSS-SHARD SHARING CAVEAT
*
* Two iobufs which have internal sharing should not be accessed concurrently on
* different shards. Note that this is a much stronger condition than the usual
* thread-safety requirements for C++ objects since this applies to different
* objects with (potentially hidden) internal sharing, while the usual rules
* apply only to sharing of the _same_ object.
*
* More formally and slightly stricter than the above: every iobuf has an
* "origin" shard which cannot be changed and it must only be accessed on that
* shard: access from an other shard is an error which may or may not be
* detected. An iobuf's origin shard is set at construction, as documented
* in the method doc (for example, the default constructor sets the origin
* shard to the current one, while the move constructor inherits the origin
* shard from the source and so on).
*
* The only safe way to get the contents of an iobuf from one shard to another
* is to pass the iobuf to the other shard and then call copy() on it, which is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like a safe interface to me?

It means the submit_to caller always has to wait until the submit_to future resolves to do anything with its iobuf and hope the submit_to side is not keeping the original iobuf around somehow? It seems as unsafe to me as the more idiomatic way of just doing copy/share first and then moving the result to the other shard. It further pessimizes the cases where no copy is needed and the iobuf is simply EOL on the source shard.

I think the oncore stuff is just broken and doesn't really work as is. I tried to extend it in the past as well.

This really needs language support, maybe there is some thread_local hackery that might work?

Copy link
Member Author

@travisdowns travisdowns Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feedback, thanks.

It means the submit_to caller always has to wait until the submit_to future resolves to do anything with its iobuf and hope the submit_to side is not keeping the original iobuf around somehow?

Right, though waiting until the future resolves is almost always the existing use case for submit_to and related calls?

That said, I wrote this as a codification of the oncore existing restrictions of the class, and the existing behavior of the share/copy/move ctor methods: the thing that creates the new iobuf (here, copy() needs to happen on the target shard since it grabs its origin shard ID implicitly from the current shard.

As far as I can tell, iobuf does not currently allow any reasonable way to move an iobuf from one shard to another: the origin ID is preserved on move, so you can never get a moved to iobuf on another shard with the right origin ID. It may only happen to work in some cases because the verify check wasn't uniformly added everywhere so if you avoid the methods where it does exist you at least won't get an assert (but you may crash).

I don't think it's a slam dunk though that copy()-on-source is definitely better at least if we have any concept of origin shard: consider the case where you do an invoke_on (all shards), you really want to create N copies (at east with the current restriction), one for each shard. In that case copy() on the target just works: makes the right number of copies of the iobuf with right shard ID.

It further pessimizes the cases where no copy is needed and the iobuf is simply EOL on the source shard.

Yes, but as above this is "already the case" and don't think there's any way to move safely currently.

I think the oncore stuff is just broken and doesn't really work as is.

Well the restriction is "too strict", but we already have it so I'm trying to lean on that for safety at the cost of re-enforcing the existing pessimization. So it is at least partly workable in the sense that it has been like this since iobuf introduction.

That said, totally open to better approaches here too, either for the doc or the implementation. Just relying on 100% adherence to complex and mostly undocumented lifetime runs isn't going to cut it, I think: especially because the one of the main failure mode occurs due to a non-atomic increment race which is going to be very hard to pick up with testing and likely to slip through to production.

I tried to #14024 it in the past as well.

Ah, I wish I had seen that. Yeah it's the same vein. There's also more discussion in https://redpandadata.atlassian.net/browse/CORE-7061 if you hadn't seen it.

This really needs language support, maybe there is some thread_local hackery that might work?

Sure, we should be open to anything here. Maybe like @ballard26 said it's easier just to make the refcounting atomic, which would make the semantics more sane and hence easier to document :). That does look like a potentially invasive change performance-wise though since I think it means temporary_buffer needs use atomic refcounting everywhere in seastar (i.e., not opt-in e.g., via template parameter) since we buffers created and consumed in seastar to be like that from the start: plus the work of combing through all the deleter implementations to see if they are truly compatible with cross-shard use beyond the refcounting, since they can do arbitrary things at deletion time.

Copy link
Member

@StephanDollberg StephanDollberg Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, iobuf does not currently allow any reasonable way to move an iobuf from one shard to another: the origin ID is preserved on move, so you can never get a moved to iobuf on another shard with the right origin ID. It may only happen to work in some cases because the verify check wasn't uniformly added everywhere so if you avoid the methods where it does exist you at least won't get an assert (but you may crash).

Yes, but as above this is "already the case" and don't think there's any way to move safely currently.

Right yes, I guess I should have clarified that I was thinking more about the release build usecase or with the oncore check removed as I think it doesn't make much sense in its current form as described otherwise. I do very much suspect it wasn't added to all methods because it would break some common current usage patterns that are fine from a thread safety perspective.

I don't think it's a slam dunk though that copy()-on-source is definitely better at least if we have any concept of origin shard: consider the case where you do an invoke_on (all shards), you really want to create N copies (at east with the current restriction), one for each shard. In that case copy() on the target just works: makes the right number of copies of the iobuf with right shard ID.

In that example, is a copy even needed at all? If it's safe to copy/read from N other threads in parallel then it would also be safe to just read directly without copying for the duration of the invoke_on_all call? It would require const == threadsafe, not sure whether that is given at least today.

said it's easier just to make the refcounting atomic ... potentially invasive change performance-wise though

I do think that is the best way forward. Back when I was looking into this because of the same bug/misuse in a different part of the code this was my conclusion as well. Note I think this could be a performance gain as it would mean that we can drop the extra copies on the produce and fetch path which are likely a lot more expensive.

For the checks, I am wondering whether we can add a method along the signature of:

iobuf&& move_to_shard(shard_id target_shard_id)

That invalidates the current object and creates a new one with the source shard id adapted. Then we could possibly add the checks in more places like share (outside of the constructors which are probably still needed implicitly as seastar might move the lambda under the hood). Haven't really thought this through.

Copy link
Member

@StephanDollberg StephanDollberg Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But no, move_to_shard is never really safe because at least in its current form the iobuf doesn't know whether it uniquely owns the temporary_buffer (in #13447 (comment) the issue was definitely that we could have already gotten a shared buffer out of ss::input_stream::read_up_to).

So I guess this really already pessimizes this case:

It further pessimizes the cases where no copy is needed and the iobuf is simply EOL on the source shard.

as it can never be correct unfortunately. And in that sense share with oncore check failing anywhere is a bug. As are the failures in this PR? Well maybe not if you can guarantee that the temporary_buf is not shared possibly because you constructed it yourself somewhere.

So possibly copy_to_shard(shard_id target_shard_id) might be a good interface but in practice just doing the copy in the submit_to callback is just easier then.

Copy link
Member Author

@travisdowns travisdowns Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right yes, I guess I should have clarified that I was thinking more about the release build usecase or with the oncore check removed as I think it doesn't make much sense in its current form as described otherwise.

Well I feel like we shouldn't do be doing anything in the release build that would assert in the debug build, right? Both "in principle" but also actually it should be unlikely since we run (some subset of) our tests in debug and those don't assert. So I would say that the methods that have the oncore check are off-limits for cross-origin calls regardless of the build mode.

That aside, I basically agree with you that the oncore check seems like it's not really doing the right thing or at least pessimizes things unnecessarily. So I'll consider this on hold for now.

Edit: To add this was responding to your comment 2 up, I hadn't see the one immediately above so I'll have to consider that further.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it can never be correct unfortunately.

Right. Though some correct cases can be detected, conservatively, though it would require more tracking. E.g., every operation either potentially causes shared buffers to be added to the iobuf, or not. E.g., the constructors that just pass char*/size which is copied into the iobuf don't result in sharing, and something like the move-ctor should inherit the sharing state. clear() removes all sharing, etc. So you track this boolean state "not shared vs maybe shared" and only assert if it's in the maybe shared state or some other kind of logic. E.g., moving across shards is safe for not shared stuff.

As are the failures in this PR?

I did look at two cases and they both looked safe in the sense above: on one shard an iobuf is created "unshared" (e.g., as the result of parsing some json), then it is moved to another shard (produce flow in this case).

* specifically excepted from the above prohibition on access from another
* shard. This will return a deep copy of the buffer with its origin shard set
* to the shard the copy was performed on.
*/
class iobuf {
// Not a lightweight object.
Expand Down Expand Up @@ -88,17 +129,35 @@ class iobuf {
// noexcept
}
~iobuf() noexcept;

/**
* @brief Construct a new iobuf object by moving the source iobuf into it
*
* This leaves the source iobuf empty. Note that the origin shard for the
* newly constructed iobuf is the same as the source, so there is no viable
* way to move an iobuf from one shard to another: the target of the move
* will always have the same origin shard as the source, no matter where the
* moves happen, and so will not be accessible on the target shard (see the
* sharing-mutation caveat in the class comment for details on this
* restriction).
*
* Instead, to "move" an iobuf across shards you must copy() it on the
* target shard and then clear or destroy the source buffer on the source
* shard.
*/
iobuf(iobuf&& x) noexcept
: _frags(std::move(x._frags))
, _size(x._size)
#ifndef NDEBUG
, _verify_shard(x._verify_shard)
#endif
{
x.mutating_method_called();
x._frags = container{};
x._size = 0;
}
iobuf& operator=(iobuf&& x) noexcept {
mutating_method_called();
if (this != &x) {
this->~iobuf();
new (this) iobuf(std::move(x));
Expand Down Expand Up @@ -132,8 +191,11 @@ class iobuf {
* copy will be the same as this iobuf, but callers should not rely on the
* precise details.
*
* Since this call performs zero-copy operations, the sharing-mutation
* caveat in the class comment applies.
* Like almost all methods, this method must only be called on the origin
* shard of this iobuf. The returned iobuf will have the same origin, and so
* this method cannot be used to safely share iobufs across shards (see the
* sharing-mutation caveat in the class comment for details). Use copy() to
* move iobuf content from one shard to another.
*/
iobuf share(size_t pos, size_t len);

Expand All @@ -143,8 +205,17 @@ class iobuf {
* mutations to the payload bytes of this iobuf do not affected the returned
* value or vice-versa.
*
* Copying an iobuf is optimized for cases where the size of the resulting
* iobuf will not be increased (e.g. via iobuf::append).
* The returned iobuf is linearized, and is optimized is optimized for cases
* where the size of the resulting iobuf will not be increased (e.g. via
* iobuf::append). That is, the last fragment is sized relatively tightly
* the size of the data, rather having a lot of padding as it might if the
* same sequence of bytes were appended to an empty iobuf.
*
* Unlike most methods which create a new iobuf based on an existing one,
* this method sets the origin shard of the iobuf to the current shard, so
* it is safe to send an iobuf to another shard, then call copy on it and
* then access the copy on other shard. See the sharing-mutation caveat in
* the class comment for further details.
*/
iobuf copy() const;

Expand Down Expand Up @@ -257,13 +328,20 @@ class iobuf {
void create_new_fragment(size_t);
size_t last_allocation_size() const;

/**
* Should be called before every mutating method in order to perform any
* consistency checks associated with mutating methods.
*/
void mutating_method_called() const;

container _frags;
size_t _size{0};
expression_in_debug_mode(oncore _verify_shard);
friend std::ostream& operator<<(std::ostream&, const iobuf&);
};

inline void iobuf::clear() {
mutating_method_called();
_frags.clear_and_dispose(&details::dispose_io_fragment);
_size = 0;
}
Expand Down Expand Up @@ -296,7 +374,18 @@ inline size_t iobuf::last_allocation_size() const {
return _frags.empty() ? details::io_allocation_size::default_chunk_size
: _frags.back().capacity();
}

inline void iobuf::mutating_method_called() const {
// It is a bug to access an iobuf on any shard other than its "origin shard"
// (which may be different than the shard it was constructed on), so check
// that we aren't doing this in debug mode. This check should also apply to
// const methods, but currently we mostly only check this on mutating
// methods.
oncore_debug_verify(_verify_shard);
}

inline void iobuf::append(std::unique_ptr<fragment> f) {
mutating_method_called();
if (!_frags.empty()) {
_frags.back().trim();
}
Expand All @@ -305,20 +394,21 @@ inline void iobuf::append(std::unique_ptr<fragment> f) {
_frags.push_back(*f.release());
}
inline void iobuf::prepend(std::unique_ptr<fragment> f) {
mutating_method_called();
_size += f->size();
_frags.push_front(*f.release());
}

inline void iobuf::create_new_fragment(size_t sz) {
oncore_debug_verify(_verify_shard);
mutating_method_called();
auto chunk_max = std::max(sz, last_allocation_size());
auto asz = details::io_allocation_size::next_allocation_size(chunk_max);
append(std::make_unique<fragment>(asz));
}
/// only ensures that a segment of at least reservation is avaible
/// as an empty details::io_fragment
inline void iobuf::reserve_memory(size_t reservation) {
oncore_debug_verify(_verify_shard);
mutating_method_called();
if (auto b = available_bytes(); b < reservation) {
if (b > 0) {
_frags.back().trim();
Expand All @@ -329,13 +419,14 @@ inline void iobuf::reserve_memory(size_t reservation) {

[[gnu::always_inline]] void inline iobuf::prepend(
ss::temporary_buffer<char> b) {
mutating_method_called();
if (unlikely(!b.size())) {
return;
}
prepend(std::make_unique<fragment>(std::move(b)));
}
[[gnu::always_inline]] void inline iobuf::prepend(iobuf b) {
oncore_debug_verify(_verify_shard);
mutating_method_called();
while (!b._frags.empty()) {
b._frags.pop_back_and_dispose([this](fragment* f) {
prepend(f->share());
Expand All @@ -346,12 +437,13 @@ inline void iobuf::reserve_memory(size_t reservation) {
/// append src + len into storage
[[gnu::always_inline]] void inline iobuf::append(
const uint8_t* src, size_t len) {
mutating_method_called();
// NOLINTNEXTLINE
append(reinterpret_cast<const char*>(src), len);
}

[[gnu::always_inline]] void inline iobuf::append(const char* ptr, size_t size) {
oncore_debug_verify(_verify_shard);
mutating_method_called();
if (unlikely(size == 0)) {
return;
}
Expand All @@ -374,10 +466,10 @@ inline void iobuf::reserve_memory(size_t reservation) {

/// appends the contents of buffer; might pack values into existing space
[[gnu::always_inline]] inline void iobuf::append(ss::temporary_buffer<char> b) {
mutating_method_called();
if (unlikely(!b.size())) {
return;
}
oncore_debug_verify(_verify_shard);
const size_t last_asz = last_allocation_size();
// The following is a heuristic to decide between copying and zero-copy
// append of the source buffer. The rule we apply is if the buffer we are
Expand Down Expand Up @@ -405,7 +497,7 @@ inline void iobuf::reserve_memory(size_t reservation) {
}
/// appends the contents of buffer; might pack values into existing space
inline void iobuf::append(iobuf o) {
oncore_debug_verify(_verify_shard);
mutating_method_called();
while (!o._frags.empty()) {
o._frags.pop_front_and_dispose([this](fragment* f) {
append(f->share());
Expand All @@ -415,7 +507,7 @@ inline void iobuf::append(iobuf o) {
}

inline void iobuf::append_fragments(iobuf o) {
oncore_debug_verify(_verify_shard);
mutating_method_called();
while (!o._frags.empty()) {
o._frags.pop_front_and_dispose([this](fragment* f) {
append(std::make_unique<fragment>(f->share()));
Expand All @@ -425,17 +517,17 @@ inline void iobuf::append_fragments(iobuf o) {
}
/// used for iostreams
inline void iobuf::pop_front() {
oncore_debug_verify(_verify_shard);
mutating_method_called();
_size -= _frags.front().size();
_frags.pop_front_and_dispose(&details::dispose_io_fragment);
}
inline void iobuf::pop_back() {
oncore_debug_verify(_verify_shard);
mutating_method_called();
_size -= _frags.back().size();
_frags.pop_back_and_dispose(&details::dispose_io_fragment);
}
inline void iobuf::trim_front(size_t n) {
oncore_debug_verify(_verify_shard);
mutating_method_called();
while (!_frags.empty()) {
auto& f = _frags.front();
if (f.size() > n) {
Expand All @@ -448,7 +540,7 @@ inline void iobuf::trim_front(size_t n) {
}
}
inline void iobuf::trim_back(size_t n) {
oncore_debug_verify(_verify_shard);
mutating_method_called();
while (!_frags.empty()) {
auto& f = _frags.back();
if (f.size() > n) {
Expand Down
7 changes: 2 additions & 5 deletions tests/rptest/services/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,12 +118,9 @@ def wrapped(self: HasRedpanda, *args: Any, **kwargs: Any):
f"Test failed, doing failure checks on {redpanda.who_am_i()}..."
)

# Disabled to avoid addr2line hangs
# (https://github.com/redpanda-data/redpanda/issues/5004)
# self.redpanda.decode_backtraces()

if isinstance(redpanda, RedpandaServiceBase):
if isinstance(redpanda, RedpandaService):
redpanda.cloud_storage_diagnostics()
redpanda.decode_backtraces()
if isinstance(redpanda,
RedpandaService | RedpandaServiceCloud):
redpanda.raise_on_crash(log_allow_list=log_allow_list)
Expand Down
6 changes: 5 additions & 1 deletion tests/rptest/tests/redpanda_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,11 @@ def __init__(self, test_context: TestContext):
self.scale = Scale(test_context)

def setUp(self):
self.__redpanda.start()
try:
self.__redpanda.start()
except:
self.__redpanda.decode_backtraces()
raise
self._create_initial_topics()

@abstractmethod
Expand Down