-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add test that shows how lambda tokenization is broken #765
Conversation
Unless there are other options I am not seeing, 3) would be the most permissive thinking about users. Since this is also the one with the most uncertainty I'll probably invest some time and whip up a prototype outlining what I'm thinking. Short term we may just postpone scheduler integration since I don't think this is critical for a release but I'd like to know what our options are. If we were forced to go with option 1) it would be good to know before the release |
There is also the option of making cloudpickle actually deterministic. There has been a little progress on that front cloudpipe/cloudpickle#453 cloudpipe/cloudpickle#524 |
Can someone help me understand the cost of non-deterministic hashing in
dask-expr? Historically hashing was a nice way to avoid work, but if it
didn't work then 🤷♂️ we just did a bit more work. I'm guessing that this
comes down to the fact that we recreate the same object repeatedly in the
optimization process, and need to see if the object is the same or not. Is
that right?
…On Thu, Jan 18, 2024 at 6:15 AM Florian Jetter ***@***.***> wrote:
There is also the option of making cloudpickle actually deterministic.
There has been a little progress on that front cloudpipe/cloudpickle#453
<cloudpipe/cloudpickle#453>
cloudpipe/cloudpickle#524
<cloudpipe/cloudpickle#524>
—
Reply to this email directly, view it on GitHub
<#765 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTD5WCR7GCNR5UJGCE3YPEG6RAVCNFSM6AAAAABCACROXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGM3DSMZXGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The assumption that the name attribute of an expression is the same if the expression has exactly the same operands is deeply baked into our optimization logic |
Apart form the optimization logic itself, the way distributed is currently built is roughly as Client side
Scheduler side
Changing this procedure is what I meant with 3) but this is a bit of work. I was hoping to not have to do this, particularly not while still supporting HLGs |
One dumb approach (maybe not the approach that we want to take) is that we optionally store a finalized key on the expression before we serialize it. So on the client side we do all of our optimizations, get a nice expr that we like, finalize the key on the expression, then serialize it up to the scheduler. The This seems like it would be easy short term. It would stop us from doing scheduler-side optimization (which I think we're all excited about medium-term) but might give us the fast client-scheduler communications we want in the meantime. |
I haven't looked carefully through this code or other linked work yet. However, just to check my rough understanding - It seems like there are two challenges here:
|
I think I'd be fine with a finalized name after optimization |
reposting here the most recent commit on main of cloudpickle (not released yet) likely solves the lambda problem but there are still cases that are not handled well. It's easy for us to verify on scheduler side if the tokenization was successful so we can abort with a clear exception. Maybe the new cloudpickle plus that scheduler side verification is good enough for us? |
Status updateThe PRs linked from dask/dask#10905 make lambdas tokenization deterministic, with the current released version of cloudpickle, on the same Python interpreter. Nothing should be expected across processes or hosts. So we'll either need tokens to be generated on the client and then shipped to the scheduler, or the scheduler to generate them first and ship them back to the client (the latter is more involved because as of today we use batched comms, which is one-directional, but the client would need to be informed of what keys they just gained in who_wants). It would be trivial to add a single line to the dask tokenize function that bakes in an unique interpreter ID, so that we can see things fail hard and fast in trivial use cases instead of only on the edge cases of cloudpickle. It of course remains possible to have some obscure third-party object (callable or not) that doesn't cloudpickle deterministically AND comes with neither |
Superseded by #dask/dask#10883 |
#223 had good intention and introduced a utility function
_tokenize_deterministic
that is setting thetokenize.ensure-deterministic
flag such that we can ensure that the tokens we're generating are indeed deterministic.Determinism in this context is mildly ambiguous and there are subtly different consistency guarantees we need if we want to pickle
Expr
objects in a way that is deterministic across python processes/interpretersEssentially everything we'd need to fall back to cloudpickle to serialize is also susceptible to not being deterministically tokenized (therefore this loosely relates to #331 (comment) even though those are separate issues) since the tokenization is using cloudpickle behind the scenes for those cases. In way that defeats the entire purpose of the tokenization engine (see point 3. below).
#223 also implemented a "fix" for
LambdaType
functions which is using thestr
representation of those functions which looks like<function func.<locals>.<lambda> at 0x107363be0>
. This is unique inside of an interpreter session since it includes the memory address where the lambda is stored but this changes once moved to another interpreter.This effectively hard blocks #14
There are a couple of options (I'm not really happy about either but here we go...)
assert_eq
#331 but this would expand to all user APIs. We could wrap user defined objects with some magical class but this would feel a little awkward I think)__hash__
we could just use it'sid
for a unique identifier and hash (or just stick with the current engine although it feels slightly over engineered then). From what I understand the entire tokenization is merely for the distributed scheduler but I see a way how to avoid that. This would require us to rewrite how client and scheduler are communicating indask/distributed
but I believe this is doable with a reasonable amount of effort. The scheduler would de-facto be the source of truth and be responsible for key generation and not the other way round. Locally generated keys would no longer map to remote keys. That would be fine, I guess?cc @rjzamora @mrocklin @phofl
Note: This problem has been around forever. HLGs were not as much affected by this since their names are actually not unique (e.g. dask/dask#9888).
Expr
are calculating names/tokens recursively which is objectively much better for key uniqueness but also means that if there is any object inside of the expr-tree that cannot be tokenized deterministically, this affects the entire tree.