Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a deterministic constructor for RandomState #523

Open
bkragl opened this issue Jan 19, 2025 · 8 comments
Open

Add a deterministic constructor for RandomState #523

bkragl opened this issue Jan 19, 2025 · 8 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@bkragl
Copy link

bkragl commented Jan 19, 2025

Proposal

Problem statement

The default hasher for std::collections::{HashMap, HashSet} (via the S type parameter) is RandomState, whose constructor generates fresh random keys for every instance. That's a good default for security, but problematic when hash collections must behave deterministically. Although in principle it is possible to use other deterministic hashers, there are use cases that effectively require RandomState. The problem is that today there is no way to deterministically construct RandomState values.

Motivating examples or use cases

My main interest is testing complex concurrent systems using tools like Loom and Shuttle. In these use cases it is crucial for the programs under test to be deterministic, or, rather, that all nondeterminism (e.g., scheduling choices, rand calls, etc.) is controlled by the tool. Otherwise test failures are not reproducible, which is frustrating and significantly diminishes the utility these testing tools can offer.

One unexpected source of nondeterminism are the standard library hash collections with their default hasher RandomState. This manifests in different iteration orders in every execution. In 1P code it is possible to replace all hash collections with a deterministic hasher (potentially using a feature flag for testing, to keep the default behavior in production). However, since this results in a different type, we run into a type incompatibility with 3P libraries that have hash collections with RandomState hardcoded in their public interface. Sadly that's a reality, see for example aws_sdk_dynamodb (and this related discussion in the smithy-rs code generator).

Solution sketch

The proposal is to introduce a new deterministic constructor for RandomState. My concrete proposal in rust-lang/rust#135578 is to introduce a parameter-less constructor

pub fn deterministic() -> RandomState {
    RandomState { k0: 0, k1: 0 }
}

and keep the internal representation (currently a pair of u64s) private. However, if experts think that the internal representation can be exposed, that's fine too.

Alternatives

Today there is no safe way to construct deterministic RandomState values. Folks might take the shameful path of transmuting (u64, u64) into RandomState.

It would be even nicer if hash collections (and possibly other parts of the standard library) could just be made deterministic via a compiler option, so that one can just write HashMap::new() instead of having to write HashMap::with_hasher(RandomState::deterministic()) with the solution described above. I'm happy to entertain a discussion in that direction, but I imagine that this will be a much larger effort.

@bkragl bkragl added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Jan 19, 2025
@the8472
Copy link
Member

the8472 commented Jan 19, 2025

In these use cases it is crucial for the programs under test to be deterministic, or, rather, that all nondeterminism (e.g., scheduling choices, rand calls, etc.) is controlled by the tool.

If you already control all those things then can't you inject the same seeds in getrandom syscalls, instantiate the hashmaps in the same order etc?

@bkragl
Copy link
Author

bkragl commented Jan 19, 2025

In these use cases it is crucial for the programs under test to be deterministic, or, rather, that all nondeterminism (e.g., scheduling choices, rand calls, etc.) is controlled by the tool.

If you already control all those things then can't you inject the same seeds in getrandom syscalls, instantiate the hashmaps in the same order etc?

Oh, I gave the wrong impression here. Loom and Shuttle focus on controlling concurrency by providing drop-in replacements for synchronization primitives. For example, instead of std::sync::Mutex you'd import shuttle::sync::Mutex for testing. There is also a built-in rand replacement that records random choices, since rand is so common. For other nondeterminism, like with hash collections, the goal is to eliminate rather than control.

There's a different world where this functionality could be provided with a more heavyweight approach like binary rewriting or replacing syscalls, but that comes with different trade-offs.

@pitaj
Copy link

pitaj commented Jan 19, 2025

In that case, why not provide a drop-in HashMap replacement which defaults to a deterministic hasher? Or why not provide a RandomState replacement which is deterministic?

@bkragl
Copy link
Author

bkragl commented Jan 19, 2025

In that case, why not provide a drop-in HashMap replacement which defaults to a deterministic hasher? Or why not provide a RandomState replacement which is deterministic?

That only works when I own all the relevant code. The type will be HashMap<K, V, SomethingSomething>. However, I cannot pass a value of that type to code I don't own expecting HashMap<K, V, RandomState>.

@pitaj
Copy link

pitaj commented Jan 20, 2025

I'm a little confused then. From your explanation of the Mutex / rand stuff it sounds like that's the normal approach. Does the same not apply to those?

@bkragl
Copy link
Author

bkragl commented Jan 20, 2025

I'm a little confused then. From your explanation of the Mutex / rand stuff it sounds like that's the normal approach. Does the same not apply to those?

By "the same" I assume you mean the possibility to run into a type mismatch? Well, in theory yes, but practically no. I've not encountered synchronization primitives in the API of a 3p library. For rand I could imagine it is possible, but (a) only the necessary RNG types are replaced, while other things (e.g., the entire rand::distributions module) are just a re-exported, and (b) if a library takes an RNG parameter, then I think it is very likely generic. Hash collections are kind of a special case, since they are the most common embodiment of the fundamental set and map types. And while hash collections could be generic over the hasher, there are real-world libraries where the authors decided to forego that. I kind of understand that decision, since that type parameter would be extremely invasive.

@tbu-
Copy link

tbu- commented Jan 20, 2025

The most comprehensive way to do this might be to compile your own Rust standard library which seeds hash maps deterministically. Then you won't have to hunt down hashmaps in dependencies because all are covered already.

@the8472
Copy link
Member

the8472 commented Jan 20, 2025

I've not encountered synchronization primitives in the API of a 3p library.

How is the API relevant? To get the determinism you would also have to replace hash maps, rand and synchronization in their implementations, not just the public interface, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

4 participants