Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Line and Grid Qubit hashes faster for common set ops #6886

Open
daxfohl opened this issue Dec 27, 2024 · 6 comments · May be fixed by #6908
Open

Make Line and Grid Qubit hashes faster for common set ops #6886

daxfohl opened this issue Dec 27, 2024 · 6 comments · May be fixed by #6908
Labels
area/performance kind/design-issue A conversation around design triage/needs-feasibility [Feature requests] Needs design work to prove feasibility before accepting

Comments

@daxfohl
Copy link
Collaborator

daxfohl commented Dec 27, 2024

(Note: the original issue was about the __eq__ function. Upon further analysis, captured in the comments below, the hash key turns out to be the primary factor, so the issue title and description has been changed accordingly.)

Set operations are common with qubits, and Python's Set ops are faster when hash values are faster the fewer bucket collisions there are. Given most Line Qubits are defined on small contiguous integers, the fewest collisions occur when hash(q[i]) == i. A similar approach could be used for Grid Qubits.

The open question would be what to do with dimension. We could ignore it, assuming qubits of the same index but different dimensions in the same set would be extremely rare, but it would also be extremely slow if they were, as not only would there be quadratic behavior due to bucket collisions (which is actually not so bad in the Set source for reasonable sized sets since it's in C) but it would also require frequent __eq__ calculations, which jump into Python and have a huge constant multiplier for Set ops. Another option would be to use a hash of i * 101 + dim and hope 101 is enough. Or, we could ignore the issue and leave the hash code alone. (This also opens a question of whether, perhaps, just the qubit keys should be part of the operation, rather than the full qubit, in which case it's not even possible to have a represent a circuit that has two qubits of different dimensions sharing an index, but that's a much larger design question and change).

Experimentation shows set ops are around 50% faster. In particular this speeds up construction of wide circuits, where checking for qubit overlap can take half of the construction time.

@daxfohl daxfohl added the kind/design-issue A conversation around design label Dec 27, 2024
@maffoo
Copy link
Contributor

maffoo commented Dec 27, 2024

We do have a custom __eq__ method for LineQubit and other "standard" qubit types (it's implemented on _BaseLineQid so it works for LineQudit as well):

def __eq__(self, other) -> bool:
# Explicitly implemented for performance (vs delegating to Qid).
if isinstance(other, _BaseLineQid):
return self is other or (self._x == other._x and self._dimension == other._dimension)
return NotImplemented

We've already improvement performance quite a bit with these methods, but it's certainly possible there's more room for improvement.

I'm not sure what kind of operations you are using for circuit construction, but almost everything we do internally uses Moment.from_ops and Circuit.from_moments since these perform significantly better by bypassing all the insertion strategy stuff and we almost always want precise control over the moment structure when running on hardware.

@daxfohl
Copy link
Collaborator Author

daxfohl commented Dec 28, 2024

The implementation I made of LineQubit.__eq__ was

 def __eq__(self, other) -> bool: 
     return isinstance(other, LineQubit) and self._x == other._x

This was a little more than twice as fast. I hadn't seen the return NotImplemented for __eq__ before. Looks like that causes it to try the reverse y == x check. Anyway that likely accounts for the "a little more than twice as fast". Of course the existing implementation would catch LineQid(1, dim=2) == LineQubit(1), whereas this one wouldn't.

The test was a basic something like

circuit = cirq.Circuit()
for q in cirq.LineQubit.range(10000):
    circuit.append(cirq.X(q))

Almost half the time was spent on this line https://github.com/quantumlib/Cirq/blob/main/cirq-core/cirq/circuits/moment.py#L195. So, twice the speed on the op that was almost half the time, for 25% improvement overall. Of course it's faster to create the circuit with cirq.Circuit([cirq.X(cirq.LineQubit(i)) for i in range(10000)]), since it doesn't create 9999 intermediate moments. But from sample code I see users posting, most users build circuits by appending operations one-by-one.

Another option that could improve this use case for LineQubit explicitly: we could create a FrozenQubitSet wrapper class, that contains a frozenset for all non-LineQubit qubits, but keeps LineQubits as a bitarray. The bitarray should be way faster to copy and update than a frozenset. Then Moment._qubits could be a FrozenQubitSet and line 195 would be a lot faster if it's all LineQubits. Plausibly FrozenQubitSet could contain a separate bitarray for GridQubits, using diagonal indexing. I can't think of a good way to do NamedQubits, or for multi-dimensional qudits (since we need to store the dimension somewhere).

@daxfohl
Copy link
Collaborator Author

daxfohl commented Dec 29, 2024

Actually, it turns out we're duplicating work; there's already a _qubit_to_ops dict that is more performant (I guess explicitly freezing the set must be a slow op). PR above gives a 3x improvement and futzing with LineQubit.__eq__ after that change no longer affects perf at all (apparently frozenset and dict do equality checking differently(?)).

@daxfohl
Copy link
Collaborator Author

daxfohl commented Dec 30, 2024

And it also turns out that the issue isn't __eq__, but __hash__. Creating the tuple and hashing it is expensive. Hashing an int is a lot faster. Should we consider changing to something like hash(self._x * 100 + self._dimension)?

def __hash__(self) -> int:
if self._hash is None:
self._hash = hash((self._x, self._dimension))
return self._hash

@daxfohl
Copy link
Collaborator Author

daxfohl commented Dec 30, 2024

Actually it's not even the hashing. It's something weird in Python itself. The more random a hash key is, the more trouble it has unioning a set. Which, is supposedly the opposite of what's supposed to happen.

def test_speed():
    # Note in Python, hash(i) == i for i < 2**60
    s = set()
    k = 8
    t = time.perf_counter()
    for i in range(10000):
        q = i * k
        q = i * k + random.randint(0, k-1)
        s |= s.union({q})
    duration = time.perf_counter() - t
    assert duration < 0

Here, the smaller I make k, the faster it goes. Also if I swap the order of setting q, to eliminate the random part, then it goes faster. But setting k=7 doesn't slow it down, so it's nothing to do with the bottom 3 bits.

I also made a test with dummy objects and also implemented __eq__ to show whether there were hash collisions, and none of this caused hash collisions, so that's not what's slowing it down. Totally befuddled, and it's starting to go off the rails of this issue, but nerd sniped on this one.

@daxfohl
Copy link
Collaborator Author

daxfohl commented Dec 30, 2024

I see. Not exactly, but close enough. https://github.com/python/cpython/blob/main/Objects/setobject.c says something something Algorithm D, something something linear probing, something something cache locality.

It looks like the algorithm doesn't do the linked-list-per-bucket thing, but rather keep a single item per bucket, start at bucket hash&mask and if it's empty then use it; otherwise iterate until it finds the next available bucket, with some pseudorandom jumping to avoid hot spots, but not too much, so as to retain cache locality. Linear hashes then would work better than random, because each hash will generally go to different buckets, whereas random ones collide. The offset and multiplier don't matter either, so long as they're not a power of two. Confirmed, k = 2**20 in the test above is horrible, twice as bad as random, but k = 2**20-1 is great. Also, slight lie: a very small multiplier (e.g. 1) is great due to cache locality, which shows up in testing. Also the __eq__ note above is irrelevant, because a bucket collision is not a hash collision, and so if the object ids are the same, it doesn't have to check __eq__. It only has to do that when the hashes are the same but the object ids are different.

Anyway, I guess all that is why Python's hash for an int is the int itself; it's faster for sets of small numbers, which are quite common.

So, all that said...should we change the hash function here from hash((x, dim)) to something like x*100 + dim? Or plausibly even just x, since it's unlikely to have a set with different dim values for the same x value? Granted, this is fairly edge case, building a circuit one-by-one with 10000 X(q[i]) gates. But it's a simple change and 30% improvement for that case, and really any case that uses sets of qubits, and it shouldn't have any negative side effects.

@daxfohl daxfohl changed the title Custom equality checkers for qubits Make Line and Grid Qubit hashes faster for common set ops Dec 31, 2024
@daxfohl daxfohl linked a pull request Jan 1, 2025 that will close this issue
@dstrain115 dstrain115 added the triage/discuss Needs decision / discussion, bring these up during Cirq Cynque label Jan 5, 2025
@mhucka mhucka added triage/needs-feasibility [Feature requests] Needs design work to prove feasibility before accepting area/performance and removed triage/discuss Needs decision / discussion, bring these up during Cirq Cynque labels Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance kind/design-issue A conversation around design triage/needs-feasibility [Feature requests] Needs design work to prove feasibility before accepting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants