Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate JFR leak profiler with BasedOnJDKFile #10206

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

roberttoyonaga
Copy link
Collaborator

This is just a minor change that adds BasedOnJDKFile to the OldObjectSampler code.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Dec 2, 2024
@roberttoyonaga roberttoyonaga self-assigned this Dec 2, 2024
@christianhaeubl
Copy link
Member

Thanks.

Did you compare the current implementation with the HotSpot sources that you mentioned in the @BasedOnJDKFile annotations? I had a brief look and I think that some parts probably need to be updated on the GraalVM side (e.g., HotSpot seems to do some special handling for virtual threads, that seems to be missing on the GraalVM side).

@roberttoyonaga
Copy link
Collaborator Author

@christianhaeubl and @galderz

Thanks for catching that Christian. While looking into this, I actually discovered a bug that runs a lot deeper. What's happening in Hotspot is that the thread and stacktrace constant data are being cached and only written out when the OldObjectSample events get emitted. OldObjectSample events are special because they are only emitted at the end of a recording, not every chunk rotation. However, in SVM we are serializing the thread and stacktrace data normally like we do for any other event. This is bad because the constant data may get written out too early in chunk1 while the recording later finishes in chunk2. Thus separating the constant and event data. We probably didn't catch this before in tests because test recordings are generally only 1 chunk long.

The fix for this will likely not be trivial. I am happy to work on it but it might be best to do it in a separate PR. I can open a bug report for it too. What do you think?

@christianhaeubl
Copy link
Member

Very good catch, thanks for checking! Yes, please open a bug report so that this problem is documented.

It would be great if you could look into that. Regarding PRs, please do the fix and the @BasedOnJDKFile annotations in a single PR (code shouldn't have @BasedOnJDKFile annotations as long as it is out-of-sync with HotSpot).

@roberttoyonaga roberttoyonaga marked this pull request as draft December 5, 2024 14:14
@roberttoyonaga
Copy link
Collaborator Author

Very good catch, thanks for checking! Yes, please open a bug report so that this problem is documented.

It would be great if you could look into that. Regarding PRs, please do the fix and the @BasedOnJDKFile annotations in a single PR (code shouldn't have @BasedOnJDKFile annotations as long as it is out-of-sync with HotSpot).

Ok I'll add all the changes to this one PR. In the meantime, I've set this PR to "draft" and I'll open a bug report.

@roberttoyonaga roberttoyonaga force-pushed the oldobject-BasedOnJDKFile branch from c8cdc26 to e5bcf34 Compare January 22, 2025 16:21
@roberttoyonaga roberttoyonaga force-pushed the oldobject-BasedOnJDKFile branch from e5bcf34 to 692a0be Compare January 22, 2025 16:25
@roberttoyonaga roberttoyonaga marked this pull request as ready for review January 22, 2025 17:03
@roberttoyonaga
Copy link
Collaborator Author

roberttoyonaga commented Jan 22, 2025

Hi @christianhaeubl. I have updated this PR with a fix for the bug we talked about. I've tried to reuse as much of the existing SubstrateVM JFR infrastructure as possible. My general approach is to cache the thread and stacktrace data within the sample instances themselves so that they can later be serialized at the time OldObjectSample events are emitted.

I've updated the existing tests and added a new test to catch this type of problem in the future.

Hotspot handles stacktrace and thread data slightly differently. Stacktraces are deduplicated at the time of sampling using a second stacktrace repo dedicated to the leak profiler. Then at the time of chunk rotation, the stacktraces are retrieved, serialized, and cached in the old object samples. Multiple samples may reference the same cached stacktrace so a reference count is kept. If all the samples referencing a serialized stacktrace are evicted from the queue, the stacktrace is freed. Finally, at the time of event emission, the serialized stacktraces are retrieved from the surviving samples and written to the chunk. The goal is to (1) provide deduplication and (2) only keep stacktraces that correspond to surviving samples. I think my implementation accomplishes both goals while still allowing us to reuse most of the existing SubstrateVM JFR code. We can avoid introducing a second stacktrace repo that functions differently and avoid creating a caching/referencing system. The main practical difference of my implementation is that we delay deduplication and serialization until the time of event emission. I think this is okay, since we are bounded by the number of elements in the queue. The JFR CPU sampler also delays deduplication too, so this is not so different.

Hotspot handles thread data similarly to stacktraces. It serializes it upfront and caches it in the sample. I decided it was better to instead delay serialization until the epoch that the events get emitted. This lets us reuse the JfrThreadRepository without any changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
native-image native-image-jfr OCA Verified All contributors have signed the Oracle Contributor Agreement. redhat-interest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants