forked from jemalloc/jemalloc
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] dev from jemalloc:dev #103
Open
pull
wants to merge
176
commits into
jamlee-t:dev
Choose a base branch
from
jemalloc:dev
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fix or suppress the remaining warnings generated by static analysis. This is a necessary step before we can incorporate static analysis into CI. Where possible, I've preferred to modify the code itself instead of just disabling the warning with a magic comment, so that if we decide to use different static analysis tools in the future we will be covered against them raising similar warnings.
Now that all of the various issues that static analysis uncovered have been fixed (#2431, #2432, #2433, #2436, #2437, #2446), I've added a GitHub action which will run static analysis for every PR going forward. When static analysis detects issues with your code, the GitHub action provides a link to download its findings in a form tailored for human consumption. Take a look at [this demonstration of what it looks like when static analysis issues are found](https://github.com/Svetlitski/jemalloc/actions/runs/5010245602) on my fork for an example (make sure to follow the instructions in the error message to download and inspect the results).
Additionally, added a GitHub Action to ensure no more trailing whitespace will creep in again in the future. I'm excluding Markdown files from this check, since trailing whitespace is significant there, and also excluding `build-aux/install-sh` because there is significant trailing whitespace on the line that sets `defaultIFS`.
It turns out LLVM does not include a build for every platform in the assets for every release, just some of them. As such, I've pinned us to the latest release version with a corresponding build.
We have observed new workload patterns (namely ML training type) that cycle through oversized allocations frequently, because 1) the dataset might be sparse which is faster to go through, and 2) GPU accelerated. As a result, the eager purging from the oversize arena becomes a bottleneck. To offer an easy solution, allow normal purging of the oversized extents when background threads are enabled.
Previously, small allocations which were sampled as part of heap profiling were rounded up to `SC_LARGE_MINCLASS`. This additional memory usage becomes problematic when the page size is increased, as noted in #2358. Small allocations are now rounded up to the nearest multiple of `PAGE` instead, reducing the memory overhead by a factor of 4 in the most extreme cases.
Validate that small allocations (i.e. those with `size <= SC_SMALL_MAXCLASS`) which are sampled for profiling maintain the expected invariants even though they now take up less space.
For better or worse, Jemalloc has a significant number of global variables. Making all eligible global variables `static` and/or `const` at least makes it slightly easier to reason about them, as these qualifications communicate to the programmer restrictions on their use without having to `grep` the whole codebase.
Adding `-Wstrict-prototypes` to the default `CFLAGS` in PR #2473 had the non-obvious side-effect of breaking configure-time feature detection, because the [test-program `autoconf` generates for feature detection](https://www.gnu.org/software/autoconf/manual/autoconf-2.67/html_node/Generating-Sources.html#:~:text=main%20()) defines `main` as: ```c int main() ``` Which causes all feature checks to fail, since this triggers `-Wstrict-prototypes` and the feature checks use `-Werror`. Resolved by only adding `-Wstrict-prototypes` to `EXTRA_{CFLAGS,CXXFLAGS}` in CI, since these flags are not used during feature detection and we control which compiler is used.
For the sake of consistency, function definitions and their corresponding declarations should use the same names for parameters. I've enabled this check in static analysis to prevent this issue from occurring again in the future.
When stderr is a terminal and supports color, print error messages from tests in red to make them stand out from the surrounding output.
This makes it faster and easier to debug, so that you don't need to fire up a debugger just to see which assertion triggered in a failing test.
At least for LLVM, [casting from an integer to a pointer hides provenance information](https://clang.llvm.org/extra/clang-tidy/checks/performance/no-int-to-ptr.html) and inhibits optimizations. Here's a [Godbolt link](https://godbolt.org/z/5bYPcKoWT) showing how this change removes a couple unnecessary branches in `phn_merge_siblings`, which is a very hot function. Canary profiles show only minor improvements (since most of the cost of this function is in cache misses), but there's no reason we shouldn't take it.
This is a prerequisite to achieving self-contained headers. Previously, the various tsd implementation headers (`tsd_generic.h`, `tsd_tls.h`, `tsd_malloc_thread_cleanup.h`, and `tsd_win.h`) relied implicitly on being included in `tsd.h` after a variety of dependencies had been defined above them. This commit instead makes these dependencies explicit by splitting them out into a separate file, `tsd_internals.h`, which each of the tsd implementation headers includes directly.
Header files are now self-contained, which makes the relationships between the files clearer, and crucially allows LSP tools like `clangd` to function correctly in all of our header files. I have verified that the headers are self-contained (aside from the various Windows shims) by compiling them as if they were C files – in a follow-up commit I plan to add this to CI to ensure we don't regress on this front.
[N2699 - Sized Memory Deallocation](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2699.htm) introduced two new functions which were incorporated into the C23 standard, `free_sized` and `free_aligned_sized`. Both already have analogues in Jemalloc, all we are doing here is adding the appropriate wrappers.
These warnings are not useful, and make the output of some CI jobs enormous and difficult to read, so let's suppress them.
This makes the code more readable on its own, and also sets the stage for more cleanly handling the pointer provenance lints in a following commit.
Following from PR #2481, we replace all integer-to-pointer casts [which hide pointer provenance information (and thus inhibit optimizations)](https://clang.llvm.org/extra/clang-tidy/checks/performance/no-int-to-ptr.html) with equivalent operations that preserve this information. I have enabled the corresponding clang-tidy check in our static analysis CI so that we do not get bitten by this again in the future.
In an attempt to make all headers self-contained, I inadvertently added `#include`s which refer to intermediate, generated headers that aren't included in the final install. Closes #2489.
While this function isn't particularly hot, (accounting for just 0.27% of time spent inside the allocator on average across the fleet), looking at the generated assembly and performance profiles does show we're dispatching to multiple different `memset`s when we could instead be just tail-calling `memset` once, reducing code size and marginally improving performance.
…set` Sampled allocations were not being demoted before being deallocated during an `arena_reset` operation.
Previously the option causing trouble will not be printed, unless the option key:value pair format is found.
Prefer clock_gettime_nsec_np(CLOCK_UPTIME_RAW) to mach_absolute_time().
Milliseconds are used a lot in hpa, so it is convenient to have `nstime_ms_since` function instead of dividing to `MILLION` constantly. For consistency renamed `nstime_msec` to `nstime_ms` as `ms` abbreviation is used much more commonly across codebase than `msec`. ``` $ grep -Rn '_msec' include src | wc -l 2 $ grep -RPn '_ms( |,|:)' include src | wc -l 72 ``` Function `nstime_msec` wasn't used anywhere in the code yet.
Make multiple functions from `stats_arena_hpa_shard_print` for readability and ease of change in the future.
Linux 6.1 introduced `MADV_COLLAPSE` flag to perform a best-effort synchronous collapse of the native pages mapped by the memory range into transparent huge pages. Synchronous hugification might be beneficial for at least two reasons: we are not relying on khugepaged anymore and get an instant feedback if range wasn't hugified. If `hpa_hugify_sync` option is on, we'll try to perform synchronously collapse and if it wasn't successful, we'll fallback to asynchronous behaviour.
Config validation was introduced at 3aae792 with main intention to fix infinite purging loop, but it didn't actually fix the underlying problem, just masked it. Later 47d69b4 was merged to address the same problem. Options `hpa_dirty_mult` and `hpa_hugification_threshold` have different application dimensions: `hpa_dirty_mult` applied to active memory on the shard, but `hpa_hugification_threshold` is a threshold for single pageslab (hugepage). It doesn't make much sense to sum them up together. While it is true that too high value of `hpa_dirty_mult` and too low value of `hpa_hugification_threshold` can lead to pathological behaviour, it is true for other options as well. Poor configurations might lead to suboptimal and sometimes completely unacceptable behaviour and that's OK, that is exactly the reason why they are called poor. There are other mechanism exist to prevent extreme behaviour, when we hugified and then immediately purged page, see `hpa_hugify_blocked_by_ndirty` function, which exist to prevent exactly this case. Lastly, `hpa_dirty_mult + hpa_hugification_threshold >= 1` constraint is too tight and prevents a lot of valid configurations.
When evaluating changes in HPA logic, it is useful to know internal `hpa_shard` state. Great deal of this state is `psset`. Some of the `psset` stats was available, but in disaggregated form, which is not very convenient. This commit exposed `psset` counters to `mallctl` and malloc stats dumps. Example of how malloc stats dump will look like after the change. HPA shard stats: Pageslabs: 14899 (4354 huge, 10545 nonhuge) Active pages: 6708166 (2228917 huge, 4479249 nonhuge) Dirty pages: 233816 (331 huge, 233485 nonhuge) Retained pages: 686306 Purge passes: 8730 (10 / sec) Purges: 127501 (146 / sec) Hugeifies: 4358 (5 / sec) Dehugifies: 4 (0 / sec) Pageslabs, active pages, dirty pages and retained pages are rows added by this change.
We are trying to create `ncpus * 2` threads for this test and place them into `VARIABLE_ARRAY`, but `VARIABLE_ARRAY` can not be more than `VARIABLE_ARRAY_SIZE_MAX` bytes. When there are a lot of threads on the box test always fails. ``` $ nproc 176 $ make -j`nproc` tests_unit && ./test/unit/retained <jemalloc>: ../test/unit/retained.c:123: Failed assertion: "sizeof(thd_t) * (nthreads) <= VARIABLE_ARRAY_SIZE_MAX" Aborted (core dumped) ``` There is no need for high concurrency for this test as we are only checking stats there and it's behaviour is quite stable regarding number of allocating threads. Limited number of threads to 16 to save compute resources (on CI for example) and reduce tests running time. Before the change (`nproc` is 80 on this box). ``` $ make -j`nproc` tests_unit && time ./test/unit/retained <...> real 0m0.372s user 0m14.236s sys 0m12.338s ``` After the change (same box). ``` $ make -j`nproc` tests_unit && time ./test/unit/retained <...> real 0m0.018s user 0m0.108s sys 0m0.068s ```
Taken from https://android-review.git.corp.google.com/c/platform/external/jemalloc_new/+/3316478 This might need more cleanups to remove the definition of JEMALLOC_INTERNAL_UNREACHABLE.
The gettid() function is available on Linux in glibc only since version 2.30. There are supported distributions that still use older glibc version. Thus add a configure check if the gettid() function is available and extend the check in src/prof_stack_range.c so it's skipped also when gettid() isn't available. Fixes: #2740
`final[3]` is `uint8_t`. Integer conversion rank of `uint8_t` is lower than integer conversion rank of `int`, so `uint8_t` got promoted to `int`, which is signed integer type. Shift `final[3]` value left on 24, when leftmost bit is set overflows `int` and it is undefined behaviour. Before this change Undefined Behaviour Sanitizer was unhappy about it with the following message. ``` ../test/unit/hash.c:119:25: runtime error: left shift of 176 by 24 places cannot be represented in type 'int' ``` After this commit problem is gone.
We tried to load `g` from `bitmap[i]` before checking it is actually a valid load. Tweaked a loop a bit to `break` early, when we are done scanning for bits. Before this commit undefined behaviour sanitizer from GCC 14+ was unhappy at `test/unit/bitmap` test with following error. ``` ../include/jemalloc/internal/bitmap.h:293:5: runtime error: load of address 0x7bb1c2e08008 with insufficient space for an object of type 'const bitmap_t' <...> #0 0x62671a149954 in bitmap_ffu ../include/jemalloc/internal/bitmap.h:293 #1 0x62671a149954 in test_bitmap_xfu_body ../test/unit/bitmap.c:275 #2 0x62671a14b767 in test_bitmap_xfu ../test/unit/bitmap.c:323 #3 0x62671a376ad1 in p_test_impl ../test/src/test.c:149 #4 0x62671a377135 in p_test ../test/src/test.c:200 #5 0x62671a13da06 in main ../test/unit/bitmap.c:336 <...> ```
…ve, `prof_threshold` is intended to be an always-supported allocation callback with much less overhead. The usage of the threshold allows performance critical callers to change program execution based on the callback: e.g. drop caches when memory becomes high or to predict the program is about to OOM ahead of time using peak memory watermarks.
This commit allows to enable sanitizers with autoconf options, instead of modifying `CFLAGS`, `CXXFLAGS` and `LDFLAGS` directly. * `--enable-tsan` option to enable Thread Sanitizer. * `--enable-ubsan` option to enable Undefined Behaviour Sanitizer. End goal is to speedup development by finding problems quickly, early and easier. Eventually, when all current issues will be fixed, we can enable sanitizers in CI. Fortunately, there are not a lot of problems we need to fix. Address Sanitizer is a bit controversial, because it replaces memory allocator, so we decided to left it out for a while. Below are couple of examples of how tests look like under different sanitizers at the moment. ``` $ ../configure --enable-tsan --enable-debug <...> asan : 0 tsan : 1 ubsan : 0 $ make -j`nproc` check <...> Thread T13 (tid=332043, running) created by main thread at: #0 pthread_create <null> (libtsan.so.0+0x61748) #1 thd_create ../test/src/thd.c:25 (bin_batching+0x5631ca) #2 stress_run ../test/unit/bin_batching.c:148 (bin_batching+0x40364c) #3 test_races ../test/unit/bin_batching.c:249 (bin_batching+0x403d79) #4 p_test_impl ../test/src/test.c:149 (bin_batching+0x562811) #5 p_test_no_reentrancy ../test/src/test.c:213 (bin_batching+0x562d35) #6 main ../test/unit/bin_batching.c:268 (bin_batching+0x40417e) SUMMARY: ThreadSanitizer: data race ../include/jemalloc/internal/edata.h:498 in edata_nfree_inc ``` ``` $ ../configure --enable-ubsan --enable-debug <...> asan : 0 tsan : 0 ubsan : 1 $ make -j`nproc` check <...> === test/unit/hash === ../test/unit/hash.c:119:16: runtime error: left shift of 176 by 24 places cannot be represented in type 'int' <...> ```
This adds a new autoconf flag, --disable-user-config, which disables reading the configuration from /etc/malloc.conf or the MALLOC_CONF environment variable. This can be useful when integrating jemalloc in a binary that internally handles all aspects of the configuration and shouldn't be impacted by ambient change in the environment.
Before this commit we had two age counters: one global in HPA central and one local in each HPA shard. We used HPA shard counter, when we are reused empty pageslab and HPA central counter anywhere else. They suppose to be comparable, because we use them for allocation placement decisions, but in reality they are not, there is no ordering guarantees between them. At the moment, there is no way for pageslab to migrate between HPA shards, so we don't actually need HPA central age counter.
Arena 0 have a dedicated initialization path, which differs from initialization path of other arenas. The main difference for the purpose of this change is that we initialize arena 0 before we initialize background threads. HPA shard options have `deferral_allowed` flag which should be equal to `background_thread_enabled()` return value, but it wasn't the case before this change, because for arena 0 `background_thread_enabled()` was initialized correctly after arena 0 initialization phase already ended. Below is initialization sequence for arena 0 after this commit to illustrate everything still should be initialized correctly. * `hpa_central_init` initializes HPA Central, before we initialize every HPA shard (including arena's 0). * `background_thread_boot1` initializes `background_thread_enabled()` return value. * `pa_shard_enable_hpa` initializes arena 0 HPA shard. ``` malloc_init_hard ------------- / / \ / / \ / / \ malloc_init_hard_a0_locked background_thread_boot1 pa_shard_enable_hpa / / \ / / \ / / \ arena_boot background_thread_enabled_seta hpa_shard_init | | pa_central_init | | hpa_central_init ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )