[pull] dev from jemalloc:dev #103

pull · 2023-06-23T19:10:56Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

Fix or suppress the remaining warnings generated by static analysis. This is a necessary step before we can incorporate static analysis into CI. Where possible, I've preferred to modify the code itself instead of just disabling the warning with a magic comment, so that if we decide to use different static analysis tools in the future we will be covered against them raising similar warnings.

Now that all of the various issues that static analysis uncovered have been fixed (#2431, #2432, #2433, #2436, #2437, #2446), I've added a GitHub action which will run static analysis for every PR going forward. When static analysis detects issues with your code, the GitHub action provides a link to download its findings in a form tailored for human consumption. Take a look at [this demonstration of what it looks like when static analysis issues are found](https://github.com/Svetlitski/jemalloc/actions/runs/5010245602) on my fork for an example (make sure to follow the instructions in the error message to download and inspect the results).

Additionally, added a GitHub Action to ensure no more trailing whitespace will creep in again in the future. I'm excluding Markdown files from this check, since trailing whitespace is significant there, and also excluding `build-aux/install-sh` because there is significant trailing whitespace on the line that sets `defaultIFS`.

It turns out LLVM does not include a build for every platform in the assets for every release, just some of them. As such, I've pinned us to the latest release version with a corresponding build.

We have observed new workload patterns (namely ML training type) that cycle through oversized allocations frequently, because 1) the dataset might be sparse which is faster to go through, and 2) GPU accelerated. As a result, the eager purging from the oversize arena becomes a bottleneck. To offer an easy solution, allow normal purging of the oversized extents when background threads are enabled.

Previously, small allocations which were sampled as part of heap profiling were rounded up to `SC_LARGE_MINCLASS`. This additional memory usage becomes problematic when the page size is increased, as noted in #2358. Small allocations are now rounded up to the nearest multiple of `PAGE` instead, reducing the memory overhead by a factor of 4 in the most extreme cases.

Validate that small allocations (i.e. those with `size <= SC_SMALL_MAXCLASS`) which are sampled for profiling maintain the expected invariants even though they now take up less space.

For better or worse, Jemalloc has a significant number of global variables. Making all eligible global variables `static` and/or `const` at least makes it slightly easier to reason about them, as these qualifications communicate to the programmer restrictions on their use without having to `grep` the whole codebase.

Adding `-Wstrict-prototypes` to the default `CFLAGS` in PR #2473 had the non-obvious side-effect of breaking configure-time feature detection, because the [test-program `autoconf` generates for feature detection](https://www.gnu.org/software/autoconf/manual/autoconf-2.67/html_node/Generating-Sources.html#:~:text=main%20()) defines `main` as: ```c int main() ``` Which causes all feature checks to fail, since this triggers `-Wstrict-prototypes` and the feature checks use `-Werror`. Resolved by only adding `-Wstrict-prototypes` to `EXTRA_{CFLAGS,CXXFLAGS}` in CI, since these flags are not used during feature detection and we control which compiler is used.

For the sake of consistency, function definitions and their corresponding declarations should use the same names for parameters. I've enabled this check in static analysis to prevent this issue from occurring again in the future.

When stderr is a terminal and supports color, print error messages from tests in red to make them stand out from the surrounding output.

This makes it faster and easier to debug, so that you don't need to fire up a debugger just to see which assertion triggered in a failing test.

At least for LLVM, [casting from an integer to a pointer hides provenance information](https://clang.llvm.org/extra/clang-tidy/checks/performance/no-int-to-ptr.html) and inhibits optimizations. Here's a [Godbolt link](https://godbolt.org/z/5bYPcKoWT) showing how this change removes a couple unnecessary branches in `phn_merge_siblings`, which is a very hot function. Canary profiles show only minor improvements (since most of the cost of this function is in cache misses), but there's no reason we shouldn't take it.

This is a prerequisite to achieving self-contained headers. Previously, the various tsd implementation headers (`tsd_generic.h`, `tsd_tls.h`, `tsd_malloc_thread_cleanup.h`, and `tsd_win.h`) relied implicitly on being included in `tsd.h` after a variety of dependencies had been defined above them. This commit instead makes these dependencies explicit by splitting them out into a separate file, `tsd_internals.h`, which each of the tsd implementation headers includes directly.

Header files are now self-contained, which makes the relationships between the files clearer, and crucially allows LSP tools like `clangd` to function correctly in all of our header files. I have verified that the headers are self-contained (aside from the various Windows shims) by compiling them as if they were C files – in a follow-up commit I plan to add this to CI to ensure we don't regress on this front.

[N2699 - Sized Memory Deallocation](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2699.htm) introduced two new functions which were incorporated into the C23 standard, `free_sized` and `free_aligned_sized`. Both already have analogues in Jemalloc, all we are doing here is adding the appropriate wrappers.

These warnings are not useful, and make the output of some CI jobs enormous and difficult to read, so let's suppress them.

This makes the code more readable on its own, and also sets the stage for more cleanly handling the pointer provenance lints in a following commit.

Following from PR #2481, we replace all integer-to-pointer casts [which hide pointer provenance information (and thus inhibit optimizations)](https://clang.llvm.org/extra/clang-tidy/checks/performance/no-int-to-ptr.html) with equivalent operations that preserve this information. I have enabled the corresponding clang-tidy check in our static analysis CI so that we do not get bitten by this again in the future.

In an attempt to make all headers self-contained, I inadvertently added `#include`s which refer to intermediate, generated headers that aren't included in the final install. Closes #2489.

While this function isn't particularly hot, (accounting for just 0.27% of time spent inside the allocator on average across the fleet), looking at the generated assembly and performance profiles does show we're dispatching to multiple different `memset`s when we could instead be just tail-calling `memset` once, reducing code size and marginally improving performance.

…l_overrides.h`

…set` Sampled allocations were not being demoted before being deallocated during an `arena_reset` operation.

Previously the option causing trouble will not be printed, unless the option key:value pair format is found.

Prefer clock_gettime_nsec_np(CLOCK_UPTIME_RAW) to mach_absolute_time().

Milliseconds are used a lot in hpa, so it is convenient to have `nstime_ms_since` function instead of dividing to `MILLION` constantly. For consistency renamed `nstime_msec` to `nstime_ms` as `ms` abbreviation is used much more commonly across codebase than `msec`. ``` $ grep -Rn '_msec' include src | wc -l 2 $ grep -RPn '_ms( |,|:)' include src | wc -l 72 ``` Function `nstime_msec` wasn't used anywhere in the code yet.

Make multiple functions from `stats_arena_hpa_shard_print` for readability and ease of change in the future.

Linux 6.1 introduced `MADV_COLLAPSE` flag to perform a best-effort synchronous collapse of the native pages mapped by the memory range into transparent huge pages. Synchronous hugification might be beneficial for at least two reasons: we are not relying on khugepaged anymore and get an instant feedback if range wasn't hugified. If `hpa_hugify_sync` option is on, we'll try to perform synchronously collapse and if it wasn't successful, we'll fallback to asynchronous behaviour.

Config validation was introduced at 3aae792 with main intention to fix infinite purging loop, but it didn't actually fix the underlying problem, just masked it. Later 47d69b4 was merged to address the same problem. Options `hpa_dirty_mult` and `hpa_hugification_threshold` have different application dimensions: `hpa_dirty_mult` applied to active memory on the shard, but `hpa_hugification_threshold` is a threshold for single pageslab (hugepage). It doesn't make much sense to sum them up together. While it is true that too high value of `hpa_dirty_mult` and too low value of `hpa_hugification_threshold` can lead to pathological behaviour, it is true for other options as well. Poor configurations might lead to suboptimal and sometimes completely unacceptable behaviour and that's OK, that is exactly the reason why they are called poor. There are other mechanism exist to prevent extreme behaviour, when we hugified and then immediately purged page, see `hpa_hugify_blocked_by_ndirty` function, which exist to prevent exactly this case. Lastly, `hpa_dirty_mult + hpa_hugification_threshold >= 1` constraint is too tight and prevents a lot of valid configurations.

When evaluating changes in HPA logic, it is useful to know internal `hpa_shard` state. Great deal of this state is `psset`. Some of the `psset` stats was available, but in disaggregated form, which is not very convenient. This commit exposed `psset` counters to `mallctl` and malloc stats dumps. Example of how malloc stats dump will look like after the change. HPA shard stats: Pageslabs: 14899 (4354 huge, 10545 nonhuge) Active pages: 6708166 (2228917 huge, 4479249 nonhuge) Dirty pages: 233816 (331 huge, 233485 nonhuge) Retained pages: 686306 Purge passes: 8730 (10 / sec) Purges: 127501 (146 / sec) Hugeifies: 4358 (5 / sec) Dehugifies: 4 (0 / sec) Pageslabs, active pages, dirty pages and retained pages are rows added by this change.

We are trying to create `ncpus * 2` threads for this test and place them into `VARIABLE_ARRAY`, but `VARIABLE_ARRAY` can not be more than `VARIABLE_ARRAY_SIZE_MAX` bytes. When there are a lot of threads on the box test always fails. ``` $ nproc 176 $ make -j`nproc` tests_unit && ./test/unit/retained <jemalloc>: ../test/unit/retained.c:123: Failed assertion: "sizeof(thd_t) * (nthreads) <= VARIABLE_ARRAY_SIZE_MAX" Aborted (core dumped) ``` There is no need for high concurrency for this test as we are only checking stats there and it's behaviour is quite stable regarding number of allocating threads. Limited number of threads to 16 to save compute resources (on CI for example) and reduce tests running time. Before the change (`nproc` is 80 on this box). ``` $ make -j`nproc` tests_unit && time ./test/unit/retained <...> real 0m0.372s user 0m14.236s sys 0m12.338s ``` After the change (same box). ``` $ make -j`nproc` tests_unit && time ./test/unit/retained <...> real 0m0.018s user 0m0.108s sys 0m0.068s ```

Taken from https://android-review.git.corp.google.com/c/platform/external/jemalloc_new/+/3316478 This might need more cleanups to remove the definition of JEMALLOC_INTERNAL_UNREACHABLE.

The gettid() function is available on Linux in glibc only since version 2.30. There are supported distributions that still use older glibc version. Thus add a configure check if the gettid() function is available and extend the check in src/prof_stack_range.c so it's skipped also when gettid() isn't available. Fixes: #2740

`final[3]` is `uint8_t`. Integer conversion rank of `uint8_t` is lower than integer conversion rank of `int`, so `uint8_t` got promoted to `int`, which is signed integer type. Shift `final[3]` value left on 24, when leftmost bit is set overflows `int` and it is undefined behaviour. Before this change Undefined Behaviour Sanitizer was unhappy about it with the following message. ``` ../test/unit/hash.c:119:25: runtime error: left shift of 176 by 24 places cannot be represented in type 'int' ``` After this commit problem is gone.

@nc7s

Reported by @nc7s.

We tried to load `g` from `bitmap[i]` before checking it is actually a valid load. Tweaked a loop a bit to `break` early, when we are done scanning for bits. Before this commit undefined behaviour sanitizer from GCC 14+ was unhappy at `test/unit/bitmap` test with following error. ``` ../include/jemalloc/internal/bitmap.h:293:5: runtime error: load of address 0x7bb1c2e08008 with insufficient space for an object of type 'const bitmap_t' <...> #0 0x62671a149954 in bitmap_ffu ../include/jemalloc/internal/bitmap.h:293 #1 0x62671a149954 in test_bitmap_xfu_body ../test/unit/bitmap.c:275 #2 0x62671a14b767 in test_bitmap_xfu ../test/unit/bitmap.c:323 #3 0x62671a376ad1 in p_test_impl ../test/src/test.c:149 #4 0x62671a377135 in p_test ../test/src/test.c:200 #5 0x62671a13da06 in main ../test/unit/bitmap.c:336 <...> ```

…ve, `prof_threshold` is intended to be an always-supported allocation callback with much less overhead. The usage of the threshold allows performance critical callers to change program execution based on the callback: e.g. drop caches when memory becomes high or to predict the program is about to OOM ahead of time using peak memory watermarks.

This commit allows to enable sanitizers with autoconf options, instead of modifying `CFLAGS`, `CXXFLAGS` and `LDFLAGS` directly. * `--enable-tsan` option to enable Thread Sanitizer. * `--enable-ubsan` option to enable Undefined Behaviour Sanitizer. End goal is to speedup development by finding problems quickly, early and easier. Eventually, when all current issues will be fixed, we can enable sanitizers in CI. Fortunately, there are not a lot of problems we need to fix. Address Sanitizer is a bit controversial, because it replaces memory allocator, so we decided to left it out for a while. Below are couple of examples of how tests look like under different sanitizers at the moment. ``` $ ../configure --enable-tsan --enable-debug <...> asan : 0 tsan : 1 ubsan : 0 $ make -j`nproc` check <...> Thread T13 (tid=332043, running) created by main thread at: #0 pthread_create <null> (libtsan.so.0+0x61748) #1 thd_create ../test/src/thd.c:25 (bin_batching+0x5631ca) #2 stress_run ../test/unit/bin_batching.c:148 (bin_batching+0x40364c) #3 test_races ../test/unit/bin_batching.c:249 (bin_batching+0x403d79) #4 p_test_impl ../test/src/test.c:149 (bin_batching+0x562811) #5 p_test_no_reentrancy ../test/src/test.c:213 (bin_batching+0x562d35) #6 main ../test/unit/bin_batching.c:268 (bin_batching+0x40417e) SUMMARY: ThreadSanitizer: data race ../include/jemalloc/internal/edata.h:498 in edata_nfree_inc ``` ``` $ ../configure --enable-ubsan --enable-debug <...> asan : 0 tsan : 0 ubsan : 1 $ make -j`nproc` check <...> === test/unit/hash === ../test/unit/hash.c:119:16: runtime error: left shift of 176 by 24 places cannot be represented in type 'int' <...> ```

This adds a new autoconf flag, --disable-user-config, which disables reading the configuration from /etc/malloc.conf or the MALLOC_CONF environment variable. This can be useful when integrating jemalloc in a binary that internally handles all aspects of the configuration and shouldn't be impacted by ambient change in the environment.

Before this commit we had two age counters: one global in HPA central and one local in each HPA shard. We used HPA shard counter, when we are reused empty pageslab and HPA central counter anywhere else. They suppose to be comparable, because we use them for allocation placement decisions, but in reality they are not, there is no ordering guarantees between them. At the moment, there is no way for pageslab to migrate between HPA shards, so we don't actually need HPA central age counter.

Arena 0 have a dedicated initialization path, which differs from initialization path of other arenas. The main difference for the purpose of this change is that we initialize arena 0 before we initialize background threads. HPA shard options have `deferral_allowed` flag which should be equal to `background_thread_enabled()` return value, but it wasn't the case before this change, because for arena 0 `background_thread_enabled()` was initialized correctly after arena 0 initialization phase already ended. Below is initialization sequence for arena 0 after this commit to illustrate everything still should be initialized correctly. * `hpa_central_init` initializes HPA Central, before we initialize every HPA shard (including arena's 0). * `background_thread_boot1` initializes `background_thread_enabled()` return value. * `pa_shard_enable_hpa` initializes arena 0 HPA shard. ``` malloc_init_hard ------------- / / \ / / \ / / \ malloc_init_hard_a0_locked background_thread_boot1 pa_shard_enable_hpa / / \ / / \ / / \ arena_boot background_thread_enabled_seta hpa_shard_init | | pa_central_init | | hpa_central_init ```

Svetlitski added 3 commits June 23, 2023 11:50

pull bot added the ⤵️ pull label Jun 23, 2023

Svetlitski and others added 26 commits June 23, 2023 14:30

Fix downloading LLVM in GitHub Action

46e464a

It turns out LLVM does not include a build for every platform in the assets for every release, just some of them. As such, I've pinned us to the latest release version with a corresponding build.

Address compiler warnings in the unit tests

e133870

Add a test-case for small profiled allocations

ebd7e99

Validate that small allocations (i.e. those with `size <= SC_SMALL_MAXCLASS`) which are sampled for profiling maintain the expected invariants even though they now take up less space.

Enabled -Wstrict-prototypes and fixed warnings.

602edd7

Remove unreachable code.

e249d1a

Print test error messages in color when stderr is a terminal

65d3b59

When stderr is a terminal and supports color, print error messages from tests in red to make them stand out from the surrounding output.

Print the failed assertion before aborting in test cases

314c073

This makes it faster and easier to debug, so that you don't need to fire up a debugger just to see which assertion triggered in a failing test.

Suppress verbose frame address warnings

c49c17f

These warnings are not useful, and make the output of some CI jobs enormous and difficult to read, so let's suppress them.

Define PROF_TCTX_SENTINEL instead of using magic numbers

7e54dd1

This makes the code more readable on its own, and also sets the stage for more cleanly handling the pointer provenance lints in a following commit.

Define SBRK_INVALID instead of using a magic number

1431153

Remove vestigial TCACHE_STATE_* macros

4827bb1

Remove errant #includes in public jemalloc.h header

8ff7e7d

In an attempt to make all headers self-contained, I inadvertently added `#include`s which refer to intermediate, generated headers that aren't included in the final install. Closes #2489.

Add an override for the compile-time malloc_conf to `jemalloc_interna…

b01d496

…l_overrides.h`

Ensured sampled allocations are properly deallocated during `arena_re…

62648c8

…set` Sampled allocations were not being demoted before being deallocated during an `arena_reset` operation.

Include the unrecognized malloc conf option in the error message.

6816b23

Previously the option causing trouble will not be printed, unless the option key:value pair format is found.

interwq and others added 30 commits October 10, 2024 16:41

Update the configure cache file example in INSTALL.md

02251c0

Updated jeprof with more symbols to filter.

397827a

Add support for clock_gettime_nsec_np()

6d625d5

Prefer clock_gettime_nsec_np(CLOCK_UPTIME_RAW) to mach_absolute_time().

Fix the sized-dealloc safety check abort msg.

2a693b8

Split stats_arena_hpa_shard_print function

b82333f

Make multiple functions from `stats_arena_hpa_shard_print` for readability and ease of change in the future.

Move je_cv_thp logic closer to definition

a361e88

Fix ehooks assertion for arena creation

6786934

Enable large hugepage tests for arm64 on Travis

a17385a

Disable psset test when hugepage size is too large.

587676f

Remove unreachable() macro as c23 already defines it.

d8486b2

Taken from https://android-review.git.corp.google.com/c/platform/external/jemalloc_new/+/3316478 This might need more cleanups to remove the definition of JEMALLOC_INTERNAL_UNREACHABLE.

Conditionally remove unreachable for C23+

4b88bdd

Fix the gettid() detection caught by @mrluanma .

20cc983

Check for 0 input when setting max_background_thread through mallctl.

607b866

Reported by @nc7s.

Fix test/unit/prof_threshold when !config_stats

1abeae9

Avoid implicit conversion in test/unit/prof_threshold

3bc89cf

Remove unsupported Cirrus CI config

f55e0c3

Fix profiling sample metadata lookup during xallocx

ac279d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] dev from jemalloc:dev #103

[pull] dev from jemalloc:dev #103

pull bot commented Jun 23, 2023 •

edited

Loading

[pull] dev from jemalloc:dev #103

Are you sure you want to change the base?

[pull] dev from jemalloc:dev #103

Conversation

pull bot commented Jun 23, 2023 • edited Loading

pull bot commented Jun 23, 2023 •

edited

Loading