diff --git a/.bazelrc b/.bazelrc
new file mode 100644
index 000000000..2b72b3bd9
--- /dev/null
+++ b/.bazelrc
@@ -0,0 +1,15 @@
+# Copyright 2019 The TCMalloc Authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+build --cxxopt='-std=c++17'
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 000000000..d10cc0d08
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,74 @@
+# How to Contribute to TCMalloc
+
+We'd love to accept your patches and contributions to this project. There are
+just a few small guidelines you need to follow.
+
+NOTE: If you are new to GitHub, please start by reading [Pull Request
+howto](https://help.github.com/articles/about-pull-requests/)
+
+## Contributor License Agreement
+
+Contributions to this project must be accompanied by a Contributor License
+Agreement. You (or your employer) retain the copyright to your contribution;
+this simply gives us permission to use and redistribute your contributions as
+part of the project. Head over to to see
+your current agreements on file or to sign a new one.
+
+You generally only need to submit a CLA once, so if you've already submitted one
+(even if it was for a different project), you probably don't need to do it
+again.
+
+## Guidelines for Pull Requests
+
+* All submissions, including submissions by project members, require review.
+ We use GitHub pull requests for this purpose. Consult
+ [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
+ information on using pull requests.
+
+* If you are a Googler, it is preferable to first create an internal CL and
+ have it reviewed and submitted. The code propagation process will deliver
+ the change to GitHub.
+
+* Create **small PRs** that are narrowly focused on **addressing a single concern**.
+ When PRs try to fix several things at a time, if only one fix is considered
+ acceptable, nothing gets merged and both author's & review's time is wasted.
+ Create more PRs to address different concerns and everyone will be happy.
+
+* Provide a good **PR description** as a record of **what** change is being
+ made and **why** it was made. Link to a GitHub issue if it exists.
+
+* Don't fix code style and formatting unless you are already changing that line
+ to address an issue. Formatting of modified lines may be done using
+ `git clang-format`. PRs with irrelevant changes won't be merged. If you do
+ want to fix formatting or style, do that in a separate PR.
+
+* Unless your PR is trivial, you should expect there will be reviewer comments
+ that you'll need to address before merging. We expect you to be reasonably
+ responsive to those comments, otherwise the PR will be closed after 2-3 weeks
+ of inactivity.
+
+* Maintain **clean commit history** and use **meaningful commit messages**.
+ PRs with messy commit history are difficult to review and won't be merged.
+ Use `rebase -i upstream/master` to curate your commit history and/or to
+ bring in latest changes from master (but avoid rebasing in the middle of a
+ code review).
+
+* Keep your PR up to date with upstream/master (if there are merge conflicts,
+ we can't really merge your change).
+
+* **All tests need to be passing** before your change can be merged. We
+ recommend you **run tests locally** (see below)
+
+* Exceptions to the rules can be made if there's a compelling reason for doing
+ so. That is - the rules are here to serve us, not the other way around, and
+ the rules need to be serving their intended purpose to be valuable.
+
+## TCMalloc Committers
+
+The current members of the TCMalloc engineering team are the only committers at
+present.
+
+## Community Guidelines
+
+This project follows
+[Google's Open Source Community Guidelines](https://opensource.google.com/conduct/).
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 000000000..62589edd1
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,202 @@
+
+ Apache License
+ Version 2.0, January 2004
+ https://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "[]"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright [yyyy] [name of copyright owner]
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ https://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
diff --git a/README.md b/README.md
new file mode 100644
index 000000000..a8b0467f2
--- /dev/null
+++ b/README.md
@@ -0,0 +1,38 @@
+# TCMalloc
+
+This repository contains the TCMalloc C++ code.
+
+TCMalloc is Google's customized implementation of C's `malloc()` and C++'s
+`operator new` used for memory allocation within our C and C++ code. TCMalloc is
+a fast, multi-threaded malloc implementation.
+
+## Building TCMalloc
+
+[Bazel](https://bazel.build) is the official build system for TCMalloc.
+
+The [TCMalloc Platforms Guide](platforms) contains information on platform
+support for TCMalloc.
+
+## Documentation
+
+All users of TCMalloc should consult the following documentation resources:
+
+* The [TCMalloc Overview](docs/overview) covers the basic architecture of
+ TCMalloc, and how that may affect configuration choices.
+* The [TCMalloc Reference](docs/reference) covers the C and C++ TCMalloc API
+ endpoints.
+
+More advanced usages of TCMalloc may find the following documentation useful:
+
+* The [TCMalloc Tuning Guide](docs/tuning) covers the configuration choices in
+ more depth, and also illustrates other ways to customize TCMalloc.
+* The [TCMalloc Design Doc](docs/design) covers how TCMalloc works underneath
+ the hood, and why certain design choices were made. Most developers will not
+ need this level of implementation detail.
+
+## License
+
+The TCMalloc library is licensed under the terms of the Apache
+license. See LICENSE for more information.
+
+Disclaimer: This is not an officially supported Google product.
diff --git a/WORKSPACE b/WORKSPACE
new file mode 100644
index 000000000..3044365d4
--- /dev/null
+++ b/WORKSPACE
@@ -0,0 +1,51 @@
+# Copyright 2019 The TCMalloc Authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+workspace(name = "com_google_tcmalloc")
+load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
+
+# Abseil
+http_archive(
+ name = "com_google_absl",
+ urls = ["https://github.com/abseil/abseil-cpp/archive/564001ae506a17c51fa1223684a78f05f91d3d91.zip"],
+ strip_prefix = "abseil-cpp-564001ae506a17c51fa1223684a78f05f91d3d91",
+ sha256 = "766ac184540dd24afc1542c30b8739e1490327e80738b5241bffb70b1005405c",
+)
+
+# GoogleTest/GoogleMock framework. Used by most unit-tests.
+http_archive(
+ name = "com_google_googletest",
+ urls = ["https://github.com/google/googletest/archive/d854bd6acc47f7f6e168007d58b5f509e4981b36.zip"],
+ strip_prefix = "googletest-d854bd6acc47f7f6e168007d58b5f509e4981b36",
+ sha256 = "5a3de3cb2141335255a850cc82be488aabefebca7d16abe15381bd93b6c48f9b",
+)
+
+# Google benchmark.
+http_archive(
+ name = "com_github_google_benchmark",
+ urls = ["https://github.com/google/benchmark/archive/16703ff83c1ae6d53e5155df3bb3ab0bc96083be.zip"],
+ strip_prefix = "benchmark-16703ff83c1ae6d53e5155df3bb3ab0bc96083be",
+ sha256 = "59f918c8ccd4d74b6ac43484467b500f1d64b40cc1010daa055375b322a43ba3",
+)
+
+# C++ rules for Bazel.
+http_archive(
+ name = "rules_cc",
+ urls = [
+ "https://mirror.bazel.build/github.com/bazelbuild/rules_cc/archive/7e650b11fe6d49f70f2ca7a1c4cb8bcc4a1fe239.zip",
+ "https://github.com/bazelbuild/rules_cc/archive/7e650b11fe6d49f70f2ca7a1c4cb8bcc4a1fe239.zip",
+ ],
+ strip_prefix = "rules_cc-7e650b11fe6d49f70f2ca7a1c4cb8bcc4a1fe239",
+ sha256 = "682a0ce1ccdac678d07df56a5f8cf0880fd7d9e08302b8f677b92db22e72052e",
+)
diff --git a/ci/linux_clang-latest_libcxx_bazel.sh b/ci/linux_clang-latest_libcxx_bazel.sh
new file mode 100755
index 000000000..e6ae1dfa5
--- /dev/null
+++ b/ci/linux_clang-latest_libcxx_bazel.sh
@@ -0,0 +1,70 @@
+#!/bin/bash
+#
+# Copyright 2019 The TCMalloc Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This script that can be invoked to test tcmalloc in a hermetic environment
+# using a Docker image on Linux. You must have Docker installed to use this
+# script.
+
+set -euox pipefail
+
+if [ -z ${TCMALLOC_ROOT:-} ]; then
+ TCMALLOC_ROOT="$(realpath $(dirname ${0})/..)"
+fi
+
+if [ -z ${STD:-} ]; then
+ STD="c++17"
+fi
+
+if [ -z ${COMPILATION_MODE:-} ]; then
+ COMPILATION_MODE="fastbuild opt"
+fi
+
+if [ -z ${EXCEPTIONS_MODE:-} ]; then
+ EXCEPTIONS_MODE="-fno-exceptions -fexceptions"
+fi
+
+readonly DOCKER_CONTAINER="gcr.io/google.com/absl-177019/linux_clang-latest:20191018"
+
+for std in ${STD}; do
+ for compilation_mode in ${COMPILATION_MODE}; do
+ for exceptions_mode in ${EXCEPTIONS_MODE}; do
+ echo "--------------------------------------------------------------------"
+ time docker run \
+ --volume="${TCMALLOC_ROOT}:/tcmalloc:ro" \
+ --workdir=/tcmalloc \
+ --cap-add=SYS_PTRACE \
+ --rm \
+ -e CC="/opt/llvm/clang/bin/clang" \
+ -e BAZEL_COMPILER="llvm" \
+ -e BAZEL_CXXOPTS="-std=${std}:-nostdinc++" \
+ -e BAZEL_LINKOPTS="-L/opt/llvm/libcxx/lib:-lc++:-lc++abi:-lm:-Wl,-rpath=/opt/llvm/libcxx/lib" \
+ -e CPLUS_INCLUDE_PATH="/opt/llvm/libcxx/include/c++/v1" \
+ ${DOCKER_EXTRA_ARGS:-} \
+ ${DOCKER_CONTAINER} \
+ /usr/local/bin/bazel test ... \
+ --compilation_mode="${compilation_mode}" \
+ --copt="${exceptions_mode}" \
+ --copt=-Werror \
+ --define="absl=1" \
+ --keep_going \
+ --show_timestamps \
+ --test_env="GTEST_INSTALL_FAILURE_SIGNAL_HANDLER=1" \
+ --test_output=errors \
+ --test_tag_filters=-benchmark \
+ ${BAZEL_EXTRA_ARGS:-}
+ done
+ done
+done
diff --git a/ci/linux_clang-latest_libstdcxx_bazel.sh b/ci/linux_clang-latest_libstdcxx_bazel.sh
new file mode 100755
index 000000000..b89df9409
--- /dev/null
+++ b/ci/linux_clang-latest_libstdcxx_bazel.sh
@@ -0,0 +1,82 @@
+#!/bin/bash
+#
+# Copyright 2019 The Abseil Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This script that can be invoked to test abseil-cpp in a hermetic environment
+# using a Docker image on Linux. You must have Docker installed to use this
+# script.
+
+set -euox pipefail
+
+if [ -z ${ABSEIL_ROOT:-} ]; then
+ ABSEIL_ROOT="$(realpath $(dirname ${0})/..)"
+fi
+
+if [ -z ${STD:-} ]; then
+ STD="c++17"
+fi
+
+if [ -z ${COMPILATION_MODE:-} ]; then
+ COMPILATION_MODE="fastbuild opt"
+fi
+
+if [ -z ${EXCEPTIONS_MODE:-} ]; then
+ EXCEPTIONS_MODE="-fno-exceptions -fexceptions"
+fi
+
+readonly DOCKER_CONTAINER="gcr.io/google.com/absl-177019/linux_clang-latest:20191018"
+
+# USE_BAZEL_CACHE=1 only works on Kokoro.
+# Without access to the credentials this won't work.
+if [ ${USE_BAZEL_CACHE:-0} -ne 0 ]; then
+ DOCKER_EXTRA_ARGS="--volume=${KOKORO_KEYSTORE_DIR}:/keystore:ro ${DOCKER_EXTRA_ARGS:-}"
+ # Bazel doesn't track changes to tools outside of the workspace
+ # (e.g. /usr/bin/gcc), so by appending the docker container to the
+ # remote_http_cache url, we make changes to the container part of
+ # the cache key. Hashing the key is to make it shorter and url-safe.
+ container_key=$(echo ${DOCKER_CONTAINER} | sha256sum | head -c 16)
+ BAZEL_EXTRA_ARGS="--remote_http_cache=https://storage.googleapis.com/absl-bazel-remote-cache/${container_key} --google_credentials=/keystore/73103_absl-bazel-remote-cache ${BAZEL_EXTRA_ARGS:-}"
+fi
+
+for std in ${STD}; do
+ for compilation_mode in ${COMPILATION_MODE}; do
+ for exceptions_mode in ${EXCEPTIONS_MODE}; do
+ echo "--------------------------------------------------------------------"
+ time docker run \
+ --volume="${ABSEIL_ROOT}:/abseil-cpp:ro" \
+ --workdir=/abseil-cpp \
+ --cap-add=SYS_PTRACE \
+ --rm \
+ -e CC="/opt/llvm/clang/bin/clang" \
+ -e BAZEL_COMPILER="llvm" \
+ -e BAZEL_CXXOPTS="-std=${std}" \
+ -e CPLUS_INCLUDE_PATH="/usr/include/c++/6" \
+ ${DOCKER_EXTRA_ARGS:-} \
+ ${DOCKER_CONTAINER} \
+ /usr/local/bin/bazel test ... \
+ --compilation_mode="${compilation_mode}" \
+ --copt="${exceptions_mode}" \
+ --copt=-Werror \
+ --define="absl=1" \
+ --keep_going \
+ --show_timestamps \
+ --test_env="GTEST_INSTALL_FAILURE_SIGNAL_HANDLER=1" \
+ --test_env="TZDIR=/abseil-cpp/absl/time/internal/cctz/testdata/zoneinfo" \
+ --test_output=errors \
+ --test_tag_filters=-benchmark \
+ ${BAZEL_EXTRA_ARGS:-}
+ done
+ done
+done
diff --git a/ci/linux_gcc-latest_libstdcxx_bazel.sh b/ci/linux_gcc-latest_libstdcxx_bazel.sh
new file mode 100755
index 000000000..27b709716
--- /dev/null
+++ b/ci/linux_gcc-latest_libstdcxx_bazel.sh
@@ -0,0 +1,67 @@
+#!/bin/bash
+#
+# Copyright 2019 The TCMalloc Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This script that can be invoked to test tcmalloc in a hermetic environment
+# using a Docker image on Linux. You must have Docker installed to use this
+# script.
+
+set -euox pipefail
+
+if [ -z ${TCMALLOC_ROOT:-} ]; then
+ TCMALLOC_ROOT="$(realpath $(dirname ${0})/..)"
+fi
+
+if [ -z ${STD:-} ]; then
+ STD="c++17"
+fi
+
+if [ -z ${COMPILATION_MODE:-} ]; then
+ COMPILATION_MODE="fastbuild opt"
+fi
+
+if [ -z ${EXCEPTIONS_MODE:-} ]; then
+ EXCEPTIONS_MODE="-fno-exceptions -fexceptions"
+fi
+
+readonly DOCKER_CONTAINER="gcr.io/google.com/absl-177019/linux_gcc-latest:20200106"
+
+for std in ${STD}; do
+ for compilation_mode in ${COMPILATION_MODE}; do
+ for exceptions_mode in ${EXCEPTIONS_MODE}; do
+ echo "--------------------------------------------------------------------"
+ time docker run \
+ --volume="${TCMALLOC_ROOT}:/tcmalloc:ro" \
+ --workdir=/tcmalloc \
+ --cap-add=SYS_PTRACE \
+ --rm \
+ -e CC="/usr/local/bin/gcc" \
+ -e BAZEL_CXXOPTS="-std=${std}" \
+ ${DOCKER_EXTRA_ARGS:-} \
+ ${DOCKER_CONTAINER} \
+ /usr/local/bin/bazel test ... \
+ --compilation_mode="${compilation_mode}" \
+ --copt="${exceptions_mode}" \
+ --copt=-Werror \
+ --define="absl=1" \
+ --keep_going \
+ --show_timestamps \
+ --test_env="GTEST_INSTALL_FAILURE_SIGNAL_HANDLER=1" \
+ --test_output=errors \
+ --test_tag_filters=-benchmark \
+ ${BAZEL_EXTRA_ARGS:-}
+ done
+ done
+done
diff --git a/docs/design.md b/docs/design.md
new file mode 100644
index 000000000..c49e6ed7c
--- /dev/null
+++ b/docs/design.md
@@ -0,0 +1,472 @@
+# TCMalloc : Thread-Caching Malloc
+
+
+
+## Motivation
+
+TCMalloc is a memory allocator designed as an alternative to the system default
+allocator that has the following characteristics:
+
+* Fast, uncontended allocation and deallocation for most objects. Objects are
+ cached, depending on mode, either per-thread, or per-logical-CPU. Most
+ allocatations do not need to take locks, so there is low contention and good
+ scaling for mult-threaded applications.
+* Flexible use of memory, so freed memory can be reused for different object
+ sizes, or returned to the OS.
+* Low per object memory overhead by allocating "pages" of objects of the same
+ size. Leading to space-efficient representation of small objects.
+* Low overhead sampling, enabling detailed insight into applications memory
+ usage.
+
+## Usage
+
+You use TCMalloc by specifying it as the `malloc` attribute on your binary rules
+in Bazel.
+
+## Overview
+
+The following block diagram shows the rough internal structure of TCMalloc:
+
+{.center}
+
+We can break TCMalloc into three components. The front-end, middle-end, and
+back-end. We will discuss these in more details in the following sections. A
+rough breakdown of responsibilities is:
+
+* The front-end is a cache that provides fast allocation and deallocation of
+ memory to the application.
+* The middle-end is responsible for refilling the front-end cache.
+* The back-end handles fetching memory from the OS.
+
+Note that the front-end can be run in either per-CPU or legacy per-thread mode,
+and the back-end can support either the hugepage aware pageheap or the legacy
+pageheap.
+
+## The TCMalloc Front-end
+
+The front-end handles a request for memory of a particular size. The front-end
+has a cache of memory that it can use for allocation or to hold free memory.
+This cache is only accessible by a single thread at a time, so it does not
+require any locks, hence most allocations and deallocations are fast.
+
+The front-end will satisfy any request if it has cached memory of the
+appropriate size. If the cache for that particular size is empty, the front-end
+will request a batch of memory from the middle-end to refill the cache. The
+middle-end comprises the CentralFreeList and the TransferCache.
+
+If the middle-end is exhausted, or if the requested size is greater than the
+maximum size that the front-end caches, a request will go to the back-end to
+either satisfy the large allocation, or to refill the caches in the middle-end.
+The back-end is also referred to as the PageHeap.
+
+There are two implementations of the TCMalloc front-end:
+
+* Originally it supported per-thread caches of objects (hence the name Thread
+ Caching Malloc). However, this resulted in memory footprints that scaled
+ with the number of threads. Modern applications can have large thread
+ counts, which result in either large amounts of aggregate per-thread memory,
+ or many threads having miniscule per-thread caches.
+* More recently TCMalloc has supported per-CPU mode. In this mode each logical
+ CPU in the system has its own cache from which to allocate memory. Note: On
+ x86 a logical CPU is equivalent to a hyperthread.
+
+The differences between per-thread and per-CPU modes are entirely confined to
+the implementations of malloc/new and free/delete.
+
+## Small and Large Object Allocation
+
+Allocations of "small" objects are mapped onto to one of
+[60-80 allocatable size-classes](https://github.com/google/tcmalloc/blob/master/tcmalloc/size_classes.cc).
+For example, an allocation of 12 bytes will get rounded up to the 16 byte
+size-class. The size-classes are designed to minimize the amount of memory that
+is wasted when rounding to the next largest size class.
+
+When compiled with `__STDCPP_DEFAULT_NEW_ALIGNMENT__ <= 8`, we use a set of
+sizes aligned to 8 bytes for raw storage allocated with `::operator new`. This
+smaller alignment minimizes wasted memory for many common allocation sizes (24,
+40, etc.) which are otherwise rounded up to a multiple of 16 bytes. On many
+compilers, this behavior is controlled by the `-fnew-alignment=...` flag.
+When `__STDCPP_DEFAULT_NEW_ALIGNMENT__` is not
+specified (or is larger than 8 bytes), we use standard 16 byte alignments for
+`::operator new`. However, for allocations under 16 bytes, we may return an
+object with a lower alignment, as no object with a larger alignment requirement
+can be allocated in the space.
+
+When an object of a given size is requested, that request is
+[mapped to a request of a particular class-size](https://github.com/google/tcmalloc/blob/master/tcmalloc/common.h),
+and the returned memory is from that size-class. This means that the returned
+memory is at least as large as the requested size. These class-sized allocations
+are handled by the front-end.
+
+Objects of size greater than the limit defined by
+[`kMaxSize`](https://github.com/google/tcmalloc/blob/master/tcmalloc/common.h)
+are allocated directly from the [backend](#pageheap). As such they are not
+cached in either the front or middle ends. Allocation requests for large object
+sizes are rounded up to the [TCMalloc page size](#pagesizes).
+
+## Deallocation
+
+When an object is deallocated, the compiler will provide the size of the object
+if it is known at compile time. If the size is not known, it will be looked up
+in the [pagemap](#pagemap). If the object is small it will be put back into the
+front-end cache. If the object is larger than kMaxSize it is returned directly
+to the pageheap.
+
+### Per-CPU Mode
+
+In per-CPU mode a single large block of memory is allocated. The following
+diagram shows how this slab of memory is divided between CPUs and how each CPU
+uses a part of the slab to hold metadata as well as pointers to available
+objects.
+
+{.center}
+
+Each logical CPU is assigned a section of this memory to hold metadata and
+pointers to available objects of particular size-classes. The metadata comprises
+one /header/ block per size-class. The header has a pointer to the start of the
+per-size-class array of pointers to objects, as well as a pointer to the
+current, dynamic, maximum capacity and the current position within that array
+segment. The static maximum capacity of each per-size-class array of pointers is
+[determined at start time](https://github.com/google/tcmalloc/blob/master/tcmalloc/percpu_tcmalloc.h)
+by the difference between the start of the array for this size-class and the
+start of the array for the next size-class.
+
+At runtime the maximum number of items of a particular class-size that can be
+stored in the per-cpu block will vary, but it can never exceed the statically
+determined maximum capacity assigned at start up.
+
+When an object of a particular class-size is requested it is removed from this
+array, when the object is freed it is added to the array. If the array is
+[exhausted](https://github.com/google/tcmalloc/blob/master/tcmalloc/cpu_cache.h)
+the array is refilled using a batch of objects from the middle-end. If the array
+would
+[overflow](https://github.com/google/tcmalloc/blob/master/tcmalloc/cpu_cache.h),
+a batch of objects are removed from the array and returned to the middle-end.
+
+The amount of memory that can be cached is limited per-cpu by the parameter
+`MallocExtension::SetMaxPerCpuCacheSize`. This means that the total amount of
+cached memory depends on the number of active per-cpu caches. Consequently
+machines with higher CPU counts can cache more memory.
+
+To avoid holding memory on CPUs where the application no longer runs,
+`MallocExtension::ReleaseCpuMemory` frees objects held in a specified CPU's
+caches.
+
+Within a CPU, the distribution of memory is managed across all the size classes
+so as to keep the maximum amount of cached memory below the limit. Notice that
+it is managing the maximum amount that can be cached, and not the amount that is
+currently cached. On average the amount actually cached should be about half the
+limit.
+
+The maximum capacity is increased when a class-size
+[runs out of objects](https://github.com/google/tcmalloc/blob/master/tcmalloc/cpu_cache.cc),
+as well as fetching more objects it considers
+[increasing the capacity](https://github.com/google/tcmalloc/blob/master/tcmalloc/cpu_cache.cc)
+of the class-size. It can increase the capacity of the size class up until the
+total memory (for all class sizes) that the cache could hold reaches the per-cpu
+limit or until the capacity of that size class reaches the hard-coded size limit
+for that size-class. If the size-class has not reached the hard-coded limit,
+then in order to increase the capacity it can
+[steal](https://github.com/google/tcmalloc/blob/master/tcmalloc/cpu_cache.cc)
+capacity from another size class on the same CPU.
+
+### Restartable Sequences and Per-CPU TCMalloc
+
+To work correctly, per-CPU mode relies on restartable sequences (man rseq(2)). A
+restartable sequence is just a block of (assembly language) instructions,
+largely like a typical function. A restriction of restartable sequences is that
+they cannot write partial state to memory, the final instruction must be a
+single write of the updated state. The idea of restartable sequences is that if
+a thread is removed from a CPU (e.g. context switched) while it is executing a
+restartable sequence, the sequence will be restarted from the top. Hence the
+sequence will either complete without interruption, or be repeatedly restarted
+until it completes without interruption. This is acheived without using any
+locking or atomic instructions, thereby avoiding any contention in the sequence
+itself.
+
+The practical implication of this for TCMalloc is that the code can use a
+restartable sequence like
+[TcmallocSlab_Push](https://github.com/google/tcmalloc/blob/master/tcmalloc/percpu_rseq_x86_64.S)
+to fetch from or return an element to a per-CPU array without needing locking.
+The restartable sequence ensures that either the array is updated without the
+thread being interrupted, or the sequence is restarted if the thread was
+interrupted (for example, by a context switch that enables a different thread to
+run on that CPU).
+
+### Legacy Per-Thread mode
+
+In per-thread mode, TCMalloc assigns each thread a thread-local cache. Small
+allocations are satisfied from this thread-local cache. Objects are moved
+between the middle-end into and out of the thread-local cache as needed.
+
+A thread cache contains one singly linked list of free objects per size-class
+(so if there are N class-sizes, there will be N corresponding linked lists), as
+shown in the following diagram.
+
+{.center}
+
+On allocation an object is removed from the appropriate size-class of the
+per-thread caches. On deallocation, the object is prepended to the appropriate
+size-class. Underflow and overflow are handled by accessing the middle-end to
+either fetch more objects, or to return some objects.
+
+The maximum capacity of the per-thread caches is set by the parameter
+`MallocExtension::SetMaxTotalThreadCacheBytes`.
+However it is possible for the
+total size to exceed that limit as each per-thread cache has a minimum size
+[KMinThreadCacheSize](https://github.com/google/tcmalloc/blob/master/tcmalloc/common.h)
+which is usually 512KiB. In the event that a thread wishes to increase its
+capacity, it needs to
+[scavenge](https://github.com/google/tcmalloc/blob/master/tcmalloc/thread_cache.cc)
+capacity from other threads.
+
+When threads exit their cached memory is
+[returned](https://github.com/google/tcmalloc/blob/master/tcmalloc/thread_cache.cc)
+to the middle-end
+
+### Runtime Sizing of Front-end Caches
+
+It is important for the size of the front-end cache free lists to adjust
+optimally. If the free list is too small, we'll need to go to the central free
+list too often. If the free list is too big, we'll waste memory as objects sit
+idle in there.
+
+Note that the caches are just as important for deallocation as they are for
+allocation. Without a cache, each deallocation would require moving the memory
+to the central free list.
+
+Per-CPU and per-thread modes have different implementations of a dynamic cache
+sizing algorithm.
+
+* In per-thread mode the maximum number of objects that can be stored is
+ [increased](https://github.com/google/tcmalloc/blob/master/tcmalloc/thread_cache.cc)
+ up to a limit whenever more objects need to be fetched from the middle-end.
+ Similarly the capacity is
+ [decreased](https://github.com/google/tcmalloc/blob/master/tcmalloc/thread_cache.cc)
+ when we find that we have cached too many objects. The size of the cache is
+ also
+ [reduced](https://github.com/google/tcmalloc/blob/master/tcmalloc/thread_cache.cc)
+ should the total size of the cached objects exceed the per-thread limit.
+* In per-CPU mode the
+ [capacity](https://github.com/google/tcmalloc/blob/master/tcmalloc/cpu_cache.cc)
+ of the free list is increased on whether we are alternating between
+ underflows and overflows (indicating that a larger cache might stop this
+ alternation). The capacity is
+ [reduced](https://github.com/google/tcmalloc/blob/master/tcmalloc/cpu_cache.cc)
+ when it has not been grown for a time and may therefore be over capacity.
+
+## TCMalloc Middle-end
+
+The middle-end is responsible for providing memory to the front-end and
+returning memory to the back-end. The middle-end comprises the Transfer cache
+and the Central free list. Although these are often referred to as singular,
+there is one transfer cache and one central free list per class-size. These
+caches are each protected by a mutex lock - so there is a serialization cost to
+accessing them.
+
+### Transfer Cache
+
+When the front-end requests memory, or returns memory, it will reach out to the
+transfer cache.
+
+The transfer cache holds an array of pointers to free memory, and it is quick to
+move objects into this array, or fetch objects from this array on behalf of the
+front-end.
+
+The transfer cache gets its name from situations where one thread is allocating
+memory that is deallocated by another thread. The transfer cache allows memory
+to rapidly flow between two different threads.
+
+If the transfer cache is unable to satisfy the memory request, or has
+insufficient space to hold the returned objects, it will access the central free
+list.
+
+### Central Free List
+
+The central free list manages memory in "[spans](#spans)", a span is a
+collection of one or more "[TCMalloc pages](#pagesizes)" of memory. These terms
+will be explained in the next couple of sections.
+
+A request for one or more objects is satisfied by the central free list by
+[extracting](https://github.com/google/tcmalloc/blob/master/tcmalloc/central_freelist.cc)
+objects from spans until the request is satisfied. If there are insufficient
+available objects in the spans, more spans are requested from the back-end.
+
+When objects are
+[returned to the central free list](https://github.com/google/tcmalloc/blob/master/tcmalloc/central_freelist.cc),
+each object is mapped to the span to which it belongs (using the
+[pagemap](#pagemap) and then released into that span. If all the objects that
+reside in a particular span are returned to it, the entire span gets returned to
+the back-end.
+
+### Pagemap and Spans
+
+The heap managed by TCMalloc is divided into [pages](#pagesize) of a
+compile-time determined size. A run of contiguous pages is represented by a
+`Span` object. A span can be used to manage a large object that has been handed
+off to the application, or a run of pages that have been split up into a
+sequence of small objects. If the span manages small objects, the size-class of
+the objects is recorded in the span.
+
+The pagemap is used to look up the span to which an object belongs, or to
+identify the class-size for a given object.
+
+TCMalloc uses a 2-level or 3-level
+[radix tree](https://github.com/google/tcmalloc/blob/master/tcmalloc/pagemap.h)
+in order to map all possible memory locations onto spans.
+
+The following diagram shows how a radix-2 pagemap is used to map the address of
+objects onto the spans that control the pages where the objects reside. In the
+diagram **span A** covers two pages, and **span B** covers 3 pages.
+
+{.center}
+
+Spans are used in the middle-end to determine where to place returned objects,
+and in the back-end to manage the handling of page ranges.
+
+### Storing Small Objects in Spans
+
+A span contains a pointer to the base of the TCMalloc pages that the span
+controls. For small objects those pages are divided into at most 216
+objects. This value is selected so that within the span we can refer to objects
+by a two-byte index.
+
+This means that we can use an
+[unrolled linked list](https://en.wikipedia.org/wiki/Unrolled_linked_list) to
+holded the objects. For example, if we have eight byte objects we can store the
+indexes of three ready-to-use objects, and use the forth slot to store the index
+of the next object in the chain. This datastructure reduces cache misses over a
+fully linked list.
+
+The other advantage of using two byte indexes is that we're able to use spare
+capacity in the span itself to
+[cache four objects](https://github.com/google/tcmalloc/blob/master/tcmalloc/span.h).
+
+When we have
+[no available objects](https://github.com/google/tcmalloc/blob/master/tcmalloc/central_freelist.cc)
+for a class-size we need to fetch a new span from the pageheap and
+[populate](https://github.com/google/tcmalloc/blob/master/tcmalloc/central_freelist.cc)
+it.
+
+## TCMalloc Page Sizes {#pagesizes}
+
+TCMalloc can be built with various
+["page sizes"](https://github.com/google/tcmalloc/blob/master/tcmalloc/common.h)
+. Note that these do not correspond to the page size used in the TLB of the
+underlying hardware. These TCMalloc page sizes are currently 4KiB, 8KiB, 32KiB,
+and 256KiB.
+
+A TCMalloc page either holds multiple objects of a particular size, or is used
+as part of a group to hold an object of size greater than a single page. If an
+entire page becomes free it will be returned to the back-end (the pageheap) and
+can later be repurposed to hold objects of a different size (or returned to the
+OS).
+
+Small pages are better able to handle the memory requirements of the application
+with less overhead. For example, a half-used 4KiB page will have 2KiB left over
+versus a 32KiB page which would have 16KiB. Small pages are also more likely to
+become free. For example, a 4KiB page can hold eight 512-byte objects versus 64
+objects on a 32KiB page; and there is much less chance of 32 objects being free
+at the same time than there is of eight becoming free.
+
+Large pages result in less need to fetch and return memory from the back-end. A
+single 32KiB page can hold eight times the objects of a 4KiB page, and this can
+result in the costs of managing the larger pages being smaller. It also takes
+fewer large pages to map the entire virtual address space. TCMalloc has a
+[pagemap](https://github.com/google/tcmalloc/blob/master/tcmalloc/pagemap.h)
+which maps a virtual address onto the structures that manage the objects in that
+address range. Larger pages mean that the pagemap needs fewer entries and is
+therefore smaller.
+
+Consequently, it makes sense for applications with small memory footprints, or
+that are sensitive to memory footprint size to use smaller TCMalloc page sizes.
+Applications with large memory footprints are likely to benefit from larger
+TCMalloc page sizes.
+
+## TCMalloc Back-end {#pageheap}
+
+The back-end of TCMalloc has three jobs:
+
+* It manages large chunks of unused memory.
+* It is responsible for fetching memory from the OS when there is no suitably
+ sized memory available to fulfill an allocation requestion.
+* It is responsible for returning unneeded memory back to the OS.
+
+There are two backends for TCMalloc:
+
+* The Legacy pageheap which manages memory in TCMalloc page sized chunks.
+* The hugepage aware pageheap which manages memory in chunks of hugepage
+ sizes. Managing memory in hugepage chunks enables the allocator to improve
+ application performance by reducing TLB misses.
+
+### Legacy Pageheap
+
+The legacy pageheap is an array of free lists for particular lengths of
+contiguous pages of available memory. For `k < 256`, the `k`th entry is a free
+list of runs that consist of `k` TCMalloc pages. The `256`th entry is a free
+list of runs that have length `>= 256` pages:
+
+{.center}
+
+An allocation for `k` pages is satisfied by looking in the `k`th free list. If
+that free list is empty, we look in the next free list, and so forth.
+Eventually, we look in the last free list if necessary. If that fails, we fetch
+memory from the system `mmap`.
+
+If an allocation for `k` pages is satisfied by a run of pages of length `> k` ,
+the remainder of the run is re-inserted back into the appropriate free list in
+the pageheap.
+
+When a range of pages are returned to the pageheap, the adjacent pages are
+checked to determine if they now form a contiguous region, if that is the case
+then the pages are concatenated and placed into the appropriate free list.
+
+### Hugepage Aware Pageheap
+
+The objective of the hugepage aware allocator is to hold memory in hugepage size
+chunks. On x86 a hugepage is 2MiB in size. To do this the back-end has three
+different caches:
+
+* The filler cache holds hugepages which have had some memory allocated from
+ them. This can be considered to be similar to the legacy pageheap in that it
+ holds linked lists of memory of a particular number of TCMalloc pages.
+ Allocation requests for sizes of less than a hugepage in size are
+ (typically) returned from the filler cache. If the filler cache does not
+ have sufficient available memory it will request additional hugepages from
+ which to allocate.
+* The region cache which handles allocations of greater than a hugepage. This
+ cache allows allocations to straddle multiple hugepages, and packs multiple
+ such allocations into a contiguous region. This is particularly useful for
+ allocations that slightly exceed the size of a hugepage (for example, 2.1
+ MiB).
+* The hugepage cache handles large allocations of at least a hugepage. There
+ is overlap in usage with the region cache, but the region cache is only
+ enabled when it is determined (at runtime) that the allocation pattern would
+ benefit from it.
+
+## Caveats {#caveats}
+
+TCMalloc will reserve some memory for metadata at start up. The amount of
+metadata will grow as the heap grows. In particular the pagemap will grow with
+the virtual address range that TCMalloc uses, and the spans will grow as the
+number of active pages of memory grows. In per-CPU mode, TCMalloc will reserve a
+slab of memory per-CPU (typically 256 KiB), which, on systems with large numbers
+of logical CPUs, can lead to a multi-megabyte footprint.
+
+It is worth noting that TCMalloc requests memory from the OS in large chunks
+(typically 1 GiB regions). The address space is reserved, but not backed by
+physical memory until it is used. Because of this approach the VSS of the
+application can be substantially larger than the RSS. A side effect of this is
+that trying to limit an application's memory use by restricting VSS will fail
+long before the application has used that much physical memory.
+
+Don't try to load TCMalloc into a running binary (e.g., using JNI in Java
+programs). The binary will have allocated some objects using the system malloc,
+and may try to pass them to TCMalloc for deallocation. TCMalloc will not be able
+to handle such objects.
diff --git a/docs/images/legacy_pageheap.png b/docs/images/legacy_pageheap.png
new file mode 100644
index 000000000..f93c4dc3e
Binary files /dev/null and b/docs/images/legacy_pageheap.png differ
diff --git a/docs/images/pagemap.png b/docs/images/pagemap.png
new file mode 100644
index 000000000..4a712c15b
Binary files /dev/null and b/docs/images/pagemap.png differ
diff --git a/docs/images/per-cpu-cache-internals.png b/docs/images/per-cpu-cache-internals.png
new file mode 100644
index 000000000..7e10a1aef
Binary files /dev/null and b/docs/images/per-cpu-cache-internals.png differ
diff --git a/docs/images/per-thread-structure.png b/docs/images/per-thread-structure.png
new file mode 100644
index 000000000..596289d25
Binary files /dev/null and b/docs/images/per-thread-structure.png differ
diff --git a/docs/images/spanmap.gif b/docs/images/spanmap.gif
new file mode 100644
index 000000000..a0627f6a7
Binary files /dev/null and b/docs/images/spanmap.gif differ
diff --git a/docs/images/tcmalloc_internals.png b/docs/images/tcmalloc_internals.png
new file mode 100644
index 000000000..5eb0e59f2
Binary files /dev/null and b/docs/images/tcmalloc_internals.png differ
diff --git a/docs/overview.md b/docs/overview.md
new file mode 100644
index 000000000..ea2dd874a
--- /dev/null
+++ b/docs/overview.md
@@ -0,0 +1,98 @@
+# TCMalloc Overview
+
+TCMalloc is Google's customized implementation of C's `malloc()` and C++'s
+`operator new` used for memory allocation within our C and C++ code. This custom
+memory allocation framework is an alternative to the one provided by the C
+standard library (on Linux usually through `glibc`) and C++ standard library.
+TCMalloc is designed to be more efficient at scale than other implementations.
+
+Specifically, TCMalloc provides the following benefits:
+
+* Performance scales with highly parallel applications.
+* Optimizations brought about with recent C++14 and C++17 standard enhancements,
+ and by diverging slightly from the standard where performance benefits
+ warrant. (These are noted within the [TCMalloc Reference](reference).)
+* Extensions to allow performance improvements under certain architectures, and
+ additional behavior such as metric gathering.
+
+## TCMalloc Cache Operation Mode
+
+TCMalloc may operate in one of two fashions:
+
+* (default) per-CPU caching, where TCMalloc maintains memory caches local to
+ individual logical cores. Per-CPU caching is enabled when running TCMalloc on
+ any Linux kernel that utilizes restartable sequences (RSEQ). Support for RSEQ
+ was merged in Linux 4.18.
+* per-thread caching, where TCMalloc maintains memory caches local to
+ each application thread. If RSEQ is unavailable, TCMalloc reverts to using
+ this legacy behavior.
+
+NOTE: the "TC" in TCMalloc refers to Thread Caching, which was originally a
+distinguishing feature of TCMalloc; the name remains as a legacy.
+
+In both cases, these cache implementations allows TCMalloc to avoid requiring
+locks for most memory allocations and deallocations.
+
+## TCMalloc Features
+
+TCMalloc provides APIs for dynamic memory allocation: `malloc()` using the C
+API, and `::operator new` using the C++ API. TCMalloc, like most allocation
+frameworks, manages this memory better than raw memory requests (such as through
+`mmap()`) by providing several optimizations:
+
+* Performs allocations from the operating system by managing
+ specifically-sized chunks of memory (called "pages"). Having all of these
+ chunks of memory the same size allows TCMalloc to simplify bookkeeping.
+* Devoting separate pages (or runs of pages called "Spans" in TCMalloc) to
+ specific object sizes. For example, all 16-byte objects are placed within
+ a "Span" specifically allocated for objects of that size. Operations to get or
+ release memory in such cases are much simpler.
+* Holding memory in *caches* to speed up access of commonly-used objects.
+ Holding such caches even after deallocation also helps avoid costly system
+ calls if such memory is later re-allocated.
+
+The cache size can also affect performance. The larger the cache, the less any
+given cache will overflow or get exhausted, and therefore require a lock to get
+more memory. TCMalloc extensions allow you to modify this cache size, though the
+default behavior should be preferred in most cases. For more information,
+consult the [TCMalloc Tuning Guide](tuning).
+
+Additionally, TCMalloc exposes telemetry about the state of the application's
+heap via `MallocExtension`. This can be used for gathering profiles of the live
+heap, as well as a snapshot taken near the heap's highwater mark size (a peak
+heap profile).
+
+## The TCMalloc API
+
+TCMalloc implements the C and C++ dynamic memory API endpoints from the C11,
+C++11, C++14, and C++17 standards.
+
+From C++, this includes
+
+* The basic `::operator new`, `::operator delete`, and array variant
+ functions.
+* C++14's sized `::operator delete`
+* C++17's overaligned `::operator new` and `::operator delete` functions.
+
+Unlike in the standard implementations, TCMalloc does not throw an exception
+when allocations fail, but instead crashes directly. Such behavior can be used
+as a performance optimization for move constructors not currently marked
+`noexcept`; such move operations can be allowed to fail directly due to
+allocation failures. In [Abseil](https://abseil.io/docs/cpp/guides/base), these
+are enabled with `-DABSL_ALLOCATOR_NOTHROW`.
+
+From C, this includes `malloc`, `calloc`, `realloc`, and `free`.
+
+The TCMalloc API obeys the behavior of C90 DR075 and
+[DR445](http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_445)
+which states:
+
+ The alignment requirement still applies even if the size is too small for
+ any object requiring the given alignment.
+
+In other words, `malloc(1)` returns `alignof(std::max_align_t)`-aligned pointer.
+Based on the progress of
+[N2293](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2293.htm), we may relax
+this alignment in the future.
+
+For more complete information, consult the [TCMalloc Reference](reference).
diff --git a/docs/platforms.md b/docs/platforms.md
new file mode 100644
index 000000000..4457bef9a
--- /dev/null
+++ b/docs/platforms.md
@@ -0,0 +1,52 @@
+# TCMalloc Platforms
+
+The TCMalloc code is supported on the following platforms. By "platforms",
+we mean the union of operating system, architecture (e.g. little-endian vs.
+big-endian), compiler, and standard library.
+
+## Language Requirements
+
+TCMalloc requires a code base that supports C++17 and our code is
+C++17-compliant. C code is required to be compliant to C11.
+
+We guarantee that our code will compile under the following compilation flags:
+
+Linux:
+
+* gcc, clang 5.0+: `-std=c++17`
+
+(TL;DR; All code at this time must be built under C++17. We will update this
+list if circumstances change.)
+
+## Supported Platforms
+
+The document below lists each platform, broken down by Operating System,
+Archiecture, Specific Compiler, and Standard Library implementation.
+
+### Linux
+
+**Supported**
+
+
+
+
+
+
+ Operating System |
+ Endianness/Word Size |
+ Processor Architectures |
+ Compilers* |
+ Standard Libraries |
+
+
+ Linux |
+ little-endian, 64-bit |
+ x86, PPC |
+ gcc 9.2+ clang 5.0+ |
+ libstdc++ libc++ |
+
+
+
+
+\* We test on gcc 9.2, though gcc versions (which support C++17) prior to that
+release should also work.
diff --git a/docs/reference.md b/docs/reference.md
new file mode 100644
index 000000000..c57d9bddb
--- /dev/null
+++ b/docs/reference.md
@@ -0,0 +1,244 @@
+# TCMalloc Basic Reference
+
+TCMalloc provides implementations for C and C++ library memory management
+routines (`malloc()`, etc.) provided within the C and C++ standard libraries.
+
+Currently, TCMalloc requires code that conforms to the C11 C standard library
+and the C++11, C++14, or C++17 C++ standard library.
+
+NOTE: although the C API in this document is specific to the C language, the
+entire TCMalloc API itself is designed to be callable directly within C++ code
+(and we expect most usage to be from C++). The documentation in this section
+assumes C constructs (e.g. `size_t`) though invocations using equivalent C++
+constructs of aliased types (e.g. `std::size_t`) are instrinsically supported.
+
+## C++ API
+
+We implement the variants of `operator new` and `operator delete` from the
+C++11, C++14, C++17 standards exposed within the `` header file. This
+includes:
+
+* The basic `::operator new()`, `::operator delete()`, and array variant
+ functions.
+* C++14's sized `::operator delete()`
+* C++17's overaligned `::operator new()` and `::operator delete()` functions.
+ As required by the C++ standard, memory allocated using an aligned `operator
+ new` function must be deallocated with an aligned `operator delete`.
+
+### `::operator new` / `::operator new[]`
+
+```
+void* operator new(std::size_t count);
+void* operator new(std::size_t count, const std::nothrow_t& tag) noexcept;
+void* operator new(std::size_t count, std::align_val_t al); // C++17
+void* operator new(std::size_t count,
+ std::align_val_t al, const std::nothrow_t&) noexcept; // C++17
+
+void* operator new[](std::size_t count);
+void* operator new[](std::size_t count, const std::nothrow_t& tag) noexcept;
+void* operator new[](std::size_t count, std::align_val_t al); // C++17
+void* operator new[](std::size_t count,
+ std::align_val_t al, const std::nothrow_t&) noexcept; // C++17
+```
+
+`operator new`/`operator new[]` allocates `count` bytes. They may be invoked
+directly but are more commonly invoked as part of a *new*-expression.
+
+When `__STDCPP_DEFAULT_NEW_ALIGNMENT__` is not specified (or is larger than 8
+bytes), we use standard 16 byte alignments for `::operator new` without a
+`std::align_val_t` argument. However, for allocations under 16 bytes, we may
+return an object with a lower alignment, as no object with a larger alignment
+requirement can be allocated in the space. When compiled with
+`__STDCPP_DEFAULT_NEW_ALIGNMENT__ <= 8`, we use a set of sizes aligned to 8
+bytes for raw storage allocated with `::operator new`.
+
+NOTE: On many platforms, the value of `__STDCPP_DEFAULT_NEW_ALIGNMENT__` can be
+configured by the `-fnew-alignment=...` flag.
+
+The `std::align_val_t` variants provide storage suitably aligned to the
+requested alignment.
+
+If the allocation is unsuccessful, a failure terminates the program.
+
+NOTE: unlike in the C++ standard, we do not throw an exception in case of
+allocation failure, or invoke `std::get_new_handler()` repeatedly in an
+attempt to successfully allocate, but instead crash directly. Such behavior can
+be used as a performance optimization for move constructors not currently marked
+`noexcept`; such move operations can be allowed to fail directly due to
+allocation failures. Within Abseil code, these direct allocation failures are
+enabled with the Abseil build-time configuration macro
+[`ABSL_ALLOCATOR_NOTHROW`](https://abseil.io/docs/cpp/guides/base#abseil-exception-policy).
+
+If the `std::no_throw_t` variant is utilized, upon failure, `::operator new`
+will return `nullptr` instead.
+
+### `::operator delete` / `::operator delete[]`
+
+```
+void operator delete(void* ptr) noexcept;
+void operator delete(void* ptr, std::size_t sz) noexcept;
+void operator delete(void* ptr, std::align_val_t al) noexcept;
+void operator delete(void* ptr, std::size_t sz,
+ std::align_val_t all) noexcept;
+
+void operator delete[](void* ptr) noexcept;
+void operator delete[](void* ptr, std::size_t sz) noexcept; // C++14
+void operator delete[](void* ptr, std::align_val_t al) noexcept; // C++17
+void operator delete[](void* ptr, std::size_t sz,
+ std::align_val_t al) noexcept; // C++17
+```
+
+`::operator delete`/`::operator delete[]` deallocate memory previously allocated
+by a corresponding `::operator new`/`::operator new[]` call respectively. It is
+commonly invoked as part of a *delete*-expression.
+
+Sized delete is used as a critical performance optimization, eliminating the
+need to perform a costly pointer-to-size lookup.
+
+### Extensions
+
+We also expose a prototype of
+[P0901](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0901r5.html) in
+https://github.com/google/tcmalloc/blob/master/tcmalloc/malloc_extension.h with
+`tcmalloc_size_returning_operator_new()`. This returns both memory and the size
+of the allocation in bytes. It can be freed with `::operator delete`.
+
+## C API
+
+The C standard library specifies the API for dynamic memory management within
+the `` header file. Implementations require C11 or greater.
+
+TCMalloc provides implementation for the following C API functions:
+
+* `malloc()`
+* `calloc()`
+* `realloc()`
+* `free()`
+* `aligned_alloc()`
+
+For `malloc`, `calloc`, and `realloc`, we obey the behavior of C90 DR075 and
+[DR445](http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_445)
+which states:
+
+ The alignment requirement still applies even if the size is too small for
+ any object requiring the given alignment.
+
+In other words, `malloc(1)` returns `alignof(std::max_align_t)`-aligned pointer.
+Based on the progress of
+[N2293](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2293.htm), we may relax
+this alignment in the future.
+
+Additionally, TCMalloc provides an implementation for the following POSIX
+standard library function, available within glibc:
+
+* `posix_memalign()`
+
+TCMalloc also provides implementations for the following obsolete functions
+typically provided within libc implementations:
+
+* `cfree()`
+* `memalign()`
+* `valloc()`
+* `pvalloc()`
+
+Documentation is not provided for these obsolete functions. The implementations
+are provided only for compatibility purposes.
+
+### `malloc()`
+
+```
+void* malloc(size_t size);
+```
+
+`malloc` allocates `size` bytes of memory and returns a `void *` pointer to the
+start of that memory.
+
+`malloc(0)` returns a non-NULL zero-sized pointer. (Attempting to access memory
+at this location is undefined.) If `malloc()` fails for some reason, it returns
+NULL.
+
+### `calloc()`
+
+```
+void* calloc(size_t num, size_t size);
+```
+
+`calloc()` allocates memory for an array of objects, zero-initializes all bytes
+in allocated storage, and if allocation succeeds, returns a pointer to the first
+byte in the allocated memory block.
+
+`calloc(num, 0)` or `calloc(0, size)` returns a non-NULL zero-sized pointer.
+(Attempting to access memory at this location is undefined.) If `calloc()` fails
+for some reason, it returns NULL.
+
+### `realloc()`
+
+```
+void* realloc(void *ptr, size_t new_size);
+```
+
+`realloc()` re-allocates memory for an existing region of memory by either
+expanding or contracting the memory based on the passed `new_size` in bytes,
+returning a `void*` pointer to the start of that memory (which may not change);
+it does not perform any initialization of new areas of memory.
+
+`realloc(OBJ*, 0)` returns a NULL pointer. If `realloc()` fails for some reason,
+it also returns NULL.
+
+### `aligned_alloc()`
+
+```
+void* aligned_alloc(size_t alignment, size_t size);
+```
+
+`aligned_alloc()` allocates `size` bytes of memory with alignment of size
+`alignment` and returns a `void *` pointer to the start of that memory; it does
+not perform any initialization.
+
+The `size` parameter must be an integral multiple of `alignment` and `alignment`
+must be a power of two. If either of these cases is not satisfied,
+`aligned_alloc()` will fail and return a NULL pointer.
+
+`aligned_alloc` with `size=0` returns a non-NULL zero-sized pointer.
+(Attempting to access memory at this location is undefined.)
+
+### `posix_memalign()`
+
+```
+int posix_memalign(void **memptr, size_t alignment, size_t size);
+```
+
+`posix_memalign()`, like `aligned_alloc()` allocates `size` bytes of memory with
+alignment of size `alignment` to the start of memory pointed to by `**memptr`;
+it does not perform any initialization. This pointer can be cast to the desired
+type of data pointer in order to be dereferenceable. If the alignment allocation
+succeeds, `posix_memalign()` returns `0`; otherwise it returns an error value.
+
+`posix_memalign` is similar to `aligned_alloc()` but `alignment` be a power of
+two multiple of `sizeof(void *)`. If the constraints are not satisfied,
+`posix_memalign()` will fail.
+
+`posix_memalign` with `size=0` returns a non-NULL zero-sized pointer.
+(Attempting to access memory at this location is undefined.)
+
+### `free()`
+
+```
+void free(void* ptr);
+```
+
+`free()` deallocates memory previously allocated by `malloc()`, `calloc()`,
+`aligned_alloc()`, `posix_memalign()`, or `realloc()`. If `free()` is passed a
+null pointer, the function does nothing.
+
+### Extensions
+
+These are contained in
+https://github.com/google/tcmalloc/blob/master/tcmalloc/malloc_extension.h.
+
+* `nallocx(size_t size, int flags)` - Returns the number of bytes that would
+ be allocated by `malloc(size)`, subject to the alignment specified in
+ `flags`.
+* `sdallocx(void* ptr, size_t size, int flags)` - Deallocates memory allocated
+ by `malloc` or `memalign`. It takes a size parameter to pass the original
+ allocation size, improving deallocation performance.
diff --git a/docs/sampling.md b/docs/sampling.md
new file mode 100644
index 000000000..e46414879
--- /dev/null
+++ b/docs/sampling.md
@@ -0,0 +1,57 @@
+# How sampling in TCMalloc works.
+
+## Introduction
+
+TCMalloc uses sampling to get representative data on memory usage and
+allocation. How this works is not well documented. This doc attempts to at least
+partially fix this.
+
+## Sampling
+
+We chose to sample an allocation every N bytes where N is a
+[random value](https://github.com/google/tcmalloc/blob/master/tcmalloc/sampler.cc)
+with a mean set by the
+[profile sample rate](https://github.com/google/tcmalloc/blob/master/tcmalloc/malloc_extension.h).
+By default this is every
+[2MiB](https://github.com/google/tcmalloc/blob/master/tcmalloc/common.h).
+
+## How We Sample Allocations
+
+When we
+[pick an allocation](https://github.com/google/tcmalloc/blob/master/tcmalloc/sampler.cc)
+to sample we do some
+[additional processing around that allocation](https://github.com/google/tcmalloc/blob/master/tcmalloc/tcmalloc.cc) -
+recording stack, alignment, request size, and allocation size. Then we go
+[through all the active samplers](https://github.com/google/tcmalloc/blob/master/tcmalloc/tcmalloc.cc)
+and tell them about the allocation. We also tell the
+[span that we're sampling it](https://github.com/google/tcmalloc/blob/master/tcmalloc/tcmalloc.cc) -
+we can do this because we do sampling at tcmalloc page sizes, so each sample
+corresponds to a particular page in the pagemap.
+
+## How We Free Sampled Objects
+
+Each sampled allocation is tagged. So we can
+quickly[ test whether a particular allocation might be a sample](https://github.com/google/tcmalloc/blob/master/tcmalloc/tcmalloc.cc).
+
+When we are done with the sampled span
+[we release it](https://github.com/google/tcmalloc/blob/master/tcmalloc/span.cc).
+
+## How Do We Handle Heap and Fragmentation Profiling
+
+To handle
+[heap](https://github.com/google/tcmalloc/blob/master/tcmalloc/tcmalloc.cc)
+and
+[fragmentation](https://github.com/google/tcmalloc/blob/master/tcmalloc/tcmalloc.cc)
+profiling we just need to traverse the list of sampled objects and compute
+either their degree of fragmentation, or the amount of heap they consume.
+
+## How Do We Handle Allocation Profiling
+
+Allocation profiling reports a list of sampled allocations during a length of
+time. We start an
+[allocation profile](https://github.com/google/tcmalloc/blob/master/tcmalloc/malloc_extension.h),
+then wait until time has elapsed, then call `Stop` on the token. and report the
+profile.
+
+While the allocation sampler is active it is added to the list of samplers for
+allocations and removed from the list when it is claimed.
diff --git a/docs/stats.md b/docs/stats.md
new file mode 100644
index 000000000..db77dc72b
--- /dev/null
+++ b/docs/stats.md
@@ -0,0 +1,720 @@
+# Understanding Malloc Stats
+
+## Getting Malloc Stats
+
+Human-readable statistics can be obtained by calling
+`tcmalloc::MallocExtension::GetStats()`.
+
+## Understanding Malloc Stats Output
+
+### It's A Lot Of Information
+
+The output contains a lot of information. Much of it can be considered debug
+info that's interesting to folks who are passingly familiar with the internals
+of TCMalloc, but potentially not that useful for most people.
+
+### Summary Section
+
+The most generally useful section is the first few lines:
+
+```
+------------------------------------------------
+MALLOC: 16709337136 (15935.3 MiB) Bytes in use by application
+MALLOC: + 503480320 ( 480.2 MiB) Bytes in page heap freelist
+MALLOC: + 363974808 ( 347.1 MiB) Bytes in central cache freelist
+MALLOC: + 120122560 ( 114.6 MiB) Bytes in per-CPU cache freelist
+MALLOC: + 415232 ( 0.4 MiB) Bytes in transfer cache freelist
+MALLOC: + 76920 ( 0.1 MiB) Bytes in thread cache freelists
+MALLOC: + 52258953 ( 49.8 MiB) Bytes in malloc metadata
+MALLOC: ------------
+MALLOC: = 17749665929 (16927.4 MiB) Actual memory used (physical + swap)
+MALLOC: + 333905920 ( 318.4 MiB) Bytes released to OS (aka unmapped)
+MALLOC: ------------
+MALLOC: = 18083571849 (17245.8 MiB) Virtual address space used
+```
+
+* **Bytes in use by application:** Number of bytes that the application is
+ actively using to hold data. This is computed by the bytes requested from
+ the OS minus any bytes that are held in caches and other internal data
+ structures.
+* **Bytes in page heap freelist:** The pageheap is a structure that holds
+ memory ready for TCMalloc to use it. This memory is not actively being used,
+ and could be returned to the OS. [See TCMalloc tuning](tuning.md)
+* **Bytes in central cache freelist:** This is the amount of memory currently
+ held in the central freelist. This is a structure that holds partially used
+ "[spans](/third_party/tcmalloc/g3doc/stats.md#more-detail-on-metadata)" of
+ memory. The spans are partially used because some memory has been allocated
+ from them, but not entirely used - since they have some free memory on them.
+* **Bytes in per-CPU cache freelist:** In per-cpu mode (which is the default)
+ each CPU holds some memory ready to quickly hand to the application. The
+ maximum size of this per-cpu cache is tunable.
+ [See TCMalloc tuning](tuning.md)
+* **Bytes in transfer cache freelist:** The transfer cache is can be
+ considered another part of the central freelist. It holds memory that is
+ ready to be provided to the application for use.
+* **Bytes in thread cache freelists:** The TC in TCMalloc stands for thread
+ cache. Originally each thread held its own cache of memory to provide to the
+ application. Since the change of default the thread caches are used by very
+ few applications. However, TCMalloc starts in per-thread mode, so there may
+ be some memory left in per-thread caches from before it switches into
+ per-cpu mode.
+* **Bytes in malloc metadata:** the size of the data structures used for
+ tracking memory allocation. This will grow as the amount of memory used
+ grows.
+
+There's a couple of summary lines:
+
+* **Actual memory used:** This is the total amount of memory that TCMalloc
+ thinks it is using in the various categories. This is computed from the size
+ of the various areas, the actual contribution to RSS may be larger or
+ smaller than this value. The true RSS may be less if memory is not mapped
+ in. In some cases RSS can be larger if small regions end up being mapped
+ with huge pages. This does not count memory that TCMalloc is not aware of
+ (eg memory mapped files, text segments etc.)
+* **Bytes released to OS:** TCMalloc can release memory back to the OS (see
+ [tcmalloc tuning](tuning.md)), and this is the upper bound on the amount of
+ released memory. However, it is up to the OS as to whether the act of
+ releasing the memory actually reduces the RSS of the application. The code
+ uses MADV_DONTNEED which tells the OS that the memory is no longer needed,
+ but does not actually cause it to be physically removed.
+* **Virtual address space used:** This is the amount of virtual address space
+ that TCMalloc believes it is using. This should match the later section on
+ requested memory. There are other ways that an application can increase its
+ virtual address space, and this statistic does not capture them.
+
+### More Detail On Metadata
+
+The next section gives some insight into the amount of metadata that TCMalloc is
+using. This is really debug information, and not very actionable.
+
+```
+MALLOC: 236176 Spans in use
+MALLOC: 238709 ( 10.9 MiB) Spans created
+MALLOC: 8 Thread heaps in use
+MALLOC: 46 ( 0.0 MiB) Thread heaps created
+MALLOC: 13517 Stack traces in use
+MALLOC: 13742 ( 7.2 MiB) Stack traces created
+MALLOC: 0 Table buckets in use
+MALLOC: 2808 ( 0.0 MiB) Table buckets created
+MALLOC: 11665416 ( 11.1 MiB) Pagemap bytes used
+MALLOC: 4067336 ( 3.9 MiB) Pagemap root resident bytes
+```
+
+* **Spans:** structures that hold multiple
+ [pages](/third_party/tcmalloc/g3doc/stats.md#page-sizes) of allocatable
+ objects.
+* **Thread heaps:** These are the per-thread structures used in per-thread
+ mode.
+* **Stack traces:** These hold metadata for each sampled object.
+* **Table buckets:** These hold data for stack traces for sampled events.
+* **Pagemap:** This data structure supports the mapping of object addresses to
+ information about the objects held on the page. The pagemap root is a
+ potentially large array, and it is useful to know how much is actually
+ memory resident.
+
+### Page Sizes
+
+There are three relevant "page" sizes for systems and TCMalloc. It's important
+to be able to disambiguate them.
+
+* **System default page size:** this is not reported by TCMalloc. This is 4KiB
+ on x86. It's not referred to in TCMalloc, and it's not important, but it's
+ important to know that it is different from the sizes of pages used in
+ TCMalloc.
+* **TCMalloc page size:** This is the basic unit of memory management for
+ TCMalloc. Objects on the same page are the same number of bytes in size.
+ Internally TCMalloc manages memory in chunks of this size. TCMalloc supports
+ 4 sizes: 4KiB (small but slow), 8KiB (the default), 32 KiB (large), 256 KiB
+ (256 KiB pages). There's trade-offs around the page sizes:
+ * Smaller page sizes are more memory efficient because we have less
+ fragmentation (ie left over space) when trying to provide the requested
+ amount of memory using 4KiB chunks. It's also more likely that all the
+ objects on a 4KiB page will be freed allowing the page to be returned
+ and used for a different size of data.
+ * Larger pages result in fewer fetches from the page heap to provide a
+ given amount of memory. They also keep memory of the same size in closer
+ proximity.
+* **TCMalloc hugepage size:** This is the size of a hugepage on the system,
+ for x86 this is 2MiB. This size is used as a unit of management by
+ temeriare, but not used by the pre-temeraire pageheap.
+
+```
+MALLOC: 32768 Tcmalloc page size
+MALLOC: 2097152 Tcmalloc hugepage size
+```
+
+### Experiments
+
+There is an experiment framework embedded into TCMalloc.
+The enabled experiments are reported as part of the statistics.
+
+```
+MALLOC EXPERIMENTS: TCMALLOC_TEMERAIRE=0 TCMALLOC_TEMERAIRE_WITH_SUBRELEASE_V3=0
+```
+
+### Actual Memory Footprint
+
+The output also reports the memory size information recorded by the OS:
+
+* Bytes resident is the amount of physical memory in use by the application
+ (RSS). This includes things like program text which is excluded from the
+ information that TCMalloc presents.
+* Bytes mapped is the size of the virtual address space in use by the
+ application (VSS). This can be substantially larger than the virtual memory
+ reported by TCMalloc as applications can increase VSS in other ways. It's
+ also not that useful as a metric since the VSS is a limit to the RSS, but
+ not directly related to the amount of physical memory that the application
+ uses.
+
+```
+Total process stats (inclusive of non-malloc sources):
+TOTAL: 86880677888 (82855.9 MiB) Bytes resident (physical memory used)
+TOTAL: 89124790272 (84996.0 MiB) Bytes mapped (virtual memory used)
+```
+
+### Per Class Size Information
+
+Requests for memory are rounded to convenient sizes. For example a request for
+15 bytes could be rounded to 16 bytes. These sizes are referred to as class
+sizes. There are various caches in TCMalloc where memory gets held, and the per
+size class section reports how much memory is being used by cached objects of
+each size. The columns reported for each class size are:
+
+* The class size
+* The size of each object in that class size.
+* The number of objects of that size currently held in the per-cpu,
+ per-thread, transfer, and central caches.
+* The total size of those objects in MiB - ie size of each object multiplied
+ by the number of objects.
+* The cumulative size of that class size plus all smaller class sizes.
+
+```
+Total size of freelists for per-thread and per-CPU caches,
+transfer cache, and central cache, by size class
+------------------------------------------------
+class 1 [ 8 bytes ] : 413460 objs; 3.2 MiB; 3.2 cum MiB
+class 2 [ 16 bytes ] : 103410 objs; 1.6 MiB; 4.7 cum MiB
+class 3 [ 24 bytes ] : 525973 objs; 12.0 MiB; 16.8 cum MiB
+class 4 [ 32 bytes ] : 275250 objs; 8.4 MiB; 25.2 cum MiB
+class 5 [ 40 bytes ] : 1047443 objs; 40.0 MiB; 65.1 cum MiB
+...
+```
+
+### Per-CPU Information
+
+If the per-cpu cache is enabled then we get a report of the memory currently
+being cached on each CPU.
+
+The first number reported is the maximum size of the per-cpu cache on each CPU.
+This corresponds to the parameter `MallocExtension::GetMaxPerCpuCacheSize()`,
+which defaults to 3MiB. [See tuning](tuning.md)
+
+The following columns are reported for each CPU:
+
+* The cpu ID
+* The total size of the objects held in the CPU's cache in bytes.
+* The total size of the objects held in the CPU's cache in MiB.
+* The total number of unallocated bytes.
+
+The concept of unallocated bytes needs to be explained because the definition is
+not obvious.
+
+The per-cpu cache is an array of pointers to available memory. Each class size
+has a number of entries that it can use in the array. These entries can be used
+to hold memory, or be empty.
+
+To control the maximum memory that the per-cpu cache can use we sum up the
+number of slots that can be used by a size class multiplied by the size of
+objects in that size class. This gives us the total memory that could be held in
+the cache. This is not what is reported by unallocated memory.
+
+Unallocated memory is the amount of memory left over from the per cpu limit
+after we have subtracted the total memory that could be held in the cache.
+
+The in use memory is calculated from the sum of the number of populated entries
+in the per-cpu array multiplied by the size of the objects held in those
+entries.
+
+To summarise, the per-cpu limit (which is reported before the per-cpu data) is
+equal to the number of bytes in use (which is reported in the second column)
+plus the number of bytes that could be used (which is not reported) plus the
+unallocated "spare" bytes (which is reported as the last column).
+
+```
+Bytes in per-CPU caches (per cpu limit: 3145728 bytes)
+------------------------------------------------
+cpu 0: 2168200 bytes ( 2.1 MiB) with 52536 bytes unallocated active
+cpu 1: 1734880 bytes ( 1.7 MiB) with 258944 bytes unallocated active
+cpu 2: 1779352 bytes ( 1.7 MiB) with 8384 bytes unallocated active
+cpu 3: 1414224 bytes ( 1.3 MiB) with 112432 bytes unallocated active
+cpu 4: 1260016 bytes ( 1.2 MiB) with 179800 bytes unallocated
+...
+```
+
+Some CPU caches may be marked `active`, indicating that the process is currently
+runnable on that CPU.
+
+### Pageheap Information
+
+The pageheap holds pages of memory that are not currently being used either by
+the application or by TCMalloc's internal caches. These pages are grouped into
+spans - which are ranges of contiguous pages, and these spans can be either
+mapped (backed by physical memory) or unmapped (not necessarily backed by
+physical memory).
+
+Memory from the pageheap is used either to replenish the per-thread or per-cpu
+caches to to directly satisfy requests that are larger than the sizes supported
+by the per-thread or per-cpu caches.
+
+**Note:** TCMalloc cannot tell whether a span of memory is actually backed by
+physical memory, but it uses _unmapped_ to indicate that it has told the OS that
+the span is not used and does not need the associated physical memory. For this
+reason the physical memory of an application may be larger that the amount that
+TCMalloc reports.
+
+The pageheap section contains the following information:
+
+* The first line reports the number of sizes of spans, the total memory that
+ these spans cover, and the total amount of that memory that is unmapped.
+* The size of the span in number of pages.
+* The number of spans of that size.
+* The total memory consumed by those spans in MiB.
+* The cumulative total memory held in spans of that size and fewer pages.
+* The amount of that memory that has been unmapped.
+* The cumulative amount of unmapped memory for spans of that size and smaller.
+
+```
+PageHeap: 30 sizes; 480.1 MiB free; 318.4 MiB unmapped
+------------------------------------------------
+ 1 pages * 341 spans ~ 10.7 MiB; 10.7 MiB cum; unmapped: 1.9 MiB; 1.9 MiB cum
+ 2 pages * 469 spans ~ 29.3 MiB; 40.0 MiB cum; unmapped: 0.0 MiB; 1.9 MiB cum
+ 3 pages * 462 spans ~ 43.3 MiB; 83.3 MiB cum; unmapped: 3.3 MiB; 5.2 MiB cum
+ 4 pages * 119 spans ~ 14.9 MiB; 98.2 MiB cum; unmapped: 0.1 MiB; 5.3 MiB cum
+...
+```
+
+### Pageheap Cache Age
+
+The next section gives some indication of the age of the various spans in the
+pageheap. Live (ie backed by physical memory) and unmapped spans are reported
+separately.
+
+The columns indicate roughly how long the span has been in the pageheap, ranging
+from less than a second to more than 8 hours.
+
+```
+------------------------------------------------
+PageHeap cache entry age (count of pages in spans of a given size that have been idle for up to the given period of time)
+------------------------------------------------
+ mean <1s 1s 30s 1m 30m 1h 8+h
+Live span TOTAL PAGES: 9.1 533 13322 26 1483 0 0 0
+Live span, 1 pages: 7.4 0 256 0 24 0 0 0
+Live span, 2 pages: 1.6 38 900 0 0 0 0 0
+…
+Unmapped span TOTAL PAGES: 153.9 153 2245 1801 5991 0 0 0
+Unmapped span, 1 pages: 34.6 0 35 15 11 0 0 0
+Unmapped span, 3 pages: 28.4 0 60 42 3 0 0 0
+...
+```
+
+### Pageheap Allocation Summary
+
+This reports some stats on the number of pages allocated.
+
+* The number of live (ie not on page heap) pages that were "small"
+ allocations. Small allocations are ones that are tracked in the pageheap by
+ size (eg a region of two pages in size). Larger allocations are just kept in
+ an array that has to be scanned linearly.
+* The pages of slack result from situations where allocation is rounded up to
+ hugepages, and this leaves some spare pages.
+* The largest seen allocation is self explanatory.
+
+```
+PageHeap: stats on allocation sizes
+PageHeap: 344420 pages live small allocation
+PageHeap: 12982 pages of slack on large allocations
+PageHeap: largest seen allocation 29184 pages
+```
+
+### Pageheap Per Number Of Pages In Range
+
+This starts off reporting the activity for small ranges of pages, but at the end
+of the list starts aggregating information for groups of page ranges.
+
+* The first column contains the number of pages (or the range of pages if the
+ bucket is wider than a single page).
+* The second and third columns are the number of allocated and freed pages we
+ have seen of this size.
+* The fourth column is the number of live allocations of this size.
+* The fifth column is the size of those live allocations in MiB.
+* The sixth column is the allocation rate in pages per second since the start
+ of the application.
+* The seventh column is the allocation rate in MiB per second since the start
+ of the application.
+
+```
+PageHeap: per-size information:
+PageHeap: 1 page info: 23978897 / 23762891 a/f, 216006 (6750.2 MiB) live, 2.43e+03 allocs/s ( 76.1 MiB/s)
+PageHeap: 2 page info: 21442844 / 21436331 a/f, 6513 ( 407.1 MiB) live, 2.18e+03 allocs/s (136.0 MiB/s)
+PageHeap: 3 page info: 2333686 / 2329225 a/f, 4461 ( 418.2 MiB) live, 237 allocs/s ( 22.2 MiB/s)
+PageHeap: 4 page info: 21509168 / 21508751 a/f, 417 ( 52.1 MiB) live, 2.18e+03 allocs/s (272.9 MiB/s)
+PageHeap: 5 page info: 3356076 / 3354188 a/f, 1888 ( 295.0 MiB) live, 341 allocs/s ( 53.2 MiB/s)
+PageHeap: 6 page info: 1718534 / 1718486 a/f, 48 ( 9.0 MiB) live, 174 allocs/s ( 32.7 MiB/s)
+...
+```
+
+### GWP-ASan Status
+
+The GWP-ASan section displays information about allocations guarded by GWP-ASan.
+
+* The number of successful and failed GWP-ASan allocations. If there are 0
+ successful and 0 failed allocations, GWP-ASan is probably disabled on your
+ binary. If there are a large number of failed allocations, it probably means
+ your sampling rate is too high, causing the guarded slots to be exhausted.
+* The number of "slots" currently allocated and quarantined. An allocated slot
+ contains an allocation that is still active (i.e. not freed) while a
+ quarantined slot has either not been used yet or contains an allocation that
+ was freed.
+* The maximum number of slots that have been allocated at the same time. This
+ number is printed along with the allocated slot limit. If the maximum slots
+ allocated matches the limit, you may want to reduce your sampling rate to
+ avoid failed GWP-ASan allocations.
+
+```
+------------------------------------------------
+GWP-ASan Status
+------------------------------------------------
+Successful Allocations: 1823
+Failed Allocations: 0
+Slots Currently Allocated: 33
+Slots Currently Quarantined: 95
+Moximum Slots Allocated: 51 / 64
+```
+
+### Memory Requested From The OS
+
+The stats also report the amount of memory requested from the OS by mmap.
+
+Memory is also requested, but may not actually be backed by physical memory, so
+these stats should resemble the VSS of the application, not the RSS.
+
+```
+Low-level allocator stats:
+MmapSysAllocator: 18083741696 bytes (17246.0 MiB) allocated
+```
+
+## Temeraire
+
+### Introduction
+
+Temeraire (or Huge Page Aware Allocator) is a new page heap for TCMalloc that is
+hugepage aware. It is designed to better handle memory backed by hugepages -
+avoiding breaking them up. Since it is more elaborate code, it reports
+additional information.
+
+### Summary Statistics
+
+The initial set of statistics from the Huge Page Aware Allocator are similar to
+the old page heap, and show a summary of the number of instances of each range
+of contiguous pages.
+
+```
+------------------------------------------------
+HugePageAware: 75 sizes; 938.8 MiB free; 1154.0 MiB unmapped
+------------------------------------------------
+ 1 pages * 86655 spans ~ 677.0 MiB; 677.0 MiB cum; unmapped: 0.0 MiB; 0.0 MiB cum
+ 2 pages * 3632 spans ~ 56.8 MiB; 733.7 MiB cum; unmapped: 0.0 MiB; 0.0 MiB cum
+ 3 pages * 288 spans ~ 6.8 MiB; 740.5 MiB cum; unmapped: 0.0 MiB; 0.0 MiB cum
+ 4 pages * 250 spans ~ 7.8 MiB; 748.3 MiB cum; unmapped: 0.0 MiB; 0.0 MiB cum
+...
+```
+
+The first line indicates the number of different sizes of ranges, the total MiB
+available, and the total MiB of unmapped ranges. The next lines are per number
+of continuous pages:
+
+* The number of contiguous pages
+* The number of spans of that number of pages
+* The total number of MiB of that span size that are mapped.
+* The cumulative total of the mapped pages.
+* The total number of MiB of that span size that are unmapped.
+* The cumulative total of the unmapped pages.
+
+### Per Component Information
+
+The Huge Page Aware Allocator has multiple places where pages of memory are
+held. More details of its workings can be found in this document. There are four
+caches where pages of memory can be located:
+
+* The filler, used for allocating ranges of a few TCMalloc pages in size.
+* The region cache, used for allocating ranges of multiple pages.
+* The huge cache which contains huge pages that are backed with memory.
+* The huge page allocator which contains huge pages that are not backed by
+ memory.
+
+We get some summary information for the various caches, before we report
+detailed information for each of the caches.
+
+```
+Huge page aware allocator components:
+------------------------------------------------
+HugePageAware: breakdown of free / unmapped / used space:
+HugePageAware: filler 38825.2 MiB used, 938.8 MiB free, 0.0 MiB unmapped
+HugePageAware: region 0.0 MiB used, 0.0 MiB free, 0.0 MiB unmapped
+HugePageAware: cache 908.0 MiB used, 0.0 MiB free, 0.0 MiB unmapped
+HugePageAware: alloc 0.0 MiB used, 0.0 MiB free, 1154.0 MiB unmapped
+```
+
+The summary information tells us:
+
+* The first column shows how much memory has been allocated from each of the
+ caches
+* The second column indicates how much backed memory is available in each
+ cache.
+* The third column indicates how much unmapped memory is available in each
+ cache.
+
+### Filler Cache
+
+The filler cache contains TCMalloc sized pages from within a single hugepage. So
+if we want a single TCMalloc page we will look for it in the filler.
+
+There are two sections of stats around the filler cache. The first section gives
+an indication of the number and state of the hugepages in the filler cache.
+
+```
+HugePageFiller: densely pack small requests into hugepages
+HugePageFiller: 19882 total, 3870 full, 16012 partial, 0 released, 0 quarantined
+HugePageFiller: 120168 pages free in 19882 hugepages, 0.0236 free
+HugePageFiller: among non-fulls, 0.0293 free
+HugePageFiller: 0 hugepages partially released, nan released
+HugePageFiller: 1.0000 of used pages hugepageable
+```
+
+The summary stats are as follows:
+
+* Total pages is the number of hugepages in the filler cache.
+* Full is the number of hugepages on that have multiple in-use allocations.
+* Partial is the remaining number of hugepages that have a single in-use
+ allocation.
+* Released is the number of hugepages that are released - ie partially
+ unmapped.
+* Quarantined is a feature has been disabled, so the result is currently zero.
+
+The second section gives an indication of the number of pages in various states
+in the filler cache.
+
+```
+HugePageFiller: fullness histograms
+
+HugePageFiller: # of regular hps with a<= # of free pages = 64 and < 80.
+* There are 6 regular hugepages with a longest contiguous length of exactly 1
+ page.
+* There are 2 regular hugepages with between 81 and 96 allocations.
+
+The three tracker types are "regular," "donated," and "released." "Regular" is
+by far the most common, and indicates regular memory in the filler.
+
+"Donated" is hugepages that have been donated to the filler from the tail of
+large (multi-hugepage) allocations, so that the leftover space can be packed
+with smaller allocations. But we prefer to use up all useable regular hugepages
+before touching the donated ones, which devolve to "regular" type once they are
+used. Because of this last property, donated hugepages always have only one
+allocation and their longest range equals their free space, so those histograms
+aren't shown.
+
+"Released" is partially released hugepages. Normally the entirety of a hugepage
+is backed by real RAM, but in partially released hugepages most of it has been
+returned to the OS. Because this defeats the primary goal of the hugepage-aware
+allocator, this is done rarely, and we only reuse partially-released hugepages
+for new allocations as a last resort.
+
+### Region Cache
+
+The region cache holds a chunk of memory from which can be allocated spans of
+multiple TCMalloc pages. The region cache may not be populated, and it can
+contain multiple regions.
+
+```
+HugeRegionSet: 1 MiB+ allocations best-fit into 1024 MiB slabs
+HugeRegionSet: 0 total regions
+HugeRegionSet: 0 hugepages backed out of 0 total
+HugeRegionSet: 0 pages free in backed region, nan free
+```
+
+The lines of output indicate:
+
+* The size of each region in MiB - this is currently 1GiB.
+* The total number of regions in the region cache, in the example above there
+ are no regions in the cache.
+* The number of backed hugepages in the cache out of the total number of
+ hugepages in the region cache.
+* The number of free TCMalloc pages in the regions, and as a ratio of the
+ number of backed pages.
+
+### Huge Cache
+
+The huge cache contains backed hugepages, it grows and shrinks in size depending
+on runtime conditions. Attempting to hold onto backed memory ready to be
+provided for the application.
+
+```
+HugeCache: contains unused, backed hugepage(s)
+HugeCache: 0 / 10 hugepages cached / cache limit (0.053 hit rate, 0.436 overflow rate)
+HugeCache: 88880 MiB fast unbacked, 6814 MiB periodic
+HugeCache: 1234 MiB*s cached since startup
+HugeCache: recent usage range: 40672 min - 40672 curr - 40672 max MiB
+HugeCache: recent offpeak range: 0 min - 0 curr - 0 max MiB
+HugeCache: recent cache range: 0 min - 0 curr - 0 max MiB
+```
+
+The output shows the following information:
+
+* The number of hugepages out of the maximum number of hugepages we will hold
+ in the huge cache. The hit rate is how often we get pages from the huge
+ cache vs getting them from the huge allocator. The overflow rate is the
+ number of times we added something to the huge cache causing it to exceed
+ its size limit.
+* The fast unbacked is the cumulative amount of memory unbacked due size
+ limitations, the periodic count is the cumulative amount of memory unbacked
+ by periodic calls to release unused memory.
+* The amount of cumulative memory stored in HugeCache since the startup of the
+ process. In other words, the area under the cached-memory-vs-time curve.
+* The usage range is the range minimum, current, maximum in MiB of memory
+ obtained from the huge cache.
+* The off-peak range is the minimum, current, maximum cache size in MiB
+ compared to the peak cache size.
+* The recent range is the minimum, current, maximum size of memory in MiB in
+ the huge cache.
+
+### Huge Allocator
+
+The huge allocator holds unmapped memory ranges. We allocate from here if we are
+unable to allocate from any of the caches.
+
+```
+HugeAllocator: contiguous, unbacked hugepage(s)
+HugeAddressMap: treap 5 / 10 nodes used / created
+HugeAddressMap: 256 contiguous hugepages available
+HugeAllocator: 20913 requested - 20336 in use = 577 hugepages free
+```
+
+The information reported here is:
+
+* The number of nodes used and created to handle regions of memory.
+* The size of the longest contiguous region of available hugepages.
+* The number of hugepages requested from the system, the number of hugepages
+ in used, and the number of hugepages available in the cache.
+
+### Pageheap Summary Information
+
+The new pageheap reports some summary information:
+
+```
+HugePageAware: stats on allocation sizes
+HugePageAware: 4969003 pages live small allocation
+HugePageAware: 659 pages of slack on large allocations
+HugePageAware: largest seen allocation 45839 pages
+```
+
+These are:
+
+* The number of live "small" TCMalloc pages allocated (these less than 2MiB in
+ size).
+ [Note: the 2MiB size distinction is separate from the size of hugepages]
+* The number of TCMalloc pages which are left over from "large" allocations.
+ These allocations are larger than 2MiB in size, and are rounded to a
+ hugepage - the slack being the amount left over after rounding.
+* The largest seen allocation request in TCMalloc pages.
+
+### Per Size Range Info:
+
+The per size range info is the same format as the old pageheap:
+
+* The first column contains the number of pages (or the range of pages if the
+ bucket is wider than a single page).
+* The second and third columns are the number of allocated and freed pages we
+ have seen of this size.
+* The fourth column is the number of live allocations of this size.
+* The fifth column is the size of those live allocations in MiB.
+* The sixth column is the allocation rate in pages per second since the start
+ of the application.
+* The seventh column is the allocation rate in MiB per second since the start
+ of the application.
+
+```
+HugePageAware: per-size information:
+HugePageAware: 1 page info: 5817510 / 3863506 a/f, 1954004 (15265.7 MiB) live, 16 allocs/s ( 0.1 MiB/s)
+HugePageAware: 2 page info: 1828473 / 1254096 a/f, 574377 ( 8974.6 MiB) live, 5.03 allocs/s ( 0.1 MiB/s)
+HugePageAware: 3 page info: 1464568 / 1227253 a/f, 237315 ( 5562.1 MiB) live, 4.03 allocs/s ( 0.1 MiB/s)
+...
+```
+
+### Pageheap Age Information:
+
+The new pageheap allocator also reports information on the age of the various
+page ranges. In this example you can see that there was a large number of
+unmapped pages in the last minute.
+
+```
+------------------------------------------------
+HugePageAware cache entry age (count of pages in spans of a given size that have been idle for up to the given period of time)
+------------------------------------------------
+ mean <1s 1s 30s 1m 30m 1h 8+h
+Live span TOTAL PAGES: 29317.6 145 549 1775 13059 13561 58622 32457
+Live span, 1 pages: 35933.7 0 55 685 6354 8111 43853 27597
+...
+Unmapped span TOTAL PAGES: 51.3 0 0 131072 16640 0 0 0
+Unmapped span, >=64 pages: 51.3 0 0 131072 16640 0 0 0
+...
+```
+
diff --git a/docs/tuning.md b/docs/tuning.md
new file mode 100644
index 000000000..c7f72d573
--- /dev/null
+++ b/docs/tuning.md
@@ -0,0 +1,131 @@
+# Performance Tuning TCMalloc
+
+There are three user accessible controls that we can use to performance tune
+TCMalloc:
+
+* The logical page size for TCMalloc (4KiB, 8KiB, 32KiB, 256KiB)
+* The per-thread or per-cpu cache sizes
+* The rate at which memory is released to the OS
+
+None of these tuning parameters are clear wins, otherwise they would be the
+default. We'll discuss the advantages and disadvantages of changing them.
+
+## The Logical Page Size for TCMalloc:
+
+This is determined at compile time by linking in the appropriate version of
+TCMalloc. The page size indicates the unit in which TCMalloc manages memory. The
+default is in 8KiB chunks, there are larger options of 32KiB and 256KiB. There
+is also the 4KiB page size used by the small-but-slow allocator.
+
+A smaller page size allows TCMalloc to provide memory to an application with
+less waste. Waste comes about through two issues:
+
+* Left-over memory when rounding larger requests to the page size (eg a
+ request for 62 KiB might get rounded to 64 KiB).
+* Pages of memory that are stuck because they have a single in use allocation
+ on the page, and therefore cannot be repurposed to hold a different size of
+ allocation.
+
+The second of these points is worth elucidating. For small allocations TCMalloc
+will fit multiple objects onto a single page.
+
+So if you request 512 bytes, then an entire page will be devoted to 512 byte
+objects. If the size of that page is 4KiB we get 8 objects, if the size of that
+page is 256KiB we get 512 objects. That page can only be used for 512 byte
+objects until all the objects on the page have been freed.
+
+If you have 8 objects on a page, there's a reasonable chance that all 8 will
+become free at the same time, and we can repurpose the page for objects of a
+different size. If there's 512 objects on that page, then it is very unlikely
+that all the objects will become freed at the same time, so that page will
+probably never become entirely free and will probably hang around, potentially
+containing only a few in-use objects.
+
+The consequence of this is that large pages tend to lead to a larger memory
+footprint. There's also the issue that if you want one object of a size, you
+need to allocate a whole page.
+
+The advantage of managing objects using larger page sizes are:
+
+* Objects of the same size are better clustered in memory. If you need 512 KiB
+ of 8 byte objects, then that's two 256 KiB pages, or 128 x 4 KiB pages. If
+ memory is largely backed by hugepages, then with large pages in the worst
+ case we can map the entire demand with two large pages, whereas small pages
+ could take up to 128 entries in the TLB.
+* There's a structure called the `PageMap` which enables TCMalloc to lookup
+ information about any allocated memory. If we use large pages the pagemap
+ needs fewer entries and can be much smaller. This makes it more likely that
+ it is cache resident. However, sized delete substantially reduced the number
+ of times that we need to consult the pagemap, so the benefit from larger
+ pages is reduced.
+
+**Suggestion:** The default of 8KiB page sizes is probably good enough for most
+applications. However, if an application has a heap measured in GiB it may be
+worth looking at using large page sizes.
+
+**Suggestion:** Consider small-but-slow if it is more important to minimise
+memory footprint over performance.
+
+**Note:** Class sizes are determined on a per-page-size basis. So changing the
+page size will implicitly change the class sizes used. Class sizes are selected
+to be memory-efficient for the applications using that page size. If an
+application changes page size, there may be a performance or memory impact from
+the different selection of class sizes.
+
+## Per-thread/per-cpu Cache Sizes
+
+The default is for TCMalloc to run in per-cpu mode as this is faster; however,
+there are few applications which have not yet transitioned. The plan is to move
+these across at some point soon.
+
+Increasing the size of the cache is an obvious way to improve performance. The
+larger the cache the less frequently memory needs to be fetched from the central
+caches. Returning memory from the cache is substantially faster than fetching
+from the central cache.
+
+The size of the per-cpu caches is controlled by
+`tcmalloc::MallocExtension::SetMaxPerCpuCacheSize`. This controls the limit for
+each CPU, so the total amount of memory for application could be much larger
+than this. Memory on CPUs where the application is no longer able to run can be
+freed by calling `tcmalloc::MallocExtension::ReleaseCpuMemory`.
+
+In contrast `tcmalloc::MallocExtension::SetMaxTotalThreadCacheBytes` controls
+the _total_ size of all thread caches in the application.
+
+**Suggestion:** The default cache size is typically sufficient, but cache size
+can be increased (or decreased) depending on the amount of time spent in
+TCMalloc code, and depending on the overall size of the application (a larger
+application can afford to cache more memory without noticeably increasing its
+overall size).
+
+## Memory Releasing
+
+`tcmalloc::MallocExtension::ReleaseMemoryToSystem` makes a request to release
+`n` bytes of memory to TCMalloc. This can keep the memory footprint of the
+application down to a minimal amount, however it should be considered that this
+just reduces the application down from it's peak memory footprint, and does not
+make that peak memory footprint smaller.
+
+There are two disadvantages of releasing memory aggressively:
+
+* Memory that is unmapped may be immediately needed, and there is a cost to
+ faulting unmapped memory back into the application.
+* Memory that is unmapped at small granularity will break up hugepages, and
+ this will cause some performance loss due to increased TLB misses.
+
+**Note:** Release rate is not a panacea for memory usage. Jobs should be
+provisioned for peak memory usage to avoid OOM errors. Setting a release rate
+may enable an application to exceed the memory limit for short periods of
+time without triggering an OOM. A release rate is also a good citizen behavior
+as it will enable the system to use spare capacity memory for applications
+which are are under provisioned. However, it is not a substitute for setting
+appropriate memory requirements for the job.
+
+**Note:** Memory is released from the `PageHeap` and stranded per-cpu caches.
+It is not possible to release memory from other internal structures, like
+the `CentralFreeList`.
+
+**Suggestion:** The default release rate is probably appropriate for most
+applications. In situations where it is tempting to set a faster rate it is
+worth considering why there are memory spikes, since those spikes are likely to
+cause an OOM at some point.
diff --git a/tcmalloc/BUILD b/tcmalloc/BUILD
new file mode 100644
index 000000000..12aca9a2f
--- /dev/null
+++ b/tcmalloc/BUILD
@@ -0,0 +1,995 @@
+# Copyright 2019 The TCMalloc Authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Description:
+#
+# tcmalloc is a fast malloc implementation.
+
+load("//tcmalloc:copts.bzl", "TCMALLOC_DEFAULT_COPTS")
+
+package(default_visibility = ["//visibility:private"])
+
+licenses(["notice"])
+
+config_setting(
+ name = "llvm",
+ flag_values = {
+ "@bazel_tools//tools/cpp:compiler": "llvm",
+ },
+ visibility = ["//visibility:private"],
+)
+
+NO_BUILTIN_MALLOC = [
+ "-fno-builtin-malloc",
+ "-fno-builtin-free",
+]
+
+overlay_deps = [
+]
+
+cc_library(
+ name = "experiment",
+ srcs = ["experiment.cc"],
+ hdrs = [
+ "experiment.h",
+ "experiment_config.h",
+ ],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ ":malloc_extension",
+ "//tcmalloc/internal:logging",
+ "//tcmalloc/internal:util",
+ "@com_google_absl//absl/base:core_headers",
+ "@com_google_absl//absl/strings",
+ "@com_google_absl//absl/types:optional",
+ ],
+)
+
+cc_library(
+ name = "percpu_tcmalloc",
+ hdrs = ["percpu_tcmalloc.h"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ "//tcmalloc/internal:mincore",
+ "//tcmalloc/internal:percpu",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/base:dynamic_annotations",
+ ],
+)
+
+# Dependencies required by :tcmalloc and its variants. Since :common is built
+# several different ways, it should not be included on this list.
+tcmalloc_deps = [
+ ":experiment",
+ ":malloc_extension",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/base:config",
+ "@com_google_absl//absl/base:core_headers",
+ "@com_google_absl//absl/base:dynamic_annotations",
+ "@com_google_absl//absl/debugging:leak_check",
+ "@com_google_absl//absl/debugging:stacktrace",
+ "@com_google_absl//absl/debugging:symbolize",
+ "@com_google_absl//absl/memory",
+ "@com_google_absl//absl/strings",
+ "//tcmalloc/internal:declarations",
+ "//tcmalloc/internal:linked_list",
+ "//tcmalloc/internal:logging",
+ "//tcmalloc/internal:memory_stats",
+ "//tcmalloc/internal:percpu",
+]
+
+# This library provides tcmalloc always
+cc_library(
+ name = "tcmalloc",
+ srcs = [
+ "libc_override.h",
+ "libc_override_gcc_and_weak.h",
+ "libc_override_glibc.h",
+ "sampler.h",
+ "tcmalloc.cc",
+ "tcmalloc.h",
+ ],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ visibility = ["//visibility:public"],
+ deps = overlay_deps + tcmalloc_deps + [
+ ":common",
+ ],
+ alwayslink = 1,
+)
+
+# Provides tcmalloc always; use per-thread mode.
+#
+cc_library(
+ name = "tcmalloc_deprecated_perthread",
+ srcs = [
+ "libc_override.h",
+ "libc_override_gcc_and_weak.h",
+ "libc_override_glibc.h",
+ "tcmalloc.cc",
+ "tcmalloc.h",
+ ],
+ copts = ["-DTCMALLOC_DEPRECATED_PERTHREAD"] + TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ visibility = [
+ "//tcmalloc/testing:__pkg__",
+ ],
+ deps = overlay_deps + tcmalloc_deps + [
+ ":common_deprecated_perthread",
+ ],
+ alwayslink = 1,
+)
+
+# An opt tcmalloc build with ASSERTs forced on (by turning off
+# NDEBUG). Useful for tracking down crashes in production binaries.
+# To use add malloc = "//tcmalloc:opt_with_assertions" in your
+# target's build rule.
+cc_library(
+ name = "opt_with_assertions",
+ srcs = [
+ "libc_override.h",
+ "libc_override_gcc_and_weak.h",
+ "libc_override_glibc.h",
+ "tcmalloc.cc",
+ "tcmalloc.h",
+ ],
+ copts = [
+ "-O2",
+ "-UNDEBUG",
+ ] + TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ visibility = ["//visibility:public"],
+ deps = overlay_deps + tcmalloc_deps + [
+ ":common",
+ ],
+ alwayslink = 1,
+)
+
+cc_library(
+ name = "size_class_info",
+ hdrs = ["size_class_info.h"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ "//tcmalloc/internal:logging",
+ ],
+)
+
+# List of common source files used by the various tcmalloc libraries.
+common_srcs = [
+ "arena.cc",
+ "arena.h",
+ "central_freelist.cc",
+ "central_freelist.h",
+ "common.cc",
+ "common.h",
+ "cpu_cache.cc",
+ "cpu_cache.h",
+ "experimental_size_classes.cc",
+ "guarded_page_allocator.h",
+ "guarded_page_allocator.cc",
+ "huge_address_map.cc",
+ "huge_allocator.cc",
+ "huge_allocator.h",
+ "huge_cache.cc",
+ "huge_cache.h",
+ "huge_region.h",
+ "huge_page_aware_allocator.cc",
+ "huge_page_aware_allocator.h",
+ "huge_page_filler.h",
+ "huge_pages.h",
+ "libc_override.h",
+ "libc_override_gcc_and_weak.h",
+ "libc_override_glibc.h",
+ "libc_override_redefine.h",
+ "page_allocator.cc",
+ "page_allocator.h",
+ "page_allocator_interface.cc",
+ "page_allocator_interface.h",
+ "page_heap.cc",
+ "page_heap.h",
+ "page_heap_allocator.h",
+ "pagemap.cc",
+ "pagemap.h",
+ "parameters.cc",
+ "peak_heap_tracker.cc",
+ "sampler.cc",
+ "sampler.h",
+ "size_classes.cc",
+ "span.cc",
+ "span.h",
+ "stack_trace_table.cc",
+ "stack_trace_table.h",
+ "static_vars.cc",
+ "static_vars.h",
+ "stats.cc",
+ "system-alloc.cc",
+ "system-alloc.h",
+ "tcmalloc.h",
+ "thread_cache.cc",
+ "thread_cache.h",
+ "tracking.h",
+ "transfer_cache.cc",
+ "transfer_cache.h",
+]
+
+common_hdrs = [
+ "arena.h",
+ "central_freelist.h",
+ "common.h",
+ "cpu_cache.h",
+ "guarded_page_allocator.h",
+ "huge_address_map.h",
+ "huge_allocator.h",
+ "tcmalloc_policy.h",
+ "huge_cache.h",
+ "huge_page_filler.h",
+ "huge_pages.h",
+ "huge_region.h",
+ "huge_page_aware_allocator.h",
+ "libc_override.h",
+ "page_allocator.h",
+ "page_allocator_interface.h",
+ "page_heap.h",
+ "page_heap_allocator.h",
+ "pagemap.h",
+ "parameters.h",
+ "peak_heap_tracker.h",
+ "sampler.h",
+ "span.h",
+ "stack_trace_table.h",
+ "stats.h",
+ "static_vars.h",
+ "system-alloc.h",
+ "tcmalloc.h",
+ "thread_cache.h",
+ "tracking.h",
+ "transfer_cache.h",
+]
+
+common_deps = [
+ ":experiment",
+ ":malloc_extension",
+ ":noruntime_size_classes",
+ ":percpu_tcmalloc",
+ ":size_class_info",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/base:config",
+ "@com_google_absl//absl/base:core_headers",
+ "@com_google_absl//absl/base:dynamic_annotations",
+ "@com_google_absl//absl/debugging:debugging_internal",
+ "@com_google_absl//absl/debugging:stacktrace",
+ "@com_google_absl//absl/debugging:symbolize",
+ "@com_google_absl//absl/hash:hash",
+ "@com_google_absl//absl/memory",
+ "@com_google_absl//absl/strings",
+ "@com_google_absl//absl/strings:str_format",
+ "@com_google_absl//absl/time",
+ "@com_google_absl//absl/types:optional",
+ "@com_google_absl//absl/types:span",
+ "//tcmalloc/internal:atomic_stats_counter",
+ "//tcmalloc/internal:bits",
+ "//tcmalloc/internal:declarations",
+ "//tcmalloc/internal:linked_list",
+ "//tcmalloc/internal:logging",
+ "//tcmalloc/internal:mincore",
+ "//tcmalloc/internal:parameter_accessors",
+ "//tcmalloc/internal:percpu",
+ "//tcmalloc/internal:range_tracker",
+ "//tcmalloc/internal:util",
+]
+
+cc_library(
+ name = "common",
+ srcs = common_srcs,
+ hdrs = common_hdrs,
+ copts = TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ visibility = ["//tcmalloc:tcmalloc_tests"],
+ deps = common_deps,
+ alwayslink = 1,
+)
+
+cc_library(
+ name = "common_deprecated_perthread",
+ srcs = common_srcs,
+ hdrs = common_hdrs,
+ copts = ["-DTCMALLOC_DEPRECATED_PERTHREAD"] + TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ deps = common_deps,
+ alwayslink = 1,
+)
+
+# TEMPORARY. WILL BE REMOVED.
+# Add a dep to this if you want your binary to use hugepage-aware
+# allocator.
+cc_library(
+ name = "want_hpaa",
+ srcs = ["want_hpaa.cc"],
+ copts = ["-g0"] + TCMALLOC_DEFAULT_COPTS,
+ visibility = ["//visibility:public"],
+ deps = [
+ "@com_google_absl//absl/base:core_headers",
+ ],
+ alwayslink = 1,
+)
+
+# TEMPORARY. WILL BE REMOVED.
+# Add a dep to this if you want your binary to use hugepage-aware
+# allocator with hpaa_subrelease=true.
+cc_library(
+ name = "want_hpaa_subrelease",
+ srcs = ["want_hpaa_subrelease.cc"],
+ copts = ["-g0"] + TCMALLOC_DEFAULT_COPTS,
+ visibility = ["//visibility:public"],
+ deps = [
+ "@com_google_absl//absl/base:core_headers",
+ ],
+ alwayslink = 1,
+)
+
+# TEMPORARY. WILL BE REMOVED.
+# Add a dep to this if you want your binary to not use hugepage-aware
+# allocator.
+cc_library(
+ name = "want_no_hpaa",
+ srcs = ["want_no_hpaa.cc"],
+ copts = ["-g0"] + TCMALLOC_DEFAULT_COPTS,
+ visibility = [
+ "//tcmalloc/testing:__pkg__",
+ ],
+ deps = [
+ "@com_google_absl//absl/base:core_headers",
+ ],
+ alwayslink = 1,
+)
+
+cc_library(
+ name = "runtime_size_classes",
+ srcs = ["runtime_size_classes.cc"],
+ hdrs = ["runtime_size_classes.h"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ visibility = [
+ "//visibility:private",
+ ],
+ deps = [
+ ":size_class_info",
+ "//tcmalloc/internal:logging",
+ "//tcmalloc/internal:util",
+ "@com_google_absl//absl/base:core_headers",
+ "@com_google_absl//absl/strings",
+ ],
+ alwayslink = 1,
+)
+
+cc_library(
+ name = "noruntime_size_classes",
+ srcs = ["noruntime_size_classes.cc"],
+ hdrs = ["runtime_size_classes.h"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ ":size_class_info",
+ "@com_google_absl//absl/base:core_headers",
+ "@com_google_absl//absl/strings",
+ ],
+ alwayslink = 1,
+)
+
+cc_library(
+ name = "tcmalloc_large_pages",
+ srcs = [
+ "libc_override.h",
+ "libc_override_gcc_and_weak.h",
+ "libc_override_glibc.h",
+ "tcmalloc.cc",
+ "tcmalloc.h",
+ ],
+ copts = ["-DTCMALLOC_LARGE_PAGES"] + TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ visibility = ["//visibility:public"],
+ deps = overlay_deps + tcmalloc_deps + [
+ ":common_large_pages",
+ ],
+ alwayslink = 1,
+)
+
+cc_library(
+ name = "common_large_pages",
+ srcs = common_srcs,
+ hdrs = common_hdrs,
+ copts = ["-DTCMALLOC_LARGE_PAGES"] + TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ deps = common_deps,
+ alwayslink = 1,
+)
+
+# This is another large page configuration (256k)
+cc_library(
+ name = "tcmalloc_256k_pages",
+ srcs = [
+ "libc_override.h",
+ "libc_override_gcc_and_weak.h",
+ "libc_override_glibc.h",
+ "tcmalloc.cc",
+ "tcmalloc.h",
+ ],
+ copts = ["-DTCMALLOC_256K_PAGES"] + TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ visibility = ["//visibility:public"],
+ deps = overlay_deps + tcmalloc_deps + [
+ ":common_256k_pages",
+ ],
+ alwayslink = 1,
+)
+
+cc_library(
+ name = "common_256k_pages",
+ srcs = common_srcs,
+ hdrs = common_hdrs,
+ copts = ["-DTCMALLOC_256K_PAGES"] + TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ deps = common_deps,
+ alwayslink = 1,
+)
+
+cc_library(
+ name = "tcmalloc_small_but_slow",
+ srcs = [
+ "libc_override.h",
+ "libc_override_gcc_and_weak.h",
+ "libc_override_glibc.h",
+ "tcmalloc.cc",
+ "tcmalloc.h",
+ ],
+ copts = ["-DTCMALLOC_SMALL_BUT_SLOW"] + TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ visibility = ["//visibility:public"],
+ deps = overlay_deps + tcmalloc_deps + [
+ ":common_small_but_slow",
+ ],
+ alwayslink = 1,
+)
+
+cc_library(
+ name = "common_small_but_slow",
+ srcs = common_srcs,
+ hdrs = common_hdrs,
+ copts = ["-DTCMALLOC_SMALL_BUT_SLOW"] + TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ deps = common_deps,
+ alwayslink = 1,
+)
+
+# Export some header files to tcmalloc/testing/...
+package_group(
+ name = "tcmalloc_tests",
+ packages = [
+ "//tcmalloc/testing/...",
+ ],
+)
+
+cc_library(
+ name = "headers_for_tests",
+ srcs = [
+ "arena.h",
+ "central_freelist.h",
+ "guarded_page_allocator.h",
+ "huge_address_map.h",
+ "huge_allocator.h",
+ "huge_cache.h",
+ "huge_page_aware_allocator.h",
+ "huge_page_filler.h",
+ "huge_pages.h",
+ "huge_region.h",
+ "page_allocator.h",
+ "page_allocator_interface.h",
+ "page_heap.h",
+ "page_heap_allocator.h",
+ "pagemap.h",
+ "parameters.h",
+ "peak_heap_tracker.h",
+ "stack_trace_table.h",
+ "transfer_cache.h",
+ ],
+ hdrs = [
+ "common.h",
+ "sampler.h",
+ "size_class_info.h",
+ "span.h",
+ "static_vars.h",
+ "stats.h",
+ "system-alloc.h",
+ ],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ visibility = ["//tcmalloc:tcmalloc_tests"],
+ deps = common_deps,
+)
+
+cc_library(
+ name = "page_allocator_test_util",
+ testonly = 1,
+ srcs = [
+ "page_allocator_test_util.h",
+ ],
+ hdrs = ["page_allocator_test_util.h"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ ":common",
+ ":malloc_extension",
+ ],
+)
+
+cc_test(
+ name = "page_heap_test",
+ srcs = ["page_heap_test.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ ":common",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/memory",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "huge_cache_test",
+ srcs = ["huge_cache_test.cc"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ ":common",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/memory",
+ "@com_google_absl//absl/random",
+ "@com_google_absl//absl/time",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "huge_allocator_test",
+ srcs = ["huge_allocator_test.cc"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ ":common",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/base:core_headers",
+ "@com_google_absl//absl/random",
+ "@com_google_absl//absl/time",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "huge_page_filler_test",
+ srcs = ["huge_page_filler_test.cc"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ deps = [
+ ":common",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/algorithm:container",
+ "@com_google_absl//absl/base:core_headers",
+ "@com_google_absl//absl/container:flat_hash_map",
+ "@com_google_absl//absl/container:flat_hash_set",
+ "@com_google_absl//absl/flags:flag",
+ "@com_google_absl//absl/memory",
+ "@com_google_absl//absl/random",
+ "@com_google_absl//absl/random:distributions",
+ "@com_google_absl//absl/synchronization",
+ "@com_google_absl//absl/time",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "huge_region_test",
+ srcs = ["huge_region_test.cc"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ ":common",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/random",
+ "@com_google_absl//absl/time",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "guarded_page_allocator_test",
+ srcs = ["guarded_page_allocator_test.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc",
+ deps = [
+ ":common",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/memory",
+ "@com_google_absl//absl/strings",
+ "@com_google_absl//absl/time",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "pagemap_unittest",
+ srcs = ["pagemap_unittest.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ ":common",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/random",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "realloc_unittest",
+ srcs = ["realloc_unittest.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc",
+ deps = [
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/random",
+ "@com_google_absl//absl/random:distributions",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "stack_trace_table_test",
+ srcs = ["stack_trace_table_test.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ ":common",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/base:core_headers",
+ "@com_google_absl//absl/debugging:stacktrace",
+ "@com_google_absl//absl/flags:flag",
+ "@com_google_absl//absl/strings",
+ "@com_google_absl//absl/strings:str_format",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "system-alloc_unittest",
+ srcs = ["system-alloc_unittest.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc",
+ deps = [
+ ":common",
+ ":malloc_extension",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+# This test has been named "large" since before tests were s/m/l.
+# The "large" refers to large allocation sizes.
+cc_test(
+ name = "tcmalloc_large_unittest",
+ size = "small",
+ timeout = "moderate",
+ srcs = ["tcmalloc_large_unittest.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc",
+ deps = [
+ ":common",
+ ":malloc_extension",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/base:core_headers",
+ "@com_google_absl//absl/container:node_hash_set",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+# There are more unittests in the tools subdirectory! (Mostly, those
+# tests that depend on more than just //base and //tcmalloc).
+
+cc_test(
+ name = "malloc_extension_system_malloc_test",
+ srcs = ["malloc_extension_system_malloc_test.cc"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc/internal:system_malloc",
+ deps = [
+ ":malloc_extension",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/random",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "malloc_extension_test",
+ srcs = ["malloc_extension_test.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc",
+ deps = [
+ ":malloc_extension",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "page_allocator_test",
+ srcs = ["page_allocator_test.cc"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ deps = [
+ ":common",
+ ":malloc_extension",
+ ":page_allocator_test_util",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/base:core_headers",
+ "@com_google_absl//absl/memory",
+ "@com_google_absl//absl/strings",
+ "@com_google_absl//absl/time",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "profile_test",
+ size = "medium",
+ srcs = ["profile_test.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ malloc = "//tcmalloc",
+ shard_count = 2,
+ deps = [
+ ":common",
+ ":malloc_extension",
+ "//tcmalloc/internal:declarations",
+ "//tcmalloc/internal:linked_list",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/container:flat_hash_map",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "size_classes_test",
+ srcs = ["size_classes_test.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc",
+ deps = [
+ ":common",
+ ":size_class_info",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "size_classes_test_large_pages",
+ srcs = ["size_classes_test.cc"],
+ copts = ["-DTCMALLOC_LARGE_PAGES"] + NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc:tcmalloc_large_pages",
+ deps = [
+ ":common_large_pages",
+ ":size_class_info",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "size_classes_test_256k_pages",
+ srcs = ["size_classes_test.cc"],
+ copts = ["-DTCMALLOC_256K_PAGES"] + NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc:tcmalloc_256k_pages",
+ deps = [
+ ":common_256k_pages",
+ ":size_class_info",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "size_classes_test_small_but_slow",
+ srcs = ["size_classes_test.cc"],
+ copts = ["-DTCMALLOC_SMALL_BUT_SLOW"] + NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc:tcmalloc_small_but_slow",
+ deps = [
+ ":common_small_but_slow",
+ ":size_class_info",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "size_classes_test_with_runtime_size_classes",
+ srcs = ["size_classes_with_runtime_size_classes_test.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ malloc = "//tcmalloc",
+ deps = [
+ ":common",
+ ":runtime_size_classes",
+ ":size_class_info",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/strings",
+ "@com_google_absl//absl/strings:str_format",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "heap_profiling_test",
+ srcs = ["heap_profiling_test.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc",
+ deps = [
+ ":common",
+ ":malloc_extension",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "runtime_size_classes_test",
+ srcs = ["runtime_size_classes_test.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ linkstatic = 1,
+ malloc = "//tcmalloc",
+ deps = [
+ ":runtime_size_classes",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "span_test",
+ srcs = ["span_test.cc"],
+ copts = NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc",
+ deps = [
+ ":headers_for_tests",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/container:flat_hash_set",
+ "@com_google_absl//absl/random",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "span_test_small_but_slow",
+ srcs = ["span_test.cc"],
+ copts = ["-DTCMALLOC_SMALL_BUT_SLOW"] + NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc:tcmalloc_small_but_slow",
+ deps = [
+ ":headers_for_tests",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/container:flat_hash_set",
+ "@com_google_absl//absl/random",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "span_test_large_pages",
+ srcs = ["span_test.cc"],
+ copts = ["-DTCMALLOC_LARGE_PAGES"] + NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc:tcmalloc_large_pages",
+ deps = [
+ ":headers_for_tests",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/container:flat_hash_set",
+ "@com_google_absl//absl/random",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "span_test_256k_pages",
+ srcs = ["span_test.cc"],
+ copts = ["-DTCMALLOC_256K_PAGES"] + NO_BUILTIN_MALLOC + TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc:tcmalloc_256k_pages",
+ deps = [
+ ":headers_for_tests",
+ "//tcmalloc/internal:logging",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/container:flat_hash_set",
+ "@com_google_absl//absl/random",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "stats_test",
+ srcs = ["stats_test.cc"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ malloc = "//tcmalloc",
+ deps = [
+ ":headers_for_tests",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_absl//absl/base",
+ "@com_google_absl//absl/time",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_test(
+ name = "huge_address_map_test",
+ srcs = ["huge_address_map_test.cc"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ ":common",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
+
+cc_library(
+ name = "malloc_extension",
+ srcs = ["malloc_extension.cc"],
+ hdrs = [
+ "internal_malloc_extension.h",
+ "malloc_extension.h",
+ ],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ visibility = [
+ "//visibility:public",
+ ],
+ deps = [
+ "@com_google_absl//absl/base:core_headers",
+ "@com_google_absl//absl/base:dynamic_annotations",
+ "@com_google_absl//absl/base:malloc_internal",
+ "@com_google_absl//absl/functional:function_ref",
+ "@com_google_absl//absl/memory",
+ "@com_google_absl//absl/strings",
+ "@com_google_absl//absl/types:optional",
+ "@com_google_absl//absl/types:span",
+ ],
+)
+
+cc_test(
+ name = "experiment_config_test",
+ srcs = ["experiment_config_test.cc"],
+ copts = TCMALLOC_DEFAULT_COPTS,
+ deps = [
+ ":experiment",
+ "@com_github_google_benchmark//:benchmark",
+ "@com_google_googletest//:gtest_main",
+ ],
+)
diff --git a/tcmalloc/arena.cc b/tcmalloc/arena.cc
new file mode 100644
index 000000000..975f842c0
--- /dev/null
+++ b/tcmalloc/arena.cc
@@ -0,0 +1,48 @@
+// Copyright 2019 The TCMalloc Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "tcmalloc/arena.h"
+
+#include "tcmalloc/internal/logging.h"
+#include "tcmalloc/system-alloc.h"
+
+namespace tcmalloc {
+
+void* Arena::Alloc(size_t bytes) {
+ char* result;
+ bytes = ((bytes + kAlignment - 1) / kAlignment) * kAlignment;
+ if (free_avail_ < bytes) {
+ size_t ask = bytes > kAllocIncrement ? bytes : kAllocIncrement;
+ size_t actual_size;
+ free_area_ = reinterpret_cast(
+ SystemAlloc(ask, &actual_size, kPageSize, /*tagged=*/false));
+ if (ABSL_PREDICT_FALSE(free_area_ == nullptr)) {
+ Log(kCrash, __FILE__, __LINE__,
+ "FATAL ERROR: Out of memory trying to allocate internal tcmalloc "
+ "data (bytes, object-size)",
+ kAllocIncrement, bytes);
+ }
+ SystemBack(free_area_, actual_size);
+ free_avail_ = actual_size;
+ }
+
+ ASSERT(reinterpret_cast(free_area_) % kAlignment == 0);
+ result = free_area_;
+ free_area_ += bytes;
+ free_avail_ -= bytes;
+ bytes_allocated_ += bytes;
+ return reinterpret_cast(result);
+}
+
+} // namespace tcmalloc
diff --git a/tcmalloc/arena.h b/tcmalloc/arena.h
new file mode 100644
index 000000000..59a727572
--- /dev/null
+++ b/tcmalloc/arena.h
@@ -0,0 +1,69 @@
+// Copyright 2019 The TCMalloc Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef TCMALLOC_ARENA_H_
+#define TCMALLOC_ARENA_H_
+
+#include
+#include
+
+#include "absl/base/thread_annotations.h"
+#include "tcmalloc/common.h"
+
+namespace tcmalloc {
+
+// Arena allocation; designed for use by tcmalloc internal data structures like
+// spans, profiles, etc. Always expands.
+class Arena {
+ public:
+ Arena() {
+ }
+
+ // We use an explicit Init function because these variables are statically
+ // allocated and their constructors might not have run by the time some other
+ // static variable tries to allocate memory.
+ void Init() EXCLUSIVE_LOCKS_REQUIRED(pageheap_lock) {
+ free_area_ = nullptr;
+ free_avail_ = 0;
+ bytes_allocated_ = 0;
+ }
+
+ // Return a properly aligned byte array of length "bytes". Crashes if
+ // allocation fails. Requires pageheap_lock is held.
+ void* Alloc(size_t bytes) EXCLUSIVE_LOCKS_REQUIRED(pageheap_lock);
+
+ // Returns the total number of bytes allocated from this arena. Requires
+ // pageheap_lock is held.
+ uint64_t bytes_allocated() const EXCLUSIVE_LOCKS_REQUIRED(pageheap_lock) {
+ return bytes_allocated_;
+ }
+
+ private:
+ // How much to allocate from system at a time
+ static const int kAllocIncrement = 128 << 10;
+
+ // Free area from which to carve new objects
+ char* free_area_ GUARDED_BY(pageheap_lock);
+ size_t free_avail_ GUARDED_BY(pageheap_lock);
+
+ // Total number of bytes allocated from this arena
+ uint64_t bytes_allocated_ GUARDED_BY(pageheap_lock);
+
+ Arena(const Arena&) = delete;
+ Arena& operator=(const Arena&) = delete;
+};
+
+} // namespace tcmalloc
+
+#endif // TCMALLOC_ARENA_H_
diff --git a/tcmalloc/central_freelist.cc b/tcmalloc/central_freelist.cc
new file mode 100644
index 000000000..a67759260
--- /dev/null
+++ b/tcmalloc/central_freelist.cc
@@ -0,0 +1,156 @@
+// Copyright 2019 The TCMalloc Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "tcmalloc/central_freelist.h"
+
+#include
+
+#include "tcmalloc/internal/linked_list.h"
+#include "tcmalloc/internal/logging.h"
+#include "tcmalloc/page_heap.h"
+#include "tcmalloc/pagemap.h"
+#include "tcmalloc/static_vars.h"
+
+namespace tcmalloc {
+
+// Like a constructor and hence we disable thread safety analysis.
+void CentralFreeList::Init(size_t cl) NO_THREAD_SAFETY_ANALYSIS {
+ size_class_ = cl;
+ object_size_ = Static::sizemap()->class_to_size(cl);
+ objects_per_span_ = Static::sizemap()->class_to_pages(cl) * kPageSize /
+ (cl ? object_size_ : 1);
+ nonempty_.Init();
+ num_spans_.Clear();
+ counter_.Clear();
+}
+
+static Span* MapObjectToSpan(void* object) {
+ const PageID p = reinterpret_cast(object) >> kPageShift;
+ Span* span = Static::pagemap()->GetExistingDescriptor(p);
+ return span;
+}
+
+Span* CentralFreeList::ReleaseToSpans(void* object, Span* span) {
+ if (span->FreelistEmpty()) {
+ nonempty_.prepend(span);
+ }
+
+ if (span->FreelistPush(object, object_size_)) {
+ return nullptr;
+ }
+
+ counter_.LossyAdd(-objects_per_span_);
+ num_spans_.LossyAdd(-1);
+ span->RemoveFromList(); // from nonempty_
+ return span;
+}
+
+void CentralFreeList::InsertRange(void** batch, int N) {
+ CHECK_CONDITION(N > 0 && N <= kMaxObjectsToMove);
+ Span* spans[kMaxObjectsToMove];
+ // Safe to store free spans into freed up space in span array.
+ Span** free_spans = spans;
+ int free_count = 0;
+
+ // Prefetch Span objects to reduce cache misses.
+ for (int i = 0; i < N; ++i) {
+ Span* span = MapObjectToSpan(batch[i]);
+ ASSERT(span != nullptr);
+#if defined(__GNUC__)
+ __builtin_prefetch(span, 0, 3);
+#endif
+ spans[i] = span;
+ }
+
+ // First, release all individual objects into spans under our mutex
+ // and collect spans that become completely free.
+ {
+ absl::base_internal::SpinLockHolder h(&lock_);
+ for (int i = 0; i < N; ++i) {
+ Span* span = ReleaseToSpans(batch[i], spans[i]);
+ if (span) {
+ free_spans[free_count] = span;
+ free_count++;
+ }
+ }
+ counter_.LossyAdd(N);
+ }
+
+ // Then, release all free spans into page heap under its mutex.
+ if (free_count) {
+ absl::base_internal::SpinLockHolder h(&pageheap_lock);
+ for (int i = 0; i < free_count; ++i) {
+ ASSERT(!IsTaggedMemory(free_spans[i]->start_address()));
+ Static::pagemap()->UnregisterSizeClass(free_spans[i]);
+ Static::page_allocator()->Delete(free_spans[i], /*tagged=*/false);
+ }
+ }
+}
+
+int CentralFreeList::RemoveRange(void** batch, int N) {
+ ASSERT(N > 0);
+ absl::base_internal::SpinLockHolder h(&lock_);
+ if (nonempty_.empty()) {
+ Populate();
+ }
+
+ int result = 0;
+ while (result < N && !nonempty_.empty()) {
+ Span* span = nonempty_.first();
+ int here = span->FreelistPopBatch(batch + result, N - result, object_size_);
+ ASSERT(here > 0);
+ if (span->FreelistEmpty()) {
+ span->RemoveFromList(); // from nonempty_
+ }
+ result += here;
+ }
+ counter_.LossyAdd(-result);
+ return result;
+}
+
+// Fetch memory from the system and add to the central cache freelist.
+void CentralFreeList::Populate() NO_THREAD_SAFETY_ANALYSIS {
+ // Release central list lock while operating on pageheap
+ lock_.Unlock();
+ const size_t npages = Static::sizemap()->class_to_pages(size_class_);
+
+ Span* span = Static::page_allocator()->New(npages, /*tagged=*/false);
+ if (span == nullptr) {
+ Log(kLog, __FILE__, __LINE__,
+ "tcmalloc: allocation failed", npages << kPageShift);
+ lock_.Lock();
+ return;
+ }
+ ASSERT(span->num_pages() == npages);
+
+ Static::pagemap()->RegisterSizeClass(span, size_class_);
+ span->BuildFreelist(object_size_, objects_per_span_);
+
+ // Add span to list of non-empty spans
+ lock_.Lock();
+ nonempty_.prepend(span);
+ num_spans_.LossyAdd(1);
+ counter_.LossyAdd(objects_per_span_);
+}
+
+size_t CentralFreeList::OverheadBytes() {
+ if (size_class_ == 0) { // 0 holds the 0-sized allocations
+ return 0;
+ }
+ const size_t pages_per_span = Static::sizemap()->class_to_pages(size_class_);
+ const size_t overhead_per_span = (pages_per_span * kPageSize) % object_size_;
+ return static_cast(num_spans_.value()) * overhead_per_span;
+}
+
+} // namespace tcmalloc
diff --git a/tcmalloc/central_freelist.h b/tcmalloc/central_freelist.h
new file mode 100644
index 000000000..53d9fd013
--- /dev/null
+++ b/tcmalloc/central_freelist.h
@@ -0,0 +1,98 @@
+// Copyright 2019 The TCMalloc Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef TCMALLOC_CENTRAL_FREELIST_H_
+#define TCMALLOC_CENTRAL_FREELIST_H_
+
+#include
+
+#include "absl/base/internal/spinlock.h"
+#include "absl/base/macros.h"
+#include "absl/base/thread_annotations.h"
+#include "tcmalloc/internal/atomic_stats_counter.h"
+#include "tcmalloc/span.h"
+
+namespace tcmalloc {
+
+// Data kept per size-class in central cache.
+class CentralFreeList {
+ public:
+ // A CentralFreeList may be used before its constructor runs.
+ // So we prevent lock_'s constructor from doing anything to the lock_ state.
+ CentralFreeList()
+ : lock_(absl::base_internal::kLinkerInitialized),
+ counter_(absl::base_internal::kLinkerInitialized),
+ num_spans_(absl::base_internal::kLinkerInitialized) {}
+
+ void Init(size_t cl) LOCKS_EXCLUDED(lock_);
+
+ // These methods all do internal locking.
+
+ // Insert batch[0..N-1] into the central freelist.
+ // REQUIRES: N > 0 && N <= kMaxObjectsToMove.
+ void InsertRange(void **batch, int N) LOCKS_EXCLUDED(lock_);
+
+ // Fill a prefix of batch[0..N-1] with up to N elements removed from central
+ // freelist. Return the number of elements removed.
+ int RemoveRange(void **batch, int N) LOCKS_EXCLUDED(lock_);
+
+ // Returns the number of free objects in cache.
+ size_t length() { return static_cast(counter_.value()); }
+
+ // Returns the memory overhead (internal fragmentation) attributable
+ // to the freelist. This is memory lost when the size of elements
+ // in a freelist doesn't exactly divide the page-size (an 8192-byte
+ // page full of 5-byte objects would have 2 bytes memory overhead).
+ size_t OverheadBytes();
+
+ // My size class.
+ size_t size_class() const {
+ return size_class_;
+ }
+
+ private:
+ // Release an object to spans.
+ // Returns object's span if it become completely free.
+ Span* ReleaseToSpans(void* object, Span* span)
+ EXCLUSIVE_LOCKS_REQUIRED(lock_);
+
+ // Populate cache by fetching from the page heap.
+ // May temporarily release lock_.
+ void Populate() EXCLUSIVE_LOCKS_REQUIRED(lock_);
+
+ // This lock protects all the mutable data members.
+ absl::base_internal::SpinLock lock_;
+
+ size_t size_class_; // My size class (immutable after Init())
+ size_t object_size_;
+ size_t objects_per_span_;
+
+ // Following are kept as a StatsCounter so that they can read without
+ // acquiring a lock. Updates to these variables are guarded by lock_ so writes
+ // are performed using LossyAdd for speed, the lock still guarantees accuracy.
+
+ // Num free objects in cache entry
+ tcmalloc_internal::StatsCounter counter_;
+ // Num spans in empty_ plus nonempty_
+ tcmalloc_internal::StatsCounter num_spans_;
+
+ SpanList nonempty_ GUARDED_BY(lock_); // Dummy header for non-empty spans
+
+ CentralFreeList(const CentralFreeList&) = delete;
+ CentralFreeList& operator=(const CentralFreeList&) = delete;
+};
+
+} // namespace tcmalloc
+
+#endif // TCMALLOC_CENTRAL_FREELIST_H_
diff --git a/tcmalloc/common.cc b/tcmalloc/common.cc
new file mode 100644
index 000000000..b42de5843
--- /dev/null
+++ b/tcmalloc/common.cc
@@ -0,0 +1,162 @@
+// Copyright 2019 The TCMalloc Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "tcmalloc/common.h"
+
+#include "tcmalloc/experiment.h"
+#include "tcmalloc/runtime_size_classes.h"
+#include "tcmalloc/sampler.h"
+
+namespace tcmalloc {
+
+// Load sizes classes from environment variable if present
+// and valid, then returns True. If not found or valid, returns
+// False.
+bool SizeMap::MaybeRunTimeSizeClasses() {
+ SizeClassInfo parsed[kNumClasses];
+ int num_classes = MaybeSizeClassesFromEnv(kMaxSize, kNumClasses, parsed);
+ if (!ValidSizeClasses(num_classes, parsed)) {
+ return false;
+ }
+
+ if (num_classes != kNumClasses) {
+ // TODO(b/122839049) - Add tests for num_classes < kNumClasses before
+ // allowing that case.
+ Log(kLog, __FILE__, __LINE__, "Can't change the number of size classes",
+ num_classes, kNumClasses);
+ return false;
+ }
+
+ SetSizeClasses(num_classes, parsed);
+ Log(kLog, __FILE__, __LINE__, "Loaded valid Runtime Size classes");
+ return true;
+}
+
+void SizeMap::SetSizeClasses(int num_classes, const SizeClassInfo* parsed) {
+ class_to_size_[0] = 0;
+ class_to_pages_[0] = 0;
+ num_objects_to_move_[0] = 0;
+
+ for (int c = 1; c < num_classes; c++) {
+ class_to_size_[c] = parsed[c].size;
+ class_to_pages_[c] = parsed[c].pages;
+ num_objects_to_move_[c] = parsed[c].num_to_move;
+ }
+
+ // Fill any unspecified size classes with the largest size
+ // from the static definitions.
+ for (int x = num_classes; x < kNumClasses; x++) {
+ class_to_size_[x] = kSizeClasses[kNumClasses - 1].size;
+ class_to_pages_[x] = kSizeClasses[kNumClasses - 1].pages;
+ auto num_to_move = kSizeClasses[kNumClasses - 1].num_to_move;
+ if (IsExperimentActive(Experiment::TCMALLOC_LARGE_NUM_TO_MOVE)) {
+ num_to_move = std::min(kMaxObjectsToMove, 4 * num_to_move);
+ }
+ num_objects_to_move_[x] = num_to_move;
+ }
+}
+
+// Return true if all size classes meet the requirements for alignment
+// ordering and min and max values.
+bool SizeMap::ValidSizeClasses(int num_classes, const SizeClassInfo* parsed) {
+ if (num_classes <= 0) {
+ return false;
+ }
+ for (int c = 1; c < num_classes; c++) {
+ size_t class_size = parsed[c].size;
+ size_t pages = parsed[c].pages;
+ size_t num_objects_to_move = parsed[c].num_to_move;
+ // Each size class must be larger than the previous size class.
+ if (class_size <= parsed[c - 1].size) {
+ Log(kLog, __FILE__, __LINE__, "Non-increasing size class", c,
+ parsed[c - 1].size, class_size);
+ return false;
+ }
+ if (class_size > kMaxSize) {
+ Log(kLog, __FILE__, __LINE__, "size class too big", c, class_size,
+ kMaxSize);
+ return false;
+ }
+ // Check required alignment
+ size_t alignment = 128;
+ if (class_size <= kMultiPageSize) {
+ alignment = kAlignment;
+ } else if (class_size <= SizeMap::kMaxSmallSize) {
+ alignment = kMultiPageAlignment;
+ }
+ if ((class_size & (alignment - 1)) != 0) {
+ Log(kLog, __FILE__, __LINE__, "Not aligned properly", c, class_size,
+ alignment);
+ return false;
+ }
+ if (class_size <= kMultiPageSize && pages != 1) {
+ Log(kLog, __FILE__, __LINE__, "Multiple pages not allowed", class_size,
+ pages, kMultiPageSize);
+ return false;
+ }
+ if (pages >= 256) {
+ Log(kLog, __FILE__, __LINE__, "pages limited to 255", pages);
+ return false;
+ }
+ if (num_objects_to_move > kMaxObjectsToMove) {
+ Log(kLog, __FILE__, __LINE__, "num objects to move too large",
+ num_objects_to_move, kMaxObjectsToMove);
+ return false;
+ }
+ }
+ // Last size class must be able to hold kMaxSize.
+ if (parsed[num_classes - 1].size < kMaxSize) {
+ Log(kLog, __FILE__, __LINE__, "last class doesn't cover kMaxSize",
+ num_classes - 1, parsed[num_classes - 1].size, kMaxSize);
+ return false;
+ }
+ return true;
+}
+
+// Initialize the mapping arrays
+void SizeMap::Init() {
+ // Do some sanity checking on add_amount[]/shift_amount[]/class_array[]
+ if (ClassIndex(0) != 0) {
+ Log(kCrash, __FILE__, __LINE__,
+ "Invalid class index for size 0", ClassIndex(0));
+ }
+ if (ClassIndex(kMaxSize) >= sizeof(class_array_)) {
+ Log(kCrash, __FILE__, __LINE__,
+ "Invalid class index for kMaxSize", ClassIndex(kMaxSize));
+ }
+
+ static_assert(kAlignment <= 16, "kAlignment is too large");
+
+ if (IsExperimentActive(Experiment::TCMALLOC_SANS_56_SIZECLASS)) {
+ SetSizeClasses(kNumClasses, kExperimentalSizeClasses);
+ } else {
+ SetSizeClasses(kNumClasses, kSizeClasses);
+ }
+ MaybeRunTimeSizeClasses();
+
+ int next_size = 0;
+ for (int c = 1; c < kNumClasses; c++) {
+ const int max_size_in_class = class_to_size_[c];
+
+ for (int s = next_size; s <= max_size_in_class; s += kAlignment) {
+ class_array_[ClassIndex(s)] = c;
+ }
+ next_size = max_size_in_class + kAlignment;
+ if (next_size > kMaxSize) {
+ break;
+ }
+ }
+}
+
+} // namespace tcmalloc
diff --git a/tcmalloc/common.h b/tcmalloc/common.h
new file mode 100644
index 000000000..81344b305
--- /dev/null
+++ b/tcmalloc/common.h
@@ -0,0 +1,455 @@
+// Copyright 2019 The TCMalloc Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+//
+// Common definitions for tcmalloc code.
+
+#ifndef TCMALLOC_COMMON_H_
+#define TCMALLOC_COMMON_H_
+
+#include
+#include
+#include
+
+#include "absl/base/attributes.h"
+#include "absl/base/internal/spinlock.h"
+#include "absl/base/optimization.h"
+#include "tcmalloc/internal/bits.h"
+#include "tcmalloc/internal/logging.h"
+#include "tcmalloc/size_class_info.h"
+
+// Type that can hold a page number
+typedef uintptr_t PageID;
+
+// Type that can hold the length of a run of pages
+typedef uintptr_t Length;
+
+//-------------------------------------------------------------------
+// Configuration
+//-------------------------------------------------------------------
+
+// There are four different models for tcmalloc which are created by defining a
+// set of constant variables differently:
+//
+// DEFAULT:
+// The default configuration strives for good performance while trying to
+// minimize fragmentation. It uses a smaller page size to reduce
+// fragmentation, but allocates per-thread and per-cpu capacities similar to
+// TCMALLOC_LARGE_PAGES / TCMALLOC_256K_PAGES.
+//
+// TCMALLOC_LARGE_PAGES:
+// Larger page sizes increase the bookkeeping granularity used by TCMalloc for
+// its allocations. This can reduce PageMap size and traffic to the
+// innermost cache (the page heap), but can increase memory footprints. As
+// TCMalloc will not reuse a page for a different allocation size until the
+// entire page is deallocated, this can be a source of increased memory
+// fragmentation.
+//
+// Historically, larger page sizes improved lookup performance for the
+// pointer-to-size lookup in the PageMap that was part of the critical path.
+// With most deallocations leveraging C++14's sized delete feature
+// (https://isocpp.org/files/papers/n3778.html), this optimization is less
+// significant.
+//
+// TCMALLOC_256K_PAGES
+// This configuration uses an even larger page size (256KB) as the unit of
+// accounting granularity.
+//
+// TCMALLOC_SMALL_BUT_SLOW:
+// Used for situations where minimizing the memory footprint is the most
+// desirable attribute, even at the cost of performance.
+//
+// The constants that vary between models are:
+//
+// kPageShift - Shift amount used to compute the page size.
+// kNumClasses - Number of size classes serviced by bucket allocators
+// kMaxSize - Maximum size serviced by bucket allocators (thread/cpu/central)
+// kMinThreadCacheSize - The minimum size in bytes of each ThreadCache.
+// kMaxThreadCacheSize - The maximum size in bytes of each ThreadCache.
+// kDefaultOverallThreadCacheSize - The maximum combined size in bytes of all
+// ThreadCaches for an executable.
+// kStealAmount - The number of bytes one ThreadCache will steal from another
+// when the first ThreadCache is forced to Scavenge(), delaying the next
+// call to Scavenge for this thread.
+
+// Older configurations had their own customized macros. Convert them into
+// a page-shift parameter that is checked below.
+
+#ifndef TCMALLOC_PAGE_SHIFT
+#ifdef TCMALLOC_SMALL_BUT_SLOW
+#define TCMALLOC_PAGE_SHIFT 12
+#define TCMALLOC_USE_PAGEMAP3
+#elif defined(TCMALLOC_256K_PAGES)
+#define TCMALLOC_PAGE_SHIFT 18
+#elif defined(TCMALLOC_LARGE_PAGES)
+#define TCMALLOC_PAGE_SHIFT 15
+#else
+#define TCMALLOC_PAGE_SHIFT 13
+#endif
+#else
+#error "TCMALLOC_PAGE_SHIFT is an internal macro!"
+#endif
+
+#if TCMALLOC_PAGE_SHIFT == 12
+static const size_t kPageShift = 12;
+static const size_t kNumClasses = 46;
+static const size_t kMaxSize = 8 << 10;
+static const size_t kMinThreadCacheSize = 4 * 1024;
+static const size_t kMaxThreadCacheSize = 64 * 1024;
+static const size_t kMaxCpuCacheSize = 20 * 1024;
+static const size_t kDefaultOverallThreadCacheSize = kMaxThreadCacheSize;
+static const size_t kStealAmount = kMinThreadCacheSize;
+static const size_t kDefaultProfileSamplingRate = 1 << 19;
+static const size_t kMinPages = 2;
+#elif TCMALLOC_PAGE_SHIFT == 15
+static const size_t kPageShift = 15;
+static const size_t kNumClasses = 78;
+static const size_t kMaxSize = 256 * 1024;
+static const size_t kMinThreadCacheSize = kMaxSize * 2;
+static const size_t kMaxThreadCacheSize = 4 << 20;
+static const size_t kMaxCpuCacheSize = 3 * 1024 * 1024;
+static const size_t kDefaultOverallThreadCacheSize = 8u * kMaxThreadCacheSize;
+static const size_t kStealAmount = 1 << 16;
+static const size_t kDefaultProfileSamplingRate = 1 << 21;
+static const size_t kMinPages = 8;
+#elif TCMALLOC_PAGE_SHIFT == 18
+static const size_t kPageShift = 18;
+static const size_t kNumClasses = 89;
+static const size_t kMaxSize = 256 * 1024;
+static const size_t kMinThreadCacheSize = kMaxSize * 2;
+static const size_t kMaxThreadCacheSize = 4 << 20;
+static const size_t kMaxCpuCacheSize = 3 * 1024 * 1024;
+static const size_t kDefaultOverallThreadCacheSize = 8u * kMaxThreadCacheSize;
+static const size_t kStealAmount = 1 << 16;
+static const size_t kDefaultProfileSamplingRate = 1 << 21;
+static const size_t kMinPages = 8;
+#elif TCMALLOC_PAGE_SHIFT == 13
+static const size_t kPageShift = 13;
+static const size_t kNumClasses = 86;
+static const size_t kMaxSize = 256 * 1024;
+static const size_t kMinThreadCacheSize = kMaxSize * 2;
+static const size_t kMaxThreadCacheSize = 4 << 20;
+static const size_t kMaxCpuCacheSize = 3 * 1024 * 1024;
+static const size_t kDefaultOverallThreadCacheSize = 8u * kMaxThreadCacheSize;
+static const size_t kStealAmount = 1 << 16;
+static const size_t kDefaultProfileSamplingRate = 1 << 21;
+static const size_t kMinPages = 8;
+#else
+#error "Unsupported TCMALLOC_PAGE_SHIFT value!"
+#endif
+
+// Minimum/maximum number of batches in TransferCache per size class.
+// Actual numbers depends on a number of factors, see TransferCache::Init
+// for details.
+static const size_t kMinObjectsToMove = 2;
+static const size_t kMaxObjectsToMove = 128;
+
+static const size_t kPageSize = 1 << kPageShift;
+// Verify that the page size used is at least 8x smaller than the maximum
+// element size in the thread cache. This guarantees at most 12.5% internal
+// fragmentation (1/8). When page size is 256k (kPageShift == 18), the benefit
+// of increasing kMaxSize to be multiple of kPageSize is unclear. Object size
+// profile data indicates that the number of simultaneously live objects (of
+// size >= 256k) tends to be very small. Keeping those objects as 'large'
+// objects won't cause too much memory waste, while heap memory reuse is can be
+// improved. Increasing kMaxSize to be too large has another bad side effect --
+// the thread cache pressure is increased, which will in turn increase traffic
+// between central cache and thread cache, leading to performance degradation.
+static_assert((kMaxSize / kPageSize) >= kMinPages || kPageShift >= 18,
+ "Ratio of kMaxSize / kPageSize is too small");
+
+static const size_t kAlignment = 8;
+// log2 (kAlignment)
+static const size_t kAlignmentShift =
+ tcmalloc::tcmalloc_internal::Bits::Log2Ceiling(kAlignment);
+// For all span-lengths < kMaxPages we keep an exact-size list.
+static const size_t kMaxPages = 1 << (20 - kPageShift);
+
+// The number of times that a deallocation can cause a freelist to
+// go over its max_length() before shrinking max_length().
+static const int kMaxOverages = 3;
+
+// Maximum length we allow a per-thread free-list to have before we
+// move objects from it into the corresponding central free-list. We
+// want this big to avoid locking the central free-list too often. It
+// should not hurt to make this list somewhat big because the
+// scavenging code will shrink it down when its contents are not in use.
+static const int kMaxDynamicFreeListLength = 8192;
+
+static const Length kMaxValidPages = (~static_cast(0)) >> kPageShift;
+
+#if defined __x86_64__
+// All current and planned x86_64 processors only look at the lower 48 bits
+// in virtual to physical address translation. The top 16 are thus unused.
+// TODO(b/134686025): Under what operating systems can we increase it safely to
+// 17? This lets us use smaller page maps. On first allocation, a 36-bit page
+// map uses only 96 KB instead of the 4.5 MB used by a 52-bit page map.
+static const int kAddressBits = (sizeof(void*) < 8 ? (8 * sizeof(void*)) : 48);
+#elif defined __powerpc64__ && defined __linux__
+// Linux(4.12 and above) on powerpc64 supports 128TB user virtual address space
+// by default, and up to 512TB if user space opts in by specifing hint in mmap.
+// See comments in arch/powerpc/include/asm/processor.h
+// and arch/powerpc/mm/mmap.c.
+static const int kAddressBits = (sizeof(void*) < 8 ? (8 * sizeof(void*)) : 49);
+#elif defined __aarch64__ && defined __linux__
+// According to Documentation/arm64/memory.txt of kernel 3.16,
+// AARCH64 kernel supports 48-bit virtual addresses for both user and kernel.
+static const int kAddressBits = (sizeof(void*) < 8 ? (8 * sizeof(void*)) : 48);
+#else
+static const int kAddressBits = 8 * sizeof(void*);
+#endif
+
+namespace tcmalloc {
+#if defined(__x86_64__)
+// x86 has 2 MiB huge pages
+static const size_t kHugePageShift = 21;
+#elif defined(__PPC64__)
+static const size_t kHugePageShift = 24;
+#elif defined __aarch64__ && defined __linux__
+static const size_t kHugePageShift = 21;
+#else
+// ...whatever, guess something big-ish
+static const size_t kHugePageShift = 21;
+#endif
+
+static const size_t kHugePageSize = static_cast(1) << kHugePageShift;
+static const size_t kPagesPerHugePage = static_cast(1)
+ << (kHugePageShift - kPageShift);
+static constexpr uintptr_t kTagMask = uintptr_t{1}
+ << std::min(kAddressBits - 4, 42);
+
+#if !defined(TCMALLOC_SMALL_BUT_SLOW) && __WORDSIZE != 32
+// Always allocate at least a huge page
+static const size_t kMinSystemAlloc = kHugePageSize;
+static const size_t kMinMmapAlloc = 1 << 30; // mmap() in 1GiB ranges.
+#else
+// Allocate in units of 2MiB. This is the size of a huge page for x86, but
+// not for Power.
+static const size_t kMinSystemAlloc = 2 << 20;
+// mmap() in units of 32MiB. This is a multiple of huge page size for
+// both x86 (2MiB) and Power (16MiB)
+static const size_t kMinMmapAlloc = 32 << 20;
+#endif
+
+static_assert(kMinMmapAlloc % kMinSystemAlloc == 0,
+ "Minimum mmap allocation size is not a multiple of"
+ " minimum system allocation size");
+
+// Convert byte size into pages. This won't overflow, but may return
+// an unreasonably large value if bytes is huge enough.
+inline Length pages(size_t bytes) {
+ return (bytes >> kPageShift) +
+ ((bytes & (kPageSize - 1)) > 0 ? 1 : 0);
+}
+
+// Returns true if ptr is tagged.
+inline bool IsTaggedMemory(const void* ptr) {
+ return (reinterpret_cast(ptr) & kTagMask) == 0;
+}
+
+// Size-class information + mapping
+class SizeMap {
+ public:
+ // All size classes <= 512 in all configs always have 1 page spans.
+ static const size_t kMultiPageSize = 512;
+ // Min alignment for all size classes > kMultiPageSize in all configs.
+ static const size_t kMultiPageAlignment = 64;
+ // log2 (kMultiPageAlignment)
+ static const size_t kMultiPageAlignmentShift =
+ tcmalloc::tcmalloc_internal::Bits::Log2Ceiling(kMultiPageAlignment);
+
+ private:
+ //-------------------------------------------------------------------
+ // Mapping from size to size_class and vice versa
+ //-------------------------------------------------------------------
+
+ // Sizes <= 1024 have an alignment >= 8. So for such sizes we have an
+ // array indexed by ceil(size/8). Sizes > 1024 have an alignment >= 128.
+ // So for these larger sizes we have an array indexed by ceil(size/128).
+ //
+ // We flatten both logical arrays into one physical array and use
+ // arithmetic to compute an appropriate index. The constants used by
+ // ClassIndex() were selected to make the flattening work.
+ //
+ // Examples:
+ // Size Expression Index
+ // -------------------------------------------------------
+ // 0 (0 + 7) / 8 0
+ // 1 (1 + 7) / 8 1
+ // ...
+ // 1024 (1024 + 7) / 8 128
+ // 1025 (1025 + 127 + (120<<7)) / 128 129
+ // ...
+ // 32768 (32768 + 127 + (120<<7)) / 128 376
+ static const int kMaxSmallSize = 1024;
+ static const size_t kClassArraySize =
+ ((kMaxSize + 127 + (120 << 7)) >> 7) + 1;
+
+ // Batch size is the number of objects to move at once.
+ typedef unsigned char BatchSize;
+
+ // class_array_ is accessed on every malloc, so is very hot. We make it the
+ // first member so that it inherits the overall alignment of a SizeMap
+ // instance. In particular, if we create a SizeMap instance that's cache-line
+ // aligned, this member is also aligned to the width of a cache line.
+ unsigned char class_array_[kClassArraySize];
+
+ // Number of objects to move between a per-thread list and a central
+ // list in one shot. We want this to be not too small so we can
+ // amortize the lock overhead for accessing the central list. Making
+ // it too big may temporarily cause unnecessary memory wastage in the
+ // per-thread free list until the scavenger cleans up the list.
+ BatchSize num_objects_to_move_[kNumClasses];
+
+ // If size is no more than kMaxSize, compute index of the
+ // class_array[] entry for it, putting the class index in output
+ // parameter idx and returning true. Otherwise return false.
+ static inline bool ABSL_ATTRIBUTE_ALWAYS_INLINE ClassIndexMaybe(size_t s,
+ uint32_t* idx) {
+ if (ABSL_PREDICT_TRUE(s <= kMaxSmallSize)) {
+ *idx = (static_cast(s) + 7) >> 3;
+ return true;
+ } else if (s <= kMaxSize) {
+ *idx = (static_cast(s) + 127 + (120 << 7)) >> 7;
+ return true;
+ }
+ return false;
+ }
+
+ static inline size_t ClassIndex(size_t s) {
+ uint32_t ret;
+ CHECK_CONDITION(ClassIndexMaybe(s, &ret));
+ return ret;
+ }
+
+ // Mapping from size class to number of pages to allocate at a time
+ unsigned char class_to_pages_[kNumClasses];
+
+ // Mapping from size class to max size storable in that class
+ uint32_t class_to_size_[kNumClasses];
+
+ // If environment variable defined, use it to override sizes classes.
+ // Returns true if all classes defined correctly.
+ bool MaybeRunTimeSizeClasses();
+
+ protected:
+ // Set the give size classes to be used by TCMalloc.
+ void SetSizeClasses(int num_classes, const SizeClassInfo* parsed);
+
+ // Check that the size classes meet all requirements.
+ bool ValidSizeClasses(int num_classes, const SizeClassInfo* parsed);
+
+ // Definition of size class that is set in size_classes.cc
+ static const SizeClassInfo kSizeClasses[kNumClasses];
+
+ // Definition of size class that is set in size_classes.cc
+ static const SizeClassInfo kExperimentalSizeClasses[kNumClasses];
+
+ public:
+ // Constructor should do nothing since we rely on explicit Init()
+ // call, which may or may not be called before the constructor runs.
+ SizeMap() { }
+
+ // Initialize the mapping arrays
+ void Init();
+
+ // Returns the non-zero matching size class for the provided `size`.
+ // Returns true on success, returns false if `size` exceeds the maximum size
+ // class value `kMaxSize'.
+ // Important: this function may return true with *cl == 0 if this
+ // SizeMap instance has not (yet) been initialized.
+ inline bool ABSL_ATTRIBUTE_ALWAYS_INLINE GetSizeClass(size_t size,
+ uint32_t* cl) {
+ uint32_t idx;
+ if (ABSL_PREDICT_TRUE(ClassIndexMaybe(size, &idx))) {
+ *cl = class_array_[idx];
+ return true;
+ }
+ return false;
+ }
+
+ // Returns the size class for size `size` aligned at `align`
+ // Returns true on success. Returns false if either:
+ // - the size exceeds the maximum size class size.
+ // - the align size is greater or equal to the default page size
+ // - no matching properly aligned size class is available
+ //
+ // Requires that align is a non-zero power of 2.
+ //
+ // Specifying align = 1 will result in this method using the default
+ // alignment of the size table. Calling this method with a constexpr
+ // value of align = 1 will be optimized by the compiler, and result in
+ // the inlined code to be identical to calling `GetSizeClass(size, cl)`
+ inline bool ABSL_ATTRIBUTE_ALWAYS_INLINE GetSizeClass(size_t size,
+ size_t align,
+ uint32_t* cl) {
+ ASSERT(align > 0);
+ ASSERT((align & (align - 1)) == 0);
+
+ if (ABSL_PREDICT_FALSE(align >= kPageSize)) {
+ return false;
+ }
+ if (ABSL_PREDICT_FALSE(!GetSizeClass(size, cl))) {
+ return false;
+ }
+
+ // Predict that size aligned allocs most often directly map to a proper
+ // size class, i.e., multiples of 32, 64, etc, matching our class sizes.
+ const size_t mask = (align - 1);
+ do {
+ if (ABSL_PREDICT_TRUE((class_to_size(*cl) & mask) == 0)) {
+ return true;
+ }
+ } while (++*cl < kNumClasses);
+
+ return false;
+ }
+
+ // Returns size class for given size, or 0 if this instance has not been
+ // initialized yet. REQUIRES: size <= kMaxSize.
+ inline size_t ABSL_ATTRIBUTE_ALWAYS_INLINE SizeClass(size_t size) {
+ ASSERT(size <= kMaxSize);
+ uint32_t ret = 0;
+ GetSizeClass(size, &ret);
+ return ret;
+ }
+
+ // Get the byte-size for a specified class. REQUIRES: cl <= kNumClasses.
+ inline size_t ABSL_ATTRIBUTE_ALWAYS_INLINE class_to_size(size_t cl) {
+ ASSERT(cl < kNumClasses);
+ return class_to_size_[cl];
+ }
+
+ // Mapping from size class to number of pages to allocate at a time
+ inline size_t class_to_pages(size_t cl) {
+ ASSERT(cl < kNumClasses);
+ return class_to_pages_[cl];
+ }
+
+ // Number of objects to move between a per-thread list and a central
+ // list in one shot. We want this to be not too small so we can
+ // amortize the lock overhead for accessing the central list. Making
+ // it too big may temporarily cause unnecessary memory wastage in the
+ // per-thread free list until the scavenger cleans up the list.
+ inline SizeMap::BatchSize num_objects_to_move(size_t cl) {
+ ASSERT(cl < kNumClasses);
+ return num_objects_to_move_[cl];
+ }
+};
+
+// Linker initialized, so this lock can be accessed at any time.
+extern absl::base_internal::SpinLock pageheap_lock;
+
+} // namespace tcmalloc
+
+#endif // TCMALLOC_COMMON_H_
diff --git a/tcmalloc/copts.bzl b/tcmalloc/copts.bzl
new file mode 100644
index 000000000..b04cc1236
--- /dev/null
+++ b/tcmalloc/copts.bzl
@@ -0,0 +1,38 @@
+# Copyright 2019 The TCMalloc Authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""This package provides default compiler warning flags for the OSS release"""
+
+TCMALLOC_LLVM_FLAGS = [
+ "-Wno-implicit-int-float-conversion",
+ "-Wno-sign-compare",
+ "-Wno-uninitialized",
+ "-Wno-unused-function",
+ "-Wno-unused-variable",
+]
+
+TCMALLOC_GCC_FLAGS = [
+ "-Wno-attribute-alias",
+ "-Wno-sign-compare",
+ "-Wno-uninitialized",
+ "-Wno-unused-function",
+ # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66425
+ "-Wno-unused-result",
+ "-Wno-unused-variable",
+]
+
+TCMALLOC_DEFAULT_COPTS = select({
+ "//tcmalloc:llvm": TCMALLOC_LLVM_FLAGS,
+ "//conditions:default": TCMALLOC_GCC_FLAGS,
+})
diff --git a/tcmalloc/cpu_cache.cc b/tcmalloc/cpu_cache.cc
new file mode 100644
index 000000000..1552fcdb3
--- /dev/null
+++ b/tcmalloc/cpu_cache.cc
@@ -0,0 +1,579 @@
+// Copyright 2019 The TCMalloc Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "tcmalloc/cpu_cache.h"
+
+#include
+#include
+
+#include
+#include
+
+#include "absl/base/dynamic_annotations.h"
+#include "absl/base/internal/spinlock.h"
+#include "absl/base/internal/sysinfo.h"
+#include "absl/base/macros.h"
+#include "absl/base/thread_annotations.h"
+#include "tcmalloc/arena.h"
+#include "tcmalloc/common.h"
+#include "tcmalloc/internal/logging.h"
+#include "tcmalloc/internal_malloc_extension.h"
+#include "tcmalloc/parameters.h"
+#include "tcmalloc/static_vars.h"
+#include "tcmalloc/transfer_cache.h"
+
+namespace tcmalloc {
+
+using subtle::percpu::GetCurrentCpuUnsafe;
+
+// MaxCapacity() determines how we distribute memory in the per-cpu cache
+// to the various class sizes.
+static size_t MaxCapacity(size_t cl) {
+ // The number of size classes that are commonly used and thus should be
+ // allocated more slots in the per-cpu cache.
+ static constexpr size_t kNumSmall = 10;
+ // The remaining size classes, excluding size class 0.
+ static constexpr size_t kNumLarge = kNumClasses - 1 - kNumSmall;
+ // The memory used for each per-CPU slab is the sum of:
+ // sizeof(std::atomic) * kNumClasses
+ // sizeof(void*) * (kSmallObjectDepth + 1) * kNumSmall
+ // sizeof(void*) * (kLargeObjectDepth + 1) * kNumLarge
+ //
+ // Class size 0 has MaxCapacity() == 0, which is the reason for using
+ // kNumClasses - 1 above instead of kNumClasses.
+ //
+ // Each Size class region in the slab is preceded by one padding pointer that
+ // points to itself, because prefetch instructions of invalid pointers are
+ // slow. That is accounted for by the +1 for object depths.
+#if defined(TCMALLOC_SMALL_BUT_SLOW)
+ // With SMALL_BUT_SLOW we have 4KiB of per-cpu slab and 46 class sizes we
+ // allocate:
+ // == 8 * 46 + 8 * ((16 + 1) * 10 + (6 + 1) * 35) = 4038 bytes of 4096
+ static const size_t kSmallObjectDepth = 16;
+ static const size_t kLargeObjectDepth = 6;
+#else
+ // We allocate 256KiB per-cpu for pointers to cached per-cpu memory.
+ // Each 256KiB is a subtle::percpu::TcmallocSlab::Slabs
+ // Max(kNumClasses) is 89, so the maximum footprint per CPU is:
+ // 89 * 8 + 8 * ((2048 + 1) * 10 + (152 + 1) * 78 + 88) = 254 KiB
+ static const size_t kSmallObjectDepth = 2048;
+ static const size_t kLargeObjectDepth = 152;
+#endif
+ static_assert(sizeof(std::atomic) * kNumClasses +
+ sizeof(void *) * (kSmallObjectDepth + 1) * kNumSmall +
+ sizeof(void *) * (kLargeObjectDepth + 1) * kNumLarge <=
+ (1 << CPUCache::kPerCpuShift),
+ "per-CPU memory exceeded");
+ if (cl == 0 || cl >= kNumClasses) return 0;
+ if (cl <= kNumSmall) {
+ // Small object sizes are very heavily used and need very deep caches for
+ // good performance (well over 90% of malloc calls are for cl <= 10.)
+ return kSmallObjectDepth;
+ }
+
+ return kLargeObjectDepth;
+}
+
+static void *SlabAlloc(size_t size) EXCLUSIVE_LOCKS_REQUIRED(pageheap_lock) {
+ return Static::arena()->Alloc(size);
+}
+
+void CPUCache::Activate() {
+ ASSERT(Static::IsInited());
+ int num_cpus = absl::base_internal::NumCPUs();
+
+ absl::base_internal::SpinLockHolder h(&pageheap_lock);
+
+ resize_ = reinterpret_cast(
+ Static::arena()->Alloc(sizeof(ResizeInfo) * num_cpus));
+ lazy_slabs_ = Parameters::lazy_per_cpu_caches();
+
+ auto max_cache_size = Parameters::max_per_cpu_cache_size();
+
+ for (int cpu = 0; cpu < num_cpus; ++cpu) {
+ for (int cl = 1; cl < kNumClasses; ++cl) {
+ resize_[cpu].per_class[cl].Init();
+ }
+ resize_[cpu].available.store(max_cache_size, std::memory_order_relaxed);
+ resize_[cpu].last_steal.store(1, std::memory_order_relaxed);
+ }
+
+ freelist_.Init(SlabAlloc, MaxCapacity, lazy_slabs_);
+ Static::ActivateCPUCache();
+}
+
+// Fetch more items from the central cache, refill our local cache,
+// and try to grow it if necessary.
+//
+// This is complicated by the fact that we can only tweak the cache on
+// our current CPU and we might get migrated whenever (in fact, we
+// might already have been migrated since failing to get memory...)
+//
+// So make sure only to make changes to one CPU's cache; at all times,
+// it must be safe to find ourselves migrated (at which point we atomically
+// return memory to the correct CPU.)
+void *CPUCache::Refill(int cpu, size_t cl) {
+ const size_t batch_length = Static::sizemap()->num_objects_to_move(cl);
+
+ // UpdateCapacity can evict objects from other size classes as it tries to
+ // increase capacity of this size class. The objects are returned in
+ // to_return, we insert them into transfer cache at the end of function
+ // (to increase possibility that we stay on the current CPU as we are
+ // refilling the list).
+ size_t returned = 0;
+ ObjectClass to_return[kNumClasses];
+ const size_t target =
+ UpdateCapacity(cpu, cl, batch_length, false, to_return, &returned);
+
+ // Refill target objects in batch_length batches.
+ size_t total = 0;
+ size_t got;
+ size_t i;
+ void *result = nullptr;
+ void *batch[kMaxObjectsToMove];
+ do {
+ const size_t want = std::min(batch_length, target - total);
+ got = Static::transfer_cache()[cl].RemoveRange(batch, want);
+ if (got == 0) {
+ break;
+ }
+ total += got;
+ i = got;
+ if (result == nullptr) {
+ i--;
+ result = batch[i];
+ }
+ if (i) {
+ i -= freelist_.PushBatch(cl, batch, i);
+ if (i != 0) {
+ static_assert(ABSL_ARRAYSIZE(batch) >= kMaxObjectsToMove,
+ "not enough space in batch");
+ Static::transfer_cache()[cl].InsertRange(absl::Span(batch), i);
+ }
+ }
+ } while (got == batch_length && i == 0 && total < target &&
+ cpu == GetCurrentCpuUnsafe());
+
+ for (size_t i = 0; i < returned; ++i) {
+ ObjectClass *ret = &to_return[i];
+ Static::transfer_cache()[ret->cl].InsertRange(
+ absl::Span(&ret->obj, 1), 1);
+ }
+
+ return result;
+}
+
+size_t CPUCache::UpdateCapacity(int cpu, size_t cl, size_t batch_length,
+ bool overflow, ObjectClass *to_return,
+ size_t *returned) {
+ // Freelist size balancing strategy:
+ // - We grow a size class only on overflow/underflow.
+ // - We shrink size classes in Steal as it scans all size classes.
+ // - If overflows/underflows happen on a size class, we want to grow its
+ // capacity to at least 2 * batch_length. It enables usage of the
+ // transfer cache and leaves the list half-full after we insert/remove
+ // a batch from the transfer cache.
+ // - We increase capacity beyond 2 * batch_length only when an overflow is
+ // followed by an underflow. That's the only case when we could benefit
+ // from larger capacity -- the overflow and the underflow would collapse.
+ //
+ // Note: we can't understand when we have a perfectly-sized list, because for
+ // a perfectly-sized list we don't hit any slow paths which looks the same as
+ // inactive list. Eventually we will shrink a perfectly-sized list a bit and
+ // then it will grow back. This won't happen very frequently for the most
+ // important small sizes, because we will need several ticks before we shrink
+ // it again. Also we will shrink it by 1, but grow by a batch. So we should
+ // have lots of time until we need to grow it again.
+
+ const size_t max_capacity = MaxCapacity(cl);
+ size_t capacity = freelist_.Capacity(cpu, cl);
+ // We assert that the return value, target, is non-zero, so starting from an
+ // initial capacity of zero means we may be populating this core for the
+ // first time.
+ absl::base_internal::LowLevelCallOnce(
+ &resize_[cpu].initialized,
+ [](CPUCache *cache, int cpu) {
+ if (cache->lazy_slabs_) {
+ absl::base_internal::SpinLockHolder h(&cache->resize_[cpu].lock);
+ cache->freelist_.InitCPU(cpu, MaxCapacity);
+ }
+
+ // While we could unconditionally store, a lazy slab population
+ // implementation will require evaluating a branch.
+ cache->resize_[cpu].populated.store(true, std::memory_order_relaxed);
+ },
+ this, cpu);
+ const bool grow_by_one = capacity < 2 * batch_length;
+ uint32_t successive = 0;
+ bool grow_by_batch =
+ resize_[cpu].per_class[cl].Update(overflow, grow_by_one, &successive);
+ if ((grow_by_one || grow_by_batch) && capacity != max_capacity) {
+ size_t increase = 1;
+ if (grow_by_batch) {
+ increase = std::min(batch_length, max_capacity - capacity);
+ } else if (!overflow && capacity < batch_length) {
+ // On underflow we want to grow to at least batch size, because that's
+ // what we want to request from transfer cache.
+ increase = batch_length - capacity;
+ }
+ Grow(cpu, cl, increase, to_return, returned);
+ capacity = freelist_.Capacity(cpu, cl);
+ }
+ // Calculate number of objects to return/request from transfer cache.
+ // Generally we prefer to transfer a single batch, because transfer cache
+ // handles it efficiently. Except for 2 special cases:
+ size_t target = batch_length;
+ // "capacity + 1" because on overflow we already have one object from caller,
+ // so we can return a whole batch even if capacity is one less. Similarly,
+ // on underflow we need to return one object to caller, so we can request
+ // a whole batch even if capacity is one less.
+ if ((capacity + 1) < batch_length) {
+ // If we don't have a full batch, return/request just half. We are missing
+ // transfer cache anyway, and cost of insertion into central freelist is
+ // ~O(number of objects).
+ target = std::max(1, (capacity + 1) / 2);
+ } else if (successive > 0 && capacity >= 3 * batch_length) {
+ // If the freelist is large and we are hitting series of overflows or
+ // underflows, return/request several batches at once. On the first overflow
+ // we return 1 batch, on the second -- 2, on the third -- 4 and so on up to
+ // half of the batches we have. We do this to save on the cost of hitting
+ // malloc/free slow path, reduce instruction cache pollution, avoid cache
+ // misses when accessing transfer/central caches, etc.
+ size_t num_batches =
+ std::min(1 << std::min(successive, 10),
+ ((capacity / batch_length) + 1) / 2);
+ target = num_batches * batch_length;
+ }
+ ASSERT(target != 0);
+ return target;
+}
+
+void CPUCache::Grow(int cpu, size_t cl, size_t desired_increase,
+ ObjectClass *to_return, size_t *returned) {
+ const size_t size = Static::sizemap()->class_to_size(cl);
+ const size_t desired_bytes = desired_increase * size;
+ size_t acquired_bytes;
+
+ // First, there might be unreserved slack. Take what we can.
+ size_t before, after;
+ do {
+ before = resize_[cpu].available.load(std::memory_order_relaxed);
+ acquired_bytes = std::min(before, desired_bytes);
+ after = before - acquired_bytes;
+ } while (!resize_[cpu].available.compare_exchange_strong(
+ before, after, std::memory_order_relaxed, std::memory_order_relaxed));
+
+ if (acquired_bytes < desired_bytes) {
+ acquired_bytes +=
+ Steal(cpu, cl, desired_bytes - acquired_bytes, to_return, returned);
+ }
+
+ // We have all the memory we could reserve. Time to actually do the growth.
+
+ // We might have gotten more than we wanted (stealing from larger sizeclasses)
+ // so don't grow _too_ much.
+ size_t actual_increase = acquired_bytes / size;
+ actual_increase = std::min(actual_increase, desired_increase);
+ // Remember, Grow may not give us all we ask for.
+ size_t increase = freelist_.Grow(cpu, cl, actual_increase, MaxCapacity(cl));
+ size_t increased_bytes = increase * size;
+ if (increased_bytes < acquired_bytes) {
+ // return whatever we didn't use to the slack.
+ size_t unused = acquired_bytes - increased_bytes;
+ resize_[cpu].available.fetch_add(unused, std::memory_order_relaxed);
+ }
+}
+
+// There are rather a lot of policy knobs we could tweak here.
+size_t CPUCache::Steal(int cpu, size_t dest_cl, size_t bytes,
+ ObjectClass *to_return, size_t *returned) {
+ // Steal from other sizeclasses. Try to go in a nice circle.
+ // Complicated by sizeclasses actually being 1-indexed.
+ size_t acquired = 0;
+ size_t start = resize_[cpu].last_steal.load(std::memory_order_relaxed);
+ ASSERT(start < kNumClasses);
+ ASSERT(0 < start);
+ size_t source_cl = start;
+ for (size_t offset = 1; offset < kNumClasses; ++offset) {
+ source_cl = start + offset;
+ if (source_cl >= kNumClasses) {
+ source_cl -= kNumClasses - 1;
+ }
+ ASSERT(0 < source_cl);
+ ASSERT(source_cl < kNumClasses);
+ // Decide if we want to steal source_cl.
+ if (source_cl == dest_cl) {
+ // First, no sense in picking your own pocket.
+ continue;
+ }
+ const size_t capacity = freelist_.Capacity(cpu, source_cl);
+ if (capacity == 0) {
+ // Nothing to steal.
+ continue;
+ }
+ const size_t length = freelist_.Length(cpu, source_cl);
+ const size_t batch_length =
+ Static::sizemap()->num_objects_to_move(source_cl);
+ size_t size = Static::sizemap()->class_to_size(source_cl);
+
+ // Clock-like algorithm to prioritize size classes for shrinking.
+ //
+ // Each size class has quiescent ticks counter which is incremented as we
+ // pass it, the counter is reset to 0 in UpdateCapacity on grow.
+ // If the counter value is 0, then we've just tried to grow the size class,
+ // so it makes little sense to shrink it back. The higher counter value
+ // the longer ago we grew the list and the more probable it is that
+ // the full capacity is unused.
+ //
+ // Then, we calculate "shrinking score", the higher the score the less we
+ // we want to shrink this size class. The score is considerably skewed
+ // towards larger size classes: smaller classes are usually used more
+ // actively and we also benefit less from shrinking smaller classes (steal
+ // less capacity). Then, we also avoid shrinking full freelists as we will
+ // need to evict an object and then go to the central freelist to return it.
+ // Then, we also avoid shrinking freelists that are just above batch size,
+ // because shrinking them will disable transfer cache.
+ //
+ // Finally, we shrink if the ticks counter is >= the score.
+ uint32_t qticks = resize_[cpu].per_class[source_cl].Tick();
+ uint32_t score = 0;
+ // Note: the following numbers are based solely on intuition, common sense
+ // and benchmarking results.
+ if (size <= 144) {
+ score = 2 + (length >= capacity) +
+ (length >= batch_length && length < 2 * batch_length);
+ } else if (size <= 1024) {
+ score = 1 + (length >= capacity) +
+ (length >= batch_length && length < 2 * batch_length);
+ } else if (size <= (64 << 10)) {
+ score = (length >= capacity);
+ }
+ if (score > qticks) {
+ continue;
+ }
+
+ if (length >= capacity) {
+ // The list is full, need to evict an object to shrink it.
+ if (to_return == nullptr) {
+ continue;
+ }
+ void *obj = freelist_.Pop(source_cl, NoopUnderflow);
+ if (obj) {
+ ObjectClass *ret = &to_return[*returned];
+ ++(*returned);
+ ret->cl = source_cl;
+ ret->obj = obj;
+ }
+ }
+
+ // Finally, try to shrink (can fail if we were migrated).
+ // We always shrink by 1 object. The idea is that inactive lists will be
+ // shrunk to zero eventually anyway (or they just would not grow in the
+ // first place), but for active lists it does not make sense to aggressively
+ // shuffle capacity all the time.
+ if (freelist_.Shrink(cpu, source_cl, 1) == 1) {
+ acquired += size;
+ }
+
+ if (cpu != GetCurrentCpuUnsafe() || acquired >= bytes) {
+ // can't steal any more or don't need to
+ break;
+ }
+ }
+ // update the hint
+ resize_[cpu].last_steal.store(source_cl, std::memory_order_relaxed);
+ return acquired;
+}
+
+int CPUCache::Overflow(void *ptr, size_t cl, int cpu) {
+ const size_t batch_length = Static::sizemap()->num_objects_to_move(cl);
+ const size_t target =
+ UpdateCapacity(cpu, cl, batch_length, true, nullptr, nullptr);
+ // Return target objects in batch_length batches.
+ size_t total = 0;
+ size_t count = 1;
+ void *batch[kMaxObjectsToMove];
+ batch[0] = ptr;
+ do {
+ size_t want = std::min(batch_length, target - total);
+ if (count < want) {
+ count += freelist_.PopBatch(cl, batch + count, want - count);
+ }
+ if (!count) break;
+
+ total += count;
+ static_assert(ABSL_ARRAYSIZE(batch) >= kMaxObjectsToMove,
+ "not enough space in batch");
+ Static::transfer_cache()[cl].InsertRange(absl::Span(batch), count);
+ if (count != batch_length) break;
+ count = 0;
+ } while (total < target && cpu == GetCurrentCpuUnsafe());
+ tracking::Report(kFreeTruncations, cl, 1);
+ return 1;
+}
+
+uint64_t CPUCache::UsedBytes(int target_cpu) const {
+ ASSERT(target_cpu >= 0);
+ uint64_t total = 0;
+ for (int cl = 1; cl < kNumClasses; cl++) {
+ int size = Static::sizemap()->class_to_size(cl);
+ total += size * freelist_.Length(target_cpu, cl);
+ }
+ return total;
+}
+
+bool CPUCache::HasPopulated(int target_cpu) const {
+ ASSERT(target_cpu >= 0);
+ return resize_[target_cpu].populated.load(std::memory_order_relaxed);
+}
+
+PerCPUMetadataState CPUCache::MetadataMemoryUsage() const {
+ return freelist_.MetadataMemoryUsage();
+}
+
+uint64_t CPUCache::TotalUsedBytes() const {
+ uint64_t total = 0;
+ for (int cpu = 0, num_cpus = absl::base_internal::NumCPUs(); cpu < num_cpus;
+ ++cpu) {
+ total += UsedBytes(cpu);
+ }
+ return total;
+}
+
+uint64_t CPUCache::TotalObjectsOfClass(size_t cl) const {
+ ASSERT(cl < kNumClasses);
+ uint64_t total_objects = 0;
+ if (cl > 0) {
+ for (int cpu = 0; cpu < absl::base_internal::NumCPUs(); cpu++) {
+ total_objects += freelist_.Length(cpu, cl);
+ }
+ }
+ return total_objects;
+}
+
+uint64_t CPUCache::Unallocated(int cpu) const {
+ return resize_[cpu].available.load(std::memory_order_relaxed);
+}
+
+uint64_t CPUCache::CacheLimit() const {
+ return Parameters::max_per_cpu_cache_size();
+}
+
+struct DrainContext {
+ std::atomic *available;
+ uint64_t bytes;
+};
+
+static void DrainHandler(void *arg, size_t cl, void **batch, size_t count,
+ size_t cap) {
+ DrainContext *ctx = static_cast(arg);
+ const size_t size = Static::sizemap()->class_to_size(cl);
+ const size_t batch_length = Static::sizemap()->num_objects_to_move(cl);
+ ctx->bytes += count * size;
+ // Drain resets capacity to 0, so return the allocated capacity to that
+ // CPU's slack.
+ ctx->available->fetch_add(cap * size, std::memory_order_relaxed);
+ for (size_t i = 0; i < count; i += batch_length) {
+ size_t n = std::min(batch_length, count - i);
+ Static::transfer_cache()[cl].InsertRange(absl::Span(batch + i, n),
+ n);
+ }
+}
+
+uint64_t CPUCache::Reclaim(int cpu) {
+ absl::base_internal::SpinLockHolder h(&resize_[cpu].lock);
+
+ // If we haven't populated this core, freelist_.Drain() will touch the memory
+ // (for writing) as part of its locking process. Avoid faulting new pages as
+ // part of a release process.
+ if (!resize_[cpu].populated.load(std::memory_order_relaxed)) {
+ return 0;
+ }
+
+ DrainContext ctx{&resize_[cpu].available, 0};
+ freelist_.Drain(cpu, &ctx, DrainHandler);
+ return ctx.bytes;
+}
+
+void CPUCache::PerClassResizeInfo::Init() {
+ state_.store(0, std::memory_order_relaxed);
+}
+
+bool CPUCache::PerClassResizeInfo::Update(bool overflow, bool grow,
+ uint32_t *successive) {
+ int32_t raw = state_.load(std::memory_order_relaxed);
+ State state;
+ memcpy(&state, &raw, sizeof(state));
+ const bool overflow_then_underflow = !overflow && state.overflow;
+ grow |= overflow_then_underflow;
+ // Reset quiescent ticks for Steal clock algorithm if we are going to grow.
+ State new_state;
+ new_state.overflow = overflow;
+ new_state.quiescent_ticks = grow ? 0 : state.quiescent_ticks;
+ new_state.successive = overflow == state.overflow ? state.successive + 1 : 0;
+ memcpy(&raw, &new_state, sizeof(raw));
+ state_.store(raw, std::memory_order_relaxed);
+ *successive = new_state.successive;
+ return overflow_then_underflow;
+}
+
+uint32_t CPUCache::PerClassResizeInfo::Tick() {
+ int32_t raw = state_.load(std::memory_order_relaxed);
+ State state;
+ memcpy(&state, &raw, sizeof(state));
+ state.quiescent_ticks++;
+ memcpy(&raw, &state, sizeof(raw));
+ state_.store(raw, std::memory_order_relaxed);
+ return state.quiescent_ticks - 1;
+}
+
+static void ActivatePerCPUCaches() {
+ // RunningOnValgrind is a proxy for "is something intercepting malloc."
+ //
+ // If Valgrind, et. al., are in use, TCMalloc isn't in use and we shouldn't
+ // activate our per-CPU caches.
+ if (RunningOnValgrind()) {
+ return;
+ }
+ if (Parameters::per_cpu_caches() && subtle::percpu::IsFast()) {
+ Static::InitIfNecessary();
+ Static::cpu_cache()->Activate();
+ // no need for this thread cache anymore, I guess.
+ ThreadCache::BecomeIdle();
+ // If there's a problem with this code, let's notice it right away:
+ ::operator delete(::operator new(1));
+ }
+}
+
+class PerCPUInitializer {
+ public:
+ PerCPUInitializer() {
+ ActivatePerCPUCaches();
+ }
+};
+static PerCPUInitializer module_enter_exit;
+
+} // namespace tcmalloc
+
+extern "C" bool MallocExtension_Internal_GetPerCpuCachesActive() {
+ return tcmalloc::Static::CPUCacheActive();
+}
+
+extern "C" int32_t MallocExtension_Internal_GetMaxPerCpuCacheSize() {
+ return tcmalloc::Parameters::max_per_cpu_cache_size();
+}
+
+extern "C" void MallocExtension_Internal_SetMaxPerCpuCacheSize(int32_t value) {
+ tcmalloc::Parameters::set_max_per_cpu_cache_size(value);
+}
diff --git a/tcmalloc/cpu_cache.h b/tcmalloc/cpu_cache.h
new file mode 100644
index 000000000..0bce6aa9c
--- /dev/null
+++ b/tcmalloc/cpu_cache.h
@@ -0,0 +1,237 @@
+// Copyright 2019 The TCMalloc Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef TCMALLOC_CPU_CACHE_H_
+#define TCMALLOC_CPU_CACHE_H_
+
+#include
+#include
+
+#include
+
+#include "absl/base/attributes.h"
+#include "absl/base/call_once.h"
+#include "absl/base/internal/spinlock.h"
+#include "absl/base/optimization.h"
+#include "tcmalloc/common.h"
+#include "tcmalloc/internal/logging.h"
+#include "tcmalloc/internal/percpu.h"
+#include "tcmalloc/percpu_tcmalloc.h"
+#include "tcmalloc/static_vars.h"
+#include "tcmalloc/thread_cache.h"
+#include "tcmalloc/tracking.h"
+
+namespace tcmalloc {
+
+
+class CPUCache {
+ public:
+ // tcmalloc explicitly initializes its global state (to be safe for
+ // use in global constructors) so our constructor must be trivial;
+ // do all initialization here instead.
+ void Activate();
+
+ // Allocate an object of the given size class. When allocation fails
+ // (from this cache and after running Refill), OOOHandler(size) is
+ // called and its return value is returned from
+ // Allocate. OOOHandler is used to parameterize out-of-memory
+ // handling (raising exception, returning nullptr, calling
+ // new_handler or anything else). "Passing" OOOHandler in this way
+ // allows Allocate to be used in tail-call position in fast-path,
+ // making Allocate use jump (tail-call) to slow path code.
+ template
+ void *Allocate(size_t cl);
+
+ // Free an object of the given class.
+ void Deallocate(void *ptr, size_t cl);
+
+ // Give the number of bytes in 's cache
+ uint64_t UsedBytes(int cpu) const;
+
+ // Whether 's cache has ever been populated with objects
+ bool HasPopulated(int cpu) const;
+
+ PerCPUMetadataState MetadataMemoryUsage() const;
+
+ // Give the number of bytes used in all cpu caches.
+ uint64_t TotalUsedBytes() const;
+
+ // Give the number of objects of a given class in all cpu caches.
+ uint64_t TotalObjectsOfClass(size_t cl) const;
+
+ // Give the number of bytes unallocated to any sizeclass in 's cache.
+ uint64_t Unallocated(int cpu) const;
+
+ // Give the per-cpu limit of cache size.
+ uint64_t CacheLimit() const;
+
+ // Empty out the cache on ; move all objects to the central
+ // cache. (If other threads run concurrently on that cpu, we can't
+ // guarantee it will be fully empty on return, but if the cpu is
+ // unused, this will eliminate stranded memory.) Returns the number
+ // of bytes we sent back. This function is thread safe.
+ uint64_t Reclaim(int cpu);
+
+ // Determine number of bits we should use for allocating per-cpu cache
+ // The amount of per-cpu cache is 2 ^ kPerCpuShift
+#if defined(TCMALLOC_SMALL_BUT_SLOW)
+ static const size_t kPerCpuShift = 12;
+#else
+ static const size_t kPerCpuShift = 18;
+#endif
+
+ private:
+ // Per-size-class freelist resizing info.
+ class PerClassResizeInfo {
+ public:
+ void Init();
+ // Updates info on overflow/underflow.
+ // says if it's overflow or underflow.
+ // is caller approximation of whether we want to grow capacity.
+ // will contain number of successive overflows/underflows.
+ // Returns if capacity needs to be grown aggressively (i.e. by batch size).
+ bool Update(bool overflow, bool grow, uint32_t *successive);
+ uint32_t Tick();
+
+ private:
+ std::atomic state_;
+ // state_ layout:
+ struct State {
+ // last overflow/underflow?
+ uint32_t overflow : 1;
+ // number of times Steal checked this class since the last grow
+ uint32_t quiescent_ticks : 15;
+ // number of successive overflows/underflows
+ uint32_t successive : 16;
+ };
+ static_assert(sizeof(State) == sizeof(std::atomic),
+ "size mismatch");
+ };
+
+ subtle::percpu::TcmallocSlab freelist_;
+
+ struct ResizeInfoUnpadded {
+ // cache space on this CPU we're not using. Modify atomically;
+ // we don't want to lose space.
+ std::atomic available;
+ // this is just a hint
+ std::atomic last_steal;
+ // Track whether we have initialized this CPU.
+ absl::once_flag initialized;
+ // Track whether we have ever populated this CPU.
+ std::atomic populated;
+ // For cross-cpu operations.
+ absl::base_internal::SpinLock lock;
+ PerClassResizeInfo per_class[kNumClasses];
+ };
+ struct ResizeInfo : ResizeInfoUnpadded {
+ char pad[ABSL_CACHELINE_SIZE -
+ sizeof(ResizeInfoUnpadded) % ABSL_CACHELINE_SIZE];
+ };
+ // Tracking data for each CPU's cache resizing efforts.
+ ResizeInfo *resize_;
+ // Track whether we are lazily initializing slabs. We cannot use the latest
+ // value in Parameters, as it can change after initialization.
+ bool lazy_slabs_;
+
+ struct ObjectClass {
+ size_t cl;
+ void *obj;
+ };
+
+ void *Refill(int cpu, size_t cl);
+
+ // This is called after finding a full freelist when attempting to push
+ // on the freelist for sizeclass . The last arg should indicate which
+ // CPU's list was full. Returns 1.
+ int Overflow(void *ptr, size_t cl, int cpu);
+
+ // Called on freelist overflow/underflow on to balance cache
+ // capacity between size classes. Returns number of objects to return/request
+ // from transfer cache. [0...*returned) will contain objects that
+ // need to be freed.
+ size_t UpdateCapacity(int cpu, size_t cl, size_t batch_length, bool overflow,
+ ObjectClass *to_return, size_t *returned);
+
+ // Tries to obtain up to bytes of freelist space on
+ // for from other . [0...*returned) will contain objects
+ // that need to be freed.
+ void Grow(int cpu, size_t cl, size_t desired_increase, ObjectClass *to_return,
+ size_t *returned);
+
+ // Tries to steal for on from other size classes on that
+ // CPU. Returns acquired bytes. [0...*returned) will contain
+ // objects that need to be freed.
+ size_t Steal(int cpu, size_t cl, size_t bytes, ObjectClass *to_return,
+ size_t *returned);
+
+ static void *NoopUnderflow(int cpu, size_t cl) { return nullptr; }
+ static int NoopOverflow(int cpu, size_t cl, void *item) { return -1; }
+};
+
+template
+inline void *ABSL_ATTRIBUTE_ALWAYS_INLINE CPUCache::Allocate(size_t cl) {
+ ASSERT(cl > 0);
+
+ tracking::Report(kMallocHit, cl, 1);
+ struct Helper {
+ static void *Underflow(int cpu, size_t cl) {
+ // we've optimistically reported hit in Allocate, lets undo it and
+ // report miss instead.
+ tracking::Report(kMallocHit, cl, -1);
+ tracking::Report(kMallocMiss, cl, 1);
+ void *ret = Static::cpu_cache()->Refill(cpu, cl);
+ if (ABSL_PREDICT_FALSE(ret == nullptr)) {
+ size_t size = Static::sizemap()->class_to_size(cl);
+ return OOMHandler(size);
+ }
+ return ret;
+ }
+ };
+ return freelist_.Pop(cl, &Helper::Underflow);
+}
+
+inline void ABSL_ATTRIBUTE_ALWAYS_INLINE CPUCache::Deallocate(void *ptr,
+ size_t cl) {
+ ASSERT(cl > 0);
+ tracking::Report(kFreeHit, cl, 1); // Be optimistic; correct later if needed.
+
+ struct Helper {
+ static int Overflow(int cpu, size_t cl, void *ptr) {
+ // When we reach here we've already optimistically bumped FreeHits.
+ // Fix that.
+ tracking::Report(kFreeHit, cl, -1);
+ tracking::Report(kFreeMiss, cl, 1);
+ return Static::cpu_cache()->Overflow(ptr, cl, cpu);
+ }
+ };
+ freelist_.Push(cl, ptr, Helper::Overflow);
+}
+
+inline bool UsePerCpuCache() {
+ return (Static::CPUCacheActive() &&
+ // We call IsFast() on every non-fastpath'd malloc or free since
+ // IsFast() has the side-effect of initializing the per-thread state
+ // needed for "unsafe" per-cpu operations in case this is the first
+ // time a new thread is calling into tcmalloc.
+ //
+ // If the per-CPU cache for a thread is not initialized, we push
+ // ourselves onto the slow path (if
+ // !defined(TCMALLOC_DEPRECATED_PERTHREAD)) until this occurs. See
+ // fast_alloc's use of TryRecordAllocationFast.
+ subtle::percpu::IsFast());
+}
+
+}; // namespace tcmalloc
+#endif // TCMALLOC_CPU_CACHE_H_
diff --git a/tcmalloc/experiment.cc b/tcmalloc/experiment.cc
new file mode 100644
index 000000000..8fe0edb6c
--- /dev/null
+++ b/tcmalloc/experiment.cc
@@ -0,0 +1,157 @@
+// Copyright 2019 The TCMalloc Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "tcmalloc/experiment.h"
+
+#include
+
+#include "absl/base/macros.h"
+#include "absl/strings/str_cat.h"
+#include "tcmalloc/internal/logging.h"
+#include "tcmalloc/internal/util.h"
+
+using tcmalloc::internal::kNumExperiments;
+using tcmalloc::tcmalloc_internal::thread_safe_getenv;
+
+namespace tcmalloc {
+namespace {
+
+const char kDelimiter = ',';
+const char kExperiments[] = "BORG_EXPERIMENTS";
+const char kDisableExperiments[] = "BORG_DISABLE_EXPERIMENTS";
+const char kDisableAll[] = "all";
+
+bool LookupExperimentID(absl::string_view label, Experiment* exp) {
+ for (auto config : experiments) {
+ if (config.name == label) {
+ *exp = config.id;
+ return true;
+ }
+ }
+
+ return false;
+}
+
+const bool* GetSelectedExperiments() {
+ static bool by_id[kNumExperiments];
+
+ static const char* active_experiments = thread_safe_getenv(kExperiments);
+ static const char* disabled_experiments =
+ thread_safe_getenv(kDisableExperiments);
+ static const bool* status = internal::SelectExperiments(
+ by_id, active_experiments ? active_experiments : "",
+ disabled_experiments ? disabled_experiments : "");
+ return status;
+}
+
+template
+void ParseExperiments(absl::string_view labels, F f) {
+ absl::string_view::size_type pos = 0;
+ do {
+ absl::string_view token;
+ auto end = labels.find(kDelimiter, pos);
+ if (end == absl::string_view::npos) {
+ token = labels.substr(pos);
+ pos = end;
+ } else {
+ token = labels.substr(pos, end - pos);
+ pos = end + 1;
+ }
+
+ f(token);
+ } while (pos != absl::string_view::npos);
+}
+
+} // namespace
+
+namespace internal {
+
+const bool* SelectExperiments(bool* buffer, absl::string_view active,
+ absl::string_view disabled) {
+ memset(buffer, 0, sizeof(*buffer) * kNumExperiments);
+
+ ParseExperiments(active, [buffer](absl::string_view token) {
+ Experiment id;
+ if (LookupExperimentID(token, &id)) {
+ buffer[static_cast(id)] = true;
+ }
+ });
+
+ if (disabled == kDisableAll) {
+ memset(buffer, 0, sizeof(*buffer) * kNumExperiments);
+ }
+
+ ParseExperiments(disabled, [buffer](absl::string_view token) {
+ Experiment id;
+ if (LookupExperimentID(token, &id)) {
+ buffer[static_cast(id)] = false;
+ }
+ });
+
+ return buffer;
+}
+
+} // namespace internal
+
+bool IsExperimentActive(Experiment exp) {
+ ASSERT(static_cast(exp) >= 0);
+ ASSERT(exp < Experiment::kMaxExperimentID);
+
+ return GetSelectedExperiments()[static_cast(exp)];
+}
+
+void FillExperimentProperties(
+ std::map* result) {
+ for (const auto& config : experiments) {
+ (*result)[absl::StrCat("tcmalloc.experiment.", config.name)].value =
+ IsExperimentActive(config.id) ? 1 : 0;
+ }
+}
+
+absl::optional FindExperimentByName(absl::string_view name) {
+ for (const auto& config : experiments) {
+ if (name == config.name) {
+ return config.id;
+ }
+ }
+
+ return absl::nullopt;
+}
+
+void PrintExperiments(TCMalloc_Printer* printer) {
+ // Index experiments by their positions in the experiments array, rather than
+ // by experiment ID.
+ static bool active[ABSL_ARRAYSIZE(experiments)];
+ static const bool* status = []() {
+ memset(active, 0, sizeof(active));
+ const bool* by_id = GetSelectedExperiments();
+
+ for (int i = 0; i < ABSL_ARRAYSIZE(experiments); i++) {
+ const auto& config = experiments[i];
+ active[i] = by_id[static_cast(config.id)];
+ }
+
+ return active;
+ }();
+
+ printer->printf("MALLOC EXPERIMENTS:");
+ for (int i = 0; i < ABSL_ARRAYSIZE(experiments); i++) {
+ const char* value = status[i] ? "1" : "0";
+ printer->printf(" %s=%s", experiments[i].name, value);
+ }
+
+ printer->printf("\n");
+}
+
+} // namespace tcmalloc
diff --git a/tcmalloc/experiment.h b/tcmalloc/experiment.h
new file mode 100644
index 000000000..f4d5f2a23
--- /dev/null
+++ b/tcmalloc/experiment.h
@@ -0,0 +1,69 @@
+// Copyright 2019 The TCMalloc Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef TCMALLOC_EXPERIMENT_H_
+#define TCMALLOC_EXPERIMENT_H_
+
+#include
+
+#include