Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-34062 Implement innodb_log_file_mmap on 64-bit systems #3282

Merged
merged 1 commit into from
Sep 27, 2024

Conversation

dr-m
Copy link
Contributor

@dr-m dr-m commented May 23, 2024

  • The Jira issue number for this PR is: MDEV-34062

Description

In MariaDB Server 10.11 (but not 10.6), copying the InnoDB write-ahead log (innodb_logfile0) is taking an extremely long time, copying at most 3 MiB/s from a RAM disk.

The problem appears to be connected to the rather small default value innodb_log_buffer_size=2m (log_sys.buf_size). If I specify --innodb-log-buffer-size=1g, then xtrabackup_copy_logfile() will process much more data per a system call, the backup would finish in reasonable time during my Sysbench workload, and the top of perf report would look more reasonable as well.

We can do better read the log file via memory-mapped I/O when available. A fallback for file based I/O will remain.

Release Notes

On 64-bit systems, we will introduce a read-only Boolean parameter innodb_log_file_mmap that that will be OFF by default on most platforms. On Linux and FreeBSD the default is innodb_log_file_mmap=ON. When enabled, the InnoDB crash recovery as well as mariadb-backup will use memory-mapped I/O for reading the ib_logfile0, instead of regular file system I/O.

How can this PR be tested?

The added parameter is lightly covered by the test innodb.log_file_size_online.

The code is covered by existing tests in ./mtr --suite=mariabackup. With this fix, the backup during the following is expected to complete in a few seconds:

sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket="$MYSQL_SOCK" --mysql-user="$MYSQL_USER" --mysql-db=test --tables=64 --table_size=100000 &
mariadb-backup -u"$MYSQL_USER" --socket="$MYSQL_SOCK" --backup --target-dir=/dev/shm/backup

In MDEV-34062 you can find results for testing with all 8 combinations of innodb_log_file_mmap (backup and server) and innodb_log_file_buffering (server) when the log resides on a SATA HDD, as well as results for testing the server with each setting.

Basing the PR against the correct MariaDB version

  • This is a new feature and the PR is based against the latest MariaDB development branch.
  • This is a bug fix and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@dr-m dr-m requested a review from vaintroub May 23, 2024 13:16
@dr-m dr-m self-assigned this May 23, 2024
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@dr-m dr-m force-pushed the 10.11-MDEV-34062 branch 2 times, most recently from 02aac42 to 617bac1 Compare May 29, 2024 11:07
Comment on lines 1918 to 1930
#if defined HAVE_INNODB_MMAP && defined FALLOC_FL_ZERO_RANGE
if (is_mmap())
{
const size_t ps{my_system_page_size};
ut_ad(buf_free == calc_lsn_offset(get_lsn()));
size_t offset{buf_free & ~(ps - 1)};
const size_t start_offset{calc_lsn_offset(checkpoint_lsn) & ~(ps - 1)};
if (offset == start_offset);
else if (offset < start_offset)
{
madvise(buf + offset, start_offset - offset, MADV_DONTNEED);
fallocate(log.m_file, FALLOC_FL_ZERO_RANGE, offset,
start_offset - offset);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my tests on a HDD (see MDEV-34062), it seems that the fallocate, which aims to prevent a read-on-write problem for a subsequent round of writing the circular ib_logfile0, is making things worse.

The MADV_DONTNEED, which is available on POSIX, seems to help in the innodb_log_file_buffering=ON (no O_DIRECT) case when the entire ib_logfile0 apparently fits in the Linux file system or block cache. So, I would remove the references to FALLOC_FL_ZERO_RANGE and enable this code on all POSIX.

I agree that for many setups, enabling memory-mapped log writes may be a bad idea because page faults during writes would lead to reads of the log file. I think that we should offer this option nevertheless. Possibly, madvise(MADV_SEQUENTIAL) could help, but I can imagine that it could also request the entire ib_logfile0 to be read into cache even when it is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3d2ec55 removes the references to FALLOC_FL_ZERO_RANGE.

In my tests, the impact of this read-before-write does not seem too bad. Besides, the combination of memory-mapped writes in mariadbd and memory-mapped reads in mariadb-backup --backup can give a significant performance improvement, depending on the workload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaintroub says that the impact of read-before-write is bad on Microsoft Windows; 043745b includes some tweaks from his testing. So, we might want to limit the scope to reading the log, in mariadb-backup only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ea87498 addresses the issue on Microsoft Windows by not enabling memory-mapped log access in mariadb-backup by default. On other 64-bit platforms than Windows, it will be enabled by default. On Linux, if one specified MAP_POPULATE, there should be synchronous read-ahead. But we of course do not do that; we prefer on-demand paging.

I think that innodb_log_file_mmap=ON can make sense on any platform when the ib_logfile0 is relatively small (users prefer fast recovery times, do not care about write performance, or there are not too many writes).

@dr-m dr-m marked this pull request as ready for review May 30, 2024 09:55
@dr-m dr-m force-pushed the 10.11-MDEV-34062 branch 3 times, most recently from e71cf38 to 043745b Compare June 19, 2024 14:03
Comment on lines +3442 to +3454
if (metadata_to_lsn)
{
if (metadata_to_lsn <= recv_sys.lsn)
return false;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #3370, the metadata_to_lsn will have to be replaced with metadata_last_lsn.

@dr-m dr-m force-pushed the 10.11-MDEV-34062 branch 2 times, most recently from a42116b to db576be Compare July 31, 2024 11:33
@dr-m
Copy link
Contributor Author

dr-m commented Jul 31, 2024

The top commit db576be is only for better testing this on our CI. It disables the PMEM interface, so that the new mmap logic will be exercised on Linux, and it enables the mmap logic by default on all platforms. (Except in mariadb-backup on Windows.)

@dr-m dr-m changed the title MDEV-34062 mariadb-backup --backup is extremely slow at copying ib_logfile0 MDEV-34062 Implement innodb_log_file_mmap on 64-bit systems Jul 31, 2024
storage/innobase/trx/trx0trx.cc Outdated Show resolved Hide resolved
@dr-m dr-m force-pushed the 10.11-MDEV-34062 branch 2 times, most recently from 0ae4544 to 1b03db0 Compare August 5, 2024 09:42
@dr-m
Copy link
Contributor Author

dr-m commented Aug 5, 2024

As requested by @vaintroub, I am removing the logic to allow the regular ib_logfile0 to be written via a memory mapping. For future reference, I created 0ae4544 in case someone might want to enable that logic later.

@dr-m dr-m force-pushed the 10.11-MDEV-34062 branch 2 times, most recently from 7526c5a to b77fd30 Compare September 20, 2024 10:41
storage/innobase/log/log0log.cc Show resolved Hide resolved
sql/sql_class.cc Outdated Show resolved Hide resolved
storage/innobase/log/log0recv.cc Show resolved Hide resolved
storage/innobase/trx/trx0trx.cc Outdated Show resolved Hide resolved
storage/innobase/log/log0log.cc Show resolved Hide resolved
storage/innobase/log/log0log.cc Outdated Show resolved Hide resolved
@dr-m dr-m force-pushed the 10.11-MDEV-34062 branch 2 times, most recently from d71200c to 4b49835 Compare September 26, 2024 15:37
When using the default innodb_log_buffer_size=2m, mariadb-backup --backup
would spend a lot of time re-reading and re-parsing the log. For reads,
it would be beneficial to memory-map the entire ib_logfile0 to the
address space (typically 48 bits or 256 TiB) and read it from there,
both during --backup and --prepare.

We will introduce the Boolean read-only parameter innodb_log_file_mmap
that will be OFF by default on most platforms, to avoid aggressive
read-ahead of the entire ib_logfile0 in when only a tiny portion would be
accessed. On Linux and FreeBSD the default is innodb_log_file_mmap=ON,
because those platforms define a specific mmap(2) option for enabling
such read-ahead and therefore it can be assumed that the default would
be on-demand paging. This parameter will only have impact on the initial
InnoDB startup and recovery. Any writes to the log will use regular I/O,
except when the ib_logfile0 is stored in a specially configured file system
that is backed by persistent memory (Linux "mount -o dax").

We also experimented with allowing writes of the ib_logfile0 via a
memory mapping and decided against it. A fundamental problem would be
unnecessary read-before-write in case of a major page fault, that is,
when a new, not yet cached, virtual memory page in the circular
ib_logfile0 is being written to. There appears to be no way to tell
the operating system that we do not care about the previous contents of
the page, or that the page fault handler should just zero it out.

Many references to HAVE_PMEM have been replaced with references to
HAVE_INNODB_MMAP.

The predicate log_sys.is_pmem() has been replaced with
log_sys.is_mmap() && !log_sys.is_opened().

Memory-mapped regular files differ from MAP_SYNC (PMEM) mappings in the
way that an open file handle to ib_logfile0 will be retained. In both
code paths, log_sys.is_mmap() will hold. Holding a file handle open will
allow log_t::clear_mmap() to disable the interface with fewer operations.

It should be noted that ever since
commit 685d958 (MDEV-14425)
most 64-bit Linux platforms on our CI platforms
(s390x a.k.a. IBM System Z being a notable exception) read and write
/dev/shm/*/ib_logfile0 via a memory mapping, pretending that it is
persistent memory (mount -o dax). So, the memory mapping based log
parsing that this change is enabling by default on Linux and FreeBSD
has already been extensively tested on Linux.

::log_mmap(): If a log cannot be opened as PMEM and the desired access
is read-only, try to open a read-only memory mapping.

xtrabackup_copy_mmap_snippet(), xtrabackup_copy_mmap_logfile():
Copy the InnoDB log in mariadb-backup --backup from a memory
mapped file.
@dr-m dr-m merged commit 6acada7 into 10.11 Sep 27, 2024
15 of 16 checks passed
@dr-m dr-m deleted the 10.11-MDEV-34062 branch September 27, 2024 04:38
@dr-m
Copy link
Contributor Author

dr-m commented Jan 7, 2025

On popular demand, #3732 enables this also for 32-bit systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants