-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-34062 Implement innodb_log_file_mmap on 64-bit systems #3282
Conversation
|
02aac42
to
617bac1
Compare
storage/innobase/buf/buf0flu.cc
Outdated
#if defined HAVE_INNODB_MMAP && defined FALLOC_FL_ZERO_RANGE | ||
if (is_mmap()) | ||
{ | ||
const size_t ps{my_system_page_size}; | ||
ut_ad(buf_free == calc_lsn_offset(get_lsn())); | ||
size_t offset{buf_free & ~(ps - 1)}; | ||
const size_t start_offset{calc_lsn_offset(checkpoint_lsn) & ~(ps - 1)}; | ||
if (offset == start_offset); | ||
else if (offset < start_offset) | ||
{ | ||
madvise(buf + offset, start_offset - offset, MADV_DONTNEED); | ||
fallocate(log.m_file, FALLOC_FL_ZERO_RANGE, offset, | ||
start_offset - offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my tests on a HDD (see MDEV-34062), it seems that the fallocate
, which aims to prevent a read-on-write problem for a subsequent round of writing the circular ib_logfile0
, is making things worse.
The MADV_DONTNEED
, which is available on POSIX, seems to help in the innodb_log_file_buffering=ON
(no O_DIRECT
) case when the entire ib_logfile0
apparently fits in the Linux file system or block cache. So, I would remove the references to FALLOC_FL_ZERO_RANGE
and enable this code on all POSIX.
I agree that for many setups, enabling memory-mapped log writes may be a bad idea because page faults during writes would lead to reads of the log file. I think that we should offer this option nevertheless. Possibly, madvise(MADV_SEQUENTIAL)
could help, but I can imagine that it could also request the entire ib_logfile0
to be read into cache even when it is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3d2ec55 removes the references to FALLOC_FL_ZERO_RANGE
.
In my tests, the impact of this read-before-write does not seem too bad. Besides, the combination of memory-mapped writes in mariadbd
and memory-mapped reads in mariadb-backup --backup
can give a significant performance improvement, depending on the workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vaintroub says that the impact of read-before-write is bad on Microsoft Windows; 043745b includes some tweaks from his testing. So, we might want to limit the scope to reading the log, in mariadb-backup
only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ea87498 addresses the issue on Microsoft Windows by not enabling memory-mapped log access in mariadb-backup
by default. On other 64-bit platforms than Windows, it will be enabled by default. On Linux, if one specified MAP_POPULATE
, there should be synchronous read-ahead. But we of course do not do that; we prefer on-demand paging.
I think that innodb_log_file_mmap=ON
can make sense on any platform when the ib_logfile0
is relatively small (users prefer fast recovery times, do not care about write performance, or there are not too many writes).
e71cf38
to
043745b
Compare
if (metadata_to_lsn) | ||
{ | ||
if (metadata_to_lsn <= recv_sys.lsn) | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #3370, the metadata_to_lsn
will have to be replaced with metadata_last_lsn
.
a42116b
to
db576be
Compare
The top commit db576be is only for better testing this on our CI. It disables the PMEM interface, so that the new |
0ae4544
to
1b03db0
Compare
As requested by @vaintroub, I am removing the logic to allow the regular |
b955ee7
to
a57c641
Compare
7526c5a
to
b77fd30
Compare
b77fd30
to
8ee30b7
Compare
d71200c
to
4b49835
Compare
When using the default innodb_log_buffer_size=2m, mariadb-backup --backup would spend a lot of time re-reading and re-parsing the log. For reads, it would be beneficial to memory-map the entire ib_logfile0 to the address space (typically 48 bits or 256 TiB) and read it from there, both during --backup and --prepare. We will introduce the Boolean read-only parameter innodb_log_file_mmap that will be OFF by default on most platforms, to avoid aggressive read-ahead of the entire ib_logfile0 in when only a tiny portion would be accessed. On Linux and FreeBSD the default is innodb_log_file_mmap=ON, because those platforms define a specific mmap(2) option for enabling such read-ahead and therefore it can be assumed that the default would be on-demand paging. This parameter will only have impact on the initial InnoDB startup and recovery. Any writes to the log will use regular I/O, except when the ib_logfile0 is stored in a specially configured file system that is backed by persistent memory (Linux "mount -o dax"). We also experimented with allowing writes of the ib_logfile0 via a memory mapping and decided against it. A fundamental problem would be unnecessary read-before-write in case of a major page fault, that is, when a new, not yet cached, virtual memory page in the circular ib_logfile0 is being written to. There appears to be no way to tell the operating system that we do not care about the previous contents of the page, or that the page fault handler should just zero it out. Many references to HAVE_PMEM have been replaced with references to HAVE_INNODB_MMAP. The predicate log_sys.is_pmem() has been replaced with log_sys.is_mmap() && !log_sys.is_opened(). Memory-mapped regular files differ from MAP_SYNC (PMEM) mappings in the way that an open file handle to ib_logfile0 will be retained. In both code paths, log_sys.is_mmap() will hold. Holding a file handle open will allow log_t::clear_mmap() to disable the interface with fewer operations. It should be noted that ever since commit 685d958 (MDEV-14425) most 64-bit Linux platforms on our CI platforms (s390x a.k.a. IBM System Z being a notable exception) read and write /dev/shm/*/ib_logfile0 via a memory mapping, pretending that it is persistent memory (mount -o dax). So, the memory mapping based log parsing that this change is enabling by default on Linux and FreeBSD has already been extensively tested on Linux. ::log_mmap(): If a log cannot be opened as PMEM and the desired access is read-only, try to open a read-only memory mapping. xtrabackup_copy_mmap_snippet(), xtrabackup_copy_mmap_logfile(): Copy the InnoDB log in mariadb-backup --backup from a memory mapped file.
4b49835
to
6acada7
Compare
On popular demand, #3732 enables this also for 32-bit systems. |
Description
In MariaDB Server 10.11 (but not 10.6), copying the InnoDB write-ahead log (
innodb_logfile0
) is taking an extremely long time, copying at most 3 MiB/s from a RAM disk.The problem appears to be connected to the rather small default value
innodb_log_buffer_size=2m
(log_sys.buf_size
). If I specify--innodb-log-buffer-size=1g
, thenxtrabackup_copy_logfile()
will process much more data per a system call, the backup would finish in reasonable time during my Sysbench workload, and the top ofperf report
would look more reasonable as well.We can do better read the log file via memory-mapped I/O when available. A fallback for file based I/O will remain.
Release Notes
On 64-bit systems, we will introduce a read-only Boolean parameter
innodb_log_file_mmap
that that will beOFF
by default on most platforms. On Linux and FreeBSD the default isinnodb_log_file_mmap=ON
. When enabled, the InnoDB crash recovery as well asmariadb-backup
will use memory-mapped I/O for reading theib_logfile0
, instead of regular file system I/O.How can this PR be tested?
The added parameter is lightly covered by the test
innodb.log_file_size_online
.The code is covered by existing tests in
./mtr --suite=mariabackup
. With this fix, the backup during the following is expected to complete in a few seconds:In MDEV-34062 you can find results for testing with all 8 combinations of
innodb_log_file_mmap
(backup and server) andinnodb_log_file_buffering
(server) when the log resides on a SATA HDD, as well as results for testing the server with each setting.Basing the PR against the correct MariaDB version
PR quality check