-
Notifications
You must be signed in to change notification settings - Fork 713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mysql-5.6][PR] Range Locking support, MyRocks part, updated #1430
base: fb-mysql-8.0.23
Are you sure you want to change the base?
[mysql-5.6][PR] Range Locking support, MyRocks part, updated #1430
Conversation
Summary: (Range Locking pull request,filed against fb-mysql-8.0.23) This adds a my.cnf parameter, rocksdb_use_range_locking. When it is ON, MyRocks will: - initialize RocksDB to use range-locking lock manager - for all DML operations (including SELECT .. FOR UPDATE) will lock the scanned range before reading/modifying rows. - In range locking mode, there is no snapshot checking (cannot do that for ranges). Instead, MyRocks will read and modify latest committed data, just like InnoDB does (in the code, grep for (start|end)_ignore_ snapshot) - Queries that do not have a finite range to scan, like UPDATE t1 .... ORDER BY t1.key LIMIT n will use a "Locking iterator" which will read rows, lock the range, and re-read the rows. See class LockingIterator. Pull Request resolved: facebook#1185 GitHub Author: Sergei Petrunia <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding Manuel's comments.
@@ -266,6 +269,8 @@ class ha_rocksdb : public my_core::handler, public blob_buffer { | |||
/* Type of locking to apply to rows */ | |||
Rdb_lock_type m_lock_rows; | |||
|
|||
bool m_use_range_locking; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me like this belongs more on the Rdb_transaction
object than on ha_rocksdb
(whereas m_lock_rows
may be different per table in the same transaction).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me like this belongs more on the
Rdb_transaction
object than onha_rocksdb
(whereasm_lock_rows
may be different per table in the same transaction).
My current best understanding is 50-50 re. the best home for this field. In its current form, its value is decided by ha_rocksdb::m_lock_rows
too. If it's changed to be independent (as it already half-way is in the checks), then it can be stored in Rdb_transaction
, but then some of the code paths currently avoid looking at the transaction object at all, so they would get extra get_or_create_tx
calls.
Note that in a multi-statement transaction, the snapshot may have been | ||
allocated by another statement. | ||
*/ | ||
bool m_stmt_ignores_snapshot = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this works well with TTL tables, since it relies on rocksdb tracking oldest snapshot timestamp for us so that old transactions won't see rows expiring while they are still in progress. We won't get this if we don't create rocksdb snapshots though.
rocksdb::Status lock_singlepoint_range(rocksdb::ColumnFamilyHandle *const cf, | ||
const rocksdb::Slice &point) { | ||
// Normally, one needs to "flip" the endpoint type for reverse-ordered CFs. | ||
// But here we are locking just one point so this is not necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this comment is necessarily true. For rev cfs, the locking iterator will always lock adjacent keys using inf_suffix=true.
However, locking using inf_suffix=false
here would mean that the locking iterator would not block modifications from happening.
For example, if the locking iterator iterates from 6->8 and then it locks the range (8, inf_suffix=true)
to (6, inf_suffix=true)
. A second query could delete 8 since it would use the point lock (8, inf_suffix=false)
which falls outside of the locked range.
The following shows this is possible:
CONNECTION 1:
mysql> show create table t;
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t | CREATE TABLE `t` (
`i` int NOT NULL,
`j` int NOT NULL,
PRIMARY KEY (`i`) COMMENT 'rev:cf',
KEY `j` (`j`) COMMENT 'cf'
) ENGINE=ROCKSDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> begin; select * from t force index (PRIMARY) where i > 5 order by i asc limit 3 for update ;
Query OK, 0 rows affected (0.00 sec)
+---+---+
| i | j |
+---+---+
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
+---+---+
3 rows in set (2.60 sec)
mysql> select * From information_schema.rocksdb_locks;
+------------------+-----------------+---------------------------------------+------+
| COLUMN_FAMILY_ID | TRANSACTION_ID | KEY | MODE |
+------------------+-----------------+---------------------------------------+------+
| 3 | 140534119312896 | 0000010780000008:1-0000010780000006:1 | X |
+------------------+-----------------+---------------------------------------+------+
1 row in set (0.00 sec)
CONNECTION 2:
mysql> delete from t where i = 8;
Query OK, 1 row affected (6.58 sec)
CONNECTION 1:
mysql> select * from t force index (PRIMARY) where i > 5 order by i asc limit 3 for update ;
+---+---+
| i | j |
+---+---+
| 6 | 6 |
| 7 | 7 |
| 9 | 9 |
+---+---+
3 rows in set (3.21 sec)
Incidentally, RangeLockManagerBase::TryLock
also has the same behaviour of always setting inf_suffix=false
and it's possible that also introduces similar bugs (This is called when we call get_for_update
, I wonder if we should call get_for_update
with nullptr as value instead of adding a lock_singlepoint_range
function here).
@@ -5516,7 +5686,7 @@ class Rdb_snapshot_status : public Rdb_tx_list_walker { | |||
|
|||
/* Calculate the duration the snapshot has existed */ | |||
int64_t snapshot_timestamp = tx->m_snapshot_timestamp; | |||
if (snapshot_timestamp != 0) { | |||
if (snapshot_timestamp != 0 && tx->has_snapshot()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of this information still seems useful even if there's no actual snapshot on the transaction. It might be useful to also indicate whether this transaction has an active snapshot too.
I guess some of this information is in rocksdb_trx, but dogpiles only has show engine status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of this information still seems useful even if there's no actual snapshot on the transaction.
Are there any constraints on the output should look like in the case of no snapshot, if tooling parses it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline
@@ -6837,6 +7050,28 @@ static int rocksdb_init_internal(void *const p) { | |||
tx_db_options.write_policy = | |||
static_cast<rocksdb::TxnDBWritePolicy>(rocksdb_write_policy); | |||
|
|||
if (rocksdb_use_range_locking && rocksdb_use_range_lock_manager_as_point) { | |||
// rdb_log_status_error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it useful to keep this commented out?
|
||
iter.set_use_locking() // sets m_iter_should_use_locking=true | ||
iter.seek() // this will re-create m_scan_it to be the right kind | ||
// of iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me whether this workflow is ever useful where we change locking mode between seeks. It seems like the code would be simpler if we just specified if we want a locking iterator or not at construction time, and we wouldn't have to track whether we need to recreate the iterator or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this would result in cleaner iterator code, but it would require postponing the creation of iterator from ha_rocksdb::index_init
right to the point of attempting to take the range lock, and at that point the iterator is already in use. I will look some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks possible. It might result in a separate iterator reorg PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reimplemented this without much refactoring, essentially merged reset and set_use_locking.
Rdb_transaction *tx = get_tx_from_thd(m_thd); | ||
return rdb_tx_set_status_error(tx, s, *kd, m_tbl_def); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently have most of our error handling code for iterators in is_valid_iterator
. Would be nice to keep it in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functions are a bit different:
is_valid_iterator
: works on a RDB-level iterator, has no THD or MyRocks transactioniter_status_to_retval
: works on a MyRocks-level iterator, has THD, transaction, key definition.
Almost all uses follow the pattern of calling is_valid_iterator
first and then iter_status_to_retval
. But there is also calculate_cardinality_table_scan
, which works with an RDB-level iterator directly.
Unless we find a different way to address that last use, the best I can think of is add a call to is_valid_iterator to iter_status_to_retval
, perhaps with a rename, so that it's more natural to use in iteration loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lock the range between [target, (current m_iter position)] and position | ||
the iterator on the first record in it. | ||
|
||
@param call_next false means current iterator position is achieved by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have ha_rocksdb::index_next_with_direction_intern
already with skip_next
that does something similar so I wonder if we should stick to one naming convention.
DEBUG_SYNC(my_core::thd_get_current_thd(), "rocksdb.locking_iter_scan"); | ||
|
||
if (my_core::thd_killed(current_thd)) { | ||
m_status = rocksdb::Status::Aborted(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little strange that we end up returning HA_ERR_ROCKSDB_STATUS_ABORTED
instead of the usual HA_ERR_QUERY_INTERRUPTED
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a straightforward way to do the regular THD kill flag handling here, will need to revisit.
- lock the range [-inf; $first_key] | ||
- return, the iterator is positioned on $first_key | ||
|
||
The problem here is that we cannot have "-infinity" bound. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the comparator functions, I would assume that "-infinity" would be Endpoint(slice="", inf_suffix=false)
, though I agree that this function is basically unused for us. Might be worth clarifying what the comment means.
Continued in #1449 |
Summary:
(Range Locking pull request,filed against fb-mysql-8.0.23)
This adds a my.cnf parameter, rocksdb_use_range_locking.
When it is ON, MyRocks will:
Pull Request resolved: #1185
GitHub Author: Sergei Petrunia [email protected]