Skip to content

Commit

Permalink
Merge pull request #14626 from MinaProtocol/sventimir/limit-zkapp-cmd…
Browse files Browse the repository at this point in the history
…s-per-block

RFC: Limit the number of zkApp commends fitting into a block.
  • Loading branch information
Sventimir authored Dec 6, 2023
2 parents 089e1af + 7d03400 commit 79c6253
Showing 1 changed file with 191 additions and 0 deletions.
191 changes: 191 additions & 0 deletions rfcs/0054-limit-zkapp-cmds-per-block.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
## Summary

During the ITN stress testing it was noticed that daemon's memory
consumption tends to increase dramatically after a block containing a
large number of zkApp commands. Before appropriate optimizations can
be developed, we need a temporary solution to prevent nodes crashing
due to insufficient memory. The idea is to limit the number of zkApp
commands that can be included in any single block.

## Motivation

By limiting the number of zkApp commands going into blocks we avoid
the aforementioned issue until a proper solution can be devised and
implemented. The root cause of the issue is that proofs contained
within these commands are stored in the scan state and tend to occupy
a lot of space. Fixing these storage issues won't affect the
protocol, so ideally we want a workaround that doesn't affect the
protocol either, so that at the convenient time we can turn it off
without making a fork.

## Detailed design

Since the solution should not affect the protocol, it should be
implemented at the mempool/block producer boundary. In the mempool
there is `transactions` function, which returns a sequence of
transactions from the mempool in the order of decreasing transaction
fees. The `create_diff` function in `Staged_ledger` then takes that
sequence and tries to apply as many transactions from it as can fit
into the block. In the latter function it is possible to simply
count successfully applied zkApp commands and filter out any
transactions which:
- would violate the set zkApp command limit
- or depend on any previously filtered transactions because of
a nonce increase.

The exact number of zkApps allowed in each block should be set
dynamically, so that we can adjust it without redeploying nodes.
Therefore we are going to provide an authorised GraphQL mutation
to alter the setting at runtime. A sensible default will be compiled
into the binary as well.

The setting can be stored in the Mina_lib configuration and
initialized when the mempool is being created at startup.
The limit will also be controllable through an authenticated GraphQL
mutation, which will update the setting in the configuration at
runtime.

## Drawbacks

Any non-protocol-level solution to this issue has a drawback that a
malicious node operator could modify their node to turn off the
safeguard. However, because the safeguard only affects block
production, it doesn't really matter unless the malicious agent is
going to produce blocks. If so, their chance of conducting a
successful DoS attack against the network is proportional to their
stake, but their incentive to do so is **inversely** proportional
to their stake, which means the more capable one is to conduct the
attack, the more they are going to lose in case of success.

With the safeguard turned on, if the zkApps are coming in faster than
they can be processed, they will stack up in nodes' mempools.
Mempools **will** eventually overflow, which means that either some of
these zkApp commands or some regular user commands will start to
drop. This will likely inflate transaction fees as users will attempt
to get their transactions into increasingly crowded mempools. Also a
lot of transactions will be lost in the process due to mempool
overflow.

Some payments and delegations may wait a long time for inclusion or
even get dropped if they are created by the same fee payer as a
zkApp command waiting for inclusion due to the limit. This cannot
be helped, unfortunately.

Another risk arises when we decide to turn of the limitation, because
the underlying issue is fixed. In order to safely turn the limit
off, a node needs to be updated with the fix. Because this will be
a non-breaking change, nodes may be slow to adopt it. According to
rough estimates, if 16% of the stake upgrades and turns the limit
off, they're capable of taking the non-upgraded nodes down with
memory over-consumption and taking over the network. To prevent this
we have to ensure that at least the majority of the stakeholder
upgrades as quickly as possible.

Finally, the limit introduces an attack vector, where a malicious
party can submit `limit + 1` zkApp commands and arbitrarily many more
commands depending on them, so that they are guaranteed not to be
included. They can set up arbitrarily high fees on these commands
which won't be included in order to kick out other users' transactions
from the mempool and increase the overall fees on the network. An
attacker would have to pay the fees for all their included zkApp
commands, but not for the skipped ones Then they can use another
account to kick out their expensive transactions form the mempool. So
conducting such an attack will still be costly, but not as costly as
it should be.

## Rationale and alternatives

This is a temporary solution until the scan state storage can be
optimised to accommodate storing proofs more efficiently. Therefore
it is more important that it's simple and easy to implement than
to solve the problem in a robust manner. Because the issue endangers
the whole network, some smaller drawbacks are acceptable as long as
the main issue is prevented from happening.

An alternative would be to assign more precise measurement of memory
occupied to each command and limit the amount of the total memory
occupied by commands within a block. Better still, we could compute
the difference in memory occupied by the scan state before and after
each block and make sure it does not go above certain limit. This
would, however, complicate the solution and require more time to
develop it, while it still wouldn't properly solve the problem.
Therefore we should strive for a quick solution which already improves
the situation and wait for the proper fix to come.

## Prior art

The problem of blockchain networks being unable to process incoming
transactions fast enough is a well-known one and there are several
techniques of dealing with it.

One solution is to limit the block size (and hence indirectly the
number of transactions fitting in a single block). The most notable
example here is Bitcoin, which has a hard block size limit of 1MB.
This is often criticized for limiting the network's throughput
severely, but the restriction remains in place nonetheless, because
the consequences of lifting it would be even worse.

Mina also has its own block size limit, however, the problem we are
dealing with here is different in that we've got two distinct
categories of commands, only one of which is affected. Unfortunately,
unless we move zkApp commands to a separate mempool, any limit set on
zkApp commands throughput will also affect user commands by occupying
mempool space (see Drawbacks above).

Another solution is more related to execution time, especially that of
smart contracts, which can - in principle - run indefinitely without
termination and there is no easy way of preventing this without
hindering expressiveness of a smart contract language significantly
(due to insolvability of the halting problem). Major blockchains like
Ethereum or Tezos, instead of limiting block size directly, restrict
the number of computational steps (defined by some VM model) necessary
to replay a block. A block which cannot be replayed in the specified
number of steps is automatically considered invalid.

The operation's execution time is modelled with gas. Each atomic
computation is assigned a gas cost roughly proportional to the time
the VM takes to execute that computation. Simultaneously, a block
is given a hard gas limit and the total gas required by all the
transactions within the block must be below that limit.

Translating this solution to the discussed problem would involve
modelling memory occupied by each operation in the scan state with
some measure (analogous to gas) and then limiting the maximum value
of operations (expressed in that measure) fitting in a block. This
is a more complex solution than the one proposed here and probably
requires significant time to devise the right model. It wouldn't
also remove the problem of zkApp commands stacking in the mempool,
although it might make it less severe by setting a more fine-grained
limit. However, considering that it would still be a temporary
solution, it's probably not worth the effort.

## Unresolved questions

Are the drawbacks described in this document an acceptable trade-off
for preventing crashes due to out-of-memory issues? Is the
alternative, more fine-grained solution viable?

## Testing

The part of the code responsible for applying transactions from the
mempool to the ledger is not properly isolated from the surrounding
code, but it can be isolated and then unit-tested relatively easily.
This is being done in a loop, which gives us an opportunity to test
either any single step of that loop or the loop as a whole (ideally
both). In such tests the most important properties to check would
include:
- if the limit is disabled, zkApp commands are applied normally.
- no zkApp command is applied when the limit is reached.
- no transaction depending on a skipped zkApp command is ever applied.
- list of applied transactions contains at most the limit of zkApp
commands.
- if there's less than limit of zkApp commands in the mempool, more
signed commands can be applied instead.

Additionally an integration test checking inclusion of transactions
from the mempool in a newly created block could be written. Such a
test should in particular ensure that the limit does not affect block
validation. Note that the limit can be changed dynamically, so we
can initialise a network with all nodes having the same settings and
then change it for some of them, thus examining different
configurations.

0 comments on commit 79c6253

Please sign in to comment.