Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reusable module cache #4621

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

graydon
Copy link
Contributor

@graydon graydon commented Jan 11, 2025

This is the stellar-core side of a soroban change to surface the module cache for reuse.

On the core side we:

  1. Add a new mechanism to scan the live BL snapshot for contracts alone (which is a small part of the BL)
  2. Add a new wrapper type SorobanModuleCache to the core-side Rust code, which holds a soroban_env_host::ModuleCache for each host protocol version we support caching modules for (there is currently only one but there will be more in the future)
  3. Also add a CoreCompilationContext type to contract.rs which carries a Budget and logs errors to the core console logging system. This is sufficient to allow operating the soroban_env_host::ModuleCache from outside the Host.
  4. Pass a SorobanModuleCache into the host function invocation path that core calls during transactions.
  5. Store a SorobanModuleCache in the LedgerManagerImpl that is long-lived, spans ledgers.
  6. Add a utility class SharedModuleCacheCompiler that does a multithreaded load-all-contracts / populate-the-module-cache, and call this on startup when the LedgerManagerImpl restores its LCL.
  7. Add a couple command-line helpers that list and forcibly compile all contracts (we might choose to remove these, they were helpful during debugging but at this point perhaps redundant)

The main things left to do here are:

  • Maybe move all this to the soon-to-be-added p23 soroban submodule
  • Decide how we're going to handle growing the cache with new uploads and wire that up
  • Similarly decide how we're going to handle expiry and restoration events as discussed in CAP0062 and wire that up too
  • Write some tests if I can think of any more specific than just "everything existing still works"
  • Write a CAP describing this

I think that's .. kinda it? The reusable module cache is just "not passed in" on p22 and "passed in" on p23, so it should just start working at the p23 boundary.

Copy link
Contributor

@dmkozh dmkozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this still needs to be rebased on top of p23 module and have the host change merged, so I haven't looked too closely at the rust parts (IIUC currently we just don't pass the module cache to the host at all).

src/ledger/SharedModuleCacheCompiler.h Show resolved Hide resolved
src/ledger/SharedModuleCacheCompiler.h Outdated Show resolved Hide resolved
src/ledger/SharedModuleCacheCompiler.h Outdated Show resolved Hide resolved
src/ledger/SharedModuleCacheCompiler.h Show resolved Hide resolved
src/ledger/SharedModuleCacheCompiler.cpp Outdated Show resolved Hide resolved
src/ledger/SharedModuleCacheCompiler.h Outdated Show resolved Hide resolved
src/main/CommandLine.cpp Outdated Show resolved Hide resolved
src/rust/src/contract.rs Show resolved Hide resolved
Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this generally looks pretty good, the producer/compiler thread divide is a good idea! I think the producer thread needs some changes though.

To maintain the cache, I think we need to

  1. Compile all live contract modules on startup.
  2. Compile new contract modules during upload/restore.
  3. Evict entries from the cache when the underlying WASM is evicted via State Archival eviction scans.

I think decoupling cache gc from eviction events is going to be expensive. If you have some background task that checks if a given module is live or not, it will have to read from many levels of the BL to determine if whatever BucketEntry it's looking at is the most up to date. Evicting from the cache when we do state archival evictions will remove this additional read amplification (since the archival scan has to this this multi level search already) and is simpler to maintain cache validity too.

The drawback to this is intial cache generation is a little more expensive, as we're limited to a single producer thread that has to iterate throught the BL in-order and keep track of seen keys. If we don't have a proper gc, we can't add any modules that have already been evited since they would cause a memory leak.

Looking at startup as a whole, we have a bunch of tasks that are BL disk read dominated, namely Bucket Apply, BucketIndex, and p23's upcoming Soroban state cache. Bucket Index can process all Buckets in parallel, but Bucket Apply, Soroban State cache, and Module Cache all require a single thread iterating the BL in order due to the outdated keys issue (in the future we could do this in parallel where each level marks it's "last seen key" and lower levels can't make progress beyond all their parents last seen key, but that's too invloved for v1).

Given that we're adding a bunch of work on the startup path and Horizon/RPC has indicated a need for faster startup times in the past, I think it makes sense to condense the Bucket Apply, Soroban state cache population, and Module cache producer thread into a single Work that makes a one shot pass on the BucketList. Especially in a captive-core instance, which we still run in our infra on EBS last I checked, I assume we're going to be disk bound even with the compilation step, so if we do compilation in the same pass as BucketApply we might just get it for free.

I don't think this needs to be in 23.0 (other than the memory leak issue), but if we have this in mind and make the initial version a little more friendly with the other Work tasks that happen on startup, it'll be easier to optimize this later.

@@ -0,0 +1,59 @@
#pragma once
// Copyright 2024 Stellar Development Foundation and contributors. Licensed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 2025, happy new year!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -0,0 +1,161 @@
#include "ledger/SharedModuleCacheCompiler.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: copyright

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

src/ledger/SharedModuleCacheCompiler.h Show resolved Hide resolved
src/ledger/SharedModuleCacheCompiler.cpp Outdated Show resolved Hide resolved

// Scans contract code entries in the bucket.
Loop
scanForContractCode(std::function<Loop(LedgerEntry const&)> callback) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the lambda return a Loop status when it should always be Loop::INCOMPLETE anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the lambda wants to stop early (eg. it's looking for a specific contract, or an error occurred, or whatever)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and re: thread contention, I don't think it's a huge deal again because it's just not that long. We could cut the thread count down if you want a margin of safety, but when I test it it just scrolls past instantaneously like you don't even see it happening.

LOG_INFO(DEFAULT_LOG,
"Launching 1 loading and {} compiling background threads",
mNumThreads);
mApp.postOnBackgroundThread(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a memory leak here caused by the BucketList iteration strategy. Currently, we compile each CONTRACT_CODE entry on a per-bucket basis, regardless if there is a newer version of that key higher up in the BucketList. This is a problem, as eviction deletes state from the live BucketList by writing DEADENTRY just like any other BucketList deletion. The loop will ignore the newer DEADENTRY and compile the outdated LIVEENTRY at a lower Bucket. We won't be able to garbage collect the module either, since the entry has already been evicted, so there won't be an eviction event to get rid of the module. Note that in p23, it's also possible for several copies of the same WASM INITENTRY to exist on different levels of the BucketList if you restore an entry after it's been evicted, but before the DEADENTRY can anihilate the INITENTRY. In this case currently the same WASM would be compiled twice, which might break an assumption or something in the cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, huh. So .. I guess I should:

  • Pass the full bucketentry to the callback
  • Keep track, in the compilation loop, of whether I've seen the hash before (live or dead)
  • Only compile live entries I haven't seen before (live or dead)

I think that should do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, did this. Seems to work!

src/main/ApplicationUtils.cpp Outdated Show resolved Hide resolved
@graydon graydon force-pushed the reusable-module-cache branch from c1ffdf7 to 89d85c2 Compare January 15, 2025 05:42
@graydon graydon force-pushed the reusable-module-cache branch from 89d85c2 to dd5b7eb Compare January 23, 2025 06:00
@graydon
Copy link
Contributor Author

graydon commented Jan 23, 2025

Updated with the following:

  • Rebased
  • Adapted to simplified module cache in recently-merged upstream
  • Moved soroban changes to p23 submodule
  • Incremental expiry and addition of new contracts after ledger close
  • Less-chatty logging
  • Added support for populating module cache for side-by-side testing with SOROBAN_TEST_EXTRA_PROTOCOL=23

@graydon
Copy link
Contributor Author

graydon commented Jan 24, 2025

Updated with fix for test failure when running with background close, as well as all review comments addressed (I think!)

Copy link
Contributor

@dmkozh dmkozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but CI is still failing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants