Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically clean up cached bundles directory #5976

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

martintomazic
Copy link
Contributor

@martintomazic martintomazic commented Dec 17, 2024

What

Why

Save on disk usage/ease the maintenance.

How

  1. Regular and detached exploded bundles no longer present in the config, are removed during discovery startup. This way we are not blocking initialization -> Done here go/runtime/bundle: Cleanup bundles on startup #6003
  2. Regular bundles with version lower then active are removed by watching new epochs and checking if bundle registry has bundles lower than the active version for the current epoch.

How to test

e2e

.buildkite/scripts/test_e2e.sh --scenario e2e.runtime.runtime-upgrade

Copy link

netlify bot commented Dec 17, 2024

Deploy Preview for oasisprotocol-oasis-core canceled.

Name Link
🔨 Latest commit 3b4eb79
🔍 Latest deploy log https://app.netlify.com/sites/oasisprotocol-oasis-core/deploys/678a83cff3d9450008593c07

@martintomazic martintomazic force-pushed the martin/feature/cached_bundles_clean-up branch from 8682e72 to 6e5f668 Compare January 7, 2025 04:52
@martintomazic martintomazic changed the title Martin/feature/cached bundles clean up Periodically clean up cached bundles directory Jan 7, 2025
martintomazic

This comment was marked as outdated.

@martintomazic martintomazic force-pushed the martin/feature/cached_bundles_clean-up branch 2 times, most recently from 9ef3941 to 0869adc Compare January 10, 2025 03:27
martintomazic

This comment was marked as outdated.

@kostko kostko linked an issue Jan 10, 2025 that may be closed by this pull request
@martintomazic martintomazic force-pushed the martin/feature/cached_bundles_clean-up branch 3 times, most recently from c8ded6d to f3f52e3 Compare January 10, 2025 19:20
@martintomazic martintomazic marked this pull request as ready for review January 10, 2025 19:27
Copy link
Member

@ptrus ptrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't dive too deeply into the overall logic, but it looks good based on an initial look! I left a couple of minor comments on the code.

go/runtime/bundle/registry.go Outdated Show resolved Hide resolved
go/runtime/bundle/registry.go Outdated Show resolved Hide resolved
go/runtime/bundle/registry.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery.go Outdated Show resolved Hide resolved
go/runtime/bundle/registry.go Outdated Show resolved Hide resolved
@martintomazic martintomazic force-pushed the martin/feature/cached_bundles_clean-up branch from f3f52e3 to 4ed1a32 Compare January 11, 2025 22:05
@martintomazic
Copy link
Contributor Author

martintomazic commented Jan 13, 2025

// and cached bundles (that are guaranteed to be exploded) to the registry.
func (d *Discovery) Init() error {
	// Consolidate all bundles in one place, which could be useful
	// if we implement P2P sharing in the future.
	if err := d.copyBundles(); err != nil {
		return err
	}

	// Add copied and cached bundles (that are guaranteed to be exploded)
	// to the registry.
	if err := d.Discover(); err != nil {
		return err
	}

I think will will either have 1. stop copying bundles configured via legacy path or 2. block at init time for the cleanup.

With current design, even if you remove the bundle from the config (bundle path), it was previousy copied as part of d.copyBundle. Meaning any subsequent reboot would call Discovered (even if no longer configured) i.e. add it to the registry. This is done at initialization but is blocking.

Unless we do cleanup before that (we don't as we would block further with cleanup?), you cannot know after that (GetConfiguredRuntimeIDs function) which one is stale and which one is not.

Update this actually has a further implication:

I have confirmed rn the master has a bug. Concretely, whenever your run a node configured via deprecated runtime/paths or use new runtime.Runtimes, the bundle is copied to the bundle dir. Now, no matter if you remove the bundle from the configuration, Discover will still find it and thus adding the runtime to the registry, which will be then also displayed by oasis-node control status.

Update of update
Fixed here: #6003

@martintomazic martintomazic force-pushed the martin/feature/cached_bundles_clean-up branch from dbddfa5 to 8bfe9f6 Compare January 14, 2025 11:15
go/runtime/bundle/manifest.go Outdated Show resolved Hide resolved
go/runtime/bundle/helper_test.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery_test.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery_test.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery_test.go Outdated Show resolved Hide resolved
@martintomazic martintomazic force-pushed the martin/feature/cached_bundles_clean-up branch 2 times, most recently from 64a7aaa to 622d0bd Compare January 14, 2025 11:57
@martintomazic
Copy link
Contributor Author

martintomazic commented Jan 14, 2025

One last think do we want e2e test for both my bug (nil runtimeIDs ) and the one currently in the master.

While the latter doesn't bother me that much, I find it a bit scary that tests pass even when #5976 (comment).

I can confirm thought that e.g. if I delete bundles too early (when I receive new runtime descriptor) as was the case here, the runtime was actually suspended so test failed. This is good.

Update:
The test is now failing as it should. It was false positive before due to discovery bug interleaving (fixed here 562ab5f)

@martintomazic martintomazic force-pushed the martin/feature/cached_bundles_clean-up branch from 9cc236e to 3d4b25d Compare January 14, 2025 17:34
Copy link

codecov bot commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 66.22517% with 51 lines in your changes missing coverage. Please review.

Project coverage is 64.80%. Comparing base (d563a80) to head (cba543b).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
go/runtime/bundle/discovery.go 57.44% 17 Missing and 3 partials ⚠️
go/runtime/bundle/registry.go 67.24% 15 Missing and 4 partials ⚠️
go/runtime/registry/config.go 46.66% 6 Missing and 2 partials ⚠️
go/runtime/registry/registry.go 85.71% 1 Missing and 2 partials ⚠️
go/runtime/bundle/bundle.go 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master    #5976    +/-   ##
========================================
  Coverage   64.80%   64.80%            
========================================
  Files         632      632            
  Lines       64864    64995   +131     
========================================
+ Hits        42033    42123    +90     
- Misses      17870    17907    +37     
- Partials     4961     4965     +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@martintomazic martintomazic force-pushed the martin/feature/cached_bundles_clean-up branch from 3d4b25d to cba543b Compare January 14, 2025 23:04
go/runtime/registry/config.go Outdated Show resolved Hide resolved
go/runtime/registry/config.go Outdated Show resolved Hide resolved
go/runtime/registry/config.go Outdated Show resolved Hide resolved
.changelog/5737.feature.md Outdated Show resolved Hide resolved
go/runtime/bundle/bundle.go Outdated Show resolved Hide resolved
r.RLock()
rt := r.activeDescriptor
r.RUnlock()
if rt != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy path first, so flatten the code.

go/runtime/registry/registry.go Outdated Show resolved Hide resolved
go/runtime/bundle/registry.go Outdated Show resolved Hide resolved
return
default:
if v.Less(active) {
r.logger.Info("Removing bundle with version lower then active",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would first log that we are removing all that have version less than active, and then in the for loop log for every bundle that is removed. This way, you see that we tried to removed bundles, but nothing was needed to be done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is desirable since this function is called everytime an epoch changes. I would prefer logging if we do an actual removal? Anyways let's see how this changes once rebase on top of manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epoch are every 1h. You can log as Info almost whatever you want. This way you can be sure that background tasks are triggered, even though they do nothing as nothing is upgraded.

}

if err := os.RemoveAll(explDir); err != nil {
r.logger.Error("failed to remove stale exploded (regular) bundle dir",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters are strange: runtime_version, ExplodedDataDir. The first one is not consistent with version from above, the other has not the same format.

@martintomazic martintomazic force-pushed the martin/feature/cached_bundles_clean-up branch from cba543b to 5197ed3 Compare January 16, 2025 12:30
@peternose
Copy link
Contributor

if I delete bundles too early

If you fetch the current epoch, read active version for that epoch, and delete all previous versions, the bundles should not be deleted too early.

This function is needed so that we don't have to expose,
`bunle.manifestName` constant (internal detail), for the
e2e test.
@martintomazic martintomazic force-pushed the martin/feature/cached_bundles_clean-up branch from 5197ed3 to c405417 Compare January 17, 2025 15:55
Comment on lines +488 to +490
// TODO should you also remove dirs for manifests successfully removed,
// even if error?
dirs, err := m.store.RemoveManifests(runtimeID, active)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Comment on lines 500 to 502
for _, dir := range dirs {
m.removeBundle(dir)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is potential for race condition here inside cleanStaleBundles:

Given that bundle.Registry is accessible inside runtime.Registry this means there can be another thread calling AddManifests. We don't do it rn but the code allows it so someone may later extend it?

In case of unfortunate interleaving you may**:

  • Explode bundles first
  • Delete manifest hashes (m.store.RemoveManifests), part of cleanStaleBundles so discovery thread.
  • Add manifest hashes (AddManifest), e.g. call inside runtime registry thread.
  • Remove bundles via m.removeBundleDir, part of cleanStaleBundles so discovery thread.

Effectively you deleted bundles your registry believes it has.

** (note this is not scenario at init)

  1. In theory this should not happen as you should not be adding bundles with version lower than active?
  2. Maybe we want to add a comment or even make store more defensive to reject bundles lower then active?
  3. What are alternatives, we could lock/make this atomic but then this has to become part of bundle.registry meaning this would break abstraction we wanted:
    • discovery is responsible for exploded directories (on disk representation) + fetching and cleaning, whilst the registry has in memory parts...

@martintomazic martintomazic force-pushed the martin/feature/cached_bundles_clean-up branch from c405417 to 3b4eb79 Compare January 17, 2025 16:22
@martintomazic
Copy link
Contributor Author

If you fetch the current epoch, read active version for that epoch, and delete all previous versions, the bundles should not be deleted too early.

Correct. I was just confirming I made sure my e2e test is failing when clean-up was not implemented as you write above. This was happening here: #5976 (comment) when I initially mis-understood the registry updates, thus deleting things to early. :)

@martintomazic
Copy link
Contributor Author

Freshly rebased on top of #6003. Should be ready for a second round of reviews. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Periodically clean up cached bundles directory
3 participants