Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent cache with auto-cleanup #3176

Open
leventov opened this issue Dec 15, 2024 · 37 comments
Open

Persistent cache with auto-cleanup #3176

leventov opened this issue Dec 15, 2024 · 37 comments
Labels
enhancement New feature or request

Comments

@leventov
Copy link

leventov commented Dec 15, 2024

Description

Objective

Enable #3054 (option to save all cell results by default) in a scalable way.

Assumptions

  1. Stable cell_ids across marimo app instances (both on the same and different machines) and branches/commits. See more details here on why without stable cell_ids, the qualities listed below are not realisable. (I've created a separate issue for this: Optional stable cell IDs #3177)

The qualities that I think persistent cache should obtain

  1. Fast and scalable (to thousands of cells) pruning of cell_id's list from outdated entries and addition of new entries, which requires a fast lookup from cell_id to the list of tuples (module_hash, created_at, content_blob).
  2. Pruning shouldn't base decisions on native file system's file's ctime or mtime attributes as they are not guaranteed to be preserved in some sync/distribution setups, primarily, in Git (Git doesn't store file creation times).
  3. Protection from content blob corruption in case of sudden machine power outages, marimo app crashes, etc.
  4. Agnostic towards the way cache is shared between machines and/or backed up: Git/Github, git-annex, CI systems cache directories (such as Github Actions's cache action), Marimo Cloud, Dropbox, rsync, backup systems, etc.
  5. Legible to the user/developer and other software outside of the marimo app: prefer text over binary formats.
  6. Basic deduplication, that is, if module-coinciding-with-cell has the same results (despite having different inputs) as some previous version, we don't store the results multiple times.

Non-goals

  1. Deep/fine-grained deduplication or deduplication of results across cells: may be added later if strongly needed by someone, but I doubt this, because primarily the cache space usage should be addressed by auto-cleanup. For top-level functions (when they are supported), I think for most cases using XTDB to store the results per "input columns" will achieve much better space compression than a generic cache (apart from various other benefits).
  2. 100% cached content loss guarantee. This is still a cache, not a replacement for a database.

Limitation of the current design of persistent cache

The current design of persistent cache doesn't obtain qualities 1-3 and 6 above and I don't see how it fixed "on the margin".

Suggested solution

Cache dir structure:

cache/
 - 00/ (bucketing per first two characters of cell_ids (uuids)
   - 00...[uuid -- cell_id]/
     - .table
     - [content-hash-sha1].pkl (first blob)
     - [content-hash-sha1].pkl (second blob)
     - [...]
    - 00...[uuid -- cell_id]-cell/ (note the -cell suffix of the dir name. It indicates "cached module coinciding with the cell".
     - .table
     - <content-hash>.pkl (first blob)
     - <content-hash>.pkl (second blob)
     - [...]

.table file format is as follows:

[cache type] [module hash] [created time] [content hash]
[cache type] [module hash] [created time] [content hash]
...

Design considerations

Why bucketing per first two characters of cell_ids: Future-proof thing that will help Git not to break down when working with monorepos with thousands of cells and more. Also, this will aid Marimo itself, as when it will cache results for a new cell it will need to create the directory for cell_id. In some file systems that may become slow when there are already thousands of entries (i.e., other directories in the cache directory. Bucketing by first two characters makes the cache dir no bigger than 256 entries. This future proofing is very cheap (new cells are not created often), so I think it makes sense to add it already. (This idea is borrowed from Git itself, see ls -l .git/objects.)

Why directory-per-cell: keeps .table files very small (almost always less than 4KB, one page), which makes loading, updating, and writing them down them easier. Also, importantly, when tables are per-cell, this will exclude Git merge conflicts unless the same cell was updated in both branches. Even then, having __marimo__/cache/*/*/.table merge=union in .gitattributes would help (Marimo may take up pruning obsolete lines every time it overwrites the .table file anyway).

Note: [cell_id]/ and [cell_id]-cell/ are two different directories that can co-exist for the same cell_id. The [cell_id]-cell/ contains only the module hashes for the cell itself, whereas [cell_id]/ contains module hashes for with persistent_cache() blocks within the cell.

For different auto-cleanup algorithms for cached-modules-coinciding-with-cells and modules within cells, see comments in #3055.

Note departure from the current persistent cache design where the file name is the module hash. Naming content blobs with their own hashes makes two module_hashes in the table to point to the same content blob very simple and robust. It will not lead to extra read latency most of the time because after the first access, marimo app can easily store all hashes in memory. It helps that for large workspaces, a single app will not access most cells so the overhead of this in-memory table caching is small, thanks to directory-per-cell design.

Alternative

This proposal supersedes #3055. It is the opposite of the current persistent cache design: obtains qualities 1-3 and 6, but fails for 4 (Git merge conflicts!!) and 5 (not super legible).

@leventov leventov added the enhancement New feature or request label Dec 15, 2024
@leventov
Copy link
Author

leventov commented Dec 16, 2024

Note how the proposed solution above is not 100% data loss free, but this is probably fine (see non-goal (2) in the top message in this issue).

Marimo app instances should obtain fcntl() write locks on the .table files before updating them, or, actually, on the dedicated per-App IPC locks instead because cache updates will tend to be clustered, e.g., when a single upstream cell within a notebook is updated and it triggers cascading updates to a lot of downstream cells' results, acquiring fcntl() locks on all these cell's .table files is wasteful.

However, this is not the case for Git, Dropbox, and a lot of other data sync solutions: they are not cooperative. So, a following data race is possible:

  1. Git reads the contents of .table for merging it with another commit's version of this .table.
  2. Marimo app updates the .table, adding a new entry.
  3. Git writes out the merged results without the entry added on step 2, creating an "orphaned" [content-hash].pkl file (to be cleaned up by Marimo app as part of background bookkeeping.

The probability of this is probably small enough not to bother at all, but just in case we want to bother, we can still add module hash to the .pkl file names, so that entries can be "restored" instead of orphaned files forced to be removed. This will require changing the table format to the following:

[cache type] [module hash] [created time] [content blob file name (includes content hash and module hash)]
[cache type] [module hash] [created time] [content blob file name (includes content hash and module hash)]
...

@dmadisetti
Copy link
Collaborator

dmadisetti commented Dec 16, 2024

I think cache cleaning mechanism makes a ton of sense.

Having some sort of metadata to do this also makes sense (modification stamps are unreliable)
Thoughts on implementation:

  1. Opposed to .table, use a standard metadata format (e.g. json, toml, yaml, csv). I think sqlite is also a good idea here as well- as it's a python builtin
  2. Start with an option to manually clean the cache.
    • Allow feedback for what sane defaults and use-cases might be
    • Empower user control over cache content on a per notebook basis

I think that really the gold standard for caching would be some sort of remote caching- where the user doesn't have to worry what's on their disk, and just need a static file server to host a cache (something like s3, or even google drive etc..). Gathering feedback before moving on to defaults, could inform how often this caching mechanism is even used.


Renaming the issue, since like I mentioned in the #3177, "Stable IDs" aren't really needed since:

  1. Persistent cache blocks are required to have a name
  2. Cells can leverage the execution path for producing a hash without touching the "ID" mechanism

But glad to see you're excited about this!

@dmadisetti dmadisetti changed the title Persistent cache with auto-cleanup (requires stable cell_ids) Persistent cache with auto-cleanup Dec 16, 2024
@leventov
Copy link
Author

leventov commented Dec 16, 2024

.table can be seen as csv with space delimiter and opened like in the example here: https://docs.python.org/3/library/csv.html#csv.reader. But it's on the one hand so simple that we don't really need a "csv reader", and on another hand to make it extra bulletrpoof we need to read manually and filter out git merge conflicts (those <<<<<< lines), just in case the user didn't add the merge=union directive to .gitattributes. Would need to do the same if it was comma-separated. Comma vs. whitespace make zero difference. You can replace spaces with ,, though I don't see any point really. FWIW I borrowed this "format" from go.sum in Golang. But also not the point for contention, if you prefer , so be it.

SQLite for a dozen lines doesn't make sense. More "global" SQLite, as I envisioned before, would suffer from merge conflicts.

Even JSON would be worse, it would not play as exactly smoothly with merge=union because absent/present commas for lines generate a "changed line", not an "added line" for Git. ndjson would work, but no reason to use it really over csv.

@leventov
Copy link
Author

Does "execution path hashing" mean basically AST hashing of cell's code + all of its dependencies code, without "runtime" content hashing?

If these executed path hashes can be computed in parallel with "full" module hashes before cache lookup, we can do it. Then "short history lookup" that I've talked about would be based simply on the fact that the cache cleanup wouldn't be THAT proactive, and only happen with explicit separate call to marimo cache tidy, which would do complete walkthrough of all cells in the workspace, and clean any module_code_hashes from cache that don't appear (at which point "histories" would be truncated). The user doesn't need to call it manually though, can makemarimo cache tidy a part of git pre-commit hook, or make deploy make routine, whatever.

Deduplication with previous content: would still work on Marimo app runtime level, since Marimo runtime can record what module_code_hash has corresponded to the cell most recently, and then compare the new content blob with that file. However, it would not work in the following scenario:

  1. User edits the code for the cell with marimo app not running.
  2. When the user finally runs the cell through marimo, it cannot know previous module_code_hash

Unless Marimo want to engage with Git, and try to compute module_code_hash for most recent staged or committed version of cell to try to recover "potential candidate for deduplication"... but seems far fetched.

With everything that is described above, the overall directory structure would still be the same, except that instead of [cell_id as uuid] as the second-level dir, these dir names are module_code_hashes. There are still potentially different versions of module_hash and content blobs within a module_code_hash dirs, such as due to: random number generation, current time inputs, etc. Except, as dedup now may "look across the dirs", the .table extends even further:

[cache type] [module_hash] [created time] [module_code_hash]/[content_blob_hash]_[module_hash].pkl

@leventov
Copy link
Author

leventov commented Dec 16, 2024

Deduplication with previous content: would still work on Marimo app runtime level, since Marimo runtime can record what module_code_hash has corresponded to the cell most recently, and then compare the new content blob with that file. However, it would not work in the following scenario:

  1. User edits the code for the cell with marimo app not running.
  2. When the user finally runs the cell through marimo, it cannot know previous module_code_hash

Unless Marimo want to engage with Git, and try to compute module_code_hash for most recent staged or committed version of cell to try to recover "potential candidate for deduplication"... but seems far fetched.

It doesn't seem to me the above would provide great cache robustness.

Note important thing: whereas you probably most think about two scenarios:

  1. Editing cells through Marimo Web UI (and VS Code extension provides the same, or its closer to scenario two?) where the JS side can "remember" the cell_id the user is editing to ensure "short history" continuity
  2. Editing cells as code in simple text editors, by a human, where UUID proposal is ugly and makes somewhat bumpy user experience with need to generate UUIDs by hand or figure out how to do it in IDE,

The main scenario as I see using Marimo is a third one: the majority of cells are created and edited by AIs, e.g. within Co-STORM. The wrapper code can add UUIDs trivially, so the clunkiness of (2) above is not an issue. But also unless that wrapping logic is also knee-deep in Marimo internals (calls to module_code_hash directly before it instructs the AI to re-write the cell, for instance, and then manually tells Marimo somehow to look up potentially duplicate results there), the recent history dedupliication wouldn't work.

The abiliity to provide explicit cell_ids (which my wrapper logic can generate easily) would enable much better separation of concerns between AI workflow wrappers and Marimo layer. And also, those UUIDs needed to "correlate cells across Git branches" and moving cells from one notebook to another without "losing the context" is also directly motivated by those AI-workflow-ish scenarios.

@leventov
Copy link
Author

leventov commented Dec 16, 2024

FWIW, in the previous comment, I don't question module_code_hash approach (as I see it) altogether, it seems to me now it would definitely work in many use cases.

Rather, I demonstrate a strong demand for "user-generated cell_ids", and caching based on them, too.

But these two approaches can seamlessly coexist, just on the same level and mix together:

cache/
- 00/
  - [module_code_hash]/ [...]
  - [cell_id as uuid]/ [...]
  - [cell_id as uuid]-cell/ [...]

Need to think a little more about the edge cases, but I'm sure it's fully workable.

@leventov
Copy link
Author

leventov commented Dec 20, 2024

I now think that entire cache could be stored as a git submodule which itself would be a bare repo, where all actual cached files are stored in the .git/modules/cache dir (note: "cache" doesn't have a special meaning for git here; it would merely mean "a git submodule called cache".

This git module would not represent any "directory of files" in a meaningful sense: in fact, it would not have any commits! It would only contain blob objects, tree objects, and tags. The reason is because commits add unnecessary indirection to blob files. Fortunately, git tags can point to git blob and tree object directly, as well.

Simple cell results would be blob objects, stored without compression by default (corresponding to core.compression=0 in git; this is very atypical for "normal" git repos, but for our bare pseudo-repo, where these objects would be accessed for primary cached content blobs, this probably makes sense by default to reduce latency, when the cache size is not an issue, as long as auto-cleanup of outdated cached entries works).

Once these are written, a weak reference identity dict (such as the one in SQLAlchemy's) can be updated with mappings from ids of local objects of the cell (which have just been pickled) to tuple (local_name, sha, size). Note: this is the sha of the whole locals dict as it has been pickled. Hence, these SHAs will be the same. Size is also just the size of the whole blob.

Pickler for the cell results and "scoped refs" in BlockHasher should have a custom persistent_id() which would check if the object is in that weak reference map and if yes (and the total size of that blob is big enough to worth deduplication), return (local_name, sha) as the persistent id. Note: sizes of individual locals cannot be reliably gauged because within the local dict, pickle does internal-to-stream deduplication, so the first local may appear huge on the pickle stream and the rest small, but if any of those rest is pickled again it could also be huge.

If pickling cell results itself has used this deduplication mechanism, in addition to the blob object, it would also write a git tree object that would refer to those "upstream cell" blobs. This is needed so that git gc doesn't delete the upstream object blobs.

Pickling of large out-of-band buffers (such as ndarrays, dataframes) could be de-duplicated similarly. If a buffer larger than some threshold (e.g., 1 MB; via memoryview.nbytes) is passed into buffer_callback in persistent caching routine, it will write the buffer into a separate blob object and will put the identity of the buffer's source object (from inside buffer_callback, it could be obtained via memoryview(buf).obj) into the weak reference identity dict to the SHA of the blob. If any such external writes happen, a separate git tree object is also needed to prevent git gc from deleting these buffer blobs.

Finally, the mapping from the cache key, i.e., the module hash, is the name of the tag. If uuid-like cell_id is present, then the tag name could be cell_id/module_hash. These tags point straight to tree or blob objects, not commits.

When a tree is created anyways for the cache entry (that is, for recording "upstream" and buffer dependencies, as described above), the tag could be a lightweight tag, and the time of the "created at" time of the cache entry can be recorded as a dummy "file" in the tree object, named created.<created_unix_time>, such as created.1734718851. This "file" could "point" anywhere, such as an SHA of an empty byte sequence (da39a3ee5e6b4b0d3255bfef95601890afd80709). The point of these gymnastics is to avoid creating an "annotated" git tag, which should be a separate git object.

When the cache entry points to a simple blob object, an annotated tag would need to be created (a separate git object), just to record "created at" time for the cache entry. Fortunately, this doesn't mean obligatory two-hop indirection when cache is to be accessed on disk because reftable stores "peeled" tag targets inline (see here, value_type = 0x2).

This whole construction wouldn't scale well to thousands of module_hashes without the new reftable format for storing references in Git, which plays the role of cache lookup here (instead of, implicitly, file system itself, as was proposed above, where the module_hashes or cell_ids are directory names). Reftable API is not yet in in libgit2 (see libgit2/libgit2#5462), but I hope it will be there soon.

For now (for initial testing and small-scale purposes), the classic "filesystem" format for these references would also work.

@leventov
Copy link
Author

leventov commented Dec 20, 2024

The main advantages of this design are the following:

  • Most importantly: if cache is actually stored in Git, my proposal above (with files in __marimo__/cache) would grow the history forever; with the "bare Git submodule" design, the entire "repository" is always bound.
  • In addition to the previous point, smaller disk footprint: with the previous proposal, the same files would need to be stored twice in Git, but only once in the "bare repo" design.
  • Leverage Git's native deduplication (without regard for content blob source), deltification (probably should be off by default, but maybe there would be use-cases), garbage-collection, command-line inspection, and other functions and tools instead of writing our own.

@leventov
Copy link
Author

If a non-deterministic cell produces two different results that are both persistently cached, these could be stored in separate tags, such as prefix/module_hash, prefix/module_hash-1, prefix/module_hash-2, etc., where the prefix is either an explicit stable cell ID (#3177) or module_code_hash, as discussed above. This way, "alternative histories" can be stored and inspected, if needed. All these alternative histories (except the last one) can be optionally cleaned upon marimo cache tidy along with outdated module_hashes within the prefix/ subspace of tags, as was already envisoned before.

@leventov
Copy link
Author

@dmadisetti @mscolnick

I'm updating the latest design proposal (described three comments above; I'll call it the "Tags proposal" below due to its reliance on Git tags) to reflect the input from our discussion and my latest thinking.

I think this persistence subsystem could be a "core" implementation for all persistent execution, persistent cache, and remote persistent cache facades, hence I first suggest a shift to neutral terminology (and will use it below):

  • The former "cache key" becomes (module) execution key.
  • The former "cache" or "cached content" become (module) content.
    • Note: not "module results" because this "content" may even include the snapshot of the module code, more on this below.

The interface and the data model

Scoping with notebook keys

A significant departure from all my previous proposals is that I suggest to scope execution keys per notebook. This means that the notebook save key also becomes a part of the execution key. I think these keys can be computed in the following way:

  1. When need to save or get module contents, get the app to which the module belongs.
  2. If the app has non-empty save_key property, use it.
  3. Find the Python file in which that app is first defined as top-level variable, to address corner cases where the app is used across multiple Python files.
  4. Find the relative path to that file relative to the --cache-dir or --persist-dir or another similar facade-dependent config is passed to marimo CLI or server, or configured for the app per Option to configure persistent caching for all cells within app or directory #3054. Or, if not specified, relative to the dir from which marimo cli or server were run.
  5. Use MD5 hash of that relative path as the notebook save key. Note: these keys will be the same for two apps if they are defined in the same Python file (unless marimo already prohibits this, which I'm not sure about). But this is fine, they don't need to be unique.
  6. Optionally, edit the file where the app is defined to add mo.App(save_key=...) to prevent invalidation when/if the file is moved or renamed in the future.

Why: if persistent execution or cache is enabled by default (#3054), we should from the beginning plan for the persistence subsystem to scale to 10k module contents stored within one "workspace" (not an official term yet, but I mean whatever collection of notebooks is run from a single root dir and share the same __marimo__ dir). This is what motivated various tree-like structures in earlier proposals:

  • In the first SQLite proposal, implicitly it is the primary key index (B-tree) for the cache_results table.
  • In the "second proposal" (see the top message in this issue), it is the Git-like bucketing strategy for "stable cell IDs" (if set) or module code hashes otherwise.
  • In the third, "Tags" proposal it's the actual Git objects/ database for module contents (Git uses two-level bucketing, not one-level as in the "second proposal") and also a tree-like reftable for execution key -> content mapping.

Scoping with notebook keys is also a tree-like structure which helps with scaling. But it has extra advantages over the previous proposals:

The "scalability benefit" that this scoping alone brings will probably be sufficient for a lot of use cases, so execution key -> content mapping implementation may remain simple, readable, and mergeable initially.

This is an improvement over the SQLite proposal, where the mapping is stored in binary, not easily mergeable format (SQLite database format), and is monolithic, which would cause permanent write and sync amplification for sync systems that don't do compute deltas within files before sync (some do, but many don't). The reftable (in the "tags proposal") was supposed to be an improvement in merge-ability over SQLite, but not quite (the operations with remote refs in Git are obviously not perfectly optimised for this use case), but at a significant tradeoff with inspectability/observability of the format: compared to SQLite, ~no one knows about reftable, there are no public tools for working with it besides Git itself, and it hasn't been even added to libgit2 and hence PyGit2 libraries yet.

The "second proposal" shared this benefit with "scoping with notebook keys" over SQLite and Tags proposals. However, it had the following downsides:

  • Stable cell IDs had to be UUIDs or other strings with randomness in the beginning, otherwise the distribution within the Git-like bucketing folders would be poor. But UUIDs in the code are ugly and clutter. Scoping with notebook keys lets up confidently use cell names as "stable cell IDs" for persistence subsystem purposes. This means that Optional stable cell IDs #3177 can be closed, nothing should be done there.
  • In the course of executing cells through the notebook and persisting the contents along the way, In the "second proposal" many bucket folders (00/, 01/, etc.) will be updated. Naturally, scoping with notebook keys permits better locality: not just for the write path, but also for the read path and sync mechanisms, thus reducing write, read or sync amplification. This also permits very simplistic, ad hoc "sync mechanisms" such as "zip the folder for the notebook key and send to colleague via WhatsApp".

Module execution keys

Module execution keys logically include:

  1. Notebook save key: see above.
  2. Cell name if present and if the persisted module is the entire cell, not a persistent_cache() block within a cell. Otherwise, module code hash.
  3. Module hash.
  4. Version.
  5. The created_at timestamp.

Parts 1-4 of the execution key must be specified both on the read and write paths. The 5th part, the created_at timestamp, could be omitted on reads: then the module contents with the highest created_at for the parts 1-4 is returned then, if any, along with this timestamp.

The version was already mentioned in the previous comment. Version is an alphanumeric string that can be used in multiple ways: both to enumerate non-deterministic results or to represent bitemporality, in which case the version would actually be a timestamp, in addition to the created_at timestamp for stored module contents, and thus, module contents would be keyed bitemporally: by version and by created_at timestamps.

Side note: bitemporality primarily makes sense in persistent execution use cases. The simplest use case: a cell computes average of some metric stored in some dynamic data source over the current calendar day. It specifies version=start_of_day. But this cell may be executed many times within the day and even after the day is over, to account for late-arriving data.

At the Marimo Python API level, the version could be set via optional parameters to mo.persistent_cache(version="foo") and @app.cell(save_version="foo"). If not set, the version for the module defaults to 0. For dynamism, parameter version_func may be passed and given the times when the module execution entered (i.e., when the execution key lookup would begin) and ended (on the write path, after the cell has been (re)executed and we have new cell contents), have access to already existing versions for the module, and returns the new version.

The persistence subsystem doesn't need to constrain and care about the semantics of the usage of version within a notebook or specific cell. It only needs to provide flexible ways to auto-cleanup contents along all relevant "dimensions": module hash, version, and created_at.

Module contents

Module contents logically include:

There could also be more optional parts of module contents. Perhaps shouldn't be implemented at first, but it's good to think about making interfaces and storage implementation somewhat future-proof for adding such extra parts of module contents.

  • Full execution keys for the direct parent modules for this module. This would help with exact reproducibility and debuggability in some persistent execution use cases, especially when not all module contents are always synced and distributed: see the next section.
  • Snapshot of the code used to produce this module contents. For full reproducibility, this should be the whole source tree that is necessary to compute this module's code hash including the uv.lock (thus addressing this TODO in is_pure_function()). Motivation: even if the user tracks the notebook code in Git, they may not capture intermediate code versions of some named cells that were used in some executions.
    • Impl note: of course, it would be preposterously wasteful to actually store a separate snapshot of the code for every module content. Practical implementations would be much closer to something like Marimo-managed Git branches.

Write, before_sync, and on_exit callbacks

The current version of Loader abstract class as well as all previous proposals that I made assume a single write interface (called save_cache() in Loader) which should leave the storage in a perfectly sync-ready state.

This constraint either exacerbates write and sync amplification or drives the storage implementation to stick to the "file per module content" approach, which can quickly generate lots of files with very small contents (lots of cells whose content is a few variables with primitive or string values).

In fact, the "Tags" proposal was an attempt to address this tradeoff: although Git would still create a separate file per content in that proposal, if Git itself was used for sync, it would employ thin packs to avoid making lots and lots of separate network transfers.

However, this constraint is absolutely unnecessary. The storage interface should be extended with two callbacks:

  • before_sync: called by Marimo server before doing sync to remote persistent cache storage. There could be various triggers: a period of inactivity, loss of connection to the browser, a configured timeout (e.g., 1 hour), whatever happens first. Can also be triggered externally via sending a request to the Marimo server if the sync mechanism is external to Marimo (such as, rsync in cron).
  • on_exit: called before Marimo server or command line runtime exit. This enables external sync mechanisms whenever Marimo server is not running.

Apart from consolidating small module contents into "packs" that are more convenient for sync, these callbacks may also do proactive cleanup of contents, such as removing everything except the latest (by created_at) content for each version of the module. Alternatively, these contents could be kept locally, and even indicated in the notebook mapping file (see section below) as existing on this host, but are not included in the "pack" file for sync.

Implementation

Below I sketch a possible implementation that could be extended and made more configurable in the future, yet may serve "as is" a vast majority of potential future use cases.

A notebook folder is created per notebook key. These notebook folders are bucketed like in Git and like in the proposal in the first message in this issue: __marimo__/persist/00/00...notebook_key/.

Write

On write (aka save_cache), module content files (Pickled or JSON-serialised exported variables, stdout, stderr) are stored in a single, shared Git-style object database within the notebook folder (that is, in .../notebook_key/objects/), otherwise in the same way as described in the "Tags" proposal above.

Very large content files could also skip the objects database and go to another database more suitable for these files, such as a workspace- or repo-level git annex's objects database, if so configured. Same for watched file contents
(#3271): as their contents can be readily shared across notebooks (and probably will be: that's the point), it's probably better to store them in a shared DB with a much lower size threshold than regular module content files.

At this point, the mapping from the execution key to the content file hashes (which are needed to find the contents in the objects database) are stored only in the Marimo process memory yet, not persisted. Optionally, this information could be persisted in an append-only transaction.log within the notebook folder.

before_sync and on_exit

On before_sync or on_exit, several things are done:

(1) Notebook mapping file is prepared. I didn't think too much about the format, but I think it could be a TOML file like:

# .../notebook_key/map.toml

[cell_name_or_module_code_hash]

[cell_name_or_module_code_hash.module_hash]

[cell_name_or_module_code_hash.module_hash.version]
created_at_1 = {exported_vars = "object_id", stdout = "object_id", ...}
created_at_2 = ...

# Other versions, module_hashes, modules...

(2) All the smallest file contents mentioned in the notebook mapping file are packed such that the total size of files selected for this pack doesn't exceed some threshold, such as 250 MB. This pack is stored not in the object database, but in the notebook folder: .../notebook_key/objects/pack/pack-pack_hash.pack and the corresponding pack-pack_hash.idx.

Bonus: when the pack is already prepared for the notebook, it will automatically speed up content loading when the notebook is cold reloaded.

(3) All content files that didn't fit the pack (or were too large to store in the object database to begin with) are "prepared for sync" in some way that is specific to the sync mechanism.

For non-interferring sync mechanisms and when the total sync volume will not be an issue, this "preparation for sync" can be a no-op.

For Git annex, this preparation would be calling git annex add __marimo__/persist/*/*/objects/[a-z0-9]{2}/ .

If sync volume can be an issue, and the sync workflow is not user-controlled as with git annex, but is automatic like with Dropbox, this "preparation" may constitute of creating a mirror directory .../notebook_key/sync-objects/ for objects/ that would hardlink everything in objects/ recursively, and then telling the sync mechanism (such as Dropbox) to sync folders matching the pattern __marimo__/persist/*/*/, excluding __marimo__/persist/*/*/objects/ and __marimo__/persist/*/*/transaction.log. This way, Dropbox will not be "triggered" immediately upon any transitionary changes.

@leventov
Copy link
Author

Discussing the "Implementation" above:

On a finer level details may vary, e.g., actually re-surfacing SQLite for the content mapping files (separate for different notebook keys), instead of transaction.log + map.toml as I described in the sketch above.

Making the content mapping file format configurable and swappable should be fairly easy.

These two approaches (readable toml and binary SQLite) would suit different sync mechanisms better. If the user wants to commit the mapping files to their main Git alongside Python notebooks and sync the content files (objects and/or packs) with Dropbox, GDrive, git annex, some popular backup mechanism, etc., then text-based map.toml is better.

OTOH, if the user wants to sync everything in __marimo__/persist through something like Dropbox only, including the mapping files, it might be better to keep them binary (such as SQLite), add unique (random) suffixes to new versions of these files to avoid file conflict (otherwise Dropbox obligatory creates conflicts - it doesn't know how to merge files, neither binary nor even text ones), and then the logic of Marimo opens all .../notebook_key/map-*.sqlite and merges them for its in-memory mapping and then when it's time to write a new file does that at the new name and then removes all old files.

Either way (or maybe yet other strategies), I think it should be fairly forgiving once notebook key scoping is in place, and not lead to terrible write or sync amplification.

@mscolnick
Copy link
Contributor

Not sure ill be able to provide much valuable feedback, but high-level:

1. Module execution keys

Yes i agree these keys make sense and a notebook key does as well. Version i am on the fence. I would hate to upgrade marimo and lose all my cache. I can see us making breaking changes to this, so internally we could have our own version for the cache that would bust it.

2. Module contents

This would be quite a useful artifact. it would be great if it could be module too and work with only some of the artifacts. for example, you may want to commit the outputs to git, but not the variables and values (could be a large dataframe). so putting them in separate outputs (so it can be easily gitignore) might be a good design

@dmadisetti
Copy link
Collaborator

I took me awhile to read past the implementation suggestions of your most recent comment to fully get the core idea of why you would want this.
If it's OK I'm going to rephrase - let me know if I am on the right track.

Essentially, you want a 'hot tub time machine' on a per-cell basis, where each run is snapshotted with exact code + output, designed to scale and be easily shared/cleaned up.

I do think this is decoupled from the autoclean up method, since you could feasibly have one without the other.

Regarding the idea, I think has potential. You could potentially directly leverage git instead of manually trying to fold in the directory structure. But, I think that the implementation might be hard for wide adoption since I think proper usage would maybe still have to tie directly into the source tree and source control methods. However, I do think as a git wrapper, or in a managed environment like in https://marimo.io/dashboard something like this might be useful and an improvement over the current snapshot method.

@leventov
Copy link
Author

Version i am on the fence. I would hate to upgrade marimo and lose all my cache. I can see us making breaking changes to this, so internally we could have our own version for the cache that would bust it.

I see it like this:

  • The existing persistence design is a "legacy backend". It will work only for persistent_cache() blocks, not auto-persisted cells (Option to configure persistent caching for all cells within app or directory #3054), to motivate people to upgrade the backend (more on this "upgrade" below). Also, even persistent_cache(version="x") would also fail with the legacy backend, outputting a message that an upgrade to the new persist backend is needed to use this feature. For persistent_cache() that doesn't set the version, and thus the version defaults to 0, this version is getting passed into the legacy backend and is just ignored there, as well as the notebook key, and the cell name. Since neither version parameter to persistent_cache() nor Option to configure persistent caching for all cells within app or directory #3054 yet exist, there is no compatibility breaking.

Upgrade to new backend: a command line action that takes a notebook, computes all module hashes and module code hashes for all persistent_cache() blocks, and stores them in a new backend. After that, optionally cleans the moved content files from __marimo__/cache.

When the user converts all notebooks within a workspace that have used persistent_cache() to the new backend (maybe not at once), they can remove __marimo__/cache altogether.

@leventov
Copy link
Author

leventov commented Jan 22, 2025

Essentially, you want a 'hot tub time machine' on a per-cell basis, where each run is snapshotted with exact code + output, designed to scale and be easily shared/cleaned up.

Yes, this is a good rephrasing.

If you are interested, there is a detailed flesh-out of the system concept that I previously hinted at as "Co-STORM with Marimo backend" here: doc. It would actively leverage the persistence subsystem described above.

I do think this is decoupled from the autoclean up method, since you could feasibly have one without the other.

The main "time machine persistent execution" add-ons are the "optional parts of module contents": full parent module execution keys and code snapshots. Other skeleton parts of the proposal above (notebook key scoping, cell name or module code hash as part of the execution key) seem important for efficient auto-cleanup either way. Cf. this comment.

Regarding the idea, I think has potential. You could potentially directly leverage git instead of manually trying to fold in the directory structure.

I thought a lot about how to do this. The main reason for the delay with the follow-up after our call is that I spent a lot of time fleshing out that path, where I tried to "directly leverage Git" by constructing Git trees for notebooks, cells/module code hash, module hash, version, and created_at.

But ultimately I scrapped it because it felt "off" for me:

  • Too complex/"clever"
  • Leaning into Git too much
  • Creates many small files (now for Git trees) that feel unnecessary especially if that clever structure is going to be synced not with Git in the end, but just with Dropbox, restic/borg, etc.
  • Less observable/inspectable than I would like.

The reasons why I ditched the "Tags" proposal above are approximately the same.

Another way to leverage Git would be to make the backend impl use arbitrary (configurable) git object database for content files, such as the user's own workspace/.git/objects. However, I suggested to use notebook key-scoped object databases instead:

On write (aka save_cache), module content files (Pickled or JSON-serialised exported variables, stdout, stderr) are stored in a single, shared Git-style object database within the notebook folder (that is, in .../notebook_key/objects/), otherwise in the same way as described in the "Tags" proposal above.

I didn't suggest using a "globally shared" objects database because:

  • Benefits are probably very limited: we don't expect cells across different notebooks to produce byte-identical large pickled results en masse.
  • When different notebooks depend on large upstream blobs (such as dataframes), they would probably be stored in watched files, and these files can be stored in a shared objects db, such as main repo's object db, git LFS, git-annex's object db, or dvc.
  • On the other hand, when the content file is stored in a shared object db, it's not safe to simply remove the file from that db when the corresponding module contents are auto-cleaned, because of content addressing/deduplication and that some other content in another notebook may point to the same file. So, auto-cleanup suddenly becomes much harder! git gc must be the only cleanup interface to the shared db. And to prevent git gc from removing the files we actually need to keep, dummy references should be created.
    • In comparison, with notebook key-scoped object databases, Marimo runtime itself can directly remove objects from these DBs when these objects are no longer referenced from map.toml, after a clean-up.

In summary, it becomes a hassle for little to no actual benefit. As for @mscolnick's suggestion that some people may want to have cell outputs in Git, that can be done by configuring different "databases" for them within notebook folders: .../notebook_key/output-objects/. Then these sub-dirs can be added to Git, while .../notebook_key/objects/ and sync-objects/ are ignored. Alternatively, a different parent folder: __marimo__/persist/outputs/00/00..notebook_key/objects/.

@leventov
Copy link
Author

@dmadisetti @mscolnick it seems to me we don't have substantial disagreements, do we?

I will move forward with making a draft PR for this framework with some format of content mapping files (toml or sqlite, I'm not sure what I will choose to implement first yet, but anyway the framework could be extended to support both later in a configurable way, or another format) unless you add more comments or raise concerns about the design until the end of the week.

@dmadisetti
Copy link
Collaborator

I'm wary on the time machine part, it still seems potentially a little over engineered and a premature optimization. It seems like the cached execution should come first with the current naive structure. Previously tinkering around with it, there were a couple details that were not immediately obvious (e.g. how to best "save" modules), that might further shape our thinking on it. Additionally some usecases that may be difficult with a more involved commitment with cached execution.

For instance, you could potentially trim parts of the notebook graph that are intermediate compute steps allowing for complex notebooks to be exported, verified, interactive, and WASM embedded with a cache. I'm not saying the proposal will inhibit this, but I do think that this an other cases should be considered from the beginning.

As for toml or sqlite to manage metadata and cleanup, I absolutely agree this is needed in all cases. Implementation with the current setup is still a good start.

@leventov
Copy link
Author

leventov commented Jan 23, 2025

It seems like the cached execution should come first with the current naive structure.
[...]
Implementation with the current setup is still a good start.

I don't see how this is even possible. I already argued that in the very first message in this discussion:

Objective
Enable #3054 (option to save all cell results by default) in a scalable way.

The qualities that I think persistent cache should obtain

  1. Fast and scalable (to thousands of cells) pruning of cell_id's list from outdated entries and addition of new entries, which
    requires a fast lookup from cell_id to the list of tuples (module_hash, created_at, content_blob).
  2. Pruning shouldn't base decisions on native file system's file's ctime or mtime attributes as they are not guaranteed to be preserved in some sync/distribution setups, primarily, in Git (Git doesn't store file creation times).
  3. Protection from content blob corruption in case of sudden machine power outages, marimo app crashes, etc.
  1. Basic deduplication, that is, if module-coinciding-with-cell has the same results (despite having different inputs) as some previous version, we don't store the results multiple times.

Limitation of the current design of persistent cache
The current design of persistent cache doesn't obtain qualities 1-3 and 6 above and I don't see how it could fixed "on the margin".

Even if you forget the "time machine" part, a significant departure from the current native structure is needed to obtain the desirable qualities listed above. The latest proposal would obtain these properties. The first proposal in this thread (with .table files; dubbed "second proposal") would also obtain them, but the latest proposal is just better, again, without even considering the "time machine" use case.

The elements in the latest proposal that are not necessary just for implementing #3054 and the name-plate goal of this issue ("auto-cleanup") and were added for "time machine":

  • Version as a part of the execution key. Indeed, it may be seen as entirely unnecessary to use many "scalable cache" and "remote cache" use cases without "time machine" aspirations. In these use cases, nobody ever specifies the version, and it defaults to 0 for all module contents. I'd argue this overhead (the extra ".0" substrings in the group keys in the map.toml files) is a fine trade-off for its flexibility and future-proofing.
  • "Optional parts of module contents" (parent execution keys, source snapshot) - since they are part of module contents only, they don't affect the structural design of the subsystem. They are already labelled as optional, and I don't plan to implement them yet.

Previously tinkering around with it, there were a couple details that were not immediately obvious (e.g. how to best "save" modules), that might further shape our thinking on it.

If by "how to best save modules" you mean the sync and store mechanisms (S3? Marimo Cloud? Datalad/git-annex? dvc? etc.), then the relevant way in which I see it can weigh on the design above is the capabilities that could be supported by the Loader interface (which I also called "backend" and "implementation" above).

For example, "time machine" and non-deterministic use cases may require something like listing of module contents by partial execution key.

However, note that I didn't mention that in the proposal above. I was thinking about the following shape of Loader interface:

@dataclass
class ExecutionKey:
    notebook_key: string
    cell_name_or_module_code_hash: string
    module_hash: string
    version: string = "0"
    created_at: datetime.datetime = None

class Loader:
    # execution_key.created_at is optional.
    #
    # get_spec specifies what parts of ModuleContent should be pulled and returned.
    # E.g., the caller may only be interested in 'outputs', skipping variables, parent keys, etc.
    #
    # **kwargs will include cache_type when Marimo runtime calls loader.get() - it needs to be
    # passed to the legacy __marimo__/cache loader because cache_type is included in
    # the file names.
    def get(execution_key, get_spec=None, **kwargs) -> ModuleContent: pass

    # execution_key.created_at is also optional: if not specified, the Loader itself chooses it and
    # returns, such as the _server time_ of the successful insertion operations.
    #
    # **kwargs will include cache_type when Marimo runtime itself calls put(), for the legacy
    # __marimo__/cache loader.
    def put(execution_key, module_content, **kwargs) -> created_at: pass

    # Returns if sync preparation was successful, and any extra payload. If before_sync
    # was triggered externally via HTTP request to the Marimo server, this result and
    # the payload will be returned in the HTTP response.
    def before_sync() -> (bool, Any): pass
    def on_exit(): pass

Some Loader impl may even ignore created_at both in put() and get() altogether: put just overwrites ModuleContent for the rest parts of the execution_key, and get() called with execution_key.created_at == None returns that latest written version, get() called with execution_key.created_at != None raises an "unsupported operation" error. In fact, this is exactly what the "compat" implementation of Loader that uses the current cache/ folder will do. Other Loader implementations those using S3 or Marimo Cloud as the backend may also do this.

Marimo will never call get() with non-None execution_key.created_at itself. Only "time machine" extensions would need it, but they can call the Loader that supports such calls directly.

@leventov
Copy link
Author

For instance, you could potentially trim parts of the notebook graph that are intermediate compute steps allowing for complex notebooks to be exported, verified, interactive, and WASM embedded with a cache. I'm not saying the proposal will inhibit this, but I do think that this an other cases should be considered from the beginning.

I don't quite understand what you mean here:

  • Cells are trimmed, but their module contents aren't, permitting recursive computation of execution keys for the downstream cells and "result verification" of some sort?
  • Both cells and their contents are trimmed but the graph with execution keys is retained (serialised and accessible in WASM runtime) to permit the computation of execution keys for the downstream cells?
  • The key point is trimming the module contents (both across cells, and "within" cells/modules: leaving the latest version only?) and storing them separately in some "export dedicated location"?

At least for these use cases, I don't see how it weighs on the persistence subsystem design and the Loader interface. It seems that any operation of this sort should be done by the runtime: walking over the notebook graph, (re)computing execution keys (if needed), pulling the contents (whatever parts are needed), and (re)storing them somewhere else (if needed). Trying to bestow this operation on Loader, a la prepare_export(notebook_key), I think would not work because there are too many possible variants and requirements for this operation.

It's possible to imagine that some Loaders may work towards this more proactively, such as by proactively storing contents to be exported later in a different "database".

@dmadisetti
Copy link
Collaborator

Fast and scalable (to thousands of cells) pruning of cell_id's list from outdated entries and addition of new entries, which
requires a fast lookup from cell_id to the list of tuples (module_hash, created_at, content_blob).

lookup meta should contain hash, hash_mechanism, blob_location, created_at, last_accessed, total_accesses. I think additional metadata can be stored in the blob itself for now (as is the case rn). Thousands of cells is a little premature, but also still reasonable with the flat structure

Pruning shouldn't base decisions on native file system's file's ctime or mtime attributes as they are not guaranteed to be preserved in some sync/distribution setups, primarily, in Git (Git doesn't store file creation times).

agreed, but the meta file takes care of this

Protection from content blob corruption in case of sudden machine power outages, marimo app crashes, etc.

invalid cache blob should just force recompute as is. But ideally/ eventually, a signature over the results be good for export/ sharing

Basic deduplication, that is, if module-coinciding-with-cell has the same results (despite having different inputs) as some previous version, we don't store the results multiple times.

this requires a hash over results, it's more expensive but less error prone to just store. Maybe best done at cleanup time


I still don't understand why the first iteration needs the folder structure? What's wrong with just (for now)

__marimo__/cache/
├── meta.db
├── my_expensive_block
│   ├── <hash>.<ext>
│   └── <hash>.<ext>
└── my_other_expensive_block
      └── <hash>.<ext>
  • cache_hit is rewritten on abstract Loader level to check meta.db (MemoryLoader still overloads this) and updates metadata
  • save_cache writes an entry to metadata (probably require a new abstract function called dump_cache)

No need to tie to source yet, since source changes just mean cache misses and miss simple cases like notebook copies

@dmadisetti
Copy link
Collaborator

I think would not work because there are too many possible variants and requirements for this operation.

Realizing I see cache as different from a run export to be shared

__marimo__/cache/ is quick and dirty for local or team use. In my mind marimo clean just uses meta.db to get rid of unwanted/ old cruft. This is fine for thrashing source and quickly messing around, and clean just frees up space.

whereas something like __marimo__/result/<notebook name> could have each cache object signed, have a computed closure of values leading to outputs, a parameter sweep over the ui.elements, and be tied to source hash. This would be produced by a github action, other CI, or intentional marimo export --sign-cache <key>. More like a build artifact than just a cache, and more likely tied to a commit than unstaged / working changes. Clean doesn't matter here, since __marimo__/results/<notebook name> should just be overwritten if the notebook changes with CI handling artifacts for the run. marimo verify x could then prove that the signing party did in fact compute the notebook as is (barring marimo manipulation, but maybe mitigated with something like trusted publishers)


Maybe for full time machine, these start to come together, in which case yes- cache invalidation becomes a lot more involved, and source history becomes important. In which case I think it's probably worth just dropping clean or just using LRU and evict based on disk, since space is relatively cheap for a managed service (doing something like this locally seems less important when you can intentionally clean, commit and export). I think the gain vs complexity for invalidating cache based on project source is not there. Even if it is, the dumb method is a good step

@leventov
Copy link
Author

leventov commented Jan 26, 2025

lookup meta should contain hash, hash_mechanism, blob_location, created_at, last_accessed, total_accesses. I think additional metadata can be stored in the blob itself for now (as is the case rn).

last_accessed, total_accesses definitely cannot/shouldn't be stored right in the blob: it's not compatible with atomic storages, and leads to write amplification (writes even on cache reads because access metrics are updated). Also, I don't understand how they are stored in the blob "right now"?

If the objective of this is to enable LRU or LFU cache clean-up, these metrics can stay internal to the cache database. They don't need to be exposed in the Loader interface because some implementations don't have a good way to record these metrics.

Thousands of cells is a little premature, but also still reasonable with the flat structure

For per-cell auto-cleanup, the problem is not flat structure as such (at least on "single thousands" of module contents -- cleanup is maybe a rare enough operation that even listing and going through all entries in this directory is tolerable). But I think it's not possible at all because currently, the content file {cache_type}{module_hash}.pickle doesn't register in any way to which cell (in which notebook) it belongs. It might be from a notebook that the subject Marimo server runtime (or the cli run) doesn't "know" at all!

So, to enable auto-cleanup, even with the current flat structure, would need to introduce extra files (or DBs; which you suggest below anyway) that record cells' historic module hashes (so we can clean them up later in a targeted way).

Thus, this anyway means the introduction of a transitionary "extension" cache store format (the format for storing these histories) - which adds to the variety of storage "variants" that later need to be supported.

invalid cache blob should just force recompute as is. But ideally/ eventually, a signature over the results be good for export/ sharing

Invalid storage is not guaranteed to raise a PicklingError: let's say a long blob is written, FS already updated the .pickle file size but not all the bytes. Pickle can successfully de-serialise garbage bytes.

I still don't understand why the first iteration needs the folder structure? What's wrong with just (for now)

__marimo__/cache/
├── meta.db
├── my_expensive_block
│   ├── <hash>.<ext>
│   └── <hash>.<ext>
└── my_other_expensive_block
      └── <hash>.<ext>

This looks like my first SQLite proposal? I noted why I discarded it there: a binary database format is not friendly to Git.

But more generally, I don't understand why designing and developing this entire storage structure as the "first iteration". It can be a "learning iteration" wrt. read/write scaling (and may well prove to be scalable enough), but not write amplification and ergonomics/possibility of Git-based metadata sync: it's already clear that these properties won't be great for this structure. Thus, it seems destined to be changed or augmented with another, notebook key-scoped structure.

And if so, what's the problem with creating the notebook key-scoped structure right away? To minimise the diversity of legacy on-disk structures to be supported.


In your description of simple cache vs marimo export --sign-cache and marimo verify, you tie the structure to the use case. I don't see any need for this. I think both (and more) use cases can be served by the very same "persistence subsystem", and the same Loader interface (or whatever it's renamed to).

Pseudocode of export would then look like:

cacheLoader = cache_loader(root / "__marimo__/cache/", ...)
exportLoader = export_loader(root / "__marimo__/results/", ...)

for cell in app:
    exec_key = exec_key(cell)
    content = cacheLoader.get(exec_key, ...)
    ... # extra preparations of the content for export
    exportLoader.put(exec_key, content)

The exact structure of cache/ and results/ may differ (e.g., TOML vs. SQLite content mapping file; Git's vs some other "database" for file contents; etc.), or it may be the same (there is nothing that would prevent them from being the same in principle). But they are accessed and written to using the same interface: Loader.

Maybe another important point to be made here is that the "time machine" use case is neither the "quick and dirty cache", nor it is close or would-be-served-well by marimo export (unless marimo export is called after every single code update, which seems stupid). However, if all these use cases are served by the same Loader interface, these differences don't matter: whatever I want to do with "time machine" use cases may not be part of marimo codebase at all. The single interface provides the greatest flexibility in shaping cache- and persist-adjacent use cases.


__marimo__/cache/ is quick and dirty for local or team use. In my mind marimo clean just uses meta.db to get rid of unwanted/ old cruft. This is fine for thrashing source and quickly messing around, and clean just frees up space.

What specific operations (either plumbing- or use case-level) are easier (either "easier for implementation" or "easier in runtime") with the __marimo__/cache/meta.db approach vs. notebook key-scoped approach? See my own advocacy for SQLite for reference.

I can think of only one such operation: use case-level operation "share the entire cache with a colleague manually, via file share"; and it would only be easier if there is a single file, meta.db, that is, no larger module contents that spilled over into a separate expensive_block/ folder. This operation would be easier because it would not require zipping the cache/ dir manually before upload/share (if there are no expensive_block/ folders). With notebook key-scoped storage, zipping would be obligatory.

It's a valid use case, but not of overwhelming singular importance. In fact, for notebook key-scoped structure, there is another, similar use case (manually sharing cache for a single notebook, rather than the whole repo/workspace) that becomes much easier than with the __marimo__/cache/meta.db approach. I mentioned it above, search "WhatsApp" on this page.

If there are no significant advantages to the __marimo__/cache/meta.db approach for cache storage (I don't see any; on the opposite, I see significant disadvantages; but I already mentioned them many times above), I would say that the "quick and dirty" cache/ directory should follow the same notebook key-scoped structure and use the same Loader interface. An additional benefit of this is easing the learning curve for people who will try to understand why cache/ and results/ or whatever have very different structures. I think it would be less astonishing to the users if they had the same structure.


Maybe for full time machine, these start to come together, in which case yes- cache invalidation becomes a lot more involved, and source history becomes important. In which case I think it's probably worth just dropping clean or just using LRU and evict based on disk, since space is relatively cheap for a managed service (doing something like this locally seems less important when you can intentionally clean, commit and export). I think the gain vs complexity for invalidating cache based on project source is not there. Even if it is, the dumb method is a good step

What exactly do you mean by "invalidating cache based on project source"?
(1) Notebook key scoping, as such? (Since notebook keys are derived "from source".)
(2) The "cell name or module code hash" part of the module execution key that I push for above?
(3) Full inclusion of source snapshot in module contents?

Also, what do you mean by "complexity" of adding these parts to the structure? Is it coding/maintainability complexity, or computational complexity in runtime? Or, "perceived complexity" for users, if/when it spills to cli commands or user-facing parts of the structure?

For (1) and (2) above, I don't think there is any extra complexity because all the complexity (coding, maintainability, and computational complexity) is already incurred by Hasher. It just happens that Hasher returns a single module_hash at the moment, but Hasher could be changed easily to return (notebook_key, module_code_hash_or_cell_name, module_hash), not adding to its cyclomatic complexity nor the computational complexity of making these hashes pretty much at all, relative to a single module_hash. This is because module_hash already "includes" module_code_hash, but it's not made explicit.

(1) adds user-facing complexity when the user just randomly wants to lurk into cache/ dir, and instead of meta.db, he sees a wall of dirs whose names are 40-character randomly-looking alphanumeric strings (notebook keys). I agree this is less than maximally "simple and friendly". But... c'est la vie? It would be nice to just mirror the source structure of the main repo or even put these dirs next to foo.py notebooks named foo.persist/, but that would not work well with non-marimo mediated notebook file renames and moving, that is expected to be more than common. After all, this is not even a "practical" use case, it's an "educational" case, and it would not be too hard for users to learn the meaning of these 40-character dir names and how to obtain name for the specific notebook if they spend 5 min in docs rather than diving directly into cache/ dir.

(3) probably adds considerable implementation complexity. But it's definitely not needed for cache invalidation. Nor am I sure at all that I will even need 3) in my use case. As I mentioned before, in my use case, AIs will edit the sources most of the time, so these AIs may be tasked to run git commit as frequently as needed. While I discussed 3) above as a means for full reproducibility when humans edit the sources and are "lazy" to do technical commits after each of their smallest changes. So, I discussed 3) above as a potentiality to think through the limits of the proposed design, not necessarily as something that I think should necessarily be on the development agenda.

@dmadisetti
Copy link
Collaborator

last_accessed, total_accesses definitely cannot/shouldn't be stored right in the blob: it's not compatible with atomic storages, and leads to write amplification (writes even on cache reads because access metrics are updated).

Agreed

Also, I don't understand how they are stored in the blob "right now"?

@dataclass                                                                                                                                                                                                                                                                  
class Cache:                                                                                                                                                                                                                                                                
    defs: dict[Name, Any]                                                                                                                                                                                                                                                   
    hash: str                                                                                                                                                                                                                                                               
    stateful_refs: set[str]                                                                                                                                                                                                                                                 
    cache_type: CacheType                                                                                                                                                                                                                                                   
    hit: bool                                                                                                                                                                                                                                                               

Some data is in there

way to which cell (in which notebook) it belongs. It might be from a notebook that the subject Marimo server runtime (or the cli run) doesn't "know" at all!

It doesn't need to know. I've successfully leveraged the same cache across 2 notebooks using an imported cell. I don't think there's anything wrong with that.
I've also used reused cache when merging 2 notebooks. Having a dumb system has benefits.
Scoping can be as specific or broad as one would like by adding a name specifier.

This looks like my first #3055? I noted why I discarded it #3055 (comment): a binary database format is not friendly to Git.

meta.db doesn't have to be sqlite, I think a readable flat file also makes sense, no objections to meta.toml.

What exactly do you mean by "invalidating cache based on project source"?
(1) Notebook key scoping, as such? (Since notebook keys are derived "from source".)
(2) The "cell name or module code hash" part of the module execution key that I push for above?
(3) Full inclusion of source snapshot in module contents?

I think my point is that it's brittle. Lots of added points for cache invalidation, lots of tying inputs to source where it's not fully needed. I'm not saying that closer coupling and source isn't nice (see my thoughts on result)

@leventov
Copy link
Author

leventov commented Jan 28, 2025

It doesn't need to know. I've successfully leveraged the same cache across 2 notebooks using an imported cell. I don't think there's anything wrong with that.
I've also used reused cache when merging 2 notebooks. Having a dumb system has benefits.

To retrieve a cache for a module (or cell), module_hash for it should be computed using Hasher. While computing that module_hash, Hasher already uses the sources (hash_raw_module). So, changing the return type of cache_attempt_from_hash and similar methods (currently the outputs are already packaged into a dataclass called Cache; that is, add new fields to it) from a single hash (Cache.hash) to the triple: notebook key, "code hash", and "module hash" is a technical change. It doesn't involve any substantial changes to the algorithms of Hasher and doesn't require Hasher to have access to any new data that it hasn't used already, only how the outputs are packaged and are propagated through the recursion tree.

So, when you want to access a cache for a cell that is imported from another notebook, notebook key scoping would make no difference: Hasher would return a triple, and then "you" (Marimo save subsystem) would "mindlessly" pass that triple on into Loader.get().

If you don't want this scoping in your implementation of Loader, so be it: the Loader can ignore notebook_key and module_code_hash altogether. Exactly how the "legacy" Loader (implementing the support of existing __marimo__/cache folders) will do.

I think my point is that it's brittle. Lots of added points for cache invalidation, lots of tying inputs to source where it's not fully needed.

Honestly, I don't understand what you are talking about. What do you mean by "added points for cache invalidation"? Regarding "tying inputs to source": as I noted above, the "cache keys" (aka execution keys) are already tied to the source (do you disagree? or you mean something different?). Any change in the source of the module (cell) or its upstream dependencies already leads to changes in module_hashes, and thus the cache is invalidated. What I suggested in my proposals above:

  1. To make this tie more explicit
  2. Indeed, to partially break this tie, for named cells. I didn't discuss this anywhere above because my messages above were already way too large, but caching might be configured to treat version as a higher-level part of the execution key than module_hash. In practice, this means that for all those cells that don't explicitly configure the version and it defaults to 0 for all stored module contents, the "selection of the latest module content" happens not per the provided module_hash, but only for the provided notebook_key, cell_name_or_module_code_hash, and version. For named cells, this will lead to retrieval of the latest stored contents for the cell regardless of the code changes, including in their upstream dependencies, as long as the cell name is unchanged. Maybe this approach shouldn't be the default because it has different semantics than for module_hash-keyed retrievals, but it seemed to me as a possible one, and may be useful in some use cases.

@dmadisetti
Copy link
Collaborator

To retrieve a cache for a module (or cell), module_hash for it should be computed using Hasher. While computing that module_hash, Hasher already uses the sources (hash_raw_module). So, changing the return type of cache_attempt_from_hash and similar methods (currently the outputs are already packaged into a dataclass called Cache; that is, add new fields to it) from a single hash (Cache.hash) to the triple: notebook key, "code hash", and "module hash" is a technical change.

This is incorrect, module hash only is utilized for Execution Path hashes

@leventov
Copy link
Author

Interesting, I realised that:

  • "Module code only" (without the code of the dependencies) is always a part of the module hash, regardless of the "cache type":
    self.hash_alg.update(hash_raw_module(module, hash_type))
    .
  • "Module code tree hash" (recursive) is not. Thanks for pointing this out! However, the optional "source snapshot" that is discussed above is pretty much equivalent to it: a practical impl of that "inclusion of source tree snapshot" would be a reference to the Git object id (SHA hash) of the needed source tree. So, that reference is already a "source tree hash".

Whenever I referred to the "module code hash" above, I used to think about the "module code tree hash". This was a mistake. But it also seems superfluous to me now (as a non-optional part of the execution key; as an optional addition to the module content, the calculus didn't change, although as I pointed out above I may not be that interested in it for my use case).

If we start thinking about all the mentions of "module code hashes" above as the "module code only" hashes, my arguments above (in particular in this previous message) become valid again?

This also reminded me of the discussion in #3270 that the current impl would fall back to hashing upstream module/cell dependency sources too often now, such as if the dependency variable is any dataclass object, whereas using hashes of serialised exported vars would be less prone to unnecessary cache invalidation (if the code has changed by resulting exported vars didn't) and "free" when all cells' contents are persisted anyways by default (#3054). However, this seems to be orthogonal to the discussion above, and the fact that content retrieval of the dependency cells would require "module code only hash" (as it is already; nothing changes here) doesn't hamper that "less prone to unnecessary cache invalidation" part because this path would not require retrieval with dependency cells' contents: regardless of how the exported vars for these dependency cells are obtained (deserialised or re-computed) this time, and whether that was from a different source code, won't change the fact that if only the hashes of serialised exported vars are used in the calculation of this module's module hash, it won't be invalidated (as long as its own source is not changed, but that's by design, of course).

(The previous paragraph is me reasoning out loud. It doesn't modify the conclusions above.)

@dmadisetti
Copy link
Collaborator

I'll bring it a bit back on topic with the auto-clean up, and we can start a discussion if we want to talk through this a bit more.

Whether it's tied to source or not- I think the base functionality of reading an external file and purging based on some set criteria is the same; regardless of how the cache is structured, right?

Your proposal is added logic is on when and how to invalidate, such that heuristics like LRU and LFU can be replaced with retention policies that confidently flush or retain based on the hash of source files, and the requisite cache structure to enable this.


But here are my pros for creating another issue for the cache structure and getting this in naively first:

  • This is still a first step, much of the infra will be shared with a more complicated caching structure

My pros for reconsidering the cache structure all together:

  • I think disk space is relatively cheap, and the heuristics are sufficient.
  • The simpler structure requires less dev maintenance and hidden cases (even with our current method, there are some insidious edge cases that pop up)
  • The coupled cache/source structure is inherently tied to VCS, but I think marimo should not dictate those decisions for the user- moreover binary or large blob in source is less than ideal.

Here are your pros as I understand them:

  • This layout is supposedly super scalable (but I think this would need bench-marking)
  • The dependency lends itself to eventual "execution caching" and the time machine (I think there are other options, see below)
  • The earlier we commit to this structure the easier it will be going forward (Sure, if this indeed the correct way forward- but also wouldn't be too terrible to put it in later)

But- more on hot tub time machine. In a managed environment, I think using jujutsu (git wrapper that covers these needs) on cell execution and a mirrored closure of symlinks (to keep the binaries out of source) could solve much of the issue.

@leventov
Copy link
Author

Your proposal is added logic is on when and how to invalidate, such that heuristics like LRU and LFU can be replaced with retention policies that confidently flush or retain based on the hash of source files, and the requisite cache structure to enable this.

Yes, except it's not based on "hash of source files". The latest version of my proposal involved two hashes in the execution key apart from the module hash:

(1) hash(relpath(notebook_file_where_app_var_defined, __marimo__dir.parent())) unless App(save_key=...) is provided explicitly. More detailed description above. Marimo runtime/server should be proactive in "making this key explicit/stable" by editing the app = App() line whenever the key is first computed.

Now, as I revisit that element of the design, I think maybe default_notebook_key = relpath(notebook_file_where_app_var_defined, __marimo__dir.parent()).split(".")[0] is better. The dir structure within __marimo__/persist would then need to mirror the workspace dir structure in which notebooks are nested. Still, marimo should proactively make that default_notebook_key explicit via App(save_key=...), to prevent immediate invalidation/loss of the persisted data upon notebook file renames. A housekeeping command a la marimo cache tidy --fix-notebook-renames can move the persist folders and edit the save_key params in notebook codes.

(2) hash_raw_module() of the module. It hashes the Python AST of the module that Marimo runtime is already working with, not the "source file"; it doesn't read the file "just for computing the hash". (Maybe this is exactly what you meant by "hash of source files", but I wanted to make this point maximally unambiguous.)

I think disk space is relatively cheap, and the heuristics are sufficient.

Unless we misunderstand each other and keep talking past each other, I argued above that there are no such heuristics. Quoting myself:

For per-cell auto-cleanup, the problem is not flat structure as such (at least on "single thousands" of module contents -- cleanup is maybe a rare enough operation that even listing and going through all entries in this directory is tolerable). But I think it's not possible at all because currently, the content file {cache_type}{module_hash}.pickle doesn't register in any way to which cell (in which notebook) it belongs. It might be from a notebook that the subject Marimo server runtime (or the cli run) doesn't "know" at all!

Indeed, you have suggested adding a per-workspace "registry" of the info needed for granular clean-up, namely meta.db.

I agree that your proposal (meta.db) would be sufficient for implementing the auto-cleanup functionality. The rest is the argument about other considerations beyond pure functionality: (1) scalability, (2) "Git ergonomics", and (3) the future maintenance implications of introducing an on-disk structure (meta.db) that (in my opinion) has a high chance of being soon abandoned in favour of something that is already relatively well understood (even if "in theory") -- hence my argument for starting with that "structure v2" right away. (And sure, maybe some practical feedback, learning, or even the course of actual coding of that "structure v2" will make us reconsider and propose yet another, incompatible "structure v3". This eventuality cannot be ruled out in principle, although I tried quite hard to avoid that by thinking through "structure v2".)

The simpler structure requires less dev maintenance

The entire difference between my and your proposal is that my proposal requires extra logic/code for computation, saving, and bookkeeping of "notebook keys". But that logic is isolated and doesn't mingle with the rest of the persistence subsystem logic (meta.db reading; writing; cleanup; lookup operations). So, the rest of the logic will be almost line-for-line identical.

That "rest of the logic" might be quite non-trivial (for example, handling concurrency of updates via locks on meta.db). But that's shared by our proposals; a version that doesn't have a meta.db at all seems like a non-starter (see above).

So, although, I agree the handling of notebook keys is extra logic/code (and as such needs to be maintained, tested, so adds to "dev maintenance"), I think it's a small difference: I think meta.db handling, + choice of toml/sqlite, + "module content database" impl (+ a choice between a flat structure a la the current __marimo__/cache, Git's object db proper, entirely remote option a la S3, etc.), + plug-in integrations with specific sync mechanisms like git-annex will quickly dwarf notebook key handling in the total complexity and dev maintenance. Also, since notebook key handling is isolated, it's contribution to dev maintenance burden of the entire persistence subsystem is a "constant addition" rather than a "multiplicative factor".

hidden cases (even with our current method, there are some insidious edge cases that pop up)

To discuss this productively we need to look at concrete hidden cases and edge cases. It's not given that the more deeply scoped/nested structure generates more of these cases; it could just as well be the reverse, that the more deeply scoped/nested structure automatically eliminates edge cases inherent to the flat, shared structure.

What are these edge cases that have popped up already?

The coupled cache/source structure is inherently tied to VCS, but I think marimo should not dictate those decisions for the user- moreover binary or large blob in source is less than ideal.

Hm, I didn't suggest such coupling anywhere? Perhaps we also significantly misunderstand each other?

For ref, I discussed the "object database" in the section "Write" in this comment above, and later elaborated the reasoning behind that design significantly in this comment.

I suggested using Git object db as the "module content database" per notebook key (scope). These Git object dbs are not parts of any Git repos (neither "per notebook key", nor shared): they are used just as content-addressing DBs, nothing more.

I suggested using Git object db format as the "module content database" for a vague set of reasons:

  • Existing, very well-optimised methods written in C for packing small objects and merging two packs (via git repack).
  • Minimal possible storage overhead per object in packs: since Git packs are immutable, they simply sort object IDs, write them densely, and use binary search on access; the index of SQLite should remain a mutable B-tree, hence a somewhat larger overhead.
  • An option to wrangle with these objects without marimo cli, using git cli.
  • An option to point to the Git object db of the main repo, which may be useful if all cached results are exclusively text (generated by LLMs).

However, this comes with somewhat nasty drawback: Git objects headers must be known before the object is written, which would force us to chunk Pickled blobs beyond a certain size (let's say 100MB) as separate Git objects if we don't want to crash Marimo server/cli with out-of-memory error and want to compute the hash of the blob in memory before writing it out to exclude any possibility of transient blob corruption (short of main memory corruption, of course).

I think I will flip on this one and would say that now I will prefer "module content database" be SQLite + file system for too large contents, essentially coinciding with the earlier SQLite proposal. SQLite in this case would play exactly the same role as "pack" in the "Git object db" version. These SQLite dbs could still be "immutable", and named with random suffixes (like packs), to aid merge-ability with some sync mechanisms (as was discussed above). If meta.db is a TOML file (I called it map.toml above), these SQLites would be separate; if meta/map.db is also an SQLite, these two could use the same db file (with two different tables), but that should remain configurable: it could be undesirable for some sync arrangements to mingle "meta" (execution key mapping to contents) and contents in the same file.

The sole reason why I'm flipping is that the advantages and potential future optionalities offered by Git object db don't seem appealing to me now. So, that single aforementioned drawback of Git object db is enough of the reason for me to favour SQLite for "module content database" as the default/first implementation.

  • This is still a first step, much of the infra will be shared with a more complicated caching structure

  • The earlier we commit to this structure the easier it will be going forward (Sure, if this indeed the correct way forward- but also wouldn't be too terrible to put it in later)

I agree about the transferability of infra/logic/impl. I would not raise an issue at all if we wouldn't talk about writing a structure to users' disks. It racks up a maintainability burden/commitment/graceful upgrades much more than pure source code manipulations.

@leventov
Copy link
Author

leventov commented Jan 31, 2025

But- more on hot tub time machine. In a managed environment, I think using jujutsu (git wrapper that covers these needs) on cell execution and a mirrored closure of symlinks (to keep the binaries out of source) could solve much of the issue.

Thanks for pointing to this project, I didn't see it before.

It's curious to learn how the design elements they use to enable robust concurrency, including in the face of sync mechanisms like Dropbox or rsync, such as the operation log and "first-class conflicts in the commits", are related to the solutions that I've come up with above: the transaction log, the random suffixing of content mapping file names, and the random suffixing of SQLite dbs that store small contents (I didn't mention that in the previous comment, but it should be there, exactly mirroring Git object db's random suffixing of pack files. It's also for conflict resolution/prevention purposes).

Perhaps we can learn more from jujutsu on this front, I will need to ponder more on their design docs and discussions in this area.

Another interesting aspect of jujutsu is the separation of backends. This could have been an interesting leverage for Marimo's persistence subsystem if jujutsu was already a mature project and supported a lot of backends. However, currently, they only have the Git backend itself and proprietary backends to Google's monorepo. So, there is practically no upshot in using jujutsu in Marimo's persistence subsystem right now, only a big added dependency (and the need to understand their Rust codebase, which would be a challenge for me).

Currently, the fastest path for using many storage backends is:

  1. Store files locally as we discussed up to now
  2. Module content database as we discussed above: let's say in SQLite files + contents bigger than 100MB directly in the file system
  3. Use sync mechanisms with a wide range of supported backends, of which restic and git-annex stand out in terms of the range of supported backends.
  4. Optionally, do the "hardlinking maneuver" in before_sync() and on_exit() to prevent "sync amplification", as discussed in the very end of this comment.

This wouldn't be as optimal as creating custom Loaders for backends, but it would already work.

Finally, I didn't see anything in jujutsu that makes it particularly well-suited or optimised for "time machine", even over the persistence structure that I proposed above and that would already be needed for "mere" auto-cleanup per module/cell. And even if there was something (although I don't see anything), the upshot of sharing the structure between "non-time machine" and "time machine" use cases, with the possibility for users to switch seamlessly and mix and match these use cases even within the same notebook (indeed, the "default version = 0" design implicitly makes all module contents "timed") would still be more important, I think.

FWIW, one VCS function that would be particularly handy for "time machine" use cases is doing some computations against a certain commit/revision without juggling the working tree. It's proposed for jujutsu as jj run but is not even implemented yet. But it is already implemented for plain old vanilla Git, via git-brancheless extensions, as git test run. git test run can be used with the proposed Marimo persistence subsystem if the content mapping files are stored as TOML files in Git, as we discussed above already.

@leventov
Copy link
Author

leventov commented Feb 1, 2025

Another benefit of notebook scoping that I haven't thought about before is that it offers "for free" a way to control how the module contents are synced for different notebooks. It can be as simple as telling Dropbox to sync these folders and not others. Those others are just staying on the developer's machine, unsynced.

With more centralised module content db, this can only be achieved by specifying a different folder for the App, a la app = App(cache_cells_by_default=True, cache_loader=DefaultLoader(content_dir=mo.__marimo__dir / "local-cache")). But then if I want to change this decision I need to move the existing cache, run some marimo commands for that. Whereas with notebook scoping, it's a completely decoupled decision: Dropbox sync strategy can be changed at any time without marimo knowing about that at all.

@leventov
Copy link
Author

leventov commented Feb 6, 2025

Ok, I will proceed with a PR including what we have agreed about (I hope), stalling this doesn't make sense.

This PR will include:

(1) On the Loader abstraction level, the execution key includes

This does not include:

  • "notebook key", assuming that this version of the Loader just ignores it and stores everything in a single place per workspace. This would be considered a legitimate choice even after notebook key scoping is introduced, so this is forward-compatible.
  • "version". Forward compatibility here could be ensured later when the future code of this Loader will interpret TOML's array table names with fewer sections as implicitly having version 0. As the Python interface to specify the version other than 0 for the cell (or the cached block) is not here yet also, any other version couldn't be specified yet.

(2) "meta"/"module content mapping" DB abstraction per workspace, starting with the TOML format

  • It's possible to add SQLite format for this DB later, although it makes less sense in the per-workspace structure. That's because the per-workspace structure that would be implemented in this PR itself mostly makes sense as long as the workspace is small, i.e., only a few notebooks. For this scale, TOML scalability is less of an issue. OTOH, SQLite as meta/"module content mapping" DB becomes "aggressively" Git-unfriendly already on this scale.
  • Concurrency control: this impl will use the TOML file as its own "transaction log", that is, appending new cache entries to the end of the file, possibly assorted. It will use https://github.com/tox-dev/filelock to prevent multi-process races from other Marimo processes.

(3) on_exit() performs clean-up (configurable, e.g., "leave only last created_at entry for each module_code_only_hash") and organises the TOML file

This is the "auto-cleanup" functionality that justifies the discussion of the PR in this issue, #3176.

"Organisation" means sorting tables by name (that is, by leading section of the execution key: module_code_only_hash, then by module_hash, then by created_at) and compactifying (like, section [module_code_only_hash]) when possible.

Does not include before_sync() and the corresponding HTTP endpoint for Marimo server yet.

(4) module content db
As per the latest proposal, it's an SQLite with contents larger than some threshold (100MB by default) are stored on the file system as files named content_hash.

SQLite's table has just three columns: content_hash (TEXT), size (INTEGER), and content (BLOB). No "absent entries", if there is no entry it's assumed that the content is stored in the FS and should be checked there.

As in Git object db, the extension of these files is not recorded in the file name (thus, not content_hash.pickle or content_hash.html for module outputs).

  • When/if the need will arise for formats other than Pickle for exported module variables and HTML for outputs, it will be specified on the "meta" db level, as extra keys in the entry object (recorded as a TOML table, or SQLite db row when/if SQLite "meta" db impl is added), for example: {created_at = ...datetime, exported_vars = ...content_hash, stdout = ...content_hash, stderr = ...content_hash, exported_vars_format = "json", stdout_format = "txt"}. In the first implementation, there will be no experted_vars_format, stdout_format, and stderr_format keys, assuming the defaults to be Pickle, HTML, and txt respectively.

Other module content db optimisations discussed here can be added later. Notably, they may require adding a key to the entry object above like referenced_contents = [content_hash1, content_hash2, ...] also on the "meta" db level, not the "module content" db level. This is forward-compatible.

@dmadisetti
Copy link
Collaborator

@leventov

"leave only last created_at entry for each module_code_only_hash"

I still don't like this as an automatic feature because code thrashing is a pattern I go through a bit, having lingering cache is nice if I am iterating while coding.

I get what you are trying to maintain is a closure of required cache objects for execution, with auto cleanup for that. Rigorous cleanup around caching is not especially desirable because code changes might push us back to the other case (this a common dev pattern I see in myself)

I do think having a closure of cache objects for execution does make sense- but I don't think this should be automatic unless it just cleans up over a specific disk threshold.


With 1 I think you'll find that writing information to the meta file should make those attributes redundant (but fine and non-breaking if you put them in loader.meta).

Absolutely on board with 2

Let's make 3 manual trigger or just to threshold

and 4 needs benchmarking and investigation (let's open a new issue? You also seem to have gone back and forth on this. I have few blobs under 100mb and this seems like an added step, making the PR much bigger than it needs to be)

@leventov
Copy link
Author

leventov commented Feb 6, 2025

I still don't like this as an automatic feature because code thrashing is a pattern I go through a bit, having lingering cache is nice if I am iterating while coding.

The cleanup strategy is definitely configurable, that's why I typed "e.g.". In fact, it should be maximally configurable and use case-agnostic, as I indicated well above:

The persistence subsystem doesn't need to constrain and care about the semantics of the usage of version within a notebook or specific cell. It only needs to provide flexible ways to auto-cleanup contents along all relevant "dimensions": [module code only hash -- I've added this just now -- Roman,] module hash, version, and created_at.

So, I'm completely indifferent to the default cleanup strategy (if there should be a "default" one at all! maybe the user should always specify it when they want clean up!) as long as the clean-up strategy needed for my use case could be configured.

Rigorous cleanup around caching is not especially desirable because code changes might push us back to the other case (this a common dev pattern I see in myself)

I didn't get what you mean here.

With 1 I think you'll find that writing information to the meta file should make those attributes redundant (but fine and non-breaking if you put them in loader.meta).

Didn't get what you mean here.

Let's make 3 manual trigger or just to threshold

before_sync() was supposed to be the manual trigger. You can also find the discussion above about whether before_sync() and on_exit() should be two separate methods, or just collapsed into a single method. I will give it one more thought during the implementation, but anyways this is the thing that is easy to switch this or that way.

I've also mentioned timeout as another possible trigger above. Currently, as #3054 is not yet implemented, the "added entry count threshold" (that you seem to refer to by "just to threshold") doesn't yet make sense because the total number of entries in this file will not be huge anyway. But in the future, the added entry count threshold makes sense too, sure.

and 4 needs benchmarking and investigation (let's open a new issue? You also seem to have gone back and forth on this. I have few blobs under 100mb and this seems like an added step, making the PR much bigger than it needs to be)

Before #3054 it's superfluous indeed, as the total number of entries is small. So "No-sqlite, just FS" can be considered a "module content db" option. Indeed, in this planned PR, I can do just that, and delay the "SQLite+large files" module content db for #3054.

After #3054, there should be some kind of centralisation (rather than "file for each separate module variable pickle and output HTML") for myriads of cells, the variables and outputs of most of which are tiny. Storing them all in separate files would be wasteful for footprint and synchronisation mechanisms. This was part of the thesis all along, starting with #3055. I didn't go back and forth on it. I did go back and forth on the "centralisation mechanism": from SQLite to Git's packfiles and back to SQLite. Ok, well, the so-called "second proposal" which is laid out in the original post of this issue above didn't have such centralisation at all, but it seems like an omission to me now. Anyways, "FS only" will remain a "module content db" option for those who don't have a lot of cells.

@dmadisetti
Copy link
Collaborator

by "disk threshold" I mean something like: if disk space used > allowed disk space for cache: run clean-up

The check for clean up probably would be best in post execution hooks opposed to the loader level

@leventov
Copy link
Author

leventov commented Feb 7, 2025

by "disk threshold" I mean something like: if disk space used > allowed disk space for cache: run clean-up

That when SQLite-less and "shared" cache (per workspace) already becomes... troubling? There is no other way to estimate the total size other than to crawl the directory, a la "Disk Inventory X" app. With SQLite, in comparison, it would be a simple select sum(size) from objects; query. But fine, fine :D for relatively small cache directories it probably doesn't make that much of a difference (although... if you want to call the estimator in the post-execution hook of every cached module or cell, it does become problematic even for a quite small number of objects, like a hundred?). I will stick to the "pure FS" module content db at first.

The check for clean up probably would be best in post execution hooks opposed to the loader level

Ok, but then the post-execution hook should just delegate to the Loader. The Loader is in the best position to do the necessary estimations such as the total size, as evidenced above. Also, it should probably be the Loader who owns the configuration of the said triggers and thresholds (size, timeout, etc.).

@leventov
Copy link
Author

leventov commented Feb 7, 2025

That when SQLite-less and "shared" cache (per workspace) already becomes... troubling? There is no other way to estimate the total size other than to crawl the directory, a la "Disk Inventory X" app. With SQLite, in comparison, it would be a simple select sum(size) from objects; query. But fine, fine :D for relatively small cache directories it probably doesn't make that much of a difference (although... if you want to call the estimator in the post-execution hook of every cached module or cell, it does become problematic even for a quite small number of objects, like a hundred?). I will stick to the "pure FS" module content db at first.

The workaround may be to store the sizes in the "meta" db. But this will become wasteful for "SQLite+FS for large files" arrangements, and in general, it's not the best idea to bloat the "meta" db as long as it is implemented as TOML, which is not that scalable (every time the file is loaded, all these sizes should be parsed throughput the file...).

Even though this fact itself (whether to store sizes in "meta" db or not) can be a config option, you may see here how catering for all these structures itself starts to drive up the implementation complexity. But not as much as to be a deal breaker, so I'll do it in this instance.

@dmadisetti
Copy link
Collaborator

dmadisetti commented Feb 7, 2025

as long as it is implemented as TOML

JSON, SQLite or flat file are all reasonable. I think you initially sold me on SQLite here, I don't think meta needs to be readable (I think blob signatures are probably another useful entry at some point)

Ok, but then the post-execution hook should just delegate to the Loader. The Loader is in the best position to do the necessary estimations such as the total size, as evidenced above. Also, it should probably be the Loader who owns the configuration of the said triggers and thresholds (size, timeout, etc.).

Loader objects are dead at this point, you'd have to invoke another Loader, I'm not sure that makes sense- why not just create a separate namespace? It might be worth making an Adapter class (PickleLoader prepares a pickle blob from Cache, Loads a pickle blob into a Cache object- but Adapter writes the blob to disk or upserts a SQL entry, or in this case, deletes it)- but I think this is a followup (but a quick one) since with naive FS it's as simple as os.remove in the hook. Having a separate Adapter class could also distinguish your caching structure from the current naive one (while also enabling things like redis). I'd actually be really happy with all of your proposals since they would be opt in if structured with an Adapter class.


Hmm, interesting reading your response. Yeah disk space seems important to me. I think in most cases:

  • a cache clean up is primarily to clean up disk space
  • having a closure of compute artifacts to speed up or prove computations is still best done from the ground up (i.e. a fresh run that can be verified all the way through)

I think my viewpoints on complexity are pretty consistent from those starting premises, but I also understand yours (wanting to clean up the cache closure to only relevant compute artifacts that reflect the notebooks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants