-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent cache with auto-cleanup #3176
Comments
Note how the proposed solution above is not 100% data loss free, but this is probably fine (see non-goal (2) in the top message in this issue). Marimo app instances should obtain However, this is not the case for Git, Dropbox, and a lot of other data sync solutions: they are not cooperative. So, a following data race is possible:
The probability of this is probably small enough not to bother at all, but just in case we want to bother, we can still add module hash to the
|
I think cache cleaning mechanism makes a ton of sense. Having some sort of metadata to do this also makes sense (modification stamps are unreliable)
I think that really the gold standard for caching would be some sort of remote caching- where the user doesn't have to worry what's on their disk, and just need a static file server to host a cache (something like s3, or even google drive etc..). Gathering feedback before moving on to defaults, could inform how often this caching mechanism is even used. Renaming the issue, since like I mentioned in the #3177, "Stable IDs" aren't really needed since:
But glad to see you're excited about this! |
.table can be seen as csv with space delimiter and opened like in the example here: https://docs.python.org/3/library/csv.html#csv.reader. But it's on the one hand so simple that we don't really need a "csv reader", and on another hand to make it extra bulletrpoof we need to read manually and filter out git merge conflicts (those SQLite for a dozen lines doesn't make sense. More "global" SQLite, as I envisioned before, would suffer from merge conflicts. Even JSON would be worse, it would not play as exactly smoothly with |
Does "execution path hashing" mean basically AST hashing of cell's code + all of its dependencies code, without "runtime" content hashing? If these executed path hashes can be computed in parallel with "full" module hashes before cache lookup, we can do it. Then "short history lookup" that I've talked about would be based simply on the fact that the cache cleanup wouldn't be THAT proactive, and only happen with explicit separate call to Deduplication with previous content: would still work on Marimo app runtime level, since Marimo runtime can record what module_code_hash has corresponded to the cell most recently, and then compare the new content blob with that file. However, it would not work in the following scenario:
Unless Marimo want to engage with Git, and try to compute module_code_hash for most recent staged or committed version of cell to try to recover "potential candidate for deduplication"... but seems far fetched. With everything that is described above, the overall directory structure would still be the same, except that instead of
|
It doesn't seem to me the above would provide great cache robustness. Note important thing: whereas you probably most think about two scenarios:
The main scenario as I see using Marimo is a third one: the majority of cells are created and edited by AIs, e.g. within Co-STORM. The wrapper code can add UUIDs trivially, so the clunkiness of (2) above is not an issue. But also unless that wrapping logic is also knee-deep in Marimo internals (calls to module_code_hash directly before it instructs the AI to re-write the cell, for instance, and then manually tells Marimo somehow to look up potentially duplicate results there), the recent history dedupliication wouldn't work. The abiliity to provide explicit cell_ids (which my wrapper logic can generate easily) would enable much better separation of concerns between AI workflow wrappers and Marimo layer. And also, those UUIDs needed to "correlate cells across Git branches" and moving cells from one notebook to another without "losing the context" is also directly motivated by those AI-workflow-ish scenarios. |
FWIW, in the previous comment, I don't question module_code_hash approach (as I see it) altogether, it seems to me now it would definitely work in many use cases. Rather, I demonstrate a strong demand for "user-generated cell_ids", and caching based on them, too. But these two approaches can seamlessly coexist, just on the same level and mix together:
Need to think a little more about the edge cases, but I'm sure it's fully workable. |
I now think that entire cache could be stored as a git submodule which itself would be a bare repo, where all actual cached files are stored in the This git module would not represent any "directory of files" in a meaningful sense: in fact, it would not have any commits! It would only contain blob objects, tree objects, and tags. The reason is because commits add unnecessary indirection to blob files. Fortunately, git tags can point to git blob and tree object directly, as well. Simple cell results would be blob objects, stored without compression by default (corresponding to core.compression=0 in git; this is very atypical for "normal" git repos, but for our bare pseudo-repo, where these objects would be accessed for primary cached content blobs, this probably makes sense by default to reduce latency, when the cache size is not an issue, as long as auto-cleanup of outdated cached entries works). Once these are written, a weak reference identity dict (such as the one in SQLAlchemy's) can be updated with mappings from ids of local objects of the cell (which have just been pickled) to tuple Pickler for the cell results and "scoped refs" in BlockHasher should have a custom If pickling cell results itself has used this deduplication mechanism, in addition to the blob object, it would also write a git tree object that would refer to those "upstream cell" blobs. This is needed so that Pickling of large out-of-band buffers (such as ndarrays, dataframes) could be de-duplicated similarly. If a buffer larger than some threshold (e.g., 1 MB; via memoryview.nbytes) is passed into Finally, the mapping from the cache key, i.e., the module hash, is the name of the tag. If uuid-like cell_id is present, then the tag name could be When a tree is created anyways for the cache entry (that is, for recording "upstream" and buffer dependencies, as described above), the tag could be a lightweight tag, and the time of the "created at" time of the cache entry can be recorded as a dummy "file" in the tree object, named When the cache entry points to a simple blob object, an annotated tag would need to be created (a separate git object), just to record "created at" time for the cache entry. Fortunately, this doesn't mean obligatory two-hop indirection when cache is to be accessed on disk because reftable stores "peeled" tag targets inline (see here, value_type = 0x2). This whole construction wouldn't scale well to thousands of module_hashes without the new reftable format for storing references in Git, which plays the role of cache lookup here (instead of, implicitly, file system itself, as was proposed above, where the module_hashes or cell_ids are directory names). Reftable API is not yet in in libgit2 (see libgit2/libgit2#5462), but I hope it will be there soon. For now (for initial testing and small-scale purposes), the classic "filesystem" format for these references would also work. |
The main advantages of this design are the following:
|
If a non-deterministic cell produces two different results that are both persistently cached, these could be stored in separate tags, such as |
I'm updating the latest design proposal (described three comments above; I'll call it the "Tags proposal" below due to its reliance on Git tags) to reflect the input from our discussion and my latest thinking. I think this persistence subsystem could be a "core" implementation for all persistent execution, persistent cache, and remote persistent cache facades, hence I first suggest a shift to neutral terminology (and will use it below):
The interface and the data modelScoping with notebook keysA significant departure from all my previous proposals is that I suggest to scope execution keys per notebook. This means that the notebook save key also becomes a part of the execution key. I think these keys can be computed in the following way:
Why: if persistent execution or cache is enabled by default (#3054), we should from the beginning plan for the persistence subsystem to scale to 10k module contents stored within one "workspace" (not an official term yet, but I mean whatever collection of notebooks is run from a single root dir and share the same
Scoping with notebook keys is also a tree-like structure which helps with scaling. But it has extra advantages over the previous proposals: The "scalability benefit" that this scoping alone brings will probably be sufficient for a lot of use cases, so execution key -> content mapping implementation may remain simple, readable, and mergeable initially. This is an improvement over the SQLite proposal, where the mapping is stored in binary, not easily mergeable format (SQLite database format), and is monolithic, which would cause permanent write and sync amplification for sync systems that don't do compute deltas within files before sync (some do, but many don't). The reftable (in the "tags proposal") was supposed to be an improvement in merge-ability over SQLite, but not quite (the operations with remote refs in Git are obviously not perfectly optimised for this use case), but at a significant tradeoff with inspectability/observability of the format: compared to SQLite, ~no one knows about reftable, there are no public tools for working with it besides Git itself, and it hasn't been even added to libgit2 and hence PyGit2 libraries yet. The "second proposal" shared this benefit with "scoping with notebook keys" over SQLite and Tags proposals. However, it had the following downsides:
Module execution keysModule execution keys logically include:
Parts 1-4 of the execution key must be specified both on the read and write paths. The 5th part, the The version was already mentioned in the previous comment. Version is an alphanumeric string that can be used in multiple ways: both to enumerate non-deterministic results or to represent bitemporality, in which case the version would actually be a timestamp, in addition to the
At the Marimo Python API level, the version could be set via optional parameters to
The persistence subsystem doesn't need to constrain and care about the semantics of the usage of version within a notebook or specific cell. It only needs to provide flexible ways to auto-cleanup contents along all relevant "dimensions": module hash, version, and Module contentsModule contents logically include:
There could also be more optional parts of module contents. Perhaps shouldn't be implemented at first, but it's good to think about making interfaces and storage implementation somewhat future-proof for adding such extra parts of module contents.
Write, before_sync, and on_exit callbacksThe current version of This constraint either exacerbates write and sync amplification or drives the storage implementation to stick to the "file per module content" approach, which can quickly generate lots of files with very small contents (lots of cells whose content is a few variables with primitive or string values). In fact, the "Tags" proposal was an attempt to address this tradeoff: although Git would still create a separate file per content in that proposal, if Git itself was used for sync, it would employ thin packs to avoid making lots and lots of separate network transfers. However, this constraint is absolutely unnecessary. The storage interface should be extended with two callbacks:
Apart from consolidating small module contents into "packs" that are more convenient for sync, these callbacks may also do proactive cleanup of contents, such as removing everything except the latest (by ImplementationBelow I sketch a possible implementation that could be extended and made more configurable in the future, yet may serve "as is" a vast majority of potential future use cases. A notebook folder is created per notebook key. These notebook folders are bucketed like in Git and like in the proposal in the first message in this issue: WriteOn write (aka Very large content files could also skip the objects database and go to another database more suitable for these files, such as a workspace- or repo-level git annex's objects database, if so configured. Same for watched file contents At this point, the mapping from the execution key to the content file hashes (which are needed to find the contents in the objects database) are stored only in the Marimo process memory yet, not persisted. Optionally, this information could be persisted in an append-only
|
Discussing the "Implementation" above: On a finer level details may vary, e.g., actually re-surfacing SQLite for the content mapping files (separate for different notebook keys), instead of transaction.log + map.toml as I described in the sketch above. Making the content mapping file format configurable and swappable should be fairly easy. These two approaches (readable toml and binary SQLite) would suit different sync mechanisms better. If the user wants to commit the mapping files to their main Git alongside Python notebooks and sync the content files (objects and/or packs) with Dropbox, GDrive, git annex, some popular backup mechanism, etc., then text-based map.toml is better. OTOH, if the user wants to sync everything in Either way (or maybe yet other strategies), I think it should be fairly forgiving once notebook key scoping is in place, and not lead to terrible write or sync amplification. |
Not sure ill be able to provide much valuable feedback, but high-level: 1. Module execution keys Yes i agree these keys make sense and a notebook key does as well. Version i am on the fence. I would hate to upgrade marimo and lose all my cache. I can see us making breaking changes to this, so internally we could have our own version for the cache that would bust it. 2. Module contents This would be quite a useful artifact. it would be great if it could be module too and work with only some of the artifacts. for example, you may want to commit the outputs to git, but not the variables and values (could be a large dataframe). so putting them in separate outputs (so it can be easily gitignore) might be a good design |
I took me awhile to read past the implementation suggestions of your most recent comment to fully get the core idea of why you would want this. Essentially, you want a 'hot tub time machine' on a per-cell basis, where each run is snapshotted with exact code + output, designed to scale and be easily shared/cleaned up. I do think this is decoupled from the autoclean up method, since you could feasibly have one without the other. Regarding the idea, I think has potential. You could potentially directly leverage git instead of manually trying to fold in the directory structure. But, I think that the implementation might be hard for wide adoption since I think proper usage would maybe still have to tie directly into the source tree and source control methods. However, I do think as a git wrapper, or in a managed environment like in https://marimo.io/dashboard something like this might be useful and an improvement over the current snapshot method. |
I see it like this:
Upgrade to new backend: a command line action that takes a notebook, computes all module hashes and module code hashes for all When the user converts all notebooks within a workspace that have used |
Yes, this is a good rephrasing. If you are interested, there is a detailed flesh-out of the system concept that I previously hinted at as "Co-STORM with Marimo backend" here: doc. It would actively leverage the persistence subsystem described above.
The main "time machine persistent execution" add-ons are the "optional parts of module contents": full parent module execution keys and code snapshots. Other skeleton parts of the proposal above (notebook key scoping, cell name or module code hash as part of the execution key) seem important for efficient auto-cleanup either way. Cf. this comment.
I thought a lot about how to do this. The main reason for the delay with the follow-up after our call is that I spent a lot of time fleshing out that path, where I tried to "directly leverage Git" by constructing Git trees for notebooks, cells/module code hash, module hash, version, and created_at. But ultimately I scrapped it because it felt "off" for me:
The reasons why I ditched the "Tags" proposal above are approximately the same. Another way to leverage Git would be to make the backend impl use arbitrary (configurable) git object database for content files, such as the user's own
I didn't suggest using a "globally shared" objects database because:
In summary, it becomes a hassle for little to no actual benefit. As for @mscolnick's suggestion that some people may want to have cell outputs in Git, that can be done by configuring different "databases" for them within notebook folders: |
@dmadisetti @mscolnick it seems to me we don't have substantial disagreements, do we? I will move forward with making a draft PR for this framework with some format of content mapping files (toml or sqlite, I'm not sure what I will choose to implement first yet, but anyway the framework could be extended to support both later in a configurable way, or another format) unless you add more comments or raise concerns about the design until the end of the week. |
I'm wary on the time machine part, it still seems potentially a little over engineered and a premature optimization. It seems like the For instance, you could potentially trim parts of the notebook graph that are intermediate compute steps allowing for complex notebooks to be exported, verified, interactive, and WASM embedded with a cache. I'm not saying the proposal will inhibit this, but I do think that this an other cases should be considered from the beginning. As for toml or sqlite to manage metadata and cleanup, I absolutely agree this is needed in all cases. Implementation with the current setup is still a good start. |
I don't see how this is even possible. I already argued that in the very first message in this discussion:
Even if you forget the "time machine" part, a significant departure from the current native structure is needed to obtain the desirable qualities listed above. The latest proposal would obtain these properties. The first proposal in this thread (with The elements in the latest proposal that are not necessary just for implementing #3054 and the name-plate goal of this issue ("auto-cleanup") and were added for "time machine":
If by "how to best save modules" you mean the sync and store mechanisms (S3? Marimo Cloud? Datalad/git-annex? dvc? etc.), then the relevant way in which I see it can weigh on the design above is the capabilities that could be supported by the For example, "time machine" and non-deterministic use cases may require something like listing of module contents by partial execution key. However, note that I didn't mention that in the proposal above. I was thinking about the following shape of @dataclass
class ExecutionKey:
notebook_key: string
cell_name_or_module_code_hash: string
module_hash: string
version: string = "0"
created_at: datetime.datetime = None
class Loader:
# execution_key.created_at is optional.
#
# get_spec specifies what parts of ModuleContent should be pulled and returned.
# E.g., the caller may only be interested in 'outputs', skipping variables, parent keys, etc.
#
# **kwargs will include cache_type when Marimo runtime calls loader.get() - it needs to be
# passed to the legacy __marimo__/cache loader because cache_type is included in
# the file names.
def get(execution_key, get_spec=None, **kwargs) -> ModuleContent: pass
# execution_key.created_at is also optional: if not specified, the Loader itself chooses it and
# returns, such as the _server time_ of the successful insertion operations.
#
# **kwargs will include cache_type when Marimo runtime itself calls put(), for the legacy
# __marimo__/cache loader.
def put(execution_key, module_content, **kwargs) -> created_at: pass
# Returns if sync preparation was successful, and any extra payload. If before_sync
# was triggered externally via HTTP request to the Marimo server, this result and
# the payload will be returned in the HTTP response.
def before_sync() -> (bool, Any): pass
def on_exit(): pass Some Marimo will never call |
I don't quite understand what you mean here:
At least for these use cases, I don't see how it weighs on the persistence subsystem design and the It's possible to imagine that some Loaders may work towards this more proactively, such as by proactively storing contents to be exported later in a different "database". |
lookup meta should contain
agreed, but the meta file takes care of this
invalid cache blob should just force recompute as is. But ideally/ eventually, a signature over the results be good for export/ sharing
this requires a hash over results, it's more expensive but less error prone to just store. Maybe best done at cleanup time I still don't understand why the first iteration needs the folder structure? What's wrong with just (for now) __marimo__/cache/
├── meta.db
├── my_expensive_block
│ ├── <hash>.<ext>
│ └── <hash>.<ext>
└── my_other_expensive_block
└── <hash>.<ext>
No need to tie to source yet, since source changes just mean cache misses and miss simple cases like notebook copies |
Realizing I see
whereas something like Maybe for full time machine, these start to come together, in which case yes- cache invalidation becomes a lot more involved, and source history becomes important. In which case I think it's probably worth just dropping clean or just using LRU and evict based on disk, since space is relatively cheap for a managed service (doing something like this locally seems less important when you can intentionally clean, commit and export). I think the gain vs complexity for invalidating cache based on project source is not there. Even if it is, the dumb method is a good step |
last_accessed, total_accesses definitely cannot/shouldn't be stored right in the blob: it's not compatible with atomic storages, and leads to write amplification (writes even on cache reads because access metrics are updated). Also, I don't understand how they are stored in the blob "right now"? If the objective of this is to enable LRU or LFU cache clean-up, these metrics can stay internal to the cache database. They don't need to be exposed in the
For per-cell auto-cleanup, the problem is not flat structure as such (at least on "single thousands" of module contents -- cleanup is maybe a rare enough operation that even listing and going through all entries in this directory is tolerable). But I think it's not possible at all because currently, the content file So, to enable auto-cleanup, even with the current flat structure, would need to introduce extra files (or DBs; which you suggest below anyway) that record cells' historic module hashes (so we can clean them up later in a targeted way). Thus, this anyway means the introduction of a transitionary "extension" cache store format (the format for storing these histories) - which adds to the variety of storage "variants" that later need to be supported.
Invalid storage is not guaranteed to raise a
This looks like my first SQLite proposal? I noted why I discarded it there: a binary database format is not friendly to Git. But more generally, I don't understand why designing and developing this entire storage structure as the "first iteration". It can be a "learning iteration" wrt. read/write scaling (and may well prove to be scalable enough), but not write amplification and ergonomics/possibility of Git-based metadata sync: it's already clear that these properties won't be great for this structure. Thus, it seems destined to be changed or augmented with another, notebook key-scoped structure. And if so, what's the problem with creating the notebook key-scoped structure right away? To minimise the diversity of legacy on-disk structures to be supported. In your description of simple cache vs Pseudocode of cacheLoader = cache_loader(root / "__marimo__/cache/", ...)
exportLoader = export_loader(root / "__marimo__/results/", ...)
for cell in app:
exec_key = exec_key(cell)
content = cacheLoader.get(exec_key, ...)
... # extra preparations of the content for export
exportLoader.put(exec_key, content) The exact structure of Maybe another important point to be made here is that the "time machine" use case is neither the "quick and dirty cache", nor it is close or would-be-served-well by
What specific operations (either plumbing- or use case-level) are easier (either "easier for implementation" or "easier in runtime") with the I can think of only one such operation: use case-level operation "share the entire cache with a colleague manually, via file share"; and it would only be easier if there is a single file, It's a valid use case, but not of overwhelming singular importance. In fact, for notebook key-scoped structure, there is another, similar use case (manually sharing cache for a single notebook, rather than the whole repo/workspace) that becomes much easier than with the If there are no significant advantages to the
What exactly do you mean by "invalidating cache based on project source"? Also, what do you mean by "complexity" of adding these parts to the structure? Is it coding/maintainability complexity, or computational complexity in runtime? Or, "perceived complexity" for users, if/when it spills to cli commands or user-facing parts of the structure? For (1) and (2) above, I don't think there is any extra complexity because all the complexity (coding, maintainability, and computational complexity) is already incurred by (1) adds user-facing complexity when the user just randomly wants to lurk into (3) probably adds considerable implementation complexity. But it's definitely not needed for cache invalidation. Nor am I sure at all that I will even need 3) in my use case. As I mentioned before, in my use case, AIs will edit the sources most of the time, so these AIs may be tasked to run |
Agreed
@dataclass
class Cache:
defs: dict[Name, Any]
hash: str
stateful_refs: set[str]
cache_type: CacheType
hit: bool Some data is in there
It doesn't need to know. I've successfully leveraged the same cache across 2 notebooks using an imported cell. I don't think there's anything wrong with that.
I think my point is that it's brittle. Lots of added points for cache invalidation, lots of tying inputs to source where it's not fully needed. I'm not saying that closer coupling and source isn't nice (see my thoughts on result) |
To retrieve a cache for a module (or cell), So, when you want to access a cache for a cell that is imported from another notebook, notebook key scoping would make no difference: If you don't want this scoping in your implementation of
Honestly, I don't understand what you are talking about. What do you mean by "added points for cache invalidation"? Regarding "tying inputs to source": as I noted above, the "cache keys" (aka execution keys) are already tied to the source (do you disagree? or you mean something different?). Any change in the source of the module (cell) or its upstream dependencies already leads to changes in module_hashes, and thus the cache is invalidated. What I suggested in my proposals above:
|
This is incorrect, module hash only is utilized for Execution Path hashes |
Interesting, I realised that:
Whenever I referred to the "module code hash" above, I used to think about the "module code tree hash". This was a mistake. But it also seems superfluous to me now (as a non-optional part of the execution key; as an optional addition to the module content, the calculus didn't change, although as I pointed out above I may not be that interested in it for my use case). If we start thinking about all the mentions of "module code hashes" above as the "module code only" hashes, my arguments above (in particular in this previous message) become valid again? This also reminded me of the discussion in #3270 that the current impl would fall back to hashing upstream module/cell dependency sources too often now, such as if the dependency variable is any dataclass object, whereas using hashes of serialised exported vars would be less prone to unnecessary cache invalidation (if the code has changed by resulting exported vars didn't) and "free" when all cells' contents are persisted anyways by default (#3054). However, this seems to be orthogonal to the discussion above, and the fact that content retrieval of the dependency cells would require "module code only hash" (as it is already; nothing changes here) doesn't hamper that "less prone to unnecessary cache invalidation" part because this path would not require retrieval with dependency cells' contents: regardless of how the exported vars for these dependency cells are obtained (deserialised or re-computed) this time, and whether that was from a different source code, won't change the fact that if only the hashes of serialised exported vars are used in the calculation of this module's module hash, it won't be invalidated (as long as its own source is not changed, but that's by design, of course). (The previous paragraph is me reasoning out loud. It doesn't modify the conclusions above.) |
I'll bring it a bit back on topic with the auto-clean up, and we can start a discussion if we want to talk through this a bit more. Whether it's tied to source or not- I think the base functionality of reading an external file and purging based on some set criteria is the same; regardless of how the cache is structured, right? Your proposal is added logic is on when and how to invalidate, such that heuristics like LRU and LFU can be replaced with retention policies that confidently flush or retain based on the hash of source files, and the requisite cache structure to enable this. But here are my pros for creating another issue for the cache structure and getting this in naively first:
My pros for reconsidering the cache structure all together:
Here are your pros as I understand them:
But- more on hot tub time machine. In a managed environment, I think using |
Yes, except it's not based on "hash of source files". The latest version of my proposal involved two hashes in the execution key apart from the module hash: (1) Now, as I revisit that element of the design, I think maybe (2)
Unless we misunderstand each other and keep talking past each other, I argued above that there are no such heuristics. Quoting myself:
Indeed, you have suggested adding a per-workspace "registry" of the info needed for granular clean-up, namely I agree that your proposal (
The entire difference between my and your proposal is that my proposal requires extra logic/code for computation, saving, and bookkeeping of "notebook keys". But that logic is isolated and doesn't mingle with the rest of the persistence subsystem logic ( That "rest of the logic" might be quite non-trivial (for example, handling concurrency of updates via locks on So, although, I agree the handling of notebook keys is extra logic/code (and as such needs to be maintained, tested, so adds to "dev maintenance"), I think it's a small difference: I think
To discuss this productively we need to look at concrete hidden cases and edge cases. It's not given that the more deeply scoped/nested structure generates more of these cases; it could just as well be the reverse, that the more deeply scoped/nested structure automatically eliminates edge cases inherent to the flat, shared structure. What are these edge cases that have popped up already?
Hm, I didn't suggest such coupling anywhere? Perhaps we also significantly misunderstand each other? For ref, I discussed the "object database" in the section "Write" in this comment above, and later elaborated the reasoning behind that design significantly in this comment. I suggested using Git object db as the "module content database" per notebook key (scope). These Git object dbs are not parts of any Git repos (neither "per notebook key", nor shared): they are used just as content-addressing DBs, nothing more. I suggested using Git object db format as the "module content database" for a vague set of reasons:
However, this comes with somewhat nasty drawback: Git objects headers must be known before the object is written, which would force us to chunk Pickled blobs beyond a certain size (let's say 100MB) as separate Git objects if we don't want to crash Marimo server/cli with out-of-memory error and want to compute the hash of the blob in memory before writing it out to exclude any possibility of transient blob corruption (short of main memory corruption, of course). I think I will flip on this one and would say that now I will prefer "module content database" be SQLite + file system for too large contents, essentially coinciding with the earlier SQLite proposal. SQLite in this case would play exactly the same role as "pack" in the "Git object db" version. These SQLite dbs could still be "immutable", and named with random suffixes (like packs), to aid merge-ability with some sync mechanisms (as was discussed above). If The sole reason why I'm flipping is that the advantages and potential future optionalities offered by Git object db don't seem appealing to me now. So, that single aforementioned drawback of Git object db is enough of the reason for me to favour SQLite for "module content database" as the default/first implementation.
I agree about the transferability of infra/logic/impl. I would not raise an issue at all if we wouldn't talk about writing a structure to users' disks. It racks up a maintainability burden/commitment/graceful upgrades much more than pure source code manipulations. |
Thanks for pointing to this project, I didn't see it before. It's curious to learn how the design elements they use to enable robust concurrency, including in the face of sync mechanisms like Dropbox or rsync, such as the operation log and "first-class conflicts in the commits", are related to the solutions that I've come up with above: the transaction log, the random suffixing of content mapping file names, and the random suffixing of SQLite dbs that store small contents (I didn't mention that in the previous comment, but it should be there, exactly mirroring Git object db's random suffixing of pack files. It's also for conflict resolution/prevention purposes). Perhaps we can learn more from jujutsu on this front, I will need to ponder more on their design docs and discussions in this area. Another interesting aspect of jujutsu is the separation of backends. This could have been an interesting leverage for Marimo's persistence subsystem if jujutsu was already a mature project and supported a lot of backends. However, currently, they only have the Git backend itself and proprietary backends to Google's monorepo. So, there is practically no upshot in using jujutsu in Marimo's persistence subsystem right now, only a big added dependency (and the need to understand their Rust codebase, which would be a challenge for me). Currently, the fastest path for using many storage backends is:
This wouldn't be as optimal as creating custom Loaders for backends, but it would already work. Finally, I didn't see anything in jujutsu that makes it particularly well-suited or optimised for "time machine", even over the persistence structure that I proposed above and that would already be needed for "mere" auto-cleanup per module/cell. And even if there was something (although I don't see anything), the upshot of sharing the structure between "non-time machine" and "time machine" use cases, with the possibility for users to switch seamlessly and mix and match these use cases even within the same notebook (indeed, the "default version = FWIW, one VCS function that would be particularly handy for "time machine" use cases is doing some computations against a certain commit/revision without juggling the working tree. It's proposed for jujutsu as |
Another benefit of notebook scoping that I haven't thought about before is that it offers "for free" a way to control how the module contents are synced for different notebooks. It can be as simple as telling Dropbox to sync these folders and not others. Those others are just staying on the developer's machine, unsynced. With more centralised module content db, this can only be achieved by specifying a different folder for the App, a la |
Ok, I will proceed with a PR including what we have agreed about (I hope), stalling this doesn't make sense. This PR will include: (1) On the
This does not include:
(2) "meta"/"module content mapping" DB abstraction per workspace, starting with the TOML format
(3) This is the "auto-cleanup" functionality that justifies the discussion of the PR in this issue, #3176. "Organisation" means sorting tables by name (that is, by leading section of the execution key: Does not include (4) module content db SQLite's table has just three columns: content_hash (TEXT), size (INTEGER), and content (BLOB). No "absent entries", if there is no entry it's assumed that the content is stored in the FS and should be checked there. As in Git object db, the extension of these files is not recorded in the file name (thus, not
Other module content db optimisations discussed here can be added later. Notably, they may require adding a key to the entry object above like |
I still don't like this as an automatic feature because code thrashing is a pattern I go through a bit, having lingering cache is nice if I am iterating while coding. I get what you are trying to maintain is a closure of required cache objects for execution, with auto cleanup for that. Rigorous cleanup around caching is not especially desirable because code changes might push us back to the other case (this a common dev pattern I see in myself) I do think having a closure of cache objects for execution does make sense- but I don't think this should be automatic unless it just cleans up over a specific disk threshold. With 1 I think you'll find that writing information to the meta file should make those attributes redundant (but fine and non-breaking if you put them in loader.meta). Absolutely on board with 2 Let's make 3 manual trigger or just to threshold and 4 needs benchmarking and investigation (let's open a new issue? You also seem to have gone back and forth on this. I have few blobs under 100mb and this seems like an added step, making the PR much bigger than it needs to be) |
The cleanup strategy is definitely configurable, that's why I typed "e.g.". In fact, it should be maximally configurable and use case-agnostic, as I indicated well above:
So, I'm completely indifferent to the default cleanup strategy (if there should be a "default" one at all! maybe the user should always specify it when they want clean up!) as long as the clean-up strategy needed for my use case could be configured.
I didn't get what you mean here.
Didn't get what you mean here.
I've also mentioned timeout as another possible trigger above. Currently, as #3054 is not yet implemented, the "added entry count threshold" (that you seem to refer to by "just to threshold") doesn't yet make sense because the total number of entries in this file will not be huge anyway. But in the future, the added entry count threshold makes sense too, sure.
Before #3054 it's superfluous indeed, as the total number of entries is small. So "No-sqlite, just FS" can be considered a "module content db" option. Indeed, in this planned PR, I can do just that, and delay the "SQLite+large files" module content db for #3054. After #3054, there should be some kind of centralisation (rather than "file for each separate module variable pickle and output HTML") for myriads of cells, the variables and outputs of most of which are tiny. Storing them all in separate files would be wasteful for footprint and synchronisation mechanisms. This was part of the thesis all along, starting with #3055. I didn't go back and forth on it. I did go back and forth on the "centralisation mechanism": from SQLite to Git's packfiles and back to SQLite. Ok, well, the so-called "second proposal" which is laid out in the original post of this issue above didn't have such centralisation at all, but it seems like an omission to me now. Anyways, "FS only" will remain a "module content db" option for those who don't have a lot of cells. |
by "disk threshold" I mean something like: The check for clean up probably would be best in post execution hooks opposed to the loader level |
That when SQLite-less and "shared" cache (per workspace) already becomes... troubling? There is no other way to estimate the total size other than to crawl the directory, a la "Disk Inventory X" app. With SQLite, in comparison, it would be a simple
Ok, but then the post-execution hook should just delegate to the Loader. The Loader is in the best position to do the necessary estimations such as the total size, as evidenced above. Also, it should probably be the Loader who owns the configuration of the said triggers and thresholds (size, timeout, etc.). |
The workaround may be to store the sizes in the "meta" db. But this will become wasteful for "SQLite+FS for large files" arrangements, and in general, it's not the best idea to bloat the "meta" db as long as it is implemented as TOML, which is not that scalable (every time the file is loaded, all these sizes should be parsed throughput the file...). Even though this fact itself (whether to store sizes in "meta" db or not) can be a config option, you may see here how catering for all these structures itself starts to drive up the implementation complexity. But not as much as to be a deal breaker, so I'll do it in this instance. |
JSON, SQLite or flat file are all reasonable. I think you initially sold me on SQLite here, I don't think meta needs to be readable (I think blob signatures are probably another useful entry at some point)
Loader objects are dead at this point, you'd have to invoke another Loader, I'm not sure that makes sense- why not just create a separate namespace? It might be worth making an Adapter class (PickleLoader prepares a pickle blob from Cache, Loads a pickle blob into a Cache object- but Adapter writes the blob to disk or upserts a SQL entry, or in this case, deletes it)- but I think this is a followup (but a quick one) since with naive FS it's as simple as Hmm, interesting reading your response. Yeah disk space seems important to me. I think in most cases:
I think my viewpoints on complexity are pretty consistent from those starting premises, but I also understand yours (wanting to clean up the cache closure to only relevant compute artifacts that reflect the notebooks). |
Description
Objective
Enable #3054 (option to save all cell results by default) in a scalable way.
Assumptions
The qualities that I think persistent cache should obtain
Non-goals
Limitation of the current design of persistent cache
The current design of persistent cache doesn't obtain qualities 1-3 and 6 above and I don't see how it fixed "on the margin".
Suggested solution
Cache dir structure:
.table
file format is as follows:Design considerations
Why bucketing per first two characters of cell_ids: Future-proof thing that will help Git not to break down when working with monorepos with thousands of cells and more. Also, this will aid Marimo itself, as when it will cache results for a new cell it will need to create the directory for cell_id. In some file systems that may become slow when there are already thousands of entries (i.e., other directories in the
cache
directory. Bucketing by first two characters makes thecache
dir no bigger than 256 entries. This future proofing is very cheap (new cells are not created often), so I think it makes sense to add it already. (This idea is borrowed from Git itself, seels -l .git/objects
.)Why directory-per-cell: keeps
.table
files very small (almost always less than 4KB, one page), which makes loading, updating, and writing them down them easier. Also, importantly, when tables are per-cell, this will exclude Git merge conflicts unless the same cell was updated in both branches. Even then, having__marimo__/cache/*/*/.table merge=union
in.gitattributes
would help (Marimo may take up pruning obsolete lines every time it overwrites the.table
file anyway).Note:
[cell_id]/
and[cell_id]-cell/
are two different directories that can co-exist for the same cell_id. The[cell_id]-cell/
contains only the module hashes for the cell itself, whereas[cell_id]/
contains module hashes forwith persistent_cache()
blocks within the cell.For different auto-cleanup algorithms for cached-modules-coinciding-with-cells and modules within cells, see comments in #3055.
Note departure from the current persistent cache design where the file name is the module hash. Naming content blobs with their own hashes makes two module_hashes in the table to point to the same content blob very simple and robust. It will not lead to extra read latency most of the time because after the first access, marimo app can easily store all hashes in memory. It helps that for large workspaces, a single app will not access most cells so the overhead of this in-memory table caching is small, thanks to directory-per-cell design.
Alternative
This proposal supersedes #3055. It is the opposite of the current persistent cache design: obtains qualities 1-3 and 6, but fails for 4 (Git merge conflicts!!) and 5 (not super legible).
The text was updated successfully, but these errors were encountered: