docs: zebra::database::default module

emmyoh · Jan 2, 2025 · 572ddfb · 572ddfb
1 parent c06017c
commit 572ddfb
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 4 deletions.
diff --git a/.github/workflows/rust.yml b/.github/workflows/rust.yml
@@ -19,7 +19,7 @@ jobs:
       - name: Checkout codebase
         uses: actions/checkout@v4
       - name: Generate documentation
-        run: time cargo doc --no-deps -Zrustdoc-map --release --quiet
+        run: time cargo doc --features="default_db" --no-deps -Zrustdoc-map --release --quiet
       - name: Fix permissions
         run: |
           chmod -c -R +rX "target/doc/" | while read line; do

diff --git a/README.md b/README.md
@@ -23,12 +23,12 @@ Oku enables distributed storage & distribution of mutable user-generated data; t
 * Capable of scaling in dataset size **without excessively impacting memory usage for an individual node**
 * Capable of acceptable recall with multiple distance metrics
 
-Despite the common need---a scalable CRUD database---existing solutions often fell short.
+Despite the common need—a scalable CRUD database—existing solutions often fell short.
 
 #### Distribution & CRUD
 Many vector databases utilise the [hierarchical navigable small world (HNSW)](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) algorithm to construct their database indices, as it (a) achieves high recall on high-dimensional data regardless of distance metric, and (b) performs fast queries regardless of dataset size. However, despite its attractiveness on benchmarks, it can be impractical to use in many production contexts as (a) it's difficult to distribute as you cannot shard the index across multiple nodes, (b) the entire index must be loaded into memory to perform operations, making memory a bottleneck in addition to storage, and (c) deleting vectors essentially requires rebuilding the entire index from scratch and re-inserting every vector except the deleted ones; using redundant indices and tombstoning is the only way to keep the database online.
 
-The need for a scalable & mutable vector database is not new, however, and the problem has apparently been solved to an acceptable degree before---content recommendation systems based on embedding vectors have been in production for many years, and they've often used some variation of [locality sensitive hashing (LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) to build a vector database index. An LSH index is not graph-based, but instead breaks up an *f*-dimensional space into regions of similar vectors. Consequently, it can be (a) sharded, (b) accessed in parallel, and (c) accessed from storage because, unlike a graph such as HNSW, it is not '[object soup](https://jacko.io/object_soup.html)' and avoids issues with cache locality and synchronisation in multithreaded contexts. LSH's advantages in performance and resource usage does come with an implication: while HNSW approximates neighbours, LSH approximates similarities. The recall of LSH is lesser as it's less concerned with finding *the nearest* neighbours, and more concerned with just finding what *is near*. For fine-grained searches, LSH is less helpful, but for large & varied datasets where it's important to find records that are 'close enough', it has significant advantages.
+The need for a scalable & mutable vector database is not new, however, and the problem has apparently been solved to an acceptable degree before—content recommendation systems based on embedding vectors have been in production for many years, and they've often used some variation of [locality sensitive hashing (LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) to build a vector database index. An LSH index is not graph-based, but instead breaks up an *f*-dimensional space into regions of similar vectors. Consequently, it can be (a) sharded, (b) accessed in parallel, and (c) accessed from storage because, unlike a graph such as HNSW, it is not '[object soup](https://jacko.io/object_soup.html)' and avoids issues with cache locality and synchronisation in multithreaded contexts. LSH's advantages in performance and resource usage does come with an implication: while HNSW approximates neighbours, LSH approximates similarities. The recall of LSH is lesser as it's less concerned with finding *the nearest* neighbours, and more concerned with just finding what *is near*. For fine-grained searches, LSH is less helpful, but for large & varied datasets where it's important to find records that are 'close enough', it has significant advantages.
 
 #### Integrity & Safety
 To avoid excessive memory usage, some have saved indexes to storage and performed operations directly on the index files as if they were in memory, taking advantage of a technique called [memory mapping (`mmap`)](https://en.wikipedia.org/wiki/Memory-mapped_file). Spotify boasts of [its LSH index](https://github.com/spotify/annoy):
@@ -41,7 +41,7 @@ Cloudflare [makes similar bold claims](https://blog.cloudflare.com/scalable-mach
 >
 > In the wake of our redesign, we've constructed a powerful and efficient system that truly embodies the essence of 'bliss'. Harnessing the advantages of memory-mapped files, wait-free synchronization, allocation-free operations, and zero-copy deserialization, we've established a robust infrastructure that maintains peak performance while achieving remarkable reductions in latency.
 
-Multiprocess concurrency with `mmap` is arguably *impossible*---countless DBMSes have learned the same lesson after many years. There is [a paper on this subject](https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf) that covers the pitfalls of `mmap` in detail; suffice it to say, a memory-mapped database index has not demonstrably achieved the data integrity & memory-safety guarantees necessary for a production database.
+Multiprocess concurrency with `mmap` is arguably *impossible*—countless DBMSes have learned the same lesson after many years. There is [a paper on this subject](https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf) that covers the pitfalls of `mmap` in detail; suffice it to say, a memory-mapped database index has not demonstrably achieved the data integrity & memory-safety guarantees necessary for a production database.
 
 #### Potential Improvements
 This software is free & open-source (FOSS), and code contributions are welcome.