duckdb blocking #179

jbothma · 2024-11-01T15:55:16Z

No description provided.

pudo · 2024-11-03T12:45:38Z

nomenklatura/index/duckdb_index.py

+        query = """
+            SELECT field, token, id, frequency
+            FROM frequencies
+            ORDER by field, token


This is a bit weird, but: is there a world where we want to do ORDER BY frequency DESC OFFSET 1000 (or OFFSET 1%) here to basically take out stopword tokens?

I like this, but are you happy for me to park the idea for a while and come back to it in the future? It feels lik a basic xref improvment and some other things are higher prio and this might need some experimentation

nomenklatura/index/duckdb_index.py

By letting it materialise intermediate results more explicitly instead of doing multiple joins concurrently

Duckdb bulk enricher

jbothma added 9 commits October 31, 2024 16:41

memory-index-docs

0118c9f

Start adding duckdb again

af42ebe

Lots of code, not quite working

41454b6

Some sanity tests

69adca5

Break up frequency calculation

2fc7cb3

Skip tokens occurring in more than 100 entities

eabb22c

Largely working

b410319

pairing on country is too expensive

b45aa32

Tidy

c2fbc29

pudo reviewed Nov 3, 2024

View reviewed changes

nomenklatura/index/duckdb_index.py Outdated Show resolved Hide resolved

jbothma added 4 commits November 6, 2024 16:50

Move more into the db

88e4398

Move even more into the db

684c5c0

Unit test subqueries, typecheck

84d47e3

Add duckdb as dep

136a064

jbothma commented Nov 7, 2024

View reviewed changes

nomenklatura/index/duckdb_index.py Outdated Show resolved Hide resolved

Use provided index directory

655241f

jbothma commented Nov 7, 2024

View reviewed changes

nomenklatura/index/duckdb_index.py Outdated Show resolved Hide resolved

jbothma commented Nov 7, 2024

View reviewed changes

nomenklatura/index/duckdb_index.py Show resolved Hide resolved

jbothma commented Nov 7, 2024

View reviewed changes

nomenklatura/index/duckdb_index.py Outdated Show resolved Hide resolved

jbothma added 10 commits November 11, 2024 15:01

Add basic matching, but it's 3 times slower than tantivy

0ed8da8

WIP horrid interface

ed1e282

Fixy

d5f4d17

WIP

287d70a

Reduce memory consumption by

178a941

By letting it materialise intermediate results more explicitly instead of doing multiple joins concurrently

It's already megabytes

4d5abe2

Split enricher types for different interfaces

4c682de

Handle split enricher interfaces in nomenklatura

fe1af3a

Merge pull request #180 from opensanctions/duckdb-bulk-enricher

52fb824

Duckdb bulk enricher

Merge branch 'main' into duckdb-2

abc702f

Fix import

ad6ebdc

jbothma closed this Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duckdb blocking #179

duckdb blocking #179

jbothma commented Nov 1, 2024

pudo Nov 3, 2024

jbothma Nov 7, 2024

duckdb blocking #179

duckdb blocking #179

Conversation

jbothma commented Nov 1, 2024

pudo Nov 3, 2024

Choose a reason for hiding this comment

jbothma Nov 7, 2024

Choose a reason for hiding this comment