Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ORDER BY BM25 #1434

Merged
merged 67 commits into from
Feb 27, 2025
Merged

Implement ORDER BY BM25 #1434

merged 67 commits into from
Feb 27, 2025

Conversation

jbellis
Copy link

@jbellis jbellis commented Nov 20, 2024

What is the issue

https://github.com/riptano/cndb/issues/11725

What does this PR fix and why was it fixed

...

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits

@jbellis
Copy link
Author

jbellis commented Nov 22, 2024

(force push is identical code that CI ran previously, just cleaned up the history)

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I'm really happy with how much this cleaned up the index based ordering logic in several classes.

Left a handful of minor comments/questions.

…onsible for ANN index queries. Other global orderings will be represented by a SingleColumnComparator with clustered=true instead.
@jbellis jbellis changed the title Send similarity score from writer to coordinator for faster sorting Implement ORDER BY BM25 Dec 6, 2024
Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I left several minor comments and a few larger questions.

…ction in SingleColumnRelation.newEQRestriction. This eliminates the need for skipMerge and special cases in doMergeWith, and moves the issuing of warnings next to the place where the transformation occurs instead of doing it much later in RowFilterValidator (which is no longer needed)
@jbellis jbellis force-pushed the scoreordering-6 branch 2 times, most recently from 2babc86 to 50c4a57 Compare December 13, 2024 14:36
… approach than ignoring it when serialization fails later
@jbellis
Copy link
Author

jbellis commented Feb 14, 2025

I think the GenericOrderByTest failure is real :(

@michaeljmarshall
Copy link
Member

michaeljmarshall commented Feb 19, 2025

@jbellis looks like we're duplicating the sort value so that the rows have 4 columns instead of 3. When I update the order by test to use k, c, v instead of * to select the right columns, it passes some of the assertions and fails with a new error:

java.lang.AssertionError: Expected: [[3, 3, 9],[1, 2, 8],[3, 1, 7],[2, 1, 6],[2, 2, 5],[2, 3, 4]]
Actual: [[3, 3, 9],[1, 2, 8],[1, 2, 8],[3, 1, 7],[2, 1, 6],[2, 2, 5]]

That error looks like we're not deduping rows properly.

@michaeljmarshall
Copy link
Member

IndexMetricsTest::testQueriesCount fails locally on my machine

@jbellis
Copy link
Author

jbellis commented Feb 25, 2025

Something wonky with Butler?

  1. BinLogTest: looks flaky, nothing new here
  2. testOneToManyCompaction: blank entry in the Butler report
  3. rangeRestrictedTest: no entry at all in the report

@jbellis
Copy link
Author

jbellis commented Feb 25, 2025

VectorLocalTest and VectorCompactionTest both pass locally

// by synthetic +score column.
boolean cqlReversed = ordering.direction == Ordering.Direction.DESC;
if (def.position() == ColumnMetadata.NO_POSITION)
return ordering.expression.isScored() || cqlReversed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this hurts anything as is, but is the "isScored()" ever going to return true?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ann and BM25 Ordering subclasses override isScored() to true. Maybe I'm not understanding your point correctly.

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work :) just waiting on CI now

@jbellis jbellis merged commit b0cdc37 into main Feb 27, 2025
168 of 408 checks passed
@jbellis jbellis deleted the scoreordering-6 branch February 27, 2025 00:44
@cassci-bot
Copy link

✔️ Build ds-cassandra-pr-gate/PR-1434 approved by Butler


Approved by Butler
See build details here

@michaeljmarshall michaeljmarshall restored the scoreordering-6 branch February 27, 2025 17:41
@michaeljmarshall michaeljmarshall deleted the scoreordering-6 branch February 27, 2025 17:42
michaeljmarshall added a commit that referenced this pull request Feb 27, 2025
### What is the issue
The `NON_DAEMON` IndexRegistry returns `true` for all calls to
`supportsExpression`. This leads to an incorrect result from the
`getEqBehavior` method where we get `MATCH` instead of `EQ` because the
index indicates that it can handle `ANALYZER_MATCHES` expressions and
`EQ` expressions.

It is an odd edge case because the javadoc for the `NON_DAEMON` object
is:

```java
    /**
     * An {@code IndexRegistry} intended for use when Cassandra is initialized in client or tool mode.
     * Contains a single stub {@code Index} which possesses no actual indexing or searching capabilities
     * but enables query validation and preparation to succeed. Useful for tools which need to prepare
     * CQL statements without instantiating the whole ColumnFamilyStore infrastructure.
     */
```

This presents a problem for the eq/match logic where we want to find a
nuanced solution to the different solutions. My proposal here is to just
make it use the EQ behavior, but that might have adverse side effects in
untested code.

### What does this PR fix and why was it fixed
The original fix used in #1434 was just to avoid the NPE we hit, but it
allowed for the wrong result in eq behavior. This fix is to say that we
should just return `EQ` in that case. It's possible this fix has
negative consequences that we haven't seen yet, but at the very least,
the CNDB tests pass with it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants