-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Distributed Correctness Testing Framework #3220
Comments
Hi @Swiddis , thanks for putting this together. I know some of the parts are still work in progress, so just wanna leave some general thoughts after reading the current version as my first round of review:
I saw that you also mentioned in the open questions: "How will we handle versioning/compatibility across OpenSearch versions?" Especially for these external dependencies, I'm actually having the similar question that will backward compatibility testing be a focus as part of the proposed framework? Are there plans to define specific test cases or scenarios to validate compatibility with older versions, and how will this be managed over time as new features are introduced?
Personally, I need to do more homework on the scenario for large datasets. However, by taking a first look, I think It might be useful to include a plan for benchmarking these scenarios to measure performance impact. (This may not be the P0)
I noticed the mention of the above. Does this imply the introduction of another internal test pipeline, distinct from the open-source GitHub-based pipeline? If so, how will the results from this EC2-based pipeline be integrated or communicated back to the broader testing framework? |
Also, for this open question:
Here are some general ideas in my mind: Coverage Metrics
Accuracy and Reliability Metrics
Efficiency Metrics
Outcome-Based Metrics
Integration and Usability Metrics
|
I tried setting up some crude benchmarking code to estimate how much test throughput we could get. I tried versions of a Locust script that tried two scenarios. Left: many parallel tests that each create, update, and delete their own index. Right: many tests sharing a small number of indices. In general, insert and query requests are very fast while index create/delete is slow. This gives some data to back up that we should have batches of tests run on batch-scoped indices, it's a throughput difference of 25x on equal hardware. It's also more reliable -- with a small number of test-scoped index workers there were timeout failures with even just 50 tests in parallel, but batched workers can handle 1000 tests in parallel without any failures (cluster generally just slows to land at around 1800 requests/second on my machine). For a real soak-test we probably want to run at least O(a million) tests total, which my dev machine can do for this benchmark in ~10 minutes. |
I think we can work backwards from specific past bug reports (such as those linked in the overview) to the features the query generator needs to support, then see if we reproduce them. If the process is able to find specific known issues, we can have some confidence in its ability to find more unknown issues as we add more features.
I think we should leave it at first. Since we don't typically update older versions, running tests there would mostly be data collection. I do know there have been some bugs involving specific upgrade flows (example), but I'm not convinced the extra complexity would be paid back.
We can probably extend the testing to do some sort of benchmarking, but let's not step on the toes of opensearch-benchmark where we don't have to. The focus for the moment is just correctness. Since we'll be running many tests per index (see above comment), we can probably afford to make each index much larger than if we were doing 1 test per index.
I think it's a separately-running job, I don't dislike having it scheduled in pipelines? It will probably be reported by some sort of email mailing list, or a dashboard that's periodically checked. We shouldn't automatically publish issues to GH (both to avoid noise and also in case there's any issues that are security-relevant). That said, I think my original idea of running this 24/7 and just tracking failures also is complicated, it'd require a whole live notification system. For simplicity, let's make the suite run a configurable finite number of tests with an existing framework, like how Hypothesis does |
Thanks @Swiddis for putting this proposal, I have few generic comments:
If this is the main concern of using out the box existing frameworks, can we explore the option to provide those write functions if possible, even if those only for testing purposes.
How to make sure we keep data sets and queries up to date with our new development on SQL/PPL language? since those most probably will be on a separate repository |
It's possible, but I'm not sure it's efficient. One way that could work is to have these frameworks create a SQL table locally, and we write some logic that transforms the SQL table into an OS index before starting the read queries. Getting all the datatypes to work right would be tricky, and we also would still face the other issues with PPL and OpenSearch-specific semantics. To clarify: I think these frameworks do have valuable lessons to teach, and I think we would benefit a lot from copying select parts of their code where we can. I just think that building something from scratch will give us flexibility that we can't really have otherwise.
So far the best idea I have is making it a reviewer step, reviewers should enforce whether we need to add test features for the new functionality1. This is already done for OSD with the functional test repository. Footnotes
|
I updated the document with an implementation strategy and diagrams, and left answers for several of the open questions I had (and removed a few that felt like they weren't relevant anymore). Going to call this ready for review, a lot of the remaining grungy specifics of the implementation can be done when we actually start writing. In particular I'd like to go into more specific detail on some of the properties I have in mind, but there's already a lot of literature on this so the value is limited. See the linked papers in the doc, or I particularly recommend this talk by Manuel Rigger, one of the SQLancer authors. |
@Swiddis Its a Great review and document ! My comments regarding the framework's
|
@Swiddis Great document. Couple comments
Q1, I assume we aim to build new frameworks to address scalability issues. Since OpenSearch core integration tests currently take 2-3 hours to run, have we considered improving the existing OpenSearch integration test framework as well? Q2, Could eleberate more on Query Generator and the Data Generator? |
As a goal I think that configuration of generated query features needs to be taken into account relatively early, so there will be some config that lets one enable/disable certain features. I'll be adding more details on query generation soon that should cover this. But what are the practical differences between standard Spark and EMR Spark that the test suite needs to check? As a sketch: We can represent query syntax trees as a sort-of definite clause grammar (DCG), and separately implement ways to serialize those trees, maybe based on serde serialization which is known to pretty heavily generalize to different serialization formats. To make sure generated queries make semantic sense, the generator is also provided with information about the test indices (columns and data types). "Toggling features" can be as simple as enabling/disabling certain branches of the DCG. It's something I've done before while working with parsing in Prolog, and I've experimented with it for query generation before to decent success as well, but I'm not sure how well it scales for SQL and PPL specifically. (At least SQLancer seems to follow a similar approach, splitting up generator rules by semantic function and generating trees from those functions with a visitor.)
This is what I was trying to look at with the benchmarking earlier -- in our current integration tests, we have many suites that each have a setup step of creating indices, run some number of tests on the indices, and tear down the indices. This limits the speed of our tests considerably:
To work within these constraints, a "batch-scoped index" means that for each batch of N queries, we are going to create a new index with a randomized (probably UUID) name.
We could also consider doing something like this for the current integration testing. We need to split tests into read-only tests and tests that involve index writes, and then we can just set up all the test indices once to run every read-only suite before tearing it down. I'm not sure how much work it would take to convert all of them like that. One route that shouldn't be too hard to implement:
|
I'm working on getting a proof-of-concept starting point spun up. V0.1 of the POC will do 6 things:
The steps here shouldn't clash with any alternatives as long as we agree on the core idea of having the test suite be a standalone project that can be run next to some cluster. I think having the POC will give a lot more ground to flesh out answers to current open questions. ETA end of week to have decent-enough ground on all of these. |
[RFC] Distributed Correctness Testing Framework
Date: Dec 27, 2024
Status: In Review
Overview
While the current integration testing framework is effective for development, there have been several bug reports around query correctness due to testing blind-spots. Some common sources for issues include:
JOIN
s or wildcard indices.LIMIT
queries or aggregations.In addition to the obvious challenges with predicting these edge cases ahead of time, we have the additional issue that the SQL plugin has multiple runtime contexts:
The current integration testing process is based on doesn't scale sufficiently to detect edge cases under all of these scenarios. Historically the method has been "fix the bugs as we go", but given the scale of the project nowadays and in-progress refactoring initiatives, a testing suite that can keep up with the scale is needed.
Inspired by the likes of database testing initiatives in other projects, especially SQLancer, Google's OSS-Fuzz, and the PostgreSQL BuildFarm, this RFC proposes a project to implement a distributed random testing process, which will be able to generate a large amount of test scenarios to validate correct behavior. With such a suite, the plugin can be "soak tested" by running a large number of randomized tests and reporting errors.
Glossary
Current State
Our current testing infrastructure consists of several components that serve different purposes, but none of them clearly address the gaps identified here.
We may consider reusing elements of the comparison testing framework for the new suite. In particular, both this framework and the proposed solution connect to OpenSearch via JDBC. The main concern is whether we can parallelize the workload easily.
While these tools provide valuable testing capabilities, they fall short in several key areas:
Analysis of Domain
The most important question to ask is, "why can't we use an existing system?" SQL testing in particular has a lot of existing similar initiatives. A cursory look at these initiatives (ref: SQLancer, Sqllogictest, SQuaLity) reveals a few issues:
INSERT
,UPDATE
). Removing this functionality and replacing it with a system that can set up a correctly-configured OpenSearch cluster would be again similar to writing the system from scratch.Compared to trying to adapt these solutions, it seems like the most reliable method for a long-term solution is to write a custom system. In particular, these methods are well-documented, so we likely can make something that can get a similar degree of effectiveness with less effort than the migration cost. This will also give us a lot of flexibility to toggle specific features under test, such as the assortment of unimplemented features in V2 (#1889, #1718, #1487). The flexibility of supporting OpenSearch-specific semantics and options will also open routes for implementing similar testing on core OpenSearch.
Despite these limitations, all of the linked projects have extensive research behind them that can be used to guide our own implementation. Referencing them and taking what we can is likely valuable.
Goals
Looking at the limitations of our current test cases, I propose the following goals:
Implementation Strategy
The goal is to create a separate project that can connect to OpenSearch SQL via a dynamically chosen connector (e.g. JDBC, or REST) -- similar to how the comparison testing suite currently uses JDBC. For the first viable version, only the REST connector will be supported, for simplicity.
The role of the test suite as a system is shown in this System Context diagram. It will run independently of our existing per-PR CI on a specified schedule, informing developers of issues found. The suite will have some config that specifies how many batches it should run for during each invocation, similar to how Hypothesis does
max_examples
.The testing framework itself needs two main components: the actual test runner, and a tool that can create clusters to test with various configs. For example, toggling cluster settings, whether to use Spark, and other cluster-level options. Making the cluster manager a separate container will help encapsulate a lot of grungy infrastructure code in a dedicated tool. That said, for the initial version, we can just have the pipeline spin up a single test cluster that the runner uses, with a single default configuration, meaning its box can be replaced by the CI pipeline for now.
One unit of work we introduce at this level is test batches, based on the throughput benchmarking below. The idea is that both creating clusters and creating indices are expensive compared to running read-only queries or incremental updates on indices, so we want to reuse created indices wherever possible (aiming for ≥ 100 tests/index). To have a high test throughput, we run tests in cluster-level batches, where each cluster-level batch is made up of several index-level batches. The exact number of tests per batch will be configurable.
Zooming into the test runtime, we can look more at its core components. The bulk of the work for "supporting features" happens in the Query Generator and the Data Generator, the rest of the work is mostly plumbing.
For properties we can test, the following high-level strategies are available:
Implementation Roadmap
The goal is to get a minimum viable suite running by the end of January, which is able to be extended to support more query features along with current feature development.
Bootstrapping
By the end of this process, all major components in the diagram above should be stubbed and ready to be incrementally added to. The fact that the suite finds known existing bugs can give confidence for identifying bugs in new features that violate the same properties.
Importantly, for bootstrapping we skip writing a full cluster configuration manager, and stick to a minimal adaptor and index configuration.
Extending
Once we can find known bugs, the main steps for identifying unknown bugs are similar:
Open Questions
Structurizr DSL for the diagrams
The text was updated successfully, but these errors were encountered: