Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle exceptions gracefully when delete non-existent resources during integ test resource clean up #1154

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

weijia-aws
Copy link

@weijia-aws weijia-aws commented Jan 30, 2025

Description

When integ test runs, the test will create model, index, pipeline, etc., then perform some assertions, finally clean up the resources.

public void processor_integrationTest() {
    try {
        // load model
        // create pipeline/index
        // do tests/asserts
    } finally {
        // cleanup pipeline, index, model
    }
}

If somehow resource creation fail and receive an exception, the test will enter the finally block and try to delete the resources. Since these resources don't exist, exception will be thrown when try to delete them, see issue 1091. I was able to reproduce the exception by following #1093 (comment). So in this case, we should gracefully handle such NOT FOUND exceptions

Related Issues

Resolves #1098

Related issue: #1091 and #1093

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@heemin32
Copy link
Collaborator

Can we expand the scope of this PR to shift the responsibility for resource cleanup from individual tests to the framework? This way, each test wouldn’t need to handle resource cleanup individually.

@vibrantvarun
Copy link
Member

vibrantvarun commented Jan 31, 2025

Can we expand the scope of this PR to shift the responsibility for resource cleanup from individual tests to the framework? This way, each test wouldn’t need to handle resource cleanup individually.

@heemin32 I would suggest not to do that. As Integ tests are meant to be run in isolation mode. With resource cleanup at framework level, it will create issues where test will start sharing resources with each other and can cause critical issues to be skipped at integ test level because the resource might be created by some other method and used by some other method.

@vibrantvarun
Copy link
Member

LGTM

@minalsha
Copy link
Collaborator

#1154 (comment)

+1 to Varun's comment here.

@heemin32
Copy link
Collaborator

heemin32 commented Jan 31, 2025

@heemin32 I would suggest not to do that. As Integ tests are meant to be run in isolation mode. With resource cleanup at framework level, it will create issues where test will start sharing resources with each other and can cause critical issues to be skipped at integ test level because the resource might be created by some other method and used by some other method.

Have you seen any such issue in k-nn repo where the resource clean up is handling in framework level? The resource sharing issue can happen when clean up is happening in individual test level and developer missed to clean up the resource in a test.

@vibrantvarun vibrantvarun self-requested a review January 31, 2025 22:59
* @return get pipeline response as a map object
*/
@SneakyThrows(ParseException.class)
protected Map<String, Object> retrievePipelines(final String pipelineType, final String pipelineName) throws IOException {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, I'm curious if there's any increased latency for the integ tests by adding extra get requests per test? Is the runtime for all the integ tests impacted?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latency should be minimal and it is generally okay to have some slowness for testing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I ran the integ tests with and without this commit. Total runtime is:

Previous: 8m3.59s
Now: 8m2.34s

The latency is really minimal

Copy link

codecov bot commented Feb 5, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.69%. Comparing base (e8ed3a4) to head (2f601fb).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1154      +/-   ##
============================================
- Coverage     81.72%   81.69%   -0.03%     
  Complexity     2494     2494              
============================================
  Files           186      186              
  Lines          8426     8426              
  Branches       1428     1428              
============================================
- Hits           6886     6884       -2     
- Misses         1000     1002       +2     
  Partials        540      540              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Weijia Zhao <[email protected]>
Copy link
Collaborator

@heemin32 heemin32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@heemin32 heemin32 added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Feb 6, 2025
@heemin32
Copy link
Collaborator

heemin32 commented Feb 6, 2025

Test is failing. Please take a look.

@weijia-aws
Copy link
Author

Test is failing. Please take a look.

qa tests are failing because: testAgainstNewCluster depends on resources that created in testAgainstOldCluster, however when testAgainstOldCluster finishes, all resources are cleaned up, resulting in testAgainstNewCluster tests fail with java.lang.NullPointerException or Resource not found exception. In order to fix this, we will need to re-create resources in testAgainstNewCluster tasks

@weijia-aws
Copy link
Author

In order to fix this, we will need to re-create resources in testAgainstNewCluster tasks

We can also modify the cleanUp method to not delete resources if running against old cluster?

@weijia-aws
Copy link
Author

We can also modify the cleanUp method to not delete resources if running against old cluster?

I think we should do this, as this is how the current behavior is, and requires less code changes comparing to re-create resources

@heemin32
Copy link
Collaborator

heemin32 commented Feb 6, 2025

We can also modify the cleanUp method to not delete resources if running against old cluster?

I think we should do this, as this is how the current behavior is, and requires less code changes comparing to re-create resources

Recreation of resource break the purpose of bwc test where we are validating resource created in old cluster still works in new cluster.

How about exposing a method which tell if sub class want to clean up resource or not? Then, for bwc test, we can skip deletion of resources and delete the resources manually as we do today.

Signed-off-by: Weijia Zhao <[email protected]>
String modelId = uploadTextEmbeddingModel();
loadModel(modelId);
createPipelineProcessor(modelId, pipelineName);
super.modelId = uploadTextEmbeddingModel();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest not making any changes to the BWC tests, as this approach may not work. For instance, there are two tests: testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow and testNormalizationProcessor_whenIndexWithSingleShard_E2EFlow. Each test uploads a model, but the model ID from the previous test is overwritten by the next, resulting in a failure to delete the all resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch good first issue Good for newcomers Infrastructure skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Infrastructure] Wipe integration test resources and surface exceptions gracefully
6 participants