-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase the amount of request in the stress test, using multiple threads #90
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@ahandan if this PR is ready for a review from me, please explicitly set me as the reviewer. Otherwise I won't be touching it. That said, I was curious and opened up in ReviewNB (https://app.reviewnb.com/Ouranosinc/PAVICS-e2e-workflow-tests/pull/90/) and I ended up with this error "Invalid notebook file. Please make sure both before and after version of the file contains valid notebook JSON. ". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass review, I'll have to come back to it again. I haven't wrap my head around the full changes yet. Thanks for the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
Changes ;
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionality-wise, all looks good.
I only have a few cosmetic comments regarding the output format, not blocking themselves.
Move the print(f'\n Execution # {i+1}')
statement within run_threads
with input parameter that provides the "run_index".
Remove extra spaces before:
" Timestamps"
(this is hackish, not how it is intended to be used)
The field should be w=24 (4 spaces before + 20 from the timestamp length itself) for right align.
Or, w=14 for left align (4 spaces before + len("Timestamp") == 10).Execution #<index>
(to align on the left of output)Thread(T2) Stress Test ...
(to align on the left of output)
@fmigneault, I agree for the |
You have to look at the fields this way. Because In the case of the timestamps values themselves, they are 20 char. This is >16 so the field overflows to the left. |
I'm fine with the way results are reported.
|
@fmigneault as for the weird outputs, these (�[0;31m) are the colors of the outputs. vs Looks like it's a Jenkins problem. Issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to actually show the error, like previously in comment #90 (comment)
The current error displayed in Jenkins do not have enough information https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/multi_threading_stress_test/46/console:
18:57:14 _____________________ notebooks/stress-tests.ipynb::Cell 4 _____________________
18:57:14 Notebook cell execution failed
18:57:14 Cell 4: Cell execution caused an exception
18:57:14
18:57:14 Input:
18:57:14 # NBVAL_IGNORE_OUTPUT
18:57:14
18:57:14 # Executing THREDDS test on multiple threads
18:57:14 thredds_url = f"{TWITCHER_URL}/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc"
18:57:14
18:57:14 run_threads(target_test = stress_test,
18:57:14 url = thredds_url,
18:57:14 n_threads = TEST_N_THREADS_THREDDS,
18:57:14 runs_per_threads = TEST_RUNS_THREDDS)
18:57:14
18:57:14 Traceback:
18:57:14
18:57:14 ---------------------------------------------------------------------------
18:57:14 AssertionError Traceback (most recent call last)
18:57:14 /tmp/ipykernel_1129/1504455324.py in <module>
18:57:14 7 url = thredds_url,
18:57:14 8 n_threads = TEST_N_THREADS_THREDDS,
18:57:14 ----> 9 runs_per_threads = TEST_RUNS_THREDDS)
18:57:14
18:57:14 /tmp/ipykernel_1129/4243812779.py in run_threads(target_test, url, execution_id, n_threads, runs_per_threads)
18:57:14 29 thread.join()
18:57:14 30
18:57:14 ---> 31 assert stressTestProgression.failed_count == 0, stressTestProgression.fail_report()
18:57:14 32 print(f"All threads passed {target_test.__name__} function.")
18:57:14 33
18:57:14
18:57:14 AssertionError: This test failed due to bad code received from the requests.
18:57:14 Number of failed thread(s) : 2/ 5
18:57:14 Number of failed requests : 3/1000
18:57:14 First failed request timestamp : 2021-11-17 18:56:51
We do not know which of the WPS services or Thredds is/are failing, what was the bad http code returned, how long it took.
You spent a lot of time formatting the output so why not simply show it, only for the run with errors of course, not all the runs.
In case it's not clear,
In the Jenkins purpose: the result displayed when it failed in Jenkins is what matters. All the nice looking cell outputs do not make it to Jenkins and there is no way to retrieve them after the fact. In the previous implementation, the failed outputs are saved and given to the assert error message. That was how we display only the failed cell outputs when failure occurred during Jenkins run. |
I agree with @tlvu that for convenience, it would be great to have the failing subset of threads/URL reported directly in the raised message. Do you have an ETA to implement this? |
It's not just a "convenience". Jenkins build failure is useless if we are unable to have the failing info to diagnose !
If it helps remove pressure and unblock things, I can accept merging this PR as-is as long as a followup PR with proper error reporting when Jenkins fails is done promptly right after. |
Everything is available in the notebook outputs. Its just more trouble to load them in jupyter than have them already in logs. |
Ok no problem for changing the outputs. I changed to assert printout to print only all the failed runs as the same output from the results one. @tlvu Let me know what would rather have in the Exceptions area
Here is an example of the output from the test :
|
The actual notebook outputs are discarded ! The outputs that are being saved is the result of another run and in the case of intermittent problem like this caching issue, no guaranty we will have the same result. Also another runs means all the timestamps do not match. See PAVICS-e2e-workflow-tests/Jenkinsfile Lines 83 to 85 in 9ad91bd
|
Can we combine both so we have the bad code and and actual error matching the bad code. This will be particularly useful in the case where we have multiple failing code. For example we have seen in the past both 500 and 408 can happen in the same run. Having the actual error matching the code will speed up the diagnostic.
Oh this is great ! So I see combining both status code and the actual error for the exception will look like
|
@tlvu @fmigneault
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new error reporting in Jenkins is very good. I just have comments about what seems to me as an accidental way of hiding wrong behavior from the server (Magpie, Twitcher or the service behind it).
I'd rather have all the errors exposed clearly so we can fix them but I do not know enough the design decision to make a call.
notebooks/stress-tests.ipynb
Outdated
@@ -330,7 +330,7 @@ | |||
" resp = requests.request(method, url, timeout=abort_timeout, **req_kwargs)\n", | |||
" resp.raise_for_status()\n", | |||
" except requests.exceptions.Timeout as errT:\n", | |||
" result.codes.append(code) # read timeout\n", | |||
" result.codes.append(408) # read timeout\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the wrong code for timeout, why not expose it instead of hardcoding 408? Was hardcoding the previous behavior that you are trying to maintain?
In general, tests are meant to expose errors so masking the error is usually the wrong thing to do. My 2 cents because I do not really know for sure what is the design decision of Magpie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll answer both here,
I might be wrong on this, please correct me if it is the case.
But because your request TimeOut after a specified time as timeout=abort_timeout
it means you dont have any answer from the server, which result in not knowing the response, only knowing the request made, so we can't have the exact Status_Code of the server, so we force a 408 code after the TimeOut limit. And because there is no answer from the Server I force the "Request Timeout" message and I print the requests.exceptions.Timeout
(request) as exception_text
message.
Also, the thing with this test, is that a 408 error happens randomly, so trying to debug this is like flipping a coin, sometime it takes 1 minute sometimes an entire day. So I can try to optimise it on my side, but it would be important that @fmigneault can be unblocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OHH I misunderstood, so it's the test that is timing out, it's not the actual response from the server. So in that case, the error reporting should not make it looks like it's a 408 received from the server !
Maybe just put "xxx" as the code and the reason would be "test timeout after ### second"?
So we can differentiate a real timeout as returned by Magpie because the server behind Magpie takes too long to respond vs the test that is timing out because we enforce an arbitrary abort_timeout
.
But I'd suggest you wait to hear from @fmigneault before changing code. He knows best how to test Magpie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahandan's interpretation is correct.
We manually set a limit we want to wait for, and if no response from the server is received until then, the client (us) aborts.
So there is no real code from the server.
Code 408 is picked because that's the code for "Client Timeout", in contrast to "Server Timeout" that should be 504 (or some other 5xx).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, there is no confusion to be had here.
When the request is initiated, our client establishes a connection with a limit "wait for" x seconds.
The server accepts that connection and keeps it open for that amount of time to receive the request.
If either the client fails to produce the request or the server to receive it / read data in time, 408 happens.
In other words, the request did not reach (completely or at all) the server.
If the server failed to produce a response in time (eg: it itself did not receive a result in time from the service it was supposed to redirect to), 504 happens instead. This is based on the timeout value configured in nginx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it's clear for me now. Just so it's clear for others as well, can we simply add "(client side)" to the reason "Request TimeOut (client side)"? Thanks.
Magpie integrate new feats ## Overview Update to latest Magpie version 3.19.0 Tested on a local birdhouse setup. ## Changes - Adds new environment variables to handle new Magpie features, which rely on email usage. These features include user registration and approvals + user assignment to groups with terms and conditions. Following PR will have to be merged prior to merging this current PR : [pavics-e2e #90](Ouranosinc/PAVICS-e2e-workflow-tests#90) Also, PR [birdhouse #197](#197 content has been transfered to this PR.
…uest
Overview
Enhancement for the stress_test.
Increase the amount of request by using multiple thread to increase the chance of making request in parallel as much as possible.
The increase amount of request also increases the chance to force the cache to run a refresh cycle by filling it up and trying to make a request that is overlapping the cache refreshing cycle to generate a known issue (NoTransaction, DetachedInstance).
Changes
Related Issue / Discussion
This enhancement of the stress makes a reference to :
Magpie issue (More caching/session re-connect and logging #473, Ouranosinc/Magpie#473)