-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
file descriptors surge #603
Comments
sorry, no clues from me. Stefano
…On 08/05/2019 15:49, Yuyi Guo wrote:
@vkuznet <https://github.com/vkuznet> @h4d4 <https://github.com/h4d4> @bbockelm <https://github.com/bbockelm> @belforte <https://github.com/belforte> @amaltaro <https://github.com/amaltaro> @h4d4 <https://github.com/h4d4>
The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range.
I am going to look at these APIs to check for the DB connection leakage.
Any suggestions on how to attack the fds problem?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#603>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOAVWIYN4SE4OT6H6Z2VDDPUM4GFANCNFSM4HLVDZEQ>.
|
Where are you getting this file descriptors monitoring from? Was it 15k file descriptors for DBS in general in one backend or 15k for one single process in a backend? If one look at I assume each query to oracle requires one socket. It would be great to make a correlation of queries/DAOs to the amount of file descriptors. I don't know much these details, but I'm surprised to see 15k file descriptors for a process that is handling 25(?) user requests. |
Alan, the fds we get from /proc/pid/fd area, you're write we can look for them over
there. But it should be done in real time since they may be closed. When alarm
is gone it is too late to look over there since they'll disappear.
Therefore, the only way I see to look them up is to add logic to exporter code,
e.g. if fds will go above threshold just dump content of /proc/pid/fd somewhere
that later (when we notified by alarm) we can look at them.
When I'll have time I can implement this in exporter.
…On 0, Alan Malta Rodrigues ***@***.***> wrote:
Where are you getting this file descriptors monitoring from? Was it 15k file descriptors for DBS in general in one backend or 15k for one *single process* in a backend?
If one look at `/prod/PID/fd/` in the backend, you can list all the file descriptors your process/PID has opened. Looking at vocms0136 right now, I see around 40 of them, where a bunch of them are accessing the same tnsnames.ora file (probably unwanted), then most of the rest are network socket descriptors.
I assume each query to oracle requires one socket. It would be great to make a correlation of queries/DAOs to the amount of file descriptors. I don't know much these details, but I'm surprised to see 15k file descriptors for a process that is handling 25(?) user requests.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603 (comment)
|
Here is relevant ticket: dmwm/cmsweb-exporters#1
As I wrote I can write some code for it once I'll have time.
…On 0, Yuyi Guo ***@***.***> wrote:
@vkuznet @h4d4 @bbockelm @belforte @amaltaro @h4d4
The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range.
I am going to look at these APIs to check for the DB connection leakage.
Any suggestions on how to attack the fds problem?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603
|
Yuyi,
meanwhile, until I'll add this info to exporters, you/Lina can simply look-up on
a node the fd area, e.g. here is an example of some process and its fds:
```
ls -al /proc/28149/fd/
total 0
dr-x------. 2 cmspopdb users 0 May 7 13:57 .
dr-xr-xr-x. 9 cmspopdb users 0 Feb 25 21:03 ..
lrwx------. 1 cmspopdb users 64 May 9 14:31 0 -> /dev/pts/1
lrwx------. 1 cmspopdb users 64 May 9 14:31 1 -> /dev/pts/1
lrwx------. 1 cmspopdb users 64 May 7 13:57 2 -> /dev/pts/1
lrwx------. 1 cmspopdb users 64 May 9 14:31 255 -> /dev/pts/1
lr-x------. 1 cmspopdb users 64 May 9 14:31 3 -> /var/lib/sss/mc/passwd
lrwx------. 1 cmspopdb users 64 May 9 14:31 4 -> socket:[134636807]
```
So if you look DBS pid area at the time when fds number is high you should see
what are those fds.
At the moment I see from monitoring plots that DBS experience high number of fds
every 1/2 hour. The number of fds is above 1K.
…On 0, Yuyi Guo ***@***.***> wrote:
@vkuznet @h4d4 @bbockelm @belforte @amaltaro @h4d4
The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range.
I am going to look at these APIs to check for the DB connection leakage.
Any suggestions on how to attack the fds problem?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603
|
Thanks Alan, Valentin, |
total 0 |
You are going to want to use “lsof -p $PID” instead; that will determine the IP address of all those sockets listed. |
OK, thanks |
IIUC socket means network connection, so they better be from FrontEnd.
Do we have a counter of FE connections to compare with ?
IIUC nobody but the FE is allowd to talk to BE servers. Why did not
FE nodes "die first" ?
…On 09/05/2019 08:39, Yuyi Guo wrote:
@vkuznet <https://github.com/vkuznet> @amaltaro <https://github.com/amaltaro>
above is on vocms0163. How do I know what were they?
|
Sockets can also be outgoing - to Oracle! |
This a file from command “lsof -p $PID” on 0136 when there were about 600+ fds. |
|
of course it could be zilions of external requests that FE happily
passes on, but ... thousands of them ? Or is this some FE failure ?
do we (Lina?) have some monitoring FE side ?
|
Got this from stackoverflow. How we close the connection to the FE? I could not remember that I did the closing explicitly in DBS code. Can someone reminder me? |
Yuyi, simple google search leads to this open ticket: which provides some recipes to handle this issue, e.g. increase number of threads, setting the In short run, I suggest you study this ticket and apply stuff people are taking about, while in a long-run I really suggest that we should start seriously consider replacing CherryPY stack in our python based web-apps, the best probably Flask+wsgi. |
yet... why all such problems now ? Did we chance cherrypy ? Did FE change ?
is there some external request storm ? I see some signs of FE issues in
CRAB TW to CRAB REST connections. Did we cahnge something relevant in last update ?
…On 09/05/2019 09:53, Valentin Kuznetsov wrote:
which provides some recipes to handle this issue, e.g. increase number of threads, setting the |response.headers.connection config| to always return a value of |close|.
|
This is really odd. We already had fd monitoring before the release. We did not see fd surge to 10k+ in the past monitoring before this release. |
Stefano,
I'm not sure that we claim it happens "now", we had lemon metrics outages long
time ago and it was affecting different services, e.g. DBS.
The point is that now we have better monitoring tools to reveal our problems,
and we have identified some "hidden" patterns which lead to DBS instabilities.
But we may not have know all possible combinations.
I saw fds alarms in previous release too, but we "let it go".
…On 0, Stefano Belforte ***@***.***> wrote:
yet... why all such problems now ? Did we chance cherrypy ? Did FE change ?
is there some external request storm ? I see some signs of FE issues in
CRAB TW to CRAB REST connections. Did we cahnge something relevant in last update ?
On 09/05/2019 09:53, Valentin Kuznetsov wrote:
> which provides some recipes to handle this issue, e.g. increase number of threads, setting the |response.headers.connection config| to always return a value of |close|.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603 (comment)
|
[root@vocms0136 tmp]# netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n |
VAlentin, yes ! But since yesterday DBS is being restarted every few hours. This is new !
|
This is what on cherrypy/cherrypy#1304 that is a very long and old opentk (opened Marc 2014) . |
@h4d4 |
Yuyi,
I suggest to take time window when DBS had large fds and identify all clients
during this period (sort them and find top-N). It may be that we have a client
which repeatedly initiate request and drops it w/o waiting for response
(possibly impatient clients).
Using this dashboard
https://monit-grafana.cern.ch/d/_U6nmxCmk/dbs-global-reader?refresh=1m&orgId=11&from=now-6h&to=now
I see node/periods:
vocms0136: 6:30 - 7:00, 8-8:12, 9:46 - 10:00, 11:17 - 11:30
etc.
We can also use CMSWEB timber dashboard to identify top-N in these periods, e.g.
https://monit-grafana.cern.ch/d/QADAkezZk/cmsweb-timber?orgId=11&from=now-12h&to=now&var-system=dbs&var-method=All&var-api=All&var-code=All&var-metadataType=All&var-Filters=data.code%7C>%7C99&var-dnFilter=.*
It shows that DBS had datasetlist, fileArray calls around these spots, e.g.
7:00, 10:00, etc.
V.
…On 0, Yuyi Guo ***@***.***> wrote:
***@***.*** tmp]# netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
1 established)
1 Foreign
4 FIN_WAIT2
38 LISTEN
472 ESTABLISHED
952 CLOSE_WAIT
1560 TIME_WAIT
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603 (comment)
|
Yuyi,
it is not under Lina's responsibility, it is DBS (CherryPy) configuration parameters.
Find how you can set them up in DBS and make appropriate PR either to DBS
configuration or to the code (if you don't have them as external configuration).
V.
…On 0, Yuyi Guo ***@***.***> wrote:
@h4d4
Lina,
Can we try to increase the thread to 200 and setting the response.headers.connection config to always return a value of 'close'?
We can start with of the instances.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603 (comment)
|
BTW, updating the amount of threads is a DBS configuration parameter (note there are diff config files for different DBS instances): About the header response, you'd need to make changes to the DBS source code, I believe. We also need to be careful, because if we were sometimes hitting 9GB of RAM footprint with 15 threads, with 200 chances to blow up the process/node are going to be much higher. |
While I was trying to answer Valentin's questions, I found below in the FE log. He has sent 29038 below queries and is continuing to do so. I don't know any of CMS logical_file_name is call "NA" ? How did he got this name?
|
must be part of a script which fails to find the name and yet plows on and bugs you. |
if that's really harmful, the best I can do is to remove him from VOMS, then at next CMSWEB syncroization round (few hours) he will not be authorized by FE anymore. Let me know. |
Odd, it seems to come via dasgoclient ! @vkuznet could this be your bug ? |
Just checked entire dasgo code, it has _NA_ in query parser which means that
whatever client provides as input query is not parsed and _NA_ is substituted.
We need to contact him and ask which DAS queries he is using and how he automate
this since it seems to me that he never check the input and invoke dasgoclient
with it.
So it is not a bug since but a feature. DAS needs to parse input from the
end-user, if this input is not parse-able DAS should through an error. But I
think end-user does not check for it and retries all the time.
…On 0, Stefano Belforte ***@***.***> wrote:
Odd, it seems to come via dasgoclient ! @vkuznet could this be your bug ?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603 (comment)
|
No, I haven't written him for this yet. This is the same person who scan whole datasets every 20 minutes no matter what. I wrote to him to ask him to scan every a few hours because there is no new info for a dataset, but haven't heard anything back yet for a week or so. |
If no reply, then ask Stefano to remove him from VOMS. Once he will turned down
hard way he will write back.
…On 0, Yuyi Guo ***@***.***> wrote:
No, I haven't written him for this yet. This is the same person who scan whole datasets every 20 minutes no matter what. I wrote to him to ask him to scan every a few hours because there is no new info for a dataset, but haven't heard anything back yet for a week or so.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603 (comment)
|
No that's another Benedikt ! (Benedikt Mayer at MIT running Dynamo and trying to keep Dynamo registry in sycn with PhEDEx and DBS and somehow overdoing, IMHO). |
OTH if this Benedik Vormwald is in Germany (likely), we may not hear until tomorrow. I am writing him now with you in CC so that Valentin can find out about dasgoclient use. Valentin. if user sends you junk, maybe you can avoid to pass it along to DBS ? |
Valentin, log-2019-May-14-16-30.txt |
Stefano, I'm not passing this to DBS, I think he is doing some scripting where
he gets input and invoke dasgoclient.
…On 0, Stefano Belforte ***@***.***> wrote:
OTH if this Benedik Vormwald is in Germany (likely), we may not hear until tomorrow. I am writing him now with you in CC so that Valentin can find out about dasgoclient use. Valentin. if user sends you junk, maybe you can avoid to pass it along to DBS ?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603 (comment)
|
Let me give you hypothetical example:
```
# this is pseudo-code
results = # invoke dasgoclient to get file list
for fileName in results:
# invoke dasgoclient with fileName as parameter
```
So if user get garbage in results and then iterate over it nothing wrong with
dasgoclient it takes the input `_NA_` and pass it along.
For instance, I can do:
```
dasgoclient -query="child file=_NA_" -verbose=1
...
DAS GET https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 634.356484ms
ERRO[0000] DBS unable to unmarshal the data Api=filechildren Error="json: cannot unmarshal object into Go value of type []mongo.DASRecord" data="{"exception": 400, "message": "Invalid Input Data _NA_...: Not Match Required Format", "type": "HTTPError"}"
```
But in this case user should get 400 HTTP message, while in example Yuyi found
it is 502 error code [1], which indicates that the server, while acting as a
iateway or proxy, received an invalid response from the upstream server.
But I agree that 30K of such queries may cause a harm to DBS/frontend.
What is unclear if this user submit this payload from grid jobs (which seems to
be the case). And, once again the throttling on FE/BE may help to protect us
from such scenarios.
[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/502
…On 0, Yuyi Guo ***@***.***> wrote:
While I was trying to answer Valentin's questions, I found below in the FE log. He has sent 29038 below queries and is continuing to do so. I don't know any of CMS logical_file_name is call "_NA_" ? How did he got this name?
```
[14/May/2019:16:59:59 +0200] cmsweb.cern.ch 131.225.189.65 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ HTTP/1.1" 502 [data: 6036 in 15985 out 485 body 94537097 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Benedikt Vormwald/CN=3533979439/CN=505277547" "-" ] [ref: "-" "dasgoclient/v02.01.01" ]
```
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603 (comment)
|
Are those in chronological order? If so, here is a pattern how DBS is screwed:
200"GET/dbs/prod/global/DBSReader/fileparents/?logical_file_name=%2Fstore%2Fmc%2FRunIISummer16DR80Premix%2FSMS-T1qqqq-LLChipm_ctau-200_mLSP-200_TuneCUETP8M1_13TeV-madgraphMLM-pythia8%2FAODSIM%2FPUMoriond17_longlived_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1%2F70000%2F267616AB-A35E-E911-83D5-141877640173.root
200"GET/dbs/prod/global/DBSReader/filesummaries/?dataset=%2FRelValZEE_13%2FCMSSW_10_6_0_pre4-PU25ns_106X_upgrade2021_realistic_v4-v1%2FMINIAODSIM&validFileOnly=1
200"GET/dbs/prod/global/DBSReader/releaseversions?dataset=%2FPseudo_MonoZ_NLO_Mphi-400_Mchi-50_gSM-1p0_gDM-1p0_13TeV-madgraph%2FRunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1%2FMINIAODSIM
200"GET/dbs/prod/global/DBSReader/serverinfo
200"POST/dbs/prod/global/DBSReader/fileArray
200"POST/dbs/prod/global/DBSWriter/bulkblocks
200"POST/dbs/prod/global/DBSWriter/bulkblocks
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_
502"GET/dbs/prod/global/DBSReader/blockdump?block_name=%2FHIMinimumBias14%2FHIRun2018A-v1%2FRAW%23d927b336-6fcb-44b1-a1fb-65a28abcedd1
So, we have 200 requests for filesummaries, followed by others and bulkblocks.
Then we have filechildren with _NA_ input (which I guess some script provides),
all of those are 400 as my test with dasgoclient shows, but then we have
blockdump with 502 which blocks the DBS. Everything after this is blocked, e.g.
502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FDY%2A%2F%2A%2FALCARECO&dataset_access_type=VALID&detail=True
502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FDY%2A%2F%2A%2FALCARECO&dataset_access_type=VALID&detail=True
502"GET/dbs/prod/global/DBSReader/datasets?dataset=%2FGluGluToAToZhToLLBB_M-1000_13TeV-madgraphMLM-pythia8%2FRunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1%2FMINIAODSIM&dataset_access_type=%2A&detail=True
502"GET/dbs/prod/global/DBSReader/datasets?dataset=%2FgluinoGMSB_M1400_ctau1000p0_TuneCP2_13TeV_pythia8%2FRunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1%2FAODSIM&dataset_access_type=%2A&detail=True
502"GET/dbs/prod/global/DBSReader/datasets?dataset=%2FHeavyCompositeMajoranaNeutrino_L11000_M1000_mumujj_CalcHep%2FRunIISummer15GS-MCRUN2_71_V1-v1%2FGEN-SIM&dataset_access_type=%2A&detail=True
502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelVal%2ASingleMu%2A%2FCMSSW_10_4_0%2Aupgrade2023%2AD21%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True
502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelVal%2ASingleMu%2A%2FCMSSW_10_4_0%2Aupgrade2023%2AD21%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True
502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelVal%2ASingleMu%2A%2FCMSSW_10_4_0%2Aupgrade2023%2AD21%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True
502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelVal%2ASingleMu%2A%2FCMSSW_10_4_0%2Aupgrade2023%2AD21%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True
502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelValSingleMuPt1p5to8_pythia8%2FCMSSW_10_4_0_pre2-103X_upgrade2023_realistic_v2_2023D21noPU-v1%2FGEN-SIM-DIGI-RAW&dataset_access_type=%2A&detail=True
502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelValTTbar%2A%2F%2A10_5_0%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True
502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelValTTbar%2A%2F%2A10_5_0%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True
502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelValTTbar%2A%2F%2A10_5_0%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True
So what blockdump is doing? and why it timeout. Is it possible that it brings
full DBS down and all other queries which comes afterwards just explode the
problem.
…On 0, Yuyi Guo ***@***.***> wrote:
Valentin,
To answer your question what are the calls during the fds were really high.
The files below are the gottten from FE from May 14 16:30-17:00 CERN time for All the DBS accesses.
The first row is error code , second is methods and the third is API.
[log-2019-May-14-16-30.txt](https://github.com/dmwm/DBS/files/3179938/log-2019-May-14-16-30.txt)
[log-2019-May-14-16-40.txt](https://github.com/dmwm/DBS/files/3179939/log-2019-May-14-16-40.txt)
[log-2019-May-14-16-50.txt](https://github.com/dmwm/DBS/files/3179940/log-2019-May-14-16-50.txt)
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603 (comment)
|
@vkuznet see also #605 (comment) and following. Something is fishy with blockdump which needs to look at it from inside server code |
shall I stop CRAB DBS Publisher ? No more migration requests, no more blockdump calls ! |
OK. I went for a gentler approach and disabled calls for dataset migrations in ASO Publisher (they were all failing anyhow !). So CRAB will keep publishing user MC or user processing or already migrated datasets. |
No, it is not in chronological order. I order it by error code. |
Something is really wrong. Look the two queries below, these are two identical queries. DBS does input validation. So the 1st one returned http 400 that is the correct behave. However, the second one got http 502. The validation is very simple. How could it take 300 s? What the server is doing? There is no database involved here.
|
Point is, it's not a single user putting such queries with
I don't get your comment. If the das client can't parse the query, why would it replace whatever by |
Yuyi,
the example you gave is not appropriate due to two things:
- the timestamp of 502 error is earlier than time stamp of 400
- this is two different DBS servers
It means that vocms760 had problem at the time of the query and this problem may
be caused by another source.
The 400 error comes when DBS server is in good health, which is the case of
vocms0158 at a time of the query.
V
…On 0, Yuyi Guo ***@***.***> wrote:
Something is really wrong. Look the two queries below, these are two identical queries. DBS does input validation. So the 1st one returned http 400 that is the correct behave. However, the second one got http 502. The validation is very [simple](https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/utils/DBSInputValidation.py#L26). How could it take 300 s? What the server is doing? There is no database involved here.
```
vocms0158/frontend/access_log_20190514.txt:[14/May/2019:23:57:37 +0200] cmsweb.cern.ch 129.59.197.22 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ HTTP/1.1" 400 [data: 6037 in 15819 out 107 body 325763 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Benedikt Vormwald/CN=3533979439/CN=1620332613" "-" ] [ref: "-" "dasgoclient/v02.01.01" ]
vocms0760/frontend/access_log_20190514.txt:[14/May/2019:21:15:38 +0200] cmsweb.cern.ch 193.146.75.180 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ HTTP/1.1" 502 [data: 6037 in 15985 out 485 body 300127279 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Benedikt Vormwald/CN=3533979439/CN=1796822753" "-" ] [ref: "-" "dasgoclient/v02.01.01" ]
```
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603 (comment)
|
Alan,
parser needs to have unique token when pass through the query. All alphabetical
character, [a-z], slashes, brackets, operators are allowed. I used `_NA_` as a
string which should not be appear in an input query. I can use any other string
in this manner.
The point is that it is not used as final assignment of values when DAS
translate user query fields/specs used in underslying services.
The issue is that user code is not checked its own input. Here is the snippet of
code we received from the end-user:
```
def get_miniaod_filenames():
miniaod_filenames = []
# first check children for miniAODs:
status, child_filenames = run_cmd('dasgoclient -query="child file=/%s"' % options.infile.split("//")[-1])
child_filenames = child_filenames.split("\n")
for child in child_filenames:
if dataset_is_correct_miniAOD(child):
miniaod_filenames.append(child)
# if no matches have been found, check children of parent files for matching miniAODs:
if len(miniaod_filenames) == 0:
print "Checking children of parent files for matching miniAODs..."
status, parent_filenames = run_cmd('dasgoclient -query="parent file=/%s"' %
options.infile.split("//")[-1])
for parent_filename in parent_filenames.split("\n"):
status, cousins_filenames = run_cmd('dasgoclient -query="child file=%s"' % parent_filename)
relatives_filenames = cousins_filenames.split("\n")
for relative in relatives_filenames:
if dataset_is_correct_miniAOD(relative):
miniaod_filenames.append(relative)
return miniaod_filenames
```
As I expected the user invoke for loop (see last part of the function) where it
parser parent_filenames as a string without checking what this string is.
I also don't know what is passed to options.
I'll try to get the input from him and reproduce the parents list.
Bottom line, user does not validate input in scripts.
…On 0, Alan Malta Rodrigues ***@***.***> wrote:
Point is, it's not a single user putting such queries with `_NA_` as logical_file_name. I reported it at least a couple of times already in one of those many threads about DBS instabilities.
> Just checked entire dasgo code, it has _NA_ in query parser which means that whatever client provides as input query is not parsed and _NA_ is substituted.
I don't get your comment. If the das client can't parse the query, why would it replace whatever by `_NA_`? Can you please clarify what part of the query is parsed and which one is substituted?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603 (comment)
|
Which means, the user passed something that didn't match the basic dasclient validation: and the dasclient replaced whatever value the user provided by If so, isn't it better to raise an error at the das-client itself? Given that it has already a basic validation in place and we all know that whatever query with
Yes, I agree. However, if the client can run a simple pre-validation of the arguments, we won't have this same problem in the coming week/month/year. |
Alan,
one more time, dasgoclient used _NA_ internally and it should not appear in
inputs. Let's not speculate and wait for user input that we can try out and see
what it produced.
V.
…On 0, Alan Malta Rodrigues ***@***.***> wrote:
Which means, the user passed something that didn't match the basic dasclient validation:
"" All alphabetical character, [a-z], slashes, brackets, operators are allowed."""
and the dasclient replaced whatever value the user provided by `_NA_`, right?
If so, isn't it better to raise an error at the das-client itself? Given that it has already a basic validation in place and we all know that whatever query with `_NA_` will anyways fail; don't even let it go to CMSWEB then. If you prefer not to return an error, just return an empty list.
> Bottom line, user does not validate input in scripts.
Yes, I agree. However, if the client can run a simple pre-validation of the arguments, we won't have this same problem in the coming week/month/year.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#603 (comment)
|
Alright! Given that you can't give me a straight answer and that your statements are contradictory, I tested it myself and here is the answer:
or another test with an empty space as user input:
And now is clear. dasgoclient replaces whatever user input it cannot understand by Then let me say it again. If DAS client knows this request has no chance to succeed (otherwise it wouldn't have replaced the user input by It should be a simple change to DAS client and would protect a bunch of data-services from mal-formed HTTP requests. If you still don't get what I'm saying, we can quickly discuss it over vidyo. |
@vkuznet I will not argue code with you, but there is no such thing as user only sends correct queries |
I added to DAS code basic validations, the associative commits are: dmwm/das2go@ebe54d8 and dmwm/dasgoclient@ba6f5c6 Here is how it will work:
So, if user missed or provided in-appropriate input DAS will perform the validation and return immediately without sending results to backend services. The code is in master and I'll try to schedule it in next cmsweb upgrade cycle. |
Thanks, Valentin! |
Now the new code is awaiting to be included in CMSSW, see cms-sw/cmsdist#4981 Usually it takes 1-2 weeks and new dasgoclient with validation will be in place on cvmfs/CMSSW. |
@vkuznet @h4d4 @bbockelm @belforte @amaltaro @h4d4
The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range.
I am going to look at these APIs to check for the DB connection leakage.
Any suggestions on how to attack the fds problem?
The text was updated successfully, but these errors were encountered: