file descriptors surge #603

yuyiguo · 2019-05-08T20:49:36Z

@vkuznet @h4d4 @bbockelm @belforte @amaltaro @h4d4
The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range.

I am going to look at these APIs to check for the DB connection leakage.

Any suggestions on how to attack the fds problem?

belforte · 2019-05-09T03:50:57Z

sorry, no clues from me. Stefano

…

On 08/05/2019 15:49, Yuyi Guo wrote: @vkuznet <https://github.com/vkuznet> @h4d4 <https://github.com/h4d4> @bbockelm <https://github.com/bbockelm> @belforte <https://github.com/belforte> @amaltaro <https://github.com/amaltaro> @h4d4 <https://github.com/h4d4> The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range. I am going to look at these APIs to check for the DB connection leakage. Any suggestions on how to attack the fds problem? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#603>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOAVWIYN4SE4OT6H6Z2VDDPUM4GFANCNFSM4HLVDZEQ>.

amaltaro · 2019-05-09T06:33:11Z

Where are you getting this file descriptors monitoring from? Was it 15k file descriptors for DBS in general in one backend or 15k for one single process in a backend?

If one look at /prod/PID/fd/ in the backend, you can list all the file descriptors your process/PID has opened. Looking at vocms0136 right now, I see around 40 of them, where a bunch of them are accessing the same tnsnames.ora file (probably unwanted), then most of the rest are network socket descriptors.

I assume each query to oracle requires one socket. It would be great to make a correlation of queries/DAOs to the amount of file descriptors. I don't know much these details, but I'm surprised to see 15k file descriptors for a process that is handling 25(?) user requests.

vkuznet · 2019-05-09T11:06:28Z

Alan, the fds we get from /proc/pid/fd area, you're write we can look for them over there. But it should be done in real time since they may be closed. When alarm is gone it is too late to look over there since they'll disappear. Therefore, the only way I see to look them up is to add logic to exporter code, e.g. if fds will go above threshold just dump content of /proc/pid/fd somewhere that later (when we notified by alarm) we can look at them. When I'll have time I can implement this in exporter.

…

On 0, Alan Malta Rodrigues ***@***.***> wrote: Where are you getting this file descriptors monitoring from? Was it 15k file descriptors for DBS in general in one backend or 15k for one *single process* in a backend? If one look at `/prod/PID/fd/` in the backend, you can list all the file descriptors your process/PID has opened. Looking at vocms0136 right now, I see around 40 of them, where a bunch of them are accessing the same tnsnames.ora file (probably unwanted), then most of the rest are network socket descriptors. I assume each query to oracle requires one socket. It would be great to make a correlation of queries/DAOs to the amount of file descriptors. I don't know much these details, but I'm surprised to see 15k file descriptors for a process that is handling 25(?) user requests. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603 (comment)

vkuznet · 2019-05-09T11:41:56Z

Here is relevant ticket: dmwm/cmsweb-exporters#1 As I wrote I can write some code for it once I'll have time.

…

On 0, Yuyi Guo ***@***.***> wrote: @vkuznet @h4d4 @bbockelm @belforte @amaltaro @h4d4 The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range. I am going to look at these APIs to check for the DB connection leakage. Any suggestions on how to attack the fds problem? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603

vkuznet · 2019-05-09T12:34:56Z

Yuyi, meanwhile, until I'll add this info to exporters, you/Lina can simply look-up on a node the fd area, e.g. here is an example of some process and its fds: ``` ls -al /proc/28149/fd/ total 0 dr-x------. 2 cmspopdb users 0 May 7 13:57 . dr-xr-xr-x. 9 cmspopdb users 0 Feb 25 21:03 .. lrwx------. 1 cmspopdb users 64 May 9 14:31 0 -> /dev/pts/1 lrwx------. 1 cmspopdb users 64 May 9 14:31 1 -> /dev/pts/1 lrwx------. 1 cmspopdb users 64 May 7 13:57 2 -> /dev/pts/1 lrwx------. 1 cmspopdb users 64 May 9 14:31 255 -> /dev/pts/1 lr-x------. 1 cmspopdb users 64 May 9 14:31 3 -> /var/lib/sss/mc/passwd lrwx------. 1 cmspopdb users 64 May 9 14:31 4 -> socket:[134636807] ``` So if you look DBS pid area at the time when fds number is high you should see what are those fds. At the moment I see from monitoring plots that DBS experience high number of fds every 1/2 hour. The number of fds is above 1K.

…

On 0, Yuyi Guo ***@***.***> wrote: @vkuznet @h4d4 @bbockelm @belforte @amaltaro @h4d4 The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range. I am going to look at these APIs to check for the DB connection leakage. Any suggestions on how to attack the fds problem? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603

yuyiguo · 2019-05-09T13:17:20Z

Thanks Alan, Valentin,
I am watching ...

yuyiguo · 2019-05-09T13:37:39Z

total 0
dr-x------. 2 _dbs _dbs 0 May 9 14:08 .
dr-xr-xr-x. 9 _dbs _dbs 0 May 9 14:08 ..
lr-x------. 1 _dbs _dbs 64 May 9 14:08 0 -> /dev/null
l-wx------. 1 _dbs _dbs 64 May 9 14:08 1 -> pipe:[24875653]
lrwx------. 1 _dbs _dbs 64 May 9 14:21 10 -> socket:[25211083]

yuyiguo · 2019-05-09T13:39:04Z

@vkuznet @amaltaro
above is on vocms0163. How do I know what were they?

bbockelm · 2019-05-09T13:40:15Z

You are going to want to use “lsof -p $PID” instead; that will determine the IP address of all those sockets listed.

yuyiguo · 2019-05-09T13:41:32Z

OK, thanks

belforte · 2019-05-09T13:43:23Z

IIUC socket means network connection, so they better be from FrontEnd. Do we have a counter of FE connections to compare with ? IIUC nobody but the FE is allowd to talk to BE servers. Why did not FE nodes "die first" ?

…

On 09/05/2019 08:39, Yuyi Guo wrote: @vkuznet <https://github.com/vkuznet> @amaltaro <https://github.com/amaltaro> above is on vocms0163. How do I know what were they?

bbockelm · 2019-05-09T13:50:05Z

Sockets can also be outgoing - to Oracle!

yuyiguo · 2019-05-09T14:00:45Z

This a file from command “lsof -p $PID” on 0136 when there were about 600+ fds.
fd-0848-0136-600.txt

yuyiguo · 2019-05-09T14:09:24Z

@belforte @amaltaro @vkuznet @bbockelm @h4d4
vocms0158 vocms0760 vocms0162 vocms0164
are FEs. A lot of "CLOSE_WAIT" and "ESTABLISHED" to them.

yuyiguo · 2019-05-09T14:15:53Z

CLOSE_WAIT means that the local end of the connection has received a FIN from the other end, but the OS is waiting for the program at the local end to actually close its connection.

The problem is your program running on the local machine is not closing the socket. It is not a TCP tuning issue. A connection can (and quite correctly) stay in CLOSE_WAIT forever while the program holds the connection open.

Once the local program closes the socket, the OS can send the FIN to the remote end which transitions you to LAST_ACK while you wait for the ACK of the FIN. Once that is received, the 
connection is finished and drops from the connection table (if your end is in CLOSE_WAIT you do not end up in the TIME_WAIT state).

belforte · 2019-05-09T14:17:15Z

of course it could be zilions of external requests that FE happily passes on, but ... thousands of them ? Or is this some FE failure ? do we (Lina?) have some monitoring FE side ?

yuyiguo · 2019-05-09T14:18:38Z

Got this from stackoverflow. How we close the connection to the FE? I could not remember that I did the closing explicitly in DBS code. Can someone reminder me?

vkuznet · 2019-05-09T14:53:42Z

Yuyi, simple google search leads to this open ticket:
cherrypy/cherrypy#1304

which provides some recipes to handle this issue, e.g. increase number of threads, setting the response.headers.connection config to always return a value of close.

In short run, I suggest you study this ticket and apply stuff people are taking about, while in a long-run I really suggest that we should start seriously consider replacing CherryPY stack in our python based web-apps, the best probably Flask+wsgi.

belforte · 2019-05-09T15:04:33Z

yet... why all such problems now ? Did we chance cherrypy ? Did FE change ? is there some external request storm ? I see some signs of FE issues in CRAB TW to CRAB REST connections. Did we cahnge something relevant in last update ?

…

On 09/05/2019 09:53, Valentin Kuznetsov wrote: which provides some recipes to handle this issue, e.g. increase number of threads, setting the |response.headers.connection config| to always return a value of |close|.

yuyiguo · 2019-05-09T15:09:25Z

This is really odd. We already had fd monitoring before the release. We did not see fd surge to 10k+ in the past monitoring before this release.

h4d4 · 2019-05-09T15:11:41Z

@belforte @yuyiguo Nothing was changed into latest deployment for frontends.

h4d4 · 2019-05-09T15:14:55Z

@belforte @yuyiguo For FE monitoring it is what we have:

https://monit-grafana.cern.ch/d/thT2ibCiz/frontend-servers?refresh=1m&orgId=11&from=now%2Fw&to=now%2Fw&var-host=vocms0158&var-host=vocms0162&var-host=vocms0164&var-host=vocms0760&var-port=18443

vkuznet · 2019-05-09T15:28:01Z

Stefano, I'm not sure that we claim it happens "now", we had lemon metrics outages long time ago and it was affecting different services, e.g. DBS. The point is that now we have better monitoring tools to reveal our problems, and we have identified some "hidden" patterns which lead to DBS instabilities. But we may not have know all possible combinations. I saw fds alarms in previous release too, but we "let it go".

…

On 0, Stefano Belforte ***@***.***> wrote: yet... why all such problems now ? Did we chance cherrypy ? Did FE change ? is there some external request storm ? I see some signs of FE issues in CRAB TW to CRAB REST connections. Did we cahnge something relevant in last update ? On 09/05/2019 09:53, Valentin Kuznetsov wrote: > which provides some recipes to handle this issue, e.g. increase number of threads, setting the |response.headers.connection config| to always return a value of |close|. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603 (comment)

yuyiguo · 2019-05-09T15:30:14Z

[root@vocms0136 tmp]# netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
1 established)
1 Foreign
4 FIN_WAIT2
38 LISTEN
472 ESTABLISHED
952 CLOSE_WAIT
1560 TIME_WAIT

belforte · 2019-05-09T15:31:52Z

VAlentin, yes ! But since yesterday DBS is being restarted every few hours. This is new !

yuyiguo · 2019-05-09T15:41:48Z

This is what on cherrypy/cherrypy#1304 that is a very long and old opentk (opened Marc 2014) .
"We experienced nearly the same scenario a while back. The issue is that CherryPy does not have a good way to handle HTTP Keep-Alive connections and so they begin to pile up with a default timeout time specified by the server.socket_timeout parameter. We resolved this problem by increasing the number of threads to 200 (probably a bit much), and setting the response.headers.connection config to always return a value of 'close'. This asks the browser to open a new TCP connection for each new request and tear it down after getting the response. We are currently experimenting with gunicorn and uwsgi, both of which appear to handle Keep-Alives in a better way than CherryPy."

yuyiguo · 2019-05-09T15:43:54Z

@h4d4
Lina,
Can we try to increase the thread to 200 and setting the response.headers.connection config to always return a value of 'close'?
We can start with of the instances.

vkuznet · 2019-05-09T15:51:58Z

Yuyi, I suggest to take time window when DBS had large fds and identify all clients during this period (sort them and find top-N). It may be that we have a client which repeatedly initiate request and drops it w/o waiting for response (possibly impatient clients). Using this dashboard https://monit-grafana.cern.ch/d/_U6nmxCmk/dbs-global-reader?refresh=1m&orgId=11&from=now-6h&to=now I see node/periods: vocms0136: 6:30 - 7:00, 8-8:12, 9:46 - 10:00, 11:17 - 11:30 etc. We can also use CMSWEB timber dashboard to identify top-N in these periods, e.g. https://monit-grafana.cern.ch/d/QADAkezZk/cmsweb-timber?orgId=11&from=now-12h&to=now&var-system=dbs&var-method=All&var-api=All&var-code=All&var-metadataType=All&var-Filters=data.code%7C>%7C99&var-dnFilter=.* It shows that DBS had datasetlist, fileArray calls around these spots, e.g. 7:00, 10:00, etc. V.

…

On 0, Yuyi Guo ***@***.***> wrote: ***@***.*** tmp]# netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n 1 established) 1 Foreign 4 FIN_WAIT2 38 LISTEN 472 ESTABLISHED 952 CLOSE_WAIT 1560 TIME_WAIT -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603 (comment)

vkuznet · 2019-05-09T15:53:58Z

Yuyi, it is not under Lina's responsibility, it is DBS (CherryPy) configuration parameters. Find how you can set them up in DBS and make appropriate PR either to DBS configuration or to the code (if you don't have them as external configuration). V.

…

On 0, Yuyi Guo ***@***.***> wrote: @h4d4 Lina, Can we try to increase the thread to 200 and setting the response.headers.connection config to always return a value of 'close'? We can start with of the instances. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603 (comment)

amaltaro · 2019-05-09T15:54:28Z

BTW, updating the amount of threads is a DBS configuration parameter (note there are diff config files for different DBS instances):
https://github.com/dmwm/deployment/blob/master/dbs/DBSGlobalReader.py#L35

About the header response, you'd need to make changes to the DBS source code, I believe.

We also need to be careful, because if we were sometimes hitting 9GB of RAM footprint with 15 threads, with 200 chances to blow up the process/node are going to be much higher.

yuyiguo · 2019-05-14T21:08:02Z

While I was trying to answer Valentin's questions, I found below in the FE log. He has sent 29038 below queries and is continuing to do so. I don't know any of CMS logical_file_name is call "NA" ? How did he got this name?

[14/May/2019:16:59:59 +0200] cmsweb.cern.ch 131.225.189.65 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ HTTP/1.1" 502 [data: 6036 in 15985 out 485 body 94537097 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Benedikt Vormwald/CN=3533979439/CN=505277547" "-" ] [ref: "-" "dasgoclient/v02.01.01" ]

belforte · 2019-05-14T21:11:19Z

must be part of a script which fails to find the name and yet plows on and bugs you.
Did you write this Benedikt already ?

belforte · 2019-05-14T21:12:09Z

[email protected]

belforte · 2019-05-14T21:14:25Z

if that's really harmful, the best I can do is to remove him from VOMS, then at next CMSWEB syncroization round (few hours) he will not be authorized by FE anymore. Let me know.

belforte · 2019-05-14T21:17:17Z

Odd, it seems to come via dasgoclient ! @vkuznet could this be your bug ?

vkuznet · 2019-05-14T21:24:51Z

Just checked entire dasgo code, it has _NA_ in query parser which means that whatever client provides as input query is not parsed and _NA_ is substituted. We need to contact him and ask which DAS queries he is using and how he automate this since it seems to me that he never check the input and invoke dasgoclient with it. So it is not a bug since but a feature. DAS needs to parse input from the end-user, if this input is not parse-able DAS should through an error. But I think end-user does not check for it and retries all the time.

…

On 0, Stefano Belforte ***@***.***> wrote: Odd, it seems to come via dasgoclient ! @vkuznet could this be your bug ? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603 (comment)

yuyiguo · 2019-05-14T21:26:42Z

No, I haven't written him for this yet. This is the same person who scan whole datasets every 20 minutes no matter what. I wrote to him to ask him to scan every a few hours because there is no new info for a dataset, but haven't heard anything back yet for a week or so.

vkuznet · 2019-05-14T21:27:56Z

If no reply, then ask Stefano to remove him from VOMS. Once he will turned down hard way he will write back.

…

On 0, Yuyi Guo ***@***.***> wrote: No, I haven't written him for this yet. This is the same person who scan whole datasets every 20 minutes no matter what. I wrote to him to ask him to scan every a few hours because there is no new info for a dataset, but haven't heard anything back yet for a week or so. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603 (comment)

belforte · 2019-05-14T21:28:56Z

No that's another Benedikt ! (Benedikt Mayer at MIT running Dynamo and trying to keep Dynamo registry in sycn with PhEDEx and DBS and somehow overdoing, IMHO).

belforte · 2019-05-14T21:31:13Z

OTH if this Benedik Vormwald is in Germany (likely), we may not hear until tomorrow. I am writing him now with you in CC so that Valentin can find out about dasgoclient use. Valentin. if user sends you junk, maybe you can avoid to pass it along to DBS ?

yuyiguo · 2019-05-14T21:33:22Z

Valentin,
To answer your question what are the calls during the fds were really high.
The files below are the gottten from FE from May 14 16:30-17:00 CERN time for All the DBS accesses.
The first row is error code , second is methods and the third is API.

log-2019-May-14-16-30.txt
log-2019-May-14-16-40.txt
log-2019-May-14-16-50.txt

vkuznet · 2019-05-14T22:17:29Z

Stefano, I'm not passing this to DBS, I think he is doing some scripting where he gets input and invoke dasgoclient.

…

On 0, Stefano Belforte ***@***.***> wrote: OTH if this Benedik Vormwald is in Germany (likely), we may not hear until tomorrow. I am writing him now with you in CC so that Valentin can find out about dasgoclient use. Valentin. if user sends you junk, maybe you can avoid to pass it along to DBS ? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603 (comment)

vkuznet · 2019-05-14T22:30:23Z

Let me give you hypothetical example: ``` # this is pseudo-code results = # invoke dasgoclient to get file list for fileName in results: # invoke dasgoclient with fileName as parameter ``` So if user get garbage in results and then iterate over it nothing wrong with dasgoclient it takes the input `_NA_` and pass it along. For instance, I can do: ``` dasgoclient -query="child file=_NA_" -verbose=1 ... DAS GET https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 634.356484ms ERRO[0000] DBS unable to unmarshal the data Api=filechildren Error="json: cannot unmarshal object into Go value of type []mongo.DASRecord" data="{"exception": 400, "message": "Invalid Input Data _NA_...: Not Match Required Format", "type": "HTTPError"}" ``` But in this case user should get 400 HTTP message, while in example Yuyi found it is 502 error code [1], which indicates that the server, while acting as a iateway or proxy, received an invalid response from the upstream server. But I agree that 30K of such queries may cause a harm to DBS/frontend. What is unclear if this user submit this payload from grid jobs (which seems to be the case). And, once again the throttling on FE/BE may help to protect us from such scenarios. [1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/502

…

On 0, Yuyi Guo ***@***.***> wrote: While I was trying to answer Valentin's questions, I found below in the FE log. He has sent 29038 below queries and is continuing to do so. I don't know any of CMS logical_file_name is call "_NA_" ? How did he got this name? ``` [14/May/2019:16:59:59 +0200] cmsweb.cern.ch 131.225.189.65 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ HTTP/1.1" 502 [data: 6036 in 15985 out 485 body 94537097 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Benedikt Vormwald/CN=3533979439/CN=505277547" "-" ] [ref: "-" "dasgoclient/v02.01.01" ] ``` -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603 (comment)

vkuznet · 2019-05-14T22:46:03Z

Are those in chronological order? If so, here is a pattern how DBS is screwed: 200"GET/dbs/prod/global/DBSReader/fileparents/?logical_file_name=%2Fstore%2Fmc%2FRunIISummer16DR80Premix%2FSMS-T1qqqq-LLChipm_ctau-200_mLSP-200_TuneCUETP8M1_13TeV-madgraphMLM-pythia8%2FAODSIM%2FPUMoriond17_longlived_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1%2F70000%2F267616AB-A35E-E911-83D5-141877640173.root 200"GET/dbs/prod/global/DBSReader/filesummaries/?dataset=%2FRelValZEE_13%2FCMSSW_10_6_0_pre4-PU25ns_106X_upgrade2021_realistic_v4-v1%2FMINIAODSIM&validFileOnly=1 200"GET/dbs/prod/global/DBSReader/releaseversions?dataset=%2FPseudo_MonoZ_NLO_Mphi-400_Mchi-50_gSM-1p0_gDM-1p0_13TeV-madgraph%2FRunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1%2FMINIAODSIM 200"GET/dbs/prod/global/DBSReader/serverinfo 200"POST/dbs/prod/global/DBSReader/fileArray 200"POST/dbs/prod/global/DBSWriter/bulkblocks 200"POST/dbs/prod/global/DBSWriter/bulkblocks 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 400"GET/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 502"GET/dbs/prod/global/DBSReader/blockdump?block_name=%2FHIMinimumBias14%2FHIRun2018A-v1%2FRAW%23d927b336-6fcb-44b1-a1fb-65a28abcedd1 So, we have 200 requests for filesummaries, followed by others and bulkblocks. Then we have filechildren with _NA_ input (which I guess some script provides), all of those are 400 as my test with dasgoclient shows, but then we have blockdump with 502 which blocks the DBS. Everything after this is blocked, e.g. 502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FDY%2A%2F%2A%2FALCARECO&dataset_access_type=VALID&detail=True 502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FDY%2A%2F%2A%2FALCARECO&dataset_access_type=VALID&detail=True 502"GET/dbs/prod/global/DBSReader/datasets?dataset=%2FGluGluToAToZhToLLBB_M-1000_13TeV-madgraphMLM-pythia8%2FRunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1%2FMINIAODSIM&dataset_access_type=%2A&detail=True 502"GET/dbs/prod/global/DBSReader/datasets?dataset=%2FgluinoGMSB_M1400_ctau1000p0_TuneCP2_13TeV_pythia8%2FRunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1%2FAODSIM&dataset_access_type=%2A&detail=True 502"GET/dbs/prod/global/DBSReader/datasets?dataset=%2FHeavyCompositeMajoranaNeutrino_L11000_M1000_mumujj_CalcHep%2FRunIISummer15GS-MCRUN2_71_V1-v1%2FGEN-SIM&dataset_access_type=%2A&detail=True 502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelVal%2ASingleMu%2A%2FCMSSW_10_4_0%2Aupgrade2023%2AD21%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True 502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelVal%2ASingleMu%2A%2FCMSSW_10_4_0%2Aupgrade2023%2AD21%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True 502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelVal%2ASingleMu%2A%2FCMSSW_10_4_0%2Aupgrade2023%2AD21%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True 502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelVal%2ASingleMu%2A%2FCMSSW_10_4_0%2Aupgrade2023%2AD21%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True 502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelValSingleMuPt1p5to8_pythia8%2FCMSSW_10_4_0_pre2-103X_upgrade2023_realistic_v2_2023D21noPU-v1%2FGEN-SIM-DIGI-RAW&dataset_access_type=%2A&detail=True 502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelValTTbar%2A%2F%2A10_5_0%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True 502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelValTTbar%2A%2F%2A10_5_0%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True 502"GET/dbs/prod/global/DBSReader/datasets/?dataset=%2FRelValTTbar%2A%2F%2A10_5_0%2A%2FGEN-SIM-DIGI-RAW&dataset_access_type=VALID&detail=True So what blockdump is doing? and why it timeout. Is it possible that it brings full DBS down and all other queries which comes afterwards just explode the problem.

…

On 0, Yuyi Guo ***@***.***> wrote: Valentin, To answer your question what are the calls during the fds were really high. The files below are the gottten from FE from May 14 16:30-17:00 CERN time for All the DBS accesses. The first row is error code , second is methods and the third is API. [log-2019-May-14-16-30.txt](https://github.com/dmwm/DBS/files/3179938/log-2019-May-14-16-30.txt) [log-2019-May-14-16-40.txt](https://github.com/dmwm/DBS/files/3179939/log-2019-May-14-16-40.txt) [log-2019-May-14-16-50.txt](https://github.com/dmwm/DBS/files/3179940/log-2019-May-14-16-50.txt) -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603 (comment)

belforte · 2019-05-14T22:49:25Z

@vkuznet see also #605 (comment) and following. Something is fishy with blockdump which needs to look at it from inside server code

belforte · 2019-05-14T22:55:01Z

shall I stop CRAB DBS Publisher ? No more migration requests, no more blockdump calls !

belforte · 2019-05-14T23:36:26Z

OK. I went for a gentler approach and disabled calls for dataset migrations in ASO Publisher (they were all failing anyhow !). So CRAB will keep publishing user MC or user processing or already migrated datasets.
Let's see if this makes a difference in DBS load !

yuyiguo · 2019-05-15T02:23:36Z

No, it is not in chronological order. I order it by error code.

yuyiguo · 2019-05-15T02:53:33Z

Something is really wrong. Look the two queries below, these are two identical queries. DBS does input validation. So the 1st one returned http 400 that is the correct behave. However, the second one got http 502. The validation is very simple. How could it take 300 s? What the server is doing? There is no database involved here.

vocms0158/frontend/access_log_20190514.txt:[14/May/2019:23:57:37 +0200] cmsweb.cern.ch 129.59.197.22 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ HTTP/1.1" 400 [data: 6037 in 15819 out 107 body 325763 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Benedikt Vormwald/CN=3533979439/CN=1620332613" "-" ] [ref: "-" "dasgoclient/v02.01.01" ]

vocms0760/frontend/access_log_20190514.txt:[14/May/2019:21:15:38 +0200] cmsweb.cern.ch 193.146.75.180 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ HTTP/1.1" 502 [data: 6037 in 15985 out 485 body 300127279 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Benedikt Vormwald/CN=3533979439/CN=1796822753" "-" ] [ref: "-" "dasgoclient/v02.01.01" ]

amaltaro · 2019-05-15T07:05:19Z

Point is, it's not a single user putting such queries with _NA_ as logical_file_name. I reported it at least a couple of times already in one of those many threads about DBS instabilities.

Just checked entire dasgo code, it has NA in query parser which means that whatever client provides as input query is not parsed and NA is substituted.

I don't get your comment. If the das client can't parse the query, why would it replace whatever by _NA_? Can you please clarify what part of the query is parsed and which one is substituted?

vkuznet · 2019-05-15T11:20:10Z

Yuyi, the example you gave is not appropriate due to two things: - the timestamp of 502 error is earlier than time stamp of 400 - this is two different DBS servers It means that vocms760 had problem at the time of the query and this problem may be caused by another source. The 400 error comes when DBS server is in good health, which is the case of vocms0158 at a time of the query. V

…

On 0, Yuyi Guo ***@***.***> wrote: Something is really wrong. Look the two queries below, these are two identical queries. DBS does input validation. So the 1st one returned http 400 that is the correct behave. However, the second one got http 502. The validation is very [simple](https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/utils/DBSInputValidation.py#L26). How could it take 300 s? What the server is doing? There is no database involved here. ``` vocms0158/frontend/access_log_20190514.txt:[14/May/2019:23:57:37 +0200] cmsweb.cern.ch 129.59.197.22 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ HTTP/1.1" 400 [data: 6037 in 15819 out 107 body 325763 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Benedikt Vormwald/CN=3533979439/CN=1620332613" "-" ] [ref: "-" "dasgoclient/v02.01.01" ] vocms0760/frontend/access_log_20190514.txt:[14/May/2019:21:15:38 +0200] cmsweb.cern.ch 193.146.75.180 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ HTTP/1.1" 502 [data: 6037 in 15985 out 485 body 300127279 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Benedikt Vormwald/CN=3533979439/CN=1796822753" "-" ] [ref: "-" "dasgoclient/v02.01.01" ] ``` -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603 (comment)

vkuznet · 2019-05-15T11:37:06Z

Alan, parser needs to have unique token when pass through the query. All alphabetical character, [a-z], slashes, brackets, operators are allowed. I used `_NA_` as a string which should not be appear in an input query. I can use any other string in this manner. The point is that it is not used as final assignment of values when DAS translate user query fields/specs used in underslying services. The issue is that user code is not checked its own input. Here is the snippet of code we received from the end-user: ``` def get_miniaod_filenames(): miniaod_filenames = [] # first check children for miniAODs: status, child_filenames = run_cmd('dasgoclient -query="child file=/%s"' % options.infile.split("//")[-1]) child_filenames = child_filenames.split("\n") for child in child_filenames: if dataset_is_correct_miniAOD(child): miniaod_filenames.append(child) # if no matches have been found, check children of parent files for matching miniAODs: if len(miniaod_filenames) == 0: print "Checking children of parent files for matching miniAODs..." status, parent_filenames = run_cmd('dasgoclient -query="parent file=/%s"' % options.infile.split("//")[-1]) for parent_filename in parent_filenames.split("\n"): status, cousins_filenames = run_cmd('dasgoclient -query="child file=%s"' % parent_filename) relatives_filenames = cousins_filenames.split("\n") for relative in relatives_filenames: if dataset_is_correct_miniAOD(relative): miniaod_filenames.append(relative) return miniaod_filenames ``` As I expected the user invoke for loop (see last part of the function) where it parser parent_filenames as a string without checking what this string is. I also don't know what is passed to options. I'll try to get the input from him and reproduce the parents list. Bottom line, user does not validate input in scripts.

…

On 0, Alan Malta Rodrigues ***@***.***> wrote: Point is, it's not a single user putting such queries with `_NA_` as logical_file_name. I reported it at least a couple of times already in one of those many threads about DBS instabilities. > Just checked entire dasgo code, it has _NA_ in query parser which means that whatever client provides as input query is not parsed and _NA_ is substituted. I don't get your comment. If the das client can't parse the query, why would it replace whatever by `_NA_`? Can you please clarify what part of the query is parsed and which one is substituted? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603 (comment)

amaltaro · 2019-05-15T11:51:03Z

Which means, the user passed something that didn't match the basic dasclient validation:
"" All alphabetical character, [a-z], slashes, brackets, operators are allowed."""

and the dasclient replaced whatever value the user provided by _NA_, right?

If so, isn't it better to raise an error at the das-client itself? Given that it has already a basic validation in place and we all know that whatever query with _NA_ will anyways fail; don't even let it go to CMSWEB then. If you prefer not to return an error, just return an empty list.

Bottom line, user does not validate input in scripts.

Yes, I agree. However, if the client can run a simple pre-validation of the arguments, we won't have this same problem in the coming week/month/year.

vkuznet · 2019-05-15T12:20:41Z

Alan, one more time, dasgoclient used _NA_ internally and it should not appear in inputs. Let's not speculate and wait for user input that we can try out and see what it produced. V.

…

On 0, Alan Malta Rodrigues ***@***.***> wrote: Which means, the user passed something that didn't match the basic dasclient validation: "" All alphabetical character, [a-z], slashes, brackets, operators are allowed.""" and the dasclient replaced whatever value the user provided by `_NA_`, right? If so, isn't it better to raise an error at the das-client itself? Given that it has already a basic validation in place and we all know that whatever query with `_NA_` will anyways fail; don't even let it go to CMSWEB then. If you prefer not to return an error, just return an empty list. > Bottom line, user does not validate input in scripts. Yes, I agree. However, if the client can run a simple pre-validation of the arguments, we won't have this same problem in the coming week/month/year. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #603 (comment)

amaltaro · 2019-05-15T12:39:17Z

Alright! Given that you can't give me a straight answer and that your statements are contradictory, I tested it myself and here is the answer:

amaltaro@aiadm94:~/workarea/garbage/CMSSW_10_2_0/src $ dasgoclient -query="child file=" -verbose=1
child file=

### unique true
### selected services [dbs3:filechildren] [child.name]
### selected urls map[https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_:]
### selected localApis []
DAS GET https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 78.717825ms
ERRO[0000] DBS unable to unmarshal the data              Api=filechildren Error="json: cannot unmarshal object into Go value of type []mongo.DASRecord" data="{\"exception\": 400, \"message\": \"Invalid Input Data _NA_...: Not Match Required Format\", \"type\": \"HTTPError\"}"
Received 0 records

or another test with an empty space as user input:

amaltaro@aiadm94:~/workarea/garbage/CMSSW_10_2_0/src $ dasgoclient -query="child file= " -verbose=1
child file= 

### unique true
### selected services [dbs3:filechildren] [child.name]
### selected urls map[https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_:]
### selected localApis []
DAS GET https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 251.353106ms
ERRO[0000] DBS unable to unmarshal the data              Api=filechildren Error="json: cannot unmarshal object into Go value of type []mongo.DASRecord" data="{\"exception\": 400, \"message\": \"Invalid Input Data _NA_...: Not Match Required Format\", \"type\": \"HTTPError\"}"
Received 0 records

And now is clear. dasgoclient replaces whatever user input it cannot understand by _NA_ and let the user request go through and hit all those CMSWEB data-services.

Then let me say it again. If DAS client knows this request has no chance to succeed (otherwise it wouldn't have replaced the user input by _NA), why does it even allow the request to proceed?!?

It should be a simple change to DAS client and would protect a bunch of data-services from mal-formed HTTP requests. If you still don't get what I'm saying, we can quickly discuss it over vidyo.

belforte · 2019-05-15T12:50:19Z

@vkuznet I will not argue code with you, but there is no such thing as user only sends correct queries
You can educate this one, but new ones come continuously and do all sort of things. It is only a matter of where is easier/cheaper to catch this. As you point out, eventually upstream server does. Let's ignore when upstream times out which may be accidental here. Only whether it is preferrable to catch user mistakes in the client or in the server.

vkuznet · 2019-05-16T11:53:00Z

I added to DAS code basic validations, the associative commits are: dmwm/das2go@ebe54d8 and dmwm/dasgoclient@ba6f5c6

Here is how it will work:

shell $ ./dasgoclient -query="child file="
Validation error: unmatched file pattern

shell $ echo $?
18

shell $ ./dasgoclient -exitCodes
DAS exit codes:
1 DAS error
2 DBS upstream error
3 PhEDEx upstream error
4 Rucio upstream error
5 Dynamo upstream error
6 ReqMgr upstream error
7 RunRegistry upstream error
8 McM upstream error
9 Dashboard upstream error
10 SiteDB upstream error
11 CRIC upstream error
12 CondDB upstream error
13 Combined error
14 MongoDB error
15 DAS proxy error
16 DAS query error
17 DAS parser error
18 DAS validation error

So, if user missed or provided in-appropriate input DAS will perform the validation and return immediately without sending results to backend services. The code is in master and I'll try to schedule it in next cmsweb upgrade cycle.

amaltaro · 2019-05-16T11:55:52Z

Thanks, Valentin!

vkuznet · 2019-05-17T11:15:44Z

Now the new code is awaiting to be included in CMSSW, see cms-sw/cmsdist#4981 Usually it takes 1-2 weeks and new dasgoclient with validation will be in place on cvmfs/CMSSW.

yuyiguo closed this as completed Apr 30, 2020

file descriptors surge #603

file descriptors surge #603

Comments

yuyiguo commented May 8, 2019

belforte commented May 9, 2019 via email

amaltaro commented May 9, 2019

vkuznet commented May 9, 2019 via email

vkuznet commented May 9, 2019 via email

vkuznet commented May 9, 2019 via email

yuyiguo commented May 9, 2019

yuyiguo commented May 9, 2019 • edited Loading

yuyiguo commented May 9, 2019

bbockelm commented May 9, 2019

yuyiguo commented May 9, 2019

belforte commented May 9, 2019 via email

bbockelm commented May 9, 2019

yuyiguo commented May 9, 2019

yuyiguo commented May 9, 2019

yuyiguo commented May 9, 2019

belforte commented May 9, 2019 via email

yuyiguo commented May 9, 2019

vkuznet commented May 9, 2019

belforte commented May 9, 2019 via email

yuyiguo commented May 9, 2019

h4d4 commented May 9, 2019

h4d4 commented May 9, 2019

vkuznet commented May 9, 2019 via email

yuyiguo commented May 9, 2019

belforte commented May 9, 2019 via email

yuyiguo commented May 9, 2019

yuyiguo commented May 9, 2019

vkuznet commented May 9, 2019 via email

vkuznet commented May 9, 2019 via email

amaltaro commented May 9, 2019

yuyiguo commented May 14, 2019

belforte commented May 14, 2019

belforte commented May 14, 2019

belforte commented May 14, 2019

belforte commented May 14, 2019

vkuznet commented May 14, 2019 via email

yuyiguo commented May 14, 2019

vkuznet commented May 14, 2019 via email

belforte commented May 14, 2019

belforte commented May 14, 2019

yuyiguo commented May 14, 2019

vkuznet commented May 14, 2019 via email

vkuznet commented May 14, 2019 via email

vkuznet commented May 14, 2019 via email

belforte commented May 14, 2019

belforte commented May 14, 2019

belforte commented May 14, 2019 • edited Loading

yuyiguo commented May 15, 2019

yuyiguo commented May 15, 2019

amaltaro commented May 15, 2019

vkuznet commented May 15, 2019 via email

vkuznet commented May 15, 2019 via email

amaltaro commented May 15, 2019

vkuznet commented May 15, 2019 via email

amaltaro commented May 15, 2019

belforte commented May 15, 2019

vkuznet commented May 16, 2019

amaltaro commented May 16, 2019

vkuznet commented May 17, 2019

yuyiguo commented May 9, 2019 •

edited

Loading

belforte commented May 14, 2019 •

edited

Loading