Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file descriptors surge #603

Closed
yuyiguo opened this issue May 8, 2019 · 117 comments
Closed

file descriptors surge #603

yuyiguo opened this issue May 8, 2019 · 117 comments

Comments

@yuyiguo
Copy link
Member

yuyiguo commented May 8, 2019

@vkuznet @h4d4 @bbockelm @belforte @amaltaro @h4d4
The most recent DBS instability (May 7-8) was different from the past. From the monitoring plots, I did not see the memory or CPU usage was high. In addition, I checked the /var/log/messages and did not see the OOM-killers there as we had before. However, the FDS numbers were really high as it is showed on the monitoring plots, sometimes they were 15K+. I checked the APIs called while the fds were high, APIs were in wide range.

I am going to look at these APIs to check for the DB connection leakage.

Any suggestions on how to attack the fds problem?

@belforte
Copy link
Member

belforte commented May 9, 2019 via email

@amaltaro
Copy link
Contributor

amaltaro commented May 9, 2019

Where are you getting this file descriptors monitoring from? Was it 15k file descriptors for DBS in general in one backend or 15k for one single process in a backend?

If one look at /prod/PID/fd/ in the backend, you can list all the file descriptors your process/PID has opened. Looking at vocms0136 right now, I see around 40 of them, where a bunch of them are accessing the same tnsnames.ora file (probably unwanted), then most of the rest are network socket descriptors.

I assume each query to oracle requires one socket. It would be great to make a correlation of queries/DAOs to the amount of file descriptors. I don't know much these details, but I'm surprised to see 15k file descriptors for a process that is handling 25(?) user requests.

@vkuznet
Copy link
Contributor

vkuznet commented May 9, 2019 via email

@vkuznet
Copy link
Contributor

vkuznet commented May 9, 2019 via email

@vkuznet
Copy link
Contributor

vkuznet commented May 9, 2019 via email

@yuyiguo
Copy link
Member Author

yuyiguo commented May 9, 2019

Thanks Alan, Valentin,
I am watching ...

@yuyiguo
Copy link
Member Author

yuyiguo commented May 9, 2019

total 0
dr-x------. 2 _dbs _dbs 0 May 9 14:08 .
dr-xr-xr-x. 9 _dbs _dbs 0 May 9 14:08 ..
lr-x------. 1 _dbs _dbs 64 May 9 14:08 0 -> /dev/null
l-wx------. 1 _dbs _dbs 64 May 9 14:08 1 -> pipe:[24875653]
lrwx------. 1 _dbs _dbs 64 May 9 14:21 10 -> socket:[25211083]

@yuyiguo
Copy link
Member Author

yuyiguo commented May 9, 2019

@vkuznet @amaltaro
above is on vocms0163. How do I know what were they?

@bbockelm
Copy link
Contributor

bbockelm commented May 9, 2019

You are going to want to use “lsof -p $PID” instead; that will determine the IP address of all those sockets listed.

@yuyiguo
Copy link
Member Author

yuyiguo commented May 9, 2019

OK, thanks

@belforte
Copy link
Member

belforte commented May 9, 2019 via email

@bbockelm
Copy link
Contributor

bbockelm commented May 9, 2019

Sockets can also be outgoing - to Oracle!

@yuyiguo
Copy link
Member Author

yuyiguo commented May 9, 2019

This a file from command “lsof -p $PID” on 0136 when there were about 600+ fds.
fd-0848-0136-600.txt

@yuyiguo
Copy link
Member Author

yuyiguo commented May 9, 2019

@belforte @amaltaro @vkuznet @bbockelm @h4d4
vocms0158 vocms0760 vocms0162 vocms0164
are FEs. A lot of "CLOSE_WAIT" and "ESTABLISHED" to them.

@yuyiguo
Copy link
Member Author

yuyiguo commented May 9, 2019

CLOSE_WAIT means that the local end of the connection has received a FIN from the other end, but the OS is waiting for the program at the local end to actually close its connection.

The problem is your program running on the local machine is not closing the socket. It is not a TCP tuning issue. A connection can (and quite correctly) stay in CLOSE_WAIT forever while the program holds the connection open.

Once the local program closes the socket, the OS can send the FIN to the remote end which transitions you to LAST_ACK while you wait for the ACK of the FIN. Once that is received, the 
connection is finished and drops from the connection table (if your end is in CLOSE_WAIT you do not end up in the TIME_WAIT state).

@belforte
Copy link
Member

belforte commented May 9, 2019 via email

@yuyiguo
Copy link
Member Author

yuyiguo commented May 9, 2019

Got this from stackoverflow. How we close the connection to the FE? I could not remember that I did the closing explicitly in DBS code. Can someone reminder me?

@vkuznet
Copy link
Contributor

vkuznet commented May 9, 2019

Yuyi, simple google search leads to this open ticket:
cherrypy/cherrypy#1304

which provides some recipes to handle this issue, e.g. increase number of threads, setting the response.headers.connection config to always return a value of close.

In short run, I suggest you study this ticket and apply stuff people are taking about, while in a long-run I really suggest that we should start seriously consider replacing CherryPY stack in our python based web-apps, the best probably Flask+wsgi.

@belforte
Copy link
Member

belforte commented May 9, 2019 via email

@yuyiguo
Copy link
Member Author

yuyiguo commented May 9, 2019

This is really odd. We already had fd monitoring before the release. We did not see fd surge to 10k+ in the past monitoring before this release.

@h4d4
Copy link

h4d4 commented May 9, 2019

@belforte @yuyiguo Nothing was changed into latest deployment for frontends.

@vkuznet
Copy link
Contributor

vkuznet commented May 9, 2019 via email

@yuyiguo
Copy link
Member Author

yuyiguo commented May 9, 2019

[root@vocms0136 tmp]# netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
1 established)
1 Foreign
4 FIN_WAIT2
38 LISTEN
472 ESTABLISHED
952 CLOSE_WAIT
1560 TIME_WAIT

@belforte
Copy link
Member

belforte commented May 9, 2019 via email

@yuyiguo
Copy link
Member Author

yuyiguo commented May 9, 2019

This is what on cherrypy/cherrypy#1304 that is a very long and old opentk (opened Marc 2014) .
"We experienced nearly the same scenario a while back. The issue is that CherryPy does not have a good way to handle HTTP Keep-Alive connections and so they begin to pile up with a default timeout time specified by the server.socket_timeout parameter. We resolved this problem by increasing the number of threads to 200 (probably a bit much), and setting the response.headers.connection config to always return a value of 'close'. This asks the browser to open a new TCP connection for each new request and tear it down after getting the response. We are currently experimenting with gunicorn and uwsgi, both of which appear to handle Keep-Alives in a better way than CherryPy."

@yuyiguo
Copy link
Member Author

yuyiguo commented May 9, 2019

@h4d4
Lina,
Can we try to increase the thread to 200 and setting the response.headers.connection config to always return a value of 'close'?
We can start with of the instances.

@vkuznet
Copy link
Contributor

vkuznet commented May 9, 2019 via email

@vkuznet
Copy link
Contributor

vkuznet commented May 9, 2019 via email

@amaltaro
Copy link
Contributor

amaltaro commented May 9, 2019

BTW, updating the amount of threads is a DBS configuration parameter (note there are diff config files for different DBS instances):
https://github.com/dmwm/deployment/blob/master/dbs/DBSGlobalReader.py#L35

About the header response, you'd need to make changes to the DBS source code, I believe.

We also need to be careful, because if we were sometimes hitting 9GB of RAM footprint with 15 threads, with 200 chances to blow up the process/node are going to be much higher.

@yuyiguo
Copy link
Member Author

yuyiguo commented May 14, 2019

While I was trying to answer Valentin's questions, I found below in the FE log. He has sent 29038 below queries and is continuing to do so. I don't know any of CMS logical_file_name is call "NA" ? How did he got this name?

[14/May/2019:16:59:59 +0200] cmsweb.cern.ch 131.225.189.65 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ HTTP/1.1" 502 [data: 6036 in 15985 out 485 body 94537097 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Benedikt Vormwald/CN=3533979439/CN=505277547" "-" ] [ref: "-" "dasgoclient/v02.01.01" ]

@belforte
Copy link
Member

must be part of a script which fails to find the name and yet plows on and bugs you.
Did you write this Benedikt already ?

@belforte
Copy link
Member

[email protected]

@belforte
Copy link
Member

if that's really harmful, the best I can do is to remove him from VOMS, then at next CMSWEB syncroization round (few hours) he will not be authorized by FE anymore. Let me know.

@belforte
Copy link
Member

Odd, it seems to come via dasgoclient ! @vkuznet could this be your bug ?

@vkuznet
Copy link
Contributor

vkuznet commented May 14, 2019 via email

@yuyiguo
Copy link
Member Author

yuyiguo commented May 14, 2019

No, I haven't written him for this yet. This is the same person who scan whole datasets every 20 minutes no matter what. I wrote to him to ask him to scan every a few hours because there is no new info for a dataset, but haven't heard anything back yet for a week or so.

@vkuznet
Copy link
Contributor

vkuznet commented May 14, 2019 via email

@belforte
Copy link
Member

No that's another Benedikt ! (Benedikt Mayer at MIT running Dynamo and trying to keep Dynamo registry in sycn with PhEDEx and DBS and somehow overdoing, IMHO).

@belforte
Copy link
Member

OTH if this Benedik Vormwald is in Germany (likely), we may not hear until tomorrow. I am writing him now with you in CC so that Valentin can find out about dasgoclient use. Valentin. if user sends you junk, maybe you can avoid to pass it along to DBS ?

@yuyiguo
Copy link
Member Author

yuyiguo commented May 14, 2019

Valentin,
To answer your question what are the calls during the fds were really high.
The files below are the gottten from FE from May 14 16:30-17:00 CERN time for All the DBS accesses.
The first row is error code , second is methods and the third is API.

log-2019-May-14-16-30.txt
log-2019-May-14-16-40.txt
log-2019-May-14-16-50.txt

@vkuznet
Copy link
Contributor

vkuznet commented May 14, 2019 via email

@vkuznet
Copy link
Contributor

vkuznet commented May 14, 2019 via email

@vkuznet
Copy link
Contributor

vkuznet commented May 14, 2019 via email

@belforte
Copy link
Member

@vkuznet see also #605 (comment) and following. Something is fishy with blockdump which needs to look at it from inside server code

@belforte
Copy link
Member

shall I stop CRAB DBS Publisher ? No more migration requests, no more blockdump calls !

@belforte
Copy link
Member

belforte commented May 14, 2019

OK. I went for a gentler approach and disabled calls for dataset migrations in ASO Publisher (they were all failing anyhow !). So CRAB will keep publishing user MC or user processing or already migrated datasets.
Let's see if this makes a difference in DBS load !

@yuyiguo
Copy link
Member Author

yuyiguo commented May 15, 2019

No, it is not in chronological order. I order it by error code.

@yuyiguo
Copy link
Member Author

yuyiguo commented May 15, 2019

Something is really wrong. Look the two queries below, these are two identical queries. DBS does input validation. So the 1st one returned http 400 that is the correct behave. However, the second one got http 502. The validation is very simple. How could it take 300 s? What the server is doing? There is no database involved here.

vocms0158/frontend/access_log_20190514.txt:[14/May/2019:23:57:37 +0200] cmsweb.cern.ch 129.59.197.22 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ HTTP/1.1" 400 [data: 6037 in 15819 out 107 body 325763 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Benedikt Vormwald/CN=3533979439/CN=1620332613" "-" ] [ref: "-" "dasgoclient/v02.01.01" ]

vocms0760/frontend/access_log_20190514.txt:[14/May/2019:21:15:38 +0200] cmsweb.cern.ch 193.146.75.180 "GET /dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ HTTP/1.1" 502 [data: 6037 in 15985 out 485 body 300127279 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/C=DE/O=GermanGrid/OU=uni-hamburg/CN=Benedikt Vormwald/CN=3533979439/CN=1796822753" "-" ] [ref: "-" "dasgoclient/v02.01.01" ]

@amaltaro
Copy link
Contributor

Point is, it's not a single user putting such queries with _NA_ as logical_file_name. I reported it at least a couple of times already in one of those many threads about DBS instabilities.

Just checked entire dasgo code, it has NA in query parser which means that whatever client provides as input query is not parsed and NA is substituted.

I don't get your comment. If the das client can't parse the query, why would it replace whatever by _NA_? Can you please clarify what part of the query is parsed and which one is substituted?

@vkuznet
Copy link
Contributor

vkuznet commented May 15, 2019 via email

@vkuznet
Copy link
Contributor

vkuznet commented May 15, 2019 via email

@amaltaro
Copy link
Contributor

Which means, the user passed something that didn't match the basic dasclient validation:
"" All alphabetical character, [a-z], slashes, brackets, operators are allowed."""

and the dasclient replaced whatever value the user provided by _NA_, right?

If so, isn't it better to raise an error at the das-client itself? Given that it has already a basic validation in place and we all know that whatever query with _NA_ will anyways fail; don't even let it go to CMSWEB then. If you prefer not to return an error, just return an empty list.

Bottom line, user does not validate input in scripts.

Yes, I agree. However, if the client can run a simple pre-validation of the arguments, we won't have this same problem in the coming week/month/year.

@vkuznet
Copy link
Contributor

vkuznet commented May 15, 2019 via email

@amaltaro
Copy link
Contributor

Alright! Given that you can't give me a straight answer and that your statements are contradictory, I tested it myself and here is the answer:

amaltaro@aiadm94:~/workarea/garbage/CMSSW_10_2_0/src $ dasgoclient -query="child file=" -verbose=1
child file=

### unique true
### selected services [dbs3:filechildren] [child.name]
### selected urls map[https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_:]
### selected localApis []
DAS GET https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 78.717825ms
ERRO[0000] DBS unable to unmarshal the data              Api=filechildren Error="json: cannot unmarshal object into Go value of type []mongo.DASRecord" data="{\"exception\": 400, \"message\": \"Invalid Input Data _NA_...: Not Match Required Format\", \"type\": \"HTTPError\"}"
Received 0 records

or another test with an empty space as user input:

amaltaro@aiadm94:~/workarea/garbage/CMSSW_10_2_0/src $ dasgoclient -query="child file= " -verbose=1
child file= 

### unique true
### selected services [dbs3:filechildren] [child.name]
### selected urls map[https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_:]
### selected localApis []
DAS GET https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filechildren/?logical_file_name=_NA_ 251.353106ms
ERRO[0000] DBS unable to unmarshal the data              Api=filechildren Error="json: cannot unmarshal object into Go value of type []mongo.DASRecord" data="{\"exception\": 400, \"message\": \"Invalid Input Data _NA_...: Not Match Required Format\", \"type\": \"HTTPError\"}"
Received 0 records

And now is clear. dasgoclient replaces whatever user input it cannot understand by _NA_ and let the user request go through and hit all those CMSWEB data-services.

Then let me say it again. If DAS client knows this request has no chance to succeed (otherwise it wouldn't have replaced the user input by _NA), why does it even allow the request to proceed?!?

It should be a simple change to DAS client and would protect a bunch of data-services from mal-formed HTTP requests. If you still don't get what I'm saying, we can quickly discuss it over vidyo.

@belforte
Copy link
Member

@vkuznet I will not argue code with you, but there is no such thing as user only sends correct queries
You can educate this one, but new ones come continuously and do all sort of things. It is only a matter of where is easier/cheaper to catch this. As you point out, eventually upstream server does. Let's ignore when upstream times out which may be accidental here. Only whether it is preferrable to catch user mistakes in the client or in the server.

@vkuznet
Copy link
Contributor

vkuznet commented May 16, 2019

I added to DAS code basic validations, the associative commits are: dmwm/das2go@ebe54d8 and dmwm/dasgoclient@ba6f5c6

Here is how it will work:

shell $ ./dasgoclient -query="child file="
Validation error: unmatched file pattern

shell $ echo $?
18

shell $ ./dasgoclient -exitCodes
DAS exit codes:
1 DAS error
2 DBS upstream error
3 PhEDEx upstream error
4 Rucio upstream error
5 Dynamo upstream error
6 ReqMgr upstream error
7 RunRegistry upstream error
8 McM upstream error
9 Dashboard upstream error
10 SiteDB upstream error
11 CRIC upstream error
12 CondDB upstream error
13 Combined error
14 MongoDB error
15 DAS proxy error
16 DAS query error
17 DAS parser error
18 DAS validation error

So, if user missed or provided in-appropriate input DAS will perform the validation and return immediately without sending results to backend services. The code is in master and I'll try to schedule it in next cmsweb upgrade cycle.

@amaltaro
Copy link
Contributor

Thanks, Valentin!

@vkuznet
Copy link
Contributor

vkuznet commented May 17, 2019

Now the new code is awaiting to be included in CMSSW, see cms-sw/cmsdist#4981 Usually it takes 1-2 weeks and new dasgoclient with validation will be in place on cvmfs/CMSSW.

@yuyiguo yuyiguo closed this as completed Apr 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants