*** bauzas_ is now known as bauzas | 09:58 | |
*** bauzas_ is now known as bauzas | 13:17 | |
lajoskatona | Hi, I question regarding one of the Neutron jobs, perhaps you have a view which can help us. | 15:03 |
---|---|---|
lajoskatona | Neutron-functional jobs started to fail recently with "OSError: [Errno 24] Too many open files" (i.e.: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_212/928953/12/check/neutron-functional-with-uwsgi-10/212c1ad/testr_results.html ) | 15:03 |
clarkb | lajoskatona: my hunch is that you are hitting a ulimit | 15:05 |
lajoskatona | we started to increase the ulimit: https://review.opendev.org/c/openstack/neutron/+/928759 but the above error is comming with ulimit 4096, which is strange, so you have perhaps an idea where other direction we should touch? | 15:05 |
clarkb | no it very likely is you're hitting that 4k ulimit on open files. Is it possible you are leaking file descriptors? | 15:05 |
lajoskatona | clarkb: the issue started a few weeks ago, and strangely on older branches also, | 15:07 |
clarkb | though looking at that change I'm not sure you're applying the ulimit properly against the devstack user | 15:07 |
clarkb | so maybe you're still hitting the 2k limit | 15:07 |
lajoskatona | and as it is functional (lot's of namespaces and things are opened parallelly | 15:08 |
clarkb | wouldn't be strange if you backported the bug (assuming it is a bug leaking fds) | 15:08 |
lajoskatona | clarkb: yes that can be | 15:08 |
lajoskatona | I tried to set in local.conf like ULIMIT_NOFILE: xy but as functional tests for Neutron use there own scripts and stack.sh just partially executed that didn't work | 15:09 |
clarkb | can you trap for the error then do a fd table dump? | 15:09 |
clarkb | you should be able to get paths and ownership and get a clue as to where the issue is stemming from | 15:10 |
clarkb | a more brute force approach would be to overload open() and dump a table each time you open things but that might explode your logging | 15:10 |
fungi | or you can try to list them all with lsof | 15:10 |
clarkb | fungi: ya I imagine part of the issue with that is the race involved in things crashing and cleaning up vs being visiable to lsof but maybe if run often enough we can win that race | 15:11 |
fungi | yeah, could be | 15:11 |
lajoskatona | file tracker is on, example: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_212/928953/12/check/neutron-functional-with-uwsgi-10/212c1ad/controller/logs/screen-file_tracker.txt | 15:12 |
clarkb | lajoskatona: that lacks paths | 15:12 |
clarkb | but you can pretty clearly see it balloon to 15.5k | 15:13 |
clarkb | if you had paths and processes you'd hopefully be able to narrow down the source of the issue. Maybe file tracker if > 15k does an lsof dump into another file | 15:13 |
clarkb | then you'd get that info | 15:13 |
lajoskatona | clarkb: yes, but we supposed that's not only for the test user (you know try to be on the bright side and somebody else should be anyway the troublemaker :)) | 15:14 |
clarkb | lajoskatona: right, I'm just saying we already know the fd count is too high based on the error. So that doesn't give us any real new info other than we can see it climb slowly. But it might provide a good hook point to do an lsof dump when you exceed a threshold like 15k | 15:15 |
lajoskatona | clarkb: thanks, I check this file tracker, as this is just the devstack service, perhaps I can add something similar to the functional job to have such extra info | 15:16 |
clarkb | lajoskatona: you can update devstack then depends on that in your change | 15:20 |
lajoskatona | clarkb: thanks, good idea, I check that direction first | 15:31 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!