mattoliverau | morning | 00:03 |
---|---|---|
*** jv__ has joined #openstack-swift | 00:52 | |
*** jv__ has quit IRC | 01:08 | |
*** xiaolin has joined #openstack-swift | 01:31 | |
*** baojg has joined #openstack-swift | 01:47 | |
*** baojg has quit IRC | 01:50 | |
*** gyee has quit IRC | 01:58 | |
*** rcernin has quit IRC | 02:56 | |
*** rcernin has joined #openstack-swift | 03:04 | |
*** mahatic has quit IRC | 03:20 | |
*** baojg has joined #openstack-swift | 03:31 | |
*** baojg has quit IRC | 03:35 | |
*** rcernin has quit IRC | 03:35 | |
*** rcernin has joined #openstack-swift | 03:39 | |
*** psachin has joined #openstack-swift | 03:51 | |
*** rcernin has quit IRC | 03:54 | |
*** rcernin has joined #openstack-swift | 04:08 | |
*** rcernin has quit IRC | 04:18 | |
*** rcernin has joined #openstack-swift | 04:19 | |
*** evrardjp has quit IRC | 04:33 | |
*** evrardjp has joined #openstack-swift | 04:33 | |
*** xiaolin has quit IRC | 05:27 | |
*** mahatic has joined #openstack-swift | 11:01 | |
*** ChanServ sets mode: +v mahatic | 11:01 | |
*** jv__ has joined #openstack-swift | 11:22 | |
*** dsariel has joined #openstack-swift | 11:27 | |
*** jv__ has quit IRC | 12:38 | |
*** rcernin has quit IRC | 12:56 | |
*** jv__ has joined #openstack-swift | 13:09 | |
*** jv__ has quit IRC | 13:48 | |
*** djhankb has joined #openstack-swift | 15:23 | |
timburke | good morning | 15:24 |
*** manuvakery has joined #openstack-swift | 16:11 | |
*** psachin has quit IRC | 16:24 | |
openstackgerrit | Tim Burke proposed openstack/swift master: wsgi: stop closing listen sockets when workers die https://review.opendev.org/748721 | 16:48 |
timburke | clayg, i'm kinda tempted to squash ^^^ and its two parents into one patch -- it wasn't until i started digging into the graceful exit for workers that i could really see the strategy i wanted for socket-per-worker | 16:49 |
timburke | so a decent bit of that last patch feels like it's winding back changes from the first one :-/ | 16:50 |
clayg | idk man, we don't really know what's going on with these workers when need them to shutdown | 16:51 |
clayg | I was looking at the rss killer and thinking about HUP/TERM - and I'm not sure we won't need a "hard stop after timeout" sort of situation | 16:51 |
clayg | having options is ideal | 16:52 |
clayg | it looks like that change might also be adding workers sharing sockets again? i probably can't make an honest assesment about doing a squash w/o spending more time with it | 16:54 |
timburke | clayg, so fwiw, i've been testing with killing workers via HUP/USR1 for a graceful exit, TERM for a harder stop, and KILL for a "right now, i *mean it*" and the parent's been good about bringing back a fresh worker in its place | 17:40 |
timburke | workers will share sockets only in so far as one worker replaces another | 17:41 |
timburke | so if you've got workers=4, we bind four sockets in do_bind_ports, spin up four workers, and if one of those workers dies, we move its socket over to tthe "orphan" column so we can spin up a fresh worker to start accepting on it again | 17:43 |
*** manuvakery has quit IRC | 19:20 | |
openstackgerrit | Clay Gerrard proposed openstack/swift master: add swift-manage-shard-ranges shrink command https://review.opendev.org/741721 | 19:39 |
*** ormandj has quit IRC | 19:39 | |
clayg | timburke: is there a signal you can send to a worker that closes it's socket as well? maybe useful to distinguish HUP/USR1 in this regard | 19:40 |
*** ormandj has joined #openstack-swift | 19:41 | |
timburke | not at the moment. got a preference on which one should do the close? | 19:41 |
timburke | or rather, the shutdown... | 19:41 |
timburke | it's gonna complicate the parent a bit since it'll need to check whether the socket it's got in hand is shutdown or not, but the flexibility does seem useful | 19:42 |
clayg | I'm almost positive the reason we're killing workers is to get the socket to close | 20:01 |
clayg | and i've also become skeptical that our current rss killer can "just" use HUP - I think it ends up sending a TERM after close doesn't work | 20:02 |
openstackgerrit | Tim Burke proposed openstack/swift master: Client should retry when there's just one 404 and a bunch of errors https://review.opendev.org/744942 | 20:11 |
timburke | clayg, which socket, though? the listen socket or the client connection socket? if the rss killer is working *today*, without a listen socket per-worker, it sure seems like if anything, it *must* be the client connection socket that needs to get nuked | 20:16 |
timburke | i'm still concerned by the stack xrays we've seen recently for orphans following a USR1 that seem to show workers in a deadlock down in logging. if those start piling up, and a bunch of them are loading some large-ish SLO manifest in their head before locking up, that seems likely to cause ballooning memory... | 20:18 |
timburke | of course, if those two things *are* related, the graceful stop won't actually stop -- but by sending a "stop accepting new connections" signal and then waiting 0.5-5 mins, we'll have more confidence that any connections still associated with that worker were *never* going to receive a response, so a TERM is "safe" | 20:22 |
timburke | i think it'll absolutely be worth us getting the new code running on a canary in prod then manually watching for when rss gets "too high" and fixing it. if a simple HUP is insufficient, we should be ready to run an xray and look for whether we've still got the main thread in the accept loop or not. if it is, that'd indicate the HUP was ineffective and we should leave the rss-killer going straight to TERM | 20:28 |
*** djhankb has quit IRC | 22:06 | |
*** djhankb has joined #openstack-swift | 22:07 | |
-openstackstatus- NOTICE: A zuul server ended up with read only filesystems which caused many jobs to hit retry_limit. The server has been rebooted and appears happy. Jobs can be rechecked. | 22:14 | |
*** djhankb has quit IRC | 22:37 | |
*** djhankb has joined #openstack-swift | 22:37 | |
timburke | :-/ *is* there a way to check whether a listen socket has been shutdown short of trying to accept and catching the EINVAL if it's been shut down? i still want the parent to do the binding, but only the children should do any accepting... | 22:57 |
DHE | what would happen if you polled it? | 23:01 |
*** rcernin has joined #openstack-swift | 23:10 | |
*** rcernin has quit IRC | 23:15 | |
*** djhankb has quit IRC | 23:30 | |
*** djhankb has joined #openstack-swift | 23:31 | |
timburke | DHE, good call! looks like i can check for POLLHUP flags being set. and here i was just about ready to go parsing the result of `lsof -a -p {os.getpid()} -d {sock.fileno()} -FtT`... | 23:48 |
DHE | horray I'm useful! :) | 23:58 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!