opendevreview | Merged openstack/project-config master: project-config-grafana: add opendev-buildset-registry https://review.opendev.org/c/openstack/project-config/+/847866 | 00:53 |
---|---|---|
fungi | ugh, seeing infra-prod-service-zuul failures in deploy. will take a closer look in a moment | 00:54 |
opendevreview | Ian Wienand proposed openstack/project-config master: Revert "project-config-grafana: add opendev-buildset-registry" https://review.opendev.org/c/openstack/project-config/+/847868 | 00:54 |
ianw | fyi ^ is abandoned, reverted in the wrong repo | 00:56 |
opendevreview | Ian Wienand proposed openstack/project-config master: project-config-grafana: filter opendev-buildset-registry https://review.opendev.org/c/openstack/project-config/+/847870 | 01:02 |
fungi | RUNNING HANDLER [zuul-scheduler : Reload Zuul Scheduler] | 01:31 |
fungi | fatal: [zuul01.opendev.org]: FAILED! | 01:32 |
fungi | ConnectionRefusedError: [Errno 111] Connection refused | 01:32 |
fungi | https://zuul.opendev.org/components says both schedulers are running | 01:33 |
fungi | looking at https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-service-zuul&skip=0 the hourly jobs are succeeding but the deploy jobs are failing | 01:37 |
ianw | fungi: would that be something to do with running as root and docker-compose | 01:38 |
ianw | ? seems like an error maybe from not being able to talk to the docker socket to stop the container | 01:39 |
fungi | both those failures raised the same ConnectionRefusedError exception | 01:41 |
fungi | do we run it differently in deploy than in hourly? | 01:41 |
fungi | it's a cmd task running `docker-compose exec -T scheduler zuul-scheduler smart-reconfigure` | 01:42 |
ianw | not really i don't think, it's an odd one | 02:10 |
opendevreview | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/847872 | 02:28 |
Clark[m] | fungi: ianw: it's a local socket connection iirc | 02:35 |
Clark[m] | Oh unless the error is in docker compose itself | 02:36 |
Clark[m] | But both are local sockets | 02:36 |
Clark[m] | Hourly and deploy should run the same playbook | 02:39 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Fix BLS entries for /boot partitions https://review.opendev.org/c/openstack/diskimage-builder/+/846838 | 02:41 |
*** rlandy is now known as rlandy|out | 02:43 | |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite https://review.opendev.org/c/opendev/system-config/+/847876 | 05:12 |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite https://review.opendev.org/c/opendev/system-config/+/847876 | 05:41 |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite https://review.opendev.org/c/opendev/system-config/+/847876 | 06:04 |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite https://review.opendev.org/c/opendev/system-config/+/847876 | 06:28 |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite https://review.opendev.org/c/opendev/system-config/+/847876 | 07:03 |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite https://review.opendev.org/c/opendev/system-config/+/847876 | 07:29 |
*** jpena|off is now known as jpena | 07:42 | |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite https://review.opendev.org/c/opendev/system-config/+/847876 | 07:50 |
*** undefined_ is now known as Guest3501 | 07:59 | |
opendevreview | Ian Wienand proposed opendev/system-config master: graphite: fix xFilesFactor https://review.opendev.org/c/opendev/system-config/+/847876 | 08:47 |
ianw | frickler: ^ I am pretty sure this is the reason for the "missing" stats on *some* of the dib builds on that status page. it only affects .wsp files created since I migrated graphite to ansible, which is why the older builds work. basically infrequent datapoints get nulled out as the data ages | 08:49 |
frickler | ianw: interesting, will review later | 09:09 |
*** rlandy|out is now known as rlandy | 09:40 | |
*** dviroel|out is now known as dviroel | 11:37 | |
fungi | dstufft is working on a spec for a new pypi upload workflow: https://discuss.python.org/t/pep-694-upload-2-0-api-for-python-package-repositories/16879 | 11:56 |
fungi | of particular interest is the ability to upload archives but not commit the release until they're in place and even perhaps verified/tested | 11:56 |
*** Guest3501 is now known as rcastillo | 12:59 | |
*** dasm|off is now known as dasm | 13:04 | |
jrosser_ | is there anything more specific that SUCCESS/FAILURE gets into graphite for job results? | 13:15 |
jrosser_ | i was thinking to add these POST_FAILURES to the osa grafana dashboard but i'm not seeing that represented in graphite | 13:16 |
opendevreview | Jonathan Rosser proposed openstack/project-config master: Update openstack-ansible grafana dashboard https://review.opendev.org/c/openstack/project-config/+/847973 | 13:31 |
rlandy | jrosser_: POST_FAILURES are looking much better on our side | 13:42 |
jrosser_ | rlandy: yes they are for me too - did you merge anything to reduce the quantity of log upload? | 13:43 |
rlandy | jrosser_: we did ... | 13:44 |
rlandy | but assumed that some underlying change was pushed to help as well? | 13:44 |
jrosser_ | no i don't think so - i believe it is still unclear what the root cause is | 13:44 |
fungi | since we don't know all the contributing factors, it's entirely possible there is also some transient variable involved, like performance degradation in a swift provider or network problems between our executors and the swift endpoints or load on executors themselves or... | 13:45 |
rlandy | right - what changed all of a sudden is not clear to us - but what does show is ... | 13:47 |
rlandy | there is some volume of logs that sends us into the danger zone | 13:47 |
rlandy | the test hit the most was the multinode updates test | 13:47 |
rlandy | which makes sense | 13:48 |
rlandy | multinodes, multiple installs | 13:48 |
fungi | right, the fact that almost all of these cases are for tripleo and openstack-ansible changes means that other projects are doing something different which causes them not to be hit as hard (current theory is that the impact is influenced by the volume of logs being uploaded, either count or size) | 13:48 |
*** dviroel is now known as dviroel|biab | 13:52 | |
jrosser_ | screenshots in the grafyaml update jobs is awesome :) | 13:54 |
*** dviroel|biab is now known as dviroel | 14:21 | |
clarkb | jrosser_: that is all credit to ianw | 14:45 |
jrosser_ | it's really cool | 14:45 |
jrosser_ | and the job runs really quick too | 14:45 |
clarkb | as for statuses other than success or failure I thought zuul reported them all, but maybe it doesn't | 14:46 |
jrosser_ | i guess it's not hooked up to the actual data source though | 14:46 |
jrosser_ | yeah, i was looking at my grafana dashboard and that seems setup to deal with TIMOUT | 14:47 |
jrosser_ | but that wasnt obvious in graphite either | 14:47 |
jrosser_ | oh well i think i am seeing things | 14:51 |
jrosser_ | this is definatly there stats.zuul.tenant.openstack.pipeline.check.project.opendev_org.openstack_openstack-ansible.master.job.openstack-ansible-deploy-aio_lxc-centos-8-stream.POST_FAILURE | 14:52 |
*** dviroel is now known as dviroel|afk|lunch | 14:52 | |
clarkb | if build.result in ['SUCCESS', 'FAILURE'] and build.start_time: | 14:56 |
clarkb | that condition is specifically for job timing which is what your existing graphs look at | 14:56 |
clarkb | but ya the counter (rather than timer) doesn't seem to check that so you should have the counters available | 14:57 |
clarkb | I'ev got meetings off and on through the day today. But after this next one I can look at adding POST_FAILURE into that condition so that we get POST_FAILURE timing too. If you want to look at doing that the thing to check is that we know POST_FAILURE at that point in time (I'm unsure of that right now) | 14:58 |
jrosser_ | clarkb: what i need might already be there, the lower part of my grafana dashboard is using stats_counts.<....> | 15:21 |
opendevreview | Jonathan Rosser proposed openstack/project-config master: Update openstack-ansible grafana dashboard job status rates https://review.opendev.org/c/openstack/project-config/+/847988 | 15:33 |
*** rlandy is now known as rlandy|afk | 15:59 | |
*** marios is now known as marios|out | 16:00 | |
*** dviroel|afk|lunch is now known as dviroel | 16:23 | |
*** rlandy|afk is now known as rlandy | 17:11 | |
TheJulia | clarkb: you can reclaim that node, I've been able to reproduce the issue as of the morning and a possible fix for the issue we're encountering | 17:13 |
clarkb | TheJulia: thanks! I'll get to it after this meeting | 17:13 |
*** jpena is now known as jpena|off | 17:28 | |
*** rlandy is now known as rlandy|afk | 18:00 | |
opendevreview | Julia Kreger proposed openstack/diskimage-builder master: Use internal dhcp client for centos 9-stream and beyond https://review.opendev.org/c/openstack/diskimage-builder/+/848017 | 18:08 |
clarkb | I need to take a breakt before my next meeting, but wanted to make sure I didn't forget to ask if we followed up on those zuul deployment errors from yesterday. Is that something that needs further debugging? | 18:08 |
*** akahat|ruck is now known as akahat|out | 18:11 | |
fungi | i haven't looked deeper yet, no. as far as i got was that running the job in the deploy pipeline was consistently returning a connection refused when trying to run smart-reconfigure, while running in hourly did not. though notable difference is that deploy is happening after a configuration change merges, so maybe that's causing the scheduler's command socket to refuse connections for some | 18:14 |
fungi | period of time? | 18:14 |
clarkb | ya I think the thing to try and determien is if the connection error is docker-compose/docker -> dockerd socket or zuul-admin to zuuld socket | 18:15 |
clarkb | and take it from there. | 18:16 |
fungi | clarkb: i think the latter, since the traceback raised is inside zuul.cmd.sendCommand() when it calls s.connect(command_socket) | 18:21 |
fungi | maybe i should look at the zuul scheduler log from around that time | 18:27 |
fungi | could another smart-reconfigure shortly before cause the scheduler to temporarily refuse connections on that socket? | 18:29 |
clarkb | that would surprise me. It is just a unix socket iirc and it should just always work? corvus ^ fyi | 18:42 |
clarkb | fungi: was there ever a paste of the traceback? | 18:57 |
fungi | no, but i can make one | 19:00 |
fungi | clarkb: https://paste.opendev.org/show/bDEibgHh5cplEMAk9y8S/ | 19:03 |
corvus | it shouldn't ever stop listening | 19:04 |
corvus | and it queues commands | 19:04 |
clarkb | I wonder if the path is wrong for some reason | 19:04 |
fungi | yeah, maybe the difference is in the hourly builds it doesn't bother to call smart-reconfigure at all | 19:06 |
clarkb | I do think that we only need the one scheduler to have the command run against it | 19:07 |
clarkb | so that deploy actually did work I think | 19:07 |
fungi | also odd that it's failing on 01 but succeeding on 02 | 19:07 |
fungi | i would expect both to use the same paths | 19:08 |
corvus | oh it only fails on 01 | 19:08 |
fungi | yeah | 19:08 |
corvus | maybe something about the socket inode? | 19:09 |
fungi | also the failures for deploy are recent and not 100%: https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-service-zuul&pipeline=deploy&skip=0 | 19:09 |
corvus | or possibly the handler thread is stuck; maybe a thread dump would help there | 19:12 |
corvus | `zgrep "CommandSocket" /var/log/zuul/debug.log.*` says nothing interesting since this start. i think sigusr2 is the next step, but i'll leave that to someone else due to current commitments | 19:16 |
fungi | yeah, i can try to find time to do it later today when not in meetings | 19:20 |
corvus | i forgot to mention: that grep showed that commands have run on previous runs of the scheduler on that host; so it's less likely to be a system issue, and more of an issue with the current scheduler process | 19:26 |
fungi | makes sense, yes | 19:27 |
*** dasm is now known as dasm|afk | 19:34 | |
ianw | jrosser_ / clarkb: https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/playbooks/grafana/files/screenshot.py#L44 is where we wait before taking the screenshot | 19:35 |
ianw | what would be perfect is something we could poll that says "I'm done" but I don't think there is such a thing with the asynchronous/endless-scrolling magic react app that grafana is | 19:36 |
ianw | one thing i had in an earlier revision was a dict graph-name -> config options, where we could set things like the height of the page to snapshot and a longer (even shorter) timeout | 19:44 |
ianw | i removed that for general simplicity. it could be inline, a config file, etc. etc. if we require it though | 19:44 |
corvus | jrosser_: clarkb comments on https://review.opendev.org/847780 | 20:03 |
clarkb | corvus: yup I was going to leave similar comments one thought is to use ovh_bhs in the selection then when we lookup the value do a string join with _opendev_cloud_ | 20:04 |
clarkb | I'll leave a note about that once the meeting is done | 20:04 |
fungi | i need to work on dinner, and then get to some yardwork i've been avoiding, but can start digging deeper on the zuul01 socket situation after all that's knocked out | 20:06 |
corvus | clarkb: ++ | 20:07 |
jrosser_ | i prototyped all of that before pushing the patch | 20:08 |
jrosser_ | it is only dealing in strings which happen to be the variable names, until the point the lookup converts it to the contents of that variable | 20:09 |
clarkb | jrosser_: the issue is that isn't super obvious due to the ways that ansible sometimes automatically magically wraps things in {{ }}. Considering the importance of those secrets I think we should be defensive here | 20:09 |
clarkb | even if that behavior is true for ansible 2.9 or 5 it may not be true when 6 happens | 20:10 |
jrosser_ | i was considering putting them in single quotes to make it more obvious, then they would definatly be strings | 20:10 |
clarkb | I would just avoid using the var name entirely. Use a subset of the var name that is still uniquely identifiable and then construct the var name from that when we need it | 20:10 |
clarkb | anyway I posted all that on the change. | 20:10 |
clarkb | I need to go eat lunch now. Back in a bit | 20:11 |
clarkb | jrosser_: also not sure if you followed our meeting but what we realized is your change can help record where we are uploading too while we continue to do a single upload pass | 20:11 |
ianw | clarkb: https://review.opendev.org/c/openstack/project-config/+/847870 was another quick one from the grafyaml job layout yesterday too, to only run the buildset-registry where required | 20:11 |
clarkb | just having that additional info would be useful even if we never use the change to do multiple upload passes | 20:11 |
jrosser_ | clarkb: i'm not sure where the meetings happen | 20:13 |
corvus | jrosser_: clarkb i updated my comments, thanks. i admit, i was fooled there. i don't know if quotes would have helped. i'm ambivalent about whether we should use it as-is, or go with clarkb's idea. | 20:13 |
corvus | so +2 in spirit from me, whether we go with it as-is or clark. but regardless, it should be a base-test change first | 20:14 |
fungi | jrosser_: 19:00 utc tuesdays in #opendev-meeting | 20:14 |
jrosser_ | corvus: we can certainly take opendev_cloud_ off the front of all those strings-that-look-like-vars, then it would be more obvious whats going on? | 20:14 |
fungi | which in theory also doubles as our dedicated channel for service maintenance activities, though we've rarely used it for that | 20:15 |
corvus | https://opendev.org/opendev/base-jobs/src/branch/master/zuul.d/jobs.yaml#L5-L23 is the info about base-test (and the procedure applies to roles too; there's a base-test/post-logs.yaml for that) | 20:15 |
corvus | jrosser_: probably, but also, sometimes people (me) are just wrong and there's nothing you can do except correct them (me). so if folks like that idea, i'm fine with it. but i don't personally want to push the point. i came with extra baggage because i know how the current system works :) | 20:18 |
Clark[m] | The reason I'm wary is because Ansible magically adds {{ }} in places and then var names become var contents and it is rarely clear to me when it does that | 20:19 |
jrosser_ | i'll adjust it to make it clearer, that's always a better result | 20:19 |
jrosser_ | just a mo.... | 20:19 |
opendevreview | Jonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task https://review.opendev.org/c/opendev/base-jobs/+/847780 | 20:20 |
ianw | 'opendev_cloud_' ~ _swift_provider_name -- is ~ better than + ? i don't think i've ever seen that before | 20:36 |
ianw | ~ : Converts all operands into strings and concatenates them. | 20:41 |
ianw | huh, TIL | 20:41 |
jrosser_ | ianw: https://witchychant.com/jinja-faq-concatenation/ | 20:45 |
opendevreview | Jonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task for base-test https://review.opendev.org/c/opendev/base-jobs/+/848027 | 20:47 |
jrosser_ | right - i am out for the day.... thanks for the help everyone | 20:48 |
clarkb | thank you! | 20:55 |
opendevreview | Merged opendev/grafyaml master: Test with project-config graphs https://review.opendev.org/c/opendev/grafyaml/+/847421 | 20:55 |
clarkb | ianw: looking at the xfilesfactor change, did min become lower and max upper? | 21:04 |
clarkb | judging on the rest of the content there that appears to be the case. | 21:04 |
ianw | clarkb: i've taken that from https://github.com/statsd/statsd/blob/master/docs/graphite.md#storage-aggregation | 21:04 |
clarkb | I've gone ahead and approved the change | 21:04 |
clarkb | thanks | 21:05 |
ianw | thanks, i can run a script to convert the on-disk after we have that merged and applied | 21:06 |
clarkb | ianw: for https://review.opendev.org/c/opendev/system-config/+/847700/6/playbooks/zuul/run-base.yaml is that git config update happening early enough for the other usages? I think it is happening just before testinfra runs which is well after we try to install ansible from source | 21:07 |
clarkb | ianw: I think the error is still present on the devel job with that change too | 21:08 |
ianw | doh i think you're right | 21:09 |
clarkb | I think that task needs to be moved above the run base.yaml play I'll leave a notes | 21:09 |
ianw | i had it working in ps5 and seem to have unfixed it with ps6 - https://zuul.opendev.org/t/openstack/build/41d0e4cd94774fb9b7806f4cfac3c109 | 21:09 |
ianw | yeah, i "simplified" it | 21:10 |
ianw | thanks, yeah need to go back over that one | 21:10 |
ianw | since we're generally in favour i might push on the venv angle a bit | 21:11 |
clarkb | this is unexpected x/vmware-nsx* are actually active repos. salvorlando appears to be maintaining them | 21:14 |
clarkb | maybe I need to send email to them directly about the queue thing | 21:15 |
*** dviroel is now known as dviroel|out | 21:19 | |
opendevreview | Clark Boylan proposed openstack/project-config master: Remove windmill from zuul tenant config https://review.opendev.org/c/openstack/project-config/+/848033 | 21:22 |
opendevreview | Clark Boylan proposed openstack/project-config master: Remove x/neutron-classifier from Zuul tenant config https://review.opendev.org/c/openstack/project-config/+/848034 | 21:22 |
clarkb | That is the first set of zuul tenant config cleanups. I'll work on emailing salvorlando next | 21:22 |
opendevreview | Merged opendev/system-config master: graphite: fix xFilesFactor https://review.opendev.org/c/opendev/system-config/+/847876 | 21:29 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/847035 and children are the gerrit image and CI cleanups post upgrade if you have time for them | 21:38 |
fungi | might be a good diversion as i catch my breath on yardwork breaks, sure | 21:46 |
clarkb | thanks! | 21:48 |
clarkb | The gitea 1.17.0 release is looking close https://github.com/go-gitea/gitea/milestone/105 I might need to go and properly read the release notes and get the 1.17.0 change into shape (though it passes CI so getting into shape may just be decalring it ready) | 22:04 |
ianw | oh hrm, so to reset the xFilesFactor on a .wsp file you also have to set the retentions | 22:09 |
*** rlandy|afk is now known as rlandy|out | 22:09 | |
ianw | that makes it a bit harder as we have different retentions | 22:09 |
clarkb | fungi: corvus: I've taken a quick look at zuul01 (but not run sigusr2) and things I notice are that the socket is present on the fs and owned by zuul:zuul which the smart reconfigure command should also run as. lsof also shows that the socket is opened by the scheduler process | 22:20 |
corvus | clarkb: yeah i did that too, sorry if i didn't mention that | 22:22 |
corvus | clarkb: but i agree with your findings! :) | 22:22 |
clarkb | testing locally if I try to open a socket which doesn't exist I get errno 2 No such file or directory | 22:23 |
clarkb | that implies to me that socket was present but nothing was attached to the toher end? | 22:23 |
opendevreview | Merged openstack/diskimage-builder master: Fix BLS entries for /boot partitions https://review.opendev.org/c/openstack/diskimage-builder/+/846838 | 22:30 |
ianw | https://paste.opendev.org/show/b0VXcikOvBQOYvUGPad5/ is what i plan to use to update the .wsp files to the new xFilesFactor if anyone has comments | 22:33 |
clarkb | ianw: file in this case is a dir path and I guess some other program will read the SKIP and FIX inputs and apply the info? | 22:35 |
clarkb | if I've read that correctly then I think that looks sane | 22:35 |
ianw | oh sorry yeah the "FIX" path will call the actual whisper update with the args | 22:35 |
ianw | i'm just doing a no-op test run to graphite02:/tmp/fix.txt to make sure it looks sane and has the right retentions for the right things | 22:36 |
clarkb | ++ | 22:36 |
fungi | clarkb: corvus: some sort of version command for the command socket would be nice, really anything to act as a proper no-op. but the norepl command seems to be sufficient to reproduce | 22:43 |
fungi | sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec scheduler zuul-scheduler norepl | 22:43 |
fungi | on zuul02 that produces no output and exits 0 | 22:43 |
corvus | fungi: ++ | 22:43 |
fungi | on zuul01 that exits 1 and emits: | 22:43 |
fungi | ConnectionRefusedError: [Errno 111] Connection refused | 22:43 |
fungi | (also the same traceback observed in the ansible log) | 22:44 |
corvus | fwiw, i have also tried running commands manually. sorry if that wasn't clear too. | 22:44 |
fungi | ahh, good. at least i'm on the right track | 22:44 |
fungi | wanted to make sure things were still in the same "odd" state before trying to get a thread dump | 22:45 |
fungi | from what i can see, the zuul-scheduler service was last fully restarted at 2022-06-17 19:00 utc | 22:46 |
fungi | on zuul01 | 22:46 |
fungi | that's before the deploy errors started | 22:46 |
fungi | one up-side to the gearmanectomy is there's no longer a second scheduler process fork, so it's no longer complicated to identify the scheduler's pid | 22:48 |
ianw | oh doh there's actually a whisper-set-xfilesfactor | 23:16 |
ianw | that make it easy, and no having to worry about resetting things with whisper-resize | 23:17 |
fungi | that was the x-files and x-factor cross-over episode? (nevermind, those series did not overlap at all timewise) | 23:18 |
ianw | it's running now. i believe it flock()'s itself. logging to /tmp/fix.txt and will save that somewhere later | 23:25 |
fungi | okay, two stack dumps plus yappi stats captured to zuul01.opendev.org:/home/fungi/stack_dumps_2022-06-28 | 23:27 |
ianw | the x-files did occasionally like to go do experimental episodes | 23:27 |
fungi | occasionally? | 23:27 |
fungi | and yeah, decades later i'm finally re-watching it all start to finish, near the end of season 7 at this point | 23:27 |
fungi | christine is starting to regret agreeing to watch it with me | 23:28 |
ianw | yeah i'm watching it with kids; the great job they've done making it 4k makes the difference | 23:30 |
ianw | like the b&w monster of the week episode, etc. | 23:30 |
ianw | apparently it was all done on film making the hd remaster so good. except weirdly for the stock footage bits where they set the scene with a picture of a building or road or whatever. apparently they couldn't get the rights or something to rework those bits | 23:32 |
fungi | i don't have a 4k tv, but i did end up buying copies of it all on blu-ray | 23:32 |
fungi | does definitely look nice | 23:32 |
fungi | looking back, it does seem like the writers decided to stop taking things nearly so seriously after the first movie | 23:34 |
ianw | in the great tradition of american tv shows, it definitely went on a bit long | 23:34 |
fungi | switched from creepy bizarre to creepy silly | 23:34 |
fungi | so the two threads appearing in both dumps which have anything related to commandsocket.py are "Thread: 140554885134080 command d: True" and "Thread: 140555438757632 Thread-17 d: True" | 23:38 |
fungi | not entirely sure what i'm looking for | 23:38 |
fungi | s/entirely/at all/ | 23:38 |
fungi | s/at all/even remotely/ | 23:39 |
corvus | 2022-06-28 22:53:12,317 DEBUG zuul.stack_dump: File "/usr/local/lib/python3.8/site-packages/zuul/lib/commandsocket.py", line 108, in _socketListener | 23:40 |
corvus | 2022-06-28 22:53:12,317 DEBUG zuul.stack_dump: s, addr = self.socket.accept() | 23:40 |
corvus | that means it's waiting for connections as it should be | 23:41 |
corvus | so it's not stuck processing an existing connection; and it hasn't crashed and exited | 23:41 |
corvus | that takes out most of my theories of how it could be borked | 23:41 |
corvus | fungi: don't forget to sigusr2 again to turn off the profiler if you haven't already (friendly reminder) | 23:43 |
corvus | (i think you did because you said 2 stack dumps; but just over-verbalizing) | 23:43 |
fungi | yep, i did, but thanks for the reminder! | 23:45 |
fungi | that file should contain the output from both stack dumps plus the yappi stats. i trimmed out everything else | 23:46 |
fungi | corvus: one thing worth noting, the last modified time on /var/lib/zuul/scheduler.socket is about 25 hours after the most recent server restart. is it possible something replaced that fifo? | 23:50 |
fungi | wondering if the open fd for the process is a completely different inode than we're trying to connect to | 23:51 |
corvus | fungi: aha! yes! now that you mention that, i think i may have clobbered it in some testing i did after the last restart | 23:51 |
fungi | oh neat! | 23:51 |
fungi | so if our weekly restart playbook were working over the weekend, we'd probably never have noticed | 23:52 |
corvus | i had a second scheduler running, and it's entirely plausible that doing that clobbered the socket. i didn't realize that was possible, otherwise i would have restarted to clean up. i'm sorry! | 23:52 |
fungi | makes sense. no need to apologize! this just became a lot less baffling | 23:52 |
corvus | i can't say for sure that's what happened, but i think that's the new occam's razor | 23:52 |
fungi | i should have statted the fifo sooner | 23:53 |
fungi | anyway, it's moved this into the category of "things i'll be more worried about if it happens again" | 23:54 |
fungi | i was really just concerned by the lack of plausible explanation | 23:55 |
corvus | ++ | 23:57 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!