Tuesday, 2022-06-28

opendevreviewMerged openstack/project-config master: project-config-grafana: add opendev-buildset-registry  https://review.opendev.org/c/openstack/project-config/+/84786600:53
fungiugh, seeing infra-prod-service-zuul failures in deploy. will take a closer look in a moment00:54
opendevreviewIan Wienand proposed openstack/project-config master: Revert "project-config-grafana: add opendev-buildset-registry"  https://review.opendev.org/c/openstack/project-config/+/84786800:54
ianwfyi ^ is abandoned, reverted in the wrong repo00:56
opendevreviewIan Wienand proposed openstack/project-config master: project-config-grafana: filter opendev-buildset-registry  https://review.opendev.org/c/openstack/project-config/+/84787001:02
fungiRUNNING HANDLER [zuul-scheduler : Reload Zuul Scheduler]01:31
fungifatal: [zuul01.opendev.org]: FAILED!01:32
fungiConnectionRefusedError: [Errno 111] Connection refused01:32
fungihttps://zuul.opendev.org/components says both schedulers are running01:33
fungilooking at https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-service-zuul&skip=0 the hourly jobs are succeeding but the deploy jobs are failing01:37
ianwfungi: would that be something to do with running as root and docker-compose01:38
ianw? seems like an error maybe from not being able to talk to the docker socket to stop the container01:39
fungiboth those failures raised the same ConnectionRefusedError exception01:41
fungido we run it differently in deploy than in hourly?01:41
fungiit's a cmd task running `docker-compose exec -T scheduler zuul-scheduler smart-reconfigure`01:42
ianwnot really i don't think, it's an odd one02:10
opendevreviewOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/84787202:28
Clark[m]fungi: ianw: it's a local socket connection iirc02:35
Clark[m]Oh unless the error is in docker compose itself02:36
Clark[m]But both are local sockets02:36
Clark[m]Hourly and deploy should run the same playbook02:39
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Fix BLS entries for /boot partitions  https://review.opendev.org/c/openstack/diskimage-builder/+/84683802:41
*** rlandy is now known as rlandy|out02:43
opendevreviewIan Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite  https://review.opendev.org/c/opendev/system-config/+/84787605:12
opendevreviewIan Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite  https://review.opendev.org/c/opendev/system-config/+/84787605:41
opendevreviewIan Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite  https://review.opendev.org/c/opendev/system-config/+/84787606:04
opendevreviewIan Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite  https://review.opendev.org/c/opendev/system-config/+/84787606:28
opendevreviewIan Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite  https://review.opendev.org/c/opendev/system-config/+/84787607:03
opendevreviewIan Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite  https://review.opendev.org/c/opendev/system-config/+/84787607:29
*** jpena|off is now known as jpena07:42
opendevreviewIan Wienand proposed opendev/system-config master: [wip] fix xfilesfactor in graphite  https://review.opendev.org/c/opendev/system-config/+/84787607:50
*** undefined_ is now known as Guest350107:59
opendevreviewIan Wienand proposed opendev/system-config master: graphite: fix xFilesFactor  https://review.opendev.org/c/opendev/system-config/+/84787608:47
ianwfrickler: ^ I am pretty sure this is the reason for the "missing" stats on *some* of the dib builds on that status page.  it only affects .wsp files created since I migrated graphite to ansible, which is why the older builds work.  basically infrequent datapoints get nulled out as the data ages08:49
fricklerianw: interesting, will review later09:09
*** rlandy|out is now known as rlandy09:40
*** dviroel|out is now known as dviroel11:37
fungidstufft is working on a spec for a new pypi upload workflow: https://discuss.python.org/t/pep-694-upload-2-0-api-for-python-package-repositories/1687911:56
fungiof particular interest is the ability to upload archives but not commit the release until they're in place and even perhaps verified/tested11:56
*** Guest3501 is now known as rcastillo12:59
*** dasm|off is now known as dasm13:04
jrosser_is there anything more specific that SUCCESS/FAILURE gets into graphite for job results?13:15
jrosser_i was thinking to add these POST_FAILURES to the osa grafana dashboard but i'm not seeing that represented in graphite13:16
opendevreviewJonathan Rosser proposed openstack/project-config master: Update openstack-ansible grafana dashboard  https://review.opendev.org/c/openstack/project-config/+/84797313:31
rlandyjrosser_: POST_FAILURES are looking much better on our side13:42
jrosser_rlandy: yes they are for me too - did you merge anything to reduce the quantity of log upload?13:43
rlandyjrosser_: we did ...13:44
rlandybut assumed that some underlying change was pushed to help as well?13:44
jrosser_no i don't think so - i believe it is still unclear what the root cause is13:44
fungisince we don't know all the contributing factors, it's entirely possible there is also some transient variable involved, like performance degradation in a swift provider or network problems between our executors and the swift endpoints or load on executors themselves or...13:45
rlandyright - what changed all of a sudden is not clear to us - but what does show is ...13:47
rlandythere is some volume of logs that sends us into the danger zone13:47
rlandythe test hit the most was the multinode updates test13:47
rlandywhich makes sense13:48
rlandymultinodes, multiple installs13:48
fungiright, the fact that almost all of these cases are for tripleo and openstack-ansible changes means that other projects are doing something different which causes them not to be hit as hard (current theory is that the impact is influenced by the volume of logs being uploaded, either count or size)13:48
*** dviroel is now known as dviroel|biab13:52
jrosser_screenshots in the grafyaml update jobs is awesome :)13:54
*** dviroel|biab is now known as dviroel14:21
clarkbjrosser_: that is all credit to ianw14:45
jrosser_it's really cool14:45
jrosser_and the job runs really quick too14:45
clarkbas for statuses other than success or failure I thought zuul reported them all, but maybe it doesn't14:46
jrosser_i guess it's not hooked up to the actual data source though14:46
jrosser_yeah, i was looking at my grafana dashboard and that seems setup to deal with TIMOUT14:47
jrosser_but that wasnt obvious in graphite either14:47
jrosser_oh well i think i am seeing things14:51
jrosser_this is definatly there stats.zuul.tenant.openstack.pipeline.check.project.opendev_org.openstack_openstack-ansible.master.job.openstack-ansible-deploy-aio_lxc-centos-8-stream.POST_FAILURE14:52
*** dviroel is now known as dviroel|afk|lunch14:52
clarkbif build.result in ['SUCCESS', 'FAILURE'] and build.start_time:14:56
clarkbthat condition is specifically for job timing which is what your existing graphs look at14:56
clarkbbut ya the counter (rather than timer) doesn't seem to check that so you should have the counters available14:57
clarkbI'ev got meetings off and on through the day today. But after this next one I can look at adding POST_FAILURE into that condition so that we get POST_FAILURE timing too. If you want to look at doing that the thing to check is that we know POST_FAILURE at that point in time (I'm unsure of that right now)14:58
jrosser_clarkb: what i need might already be there, the lower part of my grafana dashboard is using stats_counts.<....>15:21
opendevreviewJonathan Rosser proposed openstack/project-config master: Update openstack-ansible grafana dashboard job status rates  https://review.opendev.org/c/openstack/project-config/+/84798815:33
*** rlandy is now known as rlandy|afk15:59
*** marios is now known as marios|out16:00
*** dviroel|afk|lunch is now known as dviroel16:23
*** rlandy|afk is now known as rlandy17:11
TheJuliaclarkb: you can reclaim that node, I've been able to reproduce the issue as of the morning and a possible fix for the issue we're encountering17:13
clarkbTheJulia: thanks! I'll get to it after this meeting17:13
*** jpena is now known as jpena|off17:28
*** rlandy is now known as rlandy|afk18:00
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: Use internal dhcp client for centos 9-stream and beyond  https://review.opendev.org/c/openstack/diskimage-builder/+/84801718:08
clarkbI need to take a breakt before my next meeting, but wanted to make sure I didn't forget to ask if we followed up on those zuul deployment errors from yesterday. Is that something that needs further debugging?18:08
*** akahat|ruck is now known as akahat|out18:11
fungii haven't looked deeper yet, no. as far as i got was that running the job in the deploy pipeline was consistently returning a connection refused when trying to run smart-reconfigure, while running in hourly did not. though notable difference is that deploy is happening after a configuration change merges, so maybe that's causing the scheduler's command socket to refuse connections for some18:14
fungiperiod of time?18:14
clarkbya I think the thing to try and determien is if the connection error is docker-compose/docker -> dockerd socket or zuul-admin to zuuld socket18:15
clarkband take it from there.18:16
fungiclarkb: i think the latter, since the traceback raised is inside zuul.cmd.sendCommand() when it calls s.connect(command_socket)18:21
fungimaybe i should look at the zuul scheduler log from around that time18:27
fungicould another smart-reconfigure shortly before cause the scheduler to temporarily refuse connections on that socket?18:29
clarkbthat would surprise me. It is just a unix socket iirc and it should just always work? corvus ^ fyi18:42
clarkbfungi: was there ever a paste of the traceback?18:57
fungino, but i can make one19:00
fungiclarkb: https://paste.opendev.org/show/bDEibgHh5cplEMAk9y8S/19:03
corvusit shouldn't ever stop listening19:04
corvusand it queues commands19:04
clarkbI wonder if the path is wrong for some reason19:04
fungiyeah, maybe the difference is in the hourly builds it doesn't bother to call smart-reconfigure at all19:06
clarkbI do think that we only need the one scheduler to have the command run against it19:07
clarkbso that deploy actually did work I think19:07
fungialso odd that it's failing on 01 but succeeding on 0219:07
fungii would expect both to use the same paths19:08
corvusoh it only fails on 0119:08
corvusmaybe something about the socket inode?19:09
fungialso the failures for deploy are recent and not 100%: https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-service-zuul&pipeline=deploy&skip=019:09
corvusor possibly the handler thread is stuck; maybe a thread dump would help there19:12
corvus`zgrep "CommandSocket" /var/log/zuul/debug.log.*` says nothing interesting since this start.  i think sigusr2 is the next step, but i'll leave that to someone else due to current commitments19:16
fungiyeah, i can try to find time to do it later today when not in meetings19:20
corvusi forgot to mention: that grep showed that commands have run on previous runs of the scheduler on that host; so it's less likely to be a system issue, and more of an issue with the current scheduler process19:26
fungimakes sense, yes19:27
*** dasm is now known as dasm|afk19:34
ianwjrosser_ / clarkb: https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/playbooks/grafana/files/screenshot.py#L44 is where we wait before taking the screenshot19:35
ianwwhat would be perfect is something we could poll that says "I'm done" but I don't think there is such a thing with the asynchronous/endless-scrolling magic react app that grafana is19:36
ianwone thing i had in an earlier revision was a dict graph-name -> config options, where we could set things like the height of the page to snapshot and a longer (even shorter) timeout19:44
ianwi removed that for general simplicity.  it could be inline, a config file, etc. etc. if we require it though19:44
corvusjrosser_: clarkb comments on https://review.opendev.org/84778020:03
clarkbcorvus: yup I was going to leave similar comments one thought is to use ovh_bhs in the selection then when we lookup the value do a string join with _opendev_cloud_20:04
clarkbI'll leave a note about that once the meeting is done20:04
fungii need to work on dinner, and then get to some yardwork i've been avoiding, but can start digging deeper on the zuul01 socket situation after all that's knocked out20:06
corvusclarkb: ++20:07
jrosser_i prototyped all of that before pushing the patch20:08
jrosser_it is only dealing in strings which happen to be the variable names, until the point the lookup converts it to the contents of that variable20:09
clarkbjrosser_: the issue is that isn't super obvious due to the ways that ansible sometimes automatically magically wraps things in {{ }}. Considering the importance of those secrets I think we should be defensive here20:09
clarkbeven if that behavior is true for ansible 2.9 or 5 it may not be true when 6 happens20:10
jrosser_i was considering putting them in single quotes to make it more obvious, then they would definatly be strings20:10
clarkbI would just avoid using the var name entirely. Use a subset of the var name that is still uniquely identifiable and then construct the var name from that when we need it20:10
clarkbanyway I posted all that on the change.20:10
clarkbI need to go eat lunch now. Back in a bit20:11
clarkbjrosser_: also not sure if you followed our meeting but what we realized is your change can help record where we are uploading too while we continue to do a single upload pass20:11
ianwclarkb: https://review.opendev.org/c/openstack/project-config/+/847870 was another quick one from the grafyaml job layout yesterday too, to only run the buildset-registry where required20:11
clarkbjust having that additional info would be useful even if we never use the change to do multiple upload passes20:11
jrosser_clarkb: i'm not sure where the meetings happen20:13
corvusjrosser_: clarkb i updated my comments, thanks.  i admit, i was fooled there.  i don't know if quotes would have helped.  i'm ambivalent about whether we should use it as-is, or go with clarkb's idea.20:13
corvusso +2 in spirit from me, whether we go with it as-is or clark.  but regardless, it should be a base-test change first20:14
fungijrosser_: 19:00 utc tuesdays in #opendev-meeting20:14
jrosser_corvus: we can certainly take opendev_cloud_ off the front of all those strings-that-look-like-vars, then it would be more obvious whats going on?20:14
fungiwhich in theory also doubles as our dedicated channel for service maintenance activities, though we've rarely used it for that20:15
corvushttps://opendev.org/opendev/base-jobs/src/branch/master/zuul.d/jobs.yaml#L5-L23 is the info about base-test (and the procedure applies to roles too; there's a base-test/post-logs.yaml for that)20:15
corvusjrosser_: probably, but also, sometimes people (me) are just wrong and there's nothing you can do except correct them (me).  so if folks like that idea, i'm fine with it.  but i don't personally want to push the point.  i came with extra baggage because i know how the current system works :)20:18
Clark[m]The reason I'm wary is because Ansible magically adds {{ }} in places and then var names become var contents and it is rarely clear to me when it does that20:19
jrosser_i'll adjust it to make it clearer, that's always a better result20:19
jrosser_just a mo....20:19
opendevreviewJonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task  https://review.opendev.org/c/opendev/base-jobs/+/84778020:20
ianw'opendev_cloud_' ~ _swift_provider_name -- is ~ better than + ?  i don't think i've ever seen that before20:36
ianw~ : Converts all operands into strings and concatenates them.20:41
ianwhuh, TIL20:41
jrosser_ianw: https://witchychant.com/jinja-faq-concatenation/20:45
opendevreviewJonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task for base-test  https://review.opendev.org/c/opendev/base-jobs/+/84802720:47
jrosser_right - i am out for the day.... thanks for the help everyone20:48
clarkbthank you!20:55
opendevreviewMerged opendev/grafyaml master: Test with project-config graphs  https://review.opendev.org/c/opendev/grafyaml/+/84742120:55
clarkbianw: looking at the xfilesfactor change, did min become lower and max upper?21:04
clarkbjudging on the rest of the content there that appears to be the case.21:04
ianwclarkb: i've taken that from https://github.com/statsd/statsd/blob/master/docs/graphite.md#storage-aggregation21:04
clarkbI've gone ahead and approved the change21:04
ianwthanks, i can run a script to convert the on-disk after we have that merged and applied21:06
clarkbianw: for https://review.opendev.org/c/opendev/system-config/+/847700/6/playbooks/zuul/run-base.yaml is that git config update happening early enough for the other usages? I think it is happening just before testinfra runs which is well after we try to install ansible from source21:07
clarkbianw: I think the error is still present on the devel job with that change too21:08
ianwdoh i think you're right21:09
clarkbI think that task needs to be moved above the run base.yaml play I'll leave a notes21:09
ianwi had it working in ps5 and seem to have unfixed it with ps6 - https://zuul.opendev.org/t/openstack/build/41d0e4cd94774fb9b7806f4cfac3c10921:09
ianwyeah, i "simplified" it21:10
ianwthanks, yeah need to go back over that one21:10
ianwsince we're generally in favour i might push on the venv angle a bit21:11
clarkbthis is unexpected x/vmware-nsx* are actually active repos. salvorlando appears to be maintaining them21:14
clarkbmaybe I need to send email to them directly about the queue thing21:15
*** dviroel is now known as dviroel|out21:19
opendevreviewClark Boylan proposed openstack/project-config master: Remove windmill from zuul tenant config  https://review.opendev.org/c/openstack/project-config/+/84803321:22
opendevreviewClark Boylan proposed openstack/project-config master: Remove x/neutron-classifier from Zuul tenant config  https://review.opendev.org/c/openstack/project-config/+/84803421:22
clarkbThat is the first set of zuul tenant config cleanups. I'll work on emailing salvorlando next21:22
opendevreviewMerged opendev/system-config master: graphite: fix xFilesFactor  https://review.opendev.org/c/opendev/system-config/+/84787621:29
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/847035 and children are the gerrit image and CI cleanups post upgrade if you have time for them21:38
fungimight be a good diversion as i catch my breath on yardwork breaks, sure21:46
clarkbThe gitea 1.17.0 release is looking close https://github.com/go-gitea/gitea/milestone/105 I might need to go and properly read the release notes and get the 1.17.0 change into shape (though it passes CI so getting into shape may just be decalring it ready)22:04
ianwoh hrm, so to reset the xFilesFactor on a .wsp file you also have to set the retentions22:09
*** rlandy|afk is now known as rlandy|out22:09
ianwthat makes it a bit harder as we have different retentions22:09
clarkbfungi: corvus: I've taken a quick look at zuul01 (but not run sigusr2) and things I notice are that the socket is present on the fs and owned by zuul:zuul which the smart reconfigure command should also run as. lsof also shows that the socket is opened by the scheduler process22:20
corvusclarkb: yeah i did that too, sorry if i didn't mention that22:22
corvusclarkb: but i agree with your findings! :)22:22
clarkbtesting locally if I try to open a socket which doesn't exist I get errno 2 No such file or directory22:23
clarkbthat implies to me that socket was present but nothing was attached to the toher end?22:23
opendevreviewMerged openstack/diskimage-builder master: Fix BLS entries for /boot partitions  https://review.opendev.org/c/openstack/diskimage-builder/+/84683822:30
ianwhttps://paste.opendev.org/show/b0VXcikOvBQOYvUGPad5/ is what i plan to use to update the .wsp files to the new xFilesFactor if anyone has comments22:33
clarkbianw: file in this case is a dir path and I guess some other program will read the SKIP and FIX inputs and apply the info?22:35
clarkbif I've read that correctly then I think that looks sane22:35
ianwoh sorry yeah the "FIX" path will call the actual whisper update with the args22:35
ianwi'm just doing a no-op test run to graphite02:/tmp/fix.txt to make sure it looks sane and has the right retentions for the right things22:36
fungiclarkb: corvus: some sort of version command for the command socket would be nice, really anything to act as a proper no-op. but the norepl command seems to be sufficient to reproduce22:43
fungisudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec scheduler zuul-scheduler norepl22:43
fungion zuul02 that produces no output and exits 022:43
corvusfungi: ++22:43
fungion zuul01 that exits 1 and emits:22:43
fungiConnectionRefusedError: [Errno 111] Connection refused22:43
fungi(also the same traceback observed in the ansible log)22:44
corvusfwiw, i have also tried running commands manually.  sorry if that wasn't clear too.22:44
fungiahh, good. at least i'm on the right track22:44
fungiwanted to make sure things were still in the same "odd" state before trying to get a thread dump22:45
fungifrom what i can see, the zuul-scheduler service was last fully restarted at 2022-06-17 19:00 utc22:46
fungion zuul0122:46
fungithat's before the deploy errors started22:46
fungione up-side to the gearmanectomy is there's no longer a second scheduler process fork, so it's no longer complicated to identify the scheduler's pid22:48
ianwoh doh there's actually a whisper-set-xfilesfactor23:16
ianwthat make it easy, and no having to worry about resetting things with whisper-resize23:17
fungithat was the x-files and x-factor cross-over episode? (nevermind, those series did not overlap at all timewise)23:18
ianwit's running now.  i believe it flock()'s itself.  logging to /tmp/fix.txt and will save that somewhere later23:25
fungiokay, two stack dumps plus yappi stats captured to zuul01.opendev.org:/home/fungi/stack_dumps_2022-06-2823:27
ianwthe x-files did occasionally like to go do experimental episodes23:27
fungiand yeah, decades later i'm finally re-watching it all start to finish, near the end of season 7 at this point23:27
fungichristine is starting to regret agreeing to watch it with me23:28
ianwyeah i'm watching it with kids; the great job they've done making it 4k makes the difference23:30
ianwlike the b&w monster of the week episode, etc.23:30
ianwapparently it was all done on film making the hd remaster so good.  except weirdly for the stock footage bits where they set the scene with a picture of a building or road or whatever.  apparently they couldn't get the rights or something to rework those bits23:32
fungii don't have a 4k tv, but i did end up buying copies of it all on blu-ray23:32
fungidoes definitely look nice23:32
fungilooking back, it does seem like the writers decided to stop taking things nearly so seriously after the first movie23:34
ianwin the great tradition of american tv shows, it definitely went on a bit long23:34
fungiswitched from creepy bizarre to creepy silly23:34
fungiso the two threads appearing in both dumps which have anything related to commandsocket.py are "Thread: 140554885134080 command d: True" and "Thread: 140555438757632 Thread-17 d: True"23:38
funginot entirely sure what i'm looking for23:38
fungis/entirely/at all/23:38
fungis/at all/even remotely/23:39
corvus2022-06-28 22:53:12,317 DEBUG zuul.stack_dump:     File "/usr/local/lib/python3.8/site-packages/zuul/lib/commandsocket.py", line 108, in _socketListener23:40
corvus2022-06-28 22:53:12,317 DEBUG zuul.stack_dump:       s, addr = self.socket.accept()23:40
corvusthat means it's waiting for connections as it should be23:41
corvusso it's not stuck processing an existing connection; and it hasn't crashed and exited23:41
corvusthat takes out most of my theories of how it could be borked23:41
corvusfungi: don't forget to sigusr2 again to turn off the profiler if you haven't already (friendly reminder)23:43
corvus(i think you did because you said 2 stack dumps; but just over-verbalizing)23:43
fungiyep, i did, but thanks for the reminder!23:45
fungithat file should contain the output from both stack dumps plus the yappi stats. i trimmed out everything else23:46
fungicorvus: one thing worth noting, the last modified time on /var/lib/zuul/scheduler.socket is about 25 hours after the most recent server restart. is it possible something replaced that fifo?23:50
fungiwondering if the open fd for the process is a completely different inode than we're trying to connect to23:51
corvusfungi: aha! yes!  now that you mention that, i think i may have clobbered it in some testing i did after the last restart23:51
fungioh neat!23:51
fungiso if our weekly restart playbook were working over the weekend, we'd probably never have noticed23:52
corvusi had a second scheduler running, and it's entirely plausible that doing that clobbered the socket.  i didn't realize that was possible, otherwise i would have restarted to clean up.  i'm sorry!23:52
fungimakes sense. no need to apologize! this just became a lot less baffling23:52
corvusi can't say for sure that's what happened, but i think that's the new occam's razor23:52
fungii should have statted the fifo sooner23:53
fungianyway, it's moved this into the category of "things i'll be more worried about if it happens again"23:54
fungii was really just concerned by the lack of plausible explanation23:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!