Tuesday, 2021-04-27

openstackgerritMerged opendev/glean master: Allow disabling DHCP fallback  https://review.opendev.org/c/opendev/glean/+/78150001:21
brinzhang0I cannot submit a patch to gerrit, and I get the HTTP Credentials from self Group, it reported 500 error, is there anyone maintains the gerrit?01:28
ianwbrinzhang0: yes we can take a look, is this a new account?01:34
brinzhang0my account, and I always using it01:35
ianwbrinzhang0: so you are submitting via git review?  could you paste some command output to http://paste.openstack.org/ so we can determine what exactly is failing?01:38
brinzhang0ianw: yeah, I will paste some steps01:38
ianwi don't see your username in any of the recent cleanups so i don't think that's related01:40
brinzhang0I will sent to you in a private channel01:41
ianwi'm seeing errors in the logs consistent with https://bugs.chromium.org/p/gerrit/issues/detail?id=1372601:46
ianwthis IndexWriter is closed etc. etc.01:47
ianwbrinzhang0: so this is a bit of a known problem with gerrit.  we've found restarting it generally helps, i can do that01:48
brinzhang0ianw: ok, if there is need to restart the gerrit, I can wait01:49
ianw#status log restarted gerrit due to inability of user to update account settings, logs consistent with lock errors detailed in https://bugs.chromium.org/p/gerrit/issues/detail?id=1372601:50
openstackstatusianw: finished logging01:50
ianwbringzhang0: try again now, see how we go01:51
brinzhang0ianw: I remembered last year I was hit the same issue, and restart the gerrit that I can use it01:51
brinzhang0ianw: it's ok now ^^01:51
brinzhang0let me try to submit the modifed patch01:52
ianwinteresting, i wonder if your account is just lucky, or somehow different01:54
brinzhang0ianw: ah, interesting01:54
brinzhang0now I can submit the patch too01:54
ianwthere's an intel zuul account in a constant failure loop authenticating too02:10
ianwtrying to find a contact02:10
ianwhttps://wiki.openstack.org/wiki/ThirdPartySystems/Intel_OpenStack_CI i guess; i'll send a mail02:11
openstackgerritMerged opendev/glean master: Fix a typo in a log message  https://review.opendev.org/c/opendev/glean/+/78271102:40
openstackgerritMerged opendev/glean master: Do not require external mock on Python 3  https://review.opendev.org/c/opendev/glean/+/78229402:40
openstackgerritIan Wienand proposed opendev/glean master: Fix Gentoo "is" comparisons  https://review.opendev.org/c/opendev/glean/+/78812604:10
openstackgerritIan Wienand proposed opendev/glean master: Update hacking and fix pep8 violations  https://review.opendev.org/c/opendev/glean/+/78229604:12
openstackgerritIan Wienand proposed opendev/glean master: Move to Zuul standard hacking rules  https://review.opendev.org/c/opendev/glean/+/78812704:23
openstackgerritIan Wienand proposed opendev/glean master: Stop requiring /usr/local/bin links for glean.sh  https://review.opendev.org/c/opendev/glean/+/78201004:34
openstackgerritIan Wienand proposed opendev/glean master: Create "legacy" script path  https://review.opendev.org/c/opendev/glean/+/78201604:34
openstackgerritIan Wienand proposed opendev/glean master: Run a glean-early service to mount configdrive  https://review.opendev.org/c/opendev/glean/+/78201704:34
openstackgerritIan Wienand proposed opendev/glean master: Cleanup glean.sh variable names  https://review.opendev.org/c/opendev/glean/+/78235504:34
*** vishalmanchanda has joined #opendev04:50
*** ysandeep|away is now known as ysandeep06:06
*** ysandeep is now known as ysandeep|lunch07:28
openstackgerritIan Wienand proposed opendev/glean master: Stop requiring /usr/local/bin links for glean.sh  https://review.opendev.org/c/opendev/glean/+/78201007:57
openstackgerritIan Wienand proposed opendev/glean master: Create "legacy" script path  https://review.opendev.org/c/opendev/glean/+/78201607:57
openstackgerritIan Wienand proposed opendev/glean master: Run a glean-early service to mount configdrive  https://review.opendev.org/c/opendev/glean/+/78201707:57
openstackgerritIan Wienand proposed opendev/glean master: Cleanup glean.sh variable names  https://review.opendev.org/c/opendev/glean/+/78235507:57
openstackgerritSorin Sb├órnea proposed zuul/zuul-jobs master: Remove ansible-lint path exclusions  https://review.opendev.org/c/zuul/zuul-jobs/+/73147108:22
*** ykarel is now known as ykarel|lunch09:13
hrw"E: Failed to fetch https://mirror-int.dfw.rax.opendev.org/debian/dists/n/a-backports/main/source/Sources  404  Not Found [IP: 443]" hm.10:12
hrwlooks like zuul-jobs when run on Debian 'bullseye' see 'n/a' in ansible_distribution_release variable10:12
hrwwhich feels weird10:15
hrwroot@bae0d5701b38:~# ansible localhost -m ansible.builtin.setup|grep ansible_distribution_release10:15
hrw        "ansible_distribution_release": "bullseye",10:15
hrwhttps://b8a1a57faefd5b72979c-bcc0089d8984128efe3d45fe2388916e.ssl.cf1.rackcdn.com/788231/1/check-arm64/kolla-build-debian-source-aarch64/a10782b/zuul-info/host-info.primary.yaml shows n/a instead10:25
*** ysandeep|lunch is now known as ysandeep10:56
fricklerhrw: ianw has analyzed this in quite some detail http://lists.opendev.org/pipermail/service-discuss/2021-April/000222.html11:09
openstackgerritPranali Deore proposed openstack/project-config master: Add glance-tempest-plugin to publish-to-pypi job  https://review.opendev.org/c/openstack/project-config/+/78825011:39
*** auristor has joined #opendev11:50
openstackgerritMerged zuul/zuul-jobs master: Remove ansible-lint path exclusions  https://review.opendev.org/c/zuul/zuul-jobs/+/73147112:06
openstackgerritMerged opendev/glean master: Only force DNS handling if there is DNS data  https://review.opendev.org/c/opendev/glean/+/78172812:32
clarkbinfra-root https://review.opendev.org/c/opendev/system-config/+/786487 should be just about ready at this point. I've double checked and zk02 is still the leader so we are removing a following (03) which is what we wanted. https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 option A is the plan I'm following. Will edit it if cluster state changes as we move along (its possible 01 or 0414:51
clarkbbecome leader after this change for example)14:51
clarkbI have one meeting to get through this morning then I'll approve that and put zk03 in the emergency file and stop its zk container if no one objects. I may also manually run service-zookeeper.yaml after that change lands if it will be a while for zuul to get through the set of playboooks14:51
fungicool, i'm mostly around as well in case it doesn't go to plan15:01
clarkbalright meeting is over, last chance to object to https://review.opendev.org/c/opendev/system-config/+/786487 ?15:48
clarkbI'm approving it now. Please -W or -2 it while it is in the gate if there is a problem15:48
clarkbzk03.openstack.org is in the emergency file now too.15:52
clarkbI'll proceed with stopping zk container on it when the change above is closer to merging15:52
clarkbunder 10 minutes to merging. I'll go ahead and stop zk on 03 now16:26
clarkbmntr on 02 (the leader) shows one synced follower now which is what I expected16:27
openstackgerritMerged opendev/system-config master: Add zk04.opendev.org  https://review.opendev.org/c/opendev/system-config/+/78648716:33
*** marios|out has quit IRC16:36
clarkbyup service-zookeeper won't run for a bit so I'll run it by hand once the correct checkout happens16:37
clarkbI guess I should wait for base to run since that runs early and base will ensure the 04 server is up todate16:39
fungii'm semi paying attention, but also cooking lunch16:40
clarkbbase has completed which was good as it also updated iptables on the cluster. It would've been updated by service-zookeeper anyway but this helps to show everything is happy so far17:06
clarkbI'm going to run service-zookeeper by hand now17:06
clarkbthe playbook ran successfully17:11
clarkbzk04 failed to start because its id is not in the peer list17:13
clarkblooking at the peer lists we appear to have set server.3 to the 04's host17:13
clarkbserver.{{ loop.index }}={{ (hostvars[host].public_v4) }}:2888:3888 <- I think the use of loop.index there is the issue17:13
openstackgerritClark Boylan proposed opendev/system-config master: Fix the zk peer listing to match myid values  https://review.opendev.org/c/opendev/system-config/+/78833017:17
clarkbinfra-root ^ I think that will fix it, if it looks ok to you I can pull that down and rerun the playbook manually17:17
elodclarkb fungi : just a heads up, that according to my mail in ML, I'll test the delete of an ocata-eol tagged branch *tomorrow* (my) afternoon (CEST timezone)17:20
clarkbelod: thanks17:20
elodhopefully everything will go fine and I can upload the rest of the ACL changing patches17:21
clarkbI think I'll proceed with that change just to keep moving forward17:22
* clarkb runs service-zookeeper by hand again17:22
*** mlavalle has joined #opendev17:27
clarkbLooks like 04 may have successfully been added as a follower and synced17:28
clarkbI'm going to put zk01 and zk02 in the emergency file now so that the broken service-zookeeper doesn't run against them and cause problems. They can be removed from the emergency file once https://review.opendev.org/c/opendev/system-config/+/788330 merges17:29
clarkbI'll also proceed with doing the planned restarts just to be sure that rolling restarts are happy17:29
clarkbok thats all done. 04 has become the leader17:34
openstackgerritClark Boylan proposed opendev/system-config master: Add zk05.opendev.org to the zk cluster  https://review.opendev.org/c/opendev/system-config/+/78833617:38
clarkbinfra-root https://review.opendev.org/c/opendev/system-config/+/788330 is ready to land if you are happy with it. I'll approve https://review.opendev.org/c/opendev/system-config/+/788336 when ready if you can review that one too17:40
clarkbfor "when ready" I'm mostly thinking I may restart nodepool launchers and builders since they are lower cost restarts and that way they'll see the new servers as we go17:41
clarkbcorvus: for zuul, does anything but the scheduler talk to zk right now? (just thinking a quick down up on the scheduler might be easiest since I need to edit the config by hand later to lie and show zuul all three zk servers before all three are available)17:42
corvusclarkb: web17:44
clarkbcorvus: cool, I may just restart web and scheduelr when I get to that point to simplify things17:44
corvusclarkb: ok.  i'm pretty sure that's right; however, it's really simple to restart the whole thing17:44
corvusclarkb: the zuul_restart playbook does that, and it's not really any slower than restarting the scheduler17:45
clarkbcorvus: ya I'm not worried about the restart I'm worried about the manual config edit on ~20 something hosts17:45
corvusoh i see17:45
clarkbI guess I can do a line in file play maybe17:45
clarkbI'll look into that17:45
clarkbcorvus: because I want to lie to zuul and tell it all three of the new servers are present when only 2 are17:45
corvusclarkb: and you don't want to put that into git17:46
clarkbcorvus: ya the current way it is generated is via our group membership in ansible so I figured once we add the third zk server it will be in sync17:47
clarkbI could do another change to hardcode the addresses though, then revert that once all three are up and running17:47
corvusclarkb: you'll include nodepool as well, right?17:47
clarkbcorvus: yes17:48
clarkbthe more I think about it the more I like just doing the hardcode change that gets reverted17:49
clarkbsince it is super clear what is going on17:49
fungii'm happy to prioritize reviewing it17:50
clarkboh except that will fail testing :)17:53
clarkbbecause the hardcoded values will not line up with the test setup17:53
clarkbnot sure how to get around that in a reasonable way17:53
clarkbmaybe by hand is easiest17:53
clarkbfungi: re https://review.opendev.org/c/opendev/system-config/+/788330 if I didn't want to try and get through as much of this as possible today I would fix that :)17:55
clarkbbut its a minor typo in teh commit message so I think I'll go ahead and approve it now17:55
clarkbsince the impact of restarting nodepool launchers and builders is much smaller than that of zuul I'm going to go ahead and restart the builders to have them see the new zk0417:57
clarkbthen maybe do launchers after as well17:57
fungiclarkb: yeah, i was just calling it out in case you ended up needing a second commit. i did +2 it17:59
clarkbthe builders should accurately see the full state of the cluster now. 04 got two extra conections out of it so that looks good. I think I'll proceed with the launchers18:01
clarkblaunchers are done now too18:09
clarkbeverything is looking good to me. Once service-zookeeper for the zk04 change has run I'll remove zk04 and zk01 from the emergency file and approve the addition of zk0518:10
clarkbthen when I'm closer to being ready to run service-zookeeper to include zk05 I'll manually stop zk0218:11
clarkbthen I think I'm back to manually updating the scheduler's config and restarting it and web. Considering how easily the launcher and builder restarts went I don't think I'll even update their config to lie to them. I'll just restart them after each addition18:12
clarkbjust zuul we need to lie too since its restarts are much more costly18:12
fungibut not for too much longer18:19
clarkbI do notice that connection counts are in the range that suggests the executors and mergers are connecting to zk. Are those just idle connections?18:21
fungii'm not sure but i assumed zk relied on persistent connections18:36
openstackgerritClark Boylan proposed opendev/system-config master: Small playbook to set zuul zk hosts config  https://review.opendev.org/c/opendev/system-config/+/78834218:40
clarkbfungi: ya it does its own ping ponging. By idle I mean does zuul executor or merger do anything with them18:40
clarkbbut its all moot now because I think that playbook above (we don't need to merge that chagne I can pull it and run it) should update all the mergers and executors and schedulers and webs18:40
clarkbthat means I can run it, then do a full cluster restart and be fine18:41
zigoThe Bullseye image still does the sane n/a thing: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_11a/786772/3/check/puppet-openstack-integration-6-scenario001-tempest-debian-bullseye/11a7143/job-output.txt18:41
clarkbzigo: yes, the dib changes haven't landed yet18:41
zigoclarkb: We need a new release?18:41
clarkbif you'd like to help us debug why nova returns 500 errors when testing dib that would help18:41
zigoOutch !18:41
clarkbzigo: no its a testing issue, when we do the end to end functional test with openstack it breaks often enough because nova returns errors18:41
clarkbI don't think it is a debian issue, but it is hitting the jobs often enough taht trying to land the debian changes has become difficult18:42
clarkbI added some changes to gather more log info but haven't caught one yet last I checked (and am busy with zookeeper cluster upgrades today)18:42
openstackgerritMerged opendev/system-config master: Fix the zk peer listing to match myid values  https://review.opendev.org/c/opendev/system-config/+/78833018:44
clarkbI've approved https://review.opendev.org/c/opendev/system-config/+/788336 will eat lunch while waiting on that18:45
clarkbonly zk03 is in the emergency file right now as the above change has landed and I want to triple check it is producing the configs we want on 01, 02, and 0418:46
clarkbI'll add 02 to the emergency file after lunch and we can proceed with swapping 05 in18:46
openstackgerritMerged opendev/system-config master: Add zk05.opendev.org to the zk cluster  https://review.opendev.org/c/opendev/system-config/+/78833619:01
clarkbzk02 is in the emergency file now that ^ has merged19:03
openstackgerritIan Wienand proposed opendev/glean master: Fix Gentoo "is" comparisons  https://review.opendev.org/c/opendev/glean/+/78812619:03
openstackgerritIan Wienand proposed opendev/glean master: Update hacking and fix pep8 violations  https://review.opendev.org/c/opendev/glean/+/78229619:03
openstackgerritIan Wienand proposed opendev/glean master: Move to Zuul standard hacking rules  https://review.opendev.org/c/opendev/glean/+/78812719:03
openstackgerritIan Wienand proposed opendev/glean master: Stop requiring /usr/local/bin links for glean.sh  https://review.opendev.org/c/opendev/glean/+/78201019:03
openstackgerritIan Wienand proposed opendev/glean master: Create "legacy" script path  https://review.opendev.org/c/opendev/glean/+/78201619:03
openstackgerritIan Wienand proposed opendev/glean master: Run a glean-early service to mount configdrive  https://review.opendev.org/c/opendev/glean/+/78201719:03
openstackgerritIan Wienand proposed opendev/glean master: Cleanup glean.sh variable names  https://review.opendev.org/c/opendev/glean/+/78235519:03
clarkboh its meeting time I compeltely spaced on the between lunch and zk stuff19:04
fungilooking back over the git review --no-thin option, it turns out that users also need to have git>=1.8.5 as well (though that's probably likely for most users these days)19:35
fungiuntil then, there was a bug causing push --no-thin to noop19:35
fungiworth keeping in mind19:36
clarkbI've just stopped zk on zk02 because the service-zookeeper run for the id fix is applying the chagnes for the change after it19:39
clarkb(it sees zk05.opendev.org)19:39
clarkband zk05 is up now and seems to have joined the cluster19:41
clarkbI'll restart launchers and builders again, then do rolling restarts of 01 and 04, then we are to the point where I need to update zuul configs and restart zuul19:42
clarkboh I can also push up the change to swap 01 and 0619:42
clarkboh I can't restart launchers and builders until they update. I'll do the rolling restarts first then19:46
fungia bit more reading indicates that the git provided on centos 7 is 1.8.3, so if we have someone who needs --no-thin and they're on centos 7 they need to also work out upgrading git, just fyi19:46
fungipretty much every other distro has plenty new enough git not to have to worry19:48
fungigit review --no-thin won't error on centos 7's git, it just also won't do any good19:49
openstackgerritClark Boylan proposed opendev/system-config master: Add zk06.opendev.org to the zk cluster  https://review.opendev.org/c/opendev/system-config/+/78835519:49
clarkbI'm nowhere near ready for ^ to land but figured having it up early was fine19:50
ianwoh just calling out the restart of gerrit yesterday for the lock issues.  seems it was causing a few people issues unfortunately (intel-ci, etc)19:53
clarkbya looks like the same issue as before (same index lock etc)19:53
clarkbI'm going to do rolling restarts of zk01 and zk04 now19:53
clarkbzuul currently only knows about zk01. I think this is ok as the restart should be quick19:53
ianwyep, unfortunately nobody else has popped up on that bug19:53
fungiany tips on command-line connecting with ansible to a raw ipv6 address?19:55
fungiit seems to think the bit after the last : is a port, and wrapping the address in [] doesn't seem to work either19:55
clarkbrolling restarts done, cluster looks good to me19:56
fungiahh, the problem was i was passing the remote username separated by an @ from the address, i think, rather than with -u19:57
fungiyay! core dump reproduced. now to make the machine be willing to actually save it19:57
fungi/bin/sh: line 1:  5206 Aborted                 (core dumped) /usr/libexec/platform-python /root/.ansible/tmp/ansible-tmp-1619553398.6933997-894584-203663865195816/AnsiballZ_setup.py\19:57
clarkbprod-base should run soon then I'll run service-nodepool manually to update those configs, I'll restart nodepool services at that point. Then when that is done I'm ready to run my update zuul.conf playbook and do a full restart of zuul19:59
clarkbThen when that is done we can land https://review.opendev.org/c/opendev/system-config/+/78835519:59
fungithis is infinitely fun. apparently systemd handles core dumps by default, and the sysctl for kernel.core_pattern is a pipe to that helper rather than a path to save them in20:03
clarkblooks like `/opt/zuul/tools/zuul-changes.py https://zuul.opendev.org > queues.sh` is how we save queues these days, then just rerun queues.sh20:03
fungihah, according to the journal, ansible/python isn't the only thing dumping core. rsyslog also isn't starting on these nodes20:05
fungiProcess 1163 (rsyslogd) of user 0 dumped core.20:05
clarkbis this on arm?20:05
clarkbis it possible we're pulling the wrong arch?20:05
fungiyeah, arm20:05
clarkbthough wrong arch should eb an elf issue not a core dump?20:06
fungi/usr/sbin/rsyslogd: ELF 64-bit LSB shared object, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, for GNU/Linux 3.7.0, BuildID[sha1]=870e42df5144bb08f84e39aea7981a9f2012038c, stripped20:06
fungidoesn't look like the wrong arch20:06
clarkbinfra-root if you can review https://review.opendev.org/c/opendev/system-config/+/788342 in the next little bit taht would be great. I plan to run that playbook to do the zuul.conf update I need when restarting zuul20:06
clarkbI don't need to land the change as I can just pull that down20:06
ianwrsyslog-8.1911.0-6.el8.aarch64 marked as user installed.20:07
ianwso hopefully that's right20:07
fungianyway, according to journalctl...20:08
fungiansible-setup[5206]: Invoked with gather_subset=['all'] gather_timeout=10 filter=* fact_path=/etc/ansible/facts.d20:08
fungiProcess 5206 (platform-python) of user 0 dumped core.20:08
fungithere's a stacktrace two calls deep, but with no symbols all i can see is that it's landing somewhere in libc.so.620:08
clarkbnow try it with specific fact subset and see if it is universal or specific to a grop?20:08
fungiis that something i can pass from the remote caller?20:09
clarkbyes they are setup module options20:09
clarkbfungi: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/setup_module.html#parameter-gather_subset20:09
ianwfungi: if you run gdb on the command, iirc it tells you what pkgs to install for symbols20:09
fungiianw: i guess that would be platform-python in this case?20:10
ianw /usr/libexec/platform-python3.6 i'd say20:12
fungiansible '2604:1380:4111:3e56:f816:3eff:fe85:edd9' -i '2604:1380:4111:3e56:f816:3eff:fe85:edd9', -m setup -u root20:12
fungihow do i add gather_subset to that?20:12
ianwfungi: i'm just installing debug info on that host20:14
fungiahh, cool20:14
fungiwanted to try gather_subset min20:14
ianwfungi: maybe try the regular thing now, platform-python & glibc debug symbols installed20:16
fungihuh, now instead of a backtrace i get "Resource limits disable core dumping for process 27856 (platform-python)."20:17
fungiguess i need to finish solving that, it must want to actually write it out20:17
ianwi can see a dump now20:18
ianwit's lz4 compressed20:18
fungistrange that it was happy to embed a symbol-less backtrace in the journal but not after the symbols were installed20:18
ianwhahaha it crashes gdb20:19
ianwsomething is definitely not happy20:19
fungiwhere did it write the dump? /var/crash still dne20:19
fungibut yeah, gdb crashing seems double plus ungood20:20
fungii suppose gdb can read cross-arch cores, but copying the right symbols would be tricky20:21
ianw# gdb /usr/libexec/platform-python /var/lib/systemd/coredump/core.platform-python.0.3a0917ef2e4e434fb1320cfd30219033.5206.161955340200000020:21
openstackbugzilla.redhat.com bug 1946948 in gdb "gdb crashes with: ../../gdb/dwarf2-frame.c:1061: internal-error: Unknown CFA rule." [Unspecified,Assigned] - Assigned to keiths20:22
ianwsame issue, and unfortunately not much traction on that :/20:22
clarkbok base has finished, manually running service-nodepool now then will restart those services20:24
fungiaccording to build history, these jobs started failing after 2021-04-16 20:06 (that was the last success result for kolla-build-centos8s-source-aarch64 anyway, and hrw reported a failure for this which ran 2021-04-21 11:53)20:27
fungiso we have a ~5-day window where this would have started to occur20:27
ianw"Anyway if this bug only happens on aarch64 then it might20:27
ianwbe caused by the recent ARM v8.6 changes (bug 1875912) which caused20:27
openstackbug 1875912 in pulseaudio (Ubuntu) "Selected audio output always USB device as default, but no control" [Undecided,Expired] https://launchpad.net/bugs/187591220:27
ianwlots of brokenness.20:27
fungii'll see if i can narrow it20:27
ianwunfortunately, i can't see that bug, even logged in @redhat account20:27
fungitimeframe on 1946948 looks coincident with when we started having this problem with ansible at least20:29
ianwi guess that is referring to something like20:29
ianw* Tue Mar 23 2021 Nick Clifton  <nickc@redhat.com> - 2.30-9720:29
ianw  - Enable support for ARM v8.6 ISA.  (#1875912)20:29
fungiwhen would that have landed in a centos point release?20:30
ianwthat's from binutils-2.30-98.el820:30
fungimaybe we were also delayed building images20:30
fungii guess that had to trickle from rhel to centos20:31
ianwyeah a lot of variables20:31
ianw# rpm -qa | grep binutils20:33
fungiyeah, so we're at least new enough to include that change20:33
fungii think what this means is that centos (possibly also rhel) is not a suitable platform for arm64 testing currently20:34
clarkbnodepool services have been restarted20:34
ianw"I don't think -99 is good either.  The binaries don't crash immediately,20:35
ianwbut anything using threads will crash eventually.  See20:35
openstackbugzilla.redhat.com bug 1946977 in binutils "pthread_join segfaults in stack unwinding" [Unspecified,Modified] - Assigned to nickc20:35
openstackbugzilla.redhat.com bug 1946518 in binutils "binutils-2.30-98 are causing go binaries to crash due to segmentation fault on aarch64" [Unspecified,Modified] - Assigned to nickc20:35
clarkbI'm goign to test https://review.opendev.org/c/opendev/system-config/+/788342 by limiting it to a merger and then will restart that merge20:35
clarkbif that looks good I'm going to proceed with running that playbook against all zuul servers and will do a full restart20:35
ianwhttps://bugzilla.redhat.com/show_bug.cgi?id=1946977 is actually almost exactly what we see20:35
clarkbcorvus: ^ fyi20:35
fungiianw: "Fixed in binutils-2.30-100.el8" though?20:37
ianwfungi: yeah, i wonder though if things need a rebuild against it?20:38
fungioh maybe20:38
fungiyeah so like the python interpreter and gdb are still built against older binutils?20:39
clarkbok that all looked good on zm01 and I'm now double checking its zk connectivity20:39
clarkbit connected to one of the new servers, excellent20:39
clarkbinfra-root last chance to weigh in on my playbook to udpate zuul configs20:39
fungiianw: i take it there's something similar to debian's nmu process where the binutils maintainers can trigger rebuilds of all their reverse-(build-)dependencies20:40
ianwclarkb: i think i noted that we will switch them back to ipv6 addresses after?   i don't think it matters20:40
fungimaybe that's still in progress, or is done but hasn't trickled down from rhel to centos yet20:40
clarkbianw: zuul uses ipv4 from what I can see20:41
clarkbianw: nodepool is ipv6 and those aren't getting modified20:41
ianwohhhh, ok i guess that rewriting trick only happens in the nodepool configs20:41
clarkbyup and I'm doing those separately20:41
ianwok, carry on :)20:41
clarkbI've run it and the resutls lgtm. I made a backup of the config on scheduler and ze01 and comapred results20:42
clarkbI guess I can save queues now then run zuul_restart.yaml?20:43
clarkbcorvus: ^ anything else you think I should do before doing that?20:43
ianwfungi: yeah, honestly not sure :/20:43
clarkbok I'm proceeding to save queues and will run zuul_restart.yaml afterwards20:44
clarkbI doin't see any release jobs for the record and I awrned teh openstack release team20:44
fungiseems like a fine time, yep20:45
clarkbI'm going to remove the hourly deploy and the deploy changes that were queued as they will only serve to slow down https://review.opendev.org/c/opendev/system-config/+/788355 when it lands20:46
clarkbzuul is on its way back up20:47
ianw* Thu Apr 22 2021 Florian Weimer <fweimer@redhat.com> - 2.28-15820:51
ianw- Rebuild with new binutils (#1946518)20:51
clarkbI think zuul is up now20:52
ianwwe have glibc-2.28-155.el8.aarch6420:52
clarkbI will start restoring queues20:52
clarkbqueues have been restored20:56
clarkbinfra-root I'm ready for https://review.opendev.org/c/opendev/system-config/+/788355 if that one looks good to you20:57
ianwfungi: trying to establish in centos-devel when the new packages might appear21:03
clarkbI'm going to stop zk on zk01 now, the change for 06 should be landing in the next few minutes, then once again after base runs I'll manually run service-zookeeper, do rolling restarts of the cluster, then manually run service-nodepool and restart those services21:09
*** whoami-rajat has quit IRC21:09
openstackgerritMerged opendev/system-config master: Add zk06.opendev.org to the zk cluster  https://review.opendev.org/c/opendev/system-config/+/78835521:13
ianwfungi: it's probably worth confirming with the packages from koji that it fixes the issue21:18
ianwthe answer seems to be that we need to wait maybe a week or so for a push, but not quite clear21:19
fungiianw: i'm not super familiar with koji, i assume it provides urls to grab the rpms from?21:19
fungiif so, happy to wget a few and rpm -i them21:19
clarkbok base has completed. I'm going to run service-zookeeper now21:52
ianwfungi: yeah, i guess that would be the process.  i can as i know you're trying to be ! computering :)21:53
fungiianw: thanks, i'm clearly not succeeding at that21:56
clarkb06 is up, the leader (05) reports it has 2 synced followers. I'm going to do rolling restarts of the zk containers now21:56
clarkbthat is done, 06 is the leader now and has 2 synced followers.21:57
clarkbI'm going to run service-nodepool now then restart those services then we should be done21:58
clarkbwell other than cleanup.21:59
ianwfungi: ok, rsyslog now starts; if you have the ansible command in history could you try?22:02
fungiianw: yep, works, great detectorizing!22:04
clarkbso the tldr is they made a bad binutils package and it has since been fixed?22:05
ianwyeah, i haven't looked fully but it must be some of the crt stuff, so glibc needed a rebuild22:06
ianwnow, we could either a) hack nb03 to install from koji and put it in emergency, b) push a change through dib release (not good idea, koji build will disappear), c) wait until centos-8 stream picks up package22:07
clarkb#status log Upgraded zuul zk cluster to focal. zk01-03.openstack.org have been replaced with zk04-06.opendev.org22:12
openstackstatusclarkb: finished logging22:12
clarkbI need to write at least one cleanup change. I'll leave the old servers up until tomorrow.22:13
clarkbactually I don't need the cleanup change22:14
clarkbI do need to update the grafana dashboard though22:14
openstackgerritIan Wienand proposed opendev/glean master: Update hacking and fix pep8 violations  https://review.opendev.org/c/opendev/glean/+/78229622:18
openstackgerritIan Wienand proposed opendev/glean master: Move to Zuul standard hacking rules  https://review.opendev.org/c/opendev/glean/+/78812722:18
openstackgerritIan Wienand proposed opendev/glean master: Stop requiring /usr/local/bin links for glean.sh  https://review.opendev.org/c/opendev/glean/+/78201022:18
openstackgerritIan Wienand proposed opendev/glean master: Create "legacy" script path  https://review.opendev.org/c/opendev/glean/+/78201622:18
openstackgerritIan Wienand proposed opendev/glean master: Run a glean-early service to mount configdrive  https://review.opendev.org/c/opendev/glean/+/78201722:18
openstackgerritIan Wienand proposed opendev/glean master: Cleanup glean.sh variable names  https://review.opendev.org/c/opendev/glean/+/78235522:18
ianwclarkb: i'm sure you're a bit over it, but ^ stack should be ready for review (just fixed a pep8 issue)22:19
openstackgerritClark Boylan proposed openstack/project-config master: Limit grafana for zk to zk04-06  https://review.opendev.org/c/openstack/project-config/+/78837422:20
clarkbianw: I can probably page that back in.22:20
clarkbinfra-root I think https://review.opendev.org/c/openstack/project-config/+/788374 may be the only code side cleanup we need. Then tomorrow I'll cleanup the old servers if we're happy with the new ones22:20
corvusclarkb: normally i'd say let's have an overlap period, but i'm pretty sure we won't be changing any ZK stuff at least until friday which is long enough to establish a baseline i think, so i think we can go ahead and merge that now.22:22
corvus(and if we're wrong, we can always get the data out of graphite)22:22
clarkbcorvus: ya the data isn't going away, and reverting my change is easy too22:22
clarkbcorvus: you might also want to double check zk things look correct to you, but I've been trying to monitor as I went and I haven't noticed anything weird22:24
corvusclarkb: the zk followers graph looks like a bad game of tetris but other than that lgtm ;)22:26
corvusclarkb: statsd may need a restart in order to drop stale gauges22:26
clarkbha, it actually shows when I would take out the next in line to be replaced22:26
clarkbcorvus: stale meaning we sent data but then it moved?22:26
clarkbthe followers graph looks correct to me right now. 06 has two followers22:27
corvusclarkb: any other graph with a gauge has stale data from 1-322:30
corvusi think statsd will eventually drop them on its own? (i think we set a config for that) but i can't remember how long it has to go without data for that22:30
clarkboh that was why I pushed the change up. I don't think it will drop the data off entirely? or at least didn't expect it to22:31
corvus(basically statsd keeps repeating guages in the data it sends to carbon-cache until a timeout happens)22:31
clarkbgot it22:32
corvusthe graphite dbs will stick around regardless (they will delete data points subject to their rollup policy, but the files will stick around until we delete them) but statsd will continue to tell carbon to add data to them until restarted or that timeout hits22:33
clarkblooks like graphite has a graphite-docker_graphite_1 container running on graphiteapp/graphite-statsd image. I guess I can just down && up -d the cotnainer with docker-compose?22:34
corvusclarkb: apparently for us that timeout is "never" because we don't set deleteGauges: https://github.com/statsd/statsd/blob/master/exampleConfig.js#L6422:34
clarkbI'll down && up -d that container now22:35
corvusclarkb: or you can 'docker restart $containername'22:35
clarkbah the down && up -d is lready done :)22:35
clarkbI see 01-03 data cutting off in the graphs now22:37
clarkbI think that did it22:37
corvusthis will be way less confusing if we do have to consult the data :)22:39
openstackgerritMerged openstack/project-config master: Limit grafana for zk to zk04-06  https://review.opendev.org/c/openstack/project-config/+/78837422:42
clarkbianw: thinking about https://review.opendev.org/c/opendev/glean/+/782010 again, do we need to worry about backward compatibility? like maybe we shoudl install glean.sh to /usr/local/bin/glean.sh if not going into a venv as well as in the package dir? that said I highly doubt that people are using glean without simple-init dib22:50
clarkbso maybe we're good as long as simple-init dib is good22:50
clarkboh and glean vendors the install tooling too so ya this is fine22:51
clarkbif it breaks people they can use the install tooling which is all that dib will do22:51
clarkbianw: left some comments on https://review.opendev.org/c/opendev/glean/+/782017 but nothing worth a respin. +2'd it23:00
ianwclarkb: yeah, i mean i'm not sure we've ever really considered static installs, but i don't think we've ever guaranteed that you could update glean code without running install tool again23:06
clarkbya I think glean shipping an install tool is a big indication you need to use it :)23:06
*** tosky has quit IRC23:08
