*** sshnaidm_ has joined #opendev | 00:00 | |
*** mlavalle has quit IRC | 00:01 | |
*** sshnaidm|afk has quit IRC | 00:01 | |
*** sshnaidm_ has quit IRC | 00:07 | |
*** sshnaidm_ has joined #opendev | 00:08 | |
*** DSpider has quit IRC | 01:14 | |
openstackgerrit | Merged opendev/glean master: Allow disabling DHCP fallback https://review.opendev.org/c/opendev/glean/+/781500 | 01:21 |
---|---|---|
brinzhang0 | I cannot submit a patch to gerrit, and I get the HTTP Credentials from self Group, it reported 500 error, is there anyone maintains the gerrit? | 01:28 |
ianw | brinzhang0: yes we can take a look, is this a new account? | 01:34 |
brinzhang0 | no | 01:35 |
brinzhang0 | my account, and I always using it | 01:35 |
*** whoami-rajat has joined #opendev | 01:36 | |
*** hamalq has quit IRC | 01:37 | |
*** ricolin has quit IRC | 01:37 | |
ianw | brinzhang0: so you are submitting via git review? could you paste some command output to http://paste.openstack.org/ so we can determine what exactly is failing? | 01:38 |
brinzhang0 | ianw: yeah, I will paste some steps | 01:38 |
ianw | i don't see your username in any of the recent cleanups so i don't think that's related | 01:40 |
brinzhang0 | I will sent to you in a private channel | 01:41 |
ianw | i'm seeing errors in the logs consistent with https://bugs.chromium.org/p/gerrit/issues/detail?id=13726 | 01:46 |
ianw | this IndexWriter is closed etc. etc. | 01:47 |
ianw | brinzhang0: so this is a bit of a known problem with gerrit. we've found restarting it generally helps, i can do that | 01:48 |
brinzhang0 | ianw: ok, if there is need to restart the gerrit, I can wait | 01:49 |
ianw | #status log restarted gerrit due to inability of user to update account settings, logs consistent with lock errors detailed in https://bugs.chromium.org/p/gerrit/issues/detail?id=13726 | 01:50 |
openstackstatus | ianw: finished logging | 01:50 |
ianw | bringzhang0: try again now, see how we go | 01:51 |
brinzhang0 | ianw: I remembered last year I was hit the same issue, and restart the gerrit that I can use it | 01:51 |
brinzhang0 | ianw: it's ok now ^^ | 01:51 |
brinzhang0 | let me try to submit the modifed patch | 01:52 |
ianw | interesting, i wonder if your account is just lucky, or somehow different | 01:54 |
brinzhang0 | ianw: ah, interesting | 01:54 |
brinzhang0 | now I can submit the patch too | 01:54 |
brinzhang0 | thanks | 01:54 |
*** brinzhang_ has joined #opendev | 02:09 | |
ianw | there's an intel zuul account in a constant failure loop authenticating too | 02:10 |
ianw | trying to find a contact | 02:10 |
ianw | https://wiki.openstack.org/wiki/ThirdPartySystems/Intel_OpenStack_CI i guess; i'll send a mail | 02:11 |
*** brinzhang0 has quit IRC | 02:12 | |
openstackgerrit | Merged opendev/glean master: Fix a typo in a log message https://review.opendev.org/c/opendev/glean/+/782711 | 02:40 |
openstackgerrit | Merged opendev/glean master: Do not require external mock on Python 3 https://review.opendev.org/c/opendev/glean/+/782294 | 02:40 |
*** hemanth_n has joined #opendev | 02:57 | |
openstackgerrit | Ian Wienand proposed opendev/glean master: Fix Gentoo "is" comparisons https://review.opendev.org/c/opendev/glean/+/788126 | 04:10 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Update hacking and fix pep8 violations https://review.opendev.org/c/opendev/glean/+/782296 | 04:12 |
*** fressi has joined #opendev | 04:19 | |
openstackgerrit | Ian Wienand proposed opendev/glean master: Move to Zuul standard hacking rules https://review.opendev.org/c/opendev/glean/+/788127 | 04:23 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Stop requiring /usr/local/bin links for glean.sh https://review.opendev.org/c/opendev/glean/+/782010 | 04:34 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Create "legacy" script path https://review.opendev.org/c/opendev/glean/+/782016 | 04:34 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Run a glean-early service to mount configdrive https://review.opendev.org/c/opendev/glean/+/782017 | 04:34 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Cleanup glean.sh variable names https://review.opendev.org/c/opendev/glean/+/782355 | 04:34 |
*** ykarel has joined #opendev | 04:35 | |
*** vishalmanchanda has joined #opendev | 04:50 | |
*** slaweq has joined #opendev | 05:52 | |
*** sboyron has joined #opendev | 05:53 | |
*** amoralej has joined #opendev | 06:04 | |
*** ralonsoh has joined #opendev | 06:04 | |
*** ysandeep|away is now known as ysandeep | 06:06 | |
*** lpetrut has joined #opendev | 06:11 | |
*** marios has joined #opendev | 06:13 | |
*** DSpider has joined #opendev | 06:41 | |
*** eolivare has joined #opendev | 06:59 | |
*** andrewbonney has joined #opendev | 07:15 | |
*** sboyron_ has joined #opendev | 07:16 | |
*** sboyron has quit IRC | 07:18 | |
*** hashar has joined #opendev | 07:26 | |
*** ysandeep is now known as ysandeep|lunch | 07:28 | |
*** tosky has joined #opendev | 07:41 | |
*** rpittau|afk is now known as rpittau | 07:45 | |
*** jpena|off has joined #opendev | 07:54 | |
*** jpena|off is now known as jpena | 07:54 | |
openstackgerrit | Ian Wienand proposed opendev/glean master: Stop requiring /usr/local/bin links for glean.sh https://review.opendev.org/c/opendev/glean/+/782010 | 07:57 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Create "legacy" script path https://review.opendev.org/c/opendev/glean/+/782016 | 07:57 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Run a glean-early service to mount configdrive https://review.opendev.org/c/opendev/glean/+/782017 | 07:57 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Cleanup glean.sh variable names https://review.opendev.org/c/opendev/glean/+/782355 | 07:57 |
*** gothicserpent has quit IRC | 08:15 | |
*** gothicserpent has joined #opendev | 08:20 | |
openstackgerrit | Sorin Sbârnea proposed zuul/zuul-jobs master: Remove ansible-lint path exclusions https://review.opendev.org/c/zuul/zuul-jobs/+/731471 | 08:22 |
*** dtantsur|afk is now known as dtantsur | 08:30 | |
*** hashar has quit IRC | 08:43 | |
*** ykarel is now known as ykarel|lunch | 09:13 | |
*** snapdeal has joined #opendev | 09:14 | |
*** yoctozepto4 has joined #opendev | 09:25 | |
*** yoctozepto has quit IRC | 09:26 | |
*** yoctozepto4 is now known as yoctozepto | 09:26 | |
hrw | morning | 09:30 |
*** sshnaidm_ is now known as sshnaidm | 10:00 | |
hrw | "E: Failed to fetch https://mirror-int.dfw.rax.opendev.org/debian/dists/n/a-backports/main/source/Sources 404 Not Found [IP: 10.209.161.66 443]" hm. | 10:12 |
hrw | looks like zuul-jobs when run on Debian 'bullseye' see 'n/a' in ansible_distribution_release variable | 10:12 |
*** ykarel|lunch is now known as ykarel | 10:13 | |
hrw | which feels weird | 10:15 |
hrw | root@bae0d5701b38:~# ansible localhost -m ansible.builtin.setup|grep ansible_distribution_release | 10:15 |
hrw | "ansible_distribution_release": "bullseye", | 10:15 |
hrw | https://b8a1a57faefd5b72979c-bcc0089d8984128efe3d45fe2388916e.ssl.cf1.rackcdn.com/788231/1/check-arm64/kolla-build-debian-source-aarch64/a10782b/zuul-info/host-info.primary.yaml shows n/a instead | 10:25 |
*** ysandeep|lunch is now known as ysandeep | 10:56 | |
*** dtantsur is now known as dtantsur|brb | 11:03 | |
frickler | hrw: ianw has analyzed this in quite some detail http://lists.opendev.org/pipermail/service-discuss/2021-April/000222.html | 11:09 |
hrw | thx | 11:10 |
*** eolivare_ has joined #opendev | 11:27 | |
*** eolivare has quit IRC | 11:29 | |
*** jpena is now known as jpena|lunch | 11:34 | |
openstackgerrit | Pranali Deore proposed openstack/project-config master: Add glance-tempest-plugin to publish-to-pypi job https://review.opendev.org/c/openstack/project-config/+/788250 | 11:39 |
*** auristor has quit IRC | 11:40 | |
*** eolivare_ has quit IRC | 11:49 | |
*** auristor has joined #opendev | 11:50 | |
*** hashar has joined #opendev | 11:54 | |
openstackgerrit | Merged zuul/zuul-jobs master: Remove ansible-lint path exclusions https://review.opendev.org/c/zuul/zuul-jobs/+/731471 | 12:06 |
*** ysandeep is now known as ysandeep|brb | 12:09 | |
*** hemanth_n has quit IRC | 12:12 | |
*** tkajinam has quit IRC | 12:18 | |
*** tkajinam has joined #opendev | 12:18 | |
*** tkajinam has quit IRC | 12:24 | |
*** tkajinam has joined #opendev | 12:25 | |
*** eolivare_ has joined #opendev | 12:28 | |
*** ysandeep|brb is now known as ysandeep | 12:28 | |
openstackgerrit | Merged opendev/glean master: Only force DNS handling if there is DNS data https://review.opendev.org/c/opendev/glean/+/781728 | 12:32 |
*** jpena|lunch is now known as jpena | 12:33 | |
*** amoralej is now known as amoralej|lunch | 12:34 | |
*** snapdeal has quit IRC | 12:43 | |
*** dtantsur|brb is now known as dtantsur | 12:54 | |
*** ifest has joined #opendev | 13:16 | |
*** amoralej|lunch is now known as amoralej | 13:18 | |
*** tkajinam has quit IRC | 13:27 | |
*** tkajinam has joined #opendev | 13:27 | |
*** lpetrut has quit IRC | 14:00 | |
*** fressi has quit IRC | 14:31 | |
*** sshnaidm has quit IRC | 14:38 | |
*** sshnaidm has joined #opendev | 14:40 | |
*** d34dh0r53 has quit IRC | 14:47 | |
*** d34dh0r53 has joined #opendev | 14:47 | |
*** ifest has quit IRC | 14:50 | |
clarkb | infra-root https://review.opendev.org/c/opendev/system-config/+/786487 should be just about ready at this point. I've double checked and zk02 is still the leader so we are removing a following (03) which is what we wanted. https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 option A is the plan I'm following. Will edit it if cluster state changes as we move along (its possible 01 or 04 | 14:51 |
clarkb | become leader after this change for example) | 14:51 |
clarkb | I have one meeting to get through this morning then I'll approve that and put zk03 in the emergency file and stop its zk container if no one objects. I may also manually run service-zookeeper.yaml after that change lands if it will be a while for zuul to get through the set of playboooks | 14:51 |
*** ykarel has quit IRC | 14:57 | |
fungi | cool, i'm mostly around as well in case it doesn't go to plan | 15:01 |
*** mlavalle has joined #opendev | 15:08 | |
*** amoralej is now known as amoralej|off | 15:18 | |
*** smcginni1 is now known as smcginnis | 15:24 | |
*** ysandeep is now known as ysandeep|away | 15:24 | |
*** hashar is now known as hasharAway | 15:46 | |
clarkb | alright meeting is over, last chance to object to https://review.opendev.org/c/opendev/system-config/+/786487 ? | 15:48 |
clarkb | I'm approving it now. Please -W or -2 it while it is in the gate if there is a problem | 15:48 |
*** ykarel has joined #opendev | 15:51 | |
clarkb | zk03.openstack.org is in the emergency file now too. | 15:52 |
clarkb | I'll proceed with stopping zk container on it when the change above is closer to merging | 15:52 |
*** mlavalle has quit IRC | 15:53 | |
*** ykarel has quit IRC | 16:00 | |
*** marios is now known as marios|out | 16:13 | |
clarkb | under 10 minutes to merging. I'll go ahead and stop zk on 03 now | 16:26 |
clarkb | mntr on 02 (the leader) shows one synced follower now which is what I expected | 16:27 |
openstackgerrit | Merged opendev/system-config master: Add zk04.opendev.org https://review.opendev.org/c/opendev/system-config/+/786487 | 16:33 |
*** rpittau is now known as rpittau|afk | 16:33 | |
*** marios|out has quit IRC | 16:36 | |
clarkb | yup service-zookeeper won't run for a bit so I'll run it by hand once the correct checkout happens | 16:37 |
clarkb | I guess I should wait for base to run since that runs early and base will ensure the 04 server is up todate | 16:39 |
fungi | i'm semi paying attention, but also cooking lunch | 16:40 |
*** jpena is now known as jpena|off | 17:01 | |
*** dtantsur is now known as dtantsur|afk | 17:05 | |
*** andrewbonney has quit IRC | 17:05 | |
clarkb | base has completed which was good as it also updated iptables on the cluster. It would've been updated by service-zookeeper anyway but this helps to show everything is happy so far | 17:06 |
clarkb | I'm going to run service-zookeeper by hand now | 17:06 |
*** eolivare_ has quit IRC | 17:07 | |
clarkb | the playbook ran successfully | 17:11 |
clarkb | zk04 failed to start because its id is not in the peer list | 17:13 |
clarkb | looking at the peer lists we appear to have set server.3 to the 04's host | 17:13 |
clarkb | server.{{ loop.index }}={{ (hostvars[host].public_v4) }}:2888:3888 <- I think the use of loop.index there is the issue | 17:13 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Fix the zk peer listing to match myid values https://review.opendev.org/c/opendev/system-config/+/788330 | 17:17 |
clarkb | infra-root ^ I think that will fix it, if it looks ok to you I can pull that down and rerun the playbook manually | 17:17 |
elod | clarkb fungi : just a heads up, that according to my mail in ML, I'll test the delete of an ocata-eol tagged branch *tomorrow* (my) afternoon (CEST timezone) | 17:20 |
clarkb | elod: thanks | 17:20 |
elod | hopefully everything will go fine and I can upload the rest of the ACL changing patches | 17:21 |
clarkb | I think I'll proceed with that change just to keep moving forward | 17:22 |
* clarkb runs service-zookeeper by hand again | 17:22 | |
*** mlavalle has joined #opendev | 17:27 | |
clarkb | Looks like 04 may have successfully been added as a follower and synced | 17:28 |
clarkb | I'm going to put zk01 and zk02 in the emergency file now so that the broken service-zookeeper doesn't run against them and cause problems. They can be removed from the emergency file once https://review.opendev.org/c/opendev/system-config/+/788330 merges | 17:29 |
clarkb | I'll also proceed with doing the planned restarts just to be sure that rolling restarts are happy | 17:29 |
clarkb | ok thats all done. 04 has become the leader | 17:34 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Add zk05.opendev.org to the zk cluster https://review.opendev.org/c/opendev/system-config/+/788336 | 17:38 |
clarkb | infra-root https://review.opendev.org/c/opendev/system-config/+/788330 is ready to land if you are happy with it. I'll approve https://review.opendev.org/c/opendev/system-config/+/788336 when ready if you can review that one too | 17:40 |
clarkb | for "when ready" I'm mostly thinking I may restart nodepool launchers and builders since they are lower cost restarts and that way they'll see the new servers as we go | 17:41 |
clarkb | corvus: for zuul, does anything but the scheduler talk to zk right now? (just thinking a quick down up on the scheduler might be easiest since I need to edit the config by hand later to lie and show zuul all three zk servers before all three are available) | 17:42 |
corvus | clarkb: web | 17:44 |
fungi | looking | 17:44 |
clarkb | corvus: cool, I may just restart web and scheduelr when I get to that point to simplify things | 17:44 |
corvus | clarkb: ok. i'm pretty sure that's right; however, it's really simple to restart the whole thing | 17:44 |
corvus | clarkb: the zuul_restart playbook does that, and it's not really any slower than restarting the scheduler | 17:45 |
clarkb | corvus: ya I'm not worried about the restart I'm worried about the manual config edit on ~20 something hosts | 17:45 |
corvus | oh i see | 17:45 |
clarkb | I guess I can do a line in file play maybe | 17:45 |
clarkb | I'll look into that | 17:45 |
clarkb | corvus: because I want to lie to zuul and tell it all three of the new servers are present when only 2 are | 17:45 |
corvus | clarkb: and you don't want to put that into git | 17:46 |
clarkb | corvus: ya the current way it is generated is via our group membership in ansible so I figured once we add the third zk server it will be in sync | 17:47 |
clarkb | I could do another change to hardcode the addresses though, then revert that once all three are up and running | 17:47 |
corvus | clarkb: you'll include nodepool as well, right? | 17:47 |
clarkb | corvus: yes | 17:48 |
clarkb | the more I think about it the more I like just doing the hardcode change that gets reverted | 17:49 |
clarkb | since it is super clear what is going on | 17:49 |
corvus | ++ | 17:49 |
fungi | i'm happy to prioritize reviewing it | 17:50 |
clarkb | oh except that will fail testing :) | 17:53 |
clarkb | because the hardcoded values will not line up with the test setup | 17:53 |
clarkb | not sure how to get around that in a reasonable way | 17:53 |
*** hamalq has joined #opendev | 17:53 | |
clarkb | maybe by hand is easiest | 17:53 |
*** hamalq has quit IRC | 17:54 | |
*** hamalq has joined #opendev | 17:54 | |
clarkb | fungi: re https://review.opendev.org/c/opendev/system-config/+/788330 if I didn't want to try and get through as much of this as possible today I would fix that :) | 17:55 |
clarkb | but its a minor typo in teh commit message so I think I'll go ahead and approve it now | 17:55 |
clarkb | since the impact of restarting nodepool launchers and builders is much smaller than that of zuul I'm going to go ahead and restart the builders to have them see the new zk04 | 17:57 |
clarkb | then maybe do launchers after as well | 17:57 |
fungi | clarkb: yeah, i was just calling it out in case you ended up needing a second commit. i did +2 it | 17:59 |
clarkb | the builders should accurately see the full state of the cluster now. 04 got two extra conections out of it so that looks good. I think I'll proceed with the launchers | 18:01 |
fungi | awesome | 18:01 |
clarkb | launchers are done now too | 18:09 |
clarkb | everything is looking good to me. Once service-zookeeper for the zk04 change has run I'll remove zk04 and zk01 from the emergency file and approve the addition of zk05 | 18:10 |
clarkb | then when I'm closer to being ready to run service-zookeeper to include zk05 I'll manually stop zk02 | 18:11 |
clarkb | then I think I'm back to manually updating the scheduler's config and restarting it and web. Considering how easily the launcher and builder restarts went I don't think I'll even update their config to lie to them. I'll just restart them after each addition | 18:12 |
clarkb | just zuul we need to lie too since its restarts are much more costly | 18:12 |
*** vishalmanchanda has quit IRC | 18:14 | |
fungi | but not for too much longer | 18:19 |
clarkb | I do notice that connection counts are in the range that suggests the executors and mergers are connecting to zk. Are those just idle connections? | 18:21 |
*** valleedelisle has joined #opendev | 18:22 | |
*** dtantsur has joined #opendev | 18:25 | |
*** dtantsur|afk has quit IRC | 18:28 | |
*** _dvd has quit IRC | 18:29 | |
*** openstackstatus has quit IRC | 18:31 | |
*** openstackstatus has joined #opendev | 18:32 | |
*** ChanServ sets mode: +v openstackstatus | 18:32 | |
fungi | i'm not sure but i assumed zk relied on persistent connections | 18:36 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Small playbook to set zuul zk hosts config https://review.opendev.org/c/opendev/system-config/+/788342 | 18:40 |
clarkb | fungi: ya it does its own ping ponging. By idle I mean does zuul executor or merger do anything with them | 18:40 |
clarkb | but its all moot now because I think that playbook above (we don't need to merge that chagne I can pull it and run it) should update all the mergers and executors and schedulers and webs | 18:40 |
zigo | Hi. | 18:40 |
clarkb | that means I can run it, then do a full cluster restart and be fine | 18:41 |
zigo | The Bullseye image still does the sane n/a thing: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_11a/786772/3/check/puppet-openstack-integration-6-scenario001-tempest-debian-bullseye/11a7143/job-output.txt | 18:41 |
clarkb | zigo: yes, the dib changes haven't landed yet | 18:41 |
zigo | Oh... | 18:41 |
zigo | clarkb: We need a new release? | 18:41 |
clarkb | if you'd like to help us debug why nova returns 500 errors when testing dib that would help | 18:41 |
zigo | Outch ! | 18:41 |
clarkb | zigo: no its a testing issue, when we do the end to end functional test with openstack it breaks often enough because nova returns errors | 18:41 |
clarkb | I don't think it is a debian issue, but it is hitting the jobs often enough taht trying to land the debian changes has become difficult | 18:42 |
clarkb | I added some changes to gather more log info but haven't caught one yet last I checked (and am busy with zookeeper cluster upgrades today) | 18:42 |
openstackgerrit | Merged opendev/system-config master: Fix the zk peer listing to match myid values https://review.opendev.org/c/opendev/system-config/+/788330 | 18:44 |
clarkb | I've approved https://review.opendev.org/c/opendev/system-config/+/788336 will eat lunch while waiting on that | 18:45 |
clarkb | only zk03 is in the emergency file right now as the above change has landed and I want to triple check it is producing the configs we want on 01, 02, and 04 | 18:46 |
clarkb | I'll add 02 to the emergency file after lunch and we can proceed with swapping 05 in | 18:46 |
openstackgerrit | Merged opendev/system-config master: Add zk05.opendev.org to the zk cluster https://review.opendev.org/c/opendev/system-config/+/788336 | 19:01 |
clarkb | zk02 is in the emergency file now that ^ has merged | 19:03 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Fix Gentoo "is" comparisons https://review.opendev.org/c/opendev/glean/+/788126 | 19:03 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Update hacking and fix pep8 violations https://review.opendev.org/c/opendev/glean/+/782296 | 19:03 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Move to Zuul standard hacking rules https://review.opendev.org/c/opendev/glean/+/788127 | 19:03 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Stop requiring /usr/local/bin links for glean.sh https://review.opendev.org/c/opendev/glean/+/782010 | 19:03 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Create "legacy" script path https://review.opendev.org/c/opendev/glean/+/782016 | 19:03 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Run a glean-early service to mount configdrive https://review.opendev.org/c/opendev/glean/+/782017 | 19:03 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Cleanup glean.sh variable names https://review.opendev.org/c/opendev/glean/+/782355 | 19:03 |
clarkb | oh its meeting time I compeltely spaced on the between lunch and zk stuff | 19:04 |
*** hasharAway is now known as hashar | 19:05 | |
fungi | ayup | 19:06 |
*** lbragstad has quit IRC | 19:17 | |
*** lbragstad has joined #opendev | 19:30 | |
fungi | looking back over the git review --no-thin option, it turns out that users also need to have git>=1.8.5 as well (though that's probably likely for most users these days) | 19:35 |
fungi | until then, there was a bug causing push --no-thin to noop | 19:35 |
fungi | worth keeping in mind | 19:36 |
*** ralonsoh has quit IRC | 19:37 | |
clarkb | I've just stopped zk on zk02 because the service-zookeeper run for the id fix is applying the chagnes for the change after it | 19:39 |
clarkb | (it sees zk05.opendev.org) | 19:39 |
clarkb | and zk05 is up now and seems to have joined the cluster | 19:41 |
clarkb | I'll restart launchers and builders again, then do rolling restarts of 01 and 04, then we are to the point where I need to update zuul configs and restart zuul | 19:42 |
clarkb | oh I can also push up the change to swap 01 and 06 | 19:42 |
clarkb | oh I can't restart launchers and builders until they update. I'll do the rolling restarts first then | 19:46 |
fungi | a bit more reading indicates that the git provided on centos 7 is 1.8.3, so if we have someone who needs --no-thin and they're on centos 7 they need to also work out upgrading git, just fyi | 19:46 |
clarkb | thanks | 19:48 |
fungi | pretty much every other distro has plenty new enough git not to have to worry | 19:48 |
fungi | git review --no-thin won't error on centos 7's git, it just also won't do any good | 19:49 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Add zk06.opendev.org to the zk cluster https://review.opendev.org/c/opendev/system-config/+/788355 | 19:49 |
clarkb | I'm nowhere near ready for ^ to land but figured having it up early was fine | 19:50 |
ianw | oh just calling out the restart of gerrit yesterday for the lock issues. seems it was causing a few people issues unfortunately (intel-ci, etc) | 19:53 |
clarkb | ya looks like the same issue as before (same index lock etc) | 19:53 |
clarkb | I'm going to do rolling restarts of zk01 and zk04 now | 19:53 |
clarkb | zuul currently only knows about zk01. I think this is ok as the restart should be quick | 19:53 |
ianw | yep, unfortunately nobody else has popped up on that bug | 19:53 |
fungi | any tips on command-line connecting with ansible to a raw ipv6 address? | 19:55 |
fungi | it seems to think the bit after the last : is a port, and wrapping the address in [] doesn't seem to work either | 19:55 |
clarkb | rolling restarts done, cluster looks good to me | 19:56 |
fungi | ahh, the problem was i was passing the remote username separated by an @ from the address, i think, rather than with -u | 19:57 |
fungi | yay! core dump reproduced. now to make the machine be willing to actually save it | 19:57 |
fungi | /bin/sh: line 1: 5206 Aborted (core dumped) /usr/libexec/platform-python /root/.ansible/tmp/ansible-tmp-1619553398.6933997-894584-203663865195816/AnsiballZ_setup.py\ | 19:57 |
clarkb | prod-base should run soon then I'll run service-nodepool manually to update those configs, I'll restart nodepool services at that point. Then when that is done I'm ready to run my update zuul.conf playbook and do a full restart of zuul | 19:59 |
clarkb | Then when that is done we can land https://review.opendev.org/c/opendev/system-config/+/788355 | 19:59 |
fungi | this is infinitely fun. apparently systemd handles core dumps by default, and the sysctl for kernel.core_pattern is a pipe to that helper rather than a path to save them in | 20:03 |
clarkb | looks like `/opt/zuul/tools/zuul-changes.py https://zuul.opendev.org > queues.sh` is how we save queues these days, then just rerun queues.sh | 20:03 |
fungi | yep | 20:03 |
fungi | hah, according to the journal, ansible/python isn't the only thing dumping core. rsyslog also isn't starting on these nodes | 20:05 |
fungi | Process 1163 (rsyslogd) of user 0 dumped core. | 20:05 |
clarkb | is this on arm? | 20:05 |
clarkb | is it possible we're pulling the wrong arch? | 20:05 |
fungi | yeah, arm | 20:05 |
clarkb | though wrong arch should eb an elf issue not a core dump? | 20:06 |
fungi | /usr/sbin/rsyslogd: ELF 64-bit LSB shared object, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, for GNU/Linux 3.7.0, BuildID[sha1]=870e42df5144bb08f84e39aea7981a9f2012038c, stripped | 20:06 |
fungi | doesn't look like the wrong arch | 20:06 |
clarkb | infra-root if you can review https://review.opendev.org/c/opendev/system-config/+/788342 in the next little bit taht would be great. I plan to run that playbook to do the zuul.conf update I need when restarting zuul | 20:06 |
clarkb | I don't need to land the change as I can just pull that down | 20:06 |
ianw | rsyslog-8.1911.0-6.el8.aarch64 marked as user installed. | 20:07 |
ianw | so hopefully that's right | 20:07 |
fungi | anyway, according to journalctl... | 20:08 |
fungi | ansible-setup[5206]: Invoked with gather_subset=['all'] gather_timeout=10 filter=* fact_path=/etc/ansible/facts.d | 20:08 |
fungi | Process 5206 (platform-python) of user 0 dumped core. | 20:08 |
fungi | there's a stacktrace two calls deep, but with no symbols all i can see is that it's landing somewhere in libc.so.6 | 20:08 |
clarkb | now try it with specific fact subset and see if it is universal or specific to a grop? | 20:08 |
clarkb | *group | 20:08 |
fungi | is that something i can pass from the remote caller? | 20:09 |
clarkb | yes they are setup module options | 20:09 |
clarkb | fungi: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/setup_module.html#parameter-gather_subset | 20:09 |
ianw | fungi: if you run gdb on the command, iirc it tells you what pkgs to install for symbols | 20:09 |
fungi | ianw: i guess that would be platform-python in this case? | 20:10 |
ianw | /usr/libexec/platform-python3.6 i'd say | 20:12 |
fungi | ansible '2604:1380:4111:3e56:f816:3eff:fe85:edd9' -i '2604:1380:4111:3e56:f816:3eff:fe85:edd9', -m setup -u root | 20:12 |
fungi | how do i add gather_subset to that? | 20:12 |
ianw | fungi: i'm just installing debug info on that host | 20:14 |
fungi | ahh, cool | 20:14 |
fungi | wanted to try gather_subset min | 20:14 |
*** sboyron_ has quit IRC | 20:16 | |
ianw | fungi: maybe try the regular thing now, platform-python & glibc debug symbols installed | 20:16 |
fungi | retrying | 20:16 |
fungi | huh, now instead of a backtrace i get "Resource limits disable core dumping for process 27856 (platform-python)." | 20:17 |
fungi | guess i need to finish solving that, it must want to actually write it out | 20:17 |
ianw | i can see a dump now | 20:18 |
ianw | it's lz4 compressed | 20:18 |
fungi | strange that it was happy to embed a symbol-less backtrace in the journal but not after the symbols were installed | 20:18 |
fungi | ahh | 20:18 |
ianw | hahaha it crashes gdb | 20:19 |
ianw | something is definitely not happy | 20:19 |
fungi | where did it write the dump? /var/crash still dne | 20:19 |
fungi | but yeah, gdb crashing seems double plus ungood | 20:20 |
fungi | i suppose gdb can read cross-arch cores, but copying the right symbols would be tricky | 20:21 |
ianw | # gdb /usr/libexec/platform-python /var/lib/systemd/coredump/core.platform-python.0.3a0917ef2e4e434fb1320cfd30219033.5206.1619553402000000 | 20:21 |
ianw | https://bugzilla.redhat.com/show_bug.cgi?id=1946948 | 20:22 |
openstack | bugzilla.redhat.com bug 1946948 in gdb "gdb crashes with: ../../gdb/dwarf2-frame.c:1061: internal-error: Unknown CFA rule." [Unspecified,Assigned] - Assigned to keiths | 20:22 |
ianw | same issue, and unfortunately not much traction on that :/ | 20:22 |
clarkb | ok base has finished, manually running service-nodepool now then will restart those services | 20:24 |
fungi | according to build history, these jobs started failing after 2021-04-16 20:06 (that was the last success result for kolla-build-centos8s-source-aarch64 anyway, and hrw reported a failure for this which ran 2021-04-21 11:53) | 20:27 |
fungi | so we have a ~5-day window where this would have started to occur | 20:27 |
ianw | "Anyway if this bug only happens on aarch64 then it might | 20:27 |
ianw | be caused by the recent ARM v8.6 changes (bug 1875912) which caused | 20:27 |
openstack | bug 1875912 in pulseaudio (Ubuntu) "Selected audio output always USB device as default, but no control" [Undecided,Expired] https://launchpad.net/bugs/1875912 | 20:27 |
ianw | lots of brokenness. | 20:27 |
ianw | " | 20:27 |
fungi | i'll see if i can narrow it | 20:27 |
ianw | unfortunately, i can't see that bug, even logged in @redhat account | 20:27 |
fungi | timeframe on 1946948 looks coincident with when we started having this problem with ansible at least | 20:29 |
ianw | i guess that is referring to something like | 20:29 |
ianw | * Tue Mar 23 2021 Nick Clifton <nickc@redhat.com> - 2.30-97 | 20:29 |
ianw | - Enable support for ARM v8.6 ISA. (#1875912) | 20:29 |
fungi | when would that have landed in a centos point release? | 20:30 |
ianw | that's from binutils-2.30-98.el8 | 20:30 |
fungi | maybe we were also delayed building images | 20:30 |
fungi | i guess that had to trickle from rhel to centos | 20:31 |
ianw | yeah a lot of variables | 20:31 |
ianw | # rpm -qa | grep binutils | 20:33 |
ianw | binutils-2.30-101.el8.aarch64 | 20:33 |
fungi | yeah, so we're at least new enough to include that change | 20:33 |
fungi | i think what this means is that centos (possibly also rhel) is not a suitable platform for arm64 testing currently | 20:34 |
ianw | ohhh | 20:34 |
clarkb | nodepool services have been restarted | 20:34 |
ianw | "I don't think -99 is good either. The binaries don't crash immediately, | 20:35 |
ianw | but anything using threads will crash eventually. See | 20:35 |
ianw | https://bugzilla.redhat.com/show_bug.cgi?id=1946977" | 20:35 |
openstack | bugzilla.redhat.com bug 1946977 in binutils "pthread_join segfaults in stack unwinding" [Unspecified,Modified] - Assigned to nickc | 20:35 |
ianw | https://bugzilla.redhat.com/show_bug.cgi?id=1946518#c9 | 20:35 |
openstack | bugzilla.redhat.com bug 1946518 in binutils "binutils-2.30-98 are causing go binaries to crash due to segmentation fault on aarch64" [Unspecified,Modified] - Assigned to nickc | 20:35 |
clarkb | I'm goign to test https://review.opendev.org/c/opendev/system-config/+/788342 by limiting it to a merger and then will restart that merge | 20:35 |
clarkb | if that looks good I'm going to proceed with running that playbook against all zuul servers and will do a full restart | 20:35 |
ianw | https://bugzilla.redhat.com/show_bug.cgi?id=1946977 is actually almost exactly what we see | 20:35 |
clarkb | corvus: ^ fyi | 20:35 |
fungi | ianw: "Fixed in binutils-2.30-100.el8" though? | 20:37 |
ianw | fungi: yeah, i wonder though if things need a rebuild against it? | 20:38 |
fungi | oh maybe | 20:38 |
fungi | yeah so like the python interpreter and gdb are still built against older binutils? | 20:39 |
fungi | s/against/with/ | 20:39 |
clarkb | ok that all looked good on zm01 and I'm now double checking its zk connectivity | 20:39 |
clarkb | it connected to one of the new servers, excellent | 20:39 |
clarkb | infra-root last chance to weigh in on my playbook to udpate zuul configs | 20:39 |
fungi | ianw: i take it there's something similar to debian's nmu process where the binutils maintainers can trigger rebuilds of all their reverse-(build-)dependencies | 20:40 |
ianw | clarkb: i think i noted that we will switch them back to ipv6 addresses after? i don't think it matters | 20:40 |
fungi | maybe that's still in progress, or is done but hasn't trickled down from rhel to centos yet | 20:40 |
clarkb | ianw: zuul uses ipv4 from what I can see | 20:41 |
clarkb | ianw: nodepool is ipv6 and those aren't getting modified | 20:41 |
ianw | ohhhh, ok i guess that rewriting trick only happens in the nodepool configs | 20:41 |
clarkb | yup and I'm doing those separately | 20:41 |
ianw | ok, carry on :) | 20:41 |
corvus | lgtm | 20:42 |
clarkb | I've run it and the resutls lgtm. I made a backup of the config on scheduler and ze01 and comapred results | 20:42 |
clarkb | I guess I can save queues now then run zuul_restart.yaml? | 20:43 |
clarkb | corvus: ^ anything else you think I should do before doing that? | 20:43 |
ianw | fungi: yeah, honestly not sure :/ | 20:43 |
clarkb | ok I'm proceeding to save queues and will run zuul_restart.yaml afterwards | 20:44 |
clarkb | I doin't see any release jobs for the record and I awrned teh openstack release team | 20:44 |
fungi | seems like a fine time, yep | 20:45 |
clarkb | I'm going to remove the hourly deploy and the deploy changes that were queued as they will only serve to slow down https://review.opendev.org/c/opendev/system-config/+/788355 when it lands | 20:46 |
clarkb | zuul is on its way back up | 20:47 |
ianw | fungi: | 20:51 |
ianw | * Thu Apr 22 2021 Florian Weimer <fweimer@redhat.com> - 2.28-158 | 20:51 |
ianw | - Rebuild with new binutils (#1946518) | 20:51 |
ianw | glibc-2.28-158.el8 | 20:52 |
clarkb | I think zuul is up now | 20:52 |
ianw | we have glibc-2.28-155.el8.aarch64 | 20:52 |
clarkb | I will start restoring queues | 20:52 |
ianw | https://koji.mbox.centos.org/koji/buildinfo?buildID=17346 | 20:52 |
clarkb | queues have been restored | 20:56 |
clarkb | infra-root I'm ready for https://review.opendev.org/c/opendev/system-config/+/788355 if that one looks good to you | 20:57 |
fungi | lgtm | 21:02 |
ianw | fungi: trying to establish in centos-devel when the new packages might appear | 21:03 |
fungi | thanks! | 21:04 |
clarkb | I'm going to stop zk on zk01 now, the change for 06 should be landing in the next few minutes, then once again after base runs I'll manually run service-zookeeper, do rolling restarts of the cluster, then manually run service-nodepool and restart those services | 21:09 |
*** whoami-rajat has quit IRC | 21:09 | |
openstackgerrit | Merged opendev/system-config master: Add zk06.opendev.org to the zk cluster https://review.opendev.org/c/opendev/system-config/+/788355 | 21:13 |
ianw | fungi: it's probably worth confirming with the packages from koji that it fixes the issue | 21:18 |
ianw | the answer seems to be that we need to wait maybe a week or so for a push, but not quite clear | 21:19 |
fungi | ianw: i'm not super familiar with koji, i assume it provides urls to grab the rpms from? | 21:19 |
fungi | if so, happy to wget a few and rpm -i them | 21:19 |
*** slaweq has quit IRC | 21:22 | |
*** cloudnull has quit IRC | 21:24 | |
*** cloudnull has joined #opendev | 21:25 | |
*** hashar has quit IRC | 21:40 | |
*** lbragstad has quit IRC | 21:46 | |
clarkb | ok base has completed. I'm going to run service-zookeeper now | 21:52 |
ianw | fungi: yeah, i guess that would be the process. i can as i know you're trying to be ! computering :) | 21:53 |
fungi | ianw: thanks, i'm clearly not succeeding at that | 21:56 |
clarkb | 06 is up, the leader (05) reports it has 2 synced followers. I'm going to do rolling restarts of the zk containers now | 21:56 |
clarkb | that is done, 06 is the leader now and has 2 synced followers. | 21:57 |
clarkb | I'm going to run service-nodepool now then restart those services then we should be done | 21:58 |
clarkb | well other than cleanup. | 21:59 |
fungi | https://frinkiac.com/img/S09E13/841990.jpg | 22:00 |
ianw | fungi: ok, rsyslog now starts; if you have the ansible command in history could you try? | 22:02 |
fungi | trying! | 22:03 |
fungi | ianw: yep, works, great detectorizing! | 22:04 |
clarkb | so the tldr is they made a bad binutils package and it has since been fixed? | 22:05 |
*** DSpider has quit IRC | 22:06 | |
ianw | yeah, i haven't looked fully but it must be some of the crt stuff, so glibc needed a rebuild | 22:06 |
ianw | now, we could either a) hack nb03 to install from koji and put it in emergency, b) push a change through dib release (not good idea, koji build will disappear), c) wait until centos-8 stream picks up package | 22:07 |
clarkb | #status log Upgraded zuul zk cluster to focal. zk01-03.openstack.org have been replaced with zk04-06.opendev.org | 22:12 |
openstackstatus | clarkb: finished logging | 22:12 |
clarkb | I need to write at least one cleanup change. I'll leave the old servers up until tomorrow. | 22:13 |
clarkb | actually I don't need the cleanup change | 22:14 |
clarkb | I do need to update the grafana dashboard though | 22:14 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Update hacking and fix pep8 violations https://review.opendev.org/c/opendev/glean/+/782296 | 22:18 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Move to Zuul standard hacking rules https://review.opendev.org/c/opendev/glean/+/788127 | 22:18 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Stop requiring /usr/local/bin links for glean.sh https://review.opendev.org/c/opendev/glean/+/782010 | 22:18 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Create "legacy" script path https://review.opendev.org/c/opendev/glean/+/782016 | 22:18 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Run a glean-early service to mount configdrive https://review.opendev.org/c/opendev/glean/+/782017 | 22:18 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Cleanup glean.sh variable names https://review.opendev.org/c/opendev/glean/+/782355 | 22:18 |
ianw | clarkb: i'm sure you're a bit over it, but ^ stack should be ready for review (just fixed a pep8 issue) | 22:19 |
openstackgerrit | Clark Boylan proposed openstack/project-config master: Limit grafana for zk to zk04-06 https://review.opendev.org/c/openstack/project-config/+/788374 | 22:20 |
clarkb | ianw: I can probably page that back in. | 22:20 |
clarkb | infra-root I think https://review.opendev.org/c/openstack/project-config/+/788374 may be the only code side cleanup we need. Then tomorrow I'll cleanup the old servers if we're happy with the new ones | 22:20 |
corvus | clarkb: normally i'd say let's have an overlap period, but i'm pretty sure we won't be changing any ZK stuff at least until friday which is long enough to establish a baseline i think, so i think we can go ahead and merge that now. | 22:22 |
corvus | (and if we're wrong, we can always get the data out of graphite) | 22:22 |
clarkb | corvus: ya the data isn't going away, and reverting my change is easy too | 22:22 |
clarkb | corvus: you might also want to double check zk things look correct to you, but I've been trying to monitor as I went and I haven't noticed anything weird | 22:24 |
corvus | clarkb: the zk followers graph looks like a bad game of tetris but other than that lgtm ;) | 22:26 |
corvus | clarkb: statsd may need a restart in order to drop stale gauges | 22:26 |
clarkb | ha, it actually shows when I would take out the next in line to be replaced | 22:26 |
clarkb | corvus: stale meaning we sent data but then it moved? | 22:26 |
clarkb | the followers graph looks correct to me right now. 06 has two followers | 22:27 |
corvus | clarkb: any other graph with a gauge has stale data from 1-3 | 22:30 |
corvus | i think statsd will eventually drop them on its own? (i think we set a config for that) but i can't remember how long it has to go without data for that | 22:30 |
clarkb | oh that was why I pushed the change up. I don't think it will drop the data off entirely? or at least didn't expect it to | 22:31 |
corvus | (basically statsd keeps repeating guages in the data it sends to carbon-cache until a timeout happens) | 22:31 |
clarkb | got it | 22:32 |
corvus | the graphite dbs will stick around regardless (they will delete data points subject to their rollup policy, but the files will stick around until we delete them) but statsd will continue to tell carbon to add data to them until restarted or that timeout hits | 22:33 |
clarkb | til | 22:33 |
clarkb | looks like graphite has a graphite-docker_graphite_1 container running on graphiteapp/graphite-statsd image. I guess I can just down && up -d the cotnainer with docker-compose? | 22:34 |
corvus | clarkb: apparently for us that timeout is "never" because we don't set deleteGauges: https://github.com/statsd/statsd/blob/master/exampleConfig.js#L64 | 22:34 |
clarkb | I'll down && up -d that container now | 22:35 |
corvus | ++ | 22:35 |
corvus | clarkb: or you can 'docker restart $containername' | 22:35 |
clarkb | ah the down && up -d is lready done :) | 22:35 |
clarkb | I see 01-03 data cutting off in the graphs now | 22:37 |
clarkb | I think that did it | 22:37 |
*** lbragstad has joined #opendev | 22:39 | |
corvus | this will be way less confusing if we do have to consult the data :) | 22:39 |
openstackgerrit | Merged openstack/project-config master: Limit grafana for zk to zk04-06 https://review.opendev.org/c/openstack/project-config/+/788374 | 22:42 |
clarkb | ianw: thinking about https://review.opendev.org/c/opendev/glean/+/782010 again, do we need to worry about backward compatibility? like maybe we shoudl install glean.sh to /usr/local/bin/glean.sh if not going into a venv as well as in the package dir? that said I highly doubt that people are using glean without simple-init dib | 22:50 |
clarkb | so maybe we're good as long as simple-init dib is good | 22:50 |
clarkb | oh and glean vendors the install tooling too so ya this is fine | 22:51 |
clarkb | if it breaks people they can use the install tooling which is all that dib will do | 22:51 |
*** hrw has quit IRC | 22:56 | |
*** hrw has joined #opendev | 22:57 | |
clarkb | ianw: left some comments on https://review.opendev.org/c/opendev/glean/+/782017 but nothing worth a respin. +2'd it | 23:00 |
ianw | clarkb: yeah, i mean i'm not sure we've ever really considered static installs, but i don't think we've ever guaranteed that you could update glean code without running install tool again | 23:06 |
clarkb | ya I think glean shipping an install tool is a big indication you need to use it :) | 23:06 |
*** tosky has quit IRC | 23:08 | |
*** hrw has quit IRC | 23:31 | |
*** hrw has joined #opendev | 23:33 | |
*** mlavalle has quit IRC | 23:51 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!