ianw | it's just weird that one happened for https://review.opendev.org/c/opendev/system-config/+/880579 and then https://review.opendev.org/c/opendev/system-config/+/880710 | 00:44 |
---|---|---|
ianw | one added the jammy servers and the other removed them | 00:45 |
ianw | removed the old ones | 00:45 |
ianw | ok: [codesearch01.opendev.org] | 00:45 |
ianw | it gathered facts ok | 00:45 |
ianw | i dunno; happened well before anything relating to nameservers happened. might be a big coincidence | 00:50 |
ianw | https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-base&pipeline=deploy&skip=0 it is perhaps semi-common i guess | 00:53 |
fungi | might be a misbehaving middlebox somewhere in that cloud region | 01:02 |
ianw | fungi/clarkb: not for now ... but i wasn't sure where we got to on AAAA glue records for opendev.org. If we want to add them, we probably need to ask (my preference) but if we're ok with not having them, we can cross it off https://etherpad.opendev.org/p/2023-opendev-dns | 01:16 |
fungi | ianw: related, we may want to ask vexxhost to add ipv6 reverse dns for ns04 | 01:19 |
ianw | yeah, i don't think irc is effective for that | 01:20 |
ianw | ok i logged a low priority ticket | 01:27 |
ianw | #status log shutdown ns1.opendev.org, ns2.opendev.org and adns1.opendev.org that have been replaced with ns03.opendev.org, ns04.opendev.org and adns02.opendev.org | 02:13 |
opendevstatus | ianw: finished logging | 02:13 |
opendevreview | Ian Wienand proposed openstack/project-config master: project-config-grafana: filter opendev-buildset-registry https://review.opendev.org/c/openstack/project-config/+/847870 | 03:44 |
opendevreview | Merged opendev/system-config master: Add logging During Statup for haproxy-statsd https://review.opendev.org/c/opendev/system-config/+/881901 | 04:26 |
clarkb | it is so quiet today | 15:14 |
clarkb | I'm going toget a gitea 1.19.2 change up after local system updates. They didn't fixthe header issue but that has been a long standing problem so I think we can proceed with 1.19.2 | 15:21 |
clarkb | I've just spot checked zuul and nodepool services and believe that we are running quay images at this point. Restarts over the weekend appear successful. | 15:35 |
clarkb | One thing it looks like we will need to do is manualy prune out the old docker hub images since our regular pruning hangs onto them | 15:35 |
clarkb | cc corvus not sure if that is worth warning zuul users about | 15:36 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update gitea to 1.19.2 https://review.opendev.org/c/opendev/system-config/+/877541 | 15:46 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181 | 15:46 |
clarkb | cleaned up my old hold and put another in place for ^ but I anticipate we can upgrade soon | 15:48 |
clarkb | the centos 9 stream mirror is broken. repomd.xml points to files that don't exist. THis problem oiginates in our upstream mirror | 16:26 |
clarkb | (throwing that out there so that when everyone is back to work tomorrow they can short circuit the debugging) | 16:26 |
clarkb | I'm getting my quay.io change for zookeeper-statsd together and will be regenerating the robot accounts token just to ensure we are starting fresh. Nothing should be using it yet anyway but good extra step for safety. | 16:49 |
clarkb | s/token/docker cli passwd/ | 16:49 |
clarkb | corvus: ^ for that I am having to press ^D twice for it to emit a password entered on the command line. Is this expected? I'm worried the first control character may end up in the input somehow. I'll use echo -n 'value' | zuul-client encrypt instead I guess | 16:54 |
corvus | clarkb: re pruning -- i don't think that's something we need to warn people about | 16:55 |
corvus | clarkb: yes, 2 ctrl-d's is expected when not immediately following a newline | 16:55 |
corvus | that's a shell thing | 16:56 |
clarkb | TIL. fwiw echo -n '' seems to work. Just prefix it with a space to prevent it from going into history | 16:57 |
corvus | yep it's nice to see what you're doing :) | 16:59 |
corvus | clarkb: i went to go check exactly which image is running for zuul... and i see this: | 17:06 |
corvus | 4a9793f4f9fa 0021610b5ea6 "/usr/bin/dumb-init …" 2 days ago Up 2 days zuul-scheduler_scheduler_1 | 17:06 |
corvus | so then i run docker inspect 0021610b5ea6 | 17:06 |
corvus | and i see: "org.zuul-ci.change_url": "https://review.opendev.org/873012" | 17:06 |
corvus | and that does not look right to me | 17:07 |
opendevreview | Clark Boylan proposed opendev/system-config master: Base jobs for quay.io image publishing https://review.opendev.org/c/opendev/system-config/+/881285 | 17:08 |
corvus | do you think something about how we're building images is broken and not attaching those labels correctly now? maybe that's the most recent layer that has a label? | 17:08 |
corvus | i get that with docker inspect quay.io/zuul-ci/zuul-scheduler locally too | 17:09 |
clarkb | corvus: I Think the component reported version looks fairly accurate | 17:09 |
clarkb | so I suspect this has to do with metadat and not pushing stale content | 17:09 |
clarkb | corvus: does the most recent docker hub image look better? | 17:10 |
corvus | yeah, the build date looks correct too | 17:10 |
corvus | yep | 17:10 |
corvus | points to https://review.opendev.org/880658 | 17:11 |
clarkb | infra-root I think https://review.opendev.org/c/opendev/system-config/+/881285 is read for review now. Should be safe to land whenever we are ready to debug it. zookeeper-statd has not had any new images since I synced docker hub to quay.io so don't need to sync that before we switch either | 17:11 |
corvus | clarkb: i think i see the issue | 17:12 |
clarkb | ok I'm still trying to figure out hwere we set the value. Must be in the jobs somehwere | 17:12 |
corvus | working on a change | 17:12 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Add labels to build-container-image https://review.opendev.org/c/zuul/zuul-jobs/+/881919 | 17:14 |
corvus | clarkb: ^ | 17:15 |
corvus | looks like another case of the build-container-image bitrotting between when we made it originally and when we finally started using it. | 17:15 |
clarkb | corvus: heh ya the buildx tasks have them and they were copied over more recently | 17:16 |
fungi | i guess 881285 is going to need 881919 too? | 17:18 |
clarkb | fungi: not strictly necessary but very nice to have yes | 17:18 |
clarkb | probably worth waiting on then we can be sure the label fix works too | 17:18 |
corvus | yeah, it's super hard to map images back to what they're running without it, so i'd support waiting for 919 before doing any more builds :) | 17:18 |
corvus | i'd like to restart zuul again to catch the changes that merged over the weekend; any objections? | 17:20 |
corvus | i'll just run the zuul_reboot playbook | 17:20 |
fungi | sounds good to me, thanks | 17:20 |
clarkb | corvus: its quiet today (holidays elsewhere in the world) and the zookeeper content should make it even less of an impact. I'm good with this | 17:21 |
clarkb | oh ya zuul_reboot is the graceful one. Should go quickly anyway | 17:21 |
corvus | running now in screen on bridge | 17:23 |
clarkb | corvus: there is a periodic job that may need to be dealt with in openstack now that I look | 17:23 |
clarkb | it is queued though which means it isn't on an executor yet so maybe it is fine | 17:23 |
corvus | yeah, probably how the last reboot made it through | 17:24 |
corvus | i wish there were a way to copy the tooltip to get the node request id | 17:24 |
clarkb | ++ | 17:24 |
opendevreview | Merged zuul/zuul-jobs master: Add labels to build-container-image https://review.opendev.org/c/zuul/zuul-jobs/+/881919 | 17:27 |
corvus | clarkb: found a nodepool bug causing that stuck request | 17:33 |
clarkb | ack | 17:34 |
corvus | change linked in #zuul:opendev.org | 17:37 |
corvus | i believe a restart of nl01 will correct the immediate problem; maybe we should just land that change and let the subsequent automatic restart handle that. | 17:38 |
clarkb | yup I've approved the change and the auto hourly deployment should handle it automatically | 17:40 |
clarkb | I'm going to pop out for lunch soon so won't approve the quay.io change in system-config yet. Happy for someone else to if they can watch it otherwise I'll aim to +A it when I get back | 18:01 |
fungi | clarkb: 881285 won't actually upload a new image though, right? we need another change to merge for that? | 18:03 |
clarkb | fungi: I think it will because those jobs are added/modified so they should run? | 18:04 |
fungi | oh, yeah i guess the upload happens in gate | 18:04 |
fungi | okay, i'll check it once it merges | 18:04 |
clarkb | oh but ya the post may not. we can push up a noop dockerfile change if we need it to trigger more stuff | 18:05 |
clarkb | a number of our dockerfiles have atimestamp comment for this purpose now | 18:05 |
fungi | right, that's what i was assuming we'd need to test, but we can do that if necessary after you finish your lunch | 18:07 |
clarkb | ++ | 18:07 |
rlandy | fungi: hi ... has anyone reported anything wrt centos9 mirrors? Looks like all jobs (not just tripleo related) are failing with "error: Status code: 404 for https://mirror.bhs1.ovh.opendev.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/3ea088796d71ec43bd0450022bddc9365606b1996065fac43595e4ef6798af11-primary.xml.gz" or some similar error. Example log: https://zuul.opendev.org/t/openstack/build/683d1d11236441d48c2b181b7ce193e8 | 18:30 |
rlandy | another example: https://zuul.opendev.org/t/openstack/build/371a7f56326b4eb5877aafd600ed0a85 | 18:31 |
fungi | rlandy: first i've heard of it, but we mirror from other mirrors so i guess it's worth checking those to see if they're stale | 18:35 |
fungi | https://mirror.bhs1.ovh.opendev.org/centos-stream/timestamp.txt indicates it was updated at the top of the hour | 18:37 |
fungi | https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/centos-stream-mirror-update#L44 says we're pulling from mirror.rackspace.com | 18:38 |
fungi | and i don't see a 3ea088796d71ec43bd0450022bddc9365606b1996065fac43595e4ef6798af11-primary.xml.gz at https://mirror.rackspace.com/centos-stream/9-stream/BaseOS/x86_64/os/repodata/ | 18:39 |
fungi | so my guess is that their mirror is behind | 18:39 |
fungi | or somehow got rolled back | 18:40 |
rlandy | ok - so we're back to that mirror caching fun | 18:40 |
fungi | yes, our mirror is only ever as reliable as the mirrors we copy from | 18:40 |
fungi | and apparently the mirror network for centos is not all that reliable | 18:41 |
rlandy | thank you | 18:41 |
fungi | rlandy: https://review.opendev.org/868392 switched us from mirror.facebook.net to mirror.rackspace.com in december because facebooks mirrors stopped updating, according to the commit message | 18:43 |
rlandy | yep - I remember we switched a few times last year | 18:45 |
rlandy | going to give it a few hours to see if the mirrors sync up | 18:45 |
opendevreview | Merged opendev/system-config master: Base jobs for quay.io image publishing https://review.opendev.org/c/opendev/system-config/+/881285 | 19:04 |
Clark[m] | I posted about the mirror thing earlier today. I confirmed our upstream mirror has the same issue | 19:18 |
fungi | ahh, i missed that, thanks | 19:22 |
clarkb | fungi: corvus: the quay thing failed in deploy on the zuul job. Likely due to the ongoing zuul restart? I didn't think about that interaction. It did push a change tag but did not update latest. I think because we do need an image change to trigger the promote job | 19:41 |
clarkb | I'm working on that change now | 19:41 |
clarkb | it did restart the zk statsd service on zk04. Image is identical to the one running before so that was just a docker bookkeeping change | 19:42 |
opendevreview | Clark Boylan proposed opendev/system-config master: Force zookeeper-statsd rebuild https://review.opendev.org/c/opendev/system-config/+/881924 | 19:43 |
clarkb | nl01's launcher restarted ~36 minutes ago | 19:44 |
fungi | yeah, gate did run system-config-upload-image-zookeeper-statsd successfully, promote doesn't seem to do any tagging | 19:44 |
clarkb | and the stuck job is running | 19:44 |
clarkb | fungi: yup and if you visit the imgae location on quay you'll see the gate change tag but latest is still old | 19:44 |
clarkb | nothing unexpected that I see so far. just what we anticipated might be an issue which is good | 19:45 |
clarkb | hrm my gitea 1.19.2 change failed presumably on lack of authentication. Implying that authentication is required? | 19:49 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update gitea to 1.19.2 https://review.opendev.org/c/opendev/system-config/+/877541 | 19:50 |
clarkb | no log removed (which we can do because this shouldn't need authentication) to see more info | 19:51 |
clarkb | the infra-prod-run-zuul job failed due to zm02 failing to copy project config. We no log that so I don't know what happened to make that fail (plenty of disk space) | 19:59 |
clarkb | I guess keep an eye on it for recurrences and we can dig in if necessary | 19:59 |
clarkb | fungi: corvus want to review (and hopefully approve) https://review.opendev.org/c/opendev/system-config/+/881924 so that we can see zookeeperstatsd go end to end with container publishing | 20:46 |
clarkb | oh heh the gitea thing I know what it is. pebkac | 21:07 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update gitea to 1.19.2 https://review.opendev.org/c/opendev/system-config/+/877541 | 21:09 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181 | 21:09 |
clarkb | the good news is that having that issue had me realize we don't need to no log that request since it isn't privileged. I expect this to work now and put a hold in place | 21:11 |
corvus | fyi, due to the low load, i paused a handful of executors ahead of the reboot script to reduce the overall upgrade time | 21:39 |
corvus | that seems to be working as expected so far | 21:39 |
clarkb | I saw that. A couple of executors are done too | 21:41 |
opendevreview | Merged opendev/system-config master: Force zookeeper-statsd rebuild https://review.opendev.org/c/opendev/system-config/+/881924 | 21:43 |
corvus | yeah, i'm continuing my rolling window of keeping "about half" paused ahead of the script | 21:43 |
corvus | incidentally, the zookeeper persistent recursive watches change has had a noticeable impact on the zk latency and outstanding requests metrics. | 21:45 |
corvus | that merged on april 11 (and if there's any doubt, you can see it in the zk watches graph) | 21:46 |
corvus | https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-90d&to=now | 21:46 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Add job to ansible-config_template to Galaxy https://review.opendev.org/c/openstack/project-config/+/881930 | 21:48 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Add job to ansible-config_template to Galaxy https://review.opendev.org/c/openstack/project-config/+/881930 | 21:49 |
clarkb | https://quay.io/repository/opendevorg/zookeeper-statsd?tab=tags boom! our first promoted image on quay.io the hourly zuul jobs should deploy that for us (we didn't trigger the zuul job after the image build) | 21:50 |
fungi | w00t | 21:51 |
clarkb | I'll work on getting a few more of those queued up | 21:51 |
clarkb | now to find some easy to swap images that havne't updated on dockerhub since I did the sync | 21:55 |
clarkb | this should reduce the amount of syncing we/I end up needing to do | 21:55 |
clarkb | as I look at this I'm realizing that there is going to be a bit to do to get things moved. Stuff like our base python images create dependency issues. I think for now I'm going to ignore that though. If we get the leaf images moved then we can rebuild once the base images move too | 21:59 |
corvus | clarkb: why not start with base? | 22:08 |
clarkb | corvus: I guess I can. My main concern with doing that is that if we need to update base for some reason urgently we may not be ready to consume it from its new location everywhere | 22:09 |
clarkb | the risk of that is low though and might be good motivation :) | 22:09 |
corvus | ok fair. no strong opinion here | 22:09 |
clarkb | I think doing it leaf image first is probably mor eeffort but also "safer" from that perspective | 22:09 |
corvus | btw, friendly reminder https://zuul.opendev.org/t/openstack/project/opendev.org/opendev/system-config?branch=master&pipeline=check exists in case it's helpful :) | 22:10 |
clarkb | I'm working on two changes at the moment. One to update the base jobs and one to update ircbot | 22:17 |
clarkb | We can use review to decide which approach we prefer | 22:17 |
opendevreview | Clark Boylan proposed opendev/system-config master: Move ircbot to quay.io https://review.opendev.org/c/opendev/system-config/+/881931 | 22:22 |
ianw | nice! | 22:25 |
ianw | gitea change looks good. i don't think we use any authenticated endpoints now? | 22:25 |
clarkb | ianw: we do to create rpeos and orgs and stuff | 22:26 |
opendevreview | Clark Boylan proposed opendev/system-config master: Move python builder/base images to quay.io https://review.opendev.org/c/opendev/system-config/+/881932 | 22:56 |
clarkb | I'll work on a third change that updates our Dockerfiles to consume ^ | 22:56 |
clarkb | my concern with that is a lot of images will update all at once... we can hash that out if we want ot split things up in review | 22:56 |
opendevreview | Clark Boylan proposed opendev/system-config master: Consume python base images from quay.io https://review.opendev.org/c/opendev/system-config/+/881933 | 23:03 |
clarkb | There are a lot of moving pieces here. I think we can pause here since the general thing has been shown to work. Think about the approach we want to take / discuss it in the meeting tomorrow. Write down a plan/todo list and then get it done | 23:04 |
clarkb | I'm going to shift gears here and check up on the gitea upgrade then get a meeting agenda sent out | 23:05 |
clarkb | at first glance gitea 1.19.2 seems to be working https://158.69.65.228:3081/opendev/system-config | 23:07 |
clarkb | thinking a bit about the quay.io work. It might make sense to try and do a "sprint" for that. Pick a couple of days in the near future and just focus on getting as much of that done as possible. Then we ideally don't end up with stale images for very long and can have people around to double check services are happy with their new images | 23:08 |
clarkb | Meeting agenda has been updated. Probably with too much detail. Please let me know if there is anything else to add/chnage/edit | 23:22 |
ianw | clarkb: i think the gerrit acl indent, etc. all got merged | 23:33 |
clarkb | oh neat I'll dobule check and clena that up | 23:34 |
ianw | and the renames at the bottom are done, right? | 23:34 |
clarkb | oh yup. Thanks | 23:35 |
ianw | nameserver status is accurate; i might just remove the old servers later today as there's been no problem after i shut them down yesterday morning (my time) | 23:35 |
clarkb | the acl updates did merge. Any idea if we applied them specially to ensure they all got updated? | 23:36 |
ianw | that's a good point, i'll go back and check the deploy | 23:36 |
opendevreview | Ian Wienand proposed opendev/zone-opendev.org master: Remove old DNS servers https://review.opendev.org/c/opendev/zone-opendev.org/+/881935 | 23:40 |
ianw | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_95b/879906/7/deploy/infra-prod-manage-projects/95b29cf/manage-projects.yaml.log | 23:42 |
ianw | hrm | 23:43 |
ianw | To ssh://review.opendev.org:29418/x/gearman-plugin | 23:43 |
ianw | ! [remote rejected] HEAD -> refs/meta/config (prohibited by Gerrit: project state does not permit write) | 23:43 |
clarkb | that is probably a read only project | 23:43 |
ianw | i didn't think of that | 23:43 |
clarkb | I think that is fine. If we ever make it not read only we'll sync a current good config | 23:43 |
ianw | yeah, the r/o projects all failed like that | 23:43 |
ianw | all the errors were the doens't permit write | 23:45 |
clarkb | how long did it take (that could be good info) | 23:46 |
ianw | 55 minutes | 23:47 |
clarkb | agenda sent | 23:47 |
clarkb | ianw: maybe we should increase the timeout of that job (assuming it is 60 minutes) and then we can just merge change sand not worry about manual runs | 23:47 |
ianw | name: infra-prod-manage-projects | 23:52 |
ianw | parent: infra-prod-playbook | 23:52 |
ianw | timeout: 4800 | 23:52 |
ianw | probably enough headroom | 23:53 |
clarkb | perfect | 23:53 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!