openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: Fix DISTRO_NAME in Fedora elements https://review.opendev.org/c/openstack/diskimage-builder/+/791627 | 00:02 |
---|---|---|
openstackgerrit | Ian Wienand proposed opendev/system-config master: zuul job : collect some more logs https://review.opendev.org/c/opendev/system-config/+/791055 | 00:08 |
*** janders has quit IRC | 00:17 | |
*** Dmitrii-Sh has quit IRC | 00:17 | |
*** yoctozepto has quit IRC | 00:17 | |
*** zbr has quit IRC | 00:17 | |
*** Dmitrii-Sh has joined #opendev | 00:17 | |
*** zbr has joined #opendev | 00:17 | |
*** janders has joined #opendev | 00:17 | |
*** yoctozepto has joined #opendev | 00:17 | |
openstackgerrit | Merged opendev/system-config master: Double the default number of ansible forks https://review.opendev.org/c/opendev/system-config/+/791528 | 00:26 |
openstackgerrit | Merged openstack/diskimage-builder master: Add fedora-containerfile element https://review.opendev.org/c/openstack/diskimage-builder/+/790365 | 00:52 |
*** ricolin has joined #opendev | 03:12 | |
*** ricolin has quit IRC | 03:12 | |
*** ricolin has joined #opendev | 03:14 | |
*** ykarel_ has joined #opendev | 03:48 | |
*** ykarel_ has quit IRC | 03:51 | |
*** ykarel_ has joined #opendev | 03:51 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633 | 03:57 |
*** ykarel_ is now known as ykarel | 03:59 | |
*** akahat is now known as akahat|ruck | 04:26 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Run haproxy as root user https://review.opendev.org/c/opendev/system-config/+/791634 | 04:30 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633 | 04:32 |
*** ralonsoh has joined #opendev | 04:38 | |
*** vishalmanchanda has joined #opendev | 04:43 | |
*** marios has joined #opendev | 04:53 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633 | 04:59 |
*** ysandeep|away is now known as ysandeep | 05:10 | |
*** sboyron has joined #opendev | 05:51 | |
*** darshna has joined #opendev | 05:51 | |
*** sboyron has quit IRC | 05:52 | |
*** sboyron has joined #opendev | 05:55 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633 | 06:04 |
*** logan- has quit IRC | 06:12 | |
*** logan- has joined #opendev | 06:15 | |
*** slaweq has joined #opendev | 06:35 | |
*** mkowalski has quit IRC | 06:43 | |
*** mkowalski has joined #opendev | 06:43 | |
*** brinzhang has joined #opendev | 06:43 | |
*** amoralej|off is now known as amoralej | 06:44 | |
*** gibi has quit IRC | 06:54 | |
*** fressi has joined #opendev | 06:59 | |
*** iurygregory has quit IRC | 07:12 | |
*** hashar has joined #opendev | 07:17 | |
*** iurygregory has joined #opendev | 07:21 | |
*** andrewbonney has joined #opendev | 07:21 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633 | 07:30 |
*** tosky has joined #opendev | 07:32 | |
*** ysandeep is now known as ysandeep|lunch | 07:49 | |
*** lucasagomes has joined #opendev | 07:58 | |
*** jpena|off is now known as jpena | 07:58 | |
*** whoami-rajat has joined #opendev | 08:09 | |
*** dtantsur|afk is now known as dtantsur | 08:09 | |
*** ykarel is now known as ykarel|lunch | 08:13 | |
*** gibi has joined #opendev | 08:21 | |
frickler | mnaser: any update about the IPv6 situation yet? this is still affecting my daily work by forcing me to explicitly require accessing opendev.org via v4 only | 08:22 |
*** brinzhang_ has joined #opendev | 08:55 | |
*** brinzhang has quit IRC | 08:58 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633 | 09:03 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633 | 09:05 |
*** ysandeep|lunch is now known as ysandeep | 09:11 | |
*** ykarel|lunch is now known as ykarel | 09:27 | |
*** hrw has joined #opendev | 09:32 | |
hrw | morning | 09:32 |
hrw | can https://storage.gra.cloud.ovh.net be configured to show logs? or zuul configured to not store logs there? | 09:32 |
hrw | "Network Error (Unable to fetch URL, check your network connectivity, browser plugins, ad-blockers, or try to refresh this page) https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_16b/777062/2/check-arm64/openstack-tox-py39-arm64/16b46d9/" | 09:33 |
*** ralonsoh has quit IRC | 09:54 | |
*** ralonsoh has joined #opendev | 09:57 | |
*** hashar is now known as hasharAway | 09:58 | |
frickler | hrw: the logs expire after some time, I think 4 weeks, that job seems to be a bit older, so you'd need to rerun it in order to get new logs | 10:24 |
hrw | ah. thanks | 10:25 |
hrw | too many tabs in monday morning patch check and looked in wrong place for job age | 10:26 |
*** gibi has quit IRC | 10:42 | |
*** gibi has joined #opendev | 10:43 | |
*** jpena is now known as jpena|off | 10:58 | |
*** jpena|off is now known as jpena | 11:00 | |
*** fressi has quit IRC | 11:14 | |
*** fressi has joined #opendev | 11:15 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633 | 11:21 |
ianw | clarkb / fungi: ^ in short, haproxy switched to running as a user. the simple thing to do is to just run as root | 11:22 |
ianw | however, i've been exploring the new options that uses ip6tables to make ipv6 much more workable for us. they're still experimental in docker, but i think it's worth fleshing it out just so we understand the option | 11:23 |
ianw | it is incremental steps. it would mean we could expose 80/443 to containers on ipv4 and ipv6 without having to give them capabilities to bind to low ports, or fiddle other settings | 11:25 |
*** jpena is now known as jpena|lunch | 11:30 | |
*** fressi has quit IRC | 11:36 | |
*** fressi has joined #opendev | 11:47 | |
fungi | ianw: some of the permissions errors were about opening files for write though too, i guess ones we bindmount into the container? | 11:55 |
*** yoctozepto has quit IRC | 11:55 | |
*** ykarel has quit IRC | 11:55 | |
*** ykarel has joined #opendev | 12:03 | |
*** hasharAway has quit IRC | 12:09 | |
openstackgerrit | Hitesh Kumar proposed openstack/diskimage-builder master: Migrate from testr to stestr https://review.opendev.org/c/openstack/diskimage-builder/+/789246 | 12:10 |
*** hashar has joined #opendev | 12:10 | |
openstackgerrit | Merged openstack/diskimage-builder master: Fix DISTRO_NAME in Fedora elements https://review.opendev.org/c/openstack/diskimage-builder/+/791627 | 12:15 |
*** jpena|lunch is now known as jpena | 12:25 | |
*** amoralej is now known as amoralej|lunch | 12:25 | |
*** yoctozepto has joined #opendev | 12:31 | |
openstackgerrit | chandan kumar proposed openstack/project-config master: Added publish-openstack-python-tarball job https://review.opendev.org/c/openstack/project-config/+/791745 | 12:38 |
*** marios is now known as marios|call | 13:02 | |
openstackgerrit | chandan kumar proposed openstack/project-config master: Added publish-openstack-python-tarball job https://review.opendev.org/c/openstack/project-config/+/791745 | 13:11 |
*** amoralej|lunch is now known as amoralej | 13:13 | |
*** marios|call is now known as marios | 13:46 | |
kopecmartin | fungi: hi, can you have a look when you have a moment please https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/789481 | 13:54 |
*** artom has joined #opendev | 14:00 | |
*** ysandeep is now known as ysandeep|afk | 14:00 | |
*** chandankumar is now known as raukadah | 14:16 | |
fungi | infra-root: per discussion from friday, i've forced a logrotate on zuul02 just now, and am in the process of copying /var/log/zuul to /var/log/zuul.tmp (which is on the rootfs, there's plenty of space there for now) | 14:23 |
clarkb | fungi: thanks! I'm just starting to sit down to start my day, though I've got to visit the optometrist in a bit | 14:25 |
fungi | i'll write up the rest of the plan for it in https://etherpad.opendev.org/p/zuul-swapfs-2021-05-17 | 14:25 |
fungi | clarkb: no sweat. i'm happy to wait on the scheduler restart until you're back and have increased your terminal font size a bunch ;) | 14:25 |
*** hashar is now known as hasharAway | 14:25 | |
corvus | fungi: etherpad link is empty for me | 14:26 |
fungi | yeah, for me to; i haven't actually pulled it up and started writing just yet ;) | 14:27 |
corvus | oh ho | 14:27 |
fungi | er, me too | 14:27 |
corvus | fungi: why do we need a scheduler restart? | 14:27 |
clarkb | corvus: to fix the swap partition on the zuul02 (we need to repartition xvde to do that and /var/log/zuul shares the same device) | 14:28 |
corvus | gotcha, that's what swapfs means :) | 14:29 |
clarkb | corvus: https://review.opendev.org/c/opendev/system-config/+/791554 is hte underlying issue that was fixed. It affects zuul02, zk04-06, two mirrors and review02 | 14:29 |
clarkb | I was going to do a more extensive double check today to make sure I didn't miss any on friday | 14:29 |
clarkb | looks like ianw has fixed review02 already | 14:29 |
fungi | yeah, something (i forget what?) changed such that our launch-node script began passing megabytes assuming they were gigabytes | 14:30 |
fungi | so we have servers with 8mb swap devices | 14:30 |
clarkb | fungi: ianw wrote a change when spinning up review02 to limit swap space to 8GB bceause review02 has ~120GB of memory and that is far too large of a swapfile/partition. Unfortunately this change got the scale wrong and caused things to be limited to 8MB not 8GB | 14:31 |
*** fressi has quit IRC | 14:31 | |
fungi | aha, thanks, makes sense | 14:32 |
fungi | oh, right, fixed in 791554 as you said | 14:33 |
clarkb | the two mirror hosts use swapfiles and should be easy fixes. I'll try to pick those up after the optometrist | 14:34 |
clarkb | zk04-06 use partitions but nothing seems to be using /opt there (what the swap partition shares a device with) so those are also likely easy. Means zuul02 is the only complicated fixup | 14:35 |
clarkb | side note: looks like logrotate was working properly since the fix for zuul02 logrotation landed | 14:36 |
fungi | yep, i can confirm that seemed to work fine | 14:42 |
clarkb | fwiw my list of servers that need swap fixes is based on changes to our inventory file since the bug landed. I think that is a reasonably complete list but will try and do an ansible check against all the things today | 14:43 |
*** ysandeep|afk is now known as ysandeep | 14:44 | |
*** vishalmanchanda has quit IRC | 14:54 | |
clarkb | I have fixed up mirror.iad3.inmotion.opendev.org's swap situation. Doing the osuosl mirror next | 15:03 |
clarkb | osuosl mirror is done now too | 15:10 |
clarkb | I'll hold off on zookeeper servers until after my errand today to avoid neeind to leave it in a half state | 15:13 |
clarkb | fungi: did you see https://review.opendev.org/c/opendev/system-config/+/791634/1/playbooks/roles/haproxy/files/docker/docker-compose.yaml should address the haproxy problems you worked around over the weekend? | 15:14 |
clarkb | passes testing which did hit the problem previously right? | 15:14 |
clarkb | frickler: hrw: correct we set the swift metadata on our log uploads to clean them up after ~1 month | 15:17 |
clarkb | the volume and size of the logs is quite large which pushes us to do that | 15:18 |
*** mlavalle has joined #opendev | 15:34 | |
fungi | clarkb: hadn't reviewed it yet, thanks for the reminder | 15:35 |
fungi | infra-root: i think the plan at https://etherpad.opendev.org/p/zuul-swapfs-2021-05-17 is reasonably complete, but update/correct it if you spot obvious problems in there | 15:35 |
*** ykarel has quit IRC | 15:39 | |
fungi | the longest parts of the outage will likely be moving the saved logs back into the recreated filesystem, and waiting for the scheduler to finish initializing... any feeling for whether it'll be long enough that we should do a status notice in there? | 15:44 |
clarkb | fungi: we probably should as even the fastest restarts often get noticed | 15:46 |
fungi | i'll add it as a step | 15:46 |
fungi | it's in there before stopping the container now | 15:51 |
*** amoralej is now known as amoralej|off | 15:52 | |
fungi | clarkb: i need to switch to kneading pizza dough for a bit, but let me know when you're back from your appointment and i'll get started working through the steps outlined there | 15:53 |
clarkb | fungi: will do! | 15:55 |
*** ysandeep is now known as ysandeep|away | 15:59 | |
*** marios is now known as marios|out | 16:06 | |
*** artom has quit IRC | 16:07 | |
*** artom has joined #opendev | 16:07 | |
*** lucasagomes has quit IRC | 16:12 | |
*** DSpider has joined #opendev | 16:20 | |
*** DSpider has quit IRC | 16:26 | |
*** marios|out has quit IRC | 16:29 | |
*** dtantsur is now known as dtantsur|afk | 16:36 | |
openstackgerrit | Merged opendev/system-config master: Run haproxy as root user https://review.opendev.org/c/opendev/system-config/+/791634 | 16:56 |
*** jpena is now known as jpena|off | 17:01 | |
*** artom has quit IRC | 17:04 | |
*** ralonsoh has quit IRC | 17:22 | |
*** hamalq has joined #opendev | 17:25 | |
clarkb | fungi: I'm back at this point | 17:30 |
fungi | can you still see? | 17:32 |
clarkb | I can, they only numbed my eyeballs and did not dilate them | 17:32 |
clarkb | I'm pulling up the etherpad now to review the steps | 17:32 |
clarkb | fungi: the etherpad lgtm. Do we also want to stop ansible from running on zuul02 while we do that work? (it won't restart services but it might modify /var/log/zuul contents? | 17:36 |
fungi | oh, because ansible may set ownership on that path or something? | 17:37 |
clarkb | also we need to do similar with the zk servers. Except instead of /var/log/zuul it is /opt. | 17:37 |
clarkb | fungi: yup | 17:37 |
fungi | i can stick zuul02 in the emergency disable list now, just a sec | 17:37 |
fungi | okay, its in there | 17:38 |
fungi | i guess we should give it a few minutes in case it's running a playbook which hasn't noticed that | 17:38 |
clarkb | ++ | 17:39 |
clarkb | infra-root once swap fixups are done the next thing I want to do is delete zuul01, please check if you want to preserve anything on that host | 17:40 |
clarkb | fungi: question about this fs stuff: do we actually want to preserve lost+found across fses? I don't think so? | 17:44 |
clarkb | maybe we do? your copy on zuul02 did copy it fwiw | 17:44 |
clarkb | I guess it is empty | 17:44 |
fungi | yeah, i tend to just ignore it | 17:44 |
fungi | i can delete it after the rsync | 17:45 |
clarkb | fungi: oh I would keep it around I just thought its content was fs specific | 17:45 |
clarkb | and so copying it from one fs to another may not make sense | 17:45 |
fungi | i mean i can delete the copy in the temporary location | 17:45 |
clarkb | got it | 17:45 |
*** andrewbonney has quit IRC | 17:46 | |
fungi | so that we don't overwrite the one for the new fs | 17:46 |
clarkb | ++ | 17:46 |
clarkb | fwiw the process you have on the etherpad looks very similar to what the zk's need so I may just stick a copy and edit it at the bottom of that etherpad too? | 17:46 |
clarkb | will help me not miss anything when doing the zk's | 17:46 |
fungi | okay, added lost+found cleanup and emergency disable list stuff to the plan | 17:49 |
fungi | so i don't forget | 17:49 |
fungi | clarkb: yeah, feel free to plagiarize it for the zk servers where it makes sense, continuing to use that pad seems fine too | 17:49 |
clarkb | fungi: I think I'm ready to do a zk server if you think we should wait a bit on zuul otherwise I'll wait for zuul to finish first | 17:55 |
fungi | here's an idea... since the zk servers are redundant, we can use one to test particularly the parted command syntax | 17:56 |
fungi | in case i got something subtly wrong | 17:56 |
clarkb | fungi: sounds like a good idea. Do you want to drive that or should I? zk04 is a follower so a good candidate | 17:58 |
clarkb | (we can do the leader last just in case) | 17:58 |
fungi | clarkb: your zk plan doesn't include stopping/starting zk. that's needed, right? | 17:58 |
clarkb | fungi: stopping zk shouldn't be needed because /opt isn't used by zk | 17:58 |
fungi | ohh | 17:58 |
clarkb | the only thing in /opt is a mostly empty /opt/containerd | 17:58 |
fungi | okay cool | 17:59 |
clarkb | (there are two subdirs of that dir and no files) | 17:59 |
fungi | yeah plan there lgtm then | 17:59 |
fungi | none of them would be outages anyway i guess? | 17:59 |
clarkb | ya there shouldn't be any outages unless we do something very wrong | 17:59 |
corvus | clarkb: did the zuul02 swap happen yet? i'm wondering if we can squeeze the encrypt change into that sequence | 18:01 |
fungi | corvus: it hasn't happened yet, let's add it | 18:01 |
clarkb | ya we're going to do at least on zk serverfirst to make sure the parted commands are happy | 18:01 |
fungi | corvus: can you plug the commands you want into the outline in https://etherpad.opendev.org/p/zuul-swapfs-2021-05-17 ? | 18:01 |
clarkb | fungi: I've started a root screen on zk04 | 18:02 |
clarkb | and I'm editing the emergency.yaml now to put all the zks in it | 18:02 |
fungi | joined | 18:02 |
corvus | fungi: sorry i just mean if we merge 791765 first, then we'll restart into the new decrypt-on-executor code | 18:02 |
clarkb | fungi: note ^ that you need to do an image pull and a full restart for that | 18:03 |
corvus | though if we do that, we'll need to run the zuul stop/start playbooks | 18:03 |
clarkb | corvus: and ya I can review those changes after zks are cleaned up | 18:03 |
fungi | corvus: got it, yeah the current plan was just to down and up the container. if we need executors restarted too than can rework the plan to include that | 18:04 |
fungi | s/than/then/ | 18:04 |
clarkb | fungi: I'm proceeding with zk04 now | 18:05 |
fungi | lgtm so far | 18:06 |
clarkb | fungi: I edited the parted command for zk to use 4096 MB instead of 8192 to match memory | 18:06 |
fungi | yup | 18:06 |
fungi | that looks right | 18:07 |
clarkb | fungi: you ready for me to run the parted command now? | 18:07 |
fungi | yes | 18:07 |
fungi | perfecto | 18:08 |
clarkb | fungi: I think that went well | 18:11 |
clarkb | I'll do zk05 and zk06 now, do you want me to do those in a root screen too? | 18:11 |
*** hasharAway is now known as hashar | 18:12 | |
fungi | not necessary, that was straightforward and as you said nothing's actually using it | 18:12 |
clarkb | fungi: zk05 is done now if you want to double check it | 18:16 |
clarkb | doing zk06 next | 18:16 |
fungi | looking | 18:17 |
fungi | 05 lgtm | 18:18 |
fungi | for zuul restarts, what's the playbook we normally run from bridge? | 18:18 |
fungi | i guess the process would be to stop all the containers on the executor as stated in the maintenance plan and do the filesystem work, then instead of just upping those run the pull and restart playbook(s)? | 18:19 |
clarkb | fungi: there is a system-config/playbooks/zuul_pull.yaml and a zuul_start.yaml and a zuul_stop.yaml | 18:19 |
clarkb | I think you want to od a zuul_pull.yaml then a zuul_stop.yaml then a zuul_start.yaml | 18:19 |
fungi | since it's not a single restart playbook, i suppose i can replace the `docker-compose down` on zuul02 with the full stop playbook on the bridge instead | 18:20 |
clarkb | yes | 18:21 |
clarkb | zk06 is done now too if you can take a quick look then I'll remove the zks from emergency | 18:22 |
clarkb | fungi: oh also the queue dumping and restoring is a bit different on zuul02 now | 18:24 |
fungi | we need the change merged and image updated before we pull, right? | 18:24 |
clarkb | fungi: yes changes need to be merged and images promoted first | 18:24 |
fungi | clarkb: i copied the queue dump/restore from root's command history on zuul02, are there missing steps? | 18:25 |
clarkb | fungi: re queue dumping you need to run it out of a checkout on zuul02 since there isn't one in /opt anymore. Also you need to edit the commands to reenqueue to use docker exec | 18:25 |
clarkb | fungi: if corvus set that up in roots homedir then it should work let me check | 18:25 |
clarkb | yup looks like corvus did that for us, thank you corvus | 18:25 |
clarkb | fungi: ^ you should be good | 18:26 |
clarkb | I'm going to go and review changes for zuul now | 18:26 |
fungi | thanks! | 18:26 |
clarkb | fungi: oh you want to remove zuul02 from the emergency file so that those playbooks function | 18:38 |
clarkb | fungi: instead of using the emergency file we should use disable-ansible to prevent automated ansibel from doing things while we do the human controlled ansible | 18:38 |
clarkb | I'm going to remove zk04-06 from the emergency file now | 18:38 |
clarkb | fungi: ^ I left zuul02 in the mergency file but I think you can go ahead and remove it | 18:39 |
clarkb | I'm going to grab lunch while we wait for zuul to CI those changes | 18:40 |
*** artom has joined #opendev | 18:48 | |
fungi | oh, yep, removing zuul02 from the emergency disable list | 18:48 |
fungi | that would get in the way of us running those playbooks, for sure | 18:48 |
fungi | where/when should we use disable-ansible in that sequence? now, i guess? | 18:50 |
clarkb | fungi: maybe closer to when we are ready to run it | 18:50 |
clarkb | since that puts a big roadblock on the zuul jobs and they can pile up | 18:50 |
fungi | yeah | 18:50 |
fungi | okay, i've got the continuous deployment disable/resume added to the plan | 18:54 |
corvus | clarkb, fungi: the zuul decrypt patches are approved, if all goes well they should land in ~1 hour. if you need to restart zuul02 before then, that's fine. if it happens after, i can update the plan with the extra commands. | 19:09 |
clarkb | I think we can wait. All of the other hsots have been sorted as far as swap goes. zuul02 is the last one one my list (though I still need to do a wider check) | 19:11 |
fungi | corvus: nah, no rush, and i think i got the commands right if you can just double-check | 19:11 |
clarkb | currently plenty of memory free on zuul02 | 19:11 |
corvus | oh you already changed, cool i'll check | 19:11 |
corvus | fungi: looks correct to me | 19:12 |
fungi | awesome, thanks! | 19:12 |
clarkb | once we're done with that I Think zuul01 will be up for deletion too | 19:13 |
openstackgerrit | Clark Boylan proposed openstack/project-config master: Stop requiring registered nicks for IRC https://review.opendev.org/c/openstack/project-config/+/791818 | 19:28 |
clarkb | I said I would push ^ up last meeting (I'm putting tomorrows agenda together) | 19:28 |
fungi | probably worth keeping an extra close eye on the results of that once it's in place, given the recent drams | 19:29 |
fungi | drama | 19:29 |
clarkb | ++ | 19:31 |
clarkb | we also don't need to land it just yet since we've got a few other things in the fire | 19:32 |
*** hashar has quit IRC | 19:50 | |
clarkb | looking ahead at my week I'm thinking wednesday may be a good day to try the mailman ansible stuff | 19:54 |
clarkb | corvus: fungi: looks like the zuul changes hit a problem in the gate (the upload image job timed out) | 19:56 |
clarkb | do we want to dequeue then enqueue to speed things up? or just wait for unittest to finish and reapprove? | 19:56 |
fungi | i'm in no rush | 19:58 |
corvus | i'll rejigger it | 19:59 |
corvus | i ran zuul promote --tenant zuul --pipeline gate --changes 791514,2 | 20:01 |
clarkb | things look queued the way we want them | 20:01 |
clarkb | corvus: and you had to docker exec that? | 20:02 |
corvus | ya | 20:02 |
fungi | cool | 20:03 |
clarkb | If you have anything to add to the meeting agenda do it soon. I think all my edits are in now. Just need to mail it out | 20:07 |
corvus | i'm out to run an errand; biab. | 20:34 |
*** sboyron has quit IRC | 21:01 | |
*** gothicserpent has quit IRC | 21:05 | |
*** whoami-rajat has quit IRC | 21:11 | |
*** slaweq has quit IRC | 21:11 | |
fungi | 791514 has merged and its zuul-promote-image build succeeded. is that all we were waiting for to be able to pull? | 21:18 |
clarkb | fungi: I think we want the end of that stack to merge | 21:19 |
clarkb | there are 3 chagnes total | 21:19 |
fungi | or i guess we wanted the other two in as well, yeah | 21:19 |
fungi | looks like they're merging now | 21:19 |
corvus | and merged; let's check the promote job | 21:20 |
fungi | we want to see 791775 succeed its zuul-promote-image build i think | 21:20 |
clarkb | ya | 21:20 |
clarkb | the promote job just succeeded | 21:22 |
clarkb | corvus: do you also want to double check the info on docker hub? You did that last time but I think that was because the job failed? | 21:23 |
corvus | the job succeeded, so i think we're good | 21:24 |
clarkb | fungi: you ready? | 21:25 |
fungi | okay, yep, moving forward | 21:25 |
clarkb | I guess let me know what I can do to help. I am around | 21:25 |
fungi | i have root screen sessions on both bridge and zuul02 if anyone wants to follow along | 21:26 |
fungi | starting with disabling ansible | 21:26 |
clarkb | I've attached to both of them | 21:26 |
fungi | and pulling images | 21:27 |
fungi | looks like it worked | 21:28 |
clarkb | The about an hour ago image update looks right to me | 21:29 |
fungi | though `docker image ls` on zuul02 shows the most recent image is from "About an hour ago" | 21:29 |
fungi | but yeah, i guess that's when the gate job to build the image completed | 21:29 |
clarkb | yup because the image timestamp is when it was built which happened in the gate job | 21:29 |
fungi | scheduler image id is b6c06442196d | 21:29 |
fungi | i'll dump queues and send the status notice next | 21:30 |
clarkb | ++ | 21:30 |
corvus | yeah i believe that timestamp interpretation is correct | 21:30 |
fungi | #status notice The Zuul service at zuul.opendev.org will be offline for a few minutes (starting now) in order for us to make some needed filesystem changes; if the outage lasts longer than anticipated we'll issue further notices | 21:31 |
openstackstatus | fungi: sending notice | 21:31 |
-openstackstatus- NOTICE: The Zuul service at zuul.opendev.org will be offline for a few minutes (starting now) in order for us to make some needed filesystem changes; if the outage lasts longer than anticipated we'll issue further notices | 21:31 | |
fungi | stopping services now | 21:31 |
fungi | says it completed | 21:32 |
fungi | working on the fs changes to zuul02 next | 21:32 |
clarkb | I double checked 02 and it indeed has no containers running | 21:32 |
openstackstatus | fungi: finished sending notice | 21:34 |
fungi | interestingly the debug logs never updated after i called logrotate, but the non-debug logs have | 21:34 |
fungi | anyone want to double-check me on that before i umount the original fs? | 21:35 |
clarkb | I'm not sure I understand what you mean by that | 21:35 |
clarkb | -rw-r--r-- 1 zuuld zuuld 3225855785 May 17 14:18 debug.log.1 exists | 21:36 |
clarkb | which is from when you rotated earlier today | 21:36 |
fungi | last modified timestamp on /var/log/zuul/debug.log and /var/log/zuul.tmp/debug.log are 14:23 | 21:36 |
clarkb | -rw-r--r-- 1 zuuld zuuld 1625733762 May 17 21:31 debug.log is what I see | 21:36 |
fungi | similar for web-debug.log | 21:36 |
fungi | okay that's super weird | 21:37 |
clarkb | I think you are looking at the log.1 files? | 21:37 |
clarkb | those are from earlier today when you rotated by hand | 21:37 |
fungi | -rw-r--r-- 1 zuuld zuuld 28966363 May 17 14:23 debug.log | 21:38 |
clarkb | but the current log files all seem to have current timestamps for me | 21:38 |
fungi | if i ls -l the directory that's what it shows | 21:38 |
clarkb | that isn't getting truncated? | 21:38 |
fungi | nope line after it is this | 21:38 |
fungi | -rw-r--r-- 1 zuuld zuuld 3225855785 May 17 14:18 debug.log.1 | 21:38 |
fungi | if i ls -l the file directly it shows a different timestamp | 21:38 |
clarkb | the file size isn't want I see either | 21:38 |
fungi | it's like the output is cached/stale or something | 21:39 |
clarkb | I am not able to reproduce that | 21:39 |
fungi | nevermind | 21:40 |
corvus | i see clarkb's | 21:40 |
fungi | i was scrolling back my tmux window which had an old ls -l in it :/ | 21:40 |
fungi | not scrolling back the screen buffer | 21:40 |
fungi | okay, moving ahead! | 21:40 |
clarkb | its me | 21:40 |
clarkb | I'm out of the dir now | 21:41 |
fungi | aha, thanks | 21:41 |
fungi | it complains the partition is not aligned, do we care? | 21:42 |
clarkb | I think make_swap.sh does log_2 math to aviod that (however 8192 should be log_2 aligned) | 21:42 |
clarkb | I didn't get similar when doing 4096 on zks | 21:43 |
clarkb | oh its sector alignments? | 21:43 |
fungi | looks that way | 21:43 |
fungi | the "s" suffix | 21:43 |
clarkb | you could tell it to do 2048 sectors for the first partition | 21:44 |
fungi | i don't know where/how it's inferring those sector numbers | 21:44 |
fungi | like that? | 21:45 |
clarkb | presumably? | 21:46 |
fungi | nope, still not aligned, plus new errors | 21:46 |
fungi | i don't know where it's getting the 1953 sector start | 21:47 |
clarkb | heh its still the same error for swap. Is the issue the 1 in 1 8192 ? | 21:47 |
fungi | i can try 0 | 21:48 |
clarkb | fungi: I think we want to start at sector 2048 | 21:48 |
clarkb | not at 0 | 21:48 |
fungi | ahh | 21:48 |
fungi | so shift the values by +2048 like that? | 21:49 |
fungi | or start at 2049 instead of 1? | 21:49 |
clarkb | no because that is still bytes | 21:49 |
clarkb | https://askubuntu.com/questions/201164/proper-alignment-of-partitions-on-an-advanced-format-hdd-using-parted says any multiple of 8 is probably fine so maybe we're ok with the original if we shift it by 8MB ? | 21:50 |
fungi | so like that? i'm honestly not quite sure what you're suggesting, nor why we didn't see the same on other servers | 21:50 |
clarkb | fungi: well I'm not sure what the command needs to be. But we want to express start at sector 2048 and end 8GB later. Then start the next partition from that point forward | 21:51 |
clarkb | it seems that it zero indexes so you don't need to do the +1 | 21:51 |
fungi | there we go | 21:51 |
fungi | parted /dev/xvde --script -- mklabel msdos mkpart primary linux-swap 8 8200 mkpart primary ext2 8200 -1 | 21:51 |
fungi | that did not error on me | 21:51 |
clarkb | yup that lgtm (fungi did the shift by 8 bytes thing) | 21:51 |
fungi | (8192+8=8200) | 21:51 |
clarkb | I think that is good and we can proceed | 21:52 |
fungi | lsblk says xvde1 202:65 0 7.6G 0 part | 21:53 |
fungi | close enough i guess | 21:53 |
clarkb | ya | 21:54 |
fungi | logfiles are moving back to the new partition now | 21:55 |
fungi | looks like it finished quickly | 21:56 |
clarkb | we are ready to run the start playbook now? | 21:57 |
fungi | contents of the new /var/log/zuul look correct to me | 21:57 |
clarkb | corvus: ^ fyi | 21:57 |
fungi | yeah, switching over to the bridge screen to run that if everyone's ready | 21:58 |
clarkb | I'm ready | 21:58 |
fungi | in theory the excitement won't begin until it tries to run some jobs anyway | 21:58 |
fungi | starting it now | 21:58 |
fungi | and that's completed | 21:59 |
corvus | logs lgtm | 21:59 |
corvus | scheduler starting | 22:00 |
clarkb | corvus: does the saving of keys double check that the file isn't already there or is it unconditional? | 22:00 |
clarkb | (since they should already be there?) | 22:00 |
corvus | unconditional | 22:00 |
fungi | once it's clear of the cat jobs, i'll start reenqueuing | 22:00 |
corvus | and it's not the slow part; reading them from zk is | 22:01 |
corvus | so i don't think adding a condition would speed that up | 22:01 |
corvus | (though, there might be a bit of extra computation happening to write them out) | 22:01 |
corvus | anyway, we're going to drop the filesystem stuff soon anyway, so i don't think that part is worth digging into | 22:01 |
clarkb | ok | 22:02 |
fungi | yeah, i figured that was transitional | 22:02 |
corvus | i think as soon as we write an export utility, we can drop it | 22:02 |
fungi | looks like we're through the cat jobs now? | 22:05 |
clarkb | fungi: yup but it isn't done parsing yet I don't think | 22:05 |
fungi | ahh, no not yet | 22:05 |
fungi | i still see a few cats flashing by | 22:06 |
corvus | more tenants | 22:06 |
clarkb | I think it is up now. The tenant list loads as does openstack status | 22:06 |
corvus | yep | 22:06 |
fungi | okay, starting to reenqueue | 22:07 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Better swap alignment https://review.opendev.org/c/opendev/system-config/+/791832 | 22:07 |
fungi | some builds are already running | 22:08 |
clarkb | they have console logs too | 22:08 |
fungi | yup | 22:08 |
clarkb | previously when we had trouble with the yaml it failed before it got that far | 22:08 |
fungi | so i think the revert of the revert is good now | 22:09 |
corvus | \o/ that's definitely more than last time :) | 22:09 |
fungi | that was ~half an hour downtime, not terrible | 22:09 |
*** iurygregory has quit IRC | 22:10 | |
clarkb | corvus: fungi: any objections to me deleting zuul01 and its dns records once we're happy with zuul02s restart? | 22:10 |
corvus | clarkb: no objection | 22:11 |
fungi | clarkb: no objection | 22:11 |
fungi | i checked my homedir on it earlier | 22:11 |
clarkb | cool I'll do that as soon as fungi gives the all clear on 02 | 22:11 |
fungi | not that i have a habit of keeping things of any value on random servers | 22:11 |
fungi | reenqueue finished, doing cleanup now | 22:12 |
clarkb | I'm out of both root screens now too fwiw (I think you can close those up when you are happy with them fungi) | 22:12 |
fungi | all finished | 22:13 |
fungi | #status log Updated swap and log filesystem sizes on zuul02, and restarted all Zuul services on cdc99a3 | 22:14 |
openstackstatus | fungi: finished logging | 22:14 |
fungi | some builds have already succeeded | 22:16 |
fungi | i think it's good | 22:17 |
clarkb | infra-root I will delete zuul01.openstack.org with id ef3deb18-e494-46eb-97a2-90fb8198b5d3 that look correct to you? | 22:17 |
clarkb | fungi: the failures I see appear to be actual failures which is another good sign | 22:18 |
*** iurygregory has joined #opendev | 22:19 | |
fungi | clarkb: that uuid looks like what openstack server show gives me for zuul01 | 22:19 |
clarkb | cool I'm going to issue the delete command now, thank you for double checking | 22:19 |
clarkb | the deletion is done. Doing dns cleanup then will status log it | 22:22 |
clarkb | also I didn't delete the zk01-03 dns records so will do that next | 22:23 |
clarkb | #status log Deleted zuul01.openstack.org (ef3deb18-e494-46eb-97a2-90fb8198b5d3) and its DNS records as zuul02.opendev.org has replaced it. | 22:25 |
openstackstatus | clarkb: finished logging | 22:25 |
ianw | clarkb: urgh, sorry about the missing *1024, what a mess | 22:26 |
clarkb | ianw: I reivewed the change too :) no worries | 22:27 |
clarkb | ianw: I think we are all done now as far as cleanup goes, but I want to see if I can figure out having ansible check all the hosts before I declare victory | 22:27 |
clarkb | zk01-03.openstack.org A and AAAA records are now cleaned up too | 22:27 |
ianw | fungi: yeah, on the permissions issue with haproxy, we do paper over a lot running as root | 22:29 |
ianw | i believe the way to do it most securely is with the user namespace stuff; so in the haproxy model | 22:30 |
ianw | https://review.opendev.org/c/opendev/system-config/+/791633/8/playbooks/roles/haproxy/tasks/main.yaml | 22:30 |
ianw | it creates a haproxy user as UID 99, so making /var/haproxy/* owned by 100099 on disk running with the "zuul" subuid makes things work | 22:30 |
ianw | although, the exact location the zuul subuid is created is still a bit of a mystery to me. i'm assuming we do it in ansible, somehwere? | 22:31 |
clarkb | infra-root I'm going to run bridge:~clarkb/playbooks/swap-inspector.yaml and see what that tells me to double check things | 22:35 |
clarkb | it relies on ansible fact gathering and debug module to emit info when a host has less than 128MB of swap total | 22:36 |
clarkb | fun that actually still shows zk04 as having too little swap beacuse we cache facts | 22:37 |
clarkb | I guess I run it once, then double check my list is what we already fixed, then rm the cached facts for those hosts and rerun | 22:37 |
clarkb | there are actually a few servers that have no swap | 22:40 |
clarkb | that was not caused by the issue we have had with make_swap.sh so I'll ginore those for now (but maybe we want to swapfile them) | 22:40 |
clarkb | fungi: corvus check out the periodic jobs in openstack tenant. They are all listed as error | 22:42 |
clarkb | except for one | 22:42 |
clarkb | I'm deleting cache entries from /var/cache/ansible/facts for the hosts we just fixed swap on | 22:43 |
clarkb | swap lgtm for the hosts that had the tiny swap problem based on that playbook | 22:45 |
corvus | clarkb: ack | 22:46 |
clarkb | we should have the hourly opendev deploy jobs starting in about 11 minutes and we can cross check against that I Guess | 22:49 |
clarkb | but there isn't a whole lot of pointers to why there were errors in the dashboard that I see | 22:50 |
clarkb | the hourly opendev deploy jobs seem to be running just fine (which gives more weight to corvus' explanation in #zuul) | 23:01 |
*** tosky has quit IRC | 23:43 | |
*** hamalq has quit IRC | 23:56 | |
*** hamalq has joined #opendev | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!