opendevreview | Clark Boylan proposed openstack/diskimage-builder master: Add Rockylinux 9 build configuration and update jobs for 8 and 9 https://review.opendev.org/c/openstack/diskimage-builder/+/848901 | 00:03 |
---|---|---|
clarkb | fungi: I'm going to need to pop out for evening things momentarily. Happy to help with the cinder stuff in the morning. I think we'll be fine until then | 00:04 |
fungi | yeah, i'm contemplating adding the new volume this evening or waiting for friday | 00:09 |
Clark[m] | Tomorrow works for me | 00:11 |
fungi | yeah, that's probably better anyway | 00:14 |
*** rlandy|bbl is now known as rlandy|out | 01:10 | |
*** dviroel|rover|afk is now known as dviroel|rover | 01:17 | |
*** ysandeep|out is now known as ysandeep | 01:37 | |
*** join_subline is now known as \join_subline | 01:47 | |
opendevreview | Merged opendev/system-config master: production-playbook logs : don't use ansible_date_time https://review.opendev.org/c/opendev/system-config/+/849784 | 01:54 |
opendevreview | Merged opendev/system-config master: production-playbook logs : move to post-run step https://review.opendev.org/c/opendev/system-config/+/849785 | 01:54 |
opendevreview | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/849909 | 02:25 |
ianw | i don't think run-production-playbook-post.yaml is working; but it also doesn't error | 02:33 |
opendevreview | Ian Wienand proposed opendev/system-config master: run-production-playbook-post: add bridge for playbook https://review.opendev.org/c/opendev/system-config/+/849919 | 02:48 |
*** ysandeep is now known as ysandeep|afk | 03:45 | |
opendevreview | Merged opendev/system-config master: run-production-playbook-post: add bridge for playbook https://review.opendev.org/c/opendev/system-config/+/849919 | 03:48 |
ianw | hrm, i guess the periodic jobs that are running have checked themselves out on a repo prior to ^ | 05:20 |
ianw | i'll merge the limnoria change and watch that. it's quiet now, no meetings | 05:21 |
*** ysandeep|afk is now known as ysandeep | 06:16 | |
ianw | oh interesting, system-config-run-eavesdrop timed_out in gate | 06:21 |
ianw | perhaps it was actually during log upload? | 06:24 |
ianw | it looks like it ran ok | 06:24 |
ianw | https://zuul.opendev.org/t/openstack/build/bcc13270b9f14dd28af70882949b6579/logs | 06:24 |
ianw | one data point ... this has an ara report too. the one we looked at the other day did as well | 06:27 |
ianw | that's a lot of small files ... | 06:27 |
ianw | that wouldn't really describe the infra-prod timeouts though; they don't have ara reports | 06:41 |
ianw | (only the system-config-run ones do) | 06:41 |
*** chandankumar is now known as chkumar|rover | 06:59 | |
opendevreview | Merged opendev/system-config master: Install Limnoria from upstream https://review.opendev.org/c/opendev/system-config/+/821331 | 07:41 |
opendevreview | lixuehai proposed openstack/diskimage-builder master: Fix build rockylinux baremetal image error https://review.opendev.org/c/openstack/diskimage-builder/+/849947 | 07:46 |
*** rlandy|out is now known as rlandy | 10:31 | |
opendevreview | Tobias Henkel proposed zuul/zuul-jobs master: Allow overriding the buildset registry image https://review.opendev.org/c/zuul/zuul-jobs/+/849989 | 11:04 |
opendevreview | Merged openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/849909 | 11:21 |
fungi | ianw: if they were timing out during log collection/upload, the build result would have been post_timeout | 11:25 |
*** dviroel|rover is now known as dviroel | 11:28 | |
*** ysandeep is now known as ysandeep|afk | 11:56 | |
ianw | fungi: yeah, this is true. it seems to be something else going on with the ansible on the bastion host just ... stopping | 12:04 |
opendevreview | James Page proposed openstack/project-config master: Add OpenStack K8S charms https://review.opendev.org/c/openstack/project-config/+/849996 | 12:18 |
opendevreview | James Page proposed openstack/project-config master: Add OpenStack K8S charms https://review.opendev.org/c/openstack/project-config/+/849996 | 12:19 |
opendevreview | James Page proposed openstack/project-config master: Add OpenStack K8S charms https://review.opendev.org/c/openstack/project-config/+/849996 | 12:21 |
fungi | trying to dig the reason for the post_failure on https://zuul.opendev.org/t/openstack/build/69dc9e01301848a9a8a3f45fd952a24b out of the debug log on ze10 and not having much luck | 13:27 |
fungi | no ansible tasks are returning failed in the post phase | 13:28 |
Clark[m] | fungi: very likely related to splitting the log management into post-run for those jobs. But I'm not sure why it would report failure if no tasks fail | 13:31 |
Clark[m] | Perhaps related to the inventory fix here: https://review.opendev.org/c/opendev/system-config/+/849919 though as noted there if no nodes match inventory you just skip successfully | 13:31 |
*** dasm|off is now known as dasm|ruck | 13:33 | |
fungi | oh. perhaps. i'll look for that signature | 14:02 |
fungi | 2022-07-15 12:22:17,931 DEBUG zuul.AnsibleJob.output: [e: 3ae9a107a3364e86bbfc5d0c7e59c499] [build: 69dc9e01301848a9a8a3f45fd952a24b] Ansible output: b'changed: [localhost] => {"add_host": {"groups": [], "host_name": "bridge.openstack.org", "host_vars": {"ansible_host": "bridge.openstack.org", "ansible_port": 22, "ansible_python_interpreter": "python3", "ansible_user": "zuul"}}, "changed": | 14:04 |
fungi | true}' | 14:04 |
fungi | seems like it ran (that was the output for "TASK [Add bridge.o.o to inventory for playbook name=bridge.openstack.org, ansible_python_interpreter=python3, ansible_user=zuul, ansible_host=bridge.openstack.org, ansible_port=22]") | 14:04 |
Clark[m] | And did the subsequent tasks in that playbook manage to run with the updated inventory? | 14:08 |
fungi | the play recap only mentions localhost | 14:15 |
fungi | maybe you can't update a playbook's inventory from within the playbook itself? | 14:16 |
*** ysandeep|afk is now known as ysandeep | 14:16 | |
fungi | though that would be silly since it would mean there's no point in ansible's add_host task | 14:17 |
fungi | there are some "no hosts matched" warnings toward the end of the log | 14:20 |
Clark[m] | That should be how the run playbook functions | 14:21 |
Clark[m] | Maybe something about the new add host is different and missing bits? | 14:21 |
fungi | after "TASK [Register quick-download link]" is recapped showing it applied to localhost, it starts to run the cleanup playbook and says "[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'" | 14:23 |
fungi | i think the failure must be prior to that though | 14:25 |
fungi | i just can't seem to find it | 14:26 |
fungi | as soon as the openstack release meeting wraps up in a few minutes, i need to run a couple of quick errands (probably 30 minutes) and then i'll get started adding the new cinder volume to review.o.o and migrating onto it | 14:38 |
Clark[m] | Sounds good. I can look at that post failure more closely after I've fully booted my morning | 14:38 |
fungi | it's probably not urgent, the job seems to have done the deployment things it was supposed to and collected/published logs, just ended with an as-of-yet inexplicable post_failure | 14:41 |
fungi | popping out for quick errands now, bbiab | 14:45 |
clarkb | I wonder if these timeouts are related to the problem albin is debugging with ansible and glibc | 14:49 |
clarkb | https://github.com/ansible/ansible/issues/78270 is the upstream bug report albin filed | 14:49 |
clarkb | we wouldn't have seen them before with ansible 2.9 I think but now we default to ansible 5 | 14:49 |
corvus | clarkb: i think Albin Vass said it was happening in older versions too? (iirc, an earlier disproved hypothesis was that ansible 5 might fix it) | 15:09 |
corvus | clarkb: but the recent container bump to 3.10 almost certainly changed our glibc, right? | 15:10 |
clarkb | corvus: no the 3.10 image and the 3.9 image are both based on debian bullseye and should have the same glibc | 15:11 |
corvus | ah | 15:12 |
*** dviroel is now known as dviroel|lunch | 15:16 | |
clarkb | fungi: [build: 69dc9e01301848a9a8a3f45fd952a24b] Ansible timeout exceeded: 1800 | 15:19 |
clarkb | its the timeout issue that ianw was looking at. In post run we run every playbook I think so the failure happened before you were looking. It happened in the encrypt steps. Maybe a lack of entropy? | 15:20 |
clarkb | I think one possibility is the glibc problem albin discovered (maybe ansible 5 trips it far more often?) or something related to the leaking zuul console log files in /tmp | 15:24 |
clarkb | corvus: ^ re the leaking console log files is that a side effect of not running the console log streamer daemon? | 15:24 |
clarkb | corvus: we should be able to safely delete those files I think, but also i wonder if we can have a zuul option to not write them in the first place? | 15:24 |
clarkb | just doing a random sampling of those filse they all appear to be ansible task text that would end up in the console log. I'm pretty confident we can just delete the lot of them | 15:26 |
*** ysandeep is now known as ysandeep|out | 15:28 | |
AlbinVass[m] | clarkb: we had two different issues, one in Ansible 2.9 getting stuck in the pwd library (i believe), and Ansible 5 getting stuck in grantpty with glibc 2.31 which is installed in the official zuul images (v6.1) | 15:35 |
clarkb | AlbinVass[m]: ok, that is good to know. I'm not sure that this problem is related to what you've seen but the symptoms seem similar. The job just seems to stop and then eventually timeout | 15:36 |
clarkb | AlbinVass[m]: I'd be willing to land your updated glibc change and see if it helps :) | 15:37 |
clarkb | let me go and review it now while it is fresh | 15:38 |
AlbinVass[m] | The only reason we detected it was because we had jobs hanging forever in cleanup :) | 15:38 |
clarkb | ya that is similar to what we are seeing now | 15:39 |
fungi | clarkb: thanks! i always forget to search for task timeouts, which should be the first thing i think of when i can't find any failed tasks (because the timed out task never gets a recap) | 15:39 |
*** rlandy is now known as rlandy|brb | 15:42 | |
clarkb | AlbinVass[m]: ok left some comments on that change. I think it is basically there though and worth a try | 15:46 |
clarkb | Otherwise we may want to undo the ansible 5 default update? or override it for system-config jobs at least | 15:46 |
clarkb | it looks like just about every infra-prod job is hitting those post failures now | 15:48 |
clarkb | I don't think that is the end of the world while we work thoruhg the gerrit thing. But be aware of that if landing changes to prod as they may not deploy as expected | 15:48 |
clarkb | I think we should finish up the gerrit thread then look at cleaning up /tmp on bridge to see if that helps then possibly do AlbinVass[m]'s zuul image update or force ansible 2.9 on the infra-prod jobs. I'll push up a change for infra-prod jobs now so that it is ready if we wish to do that | 15:49 |
clarkb | actually nevermind lets try the /tmp cleanup first | 15:49 |
fungi | clarkb: you noted the timeout was after the file encryption tasks, maybe we're consistently starving that machine for entropy when repeatedly running those? | 15:51 |
clarkb | fungi: yes that was another thought | 15:51 |
fungi | could be entirely unrelated to our ansible version | 15:51 |
clarkb | is there a way to check that? | 15:51 |
fungi | we could report the size of the entropy pool before tasks we think are likely to try to use it | 15:52 |
fungi | or we could check it with cacti or something | 15:52 |
clarkb | `find /tmp -name 'console-*-bridgeopenstackorg.log' -mtime +30` for a listing of what we might want to delete | 15:53 |
clarkb | /proc/sys/kernel/random/entropy_avail appears to report good numbers currently | 15:54 |
fungi | fungi@bridge:~$ cat /proc/sys/kernel/random/entropy_avail | 15:54 |
fungi | 3413 | 15:54 |
fungi | yeah, not bad for now | 15:54 |
fungi | bigger question is whether we're doing more key generation and similar tasks on there during jobs recently | 15:55 |
fungi | the server seems to have a reasonable amount of available entropy, but the jobs could have started demanding an unreasonable amount | 15:56 |
clarkb | I don't think the total has changed. We only shifted where it ran (that was in response to this problem too so the issue was preexisting) | 15:57 |
*** marios is now known as marios|out | 15:57 | |
clarkb | that is the main reason why I susecpted ansible 5 + glibc since it was a recent change | 15:57 |
clarkb | basically if we look at what changed ansible 5 is it and AlbinVass[m] has evidence of a very similar problem happening for them | 15:57 |
fungi | yeah, i agree that's a stronger theory | 15:58 |
clarkb | I think we can try something like `find /tmp -name 'console-*-bridgeopenstackorg.log' -mtime +30 -delete` if others agree those files look safe to prune. But don't expect that to help | 15:59 |
clarkb | (good cleanup either way) | 16:00 |
clarkb | But also removing the old index back up on review and expanding available disk there might be the higher priority? There are too many things and I don't want to try and do too many things at once :) | 16:00 |
clarkb | actually you know what why don't we just flip infra-prod to ansible 2.9 again. its a dirt cheap thing to change and we can toggle it back too | 16:03 |
clarkb | but that would give us a really good idea if it is related to ansible 5. change incoming | 16:03 |
fungi | sure | 16:03 |
fungi | sounds good to me | 16:03 |
*** rlandy|brb is now known as rlandy | 16:05 | |
opendevreview | Clark Boylan proposed opendev/system-config master: Force ansible 2.9 on infra-prod jobs https://review.opendev.org/c/opendev/system-config/+/850027 | 16:07 |
fungi | i count 172663 console logs matching the glob pattern suggested above | 16:08 |
clarkb | ya there are just over 200k if you drop the mtime filter | 16:08 |
fungi | all the glob matches also match the regex ^/tmp/console-[^/]*-bridgeopenstackorg.log$ | 16:09 |
fungi | so i agree that looks safe | 16:09 |
clarkb | fungi: do you agree the content of those files isn't work keeping after 30 days? | 16:10 |
clarkb | I'm not even sure it is worth keeping 30 days of them but figure we can start there | 16:10 |
clarkb | I guess the other thing we can do is limit the depth on the search so that subdirs that happen to have the same filename pattern don't get files deleted but that also seems unlikely | 16:12 |
fungi | yeah, spot-checking some at random, the contents don't really seem that useful to begin with | 16:12 |
clarkb | `find /tmp -maxdepth 1 -name 'console-*-bridgeopenstackorg.log' -mtime +30 -delete` ? | 16:13 |
fungi | yes, that still works | 16:16 |
fungi | and is slightly safer, i suppose | 16:16 |
clarkb | ya as there are some subdirs. I guess I'll spin up a screen and get that ready to go? | 16:16 |
clarkb | fungi: screen started if you want to join and help double check | 16:17 |
clarkb | shoudl I add a -print? | 16:17 |
clarkb | I think I've decided that is going to be too much noise with -print | 16:18 |
AlbinVass[m] | mind that we had other issues with ansible 2.9, I'm not sure why I haven't seen those before because the root cause for the deadlocks are the same (multithreaded process using fork) | 16:26 |
fungi | clarkb: sorry, stepped away for a moment, but i'm in the screen session now and that looks great | 16:28 |
AlbinVass[m] | * seen those when running zuul with ansible 2.9 before because | 16:28 |
fungi | clarkb: and i agree, -print is worthless for this unless you're just trying to see if it's hung | 16:28 |
clarkb | fungi: ok I'll run that now | 16:30 |
fungi | thanks | 16:31 |
clarkb | its done. Still 38k entries for the last 30 days but much cleaner | 16:31 |
clarkb | and ya I don't expect that to help. But maybe it will | 16:31 |
fungi | yeah, that looks reasonable | 16:31 |
*** dviroel|lunch is now known as dviroel | 16:32 | |
clarkb | AlbinVass[m]: ya at this point I think its fine for us to do some process of elimination. We ran under 2.9 for a while and it was fine there for us iirc | 16:33 |
clarkb | AlbinVass[m]: could be a timing issue or one environment being more likely to trip specific issues than others | 16:33 |
corvus | clarkb: nothing deletes the zuul console log files; so if we're talking bridge, we should have some kind of tmp cleaner | 16:59 |
fungi | corvus: bridge, correct | 17:00 |
fungi | these are in /tmp on bridge.o.o | 17:00 |
fungi | also /tmp was nowhere near filling up from a block or inode standpoint | 17:01 |
opendevreview | Jeremy Stanley proposed opendev/git-review master: Clarify that test rebases are not kept https://review.opendev.org/c/opendev/git-review/+/850054 | 17:28 |
clarkb | corvus: thank you for confirming I'll work on putting something together for that I guess | 17:35 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add cronjobs to cleanup Zuul console log files https://review.opendev.org/c/opendev/system-config/+/850059 | 17:48 |
clarkb | fungi: I'm going to pop out soon for lunch, but let me know if I can be helpful re gerrit and I'll adjust timing as necessary | 18:20 |
opendevreview | Jeremy Stanley proposed opendev/git-review master: Clarify that test rebases are not kept https://review.opendev.org/c/opendev/git-review/+/850054 | 18:26 |
opendevreview | Jeremy Stanley proposed opendev/git-review master: Don't keep incomplete rebase state by default https://review.opendev.org/c/opendev/git-review/+/850061 | 18:26 |
fungi | clarkb: thanks, i'm switching to that now, since i'm done with the git-review distraction for the moment | 18:27 |
fungi | [Fri Jul 15 18:42:54 2022] virtio_blk virtio5: [vdc] 2147483648 512-byte logical blocks (1.10 TB/1.00 TiB) | 18:46 |
fungi | /dev/vdc1 lvm2 --- <1024.00g <1024.00g | 18:47 |
fungi | i have `pvmove /dev/vdb1 /dev/vdc1` running in a root screen session on review.o.o now | 18:48 |
fungi | that's after creating and attaching the cinder volume, creating a partition table on it with one large partition, writing a pv signature to the partition, and adding that pv to the existing "main" vg on the server | 18:50 |
fungi | i did make sure it was set as --type=nvme too | 18:50 |
fungi | (openstack volume show confirms it's the case too) | 18:51 |
fungi | block migration is already 13% done, so it's going quickly | 18:51 |
fungi | hopefully the additional i/o isn't impacting gerrit performance too badly | 18:51 |
clarkb | cool. I think where I always get confused is the relationship between physical volumes, volume groups, and logical volumes | 18:51 |
clarkb | physical volumes are aggregates into volume groups to provide logical volumes that may be larger than any one physical volume. Is that basically it? | 18:52 |
fungi | [lvx][--lvy--][-lvz] | 18:52 |
fungi | [-------vg0--------] | 18:53 |
fungi | [---pva---][--pvb--] | 18:53 |
clarkb | ok ya that helps | 18:53 |
fungi | er, except for my indentation on that last line | 18:53 |
fungi | but yeah, a vg is just an aggregation of whatever block devices the kernel knows about, then you create logical volumes within a vg | 18:54 |
clarkb | and in this case we do the pvmove so that we can remove the other physical volume later. But then we hvae to expand the lv to stretch it out to match the larger size. Then after that the fs itself | 18:54 |
fungi | optional but yes | 18:55 |
fungi | the process is basically 1. add a new pv to the vg, 2. move the extents for the lv from the old pv to the new pv, 3. remove the old pv from the vg | 18:55 |
fungi | then later you can also 4. increase the size of the lv, 5. resize the fs on that lv, 6. profit | 18:56 |
clarkb | fungi: are vdb and vdc pvs or lvs? | 18:56 |
clarkb | (they are pvs Ithink?) | 18:57 |
fungi | we could also instead have done 1. add a new pv to the vg, 2. increase the size of the lv, 3. resize the fs | 18:57 |
fungi | pv (so-called "physical volume") is the kernel block devices for the disks, so vdb and vdc in this case | 18:57 |
clarkb | but then we would have to keep both cinder volumes attached (which I think is less ideal long term) | 18:57 |
fungi | lv is the "logical volume" which spans one or more physical volumes within the volume group | 18:58 |
clarkb | and they are localted under the /dev/mapper paths | 18:59 |
fungi | and yes, the second option i described would have been more if we just wanted to incrementally add to the existing physical volume set rather than migrating data from one to another. in the real world where you may be plugging another actual hunk of spinning rust into a server that's more typical, but with the virtual block devices in a cloud it usually makes more sense to migrate data to a | 19:00 |
fungi | larger pv where possible | 19:00 |
fungi | also some people simply don't know about pvmove and so don't realize it's possible to do this | 19:00 |
clarkb | fungi: last question, why do you need to add it to the volume group if we're essentially lconing an existing member of the vg onto the new pv? I guess just a bookkeeping activity? | 19:02 |
fungi | it's just how lvm is designed. the extents of an lv can be mapped to any pv in their vg. when you do a pvmove it's taking advantage of the copy-on-write/snapshot capabilities in lvm to update the address for an extent to a new location and then cleaning up the old one, so during any arbitrary point in the pvmove that lv is spread across the old and new pvs | 19:04 |
clarkb | got it | 19:05 |
fungi | since an lv can already span multiple pvs, it's almost a side effect of the eother features | 19:05 |
clarkb | right | 19:05 |
fungi | but to have an lv spread across multiple pvs, they have to be in its vg, hence the initial step of adding the new pv to the vg | 19:06 |
fungi | technically we're "extending" the vg to include the new pv, and then later "reducing" the vg off of the old pv once there are no longer any lv extents in use on it | 19:07 |
clarkb | makes sense | 19:07 |
fungi | but since it's all happening underneath the lv abstraction, it's essentially invisible to the filesystem activity | 19:08 |
fungi | just (possibly lots of) added i/o activity for those devices while in progress | 19:09 |
clarkb | once the pvmove is done we could theoretically expand to ~1.25TB on the lv since the other pv will still be in the vg right? So you need to be careful with maths or remove the other pv first? | 19:10 |
fungi | yeah, that's why i suggested removing the old pv first before resizing the lv | 19:10 |
fungi | at the moment we have .25tb in use and 1.25tb free in the vg, so as long as i don't tell the lv to use more space before i remove the old pv, we'll be down to .25tb used and .75tb free out of 1tb in the vg, then i'll lvextend the lv to use the additional space and resize the fs after that | 19:12 |
fungi | pvmove is done now, and if you look at the pvs output /dev/vdb1 has 250.00g of 250.00g free, while /dev/vdc1 has 774.00g of 1024.00g free | 19:13 |
clarkb | I'll trust yo uon that (as I'm juggling lunch right now) | 19:13 |
fungi | so this is the point where i vgreduce off of /dev/vdb1 | 19:13 |
clarkb | but ya that lal makes sense. Th ehard part is retaining this info for the future | 19:13 |
fungi | we have most of the commands documented at https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#cinder-volume-management | 19:14 |
fungi | but also the lvm manpage is helpful | 19:14 |
clarkb | When we do the fs resize does that add more inodes (in theory we need them?) | 19:15 |
clarkb | iirc it will add them at the same ratio of blocks to inodes that the existing fs has and we do get new ones but we can't change the ratio? | 19:15 |
fungi | i think that depends on the fs, but ext4 should increase the inode count | 19:15 |
fungi | if memory serves, ext3 did not increase inode count relative to block count | 19:15 |
fungi | now i've done `vgreduce main /dev/vdb1` to take the old pv out of our "main" vg | 19:16 |
fungi | and `pvremove /dev/vdb1` to wipe the pv signature from the partition | 19:16 |
fungi | so `pvs` currently reports only one pv on the server, which is a member of the main vg, with 774.00g free | 19:17 |
fungi | and `vgs` similarly reports the vg has 774.00g available | 19:18 |
fungi | now to detach the old volume, which is always the iffy part of this process | 19:18 |
*** tobias-urdin is now known as tobias-urdin_pto | 19:19 | |
fungi | i used `openstack server remove volume "review02.opendev.org" "review02.opendev.org/gerrit01"` and after a minute it returned. now `openstack volume list` reports review02.opendev.org/gerrit01 is available rather than in-use | 19:21 |
clarkb | and /dev on review02 probably doesn't show that device anymore? | 19:21 |
fungi | correct, we have vda and vdc but nothing between | 19:22 |
fungi | interestingly, nothing gets logged in dmesg during that hot unplug action | 19:22 |
fungi | and now i've done `openstack volume delete review02.opendev.org/gerrit01` so we're cleaned up on the cloud side of things | 19:23 |
fungi | next is the lv and fs resizing | 19:23 |
fungi | i've run `lvextend --extents=100%FREE main/gerrit` which tells it to grow the gerrit lv into the remaining available extents of the main vg | 19:25 |
fungi | "Size of logical volume main/gerrit changed from <250.00 GiB (63999 extents) to 774.00 GiB (198144 extents)." | 19:26 |
clarkb | shouldn't it be 1TB intsead of 774? | 19:26 |
fungi | d'oh, yep! | 19:27 |
fungi | i missed a + | 19:27 |
fungi | thanks for spotting that | 19:27 |
fungi | lvextend --extents=+100%FREE main/gerrit | 19:27 |
fungi | Size of logical volume main/gerrit changed from 774.00 GiB (198144 extents) to <1024.00 GiB (262143 extents). | 19:27 |
clarkb | out of curiousity would it have errored if the 100%FREE was less than the current consumed size? | 19:28 |
clarkb | (or did we just get really lucky?) | 19:28 |
fungi | it would have errored, yes. lvextend can only increase. you have to use lvreduce to decrease | 19:28 |
clarkb | the command is extend so ya | 19:28 |
clarkb | makes sense | 19:28 |
fungi | now pvs and vgs show 0 free blocks, and lvs shows the gerrit lv is 1tb | 19:28 |
corvus | glad they retired lvbork | 19:29 |
fungi | the swedish chef was bummed though | 19:29 |
fungi | resize2fs /dev/main/gerrit | 19:30 |
fungi | The filesystem on /dev/main/gerrit is now 268434432 (4k) blocks long. | 19:30 |
fungi | Filesystem Size Used Avail Use% Mounted on | 19:30 |
fungi | /dev/mapper/main-gerrit 1007G 220G 788G 22% /home/gerrit2 | 19:30 |
clarkb | any idea if the inode count changed with that expansion? | 19:31 |
fungi | #status log Replaced the 250GB volume for Gerrit data on review.opendev.org with a 1TB volume, to handle future cache growth | 19:31 |
opendevstatus | fungi: finished logging | 19:32 |
clarkb | (I don't really expect inodes to be a problem considering why we expanded. Mostly just curious) | 19:32 |
fungi | df -i says we have 1% of inodes in use on that fs | 19:32 |
clarkb | cool. I feel like I have learned things | 19:32 |
fungi | so while i think it did increase available inodes x4, it's unlikely to matter | 19:33 |
fungi | now i just need to do two more of these on the ord backup server and one on the ord afs server before the end of the month | 19:33 |
fungi | (for upcoming announced storage maintenance in that region) | 19:34 |
clarkb | cacti reflects the changes too fwiw | 19:34 |
fungi | awesome | 19:34 |
fungi | i've closed out my root screen session on review.o.o now | 19:38 |
opendevreview | Jeremy Stanley proposed opendev/git-review master: Don't keep incomplete rebase state by default https://review.opendev.org/c/opendev/git-review/+/850061 | 19:39 |
clarkb | and now we monitor the cache disk usage | 19:39 |
clarkb | I marked the change to disable the caches as WIP so we know that is a fallback we aren't ready for yet | 19:39 |
fungi | thanks | 19:43 |
*** dviroel is now known as dviroel|biab | 20:09 | |
opendevreview | Jeremy Stanley proposed opendev/git-review master: Don't keep incomplete rebase state by default https://review.opendev.org/c/opendev/git-review/+/850061 | 20:23 |
*** dviroel|biab is now known as dviroel | 21:08 | |
*** dasm|ruck is now known as dasm|off | 21:37 | |
*** dviroel is now known as dviroel|out | 21:37 | |
clarkb | fungi: dpawlik1 can we abandon https://review.opendev.org/c/opendev/system-config/+/833264 ? | 21:40 |
fungi | yeah, there's info at https://governance.openstack.org/sigs/tact-sig.html#opensearch and https://docs.opendev.org/opendev/infra-manual/latest/developers.html#automated-testing has been cleaned up. https://docs.openstack.org/project-team-guide/testing.html#automatic-test-failure-identification probably does need updating though | 21:45 |
fungi | i've abandoned it, thanks for the reminder | 21:47 |
clarkb | thanks | 21:47 |
fungi | sure thing | 21:50 |
ianw | clarkb: thanks for looking into that. the 2.9 check would probably be good | 22:24 |
clarkb | ianw: if 2.9 is better then I strongly suspect AlbinVass[m] did all the work :) | 22:25 |
ianw | i've approved it and will check in on it after the weekend | 22:25 |
ianw | deadlock sounds like it. i poked at all the processes i could see, and nothing seemed to be doing or waiting for anything obvious | 22:26 |
clarkb | One thing I wonder about is whether or not we should revert the changes to the run playbook. But I think this is still strictly an improvement to seprate the logging from the actions | 22:26 |
ianw | i guess that was a response to the timeouts i had already randomly seen -- so it was happening before that | 22:27 |
clarkb | yes I think it was happening before, just more reliably now | 22:27 |
ianw | i'm not sure the encryption jobs really require entropy, as they're using pre-generated keys. but it's a thread to pull | 22:28 |
ianw | i'd also believe something about gpg agents, sockets, <insert hand wavy actions here>... | 22:29 |
clarkb | ianw: my suspicion is tha tthe new order of operations just happens to trip the deadlock more reliably | 22:29 |
clarkb | it was happening before less reliably | 22:30 |
clarkb | but now we've managed to align timing/whatever to make it happen 100% of the time | 22:30 |
opendevreview | Merged opendev/system-config master: Force ansible 2.9 on infra-prod jobs https://review.opendev.org/c/opendev/system-config/+/850027 | 22:30 |
ianw | oh annoying, the POST failure is different | 22:32 |
corvus | clarkb: to be clear -- 850027 is temporary for data collection? | 22:32 |
clarkb | corvus: yes | 22:32 |
ianw | https://zuul.opendev.org/t/openstack/builds?project=opendev%2Fsystem-config&skip=0 | 22:32 |
clarkb | corvus: if that makes it better than I think we should strongly consider AlbinVass[m]'s docker image update | 22:33 |
corvus | when will we have data? | 22:33 |
clarkb | corvus: jobs will be triggered in 27 minutes in the deploy pipeline | 22:33 |
corvus | and one buildset is sufficient? | 22:34 |
clarkb | corvus: yes Ithink so. we had a 100% failure on the last hour's buildset | 22:34 |
clarkb | er sorry its the opendev-prod-hourly pipeline not depoy | 22:34 |
clarkb | but ya runs hourly. Last hour was 100% failure. In ~26 minutes the first job that indicates whether or not this is happier should run | 22:35 |
opendevreview | Ian Wienand proposed opendev/system-config master: run-production-playbook-post: use timestamp from hostvars https://review.opendev.org/c/opendev/system-config/+/850077 | 22:36 |
clarkb | ianw: oh does that need to land before we get consistent results? | 22:36 |
clarkb | and/or should we just revert back to the way it was for simplicity then work from there? | 22:37 |
ianw | i dunno, reverting back means we get no logs, which is also unhelpful | 22:37 |
clarkb | ianw: well if 2.9 fixes it we would get logs | 22:37 |
clarkb | ianw: I'm not sure your fix will fix it since this is a separate playbook run | 22:38 |
clarkb | is _log_timestamp cached? | 22:38 |
ianw | now i've sent that i'm wondering if the way we dynamically add bridge might make that not work | 22:38 |
clarkb | ya I'm not sure that will work | 22:38 |
ianw | it would work across playbooks, but i'm not sure with the dynamic adding of the host | 22:39 |
ianw | hrm | 22:39 |
clarkb | I think you can just get a new timestamp value in the post-run playbook | 22:41 |
clarkb | since the two uses in the run playbook and the post-run playbook are detached from each other (though them lining up is a bit nice it isn't necessary) | 22:42 |
ianw | yeah i can do that for now, and try passing the variable around later when it's all working | 22:42 |
clarkb | I think if the jobs fail because that var is undefined that is a good indication that ansible 2.9 fixes the othe rproblem at least | 22:42 |
opendevreview | Ian Wienand proposed opendev/system-config master: run-production-playbook-post: use timestamp from hostvars https://review.opendev.org/c/opendev/system-config/+/850077 | 22:45 |
ianw | right, because they all got far enough to try renaming the file, rather than hanging | 22:45 |
ianw | well actually, all those ran under 2.12? | 22:46 |
clarkb | the example I looked at earlier today didn't get that far. It failed due to a timeout right around encrypting things and it ran under 2.12 | 22:47 |
clarkb | I guess I hvaen't looked at the others we may have 100% failure due to either one of the things | 22:47 |
clarkb | corvus: ^ related we seem to report POST_FAILURE when we we timeout in post-run | 22:47 |
clarkb | ianw: ah yup seems at least some of the failures are due to the missing var. corvus I was mistaken I assumed 100% of the failures were due to the timeouts but that isn't the case so we won't know as quickly as I hoped | 22:48 |
clarkb | (sorry I was juggling like 4 different issues this morning...) | 22:49 |
ianw | right, i think we a) fix the missing var and b) i'll check in on it on monday with fresh eyes and see if we're still seeing timeouts | 22:49 |
ianw | https://zuul.openstack.org/build/304cacd67c2e459cbb69e0af5b24963f was the job i was looking at last night that looked stuck (at the time, and eventually timed out) | 22:49 |
ianw | for reference, https://paste.opendev.org/show/bvelHlomXJwkNpCmKO6r/ is the console log which i had open on it | 22:51 |
ianw | i agree the last thing before the timeout was | 22:51 |
ianw | 2022-07-15 09:36:45.327346 | TASK [encrypt-file : Check for existing key] | 22:51 |
ianw | this means it actually ran all the playbook and has done everything up to setting up logs for storage | 22:52 |
ianw | that is not a very exciting task -> https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/encrypt-file/tasks/import-key.yaml#L3 ; just calls gpg --list-keys in a command: | block | 22:53 |
opendevreview | Ian Wienand proposed opendev/system-config master: run-production-playbook-post: generate timestamp https://review.opendev.org/c/opendev/system-config/+/850077 | 22:54 |
ianw | clarkb: ^ fixed that change typo to avoid confusion | 22:54 |
clarkb | +2 thanks | 22:55 |
clarkb | and ya once that one lands we should get much clearer info | 22:55 |
ianw | i've run gpg --list-keys in a loop on bridge as "zuul" and haven't seen anything hang, so doesn't seem as obvious as gpg being silly | 22:56 |
ianw | and last night, when i was poking, i did not see a gpg process attached to any of the ansible; all that was running was an "ansiballz" python glob thing afaics | 22:57 |
corvus | clarkb: i'm not sure reporting post_failure when a post playbook times out is wrong? | 23:11 |
corvus | the idea of post_failure is that it's saying "the main thing worked, but the post-run stuff did not" | 23:12 |
corvus | ie, "your change is not broken, but your job cleanup is" | 23:13 |
clarkb | corvus: would POST_TIMEOUT make sense though? | 23:13 |
clarkb | similar to how we have TIMEOUT and POST_FAILURE | 23:13 |
corvus | i guess if that's useful? | 23:13 |
clarkb | I think for me its mostly to reduce confusion over an actual failure or a timeout in post since we have that in run | 23:13 |
corvus | hrm | 23:13 |
clarkb | in this case it would have distinguished between the two different failures we are currently seeing | 23:14 |
corvus | i sort of disagree -- a post-run script should not timeout | 23:14 |
corvus | yeah, but i mean that's just a happenstance of these 2 failures | 23:14 |
corvus | a test run timeout tells you maybe your unit tests run too long | 23:14 |
clarkb | ya I'll have to think about it a bit more. Current thought is that it is mostly surprising considering the differentiation elsewhere | 23:15 |
clarkb | I think it may have tripped fungi up too when trying to determine why the job we debugged earlier today had failed | 23:16 |
corvus | it does have some signal to it -- it tells you something like "your job broke in post run because your system is completely hosed" vs "your job broke in post run because it didn't produce a file that was expected" | 23:16 |
corvus | but i personally am not convinced it's enough signal to make it a result so that you can scan the build reports for it | 23:17 |
fungi | yeah, we've had a few different chronic situations recently which caused some task to timeout in post-run. i should just be more diligent about mining executor logs for "timeout" in addition to failed=[^0] | 23:34 |
fungi | i think to check for task failures, but then forget that if there aren't any it could be because a task timed out and so we didn't get output from it | 23:35 |
fungi | or even a play recap | 23:35 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!