Tuesday, 2022-09-13

fricklerianw: well cirros uses the ubuntu kernel, so that would likely be a match. I just don't understand why the same thing doesn't seem to be happening with the cloud image that neutron uses in their tests. but I can easily set up a test with the added cmdline parameter for cirros04:41
ianwfrickler: hrm, does that use nested virt?  probably doesn't happen on binary translation i guess05:17
fricklerianw: yes, nested virt. https://opendev.org/openstack/neutron-tempest-plugin/src/branch/master/zuul.d/base-nested-switch.yaml05:27
fricklerif I switch back to qemu, the issue indeed doesn't happen05:28
fricklerbut booting a full ubuntu image under qemu takes ages, it would be fine for the cirros based tests though05:29
fricklerhttps://review.opendev.org/c/openstack/neutron-tempest-plugin/+/854910 is the root of my test stack. note that even jobs that aren't affected are failing (due to cirros not supporting network setup via configdrive plus some multicast issue)05:31
ianwfrickler: if you have a sec to just double-check https://review.opendev.org/c/opendev/system-config/+/857241 that is for the translate db backup failure you noted, thanks06:09
fricklerinfra-root: seems there is a tmux session on mirror-update since May 22 locking debian?07:09
fricklerkolla is seeing failures due to what looks like an outdated debian mirror, not 100% sure if related, but looks strange to me07:10
zigoHi. Is it expected that https://docs.openstack.org/releasenotes/ceilometer/unreleased.html is empty?09:13
zigo(rc1 just got released...)09:13
fricklerlooking closer, the tmux session doesn't cause issues other than confusing me, the flock has been exited10:14
fricklerhowever, debian seem to have had a large update on friday that may have timeouted the reprepro run, there is a stale lockfile from that date10:15
fricklerI'm tempted to just remove it and let things rerun, but maybe taking the lock manually might be needed, waiting for others to jump in10:15
zigofrickler: Maybe that's related to the point release of this week-end, where both buster and bullseye got a new point release?10:24
fricklerthat would likely explain the rush of updates, yes10:25
zigoFYI, this is the last Buster point release, as it is now handled by the LTS team rather than all DDs.10:27
zigoThis time, OpenStack (rocky) will be part of the supported list of software.10:27
zigoSo we get a 5 years support for OpenStack in Debian from now on, which I'm very happy about.10:28
fungizigo: about the release notes, i think the way reno handles that is it sorts them under the release candidate once it exists, so they moved to https://docs.openstack.org/releasenotes/ceilometer/zed.html when rc1 was tagged (because that's the branch point for stable/zed and all master branch development now is targeting 2013.1/antelope)11:26
fricklerwhen I checked earlier, the Zed renos weren't there yet, either. IIRC it needs a patch in ceilometer to be merged after branching. not sure whether in master or zed or both11:37
zigoI guessed it was updated right after I wrote here ! :)11:54
fungiyeah, there's likely a period between branching and merging the initial autogenerated change to the stable branch where those release notes are in limbo11:58
opendevreviewDr. Jens Harbott proposed openstack/project-config master: Fix update_target definition for service-types-authority  https://review.opendev.org/c/openstack/project-config/+/85743513:38
fricklergtema: dtantsur: ^^ seems that's an easy fix13:39
fricklerianw: seems that was two typos in one patch. the other was found in 2020, this one made it longer ;) https://opendev.org/openstack/project-config/commit/e1d227bf5b70f531ca608202d47d5797536c67ad13:43
fricklerfungi: clarkb: do you have time for a quick look at 857435?13:51
fungifrickler: done14:21
opendevreviewMerged openstack/project-config master: Fix update_target definition for service-types-authority  https://review.opendev.org/c/openstack/project-config/+/85743514:28
fricklerdtantsur: would you consider merging the venus patch to verify ^^? or should I rather rerun the failed post job?14:33
dtantsurfrickler: so, merge https://review.opendev.org/c/openstack/service-types-authority/+/857080 ?15:27
fricklerdtantsur: if you will, yes. I created the followup15:28
clarkbfrickler: did the mirror-update situation for debian get resolved?15:52
clarkbfrickler: I'm looking at the server and there are no flock processes currently. The files exist on disk but that doesn't mean they are locked iirc. You need an active process to hold open the lock I think15:55
fricklerclarkb: no, there still is this file: -rw------- 1 10004 root 0 Sep 10 14:10 /afs/.openstack.org/mirror/debian/db/lockfile16:02
fricklerand /var/log/reprepro/debian.log says it is not doing anything because of that16:03
frickleron mirror-update16:04
clarkboh a reprepro lock file not our crontab flock locks.16:13
clarkbThe man page indicates that this can happen if reprepro is interrupted inappropriately (and indicates that sort of thing can also lead to db corruption :/)16:13
clarkbafs: Warning: We are having trouble keeping the AFS stat cache trimmed down under the configured limit (current -stat setting: 15000, current vcache usage: 91072). afs: If AFS access seems slow, consider raising the -stat setting for afsd.16:14
clarkbnoticed that when checking dmesg for OOMs16:14
clarkbprobably unrelated, but maybe something we should update as well16:14
clarkbmy hunch here is that we'll have to manually remove that lockfile as no reprepro process for debian is running16:15
clarkband then see if the db needs fixing (clearing and starting over?)16:15
clarkbfungi: ^ you may know?16:15
fungii think we can just delete the lockfile and let it try to sync again. if it complains about a corrupt db then we may have to force it to rebuild16:18
clarkband that will require aklog because it is in the afs fs16:22
* clarkb does this16:22
clarkbit will next run in about an hour and 45 minutes16:25
clarkbThe forest fire smoke has largely cleared out of here today. I'm going to get on the bike momentarily. Should be back around when that starts and for the meeting16:25
opendevreviewFelipe Reyes proposed openstack/project-config master: Add Keystone OpenID Connect charm to OpenStack charms  https://review.opendev.org/c/openstack/project-config/+/85749216:28
*** jpena is now known as jpena|off16:35
opendevreviewFelipe Reyes proposed openstack/project-config master: Add Keystone OpenID Connect charm to OpenStack charms  https://review.opendev.org/c/openstack/project-config/+/85749216:35
*** dviroel|lunch is now known as dviroel16:57
clarkblooks like reprepro is running for debian now18:15
opendevreviewClark Boylan proposed opendev/system-config master: Update python builder and base image  https://review.opendev.org/c/opendev/system-config/+/85653718:40
clarkbthe base images updated about an hour ago and appear to include the debian updates so we don't need to manually install libc updates18:41
fungidb_close(contents.cache.db, compressedfilelists): BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery18:52
fungii guess that's for after the meeting, i can take a closer look then18:52
clarkbfungi: that was from reprepro?18:57
ianwinteresting that https://grafana.opendev.org/d/9871b26303/afs?orgId=1 is showing releases as of 3 days ago19:03
ianwoh right; -rw------- 1 10004 root 0 Sep 10 14:10 /afs/.openstack.org/mirror/debian/db/lockfile would match that19:04
fungiyep, precisely19:05
ianwSecure Connection Failed19:23
ianwAn error occurred during a connection to lists.opendev.org. Peer reports it experienced an internal error.19:23
ianw^ this is a weird thing i see in my work vm.  but not the browser outside that 19:23
clarkbianw: this is to https not smtp + ssl?19:31
clarkbjust want to make sure we're looking at the right server with ssl19:31
ianwyeah, sorry https.  it must be a local thing, i can't replicate19:32
fungidifferent operating systems?19:34
clarkbianw: I think it is fair to ask that grafana follow typical norms for tagging images on docker hub20:03
clarkbianw: it is highly unexpected that beta releases end up in :latest20:03
clarkbsimilarly it was highly unexpected for jitsi meet to stop tagging :latest entirely and left us behind :/20:03
fungithough they do have a stable tag we've switched to20:05
ianwhttps://github.com/grafana/grafana/discussions/47177 is from when this came up before20:07
clarkbthe currentl latest-ubuntu points to the 9.1.5-ubuntu tag so that seems to be an actual release20:08
clarkbbut latest doesn't seem to point at 9.1.520:08
clarkboh wait it does I opened bad tabs?20:08
clarkbwow for some reason docker hub showes me armv7 by default for latest and amd64 for 9.1.520:09
fungion the other hand, zuul's latest tag follows tip of master20:09
clarkbfungi: zuul is also expected to be functional on every commit and doesn't do beta releases20:10
clarkbthoguh I guess it does sometimes have required upgrade steps20:10
fungiand makes releases, which some people do use20:11
clarkbconsidering :latest is currently 9.1.5 I'm ok with trying it again. This was the first time we ran into problems? ianw maybe and a comment to the docker-compose yaml file suggesting a rollback to the current latest release version tag should something go wrong just as a hint that latest here include betas?20:12
clarkband now I must find lunch20:12
ianwgrafana has had 5 releases in two weeks  (9.0.8 -> 9.1.5) ... we're not watching closely enough to keep up with that and manual bumps20:13
ianwa proposal bot update might be something, but i don't think it's worth the effort20:13
ianwi think we're better to assume things will work, and if they don't pin and investigate.  anyway, that's what i said in the review, so now in a loop :)  i'll go with the crowd decision20:14
fungimakes sense20:17
clarkbthinking about making the multi node known hosts role quicker, does anyone know if simply appending our keys to the known hosts file may result in errors? for example what happens if we add a different rsa host key for an existing entry? I'd like to avoid rewriting too much of the wheel here since ansible already has a known_hosts module. If I can do a simple append and call it a day20:55
clarkbthat might be worthwhile20:55
clarkbI guess another more involved option is to fork the known_hosts module and update it to take a list of entries so that it can add them all at once rather than one at a time20:55
fungiyou can have duplicates, afaik21:03
fungias long as they aren't different keys of the same type for the same host21:04
clarkbya so maybe having a small module that simply appends all keys at once to the file is a good improvement21:05
ianwin case you were wondering i must have been testing lists.opendev.org and had a hosts override, and that ip is now live but some other site :)21:06
ianwre the SSL error previously21:06
fungiianw: oh, you were probably helping with the mm3 testing21:11
fungiand yeah, those held nodes are now gone21:11
ianwi've got a root screen on mirror-update and will poke at this debian reprepro21:14
ianwi get the feeling "Warning: We are having trouble keeping the AFS stat cache trimmed down under the configured limit" seems more likely to mean "i can't talk properly and too much is buffering" more than "i'm too busy"21:18
ianwnone of the afs server seem to have any errors21:21
fungiahh, thanks. it looks like there's a debian update in progress, but once the repos which aren't in trouble are done, it's probably best to just remove /afs/.openstack.org/mirror/debian/db21:26
fungi"recovering" the berkeleydb it uses is unlikely to be all that helpful, and would be better to tell reprepro to generate a new db (however long that takes)21:27
ianwfungi: yeah, i have the lock right now so that's probably what is being seen21:28
fungioh, that's you21:28
ianwjust double checking, i'm not seeing anything in the openafs logs indicating an error21:29
fungiyeah, now i see the flock is parented to a screen session. i forgot to give ps the f flag21:29
ianwubuntu.log:Starting ForwardMulti from 536870950 to 536870950 on afs02.dfw.openstack.org (as of Tue Sep 13 18:45:06 2022).21:30
ianw[Tue Sep 13 18:45:46 2022] afs: Warning: We are having trouble keeping the AFS stat cache trimmed down under the configured limit (current -stat setting: 15000, current vcache usage: 639641).21:30
ianwthey must line up21:30
fungiindeed, they do21:30
fungii wonder if something was going on in rackspace21:31
ianwwhich isn't debian ... so there's no smoking gun there21:31
ianwwe've been seeing that since ... [Fri May 13 17:48:02 2022]21:32
fungiuptime on all the servers involved is quite lengthy, so nothing seems to have spontaneously rebooted at least21:32
ianwi think it might be worth a reboot of mirror-update; it's uptime is such a new kernel update won't hurt21:35
ianwi'm going to quickly do that now, because it's not doing anything briefly21:36
fungiyeah, good call21:38
fungiclarkb: which job for 855292 were you holding before?21:41
fungioh, looks like it would have been system-config-run-lists3. also does zuul-client still need to filter via --ref=refs/changes/92/855292/7 or has that been improved? looks like it's been a while since i set an autohold, judging from my command history21:43
ianwit's contents.cache.db that seems corrupt21:43
fungiyeah, that's the one it was complaining about in the logs21:44
fungithough i'm unsure what reprepro will do if you remove just one db file21:44
ianwhttps://docs.opendev.org/opendev/system-config/latest/reprepro.html does not discuss recreating that one :/21:44
ianw"new format for contents.cache.db. Only needs half of the disk space and runtime21:46
ianw  to generate Contents files, but you need to run translatefilelists to translate21:46
ianw  the cached items (or delete your contents.cache.db and let reprepro reread21:46
ianw  all your .deb files)"21:46
ianwthat suggests that deleting the contents.cache.db might just have reprepro rebuild it21:46
ianw(from a changelog note)21:46
fungiworth a try. will take a while of course since it's reading them all back over afs21:47
opendevreviewJeremy Stanley proposed opendev/system-config master: DNM force mm3 failure to hold the node  https://review.opendev.org/c/opendev/system-config/+/85529221:47
ianwi guess i'll move it out of the way and re-run it21:47
fungiand then probably find something else to do for a very looooong while21:48
ianw"reprepo update" has not complained (so far) and there is a new contents.cache.db21:52
ianwhrm, it also seemed to exit and not do anything21:53
fungimaybe it doesn't read the file contents, just size/date et cetera?21:55
fungiin which case it might rebuild quickly21:55
ianwit might be that it's in sync21:55
fungiit was pulling the files down, but failing to update the cache db21:56
ianwi'm just letting the full script run now21:57
clarkbfungi: system-config-run-lists3 sorry just got back from school run22:01
clarkblooks like you found it /me double checks a listing22:02
clarkbfungi: if you look at the listing you can see how to do it for arbitrary patchests using a regex but ya that looks right22:02
ianwthe contents.cache.db is growning now -- https://review.opendev.org/c/zuul/zuul/+/857517 seems to be writing it22:03
ianwbah, paste error22:04
ianweprepro --confdir /etc/reprepro/debian export22:04
ianwunfortunately i've run that under timeout :/22:05
clarkblooking at the afs -stat thing /etc/openafs/afs.conf has an OPTIONS value currently set to OPTIONS=AUTOMATIC which means that openafs can automatically determine what -stat should be (among a bunch of other options)22:08
clarkbIt isn't clear to me if we override and set only -stat if all the other options will end up with different chosen defaults22:08
clarkblooking at the init script I think changing that value to something other than AUTOMATIC won't cause different values for the unset flags22:10
clarkbafs.conf.client is sourced before OPTIONS is set in afs.conf so we can't set that value in the config file we already manage :/22:13
clarkbif we want to change that value I think we need to add afs.conf to the ansible role and change the value. I'll make a change for that and we can discuss further on that22:16
opendevreviewIan Wienand proposed opendev/system-config master: mirror-update: make jobs interactive by default  https://review.opendev.org/c/opendev/system-config/+/85751922:24
ianw^ i feel like we've discussed that before22:27
clarkbianw: ya https://review.opendev.org/c/opendev/system-config/+/84021422:28
opendevreviewClark Boylan proposed opendev/system-config master: Up openafs client -stat value  https://review.opendev.org/c/opendev/system-config/+/85752022:29
ianwoh weird, that's actually a sourced shell script22:32
clarkbugh either the job finishes in ~10 minutes or it times out https://review.opendev.org/c/opendev/system-config/+/85653722:32
clarkbI wish I understood the emulation better22:32
ianwclarkb: this does run on other distros (for the wheel build) and that is maybe deb specific?  have to look closer22:33
ianwurgh.  that definitely feels like "i'm building something" one time, and not the other22:34
clarkbya, but the last time I looked at it I couldn't reproduce it22:36
clarkbhowever, I only tested on amd64 to see what it did. Maybe the arm images are different?22:36
clarkbbut also it seems to change over time22:36
clarkbthe previous run everything timed out22:37
clarkbthis time jobs for https://review.opendev.org/c/opendev/system-config/+/856537 succeeded23:24
clarkbfungi: ^ if you've got time to review that23:24
opendevreviewMerged opendev/system-config master: Update python builder and base image  https://review.opendev.org/c/opendev/system-config/+/85653723:59

