Monday, 2021-11-15

ianw# grep 'Could not access submodule' * | wc -l00:36
ianwthis is hitting a lot on the builders updating the cache00:36
ianw# grep -B1 'Could not access submodule' *  | grep 'Updating cache' | awk '{print $7}' | uniq -c00:38
ianw     31
ianwit really looks like this repo is the major cause00:38
ianwi'm going to have to try to extract the script and run it separately.  what we are doing is fairly convoluted 00:44
fungii expect i'm partly to blame, i vaguely recall adding bits to force caching of all the branches which was... ugly01:32
*** ricolin_ is now known as ricolin04:18
ianw is what's happening distilled04:27
*** pojadhav|afk is now known as pojadhav04:56
opendevreviewMerged opendev/system-config master: Enable mirroring of centos stream 9 contents
*** ysandeep|out is now known as ysandeep05:24
opendevreviewIan Wienand proposed opendev/system-config master: Revert "Enable mirroring of centos stream 9 contents"
opendevreviewIan Wienand proposed opendev/system-config master: Enable mirroring of 9-stream
ianwfungi: i totally missed that uses /mirror/centos-stream which needs to be a new volume mount.  i think it makes sense to do it like that, which 817900 enables.06:09
opendevreviewIan Wienand proposed opendev/system-config master: Enable mirroring of 9-stream
opendevreviewMerged opendev/system-config master: Revert "Enable mirroring of centos stream 9 contents"
ianwfungi: ^ i've setup the users/permsissions/volumes/quota (200g) etc. for 817900 now08:09
ianwahh, i replicated the no such module issue08:15
ianw$ git --version08:15
ianwgit version 2.33.108:15
ianwso this happens at least on my f35 version and git version 2.30.2 on the builder08:15
ianwthe final script that replicated this was ->
ianwinfra-root: ^ might be good if someone else can confirm this.  i see suspects in either the gerrit submodule update plugin thing, replication to gitea, gitea serving itself, or git client issues ...  i.e. i have no idea :)08:17
ianwrunning it, leaving it for a while for a few things to merge, then running again is required.  i can't get a deterministic replication, yet08:18
*** ykarel__ is now known as ykarel08:19
ianwi'm out of time for now, but tracing through a request to see which backend it ends up at, and seeing if anything in gitea logs will be my first thought08:20
opendevreviewMerged opendev/irc-meetings master: Update QA meeting info
*** giblet is now known as gibi09:02
*** pojadhav is now known as pojadhav|lunch09:21
*** ykarel is now known as ykarel|lunch09:42
*** ysandeep is now known as ysandeep|afk09:44
*** pojadhav|lunch is now known as pojadhav10:04
*** ykarel|lunch is now known as ykarel10:43
*** ysandeep|afk is now known as ysandeep10:57
*** jpena|off is now known as jpena10:59
fungiianw: in which log files were you finding that error?12:43
opendevreviewMerged opendev/system-config master: Enable mirroring of 9-stream
fungii'll check in on that ^ once it's deployed13:11
fricklerfungi: those errors seem to be in the build logs, e.g.
fungiahh, thanks, i checked some build logs but just didn't get lucky14:09
fricklerseems one has a good chance when looking at those that are much shorter than the others, but not too short like the c-9-s builds14:11
fungiis it the clone or the fetch triggering that? i guess it's the fetch operation?14:30
noonedeadpunkhey folks! can I ask to review and ?14:48
fungisure, lookin14:52
*** ykarel is now known as ykarel|away15:03
opendevreviewMerged openstack/project-config master: Add Vault role to Zuul jobs
fungithat centos stream 9 mirror change failed to deploy, checking the log on bridge.o.o now15:15
fungibroke on "mirror-update : Copy keytab files in place"15:16
fungibut unfortunately, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result"15:16
fungimaybe the private hiera is missing the value or mistyped15:17
fungii see a mirror_update_keytab_centos-stream in host_vars/ and it looks properly base64 encoded15:19
fungiaha! they needed to be added to host_vars/ instead15:27
fungii've copied it into there now15:29
Clark[m]re the openstack/openstack repo I wonder if we are replicating the openstack/openstack submodule update prior to replicating the ref the submodule was updated to point at. This is possible because we run multiple replication threads and submodule updates may be cheap but updates to other repos less so?15:38
Clark[m]And if we try to fetch/clone in that period of time we lose the race. Now this should only be an issue if actually updating or checking the submodules which we don't need to do in the git cache? Maybe we can force git to be more naive there15:39
fungiyeah, i have a feeling at some point git fetch changed its behavior to also update or validate submodules by default15:48
fungior maybe the default behavior is expressly overridden on specific distros?15:49
*** jpena is now known as jpena|off16:08
fungiperhaps we need to explicitly pass --recurse-submodules=false?16:09
fungi"Use on-demand to only recurse into a populated submodule when the superproject retrieves a commit that updates the submodule’s reference to a commit that isn’t already in the local submodule clone. By default, on-demand is used..."16:11
fungii think that explains why we're seeing it try16:11
fungiand yes, a agree a race between openstack/openstack and some other project it treats as a submodule getting replicated in the "wrong" order could explain it16:12
fungis/a agree/i agree/16:12
*** ysandeep is now known as ysandeep|out16:34
clarkbfungi: it might be nice to get and/or in today if we are feeling confident in them?17:21
clarkb(its early in the week etc etc)17:21
fungii'll take a look17:21
clarkbI'm happy to pick one or the other so that we can monitor if we think that is prudent. Mostly didn't want that work to stall out because I got sick last week17:22
clarkbI do have a meeting here in the next few minutes. Some people are interested in eharing how we do arm64 ci stuff17:22
clarkbBut other than that I expect to be around all day and able to help17:22
clarkbalso I need to do a system udpate that is removing xorg-video-mach64 whcih I don't think is used by my current gpu, but on the off chance it is I'll be falling back to the laptop I guess17:23
clarkbwhen this meeting is done I'll send email to linaro about the expiring cert and what appear to be a number of leaked instances as well17:27
fungiokay, so if i were to approve 816771 now, that's cool with you?17:30
clarkbfungi: I think so, assuming you are able to help watch it as my meeting just started17:31
*** jgwentworth is now known as melwitt17:34
fungiyeah, i can. as for your point about the admin user possibly being used for some service, while i doubt we would have done that, it's not easy to search our configuration for it since that string appears all over17:37
opendevreviewMerged opendev/system-config master: Cleanup users might have used
fungiwatching that delpoy ^17:59
clarkbfungi: ya I doubt we have too. Ok meeting over. Turns out that the hardware we are on for the linaro cloud is old and need to go away so need to work with equinix and linaro to shift onto new hardware. I'm working on an email to get kevinz_ looped into that conversation next18:00
clarkbthen also an email about the ssl cert and the leaked nodes :)18:00
fungiyeah, it started complaining about the 30-day expiration over the weekend18:00
fungishould hopefully see /var/log/ansible/base.yaml.log start updating once the infra-prod-base job is underway18:02
fungilooks like it's working on the account updates now18:05
fungioccasional stderr about the ubuntu account having no mail spool to remove18:06
clarkbfungi: anything unexpected or concerning?18:06
funginot that i could see, though it spammed past very quickly so i need to take a slower look through the log18:07
fungii'll wait to make sure the build result comes back18:07
clarkbok first email sent18:14
clarkband now I've also asked kevinz_ to look at the laeked nodes and the ssl cert. I feel so productive to day :)18:20
clarkbfungi: looks like the base job finished but failed?18:21
fungiyeah, i'm looking now to see what host failed18:22
clarkbfungi: its was lists.o.o a dpkg lock was being held so it couldn't update the cache18:22
clarkbI think that is the only error so we are probably good18:22
fungiE: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?18:22
fungii'm putting together a list of hosts now which seem to have gotten accounts removed in that build18:23
clarkbthanks. I spot checked a couple to make sure the removal actually hapepend and it seems to have18:24
clarkbI'm going to do that local update and hope it doesn't break my xorg (will fallback to laptop if it does)18:24
clarkbGerrit User Summit has been announced for December 2&318:25
clarkbI'll probably try ot attend that at least. But also propose a talk about how we deploy Gerrit these days as I think the automation we've got is probably interesting to other gerrit users18:26
fungithese reported 4 changed tasks instead of the usual 3 (or occasional 2): afs01.ord afs02.dfw afsdb02 cacti02 ethercalc02 translate0118:26
clarkbfungi: but those 4 tasks weren't anything unusual?18:27
clarkbI guess only odd in that they were different?18:27
fungii think it represents the ones which had accounts removed, but i'm checking now to be sure18:27
clarkbfungi: note the task count is for the entire playbook not just the user stuff18:30
clarkbspot checking I think it only did one change for the user18:30
fungifor translate01, the tasks which reported changed status were the ubuntu account removal, running `ua status`, adding the puppet remote ssh key to the root account, and running `apt-get autoremove -y`18:31
clarkbya if you filter by changed.* you can see what they are. That one lgtm18:31
fungii bet the list with 4 changed tasks is the ones with ua configured18:31
clarkbyup similar for (no ua status)18:31
fungiokay, so it'll probably be the ones with only 2 changed tasks which had no accounts to remove18:32
clarkbya that seems likely18:33
fungigitea* and mirror02.mtl01.inap18:33
fungithat does indeed make far more sense18:33
clarkbalright this seems happy. I'm going to go ahead with my local reboot now. Back in a bit (I hope!)18:34
fungithe two changed tasks for mirror02.mtl01.inap were adding the puppet remote ssh key to the root account, and running `apt-get autoremove -y`18:35
fungino account removals18:35
fungiso that does seem reasonable18:35
fungiokay, i've approved 816869 now18:37
clarkbok back again.18:39
clarkb is the other one, but I'm still a bit worried we might end up breaking functionality for those instances if we do this18:39
clarkbcloudnull: ^ may know?18:39
* cloudnull looking 18:40
clarkbcloudnull: mostly I'm not sure if we need to keep that because Xen or similar.18:40
clarkblike will console access stop working or live migration or etc18:40
fungiyeah, in short, what do we lose by uninstalling nova-agent on rackspace server instances?18:41
clarkbha ok. I wonder if the source code is readable on disk. We might be able to deicpher it that way18:41
cloudnullthe nova agent used to query the xen meta data and setup networking. but IDK if that's still a thing 18:41
clarkbthat is a good hint that it does important tasks though18:42
clarkband we should be careful :)18:42
cloudnull++ 18:42
cloudnullIDK if the rax cloud still relies on that ?18:42
clarkbya maybe what we should do is boot a throwaway instance and do some testing18:42
fungion bridge.o.o, dpkg says we used to have rax-nova-agent installed but no longer do (and do not have python3-nova-agent installed either)18:42
clarkbfungi: interesting I thought we were uninstalling it at one point but then couldn't find evidence of that. Maybe because the package name changed18:43
fungiwe do have qemu-guest-agent installed though18:43
fungiamusingly, /usr/bin/nova-agent (supplied by the python3-nova-agent package) is an executable zipball18:45
fungi/usr/bin/nova-agent: Zip archive data, made by v?[0x314], extract using at least v2.0, last modified Fri Sep 13 04:00:00 2013, uncompressed size 87176, method=store, K\003\004\02418:45
fungiit seems to bundle netifaces, pyyaml, pyxs, distro, and novaagent18:46
fungithe supplied in the root of the bundle looks like a sort of entrypoint wrapper calling novaagent.novaagent.main()18:49
fungilooks like is probably the current source18:50
fungiand has a readme explaining what it's for18:50
clarkbnice find18:50
fungiand we clearly have at least some server instances in rackspace without any version of that agent running18:52
clarkbthe network changes are probably what we would need to dig into more. Is it possible for those to happen without our input (I really doubt it as that would be an outage for the users)18:52
clarkband ya I did think we removed it at one time, but then couldn't find where we had done that18:52
fungiyes, in the past we've seen the agent change "backend" (eth1) interface routes when they added or removed rfc-1918 networks in a particular region, and also change our dns resolver configuration18:54
fungiwhich is what i seem to recall prompted us to start uninstalling it the first time18:54
clarkbfungi: any idea where we uninstalled it before. I swear I looked when I wrote this change any couldn't find it. But if we can find those logs we may find additional useful info19:17
fungilooking to see if i can find archeological evidence of it19:36
fungiclarkb: merged last year and seems related, cleaned up the old service removal code in 2014 though19:39
fungioh, no not 84543 that was in a job config19:40
fungii have a feeling if we did take the agent off our servers at some point, we did so manually19:43
opendevreviewMerged opendev/system-config master: Lower UID/GID range max to make way for containers
clarkbsorry grabbed lunch. Ya I guess it could've been manual20:22
clarkbwow the deploy jobs for the first change are still running and ^ is queued up behind that20:23
clarkbcorvus: I'm updating tomorrow's meeting agenda. For zuul we are running two mostly up to date schedulers and zuul-web talks to zk directly now too so no more weird flapping?20:28
clarkbcorvus: is gearman completely removed at this point or is it still around for the cli commands? What is the plan there?20:28
clarkbalso generally seems like queues are working as we expect them to today. I don't see anything that looks odd in zuul status right now20:47
clarkbcorvus: we might want to consider a zuul tag this week if that holds up?20:47
fungiyeah, it's been looking good since the upgrade and even the hitless rolling scheduler restart turned up no obvious issues20:48
corvusclarkb: 1: correct.  2: gearman is still around vestigally; we need to write more changes to finish removing it.  a zuul tag sounds potentially useful, but i think we may want to merge a few more bugfixes.20:53
clarkbcool thanks for the update20:54
clarkbI'll go do some hashtag:sos reviews now20:55
ianwfungi: thanks for fixing up the 9-stream error.  the initial deploy might need to be done under lock?21:09
ianwwe should delete that old hiera file; also the instructions for using hiera edit didn't work either, i'll look at both21:09
ianwthe 9-stream update did timeout.  i'm running it now under no_timeout lock in a root screen 21:30
ianw(on mirror-update)21:33
fungioh, thanks i didn't check back to see if it deployed properly after i fixed it21:42
clarkbfungi: the most recent run on infra-prod-base for the uid and guid shift also failed on lists.o.o due to the dpkg lock. I suspect we may have a stale lock there?21:58
clarkbfungi: I think there is an apt-get autoremove from november 11 that is causing that to happen on lists.o.o22:11
clarkbfungi: also looking in /boot I see two vmlinuz files newer than the one that I seem to have extracted so that xen could boot it. I don't think our package pins are working there22:13
clarkbfungi: if you get a chance can you take a look? We can probably go ahead and extract the two newer files and put them in place too?22:14
clarkboh wait we end up using menu.lst I think22:16
clarkboh except that chain loads grub222:17
clarkbya Ok I've paged enough of this back in again. Basically we rely on the menu.lst which xen reads to chainload grub222:18
clarkbgrub2 will boot a new kernel by default which is compressed with the wrong compression algorithm for xen. This means our kernel pinning isn't working and we need to extra newer kernels before rebooting22:18
clarkbI wonder if the autoremove is broken due to our pinning and we've confused it somehow?22:19
opendevreviewIan Wienand proposed opendev/system-config master: mirror: Add centos-stream link
clarkb and child are a couple of what should be safe cleanups to our base jobs22:27
clarkbbasically we can stop scanning for the growroot info because we've not had problems with that recently22:28
ianwok, my "git --git-dir=openstack_179b61797588a5983c2f97c6533dca570c8f887d/.git fetch --prune --update-head-ok '+refs/heads/*:refs/heads/*' '+refs/tags/*:refs/tags/*'"22:34
ianwjust got a bunch of "Could not access submodule 'ceilometer'" messages22:35
ianwi'm going to try tracing what backend i went to22:35
ianw...45c8:37818 [15/Nov/2021:22:34:16.037] balance_git_https []:53688 balance_git_https/ 1/0/2280 79208 -- 27/26/25/1/0 0/0 must have been me22:37
clarkbianw: did you see my theory that openstack/openstack is getting replicated before the ceilometer repo in this case?22:38
clarkbianw: we should be able to check the timestamps in the gerrit replication log to check that now that we have a concrete case22:38
ianwNov 15 22:09:20 gitea08 docker-gitea[846]: #033[36m2021/11/15 22:09:20 #033[0mCompleted #033[34mGET#033[0m #033[1m/openstack/openstack/commit/bf679fc618d360a7d1f3b329bef50f67d3d40fa3#033[0m #033[1;41m500#033[0m #033[1;41mInternal Server Error#033[0m in #033[1m19.713562ms#033[0m22:39
fungiianw: and that's coming from the fetch, right? any chance you could try to recreate with --recurse-submodules=false? i have a feeling it's the default of on-demand which is causing it22:42
fungii feel like there's probably no point in recursing submodules in our git cache anyway22:43
clarkbya the jobs update all the repos when they run22:43
clarkbwe just want enough object data in place that those updates are not super slow22:44
clarkbfungi: please see notes about lists.o.o above too22:44
ianwhere is the error from above ->
fungiclarkb: yeah, i'm trying to make sure i remember what the specifics are with the initrd there22:45
clarkbfungi: I don't think it was initrd. It is vmlinuz since that is compressed by some newer algorithm on ubuntu now that xen doesn't understand. In my homedir on lists under kernel-stuff is a script from the linux kernel repo that will extract an uncompressed kernel22:46
fungioh, yep right you are22:46
clarkbOur boot sequence is something like xen finds grub1 menu.lst and finds our pvchain loader boot thing which knows how to read the grub 2 config and get the kernel from that for xen22:47
clarkbxen then tries to uncompress it and fails or it finds a pre decompressed kernel and is fine22:47
fungii'm going to kill the hung autoremove and try running it manually without the -y22:47
clarkbwe thought we had pinned the kernel on that server to prevent my old uncompressed file from getting replaced but there are two newer kernels in grub now22:48
clarkbfungi: ok22:48
ianw was the update and the list of submodules that "failed"22:48
fungiThe following packages will be REMOVED: linux-image-5.4.0-88-generic22:48
clarkbianw: hrm the number of submoduels complaining there would imply my theory isn't very valid (just too many for that race to be a problem)22:48
fungiwe're booted on 5.4.0-84-generic22:49
fungithere's a vmlinuz-5.4.0-90-generic installed22:49
clarkbfungi: yes 84-generic si the one I uncompressed (you'll see the original and a copy of the uncompressed version in my homedir)22:49
fungiand a vmlinuz-5.4.0-89-generic22:49
clarkbfungi: we can uncompress -90, but then when -91 happens we'll be in the same boat22:49
clarkbmostly I am worried about getting our pin working then we can uncompress whatever is current22:49
fungibut also the autoremove isn't trying to remove the one we've booted from22:50
clarkbgot it22:50
fungiand yeah, looks like it's probably set to boot vmlinuz-5.4.0-90-generic by default22:53
ianwok, here's my hit on the LB, and the subsequent hits on the gitea08 backend ->
ianwthere are these "Unsupported cached value type: <nil>" errors periodically, but in this case it's a red-herring22:56
ianwfromt the actual error ( i do not see any backend hits for charm-* -- i.e. i do not think my client making is any requests related to these "missing" submodules22:58
ianwnone of these submodules are populated -- i don't think it is doing --recurse-submodules22:59
clarkb those submodules are relative to the openstack/openstack repo23:01
fungiianw: --recurse-submodules=on-demand is the default on recent git versions23:03
fungiwhich is why i suspected it might be related23:03
ianw is where the error comes from23:03
clarkbit is complaining the dir isn't empty?23:04
ianwfeels like it must be hitting that !is_empty_dir23:05
fungiwhat would cause it to populate the directory though?23:05
clarkbdoing a clone of openstack/openstack on my desktop using git 2.33.1 I get all empty dirs for the submodules which seems to line up with not hitting that error23:06
ianwfungi: what does the "-on-demand" do in the fetch case?23:06
clarkbI wonder if some subsequent action is populating some of those submodules, then the next time we try to update it complains?23:06
fungithe manpage explains, but basically it recurses if the commit updates the reference for a submodule23:06
fungiwhich is basically all the commits in openstack/openstack ever do23:07
fungiso may as well be --recurse-submodules=true where that repo is concerned23:07
ianwit says "populated submodules"23:09
clarkbfungi: if you have a moment for we can land that one and I can run a quick test then we can land the child23:09
ianwsince i don't think this has populated the submodules at all, i don't think it's trying to clone those bits at all23:09
ianw is what introduces this check23:11
fungiyeah, and the git clone manpage doesn't state what its default for --recurse-submodules is unfortunately, so it seems like it defaults to off23:12
fungiso any of the submodules in the cache look like they've got a populated worktree?23:12
fungii guess we should expect to have a set for those if so23:13
clarkbor at least not empty23:13
ianwit's definitely racy though. isn't a consistent list of what fails, it always changes23:13
ianwi guess instrumenting git to print why it thinks the directory isn't empty might help23:14
clarkbfungi: re lists. Let me know if you want me to extract the -90 kernel or I can walk you through it. But I'll defer to you on package pinning as that always seems like apt magic to me23:15
fungiyeah, linux-image-5.4.0-84-generic is in held state, which explains why it's not trying to remove that23:23
fungilinux-image-5.4.0-88-generic was listed as being in an error state, but when i manually re-ran `apt-get autoremove` and agreed to remove the package, it seems to have done so without error23:24
funginow 89 is showing removed, 84 is still installed, and rerunning autoremove no longer thinks it needs to do anything23:24
fungiclarkb: i'll try to get up with you tomorrow about decompressing 90 and testing a reboot on that23:25
fungier, 88 is showing removed i meant to say, 89 and 90 are still installed as the two most recent kernels, as is 84 because we held it23:26
fungiit seems to have previously cleaned up 85, 86 and 87 without problem, so not sure what caused it to choke on 88, i wasn't able to reproduce the problem23:26
clarkbsounds good. The process is straightforward to use the extraction tool and it cam eout of the linux kernel git repo. Would be good to have someone else that knows how to use it though :)23:28

Generated by 2.17.2 by Marius Gedminas - find it at!