Monday, 2023-02-06

opendevreviewMerged openstack/project-config master: nodepool: infra-package-needs; cleanup python
opendevreviewMerged openstack/project-config master: nodepool: infra-package-needs; remove lvm2
tonybI'm still seeing really slow reqponses from gitea0802:32
ianwload average: 78.44, 85.12, 99.4802:33
ianwit is .. unhappy02:34
ianwnothing completely obvious02:36
ianwFeb  6 02:31:24 gitea08 docker-gitea[847]: 2023/02/06 02:31:23 ...ules/context/repo.go:469:RepoAssignment() [E] [63e06679-3] GetUserByName: context canceled02:36
ianwseems frequent02:37
ianwthose messages go back as far as we have logs though02:40
ianwthere's lots of oom kills 02:44
ianwi've restarted the container anyway02:45
fungioom kills on the gitea servers are usually a sign that some network behind a common nat is repeatedly cloning large repos like openstack/nova03:49
fungiwe saw that behavior when people had openstack-ansible deployments acting up and all their servers tried to independently clone all of openstack rather than caching a central copy in their deployment03:50
fungiapparently the clone operation results in whole copies of the repository being temporarily stored in memory03:51
fungiso it doesn't take many to exhaust one of the backends03:51
*** yadnesh|away is now known as yadnesh04:00
*** bhagyashris_ is now known as bhagyashris04:28
*** ysandeep is now known as ysandeep|ruck05:18
*** ysandeep|ruck is now known as ysandeep|ruck|afk06:05
*** ysandeep|ruck|afk is now known as ysandeep|ruck06:47
jrosserfungi ianw I did a  bunch of work to make OSA use an identifiable user agent if you believe that is the cause06:58
ianwjrosser: ++ i don't have time to look right now but definitely an angle.  08:42
ianwi wonder if we could somehow work that into some sort of static report like we do for the other services08:42
*** jpena|off is now known as jpena08:42
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-skopeo: fixup some typos
*** gibi_pto is now known as gibi08:47
opendevreviewMerged zuul/zuul-jobs master: ensure-skopeo: add install from upstream option
opendevreviewMerged zuul/zuul-jobs master: zuul-jobs-test-registry-docker-* : update to jammy nodes
*** ysandeep|ruck is now known as ysandeep|ruck|break10:43
opendevreviewAde Lee proposed zuul/zuul-jobs master: Add ubuntu to enable-fips role
*** ysandeep|ruck|break is now known as ysandeep|ruck11:31
*** yadnesh is now known as yadnesh|away13:38
*** dasm|off is now known as dasm|rover13:52
opendevreviewScott Little proposed openstack/project-config master: Create a git for the storage of public keys and certificates
gthiemongeFYI I see a lot of issues with ubuntu mirrors in the octavia-grenade job (I don't know why this particular job is so impacted)15:16
gthiemongeI see similar issues in opensearch15:21
fungigthiemonge: any idea why those jobs don't use our package mirrors?15:25
tweiningonce I saw in the log that it was trying an ipv6 address, not sure if that has something to do with it or not though.15:26
fungisome of our providers have ipv6 access, some do not. if it was trying to reach an ipv6 address from a system which didn't have any v6 routes that could indicate an issue15:28
fungithe zuul host info logged with the job should show the routing table the node had when the build started15:28
fungibut also some tools "fall back" to trying ipv6 when v4 connections to something time out, and then report misleading errors15:29
fungiso the error message ends up implying that a v4-only host tried to reach something over v6, but really the problem is that the v4 connection it correctly attempted first failed for some reason15:30
opendevreviewScott Little proposed openstack/project-config master: Create a git for the storage of public keys and certificates
fungianyway, i'm able to manually download one of the packages that build said it couldn't, so the problem is likely either intermittent or location-dependent15:33
fungithe addresses i see for it in dns appear to be hosted directly by canonical, so probably no cdn involved at least15:34
Clark[m]fungi: it's likely not using our mirrors because it is within the dib image builds chroot. Dib has support for using our package mirrors though iirc. Maybe that's just part of dibs test suite though15:36
fungiyeah, that would make sense15:40
fungialso i tested downloading over ipv4 as well as ipv6, fwiw, both worked for me15:41
fungithough i didn't try all 3 v4 and 3 v6 addresses in the round-robin15:41
fungicould be one of the servers they list is having trouble15:42
gthiemongefungi: Clark[m]: thanks, I'll check how  we can use our mirrors16:03
slittle1_Review please...
fungiyou bet, i was just about to pull it up, i was delayed by some local software updates which have just completed16:30
clarkbhey I'm doing local software updates too16:30
clarkbmonday morning routine16:30
fungiindeed, though i was way behind in recompiling all my python interpreters since the recent tags16:31
fungiand then rebuilding all my venvs, including the one for my gertty16:32
clarkbthey should only need rebuilding when you change major versions?16:32
fungiwell, any time your interpreter's path changes, i think16:32
fungiwhich in my case is the case even for new patch releases because i use separate directories for them16:33
fungiyeah, pyvenv.cfg embeds the real path to the interpreter in its "executable" key16:35
fungiso mine just updated from /home/fungi/lib/cpython/3.11.0/bin/python3.11 to /home/fungi/lib/cpython/3.11.1/bin/python3.1116:35
fungiuseful when you want to, say, easily compare behaviors between 3.11.0 and 3.11.1 since you can have them installed side by side and create different venvs referencing each16:37
fungii simply update a symlink in ~/bin to point to the new version when i want it to be the default build16:38
fungii've got my venv rebuilds scripted anyway, so it's just a matter of starting the script and waiting for that to complete (or error)16:41
opendevreviewJeremy Stanley proposed opendev/system-config master: Feature our cloud donors on
opendevreviewMerged openstack/project-config master: Create a git for the storage of public keys and certificates
*** jpena is now known as jpena|off17:36
clarkbmtreinish: super minor thing I've noticed cleaning up warnings in Zuul. stestr's subunit_runner opens an fd returning a python file object in SubunitTestRunner._list() and ends up returning that back up again to users of the TestRunner so that status results can be recorded. Python complains that this file object is never closed and raises a ResourceWarning18:11
clarkbmtreinish: a quick fix wasn't super obvious to me otherwise I'd write a PR because the status object which uses that file object is passed back up and stats things are called against it18:12
fungiinfra-root: i think our recent changes to jeepyb may have broken manage-projects:
fungipossible chicken-and-egg problem? is it trying to fetch a ref from a project which doesn't exist yet?18:33
clarkbfungi: that would be from the change that made the git errors an error rather than log and continue18:34
clarkband ya that hunch sounds correct.18:34
fungiyeah, that's the change i was expecting it to be18:34
clarkbWe should be able to "safely" revert that jeepyb change that raised errors in that situation18:34
clarkb(we'll just reintroduce the old behavior which was problematic but less problematic probably)18:34
clarkban alternative would be to treat the fetch of the refs as special. If it fails its ok and continue and we'll try to push what we have anyway18:34
clarkbthat is probably a reasonably correct fix18:35
clarkbthe issue before was treating pushes as fail acceptable, here its a fetch18:35
fungithough... has that change actually merged yet?18:35
clarkbhrm nope
fungiright, so this must be something else18:36
fungimaybe there was an intermittent connectivity failure18:37
clarkbits all to localhost I think18:37
clarkbthat would be highly unlikely but possible if the mina sshd ran out of threads maybe18:37
clarkbfungi: the other change is the change of the base image18:37
fungigerrit says the project got created18:37
fungiso maybe jeepyb raced something trying to access the config ref from it too soon18:38
fungithe repo got prepopulated and synced to gitea too18:38
clarkbfungi: if you look at fetch_config in manage_projects it has a loop for 20ish seconds waiting for the meta config to be available18:39
clarkbperhaps 20 seconds is not long enough?18:39
fungimmm, the repo actually didn't get prepoputated or synced, it was just created on gitea empty18:40
fungibut it exists in gerrit and gitea at least18:40
clarkbfungi: if ou look in the manage projects log you should see it looping too as it seems to log each pass through that loop18:40
clarkboh but only at debug level?18:41
clarkbwhat if "public-keys" is the problem18:42
clarkband we're tripping over some gerrit user public keys api path18:42
fungiif i `git fetch ssh:// +refs/meta/config:refs/remotes/gerrit-meta/config` i'm told "fatal: couldn't find remote ref refs/meta/config"18:43
clarkbfungi: you have to do that as your admin account iirc18:43
clarkbpossibly in bootstrappers18:43
fungioh, right18:43
fungithat worked18:43
fungigit fetch ssh:// +refs/meta/config:refs/remotes/gerrit/config18:43
fungi* [new ref]         refs/meta/config -> gerrit/config18:44
fungiso it seems to exist now, at least18:44
fungimight have just been a race18:44
clarkbya maybe that 20 second time period isn't long enough depending on how busy gerrit is or how busy its disks are?18:45
fungiclarkb: should i try manually rerunning manage-projects and see if it succeeds?18:45
clarkbfungi: I guess so? maybe with debug enabled so that you can see it loop through things. The only other thought I've got is maybe it has something to do with git in the new image or the git repos in the jeepyb cache on the new image18:46
clarkbfungi: but we directly manage the gerrit uid already and that didn't change in the base image swap so that would surprise me I think18:46
clarkband the git versions were basically equivalent18:46
clarkb(conversion from our security patched version to debians)18:47
clarkbunrelated: Our CI jobs for fungi's gitea change are failing on apparmor for docker 23 now18:56
fungii saw that the build failed, but hadn't found time to see why yet18:56
funginoticed it about the same time as the manage-projects failure18:56
clarkbour prod servers already have apparmor installed based on a quick sampling so I think I'll just push a change to add apparmor to our install docker role18:57
clarkbfungi: re manage-projects I can't really come up with anything except for git versions/permissions issues due to the base image change, or simply a timeout with our loop not being long enough18:58
clarkbfungi: I double checked group membership and project creator appears to have the correct perms18:58
clarkbin gerrit I mean.18:58
opendevreviewClark Boylan proposed opendev/system-config master: Install apparmor when we install docker-ce from upstream
opendevreviewClark Boylan proposed opendev/system-config master: Feature our cloud donors on
clarkbfungi: ^ rebased as thats a good check it fixes the issue19:01
clarkbfungi: another variable that may have impacted refs/meta/config is if it overlappedwith backups and that was eating up iops19:04
clarkbso ya I'm thinking the best next step is to rerun with debug on against that project specifically and see if its happy now. If so our 20 second retry loop may simply be too short19:04
clarkbI guess the jdk changed too and maybe its slower at doing that bootstrapping?19:07
clarkbfungi: I'm going to go back to zuul warning cleanup while I've got it paged in but ping me if I can help further19:16
fungii'm trying to reverse-engineer the manage-projects playbook since just running it directly seems to have failed (probably in the same spot but it doesn't log to a file, just to stdout)19:31
fungiwhat does this tasks_from do?
clarkbfungi: it runs the tasks from teh manage-projects file in the gerrit role19:32
fungiyep, thanks found it19:33
fungiso i guess i can just run manage-projects on the gerrit server19:34
fungiwhich seems to be a docker run wrapper19:34
clarkbyes because we run jeepyb on the image with all the various dirs bind mounted in19:35
clarkbdoing that by hand would be annoying so we have the wrapper19:35
fungirunning with -v, i don't see any debug log entries19:38
fungi2023-02-06 19:37:40,534: manage_projects - ERROR - Failed to fetch refs/meta/config for project: starlingx/public-keys19:39
fungiso whatever it's trying is still not working19:39
clarkbthat method is the only place we have log.debug() calls. I wonder if we didn't add an ability to actually record those19:39
fungii find it extra interesting that i can fetch that with a gerrit admin account19:40
fungiit does seem to take at least 20 seconds before i get any output, which would suggest the retry loop is actually happening at least19:43
clarkbfungi: I think it isn't creating the blank repo to fetch the config into19:44
clarkbfungi: the jeepyb cache dir is at /opt/lib/jeepyb:/opt/lib/jeepyb so the same dir path in both host and container. /opt/lib/jeepyb/starlingx does not have any entries, but the public-keys dir should be there to fetch the config into19:45
fungi/opt/lib/jeepyb/project.cache has an entry for it with project-created and pushed-to-gerrit both true but no acl-sha, which seems to match what we're observing at least19:47
clarkbwhat I'm confused about is jeepyb's make_local_copy should error if it isn't able to git init I think19:49
clarkboh except we don't raise there we could just be running several git commands that all just fail19:50
fungiyeah, it looks like run_command would log.debug the output from those19:51
fungibut maybe that doesn't go to stdout/stderr on a normal invocation19:51
fungimanage-projects has a -l option to specify a log path19:51
fungiwe map /var/log into the container too but doesn't look like anything is writing a jeepyb or manage-projects log by default19:52
clarkbya I think beacuse we use default logging which is stdout19:53
clarkboh wait we remove the dir in the cache19:54
clarkbok that explains some very confusing behavior19:54
clarkband the timestamps for that dir do show it was updated roughly when you ran it by hand ok thats making a bit more sense now19:55
fungii'm trying it's -l option19:57
fungiwhich doesn't appear to do anything19:57
clarkbfungi: I think the flag for debug is -d19:58
fungiaha, the -- i was including was to blame19:58
clarkbsee setup_logging_arguments19:58
fungi--help says it's -v19:58
clarkb-v is verbose -d is debug19:58
fungioh! yes okay i see it now19:59
clarkband in this case we've set verbose at INFO and higher and debug at DEBUG and higher19:59
clarkbhowever, that will just log the mostly useless message 10 times over 20 seconds since we already know it is taking roughly that long19:59
fungiokay, i have a more useful log file in /var/log now20:00
fungihere we go...20:00
clarkbfatal: ssh variant 'simple' does not support setting port20:01
fungijeepyb.utils - DEBUG - Command said: fatal: not a git repository: '/opt/lib/jeepyb/starlingx/public-keys/.git'20:01
clarkbyup and if you scroll up a bit ssh vairnat simple does not support setting port seems to be why ^ that isn't a repository20:01
fungiahh, yeah that's even earlier20:01
fungilooks like GIT_SSH_VARIANT=ssh is a workaround or `git config --global ssh.variant ssh`20:02
clarkbya and this is likely a side effect of our image change then I guess20:02
fungimaybe different ssh client?20:03
clarkbfungi: we also set GIT_SSH to a wrapper script in order to set ssh flags for the key path and the username etc20:04
fungior it could be that the git command there has a built-in ssh client implementation now20:06
opendevreviewClark Boylan proposed opendev/system-config master: Install openssh-client in our Gerrit docker image
clarkbok I think ^ will address it. Just a missing dep in the base image swap20:21
clarkbnote this is based on the apparmor change so that it can gate20:21
clarkbthe apparmor change should be considered carefully however, as I mentioned I thnk its a noop for our prod hosts20:21
fungisystem-config-run-gitea is still underway for 869091,10 as our confirmation on that one20:24
fungihopefully we'll see that green shortly20:24
clarkbwith that largely sorted out (for now anyway) I'm going to eat lunch20:28
mtreinishclarkb: yeah, it's been on my backlog to try and figure out how to handle that. There was an issue opened a while ago about all the resource warnings that get raised: and masayukig fixed some of them but there are definitely still more20:49
mtreinishsome of them will definitely be tricky to fix, because it's all in weird inherited usage from subunit and unittest (mostly because I have to remind myself how that all works)20:50
clarkbmtreinish: I can commiserate with that. See also the jeepyb debugging above :)20:53
ianwmy docker 23 issue was ultimately that i had an old devicemapper based container and the docker daemon wouldn't start 21:13
ianwit might have been able to with various flags, but it was easier to just start again21:13
ianwi think this was from when linode (my host) was a Xen-based vm.  at some point they migrated everything to kvm, but iirc at the time something about being xen made it use devicemapper21:14
clarkbsome linux archeaolgy21:15
ianwwe're testing with docker 23 now, but i don't think it will get pulled in anywhere in prod unless we explicitly update21:15
clarkb(also I can't type)21:15
clarkbianw: correct because updating docker implies restarting containers and we try to control that21:15
ianwi wonder if it's worth just making a list and doing it manually, starting with lower-impact hosts?21:16
clarkbfungi: heh the latest donor change made the header and text align properly but now the donor logos are stacked on top of each other. I Think I prefer this even if it is more scrolling though21:16
clarkbianw: not a bad idea21:16
clarkbfungi: but I'm terrible at css and layout...21:17
ianwi can start an etherpad and do that.  it's probably not a bad idea to do a reboot anyway on some of these hosts21:17
ianwtracking at
clarkbianw: in the past what I've tried to do is stop service containers, upgrade docker, optionally reboot, start service containers again. I think the packaging will attempt to restart containers for you but I like doing it myself for most things21:23
mtreinishclarkb: tbh, looking at the code in detail now I think I can just drop the fdopen call. I don't think it's really relevant. IIRC, I just ported that from subunit and/or unittest when I rewrote the runner to be based on unittest's run instead of testtools, but the stestr context is more limited and we're almost always just passing stdout as the result stream and won't ever need to open a new descriptor 21:28
mtreinishin that code21:28
mtreinishI'm just going to simplify that logic (famous last words)21:28
opendevreviewMerged zuul/zuul-jobs master: ansible-lint: fix a bunch of command-instead-of-shell errors
opendevreviewMerged zuul/zuul-jobs master: ansible-lint: add names to blocks/includes, etc.
opendevreviewMerged zuul/zuul-jobs master: ansible-lint: ignore use of mkdir
mtreinishclarkb: it passed tests locally21:39
opendevreviewJeremy Stanley proposed opendev/system-config master: Feature our cloud donors on
clarkbmtreinish: thanks! I was mostly motiviated by the sqlalchemy 2.0 update and needing to filter out all the noise warnings from the useful warnings.21:48
fungiclarkb: ^ looking at the other logos at the top of the page, i think i just incorrectly nested them21:48
*** dmitriis9 is now known as dmitriis21:51
*** Tengu8 is now known as Tengu21:51
*** mtreinish_ is now known as mtreinish21:51
*** dtantsur_ is now known as dtantsur21:51
*** noonedeadpunk_ is now known as noonedeadpunk21:51
opendevreviewMerged opendev/system-config master: Install apparmor when we install docker-ce from upstream
opendevreviewMerged zuul/zuul-jobs master: ansible-lint: use pipefail
opendevreviewMerged zuul/zuul-jobs master: ansible-lint: ignore latest git pull
opendevreviewIan Wienand proposed zuul/zuul-jobs master: build-docker-image: further cleanup buildx path
ianwTo ssh://
ianw ! [remote rejected]     HEAD -> refs/for/master%topic=docker-apt-key (n/a (unpacker error))23:16
ianwis this my fault or gerrits fault??23:17
ianwCaused by: Unpack error on project "opendev/system-config":23:18
ianwin gerrit logs23:18
opendevreviewIan Wienand proposed opendev/system-config master: install-docker: switch from deprecated apt-key
opendevreviewIan Wienand proposed opendev/system-config master: install-docker: remove apt-key cleanup
ianw$ zgrep 'Unpack error, check server log' * | wc -l23:32
ianwso it's not unique, but also not that frequent.  maybe it was my client dropping packets or something23:32
JayFis something upside down? 23:37
JayFianw: I'm seeing exactly that23:37
JayFfetch-pack: unexpected disconnect while reading sideband packet23:37
JayFmore like these, it errors in different places depending on when it times out23:38
JayFlooks like generically slow-remote-server stuff? but I know little about what goes on behind the covers here23:38
ianwJayF: what was the operation you were doing?23:38
JayFTrying to push a fresh patch. It died in the git remote update gerrit step23:39
JayFand I can make that fail outside of `git review123:39
ianwJayF: hrm, i'm not seeing anything lining up in the gerrit logs, can you paste more context where it popped up?23:41
JayFlet me get a fresh reproduction then I'll paste it23:41
JayFianw: 23:42
JayFweb UI works as I'd expect, if a bit slow, so I think it's not connectivity23:42
ianwahh, ok, i see in logs now23:43
ianwSshChannelNotFoundException: Received SSH_MSG_CHANNEL_WINDOW_ADJUST on unassigned channel 0 (last assigned=null)23:43
ianwalways great to see a new weird ssh error, it's been too long since the last one :)23:44
JayFI'm running 9.1_p1-r323:44
JayFon gentoo23:44
ianwthe last one almost turned clarkb into a java developer23:44
JayFif it's possible the error is caused by shiny new openssh, it's likely I'm running the shiny new lol23:44
JayFalthough there is a 9.2 in the repo too...23:44
opendevreviewMerged opendev/system-config master: Install openssh-client in our Gerrit docker image
ianwthere's references to this in a few places23:47
JayFthat's an ominous merge in time with this bug LOL23:48
ianw; an old wikimedia commit seems to have enabled the workaround -> 23:48
JayFlooks like from what I'm seeing, most reports are when networking is slow or high latency23:49
JayFmakes me wonder if it's possible there's a network issue underlying this failure mode23:49
ianwyou're not the only user to have this error in the logs23:49
JayFI'm talking more generally than just me; mainly based off a feeling (not quanitative data) that the Web UI is exhibiting some slowness too23:50
ianwJayF: hrm, did you just upgrade or something?23:54
JayFI don't think so; but I run updates on this thing very frequently23:55
ianwthere's a few users seeing this in a bit of a regular pattern23:55
ianw    133 exceptionCaught(ServerSessionImpl[proliantci@23:55
ianwe.g. seems proliantci is experiencing it23:55
JayFthose are ironic third party CI :( 23:55
ianw    33 exceptionCaught(ServerSessionImpl[cisco-cinder-ci23:55
ianw     19 exceptionCaught(ServerSessionImpl[hp-storage-blr-ci23:56
JayFFWIW, looks like I'v ebeen running the same openssh client version for a couple weeks minimum23:56
ianwthat's the bot accounts, but there's user accounts too23:56
JayFhonestly, and I'm far from an expert in java ops (and even if I was, that info would be dusty) 23:56
JayFbut this is the sort of thing I'd reboot first and ask questions later LOL 23:56
ianwi suppose it's possible all three of those are on the same distro with the same openssh23:57
JayFhonestly, I'd be amazed if anyone has modified config on HP third party CI in months23:57
ianwJayF: so are you basically blocked from pushing changes with this atm?23:58
JayFyes, but my day ends in 2 minutes23:58
JayFso i'm happy to just close the laptop and re-run `git review` tomorrow lol23:59
JayFbut I can hang out and help w/testing if that's useful23:59
* JayF tries again for good measure23:59

Generated by 2.17.3 by Marius Gedminas - find it at!