Monday, 2023-05-08

*** amoralej|off is now known as amoralej06:08
opendevreviewMerged openstack/project-config master: Prevent recreate EOL'd branch
opendevreviewMerged openstack/project-config master: project-config-grafana: filter opendev-buildset-registry
opendevreviewwaleed mousa proposed openstack/diskimage-builder master: Add nm-dhcp-ib-interfaces element
*** iurygregory_ is now known as iurygregory11:18
tweiningHi. There are a lot of node failures recently. Is this something that you are aware of?
tweiningwell, is probably better11:53
*** amoralej is now known as amoralej|lunch11:53
fungiyou're probably the first to notice, but i'll run down a few examples and see if they have anything in common11:59
fungitweining: "a lot" is relative. looks like 7 so far today and all of those are for octavia, so almost certainly expected outcome of being down to only a single cloud provider who can supply nested-virt labeled nodes. none yesterday. there were two for neutron on saturday. a bunch for cinder-tempest-plugin on friday though12:03
fungithe node failures for neutron on friday were also trying to get nested-virt nodes12:08
fungier, on saturday12:08
fungii'm looking into friday now12:08
fungiinterestingly, these seem to pretty much all be stable branch jobs12:10
fungiall 20 of those are showing up as cancelled within a minute of one another in the scheduler's debug log so was likely related to at event around 17:00z that day12:14
fungiwhich only impacted a single job for a single project, so likely not a system-wide event. also they seem to all be from a series of do-not-merge changes which were being used to try to recreate errors12:20
fungitweining: and those are misleading, they never reported to gerrit because they were the result of build cancellations from a mass rebase. i suspect zuul should have reported those as cancelled instead of node_failure12:23
fungitweining: let's approach this another way... is there a specific build you want investigated?12:24
tweininglet me see...12:24
fungilike, what brought it to your attention?12:25
tweiningmaybe that one
fungiokay, that's for octavia so i can pretty much guarantee that ovh failed to boot a nested-virt node for it and because we have no other providers now who can supply those, nodepool gave up. i'll double-check that though12:26
tweiningI noticed it because I saw multiple Octavia CI builds fail today because of it (and because of timeout, which are probably unrelated)12:26
fungiso that build got node request 300-0021159460 for a nested-virt-ubuntu-jammy label12:30 accepted that node request at 06:38:17 on behalf of the ovh-gra1 region12:34
fungiit booted node 0033974049 and then at 06:48:18 logged that it timed out waiting for the instance to be created (so basically 10 minutes after the nova boot api call the server was still not in an active state)12:37
fungias a result it declined the node request on behalf of ovh-gra1 and then immediately accepted the request on behalf of ovh-bhs112:39
tweiningon Saturday this patch was merged that makes the CI use jammy nodes instead of focal: (except for stable branches, but that doesn't seem to work because I saw a build for stable/yoga was running on Jammy as well)12:41
tweiningI'm only talking about Octavia btw.12:41
fungiit booted 0033974292 for the new attempt and then at 06:58:20 similarly gave up waiting for it to become active12:42
fungiat that point it had exhausted all cloud providers capable of supplying nested-virt nodes, so the scheduler decided that node request 300-0021159460 was unsatisfiable and returned a node_failure result for that build12:44
tweiningok, so your theory was right.12:45
fungitweining: my recommendation for reducing the incidence of situations like this is to work with our other donor providers to get reliable nested-virt capability in more places besides just ovh, or work with ovh to help them figure out why it can take >10 minutes for a server instance to go to active state, or work on making octavia's jobs no longer dependent on nested virt acceleration12:46
tweiningthanks. I will bring this topic up in our next team meeting on Wednesday12:47
fungiit's worth noting that we did have nested-virt labels in vexxhost's ca-ymq-1 region, but merged a change a little over a week ago to disable them there due to apparent network instability (possibly limited to nested-virt-ubuntu-jammy nodes):
tweiningyeah, I remember that12:51
fungiclarkb: what are the odds that the inmotion hardware is capable of nested virt acceleration? it might be interesting to test since we have admin access to the hypervisors there12:53
funginot that we have all that much capacity there, but it may put us in a position to get help from our community in diagnosing some of the issues with nested kvm in order to get better data back to the kernel developers12:56
ianwremoving vexxhost was mostly about the weird network dropout right?12:59
*** amoralej|lunch is now known as amoralej13:00
fungiyeah, though the change asserts it was nested-virt nodes which were impacted by it13:06
Clark[m]Because vexxhost doesn't supply normal nodes. And yes I think nested virt is enabled in inmotion. We didn't add it to the special labels because debugging kernel panics for it if it goes wrong is not something we have a ton of time for.13:15
opendevreviewChing Kuo proposed opendev/system-config master: Build jinja-init with Python 3.11 Base Images
opendevreviewChing Kuo proposed opendev/system-config master: Build eavesdrop with Python 3.11 Base Images
ianwyeah i only had a quick look but it really did seem like it could do an apt-get update and then couldn't pull packages13:28
ianwyou know what, that did make me think13:44
ianwfrom my logs, is a failed job logs13:44
ianw2023-04-28 10:58:51.169 | Err:1 focal/main amd64 bsdmainutils amd64 11.1.2ubuntu313:45
ianwis the first failure13:45
ianwApr 28 10:59:23 np0033882354 sudo[7951]:    stack : TTY=unknown ; PWD=/opt/stack/devstack ; USER=root ; COMMAND=/sbin/iptables --line-numbers -L -nv -t filter13:46
ianwahh i guess it doesn't line up.  something starts poking at iptables at 10:5913:47
ianwoh, you know what it is, it's the worlddump after it fails13:48
ianwfalse alarm13:48
ianwit would probably not be a bad idea for worlddump to also dump a few hundred of the last lines of dmesg in there13:50
ianwbut then again, what sort of oops takes out one apt operation, but leaves the host up and talking to collect the logs etc.  the mystery remains13:51
fungithe current theory is that there is some intermittent layer 2 connectivity issue between some hypervisor hosts there14:30
fungithe list of host_ids impacted was smallish, like ~714:31
clarkbinfra-root the changes from line 17 to 25 in should be good to go at this point assuming I didn't make any mistakes. I'm going to do local system updates but then plan to sync images that need syncing and hopefully we can keep merging things in that todo list15:09
clarkbinfra-root image syncs have been done for haproxy-statsd and accessbot to catch them up after the updated due to the base image update. I think we are good to go through line 63 of that etherpad15:32
clarkblanding in particular would be nice as the rest of the system-config chagnes are fairly independent so don't need to be stacked after that one lands15:34
clarkbI'm going to write the announcement email draft now15:37
clarkbhow does this look
opendevreviewClark Boylan proposed opendev/system-config master: Move pull external IRC bot images from
opendevreviewClark Boylan proposed opendev/system-config master: Pull grafyaml from
opendevreviewClark Boylan proposed openstack/project-config master: Pull grafyaml from
clarkbI'm going to sort out lodgeit changes next, but then I think I'll hold there because I'd like to merge some changes to keep this from becoming too unwieldy and give us a chance to double check everything is looking good without more inflight stuff to worry about15:56
*** amoralej is now known as amoralej|off16:04
opendevreviewClark Boylan proposed opendev/lodgeit master: Move lodgeit image publication to
opendevreviewClark Boylan proposed opendev/system-config master: Pull lodgeit from
fungiclarkb: announcement lgtm, thanks for putting it together!16:12
clarkbok I'll send that out now16:12
clarkbI left a note in the etherpad to show where the break point is. I think everything up to that point is ready for review and potential landing/migration now16:16
clarkbthats 21 open changes for this right now so a good spot to pause and double check things :) thank you for approving the assets image update16:17
clarkbfungi: openstack-zuul-jobs appears to need the ansible-compat pin do you know if a change for that exists yet?16:18
clarkbif not I can push it16:18
clarkbI pushed a change for that as I couldn't find one digging around in gerit16:21
opendevreviewMerged opendev/system-config master: Move assets image to
opendevreviewClark Boylan proposed opendev/lodgeit master: Move lodgeit image publication to
opendevreviewClark Boylan proposed opendev/lodgeit master: Pin SQLAlchemy less than 2.0.0
clarkbdoing two changes like that for lodgeit means we will need to sync it after the first one lands. Thats fine. I'll make note of it on the etherpad16:27
clarkbcorvus: ianw: remind me what is the change tag cleanup process expected to be for these images?16:30
clarkbdo we still need to update our jobs to do that?16:30
* clarkb makes a note on the etherpad about that16:30
fungiclarkb: for now is the latest change i made around that16:39
fungii was hoping ansible-lint would have tagged a fixed release by now16:40
clarkbdoes not look like it. I think the update to ozj made things happy though I'm seeing a lot more green on my dashboard for those ~21 changes16:43
clarkbinfra-root note that the items are in a list on the etherpad but that doesn't necessarily imply a strict ordering. There is some ordering and tht is captured by git parents/depends-on instead16:47
clarkbfor example can totally land now and has nothing to do with updating the zuul images to pull python-builder/python-base from quay.io16:48
opendevreviewClark Boylan proposed opendev/grafyaml master: Migrate grafyaml container images to
clarkbI think that is the last of the -1s other than zuul-operator16:56
opendevreviewClark Boylan proposed opendev/grafyaml master: Migrate grafyaml container images to
clarkbok now I think they should all be green17:34
clarkb and are good next steps in this process17:35
clarkbreminder to add any missing items to the meeting agenda or let me know what they are and I can add them18:16
*** dmellado5 is now known as dmellado19:00
opendevreviewMerged opendev/gerritbot master: Fix gerritbot CI
*** dmellado2 is now known as dmellado19:30
*** blarnath is now known as d34dh0r5319:35
clarkbfungi: thank you for the reviews. We don't need to wait for the zuul stuff to land before moving to the next changes for opendev. I just had them listed early because they are small changes and not deeply tied into the opendev order of operations so easy to get out of the way early19:38
*** dmellado9 is now known as dmellado19:43
fungiyeah, i was just going through them in sequence19:46
fungithey were also the faster ones to review since they just changed the origins19:47
clarkbthat was also why I started there when writing changes :)19:49
clarkbI just used the new gerrit web ui bulk actions to set a bunch of change topics to opendev-quay20:02
clarkbreally great feature20:02
fungilike it's caught up with a feature gertty had since (5?) years20:02
clarkb now for ease of reviewing20:02
fungiooh, look at all those new changes i hadn't reviewed yet20:04
clarkbfungi: they are all listed on the etherpad too :)20:04
fungioh, i'm aware20:04
clarkbbut ya I realized you had reviewed onl the changes with that one topic so decided to fix them having different topics20:04
fungii had just been focusing on the topic:opendev-quay changes first20:04
fungiand suddenly that query view in gertty spotted many new changes20:05
fungii just happened to have it active in my terminal when you retopiced20:05
clarkbfungi: if you review and and they look good I think you can approve them20:06
clarkbin particular the irc bots change updates the limnoria bot and there are no meetings for the rest of today so would be good to sneak that in if possible20:07
fungii'm in the middle of a board meeting for spi at the moment but will pick those back on once i'm free20:07
clarkbwe have until 0300 UTC tomorrow which is 7 hours away. should be plenty of time to get that in20:07
clarkbfungi: for something completely different in adding a tox target seems reasonable since that is an entrypoint to executing things we expect people to be able to use for python around here20:33
fungisure, i can add one in a followup change20:35
opendevreviewMerged opendev/lodgeit master: Pin SQLAlchemy less than 2.0.0
clarkbI'll work on resyncing gerritbot and lodgeit from docker hub to momentarily20:37
clarkbthat is done20:41
clarkbI went ahead and single core approved since that has very minimal user impact should anything go wrong20:43
clarkbfungi: do you think I should single core approve the irc bot change too? It has more potential for user visible impacts20:43
clarkbI'm going to have to delete the wrongly named zookeeker-statsd image from quay again. I should've waited for the change above to merge first before dleeting it the first time20:44
clarkb(it got recreated when we landed the change to update the base images because that triggered rebuilds of that image)20:44
fungioh, oops20:45
fungithanks for the cleanup20:45
clarkbwell I wrote the typo the first time around too :)20:45
fungii'm not opposed to the irc bot change being single-core approved20:45
fungithe changes are announced, and we don't merge changes to them often anyway20:46
clarkbok I'll go ahead and do that20:46
clarkbonce this batch gets through we can probably move the three external irc bots. There is a fourth change to have system-config deploy them from the new location but that should be pretty unimpactful (statusbot, ptgbot, gerritbot)21:43
opendevreviewMerged opendev/system-config master: Migrate statsd sidecar container images to
opendevreviewMerged opendev/system-config master: Move system-config irc bots into
clarkbmy meeting agenda edits are in.21:50
clarkbit is joining one channel each second22:01
clarkbso this may take a few minutes to get everywhere I guess22:02
clarkbit is back here now though22:02
clarkbI think this is looking well. I'm going to go ahead and approve image moves for the other bots now22:02
clarkbalso haproxy-statsd updated on gitea-lb02 and as far as I can tell is fine22:02
clarkbstarting with as there is no ptg currently so the impact will be minimal22:04
clarkbzookeeker-statsd has been deleted. The type was fixed in 882478 so we shouldn't see it get created again22:17
clarkbptgbot looks good. Doing gerritbot and then statusbot22:19
clarkber I have that backwards. statusbot then gerritbot  Iguess22:21
ianwclarkb: tags should be removed automatically, via the api key, when it's see's in there, iirc.  that key should come from the credentials same as the creation role22:22
clarkbianw: right that was he role you added but did it get added to our base jobs?22:24
clarkb still has it for exampe22:24
clarkbI guess that is a downside to not adding the api token to the external repos.22:24
clarkbI'll have to think about how that will work since I don't think we can scope the tokens to specific projects like we can with the robot user22:25
opendevreviewMerged opendev/statusbot master: Move statusbot to
clarkbfwiw I don't think it is urgent which is why I put it at the end of my todo list22:25
ianw... hrm, i thought we were using the upload-from-intermediate-registry approach (as opposed to the promote from already uploaded), that i guess shouldn't make the temporary tags?22:30
clarkboh! yes I mean the changes are all configured that way but maybe improperly?22:32
clarkbyou can see
clarkbcorvus: ianw: if anyone else is able to look into that I think that would be great as I've still got tons of changes in flight and trying to keep everything else in order has been fun enough. But if you can't let me know and I'll try to take a look22:34
clarkbalso I can stop approving things after gerritbot (and its application change in system-config) to minimize cleanup. I was going to pause about here anyway just due to lack of daylight22:35
clarkbI think zuul may have the same issue so not specific to what I've been doing atleast22:37
opendevreviewMerged opendev/gerritbot master: Move gerritbot to
clarkbI've approved and will pause here since I'm not sure I can monitor updates afterwards22:40
opendevreviewMerged opendev/system-config master: Move pull external IRC bot images from
clarkball three bots appear to be updated on eavesdrop23:20
clarkbI've updated the etherpad to reflect what was completed today and left notes about cleaning up change tags23:22

Generated by 2.17.3 by Marius Gedminas - find it at!