*** amoralej|off is now known as amoralej | 06:08 | |
opendevreview | Merged openstack/project-config master: Prevent recreate EOL'd branch https://review.opendev.org/c/openstack/project-config/+/881731 | 07:52 |
---|---|---|
opendevreview | Merged openstack/project-config master: project-config-grafana: filter opendev-buildset-registry https://review.opendev.org/c/openstack/project-config/+/847870 | 07:54 |
opendevreview | waleed mousa proposed openstack/diskimage-builder master: Add nm-dhcp-ib-interfaces element https://review.opendev.org/c/openstack/diskimage-builder/+/882507 | 08:40 |
*** iurygregory_ is now known as iurygregory | 11:18 | |
tweining | Hi. There are a lot of node failures recently. Is this something that you are aware of? https://zuul.opendev.org/t/openstack/builds?result=NODE_FAILURE&skip=0 | 11:52 |
tweining | well, https://zuul.opendev.org/t/openstack/builds?pipeline=check&result=NODE_FAILURE&skip=0 is probably better | 11:53 |
*** amoralej is now known as amoralej|lunch | 11:53 | |
fungi | you're probably the first to notice, but i'll run down a few examples and see if they have anything in common | 11:59 |
fungi | tweining: "a lot" is relative. looks like 7 so far today and all of those are for octavia, so almost certainly expected outcome of being down to only a single cloud provider who can supply nested-virt labeled nodes. none yesterday. there were two for neutron on saturday. a bunch for cinder-tempest-plugin on friday though | 12:03 |
fungi | the node failures for neutron on friday were also trying to get nested-virt nodes | 12:08 |
fungi | er, on saturday | 12:08 |
fungi | i'm looking into friday now | 12:08 |
fungi | interestingly, these seem to pretty much all be stable branch jobs | 12:10 |
fungi | all 20 of those are showing up as cancelled within a minute of one another in the scheduler's debug log so was likely related to at event around 17:00z that day | 12:14 |
fungi | which only impacted a single job for a single project, so likely not a system-wide event. also they seem to all be from a series of do-not-merge changes which were being used to try to recreate errors | 12:20 |
fungi | tweining: and those are misleading, they never reported to gerrit because they were the result of build cancellations from a mass rebase. i suspect zuul should have reported those as cancelled instead of node_failure | 12:23 |
fungi | tweining: let's approach this another way... is there a specific build you want investigated? | 12:24 |
tweining | let me see... | 12:24 |
fungi | like, what brought it to your attention? | 12:25 |
tweining | maybe that one https://zuul.opendev.org/t/openstack/build/637eddd7d1924851bb175492578008d1 | 12:25 |
fungi | okay, that's for octavia so i can pretty much guarantee that ovh failed to boot a nested-virt node for it and because we have no other providers now who can supply those, nodepool gave up. i'll double-check that though | 12:26 |
tweining | I noticed it because I saw multiple Octavia CI builds fail today because of it (and because of timeout, which are probably unrelated) | 12:26 |
fungi | so that build got node request 300-0021159460 for a nested-virt-ubuntu-jammy label | 12:30 |
fungi | nl04.opendev.org accepted that node request at 06:38:17 on behalf of the ovh-gra1 region | 12:34 |
fungi | it booted node 0033974049 and then at 06:48:18 logged that it timed out waiting for the instance to be created (so basically 10 minutes after the nova boot api call the server was still not in an active state) | 12:37 |
fungi | as a result it declined the node request on behalf of ovh-gra1 and then immediately accepted the request on behalf of ovh-bhs1 | 12:39 |
tweining | on Saturday this patch was merged that makes the CI use jammy nodes instead of focal: https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/861369/12/zuul.d/jobs.yaml (except for stable branches, but that doesn't seem to work because I saw a build for stable/yoga was running on Jammy as well) | 12:41 |
tweining | I'm only talking about Octavia btw. | 12:41 |
fungi | it booted 0033974292 for the new attempt and then at 06:58:20 similarly gave up waiting for it to become active | 12:42 |
fungi | at that point it had exhausted all cloud providers capable of supplying nested-virt nodes, so the scheduler decided that node request 300-0021159460 was unsatisfiable and returned a node_failure result for that build | 12:44 |
tweining | ok, so your theory was right. | 12:45 |
fungi | tweining: my recommendation for reducing the incidence of situations like this is to work with our other donor providers to get reliable nested-virt capability in more places besides just ovh, or work with ovh to help them figure out why it can take >10 minutes for a server instance to go to active state, or work on making octavia's jobs no longer dependent on nested virt acceleration | 12:46 |
tweining | thanks. I will bring this topic up in our next team meeting on Wednesday | 12:47 |
fungi | it's worth noting that we did have nested-virt labels in vexxhost's ca-ymq-1 region, but merged a change a little over a week ago to disable them there due to apparent network instability (possibly limited to nested-virt-ubuntu-jammy nodes): https://review.opendev.org/881810 | 12:50 |
tweining | yeah, I remember that | 12:51 |
fungi | clarkb: what are the odds that the inmotion hardware is capable of nested virt acceleration? it might be interesting to test since we have admin access to the hypervisors there | 12:53 |
fungi | not that we have all that much capacity there, but it may put us in a position to get help from our community in diagnosing some of the issues with nested kvm in order to get better data back to the kernel developers | 12:56 |
ianw | removing vexxhost was mostly about the weird network dropout right? | 12:59 |
*** amoralej|lunch is now known as amoralej | 13:00 | |
fungi | yeah, though the change asserts it was nested-virt nodes which were impacted by it | 13:06 |
Clark[m] | Because vexxhost doesn't supply normal nodes. And yes I think nested virt is enabled in inmotion. We didn't add it to the special labels because debugging kernel panics for it if it goes wrong is not something we have a ton of time for. | 13:15 |
opendevreview | Ching Kuo proposed opendev/system-config master: Build jinja-init with Python 3.11 Base Images https://review.opendev.org/c/opendev/system-config/+/882563 | 13:22 |
opendevreview | Ching Kuo proposed opendev/system-config master: Build eavesdrop with Python 3.11 Base Images https://review.opendev.org/c/opendev/system-config/+/882564 | 13:26 |
ianw | yeah i only had a quick look but it really did seem like it could do an apt-get update and then couldn't pull packages | 13:28 |
ianw | you know what, that did make me think | 13:44 |
ianw | from my logs, https://04313aea8028a6223239-76770e8376bdcb5a12e0ef605f8b8d22.ssl.cf5.rackcdn.com/881742/3/check/neutron-tempest-plugin-designate-scenario/6f701ec/controller/logs/index.html is a failed job logs | 13:44 |
ianw | 2023-04-28 10:58:51.169 | Err:1 https://mirror.ca-ymq-1.vexxhost.opendev.org/ubuntu focal/main amd64 bsdmainutils amd64 11.1.2ubuntu3 | 13:45 |
ianw | is the first failure | 13:45 |
ianw | Apr 28 10:59:23 np0033882354 sudo[7951]: stack : TTY=unknown ; PWD=/opt/stack/devstack ; USER=root ; COMMAND=/sbin/iptables --line-numbers -L -nv -t filter | 13:46 |
ianw | ahh i guess it doesn't line up. something starts poking at iptables at 10:59 | 13:47 |
ianw | oh, you know what it is, it's the worlddump after it fails | 13:48 |
ianw | false alarm | 13:48 |
ianw | it would probably not be a bad idea for worlddump to also dump a few hundred of the last lines of dmesg in there | 13:50 |
ianw | but then again, what sort of oops takes out one apt operation, but leaves the host up and talking to collect the logs etc. the mystery remains | 13:51 |
fungi | the current theory is that there is some intermittent layer 2 connectivity issue between some hypervisor hosts there | 14:30 |
fungi | the list of host_ids impacted was smallish, like ~7 | 14:31 |
clarkb | infra-root the changes from line 17 to 25 in https://etherpad.opendev.org/p/opendev-quay-migration-2023 should be good to go at this point assuming I didn't make any mistakes. I'm going to do local system updates but then plan to sync images that need syncing and hopefully we can keep merging things in that todo list | 15:09 |
clarkb | infra-root image syncs have been done for haproxy-statsd and accessbot to catch them up after the updated due to the base image update. I think we are good to go through line 63 of that etherpad | 15:32 |
clarkb | landing https://review.opendev.org/c/opendev/system-config/+/882467 in particular would be nice as the rest of the system-config chagnes are fairly independent so don't need to be stacked after that one lands | 15:34 |
clarkb | I'm going to write the announcement email draft now | 15:37 |
clarkb | how does this look https://etherpad.opendev.org/p/giIxa4d_Q18aDmfhKNut | 15:43 |
opendevreview | Clark Boylan proposed opendev/system-config master: Move pull external IRC bot images from quay.io https://review.opendev.org/c/opendev/system-config/+/882571 | 15:48 |
opendevreview | Clark Boylan proposed opendev/system-config master: Pull grafyaml from quay.io https://review.opendev.org/c/opendev/system-config/+/882573 | 15:52 |
opendevreview | Clark Boylan proposed openstack/project-config master: Pull grafyaml from quay.io https://review.opendev.org/c/openstack/project-config/+/882576 | 15:55 |
clarkb | I'm going to sort out lodgeit changes next, but then I think I'll hold there because I'd like to merge some changes to keep this from becoming too unwieldy and give us a chance to double check everything is looking good without more inflight stuff to worry about | 15:56 |
*** amoralej is now known as amoralej|off | 16:04 | |
opendevreview | Clark Boylan proposed opendev/lodgeit master: Move lodgeit image publication to quay.io https://review.opendev.org/c/opendev/lodgeit/+/882590 | 16:09 |
opendevreview | Clark Boylan proposed opendev/system-config master: Pull lodgeit from quay.io https://review.opendev.org/c/opendev/system-config/+/882593 | 16:12 |
fungi | clarkb: announcement lgtm, thanks for putting it together! | 16:12 |
clarkb | ok I'll send that out now | 16:12 |
clarkb | I left a note in the etherpad to show where the break point is. I think everything up to that point is ready for review and potential landing/migration now | 16:16 |
clarkb | thats 21 open changes for this right now so a good spot to pause and double check things :) thank you for approving the assets image update | 16:17 |
clarkb | fungi: openstack-zuul-jobs appears to need the ansible-compat pin do you know if a change for that exists yet? | 16:18 |
clarkb | if not I can push it | 16:18 |
clarkb | I pushed a change for that as I couldn't find one digging around in gerit | 16:21 |
opendevreview | Merged opendev/system-config master: Move assets image to quay.io https://review.opendev.org/c/opendev/system-config/+/882467 | 16:25 |
opendevreview | Clark Boylan proposed opendev/lodgeit master: Move lodgeit image publication to quay.io https://review.opendev.org/c/opendev/lodgeit/+/882590 | 16:26 |
opendevreview | Clark Boylan proposed opendev/lodgeit master: Pin SQLAlchemy less than 2.0.0 https://review.opendev.org/c/opendev/lodgeit/+/882603 | 16:26 |
clarkb | doing two changes like that for lodgeit means we will need to sync it after the first one lands. Thats fine. I'll make note of it on the etherpad | 16:27 |
clarkb | corvus: ianw: remind me what is the change tag cleanup process expected to be for these images? | 16:30 |
clarkb | do we still need to update our jobs to do that? | 16:30 |
* clarkb makes a note on the etherpad about that | 16:30 | |
fungi | clarkb: for now https://review.opendev.org/882482 is the latest change i made around that | 16:39 |
fungi | i was hoping ansible-lint would have tagged a fixed release by now | 16:40 |
clarkb | does not look like it. I think the update to ozj made things happy though I'm seeing a lot more green on my dashboard for those ~21 changes | 16:43 |
clarkb | infra-root note that the items are in a list on the etherpad but that doesn't necessarily imply a strict ordering. There is some ordering and tht is captured by git parents/depends-on instead | 16:47 |
clarkb | for example https://review.opendev.org/c/opendev/system-config/+/882478 can totally land now and has nothing to do with updating the zuul images to pull python-builder/python-base from quay.io | 16:48 |
opendevreview | Clark Boylan proposed opendev/grafyaml master: Migrate grafyaml container images to quay.io https://review.opendev.org/c/opendev/grafyaml/+/882493 | 16:56 |
clarkb | I think that is the last of the -1s other than zuul-operator | 16:56 |
opendevreview | Clark Boylan proposed opendev/grafyaml master: Migrate grafyaml container images to quay.io https://review.opendev.org/c/opendev/grafyaml/+/882493 | 17:34 |
clarkb | ok now I think they should all be green | 17:34 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/882478 and https://review.opendev.org/c/opendev/system-config/+/882486 are good next steps in this process | 17:35 |
clarkb | reminder to add any missing items to the meeting agenda or let me know what they are and I can add them | 18:16 |
*** dmellado5 is now known as dmellado | 19:00 | |
opendevreview | Merged opendev/gerritbot master: Fix gerritbot CI https://review.opendev.org/c/opendev/gerritbot/+/882490 | 19:05 |
*** dmellado2 is now known as dmellado | 19:30 | |
*** blarnath is now known as d34dh0r53 | 19:35 | |
clarkb | fungi: thank you for the reviews. We don't need to wait for the zuul stuff to land before moving to the next changes for opendev. I just had them listed early because they are small changes and not deeply tied into the opendev order of operations so easy to get out of the way early | 19:38 |
*** dmellado9 is now known as dmellado | 19:43 | |
fungi | yeah, i was just going through them in sequence | 19:46 |
clarkb | gotcha | 19:46 |
fungi | they were also the faster ones to review since they just changed the origins | 19:47 |
clarkb | that was also why I started there when writing changes :) | 19:49 |
clarkb | I just used the new gerrit web ui bulk actions to set a bunch of change topics to opendev-quay | 20:02 |
clarkb | really great feature | 20:02 |
fungi | like it's caught up with a feature gertty had since (5?) years | 20:02 |
clarkb | yup | 20:02 |
clarkb | https://review.opendev.org/q/topic:opendev-quay now for ease of reviewing | 20:02 |
fungi | ooh, look at all those new changes i hadn't reviewed yet | 20:04 |
clarkb | fungi: they are all listed on the etherpad too :) | 20:04 |
fungi | oh, i'm aware | 20:04 |
clarkb | but ya I realized you had reviewed onl the changes with that one topic so decided to fix them having different topics | 20:04 |
fungi | i had just been focusing on the topic:opendev-quay changes first | 20:04 |
fungi | and suddenly that query view in gertty spotted many new changes | 20:05 |
fungi | i just happened to have it active in my terminal when you retopiced | 20:05 |
clarkb | fungi: if you review https://review.opendev.org/c/opendev/system-config/+/882478?usp=dashboard and https://review.opendev.org/c/opendev/system-config/+/882486?usp=dashboard and they look good I think you can approve them | 20:06 |
clarkb | in particular the irc bots change updates the limnoria bot and there are no meetings for the rest of today so would be good to sneak that in if possible | 20:07 |
fungi | yeah | 20:07 |
fungi | i'm in the middle of a board meeting for spi at the moment but will pick those back on once i'm free | 20:07 |
clarkb | we have until 0300 UTC tomorrow which is 7 hours away. should be plenty of time to get that in | 20:07 |
clarkb | thanks! | 20:07 |
clarkb | fungi: for something completely different in https://review.opendev.org/c/openstack/project-config/+/882075 adding a tox target seems reasonable since that is an entrypoint to executing things we expect people to be able to use for python around here | 20:33 |
fungi | sure, i can add one in a followup change | 20:35 |
opendevreview | Merged opendev/lodgeit master: Pin SQLAlchemy less than 2.0.0 https://review.opendev.org/c/opendev/lodgeit/+/882603 | 20:35 |
clarkb | I'll work on resyncing gerritbot and lodgeit from docker hub to quay.io momentarily | 20:37 |
clarkb | that is done | 20:41 |
clarkb | I went ahead and single core approved https://review.opendev.org/c/opendev/system-config/+/882478 since that has very minimal user impact should anything go wrong | 20:43 |
clarkb | fungi: do you think I should single core approve the irc bot change too? It has more potential for user visible impacts | 20:43 |
clarkb | I'm going to have to delete the wrongly named zookeeker-statsd image from quay again. I should've waited for the change above to merge first before dleeting it the first time | 20:44 |
clarkb | (it got recreated when we landed the change to update the base images because that triggered rebuilds of that image) | 20:44 |
fungi | oh, oops | 20:45 |
fungi | thanks for the cleanup | 20:45 |
clarkb | well I wrote the typo the first time around too :) | 20:45 |
fungi | i'm not opposed to the irc bot change being single-core approved | 20:45 |
fungi | the changes are announced, and we don't merge changes to them often anyway | 20:46 |
clarkb | ok I'll go ahead and do that | 20:46 |
clarkb | once this batch gets through we can probably move the three external irc bots. There is a fourth change to have system-config deploy them from the new location but that should be pretty unimpactful (statusbot, ptgbot, gerritbot) | 21:43 |
opendevreview | Merged opendev/system-config master: Migrate statsd sidecar container images to quay.io https://review.opendev.org/c/opendev/system-config/+/882478 | 21:49 |
opendevreview | Merged opendev/system-config master: Move system-config irc bots into quay.io https://review.opendev.org/c/opendev/system-config/+/882486 | 21:49 |
clarkb | my meeting agenda edits are in. | 21:50 |
clarkb | it is joining one channel each second | 22:01 |
clarkb | so this may take a few minutes to get everywhere I guess | 22:02 |
clarkb | it is back here now though | 22:02 |
clarkb | I think this is looking well. I'm going to go ahead and approve image moves for the other bots now | 22:02 |
clarkb | also haproxy-statsd updated on gitea-lb02 and as far as I can tell is fine | 22:02 |
clarkb | starting with https://review.opendev.org/c/openstack/ptgbot/+/882489 as there is no ptg currently so the impact will be minimal | 22:04 |
clarkb | zookeeker-statsd has been deleted. The type was fixed in 882478 so we shouldn't see it get created again | 22:17 |
clarkb | ptgbot looks good. Doing gerritbot and then statusbot | 22:19 |
clarkb | er I have that backwards. statusbot then gerritbot Iguess | 22:21 |
ianw | clarkb: tags should be removed automatically, via the api key, when it's see's quay.io in there, iirc. that key should come from the credentials same as the creation role | 22:22 |
clarkb | ianw: right that was he role you added but did it get added to our base jobs? | 22:24 |
clarkb | https://quay.io/repository/opendevorg/haproxy-statsd?tab=tags still has it for exampe | 22:24 |
clarkb | I guess that is a downside to not adding the api token to the external repos. | 22:24 |
clarkb | I'll have to think about how that will work since I don't think we can scope the tokens to specific projects like we can with the robot user | 22:25 |
opendevreview | Merged opendev/statusbot master: Move statusbot to quay.io https://review.opendev.org/c/opendev/statusbot/+/882488 | 22:25 |
clarkb | fwiw I don't think it is urgent which is why I put it at the end of my todo list | 22:25 |
ianw | ... hrm, i thought we were using the upload-from-intermediate-registry approach (as opposed to the promote from already uploaded), that i guess shouldn't make the temporary tags? | 22:30 |
clarkb | oh! yes I mean the changes are all configured that way but maybe improperly? | 22:32 |
clarkb | you can see https://review.opendev.org/c/opendev/gerritbot/+/882487/5/.zuul.yaml#31 | 22:32 |
clarkb | corvus: ianw: if anyone else is able to look into that I think that would be great as I've still got tons of changes in flight and trying to keep everything else in order has been fun enough. But if you can't let me know and I'll try to take a look | 22:34 |
clarkb | also I can stop approving things after gerritbot (and its application change in system-config) to minimize cleanup. I was going to pause about here anyway just due to lack of daylight | 22:35 |
clarkb | I think zuul may have the same issue so not specific to what I've been doing atleast | 22:37 |
opendevreview | Merged opendev/gerritbot master: Move gerritbot to quay.io https://review.opendev.org/c/opendev/gerritbot/+/882487 | 22:37 |
clarkb | I've approved https://review.opendev.org/c/opendev/system-config/+/882571 and will pause here since I'm not sure I can monitor updates afterwards | 22:40 |
opendevreview | Merged opendev/system-config master: Move pull external IRC bot images from quay.io https://review.opendev.org/c/opendev/system-config/+/882571 | 23:08 |
clarkb | all three bots appear to be updated on eavesdrop | 23:20 |
fungi | awesome | 23:21 |
clarkb | I've updated the etherpad to reflect what was completed today and left notes about cleaning up change tags | 23:22 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!