opendevreview | Merged openstack/project-config master: Add charms-stable-maint group for charms projects https://review.opendev.org/c/openstack/project-config/+/872406 | 00:28 |
---|---|---|
opendevreview | Merged openstack/project-config master: Retire puppet-tacker - Step 1: End project Gating https://review.opendev.org/c/openstack/project-config/+/874539 | 00:29 |
fungi | clarkb: ^ you were wanting to see successive project-config changes and how checkouts impacted things at deployment | 00:30 |
fungi | i approved a few | 00:30 |
opendevreview | Merged openstack/project-config master: Periodically update Puppetfile_unit https://review.opendev.org/c/openstack/project-config/+/875302 | 00:31 |
clarkb | thanks. I think there is a weird interaction where the job fails successfully or something in the rename case though? its definitely something I want to dig into to understand better and probably document | 00:32 |
opendevreview | Merged openstack/project-config master: Add the main NebulOuS repos https://review.opendev.org/c/openstack/project-config/+/876054 | 00:32 |
opendevreview | Merged openstack/project-config master: Add Ironic Dashboard charm to OpenStack charms https://review.opendev.org/c/openstack/project-config/+/876205 | 00:32 |
clarkb | Tomorrow I'll look at finishing up the gitea05-07 deletions | 00:36 |
clarkb | I don't expect anyone has stashed anything they need on those servers but infra-root consider this your warning | 00:37 |
fungi | i definitely haven't | 00:37 |
opendevreview | Merged openstack/project-config master: Add the NebulOuS tenant https://review.opendev.org/c/openstack/project-config/+/876414 | 01:43 |
fungi | yoctozepto: ^ that's deployed | 01:57 |
fungi | https://zuul.opendev.org/t/nebulous/jobs | 01:58 |
fungi | just inherited stuff for now, but it's there and ready for next steps | 01:59 |
opendevreview | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/877057 | 02:15 |
ianw | i took the liberty of adding yoctozepto to nebulous-core | 02:41 |
fungi | oh, good thinking! | 02:41 |
fungi | ianw: nebulous-project-config-core was made as a separate group too | 02:43 |
ianw | ok i added to that too :) luckily i still have the admin console up from pushing changes yesterday | 02:44 |
fungi | thanks! | 02:45 |
opendevreview | Merged openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/877057 | 02:54 |
WhoisJMH | hello, I have a question. In the openstack environment built using devstack in the Ubuntu 20.04 environment, the instance is created and operated well, but there is no problem. When I try to create a new instance, the message "Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance" appears and the creation fails. Although it is a single node environment, the server has enough resources for cpu, ram, | 07:12 |
WhoisJMH | Which part should I check first to solve this problem? | 07:12 |
yoctozepto | WhoisJMH: hi, nova-compute logs will be best now to know the reason for rejection; just note this channel is not devoted to openstack support, please go to #openstack for further queries | 07:38 |
yoctozepto | ianw, fungi, clarkb, frickler: thanks for all your feedback on the new project+tenant and for merging that; I will proceed with setting up the tenant today and let you know how it goes | 07:43 |
*** jpena|off is now known as jpena | 07:46 | |
yoctozepto | just one last request for now - please also add me to nebulous-release ;D | 07:57 |
ianw | yoctozepto: done :) | 08:22 |
yoctozepto | ianw: many thanks | 08:23 |
opendevreview | daniel.pawlik proposed zuul/zuul-jobs master: Provide ensure-microshift role https://review.opendev.org/c/zuul/zuul-jobs/+/876081 | 09:00 |
opendevreview | daniel.pawlik proposed zuul/zuul-jobs master: Provide ensure-microshift role https://review.opendev.org/c/zuul/zuul-jobs/+/876081 | 09:10 |
opendevreview | daniel.pawlik proposed zuul/zuul-jobs master: Provide ensure-microshift role https://review.opendev.org/c/zuul/zuul-jobs/+/876081 | 09:31 |
bbezak | Hi, I'm having quite a lot of network connectivity issues - only involves 'provider: inmotion-iad3' ones. | 10:52 |
bbezak | Interestingly it happens towards tarballs.openstack.org. On both ubuntu (focal,jammy) and centos stream 8 jobs. Most often than not I'm affraid (but I saw good runs too on the occasions on iad3, but less often). I haven't seen those issues on 'rax' provider for instance - https://paste.opendev.org/raw/btENz9poC0tQ0p3t7Hny/ | 10:52 |
bbezak | by the look of it, it started yesterday | 10:52 |
fungi | bbezak: looks like it's not just tarballs.o.o, the first failure i pulled up was complaining about reaching the releases site: https://zuul.opendev.org/t/openstack/build/3a9c0f69727f47ba8e7747eba3f2d678/log/primary/ansible/tenks-deploy#2030-2036 | 12:36 |
fungi | but still from a node in inmotion-iad3 | 12:37 |
bbezak | I've seen issues with releases as well. But not in last several runs, so I didn't mention it | 12:37 |
fungi | well, it helps to know that there's more than one site the jobs are having trouble reaching from there | 12:40 |
fungi | and the nodes in that region are ipv4-only so we can rule out ipv6-related issues | 12:40 |
bbezak | however those are resolving to the same static01.opendev.org fungi | 12:43 |
bbezak | (at least from my end) | 12:44 |
fungi | oh, yes that's a good point, they're different sites on the same vm | 12:47 |
fungi | anyway, i'm checking for connectivity issues between that provider region and those sites | 12:48 |
bbezak | thx fungi | 12:48 |
fungi | not seeing any packet loss at the moment | 12:50 |
bbezak | it just failed on 173.231.253.119 fungi | 13:16 |
bbezak | it got 200 on https://tarballs.openstack.org/ironic-python-agent/tinyipa/files/tinyipa-stable-xena.vmlinuz.sha256, but got Network is unreachable on https://tarballs.openstack.org/ironic-python-agent/tinyipa/files/tinyipa-stable-xena.vmlinuz | 13:18 |
fungi | yeah, whatever it is, it's clearly intermittent | 13:19 |
bbezak | yeah, "the best" kind | 13:20 |
fungi | could be an overloaded router in that providers core network and only some flows are getting balanced through it, for example | 13:20 |
fungi | i'm still trying to reproduce connectivity errors with lower-level tools | 13:20 |
fungi | could also be farther out on the internet in some backbone provider | 13:22 |
fungi | the route between those environments is, unfortunately, asymmetrical so will be harder to track down if so | 13:22 |
fungi | looks like from inmotion to rackspace (where the static server resides) both providers peer with zayo, while in the other direction they both peer with level3 | 13:24 |
fungi | going through zayo it transits their atl facillity to get from iad to dfw, though the level3 hop between dfw and iad is not identifying itself currently | 13:27 |
fungi | mtr from rackspace to inmotion is recording around 0.2-0.3% packet loss at the moment | 13:28 |
fungi | not seeing any in the other direction, which is strange, but maybe just not a statistically significant volume of samples yet | 13:29 |
bbezak | ok | 13:31 |
fungi | bbezak: one thing to keep in mind, jobs shouldn't normally need to fetch urls like https://releases.openstack.org/constraints/upper/yoga since they can access the same constraints file from the openstack/requirements repository checkout provided on the test node | 13:32 |
fungi | and we could look into baking the tinyipa kernels into our node images in order to reduce further traffic across the internet, or add the tarballs site to our mirrors in all providers (they're both backed by data in afs, so it would just be a matter of adding a path in the apache vhost to expose that to clients) | 13:34 |
fungi | making connections across the internet in a job should be avoided whenever possible (though we perform some brief internet connectivity tests in pre-run for all jobs in order to weed out test nodes with obviously bad internet connections) | 13:38 |
bbezak | yeah, that makes sense, we have the var for requirements_src_dir already in the job, so it shouldn't be difficult to override it for the CI only | 13:40 |
fungi | setting up some mtr runs from montreal and san jose to static.o.o as well for a baseline | 13:40 |
opendevreview | daniel.pawlik proposed zuul/zuul-jobs master: Provide ensure-microshift role https://review.opendev.org/c/zuul/zuul-jobs/+/876081 | 13:57 |
opendevreview | daniel.pawlik proposed zuul/zuul-jobs master: Provide ensure-microshift role https://review.opendev.org/c/zuul/zuul-jobs/+/876081 | 14:03 |
opendevreview | daniel.pawlik proposed zuul/zuul-jobs master: Provide ensure-microshift role https://review.opendev.org/c/zuul/zuul-jobs/+/876081 | 14:48 |
fungi | mtr wasn't turning up any packet loss to static.o.o from other providers either, and the 0.3% loss i initially saw from there to inmotion dropped to 0.2%, then to 0.1% and eventially 0.0% so seems there may have been a very brief blip early in the mtr run but that's it | 15:02 |
fungi | i'm currently downloading tinyipa-stable-xena.vmlinuz on a machine in inmotion-iad3 in a loop with a 1-second delay, trying to get it to fail | 15:04 |
fungi | over 2k downloads so far with no failures | 16:02 |
fungi | whatever the issue, i don't think it's steady, must come and go in small bursts | 16:03 |
clarkb | as expected the gitea13 and 14 replication continues this morning | 16:15 |
clarkb | its a bit more than halfway done | 16:15 |
clarkb | fungi: sounds like typical internet behavior | 16:17 |
fungi | yeah, close to 3k successful downloads and no failures. i'm going to stop the loop before i waste any more bandwidth | 16:17 |
fungi | though i do think exposing the tarballs afs volume on our mirrors might be useful for some stuff like the ipa kernel downloads | 16:18 |
clarkb | as far as adding the tinyipa image to test nodes the main struggle there is you end up with a bunch of versions and no one knows when it is safe to remove. If we do that I think we should explicitly state we can do latest and latest-1 and then older versions which are used less often can ocntinue to be fetched remotely. This is basically what we're moving towards with cirros | 16:18 |
clarkb | oh ya simply making use of our afs caches isn't a bad idea | 16:18 |
fungi | speaking of cirros, should i go ahead and self-approve 873735? it's been about a month with no objections | 16:19 |
clarkb | I've got no objections though I worry it may disrupt the openstack release somehow (the latest 6.1 version isn't used anywhere because it changes dhcp clients and tempest doesn't know how to interact with it or something to check dhcp things are working_ | 16:20 |
clarkb | but I think 5.2 is used and not 5.1? | 16:21 |
fungi | yeah, i'll keep it on the back burner until post-release | 16:22 |
fungi | good call | 16:23 |
yoctozepto | infra-root: I think I need your help with merging this initial change: https://review.opendev.org/c/nebulous/project-config/+/877107 | 16:25 |
yoctozepto | or may you help me set it up to allow me to merge things on demand from gerrit? | 16:25 |
fungi | if it's what i think it is (haven't looked yet), yes there's a bootstrapping step where manual merging is needed to add a base job | 16:26 |
yoctozepto | (in case we break this base config in the future) | 16:26 |
yoctozepto | fungi: yeah, it's adding the noop job to the nebulous/project-config repo as well as pipelines | 16:26 |
yoctozepto | based on opendev/project-config | 16:26 |
clarkb | I think what you've got is correct. Add pipelines and a noop job | 16:27 |
clarkb | then you can land changes from there | 16:27 |
fungi | adding verified+2 and submit perms in the project-config repo for a special group might make sense... infra-root: ^ opinions? | 16:27 |
clarkb | and ya that would need a gerrit admin to apply a verified +2 and hit submit | 16:27 |
yoctozepto | btw, the CI/CD side will be Apache 2.0 licensed; the project itself will be MPL 2.0 because that is what we have in the grant agreement | 16:28 |
fungi | definitely a line to walk between risk for the user and needing to involve our gerrit admins more often | 16:28 |
clarkb | fungi: I seem to recall there was some consideration for that in the past. | 16:28 |
clarkb | I want to say it implies a higher level of trust than what is limited to the tenant for some reason but I may be misremembering | 16:28 |
yoctozepto | for one, not many people will be allowed to approve anything in that repo | 16:29 |
yoctozepto | likely just me and some other person that we have not found yet | 16:29 |
clarkb | but ya ultimately if they can land changes normally then allowing them to bypass ci is not much extra | 16:30 |
clarkb | I'm willing to give it a go | 16:30 |
yoctozepto | thanks | 16:30 |
yoctozepto | what should I do? | 16:30 |
clarkb | yoctozepto: it will require an acl update to give some group verified -2/+2 perms and allow the submit button | 16:30 |
yoctozepto | ook | 16:31 |
clarkb | then instructing that group to do their best to avoid relying on those perms and only perform the actions when you can't get around zuul being stuck due to the config you are tring to update | 16:31 |
yoctozepto | so verified +2 I think I know how to do | 16:31 |
clarkb | this situation is an example of that | 16:31 |
yoctozepto | but the submit button | 16:31 |
yoctozepto | :-) | 16:31 |
clarkb | yoctozepto: once the necessary votes are applied the button shows up in the top left panel of the change | 16:32 |
clarkb | next to rebase/abandon/edit | 16:32 |
clarkb | you apply the required votes, then click the button | 16:32 |
yoctozepto | oook, I see you mean top-right | 16:32 |
yoctozepto | then I will reconfigure the group to allow V+2 | 16:33 |
clarkb | oh sorry yes | 16:33 |
fungi | yoctozepto: one thing you might consider is having separate admin accounts you add to your administrative group, it's what we (opendev sysadmins) do for our gerrit admin access in order to minimize risk of accidentally doing something we didn't mean to over the course of normal use of the system or unnecessarily exposing the more privileged account to compromise | 16:36 |
fungi | also lp/uo sso 2fa is highly recommended | 16:36 |
yoctozepto | fungi: thanks for the hints! I think we are less impactful so a separate account is an overkill but it surely would be handy to disallow non-2FA logins going forward | 16:45 |
clarkb | unfortunately I don't think we're able to control that via gerrit | 16:46 |
fungi | right, well what you can do is make sure you have 2fa set up for the account(s) you use | 16:47 |
clarkb | yup and you can configure your account to require 2fa, but I don't know that we can enforce it on the service side with the way thing are currently implemented | 16:48 |
fungi | and you can always separate your roles into multiple accounts later pretty easily since you control the group membership anyway, so nothing you need to decide right now | 16:48 |
opendevreview | Radosław Piliszek proposed openstack/project-config master: Allow nebulous-project-config-core to add V+2 https://review.opendev.org/c/openstack/project-config/+/877108 | 16:52 |
yoctozepto | clarkb, fungi: yeah, I meant for other project members to also be good citizens with 2FA :D but thankfully nothing makes it obligatory for us so it's good as it is atm | 16:53 |
yoctozepto | anyways, the change is up ^ | 16:53 |
clarkb | yoctozepto: yup just reviewed | 16:54 |
opendevreview | Radosław Piliszek proposed openstack/project-config master: Allow nebulous-project-config-core to add V+2 https://review.opendev.org/c/openstack/project-config/+/877108 | 16:56 |
yoctozepto | clarkb: fixed&replied | 16:56 |
yoctozepto | q if you are sure about the "submit =" part | 16:57 |
yoctozepto | because nothing else has it | 16:57 |
yoctozepto | happy to oblige otherwise | 16:57 |
clarkb | yoctozepto: I'm pretty sure. Yesterday when we were tesitng things and adding all the +2 votes the submit button would show up but was greyed out because you also need explicit submit perms. If you look in system-config/doc/source/gerrit.rst you'll see where we document that only zuul and the project creation tool have it by default | 16:57 |
clarkb | fungi can double check | 16:58 |
yoctozepto | ok, you are right, I feel convinced | 16:58 |
opendevreview | Radosław Piliszek proposed openstack/project-config master: Allow nebulous-project-config-core to add V+2 https://review.opendev.org/c/openstack/project-config/+/877108 | 16:58 |
opendevreview | Radosław Piliszek proposed openstack/project-config master: Allow nebulous-project-config-core to submit changes https://review.opendev.org/c/openstack/project-config/+/877108 | 16:59 |
yoctozepto | clarkb, fungi: all done ^ | 16:59 |
yoctozepto | even adapted the commit message | 17:00 |
yoctozepto | fingers crossed | 17:00 |
*** jpena is now known as jpena|off | 17:02 | |
corvus | yoctozepto: clarkb fungi i think we can remove these permissions. this is a one-time event | 17:04 |
yoctozepto | corvus: unless I manage to break the zuul config there and then come here bragging for help ;D | 17:05 |
corvus | currently both opendev and zuul tenants have only noop jobs configured for their config-projects, so it's exceedingly unlikely that further involvement from infra-root would needed | 17:05 |
fungi | well, it's a one-time event until someone accidentally merges breakage to the base job and then gerrit admins need to step in again | 17:05 |
corvus | yoctozepto: well, that's part of it... | 17:05 |
fungi | ahh, good point with noop | 17:05 |
clarkb | ya I explicitly noted that you can stick with noop and non voting in my +1 review | 17:05 |
corvus | you can't break the tenant if the config project is gated with noop. but you can break it if you have submit perms | 17:06 |
clarkb | basically the change does what fungi suggested earlier, but I wanted more feedback and gave alternatives | 17:06 |
yoctozepto | what if I break the pipelines? | 17:06 |
yoctozepto | :D | 17:06 |
yoctozepto | I mean | 17:06 |
yoctozepto | I can only really ever break pipelines there | 17:06 |
yoctozepto | as it will like 99,9% stay "tested" with noop | 17:06 |
corvus | it would be exceedingly hard to break the pipelines if the config-project is gated | 17:07 |
corvus | it is easy to break them if it is not | 17:07 |
fungi | note i wasn't necessarily suggesting this, but asking what others thought about the tradeoffs | 17:07 |
clarkb | I think pipelines are unlikely to change often and considering that other projcet configs haven't resorted to this I think I'm coming around to corvus' reasoning and we can try it with noop for now | 17:07 |
clarkb | fungi: ack | 17:07 |
corvus | okay sorry i saw a flurry of changes and messages and am not 100% sure what the current status is | 17:07 |
yoctozepto | ok, so someone just merge this for me and I abandon the extra perms | 17:07 |
yoctozepto | I don't mind either way for now :D | 17:07 |
corvus | so we're at "consider additng perms" not "we just added perms" that's cool, then i'm jumping into the conversation about evaluating what to do :) | 17:08 |
yoctozepto | :D | 17:08 |
clarkb | corvus: yup the change has not merged yet. Just at the point where a change that does that has been proposed and is in early review | 17:08 |
corvus | yoctozepto: anyway, not trying to make things hard, and if it becomes a problem, i'm not opposed to more perms in principle. i think that not having perms is sufficient and the most safe, and should not actually block you | 17:08 |
yoctozepto | I agree, I don't like exceptions in my stuff either | 17:09 |
clarkb | the main trick would be avoiding gate jobs that vote or if they vote always return success | 17:09 |
yoctozepto | if I never break the change and gate pipelines, then I should be fine, right? | 17:09 |
fungi | i'm also not against helping bypass zuul to merge things on rare occasions where there are no alternative solutions, mainly just want to avoid it becoming a frequent activity | 17:09 |
yoctozepto | as in | 17:09 |
yoctozepto | if I misconfigure some other pipeline | 17:09 |
corvus | yeah, and we've never seen fit to add anything other than noop to the opendev or zuul tenants, so i would expect the same for nebulous too | 17:09 |
yoctozepto | ++ | 17:09 |
corvus | yoctozepto: yes -- and if you follow the pattern in the opendev or zuul tenants, hopefully only gate would matter. | 17:10 |
clarkb | eg no clean check requirement | 17:10 |
yoctozepto | ah, right | 17:10 |
yoctozepto | that's true | 17:10 |
yoctozepto | so I can even break the check, sweet | 17:11 |
yoctozepto | :D | 17:11 |
yoctozepto | let there be havoc | 17:11 |
corvus | yoctozepto: consider carefully which tenants to base your pipelines on. opendev and zuul tenants do not have a clean check requirement, which means only gate is needed to work (and merging changes can be much faster); openstack has a clean check requirement (because people don't always follow best practices) | 17:11 |
yoctozepto | corvus: yeah, I went for the quicker way for now and will see how our partners behave | 17:12 |
yoctozepto | in the worst case, it will be the openstack way | 17:12 |
yoctozepto | which is not bad | 17:12 |
yoctozepto | just slower for some stuff | 17:12 |
corvus | yoctozepto: i think it's great to start with no clean check and add only if needed | 17:13 |
clarkb | ++ | 17:13 |
yoctozepto | happy to hear that I am making the blessed choices :-) | 17:13 |
yoctozepto | soo | 17:14 |
yoctozepto | I think we have a consensus | 17:14 |
yoctozepto | let me abandon the extra perms | 17:14 |
yoctozepto | and some of you merge me that nebulous/project-config change | 17:14 |
yoctozepto | https://review.opendev.org/c/nebulous/project-config/+/877107 | 17:15 |
corvus | infra-root: i can do the force-merge -- i think that's probably a non-controversial action that i could do immediately? | 17:19 |
fungi | corvus: thanks! i have no objection | 17:20 |
clarkb | corvus: yup as long as the pipeline config doesn't look broken I guess. But if it is functional enough to land a followup that isn't a big deal | 17:20 |
corvus | #status log force-merged https://review.opendev.org/877107 to bootstrap nebulous tenant | 17:22 |
opendevstatus | corvus: finished logging | 17:22 |
corvus | yoctozepto: https://zuul.opendev.org/t/nebulous/status | 17:23 |
yoctozepto | thanks, corvus | 17:23 |
yoctozepto | it's a verbatim copy of opendev/project-config now | 17:24 |
yoctozepto | I think I made it explicit in the commit message | 17:24 |
opendevreview | John L. Villalovos proposed openstack/diskimage-builder master: chore: support building Fedora on arm64 AKA aarch64 https://review.opendev.org/c/openstack/diskimage-builder/+/877112 | 17:32 |
fungi | clarkb: what do you think about further increases to the launch timeout in rax-ord? it looks like whenever we have a spike in node requests, we end up with lots of launch timeouts there even with the timeout increased to 15 minutes, but the instances do seem to eventually boot after nodepool gives up waiting | 17:38 |
fungi | my concern is that the longer we make the timeout, the longer some jobs will spend waiting for node requests to be filled | 17:38 |
clarkb | fungi: I suspect that reducing the max-servers count may result in better throughput? | 17:39 |
fungi | if we had some way to limit the number of nodes booting in parallel that might help, since the cloud does appear to be capable of handling a large number of nodes once they've booted | 17:39 |
clarkb | that will reduce the size of potential rushes there and may keep us booting nodes in a reasonable time frame | 17:39 |
clarkb | fungi: ooh thats a good idea, but I'm not sure we have support for that yet? | 17:40 |
fungi | it's the region in rax where we have the largest quota, but we've already reduced max-servers there by half (so it's now ~2/3 of the other two rax regions) | 17:40 |
clarkb | one problem with increasing the timeout is that we retry 3 times too | 17:41 |
clarkb | so in the owrst case a job may wait 3 * timeout | 17:41 |
fungi | it did represent 40% of our theoretical rax capacity, now it's 25% | 17:41 |
clarkb | maybe what we should do is raise the timeout and not allow retries in that region | 17:41 |
clarkb | that should also help with the thundering herd problem since we won't retry so much | 17:42 |
clarkb | (maybe, if we are at capacity chances are another request will show up soon enough though) | 17:42 |
clarkb | I think that is what I would do. don't allow retries and increase timeout a bit | 17:42 |
fungi | we're able to control retries independently per provider? i'll give that a shot | 17:43 |
clarkb | I think we are | 17:43 |
fungi | launch-retries is per provider, yep | 17:44 |
fungi | we currently don't set it for any provider and take the default from nodepool | 17:44 |
fungi | https://zuul-ci.org/docs/nodepool/latest/openstack.html#attr-providers.[openstack].launch-retries | 17:45 |
fungi | slightly misleadingly named/documented, since it's the number of time to try, not the number of times to retry | 17:45 |
fungi | so we want it set to 1 for a single try (i.e. no retries) | 17:45 |
clarkb | fungi: and also maybe we want to increase the api rate limit | 17:48 |
clarkb | but thta impact is unlikely to matter much | 17:49 |
fungi | you mean the delay between api calls? already did by an order of magnitude in an earlier change but can do it some more | 17:49 |
clarkb | (it would slow down boots when thundering herd happens just not much comapred to the timeouts) | 17:49 |
clarkb | ya its still 100 requests a second | 17:49 |
clarkb | we may want 1 a second? I dunno | 17:49 |
fungi | worth a try i guess | 17:49 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Further increase rax-ord launch timeout no retries https://review.opendev.org/c/openstack/project-config/+/877113 | 17:53 |
fungi | if it ends up helping, maybe we can try turning the max-servers there back up some | 17:54 |
clarkb | corvus: ^ that may interest you from a general nodepool functionality perspective | 17:55 |
fungi | it's a fairly pathological case though, not sure how many knobs for dealing with a situation like that make sense | 17:59 |
corvus | oh oh | 18:00 |
clarkb | I think being able to control the parallelism of inflight node creations is worth considering though | 18:00 |
corvus | what about https://zuul-ci.org/docs/nodepool/latest/configuration.html#attr-providers.max-concurrency ? | 18:00 |
clarkb | oh do we have that TIL | 18:00 |
clarkb | ++ that is exactly what we need | 18:01 |
fungi | whoa. mind *blown* | 18:01 |
clarkb | I think we still drop the retries to avoid stalling builds out | 18:01 |
clarkb | but maybe we keep the old api rate and set parallelism to something that grafana graphs make look like it can handle | 18:01 |
fungi | i'll start it off at 10 | 18:02 |
corvus | (the docs need updating because it's not actually threads anymore, but the statemachine framework does honor that -- it still controls the concurrency for new requests) | 18:02 |
fungi | if this helps, we can probably back the launch-timeout down to something smaller as a safety, and then possibly re-add retries? | 18:03 |
clarkb | fungi: I suspect that we're more likely to hit timeouts than valid failures needing a retry? | 18:05 |
clarkb | basically in a cloud suffering these issues it is probably better in all cases to have alonger timeout and not keep trying if you fail | 18:05 |
fungi | yeah, maybe | 18:07 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Limit rax-ord launch concurrency and don't retry https://review.opendev.org/c/openstack/project-config/+/877113 | 18:09 |
fungi | corvus: clarkb: ^ | 18:09 |
corvus | ++ | 18:10 |
fungi | you can sort of see the misbehavior by comparing the three test node history graphs at https://grafana.opendev.org/d/a8667d6647/nodepool-rackspace?orgId=1&from=now-24h&to=now | 18:11 |
fungi | though the error node launch attempts graph is across all three providers, from the logs it seems to be mainly rax-ord | 18:12 |
fungi | and you can clearly see the variability for that region in the time to ready graph | 18:12 |
clarkb | itdoes seem to be able to handle ~30 booting nodes. I guess we monitor things and increase the concurrency if it holds up | 18:12 |
fungi | well, that's potentially misleading. it "handles" accepting that many boot requests in parallel, but definitely does look like things get a lot worse when we ask for all its capacity at once | 18:14 |
clarkb | right you can see where it requests far more than 30 and has a sad. But there are a couple instances where it requests 30ish and seems to do well with that | 18:15 |
clarkb | and an instance of about 38 where it falls over | 18:15 |
clarkb | I suspect the tipping point is around 30 for this reason | 18:16 |
fungi | potentially, but i also wouldn't rule out external factors since we're not the only user of that cloud | 18:16 |
fungi | does anybody know what the multiple stats for the providers are in the api server operations graphs? | 18:19 |
fungi | not all regions have the same number of them either | 18:19 |
fungi | for example in the post server graph there are 5 lines for dfw, 3 each for iad and ord | 18:20 |
fungi | i guess i could look at the yaml for that | 18:21 |
clarkb | if you hit the graph menu there is an inspect option | 18:21 |
clarkb | it looks like only one of them actually has data | 18:22 |
fungi | looks like the same thing i found in git, aliasByNode(stats.timers.nodepool.task.$region.compute.POST.servers.*.mean, 4) | 18:22 |
clarkb | somehow $region is ending up with multiple identical results? And i think that variable list can be retrieved by querying graphite? | 18:23 |
fungi | yeah, i'm fiddling with graphite | 18:23 |
corvus | one for each response code? | 18:24 |
clarkb | oh maybe | 18:25 |
fungi | that seems to be it, yep | 18:25 |
corvus | there are 5 for dfw and 3 for ord | 18:25 |
fungi | dfw has returned 500 and 503, in addition to the 202, 400 and 403 returned by the other regions | 18:25 |
fungi | i wonder if the intent was to aggregate those | 18:26 |
corvus | should probably either aggregrate or adjust the alias so it's rax-ord 200; either is possible with a different graphite function | 18:27 |
opendevreview | Merged openstack/project-config master: Limit rax-ord launch concurrency and don't retry https://review.opendev.org/c/openstack/project-config/+/877113 | 18:29 |
clarkb | alright I'm going to work on deleting gitea05-07 now | 18:30 |
clarkb | if they are anything like gitea08 I will need to delete their backing boot volume separately | 18:30 |
clarkb | #status log Deleted gitea05-07 and their associated boot volumes as they have been replaced with gitea10-12. | 18:41 |
opendevstatus | clarkb: finished logging | 18:41 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Remove gitea05-07 from DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/877117 | 18:43 |
clarkb | under 1k replication tasks for gitea13 and 14 now too | 18:43 |
clarkb | the cryptography min requirement discussion for openstack prompted me to look and cryptography is actually requiring a pretty reasonable rust compiler version | 19:10 |
clarkb | A lot of projects (firefox famously) require a prerelease compiler but cryprography wants 1.45.0 or newer and ubuntu has 1.65.0 (in universe though) | 19:10 |
fungi | yeah, however jammy ships python3-cryptography 3.4.8 because it was contemporary at the time, and would need to backport a rust-built version of the package with numerous new build-dependencies to update to a newer version | 19:25 |
fungi | compare https://packages.ubuntu.com/source/jammy/python-cryptography with https://packages.ubuntu.com/source/lunar/python-cryptography | 19:25 |
fungi | that's the sort of complex change a stable distro is generally going to avoid | 19:26 |
clarkb | for sure. My point is more I don't think the issue is rust specific so much as stable distros don't wholesale backport new releases typically | 19:27 |
clarkb | and then that is made worse by the old software they are having been in part replaced with something completely different so any updates may require a lot of effort | 19:29 |
fungi | but also it may not just be rust itself in this case, if cryptography needs newer versions of other toolchain components (cargo, setuptools-rust extension, various rust libs) | 19:30 |
clarkb | right, I just think its distracting to bring up rust in this case. Ubuntu basically never takes a 2 year newer version of a libarary and replaces their stable version with it regardless of the compiler | 19:31 |
fungi | looks like the python3-cryptography in lunar has 9 different rust-based libs it also depends on | 19:31 |
clarkb | this is more about "stable distro has a stable library version we need to continue to do our best to support that". Its no different than not requiring a newer libvirt | 19:32 |
fungi | sure, the discussion has fixated on rust because it's what coreycb brought up as the challenge for that particular package, but i tried to point out earlier that it's really just one example | 19:32 |
fungi | the rax-ord graph looks considerably better | 20:49 |
fungi | though it'll be hard to know for sure until monday | 20:50 |
fungi | possibly just wishful thinking on my part | 20:51 |
clarkb | one replication event is still processing. Once that is done I'll trigger global replication for the replication config reload and then we should be good to land the change to add gitea13 and 14 to haproxy | 21:08 |
clarkb | full replication started | 21:23 |
clarkb | fungi: the boot times don't seem to have dropped significantly though | 21:26 |
clarkb | we may still need to bump the timeout up as a result | 21:27 |
clarkb | I've approved https://review.opendev.org/c/opendev/system-config/+/877047 as replication is complete | 21:47 |
clarkb | fungi: if you are still around https://review.opendev.org/c/opendev/zone-opendev.org/+/877117 should be an eas one | 21:47 |
clarkb | (cleans up gitea05-07 dns records) | 21:47 |
clarkb | thank you! | 21:53 |
fungi | np, just cooking dinner and reviewing changes | 21:56 |
opendevreview | Merged opendev/zone-opendev.org master: Remove gitea05-07 from DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/877117 | 21:58 |
opendevreview | Merged opendev/system-config master: Add gitea13 and gitea14 to the haproxy load balancer https://review.opendev.org/c/opendev/system-config/+/877047 | 23:00 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!