corvus | #status log restarted zuul-launchers with latest fixes | 00:22 |
---|---|---|
opendevstatus | corvus: finished logging | 00:23 |
corvus | anyone remember what the rackspace classic network caps are? cause looking at the cacti graphs, i'm suspecting we're hitting a 100Mbit outbound cap on the launchers. that's on a 2GB ram node. | 00:29 |
corvus | that's nbd for node operations, but it'll slow down image uploads | 00:29 |
corvus | the previous version of zl01, which was an 8GB vm, was able to push at least 200Mbps, possibly more (hard to tell in our degraded cacti due to rollup artifacts) | 00:31 |
corvus | we're generating more images now than normal, so it's possible this may be reduced once things settle down, but it looks like we're spending around 20 hours each of the last 2 days uploading images. so getting pretty close to the point where this could be a problem; maybe past that point now with the new images. | 00:33 |
corvus | so we might need to add another launcher just to handle the image load, or upsize the vms to get more bandwidth. which is a shame, because they're nowhere near using their allocated cpu or ram, which is miniscule. | 00:34 |
corvus | i think if we decide to upsize, we should go back to 8GB; i think a 4GB node might still limit our bandwidth. | 00:45 |
corvus | that's the way i'm leaning. i can start on that tomorrow if folks agree. | 00:49 |
Clark[m] | corvus I think flavor show shows some network cap info | 00:51 |
Clark[m] | It may be ratios between the flavors and not raw bit rate and ya bigger flavors get more bit rate so maybe that is a workaround | 00:52 |
corvus | yeah, that matches the docs i found. the number is 400 for 2GB, 800 for 4GB, and 1600 for 8GB. based on those ratios, i guess we should be able to push 400Mbps with an 8GB node. | 01:22 |
opendevreview | Clark Boylan proposed opendev/system-config master: Disable recursive git clones with ansible https://review.opendev.org/c/opendev/system-config/+/954382 | 01:25 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Disable recursive git clones with ansible https://review.opendev.org/c/opendev/base-jobs/+/954383 | 01:29 |
noonedeadpunk | so translations were proposed at least for us: https://review.opendev.org/c/openstack/openstack-ansible/+/954400 | 07:07 |
mnasiadka | noonedeadpunk: for a lot of other projects as well | 07:12 |
noonedeadpunk | for a lot it failed :) | 07:13 |
noonedeadpunk | https://zuul.opendev.org/t/openstack/builds?job_name=propose-translation-update&pipeline=periodic&skip=0 | 07:13 |
noonedeadpunk | as yesterday some were passing as well, just because they didn't have updates for translations | 07:17 |
noonedeadpunk | like this seems to use old whitelist_externals in tox: https://zuul.opendev.org/t/openstack/build/b9c83152e96f42a58a91af2afbdc00e2/log/job-output.txt | 07:18 |
noonedeadpunk | which I think is broken with nodeset switch | 07:19 |
noonedeadpunk | but they seem to fail on unrelated things I'd say... One fails to build Pillow, Ironic has conflicting constraint to requirements. | 07:27 |
frickler | noonedeadpunk: I saw https://review.opendev.org/c/openstack/designate-dashboard/+/954385 which tries to delete all releasenote translations, which doesn't seem right to me, either | 07:45 |
frickler | oh, although this is for stable branches, maybe the reno translations only apply on the master branch, so this change could be correct? some neutron-* changes look similar. might be good to confirm with I18N people though | 07:47 |
noonedeadpunk | am I right that designate dashboard is not functional anyway? | 08:09 |
noonedeadpunk | ah, ok, sorry, mixed it with barrbican | 08:10 |
noonedeadpunk | yeah, I agree that it would be good to confirm that | 08:11 |
noonedeadpunk | as it's not super clear if we fixed or bricked things | 08:11 |
noonedeadpunk | like - there's a horizon proposal as well, which I've heard should not have happened | 08:11 |
noonedeadpunk | but also only for stable branches | 08:12 |
stephenfin | clarkb: fungi: I think you reached the same conclusion, but yes, deprecation and deprecation for removal are separate things and I'm mostly pursuing the former | 12:38 |
stephenfin | The only exceptions are things like the `test` and `build_doc` commands, which are not core to building a package and rely on other packages | 12:39 |
stephenfin | I've indicated this on a few of the reviews, but I'm open to adding verbiage to indicate this. I want to drive new users towards the upstream variants of options, but don't want to break existing workflows | 12:40 |
opendevreview | Merged opendev/base-jobs master: Disable recursive git clones with ansible https://review.opendev.org/c/opendev/base-jobs/+/954383 | 12:51 |
ykarel | fungi, clarkb still seeing that multinode issue, mixing cloud providers | 12:56 |
ykarel | https://zuul.openstack.org/build/83acbc4e29ca4117bb2d7825d4504d30 | 12:56 |
ykarel | https://zuul.openstack.org/build/91af210331494b35bcf21d9b7cbdb0d5 | 12:56 |
ykarel | from today ^ | 12:56 |
fungi | ykarel: thanks! looks like those are both periodic jobs, i wonder if the burst when those all get enqueued at once is overwhelming some providers and causing repeated boot failures/timeouts that eventually force it over to trying get the node from another provider rather than failing the job outright | 13:00 |
ykarel | yes seen in periodic ones | 13:06 |
opendevreview | Merged opendev/system-config master: Disable recursive git clones with ansible https://review.opendev.org/c/opendev/system-config/+/954382 | 13:21 |
frickler | corvus: ^^ so the question would be whether we still consider that an issue in zuul-launcher or whether we need to somehow this in devstack, either by a pre-check that rejects such a nodeset or by changing the handling of internal vs. public addresses | 13:29 |
frickler | *somehow handle this | 13:30 |
fungi | one of the suggestions was some means to express at the job level when nodes should be allocated from a single provider or otherwise immediately failed when that's not possible, rather than trying to proceed with mixed providers | 13:32 |
corvus | yeah, i just want to make sure that zuul-launcher is operating as intended before we start talking about changing it. let me investigate that before we decide how to proceed. | 13:34 |
corvus | okay, after looking at the first instance, it exceeded its three retries and did failover to another provider for one of the nodes. so that part seems to be working better. but what's interesting is why it failed 3 times | 14:41 |
corvus | 1) launch timeout | 14:42 |
corvus | 2) nova scheduling failure | 14:42 |
corvus | 3) quota exceeded (cores) | 14:42 |
fungi | it's going for a full bingo card | 14:42 |
fungi | was that in rackspace classic? | 14:43 |
corvus | the third one should not have resulted in a permanent failure (it should not count against the 3 retries). i think ovh issues a slightly different error message from when we wrote the code originally (we have to string match). so i wrote https://review.opendev.org/c/zuul/zuul/+/954505 to update it | 14:43 |
corvus | no, ovh | 14:43 |
fungi | oh, even more surprising | 14:43 |
corvus | failure #1, the timeout, is also a new thing for us, since we just (re-)implemented launch timeouts yesterday. we would not have seen those in yesterday's rush of periodic jobs. | 14:44 |
corvus | so i wounder if we should increase the launch-timeout on ovh? | 14:44 |
fungi | or maybe not surprising, from what i understand our tenant in ovh has a dedicated host aggregate with separate overcommit tuning, mainly to avoid impacting other customers as a noisy neighbor | 14:45 |
corvus | i'm not sure why we would get a nova error instead of a quota error for #2... | 14:45 |
corvus | this is the error it got: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 81889faa-c359-4f23-8b17-77d91877406f. | 14:46 |
corvus | should zuul treat that as a temporary failure? or should we still consider that a real error? is there something else we should do to avoid it? | 14:47 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Increase launch timeout for ovh https://review.opendev.org/c/opendev/zuul-providers/+/954508 | 14:48 |
clarkb | corvus: I think that is nova indicating there was no hypervisor to assign the node. I hesitate to treat that as a permanent error becuse in the past we've seen errors like that require human intervention due to stale/broken placement data | 14:54 |
corvus | that sounds like we should treat it as a non-temporary error. basically, if we mark the node as "tempfailed" then we will not count the failure against the retry counter; if we mark it as "failed" we will. so errors that result in "failed" will try 3 times before falling back on another provider. "temfailed" could loop forever (literally) in the provider, so we only want to do that for cases where we actually expect the provider to recover on | 14:57 |
corvus | its own (eg, quota) | 14:57 |
clarkb | ya quota makes sense for doing that | 14:58 |
corvus | so what do you think? leave "Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance" as a "permanent" error? or should we treat it as quota error? | 15:04 |
clarkb | I almost wonder if we need a third state which is atemporary temporary failure | 15:05 |
clarkb | basically if it persists past some limit then we treat it as permanent | 15:05 |
frickler | sadly nova doesn't bubble up more details about why scheduling failed, so keeping it permanent is this safer option IMO. how about increasing the number of retries? | 15:06 |
corvus | yeah, we can increase retries. but this also didn't happen as often. most of the errors were regular quota errors. so if we fix those (in progress) we might be able to leave the current value | 15:09 |
johnsom | noonedeadpunk Designate dashboard should be fully functional. Are you seeing otherwise? | 15:18 |
noonedeadpunk | johnsom: nah, as I said, I mixed up with barbican | 15:24 |
johnsom | ack | 15:25 |
*** dhill is now known as Guest21627 | 15:33 | |
clarkb | stephenfin: I think there is a small bug in https://review.opendev.org/c/openstack/pbr/+/954514/ but otherwise the stack up to about https://review.opendev.org/c/openstack/pbr/+/954047/ looks mergeable to me | 16:04 |
clarkb | stephenfin: fungi ^ maybe we focus on getting that first half in to avoid too much rebase churn when fixing small bugs like the one in 954514? Then tackle the second half of the stack after? | 16:05 |
clarkb | maybe that doesn't help much I dunno | 16:05 |
fungi | makes sense, sure | 16:05 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for ensure-devstack https://review.opendev.org/c/zuul/zuul-jobs/+/954528 | 16:38 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for ensure-dstat-graph https://review.opendev.org/c/zuul/zuul-jobs/+/954529 | 16:40 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for ensure-python https://review.opendev.org/c/zuul/zuul-jobs/+/954530 | 16:40 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for ensure-skopeo https://review.opendev.org/c/zuul/zuul-jobs/+/954531 | 16:41 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for upload-npm https://review.opendev.org/c/zuul/zuul-jobs/+/954532 | 16:42 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for upload-git-mirror test https://review.opendev.org/c/zuul/zuul-jobs/+/954533 | 16:43 |
corvus | https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-24h&to=now&timezone=utc | 16:50 |
corvus | today was a little anomalous -- a big bunch of node requests at 15:00 | 16:51 |
corvus | but i think the launchers did what they were supposed to do | 16:51 |
corvus | infra-root: i do think we should recreate the launchers as 8gb nodes in order to get the extra bandwidth. i didn't see any feedback on that overnight... so i'm... going to take that as consensus? :) | 16:52 |
corvus | last chance to object or suggest an alternative | 16:53 |
fungi | i concur | 16:54 |
fungi | sad but necessary, i guess | 16:54 |
fungi | heading out on some quick errands, should only be a few minutes | 16:55 |
clarkb | corvus: ya bigger nodes seems reasonable to me | 16:59 |
opendevreview | James E. Blair proposed opendev/zone-opendev.org master: Replace zl01 https://review.opendev.org/c/opendev/zone-opendev.org/+/954536 | 17:20 |
opendevreview | James E. Blair proposed opendev/zone-opendev.org master: Replace zl02 https://review.opendev.org/c/opendev/zone-opendev.org/+/954537 | 17:20 |
opendevreview | James E. Blair proposed opendev/system-config master: Replace zl01 https://review.opendev.org/c/opendev/system-config/+/954538 | 17:20 |
opendevreview | James E. Blair proposed opendev/system-config master: Replace zl02 https://review.opendev.org/c/opendev/system-config/+/954539 | 17:20 |
corvus | infra-root: if you want to +2 those, i'll +w and replace one at a time. | 17:21 |
clarkb | done | 17:28 |
opendevreview | Merged opendev/zone-opendev.org master: Replace zl01 https://review.opendev.org/c/opendev/zone-opendev.org/+/954536 | 17:40 |
corvus | looking at the status page, i'm seeing some node_failures for arm nodes -- it looks like it's due to launch-timeouts. | 17:41 |
corvus | do we need a larger value for osuosl as well? | 17:41 |
ianychoi | <frickler> "noonedeadpunk: I saw https://..." <- I have just replied to the patch - releasenotes translation is only managed on master branch, so deleting on stable branches is right. Thank you for helping I18n team | 17:43 |
clarkb | the current values appear to have been ported over from nodepool? Thinking out loud here I wonder if the more regularl image builds are impacting nova's caching of images | 17:43 |
clarkb | corvus: a little while back we reduced the total number of image builds in nodepool to ease back on that and update less common images less often. Thats my best guess as to what is happening. Increasingthe timeouts would help if that is the problem | 17:44 |
corvus | oh... it looks like we're counting the "paused" time in the launch timeout... which is probably fine for other providers, but not osuosl... | 17:44 |
noonedeadpunk | ianychoi: aha, cool, thanks for having a look at that, as I was really unsure if I broke something with it or not | 17:44 |
corvus | clarkb: let me take a look at tweaking this first | 17:44 |
clarkb | corvus: basically when you boot a node on a hypervisor that hasn't used that image yet it has to be copied from glance if not using ceph | 17:44 |
clarkb | corvus: ack | 17:44 |
noonedeadpunk | then this project is kinda terribly broken overall I believe: https://review.opendev.org/c/openstack/training-guides/+/954432 | 17:44 |
noonedeadpunk | but it's not I18n concern... | 17:45 |
corvus | yeah, that may be happening too, but for the example i'm looking at in the logs, i think it's the launch-timeout issue | 17:45 |
ianychoi | noonedeadpunk: Thank you too for helping Ubuntu node upgrade!! | 17:45 |
noonedeadpunk | ianychoi: ah, no worries, I was just pinged that expected translations didn't land to had to check anyway | 17:45 |
clarkb | the weather is cooperating today and I don't have a morning full of meetings. I'm going to pop out for a bike ride while I can. | 17:48 |
clarkb | back in a bit | 17:48 |
fungi | you definitely should | 17:54 |
corvus | okay, this should address the node_errors for arm64 nodes: https://review.opendev.org/954540 | 17:54 |
corvus | i have stopped zl01 in preparation for its replacement; zl02 is handling all the node requests now | 18:01 |
opendevreview | Merged opendev/system-config master: Replace zl01 https://review.opendev.org/c/opendev/system-config/+/954538 | 18:17 |
opendevreview | Merged zuul/zuul-jobs master: Disable recursive git clone for ensure-devstack https://review.opendev.org/c/zuul/zuul-jobs/+/954528 | 18:20 |
opendevreview | Merged zuul/zuul-jobs master: Disable recursive git clone for ensure-dstat-graph https://review.opendev.org/c/zuul/zuul-jobs/+/954529 | 18:20 |
opendevreview | Merged zuul/zuul-jobs master: Disable recursive git clone for ensure-python https://review.opendev.org/c/zuul/zuul-jobs/+/954530 | 18:20 |
opendevreview | Merged zuul/zuul-jobs master: Disable recursive git clone for ensure-skopeo https://review.opendev.org/c/zuul/zuul-jobs/+/954531 | 18:20 |
opendevreview | Merged zuul/zuul-jobs master: Disable recursive git clone for upload-npm https://review.opendev.org/c/zuul/zuul-jobs/+/954532 | 18:20 |
opendevreview | Merged zuul/zuul-jobs master: Disable recursive git clone for upload-git-mirror test https://review.opendev.org/c/zuul/zuul-jobs/+/954533 | 18:20 |
corvus | i am in favor of increasing the system-config mutex :) | 18:57 |
corvus | #status log replaced zl01 with an 8GB server | 20:05 |
opendevstatus | corvus: finished logging | 20:06 |
corvus | i've stopped zl02, so all requests are handled by the new zl01 now | 20:07 |
corvus | clarkb: fungi would you mind taking a look at https://review.opendev.org/954246 -- that should make the zuul status graphs a bit easier to read | 20:10 |
opendevreview | Merged opendev/zone-opendev.org master: Replace zl02 https://review.opendev.org/c/opendev/zone-opendev.org/+/954537 | 20:12 |
fungi | i see that infra-prod-service-zuul-preview failed in deploy for 954538, service-zuul-preview.yaml.log on bridge says it was a "HTTP status: 504 Gateway Time-out" when pulling the zuul-preview image with docker-compose | 20:12 |
fungi | probably not anything to worry about, and just wait for the next deploy | 20:12 |
corvus | yeah, we're not changing anything about that, so it doesn't affect anything | 20:12 |
fungi | corvus: one review comment on 954246 | 20:14 |
corvus | replied | 20:16 |
fungi | thanks! lgtm | 20:16 |
clarkb | reviewing that now | 20:27 |
clarkb | corvus: is there a semaphore count increase change yet? I'm happy to do that and can write the chnage if not | 20:27 |
corvus | nope, was just sharing thoughts to gauge opinions :) | 20:31 |
clarkb | cool I work on that after I review that other change | 20:32 |
corvus | unlike other some other clouds, when a server in rax-flex is in error state, the 'fault' does not show up in the result we get back from listing /servers/detail. but it does show up with "openstack server show". is that a change in the openstack api? is there something we can add to get it returned in the list? | 20:34 |
clarkb | corvus: why drop the deleted nodes graphing? | 20:35 |
clarkb | corvus: chances are that is a change to nova since rax-flex is a relatively new version aiui | 20:35 |
corvus | there is no "deleting" node state | 20:35 |
clarkb | corvus: aha | 20:35 |
clarkb | melwitt: ^ do you know the answer to corvus' nova server api question there? | 20:35 |
opendevreview | Merged zuul/zuul-jobs master: Remove no_log for image upload tasks https://review.opendev.org/c/zuul/zuul-jobs/+/953983 | 20:36 |
corvus | i don't see anything in https://docs.openstack.org/api-ref/compute/ that would explain the behavior; in fact, it just says that "fault" is "Only displayed when the server status is ERROR or DELETED and a fault occurred." | 20:38 |
opendevreview | Clark Boylan proposed opendev/system-config master: Increase infra-prod-playbook-limit semaphore to 6 https://review.opendev.org/c/opendev/system-config/+/954545 | 20:39 |
corvus | this is a great time to investigate this, because every node we're building in flex is getting: {'code': 500, 'created': '2025-07-09T20:15:49Z', 'message': 'No valid host was found. There are not enough hosts available.'} | 20:39 |
corvus | (but zuul can't see that, it just sees "ERROR") | 20:40 |
fungi | i've sent the service-announce post about removing cla enforcement from remaining projects in our gerrit | 20:41 |
clarkb | fungi: I've recieved the email | 20:42 |
fungi | the zl02 replacement inventory change failed in the gate | 20:42 |
clarkb | probably on the same image fetch 504 problem. I'm guessing quay is having a sad | 20:43 |
fungi | yeah, i haven't dug in | 20:44 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/954545 is an easy review if you think that is safe enough | 20:46 |
corvus | https://zuul.opendev.org/t/openstack/build/d4994768717840c29cf6fc29a74e9074/log/job-output.txt#25700 | 20:46 |
corvus | it failed something something dstat deb on noble? | 20:46 |
clarkb | oh ya that is a known issue | 20:47 |
corvus | oh? | 20:47 |
clarkb | dstat is deprecated so some other tool picked it up (something copilot) and provides a shim for it. THe packaging for that new tool has been flaky at times | 20:47 |
corvus | how is that non-deterministic? | 20:48 |
fungi | clarkb: approved. now going to try to tackle mowing the lawn for a bit before the rains return | 20:48 |
corvus | i guess the db installs a python post-install script, and it's not-deterministic python? | 20:48 |
clarkb | corvus: I think its a race amongst the related copilottools starting up | 20:49 |
clarkb | corvus: they have dependencies and don't properly express that in their units | 20:49 |
clarkb | though the line you linked to seems to be something different | 20:50 |
corvus | is anything using dstat-logger? | 20:51 |
clarkb | corvus: I don't think anything uses it from a production standpoint. Its there for info gathering in CI | 20:51 |
clarkb | oh wait | 20:52 |
clarkb | corvus: the line you linked is for focal I think | 20:52 |
clarkb | https://zuul.opendev.org/t/openstack/build/d4994768717840c29cf6fc29a74e9074/log/job-output.txt#25640 is the actual error | 20:52 |
clarkb | corvus: and I think rc -13 is what happens when ansible attempts to use the controlperist connection as ssh is shutting down that connection | 20:53 |
clarkb | we increased the controlpersist timeout to make this happen less often but some tasks can still trip over it if they run long enough | 20:53 |
clarkb | in this case a recheck is maybe sufficient. If we see rc -13 occuring more often again then we might tweak the controlpersist timeout again | 20:53 |
corvus | or maybe set up a keepalive | 20:54 |
corvus | zuul does -o ControlPersist=60s -o ServerAliveInterval=60 | 20:54 |
corvus | i have re-enqueued the zl02change | 20:55 |
clarkb | let me see if we set serveraliveinterval | 20:55 |
opendevreview | Merged opendev/zuul-providers master: Increase launch timeout for ovh https://review.opendev.org/c/opendev/zuul-providers/+/954508 | 20:56 |
opendevreview | Merged openstack/project-config master: Update Zuul status node graphs https://review.opendev.org/c/openstack/project-config/+/954246 | 20:56 |
opendevreview | Merged opendev/system-config master: Increase infra-prod-playbook-limit semaphore to 6 https://review.opendev.org/c/opendev/system-config/+/954545 | 20:57 |
clarkb | corvus: man 5 ssh_config implies that ControlPersist relies on client connections to determine when to shutdown the control master | 20:59 |
clarkb | that said I'm willing to add serveraliveinterval to see if it helps. Maybe the internal ping pong for the serveraliveinterval counts as a client connection | 20:59 |
melwitt | clarkb: I thought it should be returned when the GET /servers/detail API is called https://docs.openstack.org/api-ref/compute/#list-servers-detailed I will look into if there is some reason it would not | 21:00 |
clarkb | melwitt: thank you | 21:00 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update Ansible config to set ssh ServerAliveInterval https://review.opendev.org/c/opendev/system-config/+/954547 | 21:02 |
clarkb | melwitt: as mentioned we've got a cloud where we can pretty consistently reproduce from the client side and dan_with may be willing to help us debug further on the server side if there are things you think we should check | 21:04 |
corvus | clarkb: yes, i think that should help. can probably ramp the timeout down too, but it's not important, and not worth doing now. | 21:05 |
corvus | clarkb: do you mind giving https://review.opendev.org/951018 a re-review? it needed a rebase | 21:05 |
clarkb | on it. Shoudl I recheck it as well? | 21:06 |
clarkb | I went ahead and recheked it | 21:07 |
clarkb | looks like the only reason my vote was removed was the new depends on which has merged so basically a noop now | 21:07 |
melwitt | clarkb: so far I'm not seeing anything in https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/views/servers.py#L504 that would prevent it from showing the fault. do you know what is being used to query the API? I'm guessing not OSC if you are wanting more attributes from the detail list | 21:08 |
clarkb | corvus: probably knows how that is being reuqested | 21:10 |
melwitt | I have a devstack handy so actually I can try this myself to see | 21:10 |
clarkb | melwitt: I think its doing a direct request with python requests | 21:11 |
clarkb | let me link the code | 21:11 |
corvus | https://paste.opendev.org/show/bOK5kTOJ66AJfTR3F1tQ/ | 21:11 |
clarkb | https://opendev.org/zuul/zuul/src/commit/9554f9593df757fcd3e5fbb8e6cc86ea55fd9c51/zuul/driver/openstack/openstackendpoint.py#L764-L804 | 21:12 |
corvus | there's the test script i'm using (for simplicity) | 21:12 |
melwitt | thx | 21:12 |
opendevreview | Merged opendev/system-config master: Replace zl02 https://review.opendev.org/c/opendev/system-config/+/954539 | 21:12 |
clarkb | corvus: ^ we should see how much quicker that one deploys compared to zl01 | 21:13 |
corvus | good point | 21:13 |
melwitt | I'm getting the same behavior ... no fault field in the list. I will need to debug what is going on here. as far as I know it's not intentional but I could be wrong. the only hint I find that it might be intentional is this sentence in the microversion history https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#id77 "API returns a details parameter for each failed event with a fault message, similar to the server | 21:17 |
melwitt | fault.message parameter in GET /servers/{server_id} for a server with status ERROR" that is, it specifically mentions /servers/{server_id} only with regard to the fault field | 21:17 |
melwitt | gmaan might know actually, in the meantime | 21:18 |
corvus | hopefully that's just a harmless omission (ie, the fault field shows up in servers/{server_id} as well as servers/detail, but they omitted the last part because it doesn't change the meaning) | 21:20 |
corvus | melwitt: thanks for looking into it | 21:21 |
clarkb | looks like the letsencrypt job failed on the zl02 deployment | 21:33 |
clarkb | rc -13 on gitea13 checking the acme.sh script | 21:34 |
clarkb | nothing else has merged since. I'll just reenqueue that change to deploy | 21:34 |
clarkb | ok reenqueued | 21:36 |
corvus | thx | 21:36 |
melwitt | the more I looked at the code the more it looked like a bug and sure enough https://bugs.launchpad.net/nova/+bug/1856329 | 22:08 |
clarkb | hrm does that imply it is a cells configuration thing then? So maybe less a ne wopenstack problem (that bug is fairly old) and more how the cells are layed out? | 22:10 |
melwitt | I think it's just a bug in the nova code, unfortunately | 22:12 |
corvus | the bug mentions a change which claims to fix it: https://review.opendev.org/c/openstack/nova/+/699176 | 22:15 |
clarkb | gitea13 is under high load and that is why the manage project sjob is taking a while I think | 22:16 |
clarkb | load seems to be slowly falling no wso maybe manage-projects will complete successfully without intervention? | 22:20 |
corvus | The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly. | 22:20 |
melwitt | corvus: I see. off the top of my head I'm not sure if that's the best way to fix it but will review | 22:21 |
melwitt | the instance_list itself has code in it to fan out across cells so it feels like there's something wrong at a lower layer but I have to look into it more | 22:23 |
clarkb | corvus: if the manage projects run times out or fails then the hourly runs should run zuul without depending it on the manage projects job | 22:26 |
corvus | oh good | 22:26 |
clarkb | for that reason I don't think we should bother reenquing again or trying to manually run the playbook. A bit lazy but I think it will get things done | 22:26 |
corvus | yay it did not fail | 22:36 |
clarkb | woot | 22:36 |
corvus | #status log replaced zl02 with an 8gb server | 22:45 |
opendevstatus | corvus: finished logging | 22:46 |
corvus | i've restarted zl01 on the newest zuul commit as well; transferred the volumes to the new servers, and deleted the old servers | 22:46 |
clarkb | did zl02 deploy the latet code due to timing? | 22:47 |
corvus | outbound traffic on zl01 peaked around 800Mbps | 22:47 |
corvus | yes it did | 22:47 |
corvus | at some point we may want to look at what's capping the inbound traffic at 100Mbps, but that's not as important right now. | 22:48 |
clarkb | I wonder if that is limited on the source side | 22:49 |
corvus | yeah i wonder -- but we have 5 threads downloading in parallel, so if it's per-connection it's 25Mbps each; it might be more sophisticated traffic shaping based on endpoints and flows. | 22:53 |
corvus | clarkb: fungi do you mind taking a look at https://review.opendev.org/931824 and https://review.opendev.org/953820 ? that addresses something we saw last week, and i'd like to start exercising that fix | 22:57 |
clarkb | corvus: the checksum that is being validated is the one between intermediate upload storage and the launcher right? not launcher and glance? | 23:07 |
corvus | right, that's the checksum that we generate locally when we upload to intermediate storage, and we store that checksum in zuul's db | 23:09 |
corvus | locally == on the image build worker node | 23:09 |
clarkb | thanks, I think I foudn a bug in that one. Post ed as comments | 23:10 |
corvus | clarkb: thanks fixed | 23:22 |
corvus | we're going to burn a lot of cycles trying to spin up failed servers in flex... do we want to drop the limits, or leave it in place and see how the system performs under very adverse conditions? | 23:24 |
corvus | (i think it will be okay) | 23:24 |
corvus | (just maybe a little slower than it would be if we turned flex off) | 23:25 |
corvus | https://grafana.opendev.org/d/0172e0bb72/zuul-launcher3a-rackspace-flex?orgId=1&from=now-24h&to=now&timezone=utc&var-region=$__all | 23:26 |
corvus | you can see the problem's already been happening for most of the day | 23:26 |
clarkb | corvus: maybe dan_with with an opinion? | 23:26 |
clarkb | james denton isn't in here | 23:26 |
gmaan | melwitt: clarkb corvus fault field should be present in both GET server and list server details. only thing is it is included in response if server is in fault status (error or deleted) otherwise nova does not add it | 23:48 |
clarkb | gmaan: ya except there is a bug apparently | 23:48 |
gmaan | so there are chances that some server response might not have it if they are not in error or deleted | 23:48 |
clarkb | so it isn't properly included even when in an error state | 23:49 |
melwitt | thanks gmaan. I later found there is already a bug open about it also https://bugs.launchpad.net/nova/+bug/1856329 I have pinged about it in #openstack-nova for awareness | 23:49 |
gmaan | clarkb: in both API response you mean right? or there is diff between show and list detai; | 23:49 |
corvus | gmaan: thanks. that's my understanding of the intended behavior. we are seeing that fault is not present in /servers/detail on a very new cloud even on servers in ERROR (and the fault does appear with "server show") | 23:49 |
gmaan | ohk, got it. yeah just saw that | 23:49 |
clarkb | corvus: I've approved the checksum checking change despite fungi's comment since fungi +2'd it I figured its fine for now | 23:50 |
clarkb | we can adjust based on real world data if it is too slow as is | 23:50 |
corvus | ack, thx | 23:52 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!