Wednesday, 2025-07-09

corvus#status log restarted zuul-launchers with latest fixes00:22
opendevstatuscorvus: finished logging00:23
corvusanyone remember what the rackspace classic network caps are?  cause looking at the cacti graphs, i'm suspecting we're hitting a 100Mbit outbound cap on the launchers.  that's on a 2GB ram node.00:29
corvusthat's nbd for node operations, but it'll slow down image uploads00:29
corvusthe previous version of zl01, which was an 8GB vm, was able to push at least 200Mbps, possibly more (hard to tell in our degraded cacti due to rollup artifacts)00:31
corvuswe're generating more images now than normal, so it's possible this may be reduced once things settle down, but it looks like we're spending around 20 hours each of the last 2 days uploading images.  so getting pretty close to the point where this could be a problem; maybe past that point now with the new images.00:33
corvusso we might need to add another launcher just to handle the image load, or upsize the vms to get more bandwidth.  which is a shame, because they're nowhere near using their allocated cpu or ram, which is miniscule.00:34
corvusi think if we decide to upsize, we should go back to 8GB; i think a 4GB node might still limit our bandwidth.00:45
corvusthat's the way i'm leaning.  i can start on that tomorrow if folks agree.00:49
Clark[m]corvus I think flavor show shows some network cap info00:51
Clark[m]It may be ratios between the flavors and not raw bit rate and ya bigger flavors get more bit rate so maybe that is a workaround00:52
corvusyeah, that matches the docs i found.  the number is 400 for 2GB, 800 for 4GB, and 1600 for 8GB.  based on those ratios, i guess we should be able to push 400Mbps with an 8GB node.01:22
opendevreviewClark Boylan proposed opendev/system-config master: Disable recursive git clones with ansible  https://review.opendev.org/c/opendev/system-config/+/95438201:25
opendevreviewClark Boylan proposed opendev/base-jobs master: Disable recursive git clones with ansible  https://review.opendev.org/c/opendev/base-jobs/+/95438301:29
noonedeadpunkso translations were proposed at least for us: https://review.opendev.org/c/openstack/openstack-ansible/+/95440007:07
mnasiadkanoonedeadpunk: for a lot of other projects as well07:12
noonedeadpunkfor a lot it failed :)07:13
noonedeadpunkhttps://zuul.opendev.org/t/openstack/builds?job_name=propose-translation-update&pipeline=periodic&skip=007:13
noonedeadpunkas yesterday some were passing as well, just because they didn't have updates for translations07:17
noonedeadpunklike this seems to use old whitelist_externals in tox: https://zuul.opendev.org/t/openstack/build/b9c83152e96f42a58a91af2afbdc00e2/log/job-output.txt07:18
noonedeadpunkwhich I think is broken with nodeset switch07:19
noonedeadpunkbut they seem to fail on unrelated things I'd say... One fails to build Pillow, Ironic has conflicting constraint to requirements.07:27
fricklernoonedeadpunk: I saw https://review.opendev.org/c/openstack/designate-dashboard/+/954385 which tries to delete all releasenote translations, which doesn't seem right to me, either07:45
frickleroh, although this is for stable branches, maybe the reno translations only apply on the master branch, so this change could be correct? some neutron-* changes look similar. might be good to confirm with I18N people though07:47
noonedeadpunkam I right that designate dashboard is not functional anyway?08:09
noonedeadpunkah, ok, sorry, mixed it with barrbican08:10
noonedeadpunkyeah, I agree that it would be good to confirm that08:11
noonedeadpunkas it's not super clear if we fixed or bricked things08:11
noonedeadpunklike - there's a horizon proposal as well, which I've heard should not have happened08:11
noonedeadpunkbut also only for stable branches08:12
stephenfinclarkb: fungi: I think you reached the same conclusion, but yes, deprecation and deprecation for removal are separate things and I'm mostly pursuing the former12:38
stephenfinThe only exceptions are things like the `test` and `build_doc` commands, which are not core to building a package and rely on other packages12:39
stephenfinI've indicated this on a few of the reviews, but I'm open to adding verbiage to indicate this. I want to drive new users towards the upstream variants of options, but don't want to break existing workflows12:40
opendevreviewMerged opendev/base-jobs master: Disable recursive git clones with ansible  https://review.opendev.org/c/opendev/base-jobs/+/95438312:51
ykarelfungi, clarkb still seeing that multinode issue, mixing cloud providers12:56
ykarelhttps://zuul.openstack.org/build/83acbc4e29ca4117bb2d7825d4504d3012:56
ykarelhttps://zuul.openstack.org/build/91af210331494b35bcf21d9b7cbdb0d512:56
ykarelfrom today ^12:56
fungiykarel: thanks! looks like those are both periodic jobs, i wonder if the burst when those all get enqueued at once is overwhelming some providers and causing repeated boot failures/timeouts that eventually force it over to trying get the node from another provider rather than failing the job outright13:00
ykarelyes seen in periodic ones13:06
opendevreviewMerged opendev/system-config master: Disable recursive git clones with ansible  https://review.opendev.org/c/opendev/system-config/+/95438213:21
fricklercorvus: ^^ so the question would be whether we still consider that an issue in zuul-launcher or whether we need to somehow this in devstack, either by a pre-check that rejects such a nodeset or by changing the handling of internal vs. public addresses 13:29
frickler*somehow handle this13:30
fungione of the suggestions was some means to express at the job level when nodes should be allocated from a single provider or otherwise immediately failed when that's not possible, rather than trying to proceed with mixed providers13:32
corvusyeah, i just want to make sure that zuul-launcher is operating as intended before we start talking about changing it.  let me investigate that before we decide how to proceed.13:34
corvusokay, after looking at the first instance, it exceeded its three retries and did failover to another provider for one of the nodes.  so that part seems to be working better.  but what's interesting is why it failed 3 times14:41
corvus1) launch timeout14:42
corvus2) nova scheduling failure14:42
corvus3) quota exceeded (cores)14:42
fungiit's going for a full bingo card14:42
fungiwas that in rackspace classic?14:43
corvusthe third one should not have resulted in a permanent failure (it should not count against the 3 retries).  i think ovh issues a slightly different error message from when we wrote the code originally (we have to string match).  so i wrote   https://review.opendev.org/c/zuul/zuul/+/954505 to update it14:43
corvusno, ovh14:43
fungioh, even more surprising14:43
corvusfailure #1, the timeout, is also a new thing for us, since we just (re-)implemented launch timeouts yesterday.  we would not have seen those in yesterday's rush of periodic jobs.14:44
corvusso i wounder if we should increase the launch-timeout on ovh?14:44
fungior maybe not surprising, from what i understand our tenant in ovh has a dedicated host aggregate with separate overcommit tuning, mainly to avoid impacting other customers as a noisy neighbor14:45
corvusi'm not sure why we would get a nova error instead of a quota error for #2...14:45
corvusthis is the error it got: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 81889faa-c359-4f23-8b17-77d91877406f.14:46
corvusshould zuul treat that as a temporary failure?  or should we still consider that a real error?  is there something else we should do to avoid it?14:47
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Increase launch timeout for ovh  https://review.opendev.org/c/opendev/zuul-providers/+/95450814:48
clarkbcorvus: I think that is nova indicating there was no hypervisor to assign the node. I hesitate to treat that as a permanent error becuse in the past we've seen errors like that require human intervention due to stale/broken placement data14:54
corvusthat sounds like we should treat it as a non-temporary error.  basically, if we mark the node as "tempfailed" then we will not count the failure against the retry counter; if we mark it as "failed" we will.  so errors that result in "failed" will try 3 times before falling back on another provider.  "temfailed" could loop forever (literally) in the provider, so we only want to do that for cases where we actually expect the provider to recover on14:57
corvusits own (eg, quota)14:57
clarkbya quota makes sense for doing that14:58
corvusso what do you think?  leave "Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance" as a "permanent" error?  or should we treat it as quota error?15:04
clarkbI almost wonder if we need a third state which is atemporary temporary failure15:05
clarkbbasically if it persists past some limit then we treat it as permanent15:05
fricklersadly nova doesn't bubble up more details about why scheduling failed, so keeping it permanent is this safer option IMO. how about increasing the number of retries?15:06
corvusyeah, we can increase retries.  but this also didn't happen as often.  most of the errors were regular quota errors.  so if we fix those (in progress) we might be able to leave the current value15:09
johnsomnoonedeadpunk Designate dashboard should be fully functional. Are you seeing otherwise?15:18
noonedeadpunkjohnsom: nah, as I said, I mixed up with barbican15:24
johnsomack15:25
*** dhill is now known as Guest2162715:33
clarkbstephenfin: I think there is a small bug in https://review.opendev.org/c/openstack/pbr/+/954514/ but otherwise the stack up to about https://review.opendev.org/c/openstack/pbr/+/954047/ looks mergeable to me16:04
clarkbstephenfin: fungi ^ maybe we focus on getting that first half in to avoid too much rebase churn when fixing small bugs like the one in 954514? Then tackle the second half of the stack after?16:05
clarkbmaybe that doesn't help much I dunno16:05
fungimakes sense, sure16:05
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for ensure-devstack  https://review.opendev.org/c/zuul/zuul-jobs/+/95452816:38
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for ensure-dstat-graph  https://review.opendev.org/c/zuul/zuul-jobs/+/95452916:40
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for ensure-python  https://review.opendev.org/c/zuul/zuul-jobs/+/95453016:40
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for ensure-skopeo  https://review.opendev.org/c/zuul/zuul-jobs/+/95453116:41
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for upload-npm  https://review.opendev.org/c/zuul/zuul-jobs/+/95453216:42
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for upload-git-mirror test  https://review.opendev.org/c/zuul/zuul-jobs/+/95453316:43
corvushttps://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-24h&to=now&timezone=utc16:50
corvustoday was a little anomalous -- a big bunch of node requests at 15:0016:51
corvusbut i think the launchers did what they were supposed to do16:51
corvusinfra-root: i do think we should recreate the launchers as 8gb nodes in order to get the extra bandwidth.  i didn't see any feedback on that overnight... so i'm... going to take that as consensus? :)16:52
corvuslast chance to object or suggest an alternative16:53
fungii concur16:54
fungisad but necessary, i guess16:54
fungiheading out on some quick errands, should only be a few minutes16:55
clarkbcorvus: ya bigger nodes seems reasonable to me16:59
opendevreviewJames E. Blair proposed opendev/zone-opendev.org master: Replace zl01  https://review.opendev.org/c/opendev/zone-opendev.org/+/95453617:20
opendevreviewJames E. Blair proposed opendev/zone-opendev.org master: Replace zl02  https://review.opendev.org/c/opendev/zone-opendev.org/+/95453717:20
opendevreviewJames E. Blair proposed opendev/system-config master: Replace zl01  https://review.opendev.org/c/opendev/system-config/+/95453817:20
opendevreviewJames E. Blair proposed opendev/system-config master: Replace zl02  https://review.opendev.org/c/opendev/system-config/+/95453917:20
corvusinfra-root: if you want to +2 those, i'll +w and replace one at a time.17:21
clarkbdone17:28
opendevreviewMerged opendev/zone-opendev.org master: Replace zl01  https://review.opendev.org/c/opendev/zone-opendev.org/+/95453617:40
corvuslooking at the status page, i'm seeing some node_failures for arm nodes -- it looks like it's due to launch-timeouts.17:41
corvusdo we need a larger value for osuosl as well?17:41
ianychoi<frickler> "noonedeadpunk: I saw https://..." <- I have just replied to the patch - releasenotes translation is only managed on master branch, so deleting on stable branches is right. Thank you for helping I18n team17:43
clarkbthe current values appear to have been ported over from nodepool? Thinking out loud here I wonder if the more regularl image builds are impacting nova's caching of images17:43
clarkbcorvus: a little while back we reduced the total number of image builds in nodepool to ease back on that and update less common images less often. Thats my best guess as to what is happening. Increasingthe timeouts would help if that is the problem17:44
corvusoh... it looks like we're counting the "paused" time in the launch timeout... which is probably fine for other providers, but not osuosl...17:44
noonedeadpunkianychoi: aha, cool, thanks for having a look at that, as I was really unsure if I broke something with it or not17:44
corvusclarkb: let me take a look at tweaking this first17:44
clarkbcorvus: basically when you boot a node on a hypervisor that hasn't used that image yet it has to be copied from glance if not using ceph17:44
clarkbcorvus: ack17:44
noonedeadpunkthen this project is kinda terribly broken overall I believe: https://review.opendev.org/c/openstack/training-guides/+/95443217:44
noonedeadpunkbut it's not I18n concern...17:45
corvusyeah, that may be happening too, but for the example i'm looking at in the logs, i think it's the launch-timeout issue17:45
ianychoinoonedeadpunk: Thank you too for helping Ubuntu node upgrade!!17:45
noonedeadpunkianychoi: ah, no worries, I was just pinged that expected translations didn't land to had to check anyway17:45
clarkbthe weather is cooperating today and I don't have a morning full of meetings. I'm going to pop out for a bike ride while I can.17:48
clarkbback in a bit17:48
fungiyou definitely should17:54
corvusokay, this should address the node_errors for arm64 nodes: https://review.opendev.org/95454017:54
corvusi have stopped zl01 in preparation for its replacement; zl02 is handling all the node requests now18:01
opendevreviewMerged opendev/system-config master: Replace zl01  https://review.opendev.org/c/opendev/system-config/+/95453818:17
opendevreviewMerged zuul/zuul-jobs master: Disable recursive git clone for ensure-devstack  https://review.opendev.org/c/zuul/zuul-jobs/+/95452818:20
opendevreviewMerged zuul/zuul-jobs master: Disable recursive git clone for ensure-dstat-graph  https://review.opendev.org/c/zuul/zuul-jobs/+/95452918:20
opendevreviewMerged zuul/zuul-jobs master: Disable recursive git clone for ensure-python  https://review.opendev.org/c/zuul/zuul-jobs/+/95453018:20
opendevreviewMerged zuul/zuul-jobs master: Disable recursive git clone for ensure-skopeo  https://review.opendev.org/c/zuul/zuul-jobs/+/95453118:20
opendevreviewMerged zuul/zuul-jobs master: Disable recursive git clone for upload-npm  https://review.opendev.org/c/zuul/zuul-jobs/+/95453218:20
opendevreviewMerged zuul/zuul-jobs master: Disable recursive git clone for upload-git-mirror test  https://review.opendev.org/c/zuul/zuul-jobs/+/95453318:20
corvusi am in favor of increasing the system-config mutex :)18:57
corvus#status log replaced zl01 with an 8GB server20:05
opendevstatuscorvus: finished logging20:06
corvusi've stopped zl02, so all requests are handled by the new zl01 now20:07
corvusclarkb: fungi would you mind taking a look at https://review.opendev.org/954246 -- that should make the zuul status graphs a bit easier to read20:10
opendevreviewMerged opendev/zone-opendev.org master: Replace zl02  https://review.opendev.org/c/opendev/zone-opendev.org/+/95453720:12
fungii see that infra-prod-service-zuul-preview failed in deploy for 954538, service-zuul-preview.yaml.log on bridge says it was a "HTTP status: 504 Gateway Time-out" when pulling the zuul-preview image with docker-compose20:12
fungiprobably not anything to worry about, and just wait for the next deploy20:12
corvusyeah, we're not changing anything about that, so it doesn't affect anything20:12
fungicorvus: one review comment on 95424620:14
corvusreplied20:16
fungithanks! lgtm20:16
clarkbreviewing that now20:27
clarkbcorvus: is there a semaphore count increase change yet? I'm happy to do that and can write the chnage if not20:27
corvusnope, was just sharing thoughts to gauge opinions :)20:31
clarkbcool I work on that after I review that other change20:32
corvusunlike other some other clouds, when a server in rax-flex is in error state, the 'fault' does not show up in the result we get back from listing /servers/detail.  but it does show up with "openstack server show".  is that a change in the openstack api?  is there something we can add to get it returned in the list?20:34
clarkbcorvus: why drop the deleted nodes graphing?20:35
clarkbcorvus: chances are that is a change to nova since rax-flex is a relatively new version aiui20:35
corvusthere is no "deleting" node state20:35
clarkbcorvus: aha20:35
clarkbmelwitt: ^ do you know the answer to corvus' nova server api question there?20:35
opendevreviewMerged zuul/zuul-jobs master: Remove no_log for image upload tasks  https://review.opendev.org/c/zuul/zuul-jobs/+/95398320:36
corvusi don't see anything in https://docs.openstack.org/api-ref/compute/ that would explain the behavior; in fact, it just says that "fault" is "Only displayed when the server status is ERROR or DELETED and a fault occurred."20:38
opendevreviewClark Boylan proposed opendev/system-config master: Increase infra-prod-playbook-limit semaphore to 6  https://review.opendev.org/c/opendev/system-config/+/95454520:39
corvusthis is a great time to investigate this, because every node we're building in flex is getting: {'code': 500, 'created': '2025-07-09T20:15:49Z', 'message': 'No valid host was found. There are not enough hosts available.'}20:39
corvus(but zuul can't see that, it just sees "ERROR")20:40
fungii've sent the service-announce post about removing cla enforcement from remaining projects in our gerrit20:41
clarkbfungi: I've recieved the email20:42
fungithe zl02 replacement inventory change failed in the gate20:42
clarkbprobably on the same image fetch 504 problem. I'm guessing quay is having a sad20:43
fungiyeah, i haven't dug in20:44
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/954545 is an easy review if you think that is safe enough20:46
corvushttps://zuul.opendev.org/t/openstack/build/d4994768717840c29cf6fc29a74e9074/log/job-output.txt#2570020:46
corvusit failed something something dstat deb on noble?20:46
clarkboh ya that is a known issue20:47
corvusoh?20:47
clarkbdstat is deprecated so some other tool picked it up (something copilot) and provides a shim for it. THe packaging for that new tool has been flaky at times20:47
corvushow is that non-deterministic?20:48
fungiclarkb: approved. now going to try to tackle mowing the lawn for a bit before the rains return20:48
corvusi guess the db installs a python post-install script, and it's not-deterministic python?20:48
clarkbcorvus: I think its a race amongst the related copilottools starting up20:49
clarkbcorvus: they have dependencies and don't properly express that in their units20:49
clarkbthough the line you linked to seems to be something different20:50
corvusis anything using dstat-logger?20:51
clarkbcorvus: I don't think anything uses it from a production standpoint. Its there for info gathering in CI20:51
clarkboh wait20:52
clarkbcorvus: the line you linked is for focal I think20:52
clarkbhttps://zuul.opendev.org/t/openstack/build/d4994768717840c29cf6fc29a74e9074/log/job-output.txt#25640 is the actual error20:52
clarkbcorvus: and I think rc -13 is what happens when ansible attempts to use the controlperist connection as ssh is shutting down that connection20:53
clarkbwe increased the controlpersist timeout to make this happen less often but some tasks can still trip over it if they run long enough20:53
clarkbin this case a recheck is maybe sufficient. If we see rc -13 occuring more often again then we might tweak the controlpersist timeout again20:53
corvusor maybe set up a keepalive20:54
corvuszuul does -o ControlPersist=60s -o ServerAliveInterval=6020:54
corvusi have re-enqueued the zl02change20:55
clarkblet me see if we set serveraliveinterval20:55
opendevreviewMerged opendev/zuul-providers master: Increase launch timeout for ovh  https://review.opendev.org/c/opendev/zuul-providers/+/95450820:56
opendevreviewMerged openstack/project-config master: Update Zuul status node graphs  https://review.opendev.org/c/openstack/project-config/+/95424620:56
opendevreviewMerged opendev/system-config master: Increase infra-prod-playbook-limit semaphore to 6  https://review.opendev.org/c/opendev/system-config/+/95454520:57
clarkbcorvus: man 5 ssh_config implies that ControlPersist relies on client connections to determine when to shutdown the control master20:59
clarkbthat said I'm willing to add serveraliveinterval to see if it helps. Maybe the internal ping pong for the serveraliveinterval counts as a client connection20:59
melwittclarkb: I thought it should be returned when the GET /servers/detail API is called https://docs.openstack.org/api-ref/compute/#list-servers-detailed I will look into if there is some reason it would not21:00
clarkbmelwitt: thank you21:00
opendevreviewClark Boylan proposed opendev/system-config master: Update Ansible config to set ssh ServerAliveInterval  https://review.opendev.org/c/opendev/system-config/+/95454721:02
clarkbmelwitt: as mentioned we've got a cloud where we can pretty consistently reproduce from the client side and dan_with may be willing to help us debug further on the server side if there are things you think we should check21:04
corvusclarkb: yes, i think that should help.  can probably ramp the timeout down too, but it's not important, and not worth doing now.21:05
corvusclarkb: do you mind giving https://review.opendev.org/951018 a re-review?  it needed a rebase21:05
clarkbon it. Shoudl I recheck it as well?21:06
clarkbI went ahead and recheked it21:07
clarkblooks like the only reason my vote was removed was the new depends on which has merged so basically a noop now21:07
melwittclarkb: so far I'm not seeing anything in https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/views/servers.py#L504 that would prevent it from showing the fault. do you know what is being used to query the API? I'm guessing not OSC if you are wanting more attributes from the detail list21:08
clarkbcorvus: probably knows how that is being reuqested21:10
melwittI have a devstack handy so actually I can try this myself to see 21:10
clarkbmelwitt: I think its doing a direct request with python requests21:11
clarkblet me link the code21:11
corvushttps://paste.opendev.org/show/bOK5kTOJ66AJfTR3F1tQ/21:11
clarkbhttps://opendev.org/zuul/zuul/src/commit/9554f9593df757fcd3e5fbb8e6cc86ea55fd9c51/zuul/driver/openstack/openstackendpoint.py#L764-L80421:12
corvusthere's the test script i'm using (for simplicity)21:12
melwittthx21:12
opendevreviewMerged opendev/system-config master: Replace zl02  https://review.opendev.org/c/opendev/system-config/+/95453921:12
clarkbcorvus: ^ we should see how much quicker that one deploys compared to zl0121:13
corvusgood point21:13
melwittI'm getting the same behavior ... no fault field in the list. I will need to debug what is going on here. as far as I know it's not intentional but I could be wrong. the only hint I find that it might be intentional is this sentence in the microversion history https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#id77 "API returns a details parameter for each failed event with a fault message, similar to the server21:17
melwitt fault.message parameter in GET /servers/{server_id} for a server with status ERROR" that is, it specifically mentions /servers/{server_id} only with regard to the fault field21:17
melwittgmaan might know actually, in the meantime21:18
corvushopefully that's just a harmless omission (ie, the fault field shows up in servers/{server_id} as well as servers/detail, but they omitted the last part because it doesn't change the meaning)21:20
corvusmelwitt: thanks for looking into it21:21
clarkblooks like the letsencrypt job failed on the zl02 deployment21:33
clarkbrc -13 on gitea13 checking the acme.sh script21:34
clarkbnothing else has merged since. I'll just reenqueue that change to deploy21:34
clarkbok reenqueued21:36
corvusthx21:36
melwittthe more I looked at the code the more it looked like a bug and sure enough https://bugs.launchpad.net/nova/+bug/185632922:08
clarkbhrm does that imply it is a cells configuration thing then? So maybe less a ne wopenstack problem (that bug is fairly old) and more how the cells are layed out?22:10
melwittI think it's just a bug in the nova code, unfortunately22:12
corvusthe bug mentions a change which claims to fix it: https://review.opendev.org/c/openstack/nova/+/69917622:15
clarkbgitea13 is under high load and that is why the manage project sjob is taking a while I think22:16
clarkbload seems to be slowly falling no wso maybe manage-projects will complete successfully without intervention?22:20
corvusThe Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly.22:20
melwittcorvus: I see. off the top of my head I'm not sure if that's the best way to fix it but will review22:21
melwittthe instance_list itself has code in it to fan out across cells so it feels like there's something wrong at a lower layer but I have to look into it more22:23
clarkbcorvus: if the manage projects run times out or fails then the hourly runs should run zuul without depending it on the manage projects job22:26
corvusoh good22:26
clarkbfor that reason I don't think we should bother reenquing again or trying to manually run the playbook. A bit lazy but I think it will get things done22:26
corvusyay it did not fail22:36
clarkbwoot22:36
corvus#status log replaced zl02 with an 8gb server22:45
opendevstatuscorvus: finished logging22:46
corvusi've restarted zl01 on the newest zuul commit as well; transferred the volumes to the new servers, and deleted the old servers22:46
clarkbdid zl02 deploy the latet code due to timing?22:47
corvusoutbound traffic on zl01 peaked around 800Mbps22:47
corvusyes it did22:47
corvusat some point we may want to look at what's capping the inbound traffic at 100Mbps, but that's not as important right now.22:48
clarkbI wonder if that is limited on the source side22:49
corvusyeah i wonder -- but we have 5 threads downloading in parallel, so if it's per-connection it's 25Mbps each; it might be more sophisticated traffic shaping based on endpoints and flows.22:53
corvusclarkb: fungi do you mind taking a look at https://review.opendev.org/931824 and https://review.opendev.org/953820 ?  that addresses something we saw last week, and i'd like to start exercising that fix22:57
clarkbcorvus: the checksum that is being validated is the one between intermediate upload storage and the launcher right? not launcher and glance?23:07
corvusright, that's the checksum that we generate locally when we upload to intermediate storage, and we store that checksum in zuul's db23:09
corvuslocally == on the image build worker node23:09
clarkbthanks, I think I foudn a bug in that one. Post ed as comments23:10
corvusclarkb: thanks fixed23:22
corvuswe're going to burn a lot of cycles trying to spin up failed servers in flex... do we want to drop the limits, or leave it in place and see how the system performs under very adverse conditions?23:24
corvus(i think it will be okay)23:24
corvus(just maybe a little slower than it would be if we turned flex off)23:25
corvushttps://grafana.opendev.org/d/0172e0bb72/zuul-launcher3a-rackspace-flex?orgId=1&from=now-24h&to=now&timezone=utc&var-region=$__all23:26
corvusyou can see the problem's already been happening for most of the day23:26
clarkbcorvus: maybe dan_with with an opinion?23:26
clarkbjames denton isn't in here23:26
gmaanmelwitt: clarkb corvus fault field should be present in both GET server and list server details. only thing is it is included in response if server is in fault status (error or deleted) otherwise nova does not add it23:48
clarkbgmaan: ya except there is a bug apparently23:48
gmaanso there are chances that some server response might not have it if they are not in error or deleted 23:48
clarkbso it isn't properly included even when in an error state23:49
melwittthanks gmaan. I later found there is already a bug open about it also https://bugs.launchpad.net/nova/+bug/1856329 I have pinged about it in #openstack-nova for awareness23:49
gmaanclarkb: in both API response you mean right? or there is diff between show and list detai;23:49
corvusgmaan: thanks.  that's my understanding of the intended behavior.  we are seeing that fault is not present in /servers/detail on a very new cloud even on servers in ERROR (and the fault does appear with "server show")23:49
gmaanohk, got it. yeah just saw that23:49
clarkbcorvus: I've approved the checksum checking change despite fungi's comment since fungi +2'd it I figured its fine for now23:50
clarkbwe can adjust based on real world data if it is too slow as is23:50
corvusack, thx23:52

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!