*** jpena|off is now known as jpena | 08:06 | |
dpawlik | dansmith: after 4days of getting logs from TIMED_OUT, it looks like... | 10:25 |
---|---|---|
dpawlik | 1. rax-ORD - 21 times | 10:25 |
dpawlik | 2. rax-IAD - 13 times | 10:25 |
dpawlik | 3. rax-DFW - 12 times | 10:25 |
dpawlik | 4. ovh-GRA1 - 9 times | 10:25 |
dpawlik | 5. ovh-BHS1 - 6 times | 10:25 |
fungi | that looks like it could approximately correlate to the proportions of quota we have in each region, implying a relatively even distribution and not implicating any particular provider/region | 12:40 |
fungi | would need to look at the effective peak utilization on the graphs for each region to be absolutely certain, since the max-servers value set for them in nodepool can be a lie, but regardless those numbers all look to be within an order of magnitude of each other so we're probably nowhere near statistical significance with the number of samples there | 12:47 |
opendevreview | Elod Illes proposed openstack/openstack-zuul-jobs master: Add stable/2023.1 to periodic-stable templates https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/876568 | 12:53 |
fungi | just going by max-servers though: rax-ord=195, rax-iad=145, rax-dfw=140, ovh-gra1=79, ovh-bhs1=120 | 12:57 |
fungi | so if we ignore the lack of statistical significance, it seems like maybe ovh-bhs1 performs twice as well as all the others | 12:58 |
fungi | however, looking at the graphs, somethings wrong in rax-ord and we're not actually booting anywhere near our max-servers value there, seems most of it either sits unused or is taken up by nodes in a "building" state, suggesting maybe we have serious boot issues there. as a result, the used count is relatively far lower than for other regions, making its comparatively high timeout count | 13:00 |
fungi | a bit more significant of an indicator | 13:01 |
fungi | strangely, we have enough server quota there for 204 instances, enough ram quota for 200 instances, unlimited core quota... though totalInstancesUsed is reporting 10x more than what's currently on nodepool's graph leading me to wonder if there's a ghost fleet hiding there | 13:12 |
fungi | yeah, there are a ton of nodes there in an active state which nodepool's own count doesn't reflect | 13:15 |
fungi | moving to #opendev with further investigation of what's going on in that provider | 13:48 |
dansmith | dpawlik: ack, but also two of those days were mostly weekend | 14:42 |
dansmith | dpawlik: but yeah, that's the sort of data I want to be able to gather, so thanks very much for getting that fixed! | 14:42 |
fungi | anyway, there do seem to be enough examples outside rax-ord to suggest that the underlying problems aren't exclusive to a specific provider nor all a result of whatever is going on in rax-ord that we need to sort out | 14:47 |
fungi | also we still have inmotion-iad3 offline until we can track down what's been causing nova to turn off the mirror server in there | 14:48 |
fungi | so if it was contributing to timeouts previously, we won't have numbers to say until it's returned to the pool | 14:49 |
dpawlik | dansmith you're welcome | 15:09 |
ganso | hi! I'm not sure this is the place to ask this, but is there anything blocking this patch? https://review.opendev.org/c/openstack/grenade/+/871946 | 16:09 |
dansmith | gmann: kopecmartin I can +W this ^ but since it's been sitting a while I just want to make sure you weren't waiting for the release or something | 16:12 |
ganso | I've seen this tempest issue blocking the CI on heat stable/yoga branch. Has anyone seen the fix for it that could be backported ? "AttributeError: type object 'Draft4Validator' has no attribute 'FORMAT_CHECKER'" | 16:15 |
kopecmartin | dansmith: feel free to +W, not waiting for anything | 16:16 |
dansmith | done | 16:16 |
kopecmartin | ganso: we have reverted patches which caused that, all should be good now | 16:16 |
kopecmartin | dansmith: thanks | 16:16 |
kopecmartin | ganso: oh , almost merged the revert :/ | 16:17 |
kopecmartin | https://review.opendev.org/c/openstack/tempest/+/876218 | 16:17 |
kopecmartin | rechecking | 16:17 |
kopecmartin | it's been rechecked, so we're waiting now | 16:17 |
ganso | kopecmartin: thank you so much! =) | 16:17 |
fungi | ganso: as far as where to ask such questions in the future, grenade is a deliverable of the qa team, so #openstack-qa is your best bet (but still reasonably on topic for here as well) | 16:41 |
ganso | fungi: thanks! | 16:41 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Increase boot-timeout for rax-ord https://review.opendev.org/c/openstack/project-config/+/876592 | 16:58 |
*** jpena is now known as jpena|off | 17:13 | |
opendevreview | Merged openstack/project-config master: Increase boot-timeout for rax-ord https://review.opendev.org/c/openstack/project-config/+/876592 | 17:38 |
gmann | ganso: yes, we reverted those changes. let me check if they are merged | 18:31 |
gmann | ganso: this one, it is in gate, once this merge feel free to recheck https://review.opendev.org/c/openstack/tempest/+/876218 | 18:32 |
gmann | just saw kopecmartin already mentioned these | 18:33 |
gmann | kopecmartin: ganso sent email on openstack-discuss also. I thought revert is merged | 18:37 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!