*** mazzy5098812929580859 is now known as mazzy509881292958085 | 01:35 | |
mnaser | https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/833476/15/.zuul.yaml | 01:53 |
---|---|---|
mnaser | somehow.. zuul isn't responding to this change at all | 01:53 |
mnaser | not an error, but also not a job running | 01:53 |
mnaser | hmm, seeing buildset sitting in queue for >10 minutes with no jobs appearing | 02:18 |
fungi | mnaser: so it was working as of patchset 14 but not from 15 onwards? | 02:28 |
mnaser | fungi: yeah.. and now its just not responding, even if the dif from patch 13 to the current should work (two extra files changed) | 02:30 |
fungi | and gerrit says nothing changed in .zuul.yaml between those | 02:30 |
mnaser | https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/833476/13..23 | 02:30 |
mnaser | my thought process said "let me go back to what i know works" | 02:30 |
fungi | we had an issue where one of the schedulers lost contact with gerrit earlier, i wonder if it's continuing to occur | 02:31 |
mnaser | on recheck it shows up on zuul status and then disappears with no report | 02:32 |
fungi | looking in the debug log on the scheduler, a recheck of that change has been waiting for a merger to return the layout for some time | 02:40 |
fungi | there was a large spike in the merger queue at 02:00 when openstack's periodic jobs started | 02:41 |
fungi | executors are all chewing through those right now | 02:41 |
fungi | the executor queue is ~600 | 02:42 |
fungi | and there's a node request backlog of around 500 at the moment | 02:42 |
fungi | 2022-03-12 02:31:27,374 DEBUG zuul.Pipeline.vexxhost.check: [e: 5ee2a58c5c16417ebbf6ebefc2836469] Scheduling merge for item <QueueItem da28318bde374e6b907d3b0881809f46 for <Change 0x7f93f688fb80 vexxhost/ansible-collection-atmosphere 833476,23> in check> (files: ['zuul.yaml', '.zuul.yaml'], dirs: ['zuul.d', '.zuul.d']) | 02:47 |
fungi | that was 18 minutes ago | 02:47 |
fungi | hasn't returned yet | 02:47 |
fungi | i wonder if we're running lots more periodic jobs because of all the stable/yoga branches getting added | 02:54 |
fungi | the executor impact seems much more prolonged than in prior windows | 02:55 |
fungi | i misread the executor queue graph, we're nearly caught up on the backlog there finally | 02:57 |
mnaser | fungi: i wondered if like maybe there ewas a timeout where a job didnt start fast enough | 03:04 |
fungi | i don't think so, it's still waiting on the config merge results | 03:10 |
mnaser | fungi: so.. everything operating as expected, just gotta be more patient? | 03:11 |
fungi | well, i don't see signs that it's not simply overloaded still from the 02:00 openstack periodic job burst | 03:12 |
fungi | though the executors have finally caught back up | 03:13 |
mnaser | so should i kick off a recheck again? | 03:13 |
mnaser | (i also don't wanna make the problem worse by rechecking) | 03:13 |
fungi | so it's not clear to me why the merge request hasn't been handled yet | 03:14 |
fungi | it's been waiting for about 45 minutes already | 03:15 |
mnaser | i mean i saw it show up in the queue and then disappear afte ra while (no jobs were spawned) | 03:17 |
fungi | zm08 claims to have handled it at 02:31:30 | 03:18 |
fungi | 2022-03-12 02:31:30,965 DEBUG zuul.MergerApi: [e: 5ee2a58c5c16417ebbf6ebefc2836469] Removing request <MergeRequest b8b108cd446c452fa30402ceb9dbf4b6, job_type=merge, state=completed, path=/zuul/merger/requests/b8b108cd446c452fa30402ceb9dbf4b6> | 03:19 |
mnaser | hmm | 03:21 |
mnaser | recheck'd, lets see | 03:21 |
mnaser | welp | 03:21 |
mnaser | showed up in status, disappeared, and no response from zuul | 03:21 |
fungi | oh! i found it i think | 03:22 |
fungi | i assumed it was zuul01 because that's where the merge request originated, but zuul02 is what handled the enqueuing | 03:22 |
fungi | 2022-03-12 02:31:48,712 DEBUG zuul.Pipeline.vexxhost.check: [e: 5ee2a58c5c16417ebbf6ebefc2836469] <QueueItem da28318bde374e6b907d3b0881809f46 for <Change 0x7ff71a575760 vexxhost/ansible-collection-atmosphere 833476,23> in check> is a failing item because ['it has an invalid configuration'] | 03:22 |
mnaser | welp, would be nice to actually get that notification :p | 03:24 |
mnaser | now why is it an invalid configuration... | 03:25 |
mnaser | and why isnt zuul telling me that =( | 03:25 |
mnaser | and the same config worked .. earlier? | 03:25 |
mnaser | i wonder if this is a weird cache busting thing, i will try to rename the job and try again? | 03:25 |
fungi | right, i'm sure what's going on there, the debug log on the scheduler doesn't actually seem to mention what's invalid about the config either | 03:26 |
mnaser | nope.. nothing | 03:26 |
fungi | er, i meant i'm NOT sure what's going on there | 03:27 |
mnaser | https://opendev.org/zuul/zuul/src/branch/master/zuul/manager/__init__.py#L1456-L1457 | 03:27 |
mnaser | https://opendev.org/zuul/zuul/src/branch/master/zuul/manager/__init__.py#L1519-L1537 i guess thats how it ends up here? | 03:29 |
fungi | my clone of your change also wasn't tested for the same reason: | 03:32 |
fungi | 2022-03-12 03:29:26,335 DEBUG zuul.Pipeline.vexxhost.check: [e: 2898824303784a5d92e3b4369f1ce586] <QueueItem f7233b166890434bbb6acb2c88715fb9 for <Change 0x7f93d47f7c40 vexxhost/ansible-collection-atmosphere 833492,1> in check> is a failing item because ['it has an invalid configuration'] | 03:32 |
mnaser | im so confused | 03:35 |
mnaser | but whats invalid about it? | 03:36 |
mnaser | unless that change was like | 03:36 |
mnaser | like the report was not for change 13 when it passed | 03:36 |
mnaser | like the comment @ change 13 was maybe for change 12 or 11? | 03:37 |
mnaser | let me look at zuul builds | 03:37 |
mnaser | nope, that was 833476,13 .. wth | 03:38 |
fungi | mnaser: you have a .zuul.yaml and a zuul.d in the tree | 03:39 |
fungi | is that intentional? | 03:40 |
mnaser | oh man | 03:40 |
mnaser | its not but i thought they would all get parsed | 03:40 |
mnaser | that must be whyyyyy | 03:40 |
fungi | i think zuul expects the yaml files inside zuul.d/playbooks to be zuul configs not ansible | 03:41 |
mnaser | but | 03:41 |
fungi | so it tries to parse them and then fails | 03:41 |
mnaser | i saw openstack-ansible repo does the same | 03:41 |
mnaser | https://opendev.org/openstack/openstack-ansible/src/branch/master/zuul.d/playbooks | 03:43 |
mnaser | but i think you must either have zuul.d or yaml file | 03:43 |
mnaser | not both | 03:43 |
fungi | mnaser: ahh, yeah they just have a zuul.d/ | 03:43 |
fungi | no .zuul.yaml | 03:43 |
mnaser | that might jsut be it.. | 03:44 |
fungi | so maybe git mv .zuul.yaml zuul.d/project.yaml | 03:44 |
fungi | or something like that | 03:44 |
fungi | https://zuul-ci.org/docs/zuul/latest/project-config.html#configuration-loading | 03:45 |
fungi | "Zuul looks first for a file named zuul.yaml or a directory named zuul.d, and if they are not found, .zuul.yaml or .zuul.d (with a leading dot)." | 03:45 |
fungi | so it found your zuul.d and expected it to contain zuul configuration. it did not even bother to read your .zuul.yaml | 03:46 |
mnaser | yeah i think that's what probably happened | 03:47 |
fungi | it's clearly too late at night for me to adequately troubleshoot such things, i should have looked at your entire change and spotted that straight away, sorry! | 03:48 |
mnaser | ahhhhh | 03:48 |
mnaser | also i think you're righta bout the files thing | 03:48 |
mnaser | in SOA they use .yml for playbooks | 03:48 |
mnaser | OSA* | 03:48 |
mnaser | but .yaml for zuul files | 03:48 |
fungi | aha | 03:49 |
fungi | that's sneaky | 03:49 |
fungi | well, anyway, sounds like you're probably on the right track now. i think i'm going to put down the computer and watch some cartoons before i pass out. have a great night/weekend! | 03:50 |
mnaser | no worries, take care | 03:51 |
frickler | o.k., since I haven't seen any further feedback on the neutron yoga issue, and the recheck still didn't start any jobs, I will now run another set of full-reconfigure, first on zuul02, then on 01 | 08:44 |
frickler | hmm, in the list of Cat jobs that the executor is submitting, neutron stable/yoga is also missing, which doesn't make this very promising to me | 08:52 |
*** mgoddard- is now known as mgoddard | 08:53 | |
frickler | so I'm going to delete the branch in gerrit now and recreate it (sha is 452a3093f62b314d0508bc92eee3e7912f12ecf1 for reference) | 09:04 |
frickler | this is looking better: 2022-03-12 09:11:33,251 INFO zuul.Scheduler: Tenant reconfiguration beginning for openstack due to projects {('opendev.org/openstack/neutron', 'stable/yoga')} | 09:13 |
*** odyssey4me|away is now known as odyssey4me | 09:48 | |
*** odyssey4me is now known as odyssey4me|away | 09:49 | |
frickler | #status log recreated neutron:stable/yoga branch @452a3093f62b314d0508bc92eee3e7912f12ecf1 in order to have zuul learn about this branch | 09:56 |
opendevstatus | frickler: finished logging | 09:56 |
*** odyssey4me|away is now known as odyssey4me | 10:25 | |
*** odyssey4me is now known as odyssey4me|away | 11:00 | |
*** odyssey4me|away is now known as odyssey4me | 11:00 | |
*** odyssey4me is now known as odyssey4me|away | 11:15 | |
*** dviroel_ is now known as dviroel|out | 11:55 | |
fungi | frickler: thanks! and good idea. did you have to abandon and restore the open changes on it? | 12:25 |
fungi | never mind, i see that you did | 12:27 |
*** odyssey4me|away is now known as odyssey4me | 14:35 | |
*** odyssey4me is now known as odyssey4me|away | 14:35 | |
*** odyssey4me|away is now known as odyssey4me | 15:03 | |
*** odyssey4me is now known as odyssey4me|away | 15:04 | |
*** odyssey4me|away is now known as odyssey4me | 15:07 | |
*** odyssey4me is now known as odyssey4me|away | 15:23 | |
*** odyssey4me|away is now known as odyssey4me | 16:30 | |
*** odyssey4me is now known as odyssey4me|away | 16:30 | |
*** odyssey4me|away is now known as odyssey4me | 16:45 | |
*** odyssey4me is now known as odyssey4me|away | 16:45 | |
*** odyssey4me|away is now known as odyssey4me | 18:18 | |
*** odyssey4me is now known as odyssey4me|away | 18:19 | |
mnaser | infra-root: i think nodepool might be borked. no available nodes, a bunch of deleting ones, only 11 changes in queue, change of mine queued for 24 minutes now | 21:54 |
fungi | looking | 21:55 |
*** odyssey4me|away is now known as odyssey4me | 22:01 | |
fungi | 2022-03-12 22:01:00,368 INFO nodepool.NodeLauncher: [e: 88a1bf654b794b97bbdc4820b1c00d8f] [node_request: 300-0017541887] [node: 0028811545] Creating server with hostname ubuntu-focal-rax-dfw-0028811545 in rax-dfw from image ubuntu-focal | 22:01 |
fungi | 2022-03-12 22:01:00,997 DEBUG nodepool.NodeLauncher: [e: 88a1bf654b794b97bbdc4820b1c00d8f] [node_request: 300-0017541887] [node: 0028811545] Waiting for server c10d466c-9a54-492b-b6c6-13734f2edbf3 | 22:01 |
*** odyssey4me is now known as odyssey4me|away | 22:02 | |
fungi | that was on nl01, maybe another one timed out, looking harder | 22:02 |
fungi | no, not an earlier one unless the node request id changed | 22:04 |
fungi | mnaser: looks like there was a delay in handling that node request, but it did end up getting one | 22:10 |
mnaser | fungi: yup indeed.. strange | 22:12 |
fungi | oh! i forgot we have an nl04 ;) | 22:15 |
fungi | it first locked the request at 21:30:39 | 22:15 |
fungi | 2022-03-12 21:30:42,358 DEBUG nodepool.NodeLauncher: [e: 88a1bf654b794b97bbdc4820b1c00d8f] [node_request: 300-0017541887] [node: 0028811529] Waiting for server 59a78014-cbe8-4dbc-bf5b-da23e8c45f61 | 22:16 |
fungi | that was in ovh-bhs1 | 22:16 |
fungi | 21:40:42 Launch attempt 1/3 failed | 22:16 |
fungi | 21:50:44 Launch attempt 2/3 failed | 22:16 |
fungi | 22:00:47 Launch attempt 3/3 failed | 22:17 |
fungi | 2022-03-12 22:00:57,214 DEBUG nodepool.driver.NodeRequestHandler[nl04.opendev.org-PoolWorker.ovh-bhs1-main-72cd222540804b9ea0137d5a7d41ec59]: [e: 88a1bf654b794b97bbdc4820b1c00d8f] [node_request: 300-0017541887] Declining node request because nodes failed | 22:17 |
fungi | so that explains it | 22:17 |
fungi | nl04 had three failed attempts in ovh-bhs1 waiting 10 minutes for each, then after that wasted half hour nl01 took the request and booted one successfully in rax-dfw | 22:18 |
fungi | openstack.exceptions.ResourceTimeout: Timeout waiting for the server to come up. | 22:19 |
fungi | is all nodepool got back from te api... not all that helpful | 22:19 |
fungi | but yeah, looks like ovh-bhs1 may be having a bad day | 22:20 |
fungi | we got some maintenance notices in french the other day, maybe said they were doing something this weekend. none of our sysadmins is fluent but apparently when someone in fr headquarters approves the account it's locked to receiving all communications in français | 22:21 |
fungi | i'll see if i get some time to paste them into google translate | 22:21 |
mnaser | fungi: feel free to paste it here | 22:31 |
mnaser | a lot of people forget i speak french fluently =P | 22:32 |
fungi | touché ;) | 23:55 |
fungi | eh, i can work this one out without the translator, i think. api impacting maintenance tuesday 19:00-01:00 utc for gra11 (gra1 to us probably?) | 23:58 |
fungi | Une opération de maintenance est programmée pour la région GRA11 le 15 mars 2022, entre 19h00 et 01h00 (UTC). Aucune perturbation n'est à prévoir sur vos instances. Certaines API resteront injoignables pendant la durée de l'opération. | 23:59 |
fungi | the other two announcements were similar. they're all for lateish tuesday for the same region, looks like | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!