tristanC | matburt: looking now | 00:30 |
---|---|---|
tristanC | for some reason, nodepool unregistered the static nodes | 00:32 |
tristanC | ok, commenting the nodepool provider section and uncommenting triggered the re-registering, jobs are scheduled now | 00:38 |
pabelanger | tristanC: is it possible the provider died in nodepool-launcher during rdocloud outage? | 00:40 |
matburt | we did notice what looked like an rdocloud outage | 00:46 |
matburt | it lasted, maybe 10 or 15m? | 00:47 |
pabelanger | or maybe OOM? I've seen a few times nodepool provider die but not restart. I cannot remember the last one | 00:47 |
pabelanger | haven't really used the static node driver, maybe some edge case where it stops working | 00:48 |
tristanC | matburt: certainely related to the rdocloud outage, we usually needs to clean services state after | 00:49 |
tristanC | pabelanger: in this case the provider didn't died, but the static node was no longer registered and the provider keep the request in queue because of missing quota | 00:49 |
pabelanger | tristanC: any thoughts how we can have nodepool re-register the provider? | 00:50 |
tristanC | pabelanger: we should discuss having a "check" periodic method in nodepool, so that driver can make sure zookeeper state match configuration and reality | 00:52 |
tristanC | in nodepool driver api* | 00:52 |
pabelanger | tristanC: with a static node, does min-ready do anything? | 00:52 |
tristanC | pabelanger: i don't think, the driver works differently, the node are created by the provider | 00:53 |
tristanC | and they are re-used | 00:53 |
pabelanger | yah, not really sure how it works right now | 00:54 |
pabelanger | I think tobiash is using it at bmw, but not sure who else | 00:54 |
tristanC | pabelanger: iiuc this bit: https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/static/provider.py#n96 | 00:56 |
tristanC | pabelanger: then the driver simply deleted the node during the outage because they were unreachable | 00:56 |
tristanC | pabelanger: it seems like the driver is missing code to re-add them automatically | 00:56 |
pabelanger | tristanC: yes, that is not surprising to me TBH. | 01:01 |
tristanC | hum, and ansible jobs are not starting, here is the tb: https://ansible.softwarefactory-project.io/paste/show/21/ | 01:09 |
matburt | that's strange... it's mentioning a tag that we removed recently | 01:11 |
pabelanger | oh, if you delete the tag, that might make zuul unhappy | 01:12 |
matburt | we weren't happy to have to do it... it leaked over from our private tower repo | 01:12 |
matburt | what sort of resolution do we have for this? | 01:12 |
pabelanger | I would guess reclone awx repos in zuul | 01:12 |
pabelanger | but only way I know that today, is to rm -rf the repo | 01:13 |
tristanC | matburt: pabelanger: yep, i can state: absent /var/lib/zuul/executor/github.com/ansible/ | 01:13 |
tristanC | pabelanger: rm, gross :-) | 01:13 |
tristanC | matburt: could it be possible the tag was removed when zuul was d/c from github? | 01:13 |
tristanC | e.g. during the outage yesterday? | 01:13 |
pabelanger | I'm not acutally sure what zuul will do if you delete a tag it has | 01:15 |
tristanC | pabelanger: removing the executor cache doesn't seems to be enough... | 01:18 |
pabelanger | I think you would also need to do merger | 01:19 |
pabelanger | or, if you dequeue / enqueue job, does that cause zuul to refer git refs | 01:20 |
tristanC | pabelanger: if only we could dequeue/enqueue github change... | 01:20 |
pabelanger | refresh* | 01:20 |
pabelanger | tristanC: close / reopen the PR | 01:21 |
pabelanger | should do it | 01:21 |
matburt | hmm | 01:24 |
matburt | so these 4 PRs need to be closed and then opened again? | 01:24 |
tristanC | pabelanger: it seems like executor and merger needs to be restarted, they are holding an internal cache of ref according to https://git.zuul-ci.org/cgit/zuul/tree/zuul/merger/merger.py#n548 | 01:25 |
pabelanger | yah, deleting tags is bad :) | 01:26 |
pabelanger | I say, try to cycle the PRs, see of that helps | 01:26 |
pabelanger | otherwise, we maybe confirm in #zuul | 01:26 |
pabelanger | and restart zuul-mergers and zuul-executors | 01:26 |
pabelanger | jeblair likely knows the right play here | 01:26 |
matburt | okay just did it on 2391... lets see what happens | 01:27 |
matburt | okay done on all of them | 01:28 |
matburt | looks like things are getting happy again? Is this something we should report to the zuul devs or is it known? | 01:29 |
tristanC | matburt: oh indeed, it fixed the issue. thanks! | 01:30 |
tristanC | matburt: i left a message on #zuul, surely something to be fixed in merger logic | 01:30 |
pabelanger | I would have expected the jobs with failing tag to fail | 01:31 |
pabelanger | then report back to PR | 01:31 |
pabelanger | but looks like that didn't happen? | 01:31 |
tristanC | "deleting tag when jobs are queued but not running" is quite a edge case :) | 01:31 |
tristanC | pabelanger: no, i think it failed too early to be consider a job failure | 01:31 |
pabelanger | yah, we should be able to write a unit test for it | 01:31 |
matburt | I've spun up a few more static nodes that I'd love to add to the nodepool… I can't remember if I set up that PR before or if you did tristanC? | 01:32 |
matburt | Another thing that I wanted to talk to yall about | 01:32 |
pabelanger | matburt: because I don't know, where are the static nodes now? | 01:32 |
tristanC | matburt: i did, the change was: https://softwarefactory-project.io/r/#/c/13747/ | 01:33 |
matburt | they are static nodes in GCE also... identical to the ones we have now | 01:33 |
pabelanger | matburt: is there something specific to GCE here or just that you have capacity there right now | 01:33 |
matburt | We like dealing with GCE, it's also where we host our k8s clusters and registries | 01:34 |
matburt | So from zuul's perspective… they are just other static nodes, the fact that they are in GCE isn't super relevant from a technical perspective | 01:34 |
*** tristanC has quit IRC | 01:37 | |
*** logan- has quit IRC | 01:37 | |
*** sbadia has quit IRC | 01:37 | |
*** gundalow has quit IRC | 01:37 | |
*** sshnaidm|afk has quit IRC | 01:37 | |
*** jpena|off has quit IRC | 01:37 | |
*** trishnag has quit IRC | 01:37 | |
*** mattclay has quit IRC | 01:37 | |
*** pabelanger has quit IRC | 01:37 | |
*** ganeshrn has quit IRC | 01:37 | |
*** matburt has quit IRC | 01:37 | |
*** zoli has quit IRC | 01:37 | |
*** jruzicka has quit IRC | 01:37 | |
*** mnaser has quit IRC | 01:37 | |
*** mordred has quit IRC | 01:37 | |
*** fbo has quit IRC | 01:37 | |
*** rcarrillocruz has quit IRC | 01:37 | |
*** nhicher has quit IRC | 01:37 | |
*** jangutter has quit IRC | 01:37 | |
*** shanemcd has quit IRC | 01:37 | |
*** dmsimard has quit IRC | 01:37 | |
*** mhu has quit IRC | 01:37 | |
*** chandankumar has quit IRC | 01:37 | |
*** fc__ has quit IRC | 01:37 | |
*** spredzy has quit IRC | 01:37 | |
*** ChanServ has quit IRC | 01:37 | |
*** chandankumar has joined #softwarefactory | 01:43 | |
*** logan- has joined #softwarefactory | 01:43 | |
*** mhu has joined #softwarefactory | 01:43 | |
*** dmsimard has joined #softwarefactory | 01:43 | |
*** shanemcd has joined #softwarefactory | 01:43 | |
*** jangutter has joined #softwarefactory | 01:43 | |
*** nhicher has joined #softwarefactory | 01:43 | |
*** rcarrillocruz has joined #softwarefactory | 01:43 | |
*** ganeshrn has joined #softwarefactory | 01:44 | |
*** matburt has joined #softwarefactory | 01:44 | |
*** zoli has joined #softwarefactory | 01:44 | |
*** fc__ has joined #softwarefactory | 01:45 | |
*** spredzy has joined #softwarefactory | 01:45 | |
*** sshnaidm|afk has joined #softwarefactory | 01:45 | |
*** jpena|off has joined #softwarefactory | 01:45 | |
*** trishnag has joined #softwarefactory | 01:45 | |
*** mattclay has joined #softwarefactory | 01:45 | |
*** sbadia has joined #softwarefactory | 01:45 | |
*** card.freenode.net sets mode: +o sbadia | 01:45 | |
*** gundalow has joined #softwarefactory | 01:45 | |
*** jruzicka has joined #softwarefactory | 01:46 | |
*** mnaser has joined #softwarefactory | 01:46 | |
*** mordred has joined #softwarefactory | 01:46 | |
*** pabelanger has joined #softwarefactory | 01:46 | |
*** fbo has joined #softwarefactory | 01:46 | |
dmsimard | it's just a server to ssh into :p | 01:46 |
matburt | that was a super unfortunate netsplit… I'm not sure what the last message yall got was | 01:47 |
pabelanger | 01:34:45 matburt | So from zuul's | 01:47 |
pabelanger | was my last message | 01:47 |
pabelanger | then I replied | 01:47 |
pabelanger | okay, cool. would be great to show awx jobs running in other places then GCE, in case there is issues with that provider for some reason. | 01:47 |
matburt | obviously... once shanemcd has some time we'd like to start ramping up to test out the kubernetes integration | 01:47 |
*** ChanServ has joined #softwarefactory | 01:47 | |
*** card.freenode.net sets mode: +o ChanServ | 01:47 | |
*** tristanC has joined #softwarefactory | 01:48 | |
tristanC | matburt: that's correct | 01:49 |
matburt | there's another thing I'd like to talk to yall about but this might not be the best place to do it | 01:49 |
tristanC | matburt: i can do bluejean now if you prefer | 01:50 |
dmsimard | oh I can do bluejeans too | 01:51 |
matburt | yep lets do that | 01:52 |
matburt | https://bluejeans.com/5442700661 | 01:53 |
dmsimard | matburt: https://pagure.io/standard-test-roles/ is the fedora thing that spawns a vm to run tests on based on ansible | 02:10 |
matburt | Iiinteresting | 02:16 |
*** nilashishc has joined #softwarefactory | 03:59 | |
*** logan- has quit IRC | 07:09 | |
*** logan- has joined #softwarefactory | 07:11 | |
spredzy | tristanC: yo | 07:55 |
spredzy | Could I ask for your approval on https://github.com/ansible/zuul-jobs/pull/23 | 07:56 |
spredzy | I'll remove the become: yes from ansile/zuul-config | 07:59 |
tristanC | spredzy: done | 08:02 |
spredzy | thanks for the comment I missed that | 08:03 |
* spredzy blesses reviews | 08:03 | |
*** sshnaidm|afk is now known as sshnaidm | 08:22 | |
*** logan- has quit IRC | 08:23 | |
*** logan- has joined #softwarefactory | 08:27 | |
*** rcarrillocruz has quit IRC | 08:30 | |
*** jangutter has quit IRC | 08:32 | |
*** jangutter has joined #softwarefactory | 08:33 | |
*** jangutter has quit IRC | 08:37 | |
*** jangutter has joined #softwarefactory | 08:37 | |
*** nilashishc has quit IRC | 08:45 | |
*** zoli is now known as zoli|lunch | 09:31 | |
*** zoli|lunch is now known as zoli | 09:31 | |
*** jpena|off has quit IRC | 10:58 | |
*** sshnaidm is now known as sshnaidm|afk | 11:04 | |
*** sshnaidm|afk is now known as sshnaidm | 11:24 | |
*** zoli is now known as zoli|wfh | 12:01 | |
*** zoli|wfh is now known as zoli|afk | 12:01 | |
*** zoli|afk is now known as zoli | 12:01 | |
*** zoli is now known as zoli|brb | 14:15 | |
*** sfbender has joined #softwarefactory | 14:19 | |
sfbender | Fabien Boucher created software-factory/sf-config master: Make the welcome page resources connections aware https://softwarefactory-project.io/r/13892 | 14:19 |
*** zoli|brb is now known as zoli | 14:25 | |
*** zoli is now known as zoli|wfh | 14:25 | |
*** zoli|wfh is now known as zoli | 14:25 | |
*** chandankumar is now known as chkumar|off | 14:55 | |
*** sshnaidm has quit IRC | 15:58 | |
*** sshnaidm has joined #softwarefactory | 15:59 | |
*** zoli is now known as zoli|gone | 17:25 | |
*** zoli|gone is now known as zoli | 17:25 | |
*** sshnaidm is now known as sshnaidm|afk | 22:33 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!