SpamapS | Well | 00:20 |
---|---|---|
SpamapS | you could also have the app for the app things, and then users for permission things. | 00:20 |
SpamapS | One reason that sucks is if you're using GHE or private repos, you pay per seat. | 00:20 |
SpamapS | But it's just one user. ;) | 00:20 |
goern | ja, thats what we do.... the user is shared by zuul and other bots | 00:21 |
goern | and the app is zuul's baby | 00:21 |
SpamapS | The whole point of apps though, is so that you get all of that in one package. | 00:23 |
SpamapS | So it's pretty frustrating that they can't just be listed in branch protections. | 00:23 |
SpamapS | Yeah I think given GitHub's limitations, the thing to do is make it so GitHub can "click merge" with a real user and not the app user. | 00:28 |
SpamapS | err | 00:28 |
SpamapS | Yeah I think given GitHub's limitations, the thing to do is make it so *Zuul* can "click merge" with a real user and not the app user. | 00:28 |
corvus | tristanC: 00:31 < openstackgerrit> James E. Blair proposed openstack-infra/project-config master: Nodepool: add kubernetes provider https://review.openstack.org/620756 | 00:31 |
tristanC | corvus: excellent! | 00:34 |
corvus | tristanC: i'd love it if you could look over that one and the depends-on and see if i did it right -- you should be able to see everything about the config except the key. | 00:35 |
SpamapS | yuuummy | 00:35 |
tristanC | corvus: what about the roadmap, without https://review.openstack.org/570668 it won't be usable... | 00:35 |
corvus | at least, i think the thing i removed is the key :) | 00:35 |
clarkb | does the k8s driver not dynamically create namespaces and pods like nodepool does openstack instances? | 00:36 |
corvus | tristanC: can you drop the depends-on to the openshift change? | 00:36 |
clarkb | or is that specificity more label attribute and it will create many of those things | 00:36 |
SpamapS | While we're looking at providers... https://review.openstack.org/535558 is all green and wants some feedback (and has been running in production at Good Money for about 3 months now) | 00:36 |
corvus | clarkb: it does both of those things | 00:36 |
SpamapS | It does have a bug that causes NODE_FAILURE that I haven't had time to look at, but I think it has something to do with quota management. | 00:37 |
corvus | SpamapS: can we find some way to test it? it doesn't have to be perfect, but some kind of fake or mock or something that at least lets us have python evaluate some of the code (so that we can refactor, etc) would be great. | 00:38 |
clarkb | ya ok its label attributes then it creates many of those things | 00:38 |
tristanC | corvus: about https://review.openstack.org/#/c/620755/1/playbooks/templates/clouds/nodepool_kube_config.yaml.j2 , best is to try using it before, e.g "kubectl get pods" shouldn't return error | 00:38 |
corvus | tristanC: yep, i ran that locally | 00:39 |
tristanC | then 620756 should be good to go | 00:39 |
tristanC | corvus: yes we can remove the depends-on openshift change, but can it be reviewed too? | 00:41 |
SpamapS | corvus: definitely.. betamax should be possible. | 00:41 |
SpamapS | Or maybe this https://github.com/spulec/moto | 00:42 |
tristanC | clarkb: the k8s driver creates a new namespace for each node request, e.g. https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/kubernetes/provider.py#n255 | 00:42 |
corvus | tristanC: yes, though i'd like to complete the k8s work before landing the openshift work | 00:42 |
tristanC | corvus: can you define complete? :) | 00:43 |
corvus | tristanC: all the required parts of the spec landed? | 00:44 |
corvus | tristanC: probably just that zuul change and any bugfixes | 00:44 |
clarkb | corvus: I'm reading the kube config file and while I have no reason to believe it is wrong I also don't understand how the admin user at hte bottom of the file is associated with the vexxhost cluster. If we added a second cluster would that cluster have to have a different username than admin because the existing cluster already uses the username admin? if so I wonder if k8s would consider that a bug | 00:45 |
SpamapS | (funny enough, I may not even want the ec2 driver anymore, I might just kubernetes it up... we'll see) | 00:45 |
clarkb | seems like you should be able to use the same username in any cluster | 00:45 |
* SpamapS pokes at moto'ing | 00:45 | |
clarkb | SpamapS: tristanC ^ you may know the answer to that question | 00:45 |
corvus | clarkb: i do not know the answer; the config is mostly what 'openstack coe config' spat out | 00:46 |
tristanC | clarkb: you can use a single username in any cluster or any namespace yes | 00:47 |
clarkb | tristanC: how do you express that in the config file? | 00:48 |
clarkb | tristanC: as is the config file doesn't map the cluster to the user other than by name | 00:48 |
clarkb | tristanC: so it would have to be unique or use the same key and cert? | 00:48 |
clarkb | tristanC: https://review.openstack.org/#/c/620755/1/playbooks/templates/clouds/nodepool_kube_config.yaml.j2 for example | 00:48 |
SpamapS | clarkb: guessing contexts? | 00:49 |
tristanC | clarkb: oh right, the 'name' in kubectl are just key, not actual username | 00:49 |
tristanC | clarkb: the username is defined by the auth-data sent to the cluster | 00:49 |
clarkb | tristanC: gotcha so that is part of the key? | 00:49 |
clarkb | so we'd have a vexxhost-admin and just update those name keys but key-data stays the same? | 00:50 |
tristanC | clarkb: maybe? i don't know how k8s works under the hood, but in openshift you can login using just a token, no username required | 00:50 |
clarkb | we can make that update if/when there is more than one cluster | 00:50 |
SpamapS | There are various auth providers | 00:50 |
SpamapS | AWS has a token based one | 00:50 |
SpamapS | GKE has something similar | 00:50 |
clarkb | ah so this may even come down to implementation details. yay | 00:51 |
SpamapS | http://paste.openstack.org/show/736344/ | 00:51 |
* corvus eods | 00:51 | |
SpamapS | that's my user section for one of my k8s clusters | 00:51 |
SpamapS | the username is not actually there, because it's defined by my AWS credentials that aws-iam-authenticator uses. | 00:52 |
SpamapS | So it's just a name to refer to in context in the config file as "unique way to get access" | 00:52 |
clarkb | SpamapS: ya then each user: block would determine the actual details | 00:53 |
clarkb | makes sense. I msotly wanted to make sure we didn't need to manage more than one file and doesn't seem like we do. We'd have to update the existing file if we add another k8s but it should work fine (and we have to do that anyway) | 00:53 |
SpamapS | correct | 00:53 |
SpamapS | yeah most systems that I've used have this exec method that grabs credentials. | 00:54 |
SpamapS | that way we can all just have one big happy kube config | 00:54 |
SpamapS | similar to clouds.yaml with secure.yaml having the actual creds | 00:54 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: executor: add support for generic build resource https://review.openstack.org/570668 | 01:07 |
SpamapS | this moto thing is perfect actually | 01:19 |
*** bhavikdbavishi has joined #zuul | 02:42 | |
*** rlandy|bbl is now known as rlandy | 03:18 | |
openstackgerrit | Merged openstack-infra/zuul master: Remove uneeded if statement https://review.openstack.org/617984 | 04:02 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Amazon EC2 driver https://review.openstack.org/535558 | 04:29 |
SpamapS | corvus: ^ now with a unit test! | 04:29 |
*** bhavikdbavishi has quit IRC | 05:18 | |
*** bhavikdbavishi has joined #zuul | 05:18 | |
*** bhavikdbavishi has quit IRC | 05:30 | |
*** bhavikdbavishi has joined #zuul | 06:12 | |
*** bjackman has joined #zuul | 06:20 | |
SpamapS | hrm | 06:23 |
SpamapS | upgraded to 3.3.1 and I'm getting no consoles :( | 06:23 |
SpamapS | http://paste.openstack.org/show/736353/ | 06:23 |
SpamapS | gah | 06:28 |
SpamapS | n/m | 06:28 |
SpamapS | another backward incompatible change. :-p | 06:28 |
tristanC | SpamapS: "Header Upgrade is not defined"... what rewrite rules are you using? | 06:28 |
SpamapS | No the problem is that somewhere between 3.2.x and 3.3.0 (which is in the release notes) the requirements changed for static offload | 06:28 |
SpamapS | Honestly.. this is getting pretty frustrating. I don't feel like retooling my deployment every time I do a minor upgrade. :-/ | 06:29 |
SpamapS | I mean, I love the new UI | 06:29 |
SpamapS | so grateful for all the new stuff | 06:29 |
SpamapS | but.. can.. I just change my version pointer.. just once.. and not have to spend 2 hours mopping up? :-P | 06:30 |
* SpamapS will stop venting | 06:30 | |
tristanC | SpamapS: static offload had to be impacted to support the new html5 urls... and since it's outside zuul control, there isn't much we can do about it | 06:38 |
tristanC | perhaps don't offload, that works out of the box | 06:39 |
tobiash | SpamapS: you do static offload? | 06:39 |
*** bhavikdbavishi has quit IRC | 07:01 | |
*** quiquell|off is now known as quiquell | 07:07 | |
*** chkumar|away has quit IRC | 07:13 | |
*** chandan_kumar has joined #zuul | 07:15 | |
quiquell | Good morning | 07:27 |
quiquell | What is the cccp.yaml ? | 07:27 |
tobiash | quiquell: morning | 07:55 |
tobiash | ? | 07:55 |
*** gtema has joined #zuul | 08:01 | |
quiquell | tobiash: Nah is not here https://github.com/openstack-infra/zuul/tree/master/doc/source/admin/examples | 08:02 |
quiquell | tobiash: I think it get that generated after build images | 08:02 |
tobiash | quiquell: what file are you referring to? cccp.yaml doesn't tell anything to me ;) | 08:03 |
quiquell | tobiash: Me neither, will just remove it | 08:04 |
quiquell | tobiash: thanks anyways | 08:04 |
tobiash | k | 08:04 |
*** bjackman has quit IRC | 08:10 | |
*** quiquell is now known as quiquell|brb | 08:13 | |
*** bjackman has joined #zuul | 08:13 | |
*** bhavikdbavishi has joined #zuul | 08:18 | |
bjackman | Just noticed from reading the semaphore thread on zuul-discuss that acquisition/release is tied to job begin/start. In my ultimate use-case I'd like to be able to acquire/release them in the middle of jobs (specifically, I don't really want to lock test HW until I've made sure the code compiles) | 08:20 |
bjackman | Just a thought | 08:20 |
bjackman | (In this case test HW does not mean the node provided by nodepool, it's a separate resource) | 08:21 |
*** hashar has joined #zuul | 08:24 | |
*** jpena|off is now known as jpena | 08:31 | |
*** bjackman has quit IRC | 08:34 | |
*** themroc has joined #zuul | 08:36 | |
*** bjackman has joined #zuul | 08:37 | |
*** quiquell|brb is now known as quiquell | 08:37 | |
*** bhavikdbavishi1 has joined #zuul | 08:37 | |
*** bhavikdbavishi has quit IRC | 08:38 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 08:38 | |
bjackman | Anyway Jim Blair said in an email on zuul-discuss that there's been some discussion about what he calls the "bisecting pipeline manager", does anyone know where that discussion took place / if there's a record of it? | 08:39 |
bjackman | (Mail I'm referring to http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-November/000636.html) | 08:39 |
*** pcaruana has joined #zuul | 08:48 | |
openstackgerrit | Brendan proposed openstack-infra/zuul master: Fix "reverse" Depends-On detection with new Gerrit URL schema https://review.openstack.org/620838 | 09:25 |
*** ssbarnea has quit IRC | 09:45 | |
*** ssbarnea|rover has joined #zuul | 09:45 | |
openstackgerrit | Merged openstack-infra/nodepool master: Update node request during locking https://review.openstack.org/618807 | 09:50 |
*** bhavikdbavishi has quit IRC | 09:55 | |
openstackgerrit | Merged openstack-infra/nodepool master: Add second level cache of nodes https://review.openstack.org/619025 | 09:59 |
openstackgerrit | Merged openstack-infra/nodepool master: Add second level cache to node requests https://review.openstack.org/619069 | 09:59 |
openstackgerrit | Merged openstack-infra/nodepool master: Only setup zNode caches in launcher https://review.openstack.org/619440 | 09:59 |
openstackgerrit | BenoƮt Bayszczak proposed openstack-infra/zuul master: Disable Nodepool nodes lock for SKIPPED jobs https://review.openstack.org/613261 | 10:23 |
*** bjackman has quit IRC | 10:30 | |
*** AJaeger_ has quit IRC | 10:32 | |
*** electrofelix has joined #zuul | 11:04 | |
*** AJaeger_ has joined #zuul | 11:13 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Asynchronously update node statistics https://review.openstack.org/619589 | 11:57 |
*** jpena is now known as jpena|lunch | 11:59 | |
*** panda|pto is now known as panda | 12:41 | |
*** gtema has quit IRC | 12:45 | |
*** sshnaidm|afk is now known as sshnaidm | 12:51 | |
quiquell | tobiash: Do you an example of log config for scheduler ? I feel lazy today :-/ | 13:06 |
quiquell | tobiash: I mean log_config the yaml files with filters and all | 13:06 |
tobiash | quiquell: we're still using the ini style, but not at computer atm | 13:07 |
tobiash | quiquell: look in system-config repo, there should be a log config too | 13:08 |
quiquell | tobiash: system-config that was the name I was looking for thanks! | 13:08 |
*** gtema has joined #zuul | 13:10 | |
*** rlandy has joined #zuul | 13:11 | |
*** jpena|lunch is now known as jpena | 13:31 | |
*** j^2 has joined #zuul | 14:04 | |
*** themroc has quit IRC | 14:23 | |
*** themroc has joined #zuul | 14:25 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Add arbitrary node attributes config option https://review.openstack.org/620691 | 14:36 |
Shrews | mordred: s/metadata/attributes/ ^^^ | 14:37 |
mordred | Shrews: brilliant!@ | 14:37 |
*** bhavikdbavishi has joined #zuul | 14:56 | |
*** themroc has quit IRC | 15:16 | |
*** themroc has joined #zuul | 15:16 | |
quiquell | Is there any way to exclude especific jobs from configuration in the tenant yaml config ? | 15:23 |
tobiash | quiquell: only all jobs | 15:32 |
quiquell | tobiash: ack | 15:32 |
quiquell | tobiash: Maybe I can shadow the ones I don't want ? | 15:32 |
*** panda is now known as panda|pto | 15:34 | |
*** hashar has quit IRC | 15:37 | |
tobiash | quiquell: that is possible, shadowing allows you to override jobs | 15:38 |
quiquell | tobiash: ok, thanks | 15:40 |
corvus | quiquell: though you might consider moving the jobs you don't want into a new repo | 15:40 |
corvus | tobiash: why did you add the event.clear in 619589 | 15:41 |
quiquell | corvus: ack | 15:41 |
tobiash | corvus: because otherwise it updated just every second regardless of cache triggered changes | 15:41 |
tobiash | just noticed that today and then read about threading events. Without the clear any subsequent wait effectively doesn't wait at all because the event is still set | 15:43 |
corvus | tobiash: left a couple comments | 15:43 |
tobiash | corvus: good points | 15:45 |
tobiash | corvus: shall we signal on stop and still keep the timeout as a safety net? | 15:48 |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Asynchronously update node statistics https://review.openstack.org/619589 | 15:50 |
corvus | tobiash: i like dropping the timeout; less wasted work | 15:51 |
tobiash | Ok, timeout dropped | 15:54 |
*** quiquell is now known as quiquell|off | 15:55 | |
*** bjackman has joined #zuul | 15:58 | |
*** gtema has quit IRC | 16:23 | |
*** themroc has quit IRC | 16:23 | |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: Support relative priority of node requests https://review.openstack.org/620954 | 16:37 |
corvus | tobiash, Shrews: ^ that's the nodepool half of that. i'll finish up the zuul change now. | 16:38 |
tobiash | Cool, looking after dinner :) | 16:40 |
pabelanger | Oooh, I'm going to review that too! Totally interested in the code | 16:40 |
*** bjackman has quit IRC | 16:55 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Set relative priority of node requests https://review.openstack.org/615356 | 17:06 |
corvus | tobiash, Shrews, pabelanger: ^ that should be ready now too | 17:06 |
corvus | clarkb: ^ | 17:06 |
pabelanger | looks | 17:08 |
clarkb | corvus great. I'm listening to an openci meeting noe but will review after | 17:09 |
corvus | pabelanger: can you look at https://review.openstack.org/620691 and make sure it will work for zones? | 17:10 |
pabelanger | sure | 17:10 |
corvus | pabelanger: feel free to +W if it works for you | 17:12 |
*** dkehn_ has joined #zuul | 17:15 | |
*** bhavikdbavishi has quit IRC | 17:15 | |
*** dkehn has quit IRC | 17:15 | |
*** dkehn_ is now known as dkehn | 17:15 | |
*** bhavikdbavishi has joined #zuul | 17:16 | |
pabelanger | +A | 17:17 |
pabelanger | Thanks! | 17:17 |
pabelanger | I'll update zuul patch now | 17:17 |
tobiash | corvus: re nodepool relative prios, we have the case that we have many providers that decline because they don't support the label (static providers). In this case there will be much lock contention leading to skipped node requests in the loop. So I think we should make sure that the loop is frequently restarted from the beginning with a fresh sorted list. | 17:44 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: More strongly recommend the simple reverse proxy deployment https://review.openstack.org/620969 | 17:44 |
tobiash | corvus: but I think in combination with 610029 that might be good enough | 17:45 |
corvus | tobiash: does relative_priority alter that behavior? -- is that something we need to address in the relative_priority change, or can that be a followup? | 17:46 |
corvus | mordred: based on the feedback from SpamapS earlier, i worked on 620969, but i noticed that we suggest static offloading for white-labeling because white-label without static offload is said to be complex. you were talking about dropping static-offload for openstack -- were you thinking of doing that while retaining white-label? should we document that case? | 17:48 |
tobiash | corvus: no, I just wanted to mention this as our lock contention will circumvent the relative priority without 610029 | 17:48 |
tobiash | corvus: +2 from me | 17:50 |
corvus | tobiash: ah yes. i agree they will be better together and we should merge 029 shortly. can you explain what the 'yield' does there? | 17:50 |
tobiash | corvus: I did that based on Simon's first comment | 17:52 |
*** jpena is now known as jpena|off | 17:53 | |
corvus | tobiash: that makes sense now | 17:54 |
corvus | i'll leave the approval of https://review.openstack.org/610029 to Shrews | 17:55 |
tobiash | corvus: but I think the yield is not good anymore with the relative priority as it just continues the loop while we now actually want to restart the loop | 17:55 |
corvus | tobiash: agreed that would be better, but not a fatal flaw. do you want to do a followup after they both merge, or rebase 029 on relative_priority? | 17:57 |
tobiash | corvus: I think I'll rebase on relative priority | 17:57 |
corvus | tobiash: ok. better wip it then, it has a lot of +2s | 17:58 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul master: Add support for zones in executors https://review.openstack.org/549197 | 18:05 |
pabelanger | tobiash: corvus: clarkb: mordred: ^now updated for node-attributes logic | 18:06 |
*** bhavikdbavishi has quit IRC | 18:13 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently https://review.openstack.org/610029 | 18:13 |
*** bhavikdbavishi1 has joined #zuul | 18:13 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently https://review.openstack.org/610029 | 18:15 |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 18:15 | |
tobiash | corvus: that should do it ^ | 18:16 |
tobiash | pabelanger: while you're at it, do you want to address my doc comment of PS18? | 18:18 |
tobiash | pabelanger: it's not provider anymore ;) | 18:18 |
*** bhavikdbavishi has quit IRC | 18:19 | |
pabelanger | +1 | 18:22 |
sshnaidm | where can I see possible zuul statuses? like SUCCESS, FAILURE, TIMED_OUT...? | 18:29 |
corvus | sshnaidm: i don't think the whole list is enumerated anywhere | 18:32 |
pabelanger | tobiash: provider pool? is that better? | 18:38 |
tobiash | pabelanger: maybe just zone as this is what you actually configure? | 18:38 |
pabelanger | sure | 18:38 |
tobiash | pabelanger: '... that have nodes with the specified executor-zone attribute...' | 18:39 |
tobiash | pabelanger: how about this ^ ? | 18:39 |
pabelanger | sure | 18:40 |
tobiash | corvus: I've commented on 615356, what do you think? | 18:44 |
corvus | tobiash: yep | 18:45 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul master: Clarify executor zone documentation https://review.openstack.org/620989 | 18:51 |
pabelanger | tobiash: ^ | 18:51 |
tobiash | pabelanger: thanks | 18:52 |
tobiash | +2 on both | 18:52 |
pabelanger | thank you! | 18:56 |
Shrews | tobiash: i don't think i'm clear on the need for the kazoo election recipe (though docs are horrible). seems like only a single thread is involved in the election, so what purpose is that serving? | 18:57 |
tobiash | Shrews: there may be multiple launcher processes involved in the election | 18:57 |
Shrews | tobiash: oh right. external scaling. what happens if the leader is stopped? will another take it's place? | 18:58 |
Shrews | s/it's/its | 18:58 |
tobiash | Shrews: yes, that's how the leader election works | 18:58 |
tobiash | as soon as the leader finished its function or looses its session a new leader is elected | 18:59 |
Shrews | tobiash: ok, so election.run() basically blocks forever while it is not the leader | 19:00 |
tobiash | yes | 19:00 |
Shrews | tobiash: gotcha. thx | 19:00 |
tobiash | :) | 19:00 |
Shrews | it wasn't clear to me if it returned when it LOST the election, but that makes sense | 19:00 |
tobiash | yeah I like that concept | 19:01 |
*** hashar has joined #zuul | 19:06 | |
Shrews | tobiash: what if the leader gets an exception and the thread stops, but the zk session (which is shared among threads) remains active? | 19:08 |
clarkb | Shrews: if the thread stops the connection should break allowing another process to become leader? | 19:09 |
clarkb | though I guess if its a shared session maybe that doesn't happen | 19:09 |
tobiash | Shrews: if the leader returns from the function that is triggered by the election process a new leader is elected | 19:09 |
clarkb | might want to do connection/session specific to elections | 19:09 |
Shrews | how does the connection break? it's a shared connection | 19:09 |
clarkb | Shrews: well it could be some logical action over the top of the connection too | 19:10 |
clarkb | tcp might still be happy though | 19:10 |
tobiash | Shrews: so throwing an exception would lead to a new leader election as well | 19:10 |
Shrews | tobiash: *nod* | 19:11 |
*** electrofelix has quit IRC | 19:12 | |
tobiash | Shrews: that leader election is much simpler than I thought: https://kazoo.readthedocs.io/en/latest/_modules/kazoo/recipe/election.html#Election.run | 19:16 |
tobiash | It just gets a lock and runs the leader function while having the lock, as soon as this returns (no matter how) the lock is released and the next one in the election gets the lock | 19:17 |
openstackgerrit | Jeremy Stanley proposed openstack-infra/zuul-website master: Revert "Add a promotional message banner and events list" https://review.openstack.org/620995 | 19:38 |
*** sshnaidm is now known as sshnaidm|afk | 19:56 | |
SpamapS | tobiash: yeah I do static offload because when I set up my zuul cherrypy was totally screwed up and not serving static content. | 20:12 |
SpamapS | (I think actually the problem was the yarn hooks or something.. anyway.. it's fixed now) | 20:12 |
SpamapS | And also because of what corvus mentioned.. it was suggested as the simpler path (which it is). | 20:13 |
SpamapS | I am sure I'll find a multi-tenant case at some point, but thus far, I've always been single tenant. | 20:13 |
openstackgerrit | Merged openstack-infra/nodepool master: Add arbitrary node attributes config option https://review.openstack.org/620691 | 20:14 |
openstackgerrit | Merged openstack-infra/nodepool master: Asynchronously update node statistics https://review.openstack.org/619589 | 21:00 |
openstackgerrit | Merged openstack-infra/zuul-website master: Revert "Add a promotional message banner and events list" https://review.openstack.org/620995 | 21:00 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: Fix updateFromDict overrides https://review.openstack.org/621008 | 21:07 |
corvus | tobiash, Shrews: ^ just noticed that wee typo. | 21:07 |
Shrews | oops | 21:12 |
pabelanger | +3 | 21:13 |
pabelanger | corvus: tobiash: Shrews: I have a provider in nodepool, that seems to be having some quota issues. Right now my tenant, vexxhost-mtl1, is full of nodes in ERROR state. But it seems, this does seem to reflect in the openstack quota from the provider. I wonder if we could some how try to calculate the error instances in the unmanagedQuotaUsed() call: | 21:35 |
pabelanger | http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/driver/openstack/provider.py#n176 | 21:35 |
pabelanger | because right now, I have jobs that are wedged waiting for nodes to free up in the provider | 21:36 |
pabelanger | and nodepool won't try another provider it seems | 21:36 |
clarkb | pabelanger: the issue is those nodes actually count against your quota | 21:36 |
clarkb | and you will likely be charged for them so I don't think nodepool should ignore them | 21:36 |
pabelanger | but for some reason, openstack api says there is quota available | 21:37 |
pabelanger | but really, there isn't | 21:37 |
corvus | pabelanger: why won't nodepool try another provider? | 21:37 |
pabelanger | corvus: I don't know, still trying to figure that out | 21:37 |
clarkb | pabelanger: they may not count against cpu/memory/disk etc quota if they never started but they should count against instance quota. I have definitelyseen error instances count against instance quota on clouds | 21:38 |
*** rlandy is now known as rlandy|biab | 21:44 | |
dkehn | clarkb: FYI the lastest update did fix the .ssh/known_host issue | 21:52 |
clarkb | dkehn: great] | 21:53 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Set relative priority of node requests https://review.openstack.org/615356 | 21:57 |
pabelanger | corvus: clarkb: looking at the log, I can see new node requests being created each time vexxhost-mtl fails, then a few pools decline the new node request, because they don't have the right flavor. But of the 4 provider pools that do have the flavor, vexxhost-mtl is always the first to fulfill the node request, and I don't know why that it | 21:57 |
pabelanger | so, the 3 behind vexxhost-mtl, never get a change to try and fulfill it | 21:58 |
pabelanger | leading to stuck jobs in zuul | 21:59 |
clarkb | pabelanger: the other three should get it after mtl1 fails 3 times? | 21:59 |
pabelanger | clarkb: no, we delete the node request it looks like, and create it new | 22:00 |
clarkb | mtl1 should decline the request after three errors, that gives the other regions opportunity to pick it up | 22:00 |
clarkb | we being nodepool? | 22:00 |
clarkb | that seems like a bug | 22:00 |
pabelanger | 1 sec, let me get a log | 22:00 |
pabelanger | yah, nodepool | 22:00 |
corvus | nodepool can delete min-ready node requests, but otherwise doesn't delete requests. | 22:03 |
dmsimard | clarkb, SpamapS, corvus: Has there been any traction on reproducing jobs outside Zuul ? I wasn't in Berlin so I may have missed out. | 22:11 |
dmsimard | I asked about this briefly before Berlin: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2018-11-06.log.html#t2018-11-06T17:39:05 | 22:13 |
*** manjeets has quit IRC | 22:15 | |
pabelanger | corvus: clarkb: okay, I am looking at the right node requests now: | 22:16 |
pabelanger | 2018-11-29 20:58:35,195 DEBUG nodepool.PoolWorker.vexxhost-ansible-network-mtl1-v2-highcpu-1: Active requests: ['300-0000331819', '300-0000331925'] | 22:16 |
corvus | dmsimard: in the conversation you linked, i noted who was working on it; maybe ask them? | 22:16 |
pabelanger | corvus: clarkb: it seems vexxhost-mtl is retrying indefinitely because we hit a quota exception | 22:17 |
pabelanger | trying to see why these are not failing over the other providers now | 22:18 |
dmsimard | corvus: not a terrible idea :/ | 22:19 |
*** rlandy|biab is now known as rlandy | 22:21 | |
corvus | pabelanger, Shrews, tobiash, clarkb: https://review.openstack.org/536930 is the change which introduced the behavior which is vexing pabelanger now | 22:22 |
corvus | pabelanger: however, it shouldn't retry indefinitely, it should pause | 22:23 |
corvus | pabelanger: (which means you should end up with one stuck request, and others continue) | 22:23 |
pabelanger | looking | 22:23 |
corvus | pabelanger: wait, you're not getting a quota exception though, are you? the way you described it, it sounded like it was nodepool hitting quota limits and not trying to boot nodes... | 22:24 |
corvus | i'm not longer certain 536930 is relevant | 22:24 |
pabelanger | corvus: no, I am. I was tracing the wrong node request | 22:24 |
corvus | pabelanger: maybe you could paste the logs? | 22:24 |
pabelanger | yup | 22:24 |
pabelanger | 1 min | 22:24 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: Support relative priority of node requests https://review.openstack.org/620954 | 22:26 |
pabelanger | corvus: https://pabelanger.fedorapeople.org/launcher.log | 22:27 |
pabelanger | I was tracing 300-0000331925 | 22:27 |
pabelanger | which leads to an aborted node for 0000318660 | 22:28 |
*** manjeets has joined #zuul | 22:28 | |
*** manjeets has quit IRC | 22:29 | |
*** manjeets has joined #zuul | 22:29 | |
corvus | pabelanger: okay, so 536930 does describe the behavior you're seeing, and nodepool is not pausing because nova is incorrectly reporting quota usage (it's not reporting the quota is used, but if you try to launch a node, it fails with a quota error) | 22:33 |
pabelanger | do you know why it is not trying quota on another provider pool? for example vexxhost-sjc1 should have space | 22:34 |
corvus | to be honest, i don't see a way to resolve this other than going back to the original behavior (where unexpected quota errors counted against the three strikes which eventually caused a node-request decline). but it tobiash's environment, an unexpected quota error is common enough for that to be a problem. | 22:34 |
corvus | pabelanger: the only way for another provider to try the request is for this provider to decline it. that was the old behavior. tobiash wanted it changed so that quota errors weren't fatal. | 22:35 |
corvus | (mind you, the underlying problem is openstack lying; fix that and ther's no issue. the issue we're grappling with is how to work around that.) | 22:36 |
*** sshnaidm|afk is now known as sshnaidm|off | 22:37 | |
pabelanger | corvus: okay, I understand now | 22:37 |
pabelanger | unless we start tracking quota retries, and fail decline after x attempts | 22:38 |
corvus | pabelanger: that was the old behavior | 22:38 |
pabelanger | yah | 22:38 |
pabelanger | okay, will wait until tobiash is back online to better understand why that broke him | 22:38 |
clarkb | my understanding is that nova at least only periodically updates the quota usage for what you see via the api | 22:39 |
pabelanger | but yes, in this case openstack is at fault | 22:39 |
clarkb | hwoever on every boot it seems to know an up to date number? seems like it could give you a valid number on every api requiest too | 22:39 |
corvus | pabelanger: yeah, let's discuss with tobiash. i think we need to do something -- the possibility for nodepool to wedge is not acceptable. maybe we can come up with something other than reverting to the original algorithm. | 22:41 |
corvus | clarkb: it sounds like this has been going on for some time for pabelanger -- so maybe it's not updating the quota at all? | 22:42 |
clarkb | or it has gotten out of sync somehow (we've also seen that) | 22:42 |
pabelanger | corvus: wfm, for now I can disable the provider, so no rush. | 22:43 |
corvus | pabelanger: do those error nodes show up in openstack server list? | 22:46 |
corvus | it actually looks like we ignore the quota used values from openstack; instead we only use the max values, and then perform our own subtraction. so if nodepool doesn't see those servers (or doesn't have them associated with a flavor), it won't see them as impacting the quota... | 22:47 |
pabelanger | corvus: yes, they show up in ERROR | 22:48 |
corvus | pabelanger: are there zk node entries for them? | 22:48 |
pabelanger | corvus: I'd have to ask tristanC or nhicher to look for that, I don't have access myself | 22:49 |
corvus | pabelanger: you... have logs? | 22:49 |
pabelanger | I ask for them | 22:49 |
pabelanger | I don't have access to the production servers, just a user | 22:49 |
corvus | pabelanger: can you get the output of 'nodepool list'. the /node-list endpoint of the nodepool web app would have it | 22:52 |
pabelanger | let me see | 22:54 |
mordred | https://softwarefactory-project.io/zuul/t/ansible-network/nodes <-- I'm only seeing 2 nodes in mtl1 | 22:55 |
pabelanger | yah, I don't believe nodepool-web is exposed | 22:56 |
pabelanger | dmsimard: around to run the commands corvus asked for? | 22:57 |
dmsimard | o/ | 22:58 |
dmsimard | let me see | 22:58 |
corvus | pabelanger, mordred, dmsimard: we'll need the nodepool ids of the error nodes (should be in the nova metadata), and then the nodepool logs for those ids | 22:58 |
pabelanger | http://paste.openstack.org/show/736454/ | 22:58 |
pabelanger | is openstack server show of an ERROR node | 22:59 |
corvus | pabelanger, dmsimard: great, so nodepool logs of 0000318690 would be helpful | 22:59 |
pabelanger | let me check if that is in the copy I have, I pruned the posted one | 23:00 |
dmsimard | corvus: logs http://paste.openstack.org/show/736455/ | 23:00 |
corvus | it's starting to look like openstack is providing the right quota info, but nodepool isn't subtracting out those error nodes because it doesn't have zk records for them | 23:00 |
corvus | dmsimard: can you confirm there aren't any other interesting log lines between the last two? | 23:01 |
corvus | (ie, an exception or something) | 23:01 |
pabelanger | 0000318690 shows up in https://pabelanger.fedorapeople.org/launcher.log | 23:01 |
corvus | it's only a few microseconds, so shouldn't be much there | 23:01 |
clarkb | thats curious, nodepool should keep records for error'd nodes until they get deleted | 23:01 |
clarkb | (we even show the instance after delete to make sure it actually deleted iirc) | 23:02 |
pabelanger | http://paste.openstack.org/show/736456/ | 23:02 |
*** hashar has quit IRC | 23:02 | |
pabelanger | that is the full list of ERROR nodes, FYI | 23:02 |
corvus | oh... look closer at the history for 318690 -- it reused the node id after the initial error | 23:03 |
dmsimard | corvus: there's a few traces: http://paste.openstack.org/show/736457/ | 23:04 |
corvus | it didn't think a server was created, so it just launched it again, and overwrote the zk record. but a server really was created in attempt #1. | 23:04 |
dmsimard | It's hard to correlate that last trace with something in specific | 23:04 |
clarkb | corvus: ah so that might be the openstack bug | 23:05 |
clarkb | (incorrect return to create request?) | 23:05 |
dmsimard | "keystoneauth1.exceptions.http.Unauthorized" is 1384 times in the logs but I'm not sure for which cloud or provider it is. | 23:05 |
corvus | mordred: maybe you can help here -- look at the first exception in http://paste.openstack.org/show/736457/ | 23:06 |
pabelanger | dmsimard: that looks to be limestone, I need to see why | 23:06 |
corvus | mordred, clarkb: nodepool is getting an exception from openstacksdk and thinks there's no server, but there really is a server in nova -- the error message even gives us the id | 23:06 |
corvus | (that uuid matches up with pabelanger's paste in http://paste.openstack.org/show/736456/ ) | 23:07 |
dmsimard | I need to step away for dinner, I'll be back later | 23:07 |
pabelanger | dmsimard: thanks for help | 23:07 |
corvus | mordred: the issue is that http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/handler.py#n125 raises an exception; we only save the external id of the server if it succeeds (see line 142). | 23:14 |
corvus | so basically, every server create that hits an error on the initial call ends up leaked | 23:15 |
pabelanger | ah, nice debugging | 23:17 |
pabelanger | corvus: do we still need to look in zookeeper or does https://softwarefactory-project.io/zuul/t/ansible-network/nodes give the proper information? | 23:21 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: OpenStack: count leaked nodes in unmanaged quota https://review.openstack.org/621040 | 23:22 |
corvus | pabelanger: that's all the info we need. you should never need to talk to zookeeper directly. | 23:23 |
corvus | pabelanger, tobiash, clarkb: ^ 621040 may help mitigate this situation | 23:23 |
clarkb | corvus: that fix should work, but would it be better to record the leaked nodes so they can be deleted? | 23:23 |
corvus | clarkb: yes. i do not know how to do that. i was hoping mordred could help. | 23:24 |
pabelanger | corvus: ack, thanks | 23:24 |
corvus | pabelanger: do you know which version of openstacksdk you're using? | 23:25 |
pabelanger | okay, cool. that kinda lines up with my initial thought, updating unmanagedQuotaUsed to account for the leak :) | 23:25 |
pabelanger | corvus: I don't think so, let me check but tristanC will know | 23:25 |
pabelanger | I think that is pulled in via RPM | 23:26 |
corvus | line 6831 in openstackcloud.py is the middle of a comment on master; so it's a bit hard to line up that traceback | 23:26 |
corvus | clarkb: oh nice -- the error string in that exception is generated/formatted by openstacksdk. the server id is an attribute on openstack.cloud.exc.OpenStackCloudCreateException | 23:28 |
corvus | so we should be able to catch that and snag the server id and keep the zk record around | 23:29 |
clarkb | corvus: yup I'm writing that change now | 23:29 |
pabelanger | corvus: educated guess for openstacksdk is 0.19.0, that is just looking at the spec file however. Not sure if that is the version running | 23:29 |
pabelanger | 0.17.2 was the version before that | 23:29 |
corvus | clarkb: oh so was i. how many lines have you written? :) | 23:30 |
clarkb | corvus: uh the try except with self.node.state = zk.FAILED and set the uuid and zk.storeNode? oh and reraise | 23:31 |
clarkb | mine is sort of handwavy though with placeholders for actual values (like I don't know what the actual server id attribute is yet | 23:31 |
clarkb | if you've got specifics sorted out then you should go for it | 23:31 |
corvus | clarkb: i'll do it; you review :) | 23:33 |
clarkb | ok | 23:33 |
clarkb | I'll continue sorting out mine ebacuse then I'll be able to review yours easily :) | 23:33 |
clarkb | http://paste.openstack.org/show/736458/ I think that is functional | 23:35 |
clarkb | (but totally untested) | 23:36 |
clarkb | oh the if e.uuid should be if e.resource_id | 23:36 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: OpenStack: store ZK records for launch error nodes https://review.openstack.org/621043 | 23:38 |
clarkb | ah ok yours is a bit cleaner by relying on the callee to store the failed node | 23:39 |
corvus | clarkb: yeah-- and actually i think we want to do that after a closer reading of http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/handler.py#n244 | 23:39 |
clarkb | corvus: one thing I did was move image and hostname up so that we'd have that data in the record we store (which might help a human debug later) | 23:39 |
corvus | clarkb: it actually generates a *new* zk node | 23:39 |
corvus | clarkb: i almost did that too until i realized what line 244+ was doing | 23:40 |
clarkb | oh ya huh | 23:40 |
clarkb | I mean with the uuid you should be able to figure it out | 23:40 |
corvus | (so the leaked instance gets a new node, and the original node gets reused) | 23:40 |
clarkb | so probably not a major concern | 23:40 |
corvus | tbh, i think we may want to revisit that -- the metadata of the leaked node will refer to a different node id. | 23:41 |
corvus | but that's going to be a bigger change | 23:42 |
corvus | er, leaked isn't the right word -- error node i guess | 23:42 |
pabelanger | cool, I've +2 both too. in case mordred wants to review | 23:43 |
pabelanger | corvus: thanks for helping debug this today | 23:43 |
clarkb | corvus: ya I would expect that we'd create a new node for the retry not for tracking teh delete | 23:43 |
corvus | pabelanger: thank you for bringing it up. and dmsimard for pitching in :) | 23:43 |
clarkb | in any case I much prefer this fix to the first workaround :) | 23:43 |
corvus | clarkb: oh, i think we can have both | 23:44 |
corvus | the second will make the first one mostly a noop | 23:44 |
corvus | but it's still safe to have in case there are other leaks | 23:44 |
clarkb | ya the first is more belt and suspenders. I just didn't watn to only rely on that first on | 23:44 |
corvus | ++ | 23:44 |
pabelanger | wfm | 23:45 |
*** manjeets has quit IRC | 23:46 | |
*** manjeets has joined #zuul | 23:46 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!