openstackgerrit | Paul Belanger proposed openstack-infra/nodepool master: Include host_id for openstack provider https://review.openstack.org/623107 | 00:08 |
---|---|---|
pabelanger | corvus: clarkb, mordred: Shrews: ^ first 1/2 to collect host_id for openstack providers, this is to better help openstack-infra collect information for the current jobs timing out in a cloud. Interested in your thoughts, feedback. | 00:09 |
clarkb | pabelanger: well its not a feature in any of the clouds right? | 00:10 |
clarkb | or does it pull from the api instead? | 00:10 |
clarkb | ah ya it is in the api interesting | 00:10 |
pabelanger | yah, looking at openstacksdk, we should get it | 00:11 |
pabelanger | nodepool dsvm test should help confirm | 00:11 |
clarkb | pabelanger: a common ish thing for me when debugging is to take the test node id and grep for that in the launcher debug log | 00:11 |
clarkb | that gets me lines like 2018-12-05 17:08:28,545 DEBUG nodepool.NodeLauncher-0000956882: Waiting for server 0b056afb-88e9-4d0f-8b3c-13f8363d7af2 for node id: 0000956882 and 2018-12-05 17:08:58,596 DEBUG nodepool.NodeLauncher-0000956882: Node 0000956882 is running [region: BHS1, az: nova, ip: 158.69.66.132 ipv4: 158.69.66.132, ipv6: 2607:5300:201:2000::576] | 00:12 |
clarkb | if we added the uuid and the host_id to the second line there, that would be a major win for me I think | 00:12 |
pabelanger | clarkb: kk, current patch doesn't log host_id, but should add it | 00:15 |
pabelanger | will do that in ps2 | 00:15 |
clarkb | pabelanger: if you do expose it on the zuul side too adding in the instance uuid to the zuul side would be helpful too I think | 00:16 |
clarkb | not sure if that is already there | 00:16 |
jhesketh | panda: perhaps long term there can be enough automation to actually run through the playbooks, but for now I was planning on preparing all the playbooks and tasks locally and spitting out the ansible-playbook commands that the user would need to run. The user can then modify the playbooks and set up an itinerary to match their local environment. | 00:17 |
jhesketh | To do it fully automatically we'd have to build in extra flags to point to hosts etc, and/or build in cloud launching functionality. Which is something I'd like to see, but as a part 2 or separate tool even. eg, you give the tool your cloud credentials and it does the rest. But it'd need to know a lot more about the image building | 00:18 |
pabelanger | clarkb: k, I'll look at uuid also | 00:23 |
clarkb | pabelanger: the nice thing about those two messages is I get commonly needed info (uuid, ip addrs, etc) | 00:26 |
*** manjeets_ has joined #zuul | 01:49 | |
*** manjeets has quit IRC | 01:51 | |
*** bhavikdbavishi has joined #zuul | 02:41 | |
*** bhavikdbavishi1 has joined #zuul | 02:44 | |
*** bhavikdbavishi has quit IRC | 02:45 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 02:45 | |
*** rlandy|bbl is now known as rlandy | 03:09 | |
*** rlandy has quit IRC | 03:10 | |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool master: Include host_id for openstack provider https://review.openstack.org/623107 | 03:12 |
*** bjackman has joined #zuul | 04:28 | |
bjackman | Is there a way to get your config-project changes tested pre-merge in a post-review pipeline? I tried but it didn't work, not sure if this is because of config error on my part or just the way Zuul is | 05:59 |
bjackman | Ah OK, I think the real answer to my question is that where I have shared config that I want to be tested pre-merge, that should go in a shared untrusted project (equivalent to the zuul-jobs one) | 06:35 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: update status page layout based on screen size https://review.openstack.org/622010 | 06:43 |
*** goern has quit IRC | 06:58 | |
*** goern has joined #zuul | 07:08 | |
*** bhavikdbavishi has quit IRC | 07:13 | |
*** gtema has joined #zuul | 07:32 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Report tenant and project specific resource usage stats https://review.openstack.org/616306 | 07:33 |
*** pcaruana has joined #zuul | 07:58 | |
*** pcaruana is now known as muttley | 07:58 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: refactor jobs page to use a reducer https://review.openstack.org/621396 | 08:06 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: refactor job page to use a reducer https://review.openstack.org/623156 | 08:06 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: refactor tenants page to use a reducer https://review.openstack.org/623157 | 08:06 |
*** themroc has joined #zuul | 08:48 | |
*** AJaeger has quit IRC | 08:49 | |
*** AJaeger has joined #zuul | 08:51 | |
*** bhavikdbavishi has joined #zuul | 08:55 | |
*** sshnaidm|afk has quit IRC | 09:45 | |
*** sshnaidm|afk has joined #zuul | 09:46 | |
*** bhavikdbavishi has quit IRC | 09:49 | |
*** electrofelix has joined #zuul | 10:04 | |
*** dkehn has quit IRC | 10:05 | |
*** sshnaidm|afk is now known as sshnaidm | 10:12 | |
*** sshnaidm has quit IRC | 10:33 | |
*** sshnaidm has joined #zuul | 10:34 | |
*** jesusaur has quit IRC | 11:27 | |
*** jesusaur has joined #zuul | 11:31 | |
*** bhavikdbavishi has joined #zuul | 11:48 | |
*** sshnaidm is now known as sshnaidm|bbl | 12:08 | |
*** dkehn has joined #zuul | 12:39 | |
*** bjackman has quit IRC | 12:42 | |
*** gtema has quit IRC | 12:46 | |
*** bjackman has joined #zuul | 12:47 | |
*** rlandy has joined #zuul | 12:58 | |
*** muttley has quit IRC | 13:08 | |
*** bjackman has quit IRC | 13:09 | |
*** muttley has joined #zuul | 13:21 | |
*** muttley has quit IRC | 13:25 | |
*** muttley has joined #zuul | 13:26 | |
*** muttley has quit IRC | 13:29 | |
*** pcaruana has joined #zuul | 13:34 | |
*** pcaruana has quit IRC | 13:39 | |
*** rfolco has quit IRC | 13:41 | |
*** rfolco has joined #zuul | 13:41 | |
*** gtema has joined #zuul | 13:42 | |
*** pcaruana has joined #zuul | 13:43 | |
*** pcaruana has quit IRC | 13:47 | |
*** bhavikdbavishi has quit IRC | 13:53 | |
*** gtema has quit IRC | 13:53 | |
*** smyers_ has joined #zuul | 13:57 | |
*** smyers has quit IRC | 13:57 | |
*** smyers_ is now known as smyers | 13:57 | |
Shrews | corvus: tobiash: fwiw, i don't think https://review.openstack.org/622403 made much impact. I'm still seeing lot's of empty nodes being left around (but thankfully cleaned up now) | 14:06 |
tobiash | Shrews: ok, so maybe we should consider switching to sibling locks | 14:07 |
tobiash | but that would be a harder transition and might require a complete synchronized zuul + nodepool upgrade and shutdown | 14:08 |
Shrews | yes, a bit more involved to do that | 14:08 |
Shrews | but at least not urgent now | 14:08 |
Shrews | at least we've learned something new about using zookeeper! :) | 14:10 |
Shrews | child locks + znode deletion == bad news | 14:10 |
tobiash | yepp :) | 14:13 |
*** gtema has joined #zuul | 14:26 | |
*** smyers has quit IRC | 14:32 | |
*** smyers has joined #zuul | 14:32 | |
*** gtema has quit IRC | 14:44 | |
*** sshnaidm|bbl is now known as sshnaidm | 14:51 | |
mordred | tobiash: that reducers stack is really nice | 14:55 |
mordred | gah | 14:55 |
mordred | tristanC: ^^ | 14:55 |
mordred | t <tab> is a fail :) | 14:55 |
tobiash | :) | 14:55 |
mordred | tobiash: I approved the stack except for the last 2 | 14:56 |
tobiash | mordred: k, I'll check that out latest | 14:57 |
tobiash | later | 14:57 |
mordred | ++ | 14:57 |
tobiash | mordred: lgtm but I'm not feeling competent enough to +a it. | 15:02 |
*** njohnston_ is now known as njohnston | 15:10 | |
mordred | tobiash: yeah. these javascripts are just about at the edge of my brain abilities | 15:17 |
ssbarnea|rover | can we do something to avoid zuul spam with "Waiting on logger"? as in http://logs.openstack.org/30/621930/2/gate/tripleo-ci-centos-7-standalone/4fb356a/job-output.txt.gz | 15:18 |
mordred | ssbarnea|rover: we should probably instead figure out what broke the log streamer - do y'all reboot any of the VMs? | 15:27 |
mordred | or, alternately, if you're doing iptables on the vms those could be blocking access to the log streamer daemon | 15:28 |
rlandy | hello - I am testing out zuul static driver for use with some ready provisioned vms. I followed the nodepool.yaml configuration per https://zuul-ci.org/docs/zuul/admin/nodepool_static.html. The playbook setting up the multinode bridge fails - I think due to the fact that the private_ipv4 is set to null. The public_ipv4 value is populated with the 'name' ip. How can I get the static driver to set a private_ipv4? | 15:28 |
tobiash | rlandy: the static driver only knows one ip address so you need to set the private_ipv4 in your job if you're depending on it (or maybe do a fallback when setting up the multinode bridge) | 15:30 |
rlandy | tobiash: ok - set_fact on hostvars[groups['switch'][0]]['nodepool']['private_ipv4']? | 15:32 |
rlandy | setting a fallback would mean editing this role: https://github.com/openstack-infra/zuul-jobs/blob/master/roles/multi-node-bridge/tasks/peer.yaml#L16 | 15:33 |
tobiash | rlandy: you need the hostvars[]... cruft only for setting facts for a different machine | 15:33 |
rlandy | and I am not sure other user will be open to my editing that for a static driver case | 15:33 |
rlandy | tobiash: yep - ack - thanks for your help | 15:33 |
*** jhesketh has quit IRC | 15:34 | |
tobiash | rlandy: yes, I'm not familiar with this role so someone else (mordred, AJaeger ?) might be of help with the multi-node-bridge role | 15:34 |
*** jhesketh has joined #zuul | 15:35 | |
mordred | rlandy: that role should work on nodes that don't have a private ip though - we have clouds that give us vms with no private ip | 15:37 |
mordred | we should check with clarkb when he wakes up | 15:37 |
rlandy | mordred: looking at the inventory, I saw the private_ipv4 set to null and I assumed that was the cause of the error. I could be wrong. I am testing it out again with private_ipv4 set | 15:39 |
mordred | kk | 15:39 |
mordred | it also might not be terrible to allow setting a private_ipv4 in the static driver since it's a value we provide for the dynamic nodes too | 15:40 |
rlandy | or default to the public_ipv4 if private_ipv4 is null | 15:50 |
ssbarnea|rover | mordred: i didn't do anything myself and I see the "[primary] Waiting on logger" error on multiple jobs during the last 7 days. i don't know how to make a group by in logstash to identify a pattern. | 15:55 |
Shrews | i wonder why that role is using private_ipv4 and not interface_ip | 16:00 |
Shrews | that's available in the inventory | 16:01 |
Shrews | mordred: do you know? ^^ | 16:02 |
mordred | Shrews: the role uses private_ipv4 if it's there to establish the network bridge between nodes - that lets us always have a consistent network jobs can use regardless of provider differences | 16:04 |
Shrews | that makes sense | 16:04 |
*** nilashishc has joined #zuul | 16:05 | |
clarkb | private ip is set to public ip is there is no private ip | 16:18 |
clarkb | the reason to useprivate over public when you have both is vxlan/gre were not reloable over nat | 16:18 |
clarkb | rather than try and debug that I decided it was easier to just avoid the issue entirely | 16:19 |
clarkb | we can either make that ip behavior consistent in nodepool drivers or update the role to make that assumption instead | 16:19 |
*** themroc has quit IRC | 16:26 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Read old json data right before writing new data https://review.openstack.org/623245 | 16:30 |
pabelanger | mordred: Shrews: clarkb: in https://review.openstack.org/623107/ I'm trying to collect the host_id from wait_for_server in nodepool, but it seems to be empty: http://logs.openstack.org/07/623107/2/check/nodepool-functional-py35-src/46b9fb9/controller/logs/screen-nodepool-launcher.txt.gz#_Dec_06_04_00_07_442925 but 2 lines up I can in fact see hostId. Any ideas why that would be? | 16:41 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Add appending yaml log plugin https://review.openstack.org/623256 | 16:47 |
mordred | pabelanger: looking | 16:49 |
mordred | pabelanger: no - that makes no sense | 16:51 |
mordred | pabelanger: I'm landing so can't dig too deep for a few minutes - but I'm gonna put money on a bug :( | 16:52 |
pabelanger | mordred: okay, that is what I figured also. I can start to dig into it more locally today too | 16:52 |
mordred | pabelanger: cool. I'm guessing something in the conn.compute.servers() -> to_dict() -> normalize_server() sequence | 16:54 |
mordred | pabelanger: which is new and is the first step in making the shade layer consume the underlying sdk objects | 16:54 |
mordred | although looking at it it seems like all the things are in place properly to make sure you'd end up with a host_id | 16:55 |
clarkb | ssbarnea|rover: mordred: those test nodes are very memory constrainted I wonder if OOMKiller is targetting that process if it gets invoked | 17:01 |
clarkb | ssbarnea|rover: do those jobs capture syslog? we should be able to check for OOMKiller there | 17:01 |
rlandy | clarkb: wrt private_ipv4 for drivers that only define a public_ipv4, I am happy to put in a review to default the private_ipv4 value in the role but if making the behavior consistent is possible, I think that would be better | 17:12 |
pabelanger | mordred: ack, thanks for the pointers | 17:12 |
clarkb | rlandy: we may want to do both things now that I think about it more. As much as possible consistent driver behavior from nodepool is desireable but the roles should manage when they aren't consistent (maybe someone is running older nodepool) | 17:13 |
Shrews | tobiash: i think i found the race in test_handler_poll_session_expired. running for a bit locally before i push up the fix | 17:14 |
tobiash | Yay :) | 17:14 |
rlandy | clarkb: understood. I'll put in the role change for my own testing at least. Currently I am hacking up the job definition which is not a good way to go | 17:15 |
corvus | Shrews: if the deleted state didn't help, should we revert that patch? (but also, any idea why it didn't work?) | 17:27 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Fix race in test_handler_poll_session_expired https://review.openstack.org/623269 | 17:39 |
Shrews | corvus: i have no idea why it didn't work. as for removing it, the only upside to doing so is an easier upgrade path for operators (the code itself doesn't hurt anything afaict) | 17:40 |
corvus | tobiash: heads up on https://review.openstack.org/620285 | 17:41 |
Shrews | corvus: downgrading now might be tricky (we'd have to make sure there are no DELETED node states before we restarted) | 17:41 |
Shrews | an unlikely scenario, but we'd still have to check | 17:41 |
corvus | Shrews: hrm, maybe we keep DELETED in there for a little while? | 17:41 |
corvus | maybe until after the next release... | 17:41 |
corvus | sorry let me clarify | 17:41 |
corvus | maybe we should keep DELETED as an acceptable state, but remove the code which sets it | 17:42 |
corvus | then after the next release, also remove the state | 17:42 |
Shrews | corvus: it's possible that change is at least a little helpful, too. but hard to quantify | 17:42 |
corvus | i think one of two plans make sense: 1) debug the DELETED patch and figure out how to make it work; or 2) agree that we should switch to sibling locks, remove the deleted state, and rely on the cleanup worker until we make the switch. | 17:43 |
Shrews | i think 2 is the real solution, but much harder to get there | 17:44 |
SpamapS | looks like /build/xxx doesn't know how to 404. | 17:45 |
corvus | yeah, we'll need coordination between zuul and nodepool for that | 17:45 |
SpamapS | it just... waits | 17:45 |
corvus | SpamapS: :( | 17:45 |
SpamapS | Ya.. throw it on the bug pile? ;-) | 17:45 |
corvus | SpamapS: there's sevear lines of code about returning a 404 in there | 17:45 |
corvus | wow | 17:45 |
corvus | several | 17:45 |
corvus | SpamapS: it does take a while, but http://zuul.openstack.org/api/build/foo returns 404 | 17:46 |
corvus | a while=3.7s | 17:46 |
clarkb | ssbarnea|rover: mordred: Looking at logs for cases with Waiting on logger. This happens when the run playbook seems to die with "[Zuul] Log Stream did not terminate" Then the post run playbook has the Waiting on logger errors, presumably beacuse port 19885 is still heald by the existing log stream daemon? | 17:47 |
tobiash | corvus: thanks, looking | 17:47 |
clarkb | ssbarnea|rover: mordred: It also seems that when this happens we have incomplete log stream for the run playbook, but ara shows that things kept going behind the scenes | 17:48 |
clarkb | http://logs.openstack.org/25/620625/2/gate/tripleo-ci-centos-7-standalone/70949b6/job-output.txt.gz#_2018-12-06_17_21_48_790889 is an example. This particular job failed trying to run delorean | 17:48 |
SpamapS | hm I may not have waited 3.7s | 17:48 |
clarkb | I don't find any OOMs so it is possibly a bug in zuul (with the cleanup of the streamer in run failing hence talking about it here and not in -infra) | 17:48 |
corvus | SpamapS: i don't see an index on uuid; that's probably why the response is slow | 17:48 |
corvus | so for us, it's 3.7 seconds for a table scan i guess | 17:49 |
SpamapS | def worth an index, but the UI still doesn't show the 404 | 17:49 |
SpamapS | Loading... forever | 17:49 |
corvus | (yeah, the table scan part of it takes 2.39s for us) | 17:50 |
SpamapS | In fact /api/build/foo show Loading... too.. hmm | 17:50 |
SpamapS | Why is that an HTML page and not a json response? | 17:50 |
corvus | SpamapS: oh, maybe you need to restart zuul-web? | 17:50 |
tobiash | Shrews, corvus: while reading that stats rework, I've a side question. How can zuul cancel a node request that is currently locked by a provider trying to fulfill it? | 17:51 |
SpamapS | I haven't upgraded recently. | 17:51 |
corvus | SpamapS: i've seen zuul-web get grumpy after a scheduler restart | 17:51 |
SpamapS | oh, it seems to switch on ACcept | 17:52 |
SpamapS | Because my browser is sending Accept html, it's sending me HTML | 17:52 |
Shrews | tobiash: not currently possible | 17:53 |
corvus | SpamapS: erm, we don't have anything like that. you can hit the api in your browser | 17:53 |
corvus | tobiash, Shrews: wel... um, it looks like it actually just deletes the request out from under nodepool. | 17:53 |
SpamapS | hm, no it's not Accept. | 17:53 |
SpamapS | I cannot hit the api in *my* browser | 17:53 |
Shrews | corvus: well, i mean, if we want to talk about out of bounds methods... | 17:53 |
SpamapS | -> zuul.gdmny.co | 17:53 |
corvus | tobiash, Shrews: without any consideration of the lock. | 17:53 |
SpamapS | (it's still not auth walled) | 17:54 |
tobiash | ah ok, that just works ;) | 17:54 |
SpamapS | I probably messed something up in the translation from mod_rewrite to Nginx. | 17:54 |
Shrews | corvus: tobiash: is this something we are actively seeing then? i thought it was a hypothetical question | 17:54 |
SpamapS | Curl'ing my api works | 17:54 |
SpamapS | but browsering it just shows Loading... | 17:54 |
corvus | SpamapS: if i shift-reload i get an api response. | 17:54 |
tobiash | Shrews: it was a hypothetical question | 17:54 |
SpamapS | corvus: *weird* | 17:54 |
corvus | Shrews: i'm sure this must happen in openstack-infra | 17:55 |
SpamapS | And of course the javascript fetches are getting json | 17:55 |
corvus | SpamapS: there may be something weird about the javascript service worker | 17:56 |
tobiash | Shrews, corvus: I'm also thinking how we would design this for the scheduler-executor interface | 17:56 |
SpamapS | I'm guessing there's a header combination that gets you HTML. | 17:56 |
tobiash | maybe with two locks, a modify-znode-lock and a processing-lock | 17:56 |
* SpamapS is out of time to investigate though | 17:56 | |
corvus | tobiash: okay you're getting way ahead of me here. what's this have to do with the scheduler-executor interface? | 17:57 |
tobiash | or to rephrade, locks for modifying the object, and a further ephemeral node that is hold during processing | 17:57 |
tobiash | corvus: I'm thinking about the scale out scheduler | 17:58 |
corvus | tobiash: i know | 17:58 |
corvus | tobiash: oh, you're thinking we need a distinct lock for "i'm running the job" and a separate lock for modifying the job information | 17:58 |
tobiash | so I thought, if the executor holds the lock during processing, how can we cancel a build? | 17:58 |
tobiash | exactly | 17:58 |
tobiash | actually that's the same with the node-requests that are now just deleted | 17:59 |
corvus | tobiash: i agree, the situations are similar. | 17:59 |
corvus | we could probably do either thing: 2 locks, or, accept that "delete node out from under the lock" is a valid API :) | 18:00 |
corvus | i'm not sure if the current situation with requests is on-purpose or accidental. i'm not sure what nodepool will do at this point if the request disappears from under it. especially with the cache changes. | 18:01 |
tobiash | corvus: yes, for node-requests, but for jobs on the executor we might want not to delete it but leave it within the pipeline (if we follow your suggestion that the executors take their jobs directly from the pipeline data in zk) | 18:02 |
tobiash | corvus: with the cache changes the object is removed from the list, but if some other code path currently processes it it probably just gets errors when locking or saving the node | 18:03 |
Shrews | corvus: tobiash: well, nodepool explicitly looks for node requests to disappear during handling as an assumed error condition. we could just pretend that's a proper stop-now-please api | 18:03 |
corvus | tobiash: if we wanted to, i think we could get rid of canceled build records faster than we do now. if we wanted to, we could just delete the build record and have the executor detect that and abort. i'm not saying let's do it that way, but i do think it's an option. | 18:03 |
clarkb | ssbarnea|rover: mordred http://logs.openstack.org/25/620625/2/gate/tripleo-ci-centos-7-standalone/70949b6/logs/undercloud/var/log/journal.txt.gz#_Dec_06_16_09_23 I think maybe that is the issue. Running out of disk space? The log streaming reads off of disk and I could see where maybe the reads and the writes get said if we run out of disk? | 18:04 |
corvus | Shrews: retro-engineering! | 18:04 |
corvus | Shrews: the fact that we're only now thinking about it probably means it's working okay :) | 18:04 |
Shrews | http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/driver/__init__.py#n680 | 18:04 |
corvus | Shrews: that almost looks purposeful | 18:04 |
Shrews | right? | 18:05 |
corvus | Shrews: i'm feeling generous; i'm going to assume we engineered it that way and forgot :) | 18:05 |
Shrews | totes | 18:05 |
Shrews | corvus: that's how i (now) remember it | 18:05 |
corvus | ++ | 18:05 |
clarkb | ssbarnea|rover: mordred what is odd there is the job seems to start in that state so may be unrelated, however that could also explain why delorean failed possibly | 18:05 |
tobiash | corvus: ok, I think the delete will work in the executor case too | 18:07 |
corvus | tobiash: i think as part of this, we probably want to have the executor hold the lock on nodes in the future. so scheduler deletes build record; executor detects that and aborts job and releases node locks. | 18:08 |
Shrews | corvus: tobiash: actually, we may want to move that nodepool check up a bit in the code. it will only reach that point if it's done launching all requested nodes | 18:09 |
corvus | Shrews: ++ should save some time | 18:09 |
Shrews | the 'if not launchesComplete(): return' is above that | 18:09 |
corvus | Shrews: though... that may be complex | 18:10 |
Shrews | yeah | 18:10 |
Shrews | just thinking of the consequences... | 18:10 |
corvus | Shrews: it's okay if we complete the request and then end up with some extra ready nodes. but if we want to abort mid-launch it'll get messy. | 18:10 |
tobiash | corvus: probably makes sense, I'll think about it | 18:10 |
SpamapS | corvus: btw, it's possible that my API being behind CloudFlare could cause weirdness. | 18:11 |
corvus | tobiash: i think the main driver there is -- it's really the executor using the nodes. a distributed scheduler may get restarted at any time and should have no consequence to running jobs. only if the executor running the job is restarted should the nodes be returned. | 18:11 |
SpamapS | I already can't use encrypt_secret.py through it. | 18:11 |
SpamapS | (CloudFlare blocks unknown user agents and you have to pay to whitelist things, something we'll do.. but.. not today ;) | 18:12 |
corvus | SpamapS: you may want to ask tobiash about building the web dashboard without support for service workers and see if that fixes any weirdness. | 18:12 |
tobiash | corvus: totally correct, I'm just thinking about who should request the nodes. I think this should still be the pipeline processor | 18:12 |
corvus | tobiash: yes. the hand-off to an executor will be a neat trick. :) | 18:12 |
tobiash | and the executor holding the lock on the nodes is absolutely the right thing | 18:13 |
tobiash | yeah, so the scheduler requests it, but the executor that got the job accepts it and locks the nodes | 18:14 |
clarkb | mordred: ssbarnea|rover ok where the run streaming stops in that job there is a nested ansible run which ara repos was interrupted; data will be inconsistent | 18:16 |
clarkb | and from that point forward we stop getting streaming. So something is happening there that affects more than just zuul | 18:16 |
openstackgerrit | Merged openstack-infra/zuul master: web: break the reducers module into logical units https://review.openstack.org/621385 | 18:20 |
*** electrofelix has quit IRC | 18:26 | |
*** cristoph_ has joined #zuul | 18:30 | |
SpamapS | So... I'm about to submit a slack notifier role for Zuul... wondering if we should stand up a slack (they're free) just for running test jobs. | 18:31 |
SpamapS | Also.. Ansible 2.8 has added a threading mechanism to the slack module that would be super useful for threading based on buildset.... wondering how we're looking for catching up to Ansible any time soon. | 18:32 |
Shrews | it feels like we just caught up to ansible, like, last week | 18:34 |
Shrews | we need to slow their momentum :) | 18:34 |
* Shrews plots an inside attack | 18:34 | |
tobiash | corvus: I'm +2 on 620285 | 18:38 |
tobiash | corvus: do we need to announce such a change on the mailing list? | 18:39 |
SpamapS | <2.6 .. wasn't 2.6 like.. over a year ago? | 18:40 |
tobiash | SpamapS: nope, we switched to 2.5 just one week before 2.6 has been released. And that was this year ;) | 18:41 |
tobiash | that has been merged in june... (https://review.openstack.org/562668) | 18:43 |
corvus | we need to find a volunteer for the support multiple ansible versions work | 18:43 |
clarkb | SpamapS: openstack infra's testing of ansible 2.8 shows that handlers are going to all break unless things are changed in 2.8 before release | 18:44 |
tobiash | I think I could at least support | 18:44 |
clarkb | so ya ++ to multi ansible instead | 18:44 |
SpamapS | Oh right ok 2.6 was July | 18:52 |
SpamapS | Seems like multi-ansible is a virtualenv+syntax challenge, yeah? | 18:53 |
tobiash | SpamapS: plus possibility to pre-install | 18:54 |
tobiash | (my zuul doesn't really have access to the internet) | 18:55 |
clarkb | if we used teh venv module we should be able to document steps or supply a script to preinstall venv virtualenvs for zuul as it would on demand | 18:55 |
SpamapS | There's some interesting things to think through like, do we have 1executor:manyansibles, or just 1:1 executor:ansible and make ansible version a thing executors subscribe to (like "hey I can do 2.6") | 18:55 |
tobiash | SpamapS: also we need to inject different versions of the command module into different versions of ansible | 18:56 |
clarkb | tobiash: that is the biggest challenge I think | 18:56 |
SpamapS | Is there no way we can write one that works for all supported versions? | 18:56 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Fix race in test_handler_poll_session_expired https://review.openstack.org/623269 | 18:56 |
SpamapS | and just always inject that in to the module path.. | 18:56 |
tobiash | SpamapS: it could be that the latest one by accident works with all but we don't know | 18:57 |
SpamapS | Anyway, yeah, would be great to have multi-version support, especially with how fast Ansible seems to be moving/breaking. | 18:57 |
*** nilashishc has quit IRC | 19:04 | |
openstackgerrit | Ronelle Landy proposed openstack-infra/zuul-jobs master: WIP: Default private_ipv4 to use public_ipv4 address when null https://review.openstack.org/623294 | 19:28 |
Shrews | wow, i don't think this nodepool test ever worked properly :( | 19:29 |
Shrews | anyone know of a way to have a mock.side_effect both execute code AND raise an exception? seems it's either one or the other | 19:34 |
openstackgerrit | Merged openstack-infra/zuul master: web: refactor info and tenant reducers action https://review.openstack.org/621386 | 19:35 |
clarkb | Shrews: have it call a fake? | 19:36 |
clarkb | then have that raise itself | 19:36 |
Shrews | clarkb: that doesn't work | 19:36 |
Shrews | it can either call the fake, or raise an Exc, but not both it would seem | 19:37 |
clarkb | Shrews: the fake does the raise | 19:37 |
Shrews | clarkb: that didn't work in my test | 19:38 |
Shrews | the raise is ignored | 19:38 |
Shrews | oh, there is something else wrong here. maybe that will work if i fix that | 19:41 |
clarkb | Shrews: http://paste.openstack.org/show/736784/ it works here | 19:41 |
*** sshnaidm is now known as sshnaidm|afk | 19:43 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Fix race in test_handler_poll_session_expired https://review.openstack.org/623269 | 19:50 |
pabelanger | mordred: it seems we might already be passing in normalized data for server at: http://git.openstack.org/cgit/openstack/openstacksdk/tree/openstack/cloud/openstackcloud.py#n2144 because I can see host_id and has_config_drive data before we attempt to normailze again, which results in lost of data | 20:30 |
pabelanger | mordred: I am not familiar enough with code to figure out how to properly fix | 20:30 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: WIP: Add spec for scale out scheduler https://review.openstack.org/621479 | 20:35 |
mordred | pabelanger: hrm. we shouldn't be double-normalizing :( | 20:35 |
mordred | pabelanger: OH | 20:36 |
pabelanger | Oh, HAHA | 20:37 |
pabelanger | http://logs.openstack.org/07/623107/2/check/nodepool-functional-py35/e49051c/controller/logs/screen-nodepool-launcher.txt.gz#_Dec_06_03_59_24_677871 | 20:37 |
pabelanger | this actually works with 0.20.0 | 20:37 |
pabelanger | but is a bug in master | 20:37 |
mordred | remote: https://review.openstack.org/623308 Deal with double-normalization of host_id | 20:38 |
mordred | pabelanger: ^^ | 20:38 |
pabelanger | mordred: https://review.openstack.org/621585/ I think that is what broke it | 20:38 |
mordred | pabelanger: I believe it's because what we're now starting from is an openstack.compute.v2.server.Server Resource object that we then run to_dict() on. the Resource object already coerces hostId into host_id - and the normalize function was only doing host_id = server.pop('hostId', None) - but there isn't a hostId in the incoming - only a host_id | 20:41 |
pabelanger | mordred: yes, exactly | 20:42 |
pabelanger | possible there is others, but haven't checked | 20:42 |
mordred | pabelanger: so I Think that patch above will fix this specific thign - the next step is actually to make that normalize function go away completely | 20:42 |
mordred | pabelanger: but I figure that's going to need slightly more care than a quick fix | 20:42 |
openstackgerrit | Merged openstack-infra/zuul master: web: add error reducer and info toast notification https://review.openstack.org/621387 | 20:42 |
pabelanger | mordred: patch is missing trailing ) but worked | 20:44 |
pabelanger | mordred: clarkb: corvus: Shrews: okay, so https://review.openstack.org/623107/ is in fact working, if you'd like to review and confirm format of patch is something we want to actually do | 20:48 |
mordred | pabelanger: yay! I have updated the patch to add the appropriate number of )s | 20:51 |
pabelanger | mordred: +2 | 20:51 |
mordred | \o/ | 20:52 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Read old json data right before writing new data https://review.openstack.org/623245 | 20:59 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Add appending yaml log plugin https://review.openstack.org/623256 | 20:59 |
fungi | looks like that busy cycle for the executors lasted ~17 minutes | 21:05 |
fungi | er, wrong channel (sort of) | 21:05 |
SpamapS | mordred: there are ways to make json append-only too you know. | 21:10 |
fungi | i assumed the point of 623256 was more expressing a preference for yaml instead of json | 21:13 |
fungi | i do find it sort of disjoint that ansible takes yaml input and returns json output | 21:14 |
clarkb | fungi well yaml can be written to without first parsing the file | 21:16 |
clarkb | so its memory overhead is better | 21:16 |
fungi | ahh, yeah, i didn't consider that angle | 21:16 |
corvus | SpamapS: can you elaborate on your json thoughts? | 21:19 |
SpamapS | corvus: so there are some parsers that can handle this string as a "json stream": '{"field":1}\n{"field":2}\n' | 21:31 |
SpamapS | Which allows you to have append-only json | 21:31 |
SpamapS | But not all parsers do it | 21:31 |
corvus | SpamapS: i think that's the crux -- that we want the output to be valid normal json, not special zuul json | 21:32 |
corvus | (because we want this to be valid if the job dies at any point) | 21:33 |
SpamapS | Yeah, I could have sworn there was a standard for doing it but I can't find it, so I probably dreamed it. | 21:35 |
mordred | SpamapS: yah - I originally looked for a standard ... everything I could find with streaming json was just people doing really weird stuff | 21:42 |
mordred | but I figure - other than python's weird obession with not including yaml support in the core language - everybody else seems to be able to parse it easily | 21:43 |
SpamapS | mordred: so in yaml to make it appendable you just have to indent everything by one and start with a "- " and, all good... +1 | 21:46 |
mordred | SpamapS: heh | 21:49 |
mordred | SpamapS: no - actually just separate sections with --- ... to make it a multi-document file | 21:49 |
mordred | SpamapS: it's actually k8s yaml files that gave me the idea | 21:50 |
SpamapS | Oh docs.. hm | 21:54 |
SpamapS | You'd be surprised how many yaml parsers do not support multi doc | 21:54 |
SpamapS | Mostly because they're short-sighted. | 21:54 |
SpamapS | "make maps into {my language's version of dict} and lists into {my language version of list} and done" | 21:55 |
openstackgerrit | Merged openstack-infra/zuul master: Read old json data right before writing new data https://review.openstack.org/623245 | 21:55 |
mordred | SpamapS: at this point in my life, nothing surprises me | 22:03 |
mordred | SpamapS: well, except for bojack getting zero golden globe nominations | 22:03 |
SpamapS | I'm sure he has a long face. | 22:03 |
openstackgerrit | Ronelle Landy proposed openstack-infra/zuul-jobs master: WIP: Default private_ipv4 to use public_ipv4 address when null https://review.openstack.org/623294 | 22:04 |
openstackgerrit | Ronelle Landy proposed openstack-infra/zuul-jobs master: WIP: Default private_ipv4 to use public_ipv4 address when null https://review.openstack.org/623294 | 22:20 |
*** manjeets_ is now known as manjeets | 22:28 | |
*** dkehn has quit IRC | 22:52 | |
openstackgerrit | Ronelle Landy proposed openstack-infra/zuul-jobs master: WIP: Default private_ipv4 to use public_ipv4 address when null https://review.openstack.org/623294 | 22:53 |
SpamapS | Hey, I'm setting up a Slack just to test the slack notifier role I've built to submit to zuul-roles. Who would like to be added to that slack? Anybody? | 22:57 |
*** cristoph_ has quit IRC | 22:58 | |
*** dkehn has joined #zuul | 23:01 | |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool master: Include host_id for openstack provider https://review.openstack.org/623107 | 23:49 |
clarkb | pabelanger: does ^ depend on a fix in the sdk lib? | 23:50 |
pabelanger | clarkb: no, that was a failure with unreleased version of openstacksdk. Ones on pypi work | 23:52 |
pabelanger | we can add depends-on if we want however | 23:52 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!