22:07:51 <jeblair> #startmeeting zuul 22:07:52 <openstack> Meeting started Mon Nov 20 22:07:51 2017 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 22:07:53 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 22:07:56 <openstack> The meeting name has been set to 'zuul' 22:08:12 <jlk> The meeting should be stored in UTC, and then display is adjusted by local display zone info 22:08:15 <jeblair> #link agenda: https://wiki.openstack.org/wiki/Meetings/Zuul 22:08:22 <jlk> but people fail at thinking like that, so some events are stored actually in the local timezone, so that the "time" doesn't change when the offset does 22:08:24 <jeblair> #link last meeting: http://eavesdrop.openstack.org/meetings/zuul/2017/zuul.2017-11-13-22.10.log.html 22:08:39 <jeblair> Thanks to Shrews for chairing the last meeting! 22:08:46 <jeblair> it was nice and short, so i just linked to the transcript 22:08:57 <fungi> i found it helpful, even if short 22:09:36 <jeblair> i will only add that the infra+zuul team did not have rotten vegetables thrown at us so i consider it a success (in fact, many nice things were said about v3) 22:09:51 <jeblair> it=summit 22:10:04 <pabelanger> yay 22:10:11 <fungi> yes, none of the vegetable they threw were rotten 22:10:19 <pabelanger> a lot of excitement around zuulv3 at summit 22:10:20 <jlk> woo 22:10:41 <jeblair> #topic Add support for shared ansible_host in inventory (pabelanger) 22:11:18 <pabelanger> so, this is something I found when trying to convert a zuulv2 job to native zuulv3 22:11:32 <pabelanger> it is more an optimization I think on CI resources and hopefully something we want to consider 22:11:38 <jeblair> #link https://review.openstack.org/521324/ 22:12:10 <pabelanger> right now, if I use our current nodeset stanza, I'd have to request 6 nodes from nodepool, when in fact, the way the playbooks are written, I really only need 1. 22:12:20 <jeblair> pabelanger: i take it the zuulv2 job was running ansible within the job, and you're trying to move that ansible up to top-level zuulv3 internal ansible? 22:12:29 <pabelanger> also, host groups doesn't work in this case, because of the way variable scoping is handled 22:12:36 <pabelanger> jeblair: correct 22:13:12 <dmsimard> I'm not sure I'm following, have a bad case of mondays 22:13:28 <jeblair> pabelanger: cool. so this isn't a regression from v2, more of an impedence mismatch with zuulv3 internal ansible and native ansible. which is cool -- we want to make it as transparant as possible. 22:13:30 * dmsimard reads review 22:13:39 <mordred> if there are differences in how variable scoping works, I could see that being something other folks would run in to should they attempt to do what pabelanger is trying to do too 22:13:49 <mordred> yah - what jeblair said 22:13:53 <dmsimard> pabelanger: oh, different hosts which lead back to the same host 22:14:07 <pabelanger> jeblair: Yup, in fact, requesting the 5 nodes from nodepool work fine. I just didn't want to land it and eat up a bunch of nodes for each patch 22:14:31 <pabelanger> dmsimard: right 22:14:32 <mordred> "I want 3 different ansible hosts, but I only need one node" 22:14:34 <clarkb> wouldn't you put the single node in different groups with zuulv3 22:14:39 <clarkb> then have your playbooks operate on the various groups? 22:14:46 <clarkb> but one node could be in say 6 groups 22:14:52 <mordred> clarkb: yah - that apparently behaves differently in some ways 22:14:57 <dmsimard> clarkb: yes and no, you can do both 22:15:11 <mordred> clarkb: (that was my first suggestion as well) 22:15:15 <fungi> pabelanger: will it run ~5x as fast if scheduled across 5 nodes? if so, the larger node size doesn't sound too terrible 22:15:16 <pabelanger> clarkb: yes, that is possible but it would require a rewrite in my case 22:15:28 <dmsimard> clarkb: like, technically, there's nothing that prevents "keystone.domain.tld" "nova.domain.tld" to ultimately resolve to the same IP address while also being different "hosts" in ansible 22:15:56 <dmsimard> the problem here is that we use IP addresses, not hostnames 22:16:10 <jeblair> pabelanger: what's the deficiency with using groups? 22:16:30 <jeblair> i know "something about var scoping" but is there something more specific i can reference? 22:16:30 <dmsimard> pabelanger: I think we might break some assumptions in roles if we do this, especially multinode roles 22:16:46 <dmsimard> jeblair: play host filtering is an example 22:17:03 <mordred> so if we want to support the story, as much as possible, of "run your existing ansible as part of your testing" - then if there is a semantic distinction between 2 hosts with the same IP and a single host in two groups, then I think we need to allow a user to express which they want 22:17:09 <pabelanger> jeblair: it is likely better is I work up simple playbook example, because I'm likley not going to explain it very well. 22:17:27 <dmsimard> jeblair: if you want to do an "all in one" but your playbooks/roles are built to target different hosts 22:17:27 <jeblair> dmsimard: isn't that an anti-pattern? (ie, shouldn't you filter plays based on groups or roles anyway?) 22:17:30 <mordred> (also, yah, having a little clarity on the things that are different between the two would likely be helpful for all of our deeper understanding) 22:17:43 <dmsimard> I have a (bad) example, one sec 22:18:06 <dmsimard> https://github.com/rdo-infra/rdo-container-registry/blob/master/hosts 22:18:32 <dmsimard> it so happens that I'm installing everything on the same host, but the playbooks are made to target specific groups to install specific things 22:18:45 <dmsimard> pabelanger: ^ does that make sense ? 22:19:35 <jeblair> dmsimard: iiuc, that case should be handled currently with our group support (aside from the openshift_node_labels, but that's only because we don't do any host-specific inventory vars right now. we could, but that's a different change) 22:19:40 <clarkb> dmsimard: right but we have different groups ability so your think shoudl work right? 22:19:42 <dmsimard> and then there's an example of var scoping in that example, if you look at the nodes group 22:19:54 <pabelanger> right, but more specifically ansible_host seems to create a new SSH connection (which resets variable scope) where using groups doesn't. It will just run everything from start to finish using 1 connection. Based on my local testing 22:20:19 <mordred> pabelanger: yah - that's, I think, the most important distiction 22:20:21 <dmsimard> pabelanger: that strangely rings me a bell 22:20:29 <jeblair> pabelanger: multiple ssh connections make sense, how that's connected to variable scoping i don't understand 22:20:32 <pabelanger> but, I'll setup a simple playbook / inventory demonstrate the issue 22:20:47 <mordred> pabelanger: ++ 22:21:06 <dmsimard> jeblair: hostvars can arguably be workaround by supplying a host_vars directory so it's not a big deal I think 22:21:15 <dmsimard> (or group_vars) 22:21:28 <dmsimard> I don't think we need to support providing them in the nodesets (unless we really want to) 22:21:53 <jeblair> dmsimard: either way, based on my current understanding, i think it's orthogonal to this question, so we can set it aside for now 22:22:00 <mordred> yah 22:22:01 * dmsimard nods 22:22:18 <dmsimard> a reproducer would indeed help 22:22:35 <pabelanger> okay, will get that done for tomorrow and we can discuss more 22:23:14 <jeblair> okay, my personal summary here is: i'm not opposed to this on principle, but before we proceed, i'd like to understand it a bit more; pabelanger will help by suppling more examples and details. and if we do need to proceed, we should evaluate dmsimard's concern about assumptions in multinode jobs. 22:23:31 <jeblair> that jive for folks? 22:23:33 <pabelanger> ++ 22:23:35 <mordred> I agree 22:23:38 <mordred> from a philosophical point of view, I'd prefer to minimize the number of times we have to say to someone "to use your existing ansible in zuul, you need to rewrite it" 22:23:39 <dmsimard> yeah. 22:23:47 <mordred> there will be some cases in which that is unavoidable, of course 22:24:15 <fungi> sgtm 22:24:19 <jeblair> mordred: ++ 22:24:23 <pabelanger> yah, this was the closes way I could reproduce an existing inventory file I was testing with in v2 22:24:26 <jeblair> #agreed pabelanger will help by suppling more examples and details before we proceed with this. if we do need to proceed, we should evaluate dmsimard's concern about assumptions in multinode jobs. 22:25:31 <jeblair> i -1d the change with a quick summary too, so we don't lose it 22:25:40 <pabelanger> ack 22:25:50 <jeblair> #topic Allow run to be list of playbooks (pabelanger) 22:26:27 <pabelanger> So, this was actually the first way I solved above, but kept it alive because it might be usefully. Having a job run multiple playbooks 22:26:31 <jeblair> this seems to touch on similar issues... 22:26:33 <jeblair> ah :) 22:26:36 <pabelanger> yup 22:26:38 <jeblair> that was my question 22:26:41 <jeblair> #link https://review.openstack.org/519596 22:26:53 <jeblair> so this is a semi-alternate to the other change 22:27:25 <pabelanger> Yah, gave the option to run back to back playbooks with specific hosts 22:27:34 <pabelanger> kinda like we do on puppetmaster today with ansible 22:27:51 <pabelanger> so, not sure if we want to consider supporting it or leave until later 22:27:55 <clarkb> considering it is something we already do elsewhere it seems to make sense as a feature 22:28:51 <jeblair> actually, why do we do that on puppetmaster? 22:29:35 <pabelanger> I know we wrap each ansible-playbook with timeout, did we break it out due to memory issues? 22:30:12 <jeblair> is it something about parallelism? or exiting on failure? 22:30:26 <mordred> jeblair: I think it's acutlaly just historical 22:30:53 <mordred> we had a run_puppet.sh - and we started using ansible to run it by modifying that script one bit at a time 22:31:39 <clarkb> jeblair: mordred the big reason for it today is decoupling infracloud from everything else 22:31:42 <jeblair> pabelanger: how would this have solved your problem? even if zuul ran multiple playbooks in sequence, it would still have the same vars? 22:31:54 <clarkb> because infracloud is more likely to fail and adds a significant time to the round robin 22:31:59 <mordred> jeblair: vars set by tasks in the plays get reset across playbook invocations 22:32:03 <jeblair> clarkb: right, but those just operate completely in parallel, right? 22:32:12 <clarkb> jeblair: yes 22:32:14 <jeblair> mordred: oh i see 22:32:20 <clarkb> so ya I guess in the context of a job you'd just have two jobs for that 22:32:23 <mordred> yah 22:32:31 <pabelanger> jeblair: it would be same vars, but multiple ssh connection attempts. That seems to be the key to resetting them to how I expect them in the playbooks 22:32:53 <pabelanger> again, I think a more detailed example playbook might help here 22:33:01 <pabelanger> and happy to write up 22:33:05 <jeblair> i'm still questioning the connection between ssh connections and variables 22:33:16 <jeblair> i'm pretty sure those are two independent concepts 22:33:57 <pabelanger> these are group_vars if that helps 22:34:21 <jeblair> does zuul set group vars? 22:34:50 <dmsimard> I don't think so 22:34:54 <pabelanger> it doesn't, ansible is loading them based on the playbooks/group_vars folder 22:35:11 <dmsimard> there's either inventory-wide vars or extra-vars which both apply to everything 22:35:29 <jeblair> pabelanger: so you're getting different variables because you're running playbooks in different paths? 22:36:22 <pabelanger> jeblair: I get different variables if I switch to groups in my inventory file 22:36:32 <pabelanger> well 22:36:38 <pabelanger> groups of groups 22:36:56 <dmsimard> I haven't tested group_vars and host_vars, it'd be interesting to test actually.. typically you'd have host_vars and group_vars inside {{ playbook_dir }}, but in our case those paths aren't "merged". However, I believe you can set group_vars and host_vars inside roles, and that would be more likely to work. 22:37:11 <jeblair> pabelanger: er, i'm trying to understand how this change solves your variable problem from earlier 22:37:13 <pabelanger> for exmaple: http://git.openstack.org/cgit/openstack/windmill/tree/playbooks/inventory is my working v2 inventory file 22:38:00 <pabelanger> jeblair: bascially, it allows me to stop doing http://git.openstack.org/cgit/openstack/windmill/tree/playbooks/site.yaml 22:38:17 <pabelanger> and create a run: statement for each playbook 22:38:28 <jeblair> pabelanger: why is that preferable? 22:38:41 <jeblair> pabelanger: don't you feel like you're moving too much logic *into* zuul? 22:38:49 <dmsimard> seems equivalent to me as well 22:39:25 <dmsimard> FWIW that's exactly what we're doing with the base and multinode integration jobs, we're running "one" playbook that includes multiple playbooks 22:39:33 <dmsimard> I don't currently see that as a hindrance 22:39:35 <pabelanger> jeblair: yes, this is a work around, because I created 521324, which I'd much rather have 22:39:42 <pabelanger> before* 22:41:14 <pabelanger> I'll have to run here in about 5 minutes, but don't want to leave people hanging. I'm happy if we want to contiune this topic in #zuul too 22:41:39 <jeblair> pabelanger: to be clear, i'm, again, not permanently opposed to 519596, but before we merge changes like that, i'd like to have a really clear idea of why they are necessary, or what problem they solve, or what situation they improve. so far we've got the "include list" as one thing, but that seems like an anti-pattern and a mild argument against merging 519596 22:41:47 <jeblair> if there's a variable aspect to this, i still don't understand it 22:42:35 <pabelanger> sure, I'll get some working examples that better show the issues I ran into converting v2 job to native v3. These were both my attempts to address some issues I was having 22:42:44 <jeblair> pabelanger: okay, thanks 22:43:06 <jeblair> #agreed pabelanger will provide more examples and explanation for this 22:43:25 <pabelanger> I have to run now, will catch up on minutes when I return 22:43:31 <jeblair> #topic open discussion 22:44:11 <jlk> I have a topic... 22:44:18 <dmsimard> For open discussion, I just wanted to point out that we formally started looking at what it means to run a Zuul v3 that is not the one in OpenStack 22:44:18 <clarkb> I too have one, but go for it jlk 22:44:35 <dmsimard> jlk won first :P 22:44:38 <jlk> dmsimard: who's "we" ? 22:44:45 <jlk> IIRC there is one at BMW is there not? 22:45:06 <dmsimard> We is RDO Infra (analogous to openstack-infra) and Software Factory developers 22:45:08 <jeblair> there are a few in fact 22:45:10 <jlk> (and for a hot minute, there was Bonny. Sigh.) 22:45:23 <jlk> dmsimard: neat! 22:45:43 <dmsimard> Software Factory had arguably been running Zuul v3 for a while 22:46:25 <dmsimard> But there's some interesting questions and design challenges in thinking how we want to share configuration between zuuls (zuul-jobs, openstack-zuul-jobs, project-config, and specific project) 22:47:09 <jeblair> i think SpamapS was also trying out zuul-jobs sharing 22:47:28 <jlk> jamielennox had some thoughts in this space as well 22:47:43 <dmsimard> I started a thread about it in the context of TripleO http://lists.openstack.org/pipermail/openstack-dev/2017-November/124733.html and we also started hunting down issues we come across in zuul-jobs here: https://etherpad.openstack.org/p/downstream-zuul-jobs-issues 22:47:46 <clarkb> we probably want to focus on sharing zuul-jobs first and not the others right (they aren't generally supposed to be reconsumable so figuring it out for zuul-jobs where it is is a good start) 22:47:48 <jeblair> sharing between instances is definitely a design goal for zuul-jobs. 22:48:34 <dmsimard> clarkb: it's funny that you mention that, because one of the ideas that has been floating around is to centralize the playbooks/roles/jobs/etc for TripleO in tripleo-ci and then use that across all Zuuls 22:48:36 <jeblair> i think openstack-zuul-jobs and individual openstack projects may be useful for openstack third-party-ci, but that's less of an explicit goal, and i think large amounts of 'shadow' and 'include'/'exclude' may be needed. 22:48:41 <mordred> yes. I could also imagine that in-repo jobs and possibly openstack-zuul-jobs might be things that OpenStack Third Party Zuuls will want to consume 22:48:50 <dmsimard> jeblair: yes, in the context of third party CI and such. 22:49:02 <mordred> jeblair: yup 22:49:37 <jeblair> project-config definitely isn't meant to be shareable -- *however* -- we do want to have at least a stub/example base job. that should end up either in zuul-jobs or zuul-base-jobs at some point soon. 22:49:44 <mordred> ++ 22:49:51 <mordred> starting by figuring out sharing of zuul-jobs and getting it right will go a long way 22:50:17 <jeblair> ++ 22:50:32 <clarkb> mordred: ya thats what I'm thinking. Those are the bits that should be reconsumbale so lets start there and learn what we learn 22:50:46 <jeblair> dmsimard: so thanks for diving in and thanks in advance for your patience :) 22:51:19 <jeblair> (cause we just *might* have gotten some things wrong in the first pass) 22:51:25 <dmsimard> no stress 22:51:26 <clarkb> last week we merged a couple changes that broke Zuuls config and zuul didn't catch them upfront. The one I remember off the top of my head is parenting to a final job. I know that pabelanger ran into something else when he had to restart zuul due to OOM as well. 22:51:31 <clarkb> ^^ is my item 22:51:53 <clarkb> it would be good if we could address those config issues pre merge 22:51:53 <dmsimard> a couple? 22:51:56 <jeblair> clarkb: by broke zuul's config... what do you mean? 22:52:01 <dmsimard> how did more than one breaking change merge ? 22:52:08 <mordred> jlk: was thatyour topic? or did we talk over you? 22:52:10 <jeblair> how did *one* breaking change merge? 22:52:11 <clarkb> jeblair: in pabelanger's restart zuul case zuul would not start up again 22:52:18 <jlk> mordred: I can wait through clarkb's :) 22:52:37 <jeblair> is there a bug report? 22:52:41 <clarkb> jeblair: in the job parented to final job zuul kept running but the new jobs that merged would not run 22:52:50 <clarkb> jeblair: I believe both were added to the zuulv3 issues etherpad 22:53:14 <jeblair> clarkb: the second sounds like an expected run-time error; the first sounds like a bug. 22:53:30 <clarkb> ok we can followup on them after meeting 22:53:35 * clarkb gets out of jlk's way 22:53:35 <jeblair> ack 22:53:51 * dmsimard has an easy item after jlk 22:54:04 * jeblair has a quick item after dmsimard 22:54:21 <jlk> hi there! So tristanC landed a nodepool driver for k8s, and I'm asking if the group collective has the appetite to discuss/debate k8s (and container in general) driver approaches for Zuul/Nodepool. 22:54:30 <jlk> "is now the time" or should we table this for later? 22:54:43 <jlk> and by now, I don't mean in this channel now, but on list and in #zuul and whatnot 22:55:07 <mordred> s/landed/proposed/ 22:55:14 <jlk> sorry that's what I meant :( 22:55:16 <jlk> silly fingers 22:55:37 <mordred> jlk: yah - just wanted to make sure nobody else was confused :) 22:55:38 <dmsimard> it's worth discussing not just because it's important but because it'll be a precedent from which the design of other "drivers" will be built from IMO 22:55:57 <mordred> jlk: I was actually just writing an email to the list on this meta topic actually 22:55:59 <jeblair> i'd really like to get a v3 release out before we dive into this, because i think it will help get and retain other folks into our community. also, it's just embarrasing not to release. 22:56:25 <dmsimard> +1 22:56:30 <clarkb> considering the lack of such features isn't a regression I'm on board with that too 22:56:45 * mordred agrees with jeblair - but does think there is at least one facet of the discussion that is worth having pre-release to make sure we don't back ourselves into a corner when we release 22:56:49 <jlk> same, a release would be good, so long as we don't have to wait for zuul4 to add container support 22:56:51 <jeblair> my preferred approach would be to get the release out the door quickly, then start on the next dev cycle. there are things on the roadmap for release that others can pitch in on (some of them have my name next to them, but they don't have to be me and that would speed things up) 22:57:38 <dmsimard> ok mordred you will start a topic on the ML ? 22:57:44 <jeblair> i would be really surprised if we are unable to release by februrary, and think earlier is likely (though i caution, there really aren't that many work weeks left in this year) 22:57:49 <mordred> the thing I think is worth sanity-checking ourselves on pre-release is making sure we aren't doing anything thatwould fundamentally block the addition of a nodepool driver that produces build resources that do not use ssh 22:57:53 <mordred> dmsimard: yes, I shall 22:58:02 <dmsimard> let's pick up the discussion there 22:58:16 <jlk> works for me, thanks! 22:58:56 <jeblair> this dovetails into my topic, which is in next week's meeting, i'd like to check in on the release roadmap items 22:58:56 <dmsimard> There's only 1:30 left.. but Ansible 2.4.1 is out and addresses most/all of the regressions/issues introduced in 2.4. How do we upgrade without breaking people that have written <2.4 things ? 22:59:25 <jeblair> i'll add that to the angenda for next, and i expect we'll have the roadmap items in storyboard soon 22:59:33 <dmsimard> Alternatively, how do we potentially allow different version of Ansible ? Because some things works in one, some in others, etc. 22:59:35 <dmsimard> Ack 22:59:44 <mordred> jeblair: ++ 23:00:11 <jlk> I really hope that we don't get too deep in "what version of ansible" land for Zuul users. 23:00:13 <jlk> at least not "choose your own" 23:00:19 <mordred> dmsimard: most of the 2.3 to 2.4 breakages were in the python layer for us, right? 23:00:32 <dmsimard> mordred: for ara and zuul themselves, perhaps 23:00:40 <dmsimard> mordred: there's still some "bug fixes" that broke things 23:00:58 <dmsimard> some behavior changes in includes/imports 23:01:05 <dmsimard> also variable scopes 23:01:16 <dmsimard> they don't really follow semver :( 23:01:27 <mordred> jlk: I actually have some thoughts on how we might consider doing that without death - but as of right now I agree with that sentiment 23:01:31 <jeblair> jlk, dmsimard: some folks have expressed a use case for multiple ansible support, and it may be in keeping with our philosohpy of trying to be as transparent as possible with ansible. for instance, kolla tests all their stuff with all current supported versions of ansible. 23:02:11 <jlk> sure, but I thought at that point, you write a first job that installs the ansible you want to test with, then.... 23:02:12 <dmsimard> this is a cool topic which I'll continue in #zuul :) 23:02:13 <mordred> dmsimard: I'm not sure there is a great story yet today for how to upgrade zuul's ansible from 2.3 to 2.4 and verify that people's job ansible doesn't break 23:02:18 <jeblair> so it may not be out of the question for zuul to maintain compatibility with current supported versions. but i think that needs a design proposal. :) 23:02:26 <jeblair> i think the quick answer to dmsimard's question is -- 23:02:36 <jeblair> we try using 2.4.1 and fix what's broken :) 23:02:39 <mordred> yah 23:02:43 <dmsimard> yuck 23:02:53 <dmsimard> anyway, we're over time :) 23:03:01 <jeblair> Shrews has done much work on that already, we should be in pretty good shape 23:03:10 <jeblair> thanks all! 23:03:13 <jeblair> #endmeeting