22:03:04 <jeblair> #startmeeting zuul
22:03:05 <openstack> Meeting started Mon Feb  6 22:03:04 2017 UTC and is due to finish in 60 minutes.  The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot.
22:03:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
22:03:08 <openstack> The meeting name has been set to 'zuul'
22:03:13 <jeblair> #link agenda https://wiki.openstack.org/wiki/Meetings/Zuul
22:03:22 <jeblair> #link previous meeting http://eavesdrop.openstack.org/meetings/zuul/2017/zuul.2017-01-30-22.00.html
22:03:36 <SpamapS> o/
22:04:06 <jeblair> i'd like to reserve at least the last 20 minutes for talk about the ptg
22:04:11 <jeblair> so with that
22:04:20 <jeblair> #topic Status updates: Nodepool Zookeeper work
22:04:35 <jhesketh> o/
22:05:20 <jamielennox> o/
22:06:04 <jeblair> Shrews is continuing work on having nodepool actually return nodes
22:06:53 <Shrews> 428428 makes our integration job pass
22:06:57 <fungi> that would be useful
22:07:09 <pabelanger> Shrews: nice work
22:07:16 <jeblair> Shrews: now might be a good time for someone to jump in and update the nodepool cli commands to use zookeeper?  if you think so, and no one else does that soon, i may...
22:07:46 <morgan> :)
22:08:24 <Shrews> jeblair: i think someone could begin poking at that. i won't get to it anytime soon
22:08:42 <pabelanger> If we are ready to start using more zookeeper in nodepool, I don't mind poking into that again. It went quite well last time
22:08:59 <SpamapS> jeblair: is there a story for that yet?
22:09:01 * SpamapS can make one
22:09:16 <jeblair> SpamapS: don't think so, and thanks
22:09:31 * SpamapS makes it
22:10:36 <jeblair> #topic Status updates: Devstack-gate roles refactoring
22:11:04 <jeblair> rcarrillocruz, clarkb: where you talking about that earlier?
22:11:04 <SpamapS> FYI: https://storyboard.openstack.org/#!/story/2000856
22:11:22 <clarkb> jeblair: yes, I think the current patchset on the first change is good to go
22:11:41 <rcarrillocruz> yeah, passing tempest tests now in zuul
22:11:44 <rcarrillocruz> just needs a +A
22:11:46 <jeblair> #link nodepool zk cli work can begin https://storyboard.openstack.org/#!/story/2000856
22:11:48 <clarkb> the second still has a -1 from previous reviews that will need addressing (I think rcarrillocruz may be trying to reduce number of iterations and focus on one at a time)
22:11:56 <rcarrillocruz> y
22:12:33 <jeblair> rcarrillocruz, clarkb: so 403732 is ready?
22:12:58 <clarkb> yes I think so
22:13:01 <rcarrillocruz> imho yeah
22:13:02 <jeblair> #link devstack-gate roles change ready for approval https://review.openstack.org/403732
22:13:48 <jeblair> #link next devstack-gate roles change https://review.openstack.org/404243
22:14:56 <jeblair> #topic Status updates: Zuul test enablement
22:15:17 <jeblair> a bunch of those just showed up recently!  :)
22:15:25 <adam_g> i've started to pick up some low hangers again between doing other things..
22:15:52 <adam_g> i noticed test_dependent_behind_dequeue, which was recently reenabled, doesnt seem to be too stable. i had to recheck against it a few times and noticed others had to as well
22:15:58 <jeblair> also -- reminder that we merged a change that requires a playbook for every test job now.  there's a make_playbooks.py script in zuul/tests to help automate that.
22:16:03 <pabelanger> I've fixed my conflicts today, and started on the conflict project tests today
22:16:39 <jeblair> adam_g: yeah, i recently made it more stable by extending the timeouts (it's a very busy test), but there have now been a few failures of it since then, so there's still something going on
22:17:07 <adam_g> ah
22:17:20 <jeblair> we also just merged a change which attaches full debug logs on test failures, so as long as it doesn't manifest as a timeout (which this one, unfortunately, often does) we can actually fix them.
22:17:31 <pabelanger> https://review.openstack.org/#/c/393887/ is particularly easy :)
22:18:16 <jeblair> (the fact that it times out now is not likely because it's slow, but rather an error that just manifests as never reaching the stable condition)
22:18:42 <SpamapS> jeblair: does our test zookeeper make use of tmpfs? That might help.
22:19:00 <SpamapS> oh that
22:19:04 <jeblair> SpamapS: good point; i don't think so.
22:19:35 <jeblair> SpamapS: oh, but you know what, zk is usually pretty fast on tests in our cloud providers....
22:19:40 <SpamapS> yeah
22:19:51 <SpamapS> With so much RAM
22:20:06 <SpamapS> I'd expect it to mostly just buffer. Though ZK can be sync-happy
22:20:18 <SpamapS> because journals
22:20:31 <jeblair> yeah... maybe our clouds either have battery backed caches or just turn on data-eating.
22:20:41 <SpamapS> probably former for most.
22:20:50 <morgan> hah
22:20:53 <jeblair> and the latter for infra-cloud, iirc...
22:20:53 <morgan> yeah
22:20:59 <SpamapS> either way, we could look at io wait if we were concerned
22:21:10 <morgan> set value eat_data
22:21:12 <morgan> true
22:21:21 <SpamapS> eatmydata is a thing you know :)
22:21:30 <mordred> world's best LD_PRELOAD library
22:21:43 <jeblair> we may want to collect the zk logs from tests...
22:21:48 <SpamapS> I like to load it with libhostile and let them fight it out
22:21:49 <mordred> jeblair: ++
22:22:09 * fungi smells a new theme show in the making
22:22:09 <jeblair> #topic Status updates: Zuul Ansible running
22:22:44 <jeblair> my patch series to enable pre and post playbooks is making its way in (i have some random test failures to debug -- see earlier topic :)
22:23:08 <jeblair> mordred has a change built on that to start securing the insecure playbooks
22:23:21 <jeblair> #link playbook security https://review.openstack.org/428798
22:23:23 <mordred> yes. and then we found a whole new set of ways in which playbooks can be insecure
22:23:29 * mordred glares at roles
22:23:38 <pabelanger> funtimes
22:23:41 <mordred> yah
22:23:47 <jeblair> but also sketched out some solutions for that, yeah?
22:23:50 <mordred> yah
22:23:56 <SpamapS> ohmy
22:24:10 <clarkb> mordred: just reading the commit message on that seems like you'll have to audit and patch after every ansible release?
22:24:18 <SpamapS> Is this where we ask how this happened and somebody goes <cough>tower</cough>
22:24:31 <mordred> jeblair: it may be worth mentioning that due to security lockdown, we may also want to develop a stdlib role that knows how to run ansible on the remote host as if it was the local host job content
22:24:56 <mordred> SpamapS: actually - not really, it's more that the idea of running untrusted ansible code isn't a use case they really focus on
22:25:08 <SpamapS> Oh, joy, this also means Zuul gets to be partially GPLv3
22:25:25 <mordred> yup. this, of course, causes me to have a warm and fuzzy feeling
22:25:26 <clarkb> mordred: I'd worry about missing things and further complicating the ansible has pushed security update hurry and fix/upgrade
22:25:53 <pabelanger> is it possible we could have the idea of secure / insecure zuul-launchers?  I know that doesn't scale well
22:25:54 <SpamapS> mordred: would a simpler thing be to just run it in a throw-away container?
22:25:56 <jeblair> mordred: you mean like push the inventory over and run something?  that sounds helpful.
22:26:14 <mordred> so - yes, I agree with clarkb, although from ansible core we really only need to worry about new action plugins (not very likely) or entirely new types of plugins (also not very common)
22:26:19 <mordred> we don't have to look at every patch
22:26:40 <mordred> SpamapS: ALSO looking at using some container tech here - but no, I do not think container == security yet
22:26:40 <pabelanger> ( I guess now, since our zuul-launchers are in the control plane)
22:26:41 <jeblair> pabelanger: that's not really the problem here as much as the fact that jobs need to run some secure things and some insecure things.
22:26:44 <pabelanger> not*
22:26:50 <pabelanger> jeblair: ya
22:27:05 <mordred> SpamapS: I think container + careful code can together be better than either one in isolation
22:27:17 <jeblair> actual defense in depth :)
22:27:43 <mordred> so specifically looking at giftwrap which allows for construction and execution of unprivileged containers - so that we don't have to escalate zuul-launcher to root before adding in the containment :)
22:27:58 <jeblair> mordred: bubble wrap?
22:28:11 <mordred> gah. bubblewrap. yes.
22:28:14 <mordred> https://github.com/projectatomic/bubblewrap
22:28:42 <mordred> it needs a fairly new kernel though - so the support for it will need to be opt-in for operators I think
22:28:49 * mordred needs to write up some thoughts on this for folks
22:28:53 <pabelanger> oh, new things to look at
22:29:33 <mordred> jeblair: and yes to "push the inventory over and run something?" ... jlk was asking about using zuul to test ansible that relies on plugins that we don't allow people to run with
22:29:45 <pabelanger> mordred: that is a neat idea
22:29:48 <mordred> jeblair: and that's _totally_ possible by writing a playbook that does a shell call to ansible
22:29:57 <mordred> but ... you know ... we can likely make that experience a little better :)
22:29:57 <clarkb> mordred: mostly I don't want to replace one security issue with another via upgrade of ansible by ops that don't understand caveats here
22:30:32 <mordred> clarkb: yup. it's definitely an area where we need WAY more prose about what's going on for all of us, and then make sure that we're happy with how we're covering it
22:30:43 <clarkb> if the class of objects that are an issue is small maybe we can do terrible nasty python to intercept all dispatches to them and sanitize appropriately
22:31:03 <clarkb> rather than having hard coded sanitization for known issues today
22:31:21 <mordred> yah - so - there are 2 prongs we need to deal with
22:31:35 <jeblair> clarkb: well, we don't use ansible as a library, so the solution has to be in ansible configuration...
22:31:37 <mordred> one are ansible in-tree action-plugin based modules - these do execptional things like the copy module
22:31:47 <mordred> and execute code on purpose on the calling host
22:31:56 <mordred> but there is a fixed set of them and it's easy to vet those
22:32:18 <mordred> the _other_ is that roles can ship with plugins (action plugins, filter plugins, etc) that will run python code on the calling machine
22:32:35 <jeblair> (i'm going to call time on this at 22:35, btw)
22:32:39 <mordred> in that case, the approach we've discussed so far is to scrub roles when we fetch them for plugin directories (known set of names)
22:32:47 <mordred> and if a role has a plugin dir with content, just fail hard
22:32:57 <mordred> so doing those two things AND adding in containment
22:33:34 <SpamapS> could also neuter ansible's plugin loading
22:33:36 <mordred> should hopefully get us fairly decent coverage ... we could also potentially talk to our friends at ansible and request they warn us if they're going ot release new local-execution action plugins
22:33:47 <mordred> SpamapS: yah - bcoca talked a bit about that
22:33:50 <clarkb> mordred: right my concern is ansible adds new actionmodule or changes one arbitrarily
22:33:58 <SpamapS> Just have a plugin that literally overrides the plugin loader with a pass.
22:34:02 <jeblair> yeah, so we're going to be as general as we can be (eg the plugins in roles), but we don't have a good general way to stop the in-tree plugins  atm.
22:34:04 <mordred> SpamapS: and also mused about hte possibility of adding a neuter plugins option to ansible itself
22:34:08 <clarkb> mordred: and then next zuul update and now you are vulnerable (and that would be a much larger target if/when people are using zuul with ansible)
22:34:16 <SpamapS> Or an "splodey splode, no plugins allowed"
22:34:26 <jeblair> clarkb: yes, i agree with your concern
22:34:40 <mordred> clarkb: yah - that's one where we're going to need to connect with ansible release managmenet in addition to doing defensive coding on our part
22:34:47 <clarkb> mordred: having contributed to one of those action modules recently I don't think its terribly hard to change the behavior of them in such ways (as people don't seem to grok how they work very well)
22:34:49 <mordred> clarkb: definitely a concern
22:34:49 <fungi> i definitely hadn't thought of that, but i can see it as a possibility
22:35:11 <jeblair> but we're hoping that's a small load due to the rarity in adding such new modules.
22:35:16 <SpamapS> clarkb: I share your concern, and think that one has to also wrap it up in a system level protection of some kind.
22:35:20 <mordred> (but yeah, this is why I think adding container wrapping to the mix will give us buffer too)
22:35:23 <mordred> yup
22:35:25 <mordred> SpamapS: ++
22:35:25 * SpamapS will look at bubblewrap
22:35:37 <jeblair> great segue
22:35:41 <mordred> SpamapS: it needs yakkety on ubuntu, fwiw - needs new kernel
22:35:56 <mordred> SpamapS: or, needs that to be able to run without sudo stuff
22:35:56 <jeblair> #topic  Progress summary
22:36:21 * jeblair hands link baton to SpamapS
22:40:09 <SpamapS> ahoy
22:40:24 <SpamapS> sorry I got alt-tabbed and tried to refresh and fell off the earth
22:40:33 <SpamapS> #link https://storyboard.openstack.org/#!/board/41
22:40:33 <jeblair> let's come back to this if we have time at the end
22:40:37 <fungi> flat earth will do that to you
22:40:42 <SpamapS> Not much to say anyway
22:40:45 <SpamapS> Progress continues.
22:40:51 <jeblair> that works out then :)
22:40:56 <jeblair> #topic PTG prep (jeblair)
22:40:59 * fungi is a fan of progress
22:41:08 <jeblair> so we've got a thing coming up soon
22:41:14 <fungi> like, real soon
22:41:23 <jeblair> 2 weeks?
22:41:29 <fungi> right at, yes
22:41:31 <pabelanger> yay travel
22:41:43 <SpamapS> crikey
22:41:49 <SpamapS> so much travel :_P
22:41:57 <jeblair> i think at this point, we probably have a good idea what's feasible
22:42:11 <mordred> ++
22:42:20 <jeblair> we should hopefully have nodepool at least able to hand out some nodes, even if it still doesn't do a lot of things
22:42:39 <jeblair> and we should have zuul able to run some jobs, even if it doesn't do a lot of things
22:43:05 <jeblair> so i think it's well within the realm of possibility that we can set up a v3 nodepool and zuul, and have them run some hello world jobs
22:43:12 <SpamapS> I'd also like to have all tests re-enabled/refactored/done by the time I fly out Thursday night.
22:43:30 <fungi> are our current puppet-zuul/puppet-nodepool modules up to the task of deploying what's in the feature branches yet?
22:43:34 <jeblair> to that end, there are probably some things we can do to prepare for that
22:43:42 <SpamapS> (while we focus on the pragmatic thing first, I want to make a real push while I have your brains in view)
22:43:52 <jeblair> fungi: probably close, but probably not.
22:43:56 <fungi> just curious if hello world is going to involve a lot of manual deployment
22:44:14 <jeblair> SpamapS: i would support that as a very worthy secondary goal :)
22:44:22 <pabelanger> what is left to do for nodepool zookeeper production? I am assuming mordred shim? (CLI commands?)
22:44:27 <fungi> or if we should try to work out the adjustments to puppet necessary to hello world it as part of the task
22:44:27 <SpamapS> It's a stretch goal for sure.
22:44:37 <jeblair> pabelanger: i don't think we need the shim for this
22:44:44 <pabelanger> ack
22:45:02 <jeblair> (the shim is for zuul v2 -> nodepool v3)
22:45:12 <pabelanger> got it
22:45:25 <Shrews> pabelanger: actual node launches (not just record entries in ZK) needs to be completed
22:45:44 <fungi> basically set up demo environment by hand and taking notes, vs deploying demo env using (patched) puppet modules so we can more directly translate that to the chnages we'll need to make
22:45:55 <pabelanger> Shrews: thanks for the info
22:45:55 <jeblair> (fortunately, there's a body of code that does launches, so we're not starting from zero)
22:46:16 <Shrews> right
22:46:23 <jeblair> fungi: we may well end up doing some manual deployment, but otoh, maybe in the intervening 2 weeks, we could do some puppet work and have at least some of that codified
22:47:10 <jeblair> let's start an etherpad: https://etherpad.openstack.org/p/pike-ptg-zuul
22:47:17 <fungi> jeblair: thanks, just wondering if anyone has a feel for where we can strike that balance of effort vs expediency
22:47:48 <pabelanger> we didn't need to change puppet-nodepool too much for zookeeper things first time. But agree, we should try to land patches at the same time
22:48:10 <fungi> i do want to make sure we can have something viable we can at least feel good about by the end of tuesday, so if that means config management changes get mostly punted to later i'm cool with that
22:48:55 <fungi> would be awesome to say "zuul v3 ran a job"
22:48:59 <mordred> ++
22:49:00 <jeblair> yep
22:49:17 <pabelanger> jeblair: the plan is to have nl01.o.o eventually? (nodepool-launcher)
22:49:25 <jeblair> pabelanger: sounds reasonable
22:49:26 <Shrews> maybe we can begin making notes for any documentation that may need written
22:49:31 <pabelanger> k
22:49:33 <mordred> Shrews: ++
22:50:15 <jeblair> okay, take a look at that etherpad and let me know if there is anything else we should prep beforehand to increase our chances of success
22:50:31 <jeblair> obviously the first two are very important
22:51:13 <jeblair> the next few about deployment and setting up a server are things that would be really good to do ahead of time so we don't spend 2 days watching someone boot a server
22:51:26 <pabelanger> ++
22:51:35 <jeblair> i'd love it if someone would volunteer to take the lead on prepping a platform for us to work from at the ptg
22:51:39 <pabelanger> I can start doing some prep tomorrow for that
22:51:53 <jeblair> i think pabelanger just volunteered for that :)  thanks
22:51:58 <fungi> "platform" meaning server instances?
22:52:02 <jeblair> yeah
22:52:22 <fungi> oh, and i guess a tenant/namespace/whatever for the test nodes
22:52:23 <jeblair> so, new servers so that we don't touch any of the current system
22:52:40 <pabelanger> ++
22:52:49 <jeblair> fungi: i think at our scale, we can just steal some quota from our current nodepool tenants
22:53:10 <jeblair> (maybe bump the production quota down a little bit on one of them?)
22:53:17 <fungi> wfm. we do have unique identifiers implemented for nodepool's alien cleanup instance metadata right?
22:53:54 <fungi> i know we discussed having that so two could coexist on the same tenant was preferred but can't remember if it ever got implemented
22:54:22 <clarkb> fungi: ya nodepool should only delete leaks that it booted
22:54:31 <clarkb> (and if it doesn't we should fix that too)
22:54:39 <fungi> just want to make sure bringing up a demon nodepool pointed at one of our production tenants won't start sniping the production nodepool nodes
22:54:40 <jeblair> let's check on that
22:54:51 <Shrews> yep. won't delete an image unless the DIB is local
22:54:53 <fungi> s/demon/demo/ (fun typo though)
22:54:58 <jeblair> fungi: oh, that probably won't happen because we probably won't have cleanup in v3 implemented
22:55:03 <jeblair> fungi: i think the other direction is a possibility
22:55:23 <fungi> ah, so worst case nodepool v0.x production might blow away our demon nodes before they run anything
22:55:28 <clarkb> its certainly the intent of the laek cleanup code to only delete things that it once booted and knew about
22:56:05 <fungi> i thought we had talked about adding a config option where you could put a unique string for each nodepool scheduler so it could differentiate its own node metadata from someone else's
22:56:07 <jeblair> the last item i put on the list is something i'll volunteer for -- to write up what what we all need to know about the current and future state of both pieces of software in order to productively work on a hello-world job at the ptg
22:56:14 <Shrews> grr, yeah nodes, not images. doubtful cleanup will be implemented by then
22:56:47 <fungi> wow, i keep typing demon instead of demo. what is up with that finger memory?
22:57:10 <Shrews> fungi: i'm living proof brains break after 5pm
22:57:13 <jeblair> mordred: it would probably be good if we have a handle on some of the security stuff by then, otherwise we may not be able to publish logs for our hello world job
22:57:22 <mordred> jeblair: ++
22:57:54 <fungi> "we ran a job, but its logs were too insecure to fit in this margin"
22:57:58 <jeblair> basically
22:58:20 <jeblair> i put some names on the etherpad, if you would like names added or removed, let me know
22:58:52 <jeblair> also, if you think of anything else we need to do before then so we're not sitting on our thumbs at the ptg, add it / let me know
22:59:19 <jeblair> fungi: think this is probably worth a mention at the infra meeting tomorrow?
22:59:36 <fungi> i think it's definitely worth mentioning, yes
22:59:41 <jeblair> #link actions to prepare for pike ptg https://etherpad.openstack.org/p/pike-ptg-zuul
22:59:44 <jeblair> will do
22:59:56 <fungi> given the timing, we should spend a good chunk of tomorrow on ptg topics
23:00:01 <fungi> thanks!
23:00:10 <jeblair> thanks everyone!
23:00:12 <jeblair> #endmeeting