12:00:20 <jaosorior> #startmeeting TripleO Security Squad 12:00:21 <openstack> Meeting started Wed Jun 20 12:00:20 2018 UTC and is due to finish in 60 minutes. The chair is jaosorior. Information about MeetBot at http://wiki.debian.org/MeetBot. 12:00:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 12:00:24 <openstack> The meeting name has been set to 'tripleo_security_squad' 12:00:28 <moguimar> #link https://etherpad.openstack.org/p/tripleo-security-squad 12:00:31 <jaosorior> Lets wait some minutes for more folks to log in 12:03:15 <jaosorior> Alright, I guess it's fine now 12:03:32 <jaosorior> #topic oslo pluggable secrets backend discussion 12:04:11 <jaosorior> raildo, moguimar: wanna take it from here? 12:04:32 <raildo> yeah, I'll try to sync everything here :) 12:05:05 <openstackgerrit> Alex Schultz proposed openstack/instack-undercloud stable/queens: Fall back to puppet-ntp defaults https://review.openstack.org/576450 12:05:43 <raildo> so we're starting to discuss about the castellan driver, that will probably be supported by Tripleo, in a meaning of having a secure and automated way to handle the secrets on configuration files 12:06:37 <raildo> so we were discussing about that in yesterday meeting and dhellmann had a good point about understand more the tripleo needs for this feature, so we can guarantee that we're covering those points in that driver 12:07:26 <openstackgerrit> Martin André proposed openstack/tripleo-quickstart master: Install packages from centos-release-openshift-origin39 https://review.openstack.org/576832 12:07:35 <dhellmann> right, I would hate for us to design the driver to work in a way that doesn't fit with tripleo 12:07:36 <jaosorior> right 12:08:14 <jaosorior> So, the way we do things at the moment, is that we write everything to hiera and after that eventually it gets persisted to the configuration files 12:08:30 <chem> ccamacho: do you know if that py35 issue is generalized or a recheck will do 12:08:50 <jaosorior> It would be possible, however, to, instead of writing sensitive info to hiera, to write it to a secure backend (Vault?) 12:09:12 <chem> ccamacho: by the way waiting on the memcached one that ci passes (or should I just merge the backport now ?) 12:09:29 <jaosorior> chem: w'ere in the middle of a meeting 12:09:32 <jaosorior> *we're 12:09:42 * chem ->[ ] 12:09:59 <jaosorior> Having these secrets in the secure backend we could then write "constants" to the config files, that point to the relevant entry in the backend, and a reference on how to get to that backend 12:10:14 <ooolpbot> URGENT TRIPLEO TASKS NEED ATTENTION 12:10:14 <ooolpbot> https://bugs.launchpad.net/tripleo/+bug/1777759 12:10:16 <ooolpbot> https://bugs.launchpad.net/tripleo/+bug/1777762 12:10:16 <openstack> Launchpad bug 1777759 in tripleo "pike, volume failed to build in error status. list index out of range in cinder" [Critical,Triaged] - Assigned to Quique Llorente (quiquell) 12:10:18 <openstack> Launchpad bug 1777762 in tripleo "pike: nova scheduler, Failed to update inventory for resource provider" [Critical,Triaged] - Assigned to Quique Llorente (quiquell) 12:10:18 <dhellmann> it looks like castellan supports barbican and vault today, but I don't know how complete either of those drivers is 12:10:35 <jaosorior> dhellmann: barbican would be the most complete implementation I would say 12:10:47 <dhellmann> jaosorior : yeah, for each secret you would need the "id" string that the service gives you, I think 12:10:55 <jaosorior> the problem is that we can't use Barbican, because... how would keystone AND barbican would get their own secrets? 12:11:07 <dhellmann> so instead of storing the actual secret in the config, you store the id of the secret in a separate file the driver will read 12:11:09 <jaosorior> dhellmann: either an ID or just a unique tag 12:11:16 <raildo> dhellmann, well, we are looking for to use the vault backend, since it wont require keystone auth tokens and also because we will be able to store the keystone and babican secrets as well 12:11:29 <dhellmann> raildo : that makes sense 12:11:39 <dhellmann> jaosorior : it sounds like you probably know more about this than I do :-) 12:11:59 <jaosorior> so... we can't use the barbican backend for castellan 12:12:11 <dhellmann> so someone needs to evaluate that vault driver and figure out if it is complete enough 12:12:20 <raildo> dhellmann, also, the way that the vault backend is implemented, it's just pointing for an external vault server passing vault key, user... 12:12:25 <openstackgerrit> Martin André proposed openstack/tripleo-heat-templates master: Update for openshift 3.9 https://review.openstack.org/574233 12:12:25 <openstackgerrit> Martin André proposed openstack/tripleo-heat-templates master: Add ability to set openshift container images https://review.openstack.org/576441 12:12:26 <dhellmann> assuming we don't want to use a completely different backend, of course 12:12:27 <jaosorior> we also don't want the overcloud to depends on the undercloud. So ideally it has to be a service that's deployed by TripleO 12:12:57 <jaosorior> redrobot has been taking a look at the Vault driver... not sure if he got to any conclusions 12:13:00 * redrobot sneaks into the back of the room 12:13:02 <jaosorior> hopefully he'll be online soon 12:13:09 <redrobot> o/ 12:13:11 <jaosorior> aaah there you go! Hi! redrobot 12:13:13 <dhellmann> nice timing 12:13:17 <raildo> ++ 12:14:11 <redrobot> haha, sorry guys. Give me a sec to read the scrollback 12:15:04 <ccamacho> chem, we are still waiting for the BZ flags 12:15:14 <ccamacho> so we can wait for the patch to merge 12:15:54 <dhellmann> just for my own clarification, are we at a stage where we need to deep dive into questions like "does the driver work?" or are we still working out higher level issues like what parts of the system are responsible for different actions? 12:16:01 <redrobot> Re: Vault driver. Still evaluating but, yeah I'm concerned about the way the castellan-context is used 12:16:04 <redrobot> or not used rather 12:16:33 <dhellmann> context? 12:16:35 <redrobot> the idea was that the castellan-context (which is a terrible name, should have been castellan-auth) would be used to abstract away auth from the backend 12:17:33 <dhellmann> does that mean the driver doesn't work? it's not secure? or it's not using the preferred implementation pattern? 12:17:49 <redrobot> and so http://git.openstack.org/cgit/openstack/castellan/tree/castellan/common/utils.py#n95 was supposed to be called to get a castellan-context 12:18:04 <redrobot> but the Vault plugin sidesteps all that and just reads the token from config 12:18:29 <redrobot> also, the whole context naming has people passing oslo contexts into the Castellan API 12:18:29 <dhellmann> ah 12:18:31 <raildo> dhellmann, my concern from a tripleo perspective, if we choose to go with the vault backend, will be how we gonna to ship/build vault to use it on tripleo? or we gonna just ask for an external vault server, something like what castellan did on that driver 12:19:02 <dhellmann> raildo : good question 12:19:54 <dhellmann> redrobot : changing the name of the public classes and arguments may be a challenge at this point, but fixing up the driver to use them seems like a good idea 12:20:27 <dhellmann> I'm not sure why a centralized set of options and a function to access them is needed. It seems like each driver is just going to have 1 auth method, right? 12:20:54 <dhellmann> but if that's the preferred way, it seems like the driver can just be fixed to use it 12:21:02 <jaosorior> Seems that Vault is the only choice at the moment. We could try to use the Barbican driver, but it would need to be a barbican instance that uses a context-middleware that's not the keystone one, and we would need to write up proper auth and permissions for that one. That just seems like too much work when we could try to fix up the vault driver. 12:21:13 <openstackgerrit> Sagi Shnaidman proposed openstack-infra/tripleo-ci master: WIP: DNM: try to remove things from toci_* scripts https://review.openstack.org/576834 12:21:44 <redrobot> dhellmann, yeah... the more I think about it, the more I think the credential_factory is not needed, and maybe the way the Vault backend gets its credentials may be the better pattern. 12:21:50 <dhellmann> so what's involved in getting tripleo to deploy vault in a way that it can be used by the services to access their secrets? 12:22:17 <redrobot> a prod-ready vault needs a HA backend 12:22:22 <redrobot> so etcd or consul 12:22:31 <dhellmann> redrobot : I'd be happy to deep-dive into that with you at some point if you want to talk about it separately 12:23:01 <dhellmann> are we deploying either of those yet? 12:23:18 * dhellmann knows embarrassingly little about tripleo today 12:23:28 <redrobot> heh, you and me both dhellmann 12:23:43 <dhellmann> for a PoC, would we need etcd or consul? 12:23:44 <redrobot> I hear jaosorior is the TripleO expert... 12:24:04 <redrobot> No, a non-prod vault server doesn't really have any dependencies 12:24:21 <dhellmann> ok, so maybe we could do it in stages, if we need to 12:24:38 <dhellmann> we should start #info-ing these things 12:24:51 <dhellmann> or #action-ing 12:25:06 <dhellmann> who's going to look at the vault driver? redrobot, is that you? 12:25:29 <redrobot> yep, been deep diving into it, and I think I'm the Vault expert 12:25:43 <redrobot> and by expert I mean I probably read more docs than anyone else, but still don't know much, haha 12:25:52 <dhellmann> #action redrobot investigate completeness of the vault driver in castellan and identify any shortcomings that need to be resolved 12:26:29 <jaosorior> What's required for an HA Vault deployment? 12:26:38 <jaosorior> just etcd or consul? 12:26:45 <redrobot> depends ... 12:26:54 <redrobot> etcd or consul gives your storage HA 12:27:02 <redrobot> but Vault runs in single instance mode 12:27:14 <redrobot> unless you have a boatload of cash to dump on Hashicorp 12:27:30 <redrobot> only the Enterprise version has failover IIRC 12:27:53 <jaosorior> uhm... 12:27:57 <dhellmann> maybe the next question to answer is which backend we actually want to use 12:28:08 * redrobot needs to revisit the Vault open-source vs Enterprise feature set 12:28:28 <jaosorior> Seems to me that without HA, the cluster startup is gonna be quite prone to failure 12:28:43 <dhellmann> do we have any info from users about which service they have experience running? 12:28:56 <jaosorior> basically, when OpenStack services start, they will pull the secrets from Vault 12:28:58 <dhellmann> AT&T has some thing I can never remember the name of 12:29:14 <dhellmann> s/has/likes/ maybe 12:29:19 <jaosorior> when we're deploying (or updating) a cloud, this means there will be a LOT of traffic coming to Vault at that one point 12:29:38 <jaosorior> and then for most of the rest of the cloud's lifetime, there won't be any traffic... 12:29:52 <dhellmann> that's an interesting point 12:30:10 <jaosorior> most of what we need is for Vault to be HA, but for read operations 12:30:21 <jaosorior> write operations can happen quite serialized 12:30:35 <redrobot> Also, open-source Vault does not support HSMs 12:30:41 <shardy> TripleO can deploy etcd, but it's not enabled by default 12:30:46 <dhellmann> yeah, we also need to talk about what we have to do to update a cloud when secrets are changed 12:30:56 <shardy> AFAICS only the neutron vpp ml2 plugin requires it 12:30:57 <jaosorior> and will only happen when we first deploy the cloud and when we update secrets (password rotation for instance)( 12:31:20 <jaosorior> shardy: sure, if we want to enable the secure backend we could just deploy etcd for that setup 12:31:39 <jaosorior> redrobot: HSMs are not a requirement (yet) 12:34:00 <jaosorior> redrobot: so, given that our main concern aren't really write operations... will open-source Vault be alright? or does everything still relly on one node? 12:34:06 <jaosorior> * isn't 12:35:26 <redrobot> Vault is a single node, but presumably etcd or consul won't be. Supposedly even though Vault seems like a bottleneck, because it's written in Go it can keep up with large loads and the performance limit is the speed of the backend you're using. 12:35:51 <redrobot> this is all theoretical btw. I need to set up a for realsies Vault and actually put some numbers together 12:36:02 <jaosorior> that would be nice 12:36:14 <openstackgerrit> Martin André proposed openstack/tripleo-common master: Use upstream etcd container image for openshift https://review.openstack.org/576497 12:36:15 <jaosorior> redrobot: also, for failover, we could potentially write a pacemaker resource agent for Vault 12:36:16 <dhellmann> that sounds like a good way to test the castellan driver, too :-) 12:36:19 <moguimar> afaik the backend is the bottleneck, not vault itself 12:36:53 <dhellmann> so we need to figure out which backend to use, as well 12:37:10 <jaosorior> well, seems etcd is the best bet we have right now, given that tripleo can deploy it 12:37:15 <moguimar> vault only encrypt/decrypt stuff 12:37:23 <moguimar> the IO is done in the backend 12:37:28 <openstackgerrit> Martin André proposed openstack/tripleo-quickstart-extras master: Add openshift etcd image to image prepare params https://review.openstack.org/576837 12:38:03 <dhellmann> so it sounds like we need etcd regardless of whether we're worried about HA? 12:38:12 <moguimar> yep 12:38:15 <moguimar> etcd or consul 12:38:30 <moguimar> consul is also from hashicorp 12:38:46 <jaosorior> what are the other backend alternatives? 12:38:49 <raildo> btw, I'm just collecting some of this discussion on: https://etherpad.openstack.org/p/oslo-config-plaintext-secrets so we can come back later, in a future 12:39:25 <moguimar> the in memory vault backend should never be used in production 12:40:03 <moguimar> we can start with it to test the castellan integration then move on to connect vault to a real backend 12:40:44 <raildo> jaosorior, looks like they have a lot of plugin options for backend: https://www.vaultproject.io/docs/configuration/storage/index.html 12:41:22 <jaosorior> Well 12:41:22 <redrobot> Yeah, last time I looked only etcd and Consul were considered "HA" 12:41:27 <jaosorior> there is a mysql backend 12:41:32 <jaosorior> that we do deploy by default 12:41:32 <redrobot> but that may have changed, it's been a while 12:41:34 <jaosorior> why not use that? 12:41:47 <redrobot> > No High Availability 12:41:57 <redrobot> > the MySQL storage backend does not support high availability. 12:42:05 <jaosorior> What does that mean? :D 12:42:16 <dhellmann> that seems odd 12:43:37 <jaosorior> Vault will merely go to either it's local mysql instance, or the VIP (if it's not colocated), replication is handled elsewhere, so it's nothing vault has to worry about 12:43:44 <openstackgerrit> Martin Mágr proposed openstack/puppet-tripleo master: Collectd QDR connection https://review.openstack.org/571152 12:43:44 <openstackgerrit> Martin Mágr proposed openstack/tripleo-heat-templates master: Enable collectd to connect to metrics QDR https://review.openstack.org/576057 12:45:18 <redrobot> I want to say Vault itself has an "HA" option that can't be turned on when configured to use MySQL 12:45:33 <redrobot> but I cant recall off the top of my head what that actually implies 12:45:43 <dhellmann> ok, this feels like something that needs more investigation but that we're not going to answer here today 12:45:47 <jaosorior> redrobot: can you investigate on that? 12:45:51 <redrobot> yessir 12:45:54 <jaosorior> cause that would then be the easiest option 12:47:23 <gfidente> myoung|off I am trying to understand why scenarios 001/004 are failing https://review.openstack.org/#/c/564285/ 12:47:27 <jaosorior> note that I'm actually not taking into account Hashicorp's enterprise HA offering... 12:47:51 <redrobot> jaosorior, noted 12:48:28 <jaosorior> alright 12:48:51 <jaosorior> redrobot: seems it all lies on you now :D 12:48:52 <myoung> gfidente: o/ good morning. 12:49:20 <weshay|ruck> quiquell|rover, http://logs.openstack.org/85/564285/20/check/tripleo-ci-centos-7-scenario001-multinode-oooq-container/ef7b33c/logs/df.txt.gz 12:49:30 <dhellmann> raildo : did we cover the topics you were hoping for? we have a few minutes left... 12:49:42 <myoung> gfidente: regarding the gate check jobs and scenario 001/004, I haven't sync'd up with realtime yet this morning, weshay|ruck or quiquell|rover should have current status/details 12:49:48 <raildo> dhellmann, I believe that are good for now, jaosorior, thanks for taking this time today for that discussion :) 12:49:58 <jaosorior> thanks for bringing it up! 12:50:03 <weshay|ruck> gfidente, what's up 12:50:12 <jaosorior> quite eager to see the result of the Vault research :D 12:50:21 <myoung> weshay|ruck: see above, he was asking about https://review.openstack.org/#/c/564285, scenario 1/4 fails 12:51:14 <weshay|ruck> myoung, gfidente see the alerts guys 12:51:22 <weshay|ruck> it's scen001/4 12:51:23 <jaosorior> #topic Any other business 12:51:28 <jaosorior> Anything else that folks want to bring up to the meeting 12:51:30 <jaosorior> ? 12:51:45 <raildo> nothing from me 12:52:14 <jaosorior> Arlight 12:52:15 <jaosorior> well 12:52:18 <jaosorior> thanks for joining everyone! 12:52:18 <dhellmann> nothign from me 12:52:23 <jaosorior> very interesting stuff! 12:52:27 <dhellmann> thanks, jaosorior , redrobot , & raildo 12:52:31 <moguimar> o/ 12:52:31 <quiquell|rover> myoung, gfidente: one of them is RBD reporting 0 GB os disk free space 12:52:36 <jaosorior> #endmeeting