19:01:13 #startmeeting infra 19:01:14 Meeting started Tue Feb 12 19:01:13 2019 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:15 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:18 The meeting name has been set to 'infra' 19:01:24 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:01:45 #topic Announcements 19:01:54 I have no announcements 19:02:35 #topic Actions from last meeting 19:02:43 #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-01-29-19.01.txt minutes from last meeting 19:02:44 * diablo_rojo stands at the back of the room and makes a cup of tea 19:02:57 corvus: did update dns docs for us. 19:03:02 #link https://review.openstack.org/#/c/633569/ Update DNS docs 19:03:14 fungi: ^ that chagne in particualr was one you noticed we need. Should be a quick review if you have a moment 19:03:27 awesome, thanks! 19:04:03 #topic Specs approval 19:04:12 #link https://review.openstack.org/#/c/587283/ LetsEncrypt Spec 19:04:26 I'm not sure this spec is ready for approval. But wanted to have some time to talk about it so that we can get it there 19:04:48 yeah, given the opendev developments, dns based approach might be better 19:04:52 As we do more opendevy things not needing to run out and pay $8.88USD for new certs for every service would be great 19:05:08 I also wanted to check in if our changes to DNS and config management have changed anyones thoughts on how we should approach this 19:05:08 since we run our own for that 19:05:29 ianw, yup I noticed that if we wanted to use the same cert with .openstack.org altnames it is still a problem but we may be able to avoid that in most cases 19:05:35 s/noticed/noted/ 19:05:54 i'm unclear on how we would do that other than pushing new dns zone changes for review and hurriedly approving them 19:05:54 in particualr we could maintain separate vhosts for openstack.org that redirect to opendev.org and continue to use the old school by hand management of those certs 19:06:25 fungi: I think that is likely how it would work and if we have a renewal period with say 10 days of leeway we should be able to approve those quickly enough 19:06:35 fungi: we could also potentially hvae it bypass review 19:06:43 how long is the challenge/response window for le acme servers, generally? 19:07:45 i've only ever used the tool with a live apache where it's seconds 19:08:15 but surely there's enough time for dns updates built in 19:08:39 i will investigate how it can work with our new opendev setup and update the spec 19:09:29 maybe it's worth a day or two's coding to do something hacky and bespoke to integrate with rackdns 19:09:57 we're leaving rackdns 19:10:06 on the thought that it's "temporary" ... or at least won't be the primary dns option 19:10:24 the only dns that matters in the future is opendev dns, which is managed in git 19:10:43 (i mean, we can manage it some other way if we need, it's managed by us. but right now it's managed in git) 19:10:59 corvus: the one place it might matter is where we want to verify ownership of openstack.org domains for cert altnames 19:11:08 sure, maybe i weigh the status quo a bit more heavily than corvus, but definitely the spec wants updating now opendev is formalised 19:11:22 but as mentioend earlier we can probably avoid that and manage those certs in a more old school fashion then eventually stop doing openstack.org -> opendev.org redirects 19:11:22 so i'll do that 19:11:48 well, for the redirect to work we'd still need the original in the san list 19:11:59 i grant there will be a transition period, but i feel fairly certain that things are moving quickly enough that investment in rackdns would be misplaced. 19:11:59 ianw: the other piece I'm trying to wrap my head around is whether or not the centralized config management push out of the cert data is still desireable with ansible/docker 19:12:32 i don't know that we should be thinking about openstack/opendev altnames 19:12:40 having a redirect from https://etherpad.openstack.org/ to https://etherpad.opendev.org/ doesn't help if the certificate served by that server doesn't include etherpad.openstack.org as an altname 19:12:59 fungi: other option is multiple vhosts with distinct certs 19:13:00 i think it may be much easier to put all of our redirects on a static server. 19:13:04 unless we maintain separate certs for those two i guess 19:13:29 yeah, putting the redirects somewhere other than the servers to which they redirect could be one way out 19:13:32 yeah, i'd much rather us throw a bunch of vhosts on files.o.o serving htaccess redirects so our new opendev servers don't have a bunch of complicated vhost baggage 19:13:41 (or a bunch of complicated cert baggage) 19:14:11 long story short we should be able to make a dns story work now that we have better direct control of our dns 19:14:18 so ++ to investigating that further and updating th spec 19:14:34 clarkb: to my mind, the certs are secrets and having it managed like all the other secrets is the easiest thing 19:14:47 but, i'm sure opinions vary 19:15:11 i'd be happy to treat certs as being more ephemeral so that bringing up a new service could be completely automatic. 19:15:58 the main contraint we are trying to accomdoate for here is the first bring up of the webserver right? but if we switch to dns then maybe it could be entirely ephemeral without that constraint to accomodate? 19:17:03 if the certs weren't handled centrally, they could be treated as ephemeral along with their keys 19:17:17 maybe we can add an alternative to the spec that roughly sketches out what ^ looks like given the dns verification? 19:17:25 have each server create a key, a csr, and initiate dialogue via acme to fetch a cert, then serve that 19:17:26 then we'll be better able to evaluate the two major options? 19:17:35 i guess if the challenge response is already in dns, and the acme update happens before the webserver starts, it could work 19:17:51 that also has the benefit that the keys never leave the servers which use them, so less risk of compromise shipping them around unnecessarily 19:18:03 ianw: ya I'm thinking we may be able to sufficiently simply the config management side that its simpler overall to treat them as ephemeral 19:18:10 s/simply/simplify/ 19:18:14 my typing is particularly bad today 19:18:26 an alternative to using dns there is to set the webserver up to use a self-signed cert and then have certbot or whatever replace that once the challenge/response is concluded 19:19:07 fungi: yes, i mean mostly the idea was to have to avoid re-writing as much legacy puppet as possible but still have things work 19:19:30 the same key could be reused for both the snakeoil and the csr, so you only need to generate one key 19:20:12 ianw: i expect both those models could be used in parallel. the central broker as a stop-gap, with local per-server acme as the eventual goal to work toward? 19:21:03 and as we add services like gitea without that legacy we might be able to use the preferable setup whatever that may be 19:21:06 that way we still eventually get to stop tracking keys and certs in hiera 19:21:31 though I'm nto sure how that will tie into octavia 19:21:38 corvus: mordred ^ do you know what that looks like? 19:21:51 does octavia need layer-7 visibility? 19:21:58 fungi: no 19:22:06 fungi: true; the DNS approach is certainly simpler. i know jhesketh's comments/preference was to just use that on rackdns too 19:22:08 fungi: I'm not sure if our k8s setup does though 19:22:15 if it can work at layer-4 and just pass the sockets through, then not terminating https on the lb ought to be fine 19:22:42 yeah, we're doing l4 with octavia 19:22:46 clarkb: I do not know the answer to that question - but I think we're just l4 19:22:51 corvus beat me to it 19:22:54 we will terminate ssl in k8s 19:23:04 cool thansk for confirming 19:23:04 we haven't done that yet 19:23:34 thanks for reminding me we still need to do that :) 19:24:25 ok it sounds like the definite next step is reevaluate dns as verification method. And then maybe given that new context we can reevaluate centralized control as the desired end goal? It may end up being a path to the end goal 19:24:38 did I capture the discussion well enough in ^ 19:25:13 clarkb: ++ yes, i'll go back it with all this in mind now 19:25:15 seems sane 19:25:25 great, thanks everyone. 19:25:35 #topic Priority Efforts 19:25:38 #storyboard 19:25:41 er 19:25:45 #topic Storyboard 19:26:01 fungi friday still looking like a good day for upgrading production storyboard? 19:26:23 and has the local db move aided with debuggability? I think we decided it didn't drastically change performance 19:27:03 yes, i need to send an announcement about brief 5-minute friday downtime 19:27:53 moving to a local db on storyboard-dev didn't change performance but did allow us to see that the db worker process spikes up to consume an entire cpu for 10 seconds or so whenever you try to load a tag-query-based board in your browser 19:28:09 oh good! 19:28:25 fungi: ok so valuable after all 19:28:53 so suspicions around highly inefficient queries mostly confirmed there, and we at least have more options to debug deeper and profile things now, i guess 19:29:11 great. 19:29:24 in other news, SotK has pushed up a bunch of changes to implement story attachments 19:29:34 Are there any changes we need to get in place before we can do the production upgrade? or is mostly just a matter of doing it at this point 19:29:39 oh nice 19:30:36 the only blocker we found is already fixed now. un-capping sqlalchemy last november resulted in a much newer version installed on the new servers which broke adding worklists to boards because of the way inferred nulls were treated in insert queries 19:30:57 #link https://review.openstack.org/#/q/topic:story-attachments 19:31:02 non-backward-compatible change between sqlalchemy 0.9.10 and 1.0.0 19:31:44 thankfully SotK was able to track down the offending code and whip up a several-line hotfix there 19:31:46 and the fix was a roll forward fix right? 19:31:53 correct 19:31:54 (eg we didn't reinstall old sqlalchmeny) 19:31:56 Correct 19:32:13 Sounds good. Any other storyboard topics? 19:32:30 I think fungi covered the big stuff 19:32:59 oh, and SotK has a dockerfile he's using to so test deployments 19:33:17 it's in his github repo at the moment but he's going to push it up for review soon it sounds like 19:33:43 ++ 19:33:51 so that might make for easier onboarding of new contributors too 19:34:00 we're pretty good at testing dockerfile builds these days if you want to add a job to verify it works 19:34:02 and that's all the news i've got 19:34:11 corvus: yep, i suggested that as well 19:34:34 that all sounds excellent 19:34:36 we could even start publishing images if we wanted 19:34:47 Not to jump in but we have a few more topics so lets keep moving 19:34:49 #topic Update Config Management 19:34:52 diablo_rojo, SotK: ^ lemme know if you want help with that 19:35:05 clarkb: ++ 19:35:13 cmurphy and I got etherpad-dev01 running puppet4 at fosdem 19:35:26 there was one minor error/warning in the logs that I got fixed yesterday 19:35:41 oh hi 19:35:44 corvus, that would be awesome 19:35:52 I think the next steps there are to puppet futureparser the afs and kerberos servers, test puppet-4 on a centos machine, then we can puppet-4 all the things 19:36:00 when SotK gets the patch up we can chat about that 19:36:25 corvus: mordred is there an especially good time to take the puppet futureparser plunge on the afs and kerberos machinery? if not I'll probably try to push that along tomorrow/thurdsay 19:36:48 For the zuul as CD engine stuff https://review.openstack.org/#/c/604925/ was rebased to fix a merge conflict and updated to address a comment from pabelanger 19:37:00 that chagne adds a zuulcd user to bridge that zuul jobs can use to ssh in and run ansioble 19:37:10 corvus: mordred ^ I think that is potentially interesting with the gitea work 19:37:17 (since the gitea work is so far mostly self contained) 19:37:28 clarkb: i can't think of a good time. but mind quorums and it mostly should be fine. 19:37:34 corvus: ok 19:37:44 clarkb: yeah, i think we can move to zuulcd on that fairly quickly 19:38:07 that is what I had for config management updates 19:38:12 cmurphy: anything else to call out re puppet-4? 19:38:22 i don't think so 19:38:28 it's all in topic:puppet-4 19:38:51 cool. Continuing to move on due to hour meeting and unexpectedly full agenda :) 19:38:56 #topic OpenDev 19:39:27 we're very close to having the opendev githea stuff running via ansible 19:39:40 I left a -1 comment on the change that cleans some of that up. I think there was maybe a missing git add? 19:39:43 mordred: ^ did you see that? 19:39:47 it would be great if folks could keep on top of the gerrit review topic 19:39:58 we try to run an ansible playbook that doesn't seem to exist 19:40:02 clarkb: he updated 19:40:24 clarkb: I just pushed an update a few minutes ago 19:40:25 ah cool I missed it with meeting prep 19:40:27 i believe i've mostly got the cgit redirects written as well 19:40:34 #link https://review.openstack.org/#/q/topic:opendev-gerrit+status:open is the change listing for related changes 19:40:53 also - I got a patch upstreamed into gitea that handles the cgit style line number lines 19:41:00 mordred: \o/ 19:41:03 so next release we should be able to drop our local hack 19:41:06 nice 19:41:06 so i think within a few days (realistically, next week due to absences this week), we're going to be in a good place to really start pushing forward on this 19:41:13 ++ 19:41:23 i think what i'd like to do now is break out a slightly more detailed task list than what's in the spec 19:41:39 and try to get people to put their names on things 19:41:56 does that sound like something we should do next week? 19:42:19 corvus: that sounds great to me. I think it will help find places for people to slot in (also encourage interaction with the new system(s)) 19:42:32 the other thing i think maybe it's time to think about is whether we should start sending emails about the move 19:42:49 ++ as someone watching, some chunks that can happen more standalone would help to contribute 19:43:06 corvus: it would probably be good to have a permanentish home we can point people too. So they can interact with it? 19:43:20 I'm not sure if we feel we are there yet (https maybe? I guess not strictly required) 19:43:32 but ya we should communicate it as soon as we are happy to have people look at it 19:43:54 specifically -- is it time to send a heads up to the top level projects saying we don't have a date for the move yet, but we will soon. in the meantime, please start thinking about what project renames you would like to do as part of the move. 19:44:49 clarkb: yeah, i think maybe we should not communicate before gitea is served from opendev.org, that way there will be a nice thing for people to look at and hopefully be happier. :) 19:45:24 ++ once that is in place I think we should communicate that this is the tool we're moving towards and have projects communicate any other input like the renaming stuff 19:46:32 okay. i'll put that on the task list too. :) 19:46:58 i think that's it for opendev-gerrit 19:47:13 ok we've got ~13 minutes and two more topics so lets keep moving 19:47:23 #topic General Topics 19:47:27 i wonder whether we shouldn't rename something first to have a slightly more concrete example to point at 19:47:46 but we can discuss later 19:47:59 fungi: rename what? 19:48:20 corvus: openstack-infra/zuul -> zuul-ci/zuul (or zuul/zuul) maybe? 19:48:25 corvus: I think a project being hosted 19:48:38 we can't rename it before we switch the hostname 19:49:25 could be confusing to do so before 19:49:39 but we can followup on that. Lets go to next topics before we run out of time 19:49:40 right, i guess it's a question of do we want to start talking about renames before we've already renamed something? but i guess communicating early and then having to alter plans later when we discover we didn't take something into account is fine too 19:49:42 for example, git.openstack.org/openstack-infra/zuul -> opendev.org/zuul/zuul will happen at the same time. 19:49:54 this is important 19:49:58 i don't think we're talking about the same thing 19:50:34 i'll have to reread. thought we were talking about repo/namespace renames 19:50:49 when we switch domain names, we will be renaming all of the repositories because we will change their hostnames 19:51:07 grr 19:51:11 we need better words to talk about this 19:51:27 we will also be renaming some git repos as well 19:51:44 i see. we have to perform the server dns name changes at the same time as the repository namespace changes 19:52:03 so we need people to tell us what namespaces they plan to use in advance, before we try to do that 19:52:09 i think maybe we should give projects as much time as possible to decide to opt in to a namespace change along with the hostname change 19:52:22 ya that will reduce renaming pain 19:52:23 because it's a very high cost to a project to rename a repo, and a low cost for us to do so at that time 19:52:26 once transition instead of two 19:52:27 clarkb: exactly 19:52:55 #link https://etherpad.openstack.org/2019-denver-ptg-infra-planning 2019 Denver PTG Planning 19:53:07 and every project is going to incur some amount of cost with the hostname rename, so we may as well solicit ways to minimize additional cost. 19:53:23 This is mostly a heads up that ^this etherpad exists and I plan to start filling in ideas this afternoon 19:53:24 corvus: ++ 19:53:28 okay, i gather we're all on the same page now :) 19:53:36 *and* we have better words 19:53:46 i'm pretty sure for a lot of projects, they won't notice they can have new namespaces until after we perform the dns change and some namespace changes along with that, and then they'll all start queuing up with new namespace requests. but i'm not really sure how to solve that bit of human nature and resistance to paying attention to announcements 19:53:49 if you have PTG ideas please add them to the etherpad too 19:54:14 fungi: we might directly reach out to individuals we expect to be interested 19:54:34 ildikov and hogepodge and corvus for that set of foundation level project for example 19:54:42 we might suggest to the tc if they want to evict non-official projects, this could be an appropriate time. 19:54:50 corvus: all yours to talk about zuul-preview when ready 19:54:54 corvus: ++ 19:54:57 and yeah, we might make sure that starlingx knows 19:55:42 o/ 19:56:00 in pitching in on the netlify work, we realized that a lot of apps out there expect to be hosted at the root of a domain 19:56:33 so we wrote a thing called zuul-preview which is basically a very simple http proxy for our log servers 19:56:45 it will work with both the current static and future swift log servers with no changes 19:56:56 and it runs as a small docker container 19:57:01 corvus: I assume the deployment method in mind is to build the software in a container and deploy it that way? 19:57:13 (so we don't have to worry about packaging or build toolchain in config management) 19:57:18 yeah, in almost exactly the same way that we now have the docker registry deployed 19:57:32 (btw, we're running a docker registry -- it's our first production docker service) 19:57:52 so as soon as the zuul-preview source repo lands, i'd like to deploy that in opendev 19:57:58 my only other thought is that small stateless services like this would be perfect for a multi function k8s 19:58:15 that way we don't have to run dedicated "hardware" for each service, get some redundancy and packing efficiency 19:58:19 when you say "will work with [...] future swift log servers" you presumably mean by running zuul-preview on a separate dedicated server/container and having it fetch/serve the data zuul archives to swift 19:58:24 but probably too much of a requirement to couple new k8s cluster with that 19:58:36 but it's a new service, and it's a slight deviation from my "lets stop running proxies" mantra (which is now "lets stop running complicated proxies"), so i thought it worth some discussion. :) 19:58:56 but we do have a number of other almost stateless small services we could collocate like that maybe 19:58:57 fungi: it proxies to any hosting url, it doesn't care what backs it. 19:59:00 (somethign to keep in mind I guess) 19:59:18 corvus: right, but point being it won't somehow get served by swift, it'll be running somewhere else 19:59:27 clarkb: yeah, i agree, though i think that's a better future thing after we k8s better 19:59:39 fungi: the swift there is location of the logs not the service 19:59:48 fungi: since we are trying to eventually put the logs in swift too 20:00:17 fungi: the thing i mean to say here is that the docker container we'd be running for this won't have to change when we change our log storage. because it neither knows or cares. 20:00:24 as i understand it the site preview builds would get archived to swift just like job logs or similar artifacts 20:00:30 fungi: correct 20:00:37 corvus: no real objectiosn from me. We may want to write down our thoughts on proxies though (since some observers may be unhappy about the loganalyze changes? maybe? I dunno) 20:00:49 but zuul-preview will be a webserver we run somewhere 20:01:11 fungi: yup it will have to run on a "server" 20:01:15 clarkb: well, i'm still hoping we managed to land some mitigating patches in zuul before we drop osla 20:01:22 corvus: cool 20:01:32 so at least in this case still somewhat like how osloganalyze works on the existing logs.o.o if it were backed with data from somewhere else not local 20:01:49 we are time. I'll let current discussion finish before ending the meeting. Thank you everyone 20:02:00 yes, and were one line of apache config instead of a mod_wsgi script :) 20:02:03 thanks clarkb! 20:02:04 yeah. I think writing down proxy thoughts is likely a good idea ... one of the differences I see here is that zuul-preview is project/content agnostic, whereas osla is a specific proxy filter that applies to some of the content produced by some of the projects hosted 20:02:25 corvus: well with a bunch of code running behind apache still 20:02:39 (still far less than osla) 20:02:43 clarkb: none of the code is involved in processing or serving the data 20:02:51 ya 20:02:57 one line of apache config and a (c++/rust/python) cli vs a few lines of apache config and a wsgi script, but sure 20:03:30 i see you disbelieve me, but i think i'm trying to make a valid point 20:03:39 corvus: ++ control-plane custom code vs. data-plane custom code 20:03:41 worth calling the lack of processing data out though as that semes to be the distinction we are making with the proxies we'd want to run 20:03:48 there was a patch several years ago to add swift backend support to osla 20:03:53 ya it does greatly simplify the serving of the data 20:03:57 it never landed because it exploded the complexity of the program 20:04:06 corvus: we actually did land it 20:04:07 *that* is the distinction i'm trying to make here 20:04:13 and reverted it? 20:04:13 i don't disbelieve, just trying to relate the level of server maintenance to something we currently have, for context 20:04:20 then sdague made us revert it because it broke some user functionality he wanted 20:04:51 corvus: ya the big thing for sdague was sorting of periodic jobs, swift doesn't do file mod times like posix fs. We now get that functionality out of the zuul builds listings 20:05:02 so anyway, that event directly related to this 20:05:05 but at the time there wasn't a good way to look at today/yseterday's periodic job 20:05:14 i don't want to manage a giant proxy that does all things for everyone 20:05:24 and, i don't think anyone else here does either :) 20:05:28 ++ 20:05:36 i don't even think sdague did when he was around :) 20:05:42 yup, the constrained scope of the proxy being basically a redirect lookup is far easier to manage 20:05:51 ya, that's what i'm trying to say :) 20:06:05 i do agree lots of little proxies with their own simple tasks are easier to reason about than one big proxy doing everything 20:06:23 i hope we don't grow a lot of little ones either, but i admit it's a possibility 20:06:56 i think most of what we get from osla can actually move into zuul javascript. i think that's still the plan, and we're making small steps toward that. 20:07:03 and the close relationship of this to a particular use case for a general class of zuul jobs helps too 20:07:24 since it's less likely to be another one-off codebase we're maintaining just for our own deployment 20:07:34 yeah. I think that's a real key here - it's a general purpose thing that is applicable to a class of zuul jobs - and is applicable to non-openstack deployments too 20:08:21 zuul-preview seems far more reusable by others outside the context of our particular ci deployment than oslogalanyze is (or likely could ever be) 20:09:28 alright I'll end the meeting now. Find us in #openstack-infra of openstack-infra@lists.openstack.org if you need to get in touch before next week's meeting 20:09:31 #endmeeting