#openstack-meeting log

19:01:08 <clarkb> #startmeeting infra
19:01:09 <openstack> Meeting started Tue Jun 26 19:01:08 2018 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:10 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:12 <openstack> The meeting name has been set to 'infra'
19:01:22 <clarkb> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting
19:01:23 <ianw> o/
19:02:00 <clarkb> The agenda hasn't been updated for today, as I'm somewhat underprepared chasing a bunch of stuff this morning as well as watching WC. That said its not too terribly wrong
19:02:16 <clarkb> #topic Announcements
19:02:56 <clarkb> As mentioned world cup is happening. Big game right now :)
19:03:09 <clarkb> Other than that don't forget to take the openstack user survey if you run or use an openstack cloud
19:03:25 <clarkb> provides valuable feedback to openstack on what is important and what can be improved
19:03:38 * mordred can't take the survey - too much football on tv
19:04:15 <clarkb> #topic Specs Approval
19:04:44 <clarkb> Monty's spec for future config mgmt is up and no longer WIP. I don't think it is ready for approval but we should all review that this week if we can find time
19:04:46 <clarkb> #link https://review.openstack.org/#/c/565550/ config mgmt and containers
19:04:47 <patchbot> patch 565550 - openstack-infra/infra-specs - Update config management for the Infra Control Plane
19:05:08 <fungi> thanks mordred!
19:05:44 <corvus> notmyname: ^ can you clear out patchbot please?
19:05:53 <clarkb> If possible I'd like for next week's meeting to be able to have a discussion about it in detail if necessary and we can work to merging that spec the week after (if not next week if everyone agrees with what monty has already written)
19:05:56 <mordred> \o/
19:05:56 <clarkb> tl;dr please review :)
19:06:33 <clarkb> though fungi and I will be travelingthe week after
19:06:34 <mordred> I tried to list all the things - but to leave some bits open for impl - definitely feedback welcome if I missed something or am just dumb
19:06:51 <fungi> yup
19:07:17 * fungi is disappearing a lot next month
19:08:09 <clarkb> as for working on this at the PTG I've heard rumor we'll hvae ~3 days of room space as well as aday or two of general help room
19:08:11 * mordred throws paint on fungi to try to defeat the invisibility
19:08:18 <clarkb> I think that will work well for digging into this in denver
19:08:40 <mordred> yah. if people are happy with the direction - there are several tasks I think I should be able to knock out by denver
19:08:59 <corvus> i don't expect us to have any dedicated zuul time at this ptg, so those of us with multiple hats should be able to focus on this too
19:09:04 <clarkb> I should get an etherpad for the PTG going so that we can start coordinating what happens there vs what happens prior
19:09:11 * clarkb makes a note to get that going today
19:09:36 <clarkb> GOL! (sorry)
19:10:01 <fungi> note that there will, again, be a cross-project "helproom" for a couple days where we can still help people with questions about zuul job configuration and the like
19:10:03 <corvus> (though getting this done unlocks a bunch of neato zuul stuff, like running zuul from containers and more cd of zuul)
19:10:07 <corvus> fungi: ++
19:10:07 <mordred> clarkb: I think we're watching different games
19:10:53 <fungi> the futbol is multi-threaded
19:11:01 <clarkb> mordred: possibly, I was leaving that info out for people that might be avoiding spoilers
19:11:21 <clarkb> corvus: yup I think this opens a few exciting followup threads once we get things rolling
19:11:26 <fungi> highly parallel gol processing
19:11:51 <clarkb> #topic Priority Efforts
19:12:03 <clarkb> Exciting updates have happend on the storyboard database query side of things
19:12:14 <clarkb> you might notice that storyboard boards are much quicker now thanks to dhellmann
19:12:19 <fungi> mostly thanks to dhellmann
19:12:24 <fungi> thanks again dhellmann!
19:12:43 <fungi> it's amazing how much adding an index on a column can speed stuff up ;)
19:12:51 <dhellmann> knowing we were missing an index made that pretty easy to figure out
19:13:03 * mordred hands dhellmann a fluffy bunny rabbit
19:13:12 * dhellmann gets out the stew pot
19:13:21 <mordred> mmm. rabbit stew
19:13:24 <dhellmann> mmm
19:13:44 <clarkb> fungi: any other storyboard related items worth bringing up?
19:13:56 <fungi> we're still struggling a bit on the name-based project urls... something's not quite right with decoding url-escaped slashes in project names
19:14:25 <clarkb> fungi: does it work without apache in front of it? apache likes to mangle those
19:14:27 <fungi> see discussion in #storyboard for current status
19:14:40 <fungi> yeah, the pecan-based dev server works fine
19:14:52 <fungi> but when running the api server through apache it's a problem
19:15:38 * mordred is very excited about the future of name-based urls
19:15:46 <fungi> i spent a good chunk of last night experimenting but couldn't come up with a workaround, though i also don't fully grasp the api routing in sb
19:16:25 <fungi> so adding debugging wasn't easy
19:16:51 <fungi> in other news, vitrage migrated all their deliverables from lp to sb on friday and that seems to be going well
19:17:32 <fungi> oh, also important, ianw noticed that the occasional rash of 500 internal server error from write operations may be related to rabbitmq disconnects causing the socket to it to get blocked
19:17:50 <clarkb> fungi: rabbitmq runs on the same instance though right? odd for there to be disconnects
19:18:00 <ianw> at first glance at the code, it looked liked it was trying to handle that
19:18:02 <fungi> yeah, unless it gets restarted or something i suppose
19:18:18 <ianw> but something in pika seemed to get itself stuck, if i had to guess
19:19:04 * ianw pats myself on the back for such an excellent, helpful bug report :)
19:19:43 <fungi> it's more detail than we've gathered on the situation to date
19:19:58 <fungi> so thanks for the insightful observation
19:20:56 <clarkb> anything else re storyboard?
19:20:58 <fungi> no other storyboard news afaik
19:21:35 <clarkb> Ok and to follow up on config management proposed changes please go review https://review.openstack.org/#/c/565550/ and we'll catch up on that in more detail next week.
19:21:42 <clarkb> #topic General Topics
19:21:58 <clarkb> Why don't we start with a packethost cloud update. We've set max servers to 0 due to the mtu problem.
19:22:17 <clarkb> I've got a couple changes up to address that in zuul-jobs and devstack-gate for our network overlay setup
19:22:31 <fungi> where "the mtu problem" is described simply as the interface mtu on instances there is 1450
19:22:45 <clarkb> right it is smaller than we assume (1500) in a few places
19:22:56 <fungi> (same as in our linaro arm64 cloud, too)
19:23:08 <clarkb> #link https://review.openstack.org/578146 handle small mtu in devstack-gate
19:23:24 <clarkb> #link https://review.openstack.org/#/c/578153/ handle small mtu in zuul-jobs
19:23:43 <clarkb> if we can get those in and osa isn't strongly opposed I'd like to turn packethost back on and see if we get more reliable results
19:25:36 <mordred> ++
19:26:15 <clarkb> I should also write an email to the dev list explaining we can't make these assumptions anymore
19:26:23 <corvus> that looks like a good solution and backwards compat, so merging into zuul-jobs should be fine
19:26:32 <corvus> at least, backwards compat for working setups :)
19:26:36 <clarkb> corvus: yup should be backward compat if you already had to yes that
19:27:29 <clarkb> kolla and tripleo are the other two I'm semi worried about but I think tripleo has done a decent job reconsuming our overaly tooling
19:27:37 <clarkb> I expect they will get a working setup if we upate our tools
19:28:39 <clarkb> The other packethost issue is a bit more osa specific according to logstash and has to do with tcp checksum errors. Possibly related to mtus but osa also explicitly sets some iptables checksum options
19:29:22 <corvus> this might be worth a quick email then, just to make sure kolla/tripleo/other folks see the changes?
19:29:41 <clarkb> yup, I'll write one up explaining we have morethan one cloud with smaller MTUs now and we hvae to stop assuming 1500
19:30:31 <clarkb> hopefully it isn't too controversial as it is a problem we've made for ourselves via neutron overlay networking :)
19:30:34 <fungi> and we could stand to update the testing environment document accordingly
19:30:43 <clarkb> fungi: I have a pathc up for that
19:30:53 <fungi> ahh, see what i miss when i go to get lunch?
19:31:00 <clarkb> looks like it must've merged its not in my open queue anymore
19:31:24 <fungi> i'm sure i was +2 on that in spirit
19:31:25 <clarkb> https://review.openstack.org/578159
19:33:00 <clarkb> On winterstack naming I followed up with jbryce last week really quickly and the plan he has proposed is that he wants to double check a couple foundation people have had a chance to look it over then he will request we start +1'ing or similar to whittle our list down. He and a bunch of the foundation staff are travling in APAC right now so that may not happen this week
19:33:11 <clarkb> feel free to continue adding suggestions if you have new ideas
19:35:16 <clarkb> SSL certs for just about everything were updated late last week. Rooters are not longer getting all those emails every day :)
19:36:00 <clarkb> One thing we/I discovered in that process is that we can no longer rely on email based verification for the signing requests
19:36:12 <corvus> oh?
19:36:32 <clarkb> yup, GDPR fallout is that our registrar for .orgs is not publishing contact info in whois anymore
19:36:50 <corvus> is there no forwarder?
19:37:10 <clarkb> and despite asking them fairly directly to publish thatdata they don't seem interested. This wasn't a problem for openstack.org as namecheap lets you use hostmaster@openstack.org but for openstackid.org was more problematic
19:37:16 <clarkb> corvus: not that I could determine
19:37:28 <fungi> it appears that the .org tld registry is special in this case
19:37:29 <clarkb> The solution I used was to use DNS record based verification
19:37:45 <clarkb> I created a random string CNAME to a comodo name and then after 20 minutes they checked and found it there and signed the cert
19:38:05 <corvus> is one cname sufficient for all certs, or did you have to do that for each?
19:38:07 <fungi> .com and .net (for example) are still publishing technical, anuse and administrative contact e-mail addresses and mailing addresses in the publiv 43/tcp whois
19:38:14 <clarkb> corvus: one for each name in a cert
19:38:16 <fungi> s/anuse/abuse/
19:38:21 <corvus> blech
19:38:43 <corvus> informally, what are our latest thoughts about exploring letsencrypt?
19:39:03 <corvus> any questions/problems i can help answer/research?
19:39:11 <clarkb> fungi has apparnetly been poking at it. The process for verifying certs with them is basically the same if using DNS
19:39:33 <fungi> i've gotten over my organizational concerns with the isrg members and am trying it out on some of my personal systems. at this point i'm more concerned about the bootstrapping problem
19:39:34 <clarkb> you use a TXT record instead of a cname though iirc
19:40:39 <clarkb> most of the tooling that seems to exist today assumes root access which bugs me (but is necessary if listening on a privileged port for verification)
19:40:41 <fungi> yeah, the acme rfc draft has details, but basically it's a specially-formatted txt record in your domain, _or_ serving a string from a webserver already present at the dns name in question
19:40:41 <corvus> i've been using apache
19:41:13 <corvus> any reason not to use http instead of dns?
19:41:30 <clarkb> corvus: mostly my concern with it needing root to perform config management tasks out of band of config management
19:41:33 <fungi> the dns solution presumes orchestrating dns, and the www mechaism assumes you'll do multipstage deployment initially with no ssl cert
19:42:04 <clarkb> separately I also think the cert limit per domain could bite us with letsencrypt
19:42:11 <corvus> fungi: true, it is difficult to convince apache to start with no cert... but maybe we could have config management handle that?  if no cert, don't add the ssl hosts?
19:42:13 <fungi> the www solution can also be worked around with a proxy sort of setup, but then boils back down to dns orchestration
19:42:17 <clarkb> a wildcard is an option but that reduces segmentation
19:42:55 <fungi> though unrelated, we noticed that *.openstack.org is already in use for a cert with the caching proxy service the foundation is contracting for www.openstack.org
19:43:20 <corvus> we could put a handler in all of our vhosts for /.well-known/acme-challenge/
19:43:22 <fungi> somewhat worrisome, and makes me eager to be on our own domain for important things
19:43:31 <corvus> fungi: yeah
19:43:48 <clarkb> corvus: ya we can engotiate the verification exchange ourselves rather than use the published tools
19:43:51 <corvus> i mean, we put in a lot of work to avoid doing that and then poof
19:44:07 <fungi> the ubiquitous acme handler solution is compelling, though still means eventually-consistent sice the first pass of configuration management won't bring up working https
19:44:33 <corvus> clarkb: oh, i just meant that all we need from apache is a non-ssl vhost which support servince static files in /.well-known/acme-challenge/.  then we can use certbot
19:44:59 <clarkb> oh will certbot operate without opening the port itself?
19:45:13 <clarkb> last I looked that was a requirement iirc and jamielennox had written a tool to avoid that
19:45:16 <fungi> it operates over https (80/tcp)
19:45:19 <corvus> fungi: right, so it's mostly config-management complication.  i haven't thought about what that would mean in the new ansible-container space
19:45:20 <fungi> er, over http
19:45:34 <fungi> i suppose we could temporarily copy snakeoil into the certbot cert/key path until it updates those files
19:45:49 <corvus> fungi: that's an option too
19:45:52 <fungi> so allowing apache to listen on 443/tcp from the start
19:46:21 <corvus> clarkb: aiui, it will write files to, eg, /var/www/www.example.com/.well-known/acme-challenge/ poke its server, get a cert, then write out the key
19:46:21 <fungi> and just make sure any generic 80->443 redirect we configure has a carve-out for .well-known/acme-challenge
19:46:42 <corvus> er, write out the signed cert
19:46:49 <corvus> fungi: ya
19:47:03 <corvus> clarkb: "certbot-auto certonly --webroot -w /var/www/www.example.com/ -d www.example.com -d example.com"   makes it do that
19:47:30 <clarkb> corvus: cool, good to know
19:47:42 <fungi> though on my newer systems the updated apache certbot module is working fine and basically provides the same
19:47:49 <corvus> (this is how i use it without giving it any special access, other than write to that directory and /etc/letsencrypt)
19:48:00 <corvus> fungi: oh neat, didn't know about that
19:48:15 <clarkb> other than that my only other real concern is rate limits and quota limits
19:48:41 <clarkb> I know when they first started it was quite limited but I think they are a lot less limited now
19:49:39 <clarkb> looks like limit is 20 certs per week now and each cert can have up to 100 names
19:49:41 <corvus> 20 certs/week now according to https://letsencrypt.org/docs/rate-limits/
19:50:08 <fungi> it's more making sure that since we're in the 20 certs range, we don't try to renew them all in the same week
19:50:19 <corvus> apparently there's an exemption for renewing :)
19:50:45 <clarkb> those limits should be fine given what I just refreshed
19:50:55 <clarkb> I did 17 certs with one having 8 names and the rest had 1 name
19:51:10 <clarkb> and we have a handful of certs that weren't expiring soon enough to worry about
19:51:37 <fungi> le didn't allow san at one time, not sure if they've started to do so
19:51:52 <clarkb> fungi: the link above implies they do now up to 100 per cert
19:51:54 <fungi> but could imply we need 24 at this point if all distinct certs?
19:51:57 <fungi> ahh, nice
19:52:00 <fungi> i missed that
19:52:59 <corvus> if renewing doesn't count (which is how i read it), we're effectively unlimited.  it's a growth limit, not a size limit.
19:53:10 <corvus> er, growth rate
19:53:29 <clarkb> Before we run out of time I wanted to give a quick kata + zuul update. We've got a fedora 28 job running now that sometimes crashes (presumably due to nested virt problems). I'm also sorting through what they feel are requirements from their end after a bug update last night
19:53:32 <fungi> oh, neat
19:53:59 <clarkb> from my perspective we haven't run into any deal breakers, things that are broken were either already broken or never running in jenkins in the first place
19:54:05 <corvus> clarkb: where's the f28 job running?
19:54:18 <clarkb> corvus: against kata-containers/proxy on vexxhost VM
19:54:37 <corvus> so it still sometimes crashes even in vexxhost?
19:54:46 <corvus> (i thought vexxhost had the right settings or something)
19:54:52 <clarkb> corvus: yes, apparently they have issues with their Jenkins running ubuntu 17.10 as well
19:54:59 <clarkb> mnaser thinks it is related to kernel versions in the guest
19:55:07 <clarkb> (newer kernels less stable)
19:55:26 <mnaser> we're running the newest centos kernel on the hypervisor
19:55:27 <clarkb> he is working with them to debug that since it affects jenkins just as much as zuul
19:55:40 <mnaser> and noticed 16.04 never crashed, but 17.10+ was crashing
19:55:47 <mnaser> salvador from kata came up with a reproducer too
19:56:47 <mnaser> i gave all the tracebacks and the info necessary but this isn't as much of a zuul/infra issue as it is a kernel, but i'm open to providing whatever they need to solve things :)
19:57:05 <corvus> downgrading kernel versions is an option too
19:57:13 <corvus> (i mean, in the job as a pre-playbook)
19:57:26 <corvus> since zuul *can* handle in-job test-node reboots
19:57:33 <clarkb> really quickly before we run out of time, is there anything else anyone had to bring up?
19:57:42 <clarkb> #topic Open Discussion
19:57:49 <clarkb> (happy to follow up on kata in -infra)
19:57:53 <clarkb> or in kata-dev
19:58:54 <corvus> i gave a talk on zuul at openinfradays china.  it was well attended.
19:59:26 <fungi> thanks for doing that!
19:59:39 <corvus> my pleasure :)
20:00:23 <clarkb> We are out of time. Thank you everyone. I amgoing to grab lunch then will follow up to the mailing lists with the mtu topic and infra ptg item
20:00:26 <clarkb> #endmeeting