#openstack-meeting log

19:01:05 <jeblair> #startmeeting infra
19:01:05 <openstack> Meeting started Tue Feb 10 19:01:05 2015 UTC and is due to finish in 60 minutes.  The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:06 <mrmartin> o/
19:01:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:10 <openstack> The meeting name has been set to 'infra'
19:01:13 <anteaya> present
19:01:17 <gema> o/
19:01:19 <jhesketh> Howdy
19:01:20 <mrmartin> o/
19:01:29 <jeblair> #link agenda https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting
19:01:30 <krtaylor> o/
19:01:30 <neillc> o/
19:01:38 <ianw> o/
19:01:40 <pleia2> o/
19:01:40 <jeblair> #link previous meeting http://eavesdrop.openstack.org/meetings/infra/2015/infra.2015-02-03-19.01.html
19:01:42 <tchaypo> O/
19:01:49 <jeblair> #topic  Actions from last meeting
19:01:56 <jeblair> pleia2 draft summary email about virtual sprints
19:01:57 <nibalizer> o/
19:02:00 <jeblair> and she sent it too!
19:02:05 <anteaya> I saw it
19:02:07 <fungi> it was a great read
19:02:08 <anteaya> it was very good
19:02:20 <jeblair> pleia2: thank you!
19:02:21 <jhesketh> +1
19:02:31 <pleia2> ttx's respons was good too, these sprints are good for well-defined tasks
19:02:42 <anteaya> yes
19:02:55 <fungi> openstack needs more well-defined tasks
19:02:55 <pleia2> which I think some teams struggle with, I think the friday hack day at summit really helped us solidify some things
19:03:29 <anteaya> some teams struggle with them because it isn't until they all get in the same room they finally realize they agree
19:03:39 <anteaya> when for 4 months they were convinced they didn't
19:04:20 <jeblair> #topic  New infra-core team member
19:04:41 <fungi> so much suspense
19:04:41 <jeblair> pleia2 will find that she has some extra buttons in gerrit now :)
19:04:47 <anteaya> woooo
19:04:53 <anteaya> congratulations pleia2
19:04:53 <pleia2> thanks everyone!
19:04:56 <krtaylor> congrats!
19:04:57 <nibalizer> gratz!
19:04:59 <pleia2> I'll try not to break openstack
19:05:06 <jeblair> pleia2: that part's covered
19:05:07 <fungi> just fix whatever you break
19:05:21 <fungi> it's a good day if i don't fix more than i break
19:05:26 <fungi> er, if i do
19:05:29 <fungi> something
19:05:30 <pleia2> fungi: hehe, noted
19:05:31 <mrmartin> :)
19:05:37 <jhesketh> pleia2: congrats :-)
19:05:51 <jeblair> pleia2: so now, you get to propose your own addition to infra-root by changing some stuff in puppet
19:05:58 <jeblair> pleia2: and to be honest, i don't even know where that lives anymore
19:06:00 <jeblair> pleia2: so good luck! :)
19:06:07 <anteaya> hahaha
19:06:07 <fungi> left as an exercise for the reader
19:06:07 <pleia2> fungi pointed me in the right direction earlier
19:06:40 <pleia2> so I'll take care of that soon, thanks
19:06:47 <jeblair> thank you!
19:06:55 <mordred> o/
19:06:57 <jeblair> #topic  Priority Efforts (Swift logs)
19:06:59 <mordred> (sorry late)
19:07:09 <timrc> o/
19:07:23 <jhesketh> So logs are ticking along.. I think our current challenge is looking into why they take ~5-10min for devstack logs
19:07:34 <jhesketh> it's most likely bandwidth
19:07:41 <jhesketh> possibly when coming from hpcloud
19:07:57 <fungi> might be good to compare some samples between providers
19:08:12 <jhesketh> but other than that, I think we can start to move some other jobs across
19:08:18 <jhesketh> especially ones with smaller log sets
19:08:27 <anteaya> do we have any of those?
19:08:40 <jeblair> but even when using scp, the data goes from node(hpcloud) -> jenkins master(rax) -> static.o.o(rax)
19:08:46 <jhesketh> well smaller than devstack isn't hard as you don't have all the various service logs
19:08:57 <anteaya> fair enough
19:09:11 <yolanda> hi
19:09:14 <jhesketh> jeblair: to be honest, I haven't compared how long the scp takes
19:09:21 <jhesketh> probably something worth poking at
19:09:25 <jeblair> yeah
19:09:46 <zaro> o/
19:09:57 <jhesketh> so a couple of actions for me there (compare times + move more jobs over)
19:10:06 <jeblair> #action jhesketh look into log copying times
19:10:11 <jeblair> #action jhesketh move more jobs over
19:10:17 <jhesketh> cheers :-)
19:10:29 <jeblair> jhesketh: are still doing scp + swift?
19:10:41 <jeblair> if so, are we ready to remove scp from any?
19:10:55 <jhesketh> jeblair: for devstack, yes. I think it's only turned off for a set of project-config jobs
19:11:31 <jeblair> okay, so we can probably keep doing that for a little bit until we're happy with the timing
19:11:39 <jhesketh> jeblair: I suspect so, but maybe we need to get somebody who works closely with devstack logs to do some user acceptability testing?
19:11:54 <jhesketh> (eg sdague or jogo)
19:12:02 <jeblair> jhesketh: aren't we doing swift-first already?
19:12:05 * sdague pops up?
19:12:13 <jeblair> jhesketh: so, in other words, acceptance testing is already in progress? :)
19:12:15 <jhesketh> jeblair: nope, disk-first
19:12:18 <jeblair> oooh
19:12:20 <jeblair> ok
19:12:30 <jhesketh> which is dictated by apache serving its indexes
19:12:54 <jeblair> jhesketh: can you dig up a log set for sdague to look at?
19:13:05 <sdague> yep, happy to
19:13:43 <jeblair> #action sdague look at devstack swift logs for usability
19:14:21 <jeblair> jhesketh, sdague: thanks
19:14:26 <jeblair> #topic  Priority Efforts (Nodepool DIB)
19:14:48 <mordred> stabstabstab
19:14:53 <mordred> SO
19:14:59 <mordred> I now have ubuntu working
19:15:31 <mordred> am battling centos - not because we need centos - but because it's a thing we have in servers that uses systemd and I figure we should solve systemd before declaring victory
19:15:32 <jhesketh> will do, thanks sdague
19:16:16 <mordred> although it turns out that centos7 has a) old version of systemd and b) not consistent systemd
19:16:19 * mordred cries
19:16:58 * greghaynes hands mordred a fedora
19:17:31 <ianw> mordred: if you want to sync me up with some of the details later, i can help out
19:17:33 <mordred> anywho - I'm expecting to have that all sorted today so that I can go back to making the nodepool patch
19:17:37 <mordred> ianw: oooh
19:17:52 <mordred> ianw: I will do that
19:17:58 <mordred> ianw: I'm assuming you grok all the systemds
19:18:44 <fungi> i haven't set aside time to get very far with collapsing bare and devstack image types together nor job run-time database configuration so we can stop needing to have a dib solution for that part. hopefully later this week will be better than last was
19:19:38 <jeblair> #action mordred fix the systemd problem
19:19:38 <anteaya> optimist
19:19:41 <jeblair> (ha!)
19:19:59 <jeblair> #action fungi collapse image types
19:20:11 <fungi> optimism all around!
19:20:18 <jeblair> :)
19:20:27 <jeblair> anything else nodepool dibby?
19:20:46 <mordred> uhm ... things that don't do DHCP are bonghits?
19:20:51 <fungi> clarkb may have things, but he's occupied
19:21:05 <clarkb> just my bugfix for image update change
19:21:07 <jeblair> any reviews need attention?
19:21:15 <clarkb> accomodates jeblairs image build fix
19:22:02 <ianw> there are still outstanding reviews for f21 image builds
19:22:14 <ianw> https://review.openstack.org/140901
19:22:25 <ianw> https://review.openstack.org/138250
19:22:35 <ianw> both were going in but hit merge conficts
19:22:45 <ianw> conflicts
19:23:05 <jeblair> clarkb: i'm not sure which you're talking about?
19:23:16 <jeblair> oh
19:23:18 <jeblair> #link https://review.openstack.org/#/c/151749/
19:23:19 <jeblair> that one?
19:23:32 <jeblair> "Better image checking in update_image command"
19:23:41 <clarkb> yes
19:23:43 <clarkb> tha ks
19:24:03 <jeblair> asselin_: had a comment on that
19:24:10 <jeblair> but yeah, we should take a look at that one
19:24:31 <asselin_> o/
19:24:57 <jeblair> #topic Priority Efforts ( Migration to Zanata )
19:25:04 * anteaya admires asselin_'s useful reviews
19:25:28 <asselin_> anteaya, thanks
19:25:55 <pleia2> so mrmartin has been helping me get my module into shape
19:26:03 <pleia2> https://review.openstack.org/#/c/147947/
19:26:25 <pleia2> helping dependencies make more sense (depending on services vs files for installation) and doing tests in vagrant
19:26:26 <mrmartin> needs some work on zanata puppet modules, I set this up in vagrant, and had some dep problems that pleia2 solved
19:27:10 <anteaya> pleia2: doing local testing in vagrant?
19:27:19 <mrmartin> I allocate some time this week and try to find out why the wildfly zanata app deplyoment fails
19:27:32 <pleia2> anteaya: mrmartin is, I'm using some snapshotted VMs
19:27:51 <anteaya> why do you need vagrant if you are using vms?
19:28:02 <pleia2> anteaya: we're both testing in our own ways
19:28:05 <anteaya> sorry if this was discussed before and I missed it in backscroll
19:28:27 <pleia2> he's using vagrant, I'm using VMs
19:28:33 <anteaya> oh sorry
19:28:34 <mrmartin> double-check
19:28:36 <mrmartin> :)
19:28:56 <mrmartin> vagrant launching vm(s) anyway.
19:29:14 <jeblair> #link https://review.openstack.org/#/c/147947/
19:29:23 <pleia2> so progress is being made, not as fast as I'd like, but java is clunky
19:29:54 <jeblair> pleia2, mrmartin: groovy, thanks!
19:30:06 <jeblair> #topic  Upgrading Gerrit (zaro)
19:30:46 <zaro> ok.  i think i have managed to fix the testing stack of review-dev, zuul-dev, and jenkins-dev
19:31:00 <zaro> all things working now, so will be easier to test
19:31:16 <zaro> review-dev.o.o is on trusty and on Gerrit 2.9.4
19:31:26 <clarkb> db is still a problem?
19:31:30 <zaro> So if anybody wants to test anything do it there.
19:31:55 <zaro> clarkb: yes, the issue about db disconnect is still a aproblem.  but it's also in prod
19:32:02 <jeblair> zaro: did you find that we need 2.10 for wip plugin?
19:32:18 <jeblair> well, at least, we can't prove that it isn't in prod
19:32:32 <jeblair> and when we run all the same versions of things in dev, it happens
19:32:41 <jeblair> but of course the trove db server is actually different
19:32:49 <zaro> jeblair: no, wip plugin will be a ways out.  it needs fixes from master which won't show up unil 2.11
19:32:54 <fungi> right, zaro was able to reproduce the problem with the ubuntu and gerrit versions we're running in prod
19:32:54 <zaro> unil/until
19:33:01 <jeblair> #info WIP requires >= 2.11
19:33:06 <jeblair> #undo
19:33:07 <openstack> Removing item from minutes: <ircmeeting.items.Info object at 0x99d7590>
19:33:13 <jeblair> #info WIP plugin requires >= 2.11
19:33:32 <fungi> clarkb: though it does seem suspiciously similar to the db problem we were seeing with paste.o.o
19:33:37 <clarkb> fungi ya
19:33:50 <clarkb> I think the trove dbs are partially to blame
19:34:00 * anteaya clicks the review-dev storyboard link in the commit message
19:34:25 <jeblair> i verified that the mysql timeout values are the default on the review-dev trove instance
19:34:32 <jeblair> so it's at least not that kind of misconfiguration
19:34:50 <zaro> so i just finished validating zuul pipelines and the launchpad integration.
19:35:30 <zaro> will probably see if there's anything to check in Gerrit ACLs next.
19:35:37 <anteaya> zaro I don't see the ability to change the topic in the gui
19:36:09 <jeblair> zaro: would you be willing to spend some time with the trove folks and see if there's something they can do?
19:36:27 <zaro> jeblair: yes, most definatlye
19:36:33 <fungi> anteaya: i thought we discovered that only change owners and gerrit admins could do in-ui topic edits?
19:36:37 <zaro> i'll ask them to help debug
19:36:44 <jeblair> zaro: cool, thanks
19:36:44 <anteaya> fungi: perhaps I don't have the permissions then
19:36:47 <mordred> iccha works on trove at rax now - might be a good contact too
19:37:02 <zaro> anteaya: i don't think that's availabe in old screen UI, not even on review.o.o
19:37:05 <anteaya> mordred: not any more
19:37:05 <jeblair> #action zaro to chat with trove folks about review-dev db problems
19:37:12 <mordred> anteaya: oh1 well, don't listen to me
19:37:20 <anteaya> mordred: rax decided they can only work on trove in their own time
19:37:30 <anteaya> mordred: so only when she has time after work now
19:38:16 <zaro> i'm going to test prod db migration next.
19:39:00 <zaro> So about moving review.o.o to trusty?
19:39:19 <zaro> anybody against that?  if not should we schedule something?
19:39:20 <anteaya> zaro: we need to do that to upgrade?
19:39:25 <zaro> yes
19:39:40 <clarkb> no opposition from me on that
19:39:48 <jhesketh> sounds good to me
19:39:49 <anteaya> I'm for scheduling something the week before summit like we did last year
19:40:04 <anteaya> or will this be less involved?
19:40:04 <zaro> anteaya: this is why #link https://review.openstack.org/#/c/151368/
19:40:07 <jeblair> zaro: so you want to do os upgrade first, then gerrit upgrade?  or both together?
19:40:34 <anteaya> bouncy castle again
19:40:36 <zaro> best to do OS upgrade first
19:41:22 <jeblair> right, so we didn't get that far with this last week, but let's try again
19:41:39 <jeblair> nothing before feb 28
19:41:53 <clarkb> I know one problem is every time we change IP corps need to update firewall rules
19:42:02 <clarkb> we still cant floating ip in rax right?
19:42:41 <jeblair> how is feb 28, mar 7, mar 21?
19:42:48 <jeblair> also, see: https://wiki.openstack.org/wiki/Kilo_Release_Schedule
19:43:24 <anteaya> can I vote for may 7th?
19:43:37 <clarkb> jeblair all should worj for me
19:43:37 <pleia2> I'm around all those days
19:44:13 <fungi> clarkb: that sounds to me like a reason to do this ~monthly
19:44:17 <mordred> fungi: ++
19:44:32 <mordred> maybe they'd learn that outbound blocking is crazypants
19:44:33 <fungi> eventually they'll get tired of having to maintain special egress rules for that port
19:44:39 <clarkb> ha
19:44:54 <clarkb> and https does work fwiw
19:44:55 <zaro> all are good with me as well.  i'm partial to feb 28.
19:45:03 <jeblair> to anteaya's point.  do we want to wait until after the release?
19:45:10 <clarkb> anteaya it should be very low impact
19:45:24 <anteaya> should and is can be miles apart
19:45:24 <clarkb> spin up new node side by side, easy switch, easy fallback
19:45:47 <anteaya> if there is a compleling reason to do it before may I'm all ears
19:45:48 <clarkb> anteaya well I have done this before and it was easy (lucid to precisr)
19:46:03 <anteaya> no blocking from corp firewalls?
19:46:15 <anteaya> no contributors unable to work?
19:46:25 <anteaya> happy to be wrong
19:46:26 <jeblair> anteaya: we will announce it well in advance, with the new ip.
19:46:30 <clarkb> there will likely be blocking on port 29418 they can use httpa
19:46:33 <clarkb> *https
19:46:43 <anteaya> okay if I am in the minority so be it
19:46:49 <fungi> the biggest issue is probably going to be finding a slow enough week at this point in the cycle that having several hours of downtime won't severely impact development momentum so we can do the maintenance carefully and if necessary roll back
19:47:01 <anteaya> fungi: yes
19:47:44 * krtaylor wonders if it will impact some third party ci systems
19:47:48 <jeblair> so this feb 28 is the saturday before feature proposal freeze.  the week following is likely to be busy
19:48:04 <asselin_> clarkb, zuul doesn't https
19:48:10 <jeblair> i'm not certain that's a reason not to do it.
19:48:13 <fungi> krtaylor: some may need a restart to reconnect to redo dns resolution and reconnect to the new ip address, yes
19:48:40 * fungi had redundant words in that last sentence
19:48:48 <asselin_> we'll need to have firewalls updated for thirdparty ci
19:49:06 <krtaylor> possibly, yes
19:49:20 <krtaylor> so we should spread the event happening far and wide
19:49:24 <clarkb> asselin_ or use a proxy
19:49:32 <krtaylor> and there will still be questions :)
19:49:35 <jeblair> krtaylor: we always do :)
19:49:40 <zaro> ohh, i forgot.  the Toggle CI button doesn't work.  is anyone willing to take a look at that?  i've already took a quick look but i don't know js so it's not apparent to me how it even works.
19:50:01 <mordred> jeblair: for sake of information - I believe last time we swapped it took between 1 and 2 months to get the egress rules changed at HP
19:50:11 <asselin_> clarkb, not sure how. last time nothing worked for zuul. this was last summer.
19:50:14 <jeblair> mordred: i hope it goes faster this time.
19:50:20 <mordred> jeblair: I do not believe that's possible
19:50:32 <asselin_> mordred, and they fat-fingered the rule for my site, so it took even longer
19:50:46 <fungi> asselin_: you'll only need firewalls updated if your firewalls are for some reason configured to block _outbound_ connections from your systems
19:50:48 <mordred> jeblair: it's a change to global security rules which goes through an IT process
19:50:49 <jeblair> to be clear, egress filtering is a bad idea.  it's particularly bad for systems that rely on connecting to a system that runs in _a public cloud_
19:50:55 <mordred> yes. this is all true
19:51:06 <asselin_> fungi, yes, we're blocked on outbound :(
19:51:13 <jeblair> so, i think what we can do is try to disseminate the information as soon as possible
19:51:14 <mordred> I'm merely reporting on the state of the world for at least one of our constituencies
19:51:19 <jeblair> mordred: thank you
19:51:19 <mrmartin> keep the old instance and ip and redirect the traffic with haproxy to the new one
19:51:31 <mrmartin> so don't need to change the ip
19:51:47 <mrmartin> or you can keep it as a backup
19:51:58 <jeblair> but we can not let this be a blocker
19:52:00 <clarkb> asselin_: you would likely need to do a port fowrad through a SOCKS proxy
19:52:05 <clarkb> asselin_: it should just work once you get it set up
19:52:19 <krtaylor> jeblair, agreed, speaking for my system anyway, anytime is as bad as any other
19:52:21 <jeblair> mrmartin: that may cause its own problems and greatly increase the complexity
19:52:27 <clarkb> mrmartin: no then we still have an old precise box around
19:52:36 <tchaypo> this is where we all talk about some kind of ha proxy as a service thingy
19:52:37 <clarkb> and it increases the number of places where things can break as jeblair points out
19:52:46 <fungi> mrmartin: if we do that, we'll either end up maintaining it indefinitely or ~50% of the people who are going to be impacted simply won't find out until we eventually take down the proxy
19:52:53 <clarkb> tchaypo: if only such proxy services were configurable in ways that made them useful :)
19:53:14 <mrmartin> ok, but you can give a 2 month grace period, and everybody can migrate
19:53:57 <mrmartin> poor-man's floating ip
19:54:20 <fungi> on the up side, it only increases places where things can break for people who are stuck behind egress filters managed by people who need far too long to update them
19:54:23 <jeblair> so who's around on feb 28?
19:54:29 <anteaya> I can be
19:54:29 <clarkb> jeblair: me
19:54:32 <mordred> I kinda think we should go the "wait longer" route so that we can spin up the new box and get the new IP info out to our various 3rd party testing folks and the corporations with idiotic network policies
19:54:48 <zaro> jeblair: me
19:54:51 <jeblair> mordred: i believe that we can do that by the end of this week and provide 2 weeks of notice.
19:55:10 <mordred> ok. I think that we know fora  fact that will break a large portion of our user base
19:55:13 <tchaypo> how do we get the new ip?
19:55:21 <fungi> i'm out of town from the 25th to the 6th but don't let my absence stop you
19:55:23 <jeblair> tchaypo: we will send an email announcement
19:55:32 <tchaypo> rephrase
19:55:34 <jeblair> mordred: how much notice do you want to provide?
19:55:47 <tchaypo> how does the infra team find out the new ip to put in the email?
19:55:55 <mordred> I think 2 months is probably the amount of time HP and IBM and Cisco are all liekly to need
19:55:57 <jeblair> tchaypo: we spin up the server
19:56:02 <mrmartin> tchaypo: by starting the new server.
19:56:05 <jeblair> mordred: that seems excessive
19:56:05 <mordred> which is sad
19:56:08 <mordred> and they should be ashamed
19:56:40 <krtaylor> not sure we'd need 2 months, but 2 weeks is tight also
19:56:50 <mordred> but given the number of corporate contributors we have AND the number of 3rd party testing rigs taht exist - even though this is almost as broken as rackspace's not-dhcp - it is what it is
19:56:55 <fungi> for third-party ci specifically i guess, since as clarkb points out everything besides stream-events should work fine via https api
19:57:17 <mordred> fungi: yah - but 3rd party without streamevents is going to be kinda tough
19:57:22 <fungi> yep
19:57:34 <anteaya> time check
19:57:41 <mrmartin> ask.o.o
19:57:43 <sdague> krtaylor: you need an egress change to connect to zuul? I was pretty sure the IBM corp network should just allow that
19:57:46 <jeblair> that puts us at mordred: that puts us at march 21.
19:58:01 <clarkb> fungi: ya https is how people are getting around china's great firewall now
19:58:03 <jeblair> anteaya, mrmartin: i think we're going to have to defer
19:58:06 <clarkb> fungi: so we know it works for the most part
19:58:07 <asselin_> I remember looking at the zuul code, and we may be about to update the socket it opents  with a proxy configuration
19:58:11 <anteaya> jeblair: looks like it
19:58:27 <clarkb> asselin_: you just configure localhost:poxy_port as the gerrit location
19:58:31 <mordred> jeblair: I'm around and available both days, fwiw
19:58:37 <krtaylor> I agree with fungi 's assessment though, ~50 arent paying attention, 2 weeks should be enough to get the word out
19:58:38 <mordred> and will help either day we choose
19:58:59 <krtaylor> sdague, yes, no egress needed for us
19:59:32 <clarkb> we could also put gerrit ssh on port 22
19:59:37 <fungi> i am happy to spin up the new server as soon as this meeting ends if someone wants to work on drafting an announcement i can plug the ip addresses into
19:59:43 <clarkb> but  Ithink we should only do that if we can listen on 29418 as well
19:59:51 <jeblair> asselin_: have you worked through this before?  how long did it take?
20:00:01 <krtaylor> fungi, will announce in third-party meetings
20:00:22 <jeblair> okay, we'll continue this in the infra channel
20:00:24 <jeblair> thanks all
20:00:26 <jeblair> #endmeeting