#tripleo log

14:00:22 <EmilienM> #startmeeting tripleo
14:00:23 <openstack> Meeting started Tue Aug 15 14:00:22 2017 UTC and is due to finish in 60 minutes.  The chair is EmilienM. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:26 <openstack> The meeting name has been set to 'tripleo'
14:00:27 <EmilienM> #topic agenda
14:00:34 <EmilienM> * review past action items
14:00:36 <EmilienM> * one off agenda items
14:00:38 <EmilienM> * bugs
14:00:40 <EmilienM> * Projects releases or stable backports
14:00:42 <EmilienM> * CI
14:00:44 <EmilienM> * Specs
14:00:46 <EmilienM> * open discussion
14:00:48 <EmilienM> Anyone can use the #link, #action and #info commands, not just the moderatorǃ
14:00:50 <EmilienM> Hi everyone! who is around today?
14:00:54 <fultonj> o/
14:00:56 <shardy> o/
14:00:58 <abishop> o/
14:01:02 <owalsh> o/
14:01:03 <jpich> o/
14:01:03 <marios> o/
14:01:06 <slagle> hi
14:01:11 <mwhahaha> hi2u
14:01:23 <EmilienM> #topic review past action items
14:01:27 <adarazs> o/
14:01:37 <jtomasek> o/
14:01:38 <myoung> o/
14:01:44 <EmilienM> EmilienM to switch master to run new upgrade jobs and not old ones anymore (done)
14:01:54 <trown> o/
14:02:06 <EmilienM> jaosorior and abishop to talk together about plans for queens relating to barbican backends (prepare ptg session if needed + discuss about migration tool) (postponed) - not sure about the status
14:02:29 <abishop> to clarify, my involvement is making sure existing deployments using legacy encryption key manager work
14:02:29 <abishop> and that includes future migration from legacy key manager to barbican
14:02:29 <abishop> cinder guy (eharney) is hoping key manager migration can be accomplished within cinder
14:02:29 <abishop> so, no immediate OOO action required (i.e. for Denver PTG)
14:02:29 <abishop> I'll continue to monitor, and will re-raise issue if OOO changes are needed
14:02:52 <EmilienM> ok good to know
14:02:57 <EmilienM> abishop: thanks
14:03:02 <EmilienM> gfidente to send an ML note about moving ceph rgw from scenario004 to 001
14:03:05 <jrist> o/
14:03:16 <EmilienM> not sure Guilio is around, we can postpone this topic unless someone has thoughts
14:03:35 <jaosorior> EmilienM: so yeah, action was taken :D
14:03:36 <florianf> o/
14:03:42 <beagles> o/
14:03:46 <sshnaidm> o\
14:03:49 <jrist> EmilienM: I don't think he is around but maybe he'll see this later
14:03:53 <EmilienM> #topic one off agenda items
14:03:58 <EmilienM> #link https://etherpad.openstack.org/p/tripleo-meeting-items
14:04:09 <EmilienM> sshnaidm: floor is yours
14:04:14 <sshnaidm> yeah
14:04:15 <openstackgerrit> Honza Pokorny proposed openstack/tripleo-ui master: Download logs interface  https://review.openstack.org/473933
14:04:47 <sshnaidm> so clarkb, fungi suggest we will manage whitelist for /etc configurations to collect in logs server
14:05:01 <sshnaidm> "strongly recommend" I would say
14:05:03 <openstackgerrit> Andy Smith proposed openstack/tripleo-heat-templates master: WIP OpenStack containerized qpid-dispatch-router service  https://review.openstack.org/479049
14:05:11 <adarazs> will that really save much? I think we're already filtering out the bigger items
14:05:14 <sshnaidm> I did some calculations
14:05:15 <EmilienM> yes, for months, I think
14:05:25 <sshnaidm> In general multinode job /etc folders take about 8MB, within 33MB of all logs. From 8MB we actually need about 5.5MB and don't need 2.5MB ( it's about 7% of logs).
14:05:59 <sshnaidm> So we can save 7% of place, I'm not really sure it worth the work...
14:06:19 <sshnaidm> Therefore I'd like to bring it to discussion here
14:06:20 <adarazs> doesn't sound like it. but we can definitely exclude some more files if it's straightforward.
14:06:38 <mwhahaha> 7% is a lot if you think about the number of jobs we actually run
14:06:42 <mwhahaha> that's not trivial
14:06:51 <EmilienM> on the other hand, infra provides us free resources and gently ask to help saving them
14:06:57 <sshnaidm> adarazs, there was a arguments that with new release of centos it could be included big files and will break logs server like it was in centos 7.3 with java
14:06:57 <pabelanger> Ya, at our scale, 1% is worth it
14:07:03 <mwhahaha> i don't know if a whitelist is the best way to do it, but we do need to be better about excluding more
14:07:07 <mwhahaha> we collect alot too much
14:07:41 <pabelanger> keep in mind, we are also re-writing devstack-gate for zuulv3 in ansible, so I'm pretty sure we're likely write a generic role for jobs to use to collect, whitelisted this
14:07:46 <EmilienM> if a whitelist is too much work, then improve the exclude list
14:07:52 <sshnaidm> so we have 2 options right now : whitelist and bigger exclude list
14:08:16 <sshnaidm> I'm against whitelist because it will require a maintenance
14:08:23 * adarazs is just a bit weary of a maintained whitelist and the constant "why we don't collect X" requests :/
14:08:25 <sshnaidm> all new services we need to add to it
14:08:27 <openstackgerrit> Marios Andreou proposed openstack/tripleo-heat-templates master: Adds PostUpgradeConfigStepsDeployment to drive post config ansible  https://review.openstack.org/493878
14:08:39 <sshnaidm> manually
14:08:40 <adarazs> sshnaidm: yep, exactly.
14:09:05 <mwhahaha> for context, here is an example of a fully loaded etc dir that we log http://logs.openstack.org/28/493728/3/check/gate-tripleo-ci-centos-7-containers-multinode/d49033b/logs/undercloud/etc/
14:09:14 <adarazs> more aggressive exclusions I'm okay with.
14:09:30 <mwhahaha> do we really need the skel, udev, rc* dirs, etc?
14:09:44 <sshnaidm> mwhahaha, no, but it take about 7%
14:09:59 <sshnaidm> So I'd suggest to start from big exclude list and to see if it satisfies
14:10:04 <sshnaidm> wdyt?
14:10:17 <ooolpbot> URGENT TRIPLEO TASKS NEED ATTENTION
14:10:17 <ooolpbot> https://bugs.launchpad.net/tripleo/+bug/1709327
14:10:18 <openstack> Launchpad bug 1709327 in tripleo "CI: extremely long times of overcloud deploy in multinode jobs" [Critical,Triaged]
14:10:18 <ooolpbot> https://bugs.launchpad.net/tripleo/+bug/1710533
14:10:19 <ooolpbot> https://bugs.launchpad.net/tripleo/+bug/1710773
14:10:19 <openstack> Launchpad bug 1710533 in tripleo "docker client failed to download container from docker.io" [Critical,In progress] - Assigned to wes hayutin (weshayutin)
14:10:20 <openstack> Launchpad bug 1710773 in tripleo "scenario001 and 004 fails when Glance with rbd backend is containerized but not Ceph" [Critical,Triaged]
14:10:22 <mwhahaha> yes i think we need to at least do that
14:10:27 <pabelanger> I'm not sure I agree with whiltelist is more work, we do that for devstack-gate / logstash.o.o today. Its not like we are inundated with requests everyday to add more things. Sure, it will take a bit to build up the whitelist, just like it will take a while to exclude things
14:10:28 <EmilienM> if we can keep reducing the size of logs, let's try that
14:10:29 <sshnaidm> pabelanger, ?
14:10:38 <pabelanger> and, whitelisting is much nicer to logs.o.o
14:10:58 <sshnaidm> pabelanger, it's much more work than big exclude list
14:11:14 <pabelanger> why is it much more work?
14:11:17 <adarazs> the thing is we're not like a usual project in openstack, we don't have a well defined set of config files to collect, or rather it's a very big set that's constantly changing.
14:11:19 <EmilienM> we're talking about /etc only right now right?
14:11:26 <sshnaidm> pabelanger, because it requires a manual maintanance
14:11:33 <sshnaidm> EmilienM, yes
14:11:47 <pabelanger> sshnaidm: look at d-g, and how we handle /etc today
14:11:50 <pabelanger> you would do the same
14:11:58 <sshnaidm> pabelanger, we are different
14:12:02 <EmilienM> I think a whitelist for /etc isn't too bad - we know what services we deploy (or plan to deploy)
14:12:14 <EmilienM> but I might miss something
14:12:42 <sshnaidm> EmilienM, the only difference is manual or automatic maintenance
14:12:51 <mwhahaha> so i think this is pointing out the use of CI for debugging
14:12:59 <mwhahaha> which if you need something, you should spin up an env locally
14:13:10 <mwhahaha> and add it to the whitelist later
14:13:33 <mwhahaha> either way we're capturing too much
14:13:41 <mwhahaha> and it's been asked to be fixed for a while
14:13:54 <mwhahaha> so to start we can do a bigger exclude list
14:14:05 <EmilienM> see how it works
14:14:05 <mwhahaha> but whitelist probably makes sense longer term
14:14:10 <pabelanger> yes
14:14:19 <mwhahaha> if we can't get it down with an exclude list we must switch to a white list
14:14:23 <mwhahaha> manual work or not
14:14:27 <openstackgerrit> wes hayutin proposed openstack/tripleo-quickstart-extras master: Use AFS mirrors to download containers instead of docker.io  https://review.openstack.org/493728
14:14:33 <EmilienM> mwhahaha: +1
14:14:44 <sshnaidm> mwhahaha, if we want whitelist, no need to make exclude list then..
14:14:46 <mwhahaha> so can we get a larger exclude list for next week?
14:14:56 <adarazs> I'm fine with an exclude list, just not with an explicit whitelist.
14:15:02 <sshnaidm> mwhahaha, let's choose one way
14:15:08 <mwhahaha> it's about making incremental progress, right now we're not doign anything but arguing
14:15:18 <mwhahaha> infra asked for a whitelist
14:15:28 <mwhahaha> if we don't want to do that, then PoC an exclude list and lets go
14:15:42 <mwhahaha> but progress needs to be made like now
14:15:50 <mwhahaha> this has been a topic for far too long
14:15:53 <adarazs> mwhahaha: as far as I understand the topic is infra wanting explicit whitelist and sshnaidm doesn't think it's good.
14:16:20 <sshnaidm> ok, I'll prepare both and let's see who wins
14:16:25 <adarazs> :)
14:16:35 * sshnaidm done
14:16:45 <openstackgerrit> Dmitry Tantsur proposed openstack/instack-undercloud master: [WIP] Switch to scheduling based on resource classes  https://review.openstack.org/490851
14:16:54 <mwhahaha> k, can you have something by next week maybe?
14:17:00 <EmilienM> I don't think we need to spend time on both now, we probably have other things to do as well
14:17:08 <sshnaidm> mwhahaha, even today
14:17:20 * EmilienM thinks sshnaidm is a machine
14:17:37 * sshnaidm not sure
14:17:52 <mwhahaha> #actions sshnaidm to prepare log exclusion/whitelist patches for review
14:17:53 <akrivoka> honza: dumb question, what's the difference between registering and enrolling nodes?
14:18:10 <mwhahaha> moving on :D
14:18:12 <weshay> sshnaidm, make the patch specific to the upstream env
14:18:12 <EmilienM> anything else for off items?
14:18:20 <honza> akrivoka: none, afaik
14:18:32 <EmilienM> #topic bugs
14:18:43 <EmilienM> #link https://launchpad.net/tripleo/+milestone/pike-rc1
14:19:10 <EmilienM> beside CI issues that we're already working on, do we have outstanding bugs that we need to get fixed in Pike RC1?
14:19:30 <akrivoka> honza: is there any reason to introduce new terminology (enroll) when we have existing (register)? (https://review.openstack.org/#/c/488526/)
14:20:02 <EmilienM> if I don't hear anything from anyone, I'll propose TripleO Pike RC1 by Friday morning.
14:20:16 <marios> EmilienM: there are some upgrades related things, i am looking at 2 personally https://bugs.launchpad.net/tripleo/+bug/1706951 and https://bugs.launchpad.net/tripleo/+bug/1708115 not sure we'll get everyting but we'll try
14:20:17 <openstack> Launchpad bug 1706951 in tripleo "Ocata to Pike upgrade fails when cinder-volume runs on host because cinder-manage db sync runs when galera is unavailable" [Critical,In progress] - Assigned to Marios Andreou (marios-b)
14:20:18 <openstack> Launchpad bug 1708115 in tripleo "Ensure non-controller are usable after upgrade and before converge." [Critical,Triaged]
14:20:26 <slagle> EmilienM: i'm investigating a swift issue currently
14:20:26 <honza> akrivoka: it's an ironic thing, i guess https://github.com/openstack/tripleo-common/blob/master/workbooks/baremetal.yaml#L1034
14:20:29 <shardy> EmilienM: due to the gate issues, I suspect some of the FFE things will slip into an RC2
14:20:32 <shardy> but sounds good
14:20:41 <EmilienM> marios: ok, upgrade patches are backportable in any case, don't worry
14:20:42 <slagle> EmilienM: https://bugs.launchpad.net/tripleo/+bug/1710606. but it's not limmited to upgrades afaict
14:20:43 <openstack> Launchpad bug 1710606 in tripleo "O -> P - Upgrade: swift_object_expirer, swift_container_replicator, swift_object_replicator, swift_rsync, swift_account_replicator, swift_proxy containers are restarting after upgrade" [Critical,In progress] - Assigned to Carlos Camacho (ccamacho)
14:21:00 <slagle> but it's already aligned against rc1
14:21:01 <EmilienM> slagle: oh this one :( ok
14:21:01 <honza> akrivoka: i was reusing the terminology from tripleo-common
14:21:02 <marios> EmilienM: ack thanks
14:21:11 <EmilienM> shardy: yes I think RC2 will happen
14:21:12 <slagle> EmilienM: yes. on a new deploy, i'm seeing the same thing
14:21:13 <honza> akrivoka: but i'm open to changing that!
14:21:38 <EmilienM> shardy: do we automatically move all FFEs to RC2? or just some of them?
14:22:37 <EmilienM> I'll look at the remaining ffes end of this week
14:23:07 <EmilienM> slagle: it's weird we don't hit that in the CI, or do we?
14:23:45 <shardy> EmilienM: maybe we should review the status and decide if any should be deferred to queens?
14:23:57 <slagle> EmilienM: i don't know. do we have verification of swift in the overcloud?
14:23:59 <shardy> same with bugs, we probably need to start reducing the number of things we're tracking?
14:24:05 <EmilienM> shardy: yeah, probably...
14:24:15 <dtantsur> sorry for appearing out of blue, but I'm solving an ironic-related upgrade complication. just want you to be aware of it.
14:24:16 <shardy> EmilienM: but given the gate issues we probably should be flexible if patches are posted
14:24:21 <EmilienM> slagle: yes, I guess, with the pingtest, it uploads an image to glance with swift backend
14:24:33 <dtantsur> this is https://bugs.launchpad.net/tripleo/+bug/1708653
14:24:34 <openstack> Launchpad bug 1708653 in tripleo "Need to set resource_class on Ironic nodes after upgrade to Pike" [High,In progress] - Assigned to Dmitry Tantsur (divius)
14:24:39 <florianf> This is a regression that should probably get merged in rc1: https://review.openstack.org/#/c/482979/
14:24:49 <florianf> (tripleo-validations)
14:24:55 <EmilienM> florianf: no bug report?
14:25:04 <EmilienM> but ok
14:25:09 <florianf> EmilienM: Let me create one
14:25:11 <lvdombrkr> hello guys, as i understood by default 1 compute and 1 control node will be deployed, how i deploy compute and controller node all in in one?
14:25:16 <EmilienM> ok moving on
14:25:22 <EmilienM> #topic projects releases or stable backports
14:25:28 <shardy> EmilienM: also we need a bug to track the remaining pieces that enable minor updates with containers
14:25:46 <EmilienM> shardy: I haven't seen a blueprint for that :(
14:25:48 <shardy> there's a couple of update related bugs targetted to rc1 already, so I'll re-title one
14:25:55 <EmilienM> it's part of the Container support blueprint, I guess
14:26:05 <shardy> EmilienM: well it's a bug, minor updates without downtime
14:26:09 <EmilienM> ok
14:26:23 * shardy thinks there's one for that already, just not specific to containers
14:26:34 <EmilienM> shardy: no problem for this one
14:26:42 <EmilienM> so we'll see how it goes but
14:26:58 <EmilienM> #action EmilienM to prepare tripleo pike rc1 by friday if things go right
14:27:15 <jaosorior> stable/ocata upgrade jobs seem to be timing out a lot :/
14:27:23 <mwhahaha> it's been that way for months now
14:27:27 <EmilienM> if things don't go right, we'll probably defer to next week
14:27:35 <EmilienM> jaosorior: yes it's not new
14:27:37 <mwhahaha> stable/ocata is effectively blocked on the upgrade jobs
14:27:49 <jaosorior> ah
14:27:52 <jaosorior> well crap
14:28:05 <EmilienM> they used to work ~ fine
14:28:16 <EmilienM> but indeed since ~2 months (I think) they timeout a lot
14:28:17 <EmilienM> #topic CI
14:28:21 <florianf> jaosorior, marios: Thanks! ;-)
14:28:48 <EmilienM> the last time I checked was upgrade tasks taking time and it makes the job timeouting on some infra clouds
14:28:48 <marios> np :)
14:29:20 <EmilienM> I posted https://bugs.launchpad.net/tripleo/+bug/1702955
14:29:21 <openstack> Launchpad bug 1702955 in tripleo "tripleo upgrade jobs timeout when running in RAX cloud" [Critical,Triaged]
14:29:38 <shardy> it'd be good to figure out which upgrade_tasks, chances are it's stuck downloading the new packages?
14:29:48 <shardy> that's where most time goes, particularly without a local mirror
14:30:04 <mwhahaha> well it should be using the local mirror now
14:30:11 <EmilienM> I added an alert on the bug and hopefully get some attention
14:30:17 <shardy> yeah I just wonder if that's working as expected
14:30:17 <mwhahaha> but it just requires someone go look into it in depth
14:30:34 <EmilienM> shardy: the local mirror works fine, afik but I can double check
14:30:56 <EmilienM> I'll look at it if no one has time
14:31:12 <EmilienM> do we have anything about CI?
14:31:18 <sshnaidm> I didn't see upgrades jobs ever passing..
14:31:28 <EmilienM> sshnaidm: on stable/ocata, they do pass
14:31:43 <EmilienM> weshay: did you do CI squad meeting last week?
14:31:45 <mwhahaha> we need to merge the docker proxy today
14:31:51 <mwhahaha> if possible
14:31:53 <sshnaidm> EmilienM, yep, only there
14:32:02 <weshay> yes.. I need to send the notes
14:32:06 <mwhahaha> otherwise tomorrow we're going to end up with a 24hour+ gate
14:32:10 <jaosorior> mwhahaha: what's the docker proxy review?
14:32:12 <weshay> for that and the rdo mtg
14:32:20 <EmilienM> #action CI / URGENT: review https://review.openstack.org/#/c/493728 and https://review.openstack.org/#/c/493726/
14:32:29 <mwhahaha> -^
14:32:47 <EmilienM> I think mandre isn't around but his -1 can be ignored
14:32:56 <lvdombrkr> hello guys, as i understood by default 1 compute and 1 control node will be deployed, how i can deploy compute and controller node all in one?
14:33:06 <sshnaidm> mwhahaha, did you see my patch? I wounder if it will be enough https://review.openstack.org/#/c/491923/
14:33:07 <weshay> does the undercloud configure a proxy for docker?
14:33:14 <EmilienM> lvdombrkr: hey, we're in weekly meeting, and we're almost done
14:33:50 <lvdombrkr> EmilienM: sorry guys ))
14:34:00 <mwhahaha> sshnaidm: possibly, so we need to figure out between weshay's patches and yours
14:34:33 <EmilienM> sshnaidm: are you sure you can get NODEPOOL_DOCKER_REGISTRY_PROXY without sourcing the env on the image?
14:34:38 <openstackgerrit> Merged openstack/tripleo-heat-templates master: Fix Heat condition for RHEL registration yum update  https://review.openstack.org/492632
14:34:41 <weshay> sshnaidm, patch worked as well http://logs.openstack.org/23/491923/2/check/gate-tripleo-ci-centos-7-scenario002-multinode-oooq-container/891461e/logs/undercloud/etc/docker/daemon.json.txt.gz
14:35:00 <sshnaidm> EmilienM, not sure I understand, which image..?
14:35:29 <sshnaidm> ok, let's talk after mtg maybe
14:35:33 <EmilienM> yeah
14:35:44 <EmilienM> #topic specs
14:35:49 <EmilienM> do we have anything specs related this week?
14:35:55 <EmilienM> #link https://review.openstack.org/#/q/project:openstack/tripleo-specs+status:open
14:36:20 <EmilienM> #topic open discussion
14:36:30 <EmilienM> quick reminder about the PTG, next month
14:36:32 <EmilienM> #link https://etherpad.openstack.org/p/tripleo-ptg-queens
14:36:40 <EmilienM> feel free to propose topics
14:37:05 <EmilienM> we'll work on the agenda in the following weeks
14:37:17 <EmilienM> anyone has anything before we close the meeting and go back to normal work?
14:37:32 <EmilienM> thanks folks
14:37:34 <EmilienM> #endmeeting