#openstack-meeting-alt log

16:00:39 <xarses> #startmeeting fuel
16:00:39 <xarses> #chair xarses
16:00:40 <xarses> Todays Agenda:
16:00:40 <xarses> #link https://etherpad.openstack.org/p/fuel-weekly-meeting-agenda
16:00:40 <xarses> Who's here?
16:00:40 <openstack> Meeting started Thu Sep  3 16:00:39 2015 UTC and is due to finish in 60 minutes.  The chair is xarses. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:41 <maximov> hi
16:00:42 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:45 <openstack> The meeting name has been set to 'fuel'
16:00:45 <mwhahaha> hi
16:00:46 <rvyalov> hi
16:00:46 <openstack> Current chairs: xarses
16:00:53 <mihgen> hi
16:01:02 <ashtokolov> \о/
16:01:16 <kozhukalov> hi
16:01:24 <dpyzhov> hi
16:02:25 <xarses> ok, lets get going
16:02:27 <xarses> #topic librarian update (mwhahaha)
16:02:35 <mwhahaha> As reported last week we are starting the work to prepare versions of the modules on fuel-infra.
16:02:36 <mwhahaha> #link https://review.fuel-infra.org/#/q/status:open+topic:bug/1490469
16:02:36 <mwhahaha> The overall work is also being tracked on a spreadsheet prepared by degorenko.
16:02:37 <mwhahaha> #link https://docs.google.com/spreadsheets/d/1P8xJbYyHXnb0W7fVme3jkOUYj4OTCfbplZEYLX7SC0E/edit#gid=1195205959
16:02:39 <mwhahaha> As you can see from the spreadsheet, mos-puppet and the openstack teams have been working hard to identify and propose the fuel specific changes required.  That  being said, we have some modules that the fuel-library team should probably take the lead in managing upstream changes and working to prepare a fuel-infra version for the 8.0 cycle.
16:02:41 <mwhahaha> In the spreadsheet we have identified the corosync, haproxy, mysql, openssl, rabbitmq and rsyslog modules as ones that the fuel-library team should take the responsibility of figuring out upstream changes and adapting.
16:02:43 <mwhahaha> As I mentioned last week, many of these modules have diverged from the upstream and may require significant work to flush out a path forward. These modules may continue to live within the fuel-library code base for the 8.0 cycle but I will be spending time to try evaluate how much effort will be involved in moving these to an upstream version.
16:02:45 <mwhahaha> Questions?
16:04:25 <xarses> can you update the sharing rights on the doc?
16:04:32 <mwhahaha> i don't own it
16:04:35 <mwhahaha> but i'll reach out
16:05:18 <mihgen> thanks mwhahaha
16:05:37 <mihgen> who else will work on converging these modules to the upstream ones?
16:06:18 <mwhahaha> Other than the openstack teams, i'm not aware of anyone else currently working on these
16:06:32 <mihgen> there is a status column, there are many questions on fuel-library side
16:07:02 <mihgen> looks like we'd need to split it between people
16:07:25 <mihgen> so to work in parallel ?
16:07:37 <mwhahaha> Ideally yes, we should identify some people to work on these
16:08:04 <mwhahaha> i know some people have expressed desire to work on some of these as they have been some tech debt that some are aware of
16:09:02 <mihgen> those in fuel-library, who are getting available from bugfixing, may take those
16:09:54 <mihgen> I was thinking how do we execute such things
16:10:29 <mihgen> we have like ten places with tasks
16:10:29 <mihgen> looks like we'd need to come up with whether trello or just plain etherpad with list of things
16:10:35 <mihgen> ensure to sort it in priority order
16:10:48 <mihgen> and those who get free from bugs can check against one single place
16:11:07 <mwhahaha> we can also create bugs for the modules
16:11:11 <mwhahaha> to assign ownership
16:11:27 <mwhahaha> I did create a few bugs previous as part of the initial migration
16:11:38 <mihgen> possibly. Ensure to have "feature" tag on those
16:12:22 <mwhahaha> The permissions should be fixed in the spreadsheet
16:12:30 <mwhahaha> so we can leverage this if that works for everyone
16:12:33 <mihgen> ok guys - let's sync over email on this, let's move on now
16:12:37 <xarses> moving on?
16:12:42 <xarses> =)
16:12:56 <xarses> #topic HCF bugs reveiw
16:13:19 <xarses> Ok, so where going to go through some bugs blocking HCF and discuss status and the like
16:13:37 <xarses> #link https://bugs.launchpad.net/fuel/+bug/1490523 (idv1985)
16:13:39 <openstack> Launchpad bug 1490523 in Fuel for OpenStack "OpenStack client hangs forever after simultaneous add/delete controller" [Critical,In progress] - Assigned to Dmitry Ilyin (idv1985)
16:14:26 <dilyin> this bug is already fixed, i've added retries and timeout to the openstack provider
16:14:29 <sgolovatiuk> xarses: as far as I know fix is ready
16:14:41 <dilyin> now i'm goind to post the changes upstream
16:14:57 <mihgen> can you provide a link to patch pls?
16:14:57 <xarses> good to hear
16:15:37 <ashtokolov> https://review.openstack.org/#/c/219668/
16:16:01 <mihgen> so it's not yet fixed since it's still not merged ;)
16:16:19 <mihgen> so it's just more retries?
16:16:39 <aglarendil_> yep, apache restart lead to wsgi accepting but not processing requests
16:16:50 <aglarendil_> and openstack client was hanging forever
16:17:05 <aglarendil_> so we added timeout + retries there
16:17:28 <xarses> ok, next up
16:17:35 <xarses> #link https://bugs.launchpad.net/fuel/+bug/1491725 (kozhukalov)
16:17:36 <openstack> Launchpad bug 1491725 in Fuel for OpenStack "Deployment failed with error: <class 'cobbler.cexceptions.CX'>:'MAC address duplicated: 0c:c4:7a:14:25:36'" [Critical,In progress] - Assigned to Vladimir Kozhukalov (kozhukalov)
16:17:52 <kozhukalov> i am working on this
16:18:08 <kozhukalov> #link https://review.openstack.org/#/c/220191/
16:18:33 <kozhukalov> still in progress, testing to make sure nothing was broken by this patch
16:18:48 <mihgen> kozhukalov: scale related only?
16:18:59 <maximov> what are the chances to fix it till Sunday this week?
16:19:01 <kozhukalov> yep
16:19:09 <kozhukalov> it is hard to reproduce
16:19:28 <kozhukalov> but the root cause is that cobbler is not intended to be scalable
16:19:33 <mihgen> kozhukalov: it's only when you remove env and immediatelly create new one?
16:19:44 <mihgen> or it's just when you remove, not all entries are removed from cobbler?
16:19:55 <kozhukalov> it is when you remove plenty of nodes at the same time
16:20:01 <mihgen> so it's regardless of the time when you want nw env to be created
16:20:04 <kozhukalov> some of those nodes can be still there
16:20:29 <kozhukalov> and then new env tries to add the same nodes with other names
16:20:50 <sgolovatiuk> maybe we need to remove until we are sure all nodes are deleted
16:20:51 <kozhukalov> but cobbler does not allow having two nodes with the same MAC
16:20:57 <sgolovatiuk> not fire and forget ...
16:21:30 <sgolovatiuk> not matter how many cycles we need to delete data from cobbler
16:22:08 <mihgen> kozhukalov: so there is a patch from you,
16:22:12 <aglarendil_> oh!sleep&retry
16:22:20 <mihgen> is it close to solution .. ?
16:22:30 <kozhukalov> sgolovatiuk, yes, maybe you are right
16:22:41 <kozhukalov> we need to try again and again
16:22:47 <kozhukalov> will add this
16:23:04 <kozhukalov> mihgen, yes, patch is here https://review.openstack.org/#/c/220191/
16:23:17 <xarses> just group the deletes into batches?
16:23:27 <mihgen> question is will it actually lead to success)
16:23:30 <kozhukalov> mihgen, it is quite close
16:23:31 <mihgen> and when we get rid of cobbler )
16:23:45 <sgolovatiuk> :)
16:23:46 <kozhukalov> like i said, im working on testing this patch
16:24:39 <kozhukalov> mihgen, at least we will know that if there are nodes with the same MAC, we will try to remove them
16:24:49 <mihgen> thanks kozhukalov
16:24:51 <kozhukalov> but yes, it might be not enough
16:25:02 <kozhukalov> I'll add retries
16:25:05 <xarses> ok, we need to move to get through the others.
16:25:06 <mihgen> that was probably very hard to understand what happens in this bug..
16:25:19 <maximov> kozhukalov just to confirm - patch is almost ready and we will merge it till Sunday, is this correct understanding?
16:25:25 <mattymo_> pre-emptive delete by mac is probably a smart idea
16:25:26 <aglarendil_> folks, just use SDD - sleep driven development
16:25:36 <kozhukalov> maximov, I think yes
16:25:56 <mihgen> aglarendil_: :)
16:26:04 <xarses> #link https://bugs.launchpad.net/fuel/+bug/1491015 (akislitsky)
16:26:05 <openstack> Launchpad bug 1491015 in Fuel for OpenStack "System test 'check_openstack_stat' failed: No JSON object could be decoded" [Critical,In progress] - Assigned to Alexander Kislitsky (akislitsky)
16:26:05 <mihgen> xarses: next bug?
16:26:57 <kozhukalov> aglarendil_, NAADD (Nice Aglarendil Advice Driven Development :-)
16:27:04 <akislitsky> Patch is on review. Proper fix will be done in 8.0. Now we have workaround. Patch is tested in unit tests and manual on the live env
16:27:06 <mihgen> sbog: is it stats broken becuase we enabled SSL ?
16:27:48 <sbog> mihgen: couple secs
16:27:55 <akislitsky> mihgen, yep. it is up to enabling SSL
16:28:34 <mihgen> quite sad that we catching such things that late in the cycle. . ok akislitsky so your fix is ready to go?
16:28:44 <mihgen> we just need reviews / small fixes if needed ?
16:28:51 <sbog> yep, seems so
16:29:21 <akislitsky> mihgen, it is ready to be merged
16:29:28 <mihgen> akislitsky: already by ikalnitsky :)
16:29:35 <mihgen> cool, thanks guys!
16:29:37 <xarses> it was just merged
16:29:42 <mihgen> let's move on)
16:29:45 <xarses> #link https://bugs.launchpad.net/fuel/+bug/1491306 (bogdando)
16:29:46 <openstack> Launchpad bug 1491306 in Fuel for OpenStack "Rabbit join race with OSTF tests 'RabbitMQ availability' and 'RabbitMQ replication' are failed after reschedule router from primary controller and destroying it" [Critical,In progress] - Assigned to Bogdan Dobrelya (bogdando)
16:31:24 <mihgen> bogdando: can you share a bit on what it's about?
16:32:26 <xarses> seems he's not around
16:33:08 <xarses> #link https://bugs.launchpad.net/mos/+bug/1491576 - not reproducible anymore! Considered as Invalid
16:33:09 <openstack> Launchpad bug 1491576 in Mirantis OpenStack "logrotate script for apache leads to restarting keystone service" [High,Confirmed] - Assigned to Sergii Golovatiuk (sgolovatiuk)
16:33:28 <xarses> seems invalid now, the bug doesn't think so though
16:33:41 <sgolovatiuk> we made manual deployments with dtyzhnenko
16:34:00 <sgolovatiuk> I was not able to reproduce it anymore
16:34:12 <xarses> Ok, can we update the LP then?
16:34:19 <mihgen> sgolovatiuk: how did you try to repro?
16:34:36 <sgolovatiuk> I've been trying whole day long
16:34:49 <mihgen> how exactly?
16:34:50 <sgolovatiuk> not reproducible
16:35:04 <sgolovatiuk> with D. Tyznenko who opened it
16:35:11 <sgolovatiuk> using steps in bug
16:35:22 <mihgen> did you try to put some load / ostf, and do apache2 reload on all controllers at the same time
16:35:29 <sgolovatiuk> yep
16:35:43 <sgolovatiuk> apache reload is not related
16:35:48 <sgolovatiuk> it's different bug
16:36:16 <mwhahaha> doing a reload does restart keystone...
16:36:18 <mihgen> oh sorry I messed up with another bug, yeah
16:36:23 <sgolovatiuk> we shouldn't mix them ... but there can be side effect of another bug which is already resolved
16:36:34 <mihgen> mwhahaha: I'm actually surprised a bit..
16:36:51 <mihgen> reload suppose to be fetching config
16:37:00 <mihgen> but it actually works like restart
16:37:23 <mihgen> is it how wsgi things suppose to work under apache?
16:37:43 <mihgen> or may be we should just configure something differently, so that reload is not that dramatic?
16:37:46 <mwhahaha> yes because keystone services get managed via apache now, so that's to be expected i guess
16:37:57 <sgolovatiuk> I don't like mod_wsgi
16:38:06 <sgolovatiuk> we need to invest time in uwsgi
16:38:19 <sgolovatiuk> in that case apache will be proxy
16:38:21 <mwhahaha> would need to get upstream puppet keystone to support it
16:38:24 <mihgen> mwhahaha: I'm not sure - in order to get logs rotated, we need to restart the whole service...
16:38:42 <sgolovatiuk> logrotate - SIGHUP
16:38:47 <sgolovatiuk> that's enough
16:38:55 <sgolovatiuk> we don't need to restart apache
16:39:06 <mihgen> is not apache2 reload actually SIGHUP?
16:39:13 <mwhahaha> no
16:39:17 <mihgen> wow
16:39:30 <mwhahaha> but reload is the same as a graceful
16:39:33 * sgolovatiuk nods
16:40:00 <mihgen> well then why do we use apache reload in our logrotate scripts, if we need SIGHUP.. ?
16:40:12 <mwhahaha> that's the script that ships with apache2 i think
16:40:35 <mwhahaha> i couldn't find that logrotate script in fuel-library
16:40:46 <mihgen> sgolovatiuk: sorry I shifted conversation to talk about that bug, not this one..
16:41:09 <xarses> sorry guys, we need to keep moving
16:41:25 <xarses> #link https://bugs.launchpad.net/mos/+bug/1491576 (holser)
16:41:26 <openstack> Launchpad bug 1491576 in Mirantis OpenStack "logrotate script for apache leads to restarting keystone service" [High,Confirmed] - Assigned to Sergii Golovatiuk (sgolovatiuk)
16:41:52 <sgolovatiuk> we've already discussed it
16:41:58 <sgolovatiuk> I am still investigating it
16:42:01 <xarses> =)
16:42:11 <sgolovatiuk> I need a bit of time to produce a review
16:42:19 <sgolovatiuk> my ETA - end of tomorrow
16:42:24 <xarses> ok
16:42:31 <xarses> #link https://bugs.launchpad.net/fuel/+bug/1461562 (ikalnitsky)
16:42:32 <openstack> Launchpad bug 1461562 in Fuel for OpenStack "Failed to casting message to the nailgun. RabbitMQ was dead" [Critical,In progress] - Assigned to Igor Kalnitsky (ikalnitsky)
16:43:01 <ikalnitsky> ok, guys, we've started facing this issue since beginning of this week
16:43:15 <mihgen> ikalnitsky: this one is really annoying. According to dims, it should be fixed by upgrading rabbitmq..
16:43:26 <ikalnitsky> yeah, it could be
16:43:36 <mihgen> but I'm not sure if it's gonna be enough, as openstack uses oslo and there are reconnects, etc.
16:43:41 <sgolovatiuk> that's not true
16:43:50 <mihgen> whoops
16:43:53 <sgolovatiuk> on master node we still have rabbitmq 3.3
16:43:53 <ikalnitsky> but i hope the after reverting patches, it will be barely reproducable
16:43:57 <sgolovatiuk> not 3.5
16:44:12 <ikalnitsky> sgolovatiuk: there's something strange on system level :(
16:44:14 <sgolovatiuk> I think upgrade to Centos 6.6 affected it
16:44:15 <dims> mihgen: i said, we have to test with newer rabbitmq and verify heart beats work
16:44:16 <mihgen> sgolovatiuk: yep, that's why I'm saying that in 8.0 we would upgrade it
16:44:30 <ikalnitsky> heartbeats're actually works
16:44:35 <ikalnitsky> we had 2 sec timeout
16:44:43 <ikalnitsky> and when astute took up to 100% cpu usage
16:44:49 <mihgen> dims: yeah but do you need heartbeats implementations on client side?
16:44:53 <ikalnitsky> sometimes, 2 sec wasn't enough to send message
16:45:07 <mihgen> or it's internal rabbitmq feature? I'm just not sure how it works
16:45:09 <sgolovatiuk> if message is large :)
16:45:18 <dims> mihgen: we need to verify what is there on the client side, so far i have not looked at the client side stuff used in astute
16:45:29 <mihgen> 4mb is large guys? I was sending gigabytes over it back in 2010..
16:45:56 <sgolovatiuk> 2 seconds in not enough for 4Mb
16:46:04 <sgolovatiuk> even on 10GB NIC
16:46:22 <mihgen> dims: it's likely dumb amqp lib..
16:46:30 <ikalnitsky> and when quite a time of cpu was scheduled to other astute workers
16:46:44 <ikalnitsky> why we're talking about heartbeats at all?
16:46:46 <dims> mihgen: ack, i'll take an action item to research what we use :)
16:46:50 <ikalnitsky> there's no problems with heartbeats
16:47:19 <mihgen> ikalnitsky: we'd need to create a bug for 8.0 then to research if new rabbit will help
16:47:36 <ikalnitsky> our problem is that we can't handle properly our messages, and that's the first action we must do
16:47:55 <ikalnitsky> i just straced astute and rabbitmq and figured out some strange thing
16:48:19 <ikalnitsky> the rabbitmq calls shutdown syscall on socket after hearbeat timeout
16:48:36 <ikalnitsky> but after this, astute successfuly writes to rabbitmq socket
16:48:39 <ikalnitsky> with not errors
16:48:46 <ikalnitsky> i have no idea why it's happend
16:48:56 <ikalnitsky> there must be some error
16:49:06 <maximov> shutdown can close receiver or sender or both
16:49:08 <ikalnitsky> but strace shows that "writev" syscall is successful
16:49:41 <mihgen> weird.. but how often did we have issues before introducing evgenyl's patch?
16:49:55 <mihgen> I mean how worth in general paying so much attention to it?
16:49:57 <ikalnitsky> looks like not often
16:50:19 <mihgen> we have hundreds of deployments every day..
16:50:28 <ikalnitsky> both reverts are merged, so qa will say us how it works :)
16:50:49 <ikalnitsky> still, mihgen, we should properly handle submitted messages
16:51:04 <mihgen> ikalnitsky: ok. thank you for a quick turn around here
16:51:21 <mihgen> ikalnitsky: yep.. looks like it'll need some further research
16:51:30 <mihgen> 9 min, xarses - moving to the next one?
16:51:39 <xarses> #topic https://bugs.launchpad.net/fuel/+bug/1481714 (bogdando)
16:51:41 <openstack> Launchpad bug 1481714 in Fuel for OpenStack "Zabbix plugin: Wrong check for RabbitMQ epmd process" [High,New] - Assigned to Maciej Relewicz (rlu)
16:52:05 <maximov> nurla has concerns that it could be a problem in fuel core
16:52:16 <maximov> rather then issue in the plugin
16:52:35 <maximov> I would like to hear comments from bogdando
16:52:39 <maximov> who triaged it
16:52:49 <mihgen> yeah it's saying that we run rabbit under root first
16:52:57 <mihgen> which is weird
16:53:51 <mihgen> sgolovatiuk: dims: may be you guys know something about it?
16:53:51 <xarses> seems we are still missing bogdando
16:54:15 <mihgen> how rabbit starts, is it considered a normal behavior to start as root?
16:55:09 <maximov> i think because we start rabbitmq in container
16:55:46 <maximov> it is isolated in container
16:56:15 <dims> mihgen: we probably need bogdando for this as i am not sure
16:56:16 <xarses> maximov: i thought this was about the controller
16:56:20 <mwhahaha> https://bugs.launchpad.net/fuel/+bug/1483249
16:56:21 <openstack> Launchpad bug 1483249 in Fuel for OpenStack 8.0.x "rabbitmq epmd process running from user 'root'" [Medium,Confirmed] - Assigned to Fuel Library Team (fuel-library)
16:57:08 <mihgen> is it a duplicate?
16:57:26 <mihgen> this is about controller nodes or master node?
16:57:42 <mwhahaha> controller it runs as root initally i think
16:57:54 <mwhahaha> so i think that's just what it does, why does zabbix care what user it's run as
16:58:00 <mihgen> on controllers we actually run two rabbits
16:58:25 <mwhahaha> 1481714 is the zabbix plugin itself, could it be updated to not check the user but just the process is running?
16:58:25 <mihgen> one for murano
16:58:30 <mwhahaha> only if murano is enabled
16:58:50 <mwhahaha> my test env doesn't have murano so i only have one epmd and it's currently running as root
16:59:25 <mihgen> mwhahaha: +1
16:59:49 <mwhahaha> i would fix the zabbix plugin to not check the user for the epmd to resolve 1481714
17:00:08 <mwhahaha> and we can look into what user it's supposed to be running as for 8.0 and fix the other bug
17:00:18 <maximov> can you add this comment tp bug please, mwhahaha
17:00:22 <mwhahaha> sure
17:00:49 <sgolovatiuk> time!
17:00:55 <xarses> #endmeeting