16:00:39 <xarses> #startmeeting fuel 16:00:39 <xarses> #chair xarses 16:00:40 <xarses> Todays Agenda: 16:00:40 <xarses> #link https://etherpad.openstack.org/p/fuel-weekly-meeting-agenda 16:00:40 <xarses> Who's here? 16:00:40 <openstack> Meeting started Thu Sep 3 16:00:39 2015 UTC and is due to finish in 60 minutes. The chair is xarses. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:41 <maximov> hi 16:00:42 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:45 <openstack> The meeting name has been set to 'fuel' 16:00:45 <mwhahaha> hi 16:00:46 <rvyalov> hi 16:00:46 <openstack> Current chairs: xarses 16:00:53 <mihgen> hi 16:01:02 <ashtokolov> \ะพ/ 16:01:16 <kozhukalov> hi 16:01:24 <dpyzhov> hi 16:02:25 <xarses> ok, lets get going 16:02:27 <xarses> #topic librarian update (mwhahaha) 16:02:35 <mwhahaha> As reported last week we are starting the work to prepare versions of the modules on fuel-infra. 16:02:36 <mwhahaha> #link https://review.fuel-infra.org/#/q/status:open+topic:bug/1490469 16:02:36 <mwhahaha> The overall work is also being tracked on a spreadsheet prepared by degorenko. 16:02:37 <mwhahaha> #link https://docs.google.com/spreadsheets/d/1P8xJbYyHXnb0W7fVme3jkOUYj4OTCfbplZEYLX7SC0E/edit#gid=1195205959 16:02:39 <mwhahaha> As you can see from the spreadsheet, mos-puppet and the openstack teams have been working hard to identify and propose the fuel specific changes required. That being said, we have some modules that the fuel-library team should probably take the lead in managing upstream changes and working to prepare a fuel-infra version for the 8.0 cycle. 16:02:41 <mwhahaha> In the spreadsheet we have identified the corosync, haproxy, mysql, openssl, rabbitmq and rsyslog modules as ones that the fuel-library team should take the responsibility of figuring out upstream changes and adapting. 16:02:43 <mwhahaha> As I mentioned last week, many of these modules have diverged from the upstream and may require significant work to flush out a path forward. These modules may continue to live within the fuel-library code base for the 8.0 cycle but I will be spending time to try evaluate how much effort will be involved in moving these to an upstream version. 16:02:45 <mwhahaha> Questions? 16:04:25 <xarses> can you update the sharing rights on the doc? 16:04:32 <mwhahaha> i don't own it 16:04:35 <mwhahaha> but i'll reach out 16:05:18 <mihgen> thanks mwhahaha 16:05:37 <mihgen> who else will work on converging these modules to the upstream ones? 16:06:18 <mwhahaha> Other than the openstack teams, i'm not aware of anyone else currently working on these 16:06:32 <mihgen> there is a status column, there are many questions on fuel-library side 16:07:02 <mihgen> looks like we'd need to split it between people 16:07:25 <mihgen> so to work in parallel ? 16:07:37 <mwhahaha> Ideally yes, we should identify some people to work on these 16:08:04 <mwhahaha> i know some people have expressed desire to work on some of these as they have been some tech debt that some are aware of 16:09:02 <mihgen> those in fuel-library, who are getting available from bugfixing, may take those 16:09:54 <mihgen> I was thinking how do we execute such things 16:10:29 <mihgen> we have like ten places with tasks 16:10:29 <mihgen> looks like we'd need to come up with whether trello or just plain etherpad with list of things 16:10:35 <mihgen> ensure to sort it in priority order 16:10:48 <mihgen> and those who get free from bugs can check against one single place 16:11:07 <mwhahaha> we can also create bugs for the modules 16:11:11 <mwhahaha> to assign ownership 16:11:27 <mwhahaha> I did create a few bugs previous as part of the initial migration 16:11:38 <mihgen> possibly. Ensure to have "feature" tag on those 16:12:22 <mwhahaha> The permissions should be fixed in the spreadsheet 16:12:30 <mwhahaha> so we can leverage this if that works for everyone 16:12:33 <mihgen> ok guys - let's sync over email on this, let's move on now 16:12:37 <xarses> moving on? 16:12:42 <xarses> =) 16:12:56 <xarses> #topic HCF bugs reveiw 16:13:19 <xarses> Ok, so where going to go through some bugs blocking HCF and discuss status and the like 16:13:37 <xarses> #link https://bugs.launchpad.net/fuel/+bug/1490523 (idv1985) 16:13:39 <openstack> Launchpad bug 1490523 in Fuel for OpenStack "OpenStack client hangs forever after simultaneous add/delete controller" [Critical,In progress] - Assigned to Dmitry Ilyin (idv1985) 16:14:26 <dilyin> this bug is already fixed, i've added retries and timeout to the openstack provider 16:14:29 <sgolovatiuk> xarses: as far as I know fix is ready 16:14:41 <dilyin> now i'm goind to post the changes upstream 16:14:57 <mihgen> can you provide a link to patch pls? 16:14:57 <xarses> good to hear 16:15:37 <ashtokolov> https://review.openstack.org/#/c/219668/ 16:16:01 <mihgen> so it's not yet fixed since it's still not merged ;) 16:16:19 <mihgen> so it's just more retries? 16:16:39 <aglarendil_> yep, apache restart lead to wsgi accepting but not processing requests 16:16:50 <aglarendil_> and openstack client was hanging forever 16:17:05 <aglarendil_> so we added timeout + retries there 16:17:28 <xarses> ok, next up 16:17:35 <xarses> #link https://bugs.launchpad.net/fuel/+bug/1491725 (kozhukalov) 16:17:36 <openstack> Launchpad bug 1491725 in Fuel for OpenStack "Deployment failed with error: <class 'cobbler.cexceptions.CX'>:'MAC address duplicated: 0c:c4:7a:14:25:36'" [Critical,In progress] - Assigned to Vladimir Kozhukalov (kozhukalov) 16:17:52 <kozhukalov> i am working on this 16:18:08 <kozhukalov> #link https://review.openstack.org/#/c/220191/ 16:18:33 <kozhukalov> still in progress, testing to make sure nothing was broken by this patch 16:18:48 <mihgen> kozhukalov: scale related only? 16:18:59 <maximov> what are the chances to fix it till Sunday this week? 16:19:01 <kozhukalov> yep 16:19:09 <kozhukalov> it is hard to reproduce 16:19:28 <kozhukalov> but the root cause is that cobbler is not intended to be scalable 16:19:33 <mihgen> kozhukalov: it's only when you remove env and immediatelly create new one? 16:19:44 <mihgen> or it's just when you remove, not all entries are removed from cobbler? 16:19:55 <kozhukalov> it is when you remove plenty of nodes at the same time 16:20:01 <mihgen> so it's regardless of the time when you want nw env to be created 16:20:04 <kozhukalov> some of those nodes can be still there 16:20:29 <kozhukalov> and then new env tries to add the same nodes with other names 16:20:50 <sgolovatiuk> maybe we need to remove until we are sure all nodes are deleted 16:20:51 <kozhukalov> but cobbler does not allow having two nodes with the same MAC 16:20:57 <sgolovatiuk> not fire and forget ... 16:21:30 <sgolovatiuk> not matter how many cycles we need to delete data from cobbler 16:22:08 <mihgen> kozhukalov: so there is a patch from you, 16:22:12 <aglarendil_> oh!sleep&retry 16:22:20 <mihgen> is it close to solution .. ? 16:22:30 <kozhukalov> sgolovatiuk, yes, maybe you are right 16:22:41 <kozhukalov> we need to try again and again 16:22:47 <kozhukalov> will add this 16:23:04 <kozhukalov> mihgen, yes, patch is here https://review.openstack.org/#/c/220191/ 16:23:17 <xarses> just group the deletes into batches? 16:23:27 <mihgen> question is will it actually lead to success) 16:23:30 <kozhukalov> mihgen, it is quite close 16:23:31 <mihgen> and when we get rid of cobbler ) 16:23:45 <sgolovatiuk> :) 16:23:46 <kozhukalov> like i said, im working on testing this patch 16:24:39 <kozhukalov> mihgen, at least we will know that if there are nodes with the same MAC, we will try to remove them 16:24:49 <mihgen> thanks kozhukalov 16:24:51 <kozhukalov> but yes, it might be not enough 16:25:02 <kozhukalov> I'll add retries 16:25:05 <xarses> ok, we need to move to get through the others. 16:25:06 <mihgen> that was probably very hard to understand what happens in this bug.. 16:25:19 <maximov> kozhukalov just to confirm - patch is almost ready and we will merge it till Sunday, is this correct understanding? 16:25:25 <mattymo_> pre-emptive delete by mac is probably a smart idea 16:25:26 <aglarendil_> folks, just use SDD - sleep driven development 16:25:36 <kozhukalov> maximov, I think yes 16:25:56 <mihgen> aglarendil_: :) 16:26:04 <xarses> #link https://bugs.launchpad.net/fuel/+bug/1491015 (akislitsky) 16:26:05 <openstack> Launchpad bug 1491015 in Fuel for OpenStack "System test 'check_openstack_stat' failed: No JSON object could be decoded" [Critical,In progress] - Assigned to Alexander Kislitsky (akislitsky) 16:26:05 <mihgen> xarses: next bug? 16:26:57 <kozhukalov> aglarendil_, NAADD (Nice Aglarendil Advice Driven Development :-) 16:27:04 <akislitsky> Patch is on review. Proper fix will be done in 8.0. Now we have workaround. Patch is tested in unit tests and manual on the live env 16:27:06 <mihgen> sbog: is it stats broken becuase we enabled SSL ? 16:27:48 <sbog> mihgen: couple secs 16:27:55 <akislitsky> mihgen, yep. it is up to enabling SSL 16:28:34 <mihgen> quite sad that we catching such things that late in the cycle. . ok akislitsky so your fix is ready to go? 16:28:44 <mihgen> we just need reviews / small fixes if needed ? 16:28:51 <sbog> yep, seems so 16:29:21 <akislitsky> mihgen, it is ready to be merged 16:29:28 <mihgen> akislitsky: already by ikalnitsky :) 16:29:35 <mihgen> cool, thanks guys! 16:29:37 <xarses> it was just merged 16:29:42 <mihgen> let's move on) 16:29:45 <xarses> #link https://bugs.launchpad.net/fuel/+bug/1491306 (bogdando) 16:29:46 <openstack> Launchpad bug 1491306 in Fuel for OpenStack "Rabbit join race with OSTF tests 'RabbitMQ availability' and 'RabbitMQ replication' are failed after reschedule router from primary controller and destroying it" [Critical,In progress] - Assigned to Bogdan Dobrelya (bogdando) 16:31:24 <mihgen> bogdando: can you share a bit on what it's about? 16:32:26 <xarses> seems he's not around 16:33:08 <xarses> #link https://bugs.launchpad.net/mos/+bug/1491576 - not reproducible anymore! Considered as Invalid 16:33:09 <openstack> Launchpad bug 1491576 in Mirantis OpenStack "logrotate script for apache leads to restarting keystone service" [High,Confirmed] - Assigned to Sergii Golovatiuk (sgolovatiuk) 16:33:28 <xarses> seems invalid now, the bug doesn't think so though 16:33:41 <sgolovatiuk> we made manual deployments with dtyzhnenko 16:34:00 <sgolovatiuk> I was not able to reproduce it anymore 16:34:12 <xarses> Ok, can we update the LP then? 16:34:19 <mihgen> sgolovatiuk: how did you try to repro? 16:34:36 <sgolovatiuk> I've been trying whole day long 16:34:49 <mihgen> how exactly? 16:34:50 <sgolovatiuk> not reproducible 16:35:04 <sgolovatiuk> with D. Tyznenko who opened it 16:35:11 <sgolovatiuk> using steps in bug 16:35:22 <mihgen> did you try to put some load / ostf, and do apache2 reload on all controllers at the same time 16:35:29 <sgolovatiuk> yep 16:35:43 <sgolovatiuk> apache reload is not related 16:35:48 <sgolovatiuk> it's different bug 16:36:16 <mwhahaha> doing a reload does restart keystone... 16:36:18 <mihgen> oh sorry I messed up with another bug, yeah 16:36:23 <sgolovatiuk> we shouldn't mix them ... but there can be side effect of another bug which is already resolved 16:36:34 <mihgen> mwhahaha: I'm actually surprised a bit.. 16:36:51 <mihgen> reload suppose to be fetching config 16:37:00 <mihgen> but it actually works like restart 16:37:23 <mihgen> is it how wsgi things suppose to work under apache? 16:37:43 <mihgen> or may be we should just configure something differently, so that reload is not that dramatic? 16:37:46 <mwhahaha> yes because keystone services get managed via apache now, so that's to be expected i guess 16:37:57 <sgolovatiuk> I don't like mod_wsgi 16:38:06 <sgolovatiuk> we need to invest time in uwsgi 16:38:19 <sgolovatiuk> in that case apache will be proxy 16:38:21 <mwhahaha> would need to get upstream puppet keystone to support it 16:38:24 <mihgen> mwhahaha: I'm not sure - in order to get logs rotated, we need to restart the whole service... 16:38:42 <sgolovatiuk> logrotate - SIGHUP 16:38:47 <sgolovatiuk> that's enough 16:38:55 <sgolovatiuk> we don't need to restart apache 16:39:06 <mihgen> is not apache2 reload actually SIGHUP? 16:39:13 <mwhahaha> no 16:39:17 <mihgen> wow 16:39:30 <mwhahaha> but reload is the same as a graceful 16:39:33 * sgolovatiuk nods 16:40:00 <mihgen> well then why do we use apache reload in our logrotate scripts, if we need SIGHUP.. ? 16:40:12 <mwhahaha> that's the script that ships with apache2 i think 16:40:35 <mwhahaha> i couldn't find that logrotate script in fuel-library 16:40:46 <mihgen> sgolovatiuk: sorry I shifted conversation to talk about that bug, not this one.. 16:41:09 <xarses> sorry guys, we need to keep moving 16:41:25 <xarses> #link https://bugs.launchpad.net/mos/+bug/1491576 (holser) 16:41:26 <openstack> Launchpad bug 1491576 in Mirantis OpenStack "logrotate script for apache leads to restarting keystone service" [High,Confirmed] - Assigned to Sergii Golovatiuk (sgolovatiuk) 16:41:52 <sgolovatiuk> we've already discussed it 16:41:58 <sgolovatiuk> I am still investigating it 16:42:01 <xarses> =) 16:42:11 <sgolovatiuk> I need a bit of time to produce a review 16:42:19 <sgolovatiuk> my ETA - end of tomorrow 16:42:24 <xarses> ok 16:42:31 <xarses> #link https://bugs.launchpad.net/fuel/+bug/1461562 (ikalnitsky) 16:42:32 <openstack> Launchpad bug 1461562 in Fuel for OpenStack "Failed to casting message to the nailgun. RabbitMQ was dead" [Critical,In progress] - Assigned to Igor Kalnitsky (ikalnitsky) 16:43:01 <ikalnitsky> ok, guys, we've started facing this issue since beginning of this week 16:43:15 <mihgen> ikalnitsky: this one is really annoying. According to dims, it should be fixed by upgrading rabbitmq.. 16:43:26 <ikalnitsky> yeah, it could be 16:43:36 <mihgen> but I'm not sure if it's gonna be enough, as openstack uses oslo and there are reconnects, etc. 16:43:41 <sgolovatiuk> that's not true 16:43:50 <mihgen> whoops 16:43:53 <sgolovatiuk> on master node we still have rabbitmq 3.3 16:43:53 <ikalnitsky> but i hope the after reverting patches, it will be barely reproducable 16:43:57 <sgolovatiuk> not 3.5 16:44:12 <ikalnitsky> sgolovatiuk: there's something strange on system level :( 16:44:14 <sgolovatiuk> I think upgrade to Centos 6.6 affected it 16:44:15 <dims> mihgen: i said, we have to test with newer rabbitmq and verify heart beats work 16:44:16 <mihgen> sgolovatiuk: yep, that's why I'm saying that in 8.0 we would upgrade it 16:44:30 <ikalnitsky> heartbeats're actually works 16:44:35 <ikalnitsky> we had 2 sec timeout 16:44:43 <ikalnitsky> and when astute took up to 100% cpu usage 16:44:49 <mihgen> dims: yeah but do you need heartbeats implementations on client side? 16:44:53 <ikalnitsky> sometimes, 2 sec wasn't enough to send message 16:45:07 <mihgen> or it's internal rabbitmq feature? I'm just not sure how it works 16:45:09 <sgolovatiuk> if message is large :) 16:45:18 <dims> mihgen: we need to verify what is there on the client side, so far i have not looked at the client side stuff used in astute 16:45:29 <mihgen> 4mb is large guys? I was sending gigabytes over it back in 2010.. 16:45:56 <sgolovatiuk> 2 seconds in not enough for 4Mb 16:46:04 <sgolovatiuk> even on 10GB NIC 16:46:22 <mihgen> dims: it's likely dumb amqp lib.. 16:46:30 <ikalnitsky> and when quite a time of cpu was scheduled to other astute workers 16:46:44 <ikalnitsky> why we're talking about heartbeats at all? 16:46:46 <dims> mihgen: ack, i'll take an action item to research what we use :) 16:46:50 <ikalnitsky> there's no problems with heartbeats 16:47:19 <mihgen> ikalnitsky: we'd need to create a bug for 8.0 then to research if new rabbit will help 16:47:36 <ikalnitsky> our problem is that we can't handle properly our messages, and that's the first action we must do 16:47:55 <ikalnitsky> i just straced astute and rabbitmq and figured out some strange thing 16:48:19 <ikalnitsky> the rabbitmq calls shutdown syscall on socket after hearbeat timeout 16:48:36 <ikalnitsky> but after this, astute successfuly writes to rabbitmq socket 16:48:39 <ikalnitsky> with not errors 16:48:46 <ikalnitsky> i have no idea why it's happend 16:48:56 <ikalnitsky> there must be some error 16:49:06 <maximov> shutdown can close receiver or sender or both 16:49:08 <ikalnitsky> but strace shows that "writev" syscall is successful 16:49:41 <mihgen> weird.. but how often did we have issues before introducing evgenyl's patch? 16:49:55 <mihgen> I mean how worth in general paying so much attention to it? 16:49:57 <ikalnitsky> looks like not often 16:50:19 <mihgen> we have hundreds of deployments every day.. 16:50:28 <ikalnitsky> both reverts are merged, so qa will say us how it works :) 16:50:49 <ikalnitsky> still, mihgen, we should properly handle submitted messages 16:51:04 <mihgen> ikalnitsky: ok. thank you for a quick turn around here 16:51:21 <mihgen> ikalnitsky: yep.. looks like it'll need some further research 16:51:30 <mihgen> 9 min, xarses - moving to the next one? 16:51:39 <xarses> #topic https://bugs.launchpad.net/fuel/+bug/1481714 (bogdando) 16:51:41 <openstack> Launchpad bug 1481714 in Fuel for OpenStack "Zabbix plugin: Wrong check for RabbitMQ epmd process" [High,New] - Assigned to Maciej Relewicz (rlu) 16:52:05 <maximov> nurla has concerns that it could be a problem in fuel core 16:52:16 <maximov> rather then issue in the plugin 16:52:35 <maximov> I would like to hear comments from bogdando 16:52:39 <maximov> who triaged it 16:52:49 <mihgen> yeah it's saying that we run rabbit under root first 16:52:57 <mihgen> which is weird 16:53:51 <mihgen> sgolovatiuk: dims: may be you guys know something about it? 16:53:51 <xarses> seems we are still missing bogdando 16:54:15 <mihgen> how rabbit starts, is it considered a normal behavior to start as root? 16:55:09 <maximov> i think because we start rabbitmq in container 16:55:46 <maximov> it is isolated in container 16:56:15 <dims> mihgen: we probably need bogdando for this as i am not sure 16:56:16 <xarses> maximov: i thought this was about the controller 16:56:20 <mwhahaha> https://bugs.launchpad.net/fuel/+bug/1483249 16:56:21 <openstack> Launchpad bug 1483249 in Fuel for OpenStack 8.0.x "rabbitmq epmd process running from user 'root'" [Medium,Confirmed] - Assigned to Fuel Library Team (fuel-library) 16:57:08 <mihgen> is it a duplicate? 16:57:26 <mihgen> this is about controller nodes or master node? 16:57:42 <mwhahaha> controller it runs as root initally i think 16:57:54 <mwhahaha> so i think that's just what it does, why does zabbix care what user it's run as 16:58:00 <mihgen> on controllers we actually run two rabbits 16:58:25 <mwhahaha> 1481714 is the zabbix plugin itself, could it be updated to not check the user but just the process is running? 16:58:25 <mihgen> one for murano 16:58:30 <mwhahaha> only if murano is enabled 16:58:50 <mwhahaha> my test env doesn't have murano so i only have one epmd and it's currently running as root 16:59:25 <mihgen> mwhahaha: +1 16:59:49 <mwhahaha> i would fix the zabbix plugin to not check the user for the epmd to resolve 1481714 17:00:08 <mwhahaha> and we can look into what user it's supposed to be running as for 8.0 and fix the other bug 17:00:18 <maximov> can you add this comment tp bug please, mwhahaha 17:00:22 <mwhahaha> sure 17:00:49 <sgolovatiuk> time! 17:00:55 <xarses> #endmeeting