16:00:39 #startmeeting fuel 16:00:39 #chair xarses 16:00:40 Todays Agenda: 16:00:40 #link https://etherpad.openstack.org/p/fuel-weekly-meeting-agenda 16:00:40 Who's here? 16:00:40 Meeting started Thu Sep 3 16:00:39 2015 UTC and is due to finish in 60 minutes. The chair is xarses. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:41 hi 16:00:42 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:45 The meeting name has been set to 'fuel' 16:00:45 hi 16:00:46 hi 16:00:46 Current chairs: xarses 16:00:53 hi 16:01:02 \ะพ/ 16:01:16 hi 16:01:24 hi 16:02:25 ok, lets get going 16:02:27 #topic librarian update (mwhahaha) 16:02:35 As reported last week we are starting the work to prepare versions of the modules on fuel-infra. 16:02:36 #link https://review.fuel-infra.org/#/q/status:open+topic:bug/1490469 16:02:36 The overall work is also being tracked on a spreadsheet prepared by degorenko. 16:02:37 #link https://docs.google.com/spreadsheets/d/1P8xJbYyHXnb0W7fVme3jkOUYj4OTCfbplZEYLX7SC0E/edit#gid=1195205959 16:02:39 As you can see from the spreadsheet, mos-puppet and the openstack teams have been working hard to identify and propose the fuel specific changes required. That being said, we have some modules that the fuel-library team should probably take the lead in managing upstream changes and working to prepare a fuel-infra version for the 8.0 cycle. 16:02:41 In the spreadsheet we have identified the corosync, haproxy, mysql, openssl, rabbitmq and rsyslog modules as ones that the fuel-library team should take the responsibility of figuring out upstream changes and adapting. 16:02:43 As I mentioned last week, many of these modules have diverged from the upstream and may require significant work to flush out a path forward. These modules may continue to live within the fuel-library code base for the 8.0 cycle but I will be spending time to try evaluate how much effort will be involved in moving these to an upstream version. 16:02:45 Questions? 16:04:25 can you update the sharing rights on the doc? 16:04:32 i don't own it 16:04:35 but i'll reach out 16:05:18 thanks mwhahaha 16:05:37 who else will work on converging these modules to the upstream ones? 16:06:18 Other than the openstack teams, i'm not aware of anyone else currently working on these 16:06:32 there is a status column, there are many questions on fuel-library side 16:07:02 looks like we'd need to split it between people 16:07:25 so to work in parallel ? 16:07:37 Ideally yes, we should identify some people to work on these 16:08:04 i know some people have expressed desire to work on some of these as they have been some tech debt that some are aware of 16:09:02 those in fuel-library, who are getting available from bugfixing, may take those 16:09:54 I was thinking how do we execute such things 16:10:29 we have like ten places with tasks 16:10:29 looks like we'd need to come up with whether trello or just plain etherpad with list of things 16:10:35 ensure to sort it in priority order 16:10:48 and those who get free from bugs can check against one single place 16:11:07 we can also create bugs for the modules 16:11:11 to assign ownership 16:11:27 I did create a few bugs previous as part of the initial migration 16:11:38 possibly. Ensure to have "feature" tag on those 16:12:22 The permissions should be fixed in the spreadsheet 16:12:30 so we can leverage this if that works for everyone 16:12:33 ok guys - let's sync over email on this, let's move on now 16:12:37 moving on? 16:12:42 =) 16:12:56 #topic HCF bugs reveiw 16:13:19 Ok, so where going to go through some bugs blocking HCF and discuss status and the like 16:13:37 #link https://bugs.launchpad.net/fuel/+bug/1490523 (idv1985) 16:13:39 Launchpad bug 1490523 in Fuel for OpenStack "OpenStack client hangs forever after simultaneous add/delete controller" [Critical,In progress] - Assigned to Dmitry Ilyin (idv1985) 16:14:26 this bug is already fixed, i've added retries and timeout to the openstack provider 16:14:29 xarses: as far as I know fix is ready 16:14:41 now i'm goind to post the changes upstream 16:14:57 can you provide a link to patch pls? 16:14:57 good to hear 16:15:37 https://review.openstack.org/#/c/219668/ 16:16:01 so it's not yet fixed since it's still not merged ;) 16:16:19 so it's just more retries? 16:16:39 yep, apache restart lead to wsgi accepting but not processing requests 16:16:50 and openstack client was hanging forever 16:17:05 so we added timeout + retries there 16:17:28 ok, next up 16:17:35 #link https://bugs.launchpad.net/fuel/+bug/1491725 (kozhukalov) 16:17:36 Launchpad bug 1491725 in Fuel for OpenStack "Deployment failed with error: :'MAC address duplicated: 0c:c4:7a:14:25:36'" [Critical,In progress] - Assigned to Vladimir Kozhukalov (kozhukalov) 16:17:52 i am working on this 16:18:08 #link https://review.openstack.org/#/c/220191/ 16:18:33 still in progress, testing to make sure nothing was broken by this patch 16:18:48 kozhukalov: scale related only? 16:18:59 what are the chances to fix it till Sunday this week? 16:19:01 yep 16:19:09 it is hard to reproduce 16:19:28 but the root cause is that cobbler is not intended to be scalable 16:19:33 kozhukalov: it's only when you remove env and immediatelly create new one? 16:19:44 or it's just when you remove, not all entries are removed from cobbler? 16:19:55 it is when you remove plenty of nodes at the same time 16:20:01 so it's regardless of the time when you want nw env to be created 16:20:04 some of those nodes can be still there 16:20:29 and then new env tries to add the same nodes with other names 16:20:50 maybe we need to remove until we are sure all nodes are deleted 16:20:51 but cobbler does not allow having two nodes with the same MAC 16:20:57 not fire and forget ... 16:21:30 not matter how many cycles we need to delete data from cobbler 16:22:08 kozhukalov: so there is a patch from you, 16:22:12 oh!sleep&retry 16:22:20 is it close to solution .. ? 16:22:30 sgolovatiuk, yes, maybe you are right 16:22:41 we need to try again and again 16:22:47 will add this 16:23:04 mihgen, yes, patch is here https://review.openstack.org/#/c/220191/ 16:23:17 just group the deletes into batches? 16:23:27 question is will it actually lead to success) 16:23:30 mihgen, it is quite close 16:23:31 and when we get rid of cobbler ) 16:23:45 :) 16:23:46 like i said, im working on testing this patch 16:24:39 mihgen, at least we will know that if there are nodes with the same MAC, we will try to remove them 16:24:49 thanks kozhukalov 16:24:51 but yes, it might be not enough 16:25:02 I'll add retries 16:25:05 ok, we need to move to get through the others. 16:25:06 that was probably very hard to understand what happens in this bug.. 16:25:19 kozhukalov just to confirm - patch is almost ready and we will merge it till Sunday, is this correct understanding? 16:25:25 pre-emptive delete by mac is probably a smart idea 16:25:26 folks, just use SDD - sleep driven development 16:25:36 maximov, I think yes 16:25:56 aglarendil_: :) 16:26:04 #link https://bugs.launchpad.net/fuel/+bug/1491015 (akislitsky) 16:26:05 Launchpad bug 1491015 in Fuel for OpenStack "System test 'check_openstack_stat' failed: No JSON object could be decoded" [Critical,In progress] - Assigned to Alexander Kislitsky (akislitsky) 16:26:05 xarses: next bug? 16:26:57 aglarendil_, NAADD (Nice Aglarendil Advice Driven Development :-) 16:27:04 Patch is on review. Proper fix will be done in 8.0. Now we have workaround. Patch is tested in unit tests and manual on the live env 16:27:06 sbog: is it stats broken becuase we enabled SSL ? 16:27:48 mihgen: couple secs 16:27:55 mihgen, yep. it is up to enabling SSL 16:28:34 quite sad that we catching such things that late in the cycle. . ok akislitsky so your fix is ready to go? 16:28:44 we just need reviews / small fixes if needed ? 16:28:51 yep, seems so 16:29:21 mihgen, it is ready to be merged 16:29:28 akislitsky: already by ikalnitsky :) 16:29:35 cool, thanks guys! 16:29:37 it was just merged 16:29:42 let's move on) 16:29:45 #link https://bugs.launchpad.net/fuel/+bug/1491306 (bogdando) 16:29:46 Launchpad bug 1491306 in Fuel for OpenStack "Rabbit join race with OSTF tests 'RabbitMQ availability' and 'RabbitMQ replication' are failed after reschedule router from primary controller and destroying it" [Critical,In progress] - Assigned to Bogdan Dobrelya (bogdando) 16:31:24 bogdando: can you share a bit on what it's about? 16:32:26 seems he's not around 16:33:08 #link https://bugs.launchpad.net/mos/+bug/1491576 - not reproducible anymore! Considered as Invalid 16:33:09 Launchpad bug 1491576 in Mirantis OpenStack "logrotate script for apache leads to restarting keystone service" [High,Confirmed] - Assigned to Sergii Golovatiuk (sgolovatiuk) 16:33:28 seems invalid now, the bug doesn't think so though 16:33:41 we made manual deployments with dtyzhnenko 16:34:00 I was not able to reproduce it anymore 16:34:12 Ok, can we update the LP then? 16:34:19 sgolovatiuk: how did you try to repro? 16:34:36 I've been trying whole day long 16:34:49 how exactly? 16:34:50 not reproducible 16:35:04 with D. Tyznenko who opened it 16:35:11 using steps in bug 16:35:22 did you try to put some load / ostf, and do apache2 reload on all controllers at the same time 16:35:29 yep 16:35:43 apache reload is not related 16:35:48 it's different bug 16:36:16 doing a reload does restart keystone... 16:36:18 oh sorry I messed up with another bug, yeah 16:36:23 we shouldn't mix them ... but there can be side effect of another bug which is already resolved 16:36:34 mwhahaha: I'm actually surprised a bit.. 16:36:51 reload suppose to be fetching config 16:37:00 but it actually works like restart 16:37:23 is it how wsgi things suppose to work under apache? 16:37:43 or may be we should just configure something differently, so that reload is not that dramatic? 16:37:46 yes because keystone services get managed via apache now, so that's to be expected i guess 16:37:57 I don't like mod_wsgi 16:38:06 we need to invest time in uwsgi 16:38:19 in that case apache will be proxy 16:38:21 would need to get upstream puppet keystone to support it 16:38:24 mwhahaha: I'm not sure - in order to get logs rotated, we need to restart the whole service... 16:38:42 logrotate - SIGHUP 16:38:47 that's enough 16:38:55 we don't need to restart apache 16:39:06 is not apache2 reload actually SIGHUP? 16:39:13 no 16:39:17 wow 16:39:30 but reload is the same as a graceful 16:39:33 * sgolovatiuk nods 16:40:00 well then why do we use apache reload in our logrotate scripts, if we need SIGHUP.. ? 16:40:12 that's the script that ships with apache2 i think 16:40:35 i couldn't find that logrotate script in fuel-library 16:40:46 sgolovatiuk: sorry I shifted conversation to talk about that bug, not this one.. 16:41:09 sorry guys, we need to keep moving 16:41:25 #link https://bugs.launchpad.net/mos/+bug/1491576 (holser) 16:41:26 Launchpad bug 1491576 in Mirantis OpenStack "logrotate script for apache leads to restarting keystone service" [High,Confirmed] - Assigned to Sergii Golovatiuk (sgolovatiuk) 16:41:52 we've already discussed it 16:41:58 I am still investigating it 16:42:01 =) 16:42:11 I need a bit of time to produce a review 16:42:19 my ETA - end of tomorrow 16:42:24 ok 16:42:31 #link https://bugs.launchpad.net/fuel/+bug/1461562 (ikalnitsky) 16:42:32 Launchpad bug 1461562 in Fuel for OpenStack "Failed to casting message to the nailgun. RabbitMQ was dead" [Critical,In progress] - Assigned to Igor Kalnitsky (ikalnitsky) 16:43:01 ok, guys, we've started facing this issue since beginning of this week 16:43:15 ikalnitsky: this one is really annoying. According to dims, it should be fixed by upgrading rabbitmq.. 16:43:26 yeah, it could be 16:43:36 but I'm not sure if it's gonna be enough, as openstack uses oslo and there are reconnects, etc. 16:43:41 that's not true 16:43:50 whoops 16:43:53 on master node we still have rabbitmq 3.3 16:43:53 but i hope the after reverting patches, it will be barely reproducable 16:43:57 not 3.5 16:44:12 sgolovatiuk: there's something strange on system level :( 16:44:14 I think upgrade to Centos 6.6 affected it 16:44:15 mihgen: i said, we have to test with newer rabbitmq and verify heart beats work 16:44:16 sgolovatiuk: yep, that's why I'm saying that in 8.0 we would upgrade it 16:44:30 heartbeats're actually works 16:44:35 we had 2 sec timeout 16:44:43 and when astute took up to 100% cpu usage 16:44:49 dims: yeah but do you need heartbeats implementations on client side? 16:44:53 sometimes, 2 sec wasn't enough to send message 16:45:07 or it's internal rabbitmq feature? I'm just not sure how it works 16:45:09 if message is large :) 16:45:18 mihgen: we need to verify what is there on the client side, so far i have not looked at the client side stuff used in astute 16:45:29 4mb is large guys? I was sending gigabytes over it back in 2010.. 16:45:56 2 seconds in not enough for 4Mb 16:46:04 even on 10GB NIC 16:46:22 dims: it's likely dumb amqp lib.. 16:46:30 and when quite a time of cpu was scheduled to other astute workers 16:46:44 why we're talking about heartbeats at all? 16:46:46 mihgen: ack, i'll take an action item to research what we use :) 16:46:50 there's no problems with heartbeats 16:47:19 ikalnitsky: we'd need to create a bug for 8.0 then to research if new rabbit will help 16:47:36 our problem is that we can't handle properly our messages, and that's the first action we must do 16:47:55 i just straced astute and rabbitmq and figured out some strange thing 16:48:19 the rabbitmq calls shutdown syscall on socket after hearbeat timeout 16:48:36 but after this, astute successfuly writes to rabbitmq socket 16:48:39 with not errors 16:48:46 i have no idea why it's happend 16:48:56 there must be some error 16:49:06 shutdown can close receiver or sender or both 16:49:08 but strace shows that "writev" syscall is successful 16:49:41 weird.. but how often did we have issues before introducing evgenyl's patch? 16:49:55 I mean how worth in general paying so much attention to it? 16:49:57 looks like not often 16:50:19 we have hundreds of deployments every day.. 16:50:28 both reverts are merged, so qa will say us how it works :) 16:50:49 still, mihgen, we should properly handle submitted messages 16:51:04 ikalnitsky: ok. thank you for a quick turn around here 16:51:21 ikalnitsky: yep.. looks like it'll need some further research 16:51:30 9 min, xarses - moving to the next one? 16:51:39 #topic https://bugs.launchpad.net/fuel/+bug/1481714 (bogdando) 16:51:41 Launchpad bug 1481714 in Fuel for OpenStack "Zabbix plugin: Wrong check for RabbitMQ epmd process" [High,New] - Assigned to Maciej Relewicz (rlu) 16:52:05 nurla has concerns that it could be a problem in fuel core 16:52:16 rather then issue in the plugin 16:52:35 I would like to hear comments from bogdando 16:52:39 who triaged it 16:52:49 yeah it's saying that we run rabbit under root first 16:52:57 which is weird 16:53:51 sgolovatiuk: dims: may be you guys know something about it? 16:53:51 seems we are still missing bogdando 16:54:15 how rabbit starts, is it considered a normal behavior to start as root? 16:55:09 i think because we start rabbitmq in container 16:55:46 it is isolated in container 16:56:15 mihgen: we probably need bogdando for this as i am not sure 16:56:16 maximov: i thought this was about the controller 16:56:20 https://bugs.launchpad.net/fuel/+bug/1483249 16:56:21 Launchpad bug 1483249 in Fuel for OpenStack 8.0.x "rabbitmq epmd process running from user 'root'" [Medium,Confirmed] - Assigned to Fuel Library Team (fuel-library) 16:57:08 is it a duplicate? 16:57:26 this is about controller nodes or master node? 16:57:42 controller it runs as root initally i think 16:57:54 so i think that's just what it does, why does zabbix care what user it's run as 16:58:00 on controllers we actually run two rabbits 16:58:25 1481714 is the zabbix plugin itself, could it be updated to not check the user but just the process is running? 16:58:25 one for murano 16:58:30 only if murano is enabled 16:58:50 my test env doesn't have murano so i only have one epmd and it's currently running as root 16:59:25 mwhahaha: +1 16:59:49 i would fix the zabbix plugin to not check the user for the epmd to resolve 1481714 17:00:08 and we can look into what user it's supposed to be running as for 8.0 and fix the other bug 17:00:18 can you add this comment tp bug please, mwhahaha 17:00:22 sure 17:00:49 time! 17:00:55 #endmeeting