16:01:19 #startmeeting Fuel 16:01:20 Meeting started Thu Jul 24 16:01:19 2014 UTC and is due to finish in 60 minutes. The chair is vkozhukalov. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:21 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:23 The meeting name has been set to 'fuel' 16:01:27 #chair vkozhukalov 16:01:28 Current chairs: vkozhukalov 16:01:30 Hi 16:01:32 hi everyone 16:01:34 o/ 16:01:35 hi 16:01:38 agenda as usual 16:01:41 hi 16:01:49 #link https://etherpad.openstack.org/p/fuel-weekly-meeting-agenda 16:01:55 hi boys and girls 16:02:04 #topic Greetings 16:02:22 #topic 5.1 soft code freeze 16:02:22 hey folks 16:02:33 we plan SCF for today folks 16:02:43 mihgen im sure you have something to say on this topic 16:02:47 let's keep collaboratively working on our patches 16:03:05 we have a lot of bugs still in progress, many of them have at least one +1 16:03:31 I'm about to give for those which have at least +1, but didn't get approval from core folks , give 1 more day to be merged 16:03:45 I think otherwise we are good enough to call SCF 16:04:05 we will have to review bug by bug and move them to next milestone (only Low and Medium of course) 16:04:38 I raised priority for some medium bugs to make sure they get into the release as they are more high than medium 16:04:39 if some bug is important or hits many people, we may want to increase priority 16:04:42 #link https://wiki.openstack.org/wiki/Fuel/5.1_Release_Schedule 16:05:12 please don't forget to add comments when you change bug priority 16:05:14 please don't forget about this important link, it has our schedule 16:05:28 sorry folks my Mac is crazy and slows me down today ( 16:05:48 any objections on call for SCF today, and exception which I mentioned? 16:06:06 #link https://wiki.openstack.org/wiki/Fuel/Soft_Code_Freeze - about SCF 16:06:35 looks like no objections so far, thanks everyone for hard work with squashing bugs! 16:06:37 mihgen: sounds good 16:06:38 there are some fixes I would like to merge into library 16:06:39 no objections 16:06:53 but I guess they are attached to high-priority bugs 16:06:56 so I do not have any 16:07:20 cool. thanks. vkozhukalov - let's go on 16:07:33 ok, thanks mihgen, moving 16:07:42 #topic 5.0.1 release status 16:07:48 Here's the current status of 5.0.1: 16:07:52 #link https://launchpad.net/fuel/+milestone/5.0.1 16:07:57 Some bugs are only tracked in mos, these can be found at: 16:08:02 #link https://launchpad.net/mos/+milestone/5.0.1 16:08:07 Primary blocker for 5.0.1 is still the RabbitMQ HA problem: 16:08:12 #link https://bugs.launchpad.net/mos/+bug/1340711 16:08:14 Launchpad bug 1340711 in mos/5.0.x "[OSCI] building oslo-messaging for rpm and deb" [Critical,In progress] 16:08:17 There are some new bugs in 5.0.1: 16:08:21 #link https://bugs.launchpad.net/bugs/1348166 16:08:22 Launchpad bug 1348166 in fuel/5.1.x "Upgrades, rollback for bootstrap brakes slaves bootstrapping" [High,In progress] 16:08:26 #link https://bugs.launchpad.net/bugs/1346093 16:08:27 Launchpad bug 1346093 in fuel "[System Tests] Dhcrelay didn't start on master in system tests" [High,Fix committed] 16:08:41 any comments on the above two? are they really blockers for 5.0.1? 16:09:04 I understand the temptation to sqeeze in more fixes into 5.0.1 while mos team is dealing with the RabbitMQ problem 16:09:22 I'm testing the patch for 1348166 right now, and it looks good. 16:09:37 do we need any other bugs in 5.0.1? 16:09:42 evgeniyl: what if we don't merge it in 5.0.1 ? 16:10:04 angdraug: all < Critical should go away for sure 16:10:24 we just need to make sure they are not Criticals 16:10:25 mihgen: fater rollback user will have broken cobbler 16:10:35 after 16:10:46 evgeniyl: if I do 5.0 -> 5.0.1 and then rollback? 16:11:15 broken cobbler means can't deploy anything, isn't that Critical? 16:11:20 or is there a workaround? 16:11:34 mihgen: yep 5.0 -> 5.0.1 and if there is some problem during the upgarde, rollback will break cobbler 16:11:53 evgeniyl: ok, only if there was a problem with upgrade 16:11:53 user can fix it manually, it is the reason why I set High priority for the bug 16:12:01 still sounds pretty critical 16:12:11 upgrade can fail and we should be ready for it 16:12:21 one more incoming bug from mos team: 16:12:24 #link https://bugs.launchpad.net/mos/+bug/1348158 16:12:29 Launchpad bug 1348158 in mos "Murano-agent does not get rabbit settings" [Critical,New] 16:12:30 angdraug: if we test the patch, I think we should have the fix in 5.0.1 16:12:57 agreed 16:13:10 anyone from murano team to comment on that? 16:13:42 iyozhikov: ^^ 16:14:18 doubt he is here 16:14:22 let's get it later 16:14:27 angdraug: let's move on 16:14:30 if it's as bad as it says on the tin, and there's a fix today or tomorrow, we should have it in 5.0.1 too 16:14:32 yep. 16:14:41 all other bugs are not Critical 16:14:52 unless someone objects now, I'll move them to 5.0.2 16:14:58 ok, good. waiting for oslo.messaging to be fixed then 16:15:05 yes. that concludes 5.0.1 discussion 16:15:05 angdraug: let's do it 16:15:12 #topic 6.0 blueprints status 16:15:13 angdraug: thanks! 16:15:38 Folks we have lots of blueprints there, many with higher statuses 16:15:49 we don't have such a velocity for sure :) 16:16:03 let's schedule blueprint triage day 16:16:07 if you take a look at schedule again, we would like to have Juno support before the summit 16:16:10 aglarendil: agree 16:16:48 and discuss in #fuel-dev, I guess right after 5.0.1 release 16:17:04 don't need to wait for 5.0.1 release 16:17:05 and another very important feature is plugins 16:17:23 as discussed above, fuel team's work on 5.0.1 is mostly done, everyone should move on 16:17:25 So the most important things for 6.0 are: Juno support, HA fixes and other bugs, openstack upgrades building blocks and plugins, also UI refactorings to improve UX 16:17:55 +1 to ^ 16:17:56 also, vkozhukalov with agordeev are doing image based provisioning for a while now 16:18:07 so I assume we gonna get it finally done in 6.0 16:18:12 vkozhukalov: are we?) 16:18:29 yes 16:18:32 :) 16:18:40 also granular deployment is critical for 6.0, I guess 16:18:47 no 16:18:49 and upstream manifests merge 16:19:04 upstream manifests, yes -- we've done most in 5.1 16:19:06 and testing coverage: rspecs and puppet module tests 16:19:16 ok let's schedule bp triaging meeting for 6.0 16:19:20 also fencing and logs management 16:19:23 sustaining stuff in puppet is very important 16:19:46 aglarendil, is plugins in library can be implemented without granular deployment? 16:19:49 we are having most of the issues there in our process, so we definitely need to provide better test coverage and etc. 16:20:21 akislitsky_: I will not hope 16:20:28 well let's discuss the rest out of the scope of this meeting 16:20:37 mihgen: I agree 16:20:46 moving on 16:20:52 items I provided are the most important, but not complete, obviously 16:21:01 vkozhukalov: sure 16:21:14 #topic Features 16:21:25 #topic upstream neutron based ml2: missed feature freeze exception deadline 16:21:38 xarses: hi, did you see an email from aglarendil and me 16:21:48 yes 16:21:57 upstream ML2 has been moved to 6.0, Multi node HA testing is basically impossible due to rabbit, or galera issues compounding testing and resolving possible issues with multiple controllers present. 16:22:13 xarses: :( 16:22:28 well looks like we can't do anything here 16:22:45 and just need to get it merged in 6.0? 16:22:51 I think xarses should keep working on it and have it ready for when stable/5.1 is branched 16:23:01 We need to fix ha 16:23:18 yes, that's our next topic I think 16:23:23 xarses: we have working ha right now 16:23:25 and some other CI issues, I added to the end of the agenda 16:23:42 wait aglarendil why are we mixing all here? 16:23:42 xarses: it might be due to puppet ordering issues 16:23:46 ha, CI, 16:24:12 lets move to the next topic and then we won't be off topic 16:24:13 moving on 16:24:14 stable/5.1? I'm lost, why do we want to have ready exactly by stable/5.1? 16:24:31 ha and ci are certainly different topics 16:24:33 mihgen: because then master will be open for 6.0 stuff 16:24:44 angdraug: ohhh ok yeah 16:24:53 #topic Bugs 16:24:56 now it's clear, thanks 16:25:04 #topic 5.1 bvt tests failure trends: what breaks it most often lately 16:25:15 aglarendil: your turn on our smoke-HA 16:25:39 over the last several weeks, bvt for ubuntu (and to lesser extent centos) has been failing more often than passing 16:25:55 angdraug: mihgen most of these tests were custom ISOs 16:26:10 angdraug: mihgen we spoke with nurla and decided to separate these jobs 16:26:30 angdraug: mihgen nighlty tests were passing mostly 16:26:41 aglarendil: ok so then what's the current status? 16:26:44 we created separeted jobs, see mail "Custom iso builds and custom bvt tests at Jenkins Product" 16:27:06 I think we need to stop doing single ha controller testing, it's possibly a large reason why we see alot of failures 16:27:06 nurla: aglarendil I see smoke failed now.. 16:27:10 mihgen: we have bvt tests passed. the only problem I saw was OSTF test on centos 16:27:23 mihgen: nurla we are waiting for an email from rvyalov 16:27:32 mihgen: It should be fixed in next build 16:27:43 xarses: great, if something does not work, let's stop testing it -) 16:27:44 mihgen: some package mirror issue 16:28:02 we need to at least doing 2 controller HA testing so that primary_controller true|false modes can be tested 16:28:37 xarses: I will talk about it with holser a little bit later related to galera improvements 16:28:38 vkozhukalov: no, single controller HA passes to easily, while multiple controller ha fails alot, we need to test the harder issue 16:29:09 s/passes to easily/passes too easily/ 16:29:18 xarses: thanx, now i see 16:29:18 nurla: can we optimize builds so that master node is reused for all centos simple, centos HA & UBuntu HA, so reducing time and resources? 16:29:33 having 3 deployments in parallel, we have multiple envs feature, right.. 16:30:09 unfortunately, we haven't support parallel deployment in our tests 16:30:30 nurla: ok.. we will talk about it later 16:30:47 ok, moving on then 16:31:00 mihgen, nurla: rmoe and I have some ideas around that too 16:31:08 ok, lets create blueprint for improve this issue 16:31:10 folks but let's keep all attention on smoke tests and BVT 16:31:30 these are very important things now , the only things where we can see regression fast 16:31:33 yes, and please lets watch for trends 16:31:42 #topic Fuel CI Python Unit Tests failures 16:31:53 dpyzhov: around? 16:32:04 akislitsky_: ^^ ? 16:32:21 this is a real issue these days 16:32:25 I have fixed all known locks and inconsistency issues in CI tests, on local machine is passed fine, but after rebase on master it fails. I going to fix it tomorrow and merge int master 16:32:39 akislitsky_: sounds optimistic 16:32:52 looks like we had so many red builds because of these locks? 16:32:57 akislitsky_: this will really help 16:33:19 I hope, patch is not small we need to review it tomorrow 16:34:04 here it is clear 16:34:06 moving on 16:34:10 mihgen, yep, locks and inconsistency of the data in DB. we have fuzzy commits in DB in code 16:34:36 #topic Ceph bugs status 16:34:39 akislitsky_: meow-nofer thanks folks, looking forward to see all green :) 16:34:45 The blocker issue that affected all ceph deployments with 5.1 is now fixed: 16:34:52 #link https://bugs.launchpad.net/fuel/+bug/1333814 16:34:54 Launchpad bug 1333814 in fuel "Environment deployment failed with ceph-deploy 1.5.2" [Critical,Fix committed] 16:34:58 This leaves another critical bug that only affects large-scale deployments: 16:35:04 #link https://bugs.launchpad.net/fuel/+bug/1341009 16:35:06 Launchpad bug 1341009 in fuel "[osci] obsolete ceph package in fuel-repository for 5.0.1 and 5.1" [Critical,In progress] 16:35:25 there's also medium and low priority bugs open, speak up if something must be fixed in 5.1 16:36:36 #link https://bugs.launchpad.net/fuel/+bugs?field.tag=ceph 16:37:23 what about 16:37:23 #1335628 seems to be automatically reopened by gerrit 16:37:28 #link https://bugs.launchpad.net/fuel/+bug/1335880 16:37:30 Launchpad bug 1335880 in fuel "[library] ERR: ceph-deploy osd activate node-11:/dev/sdb2 node-11:/dev/sdc2 returned 1 instead of one of [0]" [High,Confirmed] 16:37:40 it looks like it has some duplicates 16:38:29 yes, needs to be cleaned up 16:38:32 rmoe: ^ 16:38:57 rmoe: was not able to reproduce 1333779 16:39:06 I'll take a look at it but it looks like the same issue 16:39:14 maybe it is not so critical how it is written 16:39:45 #link https://bugs.launchpad.net/fuel/+bug/1266853 16:39:46 Launchpad bug 1266853 in fuel "[library] Adding Compute Node fails - ceph/manifests/conf.pp:44" [Medium,Confirmed] 16:39:51 also this looks really high 16:41:06 should be fixed by new ceph-deploy 1.5.9, needs a retest 16:41:43 resume: 1266853 is critical, so we need to raise its level 16:42:11 huh? 16:42:14 1333779 is not reproducible (waiting for diagnostic snapshot) 16:42:46 how can 1266853 be critical? 16:43:03 1335880 needs to be fixed 16:43:38 angdraug: if it is confirmed - it breaks compute addition feature 16:43:46 1266853 compute node cannot be added 16:43:52 1335880 needs to be confirmed not to be a duplicate of 1323343 16:44:26 ok, let's look at it more carefully 16:44:30 moving on 16:44:38 angdraug: should you and rmoe and I just go through and triage all of them? 16:44:45 and go from there? 16:44:50 yes 16:44:55 lets move on 16:44:57 #topic Some announcements 16:44:59 ok, then that's our action 16:45:10 #topic rabbitmq HA tightening status 16:45:20 here 16:45:35 so, I have been working on some ha stuff for rabbitmq 16:45:47 it turns out we can fix almost all rabbitmq issues 16:45:52 I mean the server part 16:46:07 #link https://review.openstack.org/#/c/108792/ 16:46:22 this is still a draft 16:46:37 I hope to finish it during next week and finally deal with rabbitmq clustering stuff 16:46:47 as of yesterday it was still taking a long time to assemble the cluster and sometimes would still be broken 16:47:05 xarses: yep, that's what I am going to address by this 16:47:08 rmoe was having issues with it taking very long after a cold start 16:47:24 xarses: talking to low-level stuff in mnesia 16:47:33 xarses: in OCF scripts 16:47:38 this will help us a lot 16:47:50 and kick dead controller out of the cluster quickly 16:47:57 right now it takes 60 seconds 16:48:04 for rabbitmq do declare node dead 16:48:10 I spoke to rabbitmq guys. it is configurable 16:48:15 by nettick time parameter 16:48:21 but I am afraid of false positives 16:48:30 so I am gonna still leverage information from corosync 16:48:32 and kick dead nodes 16:48:41 by low level operations 16:49:07 why do you have to go through mnesia? 16:49:17 can't remove a node from the cluster via rabbitmqctl? 16:49:23 it cannot 16:49:29 angdraug: it firsts asks status 16:49:35 angdraug: and sends rpc:multicall to all the node 16:49:43 that it thinks are alive 16:49:50 and timeout is nettick_time 16:49:57 guys it is very detailed discussion 16:49:59 this freezes all the common rabbitmqctl operations 16:50:12 let's make it more abstract 16:50:19 we have 10 minutes 16:50:21 #link https://groups.google.com/forum/#!topic/rabbitmq-users/f5pVGtR5ct0 16:50:22 aglarendil: please add all these details to your commit message :) 16:50:28 here is a link to discussion 16:50:34 angdraug: I will 16:50:55 lets move on 16:50:57 #topic galera improvements status 16:51:21 I will talk 16:51:31 I and holser finished work on galera improvements 16:51:39 currently we are testing a fix for a small bug 16:51:52 also these improvements allow for parallel deployment of controllers 16:52:03 we are running bvt tests now 16:52:12 if they pass - we will be able to extend CI tests 16:52:15 as of yesterday fuel-library master and 340 is, galera still has assembly issues. I had several controllers fail in making mysql work 16:52:17 to run all the controllers 16:52:33 xarses: there is an ubuntu issue 16:52:39 let me show a link 16:53:01 xarses: #link https://bugs.launchpad.net/fuel/+bug/1347007 16:53:03 Launchpad bug 1347007 in fuel "Pacemaker cannot assemble Galera Cluster on Ubuntu" [High,In progress] 16:53:15 xarses: fix is being tested 16:53:40 after that we will do what you want by CI coverage and decrease deployment time drastically 16:53:59 I am done 16:54:18 great 16:54:26 thanx a lot aglarendil 16:54:40 #topic Third Party CI testing 16:54:47 We need to have CI bots for integration like MLNX, NSX, or vCenter which we cant do with our current jenkins slaves, there have been a couple of changes that I'm hesitant to review, or know would break these components, but we don't have the CI to show that it's a problem. I propose that we start setting up CI bots to do basic tests so that we can ensure that we don't introduce large breakages with each commit. 16:56:13 what they are supposed to look like? i mean ci bots 16:56:37 is there kinda scheme how it is supposed to work? 16:56:45 flow etc. 16:56:53 like what we see in openstack projects like vmware minesweeper 16:57:24 but some features are require specific hardware, we may check only regression for third party components 16:57:26 they attach to gerrit triggers and post voting, or non-voting reviews 16:57:34 ok, i just did not see them 16:58:13 is there email discussion for that? 16:58:29 nurla thats the point of why they need to be "third party" bots, in that we know they run in the required environment 16:58:47 vkozhukalov: no, there is no ML thread yet 16:58:56 we have not time, let's start an ML thread 16:59:12 yes, will do 16:59:16 aglarendil: +1 16:59:23 thanx everyone 16:59:26 great meeting 16:59:30 closing 16:59:31 bye 16:59:36 #endmeeting