16:00:26 #startmeeting neutron_ci 16:00:26 Meeting started Tue Feb 28 16:00:26 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:28 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:30 The meeting name has been set to 'neutron_ci' 16:00:35 o/ 16:01:00 o/ 16:01:50 I think kevinbenton and jlibosva won't join today for different reasons but nevertheless it's worth running through outstanding actions 16:01:59 #link https://wiki.openstack.org/wiki/Meetings/NeutronCI Agenda 16:02:04 #topic Action items from previous meeting 16:02:09 o/ 16:02:28 "ihrachys to look at e-r bot for openstack-neutron channel" 16:02:40 this landed: https://review.openstack.org/#/c/433828/ 16:02:50 has anyone seen the bot reporting anything in the channel since then? :) 16:03:21 doesn't seem like it did a single time 16:03:36 gotta give it some time and see if it's some issue in configuration, or just no positive hits 16:03:58 #action ihrachys to monitor e-r irc bot reporting in the channel 16:04:18 "manjeets to polish the dashboard script and propose it for neutron/tools/" 16:04:37 manjeets: I know you proposed the patch for neutron tree but armax asked to move it to another place 16:04:55 has it happened so far? 16:04:58 0/ 16:05:06 (the neutron patch was https://review.openstack.org/#/c/433893/) 16:05:12 i'll move it today was busy with other stuff yesterday 16:05:26 ok cool, thanks for working on it 16:05:40 #action manjeets to repropose the CI dashboard script for reviewday 16:05:58 next item was "jlibosva to try reducing parallelization for scenario tests" 16:06:23 this landed: https://review.openstack.org/#/c/434866/ 16:07:26 I believe scenario job was still not particularly stable due to qos tests so we also landed https://review.openstack.org/#/c/437011/ that temporarily disables the test that measures bandwidth 16:07:41 ajo was going to rework the test once more 16:08:09 at this point the failure rate for the job is: ovs ~ 40% and linuxbridge ~ 100% 16:09:00 seems like lb trunk test fails consistently due to missing connectivity: http://logs.openstack.org/69/438669/1/check/gate-tempest-dsvm-neutron-scenario-linuxbridge-ubuntu-xenial-nv/a16519c/testr_results.html.gz 16:09:09 could be a legit failure, will need to follow up with armax on the matter 16:09:28 #action ihrachys to follow up with armax on why trunk connectivity test fails for lb scenario job 16:09:48 ok next was "ihrachys to look at getting more info from kernel about ram-locked memory segments" 16:10:03 there are a bunch of patches on review 16:10:25 one is enabling the needed logging in peakmem_tracker service: https://review.openstack.org/#/c/434470/ (needs some small rework) 16:10:54 btw we enabled peakmem_tracker service in all gates yesterday, to help with memory consumption / oom-killer debugging: https://review.openstack.org/#/c/434511/ 16:11:27 since the first patch renames peakmem_tracker, this patch also enables the service with the new name: https://review.openstack.org/434474 16:11:43 ihrachys: according to johndperkins, kibana isn't showing oom-killer issues since Feb 23 16:11:51 I hope that will give us an answer if any process blocks huge chunks of memory from swapping 16:12:08 dasm: oh nice. do we still see libvirtd crashes? 16:12:33 i don't know. i didn't look into that 16:13:23 there is an etherpad tracking oom-killer from infra side: https://etherpad.openstack.org/p/OOM_Taskforce 16:13:52 yes I think libvirt crashes are still happening 16:13:53 although, lack of oom-killer problems could be connected to lesser usage of infra, during ptg. 16:14:10 dims had one pulled up yesterday 16:14:50 * electrocucaracha wonders if discovered something 16:15:00 clarkb: could it be indeed correlated with the lower utilization of the cloud as dasm suggests? 16:15:20 clarkb : i was looking through libvirt logs for 2nd and 3rd items in the rechecks list - http://status.openstack.org/elastic-recheck/ - not much luck 16:15:24 ihrachys: maybe? the other angle people were looking at was OOMs seem to happen more often on Rax which is Xen based os potentially related to that 16:16:04 clarkb: I assume that would require some close work with the owners of the cloud. do we have those relationships? 16:16:42 ihrachys: for rackspace johnthetubaguy is probably a good contact particularly for compute related things? 16:16:57 clarkb: the other thing that beslemon mentioned me was the recent update of their centos images 16:17:16 electrocucaracha: who is beslemon, I probably miss some context. 16:17:26 infra runs its own images too that should update daily 16:17:45 ihrachys: she is racker (perf engineer of the RPC) 16:18:22 ihrachys: she was helping us(me, dasm, johndperkins) last week to discover something 16:18:37 ok. if she works on any of that, it could make sense for her to join the meeting and update on her findings. 16:19:21 ihrachys: well, she only has a limited time for that assignment but I'm going to tell her to include her findings 16:19:45 in related news, several projects notice instability in their scenario jobs because of broken (or disabled) nested virtualization in some clouds. I heard that from octavia. 16:20:00 neutron also experienced some slowdown and timeouts for some jobs 16:20:16 dumping cpu flags of the allocated nodes could give some clue: https://review.openstack.org/#/c/433949/ 16:20:54 electrocucaracha: ack. at least an email or smth, otherwise we work in isolation and don't benefit from multiple eyes looking at the same thing from the same angle. 16:21:14 ihrachys: +1 16:21:45 ok these were all action items from the previous meeting, moving on 16:21:58 #topic PTG update 16:22:08 some CI matters were covered during the PTG the prev week 16:22:48 some points were captured in the etherpad: https://etherpad.openstack.org/p/neutron-ptg-pike-final line 15-38 16:23:11 some things were also captured by kevinbenton in his report email: http://lists.openstack.org/pipermail/openstack-dev/2017-February/113032.html 16:23:34 we will need to follow up on those, so that at least tasks from our CI scope don't slip thru cracks 16:23:49 I will do that this week, and we will run through the items next week 16:24:07 #action ihrachys to follow up on PTG working items related to CI and present next week 16:24:40 tl;dr there is a lot of specific work items on gate stability and also reshaping gate (removing jobs, adding new ...) 16:25:23 any specific questions about PTG? 16:25:55 (we will discuss it in detail next week, but if you have anything time sensitive) 16:26:40 ok moving on 16:26:48 #topic Known gate issues 16:26:59 #link https://goo.gl/8vigPl Open bugs 16:27:10 #link https://bugs.launchpad.net/neutron/+bug/1627106 ovsdb native timeouts 16:27:10 Launchpad bug 1627106 in neutron "TimeoutException while executing tests adding bridge using OVSDB native" [Critical,Triaged] - Assigned to Miguel Angel Ajo (mangelajo) 16:27:20 this did not get much progress per se 16:27:36 but there are some developments that should help us to isolate some of its impact 16:27:57 specifically, otherwiseguy is working on splitting the ovsdb code into a separate project: https://review.openstack.org/#/c/438080/ 16:28:12 at which point some of unstable tests that cover the code will move into this new repo 16:28:26 which will offload some of the impact from neutron tree into the new tree 16:28:38 and hopefully will reduce impact on integrated gate. 16:28:58 some may say it's just a shift of responsibility. it indeed is. 16:29:28 otherwiseguy also had plans to work on a native eventlet state machine for the library once we get initial integration of it with neutron. 16:29:58 he is hopeful replacing existing solution that is based on native threads with something more integrated with eventlet may squash some bugs that we experience. 16:30:03 only time will tell. 16:30:36 for other bugs in the list, I gotta walk thru them and see if they need some love 16:30:53 #action ihrachys to walk thru list of open gate failure bugs and give them love 16:31:24 any other known gate failure that would benefit from the discussion here? 16:32:48 ok let's move on 16:32:55 #topic Gate hook rework 16:33:19 some of you may have noticed the gate breakage that hit us on Friday due to the change in devstack-gate in regards to local.conf handling 16:33:39 I think the breaking change was https://review.openstack.org/#/c/430857/ 16:33:56 this was fixed with https://review.openstack.org/#/q/Ibe640a584add3acc89520a2bbb25b6f4c5818e1b,n,z 16:34:21 though Sean Dague then raised a point that the way we use devstack-gate in our gate hook is not sustainable and may break in the future 16:34:36 there is a WIP patch to fix that here: https://review.openstack.org/#/c/438682/ 16:34:40 we will need to backport it too 16:35:08 I suspect that other repos that were affected by the initial d-g change and patched that to pass gate may also need to follow up with a better fix 16:35:27 I believe armax was going to at least assess the impact on other stadium repos 16:35:45 #action armax to assess impact of d-g change on stadium gate hooks 16:35:58 I will shape the Sean's patch today to make it ready to merge 16:36:17 #action ihrachys to prepare https://review.openstack.org/#/c/438682/ for merge, then backport 16:36:37 stadium projects are advised to take another look at their gate setup 16:37:02 go ask me or sdague about details if you are lost 16:37:22 #topic lib/neutron for devstack-gate 16:37:46 One final thing, a heads-up that several folks work on switching gate to lib/neutron (from lib/neutron-legacy) 16:38:00 the result can be seen in https://review.openstack.org/436798 and the list of dependent patches 16:38:48 once the patches are in better shape and in, we may need to do some more validation work with gates for other projects to make sure it won't break anything 16:39:09 #topic Open Discussion 16:39:25 anything anyone? 16:39:44 I will merely note that the patch that disables ovs compilation for functional job is in the gate: https://review.openstack.org/437041 16:40:48 electrocucaracha: any updates on memory consumption tracking work you looked at a while ago? 16:42:50 ok I believe we lost him :) 16:42:59 ok folks thanks for joining and bearing with me 16:43:05 thanks1 16:43:07 thanks 16:43:08 I hope that next meetings will be more active and more populated :) 16:43:11 #endmeeting