17:00:08 #startmeeting ironic 17:00:09 Meeting started Mon Feb 12 17:00:08 2018 UTC and is due to finish in 60 minutes. The chair is TheJulia. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:00:10 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:00:11 o/ 17:00:12 o/ 17:00:12 The meeting name has been set to 'ironic' 17:00:17 o/ 17:00:19 o/ 17:00:21 \o 17:00:23 o/ 17:00:34 Our meeting agenda can be found on the wiki, as always! 17:00:37 o/ 17:00:39 #link https://wiki.openstack.org/wiki/Meetings/Ironic 17:01:03 o/ 17:01:05 o/ 17:01:07 #topic Announcements / Reminder 17:01:29 First off, Thank you dtantsur for your hard work! 17:01:35 ++ 17:01:43 :) 17:01:55 +++ 17:01:57 and congrats TheJulia for taking this hard work ;) 17:02:00 dtantsur++ 17:02:05 o/ 17:02:11 +1 :) 17:02:12 +++ congrats! 17:02:16 I hope to meet everyone's expectations. Please remember that I too am human, as I take on the role of fearless leader. 17:02:22 Congrats Julia !! 17:02:26 and THANK YOU TheJulia for volunteering! 17:02:30 ++TheJulia 17:02:43 * jlvillal had expectations of an in-human leader :P 17:02:51 Anyway, time for the remaining annoucements 17:02:59 jlvillal: only if I get a flying aircraft carrier ;) 17:03:11 :) 17:03:18 * dtantsur announces that his visa to ireland was finally approved 17:03:36 The PTG is coming up. Please update the PTG Planning etherpad today and over this week. 17:03:39 #link https://etherpad.openstack.org/p/ironic-rocky-ptg 17:04:00 * jlvillal thought we had a new core reviewer 17:04:07 I will break the etherpad up and generate a schedule for Wednesday/Thursday on this coming friday. 17:04:30 jlvillal: thanks for the reminder! 17:04:46 #info hshiina is now a member of ironic-core, congrats! 17:04:50 \o/ 17:04:53 Nice! :) 17:05:06 thanks, everyone 17:05:13 Congrats hshiina !! 17:05:24 I think that is about it for annoucements, does anyone else have anything to annouce? 17:05:29 announce 17:05:30 congrats hshiina, welcome to moar reviews! :) 17:05:35 * jlvillal checks time in Japan and sees it is 2:05AM. Yowzer! 17:05:37 ptg get together poll: https://doodle.com/poll/d4ff6m9hxg887n9q 17:05:38 when is queens final? 17:05:56 #link https://doodle.com/poll/d4ff6m9hxg887n9q 17:06:01 o/ 17:06:13 Please respond to the doodle so we can schedule an evening gathering at the PTG. 17:06:34 if you're interested in joining the ptg get-together, please indicate your availability via the poll. i had indicated a feb 16 deadline, but i think i'd like to book a place sooner rather than later, so please sign up by tomorrow. i'll send out email too. thx. 17:06:51 ++ let's not wait till the last moment with booking 17:06:52 #info Queens final releases are slated for the week of the 19th-23rd. 17:07:08 rloo: thanks! 17:07:20 oh, so one more week to fix all the bugs 17:07:31 * rloo wonder, what bugs? impossible... 17:07:35 and tests... and gates 17:07:45 rloo: 3 critical bugs.. 17:07:47 Anyway, we should move on 17:07:58 move on \o/ 17:07:59 dtantsur: is there a list on the whiteboard? 17:08:01 * dtantsur flies to the woods 17:08:04 TheJulia: there is 17:08:09 #topic Review action items from previous meeting 17:09:04 Looks like our only action item was to review/triage and work on bugs last week. 17:09:56 I think we can just move on since this week should be the same. Any disagreements? 17:10:21 +++ 17:10:44 Moving on then! 17:11:12 #topic Subteam status reports 17:11:16 #link https://etherpad.openstack.org/p/IronicWhiteBoard 17:11:20 Starting at Line 202 17:12:40 dtantsur: are you going to work on classic driver deprecation this week? (doc needs updating?) 17:12:46 FYI, for those that don't see it, dtantsur has put the list of critical Queens bugs that need to land and be backported this week under the bugs section. 17:12:49 rloo: very likely so 17:13:02 rloo: after solving the API issue we talked about 17:13:10 dtantsur: we should ping vendors that need to update their docs wrt classic driver deprecations 17:13:12 * dtantsur finally has a devstack environment to test things 17:13:23 rloo: well, I did a call on the ML, and at least 2 vendors proposed patches 17:13:34 dtantsur: ok good. 17:13:37 * dtantsur hands TheJulia a loooong stick to poke people 17:13:56 the critical bugs are at line 215, for anyone else that also can't read today 17:14:04 dtantsur: also the TODOs wrt migrating CI to hardware types. who's going to do all those? 17:14:05 dtantsur: is the end sharpened ? 17:14:19 TheJulia: just enough to make it annoying 17:14:26 dtantsur: awesome! 17:14:30 rloo: I suspect me, unless somebody wants to help 17:14:50 just a heads up in case you miss it, traits is almost done but we forgot one thing, it'll need to be backported (L285) 17:15:07 mgoddard: re: traits, is any further action absolutely required for this release? 17:15:17 * TheJulia looks at 285 17:15:26 should just be that one 17:16:08 https://review.openstack.org/#/c/543461/ 17:16:08 patch 543461 - ironic - Validate instance_info.traits against node traits 17:16:20 it isn't clear to me what we did wrt routed network support :) L314+ 17:16:52 * TheJulia stats a list 17:16:55 hjensas: mind cleaning it up please ^^^? 17:17:18 to make it clear what is to be done for queens, what is a follow-up for rocky, etc 17:17:42 o/ Will do. 17:17:55 i am deleting the 'split away tempest plugin', meant to do that before this meeting (L423) 17:18:16 Looks like the the ansible docs need a revision, and we should likely try to land/backport 17:18:23 oh, before I forgot: TheJulia we need to document creating queens jobs for the tempest plugin in our releasing documentation 17:18:32 and, well, create queens jobs :) 17:18:50 TheJulia: that patch is borderline required IMO, but would be nice to get it in. I have just found a small issue with the nova virt driver that will need fixing 17:19:06 TheJulia: wrt bifrost L431 -- that's the bug we're fixing now, right? 17:19:23 dtantsur: is it done for this cycle? 17:19:39 TheJulia: nope, just remembered 17:19:48 I can do it while we talk 17:20:21 rloo: no, different but not really a big deal, just needs to be gotten to "soon" since it is all the way off in keystoneauth1 17:20:34 TheJulia: :-( 17:20:50 Yeah :\ 17:21:10 Anyway, I think we've looked most everything over for subteams 17:21:18 Are we ready to move on? 17:21:52 + moving on. do we want to continue with these subteam statuses until after PTG, or put on hold until after PTG? 17:22:34 although i guess we aren't quite done with queens so maybe continue... 17:22:36 Dmitry Tantsur proposed openstack/ironic-tempest-plugin master: Add jobs for stable/queens https://review.openstack.org/543555 17:22:36 TheJulia: ^^^ 17:22:38 We ought to hold off for next week, I think next week will all be discussion 17:22:41 dtantsur: thanks! 17:23:06 #topic Priorities for this week 17:23:18 I'm going to remove the list of things from last week that are struck out 17:24:51 dtantsur: do you think we should explicitly add the list of bugs to the priority list? 17:25:06 TheJulia: dtantsur's doc patch for classic drivers dep: https://review.openstack.org/#/c/537959/ 17:25:07 patch 537959 - ironic - Switch contributor documentation to hardware types 17:25:17 TheJulia: probably won't hurt 17:25:29 dtantsur: can you perform that copy/paste? 17:25:41 yep 17:25:44 Dmitry Tantsur proposed openstack/ironic master: releasing docs: document stable jobs for the tempest plugin https://review.openstack.org/543558 17:26:48 I think that works order wise 17:26:52 thoughts/objections? 17:27:21 LGTM 17:27:32 Are we ready to move on? 17:27:39 i guess it is implicit that the traits patch is a weekly priority? 17:27:54 it's in "Required backports" 17:28:04 dtantsur: hence 'implicit' :) 17:28:11 :) 17:29:19 One thing to keep in mind, if anyone becomes aware of something that must be backported, please raise visibility as soon as possible. 17:29:33 Time to move on :) 17:29:46 I think https://review.openstack.org/542214 is nice to have 17:29:46 patch 542214 - ironic-inspector - Only set switch_id in local_link_connection if it ... 17:30:02 I agree 17:30:35 #topic Bug Triaging for the week 17:30:54 Same as last week? 17:31:01 ++ 17:31:29 #action Everyone to triage/review bugs in preparation for final Queens release. 17:31:47 Moving on! 17:31:52 #topic Discussion 17:32:07 First, and only topic it looks like, is what to do with grenade. 17:32:14 fix it :) 17:32:27 The problem is there is no fixing it as-is... 17:32:47 so it seems like grenade framework doesn't work for us/rolling upgrades 17:33:14 have we had discussion with the grenade folks in the past about it? cuz we're now trying to continue to hack to get it to work for us 17:33:34 and we hack something but then don't follow up. and then something breaks :-( 17:33:56 Yeah, we can't complete the nova upgrade without a nova patch in-place to handle version negotiation either. 17:34:05 what's the latest problem with grenade? 17:34:06 which doesn't mean that we shouldn't hack something now but ... 17:34:34 jroll: tl;dr sqlalchemy gets upgraded, and old nova is incompatible with newer sqlalchemy 17:34:40 *boom* 17:35:02 and in our rolling upgrades scenario, we don't upgrade nova, just ironic 17:35:11 * jroll thinks he needs more time than we have to fully understand the thing 17:35:23 cuz the order of upgrading is ironic first, then nova 17:35:27 are we back to the segfault problem, is my actual question 17:35:32 And since we don't upgrade ironic-api either, we can't actually upgrade nova 17:35:41 jroll: We are! :) 17:35:55 TheJulia: that seems to me like a critical bug to be fixed, likely by the nova team 17:36:08 * jroll recalls dansmith saying similar, and then the bug disappeared for a while 17:36:12 A critical bug in Pike? 17:36:19 yes 17:36:26 but should old s/w be expected to work with new packages? 17:36:49 running software should not be expected to segfault after an apt-get upgrade. 17:37:00 ever, that's a bug, flat-out. 17:37:06 jroll: ok, in that case, it is a nova bug. 17:37:39 we don't upper-cap sqlalchemy in requirements, so we're expected to work with newer versions 17:38:11 rloo: a segv after a package upgrade would be a bug in some library 17:38:13 I'm totally open to a conversation about whether grenade is the right tool for the job here, but it seems to me we've been doing a lot to hack around this bug, and then complaining that grenade makes those hacks hard :) 17:38:16 So then it is a nova bug 17:38:25 rloo: there should be nothing you can do from python land to segv yourself 17:38:42 * jroll isn't sure it's a nova bug, but it's a bug with how nova interacts with the system, yes 17:38:45 dansmith: that is good to know! 17:39:00 jroll: you might even argue that grenade is the right tool since it's poking something that needs fixing :) 17:39:10 jroll: I think at the same time, we have an unrealistic scenario that we're executing with grenade 17:39:22 dansmith: yeah, I should have finished with "so that's a separate conversation" :) 17:39:27 jroll: aye :) 17:39:35 TheJulia: what's unrealistic about it? 17:39:51 aside from the fact that nobody would deploy any of this from devstack anyway 17:39:57 dansmith: Upgrade everything but nova on the same machine without isolation of underlying shared packages 17:40:08 which we do because we can't run newer nova with older ironic 17:40:12 TheJulia: I don't think that's unrealistic 17:40:36 it's unideal for sure, 17:40:54 but if the package versions don't prohibit it, I think people would expect that should work 17:41:18 would and do, unfortunately 17:41:25 right 17:41:41 so will we actually get traction for nova to fix it in stable/pike? 17:41:55 Well, for a fix to land 17:42:09 if there's something nova has to do, then sure, but I can't imagine what that is 17:42:48 i think we may need to work with nova to help pinpoint where/how it is failing... seems like if we take ironic out of the picture, nova should still segv? 17:42:49 If dtantsur's assertion that projects must be compatible with future sqlalchemy versions, then there is an extra kwarg that needs to be removed that is currently ignored I believe 17:43:21 if the underlying bytecode is removed that the python runtime is using, does it recompile the bytecode? 17:43:33 dansmith: I think it's less that nova needs to do something and more us begging for help because we've cumulatively put hundreds of people-hours into trying to track this down and/or fix it :( 17:43:45 jroll: I hear ya 17:44:47 TheJulia: IIRC yes 17:44:48 TheJulia: that shouldn't cause an segv, otherwise that'd be a python bug 17:44:54 doesn't nova have a rolling upgrades/grenade job? I'd think it would have barfed there too? 17:45:06 rloo: several of them yeah 17:45:20 rloo: nova-conductor is upgraded in that job (which is the service that is segfaulting) 17:45:40 that's why it isn't seen there 17:45:41 but do those upgrades not actually upgrade nova? 17:45:44 jroll: really, i thought in our job, we didn't upgrade nova. let's take it offline 17:46:17 rloo: correct, we do not upgrade nova. nova's grenade jobs do. nova-conductor only breaks when not upgraded. 17:46:18 TheJulia: we upgrade pieces of nova in the partial job, but conductor always gets upgraded (i.e. restarted( 17:46:25 dansmith: ok 17:46:43 jroll: ah got it. so can we change their test to not upgrade and see if it barfs? 17:46:57 rloo: no, the whole point of our grenade test is to upgrade conductor :) 17:47:01 The take-away I'm getting is we don't try and change the grenade scenario, that we hunt down and try and fix the root cause of the segfault? 17:47:24 has anyone tried to reproduce this locally? 17:47:39 TheJulia: that's my opinion, yes 17:47:42 TheJulia: yup, we should fix root cause 17:47:46 because doing that would let us get core files more easily and dig into what was going on when the segv is triggered 17:47:50 dansmith: I'm fairly sure I did so last week 17:47:52 * jroll has not tried locally 17:48:08 um okay :) 17:48:18 I wiped the machine out though 17:48:27 TheJulia: does that mean you're fairly sure you reproduced it? or fairly sure you tried? 17:48:39 * rloo wonders why the segv appeared, then disappeared, then appeared again... 17:48:52 rloo: that's usually the nature of such things 17:49:04 dansmith: I really don't remember at this point :( 17:49:11 dansmith: that explains it then! 17:49:22 I think that I did, but last week was a blur 17:49:24 they can be deterministic, but often not, due to ordering and timing 17:50:00 dansmith: so it might be hard to reproduce. great. 17:50:43 although zuul is having great luck reproducing 17:51:19 unless it is breaking updated bytecode that is causing the segfault.. I seem to remember the first time we ran into this we got some lsofs out of a running system where the conductor was crashing and we had some sqlalchemy files open but not all... 17:51:59 I'll continue to work on it this week, but with the constraint of not changing the job or scenario 17:52:01 AFAIK, python only opens those files whilst loading them the first time, not continually 17:52:25 and I don't think it ever purges them and has to reload them 17:52:43 I would imagine it's more about some shared library underneath getting upgraded 17:52:47 yes 17:52:51 I would bet on it 17:53:06 The case is the same for shared libraries 17:53:15 Open file handler don't change 17:53:32 TheJulia: but shared libraries can be opened and closed, 17:53:33 it would have ot be opening a new file/library/thing that is often accessed 17:53:39 even when the process is forked? 17:53:51 jroll: yes if it's just a fork 17:54:11 We're running out of time today 17:54:19 rloo: It doesn't look like we're goin gto get to RFEs at this point 17:54:28 no worries 17:54:39 i might poke people about them later. or not :) 17:54:50 rloo: I believe that is reasonable 17:55:33 #action TheJulia to try and reproduce the fun grenade crash situation locally and use that to try and collect data 17:56:01 Since we have only 4 minutes left, does anyone have anything else that needs to be discussed today? 17:56:52 * TheJulia queries crickets as a service 17:57:12 * jroll has nothing 17:57:18 crickets 17:57:27 * dtantsur too 17:57:28 Okay, thanks everyone! 17:57:32 Have a wonderful week! 17:57:35 thanks TheJulia and congrats again 17:57:43 #endmeeting