15:00:10 #startmeeting ironic 15:00:10 Meeting started Mon Sep 21 15:00:10 2020 UTC and is due to finish in 60 minutes. The chair is TheJulia. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:11 o/ 15:00:12 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:14 The meeting name has been set to 'ironic' 15:00:14 o/ 15:00:18 o/ 15:00:22 o/ 15:00:25 o/ 15:00:30 o/ 15:00:36 o/ 15:00:49 Our agenda this week can be found on the wiki. 15:00:50 o/ 15:00:51 #link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting 15:00:57 o/ 15:01:00 \o 15:01:09 * iurygregory forgot to add 2 rfe's for discussion... 15:01:14 #topic Announcements / Reminders 15:01:19 iurygregory: quick! add them :) 15:01:27 o/ 15:01:29 First off! 15:01:39 #info CI is very unhappy - Details are on the whiteboard. 15:01:42 CI yay... 15:01:48 This appears to be memory related :\ 15:02:13 #info We're also in the home stretch for victoria. This week is R-3 for OpenStack. 15:02:23 #info Priority obviously is CI and reviews this week. 15:02:41 #info TC/PTL nominations are this week, if your interested message TheJulia 15:02:51 I guess I'll run again if you fold want me to. 15:02:55 o/ 15:03:08 #info Redfish interop status meeting has been scheduled 15:03:26 It will be on Thursday, September 23rd at 12 PM UTC. 15:03:33 #link https://cern.zoom.us/j/94808950339 15:03:34 Everyone is welcome of course. 15:03:37 we will give you cookies if you run again TheJulia 15:03:47 iurygregory: ++ 15:03:48 o/ 15:04:04 iurygregory: cranberry oatmeal and you'll have me sold. 15:04:18 One final item in my semi-out of order list of announcements/reminders 15:04:26 mraineri from Redfish Forum will attend the first half. 15:04:44 * iurygregory would ship food from Annapurna to TheJulia 15:05:07 It looks like the kexec effort should end up with some devoted PTG time to discuss and determine the next path. I got an email from Boston University and the group of students did not choose ironic :( 15:05:20 #sad 15:05:24 c'est la vie 15:05:41 Does anyone have anything to announce or remind us of? 15:06:39 No action items so we can proceed to subteam statuses 15:06:56 Merged openstack/ironic-prometheus-exporter master: Fallback to `node_uuid` if`node_name` is not present https://review.opendev.org/723176 15:07:08 iurygregory: I guess you can release IPE :) 15:07:21 I will =) 15:07:23 * TheJulia guesses there are no other announcements and reminders 15:07:28 onward? 15:07:34 ++ 15:07:54 #topic Review subteam status reports 15:07:57 #link https://etherpad.openstack.org/p/IronicWhiteBoard 15:08:34 Starting at line 279 15:08:57 I think we can remove the Zuulv3 migration 15:09:22 and have a topic for grenade efforts in the future 15:10:44 arne_wiebalck: w/r/t the scale issues item you noted, I've got a patch up to preserve the efi boot artifacts, we should likely make sure we don't collide in our efforts 15:11:41 TheJulia: ok 15:12:10 TheJulia: you have a link? 15:13:06 arne_wiebalck: it is on ipa, I don't at the moment but I'll get it to you 15:13:20 TheJulia: I should be able to find it ... 15:13:28 Otherwise I think most things look okay and in a good state. I realize we're also basically blocked on ipa at the moment due to CI 15:13:36 Anyhow, are we good to proceed? 15:14:02 yep 15:14:19 one moment, having to relaunch windows, my browser crashed 15:14:40 #topic Deciding on priorities for the coming week 15:14:53 Is there anything we need to add to the list of the priorites 15:15:05 #link https://etherpad.openstack.org/p/IronicWhiteBoard 15:15:19 Starting at line 164 15:16:25 iscsi deprecation? https://review.opendev.org/750204 15:16:26 patch 750204 - ironic - Deprecate the iscsi deploy interface - 8 patch sets 15:16:33 I think it is already on the list 15:16:47 This can be added for vendor priority (iLO) https://review.opendev.org/#/c/752001/ 15:16:48 patch 752001 - ironic - Adding changes for iso less vmedia support - 8 patch sets 15:16:49 it is, just marked as a wip 15:17:02 ah, right, removed WIP 15:17:04 stendulker: sure, if you could make that update on the etherpad that would be much appreciated 15:17:17 updated 15:17:22 thanks 15:17:43 Any objection if I remove the networkin-generic-switch item? 15:18:10 last updated September 9th 15:18:32 https://review.opendev.org/#/c/748049 is the one you were referring to earlier? 15:18:33 patch 748049 - ironic-python-agent - Support partition image efi contents - 4 patch sets 15:18:45 arne_wiebalck: yes 15:19:03 I will add latter some backports of the IPE (after I push them) 15:19:25 iurygregory: sounds good 15:19:27 I add a line under IPA segment for https://review.opendev.org/#/c/752024/ 15:19:27 patch 752024 - ironic-python-agent - Fix: make Intel CNA hardware manager none generic - 5 patch sets 15:19:36 I see a couple people have updated a few different areas 15:19:43 Any objections to what is present at this time? 15:20:50 none from me 15:21:23 okay, seems like we can proceed then! 15:21:42 So we have nothing listed for discussion today 15:21:50 So we can proceed to the Baremetal SIG! 15:21:55 #topic Baremetal SIG 15:22:07 No more input on the doodle, so I guess we just schedule a first meeting and see how that goes. 15:22:15 arne_wiebalck: that seems reasonable 15:22:19 That's it :) 15:23:19 Okay then, RFE Review it is then 15:23:23 #topic RFE Review 15:23:29 iurygregory: I believe these are yours? 15:23:36 yup 15:23:39 \o/ 15:24:10 iurygregory: would you like to talk through them? 15:24:12 So, 1st RFE is https://storyboard.openstack.org/#!/story/2008171 15:24:32 to add some support for IPE for introspection data 15:25:08 so what problem do you see this solving? 15:25:09 this would probably be something interesting when we have active node introspection 15:25:25 hardware discrepancy detection? 15:25:54 for example the operator wants the firmware versions for the X vendor machines the same version... 15:26:46 so he can setup an alarm based on the "metric" for firmware version to get notification if something is different for the machines 15:26:49 assuming extra-hardware present? I don't think we collect firmware versions by default. 15:27:20 ofc it would depend on what the introspection data will have =) 15:27:21 What would be a downside for optionally supporting putting node inspection data in prometheus? As long as it's not enabled by default, it seems like a potentially good thing. 15:27:40 can introspection rules be used during active introspection? 15:27:40 iurygregory: would this only apply to the inspection data for the nodes that the IPE is responsible for based being paired with the conductor and the data supplied from the sensor data collection? 15:27:48 arne_wiebalck: yes, I think 15:28:01 Especially with some of the data you could "inspect" about node lifetime from utilizing plugins for more data, e.g. SMART cycles, firmware versions (as mentioned), etc 15:28:08 dtantsur: what would happen if they detect an issue? 15:28:14 TheJulia, only for the ones that conductor can report I would say 15:28:28 dtantsur: for normal inspection, the node would fail inspection 15:28:32 arne_wiebalck: probably nothing particular.. I cannot say for sure. 15:28:48 dtantsur: since that would be a similar functionality 15:29:00 dtantsur: for the alarming part at least 15:29:03 overall, I'm with JayF on the "why not" bit. but I'd rather see more technical details in the RFE before committing to it. 15:29:20 iurygregory: I suspect your going to need to write something a bit more verbose along the lines of a spec. I like the idea, I'm only worried about size/scale/scope issues and mechanics. 15:29:21 ++ that is a pretty anemic 15:29:27 RFE ** 15:29:37 yeah sorry for that =) 15:29:46 I personally don't insist on a spec, I'd just read more text on the story 15:29:46 That being said, I suspect many of us agree it would be a good thing 15:29:49 I will try gather more details =) 15:30:05 same really, just more details would be excellent 15:30:20 ack 15:30:32 so moving to the second RFE 15:30:36 and maybe think through the questions posed in a spec while your adding detail 15:30:47 sure =) 15:30:52 Because those questions are asked to provoke thought in many cases :) 15:30:56 "what if?" 15:31:01 well... ideally, it'd be some sort of 'plugin'? or class, so that other non-prometheus systems could also get that introspection data in the future? 15:31:15 we have plugins for ironic-inspector 15:31:26 Awesome 15:31:29 the problem is, prometheus only supports "pull" model 15:31:46 (there is something for the "push" model, but it's not recommended) 15:31:46 iurygregory: so your second RFE? 15:31:53 ^^ which means the general idea seems ok, but i personally would like a bit more details. if not a spec, please put in the story. 15:32:04 Templates for alarms https://storyboard.openstack.org/#!/story/2008176 15:32:33 when using prometheus normally you will have some set of alarm rules you will use that will trigger notifications 15:33:19 iurygregory: I think that makes a lot of sense, just maybe a little more detail on how we're going to make it easy 15:33:19 I guess I have the same concern with this RFE: it's very short 15:33:22 iurygregory: sorry, for the first rfe, would you mind updating the title or whatever to something more specific, eg 'push introspection data to prometheus' ? 15:33:33 for example you want to get a notification if the temparature of the nodes is higher than a threshold ... 15:33:43 rloo, sure I will do 15:33:46 I'd like to see two distinct parts: user story ("As an operator I want to") and solution ("we will change that, add this") 15:34:08 right now I don't quite understand why operators cannot configure it.. the way they usually configure it 15:34:17 well they can configure 15:34:17 ++ 15:34:46 I guess I'm also missing a hint at the solution to the problem of it is hard 15:34:59 in my mind would be something like 15:35:08 you want an alarm for higher temperature 15:35:19 Generally no objection to the rfe otherwise, just need some more detail :) 15:35:25 so you can say the metric name and the experion it should use 15:35:42 and we would output the yml format you need to update in the configuration of prometheus... 15:36:06 instead of going and writing all rules you want etc.. 15:36:26 I guess the conundrum in a way is they don't really know what to populate until after the fact 15:36:30 so they have no examples 15:36:30 I have no idea about prometheus, so please pardon my question if it's silly: is it really easier? 15:37:00 ah, hmm. do we have all the data we need? I thought it was also driver and hardware specific? 15:37:22 well I would prefer to get a file that I just need to add to the config instead of writing everything.. 15:37:24 Yeah, that is a conundrum since there are some data transformations based on names if memory serves 15:37:49 * TheJulia wonders if this could almost just be sample alarms documentation 15:39:26 again, no hard feelings against, but I'd like to understand more before ack'ing 15:39:34 and ideally have it written :) 15:39:40 iurygregory: so your going to make things more verbose on both and I guess we can revisit again next week? 15:39:41 it = details in the RFE 15:39:46 TheJulia++ 15:39:48 yup 15:39:54 Awesome then 15:40:09 Anyone have any other RFE's while we're at it? 15:40:10 I will try to show up on next week (monday is holiday in CZ) 15:40:18 iurygregory: ack 15:40:30 and I'm moving things during the weekend... 15:40:47 iurygregory: ugh, well if your not around we can always hold to the following week 15:40:48 I think you'll deserve some proper rest afterwards 15:40:54 Anyway! 15:40:58 #topic Open Discussion 15:40:59 iurygregory: just update it, and we can discuss without you 15:41:06 ack 15:41:07 worst case, we delay one more week 15:41:08 Now, we can plot to take over the world! 15:41:13 \o/ 15:41:19 * TheJulia wonders where she left the coffee at 15:42:11 so regarding CI 15:42:29 \o/ 15:43:26 hey folks, I ran across an issue when testing firmware update. After some investigation, I believe this issue exists in all cleaning steps that call task.process_event('fail'), which i modeled firmware update after. i fixed the issue in firmware update, but i believe it is present in all the other cleaning steps. wondering how this should be handled. should we discuss now, or outside the meeting? 15:43:44 it seems like both VMs are running and we're simply running the machines out of ram. 1GB of swap, 8 GB of ram... on RAX :( 15:44:02 cdearborn: I think we should get rid of any calls to task.process_event in drivers 15:44:09 I guess we need to force those jobs to use tinyipa and reduce memory count in rax as well? 15:44:17 it may be some functionality gap that we need to cover 15:44:19 it may be this stuff just works in other clouds 15:44:26 TheJulia, it's happening only in rax? 15:44:28 because of more swap on the instances 15:44:40 TheJulia: I've seen very different RAM consumption between different jobs - see the whiteboard 15:44:43 iurygregory: I'm not sure, but we're also trying centos builds on rax right now 15:44:52 gotcha 15:44:58 another option is to use concurrency==1 and 1 VM 15:45:12 will make the jobs take longer, of course 15:45:48 we are default to 2 VM's on most ironic jobs (netboot/local) for tempest testing... 15:46:19 since we need to set capabilities before running tempest or nova will *BOOM* 15:46:33 ah, I remember now. we need two VMs because of cleaning.. 15:47:25 and I think non-uefi jobs requires 3GB and uefi 4GB that can cause more swap etc since we have bigger instances.. 15:47:52 the issue is that when a cleaning step detects failure and calls task.process_event('fail'), the node moves into the clean failed state, but does not go into maintenance mode and the next cleaning step that is run against the node after is never actually kicked off. The node goes into clean wait and stays there forever. 15:48:12 I fixed the issue in firmware update by calling conductor.utils.cleaning_error_handler() instead. 15:48:33 I believe this has something to do with the state that is left in driver_internal_info. If the cleaning steps just calls into task.process_event('fail'), then that state is never cleaned up. 15:49:07 I think clean steps should only raise exceptions, not mess with states 15:49:26 but yes, you're right, just calling process_event is wrong and will leave the node in an unclear state 15:49:51 So this is an awful idea, what if we created more swap? 15:49:55 dtantsur, a raised exception is handled correctly, as is a timeout 15:50:20 TheJulia, if the awful idea helps I'm ok with it.. 15:50:42 using concurrency 1 would be worth (if we are not doing also) just to see how it goes 15:50:55 I think we can likely tune down the memory footprint we enable the centos job to have since we did some cleaning in the image 15:51:06 We were up to 500 megs at one point and now we're down to like 360 15:51:18 sounds like a plan 15:51:34 dtantsur, the issue is mainly with async cleaning steps where throwing an exception is not an option 15:51:42 that puts the footprint at worst aroudn 2.25 GB if my back of the napkin math is correct 15:52:15 I'll try tuning down the ipa jobs first with an override and we can see how that goes 15:52:36 statistically if it passes and survives a recheck, we're likely good to reduce the overall memory consumption size across the board 15:52:51 ack 15:53:08 So anything else for us to randomly discuss this morning? 15:53:15 Hi I'm an engineer at red hat research and I just wanted to throw out there that we have begun working on a way to integrate a Keylime into Ironic to provide a means of node attestation under different states. To begin I'll be reviving a patch for generic a security interface on nodes: https://review.opendev.org/#/c/576718/3/specs/approved/security-interface.rst 15:53:15 patch 576718 - ironic-specs - Add security interface spec - 3 patch sets 15:53:37 cdearborn: then probably cleaning_error_handling is the right tool to use 15:53:50 hi and welcome lmcgann_ 15:53:59 welcome lmcgann_ 15:54:03 great news! 15:54:08 ++ 15:54:09 lmcgann_: We _may_ need to split it into two specs mechanics wise, one on the interface and one on just keylime, but maybe wrapped together could work :) 15:54:10 welcome lmcgann_ 15:54:16 o/ lmcgann_ 15:54:28 Hi everybody :) 15:54:45 lmcgann_: and yes, feel free to revise that change set however you feel is appropriate 15:54:57 I would also like to shamelessly promote our DevConf talk this Thursday wherein we will demonstrate our work on Ironic multitenancy and other contributions as a means of sharing hardware in the Mass Open Cloud 15:54:57 \0 lmcgann_ 15:55:10 https://devconfus2020.sched.com/event/b2738f74c3a7e3021ba9fd53a035e5ed 15:55:20 lmcgann_: Awesome, good luck! 15:55:43 Well everyone, thank you and have a wonderful week! 15:55:53 Dmitry Tantsur proposed openstack/ironic-inspector master: Limit inspector jobs to 1 testing VM https://review.opendev.org/753051 15:55:54 TheJulia: ^^^ 15:56:24 Yeah, that should work just fine for inspector jobs 15:56:39 IPA otoh :( 15:56:51 I'll put up a patch in a few minutes 15:56:55 Anyway, have a wonderful week! 15:56:57 Thanks! 15:57:12 #endmeeting