15:00:10 <TheJulia> #startmeeting ironic
15:00:10 <openstack> Meeting started Mon Sep 21 15:00:10 2020 UTC and is due to finish in 60 minutes.  The chair is TheJulia. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:11 <TheJulia> o/
15:00:12 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:14 <openstack> The meeting name has been set to 'ironic'
15:00:14 <iurygregory> o/
15:00:18 <martalais> o/
15:00:22 <ajya> o/
15:00:25 <cdearborn> o/
15:00:30 <bdodd> o/
15:00:36 <erbarr> o/
15:00:49 <TheJulia> Our agenda this week can be found on the wiki.
15:00:50 <rloo> o/
15:00:51 <TheJulia> #link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting
15:00:57 <arne_wiebalck> o/
15:01:00 <rpioso> \o
15:01:09 * iurygregory forgot to add 2 rfe's for discussion...
15:01:14 <TheJulia> #topic Announcements / Reminders
15:01:19 <TheJulia> iurygregory: quick! add them :)
15:01:27 <kaifeng> o/
15:01:29 <TheJulia> First off!
15:01:39 <TheJulia> #info CI is very unhappy - Details are on the whiteboard.
15:01:42 <iurygregory> CI yay...
15:01:48 <TheJulia> This appears to be memory related :\
15:02:13 <TheJulia> #info We're also in the home stretch for victoria. This week is R-3 for OpenStack.
15:02:23 <TheJulia> #info Priority obviously is CI and reviews this week.
15:02:41 <TheJulia> #info TC/PTL nominations are this week, if your interested message TheJulia
15:02:51 <TheJulia> I guess I'll run again if you fold want me to.
15:02:55 <stendulker> o/
15:03:08 <TheJulia> #info Redfish interop status meeting has been scheduled
15:03:26 <TheJulia> It will be on Thursday, September 23rd at 12 PM UTC.
15:03:33 <TheJulia> #link https://cern.zoom.us/j/94808950339
15:03:34 <arne_wiebalck> Everyone is welcome of course.
15:03:37 <iurygregory> we will give you cookies if you run again TheJulia
15:03:47 <arne_wiebalck> iurygregory: ++
15:03:48 <rajinir> o/
15:04:04 <TheJulia> iurygregory: cranberry oatmeal and you'll have me sold.
15:04:18 <TheJulia> One final item in my semi-out of order list of announcements/reminders
15:04:26 <rpioso> mraineri from Redfish Forum will attend the first half.
15:04:44 * iurygregory would ship food from Annapurna to TheJulia
15:05:07 <TheJulia> It looks like the kexec effort should end up with some devoted PTG time to discuss and determine the next path. I got an email from Boston University and the group of students did not choose ironic :(
15:05:20 <iurygregory> #sad
15:05:24 <TheJulia> c'est la vie
15:05:41 <TheJulia> Does anyone have anything to announce or remind us of?
15:06:39 <TheJulia> No action items so we can proceed to subteam statuses
15:06:56 <openstackgerrit> Merged openstack/ironic-prometheus-exporter master: Fallback to `node_uuid` if`node_name` is not present  https://review.opendev.org/723176
15:07:08 <TheJulia> iurygregory: I guess you can release IPE :)
15:07:21 <iurygregory> I will =)
15:07:23 * TheJulia guesses there are no other announcements and reminders
15:07:28 <TheJulia> onward?
15:07:34 <iurygregory> ++
15:07:54 <TheJulia> #topic Review subteam status reports
15:07:57 <TheJulia> #link https://etherpad.openstack.org/p/IronicWhiteBoard
15:08:34 <TheJulia> Starting at line 279
15:08:57 <iurygregory> I think we can remove the Zuulv3 migration
15:09:22 <iurygregory> and have a topic for grenade efforts in the future
15:10:44 <TheJulia> arne_wiebalck: w/r/t the scale issues item you noted, I've got a patch up to preserve the efi boot artifacts, we should likely make sure we don't collide in our efforts
15:11:41 <arne_wiebalck> TheJulia: ok
15:12:10 <arne_wiebalck> TheJulia: you have a link?
15:13:06 <TheJulia> arne_wiebalck: it is on ipa, I don't at the moment but I'll get it to you
15:13:20 <arne_wiebalck> TheJulia: I should be able to find it ...
15:13:28 <TheJulia> Otherwise I think most things look okay and in a good state. I realize we're also basically blocked on ipa at the moment due to CI
15:13:36 <TheJulia> Anyhow, are we good to proceed?
15:14:02 <dtantsur> yep
15:14:19 <TheJulia> one moment, having to relaunch windows, my browser crashed
15:14:40 <TheJulia> #topic Deciding on priorities for the coming week
15:14:53 <TheJulia> Is there anything we need to add to the list of the priorites
15:15:05 <TheJulia> #link https://etherpad.openstack.org/p/IronicWhiteBoard
15:15:19 <TheJulia> Starting at line 164
15:16:25 <dtantsur> iscsi deprecation? https://review.opendev.org/750204
15:16:26 <patchbot> patch 750204 - ironic - Deprecate the iscsi deploy interface - 8 patch sets
15:16:33 <TheJulia> I think it is already on the list
15:16:47 <stendulker> This can be added for vendor priority (iLO) https://review.opendev.org/#/c/752001/
15:16:48 <patchbot> patch 752001 - ironic - Adding changes for iso less vmedia support - 8 patch sets
15:16:49 <TheJulia> it is, just marked as a wip
15:17:02 <dtantsur> ah, right, removed WIP
15:17:04 <TheJulia> stendulker: sure, if you could make that update on the etherpad that would be much appreciated
15:17:17 <stendulker> updated
15:17:22 <stendulker> thanks
15:17:43 <TheJulia> Any objection if I remove the networkin-generic-switch item?
15:18:10 <TheJulia> last updated September 9th
15:18:32 <arne_wiebalck> https://review.opendev.org/#/c/748049 is the one you were referring to earlier?
15:18:33 <patchbot> patch 748049 - ironic-python-agent - Support partition image efi contents - 4 patch sets
15:18:45 <TheJulia> arne_wiebalck: yes
15:19:03 <iurygregory> I will add latter some backports of the IPE (after I push them)
15:19:25 <TheJulia> iurygregory: sounds good
15:19:27 <Qianbiao> I add a line under IPA segment for https://review.opendev.org/#/c/752024/
15:19:27 <patchbot> patch 752024 - ironic-python-agent - Fix: make Intel CNA hardware manager none generic - 5 patch sets
15:19:36 <TheJulia> I see a couple people have updated a few different areas
15:19:43 <TheJulia> Any objections to what is present at this time?
15:20:50 <iurygregory> none from me
15:21:23 <TheJulia> okay, seems like we can proceed then!
15:21:42 <TheJulia> So we have nothing listed for discussion today
15:21:50 <TheJulia> So we can proceed to the Baremetal SIG!
15:21:55 <TheJulia> #topic Baremetal SIG
15:22:07 <arne_wiebalck> No more input on the doodle, so I guess we just schedule a first meeting and see how that goes.
15:22:15 <TheJulia> arne_wiebalck: that seems reasonable
15:22:19 <arne_wiebalck> That's it :)
15:23:19 <TheJulia> Okay then, RFE Review it is then
15:23:23 <TheJulia> #topic RFE Review
15:23:29 <TheJulia> iurygregory: I believe these are yours?
15:23:36 <iurygregory> yup
15:23:39 <iurygregory> \o/
15:24:10 <TheJulia> iurygregory: would you like to talk through them?
15:24:12 <iurygregory> So, 1st RFE is https://storyboard.openstack.org/#!/story/2008171
15:24:32 <iurygregory> to add some support for IPE for introspection data
15:25:08 <TheJulia> so what problem do you see this solving?
15:25:09 <iurygregory> this would probably be something interesting when we have active node introspection
15:25:25 <TheJulia> hardware discrepancy detection?
15:25:54 <iurygregory> for example the operator wants the firmware versions for the X vendor machines the same version...
15:26:46 <iurygregory> so he can setup an alarm based on the "metric" for firmware version to get notification if something is different for the machines
15:26:49 <dtantsur> assuming extra-hardware present? I don't think we collect firmware versions by default.
15:27:20 <iurygregory> ofc it would depend on what the introspection data will have =)
15:27:21 <JayF> What would be a downside for optionally supporting putting node inspection data in prometheus? As long as it's not enabled by default, it seems like a potentially good thing.
15:27:40 <arne_wiebalck> can introspection rules be used during active introspection?
15:27:40 <TheJulia> iurygregory: would this only apply to the inspection data for the nodes that the IPE is responsible for based being paired with the conductor and the data supplied from the sensor data collection?
15:27:48 <dtantsur> arne_wiebalck: yes, I think
15:28:01 <JayF> Especially with some of the data you could "inspect" about node lifetime from utilizing plugins for more data, e.g. SMART cycles, firmware versions (as mentioned), etc
15:28:08 <arne_wiebalck> dtantsur: what would happen if they detect an issue?
15:28:14 <iurygregory> TheJulia, only for the ones that conductor can report I would say
15:28:28 <arne_wiebalck> dtantsur: for normal inspection, the node would fail inspection
15:28:32 <dtantsur> arne_wiebalck: probably nothing particular.. I cannot say for sure.
15:28:48 <arne_wiebalck> dtantsur: since that would be a similar functionality
15:29:00 <arne_wiebalck> dtantsur: for the alarming part at least
15:29:03 <dtantsur> overall, I'm with JayF on the "why not" bit. but I'd rather see more technical details in the RFE before committing to it.
15:29:20 <TheJulia> iurygregory: I suspect your going to need to write something a bit more verbose along the lines of a spec. I like the idea, I'm only worried about size/scale/scope issues and mechanics.
15:29:21 <JayF> ++ that is a pretty anemic
15:29:27 <JayF> RFE **
15:29:37 <iurygregory> yeah sorry for that =)
15:29:46 <dtantsur> I personally don't insist on a spec, I'd just read more text on the story
15:29:46 <TheJulia> That being said, I suspect many of us agree it would be a good thing
15:29:49 <iurygregory> I will try gather more details =)
15:30:05 <TheJulia> same really, just more details would be excellent
15:30:20 <iurygregory> ack
15:30:32 <iurygregory> so moving to the second RFE
15:30:36 <TheJulia> and maybe think through the questions posed in a spec while your adding detail
15:30:47 <iurygregory> sure =)
15:30:52 <TheJulia> Because those questions are asked to provoke thought in many cases :)
15:30:56 <TheJulia> "what if?"
15:31:01 <rloo> well... ideally, it'd be some sort of 'plugin'? or class, so that other non-prometheus systems could also get that introspection data in the future?
15:31:15 <dtantsur> we have plugins for ironic-inspector
15:31:26 <TheJulia> Awesome
15:31:29 <dtantsur> the problem is, prometheus only supports "pull" model
15:31:46 <dtantsur> (there is something for the "push" model, but it's not recommended)
15:31:46 <TheJulia> iurygregory: so your second RFE?
15:31:53 <rloo> ^^ which means the general idea seems ok, but i personally would like a bit more details. if not a spec, please put in the story.
15:32:04 <iurygregory> Templates for alarms  https://storyboard.openstack.org/#!/story/2008176
15:32:33 <iurygregory> when using prometheus normally you will have some set of alarm rules you will use that will trigger notifications
15:33:19 <TheJulia> iurygregory: I think that makes a lot of sense, just maybe a little more detail on how we're going to make it easy
15:33:19 <dtantsur> I guess I have the same concern with this RFE: it's very short
15:33:22 <rloo> iurygregory: sorry, for the first rfe, would you mind updating the title or whatever to something more specific, eg 'push introspection data to prometheus' ?
15:33:33 <iurygregory> for example you want to get a notification if the temparature of the nodes is higher than a threshold ...
15:33:43 <iurygregory> rloo, sure I will do
15:33:46 <dtantsur> I'd like to see two distinct parts: user story ("As an operator I want to") and solution ("we will change that, add this")
15:34:08 <dtantsur> right now I don't quite understand why operators cannot configure it.. the way they usually configure it
15:34:17 <iurygregory> well they can configure
15:34:17 <TheJulia> ++
15:34:46 <TheJulia> I guess I'm also missing a hint at the solution to the problem of it is hard
15:34:59 <iurygregory> in my mind would be something like
15:35:08 <iurygregory> you want an alarm for higher temperature
15:35:19 <TheJulia> Generally no objection to the rfe otherwise, just need some more detail :)
15:35:25 <iurygregory> so you can say the metric name and the experion it should use
15:35:42 <iurygregory> and we would output the yml format you need to update in the configuration of prometheus...
15:36:06 <iurygregory> instead of going and writing all rules you want etc..
15:36:26 <TheJulia> I guess the conundrum in a way is they don't really know what to populate until after the fact
15:36:30 <TheJulia> so they have no examples
15:36:30 <dtantsur> I have no idea about prometheus, so please pardon my question if it's silly: is it really easier?
15:37:00 <dtantsur> ah, hmm. do we have all the data we need? I thought it was also driver and hardware specific?
15:37:22 <iurygregory> well I would prefer to get a file that I just need to add to the config instead of writing everything..
15:37:24 <TheJulia> Yeah, that is a conundrum since there are some data transformations based on names if memory serves
15:37:49 * TheJulia wonders if this could almost just be sample alarms documentation
15:39:26 <dtantsur> again, no hard feelings against, but I'd like to understand more before ack'ing
15:39:34 <dtantsur> and ideally have it written :)
15:39:40 <TheJulia> iurygregory: so your going to make things more verbose on both and I guess we can revisit again next week?
15:39:41 <dtantsur> it = details in the RFE
15:39:46 <dtantsur> TheJulia++
15:39:48 <iurygregory> yup
15:39:54 <TheJulia> Awesome then
15:40:09 <TheJulia> Anyone have any other RFE's while we're at it?
15:40:10 <iurygregory> I will try to show up on next week (monday is holiday in CZ)
15:40:18 <TheJulia> iurygregory: ack
15:40:30 <iurygregory> and I'm moving things during the weekend...
15:40:47 <TheJulia> iurygregory: ugh, well if your not around we can always hold to the following week
15:40:48 <dtantsur> I think you'll deserve some proper rest afterwards
15:40:54 <TheJulia> Anyway!
15:40:58 <TheJulia> #topic Open Discussion
15:40:59 <dtantsur> iurygregory: just update it, and we can discuss without you
15:41:06 <iurygregory> ack
15:41:07 <dtantsur> worst case, we delay one more week
15:41:08 <TheJulia> Now, we can plot to take over the world!
15:41:13 <dtantsur> \o/
15:41:19 * TheJulia wonders where she left the coffee at
15:42:11 <TheJulia> so regarding CI
15:42:29 <iurygregory> \o/
15:43:26 <cdearborn> hey folks, I ran across an issue when testing firmware update. After some investigation, I believe this issue exists in all cleaning steps that call task.process_event('fail'), which i modeled firmware update after. i fixed the issue in firmware update, but i believe it is present in all the other cleaning steps. wondering how this should be handled. should we discuss now, or outside the meeting?
15:43:44 <TheJulia> it seems like both VMs are running and we're simply running the machines out of ram. 1GB of swap, 8 GB of ram... on RAX :(
15:44:02 <dtantsur> cdearborn: I think we should get rid of any calls to task.process_event in drivers
15:44:09 <TheJulia> I guess we need to force those jobs to use tinyipa and reduce memory count in rax as well?
15:44:17 <dtantsur> it may be some functionality gap that we need to cover
15:44:19 <TheJulia> it may be this stuff just works in other clouds
15:44:26 <iurygregory> TheJulia, it's happening only in rax?
15:44:28 <TheJulia> because of more swap on the instances
15:44:40 <dtantsur> TheJulia: I've seen very different RAM consumption between different jobs - see the whiteboard
15:44:43 <TheJulia> iurygregory: I'm not sure, but we're also trying centos builds on rax right now
15:44:52 <iurygregory> gotcha
15:44:58 <dtantsur> another option is to use concurrency==1 and 1 VM
15:45:12 <dtantsur> will make the jobs take longer, of course
15:45:48 <iurygregory> we are default to 2 VM's on most ironic jobs (netboot/local) for tempest testing...
15:46:19 <iurygregory> since we need to set capabilities before running tempest or nova will *BOOM*
15:46:33 <dtantsur> ah, I remember now. we need two VMs because of cleaning..
15:47:25 <iurygregory> and I think non-uefi jobs requires 3GB and uefi 4GB that can cause more swap etc since we have bigger instances..
15:47:52 <cdearborn> the issue is that when a cleaning step detects failure and calls task.process_event('fail'), the node moves into the clean failed state, but does not go into maintenance mode and the next cleaning step that is run against the node after is never actually kicked off. The node goes into clean wait and stays there forever.
15:48:12 <cdearborn> I fixed the issue in firmware update by calling conductor.utils.cleaning_error_handler() instead.
15:48:33 <cdearborn> I believe this has something to do with the state that is left in driver_internal_info. If the cleaning steps just calls into task.process_event('fail'), then that state is never cleaned up.
15:49:07 <dtantsur> I think clean steps should only raise exceptions, not mess with states
15:49:26 <dtantsur> but yes, you're right, just calling process_event is wrong and will leave the node in an unclear state
15:49:51 <TheJulia> So this is an awful idea, what if we created more swap?
15:49:55 <cdearborn> dtantsur, a raised exception is handled correctly, as is a timeout
15:50:20 <iurygregory> TheJulia, if the awful idea helps I'm ok with it..
15:50:42 <iurygregory> using concurrency 1 would be worth (if we are not doing also) just to see how it goes
15:50:55 <TheJulia> I think we can likely tune down the memory footprint we enable the centos job to have since we did some cleaning in the image
15:51:06 <TheJulia> We were up to 500 megs at one point and now we're down to like 360
15:51:18 <iurygregory> sounds like a plan
15:51:34 <cdearborn> dtantsur, the issue is mainly with async cleaning steps where throwing an exception is not an option
15:51:42 <TheJulia> that puts the footprint at worst aroudn 2.25 GB if my back of the napkin math is correct
15:52:15 <TheJulia> I'll try tuning down the ipa jobs first with an override and we can see how that goes
15:52:36 <TheJulia> statistically if it passes and survives a recheck, we're likely good to reduce the overall memory consumption size across the board
15:52:51 <iurygregory> ack
15:53:08 <TheJulia> So anything else for us to randomly discuss this morning?
15:53:15 <lmcgann_> Hi I'm an engineer at red hat research and I just wanted to throw out there that we have begun working on a way to integrate a Keylime into Ironic to provide a means of node attestation under different states. To begin I'll be reviving a patch for generic a security interface on nodes: https://review.opendev.org/#/c/576718/3/specs/approved/security-interface.rst
15:53:15 <patchbot> patch 576718 - ironic-specs - Add security interface spec - 3 patch sets
15:53:37 <dtantsur> cdearborn: then probably cleaning_error_handling is the right tool to use
15:53:50 <dtantsur> hi and welcome lmcgann_
15:53:59 <iurygregory> welcome lmcgann_
15:54:03 <dtantsur> great news!
15:54:08 <iurygregory> ++
15:54:09 <TheJulia> lmcgann_: We _may_ need to split it into two specs mechanics wise, one on the interface and one on just keylime, but maybe wrapped together could work :)
15:54:10 <kaifeng> welcome lmcgann_
15:54:16 <rpioso> o/ lmcgann_
15:54:28 <lmcgann_> Hi everybody :)
15:54:45 <TheJulia> lmcgann_: and yes, feel free to revise that change set however you feel is appropriate
15:54:57 <lmcgann_> I would also like to shamelessly promote our DevConf talk this Thursday wherein we will demonstrate our work on Ironic multitenancy and other contributions as a means of sharing hardware in the Mass Open Cloud
15:54:57 <cdearborn> \0 lmcgann_
15:55:10 <lmcgann_> https://devconfus2020.sched.com/event/b2738f74c3a7e3021ba9fd53a035e5ed
15:55:20 <TheJulia> lmcgann_: Awesome, good luck!
15:55:43 <TheJulia> Well everyone, thank you and have a wonderful week!
15:55:53 <openstackgerrit> Dmitry Tantsur proposed openstack/ironic-inspector master: Limit inspector jobs to 1 testing VM  https://review.opendev.org/753051
15:55:54 <dtantsur> TheJulia: ^^^
15:56:24 <TheJulia> Yeah, that should work just fine for inspector jobs
15:56:39 <TheJulia> IPA otoh :(
15:56:51 <TheJulia> I'll put up a patch in a few minutes
15:56:55 <TheJulia> Anyway, have a wonderful week!
15:56:57 <TheJulia> Thanks!
15:57:12 <TheJulia> #endmeeting