Thursday, 2023-07-13

opendevreviewJacob Anders proposed openstack/ironic master: [WIP] Retry connecting vmedia through a DVD device if available.  https://review.opendev.org/c/openstack/ironic/+/88766507:27
opendevreviewDmitry Tantsur proposed openstack/ironic master: Very basic in-band inspection with the "agent" interface  https://review.opendev.org/c/openstack/ironic/+/88545007:49
opendevreviewDmitry Tantsur proposed openstack/ironic master: Add the initial skeleton of the agent inspect interface  https://review.opendev.org/c/openstack/ironic/+/87781407:50
opendevreviewDmitry Tantsur proposed openstack/ironic master: Very basic in-band inspection with the "agent" interface  https://review.opendev.org/c/openstack/ironic/+/88545007:50
dtantsurmasghar: recovered ^^^07:50
opendevreviewMahnoor Asghar proposed openstack/ironic master: WIP: Add inspection (processing) hooks  https://review.opendev.org/c/openstack/ironic/+/88755408:42
opendevreviewEbbex proposed openstack/bifrost master: Consolidate ubuntu/debian required_defaults  https://review.opendev.org/c/openstack/bifrost/+/88844409:09
opendevreviewEbbex proposed openstack/bifrost master: sgabios-bin is a subpackage of sgabios  https://review.opendev.org/c/openstack/bifrost/+/88844509:09
opendevreviewEbbex proposed openstack/bifrost master: Reduce the libvirt/qemu packages list  https://review.opendev.org/c/openstack/bifrost/+/88844609:09
opendevreviewEbbex proposed openstack/bifrost master: Consolidate centos/fedora/redhat required_defaults  https://review.opendev.org/c/openstack/bifrost/+/88844709:09
opendevreviewEbbex proposed openstack/bifrost master: Refactor the use of include_vars  https://review.opendev.org/c/openstack/bifrost/+/88844809:09
JayFebbex lots of contributions! Thank you! If you're not getting timely review on anything lmk09:19
JayFSo going to pitch an idea here; will probably rfe it... an optional-to-enable deploy step that'd apply in all cases that does some kind of pre-check to ensure the machine will be deployable09:37
JayFlike checking BMC health, 'hello world' connectivity check, etc09:37
JayFI wonder if the existing validate() in drivers would work or if we'd  need something new09:37
dtantsurJayF: validate() was designed to be quick since it's used in a blocking fashion10:23
JayFYeah, that's what I figured. We'd need to add a method to validate live, and add stuff to all the driver interfaces for that :(10:24
dtantsurchecking BMC health sounds like a core feature, not an opt-in thing10:24
JayFopt-in would be my suggestion so we can be through10:24
JayFyou're basically talking about performing some of the core ops in a deployment an extra time10:24
JayFso I'm not sure it's a slam-dunk to make the correctness-for-speed tradeoff for everyone10:25
JayFwhich is why I default to status quo and opt-in10:25
dtantsurI'm always worried how many opt-in things we (and openstack in general) have10:25
dtantsur"make sure the BMC works" does not sound like a thing many operators would opt out of10:26
JayFI'm thinking of it'd be giving almost every driver interface a chance to do a thing10:26
JayFfor networking that might be pretty sizable10:27
dtantsurdo you have a specific example then?10:27
JayFfor downstream conversations, we were thinking maybe 1) bmc health check 2) ensure access to switches if !flat/noop 3) other configurable items that users might need downstream (such as external hardware/storage devices that might be assisting in deploy)10:29
JayFMVP of a BMC health check wouldn't be a bad idea, but in practice I see as many failures from network automation10:30
JayFso I'd wanna cover that too10:30
opendevreviewVerification of a change to openstack/bifrost master failed: remove pymysql system packages requirement  https://review.opendev.org/c/openstack/bifrost/+/87451910:30
dtantsurJayF: so, what's the case of NOT wanting it before deployment?10:32
JayFYou're talking about adding maybe a minute, more in broken cases, to a deployment process10:32
JayFwhen some of the most clear feedback I hear from operators I interact with IRL is that things often take too long10:33
JayFwhich is why I hesitate to default that time trade-off10:33
JayFbut I don't feel super strongly about that but10:33
JayF*bit10:33
dtantsurDoes "ensure access to switches" take that long?10:33
dtantsuror more validations?10:33
JayFI know that I've seen environments where a round trip to the switch can take 30 seconds+10:34
JayFthe flip side is; on the failure case, especially integrated w/nova+retries, is a lot better10:34
JayFyou'd get rescheduled a lot faster, especially when thinking about a fast-track use case where Ironic might image the machine before realizing BMC/switch is gone10:35
JayFlike I said, could go either way10:35
dtantsurhmmm10:35
dtantsuryeah, let's open this discussion for sure10:35
JayFbut the fact we're arguing about opt-in or opt-out means you agree on the piece that matters10:35
JayFdtantsur: my assumption: this is a spec and adding a method to ~all hardware interfaces, yeah?10:35
dtantsurWell, I'd *love* to see more opt-in and opt-out deploy steps in practice10:35
JayFdtantsur: okay, idea #2 that came up downstream this week10:35
dtantsurthis may mean that we need to open steps to more interfaces10:35
JayFthis one I love a lot more even, and fits with your topic now10:36
JayFstep templates10:36
JayFlike we have deploy templates, we should have cleaning/service step templates10:36
JayFand have it setup so we could potentially wire into RBAC where, e.g. a lessee could run service_step_template "update my firmware" on their own with locked in arguments10:36
JayFto allow some self-serve maintenance with incredibly strong guardrails10:36
JayFthis would fit *super well* with more steps, in general, in our library10:36
dtantsurFunny that you mention it. I've been thinking: if I were to write Ironic API v2, it would be organized around workflows.10:38
dtantsurIt's kinda-sorta same idea, but on a much more ambitious level10:39
JayFI am less ambitious and more get-shit-done by default10:39
JayFthat doesn't mean you are wrong that it shouldn't be part of a bigger v210:39
dtantsurWhich is to say, I like the idea of templates JayF. The API would need careful consideration.10:39
JayFjust that is not something I've pondered on10:39
JayFdtantsur: I see a nice symmetry in this, right? Steps grew up in cleaning, and we sorta have ^c^v that concept everywhere10:40
dtantsurYeah, I don't think we'll see Ironic v2.10:40
JayFdtantsur: now I wanna do the same with Deploy Templates, at least in concept10:40
JayFIDK how portable that specific code is10:40
dtantsurSomewhat?10:40
dtantsurMy biggest worry is that we'll have a vastly asymmetric API here10:40
dtantsurdeploy templates are invoked via traits, which is a fully Nova-centric approach10:41
JayFso perhaps we keep that as a facade for nova10:41
JayFand do "step templates" more generically for ironic-facing workloads10:41
JayF**ironic-user-facing workloads10:41
dtantsurI'm also curious what TheJulia says if we try to create dynamic RBAC rules, i.e. an RBAC rule per template :D10:41
JayFyeah that's really the killer piece10:42
JayFif I can't give $end_user_operator access to do a limited set of service steps, it's not useful10:42
JayFbecause securtiy would require me to limit the things that can be done, and we have steps generally which might be considred harmful10:42
JayFe.g. firmware downgrades10:42
dtantsurWell.. even if we grant access to all templates, it's still better than now10:43
JayFIf the templates were tenant-aware; yes10:43
JayFif not, it wouldn't help my use case10:43
JayFs/tenant/project/10:43
JayFlast decade is leaking lol10:43
dtantsur:D10:43
dtantsurI still like tenant more..10:43
JayFI like when magic words don't chage10:43
JayFlol10:44
JayF*change10:44
dtantsurWill it help if we add a project field to templates? (ignoring the issue with naming conflicts, which we already have with nodes)10:44
JayFyes; but my use cases are mostly on manual clean/service steps/maybe even rescue one day10:45
JayF(rescue needs to be reimplemented as a service step when we get there anyway, so that'd collapse down)10:45
opendevreviewMerged openstack/ironic master: DB: Fix result set locking with periodics  https://review.opendev.org/c/openstack/ironic/+/88818812:21
TheJuliarules per template?!?!?!?!?!13:05
TheJuliawhaaaat!?13:05
TheJuliasomeone, where is the beginning of the idea?!13:06
TheJuliaor how strong should my coffee be? 13:06
JayFTheJulia: so two ideas basically; semi-related:13:07
dtantsur:D13:07
JayF1) Have the ability to perform a precheck before doing node operations, to non-destructively fail (e.g. so a user running nova rebuild would still have their workload running even if the BMC was dead even if the instance goes to err)13:07
JayF2) The ability to template out anything that uses "steps" to operate, in such a way that it can be RBAC'd (maybe make templates tenant-aware? or some concept of "y template can be run by owner/lessee of a node"13:08
JayFThe use case for 1 is obvious13:08
JayFuse case for 2 would be self-service user firmware upgrades, done in a secure enough way to prevent the end-user from performing arbitrary steps13:09
JayFIt's easy to imagine a world where we would want Ironic to be able to say "you can run this firmware upgrade reciept; but you can't run firmware_upgrade step with arbitrary values to downgrade to a vulnerable firmware" or similar13:10
JayFs/reciept/recipe/13:10
TheJuliaI avoided pouring a ton of rbac on to templates because of the unknowns and lack of asks. We would need to add at least two fields to that. Not impossible on the template side13:12
JayFThe thing is, I don't want "deploy templates"13:12
JayFI want arbitrary step templates13:12
JayFall my use cases are around manual cleaning and service steps13:13
TheJulialike public=True/False or and owner. Actual policy with step execution would mean something. I need coffee to think through it13:13
TheJuliaJayF: technically, already exists, just sufficiantly high level access user must create13:13
TheJuliawait13:14
JayF*blink*13:14
TheJuliayou want to be able to invoke a step and pickup a specific step by name without providing other args?13:14
TheJuliaso user says "do x, y, templatez"13:14
JayFI want to be able to say `openstack baremetal clean --template=firmware-upgrade-2.3.4`13:14
JayFand for a larger admin to be able to define firmware-upgrade-2.3.413:14
JayFand for that user with access to run the template to not be able to do anything but curated sets of steps they have access to13:15
JayFagain, think self-service node maintenance, where you can hand the API off to the project team, and they can coordinate their own service13:15
JayF(really service steps is the bigger use case for this; but it seems weird to have service templates and deploy templates w/o manual clean templates)13:15
TheJuliaso the lacking part is a user, on demand, being able to say "go execute this template13:16
JayFSo this is just RBAC for templates then13:16
JayFmore or less13:16
JayFwith maybe a bit of api on top13:17
TheJuliawell13:17
TheJuliaactually, two fold13:17
TheJuliawe have deploy templates, not cleaning templates13:17
JayFyeah, that's what I thought, we have to generic-ify templates to cleaning and service13:17
JayFthen RBAC on top of it13:17
TheJuliaalthough on some level, we could just say "yes, you may cross-reference them13:17
JayFyeah, I consider that implementation detail13:17
JayFI don't know the code well enough to know how I'd write it13:17
TheJuliawe have enough things that are dual purpose that maybe it just stays slightly confusing name13:17
JayFI'm just teasing out enough of the interface so I can RFE it13:18
JayFweak -1 to that naming idea, but also meh13:18
TheJuliawe'd need a knob to deny API submitted steps13:18
JayFI consider that as part of "RBAC'ing templates"13:19
TheJuliamentally, this is sort of similar to dpus, where one step along the way is that, sort of like allowing a vendor interface to be called as a step13:19
TheJulia(... because ipmitool send_raw is needed to unblock Bluefield card's for inband firmware upgrades....)13:20
TheJuliaso a pile of semi-related things13:20
JayFhttps://bugs.launchpad.net/ironic/+bug/2027688 (this is for the precheck idea)13:20
JayFTheJulia: when things like that align, I take it as a sign that we're going down the right path13:21
* TheJulia finally returns from the coffee maker13:22
JayFTheJulia, others: I think I'm just going to call RBAC'ing a thing in Ironic "exposing it to the project API" or similar13:24
JayFit's happier than using rbac as a veb13:24
JayF*verb13:24
TheJuliaJayF: put a note at the top, the request context won't be in the task past the initial request, so we'll need to save it as a dict13:26
JayFack it's in there13:27
TheJuliaall *later* contexts will be admin aligned with ironic's configured access rights13:27
TheJuliaat least those attached to tasks13:27
JayFI'm writing up the RFE, I assume both of these will likely need a spec13:27
JayFeven if just to help me clarify my thinking13:27
TheJulia++13:27
* TheJulia sips coffee13:27
JayFbut I want us to agree it's valuable and the general shape before I go too deep13:27
JayFoh heh13:28
JayFif we have generic step templates13:28
JayFwe could also then say13:28
JayF"automated cleaning for this node is template XXXX"13:28
JayF🤯13:28
TheJuliadtantsur: w/r/t the database locking stuff, I suspect that when the sqlalchemy row just gets casted as a tuple (which it *can* represent itself as, that the original hangs around until the new tuple is gone, but I'm not digging into it *that* deep at this point, which I think is why iterating down the row makes a difference.13:31
dtantsurTheJulia: that must involve quite some magic.. but magic is what SA is about, soooo13:31
TheJulialess magic, at least having looked at sqlalchemy code, I *suspect* part of it is whatever cpython does under the hood when you ask for a tuple13:33
TheJuliabut if I scratch the itch of learning that, I worry I'll suddenly want to become a cpython maintainer13:34
dtantsurnormal tuple()/list()/set() stuff is building a structure from the iterator13:34
TheJuliaand then I risk becoming a lost soul13:34
dtantsurhence my confidence that this stuff should work13:34
TheJuliayeah13:35
TheJuliaIt is what I thoguht early on as well, but didn't quite behave as I expected13:35
TheJuliawow lots of discussion this morning13:36
JayFIt helps when talkative-jay is in BST lol13:36
TheJuliaand, btw, I literally agree with both of your comments on https://review.opendev.org/c/openstack/ironic/+/888359/1/ironic/conductor/base_manager.py13:37
TheJuliayay for internal metal conflicts!13:37
TheJuliaI've seen one lock error with the new patch on a heartbeat, which I think is a super good sign, but that also has me wanting to add a retry check to node_update13:38
dtantsurWe already add retry_on_deadlock or something like that to all/most db methods13:38
TheJuliaso fun fact, oslo_db doesn't recognize our exception with the way it gets handled13:39
dtantsuryup :(13:40
dtantsurI assume it's technically not a deadlock13:40
TheJuliayeah, technically13:40
TheJulia*also* we don't *really* get a sqlalchemy exception, we get a sqlite exception13:40
dtantsurhuh, isn't it wrapped?13:40
TheJulia... I remember looking one day and going "wait, that is weird!"13:41
TheJuliano exceptions on the merge, clearly we need to recheck/merge some things!13:42
TheJuliadtantsur: where did you find the error you pasted early this morning on https://review.opendev.org/c/openstack/ironic/+/887853 ?13:43
JayFrpittau: https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting I put two RFEs on the agenda for Monday's meeting; I will likely not be here but please mention them so they can get more eyes on them. Thank you in advance!!13:44
dtantsurTheJulia: the failing job https://zuul.opendev.org/t/openstack/build/46c9a1251917452aad479a26178f45de13:44
TheJuliamuch appreciated13:45
TheJulia... interesting!13:45
* TheJulia takes a minute to make some breakfast13:46
JayFTheJulia: oh, you were asking about eventlet science, it was OK but I got scared b/c I remembered that was to work around a bug in image streaming that we don't test in gate13:49
JayFbut I guess that doesn't apply to ironic with iscsi gonezo13:49
JayFhttps://review.opendev.org/c/openstack/ironic/+/887996 everything voting w/o known issues passed13:50
opendevreviewJay Faulkner proposed openstack/ironic master: DNM: Eventlet science  https://review.opendev.org/c/openstack/ironic/+/88799613:50
JayFlets see how it does with a recheck13:51
TheJuliaths issue is basically that the test doesn't actually sent db relevent config since it overrides it elsewhere13:51
TheJuliawhich means the test fails13:51
TheJuliaactually, erorrs, doesnt fai13:51
TheJulial13:51
opendevreviewJay Faulkner proposed openstack/ironic master: Add additional logging on iLO power failure  https://review.opendev.org/c/openstack/ironic/+/88554914:02
opendevreviewJulia Kreger proposed openstack/ironic master: Fix SQLAlchemy listener for engine connection, correctly  https://review.opendev.org/c/openstack/ironic/+/88785314:19
* JayF screams at sushy unit tests14:20
TheJuliale sigh Jul 12 19:33:19.897599 np0034662248 ironic-conductor[49614]: ERROR ironic.conductor.utils [None req-f1540afd-cdbe-49f0-a5d9-f5e8865d54d9 None None] While executing step {'step': 'log_passthrough', 'priority': 1, 'abortable': False, 'argsinfo': None, 'interface': 'vendor', 'requires_ramdisk': True} on node 81688121-bae1-4683-aa31-c23c8cbaead9, step returned invalid value: True14:22
TheJuliaJayF: why screaming?14:22
TheJuliathey respond better to soothing tones14:22
JayFI think there's just nothing plumbed through far enough for the tests I need to write14:24
JayFand I just realized that14:24
JayFbasically everyone is testing if _op() is ebign called and I can't find tests that make sure _op() does the right thing14:24
JayFhahaha and I just found the actual code bug, the test isn't broken I am14:29
JayFlolsog14:29
opendevreviewJulia Kreger proposed openstack/ironic master: Enable vendor interfaces to be called as steps  https://review.opendev.org/c/openstack/ironic/+/87908914:32
* TheJulia gives JayF a cookie14:33
JayFwell I know where the code is broken14:35
JayFbut I'm perplexed and I'm 99% sure it's just jet lag14:35
JayFbecause I think I'm just forgetting how kwargs work, somehow14:35
opendevreviewJay Faulkner proposed openstack/sushy master: Requests must always have a read/connect timeout  https://review.opendev.org/c/openstack/sushy/+/88813114:39
JayFmy compat code and test for it is failing, I'm going to walk and clear my head, if someone wants to look and point out what I assume has to be an obvious error in gerrit please do so14:39
JayFotherwise hopefully fresh eyes do the trick14:39
opendevreviewJulia Kreger proposed openstack/ironic master: Enable vendor interfaces to be called as steps  https://review.opendev.org/c/openstack/ironic/+/87908914:41
TheJuliaokay, needed to clarify the docs a little bit there14:41
TheJuliaoh, I kind of see what the disconnect is14:42
TheJuliahmm14:42
TheJuliayou did the needful14:42
TheJuliawut14:42
JayFPrint-statement-debugging says that the actual code in connector is not working14:49
JayF'string' in doesn't work, does it14:49
JayFbecause it checks equality not equivalency?14:49
JayFbut I reshaped it to use .get14:50
TheJuliaoh, no14:50
TheJuliano14:50
TheJuliapulling it down14:51
JayFI think we might have too much mocked out for it to work?14:51
JayFif you're looking  that close, we can just meet on it quick?14:51
JayFi have it in ide with print debugs here14:52
TheJuliaThe heavy mocking has been problematic there14:52
JayFand I have a big conf all to myself14:52
JayFI'm really, really tempted just to punt the new test14:52
TheJuliaoh14:52
TheJuliaheh14:52
TheJuliayeah, lets talk because two different things seem to be going on14:52
JayFhttps://us06web.zoom.us/j/86315191026?pwd=ZW53U0dENGF5bUNQc0s0MHFVOElWdz0914:53
JayFI'm going to have to restart TheJulia 14:55
TheJuliak14:55
opendevreviewJay Faulkner proposed openstack/sushy master: Requests must always have a read/connect timeout  https://review.opendev.org/c/openstack/sushy/+/88813115:12
opendevreviewMahnoor Asghar proposed openstack/ironic master: WIP: Add inspection (processing) hooks  https://review.opendev.org/c/openstack/ironic/+/88755416:42
TheJuliahmmm... power_sync why you cause problems17:58
NobodyCamGood morning Ironic Folks!18:18
NobodyCamhappy almost Friday18:19
TheJuliagood morning NobodyCam 18:21
NobodyCamo/ TheJulia 18:21
NobodyCamHow's the weather on that side of the hill18:22
TheJuliaNobodyCam: not awful18:34
TheJulia90F18:34
NobodyCamhehehehe 18:34
TheJuliahow about down in the valley18:34
NobodyCam103 right now :(18:35
NobodyCamhigh of 11118:35
TheJuliaugh18:35
NobodyCamSaturday 11718:35
TheJuliaugh, and we have a wedding to go to on Saturday afternoon down there18:36
opendevreviewJulia Kreger proposed openstack/ironic master: WIP: Restrict parallel power state sync workers with sqlite  https://review.opendev.org/c/openstack/ironic/+/88849718:49
opendevreviewJulia Kreger proposed openstack/ironic master: Add a little variability for heartbeating  https://review.opendev.org/c/openstack/ironic/+/88835919:05
opendevreviewJulia Kreger proposed openstack/ironic master: WIP: Retry node update failures due to locks  https://review.opendev.org/c/openstack/ironic/+/88850020:46
opendevreviewJulia Kreger proposed openstack/ironic master: WIP: add logging to help determine when periodics are actually running  https://review.opendev.org/c/openstack/ironic/+/88850320:52
opendevreviewJulia Kreger proposed openstack/ironic master: DNM: Don't actually heartbeat with sqlite!  https://review.opendev.org/c/openstack/ironic/+/88850621:36
opendevreviewMerged openstack/bifrost master: remove pymysql system packages requirement  https://review.opendev.org/c/openstack/bifrost/+/87451922:22
TheJuliaso... I wonder if that might *actually* work22:31

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!