opendevreview | Jacob Anders proposed openstack/ironic master: [WIP] Retry connecting vmedia through a DVD device if available. https://review.opendev.org/c/openstack/ironic/+/887665 | 07:27 |
---|---|---|
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Very basic in-band inspection with the "agent" interface https://review.opendev.org/c/openstack/ironic/+/885450 | 07:49 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Add the initial skeleton of the agent inspect interface https://review.opendev.org/c/openstack/ironic/+/877814 | 07:50 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Very basic in-band inspection with the "agent" interface https://review.opendev.org/c/openstack/ironic/+/885450 | 07:50 |
dtantsur | masghar: recovered ^^^ | 07:50 |
opendevreview | Mahnoor Asghar proposed openstack/ironic master: WIP: Add inspection (processing) hooks https://review.opendev.org/c/openstack/ironic/+/887554 | 08:42 |
opendevreview | Ebbex proposed openstack/bifrost master: Consolidate ubuntu/debian required_defaults https://review.opendev.org/c/openstack/bifrost/+/888444 | 09:09 |
opendevreview | Ebbex proposed openstack/bifrost master: sgabios-bin is a subpackage of sgabios https://review.opendev.org/c/openstack/bifrost/+/888445 | 09:09 |
opendevreview | Ebbex proposed openstack/bifrost master: Reduce the libvirt/qemu packages list https://review.opendev.org/c/openstack/bifrost/+/888446 | 09:09 |
opendevreview | Ebbex proposed openstack/bifrost master: Consolidate centos/fedora/redhat required_defaults https://review.opendev.org/c/openstack/bifrost/+/888447 | 09:09 |
opendevreview | Ebbex proposed openstack/bifrost master: Refactor the use of include_vars https://review.opendev.org/c/openstack/bifrost/+/888448 | 09:09 |
JayF | ebbex lots of contributions! Thank you! If you're not getting timely review on anything lmk | 09:19 |
JayF | So going to pitch an idea here; will probably rfe it... an optional-to-enable deploy step that'd apply in all cases that does some kind of pre-check to ensure the machine will be deployable | 09:37 |
JayF | like checking BMC health, 'hello world' connectivity check, etc | 09:37 |
JayF | I wonder if the existing validate() in drivers would work or if we'd need something new | 09:37 |
dtantsur | JayF: validate() was designed to be quick since it's used in a blocking fashion | 10:23 |
JayF | Yeah, that's what I figured. We'd need to add a method to validate live, and add stuff to all the driver interfaces for that :( | 10:24 |
dtantsur | checking BMC health sounds like a core feature, not an opt-in thing | 10:24 |
JayF | opt-in would be my suggestion so we can be through | 10:24 |
JayF | you're basically talking about performing some of the core ops in a deployment an extra time | 10:24 |
JayF | so I'm not sure it's a slam-dunk to make the correctness-for-speed tradeoff for everyone | 10:25 |
JayF | which is why I default to status quo and opt-in | 10:25 |
dtantsur | I'm always worried how many opt-in things we (and openstack in general) have | 10:25 |
dtantsur | "make sure the BMC works" does not sound like a thing many operators would opt out of | 10:26 |
JayF | I'm thinking of it'd be giving almost every driver interface a chance to do a thing | 10:26 |
JayF | for networking that might be pretty sizable | 10:27 |
dtantsur | do you have a specific example then? | 10:27 |
JayF | for downstream conversations, we were thinking maybe 1) bmc health check 2) ensure access to switches if !flat/noop 3) other configurable items that users might need downstream (such as external hardware/storage devices that might be assisting in deploy) | 10:29 |
JayF | MVP of a BMC health check wouldn't be a bad idea, but in practice I see as many failures from network automation | 10:30 |
JayF | so I'd wanna cover that too | 10:30 |
opendevreview | Verification of a change to openstack/bifrost master failed: remove pymysql system packages requirement https://review.opendev.org/c/openstack/bifrost/+/874519 | 10:30 |
dtantsur | JayF: so, what's the case of NOT wanting it before deployment? | 10:32 |
JayF | You're talking about adding maybe a minute, more in broken cases, to a deployment process | 10:32 |
JayF | when some of the most clear feedback I hear from operators I interact with IRL is that things often take too long | 10:33 |
JayF | which is why I hesitate to default that time trade-off | 10:33 |
JayF | but I don't feel super strongly about that but | 10:33 |
JayF | *bit | 10:33 |
dtantsur | Does "ensure access to switches" take that long? | 10:33 |
dtantsur | or more validations? | 10:33 |
JayF | I know that I've seen environments where a round trip to the switch can take 30 seconds+ | 10:34 |
JayF | the flip side is; on the failure case, especially integrated w/nova+retries, is a lot better | 10:34 |
JayF | you'd get rescheduled a lot faster, especially when thinking about a fast-track use case where Ironic might image the machine before realizing BMC/switch is gone | 10:35 |
JayF | like I said, could go either way | 10:35 |
dtantsur | hmmm | 10:35 |
dtantsur | yeah, let's open this discussion for sure | 10:35 |
JayF | but the fact we're arguing about opt-in or opt-out means you agree on the piece that matters | 10:35 |
JayF | dtantsur: my assumption: this is a spec and adding a method to ~all hardware interfaces, yeah? | 10:35 |
dtantsur | Well, I'd *love* to see more opt-in and opt-out deploy steps in practice | 10:35 |
JayF | dtantsur: okay, idea #2 that came up downstream this week | 10:35 |
dtantsur | this may mean that we need to open steps to more interfaces | 10:35 |
JayF | this one I love a lot more even, and fits with your topic now | 10:36 |
JayF | step templates | 10:36 |
JayF | like we have deploy templates, we should have cleaning/service step templates | 10:36 |
JayF | and have it setup so we could potentially wire into RBAC where, e.g. a lessee could run service_step_template "update my firmware" on their own with locked in arguments | 10:36 |
JayF | to allow some self-serve maintenance with incredibly strong guardrails | 10:36 |
JayF | this would fit *super well* with more steps, in general, in our library | 10:36 |
dtantsur | Funny that you mention it. I've been thinking: if I were to write Ironic API v2, it would be organized around workflows. | 10:38 |
dtantsur | It's kinda-sorta same idea, but on a much more ambitious level | 10:39 |
JayF | I am less ambitious and more get-shit-done by default | 10:39 |
JayF | that doesn't mean you are wrong that it shouldn't be part of a bigger v2 | 10:39 |
dtantsur | Which is to say, I like the idea of templates JayF. The API would need careful consideration. | 10:39 |
JayF | just that is not something I've pondered on | 10:39 |
JayF | dtantsur: I see a nice symmetry in this, right? Steps grew up in cleaning, and we sorta have ^c^v that concept everywhere | 10:40 |
dtantsur | Yeah, I don't think we'll see Ironic v2. | 10:40 |
JayF | dtantsur: now I wanna do the same with Deploy Templates, at least in concept | 10:40 |
JayF | IDK how portable that specific code is | 10:40 |
dtantsur | Somewhat? | 10:40 |
dtantsur | My biggest worry is that we'll have a vastly asymmetric API here | 10:40 |
dtantsur | deploy templates are invoked via traits, which is a fully Nova-centric approach | 10:41 |
JayF | so perhaps we keep that as a facade for nova | 10:41 |
JayF | and do "step templates" more generically for ironic-facing workloads | 10:41 |
JayF | **ironic-user-facing workloads | 10:41 |
dtantsur | I'm also curious what TheJulia says if we try to create dynamic RBAC rules, i.e. an RBAC rule per template :D | 10:41 |
JayF | yeah that's really the killer piece | 10:42 |
JayF | if I can't give $end_user_operator access to do a limited set of service steps, it's not useful | 10:42 |
JayF | because securtiy would require me to limit the things that can be done, and we have steps generally which might be considred harmful | 10:42 |
JayF | e.g. firmware downgrades | 10:42 |
dtantsur | Well.. even if we grant access to all templates, it's still better than now | 10:43 |
JayF | If the templates were tenant-aware; yes | 10:43 |
JayF | if not, it wouldn't help my use case | 10:43 |
JayF | s/tenant/project/ | 10:43 |
JayF | last decade is leaking lol | 10:43 |
dtantsur | :D | 10:43 |
dtantsur | I still like tenant more.. | 10:43 |
JayF | I like when magic words don't chage | 10:43 |
JayF | lol | 10:44 |
JayF | *change | 10:44 |
dtantsur | Will it help if we add a project field to templates? (ignoring the issue with naming conflicts, which we already have with nodes) | 10:44 |
JayF | yes; but my use cases are mostly on manual clean/service steps/maybe even rescue one day | 10:45 |
JayF | (rescue needs to be reimplemented as a service step when we get there anyway, so that'd collapse down) | 10:45 |
opendevreview | Merged openstack/ironic master: DB: Fix result set locking with periodics https://review.opendev.org/c/openstack/ironic/+/888188 | 12:21 |
TheJulia | rules per template?!?!?!?!?! | 13:05 |
TheJulia | whaaaat!? | 13:05 |
TheJulia | someone, where is the beginning of the idea?! | 13:06 |
TheJulia | or how strong should my coffee be? | 13:06 |
JayF | TheJulia: so two ideas basically; semi-related: | 13:07 |
dtantsur | :D | 13:07 |
JayF | 1) Have the ability to perform a precheck before doing node operations, to non-destructively fail (e.g. so a user running nova rebuild would still have their workload running even if the BMC was dead even if the instance goes to err) | 13:07 |
JayF | 2) The ability to template out anything that uses "steps" to operate, in such a way that it can be RBAC'd (maybe make templates tenant-aware? or some concept of "y template can be run by owner/lessee of a node" | 13:08 |
JayF | The use case for 1 is obvious | 13:08 |
JayF | use case for 2 would be self-service user firmware upgrades, done in a secure enough way to prevent the end-user from performing arbitrary steps | 13:09 |
JayF | It's easy to imagine a world where we would want Ironic to be able to say "you can run this firmware upgrade reciept; but you can't run firmware_upgrade step with arbitrary values to downgrade to a vulnerable firmware" or similar | 13:10 |
JayF | s/reciept/recipe/ | 13:10 |
TheJulia | I avoided pouring a ton of rbac on to templates because of the unknowns and lack of asks. We would need to add at least two fields to that. Not impossible on the template side | 13:12 |
JayF | The thing is, I don't want "deploy templates" | 13:12 |
JayF | I want arbitrary step templates | 13:12 |
JayF | all my use cases are around manual cleaning and service steps | 13:13 |
TheJulia | like public=True/False or and owner. Actual policy with step execution would mean something. I need coffee to think through it | 13:13 |
TheJulia | JayF: technically, already exists, just sufficiantly high level access user must create | 13:13 |
TheJulia | wait | 13:14 |
JayF | *blink* | 13:14 |
TheJulia | you want to be able to invoke a step and pickup a specific step by name without providing other args? | 13:14 |
TheJulia | so user says "do x, y, templatez" | 13:14 |
JayF | I want to be able to say `openstack baremetal clean --template=firmware-upgrade-2.3.4` | 13:14 |
JayF | and for a larger admin to be able to define firmware-upgrade-2.3.4 | 13:14 |
JayF | and for that user with access to run the template to not be able to do anything but curated sets of steps they have access to | 13:15 |
JayF | again, think self-service node maintenance, where you can hand the API off to the project team, and they can coordinate their own service | 13:15 |
JayF | (really service steps is the bigger use case for this; but it seems weird to have service templates and deploy templates w/o manual clean templates) | 13:15 |
TheJulia | so the lacking part is a user, on demand, being able to say "go execute this template | 13:16 |
JayF | So this is just RBAC for templates then | 13:16 |
JayF | more or less | 13:16 |
JayF | with maybe a bit of api on top | 13:17 |
TheJulia | well | 13:17 |
TheJulia | actually, two fold | 13:17 |
TheJulia | we have deploy templates, not cleaning templates | 13:17 |
JayF | yeah, that's what I thought, we have to generic-ify templates to cleaning and service | 13:17 |
JayF | then RBAC on top of it | 13:17 |
TheJulia | although on some level, we could just say "yes, you may cross-reference them | 13:17 |
JayF | yeah, I consider that implementation detail | 13:17 |
JayF | I don't know the code well enough to know how I'd write it | 13:17 |
TheJulia | we have enough things that are dual purpose that maybe it just stays slightly confusing name | 13:17 |
JayF | I'm just teasing out enough of the interface so I can RFE it | 13:18 |
JayF | weak -1 to that naming idea, but also meh | 13:18 |
TheJulia | we'd need a knob to deny API submitted steps | 13:18 |
JayF | I consider that as part of "RBAC'ing templates" | 13:19 |
TheJulia | mentally, this is sort of similar to dpus, where one step along the way is that, sort of like allowing a vendor interface to be called as a step | 13:19 |
TheJulia | (... because ipmitool send_raw is needed to unblock Bluefield card's for inband firmware upgrades....) | 13:20 |
TheJulia | so a pile of semi-related things | 13:20 |
JayF | https://bugs.launchpad.net/ironic/+bug/2027688 (this is for the precheck idea) | 13:20 |
JayF | TheJulia: when things like that align, I take it as a sign that we're going down the right path | 13:21 |
* TheJulia finally returns from the coffee maker | 13:22 | |
JayF | TheJulia, others: I think I'm just going to call RBAC'ing a thing in Ironic "exposing it to the project API" or similar | 13:24 |
JayF | it's happier than using rbac as a veb | 13:24 |
JayF | *verb | 13:24 |
TheJulia | JayF: put a note at the top, the request context won't be in the task past the initial request, so we'll need to save it as a dict | 13:26 |
JayF | ack it's in there | 13:27 |
TheJulia | all *later* contexts will be admin aligned with ironic's configured access rights | 13:27 |
TheJulia | at least those attached to tasks | 13:27 |
JayF | I'm writing up the RFE, I assume both of these will likely need a spec | 13:27 |
JayF | even if just to help me clarify my thinking | 13:27 |
TheJulia | ++ | 13:27 |
* TheJulia sips coffee | 13:27 | |
JayF | but I want us to agree it's valuable and the general shape before I go too deep | 13:27 |
JayF | oh heh | 13:28 |
JayF | if we have generic step templates | 13:28 |
JayF | we could also then say | 13:28 |
JayF | "automated cleaning for this node is template XXXX" | 13:28 |
JayF | 🤯 | 13:28 |
TheJulia | dtantsur: w/r/t the database locking stuff, I suspect that when the sqlalchemy row just gets casted as a tuple (which it *can* represent itself as, that the original hangs around until the new tuple is gone, but I'm not digging into it *that* deep at this point, which I think is why iterating down the row makes a difference. | 13:31 |
dtantsur | TheJulia: that must involve quite some magic.. but magic is what SA is about, soooo | 13:31 |
TheJulia | less magic, at least having looked at sqlalchemy code, I *suspect* part of it is whatever cpython does under the hood when you ask for a tuple | 13:33 |
TheJulia | but if I scratch the itch of learning that, I worry I'll suddenly want to become a cpython maintainer | 13:34 |
dtantsur | normal tuple()/list()/set() stuff is building a structure from the iterator | 13:34 |
TheJulia | and then I risk becoming a lost soul | 13:34 |
dtantsur | hence my confidence that this stuff should work | 13:34 |
TheJulia | yeah | 13:35 |
TheJulia | It is what I thoguht early on as well, but didn't quite behave as I expected | 13:35 |
TheJulia | wow lots of discussion this morning | 13:36 |
JayF | It helps when talkative-jay is in BST lol | 13:36 |
TheJulia | and, btw, I literally agree with both of your comments on https://review.opendev.org/c/openstack/ironic/+/888359/1/ironic/conductor/base_manager.py | 13:37 |
TheJulia | yay for internal metal conflicts! | 13:37 |
TheJulia | I've seen one lock error with the new patch on a heartbeat, which I think is a super good sign, but that also has me wanting to add a retry check to node_update | 13:38 |
dtantsur | We already add retry_on_deadlock or something like that to all/most db methods | 13:38 |
TheJulia | so fun fact, oslo_db doesn't recognize our exception with the way it gets handled | 13:39 |
dtantsur | yup :( | 13:40 |
dtantsur | I assume it's technically not a deadlock | 13:40 |
TheJulia | yeah, technically | 13:40 |
TheJulia | *also* we don't *really* get a sqlalchemy exception, we get a sqlite exception | 13:40 |
dtantsur | huh, isn't it wrapped? | 13:40 |
TheJulia | ... I remember looking one day and going "wait, that is weird!" | 13:41 |
TheJulia | no exceptions on the merge, clearly we need to recheck/merge some things! | 13:42 |
TheJulia | dtantsur: where did you find the error you pasted early this morning on https://review.opendev.org/c/openstack/ironic/+/887853 ? | 13:43 |
JayF | rpittau: https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting I put two RFEs on the agenda for Monday's meeting; I will likely not be here but please mention them so they can get more eyes on them. Thank you in advance!! | 13:44 |
dtantsur | TheJulia: the failing job https://zuul.opendev.org/t/openstack/build/46c9a1251917452aad479a26178f45de | 13:44 |
TheJulia | much appreciated | 13:45 |
TheJulia | ... interesting! | 13:45 |
* TheJulia takes a minute to make some breakfast | 13:46 | |
JayF | TheJulia: oh, you were asking about eventlet science, it was OK but I got scared b/c I remembered that was to work around a bug in image streaming that we don't test in gate | 13:49 |
JayF | but I guess that doesn't apply to ironic with iscsi gonezo | 13:49 |
JayF | https://review.opendev.org/c/openstack/ironic/+/887996 everything voting w/o known issues passed | 13:50 |
opendevreview | Jay Faulkner proposed openstack/ironic master: DNM: Eventlet science https://review.opendev.org/c/openstack/ironic/+/887996 | 13:50 |
JayF | lets see how it does with a recheck | 13:51 |
TheJulia | ths issue is basically that the test doesn't actually sent db relevent config since it overrides it elsewhere | 13:51 |
TheJulia | which means the test fails | 13:51 |
TheJulia | actually, erorrs, doesnt fai | 13:51 |
TheJulia | l | 13:51 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Add additional logging on iLO power failure https://review.opendev.org/c/openstack/ironic/+/885549 | 14:02 |
opendevreview | Julia Kreger proposed openstack/ironic master: Fix SQLAlchemy listener for engine connection, correctly https://review.opendev.org/c/openstack/ironic/+/887853 | 14:19 |
* JayF screams at sushy unit tests | 14:20 | |
TheJulia | le sigh Jul 12 19:33:19.897599 np0034662248 ironic-conductor[49614]: ERROR ironic.conductor.utils [None req-f1540afd-cdbe-49f0-a5d9-f5e8865d54d9 None None] While executing step {'step': 'log_passthrough', 'priority': 1, 'abortable': False, 'argsinfo': None, 'interface': 'vendor', 'requires_ramdisk': True} on node 81688121-bae1-4683-aa31-c23c8cbaead9, step returned invalid value: True | 14:22 |
TheJulia | JayF: why screaming? | 14:22 |
TheJulia | they respond better to soothing tones | 14:22 |
JayF | I think there's just nothing plumbed through far enough for the tests I need to write | 14:24 |
JayF | and I just realized that | 14:24 |
JayF | basically everyone is testing if _op() is ebign called and I can't find tests that make sure _op() does the right thing | 14:24 |
JayF | hahaha and I just found the actual code bug, the test isn't broken I am | 14:29 |
JayF | lolsog | 14:29 |
opendevreview | Julia Kreger proposed openstack/ironic master: Enable vendor interfaces to be called as steps https://review.opendev.org/c/openstack/ironic/+/879089 | 14:32 |
* TheJulia gives JayF a cookie | 14:33 | |
JayF | well I know where the code is broken | 14:35 |
JayF | but I'm perplexed and I'm 99% sure it's just jet lag | 14:35 |
JayF | because I think I'm just forgetting how kwargs work, somehow | 14:35 |
opendevreview | Jay Faulkner proposed openstack/sushy master: Requests must always have a read/connect timeout https://review.opendev.org/c/openstack/sushy/+/888131 | 14:39 |
JayF | my compat code and test for it is failing, I'm going to walk and clear my head, if someone wants to look and point out what I assume has to be an obvious error in gerrit please do so | 14:39 |
JayF | otherwise hopefully fresh eyes do the trick | 14:39 |
opendevreview | Julia Kreger proposed openstack/ironic master: Enable vendor interfaces to be called as steps https://review.opendev.org/c/openstack/ironic/+/879089 | 14:41 |
TheJulia | okay, needed to clarify the docs a little bit there | 14:41 |
TheJulia | oh, I kind of see what the disconnect is | 14:42 |
TheJulia | hmm | 14:42 |
TheJulia | you did the needful | 14:42 |
TheJulia | wut | 14:42 |
JayF | Print-statement-debugging says that the actual code in connector is not working | 14:49 |
JayF | 'string' in doesn't work, does it | 14:49 |
JayF | because it checks equality not equivalency? | 14:49 |
JayF | but I reshaped it to use .get | 14:50 |
TheJulia | oh, no | 14:50 |
TheJulia | no | 14:50 |
TheJulia | pulling it down | 14:51 |
JayF | I think we might have too much mocked out for it to work? | 14:51 |
JayF | if you're looking that close, we can just meet on it quick? | 14:51 |
JayF | i have it in ide with print debugs here | 14:52 |
TheJulia | The heavy mocking has been problematic there | 14:52 |
JayF | and I have a big conf all to myself | 14:52 |
JayF | I'm really, really tempted just to punt the new test | 14:52 |
TheJulia | oh | 14:52 |
TheJulia | heh | 14:52 |
TheJulia | yeah, lets talk because two different things seem to be going on | 14:52 |
JayF | https://us06web.zoom.us/j/86315191026?pwd=ZW53U0dENGF5bUNQc0s0MHFVOElWdz09 | 14:53 |
JayF | I'm going to have to restart TheJulia | 14:55 |
TheJulia | k | 14:55 |
opendevreview | Jay Faulkner proposed openstack/sushy master: Requests must always have a read/connect timeout https://review.opendev.org/c/openstack/sushy/+/888131 | 15:12 |
opendevreview | Mahnoor Asghar proposed openstack/ironic master: WIP: Add inspection (processing) hooks https://review.opendev.org/c/openstack/ironic/+/887554 | 16:42 |
TheJulia | hmmm... power_sync why you cause problems | 17:58 |
NobodyCam | Good morning Ironic Folks! | 18:18 |
NobodyCam | happy almost Friday | 18:19 |
TheJulia | good morning NobodyCam | 18:21 |
NobodyCam | o/ TheJulia | 18:21 |
NobodyCam | How's the weather on that side of the hill | 18:22 |
TheJulia | NobodyCam: not awful | 18:34 |
TheJulia | 90F | 18:34 |
NobodyCam | hehehehe | 18:34 |
TheJulia | how about down in the valley | 18:34 |
NobodyCam | 103 right now :( | 18:35 |
NobodyCam | high of 111 | 18:35 |
TheJulia | ugh | 18:35 |
NobodyCam | Saturday 117 | 18:35 |
TheJulia | ugh, and we have a wedding to go to on Saturday afternoon down there | 18:36 |
opendevreview | Julia Kreger proposed openstack/ironic master: WIP: Restrict parallel power state sync workers with sqlite https://review.opendev.org/c/openstack/ironic/+/888497 | 18:49 |
opendevreview | Julia Kreger proposed openstack/ironic master: Add a little variability for heartbeating https://review.opendev.org/c/openstack/ironic/+/888359 | 19:05 |
opendevreview | Julia Kreger proposed openstack/ironic master: WIP: Retry node update failures due to locks https://review.opendev.org/c/openstack/ironic/+/888500 | 20:46 |
opendevreview | Julia Kreger proposed openstack/ironic master: WIP: add logging to help determine when periodics are actually running https://review.opendev.org/c/openstack/ironic/+/888503 | 20:52 |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM: Don't actually heartbeat with sqlite! https://review.opendev.org/c/openstack/ironic/+/888506 | 21:36 |
opendevreview | Merged openstack/bifrost master: remove pymysql system packages requirement https://review.opendev.org/c/openstack/bifrost/+/874519 | 22:22 |
TheJulia | so... I wonder if that might *actually* work | 22:31 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!