rpittau | good morning ironic! o/ | 06:48 |
---|---|---|
iurygregory | good morning Ironic | 11:25 |
*** iurygregory_ is now known as iurygregory | 12:08 | |
iurygregory | anything urgent for review? | 12:15 |
TheJulia | good morning | 13:01 |
rpittau | good morning TheJulia :) | 13:03 |
rpittau | iurygregory: anything in ironic-week-prio tags? | 13:03 |
TheJulia | I added it to https://review.opendev.org/c/openstack/ironic/+/888506 | 13:05 |
TheJulia | https://review.opendev.org/c/openstack/ironic/+/888500 is likely super close, but I've not seen it log any retries :( | 13:05 |
iurygregory | good morning TheJulia o/ | 13:08 |
iurygregory | rpittau, right, one week of PTO and I started to forget things lol | 13:08 |
rpittau | :D | 13:08 |
rpittau | well to be fair it's probably not super up-to-date | 13:08 |
iurygregory | =X | 13:09 |
TheJulia | more along the lines of we've been fighting the gate a lot | 13:09 |
rpittau | yeah | 13:11 |
TheJulia | A continued fight to reduce down lock conflicts in metal3 | 13:12 |
TheJulia | nuking the heartbeats seems to help a *ton* | 13:12 |
TheJulia | which leaves retry really only viable to test with a loaded CI system | 13:13 |
iurygregory | yup, I was imagining that we would still be fighting CI | 13:13 |
rpittau | "EFI variables are not supported on this system" ok.... | 13:44 |
TheJulia | where did you see that... 8| | 13:45 |
rpittau | tinycore 14.x patch https://621ce5ea301cd3e5f94e-66098508dda66f6b765978d818669c61.ssl.cf2.rackcdn.com/887754/2/check/ironic-standalone-ipa-src/aaad71d/job-output.txt | 13:45 |
rpittau | are we back to jammy for standalone? dn't remember | 13:45 |
TheJulia | we are | 13:46 |
TheJulia | identified the issue | 13:46 |
TheJulia | bug filed, routed around | 13:46 |
rpittau | ok, something else going wrong there then | 13:47 |
TheJulia | If we can get some reviews on https://review.opendev.org/c/openstack/ironic/+/888506 I think it would make sense to proceed with | 13:51 |
TheJulia | Then again, I have no idea if CI is presently super happy load wise, or hates our existence and right now. | 13:51 |
TheJulia | Also, file under "wheeeeeee": https://bugzilla.redhat.com/show_bug.cgi?id=2222981 | 13:51 |
rpittau | w00t | 13:56 |
iurygregory | Don't actually heartbeat with sqlite | 14:00 |
iurygregory | WOW | 14:00 |
iurygregory | re 2222981 WHAT?! | 14:01 |
TheJulia | yeah..... | 14:05 |
TheJulia | and the retry change passed again | 14:06 |
mohammed | TheJulia we have built an ironic image for metal3 with the fix https://review.opendev.org/c/openstack/ironic/+/888188 but still get the error sqlite3.OperationalError: database is locked | 14:12 |
TheJulia | mohammed: there are three competing issues, one is periodics, the next is the heartbeat sync for conductor status which I've got a patch up to disable when using sqlite, and the third seems to be to add db retry logic | 14:17 |
TheJulia | so, it was not a "silver bullet" unfortunately, but I think we're almost there | 14:17 |
TheJulia | I also worked to midnight local on Friday night to try and get it sorted, and I'm still trying to observe failures in CI | 14:17 |
TheJulia | if your willing, https://review.opendev.org/c/openstack/ironic/+/888506 and https://review.opendev.org/c/openstack/ironic/+/888500/ should reduce the chance substantially and add a retry decorator around db writes. | 14:20 |
mohammed | TheJulia do you think replacing SQLite with Mariadb can limit the occurrences of this issue | 14:20 |
TheJulia | absolutely, MariaDB can handle the mutlithreaded write operations natively, where write operations with sqlite are more transactional on a file level so everything going on right now tends to result in one consumer encountering another | 14:22 |
JayF | rpittau: are you going to run today's meeting? | 14:29 |
JayF | It's already 30 minutes late :( | 14:29 |
rpittau | ? | 14:29 |
rpittau | isn't it in 30 minutes ? | 14:29 |
JayF | I just traversed a giant amount of timezones | 14:29 |
JayF | so I would trust your clocking better than mine | 14:29 |
rpittau | :D | 14:29 |
JayF | lol | 14:29 |
JayF | see, this is why I'm not working today | 14:29 |
rpittau | JayF: no worries, I'll take care of the meeting :) | 14:30 |
JayF | thank you lol | 14:30 |
JayF | I was just so confused | 14:30 |
iurygregory | yeah, it's in 30min | 14:30 |
opendevreview | Julia Kreger proposed openstack/ironic master: Retry SQLite DB write failures due to locks https://review.opendev.org/c/openstack/ironic/+/888500 | 14:34 |
TheJulia | mohammed: See the release note attached to ^ | 14:34 |
mohammed | TheJulia thanks for your efforts! We'll try replacing SQLite with MariaDB on our CI to mitigate the issue, while following the progress of solving it with SQLite | 14:37 |
TheJulia | mohammed: you might just want to wait 24-48 hours, since it seems like we might have a path forward ready to go, we just haven't seen performance to result in locking issues in our CI yet today nor over the weekend | 14:39 |
opendevreview | Julia Kreger proposed openstack/ironic master: Retry SQLite DB write failures due to locks https://review.opendev.org/c/openstack/ironic/+/888500 | 14:40 |
TheJulia | so... 4k disks. If we can't use iso9660, and we cannot use vfat, what options are really left, ext2/3 ? | 14:41 |
TheJulia | i guess we could detect and repack it as xfs as long as the config label exists | 14:42 |
* TheJulia wonders what local patch HPE's CI is keeping that merge conflicts wit upstream | 14:48 | |
mohammed | TheJulia sounds like a great option we can patiently wait this fix ! Thanks :) | 14:48 |
TheJulia | mohammed: since you can locally reproduce, definitely give the two additional patches a try, I'm really quite hopeful | 14:50 |
opendevreview | Elod Illes proposed openstack/ironic stable/victoria: [stable-only] Cap virtualenv/setuptools https://review.opendev.org/c/openstack/ironic/+/888701 | 14:57 |
*** iurygregory_ is now known as iurygregory | 14:59 | |
rpittau | #startmeeting ironic | 15:00 |
opendevmeet | Meeting started Mon Jul 17 15:00:05 2023 UTC and is due to finish in 60 minutes. The chair is rpittau. Information about MeetBot at http://wiki.debian.org/MeetBot. | 15:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 15:00 |
opendevmeet | The meeting name has been set to 'ironic' | 15:00 |
iurygregory | o/ | 15:00 |
TheJulia | o/ | 15:00 |
rpittau | welcome everyone to our weekly meeting! | 15:00 |
rpittau | I'll be your host for today :) | 15:00 |
rpittau | The meeting agenda can be found here: | 15:00 |
rpittau | #link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting | 15:00 |
rpittau | #topic Announcements/Reminder | 15:01 |
rpittau | we've announced last week that the next PTG will take place virtually October 23-27, 2023 | 15:01 |
rpittau | going to remove the reminder after this meeting | 15:02 |
rpittau | we'll remind it everyone when we're closer to the date | 15:02 |
rpittau | #note usual friendly reminder to review patches tagged #ironic-week-prio, and tag your patches for priority review | 15:02 |
rpittau | I'm leaving the bobcat timeline for reference in the reminder section | 15:03 |
rpittau | any other announcement/reminder today ? | 15:04 |
TheJulia | I worked stupidly late on Friday night on a decorator for sqlite issues | 15:04 |
rpittau | lol | 15:04 |
TheJulia | I *think* it works, but I've not seen CI give me a resource contention issue to generate a "database is locked" error yet today | 15:04 |
TheJulia | unit test wise, it definitely works! | 15:05 |
rpittau | \o/ | 15:05 |
rpittau | let | 15:05 |
iurygregory | :D | 15:05 |
rpittau | I guess we need to review it and recheck a couple of times | 15:05 |
TheJulia | it has been tagged as a prio review, any reviews/feedback would be appreciated | 15:05 |
TheJulia | ++ | 15:05 |
rpittau | great | 15:05 |
rpittau | thanks TheJulia :) | 15:05 |
TheJulia | yeah, I think I'm on the 3rd stupidly minor change today, so hopefully that is helping | 15:05 |
rpittau | (I'm silently skipping the action items review as there are none from last time) | 15:06 |
rpittau | since we started with that patch let's go to | 15:06 |
rpittau | #topic Review Ironic CI Status | 15:06 |
rpittau | so we're back to jammy, except for the snmp pxe job | 15:07 |
rpittau | and still battling the sqlite shenanighans | 15:07 |
TheJulia | yup, I have no idea why it is failing, but I don't remember the exact reason we held it back to begin with | 15:07 |
rpittau | I don't know, I'll try to make time this week to try and replicate, downstream permitting | 15:08 |
rpittau | also found an issue with the new tinycore 14.x in the standalone job | 15:08 |
rpittau | the one I mentioned before | 15:08 |
rpittau | output of efibootmgr "EFI variables are not supported on this system" | 15:09 |
rpittau | oh well | 15:09 |
rpittau | we're probably not in a rush for that | 15:09 |
rpittau | anything else to mention for the CI status? | 15:10 |
rpittau | ok, moving on | 15:10 |
rpittau | #topic 2023.2 Workstream | 15:10 |
rpittau | #link https://etherpad.opendev.org/p/IronicWorkstreams2023.2 | 15:10 |
rpittau | any update to share? | 15:11 |
iurygregory | feedback would be appreciated in the Firmware Update patches =) | 15:11 |
TheJulia | I've updated the patch to support vendor passthru methods as steps | 15:11 |
TheJulia | and it now passes CI \o/ | 15:11 |
rpittau | iurygregory: yeah, was going to mention that, I have some time this week, I'll put here on top of my list | 15:12 |
rpittau | TheJulia: great! | 15:12 |
TheJulia | hopefully get back to service steps this week | 15:12 |
* TheJulia hopes | 15:12 | |
iurygregory | tks rpittau o/ | 15:12 |
TheJulia | but CI is obviously the priority | 15:12 |
iurygregory | ++ | 15:12 |
iurygregory | I will take a look at patches today for CI | 15:13 |
rpittau | ok, great | 15:14 |
rpittau | I think we're good | 15:14 |
rpittau | #topic RFE review | 15:14 |
rpittau | JayF: left a couple of notes | 15:14 |
TheJulia | So, jay also put the same link in on both | 15:14 |
rpittau | so the first one is https://bugs.launchpad.net/ironic/+bug/2027688 | 15:14 |
rpittau | ah yeah :D | 15:14 |
* TheJulia hands JayF coffee | 15:14 | |
TheJulia | I left a comment on the first one | 15:15 |
rpittau | the second one is https://bugs.launchpad.net/ironic/+bug/2027690 | 15:15 |
rpittau | I'll update teh agenda | 15:15 |
TheJulia | I think the idea is good overall, but we need to better understand where the delination is, and even then, some of the things that fail are deep in the weeds | 15:15 |
TheJulia | and you only see the issue when your at that step, deep in the weeds | 15:15 |
TheJulia | in other words, more discussion/clarification required | 15:16 |
rpittau | mmmm I agree | 15:16 |
TheJulia | since we can't go check switches we know *nothing* about | 15:16 |
TheJulia | but if we can provide better failure clarity, or "identify networking failed outright" mid-deploy, then we don't have to time out completely | 15:16 |
iurygregory | ++, I like the rfe, but we will need further discussion about it | 15:17 |
TheJulia | so yeah, I agree with the second rfe, we'll need it as an explicit spec and we'll need to figure out how to handle permissions | 15:18 |
TheJulia | it might just be we pre-flight validate all permisison rights and then cache the step data from the templates | 15:18 |
* iurygregory is reading the second | 15:18 | |
TheJulia | but, that requires a human putting their brain into the deploy templates code | 15:19 |
rpittau | checking earlky for the permissions would be best IMHO | 15:19 |
TheJulia | *also* custom clean steps would be a thing to consider | 15:19 |
TheJulia | maybe we just... permit them to be referenced, dunno | 15:19 |
TheJulia | that is a later idea I guess | 15:19 |
rpittau | yeah | 15:19 |
rpittau | once we define the templates, it should be not too hard to expand with custom clean steps | 15:20 |
TheJulia | yeah | 15:20 |
JayF | I mainly wanted feedback on the interfaces I laid out there; I intend to specify the implementation just to keep my thoughts straight | 15:21 |
iurygregory | ++ | 15:21 |
TheJulia | right now you need to have system privileges for the deploy template endpoint if memory serves, so if we take the same pattern as add fields, begin to wire everythin together, adn then change the rbac policy, it should be a good overall approach | 15:21 |
TheJulia | we have the patterns already spelled out in nodes fairly cleanly too | 15:22 |
TheJulia | and allocations | 15:22 |
rpittau | right! | 15:22 |
TheJulia | because you can have an allocation owner | 15:22 |
TheJulia | (but not lessee, since that model doesn't exist there) | 15:23 |
rpittau | we're probably going to discuss both RFEs further, but they both look good to me | 15:24 |
TheJulia | yeah, first one just needs some more "what are the specific failure cases we're trying to detect" to provide greater clarity | 15:25 |
rpittau | probably clarify some aspects and then finalize during the PTG ? | 15:25 |
TheJulia | because honestly, some areas we do a bad job with today, and we should make that better since they could still fail there even if neutron has a thing saying "yes, I can login to the switch" | 15:25 |
JayF | I don't think for the precheck idea, we're looking for anything perfect, just catch anything obviously broken | 15:27 |
JayF | it was suggested by johnthetubaguy that for ilo, for instance, there's an internal BMC status and you could fail if that wasn't 'green' | 15:27 |
TheJulia | I *suspect* the big one is "bmc is half broken" weird case, and I'm not sure we actually test a swift tempurl we create... | 15:28 |
JayF | and I know with the hardware I ran at Rackspace, even just an ipmi power status would've been enough to indicate if our most common failure modes were active | 15:28 |
iurygregory | yellow can be a thing depending on what you are looking at in iLO | 15:28 |
iurygregory | XD | 15:28 |
TheJulia | i guess one question might end up being, are there *really* operators still turning off power sync at scale? We went from 1 to 8 power sync workers. Of course, that is also ipmi | 15:30 |
TheJulia | Anyway, that is beyond the scope of the meeting, just a question to get a datapoint really | 15:31 |
rpittau | let's remember this during (and more) during future discussion :) | 15:31 |
JayF | The more I think of it, the more I wonder if it's just a flag to validation to say "no really, actually validate" | 15:32 |
JayF | and have config to change default behavior for when Ironic does it | 15:32 |
TheJulia | to at least try and fail faster, I guess | 15:33 |
JayF | I also have a use case | 15:33 |
JayF | of failing a rebuild before the disk gets nuked | 15:33 |
rpittau | yeha, I guess that's the point (fail faster) | 15:33 |
JayF | so like e.g. a Ceph node wouldn't drop outta the cluster for longer than needed | 15:33 |
JayF | if we can prevent that even in 25% of failure cases, it reduces operator pain because they don't have to go $bmc_fix_period without their node in a usable state | 15:34 |
TheJulia | JayF: I think additional details or context might be needed there since that might also just be a defect | 15:34 |
TheJulia | or, maybe something we've done work on and a lack of a failure awareness is causing the wrong recovery path to be taken by humans | 15:34 |
JayF | that's sorta the theme of both of these rfes; allowing people with that kind of "I have a sensitive cluster" kind of use cases do their own scheduling of maintenance, and reduce the amount of times we'd have them in a failed-down state | 15:35 |
TheJulia | anyway, lets keep talking about this one after the meeting, I think we need to better understand your use case and what spawned the issue to begin with, if that makes sense | 15:36 |
TheJulia | specifically, if I'm trying to rebuild [to update] a node, and somehow I'm wiping all of ceph volumes out | 15:36 |
rpittau | alright, moving on! | 15:37 |
rpittau | #topic Open Discussion | 15:37 |
TheJulia | I had https://bugzilla.redhat.com/show_bug.cgi?id=2222981 pop up in my inbox this morning! | 15:38 |
rpittau | it's been great running the meetings again this time and the last week :) | 15:38 |
rpittau | next week JayF will be back! | 15:38 |
rpittau | #note Overcloud deploy fails when mounting config drive on 4k disks | 15:38 |
rpittau | #link https://bugzilla.redhat.com/show_bug.cgi?id=2222981 | 15:39 |
TheJulia | I'm looking at cloud-init code to see if there is a happy path forward, but I suspect it might be we may need to locally repack the config drive as an intermediate solution | 15:41 |
TheJulia | there might be, just depends on how long it has *really* been an "or", and we'll need to likely fix glean | 15:44 |
opendevreview | Merged openstack/bifrost master: Refactor use of include_vars https://review.opendev.org/c/openstack/bifrost/+/855807 | 15:44 |
iurygregory | I'm wondering if repack would cause performance issues (but it would be only in the 4k scenario right?) | 15:44 |
TheJulia | well, we would have to open the file on a loopback... if we even can... | 15:45 |
TheJulia | and then write out as a brand new filesystem | 15:45 |
TheJulia | so... dunno | 15:45 |
iurygregory | gotcha | 15:46 |
TheJulia | ¯\_(ツ)_/¯ | 15:46 |
rpittau | I don't see a "straight" solution honestly | 15:49 |
TheJulia | yeah, since we support binary payloads | 15:50 |
TheJulia | I'll dig more, and write an upstream bug up | 15:51 |
rpittau | thanks | 15:51 |
iurygregory | ack | 15:51 |
rpittau | alright, anything else for open discussion ? | 15:52 |
rpittau | thanks everyone! | 15:53 |
rpittau | #endmeeting | 15:53 |
opendevmeet | Meeting ended Mon Jul 17 15:53:34 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 15:53 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-07-17-15.00.html | 15:53 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-07-17-15.00.txt | 15:53 |
opendevmeet | Log: https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-07-17-15.00.log.html | 15:53 |
iurygregory | tks! | 15:53 |
TheJulia | JayF: so, that failure case, with the rebuild, what is your perception to the process/failure? I'm guessing the rebuild fails, and instead of being worked as a distinct thing, something happened with nova at that point? | 15:57 |
TheJulia | I'm trying to understand since we have capabilities to prevent those disks from getting erased, so I'm trying to sort out in my head, where things went from "oh, retry rebuild" to "oh no, we lost everything" | 16:02 |
rpittau | good night! o/ | 16:06 |
opendevreview | Merged openstack/bifrost master: remove nginx system packages requirement https://review.opendev.org/c/openstack/bifrost/+/874521 | 16:15 |
TheJulia | Anyone have any block devices with 1k, 2k, or 4k sector sizes handy? Could you run `blockdev --getss /dev/<device>` and provide output | 17:12 |
TheJulia | oh wow | 17:13 |
TheJulia | looks like maybe blockdev -getblocksz <device> might be the thing, maybe?! | 17:14 |
TheJulia | 5-6 runs with no database is locked errors. I guess that is a good sign :\ | 17:57 |
iurygregory | yeah | 17:57 |
iurygregory | I need to drop now, going to the airport, will continue working from there o/ | 17:57 |
JayF | TheJulia: not lost data; just downtime in the cluster. I have a server, X, I do a rebuild with --preserve-ephemeral; it fails in the middle (while in prov network). That machine is now dead-to-me until I can get ops teams to fix the failure. If, instead, Ironic had predicted that failure and refused to touch the node; sure the nova instance would be in ERROR but the workload | 18:00 |
JayF | would be untouched | 18:00 |
JayF | think about cases where you are tight on capacity and can't take a long downtime in a cluster | 18:00 |
JayF | I'll talk about use cases in the rfe more. | 18:01 |
TheJulia | so, you'd need to be able to ask neutron if it can noop on that switch/interface | 18:02 |
JayF | yeah I'm not sure what an implementation looks like for every interface -- but having the hook is useful | 18:03 |
TheJulia | yeah, in ironic would mostly be fairly light touches, but that network part is the real conundrum | 18:03 |
JayF | for instance, in some future world with a non-neutron network driver, we might be able to ask better questions, even (does the smartnic have desired_vlan trunked?) | 18:03 |
TheJulia | ++ | 18:03 |
TheJulia | or switch, etc | 18:03 |
TheJulia | but yeah | 18:04 |
JayF | I do not want to say, we shouldn't add it to the interface b/c we might not be able to do a great job of it on that interface | 18:04 |
TheJulia | And that is not why I'm asking the question, I'm trying to understand under what case are you blocked from being able to perform basic actions to recover | 18:04 |
JayF | meaning I'd want to implement it across all interfaces (like validate), but it's going to naturally be more useful for some than others | 18:04 |
TheJulia | yeah | 18:05 |
JayF | I really think once I look at this tuesday | 18:05 |
JayF | I might shape it more like "validate, except you have time to do more stuff" | 18:06 |
JayF | which could make it useful for enrollment situations, or for doing actual-validation of a fix after a node has undergone servicing (either by human, or by Ironic I guess) | 18:06 |
JayF | I don't think this would actually work out in practice; but it'd be neat if we could also communicate via API the confidence level | 18:07 |
TheJulia | well, the actual validation today is just "can we get the power status" | 18:07 |
TheJulia | I do like the idea of giving that feedback | 18:08 |
JayF | heh, maybe even redfish super-validate is running the DMTF validation against it | 18:09 |
JayF | "It'll give you a medium Ironic experience" | 18:09 |
TheJulia | heh | 18:12 |
* TheJulia goes back to the gigabytes of logs a customer has shipped me | 18:12 | |
NobodyCam | Good afternoon Ironic folk | 19:37 |
TheJulia | good afternoon! | 19:37 |
TheJulia | it is... a bit... warm here today | 20:31 |
opendevreview | Julia Kreger proposed openstack/ironic-python-agent master: Log the number of bytes downloaded https://review.opendev.org/c/openstack/ironic-python-agent/+/887729 | 20:47 |
TheJulia | stevebaker[m]: ^ | 20:47 |
TheJulia | INFO ironic_python_agent.extensions.standby [-] Image streamed onto device /dev/vda in 201.56154799461365 seconds for 2958688256 bytes. Server originally reported 2958688256. | 22:47 |
TheJulia | woot | 22:47 |
opendevreview | Julia Kreger proposed openstack/ironic-python-agent master: Log the number of bytes downloaded https://review.opendev.org/c/openstack/ironic-python-agent/+/887729 | 23:19 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!