adam-metal3 | hello Ironic o7 | 05:50 |
---|---|---|
adam-metal3 | Do you also experience inspection failures if you use an IPA that has this change? https://opendev.org/openstack/ironic-python-agent/commit/ab99f36baa1d89c7db1502be4ca36ebd91446a8c , I have built an IPA that has everyting up until the change before this and it works fine but if I include this very fresh commit I get an inspection failure. THe failure is a 400 error code from the inspector (conductor) | 05:57 |
adam-metal3 | that the inspection data format is not OK str is not None I will reproduce and give the you the exact error message soon | 05:57 |
adam-metal3 | There are new None return values around L2399, L2440 and L2490 of the hardware.py I suspect something funny there | 05:59 |
adam-metal3 | this is the error I am seeing if the mentioned commit is present in the code: https://pastebin.com/embed_js/ZKVyCfnh?theme=dark | 06:31 |
dtantsur | adam-metal3: we see issues in the metal3 job but not in our regular jobs | 07:50 |
dtantsur | mmm yeah, the commit probably changed what is returned on a missing IPMI address: None instead of something silly like 0.0.0.0 | 07:51 |
dtantsur | cid: ^^^ | 07:52 |
dtantsur | as much as I'd prefer to change to None, we might want to change back to whatever placeholder values we used to return | 07:52 |
dtantsur | (same for MAC's) | 07:52 |
adam-metal3 | dtantsur, it is a bit strange because in the respective Exception blocks there was a pssibilityto return None already, O_O so are those correct? because in here Ironoic as you have said always excepts string https://opendev.org/openstack/ironic/src/branch/master/ironic/api/controllers/v1/ramdisk.py#L288 | 07:55 |
dtantsur | I don't think it was actually possible to return None before, we always had a fallback value | 07:57 |
dtantsur | JayF: FYI the metal3 job failure could actually be a regression on our side | 07:59 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Fix inspection if bmc_address or bmc_v6address is None https://review.opendev.org/c/openstack/ironic/+/928885 | 08:03 |
dtantsur | adam-metal3: ^^^ | 08:03 |
cid | That section of the code is just weird. | 08:03 |
cid | It might just be a case where nothing really ever went wrong, because the earlier code returned both a default value and None too. | 08:03 |
cid | dtantsur: ty | 08:04 |
adam-metal3 | dtantsur: thx | 08:04 |
cid | I remember there was a weird error with also returning defaults. | 08:04 |
dtantsur | cid: I guess it rarely returned None in practice | 08:05 |
cid | Exactly | 08:05 |
cid | Probably the same for defaults too. But let's see what CI thinks. | 08:06 |
opendevreview | Verification of a change to openstack/ironic master failed: Set node "alive" when inspection finished https://review.opendev.org/c/openstack/ironic/+/927828 | 08:11 |
adam-metal3 | dtantsur: wdyt could this be backported to supported branches? because technically as I understand Ironic was expected to handle these None values by design and the so older Ironic release branches have become incomptible with the new IPAs. | 08:36 |
dtantsur | adam-metal3: I see no issues with backporting this patch | 08:53 |
adam-metal3 | dtantsur: nice, thanks should I do it? | 09:01 |
dtantsur | adam-metal3: I can request the backports once the master change merges | 09:48 |
adam-metal3 | dtantsur: sounds good, thanks! | 10:05 |
dtantsur | Could anyone who used to be excited above skip-level upgrade testing (JayF? TheJulia?) please fix it? https://zuul.opendev.org/t/openstack/builds?job_name=ironic-grenade-skip-level&project=openstack/ironic | 10:39 |
dtantsur | otherwise, I'll move the job to experimental to stop wasting CI resources. | 10:39 |
dtantsur | it only ever passes on 2024.1, apparently | 10:40 |
* dtantsur just realized bifrost fails on ubuntu-noble, send help! | 10:42 | |
opendevreview | Dmitry Tantsur proposed openstack/bifrost master: WIP: start working on Ubuntu 24.04 support https://review.opendev.org/c/openstack/bifrost/+/928895 | 10:45 |
opendevreview | Dmitry Tantsur proposed openstack/bifrost master: CI: pin the benchmark job to ubuntu-jammy https://review.opendev.org/c/openstack/bifrost/+/928896 | 10:47 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Move the benchmark job to the experimental pipeline https://review.opendev.org/c/openstack/ironic/+/928897 | 10:54 |
dtantsur | We're hit pretty hard by something in Neutron's devstack plugin (see their channel) | 10:54 |
opendevreview | Dmitry Tantsur proposed openstack/bifrost master: WIP: start working on Ubuntu 24.04 support https://review.opendev.org/c/openstack/bifrost/+/928895 | 10:59 |
JayF | dtantsur: we cannot remove that job for policy reasons. I'll have a look when I'm working | 12:40 |
dtantsur | JayF: there is no difference, policy or not, between a job that is neglected because it's experimental and a job that is neglected because it's non-voting and permafailing | 12:41 |
dtantsur | on top of that, the CI is in a very bad shape - see #openstack-neutron | 12:43 |
JayF | I just realized we might not need it for this one. Because this would be upgrading from 2023.2 to 2024.2 which is not supported | 13:12 |
dtantsur | I guess we don't, but it's good to keep it working | 13:12 |
JayF | Well I wouldn't be surprised if it's broken for actual reasons if someone did like a breaking migration | 13:14 |
JayF | Or due to machinations in things like requirements or similar shared config or resources | 13:14 |
TheJulia | dtantsur: I was never excited by it | 13:26 |
* TheJulia feels a bit rough today | 13:27 | |
masghar | Take care, Julia! | 14:17 |
masghar | Anybody looking at the metal3 intergration job? | 14:17 |
dtantsur | masghar: yep https://review.opendev.org/c/openstack/ironic/+/928885 | 14:19 |
masghar | Ah, I thought that might be it (saw inspection failing in the job logs I think) | 14:20 |
masghar | also, thanks! | 14:20 |
opendevreview | Stephen Finucane proposed openstack/ironic master: api: Introduce new mechanism for API versioning https://review.opendev.org/c/openstack/ironic/+/928919 | 14:25 |
opendevreview | Stephen Finucane proposed openstack/ironic master: WIP: api: Add schema validation framework https://review.opendev.org/c/openstack/ironic/+/928920 | 14:25 |
opendevreview | Stephen Finucane proposed openstack/ironic master: WIP: api: Add schema for allocations API https://review.opendev.org/c/openstack/ironic/+/928921 | 14:25 |
stephenfin | JayF: I was hoping to get tests passing on one API but ran out of time. However, that's a prototype for how you could do the JSONSchema integration that is very similar to how we're approaching it elsewhere ^ | 14:26 |
stephenfin | Not sure how far adamcarthur5 has gotten but it might be helpful | 14:27 |
JayF | Yeah this is what I figured those changes would look like, but I have significant concerns about upending our API in that manner without causing any regression | 14:35 |
JayF | Plus, even if cores agreed to a migration like that we would need to ensure it happened in a single release cycle. I don't want us to end up in a half-migrated limbo like what happened with the virt driver | 14:36 |
JayF | So I just kind of want us to be deliberate about how we approach it to avoid the regressions and make sure we can get it in one shot | 14:36 |
dtantsur | JFYI bifrost is also broken by the same regression as metal3 | 14:39 |
masghar | :its-fine-fire: | 14:39 |
dtantsur | probably everything running new inspection on virtual machines does not work | 14:39 |
masghar | Due to the same error as fixed in 928885? | 14:40 |
dtantsur | yep | 14:40 |
masghar | ok | 14:40 |
dtantsur | on the "bright" side, bifrost on ubuntu 24.04 is borken waaaaaay before inspection :D | 14:41 |
masghar | :its-fine-fire-2: | 14:41 |
dtantsur | emojis are not enough for this situation. animated gif support in irc when? | 14:42 |
masghar | Good question | 14:42 |
JayF | Honestly this result makes me wonder if we need to add a job on the IPA side or something to catch these things | 14:46 |
JayF | Although I do feel slightly good knowing that we at least resolved a bug through this process since it was always possible for that to return none in rare cases | 14:46 |
dtantsur | a voting job with inspection, yeah | 14:46 |
JayF | It might not be a bad idea at ptg to have a short session on how to troubleshoot that metal3 job | 14:47 |
JayF | Because I took a look at it and all I ended up seeing was something that looked like a leader election failed and combining that with the extremely long time out on failure I extrapolated (incorrectly) | 14:47 |
masghar | I didnt know where the logs went (but I didnt dig much) | 14:48 |
masghar | So I just popped in here and asked if someone was taking a look | 14:48 |
JayF | Realistically, given how many new people we have in the community who have less ops experience it might be smart to have a session about troubleshooting CI jobs in general | 14:48 |
masghar | I would be all eyes and ears, tbh | 14:49 |
opendevreview | Dmitry Tantsur proposed openstack/bifrost master: WIP: start working on Ubuntu 24.04 support https://review.opendev.org/c/openstack/bifrost/+/928895 | 15:01 |
TheJulia | brraaaains | 15:23 |
opendevreview | Stephen Finucane proposed openstack/ironic master: WIP: api: Add schema validation framework https://review.opendev.org/c/openstack/ironic/+/928920 | 15:34 |
opendevreview | Stephen Finucane proposed openstack/ironic master: WIP: api: Add schema for allocations API https://review.opendev.org/c/openstack/ironic/+/928921 | 15:34 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Try limiting MTU to at least 1280 https://review.opendev.org/c/openstack/ironic/+/928931 | 15:38 |
JayF | the more I look at stephen's proposed changes, the more I like it tbh | 16:02 |
JayF | I'm just scared of breaking the API :D | 16:02 |
TheJulia | There is a required portion of fear, uncertinty, and doubt in all things. But then there is the cost to that... and yeah | 16:05 |
JayF | Honestly, it's one of those things where I think I thought the way we handled microversinos was fine | 16:08 |
JayF | until I had to explain it to two different junior devs (Boushra then C I D) | 16:08 |
JayF | and seeing a potentially clearer way was nice ... but the node object is going to be crazy hard | 16:09 |
JayF | btw; the missing context in this channel for this: if we did this migration, we get openapi for free | 16:09 |
JayF | and that means we get a lot of clients for free | 16:09 |
JayF | that's the real problem we're solving for | 16:10 |
dtantsur | weeelll... I'm highly doubtful about the "lot of clients" part | 16:15 |
dtantsur | but it will definitely help SDK authors | 16:15 |
JayF | I was told at least there's a rust SDK generator? | 16:17 |
JayF | I believe the exchange was | 16:17 |
JayF | adamcarthur5: "We can get a rust SDK for free with OpenAPI specs!" me: "We already do, dtantsur made it for us :D" ;) | 16:17 |
dtantsur | SDK folks have some hopes. From my experience, it's auxiliary best case. | 16:18 |
dtantsur | It's like with AI tools: you can generate *something*, you need to apply an engineer to generate *something good*. | 16:18 |
JayF | What are you talking about? LLM code generation is going to take all of our jobs, and they'll only need a middle manager for code review /s | 16:19 |
JayF | :D | 16:19 |
JayF | Makes sense; I am always skeptical of any code generation stuff generally | 16:19 |
JayF | but my opinion on making that API change is snowballing towards a "I'm probably for it" given the number of small benefits we pick up at each step | 16:20 |
dtantsur | it'll definitely save me from the mundane work of copying fields from the docs into rust | 16:21 |
dtantsur | JayF: ehhhhmmmm https://opendev.org/openstack/ironic-python-agent/src/branch/stable/2023.2/ironic_python_agent/hardware.py#L32 | 16:40 |
dtantsur | this looks like a problem to me | 16:40 |
JayF | đź‘€ | 16:40 |
dtantsur | I don't understand how it works upstream, in our downstream builds we now get oslo.config complaining about duplicate options | 16:41 |
JayF | dtantsur: those are safe calls | 16:41 |
JayF | dtantsur: oh | 16:41 |
dtantsur | oslo_config.cfg.DuplicateOptError: duplicate option: efi_system_partition_size | 16:42 |
JayF | https://review.opendev.org/c/openstack/ironic-python-agent/+/927978/3#message-cfacb46d3913a35e6e754e1ef68f654a89913705 it passed CI here | 16:42 |
JayF | I wonder if this is a ironic_lib version issue? | 16:42 |
JayF | one of us testing against the wrong version, perhaps? | 16:43 |
dtantsur | or oslo.config? | 16:43 |
TheJulia | dtantsur: your version is too old then | 16:43 |
TheJulia | dtantsur: they changed oslo.config at some point to ignore dupes | 16:43 |
TheJulia | code.. like.. wallaby age *does* throw an error | 16:43 |
dtantsur | TheJulia: we have python3-oslo-config-2:9.1.1-0.20230608145954.515daab | 16:46 |
dtantsur | which seems to match upper-constraints for 2023.1, for instance | 16:46 |
TheJulia | uhhhhh lovely | 16:47 |
TheJulia | I thought it changed behavior post wallaby, but didn't pin down the date | 16:47 |
dtantsur | well, somehow it works upstream | 16:47 |
dtantsur | I'll figure it out tomorrow, no brain power left | 16:49 |
TheJulia | ++ | 16:49 |
TheJulia | get some rest | 16:49 |
dtantsur | intending to lift some weights as a brain-clearing measure | 16:49 |
JayF | happy to patch upstream if it's broken, somehow, too, just don't see where it would be and TBH I don't think my digging in would be valuable as I am so satiated with thoughts of that code in multiple different versions already | 16:53 |
TheJulia | There *IS* definitely a point where oslo.config was very not friendly to duplicated options being defined due to delayed invocation | 16:58 |
JayF | I suspect this may be a weird SLURP/IPA interaction | 16:59 |
JayF | SLURP *services* have to work on 2023.1 dependencies | 16:59 |
JayF | we've never enforced that for IPA since it has to be co-installed | 16:59 |
JayF | (for stable/2023.2, they have to work on 2023.1 requirements AIUI) | 17:00 |
JayF | s/it has/it doesn't have/ | 17:03 |
JayF | I reversed the bool in that statement | 17:03 |
JayF | dtantsur: efi_system_partition_size is not in https://review.opendev.org/c/openstack/ironic-python-agent/+/927978/3/ironic_python_agent/config.py | 17:10 |
JayF | dtantsur: bad rebase, I'd guess | 17:10 |
opendevreview | cid proposed openstack/ironic-tempest-plugin master: Add tempest tests for runbooks https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/928958 | 17:26 |
TheJulia | it *could* also be the version of ironic-lib has the options setup for when the class is loaded as upposed to upfront registration | 17:34 |
TheJulia | so it could be just differences at that level?! | 17:34 |
JayF | I was having to be selective about what configs I pulled | 17:34 |
JayF | based on version of ironic_lib in that constraints | 17:34 |
JayF | because I intentionally didn't change non-image-related-code out of ironic-lib for backport safety | 17:35 |
TheJulia | well, a better question might be in the short term, is that option explicitly inside ironic-lib? | 17:35 |
cardoe | good morning ironic.... | 18:09 |
TheJulia | good morning cardoe | 18:10 |
TheJulia | How goes your morning? | 18:11 |
cardoe | Well its 1pm and I'm now looking at starting what I planned on doing first thing today. So instead I'm on IRC declaring bankruptcy on the day. | 18:11 |
JayF | when you hit that bankrupt, you gotta just grab that wheel and give it another spin | 18:13 |
cardoe | Looks like some CI fun today thus far for you guys. | 18:13 |
JayF | this time you'll hit that shiny $5,000 space, I know it | 18:13 |
TheJulia | cardoe: Have you watched the re-imagined battlestar galactica series? | 18:13 |
cardoe | There's a new one? | 18:13 |
TheJulia | 2005-ish? | 18:13 |
cardoe | omg. that's the re-imagined! | 18:14 |
cardoe | derp. yes. | 18:14 |
TheJulia | "as foretold in the sacred scrolls" | 18:14 |
TheJulia | ^^^ my favorite response to stuff like that | 18:15 |
cardoe | heh yes that is spot on | 18:15 |
cardoe | BSG as a whole sums up so many things in computing. | 18:17 |
cardoe | I was gonna say everything looks great. No patches sitting at +2/+1 | 18:19 |
cardoe | Then I look at Zuul with metal3 sitting at 3+ hours | 18:20 |
TheJulia | I think there is an ironic issue, and then a mtu issue related to CI it seems | 18:20 |
TheJulia | I'm focused on a backport to the days of train | 18:21 |
cardoe | ooo that'll be fun. | 18:21 |
* TheJulia gets out the archaeology tool to figure out where something went between Train and Wallaby | 18:22 | |
cardoe | I would have traded today though. | 18:22 |
cardoe | VP's wife was sick. Director is on jury duty. There was an all-hands and I had to fill in for slides I've never seen. My son forgot his lunch at home. So after that rodeo I took it up to school. I'm in the lobby waiting and massive police response shows up. Someone made a threat against the school online and we couldn't go anywhere. | 18:24 |
cardoe | I just wanted to work on my sushy patches! | 18:24 |
cardoe | I did add this item to the PTG "We would like to use inspection hooks or modify the node data based on inspection data coming from the redfish inspector" at the end of that's okay. | 18:26 |
cardoe | JayF: I see you talking about SRE types for Ironic. One the things I was wondering about was strengthing a story around metrics / monitoring. | 18:30 |
TheJulia | cardoe: I would highly recommend details of specifics about what one wishes to obtain | 18:33 |
TheJulia | so the overall discussion is appropriately framed | 18:34 |
cardoe | Okay added context. | 18:49 |
TheJulia | awesome | 18:54 |
TheJulia | wheeee: oslo_config.cfg.DuplicateOptError: duplicate option: image_convert_memory_limit | 18:54 |
JayF | cid and I are pairing on virtualpdu + pysnmp stuff | 19:13 |
JayF | just hit my first IRL clear heisenbug | 19:13 |
JayF | unit tests happily pass under debugger | 19:13 |
TheJulia | heisenbug? is that like a Schrödinger's unit test? | 19:18 |
* TheJulia tears up at the same number of unit tests failing | 19:20 | |
JayF | yeah basically the tests were passing in debug but not in full runs | 20:44 |
JayF | we nailed it down to basically threading.Lock(), and not just our use of it, the use of it in threading.Thread that is subclassed for the SNMPPDUHardness is causing issues | 20:44 |
JayF | we were able to fix the asyncio unhappiness, but without extracting all uses of threading (or changing them in a way I don't understand) it will deadlock eventually due to the locks not being event-loop-safe | 20:45 |
JayF | what I proposed to cid was, given pysnmp is BSD licensed, we try to import the asyncore version of the older APIs directly into virtualpdu, and migrate them to use pyasyncore | 20:45 |
JayF | because of the other thing I unearthed: pysnmp removed the low-level APIs we used, and AFAICT were the only way, to create an SNMP server around 6.4.3 release or so | 20:46 |
JayF | and they are up to 7.0.2 | 20:46 |
JayF | so I do not see a long-term happy path with pysnmp + virtualpdu | 20:46 |
JayF | but we might be able to keep it on life support for a while if we import only the bits of pysnmp we need into our project directly | 20:46 |
JayF | cid: ^ if you have anything to add, please do | 20:47 |
JayF | so tl;dr: 1) threading.Thread/threading.Lock incompatabilities with asyncio event loops (and modern pysnmp's use of em) is the immediate problem, but 2) pysnmp removed the interfaces we need for virtualpdu in recent releases, which makes solving #1 a bit of a moot point | 20:48 |
JayF | TheJulia: tl;dr basically https://en.wikipedia.org/wiki/Heisenbug > In computer programming jargon, a heisenbug is a software bug that seems to disappear or alter its behavior when one attempts to study it | 20:57 |
JayF | in this case, stepping thru tests 1-by-1 removed any thread-unsafety element :/ | 20:58 |
cid | JayF: ++, that's pretty much it. The path to a probable solution, or ... :D | 21:04 |
* cid I will be nipping out now | 21:04 | |
JayF | TheJulia: JFYI; your follow-up unmaintained/xena patch tested good in my devstack; I reproduced the original issue then verified the fix. Thank you for that, I'll spend the rest of the day patting CI on the head for them :D | 22:25 |
TheJulia | JayF: thank you so so so much for the repro and the verification | 22:26 |
JayF | we really do put the file down unmodified, matching checksum, if we don't convert | 22:26 |
JayF | plus those empty qcow test images convert from like, ~150k -> 4k | 22:26 |
JayF | it's pretty hilarious | 22:26 |
adamcarthur5 | stephenfin Thank you sharing those commits in Ironic. I haven't gotten much further than the discussions point, but I am going to look into https://review.opendev.org/c/openstack/codegenerator/+/928664 and try and scope exactly what still needs done. | 22:30 |
TheJulia | JayF: I'm not sure what you mean by by put the file down unmodified matching checksum, if we don't convert. | 22:31 |
JayF | I mean literally md5sum good-qcow (or whatever) on my http server ==== md5sum /var/lib/ironic/images/{blah} | 22:32 |
JayF | which made the validation pretty darn convincing | 22:32 |
TheJulia | oh, heh | 22:33 |
TheJulia | yeah | 22:33 |
JayF | bingo I think I might have just gotten yoga ci | 22:46 |
JayF | or at least now it's throwing over 9000 warnings instead of failing at install time :D | 22:46 |
adamcarthur5 | stephenfin (And anyone else interested) I added a few notes in the PTG etherpad https://etherpad.opendev.org/p/ironic-ptg-april-2024 w.r.t OpenAPI. let me know if you have questions / anything else you want me to put in there | 22:48 |
JayF | adamcarthur5: that's the last PTG etherpad | 22:49 |
JayF | adamcarthur5: you likely meant to update the oct one | 22:49 |
JayF | TheJulia: https://review.opendev.org/c/openstack/ironic/+/928974 [stable-only] Cap versions to allow CI to pass [NEW] | 22:51 |
JayF | you'll meed to pull the requirements change from your patch | 22:52 |
JayF | mine obviates it | 22:52 |
adamcarthur5 | yep.. thank you, here is the October one: https://etherpad.opendev.org/p/ironic-ptg-october-2024. I will add OpenAPI as a topic since it is not currently there | 22:53 |
JayF | I documented cid's and my findings in the etherpad entry about the SNMP driver, too | 23:08 |
cardoe | Yeah I wanna talk about the OpenAPI bits as well. | 23:50 |
cardoe | Like what’s the goal of the code generation. Cause IMHO keeping OpenAPI and generating server code that then doesn’t need modifications is hard. Then how do you know what’s a difference between modified code and altered generated code. | 23:52 |
cardoe | I personally prefer to write the server code and annotate it so that the server can generate the OpenAPI for you. | 23:52 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!