| opendevreview | OpenStack Proposal Bot proposed openstack/ironic-ui master: Imported Translations from Zanata https://review.opendev.org/c/openstack/ironic-ui/+/972615 | 02:18 |
|---|---|---|
| TheJulia | …. I think I we need to check the upgrade stamping code to make sure we’re iterating through and stamping the table. At a hockey game and it hit me as a possible cause/source | 04:34 |
| TheJulia | cardoe: if you could dump an offending related row out of the database, if your willing, that will set my paranoia to the side or make me think we have an upgrade bug | 04:36 |
| rpittau | good morning ironic! o/ | 07:55 |
| opendevreview | Merged openstack/ironic stable/2025.2: fix: bios fields could not be fetched via the API https://review.opendev.org/c/openstack/ironic/+/972602 | 09:18 |
| opendevreview | Merged openstack/bifrost master: Document IPA image download options https://review.opendev.org/c/openstack/bifrost/+/964145 | 09:19 |
| opendevreview | Jacob Anders proposed openstack/ironic master: Make post-firmware-update reboot conditional on component https://review.opendev.org/c/openstack/ironic/+/966344 | 09:33 |
| opendevreview | Jacob Anders proposed openstack/ironic master: Make post-firmware-update reboot conditional on component https://review.opendev.org/c/openstack/ironic/+/966344 | 10:37 |
| abongale | good morning o/ | 10:49 |
| Continuity | Morning all | 10:54 |
| *** dmellado7 is now known as dmellado | 11:19 | |
| *** hroy_ is now known as hroy | 11:44 | |
| TheJulia | good morning | 14:04 |
| TheJulia | so the online data migrations seem okay to me as they use the object list | 14:13 |
| TheJulia | in the release mapping | 14:14 |
| cardoe | getting into the DB | 14:20 |
| TheJulia | cardoe: much appreciated | 14:23 |
| TheJulia | Continuity: Regarding the networking reset question, I've put it on the radar on my downstream product owner because it seems like its a feature we absolutely need moving forward, I next chat with them Monday around noon US Pacific time. I can't promise anyone will jump on it immediately, but the more I dig into VXLAN, the more I'm convinced its a necessity to have. | 14:28 |
| Continuity | TheJulia: thanks, I appreciate it. I am going to *try* to get more involved this year with actually writing code and patching stuff. So if I can I will attempt to help out. | 14:29 |
| Continuity | I also want my org to get more involved, so we are scaling engineers as we speak | 14:30 |
| TheJulia | very cool | 14:30 |
| cardoe | https://www.irccloud.com/pastebin/hYIdU2Bx/ | 14:38 |
| cardoe | TheJulia: ^ | 14:38 |
| cardoe | bleh node_id is an int | 14:39 |
| TheJulia | so this was *purely* when its going to see and try to hydrate an empty object? | 14:39 |
| TheJulia | oh | 14:39 |
| TheJulia | heh | 14:39 |
| TheJulia | yeah | 14:39 |
| TheJulia | that is right | 14:39 |
| cardoe | gah someone isn't on IRC for me to emote stabbing them..... they deleted the specific box I reported and used it for other testing. | 14:44 |
| cardoe | I'm honestly not sure which field would be the problem. | 14:44 |
| cardoe | I can dump everything from one node. But its gonna unfortunately be different. | 14:48 |
| TheJulia | ugh | 14:48 |
| TheJulia | it sounds like on upgrade you were able to reproduce it immediately based upon the logs you mentioned? is there another node where it was observed? | 14:49 |
| TheJulia | as long as it hasn't been updated, it could help clarity wise | 14:49 |
| TheJulia | I really just want to make sure I'm not incorrectly interpretting what is going on | 14:50 |
| cardoe | okay found another one. | 14:52 |
| TheJulia | is it the magical unicorn database record we could hope for?! | 14:55 |
| cardoe | https://gist.github.com/cardoe/3834eb3c7e55a2a323d22b082cccc218 | 14:56 |
| cardoe | It's even better cause that's a totally plain box that doesn't look like anyone's configured it before. | 14:56 |
| TheJulia | But all at 1.1 | 14:59 |
| TheJulia | wtaf | 15:00 |
| cardoe | So that DB is older but not too terribly old. | 15:01 |
| cardoe | As far as I can tell from logs it was deployed with 2024.2 originally. | 15:02 |
| cardoe | Maybe I missed something in the db upgrade? | 15:02 |
| cardoe | OpenStack Helm runs ironic-dbsync upgrade | 15:02 |
| TheJulia | does it not use online-data-migrations ? | 15:03 |
| TheJulia | its a two step process, upgrade creates the schema, but it doesn't roll/update versions/fields | 15:04 |
| TheJulia | granted, your 1.1 on that record which is still weird if its reproducing | 15:04 |
| * TheJulia smells the additional fresh coffee | 15:04 | |
| dtantsur | huh, is cotyledon broken on python 3.13? | 15:11 |
| cardoe | Yeah so that's the issue. | 15:11 |
| cardoe | OpenStack Helm isn't running online-data-migrations | 15:11 |
| cardoe | We do that ourselves as a follow on job. | 15:12 |
| cardoe | But it missed on this DB. I just ran it and everything is 1.2 | 15:12 |
| TheJulia | dtantsur: ... I think I was on 3.13 when I was doing the work originally last summer, did it break and if so how? | 15:14 |
| TheJulia | cardoe: okay, that is starting to make more sense then | 15:14 |
| dtantsur | TheJulia: sorry, it was 3.14, I'm just stupid and cannot type | 15:14 |
| TheJulia | OH! | 15:14 |
| TheJulia | okay | 15:14 |
| dtantsur | TheJulia: https://paste.opendev.org/show/b3U0gV1ZC5KFd7lvJGum/ | 15:15 |
| TheJulia | I've not actually tried starting ironic with 3.14 yet considering not everything in our code base is ready and project as a whole hasn't moved devstack to 3.14 supporting state as 3.13 is the version it is supporting | 15:15 |
| dtantsur | yeah, it's just the default in Fedora | 15:15 |
| TheJulia | That being said -> I was likely going to hit it next week if my hope for plans next week don't get quashed | 15:15 |
| dtantsur | Now, let's try to collect sensors from 3500 nodes using Ironic with 4000 conductor works in a synchronous fashion. I'm sure nothing will go wrong (if I even manage to enroll all this...). | 15:17 |
| TheJulia | ewww | 15:17 |
| TheJulia | yeah | 15:17 |
| TheJulia | eww with such a config/state | 15:17 |
| TheJulia | the first eww was the backtrace | 15:18 |
| dtantsur | yeah | 15:18 |
| dtantsur | On the other hand, it will be quite silly to go ahead with an asynchronous implementation without being sure that native threads won't work. | 15:18 |
| TheJulia | Well, step 0 is likely giving such a spin on 3.13 | 15:20 |
| TheJulia | If the computer melts, I'm sorry. :\ :) | 15:20 |
| dtantsur | You never know until you do :D | 15:21 |
| cardoe | TheJulia: yeah so now everything is good... It was 1.1 cause you added the RPC stuff which bumped it to 1.2 in 2025.2 | 15:26 |
| TheJulia | Yup, and we expect online migrations to run as well, so Joy! | 15:27 |
| TheJulia | okay, I'm sorry about that, I feel kind of bad, but Helm *not* running the migrations is problematic and has me wonder if *ironic* itself needs to self attempt to run record upgrades | 15:27 |
| TheJulia | (I was at a shop where we did that like 14 years ago, so the fact we have a command makes me want to scream) | 15:28 |
| cardoe | My patch isn't wrong though cause the version convert code was wrong. | 15:32 |
| TheJulia | correct | 15:34 |
| TheJulia | but that was why I was really soft of freaking out because a happy state upgraded environment *shouldn't* be unhappy | 15:35 |
| TheJulia | sort, not soft | 15:35 |
| TheJulia | brraaain | 15:35 |
| cardoe | ah yeah makes sense | 15:36 |
| cardoe | https://bugs.launchpad.net/nova/+bug/2137673 (and I've written a patch which they're hopefully okay with) is probably just ironic specific since we're the only ones that actually do something in that call. But that should be another reason why the error messages will suck less with nova. | 15:38 |
| TheJulia | Yeah, I noticed and I hope they do accept it | 15:39 |
| TheJulia | they have been sort of resistant to such in the past but times change | 15:39 |
| dtantsur | dear lord, why does sushy need to log everything it touches... | 15:48 |
| dtantsur | Okay, this is interesting. Our thread pool implementation hardly goes beyond 400 threads even if 4000 is possible. I'm curious why. | 15:58 |
| dtantsur | Hmm, it's possible that idle thread detection in futurist is utter bullshit | 16:02 |
| TheJulia | That, right there | 16:07 |
| dtantsur | THe more I'm looking at futurist logs, the more insane I'm getting here | 16:07 |
| TheJulia | Also, just because you have the thread configured doesn't mean it is actually needed/used | 16:08 |
| TheJulia | and can get nuked before. The workload itself needs to push it | 16:08 |
| dtantsur | Oh, it's very needed, I'm running sensor data with worker count == node count | 16:08 |
| TheJulia | we aggressively try to keep that pool from sitting idle | 16:08 |
| TheJulia | oh my | 16:08 |
| dtantsur | makes this makes sense: https://paste.opendev.org/show/b2v2pMF8oXkcJkAgWjoj/ | 16:09 |
| TheJulia | I'd add some extra logging just to make sure we're actually trying to launch that many | 16:09 |
| dtantsur | how can queue size jump between 21 and 1 for many-many iterations here? | 16:09 |
| dtantsur | I'm looking at pages of this back-and-forth | 16:09 |
| TheJulia | Locally I added quite a bit of additional logging into futurist to keep me from going too crazy | 16:10 |
| TheJulia | you may want to do the same as you ride the insane bus | 16:10 |
| dtantsur | I may indeed, it's just not clear what to log | 16:11 |
| TheJulia | _maybe_spin_up, the method which calls it, and the call which tosses more threads on the pile | 16:11 |
| dtantsur | Right, let's also log when no changes are needed.. | 16:13 |
| TheJulia | yup | 16:13 |
| * dtantsur needs a logging level below debug (like rust-log has trace) | 16:13 | |
| TheJulia | yeah, pretty much | 16:13 |
| dtantsur | One can I say for sure: in its current state Ironic is not capable of handling sensors for any large number of nodes any quickly | 16:18 |
| opendevreview | Dmitry Tantsur proposed openstack/ironic-specs master: [WIP] Asynchronous sensor data collection https://review.opendev.org/c/openstack/ironic-specs/+/972754 | 16:28 |
| dtantsur | This is what I'm working on btw ^^^^ | 16:28 |
| opendevreview | Dmitry Tantsur proposed openstack/ironic-specs master: [WIP] Asynchronous sensor data collection https://review.opendev.org/c/openstack/ironic-specs/+/972754 | 16:28 |
| JayF | I filed an RFE: https://bugs.launchpad.net/ironic/+bug/2137729 -- basically bootc deploy_interface switching, except for ramdisk driver | 16:49 |
| dtantsur | JayF: note that you may also need Nova changes to populate different instance_info fields | 16:50 |
| JayF | I'm 99% sure I pass on all image metadata | 16:50 |
| JayF | checking | 16:50 |
| dtantsur | I mean, setting boot_iso instead of image_source, etc | 16:51 |
| JayF | I'm thinking we'd own that in the auto swapper | 16:51 |
| dtantsur | ah, interesting | 16:51 |
| JayF | I want to avoid adding code to the nova driver for this pattern as possible | 16:51 |
| JayF | *if possible | 16:51 |
| JayF | because it enables us to do more interesting things | 16:51 |
| JayF | We don't pass on full image metadata in the virt driver patcher; only really image_source; https://opendev.org/openstack/nova/src/commit/4b90fdf9af8de53fb6536d27b8fc654a0c011e2f/nova/virt/ironic/patcher.py#L61 | 16:55 |
| JayF | Which means we must be roundtripping glance ourself for metadata; which is good news for not having to modify nova (if we're OK to fudge the instance_info ourselves at deploy time) | 16:55 |
| TheJulia | dtantsur: entirely do-able with just the url and the data lookup, fwiw. | 17:00 |
| TheJulia | if its an OCI url | 17:00 |
| TheJulia | oh | 17:00 |
| TheJulia | for ramdisk, disregard me | 17:00 |
| TheJulia | anyway, it should be entirely doable with image metadata and I bet nova would prefer that be the case of use | 17:07 |
| JayF | I'm chatting in #openstack-glance | 17:07 |
| JayF | I'd prefer strongly that we keep the logic in Ironic as much as possible | 17:07 |
| TheJulia | JayF: also, keeping additional code out of nova allows for a direct api consumer to request a machine to be deployed without modifying interfaces manually | 17:07 |
| JayF | yes | 17:08 |
| JayF | yes yes yes | 17:08 |
| TheJulia | agree | 17:08 |
| JayF | I think we can do this with 0 Nova and Glance code changes | 17:08 |
| JayF | except a glance doc update as a follow-on | 17:08 |
| JayF | after looking at it all and chatting in #-glance | 17:08 |
| TheJulia | That is entirely what I would expect | 17:09 |
| JayF | Me too, but I'm used to having my expectations not met :P | 17:09 |
| JayF | this is a nice surprise | 17:09 |
| opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: WIP: l2vni plug case with Cisco NXOS https://review.opendev.org/c/openstack/networking-generic-switch/+/968377 | 17:18 |
| opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: Update hacking to 7.0.0 https://review.opendev.org/c/openstack/networking-generic-switch/+/972762 | 17:18 |
| opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: WIP: Arista EOS and vendor neutral SONiC support for VXLAN attachments https://review.opendev.org/c/openstack/networking-generic-switch/+/972763 | 17:18 |
| opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: WIP: VXLAN: Add Junos, Cumulus NVUE, and denote Dell OS10 as unsupported https://review.opendev.org/c/openstack/networking-generic-switch/+/972764 | 17:18 |
| opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: WIP: OVS testing patch for 'vxlan' binding model https://review.opendev.org/c/openstack/networking-generic-switch/+/972765 | 17:18 |
| opendevreview | Julia Kreger proposed openstack/ironic-specs master: VXLAN networking https://review.opendev.org/c/openstack/ironic-specs/+/959401 | 19:39 |
| TheJulia | cardoe and anyone else interested in vxlan ^^^ - Now simpler! | 19:39 |
| cardoe | That’s my intended afternoon | 19:40 |
| TheJulia | very cool, thanks | 19:54 |
| cardoe | TheJulia: so your create_network_postcommit that's gonna create the VNI... which switch is it creating the VNI on? | 21:53 |
| TheJulia | so... | 21:53 |
| TheJulia | I think that should actually be the port bind now | 21:53 |
| cardoe | So bind_port() shouldn't touch the switches | 21:53 |
| cardoe | bind_port() makes the decision | 21:54 |
| TheJulia | I mean, the post bind postcommit | 21:54 |
| TheJulia | sorry, juggling a couple conversations here | 21:54 |
| cardoe | I can take the backburner | 21:54 |
| cardoe | So update_port_postcommit() is where we do it. | 21:55 |
| cardoe | That's how my initial patch worked. | 21:55 |
| TheJulia | yeah | 21:55 |
| TheJulia | and I think that is actually correct | 21:55 |
| cardoe | if its a vxlan segment, call add_vxlan_network(), if its a vlan segment... do the existing stuff | 21:56 |
| TheJulia | I just -1'ed the spec with that note, fwiw | 21:56 |
| TheJulia | I think we should do it as part of port add/delete, that way we're also not adding the VNI to switches which don't need it | 21:57 |
| TheJulia | with some hind sight of playing with patches, I think your approach was actually the best approach in regards to the ml2 interaction | 21:58 |
| TheJulia | The NGS way of VLANs poisons the thoughts though | 21:58 |
| TheJulia | so we need to disabiguate a little bit | 21:58 |
| TheJulia | And thats okay | 21:58 |
| cardoe | only other nits are that in some places you bolded or italics so far. the hierarchical port binding is dense but I don't have any criticisms | 21:58 |
| cardoe | I can update the style stuff if you're okay with that. I downloaded it locally and built the HTML and read that version. | 21:58 |
| cardoe | I've also got a solution to detect when the VNI is no longer needed on that switch as well. | 21:59 |
| cardoe | At least it's written on my whiteboard on my wall after tracing through the neutron code. | 21:59 |
| TheJulia | cardoe: oh, by all means, but do take a glance at the ngs patch I posted | 21:59 |
| TheJulia | I've not given it a spin yet, but keeping that with port binding tightly couples it | 22:00 |
| cardoe | It'd be infinitely easier to accomplish if I could figure out what to tell Nidhi to change in https://review.opendev.org/c/openstack/python-openstackclient/+/963947 so that the patch that includes if the network is l2 or not comes back to Ironic | 22:01 |
| cardoe | Actually no. It only matters for the trunk plugin support in Ironic | 22:02 |
| TheJulia | Well, we can kind of *know* it based upon everything else | 22:02 |
| TheJulia | yeah | 22:02 |
| TheJulia | There is a dividing line of logical support there, and yes it is weird | 22:02 |
| cardoe | That's why I went and stole my kids dry erase board to sketch this all out. | 22:03 |
| TheJulia | The big thing I feel the spec is missing is a super clear understanding of the port binding on the router side, that feels like, if my understandig is correct, something we might be abel to just sort of sort out with some more mech driver code | 22:03 |
| cardoe | I feel like that meme where the guy has all the red strings. | 22:03 |
| TheJulia | When are we getting Incognito, Inc. badges? | 22:03 |
| TheJulia | or, was it cognito | 22:04 |
| TheJulia | I don't remember | 22:04 |
| TheJulia | See: Inside Job | 22:04 |
| TheJulia | anyway, decoupling the concept of the network add/delete from the VXLAN VNI is entirely the right thing to do | 22:05 |
| TheJulia | *because* the bottom binding segment doesn't get created until we go to do the attachment | 22:05 |
| TheJulia | so if we just check/delete when no longer needed, we keep the switch relatively clean | 22:06 |
| cardoe | Yep | 22:06 |
| TheJulia | that matches the patches I got claude to whip up | 22:08 |
| TheJulia | If you want to rev the spec, your welcome to, I can also do it. Great catch regarding the method and modeling for the port/network config | 22:09 |
| TheJulia | The big challenge right now is to not feel like the crazy person with the board with red yarn all over it | 22:09 |
| TheJulia | because I've managed to start to put together your preception and I get a lot of it now | 22:09 |
| cardoe | Yeah. | 22:09 |
| cardoe | I've struggled at how to convey it. | 22:09 |
| TheJulia | I'm a little vauge on your details/context, but it is making a ton more sense now | 22:10 |
| cardoe | I'm happy to fill in. | 22:10 |
| cardoe | I'll write the trunk stuff down now. | 22:10 |
| TheJulia | That would be amazing | 22:10 |
| TheJulia | because I think that is where there is room to improve or at least best understand and if we can do *that* we might be able to get out of the router bit as it is related | 22:10 |
| TheJulia | ... hopefully :) | 22:10 |
| -opendevstatus- NOTICE: Zuul will be shutdown for maintenance work. See https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/WBBLBI6ZS6FA6Q5ZMH4C2MWPL3WG3H24/ for more details. | 23:43 | |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!