iurygregory | good morning Ironic | 10:59 |
---|---|---|
TheJulia | good morning | 13:30 |
TheJulia | We should get https://review.opendev.org/c/openstack/ironic-inspector/+/895164/ sorted | 13:38 |
TheJulia | dtantsur: you around today? | 14:32 |
opendevreview | Jake Hutchinson proposed openstack/bifrost master: Bifrost NTP configuration https://review.opendev.org/c/openstack/bifrost/+/895691 | 14:37 |
ravlew | Good morning ironic | 14:43 |
ravlew | I'm getting an error in stable/yoga CI in bifrost-integration-redfish-vmedia-uefi-centos-8 | 14:44 |
ravlew | "/home/zuul/src/opendev.org/openstack/bifrost/scripts/collect-test-info.sh: line 95: /home/zuul/openrc: No such file or directory" | 14:44 |
dtantsur | TheJulia, I am indeed | 14:45 |
ravlew | could anyone help with that :) ? | 14:45 |
dtantsur | ravlew, that's probably not what makes the job fail, look before that | 14:45 |
TheJulia | dtantsur: ++ | 14:46 |
TheJulia | dtantsur: ask because of review feedback on 895164 | 14:46 |
dtantsur | yeah, I was going to get back to it, but we had some fire-fighting downstream | 14:47 |
JayF | Releases team put up PRs over the weekend requesting to cut stable/2023.2 branches by no later than Friday. | 14:47 |
JayF | I think we technically have a little more time than that, but I don't want us to overflow and cause them crunch time if we can help it | 14:47 |
JayF | So whatever the priorities are, we need to cut releases soon so we need to resolve that patch one way or another | 14:48 |
TheJulia | yeah, they will eventually just change devstack and we will have no choice if we don't do it before they force our hand | 14:48 |
* JayF notes he's not super keen on just ignoring inspector grenande failures and hiding them :/ | 14:48 | |
dtantsur | I'm by no means keen on that, just don't realistically have time/energy to debug it this week | 14:49 |
JayF | I am going to propose at the meeting we cut the release very soon and backport redfish firmware as a pseudo-FFE (we don't do FF do FFE doesn't make sense) if it's done before 2023.2 release date | 14:49 |
dtantsur | makes sense to me | 14:49 |
JayF | dtantsur: TheJulia: Any hints on Inspector? | 14:49 |
JayF | I'm not super experienced with those jobs or the service in general, but I can try to fix the grenade job for a little bit today | 14:50 |
dtantsur | I don't believe the grenade failure is caused by the current Ironic work, but I can try double-checking | 14:50 |
JayF | Let me put it this way; it's my personal belief unless someone tells me otherwise that we have not manually tested upgrades of inspecotr | 14:50 |
JayF | so that means with the gate job not working we'd be applying liberal quantities of "hope" and "assumption" to that upgrade working which seems not-great for purposes of our users | 14:51 |
JayF | especially upstream end-users who may not have another layer of QA between them and a release | 14:51 |
TheJulia | blowing out on nova resource creation | 14:51 |
JayF | dnsmasq startup explodes, something else listening on 53 | 14:53 |
TheJulia | no, it is expecting cirros to be sitting around | 14:53 |
ravlew | thanks dtantsur I'll check it out | 14:53 |
TheJulia | and changes got made there in devstack at some point | 14:54 |
JayF | I think your brain is about a mile ahead of me rn :) | 14:54 |
JayF | makes sense that > 2023-09-14 15:19:07.382906 | controller | Sep 14 15:19:07 np0035253989 dnsmasq[57991]: dnsmasq: failed to create listening socket for 127.0.0.1: Address already in use | 14:54 |
dtantsur | hmm, I thought I fixed cirros as part of https://review.opendev.org/c/openstack/ironic-inspector/+/895164 | 14:54 |
JayF | is OK because we appear to start it as devstack@ironic-inspector-dhcp | 14:55 |
TheJulia | oh, okay | 14:55 |
dtantsur | yeah, I'd assume it's fine | 14:55 |
JayF | a lot of > RC_DIR: unbound variable # which I'm assuming is probably OK? | 14:56 |
TheJulia | so we need https://review.opendev.org/c/openstack/ironic-inspector/+/895164 to not post_fail basically | 14:57 |
TheJulia | and look at the inspector log to understand what is going on | 14:57 |
dtantsur | yeah, I'm looking at the previous run | 14:57 |
JayF | I'm reading the output from the original job right now, trying to get some kind of a baseline since this might be one of the first inspector grenade jobs I've looked at | 14:58 |
JayF | I'll resume after my morning meetings (in 2 minutes than a chat with kubajj after) | 14:58 |
dtantsur | CMD "lshw -quiet -json" returned: 0 in 23.572s | 14:59 |
dtantsur | the ramdisk logs just stop at some point, interesting.. | 15:00 |
JayF | #startmeeting ironic | 15:00 |
opendevmeet | Meeting started Mon Sep 18 15:00:08 2023 UTC and is due to finish in 60 minutes. The chair is JayF. Information about MeetBot at http://wiki.debian.org/MeetBot. | 15:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 15:00 |
opendevmeet | The meeting name has been set to 'ironic' | 15:00 |
JayF | Welcome to the Ironic team meeting! This meeting is held under OpenInfra Code of Conduct available at: https://openinfra.dev/legal/code-of-conduct | 15:00 |
JayF | Agenda is available at https://wiki.openstack.org/wiki/Meetings/Ironic | 15:00 |
dtantsur | o/ | 15:00 |
kubajj | o/ | 15:00 |
iurygregory | o/ | 15:00 |
JayF | #topic Announcements/Reminder | 15:00 |
JayF | #note Standing reminder to review patches tagged ironic-week-prio and to hashtag any patches ready for review with ironic-week-prio: https://tinyurl.com/ironic-weekly-prio-dash | 15:00 |
JayF | We will be listing patches later which we want landed before release, please ensure you look and help with those, too. | 15:00 |
JayF | #note PTG will take place virtually October 23-27, 2023! https://openinfra.dev/ptg/ | 15:01 |
JayF | #link https://etherpad.opendev.org/p/ironic-ptg-october-2023 | 15:01 |
JayF | After release activities are complete, I'll setup a short mailing list thread and maybe sync call for us to pare down that list into items we think we should chat about, so please get your items in there. | 15:01 |
JayF | That's all our standing announcements; release is incoming but I made that a separate topic | 15:02 |
JayF | No action items outstanding; skipping agenda item. | 15:02 |
JayF | #topic Bobcat 2023.2 Release | 15:02 |
JayF | So essentially, my intention is to cut stable/2023.2 releases this week if possible. The sooner the better just because I don't want other teams waiting on us if we can help it. | 15:03 |
JayF | There are two items we've already identified as wanting to land before release: | 15:03 |
JayF | * Redfish Firmware Interface https://review.opendev.org/c/openstack/ironic/+/885425 | 15:03 |
JayF | * IPA support for Service Steps https://review.opendev.org/c/openstack/ironic-python-agent/+/890864 | 15:03 |
JayF | IPA Service Steps landed late Friday; so that's good news \o/ | 15:03 |
JayF | Redfish firmware I believe still needs a revision from iurygregory | 15:04 |
JayF | plus the inspector-related CI shenanigans talked about before the meeting | 15:04 |
JayF | Are there any other pending changes we'd like to ensure make Bobcat? | 15:04 |
iurygregory | I'm making changes and re-testing things to make sure they are working (still trying to figure out the DB part, almost there I think) | 15:04 |
JayF | If there are no other pending changes for Bobcat; I'd like to propose that we begin to cut releases (starting with !Ironic first, to give more time), and if we cut Ironic stable/2023.2 before iurygregory's changes land, that we permit them to be backported as long as they are completed before release finalizes (even though that may mean we can't cut a release with them until | 15:05 |
JayF | a week after or so) | 15:05 |
JayF | We generally don't practice FF here, so calling it an FFE doesn't make sense; but I don't want the whole release hanging on a change when it might hold up other bits of the stack. | 15:06 |
JayF | WDYT? | 15:06 |
TheJulia | other bits of the stack? | 15:06 |
JayF | like releases team work | 15:07 |
JayF | requirements et | 15:07 |
JayF | I don't want us to hold up any of the common work that has to happen | 15:07 |
TheJulia | well, service projects don't go into requirements | 15:07 |
JayF | I don't think it's crazy of me to suggest that we have the final Ironic release done a week before the marketing-deadline for said release? | 15:07 |
TheJulia | I don't either, but it feels like your pushing for now() as opposed to in a few days | 15:08 |
JayF | TheJulia: I'm saying I'll go create PRs for stuff that's done now, and start walking down the list. We have dozens of these and I usually manually review the changes for the final release. | 15:08 |
TheJulia | we can absolutely cut the !ironic things and then cut ironic later in the week | 15:09 |
JayF | TheJulia: so I'm not just like, automating git sha readouts into yaml files | 15:09 |
JayF | I don't wanna get that process started too late, which is why I want to start now and have the freedom to do ironic e.g. Thurs or Friday | 15:09 |
TheJulia | I understand that | 15:09 |
JayF | less now() and more max(week) | 15:09 |
TheJulia | so what is the big deal then? lets do the needful and enable | 15:09 |
TheJulia | I'd say EOD Wednesday | 15:10 |
TheJulia | because release team won't push button on friday | 15:10 |
JayF | ++ that sounds pretty much exactly like what I had in mind | 15:10 |
JayF | wnated it done before Friday-europe | 15:10 |
TheJulia | for Ironic unless we have full certinty that we can solve it early thursday morning before release team disappears | 15:10 |
JayF | ++ | 15:10 |
JayF | #agreed Ironic projects will begin having stable/2023.2 releases cut. Projects with pending changes (Ironic + Inspector) have at least until Wednesday EOD to land them. | 15:11 |
JayF | Anything else related to 2023.2 release? | 15:12 |
JayF | #topic Review Ironic CI Status | 15:13 |
JayF | AFAICT, things look stable-ish. Just have that POST-FAILURE for Inspector grenade to figure out. | 15:13 |
dtantsur | I think the POSTFAILURE itself is less of a problem | 15:14 |
JayF | Yeah, the breakage is earlier/in our code which is different than a postfailure usually indicates | 15:14 |
JayF | but either way it's the only outstanding CI issue I'm aware of | 15:15 |
TheJulia | yeah, read timeouts against the api surface | 15:15 |
TheJulia | could entirely be environmental | 15:15 |
TheJulia | we just need more logs to confirm that or not | 15:16 |
JayF | we'll figure it out, if there's nothing else I'll move on so I can get back to helping with that :D | 15:16 |
JayF | #topic Branch Retirement to resolve zuul-config-errors | 15:16 |
JayF | I'm going to execute on this, probably today if inspector CI doesn't eat the day -> https://lists.openstack.org/pipermail/openstack-discuss/2023-August/034854.html | 15:16 |
JayF | take notice | 15:16 |
JayF | #topic RFE Review | 15:17 |
JayF | There was an RFE spotted earlier this week, Julia and I discussed in channel, I already tagged it as approved | 15:17 |
JayF | posting here for documentation/awareness | 15:17 |
JayF | #link https://bugs.launchpad.net/ironic/+bug/2034953 -- adding two fields to local_link_connection schema to allow physical switch integrations with OVS | 15:17 |
TheJulia | hjensas: you might find ^ interesting | 15:18 |
JayF | Basically it seems adding two fields to our local_link_connection gets us the win of supporting OVN-native switches | 15:18 |
JayF | which is a wonderful effort:value ratio | 15:18 |
dtantsur | Indeed. How popular are these? | 15:19 |
JayF | I learned they exist when I read bug 2034953 ;) | 15:19 |
dtantsur | Same :D | 15:19 |
JayF | two fields to get free support for something from neutron sounds great though | 15:20 |
JayF | and is the exact kind of good stuff we get from being stacked sometimes :D | 15:20 |
TheJulia | I don't know... There was some OVS enabled for OpenFlow switches years ago, whitebox sort of gear AIUI, I'm guessing this might just be an evolution | 15:20 |
JayF | I don't hear any objection; so I'm going to consider this one to remain approved. Probably a low-hanging fruit for someone to knock out (I have an MLH fellow starting in a couple of weeks, if we want to save this I can use it as an onboarding task) | 15:21 |
JayF | #topic Open Discussion | 15:21 |
JayF | Agenda is done; anything else | 15:21 |
JayF | Last chance? | 15:23 |
JayF | #endmeeting | 15:23 |
opendevmeet | Meeting ended Mon Sep 18 15:23:47 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 15:23 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-09-18-15.00.html | 15:23 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-09-18-15.00.txt | 15:23 |
opendevmeet | Log: https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-09-18-15.00.log.html | 15:23 |
TheJulia | Has anyone seen "Incompatible openstacksdk library found: Version MUST be >=1.0 and <=None, but 0.101.0 is smaller than minimum version 1.0." before? | 15:43 |
dtantsur | TheJulia, ansible modules? | 15:45 |
TheJulia | Yeah, I guess :\ | 15:46 |
TheJulia | I'm guessing we're getting too new ansible on stable branches | 15:46 |
dtantsur | yep, they have two branches, 2.0 is only compatible with the (future?) openstacksdk 1.0 | 15:46 |
TheJulia | yeah, that gets kicked out when metalsmith's deployment task triggers | 15:50 |
TheJulia | dtantsur: was taht change somewhere in the zed -> 2023.1 timeframe? | 16:03 |
TheJulia | s/taht/that/ | 16:03 |
JayF | I believe so | 16:05 |
TheJulia | hmm so why is this breaking on zed then | 16:05 |
JayF | OSA versions not locked? | 16:06 |
JayF | OSDK version not locked? | 16:06 |
TheJulia | no OSA | 16:07 |
frickler | actually sdk >=0.99 should work for the openstack collection, but maybe they're being extra safe | 16:07 |
TheJulia | OSDK seems appropriate | 16:07 |
TheJulia | so yeah, it is unlocked ansible | 16:09 |
TheJulia | 2.15.3 on a backport which was released mid august | 16:09 |
TheJulia | and we use the host ansible | 16:10 |
TheJulia | My guess is the newer collection gets pulled in but with the older sdk, and then things explode | 16:21 |
TheJulia | would pinning back ansible be reasonable? | 16:21 |
TheJulia | I have no idea if that wouldd work | 16:21 |
frickler | iiuc you'd need sdk < 0.99 for that | 16:40 |
TheJulia | so 0.101.0 is incompatible in general? | 16:46 |
opendevreview | Julia Kreger proposed openstack/metalsmith stable/zed: DNM: Test constrainting ansible version https://review.opendev.org/c/openstack/metalsmith/+/895703 | 16:51 |
jrosser | there is a description of the compatibility here https://galaxy.ansible.com/openstack/cloud | 16:57 |
frickler | 0.99.0 should work with the newer collection, but it seems ansible decided to place the bar at 1.0 instead, which is not unreasonable. and the choice of 0.99.0 was a bad decision in retrospect in which I do have some responsibility myself, so sorry for that | 17:00 |
TheJulia | except we have 0.101.0 in upper-constraints on zed | 18:01 |
TheJulia | *so* the inspector issue is basically we can't find the record in the db | 18:01 |
JayF | As emailed on the list; branch EOLs requested to clean up our zuul config errors | 18:11 |
frickler | \o/ | 18:12 |
JayF | TheJulia: can I help? I don't want to dupe effort if you're actively looking at anything | 18:12 |
JayF | TheJulia: it looks like tempest and grenade might be failing in similar ways? | 18:26 |
TheJulia | maybe, I'll check in a moment | 18:26 |
opendevreview | Julia Kreger proposed openstack/ironic-inspector master: DNM: Collect additional failure information https://review.opendev.org/c/openstack/ironic-inspector/+/895727 | 18:27 |
JayF | TheJulia: I'll note; the change in progress pins cirros to 0.6.2 and ironic pins to 0.6.1 | 18:29 |
JayF | unsure if related but it's suspicious | 18:29 |
TheJulia | https://review.opendev.org/c/openstack/metalsmith/+/895703/1/test-requirements.txt <-- seems to work for metalsmith \o/ | 18:29 |
* JayF has been going the route of looking at CI-related commits in Ironic looking for things that coule break inspector or needs to be updated for inspector to work | 18:29 | |
JayF | \o/ | 18:29 |
TheJulia | the version pin doesn't matter in this case | 18:29 |
JayF | > 2023-09-18 15:42:34.321739 | controller | Details: Fault: {'code': 500, 'created': '2023-09-18T15:42:32Z', 'message': 'Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 4d27add4-8a9d-4209-a4ea-746f58e86e8a.'}. Request ID of server operation performed before checking the server status | 18:30 |
JayF | req-92fe8a05-b6a1-4061-a152-fa5fed5e0646. | 18:30 |
JayF | TheJulia: I'll note: sharding is still mid-revert in Nova; I don't think this should impact inspector jobs but it looks like some of them are landed, some are not | 18:33 |
TheJulia | so grenade blows up because for some reason we don't seem to have record of the actual node | 18:33 |
JayF | https://review.opendev.org/c/openstack/nova/+/894946 is in the gate now | 18:33 |
TheJulia | I think that *might* actually be a break | 18:33 |
JayF | I'm looking at the tempest job | 18:34 |
JayF | trying to figure that piece out under the hope(?) it's related but simpler | 18:34 |
JayF | TheJulia: > Sep 18 15:41:43.807251 np0035284146 ironic-conductor[106327]: DEBUG ironic.drivers.modules.agent_client [req-9ed5070c-733d-49fc-86c7-a7d670fbde21 req-1809a5fb-b49d-46a4-b517-ebc6f7f575f3 None None] Status of agent commands for node dc7624b9-0cf9-4a24-ad4b-ddfa7a28039d: get_deploy_steps: result "{'deploy_steps': {'GenericHardwareManager': [{'step': | 18:36 |
JayF | 'erase_devices_metadata', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False}, {'step': 'apply_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'argsinfo': {'raid_config': {'description': 'The RAID configuration to apply.', 'required': True}, 'delete_existing': {'description': "Setting this to 'True' indicates to delete existing | 18:36 |
JayF | RAID configuration prior to creating the new configuration. Default value is 'True'.", 'required': False}}}, {'step': 'write_image', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False}, {'step': 'inject_files', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'argsinfo': {'files': {'description': "Files to inject, a list of file structures with | 18:36 |
JayF | keys: 'path' (path to the file), 'partition' (partition specifier), 'content' (base64 encoded string), 'mode' (new file mode) and 'dirmode' (mode for the leaf directory, if created). Merged with the values from node.properties[inject_files].", 'required': False}, 'verify_ca': {'description': 'Whether to verify TLS certificates. Global agent options are used by default.', | 18:36 |
JayF | 'required': False}}}]}, 'hardware_manager_version': {'generic_hardware_manager': '1.2'}}", error "None"; execute_deploy_step: result "{'deploy_result': {'result': 'prepare_image: image (23902a7f-65c7-4e87-80de-6a993829398a) written to device /dev/vda root_uuid=2d360c20-c65e-4d5c-b8ce-c9196635667a'}, 'deploy_step': {'interface': 'deploy', 'step': 'write_image', 'args': | 18:36 |
JayF | {'image_info': {'id': '23902a7f-65c7-4e87-80de-6a993829398a', 'urls': ['https://173.231.255.172:8080/v1/AUTH_03382fe70b254cb88c398793986c28b3/glance/23902a7f-65c7-4e87-80de-6a993829398a?temp_url_sig=5366935da2ec9e8511e69501943eb08decccd113986c0c196665520b26cb9448&temp_url_expires=1695054998'], 'disk_format': 'raw', 'container_format': 'bare', 'stream_raw_images': True, | 18:36 |
JayF | 'checksum': 'b0d4ad249188e59ffbdfa5e10054fea0', 'os_hash_algo': 'sha512', 'os_hash_value': '1913006d9f40852f615542d15ae1eda2bf5fe1b067940ea8b02bdd4c6e2e5f04b077462f003c4ce436b7f6dc8720b80effcc54d6cef3425f2792b4dd3e8c2aa9', 'node_uuid': 'dc7624b9-0cf9-4a24-ad4b-ddfa7a28039d', 'kernel': None, 'ramdisk': None, 'root_gb': '4', 'root_mb': 4096, 'swap_mb': 0, 'ephemeral_mb': 0, | 18:36 |
JayF | 'ephemeral_format': None, 'configdrive': '***', 'preserve_ephemeral': False, 'image_type': 'partition', 'deploy_boot_mode': 'bios', 'boot_option': 'local'}, 'configdrive': '***'}}}", error "None"; get_partition_uuids: result "{'partitions': {'configdrive': '***', 'root': '/dev/vda2'}, 'root uuid': '2d360c20-c65e-4d5c-b8ce-c9196635667a', 'efi system partition uuid': None}", | 18:36 |
JayF | error "None"; install_bootloader: result "None", error "{'type': 'CommandExecutionError', 'code': 500, 'message': 'Command execution failed', 'details': 'Installing GRUB2 boot loader to device /dev/vda failed with Unexpected error while running command.\nCommand: chroot /tmp/tmpl2skmrue /bin/sh -c "mount -a -t vfat"\nExit code: 127\nStdout: \'\'\nStderr: "chroot: failed to | 18:36 |
JayF | run command \'/bin/sh\': No such file or directory\\n".'}" {{(pid=106327) get_commands_status /opt/stack/ironic/ironic/drivers/modules/agent_client.py:346}} | 18:36 |
TheJulia | dude... | 18:36 |
JayF | wow I had no idea that was that long | 18:36 |
TheJulia | seriously, paste | 18:36 |
JayF | putting it in a pastebin | 18:36 |
JayF | sorry about that | 18:36 |
JayF | my client warns me about >1 line | 18:37 |
JayF | apparently it doesn't warn about 1 line that wraps to 10 lines | 18:37 |
JayF | so I wasn't careful :( | 18:37 |
JayF | https://gist.github.com/jayofdoom/225fafa59106f71f085701a7b3c0c16f | 18:37 |
JayF | that looks like the provisioning failure that is breaking tempest | 18:37 |
JayF | IDK if it's related to the grenade fail, but there it is | 18:38 |
JayF | I'll note that I wonder how https://github.com/openstack/ironic-inspector/blob/master/devstack/plugin.sh#L150 interacts with https://github.com/openstack/ironic/blob/88fd22de796b8b936287ee0e39fed6a0bcf3b604/devstack/lib/ironic#L3351 in terms of ordering | 18:42 |
TheJulia | so looks like ironic/inspector on the non-standalone tempest job self-aborts the inspection | 18:43 |
JayF | due to the error I spammed across IRC, yeah? | 18:43 |
JayF | well, that's happening during imaging | 18:44 |
TheJulia | uhhh shouldn't be anywhere near that | 18:44 |
JayF | but it implies an environmental problem that could be similarly impacting | 18:44 |
TheJulia | that error you pasted is typical cirros | 18:44 |
JayF | oh, really? | 18:44 |
TheJulia | cirros has no actual bootloader or ocntents | 18:44 |
TheJulia | so bootloader deployment will *always* fail | 18:44 |
TheJulia | unless it is made to be present there | 18:45 |
JayF | It looks like that failure piped all the way back thru to Ironic, which is why I thought it was meaningful | 18:45 |
JayF | but I trust you have more context on this than I do | 18:45 |
JayF | is there value in us getting on a call or something? maybe make sure we're on the same wavelength? | 18:45 |
TheJulia | I'm not sure we are | 18:45 |
TheJulia | give me a few to keep digging | 18:45 |
JayF | I am almost 10000% certain we aren't :D | 18:45 |
TheJulia | okay, different node than I was looking at for the failing test, i was looking at the other test | 18:46 |
TheJulia | oh, so in the non-standalone one, it is deploying a node | 18:47 |
JayF | I'll note there's a power control failure in https://1e3a584a444c8ece91d9-a7e38d5d296143dfa7c720fa849f5cad.ssl.cf5.rackcdn.com/895164/2/experimental/ironic-inspector-tempest-managed-non-standalone/a0836f7/controller/logs/screen-ir-cond.txt too | 18:48 |
TheJulia | https://paste.opendev.org/show/blwPDRI2U06pUgQ3zxZj/ right ? | 18:50 |
JayF | that's the original error I spammed into the channel | 18:50 |
JayF | 15:42:08.165981 | 18:50 |
JayF | is the power control failure | 18:50 |
TheJulia | so yeah | 18:51 |
TheJulia | the failure I'm seeing is we're trying to deploy a cirros partition image | 18:51 |
TheJulia | which *is* empty by default | 18:51 |
TheJulia | and thus the CI job fails | 18:51 |
JayF | that matches what I saw, too | 18:51 |
JayF | which leads to the question: how the hell did that ever work? did older cirros not have empty there? | 18:52 |
TheJulia | https://github.com/openstack/ironic/blob/master/devstack/tools/ironic/scripts/cirros-partition.sh | 18:55 |
opendevreview | Merged openstack/ironic-prometheus-exporter master: CI: Remove ubuntu focal job https://review.opendev.org/c/openstack/ironic-prometheus-exporter/+/894016 | 18:55 |
opendevreview | Merged openstack/ironic-prometheus-exporter master: tox: Remove basepython https://review.opendev.org/c/openstack/ironic-prometheus-exporter/+/890314 | 18:55 |
JayF | ooh | 18:55 |
TheJulia | so lets see | 18:57 |
TheJulia | that got uploaded as cirros-0.6.1-x86_64-partition | 18:57 |
TheJulia | which is the image | 18:58 |
TheJulia | so... only guess I have is the packing fails | 18:58 |
JayF | again I note, we're pinnning to 0.6.2 in inspector in that change | 18:59 |
TheJulia | that independently grabs cirros | 19:00 |
JayF | and it uploads a full disk 0.8.2 | 19:00 |
JayF | s/8/6/ | 19:00 |
JayF | yeah I see | 19:00 |
JayF | damn | 19:01 |
JayF | aha | 19:01 |
TheJulia | That, I don't think is anything related to ironic changes with inspector merge | 19:01 |
JayF | that is where RC_DIR is unbound errors pop out | 19:01 |
TheJulia | the tempest job, sure looks like it is failing in super unexpected ways | 19:01 |
JayF | which I suspect I'll go look at a passing Ironic job and see the same /me verifies | 19:01 |
TheJulia | err | 19:01 |
TheJulia | grenade tempest job | 19:01 |
JayF | yep angry RC_DIR errors in pasisng Ironic jobs, so likely unrelated | 19:03 |
JayF | I'm going to change lanes to the grenade job | 19:04 |
JayF | but tempest being so broken implies to me there might be a base level environmental thing going on? IDK | 19:04 |
TheJulia | you'd need to hold a node to check at this point | 19:05 |
TheJulia | because fundimentally, it looks like we're getting a bogus disk image | 19:06 |
JayF | TheJulia: so that script, I believe, is pulling in 0.6.2 even though it's labelled 0.6.1 https://github.com/openstack/ironic/blob/master/devstack/tools/ironic/scripts/cirros-partition.sh respects CIRROS_VERSION | 19:07 |
JayF | TheJulia: so I think the most obvious change to move forward with is s/0.6.2/0.6.1/g in that pin, just to eliminate a variable | 19:08 |
TheJulia | the logs I have say 0.6.1 | 19:08 |
JayF | the name is set to 0.6.1 | 19:08 |
JayF | regardless of what the script uses | 19:08 |
JayF | based on my reading of the devstack logs, the partition script, and the outputs/inputs | 19:08 |
TheJulia | gets worse, 0.6.2 and 0.6.1 | 19:09 |
TheJulia | what a mess | 19:09 |
JayF | the output name is passed directly to cirros-partition.sh so we have it wired up to lie to us | 19:09 |
JayF | lets make that aligned and see if a more clear err pops out | 19:09 |
JayF | I can make the edit if you're +1 just don't wanna trample changes or running CI jobs? | 19:09 |
opendevreview | Harald Jensås proposed openstack/ironic master: redfish_address - wrap_ipv6 address https://review.opendev.org/c/openstack/ironic/+/895729 | 19:09 |
TheJulia | go ahead and make a change, since I'm bouncing between several different things and I have a meeting in a few minutes | 19:09 |
JayF | I'm going to this change then I have an important meeting with my local friendly assassin, who is more increasingly threatening me if I don't provide kibble ;) | 19:10 |
JayF | I think it was consistent | 19:11 |
JayF | it was inconsistent *between the tempest and the grenade job* | 19:11 |
JayF | but it was consistent within the job | 19:11 |
opendevreview | Jay Faulkner proposed openstack/ironic-inspector master: Update the project status and move broken jobs to experimental https://review.opendev.org/c/openstack/ironic-inspector/+/895164 | 19:12 |
JayF | I'm going to let that run and get lunch for myself and cats; I will be pointing my brain in this direction until I'm outta steam or hours in the day today | 19:13 |
opendevreview | Julia Kreger proposed openstack/metalsmith stable/zed: Constrain the upper Ansible version https://review.opendev.org/c/openstack/metalsmith/+/895703 | 19:18 |
opendevreview | Julia Kreger proposed openstack/metalsmith stable/zed: stable-only: Constrain the upper Ansible version https://review.opendev.org/c/openstack/metalsmith/+/895703 | 19:18 |
TheJulia | JayF: ^^^ needed so I can unbrick the ipa stable branches | 19:19 |
TheJulia | since they are wedged due to the too new openstacksdk/ansible issues | 19:19 |
JayF | +2a | 19:20 |
opendevreview | Harald Jensås proposed openstack/ironic-inspector master: Handle bracketed IPv6 redfish_address https://review.opendev.org/c/openstack/ironic-inspector/+/895734 | 19:20 |
opendevreview | Merged openstack/metalsmith stable/zed: stable-only: Constrain the upper Ansible version https://review.opendev.org/c/openstack/metalsmith/+/895703 | 19:37 |
TheJulia | woot | 20:31 |
opendevreview | Julia Kreger proposed openstack/metalsmith stable/yoga: stable-only: Constrain the upper Ansible version https://review.opendev.org/c/openstack/metalsmith/+/895671 | 20:31 |
opendevreview | Julia Kreger proposed openstack/metalsmith stable/xena: stable-only: Constrain the upper Ansible version https://review.opendev.org/c/openstack/metalsmith/+/895672 | 20:31 |
opendevreview | Julia Kreger proposed openstack/metalsmith stable/xena: stable-only: Constrain the upper Ansible version https://review.opendev.org/c/openstack/metalsmith/+/895672 | 20:33 |
opendevreview | Julia Kreger proposed openstack/metalsmith stable/wallaby: stable-only: Constrain the upper Ansible version https://review.opendev.org/c/openstack/metalsmith/+/895673 | 20:33 |
opendevreview | Julia Kreger proposed openstack/ironic-inspector master: DNM: Collect additional failure information https://review.opendev.org/c/openstack/ironic-inspector/+/895727 | 20:44 |
JayF | I'm looking at any differences between Ironic and inspector grenade | 20:46 |
JayF | ironic has ipa in required-projects | 20:47 |
JayF | I don't think that matters tho b/c build_ramdisk is false | 20:47 |
JayF | INSTANCE_WAIT: 120 | 20:48 |
JayF | MYSQL_GATHER_PERFORMANCE: False | 20:48 |
JayF | both missing from inspector grenade as well | 20:48 |
JayF | those actually make me ponder if they could be impacting | 20:48 |
JayF | will add thosein next update for inspector after I get experimental results | 20:48 |
JayF | I guess if tempest is failing too, it implies that the problem is deeper | 20:51 |
JayF | well, integrated tempest | 20:51 |
iurygregory | finally the power outage is over... | 21:12 |
iurygregory | I'm back | 21:12 |
TheJulia | iurygregory: welcome back | 21:35 |
iurygregory | TheJulia, tks! | 21:36 |
iurygregory | now time to work :D | 21:37 |
iurygregory | the funny thing is that I couldn't participate in a meeting about the customer case I'm working since Friday | 21:37 |
TheJulia | doh | 21:39 |
JayF | iurygregory: IDK if you were here for it, but we set EOD Wednesday as a tentative deadline for cutting Ironic releases. | 21:49 |
iurygregory | ack | 21:50 |
JayF | iurygregory: of course, I'm now racing that deadline, too, to try and get passing CI on inspector so hey :) we can form a club | 21:50 |
iurygregory | I wasn't I lost connection while we were talking about it, haven't checked logs yet | 21:50 |
iurygregory | JayF, perfect :D | 21:50 |
opendevreview | Iury Gregory Melo Ferreira proposed openstack/ironic master: RedfishFirmware Interface https://review.opendev.org/c/openstack/ironic/+/885425 | 21:52 |
iurygregory | one more round to test :D | 21:52 |
* iurygregory updates bifrost to get the latest patchset | 21:52 | |
JayF | looks like the same failure mode re: inexplicably broken image for inspector with my change | 22:08 |
JayF | I have infra holding the next failure so I can look first thing tomorrow | 22:08 |
TheJulia | so grenade didn't fail the same way | 22:22 |
TheJulia | looks like keystone went on vacation | 22:22 |
JayF | TheJulia: looking at the conductor logs, we get the same partition-image-based error | 22:23 |
JayF | (or was that in the tempest job) | 22:23 |
JayF | > Sep 18 21:14:35.428023 np0035285949 ironic-conductor[220906]: ERROR ironic.drivers.modules.inspector.interface [None req-a2e5dec5-9f04-4a09-9356-e791f2979ce1 None None] Inspection failed for node 67265dca-e1f5-4a7f-b2cd-dcde31799b96 with error: Introspection timeout | 22:25 |
TheJulia | lets turn off the db stats stuff | 22:27 |
JayF | oh, this actually maybe fits | 22:27 |
JayF | I have a question for you | 22:27 |
JayF | I've been looking for devstack-plugin-evidence that we are getting right dnsmasq version in inspector | 22:27 |
TheJulia | ok | 22:27 |
JayF | since the newer one was crashy in ironic jobs | 22:27 |
JayF | I dumped that method of thinking because I assumed the standalone jobs would be blowing up too | 22:28 |
JayF | but it occurs to me that we might have different mechanisms of interacting with dhcp that are only breaky when it's under neutron | 22:28 |
JayF | tl;dr: do we need to ensure downgrade_dnsmasq runs on inspector devstack | 22:28 |
TheJulia | .... | 22:29 |
TheJulia | I'm struggling to grok how your getting to think dnsmasq is the root cause | 22:29 |
TheJulia | Would talking through it be helpful? | 22:30 |
JayF | It is a leftover from a troubleshooting technique I applied earlier: Try to find fixes landed this cycle to Ironic CI that might apply to inspector CI | 22:30 |
JayF | since the grenade job does not inherit from ironic jobs like some of the tempest jobs do | 22:30 |
TheJulia | so on grenade, I can see inspector did seemingly go out for vacation, and I remembered we saw similar pauses on the upgrade stuffs | 22:31 |
TheJulia | which is why we disabled it on the ironic grenade job | 22:31 |
JayF | do we want to apply INSTANCE_WAIT: 120 as well? | 22:31 |
* JayF JFDI | 22:33 | |
opendevreview | Jay Faulkner proposed openstack/ironic-inspector master: Update the project status and move broken jobs to experimental https://review.opendev.org/c/openstack/ironic-inspector/+/895164 | 22:33 |
* TheJulia shrugs on the instance wait | 22:33 | |
JayF | I did it, figured I'd rather have a passing job and remove something | 22:33 |
JayF | oooh TheJulia I might have found something | 22:37 |
TheJulia | ? | 22:37 |
JayF | https://zuul.opendev.org/t/openstack/build/3e8021686f884ae8b4c5e2b248138158/log/controller/logs/ironic-bm-logs/node-3_console_log.txt#3331 | 22:37 |
JayF | TheJulia: almost pasted the line but apparently I am capable of learning and improvement :P | 22:37 |
TheJulia | https://paste.opendev.org/show/b052qhfXQ8PL9zrryTsb/ <-- grenade job did literally pause | 22:38 |
JayF | holy cow | 22:39 |
JayF | that is nontrivial | 22:39 |
TheJulia | but look further down at 22:09:08 OSError | 22:39 |
JayF | BRB in 8 minutes | 22:39 |
JayF | that is extremely strange | 22:39 |
JayF | (the BRB was a joke related to the pause if it's not clear) | 22:39 |
JayF | did the FS literally go R/O during the run? Are we filling up the disk? | 22:40 |
JayF | heh that's silly, a full disk we wouldn't get the log written about it | 22:41 |
TheJulia | no, ironic was doing some stuff | 22:41 |
TheJulia | that seems like thread locked, guessing related to the db since db counter is the last thing to do anything | 22:41 |
TheJulia | which is the same issue we saw on ironic | 22:41 |
TheJulia | my logs are from the change I put up to get some more debug logging | 22:42 |
JayF | hopefully check experimental with my change passes | 22:42 |
TheJulia | yours is the patch before | 22:42 |
JayF | yeah, I just added the bits to the yaml to turn off perf counters | 22:42 |
JayF | I have a meta-question about this: assuming we get grenade/integrated tempest job passing; should we leave them in the queue or put them back in exp? | 22:43 |
JayF | I think if we get it passing it doesn't hurt to keep it on the change but IDK | 22:43 |
TheJulia | Oh, I think I see what is going on there | 22:44 |
TheJulia | err, maybe not | 22:45 |
TheJulia | it might also be slow CI nodes, looks like the inspection timed out right before the post of the payload to inspector | 22:46 |
JayF | the mysql performance counters bump will help for that | 22:46 |
TheJulia | but that shouldn't result in the failure to find the node | 22:46 |
TheJulia | since it is not constrained | 22:46 |
TheJulia | as for leaving in the queue dunno, my impression right now is we're struggling to prove we didn't break ironic/inspector integration | 22:47 |
JayF | yeah that's my impression, too | 22:47 |
JayF | and it almost slipped off the radar because that service has been ignored in favor of getting it into ironic | 22:48 |
TheJulia | and if we didn't, then... we're chasing red herrings | 22:48 |
TheJulia | Well, that has been the case for years | 22:48 |
JayF | we're literally paying the price for us not completing that migratino, in hours :( | 22:48 |
TheJulia | we basically had to do major DB API updates this cycle *because* of a lack of attention | 22:48 |
JayF | yeah | 22:48 |
TheJulia | could be maybe we got list_nodes_by_attributes wrong too | 22:48 |
TheJulia | dunno | 22:48 |
JayF | I think I'm mostly done with digging logs on this for the day | 22:48 |
JayF | I'll look at a node tomorrow and that'll help | 22:48 |
TheJulia | okay, I guess I did the db stuffs last cycle | 22:51 |
TheJulia | yup I did | 22:51 |
opendevreview | Verification of a change to openstack/ironic-python-agent stable/zed failed: Handle the node being locked https://review.opendev.org/c/openstack/ironic-python-agent/+/892594 | 23:41 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!