Monday, 2023-07-17

rpittau	good morning ironic! o/	06:48
iurygregory	good morning Ironic	11:25
*** iurygregory_ is now known as iurygregory		12:08
iurygregory	anything urgent for review?	12:15
TheJulia	good morning	13:01
rpittau	good morning TheJulia :)	13:03
rpittau	iurygregory: anything in ironic-week-prio tags?	13:03
TheJulia	I added it to https://review.opendev.org/c/openstack/ironic/+/888506	13:05
TheJulia	https://review.opendev.org/c/openstack/ironic/+/888500 is likely super close, but I've not seen it log any retries :(	13:05
iurygregory	good morning TheJulia o/	13:08
iurygregory	rpittau, right, one week of PTO and I started to forget things lol	13:08
rpittau	:D	13:08
rpittau	well to be fair it's probably not super up-to-date	13:08
iurygregory	=X	13:09
TheJulia	more along the lines of we've been fighting the gate a lot	13:09
rpittau	yeah	13:11
TheJulia	A continued fight to reduce down lock conflicts in metal3	13:12
TheJulia	nuking the heartbeats seems to help a ton	13:12
TheJulia	which leaves retry really only viable to test with a loaded CI system	13:13
iurygregory	yup, I was imagining that we would still be fighting CI	13:13
rpittau	"EFI variables are not supported on this system" ok....	13:44
TheJulia	where did you see that... 8\|	13:45
rpittau	tinycore 14.x patch https://621ce5ea301cd3e5f94e-66098508dda66f6b765978d818669c61.ssl.cf2.rackcdn.com/887754/2/check/ironic-standalone-ipa-src/aaad71d/job-output.txt	13:45
rpittau	are we back to jammy for standalone? dn't remember	13:45
TheJulia	we are	13:46
TheJulia	identified the issue	13:46
TheJulia	bug filed, routed around	13:46
rpittau	ok, something else going wrong there then	13:47
TheJulia	If we can get some reviews on https://review.opendev.org/c/openstack/ironic/+/888506 I think it would make sense to proceed with	13:51
TheJulia	Then again, I have no idea if CI is presently super happy load wise, or hates our existence and right now.	13:51
TheJulia	Also, file under "wheeeeeee": https://bugzilla.redhat.com/show_bug.cgi?id=2222981	13:51
rpittau	w00t	13:56
iurygregory	Don't actually heartbeat with sqlite	14:00
iurygregory	WOW	14:00
iurygregory	re 2222981 WHAT?!	14:01
TheJulia	yeah.....	14:05
TheJulia	and the retry change passed again	14:06
mohammed	TheJulia we have built an ironic image for metal3 with the fix https://review.opendev.org/c/openstack/ironic/+/888188 but still get the error sqlite3.OperationalError: database is locked	14:12
TheJulia	mohammed: there are three competing issues, one is periodics, the next is the heartbeat sync for conductor status which I've got a patch up to disable when using sqlite, and the third seems to be to add db retry logic	14:17
TheJulia	so, it was not a "silver bullet" unfortunately, but I think we're almost there	14:17
TheJulia	I also worked to midnight local on Friday night to try and get it sorted, and I'm still trying to observe failures in CI	14:17
TheJulia	if your willing, https://review.opendev.org/c/openstack/ironic/+/888506 and https://review.opendev.org/c/openstack/ironic/+/888500/ should reduce the chance substantially and add a retry decorator around db writes.	14:20
mohammed	TheJulia do you think replacing SQLite with Mariadb can limit the occurrences of this issue	14:20
TheJulia	absolutely, MariaDB can handle the mutlithreaded write operations natively, where write operations with sqlite are more transactional on a file level so everything going on right now tends to result in one consumer encountering another	14:22
JayF	rpittau: are you going to run today's meeting?	14:29
JayF	It's already 30 minutes late :(	14:29
rpittau	?	14:29
rpittau	isn't it in 30 minutes ?	14:29
JayF	I just traversed a giant amount of timezones	14:29
JayF	so I would trust your clocking better than mine	14:29
rpittau	:D	14:29
JayF	lol	14:29
JayF	see, this is why I'm not working today	14:29
rpittau	JayF: no worries, I'll take care of the meeting :)	14:30
JayF	thank you lol	14:30
JayF	I was just so confused	14:30
iurygregory	yeah, it's in 30min	14:30
opendevreview	Julia Kreger proposed openstack/ironic master: Retry SQLite DB write failures due to locks https://review.opendev.org/c/openstack/ironic/+/888500	14:34
TheJulia	mohammed: See the release note attached to ^	14:34
mohammed	TheJulia thanks for your efforts! We'll try replacing SQLite with MariaDB on our CI to mitigate the issue, while following the progress of solving it with SQLite	14:37
TheJulia	mohammed: you might just want to wait 24-48 hours, since it seems like we might have a path forward ready to go, we just haven't seen performance to result in locking issues in our CI yet today nor over the weekend	14:39
opendevreview	Julia Kreger proposed openstack/ironic master: Retry SQLite DB write failures due to locks https://review.opendev.org/c/openstack/ironic/+/888500	14:40
TheJulia	so... 4k disks. If we can't use iso9660, and we cannot use vfat, what options are really left, ext2/3 ?	14:41
TheJulia	i guess we could detect and repack it as xfs as long as the config label exists	14:42
* TheJulia wonders what local patch HPE's CI is keeping that merge conflicts wit upstream		14:48
mohammed	TheJulia sounds like a great option we can patiently wait this fix ! Thanks :)	14:48
TheJulia	mohammed: since you can locally reproduce, definitely give the two additional patches a try, I'm really quite hopeful	14:50
opendevreview	Elod Illes proposed openstack/ironic stable/victoria: [stable-only] Cap virtualenv/setuptools https://review.opendev.org/c/openstack/ironic/+/888701	14:57
*** iurygregory_ is now known as iurygregory		14:59
rpittau	#startmeeting ironic	15:00
opendevmeet	Meeting started Mon Jul 17 15:00:05 2023 UTC and is due to finish in 60 minutes. The chair is rpittau. Information about MeetBot at http://wiki.debian.org/MeetBot.	15:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	15:00
opendevmeet	The meeting name has been set to 'ironic'	15:00
iurygregory	o/	15:00
TheJulia	o/	15:00
rpittau	welcome everyone to our weekly meeting!	15:00
rpittau	I'll be your host for today :)	15:00
rpittau	The meeting agenda can be found here:	15:00
rpittau	#link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting	15:00
rpittau	#topic Announcements/Reminder	15:01
rpittau	we've announced last week that the next PTG will take place virtually October 23-27, 2023	15:01
rpittau	going to remove the reminder after this meeting	15:02
rpittau	we'll remind it everyone when we're closer to the date	15:02
rpittau	#note usual friendly reminder to review patches tagged #ironic-week-prio, and tag your patches for priority review	15:02
rpittau	I'm leaving the bobcat timeline for reference in the reminder section	15:03
rpittau	any other announcement/reminder today ?	15:04
TheJulia	I worked stupidly late on Friday night on a decorator for sqlite issues	15:04
rpittau	lol	15:04
TheJulia	I think it works, but I've not seen CI give me a resource contention issue to generate a "database is locked" error yet today	15:04
TheJulia	unit test wise, it definitely works!	15:05
rpittau	\o/	15:05
rpittau	let	15:05
iurygregory	:D	15:05
rpittau	I guess we need to review it and recheck a couple of times	15:05
TheJulia	it has been tagged as a prio review, any reviews/feedback would be appreciated	15:05
TheJulia	++	15:05
rpittau	great	15:05
rpittau	thanks TheJulia :)	15:05
TheJulia	yeah, I think I'm on the 3rd stupidly minor change today, so hopefully that is helping	15:05
rpittau	(I'm silently skipping the action items review as there are none from last time)	15:06
rpittau	since we started with that patch let's go to	15:06
rpittau	#topic Review Ironic CI Status	15:06
rpittau	so we're back to jammy, except for the snmp pxe job	15:07
rpittau	and still battling the sqlite shenanighans	15:07
TheJulia	yup, I have no idea why it is failing, but I don't remember the exact reason we held it back to begin with	15:07
rpittau	I don't know, I'll try to make time this week to try and replicate, downstream permitting	15:08
rpittau	also found an issue with the new tinycore 14.x in the standalone job	15:08
rpittau	the one I mentioned before	15:08
rpittau	output of efibootmgr "EFI variables are not supported on this system"	15:09
rpittau	oh well	15:09
rpittau	we're probably not in a rush for that	15:09
rpittau	anything else to mention for the CI status?	15:10
rpittau	ok, moving on	15:10
rpittau	#topic 2023.2 Workstream	15:10
rpittau	#link https://etherpad.opendev.org/p/IronicWorkstreams2023.2	15:10
rpittau	any update to share?	15:11
iurygregory	feedback would be appreciated in the Firmware Update patches =)	15:11
TheJulia	I've updated the patch to support vendor passthru methods as steps	15:11
TheJulia	and it now passes CI \o/	15:11
rpittau	iurygregory: yeah, was going to mention that, I have some time this week, I'll put here on top of my list	15:12
rpittau	TheJulia: great!	15:12
TheJulia	hopefully get back to service steps this week	15:12
* TheJulia hopes		15:12
iurygregory	tks rpittau o/	15:12
TheJulia	but CI is obviously the priority	15:12
iurygregory	++	15:12
iurygregory	I will take a look at patches today for CI	15:13
rpittau	ok, great	15:14
rpittau	I think we're good	15:14
rpittau	#topic RFE review	15:14
rpittau	JayF: left a couple of notes	15:14
TheJulia	So, jay also put the same link in on both	15:14
rpittau	so the first one is https://bugs.launchpad.net/ironic/+bug/2027688	15:14
rpittau	ah yeah :D	15:14
* TheJulia hands JayF coffee		15:14
TheJulia	I left a comment on the first one	15:15
rpittau	the second one is https://bugs.launchpad.net/ironic/+bug/2027690	15:15
rpittau	I'll update teh agenda	15:15
TheJulia	I think the idea is good overall, but we need to better understand where the delination is, and even then, some of the things that fail are deep in the weeds	15:15
TheJulia	and you only see the issue when your at that step, deep in the weeds	15:15
TheJulia	in other words, more discussion/clarification required	15:16
rpittau	mmmm I agree	15:16
TheJulia	since we can't go check switches we know nothing about	15:16
TheJulia	but if we can provide better failure clarity, or "identify networking failed outright" mid-deploy, then we don't have to time out completely	15:16
iurygregory	++, I like the rfe, but we will need further discussion about it	15:17
TheJulia	so yeah, I agree with the second rfe, we'll need it as an explicit spec and we'll need to figure out how to handle permissions	15:18
TheJulia	it might just be we pre-flight validate all permisison rights and then cache the step data from the templates	15:18
* iurygregory is reading the second		15:18
TheJulia	but, that requires a human putting their brain into the deploy templates code	15:19
rpittau	checking earlky for the permissions would be best IMHO	15:19
TheJulia	also custom clean steps would be a thing to consider	15:19
TheJulia	maybe we just... permit them to be referenced, dunno	15:19
TheJulia	that is a later idea I guess	15:19
rpittau	yeah	15:19
rpittau	once we define the templates, it should be not too hard to expand with custom clean steps	15:20
TheJulia	yeah	15:20
JayF	I mainly wanted feedback on the interfaces I laid out there; I intend to specify the implementation just to keep my thoughts straight	15:21
iurygregory	++	15:21
TheJulia	right now you need to have system privileges for the deploy template endpoint if memory serves, so if we take the same pattern as add fields, begin to wire everythin together, adn then change the rbac policy, it should be a good overall approach	15:21
TheJulia	we have the patterns already spelled out in nodes fairly cleanly too	15:22
TheJulia	and allocations	15:22
rpittau	right!	15:22
TheJulia	because you can have an allocation owner	15:22
TheJulia	(but not lessee, since that model doesn't exist there)	15:23
rpittau	we're probably going to discuss both RFEs further, but they both look good to me	15:24
TheJulia	yeah, first one just needs some more "what are the specific failure cases we're trying to detect" to provide greater clarity	15:25
rpittau	probably clarify some aspects and then finalize during the PTG ?	15:25
TheJulia	because honestly, some areas we do a bad job with today, and we should make that better since they could still fail there even if neutron has a thing saying "yes, I can login to the switch"	15:25
JayF	I don't think for the precheck idea, we're looking for anything perfect, just catch anything obviously broken	15:27
JayF	it was suggested by johnthetubaguy that for ilo, for instance, there's an internal BMC status and you could fail if that wasn't 'green'	15:27
TheJulia	I suspect the big one is "bmc is half broken" weird case, and I'm not sure we actually test a swift tempurl we create...	15:28
JayF	and I know with the hardware I ran at Rackspace, even just an ipmi power status would've been enough to indicate if our most common failure modes were active	15:28
iurygregory	yellow can be a thing depending on what you are looking at in iLO	15:28
iurygregory	XD	15:28
TheJulia	i guess one question might end up being, are there really operators still turning off power sync at scale? We went from 1 to 8 power sync workers. Of course, that is also ipmi	15:30
TheJulia	Anyway, that is beyond the scope of the meeting, just a question to get a datapoint really	15:31
rpittau	let's remember this during (and more) during future discussion :)	15:31
JayF	The more I think of it, the more I wonder if it's just a flag to validation to say "no really, actually validate"	15:32
JayF	and have config to change default behavior for when Ironic does it	15:32
TheJulia	to at least try and fail faster, I guess	15:33
JayF	I also have a use case	15:33
JayF	of failing a rebuild before the disk gets nuked	15:33
rpittau	yeha, I guess that's the point (fail faster)	15:33
JayF	so like e.g. a Ceph node wouldn't drop outta the cluster for longer than needed	15:33
JayF	if we can prevent that even in 25% of failure cases, it reduces operator pain because they don't have to go $bmc_fix_period without their node in a usable state	15:34
TheJulia	JayF: I think additional details or context might be needed there since that might also just be a defect	15:34
TheJulia	or, maybe something we've done work on and a lack of a failure awareness is causing the wrong recovery path to be taken by humans	15:34
JayF	that's sorta the theme of both of these rfes; allowing people with that kind of "I have a sensitive cluster" kind of use cases do their own scheduling of maintenance, and reduce the amount of times we'd have them in a failed-down state	15:35
TheJulia	anyway, lets keep talking about this one after the meeting, I think we need to better understand your use case and what spawned the issue to begin with, if that makes sense	15:36
TheJulia	specifically, if I'm trying to rebuild [to update] a node, and somehow I'm wiping all of ceph volumes out	15:36
rpittau	alright, moving on!	15:37
rpittau	#topic Open Discussion	15:37
TheJulia	I had https://bugzilla.redhat.com/show_bug.cgi?id=2222981 pop up in my inbox this morning!	15:38
rpittau	it's been great running the meetings again this time and the last week :)	15:38
rpittau	next week JayF will be back!	15:38
rpittau	#note Overcloud deploy fails when mounting config drive on 4k disks	15:38
rpittau	#link https://bugzilla.redhat.com/show_bug.cgi?id=2222981	15:39
TheJulia	I'm looking at cloud-init code to see if there is a happy path forward, but I suspect it might be we may need to locally repack the config drive as an intermediate solution	15:41
TheJulia	there might be, just depends on how long it has really been an "or", and we'll need to likely fix glean	15:44
opendevreview	Merged openstack/bifrost master: Refactor use of include_vars https://review.opendev.org/c/openstack/bifrost/+/855807	15:44
iurygregory	I'm wondering if repack would cause performance issues (but it would be only in the 4k scenario right?)	15:44
TheJulia	well, we would have to open the file on a loopback... if we even can...	15:45
TheJulia	and then write out as a brand new filesystem	15:45
TheJulia	so... dunno	15:45
iurygregory	gotcha	15:46
TheJulia	¯\_(ツ)_/¯	15:46
rpittau	I don't see a "straight" solution honestly	15:49
TheJulia	yeah, since we support binary payloads	15:50
TheJulia	I'll dig more, and write an upstream bug up	15:51
rpittau	thanks	15:51
iurygregory	ack	15:51
rpittau	alright, anything else for open discussion ?	15:52
rpittau	thanks everyone!	15:53
rpittau	#endmeeting	15:53
opendevmeet	Meeting ended Mon Jul 17 15:53:34 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	15:53
opendevmeet	Minutes: https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-07-17-15.00.html	15:53
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-07-17-15.00.txt	15:53
opendevmeet	Log: https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-07-17-15.00.log.html	15:53
iurygregory	tks!	15:53
TheJulia	JayF: so, that failure case, with the rebuild, what is your perception to the process/failure? I'm guessing the rebuild fails, and instead of being worked as a distinct thing, something happened with nova at that point?	15:57
TheJulia	I'm trying to understand since we have capabilities to prevent those disks from getting erased, so I'm trying to sort out in my head, where things went from "oh, retry rebuild" to "oh no, we lost everything"	16:02
rpittau	good night! o/	16:06
opendevreview	Merged openstack/bifrost master: remove nginx system packages requirement https://review.opendev.org/c/openstack/bifrost/+/874521	16:15
TheJulia	Anyone have any block devices with 1k, 2k, or 4k sector sizes handy? Could you run `blockdev --getss /dev/<device>` and provide output	17:12
TheJulia	oh wow	17:13
TheJulia	looks like maybe blockdev -getblocksz <device> might be the thing, maybe?!	17:14
TheJulia	5-6 runs with no database is locked errors. I guess that is a good sign :\	17:57
iurygregory	yeah	17:57
iurygregory	I need to drop now, going to the airport, will continue working from there o/	17:57
JayF	TheJulia: not lost data; just downtime in the cluster. I have a server, X, I do a rebuild with --preserve-ephemeral; it fails in the middle (while in prov network). That machine is now dead-to-me until I can get ops teams to fix the failure. If, instead, Ironic had predicted that failure and refused to touch the node; sure the nova instance would be in ERROR but the workload	18:00
JayF	would be untouched	18:00
JayF	think about cases where you are tight on capacity and can't take a long downtime in a cluster	18:00
JayF	I'll talk about use cases in the rfe more.	18:01
TheJulia	so, you'd need to be able to ask neutron if it can noop on that switch/interface	18:02
JayF	yeah I'm not sure what an implementation looks like for every interface -- but having the hook is useful	18:03
TheJulia	yeah, in ironic would mostly be fairly light touches, but that network part is the real conundrum	18:03
JayF	for instance, in some future world with a non-neutron network driver, we might be able to ask better questions, even (does the smartnic have desired_vlan trunked?)	18:03
TheJulia	++	18:03
TheJulia	or switch, etc	18:03
TheJulia	but yeah	18:04
JayF	I do not want to say, we shouldn't add it to the interface b/c we might not be able to do a great job of it on that interface	18:04
TheJulia	And that is not why I'm asking the question, I'm trying to understand under what case are you blocked from being able to perform basic actions to recover	18:04
JayF	meaning I'd want to implement it across all interfaces (like validate), but it's going to naturally be more useful for some than others	18:04
TheJulia	yeah	18:05
JayF	I really think once I look at this tuesday	18:05
JayF	I might shape it more like "validate, except you have time to do more stuff"	18:06
JayF	which could make it useful for enrollment situations, or for doing actual-validation of a fix after a node has undergone servicing (either by human, or by Ironic I guess)	18:06
JayF	I don't think this would actually work out in practice; but it'd be neat if we could also communicate via API the confidence level	18:07
TheJulia	well, the actual validation today is just "can we get the power status"	18:07
TheJulia	I do like the idea of giving that feedback	18:08
JayF	heh, maybe even redfish super-validate is running the DMTF validation against it	18:09
JayF	"It'll give you a medium Ironic experience"	18:09
TheJulia	heh	18:12
* TheJulia goes back to the gigabytes of logs a customer has shipped me		18:12
NobodyCam	Good afternoon Ironic folk	19:37
TheJulia	good afternoon!	19:37
TheJulia	it is... a bit... warm here today	20:31
opendevreview	Julia Kreger proposed openstack/ironic-python-agent master: Log the number of bytes downloaded https://review.opendev.org/c/openstack/ironic-python-agent/+/887729	20:47
TheJulia	stevebaker[m]: ^	20:47
TheJulia	INFO ironic_python_agent.extensions.standby [-] Image streamed onto device /dev/vda in 201.56154799461365 seconds for 2958688256 bytes. Server originally reported 2958688256.	22:47
TheJulia	woot	22:47
opendevreview	Julia Kreger proposed openstack/ironic-python-agent master: Log the number of bytes downloaded https://review.opendev.org/c/openstack/ironic-python-agent/+/887729	23:19

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!