Wednesday, 2022-12-14

opendevreview	Merged openstack/ironic-inspector master: [grenade] Explicitly enable Neutron ML2/OVS services in the CI job https://review.opendev.org/c/openstack/ironic-inspector/+/867573	00:04
vanou	good morning ironic	02:27
vanou	TheJulia JayF: Thanks for review on https://review.opendev.org/c/openstack/ironic/+/865075 I have one concern on try/fail model. If I follow this approach, driver will first try IPMI then, if IPMI fails, try Redfish. However IPMI code in Ironic resends IPMI command till timeout reaches (default 60seconds). So this try/fail model can introduce 60s delay on every IPMI attempt.	06:00
opendevreview	Harald Jensås proposed openstack/ironic master: Use association_proxy for ports node_uuid https://review.opendev.org/c/openstack/ironic/+/862933	08:15
opendevreview	Harald Jensås proposed openstack/ironic master: Use association_proxy for port groups node_uuid https://review.opendev.org/c/openstack/ironic/+/864781	08:15
opendevreview	Harald Jensås proposed openstack/ironic master: Use association_proxy for node chassis_uuid https://review.opendev.org/c/openstack/ironic/+/864802	08:15
opendevreview	Harald Jensås proposed openstack/ironic master: Use association_proxy for node allocation_uuid https://review.opendev.org/c/openstack/ironic/+/865989	08:15
codecap_	HI Guys,	08:36
codecap_	Experiencing the Problem	08:36
codecap_	Look up error: Could not find a node for attributes {'bmc_address': ['10.40.11.68'], 'mac': ['4c:52:62:42:de:98', '4c:52:62:42:de:96', '4c:52:62:52:ae:2b', '4c:52:62:42:de:97', '4c:52:62:42:de:95', '4c:52:62:52:ae:2a']}	08:36
codecap_	Can anybody help to understand the reason for the problem ?	08:36
rpittau	good morning ironic! o/	08:59
kubajj	Morning rpittau and ironic o/	09:29
rpittau	hey kubajj :)	09:29
dtantsur	codecap_: could be a mismatch between existing BMC address and ports on the Ironic node	09:37
codecap_	dtantsur: Hi ! Can you please explain ?	09:38
dtantsur	codecap_: so, you're inspecting the node right? the node has a BMC (IPMI/Redfish/iDRAC/...) address and maybe some pre-existing ports?	09:39
codecap_	dtantsur: IPMI ... pre-existing ports ?	09:40
dtantsur	codecap_: okay, let's take a step back. What exactly are you doing? How much familiar with Ironic are you?	09:42
codecap_	dtantsur: I'm trying to install und to use it ... less expierience I would say ... I've already learned how to use bifrost... The Problem I described above, occurs when I use ironic within openstack. The node , whiech is beeing deployed, is comming up under the control of IPA, at one Point IPA tries to contact the inspector, which replies with HTTP 500	09:48
dtantsur	Is it still Bifrost or another installation with the rest of OpenStack?	09:53
dtantsur	within the Bifrost context, attempting to contact Inspector first is normal	09:54
dtantsur	when used with Nova/Neutron, we usually expect DHCP to be set up correctly to avoid that (in which case it may be a problem with networking or Neutron)	09:54
dtantsur	codecap_: ^^	09:55
dtantsur	I need to step away from a while, other folks may help you further (or I'll check your messages when I'm back)	09:56
codecap_	dtantsur: bifrost is completely independent. Ironic within OpenStack was installed by kolla-ansible ... DHCP looks good, so the node ( being deployed) get started loading IPA Image	09:57
codecap_	dtantsur: can you explain me how Inspector is searching for the node when contacted by IPA?	10:00
opendevreview	Merged openstack/ironic master: Ironic doesn't use metering; don't start it in CI https://review.opendev.org/c/openstack/ironic/+/867574	10:21
ajya	janders: iDRAC 6.10.00.00 has been released, if you want to try it out	10:30
opendevreview	Riccardo Pittau proposed openstack/ironic master: [WIP] [PoC] A metal3 CI job https://review.opendev.org/c/openstack/ironic/+/863873	10:42
opendevreview	Riccardo Pittau proposed openstack/ironic master: [WIP] [PoC] A metal3 CI job https://review.opendev.org/c/openstack/ironic/+/863873	10:48
dtantsur	codecap_: when inspection is started, it caches BMC address (ipmi_address, etc) and MAC addresses to use them for lookup later. Note that if you don't start inspection, it will never work.	11:12
dtantsur	codecap_: so if you end up inspecting when you meant to deploy, something went wrong on the networking level	11:12
codecap_	dtantsur: so what is the order a node should be deployed in ? Create -> Inspect -> Deploy ?	11:14
kubajj	dtantsur: We use the pecan REST for the decorators, right?	11:17
dtantsur	codecap_: it depends on what you need. Inspection is optional.	11:21
dtantsur	kubajj: I think we have a lot of own decorator code, inherited from a project called wsme	11:22
kubajj	dtantsur: and these are defined in the api repo?	11:23
kubajj	(not repo, directory)	11:23
codecap_	dtantsur: how can I debug why the Lookup fails ? Look up error: Could not find a node for attributes {'bmc_address': ['10.40.11.68'], 'mac': ['4c:52:62:42:de:97' ... both (bmc_address and macs) are correct ones	11:28
dtantsur	kubajj: yep, somewhere these	11:29
dtantsur	codecap_: do you start inspection? if not, it will never work. if you start deploy and end up inspecting, you need to see what went wrong on the PXE stage.	11:29
dtantsur	in a normal openstack installation, different dnsmasq instances are responsible for two processes	11:29
dtantsur	when inspection is not running, ironic-inspector's dnsmasq is blocked for these nodes by its own filters, neutron's is working	11:30
dtantsur	somewhere these something went wrong, but I cannot guess what without a deep look	11:31
codecap_	dtantsur: Inspection works	11:31
codecap_	dtantsur: [node: 8280e833-8a86-4ee1-8cf6-675d55d54970 state finished MAC 4c:52:62:52:ae:2a BMC 10.40.11.68] Introspection finished successfully	11:31
codecap_	dtantsur: after successful inspection found data should be saved in the ironic inspector DB, right?	11:33
dtantsur	codecap_: if you have it set up to do it, yes (I don't know how kolla sets up inspector)	11:34
dtantsur	so, inspection works, deployment does not?	11:34
codecap_	dtantsur: deployments failes with Look up error: Could not find a node for attributes {'bmc_address': ['10.40.11.68'], 'mac': ['4c:52:62:42:de:97', ....	11:36
dtantsur	right, so your flow erroneously goes back into inspection	11:37
dtantsur	codecap_: see what your PXE filter is ([pxe_filter]driver in inspector.conf). check for suspicious messages about it in inspector logs.	11:38
kubajj	dtantsur: I am sorry for asking so many questions, but I've been stuck on this for more than a day now. Is the api I am aiming to implement much different from the StatesController?	11:38
dtantsur	kubajj: it should not be (and don't worry about questions). It's not entirely clear to me at which step you're stuck now. If you elaborate, I may be able to help better.	11:39
codecap_	dtantsur: driver = noop	11:40
dtantsur	that explains...	11:41
dtantsur	is it something you did or something kolla set up for you?	11:41
dtantsur	(driver noop means that access to inspector's dnsmasq is not limited, which means that inspection always starts)	11:41
kubajj	dtantsur: I am still getting the "Missing mandatory parameter" node_ident error. The thing is, the states controller uses just node_ident instead of self.node_ident and does not have the constructor (which Julia suggested I implement as in the history controller) and it is able to be called from the node controller https://opendev.org/openstack/ironic/src/branch/master/ironic/api/controllers/v1/node.py#L1956	11:43
kubajj	I am trying to figure out why I can't call it as well	11:43
dtantsur	okay, I'll check after the current meeting (please upload the latest patch)	11:43
codecap_	dtantsur: kolla configures it this way	11:44
dtantsur	codecap_: then please ask mgoddard and other folks on #openstack-kolla which architecture they have in mind	11:45
dtantsur	it feels like it's a mix of a standalone architecture and normal openstack one.. but I may simply be missing what they mean	11:45
codecap_	dtantsur: shuld it be driver = dnsmasq ?	11:53
dtantsur	codecap_: that's what we recommend, but it may require additional configuration, so I'd check with kolla folks first	11:54
dtantsur	I assume they don't have a lot of docs around this area?	11:54
codecap_	dtantsur: sure ... will also ask there for help	11:57
codecap_	dtantsur: thanx a lot !	11:58
codecap_	dtantsur: Ive already read a lot .... but still not smart enough :)	11:58
dtantsur	networking is the least trivial thing in openstack IMO :)	11:58
opendevreview	Jakub Jelinek proposed openstack/ironic master: WIP: API for node inventory https://review.opendev.org/c/openstack/ironic/+/866876	12:16
kubajj	dtantsur: somehow I managed to fix it. What should happen if I try to access a node inventory of a node that doesn't have one?	12:17
kubajj	dtantsur: What should be the behaviour if the version is too old?	12:17
dtantsur	kubajj: in both cases we return 404 with an appropriate message	12:31
opendevreview	Jakub Jelinek proposed openstack/ironic master: API for node inventory https://review.opendev.org/c/openstack/ironic/+/866876	13:18
kubajj	dtantsur: if you had a moment, feedback for ^ would be appreciated	13:19
dtantsur	sure, will put on my queue	13:20
dtantsur	kubajj: before that: this is the point where we need a release note since we're finally adding something user-visible. I can help with wording, but feel free to propose a draft.	13:23
kubajj	dtantsur: I am just drafting it, I was expecting you to say that we need it 😀	13:25
dtantsur	great :)	13:25
kubajj	dtantsur: I need apiref as well, right?	13:40
opendevreview	Riccardo Pittau proposed openstack/ironic master: [WIP] [PoC] A metal3 CI job https://review.opendev.org/c/openstack/ironic/+/863873	14:09
mgoddard	ajya: Hi, we're having trouble with cleaning on some Dell hardware. We have NVMes behind a RAID controller, and it seems to be falling back to shredding rather than secure erase	14:13
mgoddard	RAID controller: https://dl.dell.com/content/manual61357984-dell-poweredge-raid-controller-11-user-s-guide-perc-h755-adapter-h755-front-sas-h755n-front-nvme-h755-mx-adapter-h750-adapter-sas-h355-adapter-sas-h355-front-sas-h350-adapter-sas-h350-mini-monolithic-sas.pdf?language=en-us&ps=true	14:13
mgoddard	NVMe: https://www.dell.com/en-us/shop/dell-384tb-data-center-nvme-read-intensive-ag-drive-u2-gen4-with-carrier/apd/400-bmto/storage-drives-media#tabs_section	14:13
mgoddard	have you seen this before?	14:13
ajya	mgoddard: haven't seen, but I can check. What commands are being used?	14:15
opendevreview	Riccardo Pittau proposed openstack/ironic-python-agent master: [DNM] test ci https://review.opendev.org/c/openstack/ironic-python-agent/+/867659	14:34
TheJulia	mgoddard: just standard automated cleaning? if so I suspect you'll need to share an agent log from cleaning, but I wonder what the actual raid controller's capabilities actually inhibit some of that	14:42
dtantsur	kubajj: yes please	14:45
dtantsur	mgoddard: I believe TheJulia is right: I don't think we can see the NVMe behind the controller	14:46
TheJulia	secure erase as the code uses it is an ATA concept	14:46
mgoddard	that does seem to be the case	14:46
mgoddard	nvme list is empty	14:47
TheJulia	scsi, depending ont he protocol exposed by the raid controller, has no concept of it	14:47
mgoddard	disk show up as /dev/sdX	14:47
TheJulia	how do you see the devices to an OS?	14:47
TheJulia	oh!	14:47
TheJulia	so yeah	14:47
TheJulia	they are exposed as SCSI devices :(	14:47
* TheJulia is surprised they are not exposed as NVMe devices		14:47
mgoddard	wondering whether it kills the benefits of NVMe	14:47
TheJulia	the raid controller utilities might have a tool to help enable this	14:47
TheJulia	raid controllers can be useful, and they can also hurt performance/usage at times	14:48
TheJulia	it is a value tradeoff unfortunately	14:48
dtantsur	or rebuild the raid every time..	14:48
TheJulia	yeah	14:49
TheJulia	the raid controller, depending on how smart it is might issue unmap/deallocate commands to the device	14:49
TheJulia	or it could write zeros	14:49
TheJulia	or keep a memory map and just double zero	14:49
TheJulia	like qcow2s	14:49
* TheJulia has crashed many PERCs in her career		14:51
mgoddard	seems like we need to try to expose the NVMes directly, if possible	14:53
TheJulia	so, you could try disabling raid mode	14:53
mgoddard	would much prefer not to need to use some custom tool in IPA (if that is even possible)	14:53
TheJulia	the bios firmware should have a mode setting for the card	14:53
mgoddard	there isn't currently any RAID virtual disk configured	14:53
TheJulia	Oh, yeah, even better reason to check the card settings	14:54
mgoddard	ok, I'll have a poke around	14:54
mgoddard	thanks all	14:54
TheJulia	so fwiw, people have wrapped individual binaries in their ramdisks to do special things, anda hardware plugin can always override, but disabling RAID mode on the card would hopefully just make it a pass-through	14:55
TheJulia	Now, if that translates the protocol, that is a whole different question	14:55
TheJulia	and if it does.... yeouch. We can work on figuring something out	14:55
TheJulia	you'll know quickly after rebooting the machine with the raid controller card in a different mode	14:56
* TheJulia wonders if we have this sort of caveat documented heavily		14:56
dtantsur	I feel like we do not	15:05
TheJulia	I'm writing something now	15:13
dtantsur	I recall there were some talks about dropping dhcp-all-interfaces from IPA deps, does anyone else remember? TheJulia?	15:23
TheJulia	we talked about it because network manager has similar behavior	15:23
TheJulia	although I think we now set networkmanager to do the thing with dhcp-all-interfaces	15:23
TheJulia	so apparently there is a SCSI FORMAT UNIT command	15:26
dtantsur	got someone complaining about the time it takes to DHCP interfaces that don't have a DHCP server on them	15:26
TheJulia	the counter balancing issue is if you turn that off, networkmanager is going to focus on the first one it finds	15:31
TheJulia	and you may loose connectivity because there could be multiple interfaces :(	15:31
opendevreview	Julia Kreger proposed openstack/ironic master: [DOC] Add entry regarding cleaning+raid https://review.opendev.org/c/openstack/ironic/+/867674	15:55
TheJulia	mgoddard: dtantsur: ^^	15:55
opendevreview	Julia Kreger proposed openstack/ironic master: [DOC] Add entry regarding cleaning+raid https://review.opendev.org/c/openstack/ironic/+/867674	15:57
JayF	We probably should mention that we have explicit examples for making your own hardware managers	15:57
JayF	with tooling provided by raid card vendor	15:57
JayF	to use the actual-vendor-tool to do this if you want	15:57
JayF	https://github.com/openstack/ironic-python-agent/tree/master/examples/custom-disk-erase	15:57
TheJulia	we've not seen that as a thing as much, most vendors are trying to integrate purely into BMC interactive actions, but yeah	15:58
* TheJulia revises		15:58
JayF	Or, potentially, that pathway is better documented so more people self-serve it without bothering us ;)	15:58
dtantsur	TheJulia: one formatting issue inline	15:58
* JayF knows of at least 3 installs that have used it without consulting with us		15:58
TheJulia	++	15:58
TheJulia	dtantsur: which line?	15:58
TheJulia	JayF: that is an interesting data poitn	15:59
dtantsur	TheJulia: 1091, need to use a :doc: link, otherwise it won't work when relocated or built locally	15:59
TheJulia	ack	15:59
JayF	TheJulia: well, it's what I tend to ask people about given it's the piece I know the most about, so there's a lot of confirmation bias :D	15:59
opendevreview	Aija Jauntēva proposed openstack/sushy master: Fix exceeding retries https://review.opendev.org/c/openstack/sushy/+/867675	16:00
dtantsur	see you tomorrow folks	16:00
ajya	rpittau: see the patch ^	16:00
opendevreview	Julia Kreger proposed openstack/ironic master: [DOC] Add entry regarding cleaning+raid https://review.opendev.org/c/openstack/ironic/+/867674	16:05
TheJulia	web-edited that, so hopefully no issues	16:05
opendevreview	Julia Kreger proposed openstack/ironic master: [DOC] Add entry regarding cleaning+raid https://review.opendev.org/c/openstack/ironic/+/867674	16:05
TheJulia	Okay, that should be good	16:06
JayF	TheJulia: just +A shard_key spec	16:12
JayF	TheJulia: Arne gave me an email thumbs-up	16:12
JayF	\o/	16:13
TheJulia	\o/	16:13
TheJulia	now, where was I with extracting useful metrics!	16:20
opendevreview	Merged openstack/ironic-specs master: Add a shard key https://review.opendev.org/c/openstack/ironic-specs/+/861803	16:22
JayF	TheJulia: circling back to the hwmgr/disk stuff; I think crossing "I'm building my own ramdisk" is a threshold for some folks; once it's crossed those are not hard problems anymore (because the code is trivial)	16:26
JayF	TheJulia: the thing that, when reflecting, I realize is common to some of the issues that are reported here, are people are trying hard (for good reason) to use a fully stock ramdisk	16:27
TheJulia	eh, we made ipa-b to simplify ramdisk building	16:35
TheJulia	so the knowledge barrier is a bit lower now	16:36
JayF	there's a reason I used 'threshold' over 'barrier'	16:37
JayF	IME operating it, it's always surrounding BS that makes it hard: setting up internal forks, CI, etc	16:37
JayF	building one special ramdisk is easy; setting up a ramdisk building machine is more :)	16:37
TheJulia	I think it is variable based upon perception/experience, but I do agree	16:38
opendevreview	Merged openstack/ironic-inspector master: Update tox.ini for tox 4 https://review.opendev.org/c/openstack/ironic-inspector/+/867541	16:40
opendevreview	Julia Kreger proposed openstack/ironic-lib master: Provide an interface to store metrics https://review.opendev.org/c/openstack/ironic-lib/+/865311	16:40
TheJulia	JayF: ^ I could use a sanity check of this ironic-lib change since you were deeply involved with this ages ago	16:41
JayF	hopefully you like the interface :D	16:41
JayF	I think aweeks wrote that though, not me, although I was upstreaming with him	16:42
TheJulia	it took a little time to wrap my head around it, but once I did I was like "oh, this is perfect for what I want"	16:42
JayF	this is a very bog-standard metrics interface from 2015 :)	16:43
JayF	TheJulia: reading this... you're reimplmenting half of statsd	16:44
JayF	TheJulia: is there a statsd->prometheus connector that we could document instead?	16:44
TheJulia	dunno, but I need both node sensor data and ironic data in one pass	16:44
TheJulia	and we have ipe for the node data	16:44
TheJulia	and deploying	16:44
TheJulia	err	16:44
TheJulia	incomplete thought, disregard that last line	16:45
JayF	TheJulia: https://github.com/prometheus/statsd_exporter	16:45
TheJulia	yeah, I was just looking at that	16:46
TheJulia	geared as a migration tool	16:46
JayF	that even handles turning timers into summary/histogram	16:46
JayF	I am not opposed to native support, I'm just asking the question	16:46
TheJulia	if I were to do that, I'd have to completely forklift all of the sensor data stuff over	16:46
JayF	Are you saying that right now	16:46
JayF	we have two metrics stacks in Ironic?	16:46
JayF	bceause whoever added sensor data didn't plug it into existing metrics collection (or vice-versa, but I think app metrics were first)	16:47
TheJulia	not metrics stacks, sensor data collection walks/collects data from the BMC and was originally geared for handing it off wholeale to.... ceilometor	16:47
TheJulia	ceilometer	16:47
JayF	yeah as I hit enter there I realized it was the celiometer stuff	16:47
JayF	which I steered clear of when designing these metrics because people thought ceilometer would continue to exist still :\|	16:47
JayF	I mean, "forklift all the sensor data over" sounds like the right answer	16:48
JayF	or "forklift all the metrics stuff over"	16:48
JayF	i don't think many deployers would distinguish, in the same way we do, between app and hardware metrics	16:48
TheJulia	hmm	16:49
TheJulia	we had to write a lot of code to handle the sensors	16:49
JayF	This isn't definitive, I'm just talking through it	16:49
TheJulia	but the proxy might be the right way to go	16:49
TheJulia	stevebaker[m]: any thought on running the statsd->prometheus proxy?	16:50
JayF	like, statsd is definitately the older school thing vs prometheus	16:50
JayF	which is something to consider too	16:50
JayF	but our metrics interface was written with statsd in mind, so using the proxy you sorta get the code you wanted to write here for free	16:50
TheJulia	the downside of IPE is it relies upon the message bus architecture	16:50
JayF	let me make sure I understand	16:51
TheJulia	you don't actually need a message bus, but it plugs in there	16:51
JayF	it used to be	16:51
JayF	Ironic node sensor data -> ceil -> ???	16:51
JayF	now it's Ironic node sensor data -> IPE -> Prom	16:51
TheJulia	yes, for those that wish to use it	16:51
JayF	and separately, we have Ironic app metrics -> statsd (I think this is the only plugin we ship?)	16:51
TheJulia	I don't think anyone actually runs with statsd, tbh	16:52
JayF	me neither	16:52
TheJulia	it is statsd, and noop currently	16:52
JayF	Well, like let me put it this way	16:52
JayF	everywhere I've run at crazy scale	16:52
JayF	we ran statsd	16:52
JayF	we almost never used the data though	16:52
JayF	So if the answer is "plumb up metrics library to prom" I'm OK-ish with that, but we might wanna consider just changing the paradigm at that point	16:53
JayF	because it seems bad that sensor data and app data don't flow through the same code paths	16:53
TheJulia	yeah, your raising a really good point	16:54
TheJulia	does statsd record how many times it got a timer call?	16:54
JayF	statsd gives you counters for free off timers	16:54
JayF	well, IDK if statsd does it	16:54
JayF	or if you can derive it at the display layer	16:54
JayF	my knowledge of this is significantly muddied because most of my experience with it was sending the statsd output to the Rackspace Monitoring agent	16:55
JayF	and then [magic]	16:55
TheJulia	so we could always just go "oh, that implies a count as well, ship it!"	16:56
TheJulia	if we need to	16:56
JayF	I am less concerned with that level of implementation	16:56
JayF	and more concerned with the core problem of having two metrics libraries (essentially) in the same project	16:57
TheJulia	I'm more so because people tend to want to know how many times was a thing performed, not total time spent	16:57
JayF	I will say from experience that timer metrics were the most useful	16:57
TheJulia	you can't imply # of total deploys from just at timer unless statsd does magic there	16:57
JayF	I am surprised you wouldn't see that, since you started the practice of actually benchmarking our API calls during dev :)	16:58
TheJulia	with fixed counts :)	16:58
JayF	TheJulia: most of the time, I used oslo notifications to derive that info...	16:58
JayF	TheJulia: timing was more for troubleshooting or isolating perf issues	16:58
JayF	this is part of why the app metrics seem so weird now	16:58
TheJulia	wait...	16:59
JayF	you don't need them as much because we don't have as many "101-level" performance metrics as you used to	16:59
TheJulia	where is the oslo notifications consumption?	16:59
TheJulia	or did that not make it upstream?	16:59
JayF	Ironic emits oslo notifications on state change	16:59
JayF	most places I've worked emitted them to splunk where you could aggregate them	16:59
JayF	https://docs.openstack.org/ironic/6.2.2/dev/notifications.html	17:00
TheJulia	ahh	17:00
JayF	Like, if you just wanna quick and dirty solve this, use the statsd->prom connector... if we're going to enhance it, lets actually fix it though	17:00
TheJulia	That is way beyond the level of info	17:00
TheJulia	well, there is no quick and dirty if I need both streams	17:01
opendevreview	Riccardo Pittau proposed openstack/ironic master: [WIP] [PoC] A metal3 CI job https://review.opendev.org/c/openstack/ironic/+/863873	17:01
TheJulia	on a prometheus payload :(	17:01
JayF	TheJulia: so a manager rloo and I used to share, he used oslo notifications going to splunk, aggregated nova and ironic notifications, to create a splunk "success %" dashboard for how many builds succeeded	17:01
JayF	TheJulia: wait, why do you need them on the same payload	17:01
JayF	is there some prom limitation I don't have awareness of?	17:01
TheJulia	because you do a poll per service/thing	17:01
TheJulia	yeah	17:01
TheJulia	it polls an http endpoint for a file	17:01
TheJulia	which is all of the metrics data for it	17:02
TheJulia	in a particular format	17:02
TheJulia	moving to a proxy would potentially allow us to sunset IPE	17:02
JayF	It sounds like having prom fetch those metrics twice (node metrics then app metrics) is the correct configuration	17:02
JayF	for how our code is currently structured	17:02
TheJulia	iurygregory: would that make anyone in your area sad/unhappy?	17:02
JayF	they are two separate gathering and export mechanisms	17:02
JayF	so why would we aggregate their output together for prom?	17:02
TheJulia	JayF: aiui, you actually can't if you want them connected, they are different query/sensor areas since it is modeled to single service/operation	17:03
JayF	we need to tie together all the way back if it's the same; or leave them separate all the way.... last minute tie in is the worst of all worlds: ops don't have flexibility to split app/node metrics and we still have two codepaths to do the same thing	17:03
JayF	TheJulia: I'm saying our code currently strongly implies "these are not tied together" because it's two separate code paths altogether	17:04
TheJulia	I'm stressing our metrics stuff was always intended as ceilometer as the consumer	17:04
TheJulia	and ipe was a means to an end without deploying anything else or retooling stats	17:04
TheJulia	oh	17:05
TheJulia	you know what hte issue is	17:05
TheJulia	that payload to prometheus has to have descriptive text	17:05
TheJulia	https://github.com/prometheus/statsd_exporter#explicit-metric-type-mapping	17:05
TheJulia	and the sensor data we get from BMCs includes freeform stuff that we predictably translate	17:05
JayF	that also makes sense as to why were were talking past each other a little bit	17:06
JayF	that mismatch of "metric type" and "metric name" being separate in statsd but not prom	17:06
JayF	Would it be useful for us to talk about this sync a little bit?	17:06
TheJulia	yeah, promethus has the additional layer of labeling for each metric	17:07
JayF	I really wish I had a whiteboard and physical proximity to talk about this sorta thing	17:07
TheJulia	well, this was supposed to be a lightweight exploration	17:07
JayF	hard to do in text only	17:07
TheJulia	go go down a heavyweight path might mean we punt	17:07
TheJulia	just, being completely transparent	17:07
JayF	so lets back up a step then and ask the core question: what's your end goal?	17:07
JayF	What's the specific thing you want to achieve? Not even "app metrics for prometheus" but like "I want to be able to count how many widgets we provisioned" or whatever	17:08
TheJulia	to get meaningful metric data about ironic's operation out of ironic plus the node sensor data to prometheus	17:08
JayF	node sensor data already goes to prom via IPE, right?	17:08
TheJulia	if deployed, yes	17:08
JayF	ack	17:08
TheJulia	The goal would be "how many deploys, how many times we continued cleaning" sort of stuff	17:09
JayF	So, if we merged this prom backend for metrics when it's done, would the next step be to make node sensor data use that natively, and deprecate "ceilometer"/ipe support?	17:09
TheJulia	we don't need the classic notifications payloads because that is way too much information and is more for logging and log statistic aggregation tools like splunk	17:09
TheJulia	I'm unsure we could merge a promethus backend because we would need another service	17:10
TheJulia	or another HTTP endpoint to offer the content up	17:10
JayF	is there any concept of an official prometheus agent?	17:11
JayF	that does that http endpoint for "free" for you?	17:11
TheJulia	not rea;;u, everything is supposted to export for pickup by promethus	17:11
TheJulia	sorry, not really	17:11
JayF	I assume the frequency of updates is such that using something intermediate, like uploading the stats file to a remote host somehow (webdav, swift temp urls, etc?) is out of the question	17:11
TheJulia	there are things like node exporter, which walks a bunch of endpoints, and creates the html document when called	17:12
TheJulia	kind of yeah, ipe does two things. It has a plugin that transforms the data, and then runs a tiny webserver for promethus to grab the data out from it	17:12
JayF	So does your PR to add a prom backend, in that case, target IPE as the intermediary?	17:13
TheJulia	the path I was heading down was to graft on to the metrics collection interface in ironic-lib to serve as the collection point where I could then pull the data out of ironic's memory every time the conductor shipped a node sensor update to IPE	17:14
TheJulia	IPE picks that up, transforms it, and has it ready for when promethus goes to get an update	17:15
JayF	I'm not sure how you could do that from the ironic-lib level	17:15
JayF	that sounds like a new API endpoing?	17:16
JayF	*endpoint	17:16
TheJulia	Ironic-lib would be in memory and could be the keeper of the local stats data, it would mean a new method, but basic testing I've done shows that works as expected at the moment	17:18
JayF	I'm confused as to how IPE and Ironic-lib communicate for this purpose	17:18
rpittau	bye everyone o/	17:18
TheJulia	we have a periodic in the conductor which does a sweep for sensor data	17:19
JayF	o/	17:19
TheJulia	to ship to the oslo notification endpoint, which IPE picks up	17:19
TheJulia	so it would just be a "oh, lets grab this data and send it along too" action	17:19
JayF	okay; I dig it	17:20
JayF	the thing that is cool about this, which I wonder if you've forgotten	17:20
JayF	is we have IPA plumbed up to metrics, too	17:20
JayF	so you could even enhance node sensor data with in-band metrics from api, if desired, too	17:20
JayF	TheJulia: so here's my suggestion: don't worry about statsd compatability, change the interface with impunity if you need to make it more prom-friendly, and we retire the statsd backend over time	17:21
JayF	and then, sometime off in the future, we tie the existing ironic node sensor data fetching into metrics	17:22
JayF	(from ironic_lib)	17:22
JayF	and in fact, that could be a very low hanging fruit for a new/more junior contributor	17:22
JayF	especially if we make sure to tailor the interface to be well suited for how we use it now (and ignore the use cases from 7 years ago which are not in use so much now)	17:22
TheJulia	yeah... so then I would need another exporter plugin or to retool IPE's operation	17:23
TheJulia	hmmm	17:23
JayF	Why would you need another exporter plugin or to retool IPE's operation?	17:23
JayF	I think you laid out a pretty clear migration path; intentional or not:	17:24
JayF	- add IPE-compatable backend to metrics	17:24
TheJulia	If I'm remembering correctly, it is one shot operations, not disjointed mashed together data operations	17:24
JayF	- IPE gets ir-lib metrics in the same polling loop it does for node sensor data	17:24
iurygregory	TheJulia, let me scroll back to read things	17:25
JayF	there's nothing wrong with a generic endpoint in the metrics lib that says	17:25
JayF	"here are some metrics from out of band"	17:25
TheJulia	ipe doesn't load ir-lib	17:25
TheJulia	ipe is an entirely separate process	17:25
TheJulia	so it doesn't have access to memory, we woudl need someplace to store the data	17:25
TheJulia	hmmmmm	17:25
JayF	I'm extremely confused; I thought you said IPE polls conductor for the metrics data	17:25
TheJulia	no, the conductor does it and ships to IPE via the notification plugin	17:26
* TheJulia needs to see how that gets invoked		17:26
TheJulia	we might actually be able to ask ironic lib from the plugin	17:26
JayF	ack; that's the point of misunderstanding I had	17:26
TheJulia	so... this might actually just work if we ask from the i-p-e plugin to ironic-lib for data when running inside of the conductor context	17:28
JayF	so if we figure out that point of contention, the path is pretty clear, yeah?	17:28
TheJulia	yeah, it is basically the same path I've been on though :)	17:28
JayF	yeah but I didn't see it until now	17:29
iurygregory	going to lose some connection, brb	17:29
JayF	it was foggy ;)	17:29
TheJulia	oh, it is very confusing :)	17:29
TheJulia	that is for sure	17:29
JayF	so okay; thank you for the lesson and the little bit of back and forth	17:29
JayF	I'm a little worried this is one of those situations where I feel like "we" figured something out but in reality; I was just confused the whole time until now lol	17:29
TheJulia	I think I need to fire up a devstack soon() and dig through this to make sure it would work	17:30
JayF	I do think that prom->statsd thing will be useful though	17:30
JayF	in terms of figuring out how to translate the metrics from the existing interface terms	17:31
JayF	not as a service; but as like, a reference	17:31
TheJulia	yeah, that is a good point, and I'm realizing now to hand things out to prometheus, we're going to have to label them properly beyond the name	17:31
TheJulia	which means we either create a reference, or... mmm	17:31
TheJulia	s/mmm/hmmm/	17:31
JayF	I suspect you have all the information you need	17:31
* TheJulia puts her glasses on and opens a spreadsheet to review test cases		17:32
JayF	because the type of metric asked for is one piece of data; the name of the metric is the other	17:32
TheJulia	I, unfortunately, don't exactly remember what the prometheus labeling is supposed to look like	17:32
iurygregory	TheJulia, let me see if understood correct re your question, you are referring to "fetch those metrics twice"?	17:41
TheJulia	in what context?	17:41
iurygregory	what JayF said before you asked me "would that make anyone in your area sad/unhappy? "	17:43
JayF	we were talking about killing either existing ir-lib metrics or IPE at that point	17:43
iurygregory	I don't see a problem tbh	17:43
JayF	I think we landed on "make ir-lib metrics use IPE"	17:44
JayF	which would, if anything, make you happier	17:44
TheJulia	and would allow us to collect all the data together and have it handy in the prometheus way	17:44
iurygregory	I do think it makes sense to collect all data and provide in the prometheus format	17:45
TheJulia	One intermediate thoguht we reached is what if we dropped i-p-e for the statsd proxy exporter, but then I realized the need to maintain label mappings would be quite problematic for IPMI data in particular	17:50
iurygregory	yeah, ipmi data is "funny"	18:00
TheJulia	"funny" is a polite way of putting it	18:41
stevebaker[m]	good morning	19:08
TheJulia	o/ stevebaker[m]	19:08
* stevebaker[m] starts reading the backscroll		19:08
TheJulia	hehe	19:08
TheJulia	would you have a few minutes to join a call in say 10-15 m?	19:08
* TheJulia guesses no :)		19:32
stevebaker[m]	yes!	19:36
stevebaker[m]	TheJulia: just caught up	19:36
stevebaker[m]	TheJulia: What is your github username? I'm just updating ironic-operator/OWNERS	19:42
JayF	https://github.com/openstack/ironic/graphs/contributors looks like juliakreger	19:44
stevebaker[m]	thanks	19:45
TheJulia	yup	19:56
TheJulia	Thanks!	19:56
TheJulia	sorry, I stepped outside for a few to obtain my daily allocation of vitamin D	19:57
TheJulia	stevebaker[m]: jparoly sent you an email downstream w/r/t what I was pinging you about. I have 2 hours before my next and final call of the day	19:58
*** tosky_ is now known as tosky		21:12
opendevreview	Jay Faulkner proposed openstack/ironic master: DB & Object layer for node.shard https://review.opendev.org/c/openstack/ironic/+/864236	22:11
opendevreview	Jay Faulkner proposed openstack/ironic master: API support for CRUD node.shard https://review.opendev.org/c/openstack/ironic/+/866235	22:11
JayF	just bringing it in line with lint and the spec as written, still need to get into rbac testing	22:12
JayF	and TheJulia I might take that sync intro after failing at RBAC testing for a bit on my own	22:12
* JayF setup some stuff in yaml, couldn't get it to pass no matter what		22:13
TheJulia	would 7 am work tomorrow?	22:15
TheJulia	\I can also do 11 am	22:15
JayF	probably, yeah	22:15
JayF	like 7am-ish, I start at 7am but not always able to communicate to other humans until some caff is ingested lol	22:15
JayF	I suspect it's likke, something basic I'm missing	22:18
JayF	maybe just a single breadcrumb needed	22:18
JayF	I'll even take another stab at it this afternoon; I'd like to get that RBAC testing in before I work on port queries filtered by node	22:19
JayF	TheJulia: I think I managed to get on the right track starting again from scratch	22:47
JayF	TheJulia: the yaml file is so well formatted that I did `shard_patch_set_node_shard_disallowed` and copilot literally completed the entire rest of the dict :)	22:47
TheJulia	creepy magic!	22:48
TheJulia	heh	22:48
TheJulia	okay, well I'll be happy to look at things in the morning	22:48
opendevreview	Jay Faulkner proposed openstack/ironic master: API support for CRUD node.shard https://review.opendev.org/c/openstack/ironic/+/866235	22:49
JayF	TheJulia: feel free to do it as a code review ^	22:49
JayF	I suspect there may be cases I wanna check that I'm not now, but there is checking being done now and I have confidence the rules are workin	22:49
TheJulia	cool cool	22:57
TheJulia	well, off to the grocery store I go	23:01
JayF	TheJulia: iurygregory: I don't know which of you is invested in bugfix branches; but we're about a month late for our first one of the cycle. My question is essentially: should we actually cut one?	23:10
JayF	I am happy to propose a change to our release policy indicating we may cut a bugfix release (much happier than I'd be cutting this release with no known consumers of it)	23:11
vanou	JayF TheJulia: if possible, could you give me feedback on my concern in last comment at https://review.opendev.org/c/openstack/ironic/+/865075 ?	23:16
JayF	oh yeah, that was on my mental list and dropped off	23:17
JayF	I'	23:17
JayF	*I'll put this comment in the PR as well, but essentially, don'	23:18
JayF	don't let yourself be limited by what exists	23:18
JayF	if we need to add a way, in our ipmi module, to send a single message no retries, or similar, I think you should just add it	23:18
JayF	or more likely, promote some method from private to public in the ipmi modules, because it almost certainly already exists	23:18
vanou	Thanks for care. I'll reconsinder how to deal with it based on your advice :)	23:20
JayF	and fwiw, you get some sympathy :) I know in this sort of patch you have two sets of folks to answer to, the ones downstream who want it a certain way and then we in Ironic want it a certain way	23:22
JayF	so don't hesitate to ask if you get stuck or need more direction	23:23
vanou	Thanks a lot JayF o/	23:23
JayF	\o	23:23
vanou	If I change IPMI method/func, I think I should make another patch on modifying IPMI method/func. Is it correct?	23:25
JayF	Honestly, I'd just do it directly in the patch we're talking about; that way we can see it as a result of our comments.	23:26
opendevreview	Verification of a change to openstack/ironic-python-agent stable/victoria failed: Drop python2 from bindep.txt https://review.opendev.org/c/openstack/ironic-python-agent/+/862656	23:26
JayF	If it helps you to split it up; feel free	23:26
vanou	OK. Thanks :)	23:27
TheJulia	JayF: in the past, when we have been late, typically we have evaluated on if there has been anything substantive	23:28
TheJulia	And just skipping if it has been a quiet month or two	23:29
JayF	TheJulia: I'm asking the question not because we're late; but because part of the reason we're late is some chatter in here that the release wouldn't be consumed	23:29
JayF	TheJulia: and also I'm info gathering; I want to have a patch for our release policy ready for when I retire bugfix branches so I can feel like I cleaned up all the release-debt from that :D	23:29
TheJulia	Ack, to be clear, we have skipped a few times when there just wasn’t a good reason to ship	23:31
TheJulia	Around holidays in particular	23:31
JayF	this would absolutely apply	23:31
JayF	and we should also make sure our published policies fit reality around this	23:31
JayF	even if it just means adding more "squish" to the public policy :D	23:32
TheJulia	I thought we already did for that actually, but it has been a while	23:34
JayF	you're 10000% right	23:35
JayF	there's nothing about the release cadence in the spec now	23:36
JayF	I'm not insane; it used to say that, right?	23:36
JayF	nope; it never said that	23:38
TheJulia	Iheh	23:45

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!