opendevreview | Merged openstack/ironic-inspector master: [grenade] Explicitly enable Neutron ML2/OVS services in the CI job https://review.opendev.org/c/openstack/ironic-inspector/+/867573 | 00:04 |
---|---|---|
vanou | good morning ironic | 02:27 |
vanou | TheJulia JayF: Thanks for review on https://review.opendev.org/c/openstack/ironic/+/865075 I have one concern on try/fail model. If I follow this approach, driver will first try IPMI then, if IPMI fails, try Redfish. However IPMI code in Ironic resends IPMI command till timeout reaches (default 60seconds). So this try/fail model can introduce 60s delay on every IPMI attempt. | 06:00 |
opendevreview | Harald Jensås proposed openstack/ironic master: Use association_proxy for ports node_uuid https://review.opendev.org/c/openstack/ironic/+/862933 | 08:15 |
opendevreview | Harald Jensås proposed openstack/ironic master: Use association_proxy for port groups node_uuid https://review.opendev.org/c/openstack/ironic/+/864781 | 08:15 |
opendevreview | Harald Jensås proposed openstack/ironic master: Use association_proxy for node chassis_uuid https://review.opendev.org/c/openstack/ironic/+/864802 | 08:15 |
opendevreview | Harald Jensås proposed openstack/ironic master: Use association_proxy for node allocation_uuid https://review.opendev.org/c/openstack/ironic/+/865989 | 08:15 |
codecap_ | HI Guys, | 08:36 |
codecap_ | Experiencing the Problem | 08:36 |
codecap_ | Look up error: Could not find a node for attributes {'bmc_address': ['10.40.11.68'], 'mac': ['4c:52:62:42:de:98', '4c:52:62:42:de:96', '4c:52:62:52:ae:2b', '4c:52:62:42:de:97', '4c:52:62:42:de:95', '4c:52:62:52:ae:2a']} | 08:36 |
codecap_ | Can anybody help to understand the reason for the problem ? | 08:36 |
rpittau | good morning ironic! o/ | 08:59 |
kubajj | Morning rpittau and ironic o/ | 09:29 |
rpittau | hey kubajj :) | 09:29 |
dtantsur | codecap_: could be a mismatch between existing BMC address and ports on the Ironic node | 09:37 |
codecap_ | dtantsur: Hi ! Can you please explain ? | 09:38 |
dtantsur | codecap_: so, you're inspecting the node right? the node has a BMC (IPMI/Redfish/iDRAC/...) address and maybe some pre-existing ports? | 09:39 |
codecap_ | dtantsur: IPMI ... pre-existing ports ? | 09:40 |
dtantsur | codecap_: okay, let's take a step back. What exactly are you doing? How much familiar with Ironic are you? | 09:42 |
codecap_ | dtantsur: I'm trying to install und to use it ... less expierience I would say ... I've already learned how to use bifrost... The Problem I described above, occurs when I use ironic within openstack. The node , whiech is beeing deployed, is comming up under the control of IPA, at one Point IPA tries to contact the inspector, which replies with HTTP 500 | 09:48 |
dtantsur | Is it still Bifrost or another installation with the rest of OpenStack? | 09:53 |
dtantsur | within the Bifrost context, attempting to contact Inspector first is normal | 09:54 |
dtantsur | when used with Nova/Neutron, we usually expect DHCP to be set up correctly to avoid that (in which case it may be a problem with networking or Neutron) | 09:54 |
dtantsur | codecap_: ^^ | 09:55 |
dtantsur | I need to step away from a while, other folks may help you further (or I'll check your messages when I'm back) | 09:56 |
codecap_ | dtantsur: bifrost is completely independent. Ironic within OpenStack was installed by kolla-ansible ... DHCP looks good, so the node ( being deployed) get started loading IPA Image | 09:57 |
codecap_ | dtantsur: can you explain me how Inspector is searching for the node when contacted by IPA? | 10:00 |
opendevreview | Merged openstack/ironic master: Ironic doesn't use metering; don't start it in CI https://review.opendev.org/c/openstack/ironic/+/867574 | 10:21 |
ajya | janders: iDRAC 6.10.00.00 has been released, if you want to try it out | 10:30 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] [PoC] A metal3 CI job https://review.opendev.org/c/openstack/ironic/+/863873 | 10:42 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] [PoC] A metal3 CI job https://review.opendev.org/c/openstack/ironic/+/863873 | 10:48 |
dtantsur | codecap_: when inspection is started, it caches BMC address (ipmi_address, etc) and MAC addresses to use them for lookup later. Note that if you don't start inspection, it will never work. | 11:12 |
dtantsur | codecap_: so if you end up inspecting when you meant to deploy, something went wrong on the networking level | 11:12 |
codecap_ | dtantsur: so what is the order a node should be deployed in ? Create -> Inspect -> Deploy ? | 11:14 |
kubajj | dtantsur: We use the pecan REST for the decorators, right? | 11:17 |
dtantsur | codecap_: it depends on what you need. Inspection is optional. | 11:21 |
dtantsur | kubajj: I think we have a lot of own decorator code, inherited from a project called wsme | 11:22 |
kubajj | dtantsur: and these are defined in the api repo? | 11:23 |
kubajj | (not repo, directory) | 11:23 |
codecap_ | dtantsur: how can I debug why the Lookup fails ? Look up error: Could not find a node for attributes {'bmc_address': ['10.40.11.68'], 'mac': ['4c:52:62:42:de:97' ... both (bmc_address and macs) are correct ones | 11:28 |
dtantsur | kubajj: yep, somewhere these | 11:29 |
dtantsur | codecap_: do you start inspection? if not, it will never work. if you start deploy and end up inspecting, you need to see what went wrong on the PXE stage. | 11:29 |
dtantsur | in a normal openstack installation, different dnsmasq instances are responsible for two processes | 11:29 |
dtantsur | when inspection is not running, ironic-inspector's dnsmasq is blocked for these nodes by its own filters, neutron's is working | 11:30 |
dtantsur | somewhere these something went wrong, but I cannot guess what without a deep look | 11:31 |
codecap_ | dtantsur: Inspection works | 11:31 |
codecap_ | dtantsur: [node: 8280e833-8a86-4ee1-8cf6-675d55d54970 state finished MAC 4c:52:62:52:ae:2a BMC 10.40.11.68] Introspection finished successfully | 11:31 |
codecap_ | dtantsur: after successful inspection found data should be saved in the ironic inspector DB, right? | 11:33 |
dtantsur | codecap_: if you have it set up to do it, yes (I don't know how kolla sets up inspector) | 11:34 |
dtantsur | so, inspection works, deployment does not? | 11:34 |
codecap_ | dtantsur: deployments failes with Look up error: Could not find a node for attributes {'bmc_address': ['10.40.11.68'], 'mac': ['4c:52:62:42:de:97', .... | 11:36 |
dtantsur | right, so your flow erroneously goes back into inspection | 11:37 |
dtantsur | codecap_: see what your PXE filter is ([pxe_filter]driver in inspector.conf). check for suspicious messages about it in inspector logs. | 11:38 |
kubajj | dtantsur: I am sorry for asking so many questions, but I've been stuck on this for more than a day now. Is the api I am aiming to implement much different from the StatesController? | 11:38 |
dtantsur | kubajj: it should not be (and don't worry about questions). It's not entirely clear to me at which step you're stuck now. If you elaborate, I may be able to help better. | 11:39 |
codecap_ | dtantsur: driver = noop | 11:40 |
dtantsur | that explains... | 11:41 |
dtantsur | is it something you did or something kolla set up for you? | 11:41 |
dtantsur | (driver noop means that access to inspector's dnsmasq is not limited, which means that inspection always starts) | 11:41 |
kubajj | dtantsur: I am still getting the "Missing mandatory parameter" node_ident error. The thing is, the states controller uses just node_ident instead of self.node_ident and does not have the constructor (which Julia suggested I implement as in the history controller) and it is able to be called from the node controller https://opendev.org/openstack/ironic/src/branch/master/ironic/api/controllers/v1/node.py#L1956 | 11:43 |
kubajj | I am trying to figure out why I can't call it as well | 11:43 |
dtantsur | okay, I'll check after the current meeting (please upload the latest patch) | 11:43 |
codecap_ | dtantsur: kolla configures it this way | 11:44 |
dtantsur | codecap_: then please ask mgoddard and other folks on #openstack-kolla which architecture they have in mind | 11:45 |
dtantsur | it feels like it's a mix of a standalone architecture and normal openstack one.. but I may simply be missing what they mean | 11:45 |
codecap_ | dtantsur: shuld it be driver = dnsmasq ? | 11:53 |
dtantsur | codecap_: that's what we recommend, but it may require additional configuration, so I'd check with kolla folks first | 11:54 |
dtantsur | I assume they don't have a lot of docs around this area? | 11:54 |
codecap_ | dtantsur: sure ... will also ask there for help | 11:57 |
codecap_ | dtantsur: thanx a lot ! | 11:58 |
codecap_ | dtantsur: Ive already read a lot .... but still not smart enough :) | 11:58 |
dtantsur | networking is the least trivial thing in openstack IMO :) | 11:58 |
opendevreview | Jakub Jelinek proposed openstack/ironic master: WIP: API for node inventory https://review.opendev.org/c/openstack/ironic/+/866876 | 12:16 |
kubajj | dtantsur: somehow I managed to fix it. What should happen if I try to access a node inventory of a node that doesn't have one? | 12:17 |
kubajj | dtantsur: What should be the behaviour if the version is too old? | 12:17 |
dtantsur | kubajj: in both cases we return 404 with an appropriate message | 12:31 |
opendevreview | Jakub Jelinek proposed openstack/ironic master: API for node inventory https://review.opendev.org/c/openstack/ironic/+/866876 | 13:18 |
kubajj | dtantsur: if you had a moment, feedback for ^ would be appreciated | 13:19 |
dtantsur | sure, will put on my queue | 13:20 |
dtantsur | kubajj: before that: this is the point where we need a release note since we're finally adding something user-visible. I can help with wording, but feel free to propose a draft. | 13:23 |
kubajj | dtantsur: I am just drafting it, I was expecting you to say that we need it 😀 | 13:25 |
dtantsur | great :) | 13:25 |
kubajj | dtantsur: I need apiref as well, right? | 13:40 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] [PoC] A metal3 CI job https://review.opendev.org/c/openstack/ironic/+/863873 | 14:09 |
mgoddard | ajya: Hi, we're having trouble with cleaning on some Dell hardware. We have NVMes behind a RAID controller, and it seems to be falling back to shredding rather than secure erase | 14:13 |
mgoddard | RAID controller: https://dl.dell.com/content/manual61357984-dell-poweredge-raid-controller-11-user-s-guide-perc-h755-adapter-h755-front-sas-h755n-front-nvme-h755-mx-adapter-h750-adapter-sas-h355-adapter-sas-h355-front-sas-h350-adapter-sas-h350-mini-monolithic-sas.pdf?language=en-us&ps=true | 14:13 |
mgoddard | NVMe: https://www.dell.com/en-us/shop/dell-384tb-data-center-nvme-read-intensive-ag-drive-u2-gen4-with-carrier/apd/400-bmto/storage-drives-media#tabs_section | 14:13 |
mgoddard | have you seen this before? | 14:13 |
ajya | mgoddard: haven't seen, but I can check. What commands are being used? | 14:15 |
opendevreview | Riccardo Pittau proposed openstack/ironic-python-agent master: [DNM] test ci https://review.opendev.org/c/openstack/ironic-python-agent/+/867659 | 14:34 |
TheJulia | mgoddard: just standard automated cleaning? if so I suspect you'll need to share an agent log from cleaning, but I wonder what the actual raid controller's capabilities actually inhibit some of that | 14:42 |
dtantsur | kubajj: yes please | 14:45 |
dtantsur | mgoddard: I believe TheJulia is right: I don't think we can see the NVMe behind the controller | 14:46 |
TheJulia | secure erase as the code uses it is an ATA concept | 14:46 |
mgoddard | that does seem to be the case | 14:46 |
mgoddard | nvme list is empty | 14:47 |
TheJulia | scsi, depending ont he protocol exposed by the raid controller, has no concept of it | 14:47 |
mgoddard | disk show up as /dev/sdX | 14:47 |
TheJulia | how do you see the devices to an OS? | 14:47 |
TheJulia | oh! | 14:47 |
TheJulia | so yeah | 14:47 |
TheJulia | they are exposed as SCSI devices :( | 14:47 |
* TheJulia is surprised they are not exposed as NVMe devices | 14:47 | |
mgoddard | wondering whether it kills the benefits of NVMe | 14:47 |
TheJulia | the raid controller utilities *might* have a tool to help enable this | 14:47 |
TheJulia | raid controllers can be useful, and they can also hurt performance/usage at times | 14:48 |
TheJulia | it is a value tradeoff unfortunately | 14:48 |
dtantsur | or rebuild the raid every time.. | 14:48 |
TheJulia | yeah | 14:49 |
TheJulia | the raid controller, depending on how smart it is might issue unmap/deallocate commands to the device | 14:49 |
TheJulia | or it could write zeros | 14:49 |
TheJulia | or keep a memory map and just double zero | 14:49 |
TheJulia | like qcow2s | 14:49 |
* TheJulia has crashed many PERCs in her career | 14:51 | |
mgoddard | seems like we need to try to expose the NVMes directly, if possible | 14:53 |
TheJulia | so, you could try disabling raid mode | 14:53 |
mgoddard | would much prefer not to need to use some custom tool in IPA (if that is even possible) | 14:53 |
TheJulia | the bios firmware should have a mode setting for the card | 14:53 |
mgoddard | there isn't currently any RAID virtual disk configured | 14:53 |
TheJulia | Oh, yeah, even better reason to check the card settings | 14:54 |
mgoddard | ok, I'll have a poke around | 14:54 |
mgoddard | thanks all | 14:54 |
TheJulia | so fwiw, people have wrapped individual binaries in their ramdisks to do special things, anda hardware plugin can always override, but disabling RAID mode on the card would hopefully just make it a pass-through | 14:55 |
TheJulia | Now, if that translates the protocol, that is a whole different question | 14:55 |
TheJulia | and if it does.... yeouch. We can work on figuring something out | 14:55 |
TheJulia | you'll know quickly after rebooting the machine with the raid controller card in a different mode | 14:56 |
* TheJulia wonders if we have this sort of caveat documented heavily | 14:56 | |
dtantsur | I feel like we do not | 15:05 |
TheJulia | I'm writing something now | 15:13 |
dtantsur | I recall there were some talks about dropping dhcp-all-interfaces from IPA deps, does anyone else remember? TheJulia? | 15:23 |
TheJulia | we talked about it because network manager has similar behavior | 15:23 |
TheJulia | although I think we now set networkmanager to do the thing with dhcp-all-interfaces | 15:23 |
TheJulia | so apparently there is a SCSI FORMAT UNIT command | 15:26 |
dtantsur | got someone complaining about the time it takes to DHCP interfaces that don't have a DHCP server on them | 15:26 |
TheJulia | the counter balancing issue is if you turn that off, networkmanager is going to focus on the first one it finds | 15:31 |
TheJulia | and you may loose connectivity because there could be multiple interfaces :( | 15:31 |
opendevreview | Julia Kreger proposed openstack/ironic master: [DOC] Add entry regarding cleaning+raid https://review.opendev.org/c/openstack/ironic/+/867674 | 15:55 |
TheJulia | mgoddard: dtantsur: ^^ | 15:55 |
opendevreview | Julia Kreger proposed openstack/ironic master: [DOC] Add entry regarding cleaning+raid https://review.opendev.org/c/openstack/ironic/+/867674 | 15:57 |
JayF | We probably should mention that we have explicit examples for making your own hardware managers | 15:57 |
JayF | with tooling provided by raid card vendor | 15:57 |
JayF | to use the actual-vendor-tool to do this if you want | 15:57 |
JayF | https://github.com/openstack/ironic-python-agent/tree/master/examples/custom-disk-erase | 15:57 |
TheJulia | we've not seen that as a thing as much, most vendors are trying to integrate purely into BMC interactive actions, but yeah | 15:58 |
* TheJulia revises | 15:58 | |
JayF | Or, potentially, that pathway is better documented so more people self-serve it without bothering us ;) | 15:58 |
dtantsur | TheJulia: one formatting issue inline | 15:58 |
* JayF knows of at least 3 installs that have used it without consulting with us | 15:58 | |
TheJulia | ++ | 15:58 |
TheJulia | dtantsur: which line? | 15:58 |
TheJulia | JayF: that is an interesting data poitn | 15:59 |
dtantsur | TheJulia: 1091, need to use a :doc: link, otherwise it won't work when relocated or built locally | 15:59 |
TheJulia | ack | 15:59 |
JayF | TheJulia: well, it's what I tend to ask people about given it's the piece I know the most about, so there's a lot of confirmation bias :D | 15:59 |
opendevreview | Aija Jauntēva proposed openstack/sushy master: Fix exceeding retries https://review.opendev.org/c/openstack/sushy/+/867675 | 16:00 |
dtantsur | see you tomorrow folks | 16:00 |
ajya | rpittau: see the patch ^ | 16:00 |
opendevreview | Julia Kreger proposed openstack/ironic master: [DOC] Add entry regarding cleaning+raid https://review.opendev.org/c/openstack/ironic/+/867674 | 16:05 |
TheJulia | web-edited that, so hopefully no issues | 16:05 |
opendevreview | Julia Kreger proposed openstack/ironic master: [DOC] Add entry regarding cleaning+raid https://review.opendev.org/c/openstack/ironic/+/867674 | 16:05 |
TheJulia | Okay, that should be good | 16:06 |
JayF | TheJulia: just +A shard_key spec | 16:12 |
JayF | TheJulia: Arne gave me an email thumbs-up | 16:12 |
JayF | \o/ | 16:13 |
TheJulia | \o/ | 16:13 |
TheJulia | now, where was I with extracting useful metrics! | 16:20 |
opendevreview | Merged openstack/ironic-specs master: Add a shard key https://review.opendev.org/c/openstack/ironic-specs/+/861803 | 16:22 |
JayF | TheJulia: circling back to the hwmgr/disk stuff; I think crossing "I'm building my own ramdisk" is a threshold for some folks; once it's crossed those are not hard problems anymore (because the code is trivial) | 16:26 |
JayF | TheJulia: the thing that, when reflecting, I realize is common to some of the issues that are reported here, are people are trying hard (for good reason) to use a fully stock ramdisk | 16:27 |
TheJulia | eh, we made ipa-b to simplify ramdisk building | 16:35 |
TheJulia | so the knowledge barrier is a bit lower now | 16:36 |
JayF | there's a reason I used 'threshold' over 'barrier' | 16:37 |
JayF | IME operating it, it's always surrounding BS that makes it hard: setting up internal forks, CI, etc | 16:37 |
JayF | building one special ramdisk is easy; setting up a ramdisk building machine is more :) | 16:37 |
TheJulia | I think it is variable based upon perception/experience, but I do agree | 16:38 |
opendevreview | Merged openstack/ironic-inspector master: Update tox.ini for tox 4 https://review.opendev.org/c/openstack/ironic-inspector/+/867541 | 16:40 |
opendevreview | Julia Kreger proposed openstack/ironic-lib master: Provide an interface to store metrics https://review.opendev.org/c/openstack/ironic-lib/+/865311 | 16:40 |
TheJulia | JayF: ^ I could use a sanity check of this ironic-lib change since you were deeply involved with this ages ago | 16:41 |
JayF | hopefully you like the interface :D | 16:41 |
JayF | I think aweeks wrote that though, not me, although I was upstreaming with him | 16:42 |
TheJulia | it took a little time to wrap my head around it, but once I did I was like "oh, this is perfect for what I want" | 16:42 |
JayF | this is a very bog-standard metrics interface from 2015 :) | 16:43 |
JayF | TheJulia: reading this... you're reimplmenting half of statsd | 16:44 |
JayF | TheJulia: is there a statsd->prometheus connector that we could document instead? | 16:44 |
TheJulia | dunno, but I need both node sensor data and ironic data in one pass | 16:44 |
TheJulia | and we have ipe for the node data | 16:44 |
TheJulia | and deploying | 16:44 |
TheJulia | err | 16:44 |
TheJulia | incomplete thought, disregard that last line | 16:45 |
JayF | TheJulia: https://github.com/prometheus/statsd_exporter | 16:45 |
TheJulia | yeah, I was just looking at that | 16:46 |
TheJulia | geared as a migration tool | 16:46 |
JayF | that even handles turning timers into summary/histogram | 16:46 |
JayF | I am not opposed to native support, I'm just asking the question | 16:46 |
TheJulia | if I were to do that, I'd have to completely forklift all of the sensor data stuff over | 16:46 |
JayF | Are you saying that right now | 16:46 |
JayF | we have two metrics stacks in Ironic? | 16:46 |
JayF | bceause whoever added sensor data didn't plug it into existing metrics collection (or vice-versa, but I think app metrics were first) | 16:47 |
TheJulia | not metrics stacks, sensor data collection walks/collects data from the BMC and was originally geared for handing it off wholeale to.... ceilometor | 16:47 |
TheJulia | ceilometer | 16:47 |
JayF | yeah as I hit enter there I realized it was the celiometer stuff | 16:47 |
JayF | which I steered clear of when designing these metrics because people thought ceilometer would continue to exist still :| | 16:47 |
JayF | I mean, "forklift all the sensor data over" sounds like the right answer | 16:48 |
JayF | or "forklift all the metrics stuff over" | 16:48 |
JayF | i don't think many deployers would distinguish, in the same way we do, between app and hardware metrics | 16:48 |
TheJulia | hmm | 16:49 |
TheJulia | we had to write a lot of code to handle the sensors | 16:49 |
JayF | This isn't definitive, I'm just talking through it | 16:49 |
TheJulia | but the proxy might be the right way to go | 16:49 |
TheJulia | stevebaker[m]: any thought on running the statsd->prometheus proxy? | 16:50 |
JayF | like, statsd is definitately the older school thing vs prometheus | 16:50 |
JayF | which is something to consider too | 16:50 |
JayF | but our metrics interface was written with statsd in mind, so using the proxy you sorta get the code you wanted to write here for free | 16:50 |
TheJulia | the downside of IPE is it relies upon the message bus architecture | 16:50 |
JayF | let me make sure I understand | 16:51 |
TheJulia | you don't actually *need* a message bus, but it plugs in there | 16:51 |
JayF | it used to be | 16:51 |
JayF | Ironic node sensor data -> ceil -> ??? | 16:51 |
JayF | now it's Ironic node sensor data -> IPE -> Prom | 16:51 |
TheJulia | yes, for those that wish to use it | 16:51 |
JayF | and separately, we have Ironic app metrics -> statsd (I think this is the only plugin we ship?) | 16:51 |
TheJulia | I don't think anyone actually runs with statsd, tbh | 16:52 |
JayF | me neither | 16:52 |
TheJulia | it is statsd, and noop currently | 16:52 |
JayF | Well, like let me put it this way | 16:52 |
JayF | everywhere I've run at crazy scale | 16:52 |
JayF | we ran statsd | 16:52 |
JayF | we almost never used the data though | 16:52 |
JayF | So if the answer is "plumb up metrics library to prom" I'm OK-ish with that, but we might wanna consider just changing the paradigm at that point | 16:53 |
JayF | because it seems bad that sensor data and app data don't flow through the same code paths | 16:53 |
TheJulia | yeah, your raising a really good point | 16:54 |
TheJulia | does statsd record how many times it got a timer call? | 16:54 |
JayF | statsd gives you counters for free off timers | 16:54 |
JayF | well, IDK if statsd does it | 16:54 |
JayF | or if you can derive it at the display layer | 16:54 |
JayF | my knowledge of this is significantly muddied because most of my experience with it was sending the statsd output to the Rackspace Monitoring agent | 16:55 |
JayF | and then [magic] | 16:55 |
TheJulia | so we could always just go "oh, that implies a count as well, ship it!" | 16:56 |
TheJulia | if we need to | 16:56 |
JayF | I am less concerned with that level of implementation | 16:56 |
JayF | and more concerned with the core problem of having two metrics libraries (essentially) in the same project | 16:57 |
TheJulia | I'm more so because people tend to want to know how many times was a thing performed, not total time spent | 16:57 |
JayF | I will say from experience that timer metrics were the most useful | 16:57 |
TheJulia | you can't imply # of total deploys from just at timer unless statsd does magic there | 16:57 |
JayF | I am surprised you wouldn't see that, since you started the practice of actually benchmarking our API calls during dev :) | 16:58 |
TheJulia | with fixed counts :) | 16:58 |
JayF | TheJulia: most of the time, I used oslo notifications to derive that info... | 16:58 |
JayF | TheJulia: timing was more for troubleshooting or isolating perf issues | 16:58 |
JayF | this is part of why the app metrics seem so weird now | 16:58 |
TheJulia | wait... | 16:59 |
JayF | you don't need them as much because we don't have as many "101-level" performance metrics as you used to | 16:59 |
TheJulia | where is the oslo notifications consumption? | 16:59 |
TheJulia | or did that not make it upstream? | 16:59 |
JayF | Ironic emits oslo notifications on state change | 16:59 |
JayF | most places I've worked emitted them to splunk where you could aggregate them | 16:59 |
JayF | https://docs.openstack.org/ironic/6.2.2/dev/notifications.html | 17:00 |
TheJulia | ahh | 17:00 |
JayF | Like, if you just wanna quick and dirty solve this, use the statsd->prom connector... if we're going to enhance it, lets actually fix it though | 17:00 |
TheJulia | That is way beyond the level of info | 17:00 |
TheJulia | well, there is no quick and dirty if I need both streams | 17:01 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: [WIP] [PoC] A metal3 CI job https://review.opendev.org/c/openstack/ironic/+/863873 | 17:01 |
TheJulia | on a prometheus payload :( | 17:01 |
JayF | TheJulia: so a manager rloo and I used to share, he used oslo notifications going to splunk, aggregated nova and ironic notifications, to create a splunk "success %" dashboard for how many builds succeeded | 17:01 |
JayF | TheJulia: wait, why do you need them on the *same payload* | 17:01 |
JayF | is there some prom limitation I don't have awareness of? | 17:01 |
TheJulia | because you do a poll per service/thing | 17:01 |
TheJulia | yeah | 17:01 |
TheJulia | it polls an http endpoint for a file | 17:01 |
TheJulia | which is all of the metrics data for it | 17:02 |
TheJulia | in a particular format | 17:02 |
TheJulia | moving to a proxy *would* potentially allow us to sunset IPE | 17:02 |
JayF | It sounds like having prom fetch those metrics twice (node metrics then app metrics) is the correct configuration | 17:02 |
JayF | for how our code is currently structured | 17:02 |
TheJulia | iurygregory: would that make anyone in your area sad/unhappy? | 17:02 |
JayF | they are two separate gathering and export mechanisms | 17:02 |
JayF | so why would we aggregate their output together for prom? | 17:02 |
TheJulia | JayF: aiui, you actually can't if you want them connected, they are different query/sensor areas since it is modeled to single service/operation | 17:03 |
JayF | we need to tie together all the way back if it's the same; or leave them separate all the way.... last minute tie in is the worst of all worlds: ops don't have flexibility to split app/node metrics and we still have two codepaths to do the same thing | 17:03 |
JayF | TheJulia: I'm saying our code currently strongly implies "these are not tied together" because it's two separate code paths altogether | 17:04 |
TheJulia | I'm stressing our metrics stuff was always intended as ceilometer as the consumer | 17:04 |
TheJulia | and ipe was a means to an end without deploying anything else or retooling stats | 17:04 |
TheJulia | oh | 17:05 |
TheJulia | you know what hte issue is | 17:05 |
TheJulia | that payload to prometheus has to have descriptive text | 17:05 |
TheJulia | https://github.com/prometheus/statsd_exporter#explicit-metric-type-mapping | 17:05 |
TheJulia | and the sensor data we get from BMCs includes freeform stuff that we predictably translate | 17:05 |
JayF | that also makes sense as to why were were talking past each other a little bit | 17:06 |
JayF | that mismatch of "metric type" and "metric name" being separate in statsd but not prom | 17:06 |
JayF | Would it be useful for us to talk about this sync a little bit? | 17:06 |
TheJulia | yeah, promethus has the additional layer of labeling for each metric | 17:07 |
JayF | I really wish I had a whiteboard and physical proximity to talk about this sorta thing | 17:07 |
TheJulia | well, this was supposed to be a lightweight exploration | 17:07 |
JayF | hard to do in text only | 17:07 |
TheJulia | go go down a heavyweight path might mean we punt | 17:07 |
TheJulia | just, being completely transparent | 17:07 |
JayF | so lets back up a step then and ask the core question: what's your end goal? | 17:07 |
JayF | What's the specific thing you want to achieve? Not even "app metrics for prometheus" but like "I want to be able to count how many widgets we provisioned" or whatever | 17:08 |
TheJulia | to get meaningful metric data about ironic's operation out of ironic plus the node sensor data to prometheus | 17:08 |
JayF | node sensor data already goes to prom via IPE, right? | 17:08 |
TheJulia | if deployed, yes | 17:08 |
JayF | ack | 17:08 |
TheJulia | The goal would be "how many deploys, how many times we continued cleaning" sort of stuff | 17:09 |
JayF | So, if we merged this prom backend for metrics when it's done, would the next step be to make node sensor data use that natively, and deprecate "ceilometer"/ipe support? | 17:09 |
TheJulia | we don't need the classic notifications payloads because that is *way* too much information and is more for logging and log statistic aggregation tools like splunk | 17:09 |
TheJulia | I'm unsure we could merge a promethus backend because we would need another service | 17:10 |
TheJulia | or another HTTP endpoint to offer the content up | 17:10 |
JayF | is there any concept of an official prometheus agent? | 17:11 |
JayF | that does that http endpoint for "free" for you? | 17:11 |
TheJulia | not rea;;u, everything is supposted to export for pickup by promethus | 17:11 |
TheJulia | sorry, not really | 17:11 |
JayF | I assume the frequency of updates is such that using something intermediate, like uploading the stats file to a remote host somehow (webdav, swift temp urls, etc?) is out of the question | 17:11 |
TheJulia | there are things like node exporter, which walks a bunch of endpoints, and creates the html document when called | 17:12 |
TheJulia | kind of yeah, ipe does two things. It has a plugin that transforms the data, and then runs a tiny webserver for promethus to grab the data out from it | 17:12 |
JayF | So does your PR to add a prom backend, in that case, target IPE as the intermediary? | 17:13 |
TheJulia | the path I was heading down was to graft on to the metrics collection interface in ironic-lib to serve as the collection point where I could then pull the data out of ironic's memory every time the conductor shipped a node sensor update to IPE | 17:14 |
TheJulia | IPE picks that up, transforms it, and has it ready for when promethus goes to get an update | 17:15 |
JayF | I'm not sure how you could do that from the ironic-lib level | 17:15 |
JayF | that sounds like a new API endpoing? | 17:16 |
JayF | *endpoint | 17:16 |
TheJulia | Ironic-lib would be in memory and could be the keeper of the local stats data, it would mean a new method, but basic testing I've done shows that works as expected at the moment | 17:18 |
JayF | I'm confused as to how IPE and Ironic-lib communicate for this purpose | 17:18 |
rpittau | bye everyone o/ | 17:18 |
TheJulia | we have a periodic in the conductor which does a sweep for sensor data | 17:19 |
JayF | o/ | 17:19 |
TheJulia | to ship to the oslo notification endpoint, which IPE picks up | 17:19 |
TheJulia | so it would just be a "oh, lets grab this data and send it along too" action | 17:19 |
JayF | okay; I dig it | 17:20 |
JayF | the thing that is cool about this, which I wonder if you've forgotten | 17:20 |
JayF | is we have IPA plumbed up to metrics, too | 17:20 |
JayF | so you could even enhance node sensor data with in-band metrics from api, if desired, too | 17:20 |
JayF | TheJulia: so here's my suggestion: don't worry about statsd compatability, change the interface with impunity if you need to make it more prom-friendly, and we retire the statsd backend over time | 17:21 |
JayF | and then, sometime off in the future, we tie the existing ironic node sensor data fetching into metrics | 17:22 |
JayF | (from ironic_lib) | 17:22 |
JayF | and in fact, that could be a very low hanging fruit for a new/more junior contributor | 17:22 |
JayF | especially if we make sure to tailor the interface to be well suited for how we use it now (and ignore the use cases from 7 years ago which are not in use so much now) | 17:22 |
TheJulia | yeah... so then I would need another exporter plugin or to retool IPE's operation | 17:23 |
TheJulia | hmmm | 17:23 |
JayF | Why would you need another exporter plugin or to retool IPE's operation? | 17:23 |
JayF | I think you laid out a pretty clear migration path; intentional or not: | 17:24 |
JayF | - add IPE-compatable backend to metrics | 17:24 |
TheJulia | If I'm remembering correctly, it is one shot operations, not disjointed mashed together data operations | 17:24 |
JayF | - IPE gets ir-lib metrics in the same polling loop it does for node sensor data | 17:24 |
iurygregory | TheJulia, let me scroll back to read things | 17:25 |
JayF | there's nothing wrong with a generic endpoint in the metrics lib that says | 17:25 |
JayF | "here are some metrics from out of band" | 17:25 |
TheJulia | ipe doesn't load ir-lib | 17:25 |
TheJulia | ipe is an entirely separate process | 17:25 |
TheJulia | so it doesn't have access to memory, we woudl need someplace to store the data | 17:25 |
TheJulia | hmmmmm | 17:25 |
JayF | I'm extremely confused; I thought you said IPE polls conductor for the metrics data | 17:25 |
TheJulia | no, the conductor does it and ships to IPE via the notification plugin | 17:26 |
* TheJulia needs to see how that gets invoked | 17:26 | |
TheJulia | we *might* actually be able to ask ironic lib from the plugin | 17:26 |
JayF | ack; that's the point of misunderstanding I had | 17:26 |
TheJulia | so... this might actually just work if we ask from the i-p-e plugin to ironic-lib for data when running inside of the conductor context | 17:28 |
JayF | so if we figure out that point of contention, the path is pretty clear, yeah? | 17:28 |
TheJulia | yeah, it is basically the same path I've been on though :) | 17:28 |
JayF | yeah but I didn't see it until now | 17:29 |
iurygregory | going to lose some connection, brb | 17:29 |
JayF | it was foggy ;) | 17:29 |
TheJulia | oh, it is very confusing :) | 17:29 |
TheJulia | that is for sure | 17:29 |
JayF | so okay; thank you for the lesson and the little bit of back and forth | 17:29 |
JayF | I'm a little worried this is one of those situations where I feel like "we" figured something out but in reality; I was just confused the whole time until now lol | 17:29 |
TheJulia | I think I need to fire up a devstack soon() and dig through this to make sure it would work | 17:30 |
JayF | I do think that prom->statsd thing will be useful though | 17:30 |
JayF | in terms of figuring out how to translate the metrics from the existing interface terms | 17:31 |
JayF | not as a service; but as like, a reference | 17:31 |
TheJulia | yeah, that is a good point, and I'm realizing now to hand things out to prometheus, we're going to have to label them properly beyond the name | 17:31 |
TheJulia | which means we either create a reference, or... mmm | 17:31 |
TheJulia | s/mmm/hmmm/ | 17:31 |
JayF | I suspect you have all the information you need | 17:31 |
* TheJulia puts her glasses on and opens a spreadsheet to review test cases | 17:32 | |
JayF | because the type of metric asked for is one piece of data; the name of the metric is the other | 17:32 |
TheJulia | I, unfortunately, don't exactly remember what the prometheus labeling is supposed to look like | 17:32 |
iurygregory | TheJulia, let me see if understood correct re your question, you are referring to "fetch those metrics twice"? | 17:41 |
TheJulia | in what context? | 17:41 |
iurygregory | what JayF said before you asked me "would that make anyone in your area sad/unhappy? " | 17:43 |
JayF | we were talking about killing either existing ir-lib metrics or IPE at that point | 17:43 |
iurygregory | I don't see a problem tbh | 17:43 |
JayF | I think we landed on "make ir-lib metrics use IPE" | 17:44 |
JayF | which would, if anything, make you happier | 17:44 |
TheJulia | and would allow us to collect all the data together and have it handy in the prometheus way | 17:44 |
iurygregory | I do think it makes sense to collect all data and provide in the prometheus format | 17:45 |
TheJulia | One intermediate thoguht we reached is what if we dropped i-p-e for the statsd proxy exporter, but then I realized the need to maintain label mappings would be quite problematic for IPMI data in particular | 17:50 |
iurygregory | yeah, ipmi data is "funny" | 18:00 |
TheJulia | "funny" is a polite way of putting it | 18:41 |
stevebaker[m] | good morning | 19:08 |
TheJulia | o/ stevebaker[m] | 19:08 |
* stevebaker[m] starts reading the backscroll | 19:08 | |
TheJulia | hehe | 19:08 |
TheJulia | would you have a few minutes to join a call in say 10-15 m? | 19:08 |
* TheJulia guesses no :) | 19:32 | |
stevebaker[m] | yes! | 19:36 |
stevebaker[m] | TheJulia: just caught up | 19:36 |
stevebaker[m] | TheJulia: What is your github username? I'm just updating ironic-operator/OWNERS | 19:42 |
JayF | https://github.com/openstack/ironic/graphs/contributors looks like juliakreger | 19:44 |
stevebaker[m] | thanks | 19:45 |
TheJulia | yup | 19:56 |
TheJulia | Thanks! | 19:56 |
TheJulia | sorry, I stepped outside for a few to obtain my daily allocation of vitamin D | 19:57 |
TheJulia | stevebaker[m]: jparoly sent you an email downstream w/r/t what I was pinging you about. I have 2 hours before my next and final call of the day | 19:58 |
*** tosky_ is now known as tosky | 21:12 | |
opendevreview | Jay Faulkner proposed openstack/ironic master: DB & Object layer for node.shard https://review.opendev.org/c/openstack/ironic/+/864236 | 22:11 |
opendevreview | Jay Faulkner proposed openstack/ironic master: API support for CRUD node.shard https://review.opendev.org/c/openstack/ironic/+/866235 | 22:11 |
JayF | just bringing it in line with lint and the spec as written, still need to get into rbac testing | 22:12 |
JayF | and TheJulia I might take that sync intro after failing at RBAC testing for a bit on my own | 22:12 |
* JayF setup some stuff in yaml, couldn't get it to pass no matter what | 22:13 | |
TheJulia | would 7 am work tomorrow? | 22:15 |
TheJulia | \I can also do 11 am | 22:15 |
JayF | probably, yeah | 22:15 |
JayF | like 7am-ish, I start at 7am but not always able to communicate to other humans until some caff is ingested lol | 22:15 |
JayF | I suspect it's likke, something basic I'm missing | 22:18 |
JayF | maybe just a single breadcrumb needed | 22:18 |
JayF | I'll even take another stab at it this afternoon; I'd like to get that RBAC testing in before I work on port queries filtered by node | 22:19 |
JayF | TheJulia: I think I managed to get on the right track starting again from scratch | 22:47 |
JayF | TheJulia: the yaml file is so well formatted that I did `shard_patch_set_node_shard_disallowed` and copilot literally completed the entire rest of the dict :) | 22:47 |
TheJulia | creepy magic! | 22:48 |
TheJulia | heh | 22:48 |
TheJulia | okay, well I'll be happy to look at things in the morning | 22:48 |
opendevreview | Jay Faulkner proposed openstack/ironic master: API support for CRUD node.shard https://review.opendev.org/c/openstack/ironic/+/866235 | 22:49 |
JayF | TheJulia: feel free to do it as a code review ^ | 22:49 |
JayF | I suspect there may be cases I wanna check that I'm not now, but there is checking being done now and I have confidence the rules are workin | 22:49 |
TheJulia | cool cool | 22:57 |
TheJulia | well, off to the grocery store I go | 23:01 |
JayF | TheJulia: iurygregory: I don't know which of you is invested in bugfix branches; but we're about a month late for our first one of the cycle. My question is essentially: should we actually cut one? | 23:10 |
JayF | I am happy to propose a change to our release policy indicating we *may* cut a bugfix release (much happier than I'd be cutting this release with no known consumers of it) | 23:11 |
vanou | JayF TheJulia: if possible, could you give me feedback on my concern in last comment at https://review.opendev.org/c/openstack/ironic/+/865075 ? | 23:16 |
JayF | oh yeah, that was on my mental list and dropped off | 23:17 |
JayF | I' | 23:17 |
JayF | *I'll put this comment in the PR as well, but essentially, don' | 23:18 |
JayF | don't let yourself be limited by what exists | 23:18 |
JayF | if we need to add a way, in our ipmi module, to send a single message no retries, or similar, I think you should just add it | 23:18 |
JayF | or more likely, promote some method from private to public in the ipmi modules, because it almost certainly already exists | 23:18 |
vanou | Thanks for care. I'll reconsinder how to deal with it based on your advice :) | 23:20 |
JayF | and fwiw, you get some sympathy :) I know in this sort of patch you have two sets of folks to answer to, the ones downstream who want it a certain way and then we in Ironic want it a certain way | 23:22 |
JayF | so don't hesitate to ask if you get stuck or need more direction | 23:23 |
vanou | Thanks a lot JayF o/ | 23:23 |
JayF | \o | 23:23 |
vanou | If I change IPMI method/func, I think I should make another patch on modifying IPMI method/func. Is it correct? | 23:25 |
JayF | Honestly, I'd just do it directly in the patch we're talking about; that way we can see it as a result of our comments. | 23:26 |
opendevreview | Verification of a change to openstack/ironic-python-agent stable/victoria failed: Drop python2 from bindep.txt https://review.opendev.org/c/openstack/ironic-python-agent/+/862656 | 23:26 |
JayF | If it helps you to split it up; feel free | 23:26 |
vanou | OK. Thanks :) | 23:27 |
TheJulia | JayF: in the past, when we have been late, typically we have evaluated on if there has been anything substantive | 23:28 |
TheJulia | And just skipping if it has been a quiet month or two | 23:29 |
JayF | TheJulia: I'm asking the question not because we're late; but because part of the reason we're late is some chatter in here that the release wouldn't be consumed | 23:29 |
JayF | TheJulia: and also I'm info gathering; I want to have a patch for our release policy ready for when I retire bugfix branches so I can feel like I cleaned up all the release-debt from that :D | 23:29 |
TheJulia | Ack, to be clear, we have skipped a few times when there just wasn’t a good reason to ship | 23:31 |
TheJulia | Around holidays in particular | 23:31 |
JayF | this would absolutely apply | 23:31 |
JayF | and we should also make sure our published policies fit reality around this | 23:31 |
JayF | even if it just means adding more "squish" to the public policy :D | 23:32 |
TheJulia | I thought we already did for that actually, but it has been a while | 23:34 |
JayF | you're 10000% right | 23:35 |
JayF | there's nothing about the release cadence in the spec now | 23:36 |
JayF | I'm not insane; it used to say that, right? | 23:36 |
JayF | nope; it never said that | 23:38 |
TheJulia | Iheh | 23:45 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!