Thursday, 2022-12-29

opendevreviewJay Faulkner proposed openstack/ironic master: Fix tox4 and setuptools errors  https://review.opendev.org/c/openstack/ironic/+/86874900:21
JayFrpittau: ^^ figured an edit and seeing if it got V+1 was preferable to a PR comment... thanks for looking at this00:22
JayFrpittau: please email me directly at jay at jvf dot cc if/when some of these are ready to land and I can try to find time to unstick it00:22
JayFfamily in town for like two more days then I can have more time to help review/troubleshoot these in detail00:22
opendevreviewRiccardo Pittau proposed openstack/ironic master: Fix CI  https://review.opendev.org/c/openstack/ironic/+/86874909:18
samuelkunkel[m]Hi folks, I have a strange scenario were you are maybe able to provide me a hint. We have deployed our second ironic infrastructure usind zed release (ipa running stream9 with release 9.2). We are facing an issue that the conductor fails to validate the cert of the agent (during cleaning which is done as first step) due to certificate is not yet valid. I am assuming this is due to timedifference between the conductor and the12:12
samuelkunkel[m]agent. (stream9 defaults to timezone EST which I fixed by unpacking the initramfs, changing the symlink and repacking it). So the time (according to timedatectl) is basically the same...12:12
samuelkunkel[m]In our first deployment (currently running train) I did not see anything with the certificate stuff but I assume this was not present in train release12:13
iurygregorygood morning Ironic12:18
iurygregorysamuelkunkel[m], release 9.2 you mean the IPA version in use?12:19
samuelkunkel[m]yes12:19
iurygregorythis would be for Antelope not Zed...12:20
iurygregoryzed is 9.112:20
samuelkunkel[m]hmm12:20
samuelkunkel[m]ok thats interesting12:20
samuelkunkel[m]but I also faced the same issue using 9.012:20
samuelkunkel[m]therefore I just build it from the newest sources12:20
iurygregorygotcha12:20
samuelkunkel[m](and I just saw that stream 9 doubled its image size since august, very interesting)12:21
iurygregoryit can be a configuration option you need to set, do you have the exactly error message you get from logs about the cert problem?12:21
samuelkunkel[m]let me quickly fetch the one I can see 12:22
samuelkunkel[m]Node failed to start the first cleaning step: Connection to agent failed: Failed to connect to the agent running on node 33cd3d68-b7df-44b1-9d2d-0027c2b93c79 for invoking command clean.get_clean_steps. Error: HTTPSConnectionPool(host='10.218.14.242', port=9999): Max retries exceeded with url: /v1/commands/?wait=true&agent_token=_N6wuKa-bhGvVV9l_Mp1PpNaIur0ylEvKFnIgiu5qmw (Caused by SSLError(SSLCertVerificationError(1, '[SSL:12:22
samuelkunkel[m]CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate is not yet valid (_ssl.c:1131)')))12:22
samuelkunkel[m]on the ipa I can see also correleating ssl validation issues during the times the conductor tries to connect12:23
samuelkunkel[m]My guess would be its down to a time issue as the cert is created to be valid for localtime up to localtime+1 on the ipa12:24
samuelkunkel[m]So my Idea was to provide ipa-ntp-server so it syncs its time to one of our ntp servers to rule out time issue12:24
samuelkunkel[m]but this one fails during startup as the difference is to big (let me quickly boot up a node to check the exact error message here)12:26
iurygregoryyeah the problem can be a clock skew 12:27
samuelkunkel[m]We had that before because stream9 defaulted to EST as its standard timezone, but after adjusting the initramfs the time looks mostly good12:28
samuelkunkel[m](before = that the clock was not properly set)12:28
samuelkunkel[m]but in our train deployment nothing cared12:29
samuelkunkel[m]ok one information was not valid, I just saw that the cert is created for 30days to be valid on the ipa12:39
samuelkunkel[m]On the agent I see the error message (right after the successfull heartbeat): ssl.SSLError: [SSL: SSLV3_ALERT_BAD_CERTIFICATE] sslv3 alert bad certificate (_ssl.c:2633)12:48
samuelkunkel[m]thrown by eventlet/green/ssl.py (does a full traceback help? Not sure if I can pack one in irc)12:48
samuelkunkel[m]This time cleaning worked but deploying failed. And everytime it fails I see an issue with chrony12:50
samuelkunkel[m]2022-12-29 14:43:59.713 3665 ERROR ironic_python_agent.utils [-] Failed to sync time using chrony to ntp server: 10.219.208.25: Unexpected error while running command.12:51
samuelkunkel[m]Dec 29 14:43:59 localhost.localdomain ironic-python-agent[3665]: Command: chronyd -q 'server 10.219.208.25 iburst'12:51
samuelkunkel[m]Dec 29 14:43:59 localhost.localdomain ironic-python-agent[3665]: Exit code: 112:51
samuelkunkel[m]Dec 29 14:43:59 localhost.localdomain ironic-python-agent[3665]: Stdout: ''12:51
samuelkunkel[m]Dec 29 14:43:59 localhost.localdomain ironic-python-agent[3665]: Stderr: '2022-12-29T14:43:49Z chronyd version 4.3 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +NTS +SECHASH +IPV6 +DEBUG)\n2022-12-29T14:43:59Z No suitable source for synchronisation\n2022-12-29T14:43:59Z chronyd exiting\n': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.12:51
samuelkunkel[m]So I guess I need to dig into why chrony fails12:51
arozmanHi Ironic!13:19
samuelkunkel[m]Ok it seems to be an issue related to dhcp. As my hardware starts the ipa kernel / ramdisk I seem to lose some connectivity. (or basically ipa is too fast). I lose around 20 pings. If I run chrony around 20s later after it failed first it succeeds. If I do that (or everytime chrony runs successful) all the ssl issues are gone.13:51
samuelkunkel[m]If chrony fails I also can not see a request on port 12313:51
samuelkunkel[m]so seems like the ipa is not "able" to query for ntp during that time 13:54
samuelkunkel[m]meh13:54
jssfrchecked syslog for timestamps when the DHCP lease was acquired?13:54
jssfralso, hi samuel13:54
samuelkunkel[m]hi!13:55
TheJuliaHave you checked switch logs to see if the port is going into a blocking state?14:11
samuelkunkel[m]good points, going to investigate in that direction(s)14:14
TheJuliaSome switches with bonds will abehave oddly when you dhcp and line carrier up at certain versions14:15
TheJuliaI know that is vague but some vendors actually say don’t try to pxe over bonds, and then things like networkmanager can impact dhcp behavior14:16
samuelkunkel[m]I think also the ipxe itself is capable of forming lacp. If I recall correctly we have the switchports configured for bonding but we also have the "port-channels" (name for bonds on the arista switches) configured for lacp fallback to individual (which is basically "if there is no lacp also allow them to run as standalone port") . So the idea was to just leave them unbundled within ipa and once we boot into the final deployment14:22
samuelkunkel[m]we have them bundeled14:22
TheJuliaSo there is the conundrum14:28
TheJuliaIpxe sends an lacp bpdu “I’m here” message which the switch counts as an activation14:28
TheJuliaWhen the OS boots, and no more messages are sent, then the switch resets port states14:29
TheJuliaSupposedly a newer network manager version does try to setup bonds automatically which would navigate this… but I know no details14:29
samuelkunkel[m]what I see, on the switchports itself we do not have spanning-tree portfast activated (only on the port-channel)14:29
samuelkunkel[m]so once the switch falls back to individual It runs the whole spanning-tree chain14:29
samuelkunkel[m]I have configured my testsetup now to use also portfast on the individual ports, lets see.14:30
samuelkunkel[m]Gonna have a look if there is an ipxe config flag to not speak lacp (would make this a bit easier)14:30
TheJuliaThere is not, it gets compiled in14:31
TheJulia But portfast should fix you up14:31
TheJuliaVendor dependent of course14:31
samuelkunkel[m]that should not be the biggest issue as we compile it from source anyway14:31
TheJuliaYeah14:32
samuelkunkel[m]maybe there is something in the build to adjust14:32
TheJuliaSome larger operators do14:32
TheJuliaIf memory serves, there is14:32
TheJuliaWe… might even have that mentioned in our docs someplace14:32
opendevreviewwaleed mousa proposed openstack/ironic-python-agent master: update NVIDIA NIC firmware images and settings by ironic-python-agent  https://review.opendev.org/c/openstack/ironic-python-agent/+/56654415:06
samuelkunkel[m]TheJulia: your guess was correct, after configuring spanning-tree portfast on the respective switchports there is no network interrupt anymore, chrony runs fine and therefore all certificates are valid15:55
samuelkunkel[m]thanks to all of your for your help :)15:56
samuelkunkel[m](have tested it now with 4 different hardware types, all work perfectly fine)15:56
opendevreviewRiccardo Pittau proposed openstack/ironic master: Fix CI  https://review.opendev.org/c/openstack/ironic/+/86874916:10
TheJuliasamuelkunkel[m]: awesome17:16
* TheJulia goes back to home repairs17:16
* rpittau appears from the shadow18:09
* rpittau hello ironicers! Mind the CI is broken almost everywhere but this https://review.opendev.org/c/openstack/ironic/+/868749 should fix it18:09
* rpittau once that merges others that are failing with the same topic (like https://review.opendev.org/c/openstack/ironic-inspector/+/868750) should succeed after a recheck18:09
* rpittau happy holidays!18:09
* rpittau /me goes back into the shadows18:09
TheJulia4gb of ram!?!18:11
TheJuliaThat is super problematic, what did they add that we can remove?18:11

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!