opendevreview | Merged openstack/ironic master: Grenade: Turn up interfaces for vxlan https://review.opendev.org/c/openstack/ironic/+/839420 | 00:46 |
---|---|---|
arne_wiebalck | Good morning, Ironic! | 06:20 |
rpittau | good morning ironic! o/ | 06:58 |
opendevreview | Riccardo Pittau proposed openstack/sushy-tools master: Use python Zed tests https://review.opendev.org/c/openstack/sushy-tools/+/838674 | 08:24 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Decouple deploy callback timeout from deploy step timeout https://review.opendev.org/c/openstack/ironic/+/837690 | 08:44 |
dtantsur | TheJulia: okay, apparently you were right, and I don't remember how the timeout stuff works :) we do update provision_updated_at on heartbeats, so running deploy steps never time out. | 09:41 |
dtantsur | folks, I'd really appreciate some reviews on https://review.opendev.org/c/openstack/sushy-tools/+/830157/ and https://review.opendev.org/c/openstack/sushy-tools/+/830598/ | 10:15 |
hjensas | networking-baremetal CI was broken, can cores take a look at https://review.opendev.org/c/openstack/networking-baremetal/+/839298 ?, thanks | 10:16 |
dtantsur | +2 | 10:18 |
iurygregory | morning Ironic | 11:04 |
hjensas | thanks dtantsur | 11:18 |
hjensas | TheJulia: how on earth can we get conductor takeover issues on the undercloud? (It's just one conductor?) | 11:18 |
hjensas | https://logserver.rdoproject.org/15/15761b77d91ab3e398f6fa1d10d2f5da267b7931/openstack-periodic-integration-stable1-cs8/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/2f5dc02/logs/undercloud/var/log/containers/ironic/ironic-conductor.log.txt.gz | 11:19 |
hjensas | 2022-04-26 23:40:31.669 10 WARNING ironic.conductor.manager [-] Forcibly removed reservation of conductor undercloud.localdomain on node 9187de8c-59e1-46d1-8fac-d6cb28fca0b4 as that conductor went offline | 11:19 |
dtantsur | hjensas: hostname change? | 11:24 |
hjensas | dtantsur: thanks, that seems like a plausible reason. | 11:25 |
hjensas | the conductor starts with "host = undercloud.localdomain", I don't see any indication of hostname change in the journal. | 11:46 |
hjensas | dtantsur: is it possible self.dbapi.get_offline_conductors() would return the same conductor in case the heartbeat was not received? | 11:48 |
dtantsur | hjensas: does not sound impossible to me, I"m not sure if we have any safeguards | 11:54 |
dtantsur | (unclear why a heartbeat wouldn't be received in this case) | 11:54 |
hjensas | yeah, how does this hearbeat work? A periodic task is updating a timestamp in the db? Or is RPC/MQ involved? | 11:56 |
dtantsur | hjensas: it's a thread in the conductor IIRC | 11:56 |
TheJulia | Local takeover can occur when it failure or super long pauses occur | 12:00 |
TheJulia | Which exceed 60 seconds. It is a sign of resources are way over committed | 12:01 |
dtantsur | maybe we need a safeguard for the current host | 12:01 |
dtantsur | (won't help in case of changing hostnames) | 12:01 |
TheJulia | A running process doesn’t learn of a host name change | 12:02 |
TheJulia | It is a launch time variable | 12:02 |
TheJulia | So that stays static in the running namespace aiui | 12:02 |
TheJulia | Well, process space | 12:02 |
TheJulia | A safeguard wouldn’t really defend it… it is basically a “I stopped writing to the dab or being able to write due to external conditions”…. | 12:04 |
TheJulia | To the db | 12:04 |
* TheJulia goes back to sleep | 12:04 | |
TheJulia | Or at least, try… cats | 12:05 |
hjensas | I looked at the dstat data of the job, there is for sure a load spike. And this is OVB, so potentially noisy neighbours. | 12:16 |
dtantsur | it shouldn't be too hard to make the conductor never take over itself | 12:16 |
hjensas | It may make sense to set CONF.conductor.check_provision_state_interval = 0 on the undercloud, or in the CI. | 12:16 |
dtantsur | please no | 12:17 |
dtantsur | this is a bug that is trivial to fix. what you're suggesting is not even the right option.. | 12:17 |
dtantsur | probably just wrap all calls to self.dbapi.get_offline_conductors in manager.py to a helper that excludes the current host | 12:19 |
dtantsur | (and maybe logs a warning that something is off) | 12:19 |
hjensas | hm, I figured from https://opendev.org/openstack/ironic/src/branch/master/ironic/conductor/manager.py#L1587 that check_provision_state_interval = 0 would disable the entire thing. | 12:20 |
hjensas | But yeah, I can add something to exclude current host. | 12:20 |
dtantsur | ah, so we're reusing the same option for many purposes. fun. | 12:20 |
dtantsur | even if you disable the periodic task, the conductor will be considered offline for e.g. hash ring purposes | 12:21 |
dtantsur | which is... extra fun because it's on the API side | 12:21 |
dtantsur | but yeah, at least never take over the current conductor, nor orphan its nodes or allocations | 12:22 |
dtantsur | this option also affects checking for deploy timeouts | 12:23 |
hjensas | ok, I'll suggest they try to bump the heartbeat_interval and heartbeat_timeout. That may allow it to survive the load peak. | 12:28 |
iurygregory | dtantsur, rpittau, TheJulia https://review.opendev.org/c/openstack/releases/+/839524 releasing ironic and sushy in victoria before moving to EM | 13:19 |
iurygregory | the other deliverables didn't have new commits or had commits that were just related to CI | 13:19 |
opendevreview | Merged openstack/ironic master: [iRMC] Change the way to get irmc-info in raid https://review.opendev.org/c/openstack/ironic/+/839122 | 13:22 |
opendevreview | Riccardo Pittau proposed openstack/ironic-python-agent master: Multipath Hardware path handling https://review.opendev.org/c/openstack/ironic-python-agent/+/837039 | 13:38 |
iurygregory | rpittau, should I update the wallaby patch? | 13:39 |
rpittau | iurygregory: yes please! | 13:39 |
iurygregory | doing now | 13:39 |
rpittau | thanks | 13:39 |
rpittau | dtantsur, iurygregory, TheJulia, please double-check the mpathconf options, I checked on RHEL8 so it should be fine, but still more eyes the better :) | 13:40 |
TheJulia | rpittau: does debian/ubuntu have /sbin/mpathconf? | 13:47 |
rpittau | TheJulia: mmm probably no | 13:51 |
TheJulia | not on debian | 13:52 |
rpittau | no, they donb't, they use a different way to configure multipath | 13:52 |
TheJulia | so | 13:52 |
TheJulia | directly launching the daemon *should* configure pathing | 13:53 |
TheJulia | at least it did on my test machine | 13:53 |
rpittau | not in RHEL | 13:53 |
TheJulia | oh jebus | 13:53 |
rpittau | unfortubnately the procedure for RHEL is to use mpathconf | 13:53 |
iurygregory | yay... | 13:54 |
rpittau | \o/ | 13:54 |
rpittau | we can put a big TRY there | 13:54 |
TheJulia | I suspect we need to | 13:54 |
TheJulia | :\ | 13:54 |
TheJulia | and I can re-test manually on my desktop after my next reboot | 13:54 |
* TheJulia feels like major distro differences should be a reason to begin drinking very early | 13:54 | |
* iurygregory agrees | 13:55 | |
opendevreview | Iury Gregory Melo Ferreira proposed openstack/ironic-python-agent stable/wallaby: Multipath Hardware path handling https://review.opendev.org/c/openstack/ironic-python-agent/+/837784 | 13:56 |
* rpittau had a beer at lunch | 13:58 | |
opendevreview | Merged openstack/networking-baremetal master: Register neutron common config options https://review.opendev.org/c/openstack/networking-baremetal/+/839298 | 14:05 |
TheJulia | rpittau: +2+A ;) | 14:46 |
rpittau | :D | 14:47 |
* TheJulia may, or may not, be a bad influence | 14:50 | |
*** mat_fechner is now known as matfechner | 14:58 | |
opendevreview | Julia Kreger proposed openstack/ironic master: Auto-populate lessee for deployments https://review.opendev.org/c/openstack/ironic/+/818641 | 16:06 |
opendevreview | Julia Kreger proposed openstack/ironic master: Auto-populate lessee for deployments https://review.opendev.org/c/openstack/ironic/+/818641 | 16:07 |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM: v6/grenade multinode jobs https://review.opendev.org/c/openstack/ironic/+/839086 | 16:18 |
rpittau | good night! o/ | 16:21 |
iurygregory | gn! | 16:22 |
TheJulia | rpittau: are you updating the multipath change tests? | 16:22 |
rpittau | TheJulia: I was going to wait for the tests on the field before moving further forward with the patch | 16:25 |
TheJulia | rpittau: ack | 16:25 |
TheJulia | hjensas: figured out multinode, seems to be an issue with ngs | 18:48 |
TheJulia | how to fix is now the question | 18:48 |
TheJulia | got it! | 19:21 |
* TheJulia gues up dancing for after the next meeting | 19:22 | |
opendevreview | Julia Kreger proposed openstack/networking-generic-switch master: CI: Fix Multinode ssh key file placement https://review.opendev.org/c/openstack/networking-generic-switch/+/839645 | 22:24 |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM: v6/grenade multinode jobs https://review.opendev.org/c/openstack/ironic/+/839086 | 22:25 |
TheJulia | hjensas: so... dhcp stateful + centos 8... it never got anotehr dhcp address. | 22:32 |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM: v6/grenade multinode jobs https://review.opendev.org/c/openstack/ironic/+/839086 | 22:34 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!