opendevreview | yatin proposed openstack/neutron master: Move neutron rally jobs to wsgi https://review.opendev.org/c/openstack/neutron/+/939315 | 05:41 |
---|---|---|
opendevreview | yatin proposed openstack/neutron-tempest-plugin master: Always create router interface for ipv6 metadata test https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/939104 | 06:09 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: common: fix wait_until_true to support native thread https://review.opendev.org/c/openstack/neutron/+/937843 | 08:09 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: ovs: remove the usage of eventlet in the OVS agent https://review.opendev.org/c/openstack/neutron/+/937765 | 08:10 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: ovs: reimplement signals handling https://review.opendev.org/c/openstack/neutron/+/939321 | 08:10 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: dnm: use native os-ken implementation https://review.opendev.org/c/openstack/neutron/+/938487 | 08:10 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: common: fix wait_until_true to support native thread https://review.opendev.org/c/openstack/neutron/+/937843 | 08:13 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: ovs: remove the usage of eventlet in the OVS agent https://review.opendev.org/c/openstack/neutron/+/937765 | 08:13 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: dnm: use greenlet os-ken implementation not monkey-patched https://review.opendev.org/c/openstack/neutron/+/939057 | 08:13 |
ralonsoh | lajoskatona, slaweq hi folks, please check ykarel patch: https://review.opendev.org/c/openstack/neutron/+/939315 | 08:36 |
ralonsoh | new eventlet (surprise!) is breaking the Neutron CI | 08:36 |
slaweq | ralonsoh done | 08:38 |
ralonsoh | thanks! | 08:38 |
*** dmellado075539377 is now known as dmellado07553937 | 08:49 | |
ralonsoh | ykarel, hi! I' | 08:56 |
ralonsoh | (sorry) | 08:56 |
ralonsoh | I'm going to implement the fix for https://bugs.launchpad.net/neutron/+bug/2094736 | 08:57 |
ralonsoh | using the port_binding | 08:57 |
ralonsoh | of course, in parallel, we need to figure out why we are missing the LSP events | 08:57 |
lajoskatona | ralonsoh: thanks for headsup, just reached to this patch and LP bug | 09:52 |
ykarel | ralonsoh, yes sure as mentioned over mail no objection from me on that parallel effort | 10:07 |
ykarel | lajoskatona, ralonsoh can you also check https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/939104 | 11:05 |
ralonsoh | sure | 11:06 |
opendevreview | Slawek Kaplonski proposed openstack/neutron master: Add limit of tags for every resource https://review.opendev.org/c/openstack/neutron/+/937887 | 11:26 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: ovs: reimplement signals handling https://review.opendev.org/c/openstack/neutron/+/939321 | 11:36 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: ovs: remove the usage of eventlet in the OVS agent https://review.opendev.org/c/openstack/neutron/+/937765 | 11:36 |
opendevreview | Gaudenz Steinlin proposed openstack/neutron master: Fixup conntrackd support https://review.opendev.org/c/openstack/neutron/+/938800 | 13:29 |
ihrachys | ralonsoh: ykarel_ thanks for figuring out what happened with rally. so the fact that it was q-svc WAS important :) | 13:33 |
ralonsoh | ihrachys, actually was ykarel_ only, I just didn't migrate the rally jobs... because of this issue | 13:34 |
opendevreview | Rodolfo Alonso proposed openstack/neutron master: WIP == [OVN] ``PortBindingUpdateUpEvent`` https://review.opendev.org/c/openstack/neutron/+/939345 | 13:37 |
ralonsoh | ihrachys, ^^ btw | 13:37 |
ihrachys | ralonsoh: re LSP update missing, in pb switch, hiding is a concern. apparently with https://review.opendev.org/c/openstack/devstack/+/937606 we should have more logs for northd to confirm it translates pb up to LSP up. | 13:37 |
ralonsoh | ihrachys, but I see the ovn logs that this is happening | 13:38 |
* haleyb hides in shame regarding q-svc | 13:38 | |
ralonsoh | I've reviewed several jobs | 13:38 |
ihrachys | you mean northd translated? | 13:38 |
ralonsoh | and always the LSP.up=true happens | 13:38 |
ralonsoh | but Neutron never receives it | 13:38 |
ihrachys | ok. so then it's ovsdb-server not notifying / idl layer not receiving for reasons | 13:38 |
ralonsoh | I think is the IDL layer (I think, of course, with no proof) | 13:39 |
ihrachys | are notify events logged by ovsdb-server? | 13:39 |
ralonsoh | what events? | 13:39 |
ralonsoh | I see the NB and SB events (for the port_binding and the LSP) | 13:40 |
ihrachys | also interesting if this can be seen with any versions of ovs/ovn or just the old ones from ubuntu repos | 13:40 |
ralonsoh | we started hitting this issue since the wsgi migration | 13:40 |
ralonsoh | (this is why I suspect this could be something related to the hash ring manager) | 13:40 |
ralonsoh | for example, in the https://bugs.launchpad.net/neutron/+bug/2094736 description | 13:40 |
ihrachys | ralonsoh: AFAIU neutron-server registers a monitor with ovsdb-server. then on matching events, ovsdb-server would notify about them to corresponding monitors. correct? (if so then I'd expect some way to confirm ovsdb-server does it, on ovsdb side) | 13:41 |
ralonsoh | in these logs, all the NB and SB events are correctly defined: the port_binding.up and some usecs later the LSP.up | 13:41 |
ralonsoh | ihrachys, ok, so that's extra info, for sure | 13:41 |
ralonsoh | ihrachys, let me push a patch on top of yours | 13:41 |
ihrachys | the timing with uwsgi change is damning though, I agree. | 13:42 |
ralonsoh | with repeated CI jobs and (that's important) more API workers | 13:42 |
*** ykarel_ is now known as ykarel | 13:43 | |
opendevreview | Rodolfo Alonso proposed openstack/neutron master: DNM - Test "neutron-ovn-tempest-ipv6-only-ovs*" with WSGI https://review.opendev.org/c/openstack/neutron/+/939346 | 13:43 |
ralonsoh | ihrachys, ^ | 13:43 |
ralonsoh | ahhhh no, wrong patch | 13:44 |
ykarel | and just to add we see all these issues with wsgi switch when running with >1 uwsgi workers | 13:44 |
opendevreview | Rodolfo Alonso proposed openstack/neutron master: DNM - Test "neutron-ovn-tempest-ipv6-only-ovs*" with WSGI https://review.opendev.org/c/openstack/neutron/+/939346 | 13:45 |
ralonsoh | done ^ this patch is on top of a patch that uses the devstack one | 13:45 |
ralonsoh | let's wait for the CI results | 13:45 |
opendevreview | Rodolfo Alonso proposed openstack/neutron master: DNM - Test "neutron-ovn-tempest-ipv6-only-ovs*" with WSGI - BUG 2094736 https://review.opendev.org/c/openstack/neutron/+/939347 | 13:47 |
ihrachys | let's say hash ring is the culprit - somehow the worker that grabbed it is locked. I'd expect no events to be received at all then. but we see port-binding events happening, just not LSP. are these untangled in regards to locking? | 13:48 |
ralonsoh | ihrachys, the events are hashed using the row.uuid | 13:50 |
ralonsoh | the LSP.uuid and PB.uuid are not the same | 13:51 |
ralonsoh | despite we are talking about the same Neutron port | 13:51 |
ralonsoh | so, as you said, the worker dealing with any other LSP event could be locked, but not the other one that matches pb.uuid | 13:52 |
ihrachys | meaning, one worker handles PB but not LSP. if LSP worker locked / died, we'll see pb updates but not lsps. if so, switching to pb won't do much? | 13:52 |
ihrachys | I mean, now we will - sometimes - lose pb events :) | 13:53 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: ovs: reimplement signals handling https://review.opendev.org/c/openstack/neutron/+/939321 | 13:53 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: ovs: remove the usage of eventlet in the OVS agent https://review.opendev.org/c/openstack/neutron/+/937765 | 13:53 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: async_process: remove usage of eventlet for AsyncProcess https://review.opendev.org/c/openstack/neutron/+/939348 | 13:53 |
ihrachys | (that's the theory / rambling; we'll see if WIP changes anything I guess) | 13:53 |
opendevreview | Lajos Katona proposed openstack/tap-as-a-service master: Doc: add documentation for usage and flow examples for OVS https://review.opendev.org/c/openstack/tap-as-a-service/+/828382 | 13:55 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: dnm: use native os-ken implementation https://review.opendev.org/c/openstack/neutron/+/938487 | 13:55 |
opendevreview | Lajos Katona proposed openstack/tap-as-a-service master: Doc: add documentation for usage and driver details for SRIOV driver https://review.opendev.org/c/openstack/tap-as-a-service/+/881807 | 13:55 |
opendevreview | Sahid Orentino Ferdjaoui proposed openstack/neutron master: dnm: use greenlet os-ken implementation not monkey-patched https://review.opendev.org/c/openstack/neutron/+/939057 | 13:56 |
lajoskatona | Dear Neutron Cores: please review my documentation patches for tap-as-a-service: https://review.opendev.org/q/topic:%22taas_driver_docs%22 | 13:58 |
lajoskatona | Thanks in advance | 13:59 |
ralonsoh | lajoskatona, sure | 14:00 |
ihrachys | is it normal that I see "Hash Ring loaded. 2 active nodes. 0 offline nodes" in logs posted every other minute? does it reinitialize idl or what? | 14:01 |
ihrachys | (sometimes it's also "Hash Ring loaded. 1 active nodes. 0 offline nodes") | 14:02 |
ralonsoh | ihrachys, it is reloaded, this is just a debug message | 14:02 |
ralonsoh | ihrachys, in the initial transient period, the number can change | 14:03 |
ralonsoh | but once all workers are loaded (they have updated the Neutron DB register) | 14:03 |
ralonsoh | the number should be always static | 14:03 |
ralonsoh | and all must be active | 14:03 |
ihrachys | well this is is not at the start of the service; it's mid-flight | 14:05 |
ralonsoh | ihrachys, so this is a problem | 14:05 |
ralonsoh | I have some patches there under review | 14:05 |
ralonsoh | one sec | 14:05 |
ralonsoh | https://review.opendev.org/c/openstack/neutron/+/937351 | 14:06 |
ralonsoh | actually I need to check the comments | 14:06 |
ralonsoh | ihrachys, the point is that we have a thread (per worker) that is in charge of refreshing the Neutron DB hashring register | 14:07 |
ralonsoh | if the hash ring reload method (other thread) reloads (reads the active nodes) and see a non-updated one, we'll have this issue | 14:08 |
ralonsoh | so I would go for https://review.opendev.org/c/openstack/neutron/+/937351, at least during the eventlet deprecation | 14:08 |
ralonsoh | we have issues with the GIL yield and the thread in charge of refreshing the hashring register is not executed on time | 14:09 |
ralonsoh | (that should NOT happen with kernel/preemptive threads) | 14:09 |
opendevreview | yatin proposed openstack/neutron master: [DNM] repro functional failure with test order reversed https://review.opendev.org/c/openstack/neutron/+/937757 | 14:15 |
ihrachys | in a log, I see for a while all events handled by one node only | 14:17 |
opendevreview | Rodolfo Alonso proposed openstack/neutron master: [OVN] Reduce the OVN hash ring touch interval https://review.opendev.org/c/openstack/neutron/+/937351 | 14:17 |
ihrachys | and just before the other one seized to handle any, I see Hash Ring loaded. 0 active nodes. 2 offline nodes; then HashRing is empty, error: Hash Ring returned empty when hashing "b'22ef8001-b0d9-43fd-956b-0abd72515c54'". All 2 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster: | 14:18 |
ralonsoh | ihrachys, exactly, that happens if the hash ring node looses one active node (by de fault, we have 2 API workers in the CI) | 14:18 |
ihrachys | neutron.common.ovn.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'22ef8001-b0d9-43fd-956b-0abd72515c54'". All 2 nodes were found offline. This should never happen in a normal situation, please check the status of your cluster | 14:18 |
ihrachys | and then Hash Ring loaded. 1 active nodes. 1 offline nodes | 14:18 |
ralonsoh | please check https://review.opendev.org/c/openstack/neutron/+/937351 | 14:18 |
ihrachys | ok. so one node falls into offline. the other one should pick up its events, so it should not be the reason for lsp event lost either, right? | 14:20 |
ralonsoh | ihrachys, yes, you are right | 14:20 |
ihrachys | ok looking more. so the worker that is actually handling events says "Hash Ring loaded. 2 active nodes. 0 offline nodes" but the worker that doesn't says "Hash Ring loaded. 1 active nodes. 0 offline nodes". that's just before the time when we miss the LSP update | 14:26 |
ihrachys | this looks like maybe there's a split reality situation | 14:27 |
ihrachys | the node that seized to handle events expects the other one to pick up the work | 14:27 |
ihrachys | but the other one still believes the retired worker will continue handling its events? | 14:28 |
ralonsoh | ihrachys, that's the point, each node refresh it's own hashring manager independently, based on the DB status | 14:28 |
ralonsoh | but this refresh operation is not done at the same time | 14:29 |
ralonsoh | so yes, that could lead to a situation where 2 nodes can discard an event because it doesn't belong to them (according to their own local hashring managers) | 14:29 |
ihrachys | ok so is it then... inherently racy? | 14:29 |
ralonsoh | so far, is the best implementation we have | 14:30 |
ihrachys | ok and to confirm I'm not piling up, just trying to make sense :) | 14:31 |
ihrachys | so bumping timeouts in your patch is in hope that workers fall offline less often? | 14:31 |
ralonsoh | yes | 14:31 |
ralonsoh | and reducing the refresh time | 14:31 |
ihrachys | see, I'm slow but I'm getting there! | 14:32 |
ihrachys | ralonsoh: and what's the theory of why wsgi switch made it worse? | 14:43 |
ralonsoh | ihrachys, to be honest, I'm not sure. But the main problems we have are related to the hash ring | 14:46 |
ralonsoh | missing events, nodes offline, etc | 14:46 |
ralonsoh | these problems are not present in Ml2/OVS | 14:47 |
ihrachys | the theory of touching the node more often seems reasonable BUT | 14:49 |
ihrachys | we touch nodes in notify() | 14:49 |
ihrachys | and I see events handled by the worker-about-to-become-offline just second(s) before it goes offline | 14:49 |
ihrachys | I'd think any ovsdb-monitor event would refresh the timer in db? | 14:50 |
ralonsoh | ihrachys, that would be too much to update always update a register in the DB | 14:59 |
ralonsoh | having said that, this table has no references to other tables (that removes xreferences issues), the table is small and we are not modifying the indexes | 15:00 |
ralonsoh | so this touch should be very fast | 15:01 |
ralonsoh | ihrachys, btw, we have CI results: https://review.opendev.org/c/openstack/neutron/+/939346 | 15:02 |
ralonsoh | this is on top of your change, that uses de devstack patch | 15:02 |
ihrachys | ralonsoh: we already touch; it's not a suggestion, it's in code | 15:03 |
ihrachys | https://github.com/openstack/neutron/blob/585ea689d5d26356e28d8eb47f6d0511d21806cf/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#L835 | 15:04 |
ihrachys | we see in logs the debug messages from line 852 from the worker | 15:04 |
ralonsoh | ah right, just after an event | 15:04 |
ihrachys | I don't see how this message can pop up without node also being touched. (assuming 30 seconds passed). so that's weird. | 15:06 |
ralonsoh | ihrachys, sorry, what do you mean? | 15:06 |
ihrachys | the linked code, it seems that if ovsdb-monitor event is being handled, and the node is past HASH_RING_CACHE_TIMEOUT then the event handler would also touch the node in db. | 15:08 |
ihrachys | but then it's not clear why it would immediately be seen as offline by the same worker | 15:08 |
ihrachys | except there's some caching of timestamps involved, so what is written in db may not necessarily be reflected into cache. maybe it should... | 15:09 |
ralonsoh | nononono | 15:10 |
ralonsoh | hold on | 15:10 |
ralonsoh | self._last_touch is a local variable | 15:10 |
ralonsoh | not the DB timestamp | 15:10 |
ihrachys | yeah I know. there's also _node_last_touch inside the hash manager | 15:10 |
ihrachys | so when it get_node(), it uses the cached timestamps | 15:11 |
ihrachys | but when it touches, it touches db; but at the same time, cache is not updated with the new timestamp. | 15:11 |
ihrachys | wonder if it should? | 15:11 |
ralonsoh | btw, I'm checking https://fdc1d05a9a337a8993b4-089607d394060d72ce519e30966a0033.ssl.cf2.rackcdn.com/939346/2/check/neutron-ovn-tempest-ipv6-only-ovs-release-wsgi-2/4eb0215/testr_results.html | 15:12 |
ralonsoh | in particular the port for instance 3c49676b-0677-4021-b6e6-4a6ee65f0704 | 15:12 |
ralonsoh | in https://fdc1d05a9a337a8993b4-089607d394060d72ce519e30966a0033.ssl.cf2.rackcdn.com/939346/2/check/neutron-ovn-tempest-ipv6-only-ovs-release-wsgi-2/4eb0215/controller/logs/ovn/ovn/ovsdb-server-nb_log.txt | 15:12 |
ralonsoh | that is LSP.uuid=10146369-053d-446e-a5fe-ba09158c3b45 | 15:13 |
ralonsoh | where is the LSP.up field defined there? | 15:13 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: WIP: refresh hash ring timestamp cache on touch https://review.opendev.org/c/openstack/neutron/+/939354 | 15:13 |
ihrachys | like in ^ (sorry too lazy to stash some other changes that I have locally) | 15:13 |
ralonsoh | ihrachys, nonono | 15:14 |
ralonsoh | https://review.opendev.org/c/openstack/neutron/+/936838 | 15:14 |
ralonsoh | I explicitly added this parameter | 15:15 |
ralonsoh | in order to read the init time (passed via WSGI config) | 15:15 |
ihrachys | ignore this. look at where I add refresh() to touch | 15:15 |
ihrachys | (should have backed these changes out not to confuse you) | 15:16 |
ihrachys | just this https://review.opendev.org/c/openstack/neutron/+/939354/1/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#845 | 15:16 |
ralonsoh | ihrachys, but this is actually the opposite to https://review.opendev.org/c/openstack/neutron/+/937351 | 15:17 |
ihrachys | without this, our touch of db does not affect the worker view of hash ring state (including its own state) and if the maint thread doesn't get to refresh() on time, then the touch was for naught (for this current worker) | 15:17 |
ralonsoh | yes but this refresh also reads other hash ring registers | 15:18 |
ralonsoh | (that should be updated, of course) | 15:18 |
ihrachys | let it do that, not sure what the problem with it is (but if that's a concern, we can refresh just ourselves) | 15:19 |
ralonsoh | ihrachys, ok, let's push a patch with this line change only | 15:19 |
ihrachys | the reduction of interval for cache update may also be good. not sure about cache timeout bump. | 15:19 |
ralonsoh | ihrachys, please check the CI logs that I mentioned: https://fdc1d05a9a337a8993b4-089607d394060d72ce519e30966a0033.ssl.cf2.rackcdn.com/939346/2/check/neutron-ovn-tempest-ipv6-only-ovs-release-wsgi-2/4eb0215/testr_results.html | 15:20 |
ihrachys | (actually I should probably do refresh=True there too) | 15:20 |
ihrachys | will check in a sec, let me push the refresh one first | 15:21 |
ralonsoh | I don't see the LSP.UP event | 15:21 |
ralonsoh | sure | 15:21 |
ihrachys | (scratch above, refresh means refresh=True already) | 15:21 |
ralonsoh | no no, sorry, it is there | 15:22 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: ovn: Refresh hash ring timestamp cache on touch_node https://review.opendev.org/c/openstack/neutron/+/939357 | 15:26 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: WIP: nit: Remove _last_touch attribute for OvnIdlDistributedLock https://review.opendev.org/c/openstack/neutron/+/939359 | 15:29 |
ihrachys | ralonsoh: how can I test the patch with refresh I wonder?.. is there a way to thrash the gate to the point that it would always fail with the problem? | 15:29 |
ralonsoh | ihrachys, sure, one sec | 15:30 |
ralonsoh | cherry pick this https://review.opendev.org/c/openstack/neutron/+/939346/. Remove the change-id (to create a new one) | 15:30 |
ralonsoh | push it on top of your patch | 15:30 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: DNM - Test "neutron-ovn-tempest-ipv6-only-ovs*" with WSGI https://review.opendev.org/c/openstack/neutron/+/939360 | 15:32 |
ihrachys | ralonsoh: to confirm, nothing to check in your logs above? | 15:33 |
ralonsoh | ihrachys, well, only to confirm that is happening again the problem | 15:33 |
ralonsoh | we have the LSP.up event | 15:33 |
ralonsoh | 2025-01-15T14:03:05.984Z|06409|jsonrpc|DBG|ssl:[2607:5300:201:2000::743]:40650: send notification, method="update3", params=["cfcf56ee-d348-11ef-aefb-fa163e340894","00000000-0000-0000-0000-000000000000",{"Logical_Switch_Port":{"b95d0ee3-cb36-4b02-87c9-1d07a051ab36":{"modify":{"up":true}}}}] | 15:33 |
ralonsoh | but nothing is received in Neutron API | 15:34 |
ihrachys | ok great that we can validate ovsdb-server is doing the right thing anyway | 15:34 |
ihrachys | I'll have to step down from this hash ring issue for the next few hours. will check results in ci later. thanks ralonsoh for bearing with me stoopid questions :p | 15:35 |
ralonsoh | a pleasure | 15:36 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: nit: Remove unused updated_at argument for touch_node https://review.opendev.org/c/openstack/neutron/+/939366 | 16:39 |
lajoskatona | otherwiseguy: Hi, there's an ovsdbapp bug perhaps you can decide if it can be something to check in detail: https://bugs.launchpad.net/ovsdbapp/+bug/2093247 , thanks in advance | 16:53 |
-opendevstatus- NOTICE: The paste service at paste.opendev.org will have a short (15-20) minute outage momentarily to replace the underlying server. | 17:08 | |
opendevreview | Lajos Katona proposed openstack/tap-as-a-service master: Doc: add documentation for usage and driver details for SRIOV driver https://review.opendev.org/c/openstack/tap-as-a-service/+/881807 | 17:11 |
lajoskatona | haleyb: Hi, if you have some free time for doc patches for taas: https://review.opendev.org/q/topic:%22taas_driver_docs%22 ;) | 17:12 |
opendevreview | Merged openstack/neutron master: Move neutron rally jobs to wsgi https://review.opendev.org/c/openstack/neutron/+/939315 | 17:24 |
ihrachys | still events missed with my refresh() patch, now in log I see: | 17:38 |
ihrachys | CRITICAL neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [-] The number of nodes in the Hash Ring (4) is higher than the number of API workers (2) for host "np0039575603". Something is not right and OVSDB events could be missed because of this. Please check the status of the Neutron processes, this can happen when the API workers are killed and restarted. | 17:38 |
ihrachys | Restarting the service should fix the issue, see LP #2024205 for more information. | 17:38 |
ihrachys | it's weird though, don't we set processes: 4 in https://review.opendev.org/c/openstack/neutron/+/939360/1/zuul.d/tempest-singlenode.yaml#821 | 17:40 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: refactor: Remove OvnIdlDistributedLock._last_touch attribute https://review.opendev.org/c/openstack/neutron/+/939359 | 17:58 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: DNM - Test "neutron-ovn-tempest-ipv6-only-ovs*" with WSGI https://review.opendev.org/c/openstack/neutron/+/939360 | 18:05 |
opendevreview | Brian Haley proposed openstack/neutron master: Optionally configure IPv6 metadata address https://review.opendev.org/c/openstack/neutron/+/926497 | 18:13 |
opendevreview | Merged openstack/neutron-tempest-plugin master: Always create router interface for ipv6 metadata test https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/939104 | 18:31 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: functional: Handle ovsdb monitor returning inserts in different checks https://review.opendev.org/c/openstack/neutron/+/939384 | 18:50 |
haleyb | ihrachys: oh, that seems to make the functional tests happy ^^ | 20:44 |
ihrachys | one can hope, yes | 20:49 |
haleyb | i ran a recheck to double-check | 20:51 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: When storing _last_time_loaded for hash ring, use a lower timestamp https://review.opendev.org/c/openstack/neutron/+/939397 | 20:52 |
ihrachys | I mean, I haven't even run the tests with the change, it's straight from my brain :p | 20:53 |
ihrachys | haleyb: it's interesting that this test started to misbehave, don't you think | 21:07 |
ihrachys | since we are dealing with missed ovn events from monitor elsewhere | 21:08 |
ihrachys | I don't think anyone touched the test case since forever | 21:08 |
haleyb | no, but we did bump to a later OVN/OVS recently i think | 21:08 |
ihrachys | haleyb: apparently unit tests now trigger sudo? https://zuul.opendev.org/t/openstack/build/31c980b9fdb5498ca392ec863c6d1370 | 21:20 |
ihrachys | btw I also noticed that my mac asked me to fingerprint for sudo when I ran tests yesterday; I denied though thought it's something about macos; but looks like maybe something creeped into the code base | 21:21 |
haleyb | ihrachys: probably a missing mock somewhere, that's my bet | 21:21 |
ihrachys | reported here https://bugs.launchpad.net/neutron/+bug/2095044 | 21:23 |
ihrachys | yeah, probably a mock | 21:23 |
ihrachys | I also some some other tests were calling to e.g. tc (for qos?) and failed on mac. which suggests that some more mocks may be missing (since unit tests should never call to system) | 21:23 |
haleyb | there is at least one other bug for a missing mock that i filed, for segment tests https://bugs.launchpad.net/neutron/+bug/2038373 | 21:25 |
ihrachys | in other news, functional is sometimes busted for a different reason, see https://zuul.opendev.org/t/openstack/build/d8a35748226140408d088f5273a71999 | 21:32 |
ihrachys | ovsdbapp.exceptions.TimeoutException: Commands [AddBridgeCommand(_result=None, name=ovs-test-d9a790, may_exist=True, datapath_type=system), DbAddCommand(_result=None, table=Bridge, record=ovs-test-d9a790, column=protocols, values=('OpenFlow13', 'OpenFlow14', 'OpenFlow10')), DbSetCommand(_result=None, table=Bridge, record=ovs-test-d9a790, col_values=(('other_config', | 21:32 |
ihrachys | {'mac-table-size': '50000'}),), if_exists=True)] exceeded timeout 30 seconds, cause: TXN queue is full | 21:32 |
ihrachys | oh fun | 21:32 |
haleyb | my head hurts | 21:33 |
ihrachys | this exact error was mentioned in the rally bug report https://bugs.launchpad.net/neutron/+bug/2094970 | 21:33 |
ihrachys | (about TXN queue is full) | 21:33 |
ihrachys | haleyb: being a certified Debby Downer, I'll say taht I think the team gets into a habit of letting failures pile up (bare rechecks etc.) until it becomes untenable... then the heads indeed hurt :p | 21:35 |
haleyb | ihrachys: that does happen sometimes and expect $(someone_else) to fix them, it's hard to spend half a day tracking down a single failure and there's been >2 people doing it recently | 21:42 |
haleyb | i do want to know how you managed to get haproxy in that call trace, it's not remotely near that codepath | 21:45 |
ihrachys | it's not me, it's zuul | 21:52 |
ihrachys | that's on a very complex patch, see https://review.opendev.org/c/openstack/neutron/+/939272 :) | 21:53 |
ihrachys | something tells me this is not the patch that broke it | 21:53 |
haleyb | @mock.patch('eventlet.spawn_n') - i will blame it on eventlet since that's the easy thing to do :) | 21:53 |
haleyb | that patch! ygbfkm | 21:54 |
ihrachys | :) i'll blame both mock and eventlet | 21:56 |
haleyb | yahtzee | 21:57 |
ihrachys | still unclear what happens with hash ring members. apparently the thread to touch nodes is not running / blocked and they over time fall into offline state, sometimes. apparently no one has a good theory of why the thread is blocked, except a vague notion of GIL issues? | 21:59 |
haleyb | that is maybe a question for rodolfo | 22:23 |
haleyb | i'll be back again to look tomorrow | 22:42 |
opendevreview | Merged openstack/neutron master: functional: Handle ovsdb monitor returning inserts in different checks https://review.opendev.org/c/openstack/neutron/+/939384 | 22:49 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: WIP: ovn: don't attempt to create router port when no fixed-ips are set https://review.opendev.org/c/openstack/neutron/+/939253 | 22:51 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: Add option to configure live migration activation strategy for OVN https://review.opendev.org/c/openstack/neutron/+/938106 | 22:51 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: Remove shorturls from code https://review.opendev.org/c/openstack/neutron/+/939272 | 22:51 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: refactor: Remove OvnIdlDistributedLock._last_touch attribute https://review.opendev.org/c/openstack/neutron/+/939359 | 22:51 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: nit: Remove unused updated_at argument for touch_node https://review.opendev.org/c/openstack/neutron/+/939366 | 22:52 |
opendevreview | Ihar Hrachyshka proposed openstack/neutron master: Remove linuxbridge driver https://review.opendev.org/c/openstack/neutron/+/927216 | 22:53 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!