TheJulia | doing some runs with 12 which is 1.5x my cores, seeing some stuff, but I'm also running with --parallel-class | 00:00 |
---|---|---|
stevebaker[m] | like dial concurrency down for reliability? | 00:12 |
JayF | Dial it up to find and fix more races! | 00:12 |
JayF | It should be safe | 00:12 |
TheJulia | the higher we go, the harder it will be to pin down | 00:13 |
TheJulia | unless it is purely cpu pressure | 00:13 |
JayF | I was thinking tomorrow I might write significant debug logging into those migration tests | 00:14 |
TheJulia | we likely need to spit out logging on each test start | 00:14 |
JayF | Normally with this kind of deadlock I would worry about the logger never getting a chance to print the message, but since it's waiting on the database I think it should make it to the disk. | 00:14 |
TheJulia | with test class name | 00:15 |
JayF | Yeah, I have a few ideas that are specific to test_migrations.py | 00:15 |
TheJulia | we also prevent buffered writes | 00:15 |
TheJulia | intentionally | 00:15 |
JayF | I haven't thought of anything that would be more universal than that | 00:15 |
TheJulia | adding --parallel-class might make the db stuff just go away (or happen more often!) | 00:16 |
TheJulia | to the stestr run | 00:16 |
JayF | I think from reading the description of that, that I was working under the assumption that's how it worked already | 00:18 |
TheJulia | by default it spreads it out | 00:19 |
TheJulia | randomly! | 00:19 |
* TheJulia expects people to scream with many curse wrods | 00:19 | |
TheJulia | words | 00:19 |
JayF | I guess for purposes of the migration tests, they still work the way I expect because it's one big test running all of those little methods serially | 00:20 |
TheJulia | the primary ones, yeah | 00:20 |
TheJulia | and the risk of stomping there is super high | 00:21 |
JayF | I'm on board for mitigation as long as we're certain we're not actually reducing our ability to test the code. I'd be happy to test our tests less and test our code more. | 00:24 |
JayF | Right now I don't think we've actually found any primary code path broken, it's all just been tests, yeah? | 00:24 |
JayF | All right, sandwiches are here for dinner, I'm going to get off of IRC for the evening. o/ | 00:25 |
TheJulia | the tests, interestingly, do seem to execute with a decent speed on my machine with 6 runners, only seen anything crazy with 14 or more | 00:56 |
opendevreview | Merged openstack/ironic master: Disable WAL Pragma for Unit Testing https://review.opendev.org/c/openstack/ironic/+/886865 | 01:24 |
opendevreview | Merged openstack/ironic master: Handle SAWarning around allocations FK Constratins https://review.opendev.org/c/openstack/ironic/+/886871 | 01:26 |
TheJulia | *gasp* | 01:28 |
TheJulia | Two in a row! | 01:28 |
iurygregory | \o/ | 03:05 |
* iurygregory dances | 03:05 | |
rpittau | good morning ironic! o/ | 07:11 |
rpittau | I'm going to increase the timeout in https://review.opendev.org/c/openstack/ironic/+/886985 for all tests to 60 as the test walkers can take longer to finish, unless we want to completely change the TestWalkVersions class | 07:30 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: CI: Change migrations timeout to be >60 seoncds https://review.opendev.org/c/openstack/ironic/+/886985 | 07:31 |
Nisha_Agarwal | morning ironic! | 08:54 |
Nisha_Agarwal | rpittau, o/ | 08:54 |
rpittau | hey Nisha_Agarwal :) | 08:56 |
Nisha_Agarwal | rpittau, we were looking for an update on bug https://bugs.launchpad.net/proliantutils/+bug/2021995.....we still need to try to reproduce the issue but if we know the iLO version it will help us to reproduce and even initaite discussion with firmware team | 08:58 |
rpittau | Nisha_Agarwal: I think that is entirely in JayF jands :) | 08:59 |
rpittau | *hands | 08:59 |
Nisha_Agarwal | hmm | 09:02 |
opendevreview | Huy Mai proposed openstack/sushy-tools master: Add fake_ipa inspection, lookup and heartbeater to fake system https://review.opendev.org/c/openstack/sushy-tools/+/875366 | 09:38 |
opendevreview | Verification of a change to openstack/ironic master failed: Revert "Fix IRONIC_IMAGE_NAME=non-existent-image" https://review.opendev.org/c/openstack/ironic/+/886960 | 11:17 |
iurygregory | good morning Ironic | 11:51 |
iurygregory | gerrit is down? | 11:55 |
dtantsur | iurygregory: seems to work for me | 11:56 |
iurygregory | .-. I'm getting Timed Out on my browser .-. | 11:56 |
iurygregory | time to clean the cookies, close the browser and see if works | 11:57 |
iurygregory | ok, sometimes it works and sometimes it doesn't lol | 12:11 |
iurygregory | ok, probably something wrong with my ISP .-. | 12:33 |
TheJulia | good morning | 13:17 |
TheJulia | scottsol: you might be able to provide clarity for Nisha_Agarwal | 13:18 |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM: Test class based test distribution https://review.opendev.org/c/openstack/ironic/+/887166 | 13:30 |
TheJulia | iurygregory: looks like tests on https://review.opendev.org/c/openstack/ironic/+/883062 fail legitimately | 13:44 |
iurygregory | TheJulia, yeah =) | 13:44 |
iurygregory | I'm updating things locally | 13:44 |
TheJulia | k | 13:44 |
iurygregory | and trying to address dtantsur comments, but gerrit is not helping too much today -.-' | 13:45 |
TheJulia | heh | 13:45 |
iurygregory | I'm looking at gerrit via my smartphone using 4g, because my wifi doesn't let me access LOL | 13:45 |
TheJulia | rpittau: what is your take on CI at this point? I rechecked some of the same things you've rechecked since the FK constraint change merged and still had failures, although it looks like deadlock timeout style failures where *even* it looks like we got through test_migrations, at least at a glance | 13:46 |
TheJulia | hmmmm, 12 executed | 13:47 |
TheJulia | 16 should have executed | 13:48 |
TheJulia | so still something, maybe | 13:48 |
rpittau | TheJulia: I've increased the timeout in https://review.opendev.org/c/openstack/ironic/+/886985 because I realized that the TestWalkVersions class depended on that as it's based on oslo test base class, but I see it failed again for timeout | 13:50 |
rpittau | the tests should just fail if they pass the timeout value, but I'm wondering if SIGALARM is enough to stop them completely in this case | 13:50 |
TheJulia | not in a deadlock state | 13:51 |
TheJulia | since the runner is sitting in epoll waiting for data on a socket | 13:51 |
TheJulia | which the DB cannot give it | 13:51 |
TheJulia | something is still hanging in all of that | 13:52 |
rpittau | yeah | 13:53 |
rpittau | the timeout fixtures is useful until a certain point, unfortunately | 13:53 |
* TheJulia works on augmenting our logging | 14:06 | |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM: Add more debugging to tie test class, test, state https://review.opendev.org/c/openstack/ironic/+/887168 | 14:16 |
TheJulia | so in the good news category, we're not handing any raw data back past through test_walk_versions | 14:16 |
TheJulia | well, an object | 14:16 |
NobodyCam | Good Morning Ironic Folks happy hump day | 14:43 |
TheJulia | good morning! | 14:48 |
zorun | hi there | 14:50 |
TheJulia | greetings | 14:50 |
zorun | good afternoon for CEST folks :) | 14:51 |
zorun | I'm trying to wrap my head again around networking-baremetal: what is it used for? | 14:51 |
zorun | according to https://docs.openstack.org/ironic/latest/admin/multitenancy.html#configuring-the-networking-service it's used for "flat" networking | 14:51 |
zorun | while NGS or model-specific ML plugins would be used for "neutron" networking | 14:52 |
TheJulia | it does sort of two things | 14:52 |
TheJulia | It is required to update the physnet mapping data in neutron | 14:52 |
zorun | but I remember that there was a use-case where both plugins are used | 14:52 |
TheJulia | it also has some openconfig/netconfig functionality under development | 14:52 |
TheJulia | to kind of replace networking-generic-switch, but the reality is it may never do that because of how vendors operate | 14:53 |
TheJulia | and configuraiton comfort levels | 14:53 |
TheJulia | hjensas: we should likely clarify some docs | 14:53 |
zorun | TheJulia: so the first part is where the networking-baremetal agent pulls ports state from Ironic, and pushes it to Neutron? | 14:53 |
TheJulia | basically yes | 14:54 |
TheJulia | specifically for the physnet information | 14:54 |
zorun | so, when you create a baremetal port, you have to specify a physnet | 14:57 |
zorun | what does the physnet mapping map to? | 14:57 |
TheJulia | a neutron physnet | 14:58 |
TheJulia | being a physical fabric | 14:58 |
zorun | ah, is it a mapping from node to physnet? | 14:59 |
TheJulia | yes, think of it as if you had multiple distinct physical network fabrics | 15:00 |
TheJulia | The interactions need to know which things are plugged into | 15:00 |
TheJulia | otherwise you can end up in some awesome error states | 15:00 |
zorun | thanks for the explanation, but I feel I'm missing a key piece of the puzzle | 15:06 |
zorun | does Ironic manages the network ports of nodes internally, independently from Neutron? | 15:06 |
TheJulia | I'm definitely not an expert in that area, but if you have additional questions, i might be able to point you in the right direction | 15:07 |
TheJulia | so, think of ironic as the holder of the knowledge of the physical machine (i.e., mac address, and where things are plugged in) | 15:07 |
TheJulia | we use neutron largely for dhcp, and the actual sdn/switch integration | 15:07 |
opendevreview | Verification of a change to openstack/ironic master failed: Use tox env variables in coverage tests https://review.opendev.org/c/openstack/ironic/+/885507 | 15:09 |
zorun | so typically, you would create an "ironic port" with the local-link-connection information only once, when you initially enroll the node? | 15:09 |
TheJulia | correct | 15:10 |
zorun | and whenever a user wants a baremetal node, they (or probably) would create a Neutron port, which would then be dynamically "glued" to this ironic port? | 15:10 |
zorun | thanks, it's starting to make sense | 15:10 |
zorun | probably *Nova | 15:10 |
TheJulia | ideally with physnet information if your needing it, since the logical how humans have assembled things are not visible at that layer | 15:10 |
TheJulia | well, that being local link connection information | 15:11 |
TheJulia | it can typically be discovered through introspection, fwiw | 15:11 |
TheJulia | yes, nova or a baremetal user directly creates a neutron port, we call this a vif, and then ironic requests the port to be plugged in neutron. When we do this we ship a bunch of information up to neturon | 15:12 |
TheJulia | neutron either succeeds or fails based upon the data, the physnet, and everything else | 15:12 |
zorun | so you copy the physnet and the local-link-connection data to the VIF port? | 15:13 |
TheJulia | when the bind occurs, I think we send local-link-connection info | 15:13 |
TheJulia | physnet don't remember exactly | 15:13 |
zorun | ok, thanks | 15:14 |
zorun | now I understand better why you possibly wanted Ironic and NGS to communicate directly instead of going through Neutron | 15:14 |
TheJulia | because a lot of folks don't want to run neutron | 15:15 |
zorun | after looking at the code, networking-baremetal and networking-generic-switch are somewhat redundant (at least for the physical device reconfiguration part) | 15:53 |
zorun | they both implement the ML2 mechanism driver API to catch network and port creation/deletion/binding | 15:54 |
zorun | and they both have some idea of internal "drivers" | 15:54 |
zorun | networking-baremetal only has a netconf/openconfig driver | 15:55 |
zorun | networking-generic-switch only has a SSH-based netmiko driver (with many supported vendors and models) | 15:55 |
zorun | but both seem to be easily extendable by adding additional drivers (again, this is completely internal, nothing related to Neutron) | 15:56 |
rpittau | good night! o/ | 15:59 |
JayF | o/ | 16:00 |
zorun | funny, both projects were initially started by the same person?! | 16:00 |
dtantsur | they had initially very different goals | 16:01 |
dtantsur | the netconf/openconfig driver was a much later addition | 16:01 |
TheJulia | Also more forward looking, but networking moves at its own speed | 16:02 |
zorun | yes, it seems the original use-case was to use networking-baremetal for flat networks, and networking-generic-switch for tenant "neutron" networks | 16:03 |
TheJulia | ahh, yeah, on flat ports nothing says done so neutron automatically thinks it failed | 16:08 |
TheJulia | then physnet mapping went in | 16:08 |
TheJulia | and the rest is history | 16:08 |
zorun | from my (limited) perspective, it seems that netconf support would have been a better fit for NGS | 16:14 |
zorun | we have possible plans to extend NGS beyond netmiko devices, by using REST APIs provided by recent Aruba switches | 16:15 |
zorun | that's still a long way off though | 16:15 |
opendevreview | Julia Kreger proposed openstack/ironic master: CI: Fix PXE Ananconda cleanup test https://review.opendev.org/c/openstack/ironic/+/887198 | 16:39 |
TheJulia | zorun: yeah, I think one aspect is comfort of access rights... becuase I think netconf it is all or nothing, but with ssh there is ability to restrict based upon user role | 16:44 |
TheJulia | as indicated by the AAA service | 16:44 |
TheJulia | we see a variety of comfort or discomfort with touching network infrastructure | 16:44 |
TheJulia | it is... really... weird. | 16:44 |
JayF | we broke the wall between dev and ops | 16:44 |
JayF | to create the DEVOP | 16:45 |
JayF | next up: the NETDEVOP | 16:45 |
JayF | the CCNAs hear footsteps and fight the coming future ;) | 16:45 |
TheJulia | Suddenly the door explodes open with the lawyers and accountants demanding separation of duties | 16:45 |
TheJulia | a fight ensues, who will win! News at 11! | 16:45 |
JayF | it's OK, we have someone sitting in a box labelled NOC who has to push the button that says 'needful' four times a minute to actually push the operational changes | 16:46 |
zorun | at some point, eBPF engineers could replace networking engineers, and almost everything will run on Linux :) | 16:46 |
zorun | of course the physical layer is the difficult part here | 16:46 |
TheJulia | JayF: ahh, I met with someone who indicated that if that button has to be pushed, it is a 2 month process | 16:46 |
zorun | well, many switches actually run Linux already | 16:47 |
JayF | TheJulia: you didn't just run past them and dramatically dive for the button? :D | 16:49 |
zorun | TheJulia: networking-baremetal does Netconf over SSH. I guess it *could* be possible to restrict access rights (but I don't know enough about netconf) | 16:49 |
JayF | netconf itself doesn't have the access controls you need, whereas traditional ssh connections do in a lot of switches | 16:49 |
JayF | AIUI from user reports | 16:50 |
zorun | I like the ideas behind Netconf, it's just the vendor implementations that currently suck | 16:50 |
zorun | JayF: oh, I see. I guess in 10-20 years Netconf vendor implementations will reach feature parity | 16:51 |
TheJulia | JayF: I was in a secure floor on a conference room... bypassing process would have been very bad. | 16:51 |
TheJulia | zorun: nor do I, and it could still be vendor specific, like some devices out there have no AAA concepts... really. | 16:52 |
opendevreview | Verification of a change to openstack/ironic master failed: Use jammy for base jobs https://review.opendev.org/c/openstack/ironic/+/869052 | 16:54 |
dtantsur | TheJulia: hi, a quick question. I'm adding missing tests for the new inspection API, copy pasting the RBAC pattern resulted in "Policy baremetal:node:ipa_continue_inspection has not been registered". Do you know why? | 16:55 |
TheJulia | sounds like your test class/file or code tests are not loading the code policies | 16:56 |
TheJulia | from ironic/common/policy.py | 16:56 |
TheJulia | at least, that is my guess | 16:56 |
dtantsur | TheJulia: it's really a copy of the already working TestHeartbeatScopedRBAC.. | 16:57 |
zorun | I'm off for the evening, thanks for the interesting discussion TheJulia | 16:59 |
TheJulia | in the same file then... | 17:00 |
TheJulia | dtantsur: I guess I'd need to look at the the current state | 17:01 |
dtantsur | I can post the WIP patch, maybe you notice something? | 17:02 |
TheJulia | sure | 17:02 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: [WIP] Add missing unit tests for /continue_inspection https://review.opendev.org/c/openstack/ironic/+/887202 | 17:02 |
dtantsur | TheJulia: ^^ | 17:02 |
TheJulia | so you need to check your policy names :( | 17:05 |
opendevreview | Iury Gregory Melo Ferreira proposed openstack/ironic master: Add DB API for Firmware and Object https://review.opendev.org/c/openstack/ironic/+/883062 | 17:05 |
TheJulia | baremetal:driver:ipa_continue_introspection is in one place, and baremetal:node:ipa_continue_introspection is in another | 17:05 |
TheJulia | ... in the code :( | 17:05 |
TheJulia | that first place is the policy definition | 17:05 |
dtantsur | I may be blind, but grepping continue_intro doesn't give anything | 17:06 |
dtantsur | It looks to me that I have the same text.. | 17:06 |
TheJulia | sorry ipa_continue_inspection | 17:06 |
TheJulia | I'm typing it out based on memory since I've got a failed test job log in by paste buffer which may yield an understanding of *where* we're blowing up in test_migraitons | 17:07 |
dtantsur | ahhhh, I have node:continue_inspection in one place and driver:continue_inspection in the other | 17:08 |
dtantsur | damn | 17:08 |
dtantsur | thank you, you led me to the right thought | 17:08 |
opendevreview | Iury Gregory Melo Ferreira proposed openstack/ironic master: Firmware Interface https://review.opendev.org/c/openstack/ironic/+/885276 | 17:09 |
TheJulia | well, we crashed it seems in _pre_upgrade_e918ff30eb42 | 17:13 |
JayF | that looks broken | 17:17 |
JayF | https://github.com/openstack/ironic/blob/master/ironic/tests/unit/db/sqlalchemy/test_migrations.py#L1387 | 17:18 |
JayF | we are expecting a DB error | 17:18 |
JayF | then continue in the *same txn* to insert good data | 17:18 |
JayF | shouldn't the expected fail and the expected success be in separate cxtmgr, and separate txns? | 17:18 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: [WIP] Correct the policy name for /continue_inspection https://review.opendev.org/c/openstack/ironic/+/887202 | 17:18 |
TheJulia | well, you can fail in a transaction and retry | 17:18 |
TheJulia | the transaction still lives unless you have a database fatal sort of thing | 17:19 |
TheJulia | I guess we need some more data on if we fail there, or not | 17:29 |
TheJulia | seems like we have trust issues with the db, dunno | 17:33 |
opendevreview | Julia Kreger proposed openstack/ironic master: CI: Fix PXE Ananconda cleanup test https://review.opendev.org/c/openstack/ironic/+/887198 | 17:57 |
TheJulia | That has passed cleanly twice on my parallel class change, so I suspect it is likely good on master branch as well | 17:58 |
JayF | where is your parallel class change? I can take a look | 18:00 |
JayF | or you just mean locally, not proposed to CI yet? | 18:00 |
TheJulia | oh, it is just a stestr change | 18:00 |
TheJulia | https://review.opendev.org/c/openstack/ironic/+/887168/1 is getting more data from the logs so we can pin down test_migrations stuffs. https://review.opendev.org/c/openstack/ironic/+/887166/1 is the stestr change | 18:01 |
TheJulia | I mentioned it yesterday | 18:01 |
JayF | sorry I've not been well tuned in, looking at other things since yesterday | 18:02 |
TheJulia | no worries | 18:02 |
* JayF still needs to get his CfP in for SeaGL and review john's sharding chain and revive+finish sharding ironicclient patch | 18:02 | |
TheJulia | the anaconda change is a direct result of the parallel change as it happened to happen there, but I've seen it elsewhere | 18:04 |
TheJulia | I'm starting to feel like I'll be able to actually focus on non CI things today | 18:18 |
TheJulia | any thoughts on just merging something like https://review.opendev.org/c/openstack/ironic/+/887168/1/ironic/tests/unit/db/sqlalchemy/test_migrations.py so we can spot in the buffer output roughly where things went side ways in stdout, not relying upon the logging buffer | 18:24 |
JayF | Are you sure if you use logging there it's less effective? | 18:24 |
JayF | or are you assuming? | 18:24 |
TheJulia | assuming that it gets kicked to a buffer, where as we shouldn't be buffering stdout | 18:24 |
JayF | I'm 100% OK with logs like that being in CI runs for test_migrations; I'm a little sketched out by it being print-statement-logging | 18:24 |
TheJulia | based upon run configuration | 18:24 |
JayF | that buffer shouldn't get stuck in an eventlet-patched unit test run, should it? | 18:25 |
TheJulia | we do have some print debug logging in the api for our own troubleshooting sanity | 18:25 |
TheJulia | fwiw | 18:25 |
JayF | well, if the precent is already there | 18:25 |
JayF | yolo | 18:25 |
TheJulia | w/r/t logging, no idea, but I don't htink we're doing anything with eventlet and the stestr run w/r/t logging | 18:26 |
* TheJulia wonders why openstack-tox-py38 gets queued | 18:26 | |
JayF | I asked Clark about that in #openstack-infra the other day | 18:29 |
JayF | tl;dr our infra clouds sometimes just ... don't wanna build, have to get retried, etc | 18:29 |
JayF | so you get weird jobs stuck queued or child jobs starting before parents | 18:29 |
JayF | it'll be eventually consistent | 18:30 |
TheJulia | fun! | 18:30 |
JayF | basically: zuul only guarantees correct ordering of the "build me a node" calls | 18:30 |
JayF | and provides no guarantees jobs will actually /start/ in that order | 18:30 |
TheJulia | iurygregory looks like your latest change got bit by the run the migrations twice issue | 18:30 |
TheJulia | oooh ahh, looks like https://zuul.opendev.org/t/openstack/stream/10ec1f5893d848f4800c107df2f36016?logfile=console.log is frozen | 18:50 |
JayF | this is without the fun output, yeah | 18:51 |
TheJulia | well, *hopefully* we'll have some extra logging | 18:51 |
TheJulia | and hopefully that will give us an idea | 18:51 |
JayF | when it finally goes kaput, you mean? | 18:51 |
TheJulia | yes, if we remember | 18:51 |
JayF | I wish I could forget :| lol | 18:52 |
opendevreview | Merged openstack/ironic master: CI: Fix PXE Ananconda cleanup test https://review.opendev.org/c/openstack/ironic/+/887198 | 18:53 |
TheJulia | /me looks at https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_10e/887168/1/check/openstack-tox-py38/10ec1f5/testr_results.html and cries | 19:13 |
TheJulia | pymysql.err.OperationalError: (3730, "Cannot drop table 'conductors' referenced by a foreign key constraint 'conductor_hardware_interfaces_ibfk_1' on table 'conductor_hardware_interfaces'.") seems like a solid hint | 20:20 |
JayF | TheJulia: the other test was "table conductors already exists" | 20:28 |
JayF | do you think that's b/c the cleanup of the first failed | 20:28 |
JayF | or is it possible that we're getting automatically created tables in the DB, with new schema, causing the migrations to freak | 20:29 |
TheJulia | absolutely | 20:29 |
TheJulia | we're attempting to create the tables, but we're in the same db | 20:29 |
JayF | I'm surprised it failed sanely tbh | 20:29 |
JayF | we must have finally "won" the race to see an error instead of a deadlock | 20:30 |
TheJulia | i suspect so | 20:30 |
TheJulia | actually.... | 20:30 |
TheJulia | :q | 20:33 |
TheJulia | err | 20:33 |
JayF | nothing sus whatsoever in hardware_interface related migrations | 20:34 |
JayF | and tests | 20:34 |
TheJulia | I don't think we're hanging there | 20:35 |
TheJulia | I *think* the issue is that we're failing teardown, and reusing the same database even tough with the OpportunisticDBFixture, we should be creating an entirely new db | 20:35 |
JayF | the problem I fall into with that line of thinking is that there's nothing substantially different about our migration tests than those in a couple other openstack projects I checked | 20:37 |
JayF | so either we have the most complex DB w/full migrations history still checked in, and that causes problems (plausible) or it's not a bug in shared code | 20:37 |
TheJulia | but if sqlalchemy is struggling to sort out how to tear down our objects | 20:37 |
TheJulia | because oslo.db trys to teardown objects as opposed to just drop the entire db... | 20:37 |
JayF | I wonder if there's a belief that metadata.drop_all() *should* work as part of the tests | 20:39 |
JayF | like if it's an explicit decision to do that to test the teardown path | 20:39 |
JayF | but that seems ~useless? | 20:39 |
TheJulia | if you end up dropping things in any order, you have to determine the teardown order properly | 20:39 |
TheJulia | oslo.db does build a list of foreign keys | 20:40 |
JayF | and at the object layer, we enforce that | 20:40 |
TheJulia | lets see what it does with them | 20:40 |
JayF | but this is not using the object layer | 20:40 |
TheJulia | hmmmmm if self.supports_drop_fk: | 20:40 |
JayF | did this change recently? | 20:42 |
TheJulia | no | 20:42 |
JayF | I feel like last time I looked at this code it was calling sqlalchemy metadata.drop_all() | 20:42 |
JayF | that has to be somewhere else | 20:42 |
TheJulia | trying to piece meal through this right now | 20:42 |
JayF | okay I am sus of this now too | 20:43 |
JayF | that I realized where I misread before | 20:43 |
JayF | I know why we're broken | 20:44 |
TheJulia | ???? | 20:44 |
JayF | for the 5th time, maybe :D | 20:44 |
TheJulia | heh | 20:44 |
JayF | we have a UniqueConstraint that's not an fk | 20:44 |
JayF | guess where it is? | 20:44 |
JayF | a UniqueConstraint that *contains* an FK but isn't one | 20:45 |
JayF | https://github.com/openstack/ironic/blob/master/ironic/db/sqlalchemy/alembic/versions/2353895ecfae_add_conductor_hardware_interfaces_table.py#L44 | 20:45 |
JayF | so you can't drop the FK because of the constraint | 20:45 |
JayF | right? | 20:45 |
JayF | if conductor.id FK to conductor_id and conductor_id is in a uniqueconstraint | 20:46 |
JayF | any test with conductor_hardware_interfaces populated is going to not tear down properly depending on how it's ordered | 20:46 |
TheJulia | uhhhh hmmmm | 20:46 |
TheJulia | the fk and the unique constraint should apply independently | 20:46 |
TheJulia | *should* | 20:46 |
JayF | because removing the FK won't blank conductor_id | 20:48 |
JayF | and removing the constraint won't touch the data | 20:48 |
JayF | so the data remains sane and the FK+constraint should be able to be removed in any order | 20:48 |
TheJulia | the table can be dropped with a constraint, a fk cannot be, and fk is independent | 20:49 |
TheJulia | aiui | 20:49 |
JayF | yeah, I think your reading of it is right | 20:50 |
JayF | gotta look for explanation #6 ;) | 20:50 |
TheJulia | heh | 20:50 |
JayF | TheJulia: so I wonder if https://github.com/openstack/ironic/commit/76a920aed2b103d72552c38b579769dced8d2e22 is breaking drop_all_objects | 20:56 |
JayF | TheJulia: the state that is repairing, not your fix | 20:56 |
JayF | TheJulia: and we traverse that state when running migration tests, yeah? even still? | 20:57 |
JayF | eh, I guess not | 20:57 |
JayF | b/c you commented it out in old migrations | 20:57 |
TheJulia | heh | 20:57 |
JayF | but I could see a "loop" like this screwing up the FK things | 20:57 |
TheJulia | yeah | 20:57 |
* JayF is rubber ducking in IRC and needs to ping you less while doing so | 20:57 | |
TheJulia | heh | 20:57 |
JayF | talking to a person is better than screaming into the board lol | 20:58 |
JayF | **void | 20:58 |
TheJulia | so interestingly enough, we have a different db name than what I expect based upon the oslodb test fixture code | 20:58 |
JayF | are you reading current version or version that was running in that test? | 20:59 |
TheJulia | current, but the code hasn't changed in lke 6-9 years | 20:59 |
JayF | aight | 20:59 |
JayF | I almost wanna hook up migrations tests in a debugger just to see the code flow, but I suspect the effort:value ratio is wildly off on that | 21:01 |
TheJulia | oh, the non-fixture code did change 2 months ago | 21:03 |
* iurygregory is back | 21:03 | |
iurygregory | TheJulia, yup just saw the failure on zuul, but the output of the failed log is weird | 21:03 |
TheJulia | eh, looks unrelated | 21:04 |
iurygregory | yeah, funny that is was only on the cover job | 21:05 |
TheJulia | it has to do a full run to analyize the code coverage | 21:05 |
iurygregory | (ง •̀_•́)ง StringException | 21:09 |
iurygregory | the only thing I wish today is my browser opening https://review.opendev.org without hitting time out '-' | 21:15 |
JayF | Some kind of internet routing issue over there? | 21:16 |
JayF | it's worked for me fine all day | 21:16 |
iurygregory | yeah =( | 21:16 |
iurygregory | still trying to figure out what is going on here | 21:17 |
TheJulia | sounds like your isp has a caching proxy of evilness | 21:17 |
JayF | flip off v6? | 21:18 |
JayF | I like IPv6 but turning it off sometimes fixes ISP flakiness | 21:18 |
iurygregory | TheJulia, yeah | 21:18 |
iurygregory | JayF, my ISP has V6 off | 21:18 |
JayF | d'oh | 21:18 |
TheJulia | https://meet.google.com/nqs-tiwa-hto?authuser=0 <-- call for spitballing/identifying the db issues | 21:18 |
JayF | give me a sec to grab headphones and I'll be there | 21:19 |
iurygregory | me too ^ | 21:19 |
TheJulia | k | 21:20 |
TheJulia | https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/test_fixtures.py | 21:30 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Drop databases in unit tests instead of contents https://review.opendev.org/c/openstack/ironic/+/887234 | 22:16 |
JayF | I bet part of the reason ^ may not be in oslo.db already is b/c it requires the CI testing user to have create/drop privs on the DB | 22:21 |
JayF | (our test-setup does, fwiw) | 22:22 |
JayF | interesting | 22:28 |
JayF | https://zuul.opendev.org/t/openstack/stream/7833fbfcd9f94fcb96a92b5210545b14?logfile=console.log looks to be frozen up | 22:28 |
JayF | I wonder if drop blocks on open connections, too? | 22:28 |
TheJulia | okay, I'm back | 22:29 |
JayF | I think it improved the failure case, at least | 22:29 |
TheJulia | uhhhhhhhh | 22:29 |
JayF | the fixture timeout caught the freeze and continued | 22:29 |
TheJulia | yes, it *should* actually | 22:29 |
JayF | 2 failures, both fixture timeouts with no useful data | 22:30 |
JayF | other than we know the two it failed on | 22:30 |
JayF | TheJulia: also is that patch what you expected? | 22:30 |
JayF | I somehow think you had something more elegant in mind LOL | 22:30 |
TheJulia | where your change froze? | 22:30 |
JayF | no, I mean https://review.opendev.org/c/openstack/ironic/+/887234/1/ironic/db/sqlalchemy/__init__.py | 22:30 |
TheJulia | looking | 22:30 |
JayF | literally monkeypatching it in | 22:30 |
JayF | ironic.tests.unit.db.sqlalchemy.test_migrations.ModelsMigrationsSyncMysql.test_models_sync | 22:31 |
TheJulia | Yeah, I guess that works :) | 22:31 |
JayF | ironic.tests.unit.db.sqlalchemy.test_migrations.TestMigrationsMySQL.test_upgrade_and_version | 22:31 |
JayF | are the two that failed with timeouts | 22:31 |
TheJulia | I was thinking from the class inside test_migrations | 22:31 |
TheJulia | ... did the timeout change merge yet | 22:31 |
JayF | https://review.opendev.org/c/openstack/ironic/+/885507 no, v-2 | 22:32 |
JayF | should I stack on it? | 22:32 |
JayF | I mean, it V-2 with a TIMED_OUT, two of them | 22:32 |
TheJulia | might be worth trying to stack on top of it | 22:33 |
JayF | I'm really, really suspicous of the added timeouts causing some kind of extra problem | 22:33 |
TheJulia | oh, it absolutely is | 22:33 |
TheJulia | the value is so low we blow up on okay runs | 22:33 |
JayF | I kinda wanna stack on a change to *remove all timeout fixtures* | 22:34 |
JayF | ah okay | 22:34 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Drop databases in unit tests instead of contents https://review.opendev.org/c/openstack/ironic/+/887234 | 22:34 |
JayF | honestly the only reason I'm sus is just because it puts eventlet in my traceback | 22:34 |
JayF | and when eventlet is in my traceback I'm always sus | 22:34 |
JayF | which isn't really fair or technical of me | 22:34 |
TheJulia | I... feel like we should go ahead on the path of the class parallel execution, so we can at least isolate blast damage down to specific class behaviors too | 22:40 |
JayF | I want to see how my change, with the timeouts fix, works | 22:41 |
JayF | if it fixes it, then I think the path is to add a unit test that sets up our DB on mysql then calls self.meta.drop_all() or that drop_all_objects() method | 22:42 |
JayF | (the /original/ drop_all_objects method, to be clear) | 22:42 |
JayF | or really, maybe we merge one that looks more like your idea so ^^^ is a saner next step | 22:42 |
JayF | metadata.drop_all I mean, like from sqla | 22:43 |
JayF | so we can validate, in it's own freakin' test, that we can tear down the DB from the HEAD alembic state of it | 22:43 |
TheJulia | stacked on top, so we get 2x the test runs too | 22:46 |
opendevreview | Julia Kreger proposed openstack/ironic master: Execute tests by class, not randomly https://review.opendev.org/c/openstack/ironic/+/887166 | 22:46 |
TheJulia | there is the irc message | 22:46 |
JayF | on my change, https://zuul.opendev.org/t/openstack/stream/1c41663bc5e14deea4083562069fb4b2?logfile=console.log has been hanging out for 8 minutes now :( | 22:52 |
TheJulia | le sigh | 22:55 |
JayF | I'll note | 22:55 |
JayF | my patch is a place | 22:55 |
JayF | where we can put ALL KINDS OF DEBUGGING STUFF IN | 22:55 |
JayF | right before we drop the things | 22:55 |
JayF | and we have an engine | 22:55 |
JayF | so perhaps we can flip that approach to instead dump some useful metrics before dropping DB? | 22:55 |
JayF | like, perhaps, a show full processlist;? | 22:56 |
JayF | I'll also note that my change was mysql only; we haven't seen postgres freeze that we know of but it's possible | 22:56 |
JayF | TheJulia: ... https://zuul.opendev.org/t/openstack/build/ff5c2e1dd077459e91ef34eadf6d2806/log/job-output.txt did arm64 tests always run mysql? | 22:58 |
JayF | ironic.tests.unit.db.sqlalchemy.test_migrations.TestMigrationsMySQL.test_walk_versions failure on arm64 | 22:59 |
JayF | from https://review.opendev.org/c/openstack/ironic/+/887234 fwiw, last reported openstack-tox-py310-arm64 | 22:59 |
* JayF is about to turn into a pumpkin | 22:59 | |
TheJulia | ADD ALL THE DEBUG! | 22:59 |
* TheJulia is almost pumpkin already | 22:59 | |
JayF | my eod is usually at 4pm | 22:59 |
* TheJulia has pumpkin bread, and needs to start the sauce for dinner soon() | 22:59 | |
JayF | my brain hit the door about 15 minutes ago lol | 22:59 |
TheJulia | lol | 23:00 |
JayF | tbh after this we should consider stepping back | 23:00 |
JayF | documenting what we've tried/merged/determined was bad | 23:00 |
JayF | and hit the mailing list | 23:00 |
JayF | maybe engage fresh brains with less ironic brain | 23:00 |
JayF | I don't wanna do that, it won't be as satisfying as just finding the fix, but at some point we have to call for help | 23:01 |
TheJulia | we're super close, I think | 23:01 |
JayF | your optimism is slightly bothersome ;) | 23:01 |
TheJulia | at least, the path I've been on has been fairly well behaved, but looking at other jobs the big difference is the parallel class change | 23:02 |
TheJulia | well, I have a few meetings tomorrow so I expect I'll be busy | 23:02 |
JayF | I'll look at all the results | 23:02 |
JayF | and then pick a path, any path | 23:02 |
JayF | and keep going | 23:02 |
JayF | what else is there to do | 23:02 |
TheJulia | Yup :( | 23:06 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!