opendevreview | Pranali Deore proposed openstack/glance_store master: DNM: Test whether few failing jobs passes or not https://review.opendev.org/c/openstack/glance_store/+/879940 | 07:42 |
---|---|---|
opendevreview | Pranali Deore proposed openstack/glance master: Change DB migration constant to 2023_2 https://review.opendev.org/c/openstack/glance/+/879947 | 09:58 |
opendevreview | Merged openstack/glance-specs master: Add a script to prepare the next cycle https://review.opendev.org/c/openstack/glance-specs/+/878121 | 10:35 |
dansmith | I get all the "ResourceWarning" messages when I run functional locally | 13:36 |
dansmith | and I do see some leaked processes, but they fluctuate and apparently eventually go away | 13:39 |
dansmith | and lots of failed process launch status, "no such process" etc | 13:40 |
dansmith | abhishekk: something clearly happened between 3/14 and 3/27: https://zuul.opendev.org/t/openstack/builds?job_name=glance-tox-functional-py39-rbac-defaults | 14:02 |
dansmith | almost nothing other than merging that startup directory check, but that passed the same tests, and reverting it locally doesn't fix anything for me | 14:02 |
abhishekk | dansmith, ack, don't think that startup directory check has anything to do with timeouts | 14:03 |
dansmith | well, I thought maybe it was preventing the functional api workers from starting, because that seems to be the failure (api servers aren't running) | 14:04 |
dansmith | and because the time lined up, but yeah, seems unrelated | 14:04 |
abhishekk | may be need to use skip to isolate/find out the failing test | 14:05 |
dansmith | it's a ton of them, and maybe all of them? | 14:06 |
dansmith | I assume you see this in the logs: AssertionError: False is not true : Unexpected server launch status for: api, | 14:06 |
abhishekk | is there any requirement side change in between related to eventlet or something ? | 14:06 |
dansmith | I've been looking but I don't think so, unless there's something unconstrained | 14:07 |
abhishekk | So as per your comment those are failing locally as well, right? | 14:07 |
dansmith | yeah, are they not for you? | 14:08 |
abhishekk | I haven't run locally anything recently, just doing it now | 14:08 |
abhishekk | yep it hanged locally for me as well | 14:15 |
abhishekk | So I think all tests which are failing are using legacy api server from tests (not the new one which you have written to add new tests for import/quota/policy changes) | 14:17 |
dansmith | yeah, where it starts a complete api worker process and talks to it over http | 14:17 |
abhishekk | right | 14:18 |
dansmith | and it seems that thing is crashing or never coming up or something, which is has always been super hard to debug | 14:18 |
dansmith | I see lots of zombie processes under the test workers, which are those api processes AFAICT | 14:18 |
abhishekk | +1 | 14:18 |
abhishekk | Can we refactor existing api class similar to recent one? | 14:20 |
dansmith | tests you mean? | 14:21 |
abhishekk | yeah | 14:22 |
dansmith | it would be a massive effort | 14:23 |
dansmith | also, | 14:23 |
dansmith | the thing you're talking about is synchronous, so the nature of the tests which expect async behaviors would need to change | 14:23 |
abhishekk | right, I think shortest way is to debug a single test to find out what is going wrong | 14:24 |
dansmith | yeah | 14:25 |
abhishekk | pdeore, you around? | 14:25 |
dansmith | it's also weird that it's working in the devstack jobs, which I think means whatever is broken is something only related to this weird functional worker thing | 14:25 |
abhishekk | this made me more nervous :D | 14:26 |
abhishekk | is there shortctut command to kill zombie processes? | 14:28 |
dansmith | they must be waited on | 14:30 |
abhishekk | yeah | 14:31 |
abhishekk | _warn("subprocess %s is still running" % self.pid, | 14:32 |
abhishekk | might be related, lots of occurrences in logs | 14:34 |
dansmith | I think that's a symptom | 14:34 |
abhishekk | likely | 14:35 |
dansmith | so I think this is what's crashing it: ERROR: 'NoneType' object has no attribute 'group' | 14:36 |
dansmith | but there's no trace so I have no idea where that is coming from | 14:36 |
abhishekk | I think we should try skipping test_reload once? | 14:36 |
dansmith | why? I can run any of the tests in isolation and they fail | 14:36 |
dansmith | I've been randomly using this one: test_invalid_cors_get_request | 14:37 |
abhishekk | ohh, I thought that is the one which is reloading configs | 14:37 |
abhishekk | ack | 14:37 |
abhishekk | greenlet-1.1.3 is what installed on passing job whereas now it is greenlet==2.0.2 | 14:49 |
abhishekk | https://d2e021dde0f27c24b843-9e47a969cbb910cc10dbd93fca848265.ssl.cf5.rackcdn.com/850417/9/check/cross-glance-tox-functional/84bf072/job-output.txt | 14:49 |
abhishekk | this is the last passing cross-glance-tox-functional | 14:50 |
abhishekk | https://review.opendev.org/c/openstack/requirements/+/872065 | 14:53 |
abhishekk | this patch is submitted on 26/03 | 14:53 |
dansmith | ah, I checked greenlet, but it was released in january, so I figured unlreated | 14:55 |
dansmith | however, | 14:55 |
dansmith | are we getting that in local runs? | 14:55 |
dansmith | ah, u-c changed | 14:55 |
dansmith | and we install that from master when we run tox, regardless | 14:55 |
dansmith | that's unfortunate | 14:55 |
abhishekk | yeah | 14:56 |
dansmith | that's why walking back in the git history doesn't change it I guess | 14:56 |
dansmith | does that fix it for you? I'm trying | 14:56 |
abhishekk | nah, I just figured out the patch | 14:56 |
abhishekk | existing tests are still running for me :D | 14:57 |
dansmith | I'm not actually sure if eventlet uses greenlet | 14:57 |
dansmith | yeah, same behavior with 1.1.3 for me | 14:58 |
abhishekk | :/ | 14:58 |
abhishekk | how you overridden it in local run? | 15:00 |
abhishekk | changed uc inside .tox ? | 15:00 |
dansmith | $ .tox/functional/bin/pip install -U greenlet==1.1.3 | 15:00 |
dansmith | I'm rebuilding my tox env with u-c from march 14th | 15:01 |
dansmith | so that should get any others to see if that's related | 15:01 |
abhishekk | ack | 15:01 |
dansmith | that installed the older constraints, but didn't fix the problem | 15:04 |
abhishekk | eventlet 0.33.3 vs 0.33.1 ? | 15:05 |
dansmith | oh hang on, | 15:05 |
dansmith | I might have broken soemthing else in my testing, just a sec | 15:05 |
abhishekk | ack | 15:05 |
dansmith | oh snap | 15:06 |
dansmith | - Passed: 1 | 15:06 |
abhishekk | greenlet or eventlet? | 15:06 |
dansmith | u-c from march 14 | 15:07 |
dansmith | but this same issue might have confused my just-greenlet testing earlier | 15:07 |
abhishekk | ack | 15:08 |
dansmith | I was halfway through trying to print out something on startup and got distracted with the greenlet thing, but had left a typo | 15:08 |
dansmith | manifests the same.. a typo preventing the service from starting :) | 15:08 |
abhishekk | :D | 15:09 |
abhishekk | what changed in u-c 14th and now? | 15:12 |
dansmith | okay greenlet alone does not fix it | 15:12 |
dansmith | I'm looking | 15:12 |
abhishekk | i found greenlet and eventlet with diff versions | 15:13 |
dansmith | https://termbin.com/cixg2 | 15:13 |
abhishekk | ack | 15:14 |
dansmith | rolling back eventlet failed with dns api error, rolling back dnspython too | 15:15 |
dansmith | that diff is 65245016de7cf2d1e585eeb1378aac6aa6d75de0..master in requirements, btw | 15:15 |
dansmith | nope | 15:16 |
dansmith | mmm, paste | 15:17 |
abhishekk | pastedeploy? | 15:17 |
dansmith | yeah, that's the one :) | 15:17 |
dansmith | works with PasteDeploy===2.1.1 | 15:18 |
abhishekk | bummer | 15:18 |
abhishekk | so we need to blacklist 3.0.1 for glance ? | 15:19 |
dansmith | I dunnno why it's not failing in devstack though | 15:19 |
dansmith | but no, I think you need to fix the problem .. can't stay on 2.x forever right? | 15:19 |
abhishekk | yeah | 15:20 |
abhishekk | till we fix (which is going to take long) shouldn't we rollback to 2.1.1? | 15:20 |
dansmith | I think u-c is supposed to be across all the projects, right? | 15:21 |
dansmith | not sure it's an option to block it just for glance and rolling it back in u-c is problematic I think assuming some other project wanted it bumped | 15:21 |
abhishekk | yeah, but there should/might be a way to override it? | 15:21 |
dansmith | I dunno what the rules are here | 15:21 |
abhishekk | me too | 15:22 |
dansmith | overriding it just means that glance can't be installed alongside nova, for example | 15:22 |
dansmith | gmann: ^ | 15:22 |
dansmith | I think gmann has been getting in late recently, so might be a while before he's around | 15:22 |
abhishekk | ack | 15:22 |
dansmith | probably should quickly work to determine what the actual problem is though.. might be something simple | 15:22 |
abhishekk | also we can't skip 106 tests as well :D | 15:22 |
abhishekk | need to go through reno of PasteDeploy | 15:23 |
dansmith | I really need to get back to what I was supposed to be doing this morning, but I assume you can take it from here? or maybe pdeore can try to suss out the change? | 15:23 |
abhishekk | I think pdeore can take it from here | 15:23 |
dansmith | knowing what the problem is should be like 90% of the work I bet | 15:23 |
abhishekk | ++ | 15:24 |
dansmith | it's probably something simple like a missing or now-required arg or something | 15:24 |
abhishekk | likely | 15:24 |
abhishekk | thanks for spending time on it | 15:25 |
* dansmith nods | 15:25 | |
dansmith | also, maybe some unit tests for the deploy stuff will make it easier to debug what is going on | 15:26 |
dansmith | and also maybe let's not add any more functional tests based on these api workers :D | 15:26 |
abhishekk | ++ | 15:27 |
abhishekk | there are two releases 3.0 and 3.0.1 2022-10-16 and 2022-10-17 | 15:28 |
abhishekk | for pastedeploy ^^ | 15:28 |
abhishekk | https://docs.pylonsproject.org/projects/pastedeploy/en/latest/news.html | 15:28 |
dansmith | yeah, not much in the way of news for those | 15:28 |
dansmith | seems like the major version bump is just because of dropping py2 support | 15:29 |
abhishekk | likely | 15:29 |
abhishekk | app = deploy.loadapp("config:%s" % conf_file, name=app_name) | 15:35 |
abhishekk | this is the only function i think we are calling | 15:36 |
dansmith | yup | 15:39 |
dansmith | and it looks the same in nova | 15:39 |
dansmith | but of course, it's loading modules provided by glance (and nova) and might be choking on those | 15:39 |
dansmith | I don't know how it goes from the paste config to the python objects.. so someone probably needs to figure that out ... | 15:41 |
abhishekk | ack | 15:43 |
dansmith | oh yeah and those workers run with generated paste configs | 15:49 |
dansmith | different from the one in etc/ | 15:49 |
dansmith | so could be something there too I guess | 15:49 |
dansmith | this is what it's loading via paste I think: glance.api:root_app_factory | 15:49 |
dansmith | which of course is different in the generated paste config for those workers | 15:51 |
dansmith | er, well, maybe the same, but in a different stack | 15:51 |
abhishekk | right | 15:53 |
gmann | dansmith: abhishekk_ hi | 17:36 |
gmann | I do not think we can/should have different u-c (blacklist specific vesion) for glance only | 17:36 |
abhishekk_ | gmann, hi | 17:36 |
abhishekk_ | ack | 17:37 |
dansmith | yeah, agree | 17:37 |
dansmith | it's unfortunate that it got bumped without being realized, but alas, here we are | 17:37 |
abhishekk_ | dansmith, you have latest devstack deployed? | 17:38 |
gmann | does not requirement has glance functional test job ? | 17:38 |
dansmith | abhishekk_: no | 17:38 |
abhishekk_ | dansmith, ack | 17:39 |
dansmith | gmann: no | 17:39 |
dansmith | it's a bummer to have to run all the projects' functionals really, especially lately with things breaking so much | 17:39 |
dansmith | also, gmann, glance's functionals (these at least) are a nightmare to debug, so asking non-glance people to investigate failures is also unfortunate | 17:40 |
gmann | ohk | 17:40 |
abhishekk_ | somehow I came to conclusion that VersionNegotiationFilter is causing trouble but no further luck since last couple of hours | 17:40 |
dansmith | abhishekk_: so you think it's not crashing on start, but refusing to do anything useful? | 17:41 |
dansmith | because the test waits for the timeout for the "ping" which should fail immediately if it's up but just not working | 17:41 |
abhishekk_ | yeah | 17:41 |
abhishekk_ | if I rmove that filter from here | 17:42 |
abhishekk_ | https://github.com/openstack/glance/blob/master/glance/tests/functional/__init__.py#L499 | 17:42 |
abhishekk_ | test passes with latest paste deploy | 17:42 |
dansmith | huh, maybe just loading that filter fails? | 17:43 |
abhishekk_ | likely, tried putting logs there or in wsgi.Middleware but nothing actually logs | 17:44 |
dansmith | right | 17:44 |
dansmith | this got me some logs: | 17:44 |
dansmith | https://termbin.com/ycd6 | 17:44 |
abhishekk_ | may be next step is to deploy latest devstack and execute some version api calls | 17:44 |
abhishekk_ | looking | 17:45 |
dansmith | or make sure it configures it the same way | 17:45 |
dansmith | maybe devstack doesn't have that filter? or it's not in that order? | 17:45 |
abhishekk_ | i think it has | 17:46 |
dansmith | well, something has to be different :) | 17:47 |
abhishekk_ | agree | 17:47 |
dansmith | definitely a different order | 17:47 |
abhishekk_ | I think now I will hand it over to pdeore | 17:47 |
* dansmith nods | 17:48 | |
abhishekk_ | curiosity is what changed in deploy 3.0 that causes this middleware to fail :D | 17:48 |
dansmith | yeah | 17:49 |
abhishekk_ | tempest is also not failing means not much to worry? | 17:56 |
dansmith | well, that's what I'm saying.. it must be something specific to the config in the functional workers, since the devstack jobs are fine | 17:56 |
abhishekk_ | yep | 17:57 |
* abhishekk_ signing out now, have a good day | 17:57 | |
dansmith | o/ | 18:01 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!