frickler | dansmith: gmann: kopecmartin: seems 886795 causes some regression, see e.g. https://review.opendev.org/c/openstack/neutron-dynamic-routing/+/888787 | 06:00 |
---|---|---|
opendevreview | Maxim Sava proposed openstack/tempest master: Add image task client and image tests task APIs. https://review.opendev.org/c/openstack/tempest/+/888755 | 06:09 |
opendevreview | Maxim Sava proposed openstack/tempest master: Add image task client and image tests task APIs. https://review.opendev.org/c/openstack/tempest/+/888755 | 06:10 |
opendevreview | Dr. Jens Harbott proposed openstack/devstack master: Revert "Set two different image in tempest irespective of DEFAULT_IMAGE_NAME" https://review.opendev.org/c/openstack/devstack/+/888650 | 06:26 |
opendevreview | yatin proposed openstack/devstack master: Handle more than 1 image while configuring tempest https://review.opendev.org/c/openstack/devstack/+/888906 | 06:47 |
ykarel | frickler, ^ | 06:47 |
frickler | ykarel: doh, lgtm, I always struggle with these advanced bash features, too. do you want to verify with a depends-on? else I'll just approve at once | 06:55 |
ykarel | frickler, sure can send a test patch, but let me first update patch to also add a break in for loop | 06:57 |
opendevreview | yatin proposed openstack/devstack master: Handle more than 1 image while configuring tempest https://review.opendev.org/c/openstack/devstack/+/888906 | 07:00 |
ykarel | testing in https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/888907 | 07:02 |
frickler | meh, next CI failure. /me fetches the force-merge hammer | 11:34 |
opendevreview | Merged openstack/devstack master: Handle more than 1 image while configuring tempest https://review.opendev.org/c/openstack/devstack/+/888906 | 11:41 |
sean-k-mooney | i just found something fun | 13:15 |
sean-k-mooney | https://zuul.opendev.org/t/openstack/build/8525a961871c4602bedf3df605f72791/log/controller/logs/screen-n-cpu.txt#60338 | 13:15 |
sean-k-mooney | som eof our job logs are large enouch that they will crash your browser tab | 13:15 |
sean-k-mooney | i guess thats what happens if you try to mark up a 2.4MB text file in a browser tab | 13:22 |
dansmith | gmann: I'm seeing a test fail with an unexpected host key, but it's after a rebuild | 14:46 |
dansmith | AFAIK, post-rebuild the host key *should* be different, so I'm not sure how the ssh client in tempest would know what it *should* be until it has ssh'd in once | 14:46 |
dansmith | is that not correct? | 14:46 |
dansmith | hmm: ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) | 14:49 |
dansmith | perhaps we don't clear the ssh key after the rebuild, or race to do that? | 14:50 |
dansmith | hmm, I wonder if we start ssh polling before the rebuild is finished, connect once but don't fully execute the command, but have grabbed the host key, then when the rebuild finishes, the polling loop continues and finds the wrong host key | 14:52 |
dansmith | kopecmartin: does that make sense? | 14:53 |
kopecmartin | sean-k-mooney: my browser can open that fine, it's slow, but it works .. that's because of the UI i think, if you open the raw file, it works well https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_852/887255/5/check/tempest-integrated-compute/8525a96/job-output.txt .. although you can't send a link to a specific line , hmm, each has its pros and cons | 15:18 |
kopecmartin | dansmith: no idea, what test are we talking about? | 15:20 |
dansmith | kopecmartin: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_84d/879499/10/check/nova-multi-cell/84d38ab/testr_results.html | 15:24 |
dansmith | kopecmartin: if you notice there, it goes to post-rebuild ssh to the server, a couple failed attempts that take a while, and then starts complaining about the host key being wrong | 15:26 |
dansmith | looking at lib/ssh I think that likely means we've successfully gotten the host key in one iteration of the poll loop and then subsequent ones fail after the rebuild has completed and the host key is (obviously) different | 15:27 |
dansmith | I've seen two of these failures this morning | 15:27 |
dansmith | so I'm thinking we need to wait for REBUILD state before we wait for active to make sure we don't race through and start polling before the rebuild starts, maybe | 15:28 |
dansmith | although I dunno, seems like the transition to REBUILD is synchronous with the request call so we should stall on the wait for active | 15:29 |
kopecmartin | hm, we're rebuilding to an alt_image .. it used be the same, but we changed that and now the alt_image is different than image - partially thanks to this? https://review.opendev.org/c/openstack/devstack/+/886795 .. so maybe that's the cause? question would be what we wanna do about it | 15:36 |
kopecmartin | i'm sorry if i'm stating obvious, had already a couple of drinks :D .. celebrating today | 15:37 |
dansmith | I don't think so .. the image shouldn't have a host key in it, that should be generated on first boot | 15:38 |
dansmith | however, after rebuild the host key should be regenerated of course | 15:39 |
dansmith | but yeah, the correlation with the image change in devstack does seem ... hard to ignore | 15:39 |
dansmith | the ssh client is complaining about the host key being different than what it expects, but we don't pass it into the client, so I think the only thing that it could be picking up on is the ssh key being different from one attempt to the other in a polling loop | 15:40 |
kopecmartin | i'm comparing the the server's metadata before and after the rebuild, the name is different , the image is different, and all that is consistent with the code | 15:47 |
kopecmartin | i can try to run it and put a few breakpoints there, but only tomorrow | 15:47 |
dansmith | okay I think we're not on the same page here :) | 15:51 |
dansmith | I think gmann likely has more context here.. we'll see when he pops up ;) | 15:52 |
*** gthiemon1e is now known as gthiemonge | 15:59 | |
gmann | dansmith: checking | 17:17 |
gmann | dansmith: there might be chances that rebuild is not finish before ssh, let me check test code | 17:20 |
gmann | dansmith: no, we wait for rebuild to finish and server to be active before trying ssh https://github.com/openstack/tempest/blob/180717d3833b6e0f89c3aa8b34b369f4cccf69fd/tempest/api/compute/servers/test_server_actions.py#L229 | 17:21 |
gmann | dansmith: may be this is issue? we pass the 'server' response in this ssh which we get before rebuild so that is obsolete 'server' response https://github.com/openstack/tempest/blob/180717d3833b6e0f89c3aa8b34b369f4cccf69fd/tempest/api/compute/servers/test_server_actions.py#L325 | 17:22 |
gmann | which is not just to get ip but here too https://github.com/openstack/tempest/blob/180717d3833b6e0f89c3aa8b34b369f4cccf69fd/tempest/api/compute/servers/test_server_actions.py#L329 | 17:23 |
dansmith | right, but.. I'm not sure how that would cause this | 17:23 |
dansmith | agree that we're waiting for active, which should only happen after the rebuild is finished since the rebuild sets the task state | 17:23 |
dansmith | the ip should not change across a rebuild and there's an assert in the inner rebuild test helper that verifies that | 17:25 |
gmann | yeah, and in other place it is used to log console output only so yes this should not be issue | 17:25 |
dansmith | so I just can't really see how we got here | 17:25 |
dansmith | we're also hitting this quite a bit lately, and I see it almost 100% locally: | 17:28 |
dansmith | https://zuul.opendev.org/t/openstack/build/94b65655fe6643aa9ec61ce61a5d3c75 | 17:28 |
dansmith | gouthamr: ^ | 17:28 |
gouthamr | o/ late response; weird | 19:18 |
gouthamr | seeing it consistently now in the past few builds: https://zuul.opendev.org/t/openstack/builds?job_name=nova-ceph-multistore&project=openstack/nova | 19:19 |
dansmith | it doesn't fail for me on fedora 37 nor macos, but does on jammy, all coming from the same local network | 19:22 |
dansmith | and I've tried updating my ca-certificates package in case it's relevant | 19:23 |
dansmith | curl -k works, as expected | 19:24 |
gouthamr | ^ same; no wonder.. i was trying to find evidence in a different job, but that runs on centos-9-stream | 19:25 |
dansmith | it's not failing all the time.. just this morning after a failure, I rechecked and it passed check then failed again in gate | 19:28 |
dansmith | so it doesn't seem to be 100% reproducible | 19:28 |
dansmith | it does, however, seem like maybe we should be doing something better than just curl'ing a raw file out of a repo. Like maybe we should just clone the ceph repo and run it out of there? | 19:29 |
gouthamr | https://bugs.launchpad.net/ubuntu/+source/curl/+bug/2028170 | 19:30 |
dansmith | I know that'll be a lot bigger, but... | 19:30 |
dansmith | hah | 19:30 |
dansmith | I think that bug is wrong (and is marked invalid now) since I was using curl from months ago this morning when testing | 19:31 |
gouthamr | marked invalid for everything but jammy? | 19:32 |
dansmith | oh? | 19:32 |
dansmith | okay I see | 19:32 |
gouthamr | yeah, they released a fixed package a few hours ago from what i can tell | 19:32 |
dansmith | is curl'ing the cephadm thing direct from the ceph tree really their recommendation? I dunno what the policy is for landing stuff there, but seems like it could be an easy thing for us to trip on if there's a bad commit there | 19:33 |
gouthamr | yes, because ceph is a really huge repo | 19:34 |
gouthamr | https://docs.ceph.com/en/latest/cephadm/install/#curl-based-installation | 19:35 |
gouthamr | our aversion to use a distro package lest we miss some bugfix :P | 19:35 |
gouthamr | but, the location's updated in those docs ^ | 19:35 |
dansmith | yeah, I just wish there was a github release we could fetch or something | 19:36 |
dansmith | ack | 19:36 |
* gouthamr will push up a fix | 19:36 | |
dansmith | that looks more cultivated than just-what-is-in-the-tree | 19:36 |
dansmith | cool | 19:36 |
gouthamr | weird to put this under "rpm-${CEPH_RELEASE}/el9/noarch/" cephadm is a python file, and the distro here wouldn't matter | 19:40 |
opendevreview | Goutham Pacha Ravi proposed openstack/devstack-plugin-ceph master: Update location of cephadm script https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/888952 | 19:47 |
opendevreview | Goutham Pacha Ravi proposed openstack/devstack-plugin-ceph master: Update location of cephadm script https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/888952 | 19:48 |
opendevreview | Brian Haley proposed openstack/devstack master: Always set image_uuid_alt in configure_tempest() https://review.opendev.org/c/openstack/devstack/+/888953 | 20:08 |
haleyb | frickler: ^^ that change is needed as well for tempest image change, just verified it fixed my stack.sh issue locally | 20:09 |
gmann | haleyb: thanks. i thought image_uuid is not set in case of single image but your fix lgtm, 1 comment in the gerrit | 20:21 |
haleyb | gmann: only image_uuid is ever set anymore, https://review.opendev.org/c/openstack/devstack/+/886795/4/lib/tempest remove setting of image_uuid_alt. Running stack.sh without this fails locally, gates must be doing something special | 20:23 |
haleyb | i added similar debugging locally and it was always "" | 20:24 |
gmann | haleyb: ok. I agree with setting the alt uuid also always. one comment on your change to check if that exist or not otherwise lgtm | 20:26 |
haleyb | i'm adding a response | 20:26 |
haleyb | gmann: see my comments, but adding a check doesn't get much since we know it's not set as that code was removed :( | 20:32 |
gmann | haleyb: but how it is set as "" and if condition is not letting to set it to proper uuid in you local env | 20:32 |
haleyb | gmann: it can only be "" if image_uuid is also "", which can't happen in the single image case. sorry i don't understand. it took me an hour of debugging to write that one line of code :) | 20:35 |
gmann | haleyb: ok but as we are removing setting of alt based on image_uuid let's set it separately and under if condition. if anyone setting it to "" that we need to fix because there might be valid case of setting it to "" somewhere | 20:36 |
gmann | so set it when it is not set at all. if any scenario does not want to set it to "" and it is set somewhere we can fix that | 20:37 |
haleyb | it will blow up on L302 call if "" in image_size_in_gib | 20:39 |
opendevreview | Jakub Skunda proposed openstack/devstack stable/yoga: git: git checkout for a commit hash combinated with depth argument https://review.opendev.org/c/openstack/devstack/+/888752 | 20:41 |
gmann | haleyb: but there should not be case where it is set as "" right ? | 20:42 |
gmann | with you change as you are removing it setting at L252 | 20:43 |
haleyb | gmann: right, not with my change, without it and a single image it's always "" though. Even the check on L258 in the multi-image case can go away, as it's not been set there either. | 20:44 |
haleyb | https://review.opendev.org/c/openstack/devstack/+/886795/4/lib/tempest changed the behavior completely but the code seems wrong looking at it | 20:45 |
gmann | haleyb: need to go for my appointment, will come back and tty in an hr or so | 20:46 |
haleyb | gmann: ack, it is close to eod here so might not respond until tomorrow | 20:47 |
dansmith | gouthamr: ah, I guess devstack is updating my curl package, which explains why I see it during/after devstack | 20:53 |
gouthamr | dansmith: oh.. ack; the change (https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/888952) is cool to have despite the curl update; i can see it working on the jammy job we have https://zuul.opendev.org/t/openstack/stream/1b90f46c303b4ed2acdfad4075d504f7?logfile=console.log | 20:57 |
dansmith | ack, my mirror just got the curl update and I can confirm it works also | 20:58 |
dansmith | I've been rechecking something all week (>:() so I'm hoping I might get it through now... | 20:58 |
dansmith | unfortunately that console log has a test failure in it | 21:00 |
dansmith | I'll look at the details when it finishes | 21:00 |
gouthamr | ++ | 21:03 |
gouthamr | > I've been rechecking something all week -- :( | 21:03 |
dansmith | yeah, it's sucking my will to live :/ | 21:12 |
gmann | haleyb: I checked and could not find where it can be set as "" but anyways setting it always in case of single image in glance is ok. approved your changes | 22:31 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!