opendevreview | Merged openstack/nova master: Add compute_id to Instance object https://review.opendev.org/c/openstack/nova/+/879500 | 01:38 |
---|---|---|
opendevreview | Takashi Kajinami proposed openstack/nova master: Enforce correct choice for [DEFAULT] enabled_ssl_apis https://review.opendev.org/c/openstack/nova/+/889787 | 01:38 |
opendevreview | melanie witt proposed openstack/nova master: block_device: Add encryption attributes to swap disks https://review.opendev.org/c/openstack/nova/+/884312 | 07:10 |
opendevreview | melanie witt proposed openstack/nova master: libvirt: Configure and teardown ephemeral encryption secrets https://review.opendev.org/c/openstack/nova/+/826754 | 07:10 |
opendevreview | melanie witt proposed openstack/nova master: imagebackend: Add support to libvirt_info for LUKS based encryption https://review.opendev.org/c/openstack/nova/+/826755 | 07:10 |
opendevreview | melanie witt proposed openstack/nova master: Add encryption support to convert_image https://review.opendev.org/c/openstack/nova/+/870934 | 07:10 |
opendevreview | melanie witt proposed openstack/nova master: Support create with ephemeral encryption for qcow2 https://review.opendev.org/c/openstack/nova/+/870932 | 07:10 |
opendevreview | melanie witt proposed openstack/nova master: Support resize with ephemeral encryption https://review.opendev.org/c/openstack/nova/+/870933 | 07:10 |
opendevreview | melanie witt proposed openstack/nova master: Add hw_ephemeral_encryption_secret_uuid image property https://review.opendev.org/c/openstack/nova/+/870935 | 07:10 |
opendevreview | melanie witt proposed openstack/nova master: Add encryption support to qemu-img rebase https://review.opendev.org/c/openstack/nova/+/870936 | 07:10 |
opendevreview | melanie witt proposed openstack/nova master: Support snapshot with ephemeral encryption https://review.opendev.org/c/openstack/nova/+/870937 | 07:10 |
opendevreview | melanie witt proposed openstack/nova master: block_device: Add encryption attributes to swap disks https://review.opendev.org/c/openstack/nova/+/884312 | 07:20 |
opendevreview | melanie witt proposed openstack/nova master: libvirt: Configure and teardown ephemeral encryption secrets https://review.opendev.org/c/openstack/nova/+/826754 | 07:20 |
opendevreview | melanie witt proposed openstack/nova master: imagebackend: Add support to libvirt_info for LUKS based encryption https://review.opendev.org/c/openstack/nova/+/826755 | 07:20 |
opendevreview | melanie witt proposed openstack/nova master: Add encryption support to convert_image https://review.opendev.org/c/openstack/nova/+/870934 | 07:20 |
opendevreview | melanie witt proposed openstack/nova master: Support create with ephemeral encryption for qcow2 https://review.opendev.org/c/openstack/nova/+/870932 | 07:20 |
opendevreview | melanie witt proposed openstack/nova master: Support resize with ephemeral encryption https://review.opendev.org/c/openstack/nova/+/870933 | 07:20 |
opendevreview | melanie witt proposed openstack/nova master: Add hw_ephemeral_encryption_secret_uuid image property https://review.opendev.org/c/openstack/nova/+/870935 | 07:20 |
opendevreview | melanie witt proposed openstack/nova master: Add encryption support to qemu-img rebase https://review.opendev.org/c/openstack/nova/+/870936 | 07:20 |
opendevreview | melanie witt proposed openstack/nova master: Support snapshot with ephemeral encryption https://review.opendev.org/c/openstack/nova/+/870937 | 07:20 |
*** tobias-urdin-pto is now known as tobias-urdin | 07:34 | |
opendevreview | Sylvain Bauza proposed openstack/nova master: DNM: Verify whether we persist some wrong ReqSpec values https://review.opendev.org/c/openstack/nova/+/889843 | 08:54 |
opendevreview | alecorps proposed openstack/nova-specs master: new spec: support of vnc console for mks https://review.opendev.org/c/openstack/nova-specs/+/889844 | 09:02 |
sean-k-mooney | bauzas: so we have a gate blocker in the nova-lvm job | 10:20 |
bauzas | shit shit shit | 10:20 |
sean-k-mooney | test_get_console_output_server_id_in_shutoff_status is failing with AssertionError: '' is not true : Console output was empty. | 10:20 |
sean-k-mooney | im seeing that quite a lot | 10:21 |
sean-k-mooney | always in the lvm job for some reason and not others | 10:22 |
sean-k-mooney | looking at the test im wondering if this could be a test bug or not | 10:23 |
sean-k-mooney | https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/api/compute/servers/test_server_actions.py#L713 | 10:23 |
sean-k-mooney | im wondering if we are perhaps stoping the server before the guest writes to the console? | 10:23 |
sean-k-mooney | https://opensearch.logs.openstack.org/_dashboards/app/discover?security_tenant=global#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-30d,to:now))&_a=(columns:!(hosts_region,project,build_name),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'94869730-aea8-11ec-9e6a-83741af3fdcd',key:project,negate:!f,params:(query:openstack%2Fnova),ty | 10:27 |
sean-k-mooney | pe:phrase),query:(match_phrase:(project:openstack%2Fnova)))),index:'94869730-aea8-11ec-9e6a-83741af3fdcd',interval:d,query:(language:kuery,query:'message:%22Console%20output%20was%20empty.%22'),sort:!()) | 10:27 |
sean-k-mooney | hum let me get a short link | 10:27 |
sean-k-mooney | https://tinyurl.com/kvcc9289 | 10:28 |
sean-k-mooney | 132 failures in the last 3 days | 10:29 |
bauzas | sean-k-mooney: ack, ok | 10:36 |
bauzas | sorry, was on meeting | 10:37 |
sean-k-mooney | no worries just filed https://bugs.launchpad.net/nova/+bug/2028851 | 10:37 |
sean-k-mooney | im going ot add a wiat for sshabel before the server stop in tempest and see if that helps | 10:38 |
bauzas | sean-k-mooney: it looks to me it regressed two days before https://zuul.openstack.org/builds?job_name=nova-lvm&skip=0 | 10:38 |
bauzas | last successful run was 2023-07-24 09:18:03 | 10:39 |
sean-k-mooney | ya it stated on the 24th | 10:39 |
sean-k-mooney | im not really sure what could have merged then | 10:39 |
bauzas | looking at gerrit | 10:39 |
bauzas | it doesn't sound a race, rather a regression | 10:40 |
sean-k-mooney | https://github.com/openstack/nova/commits/master | 10:40 |
sean-k-mooney | we had one commit on the 21st and then the 25th | 10:40 |
sean-k-mooney | so i dont see anything in nova that aligns | 10:40 |
bauzas | yeah | 10:42 |
bauzas | sean-k-mooney: but tempest has suspects https://github.com/openstack/tempest/commits/04cb0adc822ffea6c7bfccce8fa08b03739894b7 | 10:43 |
sean-k-mooney | ya i looked at the optimize one | 10:44 |
sean-k-mooney | i dont think that is touching the relevent test | 10:44 |
bauzas | https://github.com/openstack/tempest/commit/eea2c1cfac1e5d240cad4f8be68cff7d72f220a8 | 10:45 |
sean-k-mooney | oh the merge commit | 10:45 |
bauzas | we have a problem with server actions but this patch touches this | 10:45 |
sean-k-mooney | yep it moved the test | 10:47 |
sean-k-mooney | this is the orginal commit https://github.com/openstack/tempest/commit/1643536c9f7407dceccfaa18a2dd95156de1419c looks like dan wrote it | 10:48 |
bauzas | probably an unnotified merge conflict when rebasing | 10:49 |
sean-k-mooney | i dont think so | 10:50 |
sean-k-mooney | well maybe so what dan did was split up the large class into 3 | 10:50 |
sean-k-mooney | so that we got bettter parallisum | 10:50 |
sean-k-mooney | reducing the over all time of the tests form 15 min to 5 on there system | 10:51 |
sean-k-mooney | but that might mean we have more load | 10:51 |
sean-k-mooney | and the vm boots might now be slower | 10:51 |
bauzas | yeah sounds plausible | 10:53 |
sean-k-mooney | im just going to change server = self.create_test_server(wait_until='ACTIVE') | 10:54 |
sean-k-mooney | to server = self.create_test_server(wait_until='SSHABLE') | 10:54 |
bauzas | shoot | 10:56 |
bauzas | and depends-on it with a nova dnm please | 10:56 |
sean-k-mooney | yep | 10:56 |
* bauzas goes eating a kebab | 10:56 | |
sean-k-mooney | now i want one :) | 10:57 |
sean-k-mooney | enjoy | 10:57 |
opendevreview | sean mooney proposed openstack/nova master: use disk caching to hide slow cinder performance https://review.opendev.org/c/openstack/nova/+/889383 | 11:03 |
sean-k-mooney | ill use my diskcache mode change to test it instead ^ that way we can test both and save ci time | 11:04 |
opendevreview | alecorps proposed openstack/nova-specs master: new spec: support of vnc console for mks https://review.opendev.org/c/openstack/nova-specs/+/889857 | 12:56 |
dansmith | sean-k-mooney: if you could use my nick so I get a notify I'd appreciate it | 13:29 |
dansmith | I can't notify on "dan" because it's far too common | 13:29 |
dansmith | sean-k-mooney: so which test are you changing from active to sshable? | 13:30 |
dansmith | the split patch literally just moved code, so it shouldn't affect anything | 13:30 |
dansmith | it will increase parallelism, but we usually are running that test when the workers are fully loaded with tests, not after they've run out of work, | 13:30 |
dansmith | so I wouldn't expect it to really change the profile much | 13:31 |
dansmith | bauzas: can you please get this queued up? you've already +2d the follow-up to address your feedback: https://review.opendev.org/c/openstack/nova/+/884752/2 | 13:35 |
bauzas | dansmith: cool | 13:35 |
dansmith | I've been rechecking these for three weeks and I'm getting close, so I want to be able to land this when the time comes | 13:35 |
bauzas | dansmith: as we discussed with sean-k-mooney this morning, we have a job issue | 13:35 |
dansmith | I realize that, I still would like to have it ready to go, the patch below it is still waiting so +W isn't going to send it | 13:36 |
sean-k-mooney | i dont think we should slow down reviewing bacause of the job issue | 13:37 |
dansmith | agree | 13:37 |
dansmith | sean-k-mooney: I see the console_output thing, that's because it's being requested as sshable and doesn't have the validation things it needs | 13:37 |
dansmith | sean-k-mooney: not sure how that changed for that test, but I can fix it easily | 13:38 |
dansmith | sean-k-mooney: did you open a bug for it? | 13:38 |
sean-k-mooney | i fixed the validation thing | 13:38 |
sean-k-mooney | i opened a bug | 13:38 |
sean-k-mooney | and i have a ptach in ci right now | 13:38 |
sean-k-mooney | its almost completed running | 13:39 |
dansmith | link? | 13:39 |
sean-k-mooney | https://review.opendev.org/c/openstack/tempest/+/889847 | 13:39 |
sean-k-mooney | https://launchpad.net/bugs/2028851 | 13:39 |
sean-k-mooney | https://review.opendev.org/c/openstack/nova/+/889383 | 13:39 |
sean-k-mooney | i used my disk caching patch to run it with a depends on | 13:39 |
sean-k-mooney | https://zuul.openstack.org/stream/9890c9b53059410eabc0c61df0376217?logfile=console.log | 13:39 |
sean-k-mooney | that is the nova-lvm job i have it open on my other screen | 13:40 |
dansmith | oh I was looking at this: https://8d44202ac4ef26a68d85-49968699300828e6c9b78fd54dff75ef.ssl.cf1.rackcdn.com/889383/4/check/nova-lvm/368f504/testr_results.html | 13:40 |
dansmith | the in_shutoff test fail | 13:40 |
dansmith | where it claims it doesn't have validation_resourecs | 13:40 |
dansmith | sorry, no the unlimited_size one | 13:40 |
dansmith | well, both | 13:40 |
dansmith | ah, that's yours | 13:40 |
sean-k-mooney | ya in a previous revsion i enabled sshable | 13:40 |
sean-k-mooney | but not the validation resouces | 13:41 |
dansmith | ah, okay cool | 13:41 |
dansmith | okay so yep, increased load making us lose that race makes sense | 13:42 |
dansmith | it's definitely not 100% fail | 13:42 |
sean-k-mooney | its only happening in the lvm job and i think that uses lvm on a loopback file so i suspect the disk io is worse there then other jobs | 13:43 |
dansmith | because we've merged stuff in the last couple of days | 13:43 |
dansmith | er | 13:44 |
dansmith | I actually don't see that on those patches.. is that in experimental or something? | 13:44 |
sean-k-mooney | no https://github.com/openstack/nova/blob/master/.zuul.yaml#L230-L244 | 13:44 |
sean-k-mooney | but it does not run on all changes | 13:44 |
sean-k-mooney | it only runs on change to libvirt driver | 13:45 |
dansmith | ah, only on libvirt I see | 13:45 |
dansmith | okay | 13:45 |
dansmith | well in that case we *definitely* shouldn't be holding up review ;P | 13:45 |
bauzas | sorry, on an internal meeting now | 13:45 |
bauzas | dansmith: yeah, I haven't asked to hold reviews :) | 13:46 |
dansmith | bauzas: sounded like you did, but okay :) | 13:46 |
bauzas | dansmith: it's the nova-lvm job, so yeah no | 13:47 |
dansmith | ack | 13:48 |
sean-k-mooney | ill bring this up as a ptg topic but we have had the vmware driver as experimatal and deprecated for almost 2 cycle now. i think in C we should consider removing it | 13:48 |
sean-k-mooney | or at least moving it out of tree | 13:49 |
bauzas | I don't personnally disagree, since Caracal will be a SLURP release | 13:51 |
sean-k-mooney | the deprecation/experimental warning went out in antelope so we ahve warned people and i dotn think we have had any ci in maybe 2 years | 13:52 |
sean-k-mooney | there is a spec propsoed to add a new nova-mksproxy intree for vmware. when we added mks supprot in liberty the proxy server was maintained out of tree | 13:53 |
sean-k-mooney | https://opendev.org/x/nova-mksproxy | 13:53 |
sean-k-mooney | but i dont really like the idea of adding a new proxy in tree only for vmware with no testing | 13:53 |
sean-k-mooney | maintianed is proably an overly stong statement as almost nothign has been done to it in 6 years | 13:54 |
bauzas | sean-k-mooney: at least I haven't heard of any operator using vmware after Train | 14:02 |
dansmith | sean-k-mooney: still failing this one: https://d5c0e4c80966c29b54ec-9ed411199ab33310a8b6bd95d117cd22.ssl.cf1.rackcdn.com/889383/4/check/nova-lvm/9890c9b/testr_results.html | 14:27 |
sean-k-mooney | ya.. just saw | 14:28 |
sean-k-mooney | any other idea on how to fix other then disable it | 14:28 |
dansmith | that's weird, because it's failing after it has already been sshable and shutoff | 14:29 |
dansmith | oh wait, lemme check something | 14:29 |
dansmith | okay, no.. I was thinking maybe validation was disabled for this job, but it's not | 14:30 |
dansmith | so yeah, pretty weird | 14:30 |
dansmith | maybe the api times out getting the console from the compute and doesn't report failure or something? | 14:31 |
dansmith | sean-k-mooney: man, I think that's a legit failure, I just don't know why | 14:45 |
dansmith | we ssh into the guest and then we poll the console log a LOT of times before we give up and there's nothing in it | 14:46 |
dansmith | AFAICT, that basically means we lost the console for that instance, maybe because we overwrote the file or it is hooked up to a different inode than what we're reading, like if we raced to create/ensure it or something | 14:47 |
bauzas | dansmith: sean-k-mooney: we should possibly skip the test until we find the root cause | 15:03 |
dansmith | bauzas: sean-k-mooney is going to try to repro locally and if not, then skip yeah | 15:03 |
bauzas | ++ | 15:03 |
frickler | if an instance is shutoff, is the console log cleared? or should the log from the last run still be shown? | 15:03 |
dansmith | frickler: the test is specifically testing that it's not cleared | 15:05 |
dansmith | however, there are circumstances where we read the log from a pty, which is supposed to be for container stuff | 15:05 |
dansmith | but if we're getting that on the lvm config because we try not to write files, then that could be a cause | 15:05 |
dansmith | in which case the test is invalid (currently) for LVM and we were just racing before to pass it | 15:05 |
dansmith | but that is what we're going to check locally | 15:06 |
sean-k-mooney | frickler: for our default config we write it to a file on disk (or rahter libvirt/qemu does) | 15:07 |
sean-k-mooney | so we normally would epxect it to still be there when its powered off | 15:07 |
sean-k-mooney | but it gets cleard via being overwtitten on the next vm start | 15:07 |
frickler | ah, o.k., I would have expected it to get removed when the instances is shut down | 15:09 |
sean-k-mooney | for the most part we just stop the domain and dont do any cleanup of files on shutdown | 15:14 |
sean-k-mooney | and since this file is created by qemu or virtlogd depening on your libvirt config | 15:14 |
sean-k-mooney | nova does not really mange it excpetion when you delete a vm | 15:15 |
sean-k-mooney | its created under the instance dir so it gets cleaned up when that is removed | 15:15 |
frickler | makes sense when you think about it, yes | 15:17 |
sean-k-mooney | the odd thing is the test has not changed in like 6 years or more | 15:23 |
sean-k-mooney | so why it started failing 3 days ago is beyond me | 15:23 |
melwitt | I was just about to ask, was this passing recently | 15:27 |
sean-k-mooney | ya opensearch says it started failing on the 24th | 15:28 |
sean-k-mooney | dan optimised the execution time by making some of the test run across multipel workers and that seams to have exposed an existing bug | 15:29 |
sean-k-mooney | wether its a bug in nova or in the test is TBD | 15:29 |
melwitt | I can try that locally and see if it shows up for me | 15:30 |
sean-k-mooney | hence why we are repoducing it locally now. im just waiting on devstack to complete | 15:30 |
sean-k-mooney | the more the merrier | 15:30 |
melwitt | lord knows I got devstacks already available | 15:30 |
sean-k-mooney | well do you have one with lvm | 15:31 |
sean-k-mooney | this is only affecting the lvm job and ther emight eb some lvm specific logic at play | 15:31 |
melwitt | hm, no. would it be equivalent at all to do it with a boot from lvm volume? | 15:32 |
sean-k-mooney | i dont think so but its worth a try i guess if you have it to hand | 15:32 |
sean-k-mooney | the test is passing we thing in other jobs but ill admit i didnt actully check if it was running in other jobs | 15:32 |
melwitt | I'll check. given that the only condition appears to be CONF.compute_feature_enabled.console_output I'd expect it is running for other jobs | 15:33 |
melwitt | yeah it runs and passes in tempest-integrated-compute, for example | 15:35 |
sean-k-mooney | ya it is | 15:35 |
sean-k-mooney | also in nova-ceph-multistore | 15:36 |
dansmith | yeah, and also note that sean-k-mooney has proposed changes to it, | 15:41 |
dansmith | which may make it even more lvm-specific | 15:41 |
sean-k-mooney | ... typo in my local.conf, take 2 i guess | 15:47 |
melwitt | well, I'm getting weird results in both the bfv and non bfv cases I get a 404 error when requesting the console log. not sure if that's just something wrong in my environment | 15:51 |
melwitt | meaning, I know it's passing in non lvm jobs so why is it doing this for me | 15:52 |
sean-k-mooney | not that could be the same issue | 15:53 |
dansmith | the lvm jobs are lvm backend right? | 15:54 |
dansmith | so really have nothing to do with a bfv instance | 15:55 |
sean-k-mooney | yes image_type=lvm | 15:55 |
melwitt | like, the code is relying on the guest xml to get the console log path, but a stopped instance won't have xml, so how is this ever working? | 15:55 |
melwitt | I must be missing something | 15:56 |
sean-k-mooney | it does | 15:56 |
sean-k-mooney | we dont undefien the xml when we power off | 15:56 |
sean-k-mooney | we delete the domain and recreate it every tiem we start a domian but power off leaves it in place | 15:57 |
melwitt | hm ok | 15:57 |
sean-k-mooney | we use that fact to make vGPU work by reading the mdev path out of the xml | 15:57 |
dansmith | yeah, it has xml when stopped | 15:58 |
dansmith | that's the "defined" vs. "active | 15:59 |
dansmith | " xml | 15:59 |
melwitt | ok, I'm not able to get the domain with 'virsh dumpxml [--inactive] <uuid>' so far I dunno how to look at it | 15:59 |
melwitt | gotcha | 15:59 |
sean-k-mooney | do virsh list --all | 16:00 |
dansmith | you need to use the id not the uuid | 16:00 |
melwitt | ok, thanks I see them | 16:00 |
melwitt | id is blank but it has "name" so I'll try that | 16:00 |
dansmith | oh for the inactive yeah | 16:00 |
sean-k-mooney | ok stacked lets boot a vm and see what happened | 16:01 |
dansmith | but uuid is just metadata for libvirt AFAIK.. I've never known to be able to use that for virsh commands | 16:01 |
melwitt | you can. I had been doing that to look at inactive domain xml | 16:02 |
melwitt | ok got it. I wasn't using sudo 🙄 | 16:03 |
sean-k-mooney | ok so i have the console in horizon | 16:03 |
sean-k-mooney | ill check the host fs/xml and then stop it | 16:03 |
melwitt | uuid works with sudo too | 16:03 |
melwitt | ok | 16:03 |
sean-k-mooney | so there is a console.log in the instanace dir | 16:04 |
sean-k-mooney | and it has data since i can cat it so ill power it off now | 16:05 |
melwitt | same | 16:05 |
melwitt | same | 16:05 |
melwitt | mine has data after poweroff to | 16:05 |
melwitt | but, in my case console_type == 'pty' and so there must be something weird happening with that | 16:05 |
melwitt | the console log file path in the xml is correct and I can use it to open the log file | 16:06 |
sean-k-mooney | https://paste.opendev.org/show/bMa9I7WsUnEfsrMC6iQz/ | 16:06 |
melwitt | is the console log get API working for you? | 16:06 |
sean-k-mooney | i did it in horizon but ill check in a sec on the cli | 16:07 |
sean-k-mooney | horizon is now showing "Unable to get log for instance "ea890cd1-c3fe-4455-bcd9-b44ea830a136"." | 16:07 |
dansmith | melwitt: interesting, I started with libvirt before there were even non-active domains and definitely since before uuids were in the xml, so I guess 15-year habits die hard :) | 16:08 |
melwitt | for mine there's helpfully no log messages that say anything about why/where it's running into a problem | 16:08 |
dansmith | ahh, console type is pty, | 16:08 |
dansmith | that's what I was wondering if it was happening | 16:08 |
dansmith | the comment says it's only for container types, but it would explain why we're losing console on shutdown | 16:09 |
dansmith | so with pty, there's no storage of the console stuff once the other side goes away, like when shutdown | 16:09 |
melwitt | I feel like it has to be this failing: data = nova.privsep.libvirt.readpty(console_path) | 16:09 |
dansmith | melwitt: your console path is a file or a path to a pty? | 16:10 |
melwitt | it's a file. but it is also type pty. which I assume is not right? | 16:10 |
dansmith | idk, to be honest | 16:10 |
sean-k-mooney | on a call by the way | 16:10 |
dansmith | the code looks like nova reads directly from the pty if the type is pty, so the file wouldn't be used | 16:10 |
dansmith | ah, wait, | 16:11 |
dansmith | it looks like it reads from the pty and writes the file if type=pty | 16:11 |
melwitt | the console files are there and are properly with data | 16:11 |
dansmith | but that would only work when it's running | 16:11 |
dansmith | so I wonder if the test is now breaking because we never read the log before the instance is shutdown (which is what the test was supposed to be doing) | 16:12 |
melwitt | oh hm ok, it must be bc I queried for the console before I shut it down | 16:12 |
melwitt | I'll try not querying before | 16:12 |
dansmith | if we did once, then it would write the file and then we could read the file after it's shutdown | 16:12 |
dansmith | so maybe the test just needs to hit the console endpoint once after ssh'able and then shutdown, and verify that it still works after | 16:12 |
melwitt | yeah.. for me I still can't get it via the API despite the fact that the file is there and the file path matches what's in the xml | 16:13 |
dansmith | is the domain inactive? | 16:13 |
melwitt | yes | 16:13 |
dansmith | I would think the console element would go away partially while inactive, | 16:14 |
dansmith | but is that causing us to not read from the file? | 16:14 |
melwitt | <console type='pty'> | 16:14 |
melwitt | <log file='/opt/stack/data/nova/instances/682931e5-4563-4955-a552-526a31c6957a/console.log' append='off'/> | 16:14 |
melwitt | <target type='serial' port='0'/> | 16:14 |
melwitt | </console> | 16:14 |
melwitt | and that file is right. it just can't get it for whatever reason | 16:15 |
dansmith | yeah | 16:15 |
dansmith | lemme re-read this code, hang on | 16:15 |
dansmith | okay so if it's a pty, we read anything from it we can and append to the log, | 16:15 |
dansmith | I guess so that we're constantly pulling pty content and filling the log file, and then we just "should" be returning the log file contents for the requester | 16:16 |
dansmith | melwitt: can you just add some logs to those functions and see what's going on? | 16:16 |
melwitt | yeah | 16:16 |
sean-k-mooney | ok call done | 16:58 |
sean-k-mooney | i was going to check the cli | 16:58 |
sean-k-mooney | did ye fined anything in the intrim | 16:58 |
sean-k-mooney | the content is still there on disk for what its worth | 16:58 |
sean-k-mooney | stack@upstream-devstack:~/repos/devstack$ openstack console log show test | 17:00 |
sean-k-mooney | ResourceNotFound: 404: Client Error for url: http://192.168.122.17/compute/v2.1/servers/ea890cd1-c3fe-4455-bcd9-b44ea830a136/action, Guest does not have a console available. | 17:00 |
sean-k-mooney | oh... | 17:03 |
sean-k-mooney | Fetching instance by UUID {{(pid=1349614) get /opt/stack/nova/nova/compute/api.py:2978 | 17:03 |
sean-k-mooney | ok that is not relevent its just https://github.com/openstack/nova/blob/master/nova/compute/api.py#L2978-L2982 | 17:05 |
dansmith | melwitt: so to be clear, you've got regular guests with pty consoles yeah? | 17:05 |
sean-k-mooney | [None req-0042b3bd-d199-4c9d-b8f2-f1ee3a2d444e demo admin] HTTP exception thrown: Guest does not have a console available. | 17:06 |
dansmith | I wonder if this was the fix for the "console log grows forever" problem and my (our?) assumptions that it's still just a file on disk is old | 17:06 |
dansmith | if that's so, I don't see how we'd ever pass that test once we remove the races, if we don't get the console log once while it's running | 17:06 |
sean-k-mooney | so there is a file on disk | 17:07 |
dansmith | right, but the code in libvirt driver is feeding it from the pty each time it runs | 17:08 |
melwitt | ok, so what's happening is the xml for <console> element is different after the instance is stopped | 17:09 |
melwitt | so the parsing it's doing is not finding the console path | 17:09 |
melwitt | before the instance is stopped, there's <source path='/dev/pts/3'/> under console. after the instance is stopped there is not | 17:11 |
melwitt | and it's looking for this ('pty', "./devices/console[@type='pty']/source[@path]", 'path') | 17:11 |
sean-k-mooney | https://paste.opendev.org/show/bU3Vb2zX0TgOqU42xBAV/ | 17:11 |
sean-k-mooney | it is be we have the log file line in both cases | 17:12 |
sean-k-mooney | so we shoudl be usign that not source | 17:12 |
sean-k-mooney | or at least falling back to the log line if source is not found | 17:12 |
melwitt | https://paste.openstack.org/show/bnLH5aJxsHlzXdZS455s/ | 17:13 |
melwitt | yeah, I think that's the bug | 17:13 |
melwitt | it's not falling back | 17:13 |
sean-k-mooney | well i think log is alwasy where to im not sure if the fallback is ever needed | 17:13 |
sean-k-mooney | i woudl be incliend to prefer log and fallback ot source if its not there | 17:14 |
melwitt | the log file is always there. it's just using source only | 17:14 |
melwitt | it looks like the code intends to fall back to the file but it's not working | 17:15 |
sean-k-mooney | we also have https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4577C1-L4580C43 | 17:15 |
sean-k-mooney | we are generating what that will be | 17:15 |
sean-k-mooney | we dont ever need to parse the xml | 17:15 |
dansmith | so the test not waiting for sshable was actually the reason this never failed? because we were racing with the shutdown? | 17:15 |
melwitt | hah probably | 17:16 |
melwitt | I dunno | 17:16 |
dansmith | I guess I don't know why this is only failing lvm still, unless it's just lvm being slower or faster about something? | 17:16 |
melwitt | yeah that's what I don't understand | 17:16 |
melwitt | I would have wondered if console type is not pty if not lvm but my setup is rbd and I get console type pty.. and the ceph job isn't failing this test iiuc | 17:17 |
dansmith | right | 17:18 |
sean-k-mooney | i think its a pty when virtlogd is installed and running | 17:18 |
sean-k-mooney | qemu is using the pty to talk to virtlogd which is writing to the file | 17:18 |
sean-k-mooney | i think | 17:18 |
sean-k-mooney | but i dont see why we woudl ever need to read form that. we can simplfy this code by using | 17:19 |
sean-k-mooney | _get_console_log_path(instance) | 17:19 |
sean-k-mooney | to return the log file and just read directly form that | 17:19 |
melwitt | yeah I dunno if there is or was a reason for doing the reading from pty | 17:20 |
sean-k-mooney | https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4462-L4488 | 17:21 |
sean-k-mooney | so this is doub | 17:21 |
sean-k-mooney | we use console_log = self._get_console_log_path(instance) | 17:21 |
sean-k-mooney | to get the path which is to the file but that not the path we check for at the start | 17:22 |
sean-k-mooney | oh actully | 17:22 |
sean-k-mooney | https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4487-L4502 | 17:23 |
sean-k-mooney | we read form the pty and write to the log and then read form the log again but this time without privsep | 17:23 |
melwitt | this is like rube goldberg | 17:24 |
sean-k-mooney | ok so if i tail the log file and start the vm it gets the stream | 17:24 |
sean-k-mooney | so we dont actully need to write to the logfile | 17:25 |
sean-k-mooney | it already has the data form the pty | 17:25 |
sean-k-mooney | and shutting off the instance dow not affect the data | 17:25 |
melwitt | what about if the compute host were to be rebooted? just trying to think if there's any case where we need the file | 17:26 |
sean-k-mooney | the file is beiing wrttien by qemu/virtlogd | 17:26 |
sean-k-mooney | its only written by use if we do a console log show | 17:27 |
melwitt | oh ok | 17:27 |
sean-k-mooney | so if we removed that code from nova it would already have the data | 17:27 |
sean-k-mooney | i did sudo tail -F console.log to confirm that | 17:28 |
dansmith | sean-k-mooney: certainly we need to not be reading from that pty if virtlogd is already doing it | 17:28 |
sean-k-mooney | when the vm starts it recreates the file | 17:28 |
dansmith | so where in there are we only doing that if virtlogd is not? | 17:28 |
sean-k-mooney | so is this part of the xml generated by libvirt or us | 17:29 |
sean-k-mooney | i tought we were just setting the log location and libvirt filled in the rest based on if virtlogd was deployed and how it was configured | 17:29 |
dansmith | not sure that matters, since this is on get and doesn't know if we added it or libvirt itself did right? | 17:30 |
sean-k-mooney | well we are specifying the location of the file | 17:30 |
sean-k-mooney | so my point is i think we can ignore if its a pty or not | 17:30 |
sean-k-mooney | and alwasy just go to the file | 17:30 |
dansmith | not here.. here we're just looking at the state of the xml | 17:31 |
sean-k-mooney | but we dont need to look at the xml at all | 17:31 |
sean-k-mooney | thats my point we have a function that takes the instance and returs the log path to use | 17:31 |
sean-k-mooney | its always called consol.log in the instnace dir | 17:31 |
dansmith | I know that's your point, I'm just not sure it's right :) | 17:31 |
sean-k-mooney | ok :) | 17:32 |
dansmith | I want to know how we're not tripping over us reading from the pty when virtlogd already is | 17:32 |
sean-k-mooney | ah good question | 17:32 |
dansmith | I mean I guess if virtlogd is select()ing from it then it has already read everything we potentially could have from the pty each time we go to read from it, and we just get lucky that we never compete | 17:32 |
dansmith | but I would expect to occasionally see some corruption there | 17:33 |
dansmith | and is virtlogd always a thing in libvirt now, or is it optional? | 17:33 |
dansmith | and even still, | 17:33 |
dansmith | it looks to me like we *are* always returning the contents of the file at the bottom of that function, | 17:33 |
sean-k-mooney | https://paste.opendev.org/show/bW8SNwHjfUFkOMv4gp5N/ | 17:34 |
dansmith | so why are we getting blank log output when the file on disk clearly has stuff in it? | 17:34 |
sean-k-mooney | that is the orginal xml we generate | 17:34 |
sean-k-mooney | there is no console element at all we just set the serial port and pass the log path | 17:34 |
dansmith | that has no console stuff in it and is not complete | 17:34 |
sean-k-mooney | no i just didn copy it all one sec | 17:34 |
dansmith | but I don't see what it matters what we generate | 17:35 |
sean-k-mooney | https://paste.opendev.org/show/bdwD3vAg5f6D6iNsIHk1/ | 17:35 |
melwitt | it's relying on <source> in the xml to get the console path, so it gets no path at all | 17:35 |
sean-k-mooney | we are telling libvirt to do whatever it need to to make sure teh serial console is in a given file | 17:36 |
dansmith | melwitt: oh okay, gotcha you said that earlier I think | 17:36 |
dansmith | okay, so if we ask for pty/logfile and rely on virtlogd now it sounds like | 17:36 |
sean-k-mooney | i think we shoudl be changing this code to only use that instaed | 17:36 |
dansmith | so the comment in the code there about ptys says something about container types, | 17:36 |
dansmith | presumably because they will hang a process if something doesn't consume their pty | 17:36 |
* sean-k-mooney looks | 17:36 | |
dansmith | so perhaps that code is leftover from long ago before we moved to virtlogd everywhere, and that all needs to be ripped out? | 17:37 |
melwitt | I did but I know it's hard to visualize. here's my dumb print statements showing how it goes https://paste.openstack.org/show/bpovXD89OL2QzukI3BWi/ | 17:37 |
sean-k-mooney | he virt_types kvm and qemu are the only ones | 17:37 |
sean-k-mooney | # which create a dedicated file device for the console logging. | 17:37 |
sean-k-mooney | ok | 17:37 |
dansmith | keep reaing | 17:37 |
dansmith | *reading | 17:37 |
sean-k-mooney | so the pty fallback is for other virt drivers | 17:37 |
sean-k-mooney | not driver virt types like lxc | 17:37 |
dansmith | it says "like lxc" | 17:38 |
sean-k-mooney | yep | 17:38 |
dansmith | you said "not virt types like lxc" | 17:38 |
sean-k-mooney | so if virt_type not in (qemu,kvm) | 17:38 |
sean-k-mooney | then we need to read from the pty | 17:38 |
sean-k-mooney | and save it to a file | 17:38 |
sean-k-mooney | but for qemu/kvm we dont | 17:38 |
dansmith | anyway, it makes me super nervous to mess with stuff like this without a very clear understanding of what is going on, and I *still* don't understand why this isn't always failing especially after sean-k-mooney's change to the test | 17:39 |
sean-k-mooney | so i think hte fix is to wrap that cod in an if/else based on the virt_type | 17:39 |
melwitt | +1 to be really careful messing with this | 17:39 |
dansmith | sean-k-mooney: why do those virt types not also benefit from virtlogd? if anything, moving to the pty and daemon would make qemu/kvm more like lxc | 17:39 |
sean-k-mooney | ya im not sure why ceph is not exploding | 17:40 |
dansmith | ceph or any of the other jobns | 17:40 |
sean-k-mooney | so they might but i wonder if this comment predates the creation of virtlogd | 17:40 |
sean-k-mooney | i could try deploying with lxc and find out | 17:41 |
dansmith | the fact that we don't (that we know of) see corruption of console logs due to us sometimes reading from the pty also just makes me sus that this is not actually happening | 17:41 |
sean-k-mooney | before virtlogd was a thing qemu used to write to the file itself | 17:41 |
dansmith | sean-k-mooney: right that's my point. maybe this all needs to go, not preserve some of it for lxc | 17:41 |
dansmith | sean-k-mooney: right I know, and virtlogd should have made them all effectively the same | 17:41 |
sean-k-mooney | ok so there are a couple of thing i can try ill add some logs to the fucntion and see if we are hitting this | 17:42 |
sean-k-mooney | then i can spin up another vm with nova and lxc | 17:42 |
sean-k-mooney | do you know if stoping virtlogd would be enough to change libvirt behavior | 17:42 |
sean-k-mooney | i coudl also try that | 17:42 |
dansmith | ohhh hang on | 17:42 |
dansmith | melwitt: does the running instance have a source= tag in the console/pty element? | 17:43 |
dansmith | if not, then it's never finding this console_type==pty thing because the xpath won't match | 17:43 |
melwitt | dansmith: it does after the initial create. I haven't checked if it does after I start it after stopping it. checking now | 17:44 |
dansmith | okay I thought you said it didn't on stop | 17:44 |
sean-k-mooney | it does | 17:44 |
sean-k-mooney | after we stop it it goes away | 17:44 |
melwitt | dansmith: ? I said it has <source> before stop and then it does not have <source> after stop | 17:44 |
sean-k-mooney | when we do a dumpxml we are not actully reading a file | 17:44 |
sean-k-mooney | we are geting the responce form the libvirt deamon | 17:44 |
dansmith | so after stop we should be getting the actual file on disk without the pty messing it up | 17:45 |
sean-k-mooney | i dont know if the console is ever on the file on disk | 17:45 |
sean-k-mooney | ya proably does anyoen know where that file is and we can check | 17:45 |
melwitt | if I start it again it has <source> agian | 17:46 |
sean-k-mooney | ah its in /etc/libvirt/qemu/instance-00000001.xml | 17:46 |
melwitt | *again | 17:46 |
dansmith | okay, so | 17:46 |
dansmith | after we shutdown, | 17:46 |
dansmith | we match neither pty or file, the path.exists fails and we just return "" for the console log | 17:46 |
dansmith | is that the thinking? | 17:47 |
dansmith | (sorry if I'm just being slow) | 17:47 |
melwitt | (and console log get works again) | 17:47 |
dansmith | I just so don't see how this is working for the non-lvm cases | 17:47 |
melwitt | no it's bailing before it ever gets to the log file part | 17:47 |
sean-k-mooney | so so the file on disk is not updated with the souce element when its running so its only updated in the trainient domain while its running | 17:48 |
dansmith | melwitt: right, so here right? https://github.com/openstack/nova/blame/master/nova/virt/libvirt/driver.py#L4483 | 17:48 |
melwitt | because it doesn't parse <source>. I don't think that's intended probably | 17:48 |
melwitt | no, like this https://paste.openstack.org/show/bpovXD89OL2QzukI3BWi/ | 17:48 |
melwitt | fails (returns) before that | 17:49 |
dansmith | melwitt: shouldn't that return an error through the api then? | 17:49 |
sean-k-mooney | oh interesting https://termbin.com/0qcj | 17:49 |
dansmith | tempest is getting blank logs not an error, AFAIK | 17:49 |
sean-k-mooney | ya the error iss """AssertionError: '' is not true : Console output was empty.""" | 17:51 |
dansmith | unless we're not actually raising that, but I thought we were in one case you pasted earleir | 17:51 |
sean-k-mooney | the test is doing self.assertTrue(output, "Console output was empty.") | 17:51 |
dansmith | wait what | 17:51 |
melwitt | yeah... looking to see if I can see how. for me I think it goes here and raises HTTPNotFound https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/console_output.py#L58 | 17:51 |
sean-k-mooney | no i think the test is not checking the return code | 17:51 |
melwitt | so it must be something the client is doing? | 17:52 |
dansmith | omh | 17:52 |
dansmith | sean-k-mooney: it's checking to see that the console is actually empty in shutoff state/ | 17:52 |
sean-k-mooney | no output will be true if its not the empty string | 17:53 |
melwitt | on the command line I get "ResourceNotFound: 404: Client Error for url: http://127.0.0.1/compute/v2.1/servers/68b8344b-10dc-4a84-90e4-ad8be7a2bb60/action, Guest does not have a console available." | 17:53 |
melwitt | osc | 17:53 |
dansmith | wtf | 17:53 |
dansmith | so I guess I totally misread the test | 17:54 |
sean-k-mooney | ya thats what i get too | 17:54 |
sean-k-mooney | 404 | 17:54 |
sean-k-mooney | but i think the 404 has an empy body | 17:54 |
dansmith | I thought the test was asserting that the console was *still* available | 17:54 |
melwitt | yeah and it's bc | 17:54 |
melwitt | except (exception.InstanceNotFound, | 17:54 |
melwitt | exception.ConsoleNotAvailable) as e: | 17:54 |
melwitt | raise webob.exc.HTTPNotFound(explanation=e.format_message()) | 17:54 |
melwitt | wait what | 17:54 |
sean-k-mooney | dansmith: the test is | 17:54 |
sean-k-mooney | https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/api/compute/servers/test_server_actions.py#L211C7-L213 | 17:55 |
dansmith | it's not, it's asserting that the result is "console is empty" | 17:55 |
melwitt | """Test getting console output for a server in SHUTOFF status | 17:55 |
melwitt | Should be able to GET the console output for a given server_id | 17:55 |
melwitt | in SHUTOFF status. | 17:55 |
dansmith | sean-k-mooney: L213 | 17:55 |
frickler | you can add --debug to the osc call to verify the body | 17:55 |
melwitt | uh wtf self.assertTrue(output, "Console output was empty.") | 17:55 |
sean-k-mooney | no self.assertTrue takes an object to test and a message to return if it fails | 17:55 |
dansmith | melwitt: yeah I wonder if that means "doesn't fail" | 17:55 |
melwitt | this is asserting the exact opposite of what the docstring says | 17:56 |
dansmith | sean-k-mooney: ahhh, right okay | 17:56 |
dansmith | not assertEqual | 17:56 |
melwitt | ohh ok right | 17:56 |
dansmith | confusing | 17:56 |
sean-k-mooney | right | 17:56 |
sean-k-mooney | they shoudl not be useing assertTrue relaly in this case | 17:56 |
dansmith | yeah for serious | 17:56 |
sean-k-mooney | as i said they are relying on the boolean convertion to true of a non empty string | 17:56 |
melwitt | yeah bc they're saying "I got an error string, it must be a console!" | 17:57 |
sean-k-mooney | right so the fact this is failing | 17:57 |
sean-k-mooney | is telliing me the 404 had an emptyp body | 17:57 |
sean-k-mooney | im going to get the request id and check the api log | 17:57 |
dansmith | I wish frickler wasn't here watching this circus of three seasoned nova cores trying to figure out how one of the most basic calls works :/ | 17:57 |
sean-k-mooney | it should have the content lenght | 17:57 |
melwitt | seasoned curlies | 17:58 |
* frickler hides | 17:58 | |
dansmith | frickler: I'm kidding of course, just .. not our best moment :) | 17:58 |
frickler | well I can't say I'm not enjoying it ;) | 17:59 |
dansmith | lol | 17:59 |
melwitt | this means not an empty string no? | 17:59 |
melwitt | DEBUG nova.api.openstack.wsgi [None req-2441eccd-ea63-4c9d-94a3-29aaf1ecf5c4 demo demo] Returning 404 to user: Guest does not have a console available. {{(pid=2011570) __call__ /opt/stack/nova/nova/api/openstack/wsgi.py:936}} | 17:59 |
melwitt | I guess I should do osc --debug | 18:00 |
sean-k-mooney | status: 404 len: 86 microversion: 2.91 time: 0.719216 | 18:00 |
melwitt | RESP BODY: {"itemNotFound": {"code": 404, "message": "Guest does not have a console available."}} | 18:00 |
sean-k-mooney | 86 bytes might just be the header | 18:00 |
gmann | status code is already checked in service client side so we can just remove this assert itself ? | 18:00 |
sean-k-mooney | oh actully no | 18:00 |
sean-k-mooney | RESP BODY: {"itemNotFound": {"code": 404, "message": "Guest does not have a console available."}} | 18:00 |
melwitt | so I think it really is going assertTrue("Guest does not have a console available.") | 18:01 |
sean-k-mooney | ya but why whoudl that be false | 18:01 |
dansmith | maybe it's ... AI... | 18:01 |
melwitt | maybe for lvm it's not returning a non empty string? I dunno why | 18:01 |
melwitt | AI yes | 18:01 |
sean-k-mooney | no that form lvm | 18:02 |
sean-k-mooney | i have it repoduced locally | 18:02 |
melwitt | oh... uh wtf | 18:02 |
sean-k-mooney | i ran osc with --debug | 18:02 |
sean-k-mooney | so it pring the actul repsocne body | 18:02 |
gmann | but how come it reach to this assert if it return 404 ? it should have failed here https://github.com/openstack/tempest/blob/master/tempest/lib/api_schema/response/compute/v2_1/servers.py#L456 | 18:03 |
sean-k-mooney | thats the fully output if your interested form osc https://termbin.com/fq3i | 18:03 |
dansmith | gmann: yeah I was looking to see if we were ignoring 404 or something but we're not | 18:04 |
melwitt | I don't get how this is passing though | 18:04 |
melwitt | lines = len(output.split('\n')) | 18:04 |
melwitt | self.assertEqual(lines, 3) | 18:04 |
gmann | yeah, status code in resposne schema fail if different status code than expected and it does not proceed further | 18:04 |
* dansmith notes we now have four nova cores, one tempest core, and one devstack core...and zero understanding | 18:05 | |
melwitt | this is some kind of black magic I think | 18:05 |
dansmith | obviously the solution is to remove the nova-lvm job | 18:05 |
melwitt | yeah. removing job and skipping tests makes things work :D | 18:06 |
gmann | :) | 18:06 |
melwitt | we definitely need more logging in that tempest test cause like... lol | 18:06 |
sean-k-mooney | we only validate if the respocne is scculs or redirection | 18:08 |
sean-k-mooney | https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/lib/common/rest_client.py#L1016 | 18:08 |
sean-k-mooney | we got a 404 | 18:08 |
sean-k-mooney | so the cleint does not validate it | 18:08 |
sean-k-mooney | that why its not failing earlier | 18:08 |
melwitt | and there's no else: noice | 18:09 |
sean-k-mooney | yep we fall off the end of the function | 18:09 |
melwitt | VALID@ | 18:09 |
melwitt | VALID! | 18:09 |
gmann | sean-k-mooney: it should have failed here https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/lib/common/rest_client.py#L852 | 18:12 |
sean-k-mooney | i assume _error_checker is a decortor or called else whewer | 18:13 |
sean-k-mooney | i didnt see that in the funcitons i was looking at | 18:13 |
sean-k-mooney | thats not called form action https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/lib/services/compute/servers_client.py#L218 | 18:14 |
sean-k-mooney | or get_console_output https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/lib/services/compute/servers_client.py#L648-L656 | 18:14 |
sean-k-mooney | which is what we are calling https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/api/compute/servers/test_server_actions.py#L211 | 18:14 |
gmann | it is called for every API call from tempest https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/lib/common/rest_client.py#L742 | 18:16 |
sean-k-mooney | so its called form this post https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/lib/services/compute/servers_client.py#L225C27-L225C31 | 18:16 |
frickler | only success is validated https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/lib/common/rest_client.py#L1016 | 18:16 |
sean-k-mooney | ya but gmann is correct | 18:17 |
sean-k-mooney | on https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/lib/common/rest_client.py#L742 we shoudl go to _error_checker | 18:18 |
gmann | that is for schema only tempest validate status code and it is <400 then only it proceed to test otherwise fail in API call itself | 18:18 |
gmann | that->validate_response | 18:18 |
sean-k-mooney | ok so the post will raise | 18:19 |
sean-k-mooney | raise exceptions.NotFound(resp_body, resp=resp) | 18:20 |
gmann | i cannot see where getConsoleOutput return 400, it was always 200 https://zuul.opendev.org/t/openstack/build/9890c9b53059410eabc0c61df0376217/log/controller/logs/tempest_log.txt#40163 | 18:20 |
frickler | action() doesn't use request, but post() https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/lib/services/compute/servers_client.py#L218-L233 | 18:20 |
gmann | and output is empty {"output": "" - https://zuul.opendev.org/t/openstack/build/9890c9b53059410eabc0c61df0376217/log/controller/logs/tempest_log.txt#40167 | 18:20 |
sean-k-mooney | gmann: no melwitt found where we convert it | 18:20 |
sean-k-mooney | in the api log and in the debug of osc its a 404 | 18:21 |
gmann | frickler: every API call go through the request only either post, delete etc | 18:21 |
gmann | sean-k-mooney: I am saying in tempest it get 200 and empty output right? | 18:22 |
sean-k-mooney | https://github.com/openstack/nova/blob/e5ee5e035c0c85654a139131d51be5c4df76f400/nova/api/openstack/compute/console_output.py#L56-L58 | 18:22 |
sean-k-mooney | gmann: i dont see how it could be getting a 200 in tempest | 18:23 |
gmann | so _error_checker pass and self.assertTrue(output, "Console output was empty.") is valid assert right | 18:23 |
sean-k-mooney | i guess if we have the request id i can check in the job logs | 18:23 |
sean-k-mooney | gmann: that is the assert that is failing i belive yes | 18:24 |
sean-k-mooney | yes see the bug https://bugs.launchpad.net/nova/+bug/2028851 | 18:24 |
gmann | yeah I am checking those only | 18:24 |
sean-k-mooney | oh... | 18:25 |
gmann | because this is alst call to get console from tempest which return 200 and empty output https://zuul.opendev.org/t/openstack/build/9890c9b53059410eabc0c61df0376217/log/controller/logs/tempest_log.txt#40163 | 18:26 |
sean-k-mooney | the isntance is still active | 18:26 |
sean-k-mooney | its ment to be stopped | 18:26 |
gmann | even every call on wait is same 200 and empty output https://zuul.opendev.org/t/openstack/build/9890c9b53059410eabc0c61df0376217/log/controller/logs/tempest_log.txt#39248-40163 | 18:27 |
gmann | and while waiting for non empty console output, tempest timeout and raise exception from that assert https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/api/compute/base.py#L340 | 18:29 |
sean-k-mooney | it should not have got there | 18:29 |
sean-k-mooney | https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/api/compute/servers/test_server_actions.py#L727C7-L727C79 | 18:29 |
sean-k-mooney | that was ment to be wating for the instance ot be shutoff | 18:29 |
sean-k-mooney | so i see 2023-07-27 14:03:23,945 91441 INFO [tempest.common.waiters] State transition "ACTIVE/powering-off" ==> "SHUTOFF/None" after 1 second wait | 18:31 |
gmann | sean-k-mooney: it does and server is in shutoff state only https://zuul.opendev.org/t/openstack/build/9890c9b53059410eabc0c61df0376217/log/controller/logs/tempest_log.txt#39246 | 18:31 |
gmann | after that only it check for the console output | 18:32 |
frickler | so maybe this is a race? I wonder if you sleep a bit after the wait-for-shutoff whether it would always fail with 404 | 18:34 |
sean-k-mooney | so the server uuid is e526f14e-5878-4fab-ba8b-4f6c2456bd2 so we shoud lbe able to confirm the status code in the api logs | 18:34 |
sean-k-mooney | frickler: i have the vm stopped on my local devstack and it fails with 404 yes | 18:35 |
sean-k-mooney | in ci it looks like its getting 200's for about 3 minutes | 18:36 |
sean-k-mooney | but i get a 404 basically imiedaly after i power it off | 18:36 |
frickler | https://zuul.opendev.org/t/openstack/build/9890c9b53059410eabc0c61df0376217/log/controller/logs/screen-n-api.txt#34450 | 18:38 |
frickler | can you run console log show in a loop while you power off? also are you on a lvm setup? | 18:38 |
sean-k-mooney | i have it in a watch now | 18:41 |
sean-k-mooney | but its consitently 404 | 18:41 |
sean-k-mooney | POST /compute/v2.1/servers/e526f14e-5878-4fab-ba8b-4f6c2456bd2a/action => generated 14 bytes in 66 msecs (HTTP/1.1 200) 9 headers in 356 bytes (1 switches on core 0) | 18:45 |
sean-k-mooney | in ci its getting a 200 | 18:45 |
frickler | one other thing I don't understand is why this is wrapped in a wait_for() https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/api/compute/servers/test_server_actions.py#L728 | 18:45 |
frickler | _get_output() either asserts or returns nothing, right? | 18:46 |
sean-k-mooney | your right | 18:47 |
sean-k-mooney | it never returns anything | 18:47 |
frickler | oh, wait. _get_server acts on self.server_id not on temp_server_id | 18:47 |
gmann | frickler: sean-k-mooney dansmith melwitt: ahh I got the issue here. we are using different server in that tests not the one test create :( https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/api/compute/servers/test_server_actions.py#L212 | 18:47 |
gmann | and test use differnt server https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/api/compute/servers/test_server_actions.py#L723-L728 | 18:47 |
frickler | gmann: ^5 | 18:47 |
melwitt | wuuuut | 18:48 |
gmann | frickler: yeah | 18:48 |
gmann | diffeent server id in both place in log https://zuul.opendev.org/t/openstack/build/9890c9b53059410eabc0c61df0376217/log/controller/logs/tempest_log.txt#39246-39248 | 18:48 |
gmann | fixing it | 18:48 |
gmann | just a 4 letter mistake and took so much time :) | 18:48 |
melwitt | wow. fr | 18:48 |
sean-k-mooney | oh we are not passign a server id to _get_output | 18:48 |
sean-k-mooney | i checked the ides that were in use | 18:49 |
sean-k-mooney | but i missed we did not pass it to _get_output | 18:49 |
gmann | both are different https://zuul.opendev.org/t/openstack/build/9890c9b53059410eabc0c61df0376217/log/controller/logs/tempest_log.txt#39246-39248 | 18:49 |
sean-k-mooney | gmann: but that does not explain why its working for ceph | 18:50 |
sean-k-mooney | oh unless the first vm | 18:50 |
sean-k-mooney | is not active | 18:50 |
sean-k-mooney | by the time we get to this? | 18:50 |
gmann | sean-k-mooney: might be that different server used in get_console is created in setup so it has the console output | 18:50 |
sean-k-mooney | or that ya | 18:50 |
sean-k-mooney | its runing longer | 18:50 |
gmann | yeah | 18:51 |
sean-k-mooney | well its more you only get the 404 if its stopped | 18:51 |
sean-k-mooney | gmann: are you workign on a pathch to optionally pass the server id to _get_output | 18:52 |
gmann | sean-k-mooney: yeah | 18:52 |
gmann | running the test | 18:53 |
sean-k-mooney | the test sill still not be really correct | 18:53 |
frickler | the test should start failing if this is fixed, because it should get 404s all the time | 18:54 |
frickler | so the question is whether nova is broken or the test | 18:55 |
sean-k-mooney | yep | 18:55 |
sean-k-mooney | both kind of | 18:55 |
sean-k-mooney | nova could be maded to work when the vm is stopped in this case | 18:55 |
sean-k-mooney | but the test is not correct today either | 18:55 |
gmann | why ? | 18:55 |
gmann | nova should return 404 or something if no actual console is get or maybe correct if empty console for shutoff VM is ok | 18:56 |
gmann | but I do not see why test is incorrect | 18:56 |
gmann | or should we expect empty console for shutoff VM as valid case? | 18:56 |
sean-k-mooney | because its going to be asserting that "ResourceNotFound: 404: Client Error for url: http://192.168.122.17/compute/v2.1/servers/ea890cd1-c3fe-4455-bcd9-b44ea830a136/action, Guest does not have a console available." is true | 18:56 |
sean-k-mooney | the test is ment to be asserting that you can retirve the concole log while the vm is stopped | 18:57 |
sean-k-mooney | that works on my home cloud which is using cinder BFV | 18:58 |
sean-k-mooney | so there is a bug in nova as this should work | 18:58 |
sean-k-mooney | if the test catches that then the test is ok but it shoudl be failing when using the correct server id | 18:59 |
sean-k-mooney | in the lvm job | 19:00 |
sean-k-mooney | so we shoudl retirage https://bugs.launchpad.net/nova/+bug/2028851 | 19:00 |
sean-k-mooney | its a valid bug for both nova and tempest | 19:00 |
sean-k-mooney | dansmith: melwitt: do you agree ^ | 19:00 |
sean-k-mooney | on the nova side we either need to say the console shoudl always be aviable for all dirver when the server is off | 19:01 |
sean-k-mooney | or that test is invalid for nova-lvm | 19:01 |
melwitt | I'm a little confused but if the tempest test is passing with the correct server id with a 404 being returned from nova then the tempest test has a bug as well | 19:02 |
sean-k-mooney | no i think that will fail | 19:02 |
sean-k-mooney | the only tempest bug is the one gmann found where its not using the correct id | 19:02 |
melwitt | oh gotcha. yeah. I agree there is a bug on nova too | 19:02 |
gmann | https://review.opendev.org/c/openstack/tempest/+/889895 | 19:04 |
dansmith | 404 should mean "that path you gave me does not exist | 19:04 |
gmann | melwitt: sean-k-mooney yeah i will fail, I tested locally and got the 404 from nova, tempest failed | 19:04 |
dansmith | 404 for a POST doing an action that doesn't return anything but for a guest that does exist is extremely wrong, IMHO | 19:05 |
dansmith | especially when the same POST url just worked for shutdown and will work a minute later for start up | 19:05 |
gmann | dansmith: yes it says 404 Guest does not have a console available | 19:05 |
dansmith | right, that is incorrect HTTP behavior, IMHO | 19:05 |
sean-k-mooney | dansmith: its kind of abusing 404 right now | 19:05 |
dansmith | yeah | 19:06 |
sean-k-mooney | but i can see why someone might have chosen it | 19:06 |
sean-k-mooney | i.e. the log was nto found | 19:06 |
gmann | maybe 400 we should return | 19:06 |
sean-k-mooney | no its a 500 or simiall | 19:06 |
sean-k-mooney | this is a server side issue | 19:06 |
dansmith | how is it a 500? | 19:06 |
dansmith | IMHO, it should be 200 | 19:06 |
sean-k-mooney | well it shoudl be a 200 with the content | 19:07 |
gmann | so until guest is shutoff, console not available is ok | 19:07 |
sean-k-mooney | which is what i get with a bfv vm with cinder lvm | 19:07 |
sean-k-mooney | so with nova and lvm today we can only get the console while the vm is running | 19:08 |
dansmith | sean-k-mooney: did I miss why lvm differs here? | 19:08 |
sean-k-mooney | with bfv we can get the console when the vm is stopped | 19:08 |
dansmith | because I would say that the console should always return 200, even if the console is empty, unless we know that it can't have a console or something.. and even still, 404 is not the right thing to return (nor 500) | 19:08 |
sean-k-mooney | dansmith: no im still not sure why that is the case although i cna try and compare to my home cloud | 19:08 |
dansmith | we have to figure that out, because I think that's wrong (as heck) | 19:09 |
sean-k-mooney | dansmith: ya i would be ok with always returnning 200 | 19:09 |
dansmith | but even still, 200 even if it's empty because the operation worked, this is what we have for console log, which may be nothing | 19:09 |
sean-k-mooney | i said 500 just ebcasue there is a bug in nova | 19:09 |
dansmith | if you have your kernel set to log to something other than tty0 it'll be blank too | 19:09 |
sean-k-mooney | but this is part of the interop tests right | 19:09 |
sean-k-mooney | so this cant have backend depent behavior | 19:09 |
gmann | yeah empty is still a response we got the console | 19:10 |
gmann | but console not available? | 19:10 |
gmann | 'console not available' should be error case right | 19:11 |
sean-k-mooney | it depends | 19:11 |
dansmith | it should never be 404, that's for sure | 19:12 |
gmann | because this fix will get the 404 on console not avaliable for the shutoff Guest https://review.opendev.org/c/openstack/tempest/+/889895 | 19:12 |
dansmith | because /action exists | 19:12 |
dansmith | it can be an error maybe if we've never started the guest yet, or *maybe* if it was lost from disk somehow | 19:12 |
dansmith | but that's different than blank | 19:12 |
sean-k-mooney | dansmith: i agree on the 404. the depends is more i belive its possibel to have a vm without any console | 19:12 |
gmann | I think 400 is more appropriate here | 19:12 |
dansmith | maybe 410 Gone | 19:13 |
dansmith | Honestly, 4xx means the client messed up and the client hasn't messed up here | 19:13 |
dansmith | there's also 425 Too Early | 19:13 |
dansmith | and 428 Precondition Required | 19:13 |
gmann | but when server is not yet active and console not availble 410 is not right. for shutoff case where console was available but now it is not | 19:13 |
dansmith | yeah, I really don't understand what use ever returning 4xx for this is | 19:14 |
gmann | dansmith: but if request is before server is in active or in shutoff state it is client error right? | 19:14 |
dansmith | maybe on a windows guest where there is no such console? | 19:14 |
sean-k-mooney | so its not a clinet error so 4XX is incorrect | 19:14 |
dansmith | gmann: I dunno, it's debatable I think | 19:14 |
dansmith | maybe if you call it before it's active, | 19:14 |
sean-k-mooney | the user should be able to expect the console to be ableiable i think if a console exits for the instnace | 19:14 |
dansmith | but since you shoudl be able to call it in SHUTOFF, that's kinda confusing to the user that doesn't care much | 19:14 |
gmann | but does nova says console can be available for before guest become active or in shutoff state? | 19:15 |
sean-k-mooney | yes the console can be avaiable in shutoff | 19:15 |
gmann | dansmith: humm that is good point | 19:15 |
sean-k-mooney | gmann: we have a bug in the livbirt drvier where it does not find it properly | 19:15 |
dansmith | gmann: so maybe if the instance isn't scheduled to a host yet, then 400 makes sense | 19:15 |
dansmith | but after it's on a host, I dunno, seems like 200 empty is the right move always | 19:16 |
gmann | yeah, it is difficult to make differentiation on those cases | 19:16 |
dansmith | so instance.host==None, 4xxsomething, but otherwise, I'd say 200 empty | 19:16 |
gmann | +1 | 19:16 |
dansmith | I don't think there's any useful distinction between "nothing has ever been written to the console" and "the console log is gone" | 19:17 |
sean-k-mooney | hehe ok fun | 19:17 |
sean-k-mooney | dansmith: so my boot form volume guest is not using a pty | 19:17 |
dansmith | like I said, *maybe* for a windows guest with no actual console, I dunno how that works really | 19:17 |
dansmith | even macos on uefi has a text console.. it's set to silent by default, but it's there | 19:17 |
gmann | I think I got your point of action is passed so it should be 200 irrespective of what output it give | 19:17 |
dansmith | gmann: yeah | 19:17 |
sean-k-mooney | dansmith: https://paste.opendev.org/show/bvGBKyn5fHQthz8TO13C/ | 19:18 |
dansmith | sean-k-mooney: interesting | 19:18 |
sean-k-mooney | dansmith: so its using tcp for the console | 19:18 |
sean-k-mooney | and the source is still there when its stoped | 19:18 |
dansmith | yeah, I dunno why that is set that way, but I can't imagine it has much to do with being lvm | 19:19 |
sean-k-mooney | that becasue its kolla-ansible i think | 19:19 |
dansmith | yeah *that* I'd believe :) | 19:19 |
sean-k-mooney | again this is related to how libvirt is configured | 19:19 |
sean-k-mooney | since its generating that so its connecting to virtlogd over tcp in this case | 19:20 |
sean-k-mooney | well to qemu i guess not virtlogd | 19:20 |
sean-k-mooney | this deployment is using molitic libvirt | 19:21 |
sean-k-mooney | dansmith: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4464 | 19:22 |
sean-k-mooney | so if we are usign tcp we get the path form log | 19:22 |
sean-k-mooney | not source | 19:22 |
sean-k-mooney | so thats why that is working | 19:22 |
dansmith | okay but virtlogd should hide this all from us and let us use the log file yeah? | 19:22 |
sean-k-mooney | yes for qemu/kvm | 19:23 |
sean-k-mooney | i would need to double check for lxc | 19:23 |
dansmith | ack | 19:23 |
sean-k-mooney | https://libvirt.org/formatdomain.html#consoles-serial-parallel-channel-devices is the relevent docs | 19:34 |
sean-k-mooney | Regardless of the type, character devices can have an optional log file associated with them. This is expressed via a log sub-element, with a file attribute. There can also be an append attribute which takes the same values described above. Since 1.3.3 . | 19:34 |
dansmith | that implies virtlogd works for everything yeah? | 19:34 |
dansmith | (as I would expect) | 19:35 |
sean-k-mooney | implies it yes btu does not actully say it | 19:36 |
sean-k-mooney | i also cant find anything that would imply how you tell libvirt to chooese the type | 19:37 |
sean-k-mooney | im going to check my kolla cloud incase it ahs something set in the libvirt configs | 19:37 |
sean-k-mooney | i suspect pty might be the default or the default in newer libvirts | 19:38 |
sean-k-mooney | ah found it | 19:40 |
sean-k-mooney | # The backend to use for handling stdout/stderr output from | 19:40 |
sean-k-mooney | # QEMU processes. | 19:40 |
sean-k-mooney | # | 19:40 |
sean-k-mooney | # 'file': QEMU writes directly to a plain file. This is the | 19:40 |
sean-k-mooney | # historical default, but allows QEMU to inflict a | 19:40 |
sean-k-mooney | # denial of service attack on the host by exhausting | 19:40 |
sean-k-mooney | # filesystem space | 19:40 |
sean-k-mooney | # | 19:40 |
sean-k-mooney | # 'logd': QEMU writes to a pipe provided by virtlogd daemon. | 19:40 |
sean-k-mooney | # This is the current default, providing protection | 19:40 |
sean-k-mooney | # against denial of service by performing log file | 19:40 |
sean-k-mooney | # rollover when a size limit is hit. | 19:40 |
sean-k-mooney | # | 19:40 |
sean-k-mooney | #stdio_handler = "logd" | 19:40 |
sean-k-mooney | my kolla cloud is using file | 19:41 |
dansmith | eesh man, pastebin c'mon | 19:41 |
dansmith | that's libvirt config? | 19:41 |
sean-k-mooney | :) sure ya that /etc/libvirt/qemu.conf | 19:41 |
dansmith | oh, qemu.conf, interesting | 19:42 |
dansmith | well, I dunno what to say | 19:42 |
dansmith | we don't support lxc | 19:42 |
dansmith | I don't want to break it for no reason, but... unless someone is going to try out a change... | 19:42 |
sean-k-mooney | ill try chanign my lvm devstack to file and see if it changes to tcp | 19:42 |
sean-k-mooney | no still using pty | 19:46 |
sean-k-mooney | ok im going ot leave it there today | 19:46 |
gmann | this test was wrong since starting and not sure how https://review.opendev.org/c/openstack/tempest/+/889109 unhide it on lvm and made it always failing | 19:47 |
gmann | 889109 does not change anything from this test perspective | 19:47 |
frickler | oh, interesting, iirc there is still a kolla bug about dos via console log filling the disk. I need to look into that | 19:47 |
sean-k-mooney | my cloud is not fully up to date | 19:48 |
frickler | gmann: interaction with other tests in the class? | 19:48 |
dansmith | frickler: I doubt it | 19:48 |
sean-k-mooney | but on the kolla sid i jsu tnotice that libvirtd.conf is configured to connect to libvirt with sasl over tcp | 19:48 |
dansmith | it may be because this change does increase the load with more parallelism, | 19:48 |
gmann | frickler: humm, no, it recreate the server in setup and keep it SSHABLE | 19:48 |
dansmith | however this test class was usually running early on enough that all the workers were slammed with work | 19:48 |
dansmith | I looked at a *lot* of stackviz reports before doing that :) | 19:49 |
sean-k-mooney | https://paste.opendev.org/show/bZMoOYeTqbcHyr79ZDwS/ | 19:49 |
gmann | dansmith: i doubt, server was active always in any parallel case in setup so it should have been working in same way it was working before | 19:50 |
dansmith | gmann: I agree, I'm just saying load can be higher with the split, but that was the point | 19:50 |
dansmith | I just rechecked sean-k-mooney's patch to get another data point.. i.e. to see if it's still only lvm and only that test | 19:51 |
gmann | anyways as 889895 corrent the tests and make it fail always. I am going to skip this test for now until we figure out the nova behavior in 'guest shutoff' case | 19:51 |
gmann | *coprrect | 19:51 |
dansmith | ack | 19:52 |
*** elodilles is now known as elodilles_pto | 20:35 | |
frickler | maybe this is related, since it has been moved to a different class https://opendev.org/openstack/tempest/src/branch/master/tempest/api/compute/servers/test_server_actions.py#L332-L337 | 21:10 |
frickler | also expected failures on the tempest patch going back to yoga, so not a recent regression at least | 21:10 |
gmann | but that was same before moving to different class also. this test s no issue, test_get_console_output_server_id_in_shutoff_status started failing after refactoring and not sure why | 21:22 |
gmann | or maybe you are right. the execution order of test in same class might be making the server console available before test_get_console_output_server_id_in_shutoff_status execute | 21:24 |
dansmith | how? | 21:29 |
dansmith | each class gets its own server to work on, and the tests within a class are single-threaded. if it matters now, it mattered before, so it must have already been broken | 21:30 |
gmann | one chance I see if that self.server_id server get through many operation before test_get_console_output_server_id_in_shutoff_status was executed before so console of that self.server_id was somehow available but now as it is in separate class, self.server_id is kind of fresh server and fail ? | 21:37 |
gmann | idk but that is something might be happening, i cannot find any other reasoning on this | 21:37 |
gmann | correcting and skipping this test for now https://review.opendev.org/c/openstack/tempest/+/889895 | 21:50 |
dansmith | yeah I don't think so.. at one point I was wondering if we had to call get_console once to process the pty data and start the log file, but with virtlogd that should definitely not be the case | 22:15 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!