*** ykarel is now known as ykarel|lunch | 09:47 | |
*** sshnaidm is now known as sshnaidm|afk | 10:09 | |
*** rlandy is now known as rlandy|ruck | 10:33 | |
*** ykarel|lunch is now known as ykarel | 11:23 | |
*** dviroel|out is now known as dviroel|rover | 11:35 | |
*** sshnaidm|afk is now known as sshnaidm | 11:39 | |
stephenfin | We're seeing regular failures on the nodepool-build-image-siblings job in openstacksdk. Looks like opensuse have done something funky with their mirrors and there's a checksum failure occurring as a result? https://zuul.opendev.org/t/openstack/build/a21d7820d59c4416a5c6171491927192/log/job-output.txt#24090 | 11:40 |
---|---|---|
stephenfin | Ah, never mind, looks like this was fixed with cce7dbc669cea619766b4bbe7e310f75b9461d2d | 11:46 |
stephenfin | sorry, Ica62392ebf4a665a04cd65458dda9e0a7545ccc8 | 11:46 |
fungi | stephenfin: so it's solved? just making sure i don't need to dig into that | 13:34 |
*** ykarel is now known as ykarel|away | 13:41 | |
clarkb | ya that change should fix it. We don't know what was up with that OBS repo but realized we didn't need to use it anymore now that we were updated to bullseye | 15:51 |
*** Guest3739 is now known as diablo_rojo_phone | 16:35 | |
*** diablo_rojo_phone is now known as Guest4625 | 16:36 | |
*** Guest4625 is now known as diablo_rojo_phone | 16:37 | |
afaranha | Hi timburke, fungi for the patch on swift that you guys are reviewing https://review.opendev.org/c/openstack/swift/+/796057 , I was trying to identify where tests were failing, I was able to track to this line https://opendev.org/openstack/swift/src/branch/master/swift/obj/diskfile.py#L265 on the "xattr.setxattr". Do you folks know what could be the issue with the set xattr? | 17:46 |
clarkb | that setxattr is what runs out of space? | 17:52 |
fungi | afaranha: without digging into write_metadata() calls, it's hard to guess what the side of those pickled structures would be | 17:52 |
fungi | clarkb: yeah, at least from the job log it's evident that xattr calls are involved | 17:52 |
fungi | s/side/size/ | 17:53 |
afaranha | fungi, one thing is that if you check the test https://opendev.org/openstack/swift/src/branch/master/test/functional/s3api/test_object.py#L361 when I try the header with only "!" it fails | 17:53 |
clarkb | ya ok so possible that you're hitting a size limit on the xattrs and not actual fs storage (which would explain why the fs percentage used is minimal) | 17:53 |
afaranha | I think some characters are making this go south | 17:53 |
fungi | what's curious is that it only seems to be happening when the server is rebooted into fips compliant mode | 17:54 |
afaranha | I tried creating a file and testing on it what the write_metadata does, as fd I put a file name I created with "touch" and the metadata I copied from the tests, I put some prints and copied it | 17:54 |
afaranha | I can try making some changed to this file "test", but I don't know what i coudl do | 17:55 |
afaranha | "OSError: [Errno 28] No space left on device: b'test'" | 17:55 |
afaranha | clarkb, for my test I created with plenty of space, but the issue was the file itself | 17:55 |
clarkb | right xattrs for files are separate from the nomral disk space storage | 17:56 |
afaranha | what restriction does the xattr has? | 17:56 |
clarkb | I want to say the reason swift uses xfs is related to the size of them in the first place (other fses don't allow as much data to be set) | 17:56 |
clarkb | https://man7.org/linux/man-pages/man7/xattr.7.html | 17:57 |
clarkb | "In the JFS, XFS, and Reiserfs filesystem implementations, the limit on bytes used in an EA value is the ceiling imposed by the VFS." | 17:58 |
clarkb | attribute names are 255 bytes, attribute values limited to 64kb | 17:58 |
clarkb | Is it possible that fips is changing the VFS limitations? | 17:59 |
fungi | or possible fips-compliant algorithms are changing the length of the metadata itself? | 17:59 |
clarkb | that also seems possible | 18:00 |
clarkb | you can probably try to write 64kb -1 bytes, then 64 kb byes, then 64kb +1 bytes and see if it works | 18:00 |
fungi | like is something somewhere switching from an md5 hash to sha2-256? | 18:00 |
clarkb | then binary search for a limit if none of those work | 18:00 |
clarkb | then if 64kb is the actual limit you need to determine why you are writing more than 64kb which could be a different hash output as fungi suggests | 18:01 |
fungi | but yeah, it could also be that the xfs driver changes the internal representation of the xattr blobs, effectively reducing how long they can be | 18:02 |
timburke | fwiw, that xattr_size kwarg is never used (outside of unit tests, presumably), so every setxattr() call during func tests will be sending at most 64k | 18:04 |
timburke | the overall length of metastr is most likely in the 100s of kb; for vanila swift it's usually well below the 64k chunking that's going on, but having encryption enabled will significantly increase the metadata sizes | 18:04 |
timburke | i did get around to bringing up a fips-enabled VM locally and verifying that 64k xattrs were still allowed, but i don't think i'd put *that* many of them on any single file | 18:05 |
fungi | so should the test(s) triggering that condition be adjusted to write less data into it? | 18:06 |
timburke | no -- the test is within expectations for production data, so if something about the combination of xfs and fips mode means we can't write it all, it's an indication that swift cannot be run under fips mode in production yet | 18:08 |
fungi | got it | 18:09 |
timburke | swift definitely prefers xfs in part due to the larger allowed xattr sizes -- iirc (some versions of?) ext3/4 would have limits down closer to 4k, and performance would suffer because of the relatively large xattrs we want to write | 18:10 |
clarkb | "The list of attribute names that can be returned is also limited to 64 kB (see BUGS in listxattr(2))." I wouldn't expect this to be the issue if you are getting errors on writes. But if it was a problem with incomplete data coming back that mgith be related | 18:10 |
clarkb | timburke: ext* limits you to one block | 18:10 |
clarkb | so depends on the fs block size | 18:10 |
afaranha | clarkb, fungi I'm not sure it's something limited to the bytes on the metadata, I tried just setting "!" (from the test) on the metadata, and it doesn't work | 18:11 |
clarkb | and that limit applies to all the metadata for a file. So ya fairly limiting | 18:11 |
clarkb | afaranha: right but is that the actual data written? | 18:11 |
fungi | yeah, the error seen in the job log is basically enospc, it's indistinguishable from a full filesystem except that it crops up when calling xattr | 18:11 |
clarkb | afaranha: fungi is suggesting that encryption or similar could be tripping it | 18:11 |
afaranha | I can check the metadata value before the calling of setxattr, do you suspect this is being changed on this method call or before? | 18:13 |
fungi | i don't have any precise suspicions, merely vague guesses as to what it could possibly be | 18:14 |
afaranha | wait a minute | 18:14 |
fungi | most of what i know about xfs's extended attributes came from the discussion here in the past few minutes ;) | 18:14 |
fungi | same for swift's use of them | 18:15 |
afaranha | I just run 2 times, first time it didn't work, last time worked fine | 18:15 |
afaranha | let me try again | 18:15 |
afaranha | for reference, I'm running using this command: tox -e func-encryption-py3 test.functional.s3api.test_object.TestS3ApiObject.test_put_object_weird_metadata | 18:15 |
stephenfin | fungi: yup, all good (the failures on the nodepool-build-image-siblings job in openstacksdk) | 18:15 |
afaranha | and test was modified to this: https://paste.openstack.org/show/810316/ | 18:16 |
afaranha | I run again, and it got stuck as before | 18:16 |
afaranha | METADATA: {'X-Timestamp': '1635790594.21114', 'Content-Type': 'binary/octet-stream', 'Content-Length': '10', 'ETag': '7d721f6bd24977788449b41a0b7ac912', 'X-Object-Sysmeta-S3Api-Acl': '{"Owner":"test:tester","Grant":[{"Permission":"FULL_CONTROL","Grantee":"test:tester"}]}', 'X-Object-Transient-Sysmeta-Crypto-Meta-!': 'aQ==; swift_meta=%7B%22cipher%22%3A+%22AES_CTR_256%22%2C+%22iv%22%3A+%22w6REfMVgZRLLWFdVG86U2w%3D%3D%22%7D', 'X-Object-Transient- | 18:16 |
afaranha | Sysmeta-Crypto-Meta': '%7B%22cipher%22%3A+%22AES_CTR_256%22%2C+%22key_id%22%3A+%7B%22path%22%3A+%22%2FAUTH_test%2Fbucket%2Fobject%22%2C+%22v%22%3A+%222%22%7D%7D'} | 18:16 |
clarkb | I think the next debugging step is find out what exactly is being written to the metadata and do it out of band by hand | 18:20 |
clarkb | and try to reproduce it | 18:20 |
afaranha | after running the: metastr = pickle.dumps(_encode_metadata(metadata), PICKLE_PROTOCOL); metastr value is https://paste.openstack.org/show/810318/ | 18:20 |
afaranha | ack | 18:20 |
timburke | pretty short, <2k | 18:21 |
timburke | was that value one that failed, or succeeded? | 18:21 |
clarkb | also I'm not sure anything here is special to opendev or infra. This is likely to be whatever platform you are on (centos 7/8?) + fips + swift related | 18:22 |
afaranha | it failed | 18:22 |
afaranha | centos8 + fips; But I tried creating and running a Centos8 VM without fips, and the tests passed, then I enabled fips and the tests passed again | 18:23 |
afaranha | we were only able to reproduce the issue so far, on the CI server that fungi reserved for this investigation | 18:23 |
clarkb | afaranha: right we know the tests pass generally without fips enabled. When you did it locally did you use a file based block device for the filesystem? | 18:24 |
clarkb | I suppose it is possibly related to that implementation detail somehow as well | 18:24 |
afaranha | locally I just run the tests using tox, I don't know yet how it does it | 18:24 |
fungi | i think the relevant tests get skipped if that file isn't created and mounted? | 18:25 |
clarkb | I think this may actually be it | 18:27 |
clarkb | tools/test-setup.sh runs to create the xfs filesystem | 18:27 |
afaranha | is there a way to force it to be run? | 18:27 |
afaranha | or can I just run tools/test-setup.sh and then tox? | 18:27 |
fungi | by... running it | 18:27 |
clarkb | the CI jobs then set TMPDIR: /home/zuul/xfstmp. But I'm not sure that tox passenvs' TMPDIR by default | 18:27 |
fungi | our ci jobs install any packages bindep indicates should be installed, run any tools/test-setup.sh script which is present in the repo, and then call tox with the specified env | 18:28 |
clarkb | its possible that it is failing beacuse we're using ext4 and that is limited to 2048 | 18:28 |
clarkb | (or whatever our block size is) | 18:28 |
clarkb | https://tox.wiki/en/latest/config.html#conf-passenv I think that may be it | 18:29 |
fungi | afaranha: note that tools/test-setup.sh needs to be run with root permissions, but then tox should be run as your testing account rather than root | 18:29 |
clarkb | and fips is pushing over the limit | 18:29 |
clarkb | it works locally beacuse you run it on a centos8 with an xfs filesystem and the special file doesn't need to exist | 18:29 |
clarkb | when run on our CI system on ext4 you need to create the filesystem and use it but I think the tests may not be using it | 18:29 |
afaranha | ack | 18:29 |
fungi | oh, right, if centos uses xfs for its rootfs then those tests won't be skipped | 18:29 |
clarkb | timburke: ^ fyi this may be a more general problem | 18:30 |
clarkb | fungi: centos default is xfs but we ext4 everything by default with dib (for sanity) | 18:30 |
fungi | right, that's what i meant | 18:30 |
fungi | as to why the tests weren't being skipped on the manually set up test server, it likely used xfs for its rootfs | 18:31 |
clarkb | also TMPDIR is used more broadly isn't it? I think you should maybe use a different variabel to set that path? | 18:31 |
clarkb | fungi: well it seems the tests aren't being skipped on our CI machines either | 18:31 |
clarkb | (otherwise why do they fail?) | 18:31 |
fungi | clarkb: so why would passenv be posing a problem only for the fips runs and not the rest of the time? | 18:31 |
clarkb | fungi: because fips is using bigger encryption data | 18:32 |
fungi | i didn't mean to imply the tests were being skipped in ci | 18:32 |
clarkb | fungi: I'm saying that i don't think the xfs block file setup in upstream CI is being used | 18:32 |
clarkb | fungi: because we are not properly passing the env var through. Unless swift just goes for it | 18:32 |
clarkb | that implies we are not skipping any tests based on the underlying fs | 18:33 |
timburke | fwiw, the encryption job should be using the same algos whether fips is enabled or now -- it shouldn't impact the metadata size | 18:33 |
fungi | and somehow in non-fips mode that ext4 fs has enough room to pass the tests anyway? | 18:33 |
clarkb | fungi: right | 18:33 |
fungi | could ext4's behavior be possibly changing under fips mode? | 18:33 |
clarkb | another possibility: swift is using TMPDIR to set this implying it relies on whatever is in /tmp (by default) or the override | 18:33 |
clarkb | maybe when you boot in fips mode /tmp is a different fs type | 18:34 |
clarkb | from ext4 -> tmpfs or vice versa type of thing | 18:34 |
afaranha | it just got complicated to me now D= | 18:34 |
clarkb | let me get some links to explain my suspicion | 18:34 |
timburke | fwiw, from https://tox.wiki/en/latest/config.html#conf-setenv -- "Some variables are always passed through to ensure the basic functionality of standard library functions or tooling like pip ... Others (e.g. UNIX, macOS): TMPDIR" | 18:35 |
clarkb | https://opendev.org/openstack/swift/src/branch/master/tools/test-setup.sh#L13 this is where test-setup.sh mounts the xfs filesystem at $HOME/xfstmp | 18:35 |
clarkb | timburke: ah ok. tox does the right thing then? | 18:36 |
timburke | should | 18:36 |
clarkb | next theory: is the reboot for fips happening before or after we mount that filesystem? | 18:36 |
clarkb | it isn't being put in fstab so if we reboot after the mount then you'll lose the mount and be on ext4 | 18:36 |
clarkb | is there a link to the failing job log? | 18:37 |
afaranha | clarkb, https://zuul.opendev.org/t/openstack/build/8fefe2da3d754c9484f2cdd2090eb484/logs | 18:38 |
clarkb | yup we run test-setup.sh first then reboot so teh mount is lost | 18:39 |
fungi | aha, so need to reorder the roles being included there | 18:39 |
fungi | good find! | 18:39 |
clarkb | https://zuul.opendev.org/t/openstack/build/8fefe2da3d754c9484f2cdd2090eb484/console shows this. unittests/pre.yaml runs test-setup.sh and enable-fips.yaml happens later | 18:39 |
clarkb | fungi: or we should add that fs to fstab (that might be a bit heavy handed if it runs on say your laptop) | 18:40 |
fungi | this would probably have sorted itself out soon anyway, since at/after the ptg we talked through making the fips setup role run much earlier, because of similar issues with stateless multinode setup | 18:40 |
clarkb | and I don't think the tests are skipped on ext4. They are run and fail. Would be curious to know if they fail without fips too (they probably do) | 18:41 |
fungi | like if the setup fips role was just replaced by a simple reboot | 18:41 |
clarkb | afaranha: on your test node you can mount the xfs filesystem and set TMPDIR to that path and run tests to make sure it works | 18:41 |
afaranha | I'm trying to follow, but let me ask something, if it reboots after mounting the xfs block, that means the test will be skipped? | 18:42 |
afaranha | so we shouldn't see any issue with fips? | 18:42 |
fungi | apparently no, the test gets run anyway, just tries to write metadata into ext4 instead of xfs | 18:43 |
afaranha | but why if I to write "b" as the metadata on the test it works, but "!" doesn't? | 18:43 |
clarkb | luck of encryption size? | 18:44 |
afaranha | and it worked once with "!" | 18:44 |
clarkb | if you are close to the limit a few bytes either way encrypting things with non determinstic length outputs could do it | 18:44 |
afaranha | by lucky, you mean, the encryption resulted in a small metadata? | 18:44 |
clarkb | yes or large when it fails | 18:44 |
afaranha | okay | 18:44 |
afaranha | let me try the setenv on the tox then | 18:45 |
fungi | regardless, the fips setup will have to happen earlier in the job, because the reboot could clear away any number of stateless things done as part of the job setup | 18:45 |
fungi | just like we saw with the multinode jobs losing their network configs | 18:45 |
clarkb | right, mostly just thought confirming this was the issue really quickly would be good then we can delete the held node and fix this by reordering steps | 18:45 |
clarkb | side note to the fips stuff: I don't think it is an issue yet but we should be wary of created two entirely identical but for fips sets of CI testing for projects | 18:48 |
clarkb | We can probably get away with asserting if it works under fips then it will work without fips and drop half the jobs | 18:48 |
clarkb | but I'd need to think that through a bit more | 18:48 |
clarkb | or do targetted testing (and focus on functional testing?) of fips | 18:49 |
afaranha | [testenv:func-encryption-py3] | 18:49 |
afaranha | [...] | 18:49 |
afaranha | passenv = TMPDIR=/home/zuul/xfstmp | 18:49 |
afaranha | like this right? | 18:49 |
clarkb | afaranha: no, timburke pointed out that TMPDIR is automatically passed through by default. You need to mount the filesystem to /home/zuul/xfstmp then when you run tox you need to do it like: TMPDIR=/home/zuul/xfstmp tox -e py36 -- something | 18:50 |
clarkb | afaranha: https://opendev.org/openstack/swift/src/branch/master/tools/test-setup.sh#L13 that is the mount command that test-setup.sh uses | 18:51 |
afaranha | okay, passed 3 times, let me try again | 18:53 |
afaranha | :O | 18:53 |
afaranha | I think it's fixed | 18:53 |
clarkb | cool so ya you need to reorder the steps the job takes then it shouldwork | 18:56 |
afaranha | thank you all o/ | 18:57 |
afaranha | I'll leave now (EMEA timezone) hopefully we cna send a patch tomorrow to have the tests for swift working :D | 18:58 |
*** dviroel|rover is now known as dviroel|out | 21:57 | |
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org is being restarted quickly for some security updates, but should return to service momentarily | 22:10 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!