opendevreview | Ian Wienand proposed opendev/system-config master: [dnm] hold gerrit 3.5 for downgrade testing https://review.opendev.org/c/opendev/system-config/+/842013 | 01:07 |
---|---|---|
*** ysandeep|out is now known as ysandeep|rover | 05:04 | |
frickler | infra-root: wiki.openstack.org seems to be running with a non-LE cert still, which will expire in 2 weeks. I guess we'll just have to follow that path once more? | 05:43 |
*** gibi_pto is now known as gibi | 05:51 | |
frickler | ianw: some mirror hosts seem to be having issues with openafs updates, see service-mirror.yaml.log on bridge cf. https://zuul.opendev.org/t/openstack/build/1006e3bbd1264599b34f59129f3d7368 | 05:52 |
frickler | I can go through those manually if you rather want to end your day | 05:52 |
ianw | umm, let me see | 06:03 |
ianw | oh interesting, we don't encrypt/gpg the logs for TIMED_OUT i guess | 06:04 |
ianw | that's an oversight | 06:04 |
ianw | openafs-client : PreDepends: init-system-helpers (>= 1.54~) but 1.51 is to be installed | 06:05 |
ianw | now that is interesting, that is something i must have stuffed up with the new uploads | 06:06 |
ianw | this is a bionic system | 06:06 |
ianw | i bet we have something different enabled in the ppa build environment than we do on production bionic systems | 06:07 |
ianw | i bet it's backports | 06:08 |
ianw | but, why doesn't this fail the gate? | 06:12 |
opendevreview | Ian Wienand proposed opendev/system-config master: [dnm] trigger all openafs builds https://review.opendev.org/c/opendev/system-config/+/842039 | 06:15 |
*** ysandeep|rover is now known as ysandeep|rover|brb | 06:15 | |
ianw | yep the ppa has backports enabled. i'm not 100% sure when or why that happened. i guess we'll need to rebuild the bionic package without backports | 06:17 |
ianw | Primary Archive for Ubuntu - BACKPORTS (main, restricted, universe, multiverse) (included on 2010-03-10) | 06:18 |
ianw | https://launchpad.net/~openafs/+archive/ubuntu/stable | 06:18 |
ianw | i guess it was copied from that? | 06:18 |
ianw | it looks like it has debhelper = 12, bionic without backports only has 11 | 06:25 |
opendevreview | Ian Wienand proposed opendev/infra-openafs-deb bionic: Rebuild for PPA without backports https://review.opendev.org/c/opendev/infra-openafs-deb/+/842042 | 06:29 |
opendevreview | Ian Wienand proposed opendev/infra-openafs-deb bionic: gitreview: update branch to bionic https://review.opendev.org/c/opendev/infra-openafs-deb/+/842043 | 06:30 |
ianw | that has failed, mea culpa on that https://zuul.opendev.org/t/openstack/build/6e1dd24132af43309e8ceee1b72d58a4 | 06:31 |
frickler | hmm, so I wonder why did it not fail earlier? | 06:33 |
opendevreview | Ian Wienand proposed opendev/infra-openafs-deb bionic: Rebuild for PPA without backports https://review.opendev.org/c/opendev/infra-openafs-deb/+/842042 | 06:39 |
opendevreview | Ian Wienand proposed opendev/infra-openafs-deb bionic: gitreview: update branch to bionic https://review.opendev.org/c/opendev/infra-openafs-deb/+/842043 | 06:39 |
*** ysandeep|rover|brb is now known as ysandeep|rover | 06:48 | |
ianw | frickler: it would have, but it didnt' run | 07:03 |
ianw | it is a bit hard to test. we *could* figure out a build test. but at that point, we've just re-written the ppa building | 07:03 |
opendevreview | Merged opendev/infra-openafs-deb bionic: Rebuild for PPA without backports https://review.opendev.org/c/opendev/infra-openafs-deb/+/842042 | 07:09 |
opendevreview | Merged opendev/infra-openafs-deb bionic: gitreview: update branch to bionic https://review.opendev.org/c/opendev/infra-openafs-deb/+/842043 | 07:09 |
frickler | ianw: fair enough. the other thing that is not clear to me is why the job timed out, the playbook seems to have finished fine, just with some failed hosts. sadly there are not time stamps other than the one at the start | 07:18 |
opendevreview | Ian Wienand proposed opendev/infra-openafs-deb bionic: Fix typo in changelog https://review.opendev.org/c/opendev/infra-openafs-deb/+/842046 | 07:25 |
ianw | yeah, the timeout, i'm not 100% sure | 07:26 |
*** jpena|off is now known as jpena | 07:31 | |
opendevreview | Merged opendev/infra-openafs-deb bionic: Fix typo in changelog https://review.opendev.org/c/opendev/infra-openafs-deb/+/842046 | 07:38 |
ianw | ok, that one made it to building | 07:43 |
ianw | and built ok for amd64; so the build-deps are ok for bionic. running 842039 through should test it, but i might just wait for the arm64 builds | 07:50 |
*** ysandeep|rover is now known as ysandeep|rover|lunch | 08:00 | |
*** ysandeep|rover|lunch is now known as ysandeep|rover | 09:21 | |
*** rlandy|out is now known as rlandy | 10:26 | |
*** pojadhav is now known as pojadhav|afk | 10:49 | |
ianw | ok, that fixed it. the mirror jobs should deploy ok now | 10:59 |
ianw | i'll check on it tomorrow | 10:59 |
*** ysandeep|rover is now known as ysandeep|rover|brb | 11:18 | |
*** pojadhav|afk is now known as pojadhav | 11:26 | |
*** dviroel|out is now known as dviroel | 11:28 | |
*** ysandeep|rover|brb is now known as ysandeep|rover | 11:51 | |
fungi | apparently setuptools 62.3.0 (released yesterday) has deprecated "namespace_packages" which pbr uses when you have a module that lacks a __init__.py | 14:18 |
fungi | any subdirectory pbr packages without a __init__.py (e.g. foo/tests/fixtures) raises a deprecationwarning now | 14:18 |
fungi | adding foo/tests/fixtures/__init__.py silences it, but that seems a little misleading since importing foo.tests.fixtures is unlikely to ever happen | 14:19 |
fungi | trying to think through how to solve it in pbr, but unfortunately it's hard to tell the difference between a directory which contains data and a directory which is meant to be a module namespace | 14:20 |
fungi | it looks like pbr treats them the same at the moment | 14:21 |
fungi | oh, also "Relying on include_package_data to ensure sub-packages are automatically added to the build wheel distribution (as “data”) is now considered a deprecated practice." | 14:40 |
fungi | which pbr similarly seems to do | 14:41 |
fungi | looks like it's impacting yamllint for me too | 14:42 |
fungi | Python recognizes 'yamllint.conf' as an importable package, however it | 14:42 |
fungi | is included in the distribution as "data". This behavior is likely to change in future versions of setuptools (and therefore is considered deprecated). Please make sure that 'yamllint.conf' is included as a package by using setuptools' `packages` configuration field or the proper discovery methods. You can read more about "package discovery" and "data files" on setuptools documentation page. | 14:44 |
clarkb | frickler: yes I was planning on purchasing a new cert next week. Its our last remaining non LE cert instance because the host isn't directly managed by ansible | 14:47 |
clarkb | fungi: the problem is the pacakges system doesn't seem any better from a useability standpoint :/ | 14:48 |
clarkb | lacks documentation also spits out a ton of warnings/errors if you don't get it just right | 14:48 |
fungi | yeah | 14:49 |
fungi | the error seems to be that setuptools is assuming any subdirectory of an importable module should be a package namespace. i need to read up more on how to tell it that a given directory is non-importable "data" | 14:50 |
fungi | the fact that it's raising setuptools._deprecation_warning.SetuptoolsDeprecationWarning with a multi-line message makes it essentially impossible to filter with PYTHONWARNINGS | 15:09 |
fungi | apparently i can filter on the parent exception class, which ends up looking like ignore::Warning:setuptools.command.build_py (still really broad unfortunately) | 15:17 |
*** ysandeep|rover is now known as ysandeep|out | 15:24 | |
*** dviroel is now known as dviroel|lunch | 16:02 | |
*** marios is now known as marios|out | 16:03 | |
johnsom | Hi neighbors, so following figuring out the missing unbound log issues on centos, I have figured out why we have a fairly high failure rate on the fips jobs. I would like your input/ideas on how best to resolve it. | 16:09 |
johnsom | The fips jobs reboot the nodepool instance to enable the fips mode. The problem is devstack starts pretty quickly after that reboot. The unbound resolver is not always started in time for the repo installation step in devstack which later leads to packages failing to install. | 16:10 |
johnsom | https://zuul.opendev.org/t/openstack/build/c6ab0b93e2e6485c9e2eb97dbd74805d/logs | 16:10 |
johnsom | for the curious | 16:10 |
clarkb | johnsom: one approach may be to have the ansible query for 127.0.1.1:53 to be listening again (or whatever the specific localhost ip is) | 16:11 |
johnsom | I looked at using "systemctl is-system-running" to check if systemd is done, but pretty much all of my instances seem to report "degraded" for various reasons. | 16:11 |
johnsom | clarkb Yeah, that was my next thought. | 16:12 |
clarkb | infra-root I'm trimming wheel mirror quota's to 20GB now | 16:12 |
johnsom | Any thoughts on timeouts or a link to places in the ansible we already do something similar? (I can dig, but if you have a top-of-head link it would be welcome) | 16:14 |
clarkb | johnsom: there is an ansible module that can do that waiting let me see if I can find it | 16:15 |
johnsom | clarkb Much appreciated | 16:16 |
clarkb | johnsom: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/wait_for_module.html the ssh examples in there are probably close | 16:16 |
clarkb | our resolver should do tcp as well as udp so that may just work | 16:16 |
johnsom | cool, thanks! | 16:16 |
ade_lee | johnsom, where are you thinking of putting in that wait? | 16:17 |
johnsom | ade_lee It will need to go into the fips enable Ansible somewhere I think. It really only applies to the reboot. | 16:17 |
clarkb | another way to make it more generic might be to do reboot, wait for ssh, then wait for any dns record to resolve | 16:19 |
clarkb | I assume you're failing because the reoslver isn't actually up when you try to resolve a name but /etc/hosts says to use the local resolver | 16:19 |
clarkb | then it doesn't depend on a local resolver working, just that anything can resolve | 16:19 |
johnsom | Right | 16:21 |
ade_lee | clarkb, johnsom yeah - the fips job is pretty low level being in zuul-jobs - we may need to be pretty generic | 16:22 |
fungi | the wait could happen in devstack's playbook | 16:22 |
fungi | assuming it's devstack trying to do stuff too soon after boot | 16:23 |
ade_lee | yeah - thats the other place I was thinking about | 16:23 |
fungi | trying to be generic about it gets into the "how do you determine when a system has reached steady-state" conundrum which distros have been trying to solve forever | 16:23 |
fungi | when is a system fully "booted"? | 16:24 |
johnsom | Actually, that is probably good. It might help devstack fail earlier when DNS is broken. Currently, it runs pretty far before realizing it is missing some packages | 16:24 |
ade_lee | fungi, johnsom , clarkb so we're agreed that the best thing is to do a generic test to see if ssh is up and try to resolve something in devstack? | 16:28 |
*** diablo_rojo_phone_ is now known as diablo_rojo_phone | 16:29 | |
*** open10k8s_ is now known as open10k8s | 16:29 | |
*** dwhite449 is now known as dwhite44 | 16:29 | |
*** cloudnull4 is now known as cloudnull | 16:29 | |
*** andrewbonney_ is now known as andrewbonney | 16:29 | |
*** yoctozepto_ is now known as yoctozepto | 16:29 | |
*** snbuback_ is now known as snbuback | 16:29 | |
*** ricolin_ is now known as ricolin | 16:29 | |
johnsom | ade_lee Yeah, I think a check in devstack (if OFFLINE=false) have it loop/wait for a DNS resolution to succeed. maybe "opendev.org"? Not sure what is a good name to lookup. | 16:30 |
*** diablo_rojo_phone is now known as Guest674 | 16:30 | |
fungi | yes, dns resolution not ssh | 16:32 |
clarkb | johnsom: it should lookup its mirror if we can amange that | 16:32 |
clarkb | since the mirror is the main resource it is looking for externally | 16:33 |
clarkb | opendev.org is maybe a reasonable stand in as they are in the same zone | 16:33 |
ade_lee | clarkb, where is the mirror defined? we can try that if we know how to look it up. | 16:34 |
fungi | there's a zuul variable which gets set... checking | 16:37 |
johnsom | zuul_site_mirror_fqdn https://opendev.org/zuul/zuul-jobs/src/branch/master/test-playbooks/base-roles/configure-mirrors.yaml | 16:37 |
johnsom | I'm looking to see if that gets dropped on the file system in some useful way that devstack could pick it up. | 16:39 |
fungi | yeah, that's the one i was thinking of | 16:40 |
fungi | if you want it from the filesystem, yeah the mirror-info role installs /etc/ci/mirror_info.sh which sets envvars | 16:41 |
johnsom | Yeah, I don't see anything obvious. Also, that would tie the devstack install to a zuul run which we may want this to be a bit more generic. | 16:41 |
fungi | in mirror-info we set it from the mirror_fqdn value: https://opendev.org/opendev/base-jobs/src/branch/master/roles/mirror-info/templates/mirror_info.sh.j2#L17 | 16:44 |
clarkb | ya reading it out of /etc/apt/sources.list is probably easiest for devsatck? I'm not sure it has direct exposure to the zuul vars | 16:45 |
clarkb | the mirror-info stuff is holdver from zuul v2 but we haven't made any indications of deleting it and is probablyfine too? | 16:45 |
ade_lee | I'd assume /etc/apt/sources.list is for ubuntu .. probably different for centos .. | 16:46 |
ade_lee | maybe easiest to start with "opendev.org" to begin with? | 16:46 |
clarkb | ya that is probably close enough and simple | 16:48 |
ade_lee | clarkb, fungi johnsom ok cool - I'll give that a shot and see what happens. I'll test against octavia and barbican and see if it fixes things | 16:50 |
*** dviroel_ is now known as dviroel | 16:52 | |
jrosser | the projects with embedded ansibles seem pretty reliant on mirror-info stuff according to codesearch | 16:55 |
clarkb | that is unfortunate. it was only there for the legacy stuff and expectation is you'd deal with it using the zuul info more directly | 16:56 |
clarkb | I know tripleo used ti extensively which is why we haven't bothred to remove it though | 16:56 |
jrosser | is the same info available elsewhere? | 16:56 |
clarkb | yes in the zuul ansible vars | 17:00 |
clarkb | zuul_site_mirror_fqdn is the var name | 17:00 |
fungi | it doesn't seem to appear in the zuul "inventory" we record, but yeah | 17:00 |
fungi | i guess it's a general ansible var not a zuul var | 17:01 |
clarkb | ah right. It should be recorded as a fact too? But ya the old shell script is from the land before time | 17:03 |
jrosser | ah TIL https://opendev.org/zuul/zuul-jobs/src/branch/master/doc/source/mirror.rst | 17:05 |
jrosser | though passing that to an embedded ansible is reasonably challenging as well | 17:06 |
fungi | how to get ansible vars into a nested ansible seems to be a challenge | 17:08 |
jrosser | currently i have to write out vars files in a pre playbook and then grab them from the embedded ansible | 17:09 |
jrosser | so the motivation to move away from loading things from mirror-info.sh has been small | 17:09 |
fungi | right, rather similar | 17:24 |
clarkb | ya I think the main concern is we don't keep mirror-info.sh up to date | 17:32 |
clarkb | but you could keep your thing up to date with the pieces you need | 17:32 |
clarkb | even if they function in the same way its a shift in eyeballs and responsibility that ensures things stay up to date | 17:32 |
fungi | the main way that script is useful is as a source of $NODEPOOL_MIRROR_HOST | 17:33 |
fungi | but exporting that from ansible is also trivial | 17:33 |
*** jpena is now known as jpena|off | 18:08 | |
*** melwitt_ is now known as melwitt | 18:08 | |
*** deepcursion1 is now known as deepcursion | 18:45 | |
*** artom__ is now known as artom | 19:08 | |
*** ianw_ is now known as ianw | 19:11 | |
corvus | http://lists.zuul-ci.org/pipermail/zuul-discuss/2022-May/001801.html is what i mentioned in the meeting | 20:16 |
fungi | thanks! | 20:27 |
ianw | system-config-zuul-role-integration-jammy https://zuul.opendev.org/t/openstack/build/82e184f32e5449a4ae619d6176e5497e : DISK_FULL in 13m 54s | 20:32 |
ianw | do we have a full executor? | 20:32 |
fungi | more likely the job's workspace exceeded teh allowed size | 20:32 |
ianw | that would certainly be unexpected for this job | 20:35 |
*** rlandy is now known as rlandy|biab | 20:35 | |
jrosser | i had a quick go at using zuul_site_mirror_info, but seems undefined | 20:42 |
jrosser | maybe i've made a silly mistake https://review.opendev.org/c/openstack/openstack-ansible/+/842158 | 20:44 |
jrosser | ah the docs say it is still work-in-progress | 20:45 |
ianw | 2022-05-17 20:09:12,242 WARNING zuul.ExecutorDiskAccountant: /var/lib/zuul/builds/82e184f32e5449a4ae619d6176e5497e is using 10091MB (limit=5000) | 20:47 |
ianw | TASK [get df disk usage] seems like it might help, but ... doesn't. the only thing we seem to be copying in syslog. i wonder if we're dumping something weird on jammy... | 20:51 |
ianw | (i am getting this by looking in the logs at ze02) | 20:53 |
ianw | Ansible output: b'base : ok=0 changed=0 unreachable=1 | 20:57 |
ianw | so the df didn't work because the host was unreachable | 20:58 |
ianw | root@ubuntu-jammy-rax-ord-0029677363:/var/log# du -h syslog | 21:03 |
ianw | 112Ksyslog | 21:03 |
ianw | there is nothing immediately obviously wrong with jammy syslogs | 21:03 |
*** rlandy|biab is now known as rlandy | 21:07 | |
fungi | growroot failed maybe? | 21:12 |
fungi | oh, nevermind, on the executor right | 21:16 |
corvus | ianw: keep in mind the result of the cleanup playbook doesn't impact the result of the job. so while it's possible that the unreachable status could be related*, it's not a direct cause of a failure (*perhaps it's unreachable because of some momentary condition on the executor which caused disk_full and also broke outgoing ansible connections) | 21:27 |
corvus | ianw: if anything the other way around -- disk_full might cause weird ansible results as zuul begins to forcibly kill processes | 21:27 |
ianw | right, looking at the logs from the executor, it seems like it went disk full after copying the syslog | 21:45 |
ianw | 2022-05-17 20:08:59,463 DEBUG zuul.AnsibleJob.output: [e: 35914214f6ab4ab6ab13fa3125ce1797] [build: 82e184f32e5449a4ae619d6176e5497e] Ansible output: b'TASK [upload-logs-swift : Upload logs to swift] ********************************' | 21:46 |
ianw | 2022-05-17 20:09:12,242 WARNING zuul.ExecutorDiskAccountant: /var/lib/zuul/builds/82e184f32e5449a4ae619d6176e5497e is using 10091MB (limit=5000) | 21:46 |
ianw | that was the immediate message before and the kill | 21:46 |
corvus | there might be a partial upload somewhere on one of our object storage providers | 21:52 |
corvus | but it would be difficult to find, and has a low probability of providing additional info | 21:53 |
ianw | i have the full logs but https://paste.opendev.org/show/bVaO8DeNt5dsHEr9glrN/ is the just the upload bits | 21:55 |
corvus | ianw: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_aa0/841525/1/check/system-config-zuul-role-integration-jammy/aa02f68/ has a 500M syslog | 21:58 |
corvus | and the job that failed also ran in ovh | 21:58 |
ianw | theory, it's iptables logs | 21:59 |
corvus | perhaps the nodes in ovh start out with larger syslogs? and perhaps they've grown in the last week? | 21:59 |
ianw | oh, good find! | 21:59 |
corvus | that syslog has lots of kernel stack traces | 22:00 |
ianw | yeah. my theory was that the jammy nodes are underused at the moment, so they've been sitting for a long time, perhaps filling up with things like iptables rejection logs | 22:01 |
ianw | that doesn't even look like much of an oops | 22:02 |
*** rlandy is now known as rlandy|out | 22:02 | |
ianw | it's right from boot | 22:04 |
ianw | May 12 03:08:39 ubuntu glean-early.sh[486]: DEBUG:glean:Detected distro : ubuntu | 22:04 |
ianw | May 12 03:08:39 ubuntu kernel: [ 5.870384] Call Trace: | 22:04 |
ianw | e.g. glean is still running even | 22:04 |
fungi | might make sense to special-case syslog collection and strip earlier entries if over a sane size | 22:05 |
fungi | 100mb/day syslog seems like... a lot for an idle server | 22:06 |
ianw | i think the real problem here is this flood of traces | 22:07 |
ianw | but there's, no bug/warning/etc. like a regular oops. very weird | 22:08 |
corvus | it's the "Everything's OK" oops | 22:09 |
*** dviroel is now known as dviroe|out | 22:12 | |
ianw | May 12 03:10:34 ubuntu kernel: [ 121.705979] Call Trace: | 22:12 |
ianw | May 12 03:10:34 ubuntu kernel: [ 121.705981] <TASK> | 22:12 |
ianw | i've never seen this xml-ish "<TASK>" before. it must be new i guess ... but it makes me suspicious what's putting this out | 22:13 |
ianw | the start of the logs is cut off, it feels like that must be where it started, and somehow it's ended up in a recursive loop dumping traces | 22:29 |
ianw | i can try booting an ovh jammy and see if it replicates. i'm on a rax node jammy and that's ok | 22:30 |
ianw | one has booted in iweb and it's ok | 23:29 |
ianw | | 0029678242 | ovh-gra1 | ubuntu-jammy | 4bc97209-22c7-4ef5-ae24-fcbd95e77631 | 213.32.76.198 | | 23:41 |
ianw | this one *is* spewing the crap | 23:41 |
ianw | [ 2.277136] kernel: unchecked MSR access error: WRMSR to 0x48 (tried to write 0x0000000000000004) at rIP: 0xffffffff97490af4 (native_write_msr+0x4/0x20) | 23:42 |
ianw | i think that's where it starts | 23:42 |
Clark[m] | ianw: amorin may be interested if it is ovh specific | 23:47 |
Clark[m] | https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/1959721 may be related too? | 23:48 |
ianw | https://bugzilla.redhat.com/show_bug.cgi?id=1808996 also discusses 0x48 | 23:50 |
ianw | seems to be AMD related | 23:50 |
Clark[m] | I didn't think the ovh nodes were amd based but it may have changed | 23:52 |
ianw | https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1921880/comments/12 describes it | 23:57 |
ianw | "Running a focal instance (5.4.0-70-generic) on the "before" system with the EPYC-Rome type on a Milan CPU results in the following error. This is due to the missing IBRS flag, which is one of the reasons, I'd like to see this backported ;-)" | 23:57 |
ianw | although that says tried to write 0x6, whereas our says 0x4 | 23:58 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!