Tuesday, 2022-05-17

opendevreviewIan Wienand proposed opendev/system-config master: [dnm] hold gerrit 3.5 for downgrade testing  https://review.opendev.org/c/opendev/system-config/+/84201301:07
*** ysandeep|out is now known as ysandeep|rover05:04
fricklerinfra-root: wiki.openstack.org seems to be running with a non-LE cert still, which will expire in 2 weeks. I guess we'll just have to follow that path once more?05:43
*** gibi_pto is now known as gibi05:51
fricklerianw: some mirror hosts seem to be having issues with openafs updates, see service-mirror.yaml.log on bridge cf. https://zuul.opendev.org/t/openstack/build/1006e3bbd1264599b34f59129f3d736805:52
fricklerI can go through those manually if you rather want to end your day05:52
ianwumm, let me see06:03
ianwoh interesting, we don't encrypt/gpg the logs for TIMED_OUT i guess06:04
ianwthat's an oversight06:04
ianw openafs-client : PreDepends: init-system-helpers (>= 1.54~) but 1.51 is to be installed06:05
ianwnow that is interesting, that is something i must have stuffed up with the new uploads06:06
ianwthis is a bionic system06:06
ianwi bet we have something different enabled in the ppa build environment than we do on production bionic systems06:07
ianwi bet it's backports06:08
ianwbut, why doesn't this fail the gate?06:12
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] trigger all openafs builds  https://review.opendev.org/c/opendev/system-config/+/84203906:15
*** ysandeep|rover is now known as ysandeep|rover|brb06:15
ianwyep the ppa has backports enabled.  i'm not 100% sure when or why that happened.  i guess we'll need to rebuild the bionic package without backports06:17
ianwPrimary Archive for Ubuntu - BACKPORTS (main, restricted, universe, multiverse) (included on 2010-03-10) 06:18
ianwi guess it was copied from that?06:18
ianwit looks like it has debhelper = 12, bionic without backports only has 1106:25
opendevreviewIan Wienand proposed opendev/infra-openafs-deb bionic: Rebuild for PPA without backports  https://review.opendev.org/c/opendev/infra-openafs-deb/+/84204206:29
opendevreviewIan Wienand proposed opendev/infra-openafs-deb bionic: gitreview: update branch to bionic  https://review.opendev.org/c/opendev/infra-openafs-deb/+/84204306:30
ianwthat has failed, mea culpa on that https://zuul.opendev.org/t/openstack/build/6e1dd24132af43309e8ceee1b72d58a406:31
fricklerhmm, so I wonder why did it not fail earlier?06:33
opendevreviewIan Wienand proposed opendev/infra-openafs-deb bionic: Rebuild for PPA without backports  https://review.opendev.org/c/opendev/infra-openafs-deb/+/84204206:39
opendevreviewIan Wienand proposed opendev/infra-openafs-deb bionic: gitreview: update branch to bionic  https://review.opendev.org/c/opendev/infra-openafs-deb/+/84204306:39
*** ysandeep|rover|brb is now known as ysandeep|rover06:48
ianwfrickler: it would have, but it didnt' run07:03
ianwit is a bit hard to test.  we *could* figure out a build test.  but at that point, we've just re-written the ppa building07:03
opendevreviewMerged opendev/infra-openafs-deb bionic: Rebuild for PPA without backports  https://review.opendev.org/c/opendev/infra-openafs-deb/+/84204207:09
opendevreviewMerged opendev/infra-openafs-deb bionic: gitreview: update branch to bionic  https://review.opendev.org/c/opendev/infra-openafs-deb/+/84204307:09
fricklerianw: fair enough. the other thing that is not clear to me is why the job timed out, the playbook seems to have finished fine, just with some failed hosts. sadly there are not time stamps other than the one at the start07:18
opendevreviewIan Wienand proposed opendev/infra-openafs-deb bionic: Fix typo in changelog  https://review.opendev.org/c/opendev/infra-openafs-deb/+/84204607:25
ianwyeah, the timeout, i'm not 100% sure 07:26
*** jpena|off is now known as jpena07:31
opendevreviewMerged opendev/infra-openafs-deb bionic: Fix typo in changelog  https://review.opendev.org/c/opendev/infra-openafs-deb/+/84204607:38
ianwok, that one made it to building07:43
ianwand built ok for amd64; so the build-deps are ok for bionic.  running 842039 through should test it, but i might just wait for the arm64 builds07:50
*** ysandeep|rover is now known as ysandeep|rover|lunch08:00
*** ysandeep|rover|lunch is now known as ysandeep|rover09:21
*** rlandy|out is now known as rlandy10:26
*** pojadhav is now known as pojadhav|afk10:49
ianwok, that fixed it.  the mirror jobs should deploy ok now10:59
ianwi'll check on it tomorrow10:59
*** ysandeep|rover is now known as ysandeep|rover|brb11:18
*** pojadhav|afk is now known as pojadhav11:26
*** dviroel|out is now known as dviroel11:28
*** ysandeep|rover|brb is now known as ysandeep|rover11:51
fungiapparently setuptools 62.3.0 (released yesterday) has deprecated "namespace_packages" which pbr uses when you have a module that lacks a __init__.py14:18
fungiany subdirectory pbr packages without a __init__.py (e.g. foo/tests/fixtures) raises a deprecationwarning now14:18
fungiadding foo/tests/fixtures/__init__.py silences it, but that seems a little misleading since importing foo.tests.fixtures is unlikely to ever happen14:19
fungitrying to think through how to solve it in pbr, but unfortunately it's hard to tell the difference between a directory which contains data and a directory which is meant to be a module namespace14:20
fungiit looks like pbr treats them the same at the moment14:21
fungioh, also "Relying on include_package_data to ensure sub-packages are automatically added to the build wheel distribution (as “data”) is now considered a deprecated practice."14:40
fungiwhich pbr similarly seems to do14:41
fungilooks like it's impacting yamllint for me too14:42
fungiPython recognizes 'yamllint.conf' as an importable package, however it14:42
fungiis included in the distribution as "data". This behavior is likely to change in future versions of setuptools (and therefore is considered deprecated). Please make sure that 'yamllint.conf' is included as a package by using setuptools' `packages` configuration field or the proper discovery methods. You can read more about "package discovery" and "data files" on setuptools documentation page.14:44
clarkbfrickler: yes I was planning on purchasing a new cert next week. Its our last remaining non LE cert instance because the host isn't directly managed by ansible14:47
clarkbfungi: the problem is the pacakges system doesn't seem any better from a useability standpoint :/14:48
clarkblacks documentation also spits out a ton of warnings/errors if you don't get it just right14:48
fungithe error seems to be that setuptools is assuming any subdirectory of an importable module should be a package namespace. i need to read up more on how to tell it that a given directory is non-importable "data"14:50
fungithe fact that it's raising setuptools._deprecation_warning.SetuptoolsDeprecationWarning with a multi-line message makes it essentially impossible to filter with PYTHONWARNINGS15:09
fungiapparently i can filter on the parent exception class, which ends up looking like ignore::Warning:setuptools.command.build_py (still really broad unfortunately)15:17
*** ysandeep|rover is now known as ysandeep|out15:24
*** dviroel is now known as dviroel|lunch16:02
*** marios is now known as marios|out16:03
johnsomHi neighbors, so following figuring out the missing unbound log issues on centos, I have figured out why we have a fairly high failure rate on the fips jobs. I would like your input/ideas on how best to resolve it.16:09
johnsomThe fips jobs reboot the nodepool instance to enable the fips mode. The problem is devstack starts pretty quickly after that reboot. The unbound resolver is not always started in time for the repo installation step in devstack which later leads to packages failing to install.16:10
johnsomfor the curious16:10
clarkbjohnsom: one approach may be to have the ansible query for to be listening again (or whatever the specific localhost ip is)16:11
johnsomI looked at using "systemctl is-system-running" to check if systemd is done, but pretty much all of my instances seem to report "degraded" for various reasons.16:11
johnsomclarkb Yeah, that was my next thought.16:12
clarkbinfra-root I'm trimming wheel mirror quota's to 20GB now16:12
johnsomAny thoughts on timeouts or a link to places in the ansible we already do something similar? (I can dig, but if you have a top-of-head link it would be welcome)16:14
clarkbjohnsom: there is an ansible module that can do that waiting let me see if I can find it16:15
johnsomclarkb Much appreciated16:16
clarkbjohnsom: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/wait_for_module.html the ssh examples in there are probably close16:16
clarkbour resolver should do tcp as well as udp so that may just work16:16
johnsomcool, thanks!16:16
ade_leejohnsom, where are you thinking of putting in that wait?16:17
johnsomade_lee It will need to go into the fips enable Ansible somewhere I think. It really only applies to the reboot.16:17
clarkbanother way to make it more generic might be to do reboot, wait for ssh, then wait for any dns record to resolve16:19
clarkbI assume you're failing because the reoslver isn't actually up when you try to resolve a name but /etc/hosts says to use the local resolver16:19
clarkbthen it doesn't depend on a local resolver working, just that anything can resolve16:19
ade_leeclarkb, johnsom yeah - the fips job is pretty low level being in zuul-jobs - we may need to be pretty generic16:22
fungithe wait could happen in devstack's playbook16:22
fungiassuming it's devstack trying to do stuff too soon after boot16:23
ade_leeyeah - thats the other place I was thinking about 16:23
fungitrying to be generic about it gets into the "how do you determine when a system has reached steady-state" conundrum which distros have been trying to solve forever16:23
fungiwhen is a system fully "booted"?16:24
johnsomActually, that is probably good. It might help devstack fail earlier when DNS is broken. Currently, it runs pretty far before realizing it is missing some packages16:24
ade_leefungi, johnsom , clarkb so we're agreed that the best thing is to do a generic test to see if ssh is up and try to resolve something in devstack? 16:28
*** diablo_rojo_phone_ is now known as diablo_rojo_phone16:29
*** open10k8s_ is now known as open10k8s16:29
*** dwhite449 is now known as dwhite4416:29
*** cloudnull4 is now known as cloudnull16:29
*** andrewbonney_ is now known as andrewbonney16:29
*** yoctozepto_ is now known as yoctozepto16:29
*** snbuback_ is now known as snbuback16:29
*** ricolin_ is now known as ricolin16:29
johnsomade_lee Yeah, I think a check in devstack (if OFFLINE=false) have it loop/wait for a DNS resolution to succeed. maybe "opendev.org"? Not sure what is a good name to lookup.16:30
*** diablo_rojo_phone is now known as Guest67416:30
fungiyes, dns resolution not ssh16:32
clarkbjohnsom: it should lookup its mirror if we can amange that16:32
clarkbsince the mirror is the main resource it is looking for externally16:33
clarkbopendev.org is maybe a reasonable stand in as they are in the same zone16:33
ade_leeclarkb, where is the mirror defined?  we can try that if we know how to look it up.16:34
fungithere's a zuul variable which gets set... checking16:37
johnsomzuul_site_mirror_fqdn https://opendev.org/zuul/zuul-jobs/src/branch/master/test-playbooks/base-roles/configure-mirrors.yaml16:37
johnsomI'm looking to see if that gets dropped on the file system in some useful way that devstack could pick it up.16:39
fungiyeah, that's the one i was thinking of16:40
fungiif you want it from the filesystem, yeah the mirror-info role installs /etc/ci/mirror_info.sh which sets envvars16:41
johnsomYeah, I don't see anything obvious. Also, that would tie the devstack install to a zuul run which we may want this to be a bit more generic.16:41
fungiin mirror-info we set it from the mirror_fqdn value: https://opendev.org/opendev/base-jobs/src/branch/master/roles/mirror-info/templates/mirror_info.sh.j2#L1716:44
clarkbya reading it out of /etc/apt/sources.list is probably easiest for devsatck? I'm not sure it has direct exposure to the zuul vars16:45
clarkbthe mirror-info stuff is holdver from zuul v2 but we haven't made any indications of deleting it and is probablyfine too?16:45
ade_leeI'd assume  /etc/apt/sources.list is for ubuntu .. probably different for centos ..16:46
ade_leemaybe easiest to start with "opendev.org" to begin with?16:46
clarkbya that is probably close enough and simple16:48
ade_leeclarkb, fungi johnsom ok cool - I'll give that a shot and see what happens.  I'll test against octavia and barbican and see if it fixes things16:50
*** dviroel_ is now known as dviroel16:52
jrosserthe projects with embedded ansibles seem pretty reliant on mirror-info stuff according to codesearch16:55
clarkbthat is unfortunate. it was only there for the legacy stuff and expectation is you'd deal with it using the zuul info more directly16:56
clarkbI know tripleo used ti extensively which is why we haven't bothred to remove it though16:56
jrosseris the same info available elsewhere?16:56
clarkbyes in the zuul ansible vars17:00
clarkbzuul_site_mirror_fqdn is the var name17:00
fungiit doesn't seem to appear in the zuul "inventory" we record, but yeah17:00
fungii guess it's a general ansible var not a zuul var17:01
clarkbah right. It should be recorded as a fact too? But ya the old shell script is from the land before time17:03
jrosserah TIL https://opendev.org/zuul/zuul-jobs/src/branch/master/doc/source/mirror.rst17:05
jrosserthough passing that to an embedded ansible is reasonably challenging as well17:06
fungihow to get ansible vars into a nested ansible seems to be a challenge17:08
jrossercurrently i have to write out vars files in a pre playbook and then grab them from the embedded ansible17:09
jrosserso the motivation to move away from loading things from mirror-info.sh has been small17:09
fungiright, rather similar17:24
clarkbya I think the main concern is we don't keep mirror-info.sh up to date17:32
clarkbbut you could keep your thing up to date with the pieces you need17:32
clarkbeven if they function in the same way its a shift in eyeballs and responsibility that ensures things stay up to date17:32
fungithe main way that script is useful is as a source of $NODEPOOL_MIRROR_HOST17:33
fungibut exporting that from ansible is also trivial17:33
*** jpena is now known as jpena|off18:08
*** melwitt_ is now known as melwitt18:08
*** deepcursion1 is now known as deepcursion18:45
*** artom__ is now known as artom19:08
*** ianw_ is now known as ianw19:11
corvushttp://lists.zuul-ci.org/pipermail/zuul-discuss/2022-May/001801.html is what i mentioned in the meeting20:16
ianw    system-config-zuul-role-integration-jammy https://zuul.opendev.org/t/openstack/build/82e184f32e5449a4ae619d6176e5497e : DISK_FULL in 13m 54s20:32
ianwdo we have a full executor?20:32
fungimore likely the job's workspace exceeded teh allowed size20:32
ianwthat would certainly be unexpected for this job20:35
*** rlandy is now known as rlandy|biab20:35
jrosseri had a quick go at using zuul_site_mirror_info, but seems undefined20:42
jrossermaybe i've made a silly mistake https://review.opendev.org/c/openstack/openstack-ansible/+/84215820:44
jrosserah the docs say it is still work-in-progress20:45
ianw2022-05-17 20:09:12,242 WARNING zuul.ExecutorDiskAccountant: /var/lib/zuul/builds/82e184f32e5449a4ae619d6176e5497e is using 10091MB (limit=5000)20:47
ianwTASK [get df disk usage] seems like it might help, but ... doesn't.  the only thing we seem to be copying in syslog.  i wonder if we're dumping something weird on jammy...20:51
ianw(i am getting this by looking in the logs at ze02)20:53
ianwAnsible output: b'base                       : ok=0    changed=0    unreachable=120:57
ianwso the df didn't work because the host was unreachable20:58
ianwroot@ubuntu-jammy-rax-ord-0029677363:/var/log# du -h syslog21:03
ianwthere is nothing immediately obviously wrong with jammy syslogs21:03
*** rlandy|biab is now known as rlandy21:07
fungigrowroot failed maybe?21:12
fungioh, nevermind, on the executor right21:16
corvusianw: keep in mind the result of the cleanup playbook doesn't impact the result of the job.  so while it's possible that the unreachable status could be related*, it's not a direct cause of a failure (*perhaps it's unreachable because of some momentary condition on the executor which caused disk_full and also broke outgoing ansible connections)21:27
corvusianw: if anything the other way around -- disk_full might cause weird ansible results as zuul begins to forcibly kill processes21:27
ianwright, looking at the logs from the executor, it seems like it went disk full after copying the syslog21:45
ianw2022-05-17 20:08:59,463 DEBUG zuul.AnsibleJob.output: [e: 35914214f6ab4ab6ab13fa3125ce1797] [build: 82e184f32e5449a4ae619d6176e5497e] Ansible output: b'TASK [upload-logs-swift : Upload logs to swift] ********************************'21:46
ianw2022-05-17 20:09:12,242 WARNING zuul.ExecutorDiskAccountant: /var/lib/zuul/builds/82e184f32e5449a4ae619d6176e5497e is using 10091MB (limit=5000)21:46
ianwthat was the immediate message before and the kill21:46
corvusthere might be a partial upload somewhere on one of our object storage providers21:52
corvusbut it would be difficult to find, and has a low probability of providing additional info21:53
ianwi have the full logs but https://paste.opendev.org/show/bVaO8DeNt5dsHEr9glrN/ is the just the upload bits21:55
corvusianw: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_aa0/841525/1/check/system-config-zuul-role-integration-jammy/aa02f68/ has a 500M syslog21:58
corvusand the job that failed also ran in ovh21:58
ianwtheory, it's iptables logs21:59
corvusperhaps the nodes in ovh start out with larger syslogs?  and perhaps they've grown in the last week?21:59
ianwoh, good find!21:59
corvusthat syslog has lots of kernel stack traces22:00
ianwyeah.  my theory was that the jammy nodes are underused at the moment, so they've been sitting for a long time, perhaps filling up with things like iptables rejection logs22:01
ianwthat doesn't even look like much of an oops22:02
*** rlandy is now known as rlandy|out22:02
ianwit's right from boot22:04
ianwMay 12 03:08:39 ubuntu glean-early.sh[486]: DEBUG:glean:Detected distro : ubuntu22:04
ianwMay 12 03:08:39 ubuntu kernel: [    5.870384] Call Trace:22:04
ianwe.g. glean is still running even22:04
fungimight make sense to special-case syslog collection and strip earlier entries if over a sane size22:05
fungi100mb/day syslog seems like... a lot for an idle server22:06
ianwi think the real problem here is this flood of traces22:07
ianwbut there's, no bug/warning/etc. like a regular oops.  very weird22:08
corvusit's the "Everything's OK" oops22:09
*** dviroel is now known as dviroe|out22:12
ianwMay 12 03:10:34 ubuntu kernel: [  121.705979] Call Trace:22:12
ianwMay 12 03:10:34 ubuntu kernel: [  121.705981]  <TASK>22:12
ianwi've never seen this xml-ish "<TASK>" before.  it must be new i guess ... but it makes me suspicious what's putting this out22:13
ianwthe start of the logs is cut off, it feels like that must be where it started, and somehow it's ended up in a recursive loop dumping traces22:29
ianwi can try booting an ovh jammy and see if it replicates.  i'm on a rax node jammy and that's ok22:30
ianwone has booted in iweb and it's ok23:29
ianw| 0029678242 | ovh-gra1            | ubuntu-jammy                | 4bc97209-22c7-4ef5-ae24-fcbd95e77631 |   |23:41
ianwthis one *is* spewing the crap23:41
ianw[    2.277136] kernel: unchecked MSR access error: WRMSR to 0x48 (tried to write 0x0000000000000004) at rIP: 0xffffffff97490af4 (native_write_msr+0x4/0x20)23:42
ianwi think that's where it starts23:42
Clark[m]ianw: amorin may be interested if it is ovh specific23:47
Clark[m]https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/1959721 may be related too?23:48
ianwhttps://bugzilla.redhat.com/show_bug.cgi?id=1808996 also discusses 0x48 23:50
ianwseems to be AMD related23:50
Clark[m]I didn't think the ovh nodes were amd based but it may have changed23:52
ianwhttps://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1921880/comments/12 describes it23:57
ianw"Running a focal instance (5.4.0-70-generic) on the "before" system with the EPYC-Rome type on a Milan CPU results in the following error. This is due to the missing IBRS flag, which is one of the reasons, I'd like to see this backported ;-)"23:57
ianwalthough that says tried to write 0x6, whereas our says 0x423:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!