Tuesday, 2022-05-17

opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] hold gerrit 3.5 for downgrade testing https://review.opendev.org/c/opendev/system-config/+/842013	01:07
*** ysandeep\|out is now known as ysandeep\|rover		05:04
frickler	infra-root: wiki.openstack.org seems to be running with a non-LE cert still, which will expire in 2 weeks. I guess we'll just have to follow that path once more?	05:43
*** gibi_pto is now known as gibi		05:51
frickler	ianw: some mirror hosts seem to be having issues with openafs updates, see service-mirror.yaml.log on bridge cf. https://zuul.opendev.org/t/openstack/build/1006e3bbd1264599b34f59129f3d7368	05:52
frickler	I can go through those manually if you rather want to end your day	05:52
ianw	umm, let me see	06:03
ianw	oh interesting, we don't encrypt/gpg the logs for TIMED_OUT i guess	06:04
ianw	that's an oversight	06:04
ianw	openafs-client : PreDepends: init-system-helpers (>= 1.54~) but 1.51 is to be installed	06:05
ianw	now that is interesting, that is something i must have stuffed up with the new uploads	06:06
ianw	this is a bionic system	06:06
ianw	i bet we have something different enabled in the ppa build environment than we do on production bionic systems	06:07
ianw	i bet it's backports	06:08
ianw	but, why doesn't this fail the gate?	06:12
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] trigger all openafs builds https://review.opendev.org/c/opendev/system-config/+/842039	06:15
*** ysandeep\|rover is now known as ysandeep\|rover\|brb		06:15
ianw	yep the ppa has backports enabled. i'm not 100% sure when or why that happened. i guess we'll need to rebuild the bionic package without backports	06:17
ianw	Primary Archive for Ubuntu - BACKPORTS (main, restricted, universe, multiverse) (included on 2010-03-10)	06:18
ianw	https://launchpad.net/~openafs/+archive/ubuntu/stable	06:18
ianw	i guess it was copied from that?	06:18
ianw	it looks like it has debhelper = 12, bionic without backports only has 11	06:25
opendevreview	Ian Wienand proposed opendev/infra-openafs-deb bionic: Rebuild for PPA without backports https://review.opendev.org/c/opendev/infra-openafs-deb/+/842042	06:29
opendevreview	Ian Wienand proposed opendev/infra-openafs-deb bionic: gitreview: update branch to bionic https://review.opendev.org/c/opendev/infra-openafs-deb/+/842043	06:30
ianw	that has failed, mea culpa on that https://zuul.opendev.org/t/openstack/build/6e1dd24132af43309e8ceee1b72d58a4	06:31
frickler	hmm, so I wonder why did it not fail earlier?	06:33
opendevreview	Ian Wienand proposed opendev/infra-openafs-deb bionic: Rebuild for PPA without backports https://review.opendev.org/c/opendev/infra-openafs-deb/+/842042	06:39
opendevreview	Ian Wienand proposed opendev/infra-openafs-deb bionic: gitreview: update branch to bionic https://review.opendev.org/c/opendev/infra-openafs-deb/+/842043	06:39
*** ysandeep\|rover\|brb is now known as ysandeep\|rover		06:48
ianw	frickler: it would have, but it didnt' run	07:03
ianw	it is a bit hard to test. we could figure out a build test. but at that point, we've just re-written the ppa building	07:03
opendevreview	Merged opendev/infra-openafs-deb bionic: Rebuild for PPA without backports https://review.opendev.org/c/opendev/infra-openafs-deb/+/842042	07:09
opendevreview	Merged opendev/infra-openafs-deb bionic: gitreview: update branch to bionic https://review.opendev.org/c/opendev/infra-openafs-deb/+/842043	07:09
frickler	ianw: fair enough. the other thing that is not clear to me is why the job timed out, the playbook seems to have finished fine, just with some failed hosts. sadly there are not time stamps other than the one at the start	07:18
opendevreview	Ian Wienand proposed opendev/infra-openafs-deb bionic: Fix typo in changelog https://review.opendev.org/c/opendev/infra-openafs-deb/+/842046	07:25
ianw	yeah, the timeout, i'm not 100% sure	07:26
*** jpena\|off is now known as jpena		07:31
opendevreview	Merged opendev/infra-openafs-deb bionic: Fix typo in changelog https://review.opendev.org/c/opendev/infra-openafs-deb/+/842046	07:38
ianw	ok, that one made it to building	07:43
ianw	and built ok for amd64; so the build-deps are ok for bionic. running 842039 through should test it, but i might just wait for the arm64 builds	07:50
*** ysandeep\|rover is now known as ysandeep\|rover\|lunch		08:00
*** ysandeep\|rover\|lunch is now known as ysandeep\|rover		09:21
*** rlandy\|out is now known as rlandy		10:26
*** pojadhav is now known as pojadhav\|afk		10:49
ianw	ok, that fixed it. the mirror jobs should deploy ok now	10:59
ianw	i'll check on it tomorrow	10:59
*** ysandeep\|rover is now known as ysandeep\|rover\|brb		11:18
*** pojadhav\|afk is now known as pojadhav		11:26
*** dviroel\|out is now known as dviroel		11:28
*** ysandeep\|rover\|brb is now known as ysandeep\|rover		11:51
fungi	apparently setuptools 62.3.0 (released yesterday) has deprecated "namespace_packages" which pbr uses when you have a module that lacks a __init__.py	14:18
fungi	any subdirectory pbr packages without a __init__.py (e.g. foo/tests/fixtures) raises a deprecationwarning now	14:18
fungi	adding foo/tests/fixtures/__init__.py silences it, but that seems a little misleading since importing foo.tests.fixtures is unlikely to ever happen	14:19
fungi	trying to think through how to solve it in pbr, but unfortunately it's hard to tell the difference between a directory which contains data and a directory which is meant to be a module namespace	14:20
fungi	it looks like pbr treats them the same at the moment	14:21
fungi	oh, also "Relying on include_package_data to ensure sub-packages are automatically added to the build wheel distribution (as “data”) is now considered a deprecated practice."	14:40
fungi	which pbr similarly seems to do	14:41
fungi	looks like it's impacting yamllint for me too	14:42
fungi	Python recognizes 'yamllint.conf' as an importable package, however it	14:42
fungi	is included in the distribution as "data". This behavior is likely to change in future versions of setuptools (and therefore is considered deprecated). Please make sure that 'yamllint.conf' is included as a package by using setuptools' `packages` configuration field or the proper discovery methods. You can read more about "package discovery" and "data files" on setuptools documentation page.	14:44
clarkb	frickler: yes I was planning on purchasing a new cert next week. Its our last remaining non LE cert instance because the host isn't directly managed by ansible	14:47
clarkb	fungi: the problem is the pacakges system doesn't seem any better from a useability standpoint :/	14:48
clarkb	lacks documentation also spits out a ton of warnings/errors if you don't get it just right	14:48
fungi	yeah	14:49
fungi	the error seems to be that setuptools is assuming any subdirectory of an importable module should be a package namespace. i need to read up more on how to tell it that a given directory is non-importable "data"	14:50
fungi	the fact that it's raising setuptools._deprecation_warning.SetuptoolsDeprecationWarning with a multi-line message makes it essentially impossible to filter with PYTHONWARNINGS	15:09
fungi	apparently i can filter on the parent exception class, which ends up looking like ignore::Warning:setuptools.command.build_py (still really broad unfortunately)	15:17
*** ysandeep\|rover is now known as ysandeep\|out		15:24
*** dviroel is now known as dviroel\|lunch		16:02
*** marios is now known as marios\|out		16:03
johnsom	Hi neighbors, so following figuring out the missing unbound log issues on centos, I have figured out why we have a fairly high failure rate on the fips jobs. I would like your input/ideas on how best to resolve it.	16:09
johnsom	The fips jobs reboot the nodepool instance to enable the fips mode. The problem is devstack starts pretty quickly after that reboot. The unbound resolver is not always started in time for the repo installation step in devstack which later leads to packages failing to install.	16:10
johnsom	https://zuul.opendev.org/t/openstack/build/c6ab0b93e2e6485c9e2eb97dbd74805d/logs	16:10
johnsom	for the curious	16:10
clarkb	johnsom: one approach may be to have the ansible query for 127.0.1.1:53 to be listening again (or whatever the specific localhost ip is)	16:11
johnsom	I looked at using "systemctl is-system-running" to check if systemd is done, but pretty much all of my instances seem to report "degraded" for various reasons.	16:11
johnsom	clarkb Yeah, that was my next thought.	16:12
clarkb	infra-root I'm trimming wheel mirror quota's to 20GB now	16:12
johnsom	Any thoughts on timeouts or a link to places in the ansible we already do something similar? (I can dig, but if you have a top-of-head link it would be welcome)	16:14
clarkb	johnsom: there is an ansible module that can do that waiting let me see if I can find it	16:15
johnsom	clarkb Much appreciated	16:16
clarkb	johnsom: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/wait_for_module.html the ssh examples in there are probably close	16:16
clarkb	our resolver should do tcp as well as udp so that may just work	16:16
johnsom	cool, thanks!	16:16
ade_lee	johnsom, where are you thinking of putting in that wait?	16:17
johnsom	ade_lee It will need to go into the fips enable Ansible somewhere I think. It really only applies to the reboot.	16:17
clarkb	another way to make it more generic might be to do reboot, wait for ssh, then wait for any dns record to resolve	16:19
clarkb	I assume you're failing because the reoslver isn't actually up when you try to resolve a name but /etc/hosts says to use the local resolver	16:19
clarkb	then it doesn't depend on a local resolver working, just that anything can resolve	16:19
johnsom	Right	16:21
ade_lee	clarkb, johnsom yeah - the fips job is pretty low level being in zuul-jobs - we may need to be pretty generic	16:22
fungi	the wait could happen in devstack's playbook	16:22
fungi	assuming it's devstack trying to do stuff too soon after boot	16:23
ade_lee	yeah - thats the other place I was thinking about	16:23
fungi	trying to be generic about it gets into the "how do you determine when a system has reached steady-state" conundrum which distros have been trying to solve forever	16:23
fungi	when is a system fully "booted"?	16:24
johnsom	Actually, that is probably good. It might help devstack fail earlier when DNS is broken. Currently, it runs pretty far before realizing it is missing some packages	16:24
ade_lee	fungi, johnsom , clarkb so we're agreed that the best thing is to do a generic test to see if ssh is up and try to resolve something in devstack?	16:28
*** diablo_rojo_phone_ is now known as diablo_rojo_phone		16:29
*** open10k8s_ is now known as open10k8s		16:29
*** dwhite449 is now known as dwhite44		16:29
*** cloudnull4 is now known as cloudnull		16:29
*** andrewbonney_ is now known as andrewbonney		16:29
*** yoctozepto_ is now known as yoctozepto		16:29
*** snbuback_ is now known as snbuback		16:29
*** ricolin_ is now known as ricolin		16:29
johnsom	ade_lee Yeah, I think a check in devstack (if OFFLINE=false) have it loop/wait for a DNS resolution to succeed. maybe "opendev.org"? Not sure what is a good name to lookup.	16:30
*** diablo_rojo_phone is now known as Guest674		16:30
fungi	yes, dns resolution not ssh	16:32
clarkb	johnsom: it should lookup its mirror if we can amange that	16:32
clarkb	since the mirror is the main resource it is looking for externally	16:33
clarkb	opendev.org is maybe a reasonable stand in as they are in the same zone	16:33
ade_lee	clarkb, where is the mirror defined? we can try that if we know how to look it up.	16:34
fungi	there's a zuul variable which gets set... checking	16:37
johnsom	zuul_site_mirror_fqdn https://opendev.org/zuul/zuul-jobs/src/branch/master/test-playbooks/base-roles/configure-mirrors.yaml	16:37
johnsom	I'm looking to see if that gets dropped on the file system in some useful way that devstack could pick it up.	16:39
fungi	yeah, that's the one i was thinking of	16:40
fungi	if you want it from the filesystem, yeah the mirror-info role installs /etc/ci/mirror_info.sh which sets envvars	16:41
johnsom	Yeah, I don't see anything obvious. Also, that would tie the devstack install to a zuul run which we may want this to be a bit more generic.	16:41
fungi	in mirror-info we set it from the mirror_fqdn value: https://opendev.org/opendev/base-jobs/src/branch/master/roles/mirror-info/templates/mirror_info.sh.j2#L17	16:44
clarkb	ya reading it out of /etc/apt/sources.list is probably easiest for devsatck? I'm not sure it has direct exposure to the zuul vars	16:45
clarkb	the mirror-info stuff is holdver from zuul v2 but we haven't made any indications of deleting it and is probablyfine too?	16:45
ade_lee	I'd assume /etc/apt/sources.list is for ubuntu .. probably different for centos ..	16:46
ade_lee	maybe easiest to start with "opendev.org" to begin with?	16:46
clarkb	ya that is probably close enough and simple	16:48
ade_lee	clarkb, fungi johnsom ok cool - I'll give that a shot and see what happens. I'll test against octavia and barbican and see if it fixes things	16:50
*** dviroel_ is now known as dviroel		16:52
jrosser	the projects with embedded ansibles seem pretty reliant on mirror-info stuff according to codesearch	16:55
clarkb	that is unfortunate. it was only there for the legacy stuff and expectation is you'd deal with it using the zuul info more directly	16:56
clarkb	I know tripleo used ti extensively which is why we haven't bothred to remove it though	16:56
jrosser	is the same info available elsewhere?	16:56
clarkb	yes in the zuul ansible vars	17:00
clarkb	zuul_site_mirror_fqdn is the var name	17:00
fungi	it doesn't seem to appear in the zuul "inventory" we record, but yeah	17:00
fungi	i guess it's a general ansible var not a zuul var	17:01
clarkb	ah right. It should be recorded as a fact too? But ya the old shell script is from the land before time	17:03
jrosser	ah TIL https://opendev.org/zuul/zuul-jobs/src/branch/master/doc/source/mirror.rst	17:05
jrosser	though passing that to an embedded ansible is reasonably challenging as well	17:06
fungi	how to get ansible vars into a nested ansible seems to be a challenge	17:08
jrosser	currently i have to write out vars files in a pre playbook and then grab them from the embedded ansible	17:09
jrosser	so the motivation to move away from loading things from mirror-info.sh has been small	17:09
fungi	right, rather similar	17:24
clarkb	ya I think the main concern is we don't keep mirror-info.sh up to date	17:32
clarkb	but you could keep your thing up to date with the pieces you need	17:32
clarkb	even if they function in the same way its a shift in eyeballs and responsibility that ensures things stay up to date	17:32
fungi	the main way that script is useful is as a source of $NODEPOOL_MIRROR_HOST	17:33
fungi	but exporting that from ansible is also trivial	17:33
*** jpena is now known as jpena\|off		18:08
*** melwitt_ is now known as melwitt		18:08
*** deepcursion1 is now known as deepcursion		18:45
*** artom__ is now known as artom		19:08
*** ianw_ is now known as ianw		19:11
corvus	http://lists.zuul-ci.org/pipermail/zuul-discuss/2022-May/001801.html is what i mentioned in the meeting	20:16
fungi	thanks!	20:27
ianw	system-config-zuul-role-integration-jammy https://zuul.opendev.org/t/openstack/build/82e184f32e5449a4ae619d6176e5497e : DISK_FULL in 13m 54s	20:32
ianw	do we have a full executor?	20:32
fungi	more likely the job's workspace exceeded teh allowed size	20:32
ianw	that would certainly be unexpected for this job	20:35
*** rlandy is now known as rlandy\|biab		20:35
jrosser	i had a quick go at using zuul_site_mirror_info, but seems undefined	20:42
jrosser	maybe i've made a silly mistake https://review.opendev.org/c/openstack/openstack-ansible/+/842158	20:44
jrosser	ah the docs say it is still work-in-progress	20:45
ianw	2022-05-17 20:09:12,242 WARNING zuul.ExecutorDiskAccountant: /var/lib/zuul/builds/82e184f32e5449a4ae619d6176e5497e is using 10091MB (limit=5000)	20:47
ianw	TASK [get df disk usage] seems like it might help, but ... doesn't. the only thing we seem to be copying in syslog. i wonder if we're dumping something weird on jammy...	20:51
ianw	(i am getting this by looking in the logs at ze02)	20:53
ianw	Ansible output: b'base : ok=0 changed=0 unreachable=1	20:57
ianw	so the df didn't work because the host was unreachable	20:58
ianw	root@ubuntu-jammy-rax-ord-0029677363:/var/log# du -h syslog	21:03
ianw	112Ksyslog	21:03
ianw	there is nothing immediately obviously wrong with jammy syslogs	21:03
*** rlandy\|biab is now known as rlandy		21:07
fungi	growroot failed maybe?	21:12
fungi	oh, nevermind, on the executor right	21:16
corvus	ianw: keep in mind the result of the cleanup playbook doesn't impact the result of the job. so while it's possible that the unreachable status could be related, it's not a direct cause of a failure (perhaps it's unreachable because of some momentary condition on the executor which caused disk_full and also broke outgoing ansible connections)	21:27
corvus	ianw: if anything the other way around -- disk_full might cause weird ansible results as zuul begins to forcibly kill processes	21:27
ianw	right, looking at the logs from the executor, it seems like it went disk full after copying the syslog	21:45
ianw	2022-05-17 20:08:59,463 DEBUG zuul.AnsibleJob.output: [e: 35914214f6ab4ab6ab13fa3125ce1797] [build: 82e184f32e5449a4ae619d6176e5497e] Ansible output: b'TASK [upload-logs-swift : Upload logs to swift] ********************************'	21:46
ianw	2022-05-17 20:09:12,242 WARNING zuul.ExecutorDiskAccountant: /var/lib/zuul/builds/82e184f32e5449a4ae619d6176e5497e is using 10091MB (limit=5000)	21:46
ianw	that was the immediate message before and the kill	21:46
corvus	there might be a partial upload somewhere on one of our object storage providers	21:52
corvus	but it would be difficult to find, and has a low probability of providing additional info	21:53
ianw	i have the full logs but https://paste.opendev.org/show/bVaO8DeNt5dsHEr9glrN/ is the just the upload bits	21:55
corvus	ianw: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_aa0/841525/1/check/system-config-zuul-role-integration-jammy/aa02f68/ has a 500M syslog	21:58
corvus	and the job that failed also ran in ovh	21:58
ianw	theory, it's iptables logs	21:59
corvus	perhaps the nodes in ovh start out with larger syslogs? and perhaps they've grown in the last week?	21:59
ianw	oh, good find!	21:59
corvus	that syslog has lots of kernel stack traces	22:00
ianw	yeah. my theory was that the jammy nodes are underused at the moment, so they've been sitting for a long time, perhaps filling up with things like iptables rejection logs	22:01
ianw	that doesn't even look like much of an oops	22:02
*** rlandy is now known as rlandy\|out		22:02
ianw	it's right from boot	22:04
ianw	May 12 03:08:39 ubuntu glean-early.sh[486]: DEBUG:glean:Detected distro : ubuntu	22:04
ianw	May 12 03:08:39 ubuntu kernel: [ 5.870384] Call Trace:	22:04
ianw	e.g. glean is still running even	22:04
fungi	might make sense to special-case syslog collection and strip earlier entries if over a sane size	22:05
fungi	100mb/day syslog seems like... a lot for an idle server	22:06
ianw	i think the real problem here is this flood of traces	22:07
ianw	but there's, no bug/warning/etc. like a regular oops. very weird	22:08
corvus	it's the "Everything's OK" oops	22:09
*** dviroel is now known as dviroe\|out		22:12
ianw	May 12 03:10:34 ubuntu kernel: [ 121.705979] Call Trace:	22:12
ianw	May 12 03:10:34 ubuntu kernel: [ 121.705981] <TASK>	22:12
ianw	i've never seen this xml-ish "<TASK>" before. it must be new i guess ... but it makes me suspicious what's putting this out	22:13
ianw	the start of the logs is cut off, it feels like that must be where it started, and somehow it's ended up in a recursive loop dumping traces	22:29
ianw	i can try booting an ovh jammy and see if it replicates. i'm on a rax node jammy and that's ok	22:30
ianw	one has booted in iweb and it's ok	23:29
ianw	\| 0029678242 \| ovh-gra1 \| ubuntu-jammy \| 4bc97209-22c7-4ef5-ae24-fcbd95e77631 \| 213.32.76.198 \|	23:41
ianw	this one is spewing the crap	23:41
ianw	[ 2.277136] kernel: unchecked MSR access error: WRMSR to 0x48 (tried to write 0x0000000000000004) at rIP: 0xffffffff97490af4 (native_write_msr+0x4/0x20)	23:42
ianw	i think that's where it starts	23:42
Clark[m]	ianw: amorin may be interested if it is ovh specific	23:47
Clark[m]	https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/1959721 may be related too?	23:48
ianw	https://bugzilla.redhat.com/show_bug.cgi?id=1808996 also discusses 0x48	23:50
ianw	seems to be AMD related	23:50
Clark[m]	I didn't think the ovh nodes were amd based but it may have changed	23:52
ianw	https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1921880/comments/12 describes it	23:57
ianw	"Running a focal instance (5.4.0-70-generic) on the "before" system with the EPYC-Rome type on a Milan CPU results in the following error. This is due to the missing IBRS flag, which is one of the reasons, I'd like to see this backported ;-)"	23:57
ianw	although that says tried to write 0x6, whereas our says 0x4	23:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!