Monday, 2019-09-09

*** markvoelker has quit IRC		00:02
*** Goneri has quit IRC		00:06
ianw	fedora mirroring timed out. i'm rerunning on afs01 under localauth ...	00:07
ianw	lock still held on mirror-update	00:08
*** bobh has joined #openstack-infra		00:11
*** bobh has quit IRC		00:17
*** rcernin has quit IRC		00:30
openstackgerrit	Merged openstack/diskimage-builder master: Add fedora-30 testing to gate https://review.opendev.org/680531	00:32
*** ccamacho has quit IRC		00:51
ianw	pabelanger: ^ yay, i think that's it for a dib release with f30 support. will cycle back on it and get it going	00:53
johnsom	FYI, even a patch with no changes in it (extra carriage return) is failing and retrying. So it's definitely not that one patch.	00:59
johnsom	https://www.irccloud.com/pastebin/LBgx6Mhh/	00:59
ianw	so it's like 213.32.79.175 just disappeared on you?	01:03
johnsom	That is the end of the console stream, then zuul retries it. Sometimes not even one tempest test reports before that happens.	01:05
ianw	johnsom: i have debugged such things before, it is a pain :) in one case we tickled a kernel oops on fedora	01:06
johnsom	That ssh isn't part of our test to my knowledge. I think it comes from ansible	01:06
ianw	johnsom: i don't think it has been ported to !devstack-gate, but https://opendev.org/openstack/devstack-gate/src/branch/master/devstack-vm-gate-wrap.sh#L446 might be a good thing to try	01:06
ianw	https://opendev.org/openstack/devstack-gate/src/branch/master/functions.sh#L972 is the actual setup	01:07
ianw	if you have something listening for that, it might let you catch why the host just dies	01:07
johnsom	Hmm, that would be handy actually. I could just hijack the syslog from the nodepool level. I would have to come up with a target. I don't have any cloud instances anymore.	01:09
johnsom	ianw Thanks for the idea!	01:09
ianw	i should just pull that into a quick role ...	01:10
ianw	johnsom: you should be able to pull one with a public address via rdocloud, but i can create one for you if you like	01:11
johnsom	Yeah, I haven't dipped my toe in those waters yet. I should learn the process.	01:12
*** sgw has quit IRC		01:22
*** exsdev has quit IRC		01:26
*** exsdev0 has joined #openstack-infra		01:26
*** exsdev0 is now known as exsdev		01:27
*** exsdev has quit IRC		01:39
*** exsdev has joined #openstack-infra		01:40
*** sgw has joined #openstack-infra		01:44
*** rcernin has joined #openstack-infra		01:45
*** yamamoto has joined #openstack-infra		01:46
*** gregoryo has joined #openstack-infra		01:50
*** yamamoto has quit IRC		02:26
*** yamamoto has joined #openstack-infra		02:26
*** bobh has joined #openstack-infra		02:37
openstackgerrit	jacky06 proposed openstack/diskimage-builder master: Move doc related modules to doc/requirements.txt https://review.opendev.org/628466	02:40
openstackgerrit	jacky06 proposed openstack/diskimage-builder master: Move doc related modules to doc/requirements.txt https://review.opendev.org/628466	02:42
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: Add a netconsole role https://review.opendev.org/680901	02:46
*** yamamoto has quit IRC		03:11
*** ramishra has joined #openstack-infra		03:13
*** bobh has quit IRC		03:14
*** rh-jelabarre has joined #openstack-infra		03:27
*** dklyle has quit IRC		03:29
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: Add a netconsole role https://review.opendev.org/680901	03:40
*** yamamoto has joined #openstack-infra		03:47
*** ykarel has joined #openstack-infra		03:49
*** udesale has joined #openstack-infra		03:53
*** yamamoto has quit IRC		04:11
*** soniya29 has joined #openstack-infra		04:12
*** ykarel has quit IRC		04:17
*** ykarel has joined #openstack-infra		04:36
*** yamamoto has joined #openstack-infra		04:40
*** eernst has joined #openstack-infra		04:54
*** markvoelker has joined #openstack-infra		04:55
*** markvoelker has quit IRC		05:00
*** jaosorior has joined #openstack-infra		05:03
*** ricolin has joined #openstack-infra		05:08
openstackgerrit	Andreas Jaeger proposed openstack/openstack-zuul-jobs master: Use promote job for releasenotes https://review.opendev.org/678430	05:08
AJaeger_	config-core, please review https://review.opendev.org/677158 for cleanup and https://review.opendev.org/#/q/topic:promote-docs+is:open to use promote jobs for releasenotes, and extra tests for zuul-jobs: https://review.opendev.org/678573	05:10
openstackgerrit	Jan Kubovy proposed zuul/zuul master: Evaluate CODEOWNERS settings during canMerge check https://review.opendev.org/644557	05:15
*** eernst has quit IRC		05:29
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: Add a netconsole role https://review.opendev.org/680901	05:29
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: Add a netconsole role https://review.opendev.org/680901	05:44
*** raukadah is now known as chandankumar		05:46
*** kopecmartin\|off is now known as kopecmartin		05:56
*** jtomasek has joined #openstack-infra		06:02
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: Add a netconsole role https://review.opendev.org/680901	06:02
*** e0ne has joined #openstack-infra		06:09
*** jbadiapa has joined #openstack-infra		06:13
openstackgerrit	jacky06 proposed openstack/diskimage-builder master: Move doc related modules to doc/requirements.txt https://review.opendev.org/628466	06:13
*** e0ne has quit IRC		06:18
*** jtomasek has quit IRC		06:22
openstackgerrit	Merged openstack/project-config master: Add file matcher to releasenotes promote job https://review.opendev.org/679856	06:22
openstackgerrit	Merged zuul/zuul-jobs master: Switch releasenotes to fetch-sphinx-tarball https://review.opendev.org/678429	06:28
*** snapiri has quit IRC		06:35
*** snapiri has joined #openstack-infra		06:36
*** snapiri has quit IRC		06:36
*** jtomasek has joined #openstack-infra		06:40
*** ccamacho has joined #openstack-infra		06:42
*** mattw4 has joined #openstack-infra		06:44
*** slaweq_ has joined #openstack-infra		06:49
*** snapiri has joined #openstack-infra		06:49
Tengu	~.	06:54
*** pgaxatte has joined #openstack-infra		06:55
openstackgerrit	Ian Wienand proposed openstack/project-config master: Add Fedora 30 nodes https://review.opendev.org/680919	06:56
*** shachar has joined #openstack-infra		06:58
*** mattw4 has quit IRC		06:59
*** yamamoto has quit IRC		07:00
*** snapiri has quit IRC		07:00
ianw	infra-root: appreciate eyes on https://review.opendev.org/680895 to fix testinfra testing for good; after my pebkac issues with how to depends-on: for github it works	07:01
*** rcernin has quit IRC		07:02
*** slaweq_ has quit IRC		07:06
*** slaweq has joined #openstack-infra		07:08
*** yamamoto has joined #openstack-infra		07:09
*** trident has quit IRC		07:10
*** tesseract has joined #openstack-infra		07:13
*** hamzy_ has quit IRC		07:19
*** trident has joined #openstack-infra		07:21
*** hamzy_ has joined #openstack-infra		07:29
*** gfidente has joined #openstack-infra		07:31
*** ykarel is now known as ykarel\|lunch		07:31
*** owalsh has quit IRC		07:32
*** owalsh has joined #openstack-infra		07:33
*** aluria has quit IRC		07:33
*** jpena\|off is now known as jpena		07:35
*** aluria has joined #openstack-infra		07:38
*** kjackal has joined #openstack-infra		07:45
*** kaiokmo has joined #openstack-infra		07:46
*** pcaruana has joined #openstack-infra		07:49
*** markvoelker has joined #openstack-infra		07:50
*** sshnaidm\|afk is now known as sshnaidm		07:51
*** sshnaidm is now known as sshnaidm\|ruck		07:51
*** markvoelker has quit IRC		07:55
*** rpittau\|afk is now known as rpittau		07:55
*** xenos76 has joined #openstack-infra		07:56
openstackgerrit	Simon Westphahl proposed zuul/zuul master: Fix timestamp race occuring on fast systems https://review.opendev.org/680937	08:04
*** apetrich has joined #openstack-infra		08:05
openstackgerrit	Simon Westphahl proposed zuul/zuul master: Fix timestamp race occuring on fast systems https://review.opendev.org/680937	08:06
*** dchen has quit IRC		08:07
*** panda has quit IRC		08:09
*** panda has joined #openstack-infra		08:11
*** rascasoft has quit IRC		08:11
*** rascasoft has joined #openstack-infra		08:12
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: Add a netconsole role https://review.opendev.org/680901	08:14
ianw	johnsom: ^ working for me	08:16
*** ralonsoh has joined #openstack-infra		08:18
*** e0ne has joined #openstack-infra		08:19
jrosser	ianw: doesnt ansible_default_ipv4 contain the fields you need there rather than a bunch of shell commands?	08:19
*** ykarel\|lunch is now known as ykarel		08:23
*** tkajinam has quit IRC		08:42
*** derekh has joined #openstack-infra		08:43
*** gregoryo has quit IRC		08:48
*** kaiokmo has quit IRC		08:48
ianw	jrosser: yes, quite likely! it was just more or less a straight port from the old code that has been working for a long time	08:56
openstackgerrit	Simon Westphahl proposed zuul/zuul master: Fix timestamp race occurring on fast systems https://review.opendev.org/680937	08:59
*** hamzy_ has quit IRC		09:07
*** ociuhandu has joined #openstack-infra		09:08
*** ociuhandu has quit IRC		09:10
*** ociuhandu has joined #openstack-infra		09:10
*** ociuhandu has quit IRC		09:14
*** ociuhandu_ has joined #openstack-infra		09:14
*** ociuhandu_ has quit IRC		09:18
zbr	ianw: do you know who can help with a new bindep release? last one was more than an year ago.	09:18
*** soniya29 has quit IRC		09:23
*** ociuhandu has joined #openstack-infra		09:27
*** arxcruz_pto is now known as arxcruz		09:27
openstackgerrit	Sorin Sbarnea proposed opendev/bindep master: Add OracleLinux support https://review.opendev.org/536355	09:35
ianw	zbr: i would ask fungi first, i think he is back today his time. just to see if there was anything in the queue. but if he doesn't have time, etc. ping me back and i imagine i could tag one tomorrow too	09:36
zbr	ianw: thanks. i think that there are few more patches I could fix, and a bug that prevents running the tests on macos (mocking fails to work), and after this we can make a release.	09:37
zbr	ianw: thanks. i think that there are few more patches I could fix, and a bug that prevents running the tests on macos (mocking fails to work), and after this we can make a release.	09:38
*** yamamoto has quit IRC		09:40
*** kaisers has quit IRC		09:48
*** kaisers has joined #openstack-infra		09:50
*** markvoelker has joined #openstack-infra		09:51
*** shachar has quit IRC		09:53
*** snapiri has joined #openstack-infra		09:53
*** ricolin_ has joined #openstack-infra		09:55
*** lpetrut has joined #openstack-infra		09:55
*** ricolin has quit IRC		09:57
*** markvoelker has quit IRC		10:00
*** rcernin has joined #openstack-infra		10:08
openstackgerrit	Sorin Sbarnea proposed opendev/bindep master: Fix test execution failure on Darwin https://review.opendev.org/680962	10:13
*** ricolin_ has quit IRC		10:15
*** spsurya has joined #openstack-infra		10:21
*** soniya29 has joined #openstack-infra		10:26
*** gfidente has quit IRC		10:26
*** panda is now known as panda\|rover		10:42
*** kaisers has quit IRC		10:44
*** ociuhandu has quit IRC		10:45
openstackgerrit	Merged openstack/openstack-zuul-jobs master: Remove openSUSE 42.3 https://review.opendev.org/677158	10:47
*** dave-mccowan has joined #openstack-infra		10:52
*** ricolin_ has joined #openstack-infra		10:53
*** dciabrin_ has joined #openstack-infra		10:53
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: add-build-sshkey: add centos/rhel-8 support https://review.opendev.org/674092	10:54
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: add-build-sshkey: add centos/rhel-8 support https://review.opendev.org/674092	10:54
*** dciabrin has quit IRC		10:54
*** shachar has joined #openstack-infra		11:02
*** nicolasbock has joined #openstack-infra		11:02
*** snapiri has quit IRC		11:03
*** pgaxatte has quit IRC		11:05
*** dchen has joined #openstack-infra		11:05
*** yamamoto has joined #openstack-infra		11:06
*** rpittau is now known as rpittau\|bbl		11:12
openstackgerrit	Andreas Jaeger proposed openstack/project-config master: Fix promote-stx-api-ref https://review.opendev.org/680977	11:13
*** udesale has quit IRC		11:15
*** sshnaidm\|ruck is now known as sshnaidm\|afk		11:15
AJaeger_	config-core, simple fix, please review ^	11:15
*** ricolin_ is now known as ricolin		11:15
*** ociuhandu has joined #openstack-infra		11:17
*** gfidente has joined #openstack-infra		11:18
*** ociuhandu has quit IRC		11:22
*** kaisers has joined #openstack-infra		11:28
*** pgaxatte has joined #openstack-infra		11:34
*** jpena is now known as jpena\|lunch		11:35
openstackgerrit	Merged opendev/system-config master: Recognize DISK_FULL failure messages (review_dev) https://review.opendev.org/673893	11:37
*** ociuhandu has joined #openstack-infra		11:41
*** dchen has quit IRC		11:44
*** gfidente has quit IRC		11:46
*** shachar has quit IRC		11:48
*** gfidente has joined #openstack-infra		11:53
*** goldyfruit has quit IRC		11:56
*** markvoelker has joined #openstack-infra		11:56
*** snapiri has joined #openstack-infra		12:02
*** markvoelker has quit IRC		12:02
*** rlandy has joined #openstack-infra		12:03
*** yamamoto has quit IRC		12:05
*** rfolco has joined #openstack-infra		12:06
*** yamamoto has joined #openstack-infra		12:09
*** rosmaita has joined #openstack-infra		12:10
*** hamzy_ has joined #openstack-infra		12:11
*** markvoelker has joined #openstack-infra		12:12
*** ociuhandu has quit IRC		12:21
*** iurygregory has joined #openstack-infra		12:22
*** rcernin has quit IRC		12:22
ralonsoh	hello folks, I have one question if you can help me	12:27
ralonsoh	this is about Neutron rally-tasks	12:27
ralonsoh	for example: https://d4b9765f6ab6e1413c28-81a8be848ef91b58aa974b4cb791a408.ssl.cf5.rackcdn.com/680427/2/check/neutron-rally-task/01b2c1c/	12:30
ralonsoh	there are no logs or results, and the task is failing	12:30
*** jpena\|lunch is now known as jpena		12:31
*** sshnaidm\|afk is now known as sshnaidm\|ruck		12:32
frickler	ralonsoh: rally seems to be complicating access to logs by generating an index.html file. there are logs showing what I think is the error, see the end of https://d4b9765f6ab6e1413c28-81a8be848ef91b58aa974b4cb791a408.ssl.cf5.rackcdn.com/680427/2/check/neutron-rally-task/01b2c1c/controller/logs/devstacklog.txt.gz	12:36
*** rosmaita has quit IRC		12:38
*** happyhemant has joined #openstack-infra		12:39
ralonsoh	frickler, thanks! I'll take a look at this	12:40
frickler	ralonsoh: this is where the index file comes from, probably for our new logging setup it would work better to place this into a subdir and register the result as an artifact in zuul https://opendev.org/openstack/rally-openstack/src/branch/master/tests/ci/playbooks/roles/fetch-rally-task-results/tasks/main.yaml#L52-L64	12:43
*** ociuhandu has joined #openstack-infra		12:43
*** aaronsheffield has joined #openstack-infra		12:44
ralonsoh	frickler, or rename this index file in order to avoid the browser to read it by default	12:45
*** Goneri has joined #openstack-infra		12:47
*** ociuhandu has quit IRC		12:48
*** rosmaita has joined #openstack-infra		12:50
*** rpittau\|bbl is now known as rpittau		12:55
*** mriedem has joined #openstack-infra		12:57
*** roman_g has joined #openstack-infra		13:00
*** soniya29 has quit IRC		13:06
AJaeger_	clarkb: I think all Fedora 28 changes are merged with exception of devstack... Open changes that I'm aware of are https://review.opendev.org/#/q/topic:fedora-latest+is:open	13:08
*** e0ne has quit IRC		13:19
*** udesale has joined #openstack-infra		13:20
*** KeithMnemonic has joined #openstack-infra		13:24
*** gfidente has quit IRC		13:30
*** gfidente has joined #openstack-infra		13:32
*** ociuhandu has joined #openstack-infra		13:34
*** Goneri has quit IRC		13:34
*** ociuhandu has quit IRC		13:39
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-jobs master: Add prepare-workspace-openshift role https://review.opendev.org/631402	13:45
*** ykarel is now known as ykarel\|afk		13:45
*** gfidente has quit IRC		13:46
*** gfidente has joined #openstack-infra		13:48
*** prometheanfire has quit IRC		13:49
*** e0ne has joined #openstack-infra		13:50
*** prometheanfire has joined #openstack-infra		13:50
*** beekneemech is now known as bnemec		13:54
*** priteau has joined #openstack-infra		13:54
*** AJaeger_ is now known as AJaeger		13:54
AJaeger	config-core, please review https://review.opendev.org/680977 to fix stx API promote job, https://review.opendev.org/#/q/topic:promote-docs+is:open to use promote jobs for releasenotes, and for adding extra tests for zuul-jobs: https://review.opendev.org/678573	13:55
*** ykarel\|afk is now known as ykarel		13:56
*** zzehring has joined #openstack-infra		13:57
*** yamamoto has quit IRC		14:02
*** eharney has joined #openstack-infra		14:04
*** bdodd has joined #openstack-infra		14:07
*** ociuhandu has joined #openstack-infra		14:13
openstackgerrit	Merged zuul/nodepool master: Fix Kubernetes driver documentation https://review.opendev.org/680879	14:13
openstackgerrit	Merged zuul/nodepool master: Add extra spacing to avoid monospace rendering https://review.opendev.org/680880	14:13
openstackgerrit	Merged zuul/nodepool master: Fix chroot type https://review.opendev.org/680881	14:13
*** Goneri has joined #openstack-infra		14:15
*** jamesmcarthur has joined #openstack-infra		14:15
*** ociuhandu has quit IRC		14:17
*** yamamoto has joined #openstack-infra		14:21
openstackgerrit	Sorin Sbarnea proposed opendev/bindep master: Fix test execution failure on Darwin https://review.opendev.org/680962	14:25
dtantsur	hi folks! the release notes job on instack-undercloud stable/queens started failing like this: https://e29864d8017a43b5dd67-658295aea286d472989a3acc71bbfe02.ssl.cf5.rackcdn.com/680698/1/check/build-openstack-releasenotes/fae0512/job-output.txt	14:25
dtantsur	do you have any ideas?	14:25
*** dklyle has joined #openstack-infra		14:29
*** ykarel is now known as ykarel\|afk		14:30
*** lpetrut has quit IRC		14:30
*** yamamoto has quit IRC		14:30
*** yamamoto has joined #openstack-infra		14:34
*** yamamoto has quit IRC		14:34
*** yamamoto has joined #openstack-infra		14:34
AJaeger	dtantsur: let me check, we just switched the implementation...	14:35
*** ociuhandu has joined #openstack-infra		14:35
dtantsur	AJaeger: it may be that you're trying to do something on master. the master branch is deprecated and purged of contents.	14:35
AJaeger	dtantsur: that explains it...	14:36
AJaeger	dtantsur: we check out master branch for releasenotes. So, if master is dead, kill the releasenotes job.	14:36
dtantsur	AJaeger: is there anything to be done except for removing the releasenotes job?	14:36
dtantsur	ah	14:36
dtantsur	mwhahaha: ^^^	14:36
AJaeger	dtantsur: releasenotes does basically: git checkout master;tox -e releasenotes	14:37
AJaeger	So, master needs to work for this...	14:37
dtantsur	this does seem a bit inconvenient since the stable branches are supported and receive changes.	14:37
mwhahaha	just propose the job removal	14:37
*** armax has joined #openstack-infra		14:37
AJaeger	dtantsur: I don't expect you make many changes to releasenotes on a released branch anymore...	14:38
AJaeger	dtantsur: to change this, reno would need to be redesigned completely ;(	14:39
dtantsur	fair enough	14:39
mwhahaha	releases notes includes a bug fix	14:39
*** yamamoto has quit IRC		14:39
mwhahaha	so i would assume every change should have one	14:39
dtantsur	++	14:39
*** ociuhandu has quit IRC		14:39
AJaeger	normally you do the bug fix on master and then backport ;) - and have master working...	14:40
mwhahaha	ANYWAY let's just remove the job	14:40
AJaeger	agreed	14:40
dtantsur	mwhahaha: wanna me do it?	14:40
mwhahaha	yes plz	14:41
AJaeger	dtantsur: best on all active branches	14:41
dtantsur	ack	14:41
dtantsur	thanks AJaeger	14:41
*** markvoelker has quit IRC		14:41
*** mtreinish has quit IRC		14:41
AJaeger	glad that we could solve that so quickly - you had me shocked for a moment ;)	14:41
*** mtreinish has joined #openstack-infra		14:42
AJaeger	This is the change that merged earlier: https://review.opendev.org/#/c/678429/	14:42
AJaeger	config-core, we can switch releasenotes to promote jobs now: Please review https://review.opendev.org/#/c/678430/	14:43
dtantsur	aha, so rocky has already been fixed by another patch. but not queens.	14:43
openstackgerrit	Andy Ladjadj proposed zuul/zuul master: Fix: prevent usage of hashi_vault https://review.opendev.org/681041	14:50
openstackgerrit	Matt McEuen proposed openstack/project-config master: Add per-project Airship groups https://review.opendev.org/680717	14:52
*** ociuhandu has joined #openstack-infra		14:52
openstackgerrit	Andy Ladjadj proposed zuul/zuul master: Fix: prevent usage of hashi_vault https://review.opendev.org/681041	14:52
frickler	AJaeger: do you want to amend the comment on 678430? otherwise just +A	14:53
openstackgerrit	Andy Ladjadj proposed zuul/zuul master: Fix: prevent usage of hashi_vault https://review.opendev.org/681041	14:53
AJaeger	frickler: what do you propose?	14:53
AJaeger	"Build and publish" -> "build and promote"?	14:54
frickler	AJaeger: sounds more appropriate I'd think, yes	14:54
AJaeger	frickler: let me do a followup - and update the other templates as well...	14:55
* AJaeger will +A		14:55
frickler	AJaeger: ok	14:56
AJaeger	thanks!	14:56
AJaeger	frickler: could you review https://review.opendev.org/680977 as well, please?	14:56
frickler	AJaeger: I did this morning	14:58
frickler	well, before noon anyway ;)	14:59
*** jroll has joined #openstack-infra		15:00
*** ykarel\|afk is now known as ykarel		15:00
openstackgerrit	Andreas Jaeger proposed openstack/openstack-zuul-jobs master: Mention promote in template description https://review.opendev.org/681044	15:01
AJaeger	frickler: ah, thanks. Then I need another core ;)	15:01
AJaeger	frickler: here's the change - is that what you had in mind? ^	15:01
*** markmcclain has joined #openstack-infra		15:02
AJaeger	thanks, clarkb !	15:03
AJaeger	clarkb: care to review https://review.opendev.org/678430 as well? That will switch releasenotes to promote jobs as well...	15:04
clarkb	AJaeger: ya I'll be working through your scrollback list between meeting stuff	15:04
AJaeger	clarkb: thanks	15:04
*** diablo_rojo__ has joined #openstack-infra		15:05
AJaeger	clarkb: ignore 678430, just approved ;)	15:05
zbr	ianw: clarkb : bindep: please have a look at https://review.opendev.org/#/c/680954/ and let me know if is ok. had to find a way to pass unittests on all platforms so I can test other incoming patches.	15:05
openstackgerrit	Merged openstack/openstack-zuul-jobs master: Use promote job for releasenotes https://review.opendev.org/678430	15:06
AJaeger	clarkb: https://review.opendev.org/676430 and https://review.opendev.org/681044 are the two remaining ones I care about, ignore the backscroll ;)	15:06
AJaeger	clarkb: these are ready as well https://review.opendev.org/680919 and https://review.opendev.org/680830	15:06
AJaeger	config-core as is https://review.opendev.org/#/c/680717/	15:07
*** gyee has joined #openstack-infra		15:07
AJaeger	no, not yet ;(	15:07
*** goldyfruit has joined #openstack-infra		15:08
AJaeger	sorry, it IS ready	15:08
openstackgerrit	Merged openstack/project-config master: Fix promote-stx-api-ref https://review.opendev.org/680977	15:10
*** markmcclain has quit IRC		15:12
*** jaosorior has quit IRC		15:12
*** ykarel is now known as ykarel\|away		15:13
*** ykarel\|away has quit IRC		15:22
donnyd	Just an update on FN. I am currently waiting on a part to come in the mail (should be here in the next few hours), and also I am trying to get someone out to do the propane installation today or tomorrow as well.	15:30
*** ociuhandu has quit IRC		15:31
donnyd	I don't really want to do the generator tests with FN at full tilt when the fuel is hooked up... so if it is going to be done in the next 48 hours I will leave it out of nodepool... if it's gonna take a week... then I will put it back in and pull it again when the time is right... That is if that makes sense to everyone here	15:32
roman_g	Hello team. Could I ask to force-refresh OpenSUSE mirror synk, please? This one: [repo-update\|http://mirror.dfw.rax.opendev.org/opensuse/update/leap/15.0/oss/] Valid metadata not found at specified URL	15:33
roman_g	OpenSUSE Leap 15.0, Updates, OSS	15:33
roman_g	synk -> sync	15:33
clarkb	donnyd: wfm	15:34
fungi	roman_g: any idea if the official obs mirrors ever got fixed after the reorg? they're what's been blocking our automatic opensuse mirroring	15:34
openstackgerrit	Fabien Boucher proposed zuul/zuul master: Pagure - handle Pull Request tags (labels) metadata https://review.opendev.org/681050	15:34
AJaeger	ricolin: could you help merge https://review.opendev.org/678353 , please?	15:34
fungi	donnyd: are you planning to load-test the generator with a resistor bank or something? otherwise you probably do want to test it out at full tilt	15:34
clarkb	fungi: maybe we should just remove the obs mirroring for now	15:35
roman_g	fungi: yes, zypper works perfectly with original repo at http://download.opensuse.org/update/leap/15.0/oss/	15:35
fungi	roman_g: i said obs	15:35
clarkb	roman_g: the problem is with obs not the main distro mirrors	15:35
clarkb	roman_g: we have a list of obs repos to mirror and they are not present in the upstream we were pulling from any longer	15:35
AJaeger	dirk, can you help, please? ^	15:35
clarkb	unfortunately we do all the suse mirroring together so if obs fails we don't publish main distro updates	15:35
donnyd	fungi well first I have to function check the system. Make sure the auto transfer switch doesn	15:35
roman_g	what is obs? fungi clarkb	15:35
fungi	we have a script which mirrors opensuse distribution files and some select obs package trees. the latter have been blocking the script succeeding	15:35
clarkb	roman_g: packaging that isn't part of the main distro release. In this case its probably similar to epel	15:36
donnyd	doesn't blow a hole in my wall or anything.. so it will be initially tested unloaded	15:36
fungi	or similar to uca	15:36
roman_g	open build service?	15:36
clarkb	roman_g: yes	15:36
fungi	yes	15:36
donnyd	And that is like a 1 hour cycle	15:36
donnyd	then I have to test it loaded (at least what I can do)	15:37
roman_g	so we mirror not from download.opensuse.org, but from some other (obs) location?	15:37
fungi	donnyd: sounds like a fun day!	15:37
roman_g	what is that location? I could check if repo is in good condition there	15:37
donnyd	the UPS place a 17amp load on the system idle... with everything in full tilt its a 21 amp load	15:37
donnyd	plus the rest of my house circuits	15:37
donnyd	So you are correct I want to test loaded... but usually you start small and make sure it works.. and the only way that is done is to keep FN out of the pool for now	15:38
donnyd	I should know more later today on propane install	15:38
clarkb	roman_g: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/opensuse-mirror-update#L82-L96	15:39
clarkb	roman_g: they all originate from suse	15:39
clarkb	but they are not all equally mirrored	15:39
clarkb	the upstreams we use for OBS no longer have all of the obs repos we want. And after some looking I was unable to find one that did	15:40
zbr	is the os-loganalyze project orphan or in need more cores? it seems to have reviews lingering for years, like https://review.opendev.org/#/c/468762/ which was clearly not controversial in any way :p	15:40
slittle1	RE: starlingx/utilities ... we had one major package that failed to migrate to the new repo. We would like an admin to help us import the missing content.	15:41
clarkb	zbr: the move to swift likely did orphan it.	15:41
clarkb	zbr: in general though for changes like that we all have much higher priority items than removing pypy from tox.ini where it harms nothing	15:42
zbr	clarkb: maybe we should find a better way to address these. I am asking because once I review a change it stays on my queue until it merges (or is abandoned) as I feel the need not to waste the time already invested.	15:43
zbr	what can I do in such cases? sometimes I remove my vote. still the change will still be lingering there.	15:44
AJaeger	slittle1: what is the problem? https://opendev.org/starlingx/utilities was imported, wasn't it?	15:44
clarkb	zbr: probably the best way to address things like that case specifically is to coordinate repo updates when we remove pypy	15:44
zbr	i would find better to even abandon all changes, reducing the noise.	15:44
clarkb	zbr: but I'm guessing the global pypy removal wasn't that thorough	15:44
clarkb	zbr: if you click the 'x' next to your name you should be removed from it	15:45
slittle1	yes, but one of the packages that I was supposed to relocate into starlingx/utilities failed to move	15:45
zbr	clarkb: that was just one example, as you can imagine I do not care much about pypy. in fact not all all.	15:45
zbr	i was more curious about which approach should I take	15:45
slittle1	That package has a very long commit history. I stopped counting at 100	15:46
*** ykarel\|away has joined #openstack-infra		15:46
AJaeger	slittle1: I don't understand "failed to move" - let's ask differently: Did the repo you prepared was not what you expected and therefore you want to set it up again from scratch?	15:46
slittle1	Yes, that's what I'm proposing	15:46
AJaeger	slittle1: so, do you have a new git repo ready? Then an admin - if he has time - can force update it for you...	15:47
AJaeger	(note I'm not an admin, just trying to figure out next steps)	15:48
slittle1	I'm preparing the new git repo now	15:48
slittle1	Might take a couple hours to test	15:48
clarkb	zbr: personally I track my priorities external to Gerrit, when those intersect with Gerrit I do reviews	15:48
AJaeger	slittle1: then please come back once you're ready ;)	15:49
AJaeger	slittle1: so, prepare the repo, and then ask for somebody here to force push this to starlingx/utilities.	15:50
AJaeger	slittle1: and better tell starlingx team to keep the repo frozen and not merge anything - and everything proposed to it might need a rebase...	15:50
AJaeger	(and devs might need to check from scratch) - all depending on the changes you do (if they can be fast forwarded, it should be fine)	15:51
roman_g	clarkb: thank you for your help. I'm not sure I fully understand the code you are referring to, as it mentions repositories I don't use. A few lines above https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/opensuse-mirror-update#L26 there is a MIRROR="rsync://mirror.us.leaseweb.net/opensuse", and this one has broken repository metadata -> seems that we	15:51
roman_g	mirror broken leaseweb repository.	15:51
clarkb	roman_g: our mirrors are only as good as what we pull from	15:51
clarkb	roman_g: I think what we need is for someone to find a reliable upstream and we'll switch to it. If that means we can also continue to mirror OBS then we'll do that otherwise maybe we just turn off OBS mirroring	15:52
fungi	one possibility would be to split opensuse and obs to separate afs volumes and update them with separate scripts, but that would need someone to do the work	15:52
*** rlandy is now known as rlandy\|brb		15:52
*** chandankumar is now known as raukadah		15:53
clarkb	fungi: and still requires finding a working obs upstream	15:53
paladox	corvus also you'll want to add redirecting /p/ to the upgrade list (only needed for 2.16+)	15:53
roman_g	clarkb: could leaseweb be changed to something else (original opensuse mirror)? http://provo-mirror.opensuse.org/ this one, for example. or http://download.opensuse.org/ this one.	15:53
paladox	i only just remembered	15:53
clarkb	roman_g: we don't use the main one because they limit rsync connections aggressively and we never update	15:53
clarkb	roman_g: but we can point to any other second level mirror	15:53
*** rpittau is now known as rpittau\|afk		15:54
*** igordc has joined #openstack-infra		15:54
zbr	clarkb: ok, i will start removing myself from such reviews, but I will also post them here with recomandation to abandon, like: https://review.opendev.org/#/c/385217/	15:55
*** mattw4 has joined #openstack-infra		15:58
*** sshnaidm\|ruck is now known as sshnaidm\|afk		16:00
*** diablo_rojo__ is now known as diablo_rojo		16:00
*** mattw4 has quit IRC		16:01
*** mattw4 has joined #openstack-infra		16:01
*** mattw4 has quit IRC		16:05
*** mattw4 has joined #openstack-infra		16:06
*** tesseract has quit IRC		16:07
*** igordc has quit IRC		16:09
clarkb	zbr: another strategy we could probably get better at is trusting cores to single approve simple changes like that (but we've had a long history of two core reviewers so that probably merits more communication to tell people its ok in the simple case)	16:09
*** e0ne has quit IRC		16:10
*** mattw4 has quit IRC		16:10
clarkb	infra-root I'd like to restart the nodepool launcher on nl02 to rule out bad confg loads with the numa flavor problem that sean-k-mooney has had. FN is currenty set to max-servers of zero so we won't be able to check it until back in the rotation but this way that is done and out of the way	16:11
clarkb	that said the problems nodepool has are lack of ssh access	16:11
clarkb	according to the logs anyway	16:11
clarkb	and if we got that far then the config should be working	16:11
zbr	not a bad idea, especially for "downsized" projects where maybe there are no linger single cores active, same applies to abandons, cores should not be worried to press that button, is not like "delete", anyone can still restore them if needed.	16:11
clarkb	but testing a boot with that flavor and our images myself by hand I had no issues getting onto the host via ssh in a relatively quick amount of time	16:12
Shrews	clarkb: could you go ahead and make sure that openstacksdk is at the latest (i think it was last i checked) and restart nl01 as well? it contains a swift feature that auto-deletes the image objects after a day	16:12
clarkb	Shrews: I can	16:12
Shrews	ossum	16:13
clarkb	I may as well rotate through all 4 so they are running the same code. I'll do that if 02 looks happy	16:13
clarkb	Shrews: openstacksdk==0.35.0 is what is on 02 and that appears to be current	16:14
*** virendra-sharma has joined #openstack-infra		16:14
sean-k-mooney	clarkb: the ci runs with the old lable passed on saturday before FN was scalled down by the way https://zuul.opendev.org/t/openstack/build/dd0f5dad770d40a2afb3c506327d1b3e so it is just the new lable that is broken	16:14
clarkb	Shrews: nodepool==3.8.1.dev10 # git sha f2a80ef is the nodepool version there	16:14
Shrews	clarkb: yes, that's the version. it just sets an auto-expire header on the objects (so not really a "feature" but a good thing)	16:14
clarkb	sean-k-mooney: ya I saw that. That is what made me suspect maybe config loading	16:15
clarkb	sean-k-mooney: once FN is back in the ortation we should be able to confirm or deny	16:15
sean-k-mooney	ack	16:15
sean-k-mooney	i proably should tell people FN is offline/disable currenlty too	16:15
Shrews	clarkb: nodepool version is good. i don't see anything of significance that might cause issues	16:15
Shrews	since 3.8.0 release, that is	16:16
openstackgerrit	Mohammed Naser proposed zuul/zuul master: spec: use operator-sdk for kubernetes operator https://review.opendev.org/681058	16:16
zbr	fungi: there are only 4 more pending reviews on bindep, after we address them, can we make an aniversary release?	16:17
clarkb	Shrews: nl02 has been restarted. Limestone is having trouble (but appeared to have trouble prior to the restart too) going to understand that before doing 01 03 and 04	16:17
clarkb	logan-: fyi ^ "openstack.exceptions.SDKException: Error in creating the server (no further information available)" is what we get from the sdk	16:17
clarkb	logan-: I'm going to try a manual boot now	16:17
*** virendra-sharma has quit IRC		16:20
clarkb	logan-: {'message': 'No valid host was found. There are not enough hosts available.', 'code': 500, 'created': '2019-09-09T16:19:42Z'}	16:20
clarkb	Shrews: ^ seems like sdk should be able to bubble that message up? maybe that error isn't available initially?	16:21
Shrews	clarkb: did that come from the node.fault.message?	16:21
clarkb	Shrews: yes	16:21
*** markvoelker has joined #openstack-infra		16:21
Shrews	clarkb: https://review.opendev.org/671704 should begin logging that	16:22
Shrews	clarkb: the issue is you need to either list servers with detailed info (which nodepool does not), or refetch the server after failure	16:22
clarkb	oh right cool	16:22
clarkb	I shall rereview it when this restart expedition is complete	16:22
*** dtantsur is now known as dtantsur\|afk		16:23
*** virendra-sharma has joined #openstack-infra		16:23
Shrews	clarkb: yeah, i think it's good to go now, but i can't +2 it since I've added to the review	16:24
AJaeger	promotion of releasenotes is working fine so far - note that we now only update releasenotes if those are changed and not after each merge.	16:25
logan-	clarkb: yep, thats expected. i disabled the host aggregate on our end out of caution since we had a transit carrier acting up last week. i am planning to re-enable soon, later today or tomorrow, when i am around for a bit to monitor job results	16:25
openstackgerrit	Sorin Sbarnea proposed opendev/bindep master: Add OracleLinux support https://review.opendev.org/536355	16:26
logan-	thats how i typically stop nodepool from launching nodes while allowing it to finish up and delete already running jobs. does it mess anything up when I do it that way?	16:26
clarkb	logan-: thanks for confirming. As a side note you can update the instance quota to zero and nodepool will stop launching against that cloud	16:26
clarkb	it doesn't mess anything up, just noisy logs	16:26
*** spsurya has quit IRC		16:27
logan-	gotcha	16:27
openstackgerrit	Sorin Sbarnea proposed opendev/bindep master: Fix tox python3 overrides https://review.opendev.org/605613	16:32
openstackgerrit	Mohammed Naser proposed zuul/zuul-operator master: Create zookeeper operator https://review.opendev.org/676458	16:32
openstackgerrit	Sorin Sbarnea proposed opendev/bindep master: Add some more default help https://review.opendev.org/570721	16:33
openstackgerrit	Mohammed Naser proposed openstack/project-config master: Add pravega/zookeeper-operator to zuul tenant https://review.opendev.org/681063	16:36
mnaser	^ if anyone is around for a trivial patch..	16:37
openstackgerrit	Mohammed Naser proposed zuul/zuul-operator master: Create zookeeper operator https://review.opendev.org/676458	16:40
openstackgerrit	Mohammed Naser proposed zuul/zuul-operator master: Deploy Zuul cluster using operator https://review.opendev.org/681065	16:40
*** rlandy\|brb is now known as rlandy		16:41
*** pgaxatte has quit IRC		16:42
openstackgerrit	Merged openstack/openstack-zuul-jobs master: Mention promote in template description https://review.opendev.org/681044	16:43
openstackgerrit	Merged zuul/zuul-jobs master: Switch to fetch-sphinx-tarball for tox-docs https://review.opendev.org/676430	16:46
*** ykarel\|away has quit IRC		16:46
*** jpena is now known as jpena\|off		16:47
AJaeger	infra-root, we still have 6 hold nodes - are they all needed?	16:47
mnaser	AJaeger: if any are mine, i dont need them	16:48
*** ociuhandu has joined #openstack-infra		16:48
clarkb	I can take a look as part of my nodepool restarts. I've just done 03 and am confirming things work there since 02 only has offline clouds currently	16:48
*** diablo_rojo has quit IRC		16:49
Shrews	wow, one has been in hold for 49 days	16:49
Shrews	AJaeger: i suspect mordred has simply forgetten about that one :)	16:50
AJaeger	;)	16:50
Shrews	didn't pabelanger say we could delete his held node the other day?	16:51
mnaser	please dont interrupt that one, that's their bitcoin miner :)	16:51
Shrews	mnaser: must be a deep mine	16:52
AJaeger	mwhahaha, EmilienM, I see at http://zuul.opendev.org/t/openstack/status that https://review.opendev.org/680780 runs a non-voting job in gate - could you remove that one, please?	16:52
AJaeger	Shrews: he did, so go ahead and delete it...	16:52
mwhahaha	weshay: -^	16:52
AJaeger	mwhahaha, EmilienM, weshay, job is "tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates	16:52
clarkb	ok there are nodes that went from building to in-use on nl03 now. Going to do 04	16:53
AJaeger	weshay, mwhahaha, and https://review.opendev.org/674065 has tripleo-build-containers-centos-7-buildah as non-voting in gate	16:53
weshay	hrm.. k.. thanks for letting me know	16:53
*** ykarel\|away has joined #openstack-infra		16:53
Shrews	mnaser: you have a held node that i'll go ahead and delete now	16:53
pabelanger	Shrews: yes, it can be deleted. Thanks	16:54
*** mattw4 has joined #openstack-infra		16:56
*** bdodd has quit IRC		16:56
Shrews	AJaeger: since that change that mordred's held node is based on has merged, i'm going to assume that is not needed anymore either	16:56
Shrews	ianw: you now have the only 2 held nodes. will let you decide if they are needed or not	16:57
clarkb	Shrews: for the case where there are ready nodes locked for many many days is there a way to break the lock without restarting the zuul scheduler?	16:58
clarkb	Shrews: 0010553302 for example	16:58
*** trident has quit IRC		16:58
Shrews	clarkb: we should first verify who owns the lock. let me look	16:58
Shrews	there's been a long standing zuul bug around that	16:59
*** derekh has quit IRC		17:00
clarkb	#status log Restarted nodepool launchers with OpenstackSDK 0.35.0 and nodepool==3.8.1.dev10 # git sha f2a80ef	17:00
openstackstatus	clarkb: finished logging	17:00
roman_g	clarkb: I've asked SUSE guys if they could reach out to their OpenSUSE community and find a good CDN to rsync OpenSUSE from.	17:01
clarkb	roman_g: fwiw AJaeger asked dirk here to do that too in scrollback (though he is CEST so may not happen today)	17:01
clarkb	roman_g: let us know what you find out	17:01
*** diablo_rojo has joined #openstack-infra		17:02
ricolin	AJaeger, done	17:03
roman_g	clarkb, AJaeger, just for reference, I was replying to Jean-Philippe Evrard (SUSE) here https://review.opendev.org/#/c/672678/ in comment to patch set 4.	17:03
Shrews	clarkb: yeah, zuul owns that lock. there is an out-of-band way to do it (just delete the znode) but that's not recommended	17:03
clarkb	Shrews: ya it should go away if we restart the scheduler but now is a bad time for that due to job backlogs	17:04
clarkb	(and I was thinking freeing up ~4 more nodes will help a bit with the backlog)	17:04
Shrews	i just freed up 4 held nodes, so that might help a small bit	17:05
clarkb	johnsom: fungi ianw re job network problems if it is rax specific it could be that we are seeing duplicate IPs there again?	17:05
clarkb	in the past we would get job failures for those errors because ansible didn't return the correct rc code for network trouble but that hasbeen fixed and so now we retry	17:05
clarkb	however if it happens on more than just rax and it is always triggered during the start of tempest tests I would suspect something in the tempest tests	17:06
fungi	clarkb: if we can manage to correlate the failures to a specific subset of recurring ip addresses, that can explain it	17:06
clarkb	frickler: figured out worlddumping which will help with the dns debugging too I think	17:06
johnsom	I have seen it fail after 4-5 tests have passed, it's not at the start	17:06
*** trident has joined #openstack-infra		17:06
clarkb	johnsom: ok then during tempest	17:06
clarkb	both ipv6 only clouds are offlien currently so the dns issues shouldn't be happening for a bit	17:07
clarkb	hopefully long enough to get worlddump fix merged in devstack	17:07
clarkb	johnsom: the point was more that if it is duplicate IP problem we should see that during any point of time in a job	17:08
clarkb	and it would be cloud/region specific	17:08
clarkb	if however it happens during a common job location and on variety of clouds we probably don't have that problem	17:08
johnsom	Yes, it is a definite possibility	17:08
donnyd	ok electrical is done for FN and it looks like nova may need to use it.. So how about we put it back in service and can work around the gas install	17:08
clarkb	donnyd: I will defer to you on that	17:09
donnyd	I am thinking only compute jobs, maybe hold off on swift logs	17:09
clarkb	wfm	17:09
*** ramishra has quit IRC		17:09
clarkb	https://storage.gra1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e47/680797/1/check/openstack-tox-py36/e477716/ is still working	17:09
clarkb	AJaeger: ^ should we turn ovh back on for swift logs?	17:09
fungi	clarkb: last word from amorin was that the suspendbot was done for yesterday, but no idea when it would come back today and whether it would be fixed	17:10
clarkb	fungi: ya its been just over 24 hours	17:12
AJaeger	it's more than 24 hours since we talked - i would expect it run again in that time.	17:12
AJaeger	clarkb: let me remove my WIP from https://review.opendev.org/#/c/680855/	17:12
AJaeger	we might want to wait another 24 hours - or just go for it...	17:12
clarkb	johnsom: fungi http://zuul.openstack.org/build/74d1733dd44f4233a37ef43039e338e8/log/job-output.txt#36386 that may be a clue (that job was a retry_limit)	17:13
AJaeger	thanks, ricolin	17:14
clarkb	johnsom: fungi ze04 (where that build ran has disk now and doesn't have leaked builds)	17:14
openstackgerrit	Donny Davis proposed openstack/project-config master: Bringing FN back online for CI Jobs https://review.opendev.org/681075	17:15
fungi	clarkb: johnsom: oh! so another thing worth noting is that nodes in rackspace have a smallish rootfs with more space mounted at /opt	17:15
AJaeger	roman_g: yeah, evrardjp is also a good contact - thanks	17:15
clarkb	fungi: ya though in this case the thing that ran out of disk was the executor	17:15
fungi	ahh, okay, there goes that idea	17:15
AJaeger	config-core, https://review.opendev.org/678356 and https://review.opendev.org/678357 are ready for review, dependencies merged - those remove now unused publish jobs and update promote jobs. Please review	17:18
clarkb	AJaeger: fwiw I asked airship if they are ready for their global acl updates to go in and haven't gotten an answer yet	17:19
clarkb	it will prevent them from approving changes while new groups are configured so I figured I would ask	17:20
AJaeger	clarkb: looking at the commit message: If we add the airship-core group everywhere, it should work just fine, wouldn't it?	17:20
AJaeger	clarkb: but double checking is better ;) Thanks	17:20
donnyd	clarkb: fungi so even though FN doesn't have any jobs running, the images should have still be loaded from nodepool.. Correct?	17:21
fungi	donnyd: correct	17:21
donnyd	and the mirror never actually went down	17:21
fungi	nodepool continues performing uploads even if max-servers is 0	17:21
donnyd	i did have enough juice to keep it up this whole time	17:22
fungi	awesome. thanks!	17:22
clarkb	AJaeger: ya though it won't have the effect tehy want (they will still need to make changes)	17:22
donnyd	so it should also be up to date with whatever has been done...	17:22
clarkb	23GB is executor-git and 7.9GB is builds in /var/lib/zuul	17:23
clarkb	git will hardlink so that 23GB seems large but should be reasonable as its shared across builds	17:23
clarkb	there are 36GB available on ze04 represents 53% free on that fs	17:24
*** gfidente has quit IRC		17:25
clarkb	AJaeger: fungi maybe you can reach out to amorin tomorrow CEST time and see if amorin has any more input? then we can enable early tomorrow if amorin is happy with it at that point?	17:26
corvus	paladox: can you elaborate on redirecting /p/ ? do we need to do something like /p/(.*) -> /$1 ?	17:27
paladox	yup	17:27
paladox	polygerrit takes over /p/	17:27
paladox	so cloning over /p/ breaks	17:27
fungi	clarkb: hopefully my availability tomorrow morning is better than today was. i've added a reminder for it and will see what i can do	17:27
paladox	corvus https://www.gerritcodereview.com/2.16.html#legacy-p-prefix-for-githttp-projects-is-removed	17:28
*** jamesmcarthur has quit IRC		17:28
paladox	corvus https://github.com/wikimedia/puppet/commit/4a2a25f3cbcbabd03b6291459941304e67bbd1c5#diff-4d7f1c048cc827721ef9298a98d1f5d9 is what i did	17:28
paladox	to prevent breaking clones && to prevent breaking PolyGerrit project dashboard support.	17:29
corvus	paladox: gotcha, so that's targeting just cloning then	17:29
corvus	that regex in your change	17:29
paladox	yup	17:29
clarkb	http://grafana.openstack.org/d/nuvIH5Imk/nodepool-vexxhost?orgId=1 vexxhost has some weird looking node status graphs	17:30
paladox	we will be upgrading soon :)	17:30
paladox	spoke with one of the releng members and with 2.15 going EOL, we will want to be going to 2.16 very soon.	17:30
corvus	paladox: cool thanks! i added it to our list :)	17:30
paladox	:)	17:30
* clarkb figures out boot from volume to test vexxhost		17:31
paladox	corvus you'll also want to convert any of your its-* rules to soy too!	17:32
clarkb	I think we may have leaked volumes	17:33
clarkb	I'm digging in	17:33
fungi	paladox: oh, fun... what's soy?	17:34
*** ociuhandu has quit IRC		17:34
paladox	Soy is https://github.com/google/closure-templates	17:34
Shrews	clarkb: i would not doubt leaked volumes. there is something weird there that i couldn't quite figure out (i thought i had the bug down at one time, but alas, not so much)	17:34
paladox	https://github.com/wikimedia/puppet/tree/production/modules/gerrit/files/homedir/review_site/etc/its is how it looks under soy :)	17:35
*** ociuhandu has joined #openstack-infra		17:36
clarkb	Shrews: well volume list shows we have ~37 volumes at 80GB each. The quota error we got says we are using 6080GB which is enough for 76 instances	17:36
Shrews	yowza	17:36
clarkb	Shrews: so the volume leak isn't the only thing going on here from what I can tell (we do apepar to have leaked a small number of volumes which I'm trying to trim now since that is actionable on our end)	17:36
paladox	fungi it's also what is used for PolyGerrit's index.html :)	17:37
paladox	it helped me to introduce support for base url's	17:37
clarkb	Shrews: its basically 2 times our consumption	17:37
fungi	thanks paladox. luckily we don't have a bunch of its rules: https://opendev.org/opendev/system-config/src/branch/master/modules/openstack_project/manifests/review.pp#L213-L234	17:37
Shrews	clarkb: can you capture that info for the one you're deleting for me? at least the a few of the most recent ones	17:37
paladox	ah!	17:37
paladox	ok	17:37
paladox	i had to do the support for soy in its-base for security reasons :(	17:38
clarkb	Shrews: ya I'll get a list and then get full volume shows for each of them (thats basically how I'm combing through to decide they are really leaked)	17:38
Shrews	cool	17:38
openstackgerrit	Merged openstack/project-config master: Bringing FN back online for CI Jobs https://review.opendev.org/681075	17:38
corvus	paladox, fungi: if we're using its-storyboard, do we still need to use soy?	17:39
paladox	nope	17:39
paladox	you doin't use velocity it seems per the link fungi gave	17:39
fungi	corvus: i thought its rules were how we configured its-storyboard (well, that and commentlinks, which seem to be integrated with how its does rule lookups)	17:39
*** ricolin has quit IRC		17:40
fungi	is soy replacing the commentlinks feature too?	17:40
paladox	though, it's rules can now be configured from the UI :)	17:40
paladox	fungi nope	17:40
paladox	it's replacing the velocity feature	17:40
*** ociuhandu has quit IRC		17:40
fungi	ahh, i have no idea what velocity is anyway	17:42
fungi	just as well, i suppose	17:42
paladox	https://velocity.apache.org	17:42
clarkb	Shrews: bridge.o.o:/home/clarkb/vexxhost-leaked-volumes.txt	17:42
clarkb	Shrews: I'm going to delete them now	17:43
fungi	paladox: okay, so that was providing some sort of macro/templating capability i guess	17:43
paladox	yup	17:43
clarkb	Shrews: they are all from august 7th	17:43
Shrews	clarkb: thx	17:43
fungi	clarkb: last time i embarked on a vexxhost leaked image cleanup adventure, i similarly discovered a bunch of images hanging around that all leaked within a few minutes of each other on the same day, weeks prior	17:44
fungi	something happens briefly there to cause this, i guess	17:44
paladox	do you set javaOpts through puppet (for gerrit.config)	17:44
Shrews	clarkb: fungi: my last exploration of this found they seemed to be associated with the inability to delete a node because of "volume in use" errors, iirc	17:45
fungi	Shrews: correct, same for me. something caused server instances to go away ungracefully and left cinder thinking the volumes for them were still in use	17:46
Shrews	maybe this new batch will have some key that i missed before	17:46
fungi	paladox: not to my knowledge. i	17:46
fungi	er, i'm not even sure what javaopts are	17:46
paladox	ok, i think you'll want to set https://github.com/wikimedia/puppet/commit/01bf99d8c72886e876878ade7e99f9081dc313d5#diff-1145a1f82a8b6b5ee2c3238b41d39601	17:46
clarkb	paladox: fungi we set heap memory size	17:46
clarkb	but that gets loaded via the defaults file in the init script	17:47
clarkb	so ya we set things but puppet is only partially related there	17:47
fungi	ahh, here: https://opendev.org/opendev/system-config/src/branch/master/modules/openstack_project/manifests/review.pp#L119	17:47
paladox	ahh, ok.	17:47
fungi	https://opendev.org/opendev/puppet-gerrit/src/branch/master/templates/gerrit.config.erb#L62-L64	17:48
fungi	is where it gets plumbed through to gerrit.config	17:49
*** udesale has quit IRC		17:49
*** ralonsoh has quit IRC		17:49
clarkb	mnaser: if you have a moment, vexxhost sjc1 claims we are using ~6080GB of volume quota but volume list shows ~3040GB (curious that it is 2x). Not sure if that is a known thing or somethign we can help debug further but thought I would point it out	17:51
*** goldyfruit_ has joined #openstack-infra		17:51
*** virendra-sharma has quit IRC		17:53
*** goldyfruit has quit IRC		17:53
Shrews	clarkb: hrm, aug 7th logs for nodepool no longer exist. that makes investigating near impossible :(	17:53
*** ociuhandu has joined #openstack-infra		17:53
clarkb	ya we keep a month of logs iirc	17:54
clarkb	and that is ~2 days past	17:54
Shrews	we only go back to aug 30	17:54
clarkb	oh just 10 days then	17:54
clarkb	re the retrying jobs the top of the check queue is a ironic job that disappeared in tempest ~2 hours ago on rax-ord and zuul/ansible haven't caught up on that yet	18:00
clarkb	I am able to hit the ssh port via home and the executor that was running the job there	18:01
clarkb	so if it is the duplicate ip problem we've got the IP currently	18:01
fungi	well, it's rarely just one duplicate ip address	18:02
fungi	but it's usually an identifiably small percent of the addresses used there overall	18:03
openstackgerrit	Sorin Sbarnea proposed opendev/bindep master: Add OracleLinux support https://review.opendev.org/536355	18:03
clarkb	ya its just hard to identfy because we don't get logs	18:03
openstackgerrit	Merged openstack/project-config master: Add pravega/zookeeper-operator to zuul tenant https://review.opendev.org/681063	18:03
clarkb	I've put an autohold on that job	18:03
clarkb	tox/tempset is still running there	18:08
fungi	previously when i've had to hunt these down i've built up lists of build ids which hit the retry_limit or whatever, and then programmatically tracked those back through zuul executor logs to corresponding node requests and then into nodepool launcher logs	18:09
clarkb	23.253.173.73 is the host if anyone else wants to take a look	18:09
fungi	my access is still a bit crippled while my broadband is out	18:10
clarkb	http://paste.openstack.org/show/774344/ is the last thing stestr logged	18:11
clarkb	about 37 minutes after ansible stopped getting info	18:11
*** e0ne has joined #openstack-infra		18:12
johnsom	I am going to try Ian's netconsole role plus redirect the rsyslog. See if that catches anything.	18:14
clarkb	right around the time ansible loses connectivity dmesg says there was an sda gpt block device. This device has GPT errors and does not exist currently	18:14
clarkb	dtantsur\|afk: TheJulia ^ do you know if ironic-standalone is creating that sda device and giving it a GPT table? if so it seems to be buggy	18:15
*** jtomasek_ has joined #openstack-infra		18:17
clarkb	journalctl is empty for that block of time too	18:18
fungi	oh, i wonder if something to do with block device management is temporarily hanging the kernel long enough to cause ansible to give up	18:19
*** jtomasek has quit IRC		18:19
fungi	that behavior could certainly be provider/hypervisor dependent	18:19
clarkb	fungi: ya journalctl having missing data around then seems to point at sadness	18:19
clarkb	syslog has the data though so not universally a problem	18:19
clarkb	but could be just bad enough potentially	18:19
*** priteau has quit IRC		18:23
*** kjackal has quit IRC		18:23
clarkb	fungi: wins /dev/xvda1 38G 36G 0 100% /	18:25
clarkb	running du to track down what filled it up	18:25
fungi	probably the job needs adjusting to write stuff in /opt	18:27
fungi	how full is /opt on that node?	18:27
*** yamamoto has joined #openstack-infra		18:27
clarkb	http://paste.openstack.org/show/774346/	18:28
clarkb	opt has 57GB	18:28
clarkb	johnsom: ^ is it possible that your jobs are using fat images too and running into the same problem?	18:28
clarkb	that could explain why it is rax and ironic + octavia hitting this a lot	18:28
clarkb	because you both run workload in your VMs	18:28
clarkb	and don't cirros	18:28
johnsom	Not likely, we just shaved another 250MB off them a month or two ago. It should be only one 350MB image	18:29
johnsom	We are logging more than we were however, as we added log offloading a few months back. Seems unlikely that it would be failing at the start of the tempest job though.	18:30
johnsom	Is the DF log in the output at the start or the end of the job?	18:31
*** yamamoto has quit IRC		18:31
clarkb	thats me sshing in after node broke	18:31
fungi	so what is e.g. /var/lib/libvirt/images/node-1.qcow2 there which is 12gb?	18:32
johnsom	This was a run that passed: https://zuul.opendev.org/t/openstack/build/065acaa6544342348b3eef799e627696/log/controller/logs/df.txt.gz	18:32
clarkb	ironics test image	18:32
clarkb	fungi: ^	18:32
fungi	aha	18:32
fungi	that's fairly massive	18:32
fungi	especially since it's one of a dozen images in that same directory	18:33
clarkb	ironic at least is clearly broken in this way	18:33
clarkb	gives us something to fix and check on after	18:33
fungi	but also it should almost certainly put that stuff in /opt	18:33
fungi	so as not to fill the rootfs	18:33
johnsom	Yeah, our new logging is only 250K and 2.3K for a full run, so not a big change.	18:33
*** e0ne has quit IRC		18:34
clarkb	note as suspected that is a side effect of running tempest I think	18:35
* clarkb makes a bug for ironic		18:37
clarkb	johnsom: "new logging" ?	18:37
clarkb	johnsom: also keep in mind the original image size may grow up to the flavor image size when booted and used	18:37
*** jamesmcarthur has joined #openstack-infra		18:37
clarkb	johnsom: even if your qcow2 is say only 1GB when handed to glance once booted and blocks start changing it can grow up to that limit	18:38
johnsom	New logging == offloading the logging from inside the amphora instances. Looking at the passing runs it's only 250K uncompressed, so certainly not eating the disk.	18:39
johnsom	Our qcow is ~300MB, the nova flavor it gets is 2GB. Other than getting smaller, it hasn't changed in years.	18:39
clarkb	ok. I'm pointing it out because I've just shown that one of your example jobs is likely failing due to this problem.	18:41
clarkb	I think it is worthwhile to be able to specifically rule it in or out in the other jobs we know are failing similarly	18:41
*** ociuhandu has quit IRC		18:41
johnsom	Yeah, that seems very ironic job related.	18:41
clarkb	oh is ironic in storyboard?	18:42
johnsom	I hope this netconsole trick sheds light on our job.	18:43
fungi	clarkb: yes	18:43
fungi	and i agree, seems like there's a good chance that the ironic and octavia jobs are failing in similarly opaque ways for distinct reasons	18:44
johnsom	clarkb This test run is going now: https://review.opendev.org/680894 assuming I didn't fumble the Ansible somehow, which is likely given I use it just enough to forget it.	18:44
johnsom	Yep, nevermind, broken ansible. lol	18:44
clarkb	fungi: yup could be distinct but we should rule out known causes first	18:45
clarkb	rather than assuming they are fine	18:45
clarkb	octavia like ironic is failing when tempest runs	18:45
clarkb	octavia like ironic is being retried becaus ansible fails	18:45
clarkb	we know ironic is filling its root disk on rax (which also explains why it may be cloud specific)	18:46
clarkb	why not simply check "is octavia filling the disk" if yes we solved it fi no we look elsewhere	18:46
fungi	and also, predominantly in the same provider regions right? one where the available space on the rootfs is fairly limited compared to other providers	18:46
fungi	er, what you said	18:47
fungi	what's a good way to go about reporting the disk utilization on those nodes just before they fall over? the successful run johnsom linked earlier had fairly minimal disk usage reported by df at the end of the job	18:48
fungi	but maybe disk utilization spikes during the tempest run and then the resources consuming disk get cleaned up?	18:49
clarkb	fungi: ya it could be the test ordering that gets us if multiple tests with big disks run concurrently	18:49
clarkb	fungi: as for how to get the data off, if ansible is failing (likely because it can't write data to /tmp) then we may need to use ansible raw?	18:50
clarkb	pabelanger: ^ do you know if the raw module avoids needing to copy files to /tmp on the remote host?	18:50
*** factor has joined #openstack-infra		18:50
*** vesper11 has quit IRC		18:52
pabelanger	clarkb: I am not sure, would have to look	18:52
pabelanger	IIRC, raw is pretty minimal	18:53
clarkb	https://storyboard.openstack.org/#!/story/2006520 has been filed and I just shared it with the ironic channel	18:56
*** diablo_rojo has quit IRC		18:56
*** diablo_rojo has joined #openstack-infra		18:56
fungi	i wonder if we could better trap/report ansible remote write failures like that	18:58
AJaeger	config-core, https://review.opendev.org/678356 and https://review.opendev.org/678357 are ready for review, dependencies merged - those remove now unused publish jobs and update promote jobs. Please review	18:58
*** jbadiapa has quit IRC		19:00
clarkb	fungi: looking at the log on ze04 for bb2716ef3894465cb1cfbf1b22d7736c the way it manifest for ironic is the run playbook timesout, then the post run attempts to in and do post run things but that fails with exit code 4 which is the networking failed me problem. In this case I don't think networking at a layer 3 or below failed but instead at 7 because ansible/ssh needs /tmp to do things	19:01
clarkb	johnsom: is there a preexisting octavia change that we can recheck without acusing problems for you and then if job lands on say rax-* we can hold it/investigate?	19:03
clarkb	johnsom: if not I can push a no change change	19:03
*** ykarel\|away has quit IRC		19:03
johnsom	I am using this one for the netconsole test: https://review.opendev.org/#/c/680894/	19:03
johnsom	I just can't say if I got the ansible right this time.	19:03
clarkb	k	19:03
*** e0ne has joined #openstack-infra		19:03
johnsom	Looks like it got OVH	19:04
fungi	hrm, yeah if ansible is overloading the same exit code for both network connection failure and an inability to write to the remote node's filesystem, then discerning which it is could be tough	19:07
*** e0ne has quit IRC		19:09
clarkb	johnsom: if you notice a run on rax-* ping me and I'll happily add an autohold for it	19:10
johnsom	Ok, sounds good	19:10
*** e0ne has joined #openstack-infra		19:10
fungi	or we can set an autohold with a count >1	19:11
*** jbadiapa has joined #openstack-infra		19:13
*** goldyfruit_ has quit IRC		19:15
*** eharney has quit IRC		19:20
*** bnemec has quit IRC		19:27
*** bnemec has joined #openstack-infra		19:32
*** goldyfruit has joined #openstack-infra		19:33
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: Improve job and node information banner https://review.opendev.org/677971	19:33
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: Improve job and node information banner https://review.opendev.org/677971	19:34
zbr	fungi: if is not too late, please take a look at https://review.opendev.org/#/q/project:opendev/bindep+status:open thanks.	19:36
*** vesper11 has joined #openstack-infra		19:41
openstackgerrit	Clark Boylan proposed opendev/base-jobs master: Add cleanup phase to base(-test) https://review.opendev.org/681100	19:46
clarkb	infra-root ^ that might help us debug these problems better in the future	19:46
*** ociuhandu has joined #openstack-infra		19:51
*** igordc has joined #openstack-infra		20:00
*** jtomasek_ has quit IRC		20:01
*** ociuhandu has quit IRC		20:03
*** ociuhandu has joined #openstack-infra		20:03
*** michael-beaver has joined #openstack-infra		20:04
clarkb	corvus: have a quick second for 681100? once that is in I can test it	20:09
*** markmcclain has joined #openstack-infra		20:10
clarkb	sean-k-mooney: I think fn is enabled again if you want to retry your label checks (restarted the nodepool service so that will rule out config problems if it persists)	20:10
clarkb	or at least configs due to weird reloading of pools	20:10
sean-k-mooney	clarkb: ack	20:10
sean-k-mooney	ill kick of the new lable test and let you know if that works	20:11
*** Lucas_Gray has joined #openstack-infra		20:11
*** nicolasbock has quit IRC		20:11
sean-k-mooney	related question vexxhost has nested virt as does FN. did packet ever turn it back on	20:11
fungi	we haven't had a usable packethost environment in... over a year?	20:12
openstackgerrit	Merged opendev/bindep master: Fix emerge testcases https://review.opendev.org/460217	20:12
sean-k-mooney	welll i think that answer my question	20:12
clarkb	ya there was a plan to redeploy using osa but I don't think that ever happened	20:13
sean-k-mooney	my hope is that if we have 2+ proviers that can support the multi numa/or single numa lables wiht nested vert(that would be FN and maybe vexhost) then we could replace the now defunct intel nfv ci with a first part version that at a minium did a nightly build	20:15
sean-k-mooney	if that is stable then maybe we can promot it to non voting or even voting in check	20:15
*** trident has quit IRC		20:15
clarkb	limestone may also be able to help	20:15
sean-k-mooney	yes perhaps. now that we have 1 provider i can continue to refine the job and we can expolore other options as they become avialable	20:16
sean-k-mooney	ignoring the lable issue the FN job seams to be workign well	20:17
clarkb	I think the biggest gotcha is going to be those periodic updates to kernels that break it and needing to update the various layers to cope	20:18
clarkb	but if the job is nonvoting and we communicate when that happens its probably fine?	20:18
sean-k-mooney	hopefully	20:18
*** nicolasbock has joined #openstack-infra		20:18
sean-k-mooney	the intel nfv ci was always using nested virt and it was at least for the first 2 years when it was staffed faily stable	20:19
sean-k-mooney	mos tof the issue we had were not related to nested vert	20:19
sean-k-mooney	but rather dns or out proxies/mirrors	20:19
sean-k-mooney	i kicked of the new lable job 680738	20:20
sean-k-mooney	so we will see if tha tfails	20:20
sean-k-mooney	or passes	20:20
clarkb	ya we just know it breaks periodically	20:21
clarkb	logan-: has given us a bit of insight into that and aiui something in middle kernel will update then we need to update the hypervisors to accomodate and it starts working again	20:21
sean-k-mooney	its on by default now in kerels from 4.19	20:21
*** ociuhandu has quit IRC		20:22
clarkb	ya so only about 2 years away :)	20:22
*** ociuhandu has joined #openstack-infra		20:22
sean-k-mooney	it really bugs me that rhel8 is based on 4.18	20:22
sean-k-mooney	because its an lts disto on an non lts kernel	20:22
*** iurygregory has quit IRC		20:23
sean-k-mooney	well its a redhat kerel so the version number dont really mean anything in relation to the usptream kernel anyway	20:23
*** ociuhandu has quit IRC		20:27
*** trident has joined #openstack-infra		20:27
*** trident has quit IRC		20:32
*** notmyname has quit IRC		20:32
*** trident has joined #openstack-infra		20:33
openstackgerrit	Merged openstack/project-config master: Add per-project Airship groups https://review.opendev.org/680717	20:35
*** notmyname has joined #openstack-infra		20:42
*** ociuhandu has joined #openstack-infra		20:44
*** kopecmartin is now known as kopecmartin\|off		20:45
johnsom	clarkb rax-iad https://zuul.openstack.org/stream/cd166c6ecde942259f27abaeef8b89fa?logfile=console.log	20:45
johnsom	I am getting remote logs for this one too	20:46
clarkb	johnsom: placed a hold on it	20:47
clarkb	I'll also ssh in now to see if my connection dies	20:47
johnsom	I am going to run grab lunch as it will be at least 20 minutes for devstack.	20:48
clarkb	k I'll keep an eye on it out of the corner of my eye	20:48
openstackgerrit	Merged opendev/bindep master: Fix apk handling of warnings/output problems https://review.opendev.org/602433	20:50
*** markvoelker has quit IRC		21:02
*** ociuhandu has quit IRC		21:09
johnsom	clarkb Tempest is just starting	21:11
clarkb	I'm still ssh'd in	21:12
*** Goneri has quit IRC		21:14
clarkb	I don't think this is the problem but I wonder if neutron wants to clean it up: netlink: 'ovs-vswitchd': attribute type 5 has an invalid length.	21:15
johnsom	Yeah I am seeing a lot of those too	21:16
johnsom	Neutron does seem to do a lot of stuff at the start of tempest. It's log was scrolling pretty fast	21:16
*** markvoelker has joined #openstack-infra		21:16
fungi	sean-k-mooney: is that ovs-vswitchd attribute length error something you're aware of?	21:17
clarkb	I'm running watch against ip addr show eth0 ; df -h ; w ; ps -elf \| grep tempest	21:19
fungi	good call	21:22
johnsom	boom kernel opps	21:23
fungi	so we are triggering a panic?	21:23
johnsom	https://www.irccloud.com/pastebin/7VeZaVsl/	21:23
clarkb	I didn't get that with my watch	21:23
fungi	and i guess it's hypervisor-specific	21:23
fungi	so probably only happening on xen guests	21:23
clarkb	looks liek an ovs issue	21:24
clarkb	exciting	21:24
fungi	neat. sean-k-mooney may be interested in that too	21:24
johnsom	"exciting" Yes, yes it has been given it's freeze week	21:25
fungi	any idea what might have changed around that in the relevant timeframe where the behavior seems to have emerged?	21:26
johnsom	Nothing in our project, that is why this has been a pain.	21:26
clarkb	could be a kernel update in bionic	21:26
clarkb	or a xen update in the cloud(s)	21:27
fungi	neutron changes? new ovs version? is it just happening on ubuntu bionic or do we have suspected failures on other distros/versions?	21:27
fungi	but yeah, at least it's happening around feature freeze week, not release week	21:27
clarkb	getting centos version of this test to run on rax to generate a bt (assuming it fails at all) would be a great next step I expect	21:28
fungi	getting a minimum reproducer might make that easier too	21:29
*** jamesmcarthur has quit IRC		21:29
clarkb	also could it be related to the invalid length error above?	21:29
fungi	a distinct possibility	21:29
johnsom	We started seeing it on the 3rd, so something around that time.	21:30
johnsom	Looking to see if the centos-7 version of this has ever done this. (though centos has had it's own issues)	21:30
fungi	if the ovs error messages show up in success logs too, then that seems unlikely to be related (though still possible i suppose)	21:30
johnsom	Well, I don't see them in this successful run: https://zuul.opendev.org/t/openstack/build/065acaa6544342348b3eef799e627696/log/controller/logs/syslog.txt.gz assuming they go into syslog too	21:31
johnsom	Ok, for one patch, I have not seen it on the centos7 version.	21:33
clarkb	johnsom: they are in that log you linked	21:33
clarkb	https://zuul.opendev.org/t/openstack/build/065acaa6544342348b3eef799e627696/log/controller/logs/syslog.txt.gz#2399 for example	21:33
johnsom	Oh, I must have typo'd	21:33
clarkb	so ya likely not related but could potentially still be if xen somehow tickles a corner case around the handling of that	21:34
clarkb	either way cleaning it up would aid debugging because its one less thing to catch your eye and be confused about	21:34
fungi	yeah, maybe that's still a bug but only leads to a fatal condition on xen guests?	21:34
fungi	wild guesses at this stage though	21:35
clarkb	https://patchwork.ozlabs.org/patch/1012045/ I think it may just be noise	21:36
fungi	seems that way	21:37
fungi	so do we know which tempest test seems to have triggered the panic?	21:37
clarkb	if the node did end up being held I may be able to restart the server and get the stestr log	21:38
clarkb	I'll work on that now	21:38
* fungi shakes fist at broadband provider... nearly 7 hours so far on repairing a fiber cut somewhere on the mainland		21:39
clarkb	does not look like holds apply in this case (because the job is retried?)	21:39
fungi	you were doing a ps though... do you have the last one from before it died? can we infer tests from that?	21:40
fungi	or i guess the console log stream also mentioned tests	21:40
*** e0ne has quit IRC		21:40
clarkb	the ps doesn't tell you what tests are running	21:41
clarkb	jsut that there were 4 proceses running tests	21:41
*** snapiri has quit IRC		21:41
johnsom	I was a bit surprised to see it running 4 for the concurrency, but I also tried locking it down to 2 with no luck	21:42
fungi	ahh, yep, you used the f option (so included child proceses) but then omitted those lines with the grep for "tempest"	21:42
clarkb	fungi: we don't start new processes for each test though	21:43
clarkb	the 4 child processes run their 1/4 of the tests loaded from stdin iirc	21:44
fungi	sure, thought it might be inferrable from whatever shell commands were appearing in child process command lines, but also probably too short-lived for the actual culprit to be captured by watch	21:44
fungi	i guess if we can reproduce it with tempest concurrency set to 1 that might allow narrowing it down	21:45
fungi	it happens fairly early in the testset right?	21:46
clarkb	ya I guess that depends onw hether or not neutron uses the api or cli commands to interact with ovs there. However that said it almost looksl ike the failure was in frame processing	21:46
clarkb	entry_SYSCALL_64_after_hwframe+0x3d/0xa2	21:46
clarkb	and crashes in do_output	21:46
clarkb	so probably not easy to track that back to a specific test	21:46
fungi	yeah, maybe ethernet frame forwarding?	21:47
fungi	in which case perhaps any test which produces a fair amount of guest traffic might tickle it	21:48
openstackgerrit	James E. Blair proposed zuul/zuul master: WIP: Add support for the Gerrit checks plugin https://review.opendev.org/680778	21:49
openstackgerrit	James E. Blair proposed zuul/zuul master: WIP: Add enqueue reporter action https://review.opendev.org/681132	21:49
clarkb	fungi: ya	21:49
clarkb	thinking out loud here I wonder if having zuul retry jobs that blow up like this actually makes it harder to debug	21:50
clarkb	fewer people notice because most of the time the retried probably do work by running on another cloud	21:50
clarkb	and you can't hold the test nodes	21:50
clarkb	etc	21:50
clarkb	https://launchpad.net/ubuntu/+source/linux/4.15.0-60.67 went out on the third fwiw	21:53
fungi	potential culprit, yep. is that what our images have?	21:54
clarkb	4.15.0-60-generic #67-Ubuntu os from the panic so ya I think they are	21:55
clarkb	http://launchpadlibrarian.net/439948149/linux_4.15.0-58.64_4.15.0-60.67.diff.gz is the diff between the code that went out on the 3rd and what was in ubuntu prior	21:56
sean-k-mooney	what does the waiting state mean in the zuul dashboard. how is it different form queued	21:57
sean-k-mooney	does that mean its waiting for quota/nodepool?	21:58
sean-k-mooney	but zuul has selected the job to run when the nodeset has been fulfilled?	21:58
johnsom	https://www.irccloud.com/pastebin/Tnfw12ug/	21:58
johnsom	So, this is fixed in -62	21:58
*** markvoelker has quit IRC		21:59
johnsom	Wonder why the instance wasn't using 62.	21:59
* johnsom feels like the canary project for world		22:00
clarkb	sean-k-mooney: from tobiash a few days ago 04:25:11* tobiash \| pabelanger, clarkb, SpamapS: waiting means waiting for a dependency or semaphore (basically no node request created yet), queued means waiting for nodes	22:00
clarkb	johnsom: 62 was only published 7 hours ago	22:01
*** markvoelker has joined #openstack-infra		22:01
johnsom	Ha, so changelog is much earlier than availability... lol	22:01
sean-k-mooney	ok so its waiting in the merger?	22:01
*** goldyfruit has quit IRC		22:01
ianw	infra-root: could i get a +w on https://review.opendev.org/#/c/680895/ so 3rd party testinfra is working please, thanks	22:02
clarkb	ianw trade you https://review.opendev.org/#/c/681100/	22:02
johnsom	ianw Thanks for the role. It found the kernel panic	22:03
sean-k-mooney	it whet form queued to waiting but i dont see a node in http://zuul.openstack.org/nodes that matche the lable. so its unclear why it transiation state	22:03
pabelanger	sean-k-mooney: if multinode job, likely waiting for node in nodeset	22:04
ianw	johnsom: oh nice! is it a known issue or unique?	22:04
pabelanger	other wise, nodepool is running a capacity and no nodes avaiable	22:04
pabelanger	available*	22:04
johnsom	ianw Evidently the fixed kernel went out 7 hours ago	22:05
clarkb	johnsom: note that bug is specific to fragmented packets which is somethign we do on our ovs bridges because nesting. So ya seems like really good match and likely is the fix	22:05
ianw	johnsom: haha well nothing like living on the edge!	22:05
sean-k-mooney	pabelanger: it is a multi node job but neither node has been created. could it be waiting for an io opertation on the cloud such as uploading an image?	22:05
pabelanger	sean-k-mooney: http://grafana.openstack.org/dashboard/db/zuul-status gives a good overview of what zuul is doing	22:05
clarkb	sean-k-mooney: no we never wait for images to upload unless there are no images	22:05
johnsom	clarkb Path forward? How soon will that hit the mirrors?	22:05
pabelanger	sean-k-mooney: no image uploads shouldn't be issue hear. Sounds like maybe quota issue	22:05
clarkb	sean-k-mooney: waiting is "normal" I wouldn't worry about it	22:05
*** markvoelker has quit IRC		22:06
pabelanger	sean-k-mooney: cannot boot node on cloud, because no space	22:06
sean-k-mooney	clarkb: ok	22:06
johnsom	clarkb matches to the line number in the launchpad bug	22:06
sean-k-mooney	ya i can see FN is at capasity more or less http://grafana.openstack.org/d/3Bwpi5SZk/nodepool-fortnebula?orgId=1	22:06
pabelanger	there is some odd interactions with multinode jobs and quota on clouds. It is possible for a job to boot 1 of 2 nodes fine, but with 2 of 2 boots, there is no quota	22:06
pabelanger	then you are stuck on the cloud waiting	22:06
pabelanger	we have that issue in zuul.ansible.com	22:07
clarkb	johnsom: we update our mirrors every 4 hours and build images every 24 ish hours	22:07
clarkb	johnsom: I've just checked the mirror and don't see it at http://mirror.dfw.rax.openstack.org/ubuntu/dists/bionic-updates/main/binary-amd64/Packages	22:08
sean-k-mooney	in this case neither node is running (its use a special lable so i can tell form http://zuul.openstack.org/nodes) but i looks like its waiting for space	22:08
clarkb	johnsom: so I would guess anywhere from the next 4-24 hours ish	22:08
pabelanger	sean-k-mooney: yah, that would be my guess too	22:08
*** goldyfruit has joined #openstack-infra		22:08
pabelanger	you feel it much more, when running specific labels in a single cloud	22:08
clarkb	http://grafana.openstack.org/d/3Bwpi5SZk/nodepool-fortnebula?orgId=1 you can see that cloud is at capacity	22:09
sean-k-mooney	pabelanger: well this might still be a good thing. previously that lable was going to node_error so this might actully be an imporvement	22:09
clarkb	the sun just came out so I'm going to sneak out for a quick bike ride before the rain returns	22:09
clarkb	back in a bit	22:10
*** ociuhandu has joined #openstack-infra		22:10
fungi	johnsom: moral of this story is if you'd just put off debugging another day it would have fixed itself? ;)	22:13
johnsom	Yeah, that probably would have happened if it wasn't freeze week	22:14
* fungi suspects that's a terrible moral for a story anyway		22:14
*** ociuhandu has quit IRC		22:15
johnsom	Yeah, burned a good week trying to figure out why we couldn't merge anything	22:15
fungi	what's especially interesting is octavia's testing was thorough enough to hit this bug when neutron's was not	22:16
johnsom	I also wonder if no one is actually testing with floating IPs.... Given this is tied to NAT. It's not like we are load testing either....	22:16
johnsom	Yeah, not the first time	22:17
*** slaweq has quit IRC		22:18
*** goldyfruit_ has joined #openstack-infra		22:21
*** goldyfruit has quit IRC		22:24
adriant	Is this the irc channel to talk devstack issues?	22:25
adriant	or does devstack have its own channel?	22:25
johnsom	adriant General devstack issues are in #openstack-qa	22:26
fungi	adriant: you're looking for #openstacl-qa	22:26
johnsom	What he said.... grin	22:27
fungi	er, the one johnsom said. you know, with the accurate typing	22:27
adriant	johnsom wins :P	22:27
adriant	ty	22:27
pabelanger	ianw: in devstack, is there an option to enable nested virt for centos? Like is it modifying modprode.d configs?	22:28
pabelanger	not that I am asking to enable it, want to see how it is done	22:29
pabelanger	cause I need to modprob -r kvm && modprobe kvm to toggle it	22:29
pabelanger	not sure why RPM isn't reading modprobe config	22:30
Roamer`	pabelanger, judging by the fact that devstack's doc/source/guides/devstack-with-nested-kvm.rst explains how to do the rmmod/modprobe dance for different CPU types, and from the fact that there is no mention of rmmod in devstack itself other than this file, I'd say most probably not :/	22:34
openstackgerrit	Merged opendev/system-config master: Set zuul_work_dir for tox testing https://review.opendev.org/680895	22:36
*** markvoelker has joined #openstack-infra		22:36
pabelanger	Roamer`: thanks! I should read the manually next time. that is exactly what I am doing too	22:36
pabelanger	rmmod	22:37
pabelanger	above was a typo :)	22:37
*** bobh has joined #openstack-infra		22:37
johnsom	pabelanger export DEVSTACK_GATE_LIBVIRT_TYPE=kvm is the setting for devstack to setup nova for it.	22:39
*** EmilienM is now known as little_script		22:39
pabelanger	ty	22:39
Roamer`	pabelanger, what do you mean RPM is not reading the modprobe config though? is it possible that the module has been already loaded, maybe even at boot time, and you modifying the config later has no effect without, well, reloading it using rmmod/modprobe? :)	22:39
pabelanger	Roamer`: so, if I setup /etc/modprobe.d/kvm.conf with 'options kvm_intel nested=1' then yum install qemu-kvm, nested isn't enabled. I need to bounce the module, for it to work	22:41
*** factor has quit IRC		22:41
pabelanger	not an issue, but surprised it wasn't loaded properly	22:41
*** factor has joined #openstack-infra		22:41
*** little_script is now known as EmilienM		22:42
*** bobh has quit IRC		22:45
Roamer`	pabelanger, well, sorry if I'm being obtuse and asking for things that may be obvious and you may have already checked, but still, are you really sure that the kvm_intel module has not been already loaded even before you installed qemu-kvm? I can't really think of anything that would want to load it, it's certainly not loaded by default, but it's usually part of the kernel package, so it will	22:45
Roamer`	have been installed before you install qemu-kvm	22:45
*** xenos76 has quit IRC		22:46
*** markvoelker has quit IRC		22:46
pabelanger	good point, I should check that	22:46
pabelanger	I assumed qemu-kvm was pulling it in	22:47
clarkb	pabelanger: note you only set nested=1 on the first hypervisor	22:48
clarkb	on your second you can consume the nested virt without enabling it for the next layer	22:48
*** happyhemant has quit IRC		22:49
*** rlandy is now known as rlandy\|bbl		22:50
*** icarusfactor has joined #openstack-infra		22:51
*** mriedem has quit IRC		22:51
*** goldyfruit_ has quit IRC		22:51
clarkb	pabelanger: also note you cannot enable nested virt from the middle hypervisor if the first does not have it enabled	22:52
clarkb	and note that it crashes a lot	22:52
clarkb	but in linux 4.19 it is finally enabled by default for intel cpus so in theory its a lot better past that point in time	22:53
*** factor has quit IRC		22:53
*** tkajinam has joined #openstack-infra		22:55
pabelanger	clarkb: Yup agree! Not going to run this in production, mostly wanted to understand how people enabled it for jobs	22:56
*** Lucas_Gray has quit IRC		22:57
*** rcernin has joined #openstack-infra		22:59
ianw	clarkb: :/ ubuntu mirroring failed @ 2019-09-05T22:38:09,181762752+00:00 ish ... http://paste.openstack.org/show/774655/	23:11
clarkb	ianw: similar problems to fedora?	23:11
*** owalsh has quit IRC		23:11
clarkb	auth expired because vos release took too long?	23:11
ianw	yeah, seems likely. note that's mirror-update.openstack.org so the old server, and also rsync isn't involved there	23:12
*** Lucas_Gray has joined #openstack-infra		23:14
*** owalsh has joined #openstack-infra		23:14
*** threestrands has joined #openstack-infra		23:14
ianw	istr from scrollback some sort of issues, and davolserver was restarted @ "Sep 6 18:16 /proc/4595"	23:18
clarkb	ya afs02.dfw was out to lunch (really high load and ssh not responding) the console log showed it had a bunch of kernel timeouts for disks	23:18
clarkb	and processes so we rebooted it	23:18
clarkb	other vos releases were working but we didn't check all of them, its possible we should've checked all of them	23:18
ianw	hrm, so that failure happened before the reboot, and could be explained by afs02 being in bad state then	23:21
clarkb	oh I didn't notice the date on the failure	23:21
ianw	my only concern though is that if i unlock the volume, r/o needs to be completely recreated and maybe that will timeout now	23:21
clarkb	but ya likely explained by that	23:21
clarkb	ianw: ya in the past when I've manually unlocked I've done a manual sync with the lock held then run the vos release from screen on afs01.dfw using localauth	23:22
clarkb	the localauth doesn't timeout	23:22
ianw	right, it just seems like most volumes are in this state :/	23:22
clarkb	aspiers: following up on the zuul backlog. I've tracked at least part of the problem back to ironic retrying tempest jobs multiple times due to filling up disks on rax instances	23:25
clarkb	aspiers: that means we get 3 attempts * 3 hour timeouts we have to wait for	23:25
aspiers	clarkb: ouch, nice find!	23:25
clarkb	so those ironic changes hang out in the queue for forever	23:26
sean-k-mooney	clarkb the jobs still ended up in node failure so it looks like the nodepool restart was not the issue.	23:27
clarkb	sean-k-mooney: ok good to rule out the config loading then	23:27
clarkb	sean-k-mooney: let me check if we are still having ssh errors	23:27
sean-k-mooney	this was the patch that failed if that helps but its the only job that uses the numa lables	23:28
sean-k-mooney	https://review.opendev.org/#/c/680738/	23:28
clarkb	sean-k-mooney: same http error	23:28
sean-k-mooney	ok	23:28
clarkb	sean-k-mooney: I'm going to dig through the logs for a specific instance to see if I can napkin maht verify we are hitting the timeout	23:28
sean-k-mooney	am for now im just goign to rework the job to use lables we know work.	23:29
sean-k-mooney	i would like to have this working be fore FF to hopeful help merge the numa migration feature its testing	23:29
*** dchen has joined #openstack-infra		23:29
clarkb	looks like nova says the instance is active in about 35 seconds then we timeout after 120 seconds	23:30
fungi	slow to boot then?	23:30
sean-k-mooney	so its higing the ssh timeout	23:30
sean-k-mooney	fungi: it should not be. it has a numa toplogy but that normally makes it faster	23:31
clarkb	sean-k-mooney: ya the traceback is definitely the ssh timeout. I just wanted to make sure it wasn't mathing that short but the logging timestamps seem to indicate it hasn't	23:31
sean-k-mooney	it also has more ram at least on the contoler	23:31
fungi	grabbing a nova console log from a boot attempt ought to help	23:31
sean-k-mooney	16G instead of 8	23:31
clarkb	specifically it is trying to scan the ssh hostkeys	23:31
clarkb	brainstorming: we could be failing to get entropy to generate host keys?	23:31
sean-k-mooney	its really strange	23:31
sean-k-mooney	oh	23:32
clarkb	and sshd won't start until it has generated them	23:32
sean-k-mooney	yes we could	23:32
clarkb	does numa affect entropy in VMs? also we run haveged	23:32
sean-k-mooney	we could enable the hardware random number gen in the guest	23:32
fungi	or it could be timing out waiting for device scanning to settle or a particular block device to appear or any number of other things that it waits for during boot	23:32
clarkb	fungi: ya	23:32
sean-k-mooney	clarkb: it should not	23:32
clarkb	fungi: when I booted by hand though we had none of those issues	23:32
clarkb	fungi: we can try booting by hand mpre	23:33
* clarkb tries now		23:33
sean-k-mooney	fungi: also we have other labels that provide identical flavor with a different name that work	23:33
fungi	oh, that's wacky	23:33
sean-k-mooney	ya	23:33
clarkb	the hostids for all three attempts were different	23:34
clarkb	donnyd: 71a9bf7925f98026e8a268d2cda4f8623c4812e3025626a2b395e7b0 is the hostid and it tried booting f5708f13-1918-44ce-8c2f-3676712d12a0	23:34
clarkb	if you want to double check the hyervisor for anything funny	23:34
fungi	could nodepool be trying the wrong interface's address maybe?	23:34
sean-k-mooney	i think its booting with just one interace and it seam to be acessing the ipv6 ip	23:35
clarkb	nodepool.exceptions.ConnectionTimeoutException: Timeout waiting for connection to 2001:470:e045:8000:f816:3eff:fe57:3b6c on port 22	23:35
sean-k-mooney	well it might have more interfaces	23:35
ianw	clarkb: so basically every volume of interest is locked ... http://paste.openstack.org/show/774658/	23:35
clarkb	that is 387fa4f4-f69a-4654-9c39-c9fd3443bd6a on 247bdf6d16fe8bfc4c76cbe3a03e5933b4215077c60e516b1a26fbd8	23:35
sean-k-mooney	that a publicaly routable ipv6 address	23:35
clarkb	ianw: bah ok	23:35
ianw	at this point, i think the best option is to shutdown the two mirror-updates to stop things getting any worse, and do localauth releases of those volumes	23:36
clarkb	ianw: don't we need to run their respective syncs before vos releasing?	23:36
clarkb	I suppose if we are happy with the RW state then your plan will work	23:36
donnyd	I am afk atm	23:37
donnyd	I can take a look when I get back	23:37
sean-k-mooney	fungi: these are the lables/flavor/image/keys that we are using	23:37
sean-k-mooney	https://github.com/openstack/project-config/blob/master/nodepool/nl02.openstack.org.yaml#L343-L357	23:37
sean-k-mooney	ubuntu-bionic-expanded works fine	23:38
ianw	clarkb: yeah, for mine i think we get the volumes back in sync, and then let the normal mirroring process run	23:38
*** Lucas_Gray has quit IRC		23:38
sean-k-mooney	when i use multi-numa-ubuntu-bionic-expanded or multi-numa-ubuntu-bionic it does not work	23:38
fungi	sean-k-mooney: okay, so they're using different flavors	23:38
sean-k-mooney	fungi: the flavor are identical https://www.irccloud.com/pastebin/FWxMEIqc/	23:38
clarkb	fungi: the flavors are named differently but have the same attributes. However maybe there are attributes not exposed by flavor show that are different	23:39
sean-k-mooney	well the none expand one has 8G instead of 16G of ram but otherwise they are the same	23:39
clarkb	I'm able to get right in on the node I just booted with multi-numa-expanded	23:40
sean-k-mooney	clarkb: could you test a multi-numa instace instead fo the expanded one. i mean ssh runs fine on a 64mb cirros image but just incase	23:41
*** icarusfactor has quit IRC		23:41
clarkb	23:39:05 is first entry in dmesg -T, 23:39:11 is sshd starting, 23:39:45 is me logging in	23:41
clarkb	sean-k-mooney: ya though the one I have failure for is the expanded label	23:42
sean-k-mooney	oh ok	23:42
*** exsdev has quit IRC		23:42
*** exsdev0 has joined #openstack-infra		23:42
*** exsdev0 is now known as exsdev		23:42
sean-k-mooney	well honestly i have been 90 of my openstack dev in multi numa vm for the lat 6 years and i have never seen it have any affect on boot time or time to ssh working	23:43
clarkb	I can also hit it from nl02 via ipv6 to port 22 withtelnet	23:43
clarkb	just ruling out networking problems between nl02 and fn	23:43
* clarkb tests the non expanded just to be safe		23:43
clarkb	both flavors seem to work manually	23:45
clarkb	I did get a brief no route to host then host finished booting and ssh worked	23:45
clarkb	all within a minute, well under the timeout	23:46
sean-k-mooney	donnyd: how do you get your ipv6 routing?	23:46
sean-k-mooney	could there be a propogation dely in the route being acessable ?	23:46
sean-k-mooney	although that woudl not explain why the other lable seams to work at least most of the time	23:47
sean-k-mooney	i think i had some node_failures with ubuntu-bionic-expanded but i think we tracked thoes to quota	23:47
sean-k-mooney	i assme we have not seen other node_failres with FN?	23:48
sean-k-mooney	its just this spefic set of labels?	23:48
clarkb	not that I know of	23:49
clarkb	these are the only labels that are fn specific	23:50
clarkb	other clouds can pick up the slack if it happens then we don't see a node failure, just another cloud serviing the request	23:50
clarkb	let me see if I can grep for this happening in fn on other labels	23:50
sean-k-mooney	well the ubuntu-bionic-expanded and centos-7-expanded are also	23:51
sean-k-mooney	but nothing was previously using them	23:51
clarkb	2019-09-09 18:36:39,267 ERROR nodepool.NodeLauncher: [node: 0011030429] Launch failed for node opensuse-tumbleweed-fortnebula-regionone-0011030429	23:51
clarkb	it is happening for other labels too	23:51
sean-k-mooney	so maybe in other cases it retrying to a different provider	23:52
clarkb	ya	23:52
sean-k-mooney	this is not a host key thing right like when people reused ipv4 addresses	23:53
sean-k-mooney	its not able to connect rather than being rejected	23:53
clarkb	sean-k-mooney: correct my read of it is tcp is failing to connect to port 22	23:55
clarkb	so ya a networking issue in the cloud could explain it	23:55
sean-k-mooney	well we do have that recuring issue in the tempest test where ssh sometimes does not work...	23:55
ianw	#status log mirror-update.<opendev\|openstack>.org shutdown during volume recovery. script running in root screen on afs01 unlocking and doing localauth releases on all affected volumes	23:56
openstackstatus	ianw: finished logging	23:56
sean-k-mooney	which i belive is somehow related to the l3 agent. i wonder if there a race with neutron seting up routing of the vm	23:56
fungi	so if it's just a sometimes network issue, retrying that job should work sometimes and return node_error other times	23:58
clarkb	fungi: ya though with the way queues have been it might be a while :/	23:58
clarkb	I've not heard anything from ironic re the disk issues yet. here is hoping that is because of CEST timezones	23:59
sean-k-mooney	fungi: ya so it could be a bad luck that the old ubuntu expanded lable seams to work and this one does not.	23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!