Tuesday, 2018-02-06

*** lathiat has joined #openstack-infra		00:03
*** mtreinish has quit IRC		00:12
*** jamesmcarthur has quit IRC		00:14
*** ijw has quit IRC		00:14
clarkb	corvus: we can try the hwe kernel (it is 4.13 on xenial) I run it on my local nas box because cpu is too new for old kernel	00:21
*** edmondsw has joined #openstack-infra		00:21
clarkb	corvus: I haven't had any issues with it other than waiting a little longer on the meltdown kpti patching which was annoying but at that point already a week behind	00:21
*** gongysh has joined #openstack-infra		00:21
clarkb	considering these servers are largely disposable and rebuildable I would be on board with trying that	00:21
*** bobh has quit IRC		00:22
*** Goneri has quit IRC		00:23
*** iyamahat has joined #openstack-infra		00:24
*** aeng has joined #openstack-infra		00:25
clarkb	that may also simplify bwrap for us and not require the setuid?	00:25
tristanC	dmsimard: there was swift authentication settings in zuul.conf, then zuul would generate a temp url key per job	00:25
*** edmondsw has quit IRC		00:25
clarkb	tristanC: dmsimard ya that part mostly worked (the only real issue we had there was the tempurl key was generated on job scheduling and could expire by the time a job ran iirc) the bigger issue was some dev going I want to say last weeks periodic job logs	00:26
clarkb	s/say/see/	00:27
tristanC	clarkb: which is now possible the the zuul-web builds.json controller	00:27
clarkb	ya that was mordreds point	00:27
tristanC	with the*	00:27
*** dingyichen has joined #openstack-infra		00:27
*** xarses_ has quit IRC		00:28
clarkb	there are still some downsides to that approach but none would be a regression when compared against the current system	00:28
*** r-daneel has quit IRC		00:29
corvus	clarkb: yeah, i'll put executor oom on tomorrow's meeting agenda	00:30
*** mtreinish has joined #openstack-infra		00:37
*** ijw has joined #openstack-infra		00:38
clarkb	in other interesting kernel news apparently 4.15 kernel performance with kpti is only a percent or two slower than 4.11 without kpti based on some benchmarks	00:40
clarkb	it is unfortunate that hard work on performance improvements just got negated by kpti though	00:40
clarkb	but at least we aren't taking a long term massive regression	00:40
*** tosky has quit IRC		00:41
*** dave-mccowan has joined #openstack-infra		00:42
openstackgerrit	Merged openstack-infra/zuul master: Allow a few more starting builds https://review.openstack.org/540965	00:43
*** rloo has left #openstack-infra		00:45
*** slaweq has joined #openstack-infra		00:46
*** caphrim007_ has joined #openstack-infra		00:48
*** caphrim00_ has joined #openstack-infra		00:49
*** caphrim007_ has quit IRC		00:49
*** slaweq has quit IRC		00:51
*** caphrim007 has quit IRC		00:51
*** Swami has quit IRC		00:53
*** caphrim00_ has quit IRC		00:54
*** claudiub has quit IRC		00:58
*** camunoz has quit IRC		01:00
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: [WIP] Add block-device defaults https://review.openstack.org/539375	01:02
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: [WIP] Choose appropriate bootloader for block-device https://review.openstack.org/539731	01:02
*** aeng has quit IRC		01:02
*** gongysh has quit IRC		01:09
*** aeng has joined #openstack-infra		01:19
*** jamesmcarthur has joined #openstack-infra		01:20
*** stakeda has joined #openstack-infra		01:22
*** jamesmcarthur has quit IRC		01:25
*** ihrachys has quit IRC		01:26
*** ihrachys_ has joined #openstack-infra		01:26
clarkb	the falcon heavy is scheduled to launch 30 minutes before the infra meeting tomorrow	01:26
*** jamesmcarthur has joined #openstack-infra		01:28
*** abelur_ has joined #openstack-infra		01:34
*** liujiong has joined #openstack-infra		01:34
*** dougwig has quit IRC		01:41
*** ijw has quit IRC		01:42
ianw	i noticed that, might have to get up extra early :)	01:42
*** jamesmcarthur has quit IRC		01:43
ianw	my kids are pretty bored with them now ... not another rocket launch ... crazy world they live in :)	01:43
clarkb	but this one is going ti mars	01:44
*** salv-orlando has joined #openstack-infra		01:45
*** hongbin has joined #openstack-infra		01:49
*** slaweq has joined #openstack-infra		01:50
*** jamesmcarthur has joined #openstack-infra		01:54
*** slaweq has quit IRC		01:55
*** aeng has quit IRC		01:57
*** aeng has joined #openstack-infra		01:57
*** larainema has joined #openstack-infra		02:01
*** caphrim007 has joined #openstack-infra		02:03
openstackgerrit	Merged openstack-infra/gerritbot master: Add unit test framework and one unit test https://review.openstack.org/499377	02:03
*** caphrim007 has quit IRC		02:07
corvus	it's either sending elon musk's tesla roadster to mars, or it's blowing up. apparently musk gives it even odds.	02:08
*** edmondsw has joined #openstack-infra		02:09
*** liujiong has quit IRC		02:10
*** liujiong has joined #openstack-infra		02:11
*** bobh has joined #openstack-infra		02:13
*** edmondsw has quit IRC		02:14
*** esberglu has quit IRC		02:14
*** bobh has quit IRC		02:15
prometheanfire	whatever happens it'll be glorious	02:16
*** gongysh has joined #openstack-infra		02:18
*** harlowja has quit IRC		02:18
*** shu-mutou-AWAY is now known as shu-mutou		02:19
*** askb has quit IRC		02:21
*** markvoelker has joined #openstack-infra		02:22
*** slaweq has joined #openstack-infra		02:24
*** askb has joined #openstack-infra		02:24
*** markvoelker has quit IRC		02:24
*** mriedem has quit IRC		02:25
*** mriedem has joined #openstack-infra		02:28
*** slaweq has quit IRC		02:29
*** askb has quit IRC		02:29
*** askb has joined #openstack-infra		02:29
*** markvoelker has joined #openstack-infra		02:31
*** askb has quit IRC		02:31
*** askb has joined #openstack-infra		02:31
*** askb has quit IRC		02:31
*** askb has joined #openstack-infra		02:32
*** salv-orlando has quit IRC		02:33
*** salv-orlando has joined #openstack-infra		02:33
*** askb has quit IRC		02:36
*** salv-orlando has quit IRC		02:38
*** greghaynes has quit IRC		02:40
*** greghaynes has joined #openstack-infra		02:41
*** greghaynes has quit IRC		02:44
*** greghaynes has joined #openstack-infra		02:44
*** askb has joined #openstack-infra		02:45
*** olaph1 has joined #openstack-infra		02:46
*** olaph has quit IRC		02:47
*** dave-mccowan has quit IRC		02:48
*** abelur_ has quit IRC		02:49
*** askb has quit IRC		02:49
*** abelur__ has joined #openstack-infra		02:49
*** caphrim007 has joined #openstack-infra		02:50
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: [WIP] Choose appropriate bootloader for block-device https://review.openstack.org/539731	02:50
*** bobh has joined #openstack-infra		02:51
*** abelur__ has quit IRC		02:51
*** askb has joined #openstack-infra		02:51
*** askb_ has joined #openstack-infra		02:52
*** jamesmcarthur has quit IRC		02:52
*** askb has quit IRC		02:52
*** askb has joined #openstack-infra		02:53
*** bobh has quit IRC		02:53
smcginnis	Another release job timeout on a git fetch.	02:54
smcginnis	http://logs.openstack.org/c3/c3ae8a1d87084bcee73e96e2f9e2d174ee610ed4/release-post/tag-releases/48a1f5a/job-output.txt.gz#_2018-02-05_23_47_27_034616	02:54
smcginnis	Just dropping a note for now. Will follow up tomorrow.	02:54
*** markvoelker has quit IRC		02:56
*** markvoelker has joined #openstack-infra		02:59
*** slaweq has joined #openstack-infra		03:00
*** askb_ has quit IRC		03:01
*** slaweq has quit IRC		03:04
*** caphrim007 has quit IRC		03:08
prometheanfire	pip mirrors busted?	03:09
*** dave-mccowan has joined #openstack-infra		03:09
prometheanfire	smcginnis: related I imagine?	03:09
clarkb	prometheanfire: http://mirror.dfw.rax.openstack.org/pypi/last-modified its there and was last modified just under 3 hours ago	03:10
clarkb	can you be more specific on how pip mirrors are busted?	03:10
*** markvoelker has quit IRC		03:11
prometheanfire	clarkb: I was getting no distrobution found, I'll retry soon	03:12
prometheanfire	http://logs.openstack.org/29/541029/1/check/requirements-tox-py35-check-uc/b098a8f/job-output.txt.gz#_2018-02-06_01_51_40_046774	03:13
*** mriedem has quit IRC		03:14
*** markvoelker has joined #openstack-infra		03:14
clarkb	http://mirror.dfw.rax.openstack.org/pypi/simple/sushy/ ya looks like they are missing	03:15
clarkb	bandersnatch behind or having problems again is my best guess	03:15
*** d0ugal_ has joined #openstack-infra		03:16
clarkb	nothing about sushy in the bandersnatch logs	03:18
clarkb	for the 4th 5th and 6th	03:18
*** d0ugal has quit IRC		03:19
clarkb	something wrong with upstream pypi's mirroring stuff?	03:19
prometheanfire	not sure	03:22
clarkb	looks like bandersnatch did error due to failed package retrieval early today PST	03:22
clarkb	I wonder if that has it confused as to what serial it is on maybe it thinks it is done	03:23
clarkb	this may require another full resync :/	03:23
clarkb	I'm not going to be able to watch that tonight but can pick up in the morning if necesary	03:23
*** markvoelker has quit IRC		03:25
*** gcb has joined #openstack-infra		03:26
*** gyee has quit IRC		03:29
*** markvoelker has joined #openstack-infra		03:30
tonyb	jhesketh: random question ... what do OpenSuse and SLES use for init? systemd?	03:31
clarkb	tonyb: tumbleweed is systemd at least	03:32
*** dave-mccowan has quit IRC		03:33
*** gongysh has quit IRC		03:33
clarkb	looks like SLES 12 is systemd	03:33
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: [WIP] Choose appropriate bootloader for block-device https://review.openstack.org/539731	03:33
tonyb	clarkb: ah cool	03:37
prometheanfire	tonyb: ya, they use systemd, for some reason I thought they went with openrc or eudev, bug guess not	03:39
*** bobh has joined #openstack-infra		03:39
*** slaweq has joined #openstack-infra		03:40
tonyb	prometheanfire: Yeah. Seems like gentoo and debian are the last distros in the camp of 'defult init' != 'systemd'	03:41
*** markvoelker has quit IRC		03:42
prometheanfire	tonyb: debian uses systemd :P	03:42
prometheanfire	I actually don't have much of a problem with systemd, run it on my laptop	03:42
tonyb	prometheanfire: by default? Ithought it was sysv with an option for systemd? I don't mind being wrong	03:43
*** ramishra has joined #openstack-infra		03:43
*** slaweq has quit IRC		03:44
tonyb	prometheanfire: Yeah, for me it isn't anything other than "$change seems to introduce a regression on systems that don't use systemd" Which systems are they and do we consider that a problem?	03:44
tonyb	prometheanfire: by no means is it a value judgement on systemd	03:45
*** coolsvap has joined #openstack-infra		03:45
*** cshastri has joined #openstack-infra		03:46
prometheanfire	we do have a sysv initflag, but that was just to map 'shutdown' to systemctl shutdown and the like	03:47
prometheanfire	it can understand sysv init scripts iirc, but not sure	03:47
fungi	tonyb: debian stable release before last (jessie) switched to systemd by default (except on non-linux arches like kfreebsd and hurd) but debian still mostly works if you install sysvinit as your default	03:47
prometheanfire	funny thing is, I was part of the group that forked udev	03:48
fungi	though the upgrade to jessie would keep your previous init default	03:48
*** rossella_s has quit IRC		03:48
ianw	tonyb: presume relates to https://review.openstack.org/#/c/529976/2 ?	03:48
tonyb	fungi: Ahh cool. I'm clearly out of date.	03:48
prometheanfire	ya, there was a bug hubub about it and some debian people forked into another non-systemd distro	03:48
tonyb	ianw: Yeah	03:48
ianw	we've dropped centos6 era stuff ... that was also python2.6 which we don't code for	03:48
ianw	my thinking was there isn't really a non-systemd case?	03:49
*** askb_ has joined #openstack-infra		03:49
*** bobh has quit IRC		03:49
fungi	systemd is still at least partly a no-go on !linux kernels because upstream has in the past been adamant they rely on linux-kernel-specific features	03:49
*** sree has joined #openstack-infra		03:50
prometheanfire	it's also a nogo on non-glibc systemd	03:50
prometheanfire	it's also a nogo on non-glibc systems	03:50
fungi	prometheanfire: the non-systemd debian derivative you're thinking of is https://devuan.org/	03:50
prometheanfire	ya, that's it	03:51
prometheanfire	I knew how to pernounce it but not spell it	03:51
tonyb	ianw: I'm happy to drop my objection in that case. Was trusty systemd? or upstart?	03:51
prometheanfire	embedded tends to do a bit of non-glibc (uclibc, musl)	03:51
prometheanfire	which is why we forked udev into eudev	03:51
tonyb	if trusty is systemd then my objection seems to largely theoretical ;p	03:52
prometheanfire	trusty is not systemd, xenial is	03:52
prometheanfire	it might have some small systemd parts iirc, but mainly upstart	03:52
fungi	yeah, trusty was/is very much upstart	03:52
tonyb	prometheanfire, fungi: thanks.	03:54
tonyb	ianw: so I guess that gives us a line in the sand.	03:55
*** hongbin has quit IRC		03:55
*** askb_ has quit IRC		03:55
*** askb_ has joined #openstack-infra		03:56
tonyb	ianw: I guess that probably means it boils down to is trusty too old?	03:56
prometheanfire	18.04 comes out 'soon' too	03:56
tonyb	ianw: I don't really mind what the answer is :)	03:56
ianw	hmm, yeah trusty has been pretty unloved for some time	03:56
tonyb	oooh /me notices etcd 3.2 (x86, arm and ppc64el) in bionic :)	03:57
*** edmondsw has joined #openstack-infra		03:58
fungi	also openafs 1.8!	03:58
tonyb	fungi: :)	03:58
*** olaph1 has quit IRC		03:59
*** olaph has joined #openstack-infra		04:00
*** gongysh has joined #openstack-infra		04:01
*** edmondsw has quit IRC		04:02
*** s-shiono has joined #openstack-infra		04:03
tonyb	Does this http://logs.openstack.org/22/535622/10/check/puppet-openstack-unit-4.8-centos-7/84927af/job-output.txt.gz#_2018-02-06_02_58_30_732623 look like a real problem or a just run recheck problem?	04:04
ianw	tonyb: fatal: could not create leading directories of '/etc/puppet/modules/powerdns': Permission denied	04:05
ianw	i'd say that needs some help...	04:05
tonyb	ianw: I was afraid you'd say that ;P	04:06
*** askb_ has quit IRC		04:06
*** askb has quit IRC		04:06
*** askb has joined #openstack-infra		04:06
* tonyb does some research		04:06
*** askb has quit IRC		04:07
ianw	maybe it just needs a "become: " line for the task ...?	04:07
*** askb has joined #openstack-infra		04:07
*** askb has quit IRC		04:09
*** askb has joined #openstack-infra		04:09
*** abelur__ has joined #openstack-infra		04:11
gongysh	hi,	04:11
gongysh	I want to set up a multinode ci job for my project tacker	04:11
*** gongysh has quit IRC		04:11
*** askb has quit IRC		04:12
*** askb has joined #openstack-infra		04:13
*** gongysh has joined #openstack-infra		04:14
*** caphrim007 has joined #openstack-infra		04:15
*** psachin has joined #openstack-infra		04:26
*** dsariel has quit IRC		04:28
*** gongysh has quit IRC		04:30
*** yamamoto has joined #openstack-infra		04:35
*** rossella_s has joined #openstack-infra		04:38
*** iyamahat has quit IRC		04:41
*** slaweq has joined #openstack-infra		04:44
*** harlowja has joined #openstack-infra		04:46
*** slaweq has quit IRC		04:48
*** olaph1 has joined #openstack-infra		04:49
*** olaph has quit IRC		04:50
*** sree_ has joined #openstack-infra		04:53
*** rosmaita has quit IRC		04:53
*** sree_ is now known as Guest77194		04:54
openstackgerrit	John L. Villalovos proposed openstack-infra/project-config master: zuul.d: gerritbot: Remove check and gate jobs as now in repo https://review.openstack.org/541125	04:54
openstackgerrit	John L. Villalovos proposed openstack-infra/project-config master: zuul.d: gerritbot: Remove check and gate jobs as now in repo https://review.openstack.org/541125	04:55
openstackgerrit	John L. Villalovos proposed openstack-infra/project-config master: zuul.d: gerritbot: Remove check and gate jobs as now in repo https://review.openstack.org/541125	04:56
*** sree has quit IRC		04:57
*** zhurong has quit IRC		05:07
*** dhajare has joined #openstack-infra		05:10
*** harlowja has quit IRC		05:11
*** links has joined #openstack-infra		05:12
*** abelur__ has quit IRC		05:12
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: Pass at cleaning up the bootloader https://review.openstack.org/541129	05:13
*** pgadiya has joined #openstack-infra		05:14
*** HawK3r has joined #openstack-infra		05:15
*** janki has joined #openstack-infra		05:15
HawK3r	I just installed https://www.rdoproject.org/install/packstack/ packstack	05:17
HawK3r	fist time messing with openstack so I thought this would be a good way t get started	05:17
HawK3r	I ahve a pretty beefy server I wa using as a ESX host now woudl be a OS Node	05:17
HawK3r	one issue I am having is that I installed it with a dhcp given Ipaddress on the server. CENTOS 7 to be specific	05:17
HawK3r	I need to change the IP address and I did it from the 15-horizon_vhost.conf file	05:17
HawK3r	I am able to load horizon on teh browser but everytime I try to login I get Unable to establish connection to keystone endpoint.	05:17
HawK3r	If i go back to the original IP address everytign works fine	05:17
HawK3r	Im thinking there is another place where the ip needs to be changed. perhaps another config file im missing	05:17
HawK3r	Also, coming from vMware I am not getting how the management network and the whole network stack works on Open stack to be honest with you	05:17
*** askb has quit IRC		05:18
*** askb has joined #openstack-infra		05:18
*** gongysh has joined #openstack-infra		05:19
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: [WIP] Choose appropriate bootloader for block-device https://review.openstack.org/539731	05:20
AJaeger	HawK3r: #openstack is the proper channel for such questions, this channel is for the infrastructure OpenStack uses to run CI etc.	05:22
HawK3r	thanks	05:22
*** HawK3r has left #openstack-infra		05:23
*** eumel8 has joined #openstack-infra		05:25
*** armaan has quit IRC		05:25
*** daidv has quit IRC		05:26
AJaeger	ianw, frickler, could you put https://review.openstack.org/541009 https://review.openstack.org/540971 https://review.openstack.org/540971 and https://review.openstack.org/540595 on your review queue, please?	05:26
*** armaan has joined #openstack-infra		05:28
AJaeger	ianw, frickler, duplicate in there - I meant https://review.openstack.org/538344 as well, please	05:28
*** olaph has joined #openstack-infra		05:29
*** sdake_ is now known as sdake		05:30
*** olaph1 has quit IRC		05:30
*** sree has joined #openstack-infra		05:31
*** Guest77194 has quit IRC		05:34
openstackgerrit	Merged openstack-infra/project-config master: add git timeout setting for clone_repo.sh https://review.openstack.org/541050	05:34
openstackgerrit	Merged openstack-infra/project-config master: add a retry loop to clone_repo.sh https://review.openstack.org/541051	05:37
*** dhajare has quit IRC		05:38
*** dhajare has joined #openstack-infra		05:40
*** wolverineav has joined #openstack-infra		05:46
*** edmondsw has joined #openstack-infra		05:46
*** abelur__ has joined #openstack-infra		05:50
*** edmondsw has quit IRC		05:50
*** wolverineav has quit IRC		05:50
*** abelur__ has quit IRC		05:51
*** abelur__ has joined #openstack-infra		05:51
*** sree_ has joined #openstack-infra		05:54
*** sree_ is now known as Guest54253		05:54
*** wolverineav has joined #openstack-infra		05:57
*** sree has quit IRC		05:58
*** wolverin_ has joined #openstack-infra		06:00
*** gcb has quit IRC		06:00
*** wolverineav has quit IRC		06:02
*** slaweq has joined #openstack-infra		06:02
*** wolverin_ has quit IRC		06:03
*** wolverineav has joined #openstack-infra		06:03
*** abelur__ has quit IRC		06:03
*** abelur__ has joined #openstack-infra		06:03
*** e0ne has joined #openstack-infra		06:04
openstackgerrit	OpenStack Proposal Bot proposed openstack-infra/project-config master: Normalize projects.yaml https://review.openstack.org/541136	06:05
*** jchhatbar has joined #openstack-infra		06:06
*** janki has quit IRC		06:06
*** wolverineav has quit IRC		06:07
*** slaweq has quit IRC		06:07
*** aeng has quit IRC		06:09
*** askb has quit IRC		06:09
*** abelur__ has quit IRC		06:09
*** askb has joined #openstack-infra		06:10
*** ihrachys_ has quit IRC		06:13
*** d0ugal_ has quit IRC		06:14
*** wolverineav has joined #openstack-infra		06:15
openstackgerrit	Merged openstack-infra/project-config master: Remove legacy-devstack-dsvm-py35-updown from devstack https://review.openstack.org/540707	06:18
*** wolverineav has quit IRC		06:19
*** d0ugal_ has joined #openstack-infra		06:23
openstackgerrit	Merged openstack-infra/project-config master: Normalize projects.yaml https://review.openstack.org/541136	06:30
*** dhill__ has joined #openstack-infra		06:31
*** dhill_ has quit IRC		06:33
*** claudiub has joined #openstack-infra		06:37
*** threestrands has quit IRC		06:40
*** shu-mutou has quit IRC		06:43
*** yolanda has joined #openstack-infra		06:43
*** ianychoi_ has quit IRC		06:47
*** ianychoi_ has joined #openstack-infra		06:48
*** snapiri has joined #openstack-infra		06:49
*** rwsu has joined #openstack-infra		07:01
*** liujiong has quit IRC		07:03
*** liujiong has joined #openstack-infra		07:03
*** sree has joined #openstack-infra		07:04
*** HeOS has joined #openstack-infra		07:07
*** Guest54253 has quit IRC		07:07
*** zhurong has joined #openstack-infra		07:09
AJaeger	ianw: thanks for reviewing - any reason not to +A the nodepool change https://review.openstack.org/#/c/540595/ ?	07:09
openstackgerrit	Merged openstack-infra/project-config master: Remove windmill-buildimages https://review.openstack.org/541009	07:10
*** gcb has joined #openstack-infra		07:13
openstackgerrit	Merged openstack-infra/openstack-zuul-jobs master: Remove legacy infra-ansible job https://review.openstack.org/540971	07:13
*** andreas_s has joined #openstack-infra		07:14
*** khappone has quit IRC		07:17
*** jchhatba_ has joined #openstack-infra		07:20
*** e0ne has quit IRC		07:22
*** jchhatbar has quit IRC		07:23
openstackgerrit	Merged openstack-infra/openstack-zuul-jobs master: Zuul: Remove project name https://review.openstack.org/541078	07:25
*** rcernin has quit IRC		07:37
*** khappone has joined #openstack-infra		07:39
*** iyamahat has joined #openstack-infra		07:41
ianw	AJaeger: not really, just haven't been super active in nodepool changes lately	07:47
jlvillal	Not sure if just me but http://zuul.openstack.org/ isn't loading	07:47
ianw	jlvillal: wfm ...	07:47
ianw	try a hard refresh ... there was something we did with redirects this morning	07:47
jlvillal	ianw, Ah! I thought CTRL-F5 was hard-refresh but I guess not.	07:48
jlvillal	shift-click on reload icon did it in Firefox	07:48
*** slaweq has joined #openstack-infra		07:49
openstackgerrit	Andreas Jaeger proposed openstack-infra/project-config master: Remove TripleO pipelines from grafana https://review.openstack.org/541165	07:51
AJaeger	ianw: we have 0 executors right now ;(	07:51
AJaeger	ianw: check http://grafana.openstack.org/dashboard/db/zuul-status	07:51
AJaeger	Did they all die?	07:52
vivsoni_	hi team, i am trying to create devstack newton	07:52
AJaeger	infra-root ^	07:52
AJaeger	vivsoni_: devstack is a QA project, best ask on #openstack-qa	07:52
vivsoni_	AJaeger: ok.. thanks	07:52
*** jpena\|off is now known as jpena		07:53
jlvillal	ianw, Is http://zuul.openstack.org/ still working for you? Now it stopped for me :(	07:55
jlvillal	AJaeger, would your '0 executors' thing impact http://zuul.openstack.org/ ?	07:56
AJaeger	jlvillal: it's working for me - but something else is going on ;(	07:56
jlvillal	AJaeger, Good reason for me to go to sleep :) Almost midnight here.	07:57
AJaeger	jlvillal: it depends on what the problem is - and that needs an infra-root to investigate	07:57
jlvillal	AJaeger, Thanks	07:57
AJaeger	jlvillal: 9am here ;) Good night!	07:57
*** sree has quit IRC		07:59
*** slaweq has quit IRC		08:02
*** alexchadin has joined #openstack-infra		08:02
*** pcichy has joined #openstack-infra		08:03
*** slaweq has joined #openstack-infra		08:03
*** kjackal has quit IRC		08:10
*** kjackal has joined #openstack-infra		08:11
*** jchhatba_ has quit IRC		08:12
*** pcichy has quit IRC		08:12
*** jchhatba_ has joined #openstack-infra		08:12
*** pcaruana has joined #openstack-infra		08:14
*** s-shiono has quit IRC		08:15
*** dhill_ has joined #openstack-infra		08:17
*** dhill__ has quit IRC		08:18
*** iyamahat has quit IRC		08:18
*** florianf has joined #openstack-infra		08:19
*** askb has quit IRC		08:19
*** askb_ has joined #openstack-infra		08:20
*** tesseract has joined #openstack-infra		08:22
*** ralonsoh has joined #openstack-infra		08:23
*** hashar has joined #openstack-infra		08:24
*** ianychoi_ has quit IRC		08:27
*** ianychoi_ has joined #openstack-infra		08:28
*** askb_ has quit IRC		08:31
*** abelur_ has joined #openstack-infra		08:32
*** iyamahat has joined #openstack-infra		08:32
*** kjackal has quit IRC		08:34
AJaeger	infra-root, looking at grafana, half of our ze0's are using 90+ % of memory.	08:35
*** kjackal has joined #openstack-infra		08:36
*** armaan has quit IRC		08:37
*** armaan has joined #openstack-infra		08:37
*** iyamahat has quit IRC		08:38
*** zhurong has quit IRC		08:38
*** yamahata has quit IRC		08:38
*** dingyichen has quit IRC		08:39
jhesketh	AJaeger: taking a look	08:40
AJaeger	thanks, jhesketh. I'm wondering whether we have a network or connection problem	08:41
*** gongysh has quit IRC		08:53
AJaeger	jhesketh, infra-root, did we loose inap? http://grafana.openstack.org/dashboard/db/nodepool-inap	08:53
jhesketh	AJaeger: that does look possible	08:54
*** abelur_ has quit IRC		08:55
*** abelur_ has joined #openstack-infra		08:55
*** amoralej\|off is now known as amoralej		08:56
*** priteau has joined #openstack-infra		08:56
*** d0ugal_ has quit IRC		08:57
*** alexchadin has quit IRC		08:57
*** d0ugal has joined #openstack-infra		08:57
*** d0ugal has quit IRC		08:57
*** d0ugal has joined #openstack-infra		08:57
openstackgerrit	Andreas Jaeger proposed openstack-infra/project-config master: Temporary disable inap https://review.openstack.org/541188	08:58
AJaeger	jhesketh: ^	08:58
*** alexchadin has joined #openstack-infra		08:58
AJaeger	jhesketh: want to force merge?	08:58
AJaeger	waiting for nodes might take ages ;(	08:58
*** jpich has joined #openstack-infra		08:58
ianw	2018-02-06 08:59:09,862 INFO nodepool.CleanupWorker: ZooKeeper suspended. Waiting	08:59
ianw	2018-02-06 08:59:16,062 INFO nodepool.DeletedNodeWorker: ZooKeeper suspended. Waiting	08:59
ianw	what's that mean?	08:59
jhesketh	sorry, I'm still looking through logs, will check if inap is down then submit that... I'm not sure if we'll need to restart the executors stuck on those nodes	08:59
*** rpittau has joined #openstack-infra		09:01
ianw	jhesketh / AJaeger : i've restarted nl03 ... i think that error is related to auth timing out maybe?	09:02
*** zhurong has joined #openstack-infra		09:03
ianw	there was definitely some sort of blip, but it is now seeming to sync up inap	09:03
*** askb has joined #openstack-infra		09:03
AJaeger	ianw: I have no idea what it could be ;(	09:04
*** gongysh has joined #openstack-infra		09:04
*** gfidente has joined #openstack-infra		09:04
*** gfidente has joined #openstack-infra		09:04
ianw	if i had to guess, there was an inap blip, and it made nl03 unhappy and it lost it's connection to zk	09:04
AJaeger	ianw: but now grafana shows different graphs for inap, so restarting seems to have helped	09:04
* ianw handy-wavy ...		09:05
*** askb has quit IRC		09:05
*** e0ne has joined #openstack-infra		09:05
*** askb_ has joined #openstack-infra		09:05
* AJaeger hopes the executors recover...		09:05
ianw	they all seem to be processing?	09:06
*** threestrands has joined #openstack-infra		09:06
*** jaosorior has quit IRC		09:06
ianw	but i agree ... why are they not reporting	09:06
* jhesketh is glad ianw is here :-)		09:08
jhesketh	ianw: some of the executors aren't picking up work so they might need restarting	09:10
AJaeger	Reading #zuul: Shrews wanted to restart executors today to pick up new changes...	09:12
*** sree has joined #openstack-infra		09:12
* AJaeger needs to step out a bit		09:13
ianw	jhesketh: which ones?	09:14
jhesketh	1&3 at least, I haven't checked them all	09:14
jhesketh	at a guess from grafana 6,7,8,9	09:14
*** dsariel has joined #openstack-infra		09:15
jhesketh	ianw: I don't have any evidence for why that might be an effect of inap though	09:17
jhesketh	and therefore if it'll make a difference	09:18
ianw	i think the problem with the stats might be graphite ... seems the disk is full	09:18
ianw	/var/log/graphite/carbon-cache-a is full	09:19
jhesketh	oh, good find	09:19
*** threestrands has quit IRC		09:21
*** edmondsw has joined #openstack-infra		09:22
*** wxy has quit IRC		09:22
*** rossella_s has quit IRC		09:23
*** kashyap has joined #openstack-infra		09:24
*** rossella_s has joined #openstack-infra		09:24
*** edmondsw has quit IRC		09:27
ianw	i'm trying to copy it to the storage volume	09:27
ianw	alright, i'm going to quickly reboot it, because things running out of disk ... i'm not sure what state it is in	09:28
*** jaosorior has joined #openstack-infra		09:29
jhesketh	ack	09:29
*** sshnaidm\|bbl is now known as sshnaidm\|rover		09:30
openstackgerrit	Matthieu Huin proposed openstack-infra/zuul-jobs master: role: Inject public keys in case of failure https://review.openstack.org/535803	09:30
ianw	#status log graphite.o.o disk full. move /var/log/graphite/carbon-cache-a/201[67] to cinder-volume-based /var/lib/graphite/storage/carbon-cache-a.backup.2018-02-06	09:31
ianw	#status log graphite.o.o disk full. move /var/log/graphite/carbon-cache-a/201[67] to cinder-volume-based /var/lib/graphite/storage/carbon-cache-a.backup.2018-02-06 and server rebooted	09:32
openstackstatus	ianw: finished logging	09:32
ianw	clarkb: ^ are you the expert on graphite? i feel like we need to do something there; there is still a 4gb console.log file in there that we need to manage i guess	09:34
openstackgerrit	Matthieu Huin proposed openstack-infra/nodepool master: Clean held nodes automatically after configurable timeout https://review.openstack.org/536295	09:36
*** olaph1 has joined #openstack-infra		09:36
*** kopecmartin has joined #openstack-infra		09:36
*** olaph has quit IRC		09:38
*** florianf has quit IRC		09:43
jhesketh	ianw: the queued jobs are still climbing, any objections to restarting zuul on 01 to see if it picks up new work?	09:43
*** derekh has joined #openstack-infra		09:45
*** pcichy has joined #openstack-infra		09:46
openstackgerrit	Rico Lin proposed openstack-infra/irc-meetings master: Move Magnum irc meeting to #openstack-containers https://review.openstack.org/541210	09:49
ianw	jhesketh: not really, i'm looking at the logs and it seems to have taken a few jobs	09:49
ianw	ahh, nahh it's not doing anything is it	09:50
jhesketh	ianw: ze01?	09:50
jhesketh	right	09:50
jhesketh	ianw: I'll do a graceful restart	09:50
*** dizquierdo has joined #openstack-infra		09:54
*** stakeda has quit IRC		09:54
*** florianf has joined #openstack-infra		09:55
jhesketh	oh, apparently graceful isn't implemented in v3?	09:56
jhesketh	ianw: sending a stop, and apparently it had to abort a bunch of jobs:	09:57
jhesketh	2018-02-06 09:56:49,228 DEBUG zuul.AnsibleJob: [build: a956d5c2e21741298eecbc23da2a3443] Abort: no process is running	09:57
jhesketh	so it looks like ansible died somehow?	09:57
*** liujiong has quit IRC		09:59
tobiash	we maybe leak job workers under some circumstances	09:59
tobiash	leaked not yet started job workers would be counted towards starting builds and thus explain the current behavior I see in the stats	09:59
jhesketh	yep, I'd agree with that	10:01
ianw	tobiash: oohhh, interesting. so basically a big global slowdown?	10:01
ianw	that's what was a bit weird, things seemed to be happening ... slowly	10:01
tobiash	ianw: either that or we leak job workers in the executor: http://grafana.openstack.org/dashboard/db/zuul-status?panelId=25&fullscreen	10:01
tobiash	because what I expect is that starting jobs shouldn't increase monotonic	10:02
jhesketh	the stop command appears to be hanging... If I kill the process what state will those builds end up in with the scheduler?	10:03
jhesketh	I suspect they'll be stuck and we might need to do a scheduler restart	10:03
tobiash	so either something is hanging the preparation (hanging git?) or we leaked job worker threads which are counted wrongly towards starting builds	10:03
ianw	jhesketh: it can take a while to shutdown cleanly	10:03
tobiash	jhesketh: a restart should at least get us going	10:03
tobiash	I'm not sure if rescheduling of failed jobs is already fixed	10:04
jhesketh	ianw: okay, I'll wait for the shutdown for a while	10:04
*** amrith has left #openstack-infra		10:04
tobiash	but at least the queued up jobs should be running fine	10:04
jhesketh	fyi, ze01 is in stopped state	10:05
ianw	jhesketh: yeah, otherwise you can just kill it manually and restart, it shouldn't require scheduler restart	10:05
tobiash	scheduler shouldn't be need a restart	10:05
jhesketh	the scheduler will pick up that ze01 has lost builds if we kill it?	10:05
tobiash	jhesketh: I'm not sure about lost builds, it should reschedule them, if not that's a bug	10:06
*** janki has joined #openstack-infra		10:06
tobiash	and I think there was such a bug, but I don't know if that has been fixed	10:06
jhesketh	ack	10:06
tobiash	but at least the 700 queued jobs should run fine then	10:07
tobiash	;)	10:07
jhesketh	I'll let the executor try and shut down cleanly still for a bit	10:07
jhesketh	(it's still diong repo updates)	10:07
*** jchhatba_ has quit IRC		10:07
ianw	jhesketh: sorry i gotta disappear; i'll leave it in your capable hands :)	10:08
jhesketh	no worries (although not sure how capable!)	10:08
jhesketh	thanks heaps for your help!	10:08
*** hjensas has joined #openstack-infra		10:12
*** sree has quit IRC		10:13
*** sree has joined #openstack-infra		10:14
*** sree has quit IRC		10:18
*** nmathew has joined #openstack-infra		10:19
*** sree has joined #openstack-infra		10:19
*** alexchadin has quit IRC		10:22
*** sree has quit IRC		10:23
openstackgerrit	caishan proposed openstack-infra/irc-meetings master: Move Barbican irc meeting to #openstack-barbican https://review.openstack.org/541230	10:24
*** dtantsur\|afk is now known as dtantsur		10:25
*** alexchadin has joined #openstack-infra		10:25
AJaeger	jhesketh: should we send #status notice Our Zuul infrastructure is currently experiencing some problems, we're investigating. Please do not approve or recheck changes for now. ?	10:25
jhesketh	AJaeger: yep, perhaps a note that it'll be a little slow but should get through changes in time	10:26
jhesketh	AJaeger: do you have status perms?	10:26
AJaeger	yes, I have permissions	10:27
jhesketh	AJaeger: I'm happy for you to do it unless you prefer me to	10:28
AJaeger	#status notice Our Zuul infrastructure is currently experiencing some problems and processing jobs very slowly, we're investigating. Please do not approve or recheck changes for now.	10:28
openstackstatus	AJaeger: sending notice	10:28
AJaeger	jhesketh: done myself - thanks	10:28
-openstackstatus- NOTICE: Our Zuul infrastructure is currently experiencing some problems and processing jobs very slowly, we're investigating. Please do not approve or recheck changes for now.		10:29
*** alexchadin has quit IRC		10:30
*** alexchadin has joined #openstack-infra		10:30
*** alexchadin has quit IRC		10:30
jhesketh	I think it looks like ze01 is slowly shutting down, so I'm going to give it some time (but it is /very/ slow)	10:31
*** alexchadin has joined #openstack-infra		10:31
openstackstatus	AJaeger: finished sending notice	10:31
*** alexchadin has quit IRC		10:31
jhesketh	after that, if restarting fixes things I can do the other executors	10:31
*** kashyap has left #openstack-infra		10:32
*** alexchadin has joined #openstack-infra		10:32
*** alexchadin has quit IRC		10:32
*** alexchadin has joined #openstack-infra		10:33
*** alexchadin has quit IRC		10:33
*** dtruong has quit IRC		10:35
*** dtruong has joined #openstack-infra		10:35
AJaeger	wow, that takes long ;(	10:36
jhesketh	I can do the others in parallel if it is the cause	10:38
AJaeger	yep	10:40
*** kjackal has quit IRC		10:40
*** ldnunes has joined #openstack-infra		10:41
*** kjackal has joined #openstack-infra		10:41
*** wolverineav has joined #openstack-infra		10:42
*** andreas_s has quit IRC		10:43
*** andreas_s has joined #openstack-infra		10:47
jhesketh	so it's stuck processing the update_queue (for getting git changes etc) but it's going very slowly	10:48
jhesketh	unless it's doing whole clones (which it shouldn't) I can't see why that might be the case	10:48
jhesketh	tobiash: any thoughts why the update_queue might be going so slow? ^	10:50
tobiash	jhesketh: currently lunching	10:51
*** fverboso has joined #openstack-infra		10:51
jhesketh	no worries	10:51
*** lucas-afk is now known as lucasagomes		10:52
*** sambetts\|afk is now known as sambetts		10:54
*** andreas_s has quit IRC		10:57
*** andreas_s has joined #openstack-infra		10:58
*** yamamoto has quit IRC		10:59
AJaeger	jhesketh: I'm getting timeouts from jobs that finish ;(	10:59
* AJaeger joins tobiash for virtual ;) lunch now		11:00
jhesketh	AJaeger: oh, link please?	11:01
jhesketh	(when you return)	11:01
*** gfidente has quit IRC		11:02
*** yamamoto has joined #openstack-infra		11:03
*** eyalb has joined #openstack-infra		11:05
*** gfidente has joined #openstack-infra		11:05
*** gfidente has joined #openstack-infra		11:05
tobiash	jhesketh: hrm, the update_queue does only resetting, fetching and cleaning repos	11:06
tobiash	is the connection to gerrit slow?	11:07
*** andreas_s has quit IRC		11:07
*** andreas_s has joined #openstack-infra		11:08
*** tosky has joined #openstack-infra		11:08
tobiash	jhesketh: hasn't been some connection firewall limit merged recently for gerrit?	11:08
tobiash	maybe that's a side effect	11:09
jhesketh	connection seems fine, I can clone a repo very fast on the executor so it's probably also not a firewall	11:10
*** edmondsw has joined #openstack-infra		11:10
* jhesketh will be back shortly		11:10
tobiash	then I'm currently out of ideas	11:11
*** alexchadin has joined #openstack-infra		11:11
*** edmondsw has quit IRC		11:15
*** rfolco\|off is now known as rfolco\|ruck		11:16
*** andreas_s has quit IRC		11:22
*** andreas_s has joined #openstack-infra		11:23
*** nicolasbock has joined #openstack-infra		11:26
*** andreas_s has quit IRC		11:27
*** andreas_s has joined #openstack-infra		11:27
* jhesketh returns		11:31
jhesketh	so it's still processing the update_queue. I think I'm going to kill the process now	11:38
*** nmathew has quit IRC		11:38
*** larainema has quit IRC		11:40
*** links has quit IRC		11:41
*** pcichy has quit IRC		11:48
jhesketh	it's gone back to having a very slow process_queue (although actually accepting builds afaict since I started it again)	11:49
jhesketh	git-upload-pack's are taking minutes	11:49
*** sree has joined #openstack-infra		11:51
*** sree_ has joined #openstack-infra		11:51
*** sree_ is now known as Guest61310		11:52
openstackgerrit	Merged openstack-infra/nodepool master: Convert nodepool-zuul-functional job https://review.openstack.org/540595	11:53
*** links has joined #openstack-infra		11:54
*** tpsilva has joined #openstack-infra		11:55
*** sree has quit IRC		11:55
*** gongysh has quit IRC		11:57
*** maciejjozefczyk has joined #openstack-infra		11:59
tobiash	jhesketh: git-upload-pack is server side, so it might be worth checking gerrit as well	12:00
AJaeger	jhesketh: https://review.openstack.org/538508 has the timeout	12:01
d0ugal	Is there any way to search logs for a particular CI job? I want to find when an exception first started	12:05
jhesketh	tobiash: yep, a cursory glance doesn't show anything though and the other ze's appear to be working okay	12:05
* jhesketh wonders if that's since changed		12:05
jhesketh	AJaeger: thanks... looks like there might be some more network issues... maybe these are affecting process queue	12:06
AJaeger	jhesketh: https://review.openstack.org/539854 has one timeout as well - and post failures	12:06
AJaeger	d0ugal: which job?	12:06
jhesketh	d0ugal: http://logstash.openstack.org/ might help	12:06
AJaeger	http://zuul.openstack.org/jobs.html helps	12:06
AJaeger	and then click on Builds - and search for project	12:07
d0ugal	AJaeger: a tripleo one - tripleo-ci-centos-7-scenario003-multinode-oooq-container	12:07
d0ugal	cool, I shall try both of these	12:07
d0ugal	I don't want to know how much time I have wasted manually looking through logs... I should have asked before!	12:07
AJaeger	http://zuul.openstack.org/builds.html?job_name=tripleo-ci-centos-7-scenario003-multinode-oooq-container8	12:08
AJaeger	without the 8 at then end - http://zuul.openstack.org/builds.html?job_name=tripleo-ci-centos-7-scenario003-multinode-oooq-container	12:09
d0ugal	Thanks	12:09
*** electrical has quit IRC		12:13
*** electrical has joined #openstack-infra		12:13
*** Guest61310 has quit IRC		12:14
d0ugal	I guess I still need to manually look at the files :)	12:15
*** sree has joined #openstack-infra		12:15
AJaeger	you can run queries with logstash	12:16
d0ugal	I am trying to figure out how to do that :)	12:18
openstackgerrit	Matthieu Huin proposed openstack-infra/nodepool master: Add /node-list to the webapp https://review.openstack.org/535562	12:19
openstackgerrit	Matthieu Huin proposed openstack-infra/nodepool master: Add /label-list to the webapp https://review.openstack.org/535563	12:19
openstackgerrit	Matthieu Huin proposed openstack-infra/nodepool master: Refactor status functions, add web endpoints, allow params https://review.openstack.org/536301	12:19
*** sree has quit IRC		12:19
AJaeger	d0ugal: can't help with that - you might want to ask qa team	12:20
*** edwarnicke has quit IRC		12:21
*** edwarnicke has joined #openstack-infra		12:22
openstackgerrit	Matthieu Huin proposed openstack-infra/nodepool master: Add /node-list to the webapp https://review.openstack.org/535562	12:25
openstackgerrit	Matthieu Huin proposed openstack-infra/nodepool master: Add /label-list to the webapp https://review.openstack.org/535563	12:25
openstackgerrit	Matthieu Huin proposed openstack-infra/nodepool master: Refactor status functions, add web endpoints, allow params https://review.openstack.org/536301	12:25
openstackgerrit	Matthieu Huin proposed openstack-infra/nodepool master: Add a separate module for node management commands https://review.openstack.org/536303	12:25
*** fkautz has quit IRC		12:25
openstackgerrit	Matthieu Huin proposed openstack-infra/nodepool master: webapp: add optional admin endpoint https://review.openstack.org/536319	12:26
*** HeOS has quit IRC		12:26
*** sshnaidm\|rover is now known as sshnaidm\|afk		12:27
*** hashar is now known as hasharAway		12:28
*** rossella_s has quit IRC		12:29
*** zoli is now known as zoli\		12:30
*** zoli\ is now known as zoli		12:30
*** fverboso has quit IRC		12:31
*** rossella_s has joined #openstack-infra		12:32
*** rossella_s has quit IRC		12:38
*** rossella_s has joined #openstack-infra		12:39
*** jpena is now known as jpena\|lunch		12:51
*** rossella_s has quit IRC		12:52
*** rossella_s has joined #openstack-infra		12:52
*** coolsvap has quit IRC		12:54
*** gongysh has joined #openstack-infra		12:54
*** links has quit IRC		12:55
*** jlabarre has joined #openstack-infra		12:55
*** panda\|off is now known as panda		12:55
*** eharney has joined #openstack-infra		12:58
*** zhurong_ has joined #openstack-infra		13:01
*** rossella_s has quit IRC		13:02
*** larainema has joined #openstack-infra		13:03
*** rossella_s has joined #openstack-infra		13:05
*** zhurong has quit IRC		13:08
*** links has joined #openstack-infra		13:08
*** jbadiapa has joined #openstack-infra		13:09
*** edmondsw has joined #openstack-infra		13:13
*** jbadiapa has quit IRC		13:14
*** olaph has joined #openstack-infra		13:16
*** cshastri has quit IRC		13:16
*** olaph1 has quit IRC		13:17
*** amoralej is now known as amoralej\|lunch		13:19
*** alexchadin has quit IRC		13:22
*** alexchadin has joined #openstack-infra		13:23
*** jaosorior has quit IRC		13:23
Shrews	AJaeger: I'm not going to restart the launchers now given the current situation. Don't want to add fuel to the fire.	13:25
*** dayou has quit IRC		13:26
*** dave-mccowan has joined #openstack-infra		13:29
*** dhajare has quit IRC		13:30
*** rossella_s has quit IRC		13:30
AJaeger	Shrews: no idea what's going on ;(	13:31
AJaeger	jhesketh: are you still around? Anything to share here?	13:31
AJaeger	Shrews: looks like debugging/investigation is needed to mvoe us forward - help welcome I guess	13:32
jhesketh	AJaeger: still around but about to head off	13:32
*** rossella_s has joined #openstack-infra		13:32
jhesketh	Shrews, infra-root: unfortunately I've been unable to make any more progress (see scrollback) and have to head off. As best as I can tell on some ze's they are taking a very long time to perform git operations	13:34
jhesketh	I've restarted ze01 with no effect	13:34
jhesketh	Some executors don't appear affected and resource usage on the common hosts (zuul, nodepool, gerrit etc) all look normal	13:34
*** dave-mccowan has quit IRC		13:35
*** dave-mcc_ has joined #openstack-infra		13:36
*** hemna_ has joined #openstack-infra		13:36
*** pgadiya has quit IRC		13:40
*** rossella_s has quit IRC		13:41
fungi	https://rackspace.service-now.com/system_status/ doesn't indicate any widespread issues they're aware of	13:42
fungi	still catching up (is there a summary?) but resource utilization on ze01 looks normal or even low	13:43
*** rossella_s has joined #openstack-infra		13:44
*** janki has quit IRC		13:44
AJaeger	fungi: see jhesketh'S last lines - and review http://grafana.openstack.org/dashboard/db/zuul-status . None of the ze's is accepting ;(	13:45
jhesketh	Actually I think they are. Just very very very slowly	13:46
AJaeger	so, slower than a snail ;(	13:46
fungi	first guess is maybe this is the effect of the new throttle	13:46
fungi	does the pace match the rate corvus described?	13:47
jhesketh	fungi: git-upload-pack is taking minutes on a few executor hosts causing the jobs to take forever to prepare and never keeping up with the worj	13:47
*** jpena\|lunch is now known as jpena		13:47
fungi	ahh	13:47
fungi	is that a subprocess of ansible pushing prepared repositories to the job nodes? any particular repos?	13:47
jhesketh	Well that's my analysis of it. It's very likely I'm wrong and chasing a red herring	13:48
jhesketh	No I think it's the executor mergers preparing the repos for config changes and cache etc	13:48
fungi	not (yet) at a machine where i can pull up graphs easily. is there a list of which executors are exhibiting this behavior and which aren't?	13:49
jhesketh	You can see the process in ps. The time it is running for seems extreme but maybe it's correct	13:49
fungi	are any of the standalone mergers also having similar trouble?	13:49
jhesketh	I've stepped away so I can't give you a list sorry. ze01 is affected and for comparison ze02 isn't	13:50
jhesketh	From memory I think 3,5,6,7,8 are also affected	13:50
jhesketh	fungi: great question, I didn't think to check sorry	13:50
fungi	no problem. that helps, thanks!	13:51
*** dbecker has joined #openstack-infra		13:51
jhesketh	fwiw git checkout on a suffering host (from gerrit) worked just fine	13:51
*** rossella_s has quit IRC		13:52
*** zhurong_ has quit IRC		13:52
jhesketh	No worries, sorry I couldn't get further or stick around	13:52
*** eumel8 has quit IRC		13:52
*** dizquierdo has quit IRC		13:52
*** sshnaidm\|afk is now known as sshnaidm\|rover		13:52
fungi	yeah, for now i'm not going to assume long-running git-upload-pack processes are atypical	13:52
*** jcoufal has joined #openstack-infra		13:52
fungi	the governor changes also added some statsd counters/gauges	13:53
*** esberglu has joined #openstack-infra		13:53
fungi	in a few minutes i should be in a better place to start looking into those if no other infra-root beats me to it	13:53
*** rossella_s has joined #openstack-infra		13:54
Shrews	well, i'm not sure what to look for. so far on ze01, i see a lot of ssh unreachable errors for inap nodes, but not sure if that's related	13:56
AJaeger	http://grafana.openstack.org/dashboard/db/nodepool-inap - inap might had some hickups	13:57
AJaeger	Shrews: so ianw restarted nl03 and then it continued to work.	13:58
AJaeger	There's - according to graphs - a correlation to the inap hicup (not sure whether at inap, or network to it)	13:59
*** hrw has quit IRC		13:59
pabelanger	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=392&rra_id=all	14:00
*** jamesmcarthur has joined #openstack-infra		14:00
pabelanger	we are also just about our of memory on zuul.o.o.	14:00
AJaeger	pabelanger: wasn't when this started - but with this long backlog (going on since around 7:00 UTC)...	14:01
*** hrw has joined #openstack-infra		14:01
*** oidgar has joined #openstack-infra		14:03
*** rossella_s has quit IRC		14:04
*** jaosorior has joined #openstack-infra		14:04
*** psachin has quit IRC		14:04
AJaeger	pabelanger: that should reduce once we process jobs in "normal" speed	14:05
*** rossella_s has joined #openstack-infra		14:06
fungi	well, it won't really "reduce" visibly since the python interpreter won't really free memory allocations back to the system	14:06
AJaeger	mnaser: let'S not approve changes for now, please - we have a really slow Zuul right now...	14:06
fungi	but it'll stop growing, probably for a very long time until the next time it needs more than that amount of memory	14:06
mnaser	AJaeger: sorry, ack	14:07
AJaeger	fungi: yeah, we might do a restart at the end	14:07
pabelanger	sadly, I'm not in a good spot to help debug this morning.	14:07
pabelanger	Hope to be back into things in next 90mins or so	14:08
*** jrist has quit IRC		14:11
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Fix for age calculation on unused nodes https://review.openstack.org/541281	14:11
*** rossella_s has quit IRC		14:11
*** jrist has joined #openstack-infra		14:12
*** rossella_s has joined #openstack-infra		14:12
fungi	infra-root: ovh has also asked us to stop using any of their regions (due to customer impact) and check back with them in q3	14:13
Shrews	fwiw, i'm not seeing any unusual nodepool behavior, even on the launcher ianw restarted (though i did see a minor bug in our logs)	14:13
Shrews	fungi: :(	14:13
fungi	looks like they e-mailed that to me and clarkb just now	14:14
*** andreas_s has quit IRC		14:15
pabelanger	fungi: ack, sad to see them turned off but understand.	14:15
*** andreas_s has joined #openstack-infra		14:15
*** efried has joined #openstack-infra		14:15
*** Goneri has joined #openstack-infra		14:16
*** andreas_s has quit IRC		14:18
*** mriedem has joined #openstack-infra		14:18
*** andreas_s has joined #openstack-infra		14:18
*** rosmaita has joined #openstack-infra		14:20
openstackgerrit	Ruby Loo proposed openstack-infra/project-config master: Add description to update_upper_constraints patches https://review.openstack.org/541027	14:20
*** oidgar has quit IRC		14:22
*** amoralej\|lunch is now known as amoralej		14:24
openstackgerrit	Merged openstack-infra/project-config master: Revert "base-test: test new mirror-workspace role" https://review.openstack.org/540949	14:28
*** jamesmcarthur has quit IRC		14:29
*** tiswanso has joined #openstack-infra		14:30
*** jamesmcarthur has joined #openstack-infra		14:30
*** mihalis68 has joined #openstack-infra		14:30
openstackgerrit	Liam Young proposed openstack-infra/project-config master: Add vault charm to gerrit https://review.openstack.org/541287	14:31
fungi	infra-root: sorry, i misread. not ovh but citycloud, so not nearly as big of a hit	14:32
*** lucasagomes is now known as lucas-hungry		14:34
*** david-lyle has quit IRC		14:34
*** fverboso has joined #openstack-infra		14:35
openstackgerrit	Liam Young proposed openstack-infra/project-config master: Add vault charm to gerrit https://review.openstack.org/541287	14:35
*** jamesmcarthur has quit IRC		14:35
*** dtantsur is now known as dtantsur\|bbl		14:37
*** rossella_s has quit IRC		14:37
*** links has quit IRC		14:38
dmsimard	fungi: it's not the first time citycloud asks us to hold back right ?	14:39
dmsimard	fungi: I mean, is it no-op or do we have to tune some things down still ?	14:39
*** rossella_s has joined #openstack-infra		14:40
fungi	dmsimard: there are still a couple regions at max-servers: 50 and also we should probably pause image uploads since they say they want us to stop until "q3" which i take to mean check with them again on july 1	14:40
dmsimard	depends if it's q3 calendar or q3 fiscal, by default I guess it's calendar ?	14:41
openstackgerrit	Ruby Loo proposed openstack-infra/project-config master: Add description to update_upper_constraints patches https://review.openstack.org/541027	14:46
*** rossella_s has quit IRC		14:47
*** alexchadin has quit IRC		14:48
*** rossella_s has joined #openstack-infra		14:50
*** dbecker has quit IRC		14:51
*** olaph1 has joined #openstack-infra		14:51
*** olaph has quit IRC		14:53
*** sweston has quit IRC		14:54
*** sweston has joined #openstack-infra		14:54
*** derekjhyang has quit IRC		14:55
*** davidlenwell has quit IRC		14:55
*** dbecker has joined #openstack-infra		14:56
*** derekjhyang has joined #openstack-infra		14:56
*** davidlenwell has joined #openstack-infra		14:56
*** kiennt26 has joined #openstack-infra		14:57
*** hasharAway is now known as hashar		15:00
*** olaph1 is now known as olaph		15:00
*** jamesmcarthur has joined #openstack-infra		15:02
*** ihrachys has joined #openstack-infra		15:03
fungi	i'm getting lost in the maze of nodepool configs and yaml anchors now. checking the nodepool docs to figure out where i need to add pause directives	15:04
Shrews	fungi: the diskimage section	15:07
fungi	sure, just trying to figure out which one(s)	15:07
*** HeOS has joined #openstack-infra		15:07
*** fkautz has joined #openstack-infra		15:07
fungi	to pause all image uploads for citycloud, do i need to add pause to each of the citycloud diskimages, and does that need to be done both in nodepool.yaml and nl02.yaml?	15:08
fungi	or do the image configurations on the launchers not matter to the builders i guess?	15:08
Shrews	fungi: image uploads only happen on nb0*	15:09
fungi	well, that wasn't my question. i'll check puppet to see if the nl0.yaml configs get installed on the nb0 hosts	15:10
Shrews	doesn't matter what the launchers have	15:10
Shrews	fungi: oh sorry	15:10
fungi	mainly trying to figure out which specific config files i need to update for this	15:10
fungi	and whether this means undoing a bunch of the yaml anchor business in nodepool.yaml (i'm betting it does)	15:11
*** mnaser has quit IRC		15:11
*** mnaser has joined #openstack-infra		15:12
*** dtantsur\|bbl is now known as dtantsur		15:14
*** mgkwill has quit IRC		15:14
Shrews	fungi: oh hrm... i'm not familiar with how the anchors work	15:15
dmsimard	infra-root: FYI there will be a 0.14.6 release of ARA to workaround an issue where Ansible sometimes passes ignore_errors as a non-boolean value	15:15
*** mgkwill has joined #openstack-infra		15:15
*** rossella_s has quit IRC		15:16
*** jamesmca_ has joined #openstack-infra		15:18
*** rossella_s has joined #openstack-infra		15:19
*** hrybacki has quit IRC		15:19
Shrews	fungi: if it helps, the nl0.yaml configs are not installed on the nb0 instances. so i think we only need to deal with nodepool.yaml with the anchor business	15:19
*** hrybacki has joined #openstack-infra		15:19
openstackgerrit	Jeremy Stanley proposed openstack-infra/project-config master: Disable citycloud in nodepool https://review.openstack.org/541307	15:20
openstackgerrit	Matthieu Huin proposed openstack-infra/nodepool master: Add separate modules for management commands https://review.openstack.org/536303	15:20
fungi	Shrews: thanks for confirming! that's what i figured as well	15:20
fungi	see 541307	15:20
openstackgerrit	Matthieu Huin proposed openstack-infra/nodepool master: webapp: add optional admin endpoint https://review.openstack.org/536319	15:20
pabelanger	okay, ready to help now	15:21
pabelanger	we still seeing slowness in zuul?	15:21
fungi	pabelanger: i expect so. also 541307 is a quick review but semi-urgent so i can reply to citycloud	15:22
pabelanger	yah, just looking now	15:23
*** bobh has joined #openstack-infra		15:23
*** trown is now known as trown\|outtypewww		15:23
pabelanger	+2	15:24
*** iyamahat has joined #openstack-infra		15:25
*** caphrim007 has quit IRC		15:25
Shrews	fungi: +2'd. can approve now if you wish, or wait for others to look	15:26
*** olaph1 has joined #openstack-infra		15:27
fungi	Shrews: let's go ahead and approve if you're okay with the configuration. the sooner it merges, the sooner i can let the provider know	15:27
pabelanger	http://grafana.openstack.org/dashboard/db/zuul-status is a little confusing to read, but looking at ze03 now	15:27
*** olaph has quit IRC		15:27
pabelanger	I do see route issues on ze03.o.o at 2018-02-06 06:47:13,066 stderr: 'ssh: connect to host review.openstack.org port 29418: No route to host	15:28
AJaeger	pabelanger: check http://grafana.openstack.org/dashboard/db/zuul-status?from=1517844541494&to=1517930941494	15:29
pabelanger	I also see ansible SSH errors	15:29
pabelanger	2018-02-06 14:35:07,097 DEBUG zuul.AnsibleJob: [build: d1d72c14bb2b47eca956568b37af9a49] Ansible output: b' "msg": "SSH Error: data could not be sent to remote host \\"104.130.72.49\\". Make sure this host can be reached over ssh",'	15:29
AJaeger	that shows nicely the executors not in accepting since 7:22	15:30
pabelanger	however, I think we might have leaked jobs because	15:30
pabelanger	2018-02-06 14:35:13,814 INFO zuul.ExecutorServer: Unregistering due to too many starting builds 20 >= 20.0	15:30
pabelanger	however, there are no ansible-playbooks process running	15:30
*** lucas-hungry is now known as lucasagomes		15:30
AJaeger	fungi: it might take hours to merge that change with current slowness	15:30
fungi	pabelanger: on ze03 i see that telnet to any open port on review.o.o hangs from ze03 when using ipv6, but works over ipv4	15:31
fungi	AJaeger: i know	15:31
pabelanger	fungi: k	15:31
*** kiennt26 has quit IRC		15:31
pabelanger	I'm going to look at zuul source code for a moment	15:31
fungi	AJaeger: but if it's at least approved then as soon as we get zuul back on track i can enqueue it into the gate to expedite it	15:31
AJaeger	fungi: sure!	15:32
fungi	pabelanger: from ze10, on the other hand, i can connect to review.openstack.org via both ipv4 and ipv6	15:32
mordred	fungi: what can I do to help?	15:32
mordred	oh jeez. we're having routing issues?	15:33
fungi	mordred: see what oddities you can spot with the misbehaving executors, i think	15:33
fungi	mordred: that looks like one possible explanation	15:33
pabelanger	yah, think so	15:33
pabelanger	but, also at 20 running builds, at least executor things that	15:33
pabelanger	but no ansible-playbook are running	15:34
tobiash	pabelanger: 20 starting builds	15:34
tobiash	that means pre-ansible git and workspace preparation	15:34
tobiash	so that could be explained by the routing issues maybe	15:35
pabelanger	tobiash: where did you see that?	15:35
mordred	oh - yah - if the mergers are having trouble cloning from gerrit due to routing issues	15:35
*** jungleboyj has quit IRC		15:36
tobiash	pabelanger: wasn't that the starting builds metric sooner today?	15:36
*** jungleboyj has joined #openstack-infra		15:36
*** sree has joined #openstack-infra		15:36
tobiash	pabelanger: 2018-02-06 14:35:13,814 INFO zuul.ExecutorServer: Unregistering due to too many starting builds 20 >= 20.0	15:36
*** aviau has quit IRC		15:36
pabelanger	yah	15:37
*** aviau has joined #openstack-infra		15:37
tobiash	this means 20 builds in pre-ansible phase which is triggered by the slow start change	15:37
fungi	ze01, ze03, ze06, ze07, ze08 and ze09 can't reach tcp services on review.o.o over ipv6. ze02, ze04, ze05 and ze10 can	15:37
*** med_ has quit IRC		15:38
pabelanger	mordred: tobiash: I wonder if we are not resetting starting_builds in that case	15:38
fungi	i'm checking routing for mergers next	15:38
corvus	maybe we should switch the executors to ipv4 only?	15:38
corvus	do we have any ipv6-only clouds right now?	15:38
*** yamamoto has quit IRC		15:39
mordred	not to my knowledge	15:39
fungi	that might be a viable stop-gap. we could just unconfigure the v6 addresses on them for the moment	15:39
tobiash	pabelanger: that would mean we can leak job_workers in the executor which would be a zuul bug	15:39
pabelanger	tobiash: I haven't confirmed yet, trying to understand how the new logic works	15:40
*** armaan has quit IRC		15:40
fungi	worth noting, none of the standalone mergers (zm01-zm08) are exhibiting this ipv6 routing issue	15:40
*** armaan has joined #openstack-infra		15:41
tobiash	pabelanger: that's entirely possible but I didn't spot such a code path in my 5min code scraping this morning	15:41
pabelanger	that is good news	15:41
*** sree has quit IRC		15:41
*** yamamoto has joined #openstack-infra		15:42
mordred	fungi: I wonder if reconfiguring the interfaces on one of the broken executors would have any impact ... oh, wait - these are rackspace so have static IP informatoin ...	15:42
fungi	i wonder if rebooting them may "fix" it	15:42
mordred	I was thinking something might be bong with RA... but since it's static config I think that less	15:43
fungi	i've seen traffic cease making it to instances in rackspace in the past, and a hard reboot gets routing reestablished in whatever their network gear is	15:43
fungi	though in those cases it's usually been both v4 and v6 at the same time	15:43
*** eyalb has left #openstack-infra		15:44
mordred	the ipv6 routing tables on a working and a non-working node are the same	15:44
corvus	let's go ahead and stop ze01 and reboot it, that's an easy test. who wants to stop it?	15:44
dmsimard	have time to give a hand now	15:44
fungi	yeah, i have a feeling this is due to something outside the instance itself	15:44
fungi	i	15:44
fungi	can stop it now	15:45
*** vhosakot has joined #openstack-infra		15:45
fungi	it's stopping now	15:45
corvus	fungi: cool, you're stopping ze01	15:45
fungi	yeah, even ping6 isn't making it through	15:45
*** dbecker has quit IRC		15:45
dmsimard	The test node graph is super weird but you probably know that already http://grafana.openstack.org/dashboard/db/zuul-status?panelId=20&fullscreen	15:45
corvus	while that's going on, let's try to deconfigure ipv6 on ze03	15:45
* dmsimard catches up		15:45
mordred	fwiw - http://paste.openstack.org/show/663521/ <-- that's got routing and address info for a working and a non-working node. both look the same to me	15:46
fungi	what's strange is that (so far from my testing) i can ping6 instances in iad from ze01 but not instances in dfw	15:46
fungi	i take that back. i can ping6 etherpad.o.o from ze01 successfully	15:47
mordred	oh!	15:47
fungi	so this may be some flow-based issue in their routing layer	15:47
mordred	ip -6 neigh shows differences	15:47
mordred	http://paste.openstack.org/show/663523/	15:47
fungi	i still see a few git remote operations as child processes of zuul-executor on ze01	15:48
fungi	wonder if i should just kill those	15:48
corvus	ip -6 addr del 2001:4800:7818:104:be76:4eff:fe04:a4bf/64 dev eth0	15:48
corvus	how's that look?	15:49
*** yamahata has joined #openstack-infra		15:49
corvus	fungi: killing git should be okay	15:49
corvus	and i'll do the same for the fe80 addrs?	15:49
*** dtantsur is now known as dtantsur\|bbl		15:49
dmsimard	mordred: curious to try an arping on ze08 on that stale router	15:50
fungi	shouldn't need to so it for any fe80:: addresses as those are just linklocal	15:50
fungi	it's only the scope global address(es) which matter in this case	15:50
corvus	ok, wasn't sure if that would affect anything. i'll just do it for that global address on ze03 then?	15:50
dmsimard	mordred: ip -6 neigh shows the router as reachable now (didn't do anything btw)	15:50
fungi	corvus: yeah, that should be plenty	15:50
*** camunoz has joined #openstack-infra		15:50
corvus	done on ze03	15:51
*** david-lyle has joined #openstack-infra		15:51
*** pcaruana has quit IRC		15:51
corvus	if that works, presumably we'll see the effect after the current git op completes	15:51
dmsimard	mordred: some flapping going on ? http://paste.openstack.org/raw/663529/	15:51
corvus	which is now, and it's running very quickly	15:51
*** samueldmq has quit IRC		15:52
*** samueldmq has joined #openstack-infra		15:52
corvus	now i see a lot of ssh unreachable errors from ansible	15:52
corvus	they are to ipv4 addresses	15:52
fungi	ugh, ze01 keeps spawning new git operations	15:52
corvus	fungi: yeah, it probably will. known inefficiency in shutdown: it doesn't short-circuit the update queue	15:53
corvus	also, it now retries failed git operations 3 times.	15:53
*** yamamoto has quit IRC		15:53
*** xarses has joined #openstack-infra		15:54
corvus	weird, i'm attempting to manually connect to the hosts that ansible says it can't connect to, and it's working	15:54
fungi	okay, so could also be some random v4 connectivity issues from that instance?	15:55
*** xarses_ has joined #openstack-infra		15:55
*** claudiub\|2 has joined #openstack-infra		15:55
*** claudiub\|2 has quit IRC		15:55
fungi	that's sounding more and more like some sort of address-based flow balancing sending some connections to a dead device	15:55
openstackgerrit	Merged openstack-infra/zuul master: Fix github connection for standalone debugging https://review.openstack.org/540772	15:56
corvus	i've yet to fail to connect to one of those hosts manually, and pings are working fine	15:56
*** claudiub\|2 has joined #openstack-infra		15:56
Shrews	corvus: the few addresses i saw that happening for were all inap	15:57
*** salv-orlando has joined #openstack-infra		15:57
corvus	the other hosts have an increase in those errors too	15:58
dmsimard	Just curious... are we graphing conntrack ? Don't see it in cacti.. it requires some amount of RAM to keep iptables/ip6tables going with conntrack. Could our RAM starvation problems end up impacting that ? I don't see any conntrack messages in dmesg.. I know it would print obvious errors if we're reaching nf conntrack's max number but I don't know if it complains about other things.	15:58
*** claudiub has quit IRC		15:58
*** olaph has joined #openstack-infra		15:58
*** xarses has quit IRC		15:58
dmsimard	Obviously right now the current conntrack numbers are low because nothing's running	15:58
corvus	ze02 had 7 insances of ssh connection errors yesterday and 678 today. ze02 is one of the hosts that is 'working'.	15:59
dmsimard	ok I'll look at ze02	15:59
*** olaph1 has quit IRC		16:00
*** r-daneel has joined #openstack-infra		16:01
fungi	still playing whack-a-mole with git processes on ze01 waiting for it to complete zuul-executor service shutdown	16:02
corvus	Shrews: this is a list of ip addrs that ze03 has recently failed to connect to: http://paste.openstack.org/show/663545/	16:02
*** salv-orlando has quit IRC		16:03
*** salv-orlando has joined #openstack-infra		16:03
corvus	the 104's in there are rax	16:03
Shrews	yeah, those seem pretty different enough to be multiple providers	16:04
corvus	the 23. are rax	16:04
corvus	the 166. are rax...	16:04
corvus	hrm	16:04
*** v1k0d3n has quit IRC		16:04
corvus	37. is citycloud	16:05
dmsimard	We probably want to tweak net.ipv4.tcp_keepalive_intvl... /proc/sys/net/ipv4/tcp_keepalive_intvl is showing 75 right now	16:05
*** v1k0d3n has joined #openstack-infra		16:05
*** icey has quit IRC		16:05
dmsimard	That's quite a large interval :/	16:05
*** icey has joined #openstack-infra		16:05
*** jaosorior has quit IRC		16:06
fungi	https://rackspace.service-now.com/system_status/ still doesn't indicate any known widespread issue	16:07
*** armaan has quit IRC		16:07
*** armaan has joined #openstack-infra		16:08
Shrews	86 is citycloud too	16:10
Shrews	146 rax	16:10
clarkb	the way rax does network throttling is to inject resets iirc	16:10
*** neiljerram has joined #openstack-infra		16:10
clarkb	do the cacti graphs look like we are hitting the limit and getting throttled?	16:11
fungi	worth noting citycloud has mentioned we're causing issues in their regions, 541307 will disable them in our nodepool configuration for now. we might want to consider whether the citycloud issues are unrelated	16:11
Shrews	the first one, 213, is ovh	16:11
fungi	clarkb: not seeing reset-type disconnects. more like no response/timeout	16:12
neiljerram	Hi all. I just mistakenly did 'git review -y' on a long chain of commits, instead of the squashed one that I wanted to push to review.openstack.org. This can be seen under 'Submitted Together' at https://review.openstack.org/#/c/541365/. Is there a way that I can abandon all of those mistakenly submitted changes?	16:12
corvus	the unreachable errors seem to be happening during the ansible setup phase. so basically, the first connection is failing, but my subsequent manual telnet connections are succeeding.	16:12
fungi	neiljerram: if you use gertty, you can process-mark them and mass abandon	16:13
clarkb	fungi: ya and cacti doesn't seem to support that theory either but graph is spotty (I'm guessing because udp packets are being blackholed too)	16:13
neiljerram	fungi, I don't at the moment. But if that's the best approach, I'll set that up.	16:13
fungi	neiljerram: or you can script something against one of the gerrit apis (ssh api, rest api) to abandon a list of changes	16:13
*** clayg has quit IRC		16:13
fungi	i don't know of any mass review operations available in the gerrit webui	16:13
neiljerram	fungi, Or I guess just some manual clicking....	16:14
AJaeger	neiljerram: 43 changes? Argh ;(	16:14
dmsimard	neiljerram: depends how long it takes to automate it vs doing it by hand	16:14
*** clayg has joined #openstack-infra		16:14
neiljerram	AJaeger, indeed, and sorry!	16:14
*** beisner has quit IRC		16:14
clarkb	fungi: on the citycloud front we are waiting for the change to merge which is being affected by this outage too right? should we consider merging that more forcefully?	16:14
AJaeger	neiljerram: unfortunate timing, please abandon quickly...	16:14
*** beisner has joined #openstack-infra		16:14
fungi	clarkb: we can bypass ci with it, i'd support that	16:15
neiljerram	AJaeger, doing that as quickly as I can right now.	16:15
corvus	neiljerram: i'll do it	16:15
*** andreas_s has quit IRC		16:17
*** andreas_s has joined #openstack-infra		16:17
neiljerram	corvus, OK, thanks	16:17
corvus	neiljerram: done (via gertty)	16:17
pabelanger	we're less then 1GB (854M) of available memory before we start swapping on zuul.o.o too.	16:17
*** yamamoto has joined #openstack-infra		16:18
corvus	i can't think of a next step for the ipv4 connection issues. i don't know what would cause the first connection to fail but not subsequent ones.	16:18
openstackgerrit	Doug Hellmann proposed openstack-infra/openstack-zuul-jobs master: [WIP] docs sitemap generation automation https://review.openstack.org/524862	16:19
AJaeger	neiljerram, corvus looks like some are not abandoned - neiljerram could you double check, please? e.g. https://review.openstack.org/#/c/541339/1	16:19
neiljerram	AJaeger, I'm doing that	16:19
*** yamamoto has quit IRC		16:19
clarkb	fungi: ok I'm going to go ahead and do that so that we can focus on fixing zuul without also worrying about making citycloud happy	16:19
neiljerram	AJaeger, they're all gone now	16:20
neiljerram	corvus, many thanks for your help!	16:20
corvus	i also don't really know how to confirm that. it's pretty tricky to find a node that an executor is about to use before it uses it.	16:20
*** yamamoto has joined #openstack-infra		16:20
corvus	neiljerram: np. you should take a look at gertty anyway :)	16:20
neiljerram	corvus, I did in the past, but didn't quite get into it; I'll have another go.	16:21
openstackgerrit	Merged openstack-infra/project-config master: Disable citycloud in nodepool https://review.openstack.org/541307	16:21
*** kopecmartin has quit IRC		16:21
corvus	neiljerram: let me know (later) what got in your way so i can improve it :)	16:22
fungi	thanks clarkb	16:22
clarkb	fungi: ^ do you want me to respond to them as well?	16:22
*** e0ne has quit IRC		16:22
fungi	clarkb: i have a half-written reply awaiting merger of the change. i can go ahead and finish it	16:22
clarkb	fungi: ok	16:22
fungi	once we see it take effect on the np servers	16:23
fungi	i mean, once it's np-complete? ;)	16:23
*** pcichy has joined #openstack-infra		16:23
dmsimard	corvus: is it normal to have these stuck git processes on zuul01? http://paste.openstack.org/raw/663572/	16:23
corvus	dmsimard: that'll be puppet	16:23
corvus	dmsimard: it's erroneous, couldn't say if it's abnormal. probably safe to kill.	16:23
dmsimard	looks like puppet has successfully updated the repository about an hour ago so I'll go ahead and kill those	16:25
*** yamamoto has quit IRC		16:25
fungi	still killing git processes and waiting for zuul-executor to stop on ze01	16:26
*** andreas_s has quit IRC		16:26
corvus	i think ze03 shows us that if we ifdown the ipv6 interfaces, we'll get things moving again, and more jobs will be accepted. except that a good portion of them (unsure what %) will hit connection errors as soon as they start. those jobs will be retried 3x. tbh, i'm not sure whether it's a net win to fix the ipv6 issue.	16:27
AJaeger	btw. should we send a #status alert? I sent a #status notice earlier...	16:28
dmsimard	corvus, pabelanger: looking at the zuul scheduler memory.. seeing http://paste.openstack.org/raw/663579/ which somewhat reminds me of the ze issues we saw the other day with Shrews about the log spam	16:28
* AJaeger will be offline for a bit now, so can't do it		16:28
*** rossella_s has quit IRC		16:28
corvus	dmsimard: log spam?	16:28
dmsimard	corvus: let me find the issue and the patch, sec	16:29
dmsimard	corvus: but tail -f /var/log/zuul/zuul.log is not pretty right now	16:29
*** rossella_s has joined #openstack-infra		16:30
corvus	that's a very serious error	16:30
*** yamamoto has joined #openstack-infra		16:30
Shrews	dmsimard: you're thinking of the repeated request decline by nodepool, i think, which was something totally different	16:32
dmsimard	corvus: okay so the nodepool thing was different -- we were seeing this: http://paste.openstack.org/raw/653420/ and Shrews submitted https://review.openstack.org/#/c/537932/ -- he mentioned that https://review.openstack.org/#/c/533372/ may have caused the issue	16:32
dmsimard	Shrews: yeah	16:32
corvus	okay, i'm no longer looking into executor issues, i'm looking into scheduler-nodepool issues	16:33
*** gmann has quit IRC		16:33
*** gmann has joined #openstack-infra		16:33
dmsimard	corvus: this locked exception is likely contributing to the ram usage we're seeing	16:33
corvus	dmsimard: in what way?	16:33
dmsimard	corvus: instinct, but I'm trying to get data to back that up right now -- the exceptions started happening all of a sudden and at first glance it might correlate with the spike in memory usage http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=392&rra_id=all	16:35
fungi	or the memory utilization increase and those exceptions could both be symptoms of a common problem. correlation != causation	16:37
Shrews	that, btw, was one of the bugs that made me want to restart the launchers today, but i'm holding off on that to limit changes	16:37
dmsimard	It seems we started getting exceptions after a "kazoo.exceptions.NoNodeError"	16:38
dmsimard	Did we lose zookeeper at some point ?	16:38
dmsimard	paste of what seems to be the first occurrence (with extra lines for context) http://paste.openstack.org/raw/663600/	16:39
pabelanger	dmsimard: should be able to grep kazoo.client in logs to see	16:39
fungi	how long should i be giving the executor daemon to stop on ze01 before i need to start considering other solutions? it's _still_ starting new git operations and we're coming up on an hour since i issued the service stop	16:39
pabelanger	but zookeeper (nodepool.o.o) is still running last I checked this morning	16:39
corvus	fungi: if all the other jobs have stopped, you can probably note the current repo it's working on, kill it, then delete that repo.	16:40
fungi	it's lots of different repos	16:40
fungi	not sure how to tell whether the jobs have stopped on it	16:40
fungi	i'm just looking at ps	16:40
corvus	hrm, sohuld only be like one at a time	16:41
Shrews	corvus: theory... one of the launchers was restarted by ianw this morning, and has the pro-active delete solution. If the scheduler is so busy that it doesn't get around to locking it's node set immediate, we could be deleting nodes out from under it.	16:41
Shrews	s/immediate/immediately	16:42
dmsimard	pabelanger: /var/log/zookeeper/zookeeper.log dates back to october 2 2017 :/	16:42
corvus	Shrews, dmsimard: there was a zk disconnection event at 10:15:02	16:42
*** priteau has quit IRC		16:42
*** dsariel has quit IRC		16:42
pabelanger	fungi: I haven't killed a zuul-executor process myself, if we are not in a rush, maybe just let it shutdown itself?	16:42
corvus	fungi: are you killing git processes as they pop up?	16:42
fungi	yes	16:42
fungi	i've manually killed hundreds (thousands?) of git operations for countless repos on ze01 over the past hour. right now in the process list i see it performing upload-pack operations for cinder and bifrost	16:42
corvus	fungi: so if you kill the executor now, just delete the bifrost and cinder repos	16:43
corvus	from /var/lib/zuul/executor-git	16:43
fungi	got it	16:43
*** hamzy has quit IRC		16:43
fungi	okay, all that's left is a few dozen ssh-agent processes	16:45
fungi	deleted /var/lib/zuul/executor-git/git.openstack.org/openstack/bifrost and /var/lib/zuul/executor-git/git.openstack.org/openstack/cinder	16:45
fungi	anything else i need to do cleanup-wise before rebooting?	16:46
corvus	fungi: not that i can think of	16:46
fungi	going to issue a poweroff on the instance and then hard reboot it from the nova api	16:46
dmsimard	corvus: how did you find that disconnection event ? client side or server side ? not sure how to check on nodepool.o.o, I was trying to find a correlation but there's nothing in syslog/dmesg and /var/log/zookeeper is useless	16:46
dmsimard	corvus: your timestamp also doesn't match that kazoo nonode exception	16:46
corvus	dmsimard: 2018-02-06 10:15:02,480 WARNING kazoo.client: Connection dropped: socket connection broken	16:47
corvus	scheduler log	16:47
dmsimard	zk is using a bunch of cpu but doing a strace on the process shows no activity.. just FUTEX_WAIT	16:48
dmsimard	not sure if it's doing anything	16:49
*** sree has joined #openstack-infra		16:49
*** caphrim007 has joined #openstack-infra		16:49
clarkb	I'm still trying to catch up on the zuulstuff. Sounds like we've decided that going ipv4 only won't actually fix us? do we think the reboot fungi is doing will address networking?	16:49
mordred	clarkb: maybe	16:50
mordred	clarkb: I think we're in an 'it's worth a shot' mode on that one	16:50
mordred	clarkb: but I do not believe we actually understand the underlying issue	16:50
dmsimard	clarkb: there's different issues going on I think, I'll start a pad.. let's move this to incident ?	16:50
corvus	Shrews, dmsimard: i think i understand the zk errors. there is a bug when a node request is fulfilled before a zk disconnection but the scheduler processes the event after the disconnection.	16:51
fungi	clarkb: reboot fungi has _done_ now, but it's as much an experiment as anything. i've definitely seen wonky packet delivery issues in rackspace go away after hard rebooting an instance, though in my opinion this looks more like they have some misbehaving switch gear	16:52
*** thingee has quit IRC		16:52
fungi	and the hard reboot of ze01 indeed doesn't seem to have solved this	16:53
fungi	i'll delete its ipv6 address for the moment	16:53
pabelanger	kk	16:53
*** iyamahat has quit IRC		16:54
corvus	Shrews: it's pretty close to the case in 94e95886e2179f4a6aeecad687509bc7b1ab7fd3 i wonder if we just need to extend our testing of that a bit	16:54
*** salv-orlando has quit IRC		16:54
fungi	heh, of course that killed my ssh connection to it ;)	16:55
dmsimard	I have to step away ~10 minutes	16:55
corvus	Shrews: oh, i see -- that test is only a narrow unit test... we don't have a test of the schedulers response to that situation. we need a full scale functional test.	16:55
*** gfidente has quit IRC		16:55
*** salv-orlando has joined #openstack-infra		16:55
dmsimard	infra-root: started a pad to follow the different issues we're tracking https://etherpad.openstack.org/p/HRUjBTyabM	16:55
*** dsariel has joined #openstack-infra		16:57
*** eumel8 has joined #openstack-infra		16:59
*** salv-orlando has quit IRC		16:59
*** gfidente has joined #openstack-infra		17:00
*** gfidente has quit IRC		17:00
*** gfidente has joined #openstack-infra		17:00
*** ramishra has quit IRC		17:01
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Do not delete unused but allocated nodes https://review.openstack.org/541375	17:02
*** gongysh has quit IRC		17:02
*** bobh has quit IRC		17:02
*** pbourke has quit IRC		17:02
Shrews	corvus: i think ^^ will solve the situation i theorized earlier	17:02
*** gyee has joined #openstack-infra		17:02
corvus	Shrews: oh yes! that should definitely be a thing! :)	17:03
*** pbourke has joined #openstack-infra		17:04
*** salv-orlando has joined #openstack-infra		17:04
Shrews	now we definitely need a restart since nl03 is running without that :(	17:04
*** slaweq has quit IRC		17:05
*** slaweq has joined #openstack-infra		17:05
*** derekjhyang has quit IRC		17:05
clarkb	corvus: and for the zk connection issue in the scheduler we can't recover from that without a restart?	17:06
corvus	clarkb: correct	17:06
*** gongysh has joined #openstack-infra		17:06
corvus	it looks a lot like the executor queue is running down to the point where that's going to block us pretty soon.	17:07
*** zoli is now known as zoli\|gone		17:07
*** zoli\|gone is now known as zoli		17:07
corvus	so we should plan on doing a scheduler restart in the next 30m or so	17:07
corvus	we know how to deal with the ipv6 issue. i think the main unanswered question is about the ipv4 connection problems.	17:07
pabelanger	when we switch to zookeeper cluster, we'll have 3 connections to try and use from the scheduler, is that right?	17:07
*** rossella_s has quit IRC		17:08
*** gongysh has quit IRC		17:08
corvus	pabelanger: yes it may be more robust in some cases, but if the problem is closer to the scheduler, then it can still go wrong.	17:08
corvus	does anyone have any ideas how to further diagnose the ipv4 issues?	17:08
dmsimard	infra-root: AJaeger suggested we send out an update on the situation, does this sound okay ? #status notice We are actively investigating different issues impacting the execution of Zuul jobs but have no ETA for the time being. Please stand by for updates or follow #openstack-infra while we work this out. Thanks.	17:09
*** slaweq has quit IRC		17:10
pabelanger	corvus: understood	17:10
dmsimard	corvus: I need to catch up on what we have found out so far about ipv4 and I'll try to look	17:10
corvus	i'm not a fan of zero-content notices. would prefer a single alert. but i think we should send the next thing, whatever it is, after we restart.	17:10
dmsimard	corvus: we did not send an alert previously, only a notice	17:11
dmsimard	corvus: I'm also not a fan of zero content :(	17:11
mordred	corvus: I am stumped on further diagnosing the ipv4 issue	17:11
*** fverboso has quit IRC		17:11
pabelanger	fungi: is there other executors we need to reboot? (looking for something to do)	17:11
clarkb	the underlying issue is networking conenctivity	17:11
corvus	dmsimard: yeah, so let's either send an "alert:still not working" or "notice:maybe things are better" in a little bit.	17:11
clarkb	I would say something along those lines	17:11
fungi	pabelanger the reboot seems not to have helped	17:11
clarkb	zk connection broke beacuse networking (likely)	17:12
pabelanger	fungi: okay	17:12
clarkb	ze can't talk to review.o.o and test nodes becaues networking (likely)	17:12
dmsimard	I'll take a few minutes to dig into the connectivity issues now.	17:12
clarkb	I need to grab caffeine before I get too comfortable in this chair	17:12
clarkb	work fuel!	17:13
corvus	i think the best thing we can do at the moment is: restart scheduler, remove ipv6 addres from all executors and mergers, then monitor for ipv4 ssh errors and hope enough jobs get through to make headway.	17:13
dmsimard	I'm not even able to connect by SSH to ze01 or ze08 over ipv6, I had to force-resolve it to ipv4	17:13
corvus	also, file a trouble ticket.	17:13
pabelanger	fungi: Hmm, we do seem to be accepting new jobs on ze01.o.o now. Is that related to removal of ipv6?	17:14
mordred	corvus: I agree with your suggested plan of action	17:15
pabelanger	+1	17:15
corvus	oh, all the executors just changed behavior	17:15
fungi	pabelanger: it should be, yes	17:15
fungi	corvus: all the executors or just the problem ones? i deleted the ipv6 global scope addresses from all executors which were unable to connect to review.openstack.org over v6	17:16
corvus	that may suggest that the ipv6 issue is resolved	17:16
corvus	fungi: oh! i missed that	17:16
corvus	heh	17:16
dmsimard	looks like the builds just started picking up on grafana yeah	17:16
corvus	fungi: i thought you only did ze01	17:16
*** olaph1 has joined #openstack-infra		17:17
corvus	anyway, we still need to restart scheduler, shall i do that now?	17:17
*** iyamahat has joined #openstack-infra		17:17
*** olaph has quit IRC		17:17
*** jpena is now known as jpena\|off		17:18
fungi	yeah, and you had previously done ze03, i ended up doing 06-09 based on discussion with "unknown purple person" in the etherpad, but i should have mentioned it in here (i probably didn't)	17:18
fungi	now seems like a good time for the scheduler restart, agreed	17:18
dmsimard	corvus: assuming we dump queues first ? there's also a good deal of post and periodic jobs still queued	17:18
corvus	dmsimard: we don't usually save post or periodic	17:19
fungi	we've generally considered post and periodic "best effort"	17:19
dmsimard	I don't suppose losing periodic is that big of a deal but post might, I don't know	17:19
fungi	most post jobs are idempotent and will be rerun on the next commit to merge to whatever repo	17:19
pabelanger	++	17:19
dmsimard	fair	17:19
*** rwsu has quit IRC		17:19
*** rossella_s has joined #openstack-infra		17:19
*** rwsu has joined #openstack-infra		17:20
fungi	and for the ones which actually present a problem, we can reenqueue the branch tip into post for them later	17:20
corvus	okay, restarting scheduler now	17:20
fungi	if they don't have anything new to approve	17:20
smcginnis	Are issues resolved with removing ipv6 and this pending scheduler restart?	17:22
*** r-daneel has quit IRC		17:22
* smcginnis is trying to discern current status		17:22
*** slaweq has joined #openstack-infra		17:22
fungi	if things seem back on track after the scheduler restart, we could consider stopping the executor on ze01 again, rebooting it again, and then using it as the test system for a ticket with rackspace	17:22
*** r-daneel has joined #openstack-infra		17:22
fungi	smcginnis: not resolved, no. possibly worked around, but we're also dealing with recovery from a zookeeper disconnection	17:23
smcginnis	fungi: ack, thanks	17:23
fungi	and executors failing to connect to far more job nodes than usual	17:23
dmsimard	515 changes in check queue ? really ?	17:26
*** slaweq has quit IRC		17:26
corvus	re-enqueuing	17:26
fungi	dmsimard: some had been in check for 9 hours when i looked earlier, so quite the backup	17:27
dmsimard	I think OVH is not working right now. Time to ready for their nodes is flat http://grafana.openstack.org/dashboard/db/nodepool and seeing a lot of errors on nl04 -- example: http://paste.openstack.org/raw/663687/	17:30
dmsimard	Looking at this.	17:30
pabelanger	yes, looks like they have having message queue issues	17:31
pabelanger	http://status.ovh.net/?do=details&id=14073&PHPSESSID=e77302e17361afc822024180e2449e1f	17:31
pabelanger	they do a good job keeping status.ovh.net updated	17:31
dmsimard	can we temporarily set OVH to 0 so we don't needlessly hammer them ?	17:32
pabelanger	no, should be fine	17:32
dmsimard	pabelanger: that notice is from september 2016 ?	17:32
*** bobh has joined #openstack-infra		17:32
pabelanger	oh, it is... odd	17:33
*** sree has quit IRC		17:33
pabelanger	dmsimard: if you click on cloud box, at top, you can see open tickets	17:33
dmsimard	nl04 is completely nuts at creating/deleting VMs non stop in the two OVH regions	17:33
*** jlabarre has quit IRC		17:34
*** sree has joined #openstack-infra		17:34
pabelanger	dmsimard: yah, eventually it will start working again.	17:34
dmsimard	we're not very nice :/	17:34
pabelanger	we don't usually turn down clouds, until the provider asks. All know how to contact us	17:35
fungi	clarkb: i've replied to daniel at citycloud now, after confirming image uploads are stopped and node counts are at 0 in all their regions	17:35
pabelanger	that way, we don't need to be constantly updating max-servers	17:36
clarkb	fungi: I see the email, thanks!	17:36
clarkb	looks like it was the removal of ipv6 that ended up "fixing" things?	17:37
*** jlabarre has joined #openstack-infra		17:37
dmsimard	pabelanger: maybe we could have a soft disable or something ? a bit like we can enable/disable keep without restarting zuul executor	17:37
mordred	clarkb: "yes"	17:37
dmsimard	Did we remove ipv6 on all executors ? or just one ? Everything seemed to start working again all of a sudden	17:37
*** dsariel has quit IRC		17:37
clarkb	dmsimard: pabelanger we can increase the time between api requests	17:37
dmsimard	I'm not sure ipv6 has something to do with it	17:37
mordred	dmsimard: fungi took care of all the ones with broken ipv6	17:37
dmsimard	mordred: the non-broken ones were not really accepting any builds	17:38
mordred	dmsimard: (disabled ipv6 on them that is)	17:38
dmsimard	mordred: http://grafana.openstack.org/dashboard/db/zuul-status	17:38
pabelanger	dmsimard: why? I'd rather not have to toggle nodepool-launcher each time we have an API error. Would mean spending more time monitoring it	17:38
*** jbadiapa has joined #openstack-infra		17:38
*** gfidente is now known as gfidente\|afk		17:38
*** sree has quit IRC		17:38
*** bobh has quit IRC		17:38
mordred	dmsimard: that's not surprising though - the executors will only accept builds up until a point - and with the others out of commission, the running executors were well past their max	17:38
dmsimard	pabelanger: what do you mean why ? the cloud is broken and we're creating/deleting hundreds of VMs a minute, it's not very helpful to the nodepool sponsors imo	17:38
pabelanger	clarkb: sure, that too	17:38
fungi	clarkb: for very disappointing definitions of "fixing" anyway	17:39
fungi	dmsimard: see your etherpad for details?	17:39
dmsimard	fungi: details for what, sorry ? a bit multithreaded	17:40
dmsimard	fungi: oh, you mean where we removed v6	17:40
fungi	dmsimard: you asked whether we removed ipv6 addresses from all executors	17:40
fungi	details on which are in the pad	17:40
pabelanger	dmsimard: sure, providers often don't notify us of outages too. I like the fact that nodepool just keeps doing its thing, until the cloud comes back online	17:40
dmsimard	fungi: but what I'm saying is that it didn't seem like the "non-ipv6 broken" executors were really doing anything, at least according to the graphs	17:41
dmsimard	fungi: hmm, I was relying on the available executors graph and none of them were available because they were loaded.. nevermind that	17:41
fungi	dmsimard: depended on the graphs. they were bogged down enough to stop accepting many new jobs	17:41
*** bobh has joined #openstack-infra		17:42
pabelanger	dmsimard: OVH is pretty responsible to emails, we can reach out to them add see if they are aware of the issue. Would prefer we did that before shutting down nodepool-launcher.	17:42
dmsimard	pabelanger: can we ? both bhs1 and gra1 are currently not working	17:43
fungi	so http://status.ovh.net/?do=details&id=14073&PHPSESSID=e77302e17361afc822024180e2449e1f doesn't indicate they're aware?	17:43
*** mtreinish has quit IRC		17:43
pabelanger	dmsimard: yes, I suggest we then email OVH the issue we are seeing. I'm unsure if you have their contact, but can forward you them if you wanted to reach out.	17:44
pabelanger	however, http://grafana.openstack.org/dashboard/db/nodepool-ovh seems to show nodes coming online now	17:45
pabelanger	at least for BHS1	17:45
dmsimard	http://paste.openstack.org/raw/663704/ is an excerpt of the errors we're seeing for bhs1/gra1	17:45
*** jbadiapa has quit IRC		17:45
openstackgerrit	Pino de Candia proposed openstack-infra/project-config master: Add new project for Tatu (SSH as a Service) Horizon Plugin. https://review.openstack.org/537653	17:46
*** apetrich has quit IRC		17:46
pabelanger	dmsimard: yah, I usually just email them an example error message and few UUIDs of nodes. They offer to help debug	17:46
pabelanger	I'll forward you last emails I sent	17:46
*** apetrich has joined #openstack-infra		17:46
dmsimard	ok I'll reach out	17:46
dmsimard	Going to log a few things for our roots from other timezones	17:47
fungi	oh, yeah, i'm seeing freshness issues with their status tracking i guess	17:47
*** bobh has quit IRC		17:47
dmsimard	#status log (dmsimard) CityCloud asked us to disable nodepool usage with them until July: https://review.openstack.org/#/c/541307/	17:49
corvus	fyi: ze01 will behave differently than the others because it will accept jobs at a slightly higher rate	17:49
openstackstatus	dmsimard: finished logging	17:49
pabelanger	dmsimard: email forwarded	17:49
fungi	corvus: out of curiosity, why is ze01 "special" like that?	17:50
corvus	fungi: a patch landed and it rebooted	17:50
Shrews	it goes to 11	17:50
fungi	ahh, got it	17:50
fungi	yes, this one goes to 11	17:50
*** mtreinish has joined #openstack-infra		17:50
clarkb	corvus: this is the 4 vs 1 patch from yesterday?	17:50
corvus	yep	17:50
fungi	i think in earlier troubleshooting jhesketh may also have restarted an executor? before i woke up	17:50
*** dsariel has joined #openstack-infra		17:50
fungi	seeing if i can find it in scrollback	17:50
clarkb	fungi: I think he said ianw did the restart? and it was 01 too?	17:51
fungi	not rebooted, just restarted the daemon	17:51
fungi	oh, perfect	17:51
* fungi cancels scrollback spelunking		17:51
corvus	ah, well that would have been masked by ipv6 issues till now	17:51
fungi	yeah	17:51
dmsimard	#status log (dmsimard) Different Zuul issues relative to ipv4/ipv6 connectivity, some executors have had their ipv6 removed: https://etherpad.openstack.org/p/HRUjBTyabM	17:51
openstackstatus	dmsimard: finished logging	17:51
*** olaph1 is now known as olaph		17:52
clarkb	do we have a ticket in for the ipv6 connectivity issue?	17:53
dmsimard	#status log (dmsimard) zuul-scheduler issues with zookeeper ( kazoo.exceptions.NoNodeError / Exception: Node is not locked / kazoo.client: Connection dropped: socket connection broken ): https://etherpad.openstack.org/p/HRUjBTyabM	17:53
openstackstatus	dmsimard: finished logging	17:53
*** tbachman has joined #openstack-infra		17:53
dmsimard	#status log (dmsimard) High nodepool failure rates (500 errors) against OVH BHS1 and GRA1: http://paste.openstack.org/raw/663704/	17:53
openstackstatus	dmsimard: finished logging	17:53
*** panda is now known as panda\|off		17:54
fungi	clarkb: not yet because we need a system we can point to for troubleshooting. i'm proposing we stop the daemon on one of the "problem" executors and then reboot it with its executor service temporarily disabled so it just comes up doing nothing but with its v6 networking correctly configured	17:54
*** bobh has joined #openstack-infra		17:54
clarkb	fungi: ++	17:54
fungi	we could also consider launching some more (replacement) executors	17:55
*** iyamahat has quit IRC		17:55
*** iyamahat has joined #openstack-infra		17:56
dmsimard	pabelanger: ack, I'm sending an email to them nwo.	17:56
dmsimard	now*	17:56
*** bobh has quit IRC		17:59
*** yamamoto has quit IRC		17:59
*** yamamoto has joined #openstack-infra		18:00
corvus	i have a test-case reproduction of the scheduler/zk lock bug	18:00
*** tesseract has quit IRC		18:00
*** jamesmca_ has quit IRC		18:00
*** jamesmcarthur has quit IRC		18:01
*** jamesmcarthur has joined #openstack-infra		18:01
*** sambetts is now known as sambetts\|afk		18:01
*** derekh has quit IRC		18:02
pabelanger	great!	18:02
openstackgerrit	Merged openstack-infra/zuul master: Enhance github debugging script for apps https://review.openstack.org/540774	18:03
dmsimard	sent an email to ovh, have to step away again.	18:03
*** jamesmcarthur has quit IRC		18:05
*** xarses_ has quit IRC		18:06
pabelanger	dmsimard: thanks	18:06
*** xarses has joined #openstack-infra		18:06
fungi	wow, monasca-thresh is removing mysql-connector from their java archive (due to license concerns) and relying on the drizzle jdbc? not sure whether that's amusing to former drizzlites: http://lists.openstack.org/pipermail/openstack-dev/2018-February/127027.html	18:06
Shrews	O.o	18:06
*** salv-orlando has quit IRC		18:07
Shrews	Drizzle says "We're not dead yet!"	18:07
Shrews	"I'm feeling much better"	18:07
mordred	hehe	18:07
*** salv-orlando has joined #openstack-infra		18:07
mordred	that's truly awesome	18:08
*** bobh has joined #openstack-infra		18:10
clarkb	fungi: should we just go ahead and pick a server and turn off the executor now? I guess when we have a backlog of jobs that isn't the happiset thing to do (but neither is remaining ipv4 only)	18:10
fungi	i was giving things a few more minutes to see what other fallout we might spot	18:11
fungi	and hoping to garner additional input on that plan	18:11
*** salv-orlando has quit IRC		18:12
corvus	we may want to restart the other executors to pick up the 1->4 slow start change, it will let them be utilized better.	18:12
*** apetrich has quit IRC		18:13
*** slaweq has joined #openstack-infra		18:13
corvus	maybe if we do that, we'll have a surplus compared to our backlog, and can turn one down with less effect. at the moment, we have more work than the executors can handle.	18:13
pabelanger	wfm	18:13
*** hamzy has joined #openstack-infra		18:13
*** apetrich has joined #openstack-infra		18:13
fungi	i like that idea	18:13
corvus	stopping the others should be faster than ze01, but still, obviously, not super quick.	18:13
corvus	(i'm still fixing the scheduler bug, so i'll leave that to others)	18:14
fungi	do we have volunteers for a rolling restart of, say, ze02-ze08 and ze10? and then we can stop ze09 and reboot it with the executor disabled?	18:14
pabelanger	I can help here in a moment, just getting more coffee	18:14
*** hamzy_ has joined #openstack-infra		18:15
fungi	ze01, as mentioned, has already been restarted on the new code, and the instances impacted by the ipv6 situation seem to be 01, 03 and 06-09	18:15
*** myoung is now known as myoung\|food		18:15
*** bobh has quit IRC		18:16
fungi	so the idea would be to restart all executors except 01 (which already got that treatment) and only stop one of the v6-problem-affected ones (like 09) without restarting the service on it again	18:16
*** sshnaidm\|rover is now known as sshnaidm\|bbl		18:16
fungi	also be aware you probably want ssh -4 if you have working ipv6 connectivity	18:17
*** slaweq has quit IRC		18:17
fungi	i need to eat something before the meeting	18:17
*** hamzy has quit IRC		18:17
fungi	but can help with some of it	18:17
*** mihalis68 has quit IRC		18:20
*** david-lyle has quit IRC		18:20
corvus	clarkb: launch delayed to 19:20, so we may need to have an agenda item for it	18:21
*** jamesmcarthur has joined #openstack-infra		18:21
clarkb	I think I just say they pushed it all the way to 3:05 EST which is 20:05UTC?	18:22
clarkb	weather needs to cooperate	18:22
*** dtantsur\|bbl is now known as dtantsur		18:22
pabelanger	fungi: okay, ready to get started on zuul-executors	18:25
pabelanger	I'll do ze02.o.o first	18:25
*** tosky has quit IRC		18:26
pabelanger	clarkb: I seen that too	18:26
clarkb	not to completely distract from the zuul stuff but I think that sometime yesterday our bandersnatch may have gotten out of sync	18:26
clarkb	sushy 1.3.1 never made it onto our mirror despite having mirror updates	18:26
dansmith	I'm guessing it's known that zuul is down, but didn't see a status bot topic change so wanted to confirm?	18:28
clarkb	dansmith: it should be up now	18:28
clarkb	there are networking issues we've worked around by disabling ipv6 that were preventing things from functioning earlier	18:29
* tbachman isn’t seeing zuul		18:29
pabelanger	where?	18:29
dansmith	I haven't been able to hit it for like 30 minutes	18:29
dansmith	just times out for me	18:29
dhellmann	http://zuul.openstack.org is extremely slow or not responding	18:29
AJaeger	clarkb: http://zuul.openstack.org/ is not reachable right now	18:29
*** ralonsoh has quit IRC		18:29
tbachman	dansmith: same here	18:29
pabelanger	looking	18:30
pabelanger	I'm having issues SSHing into zuul.o.o	18:30
*** bobh has joined #openstack-infra		18:30
pabelanger	checking cacti	18:30
clarkb	pabelanger: I can ssh into it ok	18:31
clarkb	but ya http is sad	18:31
clarkb	zuul-web is running	18:31
pabelanger	okay, SSH now works	18:31
clarkb	it is listening on port 9000 and apache is up and configured to proxy to it	18:32
pabelanger	clarkb: I see some exceptions in web.log, trying to see if related	18:34
clarkb	hrm why does web-debug.log have less info in it than web.log?	18:34
fungi	wonder if the restart picked up a new regression	18:35
pabelanger	ze02 back online	18:35
pabelanger	moving to ze03.o.o	18:35
*** harlowja has joined #openstack-infra		18:35
clarkb	looks like web.log ends with it starting	18:35
clarkb	web-debug.log doesn't show anything sad since it started	18:35
clarkb	if I curl localhost:9000 I get html back	18:38
clarkb	so the server is up and responding to requests	18:38
clarkb	now to see if I can get a status.json from it	18:38
fungi	one other possibility is the sheer hugeness of the status.json may simply be too much	18:39
clarkb	AH00898: Error reading from remote server returned by /status.json	18:39
pabelanger	maybe	18:39
clarkb	ya I think apache is unable to get the status.json back	18:39
pabelanger	we do have 587 jobs in check	18:40
dhellmann	it sounds like that api needs a pagination feature :-)	18:40
dmsimard	Was stepping in momentarily before going away again but the status.json backups might work (or not and exacerbate but iirc I set it up with flock/curl timeout)	18:40
Shrews	pabelanger: back from eating. can help you out	18:41
pabelanger	Shrews: I'm currently doing ze03 / ze04, if you want to get others. I think we want to keep ze09 as is	18:41
clarkb	`curl --verbose localhost:9000/openstack/status.json` works and shows a status of 200	18:41
Shrews	pabelanger: ack. on ze04	18:41
Shrews	pabelanger: err, ze05	18:41
clarkb	data is small enough to fit on my scrollback buffer :P	18:42
Shrews	pabelanger: just zuul-executor restart, yeah?	18:42
fungi	pabelanger: Shrews: stopping ze09 would be great, but that's the one we want to disable the service on and then reboot so it comes back up with ipv6 addresses	18:42
Shrews	fungi: ze09 is all yours :)	18:42
clarkb	its only 2.3MB	18:42
clarkb	"only"	18:42
*** dougwig has joined #openstack-infra		18:43
fungi	Shrews: pabelanger: i'll refrain from stopping ze09 until the others have been cleanly restarted so we don't have too many out of rotation at a time	18:43
*** gmann has quit IRC		18:43
Shrews	fungi: to be clear, we're just restarting the zuul-executor process, right?	18:43
fungi	yes	18:43
fungi	which, as i understand, is not possible currently to do a typical service restart because the stop takes so long to complete	18:44
clarkb	working my way back up the stack those status.json errors were actually 40 minutes or so ago	18:44
fungi	Shrews: so is a stop and then watch the process list until it can be safely started again	18:44
fungi	(or tail logs, or whatever)	18:44
clarkb	I do not see requests from my desktop hitting apache	18:45
clarkb	possibly this is related to the networking issues we've had?	18:45
clarkb	I'm ipv4 only here	18:45
dhellmann	I get "connection reset by peer" http://paste.openstack.org/show/663792/	18:45
dhellmann	I'm also ipv4-only	18:45
*** caphrim007_ has joined #openstack-infra		18:46
fungi	doesn't appear we're overloading its network: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64796&rra_id=all	18:47
clarkb	if I make a new connection from my home I see it in a SYN_RECV state on the server	18:47
clarkb	and its been that way for qutie a few seconds now	18:48
fungi	in fact, that's probably status.json we see dying on the graph at 17:20	18:48
clarkb	then it eventually dies client side	18:48
fungi	which does correspond with the restart	18:48
fungi	scheduler and web daemons were restarted at 17:22 and 17:23 respectively	18:49
* clarkb breaks out the tcpdump		18:49
*** caphrim007 has quit IRC		18:50
Shrews	still waiting for ze05. doing ze06 now too	18:51
*** suhdood has joined #openstack-infra		18:51
openstackgerrit	Matthieu Huin proposed openstack-infra/nodepool master: [WIP] webapp: add optional admin endpoint https://review.openstack.org/536319	18:54
clarkb	aha! [Tue Feb 06 18:55:38.214934 2018] [mpm_event:error] [pid 13325:tid 140299712509824] AH00485: scoreboard is full, not at MaxRequestWorkers I mean I shouldn't checked there earlier but now e know	18:55
*** bobh has quit IRC		18:55
fungi	so apache restart time?	18:56
clarkb	so tcp connections are getting to apache but apache is saying go away	18:56
clarkb	ya	18:56
clarkb	should I go ahead and do that?	18:56
* clarkb gives everyone a minute before doing that		18:56
fungi	possible the inrush of reloads from people's browsers after the restart coupled with the giant status.json caused it, in which case we may see it come right back	18:56
fungi	but worth a try	18:56
Shrews	pabelanger: fungi: ze06 restarted. ze05 still ongoing. moving to ze07	18:57
dmsimard	status.json is rendered client-side right ?	18:57
fungi	yes	18:57
fungi	for now at least	18:58
dmsimard	so it's probably getting downloaded just fine but then has to do all that javascript dancing	18:58
clarkb	dmsimard: no it wasn't downloading anything	18:58
tbachman	fwiw, I can hit it now	18:58
fungi	dmsimard: well, no, apache is super unhappy at the moment	18:58
clarkb	apache was closing incoming tcp connections because it had no workers to put them on	18:58
dmsimard	hmm	18:58
clarkb	status page is working now after apache restart	18:58
fungi	cool, we should keep an eye on it	18:58
clarkb	now we see if that lasts	18:58
fungi	(i almost said "keep tabs on it" and then realized...)	18:58
openstackgerrit	Ruby Loo proposed openstack-infra/project-config master: Add description to update_upper_constraints patches https://review.openstack.org/541027	18:59
dmsimard	fungi: that would've been appropriate	18:59
dmsimard	are we doing an infra meeting ?	18:59
fungi	in 45 seconds	18:59
clarkb	and now its meeting time and I'm not super prepared :P	18:59
cmurphy	do we need to recheck?	18:59
clarkb	cmurphy: you shouldn't need to, you can check now if your changes are in the queues though	18:59
* fungi is entirely unprepared, but also not chairing		18:59
clarkb	(at least it loads for me currently)	18:59
fungi	cmurphy: unless you had failures from earlier (retry_limit, post_failure) and those will need a recheck	19:00
dmsimard	Because sometimes we can have good news too, it looks like OVH BHS1 has fully recovered and there's blips showing signs of recovery for GRA1 as well	19:00
clarkb	I'll keep my tail on the error log open so I'll see if it happens again	19:00
clarkb	and with that /me sorts out meeting	19:00
cmurphy	nope it's in the queue	19:00
*** tbachman has left #openstack-infra		19:01
*** lucasagomes is now known as lucas-afk		19:02
*** suhdood has quit IRC		19:02
*** myoung\|food is now known as myoung		19:03
dmsimard	ze01 is effectively picking up more builds than the others	19:04
dmsimard	4.4GB swap :(	19:04
Shrews	pabelanger: fungi: ze07 restarted. picking up ze08... still waiting for ze05	19:04
ianw	just catching up ... i know we've all moved on, but about 8 hours ago AJaeger noticed inap was offline; that lead me to noticing that nl03 had dropped out of zookeeper, so restarted that. then jhesketh restarted ze01, as it was not picking up jobs	19:05
ianw	and i found the red-herring that graphite.o.o had a full root partition -- which i thought might have lead to some of the graph dropouts (bad stat collecting) but actually was unrelated in the end	19:06
fungi	dmsimard: the others should catch up now that they're getting restarted on newer code	19:06
pabelanger	Shrews: sorry, I lost network access. back online now	19:06
pabelanger	looking at ze03 / ze04 now	19:06
Shrews	pabelanger: summary is ze06 and ze07 are done. doing ze08 now. waiting on ze05	19:06
dmsimard	ianw: I summarized some of the things on the infra log https://wiki.openstack.org/wiki/Infrastructure_Status	19:06
*** r-daneel_ has joined #openstack-infra		19:08
*** david-lyle has joined #openstack-infra		19:08
pabelanger	Shrews: ze03 online, waiting for ze04	19:08
*** yamamoto has quit IRC		19:08
*** david-lyle has quit IRC		19:08
*** dsariel has quit IRC		19:08
*** david-lyle has joined #openstack-infra		19:09
*** r-daneel has quit IRC		19:09
*** r-daneel_ is now known as r-daneel		19:09
clarkb	dmsimard: yes becuse it is running more aggressive scheudling code than the others (or was while we restart things)	19:09
clarkb	dmsimard: it should stabilize hopefully	19:09
ianw	dmsimard: interesting ... yeah i couldn't determine any reason why nl03 seemed to detach itself from zookeeper; as mentioned in the etherpad there wasn't any smoking gun logs	19:09
*** HeOS has quit IRC		19:10
Shrews	pabelanger: ze08 restarted. ze05 is taking FOREVER	19:10
pabelanger	ze04 is online, but hasn't accepted any jobs yet	19:11
pabelanger	unsure why	19:11
pabelanger	there we go	19:11
SamYaple	can anyone review this bindep patch? its been sitting 2+ months. I dont know how to get traction on it, but really need the patch	19:16
SamYaple	https://review.openstack.org/#/c/517105/	19:16
*** Goneri has quit IRC		19:19
*** vhosakot_ has joined #openstack-infra		19:20
*** vhosakot has quit IRC		19:23
*** jpich has quit IRC		19:26
Shrews	pabelanger: i'm not convinced that ze05 is actually trying to stop. is there a good indicator that it's trying to wind down? maybe number of active ansible-playbook processes or something?	19:28
pabelanger	Shrews: what command did you use?	19:28
clarkb	Shrews: pabelanger I watch ps -elf \| grep zuul \| wc -l	19:28
clarkb	that number should generally trend down over time	19:28
pabelanger	Shrews: should see socket accepting stopped in logs	19:28
*** salv-orlando has joined #openstack-infra		19:28
pabelanger	2018-02-06 18:35:54,142 DEBUG zuul.CommandSocket: Received b'stop' from socket for example	19:29
fungi	though there will quite probably be tons of orphaned ssh-agent processes which never get cleaned up	19:29
Shrews	pabelanger: i actually messed up on 05 and issued a restart before doing a stop	19:29
pabelanger	that is on ze03 FWIW	19:29
Shrews	which is why i'm not sure of its state	19:29
pabelanger	ah	19:30
* AJaeger spend so much time looking at grafana that he doesn't like to see N/A for TripleO anymore - https://review.openstack.org/541165		19:30
pabelanger	wonder if we should remove restart from init scripts, haven't had much luck with it	19:30
Shrews	pabelanger: yes, we should :)	19:30
pabelanger	Shrews: you might be able to issue stop again, if command socket it still open	19:30
Shrews	pabelanger: lemme try that	19:30
*** salv-orlando has quit IRC		19:30
Shrews	the number of ansible process seem to be dropping faster now	19:32
*** slaweq has joined #openstack-infra		19:33
*** olaph1 has joined #openstack-infra		19:36
dmsimard	AJaeger: added a comment on https://review.openstack.org/#/c/541165/	19:37
AJaeger	dmsimard: which file? http://git.openstack.org/cgit/openstack-infra/project-config/tree/grafana	19:38
*** slaweq has quit IRC		19:38
AJaeger	dmsimard: grafana.openstack.org does not delete removed files - so manual deletion is needed. I think the file you mean is removed from project-config already. If it's still in-tree, tell me and I'll updated	19:38
dmsimard	AJaeger: oh, you're right the file is already removed	19:39
*** olaph has quit IRC		19:39
AJaeger	dmsimard: if you want to manually remove dead files from grafana - great ;) There're a couple ;(	19:39
dmsimard	AJaeger: if we can write down somewhere which ones are stale I can eventually look at it	19:40
AJaeger	dmsimard: I'll make a list...	19:40
*** baoli has joined #openstack-infra		19:41
ianw	ok, so re hwe kernel -- yeah afs doesn't build surprise surprise. it looks like 1.6.21.1 is the earliest that supports 4.13 ... but I'll make a custom OpenAFS 1.6.22 pkg i think	19:41
*** baoli has quit IRC		19:42
*** tosky has joined #openstack-infra		19:42
*** jamesmca_ has joined #openstack-infra		19:44
dmsimard	T-60 minutes	19:46
*** peterlisak has quit IRC		19:46
*** onovy has quit IRC		19:47
fungi	ianw: i wonder if the 1.8 beta packages in bionic build successfully on xenial ;)	19:48
openstackgerrit	Andreas Jaeger proposed openstack-infra/project-config master: Remove unused grafana/nodepool files https://review.openstack.org/541421	19:49
AJaeger	dmsimard: two more unneeded ones ^	19:49
ianw	fungi: they probably would, at least the arm64 ones do on arm64 xenial ... but i only want to change one thing at a time as much as possible, so i think sticking with 1.6 branch for this test at this point?	19:49
pabelanger	ianw: haven't been following, is fedora-27 ready to use now or still an issue with python?	19:51
fungi	sure, i wasn't suggesting it's something we should run just now	19:51
fungi	having newer 1.6 point release packages for all our systems would be good anyway, for previously-discussed reasons	19:51
*** HeOS has joined #openstack-infra		19:51
AJaeger	dmsimard: http://paste.openstack.org/show/663889/ contains list of obsolete grafana boards	19:52
*** HeOS has quit IRC		19:56
dhellmann	do things seem stable now? is it safe to approve some releases?	19:58
clarkb	dhellmann: I think they are stable in as much as we've worked around the ipv6 issues. The ipv6 problems haven't been corrected but shouldn't currently affect jobs	19:59
dhellmann	clarkb : thanks	19:59
AJaeger	dhellmann: just very heavy load - long backlog	20:00
dhellmann	smcginnis : ^^	20:00
smcginnis	Just read that.	20:00
clarkb	fungi: "kAFS supports IPv6 which OpenAFS does not; and it implements some of the AuriStorFS services so that IPv6 can be used not only with the Location service but with file services as well."	20:00
smcginnis	So we can approve things, it just might take awhile.	20:00
*** jtomasek has quit IRC		20:00
dhellmann	it looks like the aodh and panko releases were approved but not run. maybe remove your W+1 and re-apply it?	20:01
corvus	ianw, fungi, clarkb: i think it was kerberos 5 support (which is needed for the sha256 stuff we use) which kafs was weak on last time i looked. perhaps that's changed -- but that's a thing to watch out ofr.	20:01
smcginnis	dhellmann: Just about to do that. ;)	20:01
ianw	auristor: ^^ i bet you'd know about that?	20:02
*** owalsh has quit IRC		20:02
*** slaweq has joined #openstack-infra		20:02
clarkb	corvus: https://www.infradead.org/~dhowells/kafs/ looks like it supports kerberos 5 under AF_RXRPC	20:02
*** owalsh has joined #openstack-infra		20:02
*** amoralej is now known as amoralej\|off		20:02
*** dtantsur is now known as dtantsur\|afk		20:03
*** onovy has joined #openstack-infra		20:03
*** caphrim007_ has quit IRC		20:03
clarkb	still no apache errors in the error log for zuul status so I'm closing my tail of that and making lunch	20:03
smcginnis	dhellmann: Still not picking up my reapproval. Either things are really slow, or might need you to do it?	20:03
*** caphrim007 has joined #openstack-infra		20:03
smcginnis	dhellmann: Oh, they're stacked up!	20:04
dhellmann	sigh. we keep telling them not to do that.	20:04
*** HeOS has joined #openstack-infra		20:06
dmsimard	corvus: I'm trying to see if there would be a way to highlight if a particular executor seems to be failing/timeouting jobs more than the others .. in graphite we have data for jobs but it doesn't link them back to a particular executor. I guess logstash allows me to query by job status and executor but we're much more aggressive in pruning that data out. Am I missing something ?	20:06
*** caphrim007 has quit IRC		20:07
*** caphrim007 has joined #openstack-infra		20:07
*** rfolco\|ruck is now known as rfolco\|off		20:07
*** onovy has quit IRC		20:07
dmsimard	I understand that job failures/timeouts are not necessarily a result of bad behavior but instead legitimate job/patch issues but it might be worthwhile to see if there's a difference as we start to tweak some knobs here and there	20:08
corvus	dmsimard: nope. right now you could grep logs. if you want to add stats counters to executors that'd be ok	20:08
Shrews	i know ze05 seems to be timing out jobs like crazy. waiting for it to shutdown now and it's taking forever b/c the jobs seem to go to timeout	20:08
dmsimard	corvus: okay, I'll see if I can figure out a patch.	20:08
dmsimard	Shrews: I'll take a look	20:09
dmsimard	Shrews: absolutely nothing going on in logs, huh	20:09
dmsimard	last log entry ~4 minutes ago	20:09
dmsimard	that's unusual	20:10
Shrews	dmsimard: that's because it's shutting down i think. not accepting new jobs	20:10
dmsimard	Shrews: yeah but even then the ongoing jobs ought to be printing things	20:10
Shrews	pabelanger: fungi: number of ansible processes on ze05 into single digits now. i'm hoping that means it's almost done	20:12
Shrews	it was in the 70's when i started watching it	20:12
dmsimard	yeah it seems like it's wrapping up the remaining jobs	20:12
openstackgerrit	Monty Taylor proposed openstack-infra/zuul master: Rework log streaming to use python logging https://review.openstack.org/541434	20:13
*** ldnunes has quit IRC		20:16
*** r-daneel_ has joined #openstack-infra		20:16
*** wolverineav has quit IRC		20:17
*** r-daneel has quit IRC		20:17
*** r-daneel_ is now known as r-daneel		20:17
*** slaweq has quit IRC		20:18
*** wolverineav has joined #openstack-infra		20:18
*** slaweq has joined #openstack-infra		20:19
*** onovy has joined #openstack-infra		20:20
pabelanger	Shrews: ack	20:22
*** Goneri has joined #openstack-infra		20:24
*** e0ne has joined #openstack-infra		20:24
pabelanger	clarkb: spacex stream live now!	20:24
*** onovy has quit IRC		20:25
smcginnis	For those interested: http://www.spacex.com/webcast	20:25
dmsimard	T-20	20:25
* fungi crosses fingers that the up goer five turns elon's scrap car into an asteroid		20:26
clarkb	ya watching	20:26
clarkb	I'm sad that heavy gave up on asparagus	20:26
*** florianf has quit IRC		20:28
*** florianf has joined #openstack-infra		20:28
*** florianf has quit IRC		20:28
*** jamesdenton has joined #openstack-infra		20:29
*** peterlisak has joined #openstack-infra		20:32
*** onovy has joined #openstack-infra		20:33
*** olaph has joined #openstack-infra		20:38
*** olaph1 has quit IRC		20:40
*** kjackal has quit IRC		20:43
*** dsariel has joined #openstack-infra		20:44
fungi	whoosh	20:45
mordred	clarkb: where do they launch these from?	20:45
fungi	this one's going up from kennedy	20:46
clarkb	mordred: same pad saturn 5 used	20:46
clarkb	but they are going to mars not the moon	20:47
fungi	and there go the boosters	20:47
openstackgerrit	Merged openstack-infra/nodepool master: Fix for age calculation on unused nodes https://review.openstack.org/541281	20:48
fungi	i still don't have the audio going, but... this is a promisingly long time with no explosion	20:48
smcginnis	Lots of cheering.	20:48
smcginnis	And David Bowie.	20:49
fungi	appropriate	20:49
smcginnis	Haha, that was pretty cool.	20:50
*** hongbin has joined #openstack-infra		20:50
*** gfidente\|afk has quit IRC		20:51
*** olaph1 has joined #openstack-infra		20:52
smcginnis	Dang that was cool.	20:53
fungi	wow!	20:53
Shrews	that's science well done	20:53
clarkb	its like synchronized swimming	20:53
clarkb	but with fire and rockets and space	20:53
fungi	looks like they didn't topple this time either?	20:54
smcginnis	clarkb: Glad I'm not the only one that thought of that. ;)	20:54
*** olaph has quit IRC		20:54
clarkb	fungi: ya boosters appeared to land safely	20:54
fungi	that was just the first two, right? third is on its way down later?	20:54
smcginnis	Yep	20:55
corvus	yeah, it's either landed on the 'of course i still love you' drone ship at sea, or ... not.	20:55
fungi	heh	20:55
fungi	okay, the "don't panic" on the dash got me	20:57
smcginnis	I loved that detail. :)	20:57
pabelanger	okay, I was jumping up and down in living room with excitement	20:57
pabelanger	way awesome	20:57
mordred	amazing	20:58
*** e0ne has quit IRC		21:01
*** camunoz has quit IRC		21:02
Shrews	pabelanger: i have to afk for a bit. can you watch ze05 and restart when it's done? it looks really close... maybe 1 or 2 jobs	21:03
fungi	pabelanger: Shrews: looks like we're still waiting on ze05 to get started and then ze10 to get stopped/started before i stop ze09?	21:03
Shrews	fungi: oh, i didn't know there was a ze10	21:03
fungi	i've confirmed the rest have a /var/run/zuul/executor.pid from today	21:03
pabelanger	Shrews: sure	21:03
Shrews	pabelanger: thx	21:03
pabelanger	yah, I haven't done ze10 yet. I can do that now	21:04
fungi	Shrews: yeah, we have 10 total	21:04
pabelanger	okay, ze10 stopping	21:05
*** tiswanso has quit IRC		21:06
*** tiswanso has joined #openstack-infra		21:07
pabelanger	ze10 started	21:08
AJaeger	config-core, please consider to review later some grafana updates: Remove tripleo remains from grafana https://review.openstack.org/541165 - and cleanup of unused nodepool providers https://review.openstack.org/541421	21:13
fungi	pabelanger: awesome, so we're just waiting on 05 now?	21:13
openstackgerrit	Monty Taylor proposed openstack-infra/zuul master: Rework log streaming to use python logging https://review.openstack.org/541434	21:13
mnaser	can we approve things now? :>	21:16
AJaeger	mnaser: yes, should be fine AFAIU	21:17
*** hemna_ has quit IRC		21:17
pabelanger	fungi: yah, think we are down to last job running	21:17
*** e0ne has joined #openstack-infra		21:17
AJaeger	fungi, pabelanger http://grafana.openstack.org/dashboard/db/zuul-status shows a constant 19 "starting builds" on ze05	21:17
pabelanger	only see 1 ansible-playbook process	21:17
AJaeger	that looks strange...	21:17
pabelanger	AJaeger: expected, statsd has stopped running I believe	21:18
clarkb	for 541421 we'll have to manually delete the dashboards once we stop having them in the config	21:18
AJaeger	clarkb, http://paste.openstack.org/show/663889/ contains list of obsolete grafana boards	21:18
clarkb	AJaeger: the first list can already be deleted? if so cool I can help get those all cleared out	21:19
AJaeger	clarkb: yes, the first ones can be deleted already.	21:19
AJaeger	dmsimard: ^	21:19
AJaeger	clarkb: just the last two not. and if you delete too many, I hope puppet creates them again ;)	21:20
*** tpsilva has quit IRC		21:20
* AJaeger waves good night		21:21
clarkb	AJaeger: it should!	21:22
dmsimard	(catching up on spacex).. the two side boosters landing simultaneously was awesome.	21:25
fungi	that actually _is_ rocket science	21:26
fungi	the rocket surgery comes later, when they have to put it all back together	21:26
pabelanger	okay, we have no more ansible processes on ze05, but zuul-executor still running	21:26
dmsimard	It must be so hard. I mean my day to day job is kinda hard.. but that is some crazy level of hard :)	21:26
*** jamesmca_ has quit IRC		21:26
pabelanger	do we want to debug or just kill off?	21:26
fungi	pabelanger: nothing else getting appended to the debug log?	21:27
*** jamesmca_ has joined #openstack-infra		21:27
fungi	maybe attempt to trigger a thread dump (though i expect it may get ignored because it's already in the sigterm handler)	21:27
pabelanger	fungi: no, I think using service zuul-executor restart was the reason we got into this state, which I've seen ood things happen, if you start another executor while one is already running	21:28
pabelanger	fungi: k	21:28
fungi	sounds possible	21:28
*** salv-orlando has joined #openstack-infra		21:30
*** felipemonteiro has joined #openstack-infra		21:31
openstackgerrit	Merged openstack-infra/project-config master: Remove TripleO pipelines from grafana https://review.openstack.org/541165	21:32
pabelanger	fungi: okay, I dumped threads, they are in debug logs now	21:32
*** threestrands has joined #openstack-infra		21:32
*** threestrands has quit IRC		21:32
*** threestrands has joined #openstack-infra		21:32
pabelanger	fungi: I'll kill off processes now and we can inspect logs	21:33
fungi	after you kill the processes, may also be good to save a trimmed copy of just the thread dump so we don't have to dig them out of the log later, and then start everything back up again	21:33
*** salv-orlando has quit IRC		21:33
pabelanger	sure	21:34
dmsimard	Still no news on the center core, hope it's just a video feed issue :/	21:34
fungi	2 of 3 recovered is still a pretty good result at this point	21:34
pabelanger	okay, ze05 started again	21:34
fungi	given the team was only about 50% confident the payload would even make it to orbit	21:35
pabelanger	cleaning up copied logs	21:35
dmsimard	Not arguing otherwise, they often get video feed issues when landing on barges	21:35
fungi	and considering the number of recovery failures they've had up to now	21:35
*** jbadiapa has joined #openstack-infra		21:36
dmsimard	I bet they're pulling a bunch of telemetry off the car too, it's not just a dumb car :D	21:36
openstackgerrit	Merged openstack-infra/project-config master: gerritbot: Add queens and rocky to ironic IRC notifications https://review.openstack.org/540986	21:37
*** jbadiapa has quit IRC		21:40
*** dave-mcc_ has quit IRC		21:41
*** ijw has joined #openstack-infra		21:42
*** hemna_ has joined #openstack-infra		21:42
fungi	stopping the executor on ze09 now in preparation for using it as a demo/canary for the v6 routing issues we've seen	21:44
*** jamesmca_ has quit IRC		21:44
*** peterlisak has quit IRC		21:44
*** tiswanso has quit IRC		21:45
*** onovy has quit IRC		21:45
openstackgerrit	David Moreau Simard proposed openstack-infra/zuul master: Add Executor Merger and Ansible execution statsd counters https://review.openstack.org/541452	21:46
jhesketh	Morning	21:46
*** agopi__ has joined #openstack-infra		21:46
ijw	Is there something funky with the check jobs? I've had one sat on the queue for 2 hours, which is highly unusual	21:46
*** jamesmca_ has joined #openstack-infra		21:47
*** hemna_ has quit IRC		21:48
clarkb	ijw: we had networking problems in the cloud hosting the zuul control plane which led to a large backlog	21:48
clarkb	ijw: we've worked around that but result is large backlog takes time to get through	21:48
*** agopi_ has joined #openstack-infra		21:49
*** agopi_ is now known as agopi		21:49
*** agopi__ has quit IRC		21:52
*** tiswanso has joined #openstack-infra		21:52
*** tiswanso has quit IRC		21:52
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Fix stuck node requests across ZK reconnection https://review.openstack.org/541454	21:52
*** eumel8 has quit IRC		21:57
fungi	hrm, still some 26 zuul processes on ze09 15 minutes after stopping	21:57
fungi	most look like:	21:57
fungi	ssh: /var/lib/zuul/builds/ed57db43d883405e9665eac1e1cba797/.ansible/cp/ba8cd30f56 [mux]	21:58
fungi	are those likely to just hang around indefinitely, or should i wait for them to clear up?	21:58
*** ijw has quit IRC		21:58
fungi	the count hasn't changed in the ~5 minutes i've been checking it	21:58
pabelanger	there seems to be a single job for 526171,3 that has been queued for some time in integrated gate. openstack-tox-pep8, unsure why it hasn't got a node yet	21:59
fungi	looks like the zuul-executor daemons themselves are no longer running, so i assume these are just leaked orphan processes	21:59
dmsimard	Did we talk about openstack summit presentations at the infra meeting ? I wasn't paying a lot of attention. Deadline is thursday.	22:00
fungi	oh, right these are the ssh persistent sockets which time out after an hour of inactivity or something	22:00
fungi	dmsimard: clarkb mentioned the cfp submission deadline, yeah	22:00
dmsimard	I'm writing a draft for a talk about ARA (and Zuul) and another which is about the "life" of a commit from git review to RDO stable release (and a magic black box where the productized version appears)	22:01
dmsimard	happy to participate in anything you'd like to throw together, I'm a frequent speaker at local meetups :D	22:01
*** shrircha has joined #openstack-infra		22:02
*** jamesmca_ has quit IRC		22:02
*** vhosakot has joined #openstack-infra		22:03
*** shrircha has quit IRC		22:03
*** jamesmca_ has joined #openstack-infra		22:04
pabelanger	having issues with paste.o.o	22:04
pabelanger	however	22:04
pabelanger	https://pastebin.com/7aB9uvn0	22:04
pabelanger	corvus: I am seeing that error currently in zuul debug.log	22:04
*** salv-orlando has joined #openstack-infra		22:05
pabelanger	seems to be looping	22:05
dmsimard	that's the same error as this morning	22:05
fungi	`sudo systemctl disable zuul-executor.service` seems to have properly disabled zuul-executor on ze09, so going to reboot it now	22:06
pabelanger	fungi: ack	22:06
*** vhosakot_ has quit IRC		22:07
*** ijw has joined #openstack-infra		22:07
fungi	pabelanger: hrm, yeah, paste.o.o is taking its sweet time responding to my browser's requests as well	22:07
pabelanger	dmsimard: oh, maybe this is fixed by 541454 then. Hopefully corvus will confirm	22:08
ijw	clarkb: thanks. I'm in no rush, I just wanted to make sure it wasn't our fault	22:08
pabelanger	but, looks like integrated change queue is wedged	22:08
*** vhosakot has quit IRC		22:09
fungi	restarting openstack-paste on paste.o.o seems to have fixed whatever was hanging requests indefinitely	22:09
pabelanger	fungi: great, thanks	22:09
pabelanger	I need to step away and help prepare dinner	22:10
corvus	that is not the error from this morning. the error from this morning was http://paste.openstack.org/raw/663579/	22:10
*** hemna_ has joined #openstack-infra		22:12
*** e0ne has quit IRC		22:15
*** openstackgerrit has quit IRC		22:16
*** salv-orlando has quit IRC		22:17
corvus	Shrews: you win a cookie -- the error that pabelanger just found is an unused node deletion between assignment and lock	22:17
fungi	okay, zuul09 is back up after a reboot, has no executor service running, has a v6 global address configured again and is still unable to reach the review.openstack.org ipv6 address but can via ipv4	22:18
fungi	working on a ticket with rackspace now	22:18
corvus	Shrews, pabelanger: node 0002403946 if you're curious	22:18
corvus	we can probably just pop that change out of the queue, or ignore it (it's in check)	22:19
*** jamesmca_ has quit IRC		22:19
ianw	ok ... i've installed the hwe 4.13 kernel on ze02, and built some custom 1.6.22.2 afs modules (packages in my homedir). any objections to stopping executor and rebooting?	22:22
*** peterlisak has joined #openstack-infra		22:22
*** openstackgerrit has joined #openstack-infra		22:22
openstackgerrit	Matt Riedemann proposed openstack-infra/project-config master: Remove legacy-tempest-dsvm-neutron-nova-next-full usage https://review.openstack.org/541477	22:22
clarkb	ianw: now is probably as good a time as any	22:23
openstackgerrit	Matt Riedemann proposed openstack-infra/openstack-zuul-jobs master: Remove legacy-tempest-dsvm-neutron-nova-next-full job https://review.openstack.org/541479	22:24
fungi	through the magic of tcpdump i've also narrowed the v6 traffic issue down to being unidirectional	22:24
openstackgerrit	Merged openstack-infra/nodepool master: Do not delete unused but allocated nodes https://review.openstack.org/541375	22:24
fungi	v6 traffic can _reach_ ze09 from review, but v6 traffic cannot reach review from ze09	22:25
*** onovy has joined #openstack-infra		22:25
dmsimard	any oddities in ip6tables ?	22:25
*** hashar has quit IRC		22:26
fungi	well, tcpdump should be listening to a bpf on the interface, in which case ip(6)tables doesn't matter	22:26
fungi	do we have an approximate time for when this all started?	22:27
*** Goneri has quit IRC		22:27
corvus	fungi: looks like 06:43 based on graphs	22:28
ianw	what's the current thinking re timing of "zuul-executor graceful"	22:28
fungi	thanks corvus	22:28
corvus	ianw: doesn't work, just use 'zuul-executor stop'	22:28
*** rcernin has joined #openstack-infra		22:28
corvus	(not implemented yet)	22:29
ianw	ahhhhh, that would explain a lot	22:29
corvus	(also, since retries are automatic, probably almost never worth using)	22:29
ianw	right, was hoping it cleaned up the builds better or something	22:30
corvus	stop should be a clean shutdown	22:30
corvus	even if it worked, graceful would be the one to use only if you wanted to wait 4 hours for it to stop	22:31
*** jamesmca_ has joined #openstack-infra		22:33
dmsimard	Seeing a sudden increase in ram utilization on zuul.o.o http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=392&rra_id=all	22:36
dmsimard	Seems like we're headed for a repeat of this morning's spike	22:36
*** jamesmca_ has quit IRC		22:37
*** edmondsw has quit IRC		22:37
*** edmondsw has joined #openstack-infra		22:38
dmsimard	corvus: do we need to land https://review.openstack.org/#/c/541454/ and restart the scheduler again ? /var/log/zuul/zuul.log is getting plenty of kazoo.exceptions.NoNodeError.	22:39
*** edmondsw has quit IRC		22:42
corvus	dmsimard: see above, nonodeerror is not because of the bug in 541454	22:42
corvus	we need to restart all launchers with https://review.openstack.org/541375 to fix that error	22:42
dmsimard	corvus: ok I'll add that info to the pad, thanks	22:43
ianw	#status log ze02.o.o rebooted with xenial 4.13 hwe kernel ... will monitor performance	22:43
openstackstatus	ianw: finished logging	22:43
*** tbachman has joined #openstack-infra		22:45
* tbachman wonders if others have noticed zuul to be down again		22:45
fungi	#status log provider ticket 180206-iad-0005440 has been opened to track ipv6 connectivity issues between some hosts in dfw; ze09.openstack.org has its zuul-executor process disabled so it can serve as an example while they investigate	22:45
openstackstatus	fungi: finished logging	22:45
pabelanger	corvus: Shrews: that is good news	22:46
fungi	tbachman: meaning the status page for it?	22:46
tbachman	fungi: ack	22:46
tbachman	oh	22:46
openstackgerrit	Monty Taylor proposed openstack-infra/zuul master: Rework log streaming to use python logging https://review.openstack.org/541434	22:46
tbachman	it’s up :)	22:46
tbachman	sorry	22:46
tbachman	looked down	22:46
* tbachman goes and hides in shame		22:46
fungi	may take a minute to load because the current status.json payload is pretty huge	22:46
cmurphy	it looked down for me too for a second	22:47
dmsimard	Getting pulled to dinner, will be back later -- the trend for the zuul.o.o ram usage is not good but I was not able to get to the bottom of it.	22:47
dmsimard	If it keeps up, we're going to be maxing ram before long	22:47
*** agopi has quit IRC		22:47
*** salv-orlando has joined #openstack-infra		22:47
ianw	so ... would anyone say they know anything about graphite's logs? do we need the stuff in /var/log/graphite/carbon-cache-a ?	22:47
corvus	ianw: i don't think we need any of that. pretty much ever.	22:48
* jhesketh has skimmed most of the scrollback... are we still having any network issues?		22:48
jhesketh	(specifically v4)	22:49
pabelanger	dmsimard: 20GB ram usage was the norm before this morning, leaving a 10GB buffer for zuul.o.o	22:49
pabelanger	lets see what happens	22:49
fungi	jhesketh: not sure whether the v4 connectivity issues we were seeing between executors and job nodes have persisted	22:50
pabelanger	Shrews: corvus: I can start launcher restarts in 30mins or hold off until the morning, what ever is good for everybody	22:50
auristor	ianw, fungi, clarkb, corvus: the rx security class supported by kafs and openafs only supports fcrypt (a weaker than 56-bit DES) wire integrity protection and encryption. In order to produce rxkad_k5+kdf tokens to support Kerberos v5 AES256-CTS-HMAC-SHA-1 enctypes for authentication you need a proper version of aklog and supporting libraries.	22:52
fungi	auristor: thanks for the update. that's definitely useful info	22:53
* jhesketh nods		22:53
*** myoung is now known as myoung\|off		22:53
auristor	ianw, fungi, clarkb, corvus: AuriStorFS supports the yfs-rxgk security class which uses GSS-API (Krb5 only at the moment) for auth and AES256-CTS-HMAC-SHA1-96. The support for AES256-CTS-HMAC-SHA256-384 was added to Heimdal Kerberos and once it is in MIT Kerberos we will enable it in yfs-rxgk.	22:54
auristor	yfs-rxgk will be added to kAFS this year	22:55
fungi	that's awesome news	22:55
pabelanger	corvus: I'm showing 526171 in gate (and blocking jobs), I'm not sure we can ignore it. Mind confirming, and I'll abandon / restore.	22:56
auristor	AuriStorFS servers, clients and all admin tooling support IPv6. clients, fileservers and admin tools can be IPv6 only. ubik servers have to be assigned unique IPv4 addresses but they don't have to be reachable via IPv4	22:56
corvus	pabelanger: yep, go ahead and pop it	22:57
*** ijw has quit IRC		22:57
corvus	is someone restarting the launchers, or should i do that now?	22:58
pabelanger	corvus: go for it	22:59
corvus	auristor: rxgk hasn't made it into openafs yet, right?	22:59
openstackgerrit	Clark Boylan proposed openstack-infra/zuul master: Make timeout value apply to entire job https://review.openstack.org/541485	23:02
corvus	#status log all nodepool launchers restarted to pick up https://review.openstack.org/541375	23:02
openstackstatus	corvus: finished logging	23:02
auristor	corvus: yfs-rxgk != rxgk. and no rxgk is not in OpenAFS. There is a long history and I would be happy to share if you are interested. But after Feb 2012 the openafs leadership concluded that we couldn't accomplish our goals for the technology within openafs and created a new suite of rx services that mirror the AFS3 architecture that we could rapidly extend and innovate upon.	23:02
openstackgerrit	Clark Boylan proposed openstack-infra/zuul master: Sync when doing disk accountant testing https://review.openstack.org/541486	23:03
*** slaweq has quit IRC		23:04
corvus	auristor: ah i did't pick up that yfs-rxgk!=rxgk, thanks	23:04
auristor	corvus: think of AuriStorFS and now kafs as supporting two different file system protocols, afs3 and auristorfs. the protocol that is selected depends on the behavior of the peer at the rx layer.	23:06
Shrews	corvus: pabelanger: yay! i iz teh winner!	23:06
*** felipemonteiro has quit IRC		23:06
Shrews	corvus: pabelanger: did the fix to nodepool merge?	23:07
*** hemna_ has quit IRC		23:07
corvus	Shrews: yep, should be in prod now	23:07
*** felipemonteiro has joined #openstack-infra		23:07
Shrews	corvus: awesome. sorry i had to afk for a bit	23:07
auristor	corvus: being dual-headed permits maximum compatibility. it was a major goal of AuriStorFS to provide a zero flag day upgrade and zero data loss	23:07
pabelanger	okay, 526171 finally popped from gate pipeline	23:08
*** wolverineav has quit IRC		23:12
*** r-daneel has quit IRC		23:12
*** r-daneel has joined #openstack-infra		23:13
*** wolverineav has joined #openstack-infra		23:13
*** aeng has joined #openstack-infra		23:16
*** wolverineav has quit IRC		23:17
*** felipemonteiro has quit IRC		23:18
openstackgerrit	Clark Boylan proposed openstack-infra/zuul master: Use nested tempfile fixture for cleanups https://review.openstack.org/541487	23:20
*** sshnaidm\|bbl has quit IRC		23:21
*** M4g1c5t0rM has joined #openstack-infra		23:26
*** armaan has quit IRC		23:27
*** rossella_s has quit IRC		23:28
*** rossella_s has joined #openstack-infra		23:29
*** slaweq has joined #openstack-infra		23:31
*** M4g1c5t0rM has quit IRC		23:31
openstackgerrit	Ian Wienand proposed openstack-infra/puppet-graphite master: Fix up log rotation https://review.openstack.org/541488	23:34
*** slaweq has quit IRC		23:37
openstackgerrit	Merged openstack-infra/project-config master: Add ansible-role-k8s-cinder to zuul.d https://review.openstack.org/534608	23:38
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Fix stuck node requests across ZK reconnection https://review.openstack.org/541454	23:40
*** jcoufal has quit IRC		23:42
ianw	http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=64155&rra_id=0&view_type=tree&graph_start=1517874206&graph_end=1517960606 <-- is this so choppy because of network, or something else?	23:47
clarkb	ianw: ya I think cacti is using the AAAA records to get snmp and its not reliable on some of our executors	23:48
*** edgewing_ has joined #openstack-infra		23:48
clarkb	ianw: also we've just turned it off at this point by removing the ip address config for ipv6 on the executors	23:49
*** caphrim007_ has joined #openstack-infra		23:49
corvus	that's been choppy for a while though. i think there's something unique about that host. i don't know what.	23:50
*** dingyichen has joined #openstack-infra		23:50
clarkb	I'm going to look at cleaning up grafana now	23:50
* clarkb reads grafana api docs		23:51
*** caphrim007 has quit IRC		23:52
*** caphrim007_ has quit IRC		23:53
*** stakeda has joined #openstack-infra		23:54
clarkb	ok the first portion of AJaeger's list is cleaned up in grafana looks like we are still waiting for second portion to have change merged?	23:58
pabelanger	mgagne: it looks like we might have a quota mismatch again in inap. Do you mind confirming when you have a moment	23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!