Wednesday, 2018-09-26

*** Tim_ok has quit IRC		00:02
*** sthussey has quit IRC		00:22
*** felipemonteiro has joined #openstack-infra		00:24
openstackgerrit	Merged openstack-infra/zuul-jobs master: use find instead of ls to list interfaces https://review.openstack.org/604677	00:26
*** longkb has joined #openstack-infra		00:31
*** jamesdenton has quit IRC		00:36
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: Allow debootstrap to cleanup without a kernel https://review.openstack.org/604692	00:42
openstackgerrit	Merged openstack-infra/zuul-jobs master: Add Gentoo iptables handling https://review.openstack.org/604688	00:50
Shrews	clarkb: not sure if you're still seeing it, but if nodes are locked after restarting all launchers, then zuul has the lock and you shouldn't try to undo that	00:51
*** hashar has joined #openstack-infra		00:56
*** felipemonteiro has quit IRC		00:57
clarkb	Shrews: even after 14 days?	00:59
clarkb	maybe it is a hold?	00:59
ianw	clarkb: there we are couple of held bhs1 nodes ... that was what i was looking at for the old configs	01:02
clarkb	ah maybe we can clean those up now?	01:03
*** rlandy has joined #openstack-infra		01:05
ianw	clarkb: hrm, none of the bhs1 nodes currently seem to have a comment	01:06
ianw	hrm, well maybe it never had a comment. 0001975067 is the node i'm talking about from https://review.openstack.org/#/c/603988/ ... that was one of the few nodes that remained after the region was shutdown	01:10
*** graphene has joined #openstack-infra		01:12
prometheanfire	just one more review is needed for https://review.openstack.org/#/c/602439/ to get gentoo as a usable image	01:15
prometheanfire	:(	01:17
*** rlandy has quit IRC		01:19
*** mrsoul has joined #openstack-infra		01:19
*** harlowja has quit IRC		01:22
*** hashar has quit IRC		01:26
openstackgerrit	Ian Wienand proposed openstack-infra/puppet-graphite master: [wip] rpsec: check service running https://review.openstack.org/605286	01:26
*** graphene has quit IRC		01:26
*** owalsh_ has joined #openstack-infra		01:29
Shrews	clarkb: I don't think nodes in hold are locked. There is a bug in zuul somewhere that keeps nodes locked, but I forget what triggers it.	01:31
Shrews	That might be what you're seeing	01:31
*** owalsh has quit IRC		01:33
*** rh-jelabarre has quit IRC		01:34
pabelanger	IIRC, it happens when a dynamic reload happens during a noderequest, if the job is removed from zuul, we leak the noderequest and will never unlock	01:35
*** hongbin has joined #openstack-infra		01:39
*** jamesdenton has joined #openstack-infra		01:46
*** annp has joined #openstack-infra		01:47
openstackgerrit	Matthew Thode proposed openstack-infra/openstack-zuul-jobs master: add Gentoo jobs and vars and also fix install test https://review.openstack.org/602439	01:58
*** rkukura has quit IRC		02:00
*** adriant has quit IRC		02:05
*** adriant has joined #openstack-infra		02:14
*** adriant has quit IRC		02:15
*** adriant has joined #openstack-infra		02:17
*** adriant has quit IRC		02:31
*** psachin has joined #openstack-infra		02:39
*** graphene has joined #openstack-infra		02:40
*** apetrich has quit IRC		02:42
*** Bhujay has joined #openstack-infra		02:49
*** imacdonn has quit IRC		02:50
*** imacdonn has joined #openstack-infra		02:50
*** felipemonteiro has joined #openstack-infra		03:02
*** adriant has joined #openstack-infra		03:07
*** Bhujay has quit IRC		03:07
*** felipemonteiro has quit IRC		03:19
*** ijw has joined #openstack-infra		03:19
*** diablo_rojo has quit IRC		03:19
*** ijw has quit IRC		03:24
*** ramishra has joined #openstack-infra		03:26
*** jamesmcarthur has joined #openstack-infra		03:42
*** dave-mccowan has quit IRC		03:44
*** dave-mccowan has joined #openstack-infra		03:46
*** jamesmcarthur has quit IRC		03:48
*** jamesmcarthur has joined #openstack-infra		03:49
*** armax has quit IRC		03:53
*** udesale has joined #openstack-infra		03:53
openstackgerrit	Merged openstack-infra/project-config master: Fix not working kolla graphs https://review.openstack.org/605026	03:54
*** ykarel has joined #openstack-infra		03:59
*** hongbin has quit IRC		03:59
*** felipemonteiro has joined #openstack-infra		04:06
*** pcaruana has joined #openstack-infra		04:14
*** jamesmcarthur has quit IRC		04:21
*** jamesmcarthur has joined #openstack-infra		04:23
*** jamesmcarthur has quit IRC		04:27
*** yamamoto has quit IRC		04:31
*** yamamoto has joined #openstack-infra		04:31
*** pcaruana has quit IRC		04:38
openstackgerrit	Merged openstack-infra/zuul master: Web: don't update the status cache more than once https://review.openstack.org/605243	04:57
*** auristor has quit IRC		05:03
*** jamesmcarthur has joined #openstack-infra		05:05
*** e0ne has joined #openstack-infra		05:08
*** jamesmcarthur has quit IRC		05:10
*** jamesmcarthur has joined #openstack-infra		05:25
*** Bhujay has joined #openstack-infra		05:26
*** jamesmcarthur has quit IRC		05:31
cloudnull	fungi clarkb been afk - still need anything with that instance?	05:32
*** Bhujay has quit IRC		05:32
*** rkukura has joined #openstack-infra		05:33
cloudnull	going to bed finally however if it needs digging into let me know, i'll tackle it first thing in the morning	05:35
*** auristor has joined #openstack-infra		05:39
*** dave-mccowan has quit IRC		05:39
*** quique\|rover\|off is now known as quiquell\|rover		05:40
prometheanfire	yarp	05:40
*** pcaruana has joined #openstack-infra		05:43
*** apetrich has joined #openstack-infra		05:46
*** jamesmcarthur has joined #openstack-infra		05:47
*** e0ne has quit IRC		05:49
*** Bhujay has joined #openstack-infra		05:49
*** felipemonteiro has quit IRC		05:50
*** rkukura has quit IRC		05:50
*** jamesmcarthur has quit IRC		05:51
*** jistr has quit IRC		05:55
*** jistr has joined #openstack-infra		05:56
*** jamesmcarthur has joined #openstack-infra		06:08
*** jamesmcarthur has quit IRC		06:12
*** jtomasek has joined #openstack-infra		06:13
openstackgerrit	Ian Wienand proposed openstack-infra/system-config master: [wip] port graphite setup to ansible https://review.openstack.org/605336	06:14
*** aojea has joined #openstack-infra		06:22
*** dpawlik has joined #openstack-infra		06:23
*** diablo_rojo has joined #openstack-infra		06:27
*** jamesmcarthur has joined #openstack-infra		06:28
*** jamesmcarthur has quit IRC		06:33
*** quiquell\|rover is now known as quique\|rover\|brb		06:43
*** chkumar\|off is now known as chkumar\|ruck		06:44
*** ijw has joined #openstack-infra		06:46
*** jamesmcarthur has joined #openstack-infra		06:50
*** graphene has quit IRC		06:50
*** ijw has quit IRC		06:50
*** graphene has joined #openstack-infra		06:51
*** jamesmcarthur has quit IRC		06:54
*** ginopc has joined #openstack-infra		06:57
*** quique\|rover\|brb is now known as quiquell\|rover		07:00
*** rcernin has quit IRC		07:02
*** alexchadin has joined #openstack-infra		07:03
*** hashar has joined #openstack-infra		07:06
*** ijw has joined #openstack-infra		07:09
*** jamesmcarthur has joined #openstack-infra		07:10
*** olivierb has joined #openstack-infra		07:13
*** ijw has quit IRC		07:13
*** jamesmcarthur has quit IRC		07:15
*** olivierb has quit IRC		07:17
*** olivierb has joined #openstack-infra		07:17
openstackgerrit	Andreas Jaeger proposed openstack-infra/openstack-zuul-jobs master: Remove tricircle dsvm jobs https://review.openstack.org/605344	07:19
*** psachin has quit IRC		07:21
*** shardy has joined #openstack-infra		07:23
*** alexchadin has quit IRC		07:25
*** psachin has joined #openstack-infra		07:26
*** jamesmcarthur has joined #openstack-infra		07:31
*** jamesmcarthur has quit IRC		07:37
*** ijw has joined #openstack-infra		07:44
*** jpich has joined #openstack-infra		07:48
*** ijw has quit IRC		07:50
*** alexchadin has joined #openstack-infra		07:51
*** jpena\|off is now known as jpena		07:51
*** jamesmcarthur has joined #openstack-infra		07:53
*** rcernin has joined #openstack-infra		07:56
*** alexchadin has quit IRC		07:57
*** jamesmcarthur has quit IRC		07:58
*** ykarel is now known as ykarel\|lunch		07:59
*** alexchadin has joined #openstack-infra		07:59
*** Guest42266 has joined #openstack-infra		08:05
*** hashar has quit IRC		08:06
*** hashar has joined #openstack-infra		08:06
*** jamesmcarthur has joined #openstack-infra		08:09
*** e0ne has joined #openstack-infra		08:10
openstackgerrit	Chandan Kumar proposed openstack-infra/project-config master: Added twine check functionality to python-tarball playbook https://review.openstack.org/605096	08:11
*** jamesmcarthur has quit IRC		08:13
*** Emine has joined #openstack-infra		08:17
*** e0ne has quit IRC		08:20
*** ykarel\|lunch is now known as ykarel		08:21
*** jamesmcarthur has joined #openstack-infra		08:24
*** noama has joined #openstack-infra		08:25
*** jamesmcarthur has quit IRC		08:29
*** jistr has quit IRC		08:30
*** jistr has joined #openstack-infra		08:31
*** derekh has joined #openstack-infra		08:33
*** derekh has quit IRC		08:33
*** derekh has joined #openstack-infra		08:34
*** shardy has quit IRC		08:35
*** shardy has joined #openstack-infra		08:36
*** jamesmcarthur has joined #openstack-infra		08:40
*** jamesmcarthur has quit IRC		08:45
*** owalsh_ is now known as owalsh		08:45
*** olivier__ has joined #openstack-infra		08:46
*** olivierb has quit IRC		08:46
*** tosky has joined #openstack-infra		08:48
*** apetrich has quit IRC		08:49
openstackgerrit	Fabien Boucher proposed openstack-infra/zuul master: Doc: executor operations document pause, remove graceful https://review.openstack.org/602455	08:55
openstackgerrit	Fabien Boucher proposed openstack-infra/zuul master: Doc: executor operations - explain jobs will be restarted at restart https://review.openstack.org/603136	08:55
*** jamesmcarthur has joined #openstack-infra		08:56
*** chkumar\|ruck has quit IRC		08:58
*** chandankumar has joined #openstack-infra		08:59
*** chandankumar is now known as chkumar\|ruck		09:00
*** jamesmcarthur has quit IRC		09:01
*** ykarel is now known as ykarel\|away		09:01
*** e0ne has joined #openstack-infra		09:04
*** ykarel\|away has quit IRC		09:05
*** alexchadin has quit IRC		09:06
*** dtantsur\|afk is now known as dtantsur		09:10
*** jamesmcarthur has joined #openstack-infra		09:12
*** rcernin has quit IRC		09:16
*** jamesmcarthur has quit IRC		09:17
*** alexchadin has joined #openstack-infra		09:20
*** pbourke has quit IRC		09:22
*** pbourke has joined #openstack-infra		09:23
*** alexchadin has quit IRC		09:25
*** electrofelix has joined #openstack-infra		09:26
*** ssbarnea\|bkp has quit IRC		09:28
*** jamesmcarthur has joined #openstack-infra		09:28
*** jamesmcarthur has quit IRC		09:33
*** priteau has joined #openstack-infra		09:41
*** alexchadin has joined #openstack-infra		09:42
*** jamesmcarthur has joined #openstack-infra		09:44
*** shardy is now known as shardy_mtg		09:48
*** jamesmcarthur has quit IRC		09:49
*** hashar is now known as hasharAway		09:51
*** gfidente has joined #openstack-infra		09:53
*** jamesmcarthur has joined #openstack-infra		10:00
*** diablo_rojo has quit IRC		10:02
*** jamesmcarthur has quit IRC		10:04
openstackgerrit	Matthieu Huin proposed openstack-infra/zuul master: CLI: add create-web-token command https://review.openstack.org/605386	10:06
*** longkb has quit IRC		10:08
*** shardy_mtg has quit IRC		10:11
*** jamesmcarthur has joined #openstack-infra		10:15
*** jamesmcarthur has quit IRC		10:22
*** e0ne has quit IRC		10:27
*** jamesmcarthur has joined #openstack-infra		10:33
*** yamamoto has quit IRC		10:35
*** jamesmcarthur has quit IRC		10:38
*** graphene has quit IRC		10:40
*** graphene has joined #openstack-infra		10:41
*** jamesmcarthur has joined #openstack-infra		10:49
*** alexchadin has quit IRC		10:49
*** felipemonteiro has joined #openstack-infra		10:49
*** jamesmcarthur has quit IRC		10:53
*** felipemonteiro has quit IRC		10:56
*** alexchadin has joined #openstack-infra		10:58
*** yamamoto has joined #openstack-infra		11:00
*** alexchadin has quit IRC		11:03
*** jamesmcarthur has joined #openstack-infra		11:05
*** ijw has joined #openstack-infra		11:09
*** jamesmcarthur has quit IRC		11:09
*** udesale has quit IRC		11:12
*** psachin has quit IRC		11:13
*** ijw has quit IRC		11:13
*** pcaruana has quit IRC		11:15
*** jamesmcarthur has joined #openstack-infra		11:20
*** jpena is now known as jpena\|lunch		11:21
*** jamesmcarthur has quit IRC		11:25
*** psachin has joined #openstack-infra		11:27
*** olivier__ has quit IRC		11:28
*** olivierb has joined #openstack-infra		11:29
*** roman_g has quit IRC		11:30
*** oanson has quit IRC		11:35
*** jamesmcarthur has joined #openstack-infra		11:36
*** emerson has quit IRC		11:39
*** jamesmcarthur has quit IRC		11:41
*** emerson has joined #openstack-infra		11:49
*** roman_g has joined #openstack-infra		11:49
*** jamesmcarthur has joined #openstack-infra		11:52
*** e0ne has joined #openstack-infra		11:53
*** alexchadin has joined #openstack-infra		11:57
*** jamesmcarthur has quit IRC		11:57
*** rh-jelabarre has joined #openstack-infra		11:58
*** dpawlik has quit IRC		11:58
*** trown\|outtypewww is now known as trown		11:59
*** jamesmcarthur has joined #openstack-infra		12:08
*** kgiusti has joined #openstack-infra		12:11
*** apetrich has joined #openstack-infra		12:11
*** jamesmcarthur has quit IRC		12:13
*** ijw has joined #openstack-infra		12:15
*** jamesmcarthur has joined #openstack-infra		12:15
*** alexchadin has quit IRC		12:17
*** dtantsur is now known as dtantsur\|brb		12:18
*** rlandy has joined #openstack-infra		12:18
*** shardy has joined #openstack-infra		12:19
*** ijw has quit IRC		12:19
*** quiquell\|rover is now known as quique\|rover\|lch		12:20
*** panda\|off is now known as panda		12:21
*** agopi has quit IRC		12:24
*** alexchadin has joined #openstack-infra		12:25
*** jpena\|lunch is now known as jpena		12:28
*** udesale has joined #openstack-infra		12:30
*** jamesmcarthur has quit IRC		12:31
*** yamamoto has quit IRC		12:35
AJaeger	fungi, do we need groups always when storyboard is used for new projects? See https://docs.openstack.org/infra/manual/creators.html#add-the-project-to-the-master-projects-list and https://review.openstack.org/#/c/605193/1/gerrit/projects.yaml , please	12:40
*** psachin has quit IRC		12:41
*** quique\|rover\|lch is now known as quiquell\|rover		12:44
*** jamesmcarthur has joined #openstack-infra		12:46
dmsimard	Just sharing: Ansible to adopt molecule and ansible-lint projects, https://groups.google.com/forum/m/#!topic/ansible-project/ehrb6AEptzA	12:48
*** jcoufal has joined #openstack-infra		12:49
*** jamesmcarthur has quit IRC		12:52
*** janki has joined #openstack-infra		12:55
*** jamesmcarthur has joined #openstack-infra		12:55
fungi	AJaeger: groups are a convenience option, not mandatory. if a team only has one project then a group may not be warranted	12:55
fungi	if a team has more than one project, they may want to put them in a project group together so they can query them as a set	12:55
*** boden has joined #openstack-infra		12:58
*** Guest42266 is now known as florianf		13:01
fungi	our creators guide also doesn't suggest they're always needed, simply explains what they're used for	13:02
*** gfidente has quit IRC		13:02
*** alexchadin has quit IRC		13:02
*** mriedem has joined #openstack-infra		13:03
AJaeger	fungi: ah, ok - thanks	13:07
*** sshnaidm is now known as sshnaidm\|mtg		13:07
*** yamamoto has joined #openstack-infra		13:13
*** alexchadin has joined #openstack-infra		13:17
*** ijw has joined #openstack-infra		13:17
openstackgerrit	Matthieu Huin proposed openstack-infra/zuul master: CLI: add create-web-token command https://review.openstack.org/605386	13:18
openstackgerrit	Merged openstack-infra/project-config master: Add new project: sardonic https://review.openstack.org/605193	13:20
*** yamamoto has quit IRC		13:21
*** yamamoto has joined #openstack-infra		13:21
*** ijw has quit IRC		13:22
dulek	Everything okay with http://zuul.openstack.org/status.html ? Running times (especially in post) seem a bit scary?	13:22
fungi	dulek: http://lists.openstack.org/pipermail/openstack-dev/2018-September/134867.html	13:26
*** chkumar\|ruck is now known as chandankumar		13:26
dulek	fungi: Thanks!	13:26
fungi	we're back up to capacity now but i think various openstack bugs are still causing a lot of churn in the gate pipeline starving other lower-priority pipelines	13:26
openstackgerrit	Merged openstack-infra/irc-meetings master: Change congress meeting time https://review.openstack.org/605274	13:26
*** mrsoul has quit IRC		13:26
fungi	helping the community/qa team identify and fix bugs is probably the best place to focus on improving it	13:27
AJaeger	fungi, should we kill the periodic jobs? We haven't run any of them for two days now...	13:27
*** agopi has joined #openstack-infra		13:27
*** smarcet has joined #openstack-infra		13:27
fungi	they won't really use that much capacity once they do finally run	13:29
dulek	fungi: I'll prioritize looking at kuryr-kubernetes CI flakiness then. Thanks for info!	13:29
AJaeger	fungi, I fear we run them only at the weekend...	13:32
*** panda has quit IRC		13:32
*** dtantsur\|brb is now known as dtantsur		13:33
AJaeger	fungi: oh, we have them only once in the queue - not multiple times as I feared (if I interpret zuul.o.o correctly) - then this is fine.	13:33
*** alexchadin has quit IRC		13:44
*** lbragstad has quit IRC		13:45
*** slaweq has quit IRC		13:45
*** alexchadin has joined #openstack-infra		13:46
*** jamesmcarthur has quit IRC		13:49
*** gfidente has joined #openstack-infra		13:50
*** lbragstad has joined #openstack-infra		13:50
*** alexchadin has quit IRC		13:50
*** jamesmcarthur has joined #openstack-infra		13:51
*** sthussey has joined #openstack-infra		13:52
*** ijw has joined #openstack-infra		13:53
*** jamesmcarthur has quit IRC		13:56
*** ijw has quit IRC		13:58
openstackgerrit	Matthieu Huin proposed openstack-infra/zuul master: CLI: add create-web-token command https://review.openstack.org/605386	13:58
fungi	yeah, if memory serves zuul doesn't enqueue multiples in periodic	13:59
mordred	fungi: I think it will - but we could discuss changing its pipeline manager to supercedent so that it wouldn't	14:01
mordred	also - good morning	14:02
AJaeger	mordred: I'm fine with not enqueue multiples ;)	14:02
AJaeger	good morning, mordred	14:02
*** yamamoto has quit IRC		14:03
openstackgerrit	Dmitry Tantsur proposed openstack/diskimage-builder master: Add an element to configure iBFT network interfaces https://review.openstack.org/391787	14:03
chandankumar	fungi: AJaeger mordred https://review.openstack.org/#/c/605096/ please have a look , thanks :-)	14:03
*** yamamoto has joined #openstack-infra		14:04
openstackgerrit	Dmitry Tantsur proposed openstack/diskimage-builder master: Add an element to configure iBFT network interfaces https://review.openstack.org/391787	14:07
*** bobh has joined #openstack-infra		14:08
*** yamamoto has quit IRC		14:09
mordred	dtantsur: ^^ wow, that patch is old	14:10
dtantsur	mordred: yeah, I hoped that nobody would come again with this problem to me.. I was proved wrong.	14:10
dtantsur	it's complicated by the fact that I have no idea what I'm typing :)	14:11
*** jamesmcarthur has joined #openstack-infra		14:12
*** janki has quit IRC		14:12
*** bobh has quit IRC		14:13
*** udesale has quit IRC		14:15
*** jamesmcarthur has quit IRC		14:16
mordred	dtantsur: my favorite problems!	14:18
*** olivierb has quit IRC		14:18
*** panda has joined #openstack-infra		14:21
mordred	chandankumar, fungi: commented on https://review.openstack.org/#/c/605096/	14:22
fungi	i thought we were running that in check/gate?	14:23
*** olivierb has joined #openstack-infra		14:23
fungi	at least AJaeger had pointed to the fact that we're testing sdist and wheel builds now as part of the standard template for projects participating in release management	14:24
AJaeger	mordred: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/jobs.yaml#n255 is the job	14:26
AJaeger	mordred: part of publish-to-pypi template	14:26
*** jamesmcarthur has joined #openstack-infra		14:28
*** eernst has joined #openstack-infra		14:31
*** jamesmcarthur has quit IRC		14:32
openstackgerrit	sebastian marcet proposed openstack-infra/openstackid-resources master: added new endpoint delete my presentation https://review.openstack.org/604130	14:32
njohnston	AJaeger: I have a quick question to clarify your feedback on https://review.openstack.org/#/c/605126/1/.zuul.yaml@a125 "remove these, they are not needed anymore now." Do you mean that the ansible playbooks are not consulted anymore for jobs depending on legacy-dsvm-base?	14:33
*** bobh has joined #openstack-infra		14:35
AJaeger	njohnston: you remove the job using the roles, so remove the roles as well.	14:38
*** armax has joined #openstack-infra		14:39
AJaeger	njohnston: I think you're confused by the diff ;)	14:39
mordred	AJaeger, fungi: yes - but we also use those playbooks in the actual release job	14:40
mordred	ah - wait	14:41
mordred	playbooks/pti-python-tarball/check.yaml is the one that should get updated with the twine commands	14:41
chandankumar	mordred: I will update that	14:41
mordred	chandankumar: thanks!	14:42
*** kashyap has left #openstack-infra		14:42
*** jamesmcarthur has joined #openstack-infra		14:43
*** Bhujay has quit IRC		14:44
jroll	hi friends, per https://review.openstack.org/#/c/605193/ could I please be added to the core and release groups? easiest to search for jim@jimrollenhagen.com https://review.openstack.org/#/admin/groups/1947,members https://review.openstack.org/#/admin/groups/1948,members	14:45
njohnston	AJaeger: Ah! I thought you meant to remove the lines, not the files. facepalm Thanks for the clarification!	14:46
AJaeger	njohnston: sorry for the confusion	14:46
*** jamesmcarthur has quit IRC		14:47
*** yamamoto has joined #openstack-infra		14:52
pabelanger	mnaser: clarkb: sjc1 doesn't look happy right now: http://grafana.openstack.org/dashboard/db/nodepool-vexxhost	14:55
mnaser	i know it was unhappy yesterday	14:55
mnaser	let me double check	14:55
pabelanger	nothing in nodepool log except cannot create server	14:56
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: rewrite interface in react https://review.openstack.org/591604	14:58
*** jamesmcarthur has joined #openstack-infra		14:59
*** pcaruana has joined #openstack-infra		15:01
*** jamesmcarthur has quit IRC		15:03
*** bobh has quit IRC		15:03
quiquell\|rover	Hello any infra-root here ?	15:11
quiquell\|rover	We have a review to fix timeouts https://review.openstack.org/#/c/605377/	15:12
quiquell\|rover	Need to go over the gates	15:12
clarkb	quiquell\|rover: you mean that change needs to be promoted to the top of the gate?	15:13
*** jamesmcarthur has joined #openstack-infra		15:14
fungi	i've just about given up trying to correct people on terminology. everybody seems to want to call everything "gates" even when they mean "gating jobs" or "changes in the gate pipeline" or whatever	15:14
quiquell\|rover	Hello	15:14
clarkb	quiquell\|rover: are you asking us to promote that change to the top of the gate?	15:15
clarkb	this will restart testing for all of the other changes in the tripleo gate pipeline. So want to be sure we are doing that intentionally before we do it	15:16
jaosorior	clarkb: restarting the testing of the other patches is fine. The patch he's trying to promote is meant to help out with the timeouts.	15:16
*** bobh has joined #openstack-infra		15:17
*** janki has joined #openstack-infra		15:17
clarkb	jaosorior: quiquell\|rover another thing I notice when looking at this is that tripleo changes have multiple non voting jobs in the gate. Can we remove those from the gate since they are non voting?	15:17
clarkb	will help get things through more quickly for you (fewer jobs to wait on) and gives more resources to other changes	15:18
*** jamesmcarthur has quit IRC		15:18
*** aojea has quit IRC		15:19
jaosorior	clarkb: yes, we're removing those https://review.openstack.org/#/c/603419/	15:19
clarkb	ok I am going to enqueue ^ to the gate, then promote the first change to the top and the second behind that change	15:21
quiquell\|rover	clarkb: non voting juning in gates ? will look into that	15:22
quiquell\|rover	clarkb, jaosorior: thanks	15:22
* prometheanfire doesn't like fedora-multinode :\|		15:23
jaosorior	clarkb: thank you!	15:24
clarkb	quiquell\|rover: jaosorior it is done	15:25
*** chandankumar is now known as chkumar\|off		15:26
*** quiquell\|rover is now known as quique\|rover\|off		15:27
jaosorior	yay :D	15:27
*** quique\|rover\|off is now known as quique\|off		15:28
mnaser	infra-root: is it ok to just delete a vm that nodepool is trying to launch if its stuck?	15:29
*** sshnaidm\|mtg is now known as sshnaidm		15:29
*** jamesmcarthur has joined #openstack-infra		15:29
mnaser	(on my side)	15:29
clarkb	mnaser: it should be, nodepool won't use the VM unless ssh works. And if the api shows it as error'd out then it should handle that fine	15:30
openstackgerrit	Merged openstack-infra/openstackid-resources master: added new endpoint delete my presentation https://review.openstack.org/604130	15:30
clarkb	mnaser: if the node completely disappears I think nodepool will treat that as an error on its side too, but if it doesn't we should add that functionality	15:30
fungi	if nodepool is repeatedly issuing "delete" calls for it to the api and it suddenly disappears, i think nodepool will treat that like business as usual?	15:31
clarkb	fungi: ya	15:32
clarkb	I expect it will be fine as is	15:32
*** dpawlik has joined #openstack-infra		15:32
smarcet	fungi: afternoon hope that everything is fine on your side :) i have an issue on openstackid zuul jobs, https://review.openstack.org/#/c/604172/ seems that failed and script due a temporal error, could you re trigger the job ?	15:33
clarkb	smarcet: leave comment that says "recheck" and it will reenqueue for you	15:33
smarcet	oh ok	15:34
smarcet	thx u !	15:34
fungi	yep, that will cause jobs to rerun automatically	15:34
*** jamesmcarthur has quit IRC		15:34
*** jtomasek has quit IRC		15:34
smarcet	fungi: clarkb: thx u 4 info :)	15:35
*** dpawlik has quit IRC		15:36
*** janki has quit IRC		15:36
*** eernst has quit IRC		15:36
fungi	though it may take some time. we're under a bit of a backlog this week	15:37
pabelanger	mnaser: trashing in sjc1 seems to have stopped, guess you are still looking into it	15:40
mnaser	yup.. working on it..	15:40
pabelanger	++	15:40
* prometheanfire thinks the requirements proposal bot update is failing again		15:41
clarkb	prometheanfire: failing or not running because of the backlog?	15:42
*** Tim_ok has joined #openstack-infra		15:42
prometheanfire	backlog is 5 hours, I think it's past that by a bit now (double)	15:43
prometheanfire	and it's been a couple days I think	15:43
clarkb	prometheanfire: the post backlog is 53 hours	15:44
clarkb	it has a lower priority than check and gate	15:44
prometheanfire	wat, wow, ok	15:44
*** jtomasek has joined #openstack-infra		15:44
prometheanfire	guess I wait	15:44
*** jamesmcarthur has joined #openstack-infra		15:45
*** dklyle has joined #openstack-infra		15:45
prometheanfire	there just more activity now?	15:45
*** Bhujay has joined #openstack-infra		15:46
pabelanger	provider issues i think, ovh looks to also be having an outage: http://grafana.openstack.org/dashboard/db/nodepool-ovh	15:46
prometheanfire	k	15:46
pabelanger	and packethost, some launch errors too: http://grafana.openstack.org/dashboard/db/nodepool-packethost	15:46
clarkb	pabelanger: prometheanfire its easy to blame provider issues but I think the vast majority of it is we have a lot of unrelaible tests	15:47
pabelanger	so, less VMs to service jobs	15:47
clarkb	tripleo timing out with a gate queue of like 50 changes	15:47
clarkb	neutron functional doesn't work	15:47
pabelanger	clarkb: yup, gate resets too	15:47
clarkb	glance can't pass unittests	15:47
clarkb	tempest also has problems	15:47
prometheanfire	clarkb: ya, seen some of that	15:47
pabelanger	sorry, wasn't blaming just things adding to backlog	15:47
clarkb	pabelanger: I just want to get away from the attitude that it is someone elses problem	15:47
clarkb	openstack testing is bad right now	15:47
*** xyang has joined #openstack-infra		15:48
clarkb	and openstack should work to fix that	15:48
fungi	horizon only just merged fixes for their completely broken gating jobs'	15:48
clarkb	(and not think infra will fix it by adding capacity)	15:48
pabelanger	Yes, I think that is a fair statement	15:48
clarkb	the packethost issue is likely leaked ports	15:48
clarkb	we should see if someone from neutron land (slaweq maybe?) can sync up with studarus on debugging that	15:49
fungi	do we yet know what's going on with ovh-bhs1?	15:49
mnaser	hmm	15:49
*** jamesmcarthur has quit IRC		15:49
mnaser	it looks like nodepool isnt issuing creates anymore	15:50
clarkb	fungi: no that is new to me	15:50
mnaser	after deleting those stuck instances	15:50
tosky	at least it's a nice stress test for zuul	15:50
* tosky hides		15:50
fungi	looks like things started going sideways in bhs1 around 08:30 utc	15:51
mnaser	i see vms spawning again	15:51
clarkb	{"forbidden": {"message": "The number of defined ports: 360 is over the limit: 300", "code": 403}}	15:51
mnaser	so if someone can kick off nodepool	15:52
mnaser	oh	15:52
mnaser	is that in sjc1?	15:52
clarkb	no that is ovh bhs1 sorry	15:52
*** panda is now known as panda\|bbl		15:52
mnaser	after deleting the vms that were stuck	15:52
mnaser	nodepool hasnt issued any creates	15:52
mnaser	but it's working ok now	15:52
fungi	so neutron has likely leaked ports in ovh-bhs1 for some reason	15:52
fungi	i can try to manually delete them	15:52
clarkb	sure neough we have many DOWN ports there	15:52
prometheanfire	fungi: they were waiting on reqs a bit for that (horizon)	15:53
clarkb	fungi: care to only delete the ones that are DOWN	15:53
prometheanfire	they had to cap something :(	15:53
pabelanger	mnaser: clarkb: I can look at nodepool-launcher, 1 sec	15:53
clarkb	I wonder if all the clouds have upgraded to a buggy version of neutron/nova and now we leak ports all over	15:53
clarkb	we should haev a port cleanup thing in nodepool though let me see if we can run that in ovh	15:53
clarkb	or maybe that is only FIPs	15:54
*** kopecmartin\|ruck has joined #openstack-infra		15:55
*** kopecmartin\|ruck has left #openstack-infra		15:55
*** kopecmartin\|ruck has joined #openstack-infra		15:55
fungi	deleting all the ports marked as DOWN now	15:56
*** jamesmcarthur has joined #openstack-infra		15:56
*** eglute has joined #openstack-infra		15:56
fungi	starting out with 357 down	15:56
clarkb	fwiw this looks very similar to the problems we have in packethost	15:57
fungi	might be a bigger issue there... my first `openstack port delete ...` in the loop is hanging	15:57
*** jamesmcarthur has quit IRC		15:57
clarkb	hrm they don't hang in packethost, but they aren't very fast	15:57
pabelanger	2018-09-26 15:56:36,487 DEBUG nodepool.driver.NodeRequestHandler[nl03-25326-PoolWorker.vexxhost-sjc1-vexxhost-specific]: Declining node request 100-0006323378 because node type(s) [ubuntu-trusty] not available	15:57
*** jamesmcarthur has joined #openstack-infra		15:57
pabelanger	is sjc1 not running all images?	15:57
fungi	well, i say "hanging" but i don't know that for sure. it's only been ~60 seconds so far	15:57
fungi	without returning'	15:57
clarkb	this might be one of those situations where we need to get neutron and nova and our clouds all talking together to figure out why this is painful all of a sudden	15:58
clarkb	pabelanger: it should be but maybe the config there is buggy?	15:58
fungi	ahh, i guess my deletes aren't hanging, they're just "slow" (for fairly extreme definitions of the word)	15:59
fungi	we're down to 340 now	15:59
clarkb	dpawlik, amorin, studarus, mlavalle get together in a room and fix neutron	15:59
fungi	and the port delete command in osc doesn't provide any output, so i thought it was still on the first in the set	16:00
*** dave-mccowan has joined #openstack-infra		16:00
clarkb	it looks like something in the background may clean them up in ovh based on usage graphs	16:01
clarkb	we'll spike then go quiet then spike again	16:01
openstackgerrit	Paul Belanger proposed openstack-infra/project-config master: Simplify vexxhost nodepool configuration https://review.openstack.org/605469	16:01
pabelanger	mnaser: clarkb: fungi: not a fix, but should reduce some copypasta for vexxhost ^	16:02
pabelanger	2018-09-26 15:56:30,317 INFO nodepool.driver.NodeRequestHandler[nl03-25326-PoolWorker.vexxhost-sjc1-main]: Not enough quota remaining to satisfy request 200-0006317635	16:02
pabelanger	seems nodepool thinks it is at quota for sjc1	16:03
ssbarnea	clarkb: can you help with https://review.openstack.org/#/c/603061/ , already 8 days old and without extra pings I doubt it will get merged. thanks.	16:03
pabelanger	2018-09-26 15:56:30,317 DEBUG nodepool.driver.NodeRequestHandler[nl03-25326-PoolWorker.vexxhost-sjc1-main]: Current pool quota: {'compute': {'ram': inf, 'instances': 0, 'cores': inf}}	16:03
pabelanger	don't know why that is 0	16:03
pabelanger	mnaser: the instances you deleted, were they in ERROR state?	16:04
*** jamesmcarthur has quit IRC		16:04
*** jamesmcarthur has joined #openstack-infra		16:05
clarkb	pabelanger: it is calculating min(quota, max-servers) - number of servers running in nodepool	16:07
*** sthussey has quit IRC		16:07
frickler	jroll: seems your request got overlooked, done now	16:08
*** dtantsur is now known as dtantsur\|afk		16:08
pabelanger	clarkb: okay, nodepool thinks there are 46 nodes building	16:09
pabelanger	let me check openstack api	16:09
*** ramishra has quit IRC		16:09
*** noama has quit IRC		16:10
pabelanger	clarkb: mnaser: okay, I see movement now. I think we just needed to wait for the launch-timeout to trigger	16:12
clarkb	amorin: if you are around, we've noticed that we appear to leak neutron ports in ovh bhs1 now. Manually deleting them appears to work. Thought you may find this information useful as you continue to operate newer openstack	16:14
clarkb	amorin: let us know if we can provide additional information to help debug or understand the problem	16:14
fungi	amorin: from our graphs, it looks like the leak may have started around 08:30 utc	16:16
*** Emine has quit IRC		16:16
fungi	down port deletion is slowing considerably... i have a feeling we're continuing to leak new ports as we start to be able to boot new nodes after i delete previous ones	16:17
*** florianf is now known as florianf\|afk		16:17
clarkb	fungi: would not surprise me	16:17
pabelanger	clarkb: mnaser: I see jobs running in sjc1 now	16:17
pabelanger	thanks!	16:17
clarkb	ssbarnea: we don't maintain stackalytics, it is a third party service	16:18
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: WIP Extract pep8 messages for inline comments https://review.openstack.org/589634	16:18
clarkb	I do not know why I have approval rights to that repo, but it isn't something we support	16:18
clarkb	fungi: I hate to suggest it but we could add a leaked port cleaner like we do for floating ips to nodepool	16:19
*** tpsilva has joined #openstack-infra		16:19
clarkb	this specific type of problem seems important enough to want the clouds to address it though	16:19
fungi	yeah, i've hit the inflection point in bhs1 now where the count of down ports is rising faster than i'm deleting them	16:20
*** kopecmartin\|ruck is now known as kopecmartin\|off		16:20
pabelanger	openstack.exceptions.SDKException: Error in creating the server: Build of instance 305569df-0b6b-4da8-9e85-a2c2273e34a5 aborted: VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds allowed gigabytes quota. Requested 80G, quota is 5120G and 5120G has been consumed.	16:20
pabelanger	mnaser: think we might have leaked volumes ^	16:20
*** jpich has quit IRC		16:22
ssbarnea	clarkb: infra-core is still member of stackalytics-core which makes me believe it is able to perform reviews in absence of the main maintainers, right?	16:23
clarkb	ssbarnea: yes I have +2 on the repo. I have no desire to use it	16:23
clarkb	it isn't a service or a repo we support	16:23
AJaeger	clarkb: infra-core is now in stackalytics group ;(	16:26
clarkb	fungi: rereading ovh bhs1 graphs I don't think the port issue is the only problem? we don't seem to have ever transitioned to in use nodes	16:27
*** che-arne has joined #openstack-infra		16:27
* clarkb looks at logs again		16:27
AJaeger	sorry for duplicate - too slow in reading scrollback	16:27
clarkb	fungi: or maybe we never got it below the 300 magic number?	16:27
fungi	possible	16:28
jroll	frickler: thanks!	16:28
ssbarnea	clarkb: sure. i was curious more about the process in general. what happens when a project has only one ore two people in the core team and none of them is available any more? may not be the case here, should CRs be stalling forevever?	16:28
fungi	clarkb: i mean, my delete loop is still going but we're back up above 300 again as of a few minutes ago	16:28
clarkb	Error in creating the server: Exceeded maximum number of retries. Exceeded max scheduling attempts 6 for instance 8f95c8e4-e70a-4744-b87c-ae7c6cdc57cd. Last exception: Maximum number of ports exceeded	16:29
clarkb	seems we transitioned to that after some time	16:29
fungi	ssbarnea: if it's an official openstack project the answer is different than for an unofficial project	16:29
*** Bhujay has quit IRC		16:29
fungi	for official openstack projects the team in charge of it can be required by the tc to add more reviewers or retire that deliverable	16:30
fungi	as stackalytics is not and never was an official openstack project, the tc has no jurisdiction over it	16:30
ssbarnea	indeed, i was refering more to unofficial/side/infra projects because this the only place where I encounter this dilemma, on official ones is (kinda) easy to find someone.	16:30
clarkb	ssbarnea: in this case there have been multiple discussions of various groups taking it over then no one does	16:31
ssbarnea	haha, indeed, a good description of reality :D	16:31
clarkb	it was a mirantis project and continues to be one as far as I know	16:32
clarkb	if there is renewed interest in taking it over I would reach out to the previous maintainers and see if they will hand over the reigns	16:32
dmsimard	mnaser: looks like we're running some nodes on sjc1 again, thanks	16:32
ssbarnea	i will send another email to the two commiters, at some point they will find my email, hopefully.	16:33
fungi	we've had contributors in the past interested in (and even getting pretty far on) making various improvements to stackalytics like fixing the persistent analysis store so it doesn't go offline for hours when you need to restart it or ripping out the affiliation static config in favor of querying the foundation profile api	16:33
fungi	but the team in charge of the project and running the server for it seem to lack the bandwidth for such contribution (or even to dicusss it)	16:34
*** panda\|bbl is now known as panda		16:35
clarkb	fungi: I added nl04 to the emergency file	16:35
corvus	a minor correction: infra projects are official openstack projects :)	16:35
clarkb	puppet just ran on it ~4 minutes ago so I think I am safe to go ahead and set max-servers to zero in bhs1	16:35
clarkb	then we can rerun the port cleanup. Set max-servers to ~5 and see if it ends up in a happier place or not	16:36
ssbarnea	corvus: with the sideone that not everything under infra/ in gerrit is a infra project :D	16:36
fungi	well, not everything under openstack/ in gerrit is an official openstack project either	16:36
clarkb	fungi: ok max-servers is set to 0	16:36
ssbarnea	now, I want to create a new infra project, probably named openstack-helpers that would be used to host multipe greasemonkey scripts that are useful for openstack devs. i did read the (very) long docs page but is not clear to me who creates the repository in the first place.	16:36
fungi	ssbarnea: a script creates the repository when it sees a new entry appear in the gerrit/projects.yaml file in openstack-infra/project-config	16:37
fungi	automation	16:37
clarkb	I'm going to grab breakfast while waiting on that port cleanup to happen	16:37
clarkb	fungi: ^ assuming it is still running?	16:38
fungi	it just finished. i can start another pass	16:38
clarkb	please do	16:38
openstackgerrit	James E. Blair proposed openstack-infra/project-config master: Add zone-opendev.org project https://review.openstack.org/605095	16:38
fungi	we're currently at 295 leaked, so it may be cleaning itself up?	16:38
*** jamesmcarthur has quit IRC		16:38
clarkb	fungi: do we want to wait ~10 minutes and see if that number moves?	16:39
fungi	i'll give it a few, yes	16:39
clarkb	max-servers is set to 0 so any servers are using should clean up (and their ports too I hope)	16:39
ssbarnea	fungi: thanks for this magic hint! i made a note. Before making the CR, does anyone have something against using "openstack-helpers" name? I could go for "monkeys" if you don't like it,	16:39
fungi	and then start deleting if it doesn't seem to be doing the cleanup on its own	16:39
fungi	openstack-monkeys seems mildly offensive	16:40
fungi	(or could be taken that way)	16:40
*** trown is now known as trown\|lunch		16:40
AJaeger	ssbarnea: why not remove openstack from the name? Or use "os"?	16:40
mordred	os-greasemonkey would be descriptive	16:41
ssbarnea	i do not want to use the full "grease*" name because refers to a specific extension and there are multiple ones tampermonkey, greasemonkey, and even others with different names.	16:42
mordred	I did not know that - nod	16:42
ssbarnea	thus is why i was considering "helpers" as a more multi-purpose name to use that is not directly linked to a browser extension.	16:42
fungi	browser-helpers?	16:43
fungi	or are they useful outside a web browser context?	16:43
*** e0ne has quit IRC		16:43
*** jamesmcarthur has joined #openstack-infra		16:44
ssbarnea	fungi: yep, for the moment web-helpers would ok but i have some regex patterns inside that are used to highlight console logs. Still, in the future I want to also make a CLI tool to parse logs, one that uses the same patterns as the web extension, .... this would make the "web" part confusing in the repo name.	16:45
ssbarnea	i find the inclusing of web in the repo name as,.... limiting.	16:45
*** dave-mccowan has quit IRC		16:49
*** rkukura has joined #openstack-infra		16:50
corvus	you could call it "ssbarnea's scripts, browser, and regex network emporium area". maybe abbreviate that somehow.	16:50
fungi	yeah, useful names for catch-all repos are hard to come up with	16:50
*** evrardjp has quit IRC		16:50
*** ginopc has quit IRC		16:51
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Only replicate openstack* to github https://review.openstack.org/605486	16:53
*** olivierb has quit IRC		16:54
*** gfidente has quit IRC		16:55
mordred	corvus, fungi: I'm pretty sure that's all we need to let the zone-opendev.org be in opendev/ instead of openstack-infra, should we choose to do such a thing	16:56
corvus	mordred: sounds reasonable; clarkb, mordred, fungi: should i rework that back to opendev/ ? or leave it to be (presumably) moved with the rest?	16:57
fungi	reviewing	16:57
mordred	https://gerrit.googlesource.com/plugins/replication/+doc/master/src/main/resources/Documentation/config.md <-- is what I was looking at - search for "remote.NAME.projects"	16:58
mordred	we could alternately do [	16:58
mordred	we could alternately do ['openstack/', 'openstack-dev/', 'openstack-infra/*'] to be more clearer	16:59
clarkb	should double check against our gerrit docs	16:59
*** rkukura has quit IRC		16:59
mordred	clarkb: I tried looking at the docs for our gerrit - but with them having been moved to plugins ...	16:59
pabelanger	clarkb: I've deleted the leaked volumes in sjc1, but think mnaser will need to debug openstack side, some are stuck deleting. Think I'll work on nodepool at ansiblefest to also try and clean up leaked volumes, I can see there is meta data in the volumes for nodepool_build_id	16:59
mordred	clarkb: I got to that doc from https://review.openstack.org/Documentation/config-plugins.html#replication	16:59
clarkb	ah	16:59
mordred	clarkb: https://gerrit.googlesource.com/plugins/replication/+/stable-2.13/src/main/resources/Documentation/config.md	17:01
mordred	there we go - there's the 2.13 docs - and it saysthe same thing	17:01
*** derekh has quit IRC		17:01
fungi	so the leaked ports in bhs1 are still dropping, but slower than when i was deleting them i think. i'll go ahead and augment it with a loop of explicit deletes to speed things up further there	17:01
fungi	we're down to 288 leaked so far	17:01
mordred	fungi: glorious	17:02
*** fuentess has joined #openstack-infra		17:03
mordred	clarkb: so I think if you've ok with that- it's got 2x+2	17:06
clarkb	do we want to test it on review-dev first? will require a gerrit restart too iirc	17:07
mordred	clarkb: good points both of those	17:07
mordred	clarkb: lemme make a review-dev patch	17:07
*** jpena is now known as jpena\|off		17:09
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Only replicate openstack namespaces to github https://review.openstack.org/605486	17:11
openstackgerrit	Monty Taylor proposed openstack-infra/system-config master: Only replicate gtest-org and kdc https://review.openstack.org/605490	17:11
*** slaweq has joined #openstack-infra		17:11
mordred	clarkb, corvus, fungi: ^^ there ya go	17:11
*** slaweq has quit IRC		17:15
*** dpawlik has joined #openstack-infra		17:16
*** dpawlik has quit IRC		17:16
*** jamesmcarthur has quit IRC		17:16
*** dpawlik has joined #openstack-infra		17:17
clarkb	fungi: only down to 212 ports so far	17:18
fungi	yeah, but also hitting some like	17:19
fungi	Failed to delete port with name or ID '3c864749-1664-4af9-8aab-d6dacaba24a4': HttpException: 504: Server Error for url: https://network.compute.bhs1.cloud.ovh.net/v2.0/ports/3c864749-1664-4af9-8aab-d6dacaba24a4, <html><body><h1>504 Gateway Time-out</h1>The server didn't respond in time.</body></html> 1 of 1 ports failed to delete.	17:19
*** shardy has quit IRC		17:20
clarkb	I would not be surprised if that is part of why we are leaking them in the first place	17:20
clarkb	nova asks neutron to delete them, neutron 504s, and server is deleted now we have a leaked port	17:20
fungi	sounds remarkably familiar	17:20
* fungi has an overwhelming sense of deja vu		17:21
*** jamesmcarthur has joined #openstack-infra		17:21
*** harlowja has joined #openstack-infra		17:23
*** rkukura has joined #openstack-infra		17:26
clarkb	ya I want to say we saw this issue with the gate and tempest	17:28
clarkb	and initially a lot of the blame was pointed at the apache proxy that was terminated tls	17:28
clarkb	I don't know if it was ever fixed though	17:28
clarkb	oh actually it was a client thing with holding connections open	17:28
fungi	down to 205 leaked ports in ovh-bhs1 now but it's been stuck there for a few minutes	17:28
clarkb	apache by default allows for connections to be reused	17:29
clarkb	python requests is buggy in the situation where it closes a connection but races trying to use it for a new request	17:29
clarkb	we fixed it by telling requests to use a new connection each request iirc	17:29
clarkb	fungi: now 204	17:30
fungi	indeed	17:30
clarkb	and now 203. Not very quick. That could also explain the leaks (not being able to delete fast enough)	17:33
fungi	yeah, i'm not seeing too many timeouts	17:34
mordred	clarkb: amusingly enough I recently set session.keep_alive = False in the openstacksdk functional test suite because of tons of log spam due to "dropped connection, retrying"	17:34
fungi	i have a feeling it's more that we're recycling instance quota faster than the neutron backend can clean up	17:34
clarkb	mordred: ya python requests isn't great around when those keep alived connections are killed due to timeouts	17:35
clarkb	fungi: could be	17:36
*** annp has quit IRC		17:39
fungi	getting soooo slooooow	17:41
fungi	200 now	17:41
fungi	only hit 5 timeouts so far	17:42
pabelanger	it is a load issue on nodes?	17:42
AJaeger	config-core, two quick cleanup reviews, please: https://review.openstack.org/605076 https://review.openstack.org/605344	17:42
fungi	pabelanger: not sure what you're asking about	17:42
pabelanger	we had neutron issue in infra-cloud when CPU was pinned converting qcow2 images to raw	17:42
pabelanger	fungi: sorry, just jumping in, was asking if the requests were slow due to remove node not responsing fast enough	17:43
fungi	and that caused port deletion to be slow?	17:43
pabelanger	fungi: creation	17:43
fungi	this is just leaked ports. trying to remove them	17:43
pabelanger	VMs would timeout on network getting created, and fail to boot because of it	17:43
*** trown\|lunch is now known as trown		17:44
clarkb	hand waving guessing: similar to how we have to force dhcp in OVH because the neutron config isn't actually to be used, when cleaning up network related resources neutron is talking to something external to do the cleanups and this is slow	17:46
fungi	yeah, probably	17:47
*** dpawlik has quit IRC		17:47
clarkb	nova/neutron may actually have all of the port deletes queued up they just don't happen very fast	17:49
clarkb	then on top of that somehow we ended up above the quota limit of 300 so when things caught up a little we still weren't able to boot new instances	17:50
clarkb	if we can get it to zero we can bump max-servers to say 5 and see if we leak again	17:51
fungi	yeah, it's just not happening any time soon	17:53
fungi	196 ports left to delete	17:53
Shrews	that's, like, painfully slow	17:54
fungi	we're up to 9 deletion timeouts now	17:56
*** electrofelix has quit IRC		17:56
clarkb	out of ~100 ?	17:56
openstackgerrit	Merged openstack-infra/project-config master: Move glare legacy jobs in-repo https://review.openstack.org/605076	17:57
fungi	yeah, something like that	17:57
fungi	so maybe 10%	17:57
mnaser	http://grafana.openstack.org/d/nuvIH5Imk/nodepool-vexxhost?orgId=1&from=now-3h&to=now fyi i notice only 15 nodes used in sjc1? do we know why?	18:03
*** jcoufal has quit IRC		18:03
clarkb	mnaser: possibly the volume leaks pabelanger was talking about?	18:04
openstackgerrit	sebastian marcet proposed openstack-infra/openstackid-resources master: Fixed bugs on Submit presentation flow https://review.openstack.org/605497	18:04
clarkb	mnaser: apparently some of them are stuck in deleting	18:04
mnaser	oh	18:04
mnaser	let me see	18:04
mnaser	deleting seems to fluctuate up and down	18:04
mnaser	"Requested 80G, quota is 5120G and 5120G has been consumed." ok cool let me investigate	18:04
mnaser	ok, let me see	18:05
*** graphene has quit IRC		18:07
openstackgerrit	Merged openstack-infra/openstackid-resources master: Fixed bugs on Submit presentation flow https://review.openstack.org/605497	18:10
* mnaser is in a call, will review shortly		18:10
*** slaweq has joined #openstack-infra		18:12
clarkb	fungi: now down toe 76 (seems like it sped up)	18:13
clarkb	68	18:13
fungi	weird	18:16
fungi	i wonder if they're working on it	18:16
fungi	33 now	18:16
*** diablo_rojo has joined #openstack-infra		18:18
fungi	and now 0	18:19
mordred	yay!	18:19
fungi	i guess let's try to start ramping it back up again, but i have a feeling the deletion speedup coincided with them fixing some problem in their backend	18:19
clarkb	ok I'm going to set max-servers to 5	18:20
prometheanfire	lol	18:20
clarkb	and we can watch if we leak from there	18:20
*** dpawlik has joined #openstack-infra		18:21
fungi	wfm	18:21
*** slaweq has quit IRC		18:22
clarkb	Shrews: following up on zuul potentially leaking nodes in nodepool (locked for ~14 days) any idea on how to clear those out?	18:22
clarkb	Shrews: 0001950609 0001975058 0001975067 are the three nodes if you want to take a look	18:22
Shrews	clarkb: would require a zuul restart to delete the locks	18:23
Shrews	clarkb: but let me poke around zk a bit	18:23
clarkb	Shrews: could we manually delete the lock?	18:24
*** pcaruana has quit IRC		18:24
Shrews	clarkb: we'd have to manually delete zk nodes. i'm hesitant to do that	18:24
*** lbragstad has quit IRC		18:24
clarkb	Shrews: ok	18:25
Shrews	but i guess we could. they'd be seen as leaked nodes to nodepool. not sure what the zuul impact would be	18:25
clarkb	fungi: due to the 3 ready nodes that zuul has locked (leaked above) max-servers 5 really means only 2 new instances. I am going to bump to 8 to get 5	18:25
*** dpawlik has quit IRC		18:25
*** lbragstad has joined #openstack-infra		18:25
fungi	k	18:26
*** yamamoto has quit IRC		18:26
*** yamamoto has joined #openstack-infra		18:26
Shrews	clarkb: confirmed zuul is holding the locks :(	18:31
clarkb	Shrews: zuul should handle if the znode goes away though right?	18:35
clarkb	\| fault \| {'message': 'Build of instance 960aff55-3795-43cb-ad73-e58816444355 aborted: Failed to allocate the network(s), not rescheduling.', 'code': 500, 'created': '2018-09-26T18:36:13Z'} one of the nodes failed to build	18:37
clarkb	seems potentially related to our inability to clean up ports	18:37
Shrews	clarkb: i'm not sure how zuul would handle that	18:41
Shrews	we expect some zk nodes to disappear, but a Node isn't one of them	18:41
Shrews	good question for corvus	18:42
mordred	corvus knows everything	18:42
*** jamesmcarthur has quit IRC		18:42
Shrews	we need a zuul restart at some point anyway for some fixes mentioned earlier (yesterday?). perhaps we should just schedule a time for that	18:44
Shrews	that sql optimization at least	18:45
clarkb	that probably depends on whether or not we will declare bankruptcy on the backlog or not	18:45
clarkb	fungi: fwiw it seems that some nodes error as above and others are just really slow to boot. None have successfully booted yet	18:46
corvus	with those node ids, we can trace them in the zuul logs and figure out why they leaked	18:46
corvus	that should not preclude us restarting either nodepool or zuul whenever we wish	18:47
*** jamesmcarthur has joined #openstack-infra		18:47
corvus	however, i'm in need of a sandwich so am not going to trace them now :)	18:47
clarkb	fungi: I'll let it go a little longer but the oldest node i was watching was deleted by nodepool. Doesn't appear to have errored just taken too long	18:50
fungi	okay, so may be that bhs1 is still just plain unusable at the moment	18:50
clarkb	http://logs.openstack.org/25/604925/2/check/system-config-run-base/3a474c9/job-output.txt.gz#_2018-09-25_01_31_28_665890 is a fun ci bug. I think that happens beacuse I am trying to change the uid of the zuul user on the test nodes	18:54
clarkb	mordred: corvus ^ thoughts on creating a zuulcd user instead?	18:54
mordred	clarkb: oh. HAH	18:54
clarkb	then zuul the test node user can coexist with zuulcd the cd user	18:54
mordred	clarkb: or else - maybe make the creation use the same uid for nodepool and prod and make the creation idempotent?	18:54
mordred	I haven't thought long enough to ave thoughts of whether that'sa bad idea or not	18:55
clarkb	mordred: ya I think we could set the uid to 1000? I'm nto sure how difficult it will be to keep that in sync over time	18:55
*** jcoufal has joined #openstack-infra		18:55
clarkb	(I did the last uid + 1 process for normal users which resulted in 203X)	18:56
clarkb	fungi: ya I'm going to set max-servers back to 0, this isn't making any progress	18:57
*** smarcet has quit IRC		18:58
clarkb	the gate just reset again	18:58
clarkb	I think I need to step away from the computer for a bit	18:58
* fungi is struggling to whittle down a forum session proposal to fit in the requisite 1k character matchbox		18:59
clarkb	fungi: you get three of those boxes :)	19:00
fungi	yeah, but only one goes on the schedule	19:01
fungi	(i think?)	19:01
mordred	fungi: "gonna talk about stuff"	19:02
mordred	fungi: who needs more words than that?	19:02
fungi	"i will wear a funny shirt and make people talk to each other"	19:03
fungi	perfect!	19:03
clarkb	"if this session gets enough upvotes I will wear the orange cantina shirt"	19:04
*** rtjure has quit IRC		19:05
*** dave-mccowan has joined #openstack-infra		19:05
AJaeger	"And the green one if not" - give us a choice to vote for one ;)	19:05
fungi	grr... i'm still 95 characters over	19:06
AJaeger	Remove all punctations ;)	19:07
clarkb	the down ports list in ovh bhs1 fell to 1 after setting max servers to 0	19:07
*** rtjure has joined #openstack-infra		19:08
clarkb	jaosorior: https://review.openstack.org/#/c/603419/ fyi that didn't pass (it also failed the rdo third party check)	19:12
*** e0ne has joined #openstack-infra		19:15
*** jcoufal has quit IRC		19:21
*** Tim_ok has quit IRC		19:24
*** slaweq has joined #openstack-infra		19:24
*** Emine has joined #openstack-infra		19:26
fuentess	clarkb: hi, quick question? what is the best way to know if I'm running under a Zuul slave? is there any environment variable that I can check?	19:27
*** e0ne has quit IRC		19:30
fungi	fuentess: jobs can set any environment variables they like when executing a shell or command process... can you be more specific about what you're trying to do and where?	19:31
mordred	yah - there is a zuul ansible host_var your playbooks can do things with	19:32
*** jamesmcarthur has quit IRC		19:32
mordred	but once you're out of ansible and into some shell that the ansible has spawned, that's all job specific	19:32
fuentess	fungi, mordred: I will add a change in one of our scripts to run the cri-o tests using overlay (instead of devicemapper) when in a Zuul slave, so I want to check using something like: if [ "$ZUUL_CI" == true ]	19:35
mordred	yeah - in order for that to work, we'd need to update the job that calls your script to set a variable like ZUUL_CI	19:35
mordred	alternately, maybe we should justhae the zuul job pass something like --overlay to the script?	19:36
fuentess	adding it here, right? https://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/roles/kata-setup/tasks/main.yaml#n37	19:36
Shrews	corvus: attempted to trace node 0001975067 through zuul logs (debug.log.14.gz, btw). For that request (200-0006054530), I see that it gets completed, but the nodes are never set in-use (i.e., no "Setting nodeset ... in use" log entry)	19:37
slaweq	clarkb: mordred: with big help from both of You I finally managed to do patch to migrate dvr multinode job in neutron to zuulv3 syntax: https://review.openstack.org/#/c/578796/ - thx a lot guys :)	19:37
Shrews	corvus: i'm a bit stumped as to why	19:37
mordred	fuentess: yes- and it looks like CI=true is already being set there	19:37
mordred	fuentess: so you could just add another line just like that	19:38
mordred	slaweq: \o/ woohoo!	19:38
fuentess	mordred: cool, I am not sure if I can submit changes there... or how can I do it, any guidance?	19:39
mordred	fuentess: you totally can - have you submitted patches to the openstack gerrit before, or will this be your first one?	19:40
fuentess	mordred: this will be my first one	19:40
*** jamesmcarthur has joined #openstack-infra		19:41
mordred	fuentess: excellent. well - we have a doc here: https://docs.openstack.org/infra/manual/developers.html#accout-setup ... I don't think we require signing the CLA for infra projects (do we clarkb fungi?) so you can probably skip that part	19:42
fungi	checking on ozj there	19:42
mordred	fuentess: tl;dr is "make sure you have a launchpad/ubuntu sso account", "log in to review.openstack.org", "add your ssh key to your profile in gerrit", "pip install git-review" ... then in a git clone of openstack-zuul-jobs, once you've made your commit, run "git review"	19:43
fuentess	mordred: great, thanks, I'll follow the doc	19:43
mordred	fuentess: but the doc is more complete than messages from me in IRC :)	19:43
fungi	https://review.openstack.org/#/admin/projects/openstack-infra/openstack-zuul-jobs says "Require a valid contributor agreement to upload: INHERIT (false)" so no cla required	19:44
mordred	yay!	19:44
fuentess	great, thanks	19:44
mordred	let us know if you have any issues - the initial account setup is more onerous than we'd like, but such is the world we live in	19:44
fungi	things we're (painfully slowly it seems like) changing for the better over time	19:45
*** jamesmcarthur has quit IRC		19:46
*** jamesmcarthur has joined #openstack-infra		19:48
openstackgerrit	Chandan Kumar proposed openstack-infra/project-config master: Added twine check functionality to python-tarball playbook https://review.openstack.org/605096	19:49
corvus	Shrews: thanks, i'll start there and see if i can trace further	19:50
openstackgerrit	Matthieu Huin proposed openstack-infra/zuul master: web: add tenant and project scoped, JWT-protected actions https://review.openstack.org/576907	19:51
*** jamesmcarthur has quit IRC		19:52
mnaser	infra-root: http://grafana.openstack.org/d/nuvIH5Imk/nodepool-vexxhost?orgId=1&from=now-1h&to=now&var-region=All is this starting to look..healthy?	19:52
mnaser	it looks like for some reason a lot of instances accumulate as ready and then all get used up at once	19:52
*** ijw has joined #openstack-infra		19:53
*** jamesmcarthur has joined #openstack-infra		19:55
openstackgerrit	Chandan Kumar proposed openstack-infra/project-config master: Added twine check functionality to python-tarball playbook https://review.openstack.org/605096	19:57
corvus	clarkb, Shrews: the job that node request was for (openstack-ansible-ironic-ssl-nv) was removed between the time the node request was issued and fulfilled: http://git.openstack.org/cgit/openstack/openstack-ansible-os_ironic/commit/?id=9a2f843dc15750cddba10db73afe381ee3785250	20:01
corvus	i think that's our smoking gun :)	20:01
Shrews	corvus: w00t	20:01
clarkb	mnaser: I think that may be lag induced by the zuul executors throttling themselves. You can compare with the active executor count on the zuul status page	20:02
* mnaser clarkb: so it looks like we're at 90ish in-use which seems to tell me things are healthy		20:02
mnaser	that shouldn't have been a /me but okay	20:03
mnaser	sorry for the noise/annoyance/problems/etc	20:03
corvus	that behavior could also be zuul doing a bunch of reconfigs, or a gate reset	20:03
clarkb	mnaser: was that a one off thing? eg we don't need nodepool to manage nodes/volumes better?	20:04
mnaser	clarkb: yes, that was a one-off at our side thing, but i noticed that when i deleted the vms from under nodepool it was a bit confused	20:04
mnaser	not sure what was done there	20:04
*** agopi has quit IRC		20:06
clarkb	mnaser: I think pabelanger mentioned that it ended up waiting for the api requset to timeout or similar	20:07
*** bnemec has quit IRC		20:10
*** evrardjp has joined #openstack-infra		20:11
clarkb	looks like inap might also be unhappy (though less unhappy that some of the other clouds)	20:14
clarkb	(a lot of deleting nodes there and higher error rate in recent hours)	20:15
*** bnemec has joined #openstack-infra		20:15
mgagne	clarkb: how can I make it happy?	20:15
clarkb	mgagne: reading the nodepool logs the errors appaer to be "timeout waiting for node to delete"	20:17
mgagne	clarkb: could it be that same issue we had a couple of months ago?	20:17
clarkb	mgagne: I think nodepool is not booting new nodes until it frees up capacity (so slow deletes are preventing new boots)	20:17
clarkb	mgagne: you'll have to remind me what that was (sorry)	20:18
mgagne	maybe I should have documented that SQL query somewhere....	20:18
clarkb	nodepool.exceptions.ServerDeleteException: Timeout waiting for server 3cc96068-e398-4c2c-a908-ebea018af044 deletion	20:18
clarkb	is the full message and includes an instance uuid	20:18
clarkb	if that helps	20:18
mgagne	clarkb: I think delete task gets killed by restarting conductor or something. So you can't delete it again because database says it's already happening somewhere.	20:19
mgagne	but it's stuck in BUILD too so might be something related to build time being higher than expected.	20:20
mgagne	I'll see that I can do to unstuck that mess ;)	20:21
mgagne	I see that some instances are successfully deleted without me taking any actions.	20:22
clarkb	mgagne: so maybe it is just slowness?	20:23
mgagne	I haven't figured out what's going out. I know that some stuck instances are getting deleted.	20:24
pabelanger	clarkb: mnaser: I don't think it is zuul-executors, our executor queue looks healthy: http://grafana.openstack.org/dashboard/db/zuul-status	20:24
mgagne	what I would really love is a way to know what tasks are running on conductor or compute =)	20:25
clarkb	mnaser: pabelanger http://paste.openstack.org/show/730964/ is a list of volume ids that claim to be attached to those server ids. At last check none of those server ids actually exist	20:27
clarkb	mnaser: pabelanger I think that may be part of our leaked volume story	20:27
clarkb	the oldest volume is from about two weeks ago and the newest from an hour ago	20:27
*** smarcet has joined #openstack-infra		20:27
clarkb	mnaser: pabelanger I haven't tried to delete any of them yet but I figure that is my next step if mnaser doesn't say otehrwise	20:27
pabelanger	clarkb: Hmm, it is possible I may have deleted those volumes. I cleaned up some an hour so ago	20:28
pabelanger	let me check history	20:28
clarkb	pabelanger: well they are still there if you tried to delete them :)	20:28
pabelanger	clarkb: the ones I deleted were available, but unattached	20:28
clarkb	these all claim to be attached to those servers but those servers do not exist	20:29
*** smarcet has quit IRC		20:29
*** smarcet has joined #openstack-infra		20:30
*** smarcet has quit IRC		20:30
mgagne	so I restarted a nova-compute and some instances are getting deleted on that node.	20:30
pabelanger	clarkb: http://paste.openstack.org/show/730965/	20:30
pabelanger	that was a few hours ago when I started clean up	20:30
pabelanger	22f56d13-c67c-4aac-a6aa-cf58fe57b177	20:31
pabelanger	is in both pastebins	20:31
clarkb	ya all of mine are those that look like attached to $uuid	20:31
pabelanger	yah	20:31
clarkb	there are a couple extras in mine too	20:31
pabelanger	I'd expect nodepool to use hostname	20:32
clarkb	pabelanger: it does see the centos ones in your example	20:32
clarkb	I think it shows uuid there beacuse the instances don't exist (so it cannot lookup the name)	20:32
pabelanger	right	20:32
pabelanger	think so too	20:32
*** priteau has quit IRC		20:33
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Fix node leak on job removal https://review.openstack.org/605527	20:34
clarkb	mgagne: looks like in the last 10 minutes or so things may be improving (possibly related to that restart?)	20:34
corvus	clarkb, Shrews: ^ fix for that leak	20:34
fungi	oh, that's a fun bug	20:34
mgagne	I restarted 1 stuck nova-compute service. Could it be it consumed all rpc workers on nova-conductor? ¯\_(ツ)_/¯	20:36
mgagne	I don't have much time in front of me. lets hope this improves the situation.	20:36
*** hasharAway has quit IRC		20:37
clarkb	mgagne: thank you for looking	20:38
mgagne	clarkb: will nodepool attempt to retry deletion if instance is now in error state?	20:39
clarkb	mgagne: yes it should retry every 5 (or is it 15 minutes)	20:39
mgagne	cool so restarting nova-compute should put the instance in error and nodepool can continue its cleanup from there.	20:40
Shrews	corvus: sweet	20:40
corvus	clarkb: there's definitely a 5 in there. but i think it's every 5 seconds now :)	20:43
*** rkukura has quit IRC		20:43
clarkb	persistent	20:43
fungi	somethingsomething5something	20:43
clarkb	if anyone is wondering the tempest-slow job is really slow	20:44
clarkb	but we might actually merge some changes shortly	20:44
clarkb	corvus: http://grafana.openstack.org/d/ykvSNcImk/nodepool-inap?orgId=1&from=now-3h&to=now is it normal for zuul to not assign those ready nodes for this significant amount of time?	20:45
clarkb	the zuul queue lengths are short	20:46
clarkb	all 11 executors are online and accepting	20:46
clarkb	this must be slowness in the scheduler?	20:46
corvus	clarkb: yeah, if there's no executor queue, then it's the scheduler not getting around to dispatching the jobs	20:46
corvus	the scheduler is not currently behind on work	20:47
clarkb	oh wow refresh the graph	20:48
clarkb	in the last minute or two almost all of those nodes went to in use	20:48
corvus	i guess we looked just a bit too late	20:48
pabelanger	could be zookeeper slowing down?	20:48
pabelanger	(just a guess)	20:48
Shrews	nodepool.o.o seems very idle	20:49
corvus	pabelanger: based on what?	20:49
Shrews	oh, wait.	20:49
Shrews	java 170% cpu	20:50
Shrews	neat	20:50
Shrews	320%	20:50
pabelanger	corvus: there have been issues in base with nodepool not allocating nodes if zookeeper was laggy, based on comments SpamapS has made in past	20:50
corvus	pabelanger: issues in base?	20:50
pabelanger	sorry, past*	20:51
corvus	pabelanger: the current issue is that nodepool allocated nodes and zuul did not immediately use them	20:51
clarkb	spamaps problems were related to disk io I think. SpamapS and tobiash both run on top of tmpfs now	20:51
pabelanger	corvus: right, that is what SpamapS said in the past	20:51
corvus	Shrews: java, unlike python, is really good at using multiple processor	20:51
corvus	s	20:51
Shrews	iostat shows mostly idle disk	20:51
clarkb	I'm booting a clarkb-test instance in bhs1 using the current xenial image. I want to see if it will ever start up without nodepool timing out on it	20:54
ianw	clarkb: catching up ... lmn if i can help	20:55
clarkb	ianw: mostly at a loss to why bhs1 stopped working, but its basically unuseable for us now. Instances don't seem to boot at all and we leaked a bunch of ports prior to that	20:55
clarkb	ianw: we think we have the ports cleaned up but now trying to see if anything will boot	20:55
clarkb	ianw: likely our next step is to followup with amorin during european tomorrow	20:56
*** agopi has joined #openstack-infra		20:56
clarkb	other than that I think vexxhost is largely working again (though there are some weird volumes there that would be good to have mnaser glance at before we delete them)	20:56
clarkb	http://paste.openstack.org/show/730964/ is those details	20:56
clarkb	and mgagne just did a thing to make inap happier	20:57
openstackgerrit	Matthieu Huin proposed openstack-infra/zuul master: CLI: add create-web-token command https://review.openstack.org/605386	20:57
clarkb	ianw: other than general cloud capacity items like ^ I think the continuing issues are largely related to gate flakynesss and cost of resets	20:58
clarkb	tripleo has a very deep gate queue and resets are common. Openstack integrated gate isn't quite so large but has also shown signs of flakyness	20:58
*** trown is now known as trown\|outtypewww		20:59
*** bobh has quit IRC		20:59
ianw	ok, thanks for the update :)	20:59
*** colonwq has joined #openstack-infra		20:59
ianw	quick question; i started looking at graphite.o.o upgrade ... the puppet is very procedural (install stuff, template, start service). i've started looking at making it all ansible running on bionic ... see any particular issues with this?	21:00
*** rkukura has joined #openstack-infra		21:00
corvus	ianw: maybe we can use containers?	21:00
*** fuentess has quit IRC		21:04
ianw	corvus: yeah, i started looking at that; i don't know if i'm sold really ... say i make a job to make a statsd container, a carbon container, etc, then a playbook to plug it all together. what advantage do we have over just having a playbook putting this stuff on a host?	21:04
ianw	and instead of unattended-upgrades, we now have to manage the dependencies in all those sub-containers	21:04
corvus	ianw: it's a good question, but i think the spec addresses it. in short, os-independence.	21:05
*** slaweq has quit IRC		21:06
corvus	Shrews, clarkb: i looked at node 0002342552 assigned to request 200-0006321477 which has been ready for 7 minutes in inap. the request is still 'pending'. so nodepool-launcher hasn't marked it fulfilled yet, even though the only node in the request is ready.	21:07
corvus	that's on nl03	21:07
*** smarcet has joined #openstack-infra		21:11
fungi	is it really os-independent given that there is an os inside every container (albeit a hopefully minimal one)?	21:11
corvus	Shrews, clarkb: nodepool is very busy there; perhaps this is thread starvation	21:11
pabelanger	corvus: oh, interesting, we have 5 active providers there. Maybe getting close to spliting that out to another launcher?	21:12
corvus	fungi: i should have said no more than "see the spec"	21:12
clarkb	we could rebalance the providers. I think nl02 is particularly quiet recently	21:12
fungi	heh	21:12
mgagne	I really have to go. But to summarize, I restarted nova-compute on most of the compute nodes. (not all) It put "BUILD/deleting" instances in ERROR state so Nodepool could clean them up. Hopefully new deletions won't get stuck anymore.	21:12
clarkb	mgagne: thanks, I think it did get things moving	21:12
mgagne	+1	21:12
*** panda is now known as panda\|off		21:13
fungi	cpu on nl01 looks pretty much pegged (much of that claimed for system)	21:15
corvus	in the nicest way possible i'd like to convey the idea that i don't really enjoy having "container-vs-not-container" debates and i think it's important that we go with the consensus we achieved in the spec after much work and deliberation rather than having more container debates. if we truly want to re-open the discussion we had in the spec (without having even really begun on actual	21:15
corvus	implementation) that's certainly a choice, but let's make that choice clearly.	21:15
fungi	i'm good with the consensus, i just don't even know where to start with trying to containerize something	21:16
clarkb	corvus: while I agree I think we also said in the spec we would move their gradually and build up the tooling to make this happen	21:16
clarkb	and so maybe upgrading graphite can move in parallel to getting tooling in place to build container images so that we can deploy graphtie with containers?	21:16
fungi	like, i'd like to work out how to containerize mailman3 but i feel pretty out of my depth on container standards and paradigms to know what that should look like	21:17
corvus	clarkb: true. i read that as "keep puppeting" rather than "convert puppet to ansible then maybe someone will container"	21:17
*** jamesmcarthur has quit IRC		21:18
clarkb	fungi: the big step 0 which we've started for nodepool and zuul is building images in repeatable manner to keep up with updates	21:18
fungi	i'm still struggling to come to grips with ansible instead of puppet, but since we have people who want to do the ansible and container legwork i'm cool with the direction we settled on	21:18
corvus	(and to be fair, many services will be easier to run in ansible rather than containers, especially if we're just running os packages. but graphite is a bunch of pypi packages, so seems like an ideal candidate for containers)	21:19
*** jamesmcarthur has joined #openstack-infra		21:19
fungi	ahh, yeah, my main thought exercise was to try and figure out how to containerize gerrit and the various java libs it needs	21:20
corvus	fungi: there's already a gerrit container image	21:20
fungi	but maybe approaching python stuff first will be easier to grasp	21:20
fungi	i thought we said reusing existing containers was a no-go and we were going to insist on building all ours from scratch instead?	21:20
fungi	now i'm confused	21:20
corvus	fungi: well, if that's the case, we at least presumably have a build script	21:21
fungi	i probably misunderstood the answer when i asked that question earlier	21:21
fungi	got the impression we didn't want to reuse anyone else's container build tooling	21:21
fungi	but if it's just that we want to regenerate containers using existing tools provided by the upstream maintainers, that's not so tought to grasp	21:22
clarkb	there is the build tooling and then the resulting images. I think we should reuse images if possible, but we'll likely have to see how reusable these images are	21:22
corvus	anyway, i merely suggest that given the spec, perhaps leaving graphite as puppet or working on containerizing it may be better choices at the momemnt than translating the puppet to ansible	21:23
corvus	ianw: thanks for asking :)	21:23
ianw	say i make a job using buildah to produce an image that contains statsd (that's probably the very simplest case, no secrets, nodejs + statsd all required)	21:23
ianw	as a first step, where does that image go after build?	21:23
fungi	maybe also a terminology gap for me. when i hear "build tooling" i think the makefiles provided by upstream for installing things into a container image	21:23
clarkb	ianw: dockerhub?	21:24
fungi	i.e. the things we currently have pupet exec	21:24
clarkb	fungi: there is also a severe case of NIH when it comes to container image build tools. Basically everone has one	21:24
clarkb	I think the spec suggests we start with dockerfiles and ocker build because it is the commonest tool	21:25
fungi	so, like, would we still use pip for installing python-based packages into a container image or do we need to do something lower-level? is it okay to install pip temporarily into the container and then uninstall it before creating the image?	21:26
fungi	do the images get created under a chroot on a loopback device and then copied?	21:26
fungi	or do we tar up the chroot after it's tied off?	21:27
fungi	i guess a dockerfile is like a makefile that contains the steps for writing things inside the chroot?	21:28
clarkb	you would use pip or any other package manager (or make etc) as part of the image build to create the image contents. It is ok to have intermediate steps that don't show through to the final image (though if you want to keep iamge size down you have to take additional steps to clean those out)	21:28
clarkb	fungi: yes	21:28
clarkb	I'm not actually sure what docker uses as a transport format to and from dockerhub, likely something like a tarball	21:29
ianw	hrm, there's an official docker graphite+statsd+carbon docker	21:29
ianw	https://github.com/graphite-project/docker-graphite-statsd#secure-the-django-admin	21:29
fungi	as opposed to whatever the thing is that tells you how to stack multiple image layers?	21:29
ianw	is a little worrying "First login at: http://localhost/account/login Then update the root user's profile at: http://localhost/admin/auth/user/1/"	21:29
fungi	docker does layered images, right?	21:29
fungi	or i guess we can just decide to squash them all into a single layer if we want?	21:30
clarkb	fungi: yes it does layered images, that is one of the reasons that cleaning out the intermediate stuff you don't need in the end result can be tricky (like say pip being installed to pip install but then is deleted after, by default you'd keep the layer that had pip installed)	21:31
ianw	oh, i just thought of one big issue ... ipv6	21:31
fungi	why is ipv6 an issue?	21:32
clarkb	shouldn't be with docker itself, k8s doesn't really support ipv6 yet though	21:32
ianw	fungi: well, something like https://github.com/graphite-project/docker-graphite-statsd, which at first glance looks like a lot of work is done for you, well it's not really	21:32
*** jamesmcarthur has quit IRC		21:33
clarkb	hrm my ubuntu-xenial test in bhs1 booted just fine looks like	21:33
fungi	but also for web-based services we can still proxy from an apache running outside the container on the host server to the v4 loopback if we wanted, right?	21:33
clarkb	let me try centos too	21:33
clarkb	fungi: yes	21:34
clarkb	fungi: possibly in another container	21:34
ianw	fungi: if we're in a world where we're running things in containers for simplicity, and then running external apache to proxy ipv6 into containers instead of processes running on the host listening on ipv6 ... i'm not sure i'd count that as winning :/	21:35
fungi	well, it was suggested that we don't have to containerize all the things if it winds up being convenient to, say, run a gerrit container and leave apache in the host system context	21:35
*** bobh has joined #openstack-infra		21:35
clarkb	ya not required to do so, but could be done	21:35
fungi	(in cases where we already run services on the loopback and proxy them from apache anyway)	21:35
ianw	oh right, yeah that's not the case for say statsd currently, but is for gerrit	21:36
*** graphene has joined #openstack-infra		21:36
clarkb	ok bhs1 is working now that I manually boot stuff	21:36
clarkb	I am going to turn max-servers back up to 8 again	21:37
clarkb	perhaps this is nodepool specific somehow? we'll have to see if the behavior persists	21:37
fungi	yeah, maybe they fixed it, or maybe there's some deeper problem there	21:38
*** bobh has quit IRC		21:40
clarkb	\| 0002343764 \| ovh-bhs1 \| ubuntu-xenial \| d427d8d8-4a31-4f02-bab3-eb857e0fcf9b \| 158.69.64.222 \| 2607:5300:201:2000::26 \| in-use \| 00:00:00:06 \| locked \|	21:40
clarkb	that looks promising	21:40
clarkb	ianw: ^ fyi nl04 is in the emergency file so that we can tweak the max-servers value there to help debug bhs1	21:40
ianw	++	21:42
*** ijw has quit IRC		21:46
mordred	ianw: yah - for some (possibly many) of the services, there is likely a container already built we can use	21:50
mordred	ianw: for python things we need to build, I've been using the python:sim or python:alpine base images, which already have pip and stuff installed - so you can do really simple dockerfiles like "pip install ." or something	21:51
mordred	and similarly, thinking about things like gerrit, I would expect to use a "java" base image and then just install the war file in there	21:51
mordred	or something	21:51
mordred	but that's also all just in my imagination of course :)	21:52
mordred	ianw: also - for anything we need to make a container of that uses pbr and bindep, we can use pbrx to build images - so like when we start making storyboard containers, for instance	21:53
*** jamesmcarthur has joined #openstack-infra		21:54
ianw	mordred: so as step one, we need an ansible role to install & configure docker on a host, right? that's not done?	21:54
clarkb	still only the one DOWN port in BHS1 too	21:54
ianw	apt-get install docker seems fine ... reading up on making sure it talks ipv6 seems a little harder	21:54
clarkb	I've got to go pick up a box of vegetables, any objection to me increasing max-servers to say 80 on bhs1 to see ho that goes?	21:54
clarkb	I won't be able to check on it for about 45 minutes likely	21:55
ianw	clarkb: just keep an eye on ports?	21:56
clarkb	ianw: ports in a DOWN state and whether or not instancs are actually booting	21:57
clarkb	ianw: earlier today after we cleaned up the ports in a DOWN state we ended up not being able to boot anything	21:57
*** jamesmcarthur has quit IRC		21:58
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Don't report non-live items in stats https://review.openstack.org/605540	22:00
clarkb	ianw: fungi: you all ok with me making that max-server edit?	22:01
ianw	clarkb: ok, i have a window up with pre-warmed bash cache for monitoring ovh-bhs1 openstackjenkins	22:01
*** dklyle has quit IRC		22:02
clarkb	ok max servers set to 80	22:02
ianw	what's the 3 servers in there that don't have an "image"?	22:03
ianw	possibly we just refreshed images?	22:03
gouthamr	hi infra experts, i switched the nodeset on a set of jobs to ubuntu-bionic, but got a weird side-effect - the post run playbook is ignored - https://review.openstack.org/#/c/604929 - is this a known issue? or pbkac on my end?	22:04
gouthamr	the playbook's there in the ARA report, but not executed	22:04
gouthamr	sample: http://logs.openstack.org/29/604929/2/check/manila-tempest-dsvm-mysql-generic/ab0b8cf/	22:04
clarkb	gouthamr: I think the devstack base job may expect certain groups in your nodest	22:05
clarkb	gouthamr: you'll need to update the nodeset and include those if you haven't	22:05
*** boden has quit IRC		22:06
gouthamr	clarkb: oh.. you mean the "legacy-dsvm-base" job?	22:06
*** kgiusti has left #openstack-infra		22:06
clarkb	hrm that job should be named legacy-manila-tempest-dsvm-mysql-generic if it isn't a v3 native job	22:07
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Speed up build list query under mysql https://review.openstack.org/605170	22:07
clarkb	no I meant the native v3 devstack job base	22:07
clarkb	I don't know if legacy-dsvm-base needs anything like that	22:08
mordred	ianw: we have a role for that ...	22:09
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Improve docs for inventory_file https://review.openstack.org/602665	22:09
gouthamr	clarkb: true! that isn't v3 native; dunno why we changed the name on that	22:09
clarkb	gouthamr: the legacy jobs use the same devstack-single-node nodeset that the non legacy jobs use	22:09
mordred	ianw: http://git.openstack.org/cgit/openstack-infra/zuul-jobs/tree/roles/install-docker	22:09
clarkb	gouthamr: and that sets a node to be named controller. look in openstack-dev/devstack/.zuul.yaml	22:10
mordred	ianw: it might be worth us rearranging that a little bit - based on the discussion we had about roles in system-config	22:10
mordred	ianw: this might be another of the ones that wants to live in system-config?	22:10
corvus	mordred: why system-config?	22:10
mordred	corvus: if we're going to use it in production ansible to install docker on a system we want to run services in containers on - then it would be a bit weird for system-config to need to add the zuul-jobs repo to bridge.o.o in order to get the role?	22:11
corvus	that's currently zuul-jobs, which is our widest applicable source of roles, so moving to system-config is a scope reduction	22:11
corvus	mordred: maybe that should be an independent repo then	22:12
mordred	corvus: maybe so - maybe this is the first one where being an independent repo has a benefit	22:12
clarkb	openstack.exceptions.SDKException: Error in creating the server: Build of instance 7bbfc561-5a4a-4fc7-8f5c-02e60bc61511 aborted: Request to https://network.compute.bhs1.cloud.ovh.net/v2.0/ports.json timed out	22:12
corvus	mordred: though, that role sets up CI-specific mirrors -- so we may need 2 roles	22:12
clarkb	maybe all our troubles are not behind us :/	22:12
mordred	corvus: oh -that's a great point	22:12
ianw	clarkb: i also noticed a 500 error when i tried to list the servers, but only once	22:13
ianw	clarkb: more general intermittent issues?	22:13
ianw	yeah, lot of ERRORs	22:14
*** jamesmcarthur has joined #openstack-infra		22:14
clarkb	I'll set it back to 8 and then go pick up my groceries	22:15
fungi	sorry, stepped away for dinner. clarkb: increasing max-servers again seems worth trying, i agree	22:16
ianw	fungi: heh, tried it and it didn't work :)	22:16
*** smarcet has quit IRC		22:16
mordred	ianw: corvus has brought me back around to your way of thinking - we need an install-docker role for our production ansible	22:16
mordred	ianw: I'd argue that it could be _very_ similar to the install-ansible role in zuul-jobs though - just maybe not with ci mirrors	22:16
clarkb	fungi: ya was worth trying, but still broken, I unddi it	22:17
mordred	ianw, corvus: actually - sorry - now I'm changing my mind again	22:17
mordred	how about I go play with something for a second and get back to the two of you	22:17
fungi	i'm also figuring that rather than completing the mediawiki puppeting now i should be looking at containerizing it instead. is there a base php container it should be built on?	22:18
ianw	mordred: sure, well knowing how to get docker onto the host seems like the first step in using this.	22:19
*** jamesmcarthur has quit IRC		22:19
fungi	like the alpine python base?	22:19
mordred	fungi: https://hub.docker.com/_/mediawiki/	22:20
mordred	fungi: there's even a mediawiki container	22:20
fungi	yeah, just found https://docs.docker.com/samples/library/mediawiki/	22:20
mordred	\o/	22:20
fungi	as well as a lot of discussion about the complexity of adding extensions	22:21
mordred	fungi: but if we want to roll our own - their dockerfile shows: FROM php:7.2-apache	22:21
mordred	so that seems to maybe be a good base image	22:21
clarkb	mordred: in the openstacksdk exception above is sdk/shade directly trying to manage the ports?	22:21
mordred	fungi: https://github.com/wikimedia/mediawiki-docker/blob/41edcc8020aa47823d30c1b35f216b0a2834b2b6/stable/Dockerfile	22:21
clarkb	I wonder if that is part of the problem. when I boot manually I use openstackclient which probably just talks boring nova api	22:22
mordred	clarkb: in some cases it might be	22:22
mordred	clarkb: we query ports after booting a server to find the port id	22:22
mordred	clarkb: so that we can pass the port_id of the server's fixed ip to the foating_ip create call, creating the fip on the server's port	22:22
clarkb	mordred: but we don't use floating IPs here	22:22
mordred	oh - then something is wrong - we shouldn't be listing ports on a non fip cloud	22:23
mordred	we MIGHT be trying to list them to get meta info about the server's interfaces	22:23
mordred	but we should only ever be listing	22:23
clarkb	ya this looks like its part of listing the server details	22:23
clarkb	http://paste.openstack.org/show/730970/ is full traceback	22:24
clarkb	if we go back to hunches I wouldn't be surprised if ovhdoesn't expcet you to talk to neutron so much since they don't really enable any sdn features	22:24
fungi	mordred: aha so we would just have a periodic job slurp down a dockerfile like that and then push the resulting image to a server?	22:24
mordred	oh - wait	22:24
mordred	clarkb: that's not the sdk interacting with ports	22:24
clarkb	ok really need to grab veggies not, but maybe that makes sense to you mordred	22:25
clarkb	mordred: basically anything that seems to directly mess with ports is unreliable	22:25
mordred	clarkb: "Error in creating the server: {reason}".format(	22:25
mordred	reason=server['fault']['message']),	22:25
clarkb	oh so that is from nova?	22:25
mordred	that's us printing the fault.message in the server payload from nova	22:25
clarkb	gotit	22:26
mordred	we should clarify that in the error message	22:26
mordred	"Error in creating the server, nova reports error: {reason}"	22:26
mordred	would be clearer, yeah?	22:26
clarkb	++	22:28
mordred	clarkb: remote: https://review.openstack.org/605544 Clarify error message is from nova	22:28
mordred	fungi: yes, that's what I'm thinking	22:29
fungi	so this is probably similar to the errors i was getting from openstackclient in that case (nova reporting a problem talking to neutron)?	22:29
mordred	fungi: or, a periodic job that builds an image based on a dockerfile like that - then publishes the image to either dockerhub or a docker hub we run	22:29
fungi	i guess if we want to fork the dockerfile we can do that to add things like different configuration or additional non-core extension and theme bundles?	22:30
mordred	fungi: then in the ansible, instead of "apt-get install mediawiki" we'd just do "docker run opendev-ci/mediawiki" or something	22:30
mordred	fungi: yes.	22:30
fungi	or is that where layered images come into play?	22:30
mordred	fungi: we could also just buld a new image based on the mediawiki image	22:30
mordred	fungi: so make a dockerfile that says "from mediawiki" at the top, then plop our extensions and themes in	22:31
mordred	fungi: I think for config, we just put it on the server like normal, and bind-mount the config dir in as a docker volume	22:31
fungi	ahh, so dockerfiles have an inheritance concept. that's what the FROM php:7.2-apache at the top means?	22:31
mordred	fungi: so like "docker run -v /etc/mediawiki:/etc/mediawiki opendev-ci/mediawiki"	22:32
mordred	fungi: yes	22:32
ianw	mordred: it would be more a .service file that does the docker run though?	22:32
mordred	ianw: yah - I'm just spitting commandline docker commands in channel for conversation	22:32
mordred	just saying - zuul job publishes to dockerhub (or a private dockerhub) - then on the server we just pull/run the image	22:32
fungi	the fact that the mediawiki dockerfile runs `apt-get install ...` suggests that the php:7.2-apache image is debianish	22:33
mordred	fungi: yes indeed it does	22:33
fungi	and not some stripped-down thing like alpine or coreos i guess?	22:33
mordred	fungi: if you want - you should be able to run "docker run -it --rm mediawiki -- /bin/bash" and get a shell in a container running that image and see what it's based on and what's there	22:34
mordred	clarkb, fungi, ianw corvus : speaking of - opendev is taken on dockerhub - what say I grab opendevorg as an account to push things to?	22:35
corvus	mordred: ++	22:35
*** jamesmcarthur has joined #openstack-infra		22:35
*** rcernin has joined #openstack-infra		22:35
fungi	the deeper down this rabbit hole i go, the more it seems like running containerized services means 1. understanding conventions of multiple distros since you won't know which ones a given upstream image is based on, and 2. sort of also being responsible for your own distribution since versions of stuff is hard-coded all over the place needing you to bump them to get security fixes? surely this can't	22:36
fungi	be the actual reality of containerization	22:36
fungi	opendevorg sgtm	22:36
*** eernst has joined #openstack-infra		22:37
*** tosky has quit IRC		22:37
mordred	fungi: yup. that is, in fact, the realityof containerization	22:37
fungi	at least for the mediawiki image, it's just mediawiki versions themselves which are hard-coded	22:38
mordred	fungi: however, due to the idea of microservices and service teams - it's frequently only one service and one base image you're ever poking at	22:38
fungi	so no worse than our (incomplete) configuration management in that regard	22:38
mordred	fungi: yah	22:38
mordred	fungi: also - I've been finding that the majority of contianer images are actually debuntu based - unless someone wants to make it smaller inwhich case there is also frequently an alpine variation	22:39
*** jamesmcarthur has quit IRC		22:40
SpamapS	FYI, zookeeper's disk IO is VERY bursty	22:40
fungi	mailman3 seems to use docker-compose to build their images https://github.com/maxking/docker-mailman/blob/master/docker-compose.yaml	22:40
SpamapS	it wants to sync all of its writes to disk periodically, and if you did a lot....	22:40
SpamapS	it gets mad	22:40
SpamapS	It never shows as heavy disk IO, it just shows as heavy latency on the syncs	22:41
SpamapS	and no I don't run on tmpfs, I just run on a dedicated ZK node.	22:41
SpamapS	previously it was battling with other processes for that IO latency and that made it miss checkpoints	22:42
fungi	aha, FROM python:3.6-alpine https://github.com/maxking/docker-mailman/blob/master/core/Dockerfile	22:42
mordred	fungi: yeah. that docker compose file is a way to say "please launch this set of images for me"	22:43
fungi	so probably not too dissimilar from whatever we'll use for other python services	22:43
mordred	fungi: yah - that's whatwe're using in the zuul and nodepool images	22:43
fungi	oh, so docker-compose is not a higher-level image build language, it's a deployment language?	22:43
mordred	ah	22:43
mordred	yah	22:43
fungi	and again, at least for the mm3 core image, it's only hard-coding mailman package versions and i guess taking whatever latest is on pypi otherwise	22:45
fungi	so not too onerous	22:45
mordred	in fact, jesse keating had a docker compose file a while back for booting a zuul on a machine - so you can say "docker compose up" and docker compose will launch all of the processes in containers needed connected together to produce the zuul service	22:45
mordred	fungi: yah	22:45
fungi	i already worked out how to get mm3 services deployed from debian packages, but translating it to alpine will probably take some time	22:46
fungi	i have a feeling we might stick with distro-packaged exim similarly to how we might stick with distro-packaged apache for some of these things	22:47
mordred	fungi: well, you could maybe just use the docker mailman images?	22:47
mordred	fungi: yup. totally agree	22:47
fungi	i mean from a configuring it standpoint	22:47
mordred	fungi: I don't think it's a win for us for the forseeable future for things like apache	22:47
mordred	fungi: oh - gotcha	22:47
fungi	the debian packages preconfigure an awful lot of the wsgi bits and stuff	22:48
*** dpawlik has joined #openstack-infra		22:48
fungi	it's django and wsgi and...	22:48
mordred	fungi: https://hub.docker.com/r/maxking/mailman-core/ seems to be the main image	22:49
fungi	and database setup	22:49
mordred	yeah - seems to be a lot of stuff	22:49
mordred	fungi: well - if there are well maintained debian packages maybe it's one that we just instlal with ansible directly? I see the container stuff as more of a win when we're installing wars or installing python stuff from source or anything we're building ourselves	22:51
fungi	https://asynchronous.in/docker-mailman/ is the container install walkthrough linked from the mm3 docs	22:51
mordred	but - I guess if they have a full walkthrough of that- maybe it's a good learning example?	22:51
mordred	fungi: ooh - the security section of that is nice - we should do that with our images	22:52
*** dpawlik has quit IRC		22:52
mordred	(and make sure we appropriately sign our images too)	22:52
fungi	yeah, maybe no need to container mm3, but also if we stick with the debian packages filtered through ubuntu we're stuck dealing with a mailman that's still rapidly changing (there are a _lot_ of bugs in mm3 being worked through still, from what i can see)	22:53
*** rcernin has quit IRC		22:53
mordred	fungi: yah. whereas if you just follow these instructions	22:54
*** eernst has quit IRC		22:54
*** rcernin has joined #openstack-infra		22:55
mordred	and maybe even use docker-compose for it since that's how they're recommending - and that way we can follow the stuff they're publishing and do it in a way that would enhance our ability to interact with them?	22:55
*** jamesmcarthur has joined #openstack-infra		22:56
fungi	seems worth trying to redo the current poc following my notes from https://etherpad.openstack.org/p/mm3poc and adapting them to their container walkthrough	22:56
mordred	++	22:56
mordred	would at the very least be a learning case where you've got a clear set of instructions for the one way	22:56
mordred	in the mean time - I will get an install-docker role done by morning	22:57
ianw	mordred: sorry, where did we end up on "install docker to a host"? is there something i can do? i would like to get this bit nailed down (with ipv6) before i start looking at it	22:57
ianw	oh, jinx	22:57
mordred	ianw: :)	22:57
mordred	ianw: I think I nerd-sniped myself into helping on that bit	22:57
openstackgerrit	Merged openstack-infra/openstackid master: Updated user profile UI https://review.openstack.org/604172	22:57
*** jamesmcarthur has quit IRC		22:57
*** jamesmcarthur has joined #openstack-infra		22:57
ianw	mordred: i'm happy to take an initial stab, working from the existing role, if you're thinking of moving it into system-config?	22:58
corvus	s/move/copy/	22:58
mordred	ianw: well - I think corvus made a great point, which is that the one in zuul-jobs has ci specific things	22:58
mordred	so yeah, I think the next thought was copy it	22:58
mordred	but then I was thinking - the ci mirror stuff is just generic "use this mirror" and could still be useful - we just need to remove the defaults of zuul_docker_mirror_host	22:59
mordred	which is what I wanted to poke at to see what it looked like	22:59
mordred	if that does seem reasonable - then I think moving it into its own repo and having the jobs add a roles-path - and adding it to the galaxy install in system-config - could be nice	22:59
clarkb	I've set bhs1 back to max-servers 0	22:59
fungi	what is our story around testing this stuff? should we keep most of the zuul-side automation reusable for check/gate jobs on our container configs?	23:00
mordred	corvus, ianw does that above stream-of-consciousness make sense?	23:00
mordred	or - should we just copy the role to system-config for now	23:00
mordred	and maybe make the reusable shared role for later?	23:00
ianw	mordred: sort of, though i'm not sure what advantage having it in a separate repo affords?	23:00
fungi	er, i guess that's basically what you're already discussing ;)	23:00
mordred	ianw: to be able to use it in both zuul-jobs and system-config without having zuul-jobs need system-config or system-config need zuul-jobs	23:01
ianw	presumably other people have done this, and we're not using their roles	23:01
mordred	totally - and it might be a waste of energy :)	23:01
ianw	given that it's not that complex, and we might want to do things in production that we don't want in zuul-jobs without having to worry about it being generic, i'm feeling like putting it in system-config wins, for mine	23:01
mordred	so maybe we should just start with a copy of install-docker in system-config but without the mirror config?	23:02
mordred	ianw: ++	23:02
fungi	we can also run roles from system-config in jobs that test things in or related to system-config if we want, right?	23:02
mordred	fungi: totally	23:02
ianw	ok, i'll be happy to spin that up with some basic test-infra as step 1	23:02
openstackgerrit	Clark Boylan proposed openstack-infra/system-config master: Add zuul user to bridge.openstack.org https://review.openstack.org/604925	23:02
openstackgerrit	Clark Boylan proposed openstack-infra/system-config master: Manage user ssh keys from urls https://review.openstack.org/604932	23:02
mordred	fungi: but for ci and for docker specifically, we should use the install-docker role in zuul-jobs	23:03
mordred	fungi: since it properly configures ci mirrors	23:03
fungi	oh, sure	23:03
mordred	but yes, in general :)	23:03
mordred	ianw: sweet!	23:03
fungi	okay, now i get it	23:03
fungi	so for deployment we have a shadow equivalent of install-docker which doesn't do the ci-specific setup	23:03
fungi	and stick that in system-config	23:04
ianw	fungi: yes, and quite possibly might do things production-specific, if we find the need	23:04
mordred	yah	23:04
clarkb	we did leak a bunch of ports out of that short time with max servers 80	23:04
mordred	like - we might decide we want some settings in the docker daemon.json	23:04
fungi	clarkb: yech. are you still worried this is something recently regressed in nodepool/shade?	23:05
mordred	ianw: also - we can simplify that role for production and just keep the 'use_upstream_docker' codepath	23:05
ianw	yes, like the ipv6 settings i keep harping on about	23:05
clarkb	fungi: no mordred pointed out that message came from nova itself and shade was just passing it through	23:05
mordred	fungi: no - I believe it's an openstack-side issue	23:05
mordred	ianw: ++	23:05
fungi	yeah, makes sense that the errors nodepool reported are the equivalent of some of what i was seeing from osc	23:05
mordred	ianw: so maybe it's "copy install-docker, start deleting, completely change daemon.json.j2" :)	23:05
ianw	sounds about right	23:07
ianw	clarkb: am i understanding that at lower limits, no port leaks and everything was working, but then at 80 it started going wrong?	23:08
clarkb	ianw: yes, though turning it back down to 8 it was unhappy still. (but plenty of ports leaked so maybe have to clean those up again?)	23:09
clarkb	I am composing an email to amorin since our timezones don't line up great. I will cc you and fungi as people who ahve looked at this so far	23:10
ianw	right, yeah saw those errors. but individual boots didn't leak ports	23:10
ianw	anyway, ++ on the email. it does seem like the error is coming from the other end	23:11
fungi	thanks clarkb!	23:11
*** jamesmcarthur has quit IRC		23:17
clarkb	email sent	23:18
*** jamesmcarthur has joined #openstack-infra		23:19
*** tpsilva has quit IRC		23:19
ianw	http://logs.openstack.org/39/602439/11/check/openstack-infra-multinode-integration-fedora-latest/2fa2d84/job-output.txt.gz	23:24
ianw	traceroute_v4_output": "git.openstack.org: Name or service not known\nCannot handle \"host\" cmdline arg `git.openstack.org' on position 1 (argc 2)\n",	23:24
*** jlvillal has joined #openstack-infra		23:24
ianw	that's interesting, did we know dns was the cause of those intermittent errors?	23:25
ianw	and how fascinating that was a multinode job and they both failed. that suggests it wasn't just a fluke	23:26
*** rh-jelabarre has quit IRC		23:26
clarkb	I think prometheanfire had pointed it out (they were failing when setting up gentoo)	23:26
*** dklyle has joined #openstack-infra		23:26
clarkb	http://logs.openstack.org/39/602439/11/check/openstack-infra-multinode-integration-fedora-latest/2fa2d84/zuul-info/host-info.primary.yaml says localhost is the resolver so must be somethign with unbound?	23:27
clarkb	(grep dns in that file)	23:28
*** graphene has quit IRC		23:28
prometheanfire	ya, it's been happening a bunch	23:29
prometheanfire	silly fedora	23:29
*** mriedem is now known as mriedem_away		23:30
mwhahaha	any chance on getting this project-config change merged? https://review.openstack.org/#/c/603489/	23:31
ianw	clarkb: but unbound isn't setup yet, right, on these integration jobs?	23:33
mwhahaha	gracias	23:33
*** graphene has joined #openstack-infra		23:33
clarkb	ianw: it should be but using our default config (which is ipv4 and ipv6 resolvers)	23:33
clarkb	unless we don't do that config on fedora	23:34
ianw	but why would it only sometimes not work ...	23:34
ianw	clarkb: where did you see the resolver in info yaml?	23:34
openstackgerrit	Jeremy Stanley proposed openstack-infra/system-config master: Update Zuul service documentation https://review.openstack.org/605556	23:35
clarkb	ianw: it is under ansible_dns	23:36
clarkb	says nameservers: - 127.0.0.1	23:36
prometheanfire	is the service running?	23:38
ianw	hrm, right, so that would be pre reconfiguration	23:38
ianw	a working log is http://logs.openstack.org/39/602439/11/check/openstack-infra-base-integration-fedora-latest/08950a4/zuul-info/host-info.fedora-28.yaml and it has the same	23:38
clarkb	prometheanfire: I don't think we have that information, that would certainly be somethign to check. Maybe it isn't starting on boot on fedora reliably	23:39
prometheanfire	maybe provider related? iirc, 2-3/4 of my rechecks for adding gentoo are for fedora failing	23:40
ianw	we should capture unbound.log	23:40
ianw	and a ps dump at least	23:41
*** rcernin_ has joined #openstack-infra		23:41
pabelanger	prometheanfire: yes, it will fail on ipv6 hosts I believe	23:43
*** rcernin has quit IRC		23:43
clarkb	pabelanger: it shouldn't	23:43
pabelanger	clarkb: unbound needs to be configured before validate-hosts	23:44
pabelanger	http://logs.openstack.org/39/602439/11/check/openstack-infra-multinode-integration-fedora-latest/2fa2d84/job-output.txt.gz	23:44
clarkb	it is configured	23:44
prometheanfire	it's that race thing?	23:44
pabelanger	that was the whole reason fedora-28 nodes were flakey in the gate	23:44
openstackgerrit	Merged openstack-infra/project-config master: Add ansible-role-chrony project https://review.openstack.org/603489	23:44
pabelanger	clarkb: right, but configure-unbound hasn't run yet for base-minimal jobs	23:44
ianw	pabelanger: it should be minimally configured in dib -> https://git.openstack.org/cgit/openstack-infra/project-config/tree/nodepool/elements/nodepool-base/finalise.d/89-unbound#n37	23:44
clarkb	yes, but the unbound config that is there should work	23:45
pabelanger	ianw: yup, agree. for some reason, it is broken on ipv6. Fix was to first do configure-unbound and reload unbound service, and started working	23:45
clarkb	we boot with working dns, if we don't that is a bug	23:45
clarkb	we are only optimizing it from there on the jobs	23:45
pabelanger	http://git.openstack.org/cgit/openstack-infra/project-config/tree/playbooks/base/pre.yaml#n6	23:45
clarkb	ya I wonder if unbound is just not runningat boot theer	23:46
pabelanger	the only difference I can think of, is for DIB we put both ipv6/ipv4 forwarders, for configure-unbound role, we choose the right one based on ipv6 / ipv4	23:46
clarkb	and restarting it is what made it happy	23:46
ianw	ok, this seems like something i can test ... if we boot a fedora node in rax the suggestion is that unbound isn't immediately working, right?	23:47
pabelanger	clarkb: should be able to manually boot a fedora-28 to confirm	23:47
pabelanger	in inap	23:47
clarkb	ianw: yes	23:47
pabelanger	but, i think I did manually test this in rax or ipv6	23:47
pabelanger	but happy to see ianw try	23:48
pabelanger	could reproduce the issue	23:48
pabelanger	couldn't*	23:48
clarkb	pabelanger: the error above was on rax is why ianw is looking at rax I think	23:48
ianw	yeah ... but we have rax logs where it passes and where it fails	23:48
pabelanger	++	23:49
*** openstackgerrit has quit IRC		23:49
ianw	let me try creating a rax node manually and we can boot-cycle it and see if we hit something	23:49
pabelanger	maybe add port tcp/53 localhost check to validate-host too?	23:50
*** jamesmcarthur has quit IRC		23:50
ianw	a hold on this job might help, although i think i tried that and got distracted because we never hit it to hold the node	23:50
ianw	and then zuul got restarted etc	23:51
pabelanger	ianw: yah, I think that is what I did, autohold. then when I finally went back to check, everything was working.	23:51
pabelanger	oh, I rememeber, maybe it was a race starting unbound, but journald logs were too old to confirm	23:52
ianw	classic heisenbug	23:52
ianw	we do start unbound from rc.local right ...	23:52
ianw	maybe we should drop in a .service file	23:53
ianw	no, what we do is "'echo 'nameserver 127.0.0.1' > /etc/resolv.conf" in rc.local	23:55

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!