Thursday, 2019-07-25

*** xek has joined #openstack-infra		00:01
*** whoami-rajat has quit IRC		00:01
*** yamamoto has joined #openstack-infra		00:02
openstackgerrit	Merged zuul/nodepool master: Add nodepool_debug flag to openstack functional jobs https://review.opendev.org/669939	00:15
*** jistr has quit IRC		00:15
*** jistr has joined #openstack-infra		00:15
openstackgerrit	Ian Wienand proposed zuul/nodepool master: [wip] functional testing: test journal-to-console element https://review.opendev.org/669787	00:21
openstackgerrit	Ian Wienand proposed zuul/nodepool master: Enable debug logs for openstack-functional tests https://review.opendev.org/672412	00:23
openstackgerrit	Ian Wienand proposed zuul/nodepool master: [wip] functional testing: test journal-to-console element https://review.opendev.org/669787	00:23
*** larainema_ has joined #openstack-infra		00:48
*** larainema_ is now known as larainema		00:49
*** gyee has quit IRC		00:49
*** ricolin has joined #openstack-infra		00:55
ianw	clarkb: http://logs.openstack.org/87/669787/9/check/nodepool-functional-openstack-src/235e201/nodepool/nodepool-launcher.log	00:56
ianw	@ around 2019-07-25 00:52:37,654 ... sending the systemd output to the journal, it gets captured ok ... i think that will be helpful in general for any such future issues	00:57
*** igordc has quit IRC		01:04
*** yamamoto has quit IRC		01:04
clarkb	ya that bit was working iirc	01:07
*** slaweq has joined #openstack-infra		01:11
*** slaweq has quit IRC		01:15
*** tdasilva has quit IRC		01:20
*** tdasilva has joined #openstack-infra		01:21
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: journal-to-console: element to send systemd journal to console https://review.opendev.org/669784	01:25
*** mriedem has quit IRC		01:49
*** Frootloop has quit IRC		02:09
*** jcoufal has joined #openstack-infra		02:19
*** jcoufal has quit IRC		02:33
*** yamamoto has joined #openstack-infra		02:46
*** bhavikdbavishi has joined #openstack-infra		02:51
*** bhavikdbavishi1 has joined #openstack-infra		02:54
*** bhavikdbavishi has quit IRC		02:55
*** bhavikdbavishi1 is now known as bhavikdbavishi		02:55
*** ykarel\|away has joined #openstack-infra		02:56
*** whoami-rajat has joined #openstack-infra		03:06
*** slaweq has joined #openstack-infra		03:11
openstackgerrit	Clark Boylan proposed opendev/system-config master: Remove gitea02 from inventory so we can replace it https://review.opendev.org/672621	03:13
clarkb	fungi: ^ head start on tomorrow	03:13
*** slaweq has quit IRC		03:16
openstackgerrit	Ian Wienand proposed zuul/nodepool master: Functional testing: add journal-to-console element https://review.opendev.org/669787	03:35
*** eernst has joined #openstack-infra		03:36
*** psachin has joined #openstack-infra		03:38
*** yamamoto has quit IRC		03:42
*** yamamoto has joined #openstack-infra		03:46
*** yamamoto has quit IRC		03:51
*** yamamoto has joined #openstack-infra		03:53
*** rcernin has quit IRC		03:55
*** yamamoto has quit IRC		03:57
*** yamamoto has joined #openstack-infra		04:02
*** lmiccini has quit IRC		04:04
*** lmiccini has joined #openstack-infra		04:05
*** udesale has joined #openstack-infra		04:06
*** dchen has quit IRC		04:07
*** ykarel\|away has quit IRC		04:08
*** dchen has joined #openstack-infra		04:10
*** yolanda has quit IRC		04:21
*** yolanda has joined #openstack-infra		04:22
*** ykarel\|away has joined #openstack-infra		04:34
*** pcaruana has joined #openstack-infra		04:44
*** pcaruana has quit IRC		04:56
*** slittle1 has joined #openstack-infra		05:04
*** slittle1 has quit IRC		05:09
*** slaweq has joined #openstack-infra		05:11
openstackgerrit	Matthew Thode proposed openstack/diskimage-builder master: support alternate portage directories https://review.opendev.org/671530	05:15
*** slaweq has quit IRC		05:16
*** kopecmartin\|offf is now known as kopecmartin		05:18
*** eernst has quit IRC		05:22
*** rcernin has joined #openstack-infra		05:33
*** ykarel\|away is now known as ykarel		05:42
*** dchen has quit IRC		05:50
openstackgerrit	Merged openstack/diskimage-builder master: Enable nodepool debugging for functional tests https://review.opendev.org/672608	06:00
*** kjackal has joined #openstack-infra		06:01
*** jaosorior has quit IRC		06:03
*** rcernin has quit IRC		06:03
*** yamamoto has quit IRC		06:07
*** dchen has joined #openstack-infra		06:09
*** slaweq has joined #openstack-infra		06:11
*** slaweq has quit IRC		06:15
openstackgerrit	Kartikeya Jain proposed openstack/diskimage-builder master: Adding new dib element https://review.opendev.org/578773	06:18
*** yamamoto has joined #openstack-infra		06:18
*** rcernin has joined #openstack-infra		06:18
*** jaosorior has joined #openstack-infra		06:20
*** pcaruana has joined #openstack-infra		06:21
*** rcernin has quit IRC		06:21
*** rcernin has joined #openstack-infra		06:21
*** jaicaa has quit IRC		06:28
AJaeger	infra-root, I cannot login to Zanata at translate.openstack.org, is our openid somehow broken? I do not get a login screen at all ;(	06:30
*** dpawlik has joined #openstack-infra		06:31
*** jaicaa has joined #openstack-infra		06:31
*** joeguo has quit IRC		06:33
*** slaweq has joined #openstack-infra		06:33
*** udesale has quit IRC		06:33
*** udesale has joined #openstack-infra		06:34
*** cshen has joined #openstack-infra		06:36
cshen	morning, is opendev.org DOWN?	06:36
AJaeger	https://opendev.org/ is up - what exactly is failing for you?	06:37
*** abhishekk has joined #openstack-infra		06:38
AJaeger	infra-root, do we have gitea problem again?	06:38
openstackgerrit	Kartikeya Jain proposed openstack/diskimage-builder master: Adding support for SLES 15 in element 'sles' https://review.opendev.org/619186	06:38
AJaeger	I get: "fatal: unable to access 'https://opendev.org/openstack/openstack-manuals.git/': Empty reply from server"	06:38
AJaeger	cshen: is that your problem as well? ^	06:38
AJaeger	infra-root, this is running a git pull from opendev	06:38
abhishekk	hi, I am not able to access https://opendev.org/openstack/glance/ or https://opendev.org/openstack/glance_store/	06:39
abhishekk	is there any problem?	06:39
cshen	AJaeger: opendev.org is not accessible.	06:39
AJaeger	abhishekk: seems so, see the last lines	06:39
*** marios\|ruck has joined #openstack-infra		06:39
AJaeger	cshen: Which URL exactly? The git clone or anything else?	06:40
cshen	what a luck, it just happened when we started our major upgrade :-D	06:40
abhishekk	AJaeger, ack	06:40
cshen	AJaeger: basicly, the whole site is not accesible.	06:40
AJaeger	cshen: for me https://opendev.org/ works on top level, so are you running in the same problem with gig cloning that abhishekk and myself do or is there another one? How exactly can we reproduce?	06:41
cshen	AJaeger: git clone failed by me as well.	06:42
AJaeger	#infra log cloning with git from opendev is failing	06:42
yoctozepto	AJaeger: does not from here either	06:42
yoctozepto	by browser either	06:42
yoctozepto	seems like connection issue?	06:42
cshen	yoctozepto: it seems that the site is down.	06:43
yoctozepto	cshen: AJaeger has just claimed it works for him :D	06:43
yoctozepto	top-level, from browser, does not load for me	06:43
cshen	yoctozepto: I can't access opendev.org from Germany right now. neither HTTP nor git clone.	06:44
yoctozepto	Poland here	06:44
yoctozepto	Podlachia region (north east)	06:44
AJaeger	yoctozepto: git cloning fails for me, https://opendev.org (top-level) works but nothing git related like browsing repositories - from Germany	06:45
abhishekk	me From India - Asia	06:45
abhishekk	not able to clone or access via browser	06:45
AJaeger	#status alert The git service on opendev.org is currently down.	06:46
openstackstatus	AJaeger: sending alert	06:46
* AJaeger sends an alert to reduce questions ;)		06:46
*** rlandy has joined #openstack-infra		06:46
AJaeger	I think we can all agree that git is broken - and without an admin around, nothing we can do until the US wakes up. So, this might take another 5 hours...	06:47
AJaeger	yoctozepto, cshen , abhishekk, thanks for reporting - and sorry for this. But nothing we can do right now	06:48
*** pgaxatte has joined #openstack-infra		06:48
abhishekk	AJaeger, ack	06:48
-openstackstatus- NOTICE: The git service on opendev.org is currently down.		06:49
*** ChanServ changes topic to "The git service on opendev.org is currently down."		06:49
yoctozepto	AJaeger: roger that, git is definitely down when all http is down :-)	06:49
yoctozepto	it's odd	06:50
yoctozepto	I debugged it	06:50
*** dpawlik has quit IRC		06:50
yoctozepto	http does a redirect to https	06:50
yoctozepto	https negotiates tls session	06:50
yoctozepto	and hangs	06:51
yoctozepto	after tunnel is established	06:51
yoctozepto	should be region independent	06:51
yoctozepto	http://paste.openstack.org/show/754833/	06:52
*** jpena\|off is now known as jpena		06:52
yoctozepto	could it be that it banned us at app level? ;d	06:52
*** jpena is now known as jpena\|mtg		06:53
openstackstatus	AJaeger: finished sending alert	06:53
cshen	AJaeger: ack, any backup git repo which we could check out?	06:53
yoctozepto	cshen: review.opendev.org seems to still work	06:54
cshen	yoctozepto: same here	06:54
yoctozepto	cshen: cool, I meant you can use the repos via gerrit	06:54
Tengu	wait, comodo CA is still alive ?!	06:56
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Zuul CLI: allow access via REST https://review.opendev.org/636315	06:56
yoctozepto	Tengu: that's what it seems, at least for this cert	06:57
Tengu	surprizing..... didn't they get intrusion and CA stollen?	06:57
Tengu	(now, wondering why not using something free like «let's encrypt» :D)	06:58
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Add Authorization Rules configuration https://review.opendev.org/639855	06:58
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Web: plug the authorization engine https://review.opendev.org/640884	06:59
cshen	yoctozepto: could you give me an example of repo url?	06:59
yoctozepto	Tengu: yup, as long as you don't need EV (i.e. you are not a payment processing org)	06:59
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Zuul Web: add /api/user/authorizations endpoint https://review.opendev.org/641099	06:59
Tengu	yoctozepto: of course :).	06:59
openstackgerrit	Matthieu Huin proposed zuul/zuul master: authentication config: add optional token_expiry https://review.opendev.org/642408	06:59
yoctozepto	cshen: sure, it requires you to be a registered user though:	06:59
yoctozepto	[remote "gerrit"]	07:00
yoctozepto	url = ssh://yoctozepto@review.opendev.org:29418/openstack/kolla-ansible.git	07:00
yoctozepto	fetch = +refs/heads/:refs/remotes/gerrit/	07:00
*** apetrich has quit IRC		07:00
yoctozepto	change to your username obviously	07:01
Tengu	https://review.opendev.org/openstack/tripleo-ci also	07:01
Tengu	anonymous	07:01
Tengu	and http(s)	07:01
cshen	or maybe use the repos in github.com?	07:02
cshen	it seems to be 1:1 mirrored.	07:02
yoctozepto	Tengu: it said 'not found'?	07:02
yoctozepto	cshen: yeah, openstack/ are	07:02
Tengu	o_O that's the link provided within the project listing of gerrit	07:03
yoctozepto	though wonder if lack of opendev.org did not stop sync at some point	07:03
Tengu	for instance: https://review.opendev.org/#/admin/projects/openstack/tripleo-ci	07:03
*** odicha has joined #openstack-infra		07:03
yoctozepto	Tengu: yeah, it worked now	07:03
*** jamesmcarthur has joined #openstack-infra		07:04
Tengu	but the git link doesn't....	07:04
Tengu	that's interesting.	07:04
yoctozepto	it works from git, not browser, just checked	07:04
Tengu	hmm.... didn't work for me using git.	07:04
yoctozepto	then it's magic	07:04
Tengu	{"changed": false, "cmd": ["/bin/git", "fetch", "origin"], "msg": "Failed to download remote objects and refs: fatal: remote error: Git repository not found\n"}	07:05
Tengu	unless... wait.	07:05
yoctozepto	$ git clone https://review.opendev.org/openstack/tripleo-ci	07:05
yoctozepto	Cloning into 'tripleo-ci'...	07:05
yoctozepto	remote: Counting objects: 13343, done	07:05
yoctozepto	remote: Finding sources: 100% (13343/13343)	07:05
yoctozepto	remote: Total 13343 (delta 6671), reused 11016 (delta 6671)	07:05
yoctozepto	Receiving objects: 100% (13343/13343), 5.99 MiB \| 3.27 MiB/s, done.	07:05
yoctozepto	Resolving deltas: 100% (6671/6671), done.	07:05
yoctozepto	so anonymous https works too via gerrit	07:05
yoctozepto	good to know	07:05
Tengu	oh, my fault.	07:05
yoctozepto	next time gitea refuses to work	07:05
Tengu	was still using the old project "openstack-infra".	07:06
*** rcernin has quit IRC		07:06
*** rlandy is now known as rlandy\|mtg		07:07
yoctozepto	AJaeger: wonder if you can send announcement about the availiablity of git repos via gerrit?	07:07
yoctozepto	should make ppl happier	07:07
yoctozepto	the path seems to be exact the same	07:08
ianw	hrrm, this is definitely not my area of knowledge with the changes going on atm	07:09
yoctozepto	#status info The git service on review.opendev.org can be used in place of opendev.org's - project paths are preserved	07:12
yoctozepto	(was worth trying ;D )	07:12
*** tesseract has joined #openstack-infra		07:15
*** iurygregory has joined #openstack-infra		07:15
*** udesale has quit IRC		07:16
*** iokiwi has quit IRC		07:17
*** adriant has quit IRC		07:17
*** dpawlik has joined #openstack-infra		07:17
*** udesale has joined #openstack-infra		07:18
*** iokiwi has joined #openstack-infra		07:18
*** adriant has joined #openstack-infra		07:18
*** gfidente has joined #openstack-infra		07:20
ianw	12660704.934832] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.	07:21
ianw	[12660825.726429] INFO: task jbd2/vda1-8:248 blocked for more than 120 seconds.	07:21
ianw	[12660825.732761] Not tainted 4.15.0-45-generic #48-Ubuntu	07:21
ianw	this is on the gitea-lb01 console	07:22
ianw	http://paste.openstack.org/show/754834/ for posterity	07:22
ianw	i think it needs a reboot ... i guess it can't make it worse	07:22
*** aedc has joined #openstack-infra		07:22
*** rpittau\|afk is now known as rpittau		07:22
*** raissa has quit IRC		07:23
*** raissa has joined #openstack-infra		07:24
*** raissa has joined #openstack-infra		07:25
ianw	great, now it is in error state	07:26
yoctozepto	life is full of surprises	07:27
cshen	ianw: do we have only one server for serving git service?	07:27
ianw	cshen: one load balancer, anyway :/	07:28
yoctozepto	cshen: but review.opendev.org works with the same paths	07:30
*** Goneri has joined #openstack-infra		07:30
yoctozepto	so it's a no-brainer actually to replace ;D	07:30
ianw	i think this is a problem on vexxhost that i can't solve	07:30
yoctozepto	discussed a bit above	07:30
yoctozepto	cshen: change opendev.org to review.opendev.org and it should magically work (for git)	07:31
cshen	yoctozepto: yes, I checked, I even checked out from github.com. But the upgrade scripts have some dependencies on opendev.org.	07:32
*** kobis1 has joined #openstack-infra		07:32
yoctozepto	cshen: what scripts are you about?	07:32
ianw	i don't think there's much i can do at this point. either vexxhost need to look at what's going on in the backend and recover the server, or we need to build a new one	07:34
ianw	mnaser: ^	07:35
*** dchen has quit IRC		07:35
cshen	yoctozepto: https://github.com/openstack/openstack-ansible/blob/master/scripts/bootstrap-ansible.sh	07:39
yoctozepto	ah, osa	07:40
cshen	it pulls a lot of things from opendev.org	07:40
*** ykarel is now known as ykarel\|lunch		07:41
noonedeadpunk	guilhermesp probably you can help with opendev thing ^	07:42
yoctozepto	yeah, kolla's CI does too, it is broken for the moment	07:43
yoctozepto	mostly due to redirect from upper-constraints to opendev	07:43
yoctozepto	;D	07:43
*** priteau has joined #openstack-infra		07:45
*** marekchm has joined #openstack-infra		07:50
*** tkajinam has quit IRC		07:53
*** tkajinam has joined #openstack-infra		07:53
AJaeger	yoctozepto: upper-constraints should be downloaded from releases.openstack.org	07:57
*** jaosorior has quit IRC		07:57
AJaeger	yoctozepto: e.g. https://releases.openstack.org/constraints/upper/master	07:57
AJaeger	ianw: do you know how to take git01 out of haproxy?	07:58
ianw	#status log sent email update about opendev.org downtime, appears to be vexxhost region-wide http://lists.openstack.org/pipermail/openstack-infra/2019-July/006426.html	07:58
openstackstatus	ianw: finished logging	07:58
AJaeger	ianw: thanks !	07:58
yoctozepto	AJaeger: yeah and that REDIRECTS ;D	07:58
ianw	AJaeger: ^ see above email. not only does the load-balancer have issues, but the gitea backend servers also have kernel errors about storage. i think it's a region wide issues on vexxhost	07:58
ianw	so yeah, just rebuilding the lb somewhere else won't help	07:59
yoctozepto	to opendev which is utterly broken atm ;/	07:59
AJaeger	yoctozepto: oh, it redirects? didn't know that ;(	08:00
AJaeger	ianw: argh ;/	08:00
yoctozepto	AJaeger: yeah, unfortunately, someone even suggested it is inefficient when it was proceeded	08:00
yoctozepto	forgot it could also be "unstable"	08:01
ianw	AJaeger: yeah sorry i've got to step away, but i think the most practical thing is to wait for vexxhost to confirm issues	08:03
*** dtantsur\|afk is now known as dtantsur		08:11
*** pkopec has joined #openstack-infra		08:11
*** lucasagomes has joined #openstack-infra		08:12
*** pkopec has quit IRC		08:12
*** pkopec has joined #openstack-infra		08:12
*** ralonsoh has joined #openstack-infra		08:13
AJaeger	ianw: I'm in meetings all day, so not much time either (and even less options than you have). Is the alert good enough or do you have a propoal to change it?	08:13
jamesmcarthur	ianw: yeah, openstack.org, etc... are all down as well	08:16
jamesmcarthur	if anyone is asking :\|	08:16
yoctozepto	jamesmcarthur, ianw, AJaeger: oh, that escalated pretty quickly	08:21
*** apetrich has joined #openstack-infra		08:24
*** fdegir has joined #openstack-infra		08:24
AJaeger	So, is the following ok to send out "Services at opendev.org like our git server and at openstack.org are currently down, looks like an outage in one of our cloud providers." ?	08:26
*** siqbal has joined #openstack-infra		08:26
ianw	jamesmcarthur: yeah, i guess that goes through the same lb	08:27
yoctozepto	AJaeger: looks fine	08:27
*** panda has quit IRC		08:28
yoctozepto	guys, https://review.opendev.org/671178 , are cyclic dependencies possible?	08:29
yoctozepto	I get no error but it does not seem to be picked up	08:29
yoctozepto	;/	08:29
AJaeger	#status alert Services at opendev.org like our git server and at openstack.org are currently down, looks like an outage in one of our cloud providers.	08:29
openstackstatus	AJaeger: sending alert	08:29
AJaeger	yoctozepto: cyclic dependencies are not fine - Zuul will review to test these since it cannot put them in any sequential order	08:30
*** tosky has joined #openstack-infra		08:30
*** panda has joined #openstack-infra		08:31
-openstackstatus- NOTICE: Services at opendev.org like our git server and at openstack.org are currently down, looks like an outage in one of our cloud providers.		08:32
*** ChanServ changes topic to "Services at opendev.org like our git server and at openstack.org are currently down, looks like an outage in one of our cloud providers."		08:32
* AJaeger sends an email to openstack-discuss now as well...		08:32
noonedeadpunk	so, seems that mnaser just fixed balancer	08:33
AJaeger	cool!	08:34
cshen	thanks, better now.	08:34
AJaeger	are we green again?	08:34
AJaeger	looks good on my end...	08:35
cshen	I'm bootstraping.	08:35
AJaeger	noonedeadpunk: thanks for telling us	08:35
yoctozepto	looks green	08:35
AJaeger	ok, then I'll send the "ok" ;)	08:35
*** ysastri has joined #openstack-infra		08:36
*** wpp has joined #openstack-infra		08:36
AJaeger	#status ok The problem in our cloud provider has been fixed, services should be working again	08:36
openstackstatus	AJaeger: finished sending alert	08:36
*** tkajinam has quit IRC		08:36
openstackstatus	AJaeger: sending ok	08:36
AJaeger	mnaser: thanks for fixing!	08:36
*** kobis1 has quit IRC		08:37
noonedeadpunk	AJaeger: I gues you should send alert a bit earlier - probably we'll get solution faster :P	08:38
*** sshnaidm has quit IRC		08:38
*** dkopper has joined #openstack-infra		08:39
*** ChanServ changes topic to "Discussion of OpenStack Developer and Community Infrastructure \| docs http://docs.openstack.org/infra/ \| bugs https://storyboard.openstack.org/ \| source https://opendev.org/opendev/ \| channel logs http://eavesdrop.openstack.org/irclogs/%23openstack-infra/"		08:39
AJaeger	noonedeadpunk: first alert was send two hours ago - as soon as it was reported...	08:39
-openstackstatus- NOTICE: The problem in our cloud provider has been fixed, services should be working again		08:39
jamesmcarthur	appears everything is back online now	08:39
noonedeadpunk	ah...	08:40
AJaeger	jamesmcarthur: thanks for confirming.	08:41
* AJaeger is offline again...		08:41
*** sshnaidm has joined #openstack-infra		08:42
cshen	me is still working.	08:42
openstackstatus	AJaeger: finished sending ok	08:43
mnaser	sorry about that, this should have not happened and I'm a bit embarassed at how it all went down	08:46
mnaser	And sorry for the lack of communication on my side.	08:46
mnaser	Also, is it possible to drop max-servers to 0 in sjc for now?	08:47
*** jamesmcarthur has quit IRC		08:48
ianw	mnaser: np, stuff happens! yep we can, is it a fast-merge situation?	08:53
*** apetrich has quit IRC		08:53
mnaser	ianw: I mean I kinda disabled the user already on my side	08:53
mnaser	So not really unless it breaks you a whole ton having the OpenStack Jenkins user disabled	08:54
openstackgerrit	Andreas Jaeger proposed openstack/project-config master: Disable sjc https://review.opendev.org/672662	08:54
AJaeger	ianw: want to fast-merge ^	08:54
AJaeger	and apply on the server directly?	08:54
ianw	AJaeger: heh, you beat me to it :)	08:55
ianw	AJaeger: umm i can, maybe it will miss a puppet run. with the remote end disabled we'll just timeout	08:55
AJaeger	you're the expert ;)	08:56
ianw	i'd never claim that :) but i've set it to zero on nl03 for the mean time anyway	08:57
*** ykarel\|lunch is now known as ykarel		08:57
*** jtomasek has joined #openstack-infra		09:01
*** joeguo has joined #openstack-infra		09:01
*** kobis1 has joined #openstack-infra		09:02
*** siqbal90 has joined #openstack-infra		09:02
*** apetrich has joined #openstack-infra		09:02
*** siqbal has quit IRC		09:04
openstackgerrit	Fabien Boucher proposed zuul/zuul master: Return dependency cycle failure to user https://review.opendev.org/672487	09:12
*** lpetrut has joined #openstack-infra		09:15
*** lpetrut has quit IRC		09:16
*** lennyb has joined #openstack-infra		09:16
*** lpetrut has joined #openstack-infra		09:16
*** kobis1 has quit IRC		09:24
openstackgerrit	Merged openstack/project-config master: Disable sjc https://review.opendev.org/672662	09:24
*** e0ne has joined #openstack-infra		09:32
*** yamamoto has quit IRC		09:39
*** apetrich has quit IRC		09:42
*** ysastri has quit IRC		09:52
*** bhavikdbavishi has quit IRC		09:52
openstackgerrit	Fabien Boucher proposed zuul/zuul master: Fix reference pipelines syntax coloration for Pagure driver https://review.opendev.org/672677	09:54
*** Lucas_Gray has joined #openstack-infra		09:55
*** Lucas_Gray has quit IRC		10:06
openstackgerrit	Fabien Boucher proposed zuul/zuul master: Add reference pipelines file for Gerrit driver https://review.opendev.org/672683	10:12
*** yamamoto has joined #openstack-infra		10:17
*** yamamoto has quit IRC		10:27
*** yamamoto has joined #openstack-infra		10:27
*** siqbal has joined #openstack-infra		10:33
*** siqbal90 has quit IRC		10:34
*** abhishekk has quit IRC		10:38
*** ykarel is now known as ykarel\|afk		10:43
*** jaosorior has joined #openstack-infra		10:47
*** yamamoto has quit IRC		10:54
*** yamamoto has joined #openstack-infra		11:01
*** yamamoto has quit IRC		11:06
*** adriant has quit IRC		11:07
*** adriant has joined #openstack-infra		11:07
*** jaosorior has quit IRC		11:08
*** udesale has quit IRC		11:13
*** marekchm has quit IRC		11:13
*** cshen has quit IRC		11:25
*** cshen has joined #openstack-infra		11:28
*** yamamoto has joined #openstack-infra		11:32
*** rh-jelabarre has joined #openstack-infra		11:35
*** yamamoto has quit IRC		11:37
*** stakeda has quit IRC		11:39
*** pcaruana has quit IRC		11:42
*** bhavikdbavishi has joined #openstack-infra		11:42
*** igordc has joined #openstack-infra		11:43
*** mriedem has joined #openstack-infra		11:51
*** apetrich has joined #openstack-infra		11:58
*** armax has quit IRC		11:58
*** armax has joined #openstack-infra		11:59
*** ykarel\|afk is now known as ykarel		12:00
*** lmiccini has quit IRC		12:02
*** dpawlik has quit IRC		12:02
*** lmiccini has joined #openstack-infra		12:08
*** iurygregory has quit IRC		12:11
*** yamamoto has joined #openstack-infra		12:11
*** iurygregory has joined #openstack-infra		12:11
*** lmiccini has quit IRC		12:15
*** yamamoto has quit IRC		12:17
*** yamamoto has joined #openstack-infra		12:18
*** dpawlik has joined #openstack-infra		12:21
*** pcaruana has joined #openstack-infra		12:22
*** aedc has quit IRC		12:25
*** yamamoto has quit IRC		12:27
openstackgerrit	Monty Taylor proposed zuul/zuul master: Improve SQL query performance in some cases https://review.opendev.org/672606	12:31
*** jcoufal has joined #openstack-infra		12:34
openstackgerrit	Fabien Boucher proposed zuul/zuul master: Add reference pipelines file for Github driver https://review.opendev.org/672712	12:41
openstackgerrit	Fabien Boucher proposed zuul/zuul master: Add change replacement field in doc for start-message https://review.opendev.org/665974	12:44
*** joeguo has quit IRC		12:47
*** aaronsheffield has joined #openstack-infra		12:56
*** yamamoto has joined #openstack-infra		13:00
*** ekultails has joined #openstack-infra		13:01
*** gtarnaras has joined #openstack-infra		13:06
*** rfarr has joined #openstack-infra		13:07
*** rfarr_ has joined #openstack-infra		13:07
*** bhavikdbavishi has quit IRC		13:07
*** bhavikdbavishi has joined #openstack-infra		13:10
*** yamamoto has quit IRC		13:14
*** udesale has joined #openstack-infra		13:16
*** ykarel is now known as ykarel\|away		13:22
*** jhesketh has quit IRC		13:22
*** jaosorior has joined #openstack-infra		13:23
*** jhesketh has joined #openstack-infra		13:26
*** ykarel_ has joined #openstack-infra		13:27
petevg	I've got a question about the outage earlier: I had a change that got merged around the same time as the outage, and it seems to have been merged to gerrit's view of the master branch, but not to origin's view of the master branch.	13:29
petevg	This is https://opendev.org/x/microstack	13:29
*** ykarel\|away has quit IRC		13:29
petevg	My local view of the change that didn't get merged to origin looks like this:	13:29
petevg	commit 59551ca2cdf387fb3a1e857f3aeb89912731e3f2 (HEAD -> master, gerrit/master, multipass-testing-support)	13:30
petevg	As opposed to my local view of the last change to appear in "origin's" master:	13:30
petevg	commit 8ea5dc8679eea1921888fec1a3d468c0b3ae09ce (origin/master, origin/HEAD)	13:30
petevg	Does anybody have a suggestion for a fix? I'm thinking of just running git review on my local copy of master, which I've manually pulled from gerrit, to see if that triggers the gate to fix things ...	13:31
*** goldyfruit has joined #openstack-infra		13:32
AJaeger	petevg: what is link for the change?	13:32
*** ykarel_ has quit IRC		13:32
petevg	AJaeger: https://review.opendev.org/#/c/672586/	13:32
AJaeger	petevg: where exactly are you missing it?	13:33
petevg	AJaeger: if I git clone https://opendev.org/x/microstack.git, the change doesn't show up in the master branch.	13:34
petevg	AJaerger: (also, if I just "git pull origin master" on the previously cloned repo.)	13:34
AJaeger	petevg: I see it on https://opendev.org/x/microstack - let me check cloing	13:34
AJaeger	petevg: I just downloaded and it's there...	13:35
AJaeger	it's also here https://opendev.org/x/microstack/commit/59551ca2cdf387fb3a1e857f3aeb89912731e3f2	13:35
petevg	AJaeger: yeah. I see it there, too. That's why I pasted the commit lines from git log above. It's in a weird state where it's merged to HEAD and gerrit/master, but not to origin/master.	13:35
petevg	I'll try recloning. Maybe it fixed itself while I was poking at it.	13:36
*** ricolin has quit IRC		13:36
petevg	AJaeger: nope. It's still not there when you clone.	13:36
AJaeger	It is fine on my end - but we have a git farm. So, if it still fails for you, we need hepl from an admin to check check of the systems in the git farm - maybe you hit the one that is out of sync	13:36
fungi	it's possible that some gitea backends are missing some objects which could have replicated at the time	13:37
petevg	AJaeger: that would make sense. Just to verify, when you say "download", do you mean that you grabbed a tarball, or that you cloned w/ git?	13:37
fungi	probably best if we force replication to all of them from gerrit just to be sure	13:37
AJaeger	cloned with git - git clone https://opendev.org/x/microstack	13:37
AJaeger	fungi: yeah...	13:37
petevg	AJaeger: cool. fungi: thank you!	13:38
fungi	you can reach them individually without going through the lb like http://gitea08.opendev.org:3080/x/microstack	13:38
petevg	fungi: ooh, cool. I can self service on the troubleshooting next time :-)	13:39
fungi	anyway, mass replicating to all of them is likely a good precaution but it will take some hours to complete and will delay replication of newer refs	13:40
petevg	fungi: If I've got a new ref ready to merge, will that fix it?	13:40
petevg	Because I'm selfishly okay w/ that. I don't know whether anybody else was affected, though.	13:41
fungi	petevg: for that one repo, it should	13:41
fungi	odds are there are plenty of missing refs if there's at least one	13:41
*** wpp has quit IRC		13:41
petevg	Yeah ...	13:41
*** rfarr_ has quit IRC		13:41
*** rfarr has quit IRC		13:41
petevg	I won't complain about any delays when/if you decide to kick of the mass replication, then. I have a lot of meetings today, anyway :-)	13:42
fungi	i'll give #openstack-release a heads up so they don't approve any openstack release changes while this is still going on	13:42
*** jaosorior has quit IRC		13:43
*** apetrich has quit IRC		13:43
*** yamamoto has joined #openstack-infra		13:44
fungi	~17k gerrit replication tasks queued	13:47
*** apetrich has joined #openstack-infra		13:47
*** yamamoto has quit IRC		13:48
AJaeger	thanks!	13:48
openstackgerrit	Merged opendev/system-config master: Remove gitea02 from inventory so we can replace it https://review.opendev.org/672621	13:54
*** iurygregory has quit IRC		13:59
*** iurygregory has joined #openstack-infra		14:02
*** eernst has joined #openstack-infra		14:02
openstackgerrit	Merged openstack/project-config master: Cleanup in-tree removed jobs https://review.opendev.org/671412	14:03
*** yamamoto has joined #openstack-infra		14:04
*** yamamoto has quit IRC		14:04
*** goldyfruit has quit IRC		14:07
*** ykarel_ has joined #openstack-infra		14:08
*** wpp has joined #openstack-infra		14:09
clarkb	fungi: one trick to make it go faster is to only replicate to the gitea backends (then github and local /p are left alone)	14:13
fungi	that's what i did	14:14
*** gtarnaras has quit IRC		14:14
*** gtarnaras has joined #openstack-infra		14:14
fungi	in retrospect i should have skipped 02 since we're about to rip it out	14:14
*** ian-pittwood has joined #openstack-infra		14:15
*** goldyfruit has joined #openstack-infra		14:16
*** wpp has quit IRC		14:18
*** bobh has joined #openstack-infra		14:23
*** dpawlik has quit IRC		14:28
ian-pittwood	I'm currently stumped by a problem I am having with Zuul. I have a tox job that I need to run in a py36 environment. I know that Zuul uses py35 by default so I added a line to set the bindep_profile to use py36. Unfortunately that didn't seem to help as the job still fails, stating that py36 wasn't found. Does anyone know what I might be missing? H	14:30
ian-pittwood	ere's the zuul.yaml in question https://review.opendev.org/#/c/672599/4/.zuul.yaml	14:30
clarkb	ian-pittwood: You likely need to change the nodeset. Ubuntu xenial has py35 but not 36. Bionic has py36. There should be existing py36 jobs you can use too	14:31
ian-pittwood	Ok, I'll give that a try. Thank you	14:32
*** ccamacho has joined #openstack-infra		14:32
clarkb	but this specific issue is related to your nodeset	14:32
*** ysastri has joined #openstack-infra		14:40
*** yikun has quit IRC		14:40
openstackgerrit	Matthew Thode proposed openstack/diskimage-builder master: support alternate portage directories https://review.opendev.org/671530	14:42
*** yamamoto has joined #openstack-infra		14:43
*** eernst has quit IRC		14:47
*** yamamoto has quit IRC		14:53
*** ccamacho has quit IRC		14:53
*** jjohnson42 has joined #openstack-infra		14:58
jjohnson42	So I have an issue where it says 'Change has been successfully merged by Zuul' but I don't see it in the opendev git repo?	14:59
*** roman_g has quit IRC		14:59
AJaeger	jjohnson42: we had some downtime this morning and currently replicate everything to our git farm to ensure they are in sync. So, I hope this will be fixed a few hours...	15:00
*** ricolin_phone has joined #openstack-infra		15:00
fungi	jjohnson42: yeah, we're down to 1.25k replication tasks queued so should be caught up in the next couple hours	15:01
fungi	er, 12.5k i mean	15:01
jjohnson42	ok, figured it would be something well known, just asking to double check, thanks for the info	15:01
fungi	what's an order of magnitude among friends? ;)	15:01
mordred	fungi: I dunno, joey vs chandler?	15:02
fungi	i'm doing my best to forget that i have context to parse that punchline	15:02
*** rlandy\|mtg has quit IRC		15:05
*** jpena\|mtg is now known as jpena\|off		15:07
*** siqbal has quit IRC		15:12
openstackgerrit	James E. Blair proposed zuul/zuul-operator master: WIP: test operator / iptables https://review.opendev.org/672755	15:17
*** dklyle has quit IRC		15:17
*** _erlon_ has joined #openstack-infra		15:18
*** dklyle has joined #openstack-infra		15:18
*** dkopper has quit IRC		15:20
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-operator master: WIP: Add zuul-operator-functional-openshift job https://review.opendev.org/672756	15:21
*** larainema has quit IRC		15:21
*** ricolin has joined #openstack-infra		15:22
*** siqbal has joined #openstack-infra		15:23
*** e0ne has quit IRC		15:23
*** kopecmartin is now known as kopecmartin\|off		15:24
mordred	fungi: this punchline is cut in half. I'd like to exchange it for a punchline that is NOT ... cut in half.	15:24
*** pgaxatte has quit IRC		15:25
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-operator master: WIP: Add zuul-operator-functional-openshift job https://review.opendev.org/672756	15:25
*** gfidente has quit IRC		15:26
clarkb	11k tasks. I do wonder if it goes faster when we do them one or two at a time	15:27
clarkb	still waiting for gitea02 removal to show up on bridge (likely due to the replication backlog)	15:27
*** odicha has quit IRC		15:28
clarkb	that must be gerrit's way of telling me to go on an early bike ride	15:29
*** Goneri has quit IRC		15:31
*** ricolin_ has joined #openstack-infra		15:33
*** siqbal has quit IRC		15:34
*** ricolin_phone has quit IRC		15:34
*** ricolin has quit IRC		15:36
*** ricolin_ is now known as ricolin		15:38
*** adriancz has quit IRC		15:39
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: Assure ensure-tox installs latest tox version https://review.opendev.org/672760	15:39
openstackgerrit	James E. Blair proposed zuul/zuul master: Improve SQL query performance in some cases https://review.opendev.org/672606	15:39
zbr_	clarkb: mordred ^ i hope I explained well the ensure-tox change reasoning. i am curious what you think.	15:40
AJaeger	do we need to sync to codesearch as well? Or will it be updated once the replicatoin is done?	15:41
clarkb	zbr that eould break any users that might preselect a working tox in their image builds	15:42
clarkb	AJaeger: I think codesearch pulls from opendev.org on its own so should self correct once opendev is up to date	15:42
clarkb	(codeesearch is #3 requestor to opendev when I looked)	15:42
zbr_	clarkb: depends how they call it. if they call it with full path, it should not.	15:43
AJaeger	clarkb: great, thanks	15:43
clarkb	zbr unless that path is in that user install venv	15:43
clarkb	zbr we have had to do this a couple times in the past due to changesin tox breaking backward compat	15:44
zbr_	yep, and I already see jobs failing. any ideas?	15:44
*** gyee has joined #openstack-infra		15:44
clarkb	I would add a separate upgrade tox step to jobs that know they always want the latest version	15:44
zbr_	i could add a variable that tells it to update or not, default not to.	15:44
zbr_	in fact is even worse: i need to remove the system one to be sure it will work.	15:44
zbr_	clarkb: i discovered an hour ago that i was not able to add new stuff to a tox.ini file because the repository was running tox-docs on centos7, which happens to have tox 1.6.	15:46
clarkb	run the job on a different nodetype is probably the quickest path fprward there	15:46
corvus	zbr_: seems to me that maybe someone setting that job up wanted to make sure that development could happen on centos7?	15:46
zbr_	so I am trying to find a solution that would not break exiting system	15:47
clarkb	corvus: ya that is similar to my other concern	15:47
clarkb	basically that tox version choice may be intebtional	15:47
zbr_	clarkb: is not indentional in this case. so ok if I add a paramter to change behavior? so only those wanting lastest would get it.	15:48
*** altlogbot_0 has quit IRC		15:48
*** ykarel_ is now known as ykarel\|away		15:49
clarkb	a flag to opt into upgrading would probably be ok	15:49
*** altlogbot_2 has joined #openstack-infra		15:49
*** marios\|ruck has quit IRC		15:49
*** tesseract has quit IRC		15:50
zbr_	here is an interesting finding: upgrading tox as user breaks tox on system that do not have ~/.local/bin in PATH (aka CentOS7, newer ones do have it)	15:52
openstackgerrit	James E. Blair proposed zuul/zuul-operator master: WIP: test operator / iptables https://review.opendev.org/672755	15:53
zbr_	so in this particular case one user cannot have a working system-tox and a working tox-in-user-dir -- one of them will fail to import.	15:53
zbr_	workarounds: calling tox with `python -m tox`	15:53
zbr_	or removing the old one. me being inclined to like the the module calling method in general.	15:54
*** ginopc has quit IRC		15:54
zbr_	only the script is broken, module works fine, both versions.	15:54
zbr_	another approach would be to check if ~/.local/bin is in PATH and add it before calling tox, but it is bit ugly.	15:55
*** siqbal has joined #openstack-infra		15:57
openstackgerrit	Merged zuul/zuul-jobs master: Skip test-setup.sh in pep8 jobs https://review.opendev.org/670133	15:57
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: WIP: Assure ensure-tox installs latest tox version https://review.opendev.org/672760	15:58
*** cdent has joined #openstack-infra		16:05
yoctozepto	jjohnson42: re: opendev.org - I reconfigured my repos to use review.opendev.org, also wanted to report my repos are not in sync	16:06
cdent	how long do we normally expect a patch to show up in opendev.org master? https://review.opendev.org/#/c/672298/ is in gerrit/master but not origin/master (where origin is the opendev.org)	16:07
cdent	ah.	16:07
cdent	seems it is already being discussed	16:07
fungi	cdent: yoctozepto: yep, we're down to 9.6k remaining replication tasks in the queue	16:08
*** gtarnaras has quit IRC		16:08
cdent	I assume that's fallout from the earlier disk issues?	16:08
fungi	yep, since there were block device problems in the provider hosting the gitea servers, they ended up missing some git objects, so i initiated a full replication of all repositories to them to make sure any missing objects are fixed	16:09
fungi	but that causes all replication for new refs to queue up behind that	16:09
*** wpp has joined #openstack-infra		16:10
yoctozepto	fungi: thanks for background	16:11
cdent	ditto	16:11
mnaser	fungi: i wonder if long term, it would be faster to replicate to a 'master' gitea node that then replicates to a bunch of other ones	16:11
mnaser	eliminating latency and reducing load on the gerrit server too	16:11
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: WIP: Allow ensure-tox to upgrade tox version https://review.opendev.org/672760	16:12
fungi	mnaser: long term we want gitea servers to be able to share a backend	16:13
openstackgerrit	James E. Blair proposed zuul/zuul-operator master: WIP: test operator / iptables https://review.opendev.org/672755	16:13
fungi	mnaser: but there are some enhancements it needs to be able to support that	16:13
mnaser	Gotcha	16:13
fungi	our original deployment model involved only replicating to one, and it mostly worked accidentally	16:14
fungi	but gitea isn't actually designed for that (yet) so it stopped working when we upgraded	16:14
*** mattw4 has joined #openstack-infra		16:14
fungi	and so the current design with independent backends is a workaround for now	16:15
corvus	work is in progress to support that	16:15
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-operator master: WIP: Add zuul-operator-functional-openshift job https://review.opendev.org/672756	16:16
cdent	thank fungi, now back to my reguarly scheduled assorted manyness	16:17
*** iurygregory has quit IRC		16:20
*** lucasagomes has quit IRC		16:21
*** ykarel\|away has quit IRC		16:22
*** mattw4 has quit IRC		16:23
*** mattw4 has joined #openstack-infra		16:23
*** rascasoft has quit IRC		16:23
*** rascasoft has joined #openstack-infra		16:27
*** lpetrut has quit IRC		16:29
mordred	mnaser: in fact, once the work in progress to support single-shared-gitea is done, it would be made even better by manilla-cephfs - so there are several future improvement possibilities	16:33
mnaser	mordred: forever hinting at the need/want of manila-cephfs :P	16:33
mnaser	soon(tm)	16:34
mnaser	:p	16:34
mordred	mnaser: it's how I let you know I care ;)	16:34
*** cdent has left #openstack-infra		16:35
*** rpittau is now known as rpittau\|afk		16:36
*** ricolin has quit IRC		16:39
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-jobs master: install-openshift: bump version to 3.11.0 https://review.opendev.org/672785	16:40
openstackgerrit	Monty Taylor proposed zuul/zuul-jobs master: Add clear-firewall role https://review.opendev.org/672786	16:41
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-operator master: WIP: Add zuul-operator-functional-openshift job https://review.opendev.org/672756	16:41
openstackgerrit	Tristan Cacqueray proposed zuul/nodepool master: DNM: test openshift version bump https://review.opendev.org/672788	16:43
*** ykarel\|away has joined #openstack-infra		16:45
*** pkopec has quit IRC		16:48
*** dtantsur is now known as dtantsur\|afk		16:49
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-operator master: WIP: Add zuul-operator-functional-openshift job https://review.opendev.org/672756	16:50
*** chandankumar is now known as raukadah		16:52
openstackgerrit	Jeff Liu proposed zuul/zuul-operator master: Add telnet to Docker Image https://review.opendev.org/672791	16:53
openstackgerrit	Jeff Liu proposed zuul/zuul-operator master: Add telnet to Docker Image https://review.opendev.org/672791	16:56
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-jobs master: install-openshift: bump version to 3.11.0 https://review.opendev.org/672785	16:57
*** igordc has quit IRC		16:58
*** igordc has joined #openstack-infra		16:58
openstackgerrit	Tristan Cacqueray proposed zuul/nodepool master: DNM: test openshift version bump https://review.opendev.org/672788	16:58
*** ysastri has quit IRC		16:59
*** jcoufal_ has joined #openstack-infra		17:03
fungi	it's under 9000	17:04
*** jcoufal has quit IRC		17:07
*** roman_g has joined #openstack-infra		17:08
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-operator master: WIP: Add zuul-operator-functional-openshift job https://review.opendev.org/672756	17:10
*** diablo_rojo has joined #openstack-infra		17:11
*** armax has quit IRC		17:13
*** ian-pittwood has quit IRC		17:19
*** odicha has joined #openstack-infra		17:19
*** betherly has joined #openstack-infra		17:19
*** odicha_ has joined #openstack-infra		17:21
*** odicha__ has joined #openstack-infra		17:22
*** ralonsoh has quit IRC		17:24
*** betherly has quit IRC		17:24
*** igordc has quit IRC		17:28
*** odicha__ has quit IRC		17:28
*** odicha has quit IRC		17:28
*** odicha_ has quit IRC		17:28
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-operator master: WIP: Add zuul-operator-functional-openshift job https://review.opendev.org/672756	17:30
*** odicha has joined #openstack-infra		17:33
*** odicha has quit IRC		17:33
*** udesale has quit IRC		17:34
*** odicha has joined #openstack-infra		17:36
*** siqbal has quit IRC		17:36
*** bobh has quit IRC		17:39
*** weifan has joined #openstack-infra		17:45
*** odicha_ has joined #openstack-infra		17:46
clarkb	bringing the security group discussion here. Historically the two major issues with them have been 1) rax didn't support security groups and 2) they were very inefficient with group to group rules (which we'd need to rely on for multinode testing and the like) no the database. I believe rax has security groups now and that the database is no longer as sad about security groups	17:47
clarkb	I think that means we could reconsider them as an option for preventing open dns resolvers and such on the internet then remove our firewall rules from the test nodes entirely	17:47
*** goldyfruit has quit IRC		17:49
clarkb	Then zuul testing and everyone else testing doesn't have to worry about modifying firwall rules at job time	17:49
*** psachin has quit IRC		17:50
*** armax has joined #openstack-infra		17:50
weifan	Has there been any changes to tag pushing?	17:50
weifan	I was trying to push a new tag using following remote, which used to work..	17:50
weifan	ssh://<username>@review.opendev.org:29418/x/<project_name>	17:50
weifan	Right now it says the push is completed, and I could also find it on pypi. But I dont see the tag on opendev for some reason..	17:50
clarkb	weifan: there was a cloud outage a little while ago that prevented us from replicating gerrit repo data to the opendev backends. That outage has been corrected and we are now in the process of rereplicating everythign to gitea to ensure it is up to date	17:51
clarkb	weifan: when that process completes your tag should be present on opendev, but until then it is somewhere in the queue	17:51
weifan	i see, thanks :)	17:51
clarkb	at this rate I'm guessing a few more hours?	17:51
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-operator master: WIP: Add zuul-operator-functional-openshift job https://review.opendev.org/672756	17:54
clarkb	I think we could turn on security groups with our existing iamges (we'll just be double firewalled) then if that doesn't break anything remove the firewalls from the images. The transition should be fairly safe (and if adding security groups does break something revert the cloud laucnher change)	17:55
fungi	clarkb: yeah, it's possible we could orchestrate whitelist security groups over each of the job node tenant networks... as long as things like temporary docker registries coexist in the same region as the builds which connect to them	17:56
fungi	otherwise i think we're stuck with a blacklist model instead	17:56
clarkb	fungi: I believe zuul enforces that requirement currently, but good point we should double check that	17:56
fungi	basically if we can assume that builds which interact with each other will only attempt to connect to job nodes in the same provider/region then it's probably pretty straightforward	17:57
clarkb	I'm 99% sure zuul does enforce that locality requirement (probably because we were thinking about stuff like this)	17:57
clarkb	corvus would likely know 100%	17:58
*** odicha has quit IRC		17:58
*** odicha_ has quit IRC		17:59
clarkb	and we should double check that security groups do work on rax (their docs say you can do that with public cloud so I expect it to work)	17:59
*** jtomasek has quit IRC		18:01
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: WIP: Allow ensure-tox to upgrade tox version https://review.opendev.org/672760	18:03
*** goldyfruit has joined #openstack-infra		18:04
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-operator master: WIP: Add zuul-operator-functional-openshift job https://review.opendev.org/672756	18:05
mordred	clarkb, fungi: what's the email address we're using for when we need an opedev root email address? infra-root@openstack.org still?	18:11
*** bobh has joined #openstack-infra		18:11
clarkb	mordred: yes	18:11
*** mattw4 has quit IRC		18:11
mordred	clarkb: thx	18:11
*** mattw4 has joined #openstack-infra		18:11
*** dklyle has quit IRC		18:11
clarkb	fungi: re locality I remember why we enforce that, it is because some clouds have ipv6 only and others are ipv4 only so we can't assume they can talk to each other even if firwalls are wide open	18:11
*** dklyle has joined #openstack-infra		18:12
clarkb	the firwalls are 1980s wood paneling	18:12
mordred	such lovely wood paneling	18:12
*** priteau has quit IRC		18:13
fungi	yup	18:13
fungi	okay, so a fairly simple (22/tcp from everywhere) whitelist is probably sufficient?	18:13
clarkb	fungi: and an in group wide open rule (security group members can talk to themselves)	18:14
fungi	though to allow instance-to-instance traffic we have to add the instances to groups	18:14
fungi	yeah, that	18:14
clarkb	that is a thing you can express in the rules too	18:14
fungi	is there a default group they appear in automatically?	18:14
clarkb	there is a default group	18:14
clarkb	and by default that group has the talk to myself rule (but our cloud launcher removes it currently)	18:15
*** auristor has quit IRC		18:15
*** jamesmcarthur has joined #openstack-infra		18:17
*** auristor has joined #openstack-infra		18:17
clarkb	we also need to open the zuul console log port	18:17
clarkb	ssh + console log port + in group connectivity. Anything else missing?	18:17
*** bobh has quit IRC		18:17
openstackgerrit	Merged zuul/zuul master: Improve SQL query performance in some cases https://review.opendev.org/672606	18:18
*** dims has quit IRC		18:19
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-operator master: WIP: Add zuul-operator-functional-openshift job https://review.opendev.org/672756	18:21
*** roman_g has quit IRC		18:22
*** igordc has joined #openstack-infra		18:22
*** roman_g has joined #openstack-infra		18:23
openstackgerrit	Clark Boylan proposed opendev/system-config master: Use cloud security groups for test node isolation https://review.opendev.org/672806	18:28
clarkb	fungi: mordred ^ that roughly what it would look like (and applied to vexxhost mtl1 only in that change if we wnt to merge it, only gpu test nodes reside there currently)	18:29
*** dims has joined #openstack-infra		18:29
clarkb	I believe the default ruleset if applied to instances by default if you don't specify one either so nothing in nodepool would have to change either	18:30
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: WIP: Allow ensure-tox to upgrade tox version https://review.opendev.org/672760	18:30
*** weifan has quit IRC		18:35
mordred	clarkb: udp?	18:36
clarkb	mordred: we don't need udp inbound do we?	18:36
mordred	oh- default group rule is typeless	18:36
clarkb	(I think iptables treats udp as "stateful" so the outbound dns requests should get responses)	18:36
clarkb	mordred: ya	18:36
mordred	(was more thinking instance-to-instance traffic)	18:36
*** goldyfruit has quit IRC		18:38
clarkb	5.9k tasks to go now	18:39
*** eharney has quit IRC		18:39
*** betherly has joined #openstack-infra		18:41
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: WIP: Allow ensure-tox to upgrade tox version https://review.opendev.org/672760	18:42
fungi	yeah, we're down to ~1/3 of the replication backlog remaining	18:44
fungi	going to try and knock out some yardwork so that my evening is free to work on gitea server replacement stuff	18:44
*** betherly has quit IRC		18:45
*** fdegir has quit IRC		18:45
*** fdegir has joined #openstack-infra		18:46
*** ykarel\|away has quit IRC		18:49
clarkb	I think if we want to move ahead with that chagne the next two things to do would be to confirm it doesn't break anything (by applying it to vexxhost as proposed) and also to try and apply it to the rax regions	18:50
clarkb	since will it work with rax and will it not break existing jobs are the two big questions	18:50
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-operator master: WIP: Add zuul-operator-functional-openshift job https://review.opendev.org/672756	18:51
corvus	fungi, clarkb: okay so on the firewall thing -- let me summarize and see if we're on the same page: 1) the firewall is good because it's easy for folks to mess up and accidentally create an open proxy/resolver/etc. 2) we give folks root, they can disable it if they need to. 3) it's good to have that speedbump though so that they have to think about it, so we should not remove it from the base	18:51
corvus	images. 4) it is reasonable to disable the firewall for the k8s case because the very next step is that k8s is going to create a bunch of firewall rules that are not going to allow undue external access. 5) we could consider using security groups in our providers as a replacement for the firewall (but that's going to take some careful engineering since we have jobs which communicate cross-region)	18:51
clarkb	corvus: yes basically and maybe 6) the major historical reasons for not using security groups are no longer present (according to neutron and rax docs)	18:52
clarkb	corvus: what jobs communicate cross region? I seem to recall we couldn't do that due to ipv6 and ipv4 only clouds existing	18:52
*** boden has joined #openstack-infra		18:52
fungi	i concur with the summary	18:53
boden	hi... wondering if anyone has any pointers on a functional job failure related to "Error when trying to get requirement for VCS system" as shown in http://logs.openstack.org/25/672725/5/check/neutron-classifier-functional-dsvm/5eb2c85/job-output.txt.gz#_2019-07-25_18_41_29_579110	18:53
boden	is this because keystone is not in the test-requirementst.txt maybe?	18:54
*** mriedem has quit IRC		18:54
mordred	boden: that's not actually an error	18:55
clarkb	boden: http://logs.openstack.org/25/672725/5/check/neutron-classifier-functional-dsvm/5eb2c85/job-output.txt.gz#_2019-07-25_18_40_32_918662 is the error	18:56
mordred	boden: it's an unfortunate error printed by pip because of the lack of origin remote in the repos - but is harmless ... ^^ what clarkb said	18:56
corvus	clarkb: i think jobs that use the buildset registry may do that (and yes, it's a pita)	18:56
clarkb	you are running into ERROR_ON_CLONE because devstack is needing to clone some repos but we've told it isn't allowed to. The way to address that is to add it to the required projects of the job or remove those services from the devstack config	18:56
clarkb	boden: ^	18:56
boden	clarkb mordred thanks for that	18:57
yoctozepto	did iad.rax experience issues with epel mirror around 16:50 UTC? cause different images failed to build due to different 404 packages	18:57
clarkb	corvus: in cases where we pause a job with a buildset registry then other jobs consume from that? for some reason I thought we did restrict that to the same region	18:57
mordred	yeah - I thought the same thing	18:57
mordred	but I am most likely just wrong	18:57
clarkb	yoctozepto: that is our kafs canary, that implies the fixes for falling back to the second afs server are not working	18:57
clarkb	yoctozepto: can you provide direct links to where that happens it will help us and the kernel devs debug possibly	18:58
corvus	clarkb, fungi, mordred: multinode jobs are restricted to the same region, but jobs which depend on other jobs aren't	18:58
clarkb	corvus: got it	18:58
fungi	if a job paused to serve a registry in limestone (global v6 access only) and then the build trying to use that ran in ovh (no global ipv6 egress routing) they'd be unable to talk	18:58
clarkb	corvus: considering that we can't rely on that cross cloud region communication working anyway (regardless of where we put the firwall) I think we may want to fix that anyway?	18:58
*** cshen has quit IRC		18:59
yoctozepto	clarkb: e.g. here http://logs.openstack.org/86/668286/4/check/kolla-build-oraclelinux-source/1ea1696/logs/build/ - timestamps on _FAILED_	18:59
yoctozepto	though it only pinpoints the time	18:59
yoctozepto	404 is generic ;-)	18:59
clarkb	yoctozepto: why do your log files not have timestamps in them?	18:59
yoctozepto	clarkb: this one does: http://logs.openstack.org/86/668286/4/check/kolla-build-oraclelinux-source/1ea1696/job-output.txt.gz	19:00
yoctozepto	though it's all-in-one	19:00
clarkb	yoctozepto: 404 is generic but we know it happens in kafs when the filesytem is being updating and clients are supposed to fallback to the secondary fs, however kafs wasn't doing that and we are running proposed changes that are supposed to fix that in kafs whcih I'm guessing they dont. That feedback is useful to the kernel	19:00
yoctozepto	ok, then I pinpoint the time for ya	19:01
clarkb	and ya if we have the timestamp we can check if the fs was updating at that time to correlate the two events	19:01
yoctozepto	grep "HTTP Error 404" on http://logs.openstack.org/86/668286/4/check/kolla-build-oraclelinux-source/1ea1696/job-output.txt.gz	19:01
*** betherly has joined #openstack-infra		19:01
yoctozepto	nice timestamps	19:01
clarkb	yoctozepto: note you can direct link to the timestamps on that file	19:02
corvus	clarkb, fungi: maybe we need to fix that by getting ipv6 in ovh	19:02
clarkb	http://logs.openstack.org/86/668286/4/check/kolla-build-oraclelinux-source/1ea1696/job-output.txt.gz#_2019-07-25_16_44_54_226419 for example	19:02
clarkb	corvus: and inap iirc	19:02
clarkb	and rax	19:02
clarkb	(we only support ipv6 on rax on debuntu hosts)	19:02
yoctozepto	more here: http://logs.openstack.org/86/668286/4/check/kolla-build-oraclelinux-binary/0d3fc67/job-output.txt.gz	19:03
*** mriedem has joined #openstack-infra		19:03
zbr_	AJaeger: clarkb: i made the required changed to ensure-tox, if you can have another look it would be great.	19:03
zbr_	https://review.opendev.org/#/c/672760/	19:03
yoctozepto	clarkb: thanks, you are right, though there are many to share	19:03
clarkb	yoctozepto: we only need the one probably	19:03
clarkb	just enough to correlate to an updating afs volume	19:03
corvus	zbr_, AJaeger: that sort of change should have a test job	19:04
yoctozepto	http://logs.openstack.org/86/668286/4/check/kolla-build-oraclelinux-binary/0d3fc67/job-output.txt.gz#_2019-07-25_16_39_23_606826	19:04
yoctozepto	^ earliest probably	19:04
yoctozepto	seems it hit epel only	19:04
yoctozepto	centos mirror seems to have worked	19:04
clarkb	yoctozepto: they are separate afs volumes iirc (though I'll double check that when I look at this more closely)	19:04
clarkb	currently about to consume lunch	19:04
*** bobh has joined #openstack-infra		19:05
*** betherly has quit IRC		19:05
*** weifan has joined #openstack-infra		19:07
openstackgerrit	Jeff Liu proposed zuul/zuul-operator master: [WIP] Verify Operator Pod Running https://review.opendev.org/670395	19:08
corvus	clarkb, fungi, mordred: jobs which depend on other paused jobs request nodes from the same provider, and will get them if that provider is still online.	19:08
corvus	clarkb, fungi, mordred: so that case should usually not be a problem	19:08
corvus	only in weird edge cases (like a provider going offline during a buildset)	19:08
corvus	(in that case, it'll fall back on letting any provider fulfill it)	19:09
*** igordc has quit IRC		19:09
*** tosky has quit IRC		19:10
corvus	clarkb, fungi, mordred: and we're talking nodepool provider here, so that's a cloud-region combo	19:10
corvus	could come from a different 'pool' though	19:11
*** bobh has quit IRC		19:11
*** weifan has quit IRC		19:11
clarkb	our nodepool providers are per region	19:12
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: Allow ensure-tox to upgrade tox version https://review.opendev.org/672760	19:13
clarkb	From that I think we'd be ok except for the fallback case that we also risk breaking in the ipv4 vs ipv6 case, however with security groups that would be a hard fail all the time rather than a sometimes fail	19:14
zbr_	corvus: done, added test jobs and referenced it with needed-by. see https://review.rdoproject.org/r/#/c/21594/	19:14
openstackgerrit	Tristan Cacqueray proposed zuul/zuul-operator master: WIP: Add zuul-operator-functional-openshift job https://review.opendev.org/672756	19:15
*** xek_ has joined #openstack-infra		19:15
* clarkb lunches		19:16
*** xek has quit IRC		19:17
*** bhavikdbavishi has quit IRC		19:18
*** igordc has joined #openstack-infra		19:25
*** dims has quit IRC		19:30
*** goldyfruit has joined #openstack-infra		19:32
*** igordc has quit IRC		19:32
AJaeger	zbr_: we have in-tree test jobs nowadays in zuul-jobs, have a look zuul-tests.d/ directory	19:38
*** rfarr has joined #openstack-infra		19:38
*** rfarr has quit IRC		19:38
*** e0ne has joined #openstack-infra		19:39
*** jamesmcarthur has quit IRC		19:40
*** jamesmcarthur has joined #openstack-infra		19:41
*** joeguo has joined #openstack-infra		19:44
*** rascasoft has quit IRC		19:45
*** jamesmcarthur has quit IRC		19:46
*** rascasoft has joined #openstack-infra		19:47
zbr_	AJaeger: no pb with me, so you one one more job that uses this new param that triggers when someone edits this role, right?	19:48
zbr_	i personally prefer using molecule to test ansible roles, as I can easily test lots of usecases in seconds, and locally too. maybe I should make a demonstration	19:49
fungi	what's nice about the existing jobs is they exercise these roles the way they'll be used in ci jobs, rather than in an abstract framework	19:59
*** michael-beaver has joined #openstack-infra		20:02
*** betherly has joined #openstack-infra		20:02
*** betherly has quit IRC		20:07
*** jcoufal_ has quit IRC		20:07
*** igordc has joined #openstack-infra		20:08
*** jamesmcarthur has joined #openstack-infra		20:11
openstackgerrit	James E. Blair proposed zuul/zuul-jobs master: Add clear-firewall role https://review.opendev.org/672786	20:15
corvus	zbr_: a third-party test is great, but how about a first party test? :) AJaeger had some suggestions there	20:17
corvus	zbr_: this may be a candidate for testing on different platforms too; thare are examples for that	20:18
*** jamesmcarthur has quit IRC		20:19
zbr_	corvus: sure. which platforms/versions you want me to cover?	20:20
corvus	zbr_: at least ubuntu-bionic (the default) plus any you don't want to break. since centos7 was a concern, you may want to include that.	20:20
corvus	zbr_: there's a special macro you can use if you think it should be tested on all platofrms	20:21
corvus	zbr_: http://lists.zuul-ci.org/pipermail/zuul-discuss/2019-July/000973.html has more info too	20:21
corvus	zbr_: i'm writing a patch for zuul-jobs to update the docs with the info in that ml post	20:21
zbr_	cool, that was what I expected. i will read that too,	20:22
*** gyee has quit IRC		20:22
clarkb	I have WIP'd https://review.opendev.org/#/c/672806/1 given that buildset registries may run in different clouds	20:25
corvus	clarkb: did you see my update?	20:25
corvus	clarkb: you were 99% right about that (and i was 1% right)	20:25
clarkb	oh no I missed it then	20:25
corvus	so i don't think it's a problem we need to concern ourselves with	20:26
corvus	see 19:08-19:11 in here; i think you were getting lunch	20:26
fungi	(depends on a provider outage or similar immediate catastrophy)	20:26
corvus	fungi: right	20:26
clarkb	oh neat. Should I remove the WIP then? I guess the question now becomes: do we think that this is worth pursuing as it will take some measured rollout	20:26
corvus	clarkb: i kinda think so? i like the idea of having a cleaner test env	20:27
clarkb	k I'll remove the WIP then as Ithink the current ps is a good starting point for testing a rollout	20:28
fungi	it does mean that, e.g., if someone manually troubleshooting a job wants to initiate connections to it other than those allowed by the security groups we apply will be unable to (aside from reverse tunneling or similar complexity). not sure if that's a concern	20:28
fungi	i'm not personally concerned by that aspect, fwiw	20:29
*** weifan has joined #openstack-infra		20:29
clarkb	looking up epel afs volume update times now	20:29
corvus	i have held nodes running a docker registry and performed local actions from my workstation against them to debug. this would make that harder. not sure if that's a deal killer.	20:29
*** jamesmcarthur has joined #openstack-infra		20:30
clarkb	ya you'd likely end up doing ssh -L type proxying	20:30
corvus	yep. should suffice i think.	20:30
fungi	gerrit replication backlog is under 3k now	20:31
fungi	i think we're on track for an 8 hour completion time, which implies that it currently takes ~1 hour to perform full replication to a single gitea backend	20:32
corvus	aren't they in parallel?	20:32
fungi	estimating completion around 21:40z	20:32
clarkb	http://paste.openstack.org/show/754873/ does seem to coincide with http://logs.openstack.org/86/668286/4/check/kolla-build-oraclelinux-binary/0d3fc67/job-output.txt.gz#_2019-07-25_16_39_23_606826	20:33
clarkb	ianw: ^ re kafs I don't think the fixes for falling back to other servers is working properly	20:33
clarkb	yes it is in parallel	20:33
clarkb	there are N threads per replication target	20:33
clarkb	However, I think it may be faster if we do them one by one? seems like it didn't take me that long to run through them all after OOMs	20:34
*** zbr_ has quit IRC		20:35
clarkb	I wonder if that implies we should have fewer replication threads (contention being a likely cause of slowdown when run in parallel?)	20:35
fungi	ahh, yeah that i don't know about. because i issued replication commands for each of them one by one (so as to exclude local and github... i couldn't manage to get a glob/regex working) that might have caused them to get serialized? hard to tell from what's left in the backlog at this point but can probably suss it out from cacti graphs	20:35
*** raissa has joined #openstack-infra		20:36
*** zbr has joined #openstack-infra		20:37
openstackgerrit	James E. Blair proposed zuul/zuul-jobs master: Update testing section https://review.opendev.org/672820	20:37
corvus	AJaeger, zbr: ^	20:38
*** diablo_rojo has quit IRC		20:40
*** cshen has joined #openstack-infra		20:40
openstackgerrit	James E. Blair proposed zuul/zuul-jobs master: Add clear-firewall role https://review.opendev.org/672786	20:41
*** harlowja has joined #openstack-infra		20:43
fungi	corvus: clarkb skimming the active replication processes in the queue, they do appear to be parallelized (~4 active per destination)	20:45
clarkb	looks like we do set it to 4 threads per gitea backend	20:46
clarkb	that is in system-config/modules/openstack_project/manifests/review.pp	20:47
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: WIP: Allow ensure-tox to upgrade tox version https://review.opendev.org/672760	20:49
*** mriedem has quit IRC		20:52
*** mriedem has joined #openstack-infra		20:53
*** bobh has joined #openstack-infra		20:53
*** bobh has quit IRC		20:59
*** gyee has joined #openstack-infra		20:59
*** jamesmcarthur has quit IRC		21:00
*** Lucas_Gray has joined #openstack-infra		21:00
*** jamesmcarthur has joined #openstack-infra		21:01
openstackgerrit	Sorin Sbarnea proposed zuul/zuul-jobs master: WIP: Allow ensure-tox to upgrade tox version https://review.opendev.org/672760	21:02
*** betherly has joined #openstack-infra		21:03
zbr	corvus: thanks for documenting this, I will try to use it tomorrow as is 10pm here. For the. moment i enabled tox-molecule job for testing that role (just to compare the two approaches)	21:04
*** jjohnson42 has quit IRC		21:05
*** cshen has quit IRC		21:07
*** betherly has quit IRC		21:08
*** zbr has quit IRC		21:11
openstackgerrit	Monty Taylor proposed opendev/system-config master: Build gerrit images for 2.16 and 3.0 as well https://review.opendev.org/672273	21:11
openstackgerrit	Monty Taylor proposed opendev/system-config master: Trim some bazel flags https://review.opendev.org/672274	21:12
*** ekultails has quit IRC		21:12
openstackgerrit	Jeff Liu proposed zuul/zuul-operator master: [WIP] Verify Operator Pod Running https://review.opendev.org/670395	21:12
openstackgerrit	James E. Blair proposed zuul/zuul-jobs master: Add clear-firewall role https://review.opendev.org/672786	21:13
mordred	corvus, clarkb: https://review.opendev.org/#/c/671457 is ready for re-review - I think I took care of the review comments	21:13
*** jamesmcarthur has quit IRC		21:13
clarkb	mordred: safe to approve since nothing is using it yet right?	21:14
*** slaweq has quit IRC		21:15
mordred	clarkb: that's right	21:15
corvus	i agree	21:15
clarkb	done	21:15
mordred	woot!	21:15
openstackgerrit	James E. Blair proposed zuul/zuul-jobs master: Update testing section https://review.opendev.org/672820	21:17
*** cshen has joined #openstack-infra		21:18
*** diablo_rojo has joined #openstack-infra		21:19
openstackgerrit	Clark Boylan proposed zuul/zuul-jobs master: Add note to clear-firewall docs https://review.opendev.org/672829	21:20
*** cshen has quit IRC		21:23
*** zbr has joined #openstack-infra		21:26
*** whoami-rajat has quit IRC		21:28
*** pcaruana has quit IRC		21:28
fungi	replication backlog is nearly down to 1k. gonna go grab dinner and by the time i'm done hopefully the haproxy config change will have taken effect and i can rip out gitea02 and start building its replacement	21:29
openstackgerrit	Clark Boylan proposed zuul/zuul-jobs master: Add note to clear-firewall docs https://review.opendev.org/672829	21:30
*** zbr has quit IRC		21:32
*** boden has quit IRC		21:32
*** panda has quit IRC		21:34
*** panda has joined #openstack-infra		21:34
openstackgerrit	Merged zuul/zuul-jobs master: Add clear-firewall role https://review.opendev.org/672786	21:34
clarkb	mriedem: thank you for calling out the nova memcache thing on the config drive bug	21:44
clarkb	mriedem: I left a note on it suggesting that having devstack just do it when memcache is enabled would be great	21:44
*** jamesmcarthur has joined #openstack-infra		21:46
mriedem	\o/	21:46
openstackgerrit	Jeff Liu proposed zuul/zuul-operator master: [WIP] Verify Operator Pod Running https://review.opendev.org/670395	21:48
openstackgerrit	Merged zuul/zuul-jobs master: Add note to clear-firewall docs https://review.opendev.org/672829	21:50
*** jamesmcarthur has quit IRC		21:51
openstackgerrit	Jeff Liu proposed zuul/zuul-operator master: [WIP] Verify Operator Pod Running https://review.opendev.org/670395	21:55
*** e0ne has quit IRC		21:56
*** rascasoft has quit IRC		21:56
openstackgerrit	Merged opendev/system-config master: Build docker images of gerrit https://review.opendev.org/671457	21:58
*** rascasoft has joined #openstack-infra		21:58
*** slaweq has joined #openstack-infra		22:11
*** bdodd_ has joined #openstack-infra		22:12
*** bdodd_ has quit IRC		22:13
clarkb	we are now processin replication events from after the great enqueing	22:15
*** slaweq has quit IRC		22:16
*** betherly has joined #openstack-infra		22:16
*** rcernin has joined #openstack-infra		22:16
clarkb	and we are caught up	22:21
*** betherly has quit IRC		22:21
clarkb	I think we are about half an hour from bridge's system-config updating based on where it is in the loop	22:25
clarkb	hrm	22:26
clarkb	except https://opendev.org/opendev/system-config/commits/branch/master is still out of date	22:26
clarkb	I wonder if all of these have corrupt root disks like 06 did around the summit :/	22:26
* clarkb checks them individually		22:27
ianw	clarkb: were they rebooted after the outage?	22:27
ianw	they all had various kernel messages with things like "vda" in them	22:28
clarkb	ianw: I don't know	22:28
clarkb	01 and 08 have the latest system-config refs but none of the others do	22:28
clarkb	I'm going to try replicating system-config to gitea02	22:29
openstackgerrit	James E. Blair proposed zuul/zuul master: Remember tab location on build page https://review.opendev.org/672836	22:29
clarkb	unless things are cached I don't think that is working	22:30
clarkb	which is very similar to the behavior we observed in gitea06	22:30
ianw	clarkb: looks like no ... gitea02 for example systemd has decided the journal is corrupt at least	22:31
ianw	although, rebooting it might make it worse if it doesn't want to mount the disk any more	22:31
clarkb	ianw: I guess we remove it from haproxy, reboot it, retrigger replication and see if that helps?	22:31
clarkb	fungi: ^ are you back yet?	22:32
openstackgerrit	James E. Blair proposed zuul/zuul master: Use base 1 line number anchors in log view https://review.opendev.org/672837	22:33
ianw	clarkb: i looped through the gitea* servers last night and they all had similar things; especially the systemd journal unhappiness	22:33
ianw	but then again, they haven't logged anything since, so maybe it's recovered	22:35
clarkb	except that replication doesn't work	22:35
clarkb	but maybe a reboot will solve that?	22:35
clarkb	I'll remove 02 from the haproxy and reboot it	22:37
*** jamesmcarthur has joined #openstack-infra		22:38
fungi	clarkb: back now	22:38
clarkb	02 has been removed	22:38
ianw	clarkb: the rax rescue image thing would be good to try a fsck on the disk and see what that thinks ...	22:39
ianw	not sure how to do that	22:39
fungi	i did check the gitea servers and none seemed to have marked their root filesystems read-only	22:39
clarkb	ianw: https://docs.openstack.org/infra/system-config/gitea.html#backend-maintenance	22:39
fungi	which i would have expected if they had irrecoverable i/o errors	22:39
clarkb	I'm checking gitea docker logs now to see that connections have stopped	22:39
clarkb	fungi: maybe you want to grab a db backup or 10 just in case these filesystems are really unhappy?	22:40
ianw	clarkb: oh i mean more mount the disk from outside and check it	22:40
*** armax has quit IRC		22:40
clarkb	last request at 2019-07-25 22:38:15 so going to reboot now	22:40
clarkb	ianw: oh	22:40
clarkb	sorry skipped the fsck message	22:40
clarkb	lets reboot since that is easy, rereplicate and check	22:41
fungi	just copy the last nightly backup from one? should be fine since we haven't created new projects	22:41
clarkb	fungi: ya	22:41
ianw	although agree with fungi, they didn't offline themselves. and also it seemed to be a pretty hard shutoff, so it's not like some writes were getting through, but others weren't	22:42
fungi	we can experiment with 02 presumably	22:42
clarkb	I thik the writes ar ehappening	22:42
clarkb	but you can't read them back again	22:42
clarkb	anyways rebooting 02 now	22:42
fungi	if this comes right back up, maybe we need to touch /forcefsck	22:42
clarkb	it came right back up	22:43
clarkb	waiting for docker to show happy containers then will try rereplicating	22:43
*** jamesmcarthur has quit IRC		22:44
fungi	but you did confirm it had missing git objects?	22:45
clarkb	fungi: yes	22:45
clarkb	er not after reboot	22:45
clarkb	I havne't replicated yet	22:45
clarkb	panic: Failed to execute 'git config --global core.quotepath false': error: could not lock config file /data/git/.gitconfig: File exists	22:45
clarkb	I am going to delete that file	22:45
fungi	curious if they were still missing after a reboot to	22:45
fungi	o	22:45
fungi	not that i have high hopes	22:46
fungi	are all 8 affected, or just some of them? any idea?	22:46
clarkb	er the .lock file	22:46
clarkb	fungi: 01 and 08 have the system-config refs, none of the others do	22:46
clarkb	I don't know if that means 01 and 08 are ok or if it is per repo problem	22:46
fungi	yeah	22:46
fungi	well, i've got nothing better to do with my evening than churn through gitea server rebuilds. yardwork is done, dinner is behind me	22:48
fungi	and we've ironed out most of the gotchas as of yesterday	22:48
clarkb	https://gitea02.opendev.org:3000/opendev/system-config/commits/branch/master is serving content again (old content) going to trigger replication now	22:50
clarkb	after triggering replication those refs are present	22:51
clarkb	given that should we rotate through all 8, reboot them all, then trigger replication again?	22:51
clarkb	I'm adding 02 back to haproxy since its reboot is done	22:52
clarkb	I'm going to remove 03 now	22:54
clarkb	any objections to preoceeding to do all of these? maybe I should start with 01?	22:54
ianw	clarkb: i'm happy to help ... would a little playbook help?	22:55
clarkb	ianw: maybe? the tricky bit with a playbook will be clearing the .gitconfig.lock file but only if gitea fails to start	22:55
auristor	ianw: was the "5.3.0-rc1-afs-next-48c7a244 : volume is offline messages during release" e-mail sent due to additional failures of the mirror?	22:57
clarkb	decided to start with 01	22:57
ianw	auristor: it was mostly an update, but i think we have a case reported above of a file that seemed missing during a release. i need to correlate it all into something readable, will respond to your mail :)	22:58
clarkb	ianw: I think it may be quicker to just do it given how complicated checking that lock file may be?	22:58
clarkb	(would have to check docker logs output after determining the container id to find if there are errors around the lockfile?)	22:58
clarkb	01 is back up and didn't have lock errors. Putting it back in haproxy again	22:59
clarkb	I wonder if that lockfile is gonna be the canary for broken gitea replication	22:59
ianw	clarkb: yep, sure. if you want to log the steps i can follow along and help out with some of the others in due course	22:59
*** weifan has quit IRC		22:59
clarkb	ianw: run the disable commands in that link I pasted earlier on the load balancer. Log into giteaXY and do `docker ps -a` then `docker logs --tail 50 $ID_FOR_GITEA` that comes from previous command output when you see no new connections reboot	23:01
clarkb	then on start do the docker ps -a and docker log sagain to see if it is sad about the lock file	23:02
clarkb	if it is the file to delete is in /var/haproxy/data/git/.gitconfig.lock	23:02
clarkb	docker should try again and it will succeed after that file is gone, then you can enable the host in haproxy as per my link earlier	23:02
ianw	ok, should i try 08?	23:02
fungi	clarkb: cycling through all of them makes sense. we should hold off on rebuilds i guess	23:02
clarkb	sorry /var/gitea/data/git	23:02
fungi	maybe i can knock some out tomorrow and over the weekend	23:03
clarkb	ianw: yup I am on 03 and it is failing on the lockfile	23:03
ianw	ok, bringing up some windows ...	23:03
clarkb	I bet 08 doesn't fail on the lockfile because it had the system-config ref	23:03
fungi	it's an interesting theory, but still hard to know for sure it's not missing something else	23:04
clarkb	ya :/	23:04
clarkb	but a lock file may prevent replication from succeeding maybe?	23:04
fungi	certainly possible	23:05
*** aaronsheffield has quit IRC		23:05
clarkb	03 is done, doing 04 now	23:06
*** _erlon_ has quit IRC		23:07
clarkb	04 also had lock problem	23:08
clarkb	gitea logs when it starts listening on 3000 too	23:10
clarkb	though the new health checks should make that a non issue if we want to enable early?	23:10
fungi	just wondering if we should take this opportunity to yank several of the problem servers out of rotation and rebuild them in parallel while volume is low	23:10
ianw	ok 08 rebooted, back in rotation and i can't see anything bad in logs	23:11
clarkb	fungi: to do that the "right" way we have to update system-config which requires working gitea	23:11
fungi	true	23:11
clarkb	04 is back up, doing 05 now	23:12
ianw	i'll do 07	23:12
clarkb	05 too had lockfile problems	23:13
clarkb	(the correlation seems very strong)	23:14
fungi	i guess the "wrong" way would be to put the haproxy server into emergency disable and then manually tweak the config to remove those from pools	23:14
fungi	or use the command socket	23:14
clarkb	fungi: using the command socket hould be safe without emergency updates	23:14
clarkb	the problem is we can't use the inventory we have to launch nodes until we remove the nodes we want to replace	23:14
fungi	right	23:15
clarkb	05 is done. doing 06	23:16
ianw	07 has the lockfile issue	23:17
clarkb	06 did too	23:17
*** mriedem has quit IRC		23:18
clarkb	ianw: when 07 is happy let me know and I htink I'll trigger system-config replication on all the giteas	23:18
*** betherly has joined #openstack-infra		23:18
clarkb	then we can check if they have all updated, if they have then I think we trigger replication again globally	23:18
clarkb	(maybe do it one gitea at a time to see if that is faster than all at once?)	23:18
fungi	yeah, should definitely see if there's any speedup	23:19
fungi	if it takes roughly an hour to replicate one, then the parallel replication is apparently not buying us any performance increase	23:19
*** jamesmcarthur has joined #openstack-infra		23:20
fungi	and we ought to focus first on replicating to the ones we suspect are broken before the rest	23:20
clarkb	07 looks up?	23:20
fungi	we could even take some /all of the ones we think have stale state out of the haproxy pools in the interim	23:21
ianw	yep just came back into rotation and seems ok	23:21
clarkb	alright I'm going to trigger system-config replication to all giteas now	23:21
fungi	not going to try one at a time after all?	23:21
clarkb	just system-config	23:21
fungi	oh, right	23:21
fungi	so with that we can still take a few out of the inventory and replace them while we replicate to the others	23:22
*** weifan has joined #openstack-infra		23:22
clarkb	all 8 render the latest commit of system-config now	23:22
*** betherly has quit IRC		23:23
clarkb	for replication should we do 01 then 03-08 in that order? skipping 02 since it is going to be replaced?	23:23
clarkb	I'll trigger 01 replication now if so	23:23
fungi	i's say 01 and 06 first?	23:23
fungi	since 06 will also not be rebuilt	23:23
clarkb	oh good point	23:23
clarkb	ya 01, 06, 03, 04, 05, 07, 08 in that order	23:24
clarkb	triggering 01 now	23:24
fungi	i mean, serially still if you want	23:24
clarkb	yes serially	23:24
*** jamesmcarthur has quit IRC		23:24
clarkb	01 is in progress now. ~2100 tasks	23:25
fungi	but yeah, the new servers first, and we could work on replacing 02,03,04 together or something	23:25
fungi	and then replace 05,07,08 in a second batch	23:25
clarkb	fungi: we'll need a new change to the inventory if we want ot batch them	23:25
clarkb	at this point unlikely to get any of them done today? so maybe we push that up for prep for tomorrow?	23:26
fungi	that's fine too. i'm willing to work on some server replacements this evening but just as happy to save them for tomorrow when more folks are on hand	23:26
fungi	and when we're not conflating today's incident with issues we might create with server replacements	23:27
clarkb	fungi: well I don't want you to feel pressured to do that. I think we'll be ok to limp into tomorrow if these replications work	23:27
*** weifan has quit IRC		23:27
clarkb	I'm going to hav eto make dinner in the near future: curry too so won't be able to type and eat :)	23:27
fungi	ahh, yes, let's not get in the way of curry ;)	23:28
* fungi is envious		23:28
ianw	(not something i want to take on while you're all away, not quite across it well enough)	23:29
*** armax has joined #openstack-infra		23:30
openstackgerrit	James E. Blair proposed zuul/zuul master: Parse log file in action module https://review.opendev.org/672839	23:30
*** tjgresha has quit IRC		23:31
clarkb	already down to 1100 tasks	23:31
clarkb	at this rate serializing will be done in ~12 minutes?	23:32
clarkb	(maybe we should reduce the thread count then)	23:32
*** weifan has joined #openstack-infra		23:32
fungi	i also wonder if it just goes faster when nobody's using gerrit	23:34
clarkb	could be	23:36
*** weifan has quit IRC		23:37
fungi	i basically started the mass replication just when the bulk of our activity was climbing for the day	23:38
clarkb	and time	23:38
clarkb	about 14-15 minutes?	23:39
clarkb	starting 06 now	23:39
clarkb	`ssh -p 29418 user@review.openstack.org replication start --url gitea06.opendev.org` is the command I'm running	23:39
fungi	yeah, that was waaaay faster than earlier	23:40
*** goldyfruit has quit IRC		23:40
*** sshnaidm is now known as sshnaidm\|off		23:43
*** dchen has joined #openstack-infra		23:45
*** dchen has quit IRC		23:45
*** dchen has joined #openstack-infra		23:46
fungi	already more than halfway done with 06	23:46
fungi	wonder how fast it goes with two at a time	23:47
clarkb	fungi: I can do 03 and 04 together next	23:47
fungi	though another possibility is that 01 and 06 are faster than the rest?	23:48
clarkb	could be	23:48
fungi	(on faster storage owing to be created more recently)	23:48
clarkb	fungi: and proper journal size	23:48
fungi	yeah	23:49
clarkb	however I Think we should do 03 and 04 together for science	23:49
fungi	for science, yes	23:49
*** jamesmcarthur has joined #openstack-infra		23:50
clarkb	03 and 04 started	23:53
*** jamesmcarthur has quit IRC		23:55
*** smcginnis has quit IRC		23:56

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!