Tuesday, 2017-10-03

*** efoley has joined #openstack-infra-incident		08:36
*** efoley has quit IRC		09:04
*** efoley_ has joined #openstack-infra-incident		09:05
*** tushar has quit IRC		09:54
*** efoley_ has quit IRC		09:58
*** efoley has joined #openstack-infra-incident		10:05
*** tushar has joined #openstack-infra-incident		10:05
*** tumbarka has joined #openstack-infra-incident		10:23
*** tushar has quit IRC		10:26
*** efoley_ has joined #openstack-infra-incident		10:28
*** efoley has quit IRC		10:28
*** efoley has joined #openstack-infra-incident		10:34
*** efoley_ has quit IRC		10:34
*** efoley has quit IRC		10:57
*** efoley has joined #openstack-infra-incident		11:02
*** jesusaur has quit IRC		11:19
*** efoley has quit IRC		11:20
*** efoley_ has joined #openstack-infra-incident		11:20
*** jesusaur has joined #openstack-infra-incident		11:33
*** efoley_ has quit IRC		11:56
*** efoley has joined #openstack-infra-incident		12:08
AJaeger	I've moved "fixed" issues to the proper section in etherpad and added some more entries (specs publishing9	12:56
* AJaeger will now remove merged changes from the review list		12:57
fungi	thanks AJaeger!	13:37
fungi	just a heads up, the pitchforks are coming out on the -dev ml now... calls for identifying an acceptable failure threshold by a specific date/time or executing a rollback to v2	13:56
pabelanger	sigh	13:59
pabelanger	its been over 12 minutes since zuulv3.o.o has logged to debug.log	14:02
dmsimard	I'd anticipate a whole different set of issues if we rolled back	14:03
*** efoley has quit IRC		14:03
*** efoley_ has joined #openstack-infra-incident		14:03
pabelanger	I can see we are trying to promote a change too, I wonder if that is related to the current issue	14:04
pabelanger	mordred: fungi: clarkb: jeblair: ^	14:04
fungi	yeah, i wouldn't be surprised if the promote has tanked it	14:05
fungi	swap is climbing steadily over recent minutes too	14:05
fungi	my promote command still hasn't returned control to the shell	14:05
fungi	and we're at almost 9gib swap used (up almost 5gib in a matter of minutes)	14:06
mordred	fungi: yah- I think we've likley hit the place we were discusing on friday and have now run for as long as is reasonable to collect data	14:06
pabelanger	Ya, I didn't see how large the change queue was, so maybe zuul is moving around a lot of things currently	14:06
fungi	there were maybe 10 changes in that gate queue, if even that many	14:07
fungi	i don't recall exactly but it wasn't a ton	14:07
pabelanger	So, sounds like we might be thinking of a rollback?	14:07
fungi	swap utilization just dropped by several gib in the past few seconds so maybe it's about done doing... whatever	14:08
fungi	still falling	14:08
pabelanger	zuul processing again	14:08
pabelanger	in debug.log	14:08
mordred	fungi: and would not be opposed to the friday plan of shifting zuulv3 to check-only and turning v2 back on - running v3 in check mode should still trigger the reconfigure issues	14:10
pabelanger	yah	14:11
pabelanger	It does look like zuul-executors are in better shape since rolling out the limits to CPU and ansible fixes	14:12
mordred	++	14:12
mordred	that was, I believe, a great addition	14:12
AJaeger	if - we roll back - I would still freeze zuul/layout.yaml and new job creation. Or have both running in parallel and optional opt-in. We've done some great work on Zuul v3 and I would rather not run the migration scripts again...	14:13
AJaeger	this morning reviews/merges were nicely fast - but then Zuul slowed down suddenly	14:13
mordred	AJaeger: I agree	14:13
pabelanger	swapping is the reason for the slowness, hopefully once we address memory issues, that goes away	14:14
mordred	AJaeger: I think we leave v2 layout/jobs frozen - although allow case-by-case exceptions if we can also make the equivlient change in the v3 defs (there were some where people wanted to disable some v2 jobs right around cutover - I think those are fair)	14:15
mordred	pabelanger: oh - totally - once we figure out whatever is causing zuul to go sideways occassionally	14:15
fungi	swap is definitely in flux... almost at 10gib used now	14:21
pabelanger	Yup, last logging was 2017-10-03 14:11:12,051	14:23
pabelanger	i believe we are just out of memory now and will keep swapping	14:24
mordred	fungi, pabelanger: shall we consider restarting? when we're in this state I don't believe we can do the objgraph so I'm fairly sure we will not be able to collect any information that hasn't already been logged	14:28
fungi	my promote command from 40 minutes ago still hasn't returned	14:29
mordred	yah	14:29
fungi	so, yeah i guess a restart is inevitable now	14:29
dmsimard	AJaeger, mordred: some projects (such as tripleo) have had to implement different changes in their different projects to support zuul v3. Rolling back would mean breaking them (again)	14:29
mordred	dmsimard: that's a very good point	14:29
fungi	dmsimard: yep, we've been trying to discourage those sorts of changes so as not to complicate possible rollback	14:30
mordred	fungi: yah - unfortunately in some cases not doing those changes was unpossible	14:30
fungi	(similarly, not making changes to things like on-disk scripts in ways which will render them nonfunctional under v2)	14:30
dmsimard	fungi: discourage how ? their jobs would not work without making these changes	14:30
fungi	dmsimard: rather, encourage making changes which don't break v2 usage	14:30
mordred	dmsimard: most of the tripleo changes were made in the tripleo-ci repo right?	14:31
fungi	anyway, if projects have a vested interest in _not_ rolling back to v2 temporarily and trying again in a week or two, then they should be chiming in on that ml thread	14:32
fungi	right now it's projects annoyed that they have merged very few changes since thursday speaking up	14:32
clarkb	tripleo changes should be compatible right?	14:33
clarkb	(I dont know of all the tripleo changes but the ones I was involved in should be)	14:33
dmsimard	clarkb: I don't know, there's a wide variety of changes.. from that sub_nodes_private thing to the hardcoded zuul user vs jenkins	14:34
mordred	fungi, clarkb: should we hit the restart button on the scheduler?	14:35
clarkb	mordred: sorry just catching up, but sounds like it may not revover on its own?	14:36
fungi	i guess. any chance we can get a dump of check and gate pipelines to reenqueue?	14:36
Shrews	mordred: restarting would pick up the branch matching fix	14:36
mordred	Shrews: I think jeblair restarted last night with that applied	14:36
fungi	Shrews: i thought jeblair stacked that in last night when he restarted the scheduler	14:36
mordred	yah	14:37
Shrews	ah, saw that it didn't merge until this morning	14:37
mordred	"restarted with 508786 508787 508793 509014 509040 508955 manually applied; should fix branch matchers, use slightly less memory, and fix the 'base job not defined' error"	14:37
pabelanger	+1 for restart, if we can save queues great	14:39
clarkb	dmsimard: the user thing is probably the biggest one. Zuul is a valid user in yhe old setup though so may work too depending on what bits of user stuff is hardcoded	14:39
mordred	my initial thoughts on rollback logistics were that we could redefine the v3 pipelines that are not check and add a pipeline requirement that is unpossible - basically keep the existence of the pipelines but make sure nothing is ever enqueued in them	14:40
dmsimard	mordred: like branch foobar or something ?	14:41
mordred	yah. or require a +10 vote from zuul or something else silly	14:41
clarkb	++	14:42
mordred	that way we don't have to actually modify any of the project pipeline defns - then make a second gate pipeline - like v3-gate or something - that we can modify natively v3 projects (zuul, zuul-jobs, openstack-zuul-jobs, project-config) to use	14:42
pabelanger	ya, that seems simple	14:43
mordred	that way we can keep running check jobs on everything and keep battling the memory leak with proper scale - but leave job config mostly frozen and not make re-roll-forward more difficult	14:43
jeblair	catching up	14:44
mordred	morning jeblair !	14:44
fungi	from a correlating memory usage increases with known events perspective, i'm starting to wonder if the promote command has given us a way to trigger a memory spike on demand (suggesting perhaps they're related to dependent queue resets?). maybe worth testing after the next restart	14:46
jeblair	it's decidedly swappy now	14:49
AJaeger	did anybody safe the pipeline state?	14:52
jeblair	i think it's fine to restart. and i think it's fine to roll back.	14:52
jeblair	AJaeger: we probably won't be able to do so until zuul becomes responsive again	14:53
AJaeger	jeblair: our lucky day ;( Ok	14:53
fungi	we're at 1 hour 5 minutes since i started the promote of 508344,3 in the gate now, and the command is still hanging	14:54
fungi	i have a feeling attempting to get status data back isn't going to complete until around the same time that does, however long that takes	14:55
pabelanger	If we are rolling back, do we want to do 50/50 between nodepool.o.o and nl01.o.o/nl02.o.o for capacity?	14:55
fungi	probably more like 67/33	14:56
clarkb	fungi ++	14:56
pabelanger	sure	14:56
mordred	remote: https://review.openstack.org/509196 Prevent non-check changes, add infra pipelines	14:56
pabelanger	let me prepare that patch	14:56
fungi	v2 will need sufficient nodes to handle check, gate, post, et cetera while v3 only needs enough for check	14:56
jeblair	yeah. or even more. we don't need to run that many jobs to finde problems.	14:56
clarkb	do we need any other changes than nodepool shift an pipeline update?	14:57
AJaeger	Let's first write down the new policy and what to do - before rolling back to quickly.	14:57
mordred	there's a strawman for v3 config to block things from running in anything other than check	14:57
jeblair	(we'll find job problems at a lower rate, but that's fine -- we're finding them faster than we can fix them atm anyway)	14:57
pabelanger	jeblair: 80/20 then?	14:57
dmsimard	some projects don't even need a rollback	14:57
fungi	prior to rollback, we likely need one more v3 restart so that it can get back to processing changes while we work on the rollback, right?	14:58
dmsimard	like for example puppet is fully migrated and I suspect some other projects have as well	14:58
mordred	fungi: yes	14:58
pabelanger	dmsimard: I don't think we want split gating for projects	14:58
dmsimard	are we going to roll everything back ?	14:58
mordred	dmsimard: if we rollback, those projects will go back to using their v2 jobs	14:58
AJaeger	can we add puppet and others that want to Zuul v3? let them use mordred's infra-gate?	14:59
mordred	I think there is a fundamental thing we need to decide before we can decide further details ...	15:00
mordred	that is whether or not we want to continue to run v3 job content in check broadly during the v2 + v3 period, or whether we do not want to do that	15:00
mordred	continuing to run check on all the projects will let us continue to iterate on finding and fixing job issues in parallel to figuring out and fixing the memory/config issue - but will obviously eat more nodepool quota	15:01
*** efoley_ has quit IRC		15:01
clarkb	mordred: I think we can run it and if it doesnt report on every change due to node quota or restarts that is ok	15:02
clarkb	we also need to decide if we allow changes to v2 jobs	15:02
mordred	not running v3 check on all the projects will keep the signal-to-noise level the v3 sane and will not use as much node quota, but will suspend further work on v3 job issue	15:02
pabelanger	remote: https://review.openstack.org/509200 Revert "Shift nodepool quota from v2 to v3"	15:03
mordred	clarkb: yes, I agree - I vote for "no, unless the same change is applied to the v3 job" (there are some things, like "I want to turn this job off" that I think may be important exceptions to allow)	15:03
pabelanger	that will be our revert for nodepool	15:03
pabelanger	topic:zuulv3-rollback	15:03
mordred	I'm not convinced on 80/20 - I think we need to decide on the check question first	15:04
pabelanger	sure	15:04
pabelanger	I'll WIP	15:04
mordred	pabelanger: (patch looks good - mostly just saying I think we should get on to the same page on what we want to accomplish)	15:05
jeblair	mordred: i'm not quite sure i understand the question	15:05
pabelanger	mordred: agree	15:05
AJaeger	Question I see are: a) Run check on all repos? b) Which projects to gate on and c) can projects go fully v3 if they are ready (like puppet)?	15:05
jeblair	what do you mean by "v3 job content"? auto-migrated jobs? or native pre-converted jobs? or new in-repo post-cutover jobs? ....?	15:05
jeblair	mordred: ^	15:06
AJaeger	Once those three are answered, we can say 80/20 or whatever IMHO	15:06
mordred	jeblair: do we run the currently defined jobs that are defined to be in the check pipeline for all of the projects	15:06
jeblair	mordred: what's the alternative you are considering? run no check jobs except for infra-v3?	15:07
mordred	jeblair: yah - altohugh I personally think we should run check jobs	15:08
jeblair	mordred: yes, if those are the 2 choices you're presenting, i would say run check as-is.	15:08
mordred	cool. infra-root - anyone have a different opinion? ^^	15:08
clarkb	no that is what I had in mind	15:09
dmsimard	It effectively means cutting the nodepool capacity in half, yeah ?	15:10
dmsimard	I mean, because we'll be running check twice	15:10
mordred	being agreed on that - do we think 80/20 nodepool split will provide enough nodes for us to land zuul, zuul-jobs, openstack-zuul-jobs and project-config patches? would 66/33 split be better? do we have quota?	15:10
jeblair	dmsimard: not half ^	15:10
pabelanger	+1 to running check as we are now	15:10
AJaeger	dmsimard: 60:40 or 80:20 I guess - since there's gate and post as well which we not duplicate	15:11
jeblair	mordred: let's make infra-check and give it high priority.	15:11
dmsimard	jeblair: what I mean by that is that regardless of how we split v2 and v3 max-servers, if we end up running check twice we're effectively splitting that down further	15:11
AJaeger	jeblair: +1	15:11
mordred	jeblair: kk	15:11
jeblair	mordred: then "check" can run as backlogged as it wants, but we can still land changes to zuul-jobs	15:11
pabelanger	++	15:11
AJaeger	So, this would allow projects to continue working on their check jobs, correct? They could send patches in, we apply them and iterate on these.	15:13
clarkb	AJaeger: yes	15:13
fungi	probably a terrible question, but do we want to try rebuilding zuulv3 with a bigger flavor? might give us more breathing room between necessary restarts (seems it's levelled off at around 25gib total virtual memory in use for the moment, but that may just be because it hasn't handled anything new for a while)	15:14
mordred	remote: https://review.openstack.org/509196 Prevent non-check changes, add infra pipelines	15:14
mordred	updated with an infra-check	15:14
AJaeger	Is there a variable that scripts can use like "if running-under-ZUULv3", use this, otherwise use zuul-cloner"?	15:14
pabelanger	AJaeger: i think the shim will handle that	15:15
AJaeger	pabelanger: but projects might do other stuff sa well...	15:15
mordred	I believe there are projects that have shell scripts in their own repos that do things - and some of those people may have already updated their shell scripts to not use zuul-cloner	15:15
mordred	so if we rollback to v2 for them, their jobs will break again and they'll need to revert their changes to their scripts	15:16
dmsimard	yes.	15:16
AJaeger	So, I want to give them a way to use their scripts for both v2 and v3 - and thus start migrating to v3 if they have not	15:16
pabelanger	Right, i think that is the right way. Roll back, revert your zuulv3 changes. Supporting both shouldn't be an option	15:16
mordred	I believe it would be fairly easy to add a "running in zuulv3 env var"	15:16
AJaeger	pabelanger: I disagree, we should give them an option! That will make the future step to v3 easier.	15:17
mordred	so that they could half-revert and put things in an if	15:17
AJaeger	mordred: +1	15:17
clarkb	can check whoami	15:17
pabelanger	yes, old will be jenkins	15:17
pabelanger	new will be zuul users	15:17
AJaeger	that's one way of doing it - let's document it as recommendation	15:17
jeblair	or, you know, check whatever it is that's breaking. :)	15:18
AJaeger	(if we agree on that - I don't care, I just want it documented and offered)	15:18
mordred	AJaeger: ++	15:20
jeblair	https://etherpad.openstack.org/p/1sLlNKa7FU	15:21
jeblair	i started writing out what we're discussing	15:22
AJaeger	thanks, jeblair	15:22
jeblair	we could use ZUUL_URL, ZUUL_REF, or LOG_PATH as ways to tell v2 from v3.	15:26
jeblair	i hesitate to suggest the whoami thing because then we'll end up with the username as an API. if we ever change it, we'll break scripts.	15:26
dmsimard	jeblair: the username is fairly fool proof, do we need something else ?	15:26
dmsimard	ah	15:27
jeblair	at least, if folks check that whoami is zuul	15:27
jeblair	if they check for jenkins, that's fine.	15:27
jeblair	the main thing is that the check should be backwards facing.	15:27
clarkb	jeblair: ya whoami probably bad idea	15:28
jeblair	LOG_PATH is probably the least likely thing we would add to the backwards-compat var filter	15:28
jeblair	we also don't add ZUUL_URL or ZUUL_REF because there is no value we can put there, but we've entertained the idea of setting them to "N/A" or something, so they are slightly more likely to change.	15:29
dmsimard	infra-root: I think zuulv3.o.o is unexpectedly empty right now	15:30
pabelanger	it is swapping1	15:31
mordred	jeblair: ++	15:31
jeblair	so 80/20 with infra-check as high priority?	15:41
clarkb	wfm	15:42
pabelanger	sure	15:42
AJaeger	jeblair: we can adjust later if needed, it's a first good guess ;)	15:43
pabelanger	we can also bring infracloud back online	15:43
pabelanger	and see if sto2 is good to use again	15:43
pabelanger	both are still disabled	15:43
jeblair	okay, i think that's a factual summary of everything we discussed; any outstanding questions?	15:46
mordred	jeblair: I think it looks good, I like the plan, and I think the top section makes a good status email	15:46
jeblair	mordred: you want to add more words and send it?	15:47
mordred	jeblair: sure!	15:49
mordred	jeblair: also - completely unrelated to this- but in making the infra- pipelines patch I realized that we're not doing sql reporter on anything other than check or gate- followup patch submitted	15:50
jeblair	derp	15:50
jeblair	we'd figure that out eventually	15:50
*** jdandrea has joined #openstack-infra-incident		15:51
jeblair	so, logistics -- should we restart zuulv3 now, then land changes to do rollback, or stop, force merge them, then bring up both systems in the rollback state?	15:54
pabelanger	wfm	15:56
fungi	etherpad content before the horizontal rule lgtm for a status update	15:56
mordred	infra-root: updated etherpad with wrapper email words	15:56
* clarkb looks		15:57
pabelanger	looking	15:57
mordred	also - one last questoin - how badly is this going to bork openstack-health and our subunit2sql friends?	15:57
fungi	i agree with the strikethrough in the first paragraph. we have no solid evidence to indicate it's reconfiguration-related afaik	15:57
fungi	i think subunit2sql is still working to get the v3 job naming change landed	15:58
pabelanger	+1 to etherpad	15:58
clarkb	email lgtm	15:58
clarkb	mordred: I'm not super concerned about it from elasticsearch's perspective	15:59
clarkb	mordred: its juts more data	15:59
fungi	jeblair: how much of the rollback would be able to get accomplished without force merging those changes anyway?	15:59
jeblair	i would like to suggest an alternate first paragraph	15:59
clarkb	mordred: however health looks at pass/fail rates which may skew with check running in v3, but I think its much more gate focused there	15:59
jeblair	one that does not suggest that the only reason we are rolling back is the memory leak	15:59
clarkb	mordred: my hunch is its ok because of the focus on gate jobs	15:59
jeblair	i'm working on a proposal	15:59
* AJaeger is fine with etherpad, thanks		16:00
fungi	since jeblair was able to snag pipeline contents, i feel like restarting v3 in the intermi until the rollback patches are ready may provide slightly less disruption than the continued not-much-going-on state it's in right now	16:00
mordred	yah - I think we need to restart v3 - if for no other reason than to land the rollback patches	16:02
jeblair	mordred: alternate suggestion for pgraph1 on line 7.	16:03
jeblair	also, last pgraph on line 25	16:04
mordred	jeblair: wfm	16:04
fungi	i prefer the suggested alternatives	16:04
fungi	however the last paragraph will inevitably result in people asking "how long?"	16:05
jeblair	we could say, 'hopefully in a week or two'.	16:05
fungi	that sounds reasonable	16:06
AJaeger	what about saying "We revisit on Friday the current status"?	16:06
AJaeger	(or Monday?)	16:06
mordred	how about we say 'hopefully in a week or two, but we'll send out updates as we have them?'	16:06
fungi	i think in part some people will want to know because they're eager to get back to not having to think about v2 jobs any longer, while others will simply want to know how much longer their job configuration in v2 will remain frozen in place	16:07
mordred	(updated final line in etherpad)	16:07
fungi	though i guess paragraph 3 doesn't really say v2 configuraiton is completely froze, so that's fine	16:07
clarkb	wfm	16:07
pabelanger	a week or two is good for me. If we fix things sooner, I'm sure we try a rollout sooner	16:08
mordred	luckily a rollout winds up being very easy at this point- turn the other pipelines back on :)	16:09
AJaeger	pabelanger: if we're ready quickly, let's run zuul v3 for a day or two without interruption ;)	16:09
AJaeger	mordred: agreed, we should be able to go quickly from one to the other...	16:10
pabelanger	ya, we should be in a good spot to move nodes between zuulv2.5 and zuulv3 as we are ready for more load testing too	16:10
pabelanger	I'm currently using topic:zuulv3-rollback for patches that would be needed to rollback. Currently 2	16:15
mordred	jeblair: you're good with the final version?	16:15
fungi	thanks pabelanger!	16:15
jeblair	mordred: yep	16:15
pabelanger	It looks like zuul-launcher might have been stopped too, just noting we'll need to start them back up	16:17
mordred	kk. sending	16:17
*** bnemec has joined #openstack-infra-incident		16:19
*** ying_zuo has joined #openstack-infra-incident		16:30
AJaeger	mordred: where did you send it to? It's not in openstack-dev archives yet...	16:31
AJaeger	mordred: remote: https://review.openstack.org/509221 - where else do we need this?	16:37
*** dansmith has joined #openstack-infra-incident		16:37
clarkb	AJaeger: zuul-jobs and zuul. Maybe also project-config itself?	16:38
AJaeger	Let me do zuul-jobs quickly, any volunteer for zuul?	16:39
clarkb	AJaeger: we should probably do it in a single change	16:40
clarkb	as that will be easier to merge	16:40
clarkb	?	16:40
AJaeger	clarkb: mordred updated project-config. But we need these separate for each repo with individual config	16:40
AJaeger	nothing needs to be done for zuul-jobs	16:41
AJaeger	pushed for zuul now	16:46
AJaeger	not needed for zuul-jobs	16:46
AJaeger	that should be all - please double check	16:46
AJaeger	https://review.openstack.org/#/q/topic:zuulv3-rollback+(status:open+OR+status:merged)	16:47
pabelanger	I don't think it is causing issues, but I think the hostname on ze09 and ze10 are not correct. I'll look into that shortly once we rolled back	16:55
pabelanger	more for my own info, I do see that every time we do a dynamic reload in zuul debug.log, we spike the CPU (which is expected I guess). However, after the reload has finished, zuul does process things like a champ, nice and fast, zuul-scheduler process lowish CPU	17:02
mordred	infra-root: I'm back - sorry - had a phone call pop up	17:03
clarkb	fungi is working on clearing out zuul -2 votes, other than that I think we are ready to start implementation?	17:03
fungi	i think we can do those in parallel	17:07
fungi	pabelanger: yeah, i ended up having to manually repair the hostname on the ze01 i rebuilt. seems cloud-init may be racing ansible or something on fresh builds lately	17:09
fungi	basically just edit /etc/hostname to contain an fqdn instead of the shortname, and then reboot	17:10
pabelanger	kk	17:10
pabelanger	I'll look into that once after the rollback	17:10
fungi	clarkb: i got the bulk of the v3 verify -2 votes cleared, but will do another pass after we get v2 online	17:14
pabelanger	we seem to be averaging about 1min of CPU time for each dynamic reload, according to debug.log	17:14
pabelanger	maybe a little less	17:14
pabelanger	I think we might be at the point again, where zuul is just processing dynamic reloads. It doesn't look like it has the steam to get a head of the demand	17:29
clarkb	ok, so what is process for implementing the rollback, do we force merge or manually apply the pipeline changes and the nodepool quota shift then start the zuulv2 daemon?	17:34
pabelanger	quota should be ready to land now	17:36
pabelanger	https://review.openstack.org/509200/	17:37
pabelanger	once merged, nodepool-launchers should do the right thing	17:37
jeblair	we have a +1 on the pipeline change, so i say we force-merge that and the quota change, then startup v2	17:38
*** tobiash has joined #openstack-infra-incident		17:38
pabelanger	maybe nodepool change first to bleed off resources from launchers	17:38
jeblair	pabelanger: you want to take care of doing that?	17:39
pabelanger	sure	17:39
clarkb	hrm looking at the pipeline change quickj question	17:39
clarkb	with older zuul if you tried to look at attributes that weren't present in the event it could get in an unhappy state	17:39
clarkb	any concern with using things like current-patchset on post pipeline for that reason?	17:40
jeblair	clarkb: those are attributes of refs, not events, so should be fine	17:40
jeblair	since they are pipeline requirements	17:40
clarkb	cool	17:40
jeblair	clarkb: and i double checked the approvals one -- that should fail closed, ie, refs without approvals do not match and will not be enqueued	17:41
clarkb	jeblair: perfect	17:41
jeblair	clarkb: looks like the same holds for current-patchset	17:41
pabelanger	which is the correct group again to force merge?	17:41
jeblair	pabelanger: project bootstrappers	17:41
pabelanger	jeblair: ty	17:41
pabelanger	okay, 509200 force merged	17:44
pabelanger	removing myself from group	17:44
fungi	pabelanger: i usually pull this out of my command history: `ssh -p 29418 review.openstack.org gerrit set-members "Project\ Bootstrappers" --add fungi`	17:44
fungi	and then --remove instead of --add when i'm done	17:45
pabelanger	fungi: ty	17:45
fungi	faster than fiddling with the webui	17:45
pabelanger	okay, I have kicked both nl01 and nl02	17:47
AJaeger	next approving https://review.openstack.org/#/c/509196/ (new pipelines)?	17:47
pabelanger	just confirming nodepool-launchers are happy, then will move on to nodepool.o.o	17:49
clarkb	and then once nodepool is done merge pipeline change and start zuulv2?	17:53
clarkb	pretty sure we need pipeline change in plcae before zuulv2 starts running (to avoid them fighting)	17:53
jeblair	it would be best i think	17:53
jeblair	i'll translate the latest re-enqueue scripts into v2	17:54
jeblair	so we can restore queue state (from earlier this morning)	17:54
AJaeger	Question on project-config repo: Will we run both zuul v2 and zuulv3 checks on it - and merge with v3 only?	17:55
AJaeger	Then we need to remove the v2 jobs from zuul/layout.yaml. Or how to handle this central repo?	17:55
*** bnemec has quit IRC		17:56
pabelanger	okay, cleaning up some of the ready nodes on nodepool-launchers, then going to kick nodepool.o.o	17:56
jeblair	AJaeger: i vote v2 check only, v3 check/gate	17:57
* AJaeger prepares a patch...		17:58
pabelanger	kicking nodepool.o.o now	17:58
pabelanger	okay, nodepool is trying to launcher servers again, but failing to talk to gearman	18:00
pabelanger	so, think we are ready on nodepool front	18:00
clarkb	(that is expected since geard and zuulv2 are not running on zuul.o.o	18:00
pabelanger	yah	18:00
*** bnemec has joined #openstack-infra-incident		18:01
clarkb	ok so ready to merge the pipeline change?	18:03
pabelanger	I think so	18:03
clarkb	pabelanger: did you want to do that one too?	18:04
pabelanger	clarkb: force-merge, sure	18:04
pabelanger	clarkb: 509196 right?	18:05
clarkb	ya	18:05
pabelanger	kk	18:05
pabelanger	done	18:06
pabelanger	guess we kick zuul.o.o now?	18:06
pabelanger	actually, no	18:06
pabelanger	that was just for zuulv3	18:06
AJaeger	https://review.openstack.org/509244 updates project-config for Zuul v2 check only - please review carefully whether this will work	18:06
pabelanger	clarkb: ready to kick zuulv3.o.o to pickup changes for 509196?	18:09
clarkb	pabelanger: we shoulnd't need to kick it right? zuulv3 will pick it up on its own	18:09
pabelanger	no, we still need puppet to sighup for pipeline changes I think	18:10
pabelanger	Hmm, maybe not	18:10
pabelanger	I think I am confusing it with main.yaml	18:10
AJaeger	should we force merge https://review.openstack.org/#/c/509220/ ?	18:11
AJaeger	that's the status.o.o/zuul redirect	18:11
clarkb	lets see if we can get it in normally first maybe?	18:14
clarkb	are we ready to start zuuld and zuul launchers?	18:15
pabelanger	Ya, I think that should merge on its own now	18:15
AJaeger	ok, let's monitor ;)	18:15
pabelanger	I am not seeing infra-check or infra-gate yet on zuulv3.o.o	18:15
clarkb	you can hit http://zuul.openstack.org to get status until redirect is fixed	18:15
clarkb	pabelanger: I'm not sure zuul has chewed through the event queue yet?	18:17
pabelanger	possible	18:17
pabelanger	asking in #zuul	18:17
AJaeger	clarkb: zuul.o.o redirects as well ;(	18:19
clarkb	oh really?	18:19
clarkb	it does :/	18:19
AJaeger	we should revert that also...	18:19
clarkb	jeblair: mordred fungi ok I think we are just waiting for zuulv3 to pick up the new pipelines	18:20
clarkb	once that is done I think we start zuuld and zuul launchers and we can work through getting status pages pointing to the right place	18:20
pabelanger	yah, think so too	18:21
pabelanger	should we maybe consider a stop / start of zuulv3 to clear our pipelines?	18:22
pabelanger	out*	18:22
clarkb	jeblair: ^ what do you think?	18:23
jeblair	clarkb: ideally v3 should eject everything from gate after it reconfigures with that change	18:26
clarkb	gotcha, so we should wait and confirm that happens?	18:27
pabelanger	I'm going to have to step again here shortly	18:30
pabelanger	waiting for zuulv3.o.o to pickup 509196 still	18:31
pabelanger	nodepool.o.o is ready, just waiting on zuulv2.5 to be started for gearman	18:31
clarkb	ok	18:31
pabelanger	we'll also need to start zuul-launchers too	18:31
* clarkb waits patiently. Will be around to start services once its done		18:31
pabelanger	great	18:32
AJaeger	where do we redirect zuul.openstack.org to zuulv3.openstack.org in our config? Anybody can revert that change?	18:32
clarkb	AJaeger: I'm not seeing it either	18:34
clarkb	fungi: ^ you did the redirect stuff, do you know?	18:35
fungi	we should just be able to unset zuul.o.o from the emergency disable list	18:37
fungi	that wasn't done through configuration management	18:38
clarkb	aha	18:38
clarkb	that is why I don't see it :)	18:38
fungi	it's a one-line addition to the vhost config on the server	18:38
fungi	that and i commented out the rewrites for the local api, i believe	18:39
clarkb	ok infra check and infra gate are presetn now	18:40
clarkb	the gate has not been evicted yet	18:40
clarkb	jeblair: ^ if you want to look at that before we turn on zuuvl2 (probably a good idea to avoid fihting)	18:40
jeblair	checking	18:41
AJaeger	interesting to see what happens with 509145 - a project-config change that just finished in gate queue but is not merged	18:41
jeblair	zuul has not gotten around to processing the gate pipeline after the reconfig yet	18:42
jeblair	so -- status indeterminate there	18:42
* clarkb remains patient then		18:43
jeblair	it probably has a bunch of dynamic configs in check to redo	18:43
AJaeger	changes are moving over now...	18:43
jeblair	gate manager is running	18:45
jeblair	okay, it's done. i guess it doesn't check those constraints after they've already been enqueued	18:47
jeblair	so we may as well restart zuulv3 to clear it out	18:47
clarkb	jeblair: do you want to do that and I will work on starting zuul on zuul.o.o and the zl nodes?	18:47
jeblair	clarkb: will do	18:47
clarkb	ok starting zuulv2 now as well	18:48
clarkb	here goes	18:48
clarkb	do I want zuul-executor or zuul-launcher on zl0X?	18:49
fungi	launcher	18:50
clarkb	kk starting those now	18:50
clarkb	fungi: can you edit the redirect on zuul.o.o?	18:50
fungi	clarkb: on it	18:50
clarkb	job is running on zl01	18:51
clarkb	going to start the other launchers since that looks good	18:51
fungi	i've removed "RedirectMatch temp ^/(.*) http://zuulv3.openstack.org/$1" and uncommented all the old RewriteRule lines	18:52
fungi	also reloaded apache2	18:52
fungi	should i go ahead and remove zuul.o.o from the emergency disable list too?	18:52
clarkb	fungi: I think that should be safe now?	18:52
fungi	done	18:52
clarkb	infra-root ^ any reason not to do that?	18:52
fungi	i was pretty sure the only reason we had it in there was to avoid reverting the redirect	18:53
clarkb	zl01 - zl06 have launcher running	18:53
mordred	clarkb, fungi: wfm	18:53
jeblair	++	18:53
clarkb	looks like those are the launchers we've got	18:53
jeblair	do we need to delete some ze and add zl?	18:54
clarkb	jeblair: possibly since we've flipped the quotaing	18:54
clarkb	though maybe we want to watch it and see how it holds up with the lower quota on v2?	18:55
clarkb	fungi: I think it is safe to run the post rollback -2 clearout now (since zuulv3 should not leave any more after the restart it just had)	18:55
fungi	on it	18:56
fungi	there was only one new verify -2, and i've cleared it now	18:57
clarkb	fungi: status.o.o/zuul redirect was in puppet right?	18:57
clarkb	do we hvae a revert of that change yet?	18:57
fungi	correct, and i haven't seen a revert for it yet	18:58
AJaeger	check https://review.openstack.org/#/q/status:open+++topic:zuulv3-rollback	18:58
AJaeger	https://review.openstack.org/509220 is the revert	18:58
clarkb	I've approved and rechecked that change	18:58
fungi	aha, i missed that	19:00
AJaeger	since we froze the zuul v2 files, could you review https://review.openstack.org/#/c/509158 again - that gives a non-voting -1 for any change to it to alert us of that	19:01
mordred	we should send out a note to anyone who has a .zuul.yaml in their repo to remind them that since v3 isn't voting it will be possible for them to land changes that contain broken syntax	19:15
clarkb	ya the change to fix the status redirect is about to merge	19:16
mordred	so if they ARE making changes to their .zuul.yaml as part of working on things in the check pipeline- they need to be careful to not land changes if v3 has complained about syntax errors	19:16
clarkb	I think once that does we should sned a general "its in place like we described" note with details like that	19:16
jeblair	mordred: s/voting/gating/ but yes	19:16
mordred	jeblair: ++	19:16
jeblair	(and of course, the vote may be delayed or never show up)	19:16
mordred	yah. basically if you make a change to your .zuul.yaml file - please make sure that v3 actually runs check jobs and finishes before approving	19:17
jeblair	++	19:18
clarkb	mordred: are you going to write and send that update?	19:18
mordred	jeblair: I was trying to think of a fancy job we could put in v2 and add to the merge-check template with a files matcher to make sure it only ran on patches with .zuul.yaml in them ... but I couldn't come up with any way to make t anyhting other than informational	19:18
mordred	clarkb: yah - I can do that	19:18
clarkb	fungi is kick.sh'ing status.o.o so we should be set for that to go out when ready	19:19
fungi	yup	19:19
fungi	it's just about completed now	19:19
jeblair	mordred: we could have it -2	19:19
jeblair	mordred: oh in v2, sorry	19:19
fungi	fatal: [status.openstack.org]: FAILED! => {"changed": false, "failed": true, "msg": "/usr/bin/timeout -s 9 30m /usr/bin/puppet apply /opt/system-config/production/manifests/site.pp --logdest syslog --environment 'production' --no-noop --detailed-exitcodes failed with return code: 6", "rc": 6, "stderr": "", "stdout": "", "stdout_lines": []}	19:19
jeblair	mordred: yeah, that's hard in v2	19:19
fungi	um	19:19
clarkb	fungi: to the syslog!	19:20
fungi	oh, i bet that's because of 508564 not merging yet	19:20
fungi	i can get the apache config revert change applied manually for now, but getting 508564 in would be nice	19:22
clarkb	fungi: I'll recheck 8564	19:22
fungi	okay, redirect has been manually reverted on status.o.o and apache2 reloaded	19:23
fungi	looks like it's returning the v2 status page again	19:24
mordred	infra-root: https://etherpad.openstack.org/p/yLGexJjd7U	19:24
fungi	it looks so quaint and old-fashioned now	19:24
dmsimard	mordred: oh man that's a nasty side effect of not gating with v3	19:25
mordred	dmsimard: yah. this is amongst the reasons partial rollout was a thing we were trying to avoid :)	19:26
clarkb	mordred: put a note at the top that we are running in that mode now	19:26
mordred	fungi: we _could_ also force-merge a rollback on their repo :)	19:27
mordred	jinx	19:27
fungi	heh	19:27
AJaeger	;)	19:27
dmsimard	is it obvious in zuul logs when there is a syntax error and where it is from ?	19:31
dmsimard	if it happens, are we easily able to tell which review merged that was the culprit ?	19:31
clarkb	dmsimard: zuul identifies the file and location so ya should be fairly obvious	19:32
clarkb	and I think that it shows up inthe logs because the messages zuul leaves on gerrit are logged in the zuul log too iirc	19:32
dmsimard	clarkb: yeah but I mean, if someone merges a typo, every file will now yield a syntax error right ?	19:32
clarkb	maybe? that I don't know	19:33
AJaeger	dmsimard: I think so	19:33
mordred	yah	19:33
AJaeger	dmsimard: but we should now which repo and file easily	19:34
mordred	I mean - we'll know :)	19:34
mordred	infra-root: we good with that etherpad now? I'll hit send if so	19:34
dmsimard	ok, if you're able to tell easily that's good	19:34
fungi	lgtm	19:34
AJaeger	mordred: lgtm	19:34
AJaeger	and then send an #status notice?	19:35
jdandrea	AJaeger Better! https://review.openstack.org/#/c/508924/ (ignore the Zuul results, right?)	19:35
clarkb	ship it	19:35
jdandrea	Oops - I retract that - wrong channel.	19:36
jeblair	++	19:37
*** bnemec has quit IRC		19:49
* AJaeger is waiting for zuulv3 to process jobs again - it's recalculating queue the last 40+ mins		19:51
fungi	like old times! ;)	19:53
AJaeger	;)	19:54
AJaeger	yeah, https://review.openstack.org/509158 finally got a +1. Since it has two +2s, I'll add my +A to my own change... - this is the change that gives a -1 o nthe frozen files	20:03
AJaeger	Just noticed we now have on zuul-jobs a job run in check (build-sphinx) and others in infra-check... will that cause problems?	20:09
jeblair	AJaeger: yes, we should remove it from check	20:10
AJaeger	so, changing the template? On it...	20:11
clarkb	fyi there is apparnetly a problem with devstack and devstack plugins that boris says wasn't a problme before the v3 rollout	21:42
clarkb	so I'm helping todebug that	21:42
*** bnemec has joined #openstack-infra-incident		23:25

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!