Tuesday, 2017-10-03

*** efoley has joined #openstack-infra-incident08:36
*** efoley has quit IRC09:04
*** efoley_ has joined #openstack-infra-incident09:05
*** tushar has quit IRC09:54
*** efoley_ has quit IRC09:58
*** efoley has joined #openstack-infra-incident10:05
*** tushar has joined #openstack-infra-incident10:05
*** tumbarka has joined #openstack-infra-incident10:23
*** tushar has quit IRC10:26
*** efoley_ has joined #openstack-infra-incident10:28
*** efoley has quit IRC10:28
*** efoley has joined #openstack-infra-incident10:34
*** efoley_ has quit IRC10:34
*** efoley has quit IRC10:57
*** efoley has joined #openstack-infra-incident11:02
*** jesusaur has quit IRC11:19
*** efoley has quit IRC11:20
*** efoley_ has joined #openstack-infra-incident11:20
*** jesusaur has joined #openstack-infra-incident11:33
*** efoley_ has quit IRC11:56
*** efoley has joined #openstack-infra-incident12:08
AJaegerI've moved "fixed" issues to the proper section in etherpad and added some more entries (specs publishing912:56
* AJaeger will now remove merged changes from the review list12:57
fungithanks AJaeger!13:37
fungijust a heads up, the pitchforks are coming out on the -dev ml now... calls for identifying an acceptable failure threshold by a specific date/time or executing a rollback to v213:56
pabelangersigh13:59
pabelangerits been over 12 minutes since zuulv3.o.o has logged to debug.log14:02
dmsimardI'd anticipate a whole different set of issues if we rolled back14:03
*** efoley has quit IRC14:03
*** efoley_ has joined #openstack-infra-incident14:03
pabelangerI can see we are trying to promote a change too, I wonder if that is related to the current issue14:04
pabelangermordred: fungi: clarkb: jeblair: ^14:04
fungiyeah, i wouldn't be surprised if the promote has tanked it14:05
fungiswap is climbing steadily over recent minutes too14:05
fungimy promote command still hasn't returned control to the shell14:05
fungiand we're at almost 9gib swap used (up almost 5gib in a matter of minutes)14:06
mordredfungi: yah- I think we've likley hit the place we were discusing on friday and have now run for as long as is reasonable to collect data14:06
pabelangerYa, I didn't see how large the change queue was, so maybe zuul is moving around a lot of things currently14:06
fungithere were maybe 10 changes in that gate queue, if even that many14:07
fungii don't recall exactly but it wasn't a ton14:07
pabelangerSo, sounds like we might be thinking of a rollback?14:07
fungiswap utilization just dropped by several gib in the past few seconds so maybe it's about done doing... whatever14:08
fungistill falling14:08
pabelangerzuul processing again14:08
pabelangerin debug.log14:08
mordredfungi: and would not be opposed to the friday plan of shifting zuulv3 to check-only and turning v2 back on - running v3 in check mode should still trigger the reconfigure issues14:10
pabelangeryah14:11
pabelangerIt does look like zuul-executors are in better shape since rolling out the limits to CPU and ansible fixes14:12
mordred++14:12
mordredthat was, I believe, a great addition14:12
AJaegerif - we roll back - I would still freeze zuul/layout.yaml and new job creation. Or have both running in parallel and optional opt-in. We've done some great work on Zuul v3 and I would rather not run the migration scripts again...14:13
AJaegerthis morning reviews/merges were nicely fast - but then Zuul slowed down suddenly14:13
mordredAJaeger: I agree14:13
pabelangerswapping is the reason for the slowness, hopefully once we address memory issues, that goes away14:14
mordredAJaeger: I think we leave v2 layout/jobs frozen - although allow case-by-case exceptions if we can also make the equivlient change in the v3 defs (there were some where people wanted to disable some v2 jobs right around cutover - I think those are fair)14:15
mordredpabelanger: oh - totally - once we figure out whatever is causing zuul to go sideways occassionally14:15
fungiswap is definitely in flux... almost at 10gib used now14:21
pabelangerYup, last logging was 2017-10-03 14:11:12,05114:23
pabelangeri believe we are just out of memory now and will keep swapping14:24
mordredfungi, pabelanger: shall we consider restarting? when we're in this state I don't believe we can do the objgraph so I'm fairly sure we will not be able to collect any information that hasn't already been logged14:28
fungimy promote command from 40 minutes ago still hasn't returned14:29
mordredyah14:29
fungiso, yeah i guess a restart is inevitable now14:29
dmsimardAJaeger, mordred: some projects (such as tripleo) have had to implement different changes *in* their different projects to support zuul v3. Rolling back would mean breaking them (again)14:29
mordreddmsimard: that's a very good point14:29
fungidmsimard: yep, we've been trying to discourage those sorts of changes so as not to complicate possible rollback14:30
mordredfungi: yah - unfortunately in some cases not doing those changes was unpossible14:30
fungi(similarly, not making changes to things like on-disk scripts in ways which will render them nonfunctional under v2)14:30
dmsimardfungi: discourage how ? their jobs would not work without making these changes14:30
fungidmsimard: rather, encourage making changes which don't break v2 usage14:30
mordreddmsimard: most of the tripleo changes were made in the tripleo-ci repo right?14:31
fungianyway, if projects have a vested interest in _not_ rolling back to v2 temporarily and trying again in a week or two, then they should be chiming in on that ml thread14:32
fungiright now it's projects annoyed that they have merged very few changes since thursday speaking up14:32
clarkbtripleo changes should be compatible right?14:33
clarkb(I dont know of all the tripleo changes but the ones I was involved in should be)14:33
dmsimardclarkb: I don't know, there's a wide variety of changes.. from that sub_nodes_private thing to the hardcoded zuul user vs jenkins14:34
mordredfungi, clarkb: should we hit the restart button on the scheduler?14:35
clarkbmordred: sorry just catching up, but sounds like it may not revover on its own?14:36
fungii guess. any chance we can get a dump of check and gate pipelines to reenqueue?14:36
Shrewsmordred: restarting would pick up the branch matching fix14:36
mordredShrews: I think jeblair restarted last night with that applied14:36
fungiShrews: i thought jeblair stacked that in last night when he restarted the scheduler14:36
mordredyah14:37
Shrewsah, saw that it didn't merge until this morning14:37
mordred"restarted with 508786 508787 508793 509014 509040 508955 manually applied; should fix branch matchers, use *slightly* less memory, and fix the 'base job not defined' error"14:37
pabelanger+1 for restart, if we can save queues great14:39
clarkbdmsimard: the user thing is probably the biggest one. Zuul is a valid user in yhe old setup though so may work too depending on what bits of user stuff is hardcoded14:39
mordredmy initial thoughts on rollback logistics were that we could redefine the v3 pipelines that are not check and add a pipeline requirement that is unpossible - basically keep the existence of the pipelines but make sure nothing is ever enqueued in them14:40
dmsimardmordred: like branch foobar or something ?14:41
mordredyah. or require a +10 vote from zuul or something else silly14:41
clarkb++14:42
mordredthat way we don't have to actually modify any of the project pipeline defns - then make a second gate pipeline - like v3-gate or something - that we can modify natively v3 projects (zuul, zuul-jobs, openstack-zuul-jobs, project-config) to use14:42
pabelangerya, that seems simple14:43
mordredthat way we can keep running check jobs on everything and keep battling the memory leak with proper scale - but leave job config mostly frozen and not make re-roll-forward more difficult14:43
jeblaircatching up14:44
mordredmorning jeblair !14:44
fungifrom a correlating memory usage increases with known events perspective, i'm starting to wonder if the promote command has given us a way to trigger a memory spike on demand (suggesting perhaps they're related to dependent queue resets?). maybe worth testing after the next restart14:46
jeblairit's decidedly swappy now14:49
AJaegerdid anybody safe the pipeline state?14:52
jeblairi think it's fine to restart.  and i think it's fine to roll back.14:52
jeblairAJaeger: we probably won't be able to do so until zuul becomes responsive again14:53
AJaegerjeblair: our lucky day ;( Ok14:53
fungiwe're at 1 hour 5 minutes since i started the promote of 508344,3 in the gate now, and the command is still hanging14:54
fungii have a feeling attempting to get status data back isn't going to complete until around the same time that does, however long that takes14:55
pabelangerIf we are rolling back, do we want to do 50/50 between nodepool.o.o and nl01.o.o/nl02.o.o for capacity?14:55
fungiprobably more like 67/3314:56
clarkbfungi ++14:56
pabelangersure14:56
mordredremote:   https://review.openstack.org/509196 Prevent non-check changes, add infra pipelines14:56
pabelangerlet me prepare that patch14:56
fungiv2 will need sufficient nodes to handle check, gate, post, et cetera while v3 only needs enough for check14:56
jeblairyeah.  or even more.  we don't need to run that many jobs to finde problems.14:56
clarkbdo we need any other changes than nodepool shift an pipeline update?14:57
AJaegerLet's first write down the new policy and what to do - before rolling back to quickly.14:57
mordredthere's a strawman for v3 config to block things from running in anything other than check14:57
jeblair(we'll find *job* problems at a lower rate, but that's fine -- we're finding them faster than we can fix them atm anyway)14:57
pabelangerjeblair: 80/20 then?14:57
dmsimardsome projects don't even need a rollback14:57
fungiprior to rollback, we likely need one more v3 restart so that it can get back to processing changes while we work on the rollback, right?14:58
dmsimardlike for example puppet is fully migrated and I suspect some other projects have as well14:58
mordredfungi: yes14:58
pabelangerdmsimard: I don't think we want split gating for projects14:58
dmsimardare we going to roll everything back ?14:58
mordreddmsimard: if we rollback, those projects will go back to using their v2 jobs14:58
AJaegercan we add puppet and others that want to  Zuul v3? let them use mordred's infra-gate?14:59
mordredI think there is a fundamental thing we need to decide before we can decide further details ...15:00
mordredthat is whether or not we want to continue to run v3 job content in check broadly during the v2 + v3 period, or whether we do not want to do that15:00
mordredcontinuing to run check on all the projects will let us continue to iterate on finding and fixing job issues in parallel to figuring out and fixing the memory/config issue - but will obviously eat more nodepool quota15:01
*** efoley_ has quit IRC15:01
clarkbmordred: I think we can run it and if it doesnt report on every change due to node quota or restarts that is ok15:02
clarkbwe also need to decide if we allow changes to v2 jobs15:02
mordrednot running v3 check on all the projects will keep the signal-to-noise level the v3 sane and will not use as much node quota, but will suspend further work on v3 job issue15:02
pabelangerremote:   https://review.openstack.org/509200 Revert "Shift nodepool quota from v2 to v3"15:03
mordredclarkb: yes, I agree - I vote for "no, unless the same change is applied to the v3 job" (there are some things, like "I want to turn this job off" that I think may be important exceptions to allow)15:03
pabelangerthat will be our revert for nodepool15:03
pabelangertopic:zuulv3-rollback15:03
mordredI'm not convinced on 80/20 - I think we need to decide on the check question first15:04
pabelangersure15:04
pabelangerI'll WIP15:04
mordredpabelanger: (patch looks good - mostly just saying I think we should get on to the same page on what we want to accomplish)15:05
jeblairmordred: i'm not quite sure i understand the question15:05
pabelangermordred: agree15:05
AJaegerQuestion I see are: a) Run check on all repos? b) Which projects to gate on and c) can projects go fully v3 if they are ready (like puppet)?15:05
jeblairwhat do you mean by "v3 job content"?  auto-migrated jobs? or native pre-converted jobs?  or new in-repo post-cutover jobs?  ....?15:05
jeblairmordred: ^15:06
AJaegerOnce those three are answered, we can say 80/20 or whatever IMHO15:06
mordredjeblair: do we run the currently defined jobs that are defined to be in the check pipeline for all of the projects15:06
jeblairmordred: what's the alternative you are considering?  run no check jobs except for infra-v3?15:07
mordredjeblair: yah - altohugh I personally think we should run check jobs15:08
jeblairmordred: yes, if those are the 2 choices you're presenting, i would say run check as-is.15:08
mordredcool. infra-root - anyone have a different opinion? ^^15:08
clarkbno that is what I had in mind15:09
dmsimardIt effectively means cutting the nodepool capacity in half, yeah ?15:10
dmsimardI mean, because we'll be running check twice15:10
mordredbeing agreed on that - do we think 80/20 nodepool split will provide enough nodes for us to land zuul, zuul-jobs, openstack-zuul-jobs and project-config patches? would 66/33 split be better? do we have quota?15:10
jeblairdmsimard: not half ^15:10
pabelanger+1 to running check as we are now15:10
AJaegerdmsimard: 60:40 or 80:20 I guess - since there's gate and post as well which we not duplicate15:11
jeblairmordred: let's make infra-check and give it high priority.15:11
dmsimardjeblair: what I mean by that is that regardless of how we split v2 and v3 max-servers, if we end up running check twice we're effectively splitting that down further15:11
AJaegerjeblair: +115:11
mordredjeblair: kk15:11
jeblairmordred: then "check" can run as backlogged as it wants, but we can still land changes to zuul-jobs15:11
pabelanger++15:11
AJaegerSo, this would allow projects to continue working on their check jobs, correct? They could send patches in, we apply them and iterate on these.15:13
clarkbAJaeger: yes15:13
fungiprobably a terrible question, but do we want to try rebuilding zuulv3 with a bigger flavor? might give us more breathing room between necessary restarts (seems it's levelled off at around 25gib total virtual memory in use for the moment, but that may just be because it hasn't handled anything new for a while)15:14
mordredremote:   https://review.openstack.org/509196 Prevent non-check changes, add infra pipelines15:14
mordredupdated with an infra-check15:14
AJaegerIs there a variable that scripts can use like "if running-under-ZUULv3", use this, otherwise use zuul-cloner"?15:14
pabelangerAJaeger: i think the shim will handle that15:15
AJaegerpabelanger: but projects might do other stuff sa well...15:15
mordredI believe there are projects that have shell scripts in their own repos that do things - and some of those people may have already updated their shell scripts to not use zuul-cloner15:15
mordredso if we rollback to v2 for them, their jobs will break again and they'll need to revert their changes to their scripts15:16
dmsimardyes.15:16
AJaegerSo, I want to give them a way to use their scripts for both v2 and v3 - and thus start migrating to v3 if they have not15:16
pabelangerRight, i think that is the right way.  Roll back, revert your zuulv3 changes. Supporting both shouldn't be an option15:16
mordredI believe it would be fairly easy to add a "running in zuulv3 env var"15:16
AJaegerpabelanger: I disagree, we should give them an option! That will make the future step to v3 easier.15:17
mordredso that they could half-revert and put things in an if15:17
AJaegermordred: +115:17
clarkbcan check whoami15:17
pabelangeryes, old  will be jenkins15:17
pabelangernew will be zuul users15:17
AJaegerthat's one way of doing it - let's document it as recommendation15:17
jeblairor, you know, check whatever it is that's breaking.  :)15:18
AJaeger(if we agree on that - I don't care, I just want it documented and offered)15:18
mordredAJaeger: ++15:20
jeblairhttps://etherpad.openstack.org/p/1sLlNKa7FU15:21
jeblairi started writing out what we're discussing15:22
AJaegerthanks, jeblair15:22
jeblairwe could use ZUUL_URL, ZUUL_REF, or LOG_PATH as ways to tell v2 from v3.15:26
jeblairi hesitate to suggest the whoami thing because then we'll end up with the username as an API.  if we ever change it, we'll break scripts.15:26
dmsimardjeblair: the username is fairly fool proof, do we need something else ?15:26
dmsimardah15:27
jeblairat least, if folks check that whoami is zuul15:27
jeblairif they check for jenkins, that's fine.15:27
jeblairthe main thing is that the check should be *backwards* facing.15:27
clarkbjeblair: ya whoami probably bad idea15:28
jeblairLOG_PATH is probably the least likely thing we would add to the backwards-compat var filter15:28
jeblairwe also don't add ZUUL_URL or ZUUL_REF because there is no value we can put there, but we've entertained the idea of setting them to "N/A" or something, so they are slightly more likely to change.15:29
dmsimardinfra-root: I think zuulv3.o.o is unexpectedly empty right now15:30
pabelangerit is swapping115:31
mordredjeblair: ++15:31
jeblairso 80/20 with infra-check as high priority?15:41
clarkbwfm15:42
pabelangersure15:42
AJaegerjeblair: we can adjust later if needed, it's a first good guess ;)15:43
pabelangerwe can also bring infracloud back online15:43
pabelangerand see if sto2 is good to use again15:43
pabelangerboth are still disabled15:43
jeblairokay, i think that's a factual summary of everything we discussed; any outstanding questions?15:46
mordredjeblair: I think it looks good, I like the plan, and I think the top section makes a good status email15:46
jeblairmordred: you want to add more words and send it?15:47
mordredjeblair: sure!15:49
mordredjeblair: also - completely unrelated to this- but in making the infra- pipelines patch I realized that we're not doing sql reporter on anything other than check or gate- followup patch submitted15:50
jeblairderp15:50
jeblairwe'd figure that out eventually15:50
*** jdandrea has joined #openstack-infra-incident15:51
jeblairso, logistics -- should we restart zuulv3 now, then land changes to do rollback, or stop, force merge them, then bring up both systems in the rollback state?15:54
pabelangerwfm15:56
fungietherpad content before the horizontal rule lgtm for a status update15:56
mordredinfra-root: updated etherpad with wrapper email words15:56
* clarkb looks15:57
pabelangerlooking15:57
mordredalso - one last questoin - how badly is this going to bork openstack-health and our subunit2sql friends?15:57
fungii agree with the strikethrough in the first paragraph. we have no solid evidence to indicate it's reconfiguration-related afaik15:57
fungii think subunit2sql is still working to get the v3 job naming change landed15:58
pabelanger+1 to etherpad15:58
clarkbemail lgtm15:58
clarkbmordred: I'm not super concerned about it from elasticsearch's perspective15:59
clarkbmordred: its juts more data15:59
fungijeblair: how much of the rollback would be able to get accomplished without force merging those changes anyway?15:59
jeblairi would like to suggest an alternate first paragraph15:59
clarkbmordred: however health looks at pass/fail rates which may skew with check running in v3, but I think its much more gate focused there15:59
jeblairone that does not suggest that the only reason we are rolling back is the memory leak15:59
clarkbmordred: my hunch is its ok because of the focus on gate jobs15:59
jeblairi'm working on a proposal15:59
* AJaeger is fine with etherpad, thanks16:00
fungisince jeblair was able to snag pipeline contents, i feel like restarting v3 in the intermi until the rollback patches are ready may provide slightly less disruption than the continued not-much-going-on state it's in right now16:00
mordredyah - I think we need to restart v3 - if for no other reason than to land the rollback patches16:02
jeblairmordred: alternate suggestion for pgraph1 on line 7.16:03
jeblairalso, last pgraph on line 2516:04
mordredjeblair: wfm16:04
fungii prefer the suggested alternatives16:04
fungihowever the last paragraph will inevitably result in people asking "how long?"16:05
jeblairwe could say, 'hopefully in a week or two'.16:05
fungithat sounds reasonable16:06
AJaegerwhat about saying "We revisit on Friday the current status"?16:06
AJaeger(or Monday?)16:06
mordredhow about we say 'hopefully in a week or two, but we'll send out updates as we have them?'16:06
fungii think in part some people will want to know because they're eager to get back to not having to think about v2 jobs any longer, while others will simply want to know how much longer their job configuration in v2 will remain frozen in place16:07
mordred(updated final line in etherpad)16:07
fungithough i guess paragraph 3 doesn't really say v2 configuraiton is completely froze, so that's fine16:07
clarkbwfm16:07
pabelangera week or two is good for me. If we fix things sooner, I'm sure we try a rollout sooner16:08
mordredluckily a rollout winds up being very easy at this point- turn the other pipelines back on :)16:09
AJaegerpabelanger: if we're ready quickly, let's run zuul v3 for a day or two without interruption ;)16:09
AJaegermordred: agreed, we should be able to go quickly from one to the other...16:10
pabelangerya, we should be in a good spot to move nodes between zuulv2.5 and zuulv3 as we are ready for more load testing too16:10
pabelangerI'm currently using topic:zuulv3-rollback for patches that would be needed to rollback. Currently 216:15
mordredjeblair: you're good with the final version?16:15
fungithanks pabelanger!16:15
jeblairmordred: yep16:15
pabelangerIt looks like zuul-launcher might have been stopped too, just noting we'll need to start them back up16:17
mordredkk. sending16:17
*** bnemec has joined #openstack-infra-incident16:19
*** ying_zuo has joined #openstack-infra-incident16:30
AJaegermordred: where did you send it to? It's not in openstack-dev archives yet...16:31
AJaegermordred: remote: https://review.openstack.org/509221 - where else do we need this?16:37
*** dansmith has joined #openstack-infra-incident16:37
clarkbAJaeger: zuul-jobs and zuul. Maybe also project-config itself?16:38
AJaegerLet me do zuul-jobs quickly, any volunteer for zuul?16:39
clarkbAJaeger: we should probably do it in a single change16:40
clarkbas that will be easier to merge16:40
clarkb?16:40
AJaegerclarkb: mordred updated project-config. But we need these separate for each repo with individual config16:40
AJaegernothing needs to be done for zuul-jobs16:41
AJaegerpushed for zuul now16:46
AJaegernot needed for zuul-jobs16:46
AJaegerthat should be all - please double check16:46
AJaegerhttps://review.openstack.org/#/q/topic:zuulv3-rollback+(status:open+OR+status:merged)16:47
pabelangerI don't think it is causing issues, but I think the hostname on ze09 and ze10 are not correct.  I'll look into that shortly once we rolled back16:55
pabelangermore for my own info, I do see that every time we do a dynamic reload in zuul debug.log, we spike the CPU (which is expected I guess).  However, after the reload has finished, zuul does process things like a champ, nice and fast, zuul-scheduler process lowish CPU17:02
mordredinfra-root: I'm back - sorry - had a phone call pop up17:03
clarkbfungi is working on clearing out zuul -2 votes, other than that I think we are ready to start implementation?17:03
fungii think we can do those in parallel17:07
fungipabelanger: yeah, i ended up having to manually repair the hostname on the ze01 i rebuilt. seems cloud-init may be racing ansible or something on fresh builds lately17:09
fungibasically just edit /etc/hostname to contain an fqdn instead of the shortname, and then reboot17:10
pabelangerkk17:10
pabelangerI'll look into that once after the rollback17:10
fungiclarkb: i got the bulk of the v3 verify -2 votes cleared, but will do another pass after we get v2 online17:14
pabelangerwe seem to be averaging about 1min of CPU time for each dynamic reload, according to debug.log17:14
pabelangermaybe a little less17:14
pabelangerI think we might be at the point again, where zuul is just processing dynamic reloads.  It doesn't look like it has the steam to get a head of the demand17:29
clarkbok, so what is process for implementing the rollback, do we force merge or manually apply the pipeline changes and the nodepool quota shift then start the zuulv2 daemon?17:34
pabelangerquota should be ready to land now17:36
pabelangerhttps://review.openstack.org/509200/17:37
pabelangeronce merged, nodepool-launchers should do the right thing17:37
jeblairwe have a +1 on the pipeline change, so i say we force-merge that and the quota change, then startup v217:38
*** tobiash has joined #openstack-infra-incident17:38
pabelangermaybe nodepool change first to bleed off resources from launchers17:38
jeblairpabelanger: you want to take care of doing that?17:39
pabelangersure17:39
clarkbhrm looking at the pipeline change quickj question17:39
clarkbwith older zuul if you tried to look at attributes that weren't present in the event it could get in an unhappy state17:39
clarkbany concern with using things like current-patchset on post pipeline for that reason?17:40
jeblairclarkb: those are attributes of refs, not events, so should be fine17:40
jeblairsince they are pipeline requirements17:40
clarkbcool17:40
jeblairclarkb: and i double checked the approvals one -- that should fail closed, ie, refs without approvals do not match and will not be enqueued17:41
clarkbjeblair: perfect17:41
jeblairclarkb: looks like the same holds for current-patchset17:41
pabelangerwhich is the correct group again to force merge?17:41
jeblairpabelanger: project bootstrappers17:41
pabelangerjeblair: ty17:41
pabelangerokay, 509200 force merged17:44
pabelangerremoving myself from group17:44
fungipabelanger: i usually pull this out of my command history: `ssh -p 29418 review.openstack.org gerrit set-members "Project\ Bootstrappers" --add fungi`17:44
fungiand then --remove instead of --add when i'm done17:45
pabelangerfungi: ty17:45
fungifaster than fiddling with the webui17:45
pabelangerokay, I have kicked both nl01 and nl0217:47
AJaegernext  approving https://review.openstack.org/#/c/509196/ (new pipelines)?17:47
pabelangerjust confirming nodepool-launchers are happy, then will move on to nodepool.o.o17:49
clarkband then once nodepool is done merge pipeline change and start zuulv2?17:53
clarkbpretty sure we need pipeline change in plcae before zuulv2 starts running (to avoid them fighting)17:53
jeblairit would be best i think17:53
jeblairi'll translate the latest re-enqueue scripts into v217:54
jeblairso we can restore queue state (from earlier this morning)17:54
AJaegerQuestion on project-config repo: Will we run both zuul v2 and zuulv3 checks on it - and merge with v3 only?17:55
AJaegerThen we need to remove the v2 jobs from zuul/layout.yaml. Or how to handle this central repo?17:55
*** bnemec has quit IRC17:56
pabelangerokay, cleaning up some of the ready nodes on nodepool-launchers, then going to kick nodepool.o.o17:56
jeblairAJaeger: i vote v2 check only, v3 check/gate17:57
* AJaeger prepares a patch...17:58
pabelangerkicking nodepool.o.o now17:58
pabelangerokay, nodepool is trying to launcher servers again, but failing to talk to gearman18:00
pabelangerso, think we are ready on nodepool front18:00
clarkb(that is expected since geard and zuulv2 are not running on zuul.o.o18:00
pabelangeryah18:00
*** bnemec has joined #openstack-infra-incident18:01
clarkbok so ready to merge the pipeline change?18:03
pabelangerI think so18:03
clarkbpabelanger: did you want to do that one too?18:04
pabelangerclarkb: force-merge, sure18:04
pabelangerclarkb: 509196 right?18:05
clarkbya18:05
pabelangerkk18:05
pabelangerdone18:06
pabelangerguess we kick zuul.o.o now?18:06
pabelangeractually, no18:06
pabelangerthat was just for zuulv318:06
AJaegerhttps://review.openstack.org/509244 updates project-config for Zuul v2 check only - please review carefully whether this will work18:06
pabelangerclarkb: ready to kick zuulv3.o.o to pickup changes for 509196?18:09
clarkbpabelanger: we shoulnd't need to kick it right? zuulv3 will pick it up on its own18:09
pabelangerno, we still need puppet to sighup for pipeline changes I think18:10
pabelangerHmm, maybe not18:10
pabelangerI think I am confusing it with main.yaml18:10
AJaegershould we force merge https://review.openstack.org/#/c/509220/ ?18:11
AJaegerthat's the status.o.o/zuul redirect18:11
clarkblets see if we can get it in normally first maybe?18:14
clarkbare we ready to start zuuld and zuul launchers?18:15
pabelangerYa, I think that should merge on its own now18:15
AJaegerok, let's monitor ;)18:15
pabelangerI am not seeing infra-check or infra-gate yet on zuulv3.o.o18:15
clarkbyou can hit http://zuul.openstack.org to get status until redirect is fixed18:15
clarkbpabelanger: I'm not sure zuul has chewed through the event queue yet?18:17
pabelangerpossible18:17
pabelangerasking in #zuul18:17
AJaegerclarkb: zuul.o.o redirects as well ;(18:19
clarkboh really?18:19
clarkbit does :/18:19
AJaegerwe should revert that also...18:19
clarkbjeblair: mordred fungi ok I think we are just waiting for zuulv3 to pick up the new pipelines18:20
clarkbonce that is done I think we start zuuld and zuul launchers and we can work through getting status pages pointing to the right place18:20
pabelangeryah, think so too18:21
pabelangershould we maybe consider a stop / start of zuulv3 to clear our pipelines?18:22
pabelangerout*18:22
clarkbjeblair: ^ what do you think?18:23
jeblairclarkb: ideally v3 should eject everything from gate after it reconfigures with that change18:26
clarkbgotcha, so we should wait and confirm that happens?18:27
pabelangerI'm going to have to step again here shortly18:30
pabelangerwaiting for zuulv3.o.o to pickup 509196 still18:31
pabelangernodepool.o.o is ready, just waiting on zuulv2.5 to be started for gearman18:31
clarkbok18:31
pabelangerwe'll also need to start zuul-launchers too18:31
* clarkb waits patiently. Will be around to start services once its done18:31
pabelangergreat18:32
AJaegerwhere do we redirect zuul.openstack.org to zuulv3.openstack.org in our config? Anybody can revert that change?18:32
clarkbAJaeger: I'm not seeing it either18:34
clarkbfungi: ^ you did the redirect stuff, do you know?18:35
fungiwe should just be able to unset zuul.o.o from the emergency disable list18:37
fungithat wasn't done through configuration management18:38
clarkbaha18:38
clarkbthat is why I don't see it :)18:38
fungiit's a one-line addition to the vhost config on the server18:38
fungithat and i commented out the rewrites for the local api, i believe18:39
clarkbok infra check and infra gate are presetn now18:40
clarkbthe gate has not been evicted yet18:40
clarkbjeblair: ^ if you want to look at that before we turn on zuuvl2 (probably a good idea to avoid fihting)18:40
jeblairchecking18:41
AJaegerinteresting to see what happens with 509145 - a project-config change that just finished in gate queue but is not merged18:41
jeblairzuul has not gotten around to processing the gate pipeline after the reconfig yet18:42
jeblairso -- status indeterminate there18:42
* clarkb remains patient then18:43
jeblairit probably has a bunch of dynamic configs in check to redo18:43
AJaegerchanges are moving over now...18:43
jeblairgate manager is running18:45
jeblairokay, it's done.  i guess it doesn't check those constraints after they've already been enqueued18:47
jeblairso we may as well restart zuulv3 to clear it out18:47
clarkbjeblair: do you want to do that and I will work on starting zuul on zuul.o.o and the zl nodes?18:47
jeblairclarkb: will do18:47
clarkbok starting zuulv2 now as well18:48
clarkbhere goes18:48
clarkbdo I want zuul-executor or zuul-launcher on zl0X?18:49
fungilauncher18:50
clarkbkk starting those now18:50
clarkbfungi: can you edit the redirect on zuul.o.o?18:50
fungiclarkb: on it18:50
clarkbjob is running on zl0118:51
clarkbgoing to start the other launchers since that looks good18:51
fungii've removed "RedirectMatch temp ^/(.*) http://zuulv3.openstack.org/$1" and uncommented all the old RewriteRule lines18:52
fungialso reloaded apache218:52
fungishould i go ahead and remove zuul.o.o from the emergency disable list too?18:52
clarkbfungi: I think that should be safe now?18:52
fungidone18:52
clarkbinfra-root ^ any reason not to do that?18:52
fungii was pretty sure the only reason we had it in there was to avoid reverting the redirect18:53
clarkbzl01 - zl06 have launcher running18:53
mordredclarkb, fungi: wfm18:53
jeblair++18:53
clarkblooks like those are the launchers we've got18:53
jeblairdo we need to delete some ze and add zl?18:54
clarkbjeblair: possibly since we've flipped the quotaing18:54
clarkbthough maybe we want to watch it and see how it holds up with the lower quota on v2?18:55
clarkbfungi: I think it is safe to run the post rollback -2 clearout now (since zuulv3 should not leave any more after the restart it just had)18:55
fungion it18:56
fungithere was only one new verify -2, and i've cleared it now18:57
clarkbfungi: status.o.o/zuul redirect was in puppet right?18:57
clarkbdo we hvae a revert of that change yet?18:57
fungicorrect, and i haven't seen a revert for it yet18:58
AJaegercheck https://review.openstack.org/#/q/status:open+++topic:zuulv3-rollback18:58
AJaegerhttps://review.openstack.org/509220 is the revert18:58
clarkbI've approved and rechecked that change18:58
fungiaha, i missed that19:00
AJaegersince we froze the zuul v2 files, could you review https://review.openstack.org/#/c/509158 again - that gives a non-voting -1 for any change to it to alert us of that19:01
mordredwe should send out a note to anyone who has a .zuul.yaml in their repo to remind them that since v3 isn't voting it will be possible for them to land changes that contain broken syntax19:15
clarkbya the change to fix the status redirect is about to merge19:16
mordredso if they ARE making changes to their .zuul.yaml as part of working on things in the check pipeline- they need to be careful to not land changes if v3 has complained about syntax errors19:16
clarkbI think once that does we should sned a general "its in place like we described" note with details like that19:16
jeblairmordred: s/voting/gating/ but yes19:16
mordredjeblair: ++19:16
jeblair(and of course, the vote may be delayed or never show up)19:16
mordredyah. basically if you make a change to your .zuul.yaml file - please make sure that v3 actually runs check jobs and finishes before approving19:17
jeblair++19:18
clarkbmordred: are you going to write and send that update?19:18
mordredjeblair: I was trying to think of a fancy job we could put in v2 and add to the merge-check template with a files matcher to make sure it only ran on patches with .zuul.yaml in them ... but I couldn't come up with any way to make t anyhting other than informational19:18
mordredclarkb: yah - I can do that19:18
clarkbfungi is kick.sh'ing status.o.o so we should be set for that to go out when ready19:19
fungiyup19:19
fungiit's just about completed now19:19
jeblairmordred: we could have it -219:19
jeblairmordred: oh in v2, sorry19:19
fungifatal: [status.openstack.org]: FAILED! => {"changed": false, "failed": true, "msg": "/usr/bin/timeout -s 9 30m /usr/bin/puppet apply /opt/system-config/production/manifests/site.pp --logdest syslog --environment 'production' --no-noop --detailed-exitcodes failed with return code: 6", "rc": 6, "stderr": "", "stdout": "", "stdout_lines": []}19:19
jeblairmordred: yeah, that's hard in v219:19
fungium19:19
clarkbfungi: to the syslog!19:20
fungioh, i bet that's because of 508564 not merging yet19:20
fungii can get the apache config revert change applied manually for now, but getting 508564 in would be nice19:22
clarkbfungi: I'll recheck 856419:22
fungiokay, redirect has been manually reverted on status.o.o and apache2 reloaded19:23
fungilooks like it's returning the v2 status page again19:24
mordredinfra-root: https://etherpad.openstack.org/p/yLGexJjd7U19:24
fungiit looks so quaint and old-fashioned now19:24
dmsimardmordred: oh man that's a nasty side effect of not gating with v319:25
mordreddmsimard: yah. this is amongst the reasons partial rollout was a thing we were trying to avoid :)19:26
clarkbmordred: put a note at the top that we are running in that mode now19:26
mordredfungi: we _could_ also force-merge a rollback on their repo :)19:27
mordredjinx19:27
fungiheh19:27
AJaeger;)19:27
dmsimardis it obvious in zuul logs when there is a syntax error and where it is from ?19:31
dmsimardif it happens, are we easily able to tell which review merged that was the culprit ?19:31
clarkbdmsimard: zuul identifies the file and location so ya should be fairly obvious19:32
clarkband I think that it shows up inthe logs because the messages zuul leaves on gerrit are logged in the zuul log too iirc19:32
dmsimardclarkb: yeah but I mean, if someone merges a typo, every file will now yield a syntax error right ?19:32
clarkbmaybe? that I don't know19:33
AJaegerdmsimard: I think so19:33
mordredyah19:33
AJaegerdmsimard: but we should now which repo and file easily19:34
mordredI mean - we'll know :)19:34
mordredinfra-root: we good with that etherpad now? I'll hit send if so19:34
dmsimardok, if you're able to tell easily that's good19:34
fungilgtm19:34
AJaegermordred: lgtm19:34
AJaegerand then send an #status notice?19:35
jdandreaAJaeger Better! https://review.openstack.org/#/c/508924/ (ignore the Zuul results, right?)19:35
clarkbship it19:35
jdandreaOops - I retract that - wrong channel.19:36
jeblair++19:37
*** bnemec has quit IRC19:49
* AJaeger is waiting for zuulv3 to process jobs again - it's recalculating queue the last 40+ mins19:51
fungilike old times! ;)19:53
AJaeger;)19:54
AJaegeryeah, https://review.openstack.org/509158 finally got a +1. Since it has two +2s, I'll add my +A to my own change... - this is the change that gives a -1 o nthe frozen files20:03
AJaegerJust noticed we now have on zuul-jobs a job run in check (build-sphinx) and others in infra-check... will that cause problems?20:09
jeblairAJaeger: yes, we should remove it from check20:10
AJaegerso, changing the template? On it...20:11
clarkbfyi there is apparnetly a problem with devstack and devstack plugins that boris says wasn't a problme before the v3 rollout21:42
clarkbso I'm helping todebug that21:42
*** bnemec has joined #openstack-infra-incident23:25

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!