*** efoley has joined #openstack-infra-incident | 08:36 | |
*** efoley has quit IRC | 09:04 | |
*** efoley_ has joined #openstack-infra-incident | 09:05 | |
*** tushar has quit IRC | 09:54 | |
*** efoley_ has quit IRC | 09:58 | |
*** efoley has joined #openstack-infra-incident | 10:05 | |
*** tushar has joined #openstack-infra-incident | 10:05 | |
*** tumbarka has joined #openstack-infra-incident | 10:23 | |
*** tushar has quit IRC | 10:26 | |
*** efoley_ has joined #openstack-infra-incident | 10:28 | |
*** efoley has quit IRC | 10:28 | |
*** efoley has joined #openstack-infra-incident | 10:34 | |
*** efoley_ has quit IRC | 10:34 | |
*** efoley has quit IRC | 10:57 | |
*** efoley has joined #openstack-infra-incident | 11:02 | |
*** jesusaur has quit IRC | 11:19 | |
*** efoley has quit IRC | 11:20 | |
*** efoley_ has joined #openstack-infra-incident | 11:20 | |
*** jesusaur has joined #openstack-infra-incident | 11:33 | |
*** efoley_ has quit IRC | 11:56 | |
*** efoley has joined #openstack-infra-incident | 12:08 | |
AJaeger | I've moved "fixed" issues to the proper section in etherpad and added some more entries (specs publishing9 | 12:56 |
---|---|---|
* AJaeger will now remove merged changes from the review list | 12:57 | |
fungi | thanks AJaeger! | 13:37 |
fungi | just a heads up, the pitchforks are coming out on the -dev ml now... calls for identifying an acceptable failure threshold by a specific date/time or executing a rollback to v2 | 13:56 |
pabelanger | sigh | 13:59 |
pabelanger | its been over 12 minutes since zuulv3.o.o has logged to debug.log | 14:02 |
dmsimard | I'd anticipate a whole different set of issues if we rolled back | 14:03 |
*** efoley has quit IRC | 14:03 | |
*** efoley_ has joined #openstack-infra-incident | 14:03 | |
pabelanger | I can see we are trying to promote a change too, I wonder if that is related to the current issue | 14:04 |
pabelanger | mordred: fungi: clarkb: jeblair: ^ | 14:04 |
fungi | yeah, i wouldn't be surprised if the promote has tanked it | 14:05 |
fungi | swap is climbing steadily over recent minutes too | 14:05 |
fungi | my promote command still hasn't returned control to the shell | 14:05 |
fungi | and we're at almost 9gib swap used (up almost 5gib in a matter of minutes) | 14:06 |
mordred | fungi: yah- I think we've likley hit the place we were discusing on friday and have now run for as long as is reasonable to collect data | 14:06 |
pabelanger | Ya, I didn't see how large the change queue was, so maybe zuul is moving around a lot of things currently | 14:06 |
fungi | there were maybe 10 changes in that gate queue, if even that many | 14:07 |
fungi | i don't recall exactly but it wasn't a ton | 14:07 |
pabelanger | So, sounds like we might be thinking of a rollback? | 14:07 |
fungi | swap utilization just dropped by several gib in the past few seconds so maybe it's about done doing... whatever | 14:08 |
fungi | still falling | 14:08 |
pabelanger | zuul processing again | 14:08 |
pabelanger | in debug.log | 14:08 |
mordred | fungi: and would not be opposed to the friday plan of shifting zuulv3 to check-only and turning v2 back on - running v3 in check mode should still trigger the reconfigure issues | 14:10 |
pabelanger | yah | 14:11 |
pabelanger | It does look like zuul-executors are in better shape since rolling out the limits to CPU and ansible fixes | 14:12 |
mordred | ++ | 14:12 |
mordred | that was, I believe, a great addition | 14:12 |
AJaeger | if - we roll back - I would still freeze zuul/layout.yaml and new job creation. Or have both running in parallel and optional opt-in. We've done some great work on Zuul v3 and I would rather not run the migration scripts again... | 14:13 |
AJaeger | this morning reviews/merges were nicely fast - but then Zuul slowed down suddenly | 14:13 |
mordred | AJaeger: I agree | 14:13 |
pabelanger | swapping is the reason for the slowness, hopefully once we address memory issues, that goes away | 14:14 |
mordred | AJaeger: I think we leave v2 layout/jobs frozen - although allow case-by-case exceptions if we can also make the equivlient change in the v3 defs (there were some where people wanted to disable some v2 jobs right around cutover - I think those are fair) | 14:15 |
mordred | pabelanger: oh - totally - once we figure out whatever is causing zuul to go sideways occassionally | 14:15 |
fungi | swap is definitely in flux... almost at 10gib used now | 14:21 |
pabelanger | Yup, last logging was 2017-10-03 14:11:12,051 | 14:23 |
pabelanger | i believe we are just out of memory now and will keep swapping | 14:24 |
mordred | fungi, pabelanger: shall we consider restarting? when we're in this state I don't believe we can do the objgraph so I'm fairly sure we will not be able to collect any information that hasn't already been logged | 14:28 |
fungi | my promote command from 40 minutes ago still hasn't returned | 14:29 |
mordred | yah | 14:29 |
fungi | so, yeah i guess a restart is inevitable now | 14:29 |
dmsimard | AJaeger, mordred: some projects (such as tripleo) have had to implement different changes *in* their different projects to support zuul v3. Rolling back would mean breaking them (again) | 14:29 |
mordred | dmsimard: that's a very good point | 14:29 |
fungi | dmsimard: yep, we've been trying to discourage those sorts of changes so as not to complicate possible rollback | 14:30 |
mordred | fungi: yah - unfortunately in some cases not doing those changes was unpossible | 14:30 |
fungi | (similarly, not making changes to things like on-disk scripts in ways which will render them nonfunctional under v2) | 14:30 |
dmsimard | fungi: discourage how ? their jobs would not work without making these changes | 14:30 |
fungi | dmsimard: rather, encourage making changes which don't break v2 usage | 14:30 |
mordred | dmsimard: most of the tripleo changes were made in the tripleo-ci repo right? | 14:31 |
fungi | anyway, if projects have a vested interest in _not_ rolling back to v2 temporarily and trying again in a week or two, then they should be chiming in on that ml thread | 14:32 |
fungi | right now it's projects annoyed that they have merged very few changes since thursday speaking up | 14:32 |
clarkb | tripleo changes should be compatible right? | 14:33 |
clarkb | (I dont know of all the tripleo changes but the ones I was involved in should be) | 14:33 |
dmsimard | clarkb: I don't know, there's a wide variety of changes.. from that sub_nodes_private thing to the hardcoded zuul user vs jenkins | 14:34 |
mordred | fungi, clarkb: should we hit the restart button on the scheduler? | 14:35 |
clarkb | mordred: sorry just catching up, but sounds like it may not revover on its own? | 14:36 |
fungi | i guess. any chance we can get a dump of check and gate pipelines to reenqueue? | 14:36 |
Shrews | mordred: restarting would pick up the branch matching fix | 14:36 |
mordred | Shrews: I think jeblair restarted last night with that applied | 14:36 |
fungi | Shrews: i thought jeblair stacked that in last night when he restarted the scheduler | 14:36 |
mordred | yah | 14:37 |
Shrews | ah, saw that it didn't merge until this morning | 14:37 |
mordred | "restarted with 508786 508787 508793 509014 509040 508955 manually applied; should fix branch matchers, use *slightly* less memory, and fix the 'base job not defined' error" | 14:37 |
pabelanger | +1 for restart, if we can save queues great | 14:39 |
clarkb | dmsimard: the user thing is probably the biggest one. Zuul is a valid user in yhe old setup though so may work too depending on what bits of user stuff is hardcoded | 14:39 |
mordred | my initial thoughts on rollback logistics were that we could redefine the v3 pipelines that are not check and add a pipeline requirement that is unpossible - basically keep the existence of the pipelines but make sure nothing is ever enqueued in them | 14:40 |
dmsimard | mordred: like branch foobar or something ? | 14:41 |
mordred | yah. or require a +10 vote from zuul or something else silly | 14:41 |
clarkb | ++ | 14:42 |
mordred | that way we don't have to actually modify any of the project pipeline defns - then make a second gate pipeline - like v3-gate or something - that we can modify natively v3 projects (zuul, zuul-jobs, openstack-zuul-jobs, project-config) to use | 14:42 |
pabelanger | ya, that seems simple | 14:43 |
mordred | that way we can keep running check jobs on everything and keep battling the memory leak with proper scale - but leave job config mostly frozen and not make re-roll-forward more difficult | 14:43 |
jeblair | catching up | 14:44 |
mordred | morning jeblair ! | 14:44 |
fungi | from a correlating memory usage increases with known events perspective, i'm starting to wonder if the promote command has given us a way to trigger a memory spike on demand (suggesting perhaps they're related to dependent queue resets?). maybe worth testing after the next restart | 14:46 |
jeblair | it's decidedly swappy now | 14:49 |
AJaeger | did anybody safe the pipeline state? | 14:52 |
jeblair | i think it's fine to restart. and i think it's fine to roll back. | 14:52 |
jeblair | AJaeger: we probably won't be able to do so until zuul becomes responsive again | 14:53 |
AJaeger | jeblair: our lucky day ;( Ok | 14:53 |
fungi | we're at 1 hour 5 minutes since i started the promote of 508344,3 in the gate now, and the command is still hanging | 14:54 |
fungi | i have a feeling attempting to get status data back isn't going to complete until around the same time that does, however long that takes | 14:55 |
pabelanger | If we are rolling back, do we want to do 50/50 between nodepool.o.o and nl01.o.o/nl02.o.o for capacity? | 14:55 |
fungi | probably more like 67/33 | 14:56 |
clarkb | fungi ++ | 14:56 |
pabelanger | sure | 14:56 |
mordred | remote: https://review.openstack.org/509196 Prevent non-check changes, add infra pipelines | 14:56 |
pabelanger | let me prepare that patch | 14:56 |
fungi | v2 will need sufficient nodes to handle check, gate, post, et cetera while v3 only needs enough for check | 14:56 |
jeblair | yeah. or even more. we don't need to run that many jobs to finde problems. | 14:56 |
clarkb | do we need any other changes than nodepool shift an pipeline update? | 14:57 |
AJaeger | Let's first write down the new policy and what to do - before rolling back to quickly. | 14:57 |
mordred | there's a strawman for v3 config to block things from running in anything other than check | 14:57 |
jeblair | (we'll find *job* problems at a lower rate, but that's fine -- we're finding them faster than we can fix them atm anyway) | 14:57 |
pabelanger | jeblair: 80/20 then? | 14:57 |
dmsimard | some projects don't even need a rollback | 14:57 |
fungi | prior to rollback, we likely need one more v3 restart so that it can get back to processing changes while we work on the rollback, right? | 14:58 |
dmsimard | like for example puppet is fully migrated and I suspect some other projects have as well | 14:58 |
mordred | fungi: yes | 14:58 |
pabelanger | dmsimard: I don't think we want split gating for projects | 14:58 |
dmsimard | are we going to roll everything back ? | 14:58 |
mordred | dmsimard: if we rollback, those projects will go back to using their v2 jobs | 14:58 |
AJaeger | can we add puppet and others that want to Zuul v3? let them use mordred's infra-gate? | 14:59 |
mordred | I think there is a fundamental thing we need to decide before we can decide further details ... | 15:00 |
mordred | that is whether or not we want to continue to run v3 job content in check broadly during the v2 + v3 period, or whether we do not want to do that | 15:00 |
mordred | continuing to run check on all the projects will let us continue to iterate on finding and fixing job issues in parallel to figuring out and fixing the memory/config issue - but will obviously eat more nodepool quota | 15:01 |
*** efoley_ has quit IRC | 15:01 | |
clarkb | mordred: I think we can run it and if it doesnt report on every change due to node quota or restarts that is ok | 15:02 |
clarkb | we also need to decide if we allow changes to v2 jobs | 15:02 |
mordred | not running v3 check on all the projects will keep the signal-to-noise level the v3 sane and will not use as much node quota, but will suspend further work on v3 job issue | 15:02 |
pabelanger | remote: https://review.openstack.org/509200 Revert "Shift nodepool quota from v2 to v3" | 15:03 |
mordred | clarkb: yes, I agree - I vote for "no, unless the same change is applied to the v3 job" (there are some things, like "I want to turn this job off" that I think may be important exceptions to allow) | 15:03 |
pabelanger | that will be our revert for nodepool | 15:03 |
pabelanger | topic:zuulv3-rollback | 15:03 |
mordred | I'm not convinced on 80/20 - I think we need to decide on the check question first | 15:04 |
pabelanger | sure | 15:04 |
pabelanger | I'll WIP | 15:04 |
mordred | pabelanger: (patch looks good - mostly just saying I think we should get on to the same page on what we want to accomplish) | 15:05 |
jeblair | mordred: i'm not quite sure i understand the question | 15:05 |
pabelanger | mordred: agree | 15:05 |
AJaeger | Question I see are: a) Run check on all repos? b) Which projects to gate on and c) can projects go fully v3 if they are ready (like puppet)? | 15:05 |
jeblair | what do you mean by "v3 job content"? auto-migrated jobs? or native pre-converted jobs? or new in-repo post-cutover jobs? ....? | 15:05 |
jeblair | mordred: ^ | 15:06 |
AJaeger | Once those three are answered, we can say 80/20 or whatever IMHO | 15:06 |
mordred | jeblair: do we run the currently defined jobs that are defined to be in the check pipeline for all of the projects | 15:06 |
jeblair | mordred: what's the alternative you are considering? run no check jobs except for infra-v3? | 15:07 |
mordred | jeblair: yah - altohugh I personally think we should run check jobs | 15:08 |
jeblair | mordred: yes, if those are the 2 choices you're presenting, i would say run check as-is. | 15:08 |
mordred | cool. infra-root - anyone have a different opinion? ^^ | 15:08 |
clarkb | no that is what I had in mind | 15:09 |
dmsimard | It effectively means cutting the nodepool capacity in half, yeah ? | 15:10 |
dmsimard | I mean, because we'll be running check twice | 15:10 |
mordred | being agreed on that - do we think 80/20 nodepool split will provide enough nodes for us to land zuul, zuul-jobs, openstack-zuul-jobs and project-config patches? would 66/33 split be better? do we have quota? | 15:10 |
jeblair | dmsimard: not half ^ | 15:10 |
pabelanger | +1 to running check as we are now | 15:10 |
AJaeger | dmsimard: 60:40 or 80:20 I guess - since there's gate and post as well which we not duplicate | 15:11 |
jeblair | mordred: let's make infra-check and give it high priority. | 15:11 |
dmsimard | jeblair: what I mean by that is that regardless of how we split v2 and v3 max-servers, if we end up running check twice we're effectively splitting that down further | 15:11 |
AJaeger | jeblair: +1 | 15:11 |
mordred | jeblair: kk | 15:11 |
jeblair | mordred: then "check" can run as backlogged as it wants, but we can still land changes to zuul-jobs | 15:11 |
pabelanger | ++ | 15:11 |
AJaeger | So, this would allow projects to continue working on their check jobs, correct? They could send patches in, we apply them and iterate on these. | 15:13 |
clarkb | AJaeger: yes | 15:13 |
fungi | probably a terrible question, but do we want to try rebuilding zuulv3 with a bigger flavor? might give us more breathing room between necessary restarts (seems it's levelled off at around 25gib total virtual memory in use for the moment, but that may just be because it hasn't handled anything new for a while) | 15:14 |
mordred | remote: https://review.openstack.org/509196 Prevent non-check changes, add infra pipelines | 15:14 |
mordred | updated with an infra-check | 15:14 |
AJaeger | Is there a variable that scripts can use like "if running-under-ZUULv3", use this, otherwise use zuul-cloner"? | 15:14 |
pabelanger | AJaeger: i think the shim will handle that | 15:15 |
AJaeger | pabelanger: but projects might do other stuff sa well... | 15:15 |
mordred | I believe there are projects that have shell scripts in their own repos that do things - and some of those people may have already updated their shell scripts to not use zuul-cloner | 15:15 |
mordred | so if we rollback to v2 for them, their jobs will break again and they'll need to revert their changes to their scripts | 15:16 |
dmsimard | yes. | 15:16 |
AJaeger | So, I want to give them a way to use their scripts for both v2 and v3 - and thus start migrating to v3 if they have not | 15:16 |
pabelanger | Right, i think that is the right way. Roll back, revert your zuulv3 changes. Supporting both shouldn't be an option | 15:16 |
mordred | I believe it would be fairly easy to add a "running in zuulv3 env var" | 15:16 |
AJaeger | pabelanger: I disagree, we should give them an option! That will make the future step to v3 easier. | 15:17 |
mordred | so that they could half-revert and put things in an if | 15:17 |
AJaeger | mordred: +1 | 15:17 |
clarkb | can check whoami | 15:17 |
pabelanger | yes, old will be jenkins | 15:17 |
pabelanger | new will be zuul users | 15:17 |
AJaeger | that's one way of doing it - let's document it as recommendation | 15:17 |
jeblair | or, you know, check whatever it is that's breaking. :) | 15:18 |
AJaeger | (if we agree on that - I don't care, I just want it documented and offered) | 15:18 |
mordred | AJaeger: ++ | 15:20 |
jeblair | https://etherpad.openstack.org/p/1sLlNKa7FU | 15:21 |
jeblair | i started writing out what we're discussing | 15:22 |
AJaeger | thanks, jeblair | 15:22 |
jeblair | we could use ZUUL_URL, ZUUL_REF, or LOG_PATH as ways to tell v2 from v3. | 15:26 |
jeblair | i hesitate to suggest the whoami thing because then we'll end up with the username as an API. if we ever change it, we'll break scripts. | 15:26 |
dmsimard | jeblair: the username is fairly fool proof, do we need something else ? | 15:26 |
dmsimard | ah | 15:27 |
jeblair | at least, if folks check that whoami is zuul | 15:27 |
jeblair | if they check for jenkins, that's fine. | 15:27 |
jeblair | the main thing is that the check should be *backwards* facing. | 15:27 |
clarkb | jeblair: ya whoami probably bad idea | 15:28 |
jeblair | LOG_PATH is probably the least likely thing we would add to the backwards-compat var filter | 15:28 |
jeblair | we also don't add ZUUL_URL or ZUUL_REF because there is no value we can put there, but we've entertained the idea of setting them to "N/A" or something, so they are slightly more likely to change. | 15:29 |
dmsimard | infra-root: I think zuulv3.o.o is unexpectedly empty right now | 15:30 |
pabelanger | it is swapping1 | 15:31 |
mordred | jeblair: ++ | 15:31 |
jeblair | so 80/20 with infra-check as high priority? | 15:41 |
clarkb | wfm | 15:42 |
pabelanger | sure | 15:42 |
AJaeger | jeblair: we can adjust later if needed, it's a first good guess ;) | 15:43 |
pabelanger | we can also bring infracloud back online | 15:43 |
pabelanger | and see if sto2 is good to use again | 15:43 |
pabelanger | both are still disabled | 15:43 |
jeblair | okay, i think that's a factual summary of everything we discussed; any outstanding questions? | 15:46 |
mordred | jeblair: I think it looks good, I like the plan, and I think the top section makes a good status email | 15:46 |
jeblair | mordred: you want to add more words and send it? | 15:47 |
mordred | jeblair: sure! | 15:49 |
mordred | jeblair: also - completely unrelated to this- but in making the infra- pipelines patch I realized that we're not doing sql reporter on anything other than check or gate- followup patch submitted | 15:50 |
jeblair | derp | 15:50 |
jeblair | we'd figure that out eventually | 15:50 |
*** jdandrea has joined #openstack-infra-incident | 15:51 | |
jeblair | so, logistics -- should we restart zuulv3 now, then land changes to do rollback, or stop, force merge them, then bring up both systems in the rollback state? | 15:54 |
pabelanger | wfm | 15:56 |
fungi | etherpad content before the horizontal rule lgtm for a status update | 15:56 |
mordred | infra-root: updated etherpad with wrapper email words | 15:56 |
* clarkb looks | 15:57 | |
pabelanger | looking | 15:57 |
mordred | also - one last questoin - how badly is this going to bork openstack-health and our subunit2sql friends? | 15:57 |
fungi | i agree with the strikethrough in the first paragraph. we have no solid evidence to indicate it's reconfiguration-related afaik | 15:57 |
fungi | i think subunit2sql is still working to get the v3 job naming change landed | 15:58 |
pabelanger | +1 to etherpad | 15:58 |
clarkb | email lgtm | 15:58 |
clarkb | mordred: I'm not super concerned about it from elasticsearch's perspective | 15:59 |
clarkb | mordred: its juts more data | 15:59 |
fungi | jeblair: how much of the rollback would be able to get accomplished without force merging those changes anyway? | 15:59 |
jeblair | i would like to suggest an alternate first paragraph | 15:59 |
clarkb | mordred: however health looks at pass/fail rates which may skew with check running in v3, but I think its much more gate focused there | 15:59 |
jeblair | one that does not suggest that the only reason we are rolling back is the memory leak | 15:59 |
clarkb | mordred: my hunch is its ok because of the focus on gate jobs | 15:59 |
jeblair | i'm working on a proposal | 15:59 |
* AJaeger is fine with etherpad, thanks | 16:00 | |
fungi | since jeblair was able to snag pipeline contents, i feel like restarting v3 in the intermi until the rollback patches are ready may provide slightly less disruption than the continued not-much-going-on state it's in right now | 16:00 |
mordred | yah - I think we need to restart v3 - if for no other reason than to land the rollback patches | 16:02 |
jeblair | mordred: alternate suggestion for pgraph1 on line 7. | 16:03 |
jeblair | also, last pgraph on line 25 | 16:04 |
mordred | jeblair: wfm | 16:04 |
fungi | i prefer the suggested alternatives | 16:04 |
fungi | however the last paragraph will inevitably result in people asking "how long?" | 16:05 |
jeblair | we could say, 'hopefully in a week or two'. | 16:05 |
fungi | that sounds reasonable | 16:06 |
AJaeger | what about saying "We revisit on Friday the current status"? | 16:06 |
AJaeger | (or Monday?) | 16:06 |
mordred | how about we say 'hopefully in a week or two, but we'll send out updates as we have them?' | 16:06 |
fungi | i think in part some people will want to know because they're eager to get back to not having to think about v2 jobs any longer, while others will simply want to know how much longer their job configuration in v2 will remain frozen in place | 16:07 |
mordred | (updated final line in etherpad) | 16:07 |
fungi | though i guess paragraph 3 doesn't really say v2 configuraiton is completely froze, so that's fine | 16:07 |
clarkb | wfm | 16:07 |
pabelanger | a week or two is good for me. If we fix things sooner, I'm sure we try a rollout sooner | 16:08 |
mordred | luckily a rollout winds up being very easy at this point- turn the other pipelines back on :) | 16:09 |
AJaeger | pabelanger: if we're ready quickly, let's run zuul v3 for a day or two without interruption ;) | 16:09 |
AJaeger | mordred: agreed, we should be able to go quickly from one to the other... | 16:10 |
pabelanger | ya, we should be in a good spot to move nodes between zuulv2.5 and zuulv3 as we are ready for more load testing too | 16:10 |
pabelanger | I'm currently using topic:zuulv3-rollback for patches that would be needed to rollback. Currently 2 | 16:15 |
mordred | jeblair: you're good with the final version? | 16:15 |
fungi | thanks pabelanger! | 16:15 |
jeblair | mordred: yep | 16:15 |
pabelanger | It looks like zuul-launcher might have been stopped too, just noting we'll need to start them back up | 16:17 |
mordred | kk. sending | 16:17 |
*** bnemec has joined #openstack-infra-incident | 16:19 | |
*** ying_zuo has joined #openstack-infra-incident | 16:30 | |
AJaeger | mordred: where did you send it to? It's not in openstack-dev archives yet... | 16:31 |
AJaeger | mordred: remote: https://review.openstack.org/509221 - where else do we need this? | 16:37 |
*** dansmith has joined #openstack-infra-incident | 16:37 | |
clarkb | AJaeger: zuul-jobs and zuul. Maybe also project-config itself? | 16:38 |
AJaeger | Let me do zuul-jobs quickly, any volunteer for zuul? | 16:39 |
clarkb | AJaeger: we should probably do it in a single change | 16:40 |
clarkb | as that will be easier to merge | 16:40 |
clarkb | ? | 16:40 |
AJaeger | clarkb: mordred updated project-config. But we need these separate for each repo with individual config | 16:40 |
AJaeger | nothing needs to be done for zuul-jobs | 16:41 |
AJaeger | pushed for zuul now | 16:46 |
AJaeger | not needed for zuul-jobs | 16:46 |
AJaeger | that should be all - please double check | 16:46 |
AJaeger | https://review.openstack.org/#/q/topic:zuulv3-rollback+(status:open+OR+status:merged) | 16:47 |
pabelanger | I don't think it is causing issues, but I think the hostname on ze09 and ze10 are not correct. I'll look into that shortly once we rolled back | 16:55 |
pabelanger | more for my own info, I do see that every time we do a dynamic reload in zuul debug.log, we spike the CPU (which is expected I guess). However, after the reload has finished, zuul does process things like a champ, nice and fast, zuul-scheduler process lowish CPU | 17:02 |
mordred | infra-root: I'm back - sorry - had a phone call pop up | 17:03 |
clarkb | fungi is working on clearing out zuul -2 votes, other than that I think we are ready to start implementation? | 17:03 |
fungi | i think we can do those in parallel | 17:07 |
fungi | pabelanger: yeah, i ended up having to manually repair the hostname on the ze01 i rebuilt. seems cloud-init may be racing ansible or something on fresh builds lately | 17:09 |
fungi | basically just edit /etc/hostname to contain an fqdn instead of the shortname, and then reboot | 17:10 |
pabelanger | kk | 17:10 |
pabelanger | I'll look into that once after the rollback | 17:10 |
fungi | clarkb: i got the bulk of the v3 verify -2 votes cleared, but will do another pass after we get v2 online | 17:14 |
pabelanger | we seem to be averaging about 1min of CPU time for each dynamic reload, according to debug.log | 17:14 |
pabelanger | maybe a little less | 17:14 |
pabelanger | I think we might be at the point again, where zuul is just processing dynamic reloads. It doesn't look like it has the steam to get a head of the demand | 17:29 |
clarkb | ok, so what is process for implementing the rollback, do we force merge or manually apply the pipeline changes and the nodepool quota shift then start the zuulv2 daemon? | 17:34 |
pabelanger | quota should be ready to land now | 17:36 |
pabelanger | https://review.openstack.org/509200/ | 17:37 |
pabelanger | once merged, nodepool-launchers should do the right thing | 17:37 |
jeblair | we have a +1 on the pipeline change, so i say we force-merge that and the quota change, then startup v2 | 17:38 |
*** tobiash has joined #openstack-infra-incident | 17:38 | |
pabelanger | maybe nodepool change first to bleed off resources from launchers | 17:38 |
jeblair | pabelanger: you want to take care of doing that? | 17:39 |
pabelanger | sure | 17:39 |
clarkb | hrm looking at the pipeline change quickj question | 17:39 |
clarkb | with older zuul if you tried to look at attributes that weren't present in the event it could get in an unhappy state | 17:39 |
clarkb | any concern with using things like current-patchset on post pipeline for that reason? | 17:40 |
jeblair | clarkb: those are attributes of refs, not events, so should be fine | 17:40 |
jeblair | since they are pipeline requirements | 17:40 |
clarkb | cool | 17:40 |
jeblair | clarkb: and i double checked the approvals one -- that should fail closed, ie, refs without approvals do not match and will not be enqueued | 17:41 |
clarkb | jeblair: perfect | 17:41 |
jeblair | clarkb: looks like the same holds for current-patchset | 17:41 |
pabelanger | which is the correct group again to force merge? | 17:41 |
jeblair | pabelanger: project bootstrappers | 17:41 |
pabelanger | jeblair: ty | 17:41 |
pabelanger | okay, 509200 force merged | 17:44 |
pabelanger | removing myself from group | 17:44 |
fungi | pabelanger: i usually pull this out of my command history: `ssh -p 29418 review.openstack.org gerrit set-members "Project\ Bootstrappers" --add fungi` | 17:44 |
fungi | and then --remove instead of --add when i'm done | 17:45 |
pabelanger | fungi: ty | 17:45 |
fungi | faster than fiddling with the webui | 17:45 |
pabelanger | okay, I have kicked both nl01 and nl02 | 17:47 |
AJaeger | next approving https://review.openstack.org/#/c/509196/ (new pipelines)? | 17:47 |
pabelanger | just confirming nodepool-launchers are happy, then will move on to nodepool.o.o | 17:49 |
clarkb | and then once nodepool is done merge pipeline change and start zuulv2? | 17:53 |
clarkb | pretty sure we need pipeline change in plcae before zuulv2 starts running (to avoid them fighting) | 17:53 |
jeblair | it would be best i think | 17:53 |
jeblair | i'll translate the latest re-enqueue scripts into v2 | 17:54 |
jeblair | so we can restore queue state (from earlier this morning) | 17:54 |
AJaeger | Question on project-config repo: Will we run both zuul v2 and zuulv3 checks on it - and merge with v3 only? | 17:55 |
AJaeger | Then we need to remove the v2 jobs from zuul/layout.yaml. Or how to handle this central repo? | 17:55 |
*** bnemec has quit IRC | 17:56 | |
pabelanger | okay, cleaning up some of the ready nodes on nodepool-launchers, then going to kick nodepool.o.o | 17:56 |
jeblair | AJaeger: i vote v2 check only, v3 check/gate | 17:57 |
* AJaeger prepares a patch... | 17:58 | |
pabelanger | kicking nodepool.o.o now | 17:58 |
pabelanger | okay, nodepool is trying to launcher servers again, but failing to talk to gearman | 18:00 |
pabelanger | so, think we are ready on nodepool front | 18:00 |
clarkb | (that is expected since geard and zuulv2 are not running on zuul.o.o | 18:00 |
pabelanger | yah | 18:00 |
*** bnemec has joined #openstack-infra-incident | 18:01 | |
clarkb | ok so ready to merge the pipeline change? | 18:03 |
pabelanger | I think so | 18:03 |
clarkb | pabelanger: did you want to do that one too? | 18:04 |
pabelanger | clarkb: force-merge, sure | 18:04 |
pabelanger | clarkb: 509196 right? | 18:05 |
clarkb | ya | 18:05 |
pabelanger | kk | 18:05 |
pabelanger | done | 18:06 |
pabelanger | guess we kick zuul.o.o now? | 18:06 |
pabelanger | actually, no | 18:06 |
pabelanger | that was just for zuulv3 | 18:06 |
AJaeger | https://review.openstack.org/509244 updates project-config for Zuul v2 check only - please review carefully whether this will work | 18:06 |
pabelanger | clarkb: ready to kick zuulv3.o.o to pickup changes for 509196? | 18:09 |
clarkb | pabelanger: we shoulnd't need to kick it right? zuulv3 will pick it up on its own | 18:09 |
pabelanger | no, we still need puppet to sighup for pipeline changes I think | 18:10 |
pabelanger | Hmm, maybe not | 18:10 |
pabelanger | I think I am confusing it with main.yaml | 18:10 |
AJaeger | should we force merge https://review.openstack.org/#/c/509220/ ? | 18:11 |
AJaeger | that's the status.o.o/zuul redirect | 18:11 |
clarkb | lets see if we can get it in normally first maybe? | 18:14 |
clarkb | are we ready to start zuuld and zuul launchers? | 18:15 |
pabelanger | Ya, I think that should merge on its own now | 18:15 |
AJaeger | ok, let's monitor ;) | 18:15 |
pabelanger | I am not seeing infra-check or infra-gate yet on zuulv3.o.o | 18:15 |
clarkb | you can hit http://zuul.openstack.org to get status until redirect is fixed | 18:15 |
clarkb | pabelanger: I'm not sure zuul has chewed through the event queue yet? | 18:17 |
pabelanger | possible | 18:17 |
pabelanger | asking in #zuul | 18:17 |
AJaeger | clarkb: zuul.o.o redirects as well ;( | 18:19 |
clarkb | oh really? | 18:19 |
clarkb | it does :/ | 18:19 |
AJaeger | we should revert that also... | 18:19 |
clarkb | jeblair: mordred fungi ok I think we are just waiting for zuulv3 to pick up the new pipelines | 18:20 |
clarkb | once that is done I think we start zuuld and zuul launchers and we can work through getting status pages pointing to the right place | 18:20 |
pabelanger | yah, think so too | 18:21 |
pabelanger | should we maybe consider a stop / start of zuulv3 to clear our pipelines? | 18:22 |
pabelanger | out* | 18:22 |
clarkb | jeblair: ^ what do you think? | 18:23 |
jeblair | clarkb: ideally v3 should eject everything from gate after it reconfigures with that change | 18:26 |
clarkb | gotcha, so we should wait and confirm that happens? | 18:27 |
pabelanger | I'm going to have to step again here shortly | 18:30 |
pabelanger | waiting for zuulv3.o.o to pickup 509196 still | 18:31 |
pabelanger | nodepool.o.o is ready, just waiting on zuulv2.5 to be started for gearman | 18:31 |
clarkb | ok | 18:31 |
pabelanger | we'll also need to start zuul-launchers too | 18:31 |
* clarkb waits patiently. Will be around to start services once its done | 18:31 | |
pabelanger | great | 18:32 |
AJaeger | where do we redirect zuul.openstack.org to zuulv3.openstack.org in our config? Anybody can revert that change? | 18:32 |
clarkb | AJaeger: I'm not seeing it either | 18:34 |
clarkb | fungi: ^ you did the redirect stuff, do you know? | 18:35 |
fungi | we should just be able to unset zuul.o.o from the emergency disable list | 18:37 |
fungi | that wasn't done through configuration management | 18:38 |
clarkb | aha | 18:38 |
clarkb | that is why I don't see it :) | 18:38 |
fungi | it's a one-line addition to the vhost config on the server | 18:38 |
fungi | that and i commented out the rewrites for the local api, i believe | 18:39 |
clarkb | ok infra check and infra gate are presetn now | 18:40 |
clarkb | the gate has not been evicted yet | 18:40 |
clarkb | jeblair: ^ if you want to look at that before we turn on zuuvl2 (probably a good idea to avoid fihting) | 18:40 |
jeblair | checking | 18:41 |
AJaeger | interesting to see what happens with 509145 - a project-config change that just finished in gate queue but is not merged | 18:41 |
jeblair | zuul has not gotten around to processing the gate pipeline after the reconfig yet | 18:42 |
jeblair | so -- status indeterminate there | 18:42 |
* clarkb remains patient then | 18:43 | |
jeblair | it probably has a bunch of dynamic configs in check to redo | 18:43 |
AJaeger | changes are moving over now... | 18:43 |
jeblair | gate manager is running | 18:45 |
jeblair | okay, it's done. i guess it doesn't check those constraints after they've already been enqueued | 18:47 |
jeblair | so we may as well restart zuulv3 to clear it out | 18:47 |
clarkb | jeblair: do you want to do that and I will work on starting zuul on zuul.o.o and the zl nodes? | 18:47 |
jeblair | clarkb: will do | 18:47 |
clarkb | ok starting zuulv2 now as well | 18:48 |
clarkb | here goes | 18:48 |
clarkb | do I want zuul-executor or zuul-launcher on zl0X? | 18:49 |
fungi | launcher | 18:50 |
clarkb | kk starting those now | 18:50 |
clarkb | fungi: can you edit the redirect on zuul.o.o? | 18:50 |
fungi | clarkb: on it | 18:50 |
clarkb | job is running on zl01 | 18:51 |
clarkb | going to start the other launchers since that looks good | 18:51 |
fungi | i've removed "RedirectMatch temp ^/(.*) http://zuulv3.openstack.org/$1" and uncommented all the old RewriteRule lines | 18:52 |
fungi | also reloaded apache2 | 18:52 |
fungi | should i go ahead and remove zuul.o.o from the emergency disable list too? | 18:52 |
clarkb | fungi: I think that should be safe now? | 18:52 |
fungi | done | 18:52 |
clarkb | infra-root ^ any reason not to do that? | 18:52 |
fungi | i was pretty sure the only reason we had it in there was to avoid reverting the redirect | 18:53 |
clarkb | zl01 - zl06 have launcher running | 18:53 |
mordred | clarkb, fungi: wfm | 18:53 |
jeblair | ++ | 18:53 |
clarkb | looks like those are the launchers we've got | 18:53 |
jeblair | do we need to delete some ze and add zl? | 18:54 |
clarkb | jeblair: possibly since we've flipped the quotaing | 18:54 |
clarkb | though maybe we want to watch it and see how it holds up with the lower quota on v2? | 18:55 |
clarkb | fungi: I think it is safe to run the post rollback -2 clearout now (since zuulv3 should not leave any more after the restart it just had) | 18:55 |
fungi | on it | 18:56 |
fungi | there was only one new verify -2, and i've cleared it now | 18:57 |
clarkb | fungi: status.o.o/zuul redirect was in puppet right? | 18:57 |
clarkb | do we hvae a revert of that change yet? | 18:57 |
fungi | correct, and i haven't seen a revert for it yet | 18:58 |
AJaeger | check https://review.openstack.org/#/q/status:open+++topic:zuulv3-rollback | 18:58 |
AJaeger | https://review.openstack.org/509220 is the revert | 18:58 |
clarkb | I've approved and rechecked that change | 18:58 |
fungi | aha, i missed that | 19:00 |
AJaeger | since we froze the zuul v2 files, could you review https://review.openstack.org/#/c/509158 again - that gives a non-voting -1 for any change to it to alert us of that | 19:01 |
mordred | we should send out a note to anyone who has a .zuul.yaml in their repo to remind them that since v3 isn't voting it will be possible for them to land changes that contain broken syntax | 19:15 |
clarkb | ya the change to fix the status redirect is about to merge | 19:16 |
mordred | so if they ARE making changes to their .zuul.yaml as part of working on things in the check pipeline- they need to be careful to not land changes if v3 has complained about syntax errors | 19:16 |
clarkb | I think once that does we should sned a general "its in place like we described" note with details like that | 19:16 |
jeblair | mordred: s/voting/gating/ but yes | 19:16 |
mordred | jeblair: ++ | 19:16 |
jeblair | (and of course, the vote may be delayed or never show up) | 19:16 |
mordred | yah. basically if you make a change to your .zuul.yaml file - please make sure that v3 actually runs check jobs and finishes before approving | 19:17 |
jeblair | ++ | 19:18 |
clarkb | mordred: are you going to write and send that update? | 19:18 |
mordred | jeblair: I was trying to think of a fancy job we could put in v2 and add to the merge-check template with a files matcher to make sure it only ran on patches with .zuul.yaml in them ... but I couldn't come up with any way to make t anyhting other than informational | 19:18 |
mordred | clarkb: yah - I can do that | 19:18 |
clarkb | fungi is kick.sh'ing status.o.o so we should be set for that to go out when ready | 19:19 |
fungi | yup | 19:19 |
fungi | it's just about completed now | 19:19 |
jeblair | mordred: we could have it -2 | 19:19 |
jeblair | mordred: oh in v2, sorry | 19:19 |
fungi | fatal: [status.openstack.org]: FAILED! => {"changed": false, "failed": true, "msg": "/usr/bin/timeout -s 9 30m /usr/bin/puppet apply /opt/system-config/production/manifests/site.pp --logdest syslog --environment 'production' --no-noop --detailed-exitcodes failed with return code: 6", "rc": 6, "stderr": "", "stdout": "", "stdout_lines": []} | 19:19 |
jeblair | mordred: yeah, that's hard in v2 | 19:19 |
fungi | um | 19:19 |
clarkb | fungi: to the syslog! | 19:20 |
fungi | oh, i bet that's because of 508564 not merging yet | 19:20 |
fungi | i can get the apache config revert change applied manually for now, but getting 508564 in would be nice | 19:22 |
clarkb | fungi: I'll recheck 8564 | 19:22 |
fungi | okay, redirect has been manually reverted on status.o.o and apache2 reloaded | 19:23 |
fungi | looks like it's returning the v2 status page again | 19:24 |
mordred | infra-root: https://etherpad.openstack.org/p/yLGexJjd7U | 19:24 |
fungi | it looks so quaint and old-fashioned now | 19:24 |
dmsimard | mordred: oh man that's a nasty side effect of not gating with v3 | 19:25 |
mordred | dmsimard: yah. this is amongst the reasons partial rollout was a thing we were trying to avoid :) | 19:26 |
clarkb | mordred: put a note at the top that we are running in that mode now | 19:26 |
mordred | fungi: we _could_ also force-merge a rollback on their repo :) | 19:27 |
mordred | jinx | 19:27 |
fungi | heh | 19:27 |
AJaeger | ;) | 19:27 |
dmsimard | is it obvious in zuul logs when there is a syntax error and where it is from ? | 19:31 |
dmsimard | if it happens, are we easily able to tell which review merged that was the culprit ? | 19:31 |
clarkb | dmsimard: zuul identifies the file and location so ya should be fairly obvious | 19:32 |
clarkb | and I think that it shows up inthe logs because the messages zuul leaves on gerrit are logged in the zuul log too iirc | 19:32 |
dmsimard | clarkb: yeah but I mean, if someone merges a typo, every file will now yield a syntax error right ? | 19:32 |
clarkb | maybe? that I don't know | 19:33 |
AJaeger | dmsimard: I think so | 19:33 |
mordred | yah | 19:33 |
AJaeger | dmsimard: but we should now which repo and file easily | 19:34 |
mordred | I mean - we'll know :) | 19:34 |
mordred | infra-root: we good with that etherpad now? I'll hit send if so | 19:34 |
dmsimard | ok, if you're able to tell easily that's good | 19:34 |
fungi | lgtm | 19:34 |
AJaeger | mordred: lgtm | 19:34 |
AJaeger | and then send an #status notice? | 19:35 |
jdandrea | AJaeger Better! https://review.openstack.org/#/c/508924/ (ignore the Zuul results, right?) | 19:35 |
clarkb | ship it | 19:35 |
jdandrea | Oops - I retract that - wrong channel. | 19:36 |
jeblair | ++ | 19:37 |
*** bnemec has quit IRC | 19:49 | |
* AJaeger is waiting for zuulv3 to process jobs again - it's recalculating queue the last 40+ mins | 19:51 | |
fungi | like old times! ;) | 19:53 |
AJaeger | ;) | 19:54 |
AJaeger | yeah, https://review.openstack.org/509158 finally got a +1. Since it has two +2s, I'll add my +A to my own change... - this is the change that gives a -1 o nthe frozen files | 20:03 |
AJaeger | Just noticed we now have on zuul-jobs a job run in check (build-sphinx) and others in infra-check... will that cause problems? | 20:09 |
jeblair | AJaeger: yes, we should remove it from check | 20:10 |
AJaeger | so, changing the template? On it... | 20:11 |
clarkb | fyi there is apparnetly a problem with devstack and devstack plugins that boris says wasn't a problme before the v3 rollout | 21:42 |
clarkb | so I'm helping todebug that | 21:42 |
*** bnemec has joined #openstack-infra-incident | 23:25 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!