Friday, 2018-07-06

*** yolanda__ has joined #zuul00:09
*** yolanda_ has quit IRC00:13
*** ianychoi_ is now known as ianychoi01:16
*** yolanda_ has joined #zuul01:51
*** yolanda__ has quit IRC01:54
*** yolanda__ has joined #zuul01:59
*** yolanda_ has quit IRC02:02
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: wip: add github event filter debug  https://review.openstack.org/58054703:23
*** swest has joined #zuul05:11
*** yolanda_ has joined #zuul05:48
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Add test with unresolvable static node  https://review.openstack.org/58006305:50
*** yolanda__ has quit IRC05:52
*** fbo|off is now known as fbo06:04
*** nchakrab has joined #zuul06:12
*** nchakrab_ has joined #zuul06:13
*** nchakrab has quit IRC06:17
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Fix relaunch attempts when hitting quota errors  https://review.openstack.org/53693006:39
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Fix relaunch attempts when hitting quota errors  https://review.openstack.org/53693006:41
*** dmellado has joined #zuul06:46
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Fix race at shutdown  https://review.openstack.org/58056406:46
*** gtema_ has joined #zuul06:57
*** yolanda_ is now known as yolanda07:12
*** nchakrab_ has quit IRC07:20
*** nchakrab has joined #zuul07:40
*** nchakrab_ has joined #zuul07:42
*** nchakrab has quit IRC07:46
*** jpena|off is now known as jpena07:47
*** yolanda_ has joined #zuul07:54
*** yolanda has quit IRC07:56
*** hashar has joined #zuul09:29
*** abelur has quit IRC10:07
*** abelur has joined #zuul10:07
*** nchakrab_ has quit IRC10:38
*** nchakrab has joined #zuul10:39
*** nchakrab has quit IRC10:53
*** nchakrab has joined #zuul10:54
*** yolanda_ is now known as yolanda11:12
*** jpena is now known as jpena|lunch11:47
*** rlandy has joined #zuul12:37
*** nchakrab_ has joined #zuul12:39
*** nchakrab has quit IRC12:42
*** nchakrab_ has quit IRC12:43
mordredtobiash: your quota patch makes me think we should be raising a QuotaExceeded error in openstacksdk so you don't have to do a string match on the exception text12:50
tobiashmordred: that absolutely makes sense :)12:50
mordredtobiash: I'll put it on my list12:51
*** yolanda_ has joined #zuul12:54
*** yolanda has quit IRC12:57
openstackgerritMonty Taylor proposed openstack-infra/zuul master: Add job to build container images using pbrx  https://review.openstack.org/58016012:58
*** jpena|lunch is now known as jpena12:58
*** nchakrab has joined #zuul13:06
openstackgerritMerged openstack-infra/zuul-jobs master: Install build bindep profiles alongside doc and test  https://review.openstack.org/58052113:08
openstackgerritMerged openstack-infra/nodepool master: Fix race at shutdown  https://review.openstack.org/58056413:15
*** pawelzny has quit IRC13:27
*** gtema_ has quit IRC13:39
openstackgerritMonty Taylor proposed openstack-infra/zuul master: Add job to build container images using pbrx  https://review.openstack.org/58016013:42
corvuslogan-: i don't think it needs to be added to the config for depends-on to work, but it would need to be for required-projects (though i'd like to make that unecessary as well)14:19
corvusmordred, Shrews: quick question on 57282914:33
*** nchakrab has quit IRC14:33
corvusalso, gtema says that 566158 should pass tests when based on 572829 so i'm going to add a depends-on to that.  hope that's ok.14:34
*** nchakrab has joined #zuul14:34
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Use openstacksdk instead of os-client-config  https://review.openstack.org/56615814:34
*** nchakrab has quit IRC14:38
openstackgerritJeremy Stanley proposed openstack-infra/zuul master: Change "core developer" reference  https://review.openstack.org/57924114:45
*** acozine1 has joined #zuul14:56
*** yolanda__ has joined #zuul15:00
Shrewscorvus: i think that has to do with shade now using openstacksdk for cloud config instead of occ. mordred can probably explain better15:00
* Shrews has to afk for a bit15:00
corvusi was especially wondering about dropping the **args at the end15:01
Shrewserr, not shade, but openstacksdk has it's own version of config15:01
*** yolanda_ has quit IRC15:03
*** acozine1 has quit IRC15:10
corvusclarkb: i don't understand the "until converted to a string" part of your comment on line 119 of model.py on 57999715:17
clarkbcorvus: the hash functions explicitly converts the attributes to str, if the objects are not str prior to that they may hash to the same value afterwards but fail the __eq__ check15:18
*** openstackgerrit has quit IRC15:19
corvusclarkb: ah, i see, thanks.  the answer to that is the same as hashing in general: things in python that use __hash__ (like set membership) also use __eq__ to account for hash collisions.15:19
tobiashcorvus: I spent quite some time again grepping and scrolling through the logs I have and wrote some things up in an etherpad: https://etherpad.openstack.org/p/GQYVEEv2D315:20
tobiashthe relevant logs I have are 3.5 minutes of scheduler logs in which the state must got broken15:21
tobiashthat's 24000 lines of log :/15:21
tobiashso I tried to filter some stuff there, if you want me to grep for additional data I can maybe add more data to that15:22
tobiashunfortunately I have no merger and executor logs of that timeframe15:22
*** gtema has joined #zuul15:23
corvustobiash: cool, if you can save the log for a bit, that'd be great :)15:23
tobiashthat log is on my hard disk and I won't delete it :)15:23
*** nchakrab has joined #zuul15:28
*** hashar is now known as hasharAway15:29
*** nchakrab has quit IRC15:29
*** fbo is now known as fbo|off15:34
tobiashSo at least what I can tell from the logs is that there is no cat job and no reconfig in between15:37
tobiashSo I think we can rule out a locking problem on executor/merger15:38
*** openstackgerrit has joined #zuul15:41
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Report config errors when removing cross-repo jobs  https://review.openstack.org/57999715:41
corvusclarkb: ^ with many more things in the hash soup :)15:41
clarkbcorvus: cool, I'll rereview shortly15:43
corvustobiash: can you double check that zuul-conf-global never shows up as an untrusted project (ie, it wasn't accidentally added as an untrusted project to one of the tenants)?15:43
corvustobiash: did the zuul-conf-global PR Depends-On anything?  did any other change Depends-On it?15:45
corvustobiash: i've added those two questions and one more to the etherpad.  the third question i just put there because i'm looking for some log lines.15:55
*** acozine1 has joined #zuul16:10
*** nchakrab has joined #zuul16:11
*** nchakrab has quit IRC16:13
tobiashcorvus: updated the pad16:17
tobiashI've also added the relevant parts of the tenant config16:18
tobiashthere is also shadowing regarding zuul-jobs involved if that might matter16:18
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Report config errors when removing cross-repo jobs  https://review.openstack.org/57999716:20
corvustobiash: yeah, shadowing is good to keep in mind16:21
openstackgerritMonty Taylor proposed openstack-infra/zuul master: Add job to build container images using pbrx  https://review.openstack.org/58016016:21
corvustobiash: in the pad i pointed where i think there are still missing log lines16:24
tobiashcorvus: checking16:24
tobiashcorvus: created layout id was missing16:26
tobiashadded to these lines16:26
clarkbcorvus: heh I left a comment on ps2 of the error change and found there was a ps3 already addressing my comment16:27
tobiashcorvus: what I also noticed is that all except one tenant exclude project on that repo but every tenant scheduled a merge and calculated a layout to just notice in check that nothing needs to be done16:28
corvustobiash: i pasted in the 3 log lines, one of which we should see after loading phase 116:29
tobiashthat is probably unrelated though16:29
corvustobiash: yeah, good to keep in mind16:29
tobiashphase 2 is definitly missing in the log16:30
corvusyeah, i don't expect the phase 2 message in this case16:30
corvusbut we should definitely have exactly one of the 3 that follows16:30
tobiashcorvus: I cannot see any of these three16:31
corvusthe phase 1 message is inside the exception handler for the last one, so we know we've entered that, so if there was an exception, it should be reported there16:31
corvustobiash: interesting.  what git sha are you on, and do you have any local patches?16:31
corvusi need to make sure i'm looking at exactly the right code now :)16:31
corvustobiash: specifically, any local patches which touch manager/__init__.py16:32
tobiashuhm, have to check what state that was at that time16:33
tobiashjust a sec16:33
tobiashcorvus: I cannot tell precisely but it was based on the master from 7 days ago16:42
tobiashI'm currently using a moving branch and probably should tag all deployments...16:42
tobiashI'll try to prepare an equivalent branch16:42
corvustobiash: oh, i see it, there is a path that ends with phase 116:45
tobiashcorvus: is that good or bad?16:46
*** yolanda_ has joined #zuul16:46
corvustobiash: it's good, it makes sense for this case16:46
corvustobiash: this line: http://git.zuul-ci.org/cgit/zuul/tree/zuul/manager/__init__.py#n45216:47
tobiashcorvus: I tried to reproduce the branch as precise as possible https://github.com/tobiashenkel/zuul/tree/debug16:48
corvustobiash: thanks16:48
tobiashthe base must have been 02682856 or one before that16:49
*** yolanda__ has quit IRC16:49
tobiashand on top my current branches where I reverted one which was created after that16:49
tobiashbut no changes to manager/__init__.py as far as I know16:49
Shrewstobiash: approved https://review.openstack.org/536930, but i think you left a couple of debugging lines in the test we should clean up (noted in comments)17:16
tobiashShrews: oops17:17
Shrewstobiash: oh actually... i spotted something17:20
Shrewstobiash: left another comment in launchesCompleted. I'll leave it to you if you'd rather clean that up in a follow too, or just change that review17:22
tobiashlooking17:23
tobiashShrews: I think that was needed to get the test green, but I can double check17:24
Shrewstobiash: why did you change that for loop to use self.nodeset[:] from self.nodeset?17:25
tobiashto be safe that I can remove items during iteration17:25
tobiashotherwise I also can loop again after that loop to remove the aborted nodes from the nodeset17:25
Shrewsoh, does that make a copy?17:26
tobiashyes17:26
*** jpena is now known as jpena|off17:26
Shrewsah, ok.17:26
tobiashI also can change that to self.nodeset().copy() if you like17:26
tobiashthat's the same17:26
Shrewsi think that's more clear, tbh17:27
tobiashno problem17:27
Shrewstobiash: don't you love how i come up with questions AFTER i click the approve button???17:28
Shrews:)17:28
tobiash:)17:29
corvustime is meaningless17:29
tobiashShrews: ok, just double checked, the zk.ABORTED addition is really needed17:29
Shrewscorvus: not if you can bottle it17:29
*** yolanda__ has joined #zuul17:29
Shrewstobiash: hrm... trying to make the connection in my head17:30
tobiashis this a copy of the nodeset somewhere else?17:30
Shrewstobiash: oh, they aren't removed immediately, only during poll17:31
Shrewsi see it now17:31
tobiashah that was it17:31
*** hasharAway has quit IRC17:32
Shrewstobiash: ok, i'm comfortable approving again. the comments and copy() can be a followup if you want17:32
*** yolanda_ has quit IRC17:32
tobiashShrews: as you like, it's already in my working copy (uncommitted yet)17:33
Shrewsor i can do it if you are EOD17:33
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Report config errors when removing cross-repo jobs  https://review.openstack.org/57999717:33
Shrewstobiash: approved17:33
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Cleanup test_over_quota  https://review.openstack.org/58072817:35
tobiashand the followup ^17:35
Shrewsthx17:35
tobiashwell I'm officially eod since more than thre hours ;)17:36
Shrewsoh, sorry. feel free to ignore us after EOD  :)17:36
gtemayeah, EOD is too early in Germany ;-)17:36
tobiashI enjoy working with you guys :)17:37
gtema:-)17:37
openstackgerritMerged openstack-infra/nodepool master: Fix relaunch attempts when hitting quota errors  https://review.openstack.org/53693017:41
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Install config pointing docker at the dockerhub mirror  https://review.openstack.org/58073017:48
openstackgerritMonty Taylor proposed openstack-infra/zuul master: Add job to build container images using pbrx  https://review.openstack.org/58016017:50
*** gtema has quit IRC17:57
*** yolanda_ has joined #zuul18:07
*** yolanda__ has quit IRC18:10
*** acozine1 has quit IRC18:18
*** acozine1 has joined #zuul18:20
*** robled has quit IRC18:23
*** robled has joined #zuul18:24
*** robled has quit IRC18:24
*** robled has joined #zuul18:24
tobiashcorvus: we are getting increasing problems with reconfigurations blocking the whole system and we are thinking about ways how we can improve that18:38
tobiashI think our main problem with that is that the scheduler has a shared event queue for all tenants that is blocked by any reconfiguration event regardless if that's relevant for the tenant or not18:40
tobiashcorvus: what do you think about the possibilities to migrate to an event queue per tenant?18:41
*** acozine1 has quit IRC18:42
tobiashI think that could help in further tenant isolation, less blocking of tenants and it could be a first step in direction of a scale out scheduler18:42
*** acozine1 has joined #zuul18:47
corvustobiash: i think there are things that can be done to make the existing reconfiguration less painful.  what issue are you seeing specifically?19:06
corvustobiash: ie, what aspect is slow?19:06
tobiashthe full reconfiguration takes around 5 minutes19:09
corvustobiash: when you add tenants or projects?19:09
tobiashespecially in a multi tenant setup the full reconfig is inefficient because tenants are reconfigured sequentially and thus we cannot use the mergers efficiently19:10
tobiashadding projects19:10
tobiashwhich we do a lot currently19:10
tobiashalso a tenant reconfiguration within several tenants is also taking almost a minute blocking all other tenants19:11
corvustobiash: tenant reconfiguration should be fast, it uses mostly cached data19:11
corvustobiash: re-freezing the jobs is maybe what's taking a long time there?19:11
corvus(re-enqueuing changes)19:12
tobiashunfortunately the tenant reconfiguration of our most active tenant is taking around 40s currently19:13
corvustobiash: it might be useful to have a breakdown of that 40s19:14
tobiashwe're also working on a change that skips tenant reconfiguration if only branches are deleted and no branch contained cached config19:15
tobiashmuch of that is asking github for branches19:15
corvustobiash: but back to full-reconfiguration: we can probably add a new kind of reconfiguration which uses cached config, and use that for changes to main.yaml19:16
corvustobiash: so that nothing automatically does a full reconfiguration.  we can relegate full reconfiguration to just a manual operation that an operator could perform in case something has gone wrong.19:16
corvustobiash: i think the tenant reconfiguration, and the proposed semi-full reconfiguration could both use cached branches (except when a project has changed).  that should reduce a lot of time19:17
tobiashcorvus: you mean comparing the changes to main.yaml and just delete caches of removed projects, then trigger a tenant configuration for every tenant?19:17
corvustobiash: something like that, yes.19:17
tobiashcorvus: a break down of the 40s is about: get branches: 7s, reenqueueing changes: 30s19:24
corvustobiash: we might also be able to optimize adding/removing projects (especially if they have no configuration) -- in those cases, we might be able to short-circuit reconfiguration altogether.19:27
corvustobiash: other than that, i don't think there's much we can do about the reenqueing until we parellelize queue processors (zuulv4)19:28
tobiashcorvus: in that example 7 changes were re-enqueued19:30
tobiashthat took around 5s per change19:30
corvustobiash: we might be able to make job freezing more efficient19:31
corvusi have to run now19:32
*** yolanda__ has joined #zuul19:56
*** yolanda_ has quit IRC19:59
*** yolanda_ has joined #zuul20:22
corvustobiash: do you have any log entries related to the branch creation event?20:25
*** yolanda__ has quit IRC20:26
*** yolanda__ has joined #zuul20:31
*** yolanda_ has quit IRC20:34
tobiashcorvus: added the branch push event20:39
corvustobiash: it didn't do any reconfiguration due to the branch creation? (it shouldn't -- i just wanted to make sure the logs confirm that)20:39
corvus(ie, there's no "Submitting tenant reconfiguration event" after that)20:39
tobiashcorvus: no, there was no reconfiguration in that time frame20:40
corvustobiash: anything about shadow jobs after the layout was created?  (see line 59 in etherpad)20:41
tobiashcorvus: before 576538 it would have done a tenant reconfiguration but that was already deployed at that time20:41
corvustobiash: yeah, that agrees with the code on your github debug branch20:42
tobiasha grep for shadow returns nothing20:48
corvustobiash: thanks -- you don't have any shadow jobs?20:48
tobiashcorvus: I'm not sure, the only shadow config is with zuul-jobs and I think the base job there was removed20:49
corvusyeah, moved into zuul-base-jobs20:50
tobiashcorvus: I think the base job was the only reason in the past to shadow zuul-jobs so that's probably not needed anymore20:52
corvustobiash: was there reconfiguration of any kind between the branch creation and the PR?20:52
tobiashcorvus: no20:52
tobiashcorvus: the first reconfigurations happened after the first occurrence of the wrong config20:53
tobiashand a full reconfig an hour later fixed it20:53
corvustobiash: your clone-timeout merge pulled in some more recent fixes21:00
corvushttps://github.com/tobiashenkel/zuul/commit/fbbab57ff2ae11e104e32a36aeaeecae1b9600fd21:00
tobiashcorvus: oh, I overlooked that, the clone-timeout was added today21:02
corvustobiash: did you say you were working on a test?21:04
tobiashcorvus: I tried that but had no luck and zuul behaved as expected21:05
corvustobiash: is it something you can push up to gerrit?21:06
tobiashcorvus: I have to check but I think I've thrown it frustrated away ;)21:06
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Report config errors when removing cross-repo jobs  https://review.openstack.org/57999721:08
corvustobiash: 'git reflog' ? :)21:08
tobiashcorvus: found it, just a sec21:08
tobiashcorvus: I wasn't sure that I even did a commit21:08
openstackgerritTobias Henkel proposed openstack-infra/zuul master: WIP: unsuccessful attempt to reproduce zuul erratic behavior  https://review.openstack.org/58076521:09
tobiashcorvus: ^ really dirty21:09
corvustobiash: do you think we can do that with fake github instead of fake gerrit?  one of the things that strikes me as interesting about this is that a branch is created on a config project21:10
tobiashcorvus: should be possible21:11
tobiashcorvus: fake gerrit was just the quickest copy&paste start21:11
tobiashcorvus: the test case also misses assertions as I just added a break point on the pass and inspected the messages on B21:12
corvusi usually just run in the foreground and look at stdout21:12
tobiashI like the debugger :)21:13
corvusthat can be useful -- but it's good to look at the logs emitted by new tests to make sure they're what we need in prod21:14
tobiashthat's true21:14
corvusanyway -- whatever works best for debugging of course -- i just want to advocate occasionally looking at test logs for that purpose.  :)21:14
corvustobiash: do you want to take a stab at doing a fake github version of that test?21:15
corvus(also, fwiw, i threw a change at openstack's zuul and did not see any issues: https://review.openstack.org/580759 )21:16
tobiashcorvus: I'm going to sleep in a few minutes so probably on monday21:16
corvustobiash: cool.  i think the most interesting parts of this are multi-tenant, github, and creating the branch on the config project21:17
corvusi don't see anything jumping out at me as obviously the problem, but those things seem like they could have plausible connections21:17
tobiashyes, so I'll try to put all three things into the test case21:18
corvustobiash: i'm assuming this has only happened once -- have you seen this *not* happen with the same circumstances?21:18
tobiashcorvus: it was kind of a standard behavior but I've just thrown a similar change into that repo21:22
tobiashgah21:22
tobiashbroken21:22
corvustobiash: yay it's reproducible :)21:23
corvustobiash: with any luck, everyone else is asleep so the logs will be much smaller now, with fewer things going on?21:23
tobiashcorvus: I assume so21:25
tobiashcorvus: so I'll save the logs and repair prod and then go to sleep ;)21:25
corvustobiash: sounds good -- we'll have more data and more ideas to try next week!  thanks!21:26
tobiashok, now I have logs of scheduler, all 8 mergers and all 8 executors21:32
tobiashand reconfig triggered21:33
tobiashcorvus: ok, just did a last test for today. A recheck of that PR also broke it so the branch creation is probably unrelated.21:42
corvustobiash: oh good idea21:42
corvustobiash: we should still keep in mind that it exists on a branch of the repo though21:42
tobiashyah21:43
tobiashbut now time for bed21:43
tobiashn821:43
*** yolanda_ has joined #zuul21:53
*** yolanda__ has quit IRC21:56
*** acozine1 has quit IRC22:10
*** yolanda__ has joined #zuul23:04
*** yolanda_ has quit IRC23:06
*** yolanda has joined #zuul23:07
*** yolanda__ has quit IRC23:09
*** rlandy has quit IRC23:16
openstackgerritJames E. Blair proposed openstack-infra/zuul-jobs master: WIP: Add a role to return file comments  https://review.openstack.org/57903323:25
openstackgerritTristan Cacqueray proposed openstack-infra/zuul-jobs master: configure-mirrors: add set_epel_repository option  https://review.openstack.org/58078123:39
*** yolanda_ has joined #zuul23:43
openstackgerritTristan Cacqueray proposed openstack-infra/zuul-jobs master: configure-mirrors: skip system mirrors if zuul can't sudo  https://review.openstack.org/58078223:45
*** yolanda has quit IRC23:46

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!