Wednesday, 2016-12-07

jeblair	jamielennox: i think the runtime would be the same even if we make the triggers smart enough to act like that -- because ultimately the goal of project-change-merged is to put every open change for a project in the queue	00:00
jeblair	jamielennox: so whether that's one event which enqueues multiple changes (your suggestion), or multiple events each enqueing one change (current), we still need to go ask gerrit for all the changes	00:01
*** yolanda has joined #zuul		00:01
*** saneax is now known as saneax-_-\|AFK		00:01
jeblair	the current approach at least has the advantage of matching up pretty well with the "events map directly to a ref/change" idea, so while creating the synthetic events is weird, the actual event matching is very simple and behaves like the gerrit trigger	00:02
*** pabelanger_ has joined #zuul		00:05
*** yolanda has quit IRC		00:05
pabelanger_	webchat FTW	00:06
pabelanger_	can	00:06
pabelanger_	err	00:06
pabelanger_	can't get to IRC proxy atm	00:06
pabelanger_	ianw: mordred: rbergeron: harlowja: I'm hoping to show up next week with a CORS repo for zookeeper for EPEL7 next week. Like ianw said, not a priority for anybody on internal list.	00:08
harlowja	cool	00:08
mordred	awesome	00:09
pabelanger_	if that goes well, I'll see what is needed to sync into EPEL7	00:09
harlowja	sweet	00:09
pabelanger_	otherwise, roll the CORS repo until centos8?	00:09
mordred	yah - sounds like a good plan	00:09
pabelanger_	since rawhide has it	00:09
mordred	pabelanger_: is a CORS repo like a PPA?	00:09
pabelanger_	ya	00:09
mordred	neat	00:09
mordred	pabelanger_: I, for one, welcome our CORS repo overlords	00:10
ianw	pabelanger_: yeah, just working with ggillies on it a bit right now :)	00:10
* mordred hands pies to pabelanger_ and ianw		00:10
ianw	i think pabelanger_ means COPR	00:10
pabelanger_	my other thought, was just to roll the packaging in openstack-infra, using the same process as zigo	00:10
pabelanger_	oh, haha, ya	00:10
pabelanger_	that	00:10
pabelanger_	COPR	00:10
pabelanger_	now I disappear again	00:11
*** pabelanger_ has quit IRC		00:11
*** yolanda has joined #zuul		00:19
*** yolanda has quit IRC		00:23
*** yolanda has joined #zuul		00:26
*** yolanda has quit IRC		00:29
*** yolanda has joined #zuul		00:30
jamielennox	jeblair: sorry to keep coming in and out, any advice on where this event searching would live?	00:35
jeblair	jamielennox: i think the event searching is currently in the gerrit connection and can stay there. that's called by the onChangeMerged method which i think should move from zuultrigger into the scheduler	00:36
*** yolanda has quit IRC		00:38
jamielennox	so wouldn't the event searching have to be common across connectoins?	00:38
*** yolanda has joined #zuul		00:44
*** yolanda has quit IRC		00:49
jeblair	jamielennox: yeah, i think this needs to be scoped to the source/connection associated with the originating pipeline. i'm starting to think that maybe you should kick this over to me for a bit -- i keep typing and deleting things i'm not 100% sure of and i'm afraid i may be about to send you astray.	00:51
jamielennox	jeblair: i'm fine to let you play with it for a while	00:52
jamielennox	there does seem to be scoping problems going on that might be bigger issues	00:52
jeblair	jamielennox: yeah, and i think this might have an "interesting" interaction with dynamic layouts	00:52
jamielennox	also for our usage i'm pretty sure we'd be fine with just deleting the zuultrigger and adding it back if and when it's required	00:52
jeblair	jamielennox: anyway, i'll poke at it tomorrow and see where i get	00:53
jamielennox	jeblair: ok, i'll leave it with you	00:53
jeblair	jamielennox: did you grab a test in storyboard?	00:55
jamielennox	jeblair: i don't think so for zuultrigger, i was trying to see if i knew enough to fix it first then went down the rabbithole	00:56
jeblair	ok, i'll grab it tommorw then	00:57
jamielennox	night - and thanks	00:57
*** yolanda has joined #zuul		00:59
*** yolanda has quit IRC		01:07
*** jamielennox is now known as jamielennox\|away		01:07
*** jamielennox\|away is now known as jamielennox		01:21
*** yolanda has joined #zuul		01:31
*** yolanda has quit IRC		01:33
*** yolanda has joined #zuul		01:34
*** yolanda has quit IRC		01:36
*** yolanda has joined #zuul		01:37
*** yolanda has quit IRC		01:41
*** yolanda has joined #zuul		01:53
*** yolanda has quit IRC		01:58
*** yolanda has joined #zuul		02:10
Shrews	jeblair: my idea for 406411 wasn't really great. let's go with PS10	02:28
*** yolanda has quit IRC		02:33
*** yolanda has joined #zuul		02:34
*** hogepodge has quit IRC		02:38
*** yolanda has quit IRC		02:42
*** yolanda has joined #zuul		02:54
*** yolanda has quit IRC		02:58
*** yolanda has joined #zuul		03:05
*** yolanda has quit IRC		03:11
*** yolanda has joined #zuul		03:11
*** yolanda has quit IRC		03:16
*** yolanda has joined #zuul		03:28
*** yolanda has quit IRC		03:33
*** yolanda has joined #zuul		03:34
*** yolanda has quit IRC		03:40
*** yolanda has joined #zuul		03:47
*** yolanda has quit IRC		03:51
*** yolanda has joined #zuul		04:05
*** yolanda has quit IRC		04:10
*** Cibo_ has joined #zuul		04:17
*** yolanda has joined #zuul		04:22
*** yolanda has quit IRC		04:26
*** yolanda has joined #zuul		04:29
*** yolanda has quit IRC		04:35
*** yolanda has joined #zuul		04:37
*** yolanda has quit IRC		04:41
*** yolanda has joined #zuul		04:57
*** yolanda has quit IRC		05:02
*** yolanda has joined #zuul		05:14
*** mgagne has quit IRC		05:15
*** morgan has quit IRC		05:15
*** tflink has quit IRC		05:16
*** saneax-_-\|AFK has quit IRC		05:16
*** jamielennox has quit IRC		05:16
*** yolanda has quit IRC		05:19
*** tflink has joined #zuul		05:21
*** morgan has joined #zuul		05:23
*** saneax-_-\|AFK has joined #zuul		05:27
*** jamielennox has joined #zuul		05:31
*** yolanda has joined #zuul		06:12
*** saneax-_-\|AFK is now known as saneax		06:24
*** yolanda has quit IRC		06:28
*** yolanda has joined #zuul		06:30
*** abregman has joined #zuul		06:31
*** yolanda has quit IRC		06:37
*** yolanda has joined #zuul		06:40
*** yolanda has quit IRC		06:45
*** yolanda has joined #zuul		06:46
*** willthames has quit IRC		06:52
*** yolanda has quit IRC		07:10
*** jamielennox is now known as jamielennox\|away		07:11
*** yolanda has joined #zuul		07:14
*** yolanda has quit IRC		07:23
*** yolanda has joined #zuul		07:24
openstackgerrit	Joshua Hesketh proposed openstack-infra/nodepool: Merge branch 'master' into feature/zuulv3 https://review.openstack.org/407923	08:23
*** Cibo_ has quit IRC		09:35
*** bhavik1 has joined #zuul		09:48
*** Cibo_ has joined #zuul		09:50
*** mgagne has joined #zuul		10:47
*** mgagne is now known as Guest2615		10:47
*** bhavik1 has quit IRC		10:48
*** openstackgerrit has quit IRC		11:32
*** hashar has joined #zuul		11:51
*** hashar_ has joined #zuul		11:54
*** hashar has quit IRC		11:57
*** hashar_ is now known as hashar		13:33
*** Guest2615 is now known as mgagne		13:54
*** mgagne has quit IRC		13:54
*** mgagne has joined #zuul		13:54
*** saneax is now known as saneax-_-\|AFK		14:16
*** abregman has quit IRC		14:55
*** yolanda has quit IRC		14:59
*** yolanda has joined #zuul		14:59
*** openstackgerrit has joined #zuul		15:25
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool: Add --checksum support to disk-image-create https://review.openstack.org/406411	15:25
pabelanger	jeblair: Shrews: clarkb: revert back to PS10^ for --checksum	15:25
*** rcarrillocruz has quit IRC		15:31
Shrews	pabelanger: +1'd	15:31
Shrews	pabelanger: btw, what's the git or gerrit magic to easily revert to a previous patchset?	15:32
Shrews	oh, maybe review -m	15:34
clarkb	git review -d change,patchset && git commit --amend #change something because gerrit && git review	15:34
pabelanger	ya	15:35
pabelanger	I've been know to cherry-pick the previous patchset too from gerrit ui	15:36
*** hashar is now known as hasharAway		15:36
mordred	fascinating - gerrit has re-applied the votes from clarkb and jeblair from ps10 to ps12	15:40
clarkb	yup I learned it does that when it applied a -1 of mine when an old patchset was pushed and I couldnt figure out why	15:42
clarkb	"I didnt -1 this patchset" later "oh its that old patchset again that I -1'd"	15:42
*** hogepodge has joined #zuul		15:50
openstackgerrit	Monty Taylor proposed openstack-infra/zuul: Add reset of watchdog timeout flag https://review.openstack.org/408194	15:59
*** abregman has joined #zuul		16:04
pabelanger	okay, doing some ops things with nb01 and nb02	16:11
pabelanger	image-build ubuntu-precise: build started on nb02	16:11
pabelanger	image-build ubuntu-trusty: build started on nb01	16:11
pabelanger	image-build ubuntu-xenail: nothing listed in dib-image-list	16:11
pabelanger	looks like we don't display pending builds	16:12
pabelanger	only active	16:12
openstackgerrit	Monty Taylor proposed openstack-infra/zuul: Add reset of watchdog timeout flag https://review.openstack.org/408194	16:16
rbergeron	ianw / pabelanger / mordred: re: zookeeper -- i find it odd that someone is ... an approver / owner for it for epel (different from the owner in fedora, which isn't unheard of) -- but not sure if anyone has pinged that human to see if he's ... going to ever do anything on that front or not.	16:17
rbergeron	but i can rustle up package approvers and all that faiiiirly easily	16:17
pabelanger	Right, I think what i was looking for, was to get zookeeper package into some product managers pipeline at Red Hat, and having some team be come responsible for it. After some emails, it's now clear, no such team exists. The current roadmap in tooz and etcd	16:18
pabelanger	so, guess ianw and I will push on zookeeper package ourselfs and see where to maintain it	16:19
pabelanger	COPR for now, maybe into epel7	16:19
pabelanger	Shrews: ^ questions on image-build when you are free	16:23
openstackgerrit	Monty Taylor proposed openstack-infra/zuul: Add reset of watchdog timeout flag https://review.openstack.org/408194	16:26
*** abregman_ has joined #zuul		16:28
*** abregman has quit IRC		16:31
mordred	pabelanger, rbergeron: well - not to take over the channel with red hat things completely- but what I got was that it's not on any _openstack_ team's radar	16:38
mordred	but I'm not convinced that there is no interest anywhere in the company with zookeeper, kafka or mesos (kafka and mesos both use zk too)	16:38
mordred	pabelanger: so we may still be able to find some product team somewhere - maybe over in jboss land?	16:39
Shrews	pabelanger: no, pending image builds are not displayed. the only things stored in ZK are active or past builds	16:52
Shrews	pabelanger: but honestly, pending builds should not be pending for long. builders should build them as soon as it notices they need building	16:54
pabelanger	in our case, it takes about 1h20m to do a build	16:54
pabelanger	only issue I have, I don	16:55
pabelanger	err	16:55
Shrews	i'm not sure what that has to do with it	16:55
pabelanger	only issue I have, I don't actually have a way to confirm builds are queued up, without using zk-shell (found the key it sets)	16:55
Shrews	pabelanger: are you saying that a build in the 'building' state is not being diplayed, even though it is being built?	16:56
pabelanger	Shrews: no, other way around. I have 2 building images, but issues image-build 3 times	16:56
pabelanger	issued*	16:57
*** hasharAway has quit IRC		16:58
Shrews	pabelanger: i'm failing to grok something. so, you issued 'image-build ubuntu-xenial', it is not actually building, but you see the build request node for it in zk-shell? is that correct?	16:59
pabelanger	Shrews: okay, give me a sec, I'll get a pastebin	17:00
Shrews	k. thx	17:00
pabelanger	current value of dib-image-list: http://paste.openstack.org/show/591687/	17:01
pabelanger	Shrews: from that, we don't actually know we have a pending ubuntu-xenial build	17:02
pabelanger	using zk-shell, I can tell: http://paste.openstack.org/show/591688/	17:02
pabelanger	since I think that is the key we use to trigger it	17:02
pabelanger	the issue, I see, if clarkb can along an looked at dib-image-list, he would have no idea I;ve already queued up the ubuntu-xenial build	17:03
pabelanger	came*	17:03
Shrews	pabelanger: is that image paused?	17:04
pabelanger	Hmm,	17:04
pabelanger	I don't think so	17:04
pabelanger	let me check	17:04
pabelanger	no	17:05
Shrews	pabelanger: then that is odd. if there is a builder thread free to build it (i'm assuming there is), then seems like a bug since the request should be unhandled for very long	17:06
jeblair	Shrews, pabelanger: both builders are currently occupied	17:07
pabelanger	yes	17:07
jeblair	based on that pastebin	17:07
pabelanger	nb01 is almost done	17:07
Shrews	how many build threads though?	17:07
pabelanger	1 per server	17:07
jeblair	Shrews: 1 per build machine	17:07
Shrews	OH!	17:07
Shrews	well, yeah	17:07
Shrews	i assumed we used at least the default # of build workers	17:08
Shrews	which is 4 IIRC	17:08
jeblair	Shrews: i think that is the default	17:08
jeblair	Shrews: 4 is uploaders (we use 16)	17:08
pabelanger	I don't think diskimage-builder will support parallel builds	17:08
Shrews	pabelanger: really? wow	17:09
pabelanger	Shrews: I believe so, I'll have to check again	17:09
Shrews	then perhaps we shouldn't support more than one build worker? or at least warn about it	17:10
clarkb	dib can do it so I think its worth having the option	17:10
clarkb	you just have to be extremely careful doing it	17:10
clarkb	mostly in your use of the cache	17:10
clarkb	(tl;dr its mostly up to your elements not dib itself which should be fine as it builds things with all sorts of unique ids and properly loopbacks etc)	17:11
jeblair	so that's probably the warning we should put in the docs :)	17:11
pabelanger	I found this old blueprint a few weeks ago: https://blueprints.launchpad.net/tripleo/+spec/tripleo-diskimage-builder-parallel-builds that's what I am taking my queue from	17:11
Shrews	pabelanger: so, yeah, unhandled manual build requests will not show up. Would be easy enough to add, but there would be no other information in the output other than image name and some sort of new "pending" status	17:13
Shrews	so, if that's useful, we can add it	17:14
pabelanger	okay, something to think about. Not a blocker	17:16
pabelanger	ubuntu-xenial is now building too	17:16
pabelanger	okay, not to stop zookeeper	17:16
pabelanger	now*	17:16
jeblair	yeah, it's an 'image' attribute, not a 'build' attribute. i'm thinking we may need another set of commands for those...	17:16
jeblair	(like image-list, build-list, upload-list)	17:17
clarkb	maybe even possibly a summarize command too?	17:18
clarkb	summarize ubuntu-xenial outputs "pending: 1 builds: 2 uploaded_to: rax-dfw, ovh-gra1, osic-cloud1" or something	17:19
pabelanger	zookeeper stopped / started	17:19
pabelanger	now to see what happened	17:19
*** abregman_ has quit IRC		17:28
*** adam_g has quit IRC		17:33
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool: Clean up exception message to use image / provider name https://review.openstack.org/408239	17:47
pabelanger	Shrews: what we seen on nb02 when zookeeper was stopped / started: http://paste.openstack.org/show/591694/	17:58
pabelanger	currently waiting to see if all ubuntu-precise images will be uploaded	17:58
*** adam_g has joined #zuul		18:00
jeblair	Shrews: see reply on 408239	18:01
Shrews	jeblair: see my reply to myself :)	18:02
jeblair	Shrews: aha :)	18:02
Shrews	pabelanger: that looks normal. i'm very interested in what happens when ZK is kill during a build or during an upload.	18:02
pabelanger	Shrews: that's what i did for this test, we had 2 builds going an uploads. Was stopped for 15 seconds	18:03
pabelanger	so far, I don't see problems	18:03
Shrews	pabelanger: should see something when the upload actually completes	18:04
Shrews	pabelanger: because our upload lock should be lost	18:04
pabelanger	hmm	18:04
pabelanger	2016-12-07 17:28:41,710 INFO nodepool.builder.UploadWorker.7: Image build ubuntu-precise-0000000011 in infracloud-vanilla is ready	18:05
Shrews	pabelanger: did you stop zk gracefully or -9 it?	18:05
pabelanger	that is the first image uploaded after stop / start	18:05
pabelanger	graceful	18:05
pabelanger	stop / start	18:05
pabelanger	I can do -9 next	18:05
jeblair	let's etherpad this: https://etherpad.openstack.org/p/nN69gpYoAO	18:05
Shrews	ah, that could be different. i think zk does some storing of session state (at least, that's what it looked like when i played around with it)	18:06
Shrews	so graceful might not break locks	18:06
pabelanger	let me add some logs	18:06
jeblair	pabelanger: i don't seen an UploadWorker.7 from before the shutdown	18:07
openstackgerrit	Merged openstack-infra/nodepool: Add --checksum support to disk-image-create https://review.openstack.org/406411	18:08
jeblair	pabelanger: oh are you on nb01 or 02?	18:08
pabelanger	jeblair: ya, this is nb02	18:08
jeblair	ah ok, that's better	18:08
jeblair	pabelanger, Shrews: i agree, it looks like uploadworker.7 survived the graceful zk restart with an upload in progress	18:12
jeblair	neat :)	18:12
pabelanger	I don't actually think UploadWorker.11 was actually uploading anything	18:13
Shrews	anyone know offhand how long until a session is considered expired by zk/kazoo? those SUSPENDED messages indicate it had NOT expired	18:13
pabelanger	I am not sure	18:13
Shrews	wondering if the results would be different if we waited longer to restart	18:14
*** adam_g has quit IRC		18:14
pabelanger	For sure, we should do that too.	18:14
jeblair	pabelanger: yeah, uploadworker.11 (and the others) were likely in their poll loop looking for things to upload	18:16
jeblair	harlowja: ^ do you know the answer to Shrews question?	18:17
Shrews	kazoo code seems to indicate 10s	18:17
pabelanger	jeblair: on nb01, you can see uploadworker.06 has an exception, but just after zookeeper is back online, it finds an image	18:17
*** adam_g has joined #zuul		18:18
jeblair	pabelanger: yeah, i think that agrees with what i said	18:19
pabelanger	\o/	18:19
harlowja	jeblair not sure, ha	18:20
pabelanger	just waiting for rax-ord ubuntu-precise image to come online, if that happens we are good	18:20
openstackgerrit	James E. Blair proposed openstack-infra/nodepool: Delete builds when diskimage removed from config https://review.openstack.org/400421	18:21
jeblair	Shrews: can you take a quick look at https://review.openstack.org/407124 ?	18:22
Shrews	jeblair: seems harmless	18:23
Shrews	jeblair: fyi, 407736 to pluralize things is fine, but it isn't backwards compatible	18:26
Shrews	like, pabelanger couldn't just use a new builder with that. the current ZK nodes would need to change, or else start fresh	18:26
jeblair	Shrews: yeah; i could do a bunch of backwards compat code, or we could just manually fix the production zk db, or start over	18:27
jeblair	i assume we're still the only production db :)	18:27
jeblair	i'd be happy to shephard that through	18:27
Shrews	i don't mind either way. just mentioning it	18:27
jeblair	i'll leave a note on the review to let me approve it and i'll fix up the db when it lands	18:28
openstackgerrit	Merged openstack-infra/nodepool: Sort images and providers in zookeeper https://review.openstack.org/407124	18:28
openstackgerrit	Merged openstack-infra/nodepool: Merge branch 'master' into feature/zuulv3 https://review.openstack.org/407923	18:34
openstackgerrit	James E. Blair proposed openstack-infra/nodepool: Don't use taskmanagers in builder https://review.openstack.org/405663	18:34
openstackgerrit	Merged openstack-infra/zuul: Add reset of watchdog timeout flag https://review.openstack.org/408194	18:43
openstackgerrit	Merged openstack-infra/nodepool: Fix zookeeper config in test fixture https://review.openstack.org/407632	18:47
pabelanger	i'll restart nodepool-builder here shortly, once the git repo is updated on disk	18:56
pabelanger	a side from that, I don't think I have any outstanding issues right now	18:57
pabelanger	\o/	18:57
jeblair	pabelanger: did you want to perform more zk kill tests?	18:58
pabelanger	jeblair: not yet, if we are good with first round, I'll start another build then -9 zookeeper	18:58
pabelanger	let me pick up lastest code first	18:58
Shrews	pabelanger: lol, you're exception cleanup change failed in all sorts of fantastic ways	19:01
Shrews	none of which were related to your patch	19:01
pabelanger	nice, didn't see that	19:01
jeblair	yeah was just going through those -- regular tests failed with the mysql db timeout thing	19:02
pabelanger	oh, pep8 failed because of bhs1 mirror issue from this morning	19:02
jeblair	(it's not as obvious now, but i think it's that issue because it hit during a lockfile action in the cleanup)	19:02
pabelanger	nods	19:03
Shrews	is that NodeDeleter exception common?	19:03
jeblair	and i think the coverage job failed with two similar timeouts	19:04
SpamapS	jeblair: fungi Can we talk about PTG space for Zuul some time soon?	19:04
jeblair	Shrews: the FakeError?	19:04
SpamapS	Or has it already been arranged?	19:04
jeblair	er "Fake Error"	19:04
* SpamapS isn't sure where to look.		19:04
Shrews	jeblair: yeah	19:04
Shrews	jeblair: oh, that one is expected, i guess	19:05
jeblair	Shrews: i think that's a test simulation	19:05
Shrews	yeah	19:05
* Shrews should have looked at the name of the test more closely		19:05
fungi	SpamapS: we have ptg space for infra, and i'm happy for much of that to be devoted to working on zuul	19:06
fungi	it's by far our most complex deliverable now	19:06
jeblair	that seems useful to me	19:06
jeblair	i hope we'll be at a point where we'll be ready to do some work on planning the openstack rollout. if we're not there, then we will still have plenty of work to do.	19:08
fungi	yeah, either way we won't be done with this undertaking by time for the ptg	19:09
fungi	so it's safe to say there will be plenty of zuulishness to be had there	19:10
fungi	given how far along it's likely to be, i expect we'll want it to be the primary focus for our infra days anyway	19:10
Shrews	fungi: your optimism is amusing to me	19:13
Shrews	(but hopefully it will be far along)	19:13
jeblair	SpamapS: are you looking for setting an agenda now, or agreement that "yeah, we do enough zuul things to warrant attending"?	19:13
fungi	it's a survival mechanism	19:13
SpamapS	jeblair: I'm justifying travel budgets right now.	19:14
SpamapS	Which means I need to make sure people have space and something important to do while there. :)	19:14
fungi	SpamapS: i will make sure if you come you can spend as much time collaborating on zuul as you want (at least for the horizontal team days where infra gets a space)	19:14
SpamapS	It will help if agendas start to show zuul when there are, in fact, agendas. :)	19:15
mordred	Shrews: I figure the node launcher work is gonna take like ... a week, right?	19:15
Shrews	mordred: your optimism is even MORE amusing than fungi's	19:15
fungi	SpamapS: ttx seems to want us to play a little fast and loose with "agendas" for this, but to the degree that we can declare the things we intend to work on while we're there i'll make certain zuul gets a very prominent spot on that list	19:16
fungi	that is unless i'm replaced as ptl before the ptg anyway ;)	19:17
pabelanger	fungi: by a robot fungi?	19:17
fungi	in which case i will merely strongly advise that it should be a major focus	19:17
* fungi thought he _was_ the robot fungi		19:18
pabelanger	doh	19:18
openstackgerrit	Merged openstack-infra/nodepool: Clean up exception message to use image / provider name https://review.openstack.org/408239	19:25
*** rcarrillocruz has joined #zuul		19:47
pabelanger	okay, nodepool-builder restarted on nb01.o.o / nb02.o.o	19:51
jeblair	Shrews: https://review.openstack.org/405663 is ready now	19:52
jeblair	wait maybe not	19:52
jeblair	just noticed the nv tests are failing	19:52
*** hashar has joined #zuul		19:53
openstackgerrit	James E. Blair proposed openstack-infra/nodepool: Don't use taskmanagers in builder https://review.openstack.org/405663	19:58
*** adam_g_ has joined #zuul		20:06
pabelanger	okay, all the leaked checksum files have been manually cleaned up	20:38
*** harlowja has quit IRC		20:41
jeblair	pabelanger: is now a good time for me to stop the builders and do the manual zk work?	20:54
pabelanger	jeblair: yes, feel free	20:56
jeblair	pabelanger: looks like i stopped nb01 in the middle of a dib run	20:57
jeblair	dib is still running, however... i think we normally expect it to stop	20:58
pabelanger	indeed, looks like fedora-24 is doing checksums now	20:58
jeblair	it's stopped now	20:59
jeblair	maybe it just needed to finish the checksum programs	20:59
pabelanger	ya	20:59
jeblair	"cd nodepool" "cp image images true" "rmr image" is what i'm doing for the move	21:02
openstackgerrit	Merged openstack-infra/nodepool: Pluralize zk nodes with children https://review.openstack.org/407736	21:02
jeblair	the 'true' in the cp command means 'recursive'	21:02
Shrews	jeblair: hmmm, wonder how that affects sequence node numbers	21:02
Shrews	if at all	21:02
jeblair	Shrews: good question, we should look for that in the next build	21:03
pabelanger	I tried using dstat logs on https://review.openstack.org/#/c/405663 but didn't see much difference with a single provider. But +2'd	21:05
jeblair	pabelanger: hrm, it should remove about 250 threads	21:06
jeblair	pabelanger: oh, single provider	21:07
pabelanger	ya	21:07
jeblair	pabelanger: yes, there will be very little difference in that case.	21:07
pabelanger	nods	21:07
pabelanger	that is what I figured	21:07
jeblair	pabelanger: threads = providers * workers	21:07
pabelanger	k	21:08
jeblair	mordred just approved that so i'll wait till it lands to restart builders	21:08
pabelanger	Yay	21:08
jeblair	(which works well since i have more zk moves to do)	21:08
jeblair	pabelanger: fedora-24/builds/00..10 and 11 are empty but still exist	21:09
jeblair	do you know the story there?	21:09
pabelanger	jeblair: I wonder if that is a result of me stopping nodepool-builder during builds too	21:09
pabelanger	let me check the logs	21:10
mordred	jeblair: double plus bonus points for use of boartty as a library in your storyboard script	21:10
jeblair	15 and 16 are empty too	21:10
jeblair	mordred: i am lazy :)	21:10
Shrews	jeblair: hrm, i think we'll have problems	21:10
jeblair	Shrews: because the new nodes will have reset version numbers?	21:11
Shrews	jeblair: quick test, using the method you outlined, makes a subsequent create fail with NodeExistsError	21:11
Shrews	jeblair: and i don't know why	21:11
Shrews	(this is a hacked kazoo script i'm using, not the builder)	21:11
pabelanger	jeblair: fedora-24-00..16 is empty too?	21:12
jeblair	pabelanger: yep 10,11,15,16 all empty	21:12
pabelanger	jeblair: if so, that must be a result of us stopping nodepool-builder when we have builds in progress	21:13
pabelanger	because that 0016 was the build that was just running	21:13
jeblair	pabelanger: 13 14 exist, 12 does not.	21:13
jeblair	pabelanger: well, that one is fine -- there's nothing to clean it up right now	21:13
jeblair	pabelanger: but the others may be an error	21:13
pabelanger	yes, 13 and 14 are valid builds which are ready	21:13
mordred	jeblair: 407135 has 2x+2 but I think should go in th elist of jeblair reviews	21:14
jeblair	mordred: yeah, i'm going to switch to zuul soon	21:14
jeblair	thanks	21:14
mordred	jeblair: I've been trying to clear out things that don't need your attention	21:15
jeblair	mordred: oh thanks	21:15
pabelanger	I'll also work on more test coverage around stopping nodepool-builder during a diskimage build	21:16
pabelanger	should be easy to expose the issue	21:16
Shrews	jeblair: http://paste.openstack.org/show/591725/	21:17
jeblair	Shrews: hrm, that wfm	21:20
Shrews	O.o	21:20
jeblair	ii zookeeper 3.4.5+dfsg-1 all High-performance coordination service for distributed applications	21:20
Shrews	jeblair: check zk-shell listing. do you have a 'junks2'	21:21
Shrews	?	21:22
jeblair	Shrews: no, just junk and junks	21:22
Shrews	i missed a step in that paste... i deleted the first sequence node, then re-ran k.py	21:22
jeblair	(i also did an attempt where i 'rmr junk' after cping. that also worked)	21:22
jeblair	oh i'll try that	21:23
Shrews	but, if i don't delete that, i get a junks2	21:23
Shrews	perhaps i'm using an older version of zookeeper	21:23
openstackgerrit	Merged openstack-infra/zuul: Add roadmap to README https://review.openstack.org/407213	21:23
openstackgerrit	Merged openstack-infra/zuul: Re-enable TestScheduler.test_rerun_on_error https://review.openstack.org/406416	21:24
openstackgerrit	Merged openstack-infra/zuul: Re-enable test_rerun_on_abort https://review.openstack.org/407000	21:24
openstackgerrit	Merged openstack-infra/nodepool: Delete builds when diskimage removed from config https://review.openstack.org/400421	21:24
jeblair	Shrews: aha, if i delete the initial sequence znode i get the error	21:24
Shrews	jeblair: if those tests work against the production zookeeper, then go for it. we might be seeing version differences	21:24
openstackgerrit	Merged openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/404487	21:24
Shrews	jeblair: neat	21:24
Shrews	so, that's not going to work	21:25
Shrews	most likely	21:25
jeblair	fascinating :)	21:25
Shrews	jeblair: we are learning more ZK things	21:26
jeblair	yay us!	21:26
Shrews	fyi, i have zk 3.4.8, which is newer	21:29
jeblair	i like that the kazoo docs explicitly state this will never happen	21:31
Shrews	lol! link?	21:31
jeblair	http://kazoo.readthedocs.io/en/latest/api/client.html#kazoo.client.KazooClient.create	21:31
jeblair	Note that since a different actual path is used for each invocation of creating sequential nodes with the same path argument, the call will never raise NodeExistsError.	21:31
Shrews	something about the 'cp' probably makes it lose some sequence attributes	21:33
jhesketh	Morning	21:35
jeblair	Shrews: i think i understand -- i think it uses the cversion stat to get the next seqno	21:35
jeblair	the new node starts with cversion at 0 and then increments with each new child	21:35
Shrews	so that's the attribute it loses :)	21:36
jeblair	so if you had 4 children, you would have cversion=4. then delete 1, you still have cversion=4. copy that to a new node, cversion=3.	21:36
Shrews	ah	21:37
jeblair	so in production, we're going to have cversion=2 on most of these things; we'll probably end up having sequence numbers start at 3. they'll work for a while until they hit a collision with 11 or whatever we're up to. unless 11 gets deleted in time. which it probably would. except for fedora-24.	21:38
jeblair	Shrews: so i think we can fix this with a script that looks at all the sequence number containers, and adds/removes children until cversion==max(sequence number)	21:38
jeblair	(even though this is ridiculous, i'm glad we're learning this now before we had to learn it for something that's actually important :)	21:39
Shrews	yeah	21:40
jeblair	i guess we want cversion to be max(sequence number)+1	21:40
jeblair	i'll write a script to do that real quick	21:41
*** harlowja has joined #zuul		21:41
*** jamielennox\|away is now known as jamielennox		21:49
jeblair	Shrews: experimentally, it seems the next sequence number is cversion-numChildren+1	21:54
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool: Update waitForBuildDeletion() to protect against delete race https://review.openstack.org/408324	21:54
pabelanger	jeblair: Shrews: should fix the race condition in: http://logs.openstack.org/63/405663/6/gate/gate-nodepool-python27-ubuntu-xenial/91a1e7d/console.html	21:54
jeblair	Shrews: okay i ran this: http://paste.openstack.org/show/591733/	22:06
jeblair	i think we should be ready to restart now	22:06
jeblair	wait what we decided to go with the activate virtualenv thing?	22:07
jeblair	mordred: fyi https://review.openstack.org/403966	22:10
jeblair	mordred: you may want to read the entire discussion there, and on the linked change	22:12
jeblair	pabelanger, Shrews: i'm going to restart the builders now	22:13
Shrews	jeblair: okie dokie. will keep my fingers and toes crossed	22:13
jeblair	Shrews: i apparently missed something: http://paste.openstack.org/show/591734/	22:15
Shrews	ugh	22:15
jeblair	Shrews: hrm, actually, any chance that's normal for a lock collision?	22:16
Shrews	jeblair: nope. they have their own exceptions	22:16
Shrews	but that's failing on the sequence node create anyway	22:17
Shrews	jeblair: heading out to meet ansible type folks (hi rbergeron!) but will try to pay attention to my phone pings if you need me	22:22
jeblair	Shrews: ok. i think the situation corrected itself after some deletions or something anyway	22:24
jeblair	the other builder got it. :)	22:24
openstackgerrit	Merged openstack-infra/nodepool: Don't use taskmanagers in builder https://review.openstack.org/405663	22:40
jeblair	SpamapS: in 407000, i don't understand why the test is changing. i can't think of a reason for it to do so, and anyway, the final assertion in that test is basically confirming that we're running one less job than before.	22:50
jeblair	SpamapS: looking at the test output from master, "Launch job project-test1" shows up 5 times, and attempts is set at 4. that's correct because the 5th attempt is the one that returns RETRY_LIMIT -- so from the user POV, we tried to run the job 4 times. with 407000 the v3 branch shows that job launching 4 times, and it's the 4th that returns RETRY_LIMIT (so the user told us to run it 4 times and sees it run 3).	22:51
SpamapS	jeblair: yeah, I wasn't sure about that the test changing was the right call.	22:55
SpamapS	jeblair: I thought what I saw was just that we ended up not recording one of the tries anymore.	22:55
SpamapS	but I may have misunderstood how things are recorded.	22:55
jeblair	SpamapS: yeah -- that's what i went looking for, but i figured "Launching job project-test1" was a pretty good proxy for "how many times did we run this job" since that's emitted by the pipeline manager when it tells the launcher to run a job (so it shouldn't be as affected by things changing around the launcher)	22:57
openstackgerrit	Merged openstack-infra/zuul: Re-enable test_client_get_running_jobs https://review.openstack.org/407135	23:04
jeblair	jhesketh: https://review.openstack.org/406699	23:06
* jhesketh looks		23:07
jhesketh	oh, I reviewed that last night... why don't I have a vote	23:07
jeblair	jhesketh: did you use gertty?	23:07
jhesketh	nope, I suspect it was a human error	23:07
* jhesketh tries to regain state		23:08
jeblair	well there's your problem! ;)	23:08
jhesketh	heh, that's a default for me ;-)	23:08
SpamapS	jeblair: so then it's possible the logic on retries changed and we're actually doing it wrong.	23:11
jeblair	SpamapS: yeah, nothing about what might cause that comes immediately to mind, so it may require spelunking.	23:12
jhesketh	jamielennox: left a question on 406699 if you have a moment	23:14
SpamapS	jeblair: worth doing. I'll take a deeper look.	23:15
*** harlowja has quit IRC		23:16
jeblair	SpamapS: thanks. note that the change has merged.	23:16
jamielennox	jhesketh: yea, it's probably easier to just make them seperate git repos on the filesystem, they're already seperate pipelines and everything	23:16
SpamapS	jeblair: indeed, I think I just glossed over a bug, rather than introduced one. :-/	23:17
jamielennox	i'll spin a new one today	23:17
jeblair	SpamapS: yeah, odds are i introduced it. but we've lost our tracking of it, so we might want to do something (revert, propose a todo note, or file a story)	23:18
jhesketh	jamielennox: cool, mostly curious about the second comment though as I think we need that to make sure it's behaving as expected	23:19
openstackgerrit	Merged openstack-infra/zuul: Re-enable merge-mode config option and add more tests https://review.openstack.org/406361	23:19
SpamapS	jeblair: I'll add a story with the v3 tag and assign myself	23:22
SpamapS	https://storyboard.openstack.org/#!/story/2000827	23:24
openstackgerrit	James E. Blair proposed openstack-infra/zuul: Remove v3 project template test https://review.openstack.org/395722	23:26
*** jamielennox is now known as jamielennox\|away		23:27
*** jamielennox\|away is now known as jamielennox		23:28
*** hashar has quit IRC		23:36
*** hashar has joined #zuul		23:39
*** Cibo_ has quit IRC		23:50
*** hashar has quit IRC		23:52

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!