Thursday, 2016-12-08

*** willthames has joined #zuul		00:17
*** adam_g has quit IRC		00:37
*** adam_g_ is now known as adam_g		00:37
*** jamielennox is now known as jamielennox\|away		00:45
*** jamielennox\|away is now known as jamielennox		01:50
*** Cibo_ has joined #zuul		02:55
openstackgerrit	Merged openstack-infra/zuul: Remove v3 project template test https://review.openstack.org/395722	03:49
*** harlowja has joined #zuul		04:37
*** bhavik1 has joined #zuul		05:00
*** harlowja has quit IRC		05:46
*** bhavik1 has quit IRC		05:58
*** abregman has joined #zuul		06:22
*** saneax-_-\|AFK is now known as saneax		07:59
*** hashar has joined #zuul		08:23
*** jamielennox is now known as jamielennox\|away		08:33
*** hogepodge has quit IRC		09:45
*** hogepodge has joined #zuul		09:46
*** saneax is now known as saneax-_-\|AFK		10:08
*** saneax-_-\|AFK is now known as saneax		10:09
*** hogepodge has quit IRC		11:01
*** hogepodge has joined #zuul		11:21
*** hogepodge has quit IRC		11:26
*** hogepodge has joined #zuul		11:38
*** hogepodge has quit IRC		11:48
*** hogepodge has joined #zuul		12:00
*** bhavik1 has joined #zuul		13:39
*** abregman has quit IRC		14:26
*** abregman has joined #zuul		14:49
*** abregman has quit IRC		15:02
*** abregman has joined #zuul		15:08
*** abregman has quit IRC		15:29
*** saneax is now known as saneax-_-\|AFK		15:41
*** hashar is now known as hasharCall		17:02
*** adam_g is now known as adm_g		17:46
*** adm_g is now known as adam_g		17:47
*** adam_g has quit IRC		17:47
*** adam_g has joined #zuul		17:47
pabelanger	o/	18:06
pabelanger	just checking nb01, I see we are getting an exception	18:07
*** hasharCall is now known as hashar		18:07
pabelanger	http://paste.openstack.org/show/591842/	18:07
Shrews	pabelanger: is that recent?	18:10
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool: Use diskimage.name for _checkForScheduledImageUpdates exception https://review.openstack.org/408756	18:10
pabelanger	Shrews: just happened	18:10
pabelanger	Shrews: however, its stopped	18:11
pabelanger	let me check nb02 to see what happened	18:11
Shrews	pabelanger: ok. that's due to the change jeblair made yesterday. i thought that was transitive	18:11
pabelanger	oh, no. It is still happening	18:11
pabelanger	let me check zk-shell	18:11
pabelanger	Shrews: okay, lets see what jeblair wants to do	18:12
jeblair	drat	18:13
Shrews	we might just need to start fresh :(	18:13
jeblair	yes, though i think if we land pabelanger's change, we will know which note to go in and fix harder	18:13
jeblair	er node	18:13
Shrews	which change?	18:13
jeblair	756	18:14
pabelanger	I don't mind a fresh start, rebuilds and uploads are working very well	18:14
*** hashar has quit IRC		18:16
*** harlowja has joined #zuul		18:17
jeblair	let's restart with 756, poke at that node (mostly i want to see what i missed earlier). if that's confusing or doesn't work, let's reboot.	18:17
pabelanger	wfm	18:20
*** harlowja_ has joined #zuul		18:20
Shrews	maybe our config objects should just implement __repr__()	18:22
*** harlowja has quit IRC		18:22
jeblair	Shrews: ++	18:27
pabelanger	jeblair: regarding https://review.openstack.org/#/c/404976/, which moves fake-image-create out of production code, how would you like to see it reworked?	18:27
jeblair	pabelanger: i'm fine with that change as-is -- the thing i wanted to communicate is that i don't think we should change how we cause failures to happen. we can set that command in our test configs, but we should not change the command to cause failures, we should continue using the 'should_fail' (or whatever it's called) image metadata attribute to cause build failures.	18:29
jeblair	pabelanger: (basically, my -1 was in reference to the last sentence of the commit message)	18:30
pabelanger	okay, I am okay with that	18:30
pabelanger	I'll update it here in a minute	18:30
jeblair	(the reason being: by using the metadata, we can say "image A should succeed, image B should fail". we can't do that if we set the command to fail for all images)	18:31
pabelanger	That make sense	18:34
openstackgerrit	Merged openstack-infra/nodepool: Use diskimage.name for _checkForScheduledImageUpdates exception https://review.openstack.org/408756	18:36
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool: Make diskimage-builder command configurable for testing https://review.openstack.org/404976	18:38
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool: Make diskimage-builder command configurable for testing https://review.openstack.org/404976	18:39
pabelanger	k, 756 has landed on disk. going to stop / start nodepool-builder	18:44
pabelanger	http://paste.openstack.org/show/591848/	18:47
jeblair	okay, i'll poke around in zkshell	18:52
pabelanger	ya, just looking now too.	18:54
pabelanger	/nodepool/images/ubuntu-trusty/builds does exist	18:54
pabelanger	which I think is the path it should be using	18:55
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: Add __repr__ to ConfigValue objects https://review.openstack.org/408776	18:57
jeblair	cversion is 17, numchildren is 3, 17-3+1=15 which is > 11....	18:58
jeblair	(i'm poking at zk -- ignore json object decode errors)	19:01
pabelanger	ack	19:02
pabelanger	we should dump threads again too, when finished with zookeeper, nodepool-builder still using 139% CPU. Figured it would have dropped a little	19:03
jeblair	okay, i did a create/rm cycle 2 times and now it works	19:04
jeblair	i no longer think i understand the math about how it picks the next sequence number	19:04
jeblair	i think if we want to be able to understand how to do this in the future, we should probably read some code.	19:05
pabelanger	indeed, I have no idea why it was failing :)	19:05
jeblair	pabelanger, Shrews: we could just let things proceed and if it crops up again, perform a fix like i just did. i think there's a chance that this may happen a few more times as other images age out and the builders decide they need to be rebuilt. there's also a chance that things could work now but break unexpectedly in the future because of some quirk of the math. so if we want to play it safe and scrap the whole thing and restart, that's ...	19:07
jeblair	... okay too.	19:07
Shrews	i feel like us not understanding the knobs we are twiddling is a good reason to just start fresh. but we could wait and see what happens, too.	19:08
pabelanger	Ya, I think a fresh start if we are going live with nodepool.o.o is a good idea. But same, if we want to see what happens, I'm sure we'll learn more	19:09
jeblair	okay, let's do a fresh start then.	19:10
jeblair	pabelanger: do you want to do the honors?	19:10
pabelanger	jeblair: sure, how do we want to do the fresh start?	19:11
Shrews	pause everything, delete images, shutdown builder, rmr /nodepool, unpause everything, start builder?	19:12
pabelanger	ya, that's what I've come up with too	19:13
jeblair	Shrews: yeah, i think that's it	19:13
pabelanger	okay, let me but nb01 / nb02 into emergency file so puppet doesn't do things	19:13
Shrews	make sure to hit both nb01 and nb02	19:13
pabelanger	since we don't have a pause via CLI yet	19:13
Shrews	pabelanger: oh, should probably delete local dib files before the restart, too	19:15
pabelanger	++	19:15
jeblair	Shrews: yeah, any that are left that the delete's don't take care of	19:16
Shrews	yeah, hopefully there are none	19:17
pabelanger	okay, deleting centos-7 dibs	19:22
pabelanger	let see what happens	19:22
clarkb	is there a tldr of what the problem is/was?	19:25
Shrews	pabelanger: image-delete properly did the things?	19:25
Shrews	(since we had a bug there previously)	19:26
pabelanger	Shrews: I believe so, looking at logs now	19:27
Shrews	clarkb: this https://review.openstack.org/407736 was a non-backward compatible change, causing manual intervention via zk-shell	19:28
pabelanger	we have 1 builds in process, ubuntu-xenial	19:28
pabelanger	going to try deleting that	19:29
Shrews	clarkb: that caused the zk sequence nodes (used for build and upload IDs) to stop working	19:29
pabelanger	not sure if it will work	19:29
*** bhavik1 has quit IRC		19:33
pabelanger	clean up still working	19:33
Shrews	pabelanger: hmm, unlikely to work via CLI	19:33
pabelanger	Ya, we don't have a good way to abort a DIB in progress from CLI	19:34
pabelanger	aside from -9 disk-image-create	19:35
Shrews	pabelanger: hrm, the CLI has a bug. we should report that we can't delete it b/c it's in progress. i don't have a lock around it :(	19:35
pabelanger	same goes for uploads in process, we they didn't stop.	19:37
pabelanger	k	19:37
pabelanger	I had to stop nodepool-builder to stop the uploads, started again	19:38
pabelanger	all uploaded images are now gone	19:38
pabelanger	we have 3 DIBs stuck in deleting, but I think we expected that	19:39
pabelanger	both nb01 / nb02 have empty /opt/nodepool_dib	19:39
pabelanger	I think we are good	19:40
pabelanger	okay, both builders stopped	19:40
pabelanger	just for fun, running sudo -H -u nodepool nodepool alien-image-list on nodepool.o.o	19:41
pabelanger	it should be empty	19:41
pabelanger	and it is, no images from nb01 / nb02	19:42
pabelanger	rmr /nodepool done	19:43
pabelanger	starting builders again	19:43
pabelanger	and removing nb01 / nb02 from emergency file	19:44
Shrews	oh, no. i was wrong. code is fine	19:45
Shrews	can only delete READY builds (or uploads, for that matter)	19:45
Shrews	ugh, no	19:47
pabelanger	and diskimage builds started again	19:47
pabelanger	Shrews: jeblair: we are back online	19:49
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: Check for in progress build/upload in CLI https://review.openstack.org/408794	19:49
pabelanger	only 2 issues, if we want to call them issues.	19:49
pabelanger	1) cannot abort inprogress diskimage builds, had to -9 disk-image-create	19:49
pabelanger	2) cannot abort uploads, had to stop nodepool-builder	19:50
pabelanger	aside from that, dib-image-delete worked great	19:50
Shrews	794 should at least fix the CLI and it will say you can't do those things	19:50
Shrews	being able to actually do those things is a new feature. Could you do that in the old builder?	19:51
Shrews	i don't recall seeing code to handle that (without actually stopping the builder itself)	19:51
pabelanger	no, I don't think we could	19:51
Shrews	oh, so this is feature creep :)	19:52
jeblair	clarkb: note that we chose to go ahead and do 407736 knowing we might have to restart, as an opportunity to potentially learn some things about zk. we did. :) i think we know enough to come up with a better migration plan next time something like this comes up.	19:52
pabelanger	however, that was the first time I've ever had to purge all images in nodepool :)	19:52
Shrews	pabelanger: lol, touche	19:52
pabelanger	but ya, feature creep :)	19:53
jeblair	pabelanger: yeah, i think restarting builder after setting pause in order to abort builds/uploads is the right thing	19:53
pabelanger	Yup, it wasn't too painful	19:53
jeblair	it looks like f23 and f24 failed?	19:55
pabelanger	checking	19:55
jeblair	2016-12-08 19:50:28,424 INFO nodepool.image.build.fedora-24: Error: Failed to synchronize cache for repo 'updates'	19:56
jeblair	they both say that	19:56
pabelanger	oh, ya	19:57
pabelanger	that happens from time to time	19:57
pabelanger	actually	19:57
pabelanger	2016-12-08 19:47:43,817 INFO nodepool.image.build.fedora-24: mount: none already mounted or /opt/dib_tmp/dib_build.bIPidWaE/mnt/proc busy	19:57
jeblair	'info' huh? :)	19:58
jeblair	pabelanger: i think killing dib was the wrong thing :(	19:58
pabelanger	yes	19:59
jeblair	pabelanger: probably should have let the nodepool-builder shutdown handle stopping it	19:59
pabelanger	agreed	19:59
pabelanger	I didn't check the dib_tmp folder	20:00
Shrews	db/lockfile errors seem to be happening frequently today	20:02
pabelanger	http://paste.openstack.org/show/591854/	20:03
jeblair	pabelanger: i don't understand that error -- that path should be for the current build, not something left over	20:03
pabelanger	when should we expect the cleanup worker to pickup the deleting state?	20:03
pabelanger	jeblair: ya, this might be something in diskimage-builder, http://logs.openstack.org/88/408288/8/check/gate-dib-dsvm-functests-ubuntu-xenial/9c7656f/console.html#_2016-12-08_19_23_48_471594 has the same message	20:04
jeblair	pabelanger: we may never delete those builds from zk since there are no files for them :( -- no builder will think it's responsible for it.	20:07
pabelanger	jeblair: sad face indeed	20:08
pabelanger	will poke at it shortly	20:09
jeblair	we may need to make sure the builder field is set correctly and then use that to determine whether a given builder should delete the zk record	20:10
jeblair	(it looks like the builder got unset when the state changed to deleting)	20:10
Shrews	touch fedora-23-0000000001.raw should do it	20:11
jeblair	agreed	20:11
pabelanger	k, trying that	20:12
pabelanger	yup, worked	20:13
*** jamielennox\|away is now known as jamielennox		20:13
pabelanger	did fedora-24 too	20:14
Shrews	i'm not sure what the code path is that would NOT set the hostname	20:17
Shrews	or would remove it	20:17
pabelanger	Shrews: Would it be line 451 that uses a new ImageBuild()? over updating the existing data?	20:20
pabelanger	in builder.py	20:20
jeblair	pabelanger: yep	20:20
Shrews	oh, i see it	20:20
Shrews	yes	20:20
pabelanger	cool	20:21
Shrews	that should just be reusing 'build'	20:21
Shrews	fix coming	20:21
pabelanger	k	20:22
pabelanger	getting a coffee	20:22
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: Re-use build data when we set for DELETING https://review.openstack.org/408808	20:24
jeblair	ianw: do you know what went wrong with this build? http://nb01.openstack.org/dib.fedora-23.log	20:24
jeblair	ianw: same thing happened to f24 i think	20:24
ianw	jeblair: 2016-12-08 19:48:19,954 INFO nodepool.image.build.fedora-23: Error: Failed to synchronize cache for repo 'updates'	20:26
jeblair	Shrews: should we lock around that?	20:26
jeblair	ianw: yeah, that looked suspicious to me. but i don't understand what that means or why it would happen 3 times on 2 hosts.	20:26
ianw	i think upstream mirror issues. we had a couple of failures in dib CI similar	20:26
jeblair	ianw: okay, so hopefully will be fixed by the time the builders cycle around to them again	20:27
ianw	jeblair: yeah, i think so. i really need to get rid of f23! keep getting sidetracked	20:27
Shrews	jeblair: that really shouldn't be necessary as it's not a current build or in progress build	20:29
Shrews	we short-circuit the loop if it is	20:29
jeblair	Shrews: yeah, but any builder can choose to mark it as deleting, so i'm wondering about two builders doing that at the same time, with the second one possiblying issuing the store call after the first one actually deleted the znode.	20:30
Shrews	jeblair: good point	20:31
jeblair	Shrews: (also, we should probably only store it if the state has changed so we don't do so more than necessary)	20:32
Shrews	jeblair: how do you mean?	20:33
Shrews	oh, if it's not already DELETING	20:33
jeblair	ya	20:33
Shrews	k	20:33
openstackgerrit	Clint 'SpamapS' Byrum proposed openstack-infra/zuul: Fix retry accounting off-by-one bug https://review.openstack.org/408814	20:40
SpamapS	jeblair: ^ found it ;)	20:40
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: Re-use build data when we set for DELETING https://review.openstack.org/408808	20:42
jeblair	SpamapS: \o/	20:58
SpamapS	jeblair: hey, looking at test_swift_instructions .. looks like that will need some glue written. That's still a thing in v3, yes?	21:07
jeblair	SpamapS: yes, there's a little bit in the spec about moving it to the new auth section of the job config. so yeah, will require some re-plumbing.	21:26
jeblair	SpamapS: it looks like the master branch has tries as base 1 as well? http://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/model.py#n675	21:29
SpamapS	jeblair: entirely possible. I did not look at master, I looked at why we gave up early in v3.	21:53
SpamapS	jeblair: Since I have 0 familiarity with master, it's a lot harder for me to go back that far (though I have done it a bit lately)	21:54
SpamapS	much simpler for me to try and grok the test, and make sure v3 does what the test wanted and the spec wants. :-P	21:54
SpamapS	so my commit message does in fact contain an assumption that may be untrue	21:56
SpamapS	equally possible is that in master we don't set tries until we've _tried_ once... and in v3 we set it as part of the model creation	21:57
* SpamapS should be more careful with words		21:57
jeblair	SpamapS: yeah, maybe this is the right fix (it seems to make what we expect happen)... i probably won't be able to sleep until i know what changed so i can update my mental map, but i'm happy to dig into that since that might just be a personal character flaw. :)	21:59
SpamapS	jeblair: I get the benefit of having no base on which to build my anxiety .. my condolences to your sleep ;)	22:03
*** jamielennox is now known as jamielennox\|away		22:29
*** jamielennox\|away is now known as jamielennox		22:34
*** saneax-_-\|AFK is now known as saneax		23:12
openstackgerrit	James E. Blair proposed openstack-infra/zuul: WIP triggers https://review.openstack.org/408848	23:25
openstackgerrit	James E. Blair proposed openstack-infra/zuul: WIP organize connections into drivers https://review.openstack.org/408849	23:25
jeblair	jamielennox, jhesketh: ^ that isn't remotely ready for review; i'm literally about halfway into the change. i'm mostly pushing it up so i don't lose it. the commit message is literally my shorthand notes. don't worry if you can't make heads or tails of it -- i think it will be clear when i'm done (and i'll add nice docstrings). but basically i think that direction gives us the ability to have nice singleton objects (drivers, connections) for ...	23:29
jeblair	... the things we don't want lots of copies of, but also the ability to have lots of per-tenant/pipeline objects (triggers, reporters, etc). and to nicely organize all the supporting code. i think it will make it easier to make extensible using entrypoints, and provide a nice foundation for new drivers (github, sqlalchemy, etc)	23:29
jeblair	jamielennox, jhesketh: i have to run now, but i think i can finish that change tomorrow and maybe have it ready to look at next week	23:30
jlk	oh sorry, I already peeked at it :)	23:31
jeblair	jlk: np :) but hopefully that ^ explains the 'print' statements remaining :)	23:32
jlk	yup well, that and the giant "WIP"	23:32
jlk	and part of commenting was so that I see updates, since I've been staring at the 2.5 + github code	23:32
jlk	an am eager to get to porting it over :)	23:32
*** jamielennox is now known as jamielennox\|away		23:49
*** Cibo_ has quit IRC		23:58
*** Cibo_ has joined #zuul		23:59

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!