Sunday, 2014-01-19

fungi	i was able to launch one myself in az2 (and delete it)	00:02
*** david-lyle_ has quit IRC		00:06
*** sarob has joined #openstack-infra		00:13
*** sdake has quit IRC		00:14
*** gokrokve has quit IRC		00:17
*** sarob has quit IRC		00:18
fungi	oh, i see one issue... we have name-filter: 'Performance' on the bare-precise entries for hpcloud	00:33
fungi	that explains why we're getting the flavor error	00:37
clarkb	odd that it seems to have recently stopped building slaves though	00:37
fungi	yeah, however i can launch one myself manually, so something's up with nodepool i'm thinking	00:38
fungi	still digging	00:38
*** salv-orlando_ has joined #openstack-infra		00:40
*** salv-orlando has quit IRC		00:40
*** salv-orlando_ is now known as salv-orlando		00:40
clarkb	we should fix the performsnce thing. if you edit the file locally nodepool should just pick it up	00:41
fungi	yeah, i'm going to	00:41
fungi	i'm uploading a patch too while i'm thinking about it so we don't forget	00:41
mordred	clarkb, fungi: we should also figure out how to get off of az[1-3] and on to the 1.1 cloud at HP - because the old azs are going to go away at some point	00:42
mordred	sdague: yes. we should put that in the check queue - but that requires slightly more tooling - which I'm working on	00:42
clarkb	mordred "figure out"	00:42
*** gokrokve has joined #openstack-infra		00:43
mordred	clarkb: yeah.	00:43
clarkb	mordred Im not sure there is anything we can do. we just increased test time by a major factor	00:43
mordred	clarkb: we still need to test nodes that are twice the size	00:43
clarkb	mordred we can use rax for all those tests and hpcloud for single use unittesters	00:43
mordred	since that's supposed to give us twice the cpu throttling allocation	00:43
mordred	oh wow: min-ram: 30720	00:46
mordred	we're asking for pretty large nodes in region-b	00:46
clarkb	mordred: initial testing of that had very poor resulyts	00:46
mordred	spectacular	00:46
clarkb	yes and it didnt help	00:46
clarkb	well it helped a tiny bit but not 2x	00:46
mordred	well - maybe lifeless cloud will save us	00:46
clarkb	mt rainier is out!	00:47
mordred	nice!	00:47
openstackgerrit	Jeremy Stanley proposed a change to openstack-infra/config: Remove incorrect name filters from nodepool config https://review.openstack.org/67684	00:51
*** talluri has joined #openstack-infra		00:52
*** talluri has quit IRC		00:53
*** talluri has joined #openstack-infra		00:53
fungi	aha, okay so manually launching a node in az2 from the webui seems to work, but launching one using novaclient hangs at "Instance building... 0% complete"	00:54
clarkb	mordred: mattoliverau: I have to say the tmpfs/eatmydata zuul idea was really good. seems to still be humming along	00:54
clarkb	fungi: weird, nova api version trouble maybe?	00:55
sdague	clarkb: so fwiw, people are still approving stable/havana changes into the gate	00:58
sdague	there are a few still in there	00:58
sdague	I'm very tempted to bulk -2 all of stable/havana	00:58
sdague	to prevent more of that	00:58
*** talluri has quit IRC		00:59
clarkb	sdague: or and this is eviler, delete the stable havana branch permissions :)	00:59
sdague	actually, that might be less evil. Then I don't have to bulk unset it	00:59
fungi	we could just remove approve from the all-projects acl entry for refs/stable/*	00:59
fungi	that way stable release managers can still +2, just can't approve	01:00
sdague	fungi: sounds good tome	01:00
fungi	lemme finish formulating this support case with hpcloud first while you discuss amongst yourselves	01:00
sdague	I also think we should remove reverify completely	01:01
clarkb	my reason for it being eviler is it is a bit like a coup	01:01
sdague	because what happens is patch authors will reverify their code because they want it in, and are never reading the ML about things that will or will not work	01:02
*** odyssey4me has quit IRC		01:02
sdague	if it's only cores that can toggle the approved, then it should be a responsible set	01:02
*** Sonicwall[A] has joined #openstack-infra		01:03
*** Sonicwall[A] has left #openstack-infra		01:03
fungi	how about this... we unset any approval votes on stable changes and send an e-mail to the stable management ml pleading with them not to approve until the bug(s) linked in that message are resolved (and to please pitch in if they can)	01:03
clarkb	removing reverify entirely has long been my stance definite +2 for thst from me	01:03
sdague	fungi: so emails aren't helping, that was a set of approves	01:03
fungi	did e-mails about it go to the -dev ml or the stable branch ml?	01:04
sdague	-dev	01:04
fungi	i wonder if some of them don't read -dev as regularly	01:04
sdague	well they should be	01:05
fungi	'course they might just not read any lists regularly, the stable branch ml included	01:05
sdague	seriously, we can't be going and tracking down every freaking bad actor	01:05
fungi	oh, my boot test finally went to 100% but now any ssh attempt to the resulting vm gets an immediate connection closed	01:07
mordred	clarkb: yay re: tmpfs	01:07
fungi	clarkb: it was pointed out last night that we missed one more thing... i got the rechecks page working again by stopping recheckwatch on the old zuul, copying the pickle and report from it to new zuul and starting the service there	01:09
*** dcramer_ has quit IRC		01:09
mordred	I approve removing +A from refs/stable/*	01:09
mordred	and then we can let bad actors declare themselves	01:09
mordred	when they complain	01:09
*** odyssey4me has joined #openstack-infra		01:10
sdague	you also need to reset all the approved bits, so reverifies don't happen	01:12
sdague	or kill reverify	01:12
mordred	we could block reverify on stable/* too	01:12
sdague	ok, well, I have a bulk -2 script I can loop on now. Or someone else can take those on	01:13
openstackgerrit	Derek Higgins proposed a change to openstack-infra/config: Add some dependencies required by toci https://review.openstack.org/67685	01:13
*** sarob has joined #openstack-infra		01:13
mordred	fungi: do we have any example negative lookahead regexes in zuul anywhere?	01:14
mordred	ref: ^(?!(refs/stable/.)).$	01:16
mordred	perhaps?	01:16
sdague	actually, I'm going to bulk recheck all the stable havana jobs	01:18
*** sarob has quit IRC		01:18
sdague	that don't already have a -1	01:18
mordred	++	01:18
sdague	then they'll get a -1 and hopefully the reviewers won't be silly	01:18
sdague	though with the node starvation, it will slow down the rest of things. But best idea I have.	01:21
mordred	sdague: why not remove +A?	01:21
sdague	doesn't solve reverify	01:21
clarkb	it does	01:21
sdague	also, I don't have those permissions.	01:21
clarkb	zuul wont reverify without the votes	01:21
sdague	ok, well, I already fired off the bulk recheck	01:22
*** gokrokve has quit IRC		01:23
mordred	well - I just removed +A on stable/* from stable-maint	01:23
mordred	between the two, let's see how it goes	01:24
sdague	mordred: cool	01:24
mordred	I'm going to announce that I've done that too	01:24
sdague	mordred: or don't, and see who complains :)	01:24
sdague	who didn't keep up with the list	01:24
*** derekh has quit IRC		01:24
*** Hefeweizen has quit IRC		01:26
*** Hefeweizen has joined #openstack-infra		01:26
mordred	:)	01:26
mordred	sdague: so how do we work on fixing the problem is stable/* is blocked for +A?	01:27
sdague	mordred: a possible patch is in the queue	01:28
sdague	https://review.openstack.org/#/c/67425/	01:28
sdague	though I have not tested a stable/havana change behind it	01:29
sdague	so we could promote that to see	01:29
sdague	I found an pulled out two more stable/havana changes from the queue	01:34
sdague	and getting called to dinner, night all	01:34
clarkb	ok into areas of I5 with poor service going to afk now too	01:38
*** praneshp has joined #openstack-infra		01:38
*** morganfainberg\|z has quit IRC		01:39
*** morganfainberg\|z has joined #openstack-infra		01:40
*** morganfainberg\|z is now known as morganfainberg		01:40
*** morganfainberg is now known as Guest52195		01:40
*** FallenPegasus has joined #openstack-infra		01:44
fungi	that failing glance change in the gate hit a socket timeout pip-installing sqlalchemy on both its pep8 and python27 jobs	01:47
clarkb	fungi I think rax has had network blips today	01:48
clarkb	I did not check their status page though	01:48
*** FallenPegasus has quit IRC		01:48
fungi	yeah, two different bare-precise nodes in iad	01:49
fungi	status.r.c says investigating potential issue for next-gen cloud servers in london, but otherwise green	01:50
*** talluri has joined #openstack-infra		01:55
*** talluri has quit IRC		01:59
fungi	actually, playing around with the az2 problem, i think it may just be our old friend ssh timeout	02:00
clarkb	120 seconds not long enough? this edge connection is surprisingly useable for irc	02:02
fungi	yeah, i think it's taking longer. i was able to get into an az2 i launched after waiting a few minutes	02:09
*** milki has quit IRC		02:09
*** milki has joined #openstack-infra		02:10
*** sarob has joined #openstack-infra		02:13
*** sarob has quit IRC		02:18
*** oubiwann_ has quit IRC		02:18
*** oubiwann_ has joined #openstack-infra		02:20
*** thuc has joined #openstack-infra		02:21
fungi	yep, bumping my test script to sleep 300 seconds allows me to continue building...	02:28
fungi	in fact, we've already got it set to 180	02:31
fungi	for hpcloud	02:31
*** thuc has quit IRC		02:32
*** thuc has joined #openstack-infra		02:33
*** thuc has quit IRC		02:37
*** rfolco has joined #openstack-infra		02:40
fungi	actually 180 seems to be enough for my tests too	02:42
*** talluri has joined #openstack-infra		02:56
*** talluri has quit IRC		03:01
*** ok_delta has quit IRC		03:06
*** HenryG has joined #openstack-infra		03:07
*** rfolco has quit IRC		03:10
*** talluri has joined #openstack-infra		03:11
*** ok_delta has joined #openstack-infra		03:13
*** sarob has joined #openstack-infra		03:13
fungi	mmm, i'm starting to think nodepoold might be in a bad way with respect to az2, because all those "building" status nodes are from 4-7 hours ago, don't show up at all in nova list, and can't be nodepool delete'd	03:16
fungi	http://paste.openstack.org/show/61504	03:16
fungi	sqlalchemy.exc.OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE node SET state=%s, state_time=%s WHERE node.id = %s' (4, 1390101192, 1131534L)	03:17
*** sarob has quit IRC		03:18
fungi	i don't think i'll be able to cleanly stop nodepool, but will get it restarted and see whether that helps	03:19
*** talluri_ has joined #openstack-infra		03:20
*** ok_delta has quit IRC		03:20
*** talluri has quit IRC		03:24
*** talluri_ has quit IRC		03:29
fungi	huh, actually it stopped cleanly	03:31
fungi	seems to have fixed the inability to delete at least	03:33
fungi	yeah, i think it's okay now. i'm deleting all the remaining stale nodes now	03:41
*** yamahata has quit IRC		03:45
*** rakhmerov has joined #openstack-infra		03:52
*** rakhmerov1 has joined #openstack-infra		03:53
*** rakhmerov has quit IRC		03:54
*** rakhmerov1 has quit IRC		03:55
*** jhesketh has quit IRC		04:03
*** rakhmerov has joined #openstack-infra		04:04
*** obondarev_ has joined #openstack-infra		04:04
*** sarob has joined #openstack-infra		04:13
*** senk has joined #openstack-infra		04:15
*** sarob has quit IRC		04:18
*** obondarev_ has quit IRC		04:28
*** obondarev_ has joined #openstack-infra		04:29
*** senk has quit IRC		04:31
*** senk has joined #openstack-infra		04:31
*** obondarev_ has quit IRC		04:45
*** odyssey4me has quit IRC		04:48
*** DennyZhang has joined #openstack-infra		04:53
*** odyssey4me has joined #openstack-infra		04:55
*** SergeyLukjanov_ is now known as SergeyLukjanov		05:00
clarkb	fungi weird, glad to know all is better now	05:01
*** DinaBelova_ is now known as DinaBelova		05:02
fungi	still keeping an eye on it, but probably passing out soon	05:03
*** DinaBelova is now known as DinaBelova_		05:09
*** SergeyLukjanov is now known as SergeyLukjanov_		05:09
*** rakhmerov1 has joined #openstack-infra		05:10
*** rakhmerov has quit IRC		05:10
*** sarob has joined #openstack-infra		05:13
*** sarob has quit IRC		05:18
*** david-lyle_ has joined #openstack-infra		05:21
*** SergeyLukjanov_ is now known as SergeyLukjanov		05:28
*** david-lyle_ has quit IRC		05:32
*** SergeyLukjanov is now known as SergeyLukjanov_		05:37
*** vkozhukalov has joined #openstack-infra		05:38
*** SergeyLukjanov_ is now known as SergeyLukjanov		05:43
*** SergeyLukjanov is now known as SergeyLukjanov_a		05:43
*** senk has quit IRC		05:44
*** SergeyLukjanov_a is now known as SergeyLukjanov_		05:45
*** DinaBelova_ is now known as DinaBelova		05:45
*** rakhmerov1 has quit IRC		05:46
*** DinaBelova is now known as DinaBelova_		05:49
*** rakhmerov has joined #openstack-infra		05:54
*** rakhmerov has quit IRC		05:55
*** rakhmerov has joined #openstack-infra		06:09
*** sarob has joined #openstack-infra		06:13
*** rakhmerov has quit IRC		06:13
*** San_D has quit IRC		06:14
*** sarob has quit IRC		06:18
*** oubiwann_ has quit IRC		06:19
*** nati_ueno has joined #openstack-infra		06:20
*** nati_uen_ has joined #openstack-infra		06:25
*** nati_ueno has quit IRC		06:28
*** sarob has joined #openstack-infra		06:45
*** odyssey4me has quit IRC		06:45
*** sarob has quit IRC		06:50
*** sarob has joined #openstack-infra		07:06
*** rakhmerov has joined #openstack-infra		07:09
*** rakhmerov1 has joined #openstack-infra		07:11
*** rakhmerov has quit IRC		07:11
*** DennyZhang has quit IRC		07:16
*** madmike has joined #openstack-infra		07:21
*** rakhmerov1 has quit IRC		07:22
*** salv-orlando has quit IRC		07:22
*** salv-orlando has joined #openstack-infra		07:22
*** bnemec_ has joined #openstack-infra		07:24
*** rakhmerov has joined #openstack-infra		07:25
*** crank has quit IRC		07:25
*** mfink has quit IRC		07:25
*** bnemec has quit IRC		07:25
*** lifeless has quit IRC		07:25
*** akscram has quit IRC		07:25
*** bradm has quit IRC		07:25
*** obondarev has quit IRC		07:25
*** obondarev has joined #openstack-infra		07:26
*** sandywalsh has quit IRC		07:26
*** rakhmerov has quit IRC		07:26
*** rakhmerov1 has joined #openstack-infra		07:26
*** crank has joined #openstack-infra		07:26
*** bradm has joined #openstack-infra		07:27
*** lifeless has joined #openstack-infra		07:27
*** akscram has joined #openstack-infra		07:28
openstackgerrit	Monty Taylor proposed a change to openstack-infra/storyboard-webclient: Allow use of node from packages https://review.openstack.org/67604	07:29
*** rakhmerov1 has quit IRC		07:33
*** sandywalsh has joined #openstack-infra		07:39
*** sarob has quit IRC		07:56
*** julim has joined #openstack-infra		08:21
*** julim has quit IRC		08:25
*** yolanda has joined #openstack-infra		08:26
*** derekh has joined #openstack-infra		08:27
*** sarob has joined #openstack-infra		08:28
*** rakhmerov has joined #openstack-infra		08:30
*** sarob has quit IRC		08:33
*** sarob has joined #openstack-infra		08:44
*** rakhmerov has quit IRC		08:45
*** sarob has quit IRC		08:49
*** derekh has quit IRC		09:02
*** sarob has joined #openstack-infra		09:13
*** sarob has quit IRC		09:19
*** flaper87\|afk is now known as flaper87		09:31
*** rakhmerov has joined #openstack-infra		09:41
*** rakhmerov has quit IRC		09:47
*** rakhmerov has joined #openstack-infra		09:57
*** praneshp_ has joined #openstack-infra		10:02
*** praneshp has quit IRC		10:02
*** praneshp_ is now known as praneshp		10:02
*** rakhmerov has quit IRC		10:03
*** elasticio has joined #openstack-infra		10:17
*** yolanda has quit IRC		10:49
*** emagana has joined #openstack-infra		10:50
*** rakhmerov has joined #openstack-infra		11:00
*** emagana has quit IRC		11:02
*** thuc has joined #openstack-infra		11:04
*** yolanda has joined #openstack-infra		11:06
*** rakhmerov has quit IRC		11:09
*** thuc has quit IRC		11:13
*** matrohon has quit IRC		11:40
*** odyssey4me has joined #openstack-infra		11:42
*** odyssey4me has quit IRC		11:44
*** rakhmerov has joined #openstack-infra		12:05
*** rakhmerov has quit IRC		12:10
*** praneshp has quit IRC		12:12
*** obondarev_ has joined #openstack-infra		12:17
*** elasticio has quit IRC		12:34
sdague	so I'm starting to feel that we need to take the jenkins outage and get the logs fixed, because our fail rate is as high as it was before russellb's concurreny patch, but we're pretty blind on what's causing it without console logs	12:53
*** dizquierdo has joined #openstack-infra		12:57
*** DinaBelova_ is now known as DinaBelova		12:57
*** afazekas_ has joined #openstack-infra		12:57
openstackgerrit	Andreas Jaeger proposed a change to openstack-infra/config: Remove non-voting documentation gate job https://review.openstack.org/67702	13:03
*** rakhmerov has joined #openstack-infra		13:07
*** rakhmerov has quit IRC		13:11
*** sarob has joined #openstack-infra		13:13
*** sarob has quit IRC		13:18
*** DinaBelova is now known as DinaBelova_		13:25
*** salv-orlando has quit IRC		13:33
*** DinaBelova_ is now known as DinaBelova		13:40
openstackgerrit	Sean Dague proposed a change to openstack-infra/config: add in elastic-recheck-unclassified report https://review.openstack.org/67591	13:44
*** obondarev_ has quit IRC		14:06
*** rakhmerov has joined #openstack-infra		14:08
*** rakhmerov has quit IRC		14:12
openstackgerrit	Sean Dague proposed a change to openstack-infra/devstack-gate: Timestamp setup logs https://review.openstack.org/67086	14:13
*** sarob has joined #openstack-infra		14:13
*** sarob has quit IRC		14:18
*** oubiwann_ has joined #openstack-infra		14:29
*** beagles has quit IRC		14:43
*** oubiwann_ has quit IRC		14:46
*** bknudson has left #openstack-infra		14:52
sdague	there also seems to be a zuul issue that if a failing change makes it to the top of queue, it's rerun - http://ubuntuone.com/49l5sd2U7JLrMzx8RGavPU	14:56
*** dizquierdo has quit IRC		15:04
*** rakhmerov has joined #openstack-infra		15:09
*** rakhmerov has quit IRC		15:14
*** bknudson has joined #openstack-infra		15:15
*** coolsvap has joined #openstack-infra		15:25
*** obondarev has quit IRC		15:41
*** obondarev has joined #openstack-infra		15:42
*** skraynev has quit IRC		15:44
*** skraynev has joined #openstack-infra		15:44
*** luqas has joined #openstack-infra		15:46
mordred	sdague: I agree with all of the things in your email	15:52
sdague	mordred: great, now we just need to implement them :)	15:54
sdague	the early kick out is something I'd like to understand how we handle it	15:54
sdague	mostly how do we signal back about it	15:55
sdague	basically, under the current state of things, I don't think icehouse-2 is possible next week	15:57
sdague	we've got comment typo fixes in the gate queue that have been there for 40 hrs	15:58
mordred	sdague: early kick is tricky (responded to email)	15:58
sdague	cool	15:58
*** yolanda has quit IRC		15:59
mordred	sdague: the theory is that we should start streaming subunit results back to zuul, I believe	15:59
mordred	hrm	15:59
sdague	the gate is at a new level of bad	15:59
mordred	and/or have the thing running/processing the tests be attached to the gearman bus so that it could read the stream and return a gearman status on detected fail	15:59
sdague	and we're actually completely blind as to why, because we're loosing at least 3/4 of the console logs from elastic search	16:00
mordred	the work towards getting rid of jenkins was work towards being able to do early kick	16:00
mordred	I wonder if maybe we should think about how to implement it without not-jenkins	16:00
mordred	sdague: I _think_ zaro has the scp-plugin fix	16:00
sdague	yeh, I think it needs implementing by march 1st, otherwise i3 will not be possible	16:01
sdague	mordred: right, I keep hearing that :)	16:01
mordred	sdague: :)	16:01
mordred	sdague: is our incoming velocity higher than usual? or have things just gotten worse in the racy department?	16:02
openstackgerrit	Monty Taylor proposed a change to openstack-infra/config: Remove reverify entirely https://review.openstack.org/67708	16:05
sdague	so we've merged 48 changes since friday	16:05
sdague	means we're merging only 30 changes / day right now	16:06
sdague	git log --since=2014-01-17 --author=jenkins \| grep '^commit' \| wc -l on openstack/openstack	16:06
mordred	wow	16:06
notmyname	wow	16:06
bknudson	how many could merge in a day?	16:07
sdague	bknudson: if we weren't reseting... hundreds	16:07
sdague	but the biggest issue in my mind right now is we're actually completely blind to why we are failing, and largely have been for the last 2 weeks	16:08
openstackgerrit	Monty Taylor proposed a change to openstack-infra/config: Early fail on pep8 in the check pipeline https://review.openstack.org/67709	16:08
mordred	sdague: ok. there's two of your things	16:08
sdague	as we're seeing huge loss of logs going into elastic search	16:08
sdague	mordred: awesome	16:08
mordred	fungi: you awake? ^^	16:09
sdague	also, out of those 48, at least 2 were ninja merges that I had fungi do to relieve some of the fails	16:09
mordred	sdague: https://github.com/jenkinsci/scp-plugin/pull/8	16:09
mordred	anybody else who wants to review some java ^^	16:09
*** DinaBelova is now known as DinaBelova_		16:10
*** SergeyLukjanov_ is now known as SergeyLukjanov		16:10
*** rakhmerov has joined #openstack-infra		16:10
sdague	mordred: we could also post populate the missing log files not from jenkins. We don't loose them to the log server, just to ES	16:11
mordred	agree	16:11
sdague	that would at least let the bulk query for ES actually help us sort issues	16:11
sdague	there also seems to be a new zuul bug, where if a failing change hits the top of the gate, it gets restarted again instead of thrown out	16:12
notmyname	mordred: why not run all the project-specific tests before the common integration tests (instead of just pep8)?	16:13
*** sarob has joined #openstack-infra		16:13
sdague	notmyname: I think it's a judgement call of bang vs. buck. Throwing out on pep8 takes very little time, and saves a bunch of nodepool resources.	16:15
*** rakhmerov has quit IRC		16:15
*** luqas has quit IRC		16:16
mordred	notmyname: I agree with you on that too - but it would require adding some new logic to zuul's config processing, where I can do the pep8 early right now	16:16
notmyname	ok	16:16
mordred	notmyname: specifically, I don't have a way of saying "run these _Three_ things and then when all are done run this additional thing"	16:17
mordred	I think it's a feature we need, tbh	16:17
notmyname	I was looking at the neutron one the just fell off the top of the gate queue (of course causing a gate flush). https://review.openstack.org/#/c/67475/ but I didn't realize those neutron tests were taking longer than the integration ones	16:17
notmyname	mordred: isn't a feature of zuul the composability of the jobs? run set one, then set two	16:17
*** sarob has quit IRC		16:18
zaro	clarkb: i've added the additional logging to scp-plugin. new build is on review-dev.o.o	16:19
sdague	notmyname: well, actually neutron at top was the other issue	16:19
sdague	where that thing failed deeper in the queue	16:19
fungi	zaro: jenkins-dev?	16:19
sdague	the stuff above it merged	16:19
sdague	and zuul then reconnected the pipeline to it	16:19
zaro	ohh yeah, jenkins-dev.	16:19
sdague	<sdague> there also seems to be a zuul issue that if a failing change makes it to the top of queue, it's rerun - http://ubuntuone.com/49l5sd2U7JLrMzx8RGavPU	16:20
bknudson	I've seen the failure from 67475 in another review, since I was just looking into why the other one failed ... "tempest.scenario.test_cross_tenant_connectivity.TestNetworkCrossTenant.test_cross_tenant_traffic"	16:21
sdague	yeh, so that's just which test failed. That's not why it failed. We need to know why that was expected to work and did not	16:23
notmyname	sdague: when you added the timer to the zuul queue (you did that right?), did you by any chance add in a statsd timing metric for graphite? I'd love to graph the average time a patch spends in the gate queue	16:23
sdague	notmyname: nope, didn't touch graphite at all	16:24
sdague	I actually think the graphite metrics in this space are kind of broken. The jenkins interupt which happens on resetting nodes is often classified as failure	16:25
sdague	which it isn't	16:25
sdague	so all the graphite numbers are worse that reality	16:25
*** jkt has joined #openstack-infra		16:25
sdague	s/that/than/	16:26
bknudson	looks like the failure from 67475 and my own review is a known problem -- https://bugs.launchpad.net/tempest/+bug/1262613	16:26
jkt	hi there, I'm reading through your openstack-infra/config repo, and have noticed the remark about an ongoing transition towards puppet modules "straight from puppetforge"	16:26
fungi	sdague: your screen capture doesn't look to me like it's showing what your comment implies	16:26
sdague	fungi: so I don't have the capture from before	16:27
sdague	that was already in a fail state	16:27
jkt	I'm trying to deploy openstack setup at $job, to be managed by puppet, and I've never done a green-field deployment of puppet before	16:27
sdague	got moved to the head	16:27
bknudson	there's an e-r check for bug/1262613 already	16:27
sdague	then got the entire stream behind it	16:27
jkt	I'm wondering about the security implications of using "random" version of code from "random" site on the web	16:27
bknudson	the e-r check says it was resolved.	16:27
sdague	fungi: you kind of have to be watching zuul to see these happen	16:27
jkt	I mean, I'm OK with using their packages for puppet itself, but blindly loading modules from forge makes me a bit uneasy	16:28
fungi	sdague: that screen capture says that the failing head was severed and still has 19 minutes remaining until its other tests complete, but a recalculation of all the rest of the gate is underway so the following changes don't have all their jobs started yet	16:28
jkt	is that really the plan, and do you have something for version management of these?	16:28
bknudson	oh, I guess I got the wrong bug.	16:28
mordred	jkt: hi!	16:28
mordred	jkt: do you mean an install of the openstack-infra testing stuff? or of openstack itself?	16:29
sdague	fungi: I'm pretty sure that change was off failing, the jobs behind were running, after merge they all reset again	16:29
sdague	I've seen this twice this morning	16:29
jkt	mordred: in the end, I'd like to install and configure openstack, but I'm not that far yet. What I'm doing now is getting familiar with puppetizing the infrastructure from day zero	16:29
fungi	sdague: oh, okay. in that case i'll keep an eye out and see if i can catch it doing what you're suggesting	16:29
jkt	mordred: and I'm asking here because I'm more or less copying the setup you have documented in that repo	16:30
mordred	jkt: gotcha. so, those of us in here don't know anything about installing openstack itself via puppet - so I wanted to be clear on expectations :)	16:30
mordred	jkt: for the other stuff - yeah, it's still in the plans to use what we can from puppet forge	16:30
mordred	but we don't really do it blindly - we take one module at a time as we can - and already use several key ones - like the puppet-mysql module	16:31
mordred	that said - one of the other goal of that is to break that repo apart and treat several of the modules liek they're forge modules	16:31
mordred	even though we wrote them	16:31
clarkb	zaro: thanks	16:31
mordred	for better lifecycle and composability	16:31
jkt	mordred: so essentially specifying a version beforehand, in that install_modules.sh script, and running it every now and then on the master?	16:32
mordred	clarkb, fungi: could you +2/+A the two config changes I posted above - I agree with sdague that we shoudl do them	16:32
mordred	jkt: that's right	16:32
fungi	mordred: will 67709 deal more sanely now with the situation which caused us to turn it off before (changes to requirements result in very obscure failures on the pep8 jobs which are hard for devs to diagnose)?	16:32
jkt	what I like a lot in your setup is that everything is in one repo; and having to update modules manually is something which, to me, looks a bit against that goal	16:32
mordred	fungi: I believe so - but even if it doesn't , I think the tradeoff is worth it for the next couple of weeks	16:33
mordred	jkt: the thigns we want in external modules are the things that dont' change much - or that _We_ don't change that much	16:33
jkt	I've just learned about `git subtree add --prefix modules/... ... --squash`, and I have to admit I like it a lot	16:33
mordred	hehe	16:33
mordred	we don't use submodules at all :)	16:34
jkt	subtree != submodule, that's the catch	16:34
mordred	$ git subtree --help	16:34
mordred	No manual entry for gitsubtree	16:34
jkt	it's essentially "get a checkout of that remote ref I specify as ..., squash it asa single commit, and merge it as a subdirectory under the --prefix"	16:34
jkt	http://blogs.atlassian.com/2013/05/alternatives-to-git-submodule-git-subtree/	16:35
mordred	ah. interesting. so like a way to keep a super-repo for composability - but not attempt to treat the subtrees as things you'd do dev on in that context	16:35
jkt	it's also 100% client-side; what gets pushed is a boring old commit	16:36
mordred	it's also not installed in the version of git in debian	16:37
mordred	:)	16:37
mordred	anyway - looks like interesting evening reading	16:37
jkt	yeah, looks like it got only added in 2012	16:38
clarkb	subtree is bad for pther reasons :)	16:38
fungi	mordred: oh, i see... 67709 only skips py26/27 and docs, but not other things like requirements checks, dsvm jobs, py33, one-offs and so on	16:38
jkt	clarkb: I would love to listen to them	16:38
mordred	fungi: oh! piddle. thats a bug and you're right	16:38
mordred	hrm. in fact	16:38
clarkb	biggest pronlem for us immediately would be no gerrit support	16:38
mordred	with the new template org - I do not think it's possible to do what I was trying to di	16:38
fungi	mordred: well, i already approved	16:38
mordred	clarkb: it doesn't need it	16:38
fungi	but i can -2	16:38
mordred	fungi: well, it'll be _something_	16:38
jkt	mordred: and btw, on that Atlassian page, they even show how to push commits back to the original repo (which provided the subtree contents)	16:39
mordred	fungi: but yeah - go ahead and -2 and I'll rework	16:39
clarkb	second problem is you have to manually know to write commits that dont span trees iirc	16:39
mordred	that's the main problem I'd see	16:39
jkt	clarkb: it's "subtree", not "submodule", not what you'll see in Gerrit (or any other git client) is a simple commit adding whole directory at once	16:39
clarkb	and humans fail at things like that	16:40
zaro	clarkb, sdague : i tried disabling drafts on review-dev.o.o but was _not_ able to.	16:40
clarkb	jkt i knlw	16:40
sdague	zaro: ok, thanks	16:40
mordred	zaro: darn	16:40
clarkb	jkt but you cant have commits thst span trees	16:40
jkt	clarkb: how come?	16:40
sdague	ok, I need to get away from the computer for a bit. I'll check back in during football later.	16:40
clarkb	without gerrit support you cant enforce that easily	16:40
mordred	clarkb: so, early fail - how do we do that without getting rid of jenkins first?	16:41
clarkb	jkt because then you cant split trees iirc (or can but need filter branching)	16:41
clarkb	its a matter of sanity	16:41
clarkb	mordred our test runner jenkins side eg run-testr needs to do testr run --subunit and return one as soon as a failhappens	16:42
jkt	clarkb: https://github.com/git/git/blob/master/contrib/subtree/git-subtree.txt shows the split feature (even with stable, i.e. deterministic and non-changing commit IDs, but I have no experience with them	16:42
mordred	clarkb: but won't that abort the rest of the test run/	16:42
clarkb	mordred not hard to do but would require a special subunit parser unless lifeless has a flag for that	16:42
clarkb	mordred yes	16:43
clarkb	without changing zuul to undsrstand fail without return 1 I think that is our option	16:43
mordred	clarkb: yeah - I don't think there's any quick and dirty way to do it - I'm just trying to figure out if there is any conceivable way at all to get there that I could parcel some tasks out to achieve	16:44
mordred	also - anybody know when we get jeblair back? today? tomorrow? tuesday?	16:44
clarkb	either tomorrow or tuesday	16:45
jkt	anyway, thanks mordred and clarkb, your experience is appreciated	16:46
*** elasticio has joined #openstack-infra		16:46
mordred	jkt: you're welcome - thanks for the pointer to the blog - I'm less unhappy about it than clarkb is - although I've got a really narrow usecase for it I'd like to poke at	16:46
clarkb	mordred because I have seen people use subtree thinking it fixes the world but really it just changes the problems :)	16:47
*** sdake has joined #openstack-infra		16:47
mordred	clarkb: yeah - but it seems like it could be a specific solution for module composabilty for _us_ instead of puppet librarian or whatnot	16:48
clarkb	oh and since subtree smashes trees together you have an extra element of license stuff to consider	16:48
clarkb	mordred no that is the case people thoyght it would fix	16:48
clarkb	they wrote r10k instead	16:48
*** rakhmerov has joined #openstack-infra		16:48
mordred	but purely for deployment mechanics - not as something I'd expect us to ever check out ourselves	16:48
mordred	clarkb: I betcha they thought they could use it for composability and development	16:49
*** sandywalsh has quit IRC		16:49
clarkb	mordred: so you are thinking some post merge step that build a new tree that only deployments use?	16:51
mordred	clarkb: yes	16:51
mordred	and the way we'd only do commits to that repo ourselves to update the commit tracking the external module	16:52
mordred	it's probably a bad idea still and I should probably figure out r10k	16:52
clarkb	I think r10k is simpler and it can consume items not in git as well	16:53
mordred	yeah	16:53
*** rakhmerov has quit IRC		16:53
clarkb	back to jenkins. should I shutdown a jenkins and try new scp plugin?	16:55
mordred	clarkb: yes	16:55
clarkb	ok starting that shortly	16:55
mordred	clarkb: is there any specific reason why the dvsm jobs don't have a template?	16:56
mordred	or just not gotten to?	16:56
clarkb	not gotten to	16:58
mordred	k	16:58
clarkb	we have been doing that refactor with small deltas to make it easier to review and help prevent massive test breakage	16:59
mordred	gotcha	16:59
mordred	I LOVE that work, btw	16:59
*** senk has joined #openstack-infra		16:59
clarkb	mordred: what we need is a template that uses the envinject plugin to set all of the various d-g flags without changing the actual script calls	16:59
clarkb	since the env vars are what vary test to test	17:00
mordred	clarkb: we need many things	17:00
clarkb	I am going to put the scp plugin on jenkins02 because it runs the largest variety of tests	17:00
mordred	okie	17:00
clarkb	well 01 does too but 01 is old jenkins and old scp plugin so its fine	17:00
clarkb	is fungi still around?	17:01
clarkb	fungi: any opinions on ^	17:01
clarkb	02 is in shutdown mode	17:03
*** senk has quit IRC		17:06
clarkb	estimated time remaining 54 minutes :(	17:07
mordred	clarkb: I'm too dumb to have actually followed the whole thing - can you give me the tl;dr on why we have different jenkins version ?	17:07
clarkb	mordred: we upgraded one jenkins host (02) then went on holidays	17:07
clarkb	spun up 03 and 04 on the new version but haven't upgraded jenkins.o.o and jenkins01 yet	17:07
clarkb	we were being conservative	17:08
*** luqas has joined #openstack-infra		17:08
openstackgerrit	Khai Do proposed a change to openstack-infra/jenkins-job-builder: Fix references to examples in api documentation https://review.openstack.org/67712	17:08
mordred	gotcha	17:08
Mithrandir	(that jjb change)> what? Literalinclude looks pretty broken then	17:10
Mithrandir	and should rather be fixed	17:10
clarkb	Mithrandir: that is kind of funny	17:11
clarkb	would ./ work too? that might be more intuitive	17:11
Mithrandir	I'd be fine with ./, but / meaning "start looking from where this file is located" is.. not how paths work.	17:11
clarkb	Mithrandir: yup	17:12
Mithrandir	The file name is usually relative to the current file’s path. However, if it is absolute (starting with /), it is relative to the top source directory.	17:12
Mithrandir	is what the docs say	17:12
Mithrandir	so the current one should work, barring bugs	17:12
zaro	the './' didn't work, got the same warning.	17:16
*** gokrokve has joined #openstack-infra		17:16
Mithrandir	can we use a custom sphinx tag instead? nonbrokenliteralinclude?	17:17
clarkb	mordred: I am going to afk for a short time while I wait for tests to finish on 02	17:17
Mithrandir	or override the built-in	17:17
mordred	clarkb: kk	17:18
zaro	Mithrandir: how would custom tag work?	17:20
*** senk has joined #openstack-infra		17:20
openstackgerrit	Brant Knudson proposed a change to openstack-infra/elastic-recheck: Add check for bug 1270608 https://review.openstack.org/67713	17:20
*** rakhmerov has joined #openstack-infra		17:20
Mithrandir	zaro: http://sphinx-doc.org/extensions.html, apparently	17:21
* fungi is back... reading now		17:21
zaro	Mithrandir: i'm not following, is literalinclude an extension?	17:24
Mithrandir	zaro: it's a directive, I'm not sure if it's core or not.	17:25
Mithrandir	and we could have our own that works like literalinclude, but with non-crazy semantics.	17:25
fungi	clarkb: on getting jeblair back, keep in mind that he, i and the rest of the foundation staff will be in utah or in transit most of the week. i'll be working from airline seats and airport lounges most of tuesday and friday	17:25
*** rakhmerov has quit IRC		17:25
clarkb	gah	17:26
clarkb	its like the conference madness never ends	17:26
fungi	also, r10k was the cpu for the sgi o2. what else is it in this context?	17:26
mordred	Mithrandir: perhaps it has to do with how we're running sphinx	17:26
clarkb	and we are rotating batters	17:26
mordred	Mithrandir: and what it thinks the top of our source dir is	17:26
mordred	rather than being a bug in literalinclude itself	17:26
clarkb	fungi puppet librarian that actually works	17:26
mordred	fungi: jesus, really/	17:27
mordred	?	17:27
Mithrandir	according to the docs, if it starts with something else than /, it should be relative to the file.	17:27
mordred	oh - salt conf	17:27
Mithrandir	mordred: so either it's a bug in the implementation or the docs.	17:27
mordred	hrm	17:27
fungi	clarkb: upgrading scp plugin on 02 sounds good	17:27
clarkb	fungi great. currently waiting for tests to finish there	17:27
mordred	Mithrandir: but we're doing extraction with an extension of our own	17:28
clarkb	I copied the scp.jpi from -dev to my homedir on 02	17:28
mordred	Mithrandir: so the file that it's relative to might not be the file we think it is	17:28
*** sdake has quit IRC		17:28
Mithrandir	mordred: oh, that might make for extra fun.	17:28
Mithrandir	mordred: maybe that extension should adjust the paths or something, then?	17:29
mordred	Mithrandir: so, while I do think that it's a bug somewhere ... yeah - we might want to investigate the yaml extension we're using, and/or the results of pulling docstrings and generating sphinx from them	17:30
*** sdague has quit IRC		17:30
fungi	mordred: well, we happen to be there coincident with saltconf, but it's mainly a couple days mid-cycle for the staff to get face time not at a summit	17:30
mordred	fungi: wait - so you're saying we don't REALY get jeblair back, AND we lose you?	17:30
*** sdague has joined #openstack-infra		17:30
clarkb	mordred: yes	17:30
clarkb	you and I are batting next	17:30
bknudson	there's an e-r check for 1269940 that hit on https://review.openstack.org/#/c/66209/	17:30
bknudson	but the string in the yaml doesn't match anything in console.html	17:30
mordred	fungi: just so you know, I don't think it's valuable to anyone for the foundation staff to have face time	17:31
fungi	mordred: well, i'll probably be worthless wednesday/thursday, but will be working with lossy/high-latency network access on tuesday and friday	17:31
mordred	fungi: but I have no leg to stand on as I've been afk for sevearl weeks	17:31
mordred	and i'll be going to brussels week after next	17:31
clarkb	jeblair is going to brussels too	17:31
fungi	mordred: annual performance reviews and whatnot. i guess there's some perceived benefit to do face-to-face team building	17:32
clarkb	we need to clone a few fungis	17:32
mordred	fungi: but your team is all of opensatck - not each other	17:32
*** gokrokve has quit IRC		17:32
* fungi reproduces asexually through spore propagation, so should be doable		17:32
clarkb	fungi also north carolinians are all robots	17:33
mordred	fungi: you have ZERO goals separate from the project's goals, or at least you _shouldn't_ have any goals separate from the project's goals	17:33
mordred	clarkb: ++	17:33
fungi	mordred: that i agree with. i'd rather see my performance review come from random cross-sections of the project ;)	17:33
mordred	fungi: ++	17:33
clarkb	we just cp your AI from one machine to another :P	17:33
mordred	fungi: in fact, seriously - performance review for foundation staff shoudl be done by the project - maybe using condorcet	17:33
fungi	clarkb: i haven't figured out how to upload my consciousness yet, but once i get that working we should be able to make copies just fine	17:33
fungi	we have clouds	17:33
fungi	mordred: sounds like a motion for the board	17:34
openstackgerrit	A change was merged to openstack-infra/storyboard-webclient: Allow use of node from packages https://review.openstack.org/67604	17:34
*** nati_uen_ has quit IRC		17:35
mordred	clarkb: oh! you know what - if both fungi and jeblair are afk, we can finally push those HP-specific goals we've been hiding caring about!!!	17:35
fungi	heh	17:35
*** dcramer_ has joined #openstack-infra		17:42
*** gokrokve has joined #openstack-infra		17:42
*** coolsvap has quit IRC		17:50
openstackgerrit	Khai Do proposed a change to openstack-infra/jenkins-job-builder: Fix references to examples in api documentation https://review.openstack.org/67712	17:51
*** gokrokve has quit IRC		17:53
openstackgerrit	Monty Taylor proposed a change to openstack-infra/storyboard-webclient: Allow people to source bin/setenv.sh https://review.openstack.org/67714	17:54
clarkb	jenkins02 is idle now	18:00
clarkb	fungi: mordred: I am going to turn it off and start it with zaro's scp plugin build	18:00
clarkb	jenkins02 is starting again	18:03
bknudson	is there a good way to download logs in http://logs.openstack.org/47/66247/4/check/check-tempest-dsvm-full/0d6e9cc/logs/ ?	18:03
bknudson	when I wget screen-n-cpu.txt.gz it's downloading forever	18:03
bknudson	the logs page shows 10M but I canceled the download after 50M	18:04
clarkb	bknudson: it is a massive file :), if you set the encoding type to gzip you will get gzipped files instead of uncompressed data	18:04
clarkb	bknudson: it is on disk as 10MB compressed, wget requests uncompressed data though	18:04
bknudson	I can uncompress it after I download if I can figure out how to get it compressed.	18:05
clarkb	bknudson: right you set the encoding header to gzip to get it compressed	18:05
dims	http://www.commandlinefu.com/commands/view/7180/get-gzip-compressed-web-page-using-wget.	18:05
*** senk has quit IRC		18:06
*** thuc has joined #openstack-infra		18:07
clarkb	a quick spot check of console logs in logstash after the new scp plugin went in looks good	18:08
clarkb	sdague: fungi zaro I figure we let that burn in a bit today then do the others	18:08
clarkb	I also need to run errands shortly so lettign it run for a bit is my excuse to let me d othat :)	18:08
mordred	clarkb: awesome!	18:10
bknudson	dims: Thanks, --header="Accept-Encoding: gzip" worked for me.	18:12
*** salv-orlando has joined #openstack-infra		18:12
*** persia has quit IRC		18:12
*** 23LAAXGQE has joined #openstack-infra		18:15
*** pcrews has joined #openstack-infra		18:15
zaro	clarkb: sounds good.	18:15
*** 23LAAXGQE has quit IRC		18:16
*** 23LAAXGQE has joined #openstack-infra		18:16
*** 23LAAXGQE is now known as persia		18:16
*** gokrokve has joined #openstack-infra		18:17
openstackgerrit	Felipe Reyes proposed a change to openstack-infra/jenkins-job-builder: Adds Mercurial SCM support https://review.openstack.org/61547	18:20
fungi	clarkb: awesome. i'm happy to iterate over the other jenkins masters later if we want	18:27
*** nati_ueno has joined #openstack-infra		18:35
*** nati_ueno has quit IRC		18:35
*** nati_ueno has joined #openstack-infra		18:36
*** ewindisch is now known as zz_ewindisch		18:38
notmyname	if someone wants to abort the jobs runing for 65399,3 I'm ok with that. I've got the log I need (https://jenkins02.openstack.org/job/gate-swift-python26/3523/console). Looks like a timing thing, but I'll make sure there is a LP bug for it in case it comes up again	18:41
notmyname	bug 1224208	18:45
notmyname	well, actually since it's not at the top of the queue, some failure ahead of it could re-enqueue it and it would pass next time	18:48
*** thuc has quit IRC		18:52
fungi	yep	18:54
fungi	unless you expect that to be a consistent failure, better to just let it ride	18:54
openstackgerrit	Monty Taylor proposed a change to openstack-infra/storyboard-webclient: Add tox.ini file to run things via tox https://review.openstack.org/67721	18:55
mordred	ok.	18:55
mordred	that's the craziest thing I've written in quite some time	18:55
mordred	ttx, fungi, clarkb, sdague ^^ that, I'm not even kidding, allows you to run javascript toolchain stuff via tox and virtualenvs without having to install any javascript toolchain globally	18:56
openstackgerrit	Andreas Jaeger proposed a change to openstack-infra/config: Early abort documentation builds https://review.openstack.org/67722	18:56
notmyname	fungi: nope. I don't expect it to be constant. thanks	18:56
*** Ajaeger has joined #openstack-infra		18:57
openstackgerrit	Jeremy Stanley proposed a change to openstack-infra/nodepool: Delete nodes more aggressively https://review.openstack.org/67723	18:58
notmyname	fungi: yup. just reset	19:01
*** gokrokve has quit IRC		19:06
*** gokrokve has joined #openstack-infra		19:07
lifeless	o/	19:08
lifeless	clarkb: flag for what?	19:08
*** oubiwann_ has joined #openstack-infra		19:08
*** thuc has joined #openstack-infra		19:08
*** thuc has quit IRC		19:11
*** thuc has joined #openstack-infra		19:11
*** gokrokve has quit IRC		19:12
*** thuc has quit IRC		19:15
clarkb	lifeless fail fast which we can do in the runner now that I think of it	19:17
fungi	clarkb: if we do it in the runner, will that still let it run to completion and simply signal zuul earlier that the job is not going to succeed so it can get a head start on the ensuing gate reset?	19:19
*** gokrokve has joined #openstack-infra		19:19
fungi	or would we be sacrificing remaining test results for the failing job?	19:20
*** thuc has joined #openstack-infra		19:20
mordred	fungi: I almost think that sacrificing remaining test results at this point might be acceptable sacrifice	19:22
mordred	fungi: in fact, given the state of the gate right now - I'd say that sacrificing remaining test results is almost certainly acceptable sacrifice	19:23
mordred	sdague: ^^ ?	19:23
fungi	i'm inclined to agree, just curious of the implications	19:23
mordred	yeah	19:23
mordred	I mean, I think ultimately it's not what we want	19:23
mordred	ultimately we want all of the tests to run to completion and we want to fail fast	19:24
*** nati_uen_ has joined #openstack-infra		19:24
mordred	but perfect might be getting in teh way of good here, and just having the test runner hard fail and exit on first subunit stream failure might be what we want today	19:24
*** flaper87 is now known as flaper87\|afk		19:24
fungi	also, 67723 is intended to relieve me from scraping nodepool list for nodes in a delete state for more than 20 minutes and manually retrying to delete them to get available resources back	19:25
fungi	left unchecked, they end up accounting for more than 50% of our aggregate quota	19:26
mordred	fungi: lgtm	19:26
*** nati_ueno has quit IRC		19:27
sdague	mordred: reading scroll back...	19:30
mordred	sdague: most pressing thing is the idea of doing hard-fail with no run continuation on first fail	19:31
sdague	so if we can get it out of zuul faster, that's a clear win. Loosing the rest of the tests might not be.	19:32
*** sarob has joined #openstack-infra		19:33
openstackgerrit	Monty Taylor proposed a change to openstack-infra/storyboard-webclient: Add tox.ini file to run things via tox https://review.openstack.org/67721	19:34
mordred	sdague: that's the question - how much not would it be?	19:37
*** sarob has quit IRC		19:38
mordred	sdague: in the balance between early ejection and keeping the remaining tests after the fail	19:38
mordred	do we, in our current issues, get a lot of good data from the stuff that happens on tests after the first fail?	19:38
sdague	well if we end up aborting, we might abort too early for the service logs to continue	19:40
sdague	I know, for instance the issue on some of the network tests was the fact the allocation was taking too long	19:40
sdague	it actually shows up later in the logs	19:41
sdague	that would be missed	19:41
*** DinaBelova_ is now known as DinaBelova		19:42
*** senk has joined #openstack-infra		19:42
sdague	so basically we can't get fail fast but keep running?	19:42
sdague	zuul also failed to allocate a dvsm node on job #2, given starvation, I wonder when that is going to show up	19:43
lifeless	clarkb: fungi: mordred: I've said before :) - don't make zuul parse subunit	19:44
mordred	sdague: fast fail but keep running is a harder problem. fast fail and stop running is easy and can be done nw	19:44
lifeless	it's a central non distributed process	19:44
mordred	lifeless: yeah - I don't mean zuul parse subunit	19:45
mordred	I mean SOMETHING needs to parse subunit, and that something needs to be able to talk back to the gearman status	19:45
lifeless	testr can be taught to raise a signal (e.g. run a script that calls something over gear) on detecting a failure	19:45
lifeless	mordred: so testr is the thing that parses subunit here; I don't see why testr itself needs to talk gearman	19:45
lifeless	I mean it could, but it seems overly intimate coupling	19:45
mordred	I agree. and in fact, it's problematic for testr to be the one talking gear	19:46
mordred	because that means that we've violated the trust boundry	19:46
mordred	we need the thing that is in the context to talk to gearman to be able to peer in to what's going on as it happens	19:46
lifeless	mordred: so move testr to a different context	19:47
lifeless	mordred: it's not part of the code under test	19:47
lifeless	mordred: and it can run things remotely, on multiple machines etc	19:47
mordred	possibly - but now we're once again talking about massive changes that will not happen soon	19:47
lifeless	so, the reason I'm pushing back on zuul (or even jenkins) parsing the full subunit stream is because 400 test machines will overwhelm 5 jenkins or 1 zuul	19:48
*** sdake has joined #openstack-infra		19:48
lifeless	many MB of data to look for one bit	19:48
mordred	sure	19:48
mordred	but the reason taht I'm pushing back on testr being invovled with distributing work across workers	19:48
lifeless	we could use subunit as the encoding but have testr only send failure signals	19:48
mordred	is that there is too few devs hacking on testr and our ability to fix ui issues in it this past year has been very poor	19:49
mordred	so placing it in a positionof more operational complexity at this point would be a bad idea	19:49
fungi	sdague: i don't think it actually failed to allocate a node. we get that behavior when jenkins fails to start a job i think or screws it up in certain ways which zuul recognizes as a need to restart that job	19:49
lifeless	mordred: I've merged every patch I've been given :(, but I ack your point	19:49
sdague	yeh, testr being locking in bzr means testr changes to fix this is basically a non starter	19:49
lifeless	sdague: so its not locked in bzr	19:49
lifeless	sdague: it just needs tuits to move it	19:49
clarkb	fungi: yup should reschedule on another node	19:49
sdague	tuits?	19:50
fungi	sdague: the round variety	19:50
lifeless	sdague: http://2.bp.blogspot.com/-op8uJYMwdfI/TYpTkmur5hI/AAAAAAAAAd0/ty8WqHjiS58/s1600/A_Round_Tuit_Picture.jpg	19:50
mordred	sdague: "round-to-it == tuit"	19:50
fungi	well, "a round tuit" (you generally only need one)	19:51
sdague	right. well, we're basically at the point of having to start a separate project to work around testr ui for tempest.	19:51
*** DinaBelova is now known as DinaBelova_		19:51
mordred	lifeless: in any case - the thing you and I are both talking about here are implementation details of the fact that the system overall needs to be designed to handle detect-fail-and-keep-running	19:51
sdague	so it would be good if the tuits got prioritized	19:51
lifeless	sdague: I'd like to talk in detail before such a project is started; it might be the right answer, but I suspect not.	19:52
sdague	it was really clear at the neutron code sprint that we are inflicting massive pain on our developers by them needing to manually dig through the test layers to do the things they need to do	19:53
lifeless	sdague: is that this discussion, or a different one ?	19:53
sdague	it was a seperate one, which sort of joined on the first one	19:53
sdague	lifeless: I 100% agree it's a less than ideal solution. But testr is locked in bzr in a corner. So going to be pragmatic about it and just having to stop copy and pasting work arounds between projects.	19:54
lifeless	sdague: you said that already; I got that much.	19:55
lifeless	sdague: I am happy to prioritise the tuits if they are the actual blocker, but until I understand the problem I won't know if even after doing a move to git a separate project might be the right thing	19:56
lifeless	sdague: so when I say I want to talk about it, I mean I want to talk about it :)	19:56
lifeless	sdague: but since its a different discussion, lets not distract the fix-the-gate thread	19:56
sdague	sure, fair	19:56
*** david-lyle_ has joined #openstack-infra		19:57
lifeless	I will have to go out in ~ 10m to do some minor errors, including picking up my pushbike after repair, should be < 1 hr	19:57
mordred	I do not think that we're going to magic in the real solution to fast-fail-continue-running this or next or the week after	19:57
*** yolanda has joined #openstack-infra		19:57
*** vkozhukalov has quit IRC		19:58
mordred	like, we've deep dived into what needs to happen for it already, and it's not quick, or it would be done already	19:58
mordred	especially with jeblair largely out for more time the next two weeks	19:58
lifeless	fail fast and stop is fairly straight forward	19:58
mordred	yes	19:58
lifeless	what proposed impl is on the table ?	19:59
sdague	yeh, I'm not sure fail fast is a win at this point if we have to stop	19:59
mordred	fail fast and stop is straight forward and actionable	19:59
mordred	but it may not be a win	19:59
mordred	sdague: would it be worth trying? or are you pretty sure about that	19:59
* mordred is fine either way - just wants to be presenting options when we have them		19:59
sdague	mordred: given what I know of the delayed network allocations, I think it would have completely masked that issue and made it undebuggable	20:00
sdague	and I'm going to assume there are other such issues	20:00
*** sarob has joined #openstack-infra		20:00
mordred	sdague: so that I understand - you're saying the appropriate logging happened AFTER the test timed out?	20:00
bknudson	seems like we need the bugs fixed more than we need workarounds	20:00
sdague	mordred: correct	20:00
mordred	sdague: ok. I grok	20:01
mordred	we could put a delay on returning the error to jenkins perhaps	20:01
lifeless	whats the proposed implementation ?	20:01
sdague	mordred: because 2 minutes later we'd get a message from a network service that it allocated the network	20:01
lifeless	it will help a little if I know that ;)	20:01
mordred	wow. gotcha	20:01
sdague	that was one of the things that russellb saw, that had us bring down the concurency	20:02
mordred	lifeless: proposed impl was just to have our testr runner scan for failures and exit 1 if it sees one	20:02
lifeless	righto	20:02
lifeless	so a small patch to testr - ack	20:02
mordred	sdague: maybe we only fail-fast in the gate, and we leave it on run-all-the-way in check?	20:02
lifeless	^ this	20:02
lifeless	I don't think the gate should be able diagnostics	20:02
lifeless	except for things that only run in the gate	20:03
lifeless	s/able/about/	20:03
sdague	mordred: so if we don't keep enough of those logs, we loose our fingerprinting	20:03
sdague	or potentially loose our finger printing	20:03
mordred	that's true - but if we're ramping up auto-recheck, we should still have tons of check-level fails	20:03
sdague	sure, but then we don't know what's actually failing us in the gate	20:04
mordred	just ones that aren't also wrecking-balls	20:04
mordred	nod	20:04
sdague	the check queue is super noisy	20:04
lifeless	because bad code + flake	20:04
sdague	because there is tons of bad code in it	20:04
* mordred has to run - back in a bit...		20:04
lifeless	how do you determine ignal	20:04
lifeless	signal	20:04
sdague	lifeless: basically job fail rate check vs. gate	20:05
*** sarob has quit IRC		20:05
lifeless	on the fingerprint of the failure, right?	20:05
sdague	this is overall fails. We don't have things broken down yet per job the way I'd like. That's blocked by ES having lost data, and just time.	20:06
lifeless	tuits again :P	20:07
lifeless	so	20:07
lifeless	the question is, if we didn't have full gate logs	20:07
lifeless	it seems like that will limit the ER evolution	20:07
sdague	yes	20:07
lifeless	because gate signal will be extremely hard to correlate to checks	20:07
sdague	correct	20:07
lifeless	unless we have some canaries - no-op changes that we use to probe for flakiness	20:08
*** luqas has quit IRC		20:08
lifeless	that run on the gate machines, in the gate trust (thinking ahead to baremetal stuff), but with no changes merged	20:08
lifeless	and more often than 1/day	20:08
sdague	yep, we need probably 100 / day to have a solid baseline	20:09
lifeless	I'm thinking whilte true: do	20:09
lifeless	as a starting point	20:09
lifeless	which would get 24 odd	20:09
lifeless	ok, bb in an hour	20:09
sdague	yep	20:09
fungi	that plays back into the idle pool jobs idea	20:10
fungi	but we'd need a dedicated idle pool if we compromise gate results in that manner	20:10
sdague	yes	20:11
fungi	rather than just available nodes to round out the count when the gate is less busy	20:11
sdague	correct	20:12
*** sarob has joined #openstack-infra		20:13
*** salv-orlando has quit IRC		20:15
*** luqas has joined #openstack-infra		20:15
sdague	hmmm... I really wish we had the data over in ES. I think basically neutron jobs are 100% failing right now in gate, but it's hard to see	20:16
*** sarob has quit IRC		20:18
fungi	sdague: well, jobs run via jenkins01 or jenkins02 should be getting their console logs into elasticsearch currently	20:18
fungi	it's just 03 and 04 which aren't	20:18
fungi	(well, and jenkins.o.o but it's special-purpose jobs only anyway)	20:19
*** afazekas_ has quit IRC		20:20
sdague	http://logstash.openstack.org/#eyJzZWFyY2giOiJmaWxlbmFtZTpjb25zb2xlLmh0bWwgQU5EIGJ1aWxkX25hbWU6Z2F0ZS10ZW1wZXN0LWRzdm0tbmV1dHJvbi1pc29sYXRlZCBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBcIiBBTkQgYnVpbGRfcXVldWU6Z2F0ZSIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDE2Mjk5MTgyM30=	20:23
sdague	so of the logs we have, the isolated job in neutron has a 66% failure rate in the gate for the last 48hrs	20:23
sdague	based on that, I think we should remove all neutron from the gate	20:24
*** luqas has quit IRC		20:25
sdague	the regular neutron job is at 45% failure - http://logstash.openstack.org/#eyJzZWFyY2giOiJmaWxlbmFtZTpjb25zb2xlLmh0bWwgQU5EIGJ1aWxkX25hbWU6Z2F0ZS10ZW1wZXN0LWRzdm0tbmV1dHJvbiBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBcIiBBTkQgYnVpbGRfcXVldWU6Z2F0ZSIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDE2MzA5Mzk5NiwibW9kZSI6InRlcm1zIiwiYW5hbHl6ZV9maWVsZCI6ImJ1aWxkX3N0YXR1cyJ9	20:25
*** thuc has quit IRC		20:26
*** thuc has joined #openstack-infra		20:27
*** yolanda has quit IRC		20:29
mattoliverau	Good morning infra peeps!	20:30
*** thuc has quit IRC		20:32
fungi	sdague: what are the chances for the improvements from the sprint? i gathered a couple of them needed updated patchsets to pass tests, but that combined they should provide drastic improvement. any idea if they're ready to go yet?	20:36
fungi	morning mattoliverau	20:36
mikal	Morning	20:36
sdague	fungi: honestly, I don't know.	20:37
clarkb	fungi they didnt pass testing	20:39
clarkb	and were recheck no bugged so no debug context	20:39
fungi	clarkb: right, just didn't know if their respective owners had been working on them since	20:39
*** SergeyLukjanov is now known as SergeyLukjanov_		20:39
sdague	fungi: I don't know	20:39
sdague	I think there is a base failure rate that's crept back up in the isolated jobs now	20:39
sdague	part of the challenge for the week was it was taking 4 - 5 hrs to get check results back	20:40
sdague	so the timing of the zuul overload was unfortunate	20:40
*** pcrews has quit IRC		20:40
fungi	or the timing of the spring coincided with the timing of other sprints and general pre-milestone patch rush	20:40
fungi	s/spring/sprint/	20:41
sdague	yeh, agreed	20:41
sdague	i2 runup, definitely an issue	20:41
fungi	i know of at least one other major openstack project which had a sprint the same week	20:41
*** pcrews has joined #openstack-infra		20:41
fungi	but there may have been more	20:41
sdague	but I think we're basically at a point now where there no longer is a "good" time during the cycle to get together because activity is always so high	20:41
lifeless	back	20:45
*** elasticio has quit IRC		20:48
*** mrodden has joined #openstack-infra		20:50
*** Ajaeger has quit IRC		20:52
*** dcramer_ has quit IRC		20:52
*** rakhmerov has joined #openstack-infra		20:53
*** mrodden1 has joined #openstack-infra		20:53
sdague	fungi: so we are about to get a merge	20:55
*** mrodden has quit IRC		20:55
sdague	we can see if the change after the neutron fail is made to start over	20:55
sdague	nope, seemed to do the right thing	20:56
fungi	yeah, i've watched several such incidents since you mentioned it, and haven't see it happen yet	20:57
fungi	so must be an odd combo of circumstances triggering it	20:58
sdague	I think in those cases where I saw it the jobs were still running on the failed job	21:02
sdague	I'm now cron grabbing zuul every 60 seconds so I can reconstruct some of these	21:02
fungi	good idea	21:02
clarkb	was it just continueing to run tests detached ftom the rrst of the queue?	21:03
clarkb	it will do that	21:03
sdague	clarkb: it looked like it reset the job below it	21:05
*** flaper87\|afk is now known as flaper87		21:05
*** mrda has joined #openstack-infra		21:07
fungi	sdague: it does that when the change fails, because the change behind it needs to be retested against the branch tip as the new head of the gate, rather than on top of the failing change	21:08
fungi	but it sounded like you were describing something else, like a second gate reset	21:09
openstackgerrit	Eli Klein proposed a change to openstack-infra/jenkins-job-builder: Added rbenv-env wrapper https://review.openstack.org/65352	21:09
sdague	yeh, that's what it looked like	21:11
lifeless	mordred: on fail-early-keep-running; what if we added a second untrusted geard that can only signal failure	21:13
*** sarob has joined #openstack-infra		21:13
*** sarob has quit IRC		21:17
sdague	so I'm basically manually pulling all of neutron out right now	21:21
sdague	any neutron or python neutron client job has something like a 5% chance of passing at the moment	21:22
sdague	and there were 5 of them in a row in there	21:22
mikal	What's the state of stable at the moment?	21:24
mordred	lifeless: right. so, honestly I cant' remember the state of teh design for that - but jeblair has a plan for implementing the complex version of this	21:24
mikal	Rechecks are ok but the gate is still busted?	21:24
mordred	lifeless: but with all things being perfect, I would expect it to take us at least a month to get all of the various pieces landed	21:24
mordred	lifeless: the issue there isn't figuring it out - it's just working through the steps to do it	21:25
lifeless	ok	21:25
mordred	lifeless: today's question is more "are there any less-ideal shortcuts we can take to help the current gate-slam"	21:25
*** salv-orlando has joined #openstack-infra		21:25
lifeless	sue	21:25
lifeless	sure	21:25
mordred	lifeless: btw - not related to this, but just because it's the other thing I'm hacking on right now ... apparently it's possible to comingle js tooling in a python venv	21:26
fungi	mikal: all of stable is still broken because of some exercises not passing on grizzly (which in turn means grenade can't test upgrading to havana changes)	21:31
fungi	mikal: sdague mass-rechecked all outstanding stable changes so that they would get an obvious -1 from jenkins, to prevent stable cores from approving any more of them	21:32
sdague	yep	21:32
sdague	fungi: it's actually because cinder can't run because of stevedore version checking explosion	21:33
fungi	right, but that's where it manifests in the jobs	21:34
sdague	yeh	21:34
sdague	well it also manifests in all grizzly being broken	21:34
sdague	but stable maint didn't seem to care on that one :)	21:34
fungi	i thought it was only devstack/tempest failures for grizzly, but regardless yeah	21:35
sdague	sure, but that means you couldn't land any changes	21:35
fungi	if it was also failing grizzly changes on non-integration jobs i missed that	21:35
*** dizquierdo has joined #openstack-infra		21:38
openstackgerrit	Monty Taylor proposed a change to openstack-infra/config: Use nodeenv via tox to do javascript testing https://review.openstack.org/67729	21:39
mordred	there we go. generic tox-based js-unittest job template that has docs-draft-like functionality	21:39
notmyname	"BuildErrorException: Server %(server_id)s failed to build and is in ERROR status" seems familiar to me, but I can't seem to find anything in LP	21:43
notmyname	ring a bell with anyone else or is it something new?	21:44
bknudson	https://bugs.launchpad.net/nova/+bug/1266740	21:44
notmyname	hmm...this is in test_volume_boot_pattern	21:45
notmyname	same bug or should it be filed as something new in LP?	21:45
notmyname	logs at http://logs.openstack.org/16/66916/1/gate/gate-tempest-dsvm-full/b7f51bb/console.html	21:45
*** gokrokve has quit IRC		21:46
*** senk has quit IRC		21:47
bknudson	notmyname: I opened this bug https://bugs.launchpad.net/nova/+bug/1270608	21:47
bknudson	which has the same log message from n-cpu.	21:47
notmyname	bknudson: thanks. I'll use that one	21:47
bknudson	notmyname: I added a e-r check for it https://review.openstack.org/#/c/67713/	21:48
portante	notmyname: I filed https://bugs.launchpad.net/cinder/+bug/1270350 so that I could find it searching LP	21:55
portante	bknudson: shall I close that one in favor of 1270608?	21:56
mikal	sdague: I think you missed at least one, because my script just rechecked it	21:56
mikal	sdague: so, same outcome...	21:56
mikal	Oh, no I see.	21:57
mikal	sdague rechecked it, jenkins passed	21:57
mikal	This is yesterday	21:57
mikal	https://review.openstack.org/#/c/62206	21:57
openstackgerrit	Monty Taylor proposed a change to openstack-infra/config: Genericize javascript release artifact creation https://review.openstack.org/67731	21:57
mikal	Ahhh, because grenade is non-voting for neutron	21:58
sdague	yeh	21:58
sdague	neutron, ceilometer, and oslo still pass on havana	21:58
sdague	because they don't attempt to upgrade	21:58
mikal	That's good. Users never upgrade those thigns.	21:58
sdague	nope	21:59
sdague	bknudson: before putting this through - https://review.openstack.org/#/c/67713/ question in line	21:59
notmyname	sdague: on the gate timings, how feasible is adding the "time in gate" to the log message (both success and failure)? without a statsd timing metric, it would at least give the ability to track how long a piece of code stays in the gate	21:59
sdague	notmyname: to what log message?	22:00
notmyname	sdague: the jenkins message in gerrit	22:00
sdague	don't know	22:01
sdague	you could dive into the zuul code to see	22:01
*** gokrokve has joined #openstack-infra		22:01
notmyname	sdague: ie I'm now looking at another many hours to get https://review.openstack.org/#/c/66916/ to the top of the gate (which has a 61% chance of passing). and the zuul status page has now been reset to 1 minute	22:01
notmyname	sdague: got a starting point to look at?	22:01
sdague	nope, I don't know zuul code very well, I just dove in to do the enqueue time stuff	22:02
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: Add query for bug 1270309 https://review.openstack.org/67594	22:02
*** gokrokve_ has joined #openstack-infra		22:02
*** gokrokve has quit IRC		22:06
*** gokrokv__ has joined #openstack-infra		22:06
*** gokrokve_ has quit IRC		22:06
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: Add noirc option to bot https://review.openstack.org/67525	22:07
sdague	so the stable/grizzly fix made it to top of queue now, with any luck	22:09
bknudson	portante: is it the same problem? see the e-r query	22:10
bknudson	portante: if the query for the e-r works for https://bugs.launchpad.net/cinder/+bug/1270350 then close https://bugs.launchpad.net/nova/+bug/1270608 as a dup	22:11
bknudson	and I'll update the e-r change to use https://bugs.launchpad.net/cinder/+bug/1270350	22:11
bknudson	portante: I just want there to be an e-r query for it.	22:11
portante	bknudson: agred	22:13
portante	agreed, looking	22:13
*** sarob has joined #openstack-infra		22:13
*** sarob has quit IRC		22:18
*** sarob has joined #openstack-infra		22:19
*** salv-orlando has quit IRC		22:20
*** sarob has quit IRC		22:25
bknudson	portante: logstash query with 'message:"BuildErrorException: Server %(server_id)s failed to build and is in ERROR status" AND filename:"console.html"'	22:26
bknudson	gets more hits than 'filename:"logs/screen-n-cpu.txt" AND message:"Error: iSCSI device not found at /dev/disk/by-path/"'	22:26
bknudson	but they're all failures either way.	22:27
portante	yes, and I made 1270350 a dupe of 1270608 since it is more specific	22:27
sdague	bknudson: the n-cpu.txt message is better, as that is specific of an underlying error, not just the symptom it causes	22:29
sdague	mordred: so thinking about this more, while we are still at starvation, fast fail doesn't really help all that much, we're still going to be waiting around for nodes to tear down and rebuild	22:31
*** dizquierdo has quit IRC		22:31
sdague	that's another bit of why were are hurting right now. We can't restart the changes behind the fail point very quickly	22:32
*** salv-orlando has joined #openstack-infra		22:39
*** 45PAA4WSM has joined #openstack-infra		22:41
*** 45PAA4WSM is now known as jhesketh		22:41
*** dcramer_ has joined #openstack-infra		22:43
*** yassine has joined #openstack-infra		22:43
*** yassine has quit IRC		22:43
fungi	i've got a few things in play to help with some of the starvation... manually removed nodepool tracking for several nodes which are hung deleting at the provider or for a nonexistent provider, cleaning up some stray "alien" vms which nodepool has forgotten it created through unclean daemon restarts, and going to give 67723 a whirl to see if we reclaim some deleted nodes faster	22:52
*** gokrokv__ has quit IRC		22:52
fungi	we'll still be resource-starved, but at least the available resources should be increased a bit	22:52
*** rakhmerov has quit IRC		23:03
*** praneshp has joined #openstack-infra		23:05
*** gokrokve has joined #openstack-infra		23:08
*** jamielennox\|away is now known as jamielennox		23:12
sdague	yeh, I've had to walk away from beating my head on the gate for a while. I'm off trying to clean up nova request logs now	23:14
*** sarob has joined #openstack-infra		23:20
*** rakhmerov has joined #openstack-infra		23:36
*** flaper87 is now known as flaper87\|afk		23:47
openstackgerrit	lifeless proposed a change to openstack-infra/nodepool: Don't load system host keys. https://review.openstack.org/67738	23:54
openstackgerrit	lifeless proposed a change to openstack-infra/nodepool: Ignore vim editor backup and swap files. https://review.openstack.org/67651	23:54
openstackgerrit	lifeless proposed a change to openstack-infra/nodepool: Add some debugging around image checking. https://review.openstack.org/67650	23:54
openstackgerrit	lifeless proposed a change to openstack-infra/nodepool: Only attempt to copy files when bootstrapping. https://review.openstack.org/67678	23:54
openstackgerrit	lifeless proposed a change to openstack-infra/nodepool: Document that fake.yaml isn't usable. https://review.openstack.org/67679	23:54
openstackgerrit	lifeless proposed a change to openstack-infra/nodepool: Don't load system host keys. https://review.openstack.org/67738	23:54
openstackgerrit	lifeless proposed a change to openstack-infra/nodepool: Ignore vim editor backup and swap files. https://review.openstack.org/67651	23:54
*** sarob has quit IRC		23:55

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!