Friday, 2014-01-17

*** melwitt has joined #openstack-infra		00:03
openstackgerrit	Michael Krotscheck proposed a change to openstack-infra/storyboard-webclient: Customise Bootstrap https://review.openstack.org/67337	00:04
*** zz_ewindisch is now known as ewindisch		00:08
*** vipul-away is now known as vipul		00:08
*** CaptTofu has joined #openstack-infra		00:09
*** ewindisch is now known as zz_ewindisch		00:10
*** sarob has joined #openstack-infra		00:11
*** MarkAtwood has quit IRC		00:13
pabelanger	A few weeks / month ago somebody was suggesting a graphic rendering lib for rst docs... it wasn't graphviz but something else	00:14
pabelanger	there was some talk about maybe using it for -infra documentation	00:14
*** rnirmal has quit IRC		00:14
clarkb	pabelanger: I think it was hashar, but I forget what the lib was called	00:15
pabelanger	clarkb: Ya, I thought it was hashar too	00:15
zaro	clarkb: hey, i just got back. i'm just finishing up the gerrit testing, was gonna put it aside to start hacking on the scp-plugin tomorrow.	00:16
clarkb	zaro: great, thanks	00:16
pabelanger	http://blockdiag.com/	00:19
pabelanger	eavesdrop.o.o to the rescue	00:19
openstackgerrit	Michael Krotscheck proposed a change to openstack-infra/storyboard-webclient: Customise Bootstrap https://review.openstack.org/67337	00:21
*** wenlock has quit IRC		00:21
mattoliverau	I haven't read the email, and this is just me thinking out loud but in regards to rate limiting how about doing something similar to TCP windowing. Pick a low point that the queue will never be smaller then, say 20. Then everytime a patch is merged increase the queue by X, say 1. Everytime there needs to be a reset be brutal, like halve the queue size. Requeuing the stuff taken off in a high priority	00:22
mattoliverau	queue. This would mean when there are lot of resets the queue will be smaller and smaller so less ref re-pointing and hopfully push through all the congestion. When working again the queue will continue to build up.	00:22
mattoliverau	Again, i'm new, so just my 2 cents.	00:22
clarkb	mattoliverau: yup that was my thinking	00:22
clarkb	tcp slow start	00:22
clarkb	it has its faults, you almost never hit peak efficiency, but it does work at protecting you	00:22
mattoliverau	clarkb: Lol, missed your comment with the name slow start :)	00:23
mattoliverau	clarkb: of course, but it's somewhere inbetween, better then a fixed queue length, but not to problem of a huge queue when zuul is needed most.	00:23
*** vipul is now known as vipul-away		00:24
*** mrodden1 has quit IRC		00:31
*** vipul-away is now known as vipul		00:34
*** fifieldt has joined #openstack-infra		00:36
*** ok_delta has quit IRC		00:37
*** odyssey4me has quit IRC		00:37
*** sarob has quit IRC		00:40
*** wenlock has joined #openstack-infra		00:40
*** nati_uen_ has joined #openstack-infra		00:40
*** smurugesan has quit IRC		00:40
*** gokrokve has joined #openstack-infra		00:41
*** yamahata has quit IRC		00:42
*** michchap_ has quit IRC		00:43
*** michchap has joined #openstack-infra		00:43
*** nati_ueno has quit IRC		00:43
*** odyssey4me has joined #openstack-infra		00:46
clarkb	this is interesting the run handler sleeping run handler awake log messages haven't happened for 15 minutes. so that is what is starving us	00:46
clarkb	something is spending a lot of time in the middle of that loop	00:46
clarkb	sdague: ^	00:46
openstackgerrit	A change was merged to openstack-dev/hacking: Move hacking guide to root directory https://review.openstack.org/62132	00:47
openstackgerrit	A change was merged to openstack-dev/hacking: Cleanup HACKING.rst https://review.openstack.org/62133	00:47
openstackgerrit	A change was merged to openstack-dev/hacking: Re-Add section on assertRaises(Exception https://review.openstack.org/62134	00:47
openstackgerrit	A change was merged to openstack-dev/hacking: Turn Python3 section into a list https://review.openstack.org/62135	00:47
openstackgerrit	A change was merged to openstack-dev/hacking: Add Python3 deprecated assert* to HACKING.rst https://review.openstack.org/62136	00:47
*** mrodden has joined #openstack-infra		00:48
openstackgerrit	Michael Krotscheck proposed a change to openstack-infra/storyboard-webclient: Moved homepage content to about page. https://review.openstack.org/67344	00:50
clarkb	sdague: I am digging through logs now to see if I can determine where it is starving itself	00:50
*** hogepodge has quit IRC		00:50
*** harlowja is now known as harlowja_away		00:50
*** CaptTofu has quit IRC		00:53
*** melwitt has quit IRC		00:58
*** melwitt1 has joined #openstack-infra		00:59
clarkb	it looks like it takes that long to submit all of the gearman jobs after a gate reset	00:59
*** melwitt1 has quit IRC		01:03
*** sarob has joined #openstack-infra		01:05
*** sarob has quit IRC		01:05
*** CaptTofu has joined #openstack-infra		01:05
*** sarob has joined #openstack-infra		01:06
*** dkranz has joined #openstack-infra		01:07
*** harlowja_away is now known as harlowja		01:08
*** melwitt has joined #openstack-infra		01:08
clarkb	the bulk of the time was spent reseting the gate	01:08
clarkb	2014-01-17 00:31:52,791 DEBUG zuul.DependentPipelineManager: Starting queue processor: gate	01:08
clarkb	2014-01-17 00:47:17,732 DEBUG zuul.DependentPipelineManager: Finished queue processor: gate (changed: True)	01:08
*** sarob_ has joined #openstack-infra		01:08
clarkb	that is ~15 minutes of just dealing with gate reset, which is bad considering how often the gate resets	01:09
openstackgerrit	Eric Guo proposed a change to openstack/requirements: Have tox install via setup.py develop https://review.openstack.org/66549	01:09
mordred	clarkb: wow	01:10
*** sarob has quit IRC		01:10
*** sarob has joined #openstack-infra		01:12
clarkb	it is taking 9-11 seconds to got git reset, git remote update, git reset --hard $BRANCH, git merge $patchset, then create a ref that zuul can advertise to the testers	01:13
openstackgerrit	Michael Krotscheck proposed a change to openstack-infra/storyboard-webclient: Added apache license to footer https://review.openstack.org/67347	01:13
dkranz	Scrolling back, this might be a bad time to say this but I did a reverify with bug number on https://review.openstack.org/#/c/63934/ which closes the error-in-log-file hole.	01:14
clarkb	90*9 = ~13 minutes	01:14
clarkb	so that accounts for the bulk of the reset time	01:14
clarkb	knowing that, I think jeblairs farm of zuul workers plan is a really good one	01:14
*** sarob has quit IRC		01:15
clarkb	if we can distribute that work instead of doing it serially we should be able to get that number much smaller	01:15
*** sarob_ has quit IRC		01:15
*** sarob has joined #openstack-infra		01:15
clarkb	now it is also possible that the git repos themselves are degrading and are usually faster	01:16
clarkb	which isn't that far fetched as sdague indicated zuul had much better performance previously. Restarting zuul won't fix the problem but clearing out the git repos or otherwise repairing them might	01:16
*** sarob has quit IRC		01:18
clarkb	http://paste.openstack.org/show/61413/ I have filtered out everything but the git checkouts there. This shows the amount of time between each git checkout which is roughly the amount of time it takes to do a checkout reset merge etc	01:19
mordred	clarkb: I wonder if git remote update is potentially too heavy of a hammer too. (although the farm of workers is better)	01:20
clarkb	we might also try using a newer version of git on the zuul box	01:20
clarkb	we can use https://launchpad.net/~git-core/+archive/ppa to get newer git on zuul.o.o	01:21
*** zhiwei has joined #openstack-infra		01:21
*** sarob has joined #openstack-infra		01:21
clarkb	mordred: it may be	01:21
mordred	clarkb: git fetch remotes/origin/$BRANCH ; git reset --hard FETCH_HEAD might do slightly less work	01:21
*** zhiwei has quit IRC		01:22
clarkb	mordred: there is a big time delta between updating repository and the next step	01:22
* clarkb looks at that code		01:22
mordred	clarkb: as in, the remote update step is taking a long time?	01:22
clarkb	ya	01:23
*** sarob has quit IRC		01:24
*** sarob has joined #openstack-infra		01:24
clarkb	yup looks like that vast majority of time is in the remote update step	01:25
clarkb	it is happening in GitPython though. need to read up on it to see if we can make that smarter	01:25
*** pcrews_ has quit IRC		01:27
*** melwitt has quit IRC		01:27
openstackgerrit	A change was merged to openstack-infra/config: Increase timeouts for jobs doing tempest runs https://review.openstack.org/66379	01:28
*** sarob has quit IRC		01:29
mordred	clarkb: will you point me to the part of the code you're looking at?	01:29
clarkb	mordred: I am digging through zuul/merger.py. mergeChanges() is the function that seems to do the work	01:30
clarkb	mordred: the repo update only happens once per project:branch relationship during a reset	01:32
clarkb	so while it is costly when it happens it isn't the biggest cost. the git checkouts seem to be most painful	01:32
mordred	really?	01:32
clarkb	ya checkout happens for each change so * 90	01:32
*** xchu has joined #openstack-infra		01:32
clarkb	and takes about as much time as an update	01:32
mordred	is that just because it's modifying a working tree?	01:33
*** sdake has quit IRC		01:33
clarkb	oh possibly as git has to reflect the changes on disk	01:34
mordred	can I make a REALLY stupid suggestion?	01:34
mordred	what if we ran it under eatmydata?	01:34
*** zhiwei has joined #openstack-infra		01:35
mattoliverau	Can you wrap the git checkout + other git reset stuff into some python thread so they can be done in parallel? That way it shouldn't be 90*9	01:35
clarkb	hmm that is an interesting question. my first initial thought was are you crazy, my second thought is that may just be an incredible idea	01:35
clarkb	mattoliverau: we can, that is what jeblair's make workers do the work idea gets at	01:36
clarkb	mattoliverau: I think we will end up doing that regardless, but we need a short term solution	01:36
*** afazekas has quit IRC		01:36
clarkb	mordred: eatmydata disables fsync? does that mean no data will ever get synced or it will sync whenever the OS feels like it?	01:36
clarkb	mordred: my biggest concern now is that zuul relies on disk persistence to do graceful restarts	01:37
clarkb	mordred: I am pretty sure that will break if we put zuul under eatmydata	01:37
mattoliverau	so how about tmpfs then? no disk IO then, only ram.	01:38
mordred	clarkb: hrm. good point	01:38
mordred	yeah - tmpfs would be the next question - but I don't think we have the ram to handle all of the repo size	01:38
mordred	I lied	01:39
*** gokrokve has quit IRC		01:39
mordred	mordred@zuul:~$ sudo du -chs /var/lib/zuul/git/	01:39
mordred	2.8G/var/lib/zuul/git/	01:39
*** gokrokve has joined #openstack-infra		01:40
clarkb	tmpfs sounds like a great idea	01:40
mattoliverau	so we maybe able to put /var/lib/zuul/git under tmpfs, and bypass disk all together if it doesn't work then it just means it isn't disk io casuing issues.	01:40
clarkb	mordred: I think if we stop zuul, overlay a mount on /var/lib/zuul/git then start zuul it will just reclone everything	01:40
*** pcrews has joined #openstack-infra		01:41
clarkb	currently git has a bout 4GB of cached and buffered memory	01:41
*** thuc has quit IRC		01:41
clarkb	so 2.8GB filesystem may eat into that in ways that are unhappy though I bet a good chunk of that cache is for the git stuff	01:41
*** thuc has joined #openstack-infra		01:42
mattoliverau	How much RAM us the system using for everything else? is the servers RAM under utilised? I guess I could just go check out cacti :)	01:42
mattoliverau	s/us/is/	01:42
clarkb	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=392&rra_id=all	01:43
*** gokrokve has quit IRC		01:44
clarkb	zuul is 4GB virt, 1.3GB resident, geard is 1GB virt, 760MB resident	01:44
clarkb	then recheckwatch, puppet, and apache processes hang around with about 100MB a pice	01:44
mordred	we could bump it to 8G if putting the temp merge location in ram would be good, ya know?	01:46
*** thuc has quit IRC		01:46
clarkb	it is 8GB now	01:46
clarkb	it is an 8vcpu 8GB rackspace performance node	01:47
mordred	meh	01:47
mattoliverau	Yeah, the rest is mainly in cached memory. So the question is, how much will the kernel actually give back to us.. what is the real free figure.	01:47
mattoliverau	clarkb: there was a talk at LCA about thi during the sysadmin miniconf.	01:48
mattoliverau	s/thi/this/	01:48
clarkb	I missed it :(	01:49
dstufft	mordred: clarkb fungi So pip 1.5.1rc1 and 1.11.1rc2 just dropped, if you're at all able to run them through the paces in the openstack infra to make sure we fixed all your issues that would be really really awesome	01:49
*** yamahata has joined #openstack-infra		01:50
*** jp_at_hp has quit IRC		01:51
mattoliverau	if I remember correctly, we can check meminfo: cat /proc/meminfo \|grep -i active	01:51
clarkb	4836276 kB	01:52
mattoliverau	whatever the figure is for inactive should be currently what the kernel can dump (and thus give back at this point in time)	01:52
mordred	dstufft: I'd love to - the gate is so slammed though I don't think we're likely able to run anything with a difference - but I'll see what I can cook up	01:52
clarkb	mattoliverau: inactive is 2269672 kB	01:52
*** wenlock has quit IRC		01:52
dstufft	mordred: ok I totally understand if you can't fwiw :) Mostly I want to avoid another upgrade apocalypse	01:52
clarkb	oh there are several inactive categories, are they distinct or subsets?	01:53
mattoliverau	clarkb: so if I am correct, we may only get about 2.2 G back.	01:53
*** locke105 has joined #openstack-infra		01:53
clarkb	looks like they add up so we only need that value above	01:53
mordred	dstufft: well, we're blocking >=1.5 anyway - so I think we can test upgrading to 1.5.1 at our leisure	01:53
* mordred is still excited for his new 1.5 overlord		01:53
mattoliverau	I might go find the talk in question, the videos are up and it only went for 10 minutes or so.	01:54
clarkb	if we want to keep 8vcpu we can go up to a 30GB perf node	01:55
clarkb	that will give us plenty of room for a ~16GB tmpfs	01:55
mattoliverau	Yeah, that might be a good idea, that would give us room to grow.	01:56
mordred	clarkb: use 30G perf nodes for all the things!!!	01:56
clarkb	I think that is a not so crazy idea, but it is also late on thursday	01:56
mattoliverau	http://is.gd/U9kBon	01:56
clarkb	would be curious to get fungi's input	01:56
mordred	clarkb: we should make the build farm use 30G perf nodes	01:56
mattoliverau	the talk in question ^^	01:57
mattoliverau	I think	01:57
mordred	can you imagine just how quickly pvo would should up and scold us?	01:57
mattoliverau	mordred: lol	01:57
clarkb	I also think newer git is worth a shot, the version of git we are running is pretty old	01:57
mordred	clarkb: ++	01:57
mattoliverau	can't hurt, in theory the code has to be more efficient.. unless linus broke something :P	02:00
clarkb	ya I am doing some quick unscientific tests locally	02:01
mattoliverau	Lol the best kind of test ;)	02:01
*** nosnos has joined #openstack-infra		02:04
*** senk has quit IRC		02:04
clarkb	git checkouts were any better. git clone was about 20 seconds faster for nova	02:05
clarkb	also I just realized this is GitPython so it may be doing some stuff in pyth	02:05
mattoliverau	that's true, makes it hard to determine the bottle neck. were those times based from gitpython or git?	02:06
dstufft	I think GitPython just shells out	02:06
dstufft	but I might be thinking of a different project	02:06
clarkb	dstufft: it does for some stuff and not for others iirc	02:06
*** jhesketh__ has joined #openstack-infra		02:06
clarkb	also they use tabs in their source so now I don't want to read it	02:06
dstufft	clarkb: I've learned to avoid reading other people's source code unless I really want to be caremad	02:07
dstufft	(it's too late not to read my own :( )	02:07
dims	just peeked at gate queue, looks like it crept up to 104	02:08
clarkb	dstufft: it appears to shell out for checkout	02:09
*** gyee is now known as gyee_nothere		02:09
*** adrian_otto has joined #openstack-infra		02:10
*** CaptTofu has quit IRC		02:10
*** gokrokve has joined #openstack-infra		02:10
dstufft	clarkb: also question, is this cloning stuff to run tests on it?	02:10
adrian_otto	are our Zuul workers clogged up? I have 4 Solum gerrit reviews that have no votes on them from jenkins, dating back over the past ~4 hours.	02:11
dstufft	e.g. is it a read only clone and are you or can you use a shallow clone to make it go faster?	02:11
clarkb	adrian_otto: no zuul is clogged up	02:11
clarkb	dstufft: we can't shallow cloen for reasons. this is the repo zuul is using to build the refs that get tested	02:11
clarkb	iirc it needs all the refs in order to build the zuul refs which a shallow clone won't give you	02:12
adrian_otto	clarkb: no zuul is clogged, or no it is not?	02:12
dstufft	clarkb: ok!	02:12
dstufft	I don't know much about zuul so :(	02:12
clarkb	adrian_otto: no, zuul is clogged	02:12
clarkb	the workers themselves are fine	02:12
adrian_otto	clarkb: ok, thanks	02:12
*** slong- is now known as slong-afk		02:15
*** gothicmindfood has joined #openstack-infra		02:15
*** pballand has quit IRC		02:21
*** julim has joined #openstack-infra		02:22
*** yaguang has joined #openstack-infra		02:23
clarkb	adrian_otto: long story short is that the longer the gate queue gets the more time zuul spends reseting it (currently a full gate reset takes more than >15 minutes) while it is doing that reset the zuul scheduler does nothing else. There are plans to make that better (farming the expensive git work out to workers to allow massive scale out, and we have been fiddling with using a tmpfs as the cost of	02:23
clarkb	disk seems to hurt quite a bit)	02:23
*** julim has quit IRC		02:24
*** portante_ is now known as portante		02:24
*** gothicmindfood has quit IRC		02:27
adrian_otto	clarkb: thanks for the detail. Can you help me understand what a gate reset is, and why it happens?	02:29
clarkb	adrian_otto: the gate pipeline is where we test serialized changes in parallel. change A gets approved first and goes onto the head of the queue, then change B gets approved and gets added behind A. Instead of waiting for A to merge before testing B we test B with A assuming A will pass and merge	02:30
clarkb	adrian_otto: when A does not pass and merge we have to retest B without A as the previous scenario is no longer valid	02:31
clarkb	that is a gate reset.	02:31
clarkb	when you have 102 changes in the pipeline something failing at the head of the queue means we have to cancel jobs for 101 changes, then completely rebuild the git refs to test 101 changes (the 102nd is removed as it failed) then restart all of the tests	02:32
fungi	except in the current gate it's a plus b plus c plus... plus z and then repeat the alphabet several more times	02:32
adrian_otto	ok, so that's sounds like a definite design weakness in zuul	02:32
clarkb	adrian_otto: its not a design weakness in zuul, it is a problem with speculative merging and testing	02:33
adrian_otto	isn't that the key feature that makes zuul compelling?	02:33
clarkb	yes	02:33
clarkb	adrian_otto: in the best case you merge all 102 changes at one time and your time to test is O(1)	02:34
fungi	adrian_otto: more to the point, consider the integrated projects to basically be one software project with more than a thousand developers approving a hundred changes a day and trying to make sure every change passes the entire integration test suite prior to letting it merge	02:34
clarkb	when you are consistently failing that goes to O(n)	02:34
clarkb	in the previous state you were in O(n)	02:34
clarkb	so this is a win over the old state, but in the worst case is still bad	02:34
adrian_otto	indeed	02:35
fungi	the alternative, which a lot of projects settle for, is merge first, then test periodically and see if the published software is obviously broken, then try to bisect and hope you can narrow down which commit to revert	02:35
adrian_otto	so might it make sense to use an admission control strategy?	02:35
adrian_otto	so the queue is limited?	02:36
clarkb	adrian_otto: see scrollback :)	02:36
adrian_otto	that might speed up the reset case, at the cost of some concurrency in the best case	02:36
*** nati_uen_ has quit IRC		02:37
clarkb	jeblair has historically been opposed to rate limiting the size of a zuul queue. I have argued for the feature in the past. I think something simple like tcps slow start would help quite a bit	02:37
adrian_otto	thanks for the additional detail!	02:37
*** nati_ueno has joined #openstack-infra		02:37
clarkb	at LCA jeblair seemed to be more onboard with adding something like thatto zuul	02:38
adrian_otto	you can still have a backlog that's not part of the active queue	02:38
clarkb	yup	02:38
mattoliverau	It was the sun and the Aussie beer ;P	02:38
adrian_otto	and spoon feed the active queue so it remains a more optimal length	02:39
*** yamahata has quit IRC		02:39
clarkb	adrian_otto: exactly, just like a tcp connection	02:39
adrian_otto	yep	02:39
clarkb	well tcp rarely if ever hits optimal state, but it is consistently not worst case	02:39
*** dstanek has joined #openstack-infra		02:46
lifeless	clarkb: mmm	02:46
lifeless	clarkb: you could argue that tcp is nothing but worst case :)	02:46
clarkb	lifeless: maybe when you latency is NZ bad	02:46
clarkb	:)	02:46
StevenK	clarkb: Well, it's a sliding window, and also best effort.	02:48
StevenK	clarkb: However, I agree with you -- I think checking a queue of 90-100 all the time is bong, and we should limit it to a window	02:48
*** jishaom has joined #openstack-infra		02:49
*** odyssey4me has quit IRC		02:53
*** carl_baldwin has joined #openstack-infra		02:54
*** AaronGr is now known as AaronGr_Zzz		02:57
notmyname	adrian_otto: since you were asking about stuff, I threw together a quick graph for you http://not.mn/solum_gate_status.html	03:01
notmyname	adrian_otto: if that's not the right jobs, let me know (or open a pull request--the repo link is at the bottom)	03:02
*** odyssey4me has joined #openstack-infra		03:02
*** jhesketh__ has quit IRC		03:02
sdague	clarkb: so can you promote this now - https://review.openstack.org/#/c/65805/ ?	03:02
*** rakhmerov has quit IRC		03:02
*** jhesketh_ has quit IRC		03:03
sdague	if the theory on load is correct, that should level things out a bunch	03:03
*** jhesketh_ has joined #openstack-infra		03:03
clarkb	sdague it has been promoted should see it in a bit	03:04
*** rossella_s has joined #openstack-infra		03:05
sdague	ok, just looked at the queue and it was still at the bottom	03:05
sdague	but I guess we're just processing the events still?	03:05
*** jhesketh__ has joined #openstack-infra		03:05
clarkb	ya the promotion takes ~15 minutes according to fungi	03:05
clarkb	sdague see sb for long explanation for zuul slowness	03:06
sdague	yep, just read it	03:06
clarkb	i hunted it down. tldr really long gate is expensive	03:06
sdague	clarkb: right, especially as it starves out the other events	03:06
*** pballand has joined #openstack-infra		03:06
sdague	the tmpfs approach look promissing?	03:06
*** yaguang has quit IRC		03:09
clarkb	ya walking home now was hoping to chat with fungi about thst when I get back	03:10
*** yaguang has joined #openstack-infra		03:10
sdague	cool	03:10
*** HenryG has joined #openstack-infra		03:10
*** krotscheck has quit IRC		03:10
*** ArxCruz has quit IRC		03:11
*** zhiyan has joined #openstack-infra		03:12
sdague	so I guess the other question is if we're taking forever to reset with the change that we think will make this better, would it make sense to just dump the gate queue at this point?	03:13
sdague	the d-g just popped to the head	03:14
*** salv-orlando has joined #openstack-infra		03:15
notmyname	sdague: if stuff is getting promoted, then dumping the gate feels like something to do just to do something	03:16
sdague	notmyname: sure, though given that we can't allocate devstack nodes to jobs until the gate reset finishes, it's still adding 15 minutes additional friction on each hit. Which while small, adds up.	03:18
sdague	clarkb / fungi: looks like a bad py26 node - https://jenkins01.openstack.org/job/gate-nova-python26/17060/console	03:19
notmyname	yes, but I'm working on getting a patch through for the past 12 hours, and I've got another dependency that's been over 50 hours in the gate with over 13 resets. an extra 15 minutes really isn't much	03:19
sdague	it's not one extra 15 minutes, it's 15 * failing tests in gate (and right now there are at least 2 py 2.6 unit test failures that I see)	03:21
fungi	i've taken centos6-1 offline	03:22
fungi	thanks sdague	03:22
sdague	7 py26 unit tests fails... at least	03:22
fungi	i'm also caught back up on scrollback since dinner now. i am a dismally slow reader	03:22
sdague	yeh, about 40% of the gate jobs right now are in a fail state because of that py26 node	03:23
sdague	zuul hasn't noticed yet because it's still processing the first promote	03:23
fungi	i agree, in light of the performance breakdown, that saving the state of the pipelines and gracefully stopping zuul, mounting a suitably large tmpfs on /var/lib/zuul/git, starting zuul and restoring the changes would likely help performance	03:24
*** rakhmerov has joined #openstack-infra		03:25
fungi	the +/- buffers/cache amount is a good bit larger than a du of that dir	03:26
fungi	and zuul has a ton of swap for spillover if that ends up being an underestimate	03:26
*** adrian_otto has quit IRC		03:28
*** dcramer_ has quit IRC		03:28
fungi	4g tmpfs should be doable looking at the present state of the server	03:29
mattoliverau	fungi: you need to check the active and inactive memory in meminfo to see how much the kernel will really give back to you +/- buffers is a bit of lie.	03:30
mattoliverau	but yeah there is swap.. so long as it swaps out something it doesn't need again :)	03:30
fungi	yeah, but some of what's currently resident is safe to page out	03:31
fungi	more a question of how much	03:31
*** pballand has quit IRC		03:31
*** nati_uen_ has joined #openstack-infra		03:31
fungi	active(anon) is under 3g	03:33
mattoliverau	what is inactive	03:33
fungi	there's a fair amount of active(file) but i anticipate that being git	03:33
clarkb	fungi: I am willing to give tmpfs on current zuul a shot, we will need probably at least a 3GB filesystem	03:33
fungi	inactive is about 1.5g	03:33
clarkb	but only about 2 gb was inactive	03:33
mattoliverau	you you may get your 2g + what ever is actually free, and then everything else will be swaped.	03:34
*** nati_ueno has quit IRC		03:35
mattoliverau	there should be an inactive, inactive(anon) and inactive(file). Use the first as it is the total.	03:35
mattoliverau	but i don't have access to the server so I don't actually know what the current value is.	03:35
*** jerryz has quit IRC		03:35
fungi	right, inactive is roughly 2g	03:35
clarkb	fungi: I mentioned to mattoliverau earlier that we could go to a 30GB perf node to keep our vcpu count that will give us plenty of room for a massive tmpfs	03:36
mattoliverau	So from my understandding, that is as much as the kernel can actually give you.	03:36
fungi	clarkb: yeah, i'm hesitant since the downtime to swap nodes would be a bit greater. how long did it take you the other day?	03:37
clarkb	it wasn;t too bad, you basically prestage the node compeltely then do the swap. making sure firewall rules are correct everywhere was the biggest hurdle	03:38
fungi	i'm rapidly running out of steam for the night but can probably squeeze in another hour or so	03:38
clarkb	we can probably get it done in well under half an hour	03:38
StevenK	mattoliverau: For a tmpfs? tmpfs are swapped-back	03:38
clarkb	fungi: I don't think we should do anything tonight unless you really really want to	03:38
StevenK	swap-backed, even	03:39
mattoliverau	StevenK: tmpfs is just a ramdisk, so yes, it'll be swapped out.. in theory.	03:39
clarkb	fungi: maybe fire off a 30GB node build tonight and plan for swap tomorrow?	03:39
clarkb	fungi: or, put tmpfs in place on existing zuul and see what happens	03:39
fungi	clarkb: i can get a new-new-zuul spinning up now. we'll hang our hopes on the tempest parallelism reduction to make some stability headway in the meantime	03:41
clarkb	++ I think that is path of most sanity	03:41
*** weshay has quit IRC		03:42
clarkb	fungi: zuuls A record ttl is already 5 minutes so that is covered	03:42
fungi	awesome	03:42
clarkb	then tomorrow we grab the pipeline state, stop zuul, update dns, make sure firewalls update (which more I think of it may not be a problem since most connections are to zuul so only zuul firewall matters) and start zuul on new server	03:43
clarkb	if anything goes uber terrible we put old server back in use	03:43
*** amotoki has joined #openstack-infra		03:43
mattoliverau	Sounds like a plan! And on that note then I'm going to go to lunch, ttyl.	03:44
clarkb	I am no longer convinced new git will make much of an impact	03:46
fungi	heh... "120 GB Performance"	03:48
* fungi resists temptation		03:49
fungi	so we want 30 not 15?	03:49
StevenK	fungi: And then put / on a tmpfs? :-P	03:49
fungi	StevenK: bitcoins aplenty	03:49
clarkb	fungi: 15 has 4vcpu	03:50
fungi	oh weird	03:50
clarkb	fungi: the current 8gb have 8vcpu	03:50
clarkb	I think we should go 30 just to keep the vcpu value constant	03:50
fungi	so new-zuul was non-performance?	03:50
fungi	or 15g perf have fewer cpus than 8 and 30?	03:51
clarkb	new zuul was performance, 8gb 8vcpu	03:52
clarkb	but the flavors are weird, 8gb gives you 8vcpu but 15 give you 4vcpu	03:52
clarkb	double check that with nova flavor-list but pretty sure those were the values I saw earlier today	03:52
fungi	you're right	03:53
fungi	strange but true	03:53
clarkb	the other nice thing about 30GB is we can make the tmpfs pretty large and not worry too much about it filling unexpectedly	03:54
fungi	yup	03:54
clarkb	eg 16GB :)	03:55
sdague	they basically seem to have created a high memory set of perf nodes	03:55
clarkb	on my way home I was also thinking that zuul could do a better job in its scheduler of handling mroe than one discrete item at once	03:57
clarkb	at the very least it should be able to process different pipelines independently	03:57
clarkb	the nice thing about the serial way it does things now is it makes it very predictable about the order jobs run in and so on	03:58
clarkb	but gate being slow doesn't have to affect check for example	03:58
clarkb	but I think making changes like that probably won't have large benefits when 99% of your time is waiting for a forked git process to do its thing	03:58
*** coolsvap has joined #openstack-infra		04:00
fungi	also, you'd need multiple git workspaces to avoid collisions	04:01
*** harlowja is now known as harlowja_away		04:01
clarkb	oh right good point	04:01
fungi	don't want to be building two nova refs in one git clone at the same moment	04:01
*** praneshp has quit IRC		04:04
*** sarob has joined #openstack-infra		04:08
*** sarob_ has joined #openstack-infra		04:10
*** CaptTofu has joined #openstack-infra		04:11
*** sarob has quit IRC		04:13
*** CaptTofu has quit IRC		04:15
*** sdake has joined #openstack-infra		04:16
mikal	zuul hates me	04:17
notmyname	mikal: don't worry. zuul hates everybody today ;-)	04:17
mikal	Yay!	04:17
mikal	On the performance nodes front, there are two types	04:18
fungi	clarkb: we haven't merged the change yet that autopartitions the secondary block device on these performance nodes, have we?	04:18
mikal	Which might not be obvious from flavour list	04:18
sdague	so it's not incredibly helpful for people to "reverify bug 123456789" - https://review.openstack.org/#/c/61714/2	04:18
sdague	because that patch can't pass right now, due to grizzly devstack issues	04:18
mikal	OMG, who did that?	04:18
*** _ruhe is now known as ruhe		04:18
mikal	Performance 1 has its biggest at 8 vcpus, 8gb ram	04:19
mikal	Performance 2 has its biggest at 32 vcpus, 120gb ram	04:19
*** coolsvap_away has joined #openstack-infra		04:20
*** coolsvap has quit IRC		04:21
*** coolsvap_away is now known as coolsvap		04:21
*** vkozhukalov has joined #openstack-infra		04:21
sdague	fungi: so given that the grizzly devstack issues are out there, could you kick out all the stable/havana patches in the queue? because they are all just time bombs	04:22
*** SergeyLukjanov_ is now known as SergeyLukjanov		04:22
notmyname	is there a single job that is run for _every_ gate job that isn't run for check jobs? I'm looking for a graphite metric	04:23
notmyname	eg maybe gate-grenade-dsvm	04:23
fungi	sdague: i'm not sure how to "kick them out" aside from uploading trivial new patchsets to each of them	04:24
clarkb	fungi: I think that is the only way	04:24
sdague	fungi: yeh, that would be the only way	04:24
fungi	but a 'zuul eject' command would make for a good future addition	04:24
sdague	yeh	04:24
sdague	notmyname: gate-tempest-dsvm-full is the best approximation of the gate	04:25
notmyname	sdague: thanks	04:25
sdague	however, it's dynamic	04:25
sdague	so not exact	04:25
notmyname	dynamic?	04:25
sdague	the integrated queue is assembled based on overlapping jobs	04:25
sdague	so if change A runs tests 1 2 3, and change B runs tests 3 4 5, and change C runs tests 5 6 7	04:26
sdague	they will be in a single queue	04:26
sdague	even though A doesn't overlap with C	04:26
clarkb	and only one job of that entire set needs to fail to create a reset	04:26
*** dcramer_ has joined #openstack-infra		04:26
clarkb	so I was thinking about this a bit more after LCA, and I think what I would like to do is expose the zuul logs more. There shouldn't be any priveleged info in them so it should be safe to just logstash them or whatever, but ti will give clear data on 'this was a gate reset' and so on	04:27
openstackgerrit	A change was merged to openstack-infra/storyboard-webclient: Customise Bootstrap https://review.openstack.org/67337	04:28
openstackgerrit	A change was merged to openstack-infra/storyboard-webclient: Moved homepage content to about page. https://review.openstack.org/67344	04:28
SergeyLukjanov	evening guys!	04:28
clarkb	SergeyLukjanov: ohai	04:28
SergeyLukjanov	clarkb, it'll be awesome to be able to read zuul logs :)	04:28
clarkb	SergeyLukjanov: ya, I want to double check with jeblair to see if there are any known gotchas with that, but we can pipe it into the test log logstash too and get overlapping data	04:29
*** rossella_s has quit IRC		04:29
mikal	clarkb: noting that if you turn on our swift reporter and debug logging, it logs the swift password	04:29
clarkb	sdague: also did you see I diagnosed the missing console.html in logstash problem? zaro is going to work on a fix	04:29
clarkb	mikal: is swift reporter a thing?	04:30
clarkb	mikal: in any case we should sanitize that logging imo	04:30
sdague	clarkb: cool	04:30
mikal	clarkb: it is for us. I think its meant to be for you in the future.	04:30
mikal	clarkb: but perhaps I am mis-representing jhesketh__ and jblair's plan	04:30
clarkb	sdague: what happens there is jenkins hasn't even touched the file on logs.o.o by the time logstash processes it, logstash gets a 404 and moves on	04:31
sdague	great	04:31
clarkb	sdague: so we will update the scp plugin to not finish the job until that file has at least been touched	04:31
clarkb	should be a simple wait on a thread sync event	04:31
sdague	cool	04:32
*** chandankumar has joined #openstack-infra		04:32
clarkb	sdague: how is neutron thing going?	04:33
fungi	clarkb: yeah, we haven't approved https://review.openstack.org/63190 apparently, so that explains the lack of swap	04:33
sdague	good, though slowed down by the current gate backup. The lower concurrency patch is looking promissing in the gate right now though.	04:34
clarkb	sdague: yeah	04:34
clarkb	fungi: are you just going to manually add swap then? or should we merge 63190 and rebuild?	04:34
fungi	i'll delete my first launch and build another with that patch added	04:35
sdague	I need to go to bed, but I think if we kick out the stable branch changes in the gate the gate will empty by morning	04:35
fungi	unless we want to merge it first	04:35
clarkb	fungi: ok, I am reviewing that change now too	04:35
clarkb	sdague: noted	04:35
fungi	clarkb: thanks. we might as well approve it rather than continue manually applying it on every launch environment ;)	04:35
sdague	that swift one that was reverify 123456789 is reset #2 in there, but that's because it can't pass	04:35
sdague	there is a nova unit test fail as well, above it	04:36
clarkb	sdague: is it just the two? I will look and figure it out don't stay up	04:36
sdague	so that's what's failing now, though they are off to the side	04:36
sdague	the rest is being computed	04:36
sdague	however, stable/havana patches will fail on grenade	04:36
sdague	because of grizzly	04:36
sdague	so they take a while to show up after a reset	04:37
clarkb	ok so all stable/havana changes should be kicked out	04:37
sdague	yes	04:37
clarkb	got it	04:37
*** carl_baldwin has quit IRC		04:37
*** esker has quit IRC		04:37
sdague	chmouel and I were looking at stable grizzly devstack today, will do so again in the morning	04:38
clarkb	ok	04:38
sdague	I think it's fundamentally the pip 1.5 thing	04:38
*** esker has joined #openstack-infra		04:38
sdague	anyway, bed time. Talk to you later.	04:38
clarkb	we aren't using 1.5 anymore though right? or did we deal with that differently?	04:38
clarkb	fungi: ^	04:38
clarkb	fungi: oh I remember, I asked for that change not to be in install_puppet.sh	04:39
clarkb	fungi: because well it is doing something completely different and potentially harmful instead of simply installing puppet	04:39
clarkb	fungi: why don't you use a local checkout and we can figure out how to dlea with that properly when jeblair is backl	04:39
fungi	clarkb: ahh, right. you should note that in the review	04:39
clarkb	yup sorry I didn't do that before, my bad	04:40
openstackgerrit	A change was merged to openstack-infra/devstack-gate: Cut tempest concurrency in half https://review.openstack.org/65805	04:40
*** fifieldt has quit IRC		04:41
HenryG	In gerrit, is there a way to search for any reviews in progress that touch a particular file?	04:42
*** fifieldt has joined #openstack-infra		04:42
*** emagana has joined #openstack-infra		04:42
jhesketh__	clarkb: so (reading back...) I suggested on the infra mailing list that we run a zuul per a pipeline to ease the load on the gate	04:42
clarkb	jhesketh__: that won't ease the load on the gate but would help the other pipelines	04:43
jhesketh__	jblair didn't think it was necessary with the move to a performance node and also his future plan of sending git methods to workers	04:43
notmyname	clarkb: fungi: thanks for the help with the CVE patch today	04:43
clarkb	HenryG: if you have watched the projects and use the ssh query api then I think the answer is yes	04:43
jhesketh__	well zuul will be able to do it's git magic faster if it doesn't have to fight other pipelines	04:43
clarkb	jhesketh__: there is not fighting though they are all dealt with serially	04:44
clarkb	the problem is that the gate pipeline takes 15 minutes to handle a reset, and nothing else in zuul runs	04:44
fungi	notmyname: of course, it's my pleasure	04:44
clarkb	we need to make that faster, the worker idea should help there as it distributes the expensive git work across nodes	04:44
jhesketh__	clarkb: if zuul is pulling in a patch for nova in the check pipeline doesn't that block any merge it might be wanting to try on the gate pipeline?	04:45
jhesketh__	right okay	04:45
clarkb	jhesketh__: not really because it will handle those one at a time	04:45
clarkb	this interim idea is use tmpfs to speed up git operations	04:45
fungi	jhesketh__: zuul's output is a constructed git ref, i the end, so the state of its work tree doesn't have to hang around. just a git object	04:46
clarkb	as that requires no code changes and should help quite a bit	04:46
jhesketh__	clarkb: so it does block, it's just not significant?	04:46
jhesketh__	(the check pipeline that is)	04:46
clarkb	jhesketh__: ya because the check pipeline work is once and done	04:46
clarkb	~10 seconds of work	04:47
*** praneshp has joined #openstack-infra		04:47
jhesketh__	sure, but if somebody commits a dozen patches at once that's still a delay	04:47
clarkb	but for dependent pipelines it processes the entire queue before being done. which is ~10 seconds multiplied by the number of changes	04:47
fungi	jhesketh__: it blocks, but insofar as it all blocks because git operations are not happening in parallel	04:47
jhesketh__	yep	04:47
clarkb	jhesketh__: but it allows other work to happen between those changes	04:47
HenryG	clarkb: yes I have "watched" the project (tempest, in this case). Do you have a ptr handy to the ssh query api for a noob to get started?	04:48
clarkb	so the total work is 10*10 seconds but it doesn't starve the other queues	04:48
jhesketh__	sure,	04:48
clarkb	with the gate it literally stops everything else for that 15 minute peruiod	04:48
mikal	I can asume that my stackforge approval from an hour ago isn't lost, right?	04:48
mikal	Just slow?	04:48
clarkb	mikal: yes just very very slow	04:48
StevenK	515 events, wheee	04:48
clarkb	the compounding problem with the gate is on a failure it does all of the work again	04:49
clarkb	then you fail and it does it all again	04:49
clarkb	and on and on	04:49
*** praneshp_ has joined #openstack-infra		04:52
clarkb	finding trivial patchset content is non trivial	04:53
clarkb	fungi: just update commit message?	04:53
sdague	clarkb: ok, not quite asleep yet	04:53
* mikal promises not to approve anything for a while		04:53
sdague	but it looks like there are 6 - 8 stable/havana patches in the gate	04:53
sdague	so if you nuke them now, I think the gate will clear out by morning	04:53
clarkb	sdague: I foud 5	04:53
sdague	lots of keystone with month old test results	04:53
sdague	I went through and started -2ing a ton of stuff	04:54
*** praneshp has quit IRC		04:54
*** praneshp_ is now known as praneshp		04:54
mikal	Oh, we still have that "old checks" problem?	04:54
sdague	apparently, I have -2 on havana	04:54
fungi	clarkb: yeah, update commit message will work	04:54
sdague	mikal: yes	04:54
mikal	Would it be meaningful to have that quick and dirty rechecker turned on	04:54
StevenK	sdague: But that turns into an event, and zuul isn't really getting around to that ...	04:54
sdague	mikal: probably	04:54
mikal	I didn't do it because I was told that we'd have gerrit doing it soon	04:54
sdague	StevenK: sure	04:54
mikal	But if it would help, I'll get it done today	04:54
sdague	however it will signal	04:54
sdague	mikal: yes, it would be helpful, have it have a variable for # of days that we consider something stale	04:55
sdague	that we could set in infra	04:55
sdague	it would be awesome	04:55
mikal	sdague: as in projects.yaml?	04:55
* mikal pulls out that code and dusts it off		04:55
jhesketh__	mikal: is this the turbo-hipster gerrit rechecker?	04:55
mikal	jhesketh__: yeah	04:55
sdague	mikal: wherever clarkb and fungi think it should live	04:55
sdague	just want to make it configurable	04:56
mikal	It will reduce the number of merge fails	04:56
mikal	Well, what you get today is quick and dirty	04:56
jhesketh__	mikal: unless you set up turbo-hipster on infra the config will have to be in our cloud	04:56
sdague	mikal: this actually isn't a mege fail problem	04:56
jhesketh__	well I guess you could hit a url for it	04:56
mikal	And then we do something less shit sometime real soon	04:56
sdague	it's the fact that tox or deps changed in a month	04:56
sdague	so the passing results aren't valid at all	04:56
clarkb	sdague: some of these do actually fail to merge	04:56
clarkb	sdague: its fun...	04:56
mikal	Yeah, so a recheck of checks older than a week would have covered this, right?	04:56
sdague	clarkb: ok	04:56
sdague	mikal: yes	04:56
clarkb	sdague: I am pushing patchsets though to make it clear	04:57
jhesketh__	sdague: sure, so this code mikal whacked together is a turbo-hipster plugin.. so it'll probably not be configurable today if you want quick and dirty	04:57
mikal	Ok, cool	04:57
mikal	I shall do a thing	04:57
mikal	jhesketh__: I think that's ok	04:57
sdague	mikal: you are my hero :)	04:57
mikal	We can make it suck less tomorrow	04:57
jhesketh__	mikal: oh yeah, I agree. Just letting others know	04:57
mikal	I need theme music	04:57
fungi	clearly i can't work on things and keep up with irc at the same time	04:57
fungi	i'm sure you're all discussing exciting things	04:58
mikal	LOL	04:58
mikal	Just robots of doom	04:58
mikal	jhesketh__: is testzuul free at the moment?	04:59
mikal	jhesketh__: I might run this there	04:59
jhesketh__	mikal: go for it... I think it's in an okay state	04:59
mikal	jhesketh__: cool	04:59
clarkb	sdague: lol bugs are getting assigned to me because I am writing those patchsets :)	04:59
fungi	clarkb: so, new-new-zuul is 2001:4800:7815:0101:3bc3:d7f6:ff04:e07f	05:00
fungi	15g tmpfs on the git dir	05:00
fungi	zuul daemon seems to properly recreate the contents of that directory when it's started	05:00
clarkb	fungi: noice	05:00
fungi	i've also started the puppet agent on it	05:00
clarkb	fungi: is it accepting jobs though?	05:00
clarkb	oh I know where we need to update firewalls, on the jenkins masters	05:01
clarkb	er wait no	05:01
clarkb	we just need to make sure the jenkins masters connect to new new zuul's geard	05:01
fungi	yeah. but i've stopped the zuul daemon again just to be safe	05:01
clarkb	cool	05:02
*** chandankumar has quit IRC		05:03
clarkb	fungi: so ya, I think we plan to do a switcheroo early tomorrow and see if tmpfs helps a bunch	05:03
*** mrda has quit IRC		05:03
clarkb	I will attempt to wake up early	05:03
*** resker has joined #openstack-infra		05:03
fungi	i'll be around and ready	05:04
clarkb	sdague: I have killed two keystone changes and one swift, there appear to be 3 more changes	05:04
clarkb	sdague: slowly getting through them	05:04
notmyname	clarkb: https://review.openstack.org/#/c/67186/ and https://review.openstack.org/#/c/67187/ are backports for the CVE bug	05:05
notmyname	for grizzly and havana	05:06
clarkb	notmyname: ok, neither will pass the gate until grenade is working for grizzly and havana	05:06
clarkb	notmyname: sdague and chmouel are working on that as a priority	05:06
notmyname	clarkb: right. I just thought you were working on making sure those don't get into the queue. they were/are marked as approved	05:07
*** esker has quit IRC		05:07
clarkb	notmyname: I didn't see them in the queue	05:07
notmyname	ah ok	05:07
*** ruhe is now known as _ruhe		05:07
*** krtaylor has joined #openstack-infra		05:08
clarkb	I think I got all of them according to a gerrit search	05:09
clarkb	jhesketh__: going back to zuul slowness. I probably wans't entirely clear but in zuuls main loop is processes all results then processes events	05:11
*** yamahata has joined #openstack-infra		05:12
clarkb	jhesketh__: results cause gate resets (if a job result was fail) this causes zuul to cancel all jobs in the gate behind it, then remerge the new state of proposed git merging, then start jobs for all of those changes. That process takes 15 minutes or more with 90 changes in the queue	05:12
fungi	right. pragmatic ordering since results have a chance of reducing the complexity	05:12
clarkb	jhesketh__: that entire process is one iteration through the loop so no other results or events are processed during that time	05:12
clarkb	jhesketh__: because of that zuul per pipeline won't fix the problem but it will decouple it from check and post and so on	05:12
*** zz_ewindisch is now known as ewindisch		05:13
*** mrda has joined #openstack-infra		05:13
clarkb	jhesketh__: zuul per pipeline will still rsult in really slow gate processing. The way to fix that is to make git operations quicker. git worker nodes and git repos in tmpfs should make that better. And honestly after reading through logs I think if we solve that problem then zuul per pipeline isn't necessary	05:13
clarkb	we are literally spending minutes running git remote update and git checkout foo and git merge	05:14
*** resker has quit IRC		05:14
jhesketh_	clarkb: okay, thanks for the clarification, makes sense	05:14
fungi	clarkb: it might also have the effect of interleaving workers between pipelines, unlike the broad swing we see now (gate resets, all pending check changes get workers, then attempts are made on the gate changes, repeat)	05:15
clarkb	fungi: yup	05:15
fungi	since there would be more than one gearman server for a jenkins master to listen to	05:15
clarkb	jhesketh_: I do think another thing that would help but would require massive rewrites of zuul is to do everything in a non blocking manner. fire off hundreds of git merges at once and wait for IO to happen. Using the git gearman workers approximates this but could probably just be done in process too	05:16
*** sarob_ has quit IRC		05:18
clarkb	lifeless: https://jenkins02.openstack.org/job/gate-neutron-python27/6117/console is that a limitation of testtools matchers?	05:18
clarkb	jhesketh_: the whole situation has led me to drinking heavily	05:18
*** amotoki_ has joined #openstack-infra		05:18
*** sarob has joined #openstack-infra		05:18
clarkb	jhesketh_: :)	05:18
*** SergeyLukjanov is now known as SergeyLukjanov_		05:20
lifeless	clarkb: no	05:20
jhesketh_	clarkb: heh, okay	05:20
*** SergeyLukjanov_ is now known as SergeyLukjanov		05:21
fungi	clarkb: the whole situation has gotten in the way of my usual heavy drinking. opposite of the expected effect	05:21
clarkb	fungi: I'm sorry, I found this IPA to help tremendously	05:21
lifeless	the matcher api doesn't assume strings etc	05:21
*** sarob_ has joined #openstack-infra		05:21
*** SergeyLukjanov is now known as SergeyLukjanov_		05:21
mikal	clarkb: is there a way to specify a wildcard project name in layout.yaml?	05:21
fungi	clarkb: as long as it's a v6 ipa	05:21
clarkb	lifeless: I didn't think so, but figured I would ask anyways	05:21
mikal	i.e. I want this to match more than one project	05:21
clarkb	mikal: no, but you can have templates that you apply to many projects	05:21
mikal	But I still need to list the projects, right?	05:22
clarkb	mikal: yup	05:22
mikal	:(	05:22
clarkb	mikal: actually wait	05:22
clarkb	mikal: the thing that does event matching may do regexes everywhere /me examines code	05:22
*** amotoki has quit IRC		05:22
*** sarob has quit IRC		05:23
clarkb	mikal: best I can tell project is a magical key and doesn'	05:24
clarkb	t	05:24
clarkb	sdague: russellb: fungi: the spice flows. I think that d-g change helped	05:24
*** esker has joined #openstack-infra		05:25
fungi	clarkb: awesome. instead of ipa, i think i'm going to settle in for a nap	05:25
*** sarob_ has quit IRC		05:25
fungi	maybe after the zuul upgrade tomorrow i'll actually find some time to start catching up on e-mail and code review	05:26
mikal	fungi: better code review, or we'll kick you out of core!	05:26
fungi	mikal: somehow i think my current code review stats would let me kick everyone else out	05:26
mikal	LOL	05:27
mikal	Project of one	05:27
fungi	but that's holidays for you	05:27
fungi	last month shouldn't really count	05:27
clarkb	last month was a lie	05:28
*** nicedice has quit IRC		05:29
fungi	but there was cake, at least	05:29
clarkb	code review is high on list of things now that we seem to have a handle on gate badness	05:29
clarkb	and by have a handle on I mean understand	05:29
fungi	cower in ph33r of	05:30
fungi	+++ATH	05:31
fungi	NO CARRIER	05:32
clarkb	fungi: is the zuul tmpfs in fstab?	05:34
fungi	clarkb: yup	05:34
clarkb	awesome, it occured to me that a reboot may result in weird things if it wasn't	05:35
fungi	none /var/lib/zuul/git tmpfs defaults,size=15G 0 0	05:35
fungi	what kinda sysadmin do you take me for? ;)	05:35
clarkb	:P I am just double checking	05:35
fungi	yeah, good to confirm that	05:35
fungi	i just double-checked too because i'm running on fumes and no longer trust myself	05:36
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: Add query for bug 1269940 https://review.openstack.org/67303	05:36
uvirtbot	Launchpad bug 1269940 in openstack-ci "[EnvInject] - [ERROR] - SEVERE ERROR occurs:" [Undecided,New] https://launchpad.net/bugs/1269940	05:36
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: Add query for bug 1260311 https://review.openstack.org/67314	05:37
uvirtbot	Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/1260311	05:37
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: Add e-r query for bug 1266611 https://review.openstack.org/65344	05:37
uvirtbot	Launchpad bug 1266611 in nova "test_create_image_with_reboot fails with InstanceInvalidState in gate-nova-python*" [Undecided,New] https://launchpad.net/bugs/1266611	05:37
clarkb	fungi: I trust you	05:38
*** odyssey4me has quit IRC		05:38
fungi	eh, i don't recommend it. counterindicated by my operating manual	05:39
* fungi is covered in warning labels		05:40
clarkb	fungi: I have a thing at ~10am PST, will try to be up early maybe we can attempt zuul stuff around 8am PST	05:40
fungi	sounds great	05:40
clarkb	also watch the gate, it may merge a ton of things all at once over the enxt 10 minutes	05:41
fungi	i saw	05:41
fungi	though the longest-running changes have had a tendency to be the ones that fail, so it's always a major fake-out	05:41
clarkb	:/ we did just increase test time by a non trivial factor	05:42
fungi	plus, job run times are longer than jenkins expects now, so its estimates are a bit optimistic	05:42
* clarkb hopes it is just that		05:42
clarkb	NNOOOOOO a job just afiled	05:42
clarkb	oh it was just a test timeout for grenade lets bjump that timeout too	05:43
* clarkb proposes that change		05:43
fungi	i'll stick around to approve it if you propose	05:43
openstackgerrit	Clark Boylan proposed a change to openstack-infra/config: Double grenade test timeouts https://review.openstack.org/67374	05:46
clarkb	fungi: ^	05:46
clarkb	with that in place I feel confident that the queue will move	05:47
fungi	it's in	05:47
clarkb	danke	05:47
fungi	well, approved. will take time to get through the event queue	05:47
clarkb	ya I figure we don't worry too much about that :)	05:48
*** slong has joined #openstack-infra		05:50
*** slong-afk has quit IRC		05:51
*** HenryG has quit IRC		05:52
*** DinaBelova has joined #openstack-infra		05:53
*** SergeyLukjanov_ is now known as SergeyLukjanov		05:53
*** pballand has joined #openstack-infra		05:56
clarkb	fungi: anyways don't stay up anymore, things should settle down overnight (I hope) and we can hit this with a big hammer tomorrow	05:57
*** zhiwei has quit IRC		05:59
openstackgerrit	Ruslan Kamaldinov proposed a change to openstack-infra/storyboard: Fixed doc build https://review.openstack.org/67376	06:02
openstackgerrit	Guido Günther proposed a change to openstack-infra/jenkins-job-builder: tests: Allow to test project parameters https://review.openstack.org/67265	06:04
openstackgerrit	Guido Günther proposed a change to openstack-infra/jenkins-job-builder: project_maven: Don't require artifact-id and group-id https://review.openstack.org/66036	06:04
*** reed has quit IRC		06:07
*** odyssey4me has joined #openstack-infra		06:08
*** CaptTofu has joined #openstack-infra		06:12
*** chandankumar has joined #openstack-infra		06:15
*** CaptTofu has quit IRC		06:16
*** pballand has quit IRC		06:17
*** praneshp is now known as praneshp_afk		06:18
*** denis_makogon has joined #openstack-infra		06:26
*** pelix has left #openstack-infra		06:31
*** SergeyLukjanov is now known as SergeyLukjanov_		06:34
*** SergeyLukjanov_ is now known as SergeyLukjanov		06:36
*** afazekas_ has quit IRC		06:37
*** gokrokve has quit IRC		06:38
*** gokrokve has joined #openstack-infra		06:38
mikal	I think I just realized my approach wont work	06:39
mikal	The extra text zuul puts in the review comment will stop the recheck from triggering	06:39
clarkb	mikal: oh right, because the regex is very restrictive :	06:40
mikal	Yeah	06:41
mikal	I'm going to write a crappy daemon for now	06:41
mikal	But its a shame I can't use zuul	06:41
*** sHellUx has joined #openstack-infra		06:41
*** gokrokve has quit IRC		06:42
*** sHellUx has quit IRC		06:42
*** SergeyLukjanov_ has joined #openstack-infra		06:44
*** SergeyLukjanov_ has quit IRC		06:45
*** DinaBelova_ has joined #openstack-infra		06:46
*** vkozhukalov has quit IRC		06:52
*** ewindisch is now known as zz_ewindisch		06:55
*** DinaBelova has quit IRC		06:56
*** DinaBelova_ is now known as DinaBelova		06:56
*** SergeyLukjanov is now known as SergeyLukjanov_		06:58
*** DinaBelova is now known as DinaBelova_		06:58
*** mrda has quit IRC		07:01
*** odyssey4me has quit IRC		07:04
*** yolanda has joined #openstack-infra		07:07
*** nati_uen_ has quit IRC		07:11
*** odyssey4me has joined #openstack-infra		07:12
*** afazekas_ has joined #openstack-infra		07:25
*** jcoufal has joined #openstack-infra		07:27
clarkb	anteaya: can you check if https://review.openstack.org/#/c/66490/ is just broken? it is flapping in the gate and I think the patch itself doesn't work	07:33
clarkb	anteaya: and if so can you make sure someone proposes a new patchset to it to remove it from teh gate if it is still in the gate when you see this?	07:33
openstackgerrit	A change was merged to openstack-infra/config: Double grenade test timeouts https://review.openstack.org/67374	07:41
clarkb	oh good now I can go to bed	07:42
openstackgerrit	Andreas Jaeger proposed a change to openstack-infra/config: Add gates for API projects and operations-guide https://review.openstack.org/67394	07:47
*** dizquierdo has joined #openstack-infra		07:51
*** jamielennox is now known as jamielennox\|away		07:54
*** flaper87\|afk is now known as flaper87		07:55
*** DinaBelova_ is now known as DinaBelova		07:58
*** SergeyLukjanov_ is now known as SergeyLukjanov		07:58
*** SergeyLukjanov is now known as SergeyLukjanov_		08:01
*** odyssey4me has quit IRC		08:01
*** fifieldt has quit IRC		08:05
*** fifieldt has joined #openstack-infra		08:07
*** odyssey4me has joined #openstack-infra		08:09
*** CaptTofu has joined #openstack-infra		08:12
*** bookwar has quit IRC		08:14
*** bookwar has joined #openstack-infra		08:16
*** CaptTofu has quit IRC		08:17
*** jcoufal has quit IRC		08:21
*** SergeyLukjanov_ is now known as SergeyLukjanov		08:24
*** mancdaz_away is now known as mancdaz		08:25
*** mancdaz is now known as mancdaz_away		08:25
*** vkozhukalov has joined #openstack-infra		08:28
*** jcoufal has joined #openstack-infra		08:31
*** luqas has joined #openstack-infra		08:32
*** mancdaz_away is now known as mancdaz		08:34
*** coolsvap has quit IRC		08:35
*** coolsvap has joined #openstack-infra		08:35
*** odyssey4me has quit IRC		08:36
*** fifieldt has quit IRC		08:37
*** NikitaKonovalov has joined #openstack-infra		08:42
*** odyssey4me has joined #openstack-infra		08:44
*** dpyzhov has joined #openstack-infra		08:47
*** talluri has joined #openstack-infra		08:48
*** odyssey4me has quit IRC		08:49
*** mrmartin has joined #openstack-infra		08:50
*** ogelbukh has quit IRC		08:55
*** odyssey4me has joined #openstack-infra		08:56
*** hashar has joined #openstack-infra		08:57
*** lyle has joined #openstack-infra		08:58
*** mrmartin has quit IRC		08:58
*** david-lyle has quit IRC		08:58
*** emagana has quit IRC		08:59
*** mdenny has quit IRC		09:01
*** mdenny has joined #openstack-infra		09:01
*** vkozhukalov has quit IRC		09:03
*** mrmartin has joined #openstack-infra		09:04
*** mrmartin has quit IRC		09:08
*** kruskakli has quit IRC		09:11
*** fbo_away is now known as fbo		09:12
*** praneshp_afk has quit IRC		09:12
*** mrmartin has joined #openstack-infra		09:13
*** _ruhe is now known as ruhe		09:17
*** vkozhukalov has joined #openstack-infra		09:18
*** yassine has joined #openstack-infra		09:20
*** IvanBerezovskiy has joined #openstack-infra		09:20
*** JohanH has joined #openstack-infra		09:21
*** markmc has joined #openstack-infra		09:22
*** max_lobur_afk is now known as max_lobur		09:23
*** pblaho has joined #openstack-infra		09:26
JohanH	Hi, we are trying to get Zuul to work in our own project and we are running into some issues that we can not get several concurrent gate checks to execute in parallel. The first job starts but all the other changes in the queue are skipped. Does anyone know what the problem might be? We would like to run as many parallel jobs a possible utilizing all our jenkins slave workers	09:28
*** luqas has quit IRC		09:38
*** ruhe is now known as ruhe_away		09:41
*** ruhe_away is now known as ruhe		09:42
*** denis_makogon has quit IRC		09:44
SergeyLukjanov	JohanH, which dependency manager are you using?	09:45
SergeyLukjanov	JohanH, if you're setting up zuul for gerrit.o.o than you're need to use 'check' pipeline instead of 'gate' because zuul.o.o will merge files instead of yours one	09:46
*** jooools has joined #openstack-infra		09:47
*** luqas has joined #openstack-infra		09:47
*** odyssey4me has quit IRC		09:54
*** yamahata has quit IRC		09:56
JohanH	Hi SergeyLukjanov, we are using the gate pipeline and then I guess that it is the dependent pipeline manager. According to the zuul documentation and the description for the DependentPipelineManager: In order to achieve parallel testing of changes, the dependent pipeline manager performs speculative execution on changes. It orders changes based on their entry into the pipeline. It begins testing all changes in parallel, assumin	09:58
JohanH	will pass its tests. If they all succeed, all the changes can be tested and merged in parallel.	09:58
*** jishaom has quit IRC		09:59
flaper87	fungi: anyway I can ssh into a box running this test? http://logs.openstack.org/99/65499/4/check/gate-glance-python27/ff2cac8/nose_results.html	09:59
JohanH	So, wouldn't it start testing the changes in parallel	09:59
flaper87	fungi: I've no idea what's going on there and tests pass in my box	09:59
*** xchu has quit IRC		09:59
*** odyssey4me has joined #openstack-infra		10:03
*** SergeyLukjanov is now known as SergeyLukjanov_a		10:10
*** SergeyLukjanov_a is now known as SergeyLukjanov_		10:11
*** dpyzhov has quit IRC		10:11
*** CaptTofu has joined #openstack-infra		10:13
*** jp_at_hp has joined #openstack-infra		10:14
*** CaptTofu has quit IRC		10:18
*** SergeyLukjanov_ is now known as SergeyLukjanov		10:21
*** pblaho has quit IRC		10:21
*** rakhmerov has quit IRC		10:22
openstackgerrit	Guido Günther proposed a change to openstack-infra/jenkins-job-builder: tests: Allow to test project parameters https://review.openstack.org/67265	10:25
openstackgerrit	Guido Günther proposed a change to openstack-infra/jenkins-job-builder: project_maven: Don't require artifact-id and group-id https://review.openstack.org/66036	10:25
*** talluri has quit IRC		10:29
*** mrda has joined #openstack-infra		10:29
*** talluri has joined #openstack-infra		10:30
mikal	It is scary how often the stale recheck bot fires	10:32
mikal	Its like... really common	10:33
*** dpyzhov has joined #openstack-infra		10:35
*** jooools has quit IRC		10:40
openstackgerrit	SlickNik proposed a change to openstack-infra/config: Update devstack-gate jobs for Trove tempest tests https://review.openstack.org/65065	10:40
openstackgerrit	SlickNik proposed a change to openstack-infra/devstack-gate: Add Trove testing support https://review.openstack.org/65040	10:42
*** zhiyan has left #openstack-infra		10:43
SlickNik	^^ jeblair / mordred / fungi / clarkb Please review when you get a chance. Thanks!	10:43
mikal	clarkb: I have a simple bot which does rechecks, I'm not goign to leave it running over night though, as it scares me that it might recheck the world without perission	10:44
mikal	Also, the check queue is pretty long at the moment	10:44
*** jooools has joined #openstack-infra		10:46
*** vkozhukalov has quit IRC		10:46
*** nosnos has quit IRC		10:53
SergeyLukjanov	JohanH, it should start in parallel	10:54
SergeyLukjanov	JohanH, do you have enough slaves&	10:54
SergeyLukjanov	?*	10:54
*** mrmartin has quit IRC		10:54
anteaya	mikal: thank you for holding off on the recheck bot	10:55
anteaya	we would never climb out of the current situation	10:55
anteaya	yay down to 64 events, progress	10:55
anteaya	we started off yesterday with over 1000 events but never got below 600 by the end of my day yesterday	10:56
mikal	anteaya: so, the thinking is a recheck is a lot cheaper than a gate merge flush	10:58
mikal	So, we were hoping doing recents on ancient check runs would make the gate queue a bit less horrible	10:58
*** vkozhukalov has joined #openstack-infra		10:58
*** yaguang has quit IRC		10:58
mikal	The bot only does a recheck if someone comments on a review with an ancient check, so its also not a blanket thing	10:58
mikal	But I wills top it over night and keep an eye on it while its running	10:58
*** SergeyLukjanov is now known as SergeyLukjanov_		11:00
*** tma996 has joined #openstack-infra		11:02
*** talluri has quit IRC		11:05
*** amotoki has joined #openstack-infra		11:05
*** derekh has joined #openstack-infra		11:05
anteaya	mikal: hmmm okay, let's keep an eye on the amount of events	11:06
anteaya	if you have been running it on the system for the past 8 hours, it might be a source of support for the > 500 event decrease I see on the zuul status page	11:07
*** amotoki_ has quit IRC		11:07
*** SergeyLukjanov_ is now known as SergeyLukjanov		11:07
*** SergeyLukjanov is now known as SergeyLukjanov_		11:08
anteaya	clarkb: salv-orlando has beat me to it with a big -2 on 66490, thanks for alerting us and sorry causing a problem	11:09
kiall	So - Just noticed a change that merged yesterday https://review.openstack.org/#/c/67143/ never got pushed to github, but did make it to git.o.o ..	11:09
kiall	I'm assuming the next merge will "fix" it .. But might be a problem	11:09
*** NikitaKonovalov has quit IRC		11:10
*** rakhmerov has joined #openstack-infra		11:10
anteaya	he sniped it with a new patchset	11:10
sdague	morning folks	11:13
anteaya	morning sdague	11:14
*** rakhmerov has quit IRC		11:14
anteaya	mikal: I just ready part of the backscroll, clarkb and fungi were casting incantations last night and some of them seemed to be working	11:15
anteaya	so that might be part of the source of the > 500 decrease in events	11:15
sdague	yeh, jenkins is still blowing us up it looks like	11:16
sdague	which actually seems to be the root cause of the problem right now	11:16
anteaya	clarkb and fungi are planning a zuul upgrade at 11am this morning	11:17
anteaya	all things being equal	11:17
sdague	http://status.openstack.org/elastic-recheck/ - graphs 1, 2, and 3 are jenkins errors	11:17
sdague	#2 isn't effecting us, but the others are	11:17
anteaya	goodness we didn't fare well yesterday afternoon	11:18
anteaya	grenade test timeouts have been doubled: https://review.openstack.org/#/c/67374/	11:18
anteaya	and I think there was another d-g change but I didn't get far enough back in the backscroll to id the url for it	11:19
*** SergeyLukjanov_ is now known as SergeyLukjanov		11:22
*** ArxCruz has joined #openstack-infra		11:22
*** mrda has quit IRC		11:26
openstackgerrit	Sean Dague proposed a change to openstack-infra/elastic-recheck: only run on openstack gate projects https://review.openstack.org/67273	11:27
openstackgerrit	Sean Dague proposed a change to openstack-infra/elastic-recheck: expose on channel when we timeout on logs https://review.openstack.org/66565	11:27
openstackgerrit	Sean Dague proposed a change to openstack-infra/elastic-recheck: move to static LOG https://review.openstack.org/66564	11:27
openstackgerrit	Sean Dague proposed a change to openstack-infra/elastic-recheck: create more sane logging for the er bot https://review.openstack.org/66435	11:27
anteaya	timeout for tempest runs have also been increased: https://review.openstack.org/66379	11:30
anteaya	I think that was the other change I saw referenced	11:30
*** vipul is now known as vipul-away		11:31
anteaya	mordred and clarkb: jog0 had done some evaluation of times using eatmydata yesterday and I believe the conclusion he and fungi had reached was it was not a significant time savings	11:32
anteaya	if I recally they were both rather disappointed by the outcome	11:32
anteaya	ping jog0 for exact details as I might be incorrect in the application of what was being evaluated	11:33
anteaya	s/recally/recall	11:33
anteaya	it's early	11:33
*** ruhe is now known as _ruhe		11:33
*** rfolco has joined #openstack-infra		11:33
*** NikitaKonovalov has joined #openstack-infra		11:39
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: only run on openstack gate projects https://review.openstack.org/67273	11:40
*** DinaBelova is now known as DinaBelova_		11:41
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: create more sane logging for the er bot https://review.openstack.org/66435	11:41
*** SergeyLukjanov is now known as SergeyLukjanov_		11:41
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: move to static LOG https://review.openstack.org/66564	11:41
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: expose on channel when we timeout on logs https://review.openstack.org/66565	11:43
*** DinaBelova_ is now known as DinaBelova		11:47
*** smarcet has joined #openstack-infra		11:51
*** _ruhe is now known as ruhe		11:52
*** dpyzhov has quit IRC		11:52
*** dpyzhov has joined #openstack-infra		11:53
*** jcoufal has quit IRC		11:56
*** mrmartin has joined #openstack-infra		11:59
*** DinaBelova is now known as DinaBelova_		12:00
*** vkozhukalov has quit IRC		12:00
*** hashar has quit IRC		12:03
*** dstanek has quit IRC		12:06
*** talluri has joined #openstack-infra		12:10
*** lcestari has joined #openstack-infra		12:10
*** rakhmerov has joined #openstack-infra		12:11
*** vkozhukalov has joined #openstack-infra		12:12
*** pblaho has joined #openstack-infra		12:12
*** CaptTofu has joined #openstack-infra		12:14
*** rakhmerov has quit IRC		12:15
dims	sdague, i had a suggestion in https://bugs.launchpad.net/openstack-ci/+bug/1260311/comments/3	12:15
uvirtbot	Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged]	12:15
dims	for the jenkins troubles	12:15
sdague	sure	12:16
sdague	honestly, that's suspiciously high to me	12:16
sdague	I need to talk with fungi when he gets up	12:16
dims	we are on 1.525 of jenkins	12:16
sdague	because it might be one of the things that there is retry logic around, but we still count it as a fail	12:17
*** vkozhukalov has quit IRC		12:17
sdague	which would totally skew things in graphite	12:17
dims	y	12:17
*** CaptTofu has quit IRC		12:19
*** dpyzhov has quit IRC		12:19
*** talluri has quit IRC		12:21
*** vkozhukalov has joined #openstack-infra		12:32
*** jcoufal has joined #openstack-infra		12:33
dims	sdague, bit more looking around and new recommendation on the version # for jenkins (https://bugs.launchpad.net/openstack-ci/+bug/1260311/comments/4)	12:38
uvirtbot	Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged]	12:38
*** chandankumar has quit IRC		12:42
*** hashar has joined #openstack-infra		12:43
*** derekh has quit IRC		12:46
openstackgerrit	Davanum Srinivas (dims) proposed a change to openstack-infra/elastic-recheck: Better query for bug 1260311 https://review.openstack.org/67446	12:49
uvirtbot	Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/1260311	12:49
*** dpyzhov has joined #openstack-infra		12:51
*** emagana has joined #openstack-infra		12:52
*** talluri has joined #openstack-infra		12:53
*** dstanek has joined #openstack-infra		12:53
*** CaptTofu has joined #openstack-infra		12:55
*** emagana has quit IRC		12:56
*** dstanek has quit IRC		12:59
*** salv-orlando has quit IRC		13:02
*** zz_ewindisch is now known as ewindisch		13:02
*** coolsvap has quit IRC		13:09
*** ewindisch is now known as zz_ewindisch		13:09
*** mrmartin has quit IRC		13:09
*** markmc has quit IRC		13:11
*** rakhmerov has joined #openstack-infra		13:12
*** zz_ewindisch is now known as ewindisch		13:14
*** rakhmerov has quit IRC		13:16
*** ewindisch is now known as zz_ewindisch		13:18
*** amotoki_ has joined #openstack-infra		13:18
*** amotoki has quit IRC		13:20
*** jcoufal has quit IRC		13:21
*** dizquierdo has quit IRC		13:26
*** mfink has quit IRC		13:26
*** dstanek has joined #openstack-infra		13:29
*** thomasem has joined #openstack-infra		13:31
*** hashar has quit IRC		13:31
chmouel	sdague: i was wondering if you were working on stable/grizzly issues as well?	13:31
*** DinaBelova_ is now known as DinaBelova		13:33
*** dstanek has quit IRC		13:34
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: Better query for bug 1260311 https://review.openstack.org/67446	13:34
uvirtbot	Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/1260311	13:34
sdague	chmouel: trying to get your patch up now in a test env to try to help	13:35
chmouel	sdague: i think there is a bit more than that, at least with euca2ools and boto being incompatible	13:35
*** dims has quit IRC		13:35
sdague	chmouel: but we aren't running those anyway, right?	13:36
chmouel	sdague: i think we still have failures in tempest.tests.boto.test_ec2_volumes.EC2VolumesTest.test_create_volume_from_snapshot	13:37
*** pblaho has quit IRC		13:37
*** pblaho has joined #openstack-infra		13:37
chmouel	sdague: from https://review.openstack.org/#/c/67311/	13:37
chmouel	sdague: if i just rm -rf /usr/local/lib/*/boto and rerun tempest it seems to work	13:38
*** mfink has joined #openstack-infra		13:38
*** carl_baldwin has joined #openstack-infra		13:40
*** markmc has joined #openstack-infra		13:40
sdague	chmouel: so in that review I'm seeing volumes fails unrelated to ec2	13:40
sdague	chmouel: http://logs.openstack.org/11/67311/2/check/check-tempest-dsvm-full/779c8f6/logs/screen-c-sch.txt.gz	13:40
*** hashar has joined #openstack-infra		13:41
chmouel	sdague: oh yeah right, the ec2 runs but fails as you say due of the issue with cinder http://ep.chmouel.com:8080/Screenshots/2014-01-17__14-41-56.png	13:42
*** nati_ueno has joined #openstack-infra		13:42
*** nati_ueno has quit IRC		13:42
russellb	so, based on the failure rates graph here, looks like failure rates are down a good bit today? http://status.openstack.org/elastic-recheck/	13:43
sdague	russellb: yes, I definitely think the concurency reduction helped	13:43
russellb	ok cool	13:43
*** jcoufal-m has joined #openstack-infra		13:43
russellb	may take the weekend for the queues to recover a bit it seems	13:44
*** nati_ueno has joined #openstack-infra		13:44
*** DinaBelova is now known as DinaBelova_		13:44
sdague	yeh, there are still other kinds of fails going on, which we'll need to figure out	13:44
*** julim has joined #openstack-infra		13:44
sdague	also need to get the word out that stable bits can't be put in the gate right now until we address the pip 1.5 issuse on grizzly devstack	13:44
*** jcoufal-m_ has joined #openstack-infra		13:45
sdague	which will kill a stable/havana change because of grenade	13:45
*** jcoufal-m_ has quit IRC		13:45
*** jcoufal-m_ has joined #openstack-infra		13:45
sdague	chmouel: so the log for that run is confusing	13:45
*** emagana has joined #openstack-infra		13:45
russellb	alrighty	13:45
russellb	on to some other bugs then	13:45
sdague	russellb: yep, and thanks for getting to the bottom of the load thing	13:46
chmouel	sdague: yeah with my patch on my just reckicked test vm i definitively get netaddr updated properly:	13:46
chmouel	ubuntu@devstack:~$ pip freeze\|grep netaddr	13:46
chmouel	Warning: cannot find svn location for distribute==0.6.24dev-r0	13:46
chmouel	netaddr==0.7.10	13:46
sdague	right, but something isn't right	13:47
russellb	sdague: np	13:47
fungi	mmm, dims is gone, but what he doesn't realize is that we're actually only on 1.525 for jenkins01, but we're also seeing the same java stack trace (the missing class master one) on jenkins02 which runs 1.543	13:47
sdague	chmouel: the fact that we pip install netaddr 6 times over the course of the console	13:47
sdague	means pip keeps thinking there is a 0.7.5 to remove	13:47
sdague	which is why cinder explodes	13:47
*** dims has joined #openstack-infra		13:48
sdague	fungi: right, so we started classifying infra bugs in er yesterday (because our classification rate was down to 30%)	13:48
*** nati_ueno has quit IRC		13:48
*** jcoufal-m has quit IRC		13:49
sdague	fungi: http://status.openstack.org/elastic-recheck/ - Bug 1260311	13:49
uvirtbot	Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/1260311	13:49
*** DinaBelova_ is now known as DinaBelova		13:49
sdague	it's the 3rd graph down	13:49
sdague	of the er graphs	13:49
sdague	it's so high and so frequent, I feel like we must be misunderstanding something	13:49
chmouel	do we need to install python-netaddr from the packages first?	13:50
*** emagana has quit IRC		13:50
*** emagana has joined #openstack-infra		13:50
fungi	right. it's like i was explaining to jog0, we can either have catchall buckets like "jenkins it breakybadz" or we can track specific problems, but please let's not try to use a bug with "we gots stack traces" to diagnose actua faiures	13:50
sdague	fungi: so, we can do it however you'd like to	13:51
sdague	but realize that those are failure events in graphite	13:51
sdague	so right now ~ 40% of graphite failures for gate jobs are infra	13:51
sdague	for the last week	13:51
*** salv-orlando has joined #openstack-infra		13:52
*** zul has quit IRC		13:52
fungi	well, i'm okay with catchall bucket bugs for that. and i'm fine with "jenkins stack trace" as an elastic-recheck pattern, but keep in mind that it's not going to assist much in diagnosing the underlying problem and the moment other devs start jumping in and trying to use the bug to that end, we're going to be running in circles chasing our tails	13:52
*** dkliban has joined #openstack-infra		13:52
*** jcoufal-m_ has quit IRC		13:53
sdague	fungi: sure	13:53
sdague	fungi: the point I'm trying to ask, is is that issue, which looks like a failure to launch at all, something that we already recover on?	13:53
*** dcramer_ has quit IRC		13:53
fungi	that bug you linked has already collected stack trace details for two almost certainly unrelated issues, and dims was trying to use it to track down upstream bugs in jenkins. that's going to waste a lot of people's time	13:53
*** yamahata has joined #openstack-infra		13:53
*** SergeyLukjanov_ is now known as SergeyLukjanov		13:54
fungi	the first stack trace in that bug, from what we've seen, is the vm going missing between when it first talks to the jenkins master and when it gets assigned a job	13:54
*** dprince has joined #openstack-infra		13:55
sdague	fungi: so we can work on getting these broken out, which is fine, this is a process	13:55
fungi	the second stack trace in that bug is deeper in the slave agent, causing some manner of miscommunication with the master	13:55
chmouel	it's a bit annoying that i can't reproduce on clean precise vm :( the tempest runs fine after with my patch	13:55
*** emagana has quit IRC		13:55
*** zul has joined #openstack-infra		13:55
*** emagana has joined #openstack-infra		13:55
fungi	sdague: we already had two separate bugs. i referred that comment back to the other bug	13:56
*** markmc has quit IRC		13:56
sdague	fungi: ok, so we'll refine this. What I really want to know is are these gate resetting bugs, or are we actually autorecovering in zuul	13:57
*** herndon_ has joined #openstack-infra		13:57
fungi	well, we have seen both those stack traces associated with job failures. that's not to say that they don't also appear when a job gets aborted/cancelled and we tear down the vm before jenkins is done processing the abort/cancellation	13:58
sdague	fungi: so the current rates on those makes those the biggest cause of resets right now	13:59
fungi	but i think in those cases we don't get logs into logstash, so if you're finding them there then these are likely jobs which did fail at some level	13:59
sdague	fungi: this is datamining logstash	13:59
fungi	right. that's what i figured	13:59
sdague	so only if it gets to logstash, and is marked as FAILURE	13:59
*** markmc has joined #openstack-infra		13:59
fungi	was there a job status of failure associated with thoe?	13:59
fungi	tose	13:59
fungi	those	14:00
sdague	build_status:FAILURE	14:00
fungi	this keyboard is annoying me	14:00
*** CaptTofu has quit IRC		14:00
*** CaptTofu has joined #openstack-infra		14:00
*** jcoufal has joined #openstack-infra		14:01
fungi	i do think it's probably not the biggest cause of actual gate resets though. the majority are going to be the one where the persistent slave is eaten by bug 1267364 and kills a lot of jobs at once, but we fix it by the time it's ejected one or two changes out of the gate (and the rest end up testing clean when the gate reset is done processing)	14:02
uvirtbot	Launchpad bug 1267364 in openstack-ci "Recurrent jenkins slave agent failures" [Critical,In progress] https://launchpad.net/bugs/1267364	14:02
fungi	the continuing work to move our testing off persistent slaves is our current solution to that	14:02
*** mfer has joined #openstack-infra		14:03
fungi	the incidence of it has gone way down in the past week from what i've seen (i've only had to offline one persistent slave in several days even under the heaviest load we've been seeing)	14:03
fungi	it does still crop up for nonpersistent slaves, but they get torn down after impacting a single job rather than taking out dozens in a shooting-spree	14:04
*** CaptTofu has quit IRC		14:05
*** annegent_ has joined #openstack-infra		14:05
*** smarcet has left #openstack-infra		14:05
sdague	fungi: http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiamF2YS5pby5JbnRlcnJ1cHRlZElPRXhjZXB0aW9uXCIgQU5EIGZpbGVuYW1lOlwiY29uc29sZS5odG1sXCIgIEFORCBtZXNzYWdlOlwiaHVkc29uLkxhdW5jaGVyJFJlbW90ZUxhdW5jaGVyLmxhdW5jaFwiIEFORCBidWlsZF9xdWV1ZTpnYXRlIiwidGltZWZyYW1lIjoiODY0MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsIm9mZnNldCI6MCwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzODk5Njc1MzI3MDl9	14:05
fungi	the combination of it mostly only cropping up when the jenkins masters are under heavy strain and the accompanying gate dynamics when we're under that sort of load make the ratio of full gate resets to individual job failures for that bug abnormally high (probably by orders of magnitude)	14:06
IvanBerezovskiy	fungi, hi. Can I ask you a question about Cassandra and Hbase installation on CI nodes?	14:06
sdague	231 gate errors in the last 24 hrs	14:06
fungi	IvanBerezovskiy: are you the one working to get it supported in ubuntu lts?	14:06
*** jhesketh__ has quit IRC		14:06
fungi	sdague: how many were from centos6-1?	14:06
*** markmcclain has joined #openstack-infra		14:07
fungi	that's the one which went wild last night while i was at dinner, and i had to put it down when i got back to the computer	14:07
fungi	sdague: but i agree, we should take this as a sign to continue prioritizing a move to nonpersistent slaves for all non-privileged jobs	14:09
sdague	sure	14:09
sdague	25 we tempest-dvsm-full	14:09
sdague	so it's not just the unit test nodes	14:09
fungi	good to know. those hopefully should have been only one job affected per slave experiencing that error	14:10
dims	fungi, when you get a chance can i please have a stack trace from the 1.543 install for the JENKINS-19453 bug so i can try to match it to jenkins source to see if i can find something (per your comment #5 in bug 1260311)	14:10
uvirtbot	Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/1260311	14:10
fungi	sdague: and that's for the stacktrace in 1267364, not the other one?	14:10
fungi	dims: it's in the bug i inked	14:10
fungi	linked	14:10
*** prad has joined #openstack-infra		14:11
*** jaypipes has joined #openstack-infra		14:11
fungi	dims: oh, actually i guess it's not	14:12
fungi	we only had them from the jenkins console, which expires out after 24 huors	14:12
fungi	hours	14:12
*** rakhmerov has joined #openstack-infra		14:13
*** NikitaKonovalov has quit IRC		14:13
sdague	fungi: http://logs.openstack.org/84/65184/4/gate/gate-tempest-dsvm-postgres-full/7c3f2bc/console.html is being classified as Bug 1260311 by jog0's query	14:13
uvirtbot	Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/1260311	14:13
*** NikitaKonovalov has joined #openstack-infra		14:13
IvanBerezovskiy	fungi, fungi, As it was said here https://review.openstack.org/#/c/66884/ we can't use non-ubuntu mirrors. So i want to find another way to isntall these packages. My suggestion is create job for single-use node like https://git.openstack.org/cgit/openstack-infra/config/tree/modules/openstack_project/files/jenkins_job_builder/config/storyboard.yaml . So it'll be job with shell script that'll install cassandra and hbase. What do you think	14:13
sdague	which, we can figure out if that's wrong	14:13
*** yaguang has joined #openstack-infra		14:14
fungi	sdague: so there may be several different issues there	14:14
jog0	that query was taken straight from dims comment in the bug	14:14
*** nati_ueno has joined #openstack-infra		14:14
sdague	fungi: sure, so we should narrow that out	14:14
fungi	dims: the stacktrace we were seeing in both 1.525 and 1.543 is the java.lang.NoClassDefFoundError: Could not initialize class jenkins.model.Jenkins$MasterComputer one http://paste.openstack.org/show/60883/	14:15
sdague	I'm just concerned that we've had 40+ dvsm hits on that in the last 24hrs so reseting that way every 35 minutes seems very bad	14:15
sdague	and could be a contributing factor to the zuul load	14:15
fungi	sdague: right, this takes us back to "do we want a catchall bucket for people to recheck these against or should we have separate bugs for the different causes/events"	14:15
sdague	fungi: what would you like?	14:16
dims	fungi, the line numbers will be different between 1.525 and 1.543 - trying to figure out which stack trace came from which version	14:16
*** jhesketh_ has quit IRC		14:16
fungi	IvanBerezovskiy: if it's for non-openstack jobs, that's fine. for openstack projects, all those jobs would fail any time that remote repository is unreachable/broken	14:16
sdague	the er bug reporting is an art not a science, so we just want rules in there on how to categorize it.	14:16
fungi	dims: ahh, i may not have captured the exact line numbers for one triggered from jenkins02 in that bug. we'd need to find a new slave exhibiting that failure from jenkins02 and get those details	14:17
*** nati_ueno has quit IRC		14:17
*** rakhmerov has quit IRC		14:17
dims	thanks fungi i'll look for it as well	14:18
sdague	fungi: is there better metadata in ES that we need to bin these?	14:18
*** nati_ueno has joined #openstack-infra		14:18
jog0	fungi: I am happy to split the bugs as you want	14:18
*** zz_ewindisch is now known as ewindisch		14:18
jog0	long as we are categorizing them under something I am happy	14:18
fungi	sdague: i'm fine with catch-all bugs for elastic-recheck to use for infra problems, but we would still use separate infra bugs to work through the actual causes. in many cases, the bugs themselves will be solved before someone can add an accurate e-s pattern to match them	14:19
ruhe	fungi: (on the topic started by IvanBerezovskiy), so the only option to test ceilometer backends, which aren't present in stable mirrors - is to get them (hbase and cassandra) supported in ubuntu lts?	14:19
sdague	fungi: well that's not the case for at least 3 infra bugs right now	14:19
jog0	fungi: you won't like this query then: bug 1269940	14:19
uvirtbot	Launchpad bug 1269940 in openstack-ci "[EnvInject] - [ERROR] - SEVERE ERROR occurs:" [Undecided,New] https://launchpad.net/bugs/1269940	14:19
fungi	ruhe: how do you expect people running plain ubuntu to test that on their own systems (particularly if they can't/won't install unvetted/insecure third-party packages)?	14:20
*** sandywalsh has quit IRC		14:20
fungi	sdague: agreed. it ends up being the case for other infra bugs however	14:21
*** rossella_s has joined #openstack-infra		14:21
fungi	jog0: i think it's like matching on "python traceback"	14:21
jog0	fungi: haha yup	14:21
* dims realizes we need the jenkins01/02 info in logstash as well :)		14:22
jog0	that is a catch all as a stop gap for classifiying things	14:22
fungi	"bug: we seem to be using python"	14:22
*** amotoki_ has quit IRC		14:22
jog0	so yes I agree its a really vague somewhat useless bug. so as we know more we can split the bug up	14:23
ruhe	fungi: i understand your concern. the problem with this storages is they only have vendor-managed repositories and no one wants to maintain them since they're complex software. i guess this topic should be discussed in email	14:23
fungi	anyway, i need to step away for a few. i should learn not to start checking work e-mail and irc when i first wake up... it leads to me working half the morning from my bedroom and skipping breakfast as a result	14:23
sdague	fungi: so I think that given the windows of time where there aren't infra folks online, I think using er for real has value. Because bugs don't get fixed immediately	14:23
*** yamahata has quit IRC		14:23
sdague	:)	14:23
sdague	yeh, sorry about that	14:23
*** yamahata has joined #openstack-infra		14:23
dims	fungi, :)	14:23
chmouel	EmilienM: ping?	14:24
EmilienM	chmouel: pong	14:25
EmilienM	chmouel: here is good too, i use to talk about devstack on #openstack-qa though :-)	14:25
*** dstanek has joined #openstack-infra		14:25
fungi	ruhe: i would argue that makes them immature software projects, and we should seek to help them improve that situation so that we can use them rather than just accepting that situation	14:25
* fungi will bbiab		14:25
sdague	chmouel: yeh, lets take the grizzly devstack over to -qa	14:26
EmilienM	chmouel: i was wondering the cinder issue in devstack/havana and it's WIP by you and sdague, right?	14:26
*** eharney has joined #openstack-infra		14:28
*** ryanpetrello has joined #openstack-infra		14:29
openstackgerrit	Nikita Konovalov proposed a change to openstack-infra/storyboard: Introducing basic REST API https://review.openstack.org/63118	14:30
*** herndon_ has quit IRC		14:31
*** nprivalova has joined #openstack-infra		14:33
*** sandywalsh has joined #openstack-infra		14:33
openstackgerrit	Sean Dague proposed a change to openstack-infra/elastic-recheck: add uncategorized failure generation code https://review.openstack.org/67267	14:35
*** mrmartin has joined #openstack-infra		14:36
*** pblaho has quit IRC		14:36
*** mrodden has quit IRC		14:38
*** dcramer_ has joined #openstack-infra		14:39
*** dansmith is now known as damnsmith		14:40
openstackgerrit	A change was merged to openstack-infra/reviewstats: Add --csv-rows option https://review.openstack.org/60115	14:42
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: add uncategorized failure generation code https://review.openstack.org/67267	14:42
*** SergeyLukjanov is now known as SergeyLukjanov_a		14:43
*** SergeyLukjanov_a is now known as SergeyLukjanov_		14:44
openstackgerrit	Max Lobur proposed a change to openstack/requirements: Add futures library to global requirements https://review.openstack.org/66349	14:45
*** dizquierdo has joined #openstack-infra		14:45
openstackgerrit	Max Lobur proposed a change to openstack/requirements: Add futures library to global requirements https://review.openstack.org/66349	14:47
*** thuc has joined #openstack-infra		14:49
*** thuc_ has joined #openstack-infra		14:49
jog0	was a bug filed for 'No distributions at all found for oslo.messaging>=1.2.0a11' ?	14:50
jog0	example: http://logs.openstack.org/82/64682/1/gate/gate-glance-pep8/f1dce31/console.html.gz	14:50
*** beagles is now known as beagles_brb		14:50
*** mrodden has joined #openstack-infra		14:51
openstackgerrit	Tom Fifield proposed a change to openstack-infra/config: Add build job for Japanese Install Guide https://review.openstack.org/67481	14:51
jgriffith	EmilienM: Cinder issue in devstack/havana?	14:51
*** fifieldt has joined #openstack-infra		14:51
EmilienM	jgriffith: yeah, the stuff you were talking about yesterday	14:53
*** coolsvap has joined #openstack-infra		14:53
*** thuc has quit IRC		14:53
*** annegent_ has quit IRC		14:53
jgriffith	EmilienM: oh, but interesting it's only affecting Cinder now, which leads me to believe thee's been a patch for other projects to address this?	14:53
*** emagana_ has joined #openstack-infra		14:54
*** senk has joined #openstack-infra		14:55
*** russellb is now known as rustlebee		14:55
*** mrmartin has quit IRC		14:55
*** rakhmerov has joined #openstack-infra		14:56
*** jog0 is now known as flashgordon		14:56
*** oubiwann_ has joined #openstack-infra		14:56
*** emagana has quit IRC		14:56
*** marun has joined #openstack-infra		14:57
*** SergeyLukjanov_ is now known as SergeyLukjanov		14:57
*** talluri has quit IRC		14:57
flashgordon	looks like this is the closest bug 1261253	14:59
uvirtbot	Launchpad bug 1261253 in tripleo "oslo.messaging 1.2.0a11 is outdated and problematic to install" [High,Triaged] https://launchpad.net/bugs/1261253	14:59
*** dims is now known as dimsum		15:00
*** burt1 has joined #openstack-infra		15:01
*** Ajaeger has joined #openstack-infra		15:01
*** pblaho has joined #openstack-infra		15:02
fungi	aww, we lost jog0 now	15:03
fungi	oh, wait, flashgordon	15:03
fungi	flashgordon: the No distributions at all found for oslo.messaging>=1.2.0a11 is an interesting one	15:03
fungi	flashgordon: that looks like pip 1.5 ignoring the -f	15:04
fungi	i wish we had a pip --version and/or pip freeze at the end of that job	15:05
*** talluri has joined #openstack-infra		15:05
*** esker has quit IRC		15:06
*** esker has joined #openstack-infra		15:06
*** esker has quit IRC		15:06
*** nicedice has joined #openstack-infra		15:07
openstackgerrit	Joe Gordon proposed a change to openstack-infra/config: Don't run non-voting gate-grenade-dsvm-neutron https://review.openstack.org/67485	15:08
flashgordon	fungi: casual nick friday in nova land	15:08
flashgordon	sdague: ^	15:08
*** thedodd has joined #openstack-infra		15:09
flashgordon	fungi: logstash query message:"No distributions at all found for oslo.messaging>=1.2.0a11" AND filename:"console.html"	15:09
flaper87	fungi: another case where it fails in the gate and not locally: https://review.openstack.org/#/c/65499/ :( Do you think I can get access to one box?	15:09
fungi	flashgordon: i got it, just slow. i've lost track of which days are which any more	15:09
flaper87	FWIW, I'm setting up an ubuntu saucy to test it too	15:09
flashgordon	fungi: heh I am amazed your still alive after this week	15:09
*** nati_uen_ has joined #openstack-infra		15:12
*** jergerber has joined #openstack-infra		15:12
fungi	flaper87: which one? the py26 and py27 unit tests fail in entirely different ways (though also, no, can't really grant you access to the long-running 26 slave for infra policy reasons unless i completely tear down and replace it, and the 27 slave is a single-use node which was automatically deleted after it ran)	15:12
flaper87	fungi: py27 would've been enough.	15:13
flaper87	fungi: I'll set it up in my vm and see if I can replicate it	15:14
*** nati_uen_ has quit IRC		15:14
dstufft	fungi: adding various --version invocations to things you're using is the best thing I learned from travis-ci tbh	15:14
dstufft	it makes debuging things massively better	15:14
*** nati_uen_ has joined #openstack-infra		15:15
*** nati_ueno has quit IRC		15:15
openstackgerrit	Sergey Kraynev proposed a change to openstack/requirements: Update python-neutronclient version https://review.openstack.org/67487	15:16
fungi	dstufft: yep, we do that in a lot of places	15:17
fungi	just not ever enough places ;)	15:17
openstackgerrit	Ruslan Kamaldinov proposed a change to openstack-infra/config: Extracted ci docs jobs to a template https://review.openstack.org/67489	15:19
*** emagana_ has quit IRC		15:21
openstackgerrit	Sergey Kraynev proposed a change to openstack/requirements: Update python-neutronclient version to 2.3.3 https://review.openstack.org/67487	15:22
*** emagana has joined #openstack-infra		15:22
*** HenryG has joined #openstack-infra		15:22
flashgordon	fungi: what file do I touch to add branch name to logstash?	15:23
flashgordon	re: master or stable/havana	15:23
*** annegent_ has joined #openstack-infra		15:24
*** bookwar has left #openstack-infra		15:24
openstackgerrit	Sergey Kraynev proposed a change to openstack/requirements: Update python-neutronclient version to 2.3.2 https://review.openstack.org/67491	15:24
*** CaptTofu has joined #openstack-infra		15:24
*** rnirmal has joined #openstack-infra		15:26
fungi	flashgordon: do we not already index the zuul parameters for jobs in logstash?	15:27
flashgordon	we have the build_ref	15:27
*** gokrokve has joined #openstack-infra		15:28
*** IvanBerezovskiy has left #openstack-infra		15:28
*** annegent_ has quit IRC		15:28
flashgordon	would it be zuul_branch?	15:29
fungi	flashgordon: i think that's probably what you want. remember in the context of our various integration tests there are multiple branches in play	15:30
flashgordon	ohh nice zuul has docs	15:30
fungi	and yes, zuul has very nice docs	15:31
*** carl_baldwin has quit IRC		15:31
flashgordon	' The target branch for the change that triggered this build	15:31
flashgordon	fungi: if there is no zuul_change is there zuul_branch?	15:32
*** carl_baldwin has joined #openstack-infra		15:32
clarkb	anteaya: salv-orlando: a -2 doesn't kick the change out of the gate. has a new patchset been pushed to it to kick it out of the gate?	15:33
*** _NikitaKonovalov has joined #openstack-infra		15:33
clarkb	anteaya: salv-orlando: at this point it probably doesn't matter much as fungi and I are going to fork lift zuul and can simply not reverify that change	15:33
fungi	flashgordon: i believe there is always a zuul_branch, yes (periodic bitrot jobs for example have no zuul_change but would still have a zuul_branch)	15:34
*** mancdaz is now known as mancdaz_away		15:34
flashgordon	fungi: thanks	15:34
*** kmartin has quit IRC		15:34
fungi	flashgordon: i'm going to double-check that though	15:34
*** NikitaKonovalov has quit IRC		15:34
*** _NikitaKonovalov is now known as NikitaKonovalov		15:34
fungi	because now that i say it, i start to doubt myself	15:34
flashgordon	heh thanks	15:34
*** mancdaz_away is now known as mancdaz		15:35
*** kmartin has joined #openstack-infra		15:35
*** talluri has quit IRC		15:35
fungi	and that reminds me, some of the periodic jobs are still broken... need to track down where /opt/stack/new/devstack-gate/devstack-vm-gate.sh went: http://logs.openstack.org/periodic-qa/periodic-tempest-dsvm-all-havana/037442e/console.html	15:36
*** marun has quit IRC		15:36
*** marun has joined #openstack-infra		15:36
*** jgrimm has joined #openstack-infra		15:37
*** annegent_ has joined #openstack-infra		15:38
mordred	morning fungi	15:38
mordred	morning flashgordon clarkb	15:38
fungi	morning mordred	15:39
openstackgerrit	Joe Gordon proposed a change to openstack-infra/config: Record build_branch in logstash https://review.openstack.org/67498	15:39
*** wenlock has joined #openstack-infra		15:39
flashgordon	fungi: ^	15:39
clarkb	morning	15:39
flashgordon	sdague: ^	15:39
*** emagana has quit IRC		15:39
*** emagana has joined #openstack-infra		15:40
clarkb	fungi: I am mostly booted at this point and ready to do the zuul dance if you still think we should do that	15:40
*** rcleere has joined #openstack-infra		15:40
*** davidhadas has joined #openstack-infra		15:41
fungi	dimsum: to your earlier question about identifying which jenkins master a job ran on, you can actually mine that out of the console log (though having it as a parameter would definitely be nice). the "Building remotely on" line hyperlinks to the appropriate jenkins master's webui	15:41
fungi	clarkb: sure thing	15:41
clarkb	zuul just merged a bunch of changes by the way. I think the d-g tempest concurrency change did have a drastic effect	15:41
*** herndon has joined #openstack-infra		15:41
dimsum	fungi, y, just can't build a query that has the name of the jenkins host and snippet from hudson stack trace	15:42
clarkb	https://jenkins02.openstack.org/job/gate-tempest-dsvm-full/6416/console seems to be a relatively common failure causing resets (but I haven't even looked at e-r just noticed that 404 is common to several test failures last night and this morning)	15:42
*** esker has joined #openstack-infra		15:43
*** NikitaKonovalov is now known as NikitaKonovalov_		15:44
*** bnemec is now known as beekneemech		15:45
openstackgerrit	Tom Fifield proposed a change to openstack-infra/config: Add build job for Japanese Install Guide https://review.openstack.org/67481	15:45
fungi	dimsum: in the meantime, the next rogue persistent slave i get failing jobs with that stack trace from jenkins02, i'll get the exact text including line numbers	15:45
dimsum	fungi, cool	15:45
openstackgerrit	Sergey Kraynev proposed a change to openstack/requirements: Update python-neutronclient version to 2.3.3 https://review.openstack.org/67491	15:47
fungi	clarkb: so what's the zuul swap operation here? we snapshot the pipelines, kill zuul ungracefully, update a/aaaa records, copy over the queue dump to new-new-new-zuul, start the zuul service there, wait as necessary for the dns propagation, make sure jenkins masters are connecting to it, load the queue dumps and we're off to the races?	15:48
*** carl_baldwin has quit IRC		15:48
*** carl_baldwin has joined #openstack-infra		15:48
clarkb	basically	15:48
clarkb	we also need to check nodepool has connected to new zuul	15:49
fungi	right, jenkins masters and nodepool	15:49
fungi	good reminder	15:49
openstackgerrit	Ben Nemec proposed a change to openstack-dev/hacking: Enforce import group ordering https://review.openstack.org/54403	15:50
clarkb	2 more changes can merge and there are a few check tests that can be reported but I am less concerned about the check tests	15:51
*** JohanH has quit IRC		15:51
clarkb	but I think right around now is a decent time to do it as 2 changes will be merging and the gate is reseting otherwise	15:52
*** senk has quit IRC		15:52
*** adrian_otto has joined #openstack-infra		15:52
fungi	oh, though the rather large queue lengths mean maybe we should gracefully stop it and wait for it to finish processing those?	15:52
clarkb	fungi: that requires it fully processing everything in those queues which could take days	15:53
clarkb	>_>	15:53
fungi	we won't be able to copy over the event and result queues, right?	15:53
clarkb	fungi: right	15:53
fungi	it was down to 0/0 earlier	15:53
clarkb	I suppose we can wait to see if those numbers fall shortly	15:53
fungi	but it's started picking up now	15:53
clarkb	it picked up during the last gate reset where the zuul main loop does nothing	15:54
fungi	we caught a nova fail a couple changes from the head of the gate an hour or two ago and the delay that caused allowed the events/results to pile up	15:54
fungi	yeah	15:54
clarkb	normally that loop has a few iterations per second. during a gate reset it is one iteration every 15 or so minutes	15:55
clarkb	another thing that occurred to me with back of napkin maths is that we only have enough slaves to run tests for ~64 changes concurrently	15:55
salv-orlando	clarkb: I did first put a new patch set and then -2 it to ensure people did not approve it	15:55
clarkb	salv-orlando: awesome, I missed that thanks	15:55
clarkb	so we are battling the resets but also having only about 1/3 of the test resources we need to get out of the hole	15:56
*** jcoufal has quit IRC		15:56
*** adrian_otto has quit IRC		15:56
salv-orlando	But we've probably found out that all those unit test failure are related to an oslo change that went in yesterday	15:56
clarkb	fungi: results queue is falling, under 100 now. I say we wait a handful of minutes to see if the events queue falls too	15:56
*** pblaho has quit IRC		15:56
fungi	k	15:57
mordred	fungi: from an hour ago, I would argue that it might also mean that distros haven't adapted to how some newer software operates and are trying to perpetuate a model that is more beneficial to their own processes than it is to solving today's problems	15:57
fungi	clarkb: also i think 67186,1 and 67187,1 there are probably contributing to gate churn	15:57
fungi	clarkb: since they're both stable branch changes	15:58
clarkb	fungi: they would be then, we should omit them from the zuul reenqueue	15:58
clarkb	fungi: oh other thing to do after we stop zuul, is to manually stop jobs in jenkinses so that nodepool can create new nodes	15:58
*** annegent_ has quit IRC		15:58
clarkb	fungi: do you want to grab queue state, stop zuul, and update DNS while I kill jobs in jenkinses as quickly as I can?	15:59
fungi	mordred: entirely possible, but in that case we need some serious reevaluation of our security support model	15:59
mordred	fungi: I think we might need some serious reevaluation of our security support model	16:00
clarkb	fungi: mordred: I am not seeing the context to security and distros	16:00
clarkb	have a timestamp?	16:00
mordred	because I'm not sure that the distro approach which may involve staying on an old version of a piece of software that the otherwise very active upstream has stopped caring about is the right thing to do	16:00
fungi	clarkb: 14:13 utc	16:00
*** dpyzhov has quit IRC		16:01
fungi	clarkb: our previous decisions not to install software from random third-party package repositories	16:01
fungi	for testing official openstack projects	16:01
mordred	cassandra has consitently not been a thing you really want to include in a distro - but I would not call it immature, even though I personally dislike many of their core devs	16:01
clarkb	I see thanks	16:01
mordred	they produce software intended for continuous deployment - and people who use it use it in those contexts - so manufacturing a 3-year stable release is just silly	16:02
openstackgerrit	Tom Fifield proposed a change to openstack-infra/config: Add build job for Japanese Install Guide https://review.openstack.org/67481	16:03
*** yolanda has quit IRC		16:03
mordred	in fact - new thing from the CEO of redhat ... https://enterprisersproject.com/article/death-20	16:03
fungi	mordred: well, i didn't mean immature in a negative connotation. i meant the reasons free software is usually not packaged at all is one of 1. it's so new not enough people have interest in it yet, 2. it's not interesting in general or, 3. there are design issues with the software which make it too hard to package reliably/consistently	16:03
mordred	talks about how even 6-month releases are getting to be too much	16:03
*** nati_ueno has joined #openstack-infra		16:03
mordred	fungi: indeed. I'm mainly saying that I think that one of the design pieces of 3 might not be something you want to fix in some cases	16:04
anteaya	clarkb: yes, salv-orlando submitted a new change to remove it from the gate, sorry I wasn't clear	16:04
mordred	such as "the delivery model is intended for continual consuption" - which is actually more likely to be able to be dealt with at scale than a periodic release model	16:04
anteaya	sorry salv-orlando already answered you	16:05
clarkb	mordred: we should all start running arch	16:05
fungi	mordred: i agree that some software stays well-tested enough that you can be reasonably assured of its reliability when drinking from the firehose. but there's also enough out there which still isn't that the linux distributions play a useful role in shielding admins who don't want to discover yet another new software bug every morning when they get to work	16:05
*** gokrokve has quit IRC		16:05
*** marun has quit IRC		16:05
mordred	fungi: totally	16:05
mordred	I think that the distros can and do play a very useful role	16:05
*** marun has joined #openstack-infra		16:05
*** gokrokve has joined #openstack-infra		16:06
*** nati_uen_ has quit IRC		16:06
clarkb	fungi: event queue isn't falling very quickly. I figure we give it a few more minutes but otherwise I feel like we should take a hatchet to it	16:06
mordred	I'm just saying that a strict adhearance to distro-packaged software may not be necessarily the right choice every time - which is a reversal of my traditionl position	16:06
mordred	fungi: I think that some things have changed in the high-volume/high-scale world and I don't think distro-world has caught up	16:07
*** reed has joined #openstack-infra		16:08
clarkb	mordred: I agree, but I also think that projects need to provide something. eg a pip installable thing from pypi (we fail at this), because the put up a jar file behind http without a sha1 that logstash does and our tarballs with similar problems aren't very friendly	16:08
fungi	mordred: well, i do agree, particularly since we're part of that ;)	16:08
mordred	clarkb: +100	16:09
mordred	fungi: hehe	16:09
mordred	I had the idea the other day that someone should upgrade apt-get so that it understood pip and mvn and npm and gem	16:09
clarkb	fungi: I am going to step away for ~3 minutes then I say we go for it	16:09
*** pasquier-s_ has quit IRC		16:09
mordred	so that you could perhaps do "apt-get install pip:python-novaclient" and it would do the right thing	16:10
fungi	clarkb: sounds good. i need a quick coffee refill anyway	16:10
clarkb	note we should reverify savanna changes first and omit stable/* changes	16:10
fungi	mordred: you mean pip install apt:mysql-client	16:11
*** gokrokve has quit IRC		16:11
fungi	;)	16:11
fungi	clarkb: which savanna changes?	16:11
fungi	i probably missed them in scrollbackl	16:11
*** vipul-away is now known as vipul		16:12
openstackgerrit	Davanum Srinivas (dims) proposed a change to openstack-infra/config: Add jenkins host name to the logstash records https://review.openstack.org/67508	16:12
*** tangestani has joined #openstack-infra		16:12
dkranz	fungi: Any chance we can move https://review.openstack.org/#/c/63934/ (restoring fail on log errors) up in the queue?	16:13
dkranz	I really don't want to see this fail because another log error crept in.	16:13
fungi	dkranz: clarkb: let's move 63934 to the top of the gate list before we import it on the replacement zuul	16:14
flashgordon	clarkb: btw only 5 hits on gate for the 404 issue you found	16:14
*** afazekas_ has quit IRC		16:14
flashgordon	in last 7 days	16:14
*** thuc has joined #openstack-infra		16:14
*** tangestani has quit IRC		16:15
fungi	flashgordon: for the console logs which got indexed anyway (scp plugin bug still lurking)	16:15
flashgordon	fungi: ack, thats implied for everything	16:15
flashgordon	196 hits with check queue	16:15
* fungi nods		16:15
* flashgordon files a bug		16:15
dimsum	Added a couple of reviews to grab the jenkins host name for logstash (https://review.openstack.org/#/c/67495/ https://review.openstack.org/#/c/67508/ )	16:15
* SergeyLukjanov triggered by savanna word used :)		16:16
fungi	dimsum: yep, saw those just now	16:16
clarkb	fungi: in the gate queue	16:16
clarkb	fungi: one last thing to check before we dive in, we should make sure that the zuul ref replication is disabled on new zuul and new new zuul	16:17
clarkb	pretty sure jeblair dealt with that a wee kago so all should be well	16:17
*** thuc_ has quit IRC		16:18
fungi	clarkb: right, that was reverted in the zuul source. i'll check the clone on it	16:18
clarkb	fungi: was it reverted in zuul source or just the config?	16:18
*** thuc has quit IRC		16:19
fungi	oh... hrm	16:19
fungi	right, it was the config	16:19
*** lyle is now known as david-lyle		16:19
clarkb	I am logged into all 5 jenkins masters and ready to kill jobs	16:20
clarkb	fungi: basically ready when you are	16:20
fungi	i'm looking for the revert	16:20
*** dizquierdo has quit IRC		16:21
*** anteaya is now known as tired		16:21
*** tired is now known as very_tired		16:22
openstackgerrit	Sean Dague proposed a change to openstack-infra/elastic-recheck: add bug metadata to graph list https://review.openstack.org/67510	16:22
clarkb	fungi: 0c8845494d308e8fedfd6e9890c5ea6cd2f85bdb	16:22
clarkb	in config	16:22
fungi	right, why couldn't i find that in the commit log?	16:23
fungi	trying to do too many things at once	16:23
clarkb	I did git log -p manifests/site.pp because I remembered it getting piped through there	16:23
fungi	i don't see any reference to the git replication urls in zuul.conf on the new server	16:24
fungi	hold on	16:24
fungi	okay, sorry. local distraction	16:25
fungi	so i missed why we need to reverify savanna changes if they're already in the gate	16:26
*** esker has quit IRC		16:26
clarkb	fungi: isn't that how we restore the gate?	16:26
*** BobBall is now known as BobBallAway		16:26
fungi	yeah, but don't we want to restore the whole gate, not just teh savanna changes?	16:26
*** vipul is now known as vipul-away		16:27
fungi	i'm clearly confused on some point	16:27
clarkb	fungi: we do, just pointing out we want to reverify them first	16:27
clarkb	so that their jobs queue up first as they are currently running	16:27
fungi	oh, so they were causing some sort of disruption	16:27
SergeyLukjanov	could I ask why savanna changes are so prio now? :)	16:27
fungi	er, fixing some sort of disruption?	16:27
clarkb	SergeyLukjanov: simply because they managed to run tests for half an hour and we are about to kill them	16:27
clarkb	SergeyLukjanov: fungi: there is nothing special about those changes beyond their current position in the queue	16:28
fungi	ahh, i see, you mean because they're in a different gate queue, so don't want to make them wait on available nodes	16:28
clarkb	exactly	16:28
*** Ajaeger has quit IRC		16:29
SergeyLukjanov	oh, see it too ;)	16:29
SergeyLukjanov	thanks	16:29
*** gyee_nothere has quit IRC		16:29
fungi	clarkb: and also prioritize 63934,3 so that we reduce the risk of more errors getting introduced before that merges	16:29
clarkb	yup	16:30
*** Ajaeger has joined #openstack-infra		16:30
clarkb	I am actually less worried about the stable/* jobs, I can push new patchsets to them in order to make an impression on the change approvers :)	16:30
fungi	getting logged into rackspace and jenkins masters now	16:30
*** gyee has joined #openstack-infra		16:30
clarkb	fungi: s/rackspace/nodepool/ ?	16:31
*** marun has quit IRC		16:31
fungi	let's at least leave out 67186,1 and 67187,1 since we know about them and they're already relatively high up in the gate	16:31
clarkb	fungi: k	16:31
fungi	rackspace to make dns changes	16:31
*** mrodden has quit IRC		16:31
clarkb	oh that	16:31
*** vipul-away is now known as vipul		16:31
fungi	trying to reduce the zuul outage window as much as possible so we miss fewer patchset and approve events	16:32
clarkb	++	16:32
openstackgerrit	Andreas Jaeger proposed a change to openstack-infra/config: Add build job for Japanese Install Guide https://review.openstack.org/67481	16:32
*** marun has joined #openstack-infra		16:32
*** krotscheck has joined #openstack-infra		16:32
clarkb	mordred: any chance you can statusbot us?	16:33
*** MarkAtwood has joined #openstack-infra		16:34
fungi	the rackspace dns interface needs a filter	16:34
clarkb	oh ya, otherwise it is tons of scrolling	16:35
*** mrodden has joined #openstack-infra		16:35
fungi	okay, logged into the jenkins webuis, rackspace dashboard at the dns entries, cli on nodepool and both old and new zuul	16:36
*** nati_uen_ has joined #openstack-infra		16:36
fungi	are you doing the zuul pipeline dump/restore, clarkb?	16:36
clarkb	fungi: I thgouth you were :P I was going t okill jenkins jobs	16:36
fungi	ahh, okay	16:36
fungi	gimme a sec to referesh my memory on how that works	16:37
clarkb	np, I believe the script is in zuuls tools dir	16:37
openstackgerrit	Joe Gordon proposed a change to openstack-infra/config: Record short_build_uuid in logstash/ElasticSearch https://review.openstack.org/67516	16:37
*** nati_uen_ has quit IRC		16:37
*** markwash has quit IRC		16:37
fungi	there was also a ~root/zuul-changes2.py left over from the last round	16:38
clarkb	flashgordon: re ^ I am pretty sure you can match on the short uuid	16:38
*** nati_uen_ has joined #openstack-infra		16:38
flashgordon	clarkb: sample query?	16:38
clarkb	flashgordon: just search for build_uuid:someshortuuid	16:38
clarkb	notice the lack of quotes	16:38
*** markwash has joined #openstack-infra		16:39
mfer	fungi is there a place I can "subscribe" to get an update of the openstack in an sdk name? i don't want to bug you but I'm so darn curious.	16:39
clarkb	fungi: oh right, you want that one as it uses the zuul rpc cli	16:39
*** nati_ueno has quit IRC		16:39
clarkb	fungi: but the old one will work using the reverifies too (if you give reverify a bug)	16:40
*** Ajaeger has quit IRC		16:40
flashgordon	build_uuid:2123b9a	16:40
flashgordon	vs: build_uuid:2123b9a6a1464d41864e8436d5bf4397	16:41
flashgordon	short has no hits	16:41
flashgordon	clarkb: ^	16:41
clarkb	flashgordon: sorry you need build_uuid:2123b9a*	16:41
*** SergeyLukjanov is now known as SergeyLukjanov_		16:42
flashgordon	clarkb: sweet!	16:42
flashgordon	thanks	16:42
*** adrian_otto has joined #openstack-infra		16:43
flashgordon	clarkb: here is another one https://review.openstack.org/#/c/67498/	16:43
fungi	clarkb: okay, so it's... for pipeline in check gate post ; do python zuul-changes2.py http://zuul.openstack.org $pipeline > $pipeline.sh ; done	16:43
*** markmcclain has quit IRC		16:44
*** markmcclain has joined #openstack-infra		16:44
*** gothicmindfood has joined #openstack-infra		16:44
clarkb	fungi: k	16:44
*** senk has joined #openstack-infra		16:44
*** AaronGr_Zzz is now known as AaronGr		16:45
*** adrian_otto has left #openstack-infra		16:45
fungi	oh, it won't dump post	16:45
fungi	because those aren't changes	16:45
clarkb	oh right, I think we can get away with that here	16:45
fungi	looking through a sample real quick so i can confirm the reordering/filtering we want to do on the gate	16:46
clarkb	flashgordon: that change is technically fine. question about why it is necessary though. A bug fingerprint should indicate a bug regardless of branch, and a false positive due to branch should itself be a bug correct?	16:47
*** davidhadas_ has joined #openstack-infra		16:47
flashgordon	two fold	16:47
flashgordon	one is its easier when digging through logstash	16:47
*** davidhadas has quit IRC		16:47
flashgordon	and two, if we know a bug is stable only we can prevent false positives	16:48
*** senk has quit IRC		16:49
clarkb	preventing false positives that way masks other bugs though	16:49
*** markwash has quit IRC		16:49
*** DennyZhang has joined #openstack-infra		16:49
*** markmcclain has quit IRC		16:50
clarkb	fungi: how does the sample work? I suppose you can just ocmment out the lines for changes we want to ignore	16:50
fungi	clarkb: yep, i'm getting the reordering into the final command line too though	16:50
clarkb	oh right for savanna :)	16:50
flashgordon	clarkb: it won't mask bugs it will leave them as unclassified	16:50
clarkb	and dkranz's change	16:50
fungi	and the error filtering fix	16:50
*** senk has joined #openstack-infra		16:51
clarkb	flashgordon: if you didn't filter on branch it would match all branches	16:51
*** cp16net is now known as goofy-nick-frida		16:51
flashgordon	either way, having the data makes understanding logstash data easier.	16:51
flashgordon	before writting the fingerprint	16:51
*** goofy-nick-frida is now known as goofy-nic-friday		16:51
fungi	okay, have it the way i want it. making sure i can copy it quickly now	16:52
clarkb	flashgordon: gotcha	16:54
openstackgerrit	Davanum Srinivas (dims) proposed a change to openstack-infra/config: Add jenkins master name to the logstash records https://review.openstack.org/67508	16:54
*** NayanaD has joined #openstack-infra		16:55
*** NayanaD is now known as San_D		16:55
fungi	all set. so dumping the check/gate pipelines and immediately stopping zuul	16:56
fungi	ready?	16:56
clarkb	I am ready	16:56
*** sgrasley has joined #openstack-infra		16:56
fungi	done	16:57
fungi	updating dns now	16:57
openstackgerrit	Michael Krotscheck proposed a change to openstack-infra/config: Added artifact upload of storyboard. https://review.openstack.org/67520	16:57
clarkb	ok killing jenkins jobs now	16:57
*** coolsvap has quit IRC		16:57
*** tma996 has quit IRC		16:58
openstackgerrit	Michael Krotscheck proposed a change to openstack-infra/config: Added artifact upload of storyboard. https://review.openstack.org/67520	16:58
krotscheck	My bad, sorry	16:58
krotscheck	That one's good	16:58
fungi	clarkb: whups. the aaaa record had a one-hour ttl on it	16:58
fungi	i should have double-checked that last night	16:59
clarkb	fungi: that'll teach me	16:59
*** coolsvap has joined #openstack-infra		16:59
clarkb	:/	16:59
zaro	morning	17:00
*** mancdaz is now known as mancdaz_away		17:00
*** vkozhukalov has quit IRC		17:00
fungi	clarkb: should nodepool get restarted to connect to the new zuul?	17:01
clarkb	fungi: I believe the gear lib should do automatic reconnection	17:01
fungi	and is it safe to start new zuul and reenqueue changes now even though the jenkins masters aren't connected to it yet?	17:02
*** davidhadas_ has quit IRC		17:02
clarkb	jenkins masters cannot connect to it until it has started, the geard lib is embedded	17:03
zaro	clarkb: you in today?	17:03
clarkb	I think you need to wait for at least one master to advertise its job list before reenqueuing	17:03
clarkb	zaro: after the zuul stuff is done I had planned on trying to make it in	17:03
zaro	clarkb: office i mean	17:03
clarkb	yes	17:03
fungi	clarkb: more to the point, i meant is it okay to reenqueue changes before the jenkins masters are connecting to the new zuul. i assume so	17:04
*** yaguang has quit IRC		17:04
clarkb	fungi: I don't think so	17:04
*** markwash has joined #openstack-infra		17:04
clarkb	fungi: zuul may report those jobs as lost since gearman won't know how to run those jobs	17:04
fungi	ahh, right, jobs won't be registered	17:05
clarkb	so I think we start new new zuul, then get at least one master t oconnect to it, then reenqueue	17:05
zaro	clarkb: Azher asked for a meeting to help him get setup with zuul and jjb today. didn't know if you were interested to be on the call.	17:05
zaro	clarkb: meeting will be at 11am pst	17:06
clarkb	zaro: we'll see...	17:06
*** hashar has quit IRC		17:07
clarkb	fungi: ok, jenkins masters have had their jobs killed	17:07
fungi	jenkins01 seems to have established sockets to the gearman port on new zuul's ipv4 address. that's a good sign	17:09
clarkb	fungi: nodepool is connected to 162.242.150.96:4730	17:09
clarkb	which I think is new zuul	17:09
fungi	yep, checking the other masters still, but good so far	17:09
*** sarob has joined #openstack-infra		17:10
fungi	jenkins.o.o has no gearman connections according to netstat	17:10
fungi	the other masters are connected to new zuul though	17:10
clarkb	cool /me looks at jenkins.o.o	17:10
*** hashar has joined #openstack-infra		17:11
clarkb	fungi: I am going to try disabling then enalbing a job on that host as that kicks the gearman plugin	17:11
fungi	k	17:12
*** fifieldt has quit IRC		17:12
clarkb	that hasn't appeared to help	17:13
*** sarob_ has joined #openstack-infra		17:13
clarkb	I lied I think it worked	17:13
fungi	we could just restart jenkins service entirely	17:13
clarkb	oh it is talking to old gearman	17:13
clarkb	yeah lets do that	17:13
fungi	netstat -nt\|grep 4730 shows nothing on jenkins.o.o	17:13
*** obondarev_ has joined #openstack-infra		17:13
fungi	stopping it now	17:14
clarkb	fungi: jenkins log shows it tring to talk to the .88 address	17:14
fungi	starting	17:14
fungi	right, i suspected that was why there were no established sockets	17:14
ttx	Why oh WHY is Gerrit askling me to rebase	17:14
fungi	there it goes	17:14
*** sarob has quit IRC		17:14
ttx	https://review.openstack.org/#/c/67422/	17:15
ttx	and I'm rebasing and it doesn't really help	17:15
notmyname	gate status graphs for common gate jobs + several projects http://not.mn/all_gate_status.html	17:15
fungi	i see 8 connections to the right gearman server now	17:15
clarkb	ttx: hold on	17:15
fungi	clarkb: ready for me to reenqueue all the things then?	17:15
clarkb	fungi: I think we should try reenqueing one thing first	17:15
* ttx holds (and drinks more)		17:15
clarkb	fungi: see ttx's question	17:15
*** thuc has joined #openstack-infra		17:15
fungi	clarkb: will do	17:16
clarkb	fungi: because something seems off but that may just be that he got zuul when it had no workers	17:16
*** thuc_ has joined #openstack-infra		17:16
fungi	enqueued 63934,3 into the gate	17:17
*** jooools has quit IRC		17:17
fungi	clarkb: zuul hasn't cloned any repos in /var/lib/zuul/git yet	17:18
clarkb	fungi: it should do that automagically	17:18
fungi	git clone -v ssh://jenkins@review.openstack.org:29418/openstack/neutron /var/lib/zuul/git/openstack/neutron' returned exit status 128: Host key verification failed.	17:18
clarkb	oh that :)	17:18
fungi	i guess it's not puppeted?	17:18
clarkb	apparently not	17:18
fungi	what file(s) do i need?	17:18
*** rakhmerov has quit IRC		17:19
fungi	i'll grab them from old zuul	17:19
fungi	ahh, right, i can just accept the host key	17:19
clarkb	fungi: it would be for the zuul users known hosts file	17:19
fungi	added	17:20
fungi	should i restart zuul>?	17:20
clarkb	no, just try reenqueing that one change	17:20
clarkb	zuul will do clones on the fly if necessary	17:20
*** thuc has quit IRC		17:20
fungi	worked	17:21
*** sarob_ has quit IRC		17:21
*** beagles_brb is now known as beagles		17:21
clarkb	but still failed to merge	17:21
*** kiall has quit IRC		17:21
fungi	though i seem to have the old ipv6 address lodged deep within my browser	17:21
*** sarob has joined #openstack-infra		17:21
fungi	it did check out the project for that change though	17:21
clarkb	yup	17:22
fungi	UnboundLocalError: local variable 'repo' referenced before assignment	17:22
fungi	zuul bug?	17:22
clarkb	yup must be	17:23
fungi	i assume restarting zuul daemon is the best course of action for now?	17:23
*** kiall has joined #openstack-infra		17:24
clarkb	ya why don't we do that	17:24
*** jp_at_hp has quit IRC		17:24
clarkb	fungi: oh wait	17:24
*** senk has quit IRC		17:25
clarkb	git doesn't know who we are	17:25
*** yolanda has joined #openstack-infra		17:25
clarkb	that should raelly be puppeted. on old zuul the zuul users gitconfig was set to set the name and email	17:25
clarkb	we should do tat by hand on new new zuul	17:25
fungi	fixing	17:25
*** sarob_ has joined #openstack-infra		17:25
clarkb	then document it needs puppeting	17:25
*** gokrokve has joined #openstack-infra		17:26
openstackgerrit	Ruslan Kamaldinov proposed a change to openstack-infra/config: Extracted ci docs jobs to a template https://review.openstack.org/67489	17:26
clarkb	looks like new zuul has a ~zuul/.gitconfig as well	17:26
fungi	it does now	17:26
clarkb	ok now try reenqueing 63934	17:26
*** pballand has joined #openstack-infra		17:27
*** sarob__ has joined #openstack-infra		17:27
mordred	clarkb: ++ to puppeting	17:28
*** sarob has quit IRC		17:28
clarkb	fungi: zuul is cloning all the things	17:28
clarkb	which is something to note about using a tmpfs if we don't prepopulate it zuul startup will be a bit slower than before	17:29
*** markwash has quit IRC		17:29
fungi	yeah, i expected that	17:30
fungi	but that's just on reboot of the server	17:30
clarkb	yup	17:30
* mordred is excited about our new tmpfs overlord		17:30
*** sarob_ has quit IRC		17:30
fungi	hmmm, my bsd firewall here is segfaulting my shell	17:30
clarkb	ok stuff is queueing	17:30
openstackgerrit	Joe Gordon proposed a change to openstack-infra/elastic-recheck: Add noirc option to bot https://review.openstack.org/67525	17:31
fungi	i might not be long for this internet if my dhclient segfaults too	17:31
fungi	ready for me to enqueue everything else then?	17:31
sdague	cool, check queue going to refilll automatically?	17:31
clarkb	fungi: :( where did you stash your preserved queues?	17:31
clarkb	sdague: yup, fungi grabbed check and gate queue state we just need to apply them now	17:31
fungi	my homedir, though the first entry in the gate.sh is redundant now. fixing	17:31
sdague	clarkb: do you have a list of promote bits from markmcclain	17:31
ttx	fungi: any idea why i'm asked to rebase stuff ?	17:32
clarkb	sdague: no I do not	17:32
ttx	I rebased on HEAD and that doesn't work either	17:32
*** nati_ueno has joined #openstack-infra		17:32
ttx	https://review.openstack.org/#/c/67422/	17:32
fungi	clarkb: ready for me to requeue all the things?	17:32
clarkb	ttx: yes, we just moved zuul to new host with a tmpfs /var/lib/zuul/git to speed up the zuul git operations. When we did that we discovered that puppet did not configure git for zuul properly	17:32
ttx	hah.	17:32
clarkb	fungi: I think so	17:32
clarkb	ttx: we fixed that by hand and have noted that we need to automate it, you should not be asked to rebase anymore	17:33
ttx	clarkb: any ETA on fix ? Should I stay online for the next 5 min or come back in two hours ?	17:33
clarkb	ttx: we just fixed it	17:33
fungi	clarkb: it's running under a screen session for the root user now	17:33
fungi	in case i disappear	17:33
ttx	clarkb: hmm, but how do I push the change AGAIN	17:33
fungi	ttx: recheck or reverify	17:33
fungi	ttx: or reapprove	17:34
clarkb	ttx: you shouldn't need to, the existing patchset should be fine	17:34
clarkb	fungi: looks like python26 slaves/jobs are having trouble ;(	17:34
*** fbo is now known as fbo_away		17:34
sdague	https://etherpad.openstack.org/p/montreal-code-sprint - under Parallel	17:34
fungi	clarkb: i still can't see the new status page because of my resolver cache	17:34
clarkb	fungi: I am going to disable then enable jobs on jenkins01 and 02 to rekick gearman	17:34
ttx	clarkb: except Jenkins-2ed it already. I reverified it. We'll see how it goes. Thanks!	17:34
fungi	trying to call rndc flushname was how i discovered my firewall is in trouble	17:34
sdague	but unfortunately markmcclain isn't here at the moment	17:34
fungi	clarkb: okay	17:35
sdague	I guess we'll just wait until he builds a list when he gets back	17:35
fungi	clarkb: do we think gearman plugin didn't reconnect to zuul properly when we restarted the service?	17:35
*** nati_uen_ has quit IRC		17:35
clarkb	fungi: I think there is a bug in gearman client where it doesn't register all of its jobs	17:35
fungi	ahh	17:36
clarkb	er gearman plugin not client	17:36
*** nati_ueno has quit IRC		17:36
clarkb	fungi: you should edit your /etc/hosts :P to get zuul status	17:36
fungi	clarkb: i'm going to	17:36
*** nati_ueno has joined #openstack-infra		17:36
clarkb	fungi: https://jenkins01.openstack.org/job/gate-cinder-python27/5053/console	17:36
clarkb	not sure why that is happening	17:37
fungi	clarkb: oh, wait, i'm not resolving zuul incorrectly. the status page just seems to be broken for some reason	17:37
fungi	oh, or maybe i am	17:38
*** rakhmerov has joined #openstack-infra		17:38
clarkb	oh I wonder if the test slaves have the ipv6 address cached	17:38
clarkb	I can fetch the ref that the gate-cinder-ypthon27 job failed to fetch	17:38
*** rakhmerov has joined #openstack-infra		17:38
fungi	there we go. had to clear my browser cache too	17:39
*** yassine has quit IRC		17:39
*** senk has joined #openstack-infra		17:39
*** yassine has joined #openstack-infra		17:40
fungi	clarkb: hmmm, you mean like maybe a local dnscache daemon on the slaves?	17:40
clarkb	ya	17:40
fungi	that might be a centos thing, agreed	17:40
*** DennyZha` has joined #openstack-infra		17:40
clarkb	that is a python27 job	17:40
*** DennyZhang has quit IRC		17:41
*** tjones has joined #openstack-infra		17:41
clarkb	I think we are mostly good now, just need to ride out the hiccups	17:42
*** ruhe is now known as _ruhe		17:42
*** yassine has quit IRC		17:42
*** yassine has joined #openstack-infra		17:43
*** praneshp has joined #openstack-infra		17:43
*** yassine has quit IRC		17:43
*** hashar has quit IRC		17:43
clarkb	though the enqueue seems to not update zuul status? debug.log shows many jobs starting implying the enqueue is working but status doesn't reflect that for me	17:44
fungi	maybe those are still in the event queue?	17:44
clarkb	looks like the Run handler has only woken twice in the last 10 minutes, I think using the rpc to enqueue may do like a gate reset and hold everything up while it does its work	17:45
*** yassine has joined #openstack-infra		17:45
*** yassine has quit IRC		17:45
*** sarob has joined #openstack-infra		17:45
*** DennyZha` has quit IRC		17:45
*** sarob has quit IRC		17:45
*** sarob has joined #openstack-infra		17:45
openstackgerrit	A change was merged to openstack-infra/storyboard-webclient: Added apache license to footer https://review.openstack.org/67347	17:46
*** mattray has joined #openstack-infra		17:47
*** yamahata has quit IRC		17:48
*** sarob__ has quit IRC		17:48
fungi	worth noting, the server is basically idle cpu-wise	17:48
fungi	so this has to be network-related delays, right?	17:48
clarkb	or the enqueue isn't doing what we expect	17:49
fungi	2014-01-17 17:49:27,357 INFO zuul.Gerrit: Updating information for 67333,4	17:50
*** sarob_ has joined #openstack-infra		17:50
fungi	maybe gerrit's getting firebombed	17:50
*** talluri has joined #openstack-infra		17:50
mordred	load is fine on gerrit	17:50
fungi	yep	17:50
clarkb	http://paste.openstack.org/show/61460/	17:50
clarkb	I think gearman function registering is not working so well. I will enable disable on all jenkins masters	17:51
fungi	okay	17:51
*** harlowja_away is now known as harlowja		17:51
*** mrodden1 has joined #openstack-infra		17:52
clarkb	have done 1 2 3 and 4 doing jenkins.o.o now	17:52
fungi	clarkb: you also have a thing of some kind in 8 minutes, right? if you need to stop, i can work through the rest of this	17:52
*** mrodden has quit IRC		17:52
*** sarob has quit IRC		17:52
clarkb	fungi: well its a meeting thing. I should be able to give you a bit of time	17:53
*** rnirmal has quit IRC		17:53
fungi	k	17:53
clarkb	all jenkinses should have reregisterd their gearman functions	17:53
fungi	load on gerrit is spiking, so we did something	17:53
clarkb	ok, going to watch tail -f /var/log/zuul/debug.log \| grep ERROR	17:54
fungi	same thing i'm doing	17:54
*** sarob_ has quit IRC		17:55
*** sarob has joined #openstack-infra		17:55
clarkb	we seem to still be hitting ERROR zuul.Gearman: Exception while checking functions	17:56
clarkb	for that same set_description job	17:56
clarkb	zaro: any idea why that is happening?	17:56
clarkb	Function set_description:jenkins01.openstack.org is not registered	17:56
zaro	clarkb: i think i've got the scp-plugin patch ready. but i have a few meetings now, so will discuss with you after 1pm.	17:57
*** jerryz has joined #openstack-infra		17:57
clarkb	zaro: sure, can you take a quick look at ^	17:57
*** jerryz has quit IRC		17:57
*** jerryz has joined #openstack-infra		17:57
zaro	clarkb: yeah. let find that in the code during my meeting.	17:57
openstackgerrit	Joe Gordon proposed a change to openstack-infra/elastic-recheck: Add query for bug 1261253 https://review.openstack.org/67539	17:58
uvirtbot	Launchpad bug 1261253 in tripleo "oslo.messaging 1.2.0a11 is outdated and problematic to install" [High,Triaged] https://launchpad.net/bugs/1261253	17:58
*** sarob_ has joined #openstack-infra		17:59
*** yolanda has quit IRC		17:59
fungi	seen a couple timeout errors since... gate-tempest-dsvm-neutron-large-ops and gate-ceilometer-pep8	17:59
*** sarob has quit IRC		18:00
fungi	er, the jobs were probably unrelated	18:01
fungi	Exception while checking functions	18:01
*** sarob has joined #openstack-infra		18:01
openstackgerrit	Matthew Treinish proposed a change to openstack-infra/elastic-recheck: Add multi-project irc support to the bot https://review.openstack.org/67540	18:01
clarkb	fungi: ya, those exceptions seem to be timeout errors	18:01
zaro	clarkb: is stop function registered?	18:02
fungi	in connection.sendAdminRequest	18:02
clarkb	zaro: fungi: I am not sure if stop function is regisered but /var/log/zuul/gearman-server.log shows errors around getting its status	18:03
clarkb	zaro: fungi: that looks like a possible geard bug	18:03
*** sarob__ has joined #openstack-infra		18:04
mordred	clarkb, fungi: I've been floating in and out - please ping me if I can be useful to your brains	18:04
*** odyssey4me has quit IRC		18:04
clarkb	fungi: zaro: I think zuul slowness may be due to those timeouts, it is waiting and waiting and well waiting	18:05
clarkb	should we possibly try restarting zuul to begin a new geard?	18:05
*** sarob_ has quit IRC		18:05
openstackgerrit	Matthew Treinish proposed a change to openstack-infra/elastic-recheck: Add multi-project irc support to the bot https://review.openstack.org/67540	18:05
fungi	clarkb: i can do that and reenqueue it all again	18:05
fungi	clarkb: should we include a brief wait for jenkins masters to reconnect to the gearman service?	18:06
*** sarob has quit IRC		18:06
clarkb	fungi: yes, I think so	18:06
fungi	okay, killing zuul now	18:06
clarkb	fungi: well a wait before reenqueing	18:06
fungi	yeah, that's what i meant	18:06
clarkb	fungi: the gearman service is a child of the zuul service so you start them both with the zuul init script	18:07
fungi	how long do you think is sane?	18:07
clarkb	half a minute is probably plenty	18:07
fungi	k	18:07
clarkb	fungi: you can telnet localhost 4730 and run send status to the socket	18:07
clarkb	that should return a giant list of everything ever	18:07
fungi	right now it returns nothing	18:08
clarkb	just 'status' returns nothing?	18:08
fungi	oh you said run send status	18:08
clarkb	gah my bad	18:08
fungi	yeah, status returns a ton	18:09
clarkb	the command is just 'status'	18:09
fungi	though it picked up a nova change in the check pipeline already and marked a gate-nova-python26 as lost	18:09
clarkb	fungi: then before reenqueing the world I think we try enqueing one change again. and tail zuul/debug.log and zuul/gearman-server.log	18:09
*** galstrom has joined #openstack-infra		18:09
clarkb	fungi: :(	18:09
clarkb	fungi: I wonder if that means jenkins* but not jenkins01 and jenkins02 have registered their functions	18:10
clarkb	as only 01 and 02 can run the python26 jobs	18:10
fungi	well, i've reenqueued the devstack-gate change we had at the top before	18:11
fungi	but it has no py26 jobs	18:11
*** sarob__ has quit IRC		18:11
fungi	ERROR zuul.Gearman: Job <gear.Job 0x7fbe68147690 handle: None name: build:gate-trove-python27 unique: 247c5ef1806f4581ac54f8b7cb31e8b3> is not registered with Gearman	18:11
clarkb	fungi: how does zuul/gearman-server.log look? are there any recent tracebacks for the stop job?	18:11
*** sarob has joined #openstack-infra		18:11
clarkb	why is gearman so cranky	18:12
*** pballand has quit IRC		18:12
fungi	2014-01-17 18:06:23 [...] KeyError: 'stop:jenkins01.openstack.org'	18:12
clarkb	so that is from before the restart correct?	18:12
fungi	checking	18:13
fungi	18:06 was the start	18:13
*** Ajaeger has joined #openstack-infra		18:14
fungi	ahh, stopped at 18:06:34	18:15
zaro	fungi: is that from jenkins gearman plugin?	18:15
Ajaeger	What is a "LOST" failure for a gate? https://review.openstack.org/#/c/67493/	18:15
fungi	started at 18:06:49	18:15
fungi	Ajaeger: us	18:15
fungi	so that keyerror was from before i stopped it	18:15
Ajaeger	fungi: ok, I'll let you fix it ;)	18:15
*** herndon has quit IRC		18:16
*** sarob has quit IRC		18:16
*** yamahata has joined #openstack-infra		18:16
clarkb	fungi: in that status listing does jenkins01 or jenkins02 show up at all?	18:17
*** hogepodge has joined #openstack-infra		18:17
fungi	clarkb: i just tried reenqueuing a savanna change and got this in the log...	18:18
fungi	2014-01-17 18:16:52,521 WARNING zuul.Scheduler: Build <Build 8363618cf6394cf4bfc5e2596c900e09 of gate-savanna-python26> not found by any queue manager	18:18
clarkb	fungi: ya that is resulting in LOST builds	18:18
fungi	ERROR zuul.DependentPipelineManager: Exception while canceling build <Build 8363618cf6394cf4bfc5e2596c900e09 of gate-savanna-python26> for change <Change 0x7fbe60456410 66554,4>	18:18
clarkb	it couldn't cancel it because there was no job I bet	18:18
fungi	oh, wait, i need the non-cancel errors	18:19
fungi	there	18:19
fungi	2014-01-17 18:16:52,401 ERROR zuul.Gearman: Job <gear.Job 0x7fbe602e1210 handle: None name: build:gate-savanna-python26 unique: 8363618cf6394cf4bfc5e2596c900e09> is not registered with Gearman	18:19
clarkb	ya, that means the jenkins masters never registered that function with the geard daemon	18:19
clarkb	fungi: perhaps look at jenkins logs on 01 and 02 to see if the gearman plugin is puking?	18:20
salv-orlando	fungi, clarkb: sorry for the interruption - I assume it's not yet ok to start approving again patches?	18:20
clarkb	salv-orlando: ya not quite yet, we have run into unexpected trouble with gearman	18:20
fungi	and now status on port 4730 returns nothing	18:20
salv-orlando	clarkb: will keep lurking waiting for a go-ahead	18:20
clarkb	fungi: o_O how does the gearman-server.log look?	18:21
fungi	2014-01-17 18:20:54,214 ERROR gear.BaseClientServer: Exception in poll loop	18:21
fungi	KeyError: 'stop:jenkins03.openstack.org'	18:21
*** salv-orlando has quit IRC		18:22
*** marun has quit IRC		18:22
*** marun has joined #openstack-infra		18:22
fungi	quite a few, but all for jenkins02	18:22
fungi	er, for jenkins03	18:22
clarkb	fungi: out of curiousity how does the version of gear compare on new zuul and new new zuul	18:23
fungi	oh crap, this is what you ran into last time	18:23
clarkb	ya	18:23
clarkb	so restarting it didn't help	18:23
fungi	pip freeze says gear==0.5.0	18:23
*** herndon has joined #openstack-infra		18:24
fungi	same as on old zuul	18:24
fungi	also, we have newer statsd on new zuul	18:25
fungi	separate problem	18:25
fungi	i've downgraded statsd while i'm thinking about it	18:26
clarkb	next crazy idea, stop jenkinses, bring up one at a time in a relatively slow manner allowing each to register with gearman without threash	18:26
fungi	okay, doing	18:26
fungi	sounds sane enough to me	18:26
*** SergeyLukjanov_ is now known as SergeyLukjanov		18:27
*** hogepodge has quit IRC		18:28
*** aude has joined #openstack-infra		18:30
clarkb	fungi: and check the gearman plugin versions are consistent across jenkinses, pretty sure jeblair ran into that though and made them consistent	18:30
*** max_lobur is now known as max_lobur_afk		18:30
*** hogepodge has joined #openstack-infra		18:30
*** nati_uen_ has joined #openstack-infra		18:31
fungi	will do. also deleting offline slaves, including long-running ones, so they don't get brought back online when jenkins restarts. i'll note them here	18:31
*** CaptTofu has quit IRC		18:31
*** smurugesan has joined #openstack-infra		18:31
*** kgriffs has joined #openstack-infra		18:33
*** nati_ueno has quit IRC		18:33
*** luqas has quit IRC		18:35
*** marun has quit IRC		18:35
*** marun has joined #openstack-infra		18:35
*** jaypipes has quit IRC		18:36
fungi	centos6-1, precise{1,11,13,17,19,21,27,29,3,37,39,7,14,16,34,38,4,40,8}	18:37
clarkb	wow that is a lot of slaves	18:37
*** hogepodge has quit IRC		18:37
clarkb	I spot checked gearman plugin versions and they all look consistent and are 0.0.4.2.ad75b7e	18:37
fungi	yeah, jenkins masters have been so loaded they're failing out slaves right and left	18:37
fungi	i'm still deleting offline nodepool nodes on 03 and 04, but i'll begin restarting jenkins services one at a time on the other masters	18:39
clarkb	k	18:39
clarkb	fungi: if you tail the jenkins.log for the masters as they come up you should see it registering gearman functions. you can use that to get a sense for what is being registered and how long it takes	18:39
fungi	INFO: ---- Worker pypi.slave.openstack.org_exec-0 registering 184 functions	18:43
fungi	clarkb: ^ that?	18:43
clarkb	yeah	18:43
clarkb	it should happen for all the workers and go on and on. the list are failry large which is why I wonder if geard may not keep up or gearman plugin not keep up	18:43
fungi	so, status is still returning absolutely nothing from the gear socket on new zuul, fwiw	18:43
clarkb	really	18:43
fungi	a few minutes after starting jenkins on jenkins.o.o	18:44
fungi	making me wonder if the geard is kaput	18:44
clarkb	ya	18:44
clarkb	oh I bet status fails due to that keyerror	18:44
clarkb	and once that happens geard is kaput	18:44
fungi	so stop jenkins.o.o again, restart zuul, then start jenkins again?	18:45
clarkb	sure?	18:45
*** lucasagomes has joined #openstack-infra		18:46
*** lucasagomes has left #openstack-infra		18:46
fungi	status is working now	18:47
*** _ruhe is now known as ruhe		18:48
*** herndon has quit IRC		18:50
clarkb	ermagerd 67025 is running python26 job	18:50
*** smarcet has joined #openstack-infra		18:50
clarkb	fungi: I wonder, could the reenqueue thing that speaks rpc be breaking zuul/geard because of some bug?	18:51
fungi	clarkb: maybe. though i gathered that's how stuff was reenqueued on the last zuul too	18:52
clarkb	fungi: k, probably worth retrying with the reenqueue rpc and if it fails AGAIN then fall back on reverify/recheck	18:52
*** salv-orlando has joined #openstack-infra		18:53
fungi	yep, confirmed that all the jenkins masters are restarted and gear status is still responding	18:53
*** vkozhukalov has joined #openstack-infra		18:53
*** markwash has joined #openstack-infra		18:53
clarkb	yay!	18:54
fungi	reenqueued the savana change which was bailing on us before	18:54
clarkb	I think that is a real bug in geard, when the dust settles we should grab relevant logs, and submit a bug	18:54
SergeyLukjanov	fungi, sorry for our naughty jobs :)	18:55
fungi	reenqueued the devstack-gate change	18:55
fungi	geard status is still fine	18:55
*** markmcclain has joined #openstack-infra		18:55
fungi	SergeyLukjanov: your jobs were fine. our servers were not	18:55
*** markmcclain has quit IRC		18:55
*** markmcclain has joined #openstack-infra		18:56
fungi	someone snuck a neutron change in, but it's looking fine too	18:56
fungi	so far everything has workers and no "lost" results	18:57
*** thuc_ has quit IRC		18:57
clarkb	fungi: yup looks good from my end too	18:57
fungi	trying the mass reenqueue again now	18:57
*** thuc has joined #openstack-infra		18:57
fungi	event queue is spiking, of course	18:57
clarkb	fungi: pretty sure the registration and starting of jobs is racy in the zuul, geard, gearman-plugin stack and if you catch it just right it causes geard to crash	18:57
*** markwash_ has joined #openstack-infra		18:58
fungi	reenqueue scripts have returned	18:59
clarkb	nice	18:59
fungi	zuul seems to be tearing through the event queue now	18:59
clarkb	fungi: now, one last quick sanity check. if you grep for 'zuul.Repo' in the debug.log you will get timestamps for all of the git operations	18:59
clarkb	it used to take 9-15 seconds per change, but tmpfs should make that faster	19:00
*** markwash has quit IRC		19:00
*** markwash_ is now known as markwash		19:00
fungi	load on zuul is not at all heavy thus far	19:01
clarkb	just looking at the status we are up to 80 something changes in the gate pipeline and it only took a few minutes, much better than the 15-20 it took before	19:02
fungi	gerrit's really not breaking a sweat either	19:02
*** thuc has quit IRC		19:02
SergeyLukjanov	clarkb, have you already proved that problem is in IO?	19:02
*** rfolco has quit IRC		19:02
*** azherkhna has joined #openstack-infra		19:02
clarkb	'checking out master' is now a subsecond operation	19:03
clarkb	SergeyLukjanov: 'proved'. preliminary results look very very good	19:03
fungi	SergeyLukjanov: we suspect there was a lot of write delay/contention based on the system profiling stats, but i think we need to watch this go for a while under constant load to be certain it's improved significantly (we'll get that oppotrunity)	19:03
*** galstrom is now known as galstrom_zzz		19:04
SergeyLukjanov	k, see it	19:04
*** marun has quit IRC		19:04
*** marun has joined #openstack-infra		19:05
*** jaypipes has joined #openstack-infra		19:05
*** julim has quit IRC		19:05
*** julim has joined #openstack-infra		19:06
*** vipul is now known as vipul-away		19:09
fungi	#status notice zuul.openstack.org underwent maintenance today from 16:50 to 19:00 UTC, so any changes approved during that timeframe should be reapproved so as to be added to the gate. new patchsets uploaded for those two hours should be rechecked (no bug) if test results are desired	19:09
*** vipul-away is now known as vipul		19:09
fungi	did we lose statusbot?	19:09
clarkb	apparently	19:09
fungi	yup. fixing	19:09
*** openstackstatus has joined #openstack-infra		19:11
mordred	clarkb: nice!	19:12
fungi	#status notice zuul.openstack.org underwent maintenance today from 16:50 to 19:00 UTC, so any changes approved during that timeframe should be reapproved so as to be added to the gate. new patchsets uploaded for those two hours should be rechecked (no bug) if test results are desired	19:13
openstackstatus	NOTICE: zuul.openstack.org underwent maintenance today from 16:50 to 19:00 UTC, so any changes approved during that timeframe should be reapproved so as to be added to the gate. new patchsets uploaded for those two hours should be rechecked (no bug) if test results are desired	19:13
fungi	the event/result queues are back to trivial levels already, and enormous pipeline lengths are active	19:14
fungi	statsd is still broken though, even though i downgraded the new zuul's statsd package to be the same as the old one's	19:14
clarkb	fungi: is statsd erroring?	19:14
fungi	good question	19:15
clarkb	oh we just got our first gate reset	19:15
clarkb	lets see how long it takes to clear	19:15
fungi	and then snipe it out, because outdated sample config	19:16
fungi	no, wait, i misread the log. wrong job entirely for that anyway	19:16
clarkb	reset processed	19:16
clarkb	in ~1.75 minutes? not bad :)	19:17
*** pballand has joined #openstack-infra		19:17
zaro	clarkb: do you want to review scp-plugin on github?	19:17
clarkb	zaro: I don't see a new pull request	19:18
*** nati_ueno has joined #openstack-infra		19:18
*** nati_ueno has quit IRC		19:18
*** nati_ueno has joined #openstack-infra		19:19
*** CaptTofu has joined #openstack-infra		19:19
clarkb	zaro: I am going to head into the office around lunch, if you are in today we can go over it there	19:20
*** yolanda has joined #openstack-infra		19:20
zaro	ok. i'll just wait for you. see you later.	19:21
*** sarob has joined #openstack-infra		19:22
clarkb	fungi: I think I know the statsd problem	19:22
clarkb	fungi: that is one place where the firewall rules on the remote end may need updating	19:22
clarkb	fungi: if you start the iptables persistent service it should redig DNS records and update the ruleset	19:22
*** nati_uen_ has quit IRC		19:23
fungi	right, it's updated by dns name!	19:23
fungi	fixing	19:23
*** tjones has quit IRC		19:23
fungi	wow the graphite server is running at a crawl too	19:24
*** sarob has quit IRC		19:27
*** thuc has joined #openstack-infra		19:27
fungi	clarkb: good call. stats are updating again	19:28
mordred	yay stats	19:28
*** marun has quit IRC		19:28
*** marun has joined #openstack-infra		19:29
fungi	there's another gate reset	19:29
fungi	BadRequest: Multiple possible networks found, use a Network ID to be more specific. (HTTP 400)	19:30
*** tjones has joined #openstack-infra		19:30
fungi	oh, that's the one which snuck into the gate behind my one test reenqueue	19:31
clarkb	hahahahaha	19:31
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: add bug metadata to graph list https://review.openstack.org/67510	19:31
*** denis_makogon has joined #openstack-infra		19:31
*** tjones has quit IRC		19:31
fungi	looks like the last two patchsets were uploaded while zuul was offline, and then it was approved with no check results	19:31
clarkb	well it has been taken care of now :)	19:31
fungi	indeed	19:31
clarkb	fungi: was statsd the last remaining major issue?	19:32
*** tjones has joined #openstack-infra		19:32
fungi	clarkb: my home firewall is my next major remaining issue	19:32
fungi	i worry when a 15-year-old sparc64 server starts randomly segfaulting running processes	19:33
clarkb	notes from switchover, should puppet known_hosts file for zuul ssh, should puppet zuul .gitconfig, gearman-plugin + geard + zuul is not happy with registering our jobs and needs handholding currently (believe this is a bug in geard)	19:33
*** azherkhna has quit IRC		19:33
clarkb	fungi: you know you can buy dirt cheap power sipping boxes that work as great routers right?	19:33
fungi	server comes up with too-new statsd, need to reload firewall rules on graphite server	19:33
fungi	clarkb: yes, i know this. i even have the hardware spec'd out and everything but... so little available free time lately	19:34
clarkb	I am going to afk now and catch up on my morning. If no one beats me to it I will write bugs up for what we learned today	19:34
fungi	clarkb: sounds good	19:34
clarkb	also scp plugin, and lca expense reports	19:34
*** vipul is now known as vipul-away		19:35
*** vipul-away is now known as vipul		19:35
*** vipul is now known as vipul-away		19:35
clarkb	fungi: when I get back you should just stop working for the rest of the afternoon	19:36
clarkb	because EWHENDOYOUSLEEP?	19:36
fungi	clarkb: that would be appreciated. i have the gf's folks in town visiting one more night and should at least pretend i enjoy their company	19:36
*** mgagne has quit IRC		19:37
fungi	so will probably be disappearing for dinner again maybe 2300utc-ish	19:37
clarkb	fungi: yup no worries. ok really afking now so that I am able to cover the afternoon	19:37
sdague	fungi: puppet question ...	19:37
fungi	sdague: sure	19:37
sdague	so we're going to add another elastic recheck program that runs on cron	19:38
* fungi nods		19:38
sdague	and what I'd also like to do is trigger these jobs after CD	19:38
*** hogepodge has joined #openstack-infra		19:39
*** mattray has left #openstack-infra		19:39
sdague	because we might be landing a change, and we'd like to trigger that output	19:39
sdague	but right now the cron jobs are defined on the status site	19:40
*** sarob has joined #openstack-infra		19:40
sdague	which is done because the state dir is set there	19:41
fungi	okay, so you want a script which is called from a cron entry and from an exec, and wrap them both in lockfile (or implement a locking mechanism within the script) presumably, then subscribe the exec to the vcsrepo object	19:41
*** marun has quit IRC		19:41
*** oubiwann_ has quit IRC		19:41
fungi	am i answering the right question?	19:41
*** marun has joined #openstack-infra		19:41
sdague	I think so	19:41
sdague	I am wondering if we could define the command as a var in the elastic_recheck/init.pp	19:42
*** oubiwann_ has joined #openstack-infra		19:42
fungi	almost certainly	19:42
sdague	can we get vars from one pp to another easily?	19:42
sdague	you have a call example for something like that?	19:42
*** sarob has quit IRC		19:43
fungi	oh, hrm... class scope lookup	19:43
sdague	yeh	19:43
*** sarob has joined #openstack-infra		19:43
fungi	i know how to do it in an erb template...	19:43
dkranz	fungi: Grr. So the error log gate run is being bitten by https://bugs.launchpad.net/tempest/+bug/1260537	19:43
uvirtbot	Launchpad bug 1260537 in tempest "Generic catchall bug for non triaged bugs where a server doesn't reach it's required state" [High,Confirmed]	19:43
fungi	trying to remember if i've seen it in a puppet manifest	19:43
dkranz	fungi: Do I just do a reverify now or is some other action appropriate?	19:43
fungi	dkranz: reverify once it dies (i can abort the remaining running jobs) and then when it gets into the queue i'll promote it	19:44
dkranz	fungi: Will reverify kill the current faliing build?	19:44
dkranz	fungi: ok	19:44
fungi	dkranz: nope, that's why i need to abort the jobs	19:44
fungi	okay, it's out of the gate now. should be safe to reverify	19:45
fungi	dkranz: ^	19:46
*** mrmartin has joined #openstack-infra		19:46
*** yassine has joined #openstack-infra		19:46
*** vipul-away is now known as vipul		19:47
dkranz	fungi: Thanks, I did the reverify.	19:47
fungi	i see it	19:47
fungi	promoting now	19:47
*** reed has quit IRC		19:47
fungi	bam. there it is	19:47
fungi	snappy, snappy new zuulie	19:47
*** sarob has quit IRC		19:48
fungi	oh zuulie you nut	19:48
openstackgerrit	Matthew Treinish proposed a change to openstack-infra/elastic-recheck: Add multi-project irc support to the bot https://review.openstack.org/67540	19:49
*** AaronGr is now known as aarongr_afk		19:49
*** vipul is now known as vipul-away		19:50
mrmartin	re	19:50
*** denis_makogon_ has joined #openstack-infra		19:52
mrmartin	fungi: if you have 5 minutes during this day, please comment this review request: https://review.openstack.org/#/c/67443/ This contains the gating / distro tarball task required for community portal.	19:52
salv-orlando	I might be stating the obvious but since I see still a consistent number of failures in unit test jobs, perhaps there is a case for bumping up patches for bug 1270212	19:52
uvirtbot	Launchpad bug 1270212 in oslo "regression: multiple calls to Message.__mod__ trigger exceptions" [Critical,In progress] https://launchpad.net/bugs/1270212	19:52
*** pballand has quit IRC		19:52
openstackgerrit	Matthew Treinish proposed a change to openstack-infra/elastic-recheck: Add multi-project irc support to the bot https://review.openstack.org/67540	19:52
openstackgerrit	Sean Dague proposed a change to openstack-infra/elastic-recheck: fix css style to make page more readable https://review.openstack.org/67560	19:54
*** Ajaeger has quit IRC		19:55
*** yassine has quit IRC		19:55
*** smarcet has quit IRC		19:55
*** denis_makogon has quit IRC		19:55
*** kgriffs has left #openstack-infra		19:56
clarkb	salv-orlando: are there fixes for that change yet?	19:57
clarkb	er for that bug	19:57
fungi	sdague: i think you want http://docs.puppetlabs.com/puppet/2.7/reference/lang_scope.html#accessing-out-of-scope-variables	19:57
clarkb	fungi: sdague: you can reference variables in manifests like $::somescope::innerscope::variablename	19:57
sdague	ok	19:58
clarkb	you do need to make sure you have previously included that class that defines the variable	19:58
*** denis_makogon_ is now known as denis_makogon		19:58
sdague	cool	19:58
fungi	mrmartin: i don't see a change 67443 at all. did you maybe experiment with gerrit's drafts option, or is that a typo?	19:59
clarkb	zaro: I am on my way in now	19:59
mrmartin	it was a draft :D	19:59
mrmartin	how can I share this draft review with you?	20:00
sdague	cool, i'll see if I can figure it out	20:00
fungi	mrmartin: just set them work-in-progress in the future. drafts are implemented in gerrit in a fairly broken fashion	20:00
*** markmc has quit IRC		20:00
mrmartin	good to know that.	20:00
*** herndon has joined #openstack-infra		20:00
*** marun has quit IRC		20:00
fungi	mrmartin: in the interim, you can add me as a reviewer (just add "fungi" in the requested reviewers line)	20:00
sdague	also - http://status.openstack.org/elastic-recheck/ - shift reload, and we have descriptions on bugs now	20:00
*** marun has joined #openstack-infra		20:00
*** ruhe is now known as _ruhe		20:01
fungi	mrmartin: it will resolve it to my name and e-mail address when you do that	20:01
mrmartin	fungi: I did it	20:01
fungi	sdague: great!	20:01
fungi	mrmartin: i can see it now	20:01
clarkb	sdague: flashgordon: fwiw I think some of the jenkins errors will be false positives. When zuul aborts a job occasionally that menifests as an uncaught exception (I forget which) and the job fails	20:03
mrmartin	fungi: ok add as many comments as you can, so if anything missing, I can correct the patch. thnx!	20:03
clarkb	but zuul aborting jobs is perfectly normal	20:03
clarkb	that said the vast majority are likely slaves falling over and running tests to failure as quickly as they can	20:03
fungi	mrmartin: will do	20:03
*** mrodden1 is now known as mrodden		20:04
notmyname	...and I thought a 100+ jobs inthe check queue yesterday were a lot	20:05
*** galstrom_zzz is now known as galstrom		20:05
fungi	notmyname: yeah, i'm hoping they go far faster now that zuul is on an even bigger server and is doing all its git scratch work on tmpfs	20:06
fungi	as of an hour ago	20:07
*** nati_uen_ has joined #openstack-infra		20:07
clarkb	it definitely seems to have made the gate reset cost much lower	20:07
clarkb	which was putting the brakes on everything	20:07
*** SergeyLukjanov is now known as SergeyLukjanov_		20:08
fungi	the event/result queue pileup is completely resolved	20:08
clarkb	now we suffer from having about 1/3 to 1/4 of the test infra needed to run all of the tests	20:08
*** nati_uen_ has quit IRC		20:08
notmyname	is that a matter of getting more workers in the nodepool?	20:09
*** nati_uen_ has joined #openstack-infra		20:09
fungi	well, the nodepool capacity is driven somewhat by gate resets still, since a gate reset near the front of the gate will decimate the entire quota and need them all rebuilt	20:10
clarkb	notmyname: sort of. we need more cloud quota to do that and we have to be careful that adding more nodes doesn't make jenkins flakier	20:10
clarkb	and we just saw geard get cranky...	20:10
*** mfink has quit IRC		20:10
clarkb	for now I think we are better off working to make jenkins and geard happier then ramp up nodepool	20:10
fungi	at our current aggregate quota i saw things moving fairly smoothly even though a modest reset rate when the gate was around 25-30 changes deep	20:10
*** nati_ueno has quit IRC		20:10
fungi	once it got bigger than that, it got into a decimate/rebuild pendulum swing	20:11
*** smarcet has joined #openstack-infra		20:11
fungi	which makes me think that if we do decide to arbitrarily limit the number of testable changes at the front of an integrated queue, the sweet spot is currently somewhere around there	20:12
clarkb	fungi: I don't think we arbitrarily limit the number of testable changse, I think we let it scale a window based on performance	20:13
fungi	clarkb: i agree that makes more sense	20:13
*** hashar has joined #openstack-infra		20:13
fungi	i liked the slow-start/backoff idea, as much as i can like any pessimistic model for this	20:13
clarkb	I don't think it will be too hard to implement either as zuul basically takes a list and iterates over it until done. we can slice that list first	20:14
openstackgerrit	Davanum Srinivas (dims) proposed a change to openstack-infra/devstack-gate: Temporary HACK : Enable UCA https://review.openstack.org/67564	20:14
clarkb	the trickier bits will be in presenting it to users so that folks know they are in the queue but not being tested	20:14
clarkb	dimsum: re ^ do we expect libvirt to work now?	20:15
*** marun has quit IRC		20:15
*** marun has joined #openstack-infra		20:15
sdague	clarkb: so if that's the case, realize that it's being reported as a FAILURE to ES and graphite	20:15
sdague	which means it will make the gate look worse than it is	20:16
sdague	when you run stats on it. So it would be good if those could be classified as a different status	20:16
dimsum	clarkb, i have a vm with UCA and don't see the problems reported hence trying to run it in d-g	20:16
clarkb	sdague agree but it is a jenkins limitation	20:17
clarkb	sdague the way they implement job aborts is by raising an exception. if not caught cleanly you lose	20:17
*** galstrom has left #openstack-infra		20:17
clarkb	dimsum: did you run nova unittests too?	20:17
salv-orlando	clarkb: neutron fix is up for review. I can prepare patches for other projects if you're ok to bump them ahead of the queue	20:18
salv-orlando	clarkb: neutron patch --> https://review.openstack.org/#/c/67537/	20:18
sdague	clarkb: so the abort job exception is a different exception	20:18
*** tjones has quit IRC		20:18
fungi	clarkb: does jenkins report it as "FAILURE" state though in that case rather than "ABORT"?	20:18
sdague	from what I can tell	20:18
clarkb	fungi in some corner cases yes	20:18
sdague	I've definitely seen ABORT	20:18
clarkb	ya abort is the 99% case	20:19
sdague	clarkb: right, that's one of the reasons I wanted to raise the question	20:19
dimsum	clarkb, yep	20:19
clarkb	but when jenkins doesnt cleanly catch the abort exception it looks like failure	20:19
notmyname	I'm not sure who to direct this at, so I'm throwing it in here:	20:19
fungi	dimsum: interesting idea. i was trying to test it myself using d-g on a vm, but our recent refactor moved some repos around from where my script/instructions expect them	20:19
notmyname	I'm currently working on the Swift 1.12.0 release. I consider this somewhat of a test run for the gates for next week's i2 stuff.	20:20
*** goofy-nic-friday is now known as cp16net		20:20
notmyname	my plan is to get the last patches through the gate for an RC (today or when stuff lands, whichever is last)	20:20
notmyname	I'm currently looking at these patches: https://review.openstack.org/#/q/branch:master+AND+Approved%253D1+AND+status:open+AND+project:openstack/swift,n,z	20:20
notmyname	other patches would be whatever else is approved today, including one for the release notes update	20:21
notmyname	I don't think I need anything specific from -infra (beyond the hard work you're already doing). I wanted to give you a status update, especially because of the milestone next week (this is sort of a trial run, I'd think)	20:22
fungi	notmyname: makes sense. as far as i know we're done with emergency disruptions. we spent this morning doing what we can to try to beef up gating performance/throughput in preparation for the bigger rush next week	20:22
dimsum	fungi, don	20:22
dimsum	fungi, don't know if this will work - https://review.openstack.org/#/c/67564/ - taking a shot	20:23
fungi	dimsum: it looks like i would expect it to, but set that to wip because we won't actually put that change as it stands into production. we'd want to do that in nodepool prep scripts instead and/or in puppet configuration (but it may make for a worthwhile proof-of-concept)	20:24
fungi	dimsum: the other place you could try testing it would be with a change to devstack (before it starts installing packages)	20:26
dimsum	ah. right	20:26
dimsum	will do	20:26
fungi	but either way will probably work	20:26
*** prad_ has joined #openstack-infra		20:26
*** salv-orlando has quit IRC		20:28
*** herndon has quit IRC		20:28
*** prad has quit IRC		20:28
*** prad_ is now known as prad		20:28
openstackgerrit	Evgeny Fadeev proposed a change to openstack-infra/askbot-theme: made launchpad importer read and write data separately https://review.openstack.org/67567	20:30
sdague	clarkb, fungi: easy change - gate status to dedicated page, so we can pull it off er - https://review.openstack.org/#/c/65700/	20:30
sdague	if anyone's up for walking away from fire :)	20:30
*** DinaBelova is now known as DinaBelova_		20:36
*** Ryan_Lane has quit IRC		20:36
*** Ryan_Lane has joined #openstack-infra		20:36
*** mrmartin has quit IRC		20:36
*** salv-orlando has joined #openstack-infra		20:38
*** herndon has joined #openstack-infra		20:38
*** yolanda has quit IRC		20:38
*** markwash has quit IRC		20:39
*** markwash has joined #openstack-infra		20:41
*** marun has quit IRC		20:41
*** marun has joined #openstack-infra		20:41
notmyname	wow. I am noticing that zuul is picking up approved changes _much_ more quickly now	20:44
*** carl_baldwin has quit IRC		20:46
*** senk has quit IRC		20:47
*** carl_baldwin has joined #openstack-infra		20:47
*** markmcclain has quit IRC		20:47
*** vipul-away is now known as vipul		20:47
*** jaypipes has quit IRC		20:48
*** jaypipes_ has joined #openstack-infra		20:48
*** jaypipes_ has quit IRC		20:48
*** dprince has quit IRC		20:49
*** pballand has joined #openstack-infra		20:49
fungi	notmyname: that's thanks to the event queue no longer being backlogged	20:49
rustlebee	queues are huge :)	20:50
fungi	rustlebee: yep, i expect them to start dropping once the check pipeline catches up on worker assignments now	20:51
rustlebee	cool	20:52
fungi	rustlebee: without your awesome collapseypatch, my browser would have choked on the current status page i think	20:52
rustlebee	heh	20:52
openstackgerrit	Emilien Macchi proposed a change to openstack-infra/config: gerritbot: Add API doc git notifications on #openstack-doc https://review.openstack.org/67573	20:53
dimsum	rustlebee, ya, very handy!	20:55
* rustlebee clicks expand all ... poor chrome		20:56
fungi	boom	20:56
*** tjones has joined #openstack-infra		20:58
*** hashar has quit IRC		20:59
*** herndon has quit IRC		20:59
*** thomasem has quit IRC		21:01
*** marun has quit IRC		21:01
*** marun has joined #openstack-infra		21:01
*** nati_ueno has joined #openstack-infra		21:03
*** smarcet has left #openstack-infra		21:05
very_tired	rustlebee: yes thanks for the collapsy patc	21:05
very_tired	h	21:05
*** herndon has joined #openstack-infra		21:06
rustlebee	you're welcome :)	21:06
rustlebee	it was fun.	21:06
*** herndon has quit IRC		21:06
rustlebee	anything web related is out of my normal comfort zone	21:06
very_tired	fungi: email alert, I just sent this: http://lists.openstack.org/pipermail/openstack-infra/2014-January/000661.html	21:06
*** herndon has joined #openstack-infra		21:07
*** nati_uen_ has quit IRC		21:07
very_tired	will ping at 8pm and if they haven't responded, no voting for them	21:07
*** herndon has quit IRC		21:07
very_tired	rustlebee: you did a nice job of it	21:07
fungi	very_tired: sounds good. in the meantime, get some rest	21:07
very_tired	fungi: :D	21:07
very_tired	code sprint winding down	21:07
very_tired	we have patches to gate	21:07
*** herndon_ has joined #openstack-infra		21:08
clarkb	fungi: I am back at a different desk now	21:08
very_tired	fungi: https://etherpad.openstack.org/p/montreal-code-sprint	21:08
very_tired	under the to be promoted section	21:08
fungi	very_tired: more stability fixes?	21:08
sdague	fungi: yes, these should decrease load on the neutron side	21:09
sdague	which should make it more likely to pass	21:09
very_tired	still working on getting +A on all the neutron patches, marun is going through them	21:09
very_tired	so is nati_ueno	21:10
fungi	sdague: very_tired: if you could work up a preferred sequence, we can promote the whole batch. more stable gate means more faster gate	21:10
fungi	need to know changenum,psnum	21:11
very_tired	fungi mtreinish is double checking that now	21:11
*** jgrimm has quit IRC		21:12
mtreinish	fungi: I just reordered the tempest test list	21:13
fungi	clarkb: do you think we have a chance of being able to sanely quiesce zuul tomorrow for that project rename maintenance?	21:13
*** oubiwann_ has quit IRC		21:14
*** nati_ueno has quit IRC		21:14
*** marun has quit IRC		21:14
*** marun has joined #openstack-infra		21:14
*** oubiwann_ has joined #openstack-infra		21:15
*** nati_ueno has joined #openstack-infra		21:15
very_tired	fungi: they responded to my email, so you might not need to do anything	21:16
mikal	Morning	21:16
very_tired	mikal: morning	21:16
very_tired	happy saturday	21:16
clarkb	fungi: maybe? but it is looking less likely	21:16
*** nati_ueno has quit IRC		21:17
fungi	clarkb: i try to look at it as we're load-testing the new zuul ;)	21:18
*** nati_ueno has joined #openstack-infra		21:18
*** oubiwann_ has quit IRC		21:19
mikal	very_tired: you're anteaya?	21:19
openstackgerrit	Andreas Jaeger proposed a change to openstack-infra/config: Add build job for Japanese Install Guide https://review.openstack.org/67481	21:20
openstackgerrit	Michael Krotscheck proposed a change to openstack-infra/storyboard-webclient: [WIP] Storyboard API Interface and basic project management https://review.openstack.org/67582	21:22
very_tired	mikal: I am	21:23
mikal	very_tired: so, I don't think I caused the recheck backlog... The script didn't run for that long.	21:24
fungi	mikal: oh, the thing to recheck stale patches?	21:25
mikal	fungi: yeah	21:25
fungi	anyway, no, the check volume is from us dumping the state of the zuul check and gate pipelines, moving to a bigger badder zuul and restoring them... so they all needed fresh workers and then new patchsets came in on top of that	21:26
clarkb	but, bigger badder zuul is pretty awesome	21:26
*** marun has quit IRC		21:26
*** marun has joined #openstack-infra		21:27
fungi	bigger badder zuul will eat your spleen for breakfast it's so awesome	21:27
*** UtahDave has joined #openstack-infra		21:28
fungi	or at least, according to our design specs it has a taste for spleen. more testing required	21:28
*** pcrews has quit IRC		21:28
mikal	So, are gate flushes still hurting us?	21:28
*** krotscheck has quit IRC		21:29
*** pcrews has joined #openstack-infra		21:29
clarkb	mikal: yes in that they force us to retest stuff, no they don't cause zuul to stop for forever to process them	21:29
fungi	mikal: they will still severely deplete our available job workers for prolonged periods	21:29
*** NikitaKonovalov_ is now known as NikitaKonovalov		21:30
mikal	Ok, so I got to the point with my rechecker where it would run until it found something to recheck, recheck that, and then exit. I would then go and hand verify the recheck. I hadn't found any incorrect rechecks in a while.	21:30
fungi	though apparently the neutron+qa testing/stability sprint has a stack of patches which they think will make a big improvement on reset frequency	21:30
*** NikitaKonovalov is now known as NikitaKonovalov_		21:30
mikal	I'm wondering if I should turn it back on this morning, or if the queues are so long I should just let it rest for a day	21:31
mikal	The queues do look pretty long...	21:31
*** vipul is now known as vipul-away		21:31
fungi	mikal: i see it as a tradeoff there. at least some of the more persistent gate resets we're getting are actually from stale changes getting approved after bit-rotting in review for too long	21:32
mikal	fungi: I was surprised by how many stale checks there were last night	21:32
fungi	so catching those early might help keep cores from approving them	21:32
mikal	It was a non-trivial percentage of reviews	21:32
mikal	Noting that sdague doesn't want checks on stable at the moment because of pip	21:32
mikal	(wow, the nova check fail rate at the moment is really high)	21:33
*** SumitNaiksatam has joined #openstack-infra		21:37
*** marun has quit IRC		21:37
*** marun has joined #openstack-infra		21:38
mordred	mikal: I, for one, support your rechecker	21:38
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: fix css style to make page more readable https://review.openstack.org/67560	21:40
mikal	I just don't want to break the world with my well meaning flailing	21:41
mikal	There's only so much kermit arms can do	21:41
portante	mordred: do we run the devstack environments with GRO turned on (the generic receive off-load stuff)?	21:41
portante	I am guessing it is not a concern, but just checking	21:41
*** vipul-away is now known as vipul		21:43
*** aarongr_afk is now known as AaronGr		21:45
very_tired	heh, kermit arms	21:45
fungi	the muppet geek in me knew exactly what he meant	21:46
*** vipul is now known as vipul-away		21:47
*** rustlebee is now known as russellb		21:52
*** sdake has quit IRC		21:52
very_tired	fungi: the patches in the "to be promoted" section are all +A'd and in the order they need to go into the gate: https://etherpad.openstack.org/p/montreal-code-sprint	21:52
very_tired	fungi: let me know if you need more	21:52
*** beekneemech has quit IRC		21:52
very_tired	more as in more information, not more as in more work to do	21:53
fungi	very_tired: they're separated by project... is that the order you want them in? (neutron block first, then that standalone "please also this" change, then the tempest changes)?	21:53
*** herndon_ has quit IRC		21:53
*** derekh has joined #openstack-infra		21:54
fungi	looks like that standalone 67537 isn't approved anyway	21:54
*** thedodd has quit IRC		21:54
*** carl_baldwin has quit IRC		21:54
very_tired	fungi: this one goes first please: Please also this: https://review.openstack.org/#/c/67537/	21:54
*** carl_baldwin has joined #openstack-infra		21:55
*** marun has quit IRC		21:55
*** sarob has joined #openstack-infra		21:55
very_tired	fungi: yes, salv-orlando is getting a +A on that, sorry I thought we were ready on our end	21:55
fungi	no problem	21:55
*** sarob has quit IRC		21:55
*** marun has joined #openstack-infra		21:55
*** sandywalsh has quit IRC		21:56
*** herndon_ has joined #openstack-infra		21:56
fungi	very_tired: though you may need to get clarkb's help on those. i'm about to disappear to go out for food	21:57
clarkb	fungi: go disappear, I will be mostly here in a few minutes	21:57
*** UtahDave has quit IRC		21:57
very_tired	fungi: happy food, I will work with clarkb	21:58
very_tired	thanks	21:58
fungi	very_tired: also, kudos to you and the attendees at the sprint--that's an impressive list of stability and debugging fixes	21:58
very_tired	fungi thanks, it was very beneficial on many levels	21:59
openstackgerrit	A change was merged to openstack-infra/elastic-recheck: Add query for bug 1261253 https://review.openstack.org/67539	21:59
uvirtbot	Launchpad bug 1261253 in tripleo "oslo.messaging 1.2.0a11 is outdated and problematic to install" [High,Triaged] https://launchpad.net/bugs/1261253	21:59
very_tired	we had a good group here	21:59
*** rnirmal has joined #openstack-infra		22:02
very_tired	clarkb: all the tempest patches in the "to be promoted" section can go in	22:02
very_tired	https://etherpad.openstack.org/p/montreal-code-sprint	22:02
*** tjones has left #openstack-infra		22:03
very_tired	we are waiting on +A on 67537 for it to go first on the neutron block and then once we have that, 67537 followed by the neutron block	22:03
fungi	clarkb: though keep in mind that anything you promote now will mean all the remaining changes in the check queue which have accumulated since the last gate reset will also be serviced before zuul takes a crack at what's in the gate (including 63934,3 which we intentionally placed at the front)	22:04
*** vipul-away is now known as vipul		22:04
clarkb	very_tired: I would like to do all of them at once as promotion requires a reset	22:04
*** vipul is now known as vipul-away		22:04
fungi	clarkb: i agree that's probably the best choice	22:04
*** melwitt has joined #openstack-infra		22:05
clarkb	very_tired: so once everything has been approved and queued ping me and we will promote	22:05
very_tired	clarkb: will do	22:05
very_tired	clarkb: good to go	22:06
flashgordon	you guys ever see this bug: http://logs.openstack.org/21/65121/2/gate/gate-grenade-dsvm/efd816b/console.html	22:06
flashgordon	SCPRepositoryPublisher aborted due to exception	22:06
*** carl_baldwin has quit IRC		22:06
mordred	flashgordon: it means that java hates us	22:07
*** carl_baldwin has joined #openstack-infra		22:07
flashgordon	mordred: yup	22:08
flashgordon	but about to file a bug if we don't have one	22:08
flashgordon	263 hits in logstash	22:08
*** sandywalsh has joined #openstack-infra		22:08
fungi	flashgordon: that log you linked doesn't seem to have been associated with a result posted to the associated change	22:09
fungi	flashgordon: i wonder if that job got intentionally killed when a job on a change ahead of it failed in the gate	22:09
*** jcooley_ has joined #openstack-infra		22:10
very_tired	clarkb: problem	22:10
clarkb	very_tired: ?	22:10
very_tired	https://review.openstack.org/#/c/67537/ never passed check	22:10
openstackgerrit	Sean Dague proposed a change to openstack-infra/config: add in elastic-recheck-unclassified report https://review.openstack.org/67591	22:10
very_tired	so salv-orlando says go with the tempest block	22:10
very_tired	let https://review.openstack.org/#/c/67537/ come back with check	22:11
*** CaptTofu has quit IRC		22:11
fungi	flashgordon: there are only two grenade failures on the change for that log, and neither of them refer to that particular job run	22:11
very_tired	and then if it does promote it and the rest of the neutron block	22:11
openstackgerrit	Monty Taylor proposed a change to openstack-infra/storyboard: Fix the intial db migration https://review.openstack.org/67592	22:11
very_tired	does that sound reasonable?	22:11
clarkb	very_tired: I am not doing two promotions	22:11
*** gema has quit IRC		22:11
*** nati_uen_ has joined #openstack-infra		22:11
clarkb	promotions are very expensive	22:11
flashgordon	fungi: hmm	22:12
*** MarkAtwood has quit IRC		22:13
fungi	flashgordon: i'm guessing java.lang.InterruptedException is something akin to sigint	22:13
very_tired	clarkb: I understand	22:13
flashgordon	fungi: that makes sense	22:14
fungi	flashgordon: "Thrown when a thread is waiting, sleeping, or otherwise occupied, and the thread is interrupted, either before or during the activity."	22:14
fungi	(from oracle's language doc reference)	22:14
*** med_ has quit IRC		22:14
*** UtahDave has joined #openstack-infra		22:14
flashgordon	fungi: that makes a lot of sense	22:15
fungi	flashgordon: so i think you have a cancelled/aborted job there that jenkins reported as a failure	22:15
flashgordon	yup	22:15
fungi	because EJENKINS	22:15
very_tired	clarkb: this is our fault and we will wear it	22:15
flashgordon	so I will add a elastic-jenkins fingerprint for that so we can ignore those and get better classification rate numbers	22:15
*** CaptTofu has joined #openstack-infra		22:16
flashgordon	if that sounds good to you	22:16
flashgordon	which means add a bug marked as resolved	22:16
fungi	flashgordon: sounds like a good call	22:16
*** nati_ueno has quit IRC		22:16
fungi	anyway, really disappearing for several hours starting now... back later for more fun	22:16
clarkb	fungi: have fun	22:17
*** mfer has quit IRC		22:17
*** reed has joined #openstack-infra		22:18
very_tired	fungi: enjoy	22:18
*** thedodd has joined #openstack-infra		22:19
flashgordon	fungi: so this happens in the gate queue only which fits your hypothesis	22:19
*** ewindisch is now known as zz_ewindisch		22:19
dimsum	flashgordon, i've seen many stack traces that finally end up in the wait interrupt at line hudson.remoting.Request.call(Request.java:146)	22:20
flashgordon	dimsum: link?	22:22
flashgordon	dimsum: I am using this query: message:"java.lang.InterruptedException" AND filename:"console.html"	22:22
*** salv-orlando has quit IRC		22:22
*** nati_uen_ has quit IRC		22:22
*** med_ has joined #openstack-infra		22:23
*** nati_ueno has joined #openstack-infra		22:23
*** vkozhukalov has quit IRC		22:24
*** eharney has quit IRC		22:24
flashgordon	fungi: https://bugs.launchpad.net/openstack-ci/+bug/1270309	22:24
uvirtbot	Launchpad bug 1270309 in openstack-ci "jenkins java.lang.InterruptedException" [Undecided,New]	22:24
flashgordon	can you triage that, I think won't fix makes sense but your call	22:24
flashgordon	but something closed	22:24
*** bnemec has joined #openstack-infra		22:24
*** rossella_s has quit IRC		22:25
*** carl_baldwin has quit IRC		22:26
*** carl_baldwin has joined #openstack-infra		22:26
dimsum	"hudson.remoting.Request.call(Request.java"	22:28
very_tired	I'm out for the weekend and Monday, I expect to be online again on Tuesday	22:28
*** sarob has joined #openstack-infra		22:28
mordred	have a great weekend very_tired	22:28
very_tired	clarkb and fungi thanks for all your help	22:28
very_tired	thanks	22:28
*** marun has quit IRC		22:28
very_tired	:D	22:28
*** very_tired is now known as anteaya		22:28
*** marun has joined #openstack-infra		22:29
*** gema has joined #openstack-infra		22:30
*** carl_baldwin has quit IRC		22:30
*** lcestari has quit IRC		22:31
flashgordon	fungi: I think it is a valid infra bug actually, these shouldn't be marked as failures	22:32
*** obondarev_ has quit IRC		22:32
*** reed_ has joined #openstack-infra		22:32
*** emagana has quit IRC		22:32
flashgordon	dimsum: I think that is the same issue, that is part of the InterruptedException stacktrace	22:32
flashgordon	dimsum: see https://bugs.launchpad.net/openstack-ci/+bug/1270309	22:33
uvirtbot	Launchpad bug 1270309 in openstack-ci "jenkins java.lang.InterruptedException" [Undecided,New]	22:33
*** nati_ueno has quit IRC		22:33
*** reed__ has joined #openstack-infra		22:34
notmyname	why would this change https://review.openstack.org/#/c/67538/ be marked as SKIPPED in zuul?	22:34
notmyname	it's towards the bottom of the gate queue	22:35
*** reed has quit IRC		22:35
*** reed__ has quit IRC		22:35
*** senk has joined #openstack-infra		22:36
*** reed_ has quit IRC		22:37
*** HenryG has quit IRC		22:37
openstackgerrit	Joe Gordon proposed a change to openstack-infra/elastic-recheck: Add query for bug 1270309 https://review.openstack.org/67594	22:39
uvirtbot	Launchpad bug 1270309 in openstack-ci "jenkins java.lang.InterruptedException" [Undecided,New] https://launchpad.net/bugs/1270309	22:39
clarkb	notmyname: probably a merge conflict, if you hover over the red bubble it will tell you	22:41
openstackgerrit	Joe Gordon proposed a change to openstack-infra/elastic-recheck: Use short build_uuids in elasticSearch queries https://review.openstack.org/67596	22:45
zaro	clarkb: new scp plugin is on jenkins-dev.o.o	22:45
*** ArxCruz has quit IRC		22:49
*** flashgordon is now known as jog0		22:50
*** marun has quit IRC		22:50
*** mrda has joined #openstack-infra		22:53
*** dstanek has quit IRC		22:56
*** prad has quit IRC		22:57
*** mrda has quit IRC		22:57
*** thedodd has quit IRC		22:59
*** radix has joined #openstack-infra		23:01
radix	jenkins seems to be ignoring one of my patches, https://review.openstack.org/#/c/67006/3 , is there something wedged?	23:01
radix	or is there something messed up with my patch because I've done something wrong, maybe	23:02
*** rcleere has quit IRC		23:02
clarkb	radix: it is being rechecked	23:02
radix	oh ok cool :)	23:03
clarkb	radix: looks like it was a draft at one point though	23:03
radix	yep, started out as one	23:03
clarkb	drafts are evil and don't work at all in the CI systems	23:03
clarkb	you can use Work in progress instead	23:03
radix	well, I assumed jenkins would notice the first non-draft I posted	23:03
clarkb	depends on how the non draft is posted	23:03
*** dcramer_ has quit IRC		23:04
clarkb	if it is just published jenkins won't notice	23:04
clarkb	if it is pushed as a fresh non draft patchset jenkins should notice and in that case jenkins may have missed it beacuse we have been hitting zuul with a hammer to make it go quicker	23:04
radix	ah, ok	23:04
radix	yeah, I just pushed a new rev as a non-draft, so it was probably that	23:04
*** zz_ewindisch is now known as ewindisch		23:05
radix	I'll point out that https://wiki.openstack.org/wiki/Gerrit_Workflow explains how to use drafts, and doesn't discourage them	23:05
clarkb	gah	23:06
* clarkb goes on a bug filing spree		23:06
radix	hehe :)	23:06
clarkb	since the chances I get all of this done today are slim	23:06
*** burt1 has quit IRC		23:07
*** sarob has quit IRC		23:10
*** sarob has joined #openstack-infra		23:10
openstackgerrit	A change was merged to openstack-infra/devstack-gate: comparison to stable/grizzly is not numeric https://review.openstack.org/63934	23:11
*** jergerber has quit IRC		23:11
*** thuc has quit IRC		23:12
*** thuc has joined #openstack-infra		23:12
sdague	yay, the non numeric patch finally landed!	23:14
*** senk has quit IRC		23:14
*** reed__ has joined #openstack-infra		23:14
sdague	also, there is a fix for stable/grizzly devstack in the gate now	23:14
sdague	no need to promote it, it's fine if it churns through the weekend	23:14
sdague	but that should be handy	23:14
*** sarob has quit IRC		23:15
*** thuc_ has joined #openstack-infra		23:16
clarkb	sdague: woot	23:16
clarkb	sdague: what was the fix?	23:16
sdague	https://review.openstack.org/#/c/67425/	23:16
*** markmcclain has joined #openstack-infra		23:16
sdague	basically, we were so wrapped up in the pip 1.5 thing, we forget the broken run arounds on pip 1.4	23:16
sdague	that never got back ported	23:17
clarkb	:(	23:17
*** thuc has quit IRC		23:17
sdague	however, it passed	23:17
sdague	so I think it will fix things	23:17
sdague	chmouel has additional good backports and fixes for grizzly	23:17
sdague	but that one should be sufficient to get stable/havana working	23:17
*** soleblaze has quit IRC		23:18
*** markmcclain1 has joined #openstack-infra		23:19
clarkb	bugs 1270321 1270319 and 1270320 submitted to cover the stuff we ran into today	23:19
uvirtbot	Launchpad bug 1270321 in openstack-ci "Puppet manifests for zuul install too new statsd." [Medium,Triaged] https://launchpad.net/bugs/1270321	23:19
clarkb	radix: I think I am just going to update the wiki now	23:19
*** mrodden has quit IRC		23:20
radix	thanks :)	23:20
*** denis_makogon has quit IRC		23:20
*** markmcclain has quit IRC		23:21
sdague	clarkb: was your hack to disable draft perms ever something that worked?	23:21
*** herndon_ has quit IRC		23:22
*** soleblaze has joined #openstack-infra		23:23
*** CaptTofu has quit IRC		23:24
*** sarob has joined #openstack-infra		23:24
*** CaptTofu has joined #openstack-infra		23:25
*** reed__ has quit IRC		23:25
*** carl_baldwin has joined #openstack-infra		23:27
sdague	now that we did a zuul restart with the durable enqueue times in it - https://review.openstack.org/#/q/status:open+project:openstack-infra/config+branch:master+topic:status_ui,n,z could land any time, which displays enqueue duration in jobs	23:28
sdague	and makes the merge conflict changes black, so they are easier to distinguish	23:29
*** CaptTofu has quit IRC		23:29
clarkb	sdague: haven't had a chance to test that	23:30
clarkb	zaro: is that something you can test on review-dev? disable push rights to refs/drafts/* for all projects	23:31
*** jcooley_ has quit IRC		23:32
openstackgerrit	Sean Dague proposed a change to openstack-infra/config: add in elastic-recheck-unclassified report https://review.openstack.org/67591	23:35
sdague	clarkb: there is actually a url in the review that shows it in action	23:35
clarkb	sdague: cool, I will take a look momentarily	23:35
sdague	it's all just status ui on the zuul json	23:35
sdague	so you can just gvfs-open it locally actually	23:35
sdague	cd config/modules/openstack_project/files/status && gvfs-open index.html	23:36
*** emagana has joined #openstack-infra		23:38
*** mfink has joined #openstack-infra		23:39
*** jcooley_ has joined #openstack-infra		23:39
mordred	sdague: looks good to me	23:40
sdague	mordred: cool	23:40
sdague	mordred: so the grizzly devstack thing	23:40
mordred	yeah?	23:40
sdague	apparently you pushed a fix for that in august	23:40
sdague	which got lost	23:40
sdague	and someone found it	23:40
mordred	AWESOME	23:40
sdague	https://review.openstack.org/#/c/67425/	23:40
sdague	why it only started screwing us now... I don't know	23:41
mordred	so broken	23:41
sdague	so anyway, once that gets through the gate, havana patches can land again	23:42
sdague	I think	23:42
*** vipul-away is now known as vipul		23:42
*** jcooley_ has quit IRC		23:44
*** boris-42 has quit IRC		23:45
*** rnirmal has quit IRC		23:45
mordred	sdague, clarkb: perhaps we should make some of the different colors different shapes too - for people with colorblindness	23:46
zaro	clarkb, sdague : i'll give disabling drafts a try.	23:46
clarkb	zaro: thank you	23:46
*** salv-orlando has joined #openstack-infra		23:47
sdague	mordred: yeh, I think that would be good. Honestly, we should probably do the shape draws with svg anyway.	23:47
sdague	maybe after turning status.js into templates I'll do that	23:47
*** jerryz has quit IRC		23:47
*** obondarev_ has joined #openstack-infra		23:48
mordred	sdague: yeah. and on your plane - def look at the bower/grunt stuff for that - if we're going to get fancier, I think we should consider not just being files in the config repo	23:49
sdague	yep, I'd be fine with that	23:49
mordred	it also may be way overkill - which is why you should look at it and not me	23:49
sdague	heh	23:49
portante	sdague, mouse of the circle was a well hidden feature in zuul for me	23:49
portante	mouse over	23:49
portante	thanks for pointing that out	23:50
*** krotscheck has joined #openstack-infra		23:50
clarkb	ok fixing the wiki article finally :)	23:51
*** markmcclain1 has quit IRC		23:52
*** flaper87 is now known as flaper87\|afk		23:53
*** carl_baldwin has quit IRC		23:53
*** jerryz has joined #openstack-infra		23:56
clarkb	https://wiki.openstack.org/wiki/Gerrit_Workflow#Work_in_Progress how does that look?	23:57
*** pballand has quit IRC		23:57

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!