Monday, 2017-10-09

*** dkranz has quit IRC		01:05
*** xinliang has joined #zuul		01:05
*** bhavik1 has joined #zuul		02:45
*** bhavik1 has quit IRC		02:54
*** zhuli has joined #zuul		03:00
*** isaacb has joined #zuul		06:26
*** hashar has joined #zuul		08:30
*** bhavik1 has joined #zuul		08:56
*** electrofelix has joined #zuul		09:01
*** bhavik1 has quit IRC		09:30
*** jkilpatr has joined #zuul		10:43
*** jkilpatr has quit IRC		10:51
*** jkilpatr has joined #zuul		11:04
*** dkranz has joined #zuul		12:20
*** dkranz has quit IRC		13:05
mrhillsman	how does nodepool read clouds? i keep getting an error - os_client_config.exceptions.OpenStackConfigException: Cloud fake was not found.	13:14
mrhillsman	or is that expected behavior	13:14
mrhillsman	everything else appears to be working	13:15
mrhillsman	dib has build an image	13:15
mrhillsman	and node-launcher looks like it is fine as i do not see any errors in the logs either	13:15
mrhillsman	i see successful requests and responses	13:16
mrhillsman	for both	13:16
mrhillsman	but nodepool dib-image-list fails with error re cloud fake	13:17
rcarrillocruz	mrhillsman: it reads a file called clouds.yaml	13:21
rcarrillocruz	typically, it should be under ~/.config/openstack/clouds.yaml	13:22
mrhillsman	ok cool, i have that in place, and using the default	13:22
mrhillsman	ah	13:22
mrhillsman	it is in /etc/nodepool/clouds.yaml	13:22
rcarrillocruz	i think that's ok, it falls back to that location if home user does not have it	13:22
mrhillsman	nope	13:23
mrhillsman	you are right	13:23
mrhillsman	moving it there fixed it :)	13:24
rcarrillocruz	https://docs.openstack.org/os-client-config/latest/	13:24
rcarrillocruz	yah, it's an oscc thing, not a thing specific to nodepool	13:24
mrhillsman	hopefully i am almost done then :)	13:24
mrhillsman	need to get it working against this local openstack cloud	13:25
rcarrillocruz	if you hit roadblocks, happy to help, i've been installing a nodepool/zuul CI lately (which I think you have too per previous comments from you in this channel)	13:25
mrhillsman	awesome, thanks	13:25
mrhillsman	yes i have	13:25
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Add base job and roles for javascript https://review.openstack.org/510236	13:32
Shrews	jeblair: I'm curious about the new nodepool issue "When nodepool launchers are restarted with building nodes, the requests can be left in a pending state and end up stuck"	13:47
Shrews	jeblair: we have code (and a test) to handle unlocked PENDING requests: http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/launcher.py?h=feature/zuulv3#n381	13:48
Shrews	jeblair: is it possible the request disappeared (by a zuul disconnect) before the nodepool cleanup thread ran?	13:48
Shrews	launcher logs aren't showing me anything about the request from that pastebin after the zk restart, so it seems like it disappeared before it could be cleaned up properly	13:50
Shrews	hrm, is it possible an active lock could survive a ZK restart?	13:51
*** isaacb has quit IRC		13:58
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Don't modify project-templates to add job name attributes https://review.openstack.org/510304	13:58
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Don't use cached config when deleting files https://review.openstack.org/510317	13:58
Shrews	jeblair: oh, simple kazoo test shows that a lock CAN survive a normal ZK stop/start! TIL	13:59
Shrews	so now the paste makes sense b/c nodepool still has a lock on the PENDING request after the ZK restart	14:01
Shrews	actually, hrm... why did the building node lose its lock though? ugh, confused again	14:03
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Update statsd output for tenants https://review.openstack.org/510580	14:03
jeblair	Shrews: yeah, i assume we would lose the lock for both the node and the request, or neither of them. it's strange to think that we would have the lock on just the request but not the node. in retrospect, i suppose i should have examined the request node more carefully to determine whether it was locked, and if so, by what.	14:07
Shrews	jeblair: i'll see about working on a test case for this, but w/o more details, might be kind of hard to fix something where I don't know what was wrong. :/	14:09
Shrews	and i'm really confused about lock survival across ZK restarts now. maybe jhesketh can help shed light on that?	14:11
jeblair	Shrews: maybe you meant harlowja?	14:11
Shrews	oops, yes.	14:12
jeblair	Shrews: how often does the thing that checks for unlocked pending requests run? we can probably verify whether it ran or not from logs. but i think this request was outstanding for quite a while.	14:12
Shrews	i knew there was a 'j' and 'h' in there somewhere :)	14:12
Shrews	jeblair: 60s	14:12
jeblair	i can attest that all j names are the same	14:12
jeblair	Shrews: i see the bug	14:20
jeblair	Shrews: the cleanup thread hit an exception while trying to lock the node allocated to the request: http://paste.openstack.org/show/623109/	14:21
jeblair	Shrews: there's no protection in http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/launcher.py?h=feature/zuulv3#n391 to unlock the node request in case of an exception	14:22
jeblair	Shrews: so the request stayed locked after that and nothing touched it again.	14:22
*** hashar is now known as hasharAway		14:27
Shrews	jeblair: ah. wow, good find	14:27
jeblair	Shrews: you want to work on sprinkling some try/finally clauses in there?	14:28
Shrews	jeblair: yup, i'll take care of it	14:29
jeblair	cool, thx	14:29
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Improve exception handling around lost requests https://review.openstack.org/510594	14:47
Shrews	jeblair: ^^^	14:47
Shrews	actually, i want to make 1 improvement there	14:50
openstackgerrit	Merged openstack-infra/zuul-jobs master: Add base job and roles for javascript https://review.openstack.org/510236	14:55
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Improve exception handling around lost requests https://review.openstack.org/510594	14:57
SpamapS	Shrews: Remember that zookeeper wasn't meant to be "restarted"	15:54
SpamapS	Shrews: so all the recipes assume you have an ensemble that survives forever.	15:54
SpamapS	I think they're getting better about pointing out when something is different for a single server use case.	15:54
SpamapS	But for the most part.. the assumption is that you'll roll a restart through the members.	15:55
jeblair	i've started investigating a hung git operation on ze06 for build 55769796f30e442d817feb96a6854eb1	15:56
SpamapS	And while zuul is still a spof..and nodepool-launcher is still active/passive at best..it might make sense to look at a ZK ensemble on 3 boxes for us, just for the "one less thing we have to be careful with"	15:56
jeblair	GIT_HTTP_LOW_SPEED_TIME=30 and GIT_HTTP_LOW_SPEED_LIMIT=1000 are in the environment. and it is doing an https fetch.	16:00
*** maxamillion has quit IRC		16:18
*** maxamillion has joined #zuul		16:18
mrhillsman	sooooo...no python3 support :)	16:31
SpamapS	mrhillsman: eh?	16:36
mrhillsman	dib - was wondering why image build kept failing	16:37
mrhillsman	with no module errors	16:37
mrhillsman	and some other stuff	16:38
mrhillsman	turns out python3 was the issue - i was using it by default	16:38
AJaeger	does anybody know what's up with 510540 ? - it's since three hours in gate and waiting for a node - the change after it is tested	16:55
*** harlowja has joined #zuul		17:01
*** electrofelix has quit IRC		17:26
fungi	looks like zuulv3 started swapping a bit around 1.5 hours ago, in an effort to save more memory for cache	17:32
fungi	that probably isn't translating to performance degradation though	17:32
fungi	40mb paged out	17:32
fungi	but the server would clearly use more cache at this point, if it had some	17:33
fungi	granted, we're sitting on a check pipeline with 400+ changes now	17:33
openstackgerrit	Andrea Frittoli proposed openstack-infra/zuul-jobs master: Add a generic stage-artifacts role https://review.openstack.org/509233	17:33
openstackgerrit	Andrea Frittoli proposed openstack-infra/zuul-jobs master: Add compress capabilities to stage artifacts https://review.openstack.org/509234	17:33
openstackgerrit	Andrea Frittoli proposed openstack-infra/zuul-jobs master: Add a generic process-test-results role https://review.openstack.org/509459	17:33
AJaeger	467 changes in periodic alone...	17:34
AJaeger	fungi, so 403 in check, 467 in periodic, 261 in periodic-stable; and a few in infra pipelines	17:35
AJaeger	which also means that it did not manage to run jobs for a single periodic job today - far too busy to server higher prio requests	17:36
fungi	oh, yep i basically can't load the status page well enough to scroll around in it	17:36
AJaeger	fungi: yeah, that page is verrrrry slow. took me some time to find infra-gate and see 510540 waiting...	17:36
fungi	so over 1100 changes in all	17:36
fungi	i'd say given that sort of backlog, memory utilization is remarkably low	17:37
AJaeger	;)	17:37
fungi	i'm trying to figure out how to look into the situation with 510540	17:37
fungi	73 million lines in the current zuulv3 scheduler debug log. that certainly takes a while to load	17:59
AJaeger	wow ;(	18:00
fungi	the logfile is 9.2gib in size	18:02
Shrews	fungi: perhaps it is too busy logging? ;)	18:02
fungi	does lead me to wonder	18:03
fungi	aside from it checking and not finding any dependencies for 510540,1 the other occasional mentions in the debug log seem to be about it getting reenqueued... which i wonder if that is just how it deals with configuration changes	18:24
Shrews	fungi: did you manage to find any request #'s for it?	18:24
fungi	i need to figure out which one it isn't running. that will take a bit for me to parse it back out of the status	18:25
Shrews	holy smokes. 6300+ requests outstanding for nodepool	18:25
Shrews	218 total nodes, only 12 are READY. the rest are BUILDING or IN-USE	18:27
Shrews	maybe at total capacity?	18:27
AJaeger	multinode-integration-centos-7 is the job that is missing for 510540	18:27
fungi	okay, in that case this is the last request:	18:28
fungi	2017-10-09 15:21:22,408 INFO zuul.ExecutorClient: Execute job base-integration-centos-7 (uuid: fc4e31ad056c4d64b54b5f3cd7fac86b) on nodes <NodeSet centos-7 OrderedDict([('centos-7', <Node 0000178819 centos-7:centos-7>)])OrderedDict()> for change <Change 0x7fd20240b710 510556,2> with dependent changes [<Change 0x7fd243b52e10 510555,1>, <Change 0x7fd242811b70 510578,1>, <Change 0x7fd20285d4a8 510564,2>, <Change	18:28
fungi	0x7fd242ede390 510540,1>]	18:28
Shrews	the one's in-use and locked don't seem to be too old.	18:28
AJaeger	might be - but the gate has highest priority, so the changes after it got nodes (and those were added 30 mins later)	18:28
fungi	oh, wait, that's not the one	18:29
fungi	that's base-integration-centos-7	18:29
fungi	looking further back	18:29
AJaeger	fungi multinode-integration-centos-7 is what you need	18:29
fungi	yeah	18:29
fungi	thanks, i'm still waiting for the status page filtering to complete	18:29
Shrews	fungi: doesn't seem to be a request id in that :(	18:30
AJaeger	fungi, curl -s http://zuulv3.openstack.org/status.json status.json\|jq '.' > out	18:30
AJaeger	uuid is 55769796f30e442d817feb96a6854eb1	18:30
Shrews	request id is of the form: 200-xxxxxxxxxx or 100-xxxxxxxxxx	18:31
Shrews	just wanted to see if maybe it was waiting on a node. i can't map request ID to change ID with the info that's stored in ZK	18:32
fungi	2017-10-09 15:08:12,415 INFO zuul.ExecutorClient: Execute job multinode-integration-centos-7 (uuid: d1ddd1c5e30842859d299983465dd75e) on nodes <NodeSet OrderedDict([('primary', <Node 0000178676 primary:centos-7>), ('secondary', <Node 0000178677 secondary:centos-7>)])OrderedDict([('switch', <Group switch ['primary']>),	18:32
fungi	('peers', <Group peers ['secondary']>)])> for change <Change 0x7fd283f29240 509969,1> with dependent changes [<Change 0x7fd20240b710 510556,2>, <Change 0x7fd243b52e10 510555,1>, <Change 0x7fd242811b70 510578,1>, <Change 0x7fd20285d4a8 510564,2>, <Change 0x7fd242ede390 510540,1>]	18:32
Shrews	but those lead me to believe that it got its nodes?	18:32
Shrews	so... shrug	18:33
fungi	no, wait, that's the wrong change	18:33
fungi	just a sec, i'll go regex on this	18:34
Shrews	go go gadget regex!	18:34
fungi	think i'm going to switch to grep, since loading a >9gb file into memory is almost certainly causing undue memory pressure on the server	18:36
*** hasharAway is now known as hashar		18:36
fungi	oh yeah. cacti confirms	18:36
AJaeger	and we're out of space nearly, aren't we?	18:37
fungi	oh, yikes, yep	18:38
fungi	closing out my filter/viewer pipe freed all of that up	18:39
jeblair	oh wow, so that memory use spike isn't zuul	18:46
fungi	right, that was me	18:46
jeblair	neat. that saves some time ):	18:47
jeblair	:) even	18:47
fungi	(:	18:47
fungi	sorry for the false alarm there	18:47
jeblair	and the free space is increasing too	18:48
fungi	also me	18:48
jeblair	we still plan on recovering swap and using it for the zuul log dirs, because we will probably run out of log space in prod	18:48
jeblair	but also, i think we can make the scheduler log more efficient, it'll just take some cleanup work.	18:48
jeblair	(the whole "print out the queue" thing is extremely useful once, but not more. finding the right time to print it out when it's useful is the trick)	18:49
jeblair	at any rate, that shouldn't be significantly different than v2, so we don't have to block on it	18:50
fungi	at any rate, i think this was the entry i was originally looking for:	18:51
fungi	2017-10-09 13:56:10,708 INFO zuul.ExecutorClient: Execute job multinode-integration-centos-7 (uuid: 55769796f30e442d817feb96a6854eb1) on nodes <NodeSet OrderedDict([('primary', <Node 0000177867 primary:centos-7>), ('secondary', <Node 0000177868 secondary:centos-7>)])OrderedDict([('switch', <Group switch ['primary']>),	18:51
fungi	('peers', <Group peers ['secondary']>)])> for change <Change 0x7fd242ede390 510540,1> with dependent changes [<Change 0x7fd2533fc518 510009,1>]	18:51
fungi	there is no more recent execute for that job on that change, and it's properly from ~5hrs ago	18:52
jeblair	fungi: you can also grab the build uuid from the console log link	18:53
fungi	but as mentioned in #openstack-infra, i guess this is the one jeblair was looking at back around 14:00z in scrollback	18:53
jeblair	fungi: and you can use ansible to grep for that uuid in all the executor logs to find out which one is running it	18:53
fungi	oh, cool idea	18:54
jeblair	as best as i can tell, git is not hung waiting on http traffic; it's in some kind of internal deadlock	18:55
jeblair	fungi: https://etherpad.openstack.org/p/8aKzRXq6ae	18:59
jeblair	after lunch, i'm going to dust off the git hard-timeout change from last week	19:01
jeblair	also, i think i'm done poking at that git process. i'll kill it so things move again after lunch, unless someone objects.	19:09
SpamapS	Does look like it is deadlocked	19:19
mrhillsman	nodepool-builder and nodepool-launcher work fine as long as i have -d without it they do not run and just close	19:25
mrhillsman	are there any instructions/example of how to get it to dump logs somewhere?	19:34
Shrews	mrhillsman: are you using sudo? it probably can't start up because it can't create it's pid file (/var/run/nodepool dir, iirc)	19:53
mrhillsman	using root	19:54
pabelanger	I haven't been around much today (Turkey day here in Canada) but will be only for upcoming zuul meeting today	19:59
mrhillsman	i'll ignore it for now, things appear to be working, zuul and nodepool are talking to each other and local openstack environment, just need a job beyond noop to work and will flush out the minor issues later	20:14
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Add upload-npm role https://review.openstack.org/510686	20:42
*** hashar has quit IRC		20:43
jeblair	mordred, pabelanger: in order to use the git hard-timeout function, we need pabelanger's fix here: https://github.com/gitpython-developers/GitPython/commit/1dbfd290609fe43ca7d94e06cea0d60333343838	20:45
jeblair	that's merged to the gitpython master branch	20:45
jeblair	unfortunately, the master branch still has this bug: https://github.com/gitpython-developers/GitPython/issues/605	20:46
jeblair	i proposed the very basic workaround that Byron suggested here: https://github.com/gitpython-developers/GitPython/pull/686	20:47
jeblair	it's not good, but it's something	20:47
jeblair	mordred: in doing that, i noticed that it looks like you may have done some work in that area in https://github.com/emonty/GitPython/commit/0533dd6e1dddfaa05549a4c16303ea4b4540c030 but maybe never submitted pull requests for it?	20:48
jeblair	at any rate, i think for the moment we will have to try to run from my fork, until we get a release with both my and pabelanger's fix	20:49
pabelanger	Ya, I hope to look at my open PR tomorrow for GitPython. Didn't get much time over the weekend for it	20:50
jeblair	pabelanger: they merged your encoding fix to master	20:50
pabelanger	okay cool, I had another for as_process change	20:51
*** jkilpatr has quit IRC		20:56
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add git timeout https://review.openstack.org/509517	21:29
jeblair	mordred, pabelanger: ^	21:30
*** jkilpatr has joined #zuul		21:33
*** Shrews has quit IRC		21:36
*** Shrews has joined #zuul		21:39
jeblair	it's zuul meeting time in #openstack-meeting-alt	22:00
Shrews	fungi: fyi, when you get time: https://review.openstack.org/510594	22:19
fungi	thanks Shrews!	22:22
pabelanger	I'll be dropping offline again for the next 2 hours, but will review issues later this evening	22:57
jeblair	please see the meeting minutes for important information about the openstack-infra roll-forward: http://eavesdrop.openstack.org/meetings/zuul/2017/zuul.2017-10-09-22.03.html	22:59
mrhillsman	so i want to move to a production deployment, i know i need a target openstack cloud, have one	23:06
mrhillsman	how many dedicated servers to i need?	23:06
jeblair	mrhillsman: depends largely on the number of simultaneous jobs you'll run	23:10
mrhillsman	got it, we will not rise to as many as infra runs right now of course	23:11
jeblair	mrhillsman: (which of course is a function of the size of your openstack cloud, and how busy the repos you're testing are)	23:11
mrhillsman	but maybe like 30-50 starting off	23:11
mrhillsman	30-50 VMs	23:11
mrhillsman	need to prove it in a live environment but want to get the ball rolling	23:12
mrhillsman	only thing left in this PoC is to see an actual job run (other than noop :) )	23:12
jeblair	mrhillsman: you can do everything on one host. for that size deployment, i'd probably use 3 -- one host as a nodepool image builder+lanucher (it will probably want more disk than anything else). one zuul scheduler on maybe an 8g vm. and one zuul executor, also probably around 8g.	23:13
mrhillsman	ok cool, thanks	23:14
jeblair	mrhillsman: we've found our 8g executors can handle, we think, maybe around 100 simultaneous jobs. so those scale linearly like that.	23:14
jeblair	mrhillsman: as you start to get larger, dedicated mergers can be useful to to handle some of the git load. 1 merger for every 2 executors might be a good ratio. those can be tiny too -- like 2g or even 1g vm.	23:15
mrhillsman	got it	23:16
jeblair	mrhillsman: i'd probably separate the nodepool builder onto its own host if your nodepool cloud quota is > 250 nodes. and add another nodepool launcher after 500 nodes.	23:16
mrhillsman	noted	23:17
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add git timeout https://review.openstack.org/509517	23:22
mrhillsman	in terms of getting the web status page working, do i need to proxy through apache/nginx?	23:25
mrhillsman	simply running zuul-web and pointing at the port does not work unless i have it configured improperly	23:25
mrhillsman	if it should work out the box i'll figure it out, just don't want to go on a rabbit chase	23:28
jeblair	mrhillsman: no, the status page is not straightforward yet. there's pending work to integrate it into zuul-web. but for the moment, there's manual installation steps and proxying may be required.	23:44
mrhillsman	ok cool, i'll work to figure it out, ty sir	23:45
jeblair	mrhillsman: i think it's all in the etc/ directory of the zuul repo	23:50
mrhillsman	ah ok, was looking in the wrong dir, thx for that	23:50
jeblair	yeah, it's.... not like the rest :)	23:52
mrhillsman	;)	23:53
jeblair	fungi, mordred, pabelanger: even with my fix to gitpython, simple unit tests still take 2x as long. i think there's more at play. i think i might be inclined to make a new local fork of gitpython that has 2.1.1 with pabelanger's fix, and run with that until we figure out a long-term solution. but i expect that to take more than just a couple of days.	23:55
jeblair	all told, i don't think that needs to change anything about our plan -- we still run with a local fork for a bit. i think it just changes expectations around what step 2 is.	23:55
fungi	okay	23:57
fungi	i'm fine with that, even if it's not quite as nice as we'd hoped	23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!