Thursday, 2019-06-13

openstackgerrit	Tristan Cacqueray proposed zuul/nodepool master: static: enable using a single host with different user or port https://review.opendev.org/659209	00:01
*** paladox has quit IRC		00:02
*** paladox has joined #zuul		00:02
*** paladox has quit IRC		00:02
*** paladox has joined #zuul		00:02
*** jamesmcarthur has joined #zuul		00:07
*** paladox has quit IRC		00:08
*** paladox has joined #zuul		00:08
*** paladox has quit IRC		00:09
*** paladox has joined #zuul		00:10
*** jamesmcarthur has quit IRC		00:49
*** jamesmcarthur has joined #zuul		00:50
*** olaph has joined #zuul		00:54
*** jamesmcarthur has quit IRC		00:55
*** rf0lc0 has joined #zuul		00:57
*** mattw4 has quit IRC		01:05
*** jamesmcarthur has joined #zuul		01:20
*** michael-beaver has joined #zuul		01:38
*** rlandy\|ruck\|bbl is now known as rlandy\|ruck		02:06
*** rf0lc0 has quit IRC		02:13
*** jamesmcarthur has quit IRC		02:15
*** jamesmcarthur has joined #zuul		02:15
*** jamesmcarthur has quit IRC		02:18
*** jamesmcarthur has joined #zuul		02:20
*** rlandy\|ruck has quit IRC		02:32
*** jamesmcarthur has quit IRC		02:33
*** jamesmcarthur has joined #zuul		02:33
*** bhavikdbavishi has joined #zuul		02:48
openstackgerrit	Merged zuul/zuul master: Update quickstart nodepool node to python3 https://review.opendev.org/658486	02:52
*** bhavikdbavishi has quit IRC		02:53
*** jamesmcarthur has quit IRC		03:01
*** jamesmcarthur has joined #zuul		03:02
*** bhavikdbavishi has joined #zuul		03:03
*** jamesmcarthur has quit IRC		03:04
*** jamesmcarthur has joined #zuul		03:04
openstackgerrit	Paul Belanger proposed zuul/zuul master: Add more test coverage on using python-path https://review.opendev.org/659812	03:31
*** jamesmcarthur has quit IRC		03:43
*** jamesmcarthur has joined #zuul		03:43
*** michael-beaver has quit IRC		03:48
*** jamesmcarthur has quit IRC		03:58
*** igordc has joined #zuul		04:07
*** bhavikdbavishi has quit IRC		04:19
*** bhavikdbavishi has joined #zuul		04:20
*** swest has joined #zuul		04:25
*** swest has quit IRC		04:31
*** sanjayu_ has joined #zuul		04:43
*** swest has joined #zuul		04:45
*** pcaruana\|afk\| has joined #zuul		05:01
*** pcaruana\|afk\| has quit IRC		05:04
*** pcaruana has joined #zuul		05:05
*** badboy has joined #zuul		05:08
*** spsurya has joined #zuul		06:07
*** zbr\|flow is now known as zbr\|ooo		06:11
*** gtema has joined #zuul		06:34
openstackgerrit	Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134	07:00
*** ianychoi has joined #zuul		07:02
*** igordc has quit IRC		07:15
*** jangutter has joined #zuul		07:31
*** jpena\|off is now known as jpena		07:43
ofosos	Is there any way I can tell the executor what SSH key to use? In theory Bitbucket has an API for uploading SSH keys and I would like to use that to upload the Zuul key to Bitbucket.	07:47
*** themroc has joined #zuul		08:08
tristanC	ofosos: for gerrit, there is a sshkey option that can be set per connection in the zuul.conf	08:09
ofosos	tristanC: i'll have a look	08:15
ofosos	Thanks	08:15
*** gtema has quit IRC		08:20
*** gtema has joined #zuul		08:21
openstackgerrit	Matthieu Huin proposed zuul/zuul master: web: add tenant and project scoped, JWT-protected actions https://review.opendev.org/576907	08:38
*** hashar has joined #zuul		08:41
*** themroc has quit IRC		09:18
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Allow operator to generate auth tokens through the CLI https://review.opendev.org/636197	09:20
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Zuul CLI: allow access via REST https://review.opendev.org/636315	09:31
*** gtema has quit IRC		10:28
*** gtema_ has joined #zuul		10:28
*** gtema_ is now known as gtema		10:28
*** bhavikdbavishi has quit IRC		10:43
openstackgerrit	Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134	10:49
badboy	any ideas what's causing this:	10:52
badboy	AttributeError: type object 'EllipticCurvePublicKey' has no attribute 'from_encoded_point'	10:52
openstackgerrit	Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134	10:58
*** jpena is now known as jpena\|lunch		11:01
openstackgerrit	Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134	11:20
*** sshnaidm has quit IRC		11:52
*** sshnaidm has joined #zuul		11:56
*** bhavikdbavishi has joined #zuul		12:03
*** rlandy has joined #zuul		12:05
*** rlandy is now known as rlandy\|ruck		12:06
*** rf0lc0 has joined #zuul		12:07
*** bhavikdbavishi has quit IRC		12:08
*** bhavikdbavishi has joined #zuul		12:15
*** spsurya has quit IRC		12:18
*** jpena\|lunch is now known as jpena		12:40
*** pcaruana has quit IRC		13:01
*** chandankumar is now known as raukadah		13:01
*** rlandy\|ruck is now known as rlandy\|ruck\|mtg		13:02
*** sanjayu_ has quit IRC		13:05
*** gtema has quit IRC		13:26
*** bhavikdbavishi has quit IRC		13:27
openstackgerrit	Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134	13:27
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Zuul CLI: allow access via REST https://review.opendev.org/636315	13:34
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Add Authorization Rules configuration https://review.opendev.org/639855	13:34
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Web: plug the authorization engine https://review.opendev.org/640884	13:35
*** jamesmcarthur has joined #zuul		13:43
fungi	badboy: are you seeing that with a gerrit connection? i want to say we've seen broken ecc implementation in some gerrit versions	13:55
fungi	badboy: oh, looks like that could be a mismatch with the installed version of pyca/cryptography	13:57
fungi	you may be running too old of a version?	13:57
fungi	what version of cryptography does pip say is installed?	13:58
fungi	also, sticking the full traceback on http://paste.openstack.org/ would help provide some context	13:58
fungi	https://github.com/pyca/cryptography/blob/master/CHANGELOG.rst#25---2019-01-22 suggests you need at least 2.5 (from january of this year) for that method	14:02
*** sanjayu_ has joined #zuul		14:02
*** igordc has joined #zuul		14:05
*** rlandy\|ruck\|mtg is now known as rlandy\|ruck		14:10
*** gtema has joined #zuul		14:17
*** hashar has quit IRC		14:17
pabelanger	dmsimard: have you see this ARA failure before? Looks to be encoding issue when generating html: https://logs.zuul.ansible.com/89/57789/8d9f8e0547417362c0241ab039e360035b778478/third-party-check-silent/ansible-test-network-integration-ios-python27/bc7e0b5/job-output.html#l8652	14:27
dmsimard	pabelanger: I have not, I thought all those encoding issues had been ironed out :D	14:30
pabelanger	dmsimard: yah, this is the first time I've seen it happen, we've been using ARA for ansible-test for some time. Will dig more into it	14:31
pabelanger	I know it does some odd things with directory names for testing	14:31
dmsimard	pabelanger: oh, it might be https://github.com/ansible-community/ara/issues/48 then -- that's >1.0 though but it's possible 0.x is also impacted	14:32
dmsimard	it was also for a filesystem path with non-ascii characters	14:32
dmsimard	(who does that?)	14:32
dmsimard	I ran the ansible integration test suite against 1.x but not 0.x -- I should be able to reproduce	14:33
pabelanger	dmsimard: yup, that likely is it	14:35
pabelanger	let me confirm we have that non-ascii chars disabled for ansible integration testing	14:35
pabelanger	I also don't know why they do it	14:35
dmsimard	#ansible-devel said it's because it bubbles up bugs like this one :p	14:36
pabelanger	that is true	14:36
*** igordc has quit IRC		14:42
smcginnis	Daily third party CI question... :)	14:43
smcginnis	If I want to use the devstack job in my local zuul instance form https://opendev.org/openstack/devstack/src/branch/master/.zuul.yaml#L343	14:44
smcginnis	I've added it to my untrusted-projects, but trying to define a job locally that inherits from it results in "Job devstack not defined".	14:44
smcginnis	Is there something else I need to do in order to be able to use that?	14:45
pabelanger	smcginnis: in your tenant config for devstack, did you allow loading of jobs?	14:45
smcginnis	pabelanger: Ah, just noticing... is that "include: - job"?	14:46
pabelanger	yah	14:46
smcginnis	OK, nope, I did not do that part. Trying now.	14:46
smcginnis	pabelanger: That job has a long list of required-projects. Will I also need to add those in my zuul config as untrusted-projects?	14:47
pabelanger	yes	14:47
smcginnis	OK, thanks. That saves me some digging then. I'll get all of that added and try things out. Thanks!	14:47
pabelanger	I started testing devstack on rdoproject zuul a while back, but don't think I got it working 100%. Let me look to see if I can find the code	14:48
smcginnis	Oh, great. Or if there's some other way - I ultimately need to be get things running and run tempest against it for third party CI testing.	14:50
pabelanger	smcginnis: looks like I just added the tenant config, which you already know. Seems I didn't import all the required-projects, and code seems to have been removed now from rdoproject	14:52
smcginnis	pabelanger: OK, no worries. This is a great start, so I'll keep fiddling. Thanks for looking.	14:52
pabelanger	smcginnis: the job _should_ load properly, that was some of the work that needed to be done. However, you also might run into issues with missing nodesets, which you'll need to also define locally	14:53
smcginnis	Ah, I didn't think about that.	14:53
clarkb	devstack has itsown nodesets	14:53
pabelanger	yah, zuul does a good job at saying what doesn't work :)	14:53
smcginnis	I'll see if it makes sense to get all that matched up, or just define a local job and hope it doesn't diverge too much over time.	14:54
fungi	as long as the nodeset is defined zuul will be happy. you don't actually need nodes matching those provided by nodepool if you're not actually going to run the jobs declaring they use them	14:57
corvus	smcginnis: it's going to be some upfront investment to get the list of all the things you need to add, but i think it's going to be worth it.	14:58
corvus	yeah, nodesets are something that may make sense to override locally	14:58
fungi	someone else was working on putting together exactly the same list for the base devstack job... who was it?	14:58
fungi	maybe they've already done that legwork now	14:58
openstackgerrit	Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134	14:59
*** sanjayu_ has quit IRC		15:09
corvus	fungi: several people have done it, i have not seen it shared yet.	15:09
smcginnis	OK, I'll keep working towards that then and hopefully get it all documented well.	15:11
*** igordc has joined #zuul		15:23
Shrews	corvus: i have no idea what's going on with the plugin tests, but https://review.opendev.org/663762 has seen so many random failures with it. Earlier failures were timeout related. The new one post-fungi's timeout fix is now: http://logs.openstack.org/62/663762/10/check/tox-py35/ec8a51c/job-output.txt.gz#_2019-06-13_15_08_38_574336	15:24
openstackgerrit	Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134	15:24
corvus	Shrews: that means an executor is still running a job	15:25
Shrews	corvus: seems unrelated to my change. and the following change passed all tests	15:27
corvus	Shrews: could it be related to http://logs.openstack.org/62/663762/10/check/tox-py35/ec8a51c/job-output.txt.gz#_2019-06-13_15_19_28_066472 ?	15:28
corvus	Shrews: and also http://logs.openstack.org/62/663762/10/check/tox-py35/ec8a51c/job-output.txt.gz#_2019-06-13_15_08_58_190771	15:29
Shrews	subunit exception. neat	15:29
Shrews	i wonder if there was a recent release of subunit	15:30
corvus	Shrews: i think the test needs to be split. let's just cut it in half.	15:30
corvus	Shrews: that will address both timeouts and subunit report lengths.	15:31
openstackgerrit	Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134	15:31
corvus	fungi: ^	15:31
openstackgerrit	Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134	15:32
fungi	will take a look after the meetings i'm in. we could just revert for now	15:33
Shrews	corvus: by "cut in half", you mean a new test_plugins() test with half of the entries in the plugin_tests array?	15:34
corvus	Shrews: yep	15:35
corvus	fungi: i think it'll be just as easy to split as it would be to revert	15:36
Shrews	i'll toss a change up	15:36
fungi	thanks Shrews!	15:36
fungi	i hadn't yet looked closely enough at how the framework for that test was done to work out how much duplication/abstraction would be needed to split it	15:37
corvus	clarkb, pabelanger, fungi, Shrews, tobiash: did we decide yesterday that we should relesae zuul now? how about nodepool?	15:38
fungi	i think there was a suggestion to restart the opendev deployment on the current state. i expect we might as well lump nodepool in	15:38
fungi	though i haven't looked to see what's landed in nodepool since the last tag	15:38
corvus	enough for a releas i think https://zuul-ci.org/docs/nodepool/releasenotes.html	15:39
pabelanger	corvus: +1 for zuul release	15:39
corvus	okay, so the plan is: restart all of opendev today, release both later today or tomorrow?	15:39
pabelanger	wfm	15:40
pabelanger	also, we've been using nodepool 3.6.1.dev16 without any issues, so +1 for tagging that too	15:40
openstackgerrit	David Shrewsbury proposed zuul/zuul master: Split plugin tests into two https://review.opendev.org/665161	15:41
fungi	yeah, the conclusion with the zuul memory leak we're seeing in opendev is that it took a couple weeks to manifest after scheduler restart the last couple times it bit us, so it'll probably be a while before it crops up again and we shouldn't delay the release waiting for that	15:41
Shrews	corvus: i'm not sure which version of nodepool we are running in production and if we're in sync with master	15:42
Shrews	lemme check	15:42
Shrews	looks like status logs says we last restarted launchers on 58a2a2b68c58f9626170795dba10b75e96cd551 to pick up memory leak fix	15:43
Shrews	err f58a2a2b68c58f9626170795dba10b75e96cd551	15:44
Shrews	that's the 3.6.0 tag	15:44
corvus	i think that's old enough to warrant a restart	15:44
*** bhavikdbavishi has joined #zuul		15:45
Shrews	corvus: agreed. which means "no" on nodepool release just yet	15:45
corvus	Shrews: we need to restart zuul too, so i was thinking we'd restart both today and release tomorrow; how's that sound?	15:45
Shrews	corvus: fine by me. if there's a nodepool problem, it should be spotted rather quickly	15:46
*** bhavikdbavishi1 has joined #zuul		15:53
*** jangutter has quit IRC		15:54
*** bhavikdbavishi has quit IRC		15:54
*** bhavikdbavishi1 is now known as bhavikdbavishi		15:54
ofosos	I get 'something went wrong' from zuul.openstack.org	15:56
corvus	ofosos: please see #openstack-infra for information related to that service	15:57
corvus	pabelanger, tobiash: http://paste.openstack.org/show/752891/	15:57
clarkb	catcing up after getting kids out the door for school and ya thta plan sounds good to me	15:57
pabelanger	corvus: oh, was that a retry?	15:58
corvus	pabelanger: no idea	15:58
clarkb	corvus: I feel like that should go under "achivement unlocked" re abuse against github by zuul	15:59
clarkb	did that happen after the restart? If so the multiprocessing change may send too many requests at once to github?	15:59
corvus	clarkb: yes	15:59
pabelanger	I wonder if ad668d74-8df3-11e9-93ab-4ff1818b4f8e got 502 Server error, then we sleep(1) and tried again	15:59
smcginnis	OK, I'm feeling dumb now. Where do I need to define things to get rid of 'Unable to freeze job graph: The nodeset "openstack-single-node-bionic" was not found.'	16:00
corvus	pabelanger: did you log retries?	16:00
clarkb	corvus: yes they shouldbe logged	16:00
pabelanger	corvus: yah, you should see it as exception	16:00
corvus	smcginnis: define a nodeset called "openstack-single-node-bionic"	16:00
corvus	pabelanger: feel free to dig, i have too many windows open at the moment running the restart	16:00
smcginnis	corvus: Where is that actually done. I thought I did, but I still get the error.	16:00
pabelanger	https://review.opendev.org/664843/ might be why	16:00
corvus	smcginnis: it can be in any repo in the tenant	16:01
pabelanger	corvus: sure, looking now	16:01
clarkb	pabelanger: I think it more likely the multiprocessing change is to blame	16:01
clarkb	pabelanger: I would think a second delay between requests is plenty. But sending ~20 (or however many threads are in the pool) requests at once may make it unhappy	16:01
tobiash	corvus: you hit the rate limit?	16:01
pabelanger	clarkb: oh, maybe	16:02
tobiash	Oh and the retry succeeded :)	16:02
tobiash	corvus: re release, did the python interpreter work land?	16:03
pabelanger	tobiash: where do you see the succeed?	16:03
tobiash	pabelanger: maybe i misinterpreted the log	16:03
*** sshnaidm is now known as sshnaidm\|off		16:04
pabelanger	tobiash: Oh, I think you are right	16:04
corvus	tobiash: i believe so: Merge "executor: use node python path"	16:04
pabelanger	tobiash: let me look at code again	16:05
corvus	tobiash: i'm under the impression that both sides of that will be present in the nodepool and zuul releases, but if i'm wrong, let me know :)	16:05
tobiash	Ah yes, so then ++ for release after burn in	16:05
pabelanger	tobiash: actually, I don't think we retried. Looking at 664843 we'd have to add github3.exceptions.ForbiddenError too. But right now we don't trap generic github exepctions. cc clarkb	16:06
tobiash	corvus: correct, just wanted to make sure that the zuul part is not missing :)	16:06
corvus	++	16:06
clarkb	pabelanger: ya I think that behavior is correct there	16:06
clarkb	pabelanger: retrying would only make the abuse perception worse	16:06
pabelanger	yah	16:06
tobiash	Github timeout and retry is also in so ++ for burn in and release	16:07
pabelanger	clarkb: I think you might be right, I don't see a pervious failure on ad668d74-8df3-11e9-93ab-4ff1818b4f8e in zuul logs. So maybe your comment about multiprocess change is to blame.	16:09
pabelanger	tobiash:^	16:09
clarkb	pabelanger: for retries and abuse detection I think we may want a backoff that is more sophisticated than a sleep(1)	16:09
clarkb	like sleep with increasing backoff if we detect that case or something	16:09
pabelanger	clarkb: yah, I'm kinda curious why ratelimit didn't help here	16:10
tobiash	And I think there is still some potential to optimize away a few requests	16:10
tobiash	We don't do rate limiting afaik	16:11
tobiash	We only log it	16:11
pabelanger	okay, I see a retry attempt: dc71183c-8df3-11e9-97c9-52bfbc81ffb5	16:11
pabelanger	looking at logs	16:11
*** hashar has joined #zuul		16:12
*** olaph has quit IRC		16:12
pabelanger	tobiash: clarkb: :( http://paste.openstack.org/show/752892/	16:15
pabelanger	that is a retry attempt	16:15
pabelanger	from a 502 Server Error	16:15
pabelanger	so, sleep(1) doesn't seem to be enough time	16:15
pabelanger	time to read docs on why that is	16:16
corvus	tobiash: re log annotations -- did you happen to see a way to get the tracebacks formatted like the other lines? (so they show up in a grep?)	16:19
pabelanger	jlk: maybe you also have suggestion about 502 Server Error we get back from github api. We created, https://review.opendev.org/664843/ but now look to be tripping the abuse detection mechanism.	16:21
fungi	pabelanger: what sort of api query is causing that?	16:24
corvus	pabelanger, tobiash, jlk: here's the expanded log entries for that event: http://paste.openstack.org/show/752893/	16:24
corvus	fungi: our old friend "getPullBySha" -- the info that everyone (github internal devs included) really wants included in the event	16:25
fungi	got it	16:25
pabelanger	yah	16:25
pabelanger	a quick google says we _should_ get Retry-After header back	16:25
fungi	so it's a read operation	16:25
pabelanger	but need to confirm that	16:25
jlk	is it timing out or is the 502 immediate?	16:26
jlk	My team was CCd on an issue that look slike there is a recent spike in somewhat immediate 502s	16:26
corvus	jlk: i think it took a little over a minute to get the 502 back, if i'm reading the logs right; i'll double check that	16:27
pabelanger	right, we now see it more because we've also bumped up the default_read_timeout to 300: https://review.opendev.org/664667/	16:27
corvus	yeah, in http://paste.openstack.org/show/752893/ "Handling status event" is right before the call, and "Failed handling" is right after	16:27
corvus	pabelanger: right, if someone was using github3.py with the defaults, they would have hit the 10 second read timout before getting the 502	16:28
*** jamesmcarthur_ has joined #zuul		16:29
pabelanger	talking to some ansibullbot folks, they say there is an undocumented ~20 POST per minute, before hitting abuse things. Maybe we are also hitting that now	16:32
*** jamesmcarthur has quit IRC		16:32
openstackgerrit	Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134	16:36
openstackgerrit	James E. Blair proposed zuul/nodepool master: WIP: new devstack-based functional job https://review.opendev.org/665023	16:37
smcginnis	Is there a config option somewhere that controls allowed disk space? I see I'm getting aborts now from ExecutorDiskAccountant because the limit is set to 250mb.	16:43
clarkb	smcginnis: https://zuul-ci.org/docs/zuul/admin/components.html#attr-executor.disk_limit_per_job	16:45
smcginnis	Perfect, thanks clarkb	16:46
clarkb	that should be plenty for devstack jobs last I checked	16:46
clarkb	pabelanger: POST would be to leave comments on PRs?	16:47
smcginnis	It looks like it's happening while checking out the other repos. The one I have on screen right now was Checking out openstack/cinder, Checking out master, ExecutorDiskAccountant warning using 544MB (limit=250)	16:47
clarkb	pabelanger: the searching should all be GETs right?	16:47
pabelanger	clarkb: yah, that is right. So, I'm now looking into search api	16:48
pabelanger	because i do see a bit of google hits around search and abuse	16:48
clarkb	smcginnis: hrm I didn't think the git repos counted against that	16:48
pabelanger	maybe we are missing something with rate-limit	16:48
pabelanger	with new multiprocessing change	16:48
corvus	smcginnis: that can happen if the executor mounts are misconfigured -- https://zuul-ci.org/docs/zuul/admin/components.html#attr-executor.git_dir and https://zuul-ci.org/docs/zuul/admin/components.html#attr-executor.job_dir need to be on the same filesystem	16:49
smcginnis	clarkb: I was a little surprised to see them all checked out there.	16:49
pabelanger	I have to relocate, but plan to keep looking. We do seem to be hitting abuse message on zuul.o.o often	16:49
pabelanger	back shortly	16:49
fungi	smcginnis: normally you would deploy it so the workspace and the git cache are on the same fs, and then git will just make hardlinks when cloning	16:54
fungi	if they are not on the same fs, git will copy all the data	16:54
smcginnis	fungi: I'm just running the containers from the doc/source/admin/examples docker-compose setup, so I would have thought it would all be on the same fs.	16:55
corvus	smcginnis: can you share your docker-compose.yaml file and your zuul.conf?	16:56
*** jamesmcarthur_ has quit IRC		16:57
*** jamesmcarthur has joined #zuul		16:58
corvus	smcginnis: also "docker exec -it mount examples_executor_1" may be helpful	16:58
smcginnis	"Error: No such container: mount"	16:58
smcginnis	Getting the rest...	16:59
corvus	er, other way around then	16:59
corvus	"docker exec -it examples_executor_1 mount"	16:59
smcginnis	Heh, yep. Sorry, didn't really look at the command when I ran it. That makes more sense.	16:59
fungi	mattw4 ran into this exactly a week ago too: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2019-06-06.log.html#t2019-06-06T20:22:08	16:59
smcginnis	corvus: Adding to https://etherpad.openstack.org/p/yvyRWS72JG	17:00
fungi	though in that case it seems to have been caused by a spurious /var/lib/zuul bindmount	17:01
fungi	so i guess same symptom, different underlying misconfiguration	17:01
corvus	smcginnis: try running "df" in the container	17:03
smcginnis	In the executor?	17:03
corvus	yeah	17:03
smcginnis	corvus: Added to the bottom.	17:03
smcginnis	No I just get NODE_FAILURE	17:06
*** mattw4 has joined #zuul		17:07
corvus	this is really weird...	17:08
*** pcaruana has joined #zuul		17:09
smcginnis	I noticed that I had one DISK_FULL failure, but now the last couple attempts were NODE_FAILUREs.	17:09
corvus	i'm focused on the disk issue	17:09
corvus	it's going to be really easy to sweep that under the rug; it needs to be fixed	17:10
*** hashar has quit IRC		17:10
corvus	when i run docker-compose locally, i'm seeing mounts in containers which i don't expect	17:10
tobiash	corvus: re log annotations, I think I saw a change in nodepool that does something like this	17:10
smcginnis	Just want to warn that its root cause potentially has gone away since the configs I pasted appear to have gotten by the DISK_FULL error and is hitting NODE_FAILURE instead.	17:10
*** gtema has quit IRC		17:11
corvus	smcginnis: but you disabled the disk limit?	17:11
corvus	anyway, give me a minute, i'm trying to put together a demonstration of what i'm seeing that is weird	17:12
smcginnis	corvus: I did now. I can remove disk_limit_per_job and restart to see if the DISK_FULL error comes back, but I just am not sure right now if that went away before or after that change.	17:12
corvus	smcginnis, fungi, tobiash: this doesn't make sense to me: https://etherpad.openstack.org/p/yvyRWS72JG lines 225-237	17:15
corvus	that's in my executor container; there should be no /var/lib/zuul mount there	17:15
tobiash	that's weird	17:17
corvus	this seems to match the behavior that smcginnis is seeing too -- smcginnis, if you run "df /var/lib/zuul" does it also show you that the fs is mounted on /var/lib/zuul ?	17:17
smcginnis	corvus: Is it from here: https://opendev.org/zuul/zuul/src/branch/master/Dockerfile#L43	17:18
tobiash	corvus: does the dockerfile specify /var/lib/zuul as volume?	17:18
fungi	indeed, that's the presumed spurious /var/lib/zuul mount mattw4 had	17:18
corvus	oh, it does...	17:18
fungi	he said he thought he'd mounted it to give access to ssh keys	17:18
fungi	but maybe not?	17:18
smcginnis	corvus: /dev/vda1 40470732 6877524 33576824 18% /var/lib/zuul	17:18
corvus	that would do it	17:18
tobiash	because the scheduler container specifies it (line 61) and there is probably some automagic connection	17:18
*** jpena is now known as jpena\|off		17:19
corvus	okay, given that, i think i understand the patch that's needed. i'll push it up and we can see if we agree	17:19
smcginnis	So, should the reason for my current NODE_FAILURE show up in the executor log, or should I be looking somewhere else to figure out why that's happening?	17:22
tobiash	corvus: re log annotations, this is the nodepool change I meant: https://review.opendev.org/613196	17:22
tobiash	but it is using a custom formatter	17:23
corvus	smcginnis: probably nodepool launcher, or if not, possibly scheduler	17:23
smcginnis	k, thanks. I'll look	17:23
corvus	smcginnis: (the executor doesn't go into action until the scheduler hands it a node which it gets from the nodepool launcher)	17:24
*** panda has quit IRC		17:24
smcginnis	OK, that makes sense. So if the node has a failure alon ghte way, it never gets sent over.	17:25
openstackgerrit	James E. Blair proposed zuul/zuul master: Correct quick-start executor volume error https://review.opendev.org/665186	17:25
corvus	smcginnis: yep. which is why the most likely place to find the error is the launcher, but if that's inconclusive, the scheduler should know why it declared it a node failure	17:26
corvus	smcginnis, fungi, tobiash, mattw4: see https://review.opendev.org/665186	17:26
fungi	yup, saw you push and just finished reviewing	17:27
fungi	thanks!!!	17:27
corvus	(i still think we should change the executor default, but i think it's safer to make this change quicker)	17:27
*** panda has joined #zuul		17:27
tobiash	++ for changing the default	17:27
corvus	smcginnis, mattw4: thanks for helping us find that; that was rather subtle, and i'm sorry you had to run into it for us to see it	17:28
smcginnis	Glad some of this has been useful!	17:29
mattw4	me too! You all have helped a tremendous help too!!	17:30
mattw4	I am a native English speaker so I have no excuse for the grammar ^ :)	17:31
smcginnis	:D	17:31
fungi	i make excuses for my grammar all the time	17:32
corvus	it's okay, that sentence made me feel very helpful :)	17:35
fungi	extra helpful even	17:37
smcginnis	Just FYI, the NODE_FAILURE I was hitting was found in the executor logs and it was due to not being able to deploy the openstack-single-node-bionic nodeset defined by the devstack job. So makes sense, just needed to figure out where to look for the error. Thanks again.	17:38
mattw4	Tremendously! :)	17:39
mattw4	smcginnis, I kinda faked it by defining a nodeset with that name and supplying my own node label in the definition.	17:40
mattw4	Scheduler complains that some nodes are undefined, but I don't need those nodes for my jobs. I'm not sure if that is a problem, but it doesn't seem to impact the tests that I'm running ATM	17:41
smcginnis	mattw4: Good call, I think that was my mistake of not setting the label right.	17:41
fungi	i think that highlights a rough patch in the job sharing model, not sure if anyone's yet thought through how to deal with reusing jobs that specify node labels which may not be relevant in the consumer's context	17:42
smcginnis	Seems like you need to be able to separately share the jobs and their resource requirements with the nodes and what resources they can provide.	17:43
fungi	smcginnis: oh, so i guess we're missing something to make /var/lib/zuul/builds a valid job_dir?	17:43
smcginnis	But that's a drastic oversimplification.	17:43
smcginnis	fungi: Yeah, looks like it.	17:44
fungi	does it need to be created first?	17:44
smcginnis	That's what it would appear.	17:44
*** jamesmcarthur has quit IRC		17:58
tobiash	does anybody know the book Powerful Python by Aaron Maxwell?	18:02
tobiash	I just stumbled accross it and I'm wondering if anybody would recommend reading it	18:03
clarkb	corvus: re the volume thing, do we think it would be better to have flexibility in the deployment and have docker-compose or similar do the specification rather than the image?	18:04
clarkb	I guess the problem with that is then people have to know to add it to compose or whatever	18:04
clarkb	so better off in the image	18:04
smcginnis	clarkb: Umm, you just approved the patch that has an error. Maybe I should have left -1 instead of just commenting. Might want to remove approval on that one.	18:07
clarkb	smcginnis: done	18:07
clarkb	smcginnis: and ya that is what -1 is for :P	18:07
smcginnis	:)	18:08
clarkb	fungi's comment is probably on the money for why it isn't working	18:08
smcginnis	It was a "I'm getting an error but could be convinced it's just me" 0.	18:08
clarkb	Because that is a volume mount we can't mkdir it during the build so I think we have to add that to the init script thing	18:09
*** ianychoi has quit IRC		18:13
corvus	or have the executor create it	18:14
tobiash	I'd vote for the executor	18:14
corvus	mattw4, smcginnis, fungi: defining a local nodeset that satisfies what upstream jobs like devstack needs is exactly what i would expect. and if you don't actually need to use it, you could define it with "nodes: []".	18:15
corvus	i'll work on an update to the patch; i'll probably just go ahead and switch the default, since it's going to involve executor code changes	18:20
*** bhavikdbavishi has quit IRC		18:23
*** michael-beaver has joined #zuul		18:23
openstackgerrit	James E. Blair proposed zuul/zuul master: Change default job_dir location https://review.opendev.org/665186	18:28
pabelanger	clarkb: so, looking at github3.py, we should be able to inspect the exception for response headers on 502, I'm trying to see if there is 'Retry-After', if so we can use that value for our sleep.	18:28
clarkb	pabelanger: ++	18:29
pabelanger	clarkb: other wise, maybe we need a better backoff process as you mentioned before	18:29
corvus	tobiash, smcginnis, fungi, mattw4: okay, there's a slightly more substantial change ^ since that will need new images, etc, i'd suggest smcginnis and mattw4 just manually "mkdir /var/lib/zuul/builds" on the executor (since it's a volume, that will persist) and set the job_dir value in zuul.conf as in the previous patch. then after that change merges, you should be able to undo that.	18:30
corvus	pabelanger: sounds good -- also, be thinking about whether we should hold the release for this (i'm inclined to -- this is the sort of thing we hope to catch by burning in on opendev).	18:31
mattw4	sounds good corvus, I will do that	18:32
pabelanger	corvus: +1, I think we'll need to fix this before releasing	18:32
Shrews	i guess our zuul timeouts are still not long enough? http://logs.openstack.org/61/665161/1/gate/tox-py36/c0ebbc7/job-output.txt.gz#_2019-06-13_18_15_58_351905	18:32
corvus	Shrews: wow, that was a job timeout	18:33
openstackgerrit	Mark Meyer proposed zuul/zuul master: Extend event reporting https://review.opendev.org/662134	18:39
*** jamesmcarthur has joined #zuul		18:56
*** hashar has joined #zuul		19:02
openstackgerrit	Paul Belanger proposed zuul/zuul master: Improve retry handling for github driver https://review.opendev.org/665220	19:21
clarkb	pabelanger: were you able to check if retry after is ever present?	19:22
pabelanger	corvus: clarkb: tobiash: jlk: ^ is my first attempt to deal with 502 / 403 github errors. Based on things I am reading on the web, and some manual testing 'retry-after' was there	19:22
pabelanger	clarkb: yah, let me get paste	19:22
pabelanger	clarkb: but I am not sure if it is on 502 error	19:22
clarkb	cool that explains the fallback	19:23
clarkb	note that will cause a 5 minute backup if it never recovers from the 502 and there isn't shorter retry after values	19:23
pabelanger	http://paste.openstack.org/show/752900/	19:23
clarkb	(I think we can probably test with this and see if that causes problems)	19:23
*** jamesmcarthur has quit IRC		19:24
pabelanger	clarkb: yah, I didn't actually wait 60 seconds, so maybe we should add a little buffer?	19:24
pabelanger	I just did testing using curl	19:24
corvus	is that inside our parellilized workers our outside?	19:24
clarkb	corvus: I believe it is inside	19:25
clarkb	so once we get past that 5 minute zone we should catch up quick	19:25
pabelanger	or maybe we don't retry 5 times?	19:25
corvus	clarkb: but other queries will still be happeninng in parallel, so we're only waiting for the sequencing	19:25
corvus	?	19:25
clarkb	corvus: correct	19:26
corvus	cool, i think (based on what i know atm) that's the way to go. at least, until we discover more about github rate limiting :)	19:26
pabelanger	yah, I didn't find https://developer.github.com/v3/#rate-limiting too helpful, with examples	19:28
corvus	pabelanger: i like the patch, but i left a suggestion about improving the debug info for us	19:29
pabelanger	ack, give me a few mins to look	19:30
openstackgerrit	James E. Blair proposed zuul/nodepool master: WIP: new devstack-based functional job https://review.opendev.org/665023	19:31
*** rf0lc0 has quit IRC		19:44
*** rfolco has joined #zuul		19:44
openstackgerrit	Paul Belanger proposed zuul/zuul master: Improve retry handling for github driver https://review.opendev.org/665220	19:48
pabelanger	corvus: clarkb: ^updated	19:48
hogepodge	clarkb: right now locistack is broken during a refactor to use stock images, chasing down the issue right now. should have numbers sometime next week.	19:50
corvus	hogepodge: cool, i'm going to proceed with the devstack approach, and we can look at swapping it in later.	19:52
corvus	should be fairly isolated	19:52
hogepodge	that sounds best, will give me a chance to do a last bit of housekeeping and setting up a tempest job against it so I can create an opendev repository	19:54
pabelanger	jlk: maybe you could confirm if 'Retry-After' would be present on a 502 Server Error response, I haven't been able to find much info on the web. If you have the ability	19:57
fungi	as a first step we could start logging more details from the 502 responses	19:59
corvus	pabelanger: not quite what i had in mind, may i push up revision?	20:00
pabelanger	corvus: please do so	20:01
corvus	pabelanger: also, are you sure you want to retry forbidden errors?	20:02
pabelanger	corvus: that was mostly based on the pastebin from today, so we could open it to more	20:04
pabelanger	from my readings on the web, 403 did return 'retry-after' header	20:04
pabelanger	but it was difficult to see what else did	20:04
openstackgerrit	James E. Blair proposed zuul/zuul master: Improve retry handling for github driver https://review.opendev.org/665220	20:05
corvus	pabelanger: this update should supply the information we need to answer that question ^	20:06
pabelanger	ah, much better	20:06
corvus	oh 1 thing	20:06
openstackgerrit	James E. Blair proposed zuul/zuul master: Improve retry handling for github driver https://review.opendev.org/665220	20:07
corvus	missed type conversion	20:07
pabelanger	corvus: thanks, I see what you were asking now. +1	20:11
corvus	does anyone know how this $REGION_NAME variable gets set? https://opendev.org/zuul/nodepool/src/branch/master/devstack/plugin.sh#L303	20:28
corvus	oh, that must come from devstack	20:29
openstackgerrit	Merged zuul/zuul master: Split plugin tests into two https://review.opendev.org/665161	20:31
pabelanger	tobiash: clarkb: if you don't mind adding https://review.opendev.org/665220/ to your review pipeline, I think we should try to restart zuul.o.o with that to help avoid the 'abuse' errors we are now getting	20:36
*** pcaruana has quit IRC		20:36
tobiash	Lgtm	20:41
openstackgerrit	James E. Blair proposed zuul/zuul master: Change default job_dir location https://review.opendev.org/665186	20:43
corvus	just a minor pep8 fix on that ^, otherwise it passed all the tests, so should be gtg	20:43
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Web: plug the authorization engine https://review.opendev.org/640884	20:44
corvus	clarkb, fungi, Shrews: running devstack without the benefit of local git clones took 25 minutes 11.9 seconds (which ara rounds up to 13? cc:dmsimard): http://logs.openstack.org/23/665023/8/check/nodepool-functional-openstack/194fed6/ara-report/	20:47
*** panda has quit IRC		20:49
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Zuul Web: add /api/user/authorizations endpoint https://review.opendev.org/641099	20:49
openstackgerrit	James E. Blair proposed zuul/nodepool master: WIP: new devstack-based functional job https://review.opendev.org/665023	20:50
dmsimard	the duration in ara is calculated based on the time the task started and when it ended and then it's rounded in the webapp	20:50
fungi	heh, i suppose you can make the argument that 11.9 is roughly 13 when rounded to the nearest odd number? ;)	20:51
fungi	well, rounded up to the next odd number anyway	20:51
dmsimard	there is some latency	20:51
dmsimard	because task ends -> tells ara task ended -> ara marks end timestamp	20:51
*** panda has joined #zuul		20:51
fungi	got it. so this is time it took for ara to become aware it was done	20:51
dmsimard	yes	20:52
fungi	it just gets a notification, not a timestamp passed to it	20:52
*** hashar has quit IRC		20:55
openstackgerrit	Matthieu Huin proposed zuul/zuul master: Web: plug the authorization engine https://review.opendev.org/640884	20:57
dmsimard	fungi: right -- this is more or less also how the upstream profile_tasks callback plugin calculates the duration but there is less overhead since it's just printing to stdout	20:57
dmsimard	https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/callback/profile_tasks.py	20:57
corvus	dmsimard: ah ok, i assumed it was working from the same data that shows up here: http://logs.openstack.org/23/665023/8/check/nodepool-functional-openstack/194fed6/ara-report/result/6cad0ed8-1cee-47d3-b1a3-58426aef0e37/	21:20
corvus	(start/end/delta)	21:20
dmsimard	corvus: the problem is that (unless mistaken), those fields are not always returned	21:20
corvus	yeah, i guess those are "command module" specific fields?	21:21
dmsimard	like, depending on which module was used	21:21
dmsimard	yeah	21:21
corvus	got it, til, thx :)	21:22
openstackgerrit	James E. Blair proposed zuul/nodepool master: WIP: new devstack-based functional job https://review.opendev.org/665023	21:24
fungi	also, i suppose they could "lie" under some circumstances, so having an external timer helps keep them honest even if it does only provide loose bounds on the runtime	21:25
pabelanger	darn, we had py36 job timeout	21:38
pabelanger	looks like be limestone	21:38
dmsimard	fungi: if Ansible would reliably return timestamps for every module/action/etc, we'd probably use it	21:40
smcginnis	Maybe a little more relevant in -infra than #zuul, but any idea why the devstack job would have ansible_interfaces undefined errors? Didn't collect facts, but where?	21:55
mattw4	smcginnis, I know this one!!	21:55
smcginnis	:)	21:55
mattw4	I created a new role in base:pre.yaml to collect all facts with the setup module	21:55
smcginnis	Oh good, I'm not the only one hitting some weird thing.	21:55
mattw4	I couldn't figure out how to make Zuul collect all facts by default so I just added a small role with the setup module	21:56
smcginnis	Hmm, I tried something similar adding "gather_facts: True" to my task in pre.yaml, but same error.	21:56
smcginnis	mattw4: Do you have that up somewhere I could take a peek?	21:57
mattw4	I think it gathers facts by default, but the fact set is limited to the minimum (!all)	21:58
mattw4	smcginnis: it'	21:58
mattw4	smcginnis: it's in an internal repo ATM, but I can share the role, just a sec	21:58
smcginnis	Thanks mattw4!	21:58
mattw4	smcginnis: I named it "gather-all-ansible-facts" and it's super-small: http://paste.openstack.org/show/752907/	22:00
mattw4	smcginnis: that added the 'ansible_interfaces' list to the fact set	22:01
smcginnis	Awesome, I'll give that a shot. Thanks!	22:01
mattw4	np :)	22:01
mattw4	I already posted this in #openstack-qa, but I think this may be the right audience: Does anyone know why devstack would fail to install an SSL certificate for Apache2, causing a failure when apache2.service is restarted after installing uwsgi?	22:03
mattw4	the job is a child of devstack-minimal with a few additional services enabled.	22:04
corvus	mattw4: that's probably a better #openstack-qa question unless zuul itself is somehow involved (but it doesn't sound like it)	22:05
mattw4	corvus: ok. True, it's probably not Zuul. Thanks tho.	22:06
openstackgerrit	James E. Blair proposed zuul/nodepool master: WIP: new devstack-based functional job https://review.opendev.org/665023	22:17
*** jamesmcarthur has joined #zuul		22:18
pabelanger	corvus: clarkb: https://review.opendev.org/665220/ should land in the next 90mins, do we want to look to restart zuul again today or hold off until another time? I'll be able to assist either way	22:19
clarkb	I'm in the middle of copying lots of data around so that we can do ssl cert updates on infra services	22:22
clarkb	so I'll defer to others	22:22
pabelanger	ack	22:25
pabelanger	also, I should have used #openstack-infra for that	22:25
*** tobiash has quit IRC		22:40
ofosos	corvus, SpamapS: any love for https://review.opendev.org/#/c/662134/54 ? All the basic test s are good now... I'd like some guidance on how to proceed	23:14
ofosos	Linter will be fixed tomorrow	23:18
clarkb	ofosos: you may want to send an email to the zuul-discuss list soliciting reviews? Sounds like its to a point where it is generally working and now its double checking (and potential refinement)?	23:19
ofosos	clarkb: +1	23:20
*** jamesmcarthur has quit IRC		23:24
corvus	ofosos: looks like there's a pep8 error at http://logs.openstack.org/34/662134/54/check/tox-pep8/cc56a61/job-output.txt.gz#_2019-06-13_20_51_57_086167 but that's the only failing test	23:25
corvus	ofosos: might want to go aheand and push up a fix for that; i can give the whole stack a closer look tomorrow. i'm looking forward to it! :)	23:26
ofosos	Jup, it's a single line. I was already at the pub when that popped up. My ide was happy with the code. I'll fix it tomorrow	23:27
ofosos	Had to celebrate a birthday yesterday (already Friday in my tz).	23:28
ofosos	corvus: very good :)	23:29
openstackgerrit	Merged zuul/zuul master: Improve retry handling for github driver https://review.opendev.org/665220	23:31
ofosos	I'd still like to refactor some things. Sometimes it's unclear where you have to pass a project or a project name. That's the biggest problem I saw.	23:31
ofosos	The testing process was really nice though. The fixture tests took 10 hours today too get right, but in the end I think it's for the better.	23:32
ofosos	I also need to incorporate API paging, but that can be done on the client level.	23:33
jlk	pabelanger: I don't know if a reply-after is going to be present. It looks like our system can throw a 502 if a query has gone longer than 10 seconds, and there was a recent change that caused that to happen a lot more often. This change was reverted a few hours ago, so I'm curious if Zuul is still seeing a slew of 502s.	23:34
mattw4	Where can I set zuul_log_verbose: true to produce more verbose logs?	23:51
pabelanger	jlk: great, thanks for the information, we've landed https://review.opendev.org/665220/ to help deal with it, and give additional info	23:51
pabelanger	mattw4: you can set it in your playbook where you call the upload-logs role	23:52
mattw4	pabelanger: gotcha, Thanks!	23:52
pabelanger	Hmm, for some reason, we have 3 PRs in our third-party-check pipeline, from the same PR: https://dashboard.zuul.ansible.com/t/ansible/status	23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!