Monday, 2018-07-23

*** nchakrab has joined #zuul		04:53
tobiash	pabelanger: I added a comment.	05:36
*** quiquell\|off is now known as quiquell		05:48
*** pcaruana has joined #zuul		07:02
*** nchakrab has quit IRC		07:22
*** quiquell is now known as quiquell\|bbl		07:22
*** nchakrab has joined #zuul		07:25
*** quiquell\|bbl is now known as quiquell		07:52
*** hashar has joined #zuul		08:05
*** goern has joined #zuul		08:35
openstackgerrit	Ian Wienand proposed openstack-infra/zuul master: Add variables to project https://review.openstack.org/584230	09:04
openstackgerrit	Ian Wienand proposed openstack-infra/zuul master: doc: Move zuul variable references to a section https://review.openstack.org/584779	09:04
openstackgerrit	Ian Wienand proposed openstack-infra/zuul master: Use definition list for job status https://review.openstack.org/584780	09:04
*** nchakrab has quit IRC		09:08
*** nchakrab has joined #zuul		09:09
*** nchakrab has quit IRC		09:13
*** CrayZee has joined #zuul		09:24
*** jimi\|ansible has quit IRC		09:30
*** panda\|off is now known as panda		09:48
*** nchakrab has joined #zuul		09:49
*** sshnaidm\|afk is now known as sshnaidm\|rover		09:56
tobiash	mordred: I just found a docker benchmark collection that benches various things using different base images: https://www.phoronix.com/scan.php?page=article&item=docker-summer-2018&num=4	09:59
tobiash	mordred:	10:00
*** sshnaidm\|rover is now known as sshnaidm\|ruck		10:00
*** chkumar\|ruck is now known as chkumar\|rover		10:01
tobiash	mordred: python with alpine base image seems to perform quite bad compared to all other base images	10:02
tristanC	tobiash: interesting results... a superuser answer points at the musl libc ( https://superuser.com/a/1234279 )	10:25
*** nchakrab has quit IRC		10:30
*** nchakrab has joined #zuul		10:31
*** nchakrab has quit IRC		10:36
*** quiquell is now known as quiquell\|bbl		10:39
*** nchakrab has joined #zuul		10:47
*** bhavik1 has joined #zuul		10:57
*** quiquell\|bbl is now known as quiquell		11:03
panda	pabelanger: re OVB, the overlay network remains a difficult approach for the iPXE images. THe iPXE process should fist ask their management IP, then establish a VXLAN tunnel, then send DHCP request on that tunnel for provision network. iPXE doesn't support any of this. I think we can work with the static provision network approach and mac filtering on the undercloud. The only difficult part here is to pass the	11:16
panda	overcloud instances mac address to the undercloud so it can enable leases only for those. The other part is to create overlay network for the overcloud nodes so emulate network isolation.	11:17
quiquell	Hello zuul guys, have one question regarding ZUUL_CHANGES	11:36
quiquell	if you want a commit to be there, you add a Depends-On or you parent it ?	11:36
quiquell	that's right ?	11:36
*** bhavik1 has quit IRC		11:37
tristanC	quiquell: yes, depends-on in commit message or git repos dependency.	11:38
tristanC	quiquell: ZUUL_CHANGES is a legacy vars from zuulv2, the dependencies are now stored in the zuul.items variable	11:38
quiquell	tristanC: We will arrive there :-)	11:38
tristanC	quiquell: e.g. http://logs.rdoproject.org/62/584762/1/openstack-check/legacy-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/fd5f3d6/zuul-info/inventory.yaml	11:39
panda	quiquell: what do you mean "parent" it ?	11:39
*** GonZo2000 has joined #zuul		11:40
*** GonZo2000 has quit IRC		11:40
*** GonZo2000 has joined #zuul		11:40
tristanC	panda: i think he meant in gerrit, when you click rebase and "change parent revision", or when you git review multiple commit stacked together	11:41
quiquell	panda: If you have two commits for the same project, and you want zuul to test them at one review	11:41
quiquell	panda: You need to make one parent on the other	11:41
panda	tristanC: hhmm, I thought rebasing a change on top of another didn't create another zuul.item, because zuul doesn't need to fetch the change, it's in the git log	11:43
quiquell	panda: But it's stuff is going to test	11:44
quiquell	panda: Make sense to have it there	11:44
quiquell	panda: Other stuff if the respsec where the required projects point to	11:45
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Ignore files produced by tox-cover https://review.openstack.org/584823	11:45
panda	and how does zuul know what changes in the log need to be addedd to zuul.items ?	11:45
panda	looks at the whole log and sees what Change-Id: is still open ?	11:45
quiquell	panda: I suppose	11:46
quiquell	panda: Cannot have stuff in the middle	11:46
tristanC	panda: it does, gerrit automatically returns parent information in the "dependsOn" field, see this function for the full logic: https://git.zuul-ci.org/cgit/zuul/tree/zuul/driver/gerrit/gerritconnection.py#n524	11:48
panda	tristanC: ah, I see, thanks.	11:49
*** quiquell is now known as quiquell\|lunch		11:59
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Invalidate az cache on bad request https://review.openstack.org/582746	12:04
*** dkranz has joined #zuul		12:21
*** quiquell\|lunch is now known as quiquell		12:24
*** jimi\|ansible has joined #zuul		12:36
*** jimi\|ansible has joined #zuul		12:36
*** rlandy has joined #zuul		12:39
*** nchakrab has quit IRC		12:45
*** samccann has joined #zuul		12:50
*** samccann_ has joined #zuul		12:51
mordred	tobiash: interesting. unfortunately the other outlier is debian 9 - which is the base for the other main python base image	12:52
mordred	tobiash: definitely worth keeping our eyes on	12:52
*** nchakrab has joined #zuul		12:58
*** samccann_ has quit IRC		12:59
*** jesusaur has quit IRC		13:05
openstackgerrit	Merged openstack-infra/nodepool master: Invalidate az cache on bad request https://review.openstack.org/582746	13:25
openstackgerrit	Merged openstack-infra/nodepool master: Ignore files produced by tox-cover https://review.openstack.org/584823	13:25
pabelanger	tobiash: replied	13:27
pabelanger	panda: +1	13:27
panda	pabelanger: I know how to filter dhcp offers. kinda.	13:30
tobiash	pabelanger: oh thanks, I think I understood this now	13:37
tobiash	pabelanger: but I have another comment, sorry	13:37
pabelanger	tobiash: sure, let me update and test	13:39
openstackgerrit	Paul Belanger proposed openstack-infra/zuul master: Fix delegate_to for ansible synchronize https://review.openstack.org/584656	13:47
pabelanger	tobiash: ^should be better now	13:47
tobiash	pabelanger: I'd even leave out 'path: /tmp' and let the os decide where to put the tmpdir	13:49
tobiash	pabelanger: apart from that lgtm	13:49
*** myoung has joined #zuul		13:49
pabelanger	tobiash: I wanted to keep /tmp since on the executor that is outside of the filter path, just to ensure we can write data out side of workdir, even though this is on the node	13:50
pabelanger	thanks! left comment on review explaining too	13:53
tobiash	pabelanger: ok (although it won't be inside of the workdir as it connects back via ssh)	13:53
pabelanger	tobiash: right, agree! However, the original bug was actually still checking is_path_safe and failing a job: http://logs.openstack.org/14/584614/9/check/ansible-role-nodepool-ubuntu-xenial/82c6155/job-output.txt.gz#_2018-07-22_13_48_56_126339	13:56
*** quiquell is now known as quiquell\|off		14:02
*** acozine1 has joined #zuul		14:08
openstackgerrit	Merged openstack-infra/nodepool master: Update README and add TESTING similar to Zuul repo https://review.openstack.org/581348	14:16
*** yolanda__ has joined #zuul		14:19
*** acozine1 has quit IRC		14:20
*** acozine1 has joined #zuul		14:21
*** yolanda_ has quit IRC		14:21
*** yolanda__ is now known as yolanda		14:26
Shrews	would anyone care to review this builder change? https://review.openstack.org/564687	14:33
pabelanger	looking	14:35
Shrews	pabelanger: maybe you can speak to ian's concerns about dib there	14:38
Shrews	(i missed those initially, but seems like a caveat might be warranted there)	14:39
pabelanger	Shrews: yah, I just left a comment for just that	14:40
pabelanger	if people are okay with it in general, I'll drop a +2	14:40
pabelanger	but want to get another set of eyes	14:40
Shrews	pabelanger: maybe corvus or mordred can comment there too	14:41
pabelanger	+1	14:42
mordred	pabelanger: I'd argue that needing REG_PASSWORD in a RHEL image is a design flaw in rhel - that password would be completely exposed to anyone booting that image and running workloads on it	14:43
mordred	BUT - given that, I don't tink it's any _more_ insecure than the need to put the REG_PASSWORD in the image in the first place	14:44
mordred	(the nodepool patch)	14:44
pabelanger	yah, that is true too. Didn't think of that	14:45
pabelanger	really don't know how RHEL licensing works TBH	14:45
mordred	it's funny how old-school proprietary software techniques like license managers are _completely_ incompatible with cloud	14:45
pabelanger	Like, could you use RHEL key in DIB, yum install all latest packages, delete key and boot and image works?	14:45
mordred	yeah - but then users of the image wouldn't be able to install new images	14:46
mordred	packages	14:46
mordred	that is	14:46
rcarrillocruz	haha, yah...the state of cloud images for networking make me want to stab my eyes at times	14:46
mordred	I mean - this is amongst the reasons that literally nobody uses rhel as guest images in cloud	14:46
rcarrillocruz	cloud-init is not a thing for most	14:46
pabelanger	yah, wonder if you could write a wrapper in ansible using zuul secret, for bindep, then delete key after yum, etc	14:47
pabelanger	but that should likes work too	14:47
mordred	pabelanger: yah - you could likely do that - but it's still a mess. I'd just expose the password since it's an absurd protection	14:47
mordred	"you must use a password to download these things that you can download for free from centos"	14:47
pabelanger	++	14:48
* mordred looks around to see if he's been fired yet		14:48
mordred	pabelanger: I think in the non-CI world, one would have a rhel key as a customer, and then injecting that key at boot time using something like cloud-init or ansible would not be exposing your personal rhel secret, because you're the user	14:51
*** jesusaur has joined #zuul		14:51
mordred	pabelanger: for public CI, it's a complete mess ... and I think basically means that for public CI purposes RHEL is 100% unsuitable and centos basically must be used	14:52
pabelanger	Yah, maybe that is why I've not need much discussion around RHEL and CI when I've asked. Just for this reason of key leaks	14:53
mordred	yah	14:53
pabelanger	mordred: since you are hear, and non RHEL related: https://review.openstack.org/584656/ fixes a bug with synchronize in ansible and zuul. If you'd like to review and see if I missed something obvious in zuuls trusted / untrusted checks	14:54
mordred	pabelanger: I think that patch looks right. I added a +2 - but I'd like corvus or clarkb to look too - it's one of those areas that more eyes is more better	14:57
pabelanger	thanks!	14:57
pabelanger	yah, happy to get more eyeballs on this	14:57
openstackgerrit	Fabien Boucher proposed openstack-infra/zuul master: Add tenant yaml validation option to zuul client https://review.openstack.org/574265	15:01
clarkb	mordred: which one?	15:03
mordred	clarkb: https://review.openstack.org/584656/	15:05
corvus	pabelanger: i don't understand your answer to tobiash's comment; can you rephrase that, possibly with more examples?	15:05
pabelanger	yes	15:06
pabelanger	corvus: I've added more details on explanation.	15:11
*** mrhillsman has quit IRC		15:18
*** jtanner has quit IRC		15:19
*** patrickeast has quit IRC		15:19
*** hashar has quit IRC		15:19
*** maxamillion has quit IRC		15:19
*** clarkb has quit IRC		15:19
*** goern has quit IRC		15:20
*** sdoran has quit IRC		15:20
*** TheJulia has quit IRC		15:20
*** andreaf has quit IRC		15:21
*** patrickeast has joined #zuul		15:21
*** mrhillsman has joined #zuul		15:21
*** jtanner has joined #zuul		15:22
*** maxamillion has joined #zuul		15:23
*** sdoran has joined #zuul		15:23
*** jtanner has quit IRC		15:24
*** sdoran has quit IRC		15:24
*** sdoran has joined #zuul		15:24
*** jtanner has joined #zuul		15:25
*** andreaf has joined #zuul		15:25
*** maxamillion has quit IRC		15:25
*** maxamillion has joined #zuul		15:25
*** sdoran has quit IRC		15:25
*** hashar has joined #zuul		15:26
*** sdoran has joined #zuul		15:26
*** jtanner has quit IRC		15:26
*** hashar has quit IRC		15:26
*** hashar has joined #zuul		15:26
*** sdoran has quit IRC		15:27
*** sdoran has joined #zuul		15:27
corvus	pabelanger: thanks, i think i grok	15:29
*** clarkb has joined #zuul		15:33
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Add container spec https://review.openstack.org/560136	15:34
*** TheJulia has joined #zuul		15:35
*** jtanner has joined #zuul		15:36
*** jtanner has quit IRC		15:38
*** jtanner has joined #zuul		15:38
corvus	pabelanger: can you point me at a summary of your log work? (maybe a change that explains most of what's going on, or an email, or something?) i want to refresh my memory and think about how it might interact with things i've been thinking about wrt logs-in-swift	15:44
*** nchakrab has quit IRC		15:45
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Docs: increase visibility of env secrets warning https://review.openstack.org/584950	15:48
corvus	Shrews, pabelanger, ianw, tristanC: ^	15:48
openstackgerrit	Merged openstack-infra/nodepool master: Use detail listing in integration testing https://review.openstack.org/584542	16:00
openstackgerrit	Merged openstack-infra/nodepool master: builder: support setting diskimage env-vars in secure configuration https://review.openstack.org/564687	16:04
*** nchakrab has joined #zuul		16:12
*** nchakrab has quit IRC		16:13
*** panda is now known as panda\|off		16:30
*** gchenuet has joined #zuul		16:36
mordred	corvus, Shrews: if you get a spare sec, https://review.openstack.org/#/c/584395/ doc patch for pbrx	16:37
pabelanger	clarkb: could I trouble you for a review of https://review.openstack.org/584656/ too, having extra set of eyes helps	16:38
gchenuet	Hi guys ! We are working (at leboncoin, french classified ads website) to update our Zuul V2 infrastructure to Zuul V3 and we have a question about Nodepool Launcher service: is it possible to scale up to supply redundancy ? We saw that Nodepool Builder is able to do this, but we don't about Nodepool Launcher.	16:39
gchenuet	Thanks and congrats for this awesome stack tools !	16:39
clarkb	gchenuet: it is possible, but you must partition the clouds across launchers as they won't see each other's resource usage directly. If you only have one cloud you could give each launcher a different account with independent quota	16:40
gchenuet	thanks clarkb ! What is the best redundancy solution for you if we have two DCs (with 1 OS per DC) to keep all the Zuul infra (Zuul + Nodepool services) alive if we lost 1 DC ?	16:45
clarkb	gchenuet: for that case I would run two nodepool launchers one in each DC speaking to the local DC openstack. This way if the datacenter goes away the other nodepool keep running. (do similar for nodepool builders). The one thing this doesn't accomodate is where to run your zookeeper. I'm not sure you can run a zookeeper cross multiple datacenters	16:46
clarkb	pabelanger: why is the src value for synchronize in your test file destdir.path?	16:46
clarkb	pabelanger: is that intentional?	16:46
*** pcaruana has quit IRC		16:47
corvus	gchenuet, clarkb: i agree about launchers -- but i'd probably have both builders talk to both clouds	16:47
corvus	(that way if they are both working, you're still only building the images once)	16:48
pabelanger	clarkb: yah, comes from the copy-common role	16:48
pabelanger	see depends-on for more examples of usage	16:48
clarkb	corvus: oh good point	16:48
clarkb	pabelanger: isn't that backwards? if nothing else it is confusing :)	16:48
corvus	SpamapS: any thoughts/experience running zookeeper across a wan (multiple datacenters)? (see question from gchenuet)	16:49
clarkb	pabelanger: I guess you are prepping files for copying in the other direction?	16:49
pabelanger	clarkb: right, depends on mode: pull or push	16:49
*** acozine1 has quit IRC		16:49
pabelanger	fetching common-file from the copy-common role tmpdir	16:50
clarkb	pabelanger: both use src: destdir.path in the child change	16:51
corvus	gchenuet: if you try running zk across multiple DCs, please let us know how it goes. we (in openstack infra) have a nodepool system that's similar to what you're setting up, but we've only run zk within one data center. i assume it can be done but may require some tuning to allow for greater latencies?	16:51
pabelanger	clarkb: which line?	16:51
clarkb	pabelanger: https://review.openstack.org/#/c/584656/6/tests/fixtures/config/remote-action-modules/git/org_project/playbooks/synchronize-delegate-good.yaml 22 and 46	16:52
tobiash	gchenuet: and you should check if you can encrypt the communication between zookeepers over wan	16:52
tobiash	but client side tls is not implemented in kazoo yet (there are pull requests for that)	16:53
gchenuet	our DCs are connecting with LAN and not WAN	16:53
gchenuet	FYI	16:53
gchenuet	I think we'll setup a zk cluster with 6 nodes (3 per DC)	16:54
clarkb	gchenuet: it needs to be an odd number	16:55
tobiash	and write performance drops with more nodes (upstream suggestion is 3, 5 or 7 I think)	16:55
gchenuet	ok, good to know ! thanks, we have a third DC so 2 - 2 -1	16:56
pabelanger	clarkb: right, I don't think it matters to much, because in this case we are doing an rsync on a single server. So push and pull both are delegated to the nodepool node, basically in this case we want src to always be destdir.path regardless of mode	16:56
pabelanger	https://docs.ansible.com/ansible/2.5/modules/synchronize_module.html	16:57
pabelanger	see mode	16:57
clarkb	pabelanger: mostly its just confusing that we are using destdir as the src not the dest dir	16:57
clarkb	why not make the variable called srcdir in this case?	16:57
tobiash	regarding the launchers, I think now that nodepool is resilient against unexpected hitting the quota you could even run multiple launchers against the same cloud	16:58
gchenuet	really ?	16:58
*** bbayszczak has joined #zuul		16:58
tobiash	it might cause more openstack api requests than neccessary when being under quota pressure but I think it shouldn't break	16:58
pabelanger	clarkb: that is that common-copy does, creates a tmp directory with a file, I just reused that to pre-populate something. And that is the file we are using for src here. But I can see how that is a little confusing	16:59
clarkb	it would be racy on consumption of quota, which may leade to NODE_FAILURES, but other than that it may work	16:59
*** bbayszcz_ has joined #zuul		16:59
tobiash	clarkb: nodepool now retries when hitting quota failures	16:59
clarkb	tobiash: oh good	16:59
*** bbayszczak has quit IRC		16:59
tobiash	that doesn't cause node_failures anymore	16:59
pabelanger	tobiash: ++	17:00
gchenuet	so i can have one zk cluster across DC with nodepool launchers and builders in each DC connected to same zk cluster ?	17:00
*** GonZo2000 has quit IRC		17:01
tobiash	yes, that should work, if the latency doesn't have too much impact on zk	17:01
*** hashar is now known as hasharDinner		17:02
clarkb	pabelanger: as for push vs pull that just changes which ost is looked at to copy from dest to src.	17:02
*** bbayszcz_ is now known as bbayszczak		17:02
tobiash	gchenuet: but that's something you probably just need to try	17:02
gchenuet	to be sure, when you said 'cloud', is a 'cloud provider' in nodepool ?	17:02
clarkb	pabelanger: the test takes advantage of the fact that they are the same in this case	17:02
tobiash	gchenuet: yes	17:02
pabelanger	clarkb: right, I can propose a follow change that removes the usage of destdir variable and sets a fact for srcdir	17:05
pabelanger	give me a few minutes	17:05
*** nchakrab has joined #zuul		17:05
tobiash	corvus: regarding buildset resources, what do you think about this user api: https://etherpad.openstack.org/p/JINkHL4VuM	17:07
tobiash	we already map child_jobs into the zuul variable so I think we should do the same for pause	17:08
corvus	tobiash: oh, hrm; i wonder if we should actually do option #2 for both things?	17:09
tobiash	corvus: can we still change that for child_jobs?	17:10
tobiash	(I'd also be in favor of mapping behavior things to real module parameters)	17:11
gchenuet	thanks tobiash !	17:12
corvus	(the main difference between the two being whether the variables are exported to the child job)	17:12
tobiash	but that should be consistent then	17:12
gchenuet	In your OpenStack CI, where are nodepool launcher ? in which cloud provider ?	17:12
corvus	tobiash: we haven't released a version with child_jobs yet, i think we can change it	17:12
corvus	pabelanger: ^ we're talking about maybe changing how child_jobs is returned	17:13
tobiash	corvus: thinking about that, even if we switch to module args we still could map the values into the child jobs within the executor	17:13
tobiash	in case of pause I would even remap that to a different variable like zuul.paused_parents: [p1, p2]	17:14
pabelanger	corvus: okay, I haven't started using it yet	17:15
corvus	hrm. well, both of these are using the special 'zuul' dictionary. maybe that's enough to distinguish it from regular data that's passed to children...	17:15
openstackgerrit	Paul Belanger proposed openstack-infra/zuul master: Clean up remote ansible synchronize tests https://review.openstack.org/584978	17:17
pabelanger	clarkb: I believe ^ is clearer now?	17:17
corvus	tobiash: ultimately the interface is actually just a json file on disk. if we did option #2, we'd have to revise that structure somehow.	17:17
corvus	tobiash: (i guess "zuul_return: pause: True" could just be a shortcut for "zuul_return: data: zuul: pause: True"....)	17:18
tobiash	that's true, it's easier for the user, but a little bit harder to implement	17:18
tobiash	interesting idea	17:18
corvus	tobiash: well, log_url is already like this and is in a release. child_jobs is like this, but isn't in a release yet (but is about to be). i think maybe we should stick with option #1 for now, and then think about either making an alias later, or restructuring it (if we feel like actually deprecating things)	17:21
openstackgerrit	Merged openstack-infra/nodepool master: Docs: increase visibility of env secrets warning https://review.openstack.org/584950	17:21
tobiash	corvus: ok, I'm fine with tha	17:22
corvus	(i bet if we just make it an alias later, that will improve the UX without making things more complicated)	17:22
tobiash	corvus: shall we name it pause or do you have a better naming proposal (run_children_now, ...)?	17:23
corvus	tobiash: i like pause (or suspend is my runner-up)	17:24
pabelanger	tobiash: corvus: speaking for child_jobs for zuul_return, tristanC noticed this change in functionality for zuul: https://storyboard.openstack.org/#!/story/2003026 I think we want to maybe address before tagging	17:24
*** bbayszcz_ has joined #zuul		17:24
corvus	pabelanger: agreed; is someone working on that?	17:24
corvus	gchenuet: we use quite a few public and private providers, you can see which by looking at our grafana dashboards (search for nodepool): http://grafana.openstack.org/d/rZtIH5Imz/nodepool?orgId=1	17:27
corvus	gchenuet: (some of those have multiple regions too)	17:27
*** bbayszczak has quit IRC		17:27
pabelanger	corvus: I can, but could use help on how best to fix. This is a side affect of not allowing all jobs to be skipped: https://git.zuul-ci.org/cgit/zuul/commit/zuul/model.py?id=5c797a12a8229b30124988723f786d4ee8dea807	17:27
pabelanger	it didn't account for non-voting	17:28
*** bbayszcz_ has quit IRC		17:28
corvus	pabelanger: i think there's just a logic error in didAllJobsSucceed -- a non-voting job that ran isn't 'skipped', but it's checked first	17:29
*** myoung is now known as myoung\|bbl		17:29
tobiash	pabelanger: I think you just could set skipped to false in https://git.zuul-ci.org/cgit/zuul/tree/zuul/model.py?id=5c797a12a8229b30124988723f786d4ee8dea807#n1808	17:29
gchenuet	thanks corvus, and you have the same single zk + nodepool infra for all providers ?	17:31
corvus	gchenuet: we have a single zk, 4 builders and 4 launchers. some builders and launchers are located in specific clouds, some are just in one cloud and talk to several clouds. some of that is dictated by architecture and networking requirements, not just redundancy/availability)	17:34
*** CrayZee has quit IRC		17:35
pabelanger	corvus: tobiash: I'll create a new test for all skipped jobs and work to fix the logic error	17:35
pabelanger	I'll do that now	17:35
corvus	pabelanger: thanks!	17:35
gchenuet	but any of your launchers are speaking to same cloud ?	17:36
clarkb	gchenuet: no	17:37
corvus	well, at least one of them is	17:38
corvus	and at least two of the builders are	17:38
clarkb	corvus: which launcher of ours talks to the same cloud?	17:40
clarkb	they all talk to distinct sets of clouds	17:41
corvus	clarkb: i interpreted "same cloud" as "cloud in which the launcher is residing".	17:46
corvus	if the question is "do any two launchers talk to the same cloud" then the answer is indeed no	17:47
corvus	(but i thought we already answered that)	17:48
clarkb	ah	17:48
gchenuet	thanks ! I think you've found a solution, let's test it and if it's works, we'll be very happy to share with you our infra (we run 50K builds per week)	17:49
gchenuet	thanks you so much guys for your work and your help !	17:49
*** gchenuet has quit IRC		17:54
pabelanger	with stestr, is there a way to move the captured traceback output after captured logging? Or customize it locally, I'm find it a pain to always be scrolling up in the terminal to see why something failed.	17:54
Shrews	pabelanger: have you tried 'script'? there might be a more simpler solution, but i often use that for things where i'm not sure how to control output	18:04
Shrews	i often use that for zk-shell output capturing	18:04
mordred	corvus, Shrews, pabelanger, tobiash, tristanC, clarkb, SpamapS: you may want to skim the recent scrollback in #ansible-devel ... it's about an upcoming behavior change in ansible related to modules returning ansible_facts in their return json	18:07
openstackgerrit	Paul Belanger proposed openstack-infra/zuul master: Fix zuul reporting build failure with only non-voting jobs https://review.openstack.org/584990	18:07
mordred	https://github.com/ansible/ansible/pull/41811 is the pr adding the deprecation messages - agaffney says we can test the new behavior by setting inject_facts_as_vars to false in ansible.cfg	18:08
*** nchakrab has quit IRC		18:08
pabelanger	corvus: ^ implements tobiash suggestion for fixing non-voting jobs with zuul.child_jobs patch	18:08
pabelanger	Shrews: yah, that could help too	18:10
mordred	corvus, Shrews, pabelanger, tobiash, tristanC, clarkb, SpamapS: http://paste.ubuntu.com/p/xJFyJJ7rcy/ is the relevant channel scrollback in case you weren't in #ansible-devel	18:14
corvus	mordred: so that might affect roles/multi-node-known-hosts/library/generate_all_known_hosts.py ?	18:17
mordred	corvus: yes	18:17
corvus	(sorry, that was in zuul-jobs)	18:17
pabelanger	reading	18:17
corvus	mordred: but maybe could be worked around with a normal return / register / set_fact ?	18:18
mordred	corvus: we're on 2.5 now right?	18:18
corvus	mordred: yes	18:18
mordred	corvus: cool. so - I believe we can just access the value as ansible_facts.all_known_hosts safely now	18:18
mordred	corvus: or we could change it to return/register	18:19
mordred	as of 2.5 the ansible_facts dict exists and that return should be setting the value both as a top-level var and in the ansible_facts dict	18:19
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Scope all_known_hosts in mult-node-known-hosts https://review.openstack.org/584993	18:21
pabelanger	+2	18:22
pabelanger	haven't used ansible_facts much yet, but sounds like we should be	18:23
corvus	it's apparently new :)	18:26
*** ianychoi has quit IRC		18:33
*** ianychoi has joined #zuul		18:34
*** dkranz has quit IRC		18:38
*** dkranz has joined #zuul		18:42
*** hasharDinner has quit IRC		19:00
*** hashar has joined #zuul		19:00
Shrews	tobiash: according to https://review.openstack.org/536930, if "quota exceeded" is returned during a launch, we should pause, right? I'm not seeing that pause in deployed code	19:03
tobiash	What do you see instead?	19:04
Shrews	i'm not sure what implications this might have	19:04
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool master: Allow launch-retries to indefinitely retry for openstack https://review.openstack.org/584488	19:04
pabelanger	clarkb: okay, think ^ handles launch-retries: 0	19:05
pabelanger	but still haven't figured out best way to test indefinitely retries	19:05
Shrews	tobiash: http://paste.openstack.org/show/726468/	19:05
clarkb	pabelanger: you saw the note baout how quota failures will short circuit too right?	19:06
clarkb	pabelanger: I think that likely to be your biggest enemy in that code	19:06
Shrews	tobiash: oh, maybe we just don't log the actual pause in that code path?	19:08
Shrews	tobiash: b/c i do see an "Unpaused request" for that request ID	19:08
tobiash	Shrews: yes, I see that too	19:09
Shrews	tobiash: in your env?	19:09
tobiash	Shrews: in your log	19:09
tobiash	In my env it unpauses every 30 seconds or so and checks again	19:10
Shrews	i didn't include the unpause in that paste	19:10
Shrews	it was further down in what i pasted, but regardless, looks like that's what's happening	19:11
tobiash	I looked wrong, never mind, I saw a pause after quota recalculation	19:12
pabelanger	clarkb: yah, I'm not sure how best to do that. tobiash Shrews do you mind looking comment from clarkb on https://review.openstack.org/584488	19:13
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Add logging when pausing due to node abort https://review.openstack.org/585007	19:17
Shrews	tobiash: ^^	19:17
tobiash	Lgtm	19:19
tobiash	pabelanger: your change changes the failure count to count in advance	19:21
tobiash	That will break indefinitely retrying on quota exceeded	19:22
tobiash	As quota exceeded should not be counted	19:23
tobiash	pabelanger: note that the quota exceeded raises before increasing the counter	19:23
pabelanger	Hmm, let me try again. I'll add a test where launch-retries is 0, because pretty sure there is a bug today where that isn't counted for	19:25
clarkb	pabelanger: ya rereading it I think max-retries is really "max-attempts" and it can't be 0 valued	19:26
clarkb	or if it is that is undefined	19:26
pabelanger	okay, I'll grab some coffee and try again in a bit	19:27
Shrews	tobiash: pabelanger: wait... if the "quota exceeded" part is hit, yes, it will leave the launch() method, but we should pause and retry the launch later (effectively still retrying forever as pabelanger wants). i don't see the breakage	19:32
Shrews	so pabelanger's change alters the "inner scope" of retrying (within a single launch). but the "outer scope" of retrying when paused is unaffected (unless i miss something)	19:33
tobiash	The breakage is when having no unlimited retries and you hit quota exceeded multiple times then these are counted as failures	19:34
Shrews	tobiash: ah! i see noiw	19:35
Shrews	now	19:35
tobiash	Or is that retry a different lauch with a new counter?	19:35
Shrews	tobiash: i believe we should reset the counter on quota error (launch() gets called again)	19:36
Shrews	so i think it may be ok	19:36
clarkb	Shrews: tobiash this is specific to single provider configurations	19:36
clarkb	I thought that a quota fialure was treated as a failure in that provider?	19:36
tobiash	switching to laptop	19:37
Shrews	clarkb: a recent change by tobiash makes quota errors pause and retry until quota frees up.	19:38
clarkb	gotcha	19:38
Shrews	so before, it would have been a launch error and tried max-tries times	19:38
Shrews	or whatever that config setting is called	19:38
tobiash	Shrews: ah I see now	19:39
tobiash	the retries variable is just local	19:39
tobiash	so it shouldn't matter as the retry on quota exceeded is in fact a new node	19:39
Shrews	yes	19:39
tobiash	as the failed nodes was aborted	19:39
clarkb	in that case the original code was proably fine save for my confusion over max-retries actually being max attempts	19:40
clarkb	pabelanger: ^	19:40
tobiash	but it breaks quota retry if launch_retries is 0	19:41
Shrews	double yes	19:41
Shrews	but, i think it does need some good testing in the various scenarios	19:41
clarkb	tobiash: I think the original patchset may be fine	19:41
tobiash	we might need to test almost all combinations of retries = [inf, 0, >0] together with quota failures	19:43
clarkb	I'm beginning to wonder if 0 is a valid value	19:44
clarkb	I has assumed so because in config it is "retries" but in execution it is really attempts	19:44
clarkb	and 0 attepts means nothing boots	19:45
Shrews	tobiash: ++	19:45
Shrews	clarkb: launch_attempts is probably more accurate	19:45
Shrews	because launch_retries=1 would have always done 1 attempt	19:46
Shrews	0 doesn't seem valid	19:48
clarkb	ya	19:48
tobiash	btw, the quota retry saved us already a lot of node_failures because we now also have nodes that boot from volume combined with a relatively low volume quota and no volume quota support in nodepool yet :)	19:48
tobiash	what I also learned is that labels that compete for different types of quota (cpu vs cpu+volume) need to be separated into different pools	19:50
tobiash	otherwise a cpu+volume node can pause the handler regardless if there are cpus left for other nodes or not	19:51
Shrews	tobiash: these sound like helpful things we should document someplace	19:54
clarkb	tobiash: that is probably a good way to define a pool too	19:54
clarkb	a pool is a rsource contention aggregation	19:54
clarkb	++ to documenting	19:54
Shrews	i just don't know where to document since we don't really have a deployment/admin suggestive guide	19:55
Shrews	suggestion*/faq/whatever	19:55
clarkb	to start can probably just add it to the doc config docs	19:56
clarkb	and then incorportate that into more narrative style docs later	19:56
tobiash	k, will write something tomorrow	20:06
pabelanger	Shrews: clarkb: tobiash: could somebody reply with expected changes on 584488, I'll look and re-read backscoll again shortly	20:14
clarkb	pabelanger: I think for me its go back to original patchset	20:15
clarkb	pabelanger: but I think tobiash wants to figure it out a bit more	20:15
pabelanger	kk	20:15
Shrews	pabelanger: left a comment with the most important take away for me	20:23
pabelanger	Shrews: ++	20:24
mhu	tobiash, can the dequeue in CLI patch get a +3? https://review.openstack.org/#/c/95035/	20:26
openstackgerrit	Merged openstack-infra/nodepool master: Add logging when pausing due to node abort https://review.openstack.org/585007	20:51
*** samccann has quit IRC		21:05
*** hashar has quit IRC		21:05
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Add upgrade release note about aborted node status https://review.openstack.org/585035	21:07
Shrews	mordred: corvus: we need ^ merged before releasing	21:09
corvus	++	21:10
mordred	Shrews: should we do another restart to also pick up the logging change just to be doubly-sure?	21:10
Shrews	mordred: i think that would be nice for debugging	21:10
Shrews	but not technically necessary	21:10
mordred	yah - I'm fairly confident that we can emit log lines	21:10
Shrews	but being double-sure is bueno	21:10
mordred	corvus: I think Shrews and I have the same "yes but also meh" thought - do you have any divergent opinions?	21:11
Shrews	i have to EOD and go pick up my vehicle, so if someone else wants to do that, please go for it (launchers only)	21:11
mordred	kk	21:12
clarkb	maybe restart with 585035 merged then we will run exactly what is tagged?	21:12
mordred	wfm	21:12
Shrews	wfm x 2	21:12
clarkb	though I wonder how long check will take this time of day	21:13
mordred	clarkb: that is also a good point - and it's going to be that way all week	21:13
corvus	i think we can wait for 035 and test again. we probably won't release zuul until tues or wed at the earliest anyway.	21:14
clarkb	ya thats why I've been spending time improving reliability of clouds and trying to increase capacit.y We are right in hte heart of the merge all things	21:14
clarkb	corvus: ok	21:14
corvus	(on the plus side, this should give us plenty of traffic to see if the zuul memory problem is fixed)	21:15
clarkb	indeed	21:16
pabelanger	http://paste.openstack.org/show/726483/	22:14
pabelanger	we are getting that from latest zuul-executor for some reason	22:14
pabelanger	guessing something broke in cherrypy	22:14
pabelanger	CherryPy==17.0.0 seems to be installed	22:16
mordred	pabelanger: sounds like a problem with six version	22:17
pabelanger	six==1.10.0	22:17
pabelanger	checking docs	22:19
pabelanger	heh	22:20
pabelanger	Pull request #203: Add parse_http_list and parse_keqv_list to moved	22:20
pabelanger	urllib.request.	22:20
pabelanger	that is 1.10.0	22:20
pabelanger	https://github.com/benjaminp/six/blob/master/CHANGES	22:20
pabelanger	checking ze01 to see what is there	22:21
openstackgerrit	Merged openstack-infra/nodepool master: Add upgrade release note about aborted node status https://review.openstack.org/585035	22:25
*** rlandy has quit IRC		22:28
pabelanger	mordred: for some reason, we didn't install six=1.11.0 on ze11.o.o, but seems to be in install_requires: https://github.com/cherrypy/cherrypy/blob/master/setup.py	22:31
pabelanger	manually doing pip install -U six fixed it	22:31
mordred	yah- pip won't upgrade a package or do an internal dag	22:32
mordred	as part of a single instlal	22:33
pabelanger	yah, xenial is 1.10.0, how can we fix that with puppet-zuul?	22:33
mordred	said - is six getting installed by apt?	22:33
pabelanger	yah	22:33
mordred	ugh	22:33
mordred	we REALLY need to stop installing python things via apt	22:33
mordred	it breaks literally everything every time	22:33
pabelanger	it looks like a dependency of ubuntu-server	22:34
mordred	sigh	22:34
mordred	of course it is	22:34
mordred	oh - I think because of cloud-init?	22:34
mordred	pabelanger: did we not uninstall cloud-init?	22:34
pabelanger	yes, no longer on server	22:34
mordred	hrm	22:35
mordred	well - containers solve this ... but lemme stare at puppet for a sec	22:35
pabelanger	mordred: would virtualenv help here?	22:35
mordred	pabelanger: I think virtualenv would be too much engineering that we'd be throwing away anyway	22:36
pabelanger	mordred: agree, I'd rather push on containers too	22:36
mordred	pabelanger: we've got the containers building now - so it's about getting them published then getting ansible written	22:36
mordred	pabelanger: but - in case we need to spin up another ze before that's done ...	22:36
pabelanger	yah, fixing in puppet would be great.	22:37
*** dkranz has quit IRC		22:49
*** myoung\|bbl is now known as myoung		22:59
*** threestrands has joined #zuul		23:30

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!