*** nchakrab has joined #zuul | 04:53 | |
tobiash | pabelanger: I added a comment. | 05:36 |
---|---|---|
*** quiquell|off is now known as quiquell | 05:48 | |
*** pcaruana has joined #zuul | 07:02 | |
*** nchakrab has quit IRC | 07:22 | |
*** quiquell is now known as quiquell|bbl | 07:22 | |
*** nchakrab has joined #zuul | 07:25 | |
*** quiquell|bbl is now known as quiquell | 07:52 | |
*** hashar has joined #zuul | 08:05 | |
*** goern has joined #zuul | 08:35 | |
openstackgerrit | Ian Wienand proposed openstack-infra/zuul master: Add variables to project https://review.openstack.org/584230 | 09:04 |
openstackgerrit | Ian Wienand proposed openstack-infra/zuul master: doc: Move zuul variable references to a section https://review.openstack.org/584779 | 09:04 |
openstackgerrit | Ian Wienand proposed openstack-infra/zuul master: Use definition list for job status https://review.openstack.org/584780 | 09:04 |
*** nchakrab has quit IRC | 09:08 | |
*** nchakrab has joined #zuul | 09:09 | |
*** nchakrab has quit IRC | 09:13 | |
*** CrayZee has joined #zuul | 09:24 | |
*** jimi|ansible has quit IRC | 09:30 | |
*** panda|off is now known as panda | 09:48 | |
*** nchakrab has joined #zuul | 09:49 | |
*** sshnaidm|afk is now known as sshnaidm|rover | 09:56 | |
tobiash | mordred: I just found a docker benchmark collection that benches various things using different base images: https://www.phoronix.com/scan.php?page=article&item=docker-summer-2018&num=4 | 09:59 |
tobiash | mordred: | 10:00 |
*** sshnaidm|rover is now known as sshnaidm|ruck | 10:00 | |
*** chkumar|ruck is now known as chkumar|rover | 10:01 | |
tobiash | mordred: python with alpine base image seems to perform quite bad compared to all other base images | 10:02 |
tristanC | tobiash: interesting results... a superuser answer points at the musl libc ( https://superuser.com/a/1234279 ) | 10:25 |
*** nchakrab has quit IRC | 10:30 | |
*** nchakrab has joined #zuul | 10:31 | |
*** nchakrab has quit IRC | 10:36 | |
*** quiquell is now known as quiquell|bbl | 10:39 | |
*** nchakrab has joined #zuul | 10:47 | |
*** bhavik1 has joined #zuul | 10:57 | |
*** quiquell|bbl is now known as quiquell | 11:03 | |
panda | pabelanger: re OVB, the overlay network remains a difficult approach for the iPXE images. THe iPXE process should fist ask their management IP, then establish a VXLAN tunnel, then send DHCP request on that tunnel for provision network. iPXE doesn't support any of this. I think we can work with the static provision network approach and mac filtering on the undercloud. The only difficult part here is to pass the | 11:16 |
panda | overcloud instances mac address to the undercloud so it can enable leases only for those. The other part is to create overlay network for the overcloud nodes so emulate network isolation. | 11:17 |
quiquell | Hello zuul guys, have one question regarding ZUUL_CHANGES | 11:36 |
quiquell | if you want a commit to be there, you add a Depends-On or you parent it ? | 11:36 |
quiquell | that's right ? | 11:36 |
*** bhavik1 has quit IRC | 11:37 | |
tristanC | quiquell: yes, depends-on in commit message or git repos dependency. | 11:38 |
tristanC | quiquell: ZUUL_CHANGES is a legacy vars from zuulv2, the dependencies are now stored in the zuul.items variable | 11:38 |
quiquell | tristanC: We will arrive there :-) | 11:38 |
tristanC | quiquell: e.g. http://logs.rdoproject.org/62/584762/1/openstack-check/legacy-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/fd5f3d6/zuul-info/inventory.yaml | 11:39 |
panda | quiquell: what do you mean "parent" it ? | 11:39 |
*** GonZo2000 has joined #zuul | 11:40 | |
*** GonZo2000 has quit IRC | 11:40 | |
*** GonZo2000 has joined #zuul | 11:40 | |
tristanC | panda: i think he meant in gerrit, when you click rebase and "change parent revision", or when you git review multiple commit stacked together | 11:41 |
quiquell | panda: If you have two commits for the same project, and you want zuul to test them at one review | 11:41 |
quiquell | panda: You need to make one parent on the other | 11:41 |
panda | tristanC: hhmm, I thought rebasing a change on top of another didn't create another zuul.item, because zuul doesn't need to fetch the change, it's in the git log | 11:43 |
quiquell | panda: But it's stuff is going to test | 11:44 |
quiquell | panda: Make sense to have it there | 11:44 |
quiquell | panda: Other stuff if the respsec where the required projects point to | 11:45 |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Ignore files produced by tox-cover https://review.openstack.org/584823 | 11:45 |
panda | and how does zuul know what changes in the log need to be addedd to zuul.items ? | 11:45 |
panda | looks at the whole log and sees what Change-Id: is still open ? | 11:45 |
quiquell | panda: I suppose | 11:46 |
quiquell | panda: Cannot have stuff in the middle | 11:46 |
tristanC | panda: it does, gerrit automatically returns parent information in the "dependsOn" field, see this function for the full logic: https://git.zuul-ci.org/cgit/zuul/tree/zuul/driver/gerrit/gerritconnection.py#n524 | 11:48 |
panda | tristanC: ah, I see, thanks. | 11:49 |
*** quiquell is now known as quiquell|lunch | 11:59 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Invalidate az cache on bad request https://review.openstack.org/582746 | 12:04 |
*** dkranz has joined #zuul | 12:21 | |
*** quiquell|lunch is now known as quiquell | 12:24 | |
*** jimi|ansible has joined #zuul | 12:36 | |
*** jimi|ansible has joined #zuul | 12:36 | |
*** rlandy has joined #zuul | 12:39 | |
*** nchakrab has quit IRC | 12:45 | |
*** samccann has joined #zuul | 12:50 | |
*** samccann_ has joined #zuul | 12:51 | |
mordred | tobiash: interesting. unfortunately the other outlier is debian 9 - which is the base for the other main python base image | 12:52 |
mordred | tobiash: definitely worth keeping our eyes on | 12:52 |
*** nchakrab has joined #zuul | 12:58 | |
*** samccann_ has quit IRC | 12:59 | |
*** jesusaur has quit IRC | 13:05 | |
openstackgerrit | Merged openstack-infra/nodepool master: Invalidate az cache on bad request https://review.openstack.org/582746 | 13:25 |
openstackgerrit | Merged openstack-infra/nodepool master: Ignore files produced by tox-cover https://review.openstack.org/584823 | 13:25 |
pabelanger | tobiash: replied | 13:27 |
pabelanger | panda: +1 | 13:27 |
panda | pabelanger: I know how to filter dhcp offers. kinda. | 13:30 |
tobiash | pabelanger: oh thanks, I think I understood this now | 13:37 |
tobiash | pabelanger: but I have another comment, sorry | 13:37 |
pabelanger | tobiash: sure, let me update and test | 13:39 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul master: Fix delegate_to for ansible synchronize https://review.openstack.org/584656 | 13:47 |
pabelanger | tobiash: ^should be better now | 13:47 |
tobiash | pabelanger: I'd even leave out 'path: /tmp' and let the os decide where to put the tmpdir | 13:49 |
tobiash | pabelanger: apart from that lgtm | 13:49 |
*** myoung has joined #zuul | 13:49 | |
pabelanger | tobiash: I wanted to keep /tmp since on the executor that is outside of the filter path, just to ensure we can write data out side of workdir, even though this is on the node | 13:50 |
pabelanger | thanks! left comment on review explaining too | 13:53 |
tobiash | pabelanger: ok (although it won't be inside of the workdir as it connects back via ssh) | 13:53 |
pabelanger | tobiash: right, agree! However, the original bug was actually still checking is_path_safe and failing a job: http://logs.openstack.org/14/584614/9/check/ansible-role-nodepool-ubuntu-xenial/82c6155/job-output.txt.gz#_2018-07-22_13_48_56_126339 | 13:56 |
*** quiquell is now known as quiquell|off | 14:02 | |
*** acozine1 has joined #zuul | 14:08 | |
openstackgerrit | Merged openstack-infra/nodepool master: Update README and add TESTING similar to Zuul repo https://review.openstack.org/581348 | 14:16 |
*** yolanda__ has joined #zuul | 14:19 | |
*** acozine1 has quit IRC | 14:20 | |
*** acozine1 has joined #zuul | 14:21 | |
*** yolanda_ has quit IRC | 14:21 | |
*** yolanda__ is now known as yolanda | 14:26 | |
Shrews | would anyone care to review this builder change? https://review.openstack.org/564687 | 14:33 |
pabelanger | looking | 14:35 |
Shrews | pabelanger: maybe you can speak to ian's concerns about dib there | 14:38 |
Shrews | (i missed those initially, but seems like a caveat might be warranted there) | 14:39 |
pabelanger | Shrews: yah, I just left a comment for just that | 14:40 |
pabelanger | if people are okay with it in general, I'll drop a +2 | 14:40 |
pabelanger | but want to get another set of eyes | 14:40 |
Shrews | pabelanger: maybe corvus or mordred can comment there too | 14:41 |
pabelanger | +1 | 14:42 |
mordred | pabelanger: I'd argue that needing REG_PASSWORD in a RHEL image is a design flaw in rhel - that password would be completely exposed to anyone booting that image and running workloads on it | 14:43 |
mordred | BUT - given that, I don't tink it's any _more_ insecure than the need to put the REG_PASSWORD in the image in the first place | 14:44 |
mordred | (the nodepool patch) | 14:44 |
pabelanger | yah, that is true too. Didn't think of that | 14:45 |
pabelanger | really don't know how RHEL licensing works TBH | 14:45 |
mordred | it's funny how old-school proprietary software techniques like license managers are _completely_ incompatible with cloud | 14:45 |
pabelanger | Like, could you use RHEL key in DIB, yum install all latest packages, delete key and boot and image works? | 14:45 |
mordred | yeah - but then users of the image wouldn't be able to install new images | 14:46 |
mordred | packages | 14:46 |
mordred | that is | 14:46 |
rcarrillocruz | haha, yah...the state of cloud images for networking make me want to stab my eyes at times | 14:46 |
mordred | I mean - this is amongst the reasons that literally nobody uses rhel as guest images in cloud | 14:46 |
rcarrillocruz | cloud-init is not a thing for most | 14:46 |
pabelanger | yah, wonder if you could write a wrapper in ansible using zuul secret, for bindep, then delete key after yum, etc | 14:47 |
pabelanger | but that should likes work too | 14:47 |
mordred | pabelanger: yah - you could likely do that - but it's still a mess. I'd just expose the password since it's an absurd protection | 14:47 |
mordred | "you must use a password to download these things that you can download for free from centos" | 14:47 |
pabelanger | ++ | 14:48 |
* mordred looks around to see if he's been fired yet | 14:48 | |
mordred | pabelanger: I think in the non-CI world, one would have a rhel key as a customer, and then injecting that key at boot time using something like cloud-init or ansible would not be exposing your personal rhel secret, because you're the user | 14:51 |
*** jesusaur has joined #zuul | 14:51 | |
mordred | pabelanger: for public CI, it's a complete mess ... and I think basically means that for public CI purposes RHEL is 100% unsuitable and centos basically must be used | 14:52 |
pabelanger | Yah, maybe that is why I've not need much discussion around RHEL and CI when I've asked. Just for this reason of key leaks | 14:53 |
mordred | yah | 14:53 |
pabelanger | mordred: since you are hear, and non RHEL related: https://review.openstack.org/584656/ fixes a bug with synchronize in ansible and zuul. If you'd like to review and see if I missed something obvious in zuuls trusted / untrusted checks | 14:54 |
mordred | pabelanger: I think that patch looks right. I added a +2 - but I'd like corvus or clarkb to look too - it's one of those areas that more eyes is more better | 14:57 |
pabelanger | thanks! | 14:57 |
pabelanger | yah, happy to get more eyeballs on this | 14:57 |
openstackgerrit | Fabien Boucher proposed openstack-infra/zuul master: Add tenant yaml validation option to zuul client https://review.openstack.org/574265 | 15:01 |
clarkb | mordred: which one? | 15:03 |
mordred | clarkb: https://review.openstack.org/584656/ | 15:05 |
corvus | pabelanger: i don't understand your answer to tobiash's comment; can you rephrase that, possibly with more examples? | 15:05 |
pabelanger | yes | 15:06 |
pabelanger | corvus: I've added more details on explanation. | 15:11 |
*** mrhillsman has quit IRC | 15:18 | |
*** jtanner has quit IRC | 15:19 | |
*** patrickeast has quit IRC | 15:19 | |
*** hashar has quit IRC | 15:19 | |
*** maxamillion has quit IRC | 15:19 | |
*** clarkb has quit IRC | 15:19 | |
*** goern has quit IRC | 15:20 | |
*** sdoran has quit IRC | 15:20 | |
*** TheJulia has quit IRC | 15:20 | |
*** andreaf has quit IRC | 15:21 | |
*** patrickeast has joined #zuul | 15:21 | |
*** mrhillsman has joined #zuul | 15:21 | |
*** jtanner has joined #zuul | 15:22 | |
*** maxamillion has joined #zuul | 15:23 | |
*** sdoran has joined #zuul | 15:23 | |
*** jtanner has quit IRC | 15:24 | |
*** sdoran has quit IRC | 15:24 | |
*** sdoran has joined #zuul | 15:24 | |
*** jtanner has joined #zuul | 15:25 | |
*** andreaf has joined #zuul | 15:25 | |
*** maxamillion has quit IRC | 15:25 | |
*** maxamillion has joined #zuul | 15:25 | |
*** sdoran has quit IRC | 15:25 | |
*** hashar has joined #zuul | 15:26 | |
*** sdoran has joined #zuul | 15:26 | |
*** jtanner has quit IRC | 15:26 | |
*** hashar has quit IRC | 15:26 | |
*** hashar has joined #zuul | 15:26 | |
*** sdoran has quit IRC | 15:27 | |
*** sdoran has joined #zuul | 15:27 | |
corvus | pabelanger: thanks, i think i grok | 15:29 |
*** clarkb has joined #zuul | 15:33 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Add container spec https://review.openstack.org/560136 | 15:34 |
*** TheJulia has joined #zuul | 15:35 | |
*** jtanner has joined #zuul | 15:36 | |
*** jtanner has quit IRC | 15:38 | |
*** jtanner has joined #zuul | 15:38 | |
corvus | pabelanger: can you point me at a summary of your log work? (maybe a change that explains most of what's going on, or an email, or something?) i want to refresh my memory and think about how it might interact with things i've been thinking about wrt logs-in-swift | 15:44 |
*** nchakrab has quit IRC | 15:45 | |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: Docs: increase visibility of env secrets warning https://review.openstack.org/584950 | 15:48 |
corvus | Shrews, pabelanger, ianw, tristanC: ^ | 15:48 |
openstackgerrit | Merged openstack-infra/nodepool master: Use detail listing in integration testing https://review.openstack.org/584542 | 16:00 |
openstackgerrit | Merged openstack-infra/nodepool master: builder: support setting diskimage env-vars in secure configuration https://review.openstack.org/564687 | 16:04 |
*** nchakrab has joined #zuul | 16:12 | |
*** nchakrab has quit IRC | 16:13 | |
*** panda is now known as panda|off | 16:30 | |
*** gchenuet has joined #zuul | 16:36 | |
mordred | corvus, Shrews: if you get a spare sec, https://review.openstack.org/#/c/584395/ doc patch for pbrx | 16:37 |
pabelanger | clarkb: could I trouble you for a review of https://review.openstack.org/584656/ too, having extra set of eyes helps | 16:38 |
gchenuet | Hi guys ! We are working (at leboncoin, french classified ads website) to update our Zuul V2 infrastructure to Zuul V3 and we have a question about Nodepool Launcher service: is it possible to scale up to supply redundancy ? We saw that Nodepool Builder is able to do this, but we don't about Nodepool Launcher. | 16:39 |
gchenuet | Thanks and congrats for this awesome stack tools ! | 16:39 |
clarkb | gchenuet: it is possible, but you must partition the clouds across launchers as they won't see each other's resource usage directly. If you only have one cloud you could give each launcher a different account with independent quota | 16:40 |
gchenuet | thanks clarkb ! What is the best redundancy solution for you if we have two DCs (with 1 OS per DC) to keep all the Zuul infra (Zuul + Nodepool services) alive if we lost 1 DC ? | 16:45 |
clarkb | gchenuet: for that case I would run two nodepool launchers one in each DC speaking to the local DC openstack. This way if the datacenter goes away the other nodepool keep running. (do similar for nodepool builders). The one thing this doesn't accomodate is where to run your zookeeper. I'm not sure you can run a zookeeper cross multiple datacenters | 16:46 |
clarkb | pabelanger: why is the src value for synchronize in your test file destdir.path? | 16:46 |
clarkb | pabelanger: is that intentional? | 16:46 |
*** pcaruana has quit IRC | 16:47 | |
corvus | gchenuet, clarkb: i agree about launchers -- but i'd probably have both builders talk to both clouds | 16:47 |
corvus | (that way if they are both working, you're still only building the images once) | 16:48 |
pabelanger | clarkb: yah, comes from the copy-common role | 16:48 |
pabelanger | see depends-on for more examples of usage | 16:48 |
clarkb | corvus: oh good point | 16:48 |
clarkb | pabelanger: isn't that backwards? if nothing else it is confusing :) | 16:48 |
corvus | SpamapS: any thoughts/experience running zookeeper across a wan (multiple datacenters)? (see question from gchenuet) | 16:49 |
clarkb | pabelanger: I guess you are prepping files for copying in the other direction? | 16:49 |
pabelanger | clarkb: right, depends on mode: pull or push | 16:49 |
*** acozine1 has quit IRC | 16:49 | |
pabelanger | fetching common-file from the copy-common role tmpdir | 16:50 |
clarkb | pabelanger: both use src: destdir.path in the child change | 16:51 |
corvus | gchenuet: if you try running zk across multiple DCs, please let us know how it goes. we (in openstack infra) have a nodepool system that's similar to what you're setting up, but we've only run zk within one data center. i assume it can be done but may require some tuning to allow for greater latencies? | 16:51 |
pabelanger | clarkb: which line? | 16:51 |
clarkb | pabelanger: https://review.openstack.org/#/c/584656/6/tests/fixtures/config/remote-action-modules/git/org_project/playbooks/synchronize-delegate-good.yaml 22 and 46 | 16:52 |
tobiash | gchenuet: and you should check if you can encrypt the communication between zookeepers over wan | 16:52 |
tobiash | but client side tls is not implemented in kazoo yet (there are pull requests for that) | 16:53 |
gchenuet | our DCs are connecting with LAN and not WAN | 16:53 |
gchenuet | FYI | 16:53 |
gchenuet | I think we'll setup a zk cluster with 6 nodes (3 per DC) | 16:54 |
clarkb | gchenuet: it needs to be an odd number | 16:55 |
tobiash | and write performance drops with more nodes (upstream suggestion is 3, 5 or 7 I think) | 16:55 |
gchenuet | ok, good to know ! thanks, we have a third DC so 2 - 2 -1 | 16:56 |
pabelanger | clarkb: right, I don't think it matters to much, because in this case we are doing an rsync on a single server. So push and pull both are delegated to the nodepool node, basically in this case we want src to always be destdir.path regardless of mode | 16:56 |
pabelanger | https://docs.ansible.com/ansible/2.5/modules/synchronize_module.html | 16:57 |
pabelanger | see mode | 16:57 |
clarkb | pabelanger: mostly its just confusing that we are using destdir as the src not the dest dir | 16:57 |
clarkb | why not make the variable called srcdir in this case? | 16:57 |
tobiash | regarding the launchers, I think now that nodepool is resilient against unexpected hitting the quota you could even run multiple launchers against the same cloud | 16:58 |
gchenuet | really ? | 16:58 |
*** bbayszczak has joined #zuul | 16:58 | |
tobiash | it might cause more openstack api requests than neccessary when being under quota pressure but I think it shouldn't break | 16:58 |
pabelanger | clarkb: that is that common-copy does, creates a tmp directory with a file, I just reused that to pre-populate something. And that is the file we are using for src here. But I can see how that is a little confusing | 16:59 |
clarkb | it would be racy on consumption of quota, which may leade to NODE_FAILURES, but other than that it may work | 16:59 |
*** bbayszcz_ has joined #zuul | 16:59 | |
tobiash | clarkb: nodepool now retries when hitting quota failures | 16:59 |
clarkb | tobiash: oh good | 16:59 |
*** bbayszczak has quit IRC | 16:59 | |
tobiash | that doesn't cause node_failures anymore | 16:59 |
pabelanger | tobiash: ++ | 17:00 |
gchenuet | so i can have one zk cluster across DC with nodepool launchers and builders in each DC connected to same zk cluster ? | 17:00 |
*** GonZo2000 has quit IRC | 17:01 | |
tobiash | yes, that should work, if the latency doesn't have too much impact on zk | 17:01 |
*** hashar is now known as hasharDinner | 17:02 | |
clarkb | pabelanger: as for push vs pull that just changes which ost is looked at to copy from dest to src. | 17:02 |
*** bbayszcz_ is now known as bbayszczak | 17:02 | |
tobiash | gchenuet: but that's something you probably just need to try | 17:02 |
gchenuet | to be sure, when you said 'cloud', is a 'cloud provider' in nodepool ? | 17:02 |
clarkb | pabelanger: the test takes advantage of the fact that they are the same in this case | 17:02 |
tobiash | gchenuet: yes | 17:02 |
pabelanger | clarkb: right, I can propose a follow change that removes the usage of destdir variable and sets a fact for srcdir | 17:05 |
pabelanger | give me a few minutes | 17:05 |
*** nchakrab has joined #zuul | 17:05 | |
tobiash | corvus: regarding buildset resources, what do you think about this user api: https://etherpad.openstack.org/p/JINkHL4VuM | 17:07 |
tobiash | we already map child_jobs into the zuul variable so I think we should do the same for pause | 17:08 |
corvus | tobiash: oh, hrm; i wonder if we should actually do option #2 for both things? | 17:09 |
tobiash | corvus: can we still change that for child_jobs? | 17:10 |
tobiash | (I'd also be in favor of mapping behavior things to real module parameters) | 17:11 |
gchenuet | thanks tobiash ! | 17:12 |
corvus | (the main difference between the two being whether the variables are exported to the child job) | 17:12 |
tobiash | but that should be consistent then | 17:12 |
gchenuet | In your OpenStack CI, where are nodepool launcher ? in which cloud provider ? | 17:12 |
corvus | tobiash: we haven't released a version with child_jobs yet, i think we can change it | 17:12 |
corvus | pabelanger: ^ we're talking about maybe changing how child_jobs is returned | 17:13 |
tobiash | corvus: thinking about that, even if we switch to module args we still could map the values into the child jobs within the executor | 17:13 |
tobiash | in case of pause I would even remap that to a different variable like zuul.paused_parents: [p1, p2] | 17:14 |
pabelanger | corvus: okay, I haven't started using it yet | 17:15 |
corvus | hrm. well, both of these are using the special 'zuul' dictionary. maybe that's enough to distinguish it from regular data that's passed to children... | 17:15 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul master: Clean up remote ansible synchronize tests https://review.openstack.org/584978 | 17:17 |
pabelanger | clarkb: I believe ^ is clearer now? | 17:17 |
corvus | tobiash: ultimately the interface is actually just a json file on disk. if we did option #2, we'd have to revise that structure somehow. | 17:17 |
corvus | tobiash: (i guess "zuul_return: pause: True" could just be a shortcut for "zuul_return: data: zuul: pause: True"....) | 17:18 |
tobiash | that's true, it's easier for the user, but a little bit harder to implement | 17:18 |
tobiash | interesting idea | 17:18 |
corvus | tobiash: well, log_url is already like this and is in a release. child_jobs is like this, but isn't in a release yet (but is about to be). i think maybe we should stick with option #1 for now, and then think about either making an alias later, or restructuring it (if we feel like actually deprecating things) | 17:21 |
openstackgerrit | Merged openstack-infra/nodepool master: Docs: increase visibility of env secrets warning https://review.openstack.org/584950 | 17:21 |
tobiash | corvus: ok, I'm fine with tha | 17:22 |
corvus | (i bet if we just make it an alias later, that will improve the UX without making things more complicated) | 17:22 |
tobiash | corvus: shall we name it pause or do you have a better naming proposal (run_children_now, ...)? | 17:23 |
corvus | tobiash: i like pause (or suspend is my runner-up) | 17:24 |
pabelanger | tobiash: corvus: speaking for child_jobs for zuul_return, tristanC noticed this change in functionality for zuul: https://storyboard.openstack.org/#!/story/2003026 I think we want to maybe address before tagging | 17:24 |
*** bbayszcz_ has joined #zuul | 17:24 | |
corvus | pabelanger: agreed; is someone working on that? | 17:24 |
corvus | gchenuet: we use quite a few public and private providers, you can see which by looking at our grafana dashboards (search for nodepool): http://grafana.openstack.org/d/rZtIH5Imz/nodepool?orgId=1 | 17:27 |
corvus | gchenuet: (some of those have multiple regions too) | 17:27 |
*** bbayszczak has quit IRC | 17:27 | |
pabelanger | corvus: I can, but could use help on how best to fix. This is a side affect of not allowing all jobs to be skipped: https://git.zuul-ci.org/cgit/zuul/commit/zuul/model.py?id=5c797a12a8229b30124988723f786d4ee8dea807 | 17:27 |
pabelanger | it didn't account for non-voting | 17:28 |
*** bbayszcz_ has quit IRC | 17:28 | |
corvus | pabelanger: i think there's just a logic error in didAllJobsSucceed -- a non-voting job that ran isn't 'skipped', but it's checked first | 17:29 |
*** myoung is now known as myoung|bbl | 17:29 | |
tobiash | pabelanger: I think you just could set skipped to false in https://git.zuul-ci.org/cgit/zuul/tree/zuul/model.py?id=5c797a12a8229b30124988723f786d4ee8dea807#n1808 | 17:29 |
gchenuet | thanks corvus, and you have the same single zk + nodepool infra for all providers ? | 17:31 |
corvus | gchenuet: we have a single zk, 4 builders and 4 launchers. some builders and launchers are located in specific clouds, some are just in one cloud and talk to several clouds. some of that is dictated by architecture and networking requirements, not just redundancy/availability) | 17:34 |
*** CrayZee has quit IRC | 17:35 | |
pabelanger | corvus: tobiash: I'll create a new test for all skipped jobs and work to fix the logic error | 17:35 |
pabelanger | I'll do that now | 17:35 |
corvus | pabelanger: thanks! | 17:35 |
gchenuet | but any of your launchers are speaking to same cloud ? | 17:36 |
clarkb | gchenuet: no | 17:37 |
corvus | well, at least one of them is | 17:38 |
corvus | and at least two of the builders are | 17:38 |
clarkb | corvus: which launcher of ours talks to the same cloud? | 17:40 |
clarkb | they all talk to distinct sets of clouds | 17:41 |
corvus | clarkb: i interpreted "same cloud" as "cloud in which the launcher is residing". | 17:46 |
corvus | if the question is "do any two launchers talk to the same cloud" then the answer is indeed no | 17:47 |
corvus | (but i thought we already answered that) | 17:48 |
clarkb | ah | 17:48 |
gchenuet | thanks ! I think you've found a solution, let's test it and if it's works, we'll be very happy to share with you our infra (we run 50K builds per week) | 17:49 |
gchenuet | thanks you so much guys for your work and your help ! | 17:49 |
*** gchenuet has quit IRC | 17:54 | |
pabelanger | with stestr, is there a way to move the captured traceback output after captured logging? Or customize it locally, I'm find it a pain to always be scrolling up in the terminal to see why something failed. | 17:54 |
Shrews | pabelanger: have you tried 'script'? there might be a more simpler solution, but i often use that for things where i'm not sure how to control output | 18:04 |
Shrews | i often use that for zk-shell output capturing | 18:04 |
mordred | corvus, Shrews, pabelanger, tobiash, tristanC, clarkb, SpamapS: you may want to skim the recent scrollback in #ansible-devel ... it's about an upcoming behavior change in ansible related to modules returning ansible_facts in their return json | 18:07 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul master: Fix zuul reporting build failure with only non-voting jobs https://review.openstack.org/584990 | 18:07 |
mordred | https://github.com/ansible/ansible/pull/41811 is the pr adding the deprecation messages - agaffney says we can test the new behavior by setting inject_facts_as_vars to false in ansible.cfg | 18:08 |
*** nchakrab has quit IRC | 18:08 | |
pabelanger | corvus: ^ implements tobiash suggestion for fixing non-voting jobs with zuul.child_jobs patch | 18:08 |
pabelanger | Shrews: yah, that could help too | 18:10 |
mordred | corvus, Shrews, pabelanger, tobiash, tristanC, clarkb, SpamapS: http://paste.ubuntu.com/p/xJFyJJ7rcy/ is the relevant channel scrollback in case you weren't in #ansible-devel | 18:14 |
corvus | mordred: so that might affect roles/multi-node-known-hosts/library/generate_all_known_hosts.py ? | 18:17 |
mordred | corvus: yes | 18:17 |
corvus | (sorry, that was in zuul-jobs) | 18:17 |
pabelanger | reading | 18:17 |
corvus | mordred: but maybe could be worked around with a normal return / register / set_fact ? | 18:18 |
mordred | corvus: we're on 2.5 now right? | 18:18 |
corvus | mordred: yes | 18:18 |
mordred | corvus: cool. so - I believe we can just access the value as ansible_facts.all_known_hosts safely now | 18:18 |
mordred | corvus: or we could change it to return/register | 18:19 |
mordred | as of 2.5 the ansible_facts dict exists and that return should be setting the value both as a top-level var and in the ansible_facts dict | 18:19 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Scope all_known_hosts in mult-node-known-hosts https://review.openstack.org/584993 | 18:21 |
pabelanger | +2 | 18:22 |
pabelanger | haven't used ansible_facts much yet, but sounds like we should be | 18:23 |
corvus | it's apparently new :) | 18:26 |
*** ianychoi has quit IRC | 18:33 | |
*** ianychoi has joined #zuul | 18:34 | |
*** dkranz has quit IRC | 18:38 | |
*** dkranz has joined #zuul | 18:42 | |
*** hasharDinner has quit IRC | 19:00 | |
*** hashar has joined #zuul | 19:00 | |
Shrews | tobiash: according to https://review.openstack.org/536930, if "quota exceeded" is returned during a launch, we should pause, right? I'm not seeing that pause in deployed code | 19:03 |
tobiash | What do you see instead? | 19:04 |
Shrews | i'm not sure what implications this might have | 19:04 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool master: Allow launch-retries to indefinitely retry for openstack https://review.openstack.org/584488 | 19:04 |
pabelanger | clarkb: okay, think ^ handles launch-retries: 0 | 19:05 |
pabelanger | but still haven't figured out best way to test indefinitely retries | 19:05 |
Shrews | tobiash: http://paste.openstack.org/show/726468/ | 19:05 |
clarkb | pabelanger: you saw the note baout how quota failures will short circuit too right? | 19:06 |
clarkb | pabelanger: I think that likely to be your biggest enemy in that code | 19:06 |
Shrews | tobiash: oh, maybe we just don't log the actual pause in that code path? | 19:08 |
Shrews | tobiash: b/c i do see an "Unpaused request" for that request ID | 19:08 |
tobiash | Shrews: yes, I see that too | 19:09 |
Shrews | tobiash: in your env? | 19:09 |
tobiash | Shrews: in your log | 19:09 |
tobiash | In my env it unpauses every 30 seconds or so and checks again | 19:10 |
Shrews | i didn't include the unpause in that paste | 19:10 |
Shrews | it was further down in what i pasted, but regardless, looks like that's what's happening | 19:11 |
tobiash | I looked wrong, never mind, I saw a pause after quota recalculation | 19:12 |
pabelanger | clarkb: yah, I'm not sure how best to do that. tobiash Shrews do you mind looking comment from clarkb on https://review.openstack.org/584488 | 19:13 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Add logging when pausing due to node abort https://review.openstack.org/585007 | 19:17 |
Shrews | tobiash: ^^ | 19:17 |
tobiash | Lgtm | 19:19 |
tobiash | pabelanger: your change changes the failure count to count in advance | 19:21 |
tobiash | That will break indefinitely retrying on quota exceeded | 19:22 |
tobiash | As quota exceeded should not be counted | 19:23 |
tobiash | pabelanger: note that the quota exceeded raises before increasing the counter | 19:23 |
pabelanger | Hmm, let me try again. I'll add a test where launch-retries is 0, because pretty sure there is a bug today where that isn't counted for | 19:25 |
clarkb | pabelanger: ya rereading it I think max-retries is really "max-attempts" and it can't be 0 valued | 19:26 |
clarkb | or if it is that is undefined | 19:26 |
pabelanger | okay, I'll grab some coffee and try again in a bit | 19:27 |
Shrews | tobiash: pabelanger: wait... if the "quota exceeded" part is hit, yes, it will leave the launch() method, but we should pause and retry the launch later (effectively still retrying forever as pabelanger wants). i don't see the breakage | 19:32 |
Shrews | so pabelanger's change alters the "inner scope" of retrying (within a single launch). but the "outer scope" of retrying when paused is unaffected (unless i miss something) | 19:33 |
tobiash | The breakage is when having no unlimited retries and you hit quota exceeded multiple times then these are counted as failures | 19:34 |
Shrews | tobiash: ah! i see noiw | 19:35 |
Shrews | now | 19:35 |
tobiash | Or is that retry a different lauch with a new counter? | 19:35 |
Shrews | tobiash: i believe we should reset the counter on quota error (launch() gets called again) | 19:36 |
Shrews | so i think it may be ok | 19:36 |
clarkb | Shrews: tobiash this is specific to single provider configurations | 19:36 |
clarkb | I thought that a quota fialure was treated as a failure in that provider? | 19:36 |
tobiash | switching to laptop | 19:37 |
Shrews | clarkb: a recent change by tobiash makes quota errors pause and retry until quota frees up. | 19:38 |
clarkb | gotcha | 19:38 |
Shrews | so before, it would have been a launch error and tried max-tries times | 19:38 |
Shrews | or whatever that config setting is called | 19:38 |
tobiash | Shrews: ah I see now | 19:39 |
tobiash | the retries variable is just local | 19:39 |
tobiash | so it shouldn't matter as the retry on quota exceeded is in fact a new node | 19:39 |
Shrews | yes | 19:39 |
tobiash | as the failed nodes was aborted | 19:39 |
clarkb | in that case the original code was proably fine save for my confusion over max-retries actually being max attempts | 19:40 |
clarkb | pabelanger: ^ | 19:40 |
tobiash | but it breaks quota retry if launch_retries is 0 | 19:41 |
Shrews | double yes | 19:41 |
Shrews | but, i think it does need some good testing in the various scenarios | 19:41 |
clarkb | tobiash: I think the original patchset may be fine | 19:41 |
tobiash | we might need to test almost all combinations of retries = [inf, 0, >0] together with quota failures | 19:43 |
clarkb | I'm beginning to wonder if 0 is a valid value | 19:44 |
clarkb | I has assumed so because in config it is "retries" but in execution it is really attempts | 19:44 |
clarkb | and 0 attepts means nothing boots | 19:45 |
Shrews | tobiash: ++ | 19:45 |
Shrews | clarkb: launch_attempts is probably more accurate | 19:45 |
Shrews | because launch_retries=1 would have always done 1 attempt | 19:46 |
Shrews | 0 doesn't seem valid | 19:48 |
clarkb | ya | 19:48 |
tobiash | btw, the quota retry saved us already a lot of node_failures because we now also have nodes that boot from volume combined with a relatively low volume quota and no volume quota support in nodepool yet :) | 19:48 |
tobiash | what I also learned is that labels that compete for different types of quota (cpu vs cpu+volume) need to be separated into different pools | 19:50 |
tobiash | otherwise a cpu+volume node can pause the handler regardless if there are cpus left for other nodes or not | 19:51 |
Shrews | tobiash: these sound like helpful things we should document someplace | 19:54 |
clarkb | tobiash: that is probably a good way to define a pool too | 19:54 |
clarkb | a pool is a rsource contention aggregation | 19:54 |
clarkb | ++ to documenting | 19:54 |
Shrews | i just don't know where to document since we don't really have a deployment/admin suggestive guide | 19:55 |
Shrews | suggestion*/faq/whatever | 19:55 |
clarkb | to start can probably just add it to the doc config docs | 19:56 |
clarkb | and then incorportate that into more narrative style docs later | 19:56 |
tobiash | k, will write something tomorrow | 20:06 |
pabelanger | Shrews: clarkb: tobiash: could somebody reply with expected changes on 584488, I'll look and re-read backscoll again shortly | 20:14 |
clarkb | pabelanger: I think for me its go back to original patchset | 20:15 |
clarkb | pabelanger: but I think tobiash wants to figure it out a bit more | 20:15 |
pabelanger | kk | 20:15 |
Shrews | pabelanger: left a comment with the most important take away for me | 20:23 |
pabelanger | Shrews: ++ | 20:24 |
mhu | tobiash, can the dequeue in CLI patch get a +3? https://review.openstack.org/#/c/95035/ | 20:26 |
openstackgerrit | Merged openstack-infra/nodepool master: Add logging when pausing due to node abort https://review.openstack.org/585007 | 20:51 |
*** samccann has quit IRC | 21:05 | |
*** hashar has quit IRC | 21:05 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Add upgrade release note about aborted node status https://review.openstack.org/585035 | 21:07 |
Shrews | mordred: corvus: we need ^ merged before releasing | 21:09 |
corvus | ++ | 21:10 |
mordred | Shrews: should we do another restart to also pick up the logging change just to be doubly-sure? | 21:10 |
Shrews | mordred: i think that would be nice for debugging | 21:10 |
Shrews | but not technically necessary | 21:10 |
mordred | yah - I'm fairly confident that we can emit log lines | 21:10 |
Shrews | but being double-sure is bueno | 21:10 |
mordred | corvus: I think Shrews and I have the same "yes but also meh" thought - do you have any divergent opinions? | 21:11 |
Shrews | i have to EOD and go pick up my vehicle, so if someone else wants to do that, please go for it (launchers only) | 21:11 |
mordred | kk | 21:12 |
clarkb | maybe restart with 585035 merged then we will run exactly what is tagged? | 21:12 |
mordred | wfm | 21:12 |
Shrews | wfm x 2 | 21:12 |
clarkb | though I wonder how long check will take this time of day | 21:13 |
mordred | clarkb: that is also a good point - and it's going to be that way all week | 21:13 |
corvus | i think we can wait for 035 and test again. we probably won't release zuul until tues or wed at the earliest anyway. | 21:14 |
clarkb | ya thats why I've been spending time improving reliability of clouds and trying to increase capacit.y We are right in hte heart of the merge all things | 21:14 |
clarkb | corvus: ok | 21:14 |
corvus | (on the plus side, this should give us plenty of traffic to see if the zuul memory problem is fixed) | 21:15 |
clarkb | indeed | 21:16 |
pabelanger | http://paste.openstack.org/show/726483/ | 22:14 |
pabelanger | we are getting that from latest zuul-executor for some reason | 22:14 |
pabelanger | guessing something broke in cherrypy | 22:14 |
pabelanger | CherryPy==17.0.0 seems to be installed | 22:16 |
mordred | pabelanger: sounds like a problem with six version | 22:17 |
pabelanger | six==1.10.0 | 22:17 |
pabelanger | checking docs | 22:19 |
pabelanger | heh | 22:20 |
pabelanger | Pull request #203: Add parse_http_list and parse_keqv_list to moved | 22:20 |
pabelanger | urllib.request. | 22:20 |
pabelanger | that is 1.10.0 | 22:20 |
pabelanger | https://github.com/benjaminp/six/blob/master/CHANGES | 22:20 |
pabelanger | checking ze01 to see what is there | 22:21 |
openstackgerrit | Merged openstack-infra/nodepool master: Add upgrade release note about aborted node status https://review.openstack.org/585035 | 22:25 |
*** rlandy has quit IRC | 22:28 | |
pabelanger | mordred: for some reason, we didn't install six=1.11.0 on ze11.o.o, but seems to be in install_requires: https://github.com/cherrypy/cherrypy/blob/master/setup.py | 22:31 |
pabelanger | manually doing pip install -U six fixed it | 22:31 |
mordred | yah- pip won't upgrade a package or do an internal dag | 22:32 |
mordred | as part of a single instlal | 22:33 |
pabelanger | yah, xenial is 1.10.0, how can we fix that with puppet-zuul? | 22:33 |
mordred | said - is six getting installed by apt? | 22:33 |
pabelanger | yah | 22:33 |
mordred | ugh | 22:33 |
mordred | we REALLY need to stop installing python things via apt | 22:33 |
mordred | it breaks literally everything every time | 22:33 |
pabelanger | it looks like a dependency of ubuntu-server | 22:34 |
mordred | sigh | 22:34 |
mordred | of course it is | 22:34 |
mordred | oh - I think because of cloud-init? | 22:34 |
mordred | pabelanger: did we not uninstall cloud-init? | 22:34 |
pabelanger | yes, no longer on server | 22:34 |
mordred | hrm | 22:35 |
mordred | well - containers solve this ... but lemme stare at puppet for a sec | 22:35 |
pabelanger | mordred: would virtualenv help here? | 22:35 |
mordred | pabelanger: I think virtualenv would be too much engineering that we'd be throwing away anyway | 22:36 |
pabelanger | mordred: agree, I'd rather push on containers too | 22:36 |
mordred | pabelanger: we've got the containers building now - so it's about getting them published then getting ansible written | 22:36 |
mordred | pabelanger: but - in case we need to spin up another ze before that's done ... | 22:36 |
pabelanger | yah, fixing in puppet would be great. | 22:37 |
*** dkranz has quit IRC | 22:49 | |
*** myoung|bbl is now known as myoung | 22:59 | |
*** threestrands has joined #zuul | 23:30 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!