*** dmsimard has quit IRC | 00:18 | |
*** dmsimard has joined #zuul | 00:20 | |
mordred | jeblair: yes - that was my thinking - the string there is a thing for a log entry | 00:29 |
---|---|---|
mordred | jeblair: however, I don't feel strongly one way or the other | 00:29 |
mordred | jeblair: also - yes, your playbook in the etherpad looks right to me | 00:35 |
*** xinliang_ has quit IRC | 01:13 | |
*** xinliang_ has joined #zuul | 01:24 | |
*** timrc has quit IRC | 02:33 | |
*** timrc has joined #zuul | 02:42 | |
xinliang_ | tristanC: I figure out the reason why my CI won't triger job. I was confused by the two project: openstack-dev/ci-sandbox and openstack-dev/sandbox | 02:54 |
xinliang_ | My ci is listen ci-sandbox project, but I add recheck comment to sandbox project, so it won't trigger job ;-) | 02:55 |
xinliang_ | now it is working, thanks. | 02:55 |
tristanC | xinliang_: nice! good to hear :-) | 02:57 |
xinliang_ | :-) | 02:57 |
tristanC | xinliang_: you might want to setup the zuul status web page, it's quite handy to check what's going on. For example using http://git.openstack.org/cgit/openstack-infra/puppet-zuul/tree/templates/zuul.vhost.erb | 03:00 |
tristanC | and copying the zuul/etc/status/public_html over to /var/lib/zuul/www | 03:01 |
xinliang_ | tristanC: sounds great! will try to setup it | 03:02 |
xinliang_ | tristanC: The zuul web is already setup by the puppet. | 03:27 |
xinliang_ | tristanC: One other question: Do you know is there any way to run CI jobs on bear metal machines? | 03:29 |
xinliang_ | currently, all the openstack CI jobs are running on vm machines, right? | 03:30 |
tristanC | xinliang_: with jenkins, you can setup jenkins slave on bare metal, and if it has the right node label, then it will pick up zuul job | 03:33 |
tristanC | xinliang_: with zuul-launcher, you can setup custom node in zuul.conf so that job get scheduled directly to a specific node | 03:34 |
tristanC | xinliang_: and yes, currently all the openstack CI jobs run on ephemeral vm machines, so that the environment is identic between run | 03:35 |
xinliang_ | tristanC: thanks, I heard that openstack CI is migrated away from jenkins, so jobs are only managed by zuul right? | 03:41 |
tristanC | xinliang_: yes, zuul2.5 has a zuul-launcher service that can execute jenkins-job-builder job in place of jenkins | 03:42 |
tristanC | xinliang_: and the next version (zuul3) will replace jjb definition by ansible playbooks | 03:43 |
xinliang_ | that sounds good, which make ci system more simple | 03:44 |
xinliang_ | And if we run jobs on bear metal machines, is there any upstream tools for manage the metal machines? | 03:44 |
tristanC | not afaik, you'll have to build something custom with ironic for example | 03:46 |
xinliang_ | ok | 03:47 |
tristanC | at least you can prevent a slave to run more than a job using the OFFLINE_NODE_WHEN_COMPLETE job parameter | 03:47 |
clarkb | nodepool works with ironic and nova | 03:48 |
clarkb | people are using it that way for some ci | 03:48 |
xinliang_ | clarkb: great! good to hear this. | 03:49 |
tristanC | clarkb: oh didn't knew that, good to know! | 03:50 |
*** lennyb has quit IRC | 05:09 | |
*** lennyb has joined #zuul | 05:21 | |
*** isaacb has joined #zuul | 05:51 | |
*** isaacb has quit IRC | 06:31 | |
*** isaacb has joined #zuul | 08:37 | |
*** hashar has joined #zuul | 08:47 | |
*** jkilpatr has quit IRC | 10:00 | |
openstackgerrit | Jamie Lennox proposed openstack-infra/zuul feature/zuulv3: Common threading layer https://review.openstack.org/478466 | 10:34 |
*** jkilpatr has joined #zuul | 10:58 | |
*** jkilpatr has quit IRC | 11:41 | |
*** dkranz has quit IRC | 11:52 | |
*** jkilpatr has joined #zuul | 12:01 | |
*** jkilpatr has quit IRC | 12:11 | |
*** jkilpatr has joined #zuul | 12:12 | |
*** jkilpatr has quit IRC | 12:28 | |
openstackgerrit | Jamie Lennox proposed openstack-infra/zuul feature/zuulv3: Common threading layer https://review.openstack.org/478466 | 13:10 |
*** dkranz has joined #zuul | 13:27 | |
*** rcarrill1 is now known as rcarrillocruz | 14:44 | |
*** isaacb has quit IRC | 14:58 | |
*** jkilpatr has joined #zuul | 15:37 | |
jeblair | mordred: i think we're going to run into tobiash's role-repo problem soon. i think we hadn't noticed it because of our openstack-zuul-jobs/openstack-zuul-roles split, but as we move roles into zuul-jobs and project-config, we're going to need to name those same repos as role repos. | 15:56 |
mordred | jeblair: yay! | 16:04 |
jeblair | mordred: so i was thinking i might take a stab at implementing that rather than writing it up as a story :) | 16:04 |
jeblair | mordred: can you take a look at 478313 and 478315 ? | 16:05 |
jeblair | i'd like to merge those and 478311 and restart the executor and see if we can't get log publishing going | 16:05 |
mordred | jeblair: ++ | 16:08 |
mordred | jeblair: for 478313 - as a temp workaround while you fix the tobiash bug, we could also just add openstack-infra/zuul-jobs to the roles list of the base job | 16:09 |
jeblair | mordred: yeah i think we'll need to do that | 16:09 |
mordred | jeblair: I can do that as a followup patch | 16:10 |
jeblair | mordred: cool | 16:11 |
mordred | jeblair: oh - I just realized something - hosts: all roles: collect-logs - is the collect-logs role handling overlapping log file names? | 16:11 |
jeblair | mordred: it will handle them by clobbering them right now. :) | 16:11 |
mordred | :) | 16:11 |
* mordred will file story for that | 16:11 | |
tobiash | mordred: you might want to encode inventory_hostname into the destination dir | 16:12 |
jeblair | mordred: so that's an interesting thing -- do we want the collect-logs to ... what tobiash said :) | 16:12 |
jeblair | mordred, tobiash: maybe yes, if there's more than one host, but no if there's only one host | 16:12 |
jeblair | (that way you don't have to navigate to logs/ubuntu-xenial/subunit.html for a unit test) | 16:13 |
tobiash | jeblair: that should be possible with ansible | 16:14 |
mordred | https://storyboard.openstack.org/#!/story/2001092 | 16:14 |
mordred | funny - I wrote allof those things into the story :) | 16:14 |
jeblair | it's like we're all on the same page :) | 16:15 |
mordred | there's probably a third case - which is "what if a job wants to do pre-rationalization" - like I could imagine devstack maybe wanting the logs from the 'controller' node to show up in the base logs dir as "the logs" - but still have subdirs for each additional node | 16:16 |
mordred | which I dont think the base job needs to know about | 16:16 |
mordred | but maybe there should be a way for a job to do some log collection and then say "I've already done this work, please to not collect logs for me" | 16:16 |
mordred | I added a note about that to the story | 16:18 |
jeblair | mordred, tobiash: or... perhaps we leave the role as is -- it copies every logs/ dir from every host, which is fine for the single-host case. and if you write a multi-node job, well, it's not going to do anything by default. you just expect to pre-rationalize them. ie, devstack may pull all of its logs into the logs/ dir on the controller, in whatever layout it wants. | 16:18 |
tobiash | jeblair: would also be an option | 16:19 |
jeblair | (i mean, a job has to *put* something into the logs/ dir on the node -- so just say that the job needs to be responsible for understanding what it's putting there) | 16:19 |
mordred | jeblair: I like that for an answer to the follow up about "what to do about advanced jobs like devstack" ... I kind of like the system automatically handling multi-node jobs with host directory logs | 16:21 |
mordred | since it's kind of a neat way to support multi-node - and if the rationalize-step is just "is there content in only one hostname dir in the local logs dir, if so collapse" - then devstack can simply move all of the logs to the 'controller' node | 16:22 |
jeblair | mordred: yeah, if we implement it that way, we may have the best of both worlds. if you accidentally or on purpose have content in logs/ on more than one host, we rationalize it for you. if you know what you're doing, you don't end up in that position and we do nothing. | 16:23 |
mordred | ++ | 16:24 |
* tobiash is heading home now | 16:24 | |
*** bhavik1 has joined #zuul | 16:30 | |
*** bhavik1 has quit IRC | 16:32 | |
dmsimard | The zuul executor nodes, are they localized to each region or is it a global zuul cluster somewhere ? | 16:34 |
dmsimard | https://github.com/openstack-infra/system-config/blob/master/hiera/common.yaml suggests it's a common cluster somewhere | 16:35 |
mordred | dmsimard: common cluster | 16:36 |
mordred | dmsimard: the communication between scheduler and executors is over gear - communication between regions from executors to nodes is over ssh - which is better suited to WAN traffic | 16:37 |
dmsimard | mordred: fair, I guess the hidden question behind that was did you ever consider localizing more parts of the infrastructure (such as logs) -- cause that common cluster (for example) could end up pulling logs from an EMEA region to USA | 16:38 |
dmsimard | I'm not really concerned about things like data privacy laws, but more around latency and throughput | 16:38 |
dmsimard | I guess decentralizing the infrastructure would have a cost on complexity/management | 16:39 |
dmsimard | (not just in software but in human resources) | 16:39 |
mordred | dmsimard: it has come up in conversation a few times (for openstack, we do, in fact, pull logs from EMEA to USA) - but doing so would require teaching zuul more about clouds and regions, and at the moment that's not information zuul takes in to account when scheduling | 16:40 |
dmsimard | yeah, that logic lives in nodepool. | 16:40 |
mordred | dmsimard: since it's been fine so far at infra-scale, the idea hasn't really bubbled up to the point of being a problem that needs solving | 16:40 |
dmsimard | alright, thanks :) | 16:40 |
clarkb | fwiw our bandwidth between emea and north america is really great | 16:41 |
clarkb | so as long as you aren't sensitive to latency you tend to not have problems | 16:41 |
clarkb | our clouds in europe seem to have really solid network setups (ovh and citycloud) | 16:42 |
dmsimard | yeah, it's also okay not to prematurely optimize things until there's a real problem | 16:42 |
dmsimard | if it works at infra scale it's good enough for me :) | 16:43 |
mordred | dmsimard: :) | 16:43 |
mordred | dmsimard: it's nice that we have a stupily big version for data isn't it? :) | 16:43 |
clarkb | (afs is sensitive to latency so I've got a couple ideas to test in order to make that better, but currently focused on gerrit upgrade items) | 16:43 |
dmsimard | mordred: the 14x1TB cinder volume over LVM is still a bit mind boggling :) | 16:44 |
clarkb | dmsimard: and its too small! | 16:44 |
clarkb | :) | 16:44 |
dmsimard | clarkb: I know right, part of why decentralizing that would perhaps ease the burden :p | 16:45 |
dmsimard | but fungi said it's a WIP to move to a bigger node/provider (at which point does it make sense to scale horizontally instead?) | 16:45 |
clarkb | potentially. We want to switch providers to avoid volume maintenance taking out the entire fs as first priority | 16:46 |
clarkb | but sounds like there may be some room for scaling up too | 16:46 |
clarkb | the problem with 14 1TB volumes under a single fs is that anytime that cloud does maintenance on their volumes at least one of our volumes seems to be affected | 16:47 |
clarkb | s/the/a/ | 16:47 |
dmsimard | yeah that's awful | 16:47 |
*** hashar has quit IRC | 16:48 | |
fungi | it's an outage magnet, more or less | 16:49 |
fungi | coupled with needing a day or more outage to our ci system to perform a full fsck when that happens | 16:50 |
fungi | so, so, so many tiny files | 16:50 |
*** jkilpatr has quit IRC | 16:50 | |
mordred | dmsimard: one of the biggest issues with de-centralized log storage is that we'd lose a global index for the logs - knowing the change id would no longer be enough to find the build logs - you'd also have to know what cloud it ran on | 16:51 |
fungi | decentralized log storage seems fine to me as long as there's a centralized index | 16:51 |
mordred | dmsimard: this, of course, is solvable, as all things are - mostly just pointing out one of the issues that can make it slightly more complex to deal with | 16:51 |
mordred | fungi: yup | 16:51 |
jeblair | someone said something about a zuul dashboard... :) | 16:51 |
fungi | if only the thing deciding where the logs go knew where the logs went... | 16:52 |
clarkb | just tested to reassure myself and we get about 10MBps from gra1 to dfw on a 57MB file | 16:52 |
jeblair | speaking of logs, i'm going to try restarting the executor now | 16:52 |
clarkb | which isn't bad for crossing an ocean. Better than my home connection | 16:52 |
jeblair | and i will restart the scheduler | 16:57 |
rcarrillocruz | hey folks, noob question: a have a nodepool node stuck in locked state, can't delete it. I assume this can be unlocked with some zookeeper command? | 17:04 |
Shrews | rcarrillocruz: weird. what state does ZK report the node in? | 17:05 |
Shrews | READY, BUILDING, etc | 17:05 |
rcarrillocruz | deleting | 17:05 |
Shrews | rcarrillocruz: hmm, is the cleanup thread running? | 17:06 |
rcarrillocruz | should be, cos I can delete other nodes just fine | 17:06 |
rcarrillocruz | but can't this one | 17:06 |
jlk | o/ | 17:07 |
Shrews | would be good if we could figure out how it got into this state | 17:08 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Create a new logger for gerrit IO https://review.openstack.org/478566 | 17:08 |
rcarrillocruz | Shrews: can't really tell how i got there, any hint how to unlock the node to delete it tho? | 17:09 |
Shrews | rcarrillocruz: do you have zk-shell? | 17:10 |
rcarrillocruz | let me see | 17:10 |
rcarrillocruz | nope, is that on pypi? | 17:10 |
Shrews | https://github.com/rgs1/zk_shell | 17:10 |
rcarrillocruz | yah, pip installing it | 17:10 |
jeblair | mordred: speedy +3 on 478570 pls? | 17:11 |
rcarrillocruz | lol, it installs twitter libs ? | 17:11 |
rcarrillocruz | anyway, i'm on zk-shell *shell* now Shrews | 17:11 |
rcarrillocruz | oh neat | 17:12 |
rcarrillocruz | so | 17:12 |
rcarrillocruz | zookeeper has a tree structure | 17:12 |
Shrews | rcarrillocruz: using that, just rm the znode | 17:12 |
jeblair | wait | 17:12 |
rcarrillocruz | ls nodepool/nodes | 17:12 |
jeblair | do not rm the node | 17:13 |
jeblair | give me a second to find some debug commands | 17:13 |
* rcarrillocruz waits | 17:14 | |
mordred | jeblair: done | 17:14 |
jeblair | (it's never okay for a nodepool admin to have to use zk-shell, so we need to debug any time it happens) | 17:14 |
jeblair | rcarrillocruz: run 'dump' and paste the output please | 17:14 |
jeblair | you can actually run that without zk-shell | 17:16 |
jeblair | echo dump|nc localhost 2181 | 17:17 |
jeblair | will work | 17:17 |
jeblair | https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#The+Four+Letter+Words | 17:17 |
jeblair | (but zk-shell also supports it) | 17:17 |
rcarrillocruz | there you go | 17:17 |
rcarrillocruz | https://paste.fedoraproject.org/paste/FmmFyItw~8KpVWFZIVz0wA | 17:17 |
jeblair | so it is the poolworker that holds the lock | 17:19 |
jeblair | rcarrillocruz: are you running with debug level logging? | 17:20 |
rcarrillocruz | i'm not running nodepool in foreground with debug level set is that what you ask, installed with ansible role and started nodepool with systemd scripts. I can tail launcher-debug.log | 17:22 |
jeblair | rcarrillocruz: no, i just meant debug logs -- if "DEBUG" entries show up in launcher-debug.log, that's what i'm looking for | 17:23 |
rcarrillocruz | yeah, they show up | 17:23 |
jeblair | rcarrillocruz: can you grep "00166" in launcher-debug.log? it will also have a server UUID on the "Waiting for server" line. also grep for that UUID and pastebin both please? | 17:23 |
rcarrillocruz | http://paste.openstack.org/show/613957/ | 17:25 |
rcarrillocruz | that the last hit | 17:25 |
rcarrillocruz | otoh | 17:25 |
rcarrillocruz | i spotted an error | 17:25 |
rcarrillocruz | sec, i'll paste | 17:25 |
rcarrillocruz | http://paste.openstack.org/show/613958/ | 17:26 |
jeblair | rcarrillocruz: perfect, thanks! | 17:26 |
rcarrillocruz | sorry for the rightmost dots, i'm on a tmux session, some other user must have diff window size | 17:27 |
*** Shuo has joined #zuul | 17:27 | |
jeblair | rcarrillocruz, Shrews: i wonder if, since we have a stuck ephemeral node, if the best solution might be to restart the launcher rather than manually delete the zk node? Shrews, what do you think? | 17:27 |
Shrews | that would work | 17:28 |
jeblair | rcarrillocruz: oh, that traceback is useful, though i'm not convinced that's the problem | 17:28 |
jeblair | i think it may be from a different thread | 17:28 |
jeblair | rcarrillocruz: was there any more interesting log entries around 2017-06-27 17:01:12,489 ? | 17:29 |
rcarrillocruz | let me look | 17:29 |
Shrews | rcarrillocruz: did you shutdown/restart or otherwise do something to zookeeper while nodepool was running? | 17:29 |
jeblair | (that was the "deleting ZK node" line) | 17:29 |
*** bhavik1 has joined #zuul | 17:35 | |
*** bhavik1 has quit IRC | 17:42 | |
Shuo | What's zuul v3's dependency to nodepool? Can a zuul environment work without it? | 17:43 |
jlk | Zuul relies on nodepool to provide nodes on which to execute the tests | 17:43 |
jlk | s/tests/jobs/ | 17:44 |
SpamapS | Shuo: I'd say that zuulv3 needs _something_ to be on the other side of ZK. Nodepool is currently the only thing that implements that, so it's an effective dependency, but you could, in theory, write something else to do nodepool's job. | 17:44 |
jlk | Without nodepool, there is very limited things Zuul can do | 17:44 |
SpamapS | I've oft wondered if, when we get to a place of stability in that interface, if we could write smaller shims for clouds that have their own nodepool-ish things. Like AWS autoscale groups, for instance. | 17:45 |
SpamapS | sans nodepool, you can only do noop jobs.. I think? | 17:46 |
mordred | SpamapS: yah - although those are also likely good backend modules for nodepool itself | 17:48 |
mordred | a heat driver got brought up the other day, for instance - but there's some design questions we'll need to think about in such cases | 17:49 |
tobiash | I think it could be possible to run jobs without nodeset locally on the executor | 17:51 |
jeblair | yeah, there's enough overhead to the protocol and resource management that i think it's going to be useful to think of most species of that problem as another nodepool driver (possibly, even out-of-tree driver in the future). but, yeah, if we run into a problem from a different genus, that might be a good option for just writing to the zk api. | 17:51 |
jeblair | tobiash: correct; a job with no nodes can still do things on the executor. that could come in handy for things like "trigger this remote webhook". | 17:52 |
Shuo | Spamaps: I wonder if something like Kubernetes/Mesos can be placed on the nodepool's role | 17:52 |
tobiash | that would restrict the jobs to trusted jobs (not sure it makes any sense for a nodepool less deployment) | 17:52 |
jeblair | tobiash: untrusted jobs run on the executor too and can still be useful. like that webhook example. it doesn't need to be trusted. | 17:53 |
Shuo | I am asking this because we are not a OpenStack shop, not a VM shop (in our data center) | 17:53 |
jlk | SpamapS: you can't even do a "noop" job without a nodepool. That doesn't work outside of the test runner | 17:53 |
mordred | Shuo: writing a k8s/mesos driver for nodepool is definitely a thing that we should do | 17:54 |
tobiash | jeblair: ah, right, missed things like that | 17:54 |
jeblair | Shuo: yeah, kubernetes, etc, support in nodepool is on the roadmap, but we don't know what it will look like yet. mesos may be easier. :) | 17:55 |
jlk | I think there's some code needed to allow zuul to run a job without a node | 17:55 |
Shuo | jlk: if my plan is to avoid dependency to nodepool (and say, I can bear with a static resource capacity at this moment), can I have some kind of setup? | 17:55 |
jeblair | jlk: i think it should work -- that's actually the code path for about 85% of the tests :) | 17:55 |
jeblair | Shuo: we plan on using nodepool to manage static resources as well. there are actually patches in progress for that. | 17:56 |
jlk | jeblair: oh? I'd like to know more! I thought it was special cased for the test runner. | 17:56 |
mordred | Shuo: static resource capacity is another thing that should come from nodepool - there isn't really any intent to modify zuul to not need nodepool - however, tristanC is working on the patches to allow nodepool to have staticly defined resources | 17:56 |
mordred | Shuo: as that's a thing many people need | 17:56 |
Shuo | jeblair: that would be great because my current thought is to make a simple cluster to make the case for zuul. | 17:57 |
mordred | Shuo: ++ | 17:57 |
jeblair | jlk: i *think* it's just an empty node request. it might still go out to nodepool to be satisfied (which is silly). let's test it later after i get logs working. :) | 17:57 |
jlk | fair enough | 17:57 |
SpamapS | jlk: oh, fake nodepool to the rescue I guess ;-) | 18:00 |
SpamapS | funny story... | 18:00 |
SpamapS | I think if you had nodeset-less jobs.. | 18:00 |
SpamapS | you could just scale out executors | 18:00 |
SpamapS | with "whatever method you like" | 18:00 |
Shuo | jeblair: "let's test it later after i get logs working. :)" sound a very quick ETA? My time horizon is around 5 weeks :-) | 18:00 |
SpamapS | and if you don't want to let job definers define their host OS... that would be a viable Zuulv3 use case. | 18:00 |
Shuo | modred: glad this resonates :-) | 18:01 |
SpamapS | like if all you want to let them do is run stuff that will also be on the executor.. I see no problem with that. | 18:01 |
SpamapS | (it's not something I want... but.. I can see a case for it) | 18:01 |
jlk | I could see a case for it | 18:02 |
jlk | zuul as a relay agent to other systems | 18:02 |
jlk | handle ingest, gate keeping, and reporting, but farm "work" out to somewhere else. | 18:02 |
Shuo | jeblair: "mesos may be easier. :)" if that's something I can manage to contribute to (bear in mind that I don't know any zuul internals), I'd love to take someone to my plate. | 18:03 |
jlk | Shuo: that'd be more of a nodepool contribution | 18:03 |
Shrews | tfw you spend an hour debugging a client/server interaction to realize a lack of \n caused all of your problems | 18:03 |
Shuo | jlk: do we have an up-to-date nodepool archtiecture diagram? | 18:04 |
jlk | Shuo: I do not know the answer to that | 18:04 |
Shuo | jeblair: do we have an up-to-date zuul (working with nodepool) architecture diagram? | 18:04 |
jeblair | Shuo: nope | 18:05 |
Shrews | well, we have this (https://github.com/openstack-infra/nodepool/blob/feature/zuulv3/doc/source/devguide.rst) though we've let that get slightly out of date. but it is close | 18:05 |
Shrews | and only covers the builder | 18:05 |
Shuo | Shrews: thanks. | 18:06 |
jeblair | Shuo: we're just now working out what the internal nodepool architecture is for multiple drivers | 18:06 |
Shrews | oh, yeah, def no diagrams for the drivers | 18:06 |
Shuo | jeblair: not meant to interrupt or demand any commitment, but if documentation is something that I can help on, I'd love to help on as I hope to make zuul internally well discussed. | 18:07 |
Shuo | Shrews: thanks for sharing. btw, are you working on nodepool specifically? | 18:08 |
jeblair | Shuo: you may be interested in taking a look at 463328 and its 12 child changes | 18:08 |
jeblair | Shuo: https://review.openstack.org/463328 | 18:08 |
Shrews | Shuo: i rewrote most parts of nodepool to implement jeblair's v3 specs. not actively working on any changes to it right now. | 18:09 |
*** jkilpatr has joined #zuul | 18:09 | |
* SpamapS now wonders about something like a set of ansible modules that manage mesos things that would just run on the executor. | 18:10 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Make sure playbook and role repos are updated https://review.openstack.org/477627 | 18:10 |
mordred | SpamapS: yah - passing in mesos/k8s endpoint information as part of the inventory is definitely a thing thathas come up in the discussions we've had so far when we've looked in to container workloads intended for such systems - there's a few fun and complex issues that come up once the whiteboard comes out that it's going to be a fun conversation when we all have the brainspace to tackle it | 18:13 |
jeblair | SpamapS: it would work, but i don't think it's a good fit for most kinds of jobs. we gave up on that architecture with jenkins 6 years ago when openstack was 1/200th the size it is now. managing resources separately from job execution makes a lot of things better at all scales. | 18:14 |
Shuo | Shrews: reading the pointer you shared now. maybe a dump question, can you give an example of "BuildWorker" (is that a single machine or is that an endpoint of a driver such as mesos master)? | 18:15 |
jeblair | mordred, SpamapS: we need to land https://review.openstack.org/477627 and restart the executor again. we're running into the problem that fixes. | 18:16 |
mordred | jeblair: +2 from me - SpamapS you got a sec to look at it? | 18:16 |
* SpamapS looks | 18:19 | |
Shrews | Shuo: that diagram is not useful to you in relation to drivers. It's for the nodepool-builder (nodepool/builder.py). You would be interested in nodepool-launcher (nodepool/launcher.py). | 18:19 |
SpamapS | so many updates | 18:20 |
SpamapS | mordred: I feel like maybe mesos and k8s include better resource management themselves than jenkins + openstacks. But I could be wrong. | 18:21 |
* SpamapS hears Dak ... STAY ON TARGET... | 18:21 | |
SpamapS | I +3'd 477627, but do we need to kick it to get it into the gate? | 18:22 |
SpamapS | ah no, it's just rechecking | 18:22 |
mordred | SpamapS: I'm not necessarily agreeing or disagreeing with that - I'm saying there's a bunch of other bits that get tricky, and it'll be a fun problem to work on in a few months because I think solving it will be valuable to many people | 18:23 |
SpamapS | mordred: Right.. first let's get these torpedos into the exhaust port. | 18:29 |
mordred | SpamapS: ++ | 18:31 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Make sure playbook and role repos are updated https://review.openstack.org/477627 | 18:32 |
Shuo | jeblair: "https://review.openstack.org/463328" looks like a great start point for me. How do I start picking it up? | 18:35 |
jeblair | Shuo: i'm not sure i understand the question | 18:36 |
Shuo | jeblair: https://review.openstack.org/463328 is an active CL under review, how should I jump in and start working on it? | 18:38 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul feature/zuulv3: Fix and test report urls for unknown failures https://review.openstack.org/469214 | 18:46 |
jlk | oh crap, I think I need to change my docker file for python3 needs now | 18:46 |
jeblair | Shuo: feel free to log into gerrit and submit comments; there are 12 patches after that one too | 18:48 |
Shuo | jeblair: thanks. leave for lunch for now, will be back and try to catch up in the afternoon. by now... | 18:49 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: WIP: Add web-based console log streaming https://review.openstack.org/463353 | 18:53 |
*** Shuo has quit IRC | 19:06 | |
jeblair | mordred: i'm not seeing any output from the post playbook: http://paste.openstack.org/show/613976/ | 19:09 |
jeblair | mordred: i'm re-running with verbose | 19:09 |
mordred | jeblair: hrm. let me look in just a sec | 19:18 |
*** Guest28796 is now known as mgagne | 19:19 | |
*** mgagne has quit IRC | 19:19 | |
*** mgagne has joined #zuul | 19:19 | |
*** Shuo has joined #zuul | 20:20 | |
jeblair | i'm going to temporarily stop the nodepool launcher as a poor substitute for lacking autohold | 20:21 |
*** Shuo has quit IRC | 20:28 | |
*** Shuo has joined #zuul | 20:28 | |
mordred | jeblair: heya - sorry - I had a phone call with dkranz - any current debugging state I should start looking at? or just start from scrtach? | 20:44 |
jeblair | mordred: nah, verbose didn't give any more output; i'm going to "hold" a node ^ so i can poke | 20:45 |
mordred | jeblair: kk | 20:45 |
jeblair | mordred: while i'm waiting -- i just realized that pabelanger implemented a log pull in the tox job -- so i think he was imagining that any jobs with interesting logs would do their own pulls back to the executor, whereas i imagined they would move things into ~/logs on the node. i'm not sure my idea is better. we may want to go with the pabelanger plan for a bit and see what that looks like. | 20:48 |
clarkb | jeblair: with proper secrets they don't have to use the executor as a staging area right? | 20:50 |
clarkb | seems like avoidingthat if possible would be good for bw reasons | 20:50 |
jeblair | clarkb: technically... yeah, we could rsync directly from node to logserver, however, that means the node has some kind of credential access (not sure what ansible does there... agent forwarding?) after the node has run an untrusted job, so i get itchy about that. | 20:52 |
jeblair | clarkb: v2.5 does the staging thing and we haven't really noticed :) | 20:52 |
clarkb | thats true and we staged wtih jenkins for years too (that is how scp worked) | 20:53 |
jeblair | mordred: there's some output (like the "PLAY" and "TASK") lines going to the ZUUL_JOB_OUTPUT_FILE that isn't ending up in the zuul log because it's not going to stdout; i think that's normal -- but did we discuss maybe having everything go to stdout when we use -vvv? | 20:58 |
jeblair | in ansible, i want to have a "when:" conditional based on whether an attribute/key/whatever is defined. in other words, sometimes "zuul.newrev" will be defined, sometimes it won't. "when: zuul.newrev" throws an error when it doesn't exist... what's the right way to do that? | 21:03 |
mordred | jeblair: maybe? I mean, we can definitely do that | 21:03 |
jeblair | maybe jinja with a default value...? | 21:03 |
mordred | yah | 21:04 |
mordred | when: zuul.newrev | default(None) or somehting like that | 21:04 |
jeblair | [WARNING]: when statements should not include jinja2 templating delimiters such as {{ }} or {% %}. Found: {{ zuul.newrev | default(None) }} | 21:05 |
jeblair | i think it *worked* since it's just a warning.... | 21:05 |
jeblair | oh it just doesn't want {{}} because it's already a jinja string | 21:07 |
jeblair | yeah, so just dropping that works | 21:07 |
Shuo | jeblair: can you add me to the reviewer's list? I don't seem to be able to publish my comment on https://review.openstack.org/#/c/463328/ for some reason (though I logged onto gerrit) | 21:07 |
jeblair | Shuo: that shouldn't be necessary -- what happens when you try to publish it? | 21:07 |
jeblair | mordred: apparently "zuul.newrev is defined" is also a thing | 21:08 |
mordred | jeblair: that seems clearer | 21:08 |
jeblair | mordred: ya, hard to argue with that :) | 21:08 |
Shuo | jeblair: I have written a fully comment, and can 'save' (now it's draft from my web view) but can't publish -- sorry, long time no use gerrit :-) | 21:09 |
*** dkranz has quit IRC | 21:10 | |
jeblair | Shuo: after you do that, there's a "Reply..." button at the top of the main change screen, that should publish your comments | 21:10 |
jeblair | mordred: i don't see a way to set the user with add_host, but we need to log in as 'jenkins' to static.openstack.org (which we add to the inventory with add_host) | 21:13 |
jeblair | mordred: oh, wait, maybe i just add an ansible_user variable and it does it...trying | 21:15 |
Shuo | jeblair: done | 21:15 |
jeblair | Shuo: thanks! | 21:15 |
mordred | jeblair: yes - should be ansible_ssh_user iirc | 21:15 |
mordred | nope. ansible_user -ansible_ssh_user is the old one | 21:16 |
Shuo | also, sent an email to your @redhat.com, please take a look once you get chance... | 21:16 |
Shuo | jeblair: also, sent an email to your @redhat.com, please take a look once you get chance... | 21:16 |
jeblair | Shuo: will do | 21:18 |
Shuo | jeblair: thanks! | 21:18 |
jeblair | mordred: w00t! i got http://logs.openstack.org/76/478576/2/check/31feea4/logs/ to happen with a bunch of manual changes. :) patching now. | 21:23 |
jeblair | 21:26 < openstackgerrit> James E. Blair proposed openstack-infra/project-config master: Fix base post playbook https://review.openstack.org/478639 | 21:28 |
jeblair | 21:28 < openstackgerrit> James E. Blair proposed openstack-infra/openstack-zuul-roles master: Fix several issues with upload-logs role https://review.openstack.org/478646 | 21:28 |
jeblair | mordred: ^ can you +3 those? | 21:29 |
Shuo | I'd like to share this with zuul community (See the 2nd reply to first comment), I really feel zuul has its huge potential to grow | 21:36 |
Shuo | https://lwn.net/Articles/702177/ | 21:36 |
mordred | jeblair: on it - also - did you have to patch zuul itself? or just the playbooks? | 21:51 |
mordred | jeblair: and if it was just playbooks - do we need to address debuggability? or just these are hard becaue they are dealing with the log publication system, so chicken-and-egg? | 21:51 |
jeblair | mordred: i think both. i think sending all output to stdout on -vvv will help. as will the cmdline zuul-executor idea. | 21:53 |
jeblair | and, of course, autohold | 21:53 |
jeblair | i think with those 3 things, this would be human-debuggable. | 21:54 |
mordred | ++ | 21:54 |
jlk | Interesting question time | 22:02 |
jlk | wait, no. n/m | 22:02 |
SpamapS | Shuo: interesting article | 22:04 |
* SpamapS is reading it | 22:04 | |
SpamapS | they completely missed that gerrit has keyboard shortcuts for one.. and git-review.. | 22:04 |
SpamapS | and "it's hard to do local testing" .. ???!!! | 22:04 |
SpamapS | the UI has the clone instructions right there (not to mention gertty just does it for you) | 22:04 |
SpamapS | weird | 22:04 |
jeblair | you can't argue with kernel developers | 22:05 |
jeblair | 2017-06-28 22:03:02,939 DEBUG zuul.AnsibleJob: [build: 4f9c2f7d45e9428ba1f09964d3cb9e33] Ansible output: b'fatal: [static.openstack.org]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \\"static.openstack.org\\". Make sure this host can be reached over ssh", "unreachable": true}' | 22:05 |
jeblair | okay that's new | 22:05 |
mordred | jeblair: that doesn't seem ideal | 22:06 |
SpamapS | Shuo: "if only zuul scaled well" .. but no supporting evidence. I'd say Zuul scales pretty darn well. | 22:07 |
mordred | SpamapS: someone said zuul doesn't scale? | 22:07 |
mordred | SpamapS: are they high? | 22:07 |
jeblair | mordred: apparently we did, in the openstack-infra channel | 22:07 |
mordred | we did? | 22:08 |
jeblair | so says a person on the internet | 22:08 |
mordred | wow | 22:08 |
mordred | I mean | 22:08 |
mordred | people on the internet are often wrong | 22:08 |
mordred | but that's about the wrongest I've heard anyone be in quite some time | 22:08 |
mordred | it may not (yet) scale DOWN as well as we'd like - but I think we prove just aout every day that it scales UP | 22:09 |
jeblair | this is last october, too, after zuul achieved it's highest scaling point :) | 22:09 |
mordred | oh - that's from prometheanfire - that's really weird, he's much more clued in than that | 22:10 |
* mordred walks away from the lwn article | 22:10 | |
SpamapS | yeah | 22:11 |
SpamapS | lwn is like that | 22:11 |
jeblair | this is one of those "is everything i read in the newspaper about things that i'm not expert in *this* inaccurate?" moments | 22:12 |
jeblair | mordred, SpamapS: could it be that we have to add static.o.o host keys to known_hosts? | 22:13 |
mordred | jeblair: it's worth trying an ssh as zuul to jenkins@static and see - are you on and can try that? | 22:14 |
clarkb | jeblair: I'm pretty sure its all that inaccurate yes | 22:14 |
mordred | jeblair: there is definitely no entry in zuul user's known_hosts | 22:15 |
jlk | jeblair: so I was just introducing zuul to somebody, and they kind of stumbled over the project definition schema, and was a bit confused/irritated that the pipelines weren't in their own key, to make it much more clear what was a pipeline. instead of just taking anything that isn't one of the reserved keywords in a project hash | 22:15 |
jlk | jeblair: they expected something more like https://review.openstack.org/#/c/463328/5/doc/source/project-config.rst | 22:16 |
jlk | er | 22:16 |
jlk | https://gist.github.com/caphrim007/40ec478f0ab2e233ea804f22b8531d99 | 22:16 |
mordred | wow, how fascinating. I never want additional deepness in my yaml :) | 22:17 |
mordred | jeblair: oh - duh. we create a known_hosts in the job dir - so we'd need to be able to pass in additional known_hosts | 22:19 |
mordred | jeblair: or maybe start with the zuul user's known hosts and append to it? | 22:19 |
jlk | mordred: yeah, he's totally unfamiliar with zuul so he thought "check" was a specific keyword for the project, not implicitly a pipeline name (because it didn't match any other keys in the schema) | 22:23 |
mordred | jlk: nod | 22:23 |
mordred | jlk: I was actually just chatting with someone earlier today and the "how does a new user discovery what pipelines are available" came up - which may or may not be related | 22:24 |
jlk | somewhat related yea | 22:24 |
jlk | comes down to "admin has to have documented them" | 22:24 |
jlk | until we do something like an API where you can ask zuul what pipelines exist | 22:25 |
mordred | this is the lovely area in between "the system is flexible and lets you define any pipeline name you want" and "as a noob user, what are the names and what do they mean" | 22:27 |
clarkb | ++ on not adding unnecessary depth to yaml. I think that happens a lot because people don't understand the datastructure itself | 22:27 |
clarkb | the datastructure is part of the sematics here and we should take advantage | 22:27 |
mordred | clarkb: sure - but also I think we should not ignore reported confusion from new user | 22:32 |
clarkb | definitely | 22:33 |
Shuo | mordred: to reply "SpamapS: someone said zuul doesn't scale?", this is exactly why I feel it is worth speaking out my thoughts here https://review.openstack.org/#/c/463328/ | 22:37 |
jeblair | jlk: yeah, i could probably be convinced we should move it down a level; we've had to do similar things elsewhere (job graph), this is probably about the last place we have something like that... | 22:37 |
Shuo | SpamapS: I was not saying zuul does not scale, and I think the person making such a comment probably did not spend time to do some research, and that's why I felt something I commented in Jame's Cl might make huge value to zuul https://review.openstack.org/#/c/463328/ | 22:40 |
mordred | Shuo: totally! fwiw I did not think you were saying that - and I agree | 22:43 |
SpamapS | Shuo: indeed we were just reacting to the comment violently. It is not directed at you, and I like where you're going. :) | 22:46 |
jeblair | mordred: yeah, if i re-run my command, i get asked to accept the key, and if i say no, i get the same error | 22:51 |
jeblair | mordred: so i think it's known_hosts | 22:51 |
mordred | jeblair: yah | 22:52 |
jeblair | mordred: i think i'll just append to it in the playbook for now (since that's where the host is being added, it makes sense to keep those things together) | 22:52 |
mordred | jeblair: kk. you have access to the host_key in the playbook? | 22:52 |
jeblair | mordred: well, it's got a bunch of hardcoded stuff in there right now: username and host, so i can just add this. | 22:53 |
mordred | jeblair: ++ | 22:53 |
jeblair | something something parameterize later | 22:53 |
jeblair | sweet there's a known_hosts module | 22:53 |
mordred | jeblair: ya- seems like either an option to seed the per-job known_hosts file with ~zuul/.ssh/known_hosts - or to pass in one or more to inject | 22:53 |
mordred | jeblair: \o/ | 22:53 |
jlk | crap I think I hit some python3 issues in the github code | 23:00 |
jlk | TypeError: a bytes-like object is required, not 'str' | 23:02 |
clarkb | yup | 23:03 |
clarkb | you'll need to encode whatever is str to bytes | 23:03 |
jeblair | mordred: 23:07 < openstackgerrit> James E. Blair proposed openstack-infra/project-config master: Add static.o.o key to known_hosts for log publisher https://review.openstack.org/478672 | 23:08 |
jlk | wtf, I can't even access the object | 23:09 |
jlk | (Pdb) p request.json | 23:10 |
jlk | *** TypeError: a bytes-like object is required, not 'str' | 23:10 |
mordred | jlk: request.data I believe is where the raw data is, if that's a requests object | 23:12 |
jlk | it's not, it's a webob | 23:12 |
jlk | <class 'webob.request.Request'> | 23:12 |
mordred | jlk: ah. well - request.body I think? | 23:13 |
jlk | (Pdb) request.body | 23:13 |
jlk | *** TypeError: a bytes-like object is required, not 'str' | 23:13 |
jlk | webob page claims to support python3. I'm skeptical | 23:14 |
mordred | jlk: where in this code is this hitting you - do you know? | 23:14 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Add ability to auto-generate simple one-line shell playbooks https://review.openstack.org/478675 | 23:14 |
jlk | it's in the githubconnection.py file, when it's trying to handle a webhook event | 23:14 |
jlk | it's where we try to get the content to be a json dict object | 23:15 |
mordred | json_body = request.json_body | 23:15 |
mordred | that bit? | 23:15 |
jlk | these are all internal methods of webob. | 23:15 |
jlk | yeah | 23:15 |
mordred | jlk: will it let you print request.headerlist? | 23:16 |
mordred | or request.headers | 23:16 |
jlk | yeah | 23:17 |
jlk | ugh, python3, how do I quickly show the keys of a thing... | 23:17 |
clarkb | jlk: foo.__dict__ ? | 23:17 |
mordred | jlk: list(foo.keys()) | 23:17 |
clarkb | oh actual keys of a dict | 23:17 |
mordred | jlk: specifically curious if the incoming payload from gh had a 'Content-type' header set | 23:18 |
jlk | 'CONTENT_TYPE': 'application/json' | 23:18 |
jlk | I'm sending it with curl | 23:18 |
mordred | or charset | 23:18 |
mordred | ah | 23:18 |
jlk | maybe I need to add a charset? | 23:19 |
clarkb | content-encoding is I think what you want | 23:19 |
jamielennox | jlk: this one? http://paste.openstack.org/show/613989/ | 23:19 |
jlk | jamielennox: the bottom one yes | 23:19 |
mordred | jlk: yah - my hunch is that you need to add a charset in your curl | 23:20 |
jamielennox | ok - but not caused by the same thing then? | 23:20 |
mordred | jlk: my reading of the webob docs is that it needs a charset set so that it knows what to do with encoding | 23:20 |
jamielennox | whatever is making repository == None | 23:20 |
jamielennox | webob should set a default charset? | 23:20 |
jlk | this I think is my first time running this all in python3, I may be up for a while :( | 23:20 |
jlk | github doesn't send that in the headers | 23:21 |
mordred | jlk: you should be able to pass "content-type: application/json; charset=utf8" and webob claims it'll do the right thing | 23:21 |
mordred | hrm | 23:21 |
jlk | http://paste.openstack.org/show/613990/ | 23:21 |
mordred | jlk: excellent | 23:22 |
jlk | yeah I can't select others. | 23:23 |
jlk | default_body_encoding is set to 'UTF-8' by default, it exists to allow users to get/set the Response object using .text, even if no charset has been set for the Content-Type. | 23:25 |
jlk | The default_content_type is used as the default for the Content-Type header that is returned on the response. It is text/html. | 23:25 |
jlk | oh that's content-type, not charset :/ | 23:25 |
mordred | yah - default body encoding seems like the right thing | 23:25 |
jamielennox | charset should be it's own property on webob, you shouldn't need to manually do the headers | 23:26 |
jlk | the object knows it's application/json | 23:27 |
jlk | and it thinks the charset is 'UTF-8' | 23:27 |
jeblair | i think we should drop the --keep-jobdir argument to the executor and replace it with an IPC like verbose. so we'd run "zuul-executor keep" or something to enable it. | 23:45 |
jeblair | (it's difficult to start the executor with the init system and pass an argument like that) | 23:46 |
mordred | jeblair: ++ | 23:46 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Replace --keep-jobdirs with an IPC https://review.openstack.org/478682 | 23:51 |
jeblair | 2017-06-28 23:49:30.347020 | localhost | ERROR: Results: => {"changed": false, "failed": true, "msg": "Failed to write to file /root/.ssh/known_hosts: [Errno 2] No such file or directory: '/root/.ssh/tmpS8tEVy'"} | 23:51 |
jeblair | why's it trying to write there? | 23:51 |
jlk | I am so confused | 23:51 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!