*** jamielennox|away is now known as jamielennox | 00:02 | |
jamielennox | is there a way i can get zuul to scan my project-config | 00:03 |
---|---|---|
jamielennox | s/scan/validate | 00:03 |
SpamapS | mordred: I'm just poking at a thing that translates layout <-> layout and does not attempt to grok jjb | 00:13 |
SpamapS | My thinking is to keep it as simple and unambitious as possible. pipelines and projects to a central config is step 1, and then I want to try and have project job selection moved into git repos as .zuul.yaml | 00:14 |
SpamapS | jamielennox: v2 yes, v3, start it up. | 00:14 |
SpamapS | jamielennox: v3 has trouble with validation the way v2 did it so all we do is validate that the config can parse. | 00:15 |
jamielennox | SpamapS: yea, it's just leading to me debugging by push change to git, restart server, parse logs | 00:15 |
jamielennox | i'm not sure how exactly you could expose it with all the in-repo stuff, just asking | 00:15 |
SpamapS | jamielennox: I think we could definitely write a validator that loads config. | 00:48 |
SpamapS | jamielennox: it just hasn't been done. | 00:48 |
SpamapS | like, just read the zuul.conf and the full layout from every source, like you would at startup | 00:48 |
jamielennox | SpamapS: so realistically i don't even need that, just scan the file i give you and does it make sense | 00:49 |
jamielennox | oh, but you can't | 00:49 |
SpamapS | Oh yeah for that I think you can write a little cmdline entry point | 00:49 |
SpamapS | Well you can | 00:49 |
SpamapS | you can run it through the voluptuous schema checker | 00:49 |
jamielennox | yea, but it depends where things like pipelines and connections are defined | 00:49 |
jamielennox | because gerrit: is a key, and if that's defined elsewhere your schema will fial | 00:49 |
jamielennox | zuul (at least used to) validate reviews against its own project-config so once you have it working it's not so bad | 00:50 |
jamielennox | but if validation fails on startup it seems to take down a thread | 00:50 |
openstackgerrit | Jamie Lennox proposed openstack-infra/zuul feature/zuulv3: Store cache expiry times per status object https://review.openstack.org/461330 | 00:57 |
openstackgerrit | Jamie Lennox proposed openstack-infra/zuul feature/zuulv3: Use routes for URL handling in webapp https://review.openstack.org/461331 | 00:57 |
openstackgerrit | Jamie Lennox proposed openstack-infra/zuul feature/zuulv3: Use tenant_name to look up project key https://review.openstack.org/461332 | 00:57 |
SpamapS | jamielennox: I think you could write a cli version that validates the whole config. Not so sure about fragments. | 00:59 |
*** jamielennox is now known as jamielennox|away | 01:00 | |
SpamapS | jamielennox|away: that said, for fragments... submit review, wait for angry report of failed config load? | 01:00 |
*** jamielennox|away is now known as jamielennox | 01:10 | |
jamielennox | SpamapS: yea, i think once you're up and running you can rely on the reports | 01:11 |
jamielennox | just for now the error is happening during startup and causing problems | 01:11 |
SpamapS | on startup you can do the whole-config thing | 01:12 |
jamielennox | SpamapS: is there anything about bubblewrap that means the executor should run as root | 01:13 |
jamielennox | i caught something about it on IRC the other day but don't remember it exactly | 01:13 |
jamielennox | we're currently running executor as a zuul user and i have to change the finger port to make that happen, which is fine, i don't care about users being able to finger the executors directly | 01:14 |
jamielennox | but bubblewrap is the other thing i can think of that would be affected by this | 01:15 |
mordred | SpamapS: oh! right - the layout conversion. nod | 01:16 |
SpamapS | jamielennox: no, it should install setuid by default | 02:22 |
SpamapS | jamielennox: OR if you have a very new kernel, it can do its thing with USER_NS stuff without setuid | 02:22 |
jamielennox | SpamapS: so your saying i have to run executor as root? | 02:43 |
tristanC | jamielennox: bwrap should be root setuid so that it can be used by regular user | 03:25 |
*** dkranz has quit IRC | 03:26 | |
*** dkranz has joined #zuul | 03:31 | |
*** hashar has joined #zuul | 09:09 | |
*** hashar has quit IRC | 09:23 | |
*** hashar has joined #zuul | 09:40 | |
*** hashar has quit IRC | 10:18 | |
*** jkilpatr has joined #zuul | 10:59 | |
Shrews | tristanC: left you a comment on https://review.openstack.org/472128 along with the +2. was there a reason to avoid the 2 extra characters for 'list_address' ? | 12:37 |
Shrews | vs listen_address | 12:38 |
tristanC | Shrews: not at all, it's a mistake, should be listen_address | 12:40 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool feature/zuulv3: Add webapp port and listen_address configuration https://review.openstack.org/472128 | 12:41 |
Shrews | tfw the code you had semi-working yesterday no longer works at all today | 13:08 |
pabelanger | Shrews: tristanC: we should also make sure to update good.yaml fixtures too for 472128 | 13:24 |
tristanC | pabelanger: done | 13:56 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool feature/zuulv3: Add webapp port and listen_address configuration https://review.openstack.org/472128 | 13:56 |
pabelanger | tristanC: thanks | 13:56 |
tristanC | you're welcome, thanks for the review! | 14:25 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: WIP: Add reporter for Federated Message Bus (fedmsg) https://review.openstack.org/426861 | 15:00 |
Shrews | So, something has merged within the last few days to affect ansible jobs in unit tests | 15:07 |
Shrews | because what used to work, now hangs here: http://paste.openstack.org/show/612143/ | 15:08 |
pabelanger | doesn't look to be using bubblewrap | 15:10 |
Shrews | hrm. lemme rebuild the env. forgot to do that when i rebased | 15:12 |
Shrews | pabelanger: you run fedora, yes? what's the package for bubblewrap? | 15:21 |
pabelanger | Shrews: should be bubblewrap for package name | 15:21 |
pabelanger | there is also bwrap-oci, but haven't tried that | 15:22 |
Shrews | hrm, bubblewrap already installed, it seems | 15:23 |
Shrews | *sigh* | 15:23 |
Shrews | ok, well the pause was from my wait_for in my playbook. but now i'm not getting build.jobdir populated | 15:32 |
* Shrews takes a long lunch break to blow off frustration. bbl | 15:32 | |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: WIP: Add reporter for Federated Message Bus (fedmsg) https://review.openstack.org/426861 | 15:58 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul-jobs master: Add Sphinx module for Zuul jobs https://review.openstack.org/472743 | 16:04 |
jeblair | SpamapS, mordred: can you take a look at that change and its parents? ^ i'm working on establishing a documentation framework for the zuul stdlib. obviously that should be spun out into a new sphinx module, but incubating it there for the moment. | 16:09 |
SpamapS | mmmmmmmmmmmmmm doc framework | 16:19 |
mordred | jeblair, tristanC, pabelanger, jlk, SpamapS, Shrews: ok. I have sent email to the list on the REST question - sorry it took so long, there was a bunch of research needed doing | 16:22 |
mordred | clarkb, jamielennox: ^^ you too - sorry for lack of ping | 16:28 |
SpamapS | mordred: that's good readin | 16:47 |
pabelanger | jeblair: I have restarted ze01.o.o. We look to be running now | 16:48 |
jeblair | neat! | 16:49 |
jeblair | pabelanger: how about we stop zuulv3-dev now | 16:49 |
pabelanger | jeblair: yes | 16:49 |
jeblair | i'll do that | 16:49 |
pabelanger | http://zuulv3.openstack.org/ | 16:49 |
pabelanger | should also properly display things now | 16:49 |
jeblair | zuulv3-dev is stopped | 16:50 |
jeblair | i will recheck a change now | 16:50 |
jeblair | hrm, i'm going to restart the scheduler | 16:51 |
pabelanger | ack | 16:52 |
pabelanger | seen something that time | 16:52 |
SpamapS | how does zuulv3.openstack.org know what status.json to get? | 17:02 |
jeblair | SpamapS: it's in the apache config for now | 17:03 |
jeblair | RewriteRule ^/status.json$ http://127.0.0.1:8001/openstack/status.json [P] | 17:03 |
jeblair | RewriteRule ^/status/(.*) http://127.0.0.1:8001/openstack/status/$1 [P] | 17:03 |
jeblair | we should add a story about making the status page tenant aware, if there isn't one already | 17:04 |
mordred | jeblair, SpamapS: yah - and we'll probably want to figure out an auth story aroud access to tenant status too (I do see that jamielennox already has patches for tenant-aware logs which should also allow a similar amount of protection) | 17:06 |
jeblair | mordred: right. for now, the story is, if you have private tenants, build that into your web server infrastructure. | 17:07 |
jeblair | (like, use mod_auth_foo to restrict access to /private_tenant/....) | 17:08 |
jeblair | auth in zuul itself would be a significant distraction at this point | 17:08 |
mordred | jeblair: ++ | 17:08 |
jeblair | SpamapS: have you started any work regarding the new ssh context manager/wrapper thing in the merger? (this came up in the context of test_timer_sshkey) | 17:18 |
SpamapS | jeblair: none.. that's way down in the stack unfortunateyl. | 17:22 |
* jlk looks at the size of the scrollbar on mordred 's email, settles in for a long read. | 17:22 | |
jeblair | SpamapS: okay, i think it's about to pop onto the top of my stack | 17:22 |
jeblair | SpamapS: pabelanger and i are suspecting that change is not at all working in reality | 17:22 |
jeblair | i do not see how to use the context manager for an initial clone | 17:25 |
jeblair | SpamapS, pabelanger: okay, i've confirmed that's the problem with the merger. i will start working on a fix. it will take a little bit, but i think we can do it fairly easily and comprehensively. | 17:34 |
jeblair | pabelanger: i have to take care of a few things first; it will probably be a couple of hours until i'm ready with this change. | 17:38 |
jlk | mordred: just to screw with your noodle, what if we skipped REST and went straight to GraphQL? | 17:38 |
pabelanger | jeblair: understood | 17:39 |
pabelanger | jlk: is GraphQL the new hotness these days? | 17:40 |
jlk | pabelanger: seems like a buzzy word, but I'm not really clear on what problems it solves. I'm mostly aware of it because Github builds their website with graphql, and are now exposing it so that us app writers can get first-class access as they change the platform, instead of waiting for somebody to re-write everything for their REST API. | 17:42 |
jlk | at some point, we're going to have GraphQL in zuul, to deal with github, at least on a client level. | 17:42 |
jeblair | fwiw, i'm not sure 'rest api' actually describes what we'll end up doing with zuul anyway. more like 'http api'. | 17:42 |
pabelanger | Ya, I've only heard of it from here and githubv4 (I think). Not sure what else is using it. Looks like facebooks started it | 17:43 |
jlk | nod | 17:43 |
jeblair | (there are almost zero actual REST apis in the world) | 17:44 |
SpamapS | Because that's hard and mostly an exercise in correctness not pragmatism? :) | 17:46 |
SpamapS | Looks to me like GraphQL is more about being super expressive on the client side about what you want from the server and reducing round trips by allowing secondary lookups in the responses. | 17:47 |
SpamapS | I could see that being useful for a better status responder | 17:47 |
SpamapS | Like if no jobs are expanded you don't need the lists of jobs. | 17:48 |
jlk | yeah I'm not seriously suggesting it. | 17:48 |
jlk | although now that I re-read what you're saying | 17:48 |
SpamapS | jlk: I actually think it's for making what we do with status.json more efficient. | 17:48 |
SpamapS | I mean, status.json is 110 - 120 kB | 17:49 |
jlk | My basic read is that you can so "Go fetch me this info, and while you're there, fetch this adjacent info, and that info, and that over there too, and all teh things attached to that" | 17:49 |
jlk | or you can just say "I only want this tiny bit" | 17:49 |
SpamapS | if you refresh that 1/s ... that's not a trivial amount of data if you have a lot of users watching. | 17:49 |
SpamapS | jlk: its like SQL, for backend API queries. ;) | 17:49 |
jlk | It'd allow us to go from 4 or 5 round trips to the github API to a single call. | 17:50 |
SpamapS | jlk: yeah seems like something worth it for Github to push, since it will likely reduce their bandwidth usage and server side wasted overhead. | 17:50 |
SpamapS | so I think it's a super valid thing to push for when we revisit status.json | 17:51 |
SpamapS | also I didn't know we had some kind of frontend for nodepool? | 17:51 |
SpamapS | or is it just an HTTP backend | 17:51 |
jlk | I could see these things being useful for monitoring | 17:52 |
jlk | particularly in a containerized world, where we're not going to run zuul _and_ a monitoring daemon | 17:52 |
jlk | I want something external that can poke at it and tell me if it's healthy | 17:52 |
jlk | and later data mining for efficiency fixes. "where are you spinning your wheels" | 17:52 |
SpamapS | yeah those /health checks mordred mentioned | 17:53 |
SpamapS | jlk: that's more statsd's job tho | 17:53 |
SpamapS | just let zuul spit that stuff out and you can go get answers from influxdb or datadog | 17:53 |
jlk | how does statsd work? what collects it? Do you have to tell it what to send things to? | 17:54 |
jlk | brb, walking the furry four legger. | 17:54 |
mordred | SpamapS: /health checks are a thing k8s people want for prometheus | 18:00 |
mordred | SpamapS: I'm not saying we implement them today - but it's a thing that will come up in the future | 18:01 |
SpamapS | I like them for stuff too | 18:01 |
mordred | yay | 18:01 |
SpamapS | I used to have nagios ssh'ing into boxes and restarting our apache when /health timed out | 18:01 |
SpamapS | this isn't a new paradigm ;) | 18:01 |
SpamapS | jlk: statsd is daemon that listens for a UDP protocol. You add little calls in your app when you want to increment counters and it sends them off to statsd. Zuul and nodepool already do that. | 18:02 |
SpamapS | jlk: and then statsd takes those increments and puts them where you want.. graphite/influxdb/etc. Datadog speaks statsd too. | 18:03 |
SpamapS | mordred: does my response make sense at all? | 18:18 |
SpamapS | I'm kind of arguing that you have a nice clean "here's where the world talks to Zuul" spot, and then you can have any kind of messy backend weirdness you want behind that. | 18:19 |
mordred | SpamapS: I goes to read | 18:35 |
mordred | SpamapS: it does - although amusingly enough part of it (the zk async stream concern related to webhooks) is in the spec I wrote about ingestors that I told peole to ignore that clarkb also suggested mqtt for. I don't want to actually get in to that debate currently (I think there are pros and cons to each and we shoudl discuss those) - but I don't think we have to for now | 18:40 |
mordred | SpamapS: I would like to bikeshed on the one vs. two endpoints thought though | 18:41 |
mordred | lemme respond to the email first though | 18:41 |
*** yolanda has quit IRC | 18:43 | |
jlk | SpamapS: right, so okay. The app has to support shuffling things off to statsd itself, and has to have configuration for _which_ statsd to shuffle them off too. | 18:45 |
*** yolanda has joined #zuul | 18:45 | |
jlk | SpamapS: that's a little different in that you need prior knowledge of this stuff IN the config file to run the service, making moving an image to different environments a bit more difficult (but not really since we already ahve some of that stuff). | 18:46 |
mordred | jlk: it's acutally usually configured via env vars though | 18:46 |
jlk | whereas if it were health things hanging off the API, it wouldn't matter where they're ran, outside parties could poke at the health points. | 18:46 |
jlk | mordred: I can't tell you how much I hate configuration via ENV vars. | 18:46 |
mordred | me too - but you know it's the "right" way in cloud native, right? ;) | 18:46 |
mordred | https://12factor.net/config | 18:47 |
mordred | jlk: but in any case, in this particular instance it's a mechanism where communicating to statsd where the statsd is for your k8s would be fairly easy to accomplish without app having a-priori knowledge, no? | 18:47 |
jlk | so.. | 18:48 |
jlk | yeah, statsd daemon startup should advertise its location to k8s, which will toss it in etcd | 18:48 |
mordred | also - to provide metrics to a /status endpoint, the app has to _save_ the metrics somewhere | 18:48 |
mordred | jlk: ++ | 18:49 |
jlk | yoru service still have to be able to read from etcd to, at start up, know where to send statsd stuff | 18:49 |
jlk | (and within the cluster that place is going ot be fairly static, due to k8s proxy things) | 18:49 |
clarkb | mordred: it was mostly a response to having a spec for a thing that already exists mostly. I was saying you don't need a spec for that you just need to have zuul read from mqtt | 18:49 |
mordred | clarkb: right. I'm not convinced that actually solves the use cases | 18:49 |
SpamapS | jlk: it's push vs. pull all over again. :) | 18:49 |
mordred | clarkb: which is why I think it'll be fun to talk about | 18:49 |
* SpamapS will ponder whilst at the gym | 18:49 | |
jlk | mordred: hrm, I hadn't thought about saving the metrics, just more of getting snapshots of state periodically. Hit hte API point to get a snapshot of state, some external system does the time series analysis | 18:50 |
mordred | jlk: yes. I think polling urls for some state is a FINE idea | 18:50 |
mordred | like, polling the zuul status page, for instance - would be a fine way to get an amount of data | 18:50 |
jlk | I also think statsd is a fine idea. | 18:50 |
jlk | since we have ALREADY built statsd support into the app | 18:51 |
mordred | yah- I think having a /status AND also emitting to statsd is best of both worlds | 18:51 |
jlk | which reminds me, should probably take a pass through the github driver code and litter some statsd about | 18:51 |
mordred | I think there will be metrics we provide to statsd that would be hard to collect and provide to /status - but maybe that's fine | 18:51 |
jlk | yeah I can see a separation of concerns | 18:53 |
jlk | you don't want lengthy analysis to happen when hotting the status/ url | 18:53 |
jlk | not unless we do something like graphQL and allow taking a sip of the stats vs the firehose | 18:54 |
mordred | SpamapS: ok. responded with more words than are likely necessary | 19:12 |
mordred | jlk: agree | 19:13 |
pabelanger | lynx finger://ze01.openstack.org | 19:18 |
pabelanger | :D | 19:18 |
pabelanger | that is kinda cool | 19:18 |
pabelanger | cannot wait for jobs to be listed | 19:18 |
jeblair | mordred, SpamapS: i also responded, in an apparently complementary way to mordred. hopefully walked the right line between "lets decide the now stuff now, and talk about the later stuff later". | 19:19 |
jeblair | pabelanger: "finger @ze01.openstack.org" should return a nice error message | 19:21 |
jeblair | lacking a newline | 19:21 |
pabelanger | jeblair: ya, was playing with that already | 19:22 |
pabelanger | way trying to see if any client support other then port 79 | 19:23 |
pabelanger | GNU finger apparently does, but that is not shipped in fedora | 19:23 |
mordred | jeblair, Shrews: that finger command is the most exciting thing ever | 19:27 |
* Shrews very angry at finger things atm | 19:28 | |
* mordred hands Shrews a box of fingers he found laying around | 19:28 | |
jeblair | mordred: you're always bringing game of thrones into things | 19:29 |
* Shrews gives middle finger back to mordred | 19:29 | |
Shrews | did i win? i feel like i just won | 19:29 |
jeblair | Shrews: yes but you gave the prize to mordred | 19:30 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: WIP: add log streaming test https://review.openstack.org/471079 | 19:30 |
Shrews | if someone wants to explain why my ansible job never runs in that test (it did until i rebased yesterday), i would be terribly grateful | 19:31 |
Shrews | because i'm a bit fed up | 19:31 |
Shrews | test_log_streamer.py line 102 | 19:32 |
mordred | Shrews: when you say "don't seem to have real sync for ansible jobs" ... | 19:33 |
Shrews | we can't pause ansible jobs with hold_jobs_in_build | 19:33 |
mordred | Shrews: http://git.openstack.org/cgit/openstack-infra/zuul/tree/tests/unit/test_inventory.py?h=feature/zuulv3#n29 | 19:34 |
mordred | Shrews: self.executor_server.hold_jobs_in_build = True seems to work for me? | 19:34 |
Shrews | mordred: for fake builds | 19:34 |
mordred | OOOHOHHHHHHHH | 19:34 |
mordred | gotcha | 19:34 |
mordred | sorry - and thanks for clarifying for my dumb brainhole | 19:35 |
Shrews | the entire test attempts to keep the ansible log file around long enough so i can attempt to start the finger thingy and stream it | 19:35 |
Shrews | which i almost had working yesterday (was streaming and getting contents) | 19:36 |
Shrews | now nothing works | 19:36 |
clarkb | mordred: I have responded to the great api thread of june 2017 | 19:36 |
pabelanger | http://logs.openstack.org/79/471079/3/check/gate-zuul-python35/fd2756a/console.html#_2017-06-07_17_13_58_908303 seems to be a warning, but I don't see any other tasks after it | 19:36 |
Shrews | pabelanger: that's another thing i've been avoiding. i have no idea why that's there | 19:37 |
pabelanger | Ya, I don't see any of the other tasks after welcome | 19:38 |
pabelanger | possible that ansible is just not running them now? | 19:38 |
mordred | I think it's acutally that somehting is failing in that method so no file is getting written | 19:38 |
* mordred looks | 19:38 | |
pabelanger | I'd actually expect http://logs.openstack.org/79/471079/3/check/gate-zuul-python35/fd2756a/console.html#_2017-06-07_17_13_58_909229 to be ok=4, since you have 4 tasks | 19:39 |
mordred | Shrews: mind if I push up an update with some added debugging? | 19:40 |
Shrews | mordred: go fore it | 19:41 |
Shrews | i can't even get that locally. something missing from my local setup? | 19:42 |
jeblair | Shrews: remember the venv activation required when using ttrun on real ansible jobs? | 19:42 |
Shrews | jeblair: yup, got that | 19:42 |
jeblair | drat, out of ideas | 19:43 |
mordred | Shrews: I think you may have rebased on a commit that's in the middle of the updates to the callback plugin | 19:43 |
mordred | nothing that should be causing an issue here though | 19:44 |
Shrews | welp, going to just rebase on an older commit that works | 19:45 |
Shrews | mordred: fwiw, i was getting that v2_playback method error before rebasing, too | 19:45 |
mordred | Shrews: oh - I say that ... | 19:45 |
mordred | Shrews: nope. there is a bugfix that went in after your rebase that is probably the thing screwing you | 19:46 |
Shrews | i thought maybe something was wrong with my test ansible config setup causing that | 19:46 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: WIP: add log streaming test https://review.openstack.org/471079 | 19:46 |
mordred | Shrews: I missed an encode('utf-8') on the send to the socket part | 19:46 |
jeblair | py3 is the best! | 19:46 |
mordred | Shrews: that adds a debugging and a rebase | 19:47 |
mordred | Shrews: so either it'll work now, or we'll hopefully see the error message | 19:47 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: WIP: add log streaming test https://review.openstack.org/471079 | 19:48 |
mordred | sorry - one more fix | 19:48 |
jlk | alright I'm out for the weekend. Would love some eyes on the github caching change object er, change. | 19:52 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix git ssh env in merger https://review.openstack.org/472788 | 19:55 |
jeblair | pabelanger, SpamapS, mordred: ^ that's to fix the current blocker on ze01 | 19:56 |
pabelanger | looking | 19:57 |
Shrews | mordred: the exception would seem to imply something wrong with my job yaml, but i don't see it | 20:00 |
* Shrews wishes for a separate utility for zuul-purposed yaml files | 20:06 | |
jeblair | Shrews: what exception? | 20:11 |
Shrews | jeblair: http://logs.openstack.org/79/471079/6/check/gate-zuul-python35/0744d62/testr_results.html.gz | 20:11 |
jeblair | Shrews: s/image/label/ | 20:11 |
jeblair | Shrews: https://review.openstack.org/472372 changed that | 20:12 |
Shrews | jeblair: other configs use image | 20:12 |
jeblair | Shrews: before yesterday, yes | 20:12 |
pabelanger | Ah, also didn't know we changed it | 20:13 |
jeblair | trying to get the breaking config changes in before we write too many real configs :) | 20:14 |
pabelanger | ya, I don't believe we are using that today in our jobs | 20:15 |
Shrews | ah, mordred rebased it | 20:15 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: WIP: add log streaming test https://review.openstack.org/471079 | 20:15 |
mordred | Shrews: oh - sorry - yah -rebased to get the utf-8 fix, forgot the label change | 20:16 |
Shrews | mordred: pushed up a PS to correct it | 20:16 |
mordred | Shrews: fingers crossed | 20:18 |
mordred | jeblair: your patch has the sads | 20:18 |
Shrews | still don't know why things won't run locally now. just going to have to depend on zuul, i guess | 20:19 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix git ssh env in merger https://review.openstack.org/472788 | 20:21 |
jeblair | mordred: thanks. let's try that ^. :) | 20:21 |
pabelanger | mordred: jeblair: On the topic of nodepool-drivers, would a heat implemetation fall under the current openstack driver via shade today? Or some new dedicated driver? Was having some talks with tripleo-CI folks and the use case would be have nodepool create their OVB resources directly when possible | 20:22 |
mordred | pabelanger: that's a great question. I think a heat driver could be a separate driver, but I'd want it to use shade not use heatclient directly | 20:24 |
jeblair | pabelanger: could that use case be covered by linch-pin? | 20:25 |
mordred | pabelanger: it's also possible that making it a different driver doesn't make sense and just adding it as an option to the current one would make more sense | 20:25 |
pabelanger | jeblair: yes, that is also possible. I am really starting to like the idea of a generic ansible driver or we just say that is linch-pin thing | 20:25 |
pabelanger | mordred: understood | 20:26 |
mordred | lke - I'm guessing it would be "boot this heat stack" rather than "boot this server" - so one could imagine having a "stack" option instead of an "image" option | 20:26 |
pabelanger | I get the feeling it is more about how people would like to support it long term | 20:26 |
mordred | but I don't know enough | 20:26 |
pabelanger | mordred: ya, that was my simplest use case atm | 20:26 |
mordred | pabelanger: it's also worth doing some thought into how we express that zuul-side and how the resources in question make it into the inventory | 20:26 |
jeblair | yes, it's 'nodepool' not 'stackpool' :) | 20:27 |
mordred | which is to say - I don't think I know enough to know whether it sohuld be linch-pin, a modificaiton to the current openstack driver or a new driver | 20:27 |
mordred | but - I can say that in all three of those cases the heat api interactions should all be via shade ;) | 20:27 |
mordred | also - the same question will need to be asked re: multi-node and linch-pin integration - how does that get expressed in zuul config and how does it map into inventories | 20:28 |
mordred | so there is a worthwhile question to explore regardless of impl details | 20:28 |
jeblair | yep. i don't think we have time to fully explore it now. i'd really like us to focus on getting v3 out the door. | 20:29 |
jeblair | these are important considerations, but let's add them to the backlog, not get distracted by them now. | 20:30 |
pabelanger | Ya, I don't want to distract on current efforts for sure | 20:30 |
mordred | yah. agree. but definitely think there is a thing that is super worth digging in to when the time is right | 20:30 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: WIP: add log streaming test https://review.openstack.org/471079 | 20:32 |
jeblair | mordred, pabelanger: on that note, how about https://review.openstack.org/472472 ? | 20:32 |
jeblair | pabelanger raised a good point there | 20:33 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Add initial license, docs, and other config https://review.openstack.org/472410 | 20:34 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Add revoke-sudo role https://review.openstack.org/472461 | 20:34 |
pabelanger | Ya, on the fense. If we wanted to enforce ansible, it would be a great thing to skip. But the opposite, it is an easy way to run a bash script | 20:34 |
jeblair | and i guess the answer is how much should the "standard library" python jobs attempt to do? after conversations with mordred, i'm encouraged to try to push it as far as possible. | 20:34 |
pabelanger | +2 however, don't want to block | 20:35 |
pabelanger | ya, if we want to support it cool | 20:35 |
jeblair | pabelanger: i think the idea for this job would be to make it as simple as possible for someone to "just run the python unit tests in a repo". we might even add things to it later to support nose as well as testr, to try to make it universally applicable. | 20:35 |
jeblair | pabelanger: so it would work as well for an openstack project as a random github project | 20:36 |
jeblair | pabelanger: i've been digging into all the stuff we do (in run-tox.sh), and i actually think most of it *can* be universally applied. which surprised me a bit. | 20:36 |
jeblair | (as long as we do it carefully) | 20:36 |
jeblair | (aside from the 50mb subunit limit) | 20:37 |
pabelanger | sure, wfm | 20:37 |
jeblair | ok, 1 test failure on the merger fix; going to refresh that now | 20:38 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix git ssh env in merger https://review.openstack.org/472788 | 20:39 |
jeblair | pabelanger: i think the next step in that series is to start decomposing run-tox.sh into roles | 20:39 |
pabelanger | Ya, that seems right | 20:40 |
Shrews | mordred: http://logs.openstack.org/79/471079/7/check/gate-zuul-python35/b0b14f4/console.html#_2017-06-09_20_20_30_429766 | 20:40 |
mordred | Shrews: ooh - that's fun | 20:42 |
Shrews | missed encoding there | 20:42 |
Shrews | unless it's fixed elsewhere? | 20:42 |
mordred | Shrews: oh for the love of ... yes, that's the bug | 20:43 |
* mordred feels bad - was testing this locally with py27 - should stop doing that | 20:43 | |
Shrews | mordred: we are at least now hitting my artificial failure point again. many many thanks | 20:44 |
mordred | Shrews: you got that encoding or want me to? | 20:46 |
Shrews | i got it | 20:46 |
mordred | kk. cool | 20:46 |
mordred | Shrews: is that working locally again too? | 20:48 |
mordred | by "working" I mean "breaking properly" | 20:48 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Fix zuul_streamer send() call for py35 https://review.openstack.org/472816 | 20:48 |
jeblair | yay passes tests now | 20:51 |
pabelanger | Yay tests | 20:52 |
mordred | jeblair: ^^ that change if you get a sec to bump it in | 20:52 |
jeblair | mordred, pabelanger: what do you think of the documentation-related bits i've started on in 472485 and 472743? | 20:52 |
jeblair | mordred: +3 | 20:53 |
mordred | jeblair: ++ | 20:53 |
pabelanger | jeblair: I like them, I was hoping to see the output | 20:53 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: WIP: add log streaming test https://review.openstack.org/471079 | 20:54 |
Shrews | mordred: i don't think so | 20:54 |
jeblair | pabelanger: yeah; that'll be easier with zuulv3.o.o up and running. :) | 20:54 |
pabelanger | ++ | 20:54 |
jeblair | pabelanger: you can run 'tox -e docs' with it checked out locally though and you'll get something | 20:54 |
pabelanger | Agree, I'm not sure I have them cloned locally yet :) | 20:56 |
pabelanger | will have to try that out in a bit, looks like heading out for some food momentarily | 20:56 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Fix git ssh env in merger https://review.openstack.org/472788 | 21:01 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Fix zuul_streamer send() call for py35 https://review.openstack.org/472816 | 21:06 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: WIP: add log streaming test https://review.openstack.org/471079 | 21:10 |
mordred | Shrews: zomg. patch 9 actually passed the test for py27! | 21:11 |
Shrews | mordred: that one has always passed | 21:11 |
mordred | oh | 21:11 |
mordred | oh right - because you skip if not py35 | 21:11 |
Shrews | right | 21:12 |
Shrews | the string exceptions are new. hoping that's some other random thing | 21:12 |
Shrews | mordred: locally, it seems to hang on the 'shell' task since i get no output in the ansible log for that task beyond the name. it seems to continue on zuul, but the current patchset should validate that | 21:14 |
Shrews | specifically: shell: echo 'Hello, World' | 21:15 |
mordred | Shrews: that sounds like the same issue I'm seeing just when I try to run a simple playbook not in the test suite ... but on the patch where I combined the two log streamers | 21:19 |
mordred | I wonder ... | 21:20 |
Shrews | ugh, mysterious StringException again | 21:20 |
Shrews | aaaaand nothing in the logs. wunderbar | 21:21 |
mordred | Shrews: http://logs.openstack.org/79/471079/10/check/gate-zuul-python35/3fb578e/console.html#_2017-06-09_21_17_20_206490 | 21:21 |
Shrews | wuh? | 21:22 |
mordred | hrm - I dont think that's for your test .. | 21:22 |
jeblair | oh yeah, that looks like the old "everthing spews to the console" bug again | 21:22 |
mordred | yah | 21:22 |
jeblair | that's a test timeout | 21:23 |
jeblair | there's no way we're going to extract its output (if any) from the console log | 21:23 |
Shrews | so perhaps my test IS hanging as well? i removed the short-circuit to prevent the hang | 21:24 |
Shrews | hanging in zuul as well, i mean | 21:24 |
jeblair | yep | 21:24 |
Shrews | i wonder what happens if i remove the shell task... | 21:25 |
* Shrews tests | 21:25 | |
Shrews | hah! it proceeds as normal | 21:26 |
mordred | Shrews: I've got a thing I want to toss up to try | 21:26 |
Shrews | something funky with our shell module | 21:26 |
mordred | well - that's gonna be doing the stream-logs-from-remote-host code - which maybe isn't working so well in the test framework | 21:27 |
mordred | Shrews: oh- actually - you need to be runnign a zuul_console on the "remote" node for this to work at all | 21:27 |
* mordred feels really stupid | 21:27 | |
jeblair | oh yeah. | 21:28 |
mordred | one -sec - lemme make you a quick patch thing | 21:28 |
jeblair | this test is going to need a good long comment. :) | 21:28 |
mordred | yah | 21:28 |
*** jkilpatr has quit IRC | 21:28 | |
mordred | so - if we don't test with shell, but instead test with non-shell things | 21:28 |
mordred | it should be fine for testing the finger log streamer | 21:28 |
mordred | since non-shell things do not need zuul_console to produce output | 21:28 |
Shrews | k. that's fine. i really didn't need that shell thing anyway | 21:28 |
mordred | yah- you just need things to produce output so you can test that you stream that output | 21:29 |
Shrews | yup | 21:29 |
Shrews | jeblair: this test has been a... experience... for sure | 21:30 |
Shrews | jeblair: can i do this without defining the 'nodes' in the job? I'm seeing entries for 'localhost' and 'ubuntu-xenial' in the ansible log and they're sort of conflicting when i go to remove the flag file | 21:33 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Don't wait for forever to join streamer https://review.openstack.org/472839 | 21:34 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Use display.display for executor debug messages https://review.openstack.org/472840 | 21:34 |
mordred | Shrews, jeblair: two minor changes to zuul_stream to consider | 21:34 |
jeblair | Shrews: yes; i think we always add 'localhost' in the tests. | 21:35 |
Shrews | mordred: lgtm, except the 2nd one has an unintentional indent | 21:37 |
jeblair | mordred: 2 comments on the first one; a 0 and a -0.5. :) | 21:39 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: WIP: add log streaming test https://review.openstack.org/471079 | 21:41 |
jeblair | mordred: does 840 put the "starting to log" message into the main job log? | 21:41 |
jeblair | mordred: if so, i'm not sure i'm keen on that. maybe we could run with '-vvv' for the tests instead? | 21:42 |
mordred | jeblair: no - it does not | 21:46 |
SpamapS | quite the spirited discussion on HTTP stuffs | 21:46 |
mordred | jeblair: it's probably worth a comment somewhere | 21:46 |
jeblair | mordred: ok. i are confused. | 21:46 |
mordred | jeblair: there are essentially now two places things can go - the log file we defined, and stdout | 21:46 |
jeblair | mordred: got it, thanks. :) | 21:47 |
jeblair | mordred: then i'll be +2 on that after the fixup. | 21:47 |
mordred | jeblair: self._display.display will put things on stdout - which it seemed like our test suite was capturing in some manner at least in an earlier linkn from pabelanger | 21:47 |
mordred | cool | 21:47 |
jeblair | mordred: yeah, and it'll end up in the zuul executor log, which is fine | 21:47 |
mordred | jeblair: it's possible we may want to explore intentional uses of those two a little more | 21:47 |
jeblair | *nod* | 21:48 |
mordred | jeblair: re: waiting for 30 - tasks will not proceeed any further if that is blocking | 21:48 |
mordred | we ok with that still? | 21:48 |
jeblair | mordred: it should only matter if something has gone wrong, or if we're really backlogged reading the log. so it seems okay to me....? | 21:49 |
mordred | ++ | 21:50 |
jeblair | pabelanger: sweet. zuulv3 is up sufficiently that it is complaining about config errors related to image/label | 21:50 |
SpamapS | also... nobody's biting on my Netflix/zuul reference? ;-) | 21:51 |
mordred | SpamapS: :) | 21:51 |
mordred | SpamapS: I almost did, but then decided not to | 21:51 |
SpamapS | good choice | 21:53 |
SpamapS | it was bad | 21:53 |
Shrews | mordred: ps11 fails as I expect now | 21:53 |
jeblair | 2017-06-09 21:54:36,390 INFO zuul.IndependentPipelineManager: Adding change <Change 0x7f1bfc2e4780 472483,1> to queue <ChangeQueue check: openstack-infra/zuul> in <Pipeline check> | 21:55 |
jeblair | that's progress | 21:55 |
jeblair | 2017-06-09 21:55:13,467 DEBUG zuul.AnsibleJob: [build: 77890c7afde348899105bccf5bcb71f3] Ansible output: b'fatal: [ubuntu-xenial]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \\"15.184.66.20\\". Make sure this host can be reached over ssh", "unreachable": true}' | 21:55 |
jeblair | that's unfortunate | 21:55 |
mordred | jeblair: do you remember why we're doing the log streaming in zuul_stream as a subprocess and not a thread? | 21:55 |
SpamapS | key problems? | 21:56 |
jeblair | mordred: i thought it was something about ansible modules, but i can't recall what. | 21:56 |
jeblair | er plugins | 21:56 |
jeblair | like, something about that forced our hand. but i have no idea. | 21:57 |
mordred | nod | 21:57 |
jeblair | mordred: should be easy to switch, since they have almost the same interface. | 21:57 |
jeblair | SpamapS: highly likely | 21:57 |
* mordred is starting to worry about layers of subprocesses with things reading and writing sockets and files and whatnot | 21:57 | |
SpamapS | jeblair: is this trusted or untrusted? While I tested the ssh-agent stuff lightly locally... and the tests verify it works the way we hope it does.. there are a few new moving parts there... | 21:58 |
jeblair | SpamapS: amusingly, i think *all* of our config right now is untrusted | 21:58 |
SpamapS | fail closed FTW? | 21:58 |
jeblair | heh | 21:58 |
Shrews | mordred: it does seem like that could get fragile rather quickly | 21:59 |
SpamapS | how's that subprocess spawned btw? | 22:00 |
* SpamapS looks | 22:00 | |
jeblair | multiprocessing i think | 22:00 |
SpamapS | :q | 22:01 |
SpamapS | yay vim reflex | 22:01 |
jeblair | SpamapS: i think we have more local key problems on the host | 22:02 |
SpamapS | I've always thought multiprocessing was only for stepping around the GIL.. if you have other issues, subprocess is the cleaner path. | 22:03 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Use display.display for executor debug messages https://review.openstack.org/472840 | 22:04 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Don't wait for forever to join streamer https://review.openstack.org/472839 | 22:04 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Use threads instead of processes in zuul_stream https://review.openstack.org/472850 | 22:04 |
mordred | jeblair, Shrews: ^^ local testing shows that to work just fine | 22:04 |
mordred | SpamapS: you too | 22:04 |
SpamapS | mordred: is it possible we were worried about GIL contention? | 22:05 |
SpamapS | log streaming is going to be a constant load | 22:05 |
jeblair | this is inside of an ansible process | 22:05 |
SpamapS | or is that only inside ansible-playbook and thus not such a concern? | 22:05 |
mordred | ya | 22:06 |
mordred | that | 22:06 |
mordred | I tihnk it was cargo-cult- not specific decision | 22:06 |
SpamapS | so we already have our own per-job CPU eater | 22:06 |
SpamapS | ack | 22:06 |
* SpamapS is about 25 minutes from EOD'ing early to begin a 3 day LEGOland extravaganza in beautiful Carlsbad CA... | 22:06 | |
mordred | ooh fun | 22:06 |
SpamapS | so.. my focus is already drifting | 22:06 |
* SpamapS pulls it together for a last review push | 22:07 | |
mordred | SpamapS: also - I agree, I don't know that moving status.json out of the scheduler is of immediate concern and just shifting that to be aiohttp in place for now seems fine | 22:07 |
SpamapS | mordred: that stack is a little funny | 22:08 |
SpamapS | adds the terminate() call and then removes it | 22:08 |
SpamapS | mordred: I actually really love the idea of coalescing with Etag's of the hashed status.json or something.. but.. yeah, KISS says if you already need to rework your HTTP, just rework it to be the way you want it. | 22:09 |
jeblair | SpamapS, mordred: you both +2d https://review.openstack.org/472485 which uses https://review.openstack.org/472483 -- does that mean you like the addition of that field? | 22:13 |
SpamapS | jeblair: I do! +2'd | 22:16 |
mordred | jeblair: yup | 22:19 |
clarkb | mordred: fwiw the api services that just talk to the other services in openstack are what are being wsgi'd there is enough application logic in them talking to backends to make that desireable | 22:21 |
clarkb | but maybe I still misunderstand what you are saying? | 22:21 |
mordred | clarkb: yah - I'm saying that in either case we want to have the api services be separate services | 22:22 |
mordred | clarkb: so either WSGI or aiohttp it'll be a thing thats job is just to handle http requests | 22:22 |
clarkb | mordred: right, but the problem is not just that they are spearate but that running the http server in python is sadness | 22:22 |
mordred | clarkb: right - aiohttp is apparently _much_ better at that | 22:23 |
mordred | one sec - lemme get you links | 22:23 |
SpamapS | right that's basically the point of aiohttp | 22:23 |
clarkb | so I see aiohttp as roughly equivalent to eventlet + whatever webserver that nova api (used to) give you | 22:23 |
SpamapS | I have zero problem with WSGI.. but if we're already going python3.5+ and asyncio for streaming... | 22:23 |
mordred | SpamapS: exactly | 22:23 |
SpamapS | yeah I don't think comparing it to eventlet is fair at all | 22:24 |
clarkb | they are very similar in both design and use (if you dont use eventlet monkeypatching and instead fully cooperate) | 22:24 |
clarkb | asnycio has the benefit of the syntax being nicer in new python though | 22:24 |
SpamapS | eventlet is trying very hard to hide complexity from you, and in so doing creates a debugging nightmare and a compatibility nightmare (basically invalidates half of pypy's advantages because of its trickery) | 22:24 |
clarkb | SpamapS: no thats not quit true, you can use eventlet fully explicitly without monkey patching iirc | 22:25 |
clarkb | now it happens that openstack doesn't | 22:25 |
SpamapS | Did any of the openstack services not use monkeypatching? | 22:25 |
mordred | clarkb: https://github.com/aio-libs/aiohttp/issues/234 - the bug about "add docs explaining how to use this in production" and the associated PR discuss that the situation is different | 22:25 |
mordred | https://github.com/aio-libs/aiohttp/pull/237 | 22:25 |
SpamapS | And it's not just the monkeypatching that breaks eventlet for pypi | 22:25 |
SpamapS | pypy | 22:25 |
clarkb | pypy works fine now | 22:25 |
clarkb | has for years (vishy got it going iirc) | 22:26 |
SpamapS | works, does not improve your performance | 22:26 |
SpamapS | (the way pypy should) | 22:26 |
clarkb | SpamapS: according to intel it does | 22:26 |
SpamapS | ah maybe intel fixed eventlet | 22:26 |
clarkb | I don't know how they made it happen but they did all kinds of testing and benchmarks | 22:26 |
mordred | given that we _currently_ are doing fine with a paste server in a thread, I think aiohttp is likely to be fine for us too :) | 22:26 |
clarkb | mordred: sort of we have to cache the response in apache for zuul | 22:26 |
clarkb | because paste in a thread doesn't cut it | 22:26 |
SpamapS | there were some things that were causing the jit to run over and over IIRC | 22:26 |
clarkb | (greanted you could continue to do that) | 22:26 |
mordred | yah | 22:26 |
clarkb | specifically for status.json because its big | 22:27 |
SpamapS | I think we would definitely continue to do that. | 22:27 |
SpamapS | And then refactor to read from zk when we have a zk to read from. | 22:27 |
mordred | yup. but that'll be for later :) | 22:27 |
SpamapS | and maybe by then we'll have more than 3 things to have in the web tier that might make WSGI+Flask a more compelling choice. | 22:28 |
SpamapS | (like the admin requests) | 22:28 |
clarkb | SpamapS: the recent thread about dropping pypy testing for cinderclient brought up the pypy + openstack + intel stuff and I think they had rough numbers there | 22:28 |
clarkb | unfortunately they haven't done much upstream (at least not explicitly) that I have seen so its somewhat hand wavy still | 22:28 |
SpamapS | clarkb: I remember seeing those and thinking it was surprising. As usual, my info is outdated. :-P | 22:28 |
clarkb | SpamapS: I went to a local talk they gave on it using swift specifically and I want to say it was like 2x speedup but only after processes were running for ~5 minutes | 22:29 |
clarkb | so it won't make $clitool better but for long lived services even with eventlet it can be quite beneficial | 22:29 |
SpamapS | Sounds like a win to me. :) | 22:29 |
SpamapS | I figure aiohttp will finally give you twisted level performance without twisted level brain twisting | 22:31 |
mordred | yah | 22:32 |
mordred | my main thinking is that we have to run log streaming anyway - and that's websockets in python and isn't served by wsgi. so we _could_ do wsgi for the other things, but since we're not actually an API service in that way, it doesn't seem like any win to justify 2 http technologies when we can just use one and likely be fine | 22:32 |
mordred | especially since people report that aiohttp performs very well | 22:32 |
SpamapS | also the API for aiohttp looks simple enough | 22:33 |
SpamapS | it's not like we're saddling developers with something super weird | 22:33 |
clarkb | mordred: ya I agree that keeping them similar is worthwhile | 22:33 |
clarkb | mordred: I'm just afeared of deciding in a year it all has to be rewritten becuase average response time | 22:33 |
SpamapS | in a year it does all have to be rewritten | 22:34 |
SpamapS | so we can scale out schedulers ;) | 22:34 |
jeblair | finger 5b7d2c04bd0a4a229167e1a89d2fa2ce@ze01.openstack.org | 22:36 |
jeblair | everybody run that | 22:36 |
mordred | I am running that | 22:36 |
mordred | jeblair: mine is not streaming anything but is sitting at 2017-06-09 22:33:12.248107 | TASK [openstack-info : Display networking information about zuul worker.] | 22:36 |
jeblair | mordred: agreed | 22:36 |
mordred | cool | 22:36 |
jeblair | mordred: i am happy about the things it output up to that point | 22:37 |
clarkb | censored is a funny verb for "this writes too much data" | 22:37 |
clarkb | :) | 22:37 |
jeblair | i am less happy about the lack of things it output after that | 22:37 |
mordred | jeblair: I will confirm that that is all th efile on disk shows | 22:38 |
jeblair | i'll ssh into the worker | 22:38 |
jeblair | which, amusingly, would be easier if it had gotten around to printing its ip address | 22:38 |
mordred | jeblair: hah | 22:38 |
jeblair | the only thing running on the worker is zuul 839 0.0 0.1 115788 10068 ? Sl 22:33 0:00 /usr/bin/python /tmp/ansible_XtEykT/ansible_module_zuul_console.py | 22:41 |
mordred | jeblair: is there a /tmp/console file? | 22:41 |
pabelanger | I think ansible-playbook is defunted | 22:41 |
jeblair | mordred: /tmp/console-17a1464795d146bcb85c9802956a908f.log on the worker has the complete output from openstack-info | 22:41 |
jeblair | 2017-06-09 22:33:17.593676 | [Zuul] Task exit code: 0 | 22:42 |
jeblair | ends with that | 22:42 |
mordred | jeblair: ok. so somewhere the streaming borked | 22:42 |
mordred | jeblair: what's the IP? | 22:42 |
pabelanger | oh, we are also using python3 for ansible. Are we wanting that too? | 22:42 |
jeblair | ssh -i /var/lib/zuul/ssh/nodepool_id_rsa zuul@15.184.65.167 | 22:42 |
jeblair | ssh -i /var/lib/zuul/ssh/nodepool_id_rsa zuul@15.184.65.167 | 22:42 |
jeblair | mordred: ^ you'll want to run that from ze01 | 22:42 |
jeblair | mordred: (pabelanger is fixing up missing root ssh keys) | 22:43 |
mordred | jeblair: I was actually just trying to hit the console streamer | 22:43 |
jeblair | ah k | 22:43 |
pabelanger | https://review.openstack.org/#/c/472853/ and https://review.openstack.org/#/c/472854 are nl01.o.o updates we'll need I think | 22:43 |
pabelanger | I included flavor-name change too | 22:44 |
mordred | jeblair: telnet 15.184.65.167 19885 opens the connection, then I put 17a1464795d146bcb85c9802956a908f in and hit enter, but it did not start streaming | 22:44 |
jeblair | mordred: i also get that behavior | 22:45 |
mordred | jeblair: maybe we should add some trace logging into zuul_console that we send to a file or something and see what it thinks is going on | 22:45 |
jeblair | mordred: yeah; i'll see if i can eek anything out of the running process | 22:46 |
jeblair | mordred: it's looping on this: | 22:47 |
jeblair | [pid 1322] open("/tmp/console.log", O_RDONLY) = -1 ENOENT (No such file or directory) | 22:47 |
jeblair | [pid 1322] select(0, NULL, NULL, NULL, {0, 500000} <unfinished ...> | 22:47 |
jeblair | [pid 1269] <... select resumed> ) = 0 (Timeout) | 22:47 |
mordred | jeblair: wow. it should never do thta | 22:47 |
mordred | jeblair: out of sync copies of python things? | 22:47 |
mordred | jeblair: that's an old copy of zuul_console | 22:47 |
jeblair | weird. the version on ze01 looks current. | 22:48 |
mordred | jeblair: we copy to /opt/zuul and work from there, right? | 22:49 |
jeblair | we restarted it only a couple hours ago; that change merged yesterday, right? | 22:49 |
mordred | yah | 22:49 |
jeblair | mordred: /var/lib/zuul | 22:49 |
jeblair | so /var/lib/zuul/ansible/zuul/ansible/library/zuul_console.py should be the operative version | 22:49 |
mordred | jeblair: I agree - that looks like what I expect it to look like | 22:50 |
jeblair | /tmp/5b7d2c04bd0a4a229167e1a89d2fa2ce/ansible/untrusted.cfg says library = /var/lib/zuul/ansible/zuul/ansible/library | 22:51 |
mordred | and /var/lib/zuul/ansible/zuul/ansible/library is what's written to the unrusted.cfg | 22:51 |
jeblair | which is correct | 22:51 |
mordred | jeblair: it's like we have the same process in our heads | 22:51 |
jeblair | the start time of the zuul_console process corresponds with the start of the job (so it's not an old one somehow) | 22:54 |
mordred | jeblair: I ran an updatedb and then a locate | 22:54 |
mordred | jeblair: checking the contents of all of the zuul_console.py files, I see only one that doesn't reference console-{uuid}.log | 22:55 |
mordred | which is /var/lib/zuul/executor-git/git.openstack.org/openstack-infra/zuul/zuul/ansible/library/zuul_console.py | 22:55 |
mordred | but that also doesn't reference /tmp/console.log | 22:55 |
mordred | OH WAIT A SECOND | 22:55 |
jeblair | that's probably a master checkout | 22:55 |
mordred | yah | 22:56 |
mordred | jeblair: but zuul_console takes a filename as an argument | 22:56 |
jeblair | (that's the working directory of the merger, so its contents shouldn't matter) | 22:56 |
mordred | are we passing it a filename in the job content? | 22:56 |
pabelanger | we set it to /tmp/console.log in our prepare-workspace role | 22:56 |
mordred | that's the bug | 22:56 |
mordred | we need to stop doing that | 22:56 |
pabelanger | k, what should it be? | 22:57 |
mordred | nothing. just leave it out :) | 22:57 |
pabelanger | k, we should likely disabled the override then :) | 22:57 |
pabelanger | I can patch, 1 sec | 22:57 |
mordred | yah. | 22:57 |
jeblair | oh, the uuid thing is the *default* | 22:57 |
mordred | you can also set it to '/tmp/console-{log_uuid}.log' | 22:58 |
pabelanger | port? can that be left or removed? | 22:58 |
mordred | yah - I think we need to rework that as a thing tha has a parameter | 22:58 |
mordred | pabelanger: just remove it for now | 22:58 |
pabelanger | ack | 22:58 |
mordred | pabelanger: and we can rethink how we might want to allow this to be parameterized | 22:58 |
pabelanger | Hmm, how do you want to handle clean up on static nodes? | 22:58 |
mordred | pabelanger: I have some thoughts on that - I'll write them up for folkses | 22:59 |
jeblair | oh, wait, this has broken static node log streaming hasn't it? | 23:00 |
jeblair | or has it? | 23:00 |
mordred | shouldn't have - it will have broken cleaning up lurking log files | 23:00 |
jeblair | ah, gotcha | 23:00 |
pabelanger | ya, we just ensure /tmp/console.log was purged before | 23:00 |
pabelanger | we could wildcard it moving forward | 23:01 |
jeblair | pabelanger: not a bad idea | 23:01 |
mordred | ++ | 23:02 |
mordred | jeblair, pabelanger: we could also send a "we're done, cleanup after yourself" to the on-node console streamer | 23:03 |
jeblair | ya | 23:03 |
pabelanger | agree | 23:03 |
pabelanger | tmpreaper would also work I think | 23:04 |
mordred | I actually started poking at "cleanup" the other day, but then got lost in the "refactor these two to be the same code" | 23:04 |
jeblair | pabelanger: yeah, but tmpreaper is externalizing the cost of zuul ops onto the sysadmin | 23:04 |
pabelanger | finger is still working here :D | 23:04 |
jeblair | better for us to clean up ourselves | 23:04 |
pabelanger | agree | 23:04 |
jeblair | finger b82946f7a81a496bbfa45451606c34b2@ze01.openstack.org | 23:04 |
mordred | I'll pull that thought back up and see if I can't get y'all a patch on monday | 23:05 |
pabelanger | Oh, ah. we hit ffi missing dependency | 23:05 |
jeblair | Shrews: fingering is happening ^ :) | 23:05 |
mordred | and WORKING | 23:05 |
pabelanger | ya | 23:05 |
pabelanger | Nice, it is now logging our rsync attempts too | 23:06 |
clarkb | pabelanger: and censoring them | 23:06 |
pabelanger | Oh, this is just pulling logs to executor | 23:06 |
mordred | jeblair: what do you think of zuul itself doing a socket connection to each node on port 19885 and sending a cleanup command before it returns the nodes? | 23:06 |
mordred | jeblair: we could also make it a thing in the base job's post playbook | 23:07 |
jeblair | mordred: i think 19885 has to be read-only; so something in the playbook which logs into the host and sends a signal would be better | 23:07 |
clarkb | mordred: could you use an at exit type construct for the service instead? I assume we expect the process to die | 23:07 |
mordred | jeblair: kk | 23:07 |
mordred | clarkb: no - nothing kills the process currently | 23:07 |
pabelanger | guess we need to also publish properly to logs.o.o now too | 23:08 |
jeblair | (we need it to continue running at least until the executor has finished streaming from it) | 23:08 |
jeblair | (which is slightly *after* the thing it is running is finished) | 23:08 |
mordred | jeblair: ok - maybe zuul_console should return the PID of the child it spawns | 23:08 |
mordred | jeblair: and then we can run a post-playbook in the base job that signals that pid and tells it to shut down | 23:08 |
jeblair | mordred: considering we just said "we're only going to wait 30 seconds for the stream to catch up", we could probably have the streamer wait 40 seconds, then clean up and exit. | 23:09 |
mordred | jeblair: 40 seconds after what? | 23:09 |
mordred | it doesn't know when the last task will have been performed | 23:09 |
mordred | (there is a connect-disconnect per task now) | 23:09 |
clarkb | mordred: will it get a signal when the parent dies(parents get signal when children die) | 23:10 |
jeblair | mordred: oh, right... hrm. | 23:10 |
mordred | clarkb: nope | 23:10 |
clarkb | maybe we can make parent explicitly send a signal? | 23:10 |
mordred | clarkb: the parent is LONG gone on purpose | 23:10 |
jeblair | mordred: pid/signal sounds best so far. | 23:10 |
clarkb | well whatever ends the job | 23:10 |
mordred | clarkb: the first task of the first pre-playbook is "run zuul_console" | 23:10 |
clarkb | can lookup pid based on socket ownership or fd | 23:10 |
mordred | clarkb: ooh - TIL ... | 23:11 |
mordred | clarkb: what's the best way to lookup pid based on socket ownership? | 23:11 |
clarkb | thats a good question :) I Know its possible, but not sure of a best way | 23:11 |
mordred | heh :) | 23:12 |
clarkb | mordred: looks like psutil | 23:12 |
mordred | clarkb: sudo ss -lptn 'sport = :19885' | 23:13 |
clarkb | you can iterate over processes and for each process you can list the files | 23:13 |
mordred | or sudo netstat -nlp | grep :19855 | 23:13 |
clarkb | ya or lsof | 23:13 |
clarkb | but I am assuming you want to do it from python right? | 23:13 |
mordred | no - it'll be an ansible task | 23:13 |
clarkb | I think psutil may be easiest there | 23:13 |
clarkb | ah | 23:13 |
mordred | becuase the only thing that will have context to know things are done is the base job's post playbook | 23:14 |
mordred | clarkb: I don't have psutil instlaled anywhere | 23:14 |
clarkb | mordred: its a pyhton module | 23:14 |
mordred | oh - duh | 23:14 |
clarkb | I think off pypi | 23:14 |
mordred | yah - let's see if we can get this without needing to install stuff | 23:15 |
clarkb | https://pypi.python.org/pypi/psutil | 23:15 |
clarkb | ya ss/netstat/lsof likely to work fine. Just not sure how consistently they are installed places | 23:15 |
clarkb | I seem to recall needing to install lsof on centos | 23:15 |
mordred | netstat is pretty-much always around right? | 23:16 |
clarkb | it is in /bin for me | 23:17 |
clarkb | which I think means yes at least for suse | 23:17 |
clarkb | mordred: looks like netstat is being killed as part of the switch to ip and ss is the new command | 23:17 |
clarkb | so your original one is likely best | 23:18 |
mordred | kill $(netstat -nlp | grep :19885 | awk '{print $7}' | cut -f1 -d/) | 23:18 |
mordred | lovely | 23:18 |
mordred | ss has the hardest output to parse | 23:18 |
clarkb | of course | 23:18 |
mordred | root@ubuntu-xenial-rax-ord-9191322:~# ss -lptn 'sport = :19885' | 23:18 |
mordred | State Recv-Q Send-Q Local Address:Port Peer Address:Port | 23:19 |
mordred | LISTEN 0 5 :::19885 :::* users:(("python",pid=3440,fd=3)) | 23:19 |
mordred | SIGH | 23:19 |
clarkb | oh thats not too bad give me one sec | 23:19 |
jeblair | mordred: hah, i thought SIGH was a State. | 23:19 |
clarkb | | sed -n -e 's/.*,\(pid=[0-9]\+\),/\1/p' | 23:20 |
* jeblair renames nodepool states... SIGH. JUST_GET_IT_OVER_WITH. WHATS_THE_HOLDUP. | 23:20 | |
clarkb | SIGH has value 256 | 23:21 |
mordred | clarkb: that gets me pid=3440fd=3)) | 23:21 |
* clarkb tests more locally | 23:21 | |
clarkb | oh right you just want pid | 23:22 |
clarkb | | sed -n -e 's/.*,pid=\([0-9]\+\),.*/\1/p' | 23:22 |
mordred | yes | 23:22 |
mordred | thank you | 23:22 |
clarkb | not to derail anything but can you imagine sorting all this out for windows too? | 23:23 |
pabelanger | jeblair: are we expecting the current jobs to get killed once job timeout happens? | 23:24 |
mordred | oy | 23:25 |
mordred | clarkb: well - this isn't going to work actually - because you have to be root to look up process by port | 23:25 |
clarkb | mordred: even if that process is sharing a user with you? | 23:26 |
mordred | hrm. maybe? lemme try | 23:26 |
* clarkb checks /proc | 23:26 | |
jeblair | pabelanger: yes | 23:26 |
clarkb | mordred: things are owned by you in /proc so if utility does best effort it should work | 23:26 |
jeblair | pabelanger: i'm inclined to just let that happen and check back later | 23:26 |
clarkb | but if it just bails out if not root then ya won't work | 23:26 |
mordred | clarkb: ok. cool. you're right | 23:26 |
mordred | it works | 23:26 |
clarkb | cool | 23:27 |
Shrews | Yay finger things being useful | 23:27 |
clarkb | mordred: you definitely won't eb able to do it for arbitrary users as not root though | 23:27 |
pabelanger | jeblair: agree, happy to see what happens | 23:27 |
mordred | clarkb, jeblair, pabelanger: https://review.openstack.org/472866 | 23:28 |
jeblair | mordred: is that going to put us in a catch-22? it will create a log file that the callback plugin will want to stream? | 23:29 |
mordred | jeblair: I was just about to say that | 23:29 |
mordred | jeblair: I'll workon that next :) | 23:29 |
mordred | jeblair: can probably write it as an option to the zuul_console module actually - so that we can just call "zuul_console: state=absent" | 23:30 |
jeblair | mordred: interestingly, that's a DoS that your 30s timeout patch will prevent :) | 23:30 |
mordred | jeblair: and have the zuul_console python module that gets copied over and that does not log to the shell log do the kill in that case | 23:30 |
jeblair | mordred: ++ | 23:31 |
mordred | it'll also keep the logic in a zuul file and have the base job be easy and symmetrical for others | 23:31 |
clarkb | mordred: that would be python then? | 23:33 |
clarkb | probably still don't want to rely on psutil? | 23:33 |
jeblair | i think it could be problematic to use something not in the stdlib | 23:39 |
clarkb | in that case still possible to iterate through /proc/$piddirsownedbyouruser/fd | 23:43 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Add shutdown option for zuul_console https://review.openstack.org/472867 | 23:43 |
mordred | clarkb, jeblair: ^^ how's that look? | 23:43 |
clarkb | that looks fine, I'm just trying to find a sane way to make it all python without the subprocess | 23:46 |
mordred | I welcome that | 23:46 |
mordred | I'm gonna pushup a quick update adding a comment | 23:46 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Add shutdown option for zuul_console https://review.openstack.org/472867 | 23:47 |
mordred | clarkb, jeblair: it is time for me to EOD - but I think that should at least put a placeholder in the approach we can use there | 23:51 |
jeblair | mordred: lgtm; ++ to proc if clarkb works it out. | 23:51 |
clarkb | mordred: https://github.com/giampaolo/psutil/blob/0a6953cfd59009b422b808b2c59e37077c0bdcb1/psutil/_pslinux.py#L1870 ya that does what I dscribe using /proc | 23:51 |
mordred | clarkb: so we could potentially copypasta some and be not 100% terrible | 23:51 |
jeblair | bsd licensed, should be ok | 23:52 |
clarkb | so I thin general process is list /proc filter by process dirs owned by us, for each process dir readlink fd/*, if matches tcp port then return pid | 23:52 |
clarkb | yup licensing should be fine and I think we can mostly use that implementation too | 23:52 |
mordred | cool | 23:52 |
mordred | clarkb: if you get a while desire to do that before I do and update the change, I will not be offended | 23:52 |
mordred | s/while/wild/ | 23:53 |
clarkb | though looking in my /proc we may need to do a second lookup because I get things like 3 -> socket:[11248846] in fd | 23:53 |
jeblair | clarkb: fdinfo? | 23:53 |
clarkb | that gives me pos, flags, and mnt_id. Apparently htat number in [] is an inode | 23:54 |
pabelanger | we seem to be getting a deprecated warning about commas as lists, but not sure where that is coming from just yet | 23:54 |
jeblair | mordred: for monday: http://paste.openstack.org/show/612171/ | 23:56 |
clarkb | jeblair: mordred you read /proc/net/tcp which gives you all the connections in binary tabular form | 23:56 |
pabelanger | warning and error seem to go hand and hand: http://paste.openstack.org/show/612172/ | 23:57 |
jeblair | clarkb: is pid in there? | 23:57 |
clarkb | so I think what we actually do is read /proc/net/tcp first to find inode for socket on correct port. Then look for that inode to find the process | 23:57 |
jeblair | clarkb: oh gotcha | 23:58 |
clarkb | jeblair: it doens't look like pid is in there :) that would be too easy | 23:58 |
clarkb | sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode <- are the fields | 23:58 |
jeblair | pabelanger: are you sure those are related? | 23:58 |
jeblair | pabelanger: i mean, it's outputting the comma warning on every invocation. i would expect it to *also* show up in invocations with the log error. | 23:59 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!