*** hashar has quit IRC | 00:18 | |
openstackgerrit | John L. Villalovos proposed openstack-infra/zuul master: Only depend-on open changes https://review.openstack.org/254957 | 00:20 |
---|---|---|
*** jamielennox is now known as jamielennox|away | 00:36 | |
*** jamielennox|away is now known as jamielennox | 00:37 | |
openstackgerrit | Merged openstack-infra/zuul master: Only depend-on open changes https://review.openstack.org/254957 | 00:40 |
*** saneax is now known as saneax-_-|AFK | 00:55 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Improve job dependencies using graph instead of tree https://review.openstack.org/443973 | 01:26 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Update test config for job graph syntax https://review.openstack.org/444055 | 01:26 |
jeblair | that's an auto-generated update to the test config files. | 01:27 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Support for github commit status https://review.openstack.org/444060 | 01:47 |
jamielennox | in zuul 2.5 is there a way from the CLI to stop a job and marked it failed | 03:24 |
*** saneax-_-|AFK is now known as saneax | 04:09 | |
*** saneax is now known as saneax-_-|AFK | 05:51 | |
*** saneax-_-|AFK is now known as saneax | 05:59 | |
*** saneax is now known as saneax-_-|AFK | 06:27 | |
*** bhavik1 has joined #zuul | 07:16 | |
*** saneax-_-|AFK is now known as saneax | 07:20 | |
*** Cibo has joined #zuul | 07:42 | |
*** Cibo_ has quit IRC | 07:45 | |
*** Cibo has quit IRC | 07:46 | |
*** dmsimard has quit IRC | 08:12 | |
*** bhavik1 has quit IRC | 08:14 | |
*** dmsimard has joined #zuul | 08:35 | |
*** hashar has joined #zuul | 10:05 | |
*** Cibo has joined #zuul | 10:17 | |
*** Cibo has quit IRC | 10:22 | |
*** openstackgerrit has quit IRC | 10:33 | |
*** bhavik1 has joined #zuul | 10:41 | |
*** bhavik1 has quit IRC | 11:02 | |
*** saneax is now known as saneax-_-|AFK | 11:28 | |
*** Cibo has joined #zuul | 12:29 | |
*** mptacekx has joined #zuul | 12:30 | |
mptacekx | Hi zuul, can someone please suggest good nodepool version for zuul v2.5.x ? | 12:31 |
*** hashar is now known as rahsah | 12:32 | |
mptacekx | we tried latest nodepool 0.4.0 but some strange behavior is seen, like time-to-time vm's building / deleting inm never ending loops ... | 12:32 |
*** Cibo has quit IRC | 12:34 | |
*** yolanda has quit IRC | 12:46 | |
*** yolanda has joined #zuul | 13:06 | |
pabelanger | jamielennox: no CLI command | 13:12 |
pabelanger | mptacekx: we are using 0.4.0 in production for openstack, what do the logs say? | 13:12 |
pabelanger | sounds like your DIBs are failing to build, they get deleted, and start again | 13:13 |
mptacekx | pabelanger: actually it's about VM's, we noticed that time to time they are in building state but available and manually accessible for several minutes already. sometimes it helps if I connect to them manually, then also nodepool pass its chek or vm goes to delete state, so some change is triggered | 13:15 |
pabelanger | mptacekx: check the debug log for nodepool, it will say why it is deleting the VMs | 13:17 |
mptacekx | pabelanger: usually it failes on OpenStackCloudTimeout: Timeout waiting for the server to come up. | 13:19 |
mptacekx | , occasionally on Exception: Unable to run ready script | 13:19 |
mptacekx | how often nodepool should be checking that ? | 13:20 |
pabelanger | mptacekx: so you are hitting 2 issues, first is you fail to boot a VM on your cloud. For that, you'll need to debug the cloud side and see why that is, could be lack of IPs, resources, etc. | 13:21 |
pabelanger | as for ready-script, if you have nodepool setup to use it, it will be run for every node launch. Again, you'll have to debug the output of your ready-script to see what it is doing, common issues are networking to the VM and possible DNS issues from the VM to internet | 13:23 |
mptacekx | pabelanger: I think it's a nodepool issue, VM is spawned properly and I can access it myself, but nodepool didn't. It's not functional problem blocking all attempts, it happens e.g. when 5 vm's are spawned in parallel on 1 of them | 13:24 |
*** Cibo_ has joined #zuul | 13:24 | |
pabelanger | mptacekx: do you mind posting your debug logs? That will tell us if it is nodepool or cloud | 13:25 |
pabelanger | it is possible you mind need to update your clouds.yaml file | 13:25 |
*** hashar has joined #zuul | 13:26 | |
pabelanger | also, nodepool is hard on clouds, so launching 5 VMs might be an issue too. Easy way to test that, is limit jobs to a single VM to start | 13:26 |
mptacekx | this might help a lot, do you know how to do that ? I think it's simply failing when too many attempts are in parallel | 13:27 |
pabelanger | mptacekx: max-servers setting | 13:27 |
*** Cibo_ has quit IRC | 13:28 | |
mptacekx | pabelanger: this will sets just hard limit for number of servers, I thought if there is any way how to tell nodepool not to spawn 5 vm's at oncve | 13:29 |
*** yolanda has quit IRC | 13:29 | |
mptacekx | or you're suggesting to slowly increase max-server limits to avoid that ? | 13:29 |
*** yolanda has joined #zuul | 13:30 | |
pabelanger | right, I would decrease max-servers now, to see if you are having cloud issues. | 13:30 |
pabelanger | but yes, it is a hard limit for the cloud | 13:30 |
pabelanger | otherwise, just have nodepool keep doing what is does, it is pretty aggressive about booting nodes. Eventually it will boot something :) | 13:31 |
mptacekx | thanks, I will explore that option. There is definitely wrong something in cloud but I thought there is some second issue in nodepool itself, as I mentioned if it timeouts accessing VM which is normally accessible. Is there any way how to debug that apart of logs in /var/log/nodepool/* ? | 13:33 |
mptacekx | All I have is nodepool timeout in that logs | 13:33 |
pabelanger | mptacekx: we also have a rate setting, you could try playing with. Time, in seconds, to wait between operations for a provider | 13:33 |
mptacekx | pabelanger: rate setting ? please ellaborate little more | 13:34 |
pabelanger | mptacekx: that to me sounds like a networking issue, once the VM is online, nodepool cannot SSH into it. | 13:34 |
pabelanger | mptacekx: see: https://docs.openstack.org/infra/nodepool/configuration.html#provider | 13:35 |
mptacekx | how often nodepool is trying ? each min ? | 13:35 |
pabelanger | trying what? | 13:35 |
mptacekx | vm to pass ssh check and turn from build to ready state | 13:35 |
pabelanger | look at boot-timeout, I believe the default is 60 secs | 13:36 |
mptacekx | thanks a lot, I will play with all of that stuff | 13:37 |
pabelanger | sure, np | 13:37 |
dmsimard | So I never (ever) wrote a spec before but I took a shot at writing one for "a job reporting interface" in Zuul -- hope it makes sense: https://review.openstack.org/#/c/444088/ | 13:48 |
*** openstackgerrit has joined #zuul | 14:13 | |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Add reporter for Federated Message Bus (fedmsg) https://review.openstack.org/426861 | 14:13 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Re-enable test_alien_list_fail and alien_list cmd https://review.openstack.org/443714 | 14:58 |
mordred | dmsimard: nice | 15:02 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Add 'requestor' to NodeRequest model https://review.openstack.org/443151 | 15:07 |
jeblair | dmsimard: thanks, that looks good. i'll reply with some comments soon. can you point me at an ara server install i can browse? | 15:08 |
dmsimard | jeblair: ara is available in openstack-ansible and kolla-ansible jobs already -- let me give you a few examples | 15:08 |
jeblair | dmsimard: no i meant one running on a server, not the staticly-generated site | 15:09 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Add back statsd reporting https://review.openstack.org/443605 | 15:09 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Remove old/dead classes https://review.openstack.org/443644 | 15:09 |
dmsimard | jeblair: sure, but there's no difference, though | 15:09 |
dmsimard | there's no disparity in features | 15:09 |
jeblair | dmsimard: well, i wondered what you see when you go do "http://ara.example.com/" :) | 15:10 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Remove job_list, job_create, job_delete cmds/tests https://review.openstack.org/444344 | 15:10 |
dmsimard | jeblair: I mean, there's http://ara-demo.dmsimard.com/ | 15:10 |
dmsimard | although hang on, that one isn't up to date | 15:10 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Add leaked instance cleanup https://review.openstack.org/443690 | 15:11 |
jeblair | mordred: can you do a pass of zuul changes please? some of mine have been sitting for a few days. | 15:12 |
jeblair | mordred: also, feedback on 443973 and 443985 would be appreciated | 15:13 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool feature/zuulv3: Stop json-encoding the nodepool metadata https://review.openstack.org/410812 | 15:16 |
mordred | jeblair: yup - was just doing the walk on the nodepool changes | 15:16 |
jeblair | mordred: w00t | 15:17 |
*** mptacekx has quit IRC | 15:17 | |
dmsimard | jeblair: okay, I updated http://ara-demo.dmsimard.com/ to the latest version (there had been some bugfixes/improvements since I last updated it) | 15:17 |
mordred | jeblair: gah. I could have sworn I reviewed some of these already | 15:22 |
jeblair | mordred, Shrews: in 410812 i wonder if, instead of removing snapshot_image_id, we should update it to record the zookeeper image upload id? that's what it actually is in the current version (recall 'snapshot ~= upload`) | 15:24 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Re-enable test_image_upload_fail https://review.openstack.org/444349 | 15:24 |
jeblair | mordred, Shrews: otoh, since there's an arbitrary limit on the number of metadata fields, perhaps we should drop it. | 15:27 |
pabelanger | Shrews: yay for statsd things | 15:27 |
Shrews | jeblair: i had a similar question on node_id. i don't currently populate it. should we? | 15:27 |
Shrews | pabelanger: i'm sure we'll need to adjust the keys | 15:28 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Add a test for a broken config on startup https://review.openstack.org/441499 | 15:31 |
pabelanger | Shrews: we can do something testing this morning, if you like. Get some data into graphite.o.o then update dashboards as needed | 15:31 |
Shrews | pabelanger: i'll leave that for you if you want. i'm going to re-enable all the things, then FINALLY remove mysql \o/ | 15:32 |
pabelanger | Shrews: sure, I see if the code is landed and restart nl01.o.o | 15:33 |
pabelanger | also, Yay for database removal | 15:33 |
jeblair | Shrews: the old leak algorithm used it to look up the node record to see if it's known. you switched to scanning all of the nodes and checking server id. if we switched back to the old method, we would end up retrieving fewer znode records (since we would only pull the ones for our provider). if we want to stick with the full scan, we can remove it. | 15:33 |
jeblair | pabelanger: please don't have new nodepool send stats to statsd if it has the same provider names as production nodepool. | 15:34 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Re-enable test_nodepool_osc_config_reload https://review.openstack.org/444356 | 15:34 |
jeblair | pabelanger: we'll end up overwriting our production data. | 15:34 |
jeblair | pabelanger: as a solution, we can either adopt the clarkb method of using different provider names (and performing extra image uploads), or you can configure a statsd prefix so all the new stuff goes to a different place | 15:36 |
Shrews | jeblair: hrm, lemme think that one over. checking external_id does seem safer, but using the node id would be more efficient. | 15:36 |
pabelanger | jeblair: Ah, right. good call. | 15:36 |
pabelanger | let me see how to configure a prefix | 15:36 |
jeblair | pabelanger: 'grep -i statsd_prefix' i think will turn up some things | 15:37 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Re-enable TestWebApp tests https://review.openstack.org/444358 | 15:42 |
pabelanger | jeblair: looks like export STATSD_PREFIX is a thing | 15:44 |
pabelanger | reading up more now | 15:44 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Provide file locations of config syntax errors https://review.openstack.org/441606 | 15:54 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Clarify Job/Build/BuildSet docstrings https://review.openstack.org/435948 | 15:56 |
mordred | jeblair: the job graph change looks good to me - other than not passing tests - but who needs tests | 16:00 |
mordred | jeblair: +1 on the spec change | 16:02 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool feature/zuulv3: Rename osc to occ in tests https://review.openstack.org/444383 | 16:05 |
jeblair | mordred: cool, i've just about finished cleaning up the tests | 16:06 |
*** yolanda has quit IRC | 16:06 | |
mordred | jeblair: awesome | 16:06 |
mordred | Shrews: ^^ I just submitted a meaningless followup to one of your patches | 16:06 |
*** yolanda has joined #zuul | 16:06 | |
pabelanger | mordred: soooo, how much longer until we can finger zuulv3 :) | 16:07 |
mordred | pabelanger: oh right. I need to finish those patches | 16:07 |
Shrews | mordred: stellar | 16:07 |
pabelanger | mordred: I'm happy to help too | 16:08 |
*** yolanda has quit IRC | 16:11 | |
Shrews | hrm... why is this node failing and locked? http://logs.openstack.org/44/444344/1/check/gate-dsvm-nodepool/7ab80aa/console.html#_2017-03-10_16_05_48_329818 | 16:16 |
*** yolanda has joined #zuul | 16:18 | |
Shrews | ok, this should not be happening: http://logs.openstack.org/44/444344/1/check/gate-dsvm-nodepool/7ab80aa/logs/screen-nodepool.txt.gz#_2017-03-10_15_48_15_721 | 16:19 |
mordred | Shrews: I blame jaypipes | 16:20 |
Shrews | mordred: how random | 16:23 |
Shrews | pabelanger: you may want to hold off on updating nl01 until we investigate this odd failure | 16:23 |
pabelanger | Shrews: ack | 16:24 |
Shrews | and i'm a bit stumped atm | 16:24 |
Shrews | oh! i know. it doesn't have an external id yet, so it's racing | 16:25 |
Shrews | wheeeeee | 16:26 |
*** yolanda has quit IRC | 16:26 | |
Shrews | i blame EVERYONE else for not catching my mistake | 16:26 |
*** yolanda has joined #zuul | 16:27 | |
Shrews | i think we'll have to do jeblair's suggestion to store node_id in metadata and check that against ZK | 16:27 |
pabelanger | Shrews: we should also thing about merging master into feature/zuulv3 for nodepool. Will pick up some testing fixes specifically. eg: we shouldn't be building fedora-25 for the dsvm job | 16:28 |
mordred | pabelanger: with the deletions Shrews has been doing - it might be easier at this point to cherry-pick relevant testing fixes | 16:28 |
mordred | pabelanger: but it's worth a try for sure | 16:29 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Improve job dependencies using graph instead of tree https://review.openstack.org/443973 | 16:29 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Update test config for job graph syntax https://review.openstack.org/444055 | 16:29 |
pabelanger | mordred: agree | 16:29 |
Shrews | also, we could skip instances whose state is BUILDING | 16:30 |
jeblair | Shrews: aha. i don't recall if that's the reason we did that originally or not. it may be. at any rate, when you make that change, it's probably worth a comment so that we don't have to learn this lesson (yet?) again. :) | 16:33 |
mordred | jeblair: the best lessons are the lessons you learn over and over again | 16:34 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool feature/zuulv3: Fix fedora 25 pause bug with devstack https://review.openstack.org/444400 | 16:34 |
pabelanger | mordred: Shrews: we actually just need ^ | 16:34 |
mordred | ++ | 16:35 |
Shrews | jeblair: k | 16:35 |
Shrews | Got a lunch appointment with now, so will put up a fix when I return | 16:36 |
mordred | Shrews: silly food eating | 16:36 |
Shrews | mordred: it is needed energy-prep for tonight's game | 16:37 |
mordred | Shrews: yeah. good point | 16:38 |
mordred | I need to find more energy for that myself | 16:38 |
mordred | since each team won on their home court so far - I guess the question is - is NYC truly a second home court for Duke? | 16:38 |
jeblair | pabelanger: tobiash_ has some good comments on 438281. since that's what people seem to favor, do you want to address those and we can move the tox jobs along? | 16:41 |
jeblair | mordred: it's just like duke to have a second home in ny. | 16:42 |
mordred | jeblair: where else would they store all of their i-banker alums? | 16:48 |
*** Cibo_ has joined #zuul | 16:50 | |
jeblair | mordred, tobiash_: 443976 (graph) + 444055 (test config file updates) pass tests when combined together now. | 16:52 |
pabelanger | jeblair: toabctl: thanks, left reply / question. | 16:52 |
jeblair | so i think i'd like to get some provisional reviews on the first one and the spec, and then we're ready to merge, i'll squash them. | 16:53 |
jeblair | *when* we're ready to merge, that is | 16:53 |
mordred | jeblair: fwiw, I do not think squashing those two will make review harder - since there is only one file shared between the two | 16:59 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Add generic tox job (multiple playbooks) https://review.openstack.org/438281 | 17:02 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Add generic tox job (multiple playbooks) https://review.openstack.org/438281 | 17:03 |
jeblair | mordred: true... though there are a lot of files in the second one. | 17:03 |
mordred | that is true. there are a lot of files in the second one | 17:03 |
jeblair | pabelanger: are we going to fix the 'all' problem? | 17:04 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Add generic tox job (multiple playbooks) https://review.openstack.org/438281 | 17:04 |
*** Cibo_ has quit IRC | 17:05 | |
jeblair | pabelanger: looks like it. :) some of those still have xenial though | 17:05 |
pabelanger | jeblair: yes, let me fine error message | 17:05 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Add generic tox job (multiple playbooks) https://review.openstack.org/438281 | 17:06 |
pabelanger | actually, ^ should expose the problem in ansible log | 17:06 |
pabelanger | will be interesting moving forward on the tox jobs, my thoughts about setting hosts: ubuntu-xenial in the playbook, means a little more control which hosts jobs run on. I think its possible now, for a project to override the nodeset for the tox-py27 job and run on centos-7 lets say | 17:09 |
pabelanger | I guess the same is true that projects and redefine the job and use a new playbook | 17:09 |
jeblair | pabelanger: we can set "final: true" on any job where we don't want people to override things like nodesets | 17:10 |
pabelanger | great | 17:11 |
jeblair | (though, in general, zuulv3 is a bit more trusting that people want to do the right thing) | 17:11 |
pabelanger | ya | 17:12 |
pabelanger | either way, Yay, we have playbooks | 17:12 |
*** hashar has quit IRC | 17:20 | |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool feature/zuulv3: Remove allocator https://review.openstack.org/444425 | 17:23 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool feature/zuulv3: Remove jenkins_manager https://review.openstack.org/444427 | 17:25 |
jeblair | Shrews: approved https://review.openstack.org/443714 with comments | 17:26 |
pabelanger | +3 on deletes too | 17:28 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Re-enable test_alien_list_fail and alien_list cmd https://review.openstack.org/443714 | 17:29 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Remove job_list, job_create, job_delete cmds/tests https://review.openstack.org/444344 | 17:31 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Re-enable test_image_upload_fail https://review.openstack.org/444349 | 17:32 |
pabelanger | jeblair: Shrews: we have 5 locked ready nodes on nl01.o.o, do you mind looking? | 17:32 |
jeblair | sure thing | 17:32 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Re-enable test_nodepool_osc_config_reload https://review.openstack.org/444356 | 17:33 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Re-enable TestWebApp tests https://review.openstack.org/444358 | 17:33 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Rename osc to occ in tests https://review.openstack.org/444383 | 17:34 |
jeblair | pabelanger: http://paste.openstack.org/show/602292/ | 17:35 |
jeblair | pabelanger: the dump command has a section where it shows you what sessions hold ephemeral nodes | 17:35 |
jeblair | pabelanger: you can see that all the node locks are held by session 0x15aa4882759000c | 17:35 |
jeblair | pabelanger: and conveniently, that session also holds this node: /nodepool/launchers/nl01-5544-ProviderWorker.infracloud-chocolate | 17:37 |
jeblair | pabelanger: we know that launchers, when they come online, create an ephemeral node with their name so they know which other launchers are active | 17:37 |
jeblair | pabelanger: but it's convenient for us because we can see that it is the launcher which holds those node locks, without having to track down that session id some other way :) | 17:37 |
jeblair | pabelanger: it rather looks like the launcher has frozen? | 17:38 |
jeblair | pabelanger: i don't see any log entries for the past 30 mins | 17:38 |
mordred | jeblair: I think launchers freezing sounds unfun | 17:38 |
jeblair | (btw, we should renamed nodepoold to nodepool-launcher) | 17:39 |
mordred | jeblair: ++ | 17:39 |
jeblair | i'm going to sigusr2 it to get a stack dump (i hope) | 17:39 |
jeblair | yay that worked | 17:39 |
jeblair | okay, first of all -- did we port over that paramiko fix to v3? :) | 17:40 |
pabelanger | not sure | 17:41 |
* jeblair sorts relevant/irrelevant threads | 17:41 | |
SpamapS | mordred: jeblair fyi, I'm about to start writing a launcher security spec in earnest. I hope to have 1st draft shortly. If you have points you think we haven't discussed, now's a good time to poke them into my brain. | 17:42 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool feature/zuulv3: Add destructor to SSHClient https://review.openstack.org/444433 | 17:43 |
pabelanger | cherry-pick of paramiko^ | 17:44 |
jeblair | SpamapS: ++. i *think* we hit all the high points in chats at the ptg. | 17:44 |
jeblair | pabelanger: thanks! | 17:44 |
SpamapS | jeblair: me too, just making sure so I can reduce edits. :) | 17:47 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool feature/zuulv3: Handle exception edge cases in node launching https://review.openstack.org/444437 | 17:54 |
pabelanger | jeblair: so, back to locks and zookeeper. If I understand, for some reason nodepool-launcher didn't unlock the node. Which, is what I see in the logs too | 17:55 |
jeblair | Shrews, pabelanger: i see the problem. the providerworker is responsible for determining when a launch is complete and releasing the node locks. however, it does that *after* starting new launches and, if needed, pausing new launches. the way it pauses is to busy-wait right in the middle of the code path between starting new launches and finalizing completed ones. in other words: it is paused indefinitely, waiting for nodes to be released. ... | 17:58 |
jeblair | ... they never will be because everyone is waiting on it to release them. | 17:58 |
jeblair | Shrews, pabelanger: in other other words, we may need to move the work that happens in _removeCompletedHandlers out of the ProviderWorker thread. | 17:59 |
jeblair | i will work on a test case for this | 17:59 |
mordred | jeblair: nice catch | 18:01 |
pabelanger | I see | 18:02 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Remove allocator https://review.openstack.org/444425 | 18:11 |
Shrews | jeblair: ick | 18:24 |
jeblair | Shrews: i'm making good progress on the test (it will take a while due to the complications you pointed out previously), but i haven't started on a solution yet. | 18:25 |
Shrews | that's such a silly mistake. Can't think of a good solution off the top of my head | 18:28 |
Shrews | maybe instead of waiting to fulfill the node set, short-circuit the fulfillment and release the node set we have, trying again later? | 18:30 |
Shrews | jeblair: also, regarding your comments on 443714... removing the database stuff and unused files was going to be last on my list after re-enabling all tests. | 18:34 |
jeblair | Shrews: that will cause large node requests (where large means >1) to starve at the expense of smaller ones when all providers are near quota | 18:34 |
Shrews | ah yeah, don't want that | 18:35 |
Shrews | pabelanger: oops, i re-enabled the image upload fail test and forget we needed this: https://review.openstack.org/435481 | 18:39 |
Shrews | mordred: any chance you can +3 https://review.openstack.org/435481 for us? | 18:39 |
mordred | Shrews: yup | 18:46 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Disable CleanupWorker thread for test_image_upload_fail https://review.openstack.org/435481 | 18:52 |
Shrews | jeblair: congratulations. you encountered our first configuration file disagreement test failure: http://logs.openstack.org/27/444427/1/check/nodepool-coverage-ubuntu-xenial/185e0db/console.html#_2017-03-10_17_30_07_610345 | 18:54 |
Shrews | UploadWorker working from the old config, CleanupWorker working from the new | 18:54 |
Shrews | i don't know how to eliminate that totally | 18:55 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool feature/zuulv3: Add a failing test of node assignment at quota https://review.openstack.org/444462 | 18:55 |
jeblair | Shrews: ^ | 18:56 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: Merge pull requests from github reporter https://review.openstack.org/444463 | 18:59 |
* SpamapS learning about the dark corners of cgroups, containers, and selinux. | 19:04 | |
Shrews | jeblair: can haz node_quota.yaml file? | 19:04 |
mordred | SpamapS: fun for you! | 19:05 |
jeblair | derp | 19:05 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool feature/zuulv3: Add a failing test of node assignment at quota https://review.openstack.org/444462 | 19:06 |
jeblair | Shrews: ^ | 19:06 |
dmsimard | SpamapS: even the bright corners of those are scary, good luck sir | 19:06 |
SpamapS | dmsimard: so true :) | 19:09 |
jeblair | Shrews: i think we may want shared locks on builds for that. get a (shared) read lock on an image to perform an upload, get a write lock on the image to delete it. | 19:15 |
Shrews | jeblair: yeah, well... kazoo doesn't have that | 19:15 |
jeblair | Shrews: https://zookeeper.apache.org/doc/r3.1.2/recipes.html#Shared+Locks | 19:15 |
jeblair | doesn't look too hard | 19:16 |
Shrews | jeblair: https://github.com/python-zk/kazoo/pull/306 | 19:16 |
Shrews | at least mordred approved the PR! :) | 19:17 |
jeblair | i read that as mordred volunteering to fix it | 19:17 |
mordred | heh | 19:18 |
jeblair | after lunch i will look at what is required to get that pr into shape | 19:18 |
mordred | jeblair: I think mostly it's just about yelling at harlowja right? | 19:19 |
jeblair | (before lunch, i will try to figure out how to use github) | 19:19 |
harlowja | lol | 19:19 |
harlowja | whatt | 19:19 |
mordred | jeblair: good luck ... it's a pretty shitty ui | 19:19 |
Shrews | harlowja: the kazoo shared locks thing came up again | 19:20 |
mordred | harlowja: we've hit the point where https://github.com/python-zk/kazoo/pull/306 is gonna be important - so jeblair is going to work on it | 19:20 |
harlowja | uh oh | 19:20 |
harlowja | sweet | 19:20 |
pabelanger | I too like to live dangerously | 19:20 |
harlowja | mordred bbangert is more senior in that library then me, so once he's happy i'm happy | 19:20 |
harlowja | then i can click merge | 19:20 |
harlowja | lol | 19:20 |
mordred | jeblair: the fun part is that you can't submit patches to that PR - you'll have to pull those patches into your own repo and put up a completely different pr | 19:20 |
jeblair | i what? | 19:21 |
jeblair | how do i collaborate? | 19:21 |
mordred | jeblair: you don't | 19:21 |
mordred | you fork | 19:21 |
jeblair | but i thought github was about collaborating with people? | 19:21 |
mordred | github is all about celebrating the individual ego | 19:21 |
harlowja | none of that collaboration crap | 19:21 |
harlowja | lol | 19:21 |
jeblair | harlowja: you may get an email with a diff. | 19:21 |
harlowja | not sure if i can update someone else's PR either | 19:21 |
harlowja | lol | 19:21 |
Shrews | jeblair: that seems like a LHF task. i'd personally rather see you on zuul things and maybe get a volunteer for the kazoo patch | 19:22 |
mordred | nope. the only way to update a PR is to push more patches to the branch the PR is a request to merge | 19:22 |
Shrews | jeblair: on the other hand, i'm almost done with nodepool, so.... | 19:22 |
mordred | Shrews offers to jump on the GH grenade | 19:22 |
harlowja | i already jumped on the manifesto grenade | 19:23 |
harlowja | lol | 19:23 |
Shrews | well, i'm hoping to NOT have to... just would rather see jeblair's scarce time better used | 19:23 |
Shrews | doesn't SpamapS have some new resources to point at things like that? ;) | 19:25 |
Shrews | jeblair: mordred: SpamapS: I propose we put this as a topic for the next zuul meeting and see who has the time to see that PR through. Unless Jim just REALLY wants to work on it. | 19:31 |
jeblair | Shrews: i'm curious enough to spend a few minutes on it, but i'll give myself a short timeout. :) | 19:31 |
Shrews | jeblair: fair enough | 19:31 |
*** bhavik1 has joined #zuul | 19:33 | |
jeblair | "test_dirty_sock" | 19:34 |
*** bhavik1 has quit IRC | 19:37 | |
jeblair | Shrews: the only ideas coming to my head so far for the deadlock are: 1) have a new thread-per-provider to handle the nodelauncher poll/cleanup. 2) set an attribute on the providerworker indicating we are paused so we can proceed through the main loop without accepting new requests. | 19:37 |
Shrews | jeblair: yeah. i'm going to experiment with #2 | 19:39 |
Shrews | jeblair: wow. just what timing issues has your https://review.openstack.org/444427 review exposed? New failures each run | 19:43 |
* Shrews tries hard to avoid looking at two problems at once | 19:43 | |
SpamapS | wha hoo? | 19:44 |
SpamapS | interesting | 19:51 |
SpamapS | lxc in its original form is EOL | 19:51 |
SpamapS | lxc 2.0 is just lxd in local-socket-comm-only mode. | 19:51 |
SpamapS | (well technically LXC 1.x is supported until 2019, but that's effectively dead to me :) | 19:51 |
openstackgerrit | K Jonathan Harker proposed openstack-infra/nodepool master: Write per-label nodepool demand info to statsd https://review.openstack.org/246037 | 19:59 |
jeblair | Shrews: that wase a removal of a file that isn't used or even loaded by anything carefully crafted to expose subtle timing errors. apparently. | 20:07 |
SpamapS | https://review.openstack.org/444495 <-- Security spec 1st draft | 20:33 |
SpamapS | It's pretty light on details. I think we'll want to have subject matter experts weigh in on things. | 20:33 |
mordred | jesusaur: ^^ your patch there - Shrews just reworked statsd reporting in the v3 nodepool fwiw | 20:40 |
pabelanger | mordred: jeblair: speaking of statsd, any reason not to +3 444363? currently has 2 +2 | 20:42 |
mordred | pabelanger: nope | 20:43 |
*** hashar has joined #zuul | 20:45 | |
jesusaur | Shrews: how drastically has nodepool statsd reporting changed? | 20:47 |
Shrews | jesusaur: take a look at the StatsReporter class in nodepool.py | 20:47 |
jeblair | well, the main thing relevant to that change is that the allocator is completely gone | 20:53 |
Shrews | jeblair: I think I have a fix for the deadlock. But the leaked instance race keeps biting me, so I'm going to now fix that and put that up ahead of it. | 21:09 |
jeblair | kk | 21:12 |
Shrews | jeblair: actually, i'll just put out what i have and rebase your review when i have it | 21:13 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Fix failure of node assignment at quota https://review.openstack.org/444462 | 21:13 |
Shrews | jeblair: ^^^ | 21:13 |
Shrews | jeblair: tl;dr, I've made the NodeRequestHandler code re-entrant so that it maintains state across calls to run() | 21:14 |
Shrews | jeblair: when ProviderWorker is paused, it just drains the handlers that are waiting on nodes until they're all finished | 21:15 |
* jeblair pauses kazoo timer and context switches | 21:16 | |
Shrews | mordred: i have no idea why your nodepool metadata change is in merge conflict. it applied cleanly for me on top of current feature/zuulv3 | 21:25 |
jeblair | Shrews: i get it. i left a couple comments. | 21:31 |
Shrews | jeblair: thx | 21:39 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Stop json-encoding the nodepool metadata https://review.openstack.org/410812 | 21:41 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Use node ID for instance leak detection https://review.openstack.org/444508 | 21:41 |
Shrews | mordred: rebased for you | 21:41 |
mordred | Shrews: thanks! | 21:45 |
openstackgerrit | Jesse Keating proposed openstack-infra/zuul feature/zuulv3: support github pull reqeust labels https://review.openstack.org/444511 | 21:47 |
jeblair | harlowja: while i'm in there, would you prefer RLock/WLock to be called ReadLock/WriteLock ? | 21:48 |
harlowja | doesn't matter to me | 21:48 |
* jeblair paints bikesheds already painted | 21:49 | |
harlowja | match python threading stuff? | 21:49 |
harlowja | that'd be fine with me | 21:49 |
harlowja | oh wait, python doesn't have it | 21:49 |
harlowja | match the other library i made that does have it, lol | 21:49 |
harlowja | ReadLock/WriteLock matches more of what i did there | 21:50 |
harlowja | in https://github.com/harlowja/fasteners/blob/master/fasteners/lock.py#L100 | 21:50 |
jeblair | cool, i like that better. i have obtained validation. :) | 21:50 |
harlowja | u can go read the upstream python change for that | 21:50 |
harlowja | they bikesheded all over that | 21:50 |
harlowja | http://bugs.python.org/issue8800 from what i rememer | 21:51 |
harlowja | 'Seems to have fizzled out due to the intense amount of bikeshedding required.' | 21:51 |
harlowja | lol | 21:51 |
harlowja | http://bugs.python.org/issue8800#msg274795 (wasn't kidding) | 21:52 |
jeblair | wow, a complete example of the law of triviality! https://en.wikipedia.org/wiki/Law_of_triviality | 21:53 |
jeblair | the term references the act of killing an idea with trivial arguments | 21:53 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Fix failure of node assignment at quota https://review.openstack.org/444462 | 21:53 |
jeblair | i think we usually use it to innoculate ourselves against that | 21:54 |
harlowja | ya, so i read over that bug a long time ago, was like wtf, and then just made something | 21:54 |
harlowja | to much bikeshed for me | 21:54 |
harlowja | lol | 21:54 |
harlowja | feels bad that such patch started in 2010 | 21:55 |
harlowja | :( | 21:55 |
harlowja | poor author | 21:55 |
harlowja | sometimes on such threads i want to punch the other commenters in the nuts and tell them to have some feelings | 21:57 |
harlowja | but i can't say such things on such threads | 21:57 |
harlowja | lol | 21:57 |
Shrews | Ok, I have fixes up for the big flaws found today. Going to call it a week and prepare for the sportsball things. Night all | 22:05 |
jeblair | harlowja: okay, clear your schedule. i'm preparing a pr for you. :) | 22:06 |
jeblair | Shrews: goodnight! happy sportsball! | 22:06 |
jeblair | harlowja: as soon as i figure out how. you may have a few minutes. :) | 22:07 |
harlowja | lol | 22:07 |
harlowja | counting down | 22:07 |
jeblair | harlowja: https://github.com/python-zk/kazoo/pull/419 make magic happen! | 22:11 |
harlowja | yes sir | 22:11 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool feature/zuulv3: Store a pointer to the paused node request handler https://review.openstack.org/444520 | 22:33 |
jeblair | Shrews: +2 with an option ^ | 22:34 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Add destructor to SSHClient https://review.openstack.org/444433 | 22:37 |
jeblair | harlowja: hrm, it looks like the election tests are hanging | 22:44 |
harlowja | poopie | 22:44 |
jeblair | that doesn't make sense to me; i'm looking into it | 22:44 |
harlowja | travis and this stuff has always been let's say unreliable | 22:44 |
harlowja | so it may be travis fault, but may not, ha | 22:44 |
harlowja | but ya, looks like election something or other | 22:45 |
harlowja | afaik election stuff is just using a lock | 22:45 |
jeblair | i think i can reproduce it locally, and i'm pretty sure i had a full working run before starting | 22:45 |
harlowja | k | 22:45 |
harlowja | https://github.com/python-zk/kazoo/blob/master/kazoo/recipe/election.py#L53-L54 | 22:46 |
harlowja | lol | 22:46 |
harlowja | that whole recipe is funny | 22:46 |
harlowja | lol | 22:46 |
jeblair | harlowja: derp, i think i see the issue | 22:46 |
harlowja | almost feels like it shouldn't exist | 22:46 |
harlowja | since it adds about no logic to the lock recipe | 22:46 |
harlowja | lol | 22:46 |
jeblair | harlowja: okay, fixed locally... | 22:52 |
jeblair | harlowja: while i'm here -- when i looked at the diff, i noticed a couple of minor things i can fix up... | 22:52 |
harlowja | u can do it | 22:52 |
jeblair | harlowja: there used to be a Lock._NODE_NAME. the previous patch got rid of that in favor of passing a veriable to the constructor; i'm making it a class attribute again, but i inadvertently called it Lock.node_name. would you prefer Lock.node_name, Lock.NODE_NAME, or Lock._NODE_NAME? (note that the subclasses override this variable, but otherwise we don't expect users to touch it). | 22:54 |
jeblair | harlowja: https://github.com/python-zk/kazoo/pull/419/files#diff-a08f51f50ea54f2f8138ab6045dc59c0L72 | 22:55 |
jeblair | for context | 22:55 |
jeblair | and https://github.com/python-zk/kazoo/pull/419/files#diff-a08f51f50ea54f2f8138ab6045dc59c0R406 | 22:55 |
harlowja | hmmmm | 23:00 |
harlowja | i'll let u pick | 23:00 |
jeblair | i'll go with _NODE_NAME and hope that conveys "protected class attribute" :) | 23:02 |
harlowja | wfm | 23:02 |
jeblair | harlowja: https://github.com/python-zk/kazoo/pull/419/ updated | 23:03 |
jeblair | harlowja: builds are passing this time (with the exception of gevent which is failing to install; pretty sure that's not my fault). | 23:06 |
harlowja | kk | 23:06 |
*** rahsah has quit IRC | 23:11 | |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Remove jenkins_manager https://review.openstack.org/444427 | 23:21 |
*** saneax-_-|AFK is now known as saneax | 23:30 | |
mordred | harlowja: assuming that PR from jeblair is good, how long do you think it would take to land and get into a release? | 23:38 |
harlowja | once benbangert i guess checks it? | 23:39 |
harlowja | then i can make a release pretty quickly | 23:39 |
mordred | cool! | 23:39 |
* mordred hands harlowja a pie | 23:39 | |
harlowja | now u just have to accept my manifesto | 23:39 |
harlowja | lol | 23:39 |
* harlowja takes pie before it gets taken away | 23:40 | |
harlowja | lol | 23:40 |
mordred | :) | 23:40 |
* harlowja runs away with pie | 23:40 | |
* mordred starts handing harlowja thousands of pies | 23:41 | |
* harlowja dies | 23:41 | |
mordred | o noes! | 23:42 |
* rbergeron recommends not overdosing nice humans on the pie | 23:46 | |
mordred | MOAR PIE FOR EVERYONE | 23:53 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!