*** dkranz has quit IRC | 01:05 | |
*** xinliang has joined #zuul | 01:05 | |
*** bhavik1 has joined #zuul | 02:45 | |
*** bhavik1 has quit IRC | 02:54 | |
*** zhuli has joined #zuul | 03:00 | |
*** isaacb has joined #zuul | 06:26 | |
*** hashar has joined #zuul | 08:30 | |
*** bhavik1 has joined #zuul | 08:56 | |
*** electrofelix has joined #zuul | 09:01 | |
*** bhavik1 has quit IRC | 09:30 | |
*** jkilpatr has joined #zuul | 10:43 | |
*** jkilpatr has quit IRC | 10:51 | |
*** jkilpatr has joined #zuul | 11:04 | |
*** dkranz has joined #zuul | 12:20 | |
*** dkranz has quit IRC | 13:05 | |
mrhillsman | how does nodepool read clouds? i keep getting an error - os_client_config.exceptions.OpenStackConfigException: Cloud fake was not found. | 13:14 |
---|---|---|
mrhillsman | or is that expected behavior | 13:14 |
mrhillsman | everything else appears to be working | 13:15 |
mrhillsman | dib has build an image | 13:15 |
mrhillsman | and node-launcher looks like it is fine as i do not see any errors in the logs either | 13:15 |
mrhillsman | i see successful requests and responses | 13:16 |
mrhillsman | for both | 13:16 |
mrhillsman | but nodepool dib-image-list fails with error re cloud fake | 13:17 |
rcarrillocruz | mrhillsman: it reads a file called clouds.yaml | 13:21 |
rcarrillocruz | typically, it should be under ~/.config/openstack/clouds.yaml | 13:22 |
mrhillsman | ok cool, i have that in place, and using the default | 13:22 |
mrhillsman | ah | 13:22 |
mrhillsman | it is in /etc/nodepool/clouds.yaml | 13:22 |
rcarrillocruz | i think that's ok, it falls back to that location if home user does not have it | 13:22 |
mrhillsman | nope | 13:23 |
mrhillsman | you are right | 13:23 |
mrhillsman | moving it there fixed it :) | 13:24 |
rcarrillocruz | https://docs.openstack.org/os-client-config/latest/ | 13:24 |
rcarrillocruz | yah, it's an oscc thing, not a thing specific to nodepool | 13:24 |
mrhillsman | hopefully i am almost done then :) | 13:24 |
mrhillsman | need to get it working against this local openstack cloud | 13:25 |
rcarrillocruz | if you hit roadblocks, happy to help, i've been installing a nodepool/zuul CI lately (which I think you have too per previous comments from you in this channel) | 13:25 |
mrhillsman | awesome, thanks | 13:25 |
mrhillsman | yes i have | 13:25 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Add base job and roles for javascript https://review.openstack.org/510236 | 13:32 |
Shrews | jeblair: I'm curious about the new nodepool issue "When nodepool launchers are restarted with building nodes, the requests can be left in a pending state and end up stuck" | 13:47 |
Shrews | jeblair: we have code (and a test) to handle unlocked PENDING requests: http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/launcher.py?h=feature/zuulv3#n381 | 13:48 |
Shrews | jeblair: is it possible the request disappeared (by a zuul disconnect) before the nodepool cleanup thread ran? | 13:48 |
Shrews | launcher logs aren't showing me anything about the request from that pastebin after the zk restart, so it seems like it disappeared before it could be cleaned up properly | 13:50 |
Shrews | hrm, is it possible an active lock could survive a ZK restart? | 13:51 |
*** isaacb has quit IRC | 13:58 | |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Don't modify project-templates to add job name attributes https://review.openstack.org/510304 | 13:58 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Don't use cached config when deleting files https://review.openstack.org/510317 | 13:58 |
Shrews | jeblair: oh, simple kazoo test shows that a lock CAN survive a normal ZK stop/start! TIL | 13:59 |
Shrews | so now the paste makes sense b/c nodepool still has a lock on the PENDING request *after* the ZK restart | 14:01 |
Shrews | actually, hrm... why did the building node lose its lock though? ugh, confused again | 14:03 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Update statsd output for tenants https://review.openstack.org/510580 | 14:03 |
jeblair | Shrews: yeah, i assume we would lose the lock for both the node and the request, or neither of them. it's strange to think that we would have the lock on just the request but not the node. in retrospect, i suppose i should have examined the request node more carefully to determine whether it was locked, and if so, by what. | 14:07 |
Shrews | jeblair: i'll see about working on a test case for this, but w/o more details, might be kind of hard to fix something where I don't know what was wrong. :/ | 14:09 |
Shrews | and i'm really confused about lock survival across ZK restarts now. maybe jhesketh can help shed light on that? | 14:11 |
jeblair | Shrews: maybe you meant harlowja? | 14:11 |
Shrews | oops, yes. | 14:12 |
jeblair | Shrews: how often does the thing that checks for unlocked pending requests run? we can probably verify whether it ran or not from logs. but i think this request was outstanding for quite a while. | 14:12 |
Shrews | i knew there was a 'j' and 'h' in there somewhere :) | 14:12 |
Shrews | jeblair: 60s | 14:12 |
jeblair | i can attest that all j names are the same | 14:12 |
jeblair | Shrews: i see the bug | 14:20 |
jeblair | Shrews: the cleanup thread hit an exception while trying to lock the node allocated to the request: http://paste.openstack.org/show/623109/ | 14:21 |
jeblair | Shrews: there's no protection in http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/launcher.py?h=feature/zuulv3#n391 to unlock the node request in case of an exception | 14:22 |
jeblair | Shrews: so the request stayed locked after that and nothing touched it again. | 14:22 |
*** hashar is now known as hasharAway | 14:27 | |
Shrews | jeblair: ah. wow, good find | 14:27 |
jeblair | Shrews: you want to work on sprinkling some try/finally clauses in there? | 14:28 |
Shrews | jeblair: yup, i'll take care of it | 14:29 |
jeblair | cool, thx | 14:29 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Improve exception handling around lost requests https://review.openstack.org/510594 | 14:47 |
Shrews | jeblair: ^^^ | 14:47 |
Shrews | actually, i want to make 1 improvement there | 14:50 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Add base job and roles for javascript https://review.openstack.org/510236 | 14:55 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Improve exception handling around lost requests https://review.openstack.org/510594 | 14:57 |
SpamapS | Shrews: Remember that zookeeper wasn't meant to be "restarted" | 15:54 |
SpamapS | Shrews: so all the recipes assume you have an ensemble that survives forever. | 15:54 |
SpamapS | I think they're getting better about pointing out when something is different for a single server use case. | 15:54 |
SpamapS | But for the most part.. the assumption is that you'll roll a restart through the members. | 15:55 |
jeblair | i've started investigating a hung git operation on ze06 for build 55769796f30e442d817feb96a6854eb1 | 15:56 |
SpamapS | And while zuul is still a spof..and nodepool-launcher is still active/passive at best..it might make sense to look at a ZK ensemble on 3 boxes for us, just for the "one less thing we have to be careful with" | 15:56 |
jeblair | GIT_HTTP_LOW_SPEED_TIME=30 and GIT_HTTP_LOW_SPEED_LIMIT=1000 are in the environment. and it is doing an https fetch. | 16:00 |
*** maxamillion has quit IRC | 16:18 | |
*** maxamillion has joined #zuul | 16:18 | |
mrhillsman | sooooo...no python3 support :) | 16:31 |
SpamapS | mrhillsman: eh? | 16:36 |
mrhillsman | dib - was wondering why image build kept failing | 16:37 |
mrhillsman | with no module errors | 16:37 |
mrhillsman | and some other stuff | 16:38 |
mrhillsman | turns out python3 was the issue - i was using it by default | 16:38 |
AJaeger | does anybody know what's up with 510540 ? - it's since three hours in gate and waiting for a node - the change after it is tested | 16:55 |
*** harlowja has joined #zuul | 17:01 | |
*** electrofelix has quit IRC | 17:26 | |
fungi | looks like zuulv3 started swapping a bit around 1.5 hours ago, in an effort to save more memory for cache | 17:32 |
fungi | that probably isn't translating to performance degradation though | 17:32 |
fungi | 40mb paged out | 17:32 |
fungi | but the server would clearly use more cache at this point, if it had some | 17:33 |
fungi | granted, we're sitting on a check pipeline with 400+ changes now | 17:33 |
openstackgerrit | Andrea Frittoli proposed openstack-infra/zuul-jobs master: Add a generic stage-artifacts role https://review.openstack.org/509233 | 17:33 |
openstackgerrit | Andrea Frittoli proposed openstack-infra/zuul-jobs master: Add compress capabilities to stage artifacts https://review.openstack.org/509234 | 17:33 |
openstackgerrit | Andrea Frittoli proposed openstack-infra/zuul-jobs master: Add a generic process-test-results role https://review.openstack.org/509459 | 17:33 |
AJaeger | 467 changes in periodic alone... | 17:34 |
AJaeger | fungi, so 403 in check, 467 in periodic, 261 in periodic-stable; and a few in infra pipelines | 17:35 |
AJaeger | which also means that it did not manage to run jobs for a single periodic job today - far too busy to server higher prio requests | 17:36 |
fungi | oh, yep i basically can't load the status page well enough to scroll around in it | 17:36 |
AJaeger | fungi: yeah, that page is verrrrry slow. took me some time to find infra-gate and see 510540 waiting... | 17:36 |
fungi | so over 1100 changes in all | 17:36 |
fungi | i'd say given that sort of backlog, memory utilization is remarkably low | 17:37 |
AJaeger | ;) | 17:37 |
fungi | i'm trying to figure out how to look into the situation with 510540 | 17:37 |
fungi | 73 million lines in the current zuulv3 scheduler debug log. that certainly takes a while to load | 17:59 |
AJaeger | wow ;( | 18:00 |
fungi | the logfile is 9.2gib in size | 18:02 |
Shrews | fungi: perhaps it is too busy logging? ;) | 18:02 |
fungi | does lead me to wonder | 18:03 |
fungi | aside from it checking and not finding any dependencies for 510540,1 the other occasional mentions in the debug log seem to be about it getting reenqueued... which i wonder if that is just how it deals with configuration changes | 18:24 |
Shrews | fungi: did you manage to find any request #'s for it? | 18:24 |
fungi | i need to figure out which one it isn't running. that will take a bit for me to parse it back out of the status | 18:25 |
Shrews | holy smokes. 6300+ requests outstanding for nodepool | 18:25 |
Shrews | 218 total nodes, only 12 are READY. the rest are BUILDING or IN-USE | 18:27 |
Shrews | maybe at total capacity? | 18:27 |
AJaeger | multinode-integration-centos-7 is the job that is missing for 510540 | 18:27 |
fungi | okay, in that case this is the last request: | 18:28 |
fungi | 2017-10-09 15:21:22,408 INFO zuul.ExecutorClient: Execute job base-integration-centos-7 (uuid: fc4e31ad056c4d64b54b5f3cd7fac86b) on nodes <NodeSet centos-7 OrderedDict([('centos-7', <Node 0000178819 centos-7:centos-7>)])OrderedDict()> for change <Change 0x7fd20240b710 510556,2> with dependent changes [<Change 0x7fd243b52e10 510555,1>, <Change 0x7fd242811b70 510578,1>, <Change 0x7fd20285d4a8 510564,2>, <Change | 18:28 |
fungi | 0x7fd242ede390 510540,1>] | 18:28 |
Shrews | the one's in-use and locked don't seem to be too old. | 18:28 |
AJaeger | might be - but the gate has highest priority, so the changes after it got nodes (and those were added 30 mins later) | 18:28 |
fungi | oh, wait, that's not the one | 18:29 |
fungi | that's base-integration-centos-7 | 18:29 |
fungi | looking further back | 18:29 |
AJaeger | fungi multinode-integration-centos-7 is what you need | 18:29 |
fungi | yeah | 18:29 |
fungi | thanks, i'm still waiting for the status page filtering to complete | 18:29 |
Shrews | fungi: doesn't seem to be a request id in that :( | 18:30 |
AJaeger | fungi, curl -s http://zuulv3.openstack.org/status.json status.json|jq '.' > out | 18:30 |
AJaeger | uuid is 55769796f30e442d817feb96a6854eb1 | 18:30 |
Shrews | request id is of the form: 200-xxxxxxxxxx or 100-xxxxxxxxxx | 18:31 |
Shrews | just wanted to see if maybe it was waiting on a node. i can't map request ID to change ID with the info that's stored in ZK | 18:32 |
fungi | 2017-10-09 15:08:12,415 INFO zuul.ExecutorClient: Execute job multinode-integration-centos-7 (uuid: d1ddd1c5e30842859d299983465dd75e) on nodes <NodeSet OrderedDict([('primary', <Node 0000178676 primary:centos-7>), ('secondary', <Node 0000178677 secondary:centos-7>)])OrderedDict([('switch', <Group switch ['primary']>), | 18:32 |
fungi | ('peers', <Group peers ['secondary']>)])> for change <Change 0x7fd283f29240 509969,1> with dependent changes [<Change 0x7fd20240b710 510556,2>, <Change 0x7fd243b52e10 510555,1>, <Change 0x7fd242811b70 510578,1>, <Change 0x7fd20285d4a8 510564,2>, <Change 0x7fd242ede390 510540,1>] | 18:32 |
Shrews | but those lead me to believe that it got its nodes? | 18:32 |
Shrews | so... *shrug* | 18:33 |
fungi | no, wait, that's the wrong change | 18:33 |
fungi | just a sec, i'll go regex on this | 18:34 |
Shrews | go go gadget regex! | 18:34 |
fungi | think i'm going to switch to grep, since loading a >9gb file into memory is almost certainly causing undue memory pressure on the server | 18:36 |
*** hasharAway is now known as hashar | 18:36 | |
fungi | oh yeah. cacti confirms | 18:36 |
AJaeger | and we're out of space nearly, aren't we? | 18:37 |
fungi | oh, yikes, yep | 18:38 |
fungi | closing out my filter/viewer pipe freed all of that up | 18:39 |
jeblair | oh wow, so that memory use spike *isn't* zuul | 18:46 |
fungi | right, that was me | 18:46 |
jeblair | neat. that saves some time ): | 18:47 |
jeblair | :) even | 18:47 |
fungi | (: | 18:47 |
fungi | sorry for the false alarm there | 18:47 |
jeblair | and the free space is increasing too | 18:48 |
fungi | also me | 18:48 |
jeblair | we still plan on recovering swap and using it for the zuul log dirs, because we will probably run out of log space in prod | 18:48 |
jeblair | but also, i think we can make the scheduler log more efficient, it'll just take some cleanup work. | 18:48 |
jeblair | (the whole "print out the queue" thing is *extremely* useful once, but not more. finding the right time to print it out when it's useful is the trick) | 18:49 |
jeblair | at any rate, that shouldn't be significantly different than v2, so we don't have to block on it | 18:50 |
fungi | at any rate, i think this was the entry i was originally looking for: | 18:51 |
fungi | 2017-10-09 13:56:10,708 INFO zuul.ExecutorClient: Execute job multinode-integration-centos-7 (uuid: 55769796f30e442d817feb96a6854eb1) on nodes <NodeSet OrderedDict([('primary', <Node 0000177867 primary:centos-7>), ('secondary', <Node 0000177868 secondary:centos-7>)])OrderedDict([('switch', <Group switch ['primary']>), | 18:51 |
fungi | ('peers', <Group peers ['secondary']>)])> for change <Change 0x7fd242ede390 510540,1> with dependent changes [<Change 0x7fd2533fc518 510009,1>] | 18:51 |
fungi | there is no more recent execute for that job on that change, and it's properly from ~5hrs ago | 18:52 |
jeblair | fungi: you can also grab the build uuid from the console log link | 18:53 |
fungi | but as mentioned in #openstack-infra, i guess this is the one jeblair was looking at back around 14:00z in scrollback | 18:53 |
jeblair | fungi: and you can use ansible to grep for that uuid in all the executor logs to find out which one is running it | 18:53 |
fungi | oh, cool idea | 18:54 |
jeblair | as best as i can tell, git is *not* hung waiting on http traffic; it's in some kind of internal deadlock | 18:55 |
jeblair | fungi: https://etherpad.openstack.org/p/8aKzRXq6ae | 18:59 |
jeblair | after lunch, i'm going to dust off the git hard-timeout change from last week | 19:01 |
jeblair | also, i think i'm done poking at that git process. i'll kill it so things move again after lunch, unless someone objects. | 19:09 |
SpamapS | Does look like it is deadlocked | 19:19 |
mrhillsman | nodepool-builder and nodepool-launcher work fine as long as i have -d without it they do not run and just close | 19:25 |
mrhillsman | are there any instructions/example of how to get it to dump logs somewhere? | 19:34 |
Shrews | mrhillsman: are you using sudo? it probably can't start up because it can't create it's pid file (/var/run/nodepool dir, iirc) | 19:53 |
mrhillsman | using root | 19:54 |
pabelanger | I haven't been around much today (Turkey day here in Canada) but will be only for upcoming zuul meeting today | 19:59 |
mrhillsman | i'll ignore it for now, things appear to be working, zuul and nodepool are talking to each other and local openstack environment, just need a job beyond noop to work and will flush out the minor issues later | 20:14 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Add upload-npm role https://review.openstack.org/510686 | 20:42 |
*** hashar has quit IRC | 20:43 | |
jeblair | mordred, pabelanger: in order to use the git hard-timeout function, we need pabelanger's fix here: https://github.com/gitpython-developers/GitPython/commit/1dbfd290609fe43ca7d94e06cea0d60333343838 | 20:45 |
jeblair | that's merged to the gitpython master branch | 20:45 |
jeblair | unfortunately, the master branch still has this bug: https://github.com/gitpython-developers/GitPython/issues/605 | 20:46 |
jeblair | i proposed the very basic workaround that Byron suggested here: https://github.com/gitpython-developers/GitPython/pull/686 | 20:47 |
jeblair | it's not good, but it's something | 20:47 |
jeblair | mordred: in doing that, i noticed that it looks like you may have done some work in that area in https://github.com/emonty/GitPython/commit/0533dd6e1dddfaa05549a4c16303ea4b4540c030 but maybe never submitted pull requests for it? | 20:48 |
jeblair | at any rate, i think for the moment we will have to try to run from my fork, until we get a release with both my and pabelanger's fix | 20:49 |
pabelanger | Ya, I hope to look at my open PR tomorrow for GitPython. Didn't get much time over the weekend for it | 20:50 |
jeblair | pabelanger: they merged your encoding fix to master | 20:50 |
pabelanger | okay cool, I had another for as_process change | 20:51 |
*** jkilpatr has quit IRC | 20:56 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add git timeout https://review.openstack.org/509517 | 21:29 |
jeblair | mordred, pabelanger: ^ | 21:30 |
*** jkilpatr has joined #zuul | 21:33 | |
*** Shrews has quit IRC | 21:36 | |
*** Shrews has joined #zuul | 21:39 | |
jeblair | it's zuul meeting time in #openstack-meeting-alt | 22:00 |
Shrews | fungi: fyi, when you get time: https://review.openstack.org/510594 | 22:19 |
fungi | thanks Shrews! | 22:22 |
pabelanger | I'll be dropping offline again for the next 2 hours, but will review issues later this evening | 22:57 |
jeblair | please see the meeting minutes for important information about the openstack-infra roll-forward: http://eavesdrop.openstack.org/meetings/zuul/2017/zuul.2017-10-09-22.03.html | 22:59 |
mrhillsman | so i want to move to a production deployment, i know i need a target openstack cloud, have one | 23:06 |
mrhillsman | how many dedicated servers to i need? | 23:06 |
jeblair | mrhillsman: depends largely on the number of simultaneous jobs you'll run | 23:10 |
mrhillsman | got it, we will not rise to as many as infra runs right now of course | 23:11 |
jeblair | mrhillsman: (which of course is a function of the size of your openstack cloud, and how busy the repos you're testing are) | 23:11 |
mrhillsman | but maybe like 30-50 starting off | 23:11 |
mrhillsman | 30-50 VMs | 23:11 |
mrhillsman | need to prove it in a live environment but want to get the ball rolling | 23:12 |
mrhillsman | only thing left in this PoC is to see an actual job run (other than noop :) ) | 23:12 |
jeblair | mrhillsman: you *can* do everything on one host. for that size deployment, i'd probably use 3 -- one host as a nodepool image builder+lanucher (it will probably want more disk than anything else). one zuul scheduler on maybe an 8g vm. and one zuul executor, also probably around 8g. | 23:13 |
mrhillsman | ok cool, thanks | 23:14 |
jeblair | mrhillsman: we've found our 8g executors can handle, we think, maybe around 100 simultaneous jobs. so those scale linearly like that. | 23:14 |
jeblair | mrhillsman: as you start to get larger, dedicated mergers can be useful to to handle some of the git load. 1 merger for every 2 executors might be a good ratio. those can be tiny too -- like 2g or even 1g vm. | 23:15 |
mrhillsman | got it | 23:16 |
jeblair | mrhillsman: i'd probably separate the nodepool builder onto its own host if your nodepool cloud quota is > 250 nodes. and add another nodepool launcher after 500 nodes. | 23:16 |
mrhillsman | noted | 23:17 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add git timeout https://review.openstack.org/509517 | 23:22 |
mrhillsman | in terms of getting the web status page working, do i need to proxy through apache/nginx? | 23:25 |
mrhillsman | simply running zuul-web and pointing at the port does not work unless i have it configured improperly | 23:25 |
mrhillsman | if it should work out the box i'll figure it out, just don't want to go on a rabbit chase | 23:28 |
jeblair | mrhillsman: no, the status page is not straightforward yet. there's pending work to integrate it into zuul-web. but for the moment, there's manual installation steps and proxying may be required. | 23:44 |
mrhillsman | ok cool, i'll work to figure it out, ty sir | 23:45 |
jeblair | mrhillsman: i think it's all in the etc/ directory of the zuul repo | 23:50 |
mrhillsman | ah ok, was looking in the wrong dir, thx for that | 23:50 |
jeblair | yeah, it's.... not like the rest :) | 23:52 |
mrhillsman | ;) | 23:53 |
jeblair | fungi, mordred, pabelanger: even with my fix to gitpython, simple unit tests still take 2x as long. i think there's more at play. i think i might be inclined to make a new local fork of gitpython that has 2.1.1 with pabelanger's fix, and run with that until we figure out a long-term solution. but i expect that to take more than just a couple of days. | 23:55 |
jeblair | all told, i don't think that needs to change anything about our plan -- we still run with a local fork for a bit. i think it just changes expectations around what step 2 is. | 23:55 |
fungi | okay | 23:57 |
fungi | i'm fine with that, even if it's not quite as nice as we'd hoped | 23:58 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!