14:00:31 #startmeeting kuryr 14:00:32 Meeting started Mon Dec 17 14:00:31 2018 UTC and is due to finish in 60 minutes. The chair is dmellado. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:33 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:35 The meeting name has been set to 'kuryr' 14:00:45 Hi everyone! Who's around here for today's kuryr meeting? ;) 14:00:52 o/ 14:01:01 o/ 14:01:03 #chair dulek 14:01:04 Current chairs: dmellado dulek 14:01:05 #chair yboaron_ 14:01:06 Current chairs: dmellado dulek yboaron_ 14:01:54 All right, so let's do a quick roll call 14:02:13 first of all, due to Christmas' holidays there won't be kuryr meetings on 24 nor on 31st 14:02:14 * dulek is trying to summon celebdor as he might be able to give some insight on that kuryr-daemon mem usage. 14:02:26 I'll be sending an email as a reminder 14:02:36 dulek: could you give a brief summary about that? 14:02:41 on which gates is this happening? 14:02:43 o/ 14:02:47 what's eating memory and so? 14:02:48 Oh maaan, we could wish ourselves happy holidays on 24th and happy new year on 31st. :D 14:02:52 #topic kuryr-kubernetes 14:03:02 dulek: lol 14:03:06 #chair ltomasbo 14:03:07 Current chairs: dmellado dulek ltomasbo yboaron_ 14:03:13 Okay, let me explain then. 14:03:43 So it's the issue of events being skipped and not passed to kuryr-controller that is watching K8s API. 14:03:56 I've linked that to errors in kubernetes-api/openshift-master 14:04:05 http://logs.openstack.org/27/625327/3/check/kuryr-kubernetes-tempest-daemon-openshift-octavia/cb22439/controller/logs/screen-openshift-master.txt.gz 14:04:07 like, random type events? or is it tied to specific ones? 14:04:09 Here's an example. 14:04:16 #link http://logs.openstack.org/27/625327/3/check/kuryr-kubernetes-tempest-daemon-openshift-octavia/cb22439/controller/logs/screen-openshift-master.txt.gz 14:04:27 dmellado: Like the updates on pods, services or endpoints. 14:04:45 Okay, so then I've linked those issues with etcd warnings: http://logs.openstack.org/27/625327/3/check/kuryr-kubernetes-tempest-daemon-openshift-octavia/cb22439/controller/logs/screen-etcd.txt.gz 14:04:51 o/ 14:04:58 I am here 14:05:02 what's up 14:05:07 dulek: so that's linked with http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000906.html 14:05:08 sorry for the delay, I was on a meeting 14:05:14 dmellado: Yes! 14:05:15 hiya folks 14:05:18 And then with some insight from infra I've started looking at dstat files from those gate runs. 14:05:45 What immediately caught my attention is the column that shows up the process using most of the memory. 14:05:55 celebdor: dulek has been seeing odd things on dstat, you might have some experience on that 14:06:00 It's kuryr-daemon with value of 2577G. 14:06:03 how many memory was the process eating, dulek? xD 14:06:22 that looks far too high 14:06:31 http://logs.openstack.org/27/625327/3/check/kuryr-kubernetes-tempest-daemon-openshift-octavia/cb22439/controller/logs/screen-dstat.txt.gz - see for yourself, it's the third column to the last. 14:06:36 dulek: only on the gates or locally as well? 14:06:47 celebdor: I'll be checking that after the meeting. 14:06:51 ok 14:06:53 thanks 14:06:54 So first of all I don't understand why OOM is not happening. 14:07:03 dulek: is this also tied to specific cloud providers on the infra? 14:07:12 i.e. it happens on RAX but not on OVH and so 14:07:13 ? 14:07:25 dmellado: Not really, no. The logs I'm sending you now are from INAP. 14:07:32 Second - once the kuryr-daemon is started it drains all the swap. 14:07:55 So this might have some sense - the issues in etcd might be due to some IO delays. 14:08:41 dulek: s/drains/occupies? 14:09:27 dulek, didn't we have a healthcheck that was restarting kuryr-daemon if eating more than X MBs? 14:09:43 celebdor: Well, swap usage raises. But now that I look again it's only using up to 3 GB of it, so not all. 14:09:49 ltomasbo: That's only on containerized. 14:10:17 dulek: really? 14:10:33 a docker leak then? 14:10:39 dulek, Q: Does finding this string in etcd log file means that etcd lost requests? 'avoid queries with large range/delete range!' ? 14:10:41 that would be unusual, I usspose 14:10:50 celebdor: In that case kuryr-daemon runs as a process, not container. 14:11:04 dulek: I guess he's referring to etcd 14:11:09 yboaron_: Not 100% sure really. 14:11:19 but if kuryr-daemon is the process eating up the memory, then it doesn't apply 14:11:23 dmellado: We run etcd as process as well. 14:11:28 dulek: so this happens on containerized or not. Not sure I understood 14:11:48 celebdor: Ah, sorry. I saw it on non-containerized at the moment. 14:12:02 it could be due to the healthcheck 14:12:12 But I can check… 14:12:18 it'd be interesting to see if we can replicate if we disable that healthcheck on containerized 14:12:58 dulek: did you freeze an upstream VM to try to fetch some data? 14:13:14 Looks the same in containerized: kuryr-daemon 2589G 14:13:33 celebdor: Note that we run kuryr-daemon as privileged container, so it's on host's ps namespace, right? 14:13:45 dmellado: Not yet, I'll try this locally first. 14:14:12 Okay, so if nobody has an idea, I'll try to poke it a bit more. 14:14:23 Hopefully it's some dstat quirk 14:14:38 But in any case I guess it's worth to be looked at. 14:14:38 dulek: in any case let me know if you get stuck trying this on local 14:14:50 and we'll take a look together in a vm either upstream or rdocloud 14:14:56 dulek: indeed 14:15:10 Okay, cool! 14:15:30 thanks for bringing this dulek 14:16:50 so, I also wanted to let you know 14:17:01 that I've been working on getting python-openshift packaged for fedora 14:17:07 which involves a hell of bureaucracy 14:17:20 and being sponsored into fedora-packagers group 14:17:32 I thought it's just a rpmspec file 14:17:36 I'm not sure if besides me there's anyone on the team who also got those privileges 14:17:46 but if you could vote for this I'd be happy 14:17:58 celebdor: it involved a dependency hell xD 14:18:01 https://bugzilla.redhat.com/show_bug.cgi?id=1654833 14:18:02 bugzilla.redhat.com bug 1654833 in Package Review "Review Request: python-kubernetes - add python-kuberenetes to EPEL 7" [Medium,New] - Assigned to nobody 14:18:04 https://bugzilla.redhat.com/show_bug.cgi?id=1654835 14:18:04 bugzilla.redhat.com bug 1654835 in Package Review "Review Request: python-google-auth - Add python-google-auth to EPEL 7" [Medium,Post] - Assigned to zebob.m 14:18:06 https://bugzilla.redhat.com/show_bug.cgi?id=1659281 14:18:06 bugzilla.redhat.com bug 1659281 in Package Review "Review Request: python-dictdiffer - Dictdiffer is a module that helps you to diff and patch dictionaries" [Medium,New] - Assigned to nobody 14:18:08 https://bugzilla.redhat.com/show_bug.cgi?id=1659282 14:18:09 bugzilla.redhat.com bug 1659282 in Package Review "Review Request: python-string_utils - A python module containing utility functions for strings" [Medium,New] - Assigned to nobody 14:18:10 https://bugzilla.redhat.com/show_bug.cgi?id=1659286 14:18:11 bugzilla.redhat.com bug 1659286 in Package Review "Review Request: python-openshift - Python client for the OpenShift API" [Medium,New] - Assigned to nobody 14:18:44 celebdor: ltomasbo maysams yboaron_ andy topics from your side? 14:19:11 no 14:19:19 updates from NP? 14:19:25 neither from my side 14:19:32 dmellado: celebdor: yep 14:19:37 go for it maysams 14:19:54 I have a topic 14:20:04 https://bugs.launchpad.net/kuryr-kubernetes/+bug/1808506 14:20:06 Launchpad bug 1808506 in kuryr-kubernetes "neutron-vif requires admin rights" [Medium,New] - Assigned to Luis Tomas Bolivar (ltomasbo) 14:20:29 sorry maysams, go ahead... 14:20:34 I am working on updating the sg rules in the CRD when pods are created/deleted/updated 14:20:35 https://review.openstack.org/#/c/625588 14:20:42 ltomasbo: it's okay 14:20:54 great! I'll take a look asap! 14:20:56 ltomasbo: that's BM only 14:21:17 maysams: you also reported some bug, rgiht? 14:21:37 celebdor: yes 14:21:39 https://bugs.launchpad.net/kuryr-kubernetes/+bug/1808787 14:21:41 Launchpad bug 1808787 in kuryr-kubernetes "Pod Label Handler tries to handle pod creation causing controller restart" [Undecided,New] 14:21:41 celebdor, still is broken... and it raises another concern, which is that we are using both on devstack as well as on the gates the wront type of tenant 14:22:17 ltomasbo, what do you mean by 'under a normal tenant' in the bug description, do we use tenant with extra permissions in our gates? 14:22:18 ltomasbo: is that related to the devstack tenant permissions you told me about? 14:22:30 ltomasbo: that is my biggest concren 14:22:33 *concern 14:22:39 celebdor: basically the pod label handler is waiting for the vif handler to handle the event, but due to resource not ready exceptions this will not work 14:22:40 yboaron_, unfortunately, yes 14:23:05 ltomasbo, can you elaborate on that? 14:23:05 maysams: why wait and not drop? 14:23:09 ltomasbo: I'll need to spend some time investigating it 14:23:16 if it drops the event if it has no annotation 14:23:25 dmellado, it will be great if you can take a look, yes! 14:23:33 it will eventually get the event from the patch that the vif handler does 14:23:35 celebdor: Hey, that makes sense. 14:23:42 ltomasbo: I'll sync with you after the meeting to work into the details, thanks! 14:23:42 dmellado, it is not only on the gates, but also on the k8s tenant that we create on devstack 14:23:52 celebdor: indeed makes sense 14:23:58 :O 14:24:11 let's try that then 14:24:14 increibdle, something that celebdor says makes sense 14:24:17 xD 14:24:21 celebdor: thanks :) 14:24:22 dmellado: I feel the same way 14:24:24 believe me 14:24:28 something's going really odd 14:24:33 maysams, celebdor: yep, it is easy to just drop the event, but worth to keep in mind that we have a problem there 14:24:34 :D 14:24:47 ltomasbo: maysams we'll treat that as a bug 14:24:55 so if you haven't already, please open it in LP 14:24:58 ltomasbo: why is this a problem? 14:25:09 dmellado, is a bug we hit on network policies, but it is not a bug produced by network policies... 14:25:11 label pod handler only cares about annotated pods 14:25:14 I think it is fair 14:25:16 on the other hand 14:25:35 celebdor, if we have different handlers listening to the same events, why one is blocked by the other? 14:25:38 ltomasbo: dmellado: we need the tenant to be neutron policy regular tenant 14:25:41 like right now 14:25:54 ltomasbo is right 14:26:02 celebdor, I agree with the workaround 14:26:08 ltomasbo: totally agree 14:26:11 celebdor, but that is just avoiding the problem, not fixing it 14:26:13 ltomasbo: well, I don't think it's that odd for one handler to depend on stuff done by the other 14:26:27 celebdor, no no, I don't mean the depends on 14:26:29 just that the current handler design makes it weird 14:26:44 celebdor, what would happen in the case that handler is not depending on the other? 14:27:12 celebdor, if that is the desing, it does not make a lot of sense to have different handlers... perhaps simple to have just one... 14:27:46 ltomasbo: but I think having several ones is better for the granularity 14:27:54 if by any chance we have handlers a and b 14:27:58 ltomasbo: I don't understand the "what would happen if they do not depend on each other? 14:28:01 and b depends on a then it's fine 14:28:16 but maybe some user would like ot use just 'a' and it would use less memory 14:28:17 sorry guys, I was disconnected for the last few minutes .. network issues 14:28:22 they will just annotate the resource (if they need to) in some random order 14:28:28 what's the prob with that 14:28:30 ? 14:29:21 celebdor, imagine the vif handler was not doing pod annotations and other neutron actions 14:29:27 ok 14:29:35 celebdor, it will never get executed due to the other handler being scheduled fist 14:30:59 anyway, we can discuss this on the kuryr channel after the meeting 14:31:06 ltomasbo: yep, let's do that 14:31:13 ltomasbo: is the handler now serializing per resource? 14:31:23 ok, ok 14:31:27 we can take it up later 14:32:20 gcheresh: itzikb juriarte 14:32:22 o/ 14:32:28 anything you'd like to add to the meeting? 14:32:45 I'm pretty sure you'd be interested on the python-openshift thing 14:33:38 https://media1.tenor.com/images/184a80bf791b211b0e2f3f02badab20e/tenor.gif?itemid=12469549 14:33:40 awesome! 14:33:43 xD 14:34:12 dmellado: had network issues as well on the TLV (everyone did) 14:34:27 gcheresh: ouch, hope you can get to fix it 14:34:40 it just disconnected and now is back again 14:34:53 just wanted to ask whether you can push for the python-openshift rpms to get accepted 14:35:11 I pointed the links out earlier and they'll be available on the logs 14:35:23 besides that, is there anything you'd like to cover? ;) 14:39:28 All right, so then it seems that it was it for the day! 14:39:41 ltomasbo: celebdor let's continue the discussion on the kuryr channel 14:39:43 thanks for attending! 14:39:48 #endmeeting