14:00:31 <dmellado> #startmeeting kuryr 14:00:32 <openstack> Meeting started Mon Dec 17 14:00:31 2018 UTC and is due to finish in 60 minutes. The chair is dmellado. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:33 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:35 <openstack> The meeting name has been set to 'kuryr' 14:00:45 <dmellado> Hi everyone! Who's around here for today's kuryr meeting? ;) 14:00:52 <dulek> o/ 14:01:01 <yboaron_> o/ 14:01:03 <dmellado> #chair dulek 14:01:04 <openstack> Current chairs: dmellado dulek 14:01:05 <dmellado> #chair yboaron_ 14:01:06 <openstack> Current chairs: dmellado dulek yboaron_ 14:01:54 <dmellado> All right, so let's do a quick roll call 14:02:13 <dmellado> first of all, due to Christmas' holidays there won't be kuryr meetings on 24 nor on 31st 14:02:14 * dulek is trying to summon celebdor as he might be able to give some insight on that kuryr-daemon mem usage. 14:02:26 <dmellado> I'll be sending an email as a reminder 14:02:36 <dmellado> dulek: could you give a brief summary about that? 14:02:41 <dmellado> on which gates is this happening? 14:02:43 <ltomasbo> o/ 14:02:47 <dmellado> what's eating memory and so? 14:02:48 <dulek> Oh maaan, we could wish ourselves happy holidays on 24th and happy new year on 31st. :D 14:02:52 <dmellado> #topic kuryr-kubernetes 14:03:02 <dmellado> dulek: lol 14:03:06 <dmellado> #chair ltomasbo 14:03:07 <openstack> Current chairs: dmellado dulek ltomasbo yboaron_ 14:03:13 <dulek> Okay, let me explain then. 14:03:43 <dulek> So it's the issue of events being skipped and not passed to kuryr-controller that is watching K8s API. 14:03:56 <dulek> I've linked that to errors in kubernetes-api/openshift-master 14:04:05 <dulek> http://logs.openstack.org/27/625327/3/check/kuryr-kubernetes-tempest-daemon-openshift-octavia/cb22439/controller/logs/screen-openshift-master.txt.gz 14:04:07 <dmellado> like, random type events? or is it tied to specific ones? 14:04:09 <dulek> Here's an example. 14:04:16 <dmellado> #link http://logs.openstack.org/27/625327/3/check/kuryr-kubernetes-tempest-daemon-openshift-octavia/cb22439/controller/logs/screen-openshift-master.txt.gz 14:04:27 <dulek> dmellado: Like the updates on pods, services or endpoints. 14:04:45 <dulek> Okay, so then I've linked those issues with etcd warnings: http://logs.openstack.org/27/625327/3/check/kuryr-kubernetes-tempest-daemon-openshift-octavia/cb22439/controller/logs/screen-etcd.txt.gz 14:04:51 <maysams> o/ 14:04:58 <celebdor> I am here 14:05:02 <celebdor> what's up 14:05:07 <dmellado> dulek: so that's linked with http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000906.html 14:05:08 <celebdor> sorry for the delay, I was on a meeting 14:05:14 <dulek> dmellado: Yes! 14:05:15 <dmellado> hiya folks 14:05:18 <dulek> And then with some insight from infra I've started looking at dstat files from those gate runs. 14:05:45 <dulek> What immediately caught my attention is the column that shows up the process using most of the memory. 14:05:55 <dmellado> celebdor: dulek has been seeing odd things on dstat, you might have some experience on that 14:06:00 <dulek> It's kuryr-daemon with value of 2577G. 14:06:03 <dmellado> how many memory was the process eating, dulek? xD 14:06:22 <celebdor> that looks far too high 14:06:31 <dulek> http://logs.openstack.org/27/625327/3/check/kuryr-kubernetes-tempest-daemon-openshift-octavia/cb22439/controller/logs/screen-dstat.txt.gz - see for yourself, it's the third column to the last. 14:06:36 <celebdor> dulek: only on the gates or locally as well? 14:06:47 <dulek> celebdor: I'll be checking that after the meeting. 14:06:51 <celebdor> ok 14:06:53 <celebdor> thanks 14:06:54 <dulek> So first of all I don't understand why OOM is not happening. 14:07:03 <dmellado> dulek: is this also tied to specific cloud providers on the infra? 14:07:12 <dmellado> i.e. it happens on RAX but not on OVH and so 14:07:13 <dmellado> ? 14:07:25 <dulek> dmellado: Not really, no. The logs I'm sending you now are from INAP. 14:07:32 <dulek> Second - once the kuryr-daemon is started it drains all the swap. 14:07:55 <dulek> So this might have some sense - the issues in etcd might be due to some IO delays. 14:08:41 <celebdor> dulek: s/drains/occupies? 14:09:27 <ltomasbo> dulek, didn't we have a healthcheck that was restarting kuryr-daemon if eating more than X MBs? 14:09:43 <dulek> celebdor: Well, swap usage raises. But now that I look again it's only using up to 3 GB of it, so not all. 14:09:49 <dulek> ltomasbo: That's only on containerized. 14:10:17 <celebdor> dulek: really? 14:10:33 <celebdor> a docker leak then? 14:10:39 <yboaron_> dulek, Q: Does finding this string in etcd log file means that etcd lost requests? 'avoid queries with large range/delete range!' ? 14:10:41 <celebdor> that would be unusual, I usspose 14:10:50 <dulek> celebdor: In that case kuryr-daemon runs as a process, not container. 14:11:04 <dmellado> dulek: I guess he's referring to etcd 14:11:09 <dulek> yboaron_: Not 100% sure really. 14:11:19 <dmellado> but if kuryr-daemon is the process eating up the memory, then it doesn't apply 14:11:23 <dulek> dmellado: We run etcd as process as well. 14:11:28 <celebdor> dulek: so this happens on containerized or not. Not sure I understood 14:11:48 <dulek> celebdor: Ah, sorry. I saw it on non-containerized at the moment. 14:12:02 <dmellado> it could be due to the healthcheck 14:12:12 <dulek> But I can checkā¦ 14:12:18 <dmellado> it'd be interesting to see if we can replicate if we disable that healthcheck on containerized 14:12:58 <dmellado> dulek: did you freeze an upstream VM to try to fetch some data? 14:13:14 <dulek> Looks the same in containerized: kuryr-daemon 2589G 14:13:33 <dulek> celebdor: Note that we run kuryr-daemon as privileged container, so it's on host's ps namespace, right? 14:13:45 <dulek> dmellado: Not yet, I'll try this locally first. 14:14:12 <dulek> Okay, so if nobody has an idea, I'll try to poke it a bit more. 14:14:23 <dulek> Hopefully it's some dstat quirk 14:14:38 <dulek> But in any case I guess it's worth to be looked at. 14:14:38 <dmellado> dulek: in any case let me know if you get stuck trying this on local 14:14:50 <dmellado> and we'll take a look together in a vm either upstream or rdocloud 14:14:56 <celebdor> dulek: indeed 14:15:10 <dulek> Okay, cool! 14:15:30 <dmellado> thanks for bringing this dulek 14:16:50 <dmellado> so, I also wanted to let you know 14:17:01 <dmellado> that I've been working on getting python-openshift packaged for fedora 14:17:07 <dmellado> which involves a hell of bureaucracy 14:17:20 <dmellado> and being sponsored into fedora-packagers group 14:17:32 <celebdor> I thought it's just a rpmspec file 14:17:36 <dmellado> I'm not sure if besides me there's anyone on the team who also got those privileges 14:17:46 <dmellado> but if you could vote for this I'd be happy 14:17:58 <dmellado> celebdor: it involved a dependency hell xD 14:18:01 <dmellado> https://bugzilla.redhat.com/show_bug.cgi?id=1654833 14:18:02 <openstack> bugzilla.redhat.com bug 1654833 in Package Review "Review Request: python-kubernetes - add python-kuberenetes to EPEL 7" [Medium,New] - Assigned to nobody 14:18:04 <dmellado> https://bugzilla.redhat.com/show_bug.cgi?id=1654835 14:18:04 <openstack> bugzilla.redhat.com bug 1654835 in Package Review "Review Request: python-google-auth - Add python-google-auth to EPEL 7" [Medium,Post] - Assigned to zebob.m 14:18:06 <dmellado> https://bugzilla.redhat.com/show_bug.cgi?id=1659281 14:18:06 <openstack> bugzilla.redhat.com bug 1659281 in Package Review "Review Request: python-dictdiffer - Dictdiffer is a module that helps you to diff and patch dictionaries" [Medium,New] - Assigned to nobody 14:18:08 <dmellado> https://bugzilla.redhat.com/show_bug.cgi?id=1659282 14:18:09 <openstack> bugzilla.redhat.com bug 1659282 in Package Review "Review Request: python-string_utils - A python module containing utility functions for strings" [Medium,New] - Assigned to nobody 14:18:10 <dmellado> https://bugzilla.redhat.com/show_bug.cgi?id=1659286 14:18:11 <openstack> bugzilla.redhat.com bug 1659286 in Package Review "Review Request: python-openshift - Python client for the OpenShift API" [Medium,New] - Assigned to nobody 14:18:44 <dmellado> celebdor: ltomasbo maysams yboaron_ andy topics from your side? 14:19:11 <celebdor> no 14:19:19 <celebdor> updates from NP? 14:19:25 <yboaron_> neither from my side 14:19:32 <maysams> dmellado: celebdor: yep 14:19:37 <dmellado> go for it maysams 14:19:54 <ltomasbo> I have a topic 14:20:04 <ltomasbo> https://bugs.launchpad.net/kuryr-kubernetes/+bug/1808506 14:20:06 <openstack> Launchpad bug 1808506 in kuryr-kubernetes "neutron-vif requires admin rights" [Medium,New] - Assigned to Luis Tomas Bolivar (ltomasbo) 14:20:29 <ltomasbo> sorry maysams, go ahead... 14:20:34 <maysams> I am working on updating the sg rules in the CRD when pods are created/deleted/updated 14:20:35 <maysams> https://review.openstack.org/#/c/625588 14:20:42 <maysams> ltomasbo: it's okay 14:20:54 <ltomasbo> great! I'll take a look asap! 14:20:56 <celebdor> ltomasbo: that's BM only 14:21:17 <celebdor> maysams: you also reported some bug, rgiht? 14:21:37 <maysams> celebdor: yes 14:21:39 <maysams> https://bugs.launchpad.net/kuryr-kubernetes/+bug/1808787 14:21:41 <openstack> Launchpad bug 1808787 in kuryr-kubernetes "Pod Label Handler tries to handle pod creation causing controller restart" [Undecided,New] 14:21:41 <ltomasbo> celebdor, still is broken... and it raises another concern, which is that we are using both on devstack as well as on the gates the wront type of tenant 14:22:17 <yboaron_> ltomasbo, what do you mean by 'under a normal tenant' in the bug description, do we use tenant with extra permissions in our gates? 14:22:18 <dmellado> ltomasbo: is that related to the devstack tenant permissions you told me about? 14:22:30 <celebdor> ltomasbo: that is my biggest concren 14:22:33 <celebdor> *concern 14:22:39 <maysams> celebdor: basically the pod label handler is waiting for the vif handler to handle the event, but due to resource not ready exceptions this will not work 14:22:40 <ltomasbo> yboaron_, unfortunately, yes 14:23:05 <yboaron_> ltomasbo, can you elaborate on that? 14:23:05 <celebdor> maysams: why wait and not drop? 14:23:09 <dmellado> ltomasbo: I'll need to spend some time investigating it 14:23:16 <celebdor> if it drops the event if it has no annotation 14:23:25 <ltomasbo> dmellado, it will be great if you can take a look, yes! 14:23:33 <celebdor> it will eventually get the event from the patch that the vif handler does 14:23:35 <dulek> celebdor: Hey, that makes sense. 14:23:42 <dmellado> ltomasbo: I'll sync with you after the meeting to work into the details, thanks! 14:23:42 <ltomasbo> dmellado, it is not only on the gates, but also on the k8s tenant that we create on devstack 14:23:52 <maysams> celebdor: indeed makes sense 14:23:58 <celebdor> :O 14:24:11 <celebdor> let's try that then 14:24:14 <dmellado> increibdle, something that celebdor says makes sense 14:24:17 <dmellado> xD 14:24:21 <maysams> celebdor: thanks :) 14:24:22 <celebdor> dmellado: I feel the same way 14:24:24 <celebdor> believe me 14:24:28 <dmellado> something's going really odd 14:24:33 <ltomasbo> maysams, celebdor: yep, it is easy to just drop the event, but worth to keep in mind that we have a problem there 14:24:34 <dmellado> :D 14:24:47 <dmellado> ltomasbo: maysams we'll treat that as a bug 14:24:55 <dmellado> so if you haven't already, please open it in LP 14:24:58 <celebdor> ltomasbo: why is this a problem? 14:25:09 <ltomasbo> dmellado, is a bug we hit on network policies, but it is not a bug produced by network policies... 14:25:11 <celebdor> label pod handler only cares about annotated pods 14:25:14 <celebdor> I think it is fair 14:25:16 <celebdor> on the other hand 14:25:35 <ltomasbo> celebdor, if we have different handlers listening to the same events, why one is blocked by the other? 14:25:38 <celebdor> ltomasbo: dmellado: we need the tenant to be neutron policy regular tenant 14:25:41 <celebdor> like right now 14:25:54 <maysams> ltomasbo is right 14:26:02 <ltomasbo> celebdor, I agree with the workaround 14:26:08 <dmellado> ltomasbo: totally agree 14:26:11 <ltomasbo> celebdor, but that is just avoiding the problem, not fixing it 14:26:13 <celebdor> ltomasbo: well, I don't think it's that odd for one handler to depend on stuff done by the other 14:26:27 <ltomasbo> celebdor, no no, I don't mean the depends on 14:26:29 <celebdor> just that the current handler design makes it weird 14:26:44 <ltomasbo> celebdor, what would happen in the case that handler is not depending on the other? 14:27:12 <ltomasbo> celebdor, if that is the desing, it does not make a lot of sense to have different handlers... perhaps simple to have just one... 14:27:46 <dmellado> ltomasbo: but I think having several ones is better for the granularity 14:27:54 <dmellado> if by any chance we have handlers a and b 14:27:58 <celebdor> ltomasbo: I don't understand the "what would happen if they do not depend on each other? 14:28:01 <dmellado> and b depends on a then it's fine 14:28:16 <dmellado> but maybe some user would like ot use just 'a' and it would use less memory 14:28:17 <yboaron> sorry guys, I was disconnected for the last few minutes .. network issues 14:28:22 <celebdor> they will just annotate the resource (if they need to) in some random order 14:28:28 <celebdor> what's the prob with that 14:28:30 <celebdor> ? 14:29:21 <ltomasbo> celebdor, imagine the vif handler was not doing pod annotations and other neutron actions 14:29:27 <celebdor> ok 14:29:35 <ltomasbo> celebdor, it will never get executed due to the other handler being scheduled fist 14:30:59 <ltomasbo> anyway, we can discuss this on the kuryr channel after the meeting 14:31:06 <dmellado> ltomasbo: yep, let's do that 14:31:13 <celebdor> ltomasbo: is the handler now serializing per resource? 14:31:23 <celebdor> ok, ok 14:31:27 <celebdor> we can take it up later 14:32:20 <dmellado> gcheresh: itzikb juriarte 14:32:22 <dmellado> o/ 14:32:28 <dmellado> anything you'd like to add to the meeting? 14:32:45 <dmellado> I'm pretty sure you'd be interested on the python-openshift thing 14:33:38 <dmellado> https://media1.tenor.com/images/184a80bf791b211b0e2f3f02badab20e/tenor.gif?itemid=12469549 14:33:40 <dmellado> awesome! 14:33:43 <dmellado> xD 14:34:12 <gcheresh> dmellado: had network issues as well on the TLV (everyone did) 14:34:27 <dmellado> gcheresh: ouch, hope you can get to fix it 14:34:40 <gcheresh> it just disconnected and now is back again 14:34:53 <dmellado> just wanted to ask whether you can push for the python-openshift rpms to get accepted 14:35:11 <dmellado> I pointed the links out earlier and they'll be available on the logs 14:35:23 <dmellado> besides that, is there anything you'd like to cover? ;) 14:39:28 <dmellado> All right, so then it seems that it was it for the day! 14:39:41 <dmellado> ltomasbo: celebdor let's continue the discussion on the kuryr channel 14:39:43 <dmellado> thanks for attending! 14:39:48 <dmellado> #endmeeting