Tuesday, 2020-09-15

openstackgerritMichael Johnson proposed openstack/octavia master: Add a requirements.txt check job  https://review.opendev.org/75192500:18
johnsomIf it's working, that will fail.00:19
openstackgerritMichael Johnson proposed openstack/octavia master: Add a requirements.txt check job  https://review.opendev.org/75192500:23
johnsomFix that so both tests run even if one fails.00:23
openstackgerritMichael Johnson proposed openstack/octavia master: Add a requirements.txt check job  https://review.opendev.org/75192500:29
openstackgerritAdam Harwell proposed openstack/octavia master: Use routed network filter if it exists  https://review.opendev.org/70615301:04
openstackgerritzhangchun proposed openstack/octavia-tempest-plugin master: Remove install unnecessary packages  https://review.opendev.org/75194902:12
openstackgerritzhangchun proposed openstack/octavia-dashboard master: Remove install unnecessary packages  https://review.opendev.org/75195302:16
cgoncalveshaleyb, allowed cidrs were not tested against the OVN provider driver because the driver does not support it. support was added only to amphora. if you're asking for tested against OVN ML2, I have not tested that either but it should just work as the amphora driver is just calling Neutron APIs so ML2 plugin shouldn't matter06:02
openstackgerritAnn Taraday proposed openstack/octavia master: Remove unnecessary joinedload  https://review.opendev.org/74099407:46
openstackgerritArkady Shtempler proposed openstack/octavia-tempest-plugin master: Adding failover test. Send HTTP traffic while MASTER Amphore is rebooted. BACKUP Amphore should serve the traffic.  https://review.opendev.org/75161711:37
openstackgerritAnn Taraday proposed openstack/octavia master: Add experimental amphorav2 jobs  https://review.opendev.org/73799313:08
dulekcgoncalves, johnsom: Hi there! We're recently seeing maaany failures in our gate and I've just correlated that with OOM killing Amps and our containers.13:42
dulekMy bet is that it's Kuryr apps which is leaking, but I figured out it won't hurt to ask if anything changed in that matter on Octavia side?13:43
cgoncalvesdulek, nothing comes to mind13:43
dulekcgoncalves: Okay, thanks!13:45
johnsomdulek: I haven’t seen anything either. The only thing that comes to mind is the recent e-mail chain on the discuss list about keystone middleware leaking memcached connections via neutron API.13:59
dulekThat would mean our tests were using just right amount of memory until Friday and that little leak made it overflow limits. A bit unlikely, I guess. ;)14:01
haleybcgoncalves: ack, for now i'm just having it fail as no implemented, will have to look at amphora driver to see what it does14:26
openstackgerritMichael Johnson proposed openstack/octavia master: Add a requirements.txt check job  https://review.opendev.org/75192515:32
dulekcgoncalves, johnsom: Uhm, probably a stupid one… So we lived with assumption that we need barbican to run Octavia, yet I don't see it on your gates. Can I just remove it from ours?15:36
cgoncalvesdulek, we do have a  scenario -barbican job15:37
johnsomdulek You only need barbican if you are doing TLS offload.15:37
johnsomWe have a special job that runs with barbican to test TLS offload15:37
dulekjohnsom: Hm, we're creating a listener and pool with HTTPS protocol - is that TLS offload already?15:39
johnsomNo, that would be HTTPS pass through. It's the TERMINATED_HTTPS protocol. (terminology we had to inherit from neutron-lbaas)15:40
dulekAlright, then I'm trying to remove it. Thanks!15:40
johnsomOh man, the pypi cache/CDN issue is still happening....15:43
dulekTrue, it's the third stacked fatal issue on Kuryr CI too. :/15:45
johnsomdulek Do you have a link to a job that ran out of memory? I can look through some logs too if you would like15:50
dulekjohnsom: Sure, this is the simplest one: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e65/751263/3/check/kuryr-kubernetes-tempest/e65494e/controller/logs/15:57
dulekhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e65/751263/3/check/kuryr-kubernetes-tempest/e65494e/controller/logs/syslog.txt - you can find oom instances here.15:58
johnsomOk, cool, I will have browse15:58
dulekjohnsom: But I tend to bet that it's Kuryr fault as runs with lower-constraints.txt used to build Kuryr containers seem to work just fine. And the requirements are the only difference here.16:00
dulekI'd have way lot more data points, but the PyPi issue interfered.16:00
-openstackstatus- NOTICE: Our PyPI caching proxies are serving stale package indexes for some packages. We think because PyPI's CDN is serving stale package indexes. We are sorting out how we can either fix or workaround that. In the meantime updating requirements is likely the wrong option.16:10
openstackgerritCarlos Goncalves proposed openstack/octavia-lib master: Add alpn_protocols to the pool data model  https://review.opendev.org/75209418:56
openstackgerritCarlos Goncalves proposed openstack/octavia master: Add ALPN support for TLS-enabled pools  https://review.opendev.org/75209518:56
openstackgerritCarlos Goncalves proposed openstack/python-octaviaclient master: Add ALPN support for pools  https://review.opendev.org/75209618:57
*** zzzeek has joined #openstack-lbaas19:03
johnsomdulek So, looking through the logs I see that the test is running out of memory. At one point the OOM starts killing the amphora, which causes a failover process, which consumes more RAM during that process. It looked to me like etcd and mysql were high consumers. I may dig a little more as the logs are slightly inconsistent.19:36
johnsomdulek One thing I noticed is the job is running with the default thread settings, which will scale to the core count. For your tests you don't need that much, so I have proposed a patch to limit the threads used in your test. That should save you some RAM. Certainly not the root cause, but maybe a help.19:37
rm_workjohnsom / cgoncalves: https://review.opendev.org/#/c/751111/ Carlos' comments here would actually be a CHANGE right? a good one, but not even what was done before?20:23
johnsomYeah, that is a big change. Currently provisioning_status does not have a DEGRADE state to my knowledge. Only operating status20:24
johnsomrm_work I commented on that. How are those two patches going? Are things better?21:13
johnsomIt seems like we have some time to discuss this patch as nothing is going to pass the gates until this CDN issue is resolved21:14
rm_workhaving issues getting them deployed21:48
rm_workdue to ... CI stuff21:48
rm_workbut they seemed to fix issues in QE21:48
johnsomWhelp, we (infra folks) found what appears to be the issue in a full 12tb disk on a pypa mirror. Now it's just cleaning up the negative caches, etc.23:23
sorrisonjohnsom: Just wondering if you have a smart trick to fix an LB stuck in error due to https://storyboard.openstack.org/#!/story/200809923:52
sorrisonI was wondering if I can fudge the DB somehow to get it to failover23:52

