*** tosky_ is now known as tosky | 00:25 | |
fungi | dpawlik: amazon has restructured how they're covering the elasticsearch donation, and for the past three months we've seen increasing "fargate" memory and cpu utilization. looks like it approximately doubled between september and november (31893 Gb.hours/15946 vCPU.hours to 58906 Gb.hours/29453 vCPU.hours) which puts us over our allotted spend... any guesses what might be causing that | 13:37 |
---|---|---|
fungi | increase in resource utilization? | 13:37 |
dpawlik | fungi: hey, dunno, but maybe melwitt ^^ will know | 13:39 |
dpawlik | I guess Nova team is mostly using OpenSearch | 13:39 |
dpawlik | if still it will be unknown, maybe it would be a time to remove some visualizations. If someone was checking visualizations frequently, that might be it. | 13:41 |
dpawlik | and why I pointed melwitt, because there was a question on openstack-nova: "do you know if there is a time limit of some kind on the opensearch log indexing? I generally see around 10 days worth of search results but the logs are actually kept for 30 days? ^" That can also take some resources | 13:47 |
*** elvira1 is now known as elvira | 14:03 | |
fungi | i wonder if there are logs to indicate whether anything new is running constant queries in an automated fashion | 14:37 |
fungi | dpawlik: looks like fargate is some ephemeral ecs+eks based task runner service. do you have any idea what part of the toolchain might be relying on that? | 17:12 |
fungi | "AWS Fargate is a pay-as-you-go service that lets you focus on building applications without managing servers. It works with Amazon ECS and Amazon EKS to run tasks in dedicated runtime environments, isolate them from other workloads, and optimize for cost and security." | 17:13 |
fungi | dpawlik: looking at what we have in dns, the logstash.logs.openstack.org "server" which is a cname to logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com looks like it could be what's on fargate. if memory serves, tom built a log ingestion service in amazon which the ci log processor stuffs data into. if so, that would point to a recent increase in log data volume. is | 17:54 |
fungi | that something you have stats to confirm maybe? | 17:54 |
fungi | melwitt: ^ this also might be of interest to you | 17:58 |
fungi | if that's what's driving up fargate usage, we'll need to look into ways to reduce the log data we're feeding into opensearch so we can avoid exceeding this month's allocation in order to avoid the foundation getting charged for the overage | 17:59 |
melwitt | fungi: hm, ok. I personally haven't been doing anything different than I usually do with opensearch. definitely no automated queries or anything like that | 21:13 |
melwitt | I can't think of why the usage is drastically different off the top of my head. but I'll pay attention to it and point out anything that might be related if I come across something | 21:16 |
fungi | melwitt: my best guess right now is that something happened a couple of months ago to significantly increase the number of log lines we're feeding into it | 21:19 |
fungi | seems less likely to be an increase in query volume, more likely to be an increase in load on the ingestion systems | 21:19 |
melwitt | fungi: I did do some tests with a DNM patch that added sql query logging a month or so ago. that ran a few times, I wonder if that could be it? if so, I wonder if there's any way to tag a patch as something to not send logs to opensearch for | 21:21 |
fungi | i doubt it would be a few tests with one change increasing it, but i really don't know the first thing about that system so it's hard for me to say that with certainty | 21:24 |
fungi | basically, amazon bills two sets of line items against the credits they donate to us: managed opensearch, and "fargate" | 21:25 |
fungi | based on dns cnames we added and some vague recollections of the missing pieces which a third party developed to handle log ingestion, i think the latter is what's using fargate resources. it looks like queries are hitting opensearch directly and so probably don't factor into the fargate utilization | 21:27 |
fungi | unfortunately, while a stipulation of agreeing to coordinate donated resources for this service was that the foundation staff and opendev sysadmins shouldn't need to know how it works or what's being used, it's falling on the foundation to potentially pay for overages and i'm being asked to make sure that doesn't happen | 21:29 |
fungi | one way i can make sure that doesn't happen is to turn it off before it exceeds the allocated donation, but i'd rather have help finding a less painful resolution | 21:30 |
melwitt | yeah ... I wonder if it could be something as simple as a bunch of rechecks? like has there been a significant increase in the number of jobs run? | 21:37 |
fungi | https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&viewPanel=16&from=now-90d&to=now | 21:39 |
fungi | doesn't really seem like it | 21:39 |
melwitt | ack | 21:43 |
*** haleyb is now known as haleyb|out | 23:53 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!