*** jph5 is now known as jph1 | 08:19 | |
*** jph1 is now known as jph | 08:20 | |
dpawlik | fungi: so normally we can downscale the Opensearch. Some time ago, we disable DEBUG messages, so 25% of the cluster is not used (I mean disk space) | 08:53 |
---|---|---|
dpawlik | fungi: I can remove few visualizations. Maybe someone is periodically watch on it or maybe some of them have wrong query that takes more resources as expected | 08:54 |
dpawlik | fungi: I don't have privileges on AWS to say somthing more. My account is very very limited. | 08:54 |
*** elodilles_pto is now known as elodilles | 09:06 | |
fungi | dpawlik: yeah, it doesn't look like we're short on disk space or heavy query volume, it looks like that gets covered by the opensearch line item in the bill which hasn't really changed. the more i think about it, the more likely it is that the volume of log lines getting fed to that logstash.logs.openstack.org endpoint is where all this fargate servive usage is | 13:59 |
fungi | do you have any visibility into how much data the job log processor you manage has been shipping to it, and whether that's increased significantly in the past few months? | 14:00 |
dpawlik | checking | 14:08 |
dpawlik | so I don't have any statistics in Logsender how much data it send, but I see the stats of "Searchable documents" on AWS Opensearch cluster health | 14:14 |
dpawlik | and it looks normal | 14:14 |
dpawlik | but, when I check other metrics in cluster health one is strange | 14:14 |
dpawlik | JVM garbage collection => Young collection and Young collection time => it always grow, never go down | 14:15 |
dpawlik | maybe we can try to make an OpenSearch update from 2.5 (current one) to 2.10 | 14:16 |
dpawlik | so all services would be restarted. Should help | 14:16 |
dpawlik | and if yes, we can send AWS feedback | 14:16 |
jrosser | isnt fargate generally more expensive than regular ec2 instances | 14:17 |
dpawlik | got something. So by checking "instance health", for last 1h, JVM memory pressure (Current value for the selected instance is 50.6) where Minimum value for all instances is 26.8, Maximum value for all instances is 74.2, Minimum and maximum range for the selected instance is from 26.8 to 74.2. | 14:18 |
dpawlik | when I check last 2 weeks, it has 73.9 | 14:19 |
fungi | dpawlik: if it's that we're sending a ton of data to the log ingestion system that's getting filtered out/discarded, that could be increasing processing and memory there without being reflected in the amount of data in opensearch (if it's data that's getting discarded and not indexed into opensearch) | 14:19 |
dpawlik | same with data nodes. Average value from last 2 weeks is higher that from 1 hour ago | 14:20 |
fungi | jrosser: yes, i think that's part of the problem. the contractor who developed the log ingestion system for this wanted it to be entirely stateless so it would require no maintenance | 14:20 |
dpawlik | oh my | 14:21 |
jrosser | that might be like doubling the cost vs. a long term discount on an ec2 | 14:21 |
fungi | so there are no virtual servers, no container management, it's in fargate so that there won't need to be a sysadmin basically | 14:21 |
jrosser | right | 14:22 |
fungi | but it also means when it blows up in utilization there's nobody to look into why | 14:22 |
dpawlik | don't have enough power to see more, maybe you or clarkb have | 14:24 |
dpawlik | from the logscraper/logsender side: all is working normally | 14:24 |
dpawlik | I will increase the "sleep" time for logsender, because it does not need to do it each 5 seconds (but that parameter was set 1 year ago) | 14:25 |
fungi | i'm not sure we have that access either, i can't even trigger the ssl cert renewals for opensearch, but i'll see if it looks like i have access to something in that regard | 14:26 |
dpawlik | fungi: first you need to do some AWS certification to touch something there <I'm joking> | 14:28 |
dpawlik | from the other side, the whole situation started in September 2023, right? | 14:28 |
dpawlik | or last few weeks? | 14:28 |
dpawlik | or they did not mention anything | 14:29 |
fungi | dpawlik: ttx was looking at the bill. since aws switched our free credits from yearly to monthly (exceeding the monthly allocation will now result in overage charges). these were the numbers he reported: | 14:32 |
fungi | AWS Fargate (September): Memory=31893 Gb.hours, vCPU=15946 vCPU.hours | 14:33 |
fungi | AWS Fargate (November): Memory=58906 Gb.hours, vCPU=29453 vCPU.hours | 14:33 |
fungi | the billing interface also lists usage for "Amazon OpenSearch Service ESDomain" (general purpose provisioned storage and m6g.xlarge.search instance hours) | 14:35 |
fungi | that's separate from the fargate stuff, which is what leads me to believe the log ingestion system tom developed is what's responsible for the separate fargate charges | 14:35 |
fungi | he also noted, "Looking at December, on day 12 we are at 21833 GB.hours and 10916 vCPU.hours" (for fargate charges), so consistent with the usage we saw for november if you project it out through the end of the month | 14:38 |
dpawlik | so to understand the whole situation: the AWS fargate it is a service available on the address: https://opensearch.logs.openstack.org and later it redirect the traffic to the opensearch host logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com , right? | 14:43 |
fungi | we have a logstash.logs.openstack.org cname to logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com which i think is related to the fargate resources | 14:46 |
fungi | is that where the ci log processor sends its data? | 14:47 |
fungi | i think opensearch.logs.openstack.org goes to the opensearch service, which is billed separately from fargate usage | 14:47 |
dpawlik | it send to: opensearch.logs.openstack.org | 14:51 |
fungi | i've signed into my aws account and am poking around now, though most of the panels say "access denied" | 14:51 |
dpawlik | yay :D | 14:51 |
dpawlik | so you know my pain | 14:51 |
fungi | in particular, i do not have access to "applications" or "cost and usage" | 14:52 |
fungi | i also don't have access to "security" so probably can't change that | 14:52 |
dpawlik | oh my. I was hoping that you have and clarkb according to Reed messages | 14:53 |
fungi | granted there was never any expectation that i would be responsible for managing any of this, so the lack of access is expected | 14:53 |
dpawlik | but not you Reed :P | 14:53 |
fungi | oh, his name was reed not tom. i kept saying tom | 14:54 |
fungi | dunno why i had that name stuck in my head | 14:54 |
dpawlik | hehe | 14:54 |
dpawlik | his name was Reed' | 14:54 |
fungi | (not to be confused with the reed in this channel, who is someone else entirely) | 14:54 |
dpawlik | oh my, Reed you will be pinged few times :P | 14:54 |
reed | yes :) | 14:54 |
fungi | sorry! | 14:54 |
dpawlik | fungi: I'm not able to reach OpenSearch (logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com) from the logscraper host | 14:56 |
dpawlik | so all the traffic needs to go through opensearch.logs.openstack.org | 14:56 |
dpawlik | or something need to be changed | 14:56 |
dpawlik | or I need to go to make some AWS cert | 14:56 |
ttx | I have access to the top IAM account so I can probably grant rights (just have no idea how) | 14:56 |
fungi | okay, it's possible logstash.logs.openstack.org is just cruft in dns | 14:56 |
dpawlik | or, wait for AWS to provide some AI that will do what we provide | 14:57 |
fungi | i'm trying to see if maybe i'm in the wrong dashboard view and need to be looking at an organization, but the "organization" view also tells me my access is denied | 14:58 |
ttx | but yeah the managed opensearch service itself runs on classic EC2 instance, but there is something that consumes Fargate resources (container service) and that multiplied its footprint by 2 mid-October (and is fairly constant since) | 14:58 |
dpawlik | ah fungi, you send me wrong address and I did not verify that :D | 14:58 |
dpawlik | so direct OpenSearch url is: search-openstack-prod-cluster-hygp6wprdmtikuwtqg3z3rvapm.us-east-1.es.amazonaws.com | 14:59 |
dpawlik | so I'm able to change that in logsender | 14:59 |
*** d34dh0r5- is now known as d34dh0r53 | 14:59 | |
fungi | i honestly have no idea what is the "right" or "wrong" hostname, all i know is what we have cnames for in openstack.org's dns | 14:59 |
fungi | search-openstack-prod-cluster-hygp6wprdmtikuwtqg3z3rvapm.us-east-1.es.amazonaws.com is what opensearch.logs.openstack.org is cname'd to | 15:00 |
dpawlik | that make sense: logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com "loadb" goes to forgate, where search-blablabla goes directly to Opensearch | 15:00 |
dpawlik | hm | 15:01 |
fungi | it's possible the other logstash.logs.openstack.org cname is leftover cruft from earlier iterations of work on the log ingestion in aws | 15:01 |
fungi | or maybe the api you're sending to on opensearch.l.o.o is then forwarding the workload to logstash.l.o.o for processing | 15:02 |
dpawlik | logsender can send data directly to the OpenSearch and skip forgate \o/ | 15:03 |
dpawlik | fargate or whatever it is | 15:04 |
fungi | yeah, i have no idea what exactly the role of the log processor there is (my understanding is that it accepted data using a logstash protocol and then fed it into opensearch) | 15:05 |
dpawlik | the logstash service should be stopped long time ago, after we move to logsender. (r e e d should do that) | 15:05 |
dpawlik | if not, that service is just running without any reason and I don't have permissions to remove it :( | 15:06 |
fungi | i wonder why it has any workload at all, in that case | 15:06 |
fungi | if we're not sending it data | 15:06 |
dpawlik | so logsender process the logs and send it directly to the OpenSearch (opensearch.logs.openstack.org) which is a fargate | 15:07 |
dpawlik | let's see. I changed the url to direct URL for OpenSearch | 15:09 |
dpawlik | but can't say if it will help :( | 15:09 |
fungi | what is the direct url for opensearch? | 15:09 |
dpawlik | you already paste it: https://search-openstack-prod-cluster-hygp6wprdmtikuwtqg3z3rvapm.us-east-1.es.amazonaws.com | 15:09 |
dpawlik | <we should remove it from the history> | 15:09 |
fungi | why should we remove it? that name is in public dns | 15:10 |
fungi | it's what the cname points to | 15:10 |
dpawlik | hm | 15:11 |
fungi | but anyway, i still don't know why you think that's more direct. it sounds like all you've done is eliminate a cname lookup in dns | 15:11 |
dpawlik | right. I see in the history that I make "dig logstash.logs.openstack.org" instead of "dig opensearch.logs.openstack.org" | 15:11 |
dpawlik | so yes, nothing will change | 15:12 |
dpawlik | sorry for making noise | 15:12 |
dpawlik | so I don't know how to communicate with OpenSearch directly and skip fargate | 15:13 |
fungi | judging from the name resolution in dns, logstash.l.o.o is our convenience alias for a load balancer (elb service) though i have no idea what's on the other side of the load balancer nor whether it even still exists | 15:13 |
fungi | and opensearch.l.o.o is our convenience alias for an "es" cluster, which i think means their elasticsearch service? | 15:14 |
dpawlik | yup | 15:15 |
dpawlik | I guess I know how to get. I remember that R eed says that he is doing everything in yaml, and it was possible to get the yaml content | 15:16 |
fungi | ttx: wendar: (if you're around) is it possible to reengage the contractor who set up the elasticsearch service/integration for us to look into the recent usage increases and/or whether we even still need whatever is running on fargate-based resources? it appears none of us have access (or know how to obtain access) to investigate further why we're now needing more resources than amazon | 15:19 |
fungi | has agreed to donate, and i'd rather not have to simply turn it off in order to avoid overage charges | 15:19 |
dpawlik | fungi: in AWS console, search for "cloudwatch" and in "logs" there is "log groups" and what is there is IMHO strange | 15:25 |
dpawlik | especially because we are not using that service | 15:25 |
fungi | "This IAM user does not have permission to view Log Groups in this account." | 15:27 |
fungi | "...no identity-based policy allows the logs:DescribeLogGroups action" | 15:28 |
fungi | anyway, i've got some paperwork i need to get back to, but i'll try to get something out to the openstack-discuss ml later today to talk about what options we might have | 15:30 |
dpawlik | oh, you got lower permissions that I O_O | 15:30 |
fungi | unfortunately i have a very limited amount of time to spend on this (the original expectation was that we shouldn't need to spend any time helping maintain this, but it seems like amazon makes that harder than it should be) | 15:31 |
dpawlik | I will try to check more carefully https://opendev.org/openstack/ci-log-processing/src/branch/master/opensearch-config/opensearch.yaml . Maybe there is an answer there | 15:33 |
dpawlik | aha | 15:33 |
dpawlik | and in the AWS console, when you search "cloudformation" I see logstash, opensearch, dashboards are running | 15:37 |
dpawlik | If I will know more AWS, On "logstashstack" stack i will click delete (of course later I will have permissions error) | 15:41 |
dpawlik | but if later will affect somehow opensearchstack... | 15:42 |
dpawlik | I will make some research in AWS console tomorrow. Now I need to leave. Thanks fungi++ for checking | 16:01 |
fungi | thanks dpawlik for looking into it! | 16:01 |
ttx | dpawlik fungi if y'all need extra rights to get enough visibility on this issue, just let me know, and I can try to figure out how to grant that | 16:11 |
fungi | ttx: thanks, i'll reach out if dpawlik exhausts his current options | 16:14 |
fungi | JayF: i've just now accepted the maintainer invitation for eventlet on openstackci's behalf | 19:07 |
JayF | Thank you! I'll note that other openstack contributors, including hberaud and itamarst (the GR-OSS employee who has been posting on the list about eventlet) have been given access in pypi/github as well. | 19:08 |
fungi | https://pypi.org/project/eventlet/ reflects the addition | 19:08 |
JayF | Pretty much an ideal situation to start off on getting maintenance kicked back off. | 19:08 |
fungi | let me know if/when you need anything else (other than for someone to fix bugs in eventlet of course, my dance card is pretty full) | 19:09 |
JayF | I suspect that the reality of automation on that will end up being github-driven; unsure if openstackci account will end up used at all, but I appreciate that we have an organizational account in place and not just individuals :) | 19:09 |
fungi | right, i or other tact sig folks can at least step in to help do pypi things with it as a fallback | 19:10 |
sean-k-mooney | https://zuul.opendev.org/t/openstack/build/32cf62df705344cf8693fb547f40b3cd | 19:12 |
sean-k-mooney | cross postign in infra channel | 19:12 |
sean-k-mooney | that ssh know host key verfircation error ^ | 19:12 |
sean-k-mooney | is not a knwo issue right | 19:12 |
sean-k-mooney | "Collect test-results" failed to rsync the logs to i assuem the executor or log server | 19:13 |
sean-k-mooney | i knwo there were issues in the past with reusing ips in some clouds | 19:13 |
sean-k-mooney | so im wondering if this is infra related, a random failure or likely to just work if i recheck | 19:14 |
fungi | that problem has never really gone away, it rises and falls like the tides | 19:14 |
JayF | I've 100% seen those before, and seen them clear with a recheck. I believe it's a known issue? | 19:14 |
sean-k-mooney | ok the frustating thig is the reviews recheck failed on a timeout and this one failed to up load logs in gate | 19:15 |
sean-k-mooney | ill recheck but i dont want to waste ci time if its actully an issue affecting others | 19:16 |
fungi | have a link to the log upload failure? hoping one of our swift donors isn't having another outage | 19:16 |
sean-k-mooney | well sorry im not sure its an upload failure in fact there are logs | 19:17 |
sean-k-mooney | its form fetch-subunit-output: Collect test-results | 19:18 |
sean-k-mooney | https://zuul.opendev.org/t/openstack/build/32cf62df705344cf8693fb547f40b3cd/console#6/0/14/ubuntu-focal | 19:18 |
fungi | that's often indicative of an earlier problem, like something broke before subunit data was written | 19:19 |
fungi | oh, okay that one's a rogue vm probably | 19:19 |
fungi | the executor tried to scp the testr_results.html file from the test node, but when it connected the ssh host key wasn't the one it expected to see, so usually a second machine in that provider which is also using the same ip address | 19:20 |
sean-k-mooney | oh ok | 19:21 |
fungi | often times, the ip address (23.253.20.246 in this case) will be consistent across failures, or many failures will indicate a small number of affected addresses | 19:21 |
sean-k-mooney | neutorn is ment to prevent that... | 19:21 |
sean-k-mooney | http://tinyurl.com/4xex5xrp | 19:22 |
JayF | fungi: sean-k-mooney: Anyone wanna take the other side of the bet that this bug disappears when our eventlet dep does? | 19:22 |
JayF | :) | 19:22 |
fungi | yes, our working theory is that something happens during server instance deletion and the vm never really stops/goes away, so continues to arp for its old ip address, but neutron thinks the address has been freed and assigns it to a new server instance | 19:22 |
sean-k-mooney | JayF: you mean when hell freezes over | 19:23 |
JayF | sean-k-mooney: it's cold down here today, bring a coat if you get dragged down | 19:23 |
* fungi bought a new winter coat just for this | 19:23 | |
JayF | sean-k-mooney: openstackci, hberaud, and Itamar (my co-worker from GR-OSS) all are maintainers on eventlet gh org + pypi now :) | 19:23 |
sean-k-mooney | so neutron installes mac anti spoofing rules | 19:23 |
sean-k-mooney | that should prevent that but dependign on how netowrking is configuredn and how we connect perhaps not | 19:24 |
fungi | sean-k-mooney: you're also operating on the assumption that the provider is using a new enough neutron to have that feature, or even using neutron at all as you know it | 19:24 |
sean-k-mooney | fungi: it was there when it was called quantum | 19:24 |
sean-k-mooney | but that does not mean then have security groups enabeld | 19:24 |
sean-k-mooney | fungi: also this may happen if we reuse floting ips | 19:25 |
fungi | sean-k-mooney: "rax-ord" https://zuul.opendev.org/t/openstack/build/32cf62df705344cf8693fb547f40b3cd/log/zuul-info/inventory.yaml#20 | 19:25 |
sean-k-mooney | i would not expect this to happen with ipv6 just due to entropy | 19:25 |
fungi | so i think they're still using nova-network | 19:25 |
sean-k-mooney | ... | 19:25 |
sean-k-mooney | ya that would not have that... | 19:26 |
JayF | IIRC the neutron bits in that stack are mostly custom plugin anyway | 19:26 |
fungi | not sure rackspace ever migrated to neutron (but i can't say for sure) | 19:26 |
sean-k-mooney | they did but they still use linux bridge | 19:26 |
JayF | fungi: I know they half-migrated for OnMetal to work, dunno if they went further for VM or not | 19:26 |
fungi | aha | 19:26 |
sean-k-mooney | and the thing i was refiering to is what ovs does | 19:26 |
sean-k-mooney | in the ovs firewall dirver | 19:26 |
fungi | anyway, we do still see this in various providers. in some there was a known bug in nova cells which eventually got fixed and that has helped i think | 19:27 |
fungi | i've been told by a number of different providers' operators though that they run periodic scripts to query locally with virsh on all the compute nodes and then stop and remove any virtual machines which nova is unaware of | 19:28 |
sean-k-mooney | we have a runing delete perodic job | 19:29 |
sean-k-mooney | but really nova shoudl not let a vm state runnign when its deleted | 19:29 |
fungi | but if we see persistent failures for the same ip address(es) over a span of days, it's helped in the past to assemble a list of those addresses and open a trouble ticket asking to have them hunted down | 19:29 |
sean-k-mooney | we would put it to errror unless or fall back to local delete "if the agent is down when it starts again" | 19:30 |
sean-k-mooney | given the age of there cloud hoping they move to ipv6 is probaly wishful thinking | 19:30 |
fungi | they have working ipv6 but it's not exposed by the api in a way that nodepool/zuul can make use of the info | 19:31 |
fungi | mainly because, i think, they rolled their own ipv6 support at "the beginning" (before quantum) | 19:32 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!