Wednesday, 2023-12-13

*** jph5 is now known as jph1		08:19
*** jph1 is now known as jph		08:20
dpawlik	fungi: so normally we can downscale the Opensearch. Some time ago, we disable DEBUG messages, so 25% of the cluster is not used (I mean disk space)	08:53
dpawlik	fungi: I can remove few visualizations. Maybe someone is periodically watch on it or maybe some of them have wrong query that takes more resources as expected	08:54
dpawlik	fungi: I don't have privileges on AWS to say somthing more. My account is very very limited.	08:54
*** elodilles_pto is now known as elodilles		09:06
fungi	dpawlik: yeah, it doesn't look like we're short on disk space or heavy query volume, it looks like that gets covered by the opensearch line item in the bill which hasn't really changed. the more i think about it, the more likely it is that the volume of log lines getting fed to that logstash.logs.openstack.org endpoint is where all this fargate servive usage is	13:59
fungi	do you have any visibility into how much data the job log processor you manage has been shipping to it, and whether that's increased significantly in the past few months?	14:00
dpawlik	checking	14:08
dpawlik	so I don't have any statistics in Logsender how much data it send, but I see the stats of "Searchable documents" on AWS Opensearch cluster health	14:14
dpawlik	and it looks normal	14:14
dpawlik	but, when I check other metrics in cluster health one is strange	14:14
dpawlik	JVM garbage collection => Young collection and Young collection time => it always grow, never go down	14:15
dpawlik	maybe we can try to make an OpenSearch update from 2.5 (current one) to 2.10	14:16
dpawlik	so all services would be restarted. Should help	14:16
dpawlik	and if yes, we can send AWS feedback	14:16
jrosser	isnt fargate generally more expensive than regular ec2 instances	14:17
dpawlik	got something. So by checking "instance health", for last 1h, JVM memory pressure (Current value for the selected instance is 50.6) where Minimum value for all instances is 26.8, Maximum value for all instances is 74.2, Minimum and maximum range for the selected instance is from 26.8 to 74.2.	14:18
dpawlik	when I check last 2 weeks, it has 73.9	14:19
fungi	dpawlik: if it's that we're sending a ton of data to the log ingestion system that's getting filtered out/discarded, that could be increasing processing and memory there without being reflected in the amount of data in opensearch (if it's data that's getting discarded and not indexed into opensearch)	14:19
dpawlik	same with data nodes. Average value from last 2 weeks is higher that from 1 hour ago	14:20
fungi	jrosser: yes, i think that's part of the problem. the contractor who developed the log ingestion system for this wanted it to be entirely stateless so it would require no maintenance	14:20
dpawlik	oh my	14:21
jrosser	that might be like doubling the cost vs. a long term discount on an ec2	14:21
fungi	so there are no virtual servers, no container management, it's in fargate so that there won't need to be a sysadmin basically	14:21
jrosser	right	14:22
fungi	but it also means when it blows up in utilization there's nobody to look into why	14:22
dpawlik	don't have enough power to see more, maybe you or clarkb have	14:24
dpawlik	from the logscraper/logsender side: all is working normally	14:24
dpawlik	I will increase the "sleep" time for logsender, because it does not need to do it each 5 seconds (but that parameter was set 1 year ago)	14:25
fungi	i'm not sure we have that access either, i can't even trigger the ssl cert renewals for opensearch, but i'll see if it looks like i have access to something in that regard	14:26
dpawlik	fungi: first you need to do some AWS certification to touch something there <I'm joking>	14:28
dpawlik	from the other side, the whole situation started in September 2023, right?	14:28
dpawlik	or last few weeks?	14:28
dpawlik	or they did not mention anything	14:29
fungi	dpawlik: ttx was looking at the bill. since aws switched our free credits from yearly to monthly (exceeding the monthly allocation will now result in overage charges). these were the numbers he reported:	14:32
fungi	AWS Fargate (September): Memory=31893 Gb.hours, vCPU=15946 vCPU.hours	14:33
fungi	AWS Fargate (November): Memory=58906 Gb.hours, vCPU=29453 vCPU.hours	14:33
fungi	the billing interface also lists usage for "Amazon OpenSearch Service ESDomain" (general purpose provisioned storage and m6g.xlarge.search instance hours)	14:35
fungi	that's separate from the fargate stuff, which is what leads me to believe the log ingestion system tom developed is what's responsible for the separate fargate charges	14:35
fungi	he also noted, "Looking at December, on day 12 we are at 21833 GB.hours and 10916 vCPU.hours" (for fargate charges), so consistent with the usage we saw for november if you project it out through the end of the month	14:38
dpawlik	so to understand the whole situation: the AWS fargate it is a service available on the address: https://opensearch.logs.openstack.org and later it redirect the traffic to the opensearch host logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com , right?	14:43
fungi	we have a logstash.logs.openstack.org cname to logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com which i think is related to the fargate resources	14:46
fungi	is that where the ci log processor sends its data?	14:47
fungi	i think opensearch.logs.openstack.org goes to the opensearch service, which is billed separately from fargate usage	14:47
dpawlik	it send to: opensearch.logs.openstack.org	14:51
fungi	i've signed into my aws account and am poking around now, though most of the panels say "access denied"	14:51
dpawlik	yay :D	14:51
dpawlik	so you know my pain	14:51
fungi	in particular, i do not have access to "applications" or "cost and usage"	14:52
fungi	i also don't have access to "security" so probably can't change that	14:52
dpawlik	oh my. I was hoping that you have and clarkb according to Reed messages	14:53
fungi	granted there was never any expectation that i would be responsible for managing any of this, so the lack of access is expected	14:53
dpawlik	but not you Reed :P	14:53
fungi	oh, his name was reed not tom. i kept saying tom	14:54
fungi	dunno why i had that name stuck in my head	14:54
dpawlik	hehe	14:54
dpawlik	his name was Reed'	14:54
fungi	(not to be confused with the reed in this channel, who is someone else entirely)	14:54
dpawlik	oh my, Reed you will be pinged few times :P	14:54
reed	yes :)	14:54
fungi	sorry!	14:54
dpawlik	fungi: I'm not able to reach OpenSearch (logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com) from the logscraper host	14:56
dpawlik	so all the traffic needs to go through opensearch.logs.openstack.org	14:56
dpawlik	or something need to be changed	14:56
dpawlik	or I need to go to make some AWS cert	14:56
ttx	I have access to the top IAM account so I can probably grant rights (just have no idea how)	14:56
fungi	okay, it's possible logstash.logs.openstack.org is just cruft in dns	14:56
dpawlik	or, wait for AWS to provide some AI that will do what we provide	14:57
fungi	i'm trying to see if maybe i'm in the wrong dashboard view and need to be looking at an organization, but the "organization" view also tells me my access is denied	14:58
ttx	but yeah the managed opensearch service itself runs on classic EC2 instance, but there is something that consumes Fargate resources (container service) and that multiplied its footprint by 2 mid-October (and is fairly constant since)	14:58
dpawlik	ah fungi, you send me wrong address and I did not verify that :D	14:58
dpawlik	so direct OpenSearch url is: search-openstack-prod-cluster-hygp6wprdmtikuwtqg3z3rvapm.us-east-1.es.amazonaws.com	14:59
dpawlik	so I'm able to change that in logsender	14:59
*** d34dh0r5- is now known as d34dh0r53		14:59
fungi	i honestly have no idea what is the "right" or "wrong" hostname, all i know is what we have cnames for in openstack.org's dns	14:59
fungi	search-openstack-prod-cluster-hygp6wprdmtikuwtqg3z3rvapm.us-east-1.es.amazonaws.com is what opensearch.logs.openstack.org is cname'd to	15:00
dpawlik	that make sense: logst-loadb-c8b36j1ja8ub-1adb08efd8fe1ae1.elb.us-east-1.amazonaws.com "loadb" goes to forgate, where search-blablabla goes directly to Opensearch	15:00
dpawlik	hm	15:01
fungi	it's possible the other logstash.logs.openstack.org cname is leftover cruft from earlier iterations of work on the log ingestion in aws	15:01
fungi	or maybe the api you're sending to on opensearch.l.o.o is then forwarding the workload to logstash.l.o.o for processing	15:02
dpawlik	logsender can send data directly to the OpenSearch and skip forgate \o/	15:03
dpawlik	fargate or whatever it is	15:04
fungi	yeah, i have no idea what exactly the role of the log processor there is (my understanding is that it accepted data using a logstash protocol and then fed it into opensearch)	15:05
dpawlik	the logstash service should be stopped long time ago, after we move to logsender. (r e e d should do that)	15:05
dpawlik	if not, that service is just running without any reason and I don't have permissions to remove it :(	15:06
fungi	i wonder why it has any workload at all, in that case	15:06
fungi	if we're not sending it data	15:06
dpawlik	so logsender process the logs and send it directly to the OpenSearch (opensearch.logs.openstack.org) which is a fargate	15:07
dpawlik	let's see. I changed the url to direct URL for OpenSearch	15:09
dpawlik	but can't say if it will help :(	15:09
fungi	what is the direct url for opensearch?	15:09
dpawlik	you already paste it: https://search-openstack-prod-cluster-hygp6wprdmtikuwtqg3z3rvapm.us-east-1.es.amazonaws.com	15:09
dpawlik	<we should remove it from the history>	15:09
fungi	why should we remove it? that name is in public dns	15:10
fungi	it's what the cname points to	15:10
dpawlik	hm	15:11
fungi	but anyway, i still don't know why you think that's more direct. it sounds like all you've done is eliminate a cname lookup in dns	15:11
dpawlik	right. I see in the history that I make "dig logstash.logs.openstack.org" instead of "dig opensearch.logs.openstack.org"	15:11
dpawlik	so yes, nothing will change	15:12
dpawlik	sorry for making noise	15:12
dpawlik	so I don't know how to communicate with OpenSearch directly and skip fargate	15:13
fungi	judging from the name resolution in dns, logstash.l.o.o is our convenience alias for a load balancer (elb service) though i have no idea what's on the other side of the load balancer nor whether it even still exists	15:13
fungi	and opensearch.l.o.o is our convenience alias for an "es" cluster, which i think means their elasticsearch service?	15:14
dpawlik	yup	15:15
dpawlik	I guess I know how to get. I remember that R eed says that he is doing everything in yaml, and it was possible to get the yaml content	15:16
fungi	ttx: wendar: (if you're around) is it possible to reengage the contractor who set up the elasticsearch service/integration for us to look into the recent usage increases and/or whether we even still need whatever is running on fargate-based resources? it appears none of us have access (or know how to obtain access) to investigate further why we're now needing more resources than amazon	15:19
fungi	has agreed to donate, and i'd rather not have to simply turn it off in order to avoid overage charges	15:19
dpawlik	fungi: in AWS console, search for "cloudwatch" and in "logs" there is "log groups" and what is there is IMHO strange	15:25
dpawlik	especially because we are not using that service	15:25
fungi	"This IAM user does not have permission to view Log Groups in this account."	15:27
fungi	"...no identity-based policy allows the logs:DescribeLogGroups action"	15:28
fungi	anyway, i've got some paperwork i need to get back to, but i'll try to get something out to the openstack-discuss ml later today to talk about what options we might have	15:30
dpawlik	oh, you got lower permissions that I O_O	15:30
fungi	unfortunately i have a very limited amount of time to spend on this (the original expectation was that we shouldn't need to spend any time helping maintain this, but it seems like amazon makes that harder than it should be)	15:31
dpawlik	I will try to check more carefully https://opendev.org/openstack/ci-log-processing/src/branch/master/opensearch-config/opensearch.yaml . Maybe there is an answer there	15:33
dpawlik	aha	15:33
dpawlik	and in the AWS console, when you search "cloudformation" I see logstash, opensearch, dashboards are running	15:37
dpawlik	If I will know more AWS, On "logstashstack" stack i will click delete (of course later I will have permissions error)	15:41
dpawlik	but if later will affect somehow opensearchstack...	15:42
dpawlik	I will make some research in AWS console tomorrow. Now I need to leave. Thanks fungi++ for checking	16:01
fungi	thanks dpawlik for looking into it!	16:01
ttx	dpawlik fungi if y'all need extra rights to get enough visibility on this issue, just let me know, and I can try to figure out how to grant that	16:11
fungi	ttx: thanks, i'll reach out if dpawlik exhausts his current options	16:14
fungi	JayF: i've just now accepted the maintainer invitation for eventlet on openstackci's behalf	19:07
JayF	Thank you! I'll note that other openstack contributors, including hberaud and itamarst (the GR-OSS employee who has been posting on the list about eventlet) have been given access in pypi/github as well.	19:08
fungi	https://pypi.org/project/eventlet/ reflects the addition	19:08
JayF	Pretty much an ideal situation to start off on getting maintenance kicked back off.	19:08
fungi	let me know if/when you need anything else (other than for someone to fix bugs in eventlet of course, my dance card is pretty full)	19:09
JayF	I suspect that the reality of automation on that will end up being github-driven; unsure if openstackci account will end up used at all, but I appreciate that we have an organizational account in place and not just individuals :)	19:09
fungi	right, i or other tact sig folks can at least step in to help do pypi things with it as a fallback	19:10
sean-k-mooney	https://zuul.opendev.org/t/openstack/build/32cf62df705344cf8693fb547f40b3cd	19:12
sean-k-mooney	cross postign in infra channel	19:12
sean-k-mooney	that ssh know host key verfircation error ^	19:12
sean-k-mooney	is not a knwo issue right	19:12
sean-k-mooney	"Collect test-results" failed to rsync the logs to i assuem the executor or log server	19:13
sean-k-mooney	i knwo there were issues in the past with reusing ips in some clouds	19:13
sean-k-mooney	so im wondering if this is infra related, a random failure or likely to just work if i recheck	19:14
fungi	that problem has never really gone away, it rises and falls like the tides	19:14
JayF	I've 100% seen those before, and seen them clear with a recheck. I believe it's a known issue?	19:14
sean-k-mooney	ok the frustating thig is the reviews recheck failed on a timeout and this one failed to up load logs in gate	19:15
sean-k-mooney	ill recheck but i dont want to waste ci time if its actully an issue affecting others	19:16
fungi	have a link to the log upload failure? hoping one of our swift donors isn't having another outage	19:16
sean-k-mooney	well sorry im not sure its an upload failure in fact there are logs	19:17
sean-k-mooney	its form fetch-subunit-output: Collect test-results	19:18
sean-k-mooney	https://zuul.opendev.org/t/openstack/build/32cf62df705344cf8693fb547f40b3cd/console#6/0/14/ubuntu-focal	19:18
fungi	that's often indicative of an earlier problem, like something broke before subunit data was written	19:19
fungi	oh, okay that one's a rogue vm probably	19:19
fungi	the executor tried to scp the testr_results.html file from the test node, but when it connected the ssh host key wasn't the one it expected to see, so usually a second machine in that provider which is also using the same ip address	19:20
sean-k-mooney	oh ok	19:21
fungi	often times, the ip address (23.253.20.246 in this case) will be consistent across failures, or many failures will indicate a small number of affected addresses	19:21
sean-k-mooney	neutorn is ment to prevent that...	19:21
sean-k-mooney	http://tinyurl.com/4xex5xrp	19:22
JayF	fungi: sean-k-mooney: Anyone wanna take the other side of the bet that this bug disappears when our eventlet dep does?	19:22
JayF	:)	19:22
fungi	yes, our working theory is that something happens during server instance deletion and the vm never really stops/goes away, so continues to arp for its old ip address, but neutron thinks the address has been freed and assigns it to a new server instance	19:22
sean-k-mooney	JayF: you mean when hell freezes over	19:23
JayF	sean-k-mooney: it's cold down here today, bring a coat if you get dragged down	19:23
* fungi bought a new winter coat just for this		19:23
JayF	sean-k-mooney: openstackci, hberaud, and Itamar (my co-worker from GR-OSS) all are maintainers on eventlet gh org + pypi now :)	19:23
sean-k-mooney	so neutron installes mac anti spoofing rules	19:23
sean-k-mooney	that should prevent that but dependign on how netowrking is configuredn and how we connect perhaps not	19:24
fungi	sean-k-mooney: you're also operating on the assumption that the provider is using a new enough neutron to have that feature, or even using neutron at all as you know it	19:24
sean-k-mooney	fungi: it was there when it was called quantum	19:24
sean-k-mooney	but that does not mean then have security groups enabeld	19:24
sean-k-mooney	fungi: also this may happen if we reuse floting ips	19:25
fungi	sean-k-mooney: "rax-ord" https://zuul.opendev.org/t/openstack/build/32cf62df705344cf8693fb547f40b3cd/log/zuul-info/inventory.yaml#20	19:25
sean-k-mooney	i would not expect this to happen with ipv6 just due to entropy	19:25
fungi	so i think they're still using nova-network	19:25
sean-k-mooney	...	19:25
sean-k-mooney	ya that would not have that...	19:26
JayF	IIRC the neutron bits in that stack are mostly custom plugin anyway	19:26
fungi	not sure rackspace ever migrated to neutron (but i can't say for sure)	19:26
sean-k-mooney	they did but they still use linux bridge	19:26
JayF	fungi: I know they half-migrated for OnMetal to work, dunno if they went further for VM or not	19:26
fungi	aha	19:26
sean-k-mooney	and the thing i was refiering to is what ovs does	19:26
sean-k-mooney	in the ovs firewall dirver	19:26
fungi	anyway, we do still see this in various providers. in some there was a known bug in nova cells which eventually got fixed and that has helped i think	19:27
fungi	i've been told by a number of different providers' operators though that they run periodic scripts to query locally with virsh on all the compute nodes and then stop and remove any virtual machines which nova is unaware of	19:28
sean-k-mooney	we have a runing delete perodic job	19:29
sean-k-mooney	but really nova shoudl not let a vm state runnign when its deleted	19:29
fungi	but if we see persistent failures for the same ip address(es) over a span of days, it's helped in the past to assemble a list of those addresses and open a trouble ticket asking to have them hunted down	19:29
sean-k-mooney	we would put it to errror unless or fall back to local delete "if the agent is down when it starts again"	19:30
sean-k-mooney	given the age of there cloud hoping they move to ipv6 is probaly wishful thinking	19:30
fungi	they have working ipv6 but it's not exposed by the api in a way that nodepool/zuul can make use of the info	19:31
fungi	mainly because, i think, they rolled their own ipv6 support at "the beginning" (before quantum)	19:32

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!