Wednesday, 2021-02-03

*** tosky has quit IRC		00:15
clarkb	ianw: left some comments on ^ note the inline comments aren't the reason for the -1, the top level comment is	00:27
clarkb	let me know if I've missed something obvious and I can ammend my review	00:27
ianw	cool, replied. i may have missed gitea, will check in a sec	00:31
ianw	basically i think the script should work by looping through and trying to backup everything, and if one part fails, the whole thing will exit with !0	00:32
clarkb	ianw: re your reply on the pipefail: I mention it because you are doing `bash foo \| something else` and that will only exit non zero if the something elsefails	00:32
clarkb	I agree with your plan. I'm just worried we'll ignore if bash foo fails	00:32
clarkb	I think if we set -o pipefail we'll get both things	00:33
clarkb	?	00:33
clarkb	this is distinct from set -e	00:33
ianw	oh yes i see. we can check PIPEFAIL or whatever that is, that's a good idea for robustness	00:33
ianw	PIPESTATUS	00:34
clarkb	ah ya if we can check it directly too that would work	00:34
openstackgerrit	Ian Wienand proposed opendev/system-config master: borg-backup: implement saving a stream, use for database backups https://review.opendev.org/c/opendev/system-config/+/771738	00:40
ianw	clarkb: nice ideas, thanks, implemented with ^	00:41
clarkb	ianw: one little formatting thing that yaml will eb sad about. Otherwise that lgtm	00:42
openstackgerrit	Ian Wienand proposed opendev/system-config master: borg-backup: implement saving a stream, use for database backups https://review.opendev.org/c/opendev/system-config/+/771738	00:43
ianw	indeed, i was actually just playing with yamllint wrt to https://review.opendev.org/c/opendev/system-config/+/733406	00:44
clarkb	side note: would it be worth testing starting review-test up against an empty accountPatchReviewDb?	00:44
clarkb	and if that works with the only loss being the little check marks on the ui next to things you've reviewed maybe we just stop backing up the review database entirely?	00:45
clarkb	what I'm not sure about is if there are tendrils of that data in the notedb. I don't think there are, but it is possible	00:45
ianw	possibly, but i don't think it's a major size concern	00:45
clarkb	ok	00:46
ianw	i mean, it's not going to be atomic with any gerrit state saved in backups anyway, so if they do communicate ... i gues sit's likely to be corrupt	00:46
ianw	or at least ... have corruption	00:46
clarkb	ya	00:46
clarkb	that change lgtm now too	00:47
clarkb	I still think docs telling people to set up db backups separately would be a good addtion :)	00:49
ianw	clarkb: yes, i've started :) it got all tangled up in my modifying the stuff we have there about rotation, which now i'm not sure what to do. i'll separate it out	00:50
clarkb	++ to separating the two things and we can update them as we get to each piece	00:50
clarkb	re the gerrit testing my change seemed to have worked but post failured on some unrelated issues that appear to be network related getting logs. I rechecked it	00:50
ianw	it was all going to be so simple... :)	00:50
clarkb	ianw: I used your screenshots to confirm the gerrit versions were as expected too :)	00:51
clarkb	I thought that would end up in the container logs but didn't find them there	00:51
clarkb	(I think because it logs that info the disk the container log just gets stdout/stderr which is fairly short)	00:51
ianw	yeah with some sleep() and adujsting the viewport the screenshots seem to be pretty good now	00:52
clarkb	and now it is time to go figure out some dinner	00:52
*** diablo_rojo has quit IRC		00:56
ianw	#status log afsdb01/02 restarted with afs 1.8 packages	01:15
openstackstatus	ianw: finished logging	01:15
openstackgerrit	Merged openstack/diskimage-builder master: Install last stable version of get-pip.py script https://review.opendev.org/c/openstack/diskimage-builder/+/772254	01:16
*** dviroel has quit IRC		01:54
*** mlavalle has quit IRC		01:55
openstackgerrit	Merged opendev/system-config master: Manage afsdb servers with Ansible https://review.opendev.org/c/opendev/system-config/+/771340	02:03
*** lbragstad_ has joined #opendev		02:14
*** ysandeep\|away is now known as ysandeep		02:15
*** lbragstad has quit IRC		02:17
*** hemanth_n has joined #opendev		02:19
*** DSpider has quit IRC		02:55
openstackgerrit	Merged opendev/system-config master: borg-backup: implement saving a stream, use for database backups https://review.opendev.org/c/opendev/system-config/+/771738	03:11
*** ykarel has joined #opendev		03:18
*** ykarel has quit IRC		03:27
*** d34dh0r53 has quit IRC		03:47
*** d34dh0r53 has joined #opendev		03:48
*** d34dh0r53 has quit IRC		03:48
*** d34dh0r53 has joined #opendev		03:49
*** d34dh0r53 has quit IRC		03:49
*** lbragstad_ is now known as lbragstad		03:51
*** d34dh0r53 has joined #opendev		03:53
*** d34dh0r53 has quit IRC		03:55
*** d34dh0r53 has joined #opendev		03:56
*** d34dh0r53 has quit IRC		03:56
*** d34dh0r53 has joined #opendev		03:57
*** d34dh0r53 has joined #opendev		03:58
*** d34dh0r53 has joined #opendev		03:59
*** brinzhang has quit IRC		04:52
*** brinzhang has joined #opendev		04:53
*** ykarel has joined #opendev		05:01
*** brinzhang_ has joined #opendev		05:02
*** brinzhang has quit IRC		05:05
ianw	clarkb/kopecmartin : re 705258 i left review comments, but i've started https://etherpad.opendev.org/p/refstack-docker to try and flesh out the steps we'll use to bring up the host and various other things. as mentioned, i think it would be good to validate the db migration procedure on a test host before we start production host	05:49
*** ykarel_ has joined #opendev		05:50
*** ykarel has quit IRC		05:53
*** ykarel_ is now known as ykarel		05:53
*** ykarel_ has joined #opendev		05:58
*** marios has joined #opendev		05:59
*** ykarel has quit IRC		06:00
*** ykarel_ is now known as ykarel		06:15
*** dirtygiraffe has joined #opendev		06:58
*** dirtygiraffe has quit IRC		07:02
*** brinzhang_ has quit IRC		07:04
*** brinzhang_ has joined #opendev		07:04
*** eolivare has joined #opendev		07:28
*** slaweq has joined #opendev		07:28
*** ralonsoh has joined #opendev		07:28
*** hashar has joined #opendev		07:58
*** hashar has quit IRC		08:01
*** hashar has joined #opendev		08:01
*** sboyron_ has joined #opendev		08:04
*** fressi has joined #opendev		08:04
*** ysandeep is now known as ysandeep\|lunch		08:18
*** andrewbonney has joined #opendev		08:19
*** valery_t has joined #opendev		08:21
*** ykarel is now known as ykarel\|lunch		08:21
valery_t	I need a reviewer for my review https://review.opendev.org/c/openstack/python-openstackclient/+/773649	08:22
*** valery_t has quit IRC		08:32
frickler	wow, that one was really hasty	08:33
cgoncalves	hey folks. not sure if this issue has been reported or not, apologies in advance. https://releases.openstack.org is super slow, CI jobs timing out	08:36
cgoncalves	(HTTP 443, connection timed out)	08:38
*** tosky has joined #opendev		08:40
*** rpittau\|afk is now known as rpittau		08:41
openstackgerrit	Merged openstack/diskimage-builder master: Remove the deprecated ironic-agent element https://review.opendev.org/c/openstack/diskimage-builder/+/771808	08:45
*** valery_t has joined #opendev		08:49
frickler	cgoncalves: works fine for me, do you have some logs? is this our CI or downstream?	08:51
cgoncalves	frickler, https://zuul.opendev.org/t/openstack/build/4d4a897c012e4f7a8cd13d16fdb114f8/log/controller/logs/dib-build/amphora-x64-haproxy.qcow2_log.txt	08:51
cgoncalves	I also hit HTTP 443 locally	08:52
*** valery_t has quit IRC		08:55
*** jpena\|off is now known as jpena		08:57
*** brinzhang_ has quit IRC		08:59
*** brinzhang_ has joined #opendev		09:00
frickler	hmm, seems to be a bit of a load spike, but I don't see anything wrong locally http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=140&rra_id=all	09:02
frickler	there also seems to be a regular peak in io load starting every day at 6, not sure if that are our periodic jobs or the backup possibly, ianw?	09:04
cgoncalves	frickler, FYI 2m11s http://paste.openstack.org/show/802273/	09:04
cgoncalves	and thanks for checking!	09:05
*** DSpider has joined #opendev		09:07
*** valery_t_ has joined #opendev		09:14
*** ysandeep\|lunch is now known as ysandeep		09:38
priteau	Good morning. tarballs.o.o is extremely slow for me today. I remember it happened some time ago and someone restarted apache (IIRC) which fixed it	09:53
priteau	Yeah, that was on 2020-11-27	09:55
priteau	16:31 fungi: #status log restarted apache2 on static.opendev.org in order to troubleshoot very long response times	09:56
priteau	cgoncalves: I see the same problem	09:57
priteau	frickler: See quote from fungi above ^^^	09:58
*** wanzenbug has joined #opendev		10:00
*** wanzenbug has quit IRC		10:04
ttx	Yes, affects docs.openstack.org too	10:10
*** CeeMac has joined #opendev		10:20
frickler	#status log restarted apache2 on static.opendev.org in order to resolve slow responses and timeouts	10:20
openstackstatus	frickler: finished logging	10:20
frickler	ttx: priteau: cgoncalves: infra-root: ^^ looks better to me currently, please let us know if you see any further issues	10:21
cgoncalves	frickler, functional now. thanks a lot!	10:22
priteau	Thank you frickler! upper constraints fetched in 1 to 2 seconds	10:23
*** ykarel\|lunch is now known as ykarel		10:29
*** sshnaidm\|afk is now known as sshnaidm\|ruck		10:35
*** hashar has quit IRC		10:45
*** dtantsur\|afk is now known as dtantsur		10:49
openstackgerrit	Martin Kopec proposed opendev/system-config master: Deploy refstack with ansible docker https://review.opendev.org/c/opendev/system-config/+/705258	11:13
*** dviroel has joined #opendev		11:14
openstackgerrit	Dinesh Garg proposed zuul/zuul-jobs master: Allow customization of helm charts repos https://review.opendev.org/c/zuul/zuul-jobs/+/767354	11:43
*** hrw has joined #opendev		12:18
hrw	morning	12:18
hrw	can someone review/approve https://review.opendev.org/c/openstack/project-config/+/772887 patch? it adds centos 8 stream for aarch64 nodes	12:19
*** jpena is now known as jpena\|lunch		12:41
openstackgerrit	Pedro Luis Marques Sliuzas proposed openstack/project-config master: Add Metrics Server App to StarlingX https://review.opendev.org/c/openstack/project-config/+/773883	12:42
*** hemanth_n has quit IRC		13:00
*** hrw has quit IRC		13:19
openstackgerrit	Merged openstack/project-config master: CentOS 8 Stream initial enablement for AArch64 https://review.opendev.org/c/openstack/project-config/+/772887	13:25
openstackgerrit	Pedro Luis Marques Sliuzas proposed openstack/project-config master: Add Metrics Server App to StarlingX https://review.opendev.org/c/openstack/project-config/+/773883	13:26
*** jpena\|lunch is now known as jpena		13:37
*** ykarel_ has joined #opendev		13:51
*** ykarel has quit IRC		13:54
*** whoami-rajat__ has joined #opendev		13:55
*** lbragstad has quit IRC		13:57
*** ykarel_ is now known as ykarel		13:59
*** brinzhang_ has quit IRC		14:17
*** brinzhang_ has joined #opendev		14:17
*** zoharm has joined #opendev		14:33
*** akahat\|rover is now known as akahat		14:34
*** brinzhang_ has quit IRC		14:35
*** lbragstad has joined #opendev		14:35
*** brinzhang_ has joined #opendev		14:36
*** ysandeep is now known as ysandeep\|afk		14:48
*** bcafarel has quit IRC		14:58
*** d34dh0r53 has quit IRC		15:01
*** d34dh0r53 has joined #opendev		15:01
*** fressi has quit IRC		15:23
*** ykarel_ has joined #opendev		15:30
*** ysandeep\|afk is now known as ysandeep		15:31
*** ykarel has quit IRC		15:32
*** alfred188 has joined #opendev		15:50
*** ykarel_ is now known as ykarel		16:00
clarkb	hrw isn't here anymore, but that arm64 centos 8 stream image has me wondering if maybe the centos 8 image should be removed? I don't know if anything is using it currently though	16:04
*** ysandeep is now known as ysandeep\|away		16:06
openstackgerrit	Clark Boylan proposed opendev/system-config master: Run gerrit 3.2 and 3.3 functional tests https://review.opendev.org/c/opendev/system-config/+/773807	16:08
*** ykarel has quit IRC		16:17
*** d34dh0r53 has quit IRC		16:18
*** d34dh0r53 has joined #opendev		16:19
openstackgerrit	Matt McEuen proposed openstack/project-config master: New Project Request: airship/gerrit-to-github-bot https://review.opendev.org/c/openstack/project-config/+/773936	16:19
*** mlavalle has joined #opendev		16:34
*** hashar has joined #opendev		16:45
fungi	clarkb: given the concern over centos 8 vs centos stream 8 it seemed like projects were going to want to have both available at least for a bit so they can make sure stream still works the same for them	16:49
clarkb	fungi: yup, but I'm not sure if anything used the arm64 centos 8 image?	16:49
clarkb	seemed like most of the work there was done on debuntu, but I am probably also just working on out of date info	16:50
fungi	oh, the arm64 images specifically. right, that may be	16:50
*** sshnaidm\|ruck is now known as sshnaidm		16:52
*** zoharm has quit IRC		16:52
*** marios is now known as marios\|out		17:24
*** ralonsoh has quit IRC		17:31
clarkb	using codesearch kolla uses centos-8-arm64 in the kolla-centos8-aarch64 nodeset and that is used in kolla-build-centos8-source-aarch64. Opendev also uses it to build the arm64 centos 8 wheel cache	17:36
clarkb	my hunch is that hrw is adding the new image for kolla so that kolla-build-centos8-source-aarch64 job is likely to get replaced with a stream job. Once that happens we can drop the centos-8 wheel cache in favor of a stream wheel cache and then drop the image I bet	17:36
clarkb	but we can't just drop it today	17:36
fungi	yeah, that sounds about right	17:37
clarkb	infra-root I think the stack at https://review.opendev.org/c/opendev/system-config/+/773807/ is ready for review now. These are housekeeping changes to add gerrit 3.3 image builds and testing	17:40
clarkb	I figured out why those jobs were post failuring and it was beacuse the run playbook was short circuiting due to an error which caused a log file copy to fail since the file wasn't present	17:41
clarkb	tl;dr best to look at post failures as if they are actual failures first	17:41
fungi	#status log Requested Spamhaus SBL delisting for the lists.katacontainers.io IPv6 address	17:48
openstackstatus	fungi: finished logging	17:48
fungi	infra-root: i checked all the addresses and hostnames for lists.opendev.org and they're still clean	17:49
fungi	just as a heads up	17:49
clarkb	thanks!	17:50
*** valery_t_ has quit IRC		17:56
*** jpena is now known as jpena\|off		17:59
clarkb	iurygregory: I have approved https://review.opendev.org/c/openstack/project-config/+/772427 to allow ironic project cores to edit hashtags on the appropriate projects. I would be curious to hear how that goes	18:04
iurygregory	clarkb, awesome thanks! after it merges I will give a try	18:05
corvus	i'd be in favor of allowing that for all auth'd users	18:06
clarkb	iurygregory: note there will be a delay while we sync the acl update, you can follow along in the deploy pipeline on zuul status for that change (it will be the manage-projects job)	18:07
*** eolivare has quit IRC		18:07
fungi	corvus: yeah, i think we mostly wanted to see how it played out for volunteer test projects before we turned it on globally	18:07
clarkb	yup	18:07
iurygregory	clarkb, ack	18:08
fungi	main concern is that any user can remove a hashtag, so some projects may find that they want to override our global access for it and restrict it to a core reviewer group	18:08
*** rpittau is now known as rpittau\|afk		18:08
fungi	but honestly, there are so many ways someone can vandalize a change in gerrit, i'm not too concerned about rampant hashtag deletion	18:08
*** fbo is now known as fbo\|off		18:09
corvus	yep, that's my thought. a measured introduction with clear guidelines would probably help. maybe a standard place (CONTRIBUTING?) to describe a project's "reserved" hashtags	18:10
*** zimmerry has joined #opendev		18:10
openstackgerrit	Merged openstack/project-config master: Update ACLs of Ironic Projects to allow Edit Hashtags https://review.opendev.org/c/openstack/project-config/+/772427	18:10
clarkb	fungi: looking at project-config changes I notice that you've got a revert to reenable gentoo image builds again. I presume that means they are off now? should we reenable them at this point?	18:12
openstackgerrit	Gomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size https://review.opendev.org/c/zuul/zuul-jobs/+/773474	18:13
*** dtantsur is now known as dtantsur\|afk		18:16
fungi	clarkb: prometheanfire had some fixes into dib, we probably need to check that those appear in a release before we try again	18:20
clarkb	got it	18:20
*** d34dh0r53 has quit IRC		18:22
openstackgerrit	Merged openstack/project-config master: Remove anachronistic jobs from scciclient https://review.opendev.org/c/openstack/project-config/+/772908	18:23
*** d34dh0r53 has joined #opendev		18:24
fungi	i see at least one gentoo-related entry in `git log --no-merges --oneline 3.6.0..origin/master`	18:28
fungi	ianw: frickler: how do you feel about tagging another dib release?	18:28
fungi	looks like that'll pull in the get-pip.py change too	18:29
fungi	and a fix for centos stream	18:29
prometheanfire	fungi: clarkb yep, the gentoo update would be nice, iirc it may help fix the build issues for gentoo	18:33
openstackgerrit	Merged openstack/project-config master: Add Metrics Server App to StarlingX https://review.opendev.org/c/openstack/project-config/+/773883	18:52
corvus	i'm seeing gerrit http response times > 30s in gertty	18:53
clarkb	load is a bit high right now, but not drastically so. I've been having decent luck through the web ui. I wouldn't say its super fast, but also hasn't been terribly slow doing project-config and zuul-job reviews	18:54
*** marios\|out has quit IRC		18:54
clarkb	dansmith is doing things with the api to get zuul comments (based on conversation in -infra)	18:56
clarkb	I wonder if that could be related, or if its just another researcher	18:56
clarkb	load appears to be falling off now	18:59
iurygregory	clarkb, worked =)	19:01
iurygregory	https://review.opendev.org/c/openstack/bifrost/+/766742 I can add hashtag for a change that I'm not the owner/uploader	19:02
clarkb	iurygregory: cool, if you end up with some examples of how you are using it, I would be interested in seeing those	19:02
iurygregory	\o/	19:02
iurygregory	the idea is that we will use to track priorities for review	19:02
clarkb	iurygregory: you could tag changes "urgent" I guess	19:03
clarkb	and then core reviewers start reviewing anything tagged urgent when they review sort of thing?	19:03
iurygregory	and probably for backports also (we are thinking in add the backport-candidate label) and maybe try to use the gerrit api to automatically add a hashtag that would tell we need to have backport in some patches	19:04
iurygregory	clarkb, with the hashtag we can have a simple search in gerrit	19:04
clarkb	corvus: I wonder too if possibly updating acls slows things down (maybe there are locsk involved in that?)	19:04
iurygregory	https://review.opendev.org/q/hashtag:bifrost	19:04
iurygregory	for example	19:04
iurygregory	so maybe we will have specific ironic hashtags we want to use to make things easier for us and have a dashboard that would help the community	19:05
clarkb	right, in that example "bifrost" is implied because it is the bifrost repo. But I can see how other values for things like backports and urgency would help out	19:06
*** andrewbonney has quit IRC		19:17
*** psliuzas has joined #opendev		19:18
psliuzas	Hey folks, My commit just got merged https://review.opendev.org/c/openstack/project-config/+/773883 and I would like to be the first core reviewer for the repo starlingx/metrics-server-armada-app , could someone help me with that? thanks!	19:24
openstackgerrit	Matt McEuen proposed openstack/project-config master: New Project Request: airship/gerrit-to-github-bot https://review.opendev.org/c/openstack/project-config/+/773936	19:29
openstackgerrit	Gomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size https://review.opendev.org/c/zuul/zuul-jobs/+/773474	19:32
fungi	psliuzas: sure, taking care if it now, just a moment	19:32
fungi	psliuzas: oh, our deployment automation hasn't run for that yet, i'll check it again in a few minutes	19:34
psliuzas	Thanks!	19:40
fungi	infra-prod-manage-projects TIMED_OUT in 30m 39s	19:43
fungi	i guess that's why it wasn't created	19:43
fungi	looking into it now	19:43
fungi	"Failed to set desciption for: openstack/puppet-openstack_extras 500 Server Error: Internal Server Error for url: https://localhost:3000/api/v1/repos/openstack/puppet-openstack_extras"	19:47
fungi	looks like gitea01 may be having a bad day	19:47
fungi	it errored about setting descriptions on a bunch of projects	19:48
openstackgerrit	Merged openstack/project-config master: New Project Request: airship/gerrit-to-github-bot https://review.opendev.org/c/openstack/project-config/+/773936	19:49
fungi	i'll keep an eye on that one ^ and see if the problem persists	19:51
clarkb	fungi: thanks	19:57
clarkb	I wnt to say we considered making the project description update failures non fatal?	19:58
clarkb	it happens in a different spot than the initial project setup iirc, so we could separate those two concerns and get the project update on the next pass wheneverthat happens	19:58
fungi	yeah, it's not 100% clear to me from the log that's why it didn't run tasks for gerrit, but seems likely	19:59
clarkb	fungi: I think the whole job short circuits if the gitea stuff fails because we don't want to create a repo in gerrit that will fail to replicate	19:59
fungi	yeah	20:03
clarkb	I'll look into that after my bike ride as that seems like a good improvement	20:04
openstackgerrit	Merged zuul/zuul-jobs master: bindep: remove set_fact usage when converting string to list https://review.opendev.org/c/zuul/zuul-jobs/+/771585	20:09
*** hashar has quit IRC		20:11
*** klonn has joined #opendev		20:16
fungi	now gitea08 is returning "Internal Server Error for url: https://localhost:3000/api/v1/orgs/pypa/teams?limit=50&page=2" according to the latest log	20:23
fungi	and gitea04 said "401 Client Error: Unauthorized for url: https://localhost:3000/api/v1/user/orgs?limit=50&page=1"	20:24
fungi	i wonder if something is going sideways in gitea	20:24
clarkb	cacti shows significant new cpu demand on 01	20:25
clarkb	04 was in a similar situation until recently but seems to have subsided	20:26
clarkb	07 and 08 exhibit similar	20:27
fungi	yeah, seeing that. maybe we're getting slammed by something/someone	20:27
fungi	if it hasn't subsided by the time my kettle reaches a boil, i'll start digging into apache access logs and looking at blocking abusive client addresses	20:29
clarkb	fungi: remember that you need to map the connecting port in apache to the haproxy logs in the lb syslog	20:30
clarkb	fungi: since from apache's perspective all connections originate from the load balancer	20:30
clarkb	hrm apache may not be logging that :/	20:31
clarkb	fungi: maybe we set up something like 'LogFormat "%h:%{remote}p %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" combined' and then tell CustomLog to use that format?	20:40
clarkb	I'll push that change up now and we can iterate on it if necessary	20:40
openstackgerrit	Clark Boylan proposed opendev/system-config master: Add remote port info to gitea apache access logs https://review.opendev.org/c/opendev/system-config/+/774000	20:43
fungi	ahh, yep	20:45
fungi	established tcp connections through the lb shot waaay up around 19z	20:48
*** sshnaidm is now known as sshnaidm\|afk		20:52
ianw	o/	20:53
clarkb	ianw: for the last little bit the giteas are doing the dance similar to the thing you wrote the apache vhost for	20:54
clarkb	ianw: we've noticed that apache isn't logging source ports so hard to map to the load ablancer logs so I pushed https://review.opendev.org/c/opendev/system-config/+/774000	20:54
ianw	interesting, and it seems like they must be coming from separately hashed addresses if multiple gitea's are feeling it	20:55
fungi	seems the lb is strafing problem traffic to different backends until they oom	20:56
clarkb	I think too what can happen is one gets overwhelmed and the lb takes it out of the pool and then the addrs shift	20:56
fungi	so it could be just one or a small handful of client addresses	20:56
fungi	but memory consumption seems to be the predominant symptom, we're reaching oom conditions on backend servers	20:57
fungi	huh, any idea why we've got afs set up on the gitea servers?	20:57
clarkb	I don't see afs on gitea08, but I may not be looking properly	20:58
fungi	d'oh, my bad. i should be on gitea08 not ze08	20:59
fungi	yeah, definite oom there	20:59
fungi	killed a gitea process	20:59
fungi	clarkb spotted one ipv4 address making a ton of requests which were getting directed to gitea01 where the current memory crisis seems to be unfolding. i've temporarily blocked it in iptables on the lb to see what happens	21:03
clarkb	already load seems better fwiw	21:03
fungi	mnaser: seems we may be getting spammed by very heavy git clone operations in volume from 38.102.83.175 which looks like a vexxhost customer (but isn't us as far as i can tell). i've temporarily blocked access from that address to the git servers	21:05
clarkb	that is a bit of an imperfect correlation without the port details	21:05
clarkb	we can get the logging improvement in then open things up and see what we can infer from there	21:06
fungi	yeah, and i'll watch the logs here for a bit, then try to remove the block rule and see if the problem resumes	21:06
ianw	there is something cloning /vexxhost/* with an odd UA "GET /vexxhost/helm-charts/info/refs?service=git-upload-pack HTTP/1.1" 200 8436 "-" "git/1.0"	21:10
*** psliuzas has quit IRC		21:10
ianw	# cat gitea-ssl-access.log \| grep 'git/1.0' \| awk '{print $7}' \| sort \| uniq -c	21:11
ianw	1229 /vexxhost/helm-charts/info/refs?service=git-upload-pack	21:11
ianw	297 /vexxhost/openstack-operator/info/refs?service=git-upload-pack	21:11
ianw	297 /vexxhost/rbac-helm/info/refs?service=git-upload-pack	21:11
ianw	it has very particular interest	21:11
fungi	yeah, the potentially problematic requests i was seeing all had git/1.0 as the us	21:11
fungi	usa	21:11
fungi	gah, ua	21:11
ianw	https://github.com/src-d/go-git/blob/master/plumbing/transport/http/common.go#L19	21:12
fungi	looks like gitea01 also reached oom conditions	21:12
fungi	yep	21:13
fungi	[Wed Feb 3 20:54:22 2021] Killed process 29676 (gitea) total-vm:30048404kB, anon-rss:7604728kB, file-rss:0kB, shmem-rss:0kB	21:13
fungi	problem client(s) may have gotten punted by the lb to a fresh backend after that	21:14
ianw	gitea01 seems to have no "git/1.0" UA requests?	21:15
fungi	so far the problem seems to have hit 01, 04, 07 and 08	21:16
fungi	01 looks reasonably healthy again in past 10-15 minutes	21:19
fungi	i don't see any indication the load has shifted to another backend	21:20
fungi	the secondary symptom of established tcp connection count on the lb has also seems to have subsided around the same timeframe	21:21
fungi	in a few more minutes i'll try removing the firewall rule blocking 38.102.83.175	21:21
ianw	i can't pick any common themes from the logs like on gitea08 with the git/1.0 thing. although git/1.0 seems to be a pretty common thing used in a few git libraries. all it really indicates is whatever is cloning isn't actually a basic git client, but something using a library	21:23
fungi	i've approved another project creation change, in hopes that might flush the incompletely applied changes from earlier	21:25
fungi	gonna pop out to check the mail while that grist churns through the mill, brb	21:27
*** hamalq has joined #opendev		21:27
*** klonn has quit IRC		21:32
*** whoami-rajat__ has quit IRC		21:34
openstackgerrit	Gomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size https://review.opendev.org/c/zuul/zuul-jobs/+/773474	21:34
openstackgerrit	Merged openstack/project-config master: Add ansible-role-pki repo https://review.opendev.org/c/openstack/project-config/+/773385	21:36
fungi	and just waiting for that to deploy now	21:37
fungi	seems to be in progress, tailing the log on bridge.o.o	21:41
openstackgerrit	Gomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size https://review.opendev.org/c/zuul/zuul-jobs/+/773474	21:44
fungi	so far it's only gitea06 which hasn't reported any task result	21:46
fungi	load average there is pretty high	21:47
fungi	like around 10 right now	21:47
fungi	looks like swap usage is spiking on 06 just in the last poll interval	21:48
fungi	possible it's just manage-projects briefly running through all the descriptions	21:49
*** sboyron_ has quit IRC		21:49
fungi	but the other 7 backends completed far faster	21:49
fungi	yeah, swap is getting exhausted quickly there	21:51
fungi	already basically no ram available and half the swap in use	21:51
fungi	okay, seems to be subsiding now	21:52
fungi	no oom (yet anyway)	21:52
fungi	i need to start working on dinner but will try to keep one eye on my terminals	21:55
ianw	ok sorry back now	22:05
ianw	seems like we don't really have a smoking gun	22:05
*** openstackgerrit has quit IRC		22:11
fungi	not so far, no	22:15
clarkb	have we gotten ebtter logging in place?	22:39
ianw	umm i +2'd it but it hadn't finished testing	22:40
clarkb	I think last time we weren't able to pinpoint anything until we had similar in the gitea logs	22:40
ianw	looks like it's still moving through	22:40
clarkb	I expect that to be a big help given previous experiences	22:40
*** slaweq has quit IRC		22:41
*** slaweq has joined #opendev		22:43
fungi	looks like the manage-projects run actually completed without timing out	22:45
clarkb	lots of connections from a single vexxhost ip to gitea06 according to the lb	22:45
clarkb	I wonder if its just bouncing around a few IPs there?	22:45
fungi	psliuzas is gone, but i've added them to starlingx-metrics-server-armada-app-core	22:46
fungi	does look like i caught that right as load was ramping up for gitea06	22:46
fungi	can see it probably hit an oom condition a few minutes ago now	22:47
clarkb	as far as I can tell these IPs from vexxhost are not part of our gitea cluster or in our nodepool logs (so not ours)	22:47
fungi	[Wed Feb 3 22:30:41 2021] Killed process 14724 (gitea) total-vm:22863900kB, anon-rss:7564180kB, file-rss:0kB, shmem-rss:0kB	22:47
fungi	yeah so oom on gitea06 ~47 minutes ago	22:47
*** slaweq has quit IRC		22:47
fungi	er, ~17	22:47
clarkb	load was high as of a fwe minuts ago	22:48
clarkb	ya	22:48
clarkb	I didn't think it was that long ago :)	22:48
clarkb	I feel like the key is to catch whichever one is next now	22:49
clarkb	before it goes compeltely sad	22:49
fungi	i've reset iptables on gitea-lb01 now so 38.102.83.175 is no longer blocked	22:49
fungi	as it didn't seem that one (or that one alone anyway) was the problem	22:50
clarkb	06 has the highest system load of the set, the rest look quite happy actually	22:50
fungi	system load average is back down around 1 now	22:51
fungi	on gitea06	22:51
clarkb	that vexxhost IP seems to have continuously made requests that hit 06 for hours and hours and hours	22:53
clarkb	which is interesting, but maybe an indication it isn't to blame	22:54
clarkb	however that vexxhost IP made far and away the most requests to gitea06 while cacti reports it as being under high load	22:58
fungi	eyeballing the overall impact, it's possible these two ip addresses together are the cause	23:01
fungi	since it looks like blocking one of them may have roughly halved the effect	23:01
fungi	but it's also possible utilization is trailing off in general for the day, and is no longer compounding the problem	23:02
fungi	mildly amusing, the address i blocked earlier, when stuffed into my web browser, reveals that it's actually trunk-centos8.rdoproject.org	23:04
fungi	and the other one seems to be trunk-primary.rdoproject.org	23:05
fungi	so maybe we need to reach out to rdo folks and make sure everything is okay on their end?	23:05
fungi	we probably even have some rdo people in here or at least in #openstack-infra who can check on things	23:07
fungi	and would probably be faster than having vexxhost support act as a relay for the discussion	23:08
*** openstackgerrit has joined #opendev		23:24
openstackgerrit	Merged opendev/system-config master: Add remote port info to gitea apache access logs https://review.opendev.org/c/opendev/system-config/+/774000	23:24
clarkb	I was able to usethe new logging from ^ on gitea06 to correlate some requests to the rdo host fungi pointed out above.	23:39
clarkb	That is the host I identified as making the bulk of the requests to gitea06 via haproxy logs	23:39
clarkb	still not an indication that what they did is wrong (and in fact they seem to regularly poll repos for ref updates)	23:39
clarkb	but was a good test case for: do our logs give us what we need to correlate things now and I think they do.	23:39
clarkb	We might consider logging the apache source port on the connection to gitea so that we can correlate between apache and gitea too?	23:40
clarkb	actually I don't know how to expose that with apache logging	23:41
clarkb	%{format}p doesn't seem to have a format for that	23:41
clarkb	ianw: fungi fwiw I think at this point we largely need to see it happening again so that we log it with the data necessary to correlate things then go from there	23:42
ianw	++	23:42
corvus	i wonder if we can get metrics from gerrit on certain operations (like how long a push takes)	23:42
corvus	i was wondering that as i just pushed a change and it seemed to take a good 10-15 seconds	23:43
clarkb	corvus: for replication to gitea? oh this is separate	23:43
corvus	yeah, sorry, separate	23:43
clarkb	corvus: I think that you can probably get that out of the ssh logs	23:43
clarkb	I want to say there is timing info there and there should be enough info to split out the git operations	23:43
corvus	might be a nice thing to track in a dashboard as opposed to anecdata	23:43
clarkb	but its been while since I looked at that log file	23:43
clarkb	fwiw I noticed that pushing to gerrit's gerrit is similarly slow (but I've only pushed a handful of times to there recently)	23:44
clarkb	also that is over http not ssh	23:44
corvus	clarkb: yeah; though i always chalked that up to their backend (i assume a lot of distributed locking is involved)	23:44
corvus	i sort of assumed they had a high cost for each push, but that they could scale out to a lot of simultaneous pushes (to different repos at least)	23:45
corvus	but that's totally just assumption/inference on my part	23:45
*** tosky has quit IRC		23:54
clarkb	fungi: it appears that updating project descriptions is already a best effort attempt and shouldn't case things to fail	23:59
clarkb	fungi: I think the implication there is that something failed when trying to create the new project in gitea and that was a valid failure	23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!