Thursday, 2020-08-13

*** mlavalle has quit IRC		00:30
*** DSpider has quit IRC		00:43
ianw	clarkb: could you look at https://review.opendev.org/#/c/744038/ for additional quay.io mirror bits	00:55
*** openstackgerrit has joined #opendev		01:23
openstackgerrit	Merged opendev/system-config master: Redirect UC content to TC site https://review.opendev.org/744497	01:23
*** qchris has quit IRC		01:53
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989	02:00
*** qchris has joined #opendev		02:05
openstackgerrit	Ian Wienand proposed openstack/project-config master: A pyca/cryptography to Zuul tenant https://review.opendev.org/745990	02:10
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989	02:14
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989	02:36
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989	02:45
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989	03:07
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989	03:12
openstackgerrit	Matthew Thode proposed openstack/diskimage-builder master: update gentoo flags and package maps in preperation for arm64 support https://review.opendev.org/746000	03:23
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989	03:29
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989	03:33
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989	03:44
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989	03:49
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989	03:59
openstackgerrit	Matthew Thode proposed openstack/diskimage-builder master: update gentoo flags and package maps in preperation for arm64 support https://review.opendev.org/746000	04:41
openstackgerrit	Matthew Thode proposed openstack/diskimage-builder master: update gentoo flags and package maps in preperation for arm64 support https://review.opendev.org/746000	04:50
*** ysandeep\|away is now known as ysandeep		06:06
*** jaicaa has quit IRC		06:07
*** jaicaa has joined #opendev		06:10
ianw	infra-root: seems opendev.org is having ... issues	06:45
ianw	hard to say	06:45
ianw	the gitea container on gitea04 restarted just recently	06:45
ianw	gitea03 is under memory pressure, but no one thing	06:46
ianw	http://paste.openstack.org/show/796802/	06:46
ianw	http://cacti.openstack.org/cacti/graph.php?action=properties&local_graph_id=66680&rra_id=0&view_type=tree&graph_start=1597299176&graph_end=1597301076	06:48
ianw	06:25 it starting going crazy	06:48
ianw	http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=66797&rra_id=0&view_type=tree&graph_start=1597298351&graph_end=1597300943	06:54
ianw	it seems all the gitea hosts dropped off from ~ 6:08 -> ~6:35	06:55
ianw	i think what might have happened here is some sort of progressive outage on the gitea servers; the load balancer noticed some of them not responding and cut them out	06:58
ianw	but that then started to overload whatever was left	06:58
ianw	gitea03 and 05 maybe	07:00
*** ryohayakawa has quit IRC		07:04
*** tosky has joined #opendev		07:42
openstackgerrit	Ian Wienand proposed openstack/project-config master: Create pyca/infra https://review.opendev.org/746014	07:49
*** moppy has quit IRC		08:01
*** moppy has joined #opendev		08:02
*** hashar has joined #opendev		08:03
*** ryohayakawa has joined #opendev		08:09
*** DSpider has joined #opendev		08:15
*** ysandeep is now known as ysandeep\|lunch		09:20
*** ysandeep\|lunch is now known as ysandeep		09:35
*** hashar has quit IRC		09:46
mnaser	ianw, infra-root: anthing from our side?	12:12
mnaser	looks like it's more in the vms..	12:12
*** hashar has joined #opendev		12:12
*** ryohayakawa has quit IRC		12:18
*** marios\|ruck has joined #opendev		12:35
*** andrewbonney has joined #opendev		13:17
*** qchris has quit IRC		13:24
*** tkajinam has quit IRC		13:32
*** qchris has joined #opendev		13:53
*** qchris has quit IRC		13:56
clarkb	that was basically what our china source ip ddos looked like. I wonder if we've got another ddos	14:07
clarkb	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all	14:11
clarkb	that shows a connection spike but not like the ddos which hit the haproxy connection limits	14:11
clarkb	still possible those were costly requests that backed things up	14:11
*** qchris has joined #opendev		14:14
clarkb	thinking out loud here: we may want to reboot each of the backends in sequence to clear out any OOM fallout then do a gerrit full sync replication (there are reports some repos arenot in sync)	14:18
clarkb	this assumes the issue isnt persistent and was related to that spike in requests	14:18
fungi	the timing suggests daily cron jobs	14:20
clarkb	based on when gaps in cacti graph data happened we seem to have largely recovered. The gaps also correlate well to that spike in connections except for gitea05	14:27
clarkb	http://cacti.openstack.org/cacti/graph_view.php	14:27
clarkb	her	14:27
clarkb	*er	14:27
clarkb	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66728&rra_id=all	14:27
clarkb	I can never figure out linking to top level for a host but that shows it in a specific graph	14:28
fungi	yeah, i agree rolling reboots followed by full replication is probably warranted	14:28
clarkb	I can work on that in about half an hour	14:29
clarkb	looking at gitea05 more the early blank spot doesnt correllate to any major increasein network connections or traffic	14:30
clarkb	is it possible there was networking trouble there like we saw yesterday that caused the servers that were reachanle to takeon more load?	14:31
clarkb	different cloud regions though aiui	14:31
fungi	yeah, yesterday's v6 routing problem was in ca-ymq-1 and the gitea servers are in sjc1	14:33
fungi	i can start on reboots... do we need to down the servers in haproxy first?	14:34
fungi	not sure how graceful others have tried to be with these in the past	14:34
clarkb	lookign at syslog on gitea05 we seem to just be OOMing in a loop	14:35
clarkb	that stopped about 5 hours ago	14:35
clarkb	but also started before those games in time	14:36
clarkb	02:51:23 is when that started	14:36
clarkb	oh thats actually when we have the first gap on gitea05	14:37
clarkb	there are a bunch of git GETs for charms around when the OOM first started there	14:42
*** hashar has quit IRC		14:45
clarkb	yes a canonical IP is second biggest requestor of gitea05 between 02:00 and 03:00	14:47
clarkb	not surprising that charms show up given that. However there is a much more request happy IP I'm trying to figure out next	14:48
*** qchris has quit IRC		14:48
clarkb	according to our logs at about 01:04 gitea logs a request from this particular IP as a GET for a charm repo then haproxy reports it was in a CD state at 2::51	14:51
clarkb	*02:51	14:51
clarkb	and all but one request from this IP (of which there are thousands) ends in a CD state	14:52
clarkb	whcih is a closed disconnected error state from haproxy iirc	14:52
clarkb	looking at our top 10 requestors to the load balancer only those two charms requestors show up in gitea05's log during that time span. The abundance of CD state connections and the amount of time that they seem to be held open is somewhat suspicious	14:58
clarkb	a full 99.84% of requests from that particular IP end up in that state	14:59
clarkb	I wonder if this is a client issue?	15:00
clarkb	in any case it does seem to have subsided	15:00
clarkb	and I think the rolling reboots are worthwhile to clear out any issues. I'll start that now and will take hosts out of the rotation in haproxy before I reboot them	15:00
*** qchris has joined #opendev		15:01
clarkb	there are also a couple of really chatty vexxhost IPs that we will want to cross check against our nodepool logs (they don't seem to have reverse dns)	15:03
clarkb	but they don't seem to correlate to when problems start	15:03
*** mlavalle has joined #opendev		15:10
clarkb	reboots are done	15:17
clarkb	I'll start gerrit replication momentarily	15:17
fungi	i was willing to take care of the reboots but wanted to know if we gracefully down them in the haproxy pools one at a time and how long we wait before rebooting to make sure requests aren't still in progress	15:18
clarkb	fungi: oh sorry, yes I gracefully downed them then tailed /var/gitea/logs/access.log and waited for requests from the load balancer IP to trail off	15:19
clarkb	there are internal requests from 127.0.0.1 that get made and some web crawler is also crawling them which I ignored	15:19
clarkb	I missed your messages earlier I was so heads down on this (early morning blinders)	15:19
clarkb	https://docs.opendev.org/opendev/system-config/latest/gitea.html#backend-maintenance has does on the haproxy manipulation	15:20
fungi	no worries, just didn't want you stuck shouldering it all	15:22
clarkb	I've also pinged mnaser about teh chatty vexxhost IPs in case they are doing something unexpected. I don't really think they were doing anything to trigger the problems though	15:25
clarkb	it really does seem like the IP interested in charms that couldn't successfully finish a connection is related	15:25
clarkb	I wonder if all those connections failed because it was doing smoething that caused gitea to fail (OOM?)	15:26
clarkb	makign that correlation is likely to be more difficult though we could try making the requests it was making I suppose	15:26
clarkb	in trying to correlate the requests that IP is making more accurately I'm discovering the 65k limit on port numbers means we recycle them often :/	15:29
clarkb	ah ok I see more things. THe data transferred values seem to be important here	15:31
clarkb	sometimes we transfer nothing and the gitea backend never sees it	15:31
*** ysandeep is now known as ysandeep\|away		15:35
fungi	also further investigation of a suspicious client ip address has turned up what appears to be a socket proxy to our git hosting running on a vm in hetzner	15:44
fungi	very bizarre	15:45
clarkb	my current hunch is that that proxy undoes any load balancing form those sources since we bnalance on source IP. That then allowed it to bounce between backends as they failed under the load associated with those connections	15:45
clarkb	its possible other connections were responsible though, don't have a strong enough correlation to that yet	15:46
*** priteau has joined #opendev		15:46
openstackgerrit	Clark Boylan proposed openstack/project-config master: Trigger service-eavesdrop when gerritbot channels change https://review.opendev.org/746168	15:52
clarkb	fungi: ^ thats one of two gerritbot job tie ins we need (but should wait for gerrit replication to finish before merging that)	15:53
clarkb	the other is to run when gerritbot's docker images update	15:53
fungi	ahh, thanks, i had already forgotten about that. it was fairly late last night	15:53
fungi	though in good news, my patch seems to have solved the regression we saw	15:53
clarkb	excellent	15:53
fungi	frickler: ^ it was your excellent eye which spotted the cause, so thanks!	15:54
clarkb	actually hrm. For the image update causing things to update we'd need to add gerritbot's project ssh key to bridge	15:55
fungi	it bears remembering that if you're iterating over a dict's keys() iterable, and then you change the dict in that loop, you change what you're iterating over	15:55
clarkb	I'm thinking that maybe it is better to simply run that hourly or daily instead?	15:55
fungi	clarkb: yeah, i was getting sleepy but was trying to figure out how this was any different from other stuff we're deploying where the image builds don't happen in the context of system-config changes	15:56
fungi	i think last time this came up we concluded that we'd need to rely on periodic jobs for now	15:56
clarkb	wfm I'll get that patch up shortly now	15:56
openstackgerrit	Clark Boylan proposed opendev/system-config master: Run service-eavesdrop hourly https://review.opendev.org/746181	15:59
*** diablo_rojo has joined #opendev		16:20
*** marios\|ruck is now known as marios\|out		16:26
*** tosky has quit IRC		16:43
*** marios\|out has quit IRC		16:44
*** JayF has quit IRC		17:15
*** andrewbonney has quit IRC		17:31
*** priteau has quit IRC		17:36
*** priteau has joined #opendev		17:44
*** priteau has quit IRC		17:53
AJaeger	clarkb, just saw your comment on 746168 and removed my +A, please self-approve once ready	18:30
clarkb	AJaeger: replication is done I'll reapprove. Thanks	18:33
AJaeger	clarkb: great	18:34
clarkb	mnaser: osc reports 'Certificate did not match expected hostname: compute.public.mtl1.vexxhost.net. Certificate: {'subject': ((('commonName', '.vexxhost.net'),),), 'subjectAltName': [('DNS', '.vexxhost.net'), ('DNS', 'vexxhost.net')]}' trying to show an instance details	18:34
clarkb	fungi: ^ do glob certs only do a single level of dns?	18:34
openstackgerrit	Merged openstack/project-config master: Trigger service-eavesdrop when gerritbot channels change https://review.opendev.org/746168	18:36
fungi	good question, i thought they covered anything within that zone	18:37
clarkb	and now we should be able to land smcginnis' change to update the gerritbot channel config and be good to go	18:37
fungi	but if you delegate subdomains to other zones they won't	18:37
fungi	wildcard records aren't returned as dns responses, they're a shorthand instruction to the authoritative nameserver to match any request, but they're zone-specific	18:38
fungi	oh! though this isn't wildcard dns records, this is wildcard subject (alt)names	18:39
clarkb	ya its sslcert verification	18:39
smcginnis	\o/	18:39
fungi	clarkb: confirmed, apparently you can't wildcard multiple levels of subdomains with a single subject (alt)name	18:43
clarkb	mnaser: ^ I think that may be something you'll want to fix	18:45
fungi	"Names may contain the wildcard character * which is considered to match any single domain name component or component fragment. E.g., .a.com matches foo.a.com but not bar.foo.a.com. f.com matches foo.com but not bar.com." https://www.ietf.org/rfc/rfc2818.txt §3.1¶4	18:45
fungi	also finding a number of kb articles from certificate authorities and questions at places like serverfault agreeing this is the case	18:46
fungi	apparently some browsers did at one point treat the wildcard as matching any subsequent levels, but most (all?) have ceased doing so as it was a blatant standards violation	18:47
clarkb	I've approved smcginnis' gerritbot config change	18:47
clarkb	we should see gerritbot reconnect when that gets applied	18:48
fungi	this will be a good test	18:48
*** JayF has joined #opendev		18:55
openstackgerrit	Merged openstack/project-config master: Gerritbot: only comment on stable:follows-policy repos https://review.opendev.org/744947	18:59
*** openstackgerrit has quit IRC		19:02
mnaser	clarkb: it should be ok again now	19:05
*** hashar has joined #opendev		19:10
diablo_rojo	In thinking about the ptg. Its probably good to 'de-openstack' the irc channel. Any qualms with my making a new one just called '#ptg'?	19:53
clarkb	diablo_rojo: we get some management simplification by namespacing on freenode	19:55
clarkb	basically freenideknows who to go to for all #openstack- prefixed channels	19:55
clarkb	not a reason to avoid #ptg but something to keep in mind	19:55
diablo_rojo	Makes sense. We have the #openstack-ptg channel already obviously, but I figured it might be more inclusive to other projects to make one without the prefix	19:57
*** tosky has joined #opendev		20:10
mnaser	clarkb: http://grafana.openstack.org/d/nuvIH5Imk/nodepool-vexxhost?orgId=1 -- is it possible to maybe rekick nodepool as it may be using a cached service catalog?	20:19
fungi	diablo_rojo: if osf is going to have a bunch of those sorts of channels (this already came up wrt the #openstack-diversity channel for example) maybe we want an #osf prefix or something	20:21
mnaser	cc infra-root ^	20:28
fungi	looking	20:29
corvus	fungi: i'm around if you need help	20:32
fungi	mnaser: it's the ssl cert problem clarkb noted earlier	20:33
mnaser	fungi, corvus: the endpoint has changed and i think nodepool has the value cached	20:33
fungi	the cert you're serving is not valid for compute-ca-ymq-1.vexxhost.net	20:33
fungi	oh, i get it	20:33
mnaser	right, but the url in the service catalog is compute.public.mtl1.vexxhost.net	20:33
mnaser	:)	20:34
fungi	yeah, the launcher will need a restart for that	20:34
fungi	just a sec	20:34
fungi	#status manually restarted nodepool-launcher container on nl03 to pick up changed catalog entries in vexxhost ca-ymq-1 (aka mtl1)	20:36
openstackstatus	fungi: unknown command	20:36
fungi	d'oh!	20:36
fungi	#status log manually restarted nodepool-launcher container on nl03 to pick up changed catalog entries in vexxhost ca-ymq-1 (aka mtl1)	20:36
openstackstatus	fungi: finished logging	20:36
fungi	there we go	20:36
fungi	mnaser: thanks, that seems to be spewing a lot fewer errors in its logs now	20:37
*** openstackgerrit has joined #opendev		21:05
openstackgerrit	Matthew Thode proposed openstack/diskimage-builder master: update gentoo flags and package maps in preperation for arm64 support https://review.opendev.org/746000	21:05
ianw	fungi/clarkb: thanks for looking at gitea; do we think the reboot has done it?	21:55
clarkb	ianw: I think it sorted itself out then reboots were mostly to ensure there wasn't any bad fallout from the OOMs	21:57
clarkb	ianw: it looked like that proxy may have contributed to the problem, possibly because it had a bunch of things behind it all hitting a single backend due to the proxy having a single IP	21:57
clarkb	then when one server was sad the haproxy lb took it out of the rotation pointing that proxy at a new backend and rinse and repeat	21:58
ianw	ahh, yes that sounds likely	21:58
ianw	infra-root: if i could get some eyes on creating an pyca/infra project @ https://review.opendev.org/#/c/746014/ that would help me continue fiddling with manylinux wheel generation	22:02
ianw	my hopes that we'd sort of drop-in manylinux support are probably dashed ... for example cryptography does a custom builder image ontop of the upstream builder images that pulls in openssl and builds it fresh	22:03
ianw	which is fine, but not generic	22:04
ianw	one thing is though, if i build custom manylinux2014_aarch64 images speculatively using buildx, i unfortunately can't run them on arm64 speculatively	22:06
ianw	because can't mix architectures	22:06
clarkb	fwiw with buildx things that do IO seem fine but not cpu (like compiling)	22:07
clarkb	expect compiling openssl to take significant time. Though we can certainly test it to find out how much	22:07
clarkb	ianw: I think you can do speculative testing without buildx though	22:07
clarkb	then run both the image build and the use of the image in the linaro cloud	22:08
ianw	hrm, yes i wasn't sure of the state of native image builds	22:09
ianw	native container builds i should probably say	22:09
clarkb	I think they work fine, though the manifest info might assume x86 by default?	22:09
ianw	maybe ... https://review.opendev.org/#/c/746011/ is sort of the framework, but i don't want to put it in pyca/project-config because that's a trusted repo	22:12
diablo_rojo	fungi, yeah was thinking about an osf prefix too. If that's easier to manage, I am totally cool with that.	22:15
diablo_rojo	(waaaaaaay late in my response, got sucked into other things)	22:15
ianw	when you see how the sausage is made with all this ... it does make you wonder a little bit if you still like sausages	22:15
corvus	yeah, i think we avoided native builds in the general case because we don't want the zuul/nodepool gate to stop if we lose the linaro cloud; that's probably less worrisome for an arm-only situation	22:17
ianw	i can try it and see what happens :)	22:20
openstackgerrit	Merged openstack/project-config master: Create pyca/infra https://review.opendev.org/746014	22:29
corvus	ianw: ^ deploy playbook is done	22:51
ianw	corvus: thanks, already in testing :) https://zuul.opendev.org/t/pyca/status	22:52
ianw	from what i can tell of upstream, ISTM that the wheels get generated and published as an artifact by github actions	22:53
ianw	i can not see that they are uploaded via that mechanism though, although i may have missed it	22:53
ianw	(uploaded to pypi)	22:54
ianw	https://github.com/pyca/cryptography/actions/runs/176310608/workflow if interested	22:56
*** tkajinam has joined #opendev		22:58
*** gema has quit IRC		23:05
*** mlavalle has quit IRC		23:08
*** tosky has quit IRC		23:09
*** sgw1 has quit IRC		23:15
ianw	heh, the manylinux container build decided to use http://mirror.facebook.net/ ... who knew	23:18
fungi	ew	23:22
ianw	building openssl ... https://zuul.opendev.org/t/pyca/stream/d35730a2d4fa4121985b01692cc45c9d?logfile=console.log	23:34
ianw	slowly	23:34
ianw	corvus: so the theory is if i run this on an arm64 node, it might "just work"? i guess the intermediate registry also needs to run there?	23:36
*** DSpider has quit IRC		23:38
ianw	... ok, to answer my own question -- the intermediate registry seems happy to run in ovh	23:52
ianw	however, ensure-docker is failing on arm64	23:53
ianw	no package "docker-ce"	23:53
*** hashar has quit IRC		23:53
*** hashar has joined #opendev		23:55
*** diablo_rojo has quit IRC		23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!