*** mlavalle has quit IRC | 00:30 | |
*** DSpider has quit IRC | 00:43 | |
ianw | clarkb: could you look at https://review.opendev.org/#/c/744038/ for additional quay.io mirror bits | 00:55 |
---|---|---|
*** openstackgerrit has joined #opendev | 01:23 | |
openstackgerrit | Merged opendev/system-config master: Redirect UC content to TC site https://review.opendev.org/744497 | 01:23 |
*** qchris has quit IRC | 01:53 | |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989 | 02:00 |
*** qchris has joined #opendev | 02:05 | |
openstackgerrit | Ian Wienand proposed openstack/project-config master: A pyca/cryptography to Zuul tenant https://review.opendev.org/745990 | 02:10 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989 | 02:14 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989 | 02:36 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989 | 02:45 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989 | 03:07 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989 | 03:12 |
openstackgerrit | Matthew Thode proposed openstack/diskimage-builder master: update gentoo flags and package maps in preperation for arm64 support https://review.opendev.org/746000 | 03:23 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989 | 03:29 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989 | 03:33 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989 | 03:44 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989 | 03:49 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] manylinux builds https://review.opendev.org/745989 | 03:59 |
openstackgerrit | Matthew Thode proposed openstack/diskimage-builder master: update gentoo flags and package maps in preperation for arm64 support https://review.opendev.org/746000 | 04:41 |
openstackgerrit | Matthew Thode proposed openstack/diskimage-builder master: update gentoo flags and package maps in preperation for arm64 support https://review.opendev.org/746000 | 04:50 |
*** ysandeep|away is now known as ysandeep | 06:06 | |
*** jaicaa has quit IRC | 06:07 | |
*** jaicaa has joined #opendev | 06:10 | |
ianw | infra-root: seems opendev.org is having ... issues | 06:45 |
ianw | hard to say | 06:45 |
ianw | the gitea container on gitea04 restarted just recently | 06:45 |
ianw | gitea03 is under memory pressure, but no one thing | 06:46 |
ianw | http://paste.openstack.org/show/796802/ | 06:46 |
ianw | http://cacti.openstack.org/cacti/graph.php?action=properties&local_graph_id=66680&rra_id=0&view_type=tree&graph_start=1597299176&graph_end=1597301076 | 06:48 |
ianw | 06:25 it starting going crazy | 06:48 |
ianw | http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=66797&rra_id=0&view_type=tree&graph_start=1597298351&graph_end=1597300943 | 06:54 |
ianw | it seems all the gitea hosts dropped off from ~ 6:08 -> ~6:35 | 06:55 |
ianw | i think what might have happened here is some sort of progressive outage on the gitea servers; the load balancer noticed some of them not responding and cut them out | 06:58 |
ianw | but that then started to overload whatever was left | 06:58 |
ianw | gitea03 and 05 maybe | 07:00 |
*** ryohayakawa has quit IRC | 07:04 | |
*** tosky has joined #opendev | 07:42 | |
openstackgerrit | Ian Wienand proposed openstack/project-config master: Create pyca/infra https://review.opendev.org/746014 | 07:49 |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:02 | |
*** hashar has joined #opendev | 08:03 | |
*** ryohayakawa has joined #opendev | 08:09 | |
*** DSpider has joined #opendev | 08:15 | |
*** ysandeep is now known as ysandeep|lunch | 09:20 | |
*** ysandeep|lunch is now known as ysandeep | 09:35 | |
*** hashar has quit IRC | 09:46 | |
mnaser | ianw, infra-root: anthing from our side? | 12:12 |
mnaser | looks like it's more in the vms.. | 12:12 |
*** hashar has joined #opendev | 12:12 | |
*** ryohayakawa has quit IRC | 12:18 | |
*** marios|ruck has joined #opendev | 12:35 | |
*** andrewbonney has joined #opendev | 13:17 | |
*** qchris has quit IRC | 13:24 | |
*** tkajinam has quit IRC | 13:32 | |
*** qchris has joined #opendev | 13:53 | |
*** qchris has quit IRC | 13:56 | |
clarkb | that was basically what our china source ip ddos looked like. I wonder if we've got another ddos | 14:07 |
clarkb | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all | 14:11 |
clarkb | that shows a connection spike but not like the ddos which hit the haproxy connection limits | 14:11 |
clarkb | still possible those were costly requests that backed things up | 14:11 |
*** qchris has joined #opendev | 14:14 | |
clarkb | thinking out loud here: we may want to reboot each of the backends in sequence to clear out any OOM fallout then do a gerrit full sync replication (there are reports some repos arenot in sync) | 14:18 |
clarkb | this assumes the issue isnt persistent and was related to that spike in requests | 14:18 |
fungi | the timing suggests daily cron jobs | 14:20 |
clarkb | based on when gaps in cacti graph data happened we seem to have largely recovered. The gaps also correlate well to that spike in connections except for gitea05 | 14:27 |
clarkb | http://cacti.openstack.org/cacti/graph_view.php | 14:27 |
clarkb | her | 14:27 |
clarkb | *er | 14:27 |
clarkb | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66728&rra_id=all | 14:27 |
clarkb | I can never figure out linking to top level for a host but that shows it in a specific graph | 14:28 |
fungi | yeah, i agree rolling reboots followed by full replication is probably warranted | 14:28 |
clarkb | I can work on that in about half an hour | 14:29 |
clarkb | looking at gitea05 more the early blank spot doesnt correllate to any major increasein network connections or traffic | 14:30 |
clarkb | is it possible there was networking trouble there like we saw yesterday that caused the servers that were reachanle to takeon more load? | 14:31 |
clarkb | different cloud regions though aiui | 14:31 |
fungi | yeah, yesterday's v6 routing problem was in ca-ymq-1 and the gitea servers are in sjc1 | 14:33 |
fungi | i can start on reboots... do we need to down the servers in haproxy first? | 14:34 |
fungi | not sure how graceful others have tried to be with these in the past | 14:34 |
clarkb | lookign at syslog on gitea05 we seem to just be OOMing in a loop | 14:35 |
clarkb | that stopped about 5 hours ago | 14:35 |
clarkb | but also started before those games in time | 14:36 |
clarkb | 02:51:23 is when that started | 14:36 |
clarkb | oh thats actually when we have the first gap on gitea05 | 14:37 |
clarkb | there are a bunch of git GETs for charms around when the OOM first started there | 14:42 |
*** hashar has quit IRC | 14:45 | |
clarkb | yes a canonical IP is second biggest requestor of gitea05 between 02:00 and 03:00 | 14:47 |
clarkb | not surprising that charms show up given that. However there is a much more request happy IP I'm trying to figure out next | 14:48 |
*** qchris has quit IRC | 14:48 | |
clarkb | according to our logs at about 01:04 gitea logs a request from this particular IP as a GET for a charm repo then haproxy reports it was in a CD state at 2::51 | 14:51 |
clarkb | *02:51 | 14:51 |
clarkb | and all but one request from this IP (of which there are thousands) ends in a CD state | 14:52 |
clarkb | whcih is a closed disconnected error state from haproxy iirc | 14:52 |
clarkb | looking at our top 10 requestors to the load balancer only those two charms requestors show up in gitea05's log during that time span. The abundance of CD state connections and the amount of time that they seem to be held open is somewhat suspicious | 14:58 |
clarkb | a full 99.84% of requests from that particular IP end up in that state | 14:59 |
clarkb | I wonder if this is a client issue? | 15:00 |
clarkb | in any case it does seem to have subsided | 15:00 |
clarkb | and I think the rolling reboots are worthwhile to clear out any issues. I'll start that now and will take hosts out of the rotation in haproxy before I reboot them | 15:00 |
*** qchris has joined #opendev | 15:01 | |
clarkb | there are also a couple of really chatty vexxhost IPs that we will want to cross check against our nodepool logs (they don't seem to have reverse dns) | 15:03 |
clarkb | but they don't seem to correlate to when problems start | 15:03 |
*** mlavalle has joined #opendev | 15:10 | |
clarkb | reboots are done | 15:17 |
clarkb | I'll start gerrit replication momentarily | 15:17 |
fungi | i was willing to take care of the reboots but wanted to know if we gracefully down them in the haproxy pools one at a time and how long we wait before rebooting to make sure requests aren't still in progress | 15:18 |
clarkb | fungi: oh sorry, yes I gracefully downed them then tailed /var/gitea/logs/access.log and waited for requests from the load balancer IP to trail off | 15:19 |
clarkb | there are internal requests from 127.0.0.1 that get made and some web crawler is also crawling them which I ignored | 15:19 |
clarkb | I missed your messages earlier I was so heads down on this (early morning blinders) | 15:19 |
clarkb | https://docs.opendev.org/opendev/system-config/latest/gitea.html#backend-maintenance has does on the haproxy manipulation | 15:20 |
fungi | no worries, just didn't want you stuck shouldering it all | 15:22 |
clarkb | I've also pinged mnaser about teh chatty vexxhost IPs in case they are doing something unexpected. I don't really think they were doing anything to trigger the problems though | 15:25 |
clarkb | it really does seem like the IP interested in charms that couldn't successfully finish a connection is related | 15:25 |
clarkb | I wonder if all those connections failed because it was doing smoething that caused gitea to fail (OOM?) | 15:26 |
clarkb | makign that correlation is likely to be more difficult though we could try making the requests it was making I suppose | 15:26 |
clarkb | in trying to correlate the requests that IP is making more accurately I'm discovering the 65k limit on port numbers means we recycle them often :/ | 15:29 |
clarkb | ah ok I see more things. THe data transferred values seem to be important here | 15:31 |
clarkb | sometimes we transfer nothing and the gitea backend never sees it | 15:31 |
*** ysandeep is now known as ysandeep|away | 15:35 | |
fungi | also further investigation of a suspicious client ip address has turned up what appears to be a socket proxy to our git hosting running on a vm in hetzner | 15:44 |
fungi | very bizarre | 15:45 |
clarkb | my current hunch is that that proxy undoes any load balancing form those sources since we bnalance on source IP. That then allowed it to bounce between backends as they failed under the load associated with those connections | 15:45 |
clarkb | its possible other connections were responsible though, don't have a strong enough correlation to that yet | 15:46 |
*** priteau has joined #opendev | 15:46 | |
openstackgerrit | Clark Boylan proposed openstack/project-config master: Trigger service-eavesdrop when gerritbot channels change https://review.opendev.org/746168 | 15:52 |
clarkb | fungi: ^ thats one of two gerritbot job tie ins we need (but should wait for gerrit replication to finish before merging that) | 15:53 |
clarkb | the other is to run when gerritbot's docker images update | 15:53 |
fungi | ahh, thanks, i had already forgotten about that. it was fairly late last night | 15:53 |
fungi | though in good news, my patch seems to have solved the regression we saw | 15:53 |
clarkb | excellent | 15:53 |
fungi | frickler: ^ it was your excellent eye which spotted the cause, so thanks! | 15:54 |
clarkb | actually hrm. For the image update causing things to update we'd need to add gerritbot's project ssh key to bridge | 15:55 |
fungi | it bears remembering that if you're iterating over a dict's keys() iterable, and then you change the dict in that loop, you change what you're iterating over | 15:55 |
clarkb | I'm thinking that maybe it is better to simply run that hourly or daily instead? | 15:55 |
fungi | clarkb: yeah, i was getting sleepy but was trying to figure out how this was any different from other stuff we're deploying where the image builds don't happen in the context of system-config changes | 15:56 |
fungi | i think last time this came up we concluded that we'd need to rely on periodic jobs for now | 15:56 |
clarkb | wfm I'll get that patch up shortly now | 15:56 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Run service-eavesdrop hourly https://review.opendev.org/746181 | 15:59 |
*** diablo_rojo has joined #opendev | 16:20 | |
*** marios|ruck is now known as marios|out | 16:26 | |
*** tosky has quit IRC | 16:43 | |
*** marios|out has quit IRC | 16:44 | |
*** JayF has quit IRC | 17:15 | |
*** andrewbonney has quit IRC | 17:31 | |
*** priteau has quit IRC | 17:36 | |
*** priteau has joined #opendev | 17:44 | |
*** priteau has quit IRC | 17:53 | |
AJaeger | clarkb, just saw your comment on 746168 and removed my +A, please self-approve once ready | 18:30 |
clarkb | AJaeger: replication is done I'll reapprove. Thanks | 18:33 |
AJaeger | clarkb: great | 18:34 |
clarkb | mnaser: osc reports 'Certificate did not match expected hostname: compute.public.mtl1.vexxhost.net. Certificate: {'subject': ((('commonName', '*.vexxhost.net'),),), 'subjectAltName': [('DNS', '*.vexxhost.net'), ('DNS', 'vexxhost.net')]}' trying to show an instance details | 18:34 |
clarkb | fungi: ^ do glob certs only do a single level of dns? | 18:34 |
openstackgerrit | Merged openstack/project-config master: Trigger service-eavesdrop when gerritbot channels change https://review.opendev.org/746168 | 18:36 |
fungi | good question, i thought they covered anything within that zone | 18:37 |
clarkb | and now we should be able to land smcginnis' change to update the gerritbot channel config and be good to go | 18:37 |
fungi | but if you delegate subdomains to other zones they won't | 18:37 |
fungi | wildcard records aren't returned as dns responses, they're a shorthand instruction to the authoritative nameserver to match any request, but they're zone-specific | 18:38 |
fungi | oh! though this isn't wildcard dns records, this is wildcard subject (alt)names | 18:39 |
clarkb | ya its sslcert verification | 18:39 |
smcginnis | \o/ | 18:39 |
fungi | clarkb: confirmed, apparently you can't wildcard multiple levels of subdomains with a single subject (alt)name | 18:43 |
clarkb | mnaser: ^ I think that may be something you'll want to fix | 18:45 |
fungi | "Names may contain the wildcard character * which is considered to match any single domain name component or component fragment. E.g., *.a.com matches foo.a.com but not bar.foo.a.com. f*.com matches foo.com but not bar.com." https://www.ietf.org/rfc/rfc2818.txt §3.1¶4 | 18:45 |
fungi | also finding a number of kb articles from certificate authorities and questions at places like serverfault agreeing this is the case | 18:46 |
fungi | apparently some browsers did at one point treat the wildcard as matching any subsequent levels, but most (all?) have ceased doing so as it was a blatant standards violation | 18:47 |
clarkb | I've approved smcginnis' gerritbot config change | 18:47 |
clarkb | we should see gerritbot reconnect when that gets applied | 18:48 |
fungi | this will be a good test | 18:48 |
*** JayF has joined #opendev | 18:55 | |
openstackgerrit | Merged openstack/project-config master: Gerritbot: only comment on stable:follows-policy repos https://review.opendev.org/744947 | 18:59 |
*** openstackgerrit has quit IRC | 19:02 | |
mnaser | clarkb: it should be ok again now | 19:05 |
*** hashar has joined #opendev | 19:10 | |
diablo_rojo | In thinking about the ptg. Its probably good to 'de-openstack' the irc channel. Any qualms with my making a new one just called '#ptg'? | 19:53 |
clarkb | diablo_rojo: we get some management simplification by namespacing on freenode | 19:55 |
clarkb | basically freenideknows who to go to for all #openstack- prefixed channels | 19:55 |
clarkb | not a reason to avoid #ptg but something to keep in mind | 19:55 |
diablo_rojo | Makes sense. We have the #openstack-ptg channel already obviously, but I figured it might be more inclusive to other projects to make one without the prefix | 19:57 |
*** tosky has joined #opendev | 20:10 | |
mnaser | clarkb: http://grafana.openstack.org/d/nuvIH5Imk/nodepool-vexxhost?orgId=1 -- is it possible to maybe rekick nodepool as it may be using a cached service catalog? | 20:19 |
fungi | diablo_rojo: if osf is going to have a bunch of those sorts of channels (this already came up wrt the #openstack-diversity channel for example) maybe we want an #osf prefix or something | 20:21 |
mnaser | cc infra-root ^ | 20:28 |
fungi | looking | 20:29 |
corvus | fungi: i'm around if you need help | 20:32 |
fungi | mnaser: it's the ssl cert problem clarkb noted earlier | 20:33 |
mnaser | fungi, corvus: the endpoint has changed and i think nodepool has the value cached | 20:33 |
fungi | the cert you're serving is not valid for compute-ca-ymq-1.vexxhost.net | 20:33 |
fungi | oh, i get it | 20:33 |
mnaser | right, but the url in the service catalog is compute.public.mtl1.vexxhost.net | 20:33 |
mnaser | :) | 20:34 |
fungi | yeah, the launcher will need a restart for that | 20:34 |
fungi | just a sec | 20:34 |
fungi | #status manually restarted nodepool-launcher container on nl03 to pick up changed catalog entries in vexxhost ca-ymq-1 (aka mtl1) | 20:36 |
openstackstatus | fungi: unknown command | 20:36 |
fungi | d'oh! | 20:36 |
fungi | #status log manually restarted nodepool-launcher container on nl03 to pick up changed catalog entries in vexxhost ca-ymq-1 (aka mtl1) | 20:36 |
openstackstatus | fungi: finished logging | 20:36 |
fungi | there we go | 20:36 |
fungi | mnaser: thanks, that seems to be spewing a lot fewer errors in its logs now | 20:37 |
*** openstackgerrit has joined #opendev | 21:05 | |
openstackgerrit | Matthew Thode proposed openstack/diskimage-builder master: update gentoo flags and package maps in preperation for arm64 support https://review.opendev.org/746000 | 21:05 |
ianw | fungi/clarkb: thanks for looking at gitea; do we think the reboot has done it? | 21:55 |
clarkb | ianw: I think it sorted itself out then reboots were mostly to ensure there wasn't any bad fallout from the OOMs | 21:57 |
clarkb | ianw: it looked like that proxy may have contributed to the problem, possibly because it had a bunch of things behind it all hitting a single backend due to the proxy having a single IP | 21:57 |
clarkb | then when one server was sad the haproxy lb took it out of the rotation pointing that proxy at a new backend and rinse and repeat | 21:58 |
ianw | ahh, yes that sounds likely | 21:58 |
ianw | infra-root: if i could get some eyes on creating an pyca/infra project @ https://review.opendev.org/#/c/746014/ that would help me continue fiddling with manylinux wheel generation | 22:02 |
ianw | my hopes that we'd sort of drop-in manylinux support are probably dashed ... for example cryptography does a custom builder image ontop of the upstream builder images that pulls in openssl and builds it fresh | 22:03 |
ianw | which is fine, but not generic | 22:04 |
ianw | one thing is though, if i build custom manylinux2014_aarch64 images speculatively using buildx, i unfortunately can't run them on arm64 speculatively | 22:06 |
ianw | because can't mix architectures | 22:06 |
clarkb | fwiw with buildx things that do IO seem fine but not cpu (like compiling) | 22:07 |
clarkb | expect compiling openssl to take significant time. Though we can certainly test it to find out how much | 22:07 |
clarkb | ianw: I think you can do speculative testing without buildx though | 22:07 |
clarkb | then run both the image build and the use of the image in the linaro cloud | 22:08 |
ianw | hrm, yes i wasn't sure of the state of native image builds | 22:09 |
ianw | native container builds i should probably say | 22:09 |
clarkb | I think they work fine, though the manifest info might assume x86 by default? | 22:09 |
ianw | maybe ... https://review.opendev.org/#/c/746011/ is sort of the framework, but i don't want to put it in pyca/project-config because that's a trusted repo | 22:12 |
diablo_rojo | fungi, yeah was thinking about an osf prefix too. If that's easier to manage, I am totally cool with that. | 22:15 |
diablo_rojo | (waaaaaaay late in my response, got sucked into other things) | 22:15 |
ianw | when you see how the sausage is made with all this ... it does make you wonder a little bit if you still like sausages | 22:15 |
corvus | yeah, i think we avoided native builds in the general case because we don't want the zuul/nodepool gate to stop if we lose the linaro cloud; that's probably less worrisome for an arm-only situation | 22:17 |
ianw | i can try it and see what happens :) | 22:20 |
openstackgerrit | Merged openstack/project-config master: Create pyca/infra https://review.opendev.org/746014 | 22:29 |
corvus | ianw: ^ deploy playbook is done | 22:51 |
ianw | corvus: thanks, already in testing :) https://zuul.opendev.org/t/pyca/status | 22:52 |
ianw | from what i can tell of upstream, ISTM that the wheels get generated and published as an artifact by github actions | 22:53 |
ianw | i can not see that they are uploaded via that mechanism though, although i may have missed it | 22:53 |
ianw | (uploaded to pypi) | 22:54 |
ianw | https://github.com/pyca/cryptography/actions/runs/176310608/workflow if interested | 22:56 |
*** tkajinam has joined #opendev | 22:58 | |
*** gema has quit IRC | 23:05 | |
*** mlavalle has quit IRC | 23:08 | |
*** tosky has quit IRC | 23:09 | |
*** sgw1 has quit IRC | 23:15 | |
ianw | heh, the manylinux container build decided to use http://mirror.facebook.net/ ... who knew | 23:18 |
fungi | ew | 23:22 |
ianw | building openssl ... https://zuul.opendev.org/t/pyca/stream/d35730a2d4fa4121985b01692cc45c9d?logfile=console.log | 23:34 |
ianw | slowly | 23:34 |
ianw | corvus: so the theory is if i run this on an arm64 node, it might "just work"? i guess the intermediate registry also needs to run there? | 23:36 |
*** DSpider has quit IRC | 23:38 | |
ianw | ... ok, to answer my own question -- the intermediate registry seems happy to run in ovh | 23:52 |
ianw | however, ensure-docker is failing on arm64 | 23:53 |
ianw | no package "docker-ce" | 23:53 |
*** hashar has quit IRC | 23:53 | |
*** hashar has joined #opendev | 23:55 | |
*** diablo_rojo has quit IRC | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!