*** mlavalle has quit IRC | 00:03 | |
fungi | yeah, still fails the same way, just takes longer | 00:08 |
---|---|---|
fungi | i'll try to recreate it agai | 00:08 |
fungi | n | 00:08 |
*** tosky has quit IRC | 00:09 | |
ianw | fungi: interesting, i mean we do the same thing in the zuul job @ https://opendev.org/opendev/system-config/src/branch/master/playbooks/test-review.yaml#L71 | 00:09 |
fungi | are we running gerrit in a container there? | 00:11 |
fungi | i'm increasingly tempted to just pin cryptography in jeepyb in order to avoid redesigning integration tests for something we're phasing out | 00:13 |
ianw | fungi: yep, container | 00:18 |
openstackgerrit | Merged openstack/project-config master: grafana/afs : add ubuntu-cloud volume tracking https://review.opendev.org/c/openstack/project-config/+/782620 | 00:41 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Remove references to review-dev https://review.opendev.org/c/opendev/system-config/+/780691 | 00:42 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Create review-staging group https://review.opendev.org/c/opendev/system-config/+/780698 | 00:42 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gerrit: add mariadb_container option https://review.opendev.org/c/opendev/system-config/+/775961 | 00:42 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Run a glean-early service to mount configdrive https://review.opendev.org/c/opendev/glean/+/782017 | 00:47 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Cleanup glean.sh variable names https://review.opendev.org/c/opendev/glean/+/782355 | 00:47 |
*** ysandeep|away is now known as ysandeep | 00:51 | |
fungi | ianw: don't know why this didn't dawn on me before: http://paste.openstack.org/show/803842/ | 00:55 |
fungi | zuul@ubuntu-focal-limestone-regionone-0023652422:~$ netstat -lnt|grep :29418 | 00:56 |
fungi | tcp 0 0 0.0.0.0:29418 0.0.0.0:* LISTEN | 00:56 |
* fungi slaps forehead | 00:56 | |
openstackgerrit | Jeremy Stanley proposed opendev/gerritlib master: Run gerritlib-jeepyb-integration on ubuntu-focal https://review.opendev.org/c/opendev/gerritlib/+/782603 | 01:10 |
fungi | finally! | 01:24 |
openstackgerrit | Jeremy Stanley proposed opendev/jeepyb master: Stop trying to assign Launchpad bugs https://review.opendev.org/c/opendev/jeepyb/+/782538 | 01:25 |
fungi | that should hopefully succeed now too, with the depends-on | 01:25 |
fungi | worth keeping in mind for the inevitable (and overdue) push to switch our default nodeset to ubuntu-focal... "localhost" gets resolved first to ::1 now, at least by some tools which may not try to fall back to connecting to 127.0.0.1 | 01:28 |
fungi | there's nothing quite like being ground under the wheels of progress | 01:31 |
ianw | ahhh | 02:01 |
ianw | ssl.CertificateError: hostname 'files.pythonhosted.org' doesn't match either of 'r.ssl.fastly.net', '*.catchpoint.com', '*.cnn.io', '*.dollarshaveclub.com', | 02:03 |
ianw | wtf? | 02:03 |
ianw | someone else mentioned this too ... https://forums.aws.amazon.com/thread.jspa?messageID=978418&tstart=0 | 02:04 |
fungi | awesome. the simple patch i tried to push first thing this morning when i awoke is finally passing tests shortly before i go to bed. that's more than i can say on a lot of days! | 02:07 |
fungi | ianw: sounds like a fastly endpoint has an incomplete cert or traffic is ending up at an endpoint it wasn't meant to | 02:08 |
ianw | http://logstash.openstack.org/#/dashboard/file/logstash.json?query=(message:%5C%22No%20matching%20distribution%20found%20for%5C%22%20OR%20message:%5C%22%5C%5CnNo%20matching%20distribution%20found%20for%5C%22)%20AND%20tags:console%20AND%20voting:1&from=864000s | 02:14 |
ianw | oops sorry | 02:14 |
ianw | but anyway, that shows scattered results all day | 02:17 |
fungi | ugh | 02:20 |
ianw | fungi: one that would be good before any gerrit restart is https://review.opendev.org/c/opendev/system-config/+/777298 , just updates the zuul-summary plugin to handle a few more status flags | 02:47 |
fungi | oh, cool | 02:49 |
*** ysandeep is now known as ysandeep|afk | 02:53 | |
*** hemanth_n has joined #opendev | 03:15 | |
*** whoami-rajat has joined #opendev | 03:31 | |
ianw | i can't see anything common amongst these errors. mostly xenial, but some bionic. across basically all providers | 04:15 |
fungi | yeah, it's probably a problem with fastly | 04:27 |
fungi | and may not even be regional | 04:27 |
*** ykarel has joined #opendev | 04:32 | |
*** ysandeep|afk is now known as ysandeep | 04:58 | |
*** gnuoy has quit IRC | 05:35 | |
*** gnuoy has joined #opendev | 05:37 | |
*** zbr|rover has quit IRC | 06:08 | |
*** zbr|rover has joined #opendev | 06:10 | |
*** marios has joined #opendev | 06:19 | |
*** ralonsoh has joined #opendev | 06:38 | |
*** sboyron has joined #opendev | 06:59 | |
*** lpetrut has joined #opendev | 07:06 | |
*** moshiur has joined #opendev | 07:08 | |
*** cgoncalves has quit IRC | 07:18 | |
*** ysandeep is now known as ysandeep|lunch | 07:19 | |
*** cgoncalves has joined #opendev | 07:20 | |
*** cgoncalves has quit IRC | 07:20 | |
*** cgoncalves has joined #opendev | 07:21 | |
*** parallax has quit IRC | 07:29 | |
*** fressi has joined #opendev | 07:37 | |
*** eolivare has joined #opendev | 07:42 | |
*** amoralej|off is now known as amoralej | 08:11 | |
*** andrewbonney has joined #opendev | 08:12 | |
*** ysandeep|lunch is now known as ysandeep | 08:15 | |
*** ykarel is now known as ykarel|lunch | 08:17 | |
*** rpittau|afk is now known as rpittau | 08:19 | |
*** hashar has joined #opendev | 08:28 | |
*** ricolin has joined #opendev | 08:32 | |
*** dtantsur|afk is now known as dtantsur | 08:32 | |
*** tosky has joined #opendev | 08:41 | |
*** brinzhang has quit IRC | 09:07 | |
*** jpena|off is now known as jpena | 09:09 | |
openstackgerrit | Merged openstack/diskimage-builder master: update gentoo keywords to support gcc-10 https://review.opendev.org/c/openstack/diskimage-builder/+/781594 | 09:15 |
*** ykarel|lunch is now known as ykarel | 09:16 | |
*** dtantsur has quit IRC | 09:22 | |
*** dtantsur has joined #opendev | 09:26 | |
*** dtantsur has quit IRC | 09:28 | |
*** dtantsur has joined #opendev | 09:29 | |
*** brinzhang has joined #opendev | 09:30 | |
*** dtantsur has quit IRC | 09:36 | |
*** dtantsur has joined #opendev | 09:38 | |
*** parallax has joined #opendev | 09:53 | |
*** fressi has quit IRC | 10:02 | |
*** brinzhang_ has joined #opendev | 10:03 | |
*** brinzhang has quit IRC | 10:06 | |
*** fressi has joined #opendev | 10:09 | |
*** fressi has quit IRC | 10:39 | |
*** fressi has joined #opendev | 10:45 | |
*** hashar has quit IRC | 11:07 | |
*** brinzhang0 has joined #opendev | 11:08 | |
*** brinzhang_ has quit IRC | 11:12 | |
*** sshnaidm|off is now known as sshnaidm | 12:13 | |
*** frigo has joined #opendev | 12:17 | |
*** eolivare_ has joined #opendev | 12:21 | |
*** eolivare has quit IRC | 12:24 | |
*** hemanth_n has quit IRC | 12:31 | |
*** owalsh has quit IRC | 12:32 | |
*** jpena is now known as jpena|lunch | 12:32 | |
*** eolivare_ has quit IRC | 12:44 | |
openstackgerrit | Dmitry Tantsur proposed opendev/glean master: Fix a typo in a log message https://review.opendev.org/c/opendev/glean/+/782711 | 12:57 |
*** ykarel has quit IRC | 13:04 | |
*** ykarel has joined #opendev | 13:07 | |
*** ysandeep is now known as ysandeep|away | 13:09 | |
*** brinzhang0 has quit IRC | 13:09 | |
*** amoralej is now known as amoralej|lunch | 13:12 | |
*** ykarel_ has joined #opendev | 13:13 | |
*** ykarel has quit IRC | 13:14 | |
*** eolivare_ has joined #opendev | 13:15 | |
*** fressi has quit IRC | 13:40 | |
*** fressi has joined #opendev | 13:45 | |
*** sboyron has quit IRC | 13:54 | |
*** sboyron has joined #opendev | 13:54 | |
*** ysandeep|away is now known as ysandeep | 13:59 | |
fungi | disappearing to run some errands, should be back by 15:30 utc at the latest | 14:00 |
*** ykarel_ is now known as ykarel | 14:10 | |
*** amoralej|lunch is now known as amoralej | 14:12 | |
*** jpena|lunch is now known as jpena | 14:28 | |
*** owalsh has joined #opendev | 14:29 | |
*** lpetrut has quit IRC | 14:39 | |
*** fressi has quit IRC | 14:55 | |
corvus | i feel like the zuul dashboard graphs are missing recent data | 15:12 |
*** moshiur has quit IRC | 15:13 | |
fungi | corvus: in grafana? i see data points within the last minute | 15:16 |
corvus | oh! somehow my time range got set to the future | 15:17 |
corvus | so it looked like everything stopped a few hours ago | 15:17 |
fungi | so like you to look to the future and expect grafana to keep up ;) | 15:18 |
corvus | yeah, i mean, if it can't handle that simple task it's dead to me | 15:18 |
fungi | if only all software was zuul | 15:18 |
mordred | speculative future monitoring | 15:22 |
*** frigo has quit IRC | 15:26 | |
*** stand has joined #opendev | 15:29 | |
*** mlavalle has joined #opendev | 15:39 | |
*** sboyron has quit IRC | 15:39 | |
*** ykarel is now known as ykarel|away | 16:00 | |
*** ysandeep is now known as ysandeep|dinner | 16:03 | |
*** sboyron has joined #opendev | 16:15 | |
*** ykarel|away has quit IRC | 16:22 | |
*** sboyron has quit IRC | 16:25 | |
*** sboyron has joined #opendev | 16:25 | |
*** hamalq has joined #opendev | 16:32 | |
openstackgerrit | Jeremy Stanley proposed zuul/zuul-jobs master: WIP: Set Gentoo profile in configure-mirrors https://review.opendev.org/c/zuul/zuul-jobs/+/782339 | 16:36 |
openstackgerrit | Jeremy Stanley proposed zuul/zuul-jobs master: Revert "Temporarily stop running Gentoo base role tests" https://review.opendev.org/c/zuul/zuul-jobs/+/771106 | 16:36 |
*** ysandeep|dinner is now known as ysandeep | 16:37 | |
*** dtantsur is now known as dtantsur|brb | 16:43 | |
*** Eighth_Doctor has quit IRC | 16:48 | |
*** mordred has quit IRC | 16:48 | |
*** mordred has joined #opendev | 16:51 | |
*** ysandeep is now known as ysandeep|away | 16:52 | |
*** marios is now known as marios|out | 17:02 | |
*** Eighth_Doctor has joined #opendev | 17:04 | |
*** rpittau is now known as rpittau|afk | 17:24 | |
otherwiseguy | Did I miss an announcement about opendev.org being down? https://opendev.org/openstack/nova.git is timing out for me. | 17:36 |
*** eolivare_ has quit IRC | 17:38 | |
otherwiseguy | fungi: ^ | 17:39 |
openstackgerrit | Jeremy Stanley proposed openstack/project-config master: Add an empty project for an OpenStack base ACL https://review.opendev.org/c/openstack/project-config/+/782830 | 17:44 |
fungi | otherwiseguy: oh, sorry, heads down working on stuff. i'll look into it now | 17:47 |
fungi | otherwiseguy: it's returning for me. might be you're getting persisted to a backend which is overloaded in some pathological way that haproxy isn't taking it out of the pool, i'll start looking at graphs | 17:48 |
otherwiseguy | fungi: weird. it *just* looks like it came back. | 17:50 |
fungi | yeah, i'm going through graphs now | 17:50 |
otherwiseguy | but was gone for like 10 mins. I'll take it. :D | 17:50 |
fungi | coffee break | 17:50 |
otherwiseguy | :) I just picked a bad time to blow away my devstack vm :) | 17:51 |
fungi | due to the way we load-balance by source ip, we sometimes get a bad actor overloading a backend, and then worst case haproxy can take that backend out and migrate the load to another backend, strafing them all offline fairly quickly | 17:52 |
otherwiseguy | That's what I get for trading all of my pets for cattle. :p | 17:52 |
fungi | cattle can be pets too ;) | 17:52 |
otherwiseguy | hmm, and now back to not connecting. | 17:53 |
* otherwiseguy stares at haproxy | 17:54 | |
fungi | um, wow: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66609&rra_id=all | 17:54 |
fungi | that's the haproxy server | 17:54 |
fungi | "unhappy" is how i'm going to characterize that | 17:54 |
fungi | my cat's cpu graph looks like that a few times a day, but not typical for haproxy | 17:55 |
fungi | there's an haproxy process consuming 796.0 of a cpu | 17:56 |
fungi | 796.0% | 17:56 |
otherwiseguy | fungi: uh, wow. | 17:57 |
otherwiseguy | were you letting your cat play with haproxy? | 17:57 |
fungi | maybe when my back was turned | 17:58 |
fungi | nothing anomalous on the traffic graph for it | 17:58 |
slittle1 | we are seeing a lot if git clone/pull failures .... | 18:00 |
slittle1 | e.g. fatal: unable to access 'https://opendev.org/openstack/python-openstackclient.git/': Encountered end of file | 18:00 |
fungi | slittle1: yep, i'm looking at it right now | 18:01 |
slittle1 | It affects many openstack repos | 18:01 |
*** jralbert has joined #opendev | 18:01 | |
slittle1 | ok, good | 18:01 |
otherwiseguy | fungi: well, it definitely wasn't my dog. He is a very lazy St. Bernard. So my money is still on your cat. | 18:01 |
jralbert | Good morning; our OpenStack site is in the midst of a major version upgrade today, but we find ourselves suddenly unable to git clone from opendev - the connection is made, and SSL is negotiated, but then the opendev answering server waits a long time before returning an empty response. Is this a known issue currently? | 18:02 |
otherwiseguy | jralbert: yeah, it's being looked into | 18:03 |
otherwiseguy | jralbert: there's something funky going on with haproxy | 18:03 |
*** jpena is now known as jpena|off | 18:04 | |
fungi | slittle1: whatever it is, it's overloading the load balancer in front of the entire git farm, so yet it's affecting every repository we're hosting there | 18:06 |
fungi | i'm working to classify most of the network traffic now to see if we're under some sort of request flood from somewhere maybe | 18:07 |
*** jralbert has quit IRC | 18:07 | |
fungi | nothing really out of the ordinary there. i'm going to take the haproxy container down and bring it back up | 18:10 |
*** amoralej is now known as amoralej|off | 18:10 | |
fungi | it's running again, and seems to not be eating tons of system and user cpu cycles now | 18:10 |
fungi | i'll give it a few minutes before i rule out that having only temporarily stopped whatever was going wrong | 18:11 |
*** andrewbonney has quit IRC | 18:12 | |
*** jralbert has joined #opendev | 18:12 | |
fungi | #status log A service anomaly on our Git load balancer has been disrupting access to opendev.org hosted repositories since 17:20 UTC; we've taken action to restore functionality, but have not yet identified a root cause | 18:13 |
openstackstatus | fungi: finished logging | 18:13 |
otherwiseguy | fungi: looks like I'm still having trouble cloning. What does work is the very dirty GIS_SSL_NO_VERIFY=true and setting a hosts entry that points opendev.org to github's ip :p | 18:14 |
fungi | that'll work for things which are mirrored to github, yes | 18:14 |
jralbert | Looks like I got disconnected from the channel so I may have missed some messages in the middle there. Thanks fungi and otherwiseguy for looking into this | 18:17 |
otherwiseguy | it's all fungi. I'm just the first complainer. ;) | 18:17 |
fungi | interestingly i can access gitea by web browser but not via git | 18:17 |
jralbert | indeed; I expect a caching layer is helping there? | 18:18 |
fungi | aha, ipv4 connections are working, ipv6 is not | 18:18 |
fungi | if i git clone -4 it's fine and snappy | 18:18 |
fungi | may also be a particular backend misbehaving though, -4 vs -6 will likely get directed to different backend servers | 18:19 |
fungi | at the moment over v6 i'm getting "The requested URL returned error: 500" when cloning | 18:19 |
*** marios|out has quit IRC | 18:20 | |
jralbert | It's interesting that haproxy is chewing up so much cpu at the LB, is that just too many connections in flight I wonder? | 18:21 |
fungi | the connection count reported via snmp didn't seem out of the ordinary | 18:21 |
*** frigo has joined #opendev | 18:22 | |
fungi | oh, yeah, memory usage on gitea06 went through the roof, swap too, looks like it fell over but haproxy hasn't taken it out of the pool i guess: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66750&rra_id=all | 18:22 |
jralbert | just if they're taking a long time to get serviced by the backends it could gum up haproxy - I'd wonder whether haproxy is the problem or a symptom of the problem | 18:23 |
fungi | i'll manually set it to maintenance in haproxy | 18:23 |
jralbert | ooh, paging to death it looks like? | 18:23 |
fungi | yeah, gitea has a tendency to chew up a lot of memory under the wrong conditions | 18:24 |
fungi | #status log Temporarily removed gitea06 from the balance_git_https pool | 18:24 |
openstackstatus | fungi: finished logging | 18:24 |
fungi | now my ipv6 connections are landing on gitea08 | 18:24 |
openstackgerrit | Radosław Piliszek proposed opendev/irc-meetings master: Move the Masakari meeting one hour back https://review.opendev.org/c/opendev/irc-meetings/+/782838 | 18:25 |
fungi | and i can clone over v6 | 18:25 |
otherwiseguy | fungi++ | 18:25 |
fungi | but my ipv4 connections are hanging now. the problem may be strafing across backends | 18:25 |
fungi | yep, gitea05 now: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66727&rra_id=all | 18:28 |
fungi | i'll take it out too and see what happens | 18:28 |
fungi | #status log Temporarily removed gitea05 from the balance_git_https pool | 18:28 |
openstackstatus | fungi: finished logging | 18:28 |
fungi | looks like there was a similar incident which hit gitea08 around 10-11:00 utc but resolved itself | 18:30 |
fungi | at the moment i can clone over both ipv4 and ipv6 but that doesn't mean the problem isn't moving to one of the 6 remaining backends | 18:32 |
*** dtantsur|brb is now known as dtantsur | 18:33 | |
*** dwilde has joined #opendev | 18:34 | |
fungi | looks like things got bad for gitea01 starting in the 17:05-17:10 snmp sample | 18:34 |
fungi | er, gitea06 i mean | 18:34 |
fungi | i'll see if i can spot any anomalies in what we were directing to it during the 17z hour before i removed it from the pool | 18:35 |
jralbert | things are much happier from the outside world now | 18:36 |
fungi | unsurprisingly, i suppose, haproxy logged having sent far more requests to gitea06 during the 16z hour when nothing was broken, and the top connecting ip address (one of rdo's autobuilders) is the same both hours | 18:38 |
fungi | if the problem emerges on yet a third backend, i may be able to narrow things down to one (or at least a very few) specific source ip address which was being balanced to each backend during its period of distress, and can then block it at the lb layer | 18:45 |
*** hashar has joined #opendev | 18:54 | |
*** frigo has quit IRC | 18:55 | |
*** dwilde has quit IRC | 19:01 | |
*** dwilde has joined #opendev | 19:01 | |
fungi | at the moment memory utilization looks "normal" on the 6 backends in the pool, and is coming down on the other two i removed | 19:03 |
*** ralonsoh has quit IRC | 19:07 | |
fungi | we had some prior incidents surrounding project creation, which we think was related to bulk project description updates triggering a flurry of auth events which chew up memory in gitea due to their choice of password hashing algorithm, but we disabled description updates and also there was no new project creation happening around the time of this event | 19:08 |
*** dtantsur is now known as dtantsur|afk | 19:08 | |
*** sboyron has quit IRC | 19:10 | |
*** sshnaidm is now known as sshnaidm|afk | 19:12 | |
fungi | unfortunately, whatever initiated this also went away | 19:26 |
fungi | i'm still not seeing any new memory spikes on the remaining backends | 19:27 |
fungi | i mean, fortunate that the problem seems to have subsided, but disappointing that the amount of data we have is probably not enough to draw useful conclusions from | 19:34 |
*** dwilde has quit IRC | 19:40 | |
*** jralbert has quit IRC | 19:43 | |
*** dwilde has joined #opendev | 19:49 | |
*** hashar is now known as hasharAway | 19:58 | |
otherwiseguy | thanks for all of the work, fungi! | 20:11 |
*** hasharAway is now known as hashar | 20:24 | |
*** rfayan has joined #opendev | 20:26 | |
fungi | well, i didn't do much, hoping i can scrape some useful data from the event but not finding much to point at a source of the problem | 20:37 |
fungi | apparently there was a brief (~20:10-20:20) spike i missed on gitea04 | 20:39 |
fungi | it resolved itself quickly but the server does seem to have stopped responding to several snmp queries in a row there | 20:39 |
ianw | i'm guessing we didn't see any more fallout of the ssl cert issue i mentioned yesterday? | 20:48 |
ianw | logstash isn't showing any hits | 20:50 |
*** hamalq has quit IRC | 20:59 | |
*** hamalq has joined #opendev | 21:00 | |
ianw | the gitea job has been very unstable for me with | 21:27 |
ianw | 3000): Max retries exceeded with url: /api/v1/repos/x/os-xenapi (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5b081960f0>: Failed to establish a new connection: [Errno 111] Connection refused',))\n2021-03-24T09:02:00.475447 Failed to set desciption for: openstac | 21:27 |
ianw | tcp 127.0.0.1:3306: connect: connection reset by peer | 21:33 |
*** hashar has quit IRC | 21:33 | |
ianw | it seems like it might actually be related to db<->container | 21:33 |
ianw | there's a lot of 9:01:46 2581 [Warning] Aborted connection 2581 to db: 'gitea' user: 'gitea' host: '127.0.0.1' (Got an error reading communication packets) | 21:34 |
openstackgerrit | Gomathi Selvi Srinivasan proposed opendev/base-jobs master: This is to test the changes made in https://review.opendev.org/c/zuul/zuul-jobs/+/773474 https://review.opendev.org/c/opendev/base-jobs/+/782864 | 21:35 |
fungi | when did we try to set descriptions? i thought we disabled that | 21:36 |
fungi | the idea was to only set them at repo creation for now | 21:36 |
fungi | oh, this is into a test gitea yeah? | 21:37 |
fungi | do we have memory profiling there? you saw the thing clark found about the default password hash algorithm needing 64mb of ram for every auth event, right? | 21:38 |
openstackgerrit | Tobias Henkel proposed openstack/diskimage-builder master: Pre-pull docker images https://review.opendev.org/c/openstack/diskimage-builder/+/767706 | 21:51 |
ianw | fungi: no i didn't but maybe related :/ | 22:03 |
ianw | we have the dstat log | 22:04 |
fungi | ianw: https://github.com/go-gitea/gitea/issues/14294 | 22:05 |
fungi | worth keeping in mind anyway | 22:05 |
ianw | yeah, the dstat log has a few huge peaks, something seems to crash, and it comes back | 22:06 |
ianw | https://fea5ca9795003415e574-ecaf6f24685cfabe89c3e9d8fac0dfc4.ssl.cf2.rackcdn.com/780691/2/check/system-config-run-gitea/cfcd32f/gitea99.opendev.org/dstat-csv.log | 22:06 |
ianw | copied into https://lamada.eu/dstat-graph/ | 22:06 |
fungi | that sounds like what our production servers sometimes did during description updates when manage-projects ran | 22:07 |
fungi | well, not manage-projects, but the gitea related portion of the job which fires when we have config changes | 22:07 |
fungi | and, yeah, the theory was we weren't pipelining requests and each api call was a new auth event, so 64mb more ram | 22:08 |
fungi | and since we did it massively parallel, that quickly consumed all available memory | 22:08 |
fungi | ianw: if we're capturing syslog, it may contain an oom dump too | 22:09 |
ianw | interestingly, syslog has the entire dstat log in it, i don't think that's intended | 22:10 |
ianw | https://zuul.opendev.org/t/openstack/build/cfcd32fa1b27407ab61f5b44be83f6fc/log/gitea99.opendev.org/syslog.txt#4088 | 22:10 |
ianw | gitea invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0 | 22:10 |
fungi | yeah, that looks hauntingly familiar :( | 22:11 |
fungi | testing like production indeed | 22:11 |
fungi | speaking of, gitea07 is now out of memory | 22:14 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: dstat-logger: redirect stdout to /dev/null https://review.opendev.org/c/opendev/system-config/+/782868 | 22:16 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: dstat-logger: redirect stdout to /dev/null https://review.opendev.org/c/opendev/system-config/+/782868 | 22:23 |
fungi | looking at ip address movement, of the ip addresses which made requests forwarded to gitea06 during its memory troubles, 10 of the same addresses were seen getting forwarded to 05 when it spiked after i took 06 out of the pool | 22:25 |
fungi | of those, two appeared in the later spike on 04 after 05 was removed from the pool | 22:25 |
fungi | though none of the 05/06 intersection appear in the current event for 07 | 22:26 |
fungi | i'm going to take 07 out of the pool and see who moves | 22:27 |
fungi | #status log Temporarily removed gitea07 from the lb pool due to memory exhaustion | 22:30 |
openstackstatus | fungi: finished logging | 22:30 |
*** whoami-rajat has quit IRC | 22:30 | |
*** rfayan has quit IRC | 22:40 | |
fungi | and looks like the problem has relocated to 02 | 22:42 |
fungi | 05 and 06 have basically recovered so in a few more minutes i'm going to reenable them and disable 02, then see who moves where again | 22:43 |
fungi | if we can track a continuous event across several backend shifts, hopefully i can narrow down the source | 22:43 |
fungi | though i'm starting to see some cross-sections with reverse dns like crawl33.bl.semrush.com. | 22:45 |
fungi | ianw: ^ | 22:45 |
fungi | though in theory we're filtering those in apache, unless they've adjusted their ua strings, right? | 22:46 |
fungi | i'll check that example against apache logs | 22:47 |
fungi | #status log Re-enabled gitea05 and 06 in pool, removed 02 due to memory exhaustion | 22:51 |
openstackstatus | fungi: finished logging | 22:51 |
fungi | though from the graph it looks like 02 had just started to recover before i removed it | 22:53 |
fungi | okay, memory graphs look reasonable for all the backends, so i've put them all back in the pool now | 22:55 |
fungi | #status log All gitea backends have been enabled in the haproxy LB once more | 22:56 |
openstackstatus | fungi: finished logging | 22:56 |
fungi | so there were 8 addresses hitting 07 during its spike which then moved to 02 when i took 07 out of the pool | 22:59 |
fungi | of those, five had reverse dns like bytespider-*.crawl.bytedance.com, petalbot-*.petalsearch.com, or crawl*.bl.semrush.com. | 23:01 |
fungi | the other three were an ibm ip address, a google ip address, and a cloudbase ip address | 23:01 |
*** slaweq has quit IRC | 23:02 | |
openstackgerrit | Merged opendev/system-config master: Remove references to review-dev https://review.opendev.org/c/opendev/system-config/+/780691 | 23:06 |
fungi | not only no specific address overlap with the earlier incident across 06/05, but not even much commonality in the reverse dns or whois info | 23:06 |
*** stand has quit IRC | 23:20 | |
openstackgerrit | Gomathi Selvi Srinivasan proposed opendev/base-jobs master: This is to test the changes made in https://review.opendev.org/c/zuul/zuul-jobs/+/773474 https://review.opendev.org/c/opendev/base-jobs/+/782864 | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!