Wednesday, 2021-03-24

*** mlavalle has quit IRC00:03
fungiyeah, still fails the same way, just takes longer00:08
fungii'll try to recreate it agai00:08
*** tosky has quit IRC00:09
ianwfungi: interesting, i mean we do the same thing in the zuul job @
fungiare we running gerrit in a container there?00:11
fungii'm increasingly tempted to just pin cryptography in jeepyb in order to avoid redesigning integration tests for something we're phasing out00:13
ianwfungi: yep, container00:18
openstackgerritMerged openstack/project-config master: grafana/afs : add ubuntu-cloud volume tracking
openstackgerritIan Wienand proposed opendev/system-config master: Remove references to review-dev
openstackgerritIan Wienand proposed opendev/system-config master: Create review-staging group
openstackgerritIan Wienand proposed opendev/system-config master: gerrit: add mariadb_container option
openstackgerritIan Wienand proposed opendev/glean master: Run a glean-early service to mount configdrive
openstackgerritIan Wienand proposed opendev/glean master: Cleanup variable names
*** ysandeep|away is now known as ysandeep00:51
fungiianw: don't know why this didn't dawn on me before:
fungizuul@ubuntu-focal-limestone-regionone-0023652422:~$ netstat -lnt|grep :2941800:56
fungitcp        0      0 *               LISTEN00:56
* fungi slaps forehead00:56
openstackgerritJeremy Stanley proposed opendev/gerritlib master: Run gerritlib-jeepyb-integration on ubuntu-focal
openstackgerritJeremy Stanley proposed opendev/jeepyb master: Stop trying to assign Launchpad bugs
fungithat should hopefully succeed now too, with the depends-on01:25
fungiworth keeping in mind for the inevitable (and overdue) push to switch our default nodeset to ubuntu-focal... "localhost" gets resolved first to ::1 now, at least by some tools which may not try to fall back to connecting to
fungithere's nothing quite like being ground under the wheels of progress01:31
ianwssl.CertificateError: hostname '' doesn't match either of '', '*', '*', '*',02:03
ianwsomeone else mentioned this too ...
fungiawesome. the simple patch i tried to push first thing this morning when i awoke is finally passing tests shortly before i go to bed. that's more than i can say on a lot of days!02:07
fungiianw: sounds like a fastly endpoint has an incomplete cert or traffic is ending up at an endpoint it wasn't meant to02:08
ianwoops sorry02:14
ianwbut anyway, that shows scattered results all day02:17
ianwfungi: one that would be good before any gerrit restart is , just updates the zuul-summary plugin to handle a few more status flags02:47
fungioh, cool02:49
*** ysandeep is now known as ysandeep|afk02:53
*** hemanth_n has joined #opendev03:15
*** whoami-rajat has joined #opendev03:31
ianwi can't see anything common amongst these errors.  mostly xenial, but some bionic.  across basically all providers04:15
fungiyeah, it's probably a problem with fastly04:27
fungiand may not even be regional04:27
*** ykarel has joined #opendev04:32
*** ysandeep|afk is now known as ysandeep04:58
*** gnuoy has quit IRC05:35
*** gnuoy has joined #opendev05:37
*** zbr|rover has quit IRC06:08
*** zbr|rover has joined #opendev06:10
*** marios has joined #opendev06:19
*** ralonsoh has joined #opendev06:38
*** sboyron has joined #opendev06:59
*** lpetrut has joined #opendev07:06
*** moshiur has joined #opendev07:08
*** cgoncalves has quit IRC07:18
*** ysandeep is now known as ysandeep|lunch07:19
*** cgoncalves has joined #opendev07:20
*** cgoncalves has quit IRC07:20
*** cgoncalves has joined #opendev07:21
*** parallax has quit IRC07:29
*** fressi has joined #opendev07:37
*** eolivare has joined #opendev07:42
*** amoralej|off is now known as amoralej08:11
*** andrewbonney has joined #opendev08:12
*** ysandeep|lunch is now known as ysandeep08:15
*** ykarel is now known as ykarel|lunch08:17
*** rpittau|afk is now known as rpittau08:19
*** hashar has joined #opendev08:28
*** ricolin has joined #opendev08:32
*** dtantsur|afk is now known as dtantsur08:32
*** tosky has joined #opendev08:41
*** brinzhang has quit IRC09:07
*** jpena|off is now known as jpena09:09
openstackgerritMerged openstack/diskimage-builder master: update gentoo keywords to support gcc-10
*** ykarel|lunch is now known as ykarel09:16
*** dtantsur has quit IRC09:22
*** dtantsur has joined #opendev09:26
*** dtantsur has quit IRC09:28
*** dtantsur has joined #opendev09:29
*** brinzhang has joined #opendev09:30
*** dtantsur has quit IRC09:36
*** dtantsur has joined #opendev09:38
*** parallax has joined #opendev09:53
*** fressi has quit IRC10:02
*** brinzhang_ has joined #opendev10:03
*** brinzhang has quit IRC10:06
*** fressi has joined #opendev10:09
*** fressi has quit IRC10:39
*** fressi has joined #opendev10:45
*** hashar has quit IRC11:07
*** brinzhang0 has joined #opendev11:08
*** brinzhang_ has quit IRC11:12
*** sshnaidm|off is now known as sshnaidm12:13
*** frigo has joined #opendev12:17
*** eolivare_ has joined #opendev12:21
*** eolivare has quit IRC12:24
*** hemanth_n has quit IRC12:31
*** owalsh has quit IRC12:32
*** jpena is now known as jpena|lunch12:32
*** eolivare_ has quit IRC12:44
openstackgerritDmitry Tantsur proposed opendev/glean master: Fix a typo in a log message
*** ykarel has quit IRC13:04
*** ykarel has joined #opendev13:07
*** ysandeep is now known as ysandeep|away13:09
*** brinzhang0 has quit IRC13:09
*** amoralej is now known as amoralej|lunch13:12
*** ykarel_ has joined #opendev13:13
*** ykarel has quit IRC13:14
*** eolivare_ has joined #opendev13:15
*** fressi has quit IRC13:40
*** fressi has joined #opendev13:45
*** sboyron has quit IRC13:54
*** sboyron has joined #opendev13:54
*** ysandeep|away is now known as ysandeep13:59
fungidisappearing to run some errands, should be back by 15:30 utc at the latest14:00
*** ykarel_ is now known as ykarel14:10
*** amoralej|lunch is now known as amoralej14:12
*** jpena|lunch is now known as jpena14:28
*** owalsh has joined #opendev14:29
*** lpetrut has quit IRC14:39
*** fressi has quit IRC14:55
corvusi feel like the zuul dashboard graphs are missing recent data15:12
*** moshiur has quit IRC15:13
fungicorvus: in grafana? i see data points within the last minute15:16
corvusoh! somehow my time range got set to the future15:17
corvusso it looked like everything stopped a few hours ago15:17
fungiso like you to look to the future and expect grafana to keep up ;)15:18
corvusyeah, i mean, if it can't handle that simple task it's dead to me15:18
fungiif only all software was zuul15:18
mordredspeculative future monitoring15:22
*** frigo has quit IRC15:26
*** stand has joined #opendev15:29
*** mlavalle has joined #opendev15:39
*** sboyron has quit IRC15:39
*** ykarel is now known as ykarel|away16:00
*** ysandeep is now known as ysandeep|dinner16:03
*** sboyron has joined #opendev16:15
*** ykarel|away has quit IRC16:22
*** sboyron has quit IRC16:25
*** sboyron has joined #opendev16:25
*** hamalq has joined #opendev16:32
openstackgerritJeremy Stanley proposed zuul/zuul-jobs master: WIP: Set Gentoo profile in configure-mirrors
openstackgerritJeremy Stanley proposed zuul/zuul-jobs master: Revert "Temporarily stop running Gentoo base role tests"
*** ysandeep|dinner is now known as ysandeep16:37
*** dtantsur is now known as dtantsur|brb16:43
*** Eighth_Doctor has quit IRC16:48
*** mordred has quit IRC16:48
*** mordred has joined #opendev16:51
*** ysandeep is now known as ysandeep|away16:52
*** marios is now known as marios|out17:02
*** Eighth_Doctor has joined #opendev17:04
*** rpittau is now known as rpittau|afk17:24
otherwiseguyDid I miss an announcement about being down? is timing out for me.17:36
*** eolivare_ has quit IRC17:38
otherwiseguyfungi: ^17:39
openstackgerritJeremy Stanley proposed openstack/project-config master: Add an empty project for an OpenStack base ACL
fungiotherwiseguy: oh, sorry, heads down working on stuff. i'll look into it now17:47
fungiotherwiseguy: it's returning for me. might be you're getting persisted to a backend which is overloaded in some pathological way that haproxy isn't taking it out of the pool, i'll start looking at graphs17:48
otherwiseguyfungi: weird. it *just* looks like it came back.17:50
fungiyeah, i'm going through graphs now17:50
otherwiseguybut was gone for like 10 mins. I'll take it. :D17:50
fungicoffee break17:50
otherwiseguy:) I just picked a bad time to blow away my devstack vm :)17:51
fungidue to the way we load-balance by source ip, we sometimes get a bad actor overloading a backend, and then worst case haproxy can take that backend out and migrate the load to another backend, strafing them all offline fairly quickly17:52
otherwiseguyThat's what I get for trading all of my pets for cattle. :p17:52
fungicattle can be pets too ;)17:52
otherwiseguyhmm, and now back to not connecting.17:53
* otherwiseguy stares at haproxy17:54
fungium, wow:
fungithat's the haproxy server17:54
fungi"unhappy" is how i'm going to characterize that17:54
fungimy cat's cpu graph looks like that a few times a day, but not typical for haproxy17:55
fungithere's an haproxy process consuming 796.0 of a cpu17:56
otherwiseguyfungi: uh, wow.17:57
otherwiseguywere you letting your cat play with haproxy?17:57
fungimaybe when my back was turned17:58
funginothing anomalous on the traffic graph for it17:58
slittle1we are seeing a lot if git clone/pull failures ....18:00
slittle1e.g. fatal: unable to access '': Encountered end of file18:00
fungislittle1: yep, i'm looking at it right now18:01
slittle1It affects many openstack repos18:01
*** jralbert has joined #opendev18:01
slittle1ok, good18:01
otherwiseguyfungi: well, it definitely wasn't my dog. He is a very lazy St. Bernard. So my money is still on your cat.18:01
jralbertGood morning; our OpenStack site is in the midst of a major version upgrade today, but we find ourselves suddenly unable to git clone from opendev - the connection is made, and SSL is negotiated, but then the opendev answering server waits a long time before returning an empty response. Is this a known issue currently?18:02
otherwiseguyjralbert: yeah, it's being looked into18:03
otherwiseguyjralbert: there's something funky going on with haproxy18:03
*** jpena is now known as jpena|off18:04
fungislittle1: whatever it is, it's overloading the load balancer in front of the entire git farm, so yet it's affecting every repository we're hosting there18:06
fungii'm working to classify most of the network traffic now to see if we're under some sort of request flood from somewhere maybe18:07
*** jralbert has quit IRC18:07
funginothing really out of the ordinary there. i'm going to take the haproxy container down and bring it back up18:10
*** amoralej is now known as amoralej|off18:10
fungiit's running again, and seems to not be eating tons of system and user cpu cycles now18:10
fungii'll give it a few minutes before i rule out that having only temporarily stopped whatever was going wrong18:11
*** andrewbonney has quit IRC18:12
*** jralbert has joined #opendev18:12
fungi#status log A service anomaly on our Git load balancer has been disrupting access to hosted repositories since 17:20 UTC; we've taken action to restore functionality, but have not yet identified a root cause18:13
openstackstatusfungi: finished logging18:13
otherwiseguyfungi: looks like I'm still having trouble cloning. What does work is the very dirty GIS_SSL_NO_VERIFY=true and setting a hosts entry that points to github's ip :p18:14
fungithat'll work for things which are mirrored to github, yes18:14
jralbertLooks like I got disconnected from the channel so I may have missed some messages in the middle there. Thanks fungi and otherwiseguy for looking into this18:17
otherwiseguyit's all fungi. I'm just the first complainer. ;)18:17
fungiinterestingly i can access gitea by web browser but not via git18:17
jralbertindeed; I expect a caching layer is helping there?18:18
fungiaha, ipv4 connections are working, ipv6 is not18:18
fungiif i git clone -4 it's fine and snappy18:18
fungimay also be a particular backend misbehaving though, -4 vs -6 will likely get directed to different backend servers18:19
fungiat the moment over v6 i'm getting "The requested URL returned error: 500" when cloning18:19
*** marios|out has quit IRC18:20
jralbertIt's interesting that haproxy is chewing up so much cpu at the LB, is that just too many connections in flight I wonder?18:21
fungithe connection count reported via snmp didn't seem out of the ordinary18:21
*** frigo has joined #opendev18:22
fungioh, yeah, memory usage on gitea06 went through the roof, swap too, looks like it fell over but haproxy hasn't taken it out of the pool i guess:
jralbertjust if they're taking a long time to get serviced by the backends it could gum up haproxy - I'd wonder whether haproxy is the problem or a symptom of the problem18:23
fungii'll manually set it to maintenance in haproxy18:23
jralbertooh, paging to death it looks like?18:23
fungiyeah, gitea has a tendency to chew up a lot of memory under the wrong conditions18:24
fungi#status log Temporarily removed gitea06 from the balance_git_https pool18:24
openstackstatusfungi: finished logging18:24
funginow my ipv6 connections are landing on gitea0818:24
openstackgerritRadosÅ‚aw Piliszek proposed opendev/irc-meetings master: Move the Masakari meeting one hour back
fungiand i can clone over v618:25
fungibut my ipv4 connections are hanging now. the problem may be strafing across backends18:25
fungiyep, gitea05 now:
fungii'll take it out too and see what happens18:28
fungi#status log Temporarily removed gitea05 from the balance_git_https pool18:28
openstackstatusfungi: finished logging18:28
fungilooks like there was a similar incident which hit gitea08 around 10-11:00 utc but resolved itself18:30
fungiat the moment i can clone over both ipv4 and ipv6 but that doesn't mean the problem isn't moving to one of the 6 remaining backends18:32
*** dtantsur|brb is now known as dtantsur18:33
*** dwilde has joined #opendev18:34
fungilooks like things got bad for gitea01 starting in the 17:05-17:10 snmp sample18:34
fungier, gitea06 i mean18:34
fungii'll see if i can spot any anomalies in what we were directing to it during the 17z hour before i removed it from the pool18:35
jralbertthings are much happier from the outside world now18:36
fungiunsurprisingly, i suppose, haproxy logged having sent far more requests to gitea06 during the 16z hour when nothing was broken, and the top connecting ip address (one of rdo's autobuilders) is the same both hours18:38
fungiif the problem emerges on yet a third backend, i may be able to narrow things down to one (or at least a very few) specific source ip address which was being balanced to each backend during its period of distress, and can then block it at the lb layer18:45
*** hashar has joined #opendev18:54
*** frigo has quit IRC18:55
*** dwilde has quit IRC19:01
*** dwilde has joined #opendev19:01
fungiat the moment memory utilization looks "normal" on the 6 backends in the pool, and is coming down on the other two i removed19:03
*** ralonsoh has quit IRC19:07
fungiwe had some prior incidents surrounding project creation, which we think was related to bulk project description updates triggering a flurry of auth events which chew up memory in gitea due to their choice of password hashing algorithm, but we disabled description updates and also there was no new project creation happening around the time of this event19:08
*** dtantsur is now known as dtantsur|afk19:08
*** sboyron has quit IRC19:10
*** sshnaidm is now known as sshnaidm|afk19:12
fungiunfortunately, whatever initiated this also went away19:26
fungii'm still not seeing any new memory spikes on the remaining backends19:27
fungii mean, fortunate that the problem seems to have subsided, but disappointing that the amount of data we have is probably not enough to draw useful conclusions from19:34
*** dwilde has quit IRC19:40
*** jralbert has quit IRC19:43
*** dwilde has joined #opendev19:49
*** hashar is now known as hasharAway19:58
otherwiseguythanks for all of the work, fungi!20:11
*** hasharAway is now known as hashar20:24
*** rfayan has joined #opendev20:26
fungiwell, i didn't do much, hoping i can scrape some useful data from the event but not finding much to point at a source of the problem20:37
fungiapparently there was a brief (~20:10-20:20) spike i missed on gitea0420:39
fungiit resolved itself quickly but the server does seem to have stopped responding to several snmp queries in a row there20:39
ianwi'm guessing we didn't see any more fallout of the ssl cert issue i mentioned yesterday?20:48
ianwlogstash isn't showing any hits20:50
*** hamalq has quit IRC20:59
*** hamalq has joined #opendev21:00
ianwthe gitea job has been very unstable for me with21:27
ianw3000): Max retries exceeded with url: /api/v1/repos/x/os-xenapi (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5b081960f0>: Failed to establish a new connection: [Errno 111] Connection refused',))\n2021-03-24T09:02:00.475447 Failed to set desciption for: openstac21:27
ianwtcp connect: connection reset by peer21:33
*** hashar has quit IRC21:33
ianwit seems like it might actually be related to db<->container21:33
ianwthere's a lot of  9:01:46 2581 [Warning] Aborted connection 2581 to db: 'gitea' user: 'gitea' host: '' (Got an error reading communication packets)21:34
openstackgerritGomathi Selvi Srinivasan proposed opendev/base-jobs master: This is to test the changes made in
fungiwhen did we try to set descriptions? i thought we disabled that21:36
fungithe idea was to only set them at repo creation for now21:36
fungioh, this is into a test gitea yeah?21:37
fungido we have memory profiling there? you saw the thing clark found about the default password hash algorithm needing 64mb of ram for every auth event, right?21:38
openstackgerritTobias Henkel proposed openstack/diskimage-builder master: Pre-pull docker images
ianwfungi: no i didn't but maybe related :/22:03
ianwwe have the dstat log22:04
fungiworth keeping in mind anyway22:05
ianwyeah, the dstat log has a few huge peaks, something seems to crash, and it comes back22:06
ianwcopied into
fungithat sounds like what our production servers sometimes did during description updates when manage-projects ran22:07
fungiwell, not manage-projects, but the gitea related portion of the job which fires when we have config changes22:07
fungiand, yeah, the theory was we weren't pipelining requests and each api call was a new auth event, so 64mb more ram22:08
fungiand since we did it massively parallel, that quickly consumed all available memory22:08
fungiianw: if we're capturing syslog, it may contain an oom dump too22:09
ianwinterestingly, syslog has the entire dstat log in it, i don't think that's intended22:10
ianwgitea invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=022:10
fungiyeah, that looks hauntingly familiar :(22:11
fungitesting like production indeed22:11
fungispeaking of, gitea07 is now out of memory22:14
openstackgerritIan Wienand proposed opendev/system-config master: dstat-logger: redirect stdout to /dev/null
openstackgerritIan Wienand proposed opendev/system-config master: dstat-logger: redirect stdout to /dev/null
fungilooking at ip address movement, of the ip addresses which made requests forwarded to gitea06 during its memory troubles, 10 of the same addresses were seen getting forwarded to 05 when it spiked after i took 06 out of the pool22:25
fungiof those, two appeared in the later spike on 04 after 05 was removed from the pool22:25
fungithough none of the 05/06 intersection appear in the current event for 0722:26
fungii'm going to take 07 out of the pool and see who moves22:27
fungi#status log Temporarily removed gitea07 from the lb pool due to memory exhaustion22:30
openstackstatusfungi: finished logging22:30
*** whoami-rajat has quit IRC22:30
*** rfayan has quit IRC22:40
fungiand looks like the problem has relocated to 0222:42
fungi05 and 06 have basically recovered so in a few more minutes i'm going to reenable them and disable 02, then see who moves where again22:43
fungiif we can track a continuous event across several backend shifts, hopefully i can narrow down the source22:43
fungithough i'm starting to see some cross-sections with reverse dns like
fungiianw: ^22:45
fungithough in theory we're filtering those in apache, unless they've adjusted their ua strings, right?22:46
fungii'll check that example against apache logs22:47
fungi#status log Re-enabled gitea05 and 06 in pool, removed 02 due to memory exhaustion22:51
openstackstatusfungi: finished logging22:51
fungithough from the graph it looks like 02 had just started to recover before i removed it22:53
fungiokay, memory graphs look reasonable for all the backends, so i've put them all back in the pool now22:55
fungi#status log All gitea backends have been enabled in the haproxy LB once more22:56
openstackstatusfungi: finished logging22:56
fungiso there were 8 addresses hitting 07 during its spike which then moved to 02 when i took 07 out of the pool22:59
fungiof those, five had reverse dns like bytespider-*, petalbot-*, or crawl*
fungithe other three were an ibm ip address, a google ip address, and a cloudbase ip address23:01
*** slaweq has quit IRC23:02
openstackgerritMerged opendev/system-config master: Remove references to review-dev
funginot only no specific address overlap with the earlier incident across 06/05, but not even much commonality in the reverse dns or whois info23:06
*** stand has quit IRC23:20
openstackgerritGomathi Selvi Srinivasan proposed opendev/base-jobs master: This is to test the changes made in

Generated by 2.17.2 by Marius Gedminas - find it at!