Monday, 2023-02-27

opendevreviewTakashi Kajinami proposed openstack/project-config master: Fix wrong project removed  https://review.opendev.org/c/openstack/project-config/+/87532502:23
opendevreviewTakashi Kajinami proposed openstack/project-config master: Retire puppet-tacker - Step 1: End project Gating  https://review.opendev.org/c/openstack/project-config/+/87453902:24
opendevreviewTakashi Kajinami proposed openstack/project-config master: Retire puppet-tacker - Step 5: Remove Project  https://review.opendev.org/c/openstack/project-config/+/87529102:24
opendevreviewMerged openstack/project-config master: Fix wrong project removed  https://review.opendev.org/c/openstack/project-config/+/87532502:50
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-submodule-init role  https://review.opendev.org/c/zuul/zuul-jobs/+/87153904:39
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add ensure-git-lfs  https://review.opendev.org/c/zuul/zuul-jobs/+/87167904:39
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-lfs-init  https://review.opendev.org/c/zuul/zuul-jobs/+/87168004:39
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-submodule-init role  https://review.opendev.org/c/zuul/zuul-jobs/+/87153904:42
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add ensure-git-lfs  https://review.opendev.org/c/zuul/zuul-jobs/+/87167904:42
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-lfs-init  https://review.opendev.org/c/zuul/zuul-jobs/+/87168004:42
opendevreviewMerged zuul/zuul-jobs master: Add conditional for UA registration role  https://review.opendev.org/c/zuul/zuul-jobs/+/87490704:46
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add ensure-git-lfs  https://review.opendev.org/c/zuul/zuul-jobs/+/87167904:49
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-lfs-init  https://review.opendev.org/c/zuul/zuul-jobs/+/87168004:49
*** yadnesh|away is now known as yadnesh04:49
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-submodule-init role  https://review.opendev.org/c/zuul/zuul-jobs/+/87153904:56
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add ensure-git-lfs  https://review.opendev.org/c/zuul/zuul-jobs/+/87167904:56
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-lfs-init  https://review.opendev.org/c/zuul/zuul-jobs/+/87168004:56
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add ensure-git-lfs  https://review.opendev.org/c/zuul/zuul-jobs/+/87167904:59
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-lfs-init  https://review.opendev.org/c/zuul/zuul-jobs/+/87168004:59
opendevreviewMerged zuul/zuul-jobs master: Changes to make fips work on ubuntu  https://review.opendev.org/c/zuul/zuul-jobs/+/87389305:02
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add ensure-git-lfs  https://review.opendev.org/c/zuul/zuul-jobs/+/87167905:08
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-lfs-init  https://review.opendev.org/c/zuul/zuul-jobs/+/87168005:08
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-submodule-init role  https://review.opendev.org/c/zuul/zuul-jobs/+/87153905:21
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add ensure-git-lfs  https://review.opendev.org/c/zuul/zuul-jobs/+/87167905:21
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-lfs-init  https://review.opendev.org/c/zuul/zuul-jobs/+/87168005:21
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-submodule-init role  https://review.opendev.org/c/zuul/zuul-jobs/+/87153905:42
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add ensure-git-lfs  https://review.opendev.org/c/zuul/zuul-jobs/+/87167905:42
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: roles: Add git-lfs-init  https://review.opendev.org/c/zuul/zuul-jobs/+/87168005:42
opendevreviewTakashi Kajinami proposed openstack/project-config master: Retire puppet-tacker - Step 5: Remove Project  https://review.opendev.org/c/openstack/project-config/+/87529107:47
*** jpena|off is now known as jpena08:36
fricklerinfra-root: https://mirror.iad3.inmotion.opendev.org/ seems down, can't check myself until later today09:03
mnasiadkafrickler: just wanted to write about it ;)09:20
*** thuvh1 is now known as thuvh09:22
*** jpena is now known as jpena|off10:22
*** jpena|off is now known as jpena10:29
*** dviroel_ is now known as dviroel11:28
*** bhagyashris is now known as bhagyashris|ruck11:51
opendevreviewJeremy Stanley proposed openstack/project-config master: Temporarily stop booting nodes in inmotion iad3  https://review.opendev.org/c/openstack/project-config/+/87548812:47
fungifrickler: mnasiadka: i'll self-approve that ^12:47
mnasiadkathanks12:48
toskythanks12:57
opendevreviewMerged openstack/project-config master: Temporarily stop booting nodes in inmotion iad3  https://review.opendev.org/c/openstack/project-config/+/87548813:17
opendevreviewJeremy Stanley proposed openstack/project-config master: Revert "Temporarily stop booting nodes in inmotion iad3"  https://review.opendev.org/c/openstack/project-config/+/87549513:43
fungithat's ^ wip while i see if i can spot the issue13:43
fungiopenstack server show indicates it's been in a SHUTOFF state since 2023-02-26T22:19:58Z13:46
fungiOS-EXT-STS:power_state=Shutdown OS-EXT-STS:vm_state=stopped13:46
fungii'll try booting it13:47
fungiseems to be up13:49
fungi#status log Booted mirror.iad3.inmotion via Nova API after it was found in power_state=Shutdown since 22:19:58 UTC yesterday13:50
opendevstatusfungi: finished logging13:50
fungihttp://mirror.iad3.inmotion.opendev.org/debian/pool/main/ returns expected content, so afs cache is fine13:51
fungihttp://mirror.iad3.inmotion.opendev.org/pypi/simple/bindep/ returns expected content, so apache cache is fine13:53
fungiinfra-root: i've un-wipped 875495 if anyone wants to double-check so we can start booting nodes in inmotion-iad3 again13:56
fungialso if someone wants to check logs on the nova controller there (since we have access) that might be good. i need to switch to other tasks for the moment13:58
clarkbcatching up now. fungi  I guess double check the mirror looks good then approve?16:30
clarkbI agree afs and proxy content both look good at the urls you posted above. I'll approve it16:30
clarkbinfra-root I would love feedback on https://review.opendev.org/c/opendev/system-config/+/874340 as I try to sort out bringing gitea09 into the fold of gerrit replication targets16:31
opendevreviewClark Boylan proposed opendev/system-config master: Update gitea to 1.18.5  https://review.opendev.org/c/opendev/system-config/+/87553316:38
clarkbwe'll want to coordinate ^ with the bring up of gitea09. But I think it can land either before or after we do the replication and it should be fine. Just wnt to avoid doing replication while also restarting giteas intentionally16:39
opendevreviewMerged openstack/project-config master: Revert "Temporarily stop booting nodes in inmotion iad3"  https://review.opendev.org/c/openstack/project-config/+/87549516:42
*** jpena is now known as jpena|off17:42
clarkbinfra-root: review02 seems quite busy at the moment18:10
clarkbThe error log doesn't seem to have any new content and we don't seem to be logging to the /var/log/containers location yet?18:17
clarkbI'm having a hard time seeing what is going on18:17
clarkbok that info is in docker logs now for whatever reason. That means ianw's change to put it in /var/log/containers would address this18:19
clarkbinfra-root any ideas?18:24
clarkbI'm tempted to restart gerrit18:24
clarkbthe logs don't really show anything specific that seems problematic...18:24
clarkbbut more eyeballs are appreciated18:25
clarkbfungi: frickler ^ are you around?18:27
TheJulia... I take it gerrit is down?18:33
clarkb"yes" the process is running but spinning its wheels and I've yet to figure out why. I'm working on a java trhead dump next but then i think we may have to restart it18:34
artomhttps://www.youtube.com/watch?v=uRGljemfwUE18:35
TheJulia... sadly upgrade parts are not here yet for KSP218:36
clarkbinfra-root it looks like debian doens't package the jcmd tool in the jre headless package (need the full jdk for that? so we should update our image I guess). But looks like kill -3 should work?18:37
clarkbok I've managed to get a threaddump stored in a file in my homedir18:44
clarkbinfra-root ^ I'll plan to restart the server shortly if I don't hear any objections18:45
clarkbthere are a bunch of threads waiting on a condition(s)18:45
fungiclarkb: yep, sorry stepped away for lunch but looking now18:45
fungiclarkb: i agree, restart seems like the expedient approach and then if it comes back we know it's some ongoing external cause18:46
fungisystem load is around 2018:47
fungimemory is really only half used, half cache, early zero paged out18:47
fungis/early/nearly/18:47
clarkbok I'll restart18:48
fungithanks!18:48
clarkbside note the restart will attempt to replicate to gitea09 too18:48
clarkb(just be aware of that)18:48
funginoted18:48
clarkbok thats done18:52
clarkbI can get the web ui again18:52
fungiyeah, seems to be going again18:52
fungissh api is working for me too18:52
clarkbthe thread dump is inside 20230227-gerrit-spinning-logs in my homedir. Note there are logs on either end of that as the kill -3 emits the threadump in the normal log output destination18:53
fungiare you going to #status log the restart, or should i?18:55
clarkbcan you? I'm still looking at logs18:55
fungion it!18:55
fungi#status log The Gerrit service on review.opendev.org has been restarted to clear an as of yet undiagnosed condition which lead to a prolonged period of unresponsiveness18:57
opendevstatusfungi: finished logging18:57
fungier, s/lead/led/18:57
fungioh well, my grammar errors are my own18:58
corvusi'm still seeing timeouts18:58
fungiso sounds like the situation is coming back already18:58
corvusyeah, gertty logs suggest that's the case18:58
clarkbya the load is still high18:58
fungitrying to see if i can get an ssh connection count out of it18:59
corvusoffline until 18:50, online from 18:51--18:54 then offline again  (approx times)18:59
fungihopefully it's not already too far gone18:59
clarkbI can't run show queue anymore18:59
fungii'll try to digest apache logs looking for something hammering the webui, and also see if there's any useful clues in the gerrit ssh log19:00
clarkbthe error log has some complaints about things in /var/gerrit/data/replication/ref-updates (note we don't seem to bind mount this dir so downing the container will delete these contents)19:03
clarkbideas: We could block access via the firewall and restart again and then see if it is stable without external connectivity.19:07
clarkbI'm going to get another thread dump first I Guess19:07
fungithere's a couple of netapp ci accounts with a few auth failures a minute over the 18z hour19:08
funginothing really jumping out at me in the ssh log though19:08
clarkbI've got a second thread dump from the current state in my homedir now19:11
fungireally not a substantial count of requests through apache from any single client. the most active client made 379 requests during the 18z hour19:14
fungiif it's someone hitting the server over https, then it's the nature of the requests not their volume19:14
clarkbI think I may see it19:15
clarkbor a thing anyway.19:15
fungithe second most active client is definitely querying a lot of changes19:17
fungifrom an ipv6 addy in a research network in au19:17
clarkbyes that19:17
clarkbI think we should temporarily ask them not so kindly via the firewall to leave us alone19:18
fungiGET /changes/?q=is:merged&o=CURRENT_REVISION&o=CURRENT_FILES&start=230500 HTTP/1.1"19:18
fungiyes19:18
TheJulia... Did this happen ?last year? or was it the year before19:33
fungiit happens semi-regularly19:33
fungilast one i remember was a student/researcher at a university in canada19:34
fungii wonder if there's a way to rate-limit expensive rest api queries, but that likely gets deep into layer 7 inspection19:34
clarkbis it still sad?19:37
clarkbarg19:37
fungithere's another ip address19:37
fungiassociated with monash university19:37
clarkbhrm looks like maybe its just really slow right now.19:37
clarkbthough unsure if it will continue to trend worse. Oh if there is another ip then ya maybe we need to be more aggressive in the firweall19:38
fungialso in au i think19:38
fungimy guess is it's a dual-stack machine and the "new" address is just its ipv4 identity because we blocked its v6 from reaching us19:40
clarkbah19:40
dansmithI just tried to comment on a patch and it timed out during the submit.. when i refresh I see a new comment, but without the text19:41
dansmithso I dunno if it's "just slow"19:41
clarkbdansmith: fungi's theory above seems likely and we'll need to block more ips19:41
fungiwell, i've blocked the offending ipv4 and ipv4 address i found in our logs and don't see any new sources for the same query pattern as of the past ~8 minutes, but it may take gerrit time to recover19:42
dansmithah, text just showed up19:42
dansmithreally weird19:42
fungier, ipv4 and ipv619:42
clarkbthere is also likely a lot of demand for the service since its been afk for a bit19:43
clarkbI can show queues via ssh and load my personal dashboard on the webui19:43
fungiyeah, some python script is making parallel queries for large swaths of merged changes from gerrit in parallel, likely a student research project based on the networks19:43
clarkbsystem load is still high but it seems to be responsive at least19:43
clarkbside note gitea09 replication seems to be unhappy :(19:44
clarkbIt seems to be trying to replicate things then saying "NOT_ATTEMPTED"19:44
fungii suppose in situations like that, the sockets might stay open long enough waiting for a response that a connlimit overload rule like we use for concurrent ssh sockets could help mitigate it19:44
clarkboh except https://gitea09.opendev.org:3081/openstack/manila seems to have replicated19:45
TheJuliadoes gerrit tell browsers to close the socket or keep it open? (http keeyalive)19:46
clarkbI'm going to try and manually replciate bindep to gitea09 and see what it does for that19:46
clarkbya that results in not attempted too. weird, it seems to have worked earlier when I restarted but now is sad?19:48
clarkbon the gitea side I can see it accepting gerrit's pubkey for auth in the ssh logs19:50
clarkboh huh now it replicated. Maybe I'm not reading the logs properly19:51
clarkbya I think that is it I'm just not reading it correctly.19:51
clarkbI'm going to hold off on doing a full replication of everything (somethign we should consider doing for all giteas due to the outage) until we're reasonably happy that things are mostly resolved19:52
clarkbsystem load is normal now19:52
fungistatus notice The Gerrit service on review.opendev.org experienced severe performance degradation between 17:50 and 19:45 due to excessive API query activity; the addresses involved are now blocked but any changes missing job results from that timeframe should be rechecked19:53
fungiclarkb: ^ lgty?19:53
clarkbfungi: ++19:53
fungi#status notice The Gerrit service on review.opendev.org experienced severe performance degradation between 17:50 and 19:45 due to excessive API query activity; the addresses involved are now blocked but any changes missing job results from that timeframe should be rechecked19:53
opendevstatusfungi: sending notice19:53
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org experienced severe performance degradation between 17:50 and 19:45 due to excessive API query activity; the addresses involved are now blocked but any changes missing job results from that timeframe should be rechecked19:54
clarkbinfra-root: things that I don't want to forget 1) determining if we need to make /var/gerrit/data a persistent docker volume 2) fixing the docker logs to go in /var/log/cotnainers (I think ianw had changes for this already) 3) deciding if we want to install the normal jdk headless package instead of the jre headless package. This will add tools like jcmd which can be used for thread19:55
clarkbdumps (though kill -3 worked fine)19:55
opendevstatusfungi: finished sending notice19:56
clarkbI'm going to finish this review I was doing when I noticed gerrit was sad then find lunch as it seems to be staying stable19:58
funginow i'm noticing i should have specified utc for that time range. i'm a bit scattered after a week of travel/meetings19:59
fungiTheJulia: as to whether the server is recommending http keepalive/pipelining, i'm not sure (i guess that would be up to the apache daemon that proxies those requests to the java-based httpd embedded in gerrit itself). also i have no idea if scripts/libraries like the one involved in this incident pay attention to that signal20:01
fungithose would be things to look into20:01
TheJuliafungi: so servers *can* say "close the connection, and realistically should if a rule as such is put into place20:02
TheJuliaweb browsers *really* don't like ti when they think the socket is still open, inside of the timeout window for the kept alive connection, and find out that it is no longer open20:02
TheJulias/ti/it/20:02
TheJuliaat a job a long time ago, we... injected "Connection: Close" into response headers a bit :)20:03
clarkbin this case I think the remote made a request and gerrit held the connection open while it was responding to that request. That would happen regardless of keepalives20:06
clarkbas long as data was flowing (which I think it was based on the network utilization graphs)20:06
clarkbI'm not sure I understand what the suggestion is as far as changing the pipelining behavior20:06
TheJuliaindeed, but if the webserver/browser thought the socket was open and things suddenly go down the right path of failure, the browser would hang slightly until it tried opening a new connection20:07
fungiwhat i'm thinking is that if apache recommended pipelining of requests, then we could be more aggressive about connlimit overload rules in iptables, because a browser should have very few concurrent sockets that way20:07
TheJulia... years ago firefox was 4 sockets, fwiw20:07
clarkbhttp 1.1 is pipelined by default iirc.20:09
TheJuliaI believe so yes, so if you do want to target log lived sockets,  you should explicitly tell the browsers "do not hold the socket open"20:10
fungii mean, limiting clients to 10 concurrent connections would probably have mitigated this, but it's hard to say without recreating the incident and testing different concurrencies20:10
TheJulia++20:10
TheJuliathat also... would have preveted it... except for anyone behind a NAT gateway.20:11
clarkbya NAT is he main issue. Most of red hat is behind a NAT for example20:11
clarkband I'm sure thats common with many corp networks (though HP gave us all individual IPs out of their /8 when I was there)20:11
fungiright, that's why we have the ssh connlimit threshold at 100 simultaneous states20:12
clarkbok lunch now. Back in a bit to followup on some of those concerns20:12
TheJuliaenjoy!20:13
fungi#status log Restarted the ptgbot service, apparently hung and serving a dead web page at ptg.opendev.org since 2023-02-0720:23
opendevstatusfungi: finished logging20:23
clarkbload levels appear to remain low and stable. I'm going to try and kick off a fuul gitea09 replication now20:52
opendevreviewClark Boylan proposed opendev/system-config master: Switch gerrit container from jre to jdk packages  https://review.opendev.org/c/opendev/system-config/+/87555320:57
clarkbthats one of my todos in followup20:57
fungii still need to remember to add x/virtualpdu openstack handover plan to the meeting agenda before you send it out21:01
fungiwill try to get to that in a few21:02
clarkbfungi: no rush I probably won't get to the agenda until the end of my day21:04
clarkbfungi: re auto reloading configs for replication this replication data storage system seems to write out replication tasks to disk which would in theory allow the tasks to survive reloads. I've asked about that in the gerrit discord room and will work on an update to my autoreload change to bind mount that directory21:06
clarkbfungi: is it happening again?21:17
ianwi think so21:18
johnsomSo, I can't get gerrit to load. Was my IP in the blocked list? lol21:18
clarkbjohnsom: no I think our unfriendly user(s) have found new IPs21:19
johnsomOh, sigh. Cheering for you then21:19
clarkbfungi: I think we should go ahead and block their entire ip range21:19
fungichecking now21:24
clarkbit should be coming back up now (I just restarted it)21:36
fungikeeping an eye on the apache logs to see if it jumps outside the range we've blocked so far21:37
fungithis stuff seems to keep cropping up every time i break to eat. guess i should avoid having a midnight snack21:39
claygclarkb: thanks for the quick fix 👍21:40
clarkbclayg: its a team effort :)21:40
fungii have sympathy for whatever student this is we're blocking, but we really need them to coordinate with us for bulk data21:41
opendevreviewClark Boylan proposed opendev/system-config master: Bind mount Gerrit's review_site/data dir  https://review.opendev.org/c/opendev/system-config/+/87557021:41
fungii do think it's flattering that just about every time this happens it's queries from a university netblock21:42
clarkbinfra-root ^ if we can get confirmation on the replication plugin persisting things through that fs location then I think this change is a good followup to the autoreload change21:42
clarkbbut also one that should be considered carefully since it has to do wit hpersisting disk contents21:43
clarkbits possible we may only want to do that for the replication plugin and not the other plugins. I don't know yet. Tried to leave notes about that in the commit message too21:43
fungiyeah, we previously tried it and rolled back because it lost replication tasks in the queue, so i'm wary21:43
fungibut so long as we think that will fix it this time on newer gerrit, i agree it would be an added convenience21:44
clarkbyup exactly. I asked about it on gerrit discord so hopefully someone will chime in21:44
clarkbif not soon I'll eventually convert that to an email to their list I Guess21:45
clarkbaha looks like when you delete a project with the delete-project plugin it archives it to review_site/data/delete-poject21:45
clarkbso thats less useful to make long lived but not a problem if we do I think21:46
fungipoject typo yours or theirs?21:46
clarkbmine21:47
clarkbI'm just typing these things21:47
fungiah, okay. wasn't sure whether to be amused21:47
clarkbrelated to deleting projects the gitea09 replication is slow and I'm realizing we've got a ton of dead projects...21:47
clarkboh well I GUESS21:47
clarkbThat was weird my keyboard got its shift key stuck21:48
fungisecret alternate capslock21:48
fungiit's proof you're turning into an old man21:49
clarkbha21:49
clarkbI never use caps lock and through this process discovered that caps lock doesn't affect symbols or the number row. Makes sense21:50
fungiit's really a useless key. not a proper bucky-lock21:51
fungiokay, added x/virtualpdu to the agenda. i'm clear on topics21:57
clarkbsome of these replication tasks for gitea09 are going into a retry mode due to "short read of block" TransportException errors in jgit. When I view the repo in gitea09 there is content there so this may be an eventually consistent thing22:03
clarkbOh this might be due to our 900 second timeout?22:04
clarkbI'll have to continue to keep an eye on it I guess22:04
clarkbya I think that is it. So ya should eventually be consistent after a few retries22:04
fungimakes sense. thanks22:05
clarkbfungi: NasserG over in Gerrit land believes the on disk storage of replication tasks should persist things through autoreload config updates22:36
clarkbso I think those two chagnes together are probably a good end result to aim for. I do think the second one (that modifies volumes and mounts) deserves extra careful review though22:36
fungisounds good, thanks for digging into it!22:56
clarkbI've updated the meeting agenda now with the content I was aware of23:04
clarkbanything else to add to it?23:04
funginothing else i know of23:09
clarkbianw: responded to your question on https://review.opendev.org/c/opendev/system-config/+/874340 tldr is it appear sthat the followup change may make all these problems go away23:14
ianwok i'm fine to try it.  if we notice things out of sync we have a thread to pull on23:14
clarkbianw: we are down to 2100 ish replication tasks for gitea09 now from like 2400 or so? I don't expect this to be done by the time I need to call it a day. I think it isn't impacting anything else as there are 4 dedicated threads for each replication target. I just want you to be aware of that as a thing gerrit is currently doing23:50
clarkbI'll probably trigger a full replication tomorrow of all targets once this first pass of gitea09 is done to make sure we didn't lose anything in the restarts23:50
ianw++ sounds good23:50
clarkbthat should go much quicker since the bulk of the data will already be there23:50
fungistill no sign of new research23:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!