19:01:15 <clarkb> #startmeeting infra
19:01:16 <openstack> Meeting started Tue Nov 24 19:01:15 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:17 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:19 <openstack> The meeting name has been set to 'infra'
19:01:40 <clarkb> I didnt' send out an email agenda because I Figured we'd use this time to write down some Gerrit situation updates
19:01:48 <clarkb> that way peopel don't have to dig through as much scrollback
19:01:56 <ianw> o/
19:02:18 <fungi> situation normal: all functioning usually?
19:02:22 <clarkb> #link https://etherpad.opendev.org/p/gerrit-3.2-post-upgrade-notes continues to capture a good chunk of info particularly from users
19:02:40 <clarkb> if you're needing to catch up on anything starting there seems like a good place
19:03:13 <fungi> i've been doing my best to direct users there or to the maintenance completion announcement which links to it
19:03:43 <fungi> we should definitely make sure to add links to fixes or upstream bug reports, and cross out stuff which gets solved
19:03:46 <clarkb> There were 6 major items I managed to scrape out of recent events though: Gerritbot situation, High system load this morning, the /x/* bug, the account index log bug, the project watches bug, and openstack releases situation
19:04:08 <clarkb> Why don't we start with ^ that list then continue with anything else?
19:04:20 <clarkb> #topic Gerritbot connection issues to gerrit 3.2
19:04:33 <clarkb> ianw: fungi: I'e not managed to keep up on this topic, can you fill us in?
19:05:11 <ianw> in short, i think it is not retrying when the connection drops correctly
19:05:42 <fungi> and somehow the behavior there changed coincident with the upgrade (either because of the upgrade or for some other reason we haven't identified)
19:05:45 <ianw> ... sorry, have changes but just have to re-log in
19:05:59 <fungi> i approved the last of them i thinl
19:06:20 <fungi> including the one to switch our gerritbot container to use master branch of gerritlib instead of consuming releases
19:06:31 <ianw> ok, https://review.opendev.org/c/opendev/gerritlib/+/763892 should fix the retry loop
19:06:47 <ianw> a version of that is actually running in a screen on eavesdrop now, manually edited
19:07:10 <fungi> oh, right, i had read through that one and forgot to leave a review. approved now
19:07:38 <ianw> https://review.opendev.org/c/opendev/gerritbot/+/763927 as mentioned builds the gerritbot image using master of gerritlib, it looks like it failed
19:07:42 <fungi> it was partially duplicative of the change i had pushed up before, which is since abandoned
19:07:48 <ianw> gerritbot-upload-opendev-image https://zuul.opendev.org/t/openstack/build/28ddd61a8f024791880517f4b2be97de : POST_FAILURE in 5m 34s
19:07:50 <ianw> will debug
19:08:22 <ianw> i can babysit that and keep debugging if there's more issues
19:08:29 <clarkb> cool so just need some chagnes to land? anything else on this topic?
19:08:48 <ianw> the other thing i thought we should do is use the python3.8 base container too, just to keep up to date
19:09:04 <clarkb> if it works that seems like a reasonable change
19:09:09 <fungi> sounds good to me
19:09:12 <ianw> but yeah, i have that somewhat under control and will keep on it
19:09:19 <clarkb> thanks!
19:09:42 <clarkb> #topic The gerrit /x/* namespace conflict
19:10:00 <clarkb> #link https://bugs.chromium.org/p/gerrit/issues/detail?id=13721
19:10:21 <clarkb> The good news is upstream has responded to the bug and they think there are only like three plugins that actually conflict and only two are open source and I don't think we run either
19:10:25 <clarkb> so our fix should be really safe for now
19:10:47 <clarkb> The less good news is they suggested changing the path and updating those plugins instead of fixing the bigger issue which emans we have to check for conflicts when adding new namespaces or upgrading
19:11:01 <clarkb> fungi: ^ you responded to them asking about the bigger issue right?
19:11:09 <fungi> yep
19:11:22 <fungi> it's down there at the bottom
19:11:31 <clarkb> ok, I thinkwe can likely sit on this one while we sort it out with upstream particularly now that we have more confirmation that the conflicts are minimal
19:12:54 <clarkb> sounds like that may be it on this one
19:13:03 <clarkb> #topic Excessive change emails for some users
19:13:12 <clarkb> #link https://bugs.chromium.org/p/gerrit/issues/detail?id=13733
19:13:33 <clarkb> We tracked this down to over greedy/buggy project watch rules. The bug has details on how to workaround it (the user can update their settings)
19:14:23 <clarkb> I think it is a bug beacuse the rules really mean "send me change notifications for things I own or have reviewed" but you end up getting all the changes
19:14:36 <clarkb> I was able to reproduce with my own user and then confirm the fix worked for me too
19:14:45 <clarkb> Just be aware of that if peopel complain about spam
19:14:59 <clarkb> #topic Loss of account index filesystem lock
19:15:04 <fungi> easy workaround is to remove your watch on all-projects
19:15:09 <clarkb> yup
19:15:13 <clarkb> #link https://bugs.chromium.org/p/gerrit/issues/detail?id=13726
19:15:52 <clarkb> Yesterday a user mentioned they got a 500 error when trying to reset their http password
19:16:15 <clarkb> examinign the gerrit error_log we found that the tracebacks related to that showed the gerrit server had lost its linux fs lock on the accounts index lock file
19:16:25 <clarkb> sudo lslocks on review confirmed it had no lock for the file in question
19:16:51 <clarkb> After a bit of debugging we decided the best thing to do was to restart gerrit which allowed it to reclaim the lock and things appeared happy since
19:16:58 <clarkb> which leads us to the next topic
19:17:11 <clarkb> #topic High Gerrit server load with low cpu utilization and no iowait
19:17:39 <clarkb> Today users were complaining about slowness in gerrit. cacti and melody confirmed it was a busy server based on load but other resources were fine (memory, cpu, io, etc)
19:18:02 <clarkb> digging into the melody thread listing we noticed two things: first we only had one email send thread and had started to back up our queues for email sending
19:18:22 <clarkb> second many zuul ssh queries (typica lzuul things to get info about changes) were taking significant time
19:18:41 <clarkb> We updated gerrit to use 4 threads to send email instead of 1 in case this was the issue. After restarting gerrit the problem came back
19:19:18 <clarkb> Looking closer at the other issue we identified the stacktraces from melody showed that many of the ssh queries by zuul were looking up account details via jigt, bypassing both the lucene index and the cache
19:19:49 <clarkb> From this we theorized that perhaps the account index lock failure meant our index was incomplete and that was forcing gerrit to go straight to the source whcih is slow
19:20:25 <clarkb> in aprticular it almost looked like the slowness had to do with locking like each ssh query was waiting for the jgit backend lock so they wouldn't be reading out of sync (but I haven't confrimed this, it is a hunch based on low cpu low io but high load)
19:20:47 <clarkb> fungi triggered an online reindex of accounts with --force and since that completed things have been happier
19:21:11 <clarkb> http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=26&rra_id=all shows the fall off in load
19:21:25 <fungi> but it's hard to know for sure if that "fixed" it or was merely coincident timing
19:21:35 <clarkb> yup, though definitely correlation seems strong
19:21:35 <fungi> so we need to keep a close eye on this
19:22:01 <clarkb> if we have to restart again due to index lock failures we should probably consider reindexing as well
19:22:21 <clarkb> anythin else to add on this one?
19:23:10 <fungi> i don't think so
19:23:19 <clarkb> oh waitI did have one other thing
19:23:22 <ianw> ... sounds good
19:23:39 <clarkb> Our zuul didn't seem ot have these issues. It uses the http api for querying change info
19:23:58 <clarkb> to get zuul to do that you have to set the gerrit http password setting
19:24:34 <clarkb> It is possible that the http side of things is going to be better performing for that sort of stuff (maybe it doesn't fall back to git as aggressively) and we should consider encouraging ci operators to switch over
19:24:46 <clarkb> sean-k-mooney did it today at our request to test thigns out and seems to have gone well
19:25:21 <clarkb> #topic OpenStack Release tooling chagnes to accomodate new Gerrit
19:25:32 <clarkb> fungi: I have also not kept up to date on this one
19:25:34 <fungi> if they're running relatively recent zuul v3 releases (like from ~this year) then they should be able to just add an http password and make sure they're set for basic auth
19:25:39 <ianw> is there any way we can ... strongly suggest that via disabling something?
19:25:46 <clarkb> #undo
19:25:47 <openstack> Removing item from minutes: #topic OpenStack Release tooling chagnes to accomodate new Gerrit
19:26:00 <clarkb> ianw: yes we could disable their accounts and force them to talk to us
19:26:02 <fungi> ianw: we can individually disable their accounts
19:26:06 <clarkb> maybe do that as a last resort after emailing them first
19:26:32 <fungi> thing is, zuul still needs ssh access for the event stream (and i think it uses that for git fetches as well)
19:26:43 <ianw> ok, yeah.  if over the next few days things go crazy and everyone is off doing other things, it might be an option
19:26:48 <clarkb> fungi: yes, but neither of those seem to be a load burden based on show-queue info
19:27:09 <fungi> yep, but it does mean that we can't just turn off ssh and leave rest api access for them
19:27:10 <clarkb> I think its specifically the change info queries because that pulls comments and votes which needs account info
19:27:15 <clarkb> fungi: ah yup
19:27:38 <clarkb> fwiw load is spiking right now and its ci ssh queries if you do a show queue you'll see it
19:27:53 <clarkb> (so maybe we haven't completely addressed this)
19:28:46 <fungi> some 20+ ci accounts have ssh tasks in the queue but it's like one per account
19:28:52 <clarkb> https://review.opendev.org/c/openstack/magnum/+/763997/ is what they are fetching (for ps2 I think then they will do 3 and 4)
19:29:07 <fungi> looks like mostly cinder third-party ci
19:29:24 <fungi> maybe this happens whenever someone pushes a stack of cinder changes
19:29:43 <clarkb> but if you look at our zuul it already has 763997 in the dashboard and there and running jobs
19:29:51 <clarkb> this is why I'm fairly confident the http lookups are better
19:30:21 <clarkb> maybe when corvus gets back from vacation he can look at this from the zuul perspective and see if we need to file bugs upstream or if we can make the ssh stuff better or something
19:30:42 <clarkb> ok lets talk release stuff now
19:30:48 <clarkb> #topic OpenStack Release tooling chagnes to accomodate new Gerrit
19:31:30 <fungi> this is the part where i say there were problems, all available fixes have been merged and as of a moment ago we're testing it again to see what else breaks
19:31:45 <clarkb> cool so nothing currently outstanding on this subject
19:32:01 <fungi> this is more generally a problem for jobs using existing methods to push tags or propose new changes in gerrit
19:32:26 <fungi> one factor in this is that our default nodeset is still ubuntu-bionic which carries a too-old git-review version
19:32:51 <fungi> specifying a newer ubuntu-focal nodeset gets us new enough git-review package to interact with our gerrit
19:32:52 <frickler> an update for git-review is in progress there
19:33:11 <frickler> https://bugs.launchpad.net/ubuntu/+source/git-review/+bug/1905282
19:33:14 <openstack> Launchpad bug 1905282 in git-review (Ubuntu Bionic) "[SRU] git-review>=1.27 for updated opendev gerrit" [High,Triaged]
19:33:45 <fungi> another item is that new gerrit added and reordered its ssh host keys, so the ssh-rsa key we were pre-adding into known_hosts was not the first ley ssh saw
19:34:14 <fungi> this got addressed by adding all gerrit's current host keys into known_hosts
19:34:49 <clarkb> fungi: was it causing failures or just excessive logging?
19:34:56 <fungi> another problem is that the launchpad creds role in zuul-jobs relied on the python 2.7 python-launchpadlib package which was dropped in focal
19:35:06 <fungi> clarkb: the host key problem was causing failures
19:35:35 <fungi> new ssh is compatible with the first host key gerrit serves, but it's not in known_hosts, so ssh just errors out with an unrecognized host key
19:35:59 <clarkb> huh I thoughti t did more negotiating than that
19:36:26 <fungi> basically whatever the earliest host key the sshd presents that your openssh client supports needs to be recognized
19:36:33 <clarkb> got it
19:36:49 <fungi> so there is negotiation between what key types are present and what ones the client supports
19:37:07 <fungi> but if it reaches one which is there and supported by the client that's the one it insists on using
19:37:36 <fungi> so the fact that gerrit 3.2 puts a ne key type sooner than the rsa host key means the new type has to be accepted by the client if it's supported by it
19:38:21 <fungi> and yeah, the other thing was updating zuul-jobs to install python3-launchpadlib which we're testing now to see if that worked or was hiding yet more problems
19:39:22 <fungi> also zuul-jobs had some transitive test requirements which dropped python2 support recently and needed to be addressed before we could merge the launchpadlib change
19:39:38 <fungi> so it's been sort of involved
19:40:12 <fungi> and i think we can close this topic out because the latest tag-releases i reenqueued has succeeded
19:40:50 <clarkb> that is great news
19:40:54 <clarkb> #topic Open Discussion
19:41:12 <clarkb> That concluded the items I had identified. Are there others to bring up?
19:41:55 <ianw> oh, the java 11 upgrade you posted
19:42:04 <ianw> do you think that's worth pushing on?
19:42:16 <clarkb> ianw: ish? the main reason gerrit recommends java 11 is better GC performance
19:42:29 <clarkb> we don't seem to be having GC issues currently so I don't think it is urgent, but it would be good to do at some opint
19:42:45 <ianw> yeah, but we certainly have hit issues with that previously
19:42:52 <fungi> though also they're getting ready to drop support for <11
19:43:12 <fungi> so we'd need to do it anyway to keep upgrading gerrit past some point in the near future
19:43:15 <ianw> i noticed they had a java 15 issue with jgit, and then identified they didn't have 15 CI
19:43:30 <ianw> so, going that far seems like a bad idea
19:43:42 <clarkb> ya we need to drop java 8 before we upgrade to 3.4
19:43:49 <clarkb> I think doing it earlier is fine, just calling it out as not strictly urgent
19:44:02 <clarkb> ianw: yes Gerrit publishes which javas they support
19:44:16 <clarkb> for 3.2 it is java 8 and 11. 3.3 is 8 and 11 and 3.4 will be just 11 I think
19:44:27 <fungi> might be nice to upgrade to 11 while not doing it at the same time as a gerrit upgrade
19:44:31 <clarkb> fungi: ++
19:44:44 <fungi> just so we can rule out problems as being one or the other
19:45:01 <frickler> there's failing dashboards, not sure how urgent we want to fix those
19:45:22 <clarkb> frickler: I noted on the etherpad that I don't see any method for configuring the query terms limit via configuration
19:45:23 <frickler> also dashboards not working when not logged in
19:45:30 <clarkb> frickler: I also suggested bugs be filed for those items
19:45:41 <clarkb> (I was hoping that users hitting the problems would file the bugs as I've been so swamped with other things)
19:45:41 <fungi> yeah, those both seem like good candidates for upstream bugs
19:46:04 <ianw> oh, and also tristanC's plugin
19:46:08 <clarkb> I tried to udpate the etehrpad where I thought filing bugs upstream was appropriate and asked reporters to do that, though I suspect that etherpad is largely write only
19:46:08 <fungi> we've been filing bugs for the problems we're working on, but not generally acting as a bug forwarder for users
19:46:35 <clarkb> re performance things I'm noticing there are a number of tunables at https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#core
19:46:57 <clarkb> it might be worth a thread to the gerrit mailing list or to luca to ask about how we might modify our tunabels nmow that we have real world data
19:55:58 <clarkb> As a heads up I'm going to try and start winding down my week. I haev a few more things I want to get done but I'm finding I really need a break and will endeavor to do so
19:56:14 <clarkb> and sounds like everyone else may be done based on lack of new conversation here
19:56:17 <clarkb> thanks everyone!
19:56:21 <clarkb> #endmeeting