*** sboyron__ has joined #opendev-meeting | 07:58 | |
*** sboyron__ is now known as sboyron | 08:12 | |
*** hashar has joined #opendev-meeting | 12:00 | |
*** gouthamr_ has quit IRC | 14:36 | |
*** hashar is now known as hasharAway | 16:11 | |
-openstackstatus- NOTICE: The Gerrit service on review.opendev.org is being restarted quickly to troubleshoot an SMTP queuing backlog, downtime should be less than 5 minutes | 16:41 | |
*** hasharAway is now known as hashar | 16:46 | |
*** timburke has quit IRC | 17:00 | |
clarkb | Anyone else here for the infra meeting? we'll get started in a couple of minutes | 18:59 |
---|---|---|
fungi | ohai | 19:00 |
clarkb | #startmeeting infra | 19:01 |
openstack | Meeting started Tue Nov 24 19:01:15 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
openstack | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
*** openstack changes topic to " (Meeting topic: infra)" | 19:01 | |
openstack | The meeting name has been set to 'infra' | 19:01 |
clarkb | I didnt' send out an email agenda because I Figured we'd use this time to write down some Gerrit situation updates | 19:01 |
clarkb | that way peopel don't have to dig through as much scrollback | 19:01 |
ianw | o/ | 19:01 |
fungi | situation normal: all functioning usually? | 19:02 |
clarkb | #link https://etherpad.opendev.org/p/gerrit-3.2-post-upgrade-notes continues to capture a good chunk of info particularly from users | 19:02 |
clarkb | if you're needing to catch up on anything starting there seems like a good place | 19:02 |
fungi | i've been doing my best to direct users there or to the maintenance completion announcement which links to it | 19:03 |
fungi | we should definitely make sure to add links to fixes or upstream bug reports, and cross out stuff which gets solved | 19:03 |
clarkb | There were 6 major items I managed to scrape out of recent events though: Gerritbot situation, High system load this morning, the /x/* bug, the account index log bug, the project watches bug, and openstack releases situation | 19:03 |
clarkb | Why don't we start with ^ that list then continue with anything else? | 19:04 |
clarkb | #topic Gerritbot connection issues to gerrit 3.2 | 19:04 |
*** openstack changes topic to "Gerritbot connection issues to gerrit 3.2 (Meeting topic: infra)" | 19:04 | |
clarkb | ianw: fungi: I'e not managed to keep up on this topic, can you fill us in? | 19:04 |
ianw | in short, i think it is not retrying when the connection drops correctly | 19:05 |
fungi | and somehow the behavior there changed coincident with the upgrade (either because of the upgrade or for some other reason we haven't identified) | 19:05 |
ianw | ... sorry, have changes but just have to re-log in | 19:05 |
fungi | i approved the last of them i thinl | 19:05 |
fungi | including the one to switch our gerritbot container to use master branch of gerritlib instead of consuming releases | 19:06 |
ianw | ok, https://review.opendev.org/c/opendev/gerritlib/+/763892 should fix the retry loop | 19:06 |
ianw | a version of that is actually running in a screen on eavesdrop now, manually edited | 19:06 |
fungi | oh, right, i had read through that one and forgot to leave a review. approved now | 19:07 |
ianw | https://review.opendev.org/c/opendev/gerritbot/+/763927 as mentioned builds the gerritbot image using master of gerritlib, it looks like it failed | 19:07 |
fungi | it was partially duplicative of the change i had pushed up before, which is since abandoned | 19:07 |
ianw | gerritbot-upload-opendev-image https://zuul.opendev.org/t/openstack/build/28ddd61a8f024791880517f4b2be97de : POST_FAILURE in 5m 34s | 19:07 |
ianw | will debug | 19:07 |
ianw | i can babysit that and keep debugging if there's more issues | 19:08 |
clarkb | cool so just need some chagnes to land? anything else on this topic? | 19:08 |
ianw | the other thing i thought we should do is use the python3.8 base container too, just to keep up to date | 19:08 |
clarkb | if it works that seems like a reasonable change | 19:09 |
fungi | sounds good to me | 19:09 |
ianw | but yeah, i have that somewhat under control and will keep on it | 19:09 |
clarkb | thanks! | 19:09 |
clarkb | #topic The gerrit /x/* namespace conflict | 19:09 |
*** openstack changes topic to "The gerrit /x/* namespace conflict (Meeting topic: infra)" | 19:09 | |
clarkb | #link https://bugs.chromium.org/p/gerrit/issues/detail?id=13721 | 19:10 |
clarkb | The good news is upstream has responded to the bug and they think there are only like three plugins that actually conflict and only two are open source and I don't think we run either | 19:10 |
clarkb | so our fix should be really safe for now | 19:10 |
clarkb | The less good news is they suggested changing the path and updating those plugins instead of fixing the bigger issue which emans we have to check for conflicts when adding new namespaces or upgrading | 19:10 |
clarkb | fungi: ^ you responded to them asking about the bigger issue right? | 19:11 |
fungi | yep | 19:11 |
fungi | it's down there at the bottom | 19:11 |
clarkb | ok, I thinkwe can likely sit on this one while we sort it out with upstream particularly now that we have more confirmation that the conflicts are minimal | 19:11 |
clarkb | sounds like that may be it on this one | 19:12 |
clarkb | #topic Excessive change emails for some users | 19:13 |
*** openstack changes topic to "Excessive change emails for some users (Meeting topic: infra)" | 19:13 | |
clarkb | #link https://bugs.chromium.org/p/gerrit/issues/detail?id=13733 | 19:13 |
clarkb | We tracked this down to over greedy/buggy project watch rules. The bug has details on how to workaround it (the user can update their settings) | 19:13 |
clarkb | I think it is a bug beacuse the rules really mean "send me change notifications for things I own or have reviewed" but you end up getting all the changes | 19:14 |
clarkb | I was able to reproduce with my own user and then confirm the fix worked for me too | 19:14 |
clarkb | Just be aware of that if peopel complain about spam | 19:14 |
clarkb | #topic Loss of account index filesystem lock | 19:14 |
*** openstack changes topic to "Loss of account index filesystem lock (Meeting topic: infra)" | 19:15 | |
fungi | easy workaround is to remove your watch on all-projects | 19:15 |
clarkb | yup | 19:15 |
clarkb | #link https://bugs.chromium.org/p/gerrit/issues/detail?id=13726 | 19:15 |
clarkb | Yesterday a user mentioned they got a 500 error when trying to reset their http password | 19:15 |
clarkb | examinign the gerrit error_log we found that the tracebacks related to that showed the gerrit server had lost its linux fs lock on the accounts index lock file | 19:16 |
clarkb | sudo lslocks on review confirmed it had no lock for the file in question | 19:16 |
clarkb | After a bit of debugging we decided the best thing to do was to restart gerrit which allowed it to reclaim the lock and things appeared happy since | 19:16 |
clarkb | which leads us to the next topic | 19:16 |
clarkb | #topic High Gerrit server load with low cpu utilization and no iowait | 19:17 |
*** openstack changes topic to "High Gerrit server load with low cpu utilization and no iowait (Meeting topic: infra)" | 19:17 | |
clarkb | Today users were complaining about slowness in gerrit. cacti and melody confirmed it was a busy server based on load but other resources were fine (memory, cpu, io, etc) | 19:17 |
clarkb | digging into the melody thread listing we noticed two things: first we only had one email send thread and had started to back up our queues for email sending | 19:18 |
clarkb | second many zuul ssh queries (typica lzuul things to get info about changes) were taking significant time | 19:18 |
clarkb | We updated gerrit to use 4 threads to send email instead of 1 in case this was the issue. After restarting gerrit the problem came back | 19:18 |
clarkb | Looking closer at the other issue we identified the stacktraces from melody showed that many of the ssh queries by zuul were looking up account details via jigt, bypassing both the lucene index and the cache | 19:19 |
clarkb | From this we theorized that perhaps the account index lock failure meant our index was incomplete and that was forcing gerrit to go straight to the source whcih is slow | 19:19 |
clarkb | in aprticular it almost looked like the slowness had to do with locking like each ssh query was waiting for the jgit backend lock so they wouldn't be reading out of sync (but I haven't confrimed this, it is a hunch based on low cpu low io but high load) | 19:20 |
clarkb | fungi triggered an online reindex of accounts with --force and since that completed things have been happier | 19:20 |
clarkb | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=26&rra_id=all shows the fall off in load | 19:21 |
fungi | but it's hard to know for sure if that "fixed" it or was merely coincident timing | 19:21 |
clarkb | yup, though definitely correlation seems strong | 19:21 |
fungi | so we need to keep a close eye on this | 19:21 |
clarkb | if we have to restart again due to index lock failures we should probably consider reindexing as well | 19:22 |
clarkb | anythin else to add on this one? | 19:22 |
fungi | i don't think so | 19:23 |
clarkb | oh waitI did have one other thing | 19:23 |
ianw | ... sounds good | 19:23 |
clarkb | Our zuul didn't seem ot have these issues. It uses the http api for querying change info | 19:23 |
clarkb | to get zuul to do that you have to set the gerrit http password setting | 19:23 |
clarkb | It is possible that the http side of things is going to be better performing for that sort of stuff (maybe it doesn't fall back to git as aggressively) and we should consider encouraging ci operators to switch over | 19:24 |
clarkb | sean-k-mooney did it today at our request to test thigns out and seems to have gone well | 19:24 |
clarkb | #topic OpenStack Release tooling chagnes to accomodate new Gerrit | 19:25 |
*** openstack changes topic to "OpenStack Release tooling chagnes to accomodate new Gerrit (Meeting topic: infra)" | 19:25 | |
clarkb | fungi: I have also not kept up to date on this one | 19:25 |
fungi | if they're running relatively recent zuul v3 releases (like from ~this year) then they should be able to just add an http password and make sure they're set for basic auth | 19:25 |
ianw | is there any way we can ... strongly suggest that via disabling something? | 19:25 |
clarkb | #undo | 19:25 |
openstack | Removing item from minutes: #topic OpenStack Release tooling chagnes to accomodate new Gerrit | 19:25 |
clarkb | ianw: yes we could disable their accounts and force them to talk to us | 19:26 |
fungi | ianw: we can individually disable their accounts | 19:26 |
clarkb | maybe do that as a last resort after emailing them first | 19:26 |
fungi | thing is, zuul still needs ssh access for the event stream (and i think it uses that for git fetches as well) | 19:26 |
ianw | ok, yeah. if over the next few days things go crazy and everyone is off doing other things, it might be an option | 19:26 |
clarkb | fungi: yes, but neither of those seem to be a load burden based on show-queue info | 19:26 |
fungi | yep, but it does mean that we can't just turn off ssh and leave rest api access for them | 19:27 |
clarkb | I think its specifically the change info queries because that pulls comments and votes which needs account info | 19:27 |
clarkb | fungi: ah yup | 19:27 |
clarkb | fwiw load is spiking right now and its ci ssh queries if you do a show queue you'll see it | 19:27 |
clarkb | (so maybe we haven't completely addressed this) | 19:27 |
fungi | some 20+ ci accounts have ssh tasks in the queue but it's like one per account | 19:28 |
clarkb | https://review.opendev.org/c/openstack/magnum/+/763997/ is what they are fetching (for ps2 I think then they will do 3 and 4) | 19:28 |
fungi | looks like mostly cinder third-party ci | 19:29 |
fungi | maybe this happens whenever someone pushes a stack of cinder changes | 19:29 |
clarkb | but if you look at our zuul it already has 763997 in the dashboard and there and running jobs | 19:29 |
clarkb | this is why I'm fairly confident the http lookups are better | 19:29 |
clarkb | maybe when corvus gets back from vacation he can look at this from the zuul perspective and see if we need to file bugs upstream or if we can make the ssh stuff better or something | 19:30 |
clarkb | ok lets talk release stuff now | 19:30 |
clarkb | #topic OpenStack Release tooling chagnes to accomodate new Gerrit | 19:30 |
*** openstack changes topic to "OpenStack Release tooling chagnes to accomodate new Gerrit (Meeting topic: infra)" | 19:30 | |
fungi | this is the part where i say there were problems, all available fixes have been merged and as of a moment ago we're testing it again to see what else breaks | 19:31 |
clarkb | cool so nothing currently outstanding on this subject | 19:31 |
fungi | this is more generally a problem for jobs using existing methods to push tags or propose new changes in gerrit | 19:32 |
fungi | one factor in this is that our default nodeset is still ubuntu-bionic which carries a too-old git-review version | 19:32 |
fungi | specifying a newer ubuntu-focal nodeset gets us new enough git-review package to interact with our gerrit | 19:32 |
frickler | an update for git-review is in progress there | 19:32 |
frickler | https://bugs.launchpad.net/ubuntu/+source/git-review/+bug/1905282 | 19:33 |
openstack | Launchpad bug 1905282 in git-review (Ubuntu Bionic) "[SRU] git-review>=1.27 for updated opendev gerrit" [High,Triaged] | 19:33 |
fungi | another item is that new gerrit added and reordered its ssh host keys, so the ssh-rsa key we were pre-adding into known_hosts was not the first ley ssh saw | 19:33 |
fungi | this got addressed by adding all gerrit's current host keys into known_hosts | 19:34 |
clarkb | fungi: was it causing failures or just excessive logging? | 19:34 |
fungi | another problem is that the launchpad creds role in zuul-jobs relied on the python 2.7 python-launchpadlib package which was dropped in focal | 19:34 |
fungi | clarkb: the host key problem was causing failures | 19:35 |
fungi | new ssh is compatible with the first host key gerrit serves, but it's not in known_hosts, so ssh just errors out with an unrecognized host key | 19:35 |
clarkb | huh I thoughti t did more negotiating than that | 19:35 |
fungi | basically whatever the earliest host key the sshd presents that your openssh client supports needs to be recognized | 19:36 |
clarkb | got it | 19:36 |
fungi | so there is negotiation between what key types are present and what ones the client supports | 19:36 |
fungi | but if it reaches one which is there and supported by the client that's the one it insists on using | 19:37 |
fungi | so the fact that gerrit 3.2 puts a ne key type sooner than the rsa host key means the new type has to be accepted by the client if it's supported by it | 19:37 |
fungi | and yeah, the other thing was updating zuul-jobs to install python3-launchpadlib which we're testing now to see if that worked or was hiding yet more problems | 19:38 |
fungi | also zuul-jobs had some transitive test requirements which dropped python2 support recently and needed to be addressed before we could merge the launchpadlib change | 19:39 |
fungi | so it's been sort of involved | 19:39 |
fungi | and i think we can close this topic out because the latest tag-releases i reenqueued has succeeded | 19:40 |
clarkb | that is great news | 19:40 |
clarkb | #topic Open Discussion | 19:40 |
*** openstack changes topic to "Open Discussion (Meeting topic: infra)" | 19:40 | |
clarkb | That concluded the items I had identified. Are there others to bring up? | 19:41 |
ianw | oh, the java 11 upgrade you posted | 19:41 |
ianw | do you think that's worth pushing on? | 19:42 |
clarkb | ianw: ish? the main reason gerrit recommends java 11 is better GC performance | 19:42 |
clarkb | we don't seem to be having GC issues currently so I don't think it is urgent, but it would be good to do at some opint | 19:42 |
ianw | yeah, but we certainly have hit issues with that previously | 19:42 |
fungi | though also they're getting ready to drop support for <11 | 19:42 |
fungi | so we'd need to do it anyway to keep upgrading gerrit past some point in the near future | 19:43 |
ianw | i noticed they had a java 15 issue with jgit, and then identified they didn't have 15 CI | 19:43 |
ianw | so, going that far seems like a bad idea | 19:43 |
clarkb | ya we need to drop java 8 before we upgrade to 3.4 | 19:43 |
clarkb | I think doing it earlier is fine, just calling it out as not strictly urgent | 19:43 |
clarkb | ianw: yes Gerrit publishes which javas they support | 19:44 |
clarkb | for 3.2 it is java 8 and 11. 3.3 is 8 and 11 and 3.4 will be just 11 I think | 19:44 |
fungi | might be nice to upgrade to 11 while not doing it at the same time as a gerrit upgrade | 19:44 |
clarkb | fungi: ++ | 19:44 |
fungi | just so we can rule out problems as being one or the other | 19:44 |
frickler | there's failing dashboards, not sure how urgent we want to fix those | 19:45 |
clarkb | frickler: I noted on the etherpad that I don't see any method for configuring the query terms limit via configuration | 19:45 |
frickler | also dashboards not working when not logged in | 19:45 |
clarkb | frickler: I also suggested bugs be filed for those items | 19:45 |
clarkb | (I was hoping that users hitting the problems would file the bugs as I've been so swamped with other things) | 19:45 |
fungi | yeah, those both seem like good candidates for upstream bugs | 19:45 |
ianw | oh, and also tristanC's plugin | 19:46 |
clarkb | I tried to udpate the etehrpad where I thought filing bugs upstream was appropriate and asked reporters to do that, though I suspect that etherpad is largely write only | 19:46 |
fungi | we've been filing bugs for the problems we're working on, but not generally acting as a bug forwarder for users | 19:46 |
clarkb | re performance things I'm noticing there are a number of tunables at https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#core | 19:46 |
clarkb | it might be worth a thread to the gerrit mailing list or to luca to ask about how we might modify our tunabels nmow that we have real world data | 19:46 |
clarkb | As a heads up I'm going to try and start winding down my week. I haev a few more things I want to get done but I'm finding I really need a break and will endeavor to do so | 19:55 |
clarkb | and sounds like everyone else may be done based on lack of new conversation here | 19:56 |
clarkb | thanks everyone! | 19:56 |
clarkb | #endmeeting | 19:56 |
*** openstack changes topic to "Incident management and meetings for the OpenDev sysadmins; normal discussions are in #opendev" | 19:56 | |
openstack | Meeting ended Tue Nov 24 19:56:21 2020 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:56 |
openstack | Minutes: http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-11-24-19.01.html | 19:56 |
openstack | Minutes (text): http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-11-24-19.01.txt | 19:56 |
openstack | Log: http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-11-24-19.01.log.html | 19:56 |
diablo_rojo | thanks clarkb! | 19:56 |
*** sboyron has quit IRC | 20:09 | |
*** hamalq has joined #opendev-meeting | 20:16 | |
*** sboyron has joined #opendev-meeting | 20:18 | |
*** hamalq has quit IRC | 20:55 | |
*** hashar has quit IRC | 20:59 | |
*** hamalq has joined #opendev-meeting | 21:10 | |
*** hamalq has quit IRC | 21:15 | |
*** sboyron has quit IRC | 21:34 | |
*** sboyron has joined #opendev-meeting | 21:34 | |
*** sboyron has quit IRC | 21:58 | |
*** sboyron has joined #opendev-meeting | 21:59 | |
*** sboyron has quit IRC | 22:11 | |
*** sboyron has joined #opendev-meeting | 22:11 | |
*** sboyron has quit IRC | 22:12 | |
*** sboyron has joined #opendev-meeting | 22:13 | |
*** jentoio has quit IRC | 22:14 | |
*** sboyron has quit IRC | 22:15 | |
*** sboyron has joined #opendev-meeting | 22:16 | |
*** sboyron has quit IRC | 22:20 | |
*** sboyron has joined #opendev-meeting | 22:20 | |
*** sboyron has quit IRC | 22:25 | |
*** sboyron has joined #opendev-meeting | 22:26 | |
*** jmorgan has joined #opendev-meeting | 22:29 | |
*** sboyron has quit IRC | 22:33 | |
*** sboyron has joined #opendev-meeting | 22:34 | |
*** sboyron has quit IRC | 22:40 | |
*** sboyron has joined #opendev-meeting | 22:41 | |
*** sboyron has quit IRC | 22:48 | |
*** sboyron has joined #opendev-meeting | 22:48 | |
*** sboyron has quit IRC | 22:57 | |
*** sboyron has joined #opendev-meeting | 23:06 | |
*** hamalq has joined #opendev-meeting | 23:37 | |
*** hamalq has quit IRC | 23:41 | |
*** hamalq has joined #opendev-meeting | 23:52 | |
*** hamalq has quit IRC | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!