Tuesday, 2021-03-30

*** mlavalle has quit IRC00:02
*** tosky has quit IRC00:10
clarkbin ~7 minutes the gitea update should start00:13
fungii'm still sorta on hand00:28
clarkbgitea01 seems to have updated now to check on it00:30
clarkbhttps://gitea01.opendev.org:3000/openstack/nova seems to work for me00:30
clarkbversion is reported as being updated00:31
clarkb02 and 03 also look good now00:32
ianwclarkb: i was thinking for the token thing too, it probably helps in production because the new versions have switched the default hashing algorithm to something less intensive, but our current passwords are hashed using the old method?00:33
ianwperhaps we should redo the auth?00:33
fungior just go with your token work00:34
clarkbya I'm not sure yet if they'll hash it back again on first requests or not00:34
clarkbwe should be able to find out monitoring a new project addition after this upgrade completes?00:34
clarkbwe'll be able to compare resource usage at least00:34
clarkb04 and 05 are now done and look good00:35
fungididn't we turn off the mass description updating though?00:35
clarkbfungi: good point we did reduce the overhead externally as well00:36
clarkbprobably can check the db directly?00:36
clarkband see what sort of hash is in the records00:36
fungiif they use a typical identifier, yeah00:37
clarkb06-08 lgtm now too00:41
clarkbthe user table has a passwd_hash_algo field00:43
clarkbnow to find that issue so I can determine what we want and what we don't want00:44
*** kevinz has joined #opendev00:49
openstackgerritMerged opendev/system-config master: Add review02.opendev.org  https://review.opendev.org/c/opendev/system-config/+/78318300:49
clarkbhttps://github.com/go-gitea/gitea/issues/14294 if that was the issue then we are using pbkdf2 now according to the db00:49
clarkbwhich is good according to that issue00:49
clarkbI've got to help sort out dinner now, but gitea looks happy and I think it is using the hash that is less memory intensive00:50
ianwcool, will keep an eye00:51
fungiyeah, seems fine so far00:55
fungii hadn't noticed this before (new?) but if you look at the very bottom of the page where the gitea version is listed, it also tells you how long it took to load that page00:57
fungifor example, https://opendev.org/openstack/nova/src/branch/master/releasenotes gave me "Page: 39332ms Template: 102ms"00:59
fungiso, yeah, 40 seconds there00:59
fungibut it's a fairly pathological example00:59
clarkbfungi: the first time you request an asset it isnt' cached and can take a long time particularly for large repos like nova (beacuse gitea is happy to inspect history for objects to tell you when they were last modified)01:05
clarkbbut then it caches it, if you refresh it should be much quicker01:05
clarkbok really going to do dinner now. I'm being yelled at01:06
fungiyep "Page: 3041ms Template: 122ms"01:15
fungithat much i knew, just wasn't aware it displayed those stats01:15
fungior else i was aware and then forgot01:16
*** hamalq has quit IRC01:25
openstackgerritIan Wienand proposed opendev/system-config master: gerrit: remove mysql-client-core-5.7 package  https://review.opendev.org/c/opendev/system-config/+/78376902:10
*** prometheanfire has quit IRC02:11
*** prometheanfire has joined #opendev02:12
openstackgerritIan Wienand proposed opendev/system-config master: review01.openstack.org: add key for gerrit data copying  https://review.opendev.org/c/opendev/system-config/+/78377802:45
ianwinfra-root: ^ that installs a key from review02 -> review01 that can r/o rsync data.  i think that will be generally useful as we go through this process to sync02:45
openstackgerritIan Wienand proposed opendev/system-config master: gerrit: add mariadb_container option  https://review.opendev.org/c/opendev/system-config/+/77596103:35
ianwfinally navigated getting the mariadb fixes into the stable branch, so ^ doesn't require any patches any more03:36
*** whoami-rajat_ has joined #opendev04:21
*** ricolin has quit IRC04:47
*** whoami-rajat_ is now known as whoami-rajat04:56
*** marios has joined #opendev05:01
*** ykarel|away has joined #opendev05:03
*** ykarel|away is now known as ykarel05:07
openstackgerritIan Wienand proposed opendev/system-config master: gerrit: add mariadb_container option  https://review.opendev.org/c/opendev/system-config/+/77596105:36
*** ykarel_ has joined #opendev05:49
*** ykarel has quit IRC05:49
*** cloudnull has quit IRC05:54
*** ysandeep|away is now known as ysandeep05:59
*** lpetrut has joined #opendev06:12
*** slaweq has joined #opendev06:16
*** eolivare has joined #opendev06:17
*** ralonsoh has joined #opendev06:17
openstackgerritDmitriy Rabotyagov proposed openstack/diskimage-builder master: [doc] Update supported distros  https://review.opendev.org/c/openstack/diskimage-builder/+/78378806:27
*** ykarel__ has joined #opendev06:29
*** ykarel_ has quit IRC06:31
openstackgerritDmitriy Rabotyagov proposed openstack/diskimage-builder master: Add Debian Bullseye Zuul job  https://review.opendev.org/c/openstack/diskimage-builder/+/78379006:32
openstackgerritSlawek Kaplonski proposed openstack/project-config master: Add noop-jobs for networking-midonet projects  https://review.opendev.org/c/openstack/project-config/+/78379206:40
*** cloudnull has joined #opendev06:42
openstackgerritSlawek Kaplonski proposed openstack/project-config master: Readd publish-to-pypi for neutron-fwaas and dashboard  https://review.opendev.org/c/openstack/project-config/+/78379606:45
*** sboyron has joined #opendev06:54
*** hashar has joined #opendev06:57
*** frigo has joined #opendev07:10
*** rpittau|afk is now known as rpittau07:29
*** hashar_ has joined #opendev07:43
*** amorin_ has joined #opendev07:46
*** hashar has quit IRC07:46
*** amorin has quit IRC07:47
*** hashar_ is now known as hashar07:51
*** ykarel__ is now known as ykarel07:59
*** tosky has joined #opendev08:11
*** bodgix has quit IRC08:14
*** smcginnis has quit IRC08:14
*** bodgix has joined #opendev08:14
*** arxcruz has quit IRC08:14
*** smcginnis has joined #opendev08:14
*** frigo has quit IRC08:16
*** arxcruz has joined #opendev08:18
*** DSpider has joined #opendev08:44
*** dtantsur|afk is now known as dtantsur08:48
*** dtantsur is now known as dtantsur|brb08:55
*** hashar has quit IRC09:20
*** Guest55766 has quit IRC09:30
*** ykarel is now known as ykarel|lunch09:45
*** dtantsur|brb is now known as dtantsur09:59
*** dirk1 has joined #opendev10:34
*** dirk1 is now known as dirk10:39
*** ykarel|lunch is now known as ykarel10:52
openstackgerritGuillaume Chauvel proposed opendev/gear master: Create SSL context using PROTOCOL_TLS, fallback to highest supported version  https://review.opendev.org/c/opendev/gear/+/74128811:03
openstackgerritGuillaume Chauvel proposed opendev/gear master: Update testing to Python 3.9 and linters  https://review.opendev.org/c/opendev/gear/+/78010311:03
*** DSpider has quit IRC11:18
*** openstackgerrit has quit IRC11:21
*** hashar has joined #opendev11:39
*** ysandeep is now known as ysandeep|afk11:50
*** redrobot9 has joined #opendev12:25
*** artom has quit IRC12:26
*** artom has joined #opendev12:26
*** redrobot has quit IRC12:28
*** redrobot9 is now known as redrobot12:28
*** artom has quit IRC12:36
*** ysandeep|afk is now known as ysandeep12:45
*** dpawlik6 is now known as dpawlik12:58
*** hashar has quit IRC13:24
*** spotz has joined #opendev13:30
*** artom has joined #opendev13:33
*** ykarel is now known as ykarel|away13:50
*** ykarel|away has quit IRC13:54
*** mlavalle has joined #opendev13:58
*** ralonsoh has quit IRC14:18
*** ralonsoh has joined #opendev14:19
*** lpetrut has quit IRC14:30
*** tosky has quit IRC14:39
*** tosky has joined #opendev14:39
*** ysandeep is now known as ysandeep|away14:56
zbr|roverA page like https://review.opendev.org/admin/repos/openstack/hacking,access was supposed to list the groups that have review rights but now is ~empty15:23
zbr|roverIs that a desired change caused by some security concerns or just a glitch?15:23
clarkbI think it is related to the fixes for the bug we discovered when testing upgrades15:24
zbr|roverimho, it should be possible for a logged in user to discover which other users or groups have access to a repo15:24
fungigerrit doesn't display permissions to you if you don't have them15:24
clarkbgerrit significantly trimmed down who can access metadata15:24
fungibasically the acl view shows you what permissions apply to your account15:25
zbr|roverand is not configurable in default settings?15:25
clarkball of that info is available in project-config though15:25
fungithis makes our acls in the project-config repo even more important, yes15:25
fungiacl copies in the repo, i mean15:26
zbr|roverthat is bad for the user experience, imagine a random user trying to propose a patch to a project. Assume that is his first experience contributing something to opendev gerrit.15:26
zbr|roverhe passed CI and now he wants to get attention of someone that can help him review a patch.15:26
fungizbr|rover: are you asking us to convey your concerns to the gerrit maintainers, or replace gerrit? i can't tell15:27
zbr|roveri am wondering if we can do something to improve the discoverability of a gerrit project maintainers (cores)15:27
fungiprojects can publish a link to the group view for their core review team, sure15:28
fungigerrit even supports convenience urls where you can specify the group name instead of its id15:28
zbr|roverso basically the only option we currently have is to expect the repo owners to mention that link on their docs, as this is a problem especially on project with lower maintenance, it will never be addressed by the most vulnerable projects (active ones would likely be able to document this)15:31
zbr|roveri guess the practical answer is to look at previous reviews and see who performed them, "punishing" those few that do perform reviews :D15:32
fungii don't know that it's our *only* option, but if you have ideas we can evaluate them15:32
zbr|roversadly no ideas, only questions for now15:32
fungiokay, i'm done talking to you for now, you're suggesting that we intentionally punish our users15:32
clarkbproviding feedback like this to gerrit is also helpful. Even if they don't take action on it at least we're communicating to the people most likely to take action15:32
*** hashar has joined #opendev15:32
*** rpittau is now known as rpittau|afk15:33
clarkbunfortunately I suspect this is directly related to the security issues that were recently identified and fixed.15:33
clarkbwhich may make changing this tricky and people will probably avoid it15:33
zbr|roverwhat i was trying to say is that people that do reviews are visible in gerrit history, and more likely to be added as reviewers by other users. Those that do not perform reviews are unlikely be picked because nobody knows them.15:33
fungibut please don't suggest to the gerrit maintainers that they choose to be user-hostile and intentionally break the usability of their software. i'm rally afraid they will think you're representing our community with your abusive comments15:34
clarkbfwiw if you add me as a reviewer the email goes into a folder in my mail client that I don't really watch. It gets far too many emails every day for me to keep up.15:35
fungithere may be gerrit plugins targeted at what you're wanting, or it's possible a gerrit plugin could be developed to do it15:35
clarkbI think gerrit could definitely use better tooling around helping people see what they should review. Adding people as reviewers doesn't seem to be it15:35
fungiwe can evaluate adding new plugins if they're stable and reasonably unobtrusive (we're in the process of adding the reviewers plugin currently)15:36
clarkb(I suspect something with hash tagging may be the ticket)15:36
*** tosky has quit IRC15:36
clarkbexperienced devs consistently hash tag, reviewers can look for those specific tags to review those changes and also look for absence of tags to find new contributors and help them out15:36
* zbr|rover wonder what gave the impression that his comments are abusive15:37
fungiyou suggested we've made decisions to punish reviewr15:38
fungimy keyboard needs new fingers15:38
JayFYou may be missing context that most core reviewers are inundated with unsolicited emails and DMs from people desiring reviews who did not participate at all with the community in the reporting or design phase of whatever work they are doing.15:39
fungithe reviewers plugin probably doesn't address your concern, which seems to be that under-supported repositories don't have an easy way for contributors to find reviewers who don't exist15:39
fungibut once https://review.opendev.org/724914 merges we can do some trials with projects to see if it helps those who want a more structured way to auto-associate reviewers with reviews: https://gerrit.googlesource.com/plugins/reviewers/+/refs/heads/master/src/main/resources/Documentation/15:40
clarkbJayF: I'm not sure that making it easier to lookup the entire core reviewer list makes that better? seems like it would make the shotgun appraoch easier?15:42
JayFThat's what I was saying. The suggestion to make the core reviewer list more discoverable could be construed as abusive to already-harried core reviewers.15:42
clarkbah got it, you were addressing it at zbr15:43
JayFHonestly, 1:1, out of band review requests without asking in a public IRC channel or participating in storyboard/launchpad/etc is the #1 best way to not get your code reviewed.15:43
clarkbinfra-root I've started trying to dig up zk docs on the proper way to do rolling replacement of zk servers. Haven't had great luck finding official docs but have found a few independent articles and I think I may need to dig into this a bit more before booting servers and adding them. In particular it seems we need to be very careful with the myid values for the new servers. They should not15:48
clarkboverlap with old servers (I think this means we want zk04.opendev.org-zk06.opendev.org? need to confirm). Also there appears to be some coordination needed to trigger dynamic reconfiguration of the existing cluster members after adding a new member to the configs. Or we have to restart everything.15:48
clarkbmy current understanding of what the process looks like is start 04-06. Identify the current leader L and folloers F1 and F2. Stop zk on F2 and replace with one of 04-06. Trigger reconfig or restart things and ensure we have quorum and a leader. Repeat.15:49
fungizbr|rover: there's also https://gerrit.googlesource.com/plugins/reviewers-by-blame/+/refs/heads/master/src/main/resources/Documentation/about.md but i get the impression that doesn't distinguish reviewers with approval rights, and is based more on who contributed changes touching certain lines than who reviewed similar changes in the past15:50
clarkbonce F1 and F2 are replaced we should be left with a leader. I think we stop zk on that and ensure the other nodes elect a new leader L out of the pool of replaced servers. Then old L is now F3 and can be replaced too15:50
clarkbthere is also a thing where the ordinality of the myid value also affects behavior. A low id won't join a cluster with higher ids? something like that. I don't think it affects us since we'll have all the new ids higher than the old ids15:52
*** dtantsur is now known as dtantsur|afk16:02
corvusclarkb: or you could copy the data over manually16:05
*** iurygregory_ has joined #opendev16:06
*** iurygregory has quit IRC16:06
clarkbcorvus: if we do that it would look something like stop 03.openstack.org, copy 03.openstack.org data to 03.opendev.org, replace 03.openstack.org's IP address with 03.opendev.orgs IP address in configs. Trigger reconfiguration or rolling restarts?16:15
*** hamalq has joined #opendev16:15
*** iurygregory_ is now known as iurygregory16:15
*** marios is now known as marios|out16:17
corvusclarkb: i think rolling restarts.  if you do that (actually this applies to any process), i'd probably do 2/3 of them before restarting the scheduler, and make sure the last one is already in the config on disk before restarting the scheduler.  that way there's only one scheduler restart and it restarts into the final config.  scheduler should only need to be able to reach one server.16:18
*** hamalq_ has joined #opendev16:19
clarkboh hrm I hadn't even considered that the lcients may need to be restarted to see the new config, but that makes sense16:20
*** hamalq has quit IRC16:20
corvusyeah i think we have ip addrs in their config too16:20
clarkband ya we want to restart frequently on the cluster side to ensure we are maintaining quorum. I agree the client should be fine as long as one of the quorum members remains in its config16:20
clarkbok, I think my next step is to gather concrete info on what the current cluster looks like, put it in an etherpad and write down some options with the real info16:23
zbr|roverfungi: clarkb: please excuse me if I was not clear regarding my questions about reviewing, I did not want to complain about something done or not done by infra team about our gerrit, I only wanted to find-out if there is something I can help with in order to make that part of the review process easier for other gerrit users.16:33
*** eolivare has quit IRC16:33
clarkbzbr|rover: I think the first thing to do is communicate the concern upstream. Indicate it can be difficult for new contributors in particular to find out who to work with to get changes reviewed and that maybe gerrit can help with this. It is possible that gerrit already has plugins or other tools that we aren't aware of as well they can point us to16:34
fungion a related note, https://review.opendev.org/724914 is now passing if we're ready for it16:37
fungithe next phase will be adding support for etc/reviewers.config files in manage-projects16:39
zbr|roveryep, i commented there, i am offering to help on experimenting with it.16:39
zbr|roverwe can use a side-project for that and wee exactly how it works.16:40
*** gothicserpent has quit IRC16:41
zbr|roveri like that fact that it can do ignoreWip, and we should see if suggestOnly proves to be more useful or not.16:41
zbr|roverif suggestOnly would really work fine to bump those with rights on the top it could be an alternative.16:41
zbr|roverwhat I found curious is that I did not see any options regarding how many reviewers to auto-assign or an option to exclude specific groups16:42
zbr|roverwhile infra-core has permissions i doubt members of this group want to endup being auto-picked by the plugin just because they happen to be the fallback.16:43
*** gothicserpent has joined #opendev16:44
fungiyeah, suggestOnly unfortunately looks like it would be high-maintenance16:44
fungibecause it doesn't support suggesting groups, only individuals, which means maintaining a separate list of individuals, and for lots of teams that list would quickly fall stale16:45
mordreddoes the per-project config go in the primary repo of the project? or is it a thing that goes into a refs/ location somewhere16:45
*** gothicserpent has quit IRC16:45
mordred(yeah - it would make way more sense if it could be tied to a group)16:46
fungii expect the main way projects might use it would be to have it auto-add specific groups as requested reviewers on changes matching particular branches or file subpaths (e.g. nova could automatically add a vmware-reviewers group to changes touching the vmware hypervisor backend files)16:46
mordredor like - tied to a group but with specific overrides possible - I could see some projects saying "make sure to get files in sub/dir reviewed by $human_a and $human_b"16:46
mordredyeah - what you said16:47
fungifor many cases i expect teams would still prefer to rely on custom dashboards to so reviewers can voluntarily find what they want to review rather than having themselves added to reviews automatically, but different teams have different reviewing habits16:47
fungiwhat triggered the addition of this plugin in the first place was that some of the teams working on starlingx wanted to see if it could help them improve how they're reviewing changes for their projects16:48
zbr|roverthe way CODEOWNERS works on github is that it picks a random one or two (based on min reviews config rule) and assigns them. I kind find it working fine so far but i used it in only one project.16:49
*** ralonsoh has quit IRC16:50
zbr|roverif everyone from the list is added to each review, i would not see that working in practice. the entire idea is to spread the review-load.16:50
JayFI think something that's different about OpenStack vs many other projects is that many projects have a sense of priorities, and working together to ensure something is 100% done instead of having 10 things 10% done. That makes it hard for people outside that upstream process entirely to get code reviewed/paid attention to -- it's almost explicitly not a priority to review that code.16:51
JayFThat's why "step 1" to getting something merged in almost any OpenStack project is to make the case it's needed, via stories/bugs/mailing list/irc, then once you are there, it gets more easy for folks to review your code.16:51
JayFThere are very few successful openstack contributors who do not engage with the community in ways other than code.16:51
TheJuliaWell I just clicked into an interesting discussion :)16:57
zbr|roverJayF: you are right that discussions are highly likely to be needed. Still, there are projects where these may not really be necessary and where is easy for a valid CR to be ignored just because nobody that can help is notified.16:58
JayFIf your CR is being ignored, maybe it does need some discussion even if you don't realize it yet. :D16:58
JayFSometimes that happens in the CR itself, but not every team is structured as to that being how it works.16:58
TheJuliaAnd sometimes people downright ignore reviewer comments in change requests. :\16:59
TheJuliaso off the wall side question, is CI becoming unhappy?16:59
TheJuliaseeing 2nd attempt randomly pop up on an ironic job on things that should have worked fine17:00
JayFTheJulia: the tldr of how this started is zbr|rover was asking for an infra feature to make it easier to ID the core reviewers for a change :) We had to inform them that core reviewers are already getting canned spiced ham chucked at them with high frequency and velocity :D17:00
TheJuliaahh yes17:00
TheJuliawhich all goes to the nearly automatic canning machine in our mailboxes17:01
JayF(as a side note: I think we've all been on the other side of it too, hoping another project merges our change and you can't get anyones' attention, and it's frustrating, so yes it's a problem, but the proposal is not the solution IMO)17:01
*** tosky has joined #opendev17:03
zbr|roveri had a recent example from last week where I had to help with a new release of python-jenkins, even if i no longer use the library myself. probably we can ask him about how long it took to find someone that can make a new release (his change was already merged long)17:04
clarkbinfra-root https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 I've put two options for the zk upgrade in there. Please add text or let me know if I've missed things or if you have preferences17:04
clarkbzbr|rover: I think the issue there was largely that the software was unmaintained, not that gerrit didn't show core reviewer groups17:05
clarkbgerrit could add reviewers and it would still be ignored if those individuals are no longer maintaining the software17:06
fungiTheJulia: i haven't heard anyone mention a new global issue, we're not making changes right now either, but if you have some examples i'm happy to take a quick look17:08
zbr|roveri usually look at those reviews i am added to, but keep in mind that the issue i mentioned is that the OP may have no clue about who can help or not, regardless on which communication channel he may attempt to use gerrit, irc, mailing list, carrier pigeon.17:09
clarkbhuh some tripleo jobs show 4 attempts (I thought we capped at 3)17:10
clarkbfungi: TheJulia  ^ as another data point17:10
clarkbconsidering how widespread that is I wonder if we lost zk connectivity with a launcher?17:11
clarkber no it would have to be the scheduler as the launcher unlocks things once handed to the jobs17:11
clarkb2021-03-30 16:37:19,580 INFO kazoo.client: Zookeeper connection lost17:14
clarkbfungi: TheJulia  ^ I think that is the cause17:14
funginone of the zk servers is down17:15
clarkbI ran echo stat | nc localhost 2181 against all 3 not too long ago to figure out which was leader and which are followers an dthey all looked happy17:16
fungiall last started on the 27th17:16
clarkbseems it has happened 3 times today and not at all in the previous log file17:16
fungiso their daemons have been running for a few days17:16
funginothing in dmesg for any of them since saturday either17:17
clarkb"Connection dropped: socket connection broken" appears to be the underlying issue as reported by the zuul side17:17
clarkbI wonder if the servers timed out zuul for non responsiveness?17:17
clarkband the client sees that as the socket connection breaking17:17
fungifirewall rules for them were last touched on the 18th17:18
fungiyeah, could be17:18
clarkbon zk02 (the current leader) I see it expiring a session at Mar 30 16:37:14 in syslog17:20
clarkbtimeout of 40000ms exceeded17:20
clarkbI suspect that is what happened though I'm not sure how to confirm via the logged session id17:20
clarkbcorvus: ^ fyi this may be important performance feedback gathering?17:20
clarkbnote I haven't changed anything with zk yet, only done info gathering (ran stat command)17:21
clarkboh yup a few log lines later I see zuul01 has reconnected17:21
fungiunrelated, i saw a infra-prod-run-cloud-launcher failure where the log on bridge says /home/zuul/src/opendev.org/opendev/system-config/run_cloud_launcher.sh did not exist17:23
clarkbfungi: that script was run by the old cron based system and it was removed. Maybe we didn't properly convert over the ansible side?17:24
fungiyeah, i found https://review.opendev.org/718799 removed it almost a year ago17:24
clarkbor maybe part of the cleanup is written such that if it fails (because it already succeeded in cleaning up) the ansible play fails?17:25
fungid'oh, i'm looking at the wrong logfile17:25
fungi--- end run @ 2020-05-07T15:13:58+00:00 ---17:25
fungiyeah, that was from when we removed the script17:26
fungii was looking at the log from the cronjob not from the continuous deployment17:26
fungiconnection timeout to the openedge api. i guess this has been failing for some time17:27
clarkboh right we need to clean that up17:27
*** lpetrut has joined #opendev17:28
fungii can push it in a bit17:29
*** hashar is now known as hasharDinner17:36
fungidonnyd: should we unconfigure openedge entirely in our systems, or is it likely coming back at some point in the future?17:44
*** ykarel|away has joined #opendev17:46
*** marios|out has quit IRC17:49
clarkbfungi: my understanding was that we should probably unconfigure it, and if we want ot add it back that is straightforward to do17:54
fungiokay, i'll work on a complete rip-out for now17:54
*** prometheanfire has quit IRC17:55
clarkbzk connectivity was just lost again17:59
clarkb(I was running a tail looking for that string)18:00
*** gothicserpent has joined #opendev18:00
*** gothicserpent has quit IRC18:01
clarkband on zk02 we see the same sort of logs connection for session foo timed out18:01
clarkbthen a bit later a reconnect from zuul0118:01
clarkbthe timeout is 40000ms which I think means that zuul's zk connection didn't transfer any data for 40 seconds?18:01
clarkbwow zuul is certainly busy it has generated over 100k log lines in about the time since the last disconnect18:02
fungithe zk graphs at the end of https://grafana.opendev.org/d/5Imot6EMk/zuul-status?orgId=1 show a bunch of drops which i expect are coincident with the timing of the disconnects18:04
clarkbgoing back 200k lines only gets me an extra 2 minutes18:04
clarkbthe bulk of this seems to be collecting job variants and similar config loading logging18:04
clarkbfungi: you can see the zuul event processing time spike around then too18:07
clarkbtypical seems to be in the ms to second range but then when we see restarts it goes to the minutes range18:08
clarkbI wonder if while we are processing events we somehow block the zk connection from ping ponging appropriately18:08
clarkbwhich means if processing an event spikes to > 40s we lose18:08
clarkbthough we don't disconnect every time we spike and we don't spike every time we disconnect so maybe a bad correlation18:09
clarkbI do suspect though that if we work backward from really long event processing time we might find something useful18:10
TheJuliaclarkb: w/r/t the connection lost, I had a feeling... but it might not be bunnies18:12
fungithe job queue and executors accepting graphs do look like what we see when there's a large gate reset or huge change series pushed, but that could also be the result of jobs restarting en masse from a zk disconnect (effect rather than cause)18:15
*** lpetrut has quit IRC18:15
clarkbyes I think that is likely more the symptom than the cause18:16
*** ykarel|away has quit IRC18:16
clarkbat 17:54:40 we report or last event_enqueue_processing time for a few minutes then the next one says it took 1.3 minutes according to graphana18:17
clarkbthe next one arrives at 18:01:2018:17
clarkbwithin that period of time we've lost connectivity to zk because it has timed us out for being non responsive for >40s18:18
clarkbI strongly suspect something is monopolizing the cpu, but I think the rerun jobs is simply because nodepool has helpfully cleaned the old ones up for us18:18
clarkb2021-03-30 18:01:04,090 INFO zuul.Scheduler: Tenant reconfiguration complete for openstack (duration: 363.146 seconds)18:21
clarkbwhat if it is ^18:21
clarkbthat also doesn't correlate to every restart though18:21
clarkb2021-03-30 15:16:16,061 INFO zuul.Scheduler: Tenant reconfiguration complete for openstack (duration: 252.64 seconds) is another aroudn when we got disconnected18:21
clarkbthis disconnect doesn't have bad graphs or tenant reconfiguration that takes forever: 2021-03-30 16:37:19,580 INFO kazoo.client: Zookeeper connection lost18:22
clarkbI am still not fully up to speed around what has changed with zk recently though so I may be looking in the completely wrong location18:23
*** roman_g has joined #opendev18:26
fungiclarkb: internal scheduler events/results queues so far, i think18:26
fungialso semaphores18:27
clarkbwe have had 4 disconnects in the last several hours. three of them have an adjacent openstack tenant reconfiguration that takes 4-6 minutes. There are other long reconfigurations though18:27
fungialso change queues are in zk now, looks like18:28
clarkbmakes me less confident that long reconfiguration is a cause, but it may be another symptom (essentially something is consuming resources and when that happens zk can disconnect and reconfigurations can go long)18:28
fungiso most of the things you would think of as internal scheduler state have moved from in-memory data structures to znodes18:30
clarkbone thing that makes digging in logs difficult is that we generate a ton of kazoo exceptions after this happens beacuse all the nodes have been cleaned up and zuul can't update the nodes in zk18:31
clarkbalso 100k log lines per minute18:31
fungicacti graphs for the zk servers don't suggest any system-level resource exhaustion18:31
fungithe zuul scheculer is showing some heavy periods of read activity on its disks around that time18:33
fungialso we've been seeing a steady rise in used and cache memory on the scheduler, but used memory starts to plateau after we run out of free memory and cache begins to get squeezed (circa 14:00 utc)18:34
corvustypically if we see zk disconnects its due to cpu starvation on the scheduler18:36
corvususually due to swapping18:36
fungicould this be cacti says scheduler cpu utilization is nominal. maybe a little higher than usual but still below 20%18:37
fungiand no real swap utilization18:37
clarkbfungi: its a many cpu instance and zuul can only use one of them for the scheduler18:37
fungifair point18:37
clarkbcorvus: ya I suspect that is why reconfiguration and event queue processing is also slow when this occurs but not always18:37
fungialso i suppose we could have very brief spikes which don't register in a 5-minute aggregate sample18:37
clarkbwe're seeing cpu starvation hit a number of things and this is the most prominent as it restarts jobs18:37
fungithere is one zuul-scheduler process consuming most of a cpu according to top18:38
fungiand also a majority of the system memory18:38
corvusi can't seem to get cacti to show me more than 1 day of data18:39
clarkbthat does seem to show memory use has significantly grown in the last day or so18:42
clarkbmaybe we've got a leak that leads to swapping? http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64793&rra_id=all18:43
corvusyep, that would be my guess18:43
fungiexcept the swap usage graph is *nearly* empty18:44
corvusi think we should restart the scheduler immediately to address the immediate problem and start debugging the leak.  i can start debugging tomorrow, but am occupied today.18:44
fungibut yes there's a tiny bump up to 233mib around the time of the recent distress18:44
corvusfungi: the swap activity graph isn't though, and that can cause pauses18:44
fungiright, and if it's the wrong 233mib...18:44
fungithat also may explain some of the spikes in disk read activity18:45
clarkbI need to start prepping for the meeting, is someone else abel to lead the restart?18:45
fungii can do the scheduler restart now18:45
fungidid we get the enqueuing fix merged?18:45
* fungi checks18:45
fungiahh, no, https://review.opendev.org/783556 hasn't been approved yet18:47
fungicorvus: what's your guess as to the fallout we'll see from trying to reenqueue things... is there something i should edit in the queue dump beforehand?18:47
corvusfungi: in that case just drop the hostname from any enqueue-ref calls18:47
fungiperfect, can do, thanks!18:47
corvus(drop the hostname from the project name18:47
fungiyep, got it18:48
fungi| sed 's, opendev\.org/,' ought to do the trick18:49
fungier, forgot an additional , there18:49
fungiand i shouldn't eat the blank space ;)18:50
fungi/opt/zuul/tools/zuul-changes.py https://zuul.opendev.org | sed 's, opendev\.org/, ,' > queues.sh18:50
fungithat seems to do the trick18:50
corvusfungi: i'd just do that for enqueue-ref not the changes18:51
fungioh, can do18:51
*** diablo_rojo has joined #opendev18:52
fungilooks like the scheduler is very busy again, probably another event18:52
fungi/opt/zuul/tools/zuul-changes.py https://zuul.opendev.org | sed 's18:55
fungi,^\(zuul enqueue-ref .* --project \)opendev\.org/,\1,' > queues.sh18:55
fungistray newline in the paste, but that's what seems to take care of it18:55
fungilooks like openstack release management had just merged some release requests so there are tag events in the queue. i'm a little hesitant to try bulk reenquening those and would rather wait another few minutes for them to hopefully wrap up18:59
fungii let the team know to hold on further release approvals for a bit18:59
* TheJulia sighs19:04
fungiTheJulia: it was for ironic. you can sigh as loudly as you like ;)19:05
TheJuliais there an official guestimate on how far the status display trails behind reality?19:05
TheJuliafungi: joy19:05
fungiTheJulia: top-left corner... "Queue lengths: 399 events, 2 management events, 301 results."19:05
TheJuliaNow to just get a ci fixed merged into ironic.19:05
fungithat's basically teh internal count of backlogs for those categoties of events not yet processed19:06
fungiit's not really a time because zuul can't express much about events it hasn't processed yet19:06
TheJuliaI was meaning more in regards to a running ci job19:06
fungibut that's generally the reason for any perceived lag in status reflecting expected reality19:07
TheJulialike, I've opened the status display in the past and seen things appear completely done running past log uploads and can't view the console... yet the display says the job is still running for like 15 minutes19:07
TheJuliaokay, so same queue then19:07
fungithe results backlog is how many builds have completed that zuul hasn't reflected the completion result for yet19:07
fungievents is how many things like gerrit comments or new patchsets it has received but not yet processed to figure out if something should be enqueued19:08
fungithat sort of thing19:08
TheJuliaThat was what I was figuring but didn't quite grok the meaning of results19:09
TheJuliaat least in that context19:09
fungiand management events are generally reconfiguration19:11
fungiso now that the counts are ~0 again, the status info should be reasonably current19:11
fungiokay, zuul estimates we're 5 minutes out from being done with the last releases site update, so i'll plan to restart the scheduler and reenqueue everything as soon as that finishes19:49
fungiokay, all the reasonably critical openstack release jobs have wrapped up, i'll check the other tenants quickly19:59
ianwclarkb: likely to get caught in a restart, but is https://review.opendev.org/c/opendev/system-config/+/783778 to add a key for review02 to r/o copy gerrit data OK?  not sure if it was done like that in the past20:00
fungiseems we're all clear. grabbing a corrected queue dump per earlier bug discussion, then restartinmg20:00
ianwthe other one that is ready is https://review.opendev.org/c/opendev/system-config/+/775961 to add a mariadb container; upstream has merged the required fixes to stable now20:00
clarkbianw: I think review-test did similar but in the opposite direction20:01
fungii'll reenqueue once the cat jobs are finished20:02
clarkbI would cross check against what review-test did, though I think we may have used forwarded keys temporarily (with agent confirmation)20:02
clarkbfungi: ^ when you are done restarting things maybe you want to look at athat too20:02
ianwclarkb: hrm ok.  i liked the idea the wrapper could do read-only20:03
ianwmostly so i don't type the wrong thing :)20:03
clarkboh is that what rrsync is? I was just about to go and read up on it :)20:04
*** Dmitrii-Sh has joined #opendev20:04
*** prometheanfire has joined #opendev20:04
ianwi'm not going to claim great knowledge, but i just googled "read only rsync" and this seems to be the solution20:04
clarkbya seems like it causes rsync run using that key to be restricted to read only access with a "chroot" path as set20:07
fungistarting to reenqueue everything now, 288 itens20:10
clarkbianw: any idea why rsync doesn't just install that script?20:10
clarkbianw: otherwise ya I think this looks ok20:10
fungilooks like i may need to restart zuul-web too20:10
ianwclarkb: not really, seems it ships it in some form standard.  maybe it's a bit obscure for /usr/bin i guess20:10
clarkbfungi: ya that is normal iirc20:11
fungidone, that seems to have cleared the "proxy error"20:11
clarkbianw: I've +2'd it but I think it would be good to have fungi look it over too20:12
fungijohnsom: check again20:12
fungiseems we often need to restart the zuul-web daemon any time we restart the scheduler20:12
johnsomYeah, loading now.20:12
*** hasharDinner has quit IRC20:20
*** slaweq has quit IRC20:26
fungi#status log Restarted the Zuul scheduler to address problematic memory pressure, and reenqueued all in flight changes20:32
openstackstatusfungi: finished logging20:32
clarkb2021-03-30 18:52:06,525 INFO kazoo.client: Zookeeper connection lost <- last logged connection lost if we need to compare notes later20:32
clarkbmemory use looks much better now though so I suspect we've got a while before it happens again20:33
*** roman_g has quit IRC20:52
*** artom has quit IRC21:00
clarkbI've realized I tested the always_update path in the gitea job which means we should already have data for the gitea memory usage with dstat21:00
*** artom has joined #opendev21:00
clarkbI'll also push up a change that doesn't use tokens so we can compare between those as well21:03
clarkbactually we may already have that in https://review.opendev.org/c/opendev/system-config/+/78177621:04
clarkbya https://zuul.opendev.org/t/openstack/build/143900b189f84e1296d66d60159c1c87/log/gitea99.opendev.org/dstat-csv.log is 1.13.6 + passwd auth, https://zuul.opendev.org/t/openstack/build/08ebc2b8c7344473bbc6f4790b26b416/log/gitea99.opendev.org/dstat-csv.log is 1.13.1 + token auth and soon enough we should have 1.13.6 + tokens as well21:05
*** artom has quit IRC21:10
*** artom has joined #opendev21:11
ianwmemory usage looks sane with 143900b189f84e1296d66d60159c1c8721:39
*** whoami-rajat has quit IRC21:51
fungiback for a bit. had some impromptu guests so sitting out on the deck and the sun is making it very hard to see the screen, even at maximum backlight21:52
*** artom has quit IRC21:53
*** artom has joined #opendev21:55
*** artom has quit IRC21:56
*** artom has joined #opendev21:59
ianwhrm, weirdly i think the id_rsa.pub we install for gerrit2 isn't valid22:27
ianwwe also seem to have lost gerritbot22:34
*** brinzhang_ has quit IRC22:35
*** brinzhang_ has joined #opendev22:36
*** dpawlik has quit IRC22:37
*** osmanlicilegi has quit IRC22:42
*** otherwiseguy has quit IRC22:42
*** amoralej|off has quit IRC22:42
*** Jeffrey4l has quit IRC22:42
*** openstackstatus has quit IRC22:42
*** dpawlik0 has joined #opendev22:42
*** janders1 has joined #opendev22:42
*** openstack has joined #opendev22:43
*** ChanServ sets mode: +o openstack22:43
ianwit would be better to have ansible managing these keys than have them as secrets22:44
ianwi guess it means we'd need to reference the review server during gitea runs.  that's probably a pita22:45
*** sboyron has quit IRC23:02
clarkbianw: we could maybe put them as individual secrets in a single location like a shared group var file23:04
clarkbianw: re gitea looking at the dstat results 1.13.6 + passwd seems to be about ~900MB peak memory usage and 1.13.6 + token is about 760MB23:12
clarkbianw: do you know what it was with password on 1.13.1?23:13
*** tosky has quit IRC23:14
ianwclarkb: i think https://imgur.com/a/YrlDNcd would have been 1.13.123:16
ianw Gitea v#033[1mv1.13.1# ... https://zuul.opendev.org/t/openstack/build/cfcd32fa1b27407ab61f5b44be83f6fc/ ... that is a passwd one that ran OOM23:18
clarkbwow that is quite a difference23:19
ianwyeah it goes bananas23:19
fungidid my gerritbot fix ever merge?23:19
clarkbfungi: I thought you said it did the same thing again?23:20
clarkbno wait I'm mixing jeepyb and gerritbot and gerritlib23:20
fungino, that was jeepyb23:20
* fungi checks open changes23:20
fungiokay 781920 merged 10 days ago23:21
* fungi checks to see if 3cefaa8 is what we have installed23:22
fungioh, it'll be a container23:22
fungiquick batman, to dockerhub23:22
clarkbianw: that makes me think that token auth is less urgent, though it may help a bit anyway23:22
fungihttps://hub.docker.com/r/opendevorg/gerritbot/tags?page=1&ordering=last_updated says updated 11 days ago so roughly right23:23
fungiin theory 781919 should have been sufficient to fix the message-too-long disconnects at least but both merged at about the same time23:25
fungiprevious changes to merge were in january anyway23:25
fungiso that's gotta be a sufficient container image23:25
fungiimage digest of 21def9f40d85 according to dockerhub23:26
clarkbthe current gerritbot has been running for 48 minutes23:26
clarkbnot sure how long the prior instance had been running23:26
fungi2021-03-30 11:21:38     <--     openstackgerrit (~openstack@eavesdrop01.openstack.org) has quit (Ping timeout: 260 seconds)23:27
fungithat looks like a different behavior anyway23:27
funginormally it would have been a quit if it was the known bug23:27
clarkbchanging servers or something is what it said before with the issue you fixed iirc23:28
fungi11:21:38-260 would be 11:17:1823:30
clarkbfungi: have we tracked down the jeepyb gerritlib thing yet? I should probably page that in and give it a proper look if not23:30
clarkbI wonder if it is related to depends on post merge23:31
ianwclarkb: yeah, if production has switched itself (or, because it was started before they switched?) away from argon2 then probably ok23:31
clarkbianw: ya the db reports pbkdf223:31
fungilog says it announced 778572 in #openstack-charms 7 seconds before that at 11:17:1123:31
funginothing else jumping out at me23:31
fungipbkdf2 would be a much more standard kdf, yeah23:32
fungiclarkb: i haven't dug deeper into the jeepyb integration test with stale gerritlib ref, no23:32
ianwclarkb: it's probably always been that right?  i'm not sure when they switched *to* argon2.  that would hit us in the gate, as we start fresh, but not production23:33
ianwit switched in 1.13.023:34
clarkbfungi: just as a sanity check cloning gerritlib and pip installing from source gives me Created wheel for gerritlib: filename=gerritlib-0.10.1.dev1-py3-none-any.whl23:35
clarkbwhich is definitely not the 0.10.0 we see in the job23:35
fungihttps://review.opendev.org/782538 is the one with the weird stale gerritlib ref23:35
clarkbianw: ya I'm not sure if they ported pbkdf2 to argon2 then back again or if we just alawys were pbkdf223:35
fungijust a sec, need to move myself to a different room23:35
clarkbCreated wheel for gerritlib: filename=gerritlib-0.10.0-py3-none-any.whl is what the job does23:36
clarkbwhich does strongly imply the job is seeing the 0.10.0 ref not the new one23:36
ianwclarkb: yeah, there's no db updates included in the swap afaics.  i'd say we've always been ion pbkdf2 in production23:37
clarkbfungi `ze08:/var/log/zuul$ grep ed1e29cb099a4c139b257bb4f57f5c30  executor-debug.log.1` that is what I'm looking at now23:38
clarkbianw: ok, so we may still have trouble with the description updates, should probably leave them off in that case23:38
clarkb2021-03-29 22:18:15,805 INFO zuul.Merger: [e: ed1e29cb099a4c139b257bb4f57f5c30] [build: f10d0a0281c043678acf4393f2629810] Skipping updating local repository gerrit/opendev/gerritlib23:40
clarkbI wonder if that is the clue we need23:40
fungii wonder why it skipped that23:41
ianwclarkb: yeah; i guess my immediate issue was the gitea jobs failing gate constantly23:41
clarkbfungi: the isUpdateNeeded() method in zuul merger seems to check if each of the refs exists and that each of teh revs exist23:43
fungiyep, was just tracing that23:43
clarkbfungi: I wonder if this is a case where maybe if a change is fast forwardable we don't necessarily update the branch ref to point to a rev because the branch is already there as is the rev?23:43
clarkbnow to see what the gerritlib change looks like23:43
clarkbya no merge commit23:44
clarkbso maybe that is what is going on here?23:44
*** gothicserpent has joined #opendev23:44
fungithat would be really odd23:45
clarkbthe code is given a list of (ref, rev) tuples but then we check each component separately and not that one points to the other if I am reading this correctly23:45
clarkbdoes the ref exist? yes ok good. Does the rev exist? yes ok good. We should eb checking does ref point at rev? (I think)23:46
fungialso nothing about isUpdateNeeded() has changed in two years23:46
fungiaccording to gi tblame23:46
fungiso if that's really the bug it's been lying latent for a very long time and i would expect us to have hit it frequently23:47
clarkbfwiw git show master shows https://opendev.org/opendev/gerritlib/commit/99136e1e164baa7b1d9dac4f64c5fb511b813c19 and git show 874d4776eee5ae7c8c15debbb9e943110be299dd shows your commit23:47
clarkbI agree I would've expected us to hit this more if its been there a long time. Maybe the calling code changed?23:48
clarkbmaybe we always did an unconditional update until recently for things that had branch moves?23:49
* clarkb goes back to zuul historiy23:49
clarkbheh I was the last one to chagne that function23:50
fungisee, now i get to blame you ;)23:50
clarkbI agree I'm not seeing any recent changes that scream "this is related". 75d5dc2289d4b85b5e2d721d3fdbafdf5779e02b whispers it23:51
clarkbI would say based on the logging and the actual repo state that we are not updating it as expected though23:52
clarkband I strongly suspect isUpdateNeeded() needs to check that ref points at rev23:52
fungithis seems like something we could unit test in zuul23:53
clarkbhrm getRepoState did change to do the buildset consistent revs23:53
clarkbthat produces the input to isUpdateNeeded I think, maybe that is related?23:54
clarkbfungi: that is a good point23:54
*** openstackgerrit has joined #opendev23:57
openstackgerritMerged opendev/system-config master: gerrit: remove mysql-client-core-5.7 package  https://review.opendev.org/c/opendev/system-config/+/78376923:57
openstackgerritMerged opendev/system-config master: launch-node : cap to 8gb swap  https://review.opendev.org/c/opendev/system-config/+/78289823:57
openstackgerritMerged opendev/system-config master: dstat-logger: redirect stdout to /dev/null  https://review.opendev.org/c/opendev/system-config/+/78286823:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!