*** mlavalle has quit IRC | 00:02 | |
*** tosky has quit IRC | 00:10 | |
clarkb | in ~7 minutes the gitea update should start | 00:13 |
---|---|---|
fungi | i'm still sorta on hand | 00:28 |
clarkb | gitea01 seems to have updated now to check on it | 00:30 |
clarkb | https://gitea01.opendev.org:3000/openstack/nova seems to work for me | 00:30 |
clarkb | version is reported as being updated | 00:31 |
ianw | clarkb:lgtm | 00:32 |
clarkb | 02 and 03 also look good now | 00:32 |
ianw | clarkb: i was thinking for the token thing too, it probably helps in production because the new versions have switched the default hashing algorithm to something less intensive, but our current passwords are hashed using the old method? | 00:33 |
ianw | perhaps we should redo the auth? | 00:33 |
fungi | or just go with your token work | 00:34 |
clarkb | ya I'm not sure yet if they'll hash it back again on first requests or not | 00:34 |
clarkb | we should be able to find out monitoring a new project addition after this upgrade completes? | 00:34 |
clarkb | we'll be able to compare resource usage at least | 00:34 |
clarkb | 04 and 05 are now done and look good | 00:35 |
fungi | didn't we turn off the mass description updating though? | 00:35 |
clarkb | fungi: good point we did reduce the overhead externally as well | 00:36 |
clarkb | probably can check the db directly? | 00:36 |
clarkb | and see what sort of hash is in the records | 00:36 |
clarkb | ? | 00:36 |
fungi | if they use a typical identifier, yeah | 00:37 |
clarkb | 06-08 lgtm now too | 00:41 |
clarkb | the user table has a passwd_hash_algo field | 00:43 |
clarkb | now to find that issue so I can determine what we want and what we don't want | 00:44 |
*** kevinz has joined #opendev | 00:49 | |
openstackgerrit | Merged opendev/system-config master: Add review02.opendev.org https://review.opendev.org/c/opendev/system-config/+/783183 | 00:49 |
clarkb | https://github.com/go-gitea/gitea/issues/14294 if that was the issue then we are using pbkdf2 now according to the db | 00:49 |
clarkb | which is good according to that issue | 00:49 |
clarkb | I've got to help sort out dinner now, but gitea looks happy and I think it is using the hash that is less memory intensive | 00:50 |
ianw | cool, will keep an eye | 00:51 |
clarkb | thanks! | 00:51 |
fungi | yeah, seems fine so far | 00:55 |
fungi | i hadn't noticed this before (new?) but if you look at the very bottom of the page where the gitea version is listed, it also tells you how long it took to load that page | 00:57 |
fungi | for example, https://opendev.org/openstack/nova/src/branch/master/releasenotes gave me "Page: 39332ms Template: 102ms" | 00:59 |
fungi | so, yeah, 40 seconds there | 00:59 |
fungi | but it's a fairly pathological example | 00:59 |
clarkb | fungi: the first time you request an asset it isnt' cached and can take a long time particularly for large repos like nova (beacuse gitea is happy to inspect history for objects to tell you when they were last modified) | 01:05 |
clarkb | but then it caches it, if you refresh it should be much quicker | 01:05 |
clarkb | ok really going to do dinner now. I'm being yelled at | 01:06 |
fungi | yep "Page: 3041ms Template: 122ms" | 01:15 |
fungi | that much i knew, just wasn't aware it displayed those stats | 01:15 |
fungi | or else i was aware and then forgot | 01:16 |
*** hamalq has quit IRC | 01:25 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gerrit: remove mysql-client-core-5.7 package https://review.opendev.org/c/opendev/system-config/+/783769 | 02:10 |
*** prometheanfire has quit IRC | 02:11 | |
*** prometheanfire has joined #opendev | 02:12 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: review01.openstack.org: add key for gerrit data copying https://review.opendev.org/c/opendev/system-config/+/783778 | 02:45 |
ianw | infra-root: ^ that installs a key from review02 -> review01 that can r/o rsync data. i think that will be generally useful as we go through this process to sync | 02:45 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gerrit: add mariadb_container option https://review.opendev.org/c/opendev/system-config/+/775961 | 03:35 |
ianw | finally navigated getting the mariadb fixes into the stable branch, so ^ doesn't require any patches any more | 03:36 |
*** whoami-rajat_ has joined #opendev | 04:21 | |
*** ricolin has quit IRC | 04:47 | |
*** whoami-rajat_ is now known as whoami-rajat | 04:56 | |
*** marios has joined #opendev | 05:01 | |
*** ykarel|away has joined #opendev | 05:03 | |
*** ykarel|away is now known as ykarel | 05:07 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gerrit: add mariadb_container option https://review.opendev.org/c/opendev/system-config/+/775961 | 05:36 |
*** ykarel_ has joined #opendev | 05:49 | |
*** ykarel has quit IRC | 05:49 | |
*** cloudnull has quit IRC | 05:54 | |
*** ysandeep|away is now known as ysandeep | 05:59 | |
*** lpetrut has joined #opendev | 06:12 | |
*** slaweq has joined #opendev | 06:16 | |
*** eolivare has joined #opendev | 06:17 | |
*** ralonsoh has joined #opendev | 06:17 | |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: [doc] Update supported distros https://review.opendev.org/c/openstack/diskimage-builder/+/783788 | 06:27 |
*** ykarel__ has joined #opendev | 06:29 | |
*** ykarel_ has quit IRC | 06:31 | |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add Debian Bullseye Zuul job https://review.opendev.org/c/openstack/diskimage-builder/+/783790 | 06:32 |
openstackgerrit | Slawek Kaplonski proposed openstack/project-config master: Add noop-jobs for networking-midonet projects https://review.opendev.org/c/openstack/project-config/+/783792 | 06:40 |
*** cloudnull has joined #opendev | 06:42 | |
openstackgerrit | Slawek Kaplonski proposed openstack/project-config master: Readd publish-to-pypi for neutron-fwaas and dashboard https://review.opendev.org/c/openstack/project-config/+/783796 | 06:45 |
*** sboyron has joined #opendev | 06:54 | |
*** hashar has joined #opendev | 06:57 | |
*** frigo has joined #opendev | 07:10 | |
*** rpittau|afk is now known as rpittau | 07:29 | |
*** hashar_ has joined #opendev | 07:43 | |
*** amorin_ has joined #opendev | 07:46 | |
*** hashar has quit IRC | 07:46 | |
*** amorin has quit IRC | 07:47 | |
*** hashar_ is now known as hashar | 07:51 | |
*** ykarel__ is now known as ykarel | 07:59 | |
*** tosky has joined #opendev | 08:11 | |
*** bodgix has quit IRC | 08:14 | |
*** smcginnis has quit IRC | 08:14 | |
*** bodgix has joined #opendev | 08:14 | |
*** arxcruz has quit IRC | 08:14 | |
*** smcginnis has joined #opendev | 08:14 | |
*** frigo has quit IRC | 08:16 | |
*** arxcruz has joined #opendev | 08:18 | |
*** DSpider has joined #opendev | 08:44 | |
*** dtantsur|afk is now known as dtantsur | 08:48 | |
*** dtantsur is now known as dtantsur|brb | 08:55 | |
*** hashar has quit IRC | 09:20 | |
*** Guest55766 has quit IRC | 09:30 | |
*** ykarel is now known as ykarel|lunch | 09:45 | |
*** dtantsur|brb is now known as dtantsur | 09:59 | |
*** dirk1 has joined #opendev | 10:34 | |
*** dirk1 is now known as dirk | 10:39 | |
*** ykarel|lunch is now known as ykarel | 10:52 | |
openstackgerrit | Guillaume Chauvel proposed opendev/gear master: Create SSL context using PROTOCOL_TLS, fallback to highest supported version https://review.opendev.org/c/opendev/gear/+/741288 | 11:03 |
openstackgerrit | Guillaume Chauvel proposed opendev/gear master: Update testing to Python 3.9 and linters https://review.opendev.org/c/opendev/gear/+/780103 | 11:03 |
*** DSpider has quit IRC | 11:18 | |
*** openstackgerrit has quit IRC | 11:21 | |
*** hashar has joined #opendev | 11:39 | |
*** ysandeep is now known as ysandeep|afk | 11:50 | |
*** redrobot9 has joined #opendev | 12:25 | |
*** artom has quit IRC | 12:26 | |
*** artom has joined #opendev | 12:26 | |
*** redrobot has quit IRC | 12:28 | |
*** redrobot9 is now known as redrobot | 12:28 | |
*** artom has quit IRC | 12:36 | |
*** ysandeep|afk is now known as ysandeep | 12:45 | |
*** dpawlik6 is now known as dpawlik | 12:58 | |
*** hashar has quit IRC | 13:24 | |
*** spotz has joined #opendev | 13:30 | |
*** artom has joined #opendev | 13:33 | |
*** ykarel is now known as ykarel|away | 13:50 | |
*** ykarel|away has quit IRC | 13:54 | |
*** mlavalle has joined #opendev | 13:58 | |
*** ralonsoh has quit IRC | 14:18 | |
*** ralonsoh has joined #opendev | 14:19 | |
*** lpetrut has quit IRC | 14:30 | |
*** tosky has quit IRC | 14:39 | |
*** tosky has joined #opendev | 14:39 | |
*** ysandeep is now known as ysandeep|away | 14:56 | |
zbr|rover | A page like https://review.opendev.org/admin/repos/openstack/hacking,access was supposed to list the groups that have review rights but now is ~empty | 15:23 |
zbr|rover | Is that a desired change caused by some security concerns or just a glitch? | 15:23 |
clarkb | I think it is related to the fixes for the bug we discovered when testing upgrades | 15:24 |
zbr|rover | imho, it should be possible for a logged in user to discover which other users or groups have access to a repo | 15:24 |
fungi | gerrit doesn't display permissions to you if you don't have them | 15:24 |
clarkb | gerrit significantly trimmed down who can access metadata | 15:24 |
fungi | basically the acl view shows you what permissions apply to your account | 15:25 |
zbr|rover | and is not configurable in default settings? | 15:25 |
fungi | nope | 15:25 |
clarkb | all of that info is available in project-config though | 15:25 |
fungi | this makes our acls in the project-config repo even more important, yes | 15:25 |
fungi | acl copies in the repo, i mean | 15:26 |
zbr|rover | that is bad for the user experience, imagine a random user trying to propose a patch to a project. Assume that is his first experience contributing something to opendev gerrit. | 15:26 |
zbr|rover | he passed CI and now he wants to get attention of someone that can help him review a patch. | 15:26 |
fungi | zbr|rover: are you asking us to convey your concerns to the gerrit maintainers, or replace gerrit? i can't tell | 15:27 |
zbr|rover | i am wondering if we can do something to improve the discoverability of a gerrit project maintainers (cores) | 15:27 |
fungi | projects can publish a link to the group view for their core review team, sure | 15:28 |
fungi | gerrit even supports convenience urls where you can specify the group name instead of its id | 15:28 |
zbr|rover | so basically the only option we currently have is to expect the repo owners to mention that link on their docs, as this is a problem especially on project with lower maintenance, it will never be addressed by the most vulnerable projects (active ones would likely be able to document this) | 15:31 |
fungi | https://review.opendev.org/admin/groups/project-config-core,members | 15:31 |
zbr|rover | i guess the practical answer is to look at previous reviews and see who performed them, "punishing" those few that do perform reviews :D | 15:32 |
fungi | i don't know that it's our *only* option, but if you have ideas we can evaluate them | 15:32 |
zbr|rover | sadly no ideas, only questions for now | 15:32 |
fungi | okay, i'm done talking to you for now, you're suggesting that we intentionally punish our users | 15:32 |
clarkb | providing feedback like this to gerrit is also helpful. Even if they don't take action on it at least we're communicating to the people most likely to take action | 15:32 |
*** hashar has joined #opendev | 15:32 | |
*** rpittau is now known as rpittau|afk | 15:33 | |
clarkb | unfortunately I suspect this is directly related to the security issues that were recently identified and fixed. | 15:33 |
clarkb | which may make changing this tricky and people will probably avoid it | 15:33 |
zbr|rover | what i was trying to say is that people that do reviews are visible in gerrit history, and more likely to be added as reviewers by other users. Those that do not perform reviews are unlikely be picked because nobody knows them. | 15:33 |
fungi | but please don't suggest to the gerrit maintainers that they choose to be user-hostile and intentionally break the usability of their software. i'm rally afraid they will think you're representing our community with your abusive comments | 15:34 |
clarkb | fwiw if you add me as a reviewer the email goes into a folder in my mail client that I don't really watch. It gets far too many emails every day for me to keep up. | 15:35 |
fungi | there may be gerrit plugins targeted at what you're wanting, or it's possible a gerrit plugin could be developed to do it | 15:35 |
clarkb | I think gerrit could definitely use better tooling around helping people see what they should review. Adding people as reviewers doesn't seem to be it | 15:35 |
fungi | we can evaluate adding new plugins if they're stable and reasonably unobtrusive (we're in the process of adding the reviewers plugin currently) | 15:36 |
clarkb | (I suspect something with hash tagging may be the ticket) | 15:36 |
*** tosky has quit IRC | 15:36 | |
clarkb | experienced devs consistently hash tag, reviewers can look for those specific tags to review those changes and also look for absence of tags to find new contributors and help them out | 15:36 |
* zbr|rover wonder what gave the impression that his comments are abusive | 15:37 | |
fungi | you suggested we've made decisions to punish reviewr | 15:38 |
fungi | reviewrrs | 15:38 |
fungi | my keyboard needs new fingers | 15:38 |
JayF | You may be missing context that most core reviewers are inundated with unsolicited emails and DMs from people desiring reviews who did not participate at all with the community in the reporting or design phase of whatever work they are doing. | 15:39 |
fungi | the reviewers plugin probably doesn't address your concern, which seems to be that under-supported repositories don't have an easy way for contributors to find reviewers who don't exist | 15:39 |
fungi | but once https://review.opendev.org/724914 merges we can do some trials with projects to see if it helps those who want a more structured way to auto-associate reviewers with reviews: https://gerrit.googlesource.com/plugins/reviewers/+/refs/heads/master/src/main/resources/Documentation/ | 15:40 |
clarkb | JayF: I'm not sure that making it easier to lookup the entire core reviewer list makes that better? seems like it would make the shotgun appraoch easier? | 15:42 |
JayF | That's what I was saying. The suggestion to make the core reviewer list more discoverable could be construed as abusive to already-harried core reviewers. | 15:42 |
clarkb | ah got it, you were addressing it at zbr | 15:43 |
JayF | yes | 15:43 |
JayF | Honestly, 1:1, out of band review requests without asking in a public IRC channel or participating in storyboard/launchpad/etc is the #1 best way to not get your code reviewed. | 15:43 |
clarkb | infra-root I've started trying to dig up zk docs on the proper way to do rolling replacement of zk servers. Haven't had great luck finding official docs but have found a few independent articles and I think I may need to dig into this a bit more before booting servers and adding them. In particular it seems we need to be very careful with the myid values for the new servers. They should not | 15:48 |
clarkb | overlap with old servers (I think this means we want zk04.opendev.org-zk06.opendev.org? need to confirm). Also there appears to be some coordination needed to trigger dynamic reconfiguration of the existing cluster members after adding a new member to the configs. Or we have to restart everything. | 15:48 |
clarkb | my current understanding of what the process looks like is start 04-06. Identify the current leader L and folloers F1 and F2. Stop zk on F2 and replace with one of 04-06. Trigger reconfig or restart things and ensure we have quorum and a leader. Repeat. | 15:49 |
fungi | zbr|rover: there's also https://gerrit.googlesource.com/plugins/reviewers-by-blame/+/refs/heads/master/src/main/resources/Documentation/about.md but i get the impression that doesn't distinguish reviewers with approval rights, and is based more on who contributed changes touching certain lines than who reviewed similar changes in the past | 15:50 |
clarkb | once F1 and F2 are replaced we should be left with a leader. I think we stop zk on that and ensure the other nodes elect a new leader L out of the pool of replaced servers. Then old L is now F3 and can be replaced too | 15:50 |
clarkb | there is also a thing where the ordinality of the myid value also affects behavior. A low id won't join a cluster with higher ids? something like that. I don't think it affects us since we'll have all the new ids higher than the old ids | 15:52 |
*** dtantsur is now known as dtantsur|afk | 16:02 | |
corvus | clarkb: or you could copy the data over manually | 16:05 |
*** iurygregory_ has joined #opendev | 16:06 | |
*** iurygregory has quit IRC | 16:06 | |
clarkb | corvus: if we do that it would look something like stop 03.openstack.org, copy 03.openstack.org data to 03.opendev.org, replace 03.openstack.org's IP address with 03.opendev.orgs IP address in configs. Trigger reconfiguration or rolling restarts? | 16:15 |
*** hamalq has joined #opendev | 16:15 | |
*** iurygregory_ is now known as iurygregory | 16:15 | |
*** marios is now known as marios|out | 16:17 | |
corvus | clarkb: i think rolling restarts. if you do that (actually this applies to any process), i'd probably do 2/3 of them before restarting the scheduler, and make sure the last one is already in the config on disk before restarting the scheduler. that way there's only one scheduler restart and it restarts into the final config. scheduler should only need to be able to reach one server. | 16:18 |
*** hamalq_ has joined #opendev | 16:19 | |
clarkb | oh hrm I hadn't even considered that the lcients may need to be restarted to see the new config, but that makes sense | 16:20 |
*** hamalq has quit IRC | 16:20 | |
corvus | yeah i think we have ip addrs in their config too | 16:20 |
clarkb | and ya we want to restart frequently on the cluster side to ensure we are maintaining quorum. I agree the client should be fine as long as one of the quorum members remains in its config | 16:20 |
clarkb | ok, I think my next step is to gather concrete info on what the current cluster looks like, put it in an etherpad and write down some options with the real info | 16:23 |
zbr|rover | fungi: clarkb: please excuse me if I was not clear regarding my questions about reviewing, I did not want to complain about something done or not done by infra team about our gerrit, I only wanted to find-out if there is something I can help with in order to make that part of the review process easier for other gerrit users. | 16:33 |
*** eolivare has quit IRC | 16:33 | |
clarkb | zbr|rover: I think the first thing to do is communicate the concern upstream. Indicate it can be difficult for new contributors in particular to find out who to work with to get changes reviewed and that maybe gerrit can help with this. It is possible that gerrit already has plugins or other tools that we aren't aware of as well they can point us to | 16:34 |
fungi | on a related note, https://review.opendev.org/724914 is now passing if we're ready for it | 16:37 |
fungi | the next phase will be adding support for etc/reviewers.config files in manage-projects | 16:39 |
zbr|rover | yep, i commented there, i am offering to help on experimenting with it. | 16:39 |
zbr|rover | we can use a side-project for that and wee exactly how it works. | 16:40 |
*** gothicserpent has quit IRC | 16:41 | |
zbr|rover | i like that fact that it can do ignoreWip, and we should see if suggestOnly proves to be more useful or not. | 16:41 |
zbr|rover | if suggestOnly would really work fine to bump those with rights on the top it could be an alternative. | 16:41 |
zbr|rover | what I found curious is that I did not see any options regarding how many reviewers to auto-assign or an option to exclude specific groups | 16:42 |
zbr|rover | while infra-core has permissions i doubt members of this group want to endup being auto-picked by the plugin just because they happen to be the fallback. | 16:43 |
*** gothicserpent has joined #opendev | 16:44 | |
fungi | yeah, suggestOnly unfortunately looks like it would be high-maintenance | 16:44 |
fungi | because it doesn't support suggesting groups, only individuals, which means maintaining a separate list of individuals, and for lots of teams that list would quickly fall stale | 16:45 |
mordred | does the per-project config go in the primary repo of the project? or is it a thing that goes into a refs/ location somewhere | 16:45 |
*** gothicserpent has quit IRC | 16:45 | |
mordred | (yeah - it would make way more sense if it could be tied to a group) | 16:46 |
fungi | i expect the main way projects might use it would be to have it auto-add specific groups as requested reviewers on changes matching particular branches or file subpaths (e.g. nova could automatically add a vmware-reviewers group to changes touching the vmware hypervisor backend files) | 16:46 |
mordred | or like - tied to a group but with specific overrides possible - I could see some projects saying "make sure to get files in sub/dir reviewed by $human_a and $human_b" | 16:46 |
mordred | yeah - what you said | 16:47 |
fungi | for many cases i expect teams would still prefer to rely on custom dashboards to so reviewers can voluntarily find what they want to review rather than having themselves added to reviews automatically, but different teams have different reviewing habits | 16:47 |
fungi | what triggered the addition of this plugin in the first place was that some of the teams working on starlingx wanted to see if it could help them improve how they're reviewing changes for their projects | 16:48 |
zbr|rover | the way CODEOWNERS works on github is that it picks a random one or two (based on min reviews config rule) and assigns them. I kind find it working fine so far but i used it in only one project. | 16:49 |
*** ralonsoh has quit IRC | 16:50 | |
zbr|rover | if everyone from the list is added to each review, i would not see that working in practice. the entire idea is to spread the review-load. | 16:50 |
JayF | I think something that's different about OpenStack vs many other projects is that many projects have a sense of priorities, and working together to ensure something is 100% done instead of having 10 things 10% done. That makes it hard for people outside that upstream process entirely to get code reviewed/paid attention to -- it's almost explicitly not a priority to review that code. | 16:51 |
JayF | That's why "step 1" to getting something merged in almost any OpenStack project is to make the case it's needed, via stories/bugs/mailing list/irc, then once you are there, it gets more easy for folks to review your code. | 16:51 |
JayF | There are very few successful openstack contributors who do not engage with the community in ways other than code. | 16:51 |
TheJulia | Well I just clicked into an interesting discussion :) | 16:57 |
zbr|rover | JayF: you are right that discussions are highly likely to be needed. Still, there are projects where these may not really be necessary and where is easy for a valid CR to be ignored just because nobody that can help is notified. | 16:58 |
JayF | If your CR is being ignored, maybe it does need some discussion even if you don't realize it yet. :D | 16:58 |
JayF | Sometimes that happens in the CR itself, but not every team is structured as to that being how it works. | 16:58 |
TheJulia | And sometimes people downright ignore reviewer comments in change requests. :\ | 16:59 |
TheJulia | so off the wall side question, is CI becoming unhappy? | 16:59 |
TheJulia | seeing 2nd attempt randomly pop up on an ironic job on things that should have worked fine | 17:00 |
JayF | TheJulia: the tldr of how this started is zbr|rover was asking for an infra feature to make it easier to ID the core reviewers for a change :) We had to inform them that core reviewers are already getting canned spiced ham chucked at them with high frequency and velocity :D | 17:00 |
TheJulia | ahh yes | 17:00 |
TheJulia | which all goes to the nearly automatic canning machine in our mailboxes | 17:01 |
JayF | (as a side note: I think we've all been on the other side of it too, hoping another project merges our change and you can't get anyones' attention, and it's frustrating, so yes it's a problem, but the proposal is not the solution IMO) | 17:01 |
*** tosky has joined #opendev | 17:03 | |
zbr|rover | i had a recent example from last week where I had to help with a new release of python-jenkins, even if i no longer use the library myself. probably we can ask him about how long it took to find someone that can make a new release (his change was already merged long) | 17:04 |
clarkb | infra-root https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 I've put two options for the zk upgrade in there. Please add text or let me know if I've missed things or if you have preferences | 17:04 |
clarkb | zbr|rover: I think the issue there was largely that the software was unmaintained, not that gerrit didn't show core reviewer groups | 17:05 |
clarkb | gerrit could add reviewers and it would still be ignored if those individuals are no longer maintaining the software | 17:06 |
fungi | TheJulia: i haven't heard anyone mention a new global issue, we're not making changes right now either, but if you have some examples i'm happy to take a quick look | 17:08 |
zbr|rover | i usually look at those reviews i am added to, but keep in mind that the issue i mentioned is that the OP may have no clue about who can help or not, regardless on which communication channel he may attempt to use gerrit, irc, mailing list, carrier pigeon. | 17:09 |
clarkb | huh some tripleo jobs show 4 attempts (I thought we capped at 3) | 17:10 |
clarkb | fungi: TheJulia ^ as another data point | 17:10 |
clarkb | considering how widespread that is I wonder if we lost zk connectivity with a launcher? | 17:11 |
clarkb | er no it would have to be the scheduler as the launcher unlocks things once handed to the jobs | 17:11 |
clarkb | 2021-03-30 16:37:19,580 INFO kazoo.client: Zookeeper connection lost | 17:14 |
clarkb | fungi: TheJulia ^ I think that is the cause | 17:14 |
fungi | none of the zk servers is down | 17:15 |
clarkb | I ran echo stat | nc localhost 2181 against all 3 not too long ago to figure out which was leader and which are followers an dthey all looked happy | 17:16 |
fungi | all last started on the 27th | 17:16 |
clarkb | seems it has happened 3 times today and not at all in the previous log file | 17:16 |
fungi | so their daemons have been running for a few days | 17:16 |
fungi | nothing in dmesg for any of them since saturday either | 17:17 |
clarkb | "Connection dropped: socket connection broken" appears to be the underlying issue as reported by the zuul side | 17:17 |
clarkb | I wonder if the servers timed out zuul for non responsiveness? | 17:17 |
clarkb | and the client sees that as the socket connection breaking | 17:17 |
fungi | firewall rules for them were last touched on the 18th | 17:18 |
fungi | yeah, could be | 17:18 |
clarkb | on zk02 (the current leader) I see it expiring a session at Mar 30 16:37:14 in syslog | 17:20 |
clarkb | timeout of 40000ms exceeded | 17:20 |
clarkb | I suspect that is what happened though I'm not sure how to confirm via the logged session id | 17:20 |
clarkb | corvus: ^ fyi this may be important performance feedback gathering? | 17:20 |
clarkb | note I haven't changed anything with zk yet, only done info gathering (ran stat command) | 17:21 |
clarkb | oh yup a few log lines later I see zuul01 has reconnected | 17:21 |
fungi | unrelated, i saw a infra-prod-run-cloud-launcher failure where the log on bridge says /home/zuul/src/opendev.org/opendev/system-config/run_cloud_launcher.sh did not exist | 17:23 |
clarkb | fungi: that script was run by the old cron based system and it was removed. Maybe we didn't properly convert over the ansible side? | 17:24 |
fungi | yeah, i found https://review.opendev.org/718799 removed it almost a year ago | 17:24 |
clarkb | or maybe part of the cleanup is written such that if it fails (because it already succeeded in cleaning up) the ansible play fails? | 17:25 |
fungi | d'oh, i'm looking at the wrong logfile | 17:25 |
fungi | --- end run @ 2020-05-07T15:13:58+00:00 --- | 17:25 |
fungi | yeah, that was from when we removed the script | 17:26 |
fungi | i was looking at the log from the cronjob not from the continuous deployment | 17:26 |
fungi | connection timeout to the openedge api. i guess this has been failing for some time | 17:27 |
clarkb | oh right we need to clean that up | 17:27 |
*** lpetrut has joined #opendev | 17:28 | |
fungi | i can push it in a bit | 17:29 |
*** hashar is now known as hasharDinner | 17:36 | |
fungi | donnyd: should we unconfigure openedge entirely in our systems, or is it likely coming back at some point in the future? | 17:44 |
*** ykarel|away has joined #opendev | 17:46 | |
*** marios|out has quit IRC | 17:49 | |
clarkb | fungi: my understanding was that we should probably unconfigure it, and if we want ot add it back that is straightforward to do | 17:54 |
fungi | okay, i'll work on a complete rip-out for now | 17:54 |
*** prometheanfire has quit IRC | 17:55 | |
clarkb | zk connectivity was just lost again | 17:59 |
clarkb | (I was running a tail looking for that string) | 18:00 |
*** gothicserpent has joined #opendev | 18:00 | |
fungi | huh | 18:00 |
*** gothicserpent has quit IRC | 18:01 | |
clarkb | and on zk02 we see the same sort of logs connection for session foo timed out | 18:01 |
clarkb | then a bit later a reconnect from zuul01 | 18:01 |
clarkb | the timeout is 40000ms which I think means that zuul's zk connection didn't transfer any data for 40 seconds? | 18:01 |
clarkb | wow zuul is certainly busy it has generated over 100k log lines in about the time since the last disconnect | 18:02 |
fungi | the zk graphs at the end of https://grafana.opendev.org/d/5Imot6EMk/zuul-status?orgId=1 show a bunch of drops which i expect are coincident with the timing of the disconnects | 18:04 |
clarkb | going back 200k lines only gets me an extra 2 minutes | 18:04 |
clarkb | the bulk of this seems to be collecting job variants and similar config loading logging | 18:04 |
clarkb | fungi: you can see the zuul event processing time spike around then too | 18:07 |
clarkb | typical seems to be in the ms to second range but then when we see restarts it goes to the minutes range | 18:08 |
clarkb | I wonder if while we are processing events we somehow block the zk connection from ping ponging appropriately | 18:08 |
clarkb | which means if processing an event spikes to > 40s we lose | 18:08 |
clarkb | though we don't disconnect every time we spike and we don't spike every time we disconnect so maybe a bad correlation | 18:09 |
clarkb | I do suspect though that if we work backward from really long event processing time we might find something useful | 18:10 |
TheJulia | clarkb: w/r/t the connection lost, I had a feeling... but it might not be bunnies | 18:12 |
fungi | the job queue and executors accepting graphs do look like what we see when there's a large gate reset or huge change series pushed, but that could also be the result of jobs restarting en masse from a zk disconnect (effect rather than cause) | 18:15 |
*** lpetrut has quit IRC | 18:15 | |
clarkb | yes I think that is likely more the symptom than the cause | 18:16 |
*** ykarel|away has quit IRC | 18:16 | |
clarkb | at 17:54:40 we report or last event_enqueue_processing time for a few minutes then the next one says it took 1.3 minutes according to graphana | 18:17 |
clarkb | the next one arrives at 18:01:20 | 18:17 |
clarkb | within that period of time we've lost connectivity to zk because it has timed us out for being non responsive for >40s | 18:18 |
clarkb | I strongly suspect something is monopolizing the cpu, but I think the rerun jobs is simply because nodepool has helpfully cleaned the old ones up for us | 18:18 |
clarkb | 2021-03-30 18:01:04,090 INFO zuul.Scheduler: Tenant reconfiguration complete for openstack (duration: 363.146 seconds) | 18:21 |
clarkb | what if it is ^ | 18:21 |
clarkb | that also doesn't correlate to every restart though | 18:21 |
clarkb | 2021-03-30 15:16:16,061 INFO zuul.Scheduler: Tenant reconfiguration complete for openstack (duration: 252.64 seconds) is another aroudn when we got disconnected | 18:21 |
clarkb | this disconnect doesn't have bad graphs or tenant reconfiguration that takes forever: 2021-03-30 16:37:19,580 INFO kazoo.client: Zookeeper connection lost | 18:22 |
clarkb | I am still not fully up to speed around what has changed with zk recently though so I may be looking in the completely wrong location | 18:23 |
*** roman_g has joined #opendev | 18:26 | |
fungi | clarkb: internal scheduler events/results queues so far, i think | 18:26 |
fungi | also semaphores | 18:27 |
clarkb | we have had 4 disconnects in the last several hours. three of them have an adjacent openstack tenant reconfiguration that takes 4-6 minutes. There are other long reconfigurations though | 18:27 |
fungi | also change queues are in zk now, looks like | 18:28 |
clarkb | makes me less confident that long reconfiguration is a cause, but it may be another symptom (essentially something is consuming resources and when that happens zk can disconnect and reconfigurations can go long) | 18:28 |
fungi | so most of the things you would think of as internal scheduler state have moved from in-memory data structures to znodes | 18:30 |
clarkb | one thing that makes digging in logs difficult is that we generate a ton of kazoo exceptions after this happens beacuse all the nodes have been cleaned up and zuul can't update the nodes in zk | 18:31 |
clarkb | also 100k log lines per minute | 18:31 |
fungi | cacti graphs for the zk servers don't suggest any system-level resource exhaustion | 18:31 |
fungi | the zuul scheculer is showing some heavy periods of read activity on its disks around that time | 18:33 |
fungi | also we've been seeing a steady rise in used and cache memory on the scheduler, but used memory starts to plateau after we run out of free memory and cache begins to get squeezed (circa 14:00 utc) | 18:34 |
corvus | typically if we see zk disconnects its due to cpu starvation on the scheduler | 18:36 |
corvus | usually due to swapping | 18:36 |
fungi | could this be cacti says scheduler cpu utilization is nominal. maybe a little higher than usual but still below 20% | 18:37 |
fungi | and no real swap utilization | 18:37 |
clarkb | fungi: its a many cpu instance and zuul can only use one of them for the scheduler | 18:37 |
fungi | fair point | 18:37 |
clarkb | corvus: ya I suspect that is why reconfiguration and event queue processing is also slow when this occurs but not always | 18:37 |
fungi | also i suppose we could have very brief spikes which don't register in a 5-minute aggregate sample | 18:37 |
clarkb | we're seeing cpu starvation hit a number of things and this is the most prominent as it restarts jobs | 18:37 |
fungi | there is one zuul-scheduler process consuming most of a cpu according to top | 18:38 |
fungi | and also a majority of the system memory | 18:38 |
corvus | i can't seem to get cacti to show me more than 1 day of data | 18:39 |
clarkb | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64792&rra_id=all | 18:42 |
clarkb | that does seem to show memory use has significantly grown in the last day or so | 18:42 |
clarkb | maybe we've got a leak that leads to swapping? http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64793&rra_id=all | 18:43 |
corvus | yep, that would be my guess | 18:43 |
fungi | except the swap usage graph is *nearly* empty | 18:44 |
corvus | i think we should restart the scheduler immediately to address the immediate problem and start debugging the leak. i can start debugging tomorrow, but am occupied today. | 18:44 |
fungi | but yes there's a tiny bump up to 233mib around the time of the recent distress | 18:44 |
corvus | fungi: the swap activity graph isn't though, and that can cause pauses | 18:44 |
fungi | right, and if it's the wrong 233mib... | 18:44 |
fungi | that also may explain some of the spikes in disk read activity | 18:45 |
clarkb | I need to start prepping for the meeting, is someone else abel to lead the restart? | 18:45 |
fungi | i can do the scheduler restart now | 18:45 |
fungi | did we get the enqueuing fix merged? | 18:45 |
* fungi checks | 18:45 | |
fungi | ahh, no, https://review.opendev.org/783556 hasn't been approved yet | 18:47 |
fungi | corvus: what's your guess as to the fallout we'll see from trying to reenqueue things... is there something i should edit in the queue dump beforehand? | 18:47 |
corvus | fungi: in that case just drop the hostname from any enqueue-ref calls | 18:47 |
fungi | perfect, can do, thanks! | 18:47 |
corvus | (drop the hostname from the project name | 18:47 |
fungi | yep, got it | 18:48 |
fungi | | sed 's, opendev\.org/,' ought to do the trick | 18:49 |
fungi | er, forgot an additional , there | 18:49 |
fungi | and i shouldn't eat the blank space ;) | 18:50 |
fungi | /opt/zuul/tools/zuul-changes.py https://zuul.opendev.org | sed 's, opendev\.org/, ,' > queues.sh | 18:50 |
fungi | that seems to do the trick | 18:50 |
corvus | fungi: i'd just do that for enqueue-ref not the changes | 18:51 |
fungi | oh, can do | 18:51 |
*** diablo_rojo has joined #opendev | 18:52 | |
fungi | looks like the scheduler is very busy again, probably another event | 18:52 |
fungi | /opt/zuul/tools/zuul-changes.py https://zuul.opendev.org | sed 's | 18:55 |
fungi | ,^\(zuul enqueue-ref .* --project \)opendev\.org/,\1,' > queues.sh | 18:55 |
fungi | stray newline in the paste, but that's what seems to take care of it | 18:55 |
fungi | looks like openstack release management had just merged some release requests so there are tag events in the queue. i'm a little hesitant to try bulk reenquening those and would rather wait another few minutes for them to hopefully wrap up | 18:59 |
fungi | i let the team know to hold on further release approvals for a bit | 18:59 |
* TheJulia sighs | 19:04 | |
fungi | TheJulia: it was for ironic. you can sigh as loudly as you like ;) | 19:05 |
TheJulia | is there an official guestimate on how far the status display trails behind reality? | 19:05 |
TheJulia | fungi: joy | 19:05 |
fungi | TheJulia: top-left corner... "Queue lengths: 399 events, 2 management events, 301 results." | 19:05 |
TheJulia | Now to just get a ci fixed merged into ironic. | 19:05 |
fungi | that's basically teh internal count of backlogs for those categoties of events not yet processed | 19:06 |
fungi | it's not really a time because zuul can't express much about events it hasn't processed yet | 19:06 |
TheJulia | I was meaning more in regards to a running ci job | 19:06 |
fungi | but that's generally the reason for any perceived lag in status reflecting expected reality | 19:07 |
TheJulia | like, I've opened the status display in the past and seen things appear completely done running past log uploads and can't view the console... yet the display says the job is still running for like 15 minutes | 19:07 |
TheJulia | okay, so same queue then | 19:07 |
fungi | the results backlog is how many builds have completed that zuul hasn't reflected the completion result for yet | 19:07 |
TheJulia | AHH! | 19:08 |
TheJulia | okay | 19:08 |
fungi | events is how many things like gerrit comments or new patchsets it has received but not yet processed to figure out if something should be enqueued | 19:08 |
fungi | that sort of thing | 19:08 |
TheJulia | That was what I was figuring but didn't quite grok the meaning of results | 19:09 |
TheJulia | at least in that context | 19:09 |
fungi | and management events are generally reconfiguration | 19:11 |
fungi | so now that the counts are ~0 again, the status info should be reasonably current | 19:11 |
fungi | okay, zuul estimates we're 5 minutes out from being done with the last releases site update, so i'll plan to restart the scheduler and reenqueue everything as soon as that finishes | 19:49 |
fungi | okay, all the reasonably critical openstack release jobs have wrapped up, i'll check the other tenants quickly | 19:59 |
ianw | clarkb: likely to get caught in a restart, but is https://review.opendev.org/c/opendev/system-config/+/783778 to add a key for review02 to r/o copy gerrit data OK? not sure if it was done like that in the past | 20:00 |
fungi | seems we're all clear. grabbing a corrected queue dump per earlier bug discussion, then restartinmg | 20:00 |
ianw | the other one that is ready is https://review.opendev.org/c/opendev/system-config/+/775961 to add a mariadb container; upstream has merged the required fixes to stable now | 20:00 |
clarkb | ianw: I think review-test did similar but in the opposite direction | 20:01 |
fungi | i'll reenqueue once the cat jobs are finished | 20:02 |
clarkb | I would cross check against what review-test did, though I think we may have used forwarded keys temporarily (with agent confirmation) | 20:02 |
clarkb | fungi: ^ when you are done restarting things maybe you want to look at athat too | 20:02 |
ianw | clarkb: hrm ok. i liked the idea the wrapper could do read-only | 20:03 |
ianw | mostly so i don't type the wrong thing :) | 20:03 |
clarkb | oh is that what rrsync is? I was just about to go and read up on it :) | 20:04 |
*** Dmitrii-Sh has joined #opendev | 20:04 | |
*** prometheanfire has joined #opendev | 20:04 | |
ianw | i'm not going to claim great knowledge, but i just googled "read only rsync" and this seems to be the solution | 20:04 |
clarkb | ya seems like it causes rsync run using that key to be restricted to read only access with a "chroot" path as set | 20:07 |
fungi | starting to reenqueue everything now, 288 itens | 20:10 |
clarkb | ianw: any idea why rsync doesn't just install that script? | 20:10 |
fungi | items | 20:10 |
clarkb | ianw: otherwise ya I think this looks ok | 20:10 |
fungi | looks like i may need to restart zuul-web too | 20:10 |
ianw | clarkb: not really, seems it ships it in some form standard. maybe it's a bit obscure for /usr/bin i guess | 20:10 |
clarkb | fungi: ya that is normal iirc | 20:11 |
johnsom | FYI, | 20:11 |
fungi | done, that seems to have cleared the "proxy error" | 20:11 |
johnsom | https://www.irccloud.com/pastebin/35ySboyv/ | 20:11 |
clarkb | ianw: I've +2'd it but I think it would be good to have fungi look it over too | 20:12 |
fungi | johnsom: check again | 20:12 |
fungi | seems we often need to restart the zuul-web daemon any time we restart the scheduler | 20:12 |
johnsom | Yeah, loading now. | 20:12 |
*** hasharDinner has quit IRC | 20:20 | |
*** slaweq has quit IRC | 20:26 | |
fungi | #status log Restarted the Zuul scheduler to address problematic memory pressure, and reenqueued all in flight changes | 20:32 |
openstackstatus | fungi: finished logging | 20:32 |
clarkb | 2021-03-30 18:52:06,525 INFO kazoo.client: Zookeeper connection lost <- last logged connection lost if we need to compare notes later | 20:32 |
clarkb | memory use looks much better now though so I suspect we've got a while before it happens again | 20:33 |
*** roman_g has quit IRC | 20:52 | |
*** artom has quit IRC | 21:00 | |
clarkb | I've realized I tested the always_update path in the gitea job which means we should already have data for the gitea memory usage with dstat | 21:00 |
*** artom has joined #opendev | 21:00 | |
clarkb | I'll also push up a change that doesn't use tokens so we can compare between those as well | 21:03 |
clarkb | actually we may already have that in https://review.opendev.org/c/opendev/system-config/+/781776 | 21:04 |
clarkb | ya https://zuul.opendev.org/t/openstack/build/143900b189f84e1296d66d60159c1c87/log/gitea99.opendev.org/dstat-csv.log is 1.13.6 + passwd auth, https://zuul.opendev.org/t/openstack/build/08ebc2b8c7344473bbc6f4790b26b416/log/gitea99.opendev.org/dstat-csv.log is 1.13.1 + token auth and soon enough we should have 1.13.6 + tokens as well | 21:05 |
*** artom has quit IRC | 21:10 | |
*** artom has joined #opendev | 21:11 | |
ianw | memory usage looks sane with 143900b189f84e1296d66d60159c1c87 | 21:39 |
*** whoami-rajat has quit IRC | 21:51 | |
fungi | back for a bit. had some impromptu guests so sitting out on the deck and the sun is making it very hard to see the screen, even at maximum backlight | 21:52 |
*** artom has quit IRC | 21:53 | |
*** artom has joined #opendev | 21:55 | |
*** artom has quit IRC | 21:56 | |
*** artom has joined #opendev | 21:59 | |
ianw | hrm, weirdly i think the id_rsa.pub we install for gerrit2 isn't valid | 22:27 |
ianw | we also seem to have lost gerritbot | 22:34 |
*** brinzhang_ has quit IRC | 22:35 | |
*** brinzhang_ has joined #opendev | 22:36 | |
*** dpawlik has quit IRC | 22:37 | |
*** osmanlicilegi has quit IRC | 22:42 | |
*** otherwiseguy has quit IRC | 22:42 | |
*** amoralej|off has quit IRC | 22:42 | |
*** Jeffrey4l has quit IRC | 22:42 | |
*** openstackstatus has quit IRC | 22:42 | |
*** dpawlik0 has joined #opendev | 22:42 | |
*** janders1 has joined #opendev | 22:42 | |
*** openstack has joined #opendev | 22:43 | |
*** ChanServ sets mode: +o openstack | 22:43 | |
ianw | it would be better to have ansible managing these keys than have them as secrets | 22:44 |
ianw | i guess it means we'd need to reference the review server during gitea runs. that's probably a pita | 22:45 |
*** sboyron has quit IRC | 23:02 | |
clarkb | ianw: we could maybe put them as individual secrets in a single location like a shared group var file | 23:04 |
clarkb | ianw: re gitea looking at the dstat results 1.13.6 + passwd seems to be about ~900MB peak memory usage and 1.13.6 + token is about 760MB | 23:12 |
clarkb | ianw: do you know what it was with password on 1.13.1? | 23:13 |
*** tosky has quit IRC | 23:14 | |
ianw | clarkb: i think https://imgur.com/a/YrlDNcd would have been 1.13.1 | 23:16 |
ianw | Gitea v#033[1mv1.13.1# ... https://zuul.opendev.org/t/openstack/build/cfcd32fa1b27407ab61f5b44be83f6fc/ ... that is a passwd one that ran OOM | 23:18 |
clarkb | wow that is quite a difference | 23:19 |
ianw | yeah it goes bananas | 23:19 |
fungi | did my gerritbot fix ever merge? | 23:19 |
clarkb | fungi: I thought you said it did the same thing again? | 23:20 |
clarkb | no wait I'm mixing jeepyb and gerritbot and gerritlib | 23:20 |
fungi | no, that was jeepyb | 23:20 |
fungi | yeah | 23:20 |
* fungi checks open changes | 23:20 | |
fungi | okay 781920 merged 10 days ago | 23:21 |
* fungi checks to see if 3cefaa8 is what we have installed | 23:22 | |
fungi | oh, it'll be a container | 23:22 |
fungi | quick batman, to dockerhub | 23:22 |
clarkb | ianw: that makes me think that token auth is less urgent, though it may help a bit anyway | 23:22 |
fungi | https://hub.docker.com/r/opendevorg/gerritbot/tags?page=1&ordering=last_updated says updated 11 days ago so roughly right | 23:23 |
fungi | in theory 781919 should have been sufficient to fix the message-too-long disconnects at least but both merged at about the same time | 23:25 |
fungi | previous changes to merge were in january anyway | 23:25 |
fungi | so that's gotta be a sufficient container image | 23:25 |
fungi | image digest of 21def9f40d85 according to dockerhub | 23:26 |
clarkb | the current gerritbot has been running for 48 minutes | 23:26 |
clarkb | not sure how long the prior instance had been running | 23:26 |
fungi | 2021-03-30 11:21:38 <-- openstackgerrit (~openstack@eavesdrop01.openstack.org) has quit (Ping timeout: 260 seconds) | 23:27 |
fungi | that looks like a different behavior anyway | 23:27 |
fungi | normally it would have been a quit if it was the known bug | 23:27 |
clarkb | changing servers or something is what it said before with the issue you fixed iirc | 23:28 |
fungi | yep | 23:28 |
fungi | 11:21:38-260 would be 11:17:18 | 23:30 |
clarkb | fungi: have we tracked down the jeepyb gerritlib thing yet? I should probably page that in and give it a proper look if not | 23:30 |
clarkb | I wonder if it is related to depends on post merge | 23:31 |
ianw | clarkb: yeah, if production has switched itself (or, because it was started before they switched?) away from argon2 then probably ok | 23:31 |
clarkb | ianw: ya the db reports pbkdf2 | 23:31 |
fungi | log says it announced 778572 in #openstack-charms 7 seconds before that at 11:17:11 | 23:31 |
fungi | nothing else jumping out at me | 23:31 |
fungi | pbkdf2 would be a much more standard kdf, yeah | 23:32 |
fungi | clarkb: i haven't dug deeper into the jeepyb integration test with stale gerritlib ref, no | 23:32 |
ianw | clarkb: it's probably always been that right? i'm not sure when they switched *to* argon2. that would hit us in the gate, as we start fresh, but not production | 23:33 |
ianw | it switched in 1.13.0 | 23:34 |
clarkb | fungi: just as a sanity check cloning gerritlib and pip installing from source gives me Created wheel for gerritlib: filename=gerritlib-0.10.1.dev1-py3-none-any.whl | 23:35 |
clarkb | which is definitely not the 0.10.0 we see in the job | 23:35 |
fungi | https://review.opendev.org/782538 is the one with the weird stale gerritlib ref | 23:35 |
clarkb | ianw: ya I'm not sure if they ported pbkdf2 to argon2 then back again or if we just alawys were pbkdf2 | 23:35 |
fungi | just a sec, need to move myself to a different room | 23:35 |
clarkb | Created wheel for gerritlib: filename=gerritlib-0.10.0-py3-none-any.whl is what the job does | 23:36 |
clarkb | which does strongly imply the job is seeing the 0.10.0 ref not the new one | 23:36 |
ianw | clarkb: yeah, there's no db updates included in the swap afaics. i'd say we've always been ion pbkdf2 in production | 23:37 |
clarkb | fungi `ze08:/var/log/zuul$ grep ed1e29cb099a4c139b257bb4f57f5c30 executor-debug.log.1` that is what I'm looking at now | 23:38 |
clarkb | ianw: ok, so we may still have trouble with the description updates, should probably leave them off in that case | 23:38 |
clarkb | 2021-03-29 22:18:15,805 INFO zuul.Merger: [e: ed1e29cb099a4c139b257bb4f57f5c30] [build: f10d0a0281c043678acf4393f2629810] Skipping updating local repository gerrit/opendev/gerritlib | 23:40 |
clarkb | I wonder if that is the clue we need | 23:40 |
fungi | i wonder why it skipped that | 23:41 |
ianw | clarkb: yeah; i guess my immediate issue was the gitea jobs failing gate constantly | 23:41 |
clarkb | fungi: the isUpdateNeeded() method in zuul merger seems to check if each of the refs exists and that each of teh revs exist | 23:43 |
fungi | yep, was just tracing that | 23:43 |
clarkb | fungi: I wonder if this is a case where maybe if a change is fast forwardable we don't necessarily update the branch ref to point to a rev because the branch is already there as is the rev? | 23:43 |
clarkb | now to see what the gerritlib change looks like | 23:43 |
clarkb | ya no merge commit | 23:44 |
clarkb | so maybe that is what is going on here? | 23:44 |
*** gothicserpent has joined #opendev | 23:44 | |
fungi | that would be really odd | 23:45 |
clarkb | the code is given a list of (ref, rev) tuples but then we check each component separately and not that one points to the other if I am reading this correctly | 23:45 |
clarkb | does the ref exist? yes ok good. Does the rev exist? yes ok good. We should eb checking does ref point at rev? (I think) | 23:46 |
fungi | also nothing about isUpdateNeeded() has changed in two years | 23:46 |
fungi | according to gi tblame | 23:46 |
fungi | so if that's really the bug it's been lying latent for a very long time and i would expect us to have hit it frequently | 23:47 |
clarkb | fwiw git show master shows https://opendev.org/opendev/gerritlib/commit/99136e1e164baa7b1d9dac4f64c5fb511b813c19 and git show 874d4776eee5ae7c8c15debbb9e943110be299dd shows your commit | 23:47 |
clarkb | I agree I would've expected us to hit this more if its been there a long time. Maybe the calling code changed? | 23:48 |
clarkb | maybe we always did an unconditional update until recently for things that had branch moves? | 23:49 |
* clarkb goes back to zuul historiy | 23:49 | |
clarkb | heh I was the last one to chagne that function | 23:50 |
fungi | see, now i get to blame you ;) | 23:50 |
clarkb | I agree I'm not seeing any recent changes that scream "this is related". 75d5dc2289d4b85b5e2d721d3fdbafdf5779e02b whispers it | 23:51 |
clarkb | I would say based on the logging and the actual repo state that we are not updating it as expected though | 23:52 |
clarkb | and I strongly suspect isUpdateNeeded() needs to check that ref points at rev | 23:52 |
fungi | this seems like something we could unit test in zuul | 23:53 |
clarkb | hrm getRepoState did change to do the buildset consistent revs | 23:53 |
clarkb | that produces the input to isUpdateNeeded I think, maybe that is related? | 23:54 |
clarkb | fungi: that is a good point | 23:54 |
*** openstackgerrit has joined #opendev | 23:57 | |
openstackgerrit | Merged opendev/system-config master: gerrit: remove mysql-client-core-5.7 package https://review.opendev.org/c/opendev/system-config/+/783769 | 23:57 |
openstackgerrit | Merged opendev/system-config master: launch-node : cap to 8gb swap https://review.opendev.org/c/opendev/system-config/+/782898 | 23:57 |
openstackgerrit | Merged opendev/system-config master: dstat-logger: redirect stdout to /dev/null https://review.opendev.org/c/opendev/system-config/+/782868 | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!