Tuesday, 2021-03-30

*** mlavalle has quit IRC		00:02
*** tosky has quit IRC		00:10
clarkb	in ~7 minutes the gitea update should start	00:13
fungi	i'm still sorta on hand	00:28
clarkb	gitea01 seems to have updated now to check on it	00:30
clarkb	https://gitea01.opendev.org:3000/openstack/nova seems to work for me	00:30
clarkb	version is reported as being updated	00:31
ianw	clarkb:lgtm	00:32
clarkb	02 and 03 also look good now	00:32
ianw	clarkb: i was thinking for the token thing too, it probably helps in production because the new versions have switched the default hashing algorithm to something less intensive, but our current passwords are hashed using the old method?	00:33
ianw	perhaps we should redo the auth?	00:33
fungi	or just go with your token work	00:34
clarkb	ya I'm not sure yet if they'll hash it back again on first requests or not	00:34
clarkb	we should be able to find out monitoring a new project addition after this upgrade completes?	00:34
clarkb	we'll be able to compare resource usage at least	00:34
clarkb	04 and 05 are now done and look good	00:35
fungi	didn't we turn off the mass description updating though?	00:35
clarkb	fungi: good point we did reduce the overhead externally as well	00:36
clarkb	probably can check the db directly?	00:36
clarkb	and see what sort of hash is in the records	00:36
clarkb	?	00:36
fungi	if they use a typical identifier, yeah	00:37
clarkb	06-08 lgtm now too	00:41
clarkb	the user table has a passwd_hash_algo field	00:43
clarkb	now to find that issue so I can determine what we want and what we don't want	00:44
*** kevinz has joined #opendev		00:49
openstackgerrit	Merged opendev/system-config master: Add review02.opendev.org https://review.opendev.org/c/opendev/system-config/+/783183	00:49
clarkb	https://github.com/go-gitea/gitea/issues/14294 if that was the issue then we are using pbkdf2 now according to the db	00:49
clarkb	which is good according to that issue	00:49
clarkb	I've got to help sort out dinner now, but gitea looks happy and I think it is using the hash that is less memory intensive	00:50
ianw	cool, will keep an eye	00:51
clarkb	thanks!	00:51
fungi	yeah, seems fine so far	00:55
fungi	i hadn't noticed this before (new?) but if you look at the very bottom of the page where the gitea version is listed, it also tells you how long it took to load that page	00:57
fungi	for example, https://opendev.org/openstack/nova/src/branch/master/releasenotes gave me "Page: 39332ms Template: 102ms"	00:59
fungi	so, yeah, 40 seconds there	00:59
fungi	but it's a fairly pathological example	00:59
clarkb	fungi: the first time you request an asset it isnt' cached and can take a long time particularly for large repos like nova (beacuse gitea is happy to inspect history for objects to tell you when they were last modified)	01:05
clarkb	but then it caches it, if you refresh it should be much quicker	01:05
clarkb	ok really going to do dinner now. I'm being yelled at	01:06
fungi	yep "Page: 3041ms Template: 122ms"	01:15
fungi	that much i knew, just wasn't aware it displayed those stats	01:15
fungi	or else i was aware and then forgot	01:16
*** hamalq has quit IRC		01:25
openstackgerrit	Ian Wienand proposed opendev/system-config master: gerrit: remove mysql-client-core-5.7 package https://review.opendev.org/c/opendev/system-config/+/783769	02:10
*** prometheanfire has quit IRC		02:11
*** prometheanfire has joined #opendev		02:12
openstackgerrit	Ian Wienand proposed opendev/system-config master: review01.openstack.org: add key for gerrit data copying https://review.opendev.org/c/opendev/system-config/+/783778	02:45
ianw	infra-root: ^ that installs a key from review02 -> review01 that can r/o rsync data. i think that will be generally useful as we go through this process to sync	02:45
openstackgerrit	Ian Wienand proposed opendev/system-config master: gerrit: add mariadb_container option https://review.opendev.org/c/opendev/system-config/+/775961	03:35
ianw	finally navigated getting the mariadb fixes into the stable branch, so ^ doesn't require any patches any more	03:36
*** whoami-rajat_ has joined #opendev		04:21
*** ricolin has quit IRC		04:47
*** whoami-rajat_ is now known as whoami-rajat		04:56
*** marios has joined #opendev		05:01
*** ykarel\|away has joined #opendev		05:03
*** ykarel\|away is now known as ykarel		05:07
openstackgerrit	Ian Wienand proposed opendev/system-config master: gerrit: add mariadb_container option https://review.opendev.org/c/opendev/system-config/+/775961	05:36
*** ykarel_ has joined #opendev		05:49
*** ykarel has quit IRC		05:49
*** cloudnull has quit IRC		05:54
*** ysandeep\|away is now known as ysandeep		05:59
*** lpetrut has joined #opendev		06:12
*** slaweq has joined #opendev		06:16
*** eolivare has joined #opendev		06:17
*** ralonsoh has joined #opendev		06:17
openstackgerrit	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: [doc] Update supported distros https://review.opendev.org/c/openstack/diskimage-builder/+/783788	06:27
*** ykarel__ has joined #opendev		06:29
*** ykarel_ has quit IRC		06:31
openstackgerrit	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add Debian Bullseye Zuul job https://review.opendev.org/c/openstack/diskimage-builder/+/783790	06:32
openstackgerrit	Slawek Kaplonski proposed openstack/project-config master: Add noop-jobs for networking-midonet projects https://review.opendev.org/c/openstack/project-config/+/783792	06:40
*** cloudnull has joined #opendev		06:42
openstackgerrit	Slawek Kaplonski proposed openstack/project-config master: Readd publish-to-pypi for neutron-fwaas and dashboard https://review.opendev.org/c/openstack/project-config/+/783796	06:45
*** sboyron has joined #opendev		06:54
*** hashar has joined #opendev		06:57
*** frigo has joined #opendev		07:10
*** rpittau\|afk is now known as rpittau		07:29
*** hashar_ has joined #opendev		07:43
*** amorin_ has joined #opendev		07:46
*** hashar has quit IRC		07:46
*** amorin has quit IRC		07:47
*** hashar_ is now known as hashar		07:51
*** ykarel__ is now known as ykarel		07:59
*** tosky has joined #opendev		08:11
*** bodgix has quit IRC		08:14
*** smcginnis has quit IRC		08:14
*** bodgix has joined #opendev		08:14
*** arxcruz has quit IRC		08:14
*** smcginnis has joined #opendev		08:14
*** frigo has quit IRC		08:16
*** arxcruz has joined #opendev		08:18
*** DSpider has joined #opendev		08:44
*** dtantsur\|afk is now known as dtantsur		08:48
*** dtantsur is now known as dtantsur\|brb		08:55
*** hashar has quit IRC		09:20
*** Guest55766 has quit IRC		09:30
*** ykarel is now known as ykarel\|lunch		09:45
*** dtantsur\|brb is now known as dtantsur		09:59
*** dirk1 has joined #opendev		10:34
*** dirk1 is now known as dirk		10:39
*** ykarel\|lunch is now known as ykarel		10:52
openstackgerrit	Guillaume Chauvel proposed opendev/gear master: Create SSL context using PROTOCOL_TLS, fallback to highest supported version https://review.opendev.org/c/opendev/gear/+/741288	11:03
openstackgerrit	Guillaume Chauvel proposed opendev/gear master: Update testing to Python 3.9 and linters https://review.opendev.org/c/opendev/gear/+/780103	11:03
*** DSpider has quit IRC		11:18
*** openstackgerrit has quit IRC		11:21
*** hashar has joined #opendev		11:39
*** ysandeep is now known as ysandeep\|afk		11:50
*** redrobot9 has joined #opendev		12:25
*** artom has quit IRC		12:26
*** artom has joined #opendev		12:26
*** redrobot has quit IRC		12:28
*** redrobot9 is now known as redrobot		12:28
*** artom has quit IRC		12:36
*** ysandeep\|afk is now known as ysandeep		12:45
*** dpawlik6 is now known as dpawlik		12:58
*** hashar has quit IRC		13:24
*** spotz has joined #opendev		13:30
*** artom has joined #opendev		13:33
*** ykarel is now known as ykarel\|away		13:50
*** ykarel\|away has quit IRC		13:54
*** mlavalle has joined #opendev		13:58
*** ralonsoh has quit IRC		14:18
*** ralonsoh has joined #opendev		14:19
*** lpetrut has quit IRC		14:30
*** tosky has quit IRC		14:39
*** tosky has joined #opendev		14:39
*** ysandeep is now known as ysandeep\|away		14:56
zbr\|rover	A page like https://review.opendev.org/admin/repos/openstack/hacking,access was supposed to list the groups that have review rights but now is ~empty	15:23
zbr\|rover	Is that a desired change caused by some security concerns or just a glitch?	15:23
clarkb	I think it is related to the fixes for the bug we discovered when testing upgrades	15:24
zbr\|rover	imho, it should be possible for a logged in user to discover which other users or groups have access to a repo	15:24
fungi	gerrit doesn't display permissions to you if you don't have them	15:24
clarkb	gerrit significantly trimmed down who can access metadata	15:24
fungi	basically the acl view shows you what permissions apply to your account	15:25
zbr\|rover	and is not configurable in default settings?	15:25
fungi	nope	15:25
clarkb	all of that info is available in project-config though	15:25
fungi	this makes our acls in the project-config repo even more important, yes	15:25
fungi	acl copies in the repo, i mean	15:26
zbr\|rover	that is bad for the user experience, imagine a random user trying to propose a patch to a project. Assume that is his first experience contributing something to opendev gerrit.	15:26
zbr\|rover	he passed CI and now he wants to get attention of someone that can help him review a patch.	15:26
fungi	zbr\|rover: are you asking us to convey your concerns to the gerrit maintainers, or replace gerrit? i can't tell	15:27
zbr\|rover	i am wondering if we can do something to improve the discoverability of a gerrit project maintainers (cores)	15:27
fungi	projects can publish a link to the group view for their core review team, sure	15:28
fungi	gerrit even supports convenience urls where you can specify the group name instead of its id	15:28
zbr\|rover	so basically the only option we currently have is to expect the repo owners to mention that link on their docs, as this is a problem especially on project with lower maintenance, it will never be addressed by the most vulnerable projects (active ones would likely be able to document this)	15:31
fungi	https://review.opendev.org/admin/groups/project-config-core,members	15:31
zbr\|rover	i guess the practical answer is to look at previous reviews and see who performed them, "punishing" those few that do perform reviews :D	15:32
fungi	i don't know that it's our only option, but if you have ideas we can evaluate them	15:32
zbr\|rover	sadly no ideas, only questions for now	15:32
fungi	okay, i'm done talking to you for now, you're suggesting that we intentionally punish our users	15:32
clarkb	providing feedback like this to gerrit is also helpful. Even if they don't take action on it at least we're communicating to the people most likely to take action	15:32
*** hashar has joined #opendev		15:32
*** rpittau is now known as rpittau\|afk		15:33
clarkb	unfortunately I suspect this is directly related to the security issues that were recently identified and fixed.	15:33
clarkb	which may make changing this tricky and people will probably avoid it	15:33
zbr\|rover	what i was trying to say is that people that do reviews are visible in gerrit history, and more likely to be added as reviewers by other users. Those that do not perform reviews are unlikely be picked because nobody knows them.	15:33
fungi	but please don't suggest to the gerrit maintainers that they choose to be user-hostile and intentionally break the usability of their software. i'm rally afraid they will think you're representing our community with your abusive comments	15:34
clarkb	fwiw if you add me as a reviewer the email goes into a folder in my mail client that I don't really watch. It gets far too many emails every day for me to keep up.	15:35
fungi	there may be gerrit plugins targeted at what you're wanting, or it's possible a gerrit plugin could be developed to do it	15:35
clarkb	I think gerrit could definitely use better tooling around helping people see what they should review. Adding people as reviewers doesn't seem to be it	15:35
fungi	we can evaluate adding new plugins if they're stable and reasonably unobtrusive (we're in the process of adding the reviewers plugin currently)	15:36
clarkb	(I suspect something with hash tagging may be the ticket)	15:36
*** tosky has quit IRC		15:36
clarkb	experienced devs consistently hash tag, reviewers can look for those specific tags to review those changes and also look for absence of tags to find new contributors and help them out	15:36
* zbr\|rover wonder what gave the impression that his comments are abusive		15:37
fungi	you suggested we've made decisions to punish reviewr	15:38
fungi	reviewrrs	15:38
fungi	my keyboard needs new fingers	15:38
JayF	You may be missing context that most core reviewers are inundated with unsolicited emails and DMs from people desiring reviews who did not participate at all with the community in the reporting or design phase of whatever work they are doing.	15:39
fungi	the reviewers plugin probably doesn't address your concern, which seems to be that under-supported repositories don't have an easy way for contributors to find reviewers who don't exist	15:39
fungi	but once https://review.opendev.org/724914 merges we can do some trials with projects to see if it helps those who want a more structured way to auto-associate reviewers with reviews: https://gerrit.googlesource.com/plugins/reviewers/+/refs/heads/master/src/main/resources/Documentation/	15:40
clarkb	JayF: I'm not sure that making it easier to lookup the entire core reviewer list makes that better? seems like it would make the shotgun appraoch easier?	15:42
JayF	That's what I was saying. The suggestion to make the core reviewer list more discoverable could be construed as abusive to already-harried core reviewers.	15:42
clarkb	ah got it, you were addressing it at zbr	15:43
JayF	yes	15:43
JayF	Honestly, 1:1, out of band review requests without asking in a public IRC channel or participating in storyboard/launchpad/etc is the #1 best way to not get your code reviewed.	15:43
clarkb	infra-root I've started trying to dig up zk docs on the proper way to do rolling replacement of zk servers. Haven't had great luck finding official docs but have found a few independent articles and I think I may need to dig into this a bit more before booting servers and adding them. In particular it seems we need to be very careful with the myid values for the new servers. They should not	15:48
clarkb	overlap with old servers (I think this means we want zk04.opendev.org-zk06.opendev.org? need to confirm). Also there appears to be some coordination needed to trigger dynamic reconfiguration of the existing cluster members after adding a new member to the configs. Or we have to restart everything.	15:48
clarkb	my current understanding of what the process looks like is start 04-06. Identify the current leader L and folloers F1 and F2. Stop zk on F2 and replace with one of 04-06. Trigger reconfig or restart things and ensure we have quorum and a leader. Repeat.	15:49
fungi	zbr\|rover: there's also https://gerrit.googlesource.com/plugins/reviewers-by-blame/+/refs/heads/master/src/main/resources/Documentation/about.md but i get the impression that doesn't distinguish reviewers with approval rights, and is based more on who contributed changes touching certain lines than who reviewed similar changes in the past	15:50
clarkb	once F1 and F2 are replaced we should be left with a leader. I think we stop zk on that and ensure the other nodes elect a new leader L out of the pool of replaced servers. Then old L is now F3 and can be replaced too	15:50
clarkb	there is also a thing where the ordinality of the myid value also affects behavior. A low id won't join a cluster with higher ids? something like that. I don't think it affects us since we'll have all the new ids higher than the old ids	15:52
*** dtantsur is now known as dtantsur\|afk		16:02
corvus	clarkb: or you could copy the data over manually	16:05
*** iurygregory_ has joined #opendev		16:06
*** iurygregory has quit IRC		16:06
clarkb	corvus: if we do that it would look something like stop 03.openstack.org, copy 03.openstack.org data to 03.opendev.org, replace 03.openstack.org's IP address with 03.opendev.orgs IP address in configs. Trigger reconfiguration or rolling restarts?	16:15
*** hamalq has joined #opendev		16:15
*** iurygregory_ is now known as iurygregory		16:15
*** marios is now known as marios\|out		16:17
corvus	clarkb: i think rolling restarts. if you do that (actually this applies to any process), i'd probably do 2/3 of them before restarting the scheduler, and make sure the last one is already in the config on disk before restarting the scheduler. that way there's only one scheduler restart and it restarts into the final config. scheduler should only need to be able to reach one server.	16:18
*** hamalq_ has joined #opendev		16:19
clarkb	oh hrm I hadn't even considered that the lcients may need to be restarted to see the new config, but that makes sense	16:20
*** hamalq has quit IRC		16:20
corvus	yeah i think we have ip addrs in their config too	16:20
clarkb	and ya we want to restart frequently on the cluster side to ensure we are maintaining quorum. I agree the client should be fine as long as one of the quorum members remains in its config	16:20
clarkb	ok, I think my next step is to gather concrete info on what the current cluster looks like, put it in an etherpad and write down some options with the real info	16:23
zbr\|rover	fungi: clarkb: please excuse me if I was not clear regarding my questions about reviewing, I did not want to complain about something done or not done by infra team about our gerrit, I only wanted to find-out if there is something I can help with in order to make that part of the review process easier for other gerrit users.	16:33
*** eolivare has quit IRC		16:33
clarkb	zbr\|rover: I think the first thing to do is communicate the concern upstream. Indicate it can be difficult for new contributors in particular to find out who to work with to get changes reviewed and that maybe gerrit can help with this. It is possible that gerrit already has plugins or other tools that we aren't aware of as well they can point us to	16:34
fungi	on a related note, https://review.opendev.org/724914 is now passing if we're ready for it	16:37
fungi	the next phase will be adding support for etc/reviewers.config files in manage-projects	16:39
zbr\|rover	yep, i commented there, i am offering to help on experimenting with it.	16:39
zbr\|rover	we can use a side-project for that and wee exactly how it works.	16:40
*** gothicserpent has quit IRC		16:41
zbr\|rover	i like that fact that it can do ignoreWip, and we should see if suggestOnly proves to be more useful or not.	16:41
zbr\|rover	if suggestOnly would really work fine to bump those with rights on the top it could be an alternative.	16:41
zbr\|rover	what I found curious is that I did not see any options regarding how many reviewers to auto-assign or an option to exclude specific groups	16:42
zbr\|rover	while infra-core has permissions i doubt members of this group want to endup being auto-picked by the plugin just because they happen to be the fallback.	16:43
*** gothicserpent has joined #opendev		16:44
fungi	yeah, suggestOnly unfortunately looks like it would be high-maintenance	16:44
fungi	because it doesn't support suggesting groups, only individuals, which means maintaining a separate list of individuals, and for lots of teams that list would quickly fall stale	16:45
mordred	does the per-project config go in the primary repo of the project? or is it a thing that goes into a refs/ location somewhere	16:45
*** gothicserpent has quit IRC		16:45
mordred	(yeah - it would make way more sense if it could be tied to a group)	16:46
fungi	i expect the main way projects might use it would be to have it auto-add specific groups as requested reviewers on changes matching particular branches or file subpaths (e.g. nova could automatically add a vmware-reviewers group to changes touching the vmware hypervisor backend files)	16:46
mordred	or like - tied to a group but with specific overrides possible - I could see some projects saying "make sure to get files in sub/dir reviewed by $human_a and $human_b"	16:46
mordred	yeah - what you said	16:47
fungi	for many cases i expect teams would still prefer to rely on custom dashboards to so reviewers can voluntarily find what they want to review rather than having themselves added to reviews automatically, but different teams have different reviewing habits	16:47
fungi	what triggered the addition of this plugin in the first place was that some of the teams working on starlingx wanted to see if it could help them improve how they're reviewing changes for their projects	16:48
zbr\|rover	the way CODEOWNERS works on github is that it picks a random one or two (based on min reviews config rule) and assigns them. I kind find it working fine so far but i used it in only one project.	16:49
*** ralonsoh has quit IRC		16:50
zbr\|rover	if everyone from the list is added to each review, i would not see that working in practice. the entire idea is to spread the review-load.	16:50
JayF	I think something that's different about OpenStack vs many other projects is that many projects have a sense of priorities, and working together to ensure something is 100% done instead of having 10 things 10% done. That makes it hard for people outside that upstream process entirely to get code reviewed/paid attention to -- it's almost explicitly not a priority to review that code.	16:51
JayF	That's why "step 1" to getting something merged in almost any OpenStack project is to make the case it's needed, via stories/bugs/mailing list/irc, then once you are there, it gets more easy for folks to review your code.	16:51
JayF	There are very few successful openstack contributors who do not engage with the community in ways other than code.	16:51
TheJulia	Well I just clicked into an interesting discussion :)	16:57
zbr\|rover	JayF: you are right that discussions are highly likely to be needed. Still, there are projects where these may not really be necessary and where is easy for a valid CR to be ignored just because nobody that can help is notified.	16:58
JayF	If your CR is being ignored, maybe it does need some discussion even if you don't realize it yet. :D	16:58
JayF	Sometimes that happens in the CR itself, but not every team is structured as to that being how it works.	16:58
TheJulia	And sometimes people downright ignore reviewer comments in change requests. :\	16:59
TheJulia	so off the wall side question, is CI becoming unhappy?	16:59
TheJulia	seeing 2nd attempt randomly pop up on an ironic job on things that should have worked fine	17:00
JayF	TheJulia: the tldr of how this started is zbr\|rover was asking for an infra feature to make it easier to ID the core reviewers for a change :) We had to inform them that core reviewers are already getting canned spiced ham chucked at them with high frequency and velocity :D	17:00
TheJulia	ahh yes	17:00
TheJulia	which all goes to the nearly automatic canning machine in our mailboxes	17:01
JayF	(as a side note: I think we've all been on the other side of it too, hoping another project merges our change and you can't get anyones' attention, and it's frustrating, so yes it's a problem, but the proposal is not the solution IMO)	17:01
*** tosky has joined #opendev		17:03
zbr\|rover	i had a recent example from last week where I had to help with a new release of python-jenkins, even if i no longer use the library myself. probably we can ask him about how long it took to find someone that can make a new release (his change was already merged long)	17:04
clarkb	infra-root https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 I've put two options for the zk upgrade in there. Please add text or let me know if I've missed things or if you have preferences	17:04
clarkb	zbr\|rover: I think the issue there was largely that the software was unmaintained, not that gerrit didn't show core reviewer groups	17:05
clarkb	gerrit could add reviewers and it would still be ignored if those individuals are no longer maintaining the software	17:06
fungi	TheJulia: i haven't heard anyone mention a new global issue, we're not making changes right now either, but if you have some examples i'm happy to take a quick look	17:08
zbr\|rover	i usually look at those reviews i am added to, but keep in mind that the issue i mentioned is that the OP may have no clue about who can help or not, regardless on which communication channel he may attempt to use gerrit, irc, mailing list, carrier pigeon.	17:09
clarkb	huh some tripleo jobs show 4 attempts (I thought we capped at 3)	17:10
clarkb	fungi: TheJulia ^ as another data point	17:10
clarkb	considering how widespread that is I wonder if we lost zk connectivity with a launcher?	17:11
clarkb	er no it would have to be the scheduler as the launcher unlocks things once handed to the jobs	17:11
clarkb	2021-03-30 16:37:19,580 INFO kazoo.client: Zookeeper connection lost	17:14
clarkb	fungi: TheJulia ^ I think that is the cause	17:14
fungi	none of the zk servers is down	17:15
clarkb	I ran echo stat \| nc localhost 2181 against all 3 not too long ago to figure out which was leader and which are followers an dthey all looked happy	17:16
fungi	all last started on the 27th	17:16
clarkb	seems it has happened 3 times today and not at all in the previous log file	17:16
fungi	so their daemons have been running for a few days	17:16
fungi	nothing in dmesg for any of them since saturday either	17:17
clarkb	"Connection dropped: socket connection broken" appears to be the underlying issue as reported by the zuul side	17:17
clarkb	I wonder if the servers timed out zuul for non responsiveness?	17:17
clarkb	and the client sees that as the socket connection breaking	17:17
fungi	firewall rules for them were last touched on the 18th	17:18
fungi	yeah, could be	17:18
clarkb	on zk02 (the current leader) I see it expiring a session at Mar 30 16:37:14 in syslog	17:20
clarkb	timeout of 40000ms exceeded	17:20
clarkb	I suspect that is what happened though I'm not sure how to confirm via the logged session id	17:20
clarkb	corvus: ^ fyi this may be important performance feedback gathering?	17:20
clarkb	note I haven't changed anything with zk yet, only done info gathering (ran stat command)	17:21
clarkb	oh yup a few log lines later I see zuul01 has reconnected	17:21
fungi	unrelated, i saw a infra-prod-run-cloud-launcher failure where the log on bridge says /home/zuul/src/opendev.org/opendev/system-config/run_cloud_launcher.sh did not exist	17:23
clarkb	fungi: that script was run by the old cron based system and it was removed. Maybe we didn't properly convert over the ansible side?	17:24
fungi	yeah, i found https://review.opendev.org/718799 removed it almost a year ago	17:24
clarkb	or maybe part of the cleanup is written such that if it fails (because it already succeeded in cleaning up) the ansible play fails?	17:25
fungi	d'oh, i'm looking at the wrong logfile	17:25
fungi	--- end run @ 2020-05-07T15:13:58+00:00 ---	17:25
fungi	yeah, that was from when we removed the script	17:26
fungi	i was looking at the log from the cronjob not from the continuous deployment	17:26
fungi	connection timeout to the openedge api. i guess this has been failing for some time	17:27
clarkb	oh right we need to clean that up	17:27
*** lpetrut has joined #opendev		17:28
fungi	i can push it in a bit	17:29
*** hashar is now known as hasharDinner		17:36
fungi	donnyd: should we unconfigure openedge entirely in our systems, or is it likely coming back at some point in the future?	17:44
*** ykarel\|away has joined #opendev		17:46
*** marios\|out has quit IRC		17:49
clarkb	fungi: my understanding was that we should probably unconfigure it, and if we want ot add it back that is straightforward to do	17:54
fungi	okay, i'll work on a complete rip-out for now	17:54
*** prometheanfire has quit IRC		17:55
clarkb	zk connectivity was just lost again	17:59
clarkb	(I was running a tail looking for that string)	18:00
*** gothicserpent has joined #opendev		18:00
fungi	huh	18:00
*** gothicserpent has quit IRC		18:01
clarkb	and on zk02 we see the same sort of logs connection for session foo timed out	18:01
clarkb	then a bit later a reconnect from zuul01	18:01
clarkb	the timeout is 40000ms which I think means that zuul's zk connection didn't transfer any data for 40 seconds?	18:01
clarkb	wow zuul is certainly busy it has generated over 100k log lines in about the time since the last disconnect	18:02
fungi	the zk graphs at the end of https://grafana.opendev.org/d/5Imot6EMk/zuul-status?orgId=1 show a bunch of drops which i expect are coincident with the timing of the disconnects	18:04
clarkb	going back 200k lines only gets me an extra 2 minutes	18:04
clarkb	the bulk of this seems to be collecting job variants and similar config loading logging	18:04
clarkb	fungi: you can see the zuul event processing time spike around then too	18:07
clarkb	typical seems to be in the ms to second range but then when we see restarts it goes to the minutes range	18:08
clarkb	I wonder if while we are processing events we somehow block the zk connection from ping ponging appropriately	18:08
clarkb	which means if processing an event spikes to > 40s we lose	18:08
clarkb	though we don't disconnect every time we spike and we don't spike every time we disconnect so maybe a bad correlation	18:09
clarkb	I do suspect though that if we work backward from really long event processing time we might find something useful	18:10
TheJulia	clarkb: w/r/t the connection lost, I had a feeling... but it might not be bunnies	18:12
fungi	the job queue and executors accepting graphs do look like what we see when there's a large gate reset or huge change series pushed, but that could also be the result of jobs restarting en masse from a zk disconnect (effect rather than cause)	18:15
*** lpetrut has quit IRC		18:15
clarkb	yes I think that is likely more the symptom than the cause	18:16
*** ykarel\|away has quit IRC		18:16
clarkb	at 17:54:40 we report or last event_enqueue_processing time for a few minutes then the next one says it took 1.3 minutes according to graphana	18:17
clarkb	the next one arrives at 18:01:20	18:17
clarkb	within that period of time we've lost connectivity to zk because it has timed us out for being non responsive for >40s	18:18
clarkb	I strongly suspect something is monopolizing the cpu, but I think the rerun jobs is simply because nodepool has helpfully cleaned the old ones up for us	18:18
clarkb	2021-03-30 18:01:04,090 INFO zuul.Scheduler: Tenant reconfiguration complete for openstack (duration: 363.146 seconds)	18:21
clarkb	what if it is ^	18:21
clarkb	that also doesn't correlate to every restart though	18:21
clarkb	2021-03-30 15:16:16,061 INFO zuul.Scheduler: Tenant reconfiguration complete for openstack (duration: 252.64 seconds) is another aroudn when we got disconnected	18:21
clarkb	this disconnect doesn't have bad graphs or tenant reconfiguration that takes forever: 2021-03-30 16:37:19,580 INFO kazoo.client: Zookeeper connection lost	18:22
clarkb	I am still not fully up to speed around what has changed with zk recently though so I may be looking in the completely wrong location	18:23
*** roman_g has joined #opendev		18:26
fungi	clarkb: internal scheduler events/results queues so far, i think	18:26
fungi	also semaphores	18:27
clarkb	we have had 4 disconnects in the last several hours. three of them have an adjacent openstack tenant reconfiguration that takes 4-6 minutes. There are other long reconfigurations though	18:27
fungi	also change queues are in zk now, looks like	18:28
clarkb	makes me less confident that long reconfiguration is a cause, but it may be another symptom (essentially something is consuming resources and when that happens zk can disconnect and reconfigurations can go long)	18:28
fungi	so most of the things you would think of as internal scheduler state have moved from in-memory data structures to znodes	18:30
clarkb	one thing that makes digging in logs difficult is that we generate a ton of kazoo exceptions after this happens beacuse all the nodes have been cleaned up and zuul can't update the nodes in zk	18:31
clarkb	also 100k log lines per minute	18:31
fungi	cacti graphs for the zk servers don't suggest any system-level resource exhaustion	18:31
fungi	the zuul scheculer is showing some heavy periods of read activity on its disks around that time	18:33
fungi	also we've been seeing a steady rise in used and cache memory on the scheduler, but used memory starts to plateau after we run out of free memory and cache begins to get squeezed (circa 14:00 utc)	18:34
corvus	typically if we see zk disconnects its due to cpu starvation on the scheduler	18:36
corvus	usually due to swapping	18:36
fungi	could this be cacti says scheduler cpu utilization is nominal. maybe a little higher than usual but still below 20%	18:37
fungi	and no real swap utilization	18:37
clarkb	fungi: its a many cpu instance and zuul can only use one of them for the scheduler	18:37
fungi	fair point	18:37
clarkb	corvus: ya I suspect that is why reconfiguration and event queue processing is also slow when this occurs but not always	18:37
fungi	also i suppose we could have very brief spikes which don't register in a 5-minute aggregate sample	18:37
clarkb	we're seeing cpu starvation hit a number of things and this is the most prominent as it restarts jobs	18:37
fungi	there is one zuul-scheduler process consuming most of a cpu according to top	18:38
fungi	and also a majority of the system memory	18:38
corvus	i can't seem to get cacti to show me more than 1 day of data	18:39
clarkb	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64792&rra_id=all	18:42
clarkb	that does seem to show memory use has significantly grown in the last day or so	18:42
clarkb	maybe we've got a leak that leads to swapping? http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64793&rra_id=all	18:43
corvus	yep, that would be my guess	18:43
fungi	except the swap usage graph is nearly empty	18:44
corvus	i think we should restart the scheduler immediately to address the immediate problem and start debugging the leak. i can start debugging tomorrow, but am occupied today.	18:44
fungi	but yes there's a tiny bump up to 233mib around the time of the recent distress	18:44
corvus	fungi: the swap activity graph isn't though, and that can cause pauses	18:44
fungi	right, and if it's the wrong 233mib...	18:44
fungi	that also may explain some of the spikes in disk read activity	18:45
clarkb	I need to start prepping for the meeting, is someone else abel to lead the restart?	18:45
fungi	i can do the scheduler restart now	18:45
fungi	did we get the enqueuing fix merged?	18:45
* fungi checks		18:45
fungi	ahh, no, https://review.opendev.org/783556 hasn't been approved yet	18:47
fungi	corvus: what's your guess as to the fallout we'll see from trying to reenqueue things... is there something i should edit in the queue dump beforehand?	18:47
corvus	fungi: in that case just drop the hostname from any enqueue-ref calls	18:47
fungi	perfect, can do, thanks!	18:47
corvus	(drop the hostname from the project name	18:47
fungi	yep, got it	18:48
fungi	\| sed 's, opendev\.org/,' ought to do the trick	18:49
fungi	er, forgot an additional , there	18:49
fungi	and i shouldn't eat the blank space ;)	18:50
fungi	/opt/zuul/tools/zuul-changes.py https://zuul.opendev.org \| sed 's, opendev\.org/, ,' > queues.sh	18:50
fungi	that seems to do the trick	18:50
corvus	fungi: i'd just do that for enqueue-ref not the changes	18:51
fungi	oh, can do	18:51
*** diablo_rojo has joined #opendev		18:52
fungi	looks like the scheduler is very busy again, probably another event	18:52
fungi	/opt/zuul/tools/zuul-changes.py https://zuul.opendev.org \| sed 's	18:55
fungi	,^$zuul enqueue-ref .* --project $opendev\.org/,\1,' > queues.sh	18:55
fungi	stray newline in the paste, but that's what seems to take care of it	18:55
fungi	looks like openstack release management had just merged some release requests so there are tag events in the queue. i'm a little hesitant to try bulk reenquening those and would rather wait another few minutes for them to hopefully wrap up	18:59
fungi	i let the team know to hold on further release approvals for a bit	18:59
* TheJulia sighs		19:04
fungi	TheJulia: it was for ironic. you can sigh as loudly as you like ;)	19:05
TheJulia	is there an official guestimate on how far the status display trails behind reality?	19:05
TheJulia	fungi: joy	19:05
fungi	TheJulia: top-left corner... "Queue lengths: 399 events, 2 management events, 301 results."	19:05
TheJulia	Now to just get a ci fixed merged into ironic.	19:05
fungi	that's basically teh internal count of backlogs for those categoties of events not yet processed	19:06
fungi	it's not really a time because zuul can't express much about events it hasn't processed yet	19:06
TheJulia	I was meaning more in regards to a running ci job	19:06
fungi	but that's generally the reason for any perceived lag in status reflecting expected reality	19:07
TheJulia	like, I've opened the status display in the past and seen things appear completely done running past log uploads and can't view the console... yet the display says the job is still running for like 15 minutes	19:07
TheJulia	okay, so same queue then	19:07
fungi	the results backlog is how many builds have completed that zuul hasn't reflected the completion result for yet	19:07
TheJulia	AHH!	19:08
TheJulia	okay	19:08
fungi	events is how many things like gerrit comments or new patchsets it has received but not yet processed to figure out if something should be enqueued	19:08
fungi	that sort of thing	19:08
TheJulia	That was what I was figuring but didn't quite grok the meaning of results	19:09
TheJulia	at least in that context	19:09
fungi	and management events are generally reconfiguration	19:11
fungi	so now that the counts are ~0 again, the status info should be reasonably current	19:11
fungi	okay, zuul estimates we're 5 minutes out from being done with the last releases site update, so i'll plan to restart the scheduler and reenqueue everything as soon as that finishes	19:49
fungi	okay, all the reasonably critical openstack release jobs have wrapped up, i'll check the other tenants quickly	19:59
ianw	clarkb: likely to get caught in a restart, but is https://review.opendev.org/c/opendev/system-config/+/783778 to add a key for review02 to r/o copy gerrit data OK? not sure if it was done like that in the past	20:00
fungi	seems we're all clear. grabbing a corrected queue dump per earlier bug discussion, then restartinmg	20:00
ianw	the other one that is ready is https://review.opendev.org/c/opendev/system-config/+/775961 to add a mariadb container; upstream has merged the required fixes to stable now	20:00
clarkb	ianw: I think review-test did similar but in the opposite direction	20:01
fungi	i'll reenqueue once the cat jobs are finished	20:02
clarkb	I would cross check against what review-test did, though I think we may have used forwarded keys temporarily (with agent confirmation)	20:02
clarkb	fungi: ^ when you are done restarting things maybe you want to look at athat too	20:02
ianw	clarkb: hrm ok. i liked the idea the wrapper could do read-only	20:03
ianw	mostly so i don't type the wrong thing :)	20:03
clarkb	oh is that what rrsync is? I was just about to go and read up on it :)	20:04
*** Dmitrii-Sh has joined #opendev		20:04
*** prometheanfire has joined #opendev		20:04
ianw	i'm not going to claim great knowledge, but i just googled "read only rsync" and this seems to be the solution	20:04
clarkb	ya seems like it causes rsync run using that key to be restricted to read only access with a "chroot" path as set	20:07
fungi	starting to reenqueue everything now, 288 itens	20:10
clarkb	ianw: any idea why rsync doesn't just install that script?	20:10
fungi	items	20:10
clarkb	ianw: otherwise ya I think this looks ok	20:10
fungi	looks like i may need to restart zuul-web too	20:10
ianw	clarkb: not really, seems it ships it in some form standard. maybe it's a bit obscure for /usr/bin i guess	20:10
clarkb	fungi: ya that is normal iirc	20:11
johnsom	FYI,	20:11
fungi	done, that seems to have cleared the "proxy error"	20:11
johnsom	https://www.irccloud.com/pastebin/35ySboyv/	20:11
clarkb	ianw: I've +2'd it but I think it would be good to have fungi look it over too	20:12
fungi	johnsom: check again	20:12
fungi	seems we often need to restart the zuul-web daemon any time we restart the scheduler	20:12
johnsom	Yeah, loading now.	20:12
*** hasharDinner has quit IRC		20:20
*** slaweq has quit IRC		20:26
fungi	#status log Restarted the Zuul scheduler to address problematic memory pressure, and reenqueued all in flight changes	20:32
openstackstatus	fungi: finished logging	20:32
clarkb	2021-03-30 18:52:06,525 INFO kazoo.client: Zookeeper connection lost <- last logged connection lost if we need to compare notes later	20:32
clarkb	memory use looks much better now though so I suspect we've got a while before it happens again	20:33
*** roman_g has quit IRC		20:52
*** artom has quit IRC		21:00
clarkb	I've realized I tested the always_update path in the gitea job which means we should already have data for the gitea memory usage with dstat	21:00
*** artom has joined #opendev		21:00
clarkb	I'll also push up a change that doesn't use tokens so we can compare between those as well	21:03
clarkb	actually we may already have that in https://review.opendev.org/c/opendev/system-config/+/781776	21:04
clarkb	ya https://zuul.opendev.org/t/openstack/build/143900b189f84e1296d66d60159c1c87/log/gitea99.opendev.org/dstat-csv.log is 1.13.6 + passwd auth, https://zuul.opendev.org/t/openstack/build/08ebc2b8c7344473bbc6f4790b26b416/log/gitea99.opendev.org/dstat-csv.log is 1.13.1 + token auth and soon enough we should have 1.13.6 + tokens as well	21:05
*** artom has quit IRC		21:10
*** artom has joined #opendev		21:11
ianw	memory usage looks sane with 143900b189f84e1296d66d60159c1c87	21:39
*** whoami-rajat has quit IRC		21:51
fungi	back for a bit. had some impromptu guests so sitting out on the deck and the sun is making it very hard to see the screen, even at maximum backlight	21:52
*** artom has quit IRC		21:53
*** artom has joined #opendev		21:55
*** artom has quit IRC		21:56
*** artom has joined #opendev		21:59
ianw	hrm, weirdly i think the id_rsa.pub we install for gerrit2 isn't valid	22:27
ianw	we also seem to have lost gerritbot	22:34
*** brinzhang_ has quit IRC		22:35
*** brinzhang_ has joined #opendev		22:36
*** dpawlik has quit IRC		22:37
*** osmanlicilegi has quit IRC		22:42
*** otherwiseguy has quit IRC		22:42
*** amoralej\|off has quit IRC		22:42
*** Jeffrey4l has quit IRC		22:42
*** openstackstatus has quit IRC		22:42
*** dpawlik0 has joined #opendev		22:42
*** janders1 has joined #opendev		22:42
*** openstack has joined #opendev		22:43
*** ChanServ sets mode: +o openstack		22:43
ianw	it would be better to have ansible managing these keys than have them as secrets	22:44
ianw	i guess it means we'd need to reference the review server during gitea runs. that's probably a pita	22:45
*** sboyron has quit IRC		23:02
clarkb	ianw: we could maybe put them as individual secrets in a single location like a shared group var file	23:04
clarkb	ianw: re gitea looking at the dstat results 1.13.6 + passwd seems to be about ~900MB peak memory usage and 1.13.6 + token is about 760MB	23:12
clarkb	ianw: do you know what it was with password on 1.13.1?	23:13
*** tosky has quit IRC		23:14
ianw	clarkb: i think https://imgur.com/a/YrlDNcd would have been 1.13.1	23:16
ianw	Gitea v#033[1mv1.13.1# ... https://zuul.opendev.org/t/openstack/build/cfcd32fa1b27407ab61f5b44be83f6fc/ ... that is a passwd one that ran OOM	23:18
clarkb	wow that is quite a difference	23:19
ianw	yeah it goes bananas	23:19
fungi	did my gerritbot fix ever merge?	23:19
clarkb	fungi: I thought you said it did the same thing again?	23:20
clarkb	no wait I'm mixing jeepyb and gerritbot and gerritlib	23:20
fungi	no, that was jeepyb	23:20
fungi	yeah	23:20
* fungi checks open changes		23:20
fungi	okay 781920 merged 10 days ago	23:21
* fungi checks to see if 3cefaa8 is what we have installed		23:22
fungi	oh, it'll be a container	23:22
fungi	quick batman, to dockerhub	23:22
clarkb	ianw: that makes me think that token auth is less urgent, though it may help a bit anyway	23:22
fungi	https://hub.docker.com/r/opendevorg/gerritbot/tags?page=1&ordering=last_updated says updated 11 days ago so roughly right	23:23
fungi	in theory 781919 should have been sufficient to fix the message-too-long disconnects at least but both merged at about the same time	23:25
fungi	previous changes to merge were in january anyway	23:25
fungi	so that's gotta be a sufficient container image	23:25
fungi	image digest of 21def9f40d85 according to dockerhub	23:26
clarkb	the current gerritbot has been running for 48 minutes	23:26
clarkb	not sure how long the prior instance had been running	23:26
fungi	2021-03-30 11:21:38 <-- openstackgerrit (~openstack@eavesdrop01.openstack.org) has quit (Ping timeout: 260 seconds)	23:27
fungi	that looks like a different behavior anyway	23:27
fungi	normally it would have been a quit if it was the known bug	23:27
clarkb	changing servers or something is what it said before with the issue you fixed iirc	23:28
fungi	yep	23:28
fungi	11:21:38-260 would be 11:17:18	23:30
clarkb	fungi: have we tracked down the jeepyb gerritlib thing yet? I should probably page that in and give it a proper look if not	23:30
clarkb	I wonder if it is related to depends on post merge	23:31
ianw	clarkb: yeah, if production has switched itself (or, because it was started before they switched?) away from argon2 then probably ok	23:31
clarkb	ianw: ya the db reports pbkdf2	23:31
fungi	log says it announced 778572 in #openstack-charms 7 seconds before that at 11:17:11	23:31
fungi	nothing else jumping out at me	23:31
fungi	pbkdf2 would be a much more standard kdf, yeah	23:32
fungi	clarkb: i haven't dug deeper into the jeepyb integration test with stale gerritlib ref, no	23:32
ianw	clarkb: it's probably always been that right? i'm not sure when they switched to argon2. that would hit us in the gate, as we start fresh, but not production	23:33
ianw	it switched in 1.13.0	23:34
clarkb	fungi: just as a sanity check cloning gerritlib and pip installing from source gives me Created wheel for gerritlib: filename=gerritlib-0.10.1.dev1-py3-none-any.whl	23:35
clarkb	which is definitely not the 0.10.0 we see in the job	23:35
fungi	https://review.opendev.org/782538 is the one with the weird stale gerritlib ref	23:35
clarkb	ianw: ya I'm not sure if they ported pbkdf2 to argon2 then back again or if we just alawys were pbkdf2	23:35
fungi	just a sec, need to move myself to a different room	23:35
clarkb	Created wheel for gerritlib: filename=gerritlib-0.10.0-py3-none-any.whl is what the job does	23:36
clarkb	which does strongly imply the job is seeing the 0.10.0 ref not the new one	23:36
ianw	clarkb: yeah, there's no db updates included in the swap afaics. i'd say we've always been ion pbkdf2 in production	23:37
clarkb	fungi `ze08:/var/log/zuul$ grep ed1e29cb099a4c139b257bb4f57f5c30 executor-debug.log.1` that is what I'm looking at now	23:38
clarkb	ianw: ok, so we may still have trouble with the description updates, should probably leave them off in that case	23:38
clarkb	2021-03-29 22:18:15,805 INFO zuul.Merger: [e: ed1e29cb099a4c139b257bb4f57f5c30] [build: f10d0a0281c043678acf4393f2629810] Skipping updating local repository gerrit/opendev/gerritlib	23:40
clarkb	I wonder if that is the clue we need	23:40
fungi	i wonder why it skipped that	23:41
ianw	clarkb: yeah; i guess my immediate issue was the gitea jobs failing gate constantly	23:41
clarkb	fungi: the isUpdateNeeded() method in zuul merger seems to check if each of the refs exists and that each of teh revs exist	23:43
fungi	yep, was just tracing that	23:43
clarkb	fungi: I wonder if this is a case where maybe if a change is fast forwardable we don't necessarily update the branch ref to point to a rev because the branch is already there as is the rev?	23:43
clarkb	now to see what the gerritlib change looks like	23:43
clarkb	ya no merge commit	23:44
clarkb	so maybe that is what is going on here?	23:44
*** gothicserpent has joined #opendev		23:44
fungi	that would be really odd	23:45
clarkb	the code is given a list of (ref, rev) tuples but then we check each component separately and not that one points to the other if I am reading this correctly	23:45
clarkb	does the ref exist? yes ok good. Does the rev exist? yes ok good. We should eb checking does ref point at rev? (I think)	23:46
fungi	also nothing about isUpdateNeeded() has changed in two years	23:46
fungi	according to gi tblame	23:46
fungi	so if that's really the bug it's been lying latent for a very long time and i would expect us to have hit it frequently	23:47
clarkb	fwiw git show master shows https://opendev.org/opendev/gerritlib/commit/99136e1e164baa7b1d9dac4f64c5fb511b813c19 and git show 874d4776eee5ae7c8c15debbb9e943110be299dd shows your commit	23:47
clarkb	I agree I would've expected us to hit this more if its been there a long time. Maybe the calling code changed?	23:48
clarkb	maybe we always did an unconditional update until recently for things that had branch moves?	23:49
* clarkb goes back to zuul historiy		23:49
clarkb	heh I was the last one to chagne that function	23:50
fungi	see, now i get to blame you ;)	23:50
clarkb	I agree I'm not seeing any recent changes that scream "this is related". 75d5dc2289d4b85b5e2d721d3fdbafdf5779e02b whispers it	23:51
clarkb	I would say based on the logging and the actual repo state that we are not updating it as expected though	23:52
clarkb	and I strongly suspect isUpdateNeeded() needs to check that ref points at rev	23:52
fungi	this seems like something we could unit test in zuul	23:53
clarkb	hrm getRepoState did change to do the buildset consistent revs	23:53
clarkb	that produces the input to isUpdateNeeded I think, maybe that is related?	23:54
clarkb	fungi: that is a good point	23:54
*** openstackgerrit has joined #opendev		23:57
openstackgerrit	Merged opendev/system-config master: gerrit: remove mysql-client-core-5.7 package https://review.opendev.org/c/opendev/system-config/+/783769	23:57
openstackgerrit	Merged opendev/system-config master: launch-node : cap to 8gb swap https://review.opendev.org/c/opendev/system-config/+/782898	23:57
openstackgerrit	Merged opendev/system-config master: dstat-logger: redirect stdout to /dev/null https://review.opendev.org/c/opendev/system-config/+/782868	23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!