Friday, 2020-07-17

*** ysandeep\|away is now known as ysandeep		00:04
openstackgerrit	Ian Wienand proposed opendev/system-config master: gitea-git-repos: update deprecated API path https://review.opendev.org/741562	00:05
fungi	okay, so after reading the change in their gerrit and https://sourceware.org/bugzilla/show_bug.cgi?id=20358 it fixes, i gather the idea is that in the future applications should be doing their own dnssec validation instead of trusting any resolver to tell them whether a record is valid, but as a workaround there is now a stub resolver option to enable the old behavior of trusting the configured resolver's	00:06
openstack	sourceware.org bug 20358 in network "RES_USE_DNSSEC sets DO; should also have a way to set AD" [Normal,Resolved: fixed] - Assigned to fweimer	00:06
fungi	evaluation of the record validity	00:06
fungi	seems mildly premature if no applications are actually checking rrsigs themselves, but what do i know	00:07
fungi	glibc 2.31 has essentially broken existing dnssec for anyone not savvy enough to know to set that option on every single client system	00:08
ianw	fungi: but it's probably not good to be trusting your isp's resolvers?	00:14
ianw	Archive name: borg-backup-test01-2020-07-17T00:11:29	00:14
ianw	Archive fingerprint: 3c0535de9273f2e132ee02eae0d2458db9ae6fd46cdd92a361199f7a0a261c82	00:14
ianw	Time (start): Fri, 2020-07-17 00:11:30	00:14
ianw	Time (end): Fri, 2020-07-17 00:11:37	00:14
ianw	Duration: 7.64 seconds	00:14
openstackgerrit	Clark Boylan proposed opendev/jeepyb master: Set repo HEAD on gerrit project creation https://review.opendev.org/741279	00:14
ianw	pretty cool, hosts backing themselves up during CI to test the full path	00:15
clarkb	the git operations in there could probably use double checking. They seem to work with local testing on test repos but I'm not sure if those are the best options available to us	00:15
fungi	ianw: it's definitely not good to be trusting your isp's resolvers, but that's why you should have a local validating resolver	00:17
fungi	i trust the instance of unbound running on my openbsd firewall	00:17
ianw	right, but you have to tell glibc you trust that ... so their change is probably generically correct?	00:18
fungi	i get that if i roam some portable device onto coffee shop wifi i shouldn't suddenly start trusting their resolver, sure, but i generally don't trust anything about that internet connection	00:18
fungi	anyway, it looks like nm has also grown an option for you to associate the trust-ad toggle with specific network profiles, so i'll likely use that to prevent needing to constantly fiddle resolv.conf	00:20
clarkb	2020-07-17 00:21:23.448189 \| ubuntu-bionic \| 2020-07-17 00:21:23,447: jeepyb.utils - INFO - Executing command: git --git-dir=/tmp/jeepyb-cache/test/test-repo-2/.git --work-tree=/tmp/jeepyb-cache/test/test-repo-2 push ssh://localhost:29418/test/test-repo-2 HEAD:refs/heads/main that looks better	00:26
fungi	it's just annoying that dnssec was finally starting to achieve some degree of widepsread penetration, even if it relied on mechanisms from rfc 2535§6.1 which didn't validate the last hop between the client and its configured recursive resolver... to suddenly break that everywhere and ask users to evaluate whether their configured resolvers are trustworthy is basically going to undo all of that and we'll go	00:29
fungi	back to basically everyone relying on unvalidated dns responses instead, which is far worse	00:29
fungi	this is the sort of change which governments sneak into cryptographic systems to keep them unusable by almost everyone	00:30
ianw	fungi: i thought the main problem with dnssec was that you have zone exposure	00:31
ianw	clarkb/corvus: ok, i'm reading about append-only mode : https://borgbackup.readthedocs.io/en/stable/usage/notes.html#append-only-mode	00:32
*** ryohayakawa has joined #opendev		00:32
ianw	it does seem like what we want	00:32
ianw	i think that if we restrict the "comand=" on the backup server side to "borg serve --append-only" we get what we want	00:33
ianw	that leaves us with the possibility to have something running on the backup server itself that might prune the repos periodically	00:33
clarkb	oh interesting they can be treated differently based on client	00:33
ianw	it seems to be a bug or a feature : https://github.com/borgbackup/borg/issues/3504	00:36
ianw	i think feature, although you could argue the client commands should make it more obvious what's going on	00:36
fungi	ianw: as in that the sequence of signatures allows you to bisect records and effectively find records in a zone which you wouldn't otherwise be aware of? pretty sure that got solved by rrsig range responses, but it's been a while since i read up on that issue	00:37
fungi	something to do with signed nxdomain responses if memory serves	00:37
ianw	fungi: this is what i was thinking of https://blog.cloudflare.com/dnssec-complexities-and-considerations/	00:39
fungi	ianw: aha, yeah, on-line signing is the mitigation i was thinking of	00:44
fungi	but really, the ultimate mitigation is "don't put secret data into the public dns"	00:45
fungi	if you think about the system dns replaced, the shared hostfile, everyone got a copy of every hostname on the internet. dns wasn't designed to keep records secret either, people just started relying on the side effect that refusing axfr requests forced people to brute-force guess your records to find them	00:46
fungi	as for the nxdomain range responses, the original dnssec design didn't even try to obscure the ranges by hashing the zone record ordering	00:50
fungi	that got tacked on later when people raised objections to the fact that information they were putting in public dns was gasp public	00:51
ianw	i think clearly the solution is to move dns into the bitcoin blockchain	00:55
ianw	speaking of keys, there's an internal RH server i've been signing into for about 7 years just fine with a kerberos ticket that has suddently stopped working with	00:58
ianw	debug1: Unspecified GSS failure. Minor code may provide more information	00:58
ianw	KDC has no support for encryption type	00:58
ianw	googling that just gets more and more confusing	00:59
*** Eighth_Doctor is now known as Conan_Kudo		01:01
*** Conan_Kudo is now known as Eighth_Doctor		01:02
corvus	ianw: blockchain dns is a thing: https://handshake.org/	01:05
corvus	i think it's not completely insane, aside from the whole proof-of-work-energy-consumption-destroying-the-environment thing that all blockchain solutions share	01:07
corvus	ianw: yeah, that append only option looks good	01:10
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] borg backups https://review.opendev.org/741366	01:13
ianw	^ that should implement it. i expect testinfra to pass for that, which does a full backup cycle for the two test hosts, which is pretty cool	01:14
ianw	todo is update documentation, and probably something to configure per-host backup locations	01:15
*** ysandeep is now known as ysandeep\|away		01:49
ianw	https://917e391602178bc40e5f-1cbf7c2bad1b53c710605a1cfc31790e.ssl.cf1.rackcdn.com/741366/14/check/system-config-run-borg-backup/d512c68/bridge.openstack.org/test-results.html is nice for testinfa	02:05
*** shtepanie has quit IRC		02:11
*** rh-jelabarre has quit IRC		02:20
*** sgw1 has quit IRC		02:28
*** sgw1 has joined #opendev		02:29
*** sgw1 has quit IRC		02:33
*** sgw1 has joined #opendev		02:59
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] borg backups https://review.opendev.org/741366	03:12
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] borg backups https://review.opendev.org/741366	03:35
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] borg backups https://review.opendev.org/741366	04:04
*** sgw1 has quit IRC		04:34
*** xiaolin has quit IRC		05:34
*** xiaolin has joined #opendev		05:36
*** xiaolin has quit IRC		05:43
*** xiaolin has joined #opendev		05:55
*** ysandeep\|away is now known as ysandeep		06:09
*** ysandeep is now known as ysandeep\|rover		06:24
*** qchris has quit IRC		06:25
*** marios has joined #opendev		06:47
*** markPilon has joined #opendev		06:52
*** qchris has joined #opendev		06:53
*** xiaolin has quit IRC		06:55
*** xiaolin has joined #opendev		06:57
openstackgerrit	vinay kumar muddu proposed openstack/diskimage-builder master: Fixes nit in DIB_IPA_CERT certificate copy https://review.opendev.org/741583	06:58
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add borg-backup roles https://review.opendev.org/741366	06:58
ianw	infra-root: ^ ready for review; i think more or less everything we discussed is mentioned in the changelog	06:59
openstackgerrit	Andrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry https://review.opendev.org/741584	07:00
*** calcmandan has quit IRC		07:26
*** xiaolin has quit IRC		07:26
*** calcmandan has joined #opendev		07:26
*** dougsz has joined #opendev		07:27
*** DSpider has joined #opendev		07:29
*** tosky has joined #opendev		07:33
*** moppy has quit IRC		08:01
*** moppy has joined #opendev		08:03
*** avass has quit IRC		08:17
*** fressi has joined #opendev		08:28
*** fressi_ has joined #opendev		08:32
*** fressi has quit IRC		08:33
*** fressi_ is now known as fressi		08:33
*** roman_g has joined #opendev		08:39
*** ysandeep\|rover is now known as ysandeep\|lunch		08:41
*** dtantsur\|afk is now known as dtantsur		09:19
*** ysandeep\|lunch is now known as ysandeep\|rover		09:27
*** marios has quit IRC		09:38
*** fressi has quit IRC		09:42
*** fressi has joined #opendev		09:46
openstackgerrit	vinay kumar muddu proposed openstack/diskimage-builder master: Fixes DIB_IPA_CERT certificate copy issue https://review.opendev.org/741583	09:49
*** tkajinam has quit IRC		09:52
*** dtantsur is now known as dtantsur\|brb		10:07
*** ryohayakawa has quit IRC		10:11
*** sshnaidm\|afk is now known as sshnaidm\|off		10:16
*** markPilon has quit IRC		10:25
*** fressi has quit IRC		10:29
*** fressi has joined #opendev		10:41
*** marios has joined #opendev		10:42
*** ysandeep\|rover is now known as ysandeep\|afk		11:11
*** fressi has quit IRC		11:20
*** ysandeep\|afk is now known as ysandeep		11:31
*** avass has joined #opendev		11:33
*** ysandeep is now known as ysandeep\|rover		11:34
*** rh-jelabarre has joined #opendev		12:07
*** dtantsur\|brb is now known as dtantsur		12:10
*** ysandeep\|rover is now known as ysandeep\|coffee		12:19
*** ysandeep\|coffee is now known as ysandeep\|rover		12:40
openstackgerrit	Andrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry https://review.opendev.org/741584	13:02
*** fressi has joined #opendev		13:04
*** fressi has quit IRC		13:44
*** sgw1 has joined #opendev		13:48
*** marios has quit IRC		13:53
*** marios has joined #opendev		13:57
*** ysandeep\|rover is now known as ysandeep\|away		13:58
*** mlavalle has joined #opendev		14:04
sgw1	Morning, is there a known issue with opendev.org being really slow this morning	15:11
sgw1	??	15:11
* fungi checks graphs		15:12
fungi	it doesn't look like our ddos crawler is back at least	15:13
fungi	sgw1: yours is the first report i've heard	15:13
sgw1	might be a slow issue with corp proxies	15:13
fungi	slow rendering? slow cloning?	15:13
fungi	https://opendev.org/openstack/nova/ displays very quickly for me at least	15:14
sgw1	It	15:14
sgw1	ssomething on my end	15:14
sgw1	someone else just confirmed it works ok for them, sorry for the noise	15:15
fungi	no worries and thanks for checking!	15:15
sgw1	an excuse to say Morning to all of you!	15:15
fungi	and a very merry friday to you as well!	15:34
*** mlavalle has quit IRC		15:40
*** mlavalle has joined #opendev		15:41
openstackgerrit	Andrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry https://review.opendev.org/741584	16:11
*** marios is now known as marios\|out		16:12
*** dtantsur is now known as dtantsur\|afk		16:13
clarkb	ildikov reported that https://meetpad.opendev.org/stx-build fails to load the etherpad. I've confirmed that other ehterpads work and this one fails. It seems to fail with a cross origin request except I have no idea how to get chrome and/or firefox to tell me what the requested url is	16:13
clarkb	I half suspect something to do with the etherpad state itself	16:14
clarkb	(the etherpad is an older one with a large ish number of revisions)	16:14
openstackgerrit	Andrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry https://review.opendev.org/741584	16:14
clarkb	does anyone else have better insight into what the browser is doing there or know how to manipulate the developer tools to do so? I've tried firefox and chrome and both are lacking what was actaully requested from what I can see (though ff shows a slightly different error)	16:15
clarkb	on firefox I see pad.js is trying to warn the console that it cannot set the author id for some reason and that is treated as a cross origin request?	16:19
clarkb	Uncaught DOMException: Permission denied to access property "console" on cross-origin object	16:19
clarkb	ok that happens after loading some user info so maybe it is related to pad state	16:20
*** marios\|out has quit IRC		16:23
clarkb	it is calling top.console.warn()	16:23
clarkb	I guess top is a cross origin resource?	16:23
clarkb	I need to step out for a bit but will try and sort out what top is I guess	16:23
fungi	fwiw on other pads, even large ones, i still see some cors blocking show up in the web console, but the pad still loads for me	16:24
clarkb	ya I think the issue is this cross origin request happens in the pad loading of the content	16:25
clarkb	so it doesnt conplete. In other contexts you wont get other bits but the pad text loads	16:25
*** dougsz has quit IRC		16:35
clarkb	https://github.com/ether/etherpad-lite/commit/00b6a1d9feae2399c08b42b7a3d711aed2d87a73	16:35
clarkb	we have a winner	16:35
clarkb	I guess we cherrypick that fix into our image?	16:36
fungi	huh, good find!	16:37
fungi	and that didn't make it into 1.8.4?	16:38
*** ildikov has joined #opendev		16:38
clarkb	no missed it by about 2 weeks	16:38
openstackgerrit	Andrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry https://review.opendev.org/741584	16:41
fungi	oh, right, i forgot we held off upgrading for a while	16:42
clarkb	I'm not sure how to cherrypick that into upstreamsimage though	16:42
clarkb	I guess this is the downside of not installing it ourselves	16:43
fungi	i thought you had worked out a custom built image with your fix for the author colors overlap a little while back?	16:43
openstackgerrit	Andrii Ostapenko proposed zuul/zuul-jobs master: Fix certificate issue with use buildset registry https://review.opendev.org/741584	16:43
clarkb	fungi: sed on the prod files :/	16:44
fungi	pj	16:44
fungi	er, oh	16:44
clarkb	I mean we can do that here too	16:44
clarkb	s/top.console/\/\/top.console/	16:44
clarkb	we can probably make a patch file and apply that	16:48
clarkb	but U need to boot the etherpad image locally and figure it out	16:48
clarkb	*I	16:48
clarkb	currently trying to sort out how the minimized sources happen. Its almost like node does it on the fly ?	17:18
clarkb	since the installation it does is actually to symlink back to itself in the source dir	17:19
clarkb	rather than do an install of minimized content (but what the browser gets is definitely minimized compared to what is in the source dir)	17:19
*** owalsh has quit IRC		17:21
clarkb	ya I think that is the case which makes this a little simpler	17:22
*** owalsh has joined #opendev		17:42
openstackgerrit	Clark Boylan proposed opendev/system-config master: Patch etherpad console logging to fix cross origin error https://review.opendev.org/741692	17:49
clarkb	fungi: ^ something like that maybe? I think if that passes testing the next thing to do is a followon change that forces a failure and do a hold and test with the held node?	17:49
*** chandankumar is now known as raukadah		18:02
*** roman_g has quit IRC		18:06
openstackgerrit	Clark Boylan proposed opendev/system-config master: DNM force etherpad failure to hold node https://review.opendev.org/741698	18:24
*** johnsom has quit IRC		18:24
clarkb	fungi: can patch apply the git diff?	18:24
clarkb	I would've expected the additional git metadata to be a problem	18:25
*** johnsom has joined #opendev		18:25
clarkb	also how does it know which files to apply to when the paths are git specific?	18:25
fungi	clarkb: yeah, git's diffs are totally compatible with the patch utility's unified diff handling (or always has been when i've tried it)	18:25
clarkb	fungi: on a single file level I would expect that to be the case because you can do patch this_file patch_file	18:26
fungi	but agreed, if the deployed file relationships are not the same as in the repository (at least relative to some parent directory) then that gets problematic for multi-file diffs	18:26
clarkb	but with multiple files you'd need the files to be named properly? I mean it may work by magic and a/ b/ is a thing	18:27
clarkb	fungi: well when git does it it does an a/ b/ prefix	18:27
fungi	if they're all in the same relative locations/names though you can just use the -p option to tell it how many levels to prune	18:27
clarkb	which won't exist on disk when patch runs	18:27
clarkb	oh thats the magic	18:27
clarkb	TIL	18:27
fungi	patch -p1 < some.patch	18:27
clarkb	I've put a hold on https://review.opendev.org/#/c/741698/1	18:27
clarkb	I don't think we'll be able to confirm it fixes the meetpad issue pre merge	18:29
clarkb	but we should be able to confirm it doesn't regress etherpad in normal operation	18:29
clarkb	if we didn't proxy etherpad in meetpad we'd be able to use local /etc/hosts overrides to test it but we do proxy (in order to fix other issues with cross origin requests)	18:30
clarkb	fungi: fwiw it is a git repo in the docker image, its just that git doesn't exist on the image and pulling it in is a lot larger than pulling in patch	18:30
clarkb	otherwise I would've just installed git and done a cherrypick	18:30
fungi	makes sense	18:31
fungi	and yeah, i agree testing whether this fixes the meetpad issue will be nontrivial withough standing up a separate jitsi-meet pointed at it	18:31
fungi	but i'm okay with it just not obviously breaking etherpad	18:32
clarkb	while I'm hopeful we'll be able to delete all these hacks when 1.8.5 or whatver the next release will be happens, I have a hunch that we'll just replace the existing set of fixes with a set of new fixes	18:33
clarkb	so having some sort of system for that seems good (and patch works well enough I think)	18:33
fungi	or it may be a sign that we should look into building our own images at some point	18:35
fungi	since that's how we're handling the same sort of problems for other services	18:35
clarkb	ya, we can copy their image though its fairly involved in order to get node and yarn and related things installed. At least so far we've only cared about the source of the etherpad service itself	18:36
clarkb	definitely if we need to start changing node versions or similar we'd probably want to drop the use of their image	18:36
clarkb	I'm going to grab lunch now then when I get back we should have a test node we can check the fix against	18:39
clarkb	also I had intended to send thsoe emails this morning before people weekended but now I'm thinking I should wait for monday so that there is a chance people see them	18:39
fungi	great point, friday e-mail has a tendency to slip through the cracks	18:40
clarkb	fungi: 158.69.67.103 etherpad.opendev.org	19:14
clarkb	The css fix seems to be working at least	19:15
clarkb	I'm going to see if I can infer from minimized sources that the console logs are cleaned up	19:15
clarkb	using the debugginer and ^F I think they are gone	19:16
clarkb	*debugger	19:16
clarkb	I'm going to remove my -W now	19:18
*** owalsh has quit IRC		19:20
clarkb	infra-root I think that we can land https://review.opendev.org/#/c/741692/1 if 158.69.67.103 as etherpad.opendev.org looks good to you	19:23
fungi	yeah, tested, lgtm. thanks!	19:27
*** owalsh has joined #opendev		19:29
clarkb	I'm looking at my jeepyb branches update test logs and overall it looks good. I have however discovered an interesting jeepyb behavior.	19:34
clarkb	Any idea why it seems we create and sort of use two different repos https://zuul.opendev.org/t/openstack/build/469948ee648347719d92b70f041649b3/log/job-output.txt#750-753	19:35
clarkb	I'm kind of thinking that may be something we can cleanup since it seems we only use the jeepyb-cache repo for useful tasks and ignore jeepyb-git	19:35
fungi	we need a clone from which to apply acls	19:36
fungi	we also need a close from which to push imported content	19:37
fungi	s/close/clone/	19:37
clarkb	fungi: ya they both seem to be jeepyb-cache for that	19:37
corvus	clarkb: +w etherpad	19:37
clarkb	fungi: if you follow the larger context of the log there you see it doing those steps and randomly in the middle it inits another bare repo	19:37
clarkb	corvus: thanks	19:37
fungi	clarkb: ahh, i wonder if we used to use jeepyb-git for one of them and combined them in a refactor at some point but never cleaned up?	19:38
clarkb	fungi: ya thats what I'm beginning to suspect	19:38
fungi	i agree it seems to be unused	19:38
fungi	or at least is unused now	19:38
clarkb	based on that https://review.opendev.org/#/q/topic:opendev-git-branches lgtm now	19:39
clarkb	but careful review is much appreciated	19:39
fungi	clarkb: the split between 741277 and 741279 is a little hard to follow... for example you add default-branch to the sample projects.yaml in the earlier change but don't actually use it until the later change	20:00
fungi	actually that may be the only thing confusing me	20:00
fungi	was the projects.yaml edit meant to happen in 741279 instead?	20:00
clarkb	fungi: they are in two different repos so rather than make a third change I decided to consolidate	20:01
clarkb	fungi: really that job should've been defined in jeepyb but we were trying to test a gerritlib release at the time so the focus was there	20:01
clarkb	I'm open to idea on how to make it clearer, maybe a third change would help?	20:03
fungi	oh, yeah i missed those were gerritlib and jeepyb, now i get it	20:07
fungi	nah, it makes sense. my head was in the "split because can't turn it on until gerrit upgrade" space	20:07
fungi	i missed that it was also split between repos	20:08
openstackgerrit	Merged opendev/system-config master: Patch etherpad console logging to fix cross origin error https://review.opendev.org/741692	20:10
clarkb	we don't seem to have auto applied ^ to the etherpad server	20:15
clarkb	the role is set up such that if it runs it would happen but I don't think we ran the job at all	20:15
clarkb	I'm going to manually pull and restart the service now	20:16
fungi	i saw the deploy job run (and succeed	20:16
fungi	)	20:16
clarkb	that was the image promote job	20:17
clarkb	we didn't run infra-prod	20:17
fungi	it was a build in the deploy pipeline	20:17
clarkb	https://meetpad.opendev.org/stx-build loads the etherpad now fwiw	20:18
clarkb	the problem is we don't run that job if the etherpad image updates	20:19
fungi	oh, huh, yeah system-config-promote-image-etherpad runs in deploy?	20:20
openstackgerrit	Clark Boylan proposed opendev/system-config master: Run our etherpad prod deploy job when docker updates https://review.opendev.org/741708	20:21
clarkb	fungi: ya I think all of the image promotions run in deploy so that the sequencing with infra-prod is correct?	20:22
clarkb	if they are in different pipelines then the dependency stuff gets weird (maybe impossible?)	20:22
clarkb	fungi: re defaults and https://review.opendev.org/#/c/741264/ ya that is basically what I wa sthinking. We want to have control to switch earlier than gerrit or gitea do it if git switches and possibly hold back if gitea or gerrit go early	20:23
clarkb	basically gives us control to make the switch consistently across the board	20:24
clarkb	and now discover issues the hard way	20:24
clarkb	s/now/not/	20:24
openstackgerrit	Clark Boylan proposed opendev/system-config master: Use non deprecated gitea repo creation endpoint https://review.opendev.org/741710	20:32
clarkb	ianw: fungi ^ thats the followon fix for gitea	20:32
clarkb	oh wait ianw already pushed that change /me abandons the extra one	20:33
clarkb	fungi: https://review.opendev.org/#/c/741562/1 fyi	20:33
clarkb	oh except that will need a rebase to not conflict with the other change	20:34
clarkb	silly git	20:34
openstackgerrit	Andrii Ostapenko proposed zuul/zuul-jobs master: Add ability to use (upload\|promote)-docker-image roles in periodic jobs https://review.opendev.org/740560	20:38
openstackgerrit	Merged opendev/system-config master: Allow setting Gitea repo branch on project creation https://review.opendev.org/741264	20:51
corvus	the trendline is headed down on node requests	20:54
corvus	i think i might start preparing for zuul restart	20:55
corvus	maybe just take a short outage and do the scheduler and all the executors at once	20:55
fungi	i'm around to help/watch	20:55
corvus	mordred: i just want to confirm you're not around and i'm not about to step on your toes	20:56
clarkb	corvus: remember the ze'sare in the emergency file. Not sure if that changes anything dramatically	20:57
clarkb	I too can help though have a call in a minute	20:57
corvus	iiuc, we want to remove ze from emergency, then allow ansible to write out pending updates, then shut down all executors (using docker on ze1 and init everywhere else), restart scheduler, start all executors using docker	20:58
clarkb	maybe? I'm not sure how the current code whcih assumes docker conatiners will interact with the non docker services	20:58
corvus	clarkb: what current code?	20:58
clarkb	corvus: https://review.opendev.org/#/c/733967/	20:59
clarkb	looks like start.yaml is not run by default	20:59
clarkb	so I think your plan will work	21:00
corvus	yeah, that's what i was thinking; if we're worried, we could probably manually stop, then run ansible, then start; but that'll take a bit longer	21:00
clarkb	in looking at that change I think I convinced myself its fine	21:01
clarkb	it should install the docker things but not switch running services as well as apply the zk updates	21:01
clarkb	then you can stop the non docker services and start them under docker	21:01
corvus	k. i'll start by removing emergency and then running service-zuul playbook on bridge	21:01
corvus	how do i stop the cron?	21:02
corvus	i mean the hourly zuul job	21:02
clarkb	I don't think you can stop it, you can run the stopper script which will stop the next job from running	21:02
clarkb	but once a job has started we don't have a way to abort that	21:02
corvus	i ran the disable-ansible script	21:03
clarkb	looks like zuul isn't running now so you can run the script to pause the next thing	21:03
corvus	infra-prod-install-ansible just started	21:03
corvus	oh i think maybe i did it just in time	21:03
corvus	to conclude from that.	21:03
corvus	err	21:04
corvus	to conclude from that.	21:04
corvus	why can't i copy/paste from zuul's console log?	21:04
corvus	gimme a second while i type it in	21:04
corvus	"TASK [Make sure a manual maint isn't going on]"	21:04
corvus	is where it's currently sitting	21:04
clarkb	cool then ya you probably caught it	21:05
corvus	i just noticed nb03 is in emergency	21:07
corvus	so we'll need to do something about that before we turn off plaintext	21:07
clarkb	ya we have multi arch images now so that should be doable in the near future	21:08
corvus	i'm running service-zuul on ze01; just want to see it noop first	21:08
*** rpittau has quit IRC		21:12
*** rpittau has joined #opendev		21:13
corvus	apparently we failed to pull the image:ERROR: for executor read tcp 104.239.136.252:36290->104.18.123.25:443: read: connection reset by peer	21:15
corvus	i will run that again	21:15
fungi	huh	21:17
clarkb	I'm having dns trouble locally	21:17
fungi	that looks like a cloudflare address... i gues fronting dockerhub (or trying to)?	21:18
clarkb	oh hey that explains it	21:18
clarkb	my dns is cloudflare too	21:19
clarkb	I bet they are having an outage	21:19
clarkb	fun fun	21:19
clarkb	discord is apparently broken too	21:23
clarkb	its a friday "internet is broken" day	21:23
*** johnsom_ has joined #opendev		21:26
mnaser	i was just going to hint here about 1.1.1.1 being down	21:26
mnaser	discord being broken is a by product of 1.1.1.1 down	21:27
mnaser	https://www.cloudflarestatus.com "all systems operational"	21:28
clarkb	my dns resolves again	21:28
corvus	ERROR: for executor error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/6f/6fbba1285c105d80eedaef06c284b770b7d6e30ad3694229178d835c3d2d53d7/data?verify=1595023763-0q3vtPSuEtfFCNyYszpaQBhtqQE%3D: dial tcp: lookup production.cloudflare.docker.com on 127.0.0.1:53: read udp 127.0.0.1:39681->127.0.0.1:53: i/o timeout	21:29
corvus	yeah, that was the second error	21:29
corvus	trying a 3rd time	21:29
*** johnsom_ has quit IRC		21:29
mnaser	i'm seeing dns restored in some parts of the world but still broken in others	21:29
mnaser	1.1.1.1 is resolving again	21:36
corvus	attempt #3 failed, trying again	21:46
mnaser	https://www.cloudflarestatus.com	21:47
mnaser	"Identified - The issue has been identified and a fix is being implemented. "	21:47
*** melwitt is now known as jgwentworth		21:49
*** avass has quit IRC		21:53
fungi	"the issue has been implemented and a fix is being identified"	21:54
corvus	managed to complete this time	21:54
corvus	i'll run it on all the zes now	21:55
corvus	oh	21:55
corvus	we do 'remove old init script files'	21:55
corvus	but these are executors, we can just run 'zuul-executor stop' to stop it	21:56
corvus	though i don't know if systemd will be left in a confused state	21:56
corvus	Active: failed (Result: exit-code) since Thu 2020-06-25 17:11:53 UTC; 3 weeks 1 days ago	21:56
clarkb	I woudl try stopping it with systemd first (it compiles configs but not sure if that includes the bash)	21:56
corvus	that's what ze01 says	21:56
clarkb	but then ya it may just go into a failed or error state and that should be fine	21:56
corvus	so i think if we just run zuul-executor stop, that's the worst case scenario	21:56
corvus	clarkb: yeah	21:57
corvus	so i'll proceed with the plan as discussed and run the playbook on all zes now	21:57
clarkb	sounds good	21:57
fungi	stopping the initscript after the service is stopped should be idempotent, so if you're concerned about systemd's state tracking you could just ask systemd to stop it last thing (after it's actually stopped)	21:58
clarkb	fungi: well in this case the script won't be there anymore	21:59
clarkb	but I think systemd will just say "the unit is error or failed" now	21:59
clarkb	then we can disable the unit to prevent it from starting on boot	22:00
corvus	(ot: bob and doug are planning on returning from space on aug 2)	22:00
clarkb	do you think we should tell them that there is a pandemic and they may want to hang out up there longer?	22:01
fungi	(whaa? take off, ya hoser)	22:01
corvus	clarkb: right? i mean, what's the rush? you're stuck inside either way.	22:01
corvus	though they only have to wear a mask outside down here instead of a suit, though that might be a good idea	22:02
fungi	oh! that bob and doug, i thought you meant the mckenzie brothers	22:02
corvus	fungi: wow, i think you just wrote an amazing sketch.	22:03
fungi	it includes a case of molson	22:03
corvus	floating around the dragon capsule	22:03
corvus	okay, i will try to stop laughing and proceed with maint; the playbook is finished	22:04
clarkb	I bet their view of the comet is better than mine too	22:05
corvus	i'll stop all the executors now (will attempt systemctl stop)	22:05
*** rh-jelabarre has quit IRC		22:06
corvus	systemd is fussing about the units having changed, so i don't know if it actually did stop it or not	22:08
corvus	i'm just going to run zuul-executor stop now	22:09
clarkb	ok	22:09
corvus	i'm pretty sure systemd didn't do anything	22:10
corvus	looks like it's really stopping now	22:10
clarkb	corvus: are we leaving ze01 alone ?	22:10
corvus	no, i stopped it so it doesn't end up with all the jobs	22:10
clarkb	corvus: and you did that via docker-compose ya?	22:10
corvus	yep	22:10
corvus	i'll leave the scheduler up while the executors shut down	22:11
corvus	(to minimize downtime for the scheduler and maximize the number of mergers online when it starts)	22:11
fungi	makes sense	22:17
corvus	3 remaining	22:19
corvus	all clear	22:29
corvus	i'll save queues, stop the scheduler, start the executors, then start the scheduler	22:30
fungi	sounds good	22:30
* clarkb still watching if anything comes up		22:31
corvus	hrm playbooks/start-mergers-executors.yaml does not look like it was updated for docker	22:31
corvus	i will run a locally modified version of that	22:32
*** DSpider has quit IRC		22:33
corvus	executors are up, scheduler is starting	22:33
corvus	i'm unsure what the scheduler is doing	22:35
corvus	http://paste.openstack.org/show/796062/	22:37
corvus	could that be a fatal error?	22:37
fungi	oh stopped the geard fork?	22:38
corvus	yeah	22:39
corvus	that TB appears to be from the server	22:39
clarkb	the fork is still up too	22:39
fungi	yeah, i see two zuul-scheduler processes	22:39
corvus	i wonder if it's wedged though?	22:40
fungi	strace says it's reading from fd 4	22:40
corvus	though that should "just" be from a merger	22:40
corvus	maybe it got stuck on that	22:40
fungi	and yeah, if it weren't wedged i'd expect strace to have said more than that by now	22:41
fungi	the parent is on recvfrom(28, ...silence	22:41
corvus	where's our docs on how to connect to geard?	22:42
corvus	i thought we had something in https://docs.opendev.org/opendev/system-config/latest/zuul.html	22:42
fungi	components i think?	22:42
clarkb	corvus: telenet host 4730 ?	22:42
corvus	i thought we used ssl	22:42
clarkb	oh right	22:42
corvus	https://zuul-ci.org/docs/zuul/howtos/troubleshooting.html	22:43
clarkb	looking at that gear code I expect it should set statsd to None or something useful in Server.__init__	22:43
clarkb	whcih makes me wonder if there was an earlier failure configuring geard	22:43
fungi	https://zuul-ci.org/docs/zuul/howtos/troubleshooting.html?highlight=s_client#gearman-jobs	22:43
fungi	openssl s_client -connect localhost:4730 -cert /etc/zuul/ssl/client.pem -key /etc/zuul/ssl/client.key	22:43
corvus	that connection is rejected because the cert can't be verified	22:44
corvus	are those the right cert/key files?	22:45
corvus	nope	22:45
clarkb	ssl_cert=/etc/zuul/ssl/gearman-server.pem ssl_key=/etc/zuul/ssl/gearman-server.key in our zuul.conf	22:45
corvus	okay, final iteration is: openssl s_client -connect localhost:4730 -cert /etc/zuul/ssl/gearman-client.pem -key /etc/zuul/ssl/gearman-client.key	22:47
clarkb	s/server/client/ for client side	22:47
corvus	it is up and running	22:47
corvus	by that i meant geard	22:48
corvus	but also apparently the scheduler is doing something now	22:48
clarkb	the scheduler just started ya	22:49
clarkb	does it have startup tasks like the executor to clean things up?	22:49
corvus	i'm trying to figure out what the holdup was	22:49
clarkb	executors can take a minute to actually begin work while they clear the build dirs up	22:49
corvus	the first interesting message after the slow time was: 2020-07-17 22:47:33,949 INFO zuul.GithubConnection: Starting event connector	22:50
fungi	it's using a trove instance for the db reporter?	22:50
clarkb	fungi: yes I believe so	22:50
clarkb	(its a huge db)	22:50
corvus	oh, let me see if there was a migration	22:50
clarkb	ohhhh	22:50
fungi	yeah, that's what i was wondering	22:51
fungi	if it was waiting for (however slow) trove db to migrate	22:51
fungi	i wonder if that's something we can better surface in the logs	22:52
corvus	2020-07-17 22:32:59,666 DEBUG zuul.SQLConnection: Current migration revision: 16c1dc9054d0	22:52
corvus	zuul/driver/sql/alembic/versions/269691d2220e_add_build_final_column.py:down_revision = '16c1dc9054d0'	22:52
corvus	fungi: i think you win	22:52
*** tosky_ has joined #opendev		22:52
corvus	and yeah, we should do that. because i'm pretty sure we did the same thing last time.	22:52
fungi	that was indeed lengthy	22:52
corvus	and this is going to happen a lot.	22:52
corvus	also, we should prune our db.	22:52
clarkb	maybe even a zuul command to check if a migration will happen	22:53
clarkb	so that we can easily approximate downtime	22:53
corvus	i'm re-enqueueing now	22:53
*** tosky has quit IRC		22:53
fungi	"me: please restart; zuul: i'm sorry dave, i can't do that"	22:53
clarkb	++ to more logging and pruning and all those ideas though	22:53
corvus	#status log restarted all of zuul using tls zookeeper and executors in containers	22:54
openstackstatus	corvus: finished logging	22:54
corvus	mordred: FYI ^	22:54
fungi	so that was ~15 minutes spent on the db migration i guess?	22:54
corvus	about. a little less.	22:57
*** mlavalle has quit IRC		22:59
*** tosky_ is now known as tosky		23:01
clarkb	thats everything but nb03 running or ready to run zk connections with tls ya?	23:04
clarkb	next week is good time to convert nb03 to our new multi arch images I guess	23:04
clarkb	corvus: we landed the python base and builder side of that right?	23:04
clarkb	https://hub.docker.com/r/zuul/nodepool-builder/tags has both arches but https://hub.docker.com/r/opendevorg/python-builder/tags does not	23:07
clarkb	https://review.opendev.org/#/c/726263/7/zuul.d/docker-images/python.yaml has merged. Do we just need to trigger builds?	23:08
clarkb	https://zuul.openstack.org/builds?change=726263 looks like we ran the jobs but not the promote and I think that is beacuse we only ran the jobs due to them changing	23:10
corvus	clarkb: yep, i think we need a nodepool build	23:15
corvus	enqueue is finished	23:16
clarkb	corvus: well a nodepool build won't rebuild python-builder and python-base	23:16
corvus	oh er i thought the change we merged would have done that	23:16
clarkb	I was looking for a useful change to one of those (the jobs will run for both if either change) but I can't come up with anything so will push a noop change instead	23:16
clarkb	corvus: according to https://zuul.openstack.org/builds?change=726263 it did not. I think beacuse only the jobs changed	23:16
clarkb	and the post merge jobs don't have the context of "did my own job update" to know they should run unconditionally?	23:17
corvus	clarkb: makes sense. :/	23:17
corvus	i will approve your noop change :)	23:17
openstackgerrit	Clark Boylan proposed opendev/system-config master: Noop update to force python-builder/base to rebuild https://review.opendev.org/741789	23:17
clarkb	I think ^ should do it	23:18
clarkb	then we can land a nodepool change then we can deploy nb03	23:18
clarkb	exciting	23:18
openstackgerrit	James E. Blair proposed opendev/system-config master: Update zuul-executor stop/start playbook https://review.opendev.org/741790	23:18
*** tosky has quit IRC		23:38

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!