Friday, 2020-07-17

*** ysandeep|away is now known as ysandeep00:04
openstackgerritIan Wienand proposed opendev/system-config master: gitea-git-repos: update deprecated API path
fungiokay, so after reading the change in their gerrit and it fixes, i gather the idea is that in the future applications should be doing their own dnssec validation instead of trusting any resolver to tell them whether a record is valid, but as a workaround there is now a stub resolver option to enable the old behavior of trusting the configured resolver's00:06 bug 20358 in network "RES_USE_DNSSEC sets DO; should also have a way to set AD" [Normal,Resolved: fixed] - Assigned to fweimer00:06
fungievaluation of the record validity00:06
fungiseems mildly premature if no applications are actually checking rrsigs themselves, but what do i know00:07
fungiglibc 2.31 has essentially broken existing dnssec for anyone not savvy enough to know to set that option on every single client system00:08
ianwfungi: but it's probably not good to be trusting your isp's resolvers?00:14
ianwArchive name: borg-backup-test01-2020-07-17T00:11:2900:14
ianwArchive fingerprint: 3c0535de9273f2e132ee02eae0d2458db9ae6fd46cdd92a361199f7a0a261c8200:14
ianwTime (start): Fri, 2020-07-17 00:11:3000:14
ianwTime (end):   Fri, 2020-07-17 00:11:3700:14
ianwDuration: 7.64 seconds00:14
openstackgerritClark Boylan proposed opendev/jeepyb master: Set repo HEAD on gerrit project creation
ianwpretty cool, hosts backing themselves up during CI to test the full path00:15
clarkbthe git operations in there could probably use double checking. They seem to work with local testing on test repos but I'm not sure if those are the best options available to us00:15
fungiianw: it's definitely not good to be trusting your isp's resolvers, but that's why you should have a local validating resolver00:17
fungii trust the instance of unbound running on my openbsd firewall00:17
ianwright, but you have to tell glibc you trust that ... so their change is probably generically correct?00:18
fungii get that if i roam some portable device onto coffee shop wifi i shouldn't suddenly start trusting their resolver, sure, but i generally don't trust anything about that internet connection00:18
fungianyway, it looks like nm has also grown an option for you to associate the trust-ad toggle with specific network profiles, so i'll likely use that to prevent needing to constantly fiddle resolv.conf00:20
clarkb2020-07-17 00:21:23.448189 | ubuntu-bionic | 2020-07-17 00:21:23,447: jeepyb.utils - INFO - Executing command: git --git-dir=/tmp/jeepyb-cache/test/test-repo-2/.git --work-tree=/tmp/jeepyb-cache/test/test-repo-2 push ssh://localhost:29418/test/test-repo-2 HEAD:refs/heads/main that looks better00:26
fungiit's just annoying that dnssec was finally starting to achieve some degree of widepsread penetration, even if it relied on mechanisms from rfc 2535ยง6.1 which didn't validate the last hop between the client and its configured recursive resolver... to suddenly break that everywhere and ask users to evaluate whether their configured resolvers are trustworthy is basically going to undo all of that and we'll go00:29
fungiback to basically everyone relying on unvalidated dns responses instead, which is far worse00:29
fungithis is the sort of change which governments sneak into cryptographic systems to keep them unusable by almost everyone00:30
ianwfungi: i thought the main problem with dnssec was that you have zone exposure00:31
ianwclarkb/corvus: ok, i'm reading about append-only mode :
*** ryohayakawa has joined #opendev00:32
ianwit does seem like what we want00:32
ianwi think that if we restrict the "comand=" on the backup server side to "borg serve --append-only" we get what we want00:33
ianwthat leaves us with the possibility to have something running on the backup server itself that might prune the repos periodically00:33
clarkboh interesting they can be treated differently based on client00:33
ianwit seems to be a bug or a feature :
ianwi think feature, although you could argue the client commands should make it more obvious what's going on00:36
fungiianw: as in that the sequence of signatures allows you to bisect records and effectively find records in a zone which you wouldn't otherwise be aware of? pretty sure that got solved by rrsig range responses, but it's been a while since i read up on that issue00:37
fungisomething to do with signed nxdomain responses if memory serves00:37
ianwfungi: this is what i was thinking of
fungiianw: aha, yeah, on-line signing is the mitigation i was thinking of00:44
fungibut really, the ultimate mitigation is "don't put secret data into the public dns"00:45
fungiif you think about the system dns replaced, the shared hostfile, everyone got a copy of every hostname on the internet. dns wasn't designed to keep records secret either, people just started relying on the side effect that refusing axfr requests forced people to brute-force guess your records to find them00:46
fungias for the nxdomain range responses, the original dnssec design didn't even try to obscure the ranges by hashing the zone record ordering00:50
fungithat got tacked on later when people raised objections to the fact that information they were putting in public dns was *gasp* public00:51
ianwi think clearly the solution is to move dns into the bitcoin blockchain00:55
ianwspeaking of keys, there's an internal RH server i've been signing into for about 7 years just fine with a kerberos ticket that has suddently stopped working with00:58
ianwdebug1: Unspecified GSS failure.  Minor code may provide more information00:58
ianwKDC has no support for encryption type00:58
ianwgoogling that just gets more and more confusing00:59
*** Eighth_Doctor is now known as Conan_Kudo01:01
*** Conan_Kudo is now known as Eighth_Doctor01:02
corvusianw: blockchain dns is a thing:
corvusi think it's not completely insane, aside from the whole proof-of-work-energy-consumption-destroying-the-environment thing that all blockchain solutions share01:07
corvusianw: yeah, that append only option looks good01:10
openstackgerritIan Wienand proposed opendev/system-config master: [wip] borg backups
ianw^ that should implement it.  i expect testinfra to pass for that, which does a full backup cycle for the two test hosts, which is pretty cool01:14
ianwtodo is update documentation, and probably something to configure per-host backup locations01:15
*** ysandeep is now known as ysandeep|away01:49
ianw is nice for testinfa02:05
*** shtepanie has quit IRC02:11
*** rh-jelabarre has quit IRC02:20
*** sgw1 has quit IRC02:28
*** sgw1 has joined #opendev02:29
*** sgw1 has quit IRC02:33
*** sgw1 has joined #opendev02:59
openstackgerritIan Wienand proposed opendev/system-config master: [wip] borg backups
openstackgerritIan Wienand proposed opendev/system-config master: [wip] borg backups
openstackgerritIan Wienand proposed opendev/system-config master: [wip] borg backups
*** sgw1 has quit IRC04:34
*** xiaolin has quit IRC05:34
*** xiaolin has joined #opendev05:36
*** xiaolin has quit IRC05:43
*** xiaolin has joined #opendev05:55
*** ysandeep|away is now known as ysandeep06:09
*** ysandeep is now known as ysandeep|rover06:24
*** qchris has quit IRC06:25
*** marios has joined #opendev06:47
*** markPilon has joined #opendev06:52
*** qchris has joined #opendev06:53
*** xiaolin has quit IRC06:55
*** xiaolin has joined #opendev06:57
openstackgerritvinay kumar muddu proposed openstack/diskimage-builder master: Fixes nit in DIB_IPA_CERT certificate copy
openstackgerritIan Wienand proposed opendev/system-config master: Add borg-backup roles
ianwinfra-root: ^ ready for review; i think more or less everything we discussed is mentioned in the changelog06:59
openstackgerritAndrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry
*** calcmandan has quit IRC07:26
*** xiaolin has quit IRC07:26
*** calcmandan has joined #opendev07:26
*** dougsz has joined #opendev07:27
*** DSpider has joined #opendev07:29
*** tosky has joined #opendev07:33
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:03
*** avass has quit IRC08:17
*** fressi has joined #opendev08:28
*** fressi_ has joined #opendev08:32
*** fressi has quit IRC08:33
*** fressi_ is now known as fressi08:33
*** roman_g has joined #opendev08:39
*** ysandeep|rover is now known as ysandeep|lunch08:41
*** dtantsur|afk is now known as dtantsur09:19
*** ysandeep|lunch is now known as ysandeep|rover09:27
*** marios has quit IRC09:38
*** fressi has quit IRC09:42
*** fressi has joined #opendev09:46
openstackgerritvinay kumar muddu proposed openstack/diskimage-builder master: Fixes DIB_IPA_CERT certificate copy issue
*** tkajinam has quit IRC09:52
*** dtantsur is now known as dtantsur|brb10:07
*** ryohayakawa has quit IRC10:11
*** sshnaidm|afk is now known as sshnaidm|off10:16
*** markPilon has quit IRC10:25
*** fressi has quit IRC10:29
*** fressi has joined #opendev10:41
*** marios has joined #opendev10:42
*** ysandeep|rover is now known as ysandeep|afk11:11
*** fressi has quit IRC11:20
*** ysandeep|afk is now known as ysandeep11:31
*** avass has joined #opendev11:33
*** ysandeep is now known as ysandeep|rover11:34
*** rh-jelabarre has joined #opendev12:07
*** dtantsur|brb is now known as dtantsur12:10
*** ysandeep|rover is now known as ysandeep|coffee12:19
*** ysandeep|coffee is now known as ysandeep|rover12:40
openstackgerritAndrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry
*** fressi has joined #opendev13:04
*** fressi has quit IRC13:44
*** sgw1 has joined #opendev13:48
*** marios has quit IRC13:53
*** marios has joined #opendev13:57
*** ysandeep|rover is now known as ysandeep|away13:58
*** mlavalle has joined #opendev14:04
sgw1Morning, is there a known issue with being really slow this morning15:11
* fungi checks graphs15:12
fungiit doesn't look like our ddos crawler is back at least15:13
fungisgw1: yours is the first report i've heard15:13
sgw1might be a slow issue with corp proxies15:13
fungislow rendering? slow cloning?15:13
fungi displays very quickly for me at least15:14
sgw1ssomething on my end15:14
sgw1someone else just confirmed it works ok for them, sorry for the noise15:15
fungino worries and thanks for checking!15:15
sgw1an excuse to say Morning to all of you!15:15
fungiand a very merry friday to you as well!15:34
*** mlavalle has quit IRC15:40
*** mlavalle has joined #opendev15:41
openstackgerritAndrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry
*** marios is now known as marios|out16:12
*** dtantsur is now known as dtantsur|afk16:13
clarkbildikov reported that fails to load the etherpad. I've confirmed that other ehterpads work and this one fails. It seems to fail with a cross origin request except I have no idea how to get chrome and/or firefox to tell me what the requested url is16:13
clarkbI half suspect something to do with the etherpad state itself16:14
clarkb(the etherpad is an older one with a large ish number of revisions)16:14
openstackgerritAndrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry
clarkbdoes anyone else have better insight into what the browser is doing there or know how to manipulate the developer tools to do so? I've tried firefox and chrome and both are lacking what was actaully requested from what I can see (though ff shows a slightly different error)16:15
clarkbon firefox I see pad.js is trying to warn the console that it cannot set the author id for some reason and that is treated as a cross origin request?16:19
clarkbUncaught DOMException: Permission denied to access property "console" on cross-origin object16:19
clarkbok that happens after loading some user info so maybe it is related to pad state16:20
*** marios|out has quit IRC16:23
clarkbit is calling top.console.warn()16:23
clarkbI guess top is a cross origin resource?16:23
clarkbI need to step out for a bit but will try and sort out what top is I guess16:23
fungifwiw on other pads, even large ones, i still see some cors blocking show up in the web console, but the pad still loads for me16:24
clarkbya I think the issue is this cross origin request happens in the pad loading of the content16:25
clarkbso it doesnt conplete. In other contexts you wont get other bits but the pad text loads16:25
*** dougsz has quit IRC16:35
clarkbwe have a winner16:35
clarkbI guess we cherrypick that fix into our image?16:36
fungihuh, good find!16:37
fungiand that didn't make it into 1.8.4?16:38
*** ildikov has joined #opendev16:38
clarkbno missed it by about 2 weeks16:38
openstackgerritAndrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry
fungioh, right, i forgot we held off upgrading for a while16:42
clarkbI'm not sure how to cherrypick that into upstreamsimage though16:42
clarkbI guess this is the downside of not installing it ourselves16:43
fungii thought you had worked out a custom built image with your fix for the author colors overlap a little while back?16:43
openstackgerritAndrii Ostapenko proposed zuul/zuul-jobs master: Fix certificate issue with use buildset registry
clarkbfungi: sed on the prod files :/16:44
fungier, oh16:44
clarkbI mean we can do that here too16:44
clarkbwe can probably make a patch file and apply that16:48
clarkbbut U need to boot the etherpad image locally and figure it out16:48
clarkbcurrently trying to sort out how the minimized sources happen. Its almost like node does it on the fly ?17:18
clarkbsince the installation it does is actually to symlink back to itself in the source dir17:19
clarkbrather than do an install of minimized content (but what the browser gets is definitely minimized compared to what is in the source dir)17:19
*** owalsh has quit IRC17:21
clarkbya I think that is the case which makes this a little simpler17:22
*** owalsh has joined #opendev17:42
openstackgerritClark Boylan proposed opendev/system-config master: Patch etherpad console logging to fix cross origin error
clarkbfungi: ^ something like that maybe? I think if that passes testing the next thing to do is a followon change that forces a failure and do a hold and test with the held node?17:49
*** chandankumar is now known as raukadah18:02
*** roman_g has quit IRC18:06
openstackgerritClark Boylan proposed opendev/system-config master: DNM force etherpad failure to hold node
*** johnsom has quit IRC18:24
clarkbfungi: can patch apply the git diff?18:24
clarkbI would've expected the additional git metadata to be a problem18:25
*** johnsom has joined #opendev18:25
clarkbalso how does it know which files to apply to when the paths are git specific?18:25
fungiclarkb: yeah, git's diffs are totally compatible with the patch utility's unified diff handling (or always has been when i've tried it)18:25
clarkbfungi: on a single file level I would expect that to be the case because you can do patch this_file patch_file18:26
fungibut agreed, if the deployed file relationships are not the same as in the repository (at least relative to some parent directory) then that gets problematic for multi-file diffs18:26
clarkbbut with multiple files you'd need the files to be named properly? I mean it may work by magic and a/ b/ is a thing18:27
clarkbfungi: well when git does it it does an a/ b/ prefix18:27
fungiif they're all in the same relative locations/names though you can just use the -p option to tell it how many levels to prune18:27
clarkbwhich won't exist on disk when patch runs18:27
clarkboh thats the magic18:27
fungipatch -p1 < some.patch18:27
clarkbI've put a hold on
clarkbI don't think we'll be able to confirm it fixes the meetpad issue pre merge18:29
clarkbbut we should be able to confirm it doesn't regress etherpad in normal operation18:29
clarkbif we didn't proxy etherpad in meetpad we'd be able to use local /etc/hosts overrides to test it but we do proxy (in order to fix other issues with cross origin requests)18:30
clarkbfungi: fwiw it is a git repo in the docker image, its just that git doesn't exist on the image and pulling it in is a lot larger than pulling in patch18:30
clarkbotherwise I would've just installed git and done a cherrypick18:30
fungimakes sense18:31
fungiand yeah, i agree testing whether this fixes the meetpad issue will be nontrivial withough standing up a separate jitsi-meet pointed at it18:31
fungibut i'm okay with it just not obviously breaking etherpad18:32
clarkbwhile I'm hopeful we'll be able to delete all these hacks when 1.8.5 or whatver the next release will be happens, I have a hunch that we'll just replace the existing set of fixes with a set of new fixes18:33
clarkbso having some sort of system for that seems good (and patch works well enough I think)18:33
fungior it may be a sign that we should look into building our own images at some point18:35
fungisince that's how we're handling the same sort of problems for other services18:35
clarkbya, we can copy their image though its fairly involved in order to get node and yarn and related things installed. At least so far we've only cared about the source of the etherpad service itself18:36
clarkbdefinitely if we need to start changing node versions or similar we'd probably want to drop the use of their image18:36
clarkbI'm going to grab lunch now then when I get back we should have a test node we can check the fix against18:39
clarkbalso I had intended to send thsoe emails this morning before people weekended but now I'm thinking I should wait for monday so that there is a chance people see them18:39
fungigreat point, friday e-mail has a tendency to slip through the cracks18:40
clarkbfungi: etherpad.opendev.org19:14
clarkbThe css fix seems to be working at least19:15
clarkbI'm going to see if I can infer from minimized sources that the console logs are cleaned up19:15
clarkbusing the debugginer and ^F I think they are gone19:16
clarkbI'm going to remove my -W now19:18
*** owalsh has quit IRC19:20
clarkbinfra-root I think that we can land if as looks good to you19:23
fungiyeah, tested, lgtm. thanks!19:27
*** owalsh has joined #opendev19:29
clarkbI'm looking at my jeepyb branches update test logs and overall it looks good. I have however discovered an interesting jeepyb behavior.19:34
clarkbAny idea why it seems we create and sort of use two different repos
clarkbI'm kind of thinking that may be something we can cleanup since it seems we only use the jeepyb-cache repo for useful tasks and ignore jeepyb-git19:35
fungiwe need a clone from which to apply acls19:36
fungiwe also need a close from which to push imported content19:37
clarkbfungi: ya they both seem to be jeepyb-cache for that19:37
corvusclarkb: +w etherpad19:37
clarkbfungi: if you follow the larger context of the log there you see it doing those steps and randomly in the middle it inits another bare repo19:37
clarkbcorvus: thanks19:37
fungiclarkb: ahh, i wonder if we used to use jeepyb-git for one of them and combined them in a refactor at some point but never cleaned up?19:38
clarkbfungi: ya thats what I'm beginning to suspect19:38
fungii agree it seems to be unused19:38
fungior at least is unused now19:38
clarkbbased on that lgtm now19:39
clarkbbut careful review is much appreciated19:39
fungiclarkb: the split between 741277 and 741279 is a little hard to follow... for example you add default-branch to the sample projects.yaml in the earlier change but don't actually use it until the later change20:00
fungiactually that may be the only thing confusing me20:00
fungiwas the projects.yaml edit meant to happen in 741279 instead?20:00
clarkbfungi: they are in two different repos so rather than make a third change I decided to consolidate20:01
clarkbfungi: really that job should've been defined in jeepyb but we were trying to test a gerritlib release at the time so the focus was there20:01
clarkbI'm open to idea on how to make it clearer, maybe a third change would help?20:03
fungioh, yeah i missed those were gerritlib and jeepyb, now i get it20:07
funginah, it makes sense. my head was in the "split because can't turn it on until gerrit upgrade" space20:07
fungii missed that it was also split between repos20:08
openstackgerritMerged opendev/system-config master: Patch etherpad console logging to fix cross origin error
clarkbwe don't seem to have auto applied ^ to the etherpad server20:15
clarkbthe role is set up such that if it runs it would happen but I don't think we ran the job at all20:15
clarkbI'm going to manually pull and restart the service now20:16
fungii saw the deploy job run (and succeed20:16
clarkbthat was the image promote job20:17
clarkbwe didn't run infra-prod20:17
fungiit was a build in the deploy pipeline20:17
clarkb loads the etherpad now fwiw20:18
clarkbthe problem is we don't run that job if the etherpad image updates20:19
fungioh, huh, yeah system-config-promote-image-etherpad runs in deploy?20:20
openstackgerritClark Boylan proposed opendev/system-config master: Run our etherpad prod deploy job when docker updates
clarkbfungi: ya I think all of the image promotions run in deploy so that the sequencing with infra-prod is correct?20:22
clarkbif they are in different pipelines then the dependency stuff gets weird (maybe impossible?)20:22
clarkbfungi: re defaults and ya that is basically what I wa sthinking. We want to have control to switch earlier than gerrit or gitea do it if git switches and possibly hold back if gitea or gerrit go early20:23
clarkbbasically gives us control to make the switch consistently across the board20:24
clarkband now discover issues the hard way20:24
openstackgerritClark Boylan proposed opendev/system-config master: Use non deprecated gitea repo creation endpoint
clarkbianw: fungi ^ thats the followon fix for gitea20:32
clarkboh wait ianw already pushed that change /me abandons the extra one20:33
clarkbfungi: fyi20:33
clarkboh except that will need a rebase to not conflict with the other  change20:34
clarkbsilly git20:34
openstackgerritAndrii Ostapenko proposed zuul/zuul-jobs master: Add ability to use (upload|promote)-docker-image roles in periodic jobs
openstackgerritMerged opendev/system-config master: Allow setting Gitea repo branch on project creation
corvusthe trendline is headed down on node requests20:54
corvusi think i might start preparing for zuul restart20:55
corvusmaybe just take a short outage and do the scheduler and all the executors at once20:55
fungii'm around to help/watch20:55
corvusmordred: i just want to confirm you're not around and i'm not about to step on your toes20:56
clarkbcorvus: remember the ze'sare in the emergency file. Not sure if that changes anything dramatically20:57
clarkbI too can help though have a call in a minute20:57
corvusiiuc, we want to remove ze from emergency, then allow ansible to write out pending updates, then shut down all executors (using docker on ze1 and init everywhere else), restart scheduler, start all executors using docker20:58
clarkbmaybe? I'm not sure how the current code whcih assumes docker conatiners will interact with the non docker services20:58
corvusclarkb: what current code?20:58
clarkblooks like start.yaml is not run by default20:59
clarkbso I think your plan will work21:00
corvusyeah, that's what i was thinking; if we're worried, we could probably manually stop, then run ansible, then start; but that'll take a bit longer21:00
clarkbin looking at that change I think I convinced myself its fine21:01
clarkbit should install the docker things but not switch running services as well as apply the zk updates21:01
clarkbthen you can stop the non docker services and start them under docker21:01
corvusk.  i'll start by removing emergency and then running service-zuul playbook on bridge21:01
corvushow do i stop the cron?21:02
corvusi mean the hourly zuul job21:02
clarkbI don't think you can stop it, you can run the stopper script which will stop the next job from running21:02
clarkbbut once a job has started we don't have a way to abort that21:02
corvusi ran the disable-ansible script21:03
clarkblooks like zuul isn't running now so you can run the script to pause the next thing21:03
corvusinfra-prod-install-ansible just started21:03
corvusoh i think maybe i did it just in time21:03
corvusto conclude from that.21:03
corvusto conclude from that.21:04
corvuswhy can't i copy/paste from zuul's console log?21:04
corvusgimme a second while i type it in21:04
corvus"TASK [Make sure a manual maint isn't going on]"21:04
corvusis where it's currently sitting21:04
clarkbcool then ya you probably caught it21:05
corvusi just noticed nb03 is in emergency21:07
corvusso we'll need to do something about that before we turn off plaintext21:07
clarkbya we have multi arch images now so that should be doable in the near future21:08
corvusi'm running service-zuul on ze01; just want to see it noop first21:08
*** rpittau has quit IRC21:12
*** rpittau has joined #opendev21:13
corvusapparently we failed to pull the image:ERROR: for executor  read tcp> read: connection reset by peer21:15
corvusi will run that again21:15
clarkbI'm having dns trouble locally21:17
fungithat looks like a cloudflare address... i gues fronting dockerhub (or trying to)?21:18
clarkboh hey that explains it21:18
clarkbmy dns is cloudflare too21:19
clarkbI bet they are having an outage21:19
clarkbfun fun21:19
clarkbdiscord is apparently broken too21:23
clarkbits a friday "internet is broken" day21:23
*** johnsom_ has joined #opendev21:26
mnaseri was just going to hint here about being down21:26
mnaserdiscord being broken is a by product of down21:27
mnaser "all systems operational"21:28
clarkbmy dns resolves again21:28
corvusERROR: for executor  error pulling image configuration: Get dial tcp: lookup on read udp> i/o timeout21:29
corvusyeah, that was the second error21:29
corvustrying a 3rd time21:29
*** johnsom_ has quit IRC21:29
mnaseri'm seeing dns restored in some parts of the world but still broken in others21:29
mnaser1.1.1.1 is resolving again21:36
corvusattempt #3 failed, trying again21:46
mnaser"Identified - The issue has been identified and a fix is being implemented. "21:47
*** melwitt is now known as jgwentworth21:49
*** avass has quit IRC21:53
fungi"the issue has been implemented and a fix is being identified"21:54
corvusmanaged to complete this time21:54
corvusi'll run it on all the zes now21:55
corvuswe do 'remove old init script files'21:55
corvusbut these are executors, we can just run 'zuul-executor stop' to stop it21:56
corvusthough i don't know if systemd will be left in a confused state21:56
corvus   Active: failed (Result: exit-code) since Thu 2020-06-25 17:11:53 UTC; 3 weeks 1 days ago21:56
clarkbI woudl try stopping it with systemd first (it compiles configs but not sure if that includes the bash)21:56
corvusthat's what ze01 says21:56
clarkbbut then ya it may just go into a failed or error state and that should be fine21:56
corvusso i think if we just run zuul-executor stop, that's the worst case scenario21:56
corvusclarkb: yeah21:57
corvusso i'll proceed with the plan as discussed and run the playbook on all zes now21:57
clarkbsounds good21:57
fungistopping the initscript after the service is stopped should be idempotent, so if you're concerned about systemd's state tracking you could just ask systemd to stop it last thing (after it's actually stopped)21:58
clarkbfungi: well in this case the script won't be there anymore21:59
clarkbbut I think systemd will just say "the unit is error or failed" now21:59
clarkbthen we can disable the unit to prevent it from starting on boot22:00
corvus(ot: bob and doug are planning on returning from space on aug 2)22:00
clarkbdo you think we should tell them that there is a pandemic and they may want to hang out up there longer?22:01
fungi(whaa? take off, ya hoser)22:01
corvusclarkb: right?  i mean, what's the rush?  you're stuck inside either way.22:01
corvusthough they only have to wear a mask outside down here instead of a suit, though that might be a good idea22:02
fungioh! that bob and doug, i thought you meant the mckenzie brothers22:02
corvusfungi: wow, i think you just wrote an *amazing* sketch.22:03
fungiit includes a case of molson22:03
corvusfloating around the dragon capsule22:03
corvusokay, i will try to stop laughing and proceed with maint; the playbook is finished22:04
clarkbI bet their view of the comet is better than mine too22:05
corvusi'll stop all the executors now (will attempt systemctl stop)22:05
*** rh-jelabarre has quit IRC22:06
corvussystemd is fussing about the units having changed, so i don't know if it actually did stop it or not22:08
corvusi'm just going to run zuul-executor stop now22:09
corvusi'm pretty sure systemd didn't do anything22:10
corvuslooks like it's really stopping now22:10
clarkbcorvus: are we leaving ze01 alone ?22:10
corvusno, i stopped it so it doesn't end up with all the jobs22:10
clarkbcorvus: and you did that via docker-compose ya?22:10
corvusi'll leave the scheduler up while the executors shut down22:11
corvus(to minimize downtime for the scheduler and maximize the number of mergers online when it starts)22:11
fungimakes sense22:17
corvus3 remaining22:19
corvusall clear22:29
corvusi'll save queues, stop the scheduler, start the executors, then start the scheduler22:30
fungisounds good22:30
* clarkb still watching if anything comes up22:31
corvushrm playbooks/start-mergers-executors.yaml does not look like it was updated for docker22:31
corvusi will run a locally modified version of that22:32
*** DSpider has quit IRC22:33
corvusexecutors are up, scheduler is starting22:33
corvusi'm unsure what the scheduler is doing22:35
corvuscould that be a fatal error?22:37
fungioh stopped the geard fork?22:38
corvusthat TB appears to be from the server22:39
clarkbthe fork is still up too22:39
fungiyeah, i see two zuul-scheduler processes22:39
corvusi wonder if it's wedged though?22:40
fungistrace says it's reading from fd 422:40
corvusthough that should "just" be from a merger22:40
corvusmaybe it got stuck on that22:40
fungiand yeah, if it weren't wedged i'd expect strace to have said more than that by now22:41
fungithe parent is on recvfrom(28, ...silence22:41
corvuswhere's our docs on how to connect to geard?22:42
corvusi thought we had something in
fungicomponents i think?22:42
clarkbcorvus: telenet host 4730 ?22:42
corvusi thought we used ssl22:42
clarkboh right22:42
clarkblooking at that gear code I expect it should set statsd to None or something useful in Server.__init__22:43
clarkbwhcih makes me wonder if there was an earlier failure configuring geard22:43
fungiopenssl s_client -connect localhost:4730 -cert /etc/zuul/ssl/client.pem  -key /etc/zuul/ssl/client.key22:43
corvusthat connection is rejected because the cert can't be verified22:44
corvusare those the right cert/key files?22:45
clarkbssl_cert=/etc/zuul/ssl/gearman-server.pem ssl_key=/etc/zuul/ssl/gearman-server.key in our zuul.conf22:45
corvusokay, final iteration is: openssl s_client -connect localhost:4730 -cert /etc/zuul/ssl/gearman-client.pem -key /etc/zuul/ssl/gearman-client.key22:47
clarkbs/server/client/ for client side22:47
corvusit is up and running22:47
corvusby that i meant geard22:48
corvusbut also apparently the scheduler is doing something now22:48
clarkbthe scheduler just started ya22:49
clarkbdoes it have startup tasks like the executor to clean things up?22:49
corvusi'm trying to figure out what the holdup was22:49
clarkbexecutors can take a minute to actually begin work while they clear the build dirs up22:49
corvusthe first interesting message after the slow time was: 2020-07-17 22:47:33,949 INFO zuul.GithubConnection: Starting event connector22:50
fungiit's using a trove instance for the db reporter?22:50
clarkbfungi: yes I believe so22:50
clarkb(its a huge db)22:50
corvusoh, let me see if there was a migration22:50
fungiyeah, that's what i was wondering22:51
fungiif it was waiting for (however slow) trove db to migrate22:51
fungii wonder if that's something we can better surface in the logs22:52
corvus2020-07-17 22:32:59,666 DEBUG zuul.SQLConnection: Current migration revision: 16c1dc9054d022:52
corvuszuul/driver/sql/alembic/versions/ = '16c1dc9054d0'22:52
corvusfungi: i think you win22:52
*** tosky_ has joined #opendev22:52
corvusand yeah, we should do that.  because i'm pretty sure we did the same thing last time.22:52
fungithat was indeed lengthy22:52
corvusand this is going to happen a lot.22:52
corvusalso, we should prune our db.22:52
clarkbmaybe even a zuul command to check if a migration will happen22:53
clarkbso that we can easily approximate downtime22:53
corvusi'm re-enqueueing now22:53
*** tosky has quit IRC22:53
fungi"me: please restart; zuul: i'm sorry dave, i can't do that"22:53
clarkb++ to more logging and pruning and all those ideas though22:53
corvus#status log restarted all of zuul using tls zookeeper and executors in containers22:54
openstackstatuscorvus: finished logging22:54
corvusmordred: FYI ^22:54
fungiso that was ~15 minutes spent on the db migration i guess?22:54
corvusabout.  a little less.22:57
*** mlavalle has quit IRC22:59
*** tosky_ is now known as tosky23:01
clarkbthats everything but nb03 running or ready to run zk connections with tls ya?23:04
clarkbnext week is good time to convert nb03 to our new multi arch images I guess23:04
clarkbcorvus: we landed the python base and builder side of that right?23:04
clarkb has both arches but does not23:07
clarkb has merged. Do we just need to trigger builds?23:08
clarkb looks like we ran the jobs but not the promote and I think that is beacuse we only ran the jobs due to them changing23:10
corvusclarkb: yep, i think we need a nodepool build23:15
corvusenqueue is finished23:16
clarkbcorvus: well a nodepool build won't rebuild python-builder and python-base23:16
corvusoh er i thought the change we merged would have done that23:16
clarkbI was looking for a useful change to one of those (the jobs will run for both if either change) but I can't come up with anything so will push a noop change instead23:16
clarkbcorvus: according to it did not. I think beacuse only the jobs changed23:16
clarkband the post merge jobs don't have the context of "did my own job update" to know they should run unconditionally?23:17
corvusclarkb: makes sense.  :/23:17
corvusi will approve your noop change :)23:17
openstackgerritClark Boylan proposed opendev/system-config master: Noop update to force python-builder/base to rebuild
clarkbI think ^ should do it23:18
clarkbthen we can land a nodepool change then we can deploy nb0323:18
openstackgerritJames E. Blair proposed opendev/system-config master: Update zuul-executor stop/start playbook
*** tosky has quit IRC23:38

Generated by 2.17.2 by Marius Gedminas - find it at!