*** ysandeep|away is now known as ysandeep | 00:04 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gitea-git-repos: update deprecated API path https://review.opendev.org/741562 | 00:05 |
---|---|---|
fungi | okay, so after reading the change in their gerrit and https://sourceware.org/bugzilla/show_bug.cgi?id=20358 it fixes, i gather the idea is that in the future applications should be doing their own dnssec validation instead of trusting any resolver to tell them whether a record is valid, but as a workaround there is now a stub resolver option to enable the old behavior of trusting the configured resolver's | 00:06 |
openstack | sourceware.org bug 20358 in network "RES_USE_DNSSEC sets DO; should also have a way to set AD" [Normal,Resolved: fixed] - Assigned to fweimer | 00:06 |
fungi | evaluation of the record validity | 00:06 |
fungi | seems mildly premature if no applications are actually checking rrsigs themselves, but what do i know | 00:07 |
fungi | glibc 2.31 has essentially broken existing dnssec for anyone not savvy enough to know to set that option on every single client system | 00:08 |
ianw | fungi: but it's probably not good to be trusting your isp's resolvers? | 00:14 |
ianw | Archive name: borg-backup-test01-2020-07-17T00:11:29 | 00:14 |
ianw | Archive fingerprint: 3c0535de9273f2e132ee02eae0d2458db9ae6fd46cdd92a361199f7a0a261c82 | 00:14 |
ianw | Time (start): Fri, 2020-07-17 00:11:30 | 00:14 |
ianw | Time (end): Fri, 2020-07-17 00:11:37 | 00:14 |
ianw | Duration: 7.64 seconds | 00:14 |
openstackgerrit | Clark Boylan proposed opendev/jeepyb master: Set repo HEAD on gerrit project creation https://review.opendev.org/741279 | 00:14 |
ianw | pretty cool, hosts backing themselves up during CI to test the full path | 00:15 |
clarkb | the git operations in there could probably use double checking. They seem to work with local testing on test repos but I'm not sure if those are the best options available to us | 00:15 |
fungi | ianw: it's definitely not good to be trusting your isp's resolvers, but that's why you should have a local validating resolver | 00:17 |
fungi | i trust the instance of unbound running on my openbsd firewall | 00:17 |
ianw | right, but you have to tell glibc you trust that ... so their change is probably generically correct? | 00:18 |
fungi | i get that if i roam some portable device onto coffee shop wifi i shouldn't suddenly start trusting their resolver, sure, but i generally don't trust anything about that internet connection | 00:18 |
fungi | anyway, it looks like nm has also grown an option for you to associate the trust-ad toggle with specific network profiles, so i'll likely use that to prevent needing to constantly fiddle resolv.conf | 00:20 |
clarkb | 2020-07-17 00:21:23.448189 | ubuntu-bionic | 2020-07-17 00:21:23,447: jeepyb.utils - INFO - Executing command: git --git-dir=/tmp/jeepyb-cache/test/test-repo-2/.git --work-tree=/tmp/jeepyb-cache/test/test-repo-2 push ssh://localhost:29418/test/test-repo-2 HEAD:refs/heads/main that looks better | 00:26 |
fungi | it's just annoying that dnssec was finally starting to achieve some degree of widepsread penetration, even if it relied on mechanisms from rfc 2535ยง6.1 which didn't validate the last hop between the client and its configured recursive resolver... to suddenly break that everywhere and ask users to evaluate whether their configured resolvers are trustworthy is basically going to undo all of that and we'll go | 00:29 |
fungi | back to basically everyone relying on unvalidated dns responses instead, which is far worse | 00:29 |
fungi | this is the sort of change which governments sneak into cryptographic systems to keep them unusable by almost everyone | 00:30 |
ianw | fungi: i thought the main problem with dnssec was that you have zone exposure | 00:31 |
ianw | clarkb/corvus: ok, i'm reading about append-only mode : https://borgbackup.readthedocs.io/en/stable/usage/notes.html#append-only-mode | 00:32 |
*** ryohayakawa has joined #opendev | 00:32 | |
ianw | it does seem like what we want | 00:32 |
ianw | i think that if we restrict the "comand=" on the backup server side to "borg serve --append-only" we get what we want | 00:33 |
ianw | that leaves us with the possibility to have something running on the backup server itself that might prune the repos periodically | 00:33 |
clarkb | oh interesting they can be treated differently based on client | 00:33 |
ianw | it seems to be a bug or a feature : https://github.com/borgbackup/borg/issues/3504 | 00:36 |
ianw | i think feature, although you could argue the client commands should make it more obvious what's going on | 00:36 |
fungi | ianw: as in that the sequence of signatures allows you to bisect records and effectively find records in a zone which you wouldn't otherwise be aware of? pretty sure that got solved by rrsig range responses, but it's been a while since i read up on that issue | 00:37 |
fungi | something to do with signed nxdomain responses if memory serves | 00:37 |
ianw | fungi: this is what i was thinking of https://blog.cloudflare.com/dnssec-complexities-and-considerations/ | 00:39 |
fungi | ianw: aha, yeah, on-line signing is the mitigation i was thinking of | 00:44 |
fungi | but really, the ultimate mitigation is "don't put secret data into the public dns" | 00:45 |
fungi | if you think about the system dns replaced, the shared hostfile, everyone got a copy of every hostname on the internet. dns wasn't designed to keep records secret either, people just started relying on the side effect that refusing axfr requests forced people to brute-force guess your records to find them | 00:46 |
fungi | as for the nxdomain range responses, the original dnssec design didn't even try to obscure the ranges by hashing the zone record ordering | 00:50 |
fungi | that got tacked on later when people raised objections to the fact that information they were putting in public dns was *gasp* public | 00:51 |
ianw | i think clearly the solution is to move dns into the bitcoin blockchain | 00:55 |
ianw | speaking of keys, there's an internal RH server i've been signing into for about 7 years just fine with a kerberos ticket that has suddently stopped working with | 00:58 |
ianw | debug1: Unspecified GSS failure. Minor code may provide more information | 00:58 |
ianw | KDC has no support for encryption type | 00:58 |
ianw | googling that just gets more and more confusing | 00:59 |
*** Eighth_Doctor is now known as Conan_Kudo | 01:01 | |
*** Conan_Kudo is now known as Eighth_Doctor | 01:02 | |
corvus | ianw: blockchain dns is a thing: https://handshake.org/ | 01:05 |
corvus | i think it's not completely insane, aside from the whole proof-of-work-energy-consumption-destroying-the-environment thing that all blockchain solutions share | 01:07 |
corvus | ianw: yeah, that append only option looks good | 01:10 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] borg backups https://review.opendev.org/741366 | 01:13 |
ianw | ^ that should implement it. i expect testinfra to pass for that, which does a full backup cycle for the two test hosts, which is pretty cool | 01:14 |
ianw | todo is update documentation, and probably something to configure per-host backup locations | 01:15 |
*** ysandeep is now known as ysandeep|away | 01:49 | |
ianw | https://917e391602178bc40e5f-1cbf7c2bad1b53c710605a1cfc31790e.ssl.cf1.rackcdn.com/741366/14/check/system-config-run-borg-backup/d512c68/bridge.openstack.org/test-results.html is nice for testinfa | 02:05 |
*** shtepanie has quit IRC | 02:11 | |
*** rh-jelabarre has quit IRC | 02:20 | |
*** sgw1 has quit IRC | 02:28 | |
*** sgw1 has joined #opendev | 02:29 | |
*** sgw1 has quit IRC | 02:33 | |
*** sgw1 has joined #opendev | 02:59 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] borg backups https://review.opendev.org/741366 | 03:12 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] borg backups https://review.opendev.org/741366 | 03:35 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] borg backups https://review.opendev.org/741366 | 04:04 |
*** sgw1 has quit IRC | 04:34 | |
*** xiaolin has quit IRC | 05:34 | |
*** xiaolin has joined #opendev | 05:36 | |
*** xiaolin has quit IRC | 05:43 | |
*** xiaolin has joined #opendev | 05:55 | |
*** ysandeep|away is now known as ysandeep | 06:09 | |
*** ysandeep is now known as ysandeep|rover | 06:24 | |
*** qchris has quit IRC | 06:25 | |
*** marios has joined #opendev | 06:47 | |
*** markPilon has joined #opendev | 06:52 | |
*** qchris has joined #opendev | 06:53 | |
*** xiaolin has quit IRC | 06:55 | |
*** xiaolin has joined #opendev | 06:57 | |
openstackgerrit | vinay kumar muddu proposed openstack/diskimage-builder master: Fixes nit in DIB_IPA_CERT certificate copy https://review.opendev.org/741583 | 06:58 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add borg-backup roles https://review.opendev.org/741366 | 06:58 |
ianw | infra-root: ^ ready for review; i think more or less everything we discussed is mentioned in the changelog | 06:59 |
openstackgerrit | Andrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry https://review.opendev.org/741584 | 07:00 |
*** calcmandan has quit IRC | 07:26 | |
*** xiaolin has quit IRC | 07:26 | |
*** calcmandan has joined #opendev | 07:26 | |
*** dougsz has joined #opendev | 07:27 | |
*** DSpider has joined #opendev | 07:29 | |
*** tosky has joined #opendev | 07:33 | |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:03 | |
*** avass has quit IRC | 08:17 | |
*** fressi has joined #opendev | 08:28 | |
*** fressi_ has joined #opendev | 08:32 | |
*** fressi has quit IRC | 08:33 | |
*** fressi_ is now known as fressi | 08:33 | |
*** roman_g has joined #opendev | 08:39 | |
*** ysandeep|rover is now known as ysandeep|lunch | 08:41 | |
*** dtantsur|afk is now known as dtantsur | 09:19 | |
*** ysandeep|lunch is now known as ysandeep|rover | 09:27 | |
*** marios has quit IRC | 09:38 | |
*** fressi has quit IRC | 09:42 | |
*** fressi has joined #opendev | 09:46 | |
openstackgerrit | vinay kumar muddu proposed openstack/diskimage-builder master: Fixes DIB_IPA_CERT certificate copy issue https://review.opendev.org/741583 | 09:49 |
*** tkajinam has quit IRC | 09:52 | |
*** dtantsur is now known as dtantsur|brb | 10:07 | |
*** ryohayakawa has quit IRC | 10:11 | |
*** sshnaidm|afk is now known as sshnaidm|off | 10:16 | |
*** markPilon has quit IRC | 10:25 | |
*** fressi has quit IRC | 10:29 | |
*** fressi has joined #opendev | 10:41 | |
*** marios has joined #opendev | 10:42 | |
*** ysandeep|rover is now known as ysandeep|afk | 11:11 | |
*** fressi has quit IRC | 11:20 | |
*** ysandeep|afk is now known as ysandeep | 11:31 | |
*** avass has joined #opendev | 11:33 | |
*** ysandeep is now known as ysandeep|rover | 11:34 | |
*** rh-jelabarre has joined #opendev | 12:07 | |
*** dtantsur|brb is now known as dtantsur | 12:10 | |
*** ysandeep|rover is now known as ysandeep|coffee | 12:19 | |
*** ysandeep|coffee is now known as ysandeep|rover | 12:40 | |
openstackgerrit | Andrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry https://review.opendev.org/741584 | 13:02 |
*** fressi has joined #opendev | 13:04 | |
*** fressi has quit IRC | 13:44 | |
*** sgw1 has joined #opendev | 13:48 | |
*** marios has quit IRC | 13:53 | |
*** marios has joined #opendev | 13:57 | |
*** ysandeep|rover is now known as ysandeep|away | 13:58 | |
*** mlavalle has joined #opendev | 14:04 | |
sgw1 | Morning, is there a known issue with opendev.org being really slow this morning | 15:11 |
sgw1 | ?? | 15:11 |
* fungi checks graphs | 15:12 | |
fungi | it doesn't look like our ddos crawler is back at least | 15:13 |
fungi | sgw1: yours is the first report i've heard | 15:13 |
sgw1 | might be a slow issue with corp proxies | 15:13 |
fungi | slow rendering? slow cloning? | 15:13 |
fungi | https://opendev.org/openstack/nova/ displays very quickly for me at least | 15:14 |
sgw1 | It | 15:14 |
sgw1 | ssomething on my end | 15:14 |
sgw1 | someone else just confirmed it works ok for them, sorry for the noise | 15:15 |
fungi | no worries and thanks for checking! | 15:15 |
sgw1 | an excuse to say Morning to all of you! | 15:15 |
fungi | and a very merry friday to you as well! | 15:34 |
*** mlavalle has quit IRC | 15:40 | |
*** mlavalle has joined #opendev | 15:41 | |
openstackgerrit | Andrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry https://review.opendev.org/741584 | 16:11 |
*** marios is now known as marios|out | 16:12 | |
*** dtantsur is now known as dtantsur|afk | 16:13 | |
clarkb | ildikov reported that https://meetpad.opendev.org/stx-build fails to load the etherpad. I've confirmed that other ehterpads work and this one fails. It seems to fail with a cross origin request except I have no idea how to get chrome and/or firefox to tell me what the requested url is | 16:13 |
clarkb | I half suspect something to do with the etherpad state itself | 16:14 |
clarkb | (the etherpad is an older one with a large ish number of revisions) | 16:14 |
openstackgerrit | Andrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry https://review.opendev.org/741584 | 16:14 |
clarkb | does anyone else have better insight into what the browser is doing there or know how to manipulate the developer tools to do so? I've tried firefox and chrome and both are lacking what was actaully requested from what I can see (though ff shows a slightly different error) | 16:15 |
clarkb | on firefox I see pad.js is trying to warn the console that it cannot set the author id for some reason and that is treated as a cross origin request? | 16:19 |
clarkb | Uncaught DOMException: Permission denied to access property "console" on cross-origin object | 16:19 |
clarkb | ok that happens after loading some user info so maybe it is related to pad state | 16:20 |
*** marios|out has quit IRC | 16:23 | |
clarkb | it is calling top.console.warn() | 16:23 |
clarkb | I guess top is a cross origin resource? | 16:23 |
clarkb | I need to step out for a bit but will try and sort out what top is I guess | 16:23 |
fungi | fwiw on other pads, even large ones, i still see some cors blocking show up in the web console, but the pad still loads for me | 16:24 |
clarkb | ya I think the issue is this cross origin request happens in the pad loading of the content | 16:25 |
clarkb | so it doesnt conplete. In other contexts you wont get other bits but the pad text loads | 16:25 |
*** dougsz has quit IRC | 16:35 | |
clarkb | https://github.com/ether/etherpad-lite/commit/00b6a1d9feae2399c08b42b7a3d711aed2d87a73 | 16:35 |
clarkb | we have a winner | 16:35 |
clarkb | I guess we cherrypick that fix into our image? | 16:36 |
fungi | huh, good find! | 16:37 |
fungi | and that didn't make it into 1.8.4? | 16:38 |
*** ildikov has joined #opendev | 16:38 | |
clarkb | no missed it by about 2 weeks | 16:38 |
openstackgerrit | Andrii Ostapenko proposed zuul/zuul-jobs master: Fix issues with buildset registry https://review.opendev.org/741584 | 16:41 |
fungi | oh, right, i forgot we held off upgrading for a while | 16:42 |
clarkb | I'm not sure how to cherrypick that into upstreamsimage though | 16:42 |
clarkb | I guess this is the downside of not installing it ourselves | 16:43 |
fungi | i thought you had worked out a custom built image with your fix for the author colors overlap a little while back? | 16:43 |
openstackgerrit | Andrii Ostapenko proposed zuul/zuul-jobs master: Fix certificate issue with use buildset registry https://review.opendev.org/741584 | 16:43 |
clarkb | fungi: sed on the prod files :/ | 16:44 |
fungi | pj | 16:44 |
fungi | er, oh | 16:44 |
clarkb | I mean we can do that here too | 16:44 |
clarkb | s/top.console/\/\/top.console/ | 16:44 |
clarkb | we can probably make a patch file and apply that | 16:48 |
clarkb | but U need to boot the etherpad image locally and figure it out | 16:48 |
clarkb | *I | 16:48 |
clarkb | currently trying to sort out how the minimized sources happen. Its almost like node does it on the fly ? | 17:18 |
clarkb | since the installation it does is actually to symlink back to itself in the source dir | 17:19 |
clarkb | rather than do an install of minimized content (but what the browser gets is definitely minimized compared to what is in the source dir) | 17:19 |
*** owalsh has quit IRC | 17:21 | |
clarkb | ya I think that is the case which makes this a little simpler | 17:22 |
*** owalsh has joined #opendev | 17:42 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Patch etherpad console logging to fix cross origin error https://review.opendev.org/741692 | 17:49 |
clarkb | fungi: ^ something like that maybe? I think if that passes testing the next thing to do is a followon change that forces a failure and do a hold and test with the held node? | 17:49 |
*** chandankumar is now known as raukadah | 18:02 | |
*** roman_g has quit IRC | 18:06 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: DNM force etherpad failure to hold node https://review.opendev.org/741698 | 18:24 |
*** johnsom has quit IRC | 18:24 | |
clarkb | fungi: can patch apply the git diff? | 18:24 |
clarkb | I would've expected the additional git metadata to be a problem | 18:25 |
*** johnsom has joined #opendev | 18:25 | |
clarkb | also how does it know which files to apply to when the paths are git specific? | 18:25 |
fungi | clarkb: yeah, git's diffs are totally compatible with the patch utility's unified diff handling (or always has been when i've tried it) | 18:25 |
clarkb | fungi: on a single file level I would expect that to be the case because you can do patch this_file patch_file | 18:26 |
fungi | but agreed, if the deployed file relationships are not the same as in the repository (at least relative to some parent directory) then that gets problematic for multi-file diffs | 18:26 |
clarkb | but with multiple files you'd need the files to be named properly? I mean it may work by magic and a/ b/ is a thing | 18:27 |
clarkb | fungi: well when git does it it does an a/ b/ prefix | 18:27 |
fungi | if they're all in the same relative locations/names though you can just use the -p option to tell it how many levels to prune | 18:27 |
clarkb | which won't exist on disk when patch runs | 18:27 |
clarkb | oh thats the magic | 18:27 |
clarkb | TIL | 18:27 |
fungi | patch -p1 < some.patch | 18:27 |
clarkb | I've put a hold on https://review.opendev.org/#/c/741698/1 | 18:27 |
clarkb | I don't think we'll be able to confirm it fixes the meetpad issue pre merge | 18:29 |
clarkb | but we should be able to confirm it doesn't regress etherpad in normal operation | 18:29 |
clarkb | if we didn't proxy etherpad in meetpad we'd be able to use local /etc/hosts overrides to test it but we do proxy (in order to fix other issues with cross origin requests) | 18:30 |
clarkb | fungi: fwiw it is a git repo in the docker image, its just that git doesn't exist on the image and pulling it in is a lot larger than pulling in patch | 18:30 |
clarkb | otherwise I would've just installed git and done a cherrypick | 18:30 |
fungi | makes sense | 18:31 |
fungi | and yeah, i agree testing whether this fixes the meetpad issue will be nontrivial withough standing up a separate jitsi-meet pointed at it | 18:31 |
fungi | but i'm okay with it just not obviously breaking etherpad | 18:32 |
clarkb | while I'm hopeful we'll be able to delete all these hacks when 1.8.5 or whatver the next release will be happens, I have a hunch that we'll just replace the existing set of fixes with a set of new fixes | 18:33 |
clarkb | so having some sort of system for that seems good (and patch works well enough I think) | 18:33 |
fungi | or it may be a sign that we should look into building our own images at some point | 18:35 |
fungi | since that's how we're handling the same sort of problems for other services | 18:35 |
clarkb | ya, we can copy their image though its fairly involved in order to get node and yarn and related things installed. At least so far we've only cared about the source of the etherpad service itself | 18:36 |
clarkb | definitely if we need to start changing node versions or similar we'd probably want to drop the use of their image | 18:36 |
clarkb | I'm going to grab lunch now then when I get back we should have a test node we can check the fix against | 18:39 |
clarkb | also I had intended to send thsoe emails this morning before people weekended but now I'm thinking I should wait for monday so that there is a chance people see them | 18:39 |
fungi | great point, friday e-mail has a tendency to slip through the cracks | 18:40 |
clarkb | fungi: 158.69.67.103 etherpad.opendev.org | 19:14 |
clarkb | The css fix seems to be working at least | 19:15 |
clarkb | I'm going to see if I can infer from minimized sources that the console logs are cleaned up | 19:15 |
clarkb | using the debugginer and ^F I think they are gone | 19:16 |
clarkb | *debugger | 19:16 |
clarkb | I'm going to remove my -W now | 19:18 |
*** owalsh has quit IRC | 19:20 | |
clarkb | infra-root I think that we can land https://review.opendev.org/#/c/741692/1 if 158.69.67.103 as etherpad.opendev.org looks good to you | 19:23 |
fungi | yeah, tested, lgtm. thanks! | 19:27 |
*** owalsh has joined #opendev | 19:29 | |
clarkb | I'm looking at my jeepyb branches update test logs and overall it looks good. I have however discovered an interesting jeepyb behavior. | 19:34 |
clarkb | Any idea why it seems we create and sort of use two different repos https://zuul.opendev.org/t/openstack/build/469948ee648347719d92b70f041649b3/log/job-output.txt#750-753 | 19:35 |
clarkb | I'm kind of thinking that may be something we can cleanup since it seems we only use the jeepyb-cache repo for useful tasks and ignore jeepyb-git | 19:35 |
fungi | we need a clone from which to apply acls | 19:36 |
fungi | we also need a close from which to push imported content | 19:37 |
fungi | s/close/clone/ | 19:37 |
clarkb | fungi: ya they both seem to be jeepyb-cache for that | 19:37 |
corvus | clarkb: +w etherpad | 19:37 |
clarkb | fungi: if you follow the larger context of the log there you see it doing those steps and randomly in the middle it inits another bare repo | 19:37 |
clarkb | corvus: thanks | 19:37 |
fungi | clarkb: ahh, i wonder if we used to use jeepyb-git for one of them and combined them in a refactor at some point but never cleaned up? | 19:38 |
clarkb | fungi: ya thats what I'm beginning to suspect | 19:38 |
fungi | i agree it seems to be unused | 19:38 |
fungi | or at least is unused now | 19:38 |
clarkb | based on that https://review.opendev.org/#/q/topic:opendev-git-branches lgtm now | 19:39 |
clarkb | but careful review is much appreciated | 19:39 |
fungi | clarkb: the split between 741277 and 741279 is a little hard to follow... for example you add default-branch to the sample projects.yaml in the earlier change but don't actually use it until the later change | 20:00 |
fungi | actually that may be the only thing confusing me | 20:00 |
fungi | was the projects.yaml edit meant to happen in 741279 instead? | 20:00 |
clarkb | fungi: they are in two different repos so rather than make a third change I decided to consolidate | 20:01 |
clarkb | fungi: really that job should've been defined in jeepyb but we were trying to test a gerritlib release at the time so the focus was there | 20:01 |
clarkb | I'm open to idea on how to make it clearer, maybe a third change would help? | 20:03 |
fungi | oh, yeah i missed those were gerritlib and jeepyb, now i get it | 20:07 |
fungi | nah, it makes sense. my head was in the "split because can't turn it on until gerrit upgrade" space | 20:07 |
fungi | i missed that it was also split between repos | 20:08 |
openstackgerrit | Merged opendev/system-config master: Patch etherpad console logging to fix cross origin error https://review.opendev.org/741692 | 20:10 |
clarkb | we don't seem to have auto applied ^ to the etherpad server | 20:15 |
clarkb | the role is set up such that if it runs it would happen but I don't think we ran the job at all | 20:15 |
clarkb | I'm going to manually pull and restart the service now | 20:16 |
fungi | i saw the deploy job run (and succeed | 20:16 |
fungi | ) | 20:16 |
clarkb | that was the image promote job | 20:17 |
clarkb | we didn't run infra-prod | 20:17 |
fungi | it was a build in the deploy pipeline | 20:17 |
clarkb | https://meetpad.opendev.org/stx-build loads the etherpad now fwiw | 20:18 |
clarkb | the problem is we don't run that job if the etherpad image updates | 20:19 |
fungi | oh, huh, yeah system-config-promote-image-etherpad runs in deploy? | 20:20 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Run our etherpad prod deploy job when docker updates https://review.opendev.org/741708 | 20:21 |
clarkb | fungi: ya I think all of the image promotions run in deploy so that the sequencing with infra-prod is correct? | 20:22 |
clarkb | if they are in different pipelines then the dependency stuff gets weird (maybe impossible?) | 20:22 |
clarkb | fungi: re defaults and https://review.opendev.org/#/c/741264/ ya that is basically what I wa sthinking. We want to have control to switch earlier than gerrit or gitea do it if git switches and possibly hold back if gitea or gerrit go early | 20:23 |
clarkb | basically gives us control to make the switch consistently across the board | 20:24 |
clarkb | and now discover issues the hard way | 20:24 |
clarkb | s/now/not/ | 20:24 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Use non deprecated gitea repo creation endpoint https://review.opendev.org/741710 | 20:32 |
clarkb | ianw: fungi ^ thats the followon fix for gitea | 20:32 |
clarkb | oh wait ianw already pushed that change /me abandons the extra one | 20:33 |
clarkb | fungi: https://review.opendev.org/#/c/741562/1 fyi | 20:33 |
clarkb | oh except that will need a rebase to not conflict with the other change | 20:34 |
clarkb | silly git | 20:34 |
openstackgerrit | Andrii Ostapenko proposed zuul/zuul-jobs master: Add ability to use (upload|promote)-docker-image roles in periodic jobs https://review.opendev.org/740560 | 20:38 |
openstackgerrit | Merged opendev/system-config master: Allow setting Gitea repo branch on project creation https://review.opendev.org/741264 | 20:51 |
corvus | the trendline is headed down on node requests | 20:54 |
corvus | i think i might start preparing for zuul restart | 20:55 |
corvus | maybe just take a short outage and do the scheduler and all the executors at once | 20:55 |
fungi | i'm around to help/watch | 20:55 |
corvus | mordred: i just want to confirm you're not around and i'm not about to step on your toes | 20:56 |
clarkb | corvus: remember the ze'sare in the emergency file. Not sure if that changes anything dramatically | 20:57 |
clarkb | I too can help though have a call in a minute | 20:57 |
corvus | iiuc, we want to remove ze from emergency, then allow ansible to write out pending updates, then shut down all executors (using docker on ze1 and init everywhere else), restart scheduler, start all executors using docker | 20:58 |
clarkb | maybe? I'm not sure how the current code whcih assumes docker conatiners will interact with the non docker services | 20:58 |
corvus | clarkb: what current code? | 20:58 |
clarkb | corvus: https://review.opendev.org/#/c/733967/ | 20:59 |
clarkb | looks like start.yaml is not run by default | 20:59 |
clarkb | so I think your plan will work | 21:00 |
corvus | yeah, that's what i was thinking; if we're worried, we could probably manually stop, then run ansible, then start; but that'll take a bit longer | 21:00 |
clarkb | in looking at that change I think I convinced myself its fine | 21:01 |
clarkb | it should install the docker things but not switch running services as well as apply the zk updates | 21:01 |
clarkb | then you can stop the non docker services and start them under docker | 21:01 |
corvus | k. i'll start by removing emergency and then running service-zuul playbook on bridge | 21:01 |
corvus | how do i stop the cron? | 21:02 |
corvus | i mean the hourly zuul job | 21:02 |
clarkb | I don't think you can stop it, you can run the stopper script which will stop the next job from running | 21:02 |
clarkb | but once a job has started we don't have a way to abort that | 21:02 |
corvus | i ran the disable-ansible script | 21:03 |
clarkb | looks like zuul isn't running now so you can run the script to pause the next thing | 21:03 |
corvus | infra-prod-install-ansible just started | 21:03 |
corvus | oh i think maybe i did it just in time | 21:03 |
corvus | to conclude from that. | 21:03 |
corvus | err | 21:04 |
corvus | to conclude from that. | 21:04 |
corvus | why can't i copy/paste from zuul's console log? | 21:04 |
corvus | gimme a second while i type it in | 21:04 |
corvus | "TASK [Make sure a manual maint isn't going on]" | 21:04 |
corvus | is where it's currently sitting | 21:04 |
clarkb | cool then ya you probably caught it | 21:05 |
corvus | i just noticed nb03 is in emergency | 21:07 |
corvus | so we'll need to do something about that before we turn off plaintext | 21:07 |
clarkb | ya we have multi arch images now so that should be doable in the near future | 21:08 |
corvus | i'm running service-zuul on ze01; just want to see it noop first | 21:08 |
*** rpittau has quit IRC | 21:12 | |
*** rpittau has joined #opendev | 21:13 | |
corvus | apparently we failed to pull the image:ERROR: for executor read tcp 104.239.136.252:36290->104.18.123.25:443: read: connection reset by peer | 21:15 |
corvus | i will run that again | 21:15 |
fungi | huh | 21:17 |
clarkb | I'm having dns trouble locally | 21:17 |
fungi | that looks like a cloudflare address... i gues fronting dockerhub (or trying to)? | 21:18 |
clarkb | oh hey that explains it | 21:18 |
clarkb | my dns is cloudflare too | 21:19 |
clarkb | I bet they are having an outage | 21:19 |
clarkb | fun fun | 21:19 |
clarkb | discord is apparently broken too | 21:23 |
clarkb | its a friday "internet is broken" day | 21:23 |
*** johnsom_ has joined #opendev | 21:26 | |
mnaser | i was just going to hint here about 1.1.1.1 being down | 21:26 |
mnaser | discord being broken is a by product of 1.1.1.1 down | 21:27 |
mnaser | https://www.cloudflarestatus.com "all systems operational" | 21:28 |
clarkb | my dns resolves again | 21:28 |
corvus | ERROR: for executor error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/6f/6fbba1285c105d80eedaef06c284b770b7d6e30ad3694229178d835c3d2d53d7/data?verify=1595023763-0q3vtPSuEtfFCNyYszpaQBhtqQE%3D: dial tcp: lookup production.cloudflare.docker.com on 127.0.0.1:53: read udp 127.0.0.1:39681->127.0.0.1:53: i/o timeout | 21:29 |
corvus | yeah, that was the second error | 21:29 |
corvus | trying a 3rd time | 21:29 |
*** johnsom_ has quit IRC | 21:29 | |
mnaser | i'm seeing dns restored in some parts of the world but still broken in others | 21:29 |
mnaser | 1.1.1.1 is resolving again | 21:36 |
corvus | attempt #3 failed, trying again | 21:46 |
mnaser | https://www.cloudflarestatus.com | 21:47 |
mnaser | "Identified - The issue has been identified and a fix is being implemented. " | 21:47 |
*** melwitt is now known as jgwentworth | 21:49 | |
*** avass has quit IRC | 21:53 | |
fungi | "the issue has been implemented and a fix is being identified" | 21:54 |
corvus | managed to complete this time | 21:54 |
corvus | i'll run it on all the zes now | 21:55 |
corvus | oh | 21:55 |
corvus | we do 'remove old init script files' | 21:55 |
corvus | but these are executors, we can just run 'zuul-executor stop' to stop it | 21:56 |
corvus | though i don't know if systemd will be left in a confused state | 21:56 |
corvus | Active: failed (Result: exit-code) since Thu 2020-06-25 17:11:53 UTC; 3 weeks 1 days ago | 21:56 |
clarkb | I woudl try stopping it with systemd first (it compiles configs but not sure if that includes the bash) | 21:56 |
corvus | that's what ze01 says | 21:56 |
clarkb | but then ya it may just go into a failed or error state and that should be fine | 21:56 |
corvus | so i think if we just run zuul-executor stop, that's the worst case scenario | 21:56 |
corvus | clarkb: yeah | 21:57 |
corvus | so i'll proceed with the plan as discussed and run the playbook on all zes now | 21:57 |
clarkb | sounds good | 21:57 |
fungi | stopping the initscript after the service is stopped should be idempotent, so if you're concerned about systemd's state tracking you could just ask systemd to stop it last thing (after it's actually stopped) | 21:58 |
clarkb | fungi: well in this case the script won't be there anymore | 21:59 |
clarkb | but I think systemd will just say "the unit is error or failed" now | 21:59 |
clarkb | then we can disable the unit to prevent it from starting on boot | 22:00 |
corvus | (ot: bob and doug are planning on returning from space on aug 2) | 22:00 |
clarkb | do you think we should tell them that there is a pandemic and they may want to hang out up there longer? | 22:01 |
fungi | (whaa? take off, ya hoser) | 22:01 |
corvus | clarkb: right? i mean, what's the rush? you're stuck inside either way. | 22:01 |
corvus | though they only have to wear a mask outside down here instead of a suit, though that might be a good idea | 22:02 |
fungi | oh! that bob and doug, i thought you meant the mckenzie brothers | 22:02 |
corvus | fungi: wow, i think you just wrote an *amazing* sketch. | 22:03 |
fungi | it includes a case of molson | 22:03 |
corvus | floating around the dragon capsule | 22:03 |
corvus | okay, i will try to stop laughing and proceed with maint; the playbook is finished | 22:04 |
clarkb | I bet their view of the comet is better than mine too | 22:05 |
corvus | i'll stop all the executors now (will attempt systemctl stop) | 22:05 |
*** rh-jelabarre has quit IRC | 22:06 | |
corvus | systemd is fussing about the units having changed, so i don't know if it actually did stop it or not | 22:08 |
corvus | i'm just going to run zuul-executor stop now | 22:09 |
clarkb | ok | 22:09 |
corvus | i'm pretty sure systemd didn't do anything | 22:10 |
corvus | looks like it's really stopping now | 22:10 |
clarkb | corvus: are we leaving ze01 alone ? | 22:10 |
corvus | no, i stopped it so it doesn't end up with all the jobs | 22:10 |
clarkb | corvus: and you did that via docker-compose ya? | 22:10 |
corvus | yep | 22:10 |
corvus | i'll leave the scheduler up while the executors shut down | 22:11 |
corvus | (to minimize downtime for the scheduler and maximize the number of mergers online when it starts) | 22:11 |
fungi | makes sense | 22:17 |
corvus | 3 remaining | 22:19 |
corvus | all clear | 22:29 |
corvus | i'll save queues, stop the scheduler, start the executors, then start the scheduler | 22:30 |
fungi | sounds good | 22:30 |
* clarkb still watching if anything comes up | 22:31 | |
corvus | hrm playbooks/start-mergers-executors.yaml does not look like it was updated for docker | 22:31 |
corvus | i will run a locally modified version of that | 22:32 |
*** DSpider has quit IRC | 22:33 | |
corvus | executors are up, scheduler is starting | 22:33 |
corvus | i'm unsure what the scheduler is doing | 22:35 |
corvus | http://paste.openstack.org/show/796062/ | 22:37 |
corvus | could that be a fatal error? | 22:37 |
fungi | oh stopped the geard fork? | 22:38 |
corvus | yeah | 22:39 |
corvus | that TB appears to be from the server | 22:39 |
clarkb | the fork is still up too | 22:39 |
fungi | yeah, i see two zuul-scheduler processes | 22:39 |
corvus | i wonder if it's wedged though? | 22:40 |
fungi | strace says it's reading from fd 4 | 22:40 |
corvus | though that should "just" be from a merger | 22:40 |
corvus | maybe it got stuck on that | 22:40 |
fungi | and yeah, if it weren't wedged i'd expect strace to have said more than that by now | 22:41 |
fungi | the parent is on recvfrom(28, ...silence | 22:41 |
corvus | where's our docs on how to connect to geard? | 22:42 |
corvus | i thought we had something in https://docs.opendev.org/opendev/system-config/latest/zuul.html | 22:42 |
fungi | components i think? | 22:42 |
clarkb | corvus: telenet host 4730 ? | 22:42 |
corvus | i thought we used ssl | 22:42 |
clarkb | oh right | 22:42 |
corvus | https://zuul-ci.org/docs/zuul/howtos/troubleshooting.html | 22:43 |
clarkb | looking at that gear code I expect it should set statsd to None or something useful in Server.__init__ | 22:43 |
clarkb | whcih makes me wonder if there was an earlier failure configuring geard | 22:43 |
fungi | https://zuul-ci.org/docs/zuul/howtos/troubleshooting.html?highlight=s_client#gearman-jobs | 22:43 |
fungi | openssl s_client -connect localhost:4730 -cert /etc/zuul/ssl/client.pem -key /etc/zuul/ssl/client.key | 22:43 |
corvus | that connection is rejected because the cert can't be verified | 22:44 |
corvus | are those the right cert/key files? | 22:45 |
corvus | nope | 22:45 |
clarkb | ssl_cert=/etc/zuul/ssl/gearman-server.pem ssl_key=/etc/zuul/ssl/gearman-server.key in our zuul.conf | 22:45 |
corvus | okay, final iteration is: openssl s_client -connect localhost:4730 -cert /etc/zuul/ssl/gearman-client.pem -key /etc/zuul/ssl/gearman-client.key | 22:47 |
clarkb | s/server/client/ for client side | 22:47 |
corvus | it is up and running | 22:47 |
corvus | by that i meant geard | 22:48 |
corvus | but also apparently the scheduler is doing something now | 22:48 |
clarkb | the scheduler just started ya | 22:49 |
clarkb | does it have startup tasks like the executor to clean things up? | 22:49 |
corvus | i'm trying to figure out what the holdup was | 22:49 |
clarkb | executors can take a minute to actually begin work while they clear the build dirs up | 22:49 |
corvus | the first interesting message after the slow time was: 2020-07-17 22:47:33,949 INFO zuul.GithubConnection: Starting event connector | 22:50 |
fungi | it's using a trove instance for the db reporter? | 22:50 |
clarkb | fungi: yes I believe so | 22:50 |
clarkb | (its a huge db) | 22:50 |
corvus | oh, let me see if there was a migration | 22:50 |
clarkb | ohhhh | 22:50 |
fungi | yeah, that's what i was wondering | 22:51 |
fungi | if it was waiting for (however slow) trove db to migrate | 22:51 |
fungi | i wonder if that's something we can better surface in the logs | 22:52 |
corvus | 2020-07-17 22:32:59,666 DEBUG zuul.SQLConnection: Current migration revision: 16c1dc9054d0 | 22:52 |
corvus | zuul/driver/sql/alembic/versions/269691d2220e_add_build_final_column.py:down_revision = '16c1dc9054d0' | 22:52 |
corvus | fungi: i think you win | 22:52 |
*** tosky_ has joined #opendev | 22:52 | |
corvus | and yeah, we should do that. because i'm pretty sure we did the same thing last time. | 22:52 |
fungi | that was indeed lengthy | 22:52 |
corvus | and this is going to happen a lot. | 22:52 |
corvus | also, we should prune our db. | 22:52 |
clarkb | maybe even a zuul command to check if a migration will happen | 22:53 |
clarkb | so that we can easily approximate downtime | 22:53 |
corvus | i'm re-enqueueing now | 22:53 |
*** tosky has quit IRC | 22:53 | |
fungi | "me: please restart; zuul: i'm sorry dave, i can't do that" | 22:53 |
clarkb | ++ to more logging and pruning and all those ideas though | 22:53 |
corvus | #status log restarted all of zuul using tls zookeeper and executors in containers | 22:54 |
openstackstatus | corvus: finished logging | 22:54 |
corvus | mordred: FYI ^ | 22:54 |
fungi | so that was ~15 minutes spent on the db migration i guess? | 22:54 |
corvus | about. a little less. | 22:57 |
*** mlavalle has quit IRC | 22:59 | |
*** tosky_ is now known as tosky | 23:01 | |
clarkb | thats everything but nb03 running or ready to run zk connections with tls ya? | 23:04 |
clarkb | next week is good time to convert nb03 to our new multi arch images I guess | 23:04 |
clarkb | corvus: we landed the python base and builder side of that right? | 23:04 |
clarkb | https://hub.docker.com/r/zuul/nodepool-builder/tags has both arches but https://hub.docker.com/r/opendevorg/python-builder/tags does not | 23:07 |
clarkb | https://review.opendev.org/#/c/726263/7/zuul.d/docker-images/python.yaml has merged. Do we just need to trigger builds? | 23:08 |
clarkb | https://zuul.openstack.org/builds?change=726263 looks like we ran the jobs but not the promote and I think that is beacuse we only ran the jobs due to them changing | 23:10 |
corvus | clarkb: yep, i think we need a nodepool build | 23:15 |
corvus | enqueue is finished | 23:16 |
clarkb | corvus: well a nodepool build won't rebuild python-builder and python-base | 23:16 |
corvus | oh er i thought the change we merged would have done that | 23:16 |
clarkb | I was looking for a useful change to one of those (the jobs will run for both if either change) but I can't come up with anything so will push a noop change instead | 23:16 |
clarkb | corvus: according to https://zuul.openstack.org/builds?change=726263 it did not. I think beacuse only the jobs changed | 23:16 |
clarkb | and the post merge jobs don't have the context of "did my own job update" to know they should run unconditionally? | 23:17 |
corvus | clarkb: makes sense. :/ | 23:17 |
corvus | i will approve your noop change :) | 23:17 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Noop update to force python-builder/base to rebuild https://review.opendev.org/741789 | 23:17 |
clarkb | I think ^ should do it | 23:18 |
clarkb | then we can land a nodepool change then we can deploy nb03 | 23:18 |
clarkb | exciting | 23:18 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Update zuul-executor stop/start playbook https://review.opendev.org/741790 | 23:18 |
*** tosky has quit IRC | 23:38 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!