opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [DNM] make sure centos-8 nodes fail with 828440 https://review.opendev.org/c/zuul/zuul-jobs/+/828630 | 00:01 |
---|---|---|
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [DNM] make sure centos-8 nodes fail with 828440 https://review.opendev.org/c/zuul/zuul-jobs/+/828630 | 00:02 |
corvus | 2 executors stopped. ah ah ah. | 00:16 |
fungi | narration by the count never gets old | 00:18 |
corvus | one left | 00:44 |
corvus | hrm, i can't teel what it's waiting on | 00:52 |
corvus | i don't see any build related subprocesses. i do see a bunch of stale looking 'git cat-file' jobs | 00:52 |
corvus | i may send it a sigusr2 | 00:53 |
corvus | ah, it's a paused build | 00:54 |
corvus | tripleo-ci-centos-8-content-provider head of gate | 00:56 |
clarkb | ya that has confused me before but Zuul does the correct thing | 01:07 |
corvus | it's resumed; apparently the ooo quickstart collect logs is not fast | 01:10 |
corvus | done; on to batch 2 now | 01:14 |
corvus | the first batch looks like it's running jobs okay. i'm going to afk now | 01:15 |
ianw | hrm, https://review.opendev.org/c/zuul/zuul-jobs/+/828630 reported NODE_FAILURE when i switched the node types to centos-8 anyway. it feels like that should have run on centos-8-stream nodes. i wonder what i'm missing... | 01:27 |
clarkb | ianw: label: centos-8 is not stream | 01:34 |
clarkb | and that change seems to set lable to centos-8 | 01:34 |
clarkb | maybe we can just alnd that invalid config and it will report NODE_FAILURE? I thought zuul would validate more than that but seems not to | 01:35 |
ianw | clarkb: yeah, but i thought that centos-8 label now actually selected centos-8-stream nodes | 01:36 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Replace kpartx with qemu-nbd in extract-image https://review.opendev.org/c/openstack/diskimage-builder/+/828617 | 02:25 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [DNM] make sure centos-8 nodes fail with 828440 https://review.opendev.org/c/zuul/zuul-jobs/+/828630 | 02:28 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: Futher bootloader cleanups https://review.opendev.org/c/openstack/diskimage-builder/+/790878 | 03:09 |
*** ysandeep|out is now known as ysandeep | 03:15 | |
Clark[m] | ianw: not the label. The centos-8 nodeset | 03:23 |
ianw | Clark[m]: yeah, i had it wrong, it was not using the nodeset defined in base-jobs | 03:24 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [DNM] make sure centos-8 nodes fail with 828440 https://review.opendev.org/c/zuul/zuul-jobs/+/828630 | 03:54 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: Futher bootloader cleanups https://review.opendev.org/c/openstack/diskimage-builder/+/790878 | 04:09 |
*** ysandeep is now known as ysandeep|afk | 05:11 | |
*** ysandeep|afk is now known as ysandeep | 05:57 | |
*** amoralej|off is now known as amoralej | 07:21 | |
sshnaidm | clarkb, ianw, corvus if you have merge rights, please merge: https://review.opendev.org/c/openstack/project-config/+/828371 | 08:17 |
*** ysandeep is now known as ysandeep|lunch | 08:31 | |
*** jpena|off is now known as jpena | 08:38 | |
*** ysandeep|lunch is now known as ysandeep | 10:06 | |
*** bhagyashris__ is now known as bhagyashris | 11:26 | |
*** dviroel|out is now known as dviroel|ruck | 11:30 | |
*** rlandy|out is now known as rlandy|ruck | 11:38 | |
*** ykarel is now known as ykarel|away | 12:23 | |
frickler | kevinz_: Certificate for us.linaro.cloud is at 21 days, could you have a look please? | 12:42 |
kevinz_ | frickler: Sure, I will re-gen it. | 12:43 |
*** amoralej is now known as amoralej|lunch | 13:09 | |
*** dviroel is now known as dviroel|ruck | 13:13 | |
*** artom__ is now known as artom | 13:14 | |
*** ysandeep is now known as ysandeep|afk | 13:52 | |
*** pojadhav is now known as pojadhav|afk | 13:53 | |
*** amoralej|lunch is now known as amoralej | 13:58 | |
*** ysandeep|afk is now known as ysandeep | 14:55 | |
dtantsur | hi folks! I remember I was talking to some of you about having a real partition image for cirros. Has there been any movement around it? | 15:08 |
dtantsur | I'm asking because we're about to start building our own centos images with DIB in the CI, but I'd rather not to | 15:08 |
fungi | dtantsur: frickler has been working on a cirros fork at https://opendev.org/cirros/cirros so maybe he has some ideas | 15:15 |
dtantsur | oh, I also wanted to ask about the reason of creating a fork. is upstream development stagnating? | 15:16 |
fungi | he can speak better to the reasons, but my understanding is that he wanted to set up some zuul jobs, possibly do integration testing with devstack, and discuss with smoser about relocating development to here and/or adopting the project | 15:17 |
dtantsur | k understood | 15:17 |
fungi | there have been ml threads in the past about cirros going stale upstream for long periods and the possibility of the openstack community picking up maintenance of it, but i don't know if those prior discussions had any bearing on the present situation | 15:18 |
frickler | so currently this isn't a fork, but an attempt to get a working CI again for the original project | 15:19 |
frickler | regarding the "real partition image", I did some testing, and the main issue seems to be getting grub installed into the image, which requires changing library options and in the end makes the result twice as large as the original | 15:21 |
frickler | so I don't think that this is feasible as a default solution, but only optionally as a different flavor of cirros possibly | 15:21 |
frickler | I also don't have too much time for this myself, so expect some results in a couple of months, not within days or weeks | 15:24 |
dtantsur | frickler: is it something I could pick up or is there too much context to transfer? | 15:24 |
frickler | dtantsur: well help is always welcome, if you want to look into setting up the build to generate what you need, best join us in #cirros (currently still on libera), so smoser and myself can work together answering your questions | 15:28 |
frickler | you could also look at https://review.opendev.org/c/cirros/cirros/+/827916 and help me find out how to collect and store build results in a useful way without exploding our log storage | 15:30 |
*** ysandeep is now known as ysandeep|out | 16:03 | |
*** priteau_ is now known as priteau | 16:11 | |
dtantsur | frickler: you mean, store the actual generated images? I wish I knew, we could use that in Ironic... | 16:24 |
*** tkajinam is now known as Guest210 | 16:30 | |
corvus | the zuul executor restarts from yesterday are done; i'm going to restart zuul01 now | 16:31 |
corvus | 2022-02-10 16:33:05,483 INFO zuul.ComponentRegistry: System minimum data model version 1; this component 3 | 16:34 |
corvus | 2022-02-10 16:33:05,484 INFO zuul.ComponentRegistry: The data model version of this component is newer than the rest of the system; this component will operate in compatability mode until the system is upgraded | 16:34 |
corvus | that's as expected. as soon as i shut down the zuul02 components, that should bump up. | 16:34 |
clarkb | and then 02 will start on the new version when it see everyone else is at the new rev too | 16:34 |
corvus | yep | 16:34 |
corvus | i was just thinking, a zuul CD job to upgrade zuul would be a little tricky... gracefully restarting an executor can take longer than the max job runtime due to paused jobs... | 16:38 |
clarkb | corvus: crazy talk but I wonder if zuul could fork itself on the new code and just keep running with the old state | 16:39 |
clarkb | basically replace itself in place and not need a synchronization at all | 16:40 |
corvus | clarkb: kinda awesome.. but tricky with our container deployment model... :/ | 16:41 |
corvus | zuul01 is up, restarting 02 now | 16:44 |
clarkb | In theory it would work pretty well to do that if you got the mechanics down since we're already storing the bulk of the state in zk | 16:44 |
clarkb | the danger would be if needing to migrate internal datastructures but they could be forced to refetch from zk maybe | 16:45 |
corvus | 2022-02-10 16:45:19,445 INFO zuul.ComponentRegistry: System minimum data model version 3; this component 3 | 16:45 |
corvus | 2022-02-10 16:45:19,445 INFO zuul.ComponentRegistry: The rest of the system has been upgraded to the data model version of this component | 16:45 |
clarkb | nice | 16:46 |
clarkb | corvus: if you get a chance this morning can you look at https://review.opendev.org/c/opendev/system-config/+/828203/ since you pointed out the slurp module which I used there | 16:47 |
*** priteau is now known as priteau_ | 16:48 | |
*** priteau_ is now known as priteau | 16:48 | |
corvus | clarkb: lgtm | 16:49 |
clarkb | thanks I went ahead and approved it (and responded to your question | 16:49 |
corvus | the big zuul changes necessitating the model upgrade are related to semaphores and changes in gate superceding check; so please keep an eye out for any unexpected behavior there | 16:50 |
clarkb | I'm going to followup on that gerrit bug I filed about the cloning weirdness once my brain has fully booted. | 16:52 |
clarkb | I suspect that we can go ahead and close the bug out | 16:52 |
fungi | clarkb: migrating file descriptors and socket handles gets tricky when you're replacing processes live, but it's doable | 16:52 |
fungi | closing and reopening everything is probably simpler | 16:53 |
fungi | forks do mostly inherit them though | 16:53 |
gibi | is it just me or the zuul web ui is down? https://zuul.opendev.org/ | 16:53 |
fungi | gibi: it's being restarted | 16:53 |
gibi | ahh, OK | 16:53 |
corvus | and it's up | 16:54 |
fungi | gibi: zuul itself is able to do hitless rolling restarts now, but we only have one zuul-web service at the moment so it goes offline for a bit | 16:54 |
corvus | #status log rolling restarted all of zuul on ad1351c225c8516a0281d5b7da173a75a60bf10d | 16:54 |
clarkb | there are some TODOs to get a load balancer in front of multiple webs | 16:54 |
opendevstatus | corvus: finished logging | 16:54 |
gibi | fungi: nice improvement | 16:54 |
corvus | what was the decision about LB -- make a new one or reuse the gitea one? | 16:56 |
clarkb | corvus: I think my slgiht perference is to make a new one since small nodes seem to work well for haproxy and this way we can continue to operate zuul and gitea independently | 16:57 |
fungi | is the gitea one in the same region as the zuul servers anyway? | 17:01 |
clarkb | fungi: it is not | 17:01 |
fungi | better to have the lb as topologically close to the servers as possible if it's doing socket forwarding | 17:01 |
fungi | from a performance and stability standpoint | 17:01 |
corvus | can haproxy handle websockets? | 17:04 |
fungi | the short answer is "yes" because haproxy has a variety of different load balancing solutions | 17:06 |
fungi | and client persistence algorithms | 17:07 |
fungi | the longer answer depends on how exactly you want websockets "handled" | 17:07 |
corvus | oh and we use tcp right? | 17:07 |
fungi | for gitea we do, yes | 17:07 |
clarkb | corvus: we have historically used tcp | 17:07 |
clarkb | I think the reason for that is it simplified tls | 17:08 |
corvus | so that should work fine for this, modulo maybe needing to set max connections or something | 17:08 |
clarkb | basically instead of needing certs for every point in btween you just do it int he service and pass straight through | 17:08 |
*** rlandy|ruck is now known as rlandy|ruck|mtg | 17:09 | |
fungi | yeah, layer-4 proxying with client ip persistence can work for just about anything modulo cgn clients | 17:09 |
fungi | if you want to do things like layer-7 forwarding with ssl/tls termination on the load balancer, or more granular client persistence to specific backends based on session ids or injecting cookies, that's where it starts to depend a lot on the application itself | 17:10 |
fungi | is it important that multiple requests from the same client are persisted to the same backend in this case? like for authenticated sessions? | 17:11 |
corvus | nope :) | 17:12 |
corvus | there is no server-side session state | 17:12 |
fungi | in that case we can probably ignore client persistence entirely | 17:13 |
fungi | which should get us a much more even load distribution | 17:13 |
fungi | it's a bigger problem for gitea, where a git operation can involve multiple requests over different connections, and there's no guarantee that the state of the repositories between backends is completely consistent (repacks, replication races, et cetera) | 17:15 |
*** dviroel|ruck is now known as dviroel|ruck|afk | 17:16 | |
clarkb | ok I updated https://bugs.chromium.org/p/gerrit/issues/detail?id=15649 with what we learned | 17:22 |
corvus | we.. have a 5 node job limit? | 17:23 |
clarkb | ya I seem to recall someone went a bit overboard and we had to set that. But maybe I'm misremembering | 17:24 |
corvus | i would suggest that we increase that for the opendev tenant, but that wouldn't help us. | 17:24 |
corvus | since the opendev tenant isn't where we run the opendev service jobs | 17:24 |
clarkb | corvus: what we can do is use groups rather than hosts and have some thinsg colocated. I'ev thought about doing that in the past but it seemed non urgent | 17:25 |
fungi | i have no problem with raising it if we have jobs that need that many, it was simply a useful starting point | 17:25 |
clarkb | and ya I think we could bump it | 17:25 |
clarkb | I guess the way we do zuul services doesn't really allow for colocating though (since everything is bind mounted from the same path regardless of service) | 17:29 |
opendevreview | James E. Blair proposed opendev/system-config master: Add Zuul load balancer https://review.opendev.org/c/opendev/system-config/+/828773 | 17:31 |
corvus | presumably zuul will refuse to run that until we figure out how to run 6 nodes | 17:32 |
*** jpena is now known as jpena|off | 17:32 | |
sshnaidm | clarkb, hi, can you please merge the perms patch in your time https://review.opendev.org/c/openstack/project-config/+/828371 | 17:34 |
clarkb | fungi: sshnaidm: we clarified that deleting branche is lossy right? | 17:35 |
clarkb | sshnaidm: if you delete a branch and don't haev another permanent ref pointing to that commit our regular garbage collection will delete data | 17:35 |
sshnaidm | clarkb, yeah, I'm aware of that | 17:36 |
sshnaidm | it was created by mistake, so I just don't want it to be there to confuse users with right branches.. | 17:37 |
clarkb | fungi: I think your ethercalc copy may be trying to abckup and fail based on emails we are getting. Can you double check that and maybe comment out the backup crons on your copy ? | 17:40 |
fungi | clarkb: i've done one better and deleted the server | 17:42 |
fungi | just cleaning up the snapshot i built it on now | 17:42 |
clarkb | thanks | 17:42 |
clarkb | corvus: I think we can either bump the limit or combine the load balancer and zk or merger or similar. | 17:48 |
opendevreview | Merged openstack/project-config master: Give perm to release team to delete branches https://review.opendev.org/c/openstack/project-config/+/828371 | 17:52 |
opendevreview | Merged opendev/system-config master: Test pushes into gitea over ssh https://review.opendev.org/c/opendev/system-config/+/828203 | 17:52 |
clarkb | I was hoping ^ that stack would cause semaphores to be exercised but the second chagne is test only so we don't run the prod deploy after it | 17:55 |
clarkb | oh well | 17:55 |
corvus | clarkbfungi : oh apparently max is 10 nodes | 17:55 |
fungi | oh neat | 17:58 |
clarkb | I'm trying to write a chagne to fix ls-members --recursive now | 18:10 |
*** amoralej is now known as amoralej|off | 18:21 | |
clarkb | https://gerrit-review.googlesource.com/c/gerrit/+/330179 that may do it | 18:36 |
fungi | so it was working for the rest api, but they pulled the rug out from under the cli? | 18:37 |
clarkb | fungi: yup I think at some point they combined the rest api and ssh commands but then in cases like this missed that the recursive flag was private and needed to be explicitly handled | 18:38 |
clarkb | Next I need to work on a depends on to check this output | 18:38 |
clarkb | it is easier for me to do that than figure out their test framework :/ | 18:38 |
fungi | their release notes handling is kinda neat, i didn't realize they embed that in commit message footers | 18:39 |
clarkb | fungi: that is brand new as of last night | 18:39 |
fungi | i wonder if they've considered how to go about correcting/updating a release note after the commit merges ;) | 18:39 |
* corvus uploaded an image: (36KiB) < https://matrix.org/_matrix/media/r0/download/acmegating.com/vwNHycEJaXWOcgjpDcZbNpRM/image.png > | 18:43 | |
corvus | my guess is they write a new note. | 18:43 |
fungi | hah | 18:44 |
fungi | a moose once bit my sister | 18:44 |
fungi | today i learned about mysql's string replace function. wish i had known about it during the login.launchpad.net to login.ubuntu.com migration | 18:53 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM testing upstream fix for gerrit ls-members https://review.opendev.org/c/opendev/system-config/+/828786 | 18:55 |
clarkb | and ^ should hopefully test this for us | 18:55 |
*** rlandy|ruck|mtg is now known as rlandy|ruck | 18:57 | |
clarkb | on the surface it is a silly little bug but wow did it create a lot of confusion for us debugging the performance issues | 18:58 |
clarkb | the other really neat thing about that depends on setup is if we push the fix to stable-3.4 and not also to stable-3.5 then we'll get a test that checks for failure and success in the same buildset :) | 19:00 |
clarkb | granted on different realses of gerrit but for this situation that should be fine | 19:01 |
*** dviroel|ruck|afk is now known as dviroel|ruck | 19:19 | |
opendevreview | Ade Lee proposed zuul/zuul-jobs master: WIP/DNM - Test new version of mariadb https://review.opendev.org/c/zuul/zuul-jobs/+/827366 | 19:22 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM testing upstream fix for gerrit ls-members https://review.opendev.org/c/opendev/system-config/+/828786 | 19:46 |
clarkb | apparently admin is already a member of groups it creates? I guess implicitly as owner maybe? To be extra sure that we're getting recursive listings I went ahead and updated that to make another user that is distinct | 19:47 |
fungi | makes sense, as gerrit sets its groups to be self-owned by default | 20:52 |
corvus | clarkbfungi https://review.opendev.org/828773 is, um, possibly a hole-in-one? | 20:55 |
clarkb | corvus: nice I'll review it shortly. Just sitting back down after some lunch | 20:55 |
opendevreview | Ian Wienand proposed openstack/project-config master: Remove Fedora 34 https://review.opendev.org/c/openstack/project-config/+/816933 | 20:56 |
fungi | corvus: and on a par 4 hole at least | 20:57 |
clarkb | woot my gerrit test seems to work. I'll update it now to ensure that we aren't always recursive etc but I think the change I wrote is working | 20:58 |
clarkb | it is really cool that we can do this sort of thing with zuul | 20:58 |
opendevreview | Merged openstack/diskimage-builder master: Futher bootloader cleanups https://review.opendev.org/c/openstack/diskimage-builder/+/790878 | 21:02 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM testing upstream fix for gerrit ls-members https://review.opendev.org/c/opendev/system-config/+/828786 | 21:03 |
ianw | corvus: couple of minor comments, pleasing how easily the roles and testing and just generally everything make it look easy | 21:08 |
clarkb | I left a few too | 21:10 |
ianw | fungi / clarkb: the centos-8 failure patch I believe is fully tested now -> https://review.opendev.org/c/opendev/base-jobs/+/828437 | 21:13 |
ianw | if we're ok we can go with that, and i can send another follow-up email about the current status | 21:14 |
fungi | ianw: sounds good to me, thanks. i also brought it up in the tc meeting today so at least the openstack leadership is aware of the current situation and impending behavior change | 21:15 |
clarkb | approved I guess keep an eye out for unexpected fallout, but thank you for testing it so that we avoid all that I hope :) | 21:17 |
ianw | i think we're covered for migration now. if you're explicitly using centos-8 then you'll get NODE_FAILURE. if you're using the base-jobs definition then you'll get RETRY_FAILURE with a clear failure message not to use it | 21:21 |
fungi | it's too bad there's no clear signal to explicitly trigger a failure result directly from pre-run, but at least it'll triple-fail quickly | 21:22 |
clarkb | looks like OSA has managed to clean things up for the most part. https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/827483 should be the last one and I rechecekd it | 21:23 |
clarkb | jrosser: fyi I think that didn't auto enter the gate with its parent because they must not share a gate queue | 21:24 |
clarkb | ya'll may want to ensure all the osa repos share a queue so that they are cogated and your integration testing makes use of speculative states properly | 21:24 |
opendevreview | Ade Lee proposed zuul/zuul-jobs master: WIP/DNM - Test new version of mariadb https://review.opendev.org/c/zuul/zuul-jobs/+/827366 | 21:25 |
corvus | ianw: clarkb how sure are you that we don't need the letsencrypt job to run before the lb job? | 21:26 |
opendevreview | Merged opendev/base-jobs master: base: fail centos-8 if pointing to centos-8-stream image type https://review.opendev.org/c/opendev/base-jobs/+/828437 | 21:26 |
clarkb | corvus: like 80%. I think the risk is that we'll end up having a proxy up that connects to backends that don't have valid https certs yet. It might be better to have users get no tcp connection at all | 21:27 |
clarkb | its also not a major problem to have that dep there if we want to be safe | 21:27 |
fungi | we could also mitigate that by changing the health check to be more than just a tcp socket prove | 21:27 |
fungi | probe | 21:27 |
opendevreview | James E. Blair proposed opendev/system-config master: Add Zuul load balancer https://review.opendev.org/c/opendev/system-config/+/828773 | 21:27 |
opendevreview | James E. Blair proposed opendev/system-config master: Clean up some gitea-lb zuul config https://review.opendev.org/c/opendev/system-config/+/828793 | 21:27 |
clarkb | fungi: I think if you are doing tcp lb you cannot do the richer health checks | 21:28 |
clarkb | as haproxy does't load the necessary bits into its state tables | 21:28 |
corvus | okay but if you want to add it back we're totally adding that to my handicap. strictly speaking, the -focal/-bionic comment is the only error i made :) | 21:28 |
fungi | mulligan's fair there | 21:29 |
corvus | i believe i made sure that zuul-web doesn't answer on http until it's fully initialized, so should be compatible with a tcp healthcheck | 21:30 |
corvus | (as we've seen from the rolling-restart outages) | 21:30 |
fungi | corvus: we put it behind apache which does answer though? | 21:31 |
fungi | also looks like you're introducing a config error on 828793 | 21:31 |
corvus | hrm, that could be a problem then. | 21:31 |
corvus | do we just gloss over that discrepancy with gitea? | 21:32 |
corvus | "if apache is up, gitea is probably up" | 21:33 |
clarkb | corvus: ya I think so | 21:33 |
fungi | it's a good question. and yes i think it's probably something we should try to solve | 21:33 |
clarkb | when manually doing gitea work I always try to tell the load balancer about it first | 21:33 |
clarkb | fungi: ++ | 21:33 |
fungi | we put the haproxy config in first. later we added apache in between haproxy and gitea but didn't consider what that might do to our health checks | 21:34 |
fungi | i think that was a regression we simply haven't noticed | 21:34 |
clarkb | one solution is to have the health check check the direct port | 21:34 |
clarkb | rather than the apache ssl terminator | 21:35 |
clarkb | I think that is possible as it has the bits to do the tcp check in place. Its just a matter of telling it to use the other port? | 21:35 |
clarkb | looking at merger queue graphs and wow can you see the periodic jobs loading in | 21:39 |
clarkb | tomorrows periodic run will be intersting since we should have the full complement of mergers for it | 21:42 |
clarkb | today's set took about the same amount of time as yseterdays but with 60% of the mergers | 21:42 |
*** dviroel|ruck is now known as dviroel|out | 21:49 | |
opendevreview | Merged openstack/project-config master: Remove Fedora 34 https://review.opendev.org/c/openstack/project-config/+/816933 | 21:51 |
clarkb | infra-root any opinions on the best way to start shutting down subuntu2sql workers and openstack health api? The health api hasn't worked in months and I'ev discussed with the qa team and we're basically going to turn it off. I was thinking I should shutdown apache (running the wsgi service) and the gearman workers for subunit2sql but it looks like puppet will restart apache. Should | 21:54 |
clarkb | I put the hosts in the emergency file or go ahead and start removing the puppet for them then shutdown the services. Then delete stuff in a bit? | 21:54 |
ianw | probably makes sense to emergency them and shutdown | 22:01 |
clarkb | ianw: oh ya maybe that is the easiest thing | 22:03 |
clarkb | https://zuul.opendev.org/t/openstack/build/82be22c4e5b646438f0038851f3042f1/log/job-output.txt#30230-30264 ok I think that shows my upstream fix is working. I left a comment on the upstream chagne pointing to that | 22:11 |
clarkb | ianw: when you do that do you `shutdown -h now` or do it via the nova api? | 22:12 |
fungi | yeah, that makes sense to me. i do `sudo poweroff` | 22:25 |
clarkb | cool I'll get started on that shortly. Will be the two subunit2sql workers and the health.openstack.org server | 22:25 |
clarkb | All the data is in the trove db though so I'm not too concerned about deleting these other than not being sure we'll be able to rebuild them if somehow necessary | 22:26 |
clarkb | But I figure if we go slow it gives people a chance to scream :) | 22:26 |
fungi | we can make images of them before deleting, i've done that with most of the other services i've shut down | 22:30 |
clarkb | ok health01, subunit-worker01, and subunit-worker02.openstack.org are all in the emergency file now | 22:35 |
clarkb | next up server shutdowns | 22:35 |
clarkb | and done. They can be booted back up again if necessary but I don't expect much trouble since no one noticed the services was not working for a long time anyway | 22:37 |
clarkb | gmann: ^ fyi I shutdown the servers as a first step in cleaning things up. If nothing comes up in the next ~week I'll snapshot the servers and delete them | 22:38 |
clarkb | status.openstack.org is another related server but it hosts e-r things so I want to make sure that the whole ELK thing is settled before I clean it up | 22:39 |
clarkb | but once that is done I think status can go away too | 22:39 |
gmann | clarkb: ack. | 22:43 |
ianw | i wonder if we should move that to static and make status.openstack.org/zuul redirect | 22:46 |
ianw | i did have that in bookmarks for years, just thanks to inertia | 22:47 |
clarkb | ianw: I think the zuul redirect would be just about the only thing that is still valid there when we are done | 22:47 |
clarkb | reviewday hasn't been working doesn't look liek (and I'm not sure anyone has used it recently), health is broken and going away. e-r + ELK is moving. | 22:47 |
opendevreview | Merged opendev/system-config master: Add Zuul load balancer https://review.opendev.org/c/opendev/system-config/+/828773 | 23:20 |
ianw | ianw@bridge:/var/log/ansible$ ls -l *.2020-* | wc -l | 23:37 |
ianw | 3175 | 23:37 |
ianw | does anyone have a problem if i remove these? | 23:38 |
ianw | it seems like in 2020 we had a period where we kept all the logs for a while | 23:38 |
ianw | this inspired by trying to figure out why this infra-prod-service-nodepool run failed https://zuul.opendev.org/t/openstack/build/63b692ad36d94bf1b2f574bbee98e8cb/console | 23:39 |
clarkb | ianw: I think I started cleaning those up at one point and then frickler determined something else was hogging all the disk | 23:39 |
clarkb | But I'm not opposed to removing the 2020 log files | 23:39 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Clean up some gitea-lb zuul config https://review.opendev.org/c/opendev/system-config/+/828793 | 23:39 |
ianw | hrm, i guess actually if i expand it, we're just keeping everything | 23:41 |
ianw | i thought we only kept the last few runs, but must be wrong | 23:41 |
ianw | it would be good to have a more direct way to sync zuul build -> file on disk | 23:42 |
ianw | ok, for my own reference, "Rename playbook log on bridge" is i guess that | 23:43 |
ianw | and then it seems to run "find /var/log/ansible -name 'service-nodepool.yaml.log.*' -type f -mtime 30 -delete" | 23:43 |
clarkb | ya it should be cleaning up. I think what hapepns is if the filename changes we orphan some things | 23:44 |
opendevreview | Ian Wienand proposed opendev/system-config master: bridge production: fix mtime matching https://review.opendev.org/c/opendev/system-config/+/828808 | 23:47 |
ianw | clarkb: ^ that will probably help | 23:48 |
*** prometheanfire is now known as Guest2 | 23:49 | |
*** osmanlicilegi is now known as Guest0 | 23:49 | |
clarkb | ah we can leak then if we don't run often enough to get the exact match | 23:51 |
ianw | crazy idea; keep a list of "log reader" gpg keys and encrypt each log file from the bridge runs with multiple --recipient keys. then have a build artifact like our download-all-logs which is a command you can paste to just cat out the logs from a production run | 23:51 |
fungi | not a bad idea | 23:51 |
ianw | presumably infra-root, but if someone were interested in a particular service they could add themselves | 23:52 |
fungi | clarkb: -mtime +30 would solve that | 23:53 |
clarkb | fungi: ya that is ianw's fix | 23:54 |
fungi | oh, missed it | 23:54 |
fungi | thanks, reviewing | 23:54 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!