| priteau | Hello. I am seeing lots of CI errors while accessing mirror.dfw3.raxflex.opendev.org. It is refusing all connections. | 08:28 |
|---|---|---|
| tonyb | priteau: I'll look we did have issues with that server but we thought they were fixed | 08:29 |
| priteau | Thanks. For example here: https://9b62756387dc03e179ba-9f2ef8380661efedee2428ccb67a9720.ssl.cf2.rackcdn.com/openstack/ec48ad54ccda4860928611e9d04a946c/primary/ansible/install-pre-upgrade | 08:30 |
| tonyb | So that's about an hour ago right? | 08:31 |
| tonyb | priteau: Should be fixed | 08:35 |
| tonyb | #status log apache wasn't running on mirror.dfw3.raxflex.opendev.org with nothing obvious in the logs. Restarted. | 08:39 |
| opendevstatus | tonyb: finished logging | 08:39 |
| priteau | Thanks! | 08:39 |
| opendevreview | Merged openstack/project-config master: Grafana: add core graphs to rackspace-flex https://review.opendev.org/c/openstack/project-config/+/966981 | 10:59 |
| *** rlandy_ is now known as rlandy | 12:01 | |
| fungi | apache dying on mirror.dfw3.raxflex definitely seems odd. that's also the server i just brought back online yesterday after a multi-week outage | 14:30 |
| fungi | yeah, very odd. i booted mirror.dfw3.raxflex around 21:35 utc. i don't see any reference in journald for apache starting until tonyb presumably addressed it at 08:34:26 utc | 14:37 |
| fungi | when i tested i was so focused on making sure afs was working i'm not sure i confirmed apache was actually serving things, so maybe apache didn't actually start at boot for some reason? | 14:38 |
| fungi | maybe /var/cache/apache2 wasn't mounted when systemd tried to start apache? though i find no evidence of that in syslog either | 14:40 |
| clarkb | does journalctl -u apache2 indicate anything useful? | 15:48 |
| fungi | it did not sadly, no | 15:56 |
| fungi | that was the first thing i checked | 15:56 |
| fungi | shows it stopping when we tried to reboot back on the 3rd, and shows when tonyb started it manually, which is what's leading me to suspect it never started at boot | 15:57 |
| clarkb | for some reason I thought I checked ping and apache but maybe it was ping and ssh | 15:58 |
| clarkb | apache never starting on boot would certainly explain what you discovered so is probably the most likely case | 15:58 |
| fungi | and yeah i merely cat'ed files from the afs ro replica path in a shell there | 15:58 |
| fungi | it didn't dawn on me that apache might not start | 15:59 |
| fungi | but it certainly will from now on | 15:59 |
| clarkb | I'm going to catch up on my morning but then expect to approve the two gerrit update changes before too long. Chime in if there is reason to delay | 16:03 |
| fungi | i'll be steppout out momentarily but expect i'll be back before those deploy regardless | 16:03 |
| fungi | er, stepping out | 16:03 |
| clarkb | ya I dont' think the real fun will start until 2100 so that tonyb can join in | 16:04 |
| fungi | okay, afk for a little while, probably back in an hour-ish | 16:15 |
| clarkb | enjoy | 16:16 |
| clarkb | gerrit changes have been approved | 17:19 |
| clarkb | I did some self rereview just to be sure I felt ready and can't come up with any reason not to proceed that isn't "gerrit restarts have been annoying recently" | 17:19 |
| fungi | cool, i'm back and ready to help with restarts later whenever | 17:30 |
| clarkb | idea that just occurred to me: should we consider reducing the rax classic limits in zuul relative to the amounts we're able to increase quotas elsewhere? Mostly thinking that cloud in particular doesn't support the newer instruction sets so we can't run all images there and the disk + fs setup makes some jobs slower to bootstrap there than elsewhere ( specifically devstack style | 17:44 |
| clarkb | jobs). | 17:44 |
| clarkb | I think the way launcher works it will give the rax classic provider proportionally more work because they are larger. We can tune it away from that by reducing its quota | 17:44 |
| clarkb | the downside to that is we'd have less quota than we can actually field when demand is high | 17:45 |
| fungi | or maybe we decide where the demarcation line is when we can declare that we have sufficient quota elsewhere to just stop using rax classic for job nodes | 17:46 |
| fungi | platform-specific challenges aside (xen, performance, et cetera) ,we also run into provider-oriented problems like slow api responses, accumulation of undeletable cruft, and so on | 17:47 |
| clarkb | antoher thought that occurred to me was adding some sort of prioritization to providers in the launcher so that classic would only be considered if everything else was full up | 17:48 |
| clarkb | but I'm not sure that is a generally useful feature or if that is just something that makes sense in this one circumstance which may not be worth solving directly like that | 17:49 |
| fungi | it's a common use case in other cloud-oriented applications. long ago we used to refer to it as "cloudbursting" for situations where you had low-cost/sunk-cost infrastructure like a private cloud that you preferred to run your workloads on, but wanted to "burst" into a public cloud when your backlog reached a set threshold, basically paying some outside provider to avoid falling | 17:51 |
| fungi | farther behind | 17:51 |
| fungi | though if the idea is to prefer performant resources and then fall back on a slower/feature-limited alternative under load, i think that's a recipe for increasing job failure rates. the tests will become attuned to the non-overflow providers, and so grow assumptions that don't hold when they spill over into rax classic | 17:53 |
| clarkb | that is a good point | 17:54 |
| fungi | it happens already, but is kept somewhat in check by some percentage of those jobs always running there | 17:54 |
| clarkb | In my mind its more about improving throughput. Preferring rax classic means we're choosing some of our sloweest nodes more than others | 17:54 |
| clarkb | I don't really care too much about individual job performance as much as overall throughput in this context | 17:54 |
| clarkb | and ya that risk is fine if we think we'll never need those resources again but if we do and we don't operate in that environment anymore it would be painful | 17:55 |
| fungi | if we can tune the ratio of assignments but still keep some jobs always running there, we probably mostly avoid the acclimitzation problem | 17:55 |
| fungi | if we have lengthy periods where no jobs run there, the risk is much higher | 17:56 |
| fungi | acclimitization | 17:56 |
| fungi | it's a hard word to type after lunch | 17:57 |
| corvus | nodepool has a priority feature for the provider pools.... zuul-launcher does not. i think it would be generally useful and we can probably accomodate it. | 18:20 |
| corvus | an interesting question, which fungi is getting at, is do we set it up so that it's strictly ordered (fill everything else first, then rax-classic), or slightly more random (give rax-classic a 1% chance of handling a request). | 18:22 |
| clarkb | I think I'm convinced we'd want to make it proportional like it is today but probably with more direct control of the proportions | 18:23 |
| clarkb | to ensure that you're semi constantly checking that you can still oeprate in each environment | 18:23 |
| corvus | currently it works by going through a sorted list of providers and finding the first one that can fulfill the request, but they are sorted like this: | 18:24 |
| corvus | 1) unassigned nodes; 2) quota used percentage (quantized to 10% buckets); 3) random | 18:24 |
| corvus | so the primary sort is by unassigned nodes (so we grab those first) | 18:24 |
| corvus | then after that we use the provider that's least used | 18:25 |
| corvus | and among providers with the same usage, the order is randomized | 18:25 |
| corvus | (and by "same usage" i mean "within 10%" so that the randomization is meaningful) | 18:25 |
| opendevreview | Merged opendev/system-config master: Drop /opt/project-config/gerrit/projects.ini on review https://review.opendev.org/c/opendev/system-config/+/966083 | 18:26 |
| opendevreview | Merged opendev/system-config master: Update Gerrit images to 3.10.9 and 3.11.7 https://review.opendev.org/c/opendev/system-config/+/966084 | 18:26 |
| fungi | probably the ideal algorithm would be distributing across a tunable proportion of available providers, the real question is how backlogged requests can/should factor into those decisions (for example does a solution exist where we always run with x% distributed to that provider across available providers, but prefer to also delay satisfying some requests unless they reach a backlog | 18:26 |
| fungi | threshold?) | 18:26 |
| corvus | so if we did what clarkb suggests, one way to do that might be to tweak the step at #2 | 18:26 |
| fungi | yeah, i can imagine a couple of different tunable distribution methods, but i expect they directly conflict with one another and we'd need to decide which to support (or support multiple distribution algorithms, but that seems less like it would be worth the effort) | 18:28 |
| corvus | fungi: there isn't anything like that in z-l today -- eventually if all providers become full enough, it's just a random allocation among all of them | 18:28 |
| corvus | (except that we try to avoid processing requests that can't be fulfilled, so if things are working well and the clouds are not lying to us, we just wait for the next available provider) | 18:29 |
| fungi | right, algorithmically speaking, "random" could get a weighted distribution applied to it | 18:29 |
| clarkb | ya reusing unassigned nodes first regardless of where they are makes sense as the first criteria once filtering for ability to fulfill the request | 18:29 |
| corvus | (usually the cloud lies to us though, and we end up bouncing over-quota requests around different providers) | 18:29 |
| fungi | oh, good point, "already available" vs "need to be booted" is another critical efficiency factor to figure in | 18:30 |
| clarkb | the naive approach of applying a preference ration to the existing #2 criteria seems like it could starve clouds unexpectedly | 18:30 |
| corvus | yeah. there's a discontinuity in the algorithm at 90% though -- we quantize to 10% buckets below 90, but above 90%, it's actual values, so we try to land at exactly max quota for every provider. we could incorporate something like that into the proposed scaling. | 18:32 |
| clarkb | manage-projects is running now for the updated docker compose file for gerrit. Manage-projects was already bind mounting the homedir location for projects.ini so this should be a noop | 18:32 |
| corvus | oh, sorry, the discontinuity is actually at 100%. but still, point stands. | 18:33 |
| corvus | (above 100% is a launcher with a backlog) | 18:33 |
| corvus | s/launcher/provider/ | 18:33 |
| fungi | anyway, back to the original topic, my opinion is that we just need to decide when to stop booting job nodes in rackspace classic (by adjusting our configuration), probably based on our determination that we have sufficient quota in other providers. something more flexible could be nice, but is probably not necessary | 18:34 |
| fungi | i don't think *we* have a strong use case for such features in zuul-launcher, though we might take advantage of them if they existed | 18:36 |
| clarkb | ya manage projects logs lgtm. Lots of skipping due to matching shas | 18:36 |
| clarkb | that implies a noop but also I don't think it would've gotten that far if there aws a problem with the projects.ini config which tells it where and how to connect to gerrit | 18:36 |
| fungi | (and others probably do have stronger use cases, so what's optimal for them should be more informative of actual development effort, if any) | 18:36 |
| clarkb | sorry for splitting the discussion I just want to make sure the gerrit stuff updates as expected | 18:37 |
| fungi | yep, appreciated, trying to keep tabs on it myself | 18:37 |
| clarkb | fungi: yes agreed as far as this not being a strong need | 18:37 |
| clarkb | I was mostly brainstorming if we can start managing the old cloud more intelligently in response to new resources | 18:38 |
| fungi | i do feel like we don't get a lot of complaints about being backlogged on change queues, but we do hear about performance/featureset problems with rackspace classic nodes somewhat regularly still | 18:39 |
| fungi | so i'd be willing to trade a bit more of the former for less of the latter | 18:39 |
| clarkb | both buildests in deploy succeeded. I'm checking the docker-compose.yaml now | 18:40 |
| clarkb | ya projects.ini bind mount is gone from docker-compose.yaml. I realized the new images don't udpate docker-compose.yaml for that I need to look at quay | 18:41 |
| fungi | perfect | 18:42 |
| clarkb | quay.io has a banner warning of a tls cert update for their cdn occuring on the 17th. The reason for the warning is the trust chain will change. But if you open the document explaining this its subscriber content only :/ | 18:42 |
| clarkb | anyway I don't expect that to affect us and https://quay.io/repository/opendevorg/gerrit?tab=tags says 3.10 did update as expected | 18:42 |
| fungi | TheJulia: do you happen to know who at rh would make the decision to fix the fact that the tls certificate warning banner at the top of https://quay.io/ takes users to a red hat knowledgebase article that requires a red hat subscription? | 18:46 |
| fungi | seems like some wires got crossed wrt appropriate publication platforms for public-facing services | 18:46 |
| clarkb | fungi: tonyb: I think that around 2000 UTC we can start to get ready, pull and confirm the new image looks correct, prepare the suite of commands to clear caches and replication queues, then ~2100 UTC send it | 18:48 |
| fungi | sounds good, i'll be here | 18:48 |
| clarkb | maybe around 2000 we also send a status notice indicating our plan to take a short outage | 18:48 |
| clarkb | I'll go ahead and warn the openstack release team now | 18:49 |
| fungi | good idea | 18:49 |
| fungi | though looks like it's been about 5 hours since anyone there was approving new release requests | 18:49 |
| TheJulia | I suspect it is because the "easy" path for teams is the kb which does require an account and possible subscription entitlement. I suspect the right thing to do would be to open an issue in issues.redhat.com in PROJQUAY, type bug, component set to quay.io | 18:50 |
| fungi | thanks! now to see if i have a "red hat account" so i can log into jira | 18:52 |
| TheJulia | I take it you don't at all... hmm | 18:58 |
| fungi | no, i do, took me a few minutes to find it since i don't use it often | 18:58 |
| fungi | now i'm just fumbling around the interface, shouldn't take me long | 18:59 |
| fungi | oh, that's fun. i have a preexisting account on the rh jira with the same e-mail address as my rh sso account, so it won't let me log into jira due to that conflict? | 19:01 |
| fungi | "SAML-Login failed: You have logged in with username [...], but you already have a Jira account with [...]. If you need assistance, email jiraconf-sd@redhat.com and provide tracker-id." | 19:01 |
| fungi | i guess this is where i hit the "too much effort required to bother reporting something so trivial" threshold | 19:02 |
| clarkb | its gerrit openid all over again | 19:02 |
| fungi | yeah, complicated in my case because the conflict is on an e-mail address that redhat.com's mail provider rejects messages from, so if i e-mail them to try and get it resolved i'd have to do it from a different address than the one for the account in question | 19:10 |
| clarkb | heh I've just realized my 2000 UTC math coincides with lunch. I should be able to make that work just fine though | 19:11 |
| clarkb | my mental timezone map post DST drop hasn't updated properly yet | 19:11 |
| TheJulia | fungi: opening an issue for you now | 19:12 |
| fungi | TheJulia: oh thank you! | 19:15 |
| TheJulia | PROJQUAY-9772 | 19:15 |
| fungi | appreciated. we don't think it impacts us at all, but suspect it may impact other users who don't have rh accounts to read the article explaining whether or not it does | 19:16 |
| TheJulia | It is kind of not great, tbh | 19:17 |
| TheJulia | and the only way for teams to sometimes know is when they triage such issues as a team | 19:18 |
| clarkb | how does this look #status notice The OpenDev team will be restarting Gerrit at approximately 2100 UTC in order to pick up the latest 3.10 bugfix release. | 20:00 |
| clarkb | tonyb: I'm happy to wait for your morning to start to do the on server actions so that you're able to follow along and/or be the typing driver | 20:02 |
| clarkb | and then I guess the other question is do we want to jump on meetpad to make coordination more synchronous or are we happy with irc? | 20:04 |
| corvus | clarkb: msg lgtm | 20:07 |
| fungi | yeah, i'm good with the wording | 20:07 |
| clarkb | oh looks like I said we'd get started at 2100 with actual restart happening at about 2130 when planning on Tuesday | 20:09 |
| clarkb | so maybe update that warning to 2130? but otherwise send it at 2030? | 20:09 |
| fungi | fine by me, bettter stick with the prior plan yeah | 20:09 |
| clarkb | I then recorded that in my notes as 2100 hence the confusion. | 20:12 |
| clarkb | Mostly I want to make sure tonyb has a chnce to participate | 20:12 |
| fungi | hah | 20:12 |
| fungi | yeah, it's all good with me, i'll be here regardless | 20:12 |
| clarkb | ok sending the notice now | 20:30 |
| clarkb | #status notice The OpenDev team will be restarting Gerrit at approximately 2130 UTC in order to pick up the latest 3.10 bugfix release. | 20:30 |
| opendevstatus | clarkb: sending notice | 20:30 |
| -opendevstatus- NOTICE: The OpenDev team will be restarting Gerrit at approximately 2130 UTC in order to pick up the latest 3.10 bugfix release. | 20:30 | |
| tonyb | I'll be at my laptop in a couple of minutes. | 20:31 |
| fungi | there's no rush | 20:31 |
| clarkb | ya I may take this as an opportunity for a short break before we dive in. I think the next steps are deciding if we want to use meetpad or not, then starting a screens session and pulling and verifying the image | 20:32 |
| clarkb | then we wait for ~2130 UTC and can proceed wtih restarting things at that time | 20:32 |
| opendevstatus | clarkb: finished sending notice | 20:33 |
| fungi | i'm happy to jump in a meetpad room if people want that (e.g. for screen sharing), but gnu screen multi-attach works fine as long as we're all infra-root | 20:34 |
| opendevreview | Merged openstack/project-config master: Set noop job for the governance-sigs repository https://review.opendev.org/c/openstack/project-config/+/966755 | 20:39 |
| tonyb | clarkb, fungi: ready when you are | 20:47 |
| clarkb | fungi: ya I was just thinking maybe synchronous voice chat would be helpful to explain what we're doing but I'll let tonyb decide if that is helpful. I don't think we should use meetpad to screenshare and instead use screen to share the context | 20:52 |
| clarkb | tonyb: want to start a root screen we can attach to? Then we can take notes on what the current image is that we're running and fetch ahd check the new image | 20:53 |
| tonyb | Can do | 20:54 |
| clarkb | tonyb: if you do a docker ps -a you should see that the container's image matches that image there too | 20:58 |
| clarkb | ya so the container running gerrit (gerrit-compose-gerrit-1) says its image is quay.io/opendevorg/gerrit:3.10 which is currently 16e3e6710a12 | 21:00 |
| clarkb | this is mostly for historical record keeping and theoretical rollback paths | 21:00 |
| tonyb | Okay | 21:00 |
| clarkb | so now I think you can do `docker-compose pull` and that should only update images for us and not touch the running containers. Then we can cross check the new container image with what we have in quay.io | 21:01 |
| clarkb | or `docker compose pull` since the - is optional on noble nodes | 21:01 |
| tonyb | https://meetpad.opendev.org/gerrit-update if y'all want | 21:01 |
| clarkb | tonyb: so the new image can be found at https://quay.io/repository/opendevorg/gerrit?tab=tags | 21:04 |
| opendevreview | James E. Blair proposed opendev/system-config master: Update zuul-client image location https://review.opendev.org/c/opendev/system-config/+/967111 | 21:22 |
| corvus | clarkb: tonyb zuul isn't super busy, but it's also not idle... if we want to be very conservative, we can pause event processing before you shut down gerrit. | 21:24 |
| corvus | lmk if you want to do that, i've got a command ready (which i can send to you if you want to run it) | 21:24 |
| clarkb | corvus: will that pause merge attempts for changes already in the gate? | 21:24 |
| corvus | yes | 21:24 |
| clarkb | I think we should go ahead and do that | 21:25 |
| corvus | (that would be the main reason to do it; i'm not worried about disconnections or events or anything like that) | 21:25 |
| corvus | our helper script has an error, which i worked on correcting in 967111; in the mean time, i made a local copy on zuul01 that is corrected | 21:26 |
| corvus | so, as root on zuul01: /root/zuul-client manage-events --all-tenants --reason "Gerrit upgrade in progress" pause-result | 21:26 |
| clarkb | thanks! | 21:26 |
| corvus | when finished: /root/zuul-client manage-events --all-tenants normal | 21:26 |
| clarkb | shutdown is proceeding now | 21:30 |
| clarkb | [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.10.9-4-g309e8c8861-dirty ready | 21:34 |
| clarkb | https://groups.google.com/g/repo-discuss/c/TxeJ9lvW6Is/m/NGLT2D5aAQAJ I think this is related to the error we see from the plugin manager preloader | 21:40 |
| clarkb | however the specific error is different | 21:40 |
| clarkb | https://gerrit.googlesource.com/plugins/webhooks | 21:42 |
| clarkb | https://gerrit-review.googlesource.com/Documentation/cmd-index-start.html | 21:47 |
| opendevreview | Clark Boylan proposed opendev/system-config master: DNM this is a change ot test gerrit functionality https://review.opendev.org/c/opendev/system-config/+/967131 | 21:49 |
| clarkb | this didn't make it into the IRC logs so I'll add it: corvus pointed out that at least one change would've reported during the zuul pause period. If we hadn't paused that could've broken a merge | 22:09 |
| clarkb | so thats a great little feature for us to remember we have now with zuul | 22:09 |
| clarkb | I'm going to delete my autoholds for the old gerrit 3.11 image (we just updated it to a new version) and etherpad 2.5.3 and gitea 1.25.1 | 22:17 |
| clarkb | I don't see any other autoholds in the openstack tenant | 22:17 |
| clarkb | usually I make note of this here so that I can delete other autoholds too but not necessary this time | 22:17 |
| *** dmellado9 is now known as dmellado | 22:18 | |
| clarkb | infra-root gerrit reindexing has completed and reports `Reindex changes to version 86 complete` with 3 failed changes (the expected count) | 22:27 |
| clarkb | I have requested new autohold for gerrit 3.10 and 3.11 jobs that will hopefully finally be used to test the ugprade and get that 3.10 to 3.11 upgrade done | 23:48 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!