opendevreview | Ian Wienand proposed opendev/system-config master: testinfra: refactor screenshot taking https://review.opendev.org/c/opendev/system-config/+/807659 | 00:50 |
---|---|---|
opendevreview | Ian Wienand proposed opendev/system-config master: testinfra: refactor screenshot taking https://review.opendev.org/c/opendev/system-config/+/807659 | 01:27 |
ianw | Manifest has invalid size for layer sha256:9815a275e5d0f93566302aeb58a49bf71b121debe1a291cf0f64278fe97ec9b5 (size:203129434 actual:185016320) | 02:06 |
ianw | https://zuul.opendev.org/t/openstack/build/22089e51b052400db2970e78b08f60be | 02:07 |
ianw | system-config-build-image-gerrit-3.3 | 02:07 |
ianw | ls -l ./_local/blobs/sha256:9815a275e5d0f93566302aeb58a49bf71b121debe1a291cf0f64278fe97ec9b5 | 02:17 |
ianw | -rw-r--r-- 1 root root 203129433 Sep 7 01:53 data | 02:17 |
ianw | it's out by a byte, but did get to the final size. i wonder if this is a race or a sync issue | 02:18 |
ianw | there is a "Failed to obtain lock(1) on digest sha256:9815a275e5d0f93566302aeb58a49bf71b121debe1a291cf0f64278fe97ec9b5" | 02:21 |
ianw | https://review.opendev.org/c/zuul/zuul-registry/+/807663 is my suggestion | 02:53 |
opendevreview | Ian Wienand proposed opendev/system-config master: testinfra: refactor screenshot taking https://review.opendev.org/c/opendev/system-config/+/807659 | 02:55 |
*** ysandeep|out is now known as ysandeep | 05:07 | |
opendevreview | Ian Wienand proposed opendev/system-config master: Refactor infra-prod jobs for parallel running https://review.opendev.org/c/opendev/system-config/+/807672 | 06:14 |
ianw | clarkb: ^ we can discuss in meeting, i started to try and think about what it takes to run these things in parallel. | 06:26 |
*** jpena|off is now known as jpena | 07:39 | |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 09:16 |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 10:54 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638 | 10:56 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638 | 11:05 |
opendevreview | Sorin Sbârnea proposed zuul/zuul-jobs master: Make default tox run more strict about interpreter version https://review.opendev.org/c/zuul/zuul-jobs/+/807702 | 11:05 |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 11:07 |
*** jpena is now known as jpena|lunch | 11:26 | |
*** jpena|lunch is now known as jpena | 12:24 | |
*** ysandeep is now known as ysandeep|out | 13:07 | |
clarkb | kevinz: our ssl cert checker is warning us now that the linaro cert will expire in 27 days | 14:07 |
clarkb | mordred: corvus: there is a thread on gerrit's ml asking for links to presentations given at the 2019 sunnyvale/gothenburg gerrit user summits. I think you gave at least one presentation at both? I could be wrong but wanted to pass that along in case you have that infor them | 14:09 |
clarkb | "info" and "for" became "infor" in that last sentence | 14:09 |
opendevreview | Douglas Mendizábal proposed opendev/irc-meetings master: Move Keystone meeting to 1500 UTC https://review.opendev.org/c/opendev/irc-meetings/+/807729 | 14:18 |
clarkb | infra-root I think the stack at https://review.opendev.org/c/opendev/system-config/+/805932/3 is ready for merging. There is one small zuul-registry fix that we might want to land first: https://review.opendev.org/c/zuul/zuul-registry/+/807663 to ensure those images build and update cleanly though | 14:29 |
clarkb | that is the assets work. | 14:30 |
corvus | clarkb: reg change apvd | 14:31 |
clarkb | thanks! | 14:31 |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 14:42 |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 14:57 |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 14:59 |
clarkb | the zuul-registry update should be deployed now. | 15:30 |
opendevreview | Merged opendev/system-config master: Add assets and a related docker image/bundle https://review.opendev.org/c/opendev/system-config/+/805932 | 15:47 |
clarkb | exciting ^ | 15:48 |
fungi | i just finished reviewing and approving that whole series | 15:48 |
opendevreview | Merged opendev/elastic-recheck rdo: Updates for docker and frontpage of elastic-recheck https://review.opendev.org/c/opendev/elastic-recheck/+/806739 | 15:54 |
*** marios is now known as marios|out | 16:01 | |
mnaser | infra-root: hello, i'd like to do a reboot of {gitea0[1-8],gitea-lb01,mirror01.sjc1.vexxhost}.opendev.org -- purpose is that it will be moved on newer amd hardware (live migration is not available because intel=>amd is no good) -- can i go ahead with that? | 16:12 |
clarkb | mnaser: doing gitea01-08 in a rolling fashion should be fine. gitea-lb01 will be noticed as that is the front end and a current spof. The mirror will also be nocied by any jobs running on that cloud. | 16:13 |
clarkb | mnaser: also I think the changes that fungi approved above will result in a gitea "ugprade" to our new images with assets copied in differently. Might be best to wait for that to finish first? | 16:13 |
clarkb | though those are at least 40 minutes away says zuul. | 16:14 |
mnaser | clarkb: can you give me a ping when i'm good to go in that case? :) | 16:14 |
mnaser | that's totally fine | 16:14 |
clarkb | mnaser: I can do that | 16:14 |
mnaser | thank you | 16:14 |
mnaser | clarkb: i see a bunch of vms inside `openstackci`, called `opendev-k8s-{master,1,2,3,4}`. i think mordred was playing with k8s.. a really long time ago? | 16:24 |
mnaser | i dont know if these are in the ansible inventory, if i should move them, if we should nuke them.. | 16:25 |
clarkb | mnaser: I think nodepool may still "know" about them but they aren't really being used there (basically no one took that experiment further). If we can get mordred to confirm we can probably remove any remaing system-config/project-config for them and clean them up | 16:25 |
clarkb | https://opendev.org/opendev/system-config/src/branch/master/playbooks/templates/clouds/nodepool_kube_config.yaml.j2 things related to that | 16:26 |
mnaser | clarkb: ok, but i guess they are safe to migrate for now until that is all cleaned up? | 16:26 |
clarkb | it may also be running the old test gitea around running it in k8s? In any case I suspect that will all need to be redone with proper k8s deployment tooling if we go that route again | 16:26 |
clarkb | mnaser: yes, I think it should be safe to migrate them | 16:26 |
mnaser | great thank you | 16:27 |
corvus | i just got an internal error pushing a change; this is the gerrit error log entry: Caused by: java.lang.IllegalArgumentException: cannot chain ref update CREATE: 0000000000000000000000000000000000000000 6fcde31c9e22230e8a04d12861b0137420d13796 refs/changes/21/807221/8 after update CREATE: 0000000000000000000000000000000000000000 6fcde31c9e22230e8a04d12861b0137420d13796 refs/changes/21/807221/8 with result REJECTEDOTHERREASON | 16:40 |
corvus | re-trying worked. i'm mentioning it for monitoring purposes. | 16:41 |
fungi | thanks, i wonder what could lead to that | 16:46 |
*** jpena is now known as jpena|off | 16:46 | |
clarkb | some other reason apparently :) | 16:49 |
opendevreview | Merged opendev/system-config master: gitea: use assets bundle https://review.opendev.org/c/opendev/system-config/+/805933 | 16:57 |
opendevreview | Merged opendev/system-config master: gitea: add some screenshots to testing https://review.opendev.org/c/opendev/system-config/+/807489 | 16:57 |
mordred | clarkb, mnaser : yes, those should be safe to migrate. honestly, we should probably just delete them, there's no way we're going to use them in the current state | 16:58 |
opendevreview | Merged opendev/system-config master: testinfra: refactor screenshot taking https://review.opendev.org/c/opendev/system-config/+/807659 | 17:05 |
clarkb | mnaser: is this the sort of thing that we can trigger the reboots for and have it do the right thing behind the scenes or do you need to actively move them? Asking beacuse I think the least impact method would be to remove a gitea0X or two from haproxy and shutdown its services then reboot it | 17:19 |
clarkb | but both gerrit replication and haproxy should detect if it happens unexpectedly as well | 17:19 |
mnaser | clarkb: unfortunately an actual migration needs to be done which is an admin action — but I am happy and available to coordinate. | 17:20 |
clarkb | ok good to know. I can probably do the haproxy and service stops after the curren meeting and work through those. The deployment stuff should be done by then too | 17:20 |
mnaser | Ok sounds good. Just shoot me a ping and I can kick it off | 17:21 |
clarkb | mnaser: also sjc1 max-servers are currently set to 0 I think you can safely do the mirror there now | 17:23 |
clarkb | we are using ca-ymq-1 for jobs currently according to the config | 17:23 |
mnaser | Yeah that’s what I thought too | 17:24 |
mnaser | Okay, I’ll kick that one off now then | 17:24 |
mnaser | mirror01 should be done | 17:28 |
mnaser | urgh, configdrive migration bug, nevermind, let me dig | 17:29 |
mnaser | nope i lied, it worked just fine :) | 17:30 |
clarkb | https://mirror.sjc1.vexxhost.opendev.org/ is serving content and the uptime says we did reboot and /proc/cpuinfo shows amd | 17:30 |
clarkb | ya from what I can see it seems happy | 17:30 |
clarkb | I don't see a config drive on the instance, but not sure if it was booted with one in the first place | 17:30 |
mnaser | doesnt look like it, so yeah, that was my bad | 17:31 |
mnaser | clarkb: i will be unavailable in ~1h30m for ~1h30m-ish. so if you want to remove some backends from haproxy that i can asynchronously move if/when you're afk, just a heads up =) | 17:32 |
clarkb | mnaser: noted. Why don't we start with gitea08 as a first run and I'll go stop services and let you know when it is ready | 17:32 |
clarkb | oh except the upgrade playbook finally just started. we'll wait for that to finish first | 17:33 |
mnaser | no worries | 17:33 |
clarkb | mnaser: the upgrade is done and the gitea cluster seems to still be happy from my spot checking. I've removed gitea08 from the haproxy pool if you want to go ahead and reboot that one as a first check? We can do two at a time afterwards if that one is happy | 17:55 |
opendevreview | Artom Lifshitz proposed opendev/git-review master: WIP: Allow custom SSH args https://review.opendev.org/c/opendev/git-review/+/807787 | 17:56 |
clarkb | mnaser: I'm going to take a short break. But you should be good to reboot gitea08 whenever you are ready and I can help verify it is happy when done. Then we can go through the rest in batches | 18:04 |
mnaser | gitea08 starting now :) | 18:32 |
mnaser | and done | 18:33 |
clarkb | mnaser: that one actually does seem to have a config drive and I still see it | 18:43 |
clarkb | also its services are serving | 18:43 |
clarkb | mnaser: 06 and 07 have been pulled out of the haproxy rotation if you want to do them next | 18:44 |
clarkb | from what I see 08 is happy | 18:44 |
mnaser | clarkb: ok I won’t be able to do that until an hour or so but will do when I’m around | 18:46 |
clarkb | mnaser: sounds good. Just let me know how you're progressing and I can continue to rotate them out in haproxy. I've got the infra meeting starting in 13 minutes so an hour break wfm | 18:47 |
fungi | i need to start prepping dinner shortly, but should hopefully be able to take a look at the prometheus spec after | 19:56 |
clarkb | mnaser: I'm grabbing lunch right now but feel free to do gitea06 and gitea07 when you are ready and I'll put them back into the rotation after and poll out the next pair | 19:58 |
mnaser | clarkb: alright, im around again, i will kick off 06 and 07 | 20:33 |
clarkb | mnaser: sounds good I'm around to work through these too | 20:33 |
mnaser | 06 started | 20:35 |
mnaser | clarkb: both are done :) | 20:37 |
clarkb | checking | 20:37 |
clarkb | yup they lgtm. I've put them back in the rotation and pulled gitea04 and gitea05 out if you want to do those now | 20:38 |
mnaser | cool, starting 4 and 5 now | 20:39 |
mnaser | clarkb: should be done | 20:40 |
clarkb | mnaser: yup all looks good. gitea03 and gitea02 are ready for you now | 20:42 |
mnaser | ok, starting | 20:43 |
mnaser | clarkb: both done | 20:44 |
clarkb | yup continues to look good to me. gitea01 is ready when you are | 20:46 |
mnaser | cool, starting | 20:46 |
mnaser | clarkb: completed | 20:47 |
clarkb | great that all looks happy on my end. | 20:48 |
clarkb | mnaser: for the load balancer we decided in the meeting today that just going for it and ripping the bandaid off is likely the easiest thing | 20:49 |
clarkb | mnaser: I'm happy for you to do that now if you want and I can check it after | 20:49 |
clarkb | Also for review02.opendev.org does that still need a reboot? | 20:49 |
mnaser | clarkb: i can do lb now if you want, that would really be appreciated. review02 will need a reboot (even though its in mtl, moving to the new dc). | 20:50 |
mnaser | we don't have to do that right now though (wrt review02) because i figure that might be a bit trickier i guess | 20:50 |
clarkb | ya review02 probably needs a bit more coordination. | 20:51 |
clarkb | Ya I think we should probably just go ahead and do the load balancer now | 20:51 |
clarkb | lets rip the bandaid off | 20:51 |
mnaser | alright i'll push that through | 20:51 |
mnaser | clarkb: and its back | 20:52 |
mnaser | the underlying hardware is _way_ faster so we should see measurebly better performance | 20:52 |
clarkb | https://opendev.org loads for me and the server looks happy | 20:53 |
mnaser | awesome, thanks for the flexibility clarkb :) | 20:54 |
clarkb | for review02 the absolute safeest thing would be to wait for after the openstack release happens, but we can probably get away with a reboot during a quiet period like late friday through ianw's monday? | 20:54 |
clarkb | mnaser: does the ceph volume that we host the gerrit site on review02 impact the DC move at all? | 20:54 |
clarkb | eg do you have to move the volume at the same time and if so is that expected to make review02 movement particularly slow? | 20:55 |
mnaser | clarkb: it will be moved but it will not slow down, we've got some magic movement tools to avoid any downtime/slow donw | 20:55 |
mnaser | we use snapshots to minimize the amount of data moved during the reboot | 20:55 |
mnaser | so we move the majority of the data before hand, then one more small move, shutdown, move whatever was written to disk, then start up again | 20:56 |
mnaser | so its mostly migrated online except for the flip over | 20:56 |
clarkb | nice | 20:56 |
mnaser | so we can 'prep' vms to be moved ahead of time so the outage is really small | 20:56 |
clarkb | mnaser: if there is a quieter time that also works for you I guess let us know and we can coordinate review's move then | 20:59 |
clarkb | The biggestthing right now is lots of changes are merging and zuul will have a sad if gerrit isn't up when it tries to merge things so we want to try and pick a time when zuul is unlikely to be doing that | 21:00 |
clarkb | Another option is to do it with a coordinated zuul restart | 21:00 |
clarkb | now I should migrate outside and do some code review in the backyard | 21:01 |
corvus | if you want to expedite, shut down all the zuul executors during the move. no perceived outage from zuul's pov. | 21:07 |
clarkb | fungi: do we care about the dell openstack ironic ci user bouncing gerrit emails because they are only allowed to receive email from people in their organization? (also wow I guess that is one way to combat phishing and spam) | 22:06 |
clarkb | maybe we should ask them to disable gerrit email as much as possible? | 22:06 |
opendevreview | Ian Wienand proposed opendev/system-config master: Refactor infra-prod jobs for parallel running https://review.opendev.org/c/opendev/system-config/+/807672 | 22:10 |
fungi | yeah we shouldn't have users with invalid/undeliverable e-mail addresses, but i have a feeling there are many | 22:42 |
fungi | if we do decide that's a problem, we should start analyzing the mta logs and disabling accounts or something | 22:43 |
clarkb | in this case I suspect the account is active they just don't realize their email policy is not useful | 22:44 |
clarkb | and ya I'm sure there are others but this account seems active neough to generate the bounces | 22:44 |
fungi | i think new gerrit will prevent that by requiring addresses to be verifiable? though maybe not those which come in through openid autocreation (but then hopefully the idp has similar requirements). of course, none of that protects against working addresses ceasing to work | 22:51 |
clarkb | in this case I suspect it was working then ceased, but ya exactly | 22:53 |
ianw | clarkb: it is striking me that with parallel jobs, https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/infra-prod/pre.yaml will keep overwriting /home/zuu/src/opendev.org/opendev/system-config at random times | 23:40 |
clarkb | hrm maybe we do need to centralize that in a parent job before we parallelize? | 23:41 |
clarkb | or use some sort of lock around that (though that might get clunky fast) | 23:41 |
ianw | https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/mirror-workspace-git-repos/tasks/main.yaml#L39 might be a trouble point, it does a clean | 23:44 |
ianw | the consequence might be cleaning a .pyc file that is in use. i guess that ansible probably survives | 23:44 |
clarkb | another approach would be different workspaces per job butthat might get complicated with how we set up ansible | 23:45 |
clarkb | and also bootstrapping an env to run it ourselves | 23:45 |
fungi | python 3.10.0rc2 is out! | 23:46 |
ianw | i mean it would not be terrible to put cloud credentials, etc. in zuul secrets and have each job run from it's own self-contained environment | 23:47 |
clarkb | the struggle becomes using the host as a bastion at that point. It is certainly doable but the more we put in zuul the less ew're able to interact directly (which isn't entirely a bd thing but if you aren't ready to do everything with zuul ...) | 23:48 |
clarkb | this particular problem might be worth having a proper brainstorm over and maybe if we can rope mordred in that would be good too as he designed a bit of that original stuff | 23:48 |
ianw | yep | 23:51 |
ianw | extracting it into a job that all others depend on is probably the most logical step | 23:52 |
ianw | that would be a hard dependency, and it should always run | 23:52 |
clarkb | yup | 23:53 |
clarkb | certainly it would probably be the simplest to implement and easiset to understand (at least for me) | 23:53 |
fungi | but would also need a mutex so the periodic pipeline build of that job doesn't fire independently of deploy pipeline builds? | 23:54 |
ianw | if we keep the semaphore as is, i think it could follow-on to the existing change as well, were it would fit more logically | 23:54 |
clarkb | fungi: that is already an assumption of the system | 23:54 |
fungi | k | 23:54 |
clarkb | parallelization would only occur within a buildset | 23:54 |
fungi | so the periodic pipeline buildset couldn't run while the deploy buidlset was in progress | 23:55 |
clarkb | correct, and that is the situation today | 23:55 |
fungi | just making sure it wouldn't suddenly become possible with parallelization | 23:56 |
ianw | yep, that bit should remain the same, except for periodic this theoretical setup job will pull from master instead of the zuul change | 23:57 |
clarkb | if we want to try a meetpad call tomorrow to talk through some of this more I'm happy to do that | 23:57 |
fungi | i'm available whenevs | 23:58 |
clarkb | It would probably have to be at or after 3pm for me to juggle school pickup and ianw's schedule. | 23:58 |
clarkb | that would be 2200UTC or later | 23:58 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!