Tuesday, 2021-09-07

opendevreviewIan Wienand proposed opendev/system-config master: testinfra: refactor screenshot taking  https://review.opendev.org/c/opendev/system-config/+/80765900:50
opendevreviewIan Wienand proposed opendev/system-config master: testinfra: refactor screenshot taking  https://review.opendev.org/c/opendev/system-config/+/80765901:27
ianwManifest has invalid size for layer sha256:9815a275e5d0f93566302aeb58a49bf71b121debe1a291cf0f64278fe97ec9b5 (size:203129434 actual:185016320)02:06
ianwls -l ./_local/blobs/sha256:9815a275e5d0f93566302aeb58a49bf71b121debe1a291cf0f64278fe97ec9b502:17
ianw-rw-r--r-- 1 root root 203129433 Sep  7 01:53 data02:17
ianwit's out by a byte, but did get to the final size.  i wonder if this is a race or a sync issue02:18
ianwthere is a "Failed to obtain lock(1) on digest sha256:9815a275e5d0f93566302aeb58a49bf71b121debe1a291cf0f64278fe97ec9b5"02:21
ianwhttps://review.opendev.org/c/zuul/zuul-registry/+/807663 is my suggestion02:53
opendevreviewIan Wienand proposed opendev/system-config master: testinfra: refactor screenshot taking  https://review.opendev.org/c/opendev/system-config/+/80765902:55
*** ysandeep|out is now known as ysandeep05:07
opendevreviewIan Wienand proposed opendev/system-config master: Refactor infra-prod jobs for parallel running  https://review.opendev.org/c/opendev/system-config/+/80767206:14
ianwclarkb: ^ we can discuss in meeting, i started to try and think about what it takes to run these things in parallel.  06:26
*** jpena|off is now known as jpena07:39
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703109:16
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703110:54
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563810:56
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563811:05
opendevreviewSorin Sbârnea proposed zuul/zuul-jobs master: Make default tox run more strict about interpreter version  https://review.opendev.org/c/zuul/zuul-jobs/+/80770211:05
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703111:07
*** jpena is now known as jpena|lunch11:26
*** jpena|lunch is now known as jpena12:24
*** ysandeep is now known as ysandeep|out13:07
clarkbkevinz: our ssl cert checker is warning us now that the linaro cert will expire in 27 days14:07
clarkbmordred: corvus: there is a thread on gerrit's ml asking for links to presentations given at the 2019 sunnyvale/gothenburg gerrit user summits. I think you gave at least one presentation at both? I could be wrong but wanted to pass that along in case you have that infor them14:09
clarkb"info" and "for" became "infor" in that last sentence14:09
opendevreviewDouglas Mendizábal proposed opendev/irc-meetings master: Move Keystone meeting to 1500 UTC  https://review.opendev.org/c/opendev/irc-meetings/+/80772914:18
clarkbinfra-root I think the stack at https://review.opendev.org/c/opendev/system-config/+/805932/3 is ready for merging. There is one small zuul-registry fix that we might want to land first: https://review.opendev.org/c/zuul/zuul-registry/+/807663 to ensure those images build and update cleanly though14:29
clarkbthat is the assets work.14:30
corvusclarkb: reg change apvd14:31
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703114:42
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703114:57
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703114:59
clarkbthe zuul-registry update should be deployed now.15:30
opendevreviewMerged opendev/system-config master: Add assets and a related docker image/bundle  https://review.opendev.org/c/opendev/system-config/+/80593215:47
clarkbexciting ^15:48
fungii just finished reviewing and approving that whole series15:48
opendevreviewMerged opendev/elastic-recheck rdo: Updates for docker and frontpage of elastic-recheck  https://review.opendev.org/c/opendev/elastic-recheck/+/80673915:54
*** marios is now known as marios|out16:01
mnaserinfra-root: hello, i'd like to do a reboot of {gitea0[1-8],gitea-lb01,mirror01.sjc1.vexxhost}.opendev.org -- purpose is that it will be moved on newer amd hardware (live migration is not available because intel=>amd is no good) -- can i go ahead with that?16:12
clarkbmnaser: doing gitea01-08 in a rolling fashion should be fine. gitea-lb01 will be noticed as that is the front end and a current spof. The mirror will also be nocied by any jobs running on that cloud.16:13
clarkbmnaser: also I think the changes that fungi approved above will result in a gitea "ugprade" to our new images with assets copied in differently. Might be best to wait for that to finish first?16:13
clarkbthough those are at least 40 minutes away says zuul.16:14
mnaserclarkb: can you give me a ping when i'm good to go in that case? :)16:14
mnaserthat's totally fine16:14
clarkbmnaser: I can do that16:14
mnaserthank you16:14
mnaserclarkb: i see a bunch of vms inside `openstackci`, called `opendev-k8s-{master,1,2,3,4}`.  i think mordred was playing with k8s.. a really long time ago?16:24
mnaseri dont know if these are in the ansible inventory, if i should move them, if we should nuke them..16:25
clarkbmnaser: I think nodepool may still "know" about them but they aren't really being used there (basically no one took that experiment further). If we can get mordred to confirm we can probably remove any remaing system-config/project-config for them and clean them up16:25
clarkbhttps://opendev.org/opendev/system-config/src/branch/master/playbooks/templates/clouds/nodepool_kube_config.yaml.j2 things related to that16:26
mnaserclarkb: ok, but i guess they are safe to migrate for now until that is all cleaned up?16:26
clarkbit may also be running the old test gitea around running it in k8s? In any case I suspect that will all need to be redone with proper k8s deployment tooling if we go that route again16:26
clarkbmnaser: yes, I think it should be safe to migrate them16:26
mnasergreat thank you16:27
corvusi just got an internal error pushing a change; this is the gerrit error log entry: Caused by: java.lang.IllegalArgumentException: cannot chain ref update CREATE: 0000000000000000000000000000000000000000 6fcde31c9e22230e8a04d12861b0137420d13796 refs/changes/21/807221/8 after update CREATE: 0000000000000000000000000000000000000000 6fcde31c9e22230e8a04d12861b0137420d13796 refs/changes/21/807221/8 with result REJECTEDOTHERREASON16:40
corvusre-trying worked.  i'm mentioning it for monitoring purposes.16:41
fungithanks, i wonder what could lead to that16:46
*** jpena is now known as jpena|off16:46
clarkbsome other reason apparently :)16:49
opendevreviewMerged opendev/system-config master: gitea: use assets bundle  https://review.opendev.org/c/opendev/system-config/+/80593316:57
opendevreviewMerged opendev/system-config master: gitea: add some screenshots to testing  https://review.opendev.org/c/opendev/system-config/+/80748916:57
mordredclarkb, mnaser : yes, those should be safe to migrate. honestly, we should probably just delete them, there's no way we're going to use them in the current state16:58
opendevreviewMerged opendev/system-config master: testinfra: refactor screenshot taking  https://review.opendev.org/c/opendev/system-config/+/80765917:05
clarkbmnaser: is this the sort of thing that we can trigger the reboots for and have it do the right thing behind the scenes or do you need to actively move them? Asking beacuse I think the least impact method would be to remove a gitea0X or two from haproxy and shutdown its services then reboot it17:19
clarkbbut both gerrit replication and haproxy should detect if it happens unexpectedly as well17:19
mnaserclarkb: unfortunately an actual migration needs to be done which is an admin action — but I am happy and available to coordinate.17:20
clarkbok good to know. I can probably do the haproxy and service stops after the curren meeting and work through those. The deployment stuff should be done by then too17:20
mnaserOk sounds good. Just shoot me a ping and I can kick it off17:21
clarkbmnaser: also sjc1 max-servers are currently set to 0 I think you can safely do the mirror there now17:23
clarkbwe are using ca-ymq-1 for jobs currently according to the config17:23
mnaserYeah that’s what I thought too17:24
mnaserOkay, I’ll kick that one off now then17:24
mnasermirror01 should be done17:28
mnaserurgh, configdrive migration bug, nevermind, let me dig17:29
mnasernope i lied, it worked just fine :)17:30
clarkbhttps://mirror.sjc1.vexxhost.opendev.org/ is serving content and the uptime says we did reboot and /proc/cpuinfo shows amd17:30
clarkbya from what I can see it seems happy17:30
clarkbI don't see a config drive on the instance, but not sure if it was booted with one in the first place17:30
mnaserdoesnt look like it, so yeah, that was my bad17:31
mnaserclarkb: i will be unavailable in ~1h30m for ~1h30m-ish.  so if you want to remove some backends from haproxy that i can asynchronously move if/when you're afk, just a heads up =)17:32
clarkbmnaser: noted. Why don't we start with gitea08 as a first run and I'll go stop services and let you know when it is ready17:32
clarkboh except the upgrade playbook finally just started. we'll wait for that to finish first17:33
mnaserno worries17:33
clarkbmnaser: the upgrade is done and the gitea cluster seems to still be happy from my spot checking. I've removed gitea08 from the haproxy pool if you want to go ahead and reboot that one as a first check? We can do two at a time afterwards if that one is happy17:55
opendevreviewArtom Lifshitz proposed opendev/git-review master: WIP: Allow custom SSH args  https://review.opendev.org/c/opendev/git-review/+/80778717:56
clarkbmnaser: I'm going to take a short break. But you should be good to reboot gitea08 whenever you are ready and I can help verify it is happy when done. Then we can go through the rest in batches18:04
mnasergitea08 starting now :)18:32
mnaserand done18:33
clarkbmnaser: that one actually does seem to have a config drive and I still see it18:43
clarkbalso its services are serving18:43
clarkbmnaser: 06 and 07 have been pulled out of the haproxy rotation if you want to do them next18:44
clarkbfrom what I see 08 is happy18:44
mnaserclarkb: ok I won’t be able to do that until an hour or so but will do when I’m around18:46
clarkbmnaser: sounds good. Just let me know how you're progressing and I can continue to rotate them out in haproxy. I've got the infra meeting starting in 13 minutes so an hour break wfm18:47
fungii need to start prepping dinner shortly, but should hopefully be able to take a look at the prometheus spec after19:56
clarkbmnaser: I'm grabbing lunch right now but feel free to do gitea06 and gitea07 when you are ready and I'll put them back into the rotation after and poll out the next pair19:58
mnaserclarkb: alright, im around again, i will kick off 06 and 0720:33
clarkbmnaser: sounds good I'm around to work through these too20:33
mnaser06 started20:35
mnaserclarkb: both are done :)20:37
clarkbyup they lgtm. I've put them back in the rotation and pulled gitea04 and gitea05 out if you want to do those now20:38
mnasercool, starting 4 and 5 now20:39
mnaserclarkb: should be done20:40
clarkbmnaser: yup all looks good. gitea03 and gitea02 are ready for you now20:42
mnaserok, starting20:43
mnaserclarkb: both done20:44
clarkbyup continues to look good to me. gitea01 is ready when you are20:46
mnasercool, starting20:46
mnaserclarkb: completed20:47
clarkbgreat that all looks happy on my end.20:48
clarkbmnaser: for the load balancer we decided in the meeting today that just going for it and ripping the bandaid off is likely the easiest thing20:49
clarkbmnaser: I'm happy for you to do that now if you want and I can check it after20:49
clarkbAlso for review02.opendev.org does that still need a reboot?20:49
mnaserclarkb: i can do lb now if you want, that would really be appreciated.  review02 will need a reboot (even though its in mtl, moving to the new dc).20:50
mnaserwe don't have to do that right now though (wrt review02) because i figure that might be a bit trickier i guess20:50
clarkbya review02 probably needs a bit more coordination.20:51
clarkbYa I think we should probably just go ahead and do the load balancer now20:51
clarkblets rip the bandaid off20:51
mnaseralright i'll push that through20:51
mnaserclarkb: and its back20:52
mnaserthe underlying hardware is _way_ faster so we should see measurebly better performance20:52
clarkbhttps://opendev.org loads for me and the server looks happy20:53
mnaserawesome, thanks for the flexibility clarkb :)20:54
clarkbfor review02 the absolute safeest thing would be to wait for after the openstack release happens, but we can probably get away with a reboot during a quiet period like late friday through ianw's monday?20:54
clarkbmnaser: does the ceph volume that we host the gerrit site on review02 impact the DC move at all?20:54
clarkbeg do you have to move the volume at the same time and if so is that expected to make review02 movement particularly slow?20:55
mnaserclarkb: it will be moved but it will not slow down, we've got some magic movement tools to avoid any downtime/slow donw20:55
mnaserwe use snapshots to minimize the amount of data moved during the reboot20:55
mnaserso we move the majority of the data before hand, then one more small move, shutdown, move whatever was written to disk, then start up again20:56
mnaserso its mostly migrated online except for the flip over20:56
mnaserso we can 'prep' vms to be moved ahead of time so the outage is really small20:56
clarkbmnaser: if there is a quieter time that also works for you I guess let us know and we can coordinate review's move then20:59
clarkbThe biggestthing right now is lots of changes are merging and zuul will have a sad if gerrit isn't up when it tries to merge things so we want to try and pick a time when zuul is unlikely to be doing that21:00
clarkbAnother option is to do it with a coordinated zuul restart21:00
clarkbnow I should migrate outside and do some code review in the backyard21:01
corvusif you want to expedite, shut down all the zuul executors during the move.  no perceived outage from zuul's pov.21:07
clarkbfungi: do we care about the dell openstack ironic ci user bouncing gerrit emails because they are only allowed to receive email from people in their organization? (also wow I guess that is one way to combat phishing and spam)22:06
clarkbmaybe we should ask them to disable gerrit email as much as possible?22:06
opendevreviewIan Wienand proposed opendev/system-config master: Refactor infra-prod jobs for parallel running  https://review.opendev.org/c/opendev/system-config/+/80767222:10
fungiyeah we shouldn't have users with invalid/undeliverable e-mail addresses, but i have a feeling there are many22:42
fungiif we do decide that's a problem, we should start analyzing the mta logs and disabling accounts or something22:43
clarkbin this case I suspect the account is active they just don't realize their email policy is not useful22:44
clarkband ya I'm sure there are others but this account seems active neough to generate the bounces22:44
fungii think new gerrit will prevent that by requiring addresses to be verifiable? though maybe not those which come in through openid autocreation (but then hopefully the idp has similar requirements). of course, none of that protects against working addresses ceasing to work22:51
clarkbin this case I suspect it was working then ceased, but ya exactly22:53
ianwclarkb: it is striking me that with parallel jobs, https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/infra-prod/pre.yaml will keep overwriting /home/zuu/src/opendev.org/opendev/system-config at random times23:40
clarkbhrm maybe we do need to centralize that in a parent job before we parallelize?23:41
clarkbor use some sort of lock around that (though that might get clunky fast)23:41
ianwhttps://opendev.org/zuul/zuul-jobs/src/branch/master/roles/mirror-workspace-git-repos/tasks/main.yaml#L39 might be a trouble point, it does a clean23:44
ianwthe consequence might be cleaning a .pyc file that is in use.  i guess that ansible probably survives23:44
clarkbanother approach would be different workspaces per job butthat might get complicated with how we set up ansible23:45
clarkband also bootstrapping an env to run it ourselves23:45
fungipython 3.10.0rc2 is out!23:46
ianwi mean it would not be terrible to put cloud credentials, etc. in zuul secrets and have each job run from it's own self-contained environment 23:47
clarkbthe struggle becomes using the host as a bastion at that point. It is certainly doable but the more we put in zuul the less ew're able to interact directly (which isn't entirely a bd thing but if you aren't ready to do everything with zuul ...)23:48
clarkbthis particular problem might be worth having a proper brainstorm over and maybe if we can rope mordred in that would be good too as he designed a bit of that original stuff23:48
ianwextracting it into a job that all others depend on is probably the most logical step23:52
ianwthat would be a hard dependency, and it should always run23:52
clarkbcertainly it would probably be the simplest to implement and easiset to understand (at least for me)23:53
fungibut would also need a mutex so the periodic pipeline build of that job doesn't fire independently of deploy pipeline builds?23:54
ianwif we keep the semaphore as is, i think it could follow-on to the existing change as well, were it would fit more logically23:54
clarkbfungi: that is already an assumption of the system23:54
clarkbparallelization would only occur within a buildset23:54
fungiso the periodic pipeline buildset couldn't run while the deploy buidlset was in progress23:55
clarkbcorrect, and that is the situation today23:55
fungijust  making sure it wouldn't suddenly become possible with parallelization23:56
ianwyep, that bit should remain the same, except for periodic this theoretical setup job will pull from master instead of the zuul change23:57
clarkbif we want to try a meetpad call tomorrow to talk through some of this more I'm happy to do that23:57
fungii'm available whenevs23:58
clarkbIt would probably have to be at or after 3pm for me to juggle school pickup and ianw's schedule.23:58
clarkbthat would be 2200UTC or later23:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!