Tuesday, 2021-09-07

opendevreview	Ian Wienand proposed opendev/system-config master: testinfra: refactor screenshot taking https://review.opendev.org/c/opendev/system-config/+/807659	00:50
opendevreview	Ian Wienand proposed opendev/system-config master: testinfra: refactor screenshot taking https://review.opendev.org/c/opendev/system-config/+/807659	01:27
ianw	Manifest has invalid size for layer sha256:9815a275e5d0f93566302aeb58a49bf71b121debe1a291cf0f64278fe97ec9b5 (size:203129434 actual:185016320)	02:06
ianw	https://zuul.opendev.org/t/openstack/build/22089e51b052400db2970e78b08f60be	02:07
ianw	system-config-build-image-gerrit-3.3	02:07
ianw	ls -l ./_local/blobs/sha256:9815a275e5d0f93566302aeb58a49bf71b121debe1a291cf0f64278fe97ec9b5	02:17
ianw	-rw-r--r-- 1 root root 203129433 Sep 7 01:53 data	02:17
ianw	it's out by a byte, but did get to the final size. i wonder if this is a race or a sync issue	02:18
ianw	there is a "Failed to obtain lock(1) on digest sha256:9815a275e5d0f93566302aeb58a49bf71b121debe1a291cf0f64278fe97ec9b5"	02:21
ianw	https://review.opendev.org/c/zuul/zuul-registry/+/807663 is my suggestion	02:53
opendevreview	Ian Wienand proposed opendev/system-config master: testinfra: refactor screenshot taking https://review.opendev.org/c/opendev/system-config/+/807659	02:55
*** ysandeep\|out is now known as ysandeep		05:07
opendevreview	Ian Wienand proposed opendev/system-config master: Refactor infra-prod jobs for parallel running https://review.opendev.org/c/opendev/system-config/+/807672	06:14
ianw	clarkb: ^ we can discuss in meeting, i started to try and think about what it takes to run these things in parallel.	06:26
*** jpena\|off is now known as jpena		07:39
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	09:16
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	10:54
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638	10:56
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638	11:05
opendevreview	Sorin Sbârnea proposed zuul/zuul-jobs master: Make default tox run more strict about interpreter version https://review.opendev.org/c/zuul/zuul-jobs/+/807702	11:05
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	11:07
*** jpena is now known as jpena\|lunch		11:26
*** jpena\|lunch is now known as jpena		12:24
*** ysandeep is now known as ysandeep\|out		13:07
clarkb	kevinz: our ssl cert checker is warning us now that the linaro cert will expire in 27 days	14:07
clarkb	mordred: corvus: there is a thread on gerrit's ml asking for links to presentations given at the 2019 sunnyvale/gothenburg gerrit user summits. I think you gave at least one presentation at both? I could be wrong but wanted to pass that along in case you have that infor them	14:09
clarkb	"info" and "for" became "infor" in that last sentence	14:09
opendevreview	Douglas Mendizábal proposed opendev/irc-meetings master: Move Keystone meeting to 1500 UTC https://review.opendev.org/c/opendev/irc-meetings/+/807729	14:18
clarkb	infra-root I think the stack at https://review.opendev.org/c/opendev/system-config/+/805932/3 is ready for merging. There is one small zuul-registry fix that we might want to land first: https://review.opendev.org/c/zuul/zuul-registry/+/807663 to ensure those images build and update cleanly though	14:29
clarkb	that is the assets work.	14:30
corvus	clarkb: reg change apvd	14:31
clarkb	thanks!	14:31
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	14:42
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	14:57
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	14:59
clarkb	the zuul-registry update should be deployed now.	15:30
opendevreview	Merged opendev/system-config master: Add assets and a related docker image/bundle https://review.opendev.org/c/opendev/system-config/+/805932	15:47
clarkb	exciting ^	15:48
fungi	i just finished reviewing and approving that whole series	15:48
opendevreview	Merged opendev/elastic-recheck rdo: Updates for docker and frontpage of elastic-recheck https://review.opendev.org/c/opendev/elastic-recheck/+/806739	15:54
*** marios is now known as marios\|out		16:01
mnaser	infra-root: hello, i'd like to do a reboot of {gitea0[1-8],gitea-lb01,mirror01.sjc1.vexxhost}.opendev.org -- purpose is that it will be moved on newer amd hardware (live migration is not available because intel=>amd is no good) -- can i go ahead with that?	16:12
clarkb	mnaser: doing gitea01-08 in a rolling fashion should be fine. gitea-lb01 will be noticed as that is the front end and a current spof. The mirror will also be nocied by any jobs running on that cloud.	16:13
clarkb	mnaser: also I think the changes that fungi approved above will result in a gitea "ugprade" to our new images with assets copied in differently. Might be best to wait for that to finish first?	16:13
clarkb	though those are at least 40 minutes away says zuul.	16:14
mnaser	clarkb: can you give me a ping when i'm good to go in that case? :)	16:14
mnaser	that's totally fine	16:14
clarkb	mnaser: I can do that	16:14
mnaser	thank you	16:14
mnaser	clarkb: i see a bunch of vms inside `openstackci`, called `opendev-k8s-{master,1,2,3,4}`. i think mordred was playing with k8s.. a really long time ago?	16:24
mnaser	i dont know if these are in the ansible inventory, if i should move them, if we should nuke them..	16:25
clarkb	mnaser: I think nodepool may still "know" about them but they aren't really being used there (basically no one took that experiment further). If we can get mordred to confirm we can probably remove any remaing system-config/project-config for them and clean them up	16:25
clarkb	https://opendev.org/opendev/system-config/src/branch/master/playbooks/templates/clouds/nodepool_kube_config.yaml.j2 things related to that	16:26
mnaser	clarkb: ok, but i guess they are safe to migrate for now until that is all cleaned up?	16:26
clarkb	it may also be running the old test gitea around running it in k8s? In any case I suspect that will all need to be redone with proper k8s deployment tooling if we go that route again	16:26
clarkb	mnaser: yes, I think it should be safe to migrate them	16:26
mnaser	great thank you	16:27
corvus	i just got an internal error pushing a change; this is the gerrit error log entry: Caused by: java.lang.IllegalArgumentException: cannot chain ref update CREATE: 0000000000000000000000000000000000000000 6fcde31c9e22230e8a04d12861b0137420d13796 refs/changes/21/807221/8 after update CREATE: 0000000000000000000000000000000000000000 6fcde31c9e22230e8a04d12861b0137420d13796 refs/changes/21/807221/8 with result REJECTEDOTHERREASON	16:40
corvus	re-trying worked. i'm mentioning it for monitoring purposes.	16:41
fungi	thanks, i wonder what could lead to that	16:46
*** jpena is now known as jpena\|off		16:46
clarkb	some other reason apparently :)	16:49
opendevreview	Merged opendev/system-config master: gitea: use assets bundle https://review.opendev.org/c/opendev/system-config/+/805933	16:57
opendevreview	Merged opendev/system-config master: gitea: add some screenshots to testing https://review.opendev.org/c/opendev/system-config/+/807489	16:57
mordred	clarkb, mnaser : yes, those should be safe to migrate. honestly, we should probably just delete them, there's no way we're going to use them in the current state	16:58
opendevreview	Merged opendev/system-config master: testinfra: refactor screenshot taking https://review.opendev.org/c/opendev/system-config/+/807659	17:05
clarkb	mnaser: is this the sort of thing that we can trigger the reboots for and have it do the right thing behind the scenes or do you need to actively move them? Asking beacuse I think the least impact method would be to remove a gitea0X or two from haproxy and shutdown its services then reboot it	17:19
clarkb	but both gerrit replication and haproxy should detect if it happens unexpectedly as well	17:19
mnaser	clarkb: unfortunately an actual migration needs to be done which is an admin action — but I am happy and available to coordinate.	17:20
clarkb	ok good to know. I can probably do the haproxy and service stops after the curren meeting and work through those. The deployment stuff should be done by then too	17:20
mnaser	Ok sounds good. Just shoot me a ping and I can kick it off	17:21
clarkb	mnaser: also sjc1 max-servers are currently set to 0 I think you can safely do the mirror there now	17:23
clarkb	we are using ca-ymq-1 for jobs currently according to the config	17:23
mnaser	Yeah that’s what I thought too	17:24
mnaser	Okay, I’ll kick that one off now then	17:24
mnaser	mirror01 should be done	17:28
mnaser	urgh, configdrive migration bug, nevermind, let me dig	17:29
mnaser	nope i lied, it worked just fine :)	17:30
clarkb	https://mirror.sjc1.vexxhost.opendev.org/ is serving content and the uptime says we did reboot and /proc/cpuinfo shows amd	17:30
clarkb	ya from what I can see it seems happy	17:30
clarkb	I don't see a config drive on the instance, but not sure if it was booted with one in the first place	17:30
mnaser	doesnt look like it, so yeah, that was my bad	17:31
mnaser	clarkb: i will be unavailable in ~1h30m for ~1h30m-ish. so if you want to remove some backends from haproxy that i can asynchronously move if/when you're afk, just a heads up =)	17:32
clarkb	mnaser: noted. Why don't we start with gitea08 as a first run and I'll go stop services and let you know when it is ready	17:32
clarkb	oh except the upgrade playbook finally just started. we'll wait for that to finish first	17:33
mnaser	no worries	17:33
clarkb	mnaser: the upgrade is done and the gitea cluster seems to still be happy from my spot checking. I've removed gitea08 from the haproxy pool if you want to go ahead and reboot that one as a first check? We can do two at a time afterwards if that one is happy	17:55
opendevreview	Artom Lifshitz proposed opendev/git-review master: WIP: Allow custom SSH args https://review.opendev.org/c/opendev/git-review/+/807787	17:56
clarkb	mnaser: I'm going to take a short break. But you should be good to reboot gitea08 whenever you are ready and I can help verify it is happy when done. Then we can go through the rest in batches	18:04
mnaser	gitea08 starting now :)	18:32
mnaser	and done	18:33
clarkb	mnaser: that one actually does seem to have a config drive and I still see it	18:43
clarkb	also its services are serving	18:43
clarkb	mnaser: 06 and 07 have been pulled out of the haproxy rotation if you want to do them next	18:44
clarkb	from what I see 08 is happy	18:44
mnaser	clarkb: ok I won’t be able to do that until an hour or so but will do when I’m around	18:46
clarkb	mnaser: sounds good. Just let me know how you're progressing and I can continue to rotate them out in haproxy. I've got the infra meeting starting in 13 minutes so an hour break wfm	18:47
fungi	i need to start prepping dinner shortly, but should hopefully be able to take a look at the prometheus spec after	19:56
clarkb	mnaser: I'm grabbing lunch right now but feel free to do gitea06 and gitea07 when you are ready and I'll put them back into the rotation after and poll out the next pair	19:58
mnaser	clarkb: alright, im around again, i will kick off 06 and 07	20:33
clarkb	mnaser: sounds good I'm around to work through these too	20:33
mnaser	06 started	20:35
mnaser	clarkb: both are done :)	20:37
clarkb	checking	20:37
clarkb	yup they lgtm. I've put them back in the rotation and pulled gitea04 and gitea05 out if you want to do those now	20:38
mnaser	cool, starting 4 and 5 now	20:39
mnaser	clarkb: should be done	20:40
clarkb	mnaser: yup all looks good. gitea03 and gitea02 are ready for you now	20:42
mnaser	ok, starting	20:43
mnaser	clarkb: both done	20:44
clarkb	yup continues to look good to me. gitea01 is ready when you are	20:46
mnaser	cool, starting	20:46
mnaser	clarkb: completed	20:47
clarkb	great that all looks happy on my end.	20:48
clarkb	mnaser: for the load balancer we decided in the meeting today that just going for it and ripping the bandaid off is likely the easiest thing	20:49
clarkb	mnaser: I'm happy for you to do that now if you want and I can check it after	20:49
clarkb	Also for review02.opendev.org does that still need a reboot?	20:49
mnaser	clarkb: i can do lb now if you want, that would really be appreciated. review02 will need a reboot (even though its in mtl, moving to the new dc).	20:50
mnaser	we don't have to do that right now though (wrt review02) because i figure that might be a bit trickier i guess	20:50
clarkb	ya review02 probably needs a bit more coordination.	20:51
clarkb	Ya I think we should probably just go ahead and do the load balancer now	20:51
clarkb	lets rip the bandaid off	20:51
mnaser	alright i'll push that through	20:51
mnaser	clarkb: and its back	20:52
mnaser	the underlying hardware is _way_ faster so we should see measurebly better performance	20:52
clarkb	https://opendev.org loads for me and the server looks happy	20:53
mnaser	awesome, thanks for the flexibility clarkb :)	20:54
clarkb	for review02 the absolute safeest thing would be to wait for after the openstack release happens, but we can probably get away with a reboot during a quiet period like late friday through ianw's monday?	20:54
clarkb	mnaser: does the ceph volume that we host the gerrit site on review02 impact the DC move at all?	20:54
clarkb	eg do you have to move the volume at the same time and if so is that expected to make review02 movement particularly slow?	20:55
mnaser	clarkb: it will be moved but it will not slow down, we've got some magic movement tools to avoid any downtime/slow donw	20:55
mnaser	we use snapshots to minimize the amount of data moved during the reboot	20:55
mnaser	so we move the majority of the data before hand, then one more small move, shutdown, move whatever was written to disk, then start up again	20:56
mnaser	so its mostly migrated online except for the flip over	20:56
clarkb	nice	20:56
mnaser	so we can 'prep' vms to be moved ahead of time so the outage is really small	20:56
clarkb	mnaser: if there is a quieter time that also works for you I guess let us know and we can coordinate review's move then	20:59
clarkb	The biggestthing right now is lots of changes are merging and zuul will have a sad if gerrit isn't up when it tries to merge things so we want to try and pick a time when zuul is unlikely to be doing that	21:00
clarkb	Another option is to do it with a coordinated zuul restart	21:00
clarkb	now I should migrate outside and do some code review in the backyard	21:01
corvus	if you want to expedite, shut down all the zuul executors during the move. no perceived outage from zuul's pov.	21:07
clarkb	fungi: do we care about the dell openstack ironic ci user bouncing gerrit emails because they are only allowed to receive email from people in their organization? (also wow I guess that is one way to combat phishing and spam)	22:06
clarkb	maybe we should ask them to disable gerrit email as much as possible?	22:06
opendevreview	Ian Wienand proposed opendev/system-config master: Refactor infra-prod jobs for parallel running https://review.opendev.org/c/opendev/system-config/+/807672	22:10
fungi	yeah we shouldn't have users with invalid/undeliverable e-mail addresses, but i have a feeling there are many	22:42
fungi	if we do decide that's a problem, we should start analyzing the mta logs and disabling accounts or something	22:43
clarkb	in this case I suspect the account is active they just don't realize their email policy is not useful	22:44
clarkb	and ya I'm sure there are others but this account seems active neough to generate the bounces	22:44
fungi	i think new gerrit will prevent that by requiring addresses to be verifiable? though maybe not those which come in through openid autocreation (but then hopefully the idp has similar requirements). of course, none of that protects against working addresses ceasing to work	22:51
clarkb	in this case I suspect it was working then ceased, but ya exactly	22:53
ianw	clarkb: it is striking me that with parallel jobs, https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/infra-prod/pre.yaml will keep overwriting /home/zuu/src/opendev.org/opendev/system-config at random times	23:40
clarkb	hrm maybe we do need to centralize that in a parent job before we parallelize?	23:41
clarkb	or use some sort of lock around that (though that might get clunky fast)	23:41
ianw	https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/mirror-workspace-git-repos/tasks/main.yaml#L39 might be a trouble point, it does a clean	23:44
ianw	the consequence might be cleaning a .pyc file that is in use. i guess that ansible probably survives	23:44
clarkb	another approach would be different workspaces per job butthat might get complicated with how we set up ansible	23:45
clarkb	and also bootstrapping an env to run it ourselves	23:45
fungi	python 3.10.0rc2 is out!	23:46
ianw	i mean it would not be terrible to put cloud credentials, etc. in zuul secrets and have each job run from it's own self-contained environment	23:47
clarkb	the struggle becomes using the host as a bastion at that point. It is certainly doable but the more we put in zuul the less ew're able to interact directly (which isn't entirely a bd thing but if you aren't ready to do everything with zuul ...)	23:48
clarkb	this particular problem might be worth having a proper brainstorm over and maybe if we can rope mordred in that would be good too as he designed a bit of that original stuff	23:48
ianw	yep	23:51
ianw	extracting it into a job that all others depend on is probably the most logical step	23:52
ianw	that would be a hard dependency, and it should always run	23:52
clarkb	yup	23:53
clarkb	certainly it would probably be the simplest to implement and easiset to understand (at least for me)	23:53
fungi	but would also need a mutex so the periodic pipeline build of that job doesn't fire independently of deploy pipeline builds?	23:54
ianw	if we keep the semaphore as is, i think it could follow-on to the existing change as well, were it would fit more logically	23:54
clarkb	fungi: that is already an assumption of the system	23:54
fungi	k	23:54
clarkb	parallelization would only occur within a buildset	23:54
fungi	so the periodic pipeline buildset couldn't run while the deploy buidlset was in progress	23:55
clarkb	correct, and that is the situation today	23:55
fungi	just making sure it wouldn't suddenly become possible with parallelization	23:56
ianw	yep, that bit should remain the same, except for periodic this theoretical setup job will pull from master instead of the zuul change	23:57
clarkb	if we want to try a meetpad call tomorrow to talk through some of this more I'm happy to do that	23:57
fungi	i'm available whenevs	23:58
clarkb	It would probably have to be at or after 3pm for me to juggle school pickup and ianw's schedule.	23:58
clarkb	that would be 2200UTC or later	23:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!