clarkb | fungi: my utility venv was still python2 so I've had to switch that over to 3 now :) | 00:07 |
---|---|---|
*** tosky has quit IRC | 00:11 | |
clarkb | infra-prod service zuul is running now. So should be able to remove ze01.openstack.org whenever we like. At this rate it will likely be tomorrow morning for me to get to that | 00:16 |
clarkb | cool if we say that accounts without usernames or sshkeys and no reviews in the last year are fair game to retire. We should be able to cleanup another 180ish | 00:29 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Add tools being used to make sense of gerrit account inconsistencies https://review.opendev.org/c/opendev/system-config/+/777846 | 00:31 |
clarkb | that contains my latest updates to check usernames and ssh keys | 00:31 |
ianw | afs01.ord is upgraded. since nothing has made like a spacex rocket and seemed to work but then exploded, i'll assume it's ok :) | 00:31 |
clarkb | I didn't catch it live but watched the replay which ended before it had its big kaboom | 00:32 |
clarkb | then later found out it also had a sad | 00:32 |
clarkb | I've uploaded the results of the sshkeys and username checking to review if other infra-root want to look that over | 00:35 |
ianw | clarkb: in your homedir? | 00:36 |
clarkb | yes under gerrit_user_cleanups | 00:37 |
clarkb | its the file with today's (yesterday for you and utc) timestamp suffix | 00:38 |
clarkb | I think the biggest risk there is if there are two accounts and one is being used for ssh (code pushes) and the other is used for http (code reviews) we can end up preventing peopel from logging in via http and doing reviews. However the recency check on code reviews shoudl guard against that for any current users | 00:43 |
ianw | how did they not choose a username? didn't finish the signup? | 00:43 |
clarkb | ianw: ya, I suspect for many gerrit gave them a new account and either they sorted out how to keep using the old account and just deal with it, we fixed things enough for them to use old account, or they used one for ssh and one for http | 00:44 |
clarkb | I think if we couple this with the recency check we can avoid the vast majority of issues, then hopefully if anyone shows up and needs account fixing we can do that on a case by case basis once we've addressed the consistency issues | 00:45 |
clarkb | I do think we're fast approaching where we're goign to have to accept we may get things wrong though :/ | 00:45 |
clarkb | that is why I think setting accounts inactive for a bit first then following up with the external id cleanups is probably a good idea. We can also send peopel email and tell them feel free to ignore this unless you expect to use this account again and if so tell us | 00:46 |
clarkb | then put anyone who responds aside | 00:46 |
ianw | yeah i don't think a pre-emptive email is a bad thing | 00:46 |
clarkb | ya maybe we refine this ssh list down a bit (there are a few which appear odder than most like the 6 account case). Then fungi can help me sort out how to send a bunch of emails | 00:47 |
*** stevebaker has quit IRC | 00:49 | |
clarkb | I suspect that the vast majority will not care as many of these do appear to be long ago contributors | 00:51 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: system-config-roles: only match jobs on roles tested https://review.opendev.org/c/opendev/system-config/+/778593 | 00:51 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Remove obsolete Bazel spawn strategies https://review.opendev.org/c/opendev/system-config/+/778404 | 00:51 |
clarkb | but asking nicely first is a good way to ensure we don't inadverdently ruin someones day | 00:51 |
*** stevebaker has joined #opendev | 00:51 | |
openstackgerrit | Merged openstack/project-config master: Add custom cirros image with ahci module enabled to cache https://review.opendev.org/c/openstack/project-config/+/778590 | 01:00 |
fungi | ianw: a while back gerrit stopped requesting (or perhaps lp stopped handing over) a username, so users have to enter one manually now | 01:01 |
fungi | if they never use the ssh or rest apis to do anything, like maybe they're only reviewing changes with the account, then they may be actively using it even though they have no username | 01:02 |
* fungi sighs... privmsg spam has started back up again | 01:02 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: system-config-roles: only match jobs on roles tested https://review.opendev.org/c/opendev/system-config/+/778593 | 01:08 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Remove obsolete Bazel spawn strategies https://review.opendev.org/c/opendev/system-config/+/778404 | 01:08 |
*** mlavalle has quit IRC | 01:14 | |
openstackgerrit | Kendall Nelson proposed openstack/project-config master: Add New Repo for StoryBoard-vue https://review.opendev.org/c/openstack/project-config/+/777244 | 01:20 |
*** brinzhang_ has joined #opendev | 01:32 | |
*** brinzhang has quit IRC | 01:35 | |
*** hemanth_n has joined #opendev | 02:51 | |
*** hemanth_n has quit IRC | 02:53 | |
*** hemanth_n has joined #opendev | 02:54 | |
ianw | well this is ... crap ... it seems that afs01.dfw is having issues creating it's vicepa ... seems like some of the partitions aren't responding | 03:16 |
ianw | "failed to start LVM2 PV scan for device 202:145" | 03:17 |
fungi | like cinder volumes? | 03:17 |
fungi | oh yeah | 03:17 |
ianw | it would seem to me that one of the PV's isn't responding, yeah they are all attached volumes | 03:17 |
ianw | there are three other "PV scans" that worked | 03:18 |
fungi | it's also refusing ssh for m | 03:18 |
fungi | nw | 03:18 |
fungi | me | 03:18 |
ianw | this has 5 volumes | 03:18 |
ianw | yeah, it's stuck in early boot, i can only see via emergency console | 03:18 |
fungi | ahh, got it | 03:18 |
ianw | if i detach these volumes, it seems like they will reattach as different devices | 03:19 |
fungi | well, that's fine | 03:19 |
ianw | i *think* lvm puts a uuid on them, right | 03:19 |
ianw | so it won't care if they move around | 03:19 |
fungi | lvm2 writes headers to all of them, yep | 03:19 |
fungi | that's what pvcreate does, in fact | 03:19 |
ianw | all the volumes are green, and listed as attached | 03:19 |
ianw | i'm not sure what else to do :/ | 03:20 |
fungi | it's possible they live migrated the server instance and the device xmls for libvirt got all screwy | 03:20 |
fungi | you say you've tried detaching and reattaching with the server instance powered off? | 03:21 |
ianw | interestingly "shutoff" seems to have not made the cut for the webui | 03:24 |
fungi | i think openstack server poweroff will do it though | 03:25 |
ianw | yep, i've stopped it via that now | 03:26 |
*** lpetrut has joined #opendev | 03:27 | |
ianw | i've detached all 5 volumes, and will try reattaching | 03:28 |
ianw | this is not looking good | 03:31 |
fungi | can't attach? | 03:34 |
ianw | this host really doesn't want to boot | 03:35 |
ianw | "A start job is running for LVM2 PV scan on ..." and it keeps looping through things | 03:35 |
fungi | might be better to try booting it with the cinder volumes detached, then hot attach them after | 03:36 |
ianw | hrm, ok | 03:36 |
ianw | i've unattached everything and am trying booting it | 03:39 |
ianw | it still has a start job for dev-main-vicepa.device | 03:41 |
ianw | "reached target network (pre)" and sitting there | 03:42 |
ianw | networking issues could explain both unattached volumes and this i guess... | 03:42 |
ianw | hrm, it's gone to an emergency shell promtp | 03:43 |
ianw | trying to boot it into default target ... | 03:44 |
ianw | i wonder if this is all based of fstab; if i use the emergency attach thing to go in and modify the fstab on disk maybe it will boot | 03:45 |
fungi | yeah, i think since systemd things get dodgy if you have a filesystem set to mount at boot in fstab which can't actually mount | 03:46 |
fungi | previously mount would timeout/give up and you could still at least try to boot up the rest of the way | 03:47 |
ianw | in good news, every other afsdb/afs server has rebooted just fine :/ | 03:49 |
ianw | ok, i edited out the mount | 03:56 |
ianw | ok, exiting rescue mode ... | 03:58 |
ianw | alright, the host is back | 04:00 |
ianw | but with no storage obviously | 04:00 |
ianw | let me try attaching and see what it thinks | 04:00 |
ianw | b,c,d,f,g are attached | 04:02 |
ianw | blkid only shows three drives however | 04:02 |
ianw | b,c,d it shows. f & g not showing | 04:03 |
fungi | that's certainly not what i'd expect | 04:03 |
ianw | pvscan has just gone into something unkillable | 04:04 |
ianw | there's no /dev/xvdg1 | 04:05 |
ianw | this is afs01.dfw.opendev.org/main05 | 04:05 |
fungi | got it, i wonder if it's just that volume or both which actually have problems | 04:06 |
ianw | it took forever to respond, but it did | 04:09 |
ianw | it's possible xvdg doesn't have a lvm partition but is just a raw disk | 04:09 |
ianw | we have /dev/mapper/main-vicepa now ... | 04:09 |
ianw | i'm not sure if i trust mounting it ... | 04:10 |
ianw | well i did. it seems to be there | 04:11 |
fungi | we can probably work out from the lvm config which devices (by uuid) are actually used to assemble | 04:11 |
fungi | or vgs --verbose (i think) should show which pvs are used | 04:11 |
ianw | main05 / xvdg5 was the extra area for snapshots i added when we did the 1.8 upgrade | 04:12 |
ianw | s/xvdg5/xvdg/ | 04:13 |
ianw | i think we should remove that snapshot and remove the xvdg / main05 volume | 04:14 |
fungi | ahh, is it part of the main vg? | 04:14 |
ianw | yep | 04:14 |
fungi | i'm switching rooms but should be able to jump on there in just a sec | 04:14 |
ianw | but the lv is vicepa_snap | 04:14 |
ianw | fungi: i think it's in a stable state ... though i'm not sure why it was failing | 04:15 |
ianw | it is upgraded to bionic, vicepa is mounted and openafs is running | 04:15 |
fungi | i'd hesitate to reboot it if we uncomment /vicepa in fstab | 04:15 |
*** gothicserpent has quit IRC | 04:15 | |
ianw | that's still commented, only manually mounted | 04:15 |
fungi | until we can work out what's choking it | 04:16 |
ianw | it was definitely xvdg being very slow ... pvscan took like 5 minutes but now seems ok | 04:16 |
*** gothicserpent has joined #opendev | 04:16 | |
fungi | we should probably try scaling back of the main05 volume, but that can wait until i get some sleep, i suppose | 04:17 |
ianw | i think we should leave it given both our time constratints now | 04:17 |
ianw | tomorrow we can delete the snapshot volume and get it back to a regular 4tb array | 04:17 |
fungi | yep, full agreement from me | 04:17 |
fungi | thanks for working through that! | 04:17 |
ianw | ok, i will keep an eye but not planning on any more excitement now :) | 04:18 |
fungi | sounds good, thanks again | 04:20 |
*** whoami-rajat has joined #opendev | 04:20 | |
*** ykarel has joined #opendev | 04:44 | |
*** lpetrut has quit IRC | 04:47 | |
*** ykarel has quit IRC | 05:04 | |
*** ykarel has joined #opendev | 05:06 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Remove obsolete Bazel spawn strategies https://review.opendev.org/c/opendev/system-config/+/778404 | 05:39 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gerrit docker: match some more files https://review.opendev.org/c/opendev/system-config/+/778613 | 05:39 |
*** ykarel_ has joined #opendev | 05:43 | |
*** ykarel has quit IRC | 05:45 | |
*** yoctozepto has quit IRC | 06:00 | |
*** yoctozepto has joined #opendev | 06:01 | |
*** marios has joined #opendev | 06:07 | |
*** ykarel_ is now known as ykarel | 06:18 | |
*** jaicaa has quit IRC | 06:29 | |
*** jaicaa has joined #opendev | 06:31 | |
*** ykarel_ has joined #opendev | 06:34 | |
*** ykarel has quit IRC | 06:36 | |
*** slaweq has joined #opendev | 07:17 | |
*** lpetrut has joined #opendev | 07:22 | |
*** sboyron has joined #opendev | 07:29 | |
*** ralonsoh has joined #opendev | 07:37 | |
*** hamalq has joined #opendev | 07:45 | |
*** eolivare has joined #opendev | 07:46 | |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: cabal-test: add install_args and build_args role var https://review.opendev.org/c/zuul/zuul-jobs/+/777653 | 07:51 |
*** fressi has joined #opendev | 07:52 | |
*** jpena|off is now known as jpena | 07:52 | |
*** lpetrut has quit IRC | 07:56 | |
*** rpittau|afk is now known as rpittau | 08:08 | |
*** ykarel_ is now known as ykarel | 08:13 | |
*** andrewbonney has joined #opendev | 08:25 | |
*** TheJulia_ has joined #opendev | 08:29 | |
*** bbezak_ has joined #opendev | 08:29 | |
*** ildikov_ has joined #opendev | 08:29 | |
*** seongsoocho_ has joined #opendev | 08:29 | |
*** parallax_ has joined #opendev | 08:29 | |
*** johnsom_ has joined #opendev | 08:30 | |
*** hashar has joined #opendev | 08:30 | |
*** persia has joined #opendev | 08:31 | |
*** seongsoocho_ has quit IRC | 08:33 | |
*** seongsoocho_ has joined #opendev | 08:33 | |
*** seongsoocho_ has quit IRC | 08:33 | |
*** mnasiadka_ has joined #opendev | 08:33 | |
*** rpittau_ has joined #opendev | 08:33 | |
*** seongsoocho_ has joined #opendev | 08:33 | |
*** seongsoocho_ has quit IRC | 08:34 | |
*** jrollen has joined #opendev | 08:35 | |
*** seongsoocho_ has joined #opendev | 08:36 | |
*** tosky has joined #opendev | 08:36 | |
*** seongsoocho has quit IRC | 08:36 | |
*** seongsoocho_ is now known as seongsoocho | 08:36 | |
*** mnasiadka has quit IRC | 08:43 | |
*** parallax has quit IRC | 08:43 | |
*** persia_ has quit IRC | 08:43 | |
*** rpittau has quit IRC | 08:43 | |
*** johnsom has quit IRC | 08:43 | |
*** TheJulia has quit IRC | 08:43 | |
*** ildikov has quit IRC | 08:43 | |
*** bbezak has quit IRC | 08:43 | |
*** jroll has quit IRC | 08:43 | |
*** mnasiadka_ is now known as mnasiadka | 08:43 | |
*** ildikov_ is now known as ildikov | 08:43 | |
*** TheJulia_ is now known as TheJulia | 08:43 | |
*** bbezak_ is now known as bbezak | 08:43 | |
*** johnsom_ is now known as johnsom | 08:43 | |
*** rpittau_ is now known as rpittau | 08:43 | |
*** parallax_ is now known as parallax | 08:43 | |
*** hamalq has quit IRC | 08:47 | |
*** lpetrut has joined #opendev | 08:51 | |
*** toomer has joined #opendev | 09:19 | |
*** ykarel has quit IRC | 09:23 | |
*** zoharm has joined #opendev | 09:33 | |
*** sshnaidm|afk is now known as sshnaidm|aoff | 09:41 | |
*** sshnaidm|aoff is now known as sshnaidm|off | 09:41 | |
*** klonn has joined #opendev | 09:54 | |
*** roman_g has joined #opendev | 09:58 | |
ianw | fungi/clarkb: forgot to mention, mirror-update is still off to prevent any releases happening and getting corrupt volumes | 10:16 |
ianw | my plan is to remove the lvm snapshot / /dev/xvdg device asap, then complete the upgrade to focal (has gone fine on the other two servers) | 10:16 |
ianw | if there's urgent need, feel free to do that, or even just turn mirror-update back on -- it *should* be fine | 10:17 |
ianw | i'm not going to do anything at this stage, i'm more likely to make a mistake than get it done correctly :) | 10:17 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gerrit docker: match some more files https://review.opendev.org/c/opendev/system-config/+/778613 | 10:22 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Remove obsolete Bazel spawn strategies https://review.opendev.org/c/opendev/system-config/+/778404 | 10:22 |
openstackgerrit | Bharat Kunwar proposed openstack/project-config master: [magnum] Add Backport-Candidate and Review-Priority labels https://review.opendev.org/c/openstack/project-config/+/778629 | 10:23 |
*** ykarel has joined #opendev | 10:30 | |
openstackgerrit | Oleksandr Kozachenko proposed opendev/base-jobs master: Update post-logs playbook https://review.opendev.org/c/opendev/base-jobs/+/777087 | 10:35 |
*** artom has quit IRC | 10:39 | |
*** JayF has quit IRC | 10:51 | |
openstackgerrit | Maksim Malchuk proposed openstack/diskimage-builder master: Don't use hardcode while override base image file https://review.opendev.org/c/openstack/diskimage-builder/+/771978 | 10:54 |
openstackgerrit | Maksim Malchuk proposed openstack/diskimage-builder master: Fix hooks order for CentOS/Fedora when mirror used https://review.opendev.org/c/openstack/diskimage-builder/+/772350 | 10:54 |
*** JayF has joined #opendev | 10:56 | |
*** klonn has quit IRC | 11:34 | |
*** artom has joined #opendev | 11:37 | |
*** hashar is now known as hasharLunch | 11:40 | |
*** ykarel_ has joined #opendev | 11:45 | |
*** ykarel has quit IRC | 11:47 | |
*** tkajinam has quit IRC | 11:56 | |
*** ykarel_ is now known as ykarel | 12:30 | |
*** jpena is now known as jpena|lunch | 12:32 | |
*** hasharLunch is now known as hashar | 12:43 | |
*** klonn has joined #opendev | 12:44 | |
*** jpena|lunch is now known as jpena | 13:33 | |
*** hemanth_n has quit IRC | 13:55 | |
*** swest_ is now known as swest | 13:59 | |
*** klonn has quit IRC | 14:41 | |
openstackgerrit | Moshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse. https://review.opendev.org/c/openstack/diskimage-builder/+/778723 | 15:13 |
*** fressi has quit IRC | 15:23 | |
*** lpetrut has quit IRC | 15:36 | |
*** dmellado has quit IRC | 15:45 | |
*** dmellado has joined #opendev | 15:46 | |
*** ykarel has quit IRC | 15:58 | |
openstackgerrit | Moshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse. This PR is also related to the following PR in Ironic-python-agent-builder: https://review.opendev.org/c/openstack/ironic-python-agent-builder/+/778726 Change-Id: Id2759be29bfcbf2ecf1ce67e171686924b506b1a https://review.opendev.org/c/openstack/diskimage-builder/+/778723 | 15:59 |
clarkb | ianw: fungi: thank you for working through that, let me know if I can help otherwise the plan to keep mirror update idle for now makes sense to me as well as cleaning up the snapshot volume | 16:10 |
*** zoharm has quit IRC | 16:12 | |
fungi | i'm planning to try to pick it back up once meetings calm down | 16:13 |
fungi | revisiting the gentoo image situation, looks like we've got uploaded images from 4 days ago and today, so it's ~roughly working again i guess? | 16:15 |
clarkb | neat | 16:16 |
fungi | however, dib-image-list also mentions what i think must be orphaned records from before the builders got replaced | 16:16 |
fungi | two entries from nearly a year ago | 16:16 |
clarkb | fungi: they may also be held in clouds that are preventing us from deleting the image | 16:16 |
fungi | i wonder what the best way is to clean those up (they're not uploaded to any clouds at this point, according to image-list) | 16:16 |
fungi | the builder which held them no longer exists | 16:17 |
clarkb | fungi: image-list is paginated, you should image show the image uuids | 16:17 |
clarkb | but ya if not in any clouds, then probably need to manually remove the records from zk | 16:17 |
clarkb | since there won't be a builder around to do that for us (only the builder that built an image can fully delete it iirc) | 16:18 |
fungi | confirmed, those image ids also don't appear in image-list | 16:18 |
clarkb | fungi: image-list or show? | 16:18 |
clarkb | be very careful with image list buecause it will only show you like 50 images | 16:18 |
fungi | nodepool image-list | 16:18 |
fungi | oh! i didn't realize that | 16:18 |
clarkb | oh nodepool image list, I thought you meant glance image list | 16:19 |
fungi | interesting | 16:19 |
clarkb | nodepool will not paginate | 16:19 |
fungi | no, not glance | 16:19 |
clarkb | but glance will | 16:19 |
fungi | nodepool dib-image-list mentions the old entries, nodepool image-list does not show them uploaded anywhere | 16:19 |
fungi | so presumably they're not in use for booting nodes any longer | 16:19 |
fungi | i think they were still current at the time their builder was taken out of the pool | 16:20 |
clarkb | ya in that case I think we may have to manually remove the zk records for the builds | 16:20 |
fungi | i'll try to remember to take care of that in the near future, but if anyone else is in zk-shell at some point cleaning up anything else keep in mind those can be cleared out | 16:20 |
*** mlavalle has joined #opendev | 16:28 | |
clarkb | fungi: re 'maybe they're only reviewing changes with the account, then they may be actively using it even though they have no username' yup, that is why we're also cross checking against reviewedby:foo after:year-ago | 16:31 |
clarkb | I'm about to dig into the data the audit script returned a bit more to try and understand how some of these accounts ended up in this situation | 16:31 |
clarkb | curiously for a non zero number of emails with conflict they neverset a username or ssh keys on any of the associated accounts | 16:33 |
clarkb | most of them have an account with ssh credentials and one without | 16:33 |
clarkb | with those I worry about the scenario where one could be used for reviews and the other for pushes and I should be able to see evidence of that once I've settled in and start looking at what gerrit says | 16:34 |
*** lpetrut has joined #opendev | 16:36 | |
*** lpetrut_ has joined #opendev | 16:38 | |
*** lpetrut_ has quit IRC | 16:38 | |
*** lpetrut has quit IRC | 16:42 | |
clarkb | ya first insight digging in is that some of these accounts have never reviewed anything | 16:45 |
clarkb | in addition to having no ssh keys or username | 16:45 |
clarkb | fungi: I suspect that ^ those are actually quite safe to clean up. I'll work on modifying the audit script to differentiate between the two groups (no reviews ever vs no recent reviews) | 16:46 |
clarkb | fungi: I'm finding cases where different openids report the same email address. do you know what sort of behavior may account for that? | 16:52 |
clarkb | (maybe that will help us further narrow down accounts that can be cleaned up) | 16:53 |
clarkb | doesn't seem to be super common though, but maybe represents a set of accounst that can be simplified | 16:55 |
*** marios is now known as marios|out | 16:57 | |
*** marios|out has quit IRC | 17:00 | |
*** jpena is now known as jpena|off | 17:02 | |
clarkb | ooh this may be promising, if you open the openid links it looks like login.ubuntu will indicate if the openid is valid or not? | 17:07 |
clarkb | fungi: ^ this may be a golden ticket if I'm processing it correctly | 17:07 |
fungi | clarkb: so, if you'll remember back, there was a time when our account management involved a "sync" from launchpad because of cla management happening there. it's quite likely a number of accounts were created by that sync but never used, and then the same people later created new accounts and added the same e-mail addresses to them | 17:09 |
fungi | and the openid lookup trick does seem like a good way to find invalid ids, yes | 17:11 |
clarkb | ya I think these will be my two next improvements to the audit script. First is identifying where no reviews have ever been done, but also call out accounts without valid openids | 17:12 |
clarkb | I'm testing some assumptions about the openid thing now though to see if that makes sense | 17:12 |
clarkb | ya it seems that valid accounts get an http 200 and invalid openids get http 404 | 17:13 |
clarkb | and if we know there is no valid openid and no username and no ssh keys that account should be 100% safe to clean up? | 17:14 |
clarkb | also in doing some manual digging I found at least one account's openid is named foo-not-used and has never pushed or reviewed code. I'm going to just keep notes for accounts like this one and the tripleo one as "these can be cleaned up" and aren't immediately mechanically determined to be that way | 17:16 |
fungi | yes, ssh keys are irrelevant there for that matter. if there's no username and no openid, there's no way to log in. if there's a username and a password or a username and ssh keys, then the lack of valid openid doesn't necessarily mean it's unused | 17:19 |
fungi | but both methods of authenticating without an openid rely on having a username | 17:20 |
fungi | if there's a username but no password and no ssh keys and no openid, then also no way to log in, but i expect that to be an unusual combo | 17:21 |
clarkb | ya I think we'll ignore that combo for now | 17:21 |
clarkb | and chip away at the easier sets (I don't know if we can check password via api) | 17:21 |
fungi | but yeah, no/invalid openid + no username is a great set to look for | 17:22 |
*** rpittau is now known as rpittau|afk | 17:27 | |
openstackgerrit | Abhishek Kekane proposed openstack/project-config master: Change gerrit ACLs for glance-tempest-plugin https://review.opendev.org/c/openstack/project-config/+/778758 | 17:37 |
fungi | i'm starting to suspect that the wackyness with the main05 volume on afs01.dfw is that the raw block device was made a pv rather than being partitioned with a single partition marked as a pv | 17:40 |
fungi | i have a feeling this is causing problems for pvscan/vgscan | 17:40 |
*** slaweq has quit IRC | 17:49 | |
fungi | infra-root: i'm going to delete the vicepa_snap logical volume from the main vg on afs01.dfw (it's not mounted, this was insurance from before the openafs 1.8 upgrade). if that goes well i'll vgreduce main off of the main05 pv (it will have no extents in use at that point) and then detach it from the server instance | 17:50 |
*** slaweq has joined #opendev | 17:51 | |
clarkb | fungi: roger | 17:57 |
fungi | double-checking how to properly deactivate the snapshot volume first, since normal lvchange also wants to deactivate the origin volume when i try (that would be bad) | 17:58 |
*** roman_g has quit IRC | 17:59 | |
*** roman_g has joined #opendev | 17:59 | |
*** eolivare has quit IRC | 17:59 | |
*** roman_g has quit IRC | 17:59 | |
*** roman_g has joined #opendev | 18:00 | |
*** roman_g has quit IRC | 18:00 | |
fungi | all the docs i find seem to indicate it's okay to remove an active snapshot volume as long as it's not mounted. guess i'll ignore the warning from lvremove | 18:01 |
*** roman_g has joined #opendev | 18:01 | |
fungi | seems to have worked fine | 18:01 |
fungi | the vicepa logical volume is still present and active | 18:01 |
*** roman_g has joined #opendev | 18:01 | |
*** roman_g has quit IRC | 18:02 | |
*** roman_g has joined #opendev | 18:02 | |
*** roman_g has quit IRC | 18:03 | |
clarkb | fungi: re account sync from lp, the accounts exhibiting this are much newer than that I think | 18:03 |
fungi | ahh, okay | 18:03 |
fungi | so now i've used vgreduce to remove /dev/xvdg (main05) from the main vg | 18:05 |
fungi | and that pv is showing unused | 18:05 |
fungi | now i've used pvremove to clean off the pv header from that device | 18:05 |
clarkb | I've just kicked off a run against all 608 remaining email conflicts that will check for valid openids | 18:06 |
clarkb | I expect this to be much slower as I also did the check if there were any reviews or changes pushed at all rather than using after:year-ago on that query | 18:07 |
*** ralonsoh has quit IRC | 18:09 | |
clarkb | side note: the openid validity check seems to properly catch the tripleo ci accounts too | 18:10 |
clarkb | (that gives me a bit more confidence that it is doing the right thing) | 18:10 |
*** toomer has quit IRC | 18:10 | |
clarkb | weshay|ruck: ^ do you know what you all did if anything to make the ubuntu one account for tripleo.ci return a 404 on its openid now? | 18:11 |
clarkb | I wonder if that is just a side effect of being assigned a new openid or if there is an explicit deactivation step. Either way just trying to build confidence in this categorization and it isn't super urgent | 18:12 |
weshay|ruck | clarkb, hrm.. I have not done anything w/ it.. I'll check w/ sshnaidm|off | 18:12 |
weshay|ruck | we're probably the only two that would have done anything though | 18:12 |
*** iurygregory has quit IRC | 18:22 | |
*** iurygregory has joined #opendev | 18:22 | |
*** andrewbonney has quit IRC | 18:23 | |
*** hashar has quit IRC | 18:23 | |
fungi | i ended up detaching afs01.dfw.opendev.org/main05 through the rackspace dashboard, i couldn't remember how to wrangle osc to communicate correctly with their cinder since it's stuck on the v1 api (at least i think that's the problem) | 18:24 |
fungi | anyway, that volume is detached and deleted now | 18:25 |
fungi | my current theory is that adding the main05 volume as a pv directly without any partitioning confused the device scan the next time the server was rebooted | 18:26 |
fungi | and that rebooting now with the remaining 4 cinder volumes attached will come up just fine. i'm hesitant to readd /vicepa to the fstab until we've tested that though, since otherwise systemd will have many sads and we'll need to do another emergency repair boot which is a royal pain | 18:28 |
fungi | ianw: since you also have a lot of this paged in from yesterday, i'll hold until you're around before i test that theory | 18:28 |
clarkb | hrm temporary failure in name resolution caused my script to bail out :( /me reruns it | 18:33 |
*** slaweq has quit IRC | 18:34 | |
clarkb | the name it was looking for has a very low ttl, I'll blame that on random internets for now. If it persists I may need to look at my local networking | 18:36 |
*** smekala has joined #opendev | 18:42 | |
clarkb | infra-root I'm going to delete ze01.openstack.org 0cbe6ecb-be68-43aa-ba0d-58296a81ebcf now unless I hear objections in the next few minutes | 18:44 |
fungi | sounds good | 18:45 |
clarkb | and done | 18:49 |
clarkb | starting launches for 02-04 now | 18:54 |
*** mgagne has quit IRC | 18:56 | |
*** mgagne has joined #opendev | 18:57 | |
clarkb | #status log Removed old ze01.openstack.org in favor of ze01.opendev.org. More new zuul executors to arrive shortly. | 18:57 |
openstackstatus | clarkb: finished logging | 18:57 |
*** whoami-rajat has quit IRC | 18:59 | |
openstackgerrit | Clark Boylan proposed opendev/zone-opendev.org master: Add new ze02-04 servers to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/778765 | 19:08 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Replace ze02-04.openstack.org with ze02-04.opendev.org https://review.opendev.org/c/opendev/system-config/+/778766 | 19:12 |
clarkb | infra-root ^ I think both of those are ready for review now. The servers are up and running and /var/lib/zuul is properly configured | 19:12 |
clarkb | and now it is time for lunch | 19:13 |
*** klonn has joined #opendev | 19:14 | |
fungi | awesome, yeah i saw some unattended upgrades spam from them | 19:14 |
*** hamalq has joined #opendev | 19:41 | |
openstackgerrit | Merged opendev/zone-opendev.org master: Add new ze02-04 servers to dns https://review.opendev.org/c/opendev/zone-opendev.org/+/778765 | 19:41 |
fungi | looks like debian buster-updates has a patched openafs now since roughly a month, but it raced the 10.8 stable point release. if we're adding buster-updates to our default sources we can probably drop the workaround we put in place... otherwise we're roughly a month out from 10.9 which should include it | 19:44 |
fungi | https://bugs.debian.org/980115 for those following along, 1.8.2-1+deb10u1 is the patched version for buster | 19:46 |
openstack | Debian bug 980115 in openafs-client "connection failure when rx initialized after 08:25:36 GMT 14 Jan 2021" [Grave,Fixed] | 19:46 |
clarkb | fungi: I think our sources are whatever debootstrap will install? | 19:46 |
clarkb | though I guess our mirror configuration role will then go over that after | 19:46 |
clarkb | https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/templates/apt/etc/apt/sources.list.j2 that seems to have -updates in it so we should be ok? | 19:47 |
clarkb | looking at accounts with no username, no ssh keys, no reviews, no changes, and no valid openid (openid url is not a 200) we end up with ~70 accounts for cleaning | 19:50 |
clarkb | let me copy the output of that and upload the script changes | 19:50 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Add tools being used to make sense of gerrit account inconsistencies https://review.opendev.org/c/opendev/system-config/+/777846 | 19:51 |
clarkb | fungi: ianw: maybe you can look over ^ and the resulting file (it has been copied to review) and see if it makes sense to clean those ~70 up? | 19:52 |
openstackgerrit | Merged opendev/system-config master: Replace ze02-04.openstack.org with ze02-04.opendev.org https://review.opendev.org/c/opendev/system-config/+/778766 | 19:57 |
clarkb | it is interesting that a number of these with a completely idle account have >1 non completely idle accounts too | 20:23 |
clarkb | that means doing the cleanup for the idle side won't fix the conflict, however, it will reduce the number of accounts involved in the conflict | 20:23 |
clarkb | on individual has 6 accounts. One seems to have no valid openid, no username, no changes, no reviews. two have pushed changes a couple years apart and still more than half a decade ago and the other 3 while they have ssh usernames have never reviewed or pushed code | 20:26 |
clarkb | I'm not even sure where we start with something like that | 20:27 |
clarkb | maybe declare bankruptcy since all 6 haven't been recently used, email them, and plan to disable all the accounts and clean them up except for maybe the one that was most recently used? | 20:27 |
*** slaweq has joined #opendev | 20:29 | |
*** Jeffrey4l has quit IRC | 20:39 | |
*** Jeffrey4l has joined #opendev | 20:39 | |
*** smekala has quit IRC | 20:43 | |
fungi | i favor bankruptcy | 20:44 |
ianw | fungi: thanks for looking at it! | 20:52 |
clarkb | jrosser: related to the gerrit account cleanup discussion above it appears you have three accounts. The first is the one you appear to be actively using. There are two others that share an email address which is causing us problems with gerrit consistency checks. One of those two does not appear to have any username or valid openid. | 20:52 |
ianw | fungi: i was thinking, i'm not sure the partition made any difference, however the amount of time pvscan took to come back i think did -- i think it took so long that the startup jobs would timeout, putting us into an emergency mode loop | 20:53 |
clarkb | jrosser: I'm sort of spot checking the output of an audit script I wrote which tries to assert which accounts are probably safe to retire and remove all their externalids to fix teh conflict. It has detected that the one without a username, valid, openid, etc is one to clean up. I thought I'd run this by you as a sanity check against my script and see if there are issues with that | 20:53 |
clarkb | jrosser: feel free to PM me and I can share details like emails and account ids, etc | 20:54 |
ianw | i feel like maybe, since that was added probably many many years after the other volumes; maybe it's some sort of back-end weirdness with them being very far apart or something | 20:54 |
jrosser | clarkb: some time ago I got in a spectacular mess with gerrit/openid and my personal email address | 20:55 |
clarkb | jrosser: and the account I've identified as not really ever being used or having a valid openid would be the case? | 20:56 |
jrosser | I think I broke the openid/ubuntu one stuff sufficiently that I would need them to help me ever make it valid again | 20:57 |
clarkb | "cool". That helps me with confidence in the output of the script I've written | 20:58 |
clarkb | it seems to have properly detected that situation and identified an account that is otherwise unuseable | 20:58 |
jrosser | yup, cleaning up on the gerrit side is really valid first step to me ever getting that account working again, should I need to use something other than my current work email | 21:01 |
fungi | ianw: i feel like i've seen similar pv detection problems in rax when we didn't add a partition table on a cinder device | 21:02 |
ianw | fungi: pvscan did detect it, just eventually ... anyway a good thing to keep in mind | 21:02 |
clarkb | fungi: ianw: if you have time for a quick review https://review.opendev.org/c/opendev/system-config/+/777768 will speed up the rotation of ze05-12 though too late for 02-04 (its ok I have a workaround I can use) | 21:03 |
ianw | clarkb: lgtm | 21:04 |
clarkb | jrosser: I've put that account in my notes for direct cleanup too since you've confirmed that situation (though the cleanup suggested by the script will likely be used anyway). Should get around to it once I've got sufficient confidence in a large enough group to make that set of changes | 21:05 |
*** openstackgerrit has quit IRC | 21:05 | |
clarkb | given that ^ checks out I think we likely can move ahead with those ~70 identified as not having a username, ssh keys, reviews, pushes, or valid openids | 21:08 |
clarkb | I've got an appointment this afternoon so probably wont' get to it today which means others can review the script and data and object :) | 21:08 |
clarkb | also we can set the accounts inactive first, then wait a bit and do the external id removals | 21:08 |
jrosser | clarkb: thanks for taking the time to clean this all up | 21:09 |
clarkb | then I expect the next sets of accounts will be ones we want to email about since they all have some activity somewhere | 21:09 |
ianw | clarkb: script looks good to me. i'd be careful, it sort of feels like you're writing a replacement for our jeepyb "first contribution" scripts :) | 21:09 |
clarkb | jrosser: it has been a very intereting experience to see how accounts have gotten mixed up due to gerrit and ubuntu one assumptions that don't hold true for either side | 21:09 |
clarkb | ianw: I'm mostly worried that my own script will decide that my account should be turned off :) | 21:10 |
ianw | "I'm sorry, clarkb. I'm afarid I can't do that." :) | 21:12 |
clarkb | exactly | 21:13 |
fungi | just stay clear of the airlock | 21:14 |
clarkb | there is also another account I idnetified when digging into openid validity that is named foo-not-current and doesn't have any use while another one does so I'll mix that one in as well as the tripleo.ci account that we identified wasn't used anymore | 21:15 |
clarkb | I think that gets us ~72 or so potential cleanups. However, not all will result in happy gerrit consistency checks because some still have multiple active accounts with external id conflicts even after thatcleanup | 21:15 |
clarkb | just keep chipping away I guess | 21:16 |
fungi | ianw: just about done with dinner but i figure we can do two test reboots of afs01.dfw, the first with /vicepa still commented out of fstab, and the second with it back in if the first succeeds | 21:21 |
clarkb | the infra-prod-service-mirror-update job is running now. I don't expect that to cause problems with however you disable it but thought I would call it out just in case | 21:22 |
ianw | fungi: yeah, i just kicked off the in-place focal upgrade; i'm fairly ok to reboot it with /vicepa set to mount because we can go in via the emergency host if we have issues and turn it off, if you are | 21:23 |
fungi | ianw: i'm okay with that if you are, i just didn't want to risk having to muddle through th emergency recovery boot on my own (have had inconsistent experience with that in the past) | 21:33 |
fungi | i miss being able to just escape a grub boot menu and override init in the kernel command line | 21:35 |
*** klonn has quit IRC | 21:35 | |
fungi | rackspace's emergency boot wants to create a new server instance from the original metadata (including rerunning any userdata scripts) but then mount the original rootfs | 21:36 |
fungi | which always weirds me out a bit | 21:36 |
fungi | ianw: when you say "just kicked off the in-place focal upgrade" you mean yesterday or now? | 21:49 |
clarkb | fungi: I read it as now (I think yesterday was to bionic?) | 21:51 |
fungi | oh, right, multi-phase upgrade | 21:55 |
fungi | cool, i'll wait until ianw is satisfied with the focal upgrade, and am around (and hopefully more useful) if this reboot poses similar problems | 21:56 |
fungi | in /etc/issue it already says the server is on 20.04.2 | 21:56 |
fungi | but i likely checked too late | 21:56 |
clarkb | the new zuul executors are getting ansibled now. I'll get them started and turn off the old ones as soon as that finishes | 21:58 |
clarkb | iirc it took a while compiling the openafs kernel driver last time so probably a little ways away still | 21:58 |
ianw | fungi: yeah, sorry, kicked it off this morning. i'm back now, i think we just uncomment and try a regular reboot as i'm pretty confident the array will build | 22:00 |
*** sboyron has quit IRC | 22:01 | |
ianw | it's been fine on all the other hosts, anyway | 22:01 |
ianw | it's back ... hopefully that is it for afs! still krb hosts but very close. turning on mirror-update now to confirm | 22:04 |
clarkb | exciting | 22:04 |
fungi | ianw: i'm feeling fairly sure that fifth cinder volume was the problem, so yeah i guess let's go for it | 22:05 |
fungi | supposed problem volume is completely gone now | 22:06 |
ianw | ok, i've run a manual --flush-cache --limit base.yaml run against it, mirror-update is up and trying some docs partition releases | 22:11 |
clarkb | infra-prod-service-zuul is done. I'll start the new executors and ask the old ones to pause now | 22:16 |
clarkb | and thats done. Now we wait for the old ones to go quiet | 22:18 |
ianw | clarkb: nice, i think that's the last of the xenial afs clients too, one less thing to worry about | 22:19 |
ianw | i guess it has decided docs needs a full release ... that's weird | 22:21 |
fungi | if we rebooted in the middle of an earlier release, that could easily happen | 22:21 |
*** slaweq has quit IRC | 22:27 | |
clarkb | ianw: note I still have to do 8-12 | 22:32 |
clarkb | er 5-12 | 22:32 |
ianw | mirror-update was shutdown the whole time with no active releases (it's all release from the python script on there). anyway, it's moved on; i'm feeling pretty confident it's all working | 22:32 |
clarkb | I figured I'd do ~3-4 at a time just to avoid having too many moving pieces at once | 22:32 |
ianw | ++ | 22:33 |
ianw | clarkb: there's a little stack of changes ending at https://review.opendev.org/c/opendev/system-config/+/778404 that updates some role matching and gerrit building bits. apart from the slight change to build arguments should be a no-op generally | 22:38 |
ianw | clarkb: and if you have a sec to double-check the stevedore one @ https://review.opendev.org/c/opendev/system-config/+/778354 i can babysit that too, and cleanup the old bits | 22:39 |
clarkb | ianw: for the stevedore one I guess we create the cache dir on all hosts? | 22:40 |
ianw | yeah, /root/.cache seems like something generic, to me | 22:41 |
clarkb | ianw: for the change at hte bottom of the noopy stack is the log collection working properly? https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_043/778593/3/check/system-config-zuul-role-integration-focal/043da25/base/logs/ looks empty | 22:45 |
ianw | hrm, maybe i got the matchers wrong, i can never remember it | 22:47 |
clarkb | The other two changes lgtm though | 22:47 |
clarkb | I've not approved anything though as I will soon need to pop out | 22:48 |
* clarkb looks at the log collection more to see if it can be figured out | 22:48 | |
ianw | maybe it needs stage_dir? | 22:51 |
clarkb | the default for that appears to be ansible_user_dir and it should append /logs to that since these are logs_txt I think | 22:52 |
clarkb | https://zuul.opendev.org/t/openstack/build/043da25a8967434d83f348c95a06e467/console shows it created some stage dirs | 22:52 |
ianw | it should be "'/var/log/openafs': logs" (not logs_txt) | 22:53 |
clarkb | and that also shows it copying things like syslog into the stage dir | 22:53 |
clarkb | ianw: I think it may be a mismatch between the copies done on the executor and where we stage | 22:55 |
clarkb | the executor copy seems to want to copy from remote:/home/zuul/zuul-output/logs but we stage at /home/zuul/logs | 22:56 |
ianw | yeah, i think that's the stage-output | 22:56 |
*** tkajinam has joined #opendev | 22:57 | |
clarkb | ya I think if you set stage_dir to {{ ansible_user_dir }}/zuul-output it might work | 22:58 |
clarkb | ianw: but also I'm not sure you need the explicit dir creates in the post run you added | 22:58 |
clarkb | I think that may happen for you | 22:58 |
corvus | #status log updated eavesdrop hosts entry and restarted gerritbot due to netsplit | 23:04 |
openstackstatus | corvus: finished logging | 23:05 |
*** openstackgerrit has joined #opendev | 23:05 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: system-config-roles: only match jobs on roles tested https://review.opendev.org/c/opendev/system-config/+/778593 | 23:05 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gerrit docker: match some more files https://review.opendev.org/c/opendev/system-config/+/778613 | 23:05 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Remove obsolete Bazel spawn strategies https://review.opendev.org/c/opendev/system-config/+/778404 | 23:05 |
clarkb | ianw: I have +2'd that on the assumption that it will fix things, but I probably won't be around when the jobs finish to double check | 23:07 |
ianw | clarkb: ok, thanks .. it's fairly minor. the actual problem was that the x86-64 centos-8 image was out of date, and it couldn't find the kernel headers package for the kernel it was actually running so couldn't build openafs | 23:09 |
ianw | i don't like these jobs in general; they're an odd construction. when we don't care about xenial i'll rework them :) | 23:10 |
clarkb | all three zuul executors are still running jobs (I asked them to pause though). I'll check on these when I can and get them shut down when quiet, but popping out now | 23:27 |
*** stevebaker has quit IRC | 23:42 | |
*** stevebaker has joined #opendev | 23:44 | |
openstackgerrit | Martin Kopec proposed opendev/system-config master: refstack: Edit URL of public RefStackAPI https://review.opendev.org/c/opendev/system-config/+/776292 | 23:58 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!