opendevreview | Merged opendev/system-config master: Add static02 to inventory https://review.opendev.org/c/opendev/system-config/+/879383 | 00:24 |
---|---|---|
opendevreview | Merged opendev/system-config master: install-launch-node: upgrade all packages https://review.opendev.org/c/opendev/system-config/+/879712 | 00:24 |
*** Trevor is now known as Guest10124 | 01:22 | |
ianw | ahh, i guess vos release needs an admin key, not the wheel volume key | 01:48 |
opendevreview | Ian Wienand proposed openstack/project-config master: wheel builds : move to individual releases https://review.opendev.org/c/openstack/project-config/+/879722 | 04:03 |
opendevreview | Ian Wienand proposed openstack/project-config master: wheel builds : move to individual releases https://review.opendev.org/c/openstack/project-config/+/879722 | 04:11 |
opendevreview | Ian Wienand proposed openstack/project-config master: wheel builds : move to individual releases https://review.opendev.org/c/openstack/project-config/+/879722 | 04:18 |
noonedeadpunk | clarkb: ianw it will help, as upgrading existing systems does not break | 07:30 |
noonedeadpunk | centos9 I mean | 07:30 |
noonedeadpunk | as you already have gpg imported to rpm, and gnupg update breaks adding new gpgs not existing ones | 07:30 |
noonedeadpunk | but yeah, given it's already built... | 07:31 |
frickler | we could still delete the latest build if that is showing failures and revert to the previous one | 07:54 |
apevec | <noonedeadpunk> "centos9 I mean" <- -3 build was reverted in the latest CS9 compose, now it just needs to hit the mirrors https://composes.stream.centos.org/production/CentOS-Stream-9-20230405.1/compose/BaseOS/source/tree/Packages/gnupg2-2.3.3-2.el9.src.rpm | 11:23 |
mnasiadka | Hello, I'm seeing some POST_FAILUREs (e.g. https://zuul.opendev.org/t/openstack/build/67162651408647bcb7f81b5cab808e2e) - maybe some issues with log uploads? | 11:38 |
noonedeadpunk | apevec: aha, good news I assume | 11:39 |
noonedeadpunk | apevec: another issue I saw, is that multiple SIGs while updating their GPG keys on https://www.centos.org/keys more then a year ago are still packaging onld ones NFV was fixed really fast yestarday, but now I've spotted Storage with exact same thing | 11:40 |
apevec | I'm now asking around when we can expect this on mirrors, then we need our AFS refreshed, which goes via non-primary CS9 mirror ... | 11:41 |
apevec | amoralej: ^ who is the best contact for Storage SIG these days? | 11:42 |
noonedeadpunk | frickler: well, at least we had short period during the night when jobs were passing, until images got updated. so for osa at least that would be helpful. but I'm not sure it will fix rest | 11:42 |
noonedeadpunk | I was told it's ndevos as chair, but not sure how to reach them, except emailing directly | 11:43 |
frickler | mnasiadka: I checked the latest POST_FAILUREs, only that single one has no logs, so I think you can just recheck. | 12:06 |
fungi | 2023-04-06 10:20:10,289 DEBUG zuul.AnsibleJob.output: [e: 1aeec5b0e51048b4bd5d6cec82ba92d0] [build: 67162651408647bcb7f81b5cab808e2e] Ansible output: b'TASK [upload-logs-swift : Upload logs to swift] ********************************' | 12:08 |
fungi | 2023-04-06 10:20:21,337 DEBUG zuul.AnsibleJob.output: [e: 1aeec5b0e51048b4bd5d6cec82ba92d0] [build: 67162651408647bcb7f81b5cab808e2e] Ansible output: b'fatal: [localhost]: FAILED! => {"censored": "the output has been hidden due to the fact that \'no_log: true\' was specified for this result", "changed": false}' | 12:08 |
fungi | so yeah, log upload failed, but we don't know why | 12:08 |
fungi | i can see from the logs that it wanted to upload to the swift endpoint in ovh-gra1 | 12:09 |
fungi | i guess if we see more of those, we should check to see if they're all uploading to the same provider/region | 12:10 |
apevec | <noonedeadpunk> "apevec: another issue I saw..." <- do you have a link to centos-devel discussion where was NFV key updated? | 12:20 |
noonedeadpunk | apevec: I have no idea where logs are for this. It was yestarday between 16 and 17 UTC | 12:21 |
frickler | apevec: replying to messages isn't really a feature in IRC and looks weird in native clients. I would appreciate if you could stop using that while being bridged here | 12:37 |
apevec | sorry about that, will take notice! | 12:38 |
apevec | BTW was move to native Matrix considered for opendev channels? | 12:39 |
frickler | there was some discussion some time ago, but quite a bit of opposition to that, too, including myself | 12:40 |
*** blarnath is now known as d34dh0r53 | 13:15 | |
genekuo | ianw Sure, I'll try to wake up and shadow the process | 13:53 |
genekuo | clarkb: having a short talk will be great to get more understanding on what tasks infra team is working on and skills needed. | 13:55 |
genekuo | I'm ok with late evening call which will probably work better for you. | 13:55 |
opendevreview | gustavo ornaghi antunes proposed openstack/project-config master: Add Dell Storage App to StarlingX https://review.opendev.org/c/openstack/project-config/+/879744 | 14:36 |
opendevreview | gustavo ornaghi antunes proposed openstack/project-config master: Add Dell Storage App to StarlingX https://review.opendev.org/c/openstack/project-config/+/879744 | 14:45 |
noonedeadpunk | the most thing I hate about IRC is that you need to maintain bouncer not to loose history, which needs slightly more effort then you want to usually put | 14:48 |
opendevreview | gustavo ornaghi antunes proposed openstack/project-config master: Add Dell Storage App to StarlingX https://review.opendev.org/c/openstack/project-config/+/879744 | 14:56 |
clarkb | genekuo: ok cool lets try to schedule something for next week? I think fungi won't be around but that should be fine | 15:12 |
fungi | yeah, there's no need to include me, but if there's one when i'm not on vacation i'll be happy to help | 15:26 |
genekuo | clarkb sounds good to me. For late evening, Monday and Wednesday will be best for my time. | 15:32 |
genekuo | So Monday, Wednesday morning in your time zone | 15:32 |
genekuo | Tuesday and Thursday will also works if it's better for you | 15:33 |
clarkb | genekuo: Wednesday would be perfect for me I think. | 15:33 |
frickler | what's that in UTC? mon I'm out, wed I could likely join, too | 15:33 |
clarkb | frickler: genekuo I think I could start as early as 1300 UTC Wednesday. 1400 would be better but I'll live | 15:34 |
genekuo | 1400 UTC works for me | 15:34 |
frickler | 14 avoids conflict with kolla meeting, that'd be fine for me, too | 15:35 |
clarkb | cool see you at 1400 UTC wednesday. I can share a meetpad link for that as we get closer | 15:35 |
genekuo | cool | 15:38 |
clarkb | I've spot checked static02.opendev.org hosting docs.openstack.org, static.opendev.org, and tarballs.opendev.org via local /etc/hosts overrides. There are a number of things hosted there so will take some time to double check everything (probably won't finish today but hopefully can get that done tomrrow) | 17:46 |
clarkb | Oh I meant to say the spot checks look good | 17:47 |
clarkb | we also need to ensure everything CNAME'd to static01 has done so via the static CNAME static01 CNAME and not directly | 17:47 |
clarkb | but I think I can check that as I check the hosting works since its all related to digging info out of dns | 17:47 |
fungi | we can/should go ahead and merge https://review.opendev.org/879414 ahead of the maintenance, right? | 17:49 |
clarkb | yes, historically I think we've eft that for when we are done. But we can always revert later instead if necessary | 17:50 |
clarkb | ok looks like the ssl cert list for static will be a good cheatsheat for content | 17:50 |
fungi | all the changes for topic:gerrit-3.7 lgtm, and i approved the associated git-review series | 17:50 |
clarkb | fungi: git-review series? | 17:51 |
opendevreview | Merged opendev/project-config master: Add renames for April 6th outage https://review.opendev.org/c/opendev/project-config/+/879414 | 17:52 |
fungi | topic:gerrit-3.7+project:git-review | 17:52 |
fungi | mostly your stuff around testing git-review with newer gerrit | 17:52 |
fungi | er, topic:gerrit-3.7+project:opendev/git-review | 17:52 |
clarkb | ah | 17:52 |
fungi | also the tox to nox switch, which it depended on | 17:53 |
fungi | figure that's been up plenty long enough for anyone who cares to object | 17:53 |
clarkb | these records will not update when we update the static.opendev.org CNAME static02.opendev.org record: devstack.org registry.zuul-ci.org zuul-ci.org www.zuul-ci.org zuulci.org www.zuulci.org gating.dev www.gating.dev | 17:58 |
clarkb | some of those are root records and this is expected. Others should probably be updated to point at the cname. I'll work on changes for that as part of the dns update and we can do a staged shutdown of the old server once we think we've got everything to avoid accidents | 17:58 |
clarkb | fungi: is devstack.org managed through the rax dns stuff? | 17:59 |
fungi | yes | 17:59 |
fungi | i expect www.zuul-ci.org, www.zuulci.org and www.gating.dev are already cnames, just not (directly) to static01 | 18:00 |
fungi | and i expected wrong! | 18:00 |
fungi | they're all a/aaaa | 18:00 |
clarkb | yup I'll update those that can be CNAMEs to CNAMEs to cut down on needing to do extra work the next time this is done | 18:00 |
clarkb | and update the A/AAAA records for those that can't be CNAMEs | 18:01 |
fungi | thanks | 18:01 |
clarkb | Initially I wasn't going to bother with a short ttl on the static.opendev.org CNAME but now I'm thinking that may be a good idea in case I miss something and need to revert. | 18:03 |
fungi | in which case i guess we need a followup change to re-default the ttl(s)? | 18:04 |
clarkb | yes | 18:04 |
clarkb | not a big deal I just thought i could get away without doing that | 18:04 |
opendevreview | Merged opendev/git-review master: Switch from tox to nox https://review.opendev.org/c/opendev/git-review/+/871652 | 18:11 |
opendevreview | Merged opendev/git-review master: Test Python bounds only https://review.opendev.org/c/opendev/git-review/+/877321 | 18:11 |
opendevreview | Merged opendev/git-review master: Test old and new Gerrit https://review.opendev.org/c/opendev/git-review/+/877313 | 18:11 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Update static.o.o CNAME to point at static02 https://review.opendev.org/c/opendev/zone-opendev.org/+/879780 | 18:11 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Remove old static01 records https://review.opendev.org/c/opendev/zone-opendev.org/+/879781 | 18:11 |
clarkb | I don't think we want to land any of those today just to avoid debugging new static overlapping with gerrit things | 18:12 |
clarkb | but I'll get changes up anyway | 18:12 |
fungi | sure | 18:19 |
opendevreview | Clark Boylan proposed opendev/zone-zuul-ci.org master: Update zuul dns records to the new static02 server https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/879782 | 18:21 |
opendevreview | Clark Boylan proposed opendev/zone-zuul-ci.org master: Revert short @ record TTLs https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/879783 | 18:21 |
clarkb | remote: https://review.opendev.org/c/opendev/zone-gating.dev/+/879784 Point gating.dev at the new static02 server | 18:26 |
clarkb | that one doesn't report to us apparnetly | 18:26 |
clarkb | thats all of the code review based updates necessary to do the swithc. devstack.org I'll have to do by hand as this is in process | 18:26 |
clarkb | also I used the ssl cert generation config to determine which names to look at. I suspect that is fairly complete | 18:26 |
fungi | our certcheck is based on le config now, so good call yeah | 18:55 |
fungi | clarkb: minor question on 879780 but i'm not really all that worried about the ordering | 19:00 |
fungi | so +2 anyway | 19:00 |
Clark[m] | Ya short answer while I sort out lunch is when I looked at the follow-up change to remove static01 records I realized I didn't really need to move it as much as not delete it | 19:03 |
fungi | wfm | 19:57 |
fungi | we're at t minus one hour until maintenance | 21:00 |
fungi | should we status notice a reminder and maybe start putting things into disable state? | 21:00 |
clarkb | sounds good. | 21:00 |
clarkb | let me update the list of emergency file hosts on the etherpad to include those that the rename playbook touches | 21:01 |
fungi | status notice The Gerrit service on review.opendev.org will be offline for extended periods between 22:00 and 23:00 UTC for software upgrades and project renames: https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/VW2O56AXI4OX34CWDNRNZDCWJDZR3QJP/ | 21:02 |
fungi | oh, we said two hours, so between 22:00 and 00:00 | 21:02 |
fungi | otherwise lgty? | 21:02 |
clarkb | ya due to the rename too. | 21:02 |
clarkb | yes lgtm | 21:02 |
clarkb | https://etherpad.opendev.org/p/gerrit-upgrade-3.7 is the etherpad if you want to check the list of hosts | 21:03 |
fungi | #status notice The Gerrit service on review.opendev.org will be offline for extended periods between 22:00 and 00:00 UTC for software upgrades and project renames: https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/VW2O56AXI4OX34CWDNRNZDCWJDZR3QJP/ | 21:03 |
opendevstatus | fungi: sending notice | 21:03 |
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline for extended periods between 22:00 and 00:00 UTC for software upgrades and project renames: https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/VW2O56AXI4OX34CWDNRNZDCWJDZR3QJP/ | 21:03 | |
fungi | clarkb: list of hosts lgtm, thanks | 21:05 |
clarkb | fungi: I'll go ahead and add them on bridge now then | 21:05 |
fungi | go for it | 21:06 |
opendevstatus | fungi: finished sending notice | 21:06 |
clarkb | fungi: thats done if you want ot double check it on bridge too | 21:08 |
fungi | can do | 21:11 |
fungi | clarkb: looks correct on bridge. thanks! | 21:14 |
fungi | i guess we're all set for about the next 30-40 minutes | 21:15 |
clarkb | I think so | 21:15 |
fungi | for the second status announcement, do we want to #status alert or just stick with notice? | 21:17 |
clarkb | I feel like notice is sufficient? | 21:18 |
fungi | i'm cool with it | 21:19 |
ianw | o/ thanks for sending the notice | 21:20 |
fungi | wanted to make sure you're able to wake up and enjoy your tea et cetera | 21:21 |
fungi | there are 3 detached root screen sessions on bridge. we should maybe clean them up. newest is from a month ago | 21:22 |
fungi | any objections? | 21:22 |
clarkb | no objections from me. I have a screen I was using for the static and etherpad bootstrapping but it should be owned by my user not root | 21:23 |
fungi | newest one looks like it was related to docker updates | 21:23 |
fungi | pretty sure that can go | 21:23 |
fungi | one from january was a zuul rolling restart | 21:24 |
ianw | sounds good | 21:24 |
fungi | one from december was a zuul restart as well | 21:24 |
fungi | i'll close out all three | 21:24 |
fungi | step #3 in the pad is creating a root screen session on review02 right? | 21:25 |
ianw | look like the launch-env on bridge updated as we'd hoped now | 21:25 |
ianw | fungi: was just starting that | 21:26 |
fungi | cool, thanks! | 21:26 |
clarkb | ianw: I've made a suggested edit to the reindex step. Looking at the 3.7 release notes and the upgrade scipting for our test jobs it does a full reindex of everything not just changes | 21:26 |
fungi | i've joined it. was the only one owned by root | 21:26 |
clarkb | (changes is the slow one so its not like this will make it take much longer just makes it more complete and is safer I think) | 21:26 |
ianw | oh, thanks! yeah that's a copy-paste, i intended that to be everything | 21:27 |
fungi | sgtm, and yeah the account index rebuilds quickly | 21:27 |
clarkb | and a note on step 15 (basically it should be a noop which is totally fine) | 21:28 |
clarkb | I think worth keeping in the doc so that we have it for future upgrades if we refer back to this one | 21:29 |
ianw | yep; i think the checklist gets better each time as we pull things from the last one | 21:31 |
fungi | ianw: we went ahead and did up to rename step 2 earlier as well | 21:33 |
fungi | or i should say through rename step #2 | 21:34 |
ianw | oh i should pull the latest change on bridge actually | 21:35 |
ianw | and that playbook should run in a screen as well | 21:35 |
ianw | ok HEAD of that now 854a22aeae1bce7cb6ad63579af87fa1aa956566 | 21:36 |
clarkb | the suggestion for tea has inspired me /me boils a kettle | 21:37 |
fungi | i'm mowing the lawn, which seems like an odd thing to be doing just before a maintenance window, but i'll do my best not to initiate a medical emergency in the next few minutes | 21:38 |
fungi | status notice The Gerrit service on review.opendev.org will be offline for extended periods over the next two hours for software upgrades and project renames: https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/VW2O56AXI4OX34CWDNRNZDCWJDZR3QJP/ | 21:50 |
fungi | that look good for sending in about 10 minutes? | 21:50 |
clarkb | wfm | 21:50 |
ianw | ++ | 21:50 |
fungi | cool, i'll record it in the pad and stage to send it just before the top of the hour | 21:51 |
fungi | i've started a root screen session on bridge01 and turned on session logging for it like was done for review02 in step #3 | 21:56 |
fungi | since i didn't see one yet (though we don't need it until the rename work starts) | 21:56 |
clarkb | ok I guess we're operating from both hosts so have to join them separately | 21:56 |
* clarkb organizes workspaces | 21:56 | |
fungi | yeah, upgrade happening in one, rename in the other | 21:56 |
fungi | we're at t minus two minutes, time to send the notice? | 21:57 |
clarkb | ++ | 21:57 |
fungi | #status notice The Gerrit service on review.opendev.org will be offline for extended periods over the next two hours for software upgrades and project renames: https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/VW2O56AXI4OX34CWDNRNZDCWJDZR3QJP/ | 21:58 |
opendevstatus | fungi: sending notice | 21:58 |
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline for extended periods over the next two hours for software upgrades and project renames: https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/VW2O56AXI4OX34CWDNRNZDCWJDZR3QJP/ | 21:58 | |
fungi | takes about that long to hit all the channels | 21:58 |
ianw | does the twitter update still work or did they can that? | 22:00 |
clarkb | they made changes to the api | 22:00 |
clarkb | very likely we were impacted | 22:00 |
opendevstatus | fungi: finished sending notice | 22:00 |
fungi | but we do send it to mastodon still, right? | 22:00 |
ianw | i guess it's still updating. i have a change to pull it out somewhere if it breaks other things but as long as it isn't i guess we can keep it | 22:01 |
ianw | anyway, i think we can start? | 22:01 |
clarkb | yes I'm ready if you are | 22:01 |
clarkb | I have a cup of warm tea and a wool shirt on this cold and rainy day. | 22:01 |
fungi | yep | 22:01 |
fungi | step #4 here we come | 22:01 |
clarkb | ianw: did you double check the emergenc file update? | 22:02 |
clarkb | just to be sure I got the names correct etc | 22:02 |
clarkb | ianw: you have to use `up -d mariadb` | 22:02 |
clarkb | the down stops and deletes the containers so they aren't startable. Up is create + start | 22:02 |
clarkb | or that should work too :) | 22:03 |
fungi | it's 80f/27c here and uncharacteristically calm winds, so not a great day to mow the lawn, but it's supposed to rain all weekend and i'm getting on a plane monday :/ | 22:03 |
clarkb | a week ago the forecast made it look like spring was starting this weekend. ~72F Sunday. Then a few days later spring was cancelled. | 22:04 |
fungi | it's also very sunny. and ~100% humidity but that's to be expected here | 22:05 |
ianw | ok, watched the backup logs, all good | 22:05 |
fungi | yep, lgtm | 22:05 |
fungi | and gerrit is offline | 22:05 |
fungi | we have a lot of container images on there | 22:08 |
clarkb | fungi: ya I did a cleanup of images older than like a yera or something a while back | 22:08 |
clarkb | but we're somewhat careful with gerrit and I think that is a good thing especially since it supports reverts | 22:09 |
ianw | ac4763fec95aab55deafe2e1e48f0e166fb7ff59561df82cf87da15c52775b15 confirmed | 22:09 |
fungi | sure | 22:09 |
fungi | so offline reindex time? | 22:09 |
clarkb | not sure if we automated the gerrit pruning though probably should if we havent | 22:10 |
fungi | there it goes | 22:10 |
fungi | and now we wait | 22:11 |
fungi | i'm going to do a few more laps with the mower while this spins | 22:11 |
clarkb | ya we might consider overriding the cdefault cpu count for this in the future its like 1/2 or 1/4 of total cpus iirc | 22:11 |
fungi | could probably crank it up to nearly 1:1 | 22:11 |
fungi | but it's mostly a few large slices which have to complete, so at some point more parallelism doesn't buy us much | 22:12 |
clarkb | exactly | 22:12 |
ianw | i feel like we ran stats on that with the big notedb upgrade , when we had a complete mirror we were testing on | 22:13 |
fungi | that certainly sounds like us | 22:13 |
clarkb | ianw: hrm I'm looking at the release notes and isays we need to run the init step too but that doesn't appear to be on the etherpad? | 22:15 |
clarkb | https://www.gerritcodereview.com/3.7.html#offline-upgrade | 22:15 |
clarkb | I think we've normally run that in previous upgrades so that should be captured somewhere in a previous doc? | 22:16 |
ianw | hrm, yes i agree it does say that, and we haven't done that | 22:16 |
clarkb | what I'm not sure about is if we need to index after init'ing or if order matters at all | 22:16 |
ianw | https://etherpad.opendev.org/p/gerrit-upgrade-3.6 we didn't | 22:17 |
ianw | nor | 22:17 |
ianw | https://etherpad.opendev.org/p/gerrit-upgrade-3.5 | 22:17 |
genekuo | just wondering, isn't the etherpad steps reviewed before the execution, or it's just some bit that is missed in review | 22:17 |
ianw | we've just missed this one in review | 22:18 |
fungi | genekuo: yeah, we review the pad, also it's based (not too loosely) on prior testing | 22:18 |
clarkb | ianw: we did https://etherpad.opendev.org/p/gerrit-upgrade-3.3 | 22:18 |
genekuo | I see | 22:19 |
fungi | but also, it's got a lot of stuff in it, so easy to miss something | 22:19 |
clarkb | line 46 | 22:19 |
fungi | all the more reason why more eyes are helpful ;) | 22:19 |
clarkb | part of the problem too is the gerrit upgrade process chagnes almost every upgrade | 22:19 |
ianw | i think for an abundance of caution, we should run init after this reindex, then re-run the reindex | 22:19 |
clarkb | ianw: ++ | 22:20 |
fungi | yeah, we didn't need to reindex for 3.3, but the notes suggest that's the order we would have gone with | 22:20 |
fungi | this is why we budget extra time ;) | 22:21 |
ianw | ++ i agree on running the same init command as in the revert docs | 22:23 |
clarkb | ianw: I updated the etherpad to capture this command. I pulled it from the 3.3 upgrade but it also matches what we have in the system-config ansible stuff | 22:23 |
fungi | thanks! | 22:24 |
ianw | hrm, what was that exception | 22:25 |
ianw | Loading commit AnyObjectId[9cd80009587b67757b34a278063ab98c56a06316] for ps 3 of change 19316 failed. | 22:25 |
clarkb | its angry about a sub 100k change unfortunately we've got a few of those in the installation iirc | 22:25 |
ianw | error getting field added of ChangeData{Change{19321 (Id26497a7655c69b367aeee959d0078495879b1cf), dest=x/kwapi,refs/heads/master, status=M}} | 22:25 |
clarkb | basically over the years the gerrit data migrations weren't all as reliable as hoped for | 22:25 |
clarkb | and we've ended up with a small number ofcorrupted changes. I suspect because we may have manipulated them by hand in the DB for one reason or another | 22:26 |
clarkb | did the index progress halt or is that an artifact of screen copy mode scrollback? | 22:27 |
clarkb | top output implies things haven't halted | 22:27 |
clarkb | 1061.7s self reported time | 22:29 |
ianw | it's finished, i'll do the init | 22:29 |
fungi | log should indicate whether it completed | 22:29 |
fungi | but yeah, i think we're good | 22:30 |
* fungi actually doesn't remember whether offline reindexes report into the error_log like online ones do | 22:30 | |
clarkb | ah ok this migration is for our submit requirements stuff that should be a noop. We probably want to spot check say all-projects and a handful of others to ensure it was a noop | 22:31 |
clarkb | I strongly suspect we didn't actually need to migrate other than for bookkeeping purposes sincewe took care of this explicitly upfront | 22:31 |
fungi | also hopefully the first reindex primed the file cache enough to speed up the next run of it | 22:31 |
fungi | seems like it started up faster at least | 22:33 |
fungi | a little over 25% of the time we booked, so i feel like we're in a pretty comfortable spot still | 22:36 |
ianw | 80% | 22:44 |
clarkb | same error again I expect it was even around 86% so the order it processes these appears to be deterministic | 22:45 |
clarkb | it was 30s faster doing the changes index. Not much | 22:48 |
fungi | woohoo! | 22:48 |
clarkb | 17:42 wall total | 22:48 |
fungi | 30 seconds i can spend doing something pointless | 22:48 |
fungi | and gerrit should be on its way up (but this is not the last outage) | 22:49 |
ianw | hrm, what's with the replication errors | 22:50 |
clarkb | ianw: I understand it I think its ok | 22:50 |
clarkb | or at least I Think I do | 22:50 |
clarkb | it has to do with keeping state on disk but new containers wipe that out? I had chagnes to improve that but maybe we have a bug around it | 22:50 |
ianw | oh that was mounting another dir right, i thought we merged that | 22:51 |
clarkb | oh ya we did | 22:51 |
fungi | oh, this is the not losing replication events when we restart, right | 22:51 |
clarkb | ya but maybe there is something wrong wiht it since it is so persistent. I would've expected it to flush and then move on | 22:51 |
ianw | gerrit2@review02:~/review_site/data/replication/ref-updates/waiting$ ls -l | wc -l | 22:52 |
ianw | 9717 | 22:52 |
clarkb | but they appear to be unique tasks | 22:52 |
fungi | that's a couple of orders of magnitude more than i would have expected | 22:52 |
clarkb | so not retrying over and over again (which is good) and it stopped | 22:52 |
ianw | appears to ahve stopped | 22:53 |
ianw | heh, yeah | 22:53 |
fungi | unless gerrit re-replicates everything on start | 22:53 |
clarkb | I think we check if new replication tasks while gerrit is running are happy | 22:53 |
fungi | but the state tracking should mean it wouldn't need to do that, so i expect that's not what created them | 22:53 |
ianw | Error while renaming task 02744617a81f6f87e68186e01e9d46f29c6033b7 [CONTEXT pushOneId="6fa17c8f" ] | 22:53 |
ianw | java.nio.file.NoSuchFileException: /var/gerrit/data/replication/ref-updates/waiting/02744617a81f6f87e68186e01e9d46f29c6033b7 -> /var/gerrit/data/replication/ref-updates/running/02744617a81f6f87e68186e01e9d46f29c6033b7 | 22:53 |
ianw | so it's trying to put a waiting into running, but can't find the "waiting" file? | 22:54 |
clarkb | ya it sticks task records in the waiting queue then when it goes to process them it moves them. It might also be a race between different threads? | 22:54 |
clarkb | ianw: yes exactly | 22:54 |
clarkb | I think this is possibly related to the migration which has me concerned the migration actually generated meta config diffs | 22:54 |
ianw | waiting dir still has 9717 entries, so that's not moving | 22:55 |
clarkb | I think two things should be checked 1) does replication occur for new events while gerrit is up and running now 2) what does meta config look like for $projects | 22:55 |
ianw | ok well the ui is up, that's one ting | 22:56 |
clarkb | ianw: if you look at the timestamps in waiting they are older than I expected. This might be something that we want to clear out between upgrades? | 22:57 |
clarkb | basically these aren't a bunch of new replication tasks from the upgrade process based on timestamps | 22:57 |
ianw | i just put https://review.opendev.org/879722 in for a recheck and zuul has picked that up | 22:57 |
ianw | -rw------- 1 gerrit2 gerrit2 139 Mar 2 21:17 1f9f0629cc0e76c48676ece942d7b0099ca2e253 | 22:58 |
ianw | yeah, they start from a long time ago... | 22:58 |
clarkb | we should push a new change/newps and see if it replicates. I suspect the replication stuff is stale and we can ignore it for now if replication for current stuff is working | 22:58 |
clarkb | and then figure out why those are leaking and how to clean them up | 22:58 |
clarkb | they do not show in show-queue | 22:58 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM testing gerrit replication https://review.opendev.org/c/opendev/system-config/+/879790 | 22:59 |
ianw | https://review.opendev.org/c/zuul/zuul-client/+/879520 is a pretty simple one if we ant to merge something quick | 23:00 |
fungi | well, new changes/patchsets get replicated too | 23:00 |
fungi | shouldn't have to merge anything | 23:00 |
clarkb | ya 879790 looks good in the replication log | 23:00 |
clarkb | now to fetch it from one of the giteas | 23:00 |
clarkb | `git fetch https://opendev.org/opendev/system-config refs/changes/90/879790/1` worked for me | 23:01 |
ianw | ++ agree | 23:02 |
clarkb | so I think replication is mostly working and whatever the issue is with stale replication events is separte. We should fix that but it can happen later | 23:02 |
clarkb | I think those files are meant to be json with info in them or similar serialize data we might be able to work from toe debug further | 23:02 |
fungi | sgtm, so we can proceed i guess | 23:04 |
clarkb | they are all replication events for all-projects | 23:05 |
clarkb | or at least after sampling two of them (they are json data) | 23:05 |
clarkb | we don't/can't replicate all projects because we don't let the system have permission to do so | 23:05 |
clarkb | its possible this is a bug in the plugin recording those but not removing them when it finds it doesn't have permissions | 23:05 |
fungi | oh, yep! | 23:06 |
clarkb | I was able to push a change, review the change, and verify the change is replicated. I think that all lgtm | 23:06 |
fungi | so dealing with 879412 is next if everyone's ready to proceed? | 23:08 |
ianw | omg my keyboard just went bananas and wouldn't stop sending "enter" | 23:08 |
fungi | i saw, it was epic | 23:08 |
ianw | i had to rip the laptop out of the thunderbolt dock to stop it | 23:09 |
clarkb | were we happy with the diffs? was it just email templates | 23:09 |
clarkb | oh wow | 23:09 |
ianw | sorry, back now after heart attack that gerrrit was expoding somehow :) | 23:09 |
fungi | no worries | 23:09 |
clarkb | and before we proceed do we want ot check the all-projects meta config history and or another project or two? | 23:09 |
clarkb | just to see that the migration nooped as expected? | 23:10 |
ianw | there does seem to be some diff in the gitea config section | 23:10 |
clarkb | ianw: for replication or commentlinks? | 23:10 |
ianw | just unquoting | 23:10 |
clarkb | ah ok that should be safe for now | 23:10 |
fungi | but just the templates right? | 23:10 |
fungi | oh, there | 23:10 |
clarkb | fungi: the templates are purely for email | 23:11 |
fungi | yeah | 23:11 |
ianw | https://paste.opendev.org/show/bLQqwsnIoMIJOVLLAdAX/ | 23:11 |
ianw | it looks safe, i wonder if this isn't in the system-config test? | 23:11 |
fungi | mostly quoting | 23:11 |
clarkb | ianw: ya I think thats ok delta we can fix that later since it doesn't appear to be semantic | 23:11 |
ianw | i'll add a note but it looks ok to me too | 23:12 |
fungi | it's all quoting differences, yeah | 23:12 |
ianw | ok, i think we're about ready to call it on the gerrit screen, and things are in a stable state? | 23:14 |
clarkb | as far as I can tell they are. My test cahnge even got a +1 from zuul already | 23:14 |
clarkb | my only outstanding question is the refs/meta/config after the migration step | 23:14 |
clarkb | but voting and submit requirements also appear to be working as I expect so I don't think anything bad happened there | 23:15 |
ianw | oh right, if it merged anything? | 23:15 |
fungi | git log in /home/gerrit2/review_site/git/All-Projects.git looks okay | 23:15 |
clarkb | yes basically did it create chagnes for refs/meta/config even though we did our best to prevent that from happening | 23:16 |
clarkb | fungi: what branch is that looking at? | 23:16 |
fungi | "Migrate label configs to copy conditions" | 23:16 |
clarkb | fungi: is that from now or is that from when ianw did it? | 23:16 |
fungi | 8803e0f011d8892a5230fc3cd262b78481a8039b | 23:16 |
fungi | it's from 22:30z | 23:16 |
clarkb | ok so it did make a difference? | 23:16 |
fungi | looks like it | 23:17 |
clarkb | if that is safe to share can you share the diff? | 23:17 |
clarkb | or the full git show of that ref | 23:17 |
ianw | yeah | 23:17 |
ianw | https://paste.opendev.org/show/biW2eeakwF2arkF0Gp3Y/ | 23:17 |
fungi | https://paste.opendev.org/show/bagB8WhAD3XNJdkcIvf3/ | 23:17 |
clarkb | ok I think that is wrong for verified and workflow | 23:18 |
clarkb | but ok for code review | 23:18 |
ianw | it's added NO_CHANGE | 23:19 |
clarkb | right and workflow in particular shouldn't persist that | 23:19 |
clarkb | since its about triggering events and making zuul take action etc | 23:19 |
fungi | looks like it really wants changekind:NO_CHANGE everywhere | 23:19 |
clarkb | It might be ok for verified but I'd have to think about it | 23:19 |
clarkb | but I also don't think fixing that is urgent | 23:19 |
fungi | we probably want to clean it up for verified | 23:19 |
clarkb | did it make any other changes to all-projects today or just that one? | 23:19 |
fungi | that was the only one in the history | 23:20 |
fungi | other than our changes | 23:20 |
ianw | yep prior one was "Fix boolean operators to all-caps" | 23:20 |
clarkb | ok I think I'm comfortable proceeding for now and coming back to cleaning that up | 23:20 |
fungi | agreed | 23:20 |
clarkb | it makes me wonder if it did that to all the projects as well but I think those matter less than verified and workflow so even less urgent if it did | 23:20 |
fungi | right | 23:21 |
ianw | we could remove the acl cache again and re-do it | 23:21 |
clarkb | ya or maybe we decide NO_CHANGE is papropriate for some things and just update config on our side | 23:21 |
clarkb | but ya I think this is all ok for now if not ideal | 23:21 |
clarkb | I'm happy for people to disagree with me too :) just want to communicate my comfort level with proceeding | 23:22 |
ianw | NO_CHANGE is more trivial than a trivial rebase, no code change and a first parent update, hence this change kind is also matched by changekind:TRIVIAL_REBASE | 23:24 |
ianw | i think i thought about this, now i look at it | 23:24 |
ianw | TRIVIAL_REBASE includes NO_CHANGE | 23:24 |
clarkb | ianw: ya I think its fine on code-review (even if it appears to be a noop there) but for verified and workflow we expect those to reset state with new patchsets as they are used to drive state machine state | 23:24 |
clarkb | basically we never wanted to copy verified or workflow | 23:25 |
fungi | but we can quickly merge a change to address that after the window | 23:25 |
clarkb | yup | 23:25 |
fungi | its impact should be minimal | 23:25 |
clarkb | (I also think this is a bug in their migration and I'm going to file a bug about it) | 23:26 |
clarkb | but later | 23:26 |
ianw | yeah i think there's two things | 23:26 |
ianw | 1) it added NO_CHANGE to TRIVIAL_REBASE when it didn't need to | 23:27 |
ianw | 2) it added NO_CHANGE when there was no copyCondition | 23:27 |
clarkb | ++ | 23:27 |
ianw | 2 might be on us, but 1 feels like a unnecessary addition | 23:27 |
clarkb | I mean no copy condition means no copying | 23:27 |
fungi | time check: we're almost 75% od the way through our window, so now is probably a good go/no-go point for the project renames | 23:27 |
clarkb | they shouldn't assume one out of thin air. I think both are a bug in gerrit | 23:27 |
ianw | i think let's do the renames? | 23:28 |
clarkb | yes I'm good to proceed | 23:28 |
fungi | i'm in favor, shouldn't require the full remaining 30 minutes | 23:28 |
fungi | just wanted to be sure | 23:28 |
ianw | i think we can merge the 879412 config update later and watch it apply | 23:28 |
clarkb | ianw: wfm | 23:29 |
clarkb | ianw: but we should maybe manually edit the manage-projects command in the interim | 23:29 |
clarkb | basically do the same 3.6 -> 3.7 replacement that you did in the docker compose file | 23:30 |
clarkb | in /usr/local/bin/manage-projects | 23:30 |
fungi | the "docker.io/opendevorg/gerrit:3.6 manage-projects $@" line needs s/6/7/ | 23:32 |
ianw | ok, storyboard-dev push seemed to fail, but i don't think that's a concern | 23:32 |
clarkb | ianw: its not but it stopped things early | 23:32 |
fungi | yeah, storyboard-dev can be commented out | 23:32 |
clarkb | and we need to comment out everything prior to storyboard-dev | 23:32 |
fungi | we don't need to worry about renaming things on it | 23:32 |
clarkb | because this isn't idempotent | 23:32 |
ianw | ok, manage-projects update done | 23:33 |
fungi | want me to go ahead and make the /usr/local/bin/manage-projects edit? | 23:33 |
fungi | oh, you beat me to it | 23:33 |
clarkb | keep in mind we aren'tdone with the rename | 23:33 |
fungi | right | 23:33 |
ianw | yeah, so let's edit out storyboard-dev | 23:33 |
clarkb | https://opendev.org/opendev/system-config/src/branch/master/playbooks/rename_repos.yaml#L1-L53 all of that needs to be edited out/commented out | 23:34 |
clarkb | we can't rerun the gitea stuff safely I don't think | 23:34 |
clarkb | nor the gerrit moves | 23:34 |
ianw | huh, this has ignore errors | 23:35 |
clarkb | it was a connection error | 23:35 |
clarkb | which I think is a level before the task ignore errors | 23:36 |
fungi | interesting. i ssh'd into it just before the maintenance | 23:36 |
fungi | maybe we have a stale host key from the server replacement | 23:36 |
ianw | oh, or it's not in our now automated list for some reason :/ | 23:36 |
fungi | i should have tried from bridge | 23:36 |
ianw | sigh, anyway, i've removed everything before line 54 | 23:37 |
ianw | we feel like that's ok to run? | 23:37 |
clarkb | ianw: yes I have looked at the file and it looks right ot me. | 23:37 |
fungi | yep | 23:37 |
ianw | ok, attempt 2 looks good | 23:41 |
fungi | i concur | 23:41 |
clarkb | opendev.org redirects work for me against whatever backend I balance to | 23:42 |
clarkb | I have no reason to expect that other backends would be different given the status of the playbook running | 23:42 |
fungi | yes, lgtm | 23:43 |
clarkb | https://review.opendev.org/q/project:openstack/virtualpdu exists but has no changes yet (should show up via reindexing iirc) | 23:43 |
clarkb | show queuie shows the reindexing is happening | 23:43 |
fungi | yep | 23:43 |
ianw | ++ | 23:44 |
clarkb | so I think we can take a quick breather and we shouldn't have any further gerrit downtime | 23:44 |
clarkb | but then followup with the project-config synchronization, emergenc file updates, and then sort out the copycondition stuff | 23:45 |
ianw | i can merge the two project config changes, so we can watch those fail | 23:45 |
fungi | sounds good | 23:46 |
fungi | also, for the sake of anyone following along from the peanut gallery, we don't expect more gerrit outages for the remainder of the window, but it's not a guarantee ;) | 23:46 |
opendevreview | Merged openstack/project-config master: Ironic program adopting virtualpdu https://review.opendev.org/c/openstack/project-config/+/876231 | 23:47 |
opendevreview | Merged openstack/project-config master: Rename x/xstatic-angular-fileupload->openstack/xstatic-angular-fileupload https://review.opendev.org/c/openstack/project-config/+/873843 | 23:47 |
ianw | ^ those two are in | 23:48 |
clarkb | ianw: might be good to start a single todo list somewhere too rather than have it in the doc all over? But I'll defer to you on that | 23:49 |
clarkb | I detached from the screen but didn't `exit` not sure if I would've been the last one in it | 23:49 |
ianw | yeah will work it into a single todo | 23:51 |
ianw | so they didn't fire manage-projects jobs after the force merge -- zuul complained that the config was invalid | 23:52 |
clarkb | see I thought that may have been the case but I looekd at jobs and it seemed zuul runs after | 23:52 |
clarkb | but maybe this also saves us? | 23:52 |
clarkb | fwiw the chagnes appear to be at https://opendev.org/openstack/project-config/commits/branch/master | 23:52 |
clarkb | another check for replication | 23:53 |
clarkb | pretty sure the error would be due to the projects being invalid in gerrit now on the old side | 23:53 |
clarkb | and things should work when merging the third | 23:53 |
ianw | i'm going to unemergency things how | 23:55 |
fungi | sounds good | 23:55 |
fungi | i assume how was now, and so a statement not a question ;) | 23:56 |
ianw | i also put in https://review.opendev.org/c/opendev/system-config/+/879412 to start, that's the gerrit 3.7 config file update | 23:57 |
fungi | thanks | 23:57 |
ianw | i'm going to merge the last change, lets see what happens | 23:57 |
fungi | the others reported for their deploy fails, right? | 23:57 |
opendevreview | Merged openstack/project-config master: Rename x/ovn-bgp-agent to openstack/ovn-bgp-agent https://review.opendev.org/c/openstack/project-config/+/879456 | 23:58 |
fungi | or i guess they don't actually deploy because of the config errors? | 23:58 |
ianw | fungi: yeah, they had config errors so didnt' run | 23:59 |
ianw | https://zuul.opendev.org/t/openstack/stream/9891dc34ec3a4406b7eb34d39c7c91ff?logfile=console.log | 23:59 |
fungi | even better | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!