clarkb | fungi: completely unrelated to ^ at 00:00 I always get exim panic log emails from lists. I assume those are actually old panics and we might be able to clear them out? But I'm not clued into the dark arts of email to know for sure | 00:05 |
---|---|---|
*** rlandy|ruck|biab is now known as rlandy|ruck | 00:06 | |
*** rlandy|ruck is now known as rlandy|out | 00:30 | |
fungi | clarkb: the one from lists.o.o looks like something was going on around 07:05:52-07:07:34 tuesday and again at 00:01:59 wednesday which caused contention for access to /var/spool/exim4/db/retry.lockfile, possibly just collisions between deliveries for different mailman sites? | 00:38 |
fungi | er, no would have to be between mailman processes i suppose | 00:39 |
fungi | maybe something else was locking it temporarily, but i can't imagine what | 00:39 |
fungi | s/mailman processes/exim processes/ i meant | 00:39 |
fungi | as for new debugging info in the zuul build inventories, is that the playbook context stuff? | 00:40 |
Clark[m] | Yup the new playbook context | 00:43 |
corvus | yeah, that's a bunch of info that was only in the executor logs earlier; should help advanced users figure out what zuul did in complex situations | 00:45 |
fungi | awesome, thanks! | 00:48 |
corvus | we should be good to do a rolling restart of schedulers+web whenever convenient to pick up the bugfix | 01:57 |
corvus | i'll start on that now | 02:38 |
corvus | zuul02 scheduler is restarting | 02:41 |
corvus | this time i'm just doing: docker-compose down; docker-compose up -d | 02:41 |
corvus | that seems to be working well so far | 02:41 |
corvus | 02 is done; restarting 01 now | 02:52 |
corvus | ah, this time zuul01 took too long to shut down and docker killed it; so i think we still need to tune that. | 02:54 |
corvus | i think that means i'm a chaos monkey and we just tested "kill a scheduler while it's in the middle of re-enqueing all changes in a pipeline". that appears to have worked fine. | 02:58 |
ianw | haha i've been called worse | 03:01 |
corvus | i'm going to restart the web server now; so expect status page outage | 03:03 |
corvus | looks like everything is up now | 03:12 |
opendevreview | Merged opendev/system-config master: Upgrade to gerrit 3.3.8 https://review.opendev.org/c/opendev/system-config/+/819733 | 03:14 |
ianw | i was going to sneak ^ in but you beat me to it :) | 03:16 |
corvus | oh sorry... | 03:21 |
corvus | it looks like there's a problem with the periodic-stable pipeline; it may be a result of my chaos-monkey action | 03:22 |
corvus | i'm going to see if i can manually correct it; otherwise we may need a full shutdown/start | 03:22 |
opendevreview | Merged openstack/diskimage-builder master: Fix BLS based bootloader installation https://review.opendev.org/c/openstack/diskimage-builder/+/818851 | 03:26 |
corvus | okay, i perfomed zk surgery to completely empty the periodic-stable pipeline and am now re-enqueing it. i'll try to figure out what went wrong from the log files tomorrow | 03:30 |
corvus | there are a lot of failures in that pipeline now; i can't tell if they're legitimate, or if it has something to do with the 00000 commit sha they are all enqueued with | 03:34 |
corvus | i think it's too uncertain and we should just drop the queue | 03:35 |
corvus | which is unfortunate since we have no way to restore it | 03:35 |
corvus | i've done that now. | 03:36 |
corvus | status summary: everything is up and running, but we won't have periodic-stable results for today | 03:37 |
corvus | i'm out for the night | 03:44 |
ianw | thanks for looking after it! i'm sure i would have got it helplessly tangled up :) | 03:45 |
*** ysandeep|out is now known as ysandeep|ruck | 04:33 | |
*** pojadhav- is now known as pojadhav | 05:22 | |
*** ysandeep|ruck is now known as ysandeep|afk | 05:52 | |
*** ysandeep|afk is now known as ysandeep|ruck | 06:15 | |
*** raukadah is now known as chandankumar | 06:51 | |
*** ykarel__ is now known as ykarel | 07:08 | |
frickler | clarkb: fungi: I'm still trying to clean up exim paniclogs on other servers, but I didn't get mail from lists.o.o, likely because the aliases there were never updated. also excluding ianw, not sure if intentional or not | 07:14 |
frickler | most of the locking errors seems to be happening at logrotate time, which with the focal upgrade seems to have moved from 06:25 to 00:00? | 07:17 |
frickler | I'll see whether one can tune the timeout | 07:18 |
*** ysandeep|ruck is now known as ysandeep|lunch | 07:23 | |
*** ysandeep|lunch is now known as ysandeep | 08:28 | |
*** ysandeep is now known as ysandeep|ruck | 08:28 | |
ianw | frickler: i didn't intentionally not update, i think just never got around to it! | 09:05 |
Unit193 | fungi: Well, not quite what we were hoping for, but at least https://launchpad.net/ubuntu/+source/pastebinit/1.5.1-1ubuntu1 is a start... | 10:05 |
ykarel | Is there some issue with zuul.openstack.org? it's not loading | 10:06 |
ykarel | https://zuul.opendev.org/t/openstack/status working though | 10:07 |
ykarel | inspecting returns: TypeError: "r is undefined" | 10:08 |
frickler | ianw: not sure if we were talking about the same thing. I meant to say that you are missing in the list of aliases to send root mail to on lists.o.o | 10:19 |
frickler | ykarel: I can confirm that, best use zuul.opendev.org for now. will need to wait for corvus to dig deeper I guess | 10:23 |
ykarel | frickler, ack and Thanks for check | 10:24 |
opendevreview | Arx Cruz proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638 | 10:38 |
*** rlandy|out is now known as rlandy|ruck | 11:12 | |
opendevreview | Marios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream https://review.opendev.org/c/opendev/base-jobs/+/820018 | 11:33 |
marios | fungi: whenever you next have some review time please add to your queue ^^^ i updated to use bash instead of jinja per comment thanks for looking | 11:34 |
opendevreview | Marios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream https://review.opendev.org/c/opendev/base-jobs/+/820018 | 11:58 |
opendevreview | Marios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream https://review.opendev.org/c/opendev/base-jobs/+/820018 | 12:01 |
*** pojadhav is now known as pojadhav|afk | 12:01 | |
*** pojadhav|afk is now known as pojadhav | 12:54 | |
*** pojadhav is now known as pojadhav|afk | 13:48 | |
*** ysandeep|ruck is now known as ysandeep|afk | 14:14 | |
dtantsur | can confirm the zuul.o.o problem | 14:17 |
*** ysandeep|afk is now known as ysandeep | 14:20 | |
*** ysandeep is now known as ysandeep|afk | 14:27 | |
*** ysandeep|afk is now known as ysandeep | 15:12 | |
corvus | that should be fixed by https://review.opendev.org/820184 | 15:27 |
*** ysandeep is now known as ysandeep|out | 15:34 | |
clarkb | as far as we know the system-config deploy jobs are running again right? I'll plan to approve the matrix-gerritbot update after gerrit user summit if so | 15:41 |
fungi | i believe so, yes. i haven't approved the lists.openinfra.dev addition yet though | 16:08 |
fungi | want to wait until i'm less distracted by meetings | 16:08 |
*** chandankumar is now known as raukadah | 16:12 | |
*** tosky_ is now known as tosky | 16:17 | |
*** marios is now known as marios|out | 16:35 | |
*** priteau is now known as Guest7388 | 16:38 | |
*** priteau_ is now known as priteau | 16:38 | |
clarkb | making this note here so I don't forget. Gerrit 3.4 (or is it 3.5?) allows usernames to be case insensitive. Existing installations remain case sensitive by default. We should check in our 3.3 to 3.4 test jobs that we don't break usernames | 16:45 |
clarkb | we can createa zuul and Zuul user or similar and then going forward we should catch problems automatically | 16:45 |
clarkb | except we may need to toggle the config explicitly to avoid the default on new installs being insensitive. Anyway the testing we've got should cover this well, just need to update the system a bit | 16:47 |
fungi | we could in theory check for collisions, but i expect they're many | 16:48 |
clarkb | yes I know we have collisions just from the user cleanups I've done for the conflicting external ids problem | 16:53 |
clarkb | when people end up with a second user they often make their username a variant of the original | 16:54 |
clarkb | often by changing case of a character or three | 16:54 |
fungi | clarkb: any opinion on whether we should be using base-test to vet https://review.opendev.org/820018 before approving? | 17:26 |
clarkb | fungi: its probably sufficient to run the script locally if you want to avoid that dance | 17:28 |
fungi | i've approved 818826 to create lists.openinfra.dev and will keep an eye on it | 17:28 |
clarkb | but I think we should test it since the mirror config affects a lot of jobs | 17:28 |
fungi | yes, i looked at it very closely in order to spot obvious syntax or logic issues which could have broader fallout, but i'm not confident in my skills as a shell parser | 17:28 |
clarkb | But also, that config is long since deprecated iirc | 17:28 |
fungi | yes | 17:29 |
clarkb | we might suggest that starting with centos stream people use the proper mirror configuration tooling | 17:29 |
clarkb | but I'm indifferent to that as a shell script vars are useful in various contexts | 17:29 |
fungi | that's not a bad idea, it would be starting with stream 9 specifically though | 17:30 |
fungi | stream 8 didn't need changes to the mirroring | 17:30 |
clarkb | ah | 17:30 |
clarkb | ya the -ge 9 | 17:30 |
fungi | centos changed up their mirror path for stream 9 | 17:30 |
clarkb | tl;dr if the script as proposed runs locally I think we can approve it | 17:33 |
opendevreview | Merged opendev/system-config master: Create a new lists.openinfra.dev mailing list site https://review.opendev.org/c/opendev/system-config/+/818826 | 17:57 |
clarkb | one thing I notice is that the order of jobs isn't quite what I expected but that must be an artifact of actually writing down our dependencies :) | 18:29 |
fungi | our dependencies aren't quite what we expected | 18:31 |
fungi | fwiw, looks like the periodic puppet-else job ran again, but /var/lib/storyboard/www/js/templates.js on storyboard.o.o did not get updated | 18:34 |
clarkb | I think the source isn't updated the way we think it is | 18:35 |
clarkb | /home/zuul/src/opendev.org/opendev/system-config git log -1 shows Merge "Cache Ansible Galaxy on CI mirror servers" | 18:36 |
clarkb | we should probably hold off on making updates otherwise we'll have a giant pile of them that all apply at once when we fix that | 18:36 |
clarkb | also before manage-projects runs do we need to stop ansible? | 18:36 |
clarkb | (I don't know if we've chagned projects.yaml in the last few days) | 18:37 |
clarkb | https://zuul.opendev.org/t/openstack/build/d72175e06a8c4b5999b058a77e984755 is the build that should've updated the source and looking at the logs I think we did | 18:38 |
clarkb | but then later jobs must've reset it or something? | 18:38 |
clarkb | I'm confused, and can't really debug right now as I'm trying to pay ettention to gerrit user summit | 18:38 |
fungi | yeah, should i disable ansible on bridge for now? | 18:39 |
clarkb | probably? | 18:40 |
clarkb | the problem with the ansible disable is that we retry every job 3 times :/ | 18:40 |
clarkb | but I haven't come up with a better idea than that other than emergency filing everything but that is problematic for other reasons. I think ansible disable is probably warranted until we can understand this better | 18:40 |
clarkb | https://zuul.opendev.org/t/openstack/build/d72175e06a8c4b5999b058a77e984755/log/job-output.txt#225-229 is not what is reflected on the system | 18:41 |
fungi | #status log Temporarily disabled ansible deployment through bridge.o.o while we troubleshoot system-config state there | 18:41 |
opendevstatus | fungi: finished logging | 18:41 |
clarkb | it synced to a different host | 18:42 |
clarkb | https://zuul.opendev.org/t/openstack/build/d72175e06a8c4b5999b058a77e984755/log/job-output.txt#214 is not bridge | 18:42 |
clarkb | I think it was a single use test node? | 18:42 |
fungi | oho | 18:43 |
clarkb | so basically we're not updating system-config on bridge then running things. I think that we're likely ok except for potentialyl recreating an old project on gerrit if we had done renames but we haven't done renames so should be fine | 18:43 |
clarkb | anyway back to gerrit user summit now that I'ev largely convinced myself we aren't breaking anything, just not updating the way we expected | 18:44 |
fungi | yeah, our deployments are basically just being deferred | 18:44 |
clarkb | ianw's day should be starting soon and may understand this | 18:44 |
clarkb | this is almost certainly a result of the switch to a single job to update system-config at the beginning of a buildset | 18:45 |
fungi | yeah, the zuul inventory for that build indicates there's an ubuntu-focal node | 18:45 |
clarkb | fungi: note that infra-prod-service-lists is running now (it must'ev started before you put the prevention in place, or we've broken the preventing in the CD refactor) but as mentioend previously I think this will just apply tuesdays state and we should be ok | 18:48 |
clarkb | (sidenote the thing that tipped me off to updating a different host was I checked the reflog on system-config and didn't see the refs shown in the job log | 18:48 |
fungi | yeah | 18:51 |
fungi | 6bcf28b from 21:03:56 tuesday was the last update to ~zuul/src/opendev.org/opendev/system-config on bridge | 18:52 |
fungi | c663d9b from 00:50:45 wednesday was the next change which should have been updated there | 18:54 |
fungi | so the breakage started in that ~3.75hr timespan | 18:54 |
clarkb | DISABLE-ANSIBLE is only evaluated in the setup src job | 18:55 |
clarkb | since we put the file in place after that job the other jobs are free to continue | 18:55 |
clarkb | fungi: it was almost certainly 9cccb02bb09671fc98e42b335e649589610b33cf/42df57b545d6f8dd314678174c281c249171c1d0 | 18:57 |
fungi | in theory 42df57b from 13:48:44 wednesday would have switched to running the correct job | 18:58 |
fungi | and that much it seems to have done | 18:58 |
fungi | but the job itself is not yet doing the right thing | 18:58 |
clarkb | well the key is we stopped updating system-config in the other jobs | 18:58 |
clarkb | and then started running a job that wasn't updating properly | 18:59 |
fungi | yep | 18:59 |
clarkb | We might get away with a simple revert for now. Then reevaluate from there | 18:59 |
clarkb | but might be good to see if ianw has an opinion first | 19:00 |
clarkb | Its still a bit early there though | 19:00 |
fungi | yeah, once he's around he may already have a clearer picture of what it was supposed to be doing vs what it's actually doing | 19:00 |
clarkb | opendev-infra-prod-base <- that job still seems to exist and the changes linked above switched us off of that. I think if we revert we'll go back to using this job and hsould work? maybe? I hope? | 19:02 |
clarkb | heh | 19:02 |
clarkb | the hourly job runs are not running the source update job | 19:03 |
clarkb | so we've got another layer of problem where once we get things working if we reenqueue stuff we'll apply updates then hourly will undo them | 19:03 |
clarkb | I'm wondering if we shouldn't consider disabling ssh access since DISABLE-ANSIBLE is non functional | 19:03 |
clarkb | ya I think we need to revert for that reason either way | 19:04 |
clarkb | we can't safely roll forward without adding pipeline edits in addition to fixing the setup-src job | 19:05 |
fungi | so squash a revert of 9cccb02+42df57b i guess | 19:05 |
fungi | i can push that up | 19:05 |
clarkb | yes I think so. But I'm leaning towards lets disable ssh access, push the revert then wait for ianw to help untangle | 19:06 |
fungi | how do we globally disable ssh access to our servers? | 19:07 |
fungi | or do you mean just disable ssh access for zuul@bridge | 19:07 |
clarkb | fungi: you only need to disable it for zuul@bridge | 19:07 |
clarkb | move the authorized_keys file aside? | 19:07 |
fungi | we have a zuul-zone-zuul-ci.org-20200401 key and a zuul-opendev.org-20200401 authorized, i guess it's the latter? | 19:09 |
fungi | ahh, yeah the first is for dns i guess | 19:09 |
fungi | okay, i've commented out the zuul-opendev.org-20200401 key | 19:09 |
clarkb | ok I think I'm understanding what the setup-src job is doing that is wrong. Because it has a regular node (no nodes: []) we run the normal repo setup against the remote host | 19:12 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Revert "infra-prod: clone source once" https://review.opendev.org/c/opendev/system-config/+/820250 | 19:12 |
clarkb | Then our tasks that run against bridge.openstack.org are completely skipped beacuse it isn't in the inventory | 19:12 |
fungi | 820250 is a squashing of reverts for commits 42df57b and 9cccb02 | 19:12 |
clarkb | 70827542adfaf5816fdf396e61c5d021b0fa3769 is a flawed change | 19:14 |
clarkb | the assertion in the commit message is only half true | 19:15 |
clarkb | fungi: we need to revert ^ as well | 19:15 |
clarkb | because the inventory add in setup-keys is what was allowing setup-src.yaml to find bridge and update the system-config repo | 19:16 |
fungi | okay | 19:16 |
clarkb | when we dropped the inventory add from setup-keys we dropped the ability to update system-config | 19:16 |
fungi | i can't find that commit | 19:17 |
clarkb | fungi: it is in opendev/base-jobs | 19:17 |
fungi | oh, got it | 19:17 |
clarkb | I think the order is revert 70827542adfaf5816fdf396e61c5d021b0fa3769 then do 820250 | 19:17 |
clarkb | if we do it in the other order we'll still be broken | 19:17 |
opendevreview | Jeremy Stanley proposed opendev/base-jobs master: Revert "infra-prod-setup-keys: drop inventory add" https://review.opendev.org/c/opendev/base-jobs/+/820251 | 19:18 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Revert "infra-prod: clone source once" https://review.opendev.org/c/opendev/system-config/+/820250 | 19:19 |
fungi | depends-on added | 19:19 |
clarkb | Once we're reverted I think the plan forward is to update the setup-src job to not run with nodes first, then update our pipeline config updates as before but ensure the src update job is in all the pipelines and that all the jobs hard depend on that setup src job. We want them to fail if setup src fails. | 19:20 |
clarkb | But maybe we get back to where each job is updating system-config today so that we can reenqueue stuff (we have to be careful doing this because reenqueing to deploy will use the exact chagne state which means if we reenqueue out of order or whatever we can have problems) | 19:21 |
clarkb | then pick up the break out again next week? | 19:21 |
fungi | wfm | 19:22 |
clarkb | re reenquing stuff a safer appraoch may be to let something update system-config (the hourly deploy jobs most likely) then manually run other playbooks that we want to pick up that stuff | 19:23 |
fungi | sure. the daily will also kick off in a few hours | 19:23 |
fungi | well, ~6.5 i think | 19:23 |
clarkb | the last thing we need to sort out is where DISABLE-ANSIBLE got broken. That might also need a (parial) revert | 19:24 |
clarkb | ok I think the existing revert to go back to the old base job will return DISABLE-ANSIBLE before | 19:27 |
clarkb | s/before/behavior | 19:27 |
clarkb | I've +2'd both changes and left notes about what I've found in my debugging. I guess we can wait another hour or so to see what ianw thinks? | 19:29 |
clarkb | in the meantime infra-root please do not approve any system-cofnig changes | 19:29 |
clarkb | fungi: we should make a list of changes to system-config and project-config to audit and rerun as necessary once happy again | 19:30 |
clarkb | for system-config f29aa2da1688ab445d78d3c6596467bae9281f48 3c993c317b79640c2f86d91559f6d2b7ec83d17a 4285b4092839daea4bb7d2574f2a8923310d8278 33fc2a4d4e0628f1580893579c275f0095ce7eec | 19:31 |
clarkb | of those the lists update is probably the most scary one. I think the gerrit image update wouldn't have really affected prod since all we'd do is pull the image maybe | 19:31 |
clarkb | the haproxy changes might update haproxy in production. | 19:32 |
fungi | i've got to step away to cook dinner (christine has something pressing at 21:00) but i can take a look once we eat | 19:32 |
clarkb | for project-config 9d2f65a663df801beae4385368c86a21fca83c8e is the only one we need to check but I think it landed early enough to not be a problem | 19:33 |
fungi | i can probably scrape a list of changes reported in here by gerritbot as a cross-check | 19:33 |
clarkb | so really just the system-config commits above and of those only the lists one is concerning. I think once we think we're fixed we manually update system-config and manually run the gitea load balancer, lists and gerrit playbooks | 19:34 |
clarkb | Then we can fix ssh for zuul on bridge and see if gerrit does the right thing? I guess the fear there is it might revert our checkout somehow but I think the risk of that is low | 19:34 |
clarkb | ya I'm going to need lunch soon so this is probbaly all fine to pause a bit until ianw is awake and can review what we've found and decide if the plan is good | 19:36 |
*** artom__ is now known as artom | 19:36 | |
Clark[m] | I've switched to lunch mode but just realized that maybe landing the system-config revert will trigger all the things to run? And maybe that is better than trying to manually run stuff? If we choose to manually run stuff we should do that before approving the revert I guess | 19:47 |
fungi | yeah, might make the most sense to put ssh key back and enable ansible when approving the system-config revert? | 19:48 |
Clark[m] | Ya possibly | 19:51 |
fungi | for visibility, should the disable-ansible check be its own role even? easier to see when and where we include it in each job that way | 20:03 |
Clark[m] | ++ | 20:05 |
opendevreview | Jeremy Stanley proposed opendev/base-jobs master: Make the disable-ansible check into its own role https://review.opendev.org/c/opendev/base-jobs/+/820258 | 20:31 |
fungi | that's the role, we can switch to it where convenient i guess | 20:31 |
corvus | i'd like to rolling restart zuul scheduler and web... any thoughts on timing? | 20:46 |
corvus | i mean, should be non-disruptive, but also non-zero-risk | 20:47 |
clarkb | corvus: well we're hoping to untangle the system-config breakage when ianw's day starts. Might be good to get through that first just so that we're not debugging zuul and system-config? | 20:50 |
clarkb | I think we've got the two changes necessary to do that proposed above https://review.opendev.org/c/opendev/base-jobs/+/820251 https://review.opendev.org/c/opendev/system-config/+/820250 but washoping ianw could weigh in as he was driving that work | 20:50 |
corvus | yep, can wait. | 20:51 |
clarkb | I'm not sure how long we should wait on the off chance that ianw isn't around today. The base-jobs change should be super straightforward to land. It is the system-config change that is a bit more intertwined, but from what I can see that change is afe too | 21:13 |
opendevreview | James E. Blair proposed opendev/system-config master: Add a keycloak server https://review.opendev.org/c/opendev/system-config/+/819923 | 21:14 |
corvus | i expect that to pass tests and ready for review now | 21:15 |
clarkb | that is unexecpted, the hourly jobs are still managing to run somehow | 21:18 |
fungi | could they be authenticating with one of the other keys? | 21:19 |
clarkb | or a # doesn't do what we think it does in that file? | 21:20 |
clarkb | oh yup its the wrong key | 21:20 |
clarkb | the system-config job use the system-config key | 21:20 |
clarkb | the key you commented out is for the opendev.org zone I think | 21:20 |
fungi | oh, is that entry misnamed? | 21:21 |
clarkb | no it isn't misnamed, we just misinterpreted what it meant | 21:21 |
fungi | the zuul-ci.org one has a comment of zuul-zone-zuul-ci.org-20200401 | 21:21 |
fungi | the zuul-opendev.org-20200401 doesn't say "zone" in it | 21:22 |
clarkb | system-config/inventory/base/group_vars/all.yaml sets the value. I think it was just recorded that way | 21:22 |
fungi | i guess we should have called it zuul-zone-opendev.org-20200401 for consistency | 21:22 |
clarkb | yes. But also maybe we should move the file aside as we don't really want anything running until we're haoppy with the fixups? | 21:22 |
fungi | done, moved it temporarily to ~zuul/.ssh/disabled.authorized_keys | 21:23 |
clarkb | in the meantime should we go ahead and approve the base-jobs revert? | 21:24 |
clarkb | I'm going to rereview the system-config revert now with some fresh eyes to make sure we aren't missing anything | 21:24 |
fungi | yeah, i can approve the one for base-jobs | 21:24 |
clarkb | https://review.opendev.org/c/opendev/base-jobs/+/807807 was the last chagne to opendev-infra-prod-base. Which means we ran with that in place for about a week and seemed to be working. The system-config revert switches us back to using that job | 21:26 |
clarkb | now to double check the contents of that job for changes | 21:27 |
clarkb | the two changes to the playbooks that jobs run are the one to remove the inventory entry which we are reverting and the other rnames a playbook which I think is fine becuase it appears to have been 1:1 just a file change name for consistency with job names | 21:29 |
clarkb | and ya the git log for the rename shows no delta in the file itself | 21:29 |
clarkb | so ya I think the system-config revert is also safe. | 21:29 |
clarkb | fungi: once base-jobs lands should we approve the system-config revert and plan to move ssh authorized_keys back and also remove DISABLE-ANSIBLE? | 21:30 |
clarkb | then figure out if we need to run any playbooks by hand after it runs its jobs? | 21:30 |
clarkb | basically in my rereview I can't find anything that would indicate going back to the old situation of running the repo update for each job would be a problem | 21:31 |
opendevreview | Merged opendev/base-jobs master: Revert "infra-prod-setup-keys: drop inventory add" https://review.opendev.org/c/opendev/base-jobs/+/820251 | 21:34 |
clarkb | I guess give it a little longer in case ianw's day is still booting up and then plan to approve the other revert at the top of the hour otherwise? | 21:36 |
clarkb | unrelated gitea just made a new 1.15 release with a bunch of bugfixes | 21:40 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update gitea to 1.15.7 https://review.opendev.org/c/opendev/system-config/+/820267 | 21:47 |
clarkb | unlikely to land that today, but we can start the CI process on it | 21:47 |
fungi | yeah, top of the hour wfm... put ssh keys back, undo the disable-ansible, approve the change | 21:53 |
clarkb | I think we might want to approve first so that the hourly jobs can quickly cycle out | 21:54 |
clarkb | but then ya reenable things with the plan being the change will and and have a go | 21:55 |
ianw | sorry, here now! | 21:56 |
ianw | just reading | 21:57 |
clarkb | ianw: oh hi! so basically there are a few issues we discovered with the CD refactors that landed most recently. The main issue is system-config on bridge isn't being updated by the -src job | 21:57 |
clarkb | ianw: the reason for this is that we aren't adding bridge to the inventory anymore since we removed that from the keys playbook. But even if we fix that we also noticed that we aren't running the update job on hourly deploy or the daily periodic pipeline | 21:58 |
clarkb | separately we also found taht only the -src job was checking DISABLE-ANSIBLE which means you can't really get ahead of the next job only the next buildset | 21:58 |
clarkb | fungi pushed up two revert changes the first of which has laned and restores the inventory stuff to the setup-keys playbook. THe other revert has us going back to the every job updates system-config state so that we can roll forward addressing the whole set of issues | 21:59 |
ianw | ok, i thought it all seemed to be going too easily :) | 22:00 |
clarkb | ianw: I tried to leave comments on the revert changes to serve as hints for the future fixups but right now the priority is getting things working around as we are building up a delta (gitea haproxy, gerrit image update, and lists.opeinfra.dev changes) that haven't applied fully | 22:00 |
clarkb | We suspect that if we land the system-config revert that a bunch of those jobs will run so we can reenable zuul access to brdige and approve that if you are happy with that plan | 22:01 |
clarkb | we disabled ansible so that we could figure out what was going on. I think at this point I'm reasonably well convinced it wasn't doing anything bad just not doing anything new. We can probably reenable whenever I susppose | 22:02 |
clarkb | fungi: in https://review.opendev.org/c/opendev/base-jobs/+/820258 I think you can go ahead and add that role to the base job playbooks? | 22:03 |
fungi | is that safe? i suppose it is | 22:04 |
clarkb | fungi: ya it should be | 22:04 |
clarkb | with the usual caveats that updating base jobs is tricky and we should monitor | 22:04 |
fungi | does it need to be scoped to a specific inventory host? | 22:04 |
clarkb | fungi: yes it needs to only check on brdige | 22:04 |
clarkb | fungi: I think you can put that in the setup-keys playbook that adds bridge to the inventory | 22:05 |
clarkb | something like that should work well. And we can land it later when we are able to monitor and out of the unhappy current state | 22:05 |
ianw | thanks, 820250 is approved so we can get things moving | 22:05 |
fungi | ahh, okay, i assumed we'd want to explicitly add it to other jobs, but i guess if it's in base then it's implicitly added to all jobs without us needing to do anything | 22:05 |
clarkb | fungi: exactly | 22:06 |
fungi | with 820250 approved i should put back the ssh keys and undo the disable-ansible now? | 22:06 |
clarkb | fungi: if you do that the hourly jobs will run which will delay when the 820250 jobs start. I think if we can wait for hourly to finish and then reenable that would be best | 22:06 |
clarkb | but that only works if 820250 doesn't merge first :) | 22:06 |
fungi | got it | 22:07 |
fungi | i'll try to keep an eye on the screen | 22:07 |
clarkb | I think the hourly jobs need about 4-5 more minutes to cycle out. 820250 hasn't started all jobs yet so we should have some time | 22:07 |
clarkb | oh it just started and zuul says 26 minutes so ya we should be good to wait on the hourlies to finish first | 22:07 |
ianw | i do wonder if we want every job checking DISABLE-ANSIBLE | 22:10 |
ianw | i did totally overlook the other pipelines | 22:11 |
clarkb | for me at least its nice to be able to recognize there is an issue and then hit the off switch. I suppose if we want to keep things more fine grained we could say the ssh keys are the big red button and DISABLE-ANSIBLE is more graceful | 22:12 |
ianw | i guess you're saying you might want to stop things between the end of the src job and the other jobs starting? | 22:13 |
clarkb | yes or between some other job in the list and the next one if we realize something is off | 22:13 |
ianw | i was mostly thinking that cloning the source would be the place it stops; i don't have a problem with the flag as such | 22:14 |
ianw | hmm, fair enough. does the new zuul authentication bits give the option to cancel a buildset too? | 22:14 |
clarkb | maybe? we can dequeue with gearman as long as that still exists too | 22:15 |
clarkb | fungi the last job in the hourly buildset is about to timeout once that is done I think we can restore the ssh keys and remove DISABLE-ANSIBLE | 22:16 |
clarkb | fungi: its done we can reenable now. Were you going to do that or should I/ | 22:17 |
corvus | yes you can dequeue an item | 22:18 |
clarkb | I went ahead and removed DISABLE-ANSIBLE and put the authorized_keys file back | 22:19 |
clarkb | we're making CD omelets | 22:20 |
opendevreview | Merged opendev/system-config master: Revert "infra-prod: clone source once" https://review.opendev.org/c/opendev/system-config/+/820250 | 22:23 |
clarkb | re Gerrit User Summit I did try to take a bunch of notes which I'll try to curate and post up somewhere. I think the big thing for us to think about is case sensitive username settings in 3.4 before we upgrade. Just to be sure that doesn't bite us later | 22:23 |
clarkb | but I also understand how the new check stuff works | 22:24 |
clarkb | For the new checks stuff you write a plugin that queries some CI endpoint for a change (in our case it would hit the zuul rest api I think). Then the plugin emits data in their standard format to the central checks UI system | 22:25 |
clarkb | then they handle all the rendering for you | 22:25 |
clarkb | that was anti climactic it decided to not run any jobs | 22:27 |
clarkb | I guess because no jobs trigger on the base job updating? | 22:27 |
clarkb | just when you think you understand how computers work they remind you that no no you do not :) | 22:27 |
clarkb | Shoudl we just wait for the hourly runs to happen then we can manually run the gitea-lb playbook and the lists playbook? | 22:27 |
fungi | thanks, i got back to the keyboard too late | 22:28 |
clarkb | My one concern with manually updating the system-config checkout is taht we won't know that the jobs are doing it propely | 22:28 |
clarkb | I think I've decided we don't need to do the review playbook as all we did was update the image and those did promote to docker hub properly | 22:28 |
clarkb | or we can enqueue the lists chagne to deploy | 22:29 |
clarkb | that was the last system-config chagne to land. I don't think we should enqueue any older changes as that will create confusion | 22:29 |
fungi | i'm okay waiting for the hourly deploy | 22:30 |
clarkb | cool that wfm too then | 22:30 |
fungi | slightly worried that we've picked apart our deploy jobs enough that reenqueuing a particular change may not run everything anyway | 22:30 |
clarkb | fungi: ya it would only run whatever jobs it enqueued previously | 22:32 |
clarkb | though will it use the old state of the jobs too? I don't think so | 22:32 |
ianw | fungi: why do you think the deploy jobs won't run? | 22:34 |
clarkb | ianw: well the lists addition chagne won't run jobs for haproxy on gitea for example | 22:35 |
clarkb | but it will run some jobs related to lists | 22:35 |
ianw | oh right, yes i see what you mean | 22:36 |
clarkb | but we can manually run those playbooks once we're happy the automated jobs are updating commits properly | 22:37 |
fungi | hence the list of missing commits from the dark time | 22:38 |
fungi | so we know what needs to be rerun | 22:38 |
ianw | do we need infra-prod-setup-src, or should it just be part of infra-prod-install-ansible? | 22:53 |
clarkb | ianw: hrm thats a good question. I think if we're hard depending on the source update job and there is another job we always want to run it could pull double duty | 22:54 |
clarkb | call it prep-bridge or similar? | 22:54 |
ianw | maybe bootstrap-bridge? | 22:57 |
clarkb | ++ | 22:59 |
clarkb | hourly jobs are starting now | 23:00 |
fungi | good, i'm mostly back around again now | 23:01 |
clarkb | woot it just updated system-config | 23:02 |
clarkb | I think we're good. And can proceed with running the the lists and gitea haproxy playbooks when we like (I don't think either of those playbooks conflicts with the jobs that hourly runs | 23:02 |
clarkb | service-gitea-lb.yaml <- that is the playbook we run for the gitea lb. I'll go ahead and run it now | 23:04 |
clarkb | that is done. It updated the docker compose file to set the ro flag on the config bind mount and restarted the container | 23:06 |
clarkb | I can still reach https://opendev.org | 23:07 |
fungi | same | 23:07 |
clarkb | I think we're good | 23:07 |
ianw | thanks! | 23:07 |
clarkb | service-lists.yaml is the lists playbook. Fungi did you want to run that one? | 23:07 |
clarkb | `sudo ansible-playbook -f 20 -v /home/zuul/src/opendev.org/opendev/system-config/playbooks/service-gitea-lb.yaml` is the command I used for the gitealb | 23:07 |
fungi | i can, just a sec | 23:08 |
clarkb | just need to wap out the playbook name | 23:08 |
clarkb | I'm happy to join a screen if you want to run it in screen too | 23:08 |
fungi | cueued up in a root screen session now | 23:08 |
fungi | er, cued | 23:09 |
clarkb | I'm in the screen and that command looks right to me | 23:09 |
fungi | well, also queued | 23:09 |
fungi | okay, running | 23:09 |
clarkb | interestingly infra-prod-service-bridge needs to be retried? | 23:10 |
clarkb | there doesn't appear to be a new playbook log file from that job in our ansible log dir | 23:11 |
fungi | it's working on adding the new site now | 23:11 |
clarkb | corvus: is there a good way to see those logs from a failed but will be retried job somewhere? | 23:11 |
clarkb | fungi: considering how long this command is taking I wonder if it is stuck on a read like we had before | 23:12 |
fungi | yeah, looking | 23:12 |
fungi | it's in an epoll loop | 23:13 |
fungi | epoll_wait(5, [], 2, 1000) = 0 | 23:13 |
fungi | wait4(2537565, 0x7ffdcda5976c, WNOHANG, NULL) = 0 | 23:13 |
fungi | clock_gettime(CLOCK_MONOTONIC, {tv_sec=1398764, tv_nsec=118518927}) = 0 | 23:13 |
fungi | i think | 23:13 |
clarkb | that task is the one we fixed for the read | 23:14 |
clarkb | by setting stdin: '' | 23:14 |
fungi | i don't see any child processes of that AnsiballZ_command.py anyway | 23:14 |
clarkb | fungi: ps shows it `ps -elf | grep newlist` | 23:15 |
fungi | oh, yup, my ps afuxww wrapped at an inconvenient column | 23:15 |
fungi | that newlist command looked like it wasn't a child so i skimmed past | 23:16 |
fungi | so i wonder why newlist would hang | 23:16 |
clarkb | strace says a read on fp 0 | 23:17 |
fungi | yes, it does | 23:17 |
clarkb | which seems like the same issue as before | 23:17 |
fungi | so waiting on a pipe | 23:17 |
* fungi sighs | 23:17 | |
clarkb | well fd 0 is stdin | 23:17 |
fungi | right, waiting on something to pipe into it i meant | 23:17 |
clarkb | whats weird is we fixed this and made sure the fix worked I thought | 23:17 |
fungi | i thought so too | 23:18 |
clarkb | is there something special about newlisting the mailman list? | 23:18 |
fungi | it was prompting for confirmation last time, right? | 23:18 |
clarkb | fungi: prompting to send confirmation emails iirc ya | 23:18 |
clarkb | to the list admin | 23:18 |
fungi | ansible was making it look like a tty which caused it to go interactive | 23:18 |
clarkb | we didn't catch it in testing because testing sets the flag to not send notifications | 23:19 |
clarkb | but we do want those notification in production :/ | 23:19 |
fungi | i guess we should kill the hanging newlist or wait for the task to timeout | 23:19 |
clarkb | ya I think killing the newlist is probably best. Then we can put lists.o.o and lists.kc.io in the emergency file and go try and reproduce in testing? | 23:20 |
fungi | well, emergency file shouldn't be necessary for lists.k.i unless we try to add a list to it | 23:20 |
fungi | but may as well | 23:20 |
clarkb | good point | 23:21 |
fungi | i'll add them both and then kill the newlist process | 23:21 |
clarkb | sounds liek a plan. We may need to kill a few more newlists if it continues to try after the failed attempt (I think it will short circuit though and we should haev a half confiruged site that we can ignore?) | 23:21 |
clarkb | ya appears to have short circuited | 23:22 |
fungi | i didn't initially add any other lists so it was only trying to create the default metalist | 23:22 |
clarkb | yup and I'm wondering if that metalist has additional prompts from newlist? | 23:22 |
clarkb | since we know that adding a normal list seems to work fine we have done a few of those iirc | 23:22 |
fungi | i suppose it might | 23:23 |
clarkb | but we should be able to work through it via held test nodes | 23:23 |
fungi | anyway, it's in emergency disable now, i can probably try to debug more tomorrow | 23:23 |
clarkb | yup thanks | 23:23 |
clarkb | I need to take a break to get some stuff done while the sun is still up | 23:23 |
clarkb | The other thing on my list was to restart gerrit on the new image. But will seewhere we are at later and if I've got brain space for that | 23:24 |
ianw | i'll be happy to do that when it's a bit quieter in a few hours | 23:28 |
corvus | clarkb: yes, you can find logs of retried builds by going to the buildset, and you can get to the buildset by clicking on any completed job in the buildset to get the build page for that build, then click the buildset link. example: Bearer | 23:29 |
corvus | eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpYXQiOjE2Mzg0ODc0MDEuNjMwOTM5MiwiZXhwIjoxNjM4NDg4MDAxLjYzMDkzOTIsImlzcyI6Inp1dWxfb3BlcmF0b3IiLCJhdWQiOiJ6dXVsLmV4YW1wbGUuY29tIiwic3ViIjoicm9vdCIsInp1dWwiOnsiYWRtaW4iOlsibm9uZSJdfX0.ONXqLWPTlGEUa-rKkjYHnclbtsS2sxsD9FIPY7kjV3M | 23:29 |
corvus | oh dear that's not the right example :) | 23:30 |
ianw | i'll rework the parallel changes into a another series of "noop" jobs | 23:30 |
corvus | example: https://zuul.opendev.org/t/openstack/buildset/6cb2b00359e349ba954be34c2f06904a | 23:30 |
corvus | (that is not an important token, ftr) | 23:30 |
ianw | hopefully that meet the definition of noop this time | 23:31 |
opendevreview | Merged opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream https://review.opendev.org/c/opendev/base-jobs/+/820018 | 23:38 |
opendevreview | James E. Blair proposed opendev/system-config master: Add local auth provider to zuul https://review.opendev.org/c/opendev/system-config/+/820276 | 23:39 |
ianw | i'm keeping an eye on ^^. it's a very quick revert, but it was only an if conditional | 23:40 |
ianw | (i mean, if it does go wrong, it can be a quick revert) | 23:40 |
opendevreview | James E. Blair proposed openstack/project-config master: Add REST api auth rules https://review.opendev.org/c/openstack/project-config/+/820277 | 23:43 |
corvus | infra-root: the ansible hostvars file group_vars/grafana_opendev.yaml is not checked into git. should it be? | 23:44 |
corvus | infra-root: (also there are several *.old files which seems redundant for content that's in a git repo, should they be deleted?) | 23:45 |
fungi | ianw: ^ is that something you were working on? | 23:45 |
ianw | yeah, looking, it might be something i've left behidn | 23:45 |
fungi | corvus: i'd delete old/backup copies yes | 23:45 |
corvus | i'll wait for ianw to clear before i do anything | 23:45 |
ianw | yeah it was from the swizzle time; that group went with https://review.opendev.org/c/opendev/system-config/+/739625 | 23:46 |
ianw | i'll rm it | 23:46 |
ianw | .. done | 23:47 |
corvus | thx. i'm going to rm emergency.yaml.old groups.yaml.old openstack.yaml.old | 23:48 |
ianw | ++ | 23:52 |
opendevreview | James E. Blair proposed openstack/project-config master: Add REST api auth rules https://review.opendev.org/c/openstack/project-config/+/820277 | 23:54 |
clarkb | thansk for doing that cleanup. I'm back at the computer and will try to be useful again | 23:58 |
clarkb | first up understanding why the bridge job retried | 23:58 |
corvus | at this point in the day, i don't think i have time to do the rolling zuul restart i asked about earlier... if someone wants to do that once thing settle down, feel free, otherwise i'll ask again tomorrow. meanwhile, https://review.opendev.org/819923 https://review.opendev.org/820276 and https://review.opendev.org/820277 are all ready to merge. we should merge the latter two soon. like, before the gearman removal happens. | 23:58 |
clarkb | https://zuul.opendev.org/t/openstack/build/317db45bca0a45ba8d79e491b74b1f5c it hit the exact time the haproxy was not working | 23:58 |
clarkb | I can review those. I've already reviewd the keycloak chagne, but really the other two seem urgent and worht a check | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!