*** ysandeep is now known as ysandeep|afk | 00:09 | |
*** dviroel|rover is now known as dviroel|out | 00:10 | |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589 | 00:29 |
---|---|---|
opendevreview | James E. Blair proposed opendev/system-config master: WIP: Build a nodepool image https://review.opendev.org/c/opendev/system-config/+/848792 | 00:34 |
ianw | hrm, not sure how to test the upload-pypi role | 00:46 |
ianw | i can make a limited api key that can only update one project on test.pypi.org and we can assume that is public, and use it in zuul-jobs | 00:47 |
ianw | i mean to say test it automatically, rather than just a one-off manual approach | 00:48 |
ianw | i think the best approach might be to test upload-pypi in zuul jobs with an api key separately and manually, before committing. then we can have the switch ready in project-config and merge it just before we do something that will push to pypi like a dib release, and monitor it closely | 00:51 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593 | 01:04 |
fungi | maybe uploads of opendev/sandbox? | 01:11 |
fungi | though we may be missing a lot of the necessary bits for that | 01:12 |
*** ysandeep|afk is now known as ysandeep | 01:16 | |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589 | 01:20 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593 | 01:20 |
ianw | yeah, it would be a bit of a pain to make something that increases its version number on every gate check | 01:21 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593 | 01:38 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593 | 01:47 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 01:52 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593 | 01:57 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 01:57 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed https://review.opendev.org/c/zuul/zuul-jobs/+/849598 | 01:57 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 02:04 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 02:32 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589 | 02:50 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed https://review.opendev.org/c/zuul/zuul-jobs/+/849598 | 02:50 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593 | 02:50 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 02:50 |
*** ysandeep is now known as ysandeep|afk | 03:19 | |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589 | 04:02 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed https://review.opendev.org/c/zuul/zuul-jobs/+/849598 | 04:02 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593 | 04:02 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 04:02 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589 | 04:27 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed https://review.opendev.org/c/zuul/zuul-jobs/+/849598 | 04:27 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593 | 04:27 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 04:27 |
*** ysandeep|afk is now known as ysandeep | 04:44 | |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589 | 05:02 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed https://review.opendev.org/c/zuul/zuul-jobs/+/849598 | 05:02 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593 | 05:02 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 05:03 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589 | 05:18 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed https://review.opendev.org/c/zuul/zuul-jobs/+/849598 | 05:18 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593 | 05:18 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 05:18 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 06:07 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 06:20 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 06:27 |
*** ysandeep is now known as ysandeep|afk | 06:44 | |
*** soniya is now known as soniya|ruck | 06:48 | |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 07:01 |
*** ysandeep|afk is now known as ysandeep | 07:40 | |
*** ysandeep is now known as ysandeep|lunch | 08:25 | |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 08:34 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597 | 08:53 |
*** anbanerj is now known as frenzy_friday | 09:17 | |
*** soniya|ruck is now known as soniya|ruck|lunch | 09:41 | |
*** soniya|ruck|lunch is now known as soniya|ruck | 10:09 | |
*** soniya|ruck is now known as soniya|ruck|afk | 10:11 | |
*** rlandy|out is now known as rlandy | 10:26 | |
*** ysandeep|lunch is now known as ysandeep | 10:40 | |
ianw | fungi: https://review.opendev.org/q/topic:upload-pypi-api is the base work for pypi api upload | 10:57 |
*** soniya|ruck|afk is now known as soniya|ruck | 11:07 | |
*** rlandy is now known as rlandy|rover | 11:15 | |
*** dviroel is now known as dviroel|rover | 12:12 | |
*** rlandy|rover is now known as rlandy | 12:23 | |
*** ysandeep is now known as ysandeep|afk | 12:59 | |
*** ysandeep|afk is now known as ysandeep | 13:31 | |
mnaser | infra-root: https://tarballs.opendev.org is returning forbidden | 14:27 |
fungi | looking | 14:28 |
fungi | may be an afs outage | 14:28 |
mnaser | thank you fungi ! | 14:28 |
fungi | apache throwing lots of kernel oopses in dmesg | 14:29 |
*** dasm|off is now known as dasm | 14:29 | |
fungi | [Wed Jul 13 13:15:38 2022] afs: Lost contact with file server 104.130.138.161 in cell openstack.org (code -1) (all multi-homed ip addresses down for the server) | 14:29 |
mnaser | that'll do it | 14:30 |
mnaser | 104.130.138.161 is not pingable | 14:30 |
fungi | time reported by dmesg may also not be accurate so that may be more recent than an hour ago | 14:30 |
mnaser | so maybe afs could be the real issue here (unless that ip is not pingable by icmp) | 14:30 |
fungi | yeah, that's afs01.dfw.openstack.org | 14:30 |
fungi | trying to ssh into it now but it's hanging | 14:31 |
fungi | i'll check the oob console | 14:31 |
mnaser | is afs02 a replica for afs01 ? | 14:31 |
fungi | yes, for most things anyway | 14:31 |
mnaser | im wondering why it didnt fall back to that | 14:31 |
fungi | it did for some volumes, but doesn't seem to have for tarballs | 14:32 |
fungi | possible something is wrong/stuck with the replica for it | 14:32 |
fungi | for now i'm going to dig into what's happening to afs01.dfw though | 14:33 |
fungi | infra-root: ^ heads up, and also i have a conference call i have to jump to in 25 minutes, just fyi | 14:33 |
fungi | ticket from rackspace: This message is to inform you that the host your cloud server, afs01.dfw.openstack.org, resides on alerted our monitoring systems at 2022-07-13T13:29:01.300633. We are currently investigating the issue and will update you as soon as we have additional information regarding what is causing the alert. | 14:36 |
mnaser | ah | 14:36 |
fungi | followup: This message is to inform you that the host your cloud server, afs01.dfw.openstack.org, resides on became unresponsive. We have rebooted the server and will continue to monitor it for any further alerts. | 14:36 |
fungi | that followup was stamped roughly an hour ago | 14:37 |
fungi | so i guess the instance didn't come back when the host rebooted | 14:37 |
fungi | yeah, the api reports it in an "error" state | 14:38 |
fungi | fault | {'message': 'Storage error: Reached maximum number of retries trying to unplug VBD OpaqueRef:6d2337f7-aa1d-46b3-5da6-209ac49fd06b', 'code': 500, 'created': '2022-04-28T20:06:53Z'} | 14:40 |
mnaser | the date of that fault seems to show that its unrelated | 14:41 |
mnaser | (also wow what a throwback to see 'OpaqueRef', old school xenserver code) | 14:41 |
mnaser | afaik the nova api will let you hard reboot if vm is in error state | 14:41 |
fungi | afs01.dfw has four volumes in cinder, all in-use, none of which match that uuid | 14:43 |
mnaser | that is an internal uuid used by xenserver | 14:43 |
fungi | ahh | 14:44 |
fungi | so no clue which cinder volume it might be | 14:44 |
fungi | anyway, yeah, i'll try a hard reboot and hope we don't corrupt any filesystems | 14:44 |
fungi | fault | {'message': 'Failure', 'code': 500, 'created': '2022-07-13T14:45:08Z'} | 14:46 |
fungi | that's less than helpful | 14:46 |
fungi | putting it into shutoff for a minute and then trying a server start | 14:47 |
fungi | it went into shutoff state fine, but server start seems to be getting ignored now | 14:49 |
mnaser | i think the hypervisor feels borked :( | 14:49 |
fungi | yeah, i'll follow up on the ticket they opened about the host reboot | 14:50 |
fungi | in the meantime we can see if the read-only replica for tarballs can be brought online | 14:50 |
fungi | #status notice Due to an incident in our hosting provider, the tarballs.opendev.org site (and possibly other sites served from static.opendev.org) is offline while we attempt recovery | 14:51 |
opendevstatus | fungi: sending notice | 14:51 |
-opendevstatus- NOTICE: Due to an incident in our hosting provider, the tarballs.opendev.org site (and possibly other sites served from static.opendev.org) is offline while we attempt recovery | 14:52 | |
*** dviroel|rover is now known as dviroel|rover|biab | 14:53 | |
opendevstatus | fungi: finished sending notice | 14:54 |
fungi | infra-root: i've updated the ticket (220713-ord-0002114) and am awaiting further response from rackspace support | 14:55 |
fungi | i probably don't have time to dig into what's preventing failover for the tarballs volume to the read-only replica before my call in a few minutes, but can try to poke at it some. also we should disable afs volume releases in the meantime and work on doing a full switchover to afs02.dfw | 14:57 |
*** soniya is now known as soniya|ruck | 15:01 | |
*** ysandeep is now known as ysandeep|out | 15:04 | |
Clark[m] | I'm getting my morning started but need to do quick system updates first. | 15:12 |
Clark[m] | fungi: are we serving the RW path on static? | 15:12 |
jrosser | wouold this be related? https://mirror-int.dfw.rax.opendev.org/ubuntu/dists/bionic/universe/binary-amd64/Packages 403 Forbidden [IP: 10.209.161.66 443] | 15:16 |
clarkb | jrosser: likely yes | 15:21 |
clarkb | static's dmesg has a number of tracebacks involving afs after losing contact with the server. mirror.dfw does now | 15:24 |
clarkb | s/does now/does not/ | 15:24 |
clarkb | all three afsdb0X servers report they are running happily according to bos status so I'm not sure why failover wouldn't have happened except for maybe talking to RW paths instead of RO paths | 15:27 |
clarkb | or maybe the kernel tracebacks crashes afs hard enough to prevent failover on the client side? | 15:27 |
clarkb | looking at /var/www/mirror on mirror.dfw I think some volumes failed over and others did not | 15:29 |
clarkb | https://mirror.ord.rax.opendev.org/ubuntu/dists/bionic/universe/ has content so ya this may be ~luck of the draw on individual clients for handling failovers. | 15:29 |
clarkb | I'm trying to restart openafs on mirror.iad3.inmotion | 15:31 |
clarkb | it isn't going very quickly | 15:31 |
clarkb | ok that was slow but it didn't seem to break anything. I'll try that on mirror.dfw now | 15:33 |
clarkb | before I did that I simply navigated to the path on the fs and now it seems happy on dfw? | 15:35 |
clarkb | I wonder if we cached a failed lookup in apache and the napache stopped trying to hit the fs to refresh the failover? | 15:35 |
fungi | yeah, apache restarts might help, i suppose | 15:36 |
fungi | and sorry, confcall is pretty distracting | 15:36 |
fungi | but should be free again in ~25 minutes | 15:36 |
clarkb | ok ya I think mirror.dfw is good now simply by manually traversing the path on afs | 15:36 |
clarkb | I'll check the other mirrors first (I know that static is probably what people want more but I feel like I'm learning and mirrors are far less stressful) | 15:37 |
rlandy | hi ... Failed to fetch https://mirror.iad3.inmotion.opendev.org/ubuntu/dists/focal/main/binary-amd64/Packages 403 Forbidden [IP: 173.231.253.126 443] | 15:37 |
rlandy | mirror.iad3.inmotion.opendev.org seems to be the failing mirror now for us | 15:38 |
clarkb | rlandy: it is working now I think. Note timestamps and also links to failures are always always useful. But ya I think that particular mirror as well as dfw is happy now | 15:38 |
clarkb | (it could be that failure occured when I restartedo penafs) | 15:38 |
rlandy | clarkb: thanks - will watch that | 15:39 |
clarkb | if we see failures after this point in time for mirror.dfw and mirror.iad3 let us know. And now I'm looking at the others | 15:39 |
rlandy | failures are probably from an hour back | 15:39 |
clarkb | mirror.mtl01.iweb appears happy | 15:40 |
clarkb | mirror.ord and mirror.iad as well. None of them have the tracebacks like static does | 15:40 |
clarkb | both ovh mirrors are similarly happy from what I see. No tracebacks either | 15:42 |
*** marios is now known as marios|out | 15:44 | |
clarkb | ya all the mirrors appear ok now based on filesystem listings against /var/www/mirror | 15:45 |
clarkb | none contain the dmesg tracebacks that static shows | 15:45 |
rlandy | thanks for checking | 15:47 |
clarkb | looks like tarballs is up too? I wonder if taking the unhappy fileserver down was what we needed to failover | 15:47 |
clarkb | fungi: ^ | 15:47 |
fungi | i can check in a few | 15:48 |
clarkb | I *think* we're in a good state now via failover to RO volumes on afs02.dfw | 15:48 |
clarkb | I think the next steps are likely going to be disabling any vos releases so that we don't possibly replicate corrupted RW volumes on afs01 to RO volumes on afs02 when 01 comes back (openafs likely protects against this but I'm not sure) | 15:49 |
clarkb | then we can bring back afs01 and convert its volumes to RO and switch 02 to RW then enable releases in the other direction? | 15:50 |
fungi | also possible 01 came back up, i haven't checked yet | 15:54 |
clarkb | it doesn't ping and there is no ssh | 15:55 |
clarkb | anyway I didn't really have to do anything on the servers other than navigate their /afs/openstack.org/mirror and /afs/openstack.org/project paths and that seemed to make things happy. Either that or the shutdown of afs01 caused the afs db servers to finally notice it is down and fail over | 15:56 |
clarkb | I believe we are in a RO state with content being served. I've ntoified the release team to not do releases and updated the mailing list thread with this info | 15:56 |
clarkb | I'm going to take a break now as I haven't had breakfast yet and I have a bunch of email to catch up on after being out for a few days | 15:57 |
fungi | thanks! i'm freeing up again now for a bit, but will have an errand to run soon as well, so will see what i can get done on this in the meantime | 16:00 |
clarkb | fungi: I think holding locks/commenting out vos release cron jobs so that we control how, when and what syncs when afs01 is back is the next thing | 16:01 |
clarkb | and then it is probably just a matter of monitoring and seeing what rax says? I guess we could try booting a recovery instance to inspect why it is failing | 16:01 |
clarkb | But I really need food | 16:02 |
fungi | go eat! | 16:04 |
*** dviroel|rover|biab is now known as dviroel|rover | 16:11 | |
fungi | i've added mirror-update02.opendev.org to the emergency disable list | 16:15 |
fungi | i've also temporarily commented out all lines in the root crontab on that server | 16:16 |
clarkb | fungi: I think docs and tarballs etc are released via a cronjob elsewhere? Worth double checking | 16:23 |
fungi | those are handled by the release-volumes.py cronjob on that server, as far as i'm aware | 16:24 |
fungi | which runs every 5 minutes | 16:24 |
fungi | or did, until i commented it out | 16:24 |
fungi | we had separate mirror-update servers which reprepro and rsync mirroring was split between for a while, but that's been consolidated onto the newer server more recently | 16:25 |
clarkb | aha | 16:26 |
clarkb | looks like there is an update to our ticket? I'm not in a good place to login and check that yet | 16:26 |
clarkb | (I've got a post road trip todo list a mile long too :( ...) | 16:26 |
*** dviroel__ is now known as dviroel|rover|biab | 16:42 | |
fungi | i can only imagine | 16:44 |
fungi | the ticket updates were "The query regarding unable to boot afs01.dfw.openstack.org has been received. I am currently reviewing this ticket and I will update you with more information as it becomes available." followed by "I will now escalate to appropriate team for further review." | 16:46 |
fungi | so i guess we're waiting for an appropriate team | 16:47 |
clarkb | good to know it has been seen at least | 16:47 |
fungi | i need to go run some errands, but will make them as quick as possible. shouldn't be more than an hour i hope | 16:47 |
clarkb | I think we've done what we can until we hear back form them short of booting a recovery instance | 16:48 |
clarkb | and it is probably better to let them poke at it now that tehy have seen it | 16:48 |
fungi | yep | 16:50 |
*** dviroel|rover|biab is now known as dviroel|rover | 17:21 | |
opendevreview | James E. Blair proposed opendev/system-config master: WIP: Build a nodepool image https://review.opendev.org/c/opendev/system-config/+/848792 | 17:33 |
fungi | racker todd is my new hero! "the volume afs01.dfw.opendev.org/main03 eafb4d8d-19e2-453e-8657-013c4da7acb6 lost it's iscsi connection to the Compute host... Detaching and reattaching it did the trick." | 18:08 |
fungi | reboot system boot Wed Jul 13 18:08 | 18:08 |
fungi | i think afs01.dfw is back in business now, but need to double-check all the volumes to make sure everything's copacetic before i can say with any certainty | 18:10 |
fungi | i've gone ahead and closed out the ticket with much thanks, since we can at least take it from here | 18:12 |
clarkb | excellent | 18:12 |
fungi | for future reference, i suppose we can try detaching/reattaching through cinder | 18:13 |
fungi | i've got a narrow window to try and catch up on yardwork, but may be able to poke at checking those over on breaks or once i finish | 18:17 |
opendevreview | James E. Blair proposed opendev/system-config master: WIP: Build a nodepool image https://review.opendev.org/c/opendev/system-config/+/848792 | 18:20 |
fungi | per discussion in #openstack-infra a zuul job successfully wrote to the docs rw volume, so i'm going to uncomment the vos release cronjob for that next and see if we have any new problems there | 19:10 |
fungi | i'm tailing /var/log/afs-release/afs-release.log on mirror-update and should hopefully see it kick off in ~2 minutes | 19:13 |
clarkb | thanks | 19:15 |
fungi | looks like all releases were successful, including tarballs | 19:16 |
fungi | we dodged a bullet there | 19:16 |
opendevreview | James E. Blair proposed opendev/system-config master: WIP: Build a nodepool image https://review.opendev.org/c/opendev/system-config/+/848792 | 19:16 |
fungi | clarkb: any objections to me uncommenting the other cronjobs and taking mirror-update02 out of the emergency disable list now? | 19:18 |
clarkb | fungi: no, I probably would've done the mirrors first myself since they are all upstream data :) | 19:19 |
clarkb | I think if tarballs et al are happy then mirrors are good to go | 19:19 |
fungi | fair, but there was a request to rerun a docs job so i took the opportunity | 19:19 |
clarkb | ya | 19:19 |
fungi | okay, undoing the rest | 19:19 |
fungi | and done | 19:19 |
fungi | i'll hold my hopes until we see if there are mirror volumes remaining stale, but i think we can status log a conclusion (i only did status notice earlier, not alert) | 19:20 |
fungi | #status log The afs01.dfw server is back in full operation and writes are successfully replicating once more | 19:21 |
opendevstatus | fungi: finished logging | 19:21 |
fungi | i'll let #openstack-release know too | 19:21 |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Update to Gitea 1.17.0-rc1 https://review.opendev.org/c/opendev/system-config/+/847204 | 20:45 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update Gitea to 1.16.9 https://review.opendev.org/c/opendev/system-config/+/849754 | 20:45 |
clarkb | There is a new gitea bugfix release. I put that update between the testing update and the 1.17.0 rc change | 20:45 |
clarkb | Hopefully we can land the testing update and the 1.16.9 update soon. But as always please review the changelog and template updatds | 20:45 |
clarkb | git diff didn't show me any template changes between .8 and .9 for the three templates we override | 20:47 |
ianw | o/ ... is the short story one of the afs volumes went away for a bit? | 20:56 |
ianw | it seems we didn't need to fsck, which is good | 20:57 |
clarkb | ianw: the entire fileserver went away due to one of the cinder volumes going away | 20:57 |
clarkb | I think that may have impacted all of the afs volumes due to lvm? | 20:58 |
clarkb | but ya it seems to have come back | 20:58 |
clarkb | ianw: while we are doing catchup thank you for updating our default ansible version in zuul (I shoudl've set myself a calendar reminder for that and just spaced it). Also looks like we update ansibled to v5 on bridge too? | 21:00 |
ianw | umm, i didn't touch the ansible version on bridge, i don't think | 21:00 |
ianw | i guess /vicepa reports itself as clean ... how it survived that I don't know :) | 21:04 |
clarkb | ianw: oh maybe I misread something over the last week | 21:04 |
clarkb | I may have just smashed together the zuul update and bridge update in my head | 21:05 |
corvus | ianw: clarkb fungi for any who are interested, https://review.opendev.org/848792 is an image-build-in-zuul-job which has 2 successful runs -- one at 1 hour, one in 38 minutes. i believe that further improvement in runtime is possible with better use of the cached data already on the nodes. it does use the existing git repo cache (but then fetches updates, which is a little slow. it also copies it twice, and i feel like we should be able to | 21:13 |
corvus | avoid that somehow, but that requires some detailed thought about what's mounted where and when). it doesn't use any of the devstack/tarball/blob cache on the host, so those files are all being fetched each time; that could obviously be improved. anyway, i think that's a useful starting point, and it could be used to test out the cointainerfile stuff ianw was looking at. i'm currently working on a new spec for nodepool/zuul, and i wanted to | 21:13 |
corvus | get an idea of what a job like that would look like. feel free to take that change and modify it or copy it or whatever if you have any ideas you want to explore; i'm basically done with that for right now (it answered my questions). | 21:13 |
clarkb | corvus: re caching off the host I think the existing dib caching knows how to check for updates to thosefiles we just have to copy/link them into the right locations in the dib build path? | 21:15 |
corvus | clarkb: maybe -- but it also has some shasum hash thing it does and i think that's only in the /opt/dib_cache dir, so i don't think we have all that data on the host (which is this case is one of our built images) | 21:16 |
clarkb | ya the dib_cache dir isn't copied into the zuul runtime images | 21:17 |
clarkb | but we could probably update things to leak that across assuming it isn't very large and is also useful | 21:17 |
clarkb | I'd have to think about that a bit more | 21:17 |
corvus | yeah | 21:17 |
corvus | at least, the theoretical problem of "we have foo.img, let's update it iff it needs updating" seems solveable :) | 21:18 |
corvus | (i went ahead and put a bit of effort into the git repo cache already though since i knew that was the big thing) | 21:18 |
fungi | ianw: clarkb: a more accurate summary would be the primary afs server went away because the hypervisor host went away, but then we couldn't boot it back up for hours because the host got confused when it lost contact with the iscsi backend for one of the attached volumes | 21:19 |
clarkb | fungi: thanks | 21:19 |
fungi | so it was a bit of a cascade failure | 21:19 |
*** dmitriis is now known as Guest4934 | 21:20 | |
fungi | also we didn't manage to automatically fail over serving the ro replica for something (tarballs volume at least) and needed to intercede | 21:20 |
clarkb | fungi: was the server off for all those hours then? If so then I think the idea taht shutting it down caused failover to happen is unlikely (and more likely that my manual navigation of paths made it happier) | 21:20 |
fungi | the server was offline until 18:08 yes | 21:20 |
fungi | and the outage started around 13:something | 21:21 |
clarkb | ok that helps. For some reason i had it in my head that the server was up but with sad openafs and that may have confused the afs dbs | 21:21 |
fungi | tarballs.o.o didn't start serving content until somewhere in between those timrs | 21:21 |
clarkb | ya I suspect more that my manual navigation of afs paths on static forced openafs there to try again and it started working? | 21:22 |
fungi | possibly, though i also did that earlier in the outage | 21:22 |
clarkb | or maybe we cached the bad results for a couple of hours and that timing just lined up where the caching timed out | 21:22 |
fungi | just as part of inspecting things to see what was actually down | 21:23 |
clarkb | fungi: if you have time can you take a look at https://review.opendev.org/c/opendev/system-config/+/849754 you've already reviewd its parent. | 21:58 |
clarkb | ianw: ^ if you get a chance to look too that would be great | 21:59 |
clarkb | CI results on the child should be back momentarily | 21:59 |
opendevreview | Clark Boylan proposed opendev/system-config master: Install Limnoria from upstream https://review.opendev.org/c/opendev/system-config/+/821331 | 22:01 |
clarkb | infra-root ^ is a change that keeps ending up stale beacuse there is never a good time to land it :/ I think Fridays are generally quiet with meetings if we want to try and land it this friday (seems like the last time I pciekd a day there was a big fire that distracted me) | 22:02 |
*** dasm is now known as dasm|off | 22:08 | |
fungi | clarkb: lgtm. unrelated, a review of 849576 and its child would be awesome when you have time | 22:13 |
clarkb | fungi: I've +2'd both but didn't approve in case you wanted to respond to ianw first. Feel free to self approve | 22:15 |
ianw | oh i assume that it is all in order | 22:16 |
clarkb | I've approved the update to gitea testing. I think I'll hold off on gitea upgrade proper until tomorrow though as I'm still getting distracted by all the "home after a week away" problems | 22:18 |
clarkb | feel free to land the gitea upgrade if you're able to monitor it, but I'm happy to do that tomorrow | 22:18 |
ianw | i can monitor it, can merge in a few hours when it all slows down | 22:18 |
opendevreview | Ian Wienand proposed openstack/project-config master: Remove testpypi references https://review.opendev.org/c/openstack/project-config/+/849757 | 22:19 |
fungi | ianw: did i not respond to ianw? maybe i missed something | 22:20 |
fungi | er clarkb ^ | 22:20 |
ianw | oh you did, about the handbook v the guide v the open way v the four opens etc. | 22:21 |
fungi | looking back, i left a review comment in reply to an inline comment, rather than replying with an inline comment, sorry! | 22:23 |
fungi | and yeah, they're intentionally distinct | 22:24 |
fungi | (we debated the option of putting them together or not at great length) | 22:24 |
fungi | i was personally in favor of fewer repos, but one more repo wasn't that great of a cost to appease those who disagreed with my position on the matter | 22:25 |
opendevreview | Ian Wienand proposed openstack/project-config master: twine: default to python3 install https://review.opendev.org/c/openstack/project-config/+/849758 | 22:27 |
clarkb | fungi: hrm this is the problem of not responding to the inlien comment directly so it doesn't show up as a response on the file | 22:30 |
clarkb | https://review.opendev.org/c/openstack/project-config/+/849576/1/gerrit/projects.yaml basically that doesn't show a response | 22:31 |
clarkb | but ya I see it now | 22:31 |
fungi | well, in this case i missed that it was an inline comment so i made a normal review comment instead. was my bad | 22:39 |
clarkb | with the web ui if you click reply it automatically attaches it to the correct place. I wonder if gertty could grow a similar functionality | 22:45 |
clarkb | or maybe it does and I just haven't used it in long enough to have forgotten | 22:45 |
*** dviroel|rover is now known as dviroel|rover|Afk | 22:58 | |
jrosser | i might have a very long running logs upload in progress here https://zuul.opendev.org/t/openstack/stream/915484832105431892e804fb86abc2d3?logfile=console.log | 23:08 |
clarkb | hrm doesn't look like we've ported the base-test updates to log the target to the production base job? | 23:09 |
clarkb | or if we have I'm not seeing it in that log yet | 23:09 |
jrosser | it's from 847991 | 23:09 |
opendevreview | Merged opendev/system-config master: Move gitea partial clone test https://review.opendev.org/c/opendev/system-config/+/848174 | 23:09 |
jrosser | no i don't think we have merged that yet | 23:09 |
* clarkb makes a note to catch back up on that tomorrow | 23:10 | |
jrosser | i have a patch to do that but it needs updating | 23:10 |
jrosser | i saw one POST FAILURE earlier, and just noticed that one aparrently stuck | 23:10 |
ianw | lsof on that shows connections to ... | 23:19 |
ianw | 142.44.227.102 | 23:19 |
ianw | OVH Hosting Inc. | 23:20 |
ianw | looking at it in strace, it doesn't seem to be doing anything | 23:21 |
fungi | clarkb: in this case it wasn't a gertty failing, i replied to the review comment which contained the inline comment rather than replying to the inline comment itself | 23:21 |
clarkb | ianw: is it waiting on a read or a write? (might point to which side is idling) | 23:23 |
ianw | https://paste.opendev.org/show/bzXL1q1f2G0e4d4dQgvA/ | 23:23 |
ianw | looks to me stuck in a bunch of reads | 23:24 |
clarkb | to me that implies something about the remote end being unhappy | 23:26 |
clarkb | we're waiting for ovh to respond to us? | 23:26 |
clarkb | could be something on the network between as well | 23:26 |
ianw | pinging it from ze02 seems fine | 23:27 |
ianw | it really just looks like those threads are sitting there waiting for something | 23:27 |
clarkb | might be something amorin could help with | 23:28 |
clarkb | (to check on the ovh side to see if there is any obvious reason for the pause) | 23:28 |
ianw | i've had that under strace for a while and nothing has got any data or timed out either | 23:30 |
fungi | when was the connection initiated? | 23:31 |
ianw | https://paste.opendev.org/show/bdRYt3Lbz7PZZoEuovxE/ | 23:33 |
ianw | it would indicate Jul 13 23:33 i guess | 23:34 |
ianw | although, that's 1 minute ago? | 23:34 |
ianw | ... and it's gone ... | 23:35 |
ianw | did it get killed? | 23:35 |
ianw | 2022-07-13 20:34:38.552391 | TASK [upload-logs-swift : Upload logs to swift] | 23:37 |
ianw | 2022-07-13 23:34:33.029640 | POST-RUN END RESULT_TIMED_OUT: [trusted : opendev.org/opendev/base-jobs/playbooks/base/post-logs.yaml@master] | 23:37 |
ianw | yes | 23:37 |
jrosser | still html ara report there which needs to be got rid of | 23:38 |
ianw | i guess the time of the file in /proc/<pid>/fd is the time that the kernel made the virtual file in response to the dirent or whatever (i.e. when you "ls" it), not the file creation time. not sure i've ever considered the timestamp of it before | 23:40 |
ianw | anyway, that's a data point i guess? it was ovh, and it was all the thread stuck in read() calls | 23:41 |
clarkb | ++ someone like timburke might know what portion of the upload is doing reads too. Though it may just be waiting for a status result from the http server | 23:46 |
clarkb | and the problem is processing/storing the data on the remote | 23:46 |
ianw | my next thought was a backtrace on one of those threads, but they disappearedd | 23:48 |
opendevreview | Ian Wienand proposed openstack/project-config master: pypi: use API token for upload https://review.opendev.org/c/openstack/project-config/+/849763 | 23:54 |
ianw | does "Job publish-service-types-authority not defined" in project-config ring any bells? | 23:56 |
clarkb | service types authority is the thing that is published for keystone ? I think its the static json blob | 23:57 |
*** dviroel|rover|Afk is now known as dviroel|rover | 23:57 | |
ianw | https://review.opendev.org/c/openstack/project-config/+/708518 removed the job in feb 2020 | 23:58 |
clarkb | ianw: did you get that error from the zuul scheduler log? | 23:58 |
clarkb | all it said in gerrit was the change depends on a change with invalid config | 23:58 |
ianw | we have a reference @ https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L5018 | 23:59 |
clarkb | https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L5017-L5018 ya I just found that too | 23:59 |
clarkb | maybe we need to clean that up? | 23:59 |
ianw | https://review.opendev.org/c/openstack/project-config/+/849757 gives me a zuul error | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!