clarkb | I'm not sure we want a new super group to contain zuul and the registry. Double accounting is probably the best option :/ | 00:00 |
---|---|---|
jentoio | registry.yaml has registry_user: zuul | 00:02 |
clarkb | jentoio: that appears to configure the user inside the docker registry service itself | 00:05 |
clarkb | (I didn't realize these layers might make things confusing) | 00:05 |
clarkb | jentoio: maybe registry_process_user_id: 10001 and registry_process_group_id: 10001 as well as registry_process_user and registry_process_gruop to help differentiate? | 00:06 |
jentoio | yeah, its a little confusing. You have a container user (inside) and the user the docker-compose is running as | 00:07 |
jentoio | okay, so we want to separate the registry from zuul, that makes sence. I will work on this task assuming a new registry user per your examples | 00:08 |
clarkb | ya. Maybe add comments to registry.yaml as well to explain the difference even though they are similar. | 00:08 |
jentoio | sounds good, start here system-config/inventory/service/group_vars/registry.yaml | 00:10 |
jentoio | I'll start here.. | 00:10 |
clarkb | yup | 00:12 |
clarkb | and dont' be afraid of pushing something that seems half done. We can usually takl about things more conretely in review once we have something to look at | 00:12 |
*** ysandeep|out is now known as ysandeep | 00:15 | |
fungi | i constantly push half-done work into gerrit, just so i don't have to worry about accidentally blowing it away on my workstation when i switch to some other task | 00:28 |
jentoio | how do I push my changes to gerrit again? I don't see it in gerrit after git commit | 01:35 |
opendevreview | Jack Morgan proposed opendev/system-config master: Adds support for running zuul-registry as a non-root user https://review.opendev.org/c/opendev/system-config/+/831462 | 01:37 |
fungi | the `git review` command will add a remote named "gerrit" and will also download the commit hook which adds the change-id footer to your commit message and will rerun it if needed | 01:37 |
fungi | aha, you found the docs, i guess ;) | 01:37 |
jentoio | nevermind, figured it out | 01:37 |
fungi | jentoio: https://docs.opendev.org/opendev/infra-manual/latest/gettingstarted.html#proposing-a-change in case you found other less accurate docs | 01:39 |
jentoio | I used this one, opendev/infra-manual/latest/developers.html | 01:40 |
jentoio | but didn't scrol down enough to see submitting a change for review. Got stuck at running unit tests ;) | 01:40 |
jentoio | Do I need to add reviews in gerrit? | 01:47 |
*** ysandeep is now known as ysandeep|afk | 01:49 | |
*** pojadhav|afk is now known as pojadhav | 01:50 | |
fungi | jentoio: add reviewers? nah | 01:54 |
fungi | just pester people in here if we don't get around to looking at it | 01:54 |
jentoio | well, 1st one (zuul-registry) was kinda fun. Once its done, I'll take the learnings and apply them to other containers. | 01:58 |
fungi | yeah, hopefully we actually exercise that one with tests, more than just making sure it deploys... seeing if we do | 01:59 |
*** rlandy|ruck is now known as rlandy|out | 02:00 | |
fungi | we do, not a lot, but we check that the service starts enough to bind a listening socket, so that's better than nothing: https://opendev.org/opendev/system-config/src/branch/master/testinfra/test_registry.py | 02:02 |
Clark[m] | I'll be sure to review it tomorrow morning | 02:11 |
ianw | jentoio: thanks! i'm pretty sure there's a typo in there, but what's more interesting than that is why that typo didn't cause a CI failure? | 02:20 |
ianw | this did actually fail : https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_341/831462/1/check/system-config-run-docker-registry/341fd02/bridge.openstack.org/ara-report/playbooks/5.html?status=failed&status=unreachable#results | 02:35 |
ianw | i think i've screwed up catching the error with the | tee to logs | 02:37 |
opendevreview | Ian Wienand proposed opendev/system-config master: zuul run-base: make sure we catch failures when teeing to logs https://review.opendev.org/c/opendev/system-config/+/831465 | 02:44 |
*** ysandeep|afk is now known as ysandeep | 03:37 | |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: ensure-pip: test with install/upgrade of pip https://review.opendev.org/c/zuul/zuul-jobs/+/831469 | 04:08 |
ianw | frickler: thanks for investigating this, i've approved your other change. if you think ^ might avoid even more problems, we can go with it, but i'm not that fussed | 04:09 |
opendevreview | Merged zuul/zuul-jobs master: Fix ensure-pip test on Debian Buster https://review.opendev.org/c/zuul/zuul-jobs/+/831443 | 04:17 |
opendevreview | Jack Morgan proposed opendev/system-config master: Fixed minor spelling mistake https://review.opendev.org/c/opendev/system-config/+/831472 | 04:45 |
jentoio | Well, I fixed my spelling mistake but I created a new commit instead of updating my previous commit. Need to read the docs more. | 04:49 |
ianw | jentoio: in short, update whatever it is, "git add -i" and then "git commit --amend", then push. gerrit will know it's the same because the change-id in the commit message doesn't update | 05:01 |
jentoio | ianw: thanks, for pointer. | 05:04 |
ianw | once you've gone with amending/rebase you'll never go back :) the fact there was a typo is of historical interest in the gerrit review, where we can discuss such things. but it's not something we want to commit to the tree, which should just be the final revision :) | 05:07 |
*** ykarel_ is now known as ykarel | 06:39 | |
*** ysandeep is now known as ysandeep|lunch | 08:13 | |
frickler | ianw: I like the idea, but if I do this on buster, pip is already at the latest version, which may not be as good a test since it doesn't change anything. I'm also wondering why the affected jobs aren't executed for either of our patches, I guess we should fix that, too | 08:15 |
frickler | also your comment change now failed on f35 with what looks like broken repo mirrors. does this never stop? | 08:17 |
frickler | finally, whom to ping about the Software Factory CI node failures? | 08:19 |
*** pojadhav- is now known as pojadhav | 08:24 | |
*** jpena|off is now known as jpena | 08:34 | |
*** ysandeep|lunch is now known as ysandeep | 08:57 | |
ianw | frickler: haha, it hasn't ended in the 9 years or so i've been here, so i guess not :) | 08:57 |
ianw | yeah, the --reinstall should get it to try again? but there might be better ideas too | 08:58 |
ianw | i noticed the fedora thing first in zuul-repository, which is for some reason using fedora nodes. i haven't looked, i don't have time right now but will in the morning if nobody else does! :) | 08:59 |
*** ykarel_ is now known as ykarel | 09:54 | |
frickler | I checked that the rsync job is finishing fine, so likely we are tracking some broken upstream mirror. will need someone with more fedora knowledge to verify | 10:06 |
*** rlandy|out is now known as rlandy|ruck | 10:51 | |
*** ysandeep is now known as ysandeep|afk | 10:51 | |
*** ysandeep|afk is now known as ysandeep | 11:05 | |
*** pojadhav is now known as pojadhav|brb | 11:58 | |
*** pojadhav|brb is now known as pojadhav | 12:42 | |
opendevreview | Szymon Datko proposed zuul/zuul-jobs master: [ensure-python] Improve check for CentOS/RHEL 9 packages https://review.opendev.org/c/zuul/zuul-jobs/+/831423 | 13:14 |
opendevreview | Szymon Datko proposed zuul/zuul-jobs master: [ensure-python] Improve check for CentOS/RHEL 9 packages https://review.opendev.org/c/zuul/zuul-jobs/+/831423 | 13:16 |
tristanC | frickler: i've disabled the Software Factory CI jobs that produce in node failure with https://softwarefactory-project.io/r/c/third-party-ci-config/+/24185 | 13:37 |
frickler | tristanC: great, thx | 13:39 |
*** ykarel_ is now known as ykarel | 14:17 | |
*** timburke__ is now known as timburke | 15:23 | |
*** ykarel is now known as ykarel|away | 15:40 | |
*** ysandeep is now known as ysandeep|out | 15:51 | |
dpawlik | probably this mirror was unsynchronized https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/fedora-mirror-update#L36 | 15:51 |
dpawlik | that's why we have such error | 15:52 |
dpawlik | frickler, fungi: should we switch fedora mirror to other? | 15:54 |
fungi | dpawlik: our record of picking mirrors is apparently not great... https://opendev.org/opendev/system-config/commit/c80c6ee | 15:59 |
fungi | can you get a good recommendation? | 15:59 |
fungi | or should we switch back to mit's mirror again? | 16:00 |
* dpawlik asking folks for recommendation | 16:01 | |
fungi | thanks! | 16:01 |
mnasiadka | hello | 16:02 |
mnasiadka | what's the latest on rocky linux 8 nodes availability? | 16:02 |
dpawlik | wondering if Facebook mirror is not stable | 16:02 |
dpawlik | fungi: I will propose a patch to switch to download-ib01.fedoraproject.org (https://admin.fedoraproject.org/mirrormanager/mirrors/Fedora/35/x86_64) | 16:06 |
fungi | mnasiadka: they're available, though they may not yet be booting in all of our providers, they're at least in place enough to be able to try out | 16:08 |
fungi | dpawlik: the rsync socket on that server says not to use rsync directly and instead mirror with https://pagure.io/quick-fedora-mirror | 16:11 |
dpawlik | fungi: nhicher propose tu take this one https://pagure.io/quick-fedora-mirror/blob/master/f/quick-fedora-mirror.conf.dist#_15 | 16:12 |
dpawlik | fungi: oh, same repo | 16:12 |
dpawlik | but then it can be more difficult to find what is an issue | 16:13 |
clarkb | fungi: mnasiadka: the dib update for our nodepool builders did land yesterday so I think any new problems are unknown at this point? Definitely in the try it out and lets see how it goes stage | 16:18 |
opendevreview | daniel.pawlik proposed opendev/system-config master: Change Fedora mirror to dl.fedoraproject.org https://review.opendev.org/c/opendev/system-config/+/831555 | 16:20 |
opendevreview | daniel.pawlik proposed opendev/system-config master: Change Fedora mirror to dl.fedoraproject.org https://review.opendev.org/c/opendev/system-config/+/831555 | 16:21 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/831465 looks like an important one | 16:21 |
clarkb | dpawlik: dl.fedoraproject.org is the master mirror. I'm pretty sure they don't want us using that | 16:24 |
clarkb | dpawlik: typically the master mirrors are meant to only be pulled from the tier 1 mirrors and then you talk to those or tier 2 | 16:24 |
clarkb | iirc the rules for being tier 1 involve being publicly accessible which we aren't really (even though we are, we don't want to advertise them because they come and go, for example we just shutdown a mirror yesterday) | 16:25 |
clarkb | https://fedoraproject.org/wiki/Infrastructure/Mirroring/Tiering describes this. Anyway I think you need to find an alternative tier 2 mirror is what I'm getting at | 16:26 |
dpawlik | clarkb: yeah, I see on doing a test that it does not work as expected | 16:26 |
dpawlik | clarkb: send me later your PS. I will check. Now I need to go | 16:30 |
clarkb | well I don't have any suggestions for a better mirror. Just saying we cannot use the one that is proposed | 16:36 |
*** marios is now known as marios|out | 16:44 | |
opendevreview | Merged opendev/system-config master: zuul run-base: make sure we catch failures when teeing to logs https://review.opendev.org/c/opendev/system-config/+/831465 | 16:51 |
clarkb | jentoio: left some notes on the change. Overall looks good, but we should use the variables consistently or remove them (as noted on the review). Thanks again. I'll be sure to rereview once we've got updates | 16:53 |
jentoio | clarkb: okay, i'll make the change but I just copied the gerritbot example, https://opendev.org/opendev/system-config/commit/fd8808733536d19d2a62b4120e43a508d77a4846 | 17:10 |
clarkb | jentoio: ya I think we can just drop the variables and use the hardcoded value for the name and group name | 17:26 |
clarkb | jentoio: that would be in line with gerritbot | 17:27 |
jentoio | clarkb: nm, I'm understanding your gerrit comments now that the coffee is kicking in | 17:31 |
jentoio | i will abondon my 2nd commit, https://review.opendev.org/c/opendev/system-config/+/831472, revert my git tree back to 1st commit and make the changes there | 17:32 |
clarkb | sounds good | 17:32 |
fungi | jentoio: i meant to link it last night, but we have documentation on the change revision workflow too if you ever need a reminder: https://docs.opendev.org/opendev/infra-manual/latest/developers.html#updating-a-change | 17:33 |
*** jpena is now known as jpena|off | 17:44 | |
yoctozepto | I report there is something wrong with gitea | 18:06 |
yoctozepto | when I git clone/fetch from gitea | 18:07 |
yoctozepto | https://opendev.org/openstack/kolla-ansible | 18:07 |
yoctozepto | I only get 83fa9079611b359e390d3a7a0f2c7aa935a44946 | 18:07 |
clarkb | yoctozepto: can you be more specific? | 18:07 |
yoctozepto | as the latest commit | 18:07 |
yoctozepto | clarkb: yes, sorry, writing multiple lines | 18:07 |
yoctozepto | bad habit | 18:07 |
yoctozepto | newer commits are not propagated | 18:07 |
clarkb | ok so the clone is functional but the data is old | 18:07 |
yoctozepto | yes | 18:08 |
yoctozepto | the web ui shows new commits well | 18:08 |
yoctozepto | the gerrit endpoint also serves fresh data | 18:08 |
yoctozepto | discovered doing rebases and figured out I'm not current with gerrit heh | 18:08 |
clarkb | yoctozepto: can you go to https://opendev.org and check the names in the cert to see which backend you map to | 18:11 |
clarkb | I see some errors in the replication log that may be specific to gitea01 | 18:11 |
clarkb | want to corss check with where you are pulling from | 18:11 |
clarkb | I'm checking gitea01 and gieta02 now to compare as well | 18:13 |
clarkb | ya gitea02 seems up to date and gitea01 is not. I'm going to try forcing full replication to gitea01 of kolla-ansible | 18:13 |
clarkb | this may be related to the gitea01 outage we had | 18:14 |
clarkb | and if that corrects it I'll ask for a full replication of gitea01 for everything I Guess | 18:14 |
clarkb | ok that didn't seem to help still getting errors. Says "reason: missing necessary objects" | 18:17 |
clarkb | But this is a force push so you'd think it would push the missing objects? | 18:17 |
clarkb | infra-root I've pulled gitea01 out of the haproxy rotation | 18:20 |
clarkb | I suspect that we may need to move the repo aside and recreate it. But I'm not currently sure how to do that so I've pulled the server out of rotation so the other 7 can serve the up to date data | 18:21 |
clarkb | yoctozepto: ^ can you check that you get up to date content now that I've pulled thsi out of rotation? | 18:21 |
clarkb | and git fsck says error: object file ./objects/0a/a678cd8d48015b9db75cc5bd926660bd7aeb89 is empty and then complains it is corrupt | 18:23 |
clarkb | almost certainly related to the server dying a painful death recently. | 18:24 |
clarkb | infra-root ^ any ideas on the best way to approach this? We should be able to move the repo aside and have gerrit re replicate it but I'm unsure how to make gitea aware of that cleanly | 18:24 |
clarkb | I think we can shutdown all of gitea, move the unhappy repo aside, git init a new repo, start gitea, tell gerrit to replicate | 18:28 |
clarkb | er shutdown all gitea services on gitea01. not the entire gitea service | 18:33 |
clarkb | infra-root ^ if that plan seems reasonable I'll start doing that | 18:34 |
corvus | ++ | 18:34 |
clarkb | my only real concern with it I guess is I'm not sure what flags gitea is using to init the repo | 18:35 |
clarkb | I should probably go look and see if I can determine how it does that | 18:35 |
clarkb | looks like git init and if bare is set then --bare and I think these are bare repos | 18:36 |
clarkb | func InitRepository in gitea/modules/git/repo.go | 18:37 |
yoctozepto | clarkb: sorry, I went out for a moment | 18:38 |
yoctozepto | clarkb: yes, it's up to date now | 18:38 |
yoctozepto | thank you! | 18:39 |
clarkb | yoctozepto: thanks, I think I've identified the issue then. I'ev stopped the gitea services on gitea01 while I prepare to make a bare repo then tell gerrit to rereplicate it | 18:39 |
fungi | clarkb: any guess as to whether this could also be impacting other repos on the server? | 18:39 |
clarkb | fungi: I think the answer is yes it is possible, but no I do not know of any that are impacted other than this one | 18:41 |
clarkb | since we can take it back out of the rotation again later easily I think fixing this one and proving out the system is fine | 18:41 |
clarkb | then we can audit and do additional fixes if necessary | 18:41 |
fungi | just wondering if it might make more sense to blow away all the repos on the server and repopulate them for good measure | 18:42 |
clarkb | maybe, but we don't even know if this surgery will work | 18:44 |
fungi | yeah, i agree it's a good first experiment. if that seems to fix it, i guess we can do the rest easily in a shell loop | 18:46 |
clarkb | if anyone wants to check I've put the sad repo in /root/corrupt_repos and the kolla-ansible.git in /var/gitea/..... is a new git init --bare repo that I chowned to gitea | 18:47 |
clarkb | I think I'm ready to start gitea up again and ask gerrit to replicate. | 18:47 |
clarkb | ok proceeding to start services back up again | 18:50 |
opendevreview | Jack Morgan proposed opendev/system-config master: Adds support for running zuul-registry as a non-root user https://review.opendev.org/c/opendev/system-config/+/831462 | 18:50 |
clarkb | visiting gitea01 kolla-ansible via the web ui is a 500 error currently. | 18:51 |
jentoio | patchset 2 pushed. thanks for help / feedback | 18:51 |
clarkb | No branches in non-empty repository <- that is the error reported by gitea. So maybe replication will fix it | 18:52 |
clarkb | https://gitea01.opendev.org:3081/openstack/kolla-ansible has content now | 18:54 |
clarkb | yoctozepto: ^ can you check it please? (note the backend specific url) | 18:54 |
clarkb | fungi: `grep 'Failed replicate of' /home/gerrit2/review_site/logs/replication_log | grep -v kolla-ansible` is empty implying kolla-ansible is the only active repo with the problem from today's log | 18:55 |
fungi | clarkb: that seems reasonably safe that it's the only one affected then, unless nothing's tried to fetch commits from some repos on 01 today | 18:56 |
clarkb | fungi: note that is gerrit's replication log not gitea logs | 18:57 |
clarkb | fungi: basically no one has pushed changes to repos to gerrit that exhibit the issue | 18:57 |
clarkb | other than kolla | 18:57 |
fungi | oh, errors on push | 18:57 |
fungi | yeah | 18:57 |
clarkb | also no other gitea exhibits the issue | 18:57 |
clarkb | I changed grep -v kolla-ansible git grep -v gitea01 to check that | 18:57 |
fungi | and i agree the kolla-ansible repo seems to match gerrit (same commit is my origin/master and gerrit/master) | 18:57 |
fungi | after recloning from 01 | 18:58 |
clarkb | if yoctozepto is still around and confirm it looks good I will put gitea01 back into the haproxy rotation. Then maybe I should write up a doc on how to do this | 18:59 |
clarkb | I suspect that we could've taken another approach of deleting broken things in the old git repo until we got to a "consistent" state then have gerrit push. But starting clean felt well cleaner :) | 19:00 |
clarkb | I wonder if there are more safeguards we could put into place to avoid these issues in the first place? | 19:01 |
clarkb | I believe gitea01 is boot from volume and gerrit's git repos are hosted on a volume in a different region of the same cloud | 19:01 |
clarkb | does ceph or whatever backs this maybe have some sort of stronger write back garuntees? cc mnaser | 19:01 |
clarkb | if we could opt into that via /etc/fstab options or /sys/ settings that might be a good idea? | 19:02 |
clarkb | infra-root I've just realized that we document an admin gitea function to create missing repos. If you like I can move this new repo I created aside and have gitea create it for us that way. This would ensure that gitea is creating it exactly the way it wnts it to be. But I think I found the code and it just does a `git init --bare` so we should be fine | 19:05 |
clarkb | I suppose if we run into additional problems that would be the next step to take. | 19:05 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add docs on restoring a gitea repository https://review.opendev.org/c/opendev/system-config/+/831601 | 19:17 |
clarkb | infra-root ^ that should capture what I did pretty well. Please look it over and if the process looks good we can land the docs update. If you see problems we should address them on gitea01 and then update the docs too | 19:17 |
mnaser | im trying to capture context, what happened | 19:20 |
clarkb | mnaser: when the gitea01 server's hypervisor crashed we wrote an empty git object file to the kolla-ansible repo | 19:23 |
mnaser | fwiw there is no more different region stuff, everything is all teh same place | 19:23 |
clarkb | mnaser: this corrupted the repo and prevented further pushing to it to keep it replicated. I'm wondering if the volumes provided by vexxhost support certain writeback options that we might be able to set to tryand avoid those problems? Like I know the nvme in my local machines has different write back options. The default is fast and maybe less safe but I can opt into slower and more | 19:24 |
clarkb | safe | 19:24 |
clarkb | mnaser: well gitea01 is in sjc1 and review is in ca-ymq-1 iirc. But both use ceph backed volumes for the git data iirc | 19:24 |
clarkb | Mostly me talking out loud wondering if there are options we can set to say "this mount is important be more careful" | 19:24 |
mnaser | ok so you're saying the hard power off resulted in data loss | 19:25 |
fungi | more like data corruption | 19:25 |
fungi | a truncated write, essentially | 19:25 |
clarkb | ya it created the file but with no content so ^ | 19:25 |
clarkb | we've since recovered because the source of truth is gerrit not gitea01. But my concern is if gerrit were to experience similar we'd be pulling from backups most likely | 19:25 |
clarkb | and wondering if we can mitigate that via mount or device options | 19:25 |
fungi | probably the filesystem layer allocated the inode but never actually got a chance to flush buffered data to it before the disruption | 19:26 |
mnasiadka | clarkb: seems to launch ok, thanks :) | 19:26 |
clarkb | /sys/block/nvme.*/queue/write_cache is how linux controls this for nvme devices for example | 19:27 |
clarkb | defaults to write back typically but you can set it to write through | 19:27 |
clarkb | which will more than likely affect your write speed, but you get better integrity on power loss | 19:27 |
clarkb | these aren't nvme devices as they are ceph backed volumes mounted via iscsi or similar. But wondered if there was similar we could do maybe | 19:28 |
fungi | though if filesystem allocation and writing out the blocks are performed as separate steps, i'm not sure the caching layer can necessarily solve it | 19:28 |
clarkb | fungi: thats a good point. And I would've expected git to do atomic renames | 19:28 |
clarkb | this may also point to a git bug as it should be more resilient to this sort of thing too | 19:28 |
fungi | is gitea using cgit behind the scenes? | 19:30 |
clarkb | fungi: yes | 19:30 |
clarkb | fungi: well for the writes anyway | 19:30 |
clarkb | the reads are done via a golang libarary iirc | 19:30 |
fungi | cool, i couldn't remember. i know gerrit doesn't | 19:30 |
clarkb | fungi: the way it works is we run an sshd that can execute git commands and then those commands run to accept the pushes and those pushes trigger hooks which update the gitea application side of things | 19:31 |
clarkb | the application side then does some things with golang lib and other things with forking cgit tools iirc | 19:31 |
mnaser | i think ceph handles fsync's correctly | 19:32 |
jrosser | ceph should be synchronous underneath iirc but it presents a block device for volumes so I expect the behaviour of whichever file system used under power loss will dominate what happens | 19:32 |
mnaser | yep, what jrosser said as well -- now, there is a cache on the host level which combines writes and pushes them out to the cluster | 19:33 |
clarkb | mnaser: ya so that might be the issue | 19:34 |
clarkb | do you know if there is a way to tell the ceph mount to act write through rather than write back? | 19:34 |
clarkb | I bet that completely destroys performance though | 19:34 |
clarkb | but may be worth testing if it is an option | 19:35 |
clarkb | I'm guessing we won't hear back from yoctozepto today due to timezones. I'll put gitea01 back in rotation after lunch if we don't find any reason to keep it out by then | 19:40 |
jrosser | ceph disk write cache settings are somewhat counterintuitive https://docs.ceph.com/en/latest/start/hardware-recommendations/#write-caches | 19:42 |
jrosser | fwiw my Intel nvme come set with the write cache off by default | 19:43 |
clarkb | jrosser: is it an "enterprise" device? Apparently consumer devices are far less safe | 19:45 |
jrosser | yes these are enterprise spec drives | 19:45 |
clarkb | supposedly the other thing enterprise drives do is include extra capacitors/batteries to ensure writes make it all the way down into non volatile storage and consumer drivers are far less likely to do that (even with write through set, people have started testing this after the apple m1 nvme controller issues with linux came to light) | 19:46 |
mnaser | i think the issue here is that the writes never made it out to the ceph cluster | 19:47 |
mnaser | we didnt lose power to the ceph nodes | 19:47 |
mnaser | the problem is the qemu writes didnt make it to the cluster | 19:47 |
clarkb | mnaser: right. And I think if we were operating in write through mode the software would know the writes hadn't completed yet? But if in write back as soon as it hit the caches the software running ont top thought it was good to go | 19:48 |
clarkb | the downside to write through is that it is almost definitely going to be slower. So not necessarily something you'd want to set globally I don't think | 19:48 |
mnaser | write but the performance will be terrible | 19:49 |
mnaser | right* | 19:49 |
mnaser | yep | 19:49 |
clarkb | ya if it is possible to opt in on a per mount basis that would be interesting to benchmark at least | 19:49 |
clarkb | but I'm not sure if that is possible. Looking at jrosser's doc it seems this may be set pretty globally? | 19:49 |
mnaser | id start by making sure the fs level is flushing writes | 19:49 |
clarkb | globally per local cache at least | 19:50 |
clarkb | mnaser: I mean its git | 19:50 |
clarkb | on ext4, I'd be super surprised if they were doing something naive | 19:50 |
clarkb | (but it is possible) | 19:50 |
clarkb | we are mounted data=ordered which is "All data are forced directly out to the main file system prior to its metadata being committed to the journal." | 19:52 |
clarkb | barrier=1 would be another thing to set, we don't set it explicitly and I'm not sure what the default is | 19:53 |
clarkb | not sure if git supports that either | 19:53 |
clarkb | ext4 enables write barriers by default | 19:53 |
clarkb | (ext3 didn't) | 19:54 |
clarkb | Those seem to be the primary fs options for ensuring integrity (the other is commit=nsec which forces flushes every n seconds) | 19:56 |
clarkb | the default for that one is 5 seconds | 19:56 |
clarkb | anyway we don't have to solve this now. Just wanted to throw ideas out there to see if there were any good options or things we might be missing. Seems like maybe not? | 19:57 |
clarkb | or at least nothing obvious to me | 19:58 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/831601/1/doc/source/gitea.rst any idea why it doesn't find the Backend Maintenance heading at the top of the file? | 20:00 |
clarkb | oh I may need to enable an extension for that | 20:01 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add docs on restoring a gitea repository https://review.opendev.org/c/opendev/system-config/+/831601 | 20:03 |
fungi | clarkb: yeah, i occasionally try to use implicit labels in docs and then remember that they're fragile. using explicit labels also helps you keep old bookmarks and external links working even if you rename the section at a later date | 20:07 |
fungi | so preferable all around | 20:07 |
clarkb | makes sense | 20:08 |
clarkb | lunch now. Back in a bit and I can restore gitea01 in haproxy at that point | 20:11 |
clarkb | gitea01 is back in the rotation now | 20:58 |
fungi | thanks! | 21:01 |
clarkb | my docs change passes CI now too | 21:13 |
clarkb | ready for review if yall can take a look | 21:13 |
fungi | i've apparently clogged gertty by resubscribing to all the ancient opendev namespace repos so i can review retirement changes on them | 21:16 |
fungi | catching up on other things while i watch its sync counter slowly decrease | 21:17 |
fungi | it'll be convenient though, because after approving the retirement changes in them, i can bulk abandon all other open changes | 21:17 |
fungi | courtesy of gertty's awesome process mark feature | 21:18 |
clarkb | ooh yes | 21:18 |
ianw | i saw some chat about the fedora mirror but it doesn't seem we merged anything? | 21:18 |
clarkb | ianw: correct the proposal from dpawlik was to use the master repos to sync from which if possible (I don't think it is) is not recommended | 21:19 |
clarkb | we need to find another tier 2 mirror to pull from aiaui | 21:19 |
fungi | we could just unrevert the mit mirror change if they're looking stable again | 21:20 |
fungi | (see git history on that file) | 21:20 |
ianw | ok, i need to do school runs but will sort something out after | 21:20 |
opendevreview | Merged opendev/system-config master: Add docs on restoring a gitea repository https://review.opendev.org/c/opendev/system-config/+/831601 | 21:24 |
clarkb | thanks! | 21:25 |
fungi | my gitea sync seems to be about 75% done. hopefully after dinner i'll be able to go back to using it for reviewing stuff | 21:26 |
fungi | s/gitea.gertty/ | 21:26 |
fungi | s/\./\// | 21:26 |
fungi | (anyone need a toothpick?) | 21:26 |
*** dviroel is now known as dviroel|out | 21:31 | |
clarkb | jentoio: I think I diagnosed the job failure for the latest patchset. I think this is really close. I left a comment on how I think youcan work around it | 21:31 |
jentoio | clarkb: yup, working on it. patchset 3 coming shortly. thanks | 21:33 |
clarkb | no rush jsut wanted to make sure you saw that | 21:35 |
jentoio | np, I just happended to be working on it. | 21:36 |
opendevreview | Jack Morgan proposed opendev/system-config master: Adds support for running zuul-registry as a non-root user https://review.opendev.org/c/opendev/system-config/+/831462 | 21:38 |
clarkb | jentoio: heh the last error shold've been a hint for me to look closer. Hopefully comment on most recent patchset explains things well enough | 22:12 |
jentoio | clarkb: yeah, looking into it now. I'll implement your feedback. thanks | 22:18 |
opendevreview | Jack Morgan proposed opendev/system-config master: Adds support for running zuul-registry as a non-root user https://review.opendev.org/c/opendev/system-config/+/831462 | 22:24 |
ianw | https://zuul.opendev.org/t/zuul/build/fa98a05f171444caa5bd6cae06aeaec6 just passed so i'll assume the mirror fixed itself | 22:31 |
fungi | oh good | 22:37 |
fungi | don't tell the technicians from central services, but sometimes these things *do* fix themselves | 22:38 |
fungi | no need to involve harry tuttle | 22:38 |
fungi | (though we are all in it together!) | 22:38 |
clarkb | jentoio: woot that passes testing now. I noticed two more small things. Sorry for not catching them earlier. But I think if we make those updatse and fidn a second reviewers we may be about done here | 22:59 |
fungi | i can be a second reviewer now that my gertty has caught up | 23:02 |
clarkb | fungi: well I think we need a couple more updates. but passes testing | 23:02 |
fungi | excellent work! | 23:02 |
clarkb | and then maybe land tomorrow when we can watch it? I'm feeling like today might be an early one for me | 23:02 |
fungi | it has seemed like a long day for me too for some reason | 23:03 |
jentoio | clarkb: cool, taking a look now | 23:10 |
corvus | fungi: what happened to your gertty? also, internally it has a priority queue, so if you know what changes you want to review, you can jump right to them and it'll sync them first. | 23:14 |
corvus | fungi: oh, nm, i found the explanation in scrollback | 23:14 |
fungi | corvus: i subscribed to all the ancient opendev namespace repos so that i can use it to review clarkb's retirement changes and mass abandon old reviews on those | 23:14 |
fungi | and yeah, i could have jumped the sync queue, but it was pretty sluggish anyway | 23:15 |
fungi | so preferred to do other things until it finished | 23:15 |
fungi | mostly it was busy cloning repos and such | 23:15 |
* corvus didn't know there were other things to do | 23:15 | |
fungi | heh | 23:16 |
opendevreview | Jack Morgan proposed opendev/system-config master: Adds support for running zuul-registry as a non-root user https://review.opendev.org/c/opendev/system-config/+/831462 | 23:16 |
*** rlandy|ruck is now known as rlandy|ruck|bbl | 23:23 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!