Thursday, 2024-07-25

tonybI'm working on a small script to use as the basis of verifying that an RPM based distro mirror is "correct", any objections to me installing python3.20-venv on one of the existing region-cloud mirrors ?00:05
fungino objection here, except that i want details on the time machine you used to get your hands on a packaged python 3.20 module00:37
fungiif you meant 3.10, also no objection and you can keep the time machine plans safely hidden for a while longer00:38
tonybIt's Australia .... we live in the future ... I've said too much00:38
fungiokay, that makes sense00:38
fungiyou've harnessed the relativistic time dilation inherent to time zone maths00:39
tonyb:)00:40
*** benj_7 is now known as benj_01:53
*** ykarel_ is now known as ykarel12:07
fricklerjust another data point for possible gerrit slownewss, creating https://review.opendev.org/c/openstack/python-openstackclient/+/924927 via the UI took about 15s12:22
fungiinteresting. i wonder if it's slow from some parts of the internet and fast for others because of packet loss/latency on certain routes impacting all the rest api round trips between the js webclient and the server13:42
fungihow's your connectivity to the server more generally?13:42
fungialso sort of distracted with storm-related power outages this morning, so still catching up13:45
fungigood reminder that i really need a better ups for my workstation13:47
fricklermost other actions are without noticable delay, so unless the cherry-pick is doing a huge bunch of API calls in the background, I'd assume the delay to be independent of the network latency14:13
fricklernow you can create spam in gmail via some "react" button, great14:40
JayFFWIW Microsoft started it with that particular piece of fun. There's some email header you can add outgoing to opt out to having your messages reacted to.14:42
fricklerJayF: do you happen to have a pointer to that? not sure though whether we could or would want to teach mailman to do that, though, but I'm curious anyway14:44
fricklerdelete one of the duplicate "though" ad lib ;)14:45
JayFhttps://neilzone.co.uk/2024/07/attempting-to-stop-microsoft-users-sending-reactions-to-email-from-me-by-adding-a-postfix-header/14:46
fricklernice. maybe we should just set "x-spam-allowed: false" and expect people to adhere to that14:49
JayFI make sure to set the do not track bit in my browser, so those nasty unethical ad networks and malware know to stay away.14:53
JayFlol14:53
JayFBTW; I have a stack of "Please don't steal this car" bumper stickers if you need one14:53
fricklerbrilliant idea, could save a lot on insurance that way I'm sure14:57
*** benj_1 is now known as benj_15:02
fungiyeah, for me (in mutt) all it displays is a url to the "emoji" they reacted with15:08
fungiwe can at least add a pattern match to specific mailing lists' receipt rules to automatically moderate or reject such posts15:09
*** dviroel is now known as dviroel|afk15:14
clarkbfrickler: I think its less about api calls and more about doing git things15:43
clarkbfwiw on my bike ride this morning I wondered if we're more sensitive to slowness via the web ui than via git review just because of the medium we're intereacting through15:44
clarkband maybe this isn't abnormal its just more noticeable15:44
clarkbbut I almost never use the web ui for stuff like this so I don't have a good feeling for a baseline15:44
clarkbfwiw gmail greys out the react button on google group threads15:46
clarkbso adding that header to our lists would probably work for gmail at least15:46
fricklerwell if not using the UI, the cherry-pick would get done by git locally and the ensuing git-review wouldn't differ from any other ps submission, so I'm not surprised that there should be no difference in that case15:58
clarkbright exactly16:00
clarkband you'd use a likely underloaded personally dedicated machine to perform all the git operations and io compared to the server which is often fairly loaded and doing many things at once16:00
clarkbbut also I think people have reported pushes can be slow. If I had to guess the underlying cause is io related16:01
fricklerthat's surely possible and likely not something we could do much about. also likely to be so bursty that we couldn't detect anything in cacti16:02
clarkbya we might be able to write some ebpf code that could log/trap when we go over certain thresholds or something16:03
clarkbbut as you say we probably can't do much about that as its relying on ceph under the hood iirc and we're at the mercy of the network and disks involved for that16:04
fricklerand sadly we haven't been able to get good performance related feedback from any of our cloud providers, even the ones that sometimes lurk around here :-(16:06
fricklerand I don't think moving gerrit to openmetal would sound like a reasonable option. or would it?16:07
clarkbI'm not sure that is a good option. In particular we are so resource constrained there that the only way to do upgrades of the underlying cloud is to delete the existing cloud and start over. This works well enough for our CI test nodes but less well for long lived services with persistent data needs16:08
clarkbalso it too uses ceph (but at least we have more tuning and insight there)16:08
fricklerI think we could switch to using local storage if we wanted, but yes the upgrade situation certainly is a good argument16:10
*** iurygregory__ is now known as iurygregory16:35
fricklerclarkb: there were some new-looking failures from the gc cron in gitea09 just now17:27
clarkblooks like the same thing at gitea12?17:29
clarkbya this server has an uptime of only 2 days17:29
clarkbso likely the exact same issue :/17:29
clarkbfungi: ^ fyi since you weorked through that previously17:30
clarkbany chance you are willing to correct the issue on 09 the same way?17:30
clarkbside note: anyone know anyone at ceph/ibm/red hat that might be willing to debug these issues?17:30
clarkbbecause git having data loss/corruption is almost certainly going to be a propblem at the data persistence layer and not git itself17:31
clarkbgitea14 also has an uptime of 2 days. The others have uptiem of 10 days17:32
fricklerside note: trying to log into the server, I noticed that IPv6 to vexxhost seems broken from here once again :(17:32
clarkbso doesn't seem to b a 100% failure. Probably requires us to be attempting writes while the shutdown occurs and git believing the data has been persisted when it hasn't been17:33
fricklernot sure if that might also play a role in the apparent gerrit UI slowness17:33
clarkbthose are different locations but could be17:33
clarkbgerrit vs gitea I mean (different locatiosn in the same provider but I think they may have very different routes)17:33
clarkbgitea14 is complaining about the same repo17:34
fricklerclarkb: couldn't the write thing just be normal ext4 behavior in the case of unexpected shutdown? 17:34
fricklermaybe check whether a fsck happened during the boot, I think that should show up in the journal?17:34
clarkbfrickler: maybe? I thought git was extra careful about persisting things17:36
fricklerseems at least gitea09 was actually rebooted twice it looks like? or might be just short log retention17:36
clarkbbut ya I guess if the filesystem is happy then git probablycan't be any happier17:36
clarkblooking at /var/log/dmesg I don't see anything indicating a fsck occured17:38
fricklerI only found this in the journal, vda15 is /boot/efi https://paste.opendev.org/show/bjM0rrpKGjPNfuw9wys7/17:38
frickleralso "last" says only one reboot17:39
clarkb"Applications which want to be sure that their files have been committed to the media can use the fsync() or fdatasync() system calls; indeed, that's exactly what those system calls are for." from https://lwn.net/Articles/322823/ I think the last time I looked at this git was very careful about doing these things17:40
fricklerbut there seems to have been a downtime of >25 mins, at least according to the gap in the log17:40
fricklerso either there was some longer hypervisor downtime, or it might have stopped being able to write to the disk early and gotten rebooted later17:41
fricklerthe later might also be able to explain inconsistencies17:41
clarkbya thats a good point. It could be there were disk io issues that led to crash and reboot17:42
clarkbrather than the other way around17:42
fricklerif mnaser would read this, they could maybe check if the affected servers are even on the same host17:42
frickleror is there a way to check from within the instance? I think openstack tries to avoid making that public17:43
clarkblooking at email it appears that gitea09 and gitea14 both have sad cinder repos and need to be recovered similar to the way fungi did so for gitea12. Both gitea09 and gitea14 have been up for ~2 days17:43
clarkbfrickler: I think a uuid may be available in metadata somewhere. Let me see if I can find it17:43
clarkbits just a uuid but that is enough to determine if the hosts are the saem17:43
clarkboh we don't have config drive here so have to use web metadata this will take me a bit longer17:44
fricklerlooking at ping latency, ping from 09 to 14 seems to be faster than to 11 or 12, so that might at least be consistent with that hypotheses17:45
clarkbI'm not finding that info in the meta_data.json. vendor data is empty and network data doesn't have it either17:46
clarkbso ya maybe this info is not available17:46
clarkb`curl http://169.254.169.254/openstack/latest/meta_data.json` fwiw17:46
fricklerthe timestamps from "sudo journalctl --list-boot" also agree to within 2 seconds17:48
clarkbre ext4 I think that generally yes losing data (but not journal metadata) is expected in a defaultish setup which we appear to be using. However applications have the ability to force the kernel to flush things out when necessary and my recollection is taht git is extremely careful about doing so. This would imply to me that the problem lives under our filesystem in the ceph block17:54
clarkblayer, but I don't actually have hard evidence of that just inferences17:54
clarkbas far as fixing goes I can do that after lunch. I'm actually going to pop out in a little bit for that otherwise I'd start working on this now but we had previous plans for lunch today I need to keep. I'm also happy for someone else to go ahead and do it or have fungi walk me through it etc. I just can't really dive into that for a couple of hours17:55
*** iurygregory__ is now known as iurygregory17:57
fungiclarkb: yeah, i can take care of it now, stepped out during a break in the rain to run some quick errands but i'm back for the rest of the day18:16
clarkbthanks!18:16
fungifrickler: the outage for gitea12 was similarly lengthy. i expect something like a hung hypervisor host and then it had to be rebooted before the instances came back up18:17
fungifwiw, looks like gitea09 and 14 have uptimes of ~2.5 days (booted about a minute apart)18:29
fungigitea10-13 have uptimes of a little over 10 days, again all within 2 minutes of one another18:29
fungiso something is definitely causing ungraceful reboots in vexxhost sjc118:29
fungilooking at gc errors mailed by cron, gitea09 and gitea12 both complained today about empty cinder.git/objects/fa/bf8eb32672de75f86c6644ea69c43e465eb35c18:49
fungigitea09 also complained about it on tuesday (but not 14, and neither of them yesterday)18:49
fungino complaints from any others since sunday when gitea12 was unhappy about the one i fixed18:51
fungii've taken 09 and 14 out of service in the http and https haproxy pools temporarily18:53
fungicurrently working on transplanting the good cinder bare repo from gitea13 to those servers18:58
opendevreviewJan Gutter proposed zuul/zuul-jobs master: Update ensure-kubernetes with podman support  https://review.opendev.org/c/zuul/zuul-jobs/+/92497019:07
fungimy transfer rate downloading this 655 megabyte file from gitea13 is averaging around 4 megabits per second. maybe i should have set up some temporary authentication to transfer it directly between the servers instead19:22
clarkbfungi: I'm mostly back at this point if there is anything I can do to be useful20:20
fungicopies to servers just completed20:24
fungiuntarring them into place now20:26
clarkbfungi: do we shutdown the gitea services and remove the old content first or just go over the top?20:27
fungii've not been shutting down gitea services, just deleting the bare repo out from under them and immediately untarring the donor copy20:28
clarkback20:28
fungibecause who knows what differences there may be due to independent git gc between the servers20:28
clarkbya going over the top of the repo seems like it probably wont' work in all cases20:29
fungirerunning the command from the gc cron on both servers to see if they complain about anything else now20:30
fungiif they come out clean, i'll re-replicate the repo to them from gerrit next20:30
clarkb++20:31
clarkbhttps://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/20:31
clarkbI'm with read the docs on this one. The bots are behaving poorly and impacting the people they rely on.20:33
fungiJayF: ^ seems like you had friends in that space, though they've probably already seen the post too20:34
JayFThere was also a recent post in that channel, with my buddy from anthropic, about ifixit complaining they got 1M hits in a day20:35
JayFhe doesn't directly control the crawlers or anything, but he feeds in the feedback thru whatever mechanism they have 20:36
fungigit gc clean now on both servers, replication from gerrit in progress20:50
fungiand done20:50
JayFfungi: https://www.404media.co/anthropic-ai-scraper-hits-ifixits-website-a-million-times-in-a-day/ is that ifixit article20:50
fungii'll enable them in haproxy again20:50
fungi#status log Repaired data corruption for a repository on the gitea09 and gitea14 backends, root cause seems to be from an unexpected hypervisor host outage20:52
opendevstatusfungi: finished logging20:52
fungithanks JayF!20:52
clarkbethics was a required course for me to get my degree. Kinda feels like everyone has collectively decided to leave that behind in the race for AI dominance20:57
clarkbfungi: thank you for taking care of the git repo issue.20:58
clarkbAs far as next steps go on that we might be able to boot another node in that region and use nova shutdown apis while running git operatiosn in a loop to try and reproduce20:58
clarkbI suspect that nova shutdown apis might be too graceful though20:58
clarkb(things will flush before going down vs say a truly hard shutdown from the hypervisor side)20:59
tonybI've been working on creating noble images (Basically this: https://paste.opendev.org/show/b8Q8E7t1OahyCgxITOYX/)  Now when it comes to uploading them .... Do I need to upload the generated images to each region in each cloud or just once per cloud ?22:36
tonybor is it more like upload it once and see if they appear each region?22:37
clarkbtonyb: I believe that each region has its own images22:40
clarkbso they have to be uploaded separately per image.22:40
clarkbtonyb: looking at your paste I think your vhd-util conversions are only half done22:42
tonybclarkb: Thanks.22:42
tonybclarkb: Oh!  What'd I miss?22:42
clarkbthere should be two vhd-util convert commands one from the raw source to an intermediate format then a second to the final format22:42
clarkbtonyb: I think (but I'm not sure) that the 0 and 1 in the first command become a 1 and a 2 in the second. I think that nodepool builder logs would confirm this22:43
clarkbif you look at logs for one of the image builds specifically not the builder service (they are split up)22:43
tonybOh that maybe a formatting issue:  https://paste.opendev.org/raw/bbPlASvfVf1shcRihnqz/  Is what I did based on https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/lib/img-functions#L15122:47
clarkbnow to figure out what `qemu-img convert -p -S 512` does. I've always just used the defaults I think22:47
fungiyes, every region's glance store is separate, at least for all the providers we're using22:47
tonyb-p == show a progress meter22:47
clarkbtonyb: oh yup I had to scroll to the right sorry. That does look correct now22:47
tonyb-S 512 is consider 512 0's a "Sparse block" and write the image as such22:48
clarkbtonyb: ah ok. Does that help keep the size down? I wonder if that is something we could add to dib/nodepool22:48
clarkbtonyb: one other thing to keep in mind is that image upload to rax is weird22:49
clarkbthe imagese have to go to swift first and then get imported from there to glance using the task api22:49
tonybI'd have to double check, I admit it's something from muscle memory from $way back22:49
clarkbopenstacksdk's high level methods for image upload should do that stuff for you I think (and previously shade would) but you may need to use a small script rather than openstack cli tooling22:50
tonybclarkb: Can you share your history for when you did it for OpenMetal recently?22:50
clarkbtonyb: for OpenMetal I was lazy and just did it through horizon22:52
clarkbunfortunately that means there isn't any command history :/22:53
tonybOkay22:53
clarkbtonyb: I think it would be something like `openstack image create --disk-format raw --file ./path-to-local-raw-file.raw --private image-name-here`22:55
tonybAh okay that's what I have22:55
clarkb--private might be too restrictive. I think that means only this tenant can use it. And since the image comes from upstream maybe we don't care if others use it. The upside to making it private though is we can delete it later without errors if people have bfv using it or whatever so I'd use private22:56
tonybI think I went with --shared, so if needed we can share it with the jenkins/zuul project on a given cloud22:57
tonybbut I can switch to --private22:57
clarkbtonyb: let me pull up the glance docs. its probably fine22:57
clarkbya shared seems fine since its an explicit action to add other tenants22:58
tonybYup.22:59
clarkband if we only do that for the tenants we care about then we avoid the bfv problem without having control over that22:59
clarkbhttps://docs.openstack.org/api-ref/image/v2/ ^F shared if anyone else is curious22:59
tonybAlso https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/image-v2.html#cmdoption-openstack-image-create-shared22:59
tonybhttps://paste.opendev.org/raw/b8QSrgmqdHkbHY0S6801/ ?23:10
tonybFirst block should be all the raw images, second block should be the vhd versions23:11
clarkbtonyb: as mentioned perviously I don't think the rax images using openstack client will work. Also there is already a noble image in openmetal so we don't need to upload there (we can it doesn't hurt much other than consuming more disk space). The arm servers will need their own arm images too (not amd64)23:12
clarkband i haven't checked if ovh and vexxhost have images or not23:12
tonybThanks, I was thinking that it would be helpful to have exactly the same verion of image in $all clouds to reduce variability23:14
tonybI see what you're saying about RAX now, sorry I missed that23:15
tonybOnly OpenMetal has Noble images23:15
clarkbfwiw I don't think it hurts to try and upload to rax. But I'm like 98% certain the client never implemented the two phase swift then glance task upload process that rax uses because they are the only cloud that does it and it isn't really standard23:15
clarkband instead we will have to use the sdk directly to get the cloud.upload_image() magic or whatever it is to do that.23:15
tonybGot it23:16
JayFclarkb: might be worth reaching out to James D from rackspace; they have any resources at all pointed at openstack now; making the client actually work for their cloud I'd imagine would be a very good use of them23:16
tonybJayF: In the past they've focused develeopment on the 'rax'23:17
tonyb commandline tool rather than openstack23:17
JayFYeah, but they've been making a big deal about being more openstack-y as of late. Never hurts to try and get someone to point in the right direction :) 23:17
tonybbut maybe in this new world23:18
JayFthat's what I'm suggesting? hoping? 23:18
clarkband yes I think the glance image task api was a really unfortunate moment in history for opensatck users. Maybe one day we'll be completely ebyond it23:18
tonybyup.  I'm hoping too23:18
clarkbre ralking to rax we might also encourage them to provide noble images (since others may want it)23:28
tonybclarkb: Good point23:29

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!