Thursday, 2024-07-25

tonyb	I'm working on a small script to use as the basis of verifying that an RPM based distro mirror is "correct", any objections to me installing python3.20-venv on one of the existing region-cloud mirrors ?	00:05
fungi	no objection here, except that i want details on the time machine you used to get your hands on a packaged python 3.20 module	00:37
fungi	if you meant 3.10, also no objection and you can keep the time machine plans safely hidden for a while longer	00:38
tonyb	It's Australia .... we live in the future ... I've said too much	00:38
fungi	okay, that makes sense	00:38
fungi	you've harnessed the relativistic time dilation inherent to time zone maths	00:39
tonyb	:)	00:40
*** benj_7 is now known as benj_		01:53
*** ykarel_ is now known as ykarel		12:07
frickler	just another data point for possible gerrit slownewss, creating https://review.opendev.org/c/openstack/python-openstackclient/+/924927 via the UI took about 15s	12:22
fungi	interesting. i wonder if it's slow from some parts of the internet and fast for others because of packet loss/latency on certain routes impacting all the rest api round trips between the js webclient and the server	13:42
fungi	how's your connectivity to the server more generally?	13:42
fungi	also sort of distracted with storm-related power outages this morning, so still catching up	13:45
fungi	good reminder that i really need a better ups for my workstation	13:47
frickler	most other actions are without noticable delay, so unless the cherry-pick is doing a huge bunch of API calls in the background, I'd assume the delay to be independent of the network latency	14:13
frickler	now you can create spam in gmail via some "react" button, great	14:40
JayF	FWIW Microsoft started it with that particular piece of fun. There's some email header you can add outgoing to opt out to having your messages reacted to.	14:42
frickler	JayF: do you happen to have a pointer to that? not sure though whether we could or would want to teach mailman to do that, though, but I'm curious anyway	14:44
frickler	delete one of the duplicate "though" ad lib ;)	14:45
JayF	https://neilzone.co.uk/2024/07/attempting-to-stop-microsoft-users-sending-reactions-to-email-from-me-by-adding-a-postfix-header/	14:46
frickler	nice. maybe we should just set "x-spam-allowed: false" and expect people to adhere to that	14:49
JayF	I make sure to set the do not track bit in my browser, so those nasty unethical ad networks and malware know to stay away.	14:53
JayF	lol	14:53
JayF	BTW; I have a stack of "Please don't steal this car" bumper stickers if you need one	14:53
frickler	brilliant idea, could save a lot on insurance that way I'm sure	14:57
*** benj_1 is now known as benj_		15:02
fungi	yeah, for me (in mutt) all it displays is a url to the "emoji" they reacted with	15:08
fungi	we can at least add a pattern match to specific mailing lists' receipt rules to automatically moderate or reject such posts	15:09
*** dviroel is now known as dviroel\|afk		15:14
clarkb	frickler: I think its less about api calls and more about doing git things	15:43
clarkb	fwiw on my bike ride this morning I wondered if we're more sensitive to slowness via the web ui than via git review just because of the medium we're intereacting through	15:44
clarkb	and maybe this isn't abnormal its just more noticeable	15:44
clarkb	but I almost never use the web ui for stuff like this so I don't have a good feeling for a baseline	15:44
clarkb	fwiw gmail greys out the react button on google group threads	15:46
clarkb	so adding that header to our lists would probably work for gmail at least	15:46
frickler	well if not using the UI, the cherry-pick would get done by git locally and the ensuing git-review wouldn't differ from any other ps submission, so I'm not surprised that there should be no difference in that case	15:58
clarkb	right exactly	16:00
clarkb	and you'd use a likely underloaded personally dedicated machine to perform all the git operations and io compared to the server which is often fairly loaded and doing many things at once	16:00
clarkb	but also I think people have reported pushes can be slow. If I had to guess the underlying cause is io related	16:01
frickler	that's surely possible and likely not something we could do much about. also likely to be so bursty that we couldn't detect anything in cacti	16:02
clarkb	ya we might be able to write some ebpf code that could log/trap when we go over certain thresholds or something	16:03
clarkb	but as you say we probably can't do much about that as its relying on ceph under the hood iirc and we're at the mercy of the network and disks involved for that	16:04
frickler	and sadly we haven't been able to get good performance related feedback from any of our cloud providers, even the ones that sometimes lurk around here :-(	16:06
frickler	and I don't think moving gerrit to openmetal would sound like a reasonable option. or would it?	16:07
clarkb	I'm not sure that is a good option. In particular we are so resource constrained there that the only way to do upgrades of the underlying cloud is to delete the existing cloud and start over. This works well enough for our CI test nodes but less well for long lived services with persistent data needs	16:08
clarkb	also it too uses ceph (but at least we have more tuning and insight there)	16:08
frickler	I think we could switch to using local storage if we wanted, but yes the upgrade situation certainly is a good argument	16:10
*** iurygregory__ is now known as iurygregory		16:35
frickler	clarkb: there were some new-looking failures from the gc cron in gitea09 just now	17:27
clarkb	looks like the same thing at gitea12?	17:29
clarkb	ya this server has an uptime of only 2 days	17:29
clarkb	so likely the exact same issue :/	17:29
clarkb	fungi: ^ fyi since you weorked through that previously	17:30
clarkb	any chance you are willing to correct the issue on 09 the same way?	17:30
clarkb	side note: anyone know anyone at ceph/ibm/red hat that might be willing to debug these issues?	17:30
clarkb	because git having data loss/corruption is almost certainly going to be a propblem at the data persistence layer and not git itself	17:31
clarkb	gitea14 also has an uptime of 2 days. The others have uptiem of 10 days	17:32
frickler	side note: trying to log into the server, I noticed that IPv6 to vexxhost seems broken from here once again :(	17:32
clarkb	so doesn't seem to b a 100% failure. Probably requires us to be attempting writes while the shutdown occurs and git believing the data has been persisted when it hasn't been	17:33
frickler	not sure if that might also play a role in the apparent gerrit UI slowness	17:33
clarkb	those are different locations but could be	17:33
clarkb	gerrit vs gitea I mean (different locatiosn in the same provider but I think they may have very different routes)	17:33
clarkb	gitea14 is complaining about the same repo	17:34
frickler	clarkb: couldn't the write thing just be normal ext4 behavior in the case of unexpected shutdown?	17:34
frickler	maybe check whether a fsck happened during the boot, I think that should show up in the journal?	17:34
clarkb	frickler: maybe? I thought git was extra careful about persisting things	17:36
frickler	seems at least gitea09 was actually rebooted twice it looks like? or might be just short log retention	17:36
clarkb	but ya I guess if the filesystem is happy then git probablycan't be any happier	17:36
clarkb	looking at /var/log/dmesg I don't see anything indicating a fsck occured	17:38
frickler	I only found this in the journal, vda15 is /boot/efi https://paste.opendev.org/show/bjM0rrpKGjPNfuw9wys7/	17:38
frickler	also "last" says only one reboot	17:39
clarkb	"Applications which want to be sure that their files have been committed to the media can use the fsync() or fdatasync() system calls; indeed, that's exactly what those system calls are for." from https://lwn.net/Articles/322823/ I think the last time I looked at this git was very careful about doing these things	17:40
frickler	but there seems to have been a downtime of >25 mins, at least according to the gap in the log	17:40
frickler	so either there was some longer hypervisor downtime, or it might have stopped being able to write to the disk early and gotten rebooted later	17:41
frickler	the later might also be able to explain inconsistencies	17:41
clarkb	ya thats a good point. It could be there were disk io issues that led to crash and reboot	17:42
clarkb	rather than the other way around	17:42
frickler	if mnaser would read this, they could maybe check if the affected servers are even on the same host	17:42
frickler	or is there a way to check from within the instance? I think openstack tries to avoid making that public	17:43
clarkb	looking at email it appears that gitea09 and gitea14 both have sad cinder repos and need to be recovered similar to the way fungi did so for gitea12. Both gitea09 and gitea14 have been up for ~2 days	17:43
clarkb	frickler: I think a uuid may be available in metadata somewhere. Let me see if I can find it	17:43
clarkb	its just a uuid but that is enough to determine if the hosts are the saem	17:43
clarkb	oh we don't have config drive here so have to use web metadata this will take me a bit longer	17:44
frickler	looking at ping latency, ping from 09 to 14 seems to be faster than to 11 or 12, so that might at least be consistent with that hypotheses	17:45
clarkb	I'm not finding that info in the meta_data.json. vendor data is empty and network data doesn't have it either	17:46
clarkb	so ya maybe this info is not available	17:46
clarkb	`curl http://169.254.169.254/openstack/latest/meta_data.json` fwiw	17:46
frickler	the timestamps from "sudo journalctl --list-boot" also agree to within 2 seconds	17:48
clarkb	re ext4 I think that generally yes losing data (but not journal metadata) is expected in a defaultish setup which we appear to be using. However applications have the ability to force the kernel to flush things out when necessary and my recollection is taht git is extremely careful about doing so. This would imply to me that the problem lives under our filesystem in the ceph block	17:54
clarkb	layer, but I don't actually have hard evidence of that just inferences	17:54
clarkb	as far as fixing goes I can do that after lunch. I'm actually going to pop out in a little bit for that otherwise I'd start working on this now but we had previous plans for lunch today I need to keep. I'm also happy for someone else to go ahead and do it or have fungi walk me through it etc. I just can't really dive into that for a couple of hours	17:55
*** iurygregory__ is now known as iurygregory		17:57
fungi	clarkb: yeah, i can take care of it now, stepped out during a break in the rain to run some quick errands but i'm back for the rest of the day	18:16
clarkb	thanks!	18:16
fungi	frickler: the outage for gitea12 was similarly lengthy. i expect something like a hung hypervisor host and then it had to be rebooted before the instances came back up	18:17
fungi	fwiw, looks like gitea09 and 14 have uptimes of ~2.5 days (booted about a minute apart)	18:29
fungi	gitea10-13 have uptimes of a little over 10 days, again all within 2 minutes of one another	18:29
fungi	so something is definitely causing ungraceful reboots in vexxhost sjc1	18:29
fungi	looking at gc errors mailed by cron, gitea09 and gitea12 both complained today about empty cinder.git/objects/fa/bf8eb32672de75f86c6644ea69c43e465eb35c	18:49
fungi	gitea09 also complained about it on tuesday (but not 14, and neither of them yesterday)	18:49
fungi	no complaints from any others since sunday when gitea12 was unhappy about the one i fixed	18:51
fungi	i've taken 09 and 14 out of service in the http and https haproxy pools temporarily	18:53
fungi	currently working on transplanting the good cinder bare repo from gitea13 to those servers	18:58
opendevreview	Jan Gutter proposed zuul/zuul-jobs master: Update ensure-kubernetes with podman support https://review.opendev.org/c/zuul/zuul-jobs/+/924970	19:07
fungi	my transfer rate downloading this 655 megabyte file from gitea13 is averaging around 4 megabits per second. maybe i should have set up some temporary authentication to transfer it directly between the servers instead	19:22
clarkb	fungi: I'm mostly back at this point if there is anything I can do to be useful	20:20
fungi	copies to servers just completed	20:24
fungi	untarring them into place now	20:26
clarkb	fungi: do we shutdown the gitea services and remove the old content first or just go over the top?	20:27
fungi	i've not been shutting down gitea services, just deleting the bare repo out from under them and immediately untarring the donor copy	20:28
clarkb	ack	20:28
fungi	because who knows what differences there may be due to independent git gc between the servers	20:28
clarkb	ya going over the top of the repo seems like it probably wont' work in all cases	20:29
fungi	rerunning the command from the gc cron on both servers to see if they complain about anything else now	20:30
fungi	if they come out clean, i'll re-replicate the repo to them from gerrit next	20:30
clarkb	++	20:31
clarkb	https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/	20:31
clarkb	I'm with read the docs on this one. The bots are behaving poorly and impacting the people they rely on.	20:33
fungi	JayF: ^ seems like you had friends in that space, though they've probably already seen the post too	20:34
JayF	There was also a recent post in that channel, with my buddy from anthropic, about ifixit complaining they got 1M hits in a day	20:35
JayF	he doesn't directly control the crawlers or anything, but he feeds in the feedback thru whatever mechanism they have	20:36
fungi	git gc clean now on both servers, replication from gerrit in progress	20:50
fungi	and done	20:50
JayF	fungi: https://www.404media.co/anthropic-ai-scraper-hits-ifixits-website-a-million-times-in-a-day/ is that ifixit article	20:50
fungi	i'll enable them in haproxy again	20:50
fungi	#status log Repaired data corruption for a repository on the gitea09 and gitea14 backends, root cause seems to be from an unexpected hypervisor host outage	20:52
opendevstatus	fungi: finished logging	20:52
fungi	thanks JayF!	20:52
clarkb	ethics was a required course for me to get my degree. Kinda feels like everyone has collectively decided to leave that behind in the race for AI dominance	20:57
clarkb	fungi: thank you for taking care of the git repo issue.	20:58
clarkb	As far as next steps go on that we might be able to boot another node in that region and use nova shutdown apis while running git operatiosn in a loop to try and reproduce	20:58
clarkb	I suspect that nova shutdown apis might be too graceful though	20:58
clarkb	(things will flush before going down vs say a truly hard shutdown from the hypervisor side)	20:59
tonyb	I've been working on creating noble images (Basically this: https://paste.opendev.org/show/b8Q8E7t1OahyCgxITOYX/) Now when it comes to uploading them .... Do I need to upload the generated images to each region in each cloud or just once per cloud ?	22:36
tonyb	or is it more like upload it once and see if they appear each region?	22:37
clarkb	tonyb: I believe that each region has its own images	22:40
clarkb	so they have to be uploaded separately per image.	22:40
clarkb	tonyb: looking at your paste I think your vhd-util conversions are only half done	22:42
tonyb	clarkb: Thanks.	22:42
tonyb	clarkb: Oh! What'd I miss?	22:42
clarkb	there should be two vhd-util convert commands one from the raw source to an intermediate format then a second to the final format	22:42
clarkb	tonyb: I think (but I'm not sure) that the 0 and 1 in the first command become a 1 and a 2 in the second. I think that nodepool builder logs would confirm this	22:43
clarkb	if you look at logs for one of the image builds specifically not the builder service (they are split up)	22:43
tonyb	Oh that maybe a formatting issue: https://paste.opendev.org/raw/bbPlASvfVf1shcRihnqz/ Is what I did based on https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/lib/img-functions#L151	22:47
clarkb	now to figure out what `qemu-img convert -p -S 512` does. I've always just used the defaults I think	22:47
fungi	yes, every region's glance store is separate, at least for all the providers we're using	22:47
tonyb	-p == show a progress meter	22:47
clarkb	tonyb: oh yup I had to scroll to the right sorry. That does look correct now	22:47
tonyb	-S 512 is consider 512 0's a "Sparse block" and write the image as such	22:48
clarkb	tonyb: ah ok. Does that help keep the size down? I wonder if that is something we could add to dib/nodepool	22:48
clarkb	tonyb: one other thing to keep in mind is that image upload to rax is weird	22:49
clarkb	the imagese have to go to swift first and then get imported from there to glance using the task api	22:49
tonyb	I'd have to double check, I admit it's something from muscle memory from $way back	22:49
clarkb	openstacksdk's high level methods for image upload should do that stuff for you I think (and previously shade would) but you may need to use a small script rather than openstack cli tooling	22:50
tonyb	clarkb: Can you share your history for when you did it for OpenMetal recently?	22:50
clarkb	tonyb: for OpenMetal I was lazy and just did it through horizon	22:52
clarkb	unfortunately that means there isn't any command history :/	22:53
tonyb	Okay	22:53
clarkb	tonyb: I think it would be something like `openstack image create --disk-format raw --file ./path-to-local-raw-file.raw --private image-name-here`	22:55
tonyb	Ah okay that's what I have	22:55
clarkb	--private might be too restrictive. I think that means only this tenant can use it. And since the image comes from upstream maybe we don't care if others use it. The upside to making it private though is we can delete it later without errors if people have bfv using it or whatever so I'd use private	22:56
tonyb	I think I went with --shared, so if needed we can share it with the jenkins/zuul project on a given cloud	22:57
tonyb	but I can switch to --private	22:57
clarkb	tonyb: let me pull up the glance docs. its probably fine	22:57
clarkb	ya shared seems fine since its an explicit action to add other tenants	22:58
tonyb	Yup.	22:59
clarkb	and if we only do that for the tenants we care about then we avoid the bfv problem without having control over that	22:59
clarkb	https://docs.openstack.org/api-ref/image/v2/ ^F shared if anyone else is curious	22:59
tonyb	Also https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/image-v2.html#cmdoption-openstack-image-create-shared	22:59
tonyb	https://paste.opendev.org/raw/b8QSrgmqdHkbHY0S6801/ ?	23:10
tonyb	First block should be all the raw images, second block should be the vhd versions	23:11
clarkb	tonyb: as mentioned perviously I don't think the rax images using openstack client will work. Also there is already a noble image in openmetal so we don't need to upload there (we can it doesn't hurt much other than consuming more disk space). The arm servers will need their own arm images too (not amd64)	23:12
clarkb	and i haven't checked if ovh and vexxhost have images or not	23:12
tonyb	Thanks, I was thinking that it would be helpful to have exactly the same verion of image in $all clouds to reduce variability	23:14
tonyb	I see what you're saying about RAX now, sorry I missed that	23:15
tonyb	Only OpenMetal has Noble images	23:15
clarkb	fwiw I don't think it hurts to try and upload to rax. But I'm like 98% certain the client never implemented the two phase swift then glance task upload process that rax uses because they are the only cloud that does it and it isn't really standard	23:15
clarkb	and instead we will have to use the sdk directly to get the cloud.upload_image() magic or whatever it is to do that.	23:15
tonyb	Got it	23:16
JayF	clarkb: might be worth reaching out to James D from rackspace; they have any resources at all pointed at openstack now; making the client actually work for their cloud I'd imagine would be a very good use of them	23:16
tonyb	JayF: In the past they've focused develeopment on the 'rax'	23:17
tonyb	commandline tool rather than openstack	23:17
JayF	Yeah, but they've been making a big deal about being more openstack-y as of late. Never hurts to try and get someone to point in the right direction :)	23:17
tonyb	but maybe in this new world	23:18
JayF	that's what I'm suggesting? hoping?	23:18
clarkb	and yes I think the glance image task api was a really unfortunate moment in history for opensatck users. Maybe one day we'll be completely ebyond it	23:18
tonyb	yup. I'm hoping too	23:18
clarkb	re ralking to rax we might also encourage them to provide noble images (since others may want it)	23:28
tonyb	clarkb: Good point	23:29

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!