Friday, 2022-06-24

BlaisePabon[m]	How much work are we talking about?	00:38
BlaisePabon[m]	I have a fedora nested kvm server with 20 cores sitting in my closet?	00:38
BlaisePabon[m]	You're welcome to throw jobs at it.	00:38
opendevreview	Ian Wienand proposed opendev/grafyaml master: [wip] test real import of graphs https://review.opendev.org/c/opendev/grafyaml/+/847421	00:38
ianw	BlaisePabon[m]: we do obviously gratefully accept resources. but it's also non-zero cost to setup and maintain them, so we have to find the balance	00:39
ianw	we have a control-plane and a CI tenant, in the control plane we run a cloud-local mirror (so CI nodes don't have to go over the network).	00:40
ianw	the best situation is where somebody is motivated to maintain the cloud for other reasons ($$$$, probably :) and provides us resources. so we help develop the software they use, and they contribute resources in return. everyone wins when it works out	00:42
opendevreview	Ian Wienand proposed opendev/grafyaml master: [wip] test real import of graphs https://review.opendev.org/c/opendev/grafyaml/+/847421	00:45
fungi	i'm going to perform a controlled reboot of lists.o.o now to make sure the latest unpacked kernel is viable	01:29
*** rlandy\|bbl is now known as rlandy		01:33
fungi	ugh, it's going straight into shutoff state	01:37
fungi	i'm going to have to perform a rescue boot with it	01:39
*** rlandy is now known as rlandy\|out		01:40
ianw	:/ do you think it's the kernel or something else that's happened in the install?	01:41
Clark[m]	In the past it's been either xen can't find the kernel or the kernel wasn't extracted properly	01:41
Clark[m]	You should be able to edit the menu.lst to put an old kernel back ?	01:42
clarkb	fungi: it was the vmlinuz that you extracted right? thats the file that needs to be uncompressed. Otherwise I got nothing	01:48
fungi	yeah, i edited the menu.lst and moved the entry for the kernel we were previously booted on to the top of the list, the unrescued, but it still immediately goes to shutoff state when i try to start it	01:49
fungi	it just picks the first one listed, right?	01:50
clarkb	yes that was what it seemed to do in the past	01:50
clarkb	I wonder if it is looking at grub.conf instead?	01:50
clarkb	though I thought for us to do that we had to put an entry in menu.lst to boot the shim instead of the normal vmlinuz to chain load grub	01:50
clarkb	and I think I tried to get that to work back when and couldnt' get it to happen	01:51
fungi	i'll try rolling it back in grub.cfg, yeah	01:55
clarkb	its a lot more complicated there unfortunately	01:56
clarkb	I wonder if the console log shows anything that might indicate which kernel it is finding and attempting?	01:57
clarkb	(that seems like a long shot but may offer clues if it does)	01:57
*** ysandeep\|out is now known as ysandeep		01:59
corvus	oh hi	01:59
clarkb	hello	01:59
corvus	i was just noticing mailing list accessibility probs, and i see you're looking at it... catching up now.	01:59
fungi	just trying to test a reboot on a new unpacked kernel when it's less likely to impact anyone. glad i didn't do it at peak time	02:00
clarkb	corvus: fungi was doing a controlled reboot to bump the kernel after extracting it (as is necessary because xen), but something is unhappy with that almost certainly xen finding the unpacked kernel is the problem	02:00
corvus	anything another set of eyes/hands can do?	02:01
fungi	unfortunately switching the primary profile in grub.cfg to the old kernel doesn't seem to have solved it either	02:02
clarkb	I'm not sure. I'm still feeling pretty blah and not firing all cylinders so will defer to fungi on that	02:02
fungi	i'll see if i can get a console log out of it, but my luck with getting boot console output in rackspace has not been good	02:03
clarkb	fungi: what is the suffix of the old kernel? -96?	02:03
fungi	yes	02:03
fungi	the new one is 121	02:03
clarkb	ya that matches my memory from when I looked a few days ago	02:03
corvus	i've got a console up in my browser	02:05
corvus	i'll stand by and wait to see if fungi gets a working console; otherwise, we can reboot it and i can watch it for clues	02:06
clarkb	https://wiki.xenproject.org/wiki/PvGrub2#Chainloading_from_pvgrub-legacy is the chainloading thing I mentioned earlier should we want to try that	02:07
clarkb	basically you edit menu.lst so that xen's grub1 built in stuff chainloads grub2 which reads the grub2 configs	02:07
clarkb	But if I remember correctly I was never able to get that to work as expected	02:08
fungi	i've put the configs back to what they were for the first reboot and unrescued	02:08
corvus	hrm, console disconnected and i didn't get it back, so i guess the browser console doesn't work for this situation.	02:10
clarkb	I think it may disconnect you when you reboot which makes this sort of debugging more difficult?	02:10
fungi	i've started the server now	02:11
corvus	web ui says it's in "error" state	02:11
fungi	oh, right after unrescue it seems to go into error	02:12
fungi	i'll stop and start it again, but previously it would immediately go into shutoff when i tried to start it again	02:12
fungi	yeah, trying to start the server it remains in "shutoff" state according to the api	02:13
clarkb	Just off the top of my head things to consider: double check that vmlinuz was the extracted file. Double check that reextracting it results in the same file. Double check the default entry in menu.lst is the first (0th) item? Its possible xen is doing more parsing of the content than simply finding the first entry. Try the chainload grub2 thing from above	02:15
fungi	yeah, i think we're stuck blindly trying things from rescue mode	02:15
fungi	i also shouldn't have tried this so late in my day, but i expected rolling back to the working kernel would be a viable way out if the new one wasn't working	02:16
fungi	putting it back into a rescue boot again now	02:17
clarkb	fungi: do you want to add my pubkey to the rescue node and get a second set of eyes on it?	02:20
fungi	sure, just a sec	02:21
fungi	i've added the public keys for you, corvus and ianw to the root account	02:23
fungi	the server's rootfs is mounted at /mnt	02:23
fungi	checksum of /mnt/root/kernel-stuff/vmlinuz-5.4.0-121-generic.extracted matches /mnt/boot/vmlinuz-5.4.0-121-generic which is roughly the same size as /mnt/boot/vmlinuz-5.4.0-96-generic	02:24
clarkb	menu.lst default is 0 so that is confirmed that the first entry should be used	02:28
fungi	and /mnt/boot/vmlinuz-5.4.0-96-generic still has the same checksum as the extracted copy in kernel-stuff and has a january last modified date	02:32
clarkb	ya I'm thinking it likely isn't an issue with the kernels but with finding them if that makes sense	02:35
clarkb	since the old kernel should still boot if it is findable	02:35
clarkb	talking out loud here: maybe we try a super simplified menu.lst to remove possibility that xen is being confused by something else in the file. And if that doesn't work try to chainload also using a super simplified version of the file?	02:37
corvus	menu.lst_backup_by_grub2_prerm is interesting; it looks very similar to menu.lst	02:37
corvus	if that's a legit old working menu.lst, it would suggest that nothing significant changed other than version numbers	02:38
corvus	regardless, i do like clarkb's plan	02:38
corvus	step 1: menu.lst with only the old kernel; step 2 https://wiki.xenproject.org/wiki/PvGrub2#Chainloading_from_pvgrub-legacy	02:38
fungi	yes, that sounds solid	02:39
corvus	i think (hd0,0) should be right for us (assuming those menu.lst files are roughly correct)	02:39
clarkb	corvus: Ithink it may just be hd0 ?	02:39
clarkb	our existing and old menu.lsts all use hd0 not hd0,0 at least	02:39
fungi	though also that may be irrelevant if xen is reading the fs anyway	02:40
corvus	well, the pvgrub example has a partition... and our device has a partition table and / is the first partition, so i'd assume (hd0,0) for that...	02:41
clarkb	corvus: good point	02:41
clarkb	/root/clarkb-menu.lst on the rescue side has a simplified version of what I Think it may want to look like. though that still uses (hd0)	02:41
clarkb	maybe if that looks correct to the others we move aside the current menu.lst and copy that in place?	02:42
corvus	(actually (and still just thinking ahead to pvgrub) should it be (hd0,1)?)	02:42
fungi	the current menu.lst and the menu.lst_backup_by_grub2_prerm from a year ago both use (hd0) in the root entries	02:42
clarkb	corvus: because it is xvda1 ?	02:43
corvus	yeah i'm trying to remember whether the part arg is base 0 or 1	02:43
fungi	the example in the comment block shows a dual-boot scenario with windows on 0,0 and linux on 0,1	02:44
corvus	legacy is base0 for partitions	02:44
fungi	and equates root=/dev/hda2 with (hd0,1)	02:44
corvus	grub2 is base1 (but also has things like "(hd0, msdos1)")	02:45
clarkb	I guess the other thing to consider is if we have to try both new and old kernel with the slimmed down menu.lst as it may be the new kernel that has problems too? But one step at a time	02:46
corvus	so clarkb's file lgtm except do we want to do 96 instead of 121?	02:46
clarkb	jinx!	02:46
corvus	:)	02:46
clarkb	I'm happy to update to that version instead	02:46
corvus	step1: simplified 121; step2: simplified 96; step3: chainload? that sounds good to me	02:46
fungi	probably try 96 and if it works plan another window to retry an upgraded boot	02:46
clarkb	ok updating	02:47
corvus	++	02:47
fungi	but i'm okay trying 121 first if people don't mind extending the unscheduled window	02:47
clarkb	hows that	02:47
clarkb	fungi: I'll let you put it in place since you have control of the unrescue	02:47
corvus	lgtm	02:48
corvus	i've logged out	02:48
clarkb	I have too	02:48
fungi	yeah, lgtm	02:48
fungi	i've moved the current menu.lst into /mnt/boot/grub/menu.lst_broken_2022-06-24	02:50
fungi	and swapped in clarkb's	02:50
fungi	unrescuing now	02:50
fungi	went into error state after unrescuing	02:52
fungi	i'll stop and start the server	02:52
fungi	still staying in shutoff state	02:52
fungi	shall i put it back into rescue?	02:53
clarkb	ya I think the other thing to try is the chainload and we need rescue for that	02:53
clarkb	and if that doesn't work maybe we need to see if rax will give us some error logs so that we can divine where it is failing? that old kernel worked before and hasn't changed which really makes me suspect the grub interaction not the kernel itself	02:55
ianw	sorry stepped out for lunch at an exciting time; back now	02:57
fungi	okay, i have everyone's keys back on the rescue root	02:58
clarkb	I made another clarkb-menu.lst this time for chainloading (it uses hd0,0 can change to hd0 if we prefer) if that is what we want to try next	02:59
fungi	lgtm, i can put that in place next and see what happens	03:02
fungi	it's copied in	03:02
clarkb	if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi <- just noticing that in grub.conf	03:03
clarkb	makes me wonder if there are other things that need extracting	03:03
fungi	looks like you probably have an open file handle on /mnt	03:03
clarkb	yup I've dropped it	03:03
clarkb	was looking at grub.conf and am done now	03:04
fungi	umounted and unrescuing	03:04
fungi	it went into an active state instead of error	03:05
clarkb	it is pinginging for me too	03:06
clarkb	and I can log in	03:06
fungi	novnc shows a login prompt	03:06
clarkb	and it is running the kernel	03:07
clarkb	wow so it was the grub stuff	03:07
clarkb	the kernel was fine	03:07
fungi	indeed	03:07
* fungi grumbles		03:07
clarkb	if we can continue to chainload I think that is preferable fwiw then we can grub2 almost like everything else	03:07
fungi	thanks for working that out!	03:08
clarkb	but we should probably double check that menu.lst isn't updated the wrong way with the next kernel update	03:08
fungi	yep	03:08
corvus	i wonder if a kernel failure would leave us in an active state; so it showing up in error points toward "grub issues"	03:08
clarkb	looks like 6 mailmanctls and some exim processes are running which implies to me that services started ok too	03:08
fungi	`ps auxww\|grep -c ^list` returns 54	03:09
fungi	(9*6)	03:09
fungi	so that looks sane	03:09
clarkb	if corvus isn't ready to call it a day sending that email that was held up through might be a good test?	03:10
corvus	i released ianw's zuul-announce message	03:10
clarkb	jinx again! :)	03:10
corvus	jinx	03:10
clarkb	I see it in my inbox too	03:10
fungi	i've copied /boot/grub/menu.lst to /boot/grub/menu.lst_working_2022-06-24 just in case	03:11
fungi	yep, i too received the ensure-pip announcement	03:11
clarkb	++ on making that copy	03:12
clarkb	ok I'm going to pop off now as it seems like things are likely working	03:12
fungi	yep thanks, and sorry about the fire drill! i anticipated that the reboot might not work, but did not expect to be unable to roll it back	03:12
clarkb	ya not being able to rollback was definitely unexpected. I wonder what about the menu.lst we had before wasn't working. Could it be the (hd0)?	03:13
clarkb	not the sort of thing I want to spend all day rescue and unrescuing to bisect though	03:13
clarkb	anyway good night!	03:13
fungi	thanks again!	03:13
corvus	g'night!	03:13
opendevreview	Ian Wienand proposed openstack/project-config master: grafana: import graphs and take screenshots https://review.opendev.org/c/openstack/project-config/+/847129	03:56
opendevreview	Ian Wienand proposed opendev/grafyaml master: Test with project-config graphs https://review.opendev.org/c/opendev/grafyaml/+/847421	04:00
*** undefined_ is now known as Guest3112		04:09
*** akahat is now known as akahat\|ruck		04:39
opendevreview	Ian Wienand proposed opendev/system-config master: gerrit docs: cleanup and use shell-session https://review.opendev.org/c/opendev/system-config/+/845066	05:38
opendevreview	Ian Wienand proposed opendev/system-config master: gerrit docs: add note that duplicate user may have email addresses to remove https://review.opendev.org/c/opendev/system-config/+/845853	05:38
*** undefined_ is now known as Guest3115		05:51
opendevreview	Merged opendev/system-config master: gerrit docs: cleanup and use shell-session https://review.opendev.org/c/opendev/system-config/+/845066	05:56
opendevreview	Merged opendev/system-config master: gerrit docs: add note that duplicate user may have email addresses to remove https://review.opendev.org/c/opendev/system-config/+/845853	06:03
*** arxcruz\|rover is now known as arxcruz		06:55
*** jpena\|off is now known as jpena		07:06
*** ysandeep is now known as ysandeep\|afk		07:11
*** ysandeep\|afk is now known as ysandeep		09:00
noonedeadpunk	hey there! I was wondering if you folks don't accidentally run ara-api somewhere for opendev?	09:46
noonedeadpunk	As I bet that ara-report is smth why we get POST_FAILURES from one of the providers	09:47
noonedeadpunk	Basically I downloaded 1 ara report for random job and got 4805 files worth of 179M	09:48
noonedeadpunk	and trying to search ways to optimize that	09:50
noonedeadpunk	seems like posting records to remote API server is most efficient way of all available ones. But then looking at demo deployment I'm not sure how to search for specific job results...	09:51
noonedeadpunk	but it's probably more question to dmsimard regarding how to mark specific job/deployment to make it searchable and relate all playbooks that were run within that job	09:52
*** rlandy\|out is now known as rlandy		10:21
*** ysandeep is now known as ysandeep\|brb		11:01
*** ysandeep\|brb is now known as ysandeep		11:18
*** dviroel\|out is now known as dviroel		11:19
akahat\|ruck	Hello	11:27
akahat\|ruck	I'm seeing POST_FAILURES in the jobs	11:28
akahat\|ruck	https://zuul.opendev.org/t/openstack/builds?result=POST_FAILURE&skip=0	11:28
akahat\|ruck	can someone tell me why this is happening?	11:28
*** rlandy is now known as rlandy\|dr_appt		11:48
fungi	noonedeadpunk: yes, we looked into the new centralized ara model, but if memory serves (it's been a while) it's not well-adapted to separate out test results, and it would be yet another service we'd need to care for and feed	11:55
fungi	akahat\|ruck: are you referring specifically to tripleo jobs? because the only post_failure result in the openstack tenant's gate pipeline today was for a tripleo-ci-centos-9-containers-multinode build roughly 5 hours ago	11:58
noonedeadpunk	fungi: we have tons of them today fwiw	11:59
fungi	noonedeadpunk: not in the gate pipeline	11:59
fungi	though there were 6 yesterday, most of which were for openstack-anisble-deploy jobs	12:00
noonedeadpunk	nah, in check	12:00
noonedeadpunk	that's why again started looking on how to reduce pressure on swift by our logs	12:00
fungi	harder to reason about post_failure results in check because there's no way to know that the change is ever able to pass its jobs, while in the gate pipeline it has at least succeeded once	12:01
noonedeadpunk	and ara is an outstanding leader...	12:01
noonedeadpunk	I bet we had in gates as well	12:01
noonedeadpunk	let me look for it	12:02
noonedeadpunk	https://zuul.opendev.org/t/openstack/build/128f27979b424d4f825036f6301053df	12:02
fungi	as soon as i get a little more coffee in me, i'll try to run down the causes in the executor logs	12:03
noonedeadpunk	but I bet for us it's swift upload timeouts again	12:03
fungi	probably, but i'm more curious to see if it's the same thing for the tripleo jobs or something different	12:04
*** ysandeep is now known as ysandeep\|afk		12:12
noonedeadpunk	yup, I just catched that in console	12:12
fungi	the upload timeout?	12:14
Clark[m]	Note some tripleo jobs do more than half an hour of log uploads. Its possible they ride right along the edge of what times out and what doesnt	12:14
Clark[m]	(we've noticed this when doing zuul executor restarts and waiting on that one last job to finish)	12:15
fungi	yeah, you may be onto something with ara reports, they can create a massive number of files, so may significantly inflate swift upload times (since the files are uploaded one-by-one to the api) without exceeding the disk space limit we enforce on the workspace	12:16
noonedeadpunk	Clark[m]: so we also see log uploads >30m now. But in average it takes <10m...	12:16
noonedeadpunk	I'm not sure if tripleo _always_ upload that long though	12:16
Clark[m]	Ya I think tripleo long uploads are consistent. But if things slow down then they may be very likely to exceed timeouts	12:17
fungi	i wonder if it's impacted by rtt? jobs run in na providers uploading to rackspace and ovh-bhs1 go quickly, jobs run in eu providers uploading to ovh-gra1 go quickly...	12:17
noonedeadpunk	yah, ara spawns ridiculously many files	12:17
Clark[m]	Yes, it was a regression and part of why we stopped running it for jobs by default	12:18
fungi	so it could be the latency crossing the atlantic inflating the upload times to explain the 3x discrepancy you're observing	12:18
Clark[m]	fungi: and that could be made worse by internet activity	12:18
fungi	absolutely	12:18
fungi	or choked peering points between backbone providers along a preferred route...	12:19
fungi	the sorts of things that tend to silently (from our perspective) come and go	12:19
noonedeadpunk	well, for projects that has ansible as base ara is quite important source of information :(	12:21
fungi	could you tar it up?	12:24
Clark[m]	Or talk to dmsimard about adding the functionality back that was removed to store it in a SQLite db	12:25
Clark[m]	Or use a different tool like zuul did	12:25
Clark[m]	I wonder how hard it would be to point zuul's renderer at a different json log	12:26
fungi	though i get the impression that the file-backed implementation in ara is considered "legacy" and the intent is that users put multiple runs in a single database now	12:26
noonedeadpunk	fungi: well. then you need to download tar locally for each job, unpack, browse locally	12:26
fungi	noonedeadpunk: yes, that's exactly what i'm suggesting	12:26
noonedeadpunk	Clark[m]: oh well we had some talk back then and it was hardly achievable unfortunatelly...	12:26
fungi	i know it would be less convenient than consuming it directly over the web	12:26
noonedeadpunk	perfect situation would be having ara-api :)	12:27
fungi	have you tried feeding test results for multiple jobs into an ara-api server?	12:27
noonedeadpunk	given that dmsimard can help out with implementing some filtering/tagging per job	12:27
noonedeadpunk	yeah... dmsimard uploads kolla-ansible and ours results https://demo.recordsansible.org/	12:28
noonedeadpunk	quite impossible to understand where's what	12:28
Clark[m]	I'm going to say upfront that I don't think we should deploy another service for this. There are alternatives like zuul's that work with minimal overhead and no additional service work on our part. Adding tons of little services like this has only bitten us down the road when it needs maintaining	12:28
fungi	if you (or someone) wanted to run an ara-api service similar to how dpawlik is running a log indexing service for openstack, you could adjust jobs to submit results to that api	12:28
fungi	but yes, i agree with Clark[m] that the bar to add it to opendev's current service set would be pretty high	12:29
fungi	or you could do a scraper similar to how ci-log-processing is set up, to query zuul and pull archived ara datasets from the recorded job logs/artifacts and feed those into an ara-api instance asynchronously	12:32
noonedeadpunk	I think the biggest problem to leverage zuul as we need to run "own" ansible with ansible...	12:33
fungi	that service model has the added benefit that it wouldn't increase job runtime if the ara-api interface is slow or gets bogged down under extreme load	12:33
noonedeadpunk	not sure I fully got this tbh...	12:34
noonedeadpunk	probably because I have no idea about how ci-log-processing is set	12:35
Clark[m]	Re using zuul's renderer, yes I think a small amount of work would be necessary to update zuul to optionally look at a different Ansible json output file.	12:35
fungi	ci-log-processing is the solution dpawlik developed to query the zuul builds table, retrieve build logs for each of the builds it's interested in, and then feed them into an opensearch backend	12:36
Clark[m]	And maybe that doesn't want to live in zuul directly but get extracted into something you can include in your job logs. I'm not sure what the best approach is there. I just know that part of the reason this exists in zuul is this very issue with ara	12:36
noonedeadpunk	fungi: ah, you mean if we host ara-api somewhere else I guess?	12:36
fungi	right, exactly how the new opensearch is working	12:37
fungi	and doing it asynchronously from the job running has the added benefit that it won't impact the jobs themselves if it stops working for some reason	12:37
noonedeadpunk	Well, we can only host that withing someones ccompany, and I don't really like to make project logs dependent on any company...	12:37
noonedeadpunk	And I kind of like Clark[m] idea to leverage zuul renderer	12:38
fungi	for ci-log-processing we provided a plain unconfigured server instance in opendev's control plane tenant on one of our donor providers and gave the admin of the service access to ssh into it	12:38
noonedeadpunk	as it has basically everything we need	12:39
fungi	but yes, reusing zuul's tast renderer would be awesome, maybe even possible to turn it into a library which zuul consumes in order to reduce the collective maintenance burden	12:39
fungi	s/tast/task/	12:40
noonedeadpunk	it sounds simpler and quite useful overall	12:40
fungi	and yeah, the idea is you take the ansible json output, and then interpret it browser-side with javascript	12:41
fungi	though i do wonder if it would scale well performance-wise to a dataset as large as what openstack-ansible builds produce	12:41
frickler	do we want to limit log item count in addition to log volume on the executor side? not sure though how a reasonable limit would look like	12:42
noonedeadpunk	that is good question, if it's done in browser I totally can see where things can go south...	12:42
* dpawlik reading		12:42
fungi	frickler: i was thinking the same thing, it's harder to know what a sane file limit might be, but also the reason we limit log size has more to do with not filling up the executor's disk and less to do with making sure builds will be able to upload results reliably	12:43
fungi	noonedeadpunk: though maybe there's a reasonable way to shard by play or something	12:43
Clark[m]	The disk limit is more to pretect the executor than to ensure jobs pass. I think we have much less concern about inodes	12:44
fungi	noonedeadpunk: and have it only fetch the json for the play if the user expands it (you'd be trading responsiveness to user interaction in that case i guess)	12:44
frickler	fungi: well we also have an inode limit on /var/lib/zuul. though for example we are at 22% inodes on ze01 currently vs. 33% space	12:44
Clark[m]	fungi: noonedeadpunk: you should be able to test it via web browser debugging tools and changing the json path location to load real osa data	12:45
frickler	but in theory a job with a huge number of tiny files could likely exhaust inodes	12:45
noonedeadpunk	fungi: yes, I should totally look at how's that's done. I'm not that good in JS but hope can figure out smth	12:45
fungi	yes, i buy the inode argument. we could set a file count governer to limit to a percentage of our inode capacity similar to the percentage of our block capacity	12:45
Clark[m]	frickler: ok so less inside concern but not much less as I thought	12:45
Clark[m]	*less inode. Mobile keyboards try to be too helpful sometimes	12:46
noonedeadpunk	ok, thanks for ideas!	12:46
fungi	i need to step away from the keyboard for a few minutes, but will be back shortly and try to dig into today's tripleo post_failure build	12:46
noonedeadpunk	I have smth to work on now :)	12:46
*** rlandy\|dr_appt is now known as rlandy		12:51
*** ysandeep\|afk is now known as ysandeep		12:58
*** undefined__ is now known as rcastillo		13:08
*** undefined_ is now known as rcastillo		13:09
noonedeadpunk	hm, there's also elastic callback plugin to log directly into elasticsearch....	13:11
*** dasm\|off is now known as dasm		13:33
*** ysandeep is now known as ysandeep\|out		15:00
*** dviroel is now known as dviroel\|lunch		15:19
clarkb	noonedeadpunk: that would require direct access to ES and that isn't available with the current setup aiui. But even if you had that our experience has shown that it is a good idea to decouple jobs from writing to ES as that can be quite slow and occasionally have errors. Also you probably need something to render the data out of ES once it is there	15:55
*** jpena is now known as jpena\|off		16:02
fungi	clarkb: revisiting the lists.o.o outage and grub chainloading solution from earlier, does that mean we can get away with no longer decompressing the kernels and letting grub do it in the second stage?	16:02
fungi	obviously it's something we would test in a scheduled window, but curious if it simplifies further kernel upgrades	16:03
clarkb	fungi: I think it does open that possibility. However, if you look in grub.conf there was that if condition for xen where it adds grub modes for things like xz I think maybe we have to add lz4 too?	16:03
clarkb	because xen grub is reading our basic menu.lst which poinst to the grub2 shim loader elf which xen can read. Then that loads grub2 proper aiui which in theory can do the lz4	16:04
clarkb	that said I know I tested this before, but I think I was testing it with a compressed kernel and it didn't work then. So there is likely something like adding the extra mod that we need to do?	16:05
fungi	yeah, needing to add another module in the second stage grub config wouldn't surprise me	16:11
fungi	akahat\|ruck: i looked at the executor's debug log for https://zuul.opendev.org/t/openstack/build/ffb35b4d680d48a8bd21b1964964019d and it seems superficially similar to the problem noonedeadpunk has been running down in the openstack-ansible-deploy jobs... TASK [upload-logs-swift : Upload logs to swift] ends in "Ansible timeout exceeded: 3600"	16:15
fungi	i wonder how trivial it would be to put another task just before that which says how many files will be uploaded, so we can start to get a feel for what the upper bound is on these	16:16
fungi	and whether it's related to file count at all (even if that's only one of the variables feeding into the problem)	16:16
clarkb	when successful they are all listed already. This doesn't help if the problem is build specific due to some exceptional case, but I don't think that is the trpleo situation. They already log excessively and could probably take a critical eye to that	16:17
clarkb	using known information without changing anything about the jobs to collect more info	16:17
*** dviroel\|lunch is now known as dviroel		16:18
fungi	yeah, looking at https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py i think we're relying on the sdk to explode paths/globs anyway, so there would be a bit of code involved to guess how the sdk will enumerate the input	16:21
fungi	but yes, my thought was to be able to tell whether the builds hitting upload timeouts had significantly more files/chunks/whatever than their successful counterparts	16:22
clarkb	oh interesting they use their own log collection playbook so the one we provide via zuul jobs which logs better doesn't log anything :(	16:23
clarkb	so the response to my first suggestions is that isn't as easy as I first though because tripleo is special	16:24
fungi	another thought would be to split the uploading into two separate tasks, the first only uploading the console log, console json, and zuul info files	16:24
fungi	that way if the problem is uploading way too many log files, at least there's the basic logs and a working result dashboard	16:24
clarkb	oh interesting I like that	16:25
fungi	it does mean more than one upload task, but hopefully the overhead of splitting it that way would be small	16:26
fungi	however i'm not sure how to untangle it since a lot of that logic is in zuul-jobs	16:27
clarkb	and for many jobs they don't log anymore than that	16:27
clarkb	I think the swift-upload-logs would haev to be run against two different sets of inputs	16:27
clarkb	so we may have to write those zuul specific log files to a new location and upload that then upload the normal logs afterwards from the existing location (this would be most backward compatible)	16:28
fungi	or the upload-logs role could grow a priority list which it uploads only those matches in the first task and then excludes those from the list expanded in the second task	16:28
fungi	that might be unnecessarily complicated though	16:28
clarkb	fungi: looking at tripleo logs one thing that they have is fairly deep dir structures and each dir is a swift object	16:30
clarkb	several actually as you need the index too	16:30
clarkb	they may see improvements flattening the log structure	16:30
clarkb	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/opt/git/opendev.org/openstack/tripleo-quickstart-extras/roles/validate-tempest/files/tempestmail/tests/fixtures/index.html for example	16:30
clarkb	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/opt/git/opendev.org/openstack/skyline-console/docs/zh/develop/images/form/index.html	16:30
clarkb	there are also a couple of places where they seem to upload the same files	16:31
clarkb	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/var/tmp/dnf-zuul-gcsew2fm/index.html	16:32
clarkb	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/usr/tmp/dnf-zuul-gcsew2fm/index.html	16:32
clarkb	that is unnecessary duplication which can be trimmed	16:32
clarkb	I'm also not sure we need to copy what feels like all of /etc https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/etc/index.html	16:33
fungi	akahat\|ruck: ^ see above for some actionable suggestions which may help	16:33
clarkb	much of what is in /etc is consistent job to job because it is either distro defaults or because we set it the same way via job configs or dib for all jobs	16:34
clarkb	you could selectively log information if it becomes necessary rather than uploading things that no one ever looks at every job	16:34
clarkb	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/home/zuul/workspace/.quickstart/usr/local/share/ansible/roles/validate-tempest/files/tempestmail/tests/fixtures/index.html thats a much deeper nesting example	16:35
clarkb	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/home/zuul/src/opendev.org/openstack/tripleo-quickstart-extras/roles/validate-tempest/files/tempestmail/tests/fixtures/index.html seems to be a duplicate with one of the /opt/git/ paths above too so more	16:36
clarkb	duplicates to clean up	16:36
clarkb	I suspect that things like https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/home/zuul/tripleo-deploy/undercloud/tripleo-heat-installer-templates/environments/index.html are known or can be generated locally too? Not sue we need to log that? I mean we don't log	16:38
clarkb	all our ansible that we are about to run. We know that it matches what is in our repo. But I would start with things like deep nesting, duplicates, and excessive /etc content that never changes	16:38
clarkb	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/home/zuul/tripleo-deploy/undercloud/tripleo-heat-installer-templates/tools/index.html looks like another good trim	16:49
clarkb	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/var/lib/neutron/.cache/python-entrypoints/index.html anything like that too	16:50
akahat\|ruck	fungi, clarkb thank you for taking look in to this.	17:52
akahat\|ruck	TripleO collect lot's of logs and there are some duplication also.	17:52
akahat\|ruck	We also not sure about the file count.. adding file count +1.	17:53
fungi	and that seems to be resulting in an increasing number of job failures, so needs to be reined in	17:53
akahat\|ruck	and also i liked your idea about collect logs where we can add logs playbooks which collect specific files from the host.	17:53
fungi	we can get file counts for the successful jobs, but we don't know what the file counts might be for the failing ones which are unable to upload all their logs before zuul gets bored of waiting and cuts them off	17:54
akahat\|ruck	+1 for uploading console log, json and zuul info files at first and rest later	17:54
fungi	if you would like to help with improving the upload-logs roles in zuul-jobs that would be awesome, but at least please look into trimming the fat in tripleo's job log collection	17:55
akahat\|ruck	fungi, yeah.. sure. I'll discuss this topic with my team and will check what specific files we can collect.	17:57
clarkb	keep in mind the nested makes each dir level count as another "file"	17:57
clarkb	s/nested/nesting/	17:57
clarkb	so reducing nesting can also help	17:57
akahat\|ruck	clarkb, noted.	17:59
opendevreview	Julia Kreger proposed openstack/diskimage-builder master: DNM: Network Manager logging to Trace for Debugging https://review.opendev.org/c/openstack/diskimage-builder/+/847600	18:51
opendevreview	Julia Kreger proposed openstack/diskimage-builder master: DNM: Network Manager logging to Trace for Debugging https://review.opendev.org/c/openstack/diskimage-builder/+/847600	18:56
*** dviroel is now known as dviroel\|out		20:50
opendevreview	Gage Hugo proposed openstack/project-config master: End project gating for openstack-helm deployments https://review.opendev.org/c/openstack/project-config/+/847621	21:45
opendevreview	Gage Hugo proposed openstack/project-config master: Retire openstack-helm-deployments repo https://review.opendev.org/c/openstack/project-config/+/847414	21:46
opendevreview	Gage Hugo proposed openstack/project-config master: Retire openstack-helm-deployments repo https://review.opendev.org/c/openstack/project-config/+/847414	21:47
opendevreview	Gage Hugo proposed openstack/project-config master: End project gating for openstack-helm deployments https://review.opendev.org/c/openstack/project-config/+/847621	21:56
opendevreview	Gage Hugo proposed openstack/project-config master: Retire openstack-helm-deployments repo https://review.opendev.org/c/openstack/project-config/+/847414	21:56
opendevreview	Gage Hugo proposed openstack/project-config master: Retire openstack-helm-deployments repo https://review.opendev.org/c/openstack/project-config/+/847414	21:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!