BlaisePabon[m] | How much work are we talking about? | 00:38 |
---|---|---|
BlaisePabon[m] | I have a fedora nested kvm server with 20 cores sitting in my closet? | 00:38 |
BlaisePabon[m] | You're welcome to throw jobs at it. | 00:38 |
opendevreview | Ian Wienand proposed opendev/grafyaml master: [wip] test real import of graphs https://review.opendev.org/c/opendev/grafyaml/+/847421 | 00:38 |
ianw | BlaisePabon[m]: we do obviously gratefully accept resources. but it's also non-zero cost to setup and maintain them, so we have to find the balance | 00:39 |
ianw | we have a control-plane and a CI tenant, in the control plane we run a cloud-local mirror (so CI nodes don't have to go over the network). | 00:40 |
ianw | the best situation is where somebody is motivated to maintain the cloud for other reasons ($$$$, probably :) and provides us resources. so we help develop the software they use, and they contribute resources in return. everyone wins when it works out | 00:42 |
opendevreview | Ian Wienand proposed opendev/grafyaml master: [wip] test real import of graphs https://review.opendev.org/c/opendev/grafyaml/+/847421 | 00:45 |
fungi | i'm going to perform a controlled reboot of lists.o.o now to make sure the latest unpacked kernel is viable | 01:29 |
*** rlandy|bbl is now known as rlandy | 01:33 | |
fungi | ugh, it's going straight into shutoff state | 01:37 |
fungi | i'm going to have to perform a rescue boot with it | 01:39 |
*** rlandy is now known as rlandy|out | 01:40 | |
ianw | :/ do you think it's the kernel or something else that's happened in the install? | 01:41 |
Clark[m] | In the past it's been either xen can't find the kernel or the kernel wasn't extracted properly | 01:41 |
Clark[m] | You should be able to edit the menu.lst to put an old kernel back ? | 01:42 |
clarkb | fungi: it was the vmlinuz that you extracted right? thats the file that needs to be uncompressed. Otherwise I got nothing | 01:48 |
fungi | yeah, i edited the menu.lst and moved the entry for the kernel we were previously booted on to the top of the list, the unrescued, but it still immediately goes to shutoff state when i try to start it | 01:49 |
fungi | it just picks the first one listed, right? | 01:50 |
clarkb | yes that was what it seemed to do in the past | 01:50 |
clarkb | I wonder if it is looking at grub.conf instead? | 01:50 |
clarkb | though I thought for us to do that we had to put an entry in menu.lst to boot the shim instead of the normal vmlinuz to chain load grub | 01:50 |
clarkb | and I think I tried to get that to work back when and couldnt' get it to happen | 01:51 |
fungi | i'll try rolling it back in grub.cfg, yeah | 01:55 |
clarkb | its a lot more complicated there unfortunately | 01:56 |
clarkb | I wonder if the console log shows anything that might indicate which kernel it is finding and attempting? | 01:57 |
clarkb | (that seems like a long shot but may offer clues if it does) | 01:57 |
*** ysandeep|out is now known as ysandeep | 01:59 | |
corvus | oh hi | 01:59 |
clarkb | hello | 01:59 |
corvus | i was just noticing mailing list accessibility probs, and i see you're looking at it... catching up now. | 01:59 |
fungi | just trying to test a reboot on a new unpacked kernel when it's less likely to impact anyone. glad i didn't do it at peak time | 02:00 |
clarkb | corvus: fungi was doing a controlled reboot to bump the kernel after extracting it (as is necessary because xen), but something is unhappy with that almost certainly xen finding the unpacked kernel is the problem | 02:00 |
corvus | anything another set of eyes/hands can do? | 02:01 |
fungi | unfortunately switching the primary profile in grub.cfg to the old kernel doesn't seem to have solved it either | 02:02 |
clarkb | I'm not sure. I'm still feeling pretty blah and not firing all cylinders so will defer to fungi on that | 02:02 |
fungi | i'll see if i can get a console log out of it, but my luck with getting boot console output in rackspace has not been good | 02:03 |
clarkb | fungi: what is the suffix of the old kernel? -96? | 02:03 |
fungi | yes | 02:03 |
fungi | the new one is 121 | 02:03 |
clarkb | ya that matches my memory from when I looked a few days ago | 02:03 |
corvus | i've got a console up in my browser | 02:05 |
corvus | i'll stand by and wait to see if fungi gets a working console; otherwise, we can reboot it and i can watch it for clues | 02:06 |
clarkb | https://wiki.xenproject.org/wiki/PvGrub2#Chainloading_from_pvgrub-legacy is the chainloading thing I mentioned earlier should we want to try that | 02:07 |
clarkb | basically you edit menu.lst so that xen's grub1 built in stuff chainloads grub2 which reads the grub2 configs | 02:07 |
clarkb | But if I remember correctly I was never able to get that to work as expected | 02:08 |
fungi | i've put the configs back to what they were for the first reboot and unrescued | 02:08 |
corvus | hrm, console disconnected and i didn't get it back, so i guess the browser console doesn't work for this situation. | 02:10 |
clarkb | I think it may disconnect you when you reboot which makes this sort of debugging more difficult? | 02:10 |
fungi | i've started the server now | 02:11 |
corvus | web ui says it's in "error" state | 02:11 |
fungi | oh, right after unrescue it seems to go into error | 02:12 |
fungi | i'll stop and start it again, but previously it would immediately go into shutoff when i tried to start it again | 02:12 |
fungi | yeah, trying to start the server it remains in "shutoff" state according to the api | 02:13 |
clarkb | Just off the top of my head things to consider: double check that vmlinuz was the extracted file. Double check that reextracting it results in the same file. Double check the default entry in menu.lst is the first (0th) item? Its possible xen is doing more parsing of the content than simply finding the first entry. Try the chainload grub2 thing from above | 02:15 |
fungi | yeah, i think we're stuck blindly trying things from rescue mode | 02:15 |
fungi | i also shouldn't have tried this so late in my day, but i expected rolling back to the working kernel would be a viable way out if the new one wasn't working | 02:16 |
fungi | putting it back into a rescue boot again now | 02:17 |
clarkb | fungi: do you want to add my pubkey to the rescue node and get a second set of eyes on it? | 02:20 |
fungi | sure, just a sec | 02:21 |
fungi | i've added the public keys for you, corvus and ianw to the root account | 02:23 |
fungi | the server's rootfs is mounted at /mnt | 02:23 |
fungi | checksum of /mnt/root/kernel-stuff/vmlinuz-5.4.0-121-generic.extracted matches /mnt/boot/vmlinuz-5.4.0-121-generic which is roughly the same size as /mnt/boot/vmlinuz-5.4.0-96-generic | 02:24 |
clarkb | menu.lst default is 0 so that is confirmed that the first entry should be used | 02:28 |
fungi | and /mnt/boot/vmlinuz-5.4.0-96-generic still has the same checksum as the extracted copy in kernel-stuff and has a january last modified date | 02:32 |
clarkb | ya I'm thinking it likely isn't an issue with the kernels but with finding them if that makes sense | 02:35 |
clarkb | since the old kernel should still boot if it is findable | 02:35 |
clarkb | talking out loud here: maybe we try a super simplified menu.lst to remove possibility that xen is being confused by something else in the file. And if that doesn't work try to chainload also using a super simplified version of the file? | 02:37 |
corvus | menu.lst_backup_by_grub2_prerm is interesting; it looks very similar to menu.lst | 02:37 |
corvus | if that's a legit old working menu.lst, it would suggest that nothing significant changed other than version numbers | 02:38 |
corvus | regardless, i do like clarkb's plan | 02:38 |
corvus | step 1: menu.lst with only the old kernel; step 2 https://wiki.xenproject.org/wiki/PvGrub2#Chainloading_from_pvgrub-legacy | 02:38 |
fungi | yes, that sounds solid | 02:39 |
corvus | i think (hd0,0) should be right for us (assuming those menu.lst files are roughly correct) | 02:39 |
clarkb | corvus: Ithink it may just be hd0 ? | 02:39 |
clarkb | our existing and old menu.lsts all use hd0 not hd0,0 at least | 02:39 |
fungi | though also that may be irrelevant if xen is reading the fs anyway | 02:40 |
corvus | well, the pvgrub example has a partition... and our device has a partition table and / is the first partition, so i'd assume (hd0,0) for that... | 02:41 |
clarkb | corvus: good point | 02:41 |
clarkb | /root/clarkb-menu.lst on the rescue side has a simplified version of what I Think it may want to look like. though that still uses (hd0) | 02:41 |
clarkb | maybe if that looks correct to the others we move aside the current menu.lst and copy that in place? | 02:42 |
corvus | (actually (and still just thinking ahead to pvgrub) should it be (hd0,1)?) | 02:42 |
fungi | the current menu.lst and the menu.lst_backup_by_grub2_prerm from a year ago both use (hd0) in the root entries | 02:42 |
clarkb | corvus: because it is xvda1 ? | 02:43 |
corvus | yeah i'm trying to remember whether the part arg is base 0 or 1 | 02:43 |
fungi | the example in the comment block shows a dual-boot scenario with windows on 0,0 and linux on 0,1 | 02:44 |
corvus | legacy is base0 for partitions | 02:44 |
fungi | and equates root=/dev/hda2 with (hd0,1) | 02:44 |
corvus | grub2 is base1 (but also has things like "(hd0, msdos1)") | 02:45 |
clarkb | I guess the other thing to consider is if we have to try both new and old kernel with the slimmed down menu.lst as it may be the new kernel that has problems too? But one step at a time | 02:46 |
corvus | so clarkb's file lgtm except do we want to do 96 instead of 121? | 02:46 |
clarkb | jinx! | 02:46 |
corvus | :) | 02:46 |
clarkb | I'm happy to update to that version instead | 02:46 |
corvus | step1: simplified 121; step2: simplified 96; step3: chainload? that sounds good to me | 02:46 |
fungi | probably try 96 and if it works plan another window to retry an upgraded boot | 02:46 |
clarkb | ok updating | 02:47 |
corvus | ++ | 02:47 |
fungi | but i'm okay trying 121 first if people don't mind extending the unscheduled window | 02:47 |
clarkb | hows that | 02:47 |
clarkb | fungi: I'll let you put it in place since you have control of the unrescue | 02:47 |
corvus | lgtm | 02:48 |
corvus | i've logged out | 02:48 |
clarkb | I have too | 02:48 |
fungi | yeah, lgtm | 02:48 |
fungi | i've moved the current menu.lst into /mnt/boot/grub/menu.lst_broken_2022-06-24 | 02:50 |
fungi | and swapped in clarkb's | 02:50 |
fungi | unrescuing now | 02:50 |
fungi | went into error state after unrescuing | 02:52 |
fungi | i'll stop and start the server | 02:52 |
fungi | still staying in shutoff state | 02:52 |
fungi | shall i put it back into rescue? | 02:53 |
clarkb | ya I think the other thing to try is the chainload and we need rescue for that | 02:53 |
clarkb | and if that doesn't work maybe we need to see if rax will give us some error logs so that we can divine where it is failing? that old kernel worked before and hasn't changed which really makes me suspect the grub interaction not the kernel itself | 02:55 |
ianw | sorry stepped out for lunch at an exciting time; back now | 02:57 |
fungi | okay, i have everyone's keys back on the rescue root | 02:58 |
clarkb | I made another clarkb-menu.lst this time for chainloading (it uses hd0,0 can change to hd0 if we prefer) if that is what we want to try next | 02:59 |
fungi | lgtm, i can put that in place next and see what happens | 03:02 |
fungi | it's copied in | 03:02 |
clarkb | if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi <- just noticing that in grub.conf | 03:03 |
clarkb | makes me wonder if there are other things that need extracting | 03:03 |
fungi | looks like you probably have an open file handle on /mnt | 03:03 |
clarkb | yup I've dropped it | 03:03 |
clarkb | was looking at grub.conf and am done now | 03:04 |
fungi | umounted and unrescuing | 03:04 |
fungi | it went into an active state instead of error | 03:05 |
clarkb | it is pinginging for me too | 03:06 |
clarkb | and I can log in | 03:06 |
fungi | novnc shows a login prompt | 03:06 |
clarkb | and it is running the kernel | 03:07 |
clarkb | wow so it was the grub stuff | 03:07 |
clarkb | the kernel was fine | 03:07 |
fungi | indeed | 03:07 |
* fungi grumbles | 03:07 | |
clarkb | if we can continue to chainload I think that is preferable fwiw then we can grub2 almost like everything else | 03:07 |
fungi | thanks for working that out! | 03:08 |
clarkb | but we should probably double check that menu.lst isn't updated the wrong way with the next kernel update | 03:08 |
fungi | yep | 03:08 |
corvus | i wonder if a kernel failure would leave us in an active state; so it showing up in error points toward "grub issues" | 03:08 |
clarkb | looks like 6 mailmanctls and some exim processes are running which implies to me that services started ok too | 03:08 |
fungi | `ps auxww|grep -c ^list` returns 54 | 03:09 |
fungi | (9*6) | 03:09 |
fungi | so that looks sane | 03:09 |
clarkb | if corvus isn't ready to call it a day sending that email that was held up through might be a good test? | 03:10 |
corvus | i released ianw's zuul-announce message | 03:10 |
clarkb | jinx again! :) | 03:10 |
corvus | jinx | 03:10 |
clarkb | I see it in my inbox too | 03:10 |
fungi | i've copied /boot/grub/menu.lst to /boot/grub/menu.lst_working_2022-06-24 just in case | 03:11 |
fungi | yep, i too received the ensure-pip announcement | 03:11 |
clarkb | ++ on making that copy | 03:12 |
clarkb | ok I'm going to pop off now as it seems like things are likely working | 03:12 |
fungi | yep thanks, and sorry about the fire drill! i anticipated that the reboot might not work, but did not expect to be unable to roll it back | 03:12 |
clarkb | ya not being able to rollback was definitely unexpected. I wonder what about the menu.lst we had before wasn't working. Could it be the (hd0)? | 03:13 |
clarkb | not the sort of thing I want to spend all day rescue and unrescuing to bisect though | 03:13 |
clarkb | anyway good night! | 03:13 |
fungi | thanks again! | 03:13 |
corvus | g'night! | 03:13 |
opendevreview | Ian Wienand proposed openstack/project-config master: grafana: import graphs and take screenshots https://review.opendev.org/c/openstack/project-config/+/847129 | 03:56 |
opendevreview | Ian Wienand proposed opendev/grafyaml master: Test with project-config graphs https://review.opendev.org/c/opendev/grafyaml/+/847421 | 04:00 |
*** undefined_ is now known as Guest3112 | 04:09 | |
*** akahat is now known as akahat|ruck | 04:39 | |
opendevreview | Ian Wienand proposed opendev/system-config master: gerrit docs: cleanup and use shell-session https://review.opendev.org/c/opendev/system-config/+/845066 | 05:38 |
opendevreview | Ian Wienand proposed opendev/system-config master: gerrit docs: add note that duplicate user may have email addresses to remove https://review.opendev.org/c/opendev/system-config/+/845853 | 05:38 |
*** undefined_ is now known as Guest3115 | 05:51 | |
opendevreview | Merged opendev/system-config master: gerrit docs: cleanup and use shell-session https://review.opendev.org/c/opendev/system-config/+/845066 | 05:56 |
opendevreview | Merged opendev/system-config master: gerrit docs: add note that duplicate user may have email addresses to remove https://review.opendev.org/c/opendev/system-config/+/845853 | 06:03 |
*** arxcruz|rover is now known as arxcruz | 06:55 | |
*** jpena|off is now known as jpena | 07:06 | |
*** ysandeep is now known as ysandeep|afk | 07:11 | |
*** ysandeep|afk is now known as ysandeep | 09:00 | |
noonedeadpunk | hey there! I was wondering if you folks don't accidentally run ara-api somewhere for opendev? | 09:46 |
noonedeadpunk | As I bet that ara-report is smth why we get POST_FAILURES from one of the providers | 09:47 |
noonedeadpunk | Basically I downloaded 1 ara report for random job and got 4805 files worth of 179M | 09:48 |
noonedeadpunk | and trying to search ways to optimize that | 09:50 |
noonedeadpunk | seems like posting records to remote API server is most efficient way of all available ones. But then looking at demo deployment I'm not sure how to search for specific job results... | 09:51 |
noonedeadpunk | but it's probably more question to dmsimard regarding how to mark specific job/deployment to make it searchable and relate all playbooks that were run within that job | 09:52 |
*** rlandy|out is now known as rlandy | 10:21 | |
*** ysandeep is now known as ysandeep|brb | 11:01 | |
*** ysandeep|brb is now known as ysandeep | 11:18 | |
*** dviroel|out is now known as dviroel | 11:19 | |
akahat|ruck | Hello | 11:27 |
akahat|ruck | I'm seeing POST_FAILURES in the jobs | 11:28 |
akahat|ruck | https://zuul.opendev.org/t/openstack/builds?result=POST_FAILURE&skip=0 | 11:28 |
akahat|ruck | can someone tell me why this is happening? | 11:28 |
*** rlandy is now known as rlandy|dr_appt | 11:48 | |
fungi | noonedeadpunk: yes, we looked into the new centralized ara model, but if memory serves (it's been a while) it's not well-adapted to separate out test results, and it would be yet another service we'd need to care for and feed | 11:55 |
fungi | akahat|ruck: are you referring specifically to tripleo jobs? because the only post_failure result in the openstack tenant's gate pipeline today was for a tripleo-ci-centos-9-containers-multinode build roughly 5 hours ago | 11:58 |
noonedeadpunk | fungi: we have tons of them today fwiw | 11:59 |
fungi | noonedeadpunk: not in the gate pipeline | 11:59 |
fungi | though there were 6 yesterday, most of which were for openstack-anisble-deploy jobs | 12:00 |
noonedeadpunk | nah, in check | 12:00 |
noonedeadpunk | that's why again started looking on how to reduce pressure on swift by our logs | 12:00 |
fungi | harder to reason about post_failure results in check because there's no way to know that the change is ever able to pass its jobs, while in the gate pipeline it has at least succeeded once | 12:01 |
noonedeadpunk | and ara is an outstanding leader... | 12:01 |
noonedeadpunk | I bet we had in gates as well | 12:01 |
noonedeadpunk | let me look for it | 12:02 |
noonedeadpunk | https://zuul.opendev.org/t/openstack/build/128f27979b424d4f825036f6301053df | 12:02 |
fungi | as soon as i get a little more coffee in me, i'll try to run down the causes in the executor logs | 12:03 |
noonedeadpunk | but I bet for us it's swift upload timeouts again | 12:03 |
fungi | probably, but i'm more curious to see if it's the same thing for the tripleo jobs or something different | 12:04 |
*** ysandeep is now known as ysandeep|afk | 12:12 | |
noonedeadpunk | yup, I just catched that in console | 12:12 |
fungi | the upload timeout? | 12:14 |
Clark[m] | Note some tripleo jobs do more than half an hour of log uploads. Its possible they ride right along the edge of what times out and what doesnt | 12:14 |
Clark[m] | (we've noticed this when doing zuul executor restarts and waiting on that one last job to finish) | 12:15 |
fungi | yeah, you may be onto something with ara reports, they can create a massive number of files, so may significantly inflate swift upload times (since the files are uploaded one-by-one to the api) without exceeding the disk space limit we enforce on the workspace | 12:16 |
noonedeadpunk | Clark[m]: so we also see log uploads >30m now. But in average it takes <10m... | 12:16 |
noonedeadpunk | I'm not sure if tripleo _always_ upload that long though | 12:16 |
Clark[m] | Ya I think tripleo long uploads are consistent. But if things slow down then they may be very likely to exceed timeouts | 12:17 |
fungi | i wonder if it's impacted by rtt? jobs run in na providers uploading to rackspace and ovh-bhs1 go quickly, jobs run in eu providers uploading to ovh-gra1 go quickly... | 12:17 |
noonedeadpunk | yah, ara spawns ridiculously many files | 12:17 |
Clark[m] | Yes, it was a regression and part of why we stopped running it for jobs by default | 12:18 |
fungi | so it could be the latency crossing the atlantic inflating the upload times to explain the 3x discrepancy you're observing | 12:18 |
Clark[m] | fungi: and that could be made worse by internet activity | 12:18 |
fungi | absolutely | 12:18 |
fungi | or choked peering points between backbone providers along a preferred route... | 12:19 |
fungi | the sorts of things that tend to silently (from our perspective) come and go | 12:19 |
noonedeadpunk | well, for projects that has ansible as base ara is quite important source of information :( | 12:21 |
fungi | could you tar it up? | 12:24 |
Clark[m] | Or talk to dmsimard about adding the functionality back that was removed to store it in a SQLite db | 12:25 |
Clark[m] | Or use a different tool like zuul did | 12:25 |
Clark[m] | I wonder how hard it would be to point zuul's renderer at a different json log | 12:26 |
fungi | though i get the impression that the file-backed implementation in ara is considered "legacy" and the intent is that users put multiple runs in a single database now | 12:26 |
noonedeadpunk | fungi: well. then you need to download tar locally for each job, unpack, browse locally | 12:26 |
fungi | noonedeadpunk: yes, that's exactly what i'm suggesting | 12:26 |
noonedeadpunk | Clark[m]: oh well we had some talk back then and it was hardly achievable unfortunatelly... | 12:26 |
fungi | i know it would be less convenient than consuming it directly over the web | 12:26 |
noonedeadpunk | perfect situation would be having ara-api :) | 12:27 |
fungi | have you tried feeding test results for multiple jobs into an ara-api server? | 12:27 |
noonedeadpunk | given that dmsimard can help out with implementing some filtering/tagging per job | 12:27 |
noonedeadpunk | yeah... dmsimard uploads kolla-ansible and ours results https://demo.recordsansible.org/ | 12:28 |
noonedeadpunk | quite impossible to understand where's what | 12:28 |
Clark[m] | I'm going to say upfront that I don't think we should deploy another service for this. There are alternatives like zuul's that work with minimal overhead and no additional service work on our part. Adding tons of little services like this has only bitten us down the road when it needs maintaining | 12:28 |
fungi | if you (or someone) wanted to run an ara-api service similar to how dpawlik is running a log indexing service for openstack, you could adjust jobs to submit results to that api | 12:28 |
fungi | but yes, i agree with Clark[m] that the bar to add it to opendev's current service set would be pretty high | 12:29 |
fungi | or you could do a scraper similar to how ci-log-processing is set up, to query zuul and pull archived ara datasets from the recorded job logs/artifacts and feed those into an ara-api instance asynchronously | 12:32 |
noonedeadpunk | I think the biggest problem to leverage zuul as we need to run "own" ansible with ansible... | 12:33 |
fungi | that service model has the added benefit that it wouldn't increase job runtime if the ara-api interface is slow or gets bogged down under extreme load | 12:33 |
noonedeadpunk | not sure I fully got this tbh... | 12:34 |
noonedeadpunk | probably because I have no idea about how ci-log-processing is set | 12:35 |
Clark[m] | Re using zuul's renderer, yes I think a small amount of work would be necessary to update zuul to optionally look at a different Ansible json output file. | 12:35 |
fungi | ci-log-processing is the solution dpawlik developed to query the zuul builds table, retrieve build logs for each of the builds it's interested in, and then feed them into an opensearch backend | 12:36 |
Clark[m] | And maybe that doesn't want to live in zuul directly but get extracted into something you can include in your job logs. I'm not sure what the best approach is there. I just know that part of the reason this exists in zuul is this very issue with ara | 12:36 |
noonedeadpunk | fungi: ah, you mean if we host ara-api somewhere else I guess? | 12:36 |
fungi | right, exactly how the new opensearch is working | 12:37 |
fungi | and doing it asynchronously from the job running has the added benefit that it won't impact the jobs themselves if it stops working for some reason | 12:37 |
noonedeadpunk | Well, we can only host that withing someones ccompany, and I don't really like to make project logs dependent on any company... | 12:37 |
noonedeadpunk | And I kind of like Clark[m] idea to leverage zuul renderer | 12:38 |
fungi | for ci-log-processing we provided a plain unconfigured server instance in opendev's control plane tenant on one of our donor providers and gave the admin of the service access to ssh into it | 12:38 |
noonedeadpunk | as it has basically everything we need | 12:39 |
fungi | but yes, reusing zuul's tast renderer would be awesome, maybe even possible to turn it into a library which zuul consumes in order to reduce the collective maintenance burden | 12:39 |
fungi | s/tast/task/ | 12:40 |
noonedeadpunk | it sounds simpler and quite useful overall | 12:40 |
fungi | and yeah, the idea is you take the ansible json output, and then interpret it browser-side with javascript | 12:41 |
fungi | though i do wonder if it would scale well performance-wise to a dataset as large as what openstack-ansible builds produce | 12:41 |
frickler | do we want to limit log item count in addition to log volume on the executor side? not sure though how a reasonable limit would look like | 12:42 |
noonedeadpunk | that is good question, if it's done in browser I totally can see where things can go south... | 12:42 |
* dpawlik reading | 12:42 | |
fungi | frickler: i was thinking the same thing, it's harder to know what a sane file limit might be, but also the reason we limit log size has more to do with not filling up the executor's disk and less to do with making sure builds will be able to upload results reliably | 12:43 |
fungi | noonedeadpunk: though maybe there's a reasonable way to shard by play or something | 12:43 |
Clark[m] | The disk limit is more to pretect the executor than to ensure jobs pass. I think we have much less concern about inodes | 12:44 |
fungi | noonedeadpunk: and have it only fetch the json for the play if the user expands it (you'd be trading responsiveness to user interaction in that case i guess) | 12:44 |
frickler | fungi: well we also have an inode limit on /var/lib/zuul. though for example we are at 22% inodes on ze01 currently vs. 33% space | 12:44 |
Clark[m] | fungi: noonedeadpunk: you should be able to test it via web browser debugging tools and changing the json path location to load real osa data | 12:45 |
frickler | but in theory a job with a huge number of tiny files could likely exhaust inodes | 12:45 |
noonedeadpunk | fungi: yes, I should totally look at how's that's done. I'm not that good in JS but hope can figure out smth | 12:45 |
fungi | yes, i buy the inode argument. we could set a file count governer to limit to a percentage of our inode capacity similar to the percentage of our block capacity | 12:45 |
Clark[m] | frickler: ok so less inside concern but not much less as I thought | 12:45 |
Clark[m] | *less inode. Mobile keyboards try to be too helpful sometimes | 12:46 |
noonedeadpunk | ok, thanks for ideas! | 12:46 |
fungi | i need to step away from the keyboard for a few minutes, but will be back shortly and try to dig into today's tripleo post_failure build | 12:46 |
noonedeadpunk | I have smth to work on now :) | 12:46 |
*** rlandy|dr_appt is now known as rlandy | 12:51 | |
*** ysandeep|afk is now known as ysandeep | 12:58 | |
*** undefined__ is now known as rcastillo | 13:08 | |
*** undefined_ is now known as rcastillo | 13:09 | |
noonedeadpunk | hm, there's also elastic callback plugin to log directly into elasticsearch.... | 13:11 |
*** dasm|off is now known as dasm | 13:33 | |
*** ysandeep is now known as ysandeep|out | 15:00 | |
*** dviroel is now known as dviroel|lunch | 15:19 | |
clarkb | noonedeadpunk: that would require direct access to ES and that isn't available with the current setup aiui. But even if you had that our experience has shown that it is a good idea to decouple jobs from writing to ES as that can be quite slow and occasionally have errors. Also you probably need something to render the data out of ES once it is there | 15:55 |
*** jpena is now known as jpena|off | 16:02 | |
fungi | clarkb: revisiting the lists.o.o outage and grub chainloading solution from earlier, does that mean we can get away with no longer decompressing the kernels and letting grub do it in the second stage? | 16:02 |
fungi | obviously it's something we would test in a scheduled window, but curious if it simplifies further kernel upgrades | 16:03 |
clarkb | fungi: I think it does open that possibility. However, if you look in grub.conf there was that if condition for xen where it adds grub modes for things like xz I think maybe we have to add lz4 too? | 16:03 |
clarkb | because xen grub is reading our basic menu.lst which poinst to the grub2 shim loader elf which xen can read. Then that loads grub2 proper aiui which in theory can do the lz4 | 16:04 |
clarkb | that said I know I tested this before, but I think I was testing it with a compressed kernel and it didn't work then. So there is likely something like adding the extra mod that we need to do? | 16:05 |
fungi | yeah, needing to add another module in the second stage grub config wouldn't surprise me | 16:11 |
fungi | akahat|ruck: i looked at the executor's debug log for https://zuul.opendev.org/t/openstack/build/ffb35b4d680d48a8bd21b1964964019d and it seems superficially similar to the problem noonedeadpunk has been running down in the openstack-ansible-deploy jobs... TASK [upload-logs-swift : Upload logs to swift] ends in "Ansible timeout exceeded: 3600" | 16:15 |
fungi | i wonder how trivial it would be to put another task just before that which says how many files will be uploaded, so we can start to get a feel for what the upper bound is on these | 16:16 |
fungi | and whether it's related to file count at all (even if that's only one of the variables feeding into the problem) | 16:16 |
clarkb | when successful they are all listed already. This doesn't help if the problem is build specific due to some exceptional case, but I don't think that is the trpleo situation. They already log excessively and could probably take a critical eye to that | 16:17 |
clarkb | using known information without changing anything about the jobs to collect more info | 16:17 |
*** dviroel|lunch is now known as dviroel | 16:18 | |
fungi | yeah, looking at https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py i think we're relying on the sdk to explode paths/globs anyway, so there would be a bit of code involved to guess how the sdk will enumerate the input | 16:21 |
fungi | but yes, my thought was to be able to tell whether the builds hitting upload timeouts had significantly more files/chunks/whatever than their successful counterparts | 16:22 |
clarkb | oh interesting they use their own log collection playbook so the one we provide via zuul jobs which logs better doesn't log anything :( | 16:23 |
clarkb | so the response to my first suggestions is that isn't as easy as I first though because tripleo is special | 16:24 |
fungi | another thought would be to split the uploading into two separate tasks, the first only uploading the console log, console json, and zuul info files | 16:24 |
fungi | that way if the problem is uploading way too many log files, at least there's the basic logs and a working result dashboard | 16:24 |
clarkb | oh interesting I like that | 16:25 |
fungi | it does mean more than one upload task, but hopefully the overhead of splitting it that way would be small | 16:26 |
fungi | however i'm not sure how to untangle it since a lot of that logic is in zuul-jobs | 16:27 |
clarkb | and for many jobs they don't log anymore than that | 16:27 |
clarkb | I think the swift-upload-logs would haev to be run against two different sets of inputs | 16:27 |
clarkb | so we may have to write those zuul specific log files to a new location and upload that then upload the normal logs afterwards from the existing location (this would be most backward compatible) | 16:28 |
fungi | or the upload-logs role could grow a priority list which it uploads only those matches in the first task and then excludes those from the list expanded in the second task | 16:28 |
fungi | that might be unnecessarily complicated though | 16:28 |
clarkb | fungi: looking at tripleo logs one thing that they have is fairly deep dir structures and each dir is a swift object | 16:30 |
clarkb | several actually as you need the index too | 16:30 |
clarkb | they may see improvements flattening the log structure | 16:30 |
clarkb | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/opt/git/opendev.org/openstack/tripleo-quickstart-extras/roles/validate-tempest/files/tempestmail/tests/fixtures/index.html for example | 16:30 |
clarkb | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/opt/git/opendev.org/openstack/skyline-console/docs/zh/develop/images/form/index.html | 16:30 |
clarkb | there are also a couple of places where they seem to upload the same files | 16:31 |
clarkb | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/var/tmp/dnf-zuul-gcsew2fm/index.html | 16:32 |
clarkb | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/usr/tmp/dnf-zuul-gcsew2fm/index.html | 16:32 |
clarkb | that is unnecessary duplication which can be trimmed | 16:32 |
clarkb | I'm also not sure we need to copy what feels like all of /etc https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/etc/index.html | 16:33 |
fungi | akahat|ruck: ^ see above for some actionable suggestions which may help | 16:33 |
clarkb | much of what is in /etc is consistent job to job because it is either distro defaults or because we set it the same way via job configs or dib for all jobs | 16:34 |
clarkb | you could selectively log information if it becomes necessary rather than uploading things that no one ever looks at every job | 16:34 |
clarkb | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/home/zuul/workspace/.quickstart/usr/local/share/ansible/roles/validate-tempest/files/tempestmail/tests/fixtures/index.html thats a much deeper nesting example | 16:35 |
clarkb | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/home/zuul/src/opendev.org/openstack/tripleo-quickstart-extras/roles/validate-tempest/files/tempestmail/tests/fixtures/index.html seems to be a duplicate with one of the /opt/git/ paths above too so more | 16:36 |
clarkb | duplicates to clean up | 16:36 |
clarkb | I suspect that things like https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/home/zuul/tripleo-deploy/undercloud/tripleo-heat-installer-templates/environments/index.html are known or can be generated locally too? Not sue we need to log that? I mean we don't log | 16:38 |
clarkb | all our ansible that we are about to run. We know that it matches what is in our repo. But I would start with things like deep nesting, duplicates, and excessive /etc content that never changes | 16:38 |
clarkb | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/home/zuul/tripleo-deploy/undercloud/tripleo-heat-installer-templates/tools/index.html looks like another good trim | 16:49 |
clarkb | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/var/lib/neutron/.cache/python-entrypoints/index.html anything like that too | 16:50 |
akahat|ruck | fungi, clarkb thank you for taking look in to this. | 17:52 |
akahat|ruck | TripleO collect lot's of logs and there are some duplication also. | 17:52 |
akahat|ruck | We also not sure about the file count.. adding file count +1. | 17:53 |
fungi | and that seems to be resulting in an increasing number of job failures, so needs to be reined in | 17:53 |
akahat|ruck | and also i liked your idea about collect logs where we can add logs playbooks which collect specific files from the host. | 17:53 |
fungi | we can get file counts for the successful jobs, but we don't know what the file counts might be for the failing ones which are unable to upload all their logs before zuul gets bored of waiting and cuts them off | 17:54 |
akahat|ruck | +1 for uploading console log, json and zuul info files at first and rest later | 17:54 |
fungi | if you would like to help with improving the upload-logs roles in zuul-jobs that would be awesome, but at least please look into trimming the fat in tripleo's job log collection | 17:55 |
akahat|ruck | fungi, yeah.. sure. I'll discuss this topic with my team and will check what specific files we can collect. | 17:57 |
clarkb | keep in mind the nested makes each dir level count as another "file" | 17:57 |
clarkb | s/nested/nesting/ | 17:57 |
clarkb | so reducing nesting can also help | 17:57 |
akahat|ruck | clarkb, noted. | 17:59 |
opendevreview | Julia Kreger proposed openstack/diskimage-builder master: DNM: Network Manager logging to Trace for Debugging https://review.opendev.org/c/openstack/diskimage-builder/+/847600 | 18:51 |
opendevreview | Julia Kreger proposed openstack/diskimage-builder master: DNM: Network Manager logging to Trace for Debugging https://review.opendev.org/c/openstack/diskimage-builder/+/847600 | 18:56 |
*** dviroel is now known as dviroel|out | 20:50 | |
opendevreview | Gage Hugo proposed openstack/project-config master: End project gating for openstack-helm deployments https://review.opendev.org/c/openstack/project-config/+/847621 | 21:45 |
opendevreview | Gage Hugo proposed openstack/project-config master: Retire openstack-helm-deployments repo https://review.opendev.org/c/openstack/project-config/+/847414 | 21:46 |
opendevreview | Gage Hugo proposed openstack/project-config master: Retire openstack-helm-deployments repo https://review.opendev.org/c/openstack/project-config/+/847414 | 21:47 |
opendevreview | Gage Hugo proposed openstack/project-config master: End project gating for openstack-helm deployments https://review.opendev.org/c/openstack/project-config/+/847621 | 21:56 |
opendevreview | Gage Hugo proposed openstack/project-config master: Retire openstack-helm-deployments repo https://review.opendev.org/c/openstack/project-config/+/847414 | 21:56 |
opendevreview | Gage Hugo proposed openstack/project-config master: Retire openstack-helm-deployments repo https://review.opendev.org/c/openstack/project-config/+/847414 | 21:56 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!