opendevreview | Merged openstack/diskimage-builder master: Revert "Temporarily stop running OpenSUSE functtests" https://review.opendev.org/c/openstack/diskimage-builder/+/840329 | 00:14 |
---|---|---|
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] devel job git perms https://review.opendev.org/c/opendev/system-config/+/847700 | 01:55 |
*** ysandeep|out is now known as ysandeep | 02:04 | |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] devel job git perms https://review.opendev.org/c/opendev/system-config/+/847700 | 02:21 |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] devel job git perms https://review.opendev.org/c/opendev/system-config/+/847700 | 02:38 |
BlaisePabon[m]1 | No worries. | 03:08 |
BlaisePabon[m]1 | I hand out with the openfaas folks and those guys have no resources, so I'm always happy to help where I can. | 03:08 |
BlaisePabon[m]1 | It looks like there's more corporate largesse to go around in this group. | 03:08 |
BlaisePabon[m]1 | s/hand/hang/ | 03:09 |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] devel job git perms https://review.opendev.org/c/opendev/system-config/+/847700 | 03:18 |
ianw | https://pypi.org/project/diskimage-builder/ is showing 3.21.1 as the latest, but https://pypi.org/project/diskimage-builder/3.22.0/ is there (and the job worked (https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_85f/787dcaac0a9a446f2a5b3b442601baaf634d11a0/release/release-openstack-python/85fcaf3/job-output.txt) | 03:28 |
ianw | someone else has mentioned a similar issue on libera #pypa | 03:32 |
ianw | looks like it comes back to https://github.com/pypa/warehouse/issues/10390 | 03:36 |
ianw | i seem to remember the -X purge thing working, but doesn't seem to now | 03:37 |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] devel job git perms https://review.opendev.org/c/opendev/system-config/+/847700 | 03:40 |
opendevreview | Blaise R Pabon proposed opendev/infra-manual master: Update the sandbox instructions https://review.opendev.org/c/opendev/infra-manual/+/845593 | 03:45 |
BlaisePabon[m]1 | Sorry for the delay. I'm getting the hang of this, and rST is more familiar than Yaml. | 03:48 |
BlaisePabon[m]1 | This is a good doc, thank you Clark . | 03:55 |
BlaisePabon[m]1 | Would you like to to prepare a change with `s/OpenStack/OpenDev/g` | 03:55 |
opendevreview | Ian Wienand proposed opendev/system-config master: Allow git to operate in checked-out source directories https://review.opendev.org/c/opendev/system-config/+/847700 | 04:16 |
opendevreview | Blaise R Pabon proposed opendev/infra-manual master: Convert OpenStack terms over to OpenDev https://review.opendev.org/c/opendev/infra-manual/+/847710 | 04:21 |
*** ysandeep is now known as ysandeep|afk | 05:15 | |
*** ysandeep|afk is now known as ysandeep | 05:56 | |
frickler | frickler@bridge:~$ cat /var/log/ansible/zuul_reboot.log.1 | 06:28 |
frickler | flock: cannot open lock file /var/run/ansible/zuul_reboot.lock: No such file or directory | 06:28 |
*** frenzyfriday|PTO is now known as frenzy_friday | 07:13 | |
*** jpena|off is now known as jpena | 07:43 | |
opendevreview | likui proposed openstack/diskimage-builder master: Drop python3.6/3.7 support in testing runtime https://review.opendev.org/c/openstack/diskimage-builder/+/847718 | 08:15 |
opendevreview | likui proposed openstack/diskimage-builder master: Stop testing lower-constraints https://review.opendev.org/c/openstack/diskimage-builder/+/847719 | 08:17 |
opendevreview | Merged zuul/zuul-jobs master: [tox] Update readme regarding default values https://review.opendev.org/c/zuul/zuul-jobs/+/847656 | 08:59 |
*** rlandy|out is now known as rlandy | 09:29 | |
rlandy | fungi: hi - when you come on line - we're still seeing a lot of post_failures and not just in tripleo jobs. This is really impacting our ability to merge patches | 09:46 |
rlandy | I saw your discussion with akahat|ruck - but it's not just tripleo jobs | 09:47 |
rlandy | we do have a deep log collection structure - but it would seem there is some new issue? | 09:47 |
*** ysandeep is now known as ysandeep|afk | 09:50 | |
*** ysandeep|afk is now known as ysandeep | 10:38 | |
jrosser_ | rlandy: we see the same thing on openstack-ansibe jobs | 10:50 |
rlandy | jrosser_: ack - killin our gate | 10:50 |
jrosser_ | for now we have disabled generating the HTML ARA report which has helped a lot, but still there is a low level of POST_FAILURE which means we also can't really merge anything | 10:50 |
jrosser_ | there seem to be periods (days to weeks) when this doesnt happen, and then similar periods when it's pretty regular | 10:51 |
rlandy | jrosser_: we suspected it was one provider at first - but now it seems to be on multiple | 11:00 |
jrosser_ | do you have a way to tell where the logs are going? | 11:01 |
rlandy | jrosser_: no - no way when logs don't have a link | 11:01 |
rlandy | arxcruz: ^^ anything we can remove from our log collection to help here? | 11:02 |
rlandy | we are going to have to force merge some patches if we can't get them through gate | 11:02 |
fungi | rlandy: jrosser_: the vast majority do still seem to be for tripleo and openstack-ansible changes looking at https://zuul.opendev.org/t/openstack/builds?pipeline=gate&result=POST_FAILURE&voting=1&skip=0 | 11:35 |
rlandy | looking at what we can cut from collect logs - but this is a recent issue | 11:36 |
fungi | things i suggested might help would be if someone wants to work on splitting log uploads so that we push up the console log/json and zuul-info files with a separate task first so that we have *some* logs and can see what swift providers the builds are destined to, and then adjust the log upload role to provide a count and size of what's going to be uploaded before proceeding with the upload | 11:36 |
fungi | that will help us ascertain whether 1. there's any correlation to a specific swift provider and 2. whether there's something different about the size or quantity of logs in the builds which timeout their uploads | 11:37 |
jrosser_ | i made a patch to openstack-ansible to check there was nothing silly happening in the log structure https://zuul.opendev.org/t/openstack/build/c449b1162be6482c858eab60aff183c8/log/job-output.txt#19443 | 11:38 |
jrosser_ | though getting size/file count at the same time with `tree` didnt look possible | 11:38 |
rlandy | chandankumar added https://review.opendev.org/c/openstack/tripleo-ci/+/847771 to try reduce log capture size | 11:47 |
rlandy | may help us get in under the radar | 11:47 |
jrosser_ | fungi: do you know where the code for this module is? https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-swift/tasks/main.yaml#L21 | 11:52 |
fungi | jrosser_: it's in zuul/zuul-jobs: https://zuul-ci.org/docs/zuul-jobs/log-roles.html#role-upload-logs-swift | 11:58 |
fungi | oh, the module, just a sec | 11:58 |
fungi | jrosser_: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base | 11:58 |
fungi | that has the python modules | 11:59 |
*** dviroel|out is now known as dviroel | 12:01 | |
fungi | for splitting the uploads, i expect we'd add a new task in https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/post-logs.yaml before the existing upload-logs-swift inclusion and specify the important subset we want uploaded first, then either delete those from the local tree or find a way to exclude them from the general upload | 12:01 |
jrosser_ | you don't mind that the important things would be duplicated? | 12:02 |
jrosser_ | oh well right | 12:02 |
jrosser_ | thats why i was attempting to look at the code | 12:02 |
jrosser_ | and i'm immediately completely out of my depth unfortunatley | 12:03 |
fungi | much of the logic in the lib is delegated to openstacksdk, from what i saw looking over the weekend | 12:03 |
opendevreview | Jonathan Rosser proposed zuul/zuul-jobs master: Allow upload-logs-swift files list to be overriden https://review.opendev.org/c/zuul/zuul-jobs/+/847775 | 12:20 |
fungi | ianw: seems the pypi cache issue may be more widespread, i saw people referring to something similar on the python discourse today: https://discuss.python.org/t/python-package-not-updating/16856 | 12:25 |
*** arxcruz is now known as arxcruz|brb | 12:42 | |
opendevreview | Jonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task https://review.opendev.org/c/opendev/base-jobs/+/847780 | 12:50 |
opendevreview | Jonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task https://review.opendev.org/c/opendev/base-jobs/+/847780 | 12:52 |
opendevreview | Jonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task https://review.opendev.org/c/opendev/base-jobs/+/847780 | 12:55 |
fungi | ianw: also https://discuss.python.org/t/cannot-make-the-new-release-to-be-the-latest-version-for-installation/16859 | 13:02 |
*** arxcruz|brb is now known as arxcruz | 13:48 | |
Clark[m] | fungi: jrosser_: it does look like the roles assume zuul.executor.log_root is the sole source of logs and we don't make that configurable via the role vars. I think that means the first step would be making that a parameter. | 13:50 |
jrosser_ | Clark[m]: did you see my patch? | 13:51 |
jrosser_ | i took a first stab at it here https://review.opendev.org/c/zuul/zuul-jobs/+/847775 | 13:51 |
Clark[m] | Yes, that is a different problem | 13:52 |
jrosser_ | oooh | 13:52 |
Clark[m] | Oh wait there are two patches? | 13:52 |
Clark[m] | I saw the random choice move | 13:52 |
jrosser_ | right | 13:52 |
jrosser_ | that was a bit of a precursor to making two uploads | 13:53 |
jrosser_ | as it would need to send them to the same place | 13:53 |
Clark[m] | Right. I missed the 775 change. Ya I think that is basically what is needed. Then in the base job we switch to uploading two different sources of content | 13:55 |
jrosser_ | yeah, i wasn't really sure what the best approach was to do two uploads | 13:55 |
Clark[m] | The other issue is if we'll partition deterministically but I'm pretty sure that is based on build uuid so we should | 13:55 |
jrosser_ | as the files: parameter just grabs everything beneath it as far as i can see | 13:56 |
Clark[m] | Oh except we have to generate index files too. That might be the biggest issue | 13:58 |
Clark[m] | I think what will happen is the second upload will overwrite the previous index effectively hiding the first set of files (they will be there just not in the index.html) | 13:58 |
Clark[m] | And maybe similarly for the zuul manifest which the zuul log browser gses | 13:59 |
Clark[m] | *uses | 13:59 |
Clark[m] | Ya this isn't super straightforward now that I think about the details | 13:59 |
Clark[m] | We would need the role to treat a single tree with multiple passes. Rather than two separate passes | 14:00 |
jrosser_ | is it only the top level index that is difficult? | 14:01 |
jrosser_ | upload 1 -> zuul-info, upload 2 -> everything else | 14:01 |
jrosser_ | so the only inconsistency would be right at the top there | 14:01 |
Clark[m] | Yes, it depends on the structure of the data but currently it would just be for the top level (both in index.html and the manifest) | 14:03 |
Clark[m] | Thinking out loud here: I wonder if the data is actually present behind those failures but since manifest is likely the last thing we upload zuul is ignorant to what is there | 14:05 |
Clark[m] | fungi: it may be worth trying to browse directly to the paths and see what if anything is there. We may not need to change anything to have access to sufficient info if we're failing after uploading much content but before the manifest | 14:06 |
fungi | Clark[m]: well, longer term i think it's still very useful to have the console log/json and zuul info available from the build results page even if something goes wrong uploading everything else the job has said to archive | 14:07 |
fungi | but yes i expect the files are there in swift, at least the ones which got uploaded | 14:08 |
jrosser_ | there is a more brute force approach, which is to upload essential stuff, then upload everything and accept some duplication | 14:08 |
jrosser_ | then at least there is something rather than nothing if a large upload fails, and avoids the index issue | 14:08 |
Clark[m] | fungi: yes, but it is very complicated and unlikely to be refactored quickly | 14:08 |
Clark[m] | jrosser_: considering we suspect too many files to be the source or at least part of the issue I'd be wary of that approach. We'll just make it worse if this is the problem | 14:09 |
fungi | Clark[m]: i mean in order to make it not necessary to get a zuul admin to dig the upload url out of the executor logs, since the current build dashboard doesn't supply a log link at all | 14:09 |
fungi | making these sorts of failures at least a little more possible for users to debug would save our limited sysadmins some time | 14:10 |
jrosser_ | thats really it - i'm completely blind to whats happening here | 14:10 |
fungi | i can do it for a few representative builds, but doing it at a larger scale is much more time-consuming | 14:11 |
Clark[m] | fungi: it's there somewhere because the js has to load the manifest | 14:11 |
Clark[m] | I don't think admin is required though it may require opening the js debug console | 14:12 |
fungi | yeah, but the dashboard gives you "no log available" | 14:12 |
clarkb | hrm it looks like zuul-web proxies the manifest request then after it has the manifest the data is fetched directly from swift for everything else? | 14:15 |
jrosser_ | if i look in the zuul API for one of these failed jobs i get "log_url": null | 14:16 |
fungi | yeah, i'm trying to fish the log url for one of these out of the executor log | 14:16 |
clarkb | ya I think we may not record the log_url because that happens after the upload because the upload provides the location back | 14:17 |
clarkb | the data very likely is there, we just aren't exposing it beacuse the upload process was cut short :/ | 14:17 |
rlandy | fungi: clarkb: we are going to need to force promotion two patches as we can't get them through gates due to post_failures ... I checked with slage ( tripleo PTL) and he will approve. Can you force merge: | 14:17 |
rlandy | https://review.opendev.org/c/openstack/tripleo-common/+/847437/ | 14:17 |
rlandy | https://review.opendev.org/c/openstack/tripleo-upgrade/+/847680/ | 14:17 |
rlandy | both backports | 14:18 |
clarkb | rlandy: we usually only force merge fixes for the gate issues. | 14:18 |
fungi | and generally only when they're unable to merge because of the problem they're fixing | 14:20 |
rlandy | https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-scenario001-standalone&branch=stable%2Fwallaby&skip=0 | 14:20 |
rlandy | only pass is on https://review.opendev.org/c/openstack/tripleo-common/+/847437/ | 14:20 |
rlandy | if we can fix the post_failure | 14:21 |
rlandy | then we "should" be able to get these through | 14:21 |
fungi | you can fix the post failure by not running the jobs which are unstable because of post_failure results | 14:21 |
jrosser_ | imho it is not quite like that for my failing jobs | 14:22 |
jrosser_ | it seems random | 14:23 |
rlandy | all multinode - checking if we can merge https://review.opendev.org/c/openstack/tripleo-ci/+/847771/1/toci-quickstart/config/collect-logs.yml | 14:23 |
fungi | rlandy: work on making those easier to diagnose is most welcome, and we've been discussing options here, but please don't mistake your desire to ignore the current spate of ci problems you're experiencing as being any sort of priority for me. if you want to ignore some jobs in order to merge changes you can already do that by disabling those jobs | 14:24 |
clarkb | jrosser_: I doubt it is random. It is more likely some combo of test node cloud and/or swift provider (the way log collection works is everything is copied from test nodes to executors, then from executors to swift in one of 5 locations. if either or both of those is slow then you can run up against the timeout) | 14:24 |
jrosser_ | clarkb: oh absolutely - i should have qualified random better | 14:24 |
clarkb | jrosser_: one approach may be to increase the timeout and see if we can collect the logs that way and identify the source of the slowness | 14:24 |
jrosser_ | what i really meant was it wouldnt be possible for me to not run a failing job becasue i can't tell which one will fail | 14:25 |
jrosser_ | clarkb: is that something i can do? | 14:25 |
clarkb | jrosser_: yes, I think so. Let me dig up docs | 14:26 |
clarkb | jrosser_: https://zuul-ci.org/docs/zuul/latest/config/job.html#attr-job.post-timeout there may be an upper bound on that too looking further | 14:26 |
clarkb | our default post-timeout is 1800 seconds (half an hour) | 14:27 |
clarkb | I'm not finding where we might set a max timeout. Maybe bump that to an hour and a half in your jobs and collect what we can and take it from there? | 14:28 |
jrosser_ | sure i can try that, thanks for the pointer | 14:29 |
clarkb | oh we must rely on the default zuul max-job-timeout of 10800? | 14:29 |
clarkb | which is 3 hours which seems to line up | 14:29 |
clarkb | and I think it applies to both the regular job timeout and the post timeout as two separate blocks. So max post timeout is 3 hours as well? | 14:30 |
fungi | revisiting https://zuul.opendev.org/t/openstack/builds?pipeline=gate&result=POST_FAILURE&skip=0 all the gate post_failures so far today (since utc midnight) have been tripleo multinode jobs, though there was a tripleo "standalone" job (i guess that means single-node?) which caught one just before utc midnight | 14:30 |
rlandy | correct - most are multinode | 14:31 |
clarkb | jrosser_: that should theoretically give us better info on what actual max runtime for the uploads is, where the time cost is (pulling logs to executor or pushing to swift or both), and how close we are to the timeout. It may just be that you've been riding the edge for a while and a few new log files are tipping you over or something (though removal of ara not completelyfixing this | 14:31 |
clarkb | indicates this is unlikely) | 14:31 |
jrosser_ | tbh i was surprised it failed at all after removing ARA | 14:32 |
clarkb | rlandy: are there changes to stop uploading /etc content that never changes, and stop uploading duplicated content, and changes to reduce the nesting? Those sorts of things I think we could force merge given our current understadning of the issue | 14:32 |
jrosser_ | i will have to check again but i think that was the largest count of files by a long way | 14:33 |
fungi | that may indicate that ara files weren't a significant contributing factor to the upload time | 14:33 |
rlandy | clarkb: there is one patch in flight - on call now will discuss with team more options and dirs we can remove | 14:34 |
clarkb | frickler: I was going to look into that. Will flock -n not create a file if it doesn't exist? Anyway thanks for checking I'll try to followup a bit later after breakfast | 14:34 |
fungi | i'm trying to see how much of the logs from https://zuul.opendev.org/t/openstack/build/bae8ea4276ad495792e320631486cbf9 got uploaded. in the executor debug log i can see that zuul_log_path is bae/847362/2/gate/openstack-ansible-deploy-aio_lxc-centos-8-stream/bae8ea4 so i can probably try to fetch the console log from random providers and brute-force discovery of the actual url | 14:36 |
clarkb | fungi: that won't work with rax beacuse the urls are not consistent | 14:36 |
clarkb | fungi: but I thought it logged where it was uploading earlier in the logs? If not that may be somethign we can land to zuul-jobs quickly (explicitly log the full log path prefix at the beginning of uploading) | 14:37 |
clarkb | anyway I need to boot my morning. Back in a bit | 14:37 |
fungi | i think it may only log that if the task is successful? but anyway not having luck finding information on the actual log url in the executor debug log | 14:38 |
fungi | yeah, "TASK [upload-logs-swift : Upload logs to swift]" includes a "url" parameter in the ansible output normally, but not if the task times out | 14:40 |
fungi | so the executor doesn't actually even log the url in these cases | 14:41 |
jrosser_ | fungi: i never see that even if that task doesnt time out | 14:43 |
jrosser_ | isn't it chicken/egg with what is in the log file at the point it is uploaded? | 14:44 |
jrosser_ | oh - you mean the log on the executor i guess, not the uploaded one | 14:46 |
fungi | yes | 14:48 |
fungi | i'm digging in the executor's debug log | 14:48 |
fungi | but have basically confirmed that even from the current debug logs we can't reliably work out where the build logs ended up | 14:49 |
fungi | so yes, increasing the upload timeout temporarily and then looking for successful jobs which exceed the default post-logs playbook timeout seems like the next logical step | 14:51 |
jrosser_ | i've made a set of patches which should hopefully catch some of those | 14:51 |
jrosser_ | unfortunately this does seems to come and go somewhat | 14:52 |
*** pojadhav is now known as pojadhav|afk | 15:04 | |
clarkb | fungi: I think we can add a logging.info() for this value https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py#L144 | 15:18 |
clarkb | fungi: separately we do log the file list https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py#L283-L285 but only at a debug level so I think we need some amount of ansible -v's to emit that | 15:19 |
clarkb | (also I'm not sure if ansible logs INFO level by default it may be that logging.info() will need more verbosity too) | 15:19 |
fungi | clarkb: but also we no_log it in order to avoid leaking credentials: https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/post-logs.yaml#L12 | 15:24 |
clarkb | oh ha ok | 15:25 |
clarkb | In that case it really does seem like we need to work from successful runs with longertimeouts and go backawrd from there to try and identify problems :/ | 15:25 |
*** dviroel is now known as dviroel|lunch | 15:27 | |
fungi | https://status.python.org/ "14:20 UTC: A delay in our job queue is affecting CDN purges causing public pages to be stale and delays in email sending for notifications, email validation, password resets, etc. 15:05 UTC: We've identified specific jobs in our queue that are causing processing delays and are working to remove them." | 15:36 |
fungi | looks like they finally noticed | 15:36 |
*** marios is now known as marios|out | 15:39 | |
clarkb | talking out loud again. Another option we might consider is temporarily increasing the base jobs post-timeout | 15:50 |
clarkb | to cast a wider net and hopefully make it easier to debug. The problem with that is the revert back to half an hour is likely to be painful | 15:51 |
fungi | i think we'd need a good means of identifying which builds exceeded the old timeout under the new temporary one | 16:00 |
clarkb | frickler: ok the issue is there is no /var/run/ansible so the file can't be created in the non existant dir. It appears flock will create the file if the dir exists. I think I may just stick it under /var/run then | 16:02 |
clarkb | chnage incoming | 16:02 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fix zuul reboot playbook flock location https://review.opendev.org/c/opendev/system-config/+/847808 | 16:05 |
*** ysandeep is now known as ysandeep|PTO | 16:09 | |
clarkb | one crazy idea I had that won't work is to write the log url to a file then have a followup task read and echo it. That followup task wouldn't need to be no_log. But that won't work because we are timing out and exiting in the middle of the upload task | 16:12 |
jrosser_ | i was also thinking about that | 16:13 |
jrosser_ | looking at the uploader code it might be possible to call the role with an empty filelist | 16:13 |
jrosser_ | and the url should be returned - though i was pretty unclear if the url would be constant | 16:14 |
clarkb | I think the url is going to be constant for a specific build uuid. The build uuid determines the sharded container location and the rax cdn different urls are consistent per container | 16:14 |
clarkb | basically as long as that uuid is fixed we should end up with the same url for logs across multiple uploads | 16:15 |
BlaisePabon[m] | Oh! I wonder if that is why I didn't get an email when I added another address to my Pypi profile....? | 16:15 |
BlaisePabon[m] | (rhetorical, no reply required) | 16:15 |
fungi | BlaisePabon[m]: it does say "pypi.org - Email: Degraded Performance | 16:16 |
fungi | so maybe so | 16:16 |
clarkb | but ya I'm still trying to brainstorm better debugging methods here. It is a "fun" challenge. Back when we had the jenkins scp log uplaoded we had similar chicken and egg problems. Eventually we forked a separate java thread to upload the console log so that it could run beyond the job thread lifetime | 16:16 |
fungi | BlaisePabon[m]: and the incident name is "Queue delay affecting CDN purges and emails" | 16:17 |
fungi | so i guess the two are related | 16:17 |
clarkb | jrosser_: I'm shifting tactics to look at successful runs to see if I can infer anything from them. I still see ara reports on them for openstack-ansible-deploy-infra_lxc-ubuntu-focal on stable/wallaby fwiw | 16:22 |
clarkb | (stable/wallaby of that job was the most recent POST_FAILURE) | 16:22 |
jrosser_ | yes, we have only managed to merge the change to disable ARA on master | 16:23 |
clarkb | got it | 16:23 |
jrosser_ | ironically the cherry-pick for that to yoga failed..... due to POST_FAILURE | 16:23 |
clarkb | There is also some larger binary journal content. You might consider exporting that like devsatck does. That allows you to export the range you want (and potentially other bits you care about) to a more text friendyl format that compresses with eg xz very well | 16:26 |
clarkb | The apt trusted keys are also logged many times from many locations (host and all the containers) | 16:27 |
jrosser_ | yep - we place a lot of those ourselves | 16:27 |
jrosser_ | dropping the binary log journals is a good idea | 16:28 |
jrosser_ | i've never really had a lot of success grabbing them from one system and parsing them on another | 16:28 |
opendevreview | Julia Kreger proposed openstack/diskimage-builder master: DNM: Network Manager logging to Trace for Debugging https://review.opendev.org/c/openstack/diskimage-builder/+/847600 | 16:29 |
clarkb | jrosser_: ya journalctl has a specific export command that serliazes the info and allows you to import and view it elsewhere. Copying the binary data directly isn't meant to work that way aiui | 16:29 |
clarkb | let me see if I can find a link to how devstack does/did it | 16:29 |
jrosser_ | we have this to post-process them https://github.com/openstack/openstack-ansible/blob/master/scripts/journal_dump.py | 16:30 |
clarkb | https://opendev.org/openstack/devstack/src/branch/master/roles/export-devstack-journal/tasks/main.yaml#L34-L48 Looks like they switched to using gz :/ | 16:31 |
clarkb | we did xz back when because it was like 10x better and 99% of the time people don't need the raw journals oh well | 16:31 |
jrosser_ | it's ok, we already have a list of "interesting services" and dump out what you would have had with journalctl | 16:31 |
jrosser_ | the binary ones are pretty useless now | 16:31 |
clarkb | anyway other than the journals and ara content in the wallaby job I'm not see much that stands out to me. There is some nesting but it isn't too bad. And some duplication (of the apt trusted keys). But on the whole far less content than tripleo. Though the timeout is shorter | 16:32 |
clarkb | The other thing that just occured to me is that the post-tiemout is for all the post-run playbooks. It could be that uploads are fine and are just fallout | 16:33 |
clarkb | ara itself takes almost 3 minutes to generate the report in the successful job | 16:34 |
clarkb | about 1/10th the timeout value | 16:34 |
clarkb | Then journal collection from the container collection takes about 20 seconds and journal collection and eth interface info collection another 20 seconds. Nothing too crazy yet | 16:35 |
clarkb | Copying logs to the executor another 15 seconds | 16:36 |
opendevreview | Blaise R Pabon proposed opendev/infra-manual master: Update the sandbox instructions https://review.opendev.org/c/opendev/infra-manual/+/845593 | 16:36 |
clarkb | the upload of logs to swift took about 9 minutes | 16:37 |
*** dviroel|lunch is now known as dviroel | 16:37 | |
clarkb | In total about 13 minutes of post-run time. About 43% of the timeout value. | 16:38 |
clarkb | Anyway that does point to the upload to swift being the likely issue just based on successful time speant | 16:38 |
clarkb | And that upload occurs from the executor to swift. Which means the source side of that is pretty consistent. One of 12 instances in a single cloud | 16:39 |
clarkb | Now the successful case is the one I would have expected to have the most issues as it is across the atlantic | 16:39 |
clarkb | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_554/periodic/opendev.org/openstack/openstack-ansible-repo_server/stable/wallaby/openstack-ansible-deploy-infra_lxc-ubuntu-focal/554c3a2/ is what I'm looking at fwiw | 16:39 |
clarkb | That doesn't rule out some other internet/routing/ipv6/etc problem though | 16:40 |
clarkb | I also see uploads to both ovh swifts from tripleo jobs that are successful as well as uploads that are hosted by the rax cdn. More difficult to determine where those rax cdn's targetted on the backend though. But that seems to rule out a consistently bad location | 16:42 |
fungi | i suppose something like bandwidth contention or significant packet loss on an executor resulting in much longer upload times | 16:47 |
fungi | would be another possibility | 16:48 |
clarkb | yup it could be network contention between jobs too | 16:48 |
clarkb | I got excited I saw that the system-config-run job that runs against latest ansible had a POST_FAILRUE and thought maybe it is related to latest ansible. but no this error is a regular error in post-run not a timeout | 16:59 |
clarkb | it has logs and all that | 16:59 |
*** Guest3279 is now known as rcastillo | 17:20 | |
jrosser_ | clarkb: fungi i have a few jobs taking suspiciously long to upload logs right now, it may be nothing unusual but i can give you the details if it's helpful | 17:23 |
clarkb | jrosser_: ya can you link to the live console? | 17:23 |
clarkb | I'd be happy to take a look at that | 17:23 |
jrosser_ | https://zuul.openstack.org/stream/d29af85f2a144f68b2eeb849ccb8e8ad?logfile=console.log | 17:24 |
clarkb | Looks like that one just succeeded | 17:25 |
jrosser_ | yeah i figured that would happen | 17:25 |
clarkb | which is good bceause it means we can go look at the executor logs and gather more info about it | 17:25 |
jrosser_ | that was about ~20 mins | 17:26 |
clarkb | First thing I notice is the pre upload tasks seem to take about 3 minutes there whcih is consistent with the other successful uploads. Either the size of the data doubled causing it to take twice as long as other uploads or there is some other issue | 17:26 |
jrosser_ | this one is similar https://zuul.openstack.org/stream/62be39711bb6402d90d27f99fd4a96c3?logfile=console.log | 17:26 |
jrosser_ | about 20 mins in as well | 17:26 |
clarkb | https://e08e29492ebe0e6f5609-f5e06fd9232d0500a75b50f2bf1ad7e1.ssl.cf5.rackcdn.com/847797/1/check/openstack-ansible-deploy-aio_lxc-centos-8-stream/d29af85/ is the first one. It uploaded to one of the rax backeds (have to check the executor log to get more spceific than that I suspect) | 17:28 |
clarkb | It ran on ze09. | 17:29 |
jrosser_ | i think that they might have both completed now | 17:29 |
clarkb | https://a482a10c2a931478cad4-f031700b2c1819893a02a03206c89279.ssl.cf5.rackcdn.com/847668/1/check/openstack-ansible-deploy-aio_lxc-centos-8-stream/62be397/ is the second it ran on ze12 | 17:29 |
clarkb | yup they both ran successfully but took abit more than 20 minutes to upload | 17:29 |
jrosser_ | but that seems to be the pattern we get, that theres ~15 or so long running jobs on that change and just one or maybe two of those would POST_FAILURE | 17:30 |
jrosser_ | it certainly wasn't all of them at all | 17:30 |
opendevreview | Merged opendev/system-config master: Fix zuul reboot playbook flock location https://review.opendev.org/c/opendev/system-config/+/847808 | 17:30 |
clarkb | Both jobs ran in iweb. I suppose it is theoretically possible that the test node cloud provider is impacting logging in some way that causes this (producing much larger system logs due to something in the provider?) | 17:30 |
clarkb | just checking the journal logs they do seem larger in that one. But I would need to add them all up to be sure of that | 17:31 |
clarkb | each of them have a 96MB journal file (in addition to all the other journal files) | 17:32 |
fungi | yeah, doing a side-by-side comparison of a complete download of the saved logs from one that took longer and one that didn't for the same job would be helpful | 17:32 |
clarkb | https://zuul.opendev.org/t/openstack/build/fdd24d45c34a43fc8171baf79e5e5323/logs was a short one | 17:34 |
clarkb | https://zuul.opendev.org/t/openstack/build/ce8c876dca6b49e0adc05eb098fb0b61/log/job-output.txt was longer but still short (~4 minutes, first short example was about 2 minutes) | 17:35 |
jrosser_ | oh wow the log upload time was only 2 mins? | 17:36 |
clarkb | one of the long ones has an ara report. The other long one doesn't have an ara report | 17:36 |
clarkb | jrosser_: double check me on that but I'm looking at the timestamp for start of upload and comparing it against the job completion time in the zuul build dashboard | 17:36 |
jrosser_ | yeah, thats the only way i've found to get a handle on the upload time | 17:36 |
*** pojadhav|afk is now known as pojadhav | 17:36 | |
clarkb | at least one quick oen had a larger journal file too | 17:37 |
jrosser_ | this is slightly unscientific as the patches we're looking at are trying to reduce the log upload burden | 17:38 |
jrosser_ | so comparisons are tricky | 17:38 |
clarkb | ah | 17:38 |
jrosser_ | it would be reasonable to look at different jobs in the same change though | 17:38 |
clarkb | yup. And probably the post-timeout increase jobs are the best to look at. Recheck them a few times then compare runtimes | 17:39 |
clarkb | since in theory we're least likely to actually hit a post_failure and not be able to compare there | 17:40 |
clarkb | one other thing to note. A short run ran on ze09 as well | 17:41 |
clarkb | but I agree hard to say how useful that info is until we're comparing across the same changes | 17:41 |
fungi | yeah, things to possibly correlate: consistency for builds run from the same executor, for builds run on nodes from the same provider, for builds uploading logs to the same swift endpoint | 17:43 |
*** rlandy is now known as rlandy|afk | 18:40 | |
*** jpena is now known as jpena|off | 19:06 | |
*** rlandy|afk is now known as rlandy | 19:23 | |
TheJulia | anyone around to put a VM hold in real quick? | 19:40 |
clarkb | TheJulia: I can do it, just need job name and change ideally | 19:41 |
TheJulia | 847601,3 in flight, ironic-tempest-wholedisk-bios-snmp-pxe | 19:41 |
TheJulia | should fail shortly.... | 19:41 |
clarkb | the autohold command has been run | 19:43 |
* TheJulia crosses fingers | 19:43 | |
clarkb | ianw: ^ in doing that I notice your commands in history there don't prefix with the timestamp. I wonder why | 19:43 |
TheJulia | oh, it is *soooo* close | 19:43 |
TheJulia | it is getting logs now | 19:43 |
TheJulia | clarkb: looks like the job just exited | 19:46 |
clarkb | but it was successful? the hold should be good for your recheck | 19:47 |
TheJulia | failed | 19:47 |
clarkb | oh /me looks again | 19:47 |
TheJulia | which is what I'm hoping to capture the state of | 19:47 |
clarkb | TheJulia: https://zuul.opendev.org/t/openstack/build/67ac58dd2a2443eaa76061b884e61faf it passed | 19:48 |
TheJulia | gaaaah | 19:48 |
TheJulia | :( | 19:48 |
* TheJulia goes off in the corner and cries :) | 19:48 | |
clarkb | the hold is still good though and shoudl catch it if/when it does fail | 19:49 |
TheJulia | clarkb: off to issue a recheck then, thanks | 19:49 |
clarkb | I'm getting our meeting agenda together for tomorrow. Anything missing? | 20:09 |
clarkb | ianw: I +2'd the grafana testing changes but didn't approve them as I wasn't sure if fungi wanted to look at your response prior to approval too | 21:06 |
fungi | thanks, i did see the responses. i was +2 on it anyway | 21:24 |
fungi | just wishing i had the bandwidth to move more of our opendev jobs out of ozj | 21:25 |
*** dviroel is now known as dviroel|out | 21:40 | |
ianw | yeah, it feels like maybe a sprint thing where we sit down and just bang it out, dedicate a day where things might get broken but just work iteratively | 21:41 |
ianw | clarkb: umm, commands on zuul02? | 21:42 |
ianw | openstack/project-config is probably the other thing that stands out a bit too in terms of naming | 21:43 |
clarkb | ianw: re zuul02 if you check roots `history` commands are prefixed wtih a timestamp there. But the autohold commands you have run do not have that. I think you must be su'ing somehow without getting a new login shell? But I thought the history bit was system wide so unsure how it gets ommitted | 21:45 |
ianw | for some reason i seem to have been doing "sudo bash ./zuul-client autohold ..." where "zuul-client" is a small script in my homedir | 21:47 |
ianw | from jan 17. maybe we had quoting issues with that wrapper script or something at the time? | 21:47 |
clarkb | oh ya there were quoting issues once upon a time. I believe they are resolved now | 21:47 |
ianw | heh, that same day i proposed https://review.opendev.org/c/opendev/system-config/+/825019 | 21:47 |
clarkb | ok last call on meeting agenda content. I'll send that out at the top of the hour or so | 21:48 |
ianw | so that might explain it. | 21:48 |
ianw | oh i was going to add something about bridge. will do in a sec | 21:48 |
clarkb | k | 21:49 |
clarkb | ianw: let me know when it is posted and I'll get this sent out | 22:00 |
ianw | clarkb: thanks, done | 22:01 |
ianw | reloading the dib page shows the right release now, so hopefully pypi has fixed that | 22:03 |
clarkb | oh I got an email about it too. alreight working on sending that out | 22:03 |
fungi | ianw: i agree moving more out of openstack/project-config to opendev/project-config would be great as long as it's stuff that doesn't still need to be in the openstack zuul tenant in order for us to test it | 22:04 |
clarkb | ianw: seemed like it was a queue problem. Not sure if it is just that they got to dib in the queue or if the issue itself is corrected | 22:04 |
opendevreview | Amy Marrich proposed opendev/irc-meetings master: Updated meeting info for Diversity WG https://review.opendev.org/c/opendev/irc-meetings/+/847859 | 22:04 |
clarkb | and sent | 22:05 |
clarkb | I'm starting to look at the work in progress == merge conflict issue and https://review.opendev.org/q/is:wip does seem to show this equivalence is universal | 22:12 |
*** dasm is now known as dasm|off | 22:16 | |
opendevreview | Merged opendev/irc-meetings master: Updated meeting info for Diversity WG https://review.opendev.org/c/opendev/irc-meetings/+/847859 | 22:17 |
clarkb | after fiddling with it via the rest api I strongly suspect that the issue is mergeable:false and work_in_progress:true are both set for these changes | 22:23 |
clarkb | and I suspect that mergeable:false is picked up by the web ui as meaning merge conflict | 22:23 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Fix BLS entries for /boot partitions https://review.opendev.org/c/openstack/diskimage-builder/+/846838 | 22:27 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Parse block device lvm lvs size attributes https://review.opendev.org/c/openstack/diskimage-builder/+/839829 | 22:28 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: WIP Support LVM thin provisioning https://review.opendev.org/c/openstack/diskimage-builder/+/840144 | 22:28 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Do dmsetup remove device in rollback https://review.opendev.org/c/openstack/diskimage-builder/+/847860 | 22:28 |
clarkb | yup I think that is exactly the issue | 22:30 |
clarkb | https://gerrit.googlesource.com/gerrit/+/refs/tags/v3.5.2/polygerrit-ui/app/utils/change-util.ts#160 | 22:31 |
opendevreview | Merged openstack/project-config master: grafana: import graphs and take screenshots https://review.opendev.org/c/openstack/project-config/+/847129 | 22:32 |
clarkb | Looking at the history of that ts file I don't think that logic has changed recently though which makes me suspect that it is the mergeable state that has changed. I wonder if we need to send a null value in the case of being mergeable but wip to distinguish from unmergeable wip | 22:33 |
ianw | clarkb: ahh, you were correct -> Unable to freeze job graph: Job project-config-grafana depends on opendev-buildset-registry which was not run. :/ | 22:37 |
TheJulia | clarkb: looks like the job failed so there should be a held node | 22:37 |
ianw | the project-config gate might be borked now, as i don't think we can commit around that | 22:37 |
clarkb | TheJulia: looking | 22:38 |
clarkb | ianw: we should be able to force merge a revert at least? | 22:38 |
clarkb | TheJulia: yup I'm on the host. have a pubkey you can share for me to add? | 22:39 |
ianw | maybe it should be a soft dependency in the o-z-j side? | 22:39 |
TheJulia | clarkb: one moment | 22:40 |
clarkb | ianw: that would also fix it I think | 22:40 |
ianw | clarkb: https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/847862 | 22:42 |
clarkb | TheJulia: root@158.69.68.154 | 22:42 |
clarkb | hrm the mergeop check for mergeability that checks work in progress seems to have not chagned since 2018 either. Do we know if this behavior is new? | 22:46 |
corvus | clarkb: (just a reminder that google added the thing where they don't do merge checks by default any more; so it seems conceivable issues may go unnoticed for longer upstream) | 22:52 |
corvus | (could maybe somehow be related) | 22:52 |
clarkb | corvus: yup | 22:55 |
clarkb | I'm finding evidence that they do attempt to set mergeable to null in the wip case | 22:56 |
clarkb | so the mystery deepens | 22:56 |
opendevreview | Julia Kreger proposed openstack/diskimage-builder master: DNM: Network Manager logging to Trace for Debugging https://review.opendev.org/c/openstack/diskimage-builder/+/847600 | 23:02 |
ianw | TheJulia: ^ is this suggesting a glean issue? | 23:03 |
TheJulia | ianw: it is suggesting networkmanager may have a fairly nasty bug in it | 23:03 |
TheJulia | which causes stream-9 to break network conenctivity shortly after booting | 23:04 |
ianw | :/ ... i'd probably prefer to debug glean, and that's saying something :) | 23:04 |
TheJulia | yeah :) | 23:04 |
TheJulia | unfortunately this is in the IPA ramdisks on boot | 23:04 |
opendevreview | Merged openstack/project-config master: Retire openstack-helm-deployments repo https://review.opendev.org/c/openstack/project-config/+/847414 | 23:05 |
TheJulia | so things get started... and a little while later all network connectivity seems to go *poof* | 23:05 |
opendevreview | Merged openstack/project-config master: Add Cinder Dell EMC PowerStore charm https://review.opendev.org/c/openstack/project-config/+/846888 | 23:05 |
ianw | urgh. our ipv6 RA kernel v NetworkManager race was somewhat similar ... but only for ipv6 | 23:07 |
ianw | i'm struggling to think of what else might jump in there | 23:08 |
clarkb | it looks like if you try to submit a change then gerrit will more explicitly check mergeability. Except you can't submit wip changes so I can't really test if that is a viable method to unsetting the stale data | 23:08 |
ianw | ohhh, that job has use-buildset-registry role in it | 23:18 |
ianw | which fails without a buildset registry. | 23:19 |
ianw | should that role skip if no registry, or should we only run the role if we see the buildset registry var, or is this an incorrect usage of the whole thing and we should unconditionally have a buildset registry? | 23:20 |
clarkb | ianw: I think it may be an unexpected use of the role. The idea is if you are using it then you want to do the depends on support that I described in my review. But I wasn't sure it would break so badly | 23:28 |
clarkb | I've filed https://bugs.chromium.org/p/gerrit/issues/detail?id=16041 whcih outlines where I've debugged and a couple of ideas for a fix. I'll see if I can manage a change that we can test tomorrow using one of those appraoches. | 23:28 |
ianw | yeah, i had in my mind the job would work either way, but i guess that's not really how to write it. otherwise we have to conditionally include the use-buildset-registry on the registry variables, which kind of breaks the abstraction | 23:30 |
opendevreview | Ian Wienand proposed openstack/project-config master: project-config-grafana: add opendev-buildset-registry https://review.opendev.org/c/openstack/project-config/+/847866 | 23:32 |
ianw | clarkb: ^ | 23:32 |
clarkb | ianw: one minor clarification but ya I think that will solve it | 23:34 |
opendevreview | Ian Wienand proposed openstack/project-config master: project-config-grafana: add opendev-buildset-registry https://review.opendev.org/c/openstack/project-config/+/847866 | 23:35 |
ianw | yep, check: :) | 23:35 |
TheJulia | ianw: well, it looks like they removed a configuration option at some point, which is not helping... and behavior chagned | 23:35 |
clarkb | I'm going to go help with dinner now. | 23:35 |
TheJulia | *and* it's behavior is not ideal if you have physical switches which can make "bad things happen" | 23:36 |
ianw | sigh. if i can help beyond "random probably unhelpful suggestions" lmn :) | 23:39 |
TheJulia | thanks :) | 23:42 |
TheJulia | I suspect I'm going to have to bang my head against it a little since there is in herently a race with the dhcp timeout occuring... and now is not a good time to try :\ | 23:43 |
TheJulia | ugh, looks like I'm going to need a noisy neighbor to fully simulate this :\ | 23:54 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!