Monday, 2022-06-27

opendevreview	Merged openstack/diskimage-builder master: Revert "Temporarily stop running OpenSUSE functtests" https://review.opendev.org/c/openstack/diskimage-builder/+/840329	00:14
opendevreview	Ian Wienand proposed opendev/system-config master: [wip] devel job git perms https://review.opendev.org/c/opendev/system-config/+/847700	01:55
*** ysandeep\|out is now known as ysandeep		02:04
opendevreview	Ian Wienand proposed opendev/system-config master: [wip] devel job git perms https://review.opendev.org/c/opendev/system-config/+/847700	02:21
opendevreview	Ian Wienand proposed opendev/system-config master: [wip] devel job git perms https://review.opendev.org/c/opendev/system-config/+/847700	02:38
BlaisePabon[m]1	No worries.	03:08
BlaisePabon[m]1	I hand out with the openfaas folks and those guys have no resources, so I'm always happy to help where I can.	03:08
BlaisePabon[m]1	It looks like there's more corporate largesse to go around in this group.	03:08
BlaisePabon[m]1	s/hand/hang/	03:09
opendevreview	Ian Wienand proposed opendev/system-config master: [wip] devel job git perms https://review.opendev.org/c/opendev/system-config/+/847700	03:18
ianw	https://pypi.org/project/diskimage-builder/ is showing 3.21.1 as the latest, but https://pypi.org/project/diskimage-builder/3.22.0/ is there (and the job worked (https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_85f/787dcaac0a9a446f2a5b3b442601baaf634d11a0/release/release-openstack-python/85fcaf3/job-output.txt)	03:28
ianw	someone else has mentioned a similar issue on libera #pypa	03:32
ianw	looks like it comes back to https://github.com/pypa/warehouse/issues/10390	03:36
ianw	i seem to remember the -X purge thing working, but doesn't seem to now	03:37
opendevreview	Ian Wienand proposed opendev/system-config master: [wip] devel job git perms https://review.opendev.org/c/opendev/system-config/+/847700	03:40
opendevreview	Blaise R Pabon proposed opendev/infra-manual master: Update the sandbox instructions https://review.opendev.org/c/opendev/infra-manual/+/845593	03:45
BlaisePabon[m]1	Sorry for the delay. I'm getting the hang of this, and rST is more familiar than Yaml.	03:48
BlaisePabon[m]1	This is a good doc, thank you Clark .	03:55
BlaisePabon[m]1	Would you like to to prepare a change with `s/OpenStack/OpenDev/g`	03:55
opendevreview	Ian Wienand proposed opendev/system-config master: Allow git to operate in checked-out source directories https://review.opendev.org/c/opendev/system-config/+/847700	04:16
opendevreview	Blaise R Pabon proposed opendev/infra-manual master: Convert OpenStack terms over to OpenDev https://review.opendev.org/c/opendev/infra-manual/+/847710	04:21
*** ysandeep is now known as ysandeep\|afk		05:15
*** ysandeep\|afk is now known as ysandeep		05:56
frickler	frickler@bridge:~$ cat /var/log/ansible/zuul_reboot.log.1	06:28
frickler	flock: cannot open lock file /var/run/ansible/zuul_reboot.lock: No such file or directory	06:28
*** frenzyfriday\|PTO is now known as frenzy_friday		07:13
*** jpena\|off is now known as jpena		07:43
opendevreview	likui proposed openstack/diskimage-builder master: Drop python3.6/3.7 support in testing runtime https://review.opendev.org/c/openstack/diskimage-builder/+/847718	08:15
opendevreview	likui proposed openstack/diskimage-builder master: Stop testing lower-constraints https://review.opendev.org/c/openstack/diskimage-builder/+/847719	08:17
opendevreview	Merged zuul/zuul-jobs master: [tox] Update readme regarding default values https://review.opendev.org/c/zuul/zuul-jobs/+/847656	08:59
*** rlandy\|out is now known as rlandy		09:29
rlandy	fungi: hi - when you come on line - we're still seeing a lot of post_failures and not just in tripleo jobs. This is really impacting our ability to merge patches	09:46
rlandy	I saw your discussion with akahat\|ruck - but it's not just tripleo jobs	09:47
rlandy	we do have a deep log collection structure - but it would seem there is some new issue?	09:47
*** ysandeep is now known as ysandeep\|afk		09:50
*** ysandeep\|afk is now known as ysandeep		10:38
jrosser_	rlandy: we see the same thing on openstack-ansibe jobs	10:50
rlandy	jrosser_: ack - killin our gate	10:50
jrosser_	for now we have disabled generating the HTML ARA report which has helped a lot, but still there is a low level of POST_FAILURE which means we also can't really merge anything	10:50
jrosser_	there seem to be periods (days to weeks) when this doesnt happen, and then similar periods when it's pretty regular	10:51
rlandy	jrosser_: we suspected it was one provider at first - but now it seems to be on multiple	11:00
jrosser_	do you have a way to tell where the logs are going?	11:01
rlandy	jrosser_: no - no way when logs don't have a link	11:01
rlandy	arxcruz: ^^ anything we can remove from our log collection to help here?	11:02
rlandy	we are going to have to force merge some patches if we can't get them through gate	11:02
fungi	rlandy: jrosser_: the vast majority do still seem to be for tripleo and openstack-ansible changes looking at https://zuul.opendev.org/t/openstack/builds?pipeline=gate&result=POST_FAILURE&voting=1&skip=0	11:35
rlandy	looking at what we can cut from collect logs - but this is a recent issue	11:36
fungi	things i suggested might help would be if someone wants to work on splitting log uploads so that we push up the console log/json and zuul-info files with a separate task first so that we have some logs and can see what swift providers the builds are destined to, and then adjust the log upload role to provide a count and size of what's going to be uploaded before proceeding with the upload	11:36
fungi	that will help us ascertain whether 1. there's any correlation to a specific swift provider and 2. whether there's something different about the size or quantity of logs in the builds which timeout their uploads	11:37
jrosser_	i made a patch to openstack-ansible to check there was nothing silly happening in the log structure https://zuul.opendev.org/t/openstack/build/c449b1162be6482c858eab60aff183c8/log/job-output.txt#19443	11:38
jrosser_	though getting size/file count at the same time with `tree` didnt look possible	11:38
rlandy	chandankumar added https://review.opendev.org/c/openstack/tripleo-ci/+/847771 to try reduce log capture size	11:47
rlandy	may help us get in under the radar	11:47
jrosser_	fungi: do you know where the code for this module is? https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-swift/tasks/main.yaml#L21	11:52
fungi	jrosser_: it's in zuul/zuul-jobs: https://zuul-ci.org/docs/zuul-jobs/log-roles.html#role-upload-logs-swift	11:58
fungi	oh, the module, just a sec	11:58
fungi	jrosser_: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base	11:58
fungi	that has the python modules	11:59
*** dviroel\|out is now known as dviroel		12:01
fungi	for splitting the uploads, i expect we'd add a new task in https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/post-logs.yaml before the existing upload-logs-swift inclusion and specify the important subset we want uploaded first, then either delete those from the local tree or find a way to exclude them from the general upload	12:01
jrosser_	you don't mind that the important things would be duplicated?	12:02
jrosser_	oh well right	12:02
jrosser_	thats why i was attempting to look at the code	12:02
jrosser_	and i'm immediately completely out of my depth unfortunatley	12:03
fungi	much of the logic in the lib is delegated to openstacksdk, from what i saw looking over the weekend	12:03
opendevreview	Jonathan Rosser proposed zuul/zuul-jobs master: Allow upload-logs-swift files list to be overriden https://review.opendev.org/c/zuul/zuul-jobs/+/847775	12:20
fungi	ianw: seems the pypi cache issue may be more widespread, i saw people referring to something similar on the python discourse today: https://discuss.python.org/t/python-package-not-updating/16856	12:25
*** arxcruz is now known as arxcruz\|brb		12:42
opendevreview	Jonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task https://review.opendev.org/c/opendev/base-jobs/+/847780	12:50
opendevreview	Jonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task https://review.opendev.org/c/opendev/base-jobs/+/847780	12:52
opendevreview	Jonathan Rosser proposed opendev/base-jobs master: Separate swift provider selection from the swift log upload task https://review.opendev.org/c/opendev/base-jobs/+/847780	12:55
fungi	ianw: also https://discuss.python.org/t/cannot-make-the-new-release-to-be-the-latest-version-for-installation/16859	13:02
*** arxcruz\|brb is now known as arxcruz		13:48
Clark[m]	fungi: jrosser_: it does look like the roles assume zuul.executor.log_root is the sole source of logs and we don't make that configurable via the role vars. I think that means the first step would be making that a parameter.	13:50
jrosser_	Clark[m]: did you see my patch?	13:51
jrosser_	i took a first stab at it here https://review.opendev.org/c/zuul/zuul-jobs/+/847775	13:51
Clark[m]	Yes, that is a different problem	13:52
jrosser_	oooh	13:52
Clark[m]	Oh wait there are two patches?	13:52
Clark[m]	I saw the random choice move	13:52
jrosser_	right	13:52
jrosser_	that was a bit of a precursor to making two uploads	13:53
jrosser_	as it would need to send them to the same place	13:53
Clark[m]	Right. I missed the 775 change. Ya I think that is basically what is needed. Then in the base job we switch to uploading two different sources of content	13:55
jrosser_	yeah, i wasn't really sure what the best approach was to do two uploads	13:55
Clark[m]	The other issue is if we'll partition deterministically but I'm pretty sure that is based on build uuid so we should	13:55
jrosser_	as the files: parameter just grabs everything beneath it as far as i can see	13:56
Clark[m]	Oh except we have to generate index files too. That might be the biggest issue	13:58
Clark[m]	I think what will happen is the second upload will overwrite the previous index effectively hiding the first set of files (they will be there just not in the index.html)	13:58
Clark[m]	And maybe similarly for the zuul manifest which the zuul log browser gses	13:59
Clark[m]	*uses	13:59
Clark[m]	Ya this isn't super straightforward now that I think about the details	13:59
Clark[m]	We would need the role to treat a single tree with multiple passes. Rather than two separate passes	14:00
jrosser_	is it only the top level index that is difficult?	14:01
jrosser_	upload 1 -> zuul-info, upload 2 -> everything else	14:01
jrosser_	so the only inconsistency would be right at the top there	14:01
Clark[m]	Yes, it depends on the structure of the data but currently it would just be for the top level (both in index.html and the manifest)	14:03
Clark[m]	Thinking out loud here: I wonder if the data is actually present behind those failures but since manifest is likely the last thing we upload zuul is ignorant to what is there	14:05
Clark[m]	fungi: it may be worth trying to browse directly to the paths and see what if anything is there. We may not need to change anything to have access to sufficient info if we're failing after uploading much content but before the manifest	14:06
fungi	Clark[m]: well, longer term i think it's still very useful to have the console log/json and zuul info available from the build results page even if something goes wrong uploading everything else the job has said to archive	14:07
fungi	but yes i expect the files are there in swift, at least the ones which got uploaded	14:08
jrosser_	there is a more brute force approach, which is to upload essential stuff, then upload everything and accept some duplication	14:08
jrosser_	then at least there is something rather than nothing if a large upload fails, and avoids the index issue	14:08
Clark[m]	fungi: yes, but it is very complicated and unlikely to be refactored quickly	14:08
Clark[m]	jrosser_: considering we suspect too many files to be the source or at least part of the issue I'd be wary of that approach. We'll just make it worse if this is the problem	14:09
fungi	Clark[m]: i mean in order to make it not necessary to get a zuul admin to dig the upload url out of the executor logs, since the current build dashboard doesn't supply a log link at all	14:09
fungi	making these sorts of failures at least a little more possible for users to debug would save our limited sysadmins some time	14:10
jrosser_	thats really it - i'm completely blind to whats happening here	14:10
fungi	i can do it for a few representative builds, but doing it at a larger scale is much more time-consuming	14:11
Clark[m]	fungi: it's there somewhere because the js has to load the manifest	14:11
Clark[m]	I don't think admin is required though it may require opening the js debug console	14:12
fungi	yeah, but the dashboard gives you "no log available"	14:12
clarkb	hrm it looks like zuul-web proxies the manifest request then after it has the manifest the data is fetched directly from swift for everything else?	14:15
jrosser_	if i look in the zuul API for one of these failed jobs i get "log_url": null	14:16
fungi	yeah, i'm trying to fish the log url for one of these out of the executor log	14:16
clarkb	ya I think we may not record the log_url because that happens after the upload because the upload provides the location back	14:17
clarkb	the data very likely is there, we just aren't exposing it beacuse the upload process was cut short :/	14:17
rlandy	fungi: clarkb: we are going to need to force promotion two patches as we can't get them through gates due to post_failures ... I checked with slage ( tripleo PTL) and he will approve. Can you force merge:	14:17
rlandy	https://review.opendev.org/c/openstack/tripleo-common/+/847437/	14:17
rlandy	https://review.opendev.org/c/openstack/tripleo-upgrade/+/847680/	14:17
rlandy	both backports	14:18
clarkb	rlandy: we usually only force merge fixes for the gate issues.	14:18
fungi	and generally only when they're unable to merge because of the problem they're fixing	14:20
rlandy	https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-scenario001-standalone&branch=stable%2Fwallaby&skip=0	14:20
rlandy	only pass is on https://review.opendev.org/c/openstack/tripleo-common/+/847437/	14:20
rlandy	if we can fix the post_failure	14:21
rlandy	then we "should" be able to get these through	14:21
fungi	you can fix the post failure by not running the jobs which are unstable because of post_failure results	14:21
jrosser_	imho it is not quite like that for my failing jobs	14:22
jrosser_	it seems random	14:23
rlandy	all multinode - checking if we can merge https://review.opendev.org/c/openstack/tripleo-ci/+/847771/1/toci-quickstart/config/collect-logs.yml	14:23
fungi	rlandy: work on making those easier to diagnose is most welcome, and we've been discussing options here, but please don't mistake your desire to ignore the current spate of ci problems you're experiencing as being any sort of priority for me. if you want to ignore some jobs in order to merge changes you can already do that by disabling those jobs	14:24
clarkb	jrosser_: I doubt it is random. It is more likely some combo of test node cloud and/or swift provider (the way log collection works is everything is copied from test nodes to executors, then from executors to swift in one of 5 locations. if either or both of those is slow then you can run up against the timeout)	14:24
jrosser_	clarkb: oh absolutely - i should have qualified random better	14:24
clarkb	jrosser_: one approach may be to increase the timeout and see if we can collect the logs that way and identify the source of the slowness	14:24
jrosser_	what i really meant was it wouldnt be possible for me to not run a failing job becasue i can't tell which one will fail	14:25
jrosser_	clarkb: is that something i can do?	14:25
clarkb	jrosser_: yes, I think so. Let me dig up docs	14:26
clarkb	jrosser_: https://zuul-ci.org/docs/zuul/latest/config/job.html#attr-job.post-timeout there may be an upper bound on that too looking further	14:26
clarkb	our default post-timeout is 1800 seconds (half an hour)	14:27
clarkb	I'm not finding where we might set a max timeout. Maybe bump that to an hour and a half in your jobs and collect what we can and take it from there?	14:28
jrosser_	sure i can try that, thanks for the pointer	14:29
clarkb	oh we must rely on the default zuul max-job-timeout of 10800?	14:29
clarkb	which is 3 hours which seems to line up	14:29
clarkb	and I think it applies to both the regular job timeout and the post timeout as two separate blocks. So max post timeout is 3 hours as well?	14:30
fungi	revisiting https://zuul.opendev.org/t/openstack/builds?pipeline=gate&result=POST_FAILURE&skip=0 all the gate post_failures so far today (since utc midnight) have been tripleo multinode jobs, though there was a tripleo "standalone" job (i guess that means single-node?) which caught one just before utc midnight	14:30
rlandy	correct - most are multinode	14:31
clarkb	jrosser_: that should theoretically give us better info on what actual max runtime for the uploads is, where the time cost is (pulling logs to executor or pushing to swift or both), and how close we are to the timeout. It may just be that you've been riding the edge for a while and a few new log files are tipping you over or something (though removal of ara not completelyfixing this	14:31
clarkb	indicates this is unlikely)	14:31
jrosser_	tbh i was surprised it failed at all after removing ARA	14:32
clarkb	rlandy: are there changes to stop uploading /etc content that never changes, and stop uploading duplicated content, and changes to reduce the nesting? Those sorts of things I think we could force merge given our current understadning of the issue	14:32
jrosser_	i will have to check again but i think that was the largest count of files by a long way	14:33
fungi	that may indicate that ara files weren't a significant contributing factor to the upload time	14:33
rlandy	clarkb: there is one patch in flight - on call now will discuss with team more options and dirs we can remove	14:34
clarkb	frickler: I was going to look into that. Will flock -n not create a file if it doesn't exist? Anyway thanks for checking I'll try to followup a bit later after breakfast	14:34
fungi	i'm trying to see how much of the logs from https://zuul.opendev.org/t/openstack/build/bae8ea4276ad495792e320631486cbf9 got uploaded. in the executor debug log i can see that zuul_log_path is bae/847362/2/gate/openstack-ansible-deploy-aio_lxc-centos-8-stream/bae8ea4 so i can probably try to fetch the console log from random providers and brute-force discovery of the actual url	14:36
clarkb	fungi: that won't work with rax beacuse the urls are not consistent	14:36
clarkb	fungi: but I thought it logged where it was uploading earlier in the logs? If not that may be somethign we can land to zuul-jobs quickly (explicitly log the full log path prefix at the beginning of uploading)	14:37
clarkb	anyway I need to boot my morning. Back in a bit	14:37
fungi	i think it may only log that if the task is successful? but anyway not having luck finding information on the actual log url in the executor debug log	14:38
fungi	yeah, "TASK [upload-logs-swift : Upload logs to swift]" includes a "url" parameter in the ansible output normally, but not if the task times out	14:40
fungi	so the executor doesn't actually even log the url in these cases	14:41
jrosser_	fungi: i never see that even if that task doesnt time out	14:43
jrosser_	isn't it chicken/egg with what is in the log file at the point it is uploaded?	14:44
jrosser_	oh - you mean the log on the executor i guess, not the uploaded one	14:46
fungi	yes	14:48
fungi	i'm digging in the executor's debug log	14:48
fungi	but have basically confirmed that even from the current debug logs we can't reliably work out where the build logs ended up	14:49
fungi	so yes, increasing the upload timeout temporarily and then looking for successful jobs which exceed the default post-logs playbook timeout seems like the next logical step	14:51
jrosser_	i've made a set of patches which should hopefully catch some of those	14:51
jrosser_	unfortunately this does seems to come and go somewhat	14:52
*** pojadhav is now known as pojadhav\|afk		15:04
clarkb	fungi: I think we can add a logging.info() for this value https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py#L144	15:18
clarkb	fungi: separately we do log the file list https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py#L283-L285 but only at a debug level so I think we need some amount of ansible -v's to emit that	15:19
clarkb	(also I'm not sure if ansible logs INFO level by default it may be that logging.info() will need more verbosity too)	15:19
fungi	clarkb: but also we no_log it in order to avoid leaking credentials: https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/post-logs.yaml#L12	15:24
clarkb	oh ha ok	15:25
clarkb	In that case it really does seem like we need to work from successful runs with longertimeouts and go backawrd from there to try and identify problems :/	15:25
*** dviroel is now known as dviroel\|lunch		15:27
fungi	https://status.python.org/ "14:20 UTC: A delay in our job queue is affecting CDN purges causing public pages to be stale and delays in email sending for notifications, email validation, password resets, etc. 15:05 UTC: We've identified specific jobs in our queue that are causing processing delays and are working to remove them."	15:36
fungi	looks like they finally noticed	15:36
*** marios is now known as marios\|out		15:39
clarkb	talking out loud again. Another option we might consider is temporarily increasing the base jobs post-timeout	15:50
clarkb	to cast a wider net and hopefully make it easier to debug. The problem with that is the revert back to half an hour is likely to be painful	15:51
fungi	i think we'd need a good means of identifying which builds exceeded the old timeout under the new temporary one	16:00
clarkb	frickler: ok the issue is there is no /var/run/ansible so the file can't be created in the non existant dir. It appears flock will create the file if the dir exists. I think I may just stick it under /var/run then	16:02
clarkb	chnage incoming	16:02
opendevreview	Clark Boylan proposed opendev/system-config master: Fix zuul reboot playbook flock location https://review.opendev.org/c/opendev/system-config/+/847808	16:05
*** ysandeep is now known as ysandeep\|PTO		16:09
clarkb	one crazy idea I had that won't work is to write the log url to a file then have a followup task read and echo it. That followup task wouldn't need to be no_log. But that won't work because we are timing out and exiting in the middle of the upload task	16:12
jrosser_	i was also thinking about that	16:13
jrosser_	looking at the uploader code it might be possible to call the role with an empty filelist	16:13
jrosser_	and the url should be returned - though i was pretty unclear if the url would be constant	16:14
clarkb	I think the url is going to be constant for a specific build uuid. The build uuid determines the sharded container location and the rax cdn different urls are consistent per container	16:14
clarkb	basically as long as that uuid is fixed we should end up with the same url for logs across multiple uploads	16:15
BlaisePabon[m]	Oh! I wonder if that is why I didn't get an email when I added another address to my Pypi profile....?	16:15
BlaisePabon[m]	(rhetorical, no reply required)	16:15
fungi	BlaisePabon[m]: it does say "pypi.org - Email: Degraded Performance	16:16
fungi	so maybe so	16:16
clarkb	but ya I'm still trying to brainstorm better debugging methods here. It is a "fun" challenge. Back when we had the jenkins scp log uplaoded we had similar chicken and egg problems. Eventually we forked a separate java thread to upload the console log so that it could run beyond the job thread lifetime	16:16
fungi	BlaisePabon[m]: and the incident name is "Queue delay affecting CDN purges and emails"	16:17
fungi	so i guess the two are related	16:17
clarkb	jrosser_: I'm shifting tactics to look at successful runs to see if I can infer anything from them. I still see ara reports on them for openstack-ansible-deploy-infra_lxc-ubuntu-focal on stable/wallaby fwiw	16:22
clarkb	(stable/wallaby of that job was the most recent POST_FAILURE)	16:22
jrosser_	yes, we have only managed to merge the change to disable ARA on master	16:23
clarkb	got it	16:23
jrosser_	ironically the cherry-pick for that to yoga failed..... due to POST_FAILURE	16:23
clarkb	There is also some larger binary journal content. You might consider exporting that like devsatck does. That allows you to export the range you want (and potentially other bits you care about) to a more text friendyl format that compresses with eg xz very well	16:26
clarkb	The apt trusted keys are also logged many times from many locations (host and all the containers)	16:27
jrosser_	yep - we place a lot of those ourselves	16:27
jrosser_	dropping the binary log journals is a good idea	16:28
jrosser_	i've never really had a lot of success grabbing them from one system and parsing them on another	16:28
opendevreview	Julia Kreger proposed openstack/diskimage-builder master: DNM: Network Manager logging to Trace for Debugging https://review.opendev.org/c/openstack/diskimage-builder/+/847600	16:29
clarkb	jrosser_: ya journalctl has a specific export command that serliazes the info and allows you to import and view it elsewhere. Copying the binary data directly isn't meant to work that way aiui	16:29
clarkb	let me see if I can find a link to how devstack does/did it	16:29
jrosser_	we have this to post-process them https://github.com/openstack/openstack-ansible/blob/master/scripts/journal_dump.py	16:30
clarkb	https://opendev.org/openstack/devstack/src/branch/master/roles/export-devstack-journal/tasks/main.yaml#L34-L48 Looks like they switched to using gz :/	16:31
clarkb	we did xz back when because it was like 10x better and 99% of the time people don't need the raw journals oh well	16:31
jrosser_	it's ok, we already have a list of "interesting services" and dump out what you would have had with journalctl	16:31
jrosser_	the binary ones are pretty useless now	16:31
clarkb	anyway other than the journals and ara content in the wallaby job I'm not see much that stands out to me. There is some nesting but it isn't too bad. And some duplication (of the apt trusted keys). But on the whole far less content than tripleo. Though the timeout is shorter	16:32
clarkb	The other thing that just occured to me is that the post-tiemout is for all the post-run playbooks. It could be that uploads are fine and are just fallout	16:33
clarkb	ara itself takes almost 3 minutes to generate the report in the successful job	16:34
clarkb	about 1/10th the timeout value	16:34
clarkb	Then journal collection from the container collection takes about 20 seconds and journal collection and eth interface info collection another 20 seconds. Nothing too crazy yet	16:35
clarkb	Copying logs to the executor another 15 seconds	16:36
opendevreview	Blaise R Pabon proposed opendev/infra-manual master: Update the sandbox instructions https://review.opendev.org/c/opendev/infra-manual/+/845593	16:36
clarkb	the upload of logs to swift took about 9 minutes	16:37
*** dviroel\|lunch is now known as dviroel		16:37
clarkb	In total about 13 minutes of post-run time. About 43% of the timeout value.	16:38
clarkb	Anyway that does point to the upload to swift being the likely issue just based on successful time speant	16:38
clarkb	And that upload occurs from the executor to swift. Which means the source side of that is pretty consistent. One of 12 instances in a single cloud	16:39
clarkb	Now the successful case is the one I would have expected to have the most issues as it is across the atlantic	16:39
clarkb	https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_554/periodic/opendev.org/openstack/openstack-ansible-repo_server/stable/wallaby/openstack-ansible-deploy-infra_lxc-ubuntu-focal/554c3a2/ is what I'm looking at fwiw	16:39
clarkb	That doesn't rule out some other internet/routing/ipv6/etc problem though	16:40
clarkb	I also see uploads to both ovh swifts from tripleo jobs that are successful as well as uploads that are hosted by the rax cdn. More difficult to determine where those rax cdn's targetted on the backend though. But that seems to rule out a consistently bad location	16:42
fungi	i suppose something like bandwidth contention or significant packet loss on an executor resulting in much longer upload times	16:47
fungi	would be another possibility	16:48
clarkb	yup it could be network contention between jobs too	16:48
clarkb	I got excited I saw that the system-config-run job that runs against latest ansible had a POST_FAILRUE and thought maybe it is related to latest ansible. but no this error is a regular error in post-run not a timeout	16:59
clarkb	it has logs and all that	16:59
*** Guest3279 is now known as rcastillo		17:20
jrosser_	clarkb: fungi i have a few jobs taking suspiciously long to upload logs right now, it may be nothing unusual but i can give you the details if it's helpful	17:23
clarkb	jrosser_: ya can you link to the live console?	17:23
clarkb	I'd be happy to take a look at that	17:23
jrosser_	https://zuul.openstack.org/stream/d29af85f2a144f68b2eeb849ccb8e8ad?logfile=console.log	17:24
clarkb	Looks like that one just succeeded	17:25
jrosser_	yeah i figured that would happen	17:25
clarkb	which is good bceause it means we can go look at the executor logs and gather more info about it	17:25
jrosser_	that was about ~20 mins	17:26
clarkb	First thing I notice is the pre upload tasks seem to take about 3 minutes there whcih is consistent with the other successful uploads. Either the size of the data doubled causing it to take twice as long as other uploads or there is some other issue	17:26
jrosser_	this one is similar https://zuul.openstack.org/stream/62be39711bb6402d90d27f99fd4a96c3?logfile=console.log	17:26
jrosser_	about 20 mins in as well	17:26
clarkb	https://e08e29492ebe0e6f5609-f5e06fd9232d0500a75b50f2bf1ad7e1.ssl.cf5.rackcdn.com/847797/1/check/openstack-ansible-deploy-aio_lxc-centos-8-stream/d29af85/ is the first one. It uploaded to one of the rax backeds (have to check the executor log to get more spceific than that I suspect)	17:28
clarkb	It ran on ze09.	17:29
jrosser_	i think that they might have both completed now	17:29
clarkb	https://a482a10c2a931478cad4-f031700b2c1819893a02a03206c89279.ssl.cf5.rackcdn.com/847668/1/check/openstack-ansible-deploy-aio_lxc-centos-8-stream/62be397/ is the second it ran on ze12	17:29
clarkb	yup they both ran successfully but took abit more than 20 minutes to upload	17:29
jrosser_	but that seems to be the pattern we get, that theres ~15 or so long running jobs on that change and just one or maybe two of those would POST_FAILURE	17:30
jrosser_	it certainly wasn't all of them at all	17:30
opendevreview	Merged opendev/system-config master: Fix zuul reboot playbook flock location https://review.opendev.org/c/opendev/system-config/+/847808	17:30
clarkb	Both jobs ran in iweb. I suppose it is theoretically possible that the test node cloud provider is impacting logging in some way that causes this (producing much larger system logs due to something in the provider?)	17:30
clarkb	just checking the journal logs they do seem larger in that one. But I would need to add them all up to be sure of that	17:31
clarkb	each of them have a 96MB journal file (in addition to all the other journal files)	17:32
fungi	yeah, doing a side-by-side comparison of a complete download of the saved logs from one that took longer and one that didn't for the same job would be helpful	17:32
clarkb	https://zuul.opendev.org/t/openstack/build/fdd24d45c34a43fc8171baf79e5e5323/logs was a short one	17:34
clarkb	https://zuul.opendev.org/t/openstack/build/ce8c876dca6b49e0adc05eb098fb0b61/log/job-output.txt was longer but still short (~4 minutes, first short example was about 2 minutes)	17:35
jrosser_	oh wow the log upload time was only 2 mins?	17:36
clarkb	one of the long ones has an ara report. The other long one doesn't have an ara report	17:36
clarkb	jrosser_: double check me on that but I'm looking at the timestamp for start of upload and comparing it against the job completion time in the zuul build dashboard	17:36
jrosser_	yeah, thats the only way i've found to get a handle on the upload time	17:36
*** pojadhav\|afk is now known as pojadhav		17:36
clarkb	at least one quick oen had a larger journal file too	17:37
jrosser_	this is slightly unscientific as the patches we're looking at are trying to reduce the log upload burden	17:38
jrosser_	so comparisons are tricky	17:38
clarkb	ah	17:38
jrosser_	it would be reasonable to look at different jobs in the same change though	17:38
clarkb	yup. And probably the post-timeout increase jobs are the best to look at. Recheck them a few times then compare runtimes	17:39
clarkb	since in theory we're least likely to actually hit a post_failure and not be able to compare there	17:40
clarkb	one other thing to note. A short run ran on ze09 as well	17:41
clarkb	but I agree hard to say how useful that info is until we're comparing across the same changes	17:41
fungi	yeah, things to possibly correlate: consistency for builds run from the same executor, for builds run on nodes from the same provider, for builds uploading logs to the same swift endpoint	17:43
*** rlandy is now known as rlandy\|afk		18:40
*** jpena is now known as jpena\|off		19:06
*** rlandy\|afk is now known as rlandy		19:23
TheJulia	anyone around to put a VM hold in real quick?	19:40
clarkb	TheJulia: I can do it, just need job name and change ideally	19:41
TheJulia	847601,3 in flight, ironic-tempest-wholedisk-bios-snmp-pxe	19:41
TheJulia	should fail shortly....	19:41
clarkb	the autohold command has been run	19:43
* TheJulia crosses fingers		19:43
clarkb	ianw: ^ in doing that I notice your commands in history there don't prefix with the timestamp. I wonder why	19:43
TheJulia	oh, it is soooo close	19:43
TheJulia	it is getting logs now	19:43
TheJulia	clarkb: looks like the job just exited	19:46
clarkb	but it was successful? the hold should be good for your recheck	19:47
TheJulia	failed	19:47
clarkb	oh /me looks again	19:47
TheJulia	which is what I'm hoping to capture the state of	19:47
clarkb	TheJulia: https://zuul.opendev.org/t/openstack/build/67ac58dd2a2443eaa76061b884e61faf it passed	19:48
TheJulia	gaaaah	19:48
TheJulia	:(	19:48
* TheJulia goes off in the corner and cries :)		19:48
clarkb	the hold is still good though and shoudl catch it if/when it does fail	19:49
TheJulia	clarkb: off to issue a recheck then, thanks	19:49
clarkb	I'm getting our meeting agenda together for tomorrow. Anything missing?	20:09
clarkb	ianw: I +2'd the grafana testing changes but didn't approve them as I wasn't sure if fungi wanted to look at your response prior to approval too	21:06
fungi	thanks, i did see the responses. i was +2 on it anyway	21:24
fungi	just wishing i had the bandwidth to move more of our opendev jobs out of ozj	21:25
*** dviroel is now known as dviroel\|out		21:40
ianw	yeah, it feels like maybe a sprint thing where we sit down and just bang it out, dedicate a day where things might get broken but just work iteratively	21:41
ianw	clarkb: umm, commands on zuul02?	21:42
ianw	openstack/project-config is probably the other thing that stands out a bit too in terms of naming	21:43
clarkb	ianw: re zuul02 if you check roots `history` commands are prefixed wtih a timestamp there. But the autohold commands you have run do not have that. I think you must be su'ing somehow without getting a new login shell? But I thought the history bit was system wide so unsure how it gets ommitted	21:45
ianw	for some reason i seem to have been doing "sudo bash ./zuul-client autohold ..." where "zuul-client" is a small script in my homedir	21:47
ianw	from jan 17. maybe we had quoting issues with that wrapper script or something at the time?	21:47
clarkb	oh ya there were quoting issues once upon a time. I believe they are resolved now	21:47
ianw	heh, that same day i proposed https://review.opendev.org/c/opendev/system-config/+/825019	21:47
clarkb	ok last call on meeting agenda content. I'll send that out at the top of the hour or so	21:48
ianw	so that might explain it.	21:48
ianw	oh i was going to add something about bridge. will do in a sec	21:48
clarkb	k	21:49
clarkb	ianw: let me know when it is posted and I'll get this sent out	22:00
ianw	clarkb: thanks, done	22:01
ianw	reloading the dib page shows the right release now, so hopefully pypi has fixed that	22:03
clarkb	oh I got an email about it too. alreight working on sending that out	22:03
fungi	ianw: i agree moving more out of openstack/project-config to opendev/project-config would be great as long as it's stuff that doesn't still need to be in the openstack zuul tenant in order for us to test it	22:04
clarkb	ianw: seemed like it was a queue problem. Not sure if it is just that they got to dib in the queue or if the issue itself is corrected	22:04
opendevreview	Amy Marrich proposed opendev/irc-meetings master: Updated meeting info for Diversity WG https://review.opendev.org/c/opendev/irc-meetings/+/847859	22:04
clarkb	and sent	22:05
clarkb	I'm starting to look at the work in progress == merge conflict issue and https://review.opendev.org/q/is:wip does seem to show this equivalence is universal	22:12
*** dasm is now known as dasm\|off		22:16
opendevreview	Merged opendev/irc-meetings master: Updated meeting info for Diversity WG https://review.opendev.org/c/opendev/irc-meetings/+/847859	22:17
clarkb	after fiddling with it via the rest api I strongly suspect that the issue is mergeable:false and work_in_progress:true are both set for these changes	22:23
clarkb	and I suspect that mergeable:false is picked up by the web ui as meaning merge conflict	22:23
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Fix BLS entries for /boot partitions https://review.opendev.org/c/openstack/diskimage-builder/+/846838	22:27
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Parse block device lvm lvs size attributes https://review.opendev.org/c/openstack/diskimage-builder/+/839829	22:28
opendevreview	Steve Baker proposed openstack/diskimage-builder master: WIP Support LVM thin provisioning https://review.opendev.org/c/openstack/diskimage-builder/+/840144	22:28
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Do dmsetup remove device in rollback https://review.opendev.org/c/openstack/diskimage-builder/+/847860	22:28
clarkb	yup I think that is exactly the issue	22:30
clarkb	https://gerrit.googlesource.com/gerrit/+/refs/tags/v3.5.2/polygerrit-ui/app/utils/change-util.ts#160	22:31
opendevreview	Merged openstack/project-config master: grafana: import graphs and take screenshots https://review.opendev.org/c/openstack/project-config/+/847129	22:32
clarkb	Looking at the history of that ts file I don't think that logic has changed recently though which makes me suspect that it is the mergeable state that has changed. I wonder if we need to send a null value in the case of being mergeable but wip to distinguish from unmergeable wip	22:33
ianw	clarkb: ahh, you were correct -> Unable to freeze job graph: Job project-config-grafana depends on opendev-buildset-registry which was not run. :/	22:37
TheJulia	clarkb: looks like the job failed so there should be a held node	22:37
ianw	the project-config gate might be borked now, as i don't think we can commit around that	22:37
clarkb	TheJulia: looking	22:38
clarkb	ianw: we should be able to force merge a revert at least?	22:38
clarkb	TheJulia: yup I'm on the host. have a pubkey you can share for me to add?	22:39
ianw	maybe it should be a soft dependency in the o-z-j side?	22:39
TheJulia	clarkb: one moment	22:40
clarkb	ianw: that would also fix it I think	22:40
ianw	clarkb: https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/847862	22:42
clarkb	TheJulia: root@158.69.68.154	22:42
clarkb	hrm the mergeop check for mergeability that checks work in progress seems to have not chagned since 2018 either. Do we know if this behavior is new?	22:46
corvus	clarkb: (just a reminder that google added the thing where they don't do merge checks by default any more; so it seems conceivable issues may go unnoticed for longer upstream)	22:52
corvus	(could maybe somehow be related)	22:52
clarkb	corvus: yup	22:55
clarkb	I'm finding evidence that they do attempt to set mergeable to null in the wip case	22:56
clarkb	so the mystery deepens	22:56
opendevreview	Julia Kreger proposed openstack/diskimage-builder master: DNM: Network Manager logging to Trace for Debugging https://review.opendev.org/c/openstack/diskimage-builder/+/847600	23:02
ianw	TheJulia: ^ is this suggesting a glean issue?	23:03
TheJulia	ianw: it is suggesting networkmanager may have a fairly nasty bug in it	23:03
TheJulia	which causes stream-9 to break network conenctivity shortly after booting	23:04
ianw	:/ ... i'd probably prefer to debug glean, and that's saying something :)	23:04
TheJulia	yeah :)	23:04
TheJulia	unfortunately this is in the IPA ramdisks on boot	23:04
opendevreview	Merged openstack/project-config master: Retire openstack-helm-deployments repo https://review.opendev.org/c/openstack/project-config/+/847414	23:05
TheJulia	so things get started... and a little while later all network connectivity seems to go poof	23:05
opendevreview	Merged openstack/project-config master: Add Cinder Dell EMC PowerStore charm https://review.opendev.org/c/openstack/project-config/+/846888	23:05
ianw	urgh. our ipv6 RA kernel v NetworkManager race was somewhat similar ... but only for ipv6	23:07
ianw	i'm struggling to think of what else might jump in there	23:08
clarkb	it looks like if you try to submit a change then gerrit will more explicitly check mergeability. Except you can't submit wip changes so I can't really test if that is a viable method to unsetting the stale data	23:08
ianw	ohhh, that job has use-buildset-registry role in it	23:18
ianw	which fails without a buildset registry.	23:19
ianw	should that role skip if no registry, or should we only run the role if we see the buildset registry var, or is this an incorrect usage of the whole thing and we should unconditionally have a buildset registry?	23:20
clarkb	ianw: I think it may be an unexpected use of the role. The idea is if you are using it then you want to do the depends on support that I described in my review. But I wasn't sure it would break so badly	23:28
clarkb	I've filed https://bugs.chromium.org/p/gerrit/issues/detail?id=16041 whcih outlines where I've debugged and a couple of ideas for a fix. I'll see if I can manage a change that we can test tomorrow using one of those appraoches.	23:28
ianw	yeah, i had in my mind the job would work either way, but i guess that's not really how to write it. otherwise we have to conditionally include the use-buildset-registry on the registry variables, which kind of breaks the abstraction	23:30
opendevreview	Ian Wienand proposed openstack/project-config master: project-config-grafana: add opendev-buildset-registry https://review.opendev.org/c/openstack/project-config/+/847866	23:32
ianw	clarkb: ^	23:32
clarkb	ianw: one minor clarification but ya I think that will solve it	23:34
opendevreview	Ian Wienand proposed openstack/project-config master: project-config-grafana: add opendev-buildset-registry https://review.opendev.org/c/openstack/project-config/+/847866	23:35
ianw	yep, check: :)	23:35
TheJulia	ianw: well, it looks like they removed a configuration option at some point, which is not helping... and behavior chagned	23:35
clarkb	I'm going to go help with dinner now.	23:35
TheJulia	and it's behavior is not ideal if you have physical switches which can make "bad things happen"	23:36
ianw	sigh. if i can help beyond "random probably unhelpful suggestions" lmn :)	23:39
TheJulia	thanks :)	23:42
TheJulia	I suspect I'm going to have to bang my head against it a little since there is in herently a race with the dhcp timeout occuring... and now is not a good time to try :\	23:43
TheJulia	ugh, looks like I'm going to need a noisy neighbor to fully simulate this :\	23:54

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!