Friday, 2020-06-26

*** jamesmcarthur has joined #openstack-infra		00:00
*** Lucas_Gray has quit IRC		00:09
*** Lucas_Gray has joined #openstack-infra		00:12
*** jamesmcarthur_ has joined #openstack-infra		00:18
*** hamalq_ has quit IRC		00:20
*** jamesmca_ has joined #openstack-infra		00:21
*** jamesmcarthur has quit IRC		00:22
*** jamesmcarthur_ has quit IRC		00:23
*** jamesmca_ has quit IRC		00:24
*** tetsuro has joined #openstack-infra		00:25
*** yamamoto has joined #openstack-infra		00:36
*** ricolin has joined #openstack-infra		00:37
*** jamesmcarthur has joined #openstack-infra		00:39
*** yamamoto has quit IRC		00:41
*** jamesmcarthur has quit IRC		00:42
*** jamesden_ has joined #openstack-infra		00:42
*** ricolin has quit IRC		00:42
*** jamesden_ has quit IRC		00:42
*** jamesden_ has joined #openstack-infra		00:43
*** jamesdenton has quit IRC		00:43
*** jamesmcarthur has joined #openstack-infra		00:44
*** armax has quit IRC		00:50
*** markvoelker has joined #openstack-infra		00:54
*** ociuhandu has joined #openstack-infra		00:56
*** armax has joined #openstack-infra		00:58
*** ociuhandu has quit IRC		00:59
*** markvoelker has quit IRC		00:59
*** ociuhandu has joined #openstack-infra		00:59
*** ociuhandu has quit IRC		01:03
*** markvoelker has joined #openstack-infra		01:09
*** yamamoto has joined #openstack-infra		01:12
*** markvoelker has quit IRC		01:14
*** yamamoto has quit IRC		01:38
*** ricolin has joined #openstack-infra		01:46
*** jamesmcarthur has quit IRC		01:58
*** jamesmcarthur has joined #openstack-infra		01:59
*** rlandy\|ruck\|bbl is now known as rlandy\|ruck		02:03
*** Lucas_Gray has quit IRC		02:04
*** jamesmcarthur has quit IRC		02:04
*** rfolco has quit IRC		02:09
*** rlandy\|ruck has quit IRC		02:21
*** vishalmanchanda has joined #openstack-infra		02:29
*** jamesmcarthur has joined #openstack-infra		02:32
*** yamamoto has joined #openstack-infra		02:42
*** jamesmcarthur has quit IRC		02:42
*** yamamoto has quit IRC		02:43
*** yamamoto has joined #openstack-infra		02:43
*** hongbin has joined #openstack-infra		02:59
*** ricolin has quit IRC		03:02
*** yolanda has quit IRC		03:02
*** yolanda has joined #openstack-infra		03:03
*** smarcet has quit IRC		03:07
*** jamesmcarthur has joined #openstack-infra		03:17
*** jamesmcarthur has quit IRC		03:22
*** psachin has joined #openstack-infra		03:28
*** hongbin has quit IRC		03:33
*** jamesmcarthur has joined #openstack-infra		03:50
*** armax has quit IRC		04:05
*** ykarel\|away is now known as ykarel		04:27
*** evrardjp has quit IRC		04:33
*** evrardjp has joined #openstack-infra		04:33
*** ysandeep\|away is now known as ysandeep		04:40
*** jamesmcarthur has quit IRC		04:52
*** jamesmcarthur has joined #openstack-infra		04:52
*** jtomasek has joined #openstack-infra		04:56
*** jtomasek has quit IRC		04:56
*** jtomasek has joined #openstack-infra		04:57
*** jamesmcarthur has quit IRC		04:58
*** matt_kosut has joined #openstack-infra		05:00
AJaeger	config-core, could you review https://review.opendev.org/737987 and https://review.opendev.org/737995 , please? - further retirement changes	05:05
*** jamesmcarthur has joined #openstack-infra		05:26
*** jamesmcarthur has quit IRC		05:33
*** udesale has joined #openstack-infra		05:40
*** lmiccini has joined #openstack-infra		05:45
openstackgerrit	OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/738150	06:10
*** danpawlik is now known as dpawlik\|EoD		06:15
*** ralonsoh has joined #openstack-infra		06:17
*** dklyle has quit IRC		06:20
*** rpittau\|afk is now known as rpittau		06:29
openstackgerrit	Merged openstack/project-config master: Retire networking-onos, openstack-ux, solum-infra-guest-agent: Step 1 https://review.opendev.org/737987	06:50
*** slaweq has joined #openstack-infra		06:57
*** ysandeep is now known as ysandeep\|afk		07:04
*** marcosilva has joined #openstack-infra		07:17
*** jcapitao has joined #openstack-infra		07:18
*** hashar has joined #openstack-infra		07:20
*** ysandeep\|afk is now known as ysandeep		07:23
*** bhagyashris\|afk is now known as bhagyashris		07:27
*** amoralej\|off is now known as amoralej		07:31
*** jpena\|off is now known as jpena		07:31
*** jtomasek has quit IRC		07:32
*** dtantsur\|afk is now known as dtantsur		07:33
*** jtomasek has joined #openstack-infra		07:35
*** tosky has joined #openstack-infra		07:35
*** ociuhandu has joined #openstack-infra		07:37
openstackgerrit	Merged openstack/openstack-zuul-jobs master: Remove legacy-tempest-dsvm-networking-onos https://review.opendev.org/737995	07:38
*** marcosilva has quit IRC		07:50
openstackgerrit	Andreas Jaeger proposed openstack/openstack-zuul-jobs master: Run openafs promote job only if gate job run https://review.opendev.org/738155	07:50
*** jtomasek has quit IRC		08:10
*** jtomasek has joined #openstack-infra		08:14
*** hashar_ has joined #openstack-infra		08:21
*** hashar has quit IRC		08:22
*** hashar_ is now known as hashar		08:29
*** pkopec has quit IRC		08:31
*** derekh has joined #openstack-infra		08:43
*** ykarel is now known as ykarel\|lunch		08:48
*** markvoelker has joined #openstack-infra		08:49
openstackgerrit	Carlos Goncalves proposed openstack/project-config master: Add nested-virt-centos-8 label https://review.opendev.org/738161	08:50
*** jistr has quit IRC		08:53
*** markvoelker has quit IRC		08:54
*** jistr has joined #openstack-infra		08:54
*** gfidente has joined #openstack-infra		09:02
*** kaiokmo has joined #openstack-infra		09:06
*** ysandeep is now known as ysandeep\|lunch		09:06
*** udesale has quit IRC		09:07
openstackgerrit	Shivanand Tendulker proposed openstack/project-config master: Removes py35 and py27 jobs for proliantutils https://review.opendev.org/738168	09:09
*** ramishra has quit IRC		09:09
*** xek has joined #openstack-infra		09:10
openstackgerrit	Carlos Goncalves proposed openstack/project-config master: Add nested-virt-centos-8 label https://review.opendev.org/738161	09:13
openstackgerrit	Shivanand Tendulker proposed openstack/project-config master: Removes py35 and py27 jobs for proliantutils https://review.opendev.org/738168	09:14
*** udesale has joined #openstack-infra		09:15
*** pkopec has joined #openstack-infra		09:17
*** ysandeep\|lunch is now known as ysandeep		09:22
*** eolivare has joined #openstack-infra		09:26
*** Lucas_Gray has joined #openstack-infra		09:41
*** ramishra has joined #openstack-infra		09:52
*** ykarel\|lunch is now known as ykarel		09:55
*** priteau has joined #openstack-infra		10:03
*** rpittau is now known as rpittau\|bbl		10:04
*** tetsuro has quit IRC		10:08
*** pkopec has quit IRC		10:09
*** jcapitao has quit IRC		10:21
*** jcapitao has joined #openstack-infra		10:23
*** jcapitao is now known as jcapitao_lunch		10:34
*** slaweq has quit IRC		10:40
*** ccamacho has quit IRC		10:42
*** slaweq has joined #openstack-infra		10:42
*** markvoelker has joined #openstack-infra		10:50
*** markvoelker has quit IRC		10:54
*** Lucas_Gray has quit IRC		11:14
openstackgerrit	Thierry Carrez proposed zuul/zuul-jobs master: upload-git-mirror: use retries to avoid races https://review.opendev.org/738187	11:21
*** jaicaa has quit IRC		11:23
zbr	what is happening with "Web Listing Disabled" on log servers?	11:24
*** Lucas_Gray has joined #openstack-infra		11:26
*** jpena is now known as jpena\|lunch		11:33
*** ryohayakawa has quit IRC		11:35
*** tinwood has quit IRC		11:37
*** kopecmartin is now known as kopecmartin\|pto		11:37
*** tinwood has joined #openstack-infra		11:38
*** jaicaa has joined #openstack-infra		11:49
frickler	dirk: cmurphy: would one of you be interested in fixing opensuse for stable/stein in devstack? see https://review.opendev.org/735640 , the other option would be to just drop that job until someone cares or has time	12:01
dirk	frickler: iirc AJaeger was looking at it	12:04
dirk	there has been a short conversation about it	12:04
dirk	frickler: I'll poke that you'll get a colleague looking at it	12:04
AJaeger	dirk: I was looking and failed ;(	12:07
*** jcapitao_lunch is now known as jcapitao		12:07
AJaeger	dirk: so, we were able to fix train but stein is a different beast	12:08
dirk	AJaeger: ok, I'll ask internally further, thanks	12:08
AJaeger	thanks, dirk!	12:10
*** rpittau\|bbl is now known as rpittau		12:12
*** rlandy has joined #openstack-infra		12:13
*** rlandy is now known as rlandy\|ruck		12:13
*** ociuhandu has quit IRC		12:14
*** rfolco has joined #openstack-infra		12:18
*** udesale has quit IRC		12:29
*** derekh has quit IRC		12:32
*** ociuhandu has joined #openstack-infra		12:34
*** rlandy\|ruck is now known as rlandy\|ruck\|mtg		12:35
*** smarcet has joined #openstack-infra		12:37
fungi	zbr: you'll have to be more specific, though that usually is an indication that no index was uploaded for the url you're visiting (possibly nothing at all). have an example build which links to a listing error?	12:43
fungi	zbr: also possible you're following an old link and the logs have already expired and been deleted?	12:44
*** jpena\|lunch is now known as jpena		12:44
zbr	fungi: i did a recheck, my guess is that current retention is too small. sometimes we need logs available for long time before we make a decision	12:45
zbr	also, i observed that updating the commit message on my controversial wrap/unwrap change reset votes, so we cannot rely on gerrit to track support for that change.	12:46
fungi	yep, however we generate something like 2-3tb of compressed logs every month	12:46
zbr	any idea what we can use to track it?	12:46
fungi	last i looked anyway	12:46
zbr	probably we need different rules based on project or based on size of logs per job	12:46
zbr	why to scrap jobs that produce low logs due to ones that are heavy on them?	12:47
fungi	i'm not sure i want to be in the position of deciding what project is more important than what other project and who deserves to be able to monopolize our ci log storage	12:47
zbr	probably a rule based on size would be unbiased	12:47
zbr	any log > X is scrapped at 30 days, any log > Y is scrapped at 3 months.	12:48
fungi	that might be doable, expiration is decided at upload time, though it's also much easier to communicate a single retention period	12:48
zbr	a rotating rule would make more sense to me, start to scrap old stuff, instead of guessing at upload time	12:50
zbr	i am not sure we can always make an informed decision about removal date when we upload, yep we should have a default.	12:51
fungi	well, the expiration is a swift feature. when we had a log server we used to have a running process go through all the old logs to decide what to get rid of, and turns out the amount of logs we're keeping is so large that if you run something like that continuously you still can't keep up with the upload rate	12:51
zbr	can we compute total log size before uploading them?	12:51
*** markvoelker has joined #openstack-infra		12:51
zbr	if so, we could make the expiration bit longer for small logs.	12:51
zbr	this could play well with less active projects too	12:52
fungi	that's why i was saying it might be possible since in theory we know the aggregate log quantity for any build	12:52
zbr	and avoid extra "rechecks"	12:52
zbr	lets see what others think about it	12:52
fungi	though based on previous numbers, tripleo would wind up with a one-week expiration for most of its job logs ;)	12:52
zbr	fungi: (me hiding)	12:53
fungi	we expire lots of smaller logs to make room for those	12:53
fungi	currently	12:53
fungi	but yeah, i don't know what the real upshot would be, or whether we could just increase our retention period, across the board depending on what our overall utilization and donated object storage quotas look like	12:54
*** markvoelker has quit IRC		12:56
*** jamesden_ is now known as jamesdenton		12:59
*** amoralej is now known as amoralej\|lunch		13:04
*** derekh has joined #openstack-infra		13:04
*** gfidente has quit IRC		13:06
*** ykarel is now known as ykarel\|afk		13:14
*** smarcet has quit IRC		13:16
*** gfidente has joined #openstack-infra		13:16
*** dtantsur is now known as dtantsur\|afk		13:21
mwhahaha	hey i'm trying to look into the RETRY_LIMIT crashing issue but I don't seem to be able to figure out when it might have started. Is there a way to get more history out of zuul or some other tool?	13:21
mwhahaha	I dont' seem to be able to find them in logstash (probably cause no logs)	13:22
mwhahaha	I have a feeling it's a bug in centos8 because it's occuring on different jobs/branches but I don't know when it started	13:24
EmilienM	infra-root: we're having gate issue and we have a mitigation to reduce the failures with a revert: https://review.opendev.org/#/c/738025/ - would it be possible to forsh push that patch please?	13:28
*** smarcet has joined #openstack-infra		13:29
rlandy\|ruck\|mtg	mwhahaha: we had a discussion about the RETRY_LIMIT on wednesday - if it's the same issue	13:30
mwhahaha	yea it's that one	13:30
mwhahaha	i think it's happening when we run container image prepare which relies on multiprocessing because it seems to be happening ~30-40 mins into the job consistently	13:30
AJaeger	EmilienM: you mean: promote to head of queue?	13:30
rlandy\|ruck\|mtg	we got as far as finding out that the failure hits in tripleo-ci/toci_gate_test.sh	13:31
rlandy\|ruck\|mtg	and mostly leaves no logs	13:31
rlandy\|ruck\|mtg	any test running toci can basically hit it	13:31
mwhahaha	rlandy\|ruck\|mtg: I don't think it's the shell script tho, it's more likely what we're running inside it	13:31
mwhahaha	that's just our entry point into quickstart	13:31
rlandy\|ruck\|mtg	we didn't trace it back to any particular provider	13:31
mwhahaha	yea i think it's a centos8 bug	13:32
mwhahaha	because i think it started about the time we got 8.2	13:32
mwhahaha	https://zuul.opendev.org/t/openstack/builds?result=RETRY_LIMIT&project=openstack%2Ftripleo-heat-templates	13:32
mwhahaha	points to something durring the job	13:32
EmilienM	AJaeger: no, force merge	13:32
mwhahaha	rlandy\|ruck\|mtg: the timing indicates it's during the standalone deploy itself because we start it ~30 mins into a job	13:33
rlandy\|ruck\|mtg	AJaeger: here is the related bug ... https://bugs.launchpad.net/tripleo/+bug/1885279	13:33
openstack	Launchpad bug 1885279 in tripleo "TestVolumeBootPattern.test_volume_boot_pattern tests on master are failing on updating to cirros-0.5.1 image" [Critical,In progress] - Assigned to Ronelle Landy (rlandy)	13:33
frickler	rlandy\|ruck\|mtg: you need > 64MB for cirros 0.5.1. devstack uses 128MB	13:34
*** mordred has quit IRC		13:34
rlandy\|ruck\|mtg	chandankumar: ^^	13:35
rlandy\|ruck\|mtg	frickler: that may be - but we need more qualification here and we'd like to revert to do that	13:35
chandankumar	rlandy\|ruck\|mtg, let me check the default size	13:36
AJaeger	EmilienM: I don't have those permissions, just asking. Why do you think a force-merge is needed?	13:36
rlandy\|ruck\|mtg	AJaeger: it's taking time to get the patch through the gate	13:36
rlandy\|ruck\|mtg	in the mean time, other jobs are failing	13:36
EmilienM	AJaeger: our CI stats isn't good. We're dealing with multiple issues at this time and we think this one is one of them	13:37
AJaeger	Normally, we just promote them to head of gate to speed up...	13:37
*** dave-mccowan has quit IRC		13:37
EmilienM	yeah things haven't been normal for us this week :-/	13:37
AJaeger	bbl	13:37
rlandy\|ruck\|mtg	I guess we'll take what we can get - top of the queue then - pls	13:38
*** dave-mccowan has joined #openstack-infra		13:38
chandankumar	rlandy\|ruck\|mtg, https://opendev.org/osf/python-tempestconf/src/branch/master/config_tempest/constants.py#L30	13:38
chandankumar	it is 64	13:38
chandankumar	we need to increase that	13:38
rlandy\|ruck\|mtg	chandankumar: let's discuss back on our channels	13:39
chandankumar	yes	13:39
corvus	EmilienM, AJaeger, infra-root: hi, i can promote 738025	13:40
EmilienM	thanks corvus	13:41
corvus	EmilienM: if it does improve things, then of coures changes behind it in the gate will automatically receive the benefit of that improvement; you can make other changes in check benefit from it before it lands by adding a Depends-On	13:43
EmilienM	right	13:44
corvus	it's at the top now	13:44
EmilienM	I saw, thanks a lot	13:44
rlandy\|ruck\|mtg	corvus: thank you	13:45
corvus	no problem :) hth	13:45
*** udesale has joined #openstack-infra		13:47
*** amoralej\|lunch is now known as amoralej		13:47
*** jamesmcarthur has joined #openstack-infra		13:49
rlandy\|ruck\|mtg	mwhahaha: https://bugs.launchpad.net/tripleo/+bug/1885286 - so we have a place to track the investigation of RETRY_LIMIT errors	13:53
openstack	Launchpad bug 1885286 in tripleo "Increase in RETRY_LIMIT errors in zuul.openstack.org is preventing jobs from passing check/gate" [Critical,Triaged] - Assigned to Ronelle Landy (rlandy)	13:53
*** rlandy\|ruck\|mtg is now known as rlandy\|ruck		13:54
mwhahaha	rlandy\|ruck: yea i think it's happening during container-image-prepare based on the timings ~30-40 mins	13:54
*** yamamoto has quit IRC		13:54
rlandy\|ruck	mwhahaha: so then the failure is later then ...	13:55
mwhahaha	yea	13:55
rlandy\|ruck	on Wed we were looking at failure around 1-1 mins	13:55
mwhahaha	https://zuul.opendev.org/t/openstack/builds?result=RETRY_LIMIT&project=openstack%2Ftripleo-heat-templates	13:55
rlandy\|ruck	10-15	13:55
rlandy\|ruck	and no logs	13:55
mwhahaha	check the failures for tripleo jobs	13:55
mwhahaha	i think it's a multiprocessing bug in python in centos8	13:55
mwhahaha	we've seen stack traces in ansible with it too previously	13:55
mwhahaha	i'll try and dig up my logs later, i asked about it in #ansible-devel liek 2 weeks ago	13:56
*** yamamoto has joined #openstack-infra		13:56
rlandy\|ruck	mwhahaha: k, thanks	13:56
mwhahaha	rlandy\|ruck: http://paste.openstack.org/show/794407/ was the ansible crash i saw	14:03
rlandy\|ruck	at least we have some trace to go on now	14:04
mwhahaha	may not be related but i saw it shortly after we got 2.9.9	14:04
mwhahaha	but since the failure is in python itself, i'm wondering if there's another issue	14:05
*** dklyle has joined #openstack-infra		14:08
fungi	something seems to be breaking in such a way that at least ssh from the executor ceases working... whether that's a kernel panic, network interface getting unconfigured, sshd hanging... can't really tell	14:09
fungi	you could open a bunch of log streams for jobs you think are more likely to hit that condition, and see where they stop	14:10
fungi	(if you do it with finger protocol you could probably record them to separate local files pretty easily, and not have to depend on browser websockets)	14:10
mwhahaha	it doesn't happen enough :/	14:11
rlandy\|ruck	mwhahaha: on that paste the date logged is Jun 05 09:44:15	14:11
mwhahaha	yea that's not from one of these	14:11
rlandy\|ruck	twenty days ago?	14:11
mwhahaha	that's just something i noticed that was happening	14:11
fungi	well, the retry_limit doesn't happen that often, because the job has to hit a similar condition three builds in a row... i imagine isolated instances of this which aren't hit on a second or third rebuild may be much more common	14:12
mwhahaha	where python was segfaulting in the multiprocessing bits in ansible. since container image prepare uses multiprocessing it might be a similar root cause	14:12
rlandy\|ruck	it's happening often enough now to impact the rate jobs get through gates	14:12
openstackgerrit	Shivanand Tendulker proposed openstack/project-config master: Removes py35, tox and cover jobs for proliantutils https://review.opendev.org/738168	14:22
*** armax has joined #openstack-infra		14:24
*** ykarel\|afk is now known as ykarel		14:37
*** priteau has quit IRC		14:44
*** priteau has joined #openstack-infra		14:47
*** markvoelker has joined #openstack-infra		14:52
*** priteau has quit IRC		14:52
*** markvoelker has quit IRC		14:57
clarkb	mwhahaha: rlandy\|ruck: may also want to add logging of the individual steps as they happen in that script	14:57
mwhahaha	it's nto that script and we do	14:57
*** lmiccini has quit IRC		14:57
mwhahaha	but since we don't get any logs we have no idea what's happening	14:57
mwhahaha	that script just invokes other things that do log, but no logs are captured	14:57
clarkb	mwhahaha: I know its not the script but its something the script runs isn't it? and getting that emited to the console log would be useful rather than trying to infer based on time to failure	14:57
clarkb	right I'm saying write the logs to the console and then you'll get them	14:58
mwhahaha	we don't seem to be recording the console anywhere	14:58
clarkb	its available while the job runs	14:58
mwhahaha	which is not helpful	14:58
clarkb	permanent storage requires that the host be available at the end of the job for archival	14:58
clarkb	why is't that helpful? you can start a number of them, open the logs (via browser or finger) wait for one to fail, save logs, debug from there	14:59
mwhahaha	zuul doesn't have a call back to write out the console and always ship that off the executor?	14:59
clarkb	mwhahaha: the console log is from the test node not the executor	14:59
clarkb	if the test node is gone there is no more console log to copy	14:59
openstackgerrit	Merged openstack/project-config master: Removes py35, tox and cover jobs for proliantutils https://review.opendev.org/738168	14:59
mwhahaha	maybe i'm missing how zuul is invoking ansible on that, but shouldn't there be a way to ship the output off the node w/o needing the node such that it can be captured even if the node dies	15:00
clarkb	no because those logs are on the disk of the node	15:00
* mwhahaha shrugs		15:00
clarkb	we could potentially set up a hold and keep nodepool from deleting the instance (though I'm not sure that would trigger on a retry failure? that may be a hold bug), then reboot --hard the instance via the nova api and hope it comes back	15:01
mwhahaha	it's not a single job that hits this, it's like any of them. so trying to open up something that can capture all the console output all the time and then figure out which one RETRY_LIMITs isn't as simple as you make it seem	15:01
weshay_pto	there has to be some amount of tracking jobs that hit RETRY_LIMIT right?	15:02
*** jcapitao has quit IRC		15:02
clarkb	we track it	15:02
clarkb	the problem is in accessing the logs after the fact	15:02
clarkb	mwhahaha: you could run a bunch of fingers with their output tee'd to files	15:02
clarkb	I'm not saying its ideal, but this particular class of failure is difficult to deal with	15:04
fungi	mwhahaha: also you don't need to wait for a retry_limit result, as i keep saying, the retry_limit happens when a particular job fails in a similar way three builds in a row, so the odds there are failures of these builds happening only once or twice in a row is likely much higher, statistically	15:04
mwhahaha	i think the issue is identifying that	15:04
mwhahaha	while it's running	15:04
mwhahaha	anyway i'll look at it later	15:04
clarkb	re holding a node I'm 99% certain we won't hold if the job is retried. Whether or not the 3rd pass failing would trigger the hold I'm not sure	15:06
fungi	and yeah, i also suggested using finger protocol and redirecting it to a file... you could grab a snapshot of the zuul status.json, parse out the list of any running builds which are likely to hit that issue, spawn individual netcats in the background to each of the finger addresses for them and redirect those to local files... later grep those dump files for an indication the job did not succeed	15:06
fungi	(perhaps lacks the success message) and that narrows the pool significantly	15:06
clarkb	hold_list = ["FAILURE", "RETRY_LIMIT", "POST_FAILURE", "TIMED_OUT"]	15:07
clarkb	we would hold the third failure	15:07
clarkb	so that is another option, though relies on a reboot producing a working instance after the fact	15:08
clarkb	I'	15:08
clarkb	er	15:08
clarkb	I'm happy to add a hold if we can give a rough set of criteria for it	15:08
fungi	i doubt the hold would help, because the set of jobs it's impacting is fairly large, and the frequency with which one of those builds hits retry_failure is likely statistical noise compared to other failure modes	15:09
clarkb	I guess even if reboot fails we can ask the cloud for the instance console and that may give clues	15:10
clarkb	fungi: ya, it may require several attempts to get one. I wonde can we tell hold we only want RETRY_LIMIT jobs?	15:10
clarkb	looks like no	15:11
fungi	we do have code we can switch on to grab nova console from build failures right? or was that nodepool launch failures?	15:11
clarkb	fungi: that is nodepool launch failures	15:11
clarkb	but if we get a hold we can manually run it	15:11
fungi	ahh, right, executors lack the credentials for that anyway	15:11
mwhahaha	so i just got a crashed one	15:15
mwhahaha	http://paste.openstack.org/show/795271/	15:16
mwhahaha	it just disappears	15:16
mwhahaha	no console output	15:16
fungi	that was fast	15:16
mwhahaha	so it's like it crashed	15:16
fungi	yep, that's all i was findnig in the executor debug logs too	15:16
clarkb	mwhahaha: is standalone deploy a task that runs ansible?	15:17
mwhahaha	it's an ansibel task that invokes shell	15:17
mwhahaha	to run python/ansible stuff	15:17
clarkb	is that nested ansible or zuul's top level ansible?	15:17
fungi	and that crash wasn't in the same script i was seeing before	15:17
mwhahaha	doesn't really matter because the node went poof	15:17
mwhahaha	nested	15:18
clarkb	mwhahaha: well what would be potentially useful is figuring out where in that 11 minutes the script breaks	15:18
mwhahaha	zuul -> toci sh -> quickstart -> ansible -> shell -> python -> ansible	15:18
clarkb	hrm nested means we aren't able to get streaming shell out of ansible (at least not easily)	15:18
mwhahaha	i know where it likely breaks based on timeing (as previously mentioned)	15:18
mwhahaha	which is a python process that uses multiprocess to do container fetching/processing	15:18
mwhahaha	hence i think there's a bug in either python or the kernel	15:18
mwhahaha	but w/o any other info on the node it's going to be impossible to track down at the moment	15:19
mwhahaha	same deal with this one http://paste.openstack.org/show/795272/	15:19
fungi	previously i was seeing it happen while toci_gate_test.sh was running, so it could be anything common between what that does and what the tripleo.operator.tripleo_deploy : Standalone deploy task does	15:19
* mwhahaha knows what it does		15:20
mwhahaha	what i need is the vm console output	15:20
mwhahaha	to see if it's kernel panicing	15:20
fungi	oh, and that one happened during tripleo.operator.tripleo_undercloud_install : undercloud install	15:20
clarkb	mwhahaha: yes and as mentioend above I've suggested how we might get that	15:20
mwhahaha	https://zuul.opendev.org/t/openstack/stream/bf54d4fdf5c040d590372a7cbfbd3c53?logfile=console.log will likely crash	15:21
mwhahaha	it's on 2. attempt at the moment	15:21
mwhahaha	737774,2 tripleo-ci-centos-8-standalone (2. attempt)	15:21
clarkb	mwhahaha: k, autohold is in place for that one, if the 3rd attempt fails we'll get it. Separately we can try grabbing the console log for it ahead of time while it is running the job	15:23
clarkb	mwhahaha: any idea how far away from failing it would be now?	15:23
mwhahaha	it fails ~30 mins in	15:23
mwhahaha	i don't know how long it's running let me look at the console	15:23
mwhahaha	maybe 10 mins	15:23
mwhahaha	oh no probably like 5-10 from now	15:24
mwhahaha	it just started teh deploy	15:24
clarkb	a0c808c9-481e-44bc-8e64-9cfe8b90e1f2 is the instance uid in inap	15:24
fungi	centos-8-inap-mtl01-0017417464	15:24
*** priteau has joined #openstack-infra		15:26
clarkb	I've managed to console log show it, nothing exciting yet	15:26
fungi	i can also paste the web console url, i know that would normally be sensitive but this is a throwaway vm	15:27
fungi	the vnc token should be unique, right?	15:27
clarkb	fungi: I have no idea (and assuming that about openstack seems potentially dangerous)	15:28
clarkb	fungi: but I guess if you open that locally you'll get the running console log and won't have to time it like my console log show	15:28
clarkb	so maybe just open it locally and see if we catch anything?	15:28
fungi	i do have it open locally, but am also polling console log show out to a local file just in case	15:28
fungi	so far it's just iptables drop logs though	15:29
clarkb	and selinx bookkeeping	15:29
fungi	yep	15:29
fungi	device br-ctlplane entered promiscuous mode	15:30
*** priteau has quit IRC		15:30
fungi	in case anyone wondered	15:30
*** priteau has joined #openstack-infra		15:30
fungi	ooh, "loop: module loaded"	15:31
fungi	yeah, very much not exciting so far	15:31
mwhahaha	like watching paint dry	15:31
fungi	i really wish openstack console log show had something like --follow but that's probably tough to implement	15:32
mwhahaha	it might succeed at this point, let me see if there's another one	15:35
*** mordred has joined #openstack-infra		15:35
mwhahaha	crashed	15:35
clarkb	the console log doesn't show any panic	15:36
mwhahaha	hrm	15:36
fungi	yeah, it's still logging iptables drops to	15:36
fungi	too	15:36
mwhahaha	i wonder if it's an ovs bug	15:36
mwhahaha	so if it's still up but isn't reachable via zuul that's weird?	15:37
*** jtomasek has quit IRC		15:37
clarkb	I can confirm it doesn't seem tp ping or be sshable from there	15:38
fungi	yeah, network deems to be dead, dead, deadski	15:38
clarkb	we've already determined this isn't cloud specific so unlikely that we're colliding a specific network range	15:38
fungi	the node's ipv4 addy is/was 104.130.253.140	15:39
AJaeger	clarkb,fungi, can either of you me to ACLs of openstack-ux and solum-infra-guestagent so that I can retire these two repos,or do you want to abandon changes and approve the retirement change, please?	15:39
fungi	and the instance just got deleted	15:39
*** ricolin has joined #openstack-infra		15:39
fungi	want me to stick the recorded console log somewhere?	15:39
clarkb	fungi: 198.72.124.67 is what I had according to the job log	15:40
fungi	oh, yep nevermind, i grabbed that address out of the wrong console window	15:40
clarkb	104.130.253.140 looks like a rax IP but this was an inap node	15:40
fungi	where i was troubleshooting something unrelated	15:40
mwhahaha	2020-06-26 15:37:59.536990 \| primary \| "msg": "Failed to connect to the host via ssh: ssh: connect to host 198.72.124.67 port 22: Connection timed out",	15:41
fungi	anyway, i have the console log from the correct instance	15:41
clarkb	fungi: probably doesn't hurt to share just in case there is some clue possibly in the iptables logging	15:41
clarkb	if ping was working I'd suspect something crashed sshd	15:41
clarkb	but pign doesn't seem to work either so more likely the network stack under sshd is having trouble	15:41
clarkb	lack of kernel panic in the log implies it isn't a catastrophic failure	15:42
fungi	grumble, it's slightly too long for paste.o.o	15:42
fungi	i'll trim the first few lines from boot	15:42
clarkb	also if the third pass of that job fails we should get a node hold and we can try a reboot and see if any of the logs on the host give us clues	15:42
*** yamamoto has quit IRC		15:44
fungi	okay, i split the log in half between http://paste.openstack.org/show/795276 and http://paste.openstack.org/show/795277	15:44
fungi	mwhahaha: ^	15:45
mwhahaha	yea nothing out of the ordinary	15:45
fungi	i wonder if we should stick something in systemd to klog the system time so we have something to generate log line offsets against	15:47
openstackgerrit	Thierry Carrez proposed openstack/project-config master: [DNM] Define maintain-github-mirror job https://review.opendev.org/738228	15:47
clarkb	fungi: you might be able to find some other reference point like ssh user log in compared to zuul logs?	15:47
fungi	oh, i know, but thinking for the future it might be nice not to have to	15:48
fungi	in this case there wasn't anything worth calibrating anyway, but if we'd snagged a kernel panic we'd be able to tell how long after a particular job log line that happened	15:49
fungi	which in some cases could help narrow things down to a narrower set of operations	15:49
fungi	AJaeger: openstack-ux-core and i guess... solum-core?	15:51
clarkb	mwhahaha: (this is me just thinking crazy ideas) Do you know if you have these failures on rackspace? Rackspace gives us two interfaces a public and a private interface. Most other clouds give us a single interface where we have public only, or private ipv4 that is NAT'd with a fip. Now for the crazy idea. A lot of jobs use the "private_ip" which is actually the public_ip in clouds without a private	15:51
clarkb	ip to do things like set up multinode networking. On rax that would be on a completely separate interface so anything that may break that would be isolated from breaking zuul's connectivity via the public interface. However on basically all other clouds breaking that interface would also berak Zuul's connectivity	15:51
mwhahaha	no idea	15:52
mwhahaha	can you query zuul for the RETRY_LIMIT stuff?	15:52
mwhahaha	it's not in logstash	15:52
fungi	AJaeger: i've added you to those, let me know if that wasn't what you needed	15:52
mwhahaha	w/o the logs i don't know where these are running	15:53
clarkb	mwhahaha: ya I think we can ask zuul for that. Its not in logstash because there were no log files to index :/	15:53
AJaeger	fungi: thanks,letme check	15:54
clarkb	hrm zuul build records don't have nodeset info	15:54
clarkb	I guess weh ave to look at zuul logs	15:55
fungi	clarkb: yeah, i'm seeing what i can parse out of the scheduler debug log first	15:55
clarkb	fungi: thanks	15:55
fungi	yeah, we'll have to glue scheduler and executor logs together	15:59
fungi	the scheduler doesn't log node info, and the executor doesn't know when the result is retry_limit	15:59
clarkb	fungi: we could just ask the executor for ssh connection problems	15:59
*** xek has quit IRC		15:59
*** rpittau is now known as rpittau\|afk		15:59
clarkb	and assume that is close enough	15:59
fungi	yep, that's what i'm doing now	16:00
fungi	RESULT_UNREACHABLE is what the executor has, which will actually be a lot more hits anyway	16:00
AJaeger	fungi: I removed myself again from solum-core. Now waiting for slaweq to +A the networking-onos change and then those three repos can finish retiring	16:02
*** ykarel is now known as ykarel\|away		16:04
fungi	i've worked out a shell one-liner to pull the node names for each RESULT_UNREACHABLE failure, running this against all our executors now	16:07
fungi	oh, right, this won't work on ze01 because containery, but i'll just snag the other 11	16:12
fungi	1514 result_unreachable builds across ze02-12 in today's debug log	16:12
*** ricolin has quit IRC		16:13
*** vishalmanchanda has quit IRC		16:15
*** psachin has quit IRC		16:18
*** yamamoto has joined #openstack-infra		16:20
fungi	the distribution looks like it may favor inap a lot more than the proportional quotas would account for: http://paste.openstack.org/show/795279	16:20
fungi	a little under a third of the unreachable failures occurred there, when they account for a lot less than a third of our aggregate quota	16:21
fungi	the node label distribution indicates we see more on ubuntu then centos too: http://paste.openstack.org/show/795280	16:22
fungi	er, than	16:22
*** yamamoto has quit IRC		16:27
*** Lucas_Gray has quit IRC		16:28
clarkb	fungi: we probably want to filter for centos to isolate the tripleo case sinse it seems consistent	16:31
fungi	yeah, i can also try to filter for tripleo jobs, i suppose	16:32
*** mordred has quit IRC		16:33
*** gyee has joined #openstack-infra		16:35
*** ociuhandu_ has joined #openstack-infra		16:36
*** mordred has joined #openstack-infra		16:38
*** hamalq has joined #openstack-infra		16:38
*** ociuhandu has quit IRC		16:38
*** ociuhandu_ has quit IRC		16:40
*** hamalq_ has joined #openstack-infra		16:40
*** amoralej is now known as amoralej\|off		16:40
fungi	grep $(grep $(grep '\[e: .* result RESULT_UNREACHABLE ' /var/log/zuul/executor-debug.log \| sed 's/.\[e: $[0-9a-f]\+$\]./-e \1.Beginning.job.tripleo/') /var/log/zuul/executor-debug.log \| sed 's/.\[e: $[0-9a-f]\+$\]./-e \1.Provider:/') /var/log/zuul/executor-debug.log \| sed 's/.\\\\nProvider: $.$\\\\nLabel: $.$\\\\nInterface .*/\1 \2/' > nodes	16:40
fungi	in case you wondered	16:40
*** smarcet has quit IRC		16:40
*** jpena is now known as jpena\|off		16:42
fungi	861 tripleo jobs with unreachable results in the logs today so far	16:43
fungi	breakdown by provider-region: http://paste.openstack.org/show/795283	16:44
*** hamalq has quit IRC		16:44
fungi	and by node label: http://paste.openstack.org/show/795284	16:44
fungi	this was a crude match for any result_unreachable builds with "tripleo" in the job name	16:45
weshay_pto	fungi, afaict.. there was a large event on 6/14 where this peaked and has been an issue since.. not as many hits as 6/14 though..	16:45
weshay_pto	you seeing anything similar?	16:46
weshay_pto	w/ when this started	16:46
fungi	i only analyzed today's debug log	16:46
AJaeger	regarding these numbers: How does that compare to all runs? I mean: Do we run 3 times as much CentOS8 jobs than bionic for tripleo - and therefore the 3 timesfailure is not significant?	16:46
fungi	AJaeger: yeah, that's likely the case. i don't think these ratios are telling us much on the node label side. on the provider-region side it suggests that inap is getting a disproportionately larger number of these, i think	16:47
*** priteau has quit IRC		16:48
fungi	interesting though that i'm getting some airship-kna1 node hits in here for tripleo jobs. that may mean i'm not filtering the way i thought. investigating	16:49
mwhahaha	weshay_pto: we updated openvswitch on 6/16, perhaps that's the issue?	16:50
mwhahaha	openvswitch-2.12.0-1.1.el8.x86_64.rpm2020-06-16 07:572.0M	16:50
fungi	ahh, nevermind, airship-kna1 also hosts a small percentage of normal node labels	16:50
fungi	so that's expected	16:50
weshay_pto	mwhahaha, could be part of the issue for sure.. given it's openvswitch.. but it would not explain the spike on 6/14	16:51
mwhahaha	i don't know if that's related	16:51
rlandy\|ruck	could we revert that upgrade?	16:52
mwhahaha	given that the networking goes poof, how we configure the interfaces on the nodes, and the lack of like a kernel panic, it seems to be openvswitch	16:52
mwhahaha	we dont' see retry_failure on centos7 jobs right?	16:53
mwhahaha	that didn't get updated	16:53
weshay_pto	6/14 spike looks like the mirror outtage 2020-06-14 06:42:29.732260 \| primary \| Cannot download 'https://mirror.kna1.airship-citycloud.opendev.org/centos/8/AppStream/x86_64/os/': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried.	16:53
*** markvoelker has joined #openstack-infra		16:53
weshay_pto	so we can ignore that spike	16:53
mwhahaha	yea that was the day of the mirror outages i think	16:53
mwhahaha	i really think it's openvswitch	16:53
fungi	i'm putting together some trends for RESULT_UNREACHABLE on "tripleo" named jobs per day over the past month	16:54
mwhahaha	centos 8.2 came out on 6-11	16:55
weshay_pto	ya.. mwhahaha looking at each day.. we start to see network go down after openvswitch update	16:56
mwhahaha	do we know when the first infra image switched to it?	16:56
weshay_pto	6/18 is the first day.. I see network going down	16:56
weshay_pto	1 hit on 6/17.. so not sure how long mirrors take to update..	16:57
fungi	normally they update every 2 hours	16:57
*** markvoelker has quit IRC		16:58
fungi	but we pull from a mirror to make our mirror, so can be delayed by however long the mirror we're pulling from takes to reflect updates too	16:58
*** derekh has quit IRC		17:00
mwhahaha	hrm the openvswitch release was a build w/o dpdk	17:00
mwhahaha	so maybe not	17:00
fungi	parsing 30 days of compressed executor logs is taking a bit of time, but i should have something soonish	17:06
mwhahaha	my only other thought is that there is a known issue with iptables on the 8.2 kernel and we end up configuring the network/iptables about the time it fails	17:09
*** smarcet has joined #openstack-infra		17:09
mwhahaha	though i would have thought there to be a stack trace on the console if that was the case	17:10
*** ociuhandu has joined #openstack-infra		17:11
*** smarcet has quit IRC		17:17
*** kaiokmo has quit IRC		17:17
*** ociuhandu has quit IRC		17:18
openstackgerrit	Merged openstack/project-config master: Normalize projects.yaml https://review.opendev.org/738150	17:23
*** jamesmcarthur has quit IRC		17:23
*** jamesmcarthur has joined #openstack-infra		17:23
*** jamesmcarthur has quit IRC		17:27
fungi	~14k unreachable results in the past month across 11 of our 12 executors	17:38
*** udesale has quit IRC		17:39
fungi	mwhahaha: weshay_pto: here's what the hourly breakdown looks like for the past month: http://paste.openstack.org/show/795290	17:42
fungi	and here's the daily breakdown: http://paste.openstack.org/show/795291	17:42
fungi	note these are not scaled by the number of jobs run, these are simply counts of result_unreachable builds for jobs with "tripleo" in their names	17:43
*** gfidente has quit IRC		17:44
fungi	er, fixed daily aggregates paste, the previous one had a bit of cruft at the beginning: http://paste.openstack.org/show/795292	17:45
*** jamesmcarthur has joined #openstack-infra		17:50
*** mtreinish has quit IRC		18:04
*** mtreinish has joined #openstack-infra		18:05
*** jamesmcarthur has quit IRC		18:08
clarkb	fungi: re airship kna we did that to help ensure things were running normally there	18:13
clarkb	fungi: something we learned doing the tripleo clouds having the resources dedicated to a specfiic purpose makes it harder to understand what is going on there when things break	18:14
rlandy\|ruck	from the pastes above, it looks like we have better days and worse days	18:15
mwhahaha	probably related to the number of patches, though hose numbers seem really high	18:16
clarkb	fungi: also re ze01 the container should log to the same location as the non container runs	18:16
rlandy\|ruck	819 2020-06-23	18:16
rlandy\|ruck	800 2020-06-24	18:16
rlandy\|ruck	846 2020-06-25	18:16
fungi	i concur, those could also indicate days where you simply had higher change activity	18:16
rlandy\|ruck	^^ consistent bad though	18:16
fungi	as i said, that's not scaled by the overall build count for those jobs	18:16
rlandy\|ruck	mwhahaha: do we have a next step here? something we can try on our end?	18:20
mwhahaha	it's kinda hard because the layers of logging here	18:21
mwhahaha	we really need to either reproduce it or get a node that it failed on	18:21
* fungi checks to see if clarkb's hold caght anything		18:22
clarkb	fungi: I was just checkingand I don't think it did but double check as I'm trying to eat a sandwich too :)	18:22
fungi	nope, it's still set	18:23
mwhahaha	sudo make me a sandwich	18:23
*** ralonsoh has quit IRC		18:25
clarkb	I think we can set up some holds on jobs likely to hit the issue then see if we catch any. Another approach could be to try and reproduce it outside of nodepool and zuul's control with a VM (or three) in inap	18:32
*** markvoelker has joined #openstack-infra		18:49
*** markvoelker has quit IRC		18:54
*** smarcet has joined #openstack-infra		19:04
*** lbragstad_ has joined #openstack-infra		19:04
*** lbragstad has quit IRC		19:06
*** eolivare has quit IRC		19:10
openstackgerrit	Andreas Jaeger proposed openstack/project-config master: Finish retirement of openstack-ux,solum-infra-guestagent https://review.opendev.org/737992	19:16
openstackgerrit	Andreas Jaeger proposed openstack/project-config master: Finish retirement of networking-onos https://review.opendev.org/738263	19:16
AJaeger	config-core, 737992 is ready to merge - onos is waiting for final approval. I thus split these, please review 992.	19:16
clarkb	+2	19:17
AJaeger	thanks, clarkb	19:17
EmilienM	rlandy\|ruck: https://review.opendev.org/#/c/738025/ failed again :/	19:22
EmilienM	infra-code: I would really request a force merge if it's possible for you	19:22
EmilienM	infra-core^ sorry	19:22
rlandy\|ruck	EmilienM: don't worry about - it may fail on the retry_limit	19:24
clarkb	EmilienM: does it fix a gate bug? it isn't clear from the commit message why that would be a priority	19:24
clarkb	rlandy\|ruck: it did fail on retry limit but also something else	19:25
rlandy\|ruck	clarkb: no - it failed because the we updated the cirros image for tempest but not the space requirements. our fault	19:25
rlandy\|ruck	it's a revert - it does fail gates though	19:26
rlandy\|ruck	but it could fail again on retry_limit so I'll just try the regular route of getting patches in	19:26
clarkb	right but usually when we force merge something its something that fill fix gate failures. There is no indication in the commit message that it does this (note I don't expect it to fix the retry limits but an indication that it fixes a testing bug hence the bypass of testing would be nice)	19:26
rlandy\|ruck	clarkb: yeah - updating the commit message with the bug details	19:27
rlandy\|ruck	ok - patch updated - but let's let it run through the regular channels	19:32
clarkb	ok, that helps. Let us know if force merge is appropriate after it tries the normal route	19:33
rlandy\|ruck	clarkb: thanks	19:33
*** smarcet has quit IRC		19:47
*** smarcet has joined #openstack-infra		19:56
*** smarcet has quit IRC		20:01
*** slaweq has quit IRC		20:02
*** smarcet has joined #openstack-infra		20:05
*** slaweq has joined #openstack-infra		20:06
*** slaweq has quit IRC		20:10
*** yamamoto has joined #openstack-infra		20:25
*** yamamoto has quit IRC		20:30
*** markvoelker has joined #openstack-infra		20:35
*** markvoelker has quit IRC		20:39
*** smarcet has quit IRC		20:40
*** hashar has quit IRC		21:02
*** armax has quit IRC		21:10
*** armax has joined #openstack-infra		21:26
*** paladox has quit IRC		21:33
*** paladox has joined #openstack-infra		21:37
*** lbragstad_ has quit IRC		21:38
*** markvoelker has joined #openstack-infra		22:35
*** markvoelker has quit IRC		22:40
*** tosky has quit IRC		23:00
openstackgerrit	Merged openstack/openstack-zuul-jobs master: Run openafs promote job only if gate job run https://review.opendev.org/738155	23:51

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!