Wednesday, 2020-07-01

*** ryohayakawa has joined #opendev		00:02
clarkb	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_261/738714/2/check/system-config-run-gitea/261d1a6/gitea99.opendev.org/logs/access.log we have ports there	00:06
clarkb	fungi: corvus ianw ^ https://review.opendev.org/#/c/738714/ could use rereview, though I need to call it a day then can restart things with that tomorrow	00:06
ianw	i can restart if we like; but overall i'm not sure	00:15
ianw	the UA look a lot like those listed in https://www.informationweek.com/pdf_whitepapers/approved/1370027144_VRSNDDoSMalware.pdf, a 2013 article about a 2011 ddos tool russkill	00:16
ianw	however, https://amionrails.wordpress.com/2020/02/27/list-of-user-agent-used-in-ddos-attack-to-website/ is another one that links to	00:19
ianw	https://github.com/mythsman/weiboCrawler/blob/master/opener.py	00:19
ianw	that has the very specific	00:20
ianw	ua's we see -- compare and contrast to http://paste.openstack.org/show/795414/	00:21
ianw	like "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)" ... that's no co-incidence	00:22
ianw	https://github.com/mythsman/weiboCrawler is also suspicious	00:23
ianw	"Opener.py independently encapsulates some anti-reptile header information."	00:33
*** rchurch has quit IRC		00:34
ianw	kevinz: ^ i hope i'm not being rude asking but maybe you could translate more of what the intent is?	00:35
*** Dmitrii-Sh has quit IRC		00:38
*** Dmitrii-Sh has joined #opendev		00:43
*** diablo_rojo has quit IRC		00:44
*** DSpider has quit IRC		00:57
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] add gitea proxy option https://review.opendev.org/738721	01:48
corvus	ianw: an alternate google translate of that phrase is "some anti-anti crawlers"	02:16
ianw	corvus: yeah, it's pretty much a smoking gun for what's hitting us ... if it is malicious or a university project gone wrong is probably debatable ... the result is the same anyway	02:17
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] add gitea proxy option https://review.opendev.org/738721	02:20
corvus	ianw: the code is clearly intended to avoid detection as a crawler. it looks like it's designed to crawl weibo without weibo detecting that it's a crawler	02:27
corvus	perhaps repurposed	02:29
*** sgw1 has quit IRC		02:49
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] crawler ua reject https://review.opendev.org/738725	02:54
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] add gitea proxy option https://review.opendev.org/738721	03:23
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] crawler ua reject https://review.opendev.org/738725	03:23
*** sgw1 has joined #opendev		03:27
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] crawler ua reject https://review.opendev.org/738725	03:51
openstackgerrit	yatin proposed openstack/diskimage-builder master: Revert "Make ipa centos8 job non-voting" https://review.opendev.org/738728	03:57
openstackgerrit	yatin proposed openstack/diskimage-builder master: Revert "Make ipa centos8 job non-voting" https://review.opendev.org/738728	03:57
*** ykarel\|away is now known as ykarel		04:24
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] crawler ua reject https://review.opendev.org/738725	04:28
*** iurygregory has quit IRC		04:31
*** sgw1 has quit IRC		04:40
*** mugsie has quit IRC		04:53
*** mugsie has joined #opendev		04:57
*** ysandeep\|away is now known as ysandeep		05:09
openstackgerrit	Ian Wienand proposed opendev/system-config master: gitea: Add reverse proxy option https://review.opendev.org/738721	05:36
openstackgerrit	Ian Wienand proposed opendev/system-config master: gitea: crawler UA reject rules https://review.opendev.org/738725	05:36
openstackgerrit	Federico Ressi proposed openstack/project-config master: Create a new repository for Tobiko DevStack plugin https://review.opendev.org/738378	05:48
*** factor has quit IRC		06:03
*** factor has joined #opendev		06:03
*** icarusfactor has joined #opendev		06:05
*** factor has quit IRC		06:06
openstackgerrit	Ian Wienand proposed opendev/system-config master: gitea: crawler UA reject rules https://review.opendev.org/738725	06:16
*** icarusfactor has quit IRC		06:22
*** bhagyashris is now known as bhagyashris\|brb		06:59
*** hashar has joined #opendev		07:09
*** iurygregory has joined #opendev		07:09
*** sorin-mihai_ has joined #opendev		07:15
*** bhagyashris\|brb is now known as bhagyashris		07:15
*** sorin-mihai_ has quit IRC		07:16
*** sorin-mihai_ has joined #opendev		07:16
*** sorin-mihai has quit IRC		07:17
*** sorin-mihai_ has quit IRC		07:19
*** sorin-mihai_ has joined #opendev		07:19
*** sorin-mihai_ has quit IRC		07:21
*** sorin-mihai_ has joined #opendev		07:21
*** sorin-mihai__ has joined #opendev		07:23
*** sorin-mihai_ has quit IRC		07:26
*** dtantsur\|afk is now known as dtantsur		07:35
*** tosky has joined #opendev		07:40
*** moppy has quit IRC		08:01
*** moppy has joined #opendev		08:01
*** DSpider has joined #opendev		08:12
*** hashar has quit IRC		08:17
*** hashar has joined #opendev		08:18
openstackgerrit	Daniel Bengtsson proposed openstack/diskimage-builder master: Update the tox minversion parameter. https://review.opendev.org/738754	08:19
*** ysandeep is now known as ysandeep\|lunch		08:21
*** ysandeep\|lunch is now known as ysandeep		09:21
*** hashar has quit IRC		09:25
*** ryohayakawa has quit IRC		09:31
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771	09:34
*** hashar has joined #opendev		09:42
*** hashar is now known as hasharAway		09:47
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771	09:53
*** hasharAway has quit IRC		09:57
*** hashar has joined #opendev		09:58
*** tkajinam has quit IRC		09:59
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771	10:13
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771	10:26
*** ysandeep is now known as ysandeep\|brb		10:27
*** dtantsur is now known as dtantsur\|brb		10:27
openstackgerrit	Slawek Kaplonski proposed openstack/project-config master: Update Neutron Grafana dashboard https://review.opendev.org/738784	10:28
*** ysandeep\|brb is now known as ysandeep		10:37
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771	11:00
*** priteau has joined #opendev		11:02
*** kevinz has quit IRC		11:11
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771	11:12
sshnaidm\|afk	just fyi, I saw in console today failure of cleanup playbook. It didn't affect the job results or something else, but in case you weren't aware: http://paste.openstack.org/show/795427/	11:19
*** sshnaidm\|afk is now known as sshnaidm\|ruck		11:19
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771	11:25
*** owalsh_ has joined #opendev		11:55
*** owalsh has quit IRC		11:58
*** ysandeep is now known as ysandeep\|afk		12:45
*** dtantsur\|brb is now known as dtantsur		12:45
*** sgw1 has joined #opendev		12:55
clarkb	sshnaidm\|ruck: the way we are using the cleanup playbook is as a last effort to produce debug data at the end of jobs. The unreachable there is expected in the case of job failures that would otherwise break the job. In this case the job looks mostly happy at the end though. Is it possible that is the centos8 issue showing up later than usual? Maybe that points to the ssh keys being modified on the host by	12:57
clarkb	the job somehow?	12:57
clarkb	sshnaidm\|ruck: long story short we expect it to fail, but really only when the job node was so unhappy by the job	12:57
sshnaidm\|ruck	clarkb, ok, that's fine then	13:05
sshnaidm\|ruck	clarkb, I'm not sure any centos issue exists btw	13:05
sshnaidm\|ruck	clarkb, we tried the same image and job on third party CI and never got retry_limit	13:06
sshnaidm\|ruck	clarkb, even today it's much much less retry_limits than previous 2 days, and I thin I saw multiple attempts in non-tripleo jobs as well	13:06
clarkb	yes, I've offered now like 5 times to attempt to hold nodes on our CI system or boot test nodes in our clouds, but no one will take me up on it	13:07
clarkb	the issue is clearly centos8 related in that it happens there	13:07
clarkb	it may require specific "hardware" or timing races to be triggered though	13:07
sshnaidm\|ruck	clarkb, if it was centos issue, I	13:07
openstackgerrit	Lance Bragstad proposed openstack/project-config master: Create a new project for ansible-tripleo-ipa-server https://review.opendev.org/738842	13:07
sshnaidm\|ruck	'd expec it to be more consistent	13:07
sshnaidm\|ruck	30% percent retry_limits yesterday and only a few today	13:08
clarkb	sshnaidm\|ruck: is the job load different though?	13:08
clarkb	tripleo's queue looks very small right now	13:08
clarkb	(we can be our own noisy neighbor, etc)	13:09
sshnaidm\|ruck	clarkb, during these 2 days it was usual, maybe yesterday more because of rechecks	13:09
sshnaidm\|ruck	clarkb, usually it's growing to US afternoon	13:10
clarkb	also I think its still happening if you look at the queue there are many retry attempts	13:10
clarkb	you're just getting lucky in that its not failing 3 in a row consistently	13:10
clarkb	we should be careful to treat this as fixed if we're relying on it passing on attempt 2 or 3	13:12
clarkb	sshnaidm\|ruck: ^ you may also want to double check that your third party CI checks weren't doing the same thing	13:13
sshnaidm\|ruck	clarkb, I set attepmts:1 in both patches, on upstream and third party	13:13
clarkb	ok, I'm just calling it out bceause I see the current jobs with a bunch of retries	13:14
sshnaidm\|ruck	clarkb, never got retry limit on 3party though, and today is much better in upstream too: https://review.opendev.org/#/c/738557/	13:14
sshnaidm\|ruck	oh, got one	13:14
fungi	also when i analyzed the per-provider breakdown, it was disproportionately more likely in some than others (not proportional to our quotas in each) suggesting that the timing/load/noisy-neighbor influence could be greater in some places than others	13:15
sshnaidm\|ruck	fungi, the problem that I don't have logs, usually it's only build id	13:18
clarkb	sshnaidm\|ruck: yes, this is why I've offered to set holds or boot test nodes	13:18
clarkb	the logs are on the zuul test node and zuul can't talk to them to get the logs	13:18
sshnaidm\|ruck	clarkb, let's do it, I'm only for it	13:18
clarkb	which means its difficult for zuul to do anything useful when we get in this situation. But if we set a hold maybe we can reboot the test node and log in or boot it on a rescue host or something	13:19
clarkb	sshnaidm\|ruck: is there a specific job + change combo we should set a hold that is likely to have this problem?	13:19
sshnaidm\|ruck	clarkb, all jobs here for example: https://review.opendev.org/#/c/738557/	13:20
clarkb	thats the big problem I have with setting a hold. I don't know what to set it on	13:20
sshnaidm\|ruck	all of them with attempts:1	13:20
clarkb	sshnaidm\|ruck: do the -standalone jobs meet that criteria? THose would be good simply because they are single node which keeps the cost of holds down	13:21
sshnaidm\|ruck	clarkb, I think it's fine, the whole list is: http://paste.openstack.org/show/795437/	13:22
*** jhesketh has quit IRC		13:22
sshnaidm\|ruck	clarkb, you can just remove "multinode" from there	13:22
*** jhesketh has joined #opendev		13:24
clarkb	sshnaidm\|ruck: I've set it on tripleo-ci-centos-8-standalone tripleo-ci-centos-8-scenario001-standalone and tripleo-ci-centos-8-scenario002-standalone. The first two are running in check so maybe we'll catch one shortly	13:24
clarkb	I'll do 003 and 010	13:24
sshnaidm\|ruck	clarkb, great	13:25
clarkb	thats a good spread based on the jobs that are running I think	13:25
sshnaidm\|ruck	clarkb, can we pull any statistics which jobs have more attempts?	13:25
sshnaidm\|ruck	looking at this patch, seems like not only centos have more attempts: https://zuul.opendev.org/t/openstack/status/change/737983,2	13:26
clarkb	sshnaidm\|ruck: sort of. I think we may only really record that if/when an attempt succeeds because we're logging that in elasticsearch. THere is/was work to put it in the zuuldb which may get it in all cases but I'm not sure if that has landed	13:26
sshnaidm\|ruck	"puppet-openstack-lint-ubuntu-bionic (2. attempt)"	13:26
clarkb	ya retries aren't abnormal	13:26
fungi	reattempts can be for a number of reasons	13:26
fungi	failed to download a package during a pre phase playbook? retry	13:27
fungi	in this case we specifically care about retries from the node becoming unreachable	13:27
sshnaidm\|ruck	I see, though I still suspect it's not centos, I couldn't reproduce any problem with the same image on a different ci	13:29
clarkb	I don't think any of the jobs in this pass will hit it. Looking at console logs they seem to be further along than the toci quickstart	13:30
clarkb	we'll just have to keep trying until one trips over the old	13:30
*** ysandeep\|afk is now known as ysandeep		13:31
sshnaidm\|ruck	clarkb, ok, so I will leave standalone, 001-003, 010 in this patch and will keep rechecking?	13:32
sshnaidm\|ruck	clarkb, please add also tripleo-ci-centos-8-scenario010-ovn-provider-standalone , seems like it may have actually problem with network: http://paste.openstack.org/show/795439/	13:33
clarkb	sshnaidm\|ruck: yup, then let us know if one of those hits the issue and we'll see if we can reboot the host and get you ssh'd in	13:33
sshnaidm\|ruck	clarkb, great	13:33
clarkb	sshnaidm\|ruck: if that reboot doesn't fix things we can try to boot from snapshot or use a rescue instance too	13:33
clarkb	added that hold	13:34
fungi	in the past we've also seen job nodes which go unreachable suddenly become reachable again soon after the job ends	13:34
sshnaidm\|ruck	clarkb, now the different problem, what can be an issue with post_failure in finger://ze05.openstack.org/7be6bcfbd9274abba5caf69c65a3d519 ? No logs from job, it just builds containers, no tripleo there	13:34
clarkb	sshnaidm\|ruck: thats the same failure mode. If the job fails during the run step then it doesn't upload logs and you get the finger url	13:35
clarkb	does the tripleo quickstart script build iamges too?	13:35
sshnaidm\|ruck	clarkb, no	13:35
clarkb	perhaps this is a networking issuewith docker/podman?	13:35
sshnaidm\|ruck	clarkb, completely only build containers job	13:36
clarkb	and doing container things trips it	13:36
sshnaidm\|ruck	clarkb, can not be	13:36
sshnaidm\|ruck	like completely nothing touches network there	13:36
clarkb	containers do though	13:36
clarkb	including container image builds	13:36
fungi	2020-07-01 12:53:36,912 DEBUG zuul.AnsibleJob: [e: 7f319ac2c4284af3b8d5381a995ee25d] [build: 7be6bcfbd9274abba5caf69c65a3d519] Ansible complete, result RESULT_UNREACHABLE code None	13:36
clarkb	bceause containers if they are namespacing the network need netwroking	13:36
sshnaidm\|ruck	a-ha, so node disappeared	13:37
sshnaidm\|ruck	clarkb, it doesn't run containers, just build them	13:37
clarkb	iirc building a container image happens in a container	13:37
fungi	https://review.opendev.org/738668 should help get more public info for those cases	13:37
clarkb	I believe that is true for buildah and docker	13:38
sshnaidm\|ruck	clarkb, sorry, but this is not realistic that buildah makes something to nodes network	13:38
clarkb	why?	13:38
clarkb	the container for the image build needs network access	13:38
clarkb	it has to do something	13:38
clarkb	I'm not sure what exactly that is, but it is doing something if I understand container image builds properly	13:39
clarkb	basically I'm trying to not rule anything out	13:39
corvus	clarkb, sshnaidm\|ruck: the patch to store retried builds in the db has landed; but it's not fully deployed in opendev's zuul	13:40
clarkb	let's see if we can recover system logs and go from there	13:40
clarkb	but ruling things out before we have any logs is not very helpful	13:40
clarkb	fungi: I've rechecked https://review.opendev.org/#/c/738710/1 and plan to approve the gitea side as the conference winds down	13:42
*** bhagyashris is now known as bhagyashris\|afk		13:42
fungi	i was considering pasting the executor loglines for build 7be6bcfbd9274abba5caf69c65a3d519 but it's 9302 lines and each is so long that i can only fit about 200 in a paste, so not wanting to make a series of 50 pastes out of that	13:43
sshnaidm\|ruck	clarkb, let's add also tripleo-build-containers-centos-8 tripleo-build-containers-centos-8-ussuri	13:43
sshnaidm\|ruck	clarkb, they're pretty short jobs	13:44
sshnaidm\|ruck	maybe we could catch it	13:44
clarkb	sshnaidm\|ruck: ok thats done. Lets get some of these jobs running again. Maybe push a new ps that disables the other jobs and then start iterating to see if we catch some?	13:45
sshnaidm\|ruck	clarkb, ack, retriggered now	13:45
sshnaidm\|ruck	clarkb, will try to figure out how to disable others, not so simple now..	13:46
sshnaidm\|ruck	fungi, clarkb from this last post_failure:	13:48
sshnaidm\|ruck	2020-07-01 11:51:13.001938 \| TASK [upload-logs-swift : Upload logs to swift]	13:48
sshnaidm\|ruck	2020-07-01 12:51:00.211178 \| POST-RUN END RESULT_TIMED_OUT: [trusted : opendev.org/opendev/base-jobs/playbooks/base/post-logs.yaml@master]	13:48
clarkb	sshnaidm\|ruck: that means the job was uploading logs for 30 minutes and didn't finish in time	13:49
clarkb	we have seen that happen due to sheer volume of log uploads in a job.	13:49
clarkb	either a massive log file or many many files etc	13:50
fungi	though we tended to see it a lot more before we started limiting the amount of log data a job could save	13:50
clarkb	we do have a watchdog for that but it relies on checking after the fact iirc so isn't perfect	13:50
fungi	yep	13:50
clarkb	ya	13:50
sshnaidm\|ruck	this job has a few logs, unlike usual tripleo jobs, I believe it's network hiccup	13:51
sshnaidm\|ruck	Provider: vexxhost-ca-ymq-1	13:51
clarkb	sshnaidm\|ruck: it could be, though I would expect ssh to complain in that case	13:51
clarkb	ebcause we rsync the logs from the test node to the executor then upload to swift	13:52
clarkb	if it was a network issue on the test node the rsync should've failed. Its possible its a bw limitation between the executor and the swift I guess?	13:52
sshnaidm\|ruck	these jobs timed out to upload logs: http://paste.openstack.org/show/795442/	13:53
clarkb	different zuul executors	13:54
sshnaidm\|ruck	and clouds, and jobs..	13:54
clarkb	sshnaidm\|ruck: did they both fail during upload-logs-swift?	13:54
sshnaidm\|ruck	clarkb, yes	13:54
clarkb	ok, that step is swift upload from executor to one of 5 swift regions. 3 in rax and 2 in ovh	13:54
sshnaidm\|ruck	and this one as well http://paste.openstack.org/show/795443/	13:58
clarkb	the logs for successful runs of that job aren't particularly large. Not tiny either. Looks like ~20MB which is reasonable. I've also found successful buidls with uploads to both ovh regions	13:58
*** mlavalle has joined #opendev		13:58
sshnaidm\|ruck	maybe also node dies when uploads logs, but..seems weird such a timing	13:58
fungi	it's more likely the count of individual log files being uploaded has a bigger impact on upload time than the aggregate byte count	13:59
clarkb	sshnaidm\|ruck: also the swift upload is not from the node to swift. Its executor to swift	13:59
clarkb	fungi: the count is also not bad either, THough there are a few files	13:59
sshnaidm\|ruck	clarkb, well, then even dying node doesn't explain..	13:59
clarkb	sshnaidm\|ruck: the process is rsync the logs from node to executor then swift upload from executor to swift	13:59
clarkb	what could be happening is we spend a large amount of time doing the earlier steps like that rsync	14:00
sshnaidm\|ruck	clarkb, ack	14:00
clarkb	then by the time we do the swift upload we fail because there isn't much time left	14:00
fungi	yeah, we'd need to profile the start/end times of the individual tasks	14:01
clarkb	ya though looking at the timestamps sshnaidm\|ruck posted here its spending an hour doing the upload logs to swift step	14:02
clarkb	full context for the post-run stage would be good though	14:03
clarkb	sshnaidm\|ruck: can you paste that somewhere?	14:03
clarkb	but also my ability to debug things deteriorates as the number of simultaneous issues increases	14:03
sshnaidm\|ruck	clarkb, yeah, preparing..	14:03
clarkb	the build uuid is useful too because we can use that to find the logs on the executor to see if there is any hints there	14:04
*** hashar has quit IRC		14:06
corvus	clarkb: you're saying there's a second issue related to log uploading?	14:08
clarkb	sshnaidm\|ruck: to stop running jobs you want to edit this file: https://review.opendev.org/#/c/738557/3/zuul.d/layout.yaml. Under templates remove everything but the container builds and standalone template. Under check remove everything I think. Set check to [] or remove the check key entirely	14:08
clarkb	corvus: yes, or perhaps its the same issue with different symptoms (like maybe the network issues are on the executors)	14:08
clarkb	I don't think that is the case because I was able to reproduce failure to connect to test nodes from home on monday	14:08
clarkb	my hunch is it is two separate issues	14:09
corvus	clarkb: would you like me to look into http://paste.openstack.org/show/795443/ ? (is that a good example)?	14:09
clarkb	corvus: I haven't been able to look at those more closely since they don't have build uuids and sshnaidm\|ruck's examples pasted directly into irc weren't directly attributed to anything, but those are what we've got until sshnaidm\|ruck pastes more context somewhere	14:10
clarkb	basiclly I think that is the breadcrumb we've got but I don't know how good an example they are	14:10
corvus	it has an event id and a job, i can track it down	14:10
fungi	i concur, the node we witnessed go unreachable was really unreachable from everywhere, not just from the executor (i couldn't reach it from home either)	14:10
sshnaidm\|ruck	clarkb, https://pastebin.com/Zn30H9sj https://pastebin.com/4nz8V1W7	14:10
corvus	sshnaidm\|ruck: what are those pastes?	14:11
sshnaidm\|ruck	corvus, consoles where upload logs times out	14:11
corvus	cool -- since we're tracking multiple issues, it's good to be explicit :)	14:11
sshnaidm\|ruck	clarkb, corvus one build id is 0e70e37c7f8948a581483123aaab98a2	14:12
sshnaidm\|ruck	second is bf81f447b4924b56a613c3449a474d30	14:13
clarkb	thanks	14:13
corvus	i'll track down the executor logs for d30	14:14
clarkb	0e70e37c7f8948a581483123aaab98a2 seems to have been unreachable when the cleanup playbook ran	14:15
sshnaidm\|ruck	yeah, second one was more stable	14:16
corvus	same is true for d30 -- it spent 1 hour trying to upload logs, timed out, then unreachable for cleanup; that sounds like a hung ssh connection during the 1 hour upload playbook, then no further connections?	14:17
corvus	when the final playbook fails, it logs the json output from the playbook; i'm looking through that	14:18
corvus	it only has the json output from the previous playbook, likely because it killed the process running the final playbook, so that's not much help	14:22
sshnaidm\|ruck	clarkb, tripleo-ci-centos-8-scenario003-standalone has retry limit	14:22
sshnaidm\|ruck	\o/	14:22
clarkb	sshnaidm\|ruck: cool let me see what we canfind	14:22
corvus	clarkb: however, the second-to-last playbook removes the build ssh key -- why does the cleanup playbook work at all?	14:23
corvus	clarkb: (are we relying on a persistent connection lasting through that?)	14:23
clarkb	corvus: oh we may with the control persistent thing	14:24
sshnaidm\|ruck	OMG almost all of them start to fail	14:24
clarkb	23.253.159.202	14:24
clarkb	it actually pings	14:24
clarkb	and I can ssh in	14:25
corvus	clarkb: i think we should take any unreachable errors on the cleanup playbook with a grain of salt. especially if we sat there for an hour doing the swift upload, it's likely controlpersist will timeout and we won't re-establish. so i think that's a red herring.	14:25
clarkb	which suddenly has me thinking: zk connections dropping maybe?	14:25
clarkb	sshnaidm\|ruck: do you have an ssh key somewhere that I can put on the host? You'd be better able to look at tripleo logs to check them for oddities	14:25
corvus	there are no zk errors in the scheduler log	14:25
sshnaidm\|ruck	clarkb, https://github.com/sshnaidm.keys	14:26
fungi	and the scheduler isn't running out of memory (or even close)	14:26
clarkb	sshnaidm\|ruck: your key has been added. Let's not change anything on the server jsut yet as we try and observe what may be happening	14:27
corvus	clarkb: do you have a build id handy for that?	14:27
clarkb	corvus: not yet that was going to be my next thing	14:27
sshnaidm\|ruck	clarkb, thanks	14:27
clarkb	corvus: finger://ze05.openstack.org/603b84def27f47ee99585d58436cdc72 I think	14:27
sshnaidm\|ruck	clarkb, which user?	14:27
fungi	sshnaidm\|ruck: root	14:28
sshnaidm\|ruck	I'm in	14:28
clarkb	now this is interesting	14:28
clarkb	its an ipv6 host	14:29
clarkb	but ifconfig doesn't show me the ipv6 addr?	14:29
clarkb	ya I think its ipv6 stack is gone	14:29
clarkb	and we'd be using that for the job I'm pretty sure	14:29
clarkb	sshnaidm\|ruck: ^	14:29
sshnaidm\|ruck	clarkb, hmm.. not sure why	14:30
clarkb	I'm double checking that zuul would use that IP now	14:30
fungi	looks like the first task to end in unreachable was prepare-node : Assure src folder has safe permissions	14:30
clarkb	2001:4802:7803:104:be76:4eff:fe20:3ee4 is the ipv6 addr nodepool knows about	14:31
sshnaidm\|ruck	fungi, this task takes much time from what I saw in jobs..	14:31
clarkb	we don't seem to record either ip in the executor log	14:31
clarkb	so need to do more digging to see what ip was used	14:31
corvus	fungi: for build 603b84def27f47ee99585d58436cdc72 ? that looks like it succeeded to me.	14:31
fungi	corvus: oh, that was the last task of 11 for the play where one task was unreachable	14:32
clarkb	corvus: 603b84def27f47ee99585d58436cdc72 hit unreachable	14:32
corvus	clarkb: yes, i was saying i disagreed with fungi about what task was running at the time	14:33
clarkb	ah	14:33
clarkb	sorry read it as successful job you meant successful task	14:33
corvus	yep, i'll take my own advice and use more words	14:33
corvus	we do have all of the command output logs available in /tmp on the host	14:34
corvus	/tmp/console-*	14:34
fungi	i'm stull having trouble digestnig ansible output. the way it aggregates results makes it hard to spot which task failed in that play	14:34
clarkb	ooh should we copy those off so that a reboot doesn't clear tmp?	14:34
corvus	yes	14:34
corvus	note that the most recent one is console-bc764e04-a4bf-cd3a-2382-000000000016-primary.log and exited 0	14:35
clarkb	k I'll do that copy once I've confirmed the ip used was the ipv6 addr	14:35
clarkb	the copy of the logs	14:35
corvus	clarkb: save timestamps	14:35
corvus	since that's about the only way to line them up with tasks	14:35
corvus	oh they have timestamps in them :)	14:36
corvus	so not critical, but still helpful :)	14:36
clarkb	rgr	14:36
fungi	this is a rax node, so would have gotten its ipv6 address via glean reading it from configdrive, right?	14:37
clarkb	fungi: oh yup. Also I've just found other rax-iad jobs and they use the ipv4 addr as ansible_host	14:38
fungi	weird	14:38
clarkb	so ipv6 may just be a shiny nuisance	14:38
clarkb	I think glean may not configure ipv6 on red hat distros	14:38
clarkb	with static config drive config	14:38
clarkb	we are swapping quite a bit on ze05 as well	14:39
clarkb	copying logs now	14:39
sshnaidm\|ruck	clarkb, node 23.253.159.202 - from which jobs is it?	14:42
sshnaidm\|ruck	fungi, clarkb, do you have console logs for it just in case?	14:42
clarkb	sshnaidm\|ruck: tripleo-ci-centos-8-scenario003-standalone	14:43
sshnaidm\|ruck	ack	14:43
*** diablo_rojo has joined #opendev		14:43
clarkb	sshnaidm\|ruck: the console logs are all in /tmp/console-* on that host	14:43
corvus	clarkb: i'm having trouble lining this up with the executor log	14:43
clarkb	I've pulled them onto my desktop bceause /tmp there could be cleared out on a restart	14:43
corvus	can we double check that the ip and uuid match	14:43
clarkb	corvus: oh I can do that	14:43
corvus	clarkb: my understanding is we're looking at 603b84def27f47ee99585d58436cdc72 on ze05, right?	14:44
clarkb	\| 0017542272 \| rax-iad \| centos-8 \| e7bd207a-87d2-4394-877e-696b381c34f2 \| 23.253.159.202 \| 2001:4802:7803:104:be76:4eff:fe20:3ee4 \| hold \| 00:00:45:56 \| unlocked \| main \| centos-8-rax-iad-0017542272 \| 10.176.193.212 \| None \| 22 \| nl01-10-PoolWorker.rax-iad-main \| 300-0009721443 \| openstack	14:44
clarkb	opendev.org/openstack/tripleo-ci tripleo-ci-centos-8-scenario003-standalone refs/changes/57/738557/.* \| tripleo with sshnaidm debug retry failures \|	14:44
clarkb	that is the nodepool hold	14:44
clarkb	and that tripleo-ci-centos-8-scenario003-standalone standalone job had the finger url I posted aboev. But I'm double checking again	14:44
sshnaidm\|ruck	wow, so many console logs in /tmp/	14:44
clarkb	2020-07-01 13:49:50,678 INFO zuul.AnsibleJob: [e: e0622c34b7944cfcadddb92079a3e537] [build: 603b84def27f47ee99585d58436cdc72] Beginning job tripleo-ci-centos-8-scenario003-standalone for ref refs/changes/57/738557/3 (change https://review.opendev.org/738557)	14:45
clarkb	the job and change are correct at least	14:45
clarkb	still trying to be sure on the build. Not sure what key we use between nodepool and executor logs	14:45
clarkb	ok thats weird	14:46
corvus	should be able to correlated node id to build on scheduler	14:46
corvus	2020-07-01 11:27:56,980 INFO zuul.ExecutorClient: [e: 7f319ac2c4284af3b8d5381a995ee25d] Execute job tripleo-ci-centos-8-scenario003-standalone (uuid: 1f2db893d451453bacfb80face66ec92) on nodes <NodeSet single-centos-8-node [<Node 001754227 ('primary',):centos-8>]> for change <Change 0x7fab39611990 openstack/tripleo-ci 738557,2> with dependent changes [{'project': {'name': 'openstack/tripleo-ci',	14:47
corvus	'short_name': 'tripleo-ci', 'canonical_hostname': 'opendev.org', 'canonical_name': 'opendev.org/openstack/tripleo-ci', 'src_dir': 'src/opendev.org/openstack/tripleo-ci'}, 'branch': 'master', 'change': '738557', 'change_url': 'https://review.opendev.org/738557', 'patchset': '2'}]	14:47
clarkb	the ansible facts for 603b84def27f47ee99585d58436cdc72 show the hostname is centos-8-inap-mtl01-0017550604	14:47
corvus	clarkb: the scheduler log i just pasted is for the node id you pasted, but it's a different uuid	14:47
corvus	it's also several hours old	14:47
clarkb	its like we held a rax node but the job ran on inap?	14:48
fungi	23.253.159.202 is centos-8-rax-iad-0017542272 if that's what everyone's still looking at	14:48
sshnaidm\|ruck	clarkb, do you have other nodes?	14:49
fungi	huh, yeah that's extra weird	14:49
corvus	clarkb: how did you arrive at the uuid 603b84def27f47ee99585d58436cdc72 ?	14:49
*** mwhahaha has joined #opendev		14:49
clarkb	sshnaidm\|ruck: yes all of the standalone jobs ended up being held	14:49
clarkb	corvus: I went to the zuul web ui and retrieved the finger url for the job that triggered the hold	14:50
clarkb	corvus: the jobs have been modified on that change to set attempts: 1 and so won't retry beyond the first failure	14:50
corvus	clarkb: uuid 603b84def27f47ee99585d58436cdc72 ran on node 001754227	14:51
corvus	is that one held?	14:51
weshay_ruck	sshnaidm\|ruck, mwhahaha https://review.opendev.org/#/c/738557/	14:51
clarkb	corvus: yes \| 0017542272 \| rax-iad \| centos-8 \| e7bd207a-87d2-4394-877e-696b381c34f2 \| 23.253.159.202 is the ehld node	14:52
clarkb	oh your's is short a digit? is that a mispaste?	14:52
corvus	checking	14:52
clarkb	[build: 603b84def27f47ee99585d58436cdc72] Ansible output: b' "ansible_hostname": "centos-8-inap-mtl01-0017550604",' <- from the executor logs	14:54
clarkb	I would expect the ansible_hostname fact to be centos-8-rax-iad-0017542272	14:54
corvus	yes -- i think that whole number was a mispaste; let's start over	14:54
corvus	603b84def27f47ee99585d58436cdc72 ran on 0017550604	14:55
corvus	1f2db893d451453bacfb80face66ec92 ran on 0017542272	14:55
clarkb	ok that matches the ansible host value we see in the build so tahts good. That doesn't explain why the finger url on the web ui seems to be wrong	14:56
corvus	which one are we going to look at?	14:56
clarkb	the hold is for 0017542272 so we want 1f2db893d451453bacfb80face66ec92	14:56
clarkb	oh you know what	14:56
sshnaidm\|ruck	clarkb, seems like 23.253.159.202 is the wrong node	14:56
clarkb	I think I know what happened. sshnaidm\|ruck abandoned the chagne and restored it to rerun jobs	14:57
clarkb	the hold was already in place when that happened. Do we hold when jobs are cancelled for that state change?	14:57
clarkb	603b84def27f47ee99585d58436cdc72 is what we want a hold for, but not what we got a hold for	14:57
clarkb	the other holds we got for the other jobs are much newer and all in inap	14:58
clarkb	I'm thinking lets ignore standalone003 and look at the other jobs?	14:58
clarkb	\| 0017550599 \| inap-mtl01 \| centos-8 \| d4bf117d-ef6d-427e-9081-fb37cd069594 \| 198.72.124.39 \| \| hold \| 00:00:26:07 \| unlocked \| main \| centos-8-inap-mtl01-0017550599 \| 198.72.124.39 \| nova \| 22 \| nl03-7-PoolWorker.inap-mtl01-main \| 300-0009723644 \| openstack	14:58
clarkb	opendev.org/openstack/tripleo-ci tripleo-ci-centos-8-standalone refs/changes/57/738557/.* \| tripleo with sshnaidm debug retry failures \|	14:58
clarkb	That hold looks like a proper hold	14:58
corvus	clarkb: the autohold info may be the most concise way to correlate them	14:59
sshnaidm\|ruck	clarkb, 198.72.124.39 ?	14:59
clarkb	and I think finger://ze08.openstack.org/b5b5835abca34522933ee496c49514a6 is related to 0017550599 but I'm double checking that now before I do anything else	14:59
clarkb	sshnaidm\|ruck: yes, but let me double check my values :)	14:59
sshnaidm\|ruck	clarkb, sure	14:59
corvus	clarkb: what autohold number is that?	15:00
clarkb	[build: b5b5835abca34522933ee496c49514a6] Ansible output: b' "ansible_hostname": "centos-8-inap-mtl01-0017550599",'	15:00
corvus	autohold-info 0000000160 Held Nodes: [{'build': 'b5b5835abca34522933ee496c49514a6', 'nodes': ['0017550599']}]	15:00
corvus	clarkb: is that correct ^ ?	15:01
clarkb	corvus: 0000000160 yup	15:01
clarkb	that server does not ping	15:01
clarkb	which is more like what we expected	15:01
corvus	build b5b5835abca34522933ee496c49514a6 ran on ze08	15:02
corvus	it failed 24 minutes into 2020-07-01 13:57:59,664 DEBUG zuul.AnsibleJob.output: [e: e0622c34b7944cfcadddb92079a3e537] [build: b5b5835abca34522933ee496c49514a6] Ansible output: b'TASK [run-test : run toci_gate_test.sh executable=/bin/bash, _raw_params=set -e'	15:03
clarkb	server show on the server instance shows no errors from nova	15:03
clarkb	I can try a reboot and see if it comes back	15:03
fungi	TASK: oooci-build-images : Run build-images.sh	15:03
clarkb	but I'll hold off on that to ensure we've done all the debugging we can	15:04
clarkb	mgagne: ^ if you're around you may be able to help as well as this is an inap instance	15:04
fungi	oh, again, i'm looking at the last task in the play which had the first unreachable result... how do you identify the task which actually was unreachable?	15:05
corvus	fungi: i don't see that line at all. are you looking at build b5b5835abca34522933ee496c49514a6 on ze08?	15:05
fungi	oh, i know what i'm doing wrong	15:06
fungi	i resolved build b5b5835abca34522933ee496c49514a6 to event e0622c34b7944cfcadddb92079a3e537 so i'm looking at the entire buildset	15:06
fungi	now that i'm not looking at interleaved logs from multuple builds, yes this is easier to follow	15:08
fungi	i agree it was run-test : run toci_gate_test.sh which the node became unreachable during	15:08
clarkb	the neutron port for that IP address shows it is active and I don't see any errors	15:08
clarkb	I'm now trying to confirm that that port is actually attached to the instance	15:09
clarkb	the port device id matches our instance id	15:09
clarkb	now I'm going to double check security groups	15:10
clarkb	the security group is the expected default group wtih all ingress and egress allowed as expected	15:11
fungi	unfortunately the nova console log buffer has been overrun by network traffic logging	15:11
clarkb	fungi: the time range is about 45 minutes though	15:12
clarkb	we can find when the instance booted to see if any crashes should show up in that	15:12
clarkb	\| created \| 2020-07-01T13:48:50Z	15:13
fungi	yep, current console log covers 50 minutes of time and ends at 4827 seconds	15:14
clarkb	4827seconds + 2020-07-01T13:48:50Z is ~= 10 minutes ago?	15:14
clarkb	maybe a bit more recent?	15:14
fungi	so basically an hour and twenty minutes after 13:48:50	15:14
fungi	~15:08z yep	15:15
clarkb	2020-07-01 14:21:08,075 is when failure occurred	15:15
fungi	so maybe 7 minutes ago	15:15
*** ysandeep is now known as ysandeep\|away		15:15
clarkb	its possible a crash occurred earlier and ansible didn't know about it I guess	15:15
fungi	it's continuing to log	15:15
clarkb	I'm running out of things to consider from the api side of things	15:15
clarkb	anything else people want to try before we ask for a reboot?	15:16
fungi	so yes the instance is still running and outputting network traffic loggnig on its console	15:16
fungi	looks like it's logging its own dhcp requests	15:16
fungi	oh, actually these look like other machines's dhcp requests maybe	15:17
fungi	i suppose they could be virtual machines running on this instance	15:17
corvus	fungi: any identifying info like mac?	15:17
clarkb	fa:16:3e:fa:da:5e is the port mac according to the openstack api	15:18
fungi	MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:e9:0c:f4:08:00	15:18
fungi	MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:16:ce:62:08:00	15:18
fungi	i keep seeing those same two logged on ens3	15:18
*** _mlavalle_1 has joined #opendev		15:19
fungi	udp datagrams from 0.0.0.0:68 to 255.255.255.255:67 so definitely dhcp discovery	15:19
corvus	oh you know what, i'm silly -- that's not going to tell us anything, because of course the internal vms are openstack, and the real vm is also openstack, so they're all going to be fa:16:3e :)	15:19
clarkb	centos-8-1593208195 is MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:16:ce:62:08:00	15:20
clarkb	corvus: ya	15:20
* clarkb figures out the other one		15:20
fungi	yup. also if they're other machines and not this one but somehow still showing up on our port (because they're broadcast) they'll also likely be openstack	15:20
clarkb	er thats a mispaste image name	15:21
clarkb	centos-8-inap-mtl01-0017550603 is MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:16:ce:62:08:00	15:21
clarkb	centos-8-inap-mtl01-0017550600 is MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:e9:0c:f4:08:00	15:21
clarkb	those also happen to be two held nodes for other jobs that failed	15:22
*** mlavalle has quit IRC		15:22
corvus	so there's a dhcp process running on our failed node which is continuning to receive and log requests from other inap vms nearby?	15:22
clarkb	corvus: I think its just the iptables logging from the kernel that is receiving them	15:23
clarkb	what that implies to me is l2 is fine	15:23
clarkb	its l3 that is failing	15:23
corvus	ack	15:23
clarkb	and those other two hosts also don't ping	15:23
clarkb	so ya they all seem to have working links and ethernet is functioning. IP is not functioning	15:24
clarkb	I'm ready to try a reboot if no one objects	15:24
corvus	++ i think we learned a lot. reboot ++	15:24
fungi	no objection here	15:24
fungi	we also have a couple more candidates to compare, sounds like	15:24
fungi	it's also possible these are reachable from other nodes in the same network but not from outside (lost their default route?)	15:25
fungi	we can check that with one of the others	15:25
clarkb	it pings now	15:26
clarkb	and I'm in	15:26
clarkb	ssh root@198.72.124.39 for the others (I'll get sshnaidm\|ruck's key shortly)	15:26
clarkb	the console logs are still in /tmp so reboot didn't blast that away	15:26
sshnaidm\|ruck	clarkb, did you restart it?	15:26
clarkb	sshnaidm\|ruck: yes	15:26
sshnaidm\|ruck	ack	15:26
fungi	rebooted via nova api	15:26
clarkb	sshnaidm\|ruck: your key should be working now	15:27
sshnaidm\|ruck	I'm in	15:27
sshnaidm\|ruck	clarkb, which job is it?	15:27
clarkb	Jul 1 14:16:56 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <warn> [1593613016.1172] dhcp4 (ens3): request timed out	15:28
clarkb	sshnaidm\|ruck: tripleo-ci-centos-8-standalone	15:28
fungi	i wonder, could something be putting a firewall rule in place on the public interface which blocks dhcp requests?	15:29
clarkb	fungi: I was just goign to mention earlier in syslog is a bunch of iptables updates :)	15:29
clarkb	I don't know that it is breaking dhcp yet but that definitely seems to be an order of ops	15:30
fungi	that would explain the random behavior... depends on if it needs to renew a lease during that timeframe	15:30
clarkb	ansible-iptables does a bunch of work then soon after dhcp stops	15:30
clarkb	fungi: ya	15:30
fungi	i wonder if i can find the dhcp requests getting blocked in the console log	15:31
fungi	checking	15:31
fungi	fa:16:3e:fa:da:5e is the mac for ens3	15:31
clarkb	sshnaidm\|ruck: ^ maybe you can look into that? I'm not sure what tripleo's iptables rulesets are intended to do. Also you could cross check against successful jobs to see if they fail to renew a lease (usually you renew at 1/2 the lease time so you'd possible still fail to renew but have working IP until a bit later)	15:31
sshnaidm\|ruck	mwhahaha, weshay_ruck ^	15:31
sshnaidm\|ruck	brb	15:32
*** sshnaidm\|ruck is now known as sshnaidm\|afk		15:32
*** factor has joined #opendev		15:32
clarkb	that host has rebooted with no iptables rules ?	15:32
clarkb	whcih is I guess good for debugging but surprising since we write out a ruleset I thought	15:32
fungi	indeed, iptables -L and ip6tables -L are both entirely empty	15:33
clarkb	maybe that doesn't work on centos8	15:34
clarkb	something to investigate	15:34
fungi	i'm not finding any blocked dhcp requests logged by iptables, but i have a feeling the iptables logging is not comprehensive (probably does not include egress, for example)	15:37
clarkb	I'm also having a hard time finding the lease details for our current lease	15:37
clarkb	oh neat I think beacuse ens3 config is static now	15:38
clarkb	the network type in config drive is ipv4 not ipv4_dhcp or whatever the value is	15:40
clarkb	so I think static config is actually what we expect	15:40
mwhahaha	so the mtu is bad i believe	15:40
clarkb	is something kicking it over to dhcp which is then failing bceause there is no dhcp server?	15:40
mwhahaha	it shouldn't be 1500	15:40
mwhahaha	?	15:40
clarkb	mwhahaha: why not?	15:40
mwhahaha	most clouds it's not 1500	15:40
mwhahaha	because tennat networking	15:40
clarkb	most of ours are	15:40
mwhahaha	you sure?	15:40
clarkb	I think openedge is our only cloud with a smaller mtu	15:40
clarkb	mwhahaha: ya they do jumbo frames to carry tenant networks allowing tenant traffic to have a proper 1500 mtu	15:41
mwhahaha	k	15:41
fungi	Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593612925.5251] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)	15:41
fungi	that's almost an hour after the node booted	15:41
fungi	could the job be reconfiguring ens3 to dhcp?	15:41
clarkb	fungi: cool I think thats the smoking gun. This should be statically configured based on teh config drive stuff	15:41
mwhahaha	centos did have a bug with an ens3 existing always btw	15:41
* mwhahaha digs up bug		15:41
mwhahaha	the cloud image that is	15:42
clarkb	it switches to dhcp, fails to dhcp because neutron isn't doing dhcp there and then network manager helpfully unconfigures our interface	15:42
clarkb	we reboot and go back to our existing glean config and are statically configured again	15:42
fungi	that may also explain why the dhcp discovery datagrams we were seeing logged by the node's iptables drop rules were other nodes experiencing the same issue	15:42
mwhahaha	https://bugs.launchpad.net/tripleo/+bug/1866202	15:42
openstack	Launchpad bug 1866202 in tripleo "OVB on centos8 fails because of networking failures" [Critical,Fix released] - Assigned to wes hayutin (weshayutin)	15:42
clarkb	fungi: oh ya they're all asking for dhcp and no one can respond	15:42
clarkb	and this would happen in any other cloud that is statically configured too	15:42
mwhahaha	so if /etc/sysconfig/network-scripts/ifcfg-en3 exists, when we restart legacy networking it'll nuke the address	15:42
fungi	because that cloud doesn't have a dhcpd to answer	15:42
clarkb	rax is at least	15:42
clarkb	mwhahaha: that will do it	15:43
clarkb	/etc/sysconfig/network-scripts/ifcfg-en3 is how glean configures the interface	15:43
clarkb	ens3 but ya	15:43
mwhahaha	thought on that node it's configured static	15:43
mwhahaha	so seems weird	15:43
clarkb	mwhahaha: it is statically configured via the network script and NM's sysconfig compat layet	15:43
mwhahaha	ovs and networkmanager don't place nicely so i wonder if there's an issue around that	15:44
clarkb	and this could explain why your third party ci doesn't see it. iF that cloud is configured to use dhcp then it will work fine	15:44
fungi	also explains why we saw disproportionately more of these in some providers... those are likely the ones with no dhcpd	15:45
clarkb	fungi: ya I think rax and inap are the non dhcp cases for us	15:45
clarkb	everyone else does dhcp I think	15:45
mwhahaha	hrm no /tec/sysconfig/network-scripts is empty in the image i pulled from nb02	15:45
mwhahaha	so we're not hitting that bug where the left over thing is there	15:45
clarkb	mwhahaha: its written by glean on boot based on config drive information	15:45
mwhahaha	yea but we shouldn't be changing it to dhcp (and it's not dhcp on that node)	15:46
clarkb	mwhahaha: well something is telling NM to dhcp	15:46
clarkb	per the syslog fungi pasted above	15:46
mwhahaha	clearly	15:46
clarkb	and its happening about an hour after the node boots so it isn't glean	15:46
* mwhahaha goes digging		15:46
clarkb	(at least it would be super weird for boot units to fire that late)	15:47
clarkb	mwhahaha: we can add your ssh key to this node if it helps to dig with logs	15:47
clarkb	I just need a copy of the pubkey	15:47
mordred	wow. I pop in to see how things are going and I see a conversation about NM randomly dhcping an hour after boot	15:48
mwhahaha	i think sshnaidm\|afk added me if you're looking at centos-8-inap-mtl01-0017550599	15:48
clarkb	mwhahaha: yup thats the one	15:48
clarkb	mwhahaha: not sure if you had all the bakcground on it. We caught the held node. Observed its link layer was working based on iptables drop logging in the console log but it did not ping or ssh	15:49
fungi	i'll correlate the timestamp to the job	15:49
clarkb	mwhahaha: after double checking the cloud hadn't error'd with oepsntack apis we rebooted the instance and it came back	15:49
clarkb	currently it appears to be using the glean static config as expected which is why it is working	15:49
clarkb	syslog shows at some point NM switched to dhcp which failed and unconfiogured the interface	15:49
clarkb	we've also got ~2 more instances in the unping state that we can probably reboot if necessary but for now keeping them in that state might be good to have in our back pocket	15:51
clarkb	note the container image build jobs seem to hit tis too. I wonder if podman/buildah are doing NM config?	15:51
fungi	looks like the "run toci_gate_test.sh" task starts at 13:57:59 and then dhcp4 is activated on ens3 by NetworkManager at 14:15:25 we log the unreachable state for the node at 14:21:07	15:52
clarkb	and with that I'm going to find breakfast and maybe do a bike ride since we're somewhere this is deuggable	15:52
clarkb	fungi: mwhahaha sshnaidm\|afk /tmp/console-* will have consoel logs generated by the host too	15:52
clarkb	those may offer a more exact correlation to whatever did the thing	15:52
mwhahaha	so we run os-net-config right before networkmanager starts doing stuff	15:53
mwhahaha	let me trouble shoot this	15:53
fungi	there is a bit of time between dhcp starting to try to get a lease and giving up/unconfiguring the interface (more likely picking a v4 linklocal address to replace the prior static one with) and then a bit more delay before ansible decides ssh access is timing out	15:55
mwhahaha	had glean always used network manager?	15:55
fungi	i believe it had to for centos-8 and newer fedora	15:55
mwhahaha	k	15:55
clarkb	we switched centos7 to it too I think	15:55
mwhahaha	we aren't touching ens3 in our os-net-config configuration	15:55
clarkb	but would need to double check that	15:55
fungi	ianw knows a lot more of the history there, he had to struggle mightily to find something consistent for everything	15:56
mwhahaha	we're adding a bridge br-ex to br-ctlplane which we alreadyc onfigured	15:56
mwhahaha	so let me see if i can figure out why networkmanager tries to dhcp ens3	15:56
clarkb	mwhahaha: fungi we could rerun suspect things on that host and reboot to bring it back if it breaks	15:56
clarkb	that may help narrow it down quickly	15:56
mwhahaha	Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593612925.4888] device (ens3): state change: activated -> deactivating (reason 'connection-removed', sys-iface-state: 'managed')	15:57
mwhahaha	is likely the cause but i don't know why that's occuring	15:57
mwhahaha	so we do a systemctl restart network which is the legacy network stuff and it seem to mess with networkmanager	15:57
fungi	neat	15:58
clarkb	you could try it on that held node and see if it breaks	15:59
mwhahaha	so we restart networking, network manager bounces ens3 and it tries to reconnect it using dhcp	15:59
clarkb	I wonder if that is an 8.2 change	15:59
mwhahaha	even though ifcfg-ens3 is configured as static	15:59
mwhahaha	i bet it is but not certain	15:59
*** shtepanie has joined #opendev		15:59
mwhahaha	can you point me to the glean config code	16:00
clarkb	https://opendev.org/opendev/glean/src/branch/master/glean/cmd.py	16:00
mwhahaha	thx	16:00
*** sshnaidm\|afk is now known as sshnaidm\|ruck		16:01
clarkb	that is the bulk of it and there should be a systemd unit that calls that for ens3 on the host	16:01
clarkb	though it also uses udev so it is parameterized unit	16:01
mwhahaha	you folks rebooted this to get the node back right?	16:02
clarkb	yes	16:02
mwhahaha	yea so that points to networkmanager doing silly stuff	16:03
mwhahaha	i think we can work around it by just disabling networkmanager in a pre task	16:04
mwhahaha	for now	16:04
clarkb	will that unconfigure the static config NM sets?	16:05
clarkb	worth a try I guess	16:05
fungi	yeah, easy enough to test	16:06
mwhahaha	it shouldn't	16:06
mwhahaha	stoping the networkmanager service should still have the networking configured	16:07
mwhahaha	it should prevent networkmanager from waking up and touching the interface	16:07
mwhahaha	weshay_ruck, sshnaidm\|ruck: we should be able to reproduce this by launching a vm on a network w/o DHCP and configuring the interface statically via network manager. then 'service restart network'	16:11
mwhahaha	on a centos8.2 vm	16:11
*** etp has quit IRC		16:16
fungi	looking at cacti graphs for gitea-lb01, the background noise from our rogue distributed crawler is still ongoing	16:19
*** sshnaidm\|ruck is now known as sshnaidm\|afk		16:20
clarkb	fungi: ya Im trying to get out the door for a bike ride but after I'dlike to land and apply our logging updates for haproxy and gitea	16:21
clarkb	and see if the post failures for swift uploads are persitent and dig into that more	16:21
fungi	sounds good. i'm still trying to catch up since half of every day this week has been conference	16:24
clarkb	fungi: also did you see ianw found the likely tool that us hitting us?	16:25
clarkb	based on the UA values	16:25
fungi	yup	16:25
fungi	i fiddled with machine translating the readme	16:25
fungi	seems a very likely suspect	16:26
*** hashar has joined #opendev		16:29
*** ykarel is now known as ykarel\|away		16:34
openstackgerrit	Merged openstack/project-config master: Update Neutron Grafana dashboard https://review.opendev.org/738784	16:39
*** dtantsur is now known as dtantsur\|afk		16:44
*** factor has quit IRC		16:51
*** icarusfactor has joined #opendev		16:51
*** hashar has quit IRC		16:58
*** moppy has quit IRC		17:15
*** corvus has quit IRC		17:15
*** bolg has quit IRC		17:15
*** guillaumec has quit IRC		17:15
*** andreykurilin has quit IRC		17:15
*** factor has joined #opendev		17:16
*** icarusfactor has quit IRC		17:16
*** hashar has joined #opendev		17:17
*** corvus has joined #opendev		17:18
*** guillaumec has joined #opendev		17:18
*** moppy has joined #opendev		17:18
*** andreykurilin has joined #opendev		17:19
*** slittle1 has quit IRC		17:26
*** weshay_ruck has quit IRC		17:26
*** mtreinish has quit IRC		17:26
*** cloudnull has quit IRC		17:26
*** hrw has quit IRC		17:26
*** AJaeger has quit IRC		17:26
*** slittle1 has joined #opendev		17:29
*** weshay_ruck has joined #opendev		17:29
*** mtreinish has joined #opendev		17:29
*** cloudnull has joined #opendev		17:29
*** hrw has joined #opendev		17:29
*** AJaeger has joined #opendev		17:29
openstackgerrit	James E. Blair proposed zuul/zuul-jobs master: Test multiarch release builds and use temp registry with buildx https://review.opendev.org/737315	17:30
hrw	tarballs.opendev.org have PROJECTNAME-stable-BRANCH.tar.gz tarballs for most of projects. What defines which projects get them?	17:39
*** bolg has joined #opendev		17:41
fungi	hrw: those are created by the publish-openstack-python-branch-tarball job included in the post pipeline by lots of project-templates and also directly by some projects	17:42
fungi	mostly it'll be projects which use the openstack-python-jobs project-template or one of its dependency-specific variants	17:43
hrw	fungi: thanks.	17:45
*** hashar is now known as hasharAway		17:47
openstackgerrit	Merged zuul/zuul-jobs master: Test multiarch release builds and use temp registry with buildx https://review.opendev.org/737315	18:19
*** hasharAway is now known as hashar		18:25
openstackgerrit	Lance Bragstad proposed openstack/project-config master: Create a new project for ansible-tripleo-ipa-server https://review.opendev.org/738842	18:28
*** chandankumar is now known as raukadah		18:39
openstackgerrit	Merged zuul/zuul-jobs master: ensure-pip debian: update package lists https://review.opendev.org/737529	18:39
clarkb	I've approved https://review.opendev.org/#/c/738714/	18:41
clarkb	fungi: I'm thinking maybe we single core approve https://review.opendev.org/#/c/738710/ ? I think having that extra logging will be important to continue monitoring this ddos	18:43
fungi	wfm, it's been tested	18:44
corvus	+3	18:44
fungi	i was only hesitating to self-approve in case folks disagreed with the approach there	18:44
clarkb	corvus: thanks!	18:44
fungi	since it's a general haproxy role we might use for other things with different logging styles	18:45
fungi	so i could have set it per-backend instead of in the default section	18:45
clarkb	fungi: ya I think we can sort that out if we end up adding different backend types	18:45
corvus	i bet if we used it for others we'd want consistency	18:45
corvus	(oh, yeah, if we used different backend types that might be different.)	18:45
fungi	though the way we're generating the backends from a template now would still need reworking to support other forwarders	18:45
corvus	anyway, that's a future problem	18:46
corvus	and i'm in the present	18:46
*** factor has quit IRC		18:46
*** factor has joined #opendev		18:46
clarkb	mwhahaha: if you all end up sorting out the underlying issue it would be great to hear details so that we're aware of those issues geneally (since others do use those images too)	18:48
mwhahaha	will do	18:48
mwhahaha	i don't think it affects others because most folks integrate with networkmanager, os-net-config's support is still WIP so we use the legacy network scripts still	18:48
clarkb	gotcha	18:49
mwhahaha	but if i can figure out the RCA i'll get a bz and a ML post together	18:49
fungi	that would be awesome	18:49
fungi	we sort of expect the situation with glean+nm to be a little fragile after everything we discovered about the interactions between ipv6 slaac autoconf in the kernel and nm fighting over interfaces	18:50
clarkb	unfortunately its the system rhel and friends are committed to so we're trying to play nice	18:50
mwhahaha	yea the best thing to do would be NM_managed=No	18:50
mwhahaha	it looked like it was still yes for ens3	18:51
mwhahaha	anyway time to try and reproduce :D	18:51
mwhahaha	at least now we have a direction, thanks for your efforts	18:51
*** factor has quit IRC		19:04
*** factor has joined #opendev		19:04
*** factor has quit IRC		19:10
*** _mlavalle_1 has quit IRC		19:10
clarkb	I'm going to clean up some of those holds now. In particular the one that had us confused as it isn't useful	19:12
clarkb	corvus: fungi is the proper way to do that via zuul autohold commands or nodepool delete?	19:12
clarkb	I think I've been doing nodepool delete in the pats but realized when corvus asked for a hold id earlier today that maybe I can do that via zuul?	19:13
clarkb	ya I think I may have been doing this wrong	19:15
*** _mlavalle_1 has joined #opendev		19:15
mwhahaha	i'm still poking at centos-8-inap-mtl01-0017550599 so if you could keep that around for a few that'd be great	19:16
mwhahaha	i did grab logs/configs but i want to make sure i got everything	19:16
fungi	clarkb: i had been doing it through nodepool delete, but recently learned that removing the autohold is better since that will still cause nodepool to clean up the nodes	19:18
clarkb	mwhahaha: yup will leave that one alone	19:18
mwhahaha	thanks	19:18
clarkb	fungi: cool I was just about to test that	19:18
fungi	yeah, better because it doesn't leave orphaned autohold entries around	19:18
clarkb	ya I've got a couple to clean up	19:19
clarkb	now to double check which autohold id is the one mwhahaha is on	19:21
clarkb	id 0000000160 is the one mwhahaha is on and I kept 0000000161 too in case we need a second one but cleaned up the others	19:22
clarkb	the gitea and haproxy things should be landing soon too. I'll be sure to restart giteas for the new format once that config is applied	19:24
mwhahaha	is there an easy way to manually run glean to configure the network?	19:25
mwhahaha	trying to reproduce how it would configure it vs like the installer	19:25
clarkb	mwhahaha: you can invoke the script that the unit runs either directly or by triggering hte unit. Let me take a quick look to be more specific	19:26
clarkb	mwhahaha: the unit is glean@.service and the @ means it takes a parameter. In this case the interface name. I think you trigger that with systemctl start glean.ens3 ? But also I notice the unit has a condition where it won't run if the sysconfig files exist	19:27
clarkb	in this case it may be easiest to run it directly and forthat you wany Environment="ARGS=--interface %I" and ExecStart=/usr/local/bin/glean.sh --use-nm --debug $ARGS	19:28
clarkb	%I is magical systemd interpolation for the argument and in this case it would be ens3	19:28
mwhahaha	yea that's what i'm looking for, thanks	19:28
mwhahaha	i'll figure it out from there	19:28
clarkb	and if you want to know what triggers it with the parameter I think it is a udev rule	19:29
clarkb	ya /etc/udev/rules.d/99-glean.rules SUBSYSTEM=="net", ACTION=="add", ATTR{addr_assign_type}=="0", TAG+="systemd", ENV{SYSTEMD_WANTS}+="glean@$name.service"	19:29
clarkb	glean@ens3 is the name to use with systemctl	19:29
openstackgerrit	Merged opendev/system-config master: Update gitea access log format https://review.opendev.org/738714	19:33
*** yoctozepto7 has joined #opendev		19:37
openstackgerrit	Merged opendev/system-config master: Remove the tcplog option from haproxy configs https://review.opendev.org/738710	19:40
corvus	clarkb, fungi: yes, delete via zuul autohold delete	19:42
corvus	one command instead of 2	19:42
*** yoctozepto has quit IRC		19:45
*** yoctozepto7 is now known as yoctozepto		19:45
openstackgerrit	Sean McGinnis proposed openstack/project-config master: update-constraints: Install pip for all versions https://review.opendev.org/738926	19:53
clarkb	I've restarted gitea on gitea01 to pick up the access log format change	19:55
clarkb	it is happy so I'm working through the others then will double check the lb is similarly updated	19:55
weshay_ruck	clarkb, fungi thanks guys!	19:59
*** sorin-mihai__ has quit IRC		20:01
mwhahaha	so it looks like the ens3 config gets lost	20:01
mwhahaha	glean writes out the file an it's named 'System ens3' NetworkManager[1023]: <info> [1593611369.2978] device (ens3): Activation: starting connection 'System ens3' (21d47e65-8523-1a06-af22-6f121086f085)	20:02
clarkb	infra-root we now have ip:port recorded in haproxy and gitea logs so we can map between them now	20:02
mwhahaha	but when we restart networking, it doesn't have this so it creates a Wired connection 1	20:02
clarkb	mwhahaha: so when I rebooted it it didn't just reread the existing config instead glean reran and rewrote the config? interesting	20:04
mwhahaha	maybe?	20:04
mwhahaha	because now it's back to being 'System ens3'	20:04
clarkb	ianw should be waking up soon and may have NM thoughts	20:05
fungi	oh, i'm sure he has nm thoughts, but most are probably angry ones	20:07
mwhahaha	ha	20:08
mwhahaha	let me see if i can see where we might be removing this file all of the sudden	20:08
mwhahaha	seems weird still	20:08
mwhahaha	yea it's like ifcfg-ens3 goes missing. the behaviour is that of a node where ifcfg-ens3 is removed and when you restart networkmanager it creates the 'Wired connection 1'	20:11
mwhahaha	we shouldn't be touching that	20:11
openstackgerrit	Sean McGinnis proposed openstack/project-config master: Use python3 for update_constraints https://review.opendev.org/738931	20:13
mwhahaha	sweet, it's os-net-config	20:14
mwhahaha	could you do me a favor and restart that node	20:16
* mwhahaha nukes the network		20:16
clarkb	mwhahaha: I can	20:16
mwhahaha	thank you	20:16
clarkb	mwhahaha: ready for that now?	20:16
mwhahaha	yes plz	20:16
clarkb	reboot issued, will probably be a minute before ssh is accessible again	20:17
*** sgw1 has quit IRC		20:35
mwhahaha	yea i don't think it's coming back, oh well	20:38
clarkb	mwhahaha: we have another held andnow that we know what is going on we can boot out of band centos8 images in inap too	20:53
clarkb	assuming we need them let me know	20:53
mwhahaha	no i think we have some direction. you should be able to release them for now	20:53
mwhahaha	we're going to disable networkmanager as a starting point while we investigate what's happening to that config	20:54
clarkb	k I'll clean those up ina abit	20:56
clarkb	tomorrow I'll plan to land https://review.opendev.org/#/c/737885/ as that should make the gitea api interactions a bit more correct	21:02
clarkb	but will want to watch it afterwards and its already "late" in my day relative to my early start now	21:02
clarkb	the follow on to ^ needs work though	21:02
clarkb	also I've just noticed that ianw implemented the apache proxy for gitea with UA filtering	21:03
clarkb	so reviewing that now too	21:03
clarkb	those changes actually look good. fungi corvus https://review.opendev.org/#/c/738721/4 the way its done we don't cut over haproxy to it. So we could land those changes then deploy the apache, test it then switch haproxy. Should be very safe	21:06
clarkb	ianw: ^ thanks for doing that I like the approach being able to be a measured transition in prod too	21:07
openstackgerrit	Dmitriy Rabotyagov (noonedeadpunk) proposed opendev/system-config master: Add copr-lxc3 to list of mirrors https://review.opendev.org/738942	21:15
corvus	clarkb, ianw: +2, but i hope we don't have to use it	21:20
*** priteau has quit IRC		21:21
*** factor has joined #opendev		21:29
*** factor has quit IRC		21:33
*** factor has joined #opendev		21:33
*** hashar has quit IRC		21:49
*** factor has quit IRC		21:49
*** factor has joined #opendev		21:50
fungi	well, the traffic we saw yesterday seems to be continuing, so i expect it may be either that or leave every customer of the largest isp in china blocked with iptables	21:54
fungi	no idea when (or if) it will ever subside	21:55
ianw	hey, around now	21:59
clarkb	ianw: no rush, was just looking at your change to apache filter UAs in front of gitea. I think we can roll that out if we decide it is necessary	22:01
ianw	i figure with the proxy we could just watch the logs for 301's until they disappear, and then turn it off	22:01
clarkb	ya	22:01
clarkb	though aren't you doing 403?	22:01
ianw	sorry 403 yeah	22:01
ianw	if anyone is feeling containerish and wants too look over https://review.opendev.org/#/q/topic:grafana-container+status:open that would be great too	22:03
ianw	really the only non-standard thing in there is in graphite where i've used ipv4 and ipv6 socat on 8125 to send into the graphite container	22:04
clarkb	ianw: also we managed to catch a centos8 node that lost networking during a tripleo job. Rebooting it brought it back. Turns out that something to do with restarting networking in os-net-config or similar caused our static config for the interface to be wiped out and NM switched to dhcp	22:05
clarkb	tripleo is going to try and workaround it by disabling NM but also looking into possibility that centos8.2 changed that and broke things	22:05
clarkb	ianw: small thing on https://review.opendev.org/#/c/737406/18	22:11
clarkb	and question on https://review.opendev.org/#/c/738125/7	22:15
*** DSpider has quit IRC		22:18
ianw	clarkb: yeah, the host network thing is the ipv6 thing, which is addressed in https://review.opendev.org/#/c/738125/7/playbooks/roles/graphite/tasks/main.yaml @ 63	22:20
ianw	basically, i found that when setting up port 8125 to go into the container, docker would bind to ipv6	22:20
ianw	but the container has no ipv6 handling at all	22:21
clarkb	ianw: what about host networking?	22:21
clarkb	is the issue that with host networking the graphite services only bind on 0.0.0.0 ?	22:21
clarkb	maybe it would be better to use host networking with the proxy (if we can configure grpahite services to bind on other ports then proxy to them) that way we can be consistent?	22:22
ianw	hrm, yeah what i tried first was having host with 8125 and hoping i could run a socat proxy for ipv6 8125 but found docker took it over	22:24
ianw	yeah, i guess if it's different ports it doesn't matter, let me try	22:25
clarkb	ya in my suggestion we'd put statsd on 8126 and then the proxy listends on 8125 and forwards	22:25
clarkb	at least that way its weird but consistently weird with our normal setup :)	22:25
ianw	systemd actually has a nice socket activated forwarding service	22:26
ianw	... that doesn't support udp	22:26
clarkb	ha	22:26
fungi	"nice"	22:29
clarkb	I guess they do that because it just wants to stop at the syn ack syn	22:31
fungi	it's synful	22:31
openstackgerrit	Ian Wienand proposed opendev/system-config master: Graphite container deployment https://review.opendev.org/738125	22:31
ianw	clarkb: ^ ok, so that has socat listening on ipv4&ipv6 8125 which forwards to 8825 (8126 is taken by statsd admin) which docker should map to 8125 in the container ... phew!	22:32
clarkb	hrm I'm not sure docker will do the port mapping with host mode	22:33
clarkb	(it might)	22:33
clarkb	we may need to configure the services to use a different port if that doesn't work	22:33
ianw	OOOHHHHH yeah THAT's right!	22:33
ianw	that's why i did it	22:33
*** icarusfactor has joined #opendev		22:33
ianw	docker silently says "oh btw i'm not doing the port mapping" but continues on	22:34
*** factor has quit IRC		22:36
openstackgerrit	James E. Blair proposed zuul/zuul-jobs master: Handle multi-arch docker manifests in promote https://review.opendev.org/738945	22:39
ianw	i wonder if with host networking ipv6 gets into the container	22:41
fungi	i would expect so	22:43
fungi	we have other containerized services listening on v6 sockets	22:43
fungi	our gitea haproxy, for example	22:44
ianw	the problem is i'll have to re-write the upstream container to have ipv6 bindings	22:45
ianw	maybe it's worth the pain	22:45
clarkb	they dont make it configurable? thats too bad if so	22:45
ianw	they did take one of my patches fairly quickly	22:45
*** tkajinam has joined #opendev		22:46
corvus	why are there port bundings with host network mode?	22:50
clarkb	corvus: they werethere from earlier ps which did not use hostnetworking	22:51
corvus	ok, so next ps will remove those?	22:51
corvus	(cause ps8 has both)	22:51
ianw	corvus: this is the question ... i'm trying to avoid having to hack the pre-built container for ipv6	22:52
clarkb	another option could be to drop our AAAA record from dns for that service	22:52
ianw	yeah, i feel like that's the worst option, it's a regression on what we have	22:53
openstackgerrit	Merged zuul/zuul-jobs master: Handle multi-arch docker manifests in promote https://review.opendev.org/738945	22:53
corvus	i'd love to stick with host networking like all the others, so if we could change the bind that would be great	22:58
ianw	there's two things to deal with in the container; gunicorn presenting graphite and statsd	22:59
clarkb	it looks like we can mount in configs for both	23:01
clarkb	we'd then need to write the configs but I think that will allow us to change the bind?	23:01
ianw	statsd maybe, that launches from a config file	23:01
clarkb	as an alternative we could change the bind upstream since ipv6 listening at :: will also work for ipv4	23:01
ianw	it looks like gunicorn is started from the script with "-b" arguments	23:02
ianw	urgh, the other thing is it has no support for ssl	23:05
ianw	it runs nginx	23:08
*** _mlavalle_1 has quit IRC		23:11
corvus	wonder why they didn't just use uwsgi	23:16
*** tosky has quit IRC		23:16
ianw	if we map in the keys, a nginx config, a statsd config it might just work	23:20
ianw	... assuming upstream never changes anything	23:20

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!