Tuesday, 2020-12-22

*** cloudnull has quit IRC		00:07
*** cloudnull has joined #opendev		01:01
*** brinzhang0 has joined #opendev		01:21
*** brinzhang_ has quit IRC		01:24
*** ysandeep\|away is now known as ysandeep		02:24
*** DSpider has quit IRC		02:39
*** ykarel has joined #opendev		04:40
*** ykarel has quit IRC		04:45
*** ykarel has joined #opendev		05:04
*** hamalq has quit IRC		05:37
*** marios has joined #opendev		06:15
*** slaweq has quit IRC		07:18
*** slaweq has joined #opendev		07:20
*** ralonsoh has joined #opendev		07:28
*** ralonsoh_ has joined #opendev		08:17
*** ralonsoh has quit IRC		08:20
*** ralonsoh_ has quit IRC		08:29
*** hashar has joined #opendev		08:51
*** ralonsoh has joined #opendev		09:25
*** lpetrut has joined #opendev		09:31
*** ralonsoh has quit IRC		09:31
*** danpawlik has quit IRC		09:32
*** danpawlik0 has joined #opendev		09:32
*** otherwiseguy has quit IRC		10:04
*** otherwiseguy has joined #opendev		10:05
*** ralonsoh has joined #opendev		10:11
*** dtantsur\|afk is now known as dtantsur		10:21
*** TheJulia has joined #opendev		10:28
*** rpittau\|afk has joined #opendev		10:30
*** johnsom has joined #opendev		10:33
*** ralonsoh has quit IRC		10:35
*** ralonsoh has joined #opendev		10:48
*** ralonsoh has quit IRC		11:00
*** ralonsoh has joined #opendev		11:12
*** ralonsoh has quit IRC		11:15
*** ralonsoh has joined #opendev		11:15
*** ralonsoh_ has joined #opendev		11:22
*** ralonsoh has quit IRC		11:25
*** tosky has joined #opendev		12:02
*** icey has quit IRC		12:04
*** icey has joined #opendev		12:04
*** hashar is now known as hasharLunch		12:14
*** DSpider has joined #opendev		12:54
*** Oriz has joined #opendev		13:00
*** hasharLunch is now known as hashar		13:09
*** cloudnull has quit IRC		13:16
*** cloudnull has joined #opendev		13:17
*** tkajinam_ has quit IRC		13:21
*** ykarel_ has joined #opendev		13:53
*** ykarel has quit IRC		13:56
*** ykarel_ is now known as ykarel		14:07
*** tkajinam has joined #opendev		14:18
*** lpetrut has quit IRC		15:22
*** codecapde has joined #opendev		15:27
*** codecapde has left #opendev		15:27
*** hashar has quit IRC		15:35
*** ykarel has quit IRC		16:16
*** zer0def has joined #opendev		16:24
*** Oriz has quit IRC		16:24
*** ysandeep is now known as ysandeep\|away		16:29
*** stephenfin has quit IRC		16:53
*** hamalq has joined #opendev		16:55
*** hamalq_ has joined #opendev		16:56
*** hashar has joined #opendev		17:00
*** hamalq has quit IRC		17:00
*** stephenfin has joined #opendev		17:04
*** zer0def has quit IRC		17:05
*** zer0def has joined #opendev		17:10
*** marios is now known as marios\|out		17:16
*** andrii_ostapenko has joined #opendev		17:26
corvus	clarkb: are we meeting today?	17:38
clarkb	corvus: I dno't think so	17:38
corvus	k, i thought that was the case, just dbl checking	17:38
clarkb	I wasn't planning on it at least as nothing super urgent has come up yesterday or today	17:39
andrii_ostapenko	Hello! I have periodic job stuck for 59 hrs on 'queued'. Is it something I can get help with on this channel? https://zuul.openstack.org/status#openstack/openstack-helm-images	17:48
clarkb	andrii_ostapenko: do those jobs require the extra large instances from the citycluod airship tenant?	17:50
andrii_ostapenko	they don't	17:50
clarkb	do they have other dependency relationships between each other? basically what it looks like is we're starved for resources possibly coupled with some sort of relationship that is making that worse (but that is just my quickly looking at the status)	17:52
clarkb	I believe waiting means waiting on resources, and queued is I have resources and am just waiting my turn? I should double check on that (but you've got a number in a waiting state)	17:53
clarkb	https://opendev.org/openstack/openstack-helm-images/src/branch/master/zuul.d/base.yaml#L308-L316 I think that confirms at least part of the suspicion but not necessarily that the suspicion is at fault	17:54
andrii_ostapenko	clarkb: jobs in 'waiting' status are waiting the ones that are in 'queued' status currently	17:55
clarkb	andrii_ostapenko: yes and at least two of them have run and failed and are being retried	17:57
clarkb	that does make me wonder if there is possibly a retry bug in periodic pipelines	17:57
clarkb	corvus: ^ are you aware of anything like taht?	17:57
corvus	clarkb: unaware of anything like that	17:58
clarkb	openstack-helm-images-upload-openstack-loci-stein-ubuntu_bionic in particular seems to be holding up openstack-helm-images-cinder-stein-ubuntu_bionic and openstack-helm-images-horizon-stein-ubuntu_bionic	17:58
clarkb	it did run once, but zuul reports that it is queued for a second retry	17:58
fungi	i think periodic has the lowest of the low priority, could it really just be waiting for zuul to catch its breath in higher-priority pipelines?	17:59
corvus	i'm a little confused by a retried job that's skipped	18:00
clarkb	fungi: that is what I thoguht initially but we haev normal node capcity which is why I asked about special nodes being used	18:00
fungi	and https://grafana.opendev.org/d/9XCNuphGk/zuul-status?orgId=1 indicates we've not been backlogged with node requests for the last 60 hours	18:03
andrii_ostapenko	bottom 3 jobs are not holding anything but also stuck in queued	18:03
clarkb	logstash doesn't seem to have logs for the first attempt at openstack-helm-images-upload-openstack-loci-stein-ubuntu_bionic	18:04
clarkb	I've found the logs for the previous periodic attempt which succeeded thoug	18:05
clarkb	checking to see if I can find the fluentd logs	18:06
clarkb	my current hunch is that the zuul state is wedged somehow because zuul is not able to satisfy dependencies between all the retries	18:07
corvus	\| 299-0012184365 \| 0 \| requested \| zuul01.openstack.org \| ubuntu-bionic \| \| nl02-9-PoolWorker.airship-kna1-airship	18:07
clarkb	oh special nodes are in play?	18:08
corvus	that's the nodepool request for openstack-helm-images-upload-openstack-loci-stein-ubuntu_bionic	18:08
clarkb	oh wait no thats ubuntu-bionic	18:08
clarkb	but that cloud is still struggling to provide the normal label type, got it	18:08
corvus	that's the declined by column	18:08
corvus	so i think that's the only cloud that has weighed in on the request so far	18:09
clarkb	fwiw the fluentd job's first attempt doesn't appear to be in logstash either	18:09
clarkb	corvus: huh, looking at grafana we should haev plenty of capacity for other clouds to see and service that job	18:09
fungi	we have lots of ubuntu-bionic nodes in use according to grafana, so it's not failing to boot them	18:09
*** ralonsoh_ has quit IRC		18:10
corvus	hrm, how come we don't say if a request is locked? that would be useful	18:11
fungi	in launcher debug logs we do	18:12
corvus	oh... another thing we omit from the request list is if there's a provider preference	18:13
corvus	'provider': 'airship-kna1'	18:13
clarkb	corvus: does that come from the zuul configuration?	18:13
corvus	does airship-kna1 provide regular ubuntu-bionic nodes?	18:14
clarkb	corvus: yes, it has two pools. One with a small number of normal nodes and the other with the larger nodes. The idea there was that we'd keep images up to date and exercise them even if the other pool went idle for a while	18:15
andrii_ostapenko	afaik yes	18:15
corvus	i think i understand; gimme a sec to check some things	18:15
corvus	my hypothesis is that an earlier job in the job DAG ran on airship-kna1, and now this job, which depends on that one, is asking for a node in airship-kna1. the 'special' pool has already declined it, leaving the 'regular' pool as the only possible provider that can satisfy the request	18:17
corvus	so we should investigate the state of the 'regular' kna1 pool	18:17
clarkb	aha	18:17
corvus	the pool names are 'main' and 'airship'	18:18
fungi	that would make sense, there's nowhere we set an explicit provider preference to airship-kna1 according to codesearch at least	18:18
corvus	fungi: yeah, it's an automatic affinity based on the job dependency	18:19
clarkb	ya the preference comes from where the parent job ran as an implicit runtime thing rather than a config thing	18:19
corvus	the main pool has 'max-servers: 10'	18:19
corvus	so it's very constrained	18:19
clarkb	grafana shows a pretty consistent 8 in use in that cloud	18:19
fungi	grafana looks weird for that cloud, to be honest	18:19
clarkb	maybe those are held/leaked and we're otherwise basically at quota?	18:20
clarkb	fungi: ya	18:20
fungi	no launch attempts in the past 6 hours	18:20
*** marios\|out has quit IRC		18:20
clarkb	so maybe the cloud is reporting we're at quota?	18:20
clarkb	nodepool will respect that and periodically check the quota directly iirc	18:20
corvus	nodepool thinks it's at quota there	18:21
clarkb	cool I think that explains it	18:21
*** hashar has quit IRC		18:21
clarkb	we shouldn't be at quota though according to grafana so this is probably the nova thing where quotas get out of sync	18:21
corvus	Current pool quota: {'compute': {'cores': inf, 'instances': 0, 'ram': inf}}	18:21
corvus	well, i think that's the internal calculation, not nova?	18:22
fungi	nodepool list has some nodes locked there for ~2.5 days	18:22
clarkb	corvus: it incorporates both internal data and the nova data iirc	18:22
corvus	clarkb: i think instances in pool quota is entirely internal?	18:23
corvus	(since a pool is a nodepool construct)	18:23
corvus	at any rate, nodepool list shows 10 entries for airship-kna1, none of which are large types, so they should all be in the 'main' pool	18:23
clarkb	corvus: hrm ya looking at nodepool code really quickly the only place we seem to do math on instances is where we check the number of instances for a request against quota and where we estimate nodepool used quota	18:24
clarkb	and estimated used quota is not driver specific so ya that must be internal	18:25
clarkb	and if we're using 10 instances in the main pool then that is at quota. Do any appear leaked?	18:25
andrii_ostapenko	this particular buildset is occupying 8 nodes from airship-kna1 with jobs in paused state	18:25
corvus	andrii_ostapenko: there are 2 that are deleting right now	18:26
corvus	so it sounds like that accounts for all 10 nodes	18:26
clarkb	so I guess part of the problem here is having a ton of jobs that all pause in the same cloud if clouds can have limited resources	18:26
andrii_ostapenko	these 2 would save the day	18:27
clarkb	why are those jobs all pausing if there is a buildset registry to act as the central repository for these images	18:27
clarkb	seems like we should only have the one paused job?	18:27
clarkb	oh maybe they aren't using the central registry and are acting as their own registries too?	18:28
andrii_ostapenko	to have a conditional promote after testing is done. image builder does the promotion after testing is done	18:28
corvus	the two deleting nodes have been in a delete loop for days; they may require manual cloud admin intervention to clear	18:29
clarkb	andrii_ostapenko: I think you can do that without the pausing using a normal job dependency as long as the resources are in the central registry	18:30
fungi	fwiw, there are 6 nodes locked for ~60 hours and 2 more locked for around 22-23 hours at the moment, so there's just the observed two which are probably running active non-paused builds	18:30
clarkb	andrii_ostapenko: then the promote job will only run if its parents pass and it can shuffle the bits around via the central registry	18:30
clarkb	pausing should only be required if you need the job to be running when its child job is also running whcih isn't the case here if you use the central registry job in that paused state	18:31
andrii_ostapenko	i agree it can be implemented this way. i'll think on details	18:32
corvus	i think this is what we've learned so far: 1) use a central registry to avoid having too many simultaneous jobs in a job graph; 2) a provider-pool in nodepool needs to at least have enough capacity to run all of the simultaneous jobs in a job graph	18:32
*** dtantsur is now known as dtantsur\|afk		18:33
clarkb	corvus: and for 2) that number will vary over time. I wonder if we can have zuul predict those values then restrict where it makes requests?	18:33
corvus	holistically, we have a job graph that requires >8 simultaneous nodes, and we have a provider which currently provides those nodes but can't provide > 8.	18:33
corvus	clarkb: potentially, yes	18:33
clarkb	for the current periodic jobs do we need to dequeue them and let it try again? since the cloud they are currently assigned to is unable to fulfill the requests currently?	18:34
fungi	were there parent jobs which needed node types only supplied by that provider though?	18:34
clarkb	fungi: no it was just the luck of the draw	18:35
fungi	ahh, okay	18:35
clarkb	fungi: we know this beacuse the main pool provides generic resources not special ones	18:35
fungi	oh, and parents can't use nodes from other pools in the same provider?	18:35
clarkb	correct	18:35
clarkb	at least I'm pretty sure of that	18:35
corvus	no they can use other pools in the same provider	18:35
clarkb	oh til	18:36
corvus	but only one pool in this provider provides ubuntu-bionic	18:36
clarkb	right	18:36
corvus	(if the 'airship' pool provided ubuntu-bionic, it could use it)	18:36
fungi	just wondering if there was a parent/child build relationship where a parent used one of the special nodes types but the children were using ubuntu-bionic... that would cause it to basically always try to select them from that citycloud provider since it's the only one which provides those special nodes	18:37
corvus	anyway, the zuul change to hint to nodepool that it should only fulfill a request if it could also fulfill a future request for X nodes is probably not a trivial change	18:37
clarkb	fungi: I don't think so because my logstash info shows older jobs running in ovh too	18:37
fungi	got it	18:37
clarkb	corvus: ya I'm kinda thinking we should file this away as a known issue for now, maybe dequeue the current buildset, then look at this in the new year?	18:38
clarkb	and andrii_ostapenko can hopeflly reduce the number of jobs that pause too	18:38
andrii_ostapenko	yes i'll do it	18:38
andrii_ostapenko	but apparently it's bigger than my issue	18:39
andrii_ostapenko	and thank you so much for figuring this out	18:39
corvus	andrii_ostapenko: take a look at the opendev/system-config repo, particularly the jobs that depend on the 'opendev-buildset-registry' job for examples of how to use a central registry	18:39
clarkb	I'm not in a good spot to do the dequeue btu I expect that this will stay stuck until we do that or somehow get the cloud to cleanup those two deleting instances	18:40
andrii_ostapenko	i tried to avoid using intermediate registry - it adds too much time to job run	18:40
andrii_ostapenko	it's really needed only when you want to share artifacts between buildsets	18:40
corvus	andrii_ostapenko: 60 hours is the time to beat! :)	18:40
andrii_ostapenko	lol	18:41
clarkb	I don't think you need the intermediate registry though since this is periodic and not sharing between changes	18:41
clarkb	right in this case you want to use the buildset registry to share within a buildset and that is cloud local	18:41
clarkb	(that is why we have this problem because they all go to the same provider)	18:41
corvus	clarkb: correct	18:41
andrii_ostapenko	i remember having issues trying to implement it this way. but i'll definitely give it another try	18:41
corvus	andrii_ostapenko: i assume you meant to say you tried to avoid the buildset registry?	18:42
* clarkb needs to add some food to the slow cooker. But dequeing seems reasonable if someone is able to do that. I can try and do it later today if it doesn't happen sooner		18:42
andrii_ostapenko	no. had issues to split image build and image upload into 2 jobs. i require buildset registry	18:42
corvus	the buildset registry does take a bit of extra time (especially since it starts first, pauses, then the build jobs only start once it's paused).	18:43
corvus	the intermediate registry is used for sharing between builds, but it's not something you run in your jobs, it's always running	18:43
corvus	(it's a single host named insecure-ci-registry.opendev.org)	18:43
corvus	the buildset registry roles automatically push and pull from the intermediate registry, but that should happen regardless of whether there's a single shared buildset registry, or the build jobs have their own individual buildset registry jobs	18:44
andrii_ostapenko	yes I'm aware. i excluded intermediate registry intentionally to save some time. i now need to do a conditional upload in separate job after test job is done, not in the same image build job	18:44
andrii_ostapenko	the question is what to do with this particular buildset. Are you able to abort it or we need to fix airship cloud so it goes further	18:46
fungi	the aborting is a manual `zuul dequeue ...` cli command which needs to be issued by a zuul admin, i'll take care of it	18:47
fungi	just need to pull up the relevant details first	18:47
fungi	i've run this locally on the scheduler: sudo zuul dequeue --tenant openstack --pipeline periodic --project openstack/openstack-helm-images --ref refs/heads/master	18:50
fungi	it hasn't returned control to my shell yet, so it's presumably working on it	18:50
fungi	and done. looks like i caught it in the middle of a reconfiguration event	18:51
fungi	#status log dequeued refs/heads/master of openstack/openstack-helm-images from the periodic pipeline of the openstack zuul tenant after determining that it was wedged due to capacity issues in the selected node provider	18:52
openstackstatus	fungi: finished logging	18:52
andrii_ostapenko	fungi: thank you!	18:54
andrii_ostapenko	corvus, clarkb: thank you for your help!	18:57
fungi	you're welcome	19:08
fungi	thanks for giving us a heads up about it!	19:09
*** slaweq has quit IRC		22:51
clarkb	es05 seems to have gone to lunch some time last week which has backed up the ansibles on bridge	23:03
clarkb	I'm cleaning up the ansibles on bridge then will reboot es05	23:04
clarkb	if anyone knows how to make ansible timeouts work properly when a host is not responding to ssh that info would be great	23:08
clarkb	es06 is up but its elasticsaerch was not running, I'm rebooting it too then will ensure shard cleanup happens and then we should just need to wait for it to rebalance the cluster	23:16
clarkb	the cluster reports it is green now and it is relocating shards (that is the rebalancing that was expected)	23:27
*** tosky has quit IRC		23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!