Tuesday, 2020-12-22

*** cloudnull has quit IRC00:07
*** cloudnull has joined #opendev01:01
*** brinzhang0 has joined #opendev01:21
*** brinzhang_ has quit IRC01:24
*** ysandeep|away is now known as ysandeep02:24
*** DSpider has quit IRC02:39
*** ykarel has joined #opendev04:40
*** ykarel has quit IRC04:45
*** ykarel has joined #opendev05:04
*** hamalq has quit IRC05:37
*** marios has joined #opendev06:15
*** slaweq has quit IRC07:18
*** slaweq has joined #opendev07:20
*** ralonsoh has joined #opendev07:28
*** ralonsoh_ has joined #opendev08:17
*** ralonsoh has quit IRC08:20
*** ralonsoh_ has quit IRC08:29
*** hashar has joined #opendev08:51
*** ralonsoh has joined #opendev09:25
*** lpetrut has joined #opendev09:31
*** ralonsoh has quit IRC09:31
*** danpawlik has quit IRC09:32
*** danpawlik0 has joined #opendev09:32
*** otherwiseguy has quit IRC10:04
*** otherwiseguy has joined #opendev10:05
*** ralonsoh has joined #opendev10:11
*** dtantsur|afk is now known as dtantsur10:21
*** TheJulia has joined #opendev10:28
*** rpittau|afk has joined #opendev10:30
*** johnsom has joined #opendev10:33
*** ralonsoh has quit IRC10:35
*** ralonsoh has joined #opendev10:48
*** ralonsoh has quit IRC11:00
*** ralonsoh has joined #opendev11:12
*** ralonsoh has quit IRC11:15
*** ralonsoh has joined #opendev11:15
*** ralonsoh_ has joined #opendev11:22
*** ralonsoh has quit IRC11:25
*** tosky has joined #opendev12:02
*** icey has quit IRC12:04
*** icey has joined #opendev12:04
*** hashar is now known as hasharLunch12:14
*** DSpider has joined #opendev12:54
*** Oriz has joined #opendev13:00
*** hasharLunch is now known as hashar13:09
*** cloudnull has quit IRC13:16
*** cloudnull has joined #opendev13:17
*** tkajinam_ has quit IRC13:21
*** ykarel_ has joined #opendev13:53
*** ykarel has quit IRC13:56
*** ykarel_ is now known as ykarel14:07
*** tkajinam has joined #opendev14:18
*** lpetrut has quit IRC15:22
*** codecapde has joined #opendev15:27
*** codecapde has left #opendev15:27
*** hashar has quit IRC15:35
*** ykarel has quit IRC16:16
*** zer0def has joined #opendev16:24
*** Oriz has quit IRC16:24
*** ysandeep is now known as ysandeep|away16:29
*** stephenfin has quit IRC16:53
*** hamalq has joined #opendev16:55
*** hamalq_ has joined #opendev16:56
*** hashar has joined #opendev17:00
*** hamalq has quit IRC17:00
*** stephenfin has joined #opendev17:04
*** zer0def has quit IRC17:05
*** zer0def has joined #opendev17:10
*** marios is now known as marios|out17:16
*** andrii_ostapenko has joined #opendev17:26
corvusclarkb: are we meeting today?17:38
clarkbcorvus: I dno't think so17:38
corvusk, i thought that was the case, just dbl checking17:38
clarkbI wasn't planning on it at least as nothing super urgent has come up yesterday or today17:39
andrii_ostapenkoHello! I have periodic job stuck for 59 hrs on 'queued'. Is it something I can get help with on this channel? https://zuul.openstack.org/status#openstack/openstack-helm-images17:48
clarkbandrii_ostapenko: do those jobs require the extra large instances from the citycluod airship tenant?17:50
andrii_ostapenkothey don't17:50
clarkbdo they have other dependency relationships between each other? basically what it looks like is we're starved for resources possibly coupled with some sort of relationship that is making that worse (but that is just my quickly looking at the status)17:52
clarkbI believe waiting means waiting on resources, and queued is I have resources and am just waiting my turn? I should double check on that (but you've got a number in a waiting state)17:53
clarkbhttps://opendev.org/openstack/openstack-helm-images/src/branch/master/zuul.d/base.yaml#L308-L316 I think that confirms at least part of the suspicion but not necessarily that the suspicion is at fault17:54
andrii_ostapenkoclarkb: jobs in 'waiting' status are waiting the ones that are in 'queued' status currently17:55
clarkbandrii_ostapenko: yes and at least two of them have run and failed and are being retried17:57
clarkbthat does make me wonder if there is possibly a retry bug in periodic pipelines17:57
clarkbcorvus: ^ are you aware of anything like taht?17:57
corvusclarkb: unaware of anything like that17:58
clarkbopenstack-helm-images-upload-openstack-loci-stein-ubuntu_bionic in particular seems to be holding up openstack-helm-images-cinder-stein-ubuntu_bionic and openstack-helm-images-horizon-stein-ubuntu_bionic17:58
clarkbit did run once, but zuul reports that it is queued for a second retry17:58
fungii think periodic has the lowest of the low priority, could it really just be waiting for zuul to catch its breath in higher-priority pipelines?17:59
corvusi'm a little confused by a retried job that's skipped18:00
clarkbfungi: that is what I thoguht initially but we haev normal node capcity which is why I asked about special nodes being used18:00
fungiand https://grafana.opendev.org/d/9XCNuphGk/zuul-status?orgId=1 indicates we've not been backlogged with node requests for the last 60 hours18:03
andrii_ostapenkobottom 3 jobs are not holding anything but also stuck in queued18:03
clarkblogstash doesn't seem to have logs for the first attempt at openstack-helm-images-upload-openstack-loci-stein-ubuntu_bionic18:04
clarkbI've found the logs for the previous periodic attempt which succeeded thoug18:05
clarkbchecking to see if I can find the fluentd logs18:06
clarkbmy current hunch is that the zuul state is wedged somehow because zuul is not able to satisfy dependencies between all the retries18:07
corvus| 299-0012184365 | 0        | requested | zuul01.openstack.org | ubuntu-bionic                          |       | nl02-9-PoolWorker.airship-kna1-airship18:07
clarkboh special nodes are in play?18:08
corvusthat's the nodepool request for openstack-helm-images-upload-openstack-loci-stein-ubuntu_bionic18:08
clarkboh wait no thats ubuntu-bionic18:08
clarkbbut that cloud is still struggling to provide the normal label type, got it18:08
corvusthat's the declined by column18:08
corvusso i think that's the only cloud that has weighed in on the request so far18:09
clarkbfwiw the fluentd job's first attempt doesn't appear to be in logstash either18:09
clarkbcorvus: huh, looking at grafana we should haev plenty of capacity for other clouds to see and service that job18:09
fungiwe have lots of ubuntu-bionic nodes in use according to grafana, so it's not failing to boot them18:09
*** ralonsoh_ has quit IRC18:10
corvushrm, how come we don't say if a request is locked?  that would be useful18:11
fungiin launcher debug logs we do18:12
corvusoh... another thing we omit from the request list is if there's a provider preference18:13
corvus'provider': 'airship-kna1'18:13
clarkbcorvus: does that come from the zuul configuration?18:13
corvusdoes airship-kna1 provide regular ubuntu-bionic nodes?18:14
clarkbcorvus: yes, it has two pools. One with a small number of normal nodes and the other with the larger nodes. The idea there was that we'd keep images up to date and exercise them even if the other pool went idle for a while18:15
andrii_ostapenkoafaik yes18:15
corvusi think i understand; gimme a sec to check some things18:15
corvusmy hypothesis is that an earlier job in the job DAG ran on airship-kna1, and now this job, which depends on that one, is asking for a node in airship-kna1.  the 'special' pool has already declined it, leaving the 'regular' pool as the only possible provider that can satisfy the request18:17
corvusso we should investigate the state of the 'regular' kna1 pool18:17
corvusthe pool names are 'main' and 'airship'18:18
fungithat would make sense, there's nowhere we set an explicit provider preference to airship-kna1 according to codesearch at least18:18
corvusfungi: yeah, it's an automatic affinity based on the job dependency18:19
clarkbya the preference comes from where the parent job ran as an implicit runtime thing rather than a config thing18:19
corvusthe main pool has 'max-servers: 10'18:19
corvusso it's very constrained18:19
clarkbgrafana shows a pretty consistent 8 in use in that cloud18:19
fungigrafana looks weird for that cloud, to be honest18:19
clarkbmaybe those are held/leaked and we're otherwise basically at quota?18:20
clarkbfungi: ya18:20
fungino launch attempts in the past 6 hours18:20
*** marios|out has quit IRC18:20
clarkbso maybe the cloud is reporting we're at quota?18:20
clarkbnodepool will respect that and periodically check the quota directly iirc18:20
corvusnodepool thinks it's at quota there18:21
clarkbcool I think that explains it18:21
*** hashar has quit IRC18:21
clarkbwe shouldn't be at quota though according to grafana so this is probably the nova thing where quotas get out of sync18:21
corvus Current pool quota: {'compute': {'cores': inf, 'instances': 0, 'ram': inf}}18:21
corvuswell, i think that's the internal calculation, not nova?18:22
funginodepool list has some nodes locked there for ~2.5 days18:22
clarkbcorvus: it incorporates both internal data and the nova data iirc18:22
corvusclarkb: i think instances in pool quota is entirely internal?18:23
corvus(since a pool is a nodepool construct)18:23
corvusat any rate, nodepool list shows 10 entries for airship-kna1, none of which are large types, so they should all be in the 'main' pool18:23
clarkbcorvus: hrm ya looking at nodepool code really quickly the only place we seem to do math on instances is where we check the number of instances for a request against quota and where we estimate nodepool used quota18:24
clarkband estimated used quota is not driver specific so ya that must be internal18:25
clarkband if we're using 10 instances in the main pool then that is at quota. Do any appear leaked?18:25
andrii_ostapenkothis particular buildset is occupying 8 nodes from airship-kna1 with jobs in paused state18:25
corvusandrii_ostapenko: there are 2 that are deleting right now18:26
corvusso it sounds like that accounts for all 10 nodes18:26
clarkbso I guess part of the problem here is having a ton of jobs that all pause in the same cloud if clouds can have limited resources18:26
andrii_ostapenkothese 2 would save the day18:27
clarkbwhy are those jobs all pausing if there is a buildset registry to act as the central repository for these images18:27
clarkbseems like we should only have the one paused job?18:27
clarkboh maybe they aren't using the central registry and are acting as their own registries too?18:28
andrii_ostapenkoto have a conditional promote after testing is done. image builder does the promotion after testing is done18:28
corvusthe two deleting nodes have been in a delete loop for days; they may require manual cloud admin intervention to clear18:29
clarkbandrii_ostapenko: I think you can do that without the pausing using a normal job dependency as long as the resources are in the central registry18:30
fungifwiw, there are 6 nodes locked for ~60 hours and 2 more locked for around 22-23 hours at the moment, so there's just the observed two which are probably running active non-paused builds18:30
clarkbandrii_ostapenko: then the promote job will only run if its parents pass and it can shuffle the bits around via the central registry18:30
clarkbpausing should only be required if you need the job to be running when its child job is also running whcih isn't the case here if you use the central registry job in that paused state18:31
andrii_ostapenkoi agree it can be implemented this way. i'll think on details18:32
corvusi think this is what we've learned so far: 1) use a central registry to avoid having too many simultaneous jobs in a job graph; 2) a provider-pool in nodepool needs to at least have enough capacity to run all of the simultaneous jobs in a job graph18:32
*** dtantsur is now known as dtantsur|afk18:33
clarkbcorvus: and for 2) that number will vary over time. I wonder if we can have zuul predict those values then restrict where it makes requests?18:33
corvusholistically, we have a job graph that requires >8 simultaneous nodes, and we have a provider which currently provides those nodes but can't provide > 8.18:33
corvusclarkb: potentially, yes18:33
clarkbfor the current periodic jobs do we need to dequeue them and let it try again? since the cloud they are currently assigned to is unable to fulfill the requests currently?18:34
fungiwere there parent jobs which needed node types only supplied by that provider though?18:34
clarkbfungi: no it was just the luck of the draw18:35
fungiahh, okay18:35
clarkbfungi: we know this beacuse the main pool provides generic resources not special ones18:35
fungioh, and parents can't use nodes from other pools in the same provider?18:35
clarkbat least I'm pretty sure of that18:35
corvusno they can use other pools in the same provider18:35
clarkboh til18:36
corvusbut only one pool in this provider provides ubuntu-bionic18:36
corvus(if the 'airship' pool provided ubuntu-bionic, it could use it)18:36
fungijust wondering if there was a parent/child build relationship where a parent used one of the special nodes types but the children were using ubuntu-bionic... that would cause it to basically always try to select them from that citycloud provider since it's the only one which provides those special nodes18:37
corvusanyway, the zuul change to hint to nodepool that it should only fulfill a request if it could also fulfill a future request for X nodes is probably not a trivial change18:37
clarkbfungi: I don't think so because my logstash info shows older jobs running in ovh too18:37
fungigot it18:37
clarkbcorvus: ya I'm kinda thinking we should file this away as a known issue for now, maybe dequeue the current buildset, then look at this in the new year?18:38
clarkband andrii_ostapenko can hopeflly reduce the number of jobs that pause too18:38
andrii_ostapenkoyes i'll do it18:38
andrii_ostapenkobut apparently it's bigger than my issue18:39
andrii_ostapenkoand thank you so much for figuring this out18:39
corvusandrii_ostapenko: take a look at the opendev/system-config repo, particularly the jobs that depend on the 'opendev-buildset-registry' job for examples of how to use a central registry18:39
clarkbI'm not in a good spot to do the dequeue btu I expect that this will stay stuck until we do that or somehow get the cloud to cleanup those two deleting instances18:40
andrii_ostapenkoi tried to avoid using intermediate registry - it adds too much time to job run18:40
andrii_ostapenkoit's really needed only when you want to share artifacts between buildsets18:40
corvusandrii_ostapenko: 60 hours is the time to beat!  :)18:40
clarkbI don't think you need the intermediate registry though since this is periodic and not sharing between changes18:41
clarkbright in this case you want to use the buildset registry to share within a buildset and that is cloud local18:41
clarkb(that is why we have this problem because they all go to the same provider)18:41
corvusclarkb: correct18:41
andrii_ostapenkoi remember having issues trying to implement it this way. but i'll definitely give it another try18:41
corvusandrii_ostapenko: i assume you meant to say you tried to avoid the buildset registry?18:42
* clarkb needs to add some food to the slow cooker. But dequeing seems reasonable if someone is able to do that. I can try and do it later today if it doesn't happen sooner18:42
andrii_ostapenkono. had issues to split image build and image upload into 2 jobs. i require buildset registry18:42
corvusthe buildset registry does take a bit of extra time (especially since it starts first, pauses, then the build jobs only start once it's paused).18:43
corvusthe intermediate registry is used for sharing between builds, but it's not something you run in your jobs, it's always running18:43
corvus(it's a single host named insecure-ci-registry.opendev.org)18:43
corvusthe buildset registry roles automatically push and pull from the intermediate registry, but that should happen regardless of whether there's a single shared buildset registry, or the build jobs have their own individual buildset registry jobs18:44
andrii_ostapenkoyes I'm aware. i excluded intermediate registry intentionally to save some time. i now need to do a conditional upload in separate job after test job is done, not in the same image build job18:44
andrii_ostapenkothe question is what to do with this particular buildset. Are you able to abort it or we need to fix airship cloud so it goes further18:46
fungithe aborting is a manual `zuul dequeue ...` cli command which needs to be issued by a zuul admin, i'll take care of it18:47
fungijust need to pull up the relevant details first18:47
fungii've run this locally on the scheduler: sudo zuul dequeue --tenant openstack --pipeline periodic --project openstack/openstack-helm-images --ref refs/heads/master18:50
fungiit hasn't returned control to my shell yet, so it's presumably working on it18:50
fungiand done. looks like i caught it in the middle of a reconfiguration event18:51
fungi#status log dequeued refs/heads/master of openstack/openstack-helm-images from the periodic pipeline of the openstack zuul tenant after determining that it was wedged due to capacity issues in the selected node provider18:52
openstackstatusfungi: finished logging18:52
andrii_ostapenkofungi: thank you!18:54
andrii_ostapenkocorvus, clarkb: thank you for your help!18:57
fungiyou're welcome19:08
fungithanks for giving us a heads up about it!19:09
*** slaweq has quit IRC22:51
clarkbes05 seems to have gone to lunch some time last week which has backed up the ansibles on bridge23:03
clarkbI'm cleaning up the ansibles on bridge then will reboot es0523:04
clarkbif anyone knows how to make ansible timeouts work properly when a host is not responding to ssh that info would be great23:08
clarkbes06 is up but its elasticsaerch was not running, I'm rebooting it too then will ensure shard cleanup happens and then we should just need to wait for it to rebalance the cluster23:16
clarkbthe cluster reports it is green now and it is relocating shards (that is the rebalancing that was expected)23:27
*** tosky has quit IRC23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!