*** cloudnull has quit IRC | 00:07 | |
*** cloudnull has joined #opendev | 01:01 | |
*** brinzhang0 has joined #opendev | 01:21 | |
*** brinzhang_ has quit IRC | 01:24 | |
*** ysandeep|away is now known as ysandeep | 02:24 | |
*** DSpider has quit IRC | 02:39 | |
*** ykarel has joined #opendev | 04:40 | |
*** ykarel has quit IRC | 04:45 | |
*** ykarel has joined #opendev | 05:04 | |
*** hamalq has quit IRC | 05:37 | |
*** marios has joined #opendev | 06:15 | |
*** slaweq has quit IRC | 07:18 | |
*** slaweq has joined #opendev | 07:20 | |
*** ralonsoh has joined #opendev | 07:28 | |
*** ralonsoh_ has joined #opendev | 08:17 | |
*** ralonsoh has quit IRC | 08:20 | |
*** ralonsoh_ has quit IRC | 08:29 | |
*** hashar has joined #opendev | 08:51 | |
*** ralonsoh has joined #opendev | 09:25 | |
*** lpetrut has joined #opendev | 09:31 | |
*** ralonsoh has quit IRC | 09:31 | |
*** danpawlik has quit IRC | 09:32 | |
*** danpawlik0 has joined #opendev | 09:32 | |
*** otherwiseguy has quit IRC | 10:04 | |
*** otherwiseguy has joined #opendev | 10:05 | |
*** ralonsoh has joined #opendev | 10:11 | |
*** dtantsur|afk is now known as dtantsur | 10:21 | |
*** TheJulia has joined #opendev | 10:28 | |
*** rpittau|afk has joined #opendev | 10:30 | |
*** johnsom has joined #opendev | 10:33 | |
*** ralonsoh has quit IRC | 10:35 | |
*** ralonsoh has joined #opendev | 10:48 | |
*** ralonsoh has quit IRC | 11:00 | |
*** ralonsoh has joined #opendev | 11:12 | |
*** ralonsoh has quit IRC | 11:15 | |
*** ralonsoh has joined #opendev | 11:15 | |
*** ralonsoh_ has joined #opendev | 11:22 | |
*** ralonsoh has quit IRC | 11:25 | |
*** tosky has joined #opendev | 12:02 | |
*** icey has quit IRC | 12:04 | |
*** icey has joined #opendev | 12:04 | |
*** hashar is now known as hasharLunch | 12:14 | |
*** DSpider has joined #opendev | 12:54 | |
*** Oriz has joined #opendev | 13:00 | |
*** hasharLunch is now known as hashar | 13:09 | |
*** cloudnull has quit IRC | 13:16 | |
*** cloudnull has joined #opendev | 13:17 | |
*** tkajinam_ has quit IRC | 13:21 | |
*** ykarel_ has joined #opendev | 13:53 | |
*** ykarel has quit IRC | 13:56 | |
*** ykarel_ is now known as ykarel | 14:07 | |
*** tkajinam has joined #opendev | 14:18 | |
*** lpetrut has quit IRC | 15:22 | |
*** codecapde has joined #opendev | 15:27 | |
*** codecapde has left #opendev | 15:27 | |
*** hashar has quit IRC | 15:35 | |
*** ykarel has quit IRC | 16:16 | |
*** zer0def has joined #opendev | 16:24 | |
*** Oriz has quit IRC | 16:24 | |
*** ysandeep is now known as ysandeep|away | 16:29 | |
*** stephenfin has quit IRC | 16:53 | |
*** hamalq has joined #opendev | 16:55 | |
*** hamalq_ has joined #opendev | 16:56 | |
*** hashar has joined #opendev | 17:00 | |
*** hamalq has quit IRC | 17:00 | |
*** stephenfin has joined #opendev | 17:04 | |
*** zer0def has quit IRC | 17:05 | |
*** zer0def has joined #opendev | 17:10 | |
*** marios is now known as marios|out | 17:16 | |
*** andrii_ostapenko has joined #opendev | 17:26 | |
corvus | clarkb: are we meeting today? | 17:38 |
---|---|---|
clarkb | corvus: I dno't think so | 17:38 |
corvus | k, i thought that was the case, just dbl checking | 17:38 |
clarkb | I wasn't planning on it at least as nothing super urgent has come up yesterday or today | 17:39 |
andrii_ostapenko | Hello! I have periodic job stuck for 59 hrs on 'queued'. Is it something I can get help with on this channel? https://zuul.openstack.org/status#openstack/openstack-helm-images | 17:48 |
clarkb | andrii_ostapenko: do those jobs require the extra large instances from the citycluod airship tenant? | 17:50 |
andrii_ostapenko | they don't | 17:50 |
clarkb | do they have other dependency relationships between each other? basically what it looks like is we're starved for resources possibly coupled with some sort of relationship that is making that worse (but that is just my quickly looking at the status) | 17:52 |
clarkb | I believe waiting means waiting on resources, and queued is I have resources and am just waiting my turn? I should double check on that (but you've got a number in a waiting state) | 17:53 |
clarkb | https://opendev.org/openstack/openstack-helm-images/src/branch/master/zuul.d/base.yaml#L308-L316 I think that confirms at least part of the suspicion but not necessarily that the suspicion is at fault | 17:54 |
andrii_ostapenko | clarkb: jobs in 'waiting' status are waiting the ones that are in 'queued' status currently | 17:55 |
clarkb | andrii_ostapenko: yes and at least two of them have run and failed and are being retried | 17:57 |
clarkb | that does make me wonder if there is possibly a retry bug in periodic pipelines | 17:57 |
clarkb | corvus: ^ are you aware of anything like taht? | 17:57 |
corvus | clarkb: unaware of anything like that | 17:58 |
clarkb | openstack-helm-images-upload-openstack-loci-stein-ubuntu_bionic in particular seems to be holding up openstack-helm-images-cinder-stein-ubuntu_bionic and openstack-helm-images-horizon-stein-ubuntu_bionic | 17:58 |
clarkb | it did run once, but zuul reports that it is queued for a second retry | 17:58 |
fungi | i think periodic has the lowest of the low priority, could it really just be waiting for zuul to catch its breath in higher-priority pipelines? | 17:59 |
corvus | i'm a little confused by a retried job that's skipped | 18:00 |
clarkb | fungi: that is what I thoguht initially but we haev normal node capcity which is why I asked about special nodes being used | 18:00 |
fungi | and https://grafana.opendev.org/d/9XCNuphGk/zuul-status?orgId=1 indicates we've not been backlogged with node requests for the last 60 hours | 18:03 |
andrii_ostapenko | bottom 3 jobs are not holding anything but also stuck in queued | 18:03 |
clarkb | logstash doesn't seem to have logs for the first attempt at openstack-helm-images-upload-openstack-loci-stein-ubuntu_bionic | 18:04 |
clarkb | I've found the logs for the previous periodic attempt which succeeded thoug | 18:05 |
clarkb | checking to see if I can find the fluentd logs | 18:06 |
clarkb | my current hunch is that the zuul state is wedged somehow because zuul is not able to satisfy dependencies between all the retries | 18:07 |
corvus | | 299-0012184365 | 0 | requested | zuul01.openstack.org | ubuntu-bionic | | nl02-9-PoolWorker.airship-kna1-airship | 18:07 |
clarkb | oh special nodes are in play? | 18:08 |
corvus | that's the nodepool request for openstack-helm-images-upload-openstack-loci-stein-ubuntu_bionic | 18:08 |
clarkb | oh wait no thats ubuntu-bionic | 18:08 |
clarkb | but that cloud is still struggling to provide the normal label type, got it | 18:08 |
corvus | that's the declined by column | 18:08 |
corvus | so i think that's the only cloud that has weighed in on the request so far | 18:09 |
clarkb | fwiw the fluentd job's first attempt doesn't appear to be in logstash either | 18:09 |
clarkb | corvus: huh, looking at grafana we should haev plenty of capacity for other clouds to see and service that job | 18:09 |
fungi | we have lots of ubuntu-bionic nodes in use according to grafana, so it's not failing to boot them | 18:09 |
*** ralonsoh_ has quit IRC | 18:10 | |
corvus | hrm, how come we don't say if a request is locked? that would be useful | 18:11 |
fungi | in launcher debug logs we do | 18:12 |
corvus | oh... another thing we omit from the request list is if there's a provider preference | 18:13 |
corvus | 'provider': 'airship-kna1' | 18:13 |
clarkb | corvus: does that come from the zuul configuration? | 18:13 |
corvus | does airship-kna1 provide regular ubuntu-bionic nodes? | 18:14 |
clarkb | corvus: yes, it has two pools. One with a small number of normal nodes and the other with the larger nodes. The idea there was that we'd keep images up to date and exercise them even if the other pool went idle for a while | 18:15 |
andrii_ostapenko | afaik yes | 18:15 |
corvus | i think i understand; gimme a sec to check some things | 18:15 |
corvus | my hypothesis is that an earlier job in the job DAG ran on airship-kna1, and now this job, which depends on that one, is asking for a node in airship-kna1. the 'special' pool has already declined it, leaving the 'regular' pool as the only possible provider that can satisfy the request | 18:17 |
corvus | so we should investigate the state of the 'regular' kna1 pool | 18:17 |
clarkb | aha | 18:17 |
corvus | the pool names are 'main' and 'airship' | 18:18 |
fungi | that would make sense, there's nowhere we set an explicit provider preference to airship-kna1 according to codesearch at least | 18:18 |
corvus | fungi: yeah, it's an automatic affinity based on the job dependency | 18:19 |
clarkb | ya the preference comes from where the parent job ran as an implicit runtime thing rather than a config thing | 18:19 |
corvus | the main pool has 'max-servers: 10' | 18:19 |
corvus | so it's very constrained | 18:19 |
clarkb | grafana shows a pretty consistent 8 in use in that cloud | 18:19 |
fungi | grafana looks weird for that cloud, to be honest | 18:19 |
clarkb | maybe those are held/leaked and we're otherwise basically at quota? | 18:20 |
clarkb | fungi: ya | 18:20 |
fungi | no launch attempts in the past 6 hours | 18:20 |
*** marios|out has quit IRC | 18:20 | |
clarkb | so maybe the cloud is reporting we're at quota? | 18:20 |
clarkb | nodepool will respect that and periodically check the quota directly iirc | 18:20 |
corvus | nodepool thinks it's at quota there | 18:21 |
clarkb | cool I think that explains it | 18:21 |
*** hashar has quit IRC | 18:21 | |
clarkb | we shouldn't be at quota though according to grafana so this is probably the nova thing where quotas get out of sync | 18:21 |
corvus | Current pool quota: {'compute': {'cores': inf, 'instances': 0, 'ram': inf}} | 18:21 |
corvus | well, i think that's the internal calculation, not nova? | 18:22 |
fungi | nodepool list has some nodes locked there for ~2.5 days | 18:22 |
clarkb | corvus: it incorporates both internal data and the nova data iirc | 18:22 |
corvus | clarkb: i think instances in pool quota is entirely internal? | 18:23 |
corvus | (since a pool is a nodepool construct) | 18:23 |
corvus | at any rate, nodepool list shows 10 entries for airship-kna1, none of which are large types, so they should all be in the 'main' pool | 18:23 |
clarkb | corvus: hrm ya looking at nodepool code really quickly the only place we seem to do math on instances is where we check the number of instances for a request against quota and where we estimate nodepool used quota | 18:24 |
clarkb | and estimated used quota is not driver specific so ya that must be internal | 18:25 |
clarkb | and if we're using 10 instances in the main pool then that is at quota. Do any appear leaked? | 18:25 |
andrii_ostapenko | this particular buildset is occupying 8 nodes from airship-kna1 with jobs in paused state | 18:25 |
corvus | andrii_ostapenko: there are 2 that are deleting right now | 18:26 |
corvus | so it sounds like that accounts for all 10 nodes | 18:26 |
clarkb | so I guess part of the problem here is having a ton of jobs that all pause in the same cloud if clouds can have limited resources | 18:26 |
andrii_ostapenko | these 2 would save the day | 18:27 |
clarkb | why are those jobs all pausing if there is a buildset registry to act as the central repository for these images | 18:27 |
clarkb | seems like we should only have the one paused job? | 18:27 |
clarkb | oh maybe they aren't using the central registry and are acting as their own registries too? | 18:28 |
andrii_ostapenko | to have a conditional promote after testing is done. image builder does the promotion after testing is done | 18:28 |
corvus | the two deleting nodes have been in a delete loop for days; they may require manual cloud admin intervention to clear | 18:29 |
clarkb | andrii_ostapenko: I think you can do that without the pausing using a normal job dependency as long as the resources are in the central registry | 18:30 |
fungi | fwiw, there are 6 nodes locked for ~60 hours and 2 more locked for around 22-23 hours at the moment, so there's just the observed two which are probably running active non-paused builds | 18:30 |
clarkb | andrii_ostapenko: then the promote job will only run if its parents pass and it can shuffle the bits around via the central registry | 18:30 |
clarkb | pausing should only be required if you need the job to be running when its child job is also running whcih isn't the case here if you use the central registry job in that paused state | 18:31 |
andrii_ostapenko | i agree it can be implemented this way. i'll think on details | 18:32 |
corvus | i think this is what we've learned so far: 1) use a central registry to avoid having too many simultaneous jobs in a job graph; 2) a provider-pool in nodepool needs to at least have enough capacity to run all of the simultaneous jobs in a job graph | 18:32 |
*** dtantsur is now known as dtantsur|afk | 18:33 | |
clarkb | corvus: and for 2) that number will vary over time. I wonder if we can have zuul predict those values then restrict where it makes requests? | 18:33 |
corvus | holistically, we have a job graph that requires >8 simultaneous nodes, and we have a provider which currently provides those nodes but can't provide > 8. | 18:33 |
corvus | clarkb: potentially, yes | 18:33 |
clarkb | for the current periodic jobs do we need to dequeue them and let it try again? since the cloud they are currently assigned to is unable to fulfill the requests currently? | 18:34 |
fungi | were there parent jobs which needed node types only supplied by that provider though? | 18:34 |
clarkb | fungi: no it was just the luck of the draw | 18:35 |
fungi | ahh, okay | 18:35 |
clarkb | fungi: we know this beacuse the main pool provides generic resources not special ones | 18:35 |
fungi | oh, and parents can't use nodes from other pools in the same provider? | 18:35 |
clarkb | correct | 18:35 |
clarkb | at least I'm pretty sure of that | 18:35 |
corvus | no they can use other pools in the same provider | 18:35 |
clarkb | oh til | 18:36 |
corvus | but only one pool in this provider provides ubuntu-bionic | 18:36 |
clarkb | right | 18:36 |
corvus | (if the 'airship' pool provided ubuntu-bionic, it could use it) | 18:36 |
fungi | just wondering if there was a parent/child build relationship where a parent used one of the special nodes types but the children were using ubuntu-bionic... that would cause it to basically always try to select them from that citycloud provider since it's the only one which provides those special nodes | 18:37 |
corvus | anyway, the zuul change to hint to nodepool that it should only fulfill a request if it could also fulfill a future request for X nodes is probably not a trivial change | 18:37 |
clarkb | fungi: I don't think so because my logstash info shows older jobs running in ovh too | 18:37 |
fungi | got it | 18:37 |
clarkb | corvus: ya I'm kinda thinking we should file this away as a known issue for now, maybe dequeue the current buildset, then look at this in the new year? | 18:38 |
clarkb | and andrii_ostapenko can hopeflly reduce the number of jobs that pause too | 18:38 |
andrii_ostapenko | yes i'll do it | 18:38 |
andrii_ostapenko | but apparently it's bigger than my issue | 18:39 |
andrii_ostapenko | and thank you so much for figuring this out | 18:39 |
corvus | andrii_ostapenko: take a look at the opendev/system-config repo, particularly the jobs that depend on the 'opendev-buildset-registry' job for examples of how to use a central registry | 18:39 |
clarkb | I'm not in a good spot to do the dequeue btu I expect that this will stay stuck until we do that or somehow get the cloud to cleanup those two deleting instances | 18:40 |
andrii_ostapenko | i tried to avoid using intermediate registry - it adds too much time to job run | 18:40 |
andrii_ostapenko | it's really needed only when you want to share artifacts between buildsets | 18:40 |
corvus | andrii_ostapenko: 60 hours is the time to beat! :) | 18:40 |
andrii_ostapenko | lol | 18:41 |
clarkb | I don't think you need the intermediate registry though since this is periodic and not sharing between changes | 18:41 |
clarkb | right in this case you want to use the buildset registry to share within a buildset and that is cloud local | 18:41 |
clarkb | (that is why we have this problem because they all go to the same provider) | 18:41 |
corvus | clarkb: correct | 18:41 |
andrii_ostapenko | i remember having issues trying to implement it this way. but i'll definitely give it another try | 18:41 |
corvus | andrii_ostapenko: i assume you meant to say you tried to avoid the buildset registry? | 18:42 |
* clarkb needs to add some food to the slow cooker. But dequeing seems reasonable if someone is able to do that. I can try and do it later today if it doesn't happen sooner | 18:42 | |
andrii_ostapenko | no. had issues to split image build and image upload into 2 jobs. i require buildset registry | 18:42 |
corvus | the buildset registry does take a bit of extra time (especially since it starts first, pauses, then the build jobs only start once it's paused). | 18:43 |
corvus | the intermediate registry is used for sharing between builds, but it's not something you run in your jobs, it's always running | 18:43 |
corvus | (it's a single host named insecure-ci-registry.opendev.org) | 18:43 |
corvus | the buildset registry roles automatically push and pull from the intermediate registry, but that should happen regardless of whether there's a single shared buildset registry, or the build jobs have their own individual buildset registry jobs | 18:44 |
andrii_ostapenko | yes I'm aware. i excluded intermediate registry intentionally to save some time. i now need to do a conditional upload in separate job after test job is done, not in the same image build job | 18:44 |
andrii_ostapenko | the question is what to do with this particular buildset. Are you able to abort it or we need to fix airship cloud so it goes further | 18:46 |
fungi | the aborting is a manual `zuul dequeue ...` cli command which needs to be issued by a zuul admin, i'll take care of it | 18:47 |
fungi | just need to pull up the relevant details first | 18:47 |
fungi | i've run this locally on the scheduler: sudo zuul dequeue --tenant openstack --pipeline periodic --project openstack/openstack-helm-images --ref refs/heads/master | 18:50 |
fungi | it hasn't returned control to my shell yet, so it's presumably working on it | 18:50 |
fungi | and done. looks like i caught it in the middle of a reconfiguration event | 18:51 |
fungi | #status log dequeued refs/heads/master of openstack/openstack-helm-images from the periodic pipeline of the openstack zuul tenant after determining that it was wedged due to capacity issues in the selected node provider | 18:52 |
openstackstatus | fungi: finished logging | 18:52 |
andrii_ostapenko | fungi: thank you! | 18:54 |
andrii_ostapenko | corvus, clarkb: thank you for your help! | 18:57 |
fungi | you're welcome | 19:08 |
fungi | thanks for giving us a heads up about it! | 19:09 |
*** slaweq has quit IRC | 22:51 | |
clarkb | es05 seems to have gone to lunch some time last week which has backed up the ansibles on bridge | 23:03 |
clarkb | I'm cleaning up the ansibles on bridge then will reboot es05 | 23:04 |
clarkb | if anyone knows how to make ansible timeouts work properly when a host is not responding to ssh that info would be great | 23:08 |
clarkb | es06 is up but its elasticsaerch was not running, I'm rebooting it too then will ensure shard cleanup happens and then we should just need to wait for it to rebalance the cluster | 23:16 |
clarkb | the cluster reports it is green now and it is relocating shards (that is the rebalancing that was expected) | 23:27 |
*** tosky has quit IRC | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!