opendevreview | tianyutong proposed openstack/project-config master: Allow tag creation for heterogeneous-distributed-training-framework https://review.opendev.org/c/openstack/project-config/+/953069 | 01:41 |
---|---|---|
opendevreview | Merged openstack/project-config master: Allow tag creation for heterogeneous-distributed-training-framework https://review.opendev.org/c/openstack/project-config/+/953069 | 12:31 |
*** haleyb|out is now known as haleyb | 13:12 | |
*** mmagr__ is now known as mmagr | 14:28 | |
priteau | Hello. I have a job which has been queued for 5+ hours: https://zuul.opendev.org/t/openstack/status?change=952983 | 15:33 |
priteau | kayobe-seed-vm-ubuntu-noble towards the bottom of the list | 15:33 |
clarkb | looks like that job uses this nodeset: https://opendev.org/openstack/kayobe/src/branch/master/zuul.d/nodesets.yaml#L14-L18 which uses a standard ubuntu-noble label (so not arm or nested virt etc) | 15:42 |
fungi | and just a single node | 15:42 |
clarkb | 2025-06-23 15:42:55,713 DEBUG nodepool.PoolWorker.raxflex-sjc3-main: Active requests: ['200-0027294972'] | 15:43 |
clarkb | thats the nodepool provider that has the request in its todo list. It reports there isn't enough quota to fulfill the request. It is possible we have leaked fips there again. i'll check | 15:43 |
clarkb | yes based on a floating ip listing I believe this is the case. I'll do what I did a week or two ago and delete all the fips that are not attached to anything | 15:44 |
clarkb | dfw3 is in the same situation so I'll do that same there as well | 15:45 |
clarkb | sjc3 api responses are a bit slow so I started with dfw3 | 15:50 |
clarkb | dfw3 is done. sjc3 is in progress but slow due to the api response timing. Hopefully things will be happy in the next 5-10 minutes | 15:53 |
priteau | Thank you! | 16:20 |
priteau | Should I have posted this to #opendev instead? | 16:20 |
clarkb | either is fine, but this is an opendev ci system issue nothing specific to openstack | 16:21 |
clarkb | looking at grafana graphs I think we also have a number of nodes stuck in a deleting state in sjc3. Possibly due to the api slowness I've observed | 16:25 |
clarkb | I can try to manually delete a node there and see what happens | 16:26 |
clarkb | ok manually deleting ~3 nodes seemed to get things moving. Possible with the api response times that nodepool was hitting some error that short circuited things until I reduced the total size of the list? I'm not sure. Either way this also seems to have helped | 16:31 |
clarkb | the job has a node now too | 16:32 |
fungi | agreed, i see it started running | 16:34 |
clarkb | actually I think my manual deletion just happened to coincide with james denton making a fix on the cloud side | 16:40 |
clarkb | so anyway cloud helped us out, once that was sorted then nodepool could clean things up normally (except for the fips since we have fip cleanup disabled) | 16:40 |
clarkb | priteau: thank you for the heads up and I think your change has reported now | 16:41 |
priteau | It did, thanks. | 16:46 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!