*** __ministry is now known as Guest3875 | 01:28 | |
tkajinam | Where can I found the specs of each vms in openstack-two-node-jammy ? | 05:55 |
---|---|---|
tkajinam | the reason why I'm asking is that we've seen a few failures in nova jobs which is caused by mismatch between expected vcpus (8) and actual vcpus (4) | 05:55 |
tkajinam | https://zuul.opendev.org/t/openstack/build/8c45cc40964e4d94b7bf92f8b6c65b1f/log/controller/logs/screen-n-cpu.txt#8121 | 05:55 |
tkajinam | openstack-two-node-jammy is the nodeset used in that job, IIUC | 05:57 |
tonyb | tkajinam: Let me look. My first thought is that that isn't visible anywhere (apart from opendev sys admins) | 06:00 |
tonyb | tkajinam: If the job works sometimes if could be an issue with certain providers | 06:05 |
gibi | tonyb: I think it works intermittently but I not have time yet to look deeper | 06:13 |
tonyb | A spot check https://zuul.opendev.org/t/openstack/builds?job_name=nova-live-migration&skip=0 has the two jobs in post_failure both being run on the new raxflex provider | 06:18 |
tonyb | openstack-two-node-jammy is defined here: https://opendev.org/openstack/devstack/src/branch/master/.zuul.yaml#L124 | 06:18 |
tonyb | which will just use 2 ubuntu-jammy nodes to see what they are you can find the appropriate provider/ppol | 06:20 |
tonyb | sorry provider/pool | 06:20 |
tonyb | In this case raxflex uses gp.0.4.8 for general nodes | 06:21 |
tonyb | https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl01.opendev.org.yaml#L236 | 06:21 |
tonyb | which is a 4 VCPU 8GB RAM 80GB disk flavor | 06:25 |
tonyb | which explains the failure given the job expects 8VCPUs | 06:28 |
tonyb | https://opendev.org/openstack/nova/src/branch/master/.zuul.yaml#L142 | 06:28 |
tonyb | I can do more "digging" to see how the raxflex flavor compares to other providers | 06:35 |
tonyb | You could also update the job to look at ansible_facts for processor information rather than hard-coding/assuming 8 VCPUs | 06:37 |
tonyb | eg look at: https://zuul.opendev.org/t/openstack/build/8c45cc40964e4d94b7bf92f8b6c65b1f/log/zuul-info/host-info.compute1.yaml#533-537 | 06:37 |
tkajinam | yeah that may be an option though I'm unsure 4 cores can afford the requirement in that job (especially because it separates cores into shared and non shared to test that separation behavior) | 06:51 |
tkajinam | it'd be helpful if we can get a node set with two nodes with strictly 8 core vms , but I'm not sure if that's something the team may want | 06:52 |
tonyb | Yeah it can be done, but impacts quota etc, I don't think there is a way to say "don't pick this provider for this job" | 06:53 |
opendevreview | daniel.pawlik proposed openstack/ci-log-processing master: Hide sensitive data on executing Ansible playbook https://review.opendev.org/c/openstack/ci-log-processing/+/929881 | 07:22 |
*** bauzas_ is now known as bauzas | 07:25 | |
opendevreview | Merged openstack/ci-log-processing master: Hide sensitive data on executing Ansible playbook https://review.opendev.org/c/openstack/ci-log-processing/+/929881 | 07:58 |
sean-k-mooney | tkajinam: tonyb form my perspective if a providre it not providing 8 CPU then its on meeint the minium requirement we have written down and and the nodeset shoudl be seperate | 10:40 |
sean-k-mooney | it shoudl not be used by any of the defualt jobs | 10:40 |
sean-k-mooney | im trying to find where this is written down but all vms are quire to have 80GB of disk, 8cpus and 8GB ram as our minium requirements | 10:43 |
tonyb | sean-k-mooney: Sure. We can probably make a flavor that meets that but at this point I don't have enough data to say one way or another | 10:51 |
sean-k-mooney | tonyb: im currently trying to find the doc where we specifed the minium requirement for all cloud providrs to contibut resouce ot opendev | 10:56 |
sean-k-mooney | i know it edxists becuase i coundiered doint that my self a few years ago | 10:56 |
sean-k-mooney | we detailed that all providers must be capabel fo hosting 100 vms, have 8G of ram, 8 cpus and 80G of disk that could be splict actoss multipel disk | 10:57 |
tonyb | Well the 100Vms thing isn't a requirement | 10:58 |
sean-k-mooney | it was the one that triped me up | 10:59 |
sean-k-mooney | i could do 50-80 at the time | 10:59 |
sean-k-mooney | so it was in the document at one point | 10:59 |
tonyb | Well we have clouds that provide 50VMs | 10:59 |
tonyb | sean-k-mooney: You can look at the various providers in the nl* files eg https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl01.opendev.org.yaml#L98 | 11:01 |
sean-k-mooney | sure the reason given at the time is the the inra team did not want to manage a lot fo small providres but im sure that changed as tiem went on | 11:02 |
tonyb | sean-k-mooney: I've found the doc I think you're referring to ... I just need to find a link | 11:02 |
sean-k-mooney | im more concernd that the docusment that was ment to make sure all our providers are consitent is vanished form the internet | 11:03 |
tonyb | It hasn't | 11:03 |
tonyb | https://opendev.org/opendev/system-config/src/branch/master/doc/source/contribute-cloud.rst#L194 | 11:03 |
sean-k-mooney | yep htat it | 11:04 |
sean-k-mooney | i tought it was in the infra-manual for some reason | 11:04 |
tonyb | anyway that's the source | 11:04 |
sean-k-mooney | https://opendev.org/opendev/system-config/src/branch/master/doc/source/contribute-cloud.rst?display=source#L63-L64 | 11:04 |
sean-k-mooney | that the 100 instnace requirement | 11:05 |
tonyb | Well "would be helpful" | 11:05 |
sean-k-mooney | and for this converstaion https://opendev.org/opendev/system-config/src/branch/master/doc/source/contribute-cloud.rst?display=source#L51-L54 is whwere the 8vcpu is required for a standard flavor | 11:05 |
tonyb | Yup I see that | 11:05 |
sean-k-mooney | to be fair it happens infrequently enouch that adding a 50 vm prover every could of month is proably fine | 11:06 |
tonyb | I'll chat with the other opendev sysadmins | 11:06 |
sean-k-mooney | the v4cpu nodest is proably ok for lot of things like tox jobs | 11:06 |
tonyb | I don't know the state of quotas on raxflex, but we can probably create a custom flavor that supplies 8vcpus | 11:07 |
sean-k-mooney | it just shoudl use a diffent nodepool lable so we can align that with a differnt nodeset | 11:07 |
sean-k-mooney | its really only the devstack josb that care about the vcpu for our concurrency requiremetn for tempest | 11:07 |
tonyb | Perhaps, that creates different kinds of work | 11:08 |
sean-k-mooney | we used ot have the extended lable for the 16G vexhost ones | 11:09 |
tonyb | I think we still do | 11:09 |
sean-k-mooney | but ya i can see that being annoying for the curent tox usage | 11:09 |
sean-k-mooney | i think so too but we dont use them in nova currently | 11:10 |
sean-k-mooney | we experimented with them for some numa/hugepage testing | 11:10 |
tonyb | We, now, know it's an issue with the cloud. I' | 11:10 |
tonyb | ll talk with the other infra-roots to figure out the best fix | 11:10 |
sean-k-mooney | no worries | 11:10 |
sean-k-mooney | the rendered version is here by the way https://docs.opendev.org/opendev/system-config/latest/contribute-cloud.html i think its not in my history because it moved form openstack to opendev at some point | 11:15 |
sean-k-mooney | porbaly a few years ago | 11:15 |
tonyb | Makes sense | 11:16 |
sean-k-mooney | by the way if 50 or 25 nodes were accpable as a provider i may reconsider that in the futre. im currently redoing all my home infra, including considerign replacing some of my older systems and upgradign to a 10G core with 2.5G to clinets i did have a third party ci running for a time but it will be a whiel before im in a position to think about that seriously | 11:29 |
tonyb | I'd say 50VMs would be a viable minimum, unless they're "funky" in some way architecture/config etc etc | 11:31 |
tonyb | So yeah if you do redo your home lab that could be cool | 11:32 |
sean-k-mooney | before looking at adding them as a provider i thin i would want to benchmark it as a thirdparty for a bit to see if its actully viable under load | 11:33 |
tonyb | Sounds good | 11:33 |
sean-k-mooney | like if i cna host 50 vms but only when they are idle or the disk io ectra is too low there little point | 11:34 |
tonyb | Yup makes sense | 11:34 |
fungi | tkajinam: https://docs.opendev.org/opendev/infra-manual/latest/testing.html#known-differences-to-watch-out-for "CPU count, speed, and supported processor flags differ, sometimes even within the same cloud region." | 12:15 |
fungi | sean-k-mooney: our nodesets don't specify cpu count. we scale them based on performance and available flavors | 12:15 |
sean-k-mooney | fungi: the requirement for a ci provider spefy the min that the flavors can provide | 12:16 |
fungi | also i don't think "we" have the necessary access to add arbitrary flavors to public cloud providers, unless that's a feature i'm not aware of (though we could request some different flavors) | 12:16 |
sean-k-mooney | they can provide more but we established the basline so that jobs would be able to depend on that for there config | 12:17 |
sean-k-mooney | that provider is currently usign flavor-name: 'gp.0.4.8' | 12:18 |
sean-k-mooney | which is not valid for the default lables | 12:19 |
fungi | we used a 4 vpcu flavor in osic for years | 12:19 |
sean-k-mooney | so it shoudl not be exposed to zuul as ubuntu-jammy | 12:19 |
fungi | because 8vcpu was "too fast" (the cpus were much newer) | 12:20 |
fungi | anyway, i don't think we're going to change this, the guidance to cloud providers wanting to contribute resources was not meant as a guarantee for what users should expect | 12:20 |
sean-k-mooney | so there are ways to adress that without breakign what exposed to the guest | 12:20 |
fungi | you should not write your jobs based on an assumption of how many cpus a node has | 12:21 |
sean-k-mooney | we have to | 12:21 |
fungi | we expressly document *that* in the user-facing document in our infra manual | 12:21 |
fungi | please find an alternative | 12:21 |
sean-k-mooney | i dont think that is an accpetable solutin | 12:21 |
sean-k-mooney | to have this be dynmaic | 12:21 |
sean-k-mooney | to me this is a contract that we use to reason about all our job | 12:22 |
fungi | it's not a contract you have with opendev | 12:22 |
fungi | it's a contract you have invented for yourself | 12:22 |
sean-k-mooney | it was in the past | 12:22 |
sean-k-mooney | no | 12:22 |
fungi | it never was | 12:22 |
fungi | see above re: the osic cloud | 12:22 |
sean-k-mooney | we have talk about this at lenght in the past in the context of numa testing adn nested virt | 12:22 |
fungi | maybe you wrote that job after osic was closed down | 12:22 |
sean-k-mooney | i change the job recently to depend on this yes | 12:23 |
sean-k-mooney | but we indrecly depended on this in the past for calculating the concurency | 12:23 |
fungi | we document specifically that cpu count can differ between providers. please don't assume it will be the same everywhere | 12:23 |
sean-k-mooney | the document deifned the miniums | 12:23 |
sean-k-mooney | we are not assumign ti the same | 12:24 |
sean-k-mooney | just that its at least 8 | 12:24 |
fungi | minimums as guidance to providers. that's not a user-facing document | 12:24 |
fungi | we negotiate those based on performance metrics | 12:24 |
sean-k-mooney | thats never been comumicated to project | 12:24 |
fungi | it has, in the document i linked | 12:25 |
sean-k-mooney | we were alwasy told that that was the minium we coudl depend on | 12:25 |
sean-k-mooney | we had a long discusion about this when vexhost upgraded and started providing 16G vms too | 12:25 |
fungi | what we document for users is "CPU count, speed, and supported processor flags differ, sometimes even within the same cloud region." | 12:25 |
sean-k-mooney | this explains some of the random no valid host issue we ocationally hit | 12:26 |
sean-k-mooney | the tempest jobs have never been written to handel this changing properly | 12:27 |
fungi | we promise in the document i linked that standard nodes provide 8gb ram and at least 80gb of disk (though that may be in aggregate between the rootfs and ephemeral disk), but say cpu count can vary | 12:27 |
sean-k-mooney | i need to test somthing | 12:27 |
sean-k-mooney | i think we can present a consite cpu toplogy to the guest even if the host cpus differ | 12:28 |
fungi | tempest jobs used to scale the number of parallel tests by the number of available cpus. has that changed? | 12:28 |
sean-k-mooney | well we can im just not sure if nova allows it to be more then flavor.vcpu | 12:28 |
fungi | and tempest's scaling was precisely because cpu count is expected to vary | 12:29 |
sean-k-mooney | we scale difently in some jobs vs other and we also pin some | 12:29 |
sean-k-mooney | becaues we were geting random test failure in some jobs so we had to reduce it | 12:30 |
sean-k-mooney | most job now expect to run with concurency 6 some 4 and other 3 i think | 12:30 |
fungi | yeah, i remember when we introduced 16vcpu nodes for a while in one provider and 4vcpu nodes in another, it helped shake out some otherwise unnoticed bugs in tempest | 12:30 |
sean-k-mooney | that assumign we have 8 vcpus | 12:30 |
sean-k-mooney | so for the 11 years i have been working on openstack the fact that it was 8vcpus was alwasy set in stone in my mind so to me the fact you expect it to differ is very supprisign and im sure many other project would be surpised to hear that too | 12:32 |
sean-k-mooney | i always expected any other cpu count or ram count would be a diffent lable/nodeset | 12:32 |
fungi | in rackspace flex specifically, they don't have an 8gb ram 8vcpu flavor, but testing indicated that the cpus were ~2x the speed of those in our other donors | 12:33 |
sean-k-mooney | right but that does not help use test cpu pinning ectra | 12:33 |
sean-k-mooney | i can make the job calulate the cofnig dynmaicly based on the cpus that are avaible | 12:34 |
sean-k-mooney | but thats less stable | 12:34 |
sean-k-mooney | in the image we upload to glance via nodepool we can desribe the vcpu toplogy to present to the guest | 12:35 |
sean-k-mooney | what i need to check is can we say 1 socket 8 cores 1 thread and have flavor.vcpu=4 | 12:36 |
sean-k-mooney | the virutal toplogy is ment ot be decoupeld form the host topology but i dont think we allow the virtual toplogy to exceed the host toplogy but we might | 12:37 |
sean-k-mooney | that culd allow use to normalise across providers if it works | 12:37 |
sean-k-mooney | i just dont knwo if that would work for both kvm and xen | 12:37 |
opendevreview | Stephen Finucane proposed openstack/project-config master: Start including team as release meta field https://review.opendev.org/c/openstack/project-config/+/929914 | 12:44 |
fungi | i suppose some good news here is that, over the long term, we expect the rackspace flex donation to replace our use of rackspace classic, at which point we'll be kvm everywhere (and xen is unpopular enough at this point that i don't see that changing) | 13:05 |
sean-k-mooney | perhasps. im not convice this dynmic flaovr sizing is not more problematic | 13:08 |
sean-k-mooney | if your sayign that going forward we coudl use nested kvm as a result of this move | 13:09 |
sean-k-mooney | then yes it would be a large performance boost | 13:09 |
sean-k-mooney | disk io has histroically been more of a problem then cpu performace form my perspective | 13:09 |
sean-k-mooney | i dont think xen was the cuplprit there more so the age of the hardware | 13:10 |
fungi | we have vanishingly few providers where we don't also add nested virt capable node labels, so yes barring future regressions in nested kvm capabilities that would be one outcome | 13:11 |
fungi | but we've observed enough instability with linux nested kvm using random combinations of kernel versions between host/guest/nested that i'm hesitant to make that claim boldly | 13:13 |
sean-k-mooney | its defintly imporved with newer kernels but yes | 13:14 |
sean-k-mooney | amd host are less stable from what i have observed then intel | 13:14 |
sean-k-mooney | but i think thats changing | 13:14 |
clarkb | fungi covered it well, but I also want to add that we try to be good stewards of donated resources and find a balance between demanding more than is necessary and making good effective use of the resources. As fungi mentioned we selected a flavor that was already available (so we don't have to demand a public cloud bend to our will and add new flavors), testing indicated | 14:55 |
clarkb | performance was on par with flavors in other clouds with more cpu, and we never promise the count will be consistent across clouds for this reason | 14:55 |
clarkb | essentially we're doing our best here to balance inputs from a hundred different directions | 14:55 |
*** bauzas_ is now known as bauzas | 22:13 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!