Thursday, 2024-09-19

*** __ministry is now known as Guest387501:28
tkajinamWhere can I found the specs of each vms in openstack-two-node-jammy ?05:55
tkajinamthe reason why I'm asking is that we've seen a few failures in nova jobs which is caused by mismatch between expected vcpus (8) and actual vcpus (4)05:55
tkajinamhttps://zuul.opendev.org/t/openstack/build/8c45cc40964e4d94b7bf92f8b6c65b1f/log/controller/logs/screen-n-cpu.txt#812105:55
tkajinamopenstack-two-node-jammy is the nodeset used in that job, IIUC05:57
tonybtkajinam: Let me look.  My first thought is that that isn't visible anywhere (apart from opendev sys admins)06:00
tonybtkajinam: If the job works sometimes if could be an issue with certain providers06:05
gibitonyb: I think it works intermittently but I not have time yet to look deeper06:13
tonybA spot check https://zuul.opendev.org/t/openstack/builds?job_name=nova-live-migration&skip=0 has the two jobs in post_failure both being run on the new raxflex provider06:18
tonybopenstack-two-node-jammy is defined here: https://opendev.org/openstack/devstack/src/branch/master/.zuul.yaml#L12406:18
tonybwhich will just use 2 ubuntu-jammy nodes to see what they are you can find the appropriate provider/ppol06:20
tonybsorry provider/pool06:20
tonybIn this case raxflex uses gp.0.4.8 for general nodes06:21
tonybhttps://opendev.org/openstack/project-config/src/branch/master/nodepool/nl01.opendev.org.yaml#L23606:21
tonybwhich is a 4 VCPU 8GB RAM 80GB disk flavor06:25
tonybwhich explains the failure given the job expects 8VCPUs06:28
tonybhttps://opendev.org/openstack/nova/src/branch/master/.zuul.yaml#L14206:28
tonybI can do more "digging" to see how the raxflex flavor compares to other providers06:35
tonybYou could also update the job to look at ansible_facts for processor information rather than hard-coding/assuming 8 VCPUs06:37
tonybeg look at: https://zuul.opendev.org/t/openstack/build/8c45cc40964e4d94b7bf92f8b6c65b1f/log/zuul-info/host-info.compute1.yaml#533-53706:37
tkajinamyeah that may be an option though I'm unsure 4 cores can afford the requirement in that job (especially because it separates cores into shared and non shared to test that separation behavior)06:51
tkajinamit'd be helpful if we can get a node set with two nodes with strictly 8 core vms , but I'm not sure if that's something the team may want06:52
tonybYeah it can be done, but impacts quota etc, I don't think there is a way to say "don't pick this provider for this job"06:53
opendevreviewdaniel.pawlik proposed openstack/ci-log-processing master: Hide sensitive data on executing Ansible playbook  https://review.opendev.org/c/openstack/ci-log-processing/+/92988107:22
*** bauzas_ is now known as bauzas07:25
opendevreviewMerged openstack/ci-log-processing master: Hide sensitive data on executing Ansible playbook  https://review.opendev.org/c/openstack/ci-log-processing/+/92988107:58
sean-k-mooneytkajinam: tonyb  form my perspective if a providre it not providing 8 CPU then its on meeint the minium requirement we have written down and and the nodeset shoudl be seperate10:40
sean-k-mooneyit shoudl not be used by any of the defualt jobs10:40
sean-k-mooneyim trying to find where this is written down but all vms are quire to have 80GB of disk, 8cpus and 8GB ram as our minium requirements10:43
tonybsean-k-mooney: Sure.  We can probably make a flavor that meets that but at this point I don't have enough data to say one way or another10:51
sean-k-mooneytonyb: im currently trying to find the doc where we specifed the minium requirement for all cloud providrs to contibut resouce ot opendev10:56
sean-k-mooneyi know it edxists becuase i coundiered doint that my self a few years ago10:56
sean-k-mooneywe detailed that all providers must be capabel fo hosting 100 vms, have 8G of ram, 8 cpus and 80G of disk that could be splict actoss multipel disk10:57
tonybWell the 100Vms thing isn't a requirement10:58
sean-k-mooneyit was the one that triped me up10:59
sean-k-mooneyi could do 50-80 at the time10:59
sean-k-mooneyso it was in the document at one point10:59
tonybWell we have clouds that provide 50VMs10:59
tonybsean-k-mooney: You can look at the various providers in the nl* files eg https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl01.opendev.org.yaml#L9811:01
sean-k-mooneysure the reason given at the time is the the inra team did not want to manage a lot fo small providres but im sure that changed as tiem went on11:02
tonybsean-k-mooney: I've found the doc I think you're referring to ... I just need to find a link11:02
sean-k-mooneyim more concernd that the docusment that was ment to make sure all our providers are consitent is vanished form the internet11:03
tonybIt hasn't11:03
tonybhttps://opendev.org/opendev/system-config/src/branch/master/doc/source/contribute-cloud.rst#L19411:03
sean-k-mooneyyep htat it11:04
sean-k-mooneyi tought it was in the infra-manual for some reason11:04
tonybanyway that's the source11:04
sean-k-mooneyhttps://opendev.org/opendev/system-config/src/branch/master/doc/source/contribute-cloud.rst?display=source#L63-L6411:04
sean-k-mooneythat the 100 instnace requirement11:05
tonybWell "would be helpful"11:05
sean-k-mooneyand for this converstaion https://opendev.org/opendev/system-config/src/branch/master/doc/source/contribute-cloud.rst?display=source#L51-L54 is whwere the 8vcpu is required for a standard flavor11:05
tonybYup I see that11:05
sean-k-mooneyto be fair it happens infrequently enouch that adding a 50 vm prover every could of month is proably fine11:06
tonybI'll chat with the other opendev sysadmins11:06
sean-k-mooneythe v4cpu nodest is proably ok for lot of things like tox jobs11:06
tonybI don't know the state of quotas on raxflex, but we can probably create a custom flavor that supplies 8vcpus11:07
sean-k-mooneyit just shoudl use a diffent nodepool lable so we can align that with a differnt nodeset11:07
sean-k-mooneyits really only the devstack josb that care about the vcpu for our concurrency requiremetn for tempest11:07
tonybPerhaps, that creates different kinds of work11:08
sean-k-mooneywe used ot have the extended lable for the 16G vexhost ones11:09
tonybI think we still do11:09
sean-k-mooneybut ya i can see that being annoying for the curent tox usage11:09
sean-k-mooneyi think so too but we dont use them in nova currently11:10
sean-k-mooneywe experimented with them for some numa/hugepage testing11:10
tonybWe, now,  know it's an issue with the cloud.  I'11:10
tonybll talk with the other infra-roots to figure out the best fix11:10
sean-k-mooneyno worries11:10
sean-k-mooneythe rendered version is here by the way https://docs.opendev.org/opendev/system-config/latest/contribute-cloud.html i think its not in my history because it moved form openstack to opendev at some point11:15
sean-k-mooneyporbaly a few years ago11:15
tonybMakes sense11:16
sean-k-mooneyby the way if 50 or 25 nodes were accpable as a provider i may reconsider that in the futre. im currently redoing all my home infra, including considerign replacing some of my older systems and upgradign to a 10G core with 2.5G to clinets i did have a third party ci running for a time but it will be a whiel before im in a position to think about that seriously11:29
tonybI'd say 50VMs would be a viable minimum, unless they're "funky" in some way architecture/config etc etc11:31
tonybSo yeah if you do redo your home lab that could be cool11:32
sean-k-mooneybefore looking at adding them as a provider i thin i would want to benchmark it as a thirdparty for a bit to see if its actully viable under load11:33
tonybSounds good11:33
sean-k-mooneylike if i cna host 50 vms but only when they are idle or the disk io ectra is too low there little point11:34
tonybYup makes sense11:34
fungitkajinam: https://docs.opendev.org/opendev/infra-manual/latest/testing.html#known-differences-to-watch-out-for "CPU count, speed, and supported processor flags differ, sometimes even within the same cloud region."12:15
fungisean-k-mooney: our nodesets don't specify cpu count. we scale them based on performance and available flavors12:15
sean-k-mooneyfungi: the requirement for a ci provider spefy the min that the flavors can provide12:16
fungialso i don't think "we" have the necessary access to add arbitrary flavors to public cloud providers, unless that's a feature i'm not aware of (though we could request some different flavors)12:16
sean-k-mooneythey can provide more but we established the basline so that jobs would be able to depend on that for there config12:17
sean-k-mooneythat provider is currently usign  flavor-name: 'gp.0.4.8'12:18
sean-k-mooneywhich is not valid for the default lables12:19
fungiwe used a 4 vpcu flavor in osic for years12:19
sean-k-mooneyso it shoudl not be exposed to zuul as ubuntu-jammy12:19
fungibecause 8vcpu was "too fast" (the cpus were much newer)12:20
fungianyway, i don't think we're going to change this, the guidance to cloud providers wanting to contribute resources was not meant as a guarantee for what users should expect12:20
sean-k-mooneyso there are ways to adress that without breakign what exposed to the guest12:20
fungiyou should not write your jobs based on an assumption of how many cpus a node has12:21
sean-k-mooneywe have to 12:21
fungiwe expressly document *that* in the user-facing document in our infra manual12:21
fungiplease find an alternative12:21
sean-k-mooneyi dont think that is an accpetable solutin12:21
sean-k-mooneyto have this be dynmaic12:21
sean-k-mooneyto me this is a contract that we use to reason about all our job12:22
fungiit's not a contract you have with opendev12:22
fungiit's a contract you have invented for yourself12:22
sean-k-mooneyit was in the past12:22
sean-k-mooneyno12:22
fungiit never was12:22
fungisee above re: the osic cloud12:22
sean-k-mooneywe have talk about this at lenght in the past in the context of numa testing adn nested virt12:22
fungimaybe you wrote that job after osic was closed down12:22
sean-k-mooneyi change the job recently to depend on this yes12:23
sean-k-mooneybut we indrecly depended on this in the past for calculating the concurency12:23
fungiwe document specifically that cpu count can differ between providers. please don't assume it will be the same everywhere12:23
sean-k-mooneythe document deifned the miniums12:23
sean-k-mooneywe are not assumign ti the same12:24
sean-k-mooneyjust that its at least 812:24
fungiminimums as guidance to providers. that's not a user-facing document12:24
fungiwe negotiate those based on performance metrics12:24
sean-k-mooneythats never been comumicated to project12:24
fungiit has, in the document i linked12:25
sean-k-mooneywe were alwasy told that that was the minium we coudl depend on12:25
sean-k-mooneywe had a long discusion about this when vexhost upgraded and started providing 16G vms too12:25
fungiwhat we document for users is "CPU count, speed, and supported processor flags differ, sometimes even within the same cloud region."12:25
sean-k-mooneythis explains some of the random no valid host issue we ocationally hit12:26
sean-k-mooneythe tempest jobs have never been written to handel this changing properly12:27
fungiwe promise in the document i linked that standard nodes provide 8gb ram and at least 80gb of disk (though that may be in aggregate between the rootfs and ephemeral disk), but say cpu count can vary12:27
sean-k-mooneyi need to test somthing12:27
sean-k-mooneyi think we can present a consite cpu toplogy to the guest even if the host cpus differ12:28
fungitempest jobs used to scale the number of parallel tests by the number of available cpus. has that changed?12:28
sean-k-mooneywell we can im just not sure if nova allows it to be more then flavor.vcpu12:28
fungiand tempest's scaling was precisely because cpu count is expected to vary12:29
sean-k-mooneywe scale difently in some jobs vs other and we also pin some 12:29
sean-k-mooneybecaues we were geting random test failure in some jobs so we had to reduce it12:30
sean-k-mooneymost job now expect to run with concurency 6 some 4 and other 3 i think12:30
fungiyeah, i remember when we introduced 16vcpu nodes for a while in one provider and 4vcpu nodes in another, it helped shake out some otherwise unnoticed bugs in tempest12:30
sean-k-mooneythat assumign we have 8 vcpus12:30
sean-k-mooneyso for the 11 years i have been working on openstack the fact that it was 8vcpus was alwasy set in stone in my mind so to me the fact you expect it to differ is very supprisign and im sure many other project would be surpised to hear that too12:32
sean-k-mooneyi always expected any other cpu count or ram count would be a diffent lable/nodeset12:32
fungiin rackspace flex specifically, they don't have an 8gb ram 8vcpu flavor, but testing indicated that the cpus were ~2x the speed of those in our other donors12:33
sean-k-mooneyright but that does not help use test cpu pinning ectra12:33
sean-k-mooneyi can make the job calulate the cofnig dynmaicly based on the cpus that are avaible12:34
sean-k-mooneybut thats less stable12:34
sean-k-mooneyin the image we upload to glance via nodepool we can desribe the vcpu toplogy to present to the guest12:35
sean-k-mooneywhat i need to check is can we say 1 socket 8 cores 1 thread and have flavor.vcpu=412:36
sean-k-mooneythe virutal toplogy is ment ot be decoupeld form the host topology but i dont think we allow the virtual toplogy to exceed the host toplogy but we might12:37
sean-k-mooneythat culd allow use to normalise across providers if it works12:37
sean-k-mooneyi just dont knwo if that would work for both kvm and xen12:37
opendevreviewStephen Finucane proposed openstack/project-config master: Start including team as release meta field  https://review.opendev.org/c/openstack/project-config/+/92991412:44
fungii suppose some good news here is that, over the long term, we expect the rackspace flex donation to replace our use of rackspace classic, at which point we'll be kvm everywhere (and xen is unpopular enough at this point that i don't see that changing)13:05
sean-k-mooneyperhasps. im not convice this dynmic flaovr sizing is not more problematic13:08
sean-k-mooneyif your sayign that going forward we coudl use nested kvm as a result of this move13:09
sean-k-mooneythen yes it would be a large performance boost13:09
sean-k-mooneydisk io has histroically been more of a problem then cpu performace form my perspective13:09
sean-k-mooneyi dont think xen was the cuplprit there more so the age of the hardware13:10
fungiwe have vanishingly few providers where we don't also add nested virt capable node labels, so yes barring future regressions in nested kvm capabilities that would be one outcome13:11
fungibut we've observed enough instability with linux nested kvm using random combinations of kernel versions between host/guest/nested that i'm hesitant to make that claim boldly13:13
sean-k-mooneyits defintly imporved with newer kernels but yes13:14
sean-k-mooneyamd host are less stable from what i have observed then intel13:14
sean-k-mooneybut i think thats changing13:14
clarkbfungi covered it well, but I also want to add that we try to be good stewards of donated resources and find a balance between demanding more than is necessary and making good effective use of the resources. As fungi mentioned we selected a flavor that was already available (so we don't have to demand a public cloud bend to our will and add new flavors), testing indicated14:55
clarkbperformance was on par with flavors in other clouds with more cpu, and we never promise the count will be consistent across clouds for this reason14:55
clarkbessentially we're doing our best here to balance inputs from a hundred different directions14:55
*** bauzas_ is now known as bauzas22:13

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!