Thursday, 2024-09-19

*** __ministry is now known as Guest3875		01:28
tkajinam	Where can I found the specs of each vms in openstack-two-node-jammy ?	05:55
tkajinam	the reason why I'm asking is that we've seen a few failures in nova jobs which is caused by mismatch between expected vcpus (8) and actual vcpus (4)	05:55
tkajinam	https://zuul.opendev.org/t/openstack/build/8c45cc40964e4d94b7bf92f8b6c65b1f/log/controller/logs/screen-n-cpu.txt#8121	05:55
tkajinam	openstack-two-node-jammy is the nodeset used in that job, IIUC	05:57
tonyb	tkajinam: Let me look. My first thought is that that isn't visible anywhere (apart from opendev sys admins)	06:00
tonyb	tkajinam: If the job works sometimes if could be an issue with certain providers	06:05
gibi	tonyb: I think it works intermittently but I not have time yet to look deeper	06:13
tonyb	A spot check https://zuul.opendev.org/t/openstack/builds?job_name=nova-live-migration&skip=0 has the two jobs in post_failure both being run on the new raxflex provider	06:18
tonyb	openstack-two-node-jammy is defined here: https://opendev.org/openstack/devstack/src/branch/master/.zuul.yaml#L124	06:18
tonyb	which will just use 2 ubuntu-jammy nodes to see what they are you can find the appropriate provider/ppol	06:20
tonyb	sorry provider/pool	06:20
tonyb	In this case raxflex uses gp.0.4.8 for general nodes	06:21
tonyb	https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl01.opendev.org.yaml#L236	06:21
tonyb	which is a 4 VCPU 8GB RAM 80GB disk flavor	06:25
tonyb	which explains the failure given the job expects 8VCPUs	06:28
tonyb	https://opendev.org/openstack/nova/src/branch/master/.zuul.yaml#L142	06:28
tonyb	I can do more "digging" to see how the raxflex flavor compares to other providers	06:35
tonyb	You could also update the job to look at ansible_facts for processor information rather than hard-coding/assuming 8 VCPUs	06:37
tonyb	eg look at: https://zuul.opendev.org/t/openstack/build/8c45cc40964e4d94b7bf92f8b6c65b1f/log/zuul-info/host-info.compute1.yaml#533-537	06:37
tkajinam	yeah that may be an option though I'm unsure 4 cores can afford the requirement in that job (especially because it separates cores into shared and non shared to test that separation behavior)	06:51
tkajinam	it'd be helpful if we can get a node set with two nodes with strictly 8 core vms , but I'm not sure if that's something the team may want	06:52
tonyb	Yeah it can be done, but impacts quota etc, I don't think there is a way to say "don't pick this provider for this job"	06:53
opendevreview	daniel.pawlik proposed openstack/ci-log-processing master: Hide sensitive data on executing Ansible playbook https://review.opendev.org/c/openstack/ci-log-processing/+/929881	07:22
*** bauzas_ is now known as bauzas		07:25
opendevreview	Merged openstack/ci-log-processing master: Hide sensitive data on executing Ansible playbook https://review.opendev.org/c/openstack/ci-log-processing/+/929881	07:58
sean-k-mooney	tkajinam: tonyb form my perspective if a providre it not providing 8 CPU then its on meeint the minium requirement we have written down and and the nodeset shoudl be seperate	10:40
sean-k-mooney	it shoudl not be used by any of the defualt jobs	10:40
sean-k-mooney	im trying to find where this is written down but all vms are quire to have 80GB of disk, 8cpus and 8GB ram as our minium requirements	10:43
tonyb	sean-k-mooney: Sure. We can probably make a flavor that meets that but at this point I don't have enough data to say one way or another	10:51
sean-k-mooney	tonyb: im currently trying to find the doc where we specifed the minium requirement for all cloud providrs to contibut resouce ot opendev	10:56
sean-k-mooney	i know it edxists becuase i coundiered doint that my self a few years ago	10:56
sean-k-mooney	we detailed that all providers must be capabel fo hosting 100 vms, have 8G of ram, 8 cpus and 80G of disk that could be splict actoss multipel disk	10:57
tonyb	Well the 100Vms thing isn't a requirement	10:58
sean-k-mooney	it was the one that triped me up	10:59
sean-k-mooney	i could do 50-80 at the time	10:59
sean-k-mooney	so it was in the document at one point	10:59
tonyb	Well we have clouds that provide 50VMs	10:59
tonyb	sean-k-mooney: You can look at the various providers in the nl* files eg https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl01.opendev.org.yaml#L98	11:01
sean-k-mooney	sure the reason given at the time is the the inra team did not want to manage a lot fo small providres but im sure that changed as tiem went on	11:02
tonyb	sean-k-mooney: I've found the doc I think you're referring to ... I just need to find a link	11:02
sean-k-mooney	im more concernd that the docusment that was ment to make sure all our providers are consitent is vanished form the internet	11:03
tonyb	It hasn't	11:03
tonyb	https://opendev.org/opendev/system-config/src/branch/master/doc/source/contribute-cloud.rst#L194	11:03
sean-k-mooney	yep htat it	11:04
sean-k-mooney	i tought it was in the infra-manual for some reason	11:04
tonyb	anyway that's the source	11:04
sean-k-mooney	https://opendev.org/opendev/system-config/src/branch/master/doc/source/contribute-cloud.rst?display=source#L63-L64	11:04
sean-k-mooney	that the 100 instnace requirement	11:05
tonyb	Well "would be helpful"	11:05
sean-k-mooney	and for this converstaion https://opendev.org/opendev/system-config/src/branch/master/doc/source/contribute-cloud.rst?display=source#L51-L54 is whwere the 8vcpu is required for a standard flavor	11:05
tonyb	Yup I see that	11:05
sean-k-mooney	to be fair it happens infrequently enouch that adding a 50 vm prover every could of month is proably fine	11:06
tonyb	I'll chat with the other opendev sysadmins	11:06
sean-k-mooney	the v4cpu nodest is proably ok for lot of things like tox jobs	11:06
tonyb	I don't know the state of quotas on raxflex, but we can probably create a custom flavor that supplies 8vcpus	11:07
sean-k-mooney	it just shoudl use a diffent nodepool lable so we can align that with a differnt nodeset	11:07
sean-k-mooney	its really only the devstack josb that care about the vcpu for our concurrency requiremetn for tempest	11:07
tonyb	Perhaps, that creates different kinds of work	11:08
sean-k-mooney	we used ot have the extended lable for the 16G vexhost ones	11:09
tonyb	I think we still do	11:09
sean-k-mooney	but ya i can see that being annoying for the curent tox usage	11:09
sean-k-mooney	i think so too but we dont use them in nova currently	11:10
sean-k-mooney	we experimented with them for some numa/hugepage testing	11:10
tonyb	We, now, know it's an issue with the cloud. I'	11:10
tonyb	ll talk with the other infra-roots to figure out the best fix	11:10
sean-k-mooney	no worries	11:10
sean-k-mooney	the rendered version is here by the way https://docs.opendev.org/opendev/system-config/latest/contribute-cloud.html i think its not in my history because it moved form openstack to opendev at some point	11:15
sean-k-mooney	porbaly a few years ago	11:15
tonyb	Makes sense	11:16
sean-k-mooney	by the way if 50 or 25 nodes were accpable as a provider i may reconsider that in the futre. im currently redoing all my home infra, including considerign replacing some of my older systems and upgradign to a 10G core with 2.5G to clinets i did have a third party ci running for a time but it will be a whiel before im in a position to think about that seriously	11:29
tonyb	I'd say 50VMs would be a viable minimum, unless they're "funky" in some way architecture/config etc etc	11:31
tonyb	So yeah if you do redo your home lab that could be cool	11:32
sean-k-mooney	before looking at adding them as a provider i thin i would want to benchmark it as a thirdparty for a bit to see if its actully viable under load	11:33
tonyb	Sounds good	11:33
sean-k-mooney	like if i cna host 50 vms but only when they are idle or the disk io ectra is too low there little point	11:34
tonyb	Yup makes sense	11:34
fungi	tkajinam: https://docs.opendev.org/opendev/infra-manual/latest/testing.html#known-differences-to-watch-out-for "CPU count, speed, and supported processor flags differ, sometimes even within the same cloud region."	12:15
fungi	sean-k-mooney: our nodesets don't specify cpu count. we scale them based on performance and available flavors	12:15
sean-k-mooney	fungi: the requirement for a ci provider spefy the min that the flavors can provide	12:16
fungi	also i don't think "we" have the necessary access to add arbitrary flavors to public cloud providers, unless that's a feature i'm not aware of (though we could request some different flavors)	12:16
sean-k-mooney	they can provide more but we established the basline so that jobs would be able to depend on that for there config	12:17
sean-k-mooney	that provider is currently usign flavor-name: 'gp.0.4.8'	12:18
sean-k-mooney	which is not valid for the default lables	12:19
fungi	we used a 4 vpcu flavor in osic for years	12:19
sean-k-mooney	so it shoudl not be exposed to zuul as ubuntu-jammy	12:19
fungi	because 8vcpu was "too fast" (the cpus were much newer)	12:20
fungi	anyway, i don't think we're going to change this, the guidance to cloud providers wanting to contribute resources was not meant as a guarantee for what users should expect	12:20
sean-k-mooney	so there are ways to adress that without breakign what exposed to the guest	12:20
fungi	you should not write your jobs based on an assumption of how many cpus a node has	12:21
sean-k-mooney	we have to	12:21
fungi	we expressly document that in the user-facing document in our infra manual	12:21
fungi	please find an alternative	12:21
sean-k-mooney	i dont think that is an accpetable solutin	12:21
sean-k-mooney	to have this be dynmaic	12:21
sean-k-mooney	to me this is a contract that we use to reason about all our job	12:22
fungi	it's not a contract you have with opendev	12:22
fungi	it's a contract you have invented for yourself	12:22
sean-k-mooney	it was in the past	12:22
sean-k-mooney	no	12:22
fungi	it never was	12:22
fungi	see above re: the osic cloud	12:22
sean-k-mooney	we have talk about this at lenght in the past in the context of numa testing adn nested virt	12:22
fungi	maybe you wrote that job after osic was closed down	12:22
sean-k-mooney	i change the job recently to depend on this yes	12:23
sean-k-mooney	but we indrecly depended on this in the past for calculating the concurency	12:23
fungi	we document specifically that cpu count can differ between providers. please don't assume it will be the same everywhere	12:23
sean-k-mooney	the document deifned the miniums	12:23
sean-k-mooney	we are not assumign ti the same	12:24
sean-k-mooney	just that its at least 8	12:24
fungi	minimums as guidance to providers. that's not a user-facing document	12:24
fungi	we negotiate those based on performance metrics	12:24
sean-k-mooney	thats never been comumicated to project	12:24
fungi	it has, in the document i linked	12:25
sean-k-mooney	we were alwasy told that that was the minium we coudl depend on	12:25
sean-k-mooney	we had a long discusion about this when vexhost upgraded and started providing 16G vms too	12:25
fungi	what we document for users is "CPU count, speed, and supported processor flags differ, sometimes even within the same cloud region."	12:25
sean-k-mooney	this explains some of the random no valid host issue we ocationally hit	12:26
sean-k-mooney	the tempest jobs have never been written to handel this changing properly	12:27
fungi	we promise in the document i linked that standard nodes provide 8gb ram and at least 80gb of disk (though that may be in aggregate between the rootfs and ephemeral disk), but say cpu count can vary	12:27
sean-k-mooney	i need to test somthing	12:27
sean-k-mooney	i think we can present a consite cpu toplogy to the guest even if the host cpus differ	12:28
fungi	tempest jobs used to scale the number of parallel tests by the number of available cpus. has that changed?	12:28
sean-k-mooney	well we can im just not sure if nova allows it to be more then flavor.vcpu	12:28
fungi	and tempest's scaling was precisely because cpu count is expected to vary	12:29
sean-k-mooney	we scale difently in some jobs vs other and we also pin some	12:29
sean-k-mooney	becaues we were geting random test failure in some jobs so we had to reduce it	12:30
sean-k-mooney	most job now expect to run with concurency 6 some 4 and other 3 i think	12:30
fungi	yeah, i remember when we introduced 16vcpu nodes for a while in one provider and 4vcpu nodes in another, it helped shake out some otherwise unnoticed bugs in tempest	12:30
sean-k-mooney	that assumign we have 8 vcpus	12:30
sean-k-mooney	so for the 11 years i have been working on openstack the fact that it was 8vcpus was alwasy set in stone in my mind so to me the fact you expect it to differ is very supprisign and im sure many other project would be surpised to hear that too	12:32
sean-k-mooney	i always expected any other cpu count or ram count would be a diffent lable/nodeset	12:32
fungi	in rackspace flex specifically, they don't have an 8gb ram 8vcpu flavor, but testing indicated that the cpus were ~2x the speed of those in our other donors	12:33
sean-k-mooney	right but that does not help use test cpu pinning ectra	12:33
sean-k-mooney	i can make the job calulate the cofnig dynmaicly based on the cpus that are avaible	12:34
sean-k-mooney	but thats less stable	12:34
sean-k-mooney	in the image we upload to glance via nodepool we can desribe the vcpu toplogy to present to the guest	12:35
sean-k-mooney	what i need to check is can we say 1 socket 8 cores 1 thread and have flavor.vcpu=4	12:36
sean-k-mooney	the virutal toplogy is ment ot be decoupeld form the host topology but i dont think we allow the virtual toplogy to exceed the host toplogy but we might	12:37
sean-k-mooney	that culd allow use to normalise across providers if it works	12:37
sean-k-mooney	i just dont knwo if that would work for both kvm and xen	12:37
opendevreview	Stephen Finucane proposed openstack/project-config master: Start including team as release meta field https://review.opendev.org/c/openstack/project-config/+/929914	12:44
fungi	i suppose some good news here is that, over the long term, we expect the rackspace flex donation to replace our use of rackspace classic, at which point we'll be kvm everywhere (and xen is unpopular enough at this point that i don't see that changing)	13:05
sean-k-mooney	perhasps. im not convice this dynmic flaovr sizing is not more problematic	13:08
sean-k-mooney	if your sayign that going forward we coudl use nested kvm as a result of this move	13:09
sean-k-mooney	then yes it would be a large performance boost	13:09
sean-k-mooney	disk io has histroically been more of a problem then cpu performace form my perspective	13:09
sean-k-mooney	i dont think xen was the cuplprit there more so the age of the hardware	13:10
fungi	we have vanishingly few providers where we don't also add nested virt capable node labels, so yes barring future regressions in nested kvm capabilities that would be one outcome	13:11
fungi	but we've observed enough instability with linux nested kvm using random combinations of kernel versions between host/guest/nested that i'm hesitant to make that claim boldly	13:13
sean-k-mooney	its defintly imporved with newer kernels but yes	13:14
sean-k-mooney	amd host are less stable from what i have observed then intel	13:14
sean-k-mooney	but i think thats changing	13:14
clarkb	fungi covered it well, but I also want to add that we try to be good stewards of donated resources and find a balance between demanding more than is necessary and making good effective use of the resources. As fungi mentioned we selected a flavor that was already available (so we don't have to demand a public cloud bend to our will and add new flavors), testing indicated	14:55
clarkb	performance was on par with flavors in other clouds with more cpu, and we never promise the count will be consistent across clouds for this reason	14:55
clarkb	essentially we're doing our best here to balance inputs from a hundred different directions	14:55
*** bauzas_ is now known as bauzas		22:13

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!