Friday, 2023-03-03

opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : submit-requirements for deprecated NoOp function https://review.opendev.org/c/openstack/project-config/+/875804	03:42
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : add submit requirements to NoBlock labels https://review.opendev.org/c/openstack/project-config/+/875993	03:42
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : handle key / values with multiple = https://review.opendev.org/c/openstack/project-config/+/875994	03:42
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : Update Review-Priority to submit-requirements https://review.opendev.org/c/openstack/project-config/+/875995	03:42
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : Convert remaining AnyWithBlock to submit requirements https://review.opendev.org/c/openstack/project-config/+/875996	03:42
opendevreview	daniel.pawlik proposed openstack/ci-log-processing master: Add job with 'timed_out' status to fetch; add hosts_region info https://review.opendev.org/c/openstack/ci-log-processing/+/876260	09:27
dpawlik	dansmith: hey, let me know if it is fine for you: https://review.opendev.org/c/openstack/ci-log-processing/+/876260	09:28
*** odyssey4me is now known as odyssey4me__		10:58
*** odyssey4me__ is now known as odyssey4me		10:58
*** jpena\|off is now known as jpena		11:00
*** odyssey4me is now known as odyssey4me__		11:06
*** odyssey4me__ is now known as odyssey4me		11:06
*** odyssey4me is now known as odyssey4me__		12:02
opendevreview	Jeremy Stanley proposed openstack/project-config master: Revert "Revert "Temporarily stop booting nodes in inmotion iad3"" https://review.opendev.org/c/openstack/project-config/+/876365	14:23
ade_lee_	fungi, clarkb gotta ask to hold a node yet again to figure out why the fips ubuntu tests are failing	14:56
dansmith	dpawlik: questions inline, but yeah, sounds like that's what we need, thanks a lot :)	14:57
ade_lee_	it looks like there is some failure to do iscsi things - specifically with chap algorithms	14:57
ade_lee_	https://zuul.opendev.org/t/openstack/build/44e7d0b4a565456893f1c096f6b9da61/logs	14:58
ade_lee_	fungi, clarkb ^^	14:58
ade_lee_	we should be setting the chap algorithms correctly, but maybe that doesn't work in the same way for ubuntu	14:59
dpawlik	dansmith: allright	15:02
opendevreview	Merged openstack/project-config master: Revert "Revert "Temporarily stop booting nodes in inmotion iad3"" https://review.opendev.org/c/openstack/project-config/+/876365	15:03
dpawlik	dansmith: I applied change on logscraper. You should get now more details for TIME_OUT jobs	15:04
dpawlik	dansmith: to see how opensearch keeps the field in the index, you can just click on "json" when you click on some domunent (Expanded document)	15:05
dansmith	dpawlik: right. but are there any other list fields?	15:06
dpawlik	you mean if there are already some fileds that uses list?	15:07
dansmith	yeah I was just curious how that's going to look in the query interface	15:07
dansmith	obviously being able to see it in json is something.. I'm trying to get my search to refresh	15:08
dpawlik	https://paste.openstack.org/show/bcm3M0hsQgJNLyrKGCuq/	15:08
dpawlik	so there are few fields that contains list or dict	15:08
dansmith	ah okay, tags for example	15:08
dansmith	cool, yeah, that looks good then	15:08
dpawlik	dansmith: try this one: https://opensearch.logs.openstack.org/_dashboards/app/visualize#/edit/21f18650-b9d6-11ed-a277-139f56dc2b08?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-30m,to:now))&_a=(filters:!(),linked:!f,query:(language:kuery,query:''),uiState:(),vis:(aggs:!((enabled:!t,id:'1',params:(field:build_uuid.keyword),schema	15:15
dpawlik	:metric,type:cardinality),(enabled:!t,id:'3',params:(field:hosts_region.keyword,missingBucket:!f,missingBucketLabel:Missing,order:desc,orderBy:'1',otherBucket:!f,otherBucketLabel:Other,size:5),schema:segment,type:terms),(enabled:!t,id:'4',params:(filters:!((input:(language:kuery,query:'build_status:%22TIMED_OUT%22'),label:''))),schema:split,type:fi	15:15
dpawlik	lters)),params:(addLegend:!t,addTimeMarker:!f,addTooltip:!t,categoryAxes:!((id:CategoryAxis-1,labels:(filter:!t,show:!t,truncate:100),position:bottom,scale:(type:linear),show:!t,style:(),title:(),type:category)),grid:(categoryLines:!f),labels:(show:!f),legendPosition:right,row:!f,seriesParams:!((data:(id:'1',label:'Unique%20count%20of%20build_uuid.	15:15
dpawlik	keyword'),drawLinesBetweenPoints:!t,lineWidth:2,mode:stacked,show:!t,showCircles:!t,type:histogram,valueAxis:ValueAxis-1)),thresholdLine:(color:%23E7664C,show:!f,style:full,value:10,width:1),times:!(),type:histogram,valueAxes:!((id:ValueAxis-1,labels:(filter:!f,rotate:0,show:!t,truncate:100),name:LeftAxis-1,position:left,scale:(mode:normal,type:lin	15:15
dpawlik	ear),show:!t,style:(),title:(text:'Unique%20count%20of%20build_uuid.keyword'),type:value))),title:TIME_OUT-builds-region,type:histogram))	15:15
opendevreview	daniel.pawlik proposed openstack/ci-log-processing master: Add job with 'timed_out' status to fetch; add hosts_region info https://review.opendev.org/c/openstack/ci-log-processing/+/876260	15:15
dansmith	dpawlik: it's not merged yet so we're not actually using the new rules yet right?	15:16
dpawlik	dansmith: I still did not have enough time to automatize service deployment after some change is merged	15:19
dpawlik	one day when I have few min, I will finally finish release process and release logscraper v1.0.0	15:19
dpawlik	do automatization of service deployment, etc.	15:20
dpawlik	So far, it is manual job....	15:20
dansmith	okay I'm not sure what you're saying.. so you're just manually applying the changes and this is already applied?	15:20
dpawlik	so the container is created, just changed the service container image. That's it.	15:21
dpawlik	merging change	15:22
dpawlik	I would be really happy, if there would be more people to handle that	15:22
dansmith	so, currently, every TIMED_OUT job I see is coming from rax-IAD	15:29
dansmith	presumably this has to soak a bit to get a better view,	15:29
dansmith	but I also wonder if the IAD hardware is all much older than others and our timeout problems come from longer jobs landing on those nodes	15:30
dansmith	fungi: any idea what the distribution is of nodes in regions/	15:31
JayF	dansmith: in #opendev, they just pushed some kind of change to remove RAX-iad from rotation, a mirror died or something like that? Not sure if you're tuned into that or not, but it might be related	15:31
dansmith	I wish this discussion didn't have to be so fragmented	15:31
dansmith	but mirror issues are probably not related to job timeouts	15:32
dansmith	also looks like maybe that's not rax-IAD	15:32
dpawlik	dansmith: did you check visualization?	15:33
dpawlik	https://paste.openstack.org/show/bRXYvgTvs3SPrDlZYk1G/	15:33
dpawlik	as I see it is to early to say if it's rax or ovh	15:34
fungi	dansmith: it's not assumed that all the hardware in a given provider region is even the same, we document that here: https://docs.opendev.org/opendev/infra-manual/latest/testing.html#known-differences-to-watch-out-for "CPU count, speed, and supported processor flags differ, sometimes even within the same cloud region."	15:34
dansmith	fungi: oh I know	15:35
fungi	JayF: that was inmotion-iad3 not rax-iad which we disabled, totally different cloud	15:35
JayF	fungi: I see that now, I think I might have merged two things in my head, thank you for the correction	15:35
dansmith	fungi: I've just been trying to determine why we're suddenly hitting a ton of job timeouts, and if we slowly grew past the amount of things we can test on our slowest set of nodes, it might be an indicator if all the timeouts were on one set of hardware	15:35
dansmith	fungi: I'm just looking for clues	15:36
fungi	yeah, i don't actually know how the hardware in different providers compares	15:36
dansmith	fungi: yeah, I don't really expect we would be able to know that	15:36
fungi	i don't think anyone's tried to do a survey, but because we can't even expect all hardware in a particular provider to be consistent, it would be a nontrivial exercise	15:37
dansmith	fungi: fwiw, what i was asking above was if you knew something like "80% of our quota is in rax-IAD"	15:37
fungi	oh that. i think we have more effective quota in ovh than rax, but we have a dashboard with numbers, just a sec	15:37
dansmith	okay	15:37
dansmith	I think what dpawlik is trying to show is that all the timeouts we have recorded (so far) are spread between one ovh and one rax region	15:38
dpawlik	after few days visualization "would say something more"	15:40
dansmith	yeah	15:40
fungi	these are the rackspace utilization charts: https://grafana.opendev.org/d/a8667d6647/nodepool-rackspace?orgId=1	15:41
fungi	and these are ovh: https://grafana.opendev.org/d/2b4dba9e25/nodepool-ovh?orgId=1	15:41
fungi	so yes it looks like we have more quota in rackspace than ovh after all	15:42
dansmith	ack	15:42
dansmith	fungi: going back to your knowing things about the hardware in a region,	15:43
dansmith	if we take a single fat job and can show that, if it times out, it almost always does so in a given region, then we can probably make some assumption about the speed of those nodes (either raw, or "throughput" with noisy neighbors)	15:44
fungi	i suppose, with the caveat that "speed" is a multi-faceted thing. you can at least extrapolate it to "slower at running the same kinds of jobs as the ones which time out"	15:45
fungi	lots of job timeouts are second-order symptoms of something like memory exhaustion, so you could actually be measuring "how well does this provider's disk handle swap thrash"	15:46
dansmith	assuming a composite job mix, of course	15:46
dansmith	but yeah	15:46
fungi	unfortunately the different resources and underlying hardware aren't usually adjustable in isolation from one another, so while we do have some larger-memory flavors we could try to run the same jobs on for comparison, they're also going to be in a different provider on different hardware (where the preferred memory-to-cpu ratio is chosen by the provider to more efficiently pack their	15:50
fungi	servers)	15:50
fungi	so even if it ran faster on nodes with more memory, we'd be hard pressed to say for sure that the additional memory is why it ran faster	15:51
opendevreview	Merged openstack/ci-log-processing master: Add job with 'timed_out' status to fetch; add hosts_region info https://review.opendev.org/c/openstack/ci-log-processing/+/876260	15:51
dansmith	in isolation sure,	15:51
ade_lee_	fungi, clarkb ?	15:51
fungi	ade_lee_: yep, pulling up the build info so i can set an autohold for it	15:51
ade_lee_	fungi, thanks	15:51
dansmith	fungi: but if you run tens of thousands of jobs all well-distributed across the nodes, and you see a strong correlation of timeouts on one provider for one job, I think you can make the conclusion about those nodes being "slower" for that workload	15:52
dansmith	if you don't have a strong correlation then you can't of course	15:52
fungi	zuul-client autohold --tenant=openstack --project=opendev.org/openstack/tempest --job=tempest-all-fips-focal --ref='refs/changes/97/873697/.*' --reason='ade_lee looking into fips iscsi chap errors'	15:54
fungi	that's set now	15:54
fungi	dansmith: yes, of course	15:54
fungi	though not necessarily what to change in order to address the slowness	15:55
dansmith	no, not unless you know stuff about what makes that job special (which we do in some cases)	15:55
fungi	(for example, in some cases we're the reason the nodes seem "slow" thanks to being in an overcommit configuration that isn't tuned for our worloads)	15:56
dansmith	workloads or warlords? :)	15:56
fungi	s/worloads/workloads/	15:56
fungi	both	15:56
fungi	warring openstates	15:57
dansmith	my theory is more that we continue to grow our list of tests (and probably also our server software is slower) and we're getting closer to the limit of what we can test in two hours	15:57
dansmith	so I'm just looking for clues that suggest that's the case, and if not, maybe suggest what else might be the problem	15:57
fungi	also not new. we had the same sort of discussion when devstack jobs started taking longer than 45 minutes ;)	15:57
dansmith	and I don't know what else to do other than look at the data along different axes until I see something that correlates	15:58
fungi	agreed	15:58
fungi	or ask chatgpt. it can probably give you an explanation (not a correct one, but it will totally sounds plausible)	15:58
dansmith	I thought chatgpt has feelings now and we're not supposed to ask it hard questions that might cause it to need to seek therapy?	15:59
dansmith	or is that bing?	15:59
dansmith	maybe chatgpt could be the therapist for bing...	15:59
fungi	oh, right, the bing-chat ai was the one they had to "lobotomize" after it started threatening users	16:02
dansmith	yeah, I totally love that it took about two weeks for the "good AI" to get too creepy for human comfort	16:18
ade_lee_	fungi, thanks -- I'll kick off a recheck now	16:19
*** jpena is now known as jpena\|off		17:21
opendevreview	Radosław Piliszek proposed openstack/project-config master: Add the main NebulOuS repos https://review.opendev.org/c/openstack/project-config/+/876054	17:56
ade_lee_	fungi, looks like the node already failed	17:59
ade_lee_	fungi, https://zuul.opendev.org/t/openstack/build/aa9d89ea073a40a6b84895a019707d90	18:00
fungi	ade_lee_: what ssh key do you want authorized for it?	18:00
opendevreview	Radosław Piliszek proposed openstack/project-config master: Add the main NebulOuS repos https://review.opendev.org/c/openstack/project-config/+/876054	18:00
ade_lee_	fungi, https://paste.openstack.org/show/bADwloaOalEwhkusIxjF/	18:01
fungi	ade_lee_: ssh root@173.231.255.77	18:02
ade_lee_	fungi, thanks - in	18:03
fungi	cool, let us know when you're done and we'll clean up the hold	18:03
ade_lee_	fungi, will do	18:03
opendevreview	Radosław Piliszek proposed openstack/project-config master: Add the NebulOuS tenant https://review.opendev.org/c/openstack/project-config/+/876414	18:09
opendevreview	Radosław Piliszek proposed openstack/project-config master: Add the main NebulOuS repos https://review.opendev.org/c/openstack/project-config/+/876054	18:11
opendevreview	Radosław Piliszek proposed openstack/project-config master: Add the NebulOuS tenant https://review.opendev.org/c/openstack/project-config/+/876414	18:11

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!