opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : submit-requirements for deprecated NoOp function https://review.opendev.org/c/openstack/project-config/+/875804 | 03:42 |
---|---|---|
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : add submit requirements to NoBlock labels https://review.opendev.org/c/openstack/project-config/+/875993 | 03:42 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : handle key / values with multiple = https://review.opendev.org/c/openstack/project-config/+/875994 | 03:42 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : Update Review-Priority to submit-requirements https://review.opendev.org/c/openstack/project-config/+/875995 | 03:42 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : Convert remaining AnyWithBlock to submit requirements https://review.opendev.org/c/openstack/project-config/+/875996 | 03:42 |
opendevreview | daniel.pawlik proposed openstack/ci-log-processing master: Add job with 'timed_out' status to fetch; add hosts_region info https://review.opendev.org/c/openstack/ci-log-processing/+/876260 | 09:27 |
dpawlik | dansmith: hey, let me know if it is fine for you: https://review.opendev.org/c/openstack/ci-log-processing/+/876260 | 09:28 |
*** odyssey4me is now known as odyssey4me__ | 10:58 | |
*** odyssey4me__ is now known as odyssey4me | 10:58 | |
*** jpena|off is now known as jpena | 11:00 | |
*** odyssey4me is now known as odyssey4me__ | 11:06 | |
*** odyssey4me__ is now known as odyssey4me | 11:06 | |
*** odyssey4me is now known as odyssey4me__ | 12:02 | |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Revert "Revert "Temporarily stop booting nodes in inmotion iad3"" https://review.opendev.org/c/openstack/project-config/+/876365 | 14:23 |
ade_lee_ | fungi, clarkb gotta ask to hold a node yet again to figure out why the fips ubuntu tests are failing | 14:56 |
dansmith | dpawlik: questions inline, but yeah, sounds like that's what we need, thanks a lot :) | 14:57 |
ade_lee_ | it looks like there is some failure to do iscsi things - specifically with chap algorithms | 14:57 |
ade_lee_ | https://zuul.opendev.org/t/openstack/build/44e7d0b4a565456893f1c096f6b9da61/logs | 14:58 |
ade_lee_ | fungi, clarkb ^^ | 14:58 |
ade_lee_ | we should be setting the chap algorithms correctly, but maybe that doesn't work in the same way for ubuntu | 14:59 |
dpawlik | dansmith: allright | 15:02 |
opendevreview | Merged openstack/project-config master: Revert "Revert "Temporarily stop booting nodes in inmotion iad3"" https://review.opendev.org/c/openstack/project-config/+/876365 | 15:03 |
dpawlik | dansmith: I applied change on logscraper. You should get now more details for TIME_OUT jobs | 15:04 |
dpawlik | dansmith: to see how opensearch keeps the field in the index, you can just click on "json" when you click on some domunent (Expanded document) | 15:05 |
dansmith | dpawlik: right. but are there any other list fields? | 15:06 |
dpawlik | you mean if there are already some fileds that uses list? | 15:07 |
dansmith | yeah I was just curious how that's going to look in the query interface | 15:07 |
dansmith | obviously being able to see it in json is something.. I'm trying to get my search to refresh | 15:08 |
dpawlik | https://paste.openstack.org/show/bcm3M0hsQgJNLyrKGCuq/ | 15:08 |
dpawlik | so there are few fields that contains list or dict | 15:08 |
dansmith | ah okay, tags for example | 15:08 |
dansmith | cool, yeah, that looks good then | 15:08 |
dpawlik | dansmith: try this one: https://opensearch.logs.openstack.org/_dashboards/app/visualize#/edit/21f18650-b9d6-11ed-a277-139f56dc2b08?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-30m,to:now))&_a=(filters:!(),linked:!f,query:(language:kuery,query:''),uiState:(),vis:(aggs:!((enabled:!t,id:'1',params:(field:build_uuid.keyword),schema | 15:15 |
dpawlik | :metric,type:cardinality),(enabled:!t,id:'3',params:(field:hosts_region.keyword,missingBucket:!f,missingBucketLabel:Missing,order:desc,orderBy:'1',otherBucket:!f,otherBucketLabel:Other,size:5),schema:segment,type:terms),(enabled:!t,id:'4',params:(filters:!((input:(language:kuery,query:'build_status:%22TIMED_OUT%22'),label:''))),schema:split,type:fi | 15:15 |
dpawlik | lters)),params:(addLegend:!t,addTimeMarker:!f,addTooltip:!t,categoryAxes:!((id:CategoryAxis-1,labels:(filter:!t,show:!t,truncate:100),position:bottom,scale:(type:linear),show:!t,style:(),title:(),type:category)),grid:(categoryLines:!f),labels:(show:!f),legendPosition:right,row:!f,seriesParams:!((data:(id:'1',label:'Unique%20count%20of%20build_uuid. | 15:15 |
dpawlik | keyword'),drawLinesBetweenPoints:!t,lineWidth:2,mode:stacked,show:!t,showCircles:!t,type:histogram,valueAxis:ValueAxis-1)),thresholdLine:(color:%23E7664C,show:!f,style:full,value:10,width:1),times:!(),type:histogram,valueAxes:!((id:ValueAxis-1,labels:(filter:!f,rotate:0,show:!t,truncate:100),name:LeftAxis-1,position:left,scale:(mode:normal,type:lin | 15:15 |
dpawlik | ear),show:!t,style:(),title:(text:'Unique%20count%20of%20build_uuid.keyword'),type:value))),title:TIME_OUT-builds-region,type:histogram)) | 15:15 |
opendevreview | daniel.pawlik proposed openstack/ci-log-processing master: Add job with 'timed_out' status to fetch; add hosts_region info https://review.opendev.org/c/openstack/ci-log-processing/+/876260 | 15:15 |
dansmith | dpawlik: it's not merged yet so we're not actually using the new rules yet right? | 15:16 |
dpawlik | dansmith: I still did not have enough time to automatize service deployment after some change is merged | 15:19 |
dpawlik | one day when I have few min, I will finally finish release process and release logscraper v1.0.0 | 15:19 |
dpawlik | do automatization of service deployment, etc. | 15:20 |
dpawlik | So far, it is manual job.... | 15:20 |
dansmith | okay I'm not sure what you're saying.. so you're just manually applying the changes and this is already applied? | 15:20 |
dpawlik | so the container is created, just changed the service container image. That's it. | 15:21 |
dpawlik | merging change | 15:22 |
dpawlik | I would be really happy, if there would be more people to handle that | 15:22 |
dansmith | so, currently, every TIMED_OUT job I see is coming from rax-IAD | 15:29 |
dansmith | presumably this has to soak a bit to get a better view, | 15:29 |
dansmith | but I also wonder if the IAD hardware is all much older than others and our timeout problems come from longer jobs landing on those nodes | 15:30 |
dansmith | fungi: any idea what the distribution is of nodes in regions/ | 15:31 |
JayF | dansmith: in #opendev, they just pushed some kind of change to remove RAX-iad from rotation, a mirror died or something like that? Not sure if you're tuned into that or not, but it might be related | 15:31 |
dansmith | I wish this discussion didn't have to be so fragmented | 15:31 |
dansmith | but mirror issues are probably not related to job timeouts | 15:32 |
dansmith | also looks like maybe that's not rax-IAD | 15:32 |
dpawlik | dansmith: did you check visualization? | 15:33 |
dpawlik | https://paste.openstack.org/show/bRXYvgTvs3SPrDlZYk1G/ | 15:33 |
dpawlik | as I see it is to early to say if it's rax or ovh | 15:34 |
fungi | dansmith: it's not assumed that all the hardware in a given provider region is even the same, we document that here: https://docs.opendev.org/opendev/infra-manual/latest/testing.html#known-differences-to-watch-out-for "CPU count, speed, and supported processor flags differ, sometimes even within the same cloud region." | 15:34 |
dansmith | fungi: oh I know | 15:35 |
fungi | JayF: that was inmotion-iad3 not rax-iad which we disabled, totally different cloud | 15:35 |
JayF | fungi: I see that now, I think I might have merged two things in my head, thank you for the correction | 15:35 |
dansmith | fungi: I've just been trying to determine why we're suddenly hitting a ton of job timeouts, and if we slowly grew past the amount of things we can test on our slowest set of nodes, it might be an indicator if all the timeouts were on one set of hardware | 15:35 |
dansmith | fungi: I'm just looking for clues | 15:36 |
fungi | yeah, i don't actually know how the hardware in different providers compares | 15:36 |
dansmith | fungi: yeah, I don't really expect we would be able to know that | 15:36 |
fungi | i don't think anyone's tried to do a survey, but because we can't even expect all hardware in a particular provider to be consistent, it would be a nontrivial exercise | 15:37 |
dansmith | fungi: fwiw, what i was asking above was if you knew something like "80% of our quota is in rax-IAD" | 15:37 |
fungi | oh that. i think we have more effective quota in ovh than rax, but we have a dashboard with numbers, just a sec | 15:37 |
dansmith | okay | 15:37 |
dansmith | I think what dpawlik is trying to show is that all the timeouts we have recorded (so far) are spread between one ovh and one rax region | 15:38 |
dpawlik | after few days visualization "would say something more" | 15:40 |
dansmith | yeah | 15:40 |
fungi | these are the rackspace utilization charts: https://grafana.opendev.org/d/a8667d6647/nodepool-rackspace?orgId=1 | 15:41 |
fungi | and these are ovh: https://grafana.opendev.org/d/2b4dba9e25/nodepool-ovh?orgId=1 | 15:41 |
fungi | so yes it looks like we have more quota in rackspace than ovh after all | 15:42 |
dansmith | ack | 15:42 |
dansmith | fungi: going back to your knowing things about the hardware in a region, | 15:43 |
dansmith | if we take a single fat job and can show that, if it times out, it almost always does so in a given region, then we can probably make some assumption about the speed of those nodes (either raw, or "throughput" with noisy neighbors) | 15:44 |
fungi | i suppose, with the caveat that "speed" is a multi-faceted thing. you can at least extrapolate it to "slower at running the same kinds of jobs as the ones which time out" | 15:45 |
fungi | lots of job timeouts are second-order symptoms of something like memory exhaustion, so you could actually be measuring "how well does this provider's disk handle swap thrash" | 15:46 |
dansmith | assuming a composite job mix, of course | 15:46 |
dansmith | but yeah | 15:46 |
fungi | unfortunately the different resources and underlying hardware aren't usually adjustable in isolation from one another, so while we do have some larger-memory flavors we could try to run the same jobs on for comparison, they're also going to be in a different provider on different hardware (where the preferred memory-to-cpu ratio is chosen by the provider to more efficiently pack their | 15:50 |
fungi | servers) | 15:50 |
fungi | so even if it ran faster on nodes with more memory, we'd be hard pressed to say for sure that the additional memory is why it ran faster | 15:51 |
opendevreview | Merged openstack/ci-log-processing master: Add job with 'timed_out' status to fetch; add hosts_region info https://review.opendev.org/c/openstack/ci-log-processing/+/876260 | 15:51 |
dansmith | in isolation sure, | 15:51 |
ade_lee_ | fungi, clarkb ? | 15:51 |
fungi | ade_lee_: yep, pulling up the build info so i can set an autohold for it | 15:51 |
ade_lee_ | fungi, thanks | 15:51 |
dansmith | fungi: but if you run tens of thousands of jobs all well-distributed across the nodes, and you see a strong correlation of timeouts on one provider for one job, I think you can make the conclusion about those nodes being "slower" for that workload | 15:52 |
dansmith | if you don't have a strong correlation then you can't of course | 15:52 |
fungi | zuul-client autohold --tenant=openstack --project=opendev.org/openstack/tempest --job=tempest-all-fips-focal --ref='refs/changes/97/873697/.*' --reason='ade_lee looking into fips iscsi chap errors' | 15:54 |
fungi | that's set now | 15:54 |
fungi | dansmith: yes, of course | 15:54 |
fungi | though not necessarily what to change in order to address the slowness | 15:55 |
dansmith | no, not unless you know stuff about what makes that job special (which we do in some cases) | 15:55 |
fungi | (for example, in some cases we're the reason the nodes seem "slow" thanks to being in an overcommit configuration that isn't tuned for our worloads) | 15:56 |
dansmith | workloads or warlords? :) | 15:56 |
fungi | s/worloads/workloads/ | 15:56 |
fungi | both | 15:56 |
fungi | warring openstates | 15:57 |
dansmith | my theory is more that we continue to grow our list of tests (and probably also our server software is slower) and we're getting closer to the limit of what we can test in two hours | 15:57 |
dansmith | so I'm just looking for clues that suggest that's the case, and if not, maybe suggest what else might be the problem | 15:57 |
fungi | also not new. we had the same sort of discussion when devstack jobs started taking longer than 45 minutes ;) | 15:57 |
dansmith | and I don't know what else to do other than look at the data along different axes until I see something that correlates | 15:58 |
fungi | agreed | 15:58 |
fungi | or ask chatgpt. it can probably give you an explanation (not a correct one, but it will totally sounds plausible) | 15:58 |
dansmith | I thought chatgpt has feelings now and we're not supposed to ask it hard questions that might cause it to need to seek therapy? | 15:59 |
dansmith | or is that bing? | 15:59 |
dansmith | maybe chatgpt could be the therapist for bing... | 15:59 |
fungi | oh, right, the bing-chat ai was the one they had to "lobotomize" after it started threatening users | 16:02 |
dansmith | yeah, I totally love that it took about two weeks for the "good AI" to get too creepy for human comfort | 16:18 |
ade_lee_ | fungi, thanks -- I'll kick off a recheck now | 16:19 |
*** jpena is now known as jpena|off | 17:21 | |
opendevreview | Radosław Piliszek proposed openstack/project-config master: Add the main NebulOuS repos https://review.opendev.org/c/openstack/project-config/+/876054 | 17:56 |
ade_lee_ | fungi, looks like the node already failed | 17:59 |
ade_lee_ | fungi, https://zuul.opendev.org/t/openstack/build/aa9d89ea073a40a6b84895a019707d90 | 18:00 |
fungi | ade_lee_: what ssh key do you want authorized for it? | 18:00 |
opendevreview | Radosław Piliszek proposed openstack/project-config master: Add the main NebulOuS repos https://review.opendev.org/c/openstack/project-config/+/876054 | 18:00 |
ade_lee_ | fungi, https://paste.openstack.org/show/bADwloaOalEwhkusIxjF/ | 18:01 |
fungi | ade_lee_: ssh root@173.231.255.77 | 18:02 |
ade_lee_ | fungi, thanks - in | 18:03 |
fungi | cool, let us know when you're done and we'll clean up the hold | 18:03 |
ade_lee_ | fungi, will do | 18:03 |
opendevreview | Radosław Piliszek proposed openstack/project-config master: Add the NebulOuS tenant https://review.opendev.org/c/openstack/project-config/+/876414 | 18:09 |
opendevreview | Radosław Piliszek proposed openstack/project-config master: Add the main NebulOuS repos https://review.opendev.org/c/openstack/project-config/+/876054 | 18:11 |
opendevreview | Radosław Piliszek proposed openstack/project-config master: Add the NebulOuS tenant https://review.opendev.org/c/openstack/project-config/+/876414 | 18:11 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!