Tuesday, 2019-09-17

weshaynkinder, latest error looks like Certmonger param12:20
weshay"puppet-user: Error: Could not find resource 'Certmonger_certificate[ovn_controller]' in parameter 'require' (file: /etc/puppet/modules/tripleo/manifests/certmonger/ovn_controller.pp, line: 6412:20
ksamborweshay: fix is here https://review.opendev.org/#/c/682586/12:21
ksamborand it is working12:22
cloudnullif folks have a moment, can I get a couple reviews on https://review.opendev.org/#/c/682057/12:25
jaosoriorweshay: this looks like the fix https://review.opendev.org/#/c/682586/1/manifests/certmonger/ovn_controller.pp12:29
weshayjaosorior,  thanks! including that patch in the test :)12:31
*** sanjayu_ has joined #tripleo12:33
*** jpena|lunch is now known as jpena12:37
panda|ruckmmmhhh 20 minutes in containers-multinode to perform a single step12:58
panda|ruck2019-09-17 10:00:53 | command: overcloud deploy -> tripleoclient.v1.overcloud_deploy.DeployOvercloud (auth=True)12:58
panda|ruck2019-09-17 10:00:53 | Using auth plugin: password12:58
panda|ruck2019-09-17 10:00:56 | No stack found, will be doing a stack create12:58
panda|ruck2019-09-17 10:19:22 | Performing Heat stack create12:58
panda|ruck2019-09-17 10:19:45 | Removing the current plan files12:58
EmilienMpanda|ruck: hey13:07
EmilienMpanda|ruck: is tripleo-ci-centos-7-scenario009-multinode-oooq-container broken on stable/rocky?13:07
EmilienMmandre: ^ ping13:08
EmilienMTASK [openshift_node : Install node, clients, and conntrack packages]13:08
EmilienMit's failing on that13:08
mandreEmilienM: it looks like mirror issue, https://de339b72b688803b5a43-80f76e47f83bff14c764cd8b2b7f1a08.ssl.cf2.rackcdn.com/682506/1/check/tripleo-ci-centos-7-scenario009-multinode-oooq-container/8352d58/logs/undercloud/var/lib/mistral/config-download-latest/openshift/playbook.log.txt.gz13:11
openstackgerritEmilien Macchi proposed openstack/paunch stable/rocky: Add --cpuset-cpus support  https://review.opendev.org/68249113:12
EmilienMmandre: right but still failing after 2 recheck :/13:13
EmilienMtrying again13:13
EmilienMmandre: thx for looking. Just to confirm the job is supposed to be stable13:13
panda|ruckit has failed quite a bit on the last few days, but every time looks like temporary failure13:14
EmilienMpanda|ruck: do we know why the gate is so backed up?13:14
panda|ruckEmilienM: conteiners-multinode jobs are timing out during deployy and resetting the queue13:15
panda|ruckEmilienM: there are a few places that seems to take more time than usual during deploy13:16
EmilienMthat's not good13:16
panda|ruckEmilienM: one I pasted above13:16
panda|ruckEmilienM: the other I've seen is 20 minutes aourn containers prep, but I'm checking if it's consistent13:16
mandreEmilienM: it was green on https://review.opendev.org/#/c/681782/2 a few days ago13:17
EmilienMmandre: thanks for confirming13:19
weshaywe need to fix the ara reports upstream... broken since the log server change13:20
* weshay interested in finding out what is taking longer than usual13:20
panda|rucknot consistent, I see another 10 minutes on that step, and a gap of 15 minutes around TASK [check if libvirt is installed]13:21
EmilienMgchamoul: great start on https://review.opendev.org/#/c/682377/ - I commented13:21
gchamoulEmilienM: thanks but yes it's still not perfect13:22
*** mcornea has joined #tripleo13:23
weshaytripleo-ci community call is happening now13:31
mwhahaha#startmeeting tripleo14:00
mwhahaha#topic agenda14:00
mwhahaha* Review past action items14:00
mwhahaha* One off agenda items14:00
mwhahaha* Squad status14:00
mwhahaha* Bugs & Blueprints14:00
mwhahaha* Projects releases or stable backports14:00
mwhahaha* Specs14:00
mwhahaha* open discussion14:00
mwhahahaAnyone can use the #link, #action and #info commands, not just the moderatorǃ14:00
mwhahahaHi everyone! who is around today?14:00
*** openstack changes topic to "agenda (Meeting topic: tripleo)"14:00
mwhahahaalright let's get going14:04
mwhahaha#topic review past action items14:04
mwhahahamwhahaha to provide weshay with launchpad overview & scripts - NOT DONE14:04
mwhahaha#mwhahaha sync with weshay on ptl things14:04
migirfolco: hey, so let's go back to the nested virt question. Is it enabled in tripleO jobs in upstream. I know it's relaying on what's provided, but is there kvm enabled one or it's not really executed14:04
*** openstack changes topic to "review past action items (Meeting topic: tripleo)"14:04
*** suuuper has quit IRC14:04
rfolcomigi, tripleo mtg here, lets jump to #oooq if not in agenda for this mtg14:05
mwhahahaso i'll try and steal some time from wehsay this week14:05
mwhahahathat's it on the past action items14:05
weshaymwhahaha, I can buy some beers near you if needed :)14:05
mwhahaha#topic one off agenda items14:06
mwhahaha#link https://etherpad.openstack.org/p/tripleo-meeting-items14:06
*** openstack changes topic to "one off agenda items (Meeting topic: tripleo)"14:06
mwhahaha(weshay) centos 8 scheduled to released on sept 24 https://twitter.com/CentOSProject/status/117365299630517043214:06
weshayaye.. zbr will be working w/ infra folks on the nodepool / dib updates14:07
mwhahahaso there's a question about centos8 support in the upstream14:08
mwhahahadoes the tripleo project need to be concerned about centos-8 support upstream for ussuri?14:08
mwhahahait's likely that we'll just have to carry any support burden14:08
weshayaye.. zbr ^14:08
mwhahahawe'll definately support it, but the question is the impact on the wider services14:08
weshayk. .thanks.. just double checking etc14:09
*** rpittau is now known as rpittau|afk14:10
weshayone unrelated thing.. train ml3 release has a +1 finally https://review.opendev.org/#/c/681897/14:10
mwhahahaany other comments on this item?14:10
weshaynothing else on centos-8 from me14:11
mwhahahak thanks14:11
mwhahaha#topic Squad status14:11
mwhahaha#link https://etherpad.openstack.org/p/tripleo-ci-squad-meeting14:11
mwhahaha#link https://etherpad.openstack.org/p/tripleo-upgrade-squad-status14:11
mwhahaha#link https://etherpad.openstack.org/p/tripleo-edge-squad-status14:11
*** openstack changes topic to "Squad status (Meeting topic: tripleo)"14:11
mwhahaha#link https://etherpad.openstack.org/p/tripleo-integration-squad-status14:11
mwhahaha#link https://etherpad.openstack.org/p/tripleo-validations-squad-status14:11
mwhahaha#link https://etherpad.openstack.org/p/tripleo-networking-squad-status14:11
mwhahaha#link https://etherpad.openstack.org/p/tripleo-ansible-agenda14:11
mwhahahaany status related highlights?14:11
weshay3rd party rhel 8 jobs are done for now.. ended up w/ ovb fs001, standalone, standalone001/002;  003/004 are blocked and escalated14:13
weshaythat will complete the rhel8 work for now14:13
mwhahahaanything else?14:17
mwhahahasounds like nope, moving on14:18
mwhahaha#topic bugs & blueprints14:18
mwhahaha#link https://launchpad.net/tripleo/+milestone/train-rc114:18
mwhahahaFor Train we currently have 27 blueprints and 506 (-3) open Launchpad Bugs. 14 train-3, 5 train-rc1, 487 ussuri-1.  166 (+0) open Storyboard bugs.14:18
mwhahaha#link https://storyboard.openstack.org/#!/project_group/7614:18
mwhahaha#action mwhahaha to close out train-m314:18
*** openstack changes topic to "bugs & blueprints (Meeting topic: tripleo)"14:18
mwhahahaonce we merge the release change, i'll close out the milestone and move everything forward14:19
mwhahahaany comments/concerns/etc?14:19
mwhahaha#topic projects releases or stable backports14:21
*** openstack changes topic to "projects releases or stable backports (Meeting topic: tripleo)"14:21
mwhahahawe'll cut m3 soonish, so we'll be pushing for rc114:21
mwhahahaany comments on backports/releases?14:22
*** lbragstad_ has joined #tripleo14:26
mwhahaha#topic specs14:26
mwhahaha#link https://review.openstack.org/#/q/project:openstack/tripleo-specs+status:open14:26
*** openstack changes topic to "specs (Meeting topic: tripleo)"14:26
mwhahahaplease update any openspecs to target to ussuri14:27
mwhahaha#topic open discussion14:28
*** openstack changes topic to "open discussion (Meeting topic: tripleo)"14:28
mwhahahaanything else?14:28
openstackgerritEmilien Macchi proposed openstack/tripleo-heat-templates master: nova-libvirt: set 'cpuset_cpus' to 'all'  https://review.opendev.org/68266514:41
mwhahahathanks everyone14:42
weshaymwhahaha,  we may want to flip gate status to orange/red15:26
*** mwhahaha changes topic to "CI Status: REDish RDOCloud Status: MEHish | community irc meeting Tues@1400 UTC - tripleo-ci-community meeting Tues@1330 UTC | https://docs.openstack.org/tripleo-docs/latest/"15:27
mwhahahawhat did you break?15:27
cloudnullpushups !15:29
weshaymwhahaha, as far as we can tell.. we're hitting timeouts15:29
weshaytrying now to get our ara reports back15:30
mwhahahadid we get 7.7'd15:30
weshayhrm.. /me checks15:31
weshayit did release15:33
weshaypanda|ruck,  fyi ^15:33
mwhahahaso if it's the updating, we need 7.7 containers15:34
weshayya... ok.. we'll focus on promotion of master then..15:34
weshaycontainers are 3 days old15:34
* weshay looks at update log15:34
*** aakarsh|2 is now known as aakarsh15:38
weshaynot much worse today.. yet15:41
*** jcoufal has quit IRC15:42
weshayhrm.. I don't think the centos base updates for 7.7 have hit the openstack mirrors yet15:45
openstackgerritwes hayutin proposed openstack/tripleo-quickstart master: [DNM] Test Centos 7.7 with CR repos  https://review.opendev.org/61883215:45
*** ykarel|away has quit IRC15:50
openstackgerritGabriele Cerami proposed openstack/tripleo-ci master: Reenable ARA html generation from collect logs  https://review.opendev.org/68267916:02
cgoncalvesbogdando, thanks for the quick reviews!16:10
EmilienMweshay: AFIK cloud images aren't ready yet16:19
EmilienMnot sure if we pull them16:19
EmilienMor build them from source16:19
weshayprobably tonight16:19
EmilienMholser: ^ doing backports for you17:24
EmilienMsaw the warning on osp1517:24
holserEmilienM thanks a lot17:25
cloudnullanyone around to do a couple reviews on https://review.opendev.org/#/c/682057/17:30
dpatersoncloudnull, will take a look17:32
cloudnullthanks dpaterson!17:33
zbrweshay: i was able to get 7.7 after doing a clean all17:39
weshayre: libselinux?17:40
zbrweshay: i bet not, no replies on https://bugs.centos.org/view.php?id=1638917:41
zbrif it will happen, we will first see some replies on that ticket.17:41
zbrweshay: it may be even more problematic. if ansible decides to prefer python3 for some reason (internal or from us), it will fail.17:42
weshayzbr, if that's the case.. I don't think upstream infra would move to 7.717:43
weshaybah.. but they would have to for updates17:43
zbrafaik ansible picks version based on os version, so we should be safe.17:43
weshaykind of hate how centos works17:43
mwhahahaactually it uses /usr/lib/system-python or something now17:43
mwhahahawhich exists as of 7.617:43
zbryou are not alone, memo is quite ho these days17:44
openstackgerritAlex Schultz proposed openstack/tripleo-common master: Split template override files  https://review.opendev.org/68245517:46
openstackgerritBrent Eagles proposed openstack/tripleo-heat-templates master: Add Octavia driver agent service  https://review.opendev.org/65811817:53
zbrweshay: we are "safe" with 7.7, ansible still uses py27 regardless if python3 is installed.18:08
zbrother than this upgrade went nice and quick, but I was already using "cr" repos.18:09
weshayok.. good, thanks for checking that out18:09
weshayhrm.. jobs timing out in tempest18:09
weshayha. .the queue that is18:11
EmilienMis there any patch which needs to land to stabilize the situation?18:16
EmilienMif yes, then we can reset the gate18:16
weshayEmilienM,  I don't have anything for you re: a common pattern.. seems to be just slow.. panda|ruck is getting our ara reports for the overcloud now18:17
weshayit's not 7.718:17
weshayit's not 1 cloud18:17
weshayand tomorrow is probably going to be worse18:18
weshayovb jobs are completing more quickly than containers-multinode18:19
weshaytripleo-ci-centos-7-ovb-1ctlr_1comp-featureset001SUCCESS in 2h 38m 05s18:19
EmilienMweshay: can you give me a link of a job that timeouted?18:20
weshaytripleo-ci-centos-7-containers-multinodeSUCCESS in 3h 16m 57s18:20
EmilienMi'll look at it18:20
weshayEmilienM, your patch just timed out in tempest.. but others are here http://dashboard-ci.tripleo.org/d/YRJtmtNWk/cockpit?orgId=1&fullscreen&panelId=6118:20
EmilienMweshay: "my" patch? which one lol18:21
weshaynevermind.. urs is still going18:21
weshaytop patch in the queue18:21
weshaythat's going to merge18:21
EmilienMtripleo-ci-centos-7-containers-multinode took 2h38 2 w ago18:24
EmilienMnow it takes more than 3h18:24
EmilienMlet's take a look18:24
openstackgerritMerged openstack/tripleo-specs master: fix the spelling mistakes  https://review.opendev.org/68171718:26
EmilienMtripleo-container-image-prepare seems slower (at first look)18:27
EmilienMwhich can be caused by: more containers, more rpms, mirror issue18:27
* EmilienM digs deeper18:27
weshayya.. compared to stein.. major time diff http://zuul.openstack.org/builds?job_name=tripleo-ci-centos-7-containers-multinode-stein18:27
EmilienMi'm looking at stein jobs btw18:27
EmilienMnot even master18:27
weshayEmilienM, there are only about 10 rpms getting patched in a few jobs I looked at..18:28
EmilienMbtw, do we see rocky/jobs timeouting? if yes: likely due to some infra issue (networking, mirror, etc)18:28
EmilienMrocky/queens sorry ^18:28
*** ricolin has quit IRC18:28
weshaythere are a few stein jobs timing out http://zuul.openstack.org/builds?job_name=tripleo-ci-centos-7-containers-multinode-stein&result=TIMED_OUT18:29
EmilienMi can't find it in cockpit18:29
EmilienMcan you figure out if rocky/queens has timeouts too?18:29
weshaythe branch is listed on the left18:29
* weshay looks18:29
EmilienMif no, then our issue is maybe in tripleo itself. Still digging18:30
*** holser has quit IRC18:30
*** Vorrtex has joined #tripleo18:30
weshaynothing since june on rocky http://zuul.openstack.org/builds?job_name=tripleo-ci-centos-7-containers-multinode-rocky&result=TIMED_OUT18:31
EmilienMok thanks18:31
EmilienMgive me a few, i'm digging now18:31
weshaynothing in sept. from queens http://zuul.openstack.org/builds?job_name=tripleo-ci-centos-7-containers-multinode-queens&result=TIMED_OUT18:31
EmilienMbtw i already see it being podman/buildah thing but I won't spoil18:32
weshaythis is good info http://dashboard-ci.tripleo.org/d/si1tipHZk/jobs-exploration?orgId=1&fullscreen&panelId=718:33
weshayEmilienM,  how can you tell?18:33
EmilienMcan you add the branch on that link?18:34
EmilienMweshay: how can I tell? containers are the root cause of many of our problems. If you need examples, let me find some ;-)18:34
weshay2019-09-10tripleo-ci-centos-7-containers-multinode-stein931 week87%18:35
weshay2019-09-10tripleo-ci-centos-7-containers-multinode-rocky931 week92%18:35
weshayso.. I'll try get these split by branch too.. but there are branchful jobs there to compare to18:35
weshaywould be nice to add the average run time18:35
*** brault has joined #tripleo18:42
weshayman.. is the bot on here?18:42
EmilienMI think I found it18:50
EmilienMand it's clearly during container image prepare18:50
*** xek_ has joined #tripleo18:54
EmilienMthe job used to take 2h10 1 month ago :D18:55
*** xek has quit IRC18:57
EmilienMlook on this one: https://review.opendev.org/#/c/676218/19:01
EmilienM2h12 :D19:01
* EmilienM continues to dig logs19:01
*** lucasagomes has quit IRC19:01
EmilienMI'm not sure why we spit all the facts in https://openstack.fortnebula.com:13808/v1/AUTH_e8fd161dc34c421a979a9e6421f823e9/zuul_opendev_logs_42c/682276/1/gate/tripleo-ci-centos-7-containers-multinode-stein/42c709b/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz19:08
*** pierreprinetti has quit IRC19:09
*** pierreprinetti has joined #tripleo19:09
EmilienMi suspect https://review.opendev.org/#/c/676387/19:13
openstackgerritEmilien Macchi proposed openstack/tripleo-common master: Revert "Close the http sessions of registry on image prepare"  https://review.opendev.org/68271719:13
openstackgerritEmilien Macchi proposed openstack/tripleo-common master: Revert "Close the http sessions of registry on image prepare"  https://review.opendev.org/68271719:14
EmilienMjust testing now19:14
EmilienMI looked at Zuul and all tripleo-ci-centos-7-containers-multinode jobs increased by 30 min19:15
openstackgerritEmilien Macchi proposed openstack/tripleo-common stable/stein: Revert "Close the http sessions of registry on image prepare"  https://review.opendev.org/68271919:16
openstackgerritEmilien Macchi proposed openstack/tripleo-common stable/stein: Revert "Close the http sessions of registry on image prepare"  https://review.opendev.org/68271919:16
*** brault has quit IRC19:16
EmilienMso the patch that I'm reverting landed on Aug 21th19:19
EmilienMgo on https://review.opendev.org/#/q/status:merged+tripleo+age:3week19:19
EmilienMand look at a patch which landed before 21th19:19
EmilienMtripleo-ci-centos-7-containers-multinode SUCCESS in 2h 15m 05s19:19
EmilienMnow let's look at let's say 24th19:20
EmilienMtripleo-ci-centos-7-containers-multinode SUCCESS in 2h 48m 09s19:20
EmilienMmwhahaha: ^19:20
EmilienMin the middle, we have the patch i'm reverting19:20
EmilienM(i confirmed with a bunch of other patches)19:20
mwhahahai'm uncertain but we can revert in the mean time19:21
EmilienMbecause we'll re-open the bug that he closed?19:21
*** brault has joined #tripleo19:26
EmilienMI think we need to reset the gate19:28
EmilienMwe have identified the problem19:28
EmilienMwe should not send any patch into gate from now19:28
EmilienMuntil we have a solution, revert or not revert (fix)19:29
EmilienMthat is merged19:29
EmilienM(and backported to stein)19:29
*** brault has quit IRC19:32
*** mmethot_ has quit IRC19:37
*** mmethot_ has joined #tripleo19:37
*** mmethot_ has quit IRC19:40
*** mmethot_ has joined #tripleo19:41
weshaycool w/ me19:41
EmilienMhttps://review.opendev.org/#/c/674919/ is suspect as well19:42
EmilienMweshay: I just read logs and zuul results for that job19:43
EmilienMand bisected19:43
EmilienMfirst i figured that it was fast early august and slow end of august19:43
EmilienMthen i looked at undercloud and figured out it was during container image prepare19:44
EmilienMthen i looked at code changes in tripleo-common, regarding container images around mid August19:44
EmilienMand found these 2 patches19:44
EmilienMSaravanan's patch landed before cloudnull's one19:46
EmilienMbut it took 2h30 when it landed, which isn't that bad19:47
EmilienMwhile kevin's patch took 2h5019:47
EmilienMI think that's kevin's patch in fact19:47
weshaysame patch is on stein right? re.. the close19:47
EmilienMyeah he backported it19:47
*** holser has joined #tripleo19:49
weshaywe landed this.. https://review.opendev.org/#/c/674919/ when we were getting all those ovh errors I think19:49
EmilienMand we didn't realize the perf issue19:49
EmilienMI have to hard stop in 15min or so, will be back later19:49
weshayEmilienM,  you going to kill the gate?19:51
EmilienMwe have to19:51
EmilienMlike I suggested19:51
EmilienMok let me do it...19:51
EmilienMI was rebasing the revert on master19:51
EmilienMthen I'll reset the gate19:51
EmilienMweshay: please send email19:51
EmilienMcore: no +A plz19:51
weshayseen a handful of rpm download failures today too19:52
EmilienMnot that harmful vs this timeout imho19:52
weshayEmilienM, totally.. timeouts are the worst19:52
weshayEmilienM, things had been merging on a fast clip .. pretty much the whole cycle until this hit19:54
EmilienMmy hope is cloudnull to be ready for the pull ups19:55
weshayha ha.. he mentioned pushups earlier19:55
* weshay posts on his twitter ;)19:55
EmilienMnot sure where he is now lol19:55
weshayEmilienM, but.. to be fair.. that ovh was killer, turned out not to be us/tripleo but some good debug came out of that19:56
openstackgerritEmilien Macchi proposed openstack/tripleo-common master: Revert "Log exceptions when checking status"  https://review.opendev.org/68272919:56
weshayin that patch did we add more retries?19:56
* weshay looks19:56
EmilienMok reverts proposed & rebased19:57
weshayEmilienM, thank you!19:58
weshayI don't see retries in that patch19:58
weshayjust logging19:58
EmilienM        adapter = HTTPAdapter(19:59
EmilienM            max_retries=8,19:59
EmilienM            pool_connections=24,19:59
EmilienM            pool_maxsize=24,19:59
EmilienM            pool_block=False19:59
EmilienM        )19:59
EmilienMpool_connections=24 might be too much19:59
EmilienMnot sure how he picked these numbers19:59
weshaykill the gate :)20:00
EmilienMthese params are useful for multi-thread but i'm unsure we use it correctly later20:00
EmilienMlet's try 520:02
EmilienMweshay: do we have LP?20:04
* weshay looks20:04
EmilienMweshay: potential fix ^20:04
* weshay opens20:05
openstackLaunchpad bug 1844446 in tripleo "multiple tripleo jobs timing out upstream causing gate resets in train" [Critical,Triaged]20:08
weshayEmilienM,  ^20:08
* cloudnull wonders what he missed 20:08
EmilienMok patch updated20:08
weshaycloudnull, no good deed goes unpunished20:09
EmilienMbut again unsure it's the exact root cause. The only thing I can tell is cloudnull's patch causing the timeouts20:09
weshaycloudnull, the patches you put up for ovh container debug may have degraded perf20:09
* cloudnull reading 20:09
cloudnullthe connection pool ?20:09
weshayEmilienM,  go kill the gate20:10
cloudnullis it only in one region again ?20:10
weshaybefore you have to go20:10
weshaycloudnull, no.. perf regression is all regions20:10
EmilienMyes I'm doing it now!20:10
weshayfirst gate kill of train20:10
EmilienMweshay: send email20:10
cloudnullEmilienM did https://review.opendev.org/#/c/682731/ fix the performance ?20:12
weshayhe just put it up.. we should see that job complete in roughly 2.5 horus20:12
weshaycloudnull,  we were about to send you a tweet.. PUSH UPS / PULL UPS20:12
EmilienMcloudnull: I don't know if it does20:14
EmilienMI literally spent 10 min investigating that patch so not sure20:14
EmilienMnow I need to take off for 1h20:14
EmilienMweshay: change topic please20:14
EmilienMno more +A20:14
EmilienMgate RED20:14
* weshay needs ops20:15
weshayops me cause I'm lame20:15
EmilienMI can't20:15
EmilienMmwhahaha please ops me and weshay when you come back20:15
EmilienMgate is clear20:15
* EmilienM afk 1h20:15
weshayEmilienM,  thank you sir!!20:16
weshayheh.. cloudnull check it out http://dashboard-ci.tripleo.org/d/YRJtmtNWk/cockpit?orgId=1&fullscreen&panelId=39820:17
weshaythanks mwhahaha20:18
*** weshay changes topic to "CI Status: RED ( no rechecks, wait for https://review.opendev.org/#/c/682729/ ) | community irc meeting Tues@1400 UTC - tripleo-ci-community meeting Tues@1330 UTC | https://docs.openstack.org/tripleo-docs/latest/"20:19
cloudnullweshay looking at some of the timeout logs, there are LOTS of retries, however, it seems like those bits are working, albeit super slow:20:26
cloudnullinterestingly enough the job that keeps timing out is the "multinode" job20:26
cloudnullthe rest seem to complete in ~2.5 hours20:27
*** holser has quit IRC20:29
*** panda|ruck is now known as panda|ruck|off20:29
weshaywell.. that's one of the few multinode jobs that deploy most of the services20:30
*** ansmith has quit IRC20:30
weshaycloudnull, ^20:30
weshaytripleo-ci-centos-7-scenario000-multinode-oooq-container-updatesSUCCESS in 2h 49m 47s20:31
weshaytripleo-ci-centos-7-scenario000-multinode-oooq-container-upgradesSUCCESS in 2h 35m 44s (non-voting)20:31
weshayare keystone only20:31
weshayand still takes that long .. lolz20:31
weshayeverything else is one node20:31
*** weshay is now known as weshay|ruck20:32
weshay|ruckrhel 8 standalone running in 1 hour 11min on that patch20:35
cloudnullweshay|ruck on https://review.opendev.org/#/c/682731/ ?20:40
weshay|ruckcloudnull,  https://review.opendev.org/#/c/682265/20:40
*** bdodd_ has joined #tripleo20:41
cloudnullyeah , everything but that job seems to complete in time20:41
weshay|ruckcloudnull, but it makes sense20:41
weshay|ruckthe others are either single node or only deploying a few services20:41
weshay|rucktripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001SUCCESS in 2h 56m 11s20:41
weshay|ruckis the same number of nodes, and running introspection20:42
weshay|ruckbut I think rdo has more horse power than some of the providers now20:42
weshay|ruckat least in our benchmarking20:42
weshay|ruckcloudnull,  make more sense now?20:43
cloudnullwhat was out average run-time of those jobs before the timeouts started ?20:45
cloudnullwere they always on the edge ?20:45
weshay|ruckwe can look at stein20:46
*** Goneri has quit IRC20:46
weshay|ruckcloudnull,  http://zuul.openstack.org/builds?job_name=tripleo-ci-centos-7-containers-multinode-stein&result=TIMED_OUT20:46
weshay|ruckcloudnull, was trying to get the same info20:47
weshay|ruckgoing to try and add it to http://dashboard-ci.tripleo.org/d/si1tipHZk/jobs-exploration?orgId=1&fullscreen&panelId=720:47
*** brault has joined #tripleo20:48
*** mmethot_ has quit IRC20:48
cloudnulllooks like that job in stein job, when it completed finished in ~9-11k seconds (2.5-3) hours.20:50
cloudnullin master it looks like the average is 10-12k seconds (2.7-3.5) hours20:51
cloudnullwhen it succeeds20:51
weshay|ruckaye.. agree20:51
cloudnulldo we have profiling for the various components ?20:52
weshay|ruckcloudnull,  we didn't have that many timeout.. just enough to kill the queue http://dashboard-ci.tripleo.org/d/YRJtmtNWk/cockpit?orgId=1&fullscreen&panelId=6120:52
cloudnullcan we compare where slowdowns are from stein to master?20:52
*** brault has quit IRC20:52
weshay|ruckcloudnull, we had ara for the overcloud.. but infra changed the log servers and blew it up20:52
weshay|ruckcloudnull, I tried to get the perf team's jobs upstream.. but they dissappeared20:53
*** gfidente|afk has quit IRC20:53
*** pcaruana has quit IRC20:58
nkinderweshay|ruck, I got a bit further on the fs039 failure for https://review.opendev.org/#/c/68034520:59
*** gbarros has joined #tripleo20:59
nkinderweshay|ruck, it is getting certs now via certmonger for all of the OVN stuff, but the deploy is breaking elsewhere now20:59
nkinderweshay|ruck, I'm having trouble finding any actionable errors in the log, but my theory is that there is potentially a cert trust issue or something with the OVN components21:00
weshay|rucknkinder,  k21:00
weshay|ruckhad a chat w/ your guys about a standalone w/ an ipa container.. sounds like they thought it was a good idea21:01
nkinderweshay|ruck, do we grab the logs from the containers on overcloud nodes?  I don't see a /var/log/containers in zuul for any of the overcloud nodes21:01
weshay|ruckwe should have them.. although I saw a patch about that a few days ago21:01
* weshay|ruck looks21:01
nkinderweshay|ruck, here is the run I'm looking at - https://logs.rdoproject.org/45/680345/21/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp_1supp-featureset039/2b3c233/logs/overcloud-controller-0/var/log/21:02
weshay|ruckya.. they are not there :(21:03
* weshay|ruck opens a bug21:03
weshay|ruckfun day21:03
nkinderweshay|ruck, I've been pitching the standalone idea for TLS with my team as well21:04
weshay|rucknkinder,  I see the logs upstream, just not third party21:05
weshay|rucknkinder, https://bugs.launchpad.net/tripleo/+bug/184445421:06
openstackLaunchpad bug 1844454 in tripleo "overcloud controller containers are missing from ci logs" [Critical,Triaged]21:06
* weshay|ruck looks at the collect code21:07
nkinderweshay|ruck, I do see /var/log/containers in the logs for the OC nodes here - https://logs.rdoproject.org/45/680345/21/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp_1supp-featureset039/2b3c233/logs/quickstart_collect_logs.log21:07
*** slaweq has quit IRC21:10
*** slaweq has joined #tripleo21:11
weshay|rucknkinder, looks like it was just an error  .. I see container logs on previous runs http://logs.rdoproject.org/72/21672/2/check/tripleo-ci-centos-7-ovb-1ctlr_2comp_1supp-featureset039/478d25c/logs/overcloud-controller-0/var/log/containers/21:11
*** raildo has quit IRC21:12
nkinderweshay|ruck, I can try a recheck21:12
*** gbarros has quit IRC21:12
*** mmethot has quit IRC21:13
*** mmethot has joined #tripleo21:14
*** ansmith has joined #tripleo21:15
*** slaweq has quit IRC21:16
*** rfolco|dentist is now known as rfolco21:23
*** gbarros has joined #tripleo21:23
*** ade_lee_ has quit IRC21:28
weshay|ruckcloudnull, https://snapshot.raintank.io/dashboard/snapshot/iZgzUjP3T29q5bhGrYlzyLSNEY4yeXVf21:39
cloudnulllooks like the data starts sep 721:47
weshay|ruckEmilienM, cloudnull, sshnaidm|pto https://review.rdoproject.org/r/#/c/22339/21:47
weshay|ruckcloudnull, dev environment21:47
sshnaidm|ptoweshay|ruck, total time is more important to know where we waste resources, not time of one job21:49
weshay|rucksshnaidm|pto,  ok.. you are supposed to be gone :) but I'll update it so we have total time too21:49
weshay|rucksshnaidm|pto, ok?21:50
sshnaidm|ptoweshay|ruck, let's add one mean or median, no need to have both or max21:50
weshay|rucksshnaidm|pto, throw me a bone :)21:50
*** gbarros has quit IRC21:51
* weshay|ruck adds total time21:51
weshay|ruckfor the record, sshnaidm|pto  never likes my ideas21:51
sshnaidm|ptowhy? I like them!21:51
weshay|ruckthey like me.. they really really like me21:52
weshay|rucksshnaidm|pto,  there is a moving average, but I couldn't get it to work21:53
EmilienMsometimes I wonder if we should build our own docker registry and host it22:01
EmilienMrather than relying on docker.io22:02
EmilienMmany issues around docker.io we have seen are workarounded with retries, which of course makes deployments longer22:02
cloudnullEmilienM couldn't we use redhat.registery.io ?22:04
cloudnullI did get infra to create a proxy for us, so that should be available to us ?22:05
EmilienMwe would need push on a namespace22:06
EmilienMand ensure they have enough storage for us22:06
EmilienMwe have a ton of data22:06
cloudnullsomething like https://registry.access.redhat.com/tripleo/22:07
* cloudnull doesn't think that exists 22:07
cloudnullbut I guess it could ?22:07
EmilienMI'm clearing the check quueue as well for tripleo jobs22:10
cloudnullEmilienM I wonder if we could the folks that run https://registry.access.redhat.com to mirror https://hub.docker.com/u/tripleomaster/ ?22:11
EmilienMmwhahaha: do you remember how the proxy thing works22:18
EmilienMwhen we look at https://openstack.fortnebula.com:13808/v1/AUTH_e8fd161dc34c421a979a9e6421f823e9/zuul_opendev_logs_42c/682276/1/gate/tripleo-ci-centos-7-containers-multinode-stein/42c709b/logs/undercloud/home/zuul/containers-prepare-parameter.yaml.txt.gz22:18
EmilienMit doesn't seem we're using any proxy22:19
mwhahahait uses the python proxy22:19
EmilienMI remember with docker it was in the docker registry config22:19
EmilienMright, but are we doing it really?22:19
mwhahahait also uses the one out of the podman settings22:19
mwhahaha(i think)22:19
EmilienMif you look at logs:22:19
mwhahahagimme a few and i'll try and find it22:19
EmilienMyou can see a bunch of docker.io22:19
mwhahahait's using the proxy22:20
mwhahahathe docker.io shit is for the auth stuff22:20
mwhahahato get a tocken22:20
mwhahahathat's not proxiable22:20
EmilienMi wonder if our problem is a mix between the fact we now close http sessions properly AND the retries AND the pools22:21
mwhahahawhen we close the sessions we have to reauth22:28
mwhahahaso we'd go back to docker.io more22:28
cloudnullI think you may be on the right track with lowering the max pool size.22:33
cloudnullgiven the use of futures it just may all be too much22:33
cloudnullwe have a mix of multi-processing functions and not, with a mix of requests connections and sessions.22:35
mwhahahaso we should probably revert the close22:35
mwhahahaand lower the counts22:35
cloudnull+1 sounds sensible to me22:35
mwhahahain theory that'll reduce the docker.io call22:36
EmilienMmwhahaha, cloudnull : do you want me to push over https://review.opendev.org/#/c/682731/2/tripleo_common/image/image_uploader.py to use the defaults from requests or wait for the (running) CI job to end?22:36
mwhahahai'd just leave it for now22:37
EmilienMthe close revert is already proposed: https://review.opendev.org/#/c/682717/22:37
cloudnullThe default is 10, however I think 4 might be a better option22:37
EmilienMi've put 5, let's see22:37
* cloudnull likes powers of 2 22:38
* cloudnull does more pushups 22:38
EmilienMI wonder if the new centos7 changed the version of pacemaker so we could run with podman?22:52
*** bfournie has joined #tripleo22:52
mwhahahai think it's a major rev22:56
mwhahahain 822:56
