Wednesday, 2020-07-08

*** ryohayakawa has joined #opendev00:06
openstackgerritMerged opendev/system-config master: Don't install the track-upstream cron on review-test  https://review.opendev.org/73984000:26
gouthamrfungi: o/ sorry - i think it's been an issue since last week, https://zuul.opendev.org/t/openstack/builds?job_name=manila-tempest-plugin-lvm00:29
gouthamri now see only rax-ord failures, so maybe i am confused, let me double check and compile some findings00:32
ianwhttps://zuul.opendev.org/t/openstack/build/60671a686fec409e9f9226707ab03916/log/job-output.txt seems to be a dfw failure00:32
gouthamrianw: ack, just looked at the last ten runs and they've failed 5 times in rax-iad, 2 times in rax-ord and once in rax-dfw, and two other failures are test failures that occur sporadically; this job was quite reliable prior to this00:42
gouthamra reboot event in the logs looks like this: https://zuul.opendev.org/t/openstack/build/8b50366a1f3c4203985deba7a1d43605/log/controller/logs/screen-n-api.txt#338200:45
ianwgouthamr: tbh i wouldn't think it was rax specific; we really haven't changed any settings00:47
gouthamrianw: ty, good to know - i'll check if i can twiddle some node and test settings00:48
ianw ansible_memfree_mb: 774700:48
ianwthat's about where i'd expect it00:48
gouthamr^^ i was looking for something to indicate that, thanks ianw :)00:49
ianwhttps://zuul.opendev.org/t/openstack/build/8b50366a1f3c4203985deba7a1d43605/log/zuul-info/host-info.controller.yaml00:49
gouthamrdoes that get collected prior to devstack though?00:49
ianwduring devstack you've got00:50
ianwhttps://zuul.opendev.org/t/openstack/build/8b50366a1f3c4203985deba7a1d43605/log/controller/logs/screen-dstat.txt00:50
ianwright before the "reboot" there you've got00:52
ianw4608k 8186M00:52
ianwso it doesn't seem the host is under memory pressure at that point00:52
gouthamrianw: it does, no? (used  free  buff  cach) 5755M  126M  148M 1954M00:56
ianwthat's still ~2gb in cache there, the number before was swap so it's not heavily into swap00:57
ianwi wouldn't say it's *not* OOM, but it doesn't feel like a smoking gun at that point of the job00:58
gouthamrianw: ah, thanks; its consistently the same set of tests failing, i'll check if there's something weird we're doing..01:00
*** dtroyer has quit IRC01:03
openstackgerritMerged opendev/system-config master: gitea: install proxy  https://review.opendev.org/73986501:08
openstackgerritIan Wienand proposed opendev/base-jobs master: promote-deployment: use upload-afs-synchronize  https://review.opendev.org/73987602:30
ianwchange to deploy the reverse proxy is queued now02:43
*** dtroyer has joined #opendev02:58
ianwhrm, did i not open port 3081? ...03:27
openstackgerritIan Wienand proposed opendev/system-config master: [wip] add host keys on bridge  https://review.opendev.org/73941403:46
openstackgerritIan Wienand proposed opendev/system-config master: gitea: open port 3081  https://review.opendev.org/73988403:55
openstackgerritIan Wienand proposed opendev/system-config master: [wip] add host keys on bridge  https://review.opendev.org/73941403:57
openstackgerritIan Wienand proposed opendev/system-config master: gita-lb : update listeners to reverse proxy  https://review.opendev.org/73988904:07
ianwmordred: https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/console feels like a real problem with 726263/04:09
ianwraise Exception("you need a C compiler to build uWSGI")04:10
ianwi wonder if there's a new uwsgi release and it's not a wheel?04:10
ianw... no that's not it04:10
ianwit almost looks like a race condition, like uwsgi was building before dpkg finished04:13
*** ykarel|away is now known as ykarel04:14
mordredWeird04:14
mordredI'll investigate in the morning if you haven't figured it out yet04:15
openstackgerritIan Wienand proposed zuul/zuul-jobs master: write-inventory: add per-host variables  https://review.opendev.org/73989205:03
*** raukadah is now known as chandankumar05:09
openstackgerritIan Wienand proposed opendev/system-config master: Add host keys to inventory; give host key in launch-node script  https://review.opendev.org/73941205:10
openstackgerritIan Wienand proposed opendev/system-config master: [wip] add host keys on bridge  https://review.opendev.org/73941405:10
*** marios has joined #opendev05:17
*** ryo_hayakawa has joined #opendev05:19
*** ryo_hayakawa has quit IRC05:19
*** ryo_hayakawa has joined #opendev05:19
*** tobiash has joined #opendev05:20
*** ryohayakawa has quit IRC05:21
*** DSpider has joined #opendev05:54
openstackgerritMerged opendev/system-config master: gitea: open port 3081  https://review.opendev.org/73988405:57
ianwclarkb / fungi : ^ i've run out of time to put that into production -- you should be able to manually choose a host to redirect on gitea-lb for initial testing, then https://review.opendev.org/739889 would redirect all06:38
*** ianw is now known as ianw_pto06:38
openstackgerritIan Wienand proposed opendev/system-config master: [wip] add host keys on bridge  https://review.opendev.org/73941406:51
openstackgerritIan Wienand proposed opendev/system-config master: [wip] add host keys on bridge  https://review.opendev.org/73941407:17
openstackgerritIan Wienand proposed zuul/zuul-jobs master: write-inventory: add per-host variables  https://review.opendev.org/73989207:21
*** tosky has joined #opendev07:33
*** fressi has joined #opendev07:34
*** zbr has joined #opendev07:35
*** hashar has joined #opendev07:39
openstackgerritMerged zuul/zuul-jobs master: fetch-coverage-output: direct link to coverage data  https://review.opendev.org/73909907:59
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:02
*** dtantsur|afk is now known as dtantsur08:10
*** AJaeger has quit IRC08:13
*** jaicaa_ has quit IRC08:14
*** tkajinam has quit IRC08:23
*** bolg has joined #opendev08:27
*** zbr7 has joined #opendev08:30
*** diablo_rojo has quit IRC08:41
*** jaicaa has joined #opendev09:00
*** AJaeger has joined #opendev09:00
openstackgerritIan Wienand proposed opendev/system-config master: Add host keys on bridge  https://review.opendev.org/73941409:28
ianw_ptomordred: ^ might be of interest to you as it ties in with launching nodes.  with small stack of ^, we won't forget to add host keys when creating new servers, and also keep host keys committed which seems good too09:30
ianw_ptohttps://review.opendev.org/739412 adds the host keys for all our existing servers09:30
*** ryo_hayakawa has quit IRC09:33
*** rpittau has joined #opendev09:53
openstackgerritMerged openstack/diskimage-builder master: Fix DIB scripts python version  https://review.opendev.org/73984410:06
*** bhagyashris is now known as bhagyashris|brb10:42
*** bhagyashris|brb is now known as bhagyashris10:59
*** ysandeep is now known as ysandeep|brb11:11
*** ysandeep|brb is now known as ysandeep11:25
*** hashar is now known as hasharNap12:02
*** xiaolin has joined #opendev12:19
*** ysandeep is now known as ysandeep|mtg12:29
*** roman_g has joined #opendev12:29
roman_gHello team. Could someone look at NODE_FAILURE's we are experiencing with 16GB and 32GB nodes, please? Example: https://zuul.opendev.org/t/openstack/builds?job_name=airship-airshipctl-32GB-gate-test12:32
roman_gAnd here is another example: https://zuul.opendev.org/t/openstack/builds?job_name=airship-airshipctl-gate-test12:34
*** hasharNap is now known as hashar12:57
*** rpittau has quit IRC12:57
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host  https://review.opendev.org/73997712:58
*** rpittau has joined #opendev13:00
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host  https://review.opendev.org/73997713:01
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host  https://review.opendev.org/73997713:07
*** fressi has quit IRC13:08
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host  https://review.opendev.org/73997713:09
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host  https://review.opendev.org/73997713:14
AJaegerroman_g: one of our clouds is down - see https://review.opendev.org/73960513:16
*** lpetrut_ has joined #opendev13:21
roman_gAJaeger all right, thank you. I'm turning off voting on our jobs, as we still get some of runs successful. Where could I keep an eye on getting cloud up and running again? Mailing list?13:27
*** ysandeep|mtg is now known as ysandeep13:30
AJaegerroman_g: I don't think we announce this normally, let's see whether fungi has more info ^13:33
openstackgerritThierry Carrez proposed openstack/project-config master: Define maintain-github-openstack-mirror job  https://review.opendev.org/73822813:40
openstackgerritThierry Carrez proposed openstack/project-config master: Define maintain-github-openstack-mirror job  https://review.opendev.org/73822813:42
AJaegerinfra-root, if the change above by ttx merges, the closing of open github issues can be stopped - do we have a change for that open? Or can anybody do one, please?13:48
mordredAJaeger: I've got a phone call in about 10 minutes - but I can work on a change for that after the call13:50
ttxAJaeger: fungi told me it's already ripped off, just needs a comment on the runnign server (won't be automatically put back)13:50
mordredyeah - I agree13:51
mordredit's no longer there13:51
*** ysandeep is now known as ysandeep|afk13:52
ttxAlso does not hurt to have some overlap. I'm pretty sure the script will fail for one reason or another13:52
AJaegerttx: overlap is fine - I wanted to avoid we forget to switch it off ;)13:52
*** rpittau_ has joined #opendev13:56
*** rpittau has quit IRC13:58
*** ysandeep|afk is now known as ysandeep14:00
*** bhagyashris is now known as bhagyashris|afk14:07
*** bhagyashris|afk is now known as bhagyashris14:17
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: mirror-workspace-git-repos: enable pushing to windows host  https://review.opendev.org/73997714:22
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: mirror-workspace-git-repos: enable pushing to windows host  https://review.opendev.org/73997714:25
*** lpetrut_ has quit IRC14:28
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: mirror-workspace-git-repos: enable pushing to windows host  https://review.opendev.org/73997714:35
*** fressi has joined #opendev14:36
fungiAJaeger: roman_g: sorry, delayed start today (ironically, dealing with hvac contractors at the house)... the openedge cloud which was providing those nonstandard node flavors is offline, yes. we make sure to have plenty of redundant sources for our standard 8gb flavor, but finding more donors to supply those 16gb and 32gb flavors would be necessary for better availability. no eta on openedge returning, could14:37
fungibe as long as october14:37
fungi(the reason me dealing with hvac contractors is ironic is that openedge is down because the air handler for that privately-run environment gave up the ghost)14:38
openstackgerritTobias Henkel proposed zuul/zuul-jobs master: DNM: Add unified synchronize-repos role that works with linux and windows  https://review.opendev.org/74000514:39
*** dmsimard3 has joined #opendev14:41
*** dmsimard has quit IRC14:41
*** dmsimard3 is now known as dmsimard14:41
openstackgerritSorin Sbarnea (zbr) proposed zuul/zuul-jobs master: DNM: test ensure-podman  https://review.opendev.org/74000614:41
*** fressi has quit IRC14:46
*** priteau has joined #opendev14:48
*** mlavalle has joined #opendev14:50
*** fressi has joined #opendev14:51
*** ysandeep is now known as ysandeep|away14:55
*** fressi has quit IRC15:00
AJaegerfungi, could you review 738228 , please?15:02
*** ykarel is now known as ykarel|away15:04
*** priteau has quit IRC15:04
openstackgerritSorin Sbarnea (zbr) proposed zuul/zuul-jobs master: DNM: test ensure-podman  https://review.opendev.org/74000615:11
fungiAJaeger: you bet15:13
roman_gfungi all right, thanks.15:13
mordredfungi: the other provider of those right now is city, yes?15:14
roman_gAnd kna15:14
fungikna is the region in citycloud, yes15:16
roman_gOh, yes15:17
roman_g>> as long as october - anything I can help with?15:18
mordredroman_g: I think for openedge it's mostly about getting his HVAC system replaced, so I don't think there's much we can do to help - unless someone wants to buy him a new HVAC :)15:20
roman_gAll right =)15:20
AJaegerso, if we have two more regions for those nodes - should roman_g really see NODE_FAILURES?15:21
mordredwe might want to search out ways to get large flavors from other providers ... perhaps there's a source of funding somewhere that could be given to vexxhost or one of the other clouds to provide some. not 100% sure what's the best path there15:21
roman_gStill seeing NODE_FAILUREs, yes.15:21
roman_gProbably citycloud can't schedule that many large instances.15:21
roman_gAny logs I can check on this regard by myself?15:22
roman_gSo that I could have something in hand to go with it to Ericsson/CityCloud.15:22
corvusnode_failure should never be seen; it means an unexpected error occurred15:23
corvusi can look into that in a few minutes15:24
fungiyeah, i want to say the last time this came up, citycloud has available quota for the nistances but they can't place them due to unavailability of suitable hypervisor host capacity (something like that)15:24
roman_gfungi, yep, I remember15:24
roman_gcorvus thank you15:24
fungiandyeah, if that's (still) the case then our nodepool launcher logs should indicate nova api errors15:25
*** Eighth_Doctor is now known as Conan_Kudo15:28
*** Conan_Kudo is now known as Eighth_Doctor15:28
smcginnisFolks probably have seen this, but Gerrit has been donated to the new Open Usage Commons.15:30
smcginnishttps://openusage.org/news/introducing-the-open-usage-commons/15:30
corvuszuul scheduler is out of disk space15:34
mordredoh look - wendar is on that board15:34
mordredcorvus: :(15:34
corvuslooks like a spike at 6:2515:34
mordredcorvus: 23G in /var/log/zuul - current debug log is 5G15:36
corvusmordred: did it get rotated successfully?15:38
mordredcorvus: yes15:38
mordredcorvus: -rw-rw-rw- 1 zuuld zuuld  906501839 Jul  8 06:25 debug.log.1.gz15:38
mordred-rw-rw-rw- 1 zuuld zuuld 5386439154 Jul  8 15:38 debug.log15:38
corvus/root/.bup is very large; 25G ?15:39
mordredhrm. are we missing an exclusion?15:40
corvusmordred: the logs are on their own partition, so we can discount them15:41
mordrednod15:41
mordredcorvus: /var/log is not in bup-excludes15:41
corvusseems like the 25G of bup out of 39G is a big fish15:42
mordredyeah15:42
fungiwe recently removed a huge backup of old zuul from ~root there, right? maybe bup was tracking that previously?15:43
fungi#status log commented out old githubclosepull cronjob for github user on review.o.o15:48
openstackstatusfungi: finished logging15:48
corvusi'm confused; "bup ls" shows nothing15:50
fungi`bup ls` also returns nothing on the review server15:51
corvusgit log returns fatal: your current branch 'master' does not have any commits yet15:52
fungisame on review.o.o, so maybe that's normal?15:53
corvus'git log root' also doesn't work15:53
corvusfungi: or none of our backups work at all.  it's probably been 5 years since we did a restore test?15:53
fungiyeah, and `git ls-files` is similarly empty15:53
fungitrying to remember, i had to restore something fairly recently... what was it15:53
fungioh, was a config backup on lists.o.o15:54
fungichecking15:54
corvusalso, the backup server is close to full15:54
corvus'git log root' on the backup server works15:55
corvusi have no idea what the .bup directory on the client is for15:55
fungiand yeah, same observed behaviors on lists.o.o, so either this has broken in the last few months (since i did a restore there successfully) or it's how bup works15:56
mordredcorvus: ~/.bup/index-cache is where all of the data is in that15:57
corvusi'm not sure if we can prune that and still have it work, or if we'd need to rotate the remote backup too15:58
openstackgerritjacky06 proposed openstack/diskimage-builder master: Remove glance-registry  https://review.opendev.org/73979615:58
*** hashar has quit IRC15:58
fungiaha, `bup ls -al`15:59
fungimanpage ftw15:59
corvusthat's not making anything more clear :(15:59
fungistrangely it contains empty .commit and .tag dirs15:59
fungiyeah16:00
fungiand nothing else?16:00
mordredI agree - and also don't know what that means16:01
*** dtantsur is now known as dtantsur|afk16:01
corvusi'm inclined to just blow away the client side of this, and see if backups still work16:02
corvusand probably it's also time to rotate the server side anyway, so if they don't work that would fix it16:02
mordredcorvus: the client side of this doens't seem to be providing any value16:02
corvusmordred: well, i don't know what it does16:02
corvusit may be part of how it deduplicates stuff16:03
corvusor it may be nothing16:03
mordrednod16:03
fungi`bup index -p` provides the file index16:03
fungiand seems to work16:03
*** diablo_rojo has joined #opendev16:04
corvusfungi: that produces no output for me on zuu0116:04
fungiwell, works on lists.o.o16:04
fungibut yeah, empty for me on zuul.o.o16:04
fungiahh, empty on review.o.o too16:05
corvuswe don't usually use index, we just do split and join, right?16:05
fungiso i think it only prints paths because `bup index` was manually run there previously16:05
corvusyeah, that makes sense16:05
fungiand yes, our docs rely on bup join for restoration16:05
corvusso given our typical split/join usage of bup, i still don't know what the client side .bup dir is for16:05
fungiand it's how i managed to restore successfully on lists.o.o previously16:05
*** marios is now known as marios|out16:06
fungii concur, it doesn't seem like it's providing anything16:06
fungii wonder if it's providing some local cache to speed up incremental backup16:06
openstackgerritMerged openstack/project-config master: Define maintain-github-openstack-mirror job  https://review.opendev.org/73822816:07
corvushttps://groups.google.com/forum/#!topic/bup-list/EP19diB3ADk16:10
corvusapparently we're not supposed to notice it, and we can delete it16:11
corvusfungi, mordred: i think i should rm -rf ~/.bup on zuul01 and then run bup init16:12
mordredI agree16:13
mordred++16:13
fungicorvus: we've failed to not notice it i suppose. i concur with this plan16:13
corvus#status log removed zuul01:/root/.bup and ran 'bup init' to clear the client-side bup cache which was very large (25G)16:14
openstackstatuscorvus: finished logging16:14
mordredcorvus: there is much more spaces now16:14
corvusso many spaces16:15
fungiall the spaces16:15
fungido we need to restart the scheduler, or is it not impacted16:15
corvusi think the only impact would be failing to generate a key for a new project; i think we can just restart if someone complains16:16
corvusthe time database may be out of date, but that should recover without a restart16:19
*** roman_g has quit IRC16:22
*** gouthamr has quit IRC16:26
corvus2020-07-08 14:39:59,163 ERROR nodepool.NodeLauncher: [e: 7d1d1ee806274202b8fba3d14450c712] [node_request: 300-0009826852] [node: 0017913446] Detailed node error: No valid host was found. There are not enough hosts available.16:27
corvusfungi: ^ that's what you were expecitng?16:27
fungiyep, that's what we were getting from them in the past16:28
corvusroman_g has left16:29
fungii believe the explanation was they either lacked actual capacity on the compute nodes, or lacked sufficient capacity to place instances of the requested flavor on any16:29
corvusanyway, i think that's it for data collection; i'm not sure what we should do about that16:29
corvusother that perhaps remove the node types if we don't have any clouds that can reliably provide them16:30
fungii think it's always been that way, which is why openedge started providing capacity for that size instances so at least they could be filled from somewhere16:30
corvusif we take a high level step back: i think our choices are either: 1) find a way to reliably provide the nodes; 2) don't offer the nodes; 3) keep retry-bashing the one provider that does offer them and mask the high failure rate from end users16:32
corvusexactly how we achive either of those is tactics, but i think those are the strategies16:32
openstackgerritMerged openstack/project-config master: Add nested-virt-centos-8 label  https://review.opendev.org/73816116:33
fungiany indication as to whether we are actually ever getting any of them at all in citycloud?16:33
corvusiiuc, this link from roman_g earlier suggests "yes sometimes" https://zuul.opendev.org/t/openstack/builds?job_name=airship-airshipctl-32GB-gate-test16:33
corvus49% of the time16:34
fungioh, okay. i suspect that artificially lowering the quota won't help, it's probably a binpacking problem on the compute nodes16:39
fungiin theory there's sufficient resources, but finding a single compute node with the necessary capacity fails on placement16:40
mordrednot enough hosts available seems like a form of a quota issue16:40
mordredlike - from an end-user perspective - except with a worse error message16:40
mordredbecause it's not "this request is impossible" it's "this request is impossible right now"16:41
mordredI think that's me arguing for a form of 316:41
fungiwell, if they're running at 50% utilization in the host aggregate but evenly distributing the load across all their compute nodes and the nodes themselves can only handle up to 64gb ram, then they could be way underutilized but still unable to find a single compute node which can host a 32gb vm16:41
mordredtotally16:42
mordredwhich is why it isn't Actually a quota issue nor would it necessarily be fixed by adjusting quota16:42
fungianyway, no clue if that's the exact problem, lots of ifs in there, but we've reported it to them previously16:42
mordredbut as an API consumer it's similar - a valid request for resources is rejected by the cloud16:42
mordredand that same request, retried later, might wor16:43
mordredwork16:43
fungii suspect the 16gb requests are succeeding more often than the 32gb requests16:43
mordredI'd also suspect that16:43
*** chandankumar has quit IRC16:44
*** ykarel|away has quit IRC16:46
corvusif we want to retry-bash (3), then what tactic should we follow?16:46
corvuswe currently try to predict whether a request will succeed, only try it if we think it will, and if it fails 3 times, give up16:47
corvuswe can't predict this16:47
corvusdo we want to special case the error message: Detailed node error: No valid host was found. There are not enough hosts available.16:47
corvusand discount that so it never counts against the retry attempts?16:47
*** hashar has joined #opendev16:47
corvusand are we sure that's never a permanent error?16:48
corvus(like, what if we get that error for some other reason (data center loses power); are we okay sitting in a retry loop for that?)16:48
fungii have a feeling there are cases where it is an effectively permanent error16:50
fungilike provider has a flavor which can never actually be placed on any of their infrastructure16:50
fungii suppose the question is whether node_error and a manual recheck is preferable to having builds and their respective changes sit indefinitely in the queue16:51
fungias roman_g speaks for the primary consumer of this flavor, i'm inclined to leave it this way until we can get his feedback on it16:53
fungithough more generally, i don't really know if it's better to have changes stuck in queue effectively forever, or to emit node_failure results when clouds are having placement issues16:54
corvusthere is a launch-retries option, so we could bump it up to 1000 attempts or something, but unfortunately, that just means we're going to sit there trying the same api request over and over until it succeeds16:57
corvusgiven some of our jobs take hours, we might have to bump it up to 10,000 retry attempts to mask it.16:58
corvusit's possible our providers may notice that.16:58
*** fressi has joined #opendev17:00
corvusmaybe we could do something where if we detect that error, we say: (a) don't count this as a retry attempt; and (b) don't retry this until another server has been deleted17:00
corvusbasically, don't count it as a retry and pause the provider17:00
corvusi think i could see that working and be sustainable.  it may not be a simple change to nodepool though.17:00
mordredyeah17:01
fungiprovider pausing does sound interesting17:05
fungithough i guess it would be a label-specific pause?17:05
fungiin situations like this the provider may have no trouble placing 8gb requests but still fail a majority of 32gb requests17:06
*** chandankumar has joined #opendev17:11
*** marios|out has quit IRC17:12
corvusthat would be counter-productive for us though, so i think we would want to completely pause and wait for it to drain17:13
corvus(this is generally how nodepool providers work -- one request at a time)17:14
corvusand of course, we would only want to do this if we knew we were the last remaining provider that could handle that request17:14
*** rh-jelabarre has quit IRC17:17
*** roman_g has joined #opendev17:28
roman_gfungi corvus I read logs. Thanks for analysis. I've got rid of 32GB jobs, and we only have left 16GB jobs as of now (waiting for Zuul to merge PS with WF +1, and other PS need to be rebased on future master).17:37
*** hashar has quit IRC17:43
corvusroman_g: ok.  that seems likely to reduce the number of incidents; will be interesting to confirm17:59
roman_gWe impose quite of a load onto node pool in ayirshipctl project. For example, only today 28 patch sets have been updated, many of them were updated more than once. We used 6x 8GB, 2x 32GB, 2x 16GB. Now tea re down to 6x 8GB & 1x 16GB. I think I also can squash 2 of 8GB jobs into one.18:04
roman_g*airshipctl18:04
roman_g*we are down18:05
*** sshnaidm|ruck is now known as sshnaidm|afk18:25
mordredhttps://www.perl.com/article/announcing-perl-7/18:34
*** hashar has joined #opendev18:36
fungibut i had only just gotten used to perl 5?18:43
*** gouthamr has joined #opendev18:53
*** rpittau_ has quit IRC19:01
*** roman_g has quit IRC19:07
mordredfungi: perl7 is apparently perl519:13
mordred(but with a few of the common pragmas turned on by default)19:13
mordredpoor perl 619:13
mordredcorvus: I've got to AFK for a bit - but it looks like https://review.opendev.org/#/c/726263/ has hit the nefarious "incorrect arch" issue19:14
fungiperl6 was a fun experiment, i suppose19:14
mordredcorvus: you can see it in the console error log with the amd64 apt repos being used in the arm64 builder19:14
*** rh-jelabarre has joined #opendev19:15
mordred(the symptom which causes it to fail is going to be that we build wheels for one arch in the builder image and then copy them to the wrong arch causing pip to not find wheels and want to build them - except it's the final image so there's no compiler - but the root cause is going to be the arch mismatch)19:16
noonedeadpunkhey everyone... need some help with zuul multinode jobs. For some reason I don't get ovs ip pingable here https://zuul.opendev.org/t/vexxhost/build/ebd87dde2ffb4bf2835c21e53e5be32e/log/job-output.txt#73019:17
*** gmann_ has joined #opendev19:17
*** gmann_ is now known as gmann19:18
funginoonedeadpunk: what's the overlay there? vxlan or gre?19:24
fungii think we've generally defaulted to vxlan in opendev because a number of our providers lack the conntrack plugin to allow gre to cross between compute nodes19:25
corvusmordred: i think i need an eli5 for that19:26
noonedeadpunkfungi: vxlan I guess - it's default one19:27
noonedeadpunkyeah, previous job has ip a output. https://zuul.opendev.org/t/vexxhost/build/28580dda752044e695992546e11171e3/log/job-output.txt#76519:28
fungiahh, okay19:28
corvusmordred: i'm having particular trouble with "copy them to the wrong arch"19:28
mordredcorvus: it's possible I said too many words19:29
mordredcorvus: look at https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1994-2000 for the actual issue19:29
mordredthe other words were a description of why the c compiler errors are misleading19:29
mordredbut see on 1994 we're in linux/arm64 and then on line 2000 we're pulling from buster amd6419:29
*** donnyd has joined #opendev19:31
mordredcorvus: I don't have the whole picture - but the mismatch there matches the reported behavior from ianw from a few weeks ago19:31
corvusmordred: the most recent from line is "FROM docker.io/opendevorg/python-base:${PYTHON_VERSION}" are you saying that pulled the wrong image?19:31
*** aannuusshhkkaa has joined #opendev19:32
corvusmordred: (i'm assuming "install-from-bindep" doesn't have any arch stuff on it, it's just saying "apt-get update", so an arch mismatch like that must be because the current image is wrong, yeah?)19:32
mordredcorvus: that's the hypothesis - that somehow that's pulling the incorrect arch19:32
*** mnaser has joined #opendev19:32
mordredwe have yet to duplicate that behavior though19:32
mordredcorvus: yeah.19:32
mordredcorvus: I wonder if there is an issue at play with builder images here19:33
mordredsince a multi-stage build dockerfile is a more complicated construct19:33
mordredbut - honestly - since we've never seen this mistake from buildx in our other testing - I honestly have no clue19:34
corvusthis output is exceedingly difficult to read with exceeding difficulty19:35
corvusit looks like it's constantly reprinting entire stanzas just to make incremental changes19:35
mordredooh!19:36
mordredcorvus: look at this:19:36
mordredcorvus: https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1484-148519:36
mordredthat's in the builder image stage19:36
corvusoh good find, i was focused on the python-base image19:36
mordredwhich is ostensibly an amd64 build which is deciding to install arm64 packages19:37
mordredcorvus: and then the arm builder is also building arm: https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1492-149319:37
corvusmordred: so we've seen it crossed in both directions?  arm->amd and amd->arm?19:38
mordredso - it seems like docker is using an arm version of python-builder for both arm and amd versions of the build stage - and then it's using the amd version of python-base in teh second stage for both arm and amd19:39
mordredyup19:39
*** rpittau has joined #opendev19:40
corvusmordred: https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1847-186119:41
corvusmordred: that series appears to show the same shas being used for both arches19:41
corvusmordred: moreover https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1087-1093 (and the items above it) appears to show only one download for each image19:42
corvusall on arm6419:42
mordredcorvus: I agree19:42
mordredcorvus: https://zuul.opendev.org/t/openstack/build/700f3fa8cb5a49f1b75df928e61004e7/log/job-output.txt#1073 is from a working build that also shows same shas19:42
funginoonedeadpunk: so the failure there is in the frrouting role you've added, i guess?19:43
noonedeadpunkyep... so I'm trying to add bgp, but need internal network for that...19:43
*** jentoio has joined #opendev19:43
corvusmordred: i'm a little confused about how, if we downloaded only the arm images, we ended up with amd python-base19:43
funginoonedeadpunk: oh, i see, it's tasks after frrouting19:44
mordredcorvus: yah19:44
mordredcorvus: well - are we sure we only downloaded arm for each?  we seem to certainly have only downloaded one arch for each19:45
mordredbut maybe what looks like arm downloading is actually downloading an amd?19:45
corvusmordred: not at all; i assumed that the arm64 thingy would download the arm64 arch19:45
mordred(because it certainly seems like we have amd python-base when we use it)19:46
noonedeadpunkfungi: when I launch 2 instances in our public cloud (which is added for CI as well), it works nicely19:47
noonedeadpunkSo maybe there're some extra variables set in environment that makes things work differently...19:48
*** jrosser has joined #opendev19:48
corvusmordred: i'm going back to the python-base build for this: https://zuul.opendev.org/t/openstack/build/05bd1a9f1f804f4e936e601ebad2e5b3/log/job-output.txt#1438-144719:48
corvusmordred: 5f98 is the python-base we got19:48
corvusmordred: the way i read that is that we build one arch (5f98), the another arch (eebf), then the manifest list to bind them togethre (e4bd)19:48
corvusmordred: if that's right, then 5f98 indicates which arch we got (ie, it's not merely the manifest list)19:49
mordredI read that that way too19:50
mordredcorvus: https://zuul.opendev.org/t/openstack/build/05bd1a9f1f804f4e936e601ebad2e5b3/log/job-output.txt#109319:50
mordredpush arch specific layers one at a time19:50
*** hillpd has joined #opendev19:51
funginoonedeadpunk: a network diagram for this would help, but based on the get routes and get ip a task outputs from 28580dda752044e695992546e11171e3 i infer that "primary" has both 172.24.4.1/23 and 172.24.4.2/23 bound to br-infra while "secondary" has 172.24.4.3 on its br-infra interface, then in ebd87dde2ffb4bf2835c21e53e5be32e "primary" is trying to ping 172.24.4.3 on "secondary" across br-int and seems to19:51
fungihave no entry in its arp table for that? (dumping the arp table might help clarify here)19:51
*** zbr_ has joined #opendev19:51
fungier, across br-infra not br-int19:52
corvusmordred: i think that tells us that 5f98 is amd because we pushed it first, yeah?  that jives with your observed error that we got amd for -base19:52
mordredyeah19:52
mordredcorvus: I wonder if it's as simple as taking out that one-at-a-time behavior which we shouldn;'t need any more because you added locking in zuul-registry19:52
corvusi'm not there yet19:53
mordredkk19:53
*** rm_work has joined #opendev19:53
corvusmordred: moving to the -builder image; we saw sha bb4f in our error build.  this looks like it was the second one pushed: https://zuul.opendev.org/t/openstack/build/86101c17f7fe4a23840142ed44ad0fab/log/job-output.txt#275619:54
corvusmordred: that should mean bb4f is the arm image, which also matches what we observed19:54
mordredyah19:55
corvusmordred: so i think we've confirmed that we did indeed download the amd base and arm builder images19:55
*** jbryce has joined #opendev19:55
corvusmordred: if we assume that the job is using only one of the arch images in the multi-stage build, how does it ever work?19:56
corvusmordred: shouldn't there always be an erroneous case?19:56
noonedeadpunkfungi: ok, I think multinode role makes some weird things....19:56
mordredcorvus: not if there is a matching wheel for teh package on pypi - it might always be "broken" :but we might not notice19:58
noonedeadpunkfungi: this port is incorrect.. https://zuul.opendev.org/t/vexxhost/build/7e604332e91b4e38a418efc4414c20b0/log/job-output.txt#85819:58
noonedeadpunkwondering why it's created...19:58
mordrednevermind. there are no pypi wheels for uwsgi19:58
funginoonedeadpunk: also worth noting, broadcast traffic (including arp who-has) across vxlan requires multicast be passed through the underlying infrastructure, which at least some of our providers disallow19:59
mordredcorvus: so - yeah- I'm confused as to why all of the builds didn't fail :(19:59
corvusmordred: i feel like we have to entertain the theory that maybe there are 4 images on the builder despite us not having output indicating that?20:00
noonedeadpunkoh, but in sandbox I have the same port...20:00
noonedeadpunkso yeah, maybe problem is that such config is not allowed by some providers...20:00
corvusmordred: (and sometimes those 4 are [base-amd, base-arm, builder-amd, builder-arm] and sometimes they are [base-amd, base-amd, builder-arm, builder-arm]?20:01
noonedeadpunkfungi: can I ask for a hold? I guess I've recheck enough times and I probably should get at least once to the provider who has allow multicast...20:03
funginoonedeadpunk: i want to say that the multi-node-bridge role establishes all ports as point-to-point to avoid needing to use broadcast traffic, so i'm not sure if it's really arp not making it through for that reason, but sure a hold would be a good way to explore this20:03
funginoonedeadpunk: so an autohold for the vexxhost tenant, vexxhost/ansible-role-frrouting project, ffrouting-deploy job, 739717 change?20:05
noonedeadpunkyes, exactly)20:05
*** knikolla has joined #opendev20:05
funginoonedeadpunk: it's set now, recheck at your convenience and let me know when you've got a failure so i can find the nodes and add your ssh key to them20:07
fungithis fails early enough i expect that should be pretty quick20:07
corvusmordred: why are you thinking the 'push one at a time' isn't relevant?20:09
*** johnsom_ has joined #opendev20:10
*** mwhahaha has joined #opendev20:10
corvusmordred: (shouldn't the end state end up correct? and the job that fails isn't going to start until everything is pushed including the manifest list)20:10
corvusmordred: or rather, why *is* it relevant20:10
mordredcorvus: I think I was thinking that maybe the subsequent pushes weren't actually overwriting things properly?20:11
mordredyeah20:11
mordred(I read isn't as is : ))20:11
mordredcorvus: but - I don't have any specific output to indicate that - it was more a hunch from the sequence in how we built them matching up with our failure case shas20:12
mordredbut - if that were actually the case I'd exepct all of them to fail the same way20:12
corvusack20:13
corvusmordred: do you think it would help to hold a node and see if we can poke at the builders and see what images they have?20:14
corvusmaybe we can confirm whether there's 2 or 4 or... i'm not familiar with how much we can poke around inside the buildx builders20:14
mordredcorvus: probably? although now I'm a little worried about what happens if we don't hit it - so maybe we should remove the approval just in case - and might have to recheck it a couple of times20:15
corvusmordred: yes, i'll write some notes in a review and WIP it20:15
mordredultimatley those builders should just be docker containers - so I'd hope we can exec into them20:15
mordredcool20:15
*** wendallkaters has joined #opendev20:16
*** seongsoocho has joined #opendev20:17
corvusmordred: take a look at my comments on https://review.opendev.org/72626320:21
noonedeadpunkfungi: it has already failed:) my key is here https://launchpad.net/~noonedeadpunk/+sshkeys20:21
corvuskevinz: are you still using these held nodes from march? http://paste.openstack.org/show/795676/20:24
funginoonedeadpunk: ssh root@158.69.65.199 and root@158.69.70.22120:26
noonedeadpunkmany thanks20:26
fungisure, and please let me know once you're done with them20:26
corvusmordred: autohold in place and rechecked20:26
mordredcorvus: lgtm20:27
mordredcorvus: does the +A from ianw need to be removed? or will the -W be good enough?20:27
fungi-w will block merge20:27
mordredcool20:28
mordredcorvus: I'm truly fascinated to find out what the situation is here :)20:29
corvusayup20:29
corvuswe're goin' fishin'20:29
*** Open10K8S has joined #opendev20:30
noonedeadpunkfungi: so yeah, arp is not passing... `172.24.4.2                       (incomplete)                              br-infra`20:31
fungiglad to know all those years as a network engineer weren't for nothing20:32
funginoonedeadpunk: if you set up static arp entries can they communicate?20:33
noonedeadpunkfungi: nope20:35
noonedeadpunkso I guess it's smth in vxlan and ovs..20:35
funginoonedeadpunk: so... we're using this role successfully in devstack multinode jobs, though i don't know if we have ovs versions (i think the default one is using linuxbridge?)20:36
noonedeadpunkbtw in ovs log there's this `jsonrpc|WARN|unix#6: receive error: Connection reset by peer`20:37
noonedeadpunkok. eed to dig deeper with that20:37
fungibut given neutron's default for victoria is ml2/ovs i would be surprised if there isn't one20:37
noonedeadpunk*need20:37
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Update synchronize-repos  https://review.opendev.org/74011020:49
*** mnaser is now known as mnaser|ic20:52
*** hashar has quit IRC20:55
*** mnaser|ic has quit IRC21:08
*** mnaser|ic has joined #opendev21:08
*** mnaser|ic has quit IRC21:08
*** mnaser|ic has joined #opendev21:08
*** mnaser|ic is now known as vexxhost21:08
*** gouthamr has quit IRC21:09
*** gouthamr has joined #opendev21:11
*** vexxhost is now known as mnaser21:14
*** mnaser is now known as mnaser|ic21:14
*** gouthamr_ has joined #opendev21:15
*** johnsom_ is now known as johnsom21:19
*** johnsom has joined #opendev21:19
*** sshnaidm|afk has quit IRC22:03
*** slittle1 has quit IRC22:17
*** tosky has quit IRC22:30
*** slittle1 has joined #opendev22:31
*** mnaser has joined #opendev22:38
*** mlavalle has quit IRC22:54
*** slittle1 has quit IRC22:56
*** slittle1 has joined #opendev22:58
*** tkajinam has joined #opendev22:58
kevinzcorvus: Not yet, please help to unlock it , thanks23:10
*** Dmitrii-Sh has quit IRC23:14
*** Dmitrii-Sh has joined #opendev23:21
*** DSpider has quit IRC23:35
*** mnaser has quit IRC23:38
*** mnaser|ic has quit IRC23:38
*** mnaser|ic has joined #opendev23:38
*** mnaser|ic has quit IRC23:38
*** mnaser|ic has joined #opendev23:38
*** mnaser|ic is now known as mnaser23:38

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!