*** ryohayakawa has joined #opendev | 00:06 | |
openstackgerrit | Merged opendev/system-config master: Don't install the track-upstream cron on review-test https://review.opendev.org/739840 | 00:26 |
---|---|---|
gouthamr | fungi: o/ sorry - i think it's been an issue since last week, https://zuul.opendev.org/t/openstack/builds?job_name=manila-tempest-plugin-lvm | 00:29 |
gouthamr | i now see only rax-ord failures, so maybe i am confused, let me double check and compile some findings | 00:32 |
ianw | https://zuul.opendev.org/t/openstack/build/60671a686fec409e9f9226707ab03916/log/job-output.txt seems to be a dfw failure | 00:32 |
gouthamr | ianw: ack, just looked at the last ten runs and they've failed 5 times in rax-iad, 2 times in rax-ord and once in rax-dfw, and two other failures are test failures that occur sporadically; this job was quite reliable prior to this | 00:42 |
gouthamr | a reboot event in the logs looks like this: https://zuul.opendev.org/t/openstack/build/8b50366a1f3c4203985deba7a1d43605/log/controller/logs/screen-n-api.txt#3382 | 00:45 |
ianw | gouthamr: tbh i wouldn't think it was rax specific; we really haven't changed any settings | 00:47 |
gouthamr | ianw: ty, good to know - i'll check if i can twiddle some node and test settings | 00:48 |
ianw | ansible_memfree_mb: 7747 | 00:48 |
ianw | that's about where i'd expect it | 00:48 |
gouthamr | ^^ i was looking for something to indicate that, thanks ianw :) | 00:49 |
ianw | https://zuul.opendev.org/t/openstack/build/8b50366a1f3c4203985deba7a1d43605/log/zuul-info/host-info.controller.yaml | 00:49 |
gouthamr | does that get collected prior to devstack though? | 00:49 |
ianw | during devstack you've got | 00:50 |
ianw | https://zuul.opendev.org/t/openstack/build/8b50366a1f3c4203985deba7a1d43605/log/controller/logs/screen-dstat.txt | 00:50 |
ianw | right before the "reboot" there you've got | 00:52 |
ianw | 4608k 8186M | 00:52 |
ianw | so it doesn't seem the host is under memory pressure at that point | 00:52 |
gouthamr | ianw: it does, no? (used free buff cach) 5755M 126M 148M 1954M | 00:56 |
ianw | that's still ~2gb in cache there, the number before was swap so it's not heavily into swap | 00:57 |
ianw | i wouldn't say it's *not* OOM, but it doesn't feel like a smoking gun at that point of the job | 00:58 |
gouthamr | ianw: ah, thanks; its consistently the same set of tests failing, i'll check if there's something weird we're doing.. | 01:00 |
*** dtroyer has quit IRC | 01:03 | |
openstackgerrit | Merged opendev/system-config master: gitea: install proxy https://review.opendev.org/739865 | 01:08 |
openstackgerrit | Ian Wienand proposed opendev/base-jobs master: promote-deployment: use upload-afs-synchronize https://review.opendev.org/739876 | 02:30 |
ianw | change to deploy the reverse proxy is queued now | 02:43 |
*** dtroyer has joined #opendev | 02:58 | |
ianw | hrm, did i not open port 3081? ... | 03:27 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] add host keys on bridge https://review.opendev.org/739414 | 03:46 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gitea: open port 3081 https://review.opendev.org/739884 | 03:55 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] add host keys on bridge https://review.opendev.org/739414 | 03:57 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gita-lb : update listeners to reverse proxy https://review.opendev.org/739889 | 04:07 |
ianw | mordred: https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/console feels like a real problem with 726263/ | 04:09 |
ianw | raise Exception("you need a C compiler to build uWSGI") | 04:10 |
ianw | i wonder if there's a new uwsgi release and it's not a wheel? | 04:10 |
ianw | ... no that's not it | 04:10 |
ianw | it almost looks like a race condition, like uwsgi was building before dpkg finished | 04:13 |
*** ykarel|away is now known as ykarel | 04:14 | |
mordred | Weird | 04:14 |
mordred | I'll investigate in the morning if you haven't figured it out yet | 04:15 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: write-inventory: add per-host variables https://review.opendev.org/739892 | 05:03 |
*** raukadah is now known as chandankumar | 05:09 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add host keys to inventory; give host key in launch-node script https://review.opendev.org/739412 | 05:10 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] add host keys on bridge https://review.opendev.org/739414 | 05:10 |
*** marios has joined #opendev | 05:17 | |
*** ryo_hayakawa has joined #opendev | 05:19 | |
*** ryo_hayakawa has quit IRC | 05:19 | |
*** ryo_hayakawa has joined #opendev | 05:19 | |
*** tobiash has joined #opendev | 05:20 | |
*** ryohayakawa has quit IRC | 05:21 | |
*** DSpider has joined #opendev | 05:54 | |
openstackgerrit | Merged opendev/system-config master: gitea: open port 3081 https://review.opendev.org/739884 | 05:57 |
ianw | clarkb / fungi : ^ i've run out of time to put that into production -- you should be able to manually choose a host to redirect on gitea-lb for initial testing, then https://review.opendev.org/739889 would redirect all | 06:38 |
*** ianw is now known as ianw_pto | 06:38 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] add host keys on bridge https://review.opendev.org/739414 | 06:51 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] add host keys on bridge https://review.opendev.org/739414 | 07:17 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: write-inventory: add per-host variables https://review.opendev.org/739892 | 07:21 |
*** tosky has joined #opendev | 07:33 | |
*** fressi has joined #opendev | 07:34 | |
*** zbr has joined #opendev | 07:35 | |
*** hashar has joined #opendev | 07:39 | |
openstackgerrit | Merged zuul/zuul-jobs master: fetch-coverage-output: direct link to coverage data https://review.opendev.org/739099 | 07:59 |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:02 | |
*** dtantsur|afk is now known as dtantsur | 08:10 | |
*** AJaeger has quit IRC | 08:13 | |
*** jaicaa_ has quit IRC | 08:14 | |
*** tkajinam has quit IRC | 08:23 | |
*** bolg has joined #opendev | 08:27 | |
*** zbr7 has joined #opendev | 08:30 | |
*** diablo_rojo has quit IRC | 08:41 | |
*** jaicaa has joined #opendev | 09:00 | |
*** AJaeger has joined #opendev | 09:00 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add host keys on bridge https://review.opendev.org/739414 | 09:28 |
ianw_pto | mordred: ^ might be of interest to you as it ties in with launching nodes. with small stack of ^, we won't forget to add host keys when creating new servers, and also keep host keys committed which seems good too | 09:30 |
ianw_pto | https://review.opendev.org/739412 adds the host keys for all our existing servers | 09:30 |
*** ryo_hayakawa has quit IRC | 09:33 | |
*** rpittau has joined #opendev | 09:53 | |
openstackgerrit | Merged openstack/diskimage-builder master: Fix DIB scripts python version https://review.opendev.org/739844 | 10:06 |
*** bhagyashris is now known as bhagyashris|brb | 10:42 | |
*** bhagyashris|brb is now known as bhagyashris | 10:59 | |
*** ysandeep is now known as ysandeep|brb | 11:11 | |
*** ysandeep|brb is now known as ysandeep | 11:25 | |
*** hashar is now known as hasharNap | 12:02 | |
*** xiaolin has joined #opendev | 12:19 | |
*** ysandeep is now known as ysandeep|mtg | 12:29 | |
*** roman_g has joined #opendev | 12:29 | |
roman_g | Hello team. Could someone look at NODE_FAILURE's we are experiencing with 16GB and 32GB nodes, please? Example: https://zuul.opendev.org/t/openstack/builds?job_name=airship-airshipctl-32GB-gate-test | 12:32 |
roman_g | And here is another example: https://zuul.opendev.org/t/openstack/builds?job_name=airship-airshipctl-gate-test | 12:34 |
*** hasharNap is now known as hashar | 12:57 | |
*** rpittau has quit IRC | 12:57 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977 | 12:58 |
*** rpittau has joined #opendev | 13:00 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977 | 13:01 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977 | 13:07 |
*** fressi has quit IRC | 13:08 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977 | 13:09 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977 | 13:14 |
AJaeger | roman_g: one of our clouds is down - see https://review.opendev.org/739605 | 13:16 |
*** lpetrut_ has joined #opendev | 13:21 | |
roman_g | AJaeger all right, thank you. I'm turning off voting on our jobs, as we still get some of runs successful. Where could I keep an eye on getting cloud up and running again? Mailing list? | 13:27 |
*** ysandeep|mtg is now known as ysandeep | 13:30 | |
AJaeger | roman_g: I don't think we announce this normally, let's see whether fungi has more info ^ | 13:33 |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Define maintain-github-openstack-mirror job https://review.opendev.org/738228 | 13:40 |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Define maintain-github-openstack-mirror job https://review.opendev.org/738228 | 13:42 |
AJaeger | infra-root, if the change above by ttx merges, the closing of open github issues can be stopped - do we have a change for that open? Or can anybody do one, please? | 13:48 |
mordred | AJaeger: I've got a phone call in about 10 minutes - but I can work on a change for that after the call | 13:50 |
ttx | AJaeger: fungi told me it's already ripped off, just needs a comment on the runnign server (won't be automatically put back) | 13:50 |
mordred | yeah - I agree | 13:51 |
mordred | it's no longer there | 13:51 |
*** ysandeep is now known as ysandeep|afk | 13:52 | |
ttx | Also does not hurt to have some overlap. I'm pretty sure the script will fail for one reason or another | 13:52 |
AJaeger | ttx: overlap is fine - I wanted to avoid we forget to switch it off ;) | 13:52 |
*** rpittau_ has joined #opendev | 13:56 | |
*** rpittau has quit IRC | 13:58 | |
*** ysandeep|afk is now known as ysandeep | 14:00 | |
*** bhagyashris is now known as bhagyashris|afk | 14:07 | |
*** bhagyashris|afk is now known as bhagyashris | 14:17 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977 | 14:22 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977 | 14:25 |
*** lpetrut_ has quit IRC | 14:28 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977 | 14:35 |
*** fressi has joined #opendev | 14:36 | |
fungi | AJaeger: roman_g: sorry, delayed start today (ironically, dealing with hvac contractors at the house)... the openedge cloud which was providing those nonstandard node flavors is offline, yes. we make sure to have plenty of redundant sources for our standard 8gb flavor, but finding more donors to supply those 16gb and 32gb flavors would be necessary for better availability. no eta on openedge returning, could | 14:37 |
fungi | be as long as october | 14:37 |
fungi | (the reason me dealing with hvac contractors is ironic is that openedge is down because the air handler for that privately-run environment gave up the ghost) | 14:38 |
openstackgerrit | Tobias Henkel proposed zuul/zuul-jobs master: DNM: Add unified synchronize-repos role that works with linux and windows https://review.opendev.org/740005 | 14:39 |
*** dmsimard3 has joined #opendev | 14:41 | |
*** dmsimard has quit IRC | 14:41 | |
*** dmsimard3 is now known as dmsimard | 14:41 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed zuul/zuul-jobs master: DNM: test ensure-podman https://review.opendev.org/740006 | 14:41 |
*** fressi has quit IRC | 14:46 | |
*** priteau has joined #opendev | 14:48 | |
*** mlavalle has joined #opendev | 14:50 | |
*** fressi has joined #opendev | 14:51 | |
*** ysandeep is now known as ysandeep|away | 14:55 | |
*** fressi has quit IRC | 15:00 | |
AJaeger | fungi, could you review 738228 , please? | 15:02 |
*** ykarel is now known as ykarel|away | 15:04 | |
*** priteau has quit IRC | 15:04 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed zuul/zuul-jobs master: DNM: test ensure-podman https://review.opendev.org/740006 | 15:11 |
fungi | AJaeger: you bet | 15:13 |
roman_g | fungi all right, thanks. | 15:13 |
mordred | fungi: the other provider of those right now is city, yes? | 15:14 |
roman_g | And kna | 15:14 |
fungi | kna is the region in citycloud, yes | 15:16 |
roman_g | Oh, yes | 15:17 |
roman_g | >> as long as october - anything I can help with? | 15:18 |
mordred | roman_g: I think for openedge it's mostly about getting his HVAC system replaced, so I don't think there's much we can do to help - unless someone wants to buy him a new HVAC :) | 15:20 |
roman_g | All right =) | 15:20 |
AJaeger | so, if we have two more regions for those nodes - should roman_g really see NODE_FAILURES? | 15:21 |
mordred | we might want to search out ways to get large flavors from other providers ... perhaps there's a source of funding somewhere that could be given to vexxhost or one of the other clouds to provide some. not 100% sure what's the best path there | 15:21 |
roman_g | Still seeing NODE_FAILUREs, yes. | 15:21 |
roman_g | Probably citycloud can't schedule that many large instances. | 15:21 |
roman_g | Any logs I can check on this regard by myself? | 15:22 |
roman_g | So that I could have something in hand to go with it to Ericsson/CityCloud. | 15:22 |
corvus | node_failure should never be seen; it means an unexpected error occurred | 15:23 |
corvus | i can look into that in a few minutes | 15:24 |
fungi | yeah, i want to say the last time this came up, citycloud has available quota for the nistances but they can't place them due to unavailability of suitable hypervisor host capacity (something like that) | 15:24 |
roman_g | fungi, yep, I remember | 15:24 |
roman_g | corvus thank you | 15:24 |
fungi | andyeah, if that's (still) the case then our nodepool launcher logs should indicate nova api errors | 15:25 |
*** Eighth_Doctor is now known as Conan_Kudo | 15:28 | |
*** Conan_Kudo is now known as Eighth_Doctor | 15:28 | |
smcginnis | Folks probably have seen this, but Gerrit has been donated to the new Open Usage Commons. | 15:30 |
smcginnis | https://openusage.org/news/introducing-the-open-usage-commons/ | 15:30 |
corvus | zuul scheduler is out of disk space | 15:34 |
mordred | oh look - wendar is on that board | 15:34 |
mordred | corvus: :( | 15:34 |
corvus | looks like a spike at 6:25 | 15:34 |
mordred | corvus: 23G in /var/log/zuul - current debug log is 5G | 15:36 |
corvus | mordred: did it get rotated successfully? | 15:38 |
mordred | corvus: yes | 15:38 |
mordred | corvus: -rw-rw-rw- 1 zuuld zuuld 906501839 Jul 8 06:25 debug.log.1.gz | 15:38 |
mordred | -rw-rw-rw- 1 zuuld zuuld 5386439154 Jul 8 15:38 debug.log | 15:38 |
corvus | /root/.bup is very large; 25G ? | 15:39 |
mordred | hrm. are we missing an exclusion? | 15:40 |
corvus | mordred: the logs are on their own partition, so we can discount them | 15:41 |
mordred | nod | 15:41 |
mordred | corvus: /var/log is not in bup-excludes | 15:41 |
corvus | seems like the 25G of bup out of 39G is a big fish | 15:42 |
mordred | yeah | 15:42 |
fungi | we recently removed a huge backup of old zuul from ~root there, right? maybe bup was tracking that previously? | 15:43 |
fungi | #status log commented out old githubclosepull cronjob for github user on review.o.o | 15:48 |
openstackstatus | fungi: finished logging | 15:48 |
corvus | i'm confused; "bup ls" shows nothing | 15:50 |
fungi | `bup ls` also returns nothing on the review server | 15:51 |
corvus | git log returns fatal: your current branch 'master' does not have any commits yet | 15:52 |
fungi | same on review.o.o, so maybe that's normal? | 15:53 |
corvus | 'git log root' also doesn't work | 15:53 |
corvus | fungi: or none of our backups work at all. it's probably been 5 years since we did a restore test? | 15:53 |
fungi | yeah, and `git ls-files` is similarly empty | 15:53 |
fungi | trying to remember, i had to restore something fairly recently... what was it | 15:53 |
fungi | oh, was a config backup on lists.o.o | 15:54 |
fungi | checking | 15:54 |
corvus | also, the backup server is close to full | 15:54 |
corvus | 'git log root' on the backup server works | 15:55 |
corvus | i have no idea what the .bup directory on the client is for | 15:55 |
fungi | and yeah, same observed behaviors on lists.o.o, so either this has broken in the last few months (since i did a restore there successfully) or it's how bup works | 15:56 |
mordred | corvus: ~/.bup/index-cache is where all of the data is in that | 15:57 |
corvus | i'm not sure if we can prune that and still have it work, or if we'd need to rotate the remote backup too | 15:58 |
openstackgerrit | jacky06 proposed openstack/diskimage-builder master: Remove glance-registry https://review.opendev.org/739796 | 15:58 |
*** hashar has quit IRC | 15:58 | |
fungi | aha, `bup ls -al` | 15:59 |
fungi | manpage ftw | 15:59 |
corvus | that's not making anything more clear :( | 15:59 |
fungi | strangely it contains empty .commit and .tag dirs | 15:59 |
fungi | yeah | 16:00 |
fungi | and nothing else? | 16:00 |
mordred | I agree - and also don't know what that means | 16:01 |
*** dtantsur is now known as dtantsur|afk | 16:01 | |
corvus | i'm inclined to just blow away the client side of this, and see if backups still work | 16:02 |
corvus | and probably it's also time to rotate the server side anyway, so if they don't work that would fix it | 16:02 |
mordred | corvus: the client side of this doens't seem to be providing any value | 16:02 |
corvus | mordred: well, i don't know what it does | 16:02 |
corvus | it may be part of how it deduplicates stuff | 16:03 |
corvus | or it may be nothing | 16:03 |
mordred | nod | 16:03 |
fungi | `bup index -p` provides the file index | 16:03 |
fungi | and seems to work | 16:03 |
*** diablo_rojo has joined #opendev | 16:04 | |
corvus | fungi: that produces no output for me on zuu01 | 16:04 |
fungi | well, works on lists.o.o | 16:04 |
fungi | but yeah, empty for me on zuul.o.o | 16:04 |
fungi | ahh, empty on review.o.o too | 16:05 |
corvus | we don't usually use index, we just do split and join, right? | 16:05 |
fungi | so i think it only prints paths because `bup index` was manually run there previously | 16:05 |
corvus | yeah, that makes sense | 16:05 |
fungi | and yes, our docs rely on bup join for restoration | 16:05 |
corvus | so given our typical split/join usage of bup, i still don't know what the client side .bup dir is for | 16:05 |
fungi | and it's how i managed to restore successfully on lists.o.o previously | 16:05 |
*** marios is now known as marios|out | 16:06 | |
fungi | i concur, it doesn't seem like it's providing anything | 16:06 |
fungi | i wonder if it's providing some local cache to speed up incremental backup | 16:06 |
openstackgerrit | Merged openstack/project-config master: Define maintain-github-openstack-mirror job https://review.opendev.org/738228 | 16:07 |
corvus | https://groups.google.com/forum/#!topic/bup-list/EP19diB3ADk | 16:10 |
corvus | apparently we're not supposed to notice it, and we can delete it | 16:11 |
corvus | fungi, mordred: i think i should rm -rf ~/.bup on zuul01 and then run bup init | 16:12 |
mordred | I agree | 16:13 |
mordred | ++ | 16:13 |
fungi | corvus: we've failed to not notice it i suppose. i concur with this plan | 16:13 |
corvus | #status log removed zuul01:/root/.bup and ran 'bup init' to clear the client-side bup cache which was very large (25G) | 16:14 |
openstackstatus | corvus: finished logging | 16:14 |
mordred | corvus: there is much more spaces now | 16:14 |
corvus | so many spaces | 16:15 |
fungi | all the spaces | 16:15 |
fungi | do we need to restart the scheduler, or is it not impacted | 16:15 |
corvus | i think the only impact would be failing to generate a key for a new project; i think we can just restart if someone complains | 16:16 |
corvus | the time database may be out of date, but that should recover without a restart | 16:19 |
*** roman_g has quit IRC | 16:22 | |
*** gouthamr has quit IRC | 16:26 | |
corvus | 2020-07-08 14:39:59,163 ERROR nodepool.NodeLauncher: [e: 7d1d1ee806274202b8fba3d14450c712] [node_request: 300-0009826852] [node: 0017913446] Detailed node error: No valid host was found. There are not enough hosts available. | 16:27 |
corvus | fungi: ^ that's what you were expecitng? | 16:27 |
fungi | yep, that's what we were getting from them in the past | 16:28 |
corvus | roman_g has left | 16:29 |
fungi | i believe the explanation was they either lacked actual capacity on the compute nodes, or lacked sufficient capacity to place instances of the requested flavor on any | 16:29 |
corvus | anyway, i think that's it for data collection; i'm not sure what we should do about that | 16:29 |
corvus | other that perhaps remove the node types if we don't have any clouds that can reliably provide them | 16:30 |
fungi | i think it's always been that way, which is why openedge started providing capacity for that size instances so at least they could be filled from somewhere | 16:30 |
corvus | if we take a high level step back: i think our choices are either: 1) find a way to reliably provide the nodes; 2) don't offer the nodes; 3) keep retry-bashing the one provider that does offer them and mask the high failure rate from end users | 16:32 |
corvus | exactly how we achive either of those is tactics, but i think those are the strategies | 16:32 |
openstackgerrit | Merged openstack/project-config master: Add nested-virt-centos-8 label https://review.opendev.org/738161 | 16:33 |
fungi | any indication as to whether we are actually ever getting any of them at all in citycloud? | 16:33 |
corvus | iiuc, this link from roman_g earlier suggests "yes sometimes" https://zuul.opendev.org/t/openstack/builds?job_name=airship-airshipctl-32GB-gate-test | 16:33 |
corvus | 49% of the time | 16:34 |
fungi | oh, okay. i suspect that artificially lowering the quota won't help, it's probably a binpacking problem on the compute nodes | 16:39 |
fungi | in theory there's sufficient resources, but finding a single compute node with the necessary capacity fails on placement | 16:40 |
mordred | not enough hosts available seems like a form of a quota issue | 16:40 |
mordred | like - from an end-user perspective - except with a worse error message | 16:40 |
mordred | because it's not "this request is impossible" it's "this request is impossible right now" | 16:41 |
mordred | I think that's me arguing for a form of 3 | 16:41 |
fungi | well, if they're running at 50% utilization in the host aggregate but evenly distributing the load across all their compute nodes and the nodes themselves can only handle up to 64gb ram, then they could be way underutilized but still unable to find a single compute node which can host a 32gb vm | 16:41 |
mordred | totally | 16:42 |
mordred | which is why it isn't Actually a quota issue nor would it necessarily be fixed by adjusting quota | 16:42 |
fungi | anyway, no clue if that's the exact problem, lots of ifs in there, but we've reported it to them previously | 16:42 |
mordred | but as an API consumer it's similar - a valid request for resources is rejected by the cloud | 16:42 |
mordred | and that same request, retried later, might wor | 16:43 |
mordred | work | 16:43 |
fungi | i suspect the 16gb requests are succeeding more often than the 32gb requests | 16:43 |
mordred | I'd also suspect that | 16:43 |
*** chandankumar has quit IRC | 16:44 | |
*** ykarel|away has quit IRC | 16:46 | |
corvus | if we want to retry-bash (3), then what tactic should we follow? | 16:46 |
corvus | we currently try to predict whether a request will succeed, only try it if we think it will, and if it fails 3 times, give up | 16:47 |
corvus | we can't predict this | 16:47 |
corvus | do we want to special case the error message: Detailed node error: No valid host was found. There are not enough hosts available. | 16:47 |
corvus | and discount that so it never counts against the retry attempts? | 16:47 |
*** hashar has joined #opendev | 16:47 | |
corvus | and are we sure that's never a permanent error? | 16:48 |
corvus | (like, what if we get that error for some other reason (data center loses power); are we okay sitting in a retry loop for that?) | 16:48 |
fungi | i have a feeling there are cases where it is an effectively permanent error | 16:50 |
fungi | like provider has a flavor which can never actually be placed on any of their infrastructure | 16:50 |
fungi | i suppose the question is whether node_error and a manual recheck is preferable to having builds and their respective changes sit indefinitely in the queue | 16:51 |
fungi | as roman_g speaks for the primary consumer of this flavor, i'm inclined to leave it this way until we can get his feedback on it | 16:53 |
fungi | though more generally, i don't really know if it's better to have changes stuck in queue effectively forever, or to emit node_failure results when clouds are having placement issues | 16:54 |
corvus | there is a launch-retries option, so we could bump it up to 1000 attempts or something, but unfortunately, that just means we're going to sit there trying the same api request over and over until it succeeds | 16:57 |
corvus | given some of our jobs take hours, we might have to bump it up to 10,000 retry attempts to mask it. | 16:58 |
corvus | it's possible our providers may notice that. | 16:58 |
*** fressi has joined #opendev | 17:00 | |
corvus | maybe we could do something where if we detect that error, we say: (a) don't count this as a retry attempt; and (b) don't retry this until another server has been deleted | 17:00 |
corvus | basically, don't count it as a retry and pause the provider | 17:00 |
corvus | i think i could see that working and be sustainable. it may not be a simple change to nodepool though. | 17:00 |
mordred | yeah | 17:01 |
fungi | provider pausing does sound interesting | 17:05 |
fungi | though i guess it would be a label-specific pause? | 17:05 |
fungi | in situations like this the provider may have no trouble placing 8gb requests but still fail a majority of 32gb requests | 17:06 |
*** chandankumar has joined #opendev | 17:11 | |
*** marios|out has quit IRC | 17:12 | |
corvus | that would be counter-productive for us though, so i think we would want to completely pause and wait for it to drain | 17:13 |
corvus | (this is generally how nodepool providers work -- one request at a time) | 17:14 |
corvus | and of course, we would only want to do this if we knew we were the last remaining provider that could handle that request | 17:14 |
*** rh-jelabarre has quit IRC | 17:17 | |
*** roman_g has joined #opendev | 17:28 | |
roman_g | fungi corvus I read logs. Thanks for analysis. I've got rid of 32GB jobs, and we only have left 16GB jobs as of now (waiting for Zuul to merge PS with WF +1, and other PS need to be rebased on future master). | 17:37 |
*** hashar has quit IRC | 17:43 | |
corvus | roman_g: ok. that seems likely to reduce the number of incidents; will be interesting to confirm | 17:59 |
roman_g | We impose quite of a load onto node pool in ayirshipctl project. For example, only today 28 patch sets have been updated, many of them were updated more than once. We used 6x 8GB, 2x 32GB, 2x 16GB. Now tea re down to 6x 8GB & 1x 16GB. I think I also can squash 2 of 8GB jobs into one. | 18:04 |
roman_g | *airshipctl | 18:04 |
roman_g | *we are down | 18:05 |
*** sshnaidm|ruck is now known as sshnaidm|afk | 18:25 | |
mordred | https://www.perl.com/article/announcing-perl-7/ | 18:34 |
*** hashar has joined #opendev | 18:36 | |
fungi | but i had only just gotten used to perl 5? | 18:43 |
*** gouthamr has joined #opendev | 18:53 | |
*** rpittau_ has quit IRC | 19:01 | |
*** roman_g has quit IRC | 19:07 | |
mordred | fungi: perl7 is apparently perl5 | 19:13 |
mordred | (but with a few of the common pragmas turned on by default) | 19:13 |
mordred | poor perl 6 | 19:13 |
mordred | corvus: I've got to AFK for a bit - but it looks like https://review.opendev.org/#/c/726263/ has hit the nefarious "incorrect arch" issue | 19:14 |
fungi | perl6 was a fun experiment, i suppose | 19:14 |
mordred | corvus: you can see it in the console error log with the amd64 apt repos being used in the arm64 builder | 19:14 |
*** rh-jelabarre has joined #opendev | 19:15 | |
mordred | (the symptom which causes it to fail is going to be that we build wheels for one arch in the builder image and then copy them to the wrong arch causing pip to not find wheels and want to build them - except it's the final image so there's no compiler - but the root cause is going to be the arch mismatch) | 19:16 |
noonedeadpunk | hey everyone... need some help with zuul multinode jobs. For some reason I don't get ovs ip pingable here https://zuul.opendev.org/t/vexxhost/build/ebd87dde2ffb4bf2835c21e53e5be32e/log/job-output.txt#730 | 19:17 |
*** gmann_ has joined #opendev | 19:17 | |
*** gmann_ is now known as gmann | 19:18 | |
fungi | noonedeadpunk: what's the overlay there? vxlan or gre? | 19:24 |
fungi | i think we've generally defaulted to vxlan in opendev because a number of our providers lack the conntrack plugin to allow gre to cross between compute nodes | 19:25 |
corvus | mordred: i think i need an eli5 for that | 19:26 |
noonedeadpunk | fungi: vxlan I guess - it's default one | 19:27 |
noonedeadpunk | yeah, previous job has ip a output. https://zuul.opendev.org/t/vexxhost/build/28580dda752044e695992546e11171e3/log/job-output.txt#765 | 19:28 |
fungi | ahh, okay | 19:28 |
corvus | mordred: i'm having particular trouble with "copy them to the wrong arch" | 19:28 |
mordred | corvus: it's possible I said too many words | 19:29 |
mordred | corvus: look at https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1994-2000 for the actual issue | 19:29 |
mordred | the other words were a description of why the c compiler errors are misleading | 19:29 |
mordred | but see on 1994 we're in linux/arm64 and then on line 2000 we're pulling from buster amd64 | 19:29 |
*** donnyd has joined #opendev | 19:31 | |
mordred | corvus: I don't have the whole picture - but the mismatch there matches the reported behavior from ianw from a few weeks ago | 19:31 |
corvus | mordred: the most recent from line is "FROM docker.io/opendevorg/python-base:${PYTHON_VERSION}" are you saying that pulled the wrong image? | 19:31 |
*** aannuusshhkkaa has joined #opendev | 19:32 | |
corvus | mordred: (i'm assuming "install-from-bindep" doesn't have any arch stuff on it, it's just saying "apt-get update", so an arch mismatch like that must be because the current image is wrong, yeah?) | 19:32 |
mordred | corvus: that's the hypothesis - that somehow that's pulling the incorrect arch | 19:32 |
*** mnaser has joined #opendev | 19:32 | |
mordred | we have yet to duplicate that behavior though | 19:32 |
mordred | corvus: yeah. | 19:32 |
mordred | corvus: I wonder if there is an issue at play with builder images here | 19:33 |
mordred | since a multi-stage build dockerfile is a more complicated construct | 19:33 |
mordred | but - honestly - since we've never seen this mistake from buildx in our other testing - I honestly have no clue | 19:34 |
corvus | this output is exceedingly difficult to read with exceeding difficulty | 19:35 |
corvus | it looks like it's constantly reprinting entire stanzas just to make incremental changes | 19:35 |
mordred | ooh! | 19:36 |
mordred | corvus: look at this: | 19:36 |
mordred | corvus: https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1484-1485 | 19:36 |
mordred | that's in the builder image stage | 19:36 |
corvus | oh good find, i was focused on the python-base image | 19:36 |
mordred | which is ostensibly an amd64 build which is deciding to install arm64 packages | 19:37 |
mordred | corvus: and then the arm builder is also building arm: https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1492-1493 | 19:37 |
corvus | mordred: so we've seen it crossed in both directions? arm->amd and amd->arm? | 19:38 |
mordred | so - it seems like docker is using an arm version of python-builder for both arm and amd versions of the build stage - and then it's using the amd version of python-base in teh second stage for both arm and amd | 19:39 |
mordred | yup | 19:39 |
*** rpittau has joined #opendev | 19:40 | |
corvus | mordred: https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1847-1861 | 19:41 |
corvus | mordred: that series appears to show the same shas being used for both arches | 19:41 |
corvus | mordred: moreover https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1087-1093 (and the items above it) appears to show only one download for each image | 19:42 |
corvus | all on arm64 | 19:42 |
mordred | corvus: I agree | 19:42 |
mordred | corvus: https://zuul.opendev.org/t/openstack/build/700f3fa8cb5a49f1b75df928e61004e7/log/job-output.txt#1073 is from a working build that also shows same shas | 19:42 |
fungi | noonedeadpunk: so the failure there is in the frrouting role you've added, i guess? | 19:43 |
noonedeadpunk | yep... so I'm trying to add bgp, but need internal network for that... | 19:43 |
*** jentoio has joined #opendev | 19:43 | |
corvus | mordred: i'm a little confused about how, if we downloaded only the arm images, we ended up with amd python-base | 19:43 |
fungi | noonedeadpunk: oh, i see, it's tasks after frrouting | 19:44 |
mordred | corvus: yah | 19:44 |
mordred | corvus: well - are we sure we only downloaded arm for each? we seem to certainly have only downloaded one arch for each | 19:45 |
mordred | but maybe what looks like arm downloading is actually downloading an amd? | 19:45 |
corvus | mordred: not at all; i assumed that the arm64 thingy would download the arm64 arch | 19:45 |
mordred | (because it certainly seems like we have amd python-base when we use it) | 19:46 |
noonedeadpunk | fungi: when I launch 2 instances in our public cloud (which is added for CI as well), it works nicely | 19:47 |
noonedeadpunk | So maybe there're some extra variables set in environment that makes things work differently... | 19:48 |
*** jrosser has joined #opendev | 19:48 | |
corvus | mordred: i'm going back to the python-base build for this: https://zuul.opendev.org/t/openstack/build/05bd1a9f1f804f4e936e601ebad2e5b3/log/job-output.txt#1438-1447 | 19:48 |
corvus | mordred: 5f98 is the python-base we got | 19:48 |
corvus | mordred: the way i read that is that we build one arch (5f98), the another arch (eebf), then the manifest list to bind them togethre (e4bd) | 19:48 |
corvus | mordred: if that's right, then 5f98 indicates which arch we got (ie, it's not merely the manifest list) | 19:49 |
mordred | I read that that way too | 19:50 |
mordred | corvus: https://zuul.opendev.org/t/openstack/build/05bd1a9f1f804f4e936e601ebad2e5b3/log/job-output.txt#1093 | 19:50 |
mordred | push arch specific layers one at a time | 19:50 |
*** hillpd has joined #opendev | 19:51 | |
fungi | noonedeadpunk: a network diagram for this would help, but based on the get routes and get ip a task outputs from 28580dda752044e695992546e11171e3 i infer that "primary" has both 172.24.4.1/23 and 172.24.4.2/23 bound to br-infra while "secondary" has 172.24.4.3 on its br-infra interface, then in ebd87dde2ffb4bf2835c21e53e5be32e "primary" is trying to ping 172.24.4.3 on "secondary" across br-int and seems to | 19:51 |
fungi | have no entry in its arp table for that? (dumping the arp table might help clarify here) | 19:51 |
*** zbr_ has joined #opendev | 19:51 | |
fungi | er, across br-infra not br-int | 19:52 |
corvus | mordred: i think that tells us that 5f98 is amd because we pushed it first, yeah? that jives with your observed error that we got amd for -base | 19:52 |
mordred | yeah | 19:52 |
mordred | corvus: I wonder if it's as simple as taking out that one-at-a-time behavior which we shouldn;'t need any more because you added locking in zuul-registry | 19:52 |
corvus | i'm not there yet | 19:53 |
mordred | kk | 19:53 |
*** rm_work has joined #opendev | 19:53 | |
corvus | mordred: moving to the -builder image; we saw sha bb4f in our error build. this looks like it was the second one pushed: https://zuul.opendev.org/t/openstack/build/86101c17f7fe4a23840142ed44ad0fab/log/job-output.txt#2756 | 19:54 |
corvus | mordred: that should mean bb4f is the arm image, which also matches what we observed | 19:54 |
mordred | yah | 19:55 |
corvus | mordred: so i think we've confirmed that we did indeed download the amd base and arm builder images | 19:55 |
*** jbryce has joined #opendev | 19:55 | |
corvus | mordred: if we assume that the job is using only one of the arch images in the multi-stage build, how does it ever work? | 19:56 |
corvus | mordred: shouldn't there always be an erroneous case? | 19:56 |
noonedeadpunk | fungi: ok, I think multinode role makes some weird things.... | 19:56 |
mordred | corvus: not if there is a matching wheel for teh package on pypi - it might always be "broken" :but we might not notice | 19:58 |
noonedeadpunk | fungi: this port is incorrect.. https://zuul.opendev.org/t/vexxhost/build/7e604332e91b4e38a418efc4414c20b0/log/job-output.txt#858 | 19:58 |
noonedeadpunk | wondering why it's created... | 19:58 |
mordred | nevermind. there are no pypi wheels for uwsgi | 19:58 |
fungi | noonedeadpunk: also worth noting, broadcast traffic (including arp who-has) across vxlan requires multicast be passed through the underlying infrastructure, which at least some of our providers disallow | 19:59 |
mordred | corvus: so - yeah- I'm confused as to why all of the builds didn't fail :( | 19:59 |
corvus | mordred: i feel like we have to entertain the theory that maybe there are 4 images on the builder despite us not having output indicating that? | 20:00 |
noonedeadpunk | oh, but in sandbox I have the same port... | 20:00 |
noonedeadpunk | so yeah, maybe problem is that such config is not allowed by some providers... | 20:00 |
corvus | mordred: (and sometimes those 4 are [base-amd, base-arm, builder-amd, builder-arm] and sometimes they are [base-amd, base-amd, builder-arm, builder-arm]? | 20:01 |
noonedeadpunk | fungi: can I ask for a hold? I guess I've recheck enough times and I probably should get at least once to the provider who has allow multicast... | 20:03 |
fungi | noonedeadpunk: i want to say that the multi-node-bridge role establishes all ports as point-to-point to avoid needing to use broadcast traffic, so i'm not sure if it's really arp not making it through for that reason, but sure a hold would be a good way to explore this | 20:03 |
fungi | noonedeadpunk: so an autohold for the vexxhost tenant, vexxhost/ansible-role-frrouting project, ffrouting-deploy job, 739717 change? | 20:05 |
noonedeadpunk | yes, exactly) | 20:05 |
*** knikolla has joined #opendev | 20:05 | |
fungi | noonedeadpunk: it's set now, recheck at your convenience and let me know when you've got a failure so i can find the nodes and add your ssh key to them | 20:07 |
fungi | this fails early enough i expect that should be pretty quick | 20:07 |
corvus | mordred: why are you thinking the 'push one at a time' isn't relevant? | 20:09 |
*** johnsom_ has joined #opendev | 20:10 | |
*** mwhahaha has joined #opendev | 20:10 | |
corvus | mordred: (shouldn't the end state end up correct? and the job that fails isn't going to start until everything is pushed including the manifest list) | 20:10 |
corvus | mordred: or rather, why *is* it relevant | 20:10 |
mordred | corvus: I think I was thinking that maybe the subsequent pushes weren't actually overwriting things properly? | 20:11 |
mordred | yeah | 20:11 |
mordred | (I read isn't as is : )) | 20:11 |
mordred | corvus: but - I don't have any specific output to indicate that - it was more a hunch from the sequence in how we built them matching up with our failure case shas | 20:12 |
mordred | but - if that were actually the case I'd exepct all of them to fail the same way | 20:12 |
corvus | ack | 20:13 |
corvus | mordred: do you think it would help to hold a node and see if we can poke at the builders and see what images they have? | 20:14 |
corvus | maybe we can confirm whether there's 2 or 4 or... i'm not familiar with how much we can poke around inside the buildx builders | 20:14 |
mordred | corvus: probably? although now I'm a little worried about what happens if we don't hit it - so maybe we should remove the approval just in case - and might have to recheck it a couple of times | 20:15 |
corvus | mordred: yes, i'll write some notes in a review and WIP it | 20:15 |
mordred | ultimatley those builders should just be docker containers - so I'd hope we can exec into them | 20:15 |
mordred | cool | 20:15 |
*** wendallkaters has joined #opendev | 20:16 | |
*** seongsoocho has joined #opendev | 20:17 | |
corvus | mordred: take a look at my comments on https://review.opendev.org/726263 | 20:21 |
noonedeadpunk | fungi: it has already failed:) my key is here https://launchpad.net/~noonedeadpunk/+sshkeys | 20:21 |
corvus | kevinz: are you still using these held nodes from march? http://paste.openstack.org/show/795676/ | 20:24 |
fungi | noonedeadpunk: ssh root@158.69.65.199 and root@158.69.70.221 | 20:26 |
noonedeadpunk | many thanks | 20:26 |
fungi | sure, and please let me know once you're done with them | 20:26 |
corvus | mordred: autohold in place and rechecked | 20:26 |
mordred | corvus: lgtm | 20:27 |
mordred | corvus: does the +A from ianw need to be removed? or will the -W be good enough? | 20:27 |
fungi | -w will block merge | 20:27 |
mordred | cool | 20:28 |
mordred | corvus: I'm truly fascinated to find out what the situation is here :) | 20:29 |
corvus | ayup | 20:29 |
corvus | we're goin' fishin' | 20:29 |
*** Open10K8S has joined #opendev | 20:30 | |
noonedeadpunk | fungi: so yeah, arp is not passing... `172.24.4.2 (incomplete) br-infra` | 20:31 |
fungi | glad to know all those years as a network engineer weren't for nothing | 20:32 |
fungi | noonedeadpunk: if you set up static arp entries can they communicate? | 20:33 |
noonedeadpunk | fungi: nope | 20:35 |
noonedeadpunk | so I guess it's smth in vxlan and ovs.. | 20:35 |
fungi | noonedeadpunk: so... we're using this role successfully in devstack multinode jobs, though i don't know if we have ovs versions (i think the default one is using linuxbridge?) | 20:36 |
noonedeadpunk | btw in ovs log there's this `jsonrpc|WARN|unix#6: receive error: Connection reset by peer` | 20:37 |
noonedeadpunk | ok. eed to dig deeper with that | 20:37 |
fungi | but given neutron's default for victoria is ml2/ovs i would be surprised if there isn't one | 20:37 |
noonedeadpunk | *need | 20:37 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Update synchronize-repos https://review.opendev.org/740110 | 20:49 |
*** mnaser is now known as mnaser|ic | 20:52 | |
*** hashar has quit IRC | 20:55 | |
*** mnaser|ic has quit IRC | 21:08 | |
*** mnaser|ic has joined #opendev | 21:08 | |
*** mnaser|ic has quit IRC | 21:08 | |
*** mnaser|ic has joined #opendev | 21:08 | |
*** mnaser|ic is now known as vexxhost | 21:08 | |
*** gouthamr has quit IRC | 21:09 | |
*** gouthamr has joined #opendev | 21:11 | |
*** vexxhost is now known as mnaser | 21:14 | |
*** mnaser is now known as mnaser|ic | 21:14 | |
*** gouthamr_ has joined #opendev | 21:15 | |
*** johnsom_ is now known as johnsom | 21:19 | |
*** johnsom has joined #opendev | 21:19 | |
*** sshnaidm|afk has quit IRC | 22:03 | |
*** slittle1 has quit IRC | 22:17 | |
*** tosky has quit IRC | 22:30 | |
*** slittle1 has joined #opendev | 22:31 | |
*** mnaser has joined #opendev | 22:38 | |
*** mlavalle has quit IRC | 22:54 | |
*** slittle1 has quit IRC | 22:56 | |
*** slittle1 has joined #opendev | 22:58 | |
*** tkajinam has joined #opendev | 22:58 | |
kevinz | corvus: Not yet, please help to unlock it , thanks | 23:10 |
*** Dmitrii-Sh has quit IRC | 23:14 | |
*** Dmitrii-Sh has joined #opendev | 23:21 | |
*** DSpider has quit IRC | 23:35 | |
*** mnaser has quit IRC | 23:38 | |
*** mnaser|ic has quit IRC | 23:38 | |
*** mnaser|ic has joined #opendev | 23:38 | |
*** mnaser|ic has quit IRC | 23:38 | |
*** mnaser|ic has joined #opendev | 23:38 | |
*** mnaser|ic is now known as mnaser | 23:38 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!