ianw | it would be. i'm waiting but i think there's a chance this reboot will not happen :/ | 00:00 |
---|---|---|
clarkb | now I'm trying to figure out in my head what happens if mount fails on a specific disk (it should mount it ro or at least attempt to and finish the boot right?) | 00:01 |
ianw | ok, it's looks like it's coming up, i'm probably too pessimistic :) | 00:02 |
clarkb | ya I think if it isn't where /boot and other important things hanging under / are mounted then it should come up | 00:03 |
clarkb | it may still have a sad, but should get sshd running | 00:03 |
ianw | ok, it's back | 00:04 |
clarkb | fwiw citynetwork pushed back the fix estimate to 0300CET :/ | 00:04 |
clarkb | I guess we should all ponder the static inventory a bit more then maybe plan to convert tomorrow or early next week? | 00:04 |
*** dave-mccowan has quit IRC | 00:08 | |
*** mgutehal_ has quit IRC | 00:09 | |
*** mgutehall has joined #openstack-infra | 00:09 | |
*** wolverineav has quit IRC | 00:09 | |
*** flaper87 has quit IRC | 00:09 | |
*** wolverineav has joined #openstack-infra | 00:10 | |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Amazon EC2 driver https://review.openstack.org/535558 | 00:10 |
*** wolverineav has quit IRC | 00:12 | |
*** wolverineav has joined #openstack-infra | 00:13 | |
openstackgerrit | Merged openstack-infra/elastic-recheck master: Made elastic-recheck py3 compatible https://review.openstack.org/616578 | 00:13 |
*** flaper87 has joined #openstack-infra | 00:14 | |
*** threestrands has joined #openstack-infra | 00:18 | |
openstackgerrit | Adam Coldrick proposed openstack-infra/storyboard master: Fix the stories relation in StoryTag https://review.openstack.org/621045 | 00:26 |
openstackgerrit | Adam Coldrick proposed openstack-infra/storyboard master: Add a popularity measurement to tags https://review.openstack.org/621046 | 00:26 |
*** tosky has quit IRC | 00:27 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Remove nodeid argument from updateNode https://review.openstack.org/621047 | 00:32 |
clarkb | hrm there are a bunch of db migration errors in cinder and nova | 00:33 |
clarkb | I wonder if oslo.db made a release | 00:33 |
ianw | #status log manual reboot of mirror01.nrt1.arm64ci.openstack.org after a lot of i/o failures | 00:34 |
openstackstatus | ianw: finished logging | 00:34 |
*** rh-jelabarre has joined #openstack-infra | 00:41 | |
*** yamamoto has joined #openstack-infra | 00:48 | |
*** hamzy_ has joined #openstack-infra | 01:07 | |
*** sthussey has quit IRC | 01:11 | |
*** verdurin has quit IRC | 01:33 | |
clarkb | apparently docker 1.12 and newer has a config option to prevent containers from restarting when you restart docker | 01:36 |
clarkb | that is nice | 01:36 |
*** verdurin has joined #openstack-infra | 01:38 | |
*** wolverineav has quit IRC | 01:38 | |
*** bhavikdbavishi has joined #openstack-infra | 01:43 | |
*** yamamoto has quit IRC | 01:49 | |
*** dave-mccowan has joined #openstack-infra | 01:50 | |
*** gyee has quit IRC | 01:52 | |
openstackgerrit | Merged openstack-infra/nodepool master: Support relative priority of node requests https://review.openstack.org/620954 | 01:58 |
*** dklyle has joined #openstack-infra | 01:59 | |
*** dklyle has quit IRC | 02:05 | |
*** mriedem_afk has quit IRC | 02:19 | |
*** mrsoul has quit IRC | 02:22 | |
*** psachin has joined #openstack-infra | 02:39 | |
*** eernst_ has joined #openstack-infra | 02:45 | |
*** dave-mccowan has quit IRC | 02:46 | |
*** eernst_ has quit IRC | 02:50 | |
*** dave-mccowan has joined #openstack-infra | 02:56 | |
*** hongbin has joined #openstack-infra | 03:02 | |
*** dklyle has joined #openstack-infra | 03:11 | |
*** apetrich has quit IRC | 03:15 | |
*** dklyle has quit IRC | 03:17 | |
openstackgerrit | Merged openstack/diskimage-builder master: Revert "Make tripleo-buildimage-overcloud-full-centos-7 non-voting" https://review.openstack.org/620201 | 03:18 |
*** ykarel|away has joined #openstack-infra | 03:22 | |
*** rlandy has quit IRC | 03:31 | |
*** dave-mccowan has quit IRC | 03:33 | |
*** dave-mccowan has joined #openstack-infra | 03:35 | |
*** psachin has quit IRC | 03:43 | |
openstackgerrit | Merged openstack/diskimage-builder master: Add missing ws separator between words https://review.openstack.org/619169 | 03:53 |
*** hongbin has quit IRC | 03:56 | |
*** udesale has joined #openstack-infra | 04:01 | |
openstackgerrit | Brendan proposed openstack-infra/zuul master: Fix "reverse" Depends-On detection with new Gerrit URL schema https://review.openstack.org/620838 | 04:03 |
*** janki has joined #openstack-infra | 04:03 | |
*** ramishra has joined #openstack-infra | 04:29 | |
*** dave-mccowan has quit IRC | 04:30 | |
*** markvoelker has quit IRC | 04:32 | |
*** eernst has joined #openstack-infra | 04:35 | |
openstackgerrit | Merged openstack-infra/zuul-jobs master: upload-logs-swift: Cleanup temporary directories https://review.openstack.org/592340 | 04:49 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: upload-logs-swift: Make indexer more generic https://review.openstack.org/592852 | 04:50 |
*** wolverineav has joined #openstack-infra | 04:52 | |
*** markvoelker has joined #openstack-infra | 05:02 | |
*** eernst has quit IRC | 05:04 | |
*** wolverineav has quit IRC | 05:15 | |
*** threestrands_ has joined #openstack-infra | 05:17 | |
*** threestrands has quit IRC | 05:20 | |
*** ykarel|away has quit IRC | 05:24 | |
*** pcaruana has quit IRC | 05:35 | |
*** ykarel|away has joined #openstack-infra | 05:39 | |
*** ykarel|away is now known as ykarel | 05:39 | |
*** yamamoto has joined #openstack-infra | 05:46 | |
*** yamamoto has quit IRC | 05:51 | |
openstackgerrit | Merged openstack-infra/zuul master: Add support for zones in executors https://review.openstack.org/549197 | 06:00 |
*** diablo_rojo has quit IRC | 06:12 | |
openstackgerrit | Kartikeya Jain proposed openstack/diskimage-builder master: Adding support for SLES 15 in element 'sles' https://review.openstack.org/619186 | 06:17 |
*** e0ne has joined #openstack-infra | 06:32 | |
*** flaper87 has quit IRC | 06:32 | |
*** quiquell|off is now known as quiquell | 07:03 | |
*** hamzy__ has joined #openstack-infra | 07:11 | |
*** hamzy_ has quit IRC | 07:12 | |
*** threestrands_ has quit IRC | 07:13 | |
*** apetrich has joined #openstack-infra | 07:16 | |
*** stakeda has joined #openstack-infra | 07:22 | |
*** pcaruana has joined #openstack-infra | 07:22 | |
*** e0ne has quit IRC | 07:31 | |
*** slaweq has joined #openstack-infra | 07:34 | |
*** florianf|afk is now known as florianf | 07:37 | |
*** kjackal has joined #openstack-infra | 07:39 | |
*** bhavikdbavishi has quit IRC | 07:42 | |
*** ykarel is now known as ykarel|lunch | 07:43 | |
openstackgerrit | Merged openstack-infra/zuul master: More strongly recommend the simple reverse proxy deployment https://review.openstack.org/620969 | 07:56 |
openstackgerrit | Merged openstack-infra/zuul master: Add gearman stats reference https://review.openstack.org/620192 | 07:57 |
*** jpena|off is now known as jpena | 08:01 | |
*** ginopc has joined #openstack-infra | 08:04 | |
*** bhavikdbavishi has joined #openstack-infra | 08:05 | |
*** jtomasek has joined #openstack-infra | 08:06 | |
*** rcernin has quit IRC | 08:06 | |
*** dpawlik has joined #openstack-infra | 08:07 | |
*** ralonsoh has joined #openstack-infra | 08:17 | |
*** aojea has joined #openstack-infra | 08:18 | |
*** roman_g has joined #openstack-infra | 08:22 | |
*** shardy has joined #openstack-infra | 08:23 | |
*** shardy has quit IRC | 08:24 | |
*** shardy has joined #openstack-infra | 08:24 | |
*** yamamoto has joined #openstack-infra | 08:30 | |
*** ykarel|lunch is now known as ykarel | 08:38 | |
*** xek has joined #openstack-infra | 08:38 | |
*** ginopc has quit IRC | 08:40 | |
*** tosky has joined #openstack-infra | 08:42 | |
*** dpawlik has quit IRC | 08:46 | |
*** ccamacho has joined #openstack-infra | 08:46 | |
*** jpich has joined #openstack-infra | 08:53 | |
*** kjackal has quit IRC | 08:58 | |
*** dpawlik has joined #openstack-infra | 09:20 | |
frickler | infra-root: does /etc/ansible/hosts/group_vars/all.yaml get edited manually on bridge? I wasn't aware that my acc had been commented out there, just curious when the amount of mail I was receiving was slowly decreasing. | 09:21 |
frickler | I edited it now with a different address that is hopefully going to be bouncing less things, please let me know if you see any | 09:22 |
frickler | Shrews: ^^ you are blocked there too, in case you were wondering | 09:23 |
*** dpawlik has quit IRC | 09:24 | |
*** e0ne has joined #openstack-infra | 09:27 | |
*** stakeda has quit IRC | 09:33 | |
*** derekh has joined #openstack-infra | 09:33 | |
openstackgerrit | Ian Wienand proposed openstack-infra/system-config master: bridge.o.o : install ansible 2.7.3 https://review.openstack.org/617218 | 09:38 |
*** gfidente has joined #openstack-infra | 09:40 | |
*** takamatsu has quit IRC | 09:41 | |
*** ginopc has joined #openstack-infra | 09:41 | |
openstackgerrit | Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Categorize missing /etc/heat/policy.json https://review.openstack.org/621128 | 10:01 |
*** olivierbourdon38 has joined #openstack-infra | 10:09 | |
*** tpsilva has joined #openstack-infra | 10:31 | |
*** slaweq has quit IRC | 10:32 | |
*** ramishra has quit IRC | 10:43 | |
*** ramishra has joined #openstack-infra | 10:43 | |
*** electrofelix has joined #openstack-infra | 10:44 | |
*** pcaruana has quit IRC | 10:44 | |
*** quite has left #openstack-infra | 10:46 | |
*** pcaruana has joined #openstack-infra | 10:50 | |
*** udesale has quit IRC | 10:59 | |
*** dtantsur|mtg is now known as dtantsur|afk | 11:00 | |
*** rfolco is now known as rfolco_doctor | 11:09 | |
*** bhavikdbavishi has quit IRC | 11:11 | |
*** ramishra has quit IRC | 11:20 | |
*** ramishra has joined #openstack-infra | 11:26 | |
*** hwoarang has quit IRC | 11:32 | |
*** slaweq has joined #openstack-infra | 11:51 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Set relative priority of node requests https://review.openstack.org/615356 | 11:52 |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently https://review.openstack.org/610029 | 12:07 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Remove nodeid argument from updateNode https://review.openstack.org/621047 | 12:11 |
*** pcaruana has quit IRC | 12:11 | |
*** hwoarang has joined #openstack-infra | 12:15 | |
*** lucasagomes is now known as lucas-hungry | 12:17 | |
*** lucas-hungry is now known as lucasagomes | 12:17 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Report tenant and project specific resource usage stats https://review.openstack.org/616306 | 12:24 |
*** slaweq has quit IRC | 12:24 | |
*** udesale has joined #openstack-infra | 12:26 | |
*** kjackal has joined #openstack-infra | 12:26 | |
*** rfolco_doctor is now known as rfolco | 12:34 | |
*** eharney has quit IRC | 12:34 | |
openstackgerrit | Sorin Sbarnea proposed openstack-infra/elastic-recheck master: devstack: Didn't find service registered by hostname after 60 seconds https://review.openstack.org/621150 | 12:34 |
*** Serhii_Rusin has joined #openstack-infra | 12:36 | |
*** Serhii_Rusin has quit IRC | 12:40 | |
*** jpena is now known as jpena|lunch | 12:41 | |
*** kjackal has quit IRC | 12:41 | |
*** xek has quit IRC | 12:46 | |
Shrews | frickler: yeah. i need to migrate away from gmail | 12:46 |
*** xek has joined #openstack-infra | 12:47 | |
*** e0ne has quit IRC | 12:54 | |
*** yamamoto has quit IRC | 12:59 | |
*** boden has joined #openstack-infra | 13:06 | |
*** zul has joined #openstack-infra | 13:06 | |
*** yamamoto has joined #openstack-infra | 13:15 | |
*** trown|outtypewww has quit IRC | 13:17 | |
*** trown|brb has joined #openstack-infra | 13:18 | |
*** dave-mccowan has joined #openstack-infra | 13:19 | |
pabelanger | frickler: that file should be under control of git, so you should be able to look at history | 13:26 |
pabelanger | if not, then I am not sure | 13:26 |
pabelanger | I don't believe we should have manually editted files on brige.o.o | 13:26 |
*** annp has quit IRC | 13:29 | |
frickler | pabelanger: ah, local git repo, nice. /me can blame mordred now :-D | 13:29 |
pabelanger | frickler: cool, so any changes need to git commited to git there too | 13:32 |
frickler | pabelanger: did that for my earlier change now | 13:33 |
pabelanger | ack | 13:35 |
*** rlandy has joined #openstack-infra | 13:36 | |
*** e0ne has joined #openstack-infra | 13:37 | |
*** jpena|lunch is now known as jpena | 13:40 | |
*** sthussey has joined #openstack-infra | 13:40 | |
*** kgiusti has joined #openstack-infra | 13:45 | |
openstackgerrit | Merged openstack/diskimage-builder master: Add an element to configure iBFT network interfaces https://review.openstack.org/391787 | 13:46 |
*** udesale has quit IRC | 13:48 | |
*** udesale has joined #openstack-infra | 13:49 | |
*** hwoarang has quit IRC | 13:54 | |
*** takamatsu has joined #openstack-infra | 13:55 | |
*** slaweq has joined #openstack-infra | 13:56 | |
*** jcoufal has joined #openstack-infra | 13:57 | |
*** EmilienM is now known as EvilienM | 13:58 | |
*** ykarel is now known as ykarel|away | 14:07 | |
*** roman_g has quit IRC | 14:08 | |
*** roman_g has joined #openstack-infra | 14:08 | |
*** hwoarang has joined #openstack-infra | 14:12 | |
*** mriedem has joined #openstack-infra | 14:19 | |
*** olivierbourdon38 has quit IRC | 14:20 | |
*** olivierbourdon38 has joined #openstack-infra | 14:20 | |
*** jamesmcarthur has joined #openstack-infra | 14:24 | |
ssbarnea|rover | clarkb: are you working on https://review.openstack.org/#/c/621038/1 ? i can fix it myself. | 14:25 |
openstackgerrit | Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Better event checking timeouts https://review.openstack.org/621038 | 14:28 |
openstackgerrit | Merged openstack-infra/nodepool master: OpenStack: count leaked nodes in unmanaged quota https://review.openstack.org/621040 | 14:29 |
openstackgerrit | Merged openstack-infra/nodepool master: OpenStack: store ZK records for launch error nodes https://review.openstack.org/621043 | 14:29 |
*** janki has quit IRC | 14:30 | |
pabelanger | clarkb: corvus: mordred: are we thinking of doing nodepool / zuul restarts this morning to pick up new noderequest logic? | 14:31 |
*** ykarel|away has quit IRC | 14:33 | |
*** lbragstad is now known as elbragstad | 14:36 | |
openstackgerrit | Merged openstack-infra/project-config master: Create airship-spyglass repo https://review.openstack.org/619493 | 14:37 |
*** bhavikdbavishi has joined #openstack-infra | 14:39 | |
*** eharney has joined #openstack-infra | 14:42 | |
openstackgerrit | Merged openstack-infra/project-config master: Add openstack/arch-design https://review.openstack.org/621012 | 14:44 |
*** bhavikdbavishi has quit IRC | 14:47 | |
*** bhavikdbavishi has joined #openstack-infra | 14:47 | |
*** takamatsu has quit IRC | 14:48 | |
*** ykarel|away has joined #openstack-infra | 14:52 | |
*** ykarel|away is now known as ykarel | 14:52 | |
*** dayou_ has joined #openstack-infra | 14:59 | |
*** dayou has quit IRC | 15:00 | |
*** dayou_ has quit IRC | 15:10 | |
*** dayou_ has joined #openstack-infra | 15:11 | |
*** chandan_kumar is now known as chkumar|off | 15:14 | |
*** jcoufal has quit IRC | 15:16 | |
openstackgerrit | Merged openstack-infra/zuul master: Set relative priority of node requests https://review.openstack.org/615356 | 15:24 |
*** takamatsu has joined #openstack-infra | 15:29 | |
*** jamesmcarthur has quit IRC | 15:33 | |
*** bnemec is now known as beekneemech | 15:34 | |
*** roman_g has quit IRC | 15:36 | |
*** eharney has quit IRC | 15:36 | |
*** cdent has joined #openstack-infra | 15:37 | |
cdent | Can someone point me to a job that does a webhook post merge? Basically I want to trigger a build on dockerhub after a placement change merges | 15:38 |
*** ramishra has quit IRC | 15:38 | |
*** quiquell is now known as quiquell|off | 15:39 | |
pabelanger | cdent: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/jobs.yaml#n1128 does it for readthedocs.org | 15:42 |
cdent | pabelanger: awesome, thanks | 15:42 |
pabelanger | np! | 15:42 |
*** dansmith is now known as SteelyDan | 15:43 | |
*** therve has joined #openstack-infra | 15:44 | |
therve | Hi | 15:44 |
therve | We're getting some issues in the heat gate for the past couple of days | 15:45 |
therve | It would seem there is an issue with ovh, is that a known problem? | 15:45 |
*** jamesmcarthur has joined #openstack-infra | 15:46 | |
corvus | pabelanger, clarkb: yes i'd like to restart things today | 15:48 |
*** yamamoto has quit IRC | 15:48 | |
*** jamesmcarthur has quit IRC | 15:49 | |
*** eernst has joined #openstack-infra | 15:49 | |
*** jamesmcarthur has joined #openstack-infra | 15:49 | |
*** sthussey has quit IRC | 15:50 | |
*** mriedem has quit IRC | 15:50 | |
*** eharney has joined #openstack-infra | 15:51 | |
*** munimeha1 has joined #openstack-infra | 15:51 | |
*** takamatsu has quit IRC | 15:53 | |
pabelanger | +1, happy to assist | 15:59 |
openstackgerrit | James E. Blair proposed openstack-infra/puppet-zuul master: Add relative_priority scheduler option https://review.openstack.org/621194 | 15:59 |
mordred | cdent: fwiw, you could also just push images built in zuul to dockerhub | 16:00 |
cdent | mordred: yeah, I thought about that too, but the dockerfile being used is "mine", not an openstack thing, so I was exploring the options | 16:00 |
mordred | nod | 16:00 |
cdent | I think it is probably better to make the Dockerfile a real placement thing | 16:01 |
openstackgerrit | James E. Blair proposed openstack-infra/system-config master: Enable relative_priority in zuul https://review.openstack.org/621195 | 16:01 |
cdent | mordred: can you point me to an example? | 16:01 |
mordred | it's all good either way - mostly just wanted to mention | 16:01 |
mordred | cdent: yup - one sec | 16:01 |
corvus | pabelanger, mordred: can you review those 2 changes ^ (since they require a scheduler restart, would be good to go ahead and get them in place) | 16:01 |
corvus | gah, bad parent | 16:02 |
openstackgerrit | James E. Blair proposed openstack-infra/system-config master: Enable relative_priority in zuul https://review.openstack.org/621195 | 16:02 |
corvus | there we go | 16:02 |
mordred | cdent: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/jobs.yaml#n1260 | 16:03 |
pabelanger | +2 | 16:03 |
cdent | thanks | 16:03 |
mordred | cdent: also - if you're just publishing images of placement, you might want to check out what the loci folks are doing | 16:03 |
mordred | corvus: +A to both | 16:04 |
cdent | mordred: yeah, on that too | 16:05 |
corvus | given what those changes are intended to do, i'm inclined to direct-enqueue them | 16:05 |
mordred | corvus: ++ | 16:05 |
cdent | mordred: what I'm trying to do here in incrementally unwind a bunch of the external side stuff I've built alongside placement before it was extracted | 16:06 |
mordred | cdent: that sounds like a very worthwhile thing to do | 16:06 |
mordred | cdent: while I'm pointing you at 20 things in addition to what you were asking about ... you should check out pbrx :) | 16:06 |
cdent | in this case the image I create is fairly purpose built for testing | 16:06 |
mordred | ah- yes, well in that case neither loci nor pbrx are going to be very helpful :) | 16:06 |
corvus | we're running a 3 for 1 special on answers here today :) | 16:07 |
cdent | pbrx looks pretty interesting, though | 16:07 |
cdent | mordred: the container in question is an example of "just how cdentish can I make this thing". In this case that means zero config in the container and as small as possible and uwsgi instead of apache, etc | 16:08 |
cdent | a lot of which I'm hoping to eventually get back to loci and kolla, but there's only so much time | 16:08 |
mordred | cdent: I'm very much in the 'no config in container' camp | 16:09 |
mordred | cdent: I think putting config into the containers defeats the whole benefit of the containers | 16:10 |
cdent | ayup | 16:10 |
cdent | it's this: https://github.com/cdent/placedock | 16:10 |
mordred | like - a container should contain a single process- and things like config should be mounted in or otherwise provided, and things like apache should be network proxies, etc | 16:10 |
cdent | yup, we sound of like minds | 16:11 |
cdent | (not all that surprising) | 16:11 |
mordred | cdent: if you haven't already, you might want to consider python:alpine as a base image - you can skip your pip and python install steps | 16:12 |
cdent | I've been back and forth on the image a few different times depending on what seems to be working that day | 16:12 |
mordred | :) | 16:12 |
mordred | tell me about it | 16:12 |
cdent | yesterday lots of stuff was not working, so gave up on alpine:edge | 16:12 |
mordred | we've been using python:alpine for the pbrx-built images and I'm fairly happy with it so far- there was one moment where something stopped working that we thought was alpine's fault, but I think it wound up being something else | 16:13 |
cdent | i'll add that to the list, thanks | 16:15 |
*** dayou_ has quit IRC | 16:19 | |
*** sthussey has joined #openstack-infra | 16:20 | |
*** mriedem has joined #openstack-infra | 16:21 | |
*** pcaruana has joined #openstack-infra | 16:26 | |
*** gyee has joined #openstack-infra | 16:27 | |
fungi | therve: did anyone get back to you on tour "issues in the heat gate" specific to ovh yet? can you elaborate? is it a particular error or something general like slow performance leading to job timeouts? is it in both obh regions (bhs1 and gra1) or only one? | 16:27 |
therve | fungi: No | 16:28 |
therve | fungi: It looks like slow performance | 16:28 |
therve | I haven't checked the region | 16:28 |
therve | http://logs.openstack.org/57/620457/3/check/heat-functional-convg-mysql-lbaasv2/934e143/ is a recent example | 16:29 |
therve | It spent 90mins trying to setup devstack (that's usually our whole runtime) | 16:29 |
*** boden has quit IRC | 16:34 | |
*** ccamacho has quit IRC | 16:35 | |
*** ccamacho has joined #openstack-infra | 16:35 | |
*** adriancz has quit IRC | 16:36 | |
clarkb | pabelanger: frickler correct its a local git repo. You edit and commit in place | 16:36 |
fungi | therve: yeah, that looks like ovh-bhs1 which is where we've had a lot of reports of job timeouts. at first i thought it might be because of the 2:1 cpu oversubscription in our dedicate host aggregate for that region but temporarily halving the number of servers we were running didn't move the needle | 16:36 |
*** xek has quit IRC | 16:36 | |
clarkb | fungi: therve: I wonder if some of the hypervisors are not using virt? | 16:36 |
openstackgerrit | Monty Taylor proposed openstack-infra/project-config master: Create promstat project https://review.openstack.org/621225 | 16:37 |
openstackgerrit | Monty Taylor proposed openstack-infra/project-config master: Add promstat project to Zuul https://review.openstack.org/621226 | 16:37 |
corvus | 2018-11-30 16:27:30.828209 | primary | manifests/init.pp - WARNING: quoted boolean value found on line 35 | 16:37 |
*** yamamoto has joined #openstack-infra | 16:37 | |
corvus | okay, so, in puppet, if i want to pass the literal word "true" around.... what? just don't do it? | 16:37 |
fungi | therve: mnaser was just commenting in #openstack-tc about surprisingly slow swapfile creation/preallocation in ovh, so i'm starting to wonder if it could be attributed to disk i/o | 16:37 |
clarkb | fungi: oh interesting | 16:38 |
clarkb | fungi: that said you should totally use fallocate on ext4 and it shouldn't be slow even with bad disk io :P | 16:38 |
therve | fungi: Anecdotally I've found db migrations to be slow | 16:38 |
clarkb | mnaser: ^ | 16:38 |
mnaser | clarkb: http://logs.openstack.org/36/619636/1/gate/openstack-ansible-deploy-aio_metal-ubuntu-bionic/72c540f/logs/ara-report/result/f6ed9f8a-419a-41b8-8d81-19d6e5aac6cc/ | 16:38 |
*** jamesmcarthur has quit IRC | 16:38 | |
mnaser | i mean yes, but you cant fallocate on xfs so really we've only worked around the job | 16:39 |
corvus | therve, fungi, clarkb: wow, a bunch of zuul sql tests have started timing out recently; i wonder if they ran there | 16:39 |
mnaser | 5.9 MB/s disk write speed will be bad regardless | 16:39 |
mnaser | if you avoid it in swap, it'll bite you later somewhere else :) | 16:39 |
clarkb | mnaser: our centos7 instances are ext4 not xfs | 16:40 |
mnaser | oh shoot really | 16:40 |
mnaser | TIL | 16:40 |
mnaser | but anyways, still, 6 MB/s disk writes are painful :P | 16:40 |
clarkb | mnaser: yup not saying it fixes the underlying issue. Just pointing out that you likely want to avoid this cost anyway | 16:40 |
fungi | corvus: while we were rnning at max-servers=79 for both bhs1 and gra1 i ran the e-r logstash query for job timeouts and there were 20x as many (no joke) in bhs1 compared to gra1 | 16:40 |
mnaser | yeah well we decided to just drop swap entirely | 16:40 |
mnaser | because we rarely ever used it | 16:40 |
mnaser | it was like 1-2MB of swap with some 4.something gig of memory being used by caches/etc | 16:41 |
*** jamesmcarthur has joined #openstack-infra | 16:42 | |
openstackgerrit | James E. Blair proposed openstack-infra/puppet-zuul master: Add relative_priority scheduler option https://review.openstack.org/621194 | 16:42 |
openstackgerrit | James E. Blair proposed openstack-infra/system-config master: Enable relative_priority in zuul https://review.openstack.org/621195 | 16:42 |
corvus | pabelanger: mordred: ^ | 16:42 |
*** dayou_ has joined #openstack-infra | 16:42 | |
pabelanger | looking | 16:44 |
pabelanger | +3 | 16:44 |
clarkb | corvus: is zuul restarted to accept that? or we put config in place first then restart? | 16:44 |
*** bhavikdbavishi has quit IRC | 16:45 | |
openstackgerrit | Tobias Urdin proposed openstack-infra/system-config master: Mirror Stein on Ubuntu from Cloud Archive https://review.openstack.org/621231 | 16:45 |
clarkb | as for ovh IO, I know on my personal instance I seem capped at 1kiops so lots of really small writes perform poorly but large writes perform just as well as anything else. This seems more extreme than that though | 16:45 |
*** bhavikdbavishi has joined #openstack-infra | 16:45 | |
corvus | clarkb: config first then restart, since the scheduler needs a restart to see that, but will ignore extra options. | 16:46 |
fungi | mnaser: looks like the example linked above was also in bhs1, so i'm starting to wonder if disk is waaaay slower there than gra1. could explain the disproportionate number of job timeouts in bhs1 if so | 16:46 |
*** ginopc has quit IRC | 16:47 | |
pabelanger | corvus: clarkb: we also need to do nodepool-launcher too right? I can help with that if needed | 16:47 |
clarkb | pabelanger: for it to work yes, I think it will be a noop if zuul sets it until then (but otherwise work as is today) | 16:47 |
corvus | pabelanger: yeah. in fact, do you want to get started on restarting those now? | 16:48 |
corvus | or do we want to restart everything at once? | 16:48 |
*** yamamoto has quit IRC | 16:48 | |
pabelanger | I'm fine with both options | 16:48 |
*** udesale has quit IRC | 16:49 | |
corvus | i think launcher restarts are only minorly disruptive; i say go ahead and do 'em | 16:49 |
pabelanger | ok | 16:49 |
clarkb | corvus: ++ | 16:50 |
clarkb | usually I restart one and make sure its happy then do the others ~5 minutes later | 16:50 |
clarkb | perceived impact is incredibly low | 16:50 |
pabelanger | nl01 doesn't seem to have latest tip of nodepool master, let me see what is happening | 16:52 |
fungi | clarkb: i noticed yesterday you added static.o.o in the emergency disable list. is that still in progress or forgotten? | 16:53 |
clarkb | fungi: semi forgotten. I think we want to get dmsimard wsgi resource increases in place. But I don't think anyone else was reviewing those changes | 16:54 |
* clarkb digs them up | 16:54 | |
*** trown|brb is now known as regain | 16:55 | |
pabelanger | I think ansible on bridge.o.o is broken, but not 100%. I haven't interacted much with this server | 16:55 |
pabelanger | ERROR! Completely failed to parse inventory source /opt/system-config/inventory/openstack.yaml | 16:55 |
pabelanger | that is in /var/log/ansible/run_all_cron.log | 16:55 |
*** regain is now known as trown | 16:55 | |
pabelanger | and looks like 12 hours ago was the last run | 16:55 |
clarkb | fungi: https://review.openstack.org/#/c/616297/ | 16:55 |
*** trown is now known as trown|lunch | 16:56 | |
clarkb | pabelanger: I think that got fixed because my cert signing request went through (whcih depended on dns updates) | 16:56 |
corvus | that's the last message in the log | 16:56 |
pabelanger | okay, I might be looking in the wrong log | 16:56 |
clarkb | or maybe it stopped and then started again | 16:56 |
fungi | perhaps citycloud is still busted? | 16:57 |
corvus | clarkb, pabelanger, mordred: http://paste.openstack.org/show/736489/ | 16:57 |
pabelanger | thanks, that is what I am seeing | 16:57 |
pabelanger | there was a new release of ansible yesterday, could be related | 16:58 |
clarkb | oh interesting. I wonder if citycloud fixed, then we updated ansible then we broke | 16:58 |
clarkb | ya | 16:58 |
fungi | when it rains it pours | 16:58 |
*** jpena is now known as jpena|off | 16:59 | |
*** eharney has quit IRC | 16:59 | |
pabelanger | looking to see if ansible upgraded now | 16:59 |
corvus | ansible 2.7.0 | 16:59 |
Shrews | pabelanger: corvus: if we're restarting launchers, we'll need to watch those carefully. lots of rather big changes (mostly around caching) have gone in | 16:59 |
*** jpich has quit IRC | 17:00 | |
clarkb | fungi: I've discovered that nova takes over the dmi info for instances so hyou can't really tell if they are qemu or kvm (trying to double check that bhs1 isn't emulated as that could explain io as well as other slowness) | 17:00 |
clarkb | ok systemd-detect-virt says it is kvm proper | 17:02 |
clarkb | fungi: re static I think we should either approve dmsimard's change above or we can safely remove static from the emergency file then watch it for slowness | 17:02 |
clarkb | its safe either way, its just that we think adding the workers helped with the log downloading slowness that was seen a while back | 17:03 |
pabelanger | okay, 2.7.0 seems to be the version we are pinned to in system-config, so don't believe that has changed | 17:04 |
*** pcaruana has quit IRC | 17:04 | |
corvus | mordred: can we put in your static inventory stuff asap? | 17:05 |
*** aojea has quit IRC | 17:06 | |
fungi | clarkb: only a data point, but our mirror instances in ovh don't show a significant discrepancy in i/o performance. 124 MB/s in gra1 vs 131 MB/s in bhs1 when i tried the same dd as mnaser's job example. could be due to them having a different flavor, or not actually being in the same host aggregate, or maybe only some instances in bhs1 are hitting an i/o tarpit while others land on more performant | 17:07 |
fungi | disk | 17:07 |
clarkb | corvus: mordred fwiw I don't think the static inventory will fix this | 17:07 |
clarkb | this is a different error than what we had with citycloud (that was a url parsing failure due to http 502). THis appears to be a python bug with a Proxy object in the openstack plugin not having a servers attribute | 17:07 |
clarkb | possibly openstacksdk updated and not ansible? | 17:07 |
corvus | clarkb: not using the openstack plugin as inventory *will* fix that problem | 17:07 |
corvus | clarkb: ansible has been broken for two days straight because of two different problems with the openstack inventory plugin. | 17:08 |
clarkb | corvus: it will work around it yes. Mostly pointing it out because mordred is maintainer for that plugin and all of that in ansible. It doesn't work right now. that is probably improtant info for mordred | 17:08 |
clarkb | for infra we can work around it | 17:08 |
pabelanger | I am not sure what yamlgroup plugin is, is that something we wrote? | 17:08 |
corvus | clarkb: i agree. i'm not the maintainer for that plugin. i just want our systems to work so i can go back to doing my work. | 17:09 |
clarkb | pabelanger: yes it is how we take our host listing and group them into groups | 17:09 |
pabelanger | okay, sorry, not up to speed on recent changes | 17:09 |
*** mriedem is now known as mriedem_lunch | 17:11 | |
pabelanger | could it be openstacksdk that is breaking here? there was a new release 19hrs ago | 17:11 |
corvus | pabelanger: almost certainly | 17:11 |
clarkb | pabelanger: yes that is my current hunch | 17:11 |
corvus | feel free to revert or whatever, but i'm working on switching to static inventory | 17:11 |
pabelanger | let me see how we install it | 17:12 |
clarkb | I think for us static inventory is a good move since we have had multiple issues with not static inventory in a couple days. I think for mordred and sdk and ansible it should be clear this isn't just a failing cloud anymore and there is a bug to fix there | 17:12 |
corvus | clarkb: agreed | 17:12 |
pabelanger | okay, so try rolling back to 0.19.0, confirm inventory works. Then, work towards static inventory | 17:14 |
*** eharney has joined #openstack-infra | 17:14 | |
clarkb | fungi: on a random ready xenial VM I hopped on in bhs1 3.6MB/s seems to be peak for write sizes of 128bytes and 512 byes >100MB total | 17:16 |
fungi | granted, if that node was actively running a job then you may not know how much other contention you have for activity on the same node | 17:17 |
clarkb | fungi: ya, though it seems consistent over multiple writes. We can always boot a node external to nodepool if we need better numbers. Mostly this seems to point that io is consistently slow when artificially tested with dd | 17:17 |
pabelanger | http://paste.openstack.org/show/736491/ | 17:18 |
pabelanger | that is when openstacksdk was upgraded, then ansible ran 1 more time afterwards | 17:18 |
fungi | clarkb: any chance you can repeat the same experiment on a random gra1 node for comparison? | 17:18 |
pabelanger | going to downgrade it manually now to confirm inventory works again | 17:18 |
clarkb | bumping up the bs to 2048 shows roughly the same throughput (3.4MB/s) | 17:19 |
clarkb | fungi: ya I can do that | 17:19 |
clarkb | that implies to me that this isn't iops caps we are running into. We should've seen difference in throughput with different block sizes (up to the disk block size iirc) if it were just iops | 17:20 |
fungi | if disk access is an order of magnitude slower for (at least some) instances in bhs1 than in gra1, we probably have our smoking gun for the difference in timeouts | 17:20 |
openstackgerrit | James E. Blair proposed openstack-infra/system-config master: Switch to static inventory https://review.openstack.org/621247 | 17:20 |
corvus | clarkb, pabelanger, fungi, mordred: ^ | 17:21 |
corvus | i quickly processed mordred's file from yesterday to include only the attributes he thought important | 17:21 |
*** ykarel is now known as ykarel|away | 17:22 | |
corvus | oh, i wonder if we need ansible_host ? | 17:22 |
*** eharney has quit IRC | 17:22 | |
*** shardy has quit IRC | 17:22 | |
pabelanger | corvus: clarkb: downgrading to openstacksdk 0.0.19 seems to fix it | 17:22 |
clarkb | corvus: mordred already pushed that change yesterday | 17:22 |
pabelanger | looking at patch now | 17:22 |
* clarkb +2'd it but no one else reviewed it... | 17:23 | |
clarkb | https://review.openstack.org/#/c/621031/ if we want to use that chagne instead | 17:23 |
pabelanger | corvus: +2, how did you generate that by chance? | 17:23 |
corvus | clarkb: approved | 17:23 |
clarkb | fungi: ah ok so I had a minor derp in there I should've used /dev/zero. But ya bhs1 is 16.6MB/s at 512 bs and 9ish MB/s at 2048 bs vs >250MB/s on gra1 | 17:24 |
openstackgerrit | Merged openstack-infra/zuul master: Clarify executor zone documentation https://review.openstack.org/620989 | 17:24 |
openstackgerrit | James E. Blair proposed openstack-infra/system-config master: Disable openstack inventory plugin https://review.openstack.org/621247 | 17:25 |
corvus | clarkb, pabelanger, mordred: ^ that's the other thing from my change that wasn't in mordred's | 17:25 |
clarkb | pabelanger: mordred genearted it by taking the ansible openstack plugin output (from the cache?) and filtering out all the extra data we don't need | 17:25 |
fungi | clarkb: thanks. ironically disk writes are ~20x faster on gra1 and jobs time out ~20x more often on bhs1 | 17:25 |
clarkb | corvus: oh interesting I wonder if the yaml vs yamlgroup ordering there is important | 17:26 |
corvus | we might have actually needed my change to get past today's bug. | 17:26 |
corvus | clarkb: i don't know, but i re-ordered it to match our actual order assuming it is | 17:26 |
clarkb | corvus: ya, also may need yaml to process before yamlgroup | 17:26 |
pabelanger | clarkb: ack, thanks | 17:26 |
*** efried is now known as fried_rolls | 17:27 | |
fungi | corvus: just to confirm, one of the tox-py36 job timeouts i ran across on a zuul change a few minutes ago was indeed in ovh-bhs1 as well. likely related | 17:27 |
pabelanger | so, until bridge.o.o runs again, ansible works with old config. Next run, openstacksdk will update, but think we are protected a little with cached version, which should be enough time for these new patches to land | 17:27 |
corvus | fungi: thanks; i' checked a few too, though one of the sql-only timeouts wasn't bhs1 (it was limestone) | 17:27 |
clarkb | amorin: ^ fyi we think we have narrowed down the ovh-bhs1 issues to disk performance. Seeing writes for zeros to disk with dd on the order of 10-15MB/s on bhs1 when the same writes are >250MB/s on gra1 | 17:28 |
corvus | pabelanger: well, if we direct-enqueue the patches. otherwise we're going to spend 8 hours just getting to the point where we can begin the day's work. | 17:28 |
pabelanger | ++ | 17:28 |
corvus | so i will direct-enqueue the static inventory patches now | 17:28 |
clarkb | corvus: pabelanger I thinik we may want them to go in together so that the config update is in place the first time we run with the static inventory | 17:28 |
corvus | clarkb: i'll do both | 17:29 |
clarkb | (otherwise yamlgroup may not be able to group any hsots into groups and we'll still run ansible in an unproductive manner | 17:29 |
fungi | i wonder if we should disable ovh-bhs1 for now. it's a 159-node hit to our capacity, but the random timeouts are probably resulting in rechecks and gate resets which waste even more than that | 17:29 |
clarkb | fungi: ya likely | 17:29 |
pabelanger | fungi: actually, 1 sec, can I check something in ovh | 17:29 |
fungi | pabelanger: check whatever you like | 17:30 |
clarkb | I'm going to dig up breakfast then will update things with the opendev cert info | 17:30 |
pabelanger | fungi: okay, thanks. Finished, I wanted to check to see if we had any leaked nodes their, doesn't look like it (unrelated to the current issue). | 17:31 |
pabelanger | +1 to disable if jobs are being affected | 17:31 |
*** weshay is now known as he_hates_me | 17:35 | |
openstackgerrit | Merged openstack-infra/puppet-zuul master: Add relative_priority scheduler option https://review.openstack.org/621194 | 17:35 |
openstackgerrit | Merged openstack-infra/system-config master: Enable relative_priority in zuul https://review.openstack.org/621195 | 17:35 |
*** he_hates_me is now known as weshay | 17:36 | |
*** kjackal has joined #openstack-infra | 17:36 | |
openstackgerrit | Merged openstack-infra/zuul master: Remove STATE_PENDING https://review.openstack.org/620284 | 17:38 |
openstackgerrit | Jeremy Stanley proposed openstack-infra/project-config master: Temporarily disable ovh-bhs1 in nodepool https://review.openstack.org/621250 | 17:39 |
openstackgerrit | Jeremy Stanley proposed openstack-infra/project-config master: Revert "Temporarily disable ovh-bhs1 in nodepool" https://review.openstack.org/621251 | 17:39 |
openstackgerrit | Merged openstack-infra/system-config master: Bump amount of mod_wsgi processes for static vhosts to 16 https://review.openstack.org/616297 | 17:42 |
*** eernst has quit IRC | 17:42 | |
*** eernst has joined #openstack-infra | 17:43 | |
*** eernst has quit IRC | 17:44 | |
*** eernst has joined #openstack-infra | 17:44 | |
*** e0ne has quit IRC | 17:45 | |
clarkb | corvus: fungi opendev cert and key and friends are all in the usual location on bridge. Do we have hiera/ansible var keys set up for that yet? | 17:47 |
corvus | clarkb: expected key names are here: https://review.openstack.org/620979 | 17:48 |
clarkb | thanks | 17:48 |
fungi | aha, i thought i'd already reviewed that but looks like i have not | 17:48 |
pabelanger | it is happening | 17:48 |
fungi | opendev happens | 17:49 |
openstackgerrit | Merged openstack-infra/zone-opendev.org master: Revert "Add SSL Cert verification record" https://review.openstack.org/620986 | 17:51 |
*** openstackgerrit has quit IRC | 17:51 | |
clarkb | corvus: ok should be in place under those keys in hiera | 17:54 |
corvus | ok approved | 17:55 |
corvus | clarkb: thanks | 17:55 |
*** hamerins has joined #openstack-infra | 17:55 | |
*** openstackgerrit has joined #openstack-infra | 17:56 | |
openstackgerrit | Merged openstack-infra/system-config master: Switch to a static inventory https://review.openstack.org/621031 | 17:56 |
corvus | second change is about 8 minutes out | 17:56 |
clarkb | should we disable the cron on bridge so it doesn't run without second change in ~3 minutes? | 17:57 |
clarkb | (I'm worried our group info won't be correct which could break firewall rules) | 17:57 |
corvus | clarkb: yes | 17:57 |
clarkb | corvus: are you doing that or should I? | 17:57 |
corvus | clarkb: you | 17:58 |
clarkb | #*/15 * * * * flock -n /var/run/ansible/run_all.lock bash /opt/system-config/run_all.sh -c >> /var/log/ansible/run_all_cron.log 2>&1 is in the crontab now | 17:59 |
clarkb | (I also commented out the cloud launcher script as I'm not sure if that one will be happy either) | 17:59 |
*** derekh has quit IRC | 18:00 | |
corvus | clarkb, pabelanger: i'm going to afk for 30m | 18:00 |
clarkb | hrm it seems like ansible was running before though? | 18:00 |
clarkb | pabelanger: did you downgrade sdk? | 18:01 |
clarkb | I didn't expect ansible to be working atall, but if someone downgraded the openstacksdk package that might explain it | 18:01 |
corvus | clarkb: i believe pabelanger said he did that | 18:02 |
clarkb | ok | 18:02 |
corvus | 17:22 < pabelanger> corvus: clarkb: downgrading to openstacksdk 0.0.19 seems to fix it | 18:02 |
clarkb | thanks | 18:02 |
*** eernst has quit IRC | 18:02 | |
*** munimeha1 has quit IRC | 18:02 | |
* mordred poking to see if I can figure out what the actual issue is | 18:03 | |
*** gfidente has quit IRC | 18:04 | |
*** Swami has joined #openstack-infra | 18:09 | |
Shrews | mordred: i've been poking as well but not having much luck locally so far | 18:09 |
mordred | Shrews: I've got a script in /root/mttest.py on bridge.openstack.org that exhibits the issue - it can be run with /root/mtvenv/bin/python mttest.py if you wanna look at it | 18:10 |
mordred | something is unhappy with rackspace | 18:11 |
openstackgerrit | Merged openstack-infra/system-config master: Disable openstack inventory plugin https://review.openstack.org/621247 | 18:11 |
clarkb | mordred: ^ can you double check that does what we want before we reenable the ansible cron on bridge? | 18:12 |
clarkb | mordred: rax had their old compute api in teh catalog for a long time is sdk trying to make sense of it? | 18:14 |
mordred | maybe? | 18:14 |
*** jamesmcarthur has quit IRC | 18:15 | |
pabelanger | clarkb: corvus: yes, downgrade was the fix | 18:15 |
mordred | oh. my. dear. god | 18:16 |
*** ykarel|away has quit IRC | 18:17 | |
*** wolverineav has joined #openstack-infra | 18:17 | |
* fungi can't wait to hear this one | 18:20 | |
fungi | such suspense | 18:20 |
mordred | well - discovery is incorrectly finding the project id as the version when doing discovery | 18:20 |
mordred | so is then not matching verison 610275 against v2 | 18:21 |
mordred | ultimately it's because rackspace blocks access to the discovery document so we have to fallback to parsing the URL | 18:21 |
mordred | which SHOULD have a provision for stripping the trailing project id from the endpoint: https://dfw.servers.api.rackspacecloud.com/v2/610275 | 18:21 |
mordred | but for some reason is not doing that | 18:21 |
amorin | clarkb: that's weird (the bhs1 io issue) | 18:22 |
amorin | I am currently at home and cant test that right now | 18:23 |
amorin | can it wait monday? | 18:23 |
clarkb | amorin: yes I think you should enjoy your weekend | 18:23 |
clarkb | and thank you for checking in! | 18:24 |
amorin | :p | 18:24 |
amorin | that's still weird, we are supposed to have same hardware config with SSD disk | 18:24 |
Shrews | mordred: my that's fun | 18:24 |
amorin | but anyway, I am writing that up in my checklist for monday morning | 18:24 |
clarkb | amorin: thank you! | 18:24 |
fungi | amorin: awesome. i was hesitant to reach out to you until monday anyway since i know it's late already there | 18:25 |
fungi | we think this has been going on at least a week, possibly several, could even have started before the summit | 18:26 |
fungi | so a couple more days aren't going to hurt | 18:26 |
*** diablo_rojo has joined #openstack-infra | 18:29 | |
*** kjackal has quit IRC | 18:30 | |
*** jamesmcarthur has joined #openstack-infra | 18:31 | |
pabelanger | clarkb: corvus: Shrews: okay, seems brige.o.o got out a pulse and updated nodepool-launcher. Will hold off until restarting until everybody is back | 18:32 |
clarkb | pabelanger: we may also want to prioritize the various inflight tasks. There is nodepool restarts, zuul restart, static inventory switch (and maybe others) | 18:33 |
pabelanger | clarkb: agree, holding off for now | 18:34 |
corvus | clarkb, pabelanger: here | 18:36 |
pabelanger | also here | 18:36 |
clarkb | me three | 18:37 |
corvus | clarkb, pabelanger: so next we should manually update system-config, then uncomment the cron and watch it run? | 18:37 |
clarkb | corvus: ya I think that is a sane next step. Do we want mordred to review the changes to the ansible config too? (since mordred probably groks yamlgroup and friends best?) | 18:37 |
pabelanger | wfm | 18:37 |
mordred | Shrews: ok. I have a 'fix' | 18:38 |
mordred | Shrews: but I'm not 100% sure of the 'right' way to do it | 18:38 |
corvus | mordred: can you review https://review.openstack.org/621247 or should we just muddle along? | 18:38 |
openstackgerrit | Merged openstack-infra/system-config master: Serve opendev.org website from files.o.o https://review.openstack.org/620979 | 18:38 |
mordred | corvus: yes - that looks good | 18:39 |
corvus | mordred: thx | 18:39 |
clarkb | corvus: pabelanger also maybe after updating system-config but before uncommenting cron we run the command to list group membership (I forget what it is) and do a kick.sh? | 18:39 |
corvus | clarkb, pabelanger: system-config is updated. | 18:39 |
pabelanger | +1 for kick.sh | 18:40 |
clarkb | I'm looking up the group listing command now | 18:40 |
corvus | google says: ansible localhost -m debug -a 'var=groups' | 18:40 |
mordred | Shrews: remote: https://review.openstack.org/621257 WIP Fix version discovery for rackspace public cloud is the hacky version | 18:40 |
Shrews | mordred: looking | 18:40 |
pabelanger | do we need to delete the inventory cache, or was removing yamlgroup enough? | 18:40 |
corvus | the output of that lgtm | 18:40 |
clarkb | corvus: ansible-playbook --list-hosts zookeeper | 18:40 |
mordred | Shrews: tl;dr - rackspace project ids are integers, so they parse as version numbers | 18:40 |
openstackgerrit | Jeremy Stanley proposed openstack-infra/system-config master: Retire the interop-wg mailing list https://review.openstack.org/619056 | 18:40 |
openstackgerrit | Jeremy Stanley proposed openstack-infra/system-config master: Shut down openstack general, dev, ops and sigs mls https://review.openstack.org/621258 | 18:40 |
Shrews | mordred: vomit | 18:40 |
*** bhavikdbavishi has quit IRC | 18:41 | |
mordred | so when discovery doesn't work there and we fall back to parsing the version from the url, we do so by poping url segments to see if something is a version | 18:41 |
mordred | and we match the project id | 18:41 |
clarkb | hrm that comamnd isn't quite what I expected | 18:41 |
*** ralonsoh has quit IRC | 18:41 | |
mordred | Shrews: I *think* what we actually want to do is pass the project_id along into that function (we do a similar thing in other places) so we can do a test of "does this url segment match the current project id" | 18:41 |
mordred | but that'll take a few seconds to sort out | 18:42 |
corvus | clarkb: i don't know the answer to your question about caching | 18:42 |
corvus | pabelanger: er i mean yours | 18:42 |
Shrews | mordred: if we have that info, yes, i agree | 18:42 |
clarkb | ok I want ansible --lists-hosts $group not ansible-playbook | 18:42 |
clarkb | checking a few groups they look correct to me | 18:43 |
pabelanger | corvus: /var/cache/ansible/inventory should we also delete that now, I wasn't sure if removing yamlgroup was enough | 18:43 |
pabelanger | or maybe that is a mordred question | 18:43 |
corvus | i'll delete it | 18:43 |
mordred | we can delete /var/cache/ansible/inventory - but it shouldn't be being touched by anything | 18:43 |
mordred | so it should be a no-op | 18:43 |
pabelanger | ack | 18:44 |
clarkb | so ya things look good to me. I think next step is a kick.sh then turn cron back on | 18:44 |
corvus | deleted. after doing that ansible localhost -m debug -a 'var=groups' still looks good | 18:44 |
pabelanger | ++ | 18:44 |
clarkb | ansible --lists-hosts $group still looking good too | 18:44 |
corvus | i'll let someone else kick | 18:45 |
clarkb | I'll kick.sh the logstash server since that has firewall rules that matter but if we break them impact is low | 18:45 |
clarkb | that is running now (though I didn't start screan) | 18:46 |
pabelanger | tail -f /var/log/ansible/ansible.log is what I am watching | 18:47 |
clarkb | TASK [base-server : Set ssh key for managment] and TASK [puppet-install : Remove server] changed in the base server playbook | 18:47 |
clarkb | the second I'm not worried about. The first oen is a little odd | 18:48 |
clarkb | and then puppet run says changed in the puppet run. But overall this looks sane | 18:48 |
clarkb | ya the puppet changes are fine | 18:48 |
pabelanger | agree, ansible looks to have run properly | 18:49 |
clarkb | and authorized keys don't look wrong | 18:49 |
clarkb | unless there is another server or group people want to run against first I think we can reenable the cron | 18:49 |
corvus | the authorized_keys file contents are the same as on another host | 18:49 |
Shrews | mordred: hrm, we could determine it by place in the url (when split out by '/') too | 18:51 |
mordred | we could - but most of the time there is no project_id there | 18:51 |
Shrews | i think? not sure how standard that urls | 18:51 |
clarkb | pabelanger: corvus: should I reenable the two cron jobs now? I'll wait until I get other confirmation | 18:51 |
mordred | nova made it optional a few years ago | 18:51 |
mordred | and it's now pretty much never there | 18:51 |
corvus | clarkb: ++ | 18:51 |
pabelanger | clarkb: ++ | 18:52 |
clarkb | done | 18:52 |
clarkb | next run for both is top of the hour | 18:52 |
pabelanger | k, going to fresh coffee | 18:52 |
* Shrews afk for 15m | 18:57 | |
pabelanger | back | 18:58 |
*** boden has joined #openstack-infra | 18:59 | |
*** electrofelix has quit IRC | 18:59 | |
clarkb | I'm watching tail of the run all log file | 18:59 |
clarkb | running now | 19:00 |
clarkb | base server is running now | 19:01 |
clarkb | the thing to watch for here is iptables imo | 19:01 |
clarkb | and if that is happy I think we are good | 19:02 |
*** wolverineav has quit IRC | 19:02 | |
openstackgerrit | Jeremy Stanley proposed openstack-dev/cookiecutter master: Update contact address to openstack-discuss ML https://review.openstack.org/621266 | 19:04 |
openstackgerrit | Chris Dent proposed openstack-infra/project-config master: Set placement's gate queue to integrated https://review.openstack.org/621267 | 19:05 |
*** olivierbourdon38 has quit IRC | 19:05 | |
*** wolverineav has joined #openstack-infra | 19:06 | |
openstackgerrit | Jeremy Stanley proposed openstack-dev/specs-cookiecutter master: Update contact address to openstack-discuss ML https://review.openstack.org/621269 | 19:09 |
mnaser | fungi: glad my random digging into a timeout has resulted in finding a timeout root cause D: | 19:10 |
mordred | infra-root: https://review.openstack.org/621257 is a keystoneauth patch to fix the discovery issue. the openstacksdk update exposed the issue, which has been lurking there all along | 19:10 |
fungi | mnaser: well, we don't know the *root* cause yet, but yes your report of slow disk writes in ovh was a huuuuge help in turning the corner on that ongoing investigation. thanks!!! | 19:10 |
mnaser | everyone wins D: | 19:11 |
clarkb | ansible still clean from where I'm sitting | 19:11 |
mordred | if we want, we can protect ourselves from this issue between now and a ksa release by putting in compute_endpoint_override settings in our clouds.yaml - or we can ping openstacksdk on the nodepool launchers | 19:11 |
fungi | mnaser: "my jobs run slow and don't complete within the timeout" was a lot harder to track down | 19:11 |
mordred | s/ping/pin/ | 19:11 |
corvus | mordred: oh, so we had best not restart the launchers without doing one of those, eh? | 19:12 |
mordred | corvus: yes. it would be bad - we'd lose all rackspace quota | 19:12 |
corvus | mordred: maybe we should put the pin in nodepool? | 19:13 |
tobiash | ++ | 19:13 |
*** mriedem_lunch is now known as mriedem | 19:13 | |
corvus | mordred: maybe you can propose that change since you know what numbers to type? | 19:13 |
mordred | corvus: yup. on it | 19:14 |
clarkb | iptable files being installed now. Lots of Oks and no changed so far | 19:15 |
clarkb | so that looks good | 19:15 |
kmalloc | clarkb, fungi: confirming. gerrit does openID not OIDC (as far as I can tell) | 19:16 |
fungi | kmalloc: i think it's just openid still, yes | 19:17 |
kmalloc | bahg | 19:17 |
kmalloc | ok no problem | 19:17 |
clarkb | iptables has completed and I see nothign amiss there. I think this is working as expected | 19:17 |
fungi | kmalloc: on a related note, have you seen lemonldap-ng? | 19:17 |
kmalloc | working to ensure i have a working example SP for the ipsilon thing | 19:17 |
kmalloc | fungi: no, looking now | 19:17 |
fungi | https://lemonldap-ng.org/ | 19:17 |
clarkb | it will also do oath? whatever google does | 19:17 |
corvus | kmalloc: but related is https://github.com/davido/gerrit-oauth-provider | 19:17 |
kmalloc | oh neat | 19:18 |
fungi | seems to have come out of a project to create a full sso suite for the french government | 19:18 |
kmalloc | corvus: yeah i see | 19:18 |
kmalloc | corvus: i saw that* | 19:18 |
kmalloc | fungi: that is pretty darn cool | 19:18 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool master: Block 0.19.0 of openstacksdk https://review.openstack.org/621272 | 19:18 |
fungi | kmalloc: it just got packaged in debian this week, which is how it came to my attention | 19:18 |
fungi | and it came up in discussions about replacement sso for debian.org | 19:18 |
* kmalloc nods | 19:19 | |
Shrews | mordred: did that show up in 0.19 or 0.20? | 19:20 |
clarkb | now on to git and gerrit servers. I'm going to stop staring at it like a hawk now that base is done and seems to be happy. Also git and gerrit seems to actually just be git and gerrit as expected | 19:20 |
mordred | Shrews: gah. am I stupid? | 19:20 |
* Shrews refrains | 19:20 | |
*** jamesmcarthur has quit IRC | 19:20 | |
clarkb | pabelanger: ^ see also here :P | 19:21 |
kmalloc | fungi: I'll poke at that thing as well. It is super interesting. | 19:21 |
*** jamesmcarthur has joined #openstack-infra | 19:21 | |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool master: Block 0.20.0 of openstacksdk https://review.openstack.org/621272 | 19:21 |
mordred | Shrews: thanks | 19:21 |
fungi | kmalloc: yeah, i haven't delved deeply yet, but it looked like it could be relevant to our situation | 19:21 |
mriedem | clarkb: fwiw i'm seeing e-r commenting on failed changes again | 19:24 |
mriedem | did you tickle it into submission? | 19:24 |
clarkb | mriedem: fungi restarted it to reset the backlog timeout queue | 19:24 |
mriedem | ah | 19:24 |
clarkb | mriedem: I also pushed a change to try and address it directly in the code by tieing the timeout to the event timestamp and not current time | 19:25 |
fungi | also +2'd clarkb's proposed solution | 19:25 |
fungi | but it could stand some more thorough review | 19:25 |
mriedem | i saw but haven't reviewed that yet, | 19:25 |
mriedem | will tab queue it | 19:25 |
clarkb | so once we get behind we won't wait for the whole timeout we'll just check once if the files are there, if yes yay if not move on | 19:25 |
fungi | i'm not as familiar with that codebase as my +2 permissions might imply | 19:25 |
mriedem | maybe al calavicci will let mtreinish come back to help look for a sec too | 19:26 |
*** jamesmcarthur has quit IRC | 19:28 | |
*** jamesmcarthur has joined #openstack-infra | 19:28 | |
corvus | al calavicci has an impressively detailed wikipedia bio | 19:29 |
*** jamesmcarthur_ has joined #openstack-infra | 19:29 | |
clarkb | corvus: pabelanger I've not seen any unexpected behavior from inventory change. I think we can continue with nodepool/zuul work as soon as the sdk thing is handled | 19:32 |
pabelanger | agree | 19:32 |
fungi | some of stockwell's best work, though to me he'll always be wilbur whatley from the dunwich horror | 19:33 |
*** jamesmcarthur has quit IRC | 19:33 | |
corvus | handled means that nodepool change lands and we verify that we've downgraded sdk | 19:33 |
corvus | calavicci and cavil ar strikingly similar names | 19:33 |
openstackgerrit | David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support https://review.openstack.org/529376 | 19:34 |
openstackgerrit | Matt Riedemann proposed openstack-infra/elastic-recheck master: Better event checking timeouts https://review.openstack.org/621038 | 19:34 |
openstackgerrit | David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support for RHEL/CentOS https://review.openstack.org/528739 | 19:35 |
mtreinish | mriedem: link? | 19:36 |
clarkb | mtreinish: https://review.openstack.org/#/c/621038/ | 19:37 |
*** wolverineav has quit IRC | 19:37 | |
*** wolverineav has joined #openstack-infra | 19:41 | |
*** wolverineav has quit IRC | 19:42 | |
*** wolverineav has joined #openstack-infra | 19:42 | |
*** wolverineav has quit IRC | 19:42 | |
*** wolverineav has joined #openstack-infra | 19:43 | |
*** hamerins has quit IRC | 19:43 | |
*** wolverineav has quit IRC | 19:43 | |
*** wolverineav has joined #openstack-infra | 19:43 | |
*** wolverineav has quit IRC | 19:44 | |
clarkb | https://www.opendev.org/ and https://opendev.org/ have working ssl now. Just need to add content there | 19:44 |
*** hamerins has joined #openstack-infra | 19:44 | |
clarkb | corvus: ^ fyi and thanks! | 19:45 |
pabelanger | awesome! | 19:45 |
fungi | content schmontent | 19:45 |
*** wolverineav has joined #openstack-infra | 19:45 | |
openstackgerrit | David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support https://review.openstack.org/529376 | 19:49 |
mtreinish | mriedem, clarkb: +2 | 19:50 |
mordred | clarkb, fungi, corvus: if you have a sec, mind popping some +A on https://review.openstack.org/#/c/621225/ and https://review.openstack.org/#/c/621226 ? | 19:51 |
openstackgerrit | David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support https://review.openstack.org/529376 | 19:52 |
*** wolverineav has quit IRC | 19:52 | |
clarkb | mordred: I'm kind of surprised such a thing doesn't exist yet | 19:52 |
*** wolverineav has joined #openstack-infra | 19:53 | |
mordred | right? | 19:53 |
mordred | from what I can tell, most people focus on using one or the other in their code and then using an exporter or translation service | 19:53 |
mordred | clarkb: initial code sketch is here: https://review.openstack.org/620990 | 19:53 |
corvus | they're vastly different in terms of how you model data; so you need a real abstraction layer if you're going to try to do both | 19:54 |
mordred | yah | 19:54 |
clarkb | mordred: ya in that space and tracing too it seems everyone uses one thing and then has a ton of very specific toling built around that | 19:54 |
mordred | yup | 19:54 |
mordred | which is great if you're writing a service to run in one and only one place | 19:55 |
corvus | (and even so, the abstraction layer mordred proposes largely helps you think about both at the same time. you still can't just forget about one or the other) | 19:55 |
corvus | at least, that's how it registers in my brain | 19:55 |
mordred | corvus: yah, that's the idea | 19:55 |
mordred | "this is a metric I want to collect. this is how it goes to prometheus. this is how it goes to statsd" | 19:56 |
* mordred waves hands | 19:56 | |
Shrews | mordred: if you add a new thing, do you have to change the project name? :) | 19:56 |
clarkb | promstatcchi | 19:56 |
mordred | hah | 19:57 |
openstackgerrit | David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support for RHEL/CentOS https://review.openstack.org/528739 | 19:58 |
openstackgerrit | David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support https://review.openstack.org/529376 | 19:59 |
openstackgerrit | David Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support for RHEL/CentOS https://review.openstack.org/528739 | 19:59 |
pabelanger | clarkb: corvus: mordred: given the issues around bridge.o.o today, and tailing ansible.log where did the topic of running ara web on bridge end up? Do people see that actually live on bridge.o.o or some other server hosting the web for that? I know we talked about putting that into into trove also. | 19:59 |
*** wolverineav has quit IRC | 19:59 | |
dmsimard | pabelanger: last I know we were waiting because bridge.o.o needed to be rebuilt on a larger instance or something along those lines | 20:00 |
clarkb | I think we were going to run it with mostly the plan as is, then if we ever rebuild the bridge we can move it all internal | 20:00 |
clarkb | fungi and ianw were far more on top of that than me though iirc | 20:00 |
pabelanger | okay, yah. I might be able to find a little time to help with that. I was struggling an little today to look at ansible logs on bridge | 20:01 |
fungi | the idea was we could run it on trove temporarily but when we get around to enlarging bridge.o.o we should make sure it's big enough to host its own database for that instead | 20:01 |
pabelanger | assuming we did an audit of ARA for the public | 20:01 |
fungi | s/enlarging/rebuilding/ | 20:01 |
dmsimard | just 2 cores and 2gb of ram on bridge.o.o indeed | 20:02 |
dmsimard | is there anything stopping us from resizing it to a larger flavor ? | 20:02 |
corvus | it's not actually large enough to do what it's already doing | 20:03 |
fungi | dmsimard: mainly the lack of a "resize" option in that cloud provider is what's stopping us | 20:03 |
*** rh-jelabarre has quit IRC | 20:03 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Report tenant and project specific resource usage stats https://review.openstack.org/616306 | 20:03 |
fungi | dmsimard: we need to rebuild it on a larger flavor instead, and i gather it might not be 100% under configuration management yet? | 20:03 |
pabelanger | is that because we thing ARA web will need more resources? | 20:04 |
fungi | or maybe it is by now but nobody's tried to build a new one yet outside integration tests | 20:04 |
clarkb | fungi: ya the base server -> running ansible isn't managed yet? mordred is that the case? | 20:04 |
dmsimard | pabelanger: it's not well sized to begin with | 20:04 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Report tenant and project specific resource usage stats https://review.openstack.org/616306 | 20:04 |
corvus | clarkb: well, the tests certainly suggest it's mostly there :) | 20:04 |
pabelanger | I'm unsure what issues we are hitting today | 20:05 |
*** rh-jelabarre has joined #openstack-infra | 20:05 | |
corvus | bootstrapping it may still be a bit tricky. we also need to copy all the non-managed stuff over (passwords, secrets, etc) | 20:05 |
clarkb | pabelanger: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=65004&rra_id=all I think biggest concern is memory | 20:05 |
clarkb | pabelanger: there are spikes that make it appear we should have more breathing room. Particularly to run a mysql | 20:06 |
*** eharney has joined #openstack-infra | 20:07 | |
pabelanger | ack | 20:07 |
openstackgerrit | Merged openstack-infra/nodepool master: Block 0.20.0 of openstacksdk https://review.openstack.org/621272 | 20:08 |
corvus | i'm going to grab lunch now; feel free to restart launchers after ^ is applied. | 20:09 |
clarkb | I should get lunch too | 20:09 |
pabelanger | clarkb: corvus: okay, I'll start with nl01 first | 20:09 |
pabelanger | and confirm working | 20:09 |
pabelanger | fungi: clarkb: have you see the zoom flaw making rounds, I think some of the openstack foundation staff using it for meetings: https://threatpost.com/critical-zoom-flaw-lets-hackers-hijack-conference-meetings/139489/ | 20:11 |
*** jistr has quit IRC | 20:11 | |
clarkb | I hadn't but thats neat | 20:11 |
clarkb | mrhillsman: ^ uses a bunch of zoom too | 20:11 |
pabelanger | a public exploit in the wild also | 20:11 |
fungi | nifty | 20:11 |
fungi | i've only ever dialled into it from a telephone line, fwiw | 20:12 |
pabelanger | same, figured I'd shared in case you wanted to pass along to presenters | 20:12 |
clarkb | pabelanger: I've done so, thanks! | 20:12 |
fungi | yep, thanks! | 20:12 |
*** jistr has joined #openstack-infra | 20:13 | |
*** jamesmcarthur_ has quit IRC | 20:14 | |
*** jamesmcarthur has joined #openstack-infra | 20:15 | |
pabelanger | clarkb: corvus: 621272 has landed on nl01, and confirmed openstacksdk 0.19.0 in installed | 20:18 |
pabelanger | going to proceed with restart in a moment | 20:18 |
mordred | woot! | 20:19 |
*** xek has joined #openstack-infra | 20:19 | |
mordred | clarkb: yeah - what corvus said earlier - bridge itself is *mostly* managed, but the secrets and passwords file is manual | 20:20 |
*** jamesmcarthur has quit IRC | 20:20 | |
mordred | so the bootstrap is ... fun | 20:20 |
pabelanger | restarted | 20:20 |
pabelanger | but there looks to be an issue | 20:20 |
pabelanger | kubernetes.config.config_exception.ConfigException: Invalid kube-config file. Expected object with name default in kube-config/contexts list | 20:20 |
pabelanger | guess we landed a nodepool change? | 20:20 |
tobiash | pabelanger: either that or you landed a kube config | 20:21 |
mordred | pabelanger: yah - we have a test kubernetes in vexxhost and a kube config for it | 20:21 |
openstackgerrit | Merged openstack-infra/project-config master: Create promstat project https://review.openstack.org/621225 | 20:21 |
openstackgerrit | Merged openstack-infra/project-config master: Add promstat project to Zuul https://review.openstack.org/621226 | 20:21 |
kmalloc | fungi: so looking at lemonldap-ng. it's cool. it is also very very very very perl-ism focused. | 20:21 |
kmalloc | fungi: we might be able to use it for our case over ipsilon | 20:21 |
fungi | yeah, i saw it was perlish | 20:22 |
kmalloc | fungi: it's URL-regex and blob-o-json configured | 20:22 |
fungi | ahh | 20:22 |
kmalloc | so it's per URL, it's a little weird. | 20:22 |
mordred | pabelanger: https://review.openstack.org/#/c/620756/ didn't land yet | 20:22 |
Ng | really? another -ng project? :( | 20:22 |
kmalloc | but i can see it's very feature rich | 20:22 |
kmalloc | Ng: yep. | 20:22 |
* mordred waves to the Ng | 20:22 | |
Ng | dangit | 20:22 |
fungi | Ng: i thought you wrote all of those | 20:22 |
Ng | hey mordred | 20:22 |
pabelanger | mordred: yah, I think this is something with default config, I don't see anything in nodepool.yaml | 20:22 |
mordred | pabelanger: https://review.openstack.org/#/c/620755/ added the kube config | 20:22 |
kmalloc | that said, we can totally do something more usable / api driven | 20:23 |
Ng | fdegir: haha | 20:23 |
kmalloc | long term | 20:23 |
pabelanger | oh | 20:23 |
kmalloc | i'll consider ipsilon vs lemonldap for the transition stuff | 20:23 |
fungi | kmalloc: yep, as a canned thing i thought it might at least be an interesting alternative which is actually actively maintained still | 20:23 |
kmalloc | since we have minimal things that need to be protected. | 20:23 |
kmalloc | exactly | 20:23 |
pabelanger | mordred: http://paste.openstack.org/show/736524/ is traceback | 20:23 |
kmalloc | though the concept / mission of ipsilon maps more directly to what we do (even unmaintained) | 20:24 |
Shrews | oh weird. i just liked one of Ng's twitter posts today. crazy coincidence | 20:24 |
pabelanger | mordred: I am going to try and remove the file manually for now to see if nodpeool-launcher starts | 20:24 |
*** eernst has joined #openstack-infra | 20:24 | |
Ng | Shrews: oh yeah, I saw that. I think I'd forgotten that IFTTT was posting those for me ;) | 20:24 |
pabelanger | mordred: okay, renaming it to config.bak causes nodepool-launcher to start properly | 20:25 |
Shrews | pabelanger: i see an issue | 20:28 |
Shrews | pabelanger: related to tobiash's nodecachelistener | 20:28 |
*** eernst has quit IRC | 20:29 | |
pabelanger | Shrews: yes, was just about to say it doesn't look happy | 20:29 |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Remove updating stats debug log https://review.openstack.org/621283 | 20:29 |
Shrews | http://paste.openstack.org/show/736525/ | 20:29 |
Shrews | tobiash: ^^ | 20:29 |
pabelanger | yah, we haven't launched a new node yet since restarting | 20:30 |
pabelanger | I believe we should roll back to the previous version that was running | 20:30 |
*** eernst has joined #openstack-infra | 20:30 | |
tobiash | Shrews: is that exception recurring? | 20:31 |
pabelanger | okay, we appear to be launching nodes now: http://grafana.openstack.org/dashboard/db/nodepool-rackspace | 20:32 |
mordred | corvus: ping just in case you didn't see the k8s config related traceback | 20:32 |
*** wolverineav has joined #openstack-infra | 20:33 | |
*** slaweq has quit IRC | 20:34 | |
pabelanger | 2018-11-30 20:33:05,859 INFO nodepool.driver.NodeRequestHandler[nl01-13954-PoolWorker.rax-ord-main]: Not enough quota remaining to satisfy request 200-0000583773 | 20:34 |
pabelanger | I am unsure if that is correct or not, looking at grafana, I do see space to launch more nodes | 20:34 |
Shrews | tobiash: i'm not seeing more atm | 20:34 |
*** eernst has quit IRC | 20:35 | |
pabelanger | so far, we haven't launched a new node in rax-ord / rax-iad since the restart, but rax-dfw is now bringing nodes online | 20:36 |
tobiash | Shrews: maybe we should catch exceptions there and log a warning/error together with the event data and path | 20:36 |
tobiash | Shrews: I think this might be an event we actually don't want to process | 20:37 |
*** eernst has joined #openstack-infra | 20:37 | |
tobiash | pabelanger: do you have mode contect around the quota log? | 20:37 |
tobiash | s/mode/more | 20:38 |
pabelanger | tobiash: not yet, still looking into logs | 20:38 |
Shrews | tobiash: you need the numbers? | 20:39 |
pabelanger | tobiash: for exmaple: http://paste.openstack.org/show/736526/ | 20:39 |
openstackgerrit | Merged openstack-infra/project-config master: Temporarily disable ovh-bhs1 in nodepool https://review.openstack.org/621250 | 20:40 |
*** eernst has quit IRC | 20:41 | |
tobiash | pabelanger, Shrews: there seems to be something off in the quota calculation | 20:41 |
tobiash | pabelanger, Shrews: maybe one of the patches that landed today | 20:41 |
clarkb | one of corvus' changes modified how we account for nodes outside of nodepool | 20:42 |
clarkb | (to handle leaks better) | 20:42 |
mrhillsman | thx clarkb | 20:42 |
clarkb | (I'm not really here yet still finishing up lunch) | 20:42 |
tobiash | Shrews, pabelanger: probably this one: https://review.openstack.org/621040 | 20:42 |
tobiash | that would explain that we get predicted quota of -50 when launching one node | 20:42 |
*** eernst has joined #openstack-infra | 20:43 | |
*** hwoarang has quit IRC | 20:44 | |
pabelanger | hmm, let me check if we have leaked nodes | 20:44 |
Shrews | tobiash: pabelanger: corvus: hrm, we should be using zk.getNodes() instead of the nodeIterator in that change | 20:45 |
openstackgerrit | Matt Riedemann proposed openstack-infra/elastic-recheck master: Add query for nova i/o semaphore test race bug 1806123 https://review.openstack.org/621285 | 20:45 |
openstack | bug 1806123 in OpenStack Compute (nova) "i/o concurrency semaphore test changes are racy" [High,Confirmed] https://launchpad.net/bugs/1806123 | 20:45 |
tobiash | Shrews: I think the negation here is wrong: https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/provider.py#n197 | 20:45 |
Shrews | but that's unrelated to the quota thing | 20:46 |
Shrews | (my suggestion, that is) | 20:46 |
*** yamamoto has joined #openstack-infra | 20:46 | |
tobiash | Shrews: no, if that's wrong, it increases the unmanages usage | 20:46 |
tobiash | Shrews: so nodepool thinks it has -50 instances left and the rest is blocked with instances it doesn't manage | 20:47 |
pabelanger | well, rax-ord does have a few leaked nodes | 20:47 |
pabelanger | but, unsure why nodepool isn't deleting them | 20:47 |
*** eernst has quit IRC | 20:48 | |
pabelanger | for example | 20:48 |
pabelanger | http://paste.openstack.org/show/736527/ | 20:48 |
*** eernst has joined #openstack-infra | 20:49 | |
pabelanger | that one is missing metadata | 20:49 |
*** openstackgerrit has quit IRC | 20:50 | |
*** yamamoto has quit IRC | 20:50 | |
*** openstackgerrit has joined #openstack-infra | 20:50 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Fix leak detection in unmanaged quota calculation https://review.openstack.org/621286 | 20:50 |
*** hwoarang has joined #openstack-infra | 20:51 | |
*** fried_rolls is now known as fried_rice | 20:51 | |
fungi | okay, i'm taking a break from mailing list combining work to go obtain sustenance. will return asap | 20:51 |
tobiash | Shrews, pabelanger: I think this should fix it ^ | 20:51 |
tobiash | corvus: ^ | 20:53 |
pabelanger | oh | 20:53 |
*** takamatsu has joined #openstack-infra | 20:53 | |
*** eernst has quit IRC | 20:53 | |
tobiash | we also could revert 621040, fix that and re-propose it with a test case | 20:54 |
Shrews | tobiash: corvus: pabelanger: if the instance isn't getting the meta properties set correctly, nodepool will never delete it. | 20:55 |
pabelanger | so we have 2 issues right now with nl01, quota calculation if off. And the kube config is broken | 20:55 |
Shrews | https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/provider.py#n474 | 20:55 |
pabelanger | Shrews: yah, I don't know why that is right now. For now, I'm going to manually clean them up after current issue is addressed | 20:55 |
pabelanger | the missing metadata I mean | 20:55 |
Shrews | pabelanger: you'll probably want to back off the upgrade for now | 20:57 |
mordred | pabelanger: "Expected object with name default in kube-config/contexts list" seems to indicate something is unhappy with the "current-context: default" line | 20:57 |
pabelanger | Shrews: yah, I can stop and roll back | 20:57 |
pabelanger | let me check which verision the others are running | 20:57 |
corvus | back/catching up | 20:59 |
clarkb | I'm back now too | 21:00 |
openstackgerrit | Monty Taylor proposed openstack-infra/system-config master: Update the current-context to valid context https://review.openstack.org/621287 | 21:01 |
mordred | corvus, pabelanger: ^^ I **think** that should fix the kube config error | 21:01 |
mordred | but I'm also sort of just shooting at pickles in a barrel | 21:01 |
mordred | that was based on reading https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/ fwiw | 21:01 |
pabelanger | Shrews: corvus: clarkb: I think we are safe to revert back to 3.3.1, but waiting for somebody to confirm | 21:02 |
pabelanger | I also have to duck out for 10mins to wait for school bus, but will be back online to assist | 21:02 |
clarkb | pabelanger: ya I don't think any of the changes required the updates to the zk schema | 21:02 |
clarkb | so we should be safe to go back to older version | 21:02 |
clarkb | pabelanger: maybe just do that manually by hand on nl01 then we get tobiash's fix in and try again later today or monday | 21:03 |
pabelanger | clarkb: okay, can I pass the revert step to you now? | 21:03 |
clarkb | pabelanger: on the server or getting the code changes in gerrit side? I can do either but want to make sure I know what you are doing too :) | 21:03 |
clarkb | *either or both | 21:04 |
pabelanger | clarkb: yes, the revert of nl01.o.o to 3.3.1. | 21:04 |
pabelanger | I can do it, but will need about 10mins | 21:04 |
corvus | regarding the current-context -- that's really disappointing that the kubernetes lib requires that even though we don't use it. maybe we should default a bogus default context -- because default-context makes no sense in our multi-cloud case. | 21:04 |
pabelanger | on server, I should say | 21:04 |
tobiash | corvus: sounds valid | 21:05 |
clarkb | pabelanger: I'm ssh'ing in now | 21:05 |
pabelanger | okay, afk for 10mins... | 21:05 |
mordred | corvus: I think we can set it to '' | 21:05 |
*** kjackal has joined #openstack-infra | 21:06 | |
clarkb | puppet just ran so I have ~30 minutes to get this done. | 21:06 |
clarkb | I am pip installing 3.3.1 now | 21:06 |
mordred | corvus: which seems to be a valid option based on that document | 21:06 |
clarkb | that sound right to everyone? | 21:06 |
clarkb | based on nl02 that looks right to me | 21:07 |
clarkb | hahahaha ok | 21:07 |
clarkb | we don't have the openstacksdk exclusion on 3.3.1 | 21:08 |
corvus | i will direct-enqueue both of those changes | 21:08 |
clarkb | so I've got to manually install sdk afterwards | 21:08 |
clarkb | (just a note for anyone else trying to do similar later) | 21:08 |
*** rlandy has quit IRC | 21:08 | |
*** wolverineav has quit IRC | 21:08 | |
*** hongbin has joined #openstack-infra | 21:08 | |
clarkb | nl01 restarted running nodepool==3.3.1 and openstacksdk==0.19.0 | 21:09 |
corvus | both changes are in gate | 21:10 |
hongbin | hi folks, i have a patch to modify the zuul job: https://review.openstack.org/#/c/619642/ but consistently getting 'RETRY_LIMIT' error for some jobs, want to get helps to resolve it | 21:10 |
clarkb | hongbin: RETRY_LIMIT happens when the job fails in the job pre run stage | 21:10 |
hongbin | clarkb: i see, any idea about how to get the logs to see what is wrong? | 21:11 |
clarkb | hongbin: failures in pre-run are retried up to three times until we report back RETRY_LIMIT. The reason for this is that we expect those things to always pass as they shouldn't test code, just be setup | 21:11 |
tobiash | clarkb: with finger links that fails very early | 21:11 |
clarkb | tobiash: ya interesting | 21:11 |
clarkb | hongbin: in this case I think this means its failing very early so we dont' get logs. One way to debug that is to catch them with the streaming logs from the status page while they happen | 21:12 |
clarkb | another is we can go look in the zuul logs and see what that tells us | 21:12 |
openstackgerrit | Matt Riedemann proposed openstack-infra/elastic-recheck master: Add query for libvirt functional evacuate test bug 1806126 https://review.openstack.org/621288 | 21:12 |
corvus | it can mean that the post playbook fails (possibly because the host is hosed) | 21:12 |
openstack | bug 1806126 in OpenStack Compute (nova) "LibvirtRbdEvacuateTest and LibvirtFlatEvacuateTest tests race fail" [High,Confirmed] https://launchpad.net/bugs/1806126 | 21:12 |
tobiash | clarkb: what we did to improve that is to add a base-logs job as a parent to the job 'base' which only contains the log upload post playbook | 21:12 |
corvus | tobiash: you don't need to use inheritance for that; post-playbooks are separable | 21:13 |
clarkb | corvus: considering that this is making changes to networking in devstack that wouldn't surprise me | 21:13 |
tobiash | corvus: does it now run all post playbooks even if the pre playbook of the same job failed? | 21:13 |
corvus | tobiash: so as long as the last post playbook in base collects the logs, it will run regardless of anything before it | 21:13 |
corvus | tobiash: i think so | 21:14 |
tobiash | ah cool | 21:14 |
hongbin | ok, let me recheck it and try to catch the streaming logs, thanks for the hint | 21:14 |
clarkb | hongbin: it might help us debug, if we understand what you are trying to do with all of those api extensions | 21:15 |
AJaeger_ | fungi, I see you updating openstack-discuss - want to review https://review.openstack.org/619216 to handle infra-manual, please? | 21:15 |
clarkb | hongbin: my best guess at this point is one of those extensions results in either broken network stack of firweall rules that prevent zuul from talking to the test node | 21:15 |
corvus | tobiash: yeah, just double checked. run won't run, but post will. | 21:15 |
tobiash | corvus: cool, thx | 21:15 |
AJaeger_ | hongbin: make a smaller change first - only change YAML with the goal to do exactly the same setup as before for a single job. And then iterate on it | 21:16 |
hongbin | clarkb: i have two list of extensions, and in the zuul job config, combine those two list and write it to devstack config file | 21:16 |
tobiash | maybe that has changed when we separated it, or I just misunderstood that | 21:16 |
clarkb | hongbin: right, but what do those extensions do? | 21:16 |
hongbin | those are just two list of string in yaml | 21:17 |
clarkb | hongbin: but they change devstack behavior somehow right? | 21:17 |
clarkb | (and that changes neutron's behavior) | 21:17 |
hongbin | the neutron behavior is not changed, since we should pass exactly the same list to devstack | 21:18 |
AJaeger_ | hongbin: so, I would do it as follows: 1) Use yaml anchors and create exactly same job as today. 2) Update the values | 21:18 |
pabelanger | and back, sorry about that. #dadops | 21:18 |
pabelanger | clarkb: thanks for reverting | 21:18 |
clarkb | pabelanger: no worries. want to check on nl01? I think it should be running again | 21:18 |
pabelanger | sure | 21:19 |
AJaeger_ | hongbin: oh, I see - that list is long and you copy it over | 21:19 |
AJaeger_ | hongbin: so, looks like something is wrong in that setup. | 21:19 |
pabelanger | clarkb: grafana.o.o looks much better | 21:19 |
hongbin | NETWORK_API_EXTENSIONS: "{{ network_api_extensions_common + network_api_extensions_tempest | join(',') }}" | 21:19 |
hongbin | AJaeger_: yes, possibly | 21:20 |
clarkb | ya I wasn't sure that would work when it was first suggested because this side of the config is in zuul not in ansible | 21:20 |
clarkb | zuul doesn' jinja2 | 21:20 |
hongbin | for jobs that are doing NETWORK_API_EXTENSIONS: "{{ network_api_extensions | join(',') }}" , it succeeded | 21:21 |
hongbin | for jobs that are doing NETWORK_API_EXTENSIONS: "{{ network_api_extensions_common + network_api_extensions_tempest | join(',') }}" , it failed | 21:21 |
hongbin | so i guess the usage of "+" to combine list wont' work? | 21:21 |
pabelanger | corvus: clarkb: tobiash: Shrews: do we have an idea what is happening with http://paste.openstack.org/show/736525/ | 21:22 |
logan- | hongbin: "{{ (network_api_extensions_common + network_api_extensions_tempest) | join(',') }}" | 21:22 |
hongbin | logan-: ack, let me try that | 21:22 |
clarkb | pabelanger: looks like the string was '' and not valid json | 21:22 |
hongbin | logan-: thanks for the advice | 21:22 |
clarkb | pabelanger: I don't know why though | 21:23 |
pabelanger | clarkb: that was the only other thing I noticed on startup | 21:23 |
logan- | clarkb: yep, it works, because zuul just dumps the uninterpreted jinja into the job inventory, which ansible then consumes and interprets :) | 21:24 |
*** kgiusti has left #openstack-infra | 21:25 | |
clarkb | pabelanger: I think we want to look for any cases we might write an empty string to zk | 21:25 |
corvus | pabelanger, clarkb, tobiash: we could look in zk to see if there are any node records with empty data | 21:26 |
clarkb | but I haven't followed any of the nodepool cache stuff | 21:26 |
clarkb | so this is all new to me | 21:26 |
tobiash | clarkb, pabelanger: we should catch this exception and log better data in this case, maybe it's just an event we don't want to process but didn't filter correctly | 21:26 |
pabelanger | corvus: good idea, since this happened on startup, possible existing node-requests are missing data for some reason? | 21:28 |
pabelanger | tobiash: +1 | 21:28 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: Log exceptions in cache listener events https://review.openstack.org/621292 | 21:28 |
corvus | tobiash, clarkb, pabelanger: ^ | 21:28 |
clarkb | tobiash: these acts as watches on the "filesystem" tree? and we reconcile our local datastructures when they chagne? making sure I understand the basics here | 21:28 |
corvus | clarkb: yep | 21:29 |
tobiash | pabelanger: it's nodes, not node-requests in the exception | 21:29 |
clarkb | then we don't need to read from zk every time we need data we just trust the cache is up to date beacuse it is being reconciled. Got it | 21:29 |
corvus | clarkb: correct. we *do* read an extra time after we lock nodes, just to make sure everything is in sync. | 21:29 |
pabelanger | tobiash: thanks! | 21:29 |
corvus | i've direct-enqueued that too | 21:30 |
clarkb | reading the code we treat node added and node updated the same. Wouldn't surprise me if we create an empty node then write to it later | 21:31 |
clarkb | but the listener then races its cache update ot the update happening? | 21:31 |
clarkb | though I thought storeNode would write json with keys just empty values if there is no data | 21:32 |
clarkb | so maybe this isn't that | 21:32 |
clarkb | in any case logs ++ | 21:32 |
pabelanger | and rax looks back at capacity now | 21:33 |
corvus | i'm hoping the event structure converts to strings in a useful way | 21:33 |
corvus | (ie, i hope it looks like "<Event path:foo/bar data:...>" | 21:33 |
openstackgerrit | Merged openstack-infra/nodepool master: Fix leak detection in unmanaged quota calculation https://review.openstack.org/621286 | 21:34 |
*** kjackal has quit IRC | 21:34 | |
clarkb | corvus: and hopefully ADDED/UPDATED/DELETED types are included too | 21:34 |
corvus | oh, heh, it's a subclass of tuple | 21:35 |
corvus | https://kazoo.readthedocs.io/en/latest/_modules/kazoo/recipe/cache.html#TreeEvent | 21:35 |
corvus | so we should be able to see what we need to | 21:35 |
*** wolverineav has joined #openstack-infra | 21:35 | |
openstackgerrit | Merged openstack-infra/system-config master: Update the current-context to valid context https://review.openstack.org/621287 | 21:35 |
corvus | so next question is, what's the format of event_data | 21:36 |
tobiash | NodeData inherits from tuple | 21:38 |
*** kjackal has joined #openstack-infra | 21:38 | |
corvus | the log is the only oustanding fix now | 21:38 |
corvus | it's probably worth waiting for before we attempt another restart | 21:38 |
tobiash | yes | 21:38 |
clarkb | corvus: ya I think so. we'll just end up restarting it again to debug that anyawy (likely) | 21:39 |
clarkb | also suggestion: we use nl04 for next restart since we've disabled bhs1 anyway. That will have low impact | 21:39 |
pabelanger | ++ | 21:39 |
tobiash | now that I searched for it I see the same exceptions in my log | 21:39 |
openstackgerrit | Monty Taylor proposed openstack-infra/project-config master: Update promstat to use storyboard https://review.openstack.org/621293 | 21:40 |
corvus | drat, pep8 failure | 21:47 |
*** boden has quit IRC | 21:47 | |
clarkb | ah the exception indentation? | 21:47 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: Log exceptions in cache listener events https://review.openstack.org/621292 | 21:48 |
corvus | clarkb: nope, a legit thing. omitted a _ | 21:48 |
clarkb | oh ya diff shows that. | 21:48 |
clarkb | I single approved it if you want to reenqueue | 21:48 |
clarkb | (its a trivial fix) | 21:49 |
corvus | done | 21:49 |
clarkb | (fwiw I'm really excited about the relative priority stuff) | 21:49 |
corvus | me too! | 21:50 |
corvus | combine that with splitting zuul into its own tenant, and we should be able to merge changes at least this fast with no admin access needed | 21:50 |
*** tpsilva has quit IRC | 21:51 | |
tobiash | corvus, clarkb: 2018-11-30 21:47:58,587 ERROR nodepool.zk.ZooKeeper: Exception in node cache update for event: (0, ('/nodepool/nodes/0000235218', b'', ZnodeStat(czxid=8628450713, mzxid=8628450713, ctime=1543562070559, mtime=1543562070559, version=0, cversion=1, aversion=0, ephemeralOwner=0, dataLength=0, numChildren=1, pzxid=8628450714))) | 21:51 |
corvus | (because we can drop the clean-check pipeline requirement) | 21:51 |
tobiash | so we get empty data from some nodes | 21:51 |
tobiash | checking now if this is a cache setup thing or really like this in zk | 21:51 |
corvus | 0 is node_added | 21:51 |
clarkb | tobiash: any idea what the 0 there is? is that the type? | 21:51 |
clarkb | oh cool so my hunch was maybe right? | 21:51 |
tobiash | 0 is the event type | 21:51 |
clarkb | in that case we probably want to say if type == added and data or type == updated and data | 21:52 |
clarkb | or similar there | 21:52 |
tobiash | 0 means node added | 21:52 |
tobiash | confirmed, that node is really empty in zk | 21:53 |
tobiash | no idea yet what this means | 21:53 |
corvus | looking at storeNode, there should be no case where we set empty data | 21:53 |
pabelanger | corvus: drop clean-check pipeline requirement? Could you explain more? | 21:54 |
clarkb | corvus: tobiash could it be that creating and writing data in zk is not atomic? | 21:54 |
clarkb | corvus: tobiash so as events go we get one first for the node creation then for the update? | 21:54 |
tobiash | clarkb: no, I restarted nodepool with the patch and it read the nodes that were already in the system | 21:54 |
corvus | pabelanger: in the good old days, you used to be able to approve a change and have it gated before check results arrived. i believe that's a way better way to run a system, if you can trust people to behave. we can not, in openstack, so we require check results before approving in gate. | 21:55 |
pabelanger | corvus: interesting | 21:56 |
pabelanger | looking forward to seeing it in action :) | 21:57 |
corvus | it's like all these direct-enqueues i'm doing, but anyone can do them | 21:57 |
openstackgerrit | Merged openstack-infra/elastic-recheck master: Better event checking timeouts https://review.openstack.org/621038 | 21:57 |
pabelanger | corvus: yah | 21:57 |
pabelanger | cool | 21:57 |
*** wolverineav has quit IRC | 21:59 | |
clarkb | tobiash: so maybe in this case we can just ignore those events? | 21:59 |
clarkb | as they aren't currently processed and lack useful info/data | 21:59 |
tobiash | yes | 21:59 |
corvus | tobiash: but if the node is still empty in zk... what does that mean? | 22:00 |
tobiash | corvus, clarkb: I just looked into our zk data and we have many such nodes | 22:00 |
corvus | if it were as clarkb suggests: a two-phase process -- create then write, you would expect to have data in there by the time you got around to looking at it | 22:00 |
corvus | but to still have an empty string is puzzling | 22:01 |
tobiash | and judging by the numbers really old ones | 22:01 |
corvus | tobiash: any mention of that node id in logs? | 22:01 |
tobiash | corvus: I need to find such a node that is still in my logs | 22:01 |
corvus | i'm setting up zk_shell so i can poke too | 22:02 |
clarkb | fwiw our node count is in the 15k range whcih is about where it was when I monitored the new zk cluster transition | 22:03 |
clarkb | I don't think we are leaking nodes at an appreciable rate | 22:03 |
corvus | we do have lots of old nodes with empty data | 22:03 |
corvus | latest node id is 0000844533. but 0000470788 exists and is empty | 22:04 |
clarkb | huh | 22:04 |
*** wolverineav has joined #openstack-infra | 22:04 | |
corvus | 0000819605 is recent-ish and empty | 22:05 |
tobiash | corvus: http://paste.openstack.org/show/736529/ | 22:06 |
tobiash | corvus: nothing special here | 22:06 |
clarkb | could it be a quorum thing? | 22:06 |
corvus | tobiash: ditto: http://paste.openstack.org/show/736530/ | 22:06 |
clarkb | I woudl expect that to be fairly quick cleanup though not long term | 22:07 |
clarkb | corvus: maybe check by connecting to other zk nodes to see if they have the same empty structure? | 22:07 |
corvus | clarkb: zk01 and zk02 report the same | 22:08 |
tobiash | corvus: is znode deletion recursive? | 22:08 |
corvus | tobiash: i don't know -- are you noting that there is a lock file under it too? | 22:08 |
tobiash | corvus: the node I looked at still has an empty lock child node | 22:08 |
corvus | this seems like a very likely possibility | 22:09 |
clarkb | would that imply the node is still locked? nodepool list should show us that info | 22:09 |
clarkb | | 0000844533 | inap-mtl01 | ubuntu-xenial | 8067566c-765c-4165-a90e-0adf29e80f1b | 198.72.124.158 | | in-use | 00:00:05:31 | locked | | 22:10 |
clarkb | seems like yes | 22:10 |
tobiash | hrm, we do a recursive delete | 22:11 |
clarkb | er wait I got the wrong node to grep for there | 22:11 |
clarkb | ugh friday | 22:11 |
clarkb | 0000819605 is what we want to check for | 22:11 |
clarkb | that one does not show up | 22:12 |
*** wolverineav has quit IRC | 22:12 | |
tobiash | clarkb: yes, because nodepool has an if data: clause when getting the node | 22:12 |
clarkb | ha | 22:13 |
*** wolverineav has joined #openstack-infra | 22:13 | |
tobiash | so nodepool hid that before the caching patches | 22:13 |
clarkb | ya | 22:14 |
corvus | someone still holds the lock on 0000819605 | 22:14 |
clarkb | do the lockers drop an id of some sort on the node to indicate who has it? | 22:14 |
corvus | it's /nodepool/launchers/nl03-13009-PoolWorker.vexxhost-sjc1-main | 22:14 |
clarkb | fungi: mriedem fyi I'm going to restart elastic-recheck bot now and we can see if my change works | 22:15 |
tobiash | at least on my leaked node there is no lock anymore, but still the empty lock child | 22:16 |
clarkb | mriedem: fungi nevermind Nov 30 22:13:30 status puppet-user[17804]: (/Stage[main]/Elastic_recheck/Exec[install_elastic-recheck]/returns) ImportError: No module named docutils.core | 22:17 |
corvus | tobiash: how did you determine there's no lock? | 22:17 |
clarkb | that prevents us from installing my change | 22:17 |
tobiash | dump with grep | 22:17 |
corvus | tobiash: hrm. that's what i did. | 22:17 |
tobiash | or is that not enough? | 22:17 |
openstackgerrit | Merged openstack-infra/nodepool master: Log exceptions in cache listener events https://review.openstack.org/621292 | 22:17 |
corvus | tobiash: that should be enough | 22:17 |
*** hwoarang has quit IRC | 22:18 | |
*** wolverineav has quit IRC | 22:18 | |
corvus | tobiash: i see: http://paste.openstack.org/show/736531/ | 22:19 |
*** wolverineav has joined #openstack-infra | 22:19 | |
tobiash | corvus: ok, so if there is a lock the lock child has another child | 22:19 |
tobiash | that's not the case in my example, but it might have been the case earlier | 22:20 |
tobiash | I restarted my launcher to pickup the logging change so the launcher cannot have any lock now | 22:20 |
tobiash | (I have only one launcher atm) | 22:21 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: Log exceptions deleting ZK nodes https://review.openstack.org/621301 | 22:21 |
corvus | tobiash: ah, so the ephemeral lock may just turn into an empty dir after restarts | 22:22 |
clarkb | https://pagure.io/python-daemon/issue/18 <- that is really frustrating | 22:22 |
tobiash | yes | 22:22 |
corvus | tobiash, clarkb: there's a code path where we can end up without an exception report; see change ^ | 22:22 |
tobiash | +2 | 22:22 |
clarkb | corvus: looking | 22:22 |
clarkb | corvus: looking at your paste, that was a dump of ephemeral nodes in zk? and the two nodes being udner the same item means that that launcher holds that lock? Seems like maybe we want to record that info in the lock itself? | 22:23 |
corvus | tobiash, clarkb: i think we've established this bug as "annoying but mostly harmless" and can probably proceed with restarts and debug this in parallel... | 22:24 |
clarkb | corvus: I agree | 22:24 |
tobiash | ++ | 22:24 |
pabelanger | ++ | 22:24 |
corvus | clarkb: your understanding of the ephemeral nodes stuff is correct, but i don't want to mess with the kazoo lock recipe; this is workable enough i think. | 22:25 |
corvus | (*any more -- i've already undertaking one improvement to kazoo locks) | 22:25 |
clarkb | corvus: ok, this might be a thing we add to operational docs then | 22:25 |
clarkb | (this is how you find who owns a lock type thing) | 22:26 |
corvus | clarkb: as long as we bury it deep, deep inside the nodepool developer docs. i've been fighting very hard to avoid the perception that a nodepool operator needs to know anything about zk. this is something only a nodepool dev should ever have to do, and only when nodepool is broke. | 22:27 |
clarkb | corvus: thats fair. Things like why did zuul leak this lock etc | 22:27 |
corvus | pabelanger: you want to go ahead and restart a launcher? | 22:28 |
*** wolverineav has quit IRC | 22:28 | |
clarkb | nl04 is my suggestion | 22:28 |
pabelanger | I can yes, nl04 this time? | 22:28 |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Don't update caches with empty zNodes https://review.openstack.org/621305 | 22:28 |
clarkb | and double check they have the fix for the quota calculations installed | 22:28 |
pabelanger | ack, checking now | 22:29 |
pabelanger | (nl04) | 22:29 |
*** wolverineav has joined #openstack-infra | 22:29 | |
pabelanger | f8d20d6 is current version of nodepool | 22:30 |
pabelanger | I think we want the next 2 right | 22:30 |
corvus | pabelanger: f8d20d6 is okay | 22:30 |
pabelanger | k | 22:30 |
clarkb | ya we don't need those logs in place since tobiash was helpful and got them for us | 22:31 |
pabelanger | okay, restarting nl04 now | 22:31 |
corvus | yep, and next debug step there is changes that haven't merged yet | 22:31 |
pabelanger | nodepool-launcher started | 22:32 |
pabelanger | more execptions at beginning, expected (json) | 22:32 |
tobiash | I have around 200 of these leaked znodes | 22:33 |
openstackgerrit | Merged openstack-infra/elastic-recheck master: Add query for nova i/o semaphore test race bug 1806123 https://review.openstack.org/621285 | 22:33 |
openstack | bug 1806123 in OpenStack Compute (nova) "i/o concurrency semaphore test changes are racy" [High,Confirmed] https://launchpad.net/bugs/1806123 | 22:33 |
pabelanger | I can see we are launching nodes in ovh-gra21 | 22:33 |
pabelanger | ovh-gra1* | 22:33 |
clarkb | fungi mriedem I think newer pip may fix this? I'm goign to try updating pip on that host to see | 22:35 |
pabelanger | clarkb: corvus: tobiash: nl04 looks to be working as expected now | 22:35 |
tobiash | :) | 22:35 |
pabelanger | I no longer see no quota | 22:36 |
pabelanger | and confirmed we've launched a few nodes | 22:36 |
*** slaweq has joined #openstack-infra | 22:36 | |
pabelanger | I think that means we can proceed with other nodepool-launchers, waiting for confirmation | 22:37 |
clarkb | mriedem: fungi ok that didn't change it, but everything else should be the same between local system and remote. So I'mconfused | 22:37 |
clarkb | pabelanger: I usually like to wait long enough for deletes to happen if they don't happen on startup | 22:37 |
pabelanger | clarkb: sure, let me confirm | 22:37 |
clarkb | pabelanger: basically make sure we handle create and delete, but ya if both those look good I think you can restart the others | 22:37 |
*** xek has quit IRC | 22:37 | |
clarkb | mriedem: fungi maybe a setuptools thing, checking that next | 22:38 |
clarkb | that was it | 22:39 |
clarkb | restarting elastic-recheck now | 22:39 |
pabelanger | 2018-11-30 22:39:14,760 DEBUG nodepool.StatsWorker: Updating stats | 22:39 |
pabelanger | that doesn't really seem helpful | 22:39 |
pabelanger | and looks newish | 22:39 |
tobiash | pabelanger: https://review.openstack.org/621283 | 22:40 |
pabelanger | there we go, just deleted some nodes | 22:40 |
pabelanger | tobiash: +3, thanks | 22:40 |
pabelanger | clarkb: corvus: okay, I think we are good to proceed to other launchers | 22:41 |
corvus | pabelanger: ++ | 22:41 |
tobiash | that was useful during development | 22:41 |
clarkb | pabelanger: go for it | 22:41 |
mriedem | clarkb: i knew you could do it | 22:42 |
*** slaweq has quit IRC | 22:43 | |
clarkb | mriedem: heh, in any case can you keep your eye out for it leaving comments/reports on irc? | 22:43 |
pabelanger | okay, all launchers have been restarted | 22:43 |
mriedem | clarkb: not sure which channel those are in, except qa i guess | 22:43 |
pabelanger | looking at logs now to confirm things are working properly | 22:43 |
*** hwoarang has joined #openstack-infra | 22:44 | |
pabelanger | corvus: clarkb: okay, I don't see any obvious issues. We are still creating / deleting nodes properly now | 22:45 |
corvus | ready for the zuul part of this? | 22:45 |
clarkb | corvus: I am if you are | 22:45 |
openstackgerrit | Merged openstack-infra/elastic-recheck master: Add query for libvirt functional evacuate test bug 1806126 https://review.openstack.org/621288 | 22:46 |
openstack | bug 1806126 in OpenStack Compute (nova) "LibvirtRbdEvacuateTest and LibvirtFlatEvacuateTest tests race fail" [High,Confirmed] https://launchpad.net/bugs/1806126 | 22:46 |
pabelanger | ++ | 22:46 |
tobiash | corvus: just found this in my new nodepool: http://paste.openstack.org/show/736532/ | 22:46 |
tobiash | corvus: related to the openstacksdk release? | 22:46 |
*** hamerins has quit IRC | 22:46 | |
tobiash | the flavor used to be a munch, now it seems to be a dict | 22:46 |
tobiash | mordred: ^ | 22:46 |
clarkb | tobiash: is that with sdk 0.19.0? | 22:46 |
clarkb | tobiash: of 0.20.0? | 22:47 |
pabelanger | tobiash: we blocked 0.20.0 openstacksdk in nodepool not that long ago, which version | 22:47 |
tobiash | checking, but I guess 0.20.0 as I only pulled in the exception change | 22:47 |
pabelanger | tobiash: https://review.openstack.org/621272/ | 22:47 |
*** apetrich has quit IRC | 22:47 | |
fungi | several hundred lines of scrollback in this channel while i was at dinner. ftr i'm just going to check nick highlights | 22:47 |
*** hamerins has joined #openstack-infra | 22:47 | |
corvus | should i just do a scheduler restart or a full restart? | 22:48 |
tobiash | yes 0.20.0 | 22:48 |
corvus | we did executors a couple days ago | 22:48 |
tobiash | so another incompatibility | 22:48 |
clarkb | corvus: executors slow things down so I'm good with just scheduler if you want to make it easier | 22:48 |
clarkb | that gets is relative priorities. And infra doesn't currently need executor zones | 22:49 |
clarkb | if we had an immediate need for executor zones I'd say do it all, but I can;t think of one | 22:49 |
fungi | AJaeger_: thanks for the heads-up on 619216. i'm mostly trying to fix up cookiecutter templates and important documentation, so that's totally in scope | 22:49 |
pabelanger | what ever is easier to get the next release of zuul with executor zone support is my vote | 22:49 |
*** slaweq has joined #openstack-infra | 22:50 | |
clarkb | pabelanger: in that case probably restarting the executors too, but we won't exercise those code paths. Probably want you to run it in beta form? | 22:50 |
pabelanger | clarkb: I am okay with skipping executors today | 22:51 |
pabelanger | yah, I'll test zones locally to be sure | 22:51 |
corvus | i'm already down the scheduler-only path | 22:51 |
*** kjackal has quit IRC | 22:52 | |
clarkb | corvus: wfm. I don't think relying on infra to ensure executor zones are ready is prudent | 22:52 |
pabelanger | ++ | 22:52 |
clarkb | we don't have the constraints in place that made it a feature so we wouldn't be able to test it for general functioanlity well | 22:52 |
clarkb | (even if we could pretend and check it doesn't completely break) | 22:52 |
*** slaweq has quit IRC | 22:52 | |
pabelanger | clarkb: I was only thinking the arm cloud in china might be a good zone to do | 22:53 |
clarkb | pabelanger: I don't think we have trouble with the ssh connections. Only http | 22:53 |
pabelanger | ah | 22:53 |
clarkb | (well and zk, but executor zones don't help that either) | 22:53 |
clarkb | we moved the builder into the london region to deal with the zk issue and switched to https to deal with http problems | 22:54 |
pabelanger | if only infracloud was still arounds, we had bandwidth issues there :) | 22:54 |
*** mriedem has quit IRC | 22:54 | |
*** wolverineav has quit IRC | 22:54 | |
*** slaweq has joined #openstack-infra | 22:54 | |
corvus | i'm pretty opposed to adding spofs to zuul, so if we were to use zones, we'd have to double up on executors. at least. | 22:54 |
clarkb | ya I just don't see us needing it with current resources | 22:55 |
corvus | (even if we had network restricted resources, i'd suggest we talk about vpns first) | 22:55 |
*** wolverineav has joined #openstack-infra | 22:55 | |
clarkb | we'd have an excuse to use that new in kernel vpn system | 22:56 |
corvus | scheduler restarted | 22:57 |
clarkb | looks like check is enque(ing|ed) | 22:57 |
pabelanger | confirmed, I can see nodepool-launcher creating nodes | 22:58 |
clarkb | corvus: and we are already running with relative priority right? | 22:58 |
clarkb | seems so check is running jobs ahead of gate | 22:58 |
clarkb | neat | 22:58 |
*** wolverineav has quit IRC | 23:00 | |
clarkb | or at least it appeared so a minute ago. Now I'm not completely sure | 23:00 |
*** wolverineav has joined #openstack-infra | 23:00 | |
corvus | the field is being set: http://paste.openstack.org/show/736533/ | 23:00 |
pabelanger | yay | 23:00 |
clarkb | cool I don't have evidence against it working, more lack of evidence it is working (since now all the gate things are happening | 23:01 |
fungi | and the playing field is being leveled? | 23:01 |
corvus | theoretically | 23:01 |
corvus | gate still has priority over check | 23:01 |
corvus | but we should see, for example nodepool change 621286 get nodes before nova change 620861 | 23:02 |
*** apetrich has joined #openstack-infra | 23:02 | |
corvus | in check | 23:02 |
pabelanger | corvus: that is because there are more nova changes in check right? | 23:02 |
corvus | yeah, there's a bunch right now | 23:02 |
pabelanger | kk, keeping an eye out | 23:03 |
pabelanger | oh, I think it is working 621011 has jobs running now | 23:03 |
pabelanger | and there is nova before it with none | 23:04 |
corvus | similarly the requirements change 620563 is doing well | 23:04 |
pabelanger | yah, that is pretty cool | 23:05 |
*** Miouge- has quit IRC | 23:05 | |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: Add relative priority to request list https://review.openstack.org/621314 | 23:06 |
corvus | i assume that's going to fail tests, i just don't know which | 23:06 |
pabelanger | now I better understand 'level playing field' comment from fungi :) | 23:06 |
*** cdent has quit IRC | 23:07 | |
clarkb | pabelanger: ya this should allow lower resource usage projects "to get a word in" then once done give the higher usage projects a turn | 23:07 |
*** Miouge has joined #openstack-infra | 23:08 | |
clarkb | s/lower resource usage/less active/ | 23:08 |
clarkb | its based on changes not nodeset size | 23:08 |
pabelanger | clarkb: yah, that is really cool, and fair. | 23:08 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Only count live items for relative priority https://review.openstack.org/621315 | 23:09 |
corvus | or, uh, nearly fair ^ :) | 23:10 |
clarkb | it still won't be a strict ordering because different providers can grab requests and fulfill them at different speeds | 23:10 |
pabelanger | It seems 614035 got nodes before 614012, but 614012 is parent | 23:10 |
clarkb | corvus: oh hah | 23:10 |
corvus | yeah, it's all best-effort | 23:10 |
openstackgerrit | Merged openstack-infra/project-config master: Update promstat to use storyboard https://review.openstack.org/621293 | 23:11 |
pabelanger | +2 | 23:11 |
corvus | 614012 has a relpri of 3 | 23:12 |
mordred | tobiash: hrm. flavors should not have changed. they should still be munch ... or at the very least munch-like | 23:14 |
corvus | 603930 has relpri 39 | 23:14 |
tobiash | mordred: I just confirmed that the downgrade fixed the issue | 23:14 |
mordred | tobiash: AWESOME | 23:14 |
mordred | sorry for the breakage .. I'm ... quite sad about that | 23:15 |
corvus | i don't know what the ones that got nodes are | 23:15 |
clarkb | corvus: all those non live changes bumping it up | 23:15 |
*** hamerins has quit IRC | 23:15 | |
corvus | ya | 23:15 |
tobiash | mordred: nothing happened, it's night here and I just tried that exception change and that pulled in the new sdk | 23:15 |
corvus | my guess is the others all grabbed nodes as the system was starting up and there were no other requests | 23:16 |
tobiash | so I noticed that before it had an impact | 23:16 |
mordred | tobiash: oh good | 23:18 |
clarkb | corvus: are you wanting to restart again today with the liveness check? I'm thinking that may be good to get in on friday vs monday | 23:18 |
clarkb | (and i can hang around this afternoon to help with that) | 23:18 |
corvus | clarkb: yeah, at least sometime before monday | 23:19 |
pabelanger | I still have some time today to support | 23:19 |
clarkb | the other thing we should probably do is send an email to the dev list explaining the change so that we can avoid all the questions (or at least some of them) next week :) | 23:19 |
clarkb | corvus: is that something you'd like to draft up? (you came up with the diea and did most of the work to implement it) | 23:20 |
clarkb | I'm happy to write a note too if you'd rather not | 23:20 |
corvus | clarkb: i can do it | 23:20 |
mordred | tobiash: oh! it's server.flavor.id ... so the sub-dict isn't munched | 23:21 |
clarkb | while we are on a improve all the things streak. https://review.openstack.org/#/c/611920/ is a fairly straightforward gear change to make it more ipv6 friendly | 23:21 |
tobiash | yes | 23:21 |
clarkb | if anyone wants to review that one. It causes gear to listen on ipv6 if available | 23:21 |
clarkb | pabelanger: ^ I want to say you've pushed changes around similar stuff there | 23:21 |
pabelanger | looking | 23:22 |
*** slaweq has quit IRC | 23:22 | |
pabelanger | Ah, yah. +2 | 23:22 |
*** Swami has quit IRC | 23:27 | |
openstackgerrit | Merged openstack-infra/nodepool master: Remove updating stats debug log https://review.openstack.org/621283 | 23:28 |
pabelanger | clarkb: corvus: there we go, 621286 (nodepool) just got allocated nodes | 23:29 |
clarkb | ya and its been skipping changes aerlier as expected | 23:29 |
pabelanger | way before existing keystone / nova patches | 23:29 |
pabelanger | ++ | 23:29 |
pabelanger | awesome | 23:29 |
mordred | tobiash: remote: https://review.openstack.org/621316 Transform server with munch before normalizing should fix it | 23:30 |
corvus | clarkb, pabelanger: btw this change from tobias https://review.openstack.org/610029 will help things as well | 23:30 |
corvus | (that's probably the reason than 614012, at the top of the list, doesn't have nodes yet) | 23:31 |
pabelanger | ah, cool | 23:31 |
corvus | it was unlucky enough to be locked by a bunch of handlers which couldn't fulfill it, and is now waiting for the ones that can to come back to the top of the loop | 23:31 |
pabelanger | corvus: do we want to restart launchers today to pick it up? | 23:32 |
tobiash | mordred: thanks :) | 23:32 |
corvus | pabelanger: yeah, either today or this weekend i think | 23:38 |
clarkb | TIL about time.monotonic. There are some nice things in python3 | 23:38 |
corvus | be very careful about monotonic. it's a trap in unit tests. | 23:38 |
clarkb | corvus: oh? | 23:38 |
corvus | i keep having to swap it out for time.time. | 23:38 |
corvus | yeah, the base value isn't guaranteed | 23:38 |
clarkb | tobiash's change to timeout assigning handlers uses monotonic | 23:39 |
corvus | yeah, i think it's okay there | 23:39 |
clarkb | corvus: right its not a unix timestamp? I guess that could cause problems in places | 23:39 |
clarkb | its a time that may start at zero when process starts? | 23:39 |
corvus | clarkb: https://review.openstack.org/529173 | 23:40 |
corvus | i think as long as you always initialize to time.monotic, you're okay | 23:41 |
clarkb | got it | 23:41 |
corvus | but the common pattern of initializing to 0 and assuming that now()-then will work doesn't work with monotonic | 23:42 |
*** wolverineav has quit IRC | 23:42 | |
*** tosky has quit IRC | 23:42 | |
corvus | or rather, it may or may not work depending on $random | 23:42 |
*** wolverineav has joined #openstack-infra | 23:45 | |
corvus | clarkb, pabelanger, fungi, mordred: how's this? https://etherpad.openstack.org/p/QHQuh57TiG | 23:45 |
*** pbourke has quit IRC | 23:47 | |
pabelanger | corvus: wfm | 23:48 |
clarkb | corvus: looks great, particularly for calling out the out of order behavior | 23:48 |
clarkb | I expect people will notice that and get curious | 23:48 |
*** wolverineav has quit IRC | 23:48 | |
*** wolverineav has joined #openstack-infra | 23:49 | |
fungi | corvus: ship it! | 23:49 |
*** pbourke has joined #openstack-infra | 23:49 | |
pabelanger | clarkb: looking forward to the stats next week for node break down | 23:50 |
clarkb | pabelanger: I don't expect that will change much | 23:50 |
clarkb | we are changing where in the pipeline you get resources not how many you get | 23:50 |
clarkb | (so over a long period like a month or a week it shoudl all be roughly the same) | 23:50 |
pabelanger | I guess it depends if more smaller projects are pushing up more patches | 23:51 |
fungi | hint: they're not | 23:52 |
fungi | that's what makes them "small projects" | 23:52 |
pabelanger | cool, it only took my patch 21mins to get nodes, compared to the existing ones that are pushing 45mins | 23:52 |
pabelanger | very nice | 23:53 |
pabelanger | as it was the first one | 23:53 |
fungi | also we generally do catch up over the weekend, so the week's worth of node requests still all get satisfied within that week | 23:53 |
clarkb | ya once upon a time I tried to look at it yoy and we use ~30% of our total resources when viewed at that scale | 23:53 |
clarkb | weekends and the apac timezones tend to be quieter iirc | 23:54 |
*** slaweq has joined #openstack-infra | 23:54 | |
clarkb | so while we are flat out when people are watching, it is much less over long periods of time | 23:54 |
clarkb | we'll need to watch it and see how it goes though | 23:58 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!