Friday, 2018-11-30

ianwit would be.  i'm waiting but i think there's a chance this reboot will not happen :/00:00
clarkbnow I'm trying to figure out in my head what happens if mount fails on a specific disk (it should mount it ro or at least attempt to and finish the boot right?)00:01
ianwok, it's looks like it's coming up, i'm probably too pessimistic :)00:02
clarkbya I think if it isn't where /boot and other important things hanging under / are mounted then it should come up00:03
clarkbit may still have a sad, but should get sshd running00:03
ianwok, it's back00:04
clarkbfwiw citynetwork pushed back the fix estimate to 0300CET :/00:04
clarkbI guess we should all ponder the static inventory a bit more then maybe plan to convert tomorrow or early next week?00:04
*** dave-mccowan has quit IRC00:08
*** mgutehal_ has quit IRC00:09
*** mgutehall has joined #openstack-infra00:09
*** wolverineav has quit IRC00:09
*** flaper87 has quit IRC00:09
*** wolverineav has joined #openstack-infra00:10
openstackgerritClint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Amazon EC2 driver  https://review.openstack.org/53555800:10
*** wolverineav has quit IRC00:12
*** wolverineav has joined #openstack-infra00:13
openstackgerritMerged openstack-infra/elastic-recheck master: Made elastic-recheck py3 compatible  https://review.openstack.org/61657800:13
*** flaper87 has joined #openstack-infra00:14
*** threestrands has joined #openstack-infra00:18
openstackgerritAdam Coldrick proposed openstack-infra/storyboard master: Fix the stories relation in StoryTag  https://review.openstack.org/62104500:26
openstackgerritAdam Coldrick proposed openstack-infra/storyboard master: Add a popularity measurement to tags  https://review.openstack.org/62104600:26
*** tosky has quit IRC00:27
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Remove nodeid argument from updateNode  https://review.openstack.org/62104700:32
clarkbhrm there are a bunch of db migration errors in cinder and nova00:33
clarkbI wonder if oslo.db made a release00:33
ianw#status log manual reboot of mirror01.nrt1.arm64ci.openstack.org after a lot of i/o failures00:34
openstackstatusianw: finished logging00:34
*** rh-jelabarre has joined #openstack-infra00:41
*** yamamoto has joined #openstack-infra00:48
*** hamzy_ has joined #openstack-infra01:07
*** sthussey has quit IRC01:11
*** verdurin has quit IRC01:33
clarkbapparently docker 1.12 and newer has a config option to prevent containers from restarting when you restart docker01:36
clarkbthat is nice01:36
*** verdurin has joined #openstack-infra01:38
*** wolverineav has quit IRC01:38
*** bhavikdbavishi has joined #openstack-infra01:43
*** yamamoto has quit IRC01:49
*** dave-mccowan has joined #openstack-infra01:50
*** gyee has quit IRC01:52
openstackgerritMerged openstack-infra/nodepool master: Support relative priority of node requests  https://review.openstack.org/62095401:58
*** dklyle has joined #openstack-infra01:59
*** dklyle has quit IRC02:05
*** mriedem_afk has quit IRC02:19
*** mrsoul has quit IRC02:22
*** psachin has joined #openstack-infra02:39
*** eernst_ has joined #openstack-infra02:45
*** dave-mccowan has quit IRC02:46
*** eernst_ has quit IRC02:50
*** dave-mccowan has joined #openstack-infra02:56
*** hongbin has joined #openstack-infra03:02
*** dklyle has joined #openstack-infra03:11
*** apetrich has quit IRC03:15
*** dklyle has quit IRC03:17
openstackgerritMerged openstack/diskimage-builder master: Revert "Make tripleo-buildimage-overcloud-full-centos-7 non-voting"  https://review.openstack.org/62020103:18
*** ykarel|away has joined #openstack-infra03:22
*** rlandy has quit IRC03:31
*** dave-mccowan has quit IRC03:33
*** dave-mccowan has joined #openstack-infra03:35
*** psachin has quit IRC03:43
openstackgerritMerged openstack/diskimage-builder master: Add missing ws separator between words  https://review.openstack.org/61916903:53
*** hongbin has quit IRC03:56
*** udesale has joined #openstack-infra04:01
openstackgerritBrendan proposed openstack-infra/zuul master: Fix "reverse" Depends-On detection with new Gerrit URL schema  https://review.openstack.org/62083804:03
*** janki has joined #openstack-infra04:03
*** ramishra has joined #openstack-infra04:29
*** dave-mccowan has quit IRC04:30
*** markvoelker has quit IRC04:32
*** eernst has joined #openstack-infra04:35
openstackgerritMerged openstack-infra/zuul-jobs master: upload-logs-swift: Cleanup temporary directories  https://review.openstack.org/59234004:49
openstackgerritMerged openstack-infra/zuul-jobs master: upload-logs-swift: Make indexer more generic  https://review.openstack.org/59285204:50
*** wolverineav has joined #openstack-infra04:52
*** markvoelker has joined #openstack-infra05:02
*** eernst has quit IRC05:04
*** wolverineav has quit IRC05:15
*** threestrands_ has joined #openstack-infra05:17
*** threestrands has quit IRC05:20
*** ykarel|away has quit IRC05:24
*** pcaruana has quit IRC05:35
*** ykarel|away has joined #openstack-infra05:39
*** ykarel|away is now known as ykarel05:39
*** yamamoto has joined #openstack-infra05:46
*** yamamoto has quit IRC05:51
openstackgerritMerged openstack-infra/zuul master: Add support for zones in executors  https://review.openstack.org/54919706:00
*** diablo_rojo has quit IRC06:12
openstackgerritKartikeya Jain proposed openstack/diskimage-builder master: Adding support for SLES 15 in element 'sles'  https://review.openstack.org/61918606:17
*** e0ne has joined #openstack-infra06:32
*** flaper87 has quit IRC06:32
*** quiquell|off is now known as quiquell07:03
*** hamzy__ has joined #openstack-infra07:11
*** hamzy_ has quit IRC07:12
*** threestrands_ has quit IRC07:13
*** apetrich has joined #openstack-infra07:16
*** stakeda has joined #openstack-infra07:22
*** pcaruana has joined #openstack-infra07:22
*** e0ne has quit IRC07:31
*** slaweq has joined #openstack-infra07:34
*** florianf|afk is now known as florianf07:37
*** kjackal has joined #openstack-infra07:39
*** bhavikdbavishi has quit IRC07:42
*** ykarel is now known as ykarel|lunch07:43
openstackgerritMerged openstack-infra/zuul master: More strongly recommend the simple reverse proxy deployment  https://review.openstack.org/62096907:56
openstackgerritMerged openstack-infra/zuul master: Add gearman stats reference  https://review.openstack.org/62019207:57
*** jpena|off is now known as jpena08:01
*** ginopc has joined #openstack-infra08:04
*** bhavikdbavishi has joined #openstack-infra08:05
*** jtomasek has joined #openstack-infra08:06
*** rcernin has quit IRC08:06
*** dpawlik has joined #openstack-infra08:07
*** ralonsoh has joined #openstack-infra08:17
*** aojea has joined #openstack-infra08:18
*** roman_g has joined #openstack-infra08:22
*** shardy has joined #openstack-infra08:23
*** shardy has quit IRC08:24
*** shardy has joined #openstack-infra08:24
*** yamamoto has joined #openstack-infra08:30
*** ykarel|lunch is now known as ykarel08:38
*** xek has joined #openstack-infra08:38
*** ginopc has quit IRC08:40
*** tosky has joined #openstack-infra08:42
*** dpawlik has quit IRC08:46
*** ccamacho has joined #openstack-infra08:46
*** jpich has joined #openstack-infra08:53
*** kjackal has quit IRC08:58
*** dpawlik has joined #openstack-infra09:20
fricklerinfra-root: does /etc/ansible/hosts/group_vars/all.yaml get edited manually on bridge? I wasn't aware that my acc had been commented out there, just curious when the amount of mail I was receiving was slowly decreasing.09:21
fricklerI edited it now with a different address that is hopefully going to be bouncing less things, please let me know if you see any09:22
fricklerShrews: ^^ you are blocked there too, in case you were wondering09:23
*** dpawlik has quit IRC09:24
*** e0ne has joined #openstack-infra09:27
*** stakeda has quit IRC09:33
*** derekh has joined #openstack-infra09:33
openstackgerritIan Wienand proposed openstack-infra/system-config master: bridge.o.o : install ansible 2.7.3  https://review.openstack.org/61721809:38
*** gfidente has joined #openstack-infra09:40
*** takamatsu has quit IRC09:41
*** ginopc has joined #openstack-infra09:41
openstackgerritSorin Sbarnea proposed openstack-infra/elastic-recheck master: Categorize missing /etc/heat/policy.json  https://review.openstack.org/62112810:01
*** olivierbourdon38 has joined #openstack-infra10:09
*** tpsilva has joined #openstack-infra10:31
*** slaweq has quit IRC10:32
*** ramishra has quit IRC10:43
*** ramishra has joined #openstack-infra10:43
*** electrofelix has joined #openstack-infra10:44
*** pcaruana has quit IRC10:44
*** quite has left #openstack-infra10:46
*** pcaruana has joined #openstack-infra10:50
*** udesale has quit IRC10:59
*** dtantsur|mtg is now known as dtantsur|afk11:00
*** rfolco is now known as rfolco_doctor11:09
*** bhavikdbavishi has quit IRC11:11
*** ramishra has quit IRC11:20
*** ramishra has joined #openstack-infra11:26
*** hwoarang has quit IRC11:32
*** slaweq has joined #openstack-infra11:51
openstackgerritTobias Henkel proposed openstack-infra/zuul master: Set relative priority of node requests  https://review.openstack.org/61535611:52
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently  https://review.openstack.org/61002912:07
openstackgerritTobias Henkel proposed openstack-infra/zuul master: Remove nodeid argument from updateNode  https://review.openstack.org/62104712:11
*** pcaruana has quit IRC12:11
*** hwoarang has joined #openstack-infra12:15
*** lucasagomes is now known as lucas-hungry12:17
*** lucas-hungry is now known as lucasagomes12:17
openstackgerritTobias Henkel proposed openstack-infra/zuul master: Report tenant and project specific resource usage stats  https://review.openstack.org/61630612:24
*** slaweq has quit IRC12:24
*** udesale has joined #openstack-infra12:26
*** kjackal has joined #openstack-infra12:26
*** rfolco_doctor is now known as rfolco12:34
*** eharney has quit IRC12:34
openstackgerritSorin Sbarnea proposed openstack-infra/elastic-recheck master: devstack: Didn't find service registered by hostname after 60 seconds  https://review.openstack.org/62115012:34
*** Serhii_Rusin has joined #openstack-infra12:36
*** Serhii_Rusin has quit IRC12:40
*** jpena is now known as jpena|lunch12:41
*** kjackal has quit IRC12:41
*** xek has quit IRC12:46
Shrewsfrickler: yeah. i need to migrate away from gmail12:46
*** xek has joined #openstack-infra12:47
*** e0ne has quit IRC12:54
*** yamamoto has quit IRC12:59
*** boden has joined #openstack-infra13:06
*** zul has joined #openstack-infra13:06
*** yamamoto has joined #openstack-infra13:15
*** trown|outtypewww has quit IRC13:17
*** trown|brb has joined #openstack-infra13:18
*** dave-mccowan has joined #openstack-infra13:19
pabelangerfrickler: that file should be under control of git, so you should be able to look  at history13:26
pabelangerif not, then I am not sure13:26
pabelangerI don't believe we should have manually editted files on brige.o.o13:26
*** annp has quit IRC13:29
fricklerpabelanger: ah, local git repo, nice. /me can blame mordred now :-D13:29
pabelangerfrickler: cool, so any changes need to git commited to git there too13:32
fricklerpabelanger: did that for my earlier change now13:33
pabelangerack13:35
*** rlandy has joined #openstack-infra13:36
*** e0ne has joined #openstack-infra13:37
*** jpena|lunch is now known as jpena13:40
*** sthussey has joined #openstack-infra13:40
*** kgiusti has joined #openstack-infra13:45
openstackgerritMerged openstack/diskimage-builder master: Add an element to configure iBFT network interfaces  https://review.openstack.org/39178713:46
*** udesale has quit IRC13:48
*** udesale has joined #openstack-infra13:49
*** hwoarang has quit IRC13:54
*** takamatsu has joined #openstack-infra13:55
*** slaweq has joined #openstack-infra13:56
*** jcoufal has joined #openstack-infra13:57
*** EmilienM is now known as EvilienM13:58
*** ykarel is now known as ykarel|away14:07
*** roman_g has quit IRC14:08
*** roman_g has joined #openstack-infra14:08
*** hwoarang has joined #openstack-infra14:12
*** mriedem has joined #openstack-infra14:19
*** olivierbourdon38 has quit IRC14:20
*** olivierbourdon38 has joined #openstack-infra14:20
*** jamesmcarthur has joined #openstack-infra14:24
ssbarnea|roverclarkb: are you working on https://review.openstack.org/#/c/621038/1 ? i can fix it myself.14:25
openstackgerritSorin Sbarnea proposed openstack-infra/elastic-recheck master: Better event checking timeouts  https://review.openstack.org/62103814:28
openstackgerritMerged openstack-infra/nodepool master: OpenStack: count leaked nodes in unmanaged quota  https://review.openstack.org/62104014:29
openstackgerritMerged openstack-infra/nodepool master: OpenStack: store ZK records for launch error nodes  https://review.openstack.org/62104314:29
*** janki has quit IRC14:30
pabelangerclarkb: corvus: mordred: are we thinking of doing nodepool / zuul restarts this morning to pick up new noderequest logic?14:31
*** ykarel|away has quit IRC14:33
*** lbragstad is now known as elbragstad14:36
openstackgerritMerged openstack-infra/project-config master: Create airship-spyglass repo  https://review.openstack.org/61949314:37
*** bhavikdbavishi has joined #openstack-infra14:39
*** eharney has joined #openstack-infra14:42
openstackgerritMerged openstack-infra/project-config master: Add openstack/arch-design  https://review.openstack.org/62101214:44
*** bhavikdbavishi has quit IRC14:47
*** bhavikdbavishi has joined #openstack-infra14:47
*** takamatsu has quit IRC14:48
*** ykarel|away has joined #openstack-infra14:52
*** ykarel|away is now known as ykarel14:52
*** dayou_ has joined #openstack-infra14:59
*** dayou has quit IRC15:00
*** dayou_ has quit IRC15:10
*** dayou_ has joined #openstack-infra15:11
*** chandan_kumar is now known as chkumar|off15:14
*** jcoufal has quit IRC15:16
openstackgerritMerged openstack-infra/zuul master: Set relative priority of node requests  https://review.openstack.org/61535615:24
*** takamatsu has joined #openstack-infra15:29
*** jamesmcarthur has quit IRC15:33
*** bnemec is now known as beekneemech15:34
*** roman_g has quit IRC15:36
*** eharney has quit IRC15:36
*** cdent has joined #openstack-infra15:37
cdentCan someone point me to a job that does a webhook post merge? Basically I want to trigger a build on dockerhub after a placement change merges15:38
*** ramishra has quit IRC15:38
*** quiquell is now known as quiquell|off15:39
pabelangercdent: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/jobs.yaml#n1128 does it for readthedocs.org15:42
cdentpabelanger: awesome, thanks15:42
pabelangernp!15:42
*** dansmith is now known as SteelyDan15:43
*** therve has joined #openstack-infra15:44
therveHi15:44
therveWe're getting some issues in the heat gate for the past couple of days15:45
therveIt would seem there is an issue with ovh, is that a known problem?15:45
*** jamesmcarthur has joined #openstack-infra15:46
corvuspabelanger, clarkb: yes i'd like to restart things today15:48
*** yamamoto has quit IRC15:48
*** jamesmcarthur has quit IRC15:49
*** eernst has joined #openstack-infra15:49
*** jamesmcarthur has joined #openstack-infra15:49
*** sthussey has quit IRC15:50
*** mriedem has quit IRC15:50
*** eharney has joined #openstack-infra15:51
*** munimeha1 has joined #openstack-infra15:51
*** takamatsu has quit IRC15:53
pabelanger+1, happy to assist15:59
openstackgerritJames E. Blair proposed openstack-infra/puppet-zuul master: Add relative_priority scheduler option  https://review.openstack.org/62119415:59
mordredcdent: fwiw, you could also just push images built in zuul to dockerhub16:00
cdentmordred: yeah, I thought about that too, but the dockerfile being used is "mine", not an openstack thing, so I was exploring the options16:00
mordrednod16:00
cdentI think it is probably better to make the Dockerfile a real placement thing16:01
openstackgerritJames E. Blair proposed openstack-infra/system-config master: Enable relative_priority in zuul  https://review.openstack.org/62119516:01
cdentmordred: can you point me to an example?16:01
mordredit's all good either way - mostly just wanted to mention16:01
mordredcdent: yup - one sec16:01
corvuspabelanger, mordred: can you review those 2 changes ^  (since they require a scheduler restart, would be good to go ahead and get them in place)16:01
corvusgah, bad parent16:02
openstackgerritJames E. Blair proposed openstack-infra/system-config master: Enable relative_priority in zuul  https://review.openstack.org/62119516:02
corvusthere we go16:02
mordredcdent: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/jobs.yaml#n126016:03
pabelanger+216:03
cdentthanks16:03
mordredcdent: also - if you're just publishing images of placement, you might want to check out what the loci folks are doing16:03
mordredcorvus: +A to both16:04
cdentmordred: yeah, on that too16:05
corvusgiven what those changes are intended to do, i'm inclined to direct-enqueue them16:05
mordredcorvus: ++16:05
cdentmordred: what I'm trying to do here in incrementally unwind a bunch of the external side stuff I've built alongside placement before it was extracted16:06
mordredcdent: that sounds like a very worthwhile thing to do16:06
mordredcdent: while I'm pointing you at 20 things in addition to what you were asking about ... you should check out pbrx :)16:06
cdentin this case the image I create is fairly purpose built for testing16:06
mordredah- yes, well in that case neither loci nor pbrx are going to be very helpful :)16:06
corvuswe're running a 3 for 1 special on answers here today :)16:07
cdentpbrx looks pretty interesting, though16:07
cdentmordred: the container in question is an example of "just how cdentish can I make this thing". In this case that means zero config in the container and as small as possible and uwsgi instead of apache, etc16:08
cdenta lot of which I'm hoping to eventually get back to loci and kolla, but there's only so much time16:08
mordredcdent: I'm very much in the 'no config in container' camp16:09
mordredcdent: I think putting config into the containers defeats the whole benefit of the containers16:10
cdentayup16:10
cdentit's this: https://github.com/cdent/placedock16:10
mordredlike - a container should contain a single process- and things like config should be mounted in or otherwise provided, and things like apache should be network proxies, etc16:10
cdentyup, we sound of like minds16:11
cdent(not all that surprising)16:11
mordredcdent: if you haven't already, you might want to consider python:alpine as a base image - you can skip your pip and python install steps16:12
cdentI've been back and forth on the image a few different times depending on what seems to be working that day16:12
mordred:)16:12
mordredtell me about it16:12
cdentyesterday lots of stuff was not working, so gave up on alpine:edge16:12
mordredwe've been using python:alpine for the pbrx-built images and I'm fairly happy with it so far- there was one moment where something stopped working that we thought was alpine's fault, but I think it wound up being something else16:13
cdenti'll add that to the list, thanks16:15
*** dayou_ has quit IRC16:19
*** sthussey has joined #openstack-infra16:20
*** mriedem has joined #openstack-infra16:21
*** pcaruana has joined #openstack-infra16:26
*** gyee has joined #openstack-infra16:27
fungitherve: did anyone get back to you on tour "issues in the heat gate" specific to ovh yet? can you elaborate? is it a particular error or something general like slow performance leading to job timeouts? is it in both obh regions (bhs1 and gra1) or only one?16:27
thervefungi: No16:28
thervefungi: It looks like slow performance16:28
therveI haven't checked the region16:28
thervehttp://logs.openstack.org/57/620457/3/check/heat-functional-convg-mysql-lbaasv2/934e143/ is a recent example16:29
therveIt spent 90mins trying to setup devstack (that's usually our whole runtime)16:29
*** boden has quit IRC16:34
*** ccamacho has quit IRC16:35
*** ccamacho has joined #openstack-infra16:35
*** adriancz has quit IRC16:36
clarkbpabelanger: frickler correct its a local git repo. You edit and commit in place16:36
fungitherve: yeah, that looks like ovh-bhs1 which is where we've had a lot of reports of job timeouts. at first i thought it might be because of the 2:1 cpu oversubscription in our dedicate host aggregate for that region but temporarily halving the number of servers we were running didn't move the needle16:36
*** xek has quit IRC16:36
clarkbfungi: therve: I wonder if some of the hypervisors are not using virt?16:36
openstackgerritMonty Taylor proposed openstack-infra/project-config master: Create promstat project  https://review.openstack.org/62122516:37
openstackgerritMonty Taylor proposed openstack-infra/project-config master: Add promstat project to Zuul  https://review.openstack.org/62122616:37
corvus2018-11-30 16:27:30.828209 | primary | manifests/init.pp - WARNING: quoted boolean value found on line 3516:37
*** yamamoto has joined #openstack-infra16:37
corvusokay, so, in puppet, if i want to pass the literal word "true" around.... what?  just don't do it?16:37
fungitherve: mnaser was just commenting in #openstack-tc about surprisingly slow swapfile creation/preallocation in ovh, so i'm starting to wonder if it could be attributed to disk i/o16:37
clarkbfungi: oh interesting16:38
clarkbfungi: that said you should totally use fallocate on ext4 and it shouldn't be slow even with bad disk io :P16:38
thervefungi: Anecdotally I've found db migrations to be slow16:38
clarkbmnaser: ^16:38
mnaserclarkb: http://logs.openstack.org/36/619636/1/gate/openstack-ansible-deploy-aio_metal-ubuntu-bionic/72c540f/logs/ara-report/result/f6ed9f8a-419a-41b8-8d81-19d6e5aac6cc/16:38
*** jamesmcarthur has quit IRC16:38
mnaseri mean yes, but you cant fallocate on xfs so really we've only worked around the job16:39
corvustherve, fungi, clarkb: wow, a bunch of zuul sql tests have started timing out recently; i wonder if they ran there16:39
mnaser5.9 MB/s disk write speed will be bad regardless16:39
mnaserif you avoid it in swap, it'll bite you later somewhere else :)16:39
clarkbmnaser: our centos7 instances are ext4 not xfs16:40
mnaseroh shoot really16:40
mnaserTIL16:40
mnaserbut anyways, still, 6 MB/s disk writes are painful :P16:40
clarkbmnaser: yup not saying it fixes the underlying issue. Just pointing out that you likely want to avoid this cost anyway16:40
fungicorvus: while we were rnning at max-servers=79 for both bhs1 and gra1 i ran the e-r logstash query for job timeouts and there were 20x as many (no joke) in bhs1 compared to gra116:40
mnaseryeah well we decided to just drop swap entirely16:40
mnaserbecause we rarely ever used it16:40
mnaserit was like 1-2MB of swap with some 4.something gig of memory being used by caches/etc16:41
*** jamesmcarthur has joined #openstack-infra16:42
openstackgerritJames E. Blair proposed openstack-infra/puppet-zuul master: Add relative_priority scheduler option  https://review.openstack.org/62119416:42
openstackgerritJames E. Blair proposed openstack-infra/system-config master: Enable relative_priority in zuul  https://review.openstack.org/62119516:42
corvuspabelanger: mordred: ^16:42
*** dayou_ has joined #openstack-infra16:42
pabelangerlooking16:44
pabelanger+316:44
clarkbcorvus: is zuul restarted to accept that? or we put config in place first then restart?16:44
*** bhavikdbavishi has quit IRC16:45
openstackgerritTobias Urdin proposed openstack-infra/system-config master: Mirror Stein on Ubuntu from Cloud Archive  https://review.openstack.org/62123116:45
clarkbas for ovh IO, I know on my personal instance I seem capped at 1kiops so lots of really small writes perform poorly but large writes perform just as well as anything else. This seems more extreme than that though16:45
*** bhavikdbavishi has joined #openstack-infra16:45
corvusclarkb: config first then restart, since the scheduler needs a restart to see that, but will ignore extra options.16:46
fungimnaser: looks like the example linked above was also in bhs1, so i'm starting to wonder if disk is waaaay slower there than gra1. could explain the disproportionate number of job timeouts in bhs1 if so16:46
*** ginopc has quit IRC16:47
pabelangercorvus: clarkb: we also need to do nodepool-launcher too right? I can help with that if needed16:47
clarkbpabelanger: for it to work yes, I think it will be a noop if zuul sets it until then (but otherwise work as is today)16:47
corvuspabelanger: yeah.  in fact, do you want to get started on restarting those now?16:48
corvusor do we want to restart everything at once?16:48
*** yamamoto has quit IRC16:48
pabelangerI'm fine with both options16:48
*** udesale has quit IRC16:49
corvusi think launcher restarts are only minorly disruptive; i say go ahead and do 'em16:49
pabelangerok16:49
clarkbcorvus: ++16:50
clarkbusually I restart one and make sure its happy then do the others ~5 minutes later16:50
clarkbperceived impact is incredibly low16:50
pabelangernl01 doesn't seem to have latest tip of nodepool master, let me see what is happening16:52
fungiclarkb: i noticed yesterday you added static.o.o in the emergency disable list. is that still in progress or forgotten?16:53
clarkbfungi: semi forgotten. I think we want to get dmsimard wsgi resource increases in place. But I don't think anyone else was reviewing those changes16:54
* clarkb digs them up16:54
*** trown|brb is now known as regain16:55
pabelangerI think ansible on bridge.o.o is broken, but not 100%. I haven't interacted much with this server16:55
pabelangerERROR! Completely failed to parse inventory source /opt/system-config/inventory/openstack.yaml16:55
pabelangerthat is in /var/log/ansible/run_all_cron.log16:55
*** regain is now known as trown16:55
pabelangerand looks like 12 hours ago was the last run16:55
clarkbfungi: https://review.openstack.org/#/c/616297/16:55
*** trown is now known as trown|lunch16:56
clarkbpabelanger: I think that got fixed because my cert signing request went through (whcih depended on dns updates)16:56
corvusthat's the last message in the log16:56
pabelangerokay, I might be looking in the wrong log16:56
clarkbor maybe it stopped and then started again16:56
fungiperhaps citycloud is still busted?16:57
corvusclarkb, pabelanger, mordred: http://paste.openstack.org/show/736489/16:57
pabelangerthanks, that is what I am seeing16:57
pabelangerthere was a new release of ansible yesterday, could be related16:58
clarkboh interesting. I wonder if citycloud fixed, then we updated ansible then we broke16:58
clarkbya16:58
fungiwhen it rains it pours16:58
*** jpena is now known as jpena|off16:59
*** eharney has quit IRC16:59
pabelangerlooking to see if ansible upgraded now16:59
corvusansible 2.7.016:59
Shrewspabelanger: corvus: if we're restarting launchers, we'll need to watch those carefully. lots of rather big changes (mostly around caching) have gone in16:59
*** jpich has quit IRC17:00
clarkbfungi: I've discovered that nova takes over the dmi info for instances so hyou can't really tell if they are qemu or kvm (trying to double check that bhs1 isn't emulated as that could explain io as well as other slowness)17:00
clarkbok systemd-detect-virt says it is kvm proper17:02
clarkbfungi: re static I think we should either approve dmsimard's change above or we can safely remove static from the emergency file then watch it for slowness17:02
clarkbits safe either way, its just that we think adding the workers helped with the log downloading slowness that was seen a while back17:03
pabelangerokay, 2.7.0 seems to be the version we are pinned to in system-config, so don't believe that has changed17:04
*** pcaruana has quit IRC17:04
corvusmordred: can we put in your static inventory stuff asap?17:05
*** aojea has quit IRC17:06
fungiclarkb: only a data point, but our mirror instances in ovh don't show a significant discrepancy in i/o performance. 124 MB/s in gra1 vs 131 MB/s in bhs1 when i tried the same dd as mnaser's job example. could be due to them having a different flavor, or not actually being in the same host aggregate, or maybe only some instances in bhs1 are hitting an i/o tarpit while others land on more performant17:07
fungidisk17:07
clarkbcorvus: mordred fwiw I don't think the static inventory will fix this17:07
clarkbthis is a different error than what we had with citycloud (that was a url parsing failure due to http 502). THis appears to be a python bug with a Proxy object in the openstack plugin not having a servers attribute17:07
clarkbpossibly openstacksdk updated and not ansible?17:07
corvusclarkb: not using the openstack plugin as inventory *will* fix that problem17:07
corvusclarkb: ansible has been broken for two days straight because of two different problems with the openstack inventory plugin.17:08
clarkbcorvus: it will work around it yes. Mostly pointing it out because mordred is maintainer for that plugin and all of that in ansible. It doesn't work right now. that is probably improtant info for mordred17:08
clarkbfor infra we can work around it17:08
pabelangerI am not sure what yamlgroup plugin is, is that something we wrote?17:08
corvusclarkb: i agree.  i'm not the maintainer for that plugin.  i just want our systems to work so i can go back to doing my work.17:09
clarkbpabelanger: yes it is how we take our host listing and group them into groups17:09
pabelangerokay, sorry, not up to speed on recent changes17:09
*** mriedem is now known as mriedem_lunch17:11
pabelangercould it be openstacksdk that is breaking here? there was a new release 19hrs ago17:11
corvuspabelanger: almost certainly17:11
clarkbpabelanger: yes that is my current hunch17:11
corvusfeel free to revert or whatever, but i'm working on switching to static inventory17:11
pabelangerlet me see how we install it17:12
clarkbI think for us static inventory is a good move since we have had multiple issues with not static inventory in a couple days. I think for mordred and sdk and ansible it should be clear this isn't just a failing cloud anymore and there is a bug to fix there17:12
corvusclarkb: agreed17:12
pabelangerokay, so try rolling back to 0.19.0, confirm inventory works. Then, work towards static inventory17:14
*** eharney has joined #openstack-infra17:14
clarkbfungi: on a random ready xenial VM I hopped on in bhs1 3.6MB/s seems to be peak for write sizes of 128bytes and 512 byes >100MB total17:16
fungigranted, if that node was actively running a job then you may not know how much other contention you have for activity on the same node17:17
clarkbfungi: ya, though it seems consistent over multiple writes. We can always boot a node external to nodepool if we need better numbers. Mostly this seems to point that io is consistently slow when artificially tested with dd17:17
pabelangerhttp://paste.openstack.org/show/736491/17:18
pabelangerthat is when openstacksdk was upgraded, then ansible ran 1 more time afterwards17:18
fungiclarkb: any chance you can repeat the same experiment on a random gra1 node for comparison?17:18
pabelangergoing to downgrade it manually now to confirm inventory works again17:18
clarkbbumping up the bs to 2048 shows roughly the same throughput (3.4MB/s)17:19
clarkbfungi: ya I can do that17:19
clarkbthat implies to me that this isn't iops caps we are running into. We should've seen difference in throughput with different block sizes (up to the disk block size iirc) if it were just iops17:20
fungiif disk access is an order of magnitude slower for (at least some) instances in bhs1 than in gra1, we probably have our smoking gun for the difference in timeouts17:20
openstackgerritJames E. Blair proposed openstack-infra/system-config master: Switch to static inventory  https://review.openstack.org/62124717:20
corvusclarkb, pabelanger, fungi, mordred: ^17:21
corvusi quickly processed mordred's file from yesterday to include only the attributes he thought important17:21
*** ykarel is now known as ykarel|away17:22
corvusoh, i wonder if we need ansible_host ?17:22
*** eharney has quit IRC17:22
*** shardy has quit IRC17:22
pabelangercorvus: clarkb: downgrading to openstacksdk 0.0.19 seems to fix it17:22
clarkbcorvus: mordred already pushed that change yesterday17:22
pabelangerlooking at patch now17:22
* clarkb +2'd it but no one else reviewed it...17:23
clarkbhttps://review.openstack.org/#/c/621031/ if we want to use that chagne instead17:23
pabelangercorvus: +2, how did you generate that by chance?17:23
corvusclarkb: approved17:23
clarkbfungi: ah ok so I had a minor derp in there I should've used /dev/zero. But ya bhs1 is 16.6MB/s at 512 bs and 9ish MB/s at 2048 bs vs >250MB/s on gra117:24
openstackgerritMerged openstack-infra/zuul master: Clarify executor zone documentation  https://review.openstack.org/62098917:24
openstackgerritJames E. Blair proposed openstack-infra/system-config master: Disable openstack inventory plugin  https://review.openstack.org/62124717:25
corvusclarkb, pabelanger, mordred: ^ that's the other thing from my change that wasn't in mordred's17:25
clarkbpabelanger: mordred genearted it by taking the ansible openstack plugin output (from the cache?) and filtering out all the extra data we don't need17:25
fungiclarkb: thanks. ironically disk writes are ~20x faster on gra1 and jobs time out ~20x more often on bhs117:25
clarkbcorvus: oh interesting I wonder if the yaml vs yamlgroup ordering there is important17:26
corvuswe might have actually needed my change to get past today's bug.17:26
corvusclarkb: i don't know, but i re-ordered it to match our actual order assuming it is17:26
clarkbcorvus: ya, also may need yaml to process before yamlgroup17:26
pabelangerclarkb: ack, thanks17:26
*** efried is now known as fried_rolls17:27
fungicorvus: just to confirm, one of the tox-py36 job timeouts i ran across on a zuul change a few minutes ago was indeed in ovh-bhs1 as well. likely related17:27
pabelangerso, until bridge.o.o runs again, ansible works with old config. Next run, openstacksdk will update, but think we are protected a little with cached version, which should be enough time for these new patches to land17:27
corvusfungi: thanks; i' checked a few too, though one of the sql-only timeouts wasn't bhs1 (it was limestone)17:27
clarkbamorin: ^ fyi we think we have narrowed down the ovh-bhs1 issues to disk performance. Seeing writes for zeros to disk with dd on the order of 10-15MB/s on bhs1 when the same writes are >250MB/s on gra117:28
corvuspabelanger: well, if we direct-enqueue the patches.  otherwise we're going to spend 8 hours just getting to the point where we can begin the day's work.17:28
pabelanger++17:28
corvusso i will direct-enqueue the static inventory patches now17:28
clarkbcorvus: pabelanger I thinik we may want them to go in together so that the config update is in place the first time we run with the static inventory17:28
corvusclarkb: i'll do both17:29
clarkb(otherwise yamlgroup may not be able to group any hsots into groups and we'll still run ansible in an unproductive manner17:29
fungii wonder if we should disable ovh-bhs1 for now. it's a 159-node hit to our capacity, but the random timeouts are probably resulting in rechecks and gate resets which waste even more than that17:29
clarkbfungi: ya likely17:29
pabelangerfungi: actually, 1 sec, can I check something in ovh17:29
fungipabelanger: check whatever you like17:30
clarkbI'm going to dig up breakfast then will update things with the opendev cert info17:30
pabelangerfungi: okay, thanks. Finished, I wanted to check to see if we had any leaked nodes their, doesn't look like it (unrelated to the current issue).17:31
pabelanger+1 to disable if jobs are being affected17:31
*** weshay is now known as he_hates_me17:35
openstackgerritMerged openstack-infra/puppet-zuul master: Add relative_priority scheduler option  https://review.openstack.org/62119417:35
openstackgerritMerged openstack-infra/system-config master: Enable relative_priority in zuul  https://review.openstack.org/62119517:35
*** he_hates_me is now known as weshay17:36
*** kjackal has joined #openstack-infra17:36
openstackgerritMerged openstack-infra/zuul master: Remove STATE_PENDING  https://review.openstack.org/62028417:38
openstackgerritJeremy Stanley proposed openstack-infra/project-config master: Temporarily disable ovh-bhs1 in nodepool  https://review.openstack.org/62125017:39
openstackgerritJeremy Stanley proposed openstack-infra/project-config master: Revert "Temporarily disable ovh-bhs1 in nodepool"  https://review.openstack.org/62125117:39
openstackgerritMerged openstack-infra/system-config master: Bump amount of mod_wsgi processes for static vhosts to 16  https://review.openstack.org/61629717:42
*** eernst has quit IRC17:42
*** eernst has joined #openstack-infra17:43
*** eernst has quit IRC17:44
*** eernst has joined #openstack-infra17:44
*** e0ne has quit IRC17:45
clarkbcorvus: fungi opendev cert and key and friends are all in the usual location on bridge. Do we have hiera/ansible var keys set up for that yet?17:47
corvusclarkb: expected key names are here: https://review.openstack.org/62097917:48
clarkbthanks17:48
fungiaha, i thought i'd already reviewed that but looks like i have not17:48
pabelangerit is happening17:48
fungiopendev happens17:49
openstackgerritMerged openstack-infra/zone-opendev.org master: Revert "Add SSL Cert verification record"  https://review.openstack.org/62098617:51
*** openstackgerrit has quit IRC17:51
clarkbcorvus: ok should be in place under those keys in hiera17:54
corvusok approved17:55
corvusclarkb: thanks17:55
*** hamerins has joined #openstack-infra17:55
*** openstackgerrit has joined #openstack-infra17:56
openstackgerritMerged openstack-infra/system-config master: Switch to a static inventory  https://review.openstack.org/62103117:56
corvussecond change is about 8 minutes out17:56
clarkbshould we disable the cron on bridge so it doesn't run without second change in ~3 minutes?17:57
clarkb(I'm worried our group info won't be correct which could break firewall rules)17:57
corvusclarkb: yes17:57
clarkbcorvus: are you doing that or should I?17:57
corvusclarkb: you17:58
clarkb#*/15 * * * * flock -n /var/run/ansible/run_all.lock bash /opt/system-config/run_all.sh -c >> /var/log/ansible/run_all_cron.log 2>&1 is in the crontab now17:59
clarkb(I also commented out the cloud launcher script as I'm not sure if that one will be happy either)17:59
*** derekh has quit IRC18:00
corvusclarkb, pabelanger: i'm going to afk for 30m18:00
clarkbhrm it seems like ansible was running before though?18:00
clarkbpabelanger: did you downgrade sdk?18:01
clarkbI didn't expect ansible to be working atall, but if someone downgraded the openstacksdk package that might explain it18:01
corvusclarkb: i believe pabelanger said he did that18:02
clarkbok18:02
corvus17:22 < pabelanger> corvus: clarkb: downgrading to openstacksdk 0.0.19 seems to fix it18:02
clarkbthanks18:02
*** eernst has quit IRC18:02
*** munimeha1 has quit IRC18:02
* mordred poking to see if I can figure out what the actual issue is18:03
*** gfidente has quit IRC18:04
*** Swami has joined #openstack-infra18:09
Shrewsmordred: i've been poking as well but not having much luck locally so far18:09
mordredShrews: I've got a script in /root/mttest.py on bridge.openstack.org that exhibits the issue - it can be run with /root/mtvenv/bin/python mttest.py if you wanna look at it18:10
mordredsomething is unhappy with rackspace18:11
openstackgerritMerged openstack-infra/system-config master: Disable openstack inventory plugin  https://review.openstack.org/62124718:11
clarkbmordred: ^ can you double check that does what we want before we reenable the ansible cron on bridge?18:12
clarkbmordred: rax had their old compute api in teh catalog for a long time is sdk trying to make sense of it?18:14
mordredmaybe?18:14
*** jamesmcarthur has quit IRC18:15
pabelangerclarkb: corvus: yes, downgrade was the fix18:15
mordredoh. my. dear. god18:16
*** ykarel|away has quit IRC18:17
*** wolverineav has joined #openstack-infra18:17
* fungi can't wait to hear this one18:20
fungisuch suspense18:20
mordredwell - discovery is incorrectly finding the project id as the version when doing discovery18:20
mordredso is then not matching verison 610275 against v218:21
mordredultimately it's because rackspace blocks access to the discovery document so we have to fallback to parsing the URL18:21
mordredwhich SHOULD have a provision for stripping the trailing project id from the endpoint: https://dfw.servers.api.rackspacecloud.com/v2/61027518:21
mordredbut for some reason is not doing that18:21
amorinclarkb: that's weird (the bhs1 io issue)18:22
amorinI am currently at home and cant test that right now18:23
amorincan it wait monday?18:23
clarkbamorin: yes I think you should enjoy your weekend18:23
clarkband thank you for checking in!18:24
amorin:p18:24
amorinthat's still weird, we are supposed to have same hardware config with SSD disk18:24
Shrewsmordred: my that's fun18:24
amorinbut anyway, I am writing that up in my checklist for monday morning18:24
clarkbamorin: thank you!18:24
fungiamorin: awesome. i was hesitant to reach out to you until monday anyway since i know it's late already there18:25
fungiwe think this has been going on at least a week, possibly several, could even have started before the summit18:26
fungiso a couple more days aren't going to hurt18:26
*** diablo_rojo has joined #openstack-infra18:29
*** kjackal has quit IRC18:30
*** jamesmcarthur has joined #openstack-infra18:31
pabelangerclarkb: corvus: Shrews: okay, seems brige.o.o got out a pulse and updated nodepool-launcher. Will hold off until restarting until everybody is back18:32
clarkbpabelanger: we may also want to prioritize the various inflight tasks. There is nodepool restarts, zuul restart, static inventory switch (and maybe others)18:33
pabelangerclarkb: agree, holding off for now18:34
corvusclarkb, pabelanger: here18:36
pabelangeralso here18:36
clarkbme three18:37
corvusclarkb, pabelanger: so next we should manually update system-config, then uncomment the cron and watch it run?18:37
clarkbcorvus: ya I think that is a sane next step. Do we want mordred to review the changes to the ansible config too? (since mordred probably groks yamlgroup and friends best?)18:37
pabelangerwfm18:37
mordredShrews: ok. I have a 'fix'18:38
mordredShrews: but I'm not 100% sure of the 'right' way to do it18:38
corvusmordred: can you review  https://review.openstack.org/621247 or should we just muddle along?18:38
openstackgerritMerged openstack-infra/system-config master: Serve opendev.org website from files.o.o  https://review.openstack.org/62097918:38
mordredcorvus: yes - that looks good18:39
corvusmordred: thx18:39
clarkbcorvus: pabelanger also maybe after updating system-config but before uncommenting cron we run the command to list group membership (I forget what it is) and do a kick.sh?18:39
corvusclarkb, pabelanger: system-config is updated.18:39
pabelanger+1 for kick.sh18:40
clarkbI'm looking up the group listing command now18:40
corvusgoogle says:  ansible localhost -m debug -a 'var=groups'18:40
mordredShrews: remote:   https://review.openstack.org/621257 WIP Fix version discovery for rackspace public cloud        is the hacky version18:40
Shrewsmordred: looking18:40
pabelangerdo we need to delete the inventory cache, or was removing yamlgroup enough?18:40
corvusthe output of that lgtm18:40
clarkbcorvus: ansible-playbook --list-hosts zookeeper18:40
mordredShrews: tl;dr - rackspace project ids are integers, so they parse as version numbers18:40
openstackgerritJeremy Stanley proposed openstack-infra/system-config master: Retire the interop-wg mailing list  https://review.openstack.org/61905618:40
openstackgerritJeremy Stanley proposed openstack-infra/system-config master: Shut down openstack general, dev, ops and sigs mls  https://review.openstack.org/62125818:40
Shrewsmordred: vomit18:40
*** bhavikdbavishi has quit IRC18:41
mordredso when discovery doesn't work there and we fall back to parsing the version from the url, we do so by poping url segments to see if something is a version18:41
mordredand we match the project id18:41
clarkbhrm that comamnd isn't quite what I expected18:41
*** ralonsoh has quit IRC18:41
mordredShrews: I *think* what we actually want to do is pass the project_id along into that function (we do a similar thing in other places) so we can do a test of "does this url segment match the current project id"18:41
mordredbut that'll take a few seconds to sort out18:42
corvusclarkb: i don't know the answer to your question about caching18:42
corvuspabelanger: er i mean yours18:42
Shrewsmordred:  if we have that info, yes, i agree18:42
clarkbok I want ansible --lists-hosts $group not ansible-playbook18:42
clarkbchecking a few groups they look correct to me18:43
pabelangercorvus: /var/cache/ansible/inventory should we also delete that now, I wasn't sure if removing yamlgroup was enough18:43
pabelangeror maybe that is a mordred question18:43
corvusi'll delete it18:43
mordredwe can delete /var/cache/ansible/inventory - but it shouldn't be being touched by anything18:43
mordredso it should be a no-op18:43
pabelangerack18:44
clarkbso ya things look good to me. I think next step is a kick.sh then turn cron back on18:44
corvusdeleted.  after doing that ansible localhost -m debug -a 'var=groups' still looks good18:44
pabelanger++18:44
clarkbansible --lists-hosts $group still looking good too18:44
corvusi'll let someone else kick18:45
clarkbI'll kick.sh the logstash server since that has firewall rules that matter but if we break them impact is low18:45
clarkbthat is running now (though I didn't start screan)18:46
pabelangertail -f /var/log/ansible/ansible.log is what I am watching18:47
clarkbTASK [base-server : Set ssh key for managment] and TASK [puppet-install : Remove server] changed in the base server playbook18:47
clarkbthe second I'm not worried about. The first oen is a little odd18:48
clarkband then puppet run says changed in the puppet run. But overall this looks sane18:48
clarkbya the puppet changes are fine18:48
pabelangeragree, ansible looks to have run properly18:49
clarkband authorized keys don't look wrong18:49
clarkbunless there is another server or group people want to run against first I think we can reenable the cron18:49
corvusthe authorized_keys file contents are the same as on another host18:49
Shrewsmordred: hrm, we could determine it by place in the url (when split out by '/') too18:51
mordredwe could - but most of the time there is no project_id there18:51
Shrewsi think? not sure how standard that urls18:51
clarkbpabelanger: corvus: should I reenable the two cron jobs now? I'll wait until I get other confirmation18:51
mordrednova made it optional a few years ago18:51
mordredand it's now pretty much never there18:51
corvusclarkb: ++18:51
pabelangerclarkb: ++18:52
clarkbdone18:52
clarkbnext run for both is top of the hour18:52
pabelangerk, going to fresh coffee18:52
* Shrews afk for 15m18:57
pabelangerback18:58
*** boden has joined #openstack-infra18:59
*** electrofelix has quit IRC18:59
clarkbI'm watching tail of the run all log file18:59
clarkbrunning now19:00
clarkbbase server is running now19:01
clarkbthe thing to watch for here is iptables imo19:01
clarkband if that is happy I think we are good19:02
*** wolverineav has quit IRC19:02
openstackgerritJeremy Stanley proposed openstack-dev/cookiecutter master: Update contact address to openstack-discuss ML  https://review.openstack.org/62126619:04
openstackgerritChris Dent proposed openstack-infra/project-config master: Set placement's gate queue to integrated  https://review.openstack.org/62126719:05
*** olivierbourdon38 has quit IRC19:05
*** wolverineav has joined #openstack-infra19:06
openstackgerritJeremy Stanley proposed openstack-dev/specs-cookiecutter master: Update contact address to openstack-discuss ML  https://review.openstack.org/62126919:09
mnaserfungi: glad my random digging into a timeout has resulted in finding a timeout root cause D:19:10
mordredinfra-root: https://review.openstack.org/621257 is a keystoneauth patch to fix the discovery issue. the openstacksdk update exposed the issue, which has been lurking there all along19:10
fungimnaser: well, we don't know the *root* cause yet, but yes your report of slow disk writes in ovh was a huuuuge help in turning the corner on that ongoing investigation. thanks!!!19:10
mnasereveryone wins D:19:11
clarkbansible still clean from where I'm sitting19:11
mordredif we want, we can protect ourselves from this issue between now and a ksa release by putting in compute_endpoint_override settings in our clouds.yaml - or we can ping openstacksdk on the nodepool launchers19:11
fungimnaser: "my jobs run slow and don't complete within the timeout" was a lot harder to track down19:11
mordreds/ping/pin/19:11
corvusmordred: oh, so we had best not restart the launchers without doing one of those, eh?19:12
mordredcorvus: yes. it would be bad - we'd lose all rackspace quota19:12
corvusmordred: maybe we should put the pin in nodepool?19:13
tobiash++19:13
*** mriedem_lunch is now known as mriedem19:13
corvusmordred: maybe you can propose that change since you know what numbers to type?19:13
mordredcorvus: yup. on it19:14
clarkbiptable files being installed now. Lots of Oks and no changed so far19:15
clarkbso that looks good19:15
kmallocclarkb, fungi: confirming. gerrit does openID not OIDC (as far as I can tell)19:16
fungikmalloc: i think it's just openid still, yes19:17
kmallocbahg19:17
kmallocok no problem19:17
clarkbiptables has completed and I see nothign amiss there. I think this is working as expected19:17
fungikmalloc: on a related note, have you seen lemonldap-ng?19:17
kmallocworking to ensure i have a working example SP for the ipsilon thing19:17
kmallocfungi: no, looking now19:17
fungihttps://lemonldap-ng.org/19:17
clarkbit will also do oath? whatever google does19:17
corvuskmalloc: but related is https://github.com/davido/gerrit-oauth-provider19:17
kmallocoh neat19:18
fungiseems to have come out of a project to create a full sso suite for the french government19:18
kmalloccorvus: yeah i see19:18
kmalloccorvus: i saw that*19:18
kmallocfungi: that is pretty darn cool19:18
openstackgerritMonty Taylor proposed openstack-infra/nodepool master: Block 0.19.0 of openstacksdk  https://review.openstack.org/62127219:18
fungikmalloc: it just got packaged in debian this week, which is how it came to my attention19:18
fungiand it came up in discussions about replacement sso for debian.org19:18
* kmalloc nods19:19
Shrewsmordred: did that show up in 0.19 or 0.20?19:20
clarkbnow on to git and gerrit servers. I'm going to stop staring at it like a hawk now that base is done and seems to be happy. Also git and gerrit seems to actually just be git and gerrit as expected19:20
mordredShrews: gah. am I stupid?19:20
* Shrews refrains19:20
*** jamesmcarthur has quit IRC19:20
clarkbpabelanger: ^ see also here :P19:21
kmallocfungi: I'll poke at that thing as well. It is super interesting.19:21
*** jamesmcarthur has joined #openstack-infra19:21
openstackgerritMonty Taylor proposed openstack-infra/nodepool master: Block 0.20.0 of openstacksdk  https://review.openstack.org/62127219:21
mordredShrews: thanks19:21
fungikmalloc: yeah, i haven't delved deeply yet, but it looked like it could be relevant to our situation19:21
mriedemclarkb: fwiw i'm seeing e-r commenting on failed changes again19:24
mriedemdid you tickle it into submission?19:24
clarkbmriedem: fungi restarted it to reset the backlog timeout queue19:24
mriedemah19:24
clarkbmriedem: I also pushed a change to try and address it directly in the code by tieing the timeout to the event timestamp and not current time19:25
fungialso +2'd clarkb's proposed solution19:25
fungibut it could stand some more thorough review19:25
mriedemi saw but haven't reviewed that yet,19:25
mriedemwill tab queue it19:25
clarkbso once we get behind we won't wait for the whole timeout we'll just check once if the files are there, if yes yay if not move on19:25
fungii'm not as familiar with that codebase as my +2 permissions might imply19:25
mriedemmaybe al calavicci will let mtreinish come back to help look for a sec too19:26
*** jamesmcarthur has quit IRC19:28
*** jamesmcarthur has joined #openstack-infra19:28
corvusal calavicci has an impressively detailed wikipedia bio19:29
*** jamesmcarthur_ has joined #openstack-infra19:29
clarkbcorvus: pabelanger I've not seen any unexpected behavior from inventory change. I think we can continue with nodepool/zuul work as soon as the sdk thing is handled19:32
pabelangeragree19:32
fungisome of stockwell's best work, though to me he'll always be wilbur whatley from the dunwich horror19:33
*** jamesmcarthur has quit IRC19:33
corvushandled means that nodepool change lands and we verify that we've downgraded sdk19:33
corvuscalavicci and cavil ar strikingly similar names19:33
openstackgerritDavid Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support  https://review.openstack.org/52937619:34
openstackgerritMatt Riedemann proposed openstack-infra/elastic-recheck master: Better event checking timeouts  https://review.openstack.org/62103819:34
openstackgerritDavid Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support for RHEL/CentOS  https://review.openstack.org/52873919:35
mtreinishmriedem: link?19:36
clarkbmtreinish: https://review.openstack.org/#/c/621038/19:37
*** wolverineav has quit IRC19:37
*** wolverineav has joined #openstack-infra19:41
*** wolverineav has quit IRC19:42
*** wolverineav has joined #openstack-infra19:42
*** wolverineav has quit IRC19:42
*** wolverineav has joined #openstack-infra19:43
*** hamerins has quit IRC19:43
*** wolverineav has quit IRC19:43
*** wolverineav has joined #openstack-infra19:43
*** wolverineav has quit IRC19:44
clarkbhttps://www.opendev.org/ and https://opendev.org/ have working ssl now. Just need to add content there19:44
*** hamerins has joined #openstack-infra19:44
clarkbcorvus: ^ fyi and thanks!19:45
pabelangerawesome!19:45
fungicontent schmontent19:45
*** wolverineav has joined #openstack-infra19:45
openstackgerritDavid Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support  https://review.openstack.org/52937619:49
mtreinishmriedem, clarkb: +219:50
mordredclarkb, fungi, corvus: if you have a sec, mind popping some +A on https://review.openstack.org/#/c/621225/ and https://review.openstack.org/#/c/621226 ?19:51
openstackgerritDavid Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support  https://review.openstack.org/52937619:52
*** wolverineav has quit IRC19:52
clarkbmordred: I'm kind of surprised such a thing doesn't exist yet19:52
*** wolverineav has joined #openstack-infra19:53
mordredright?19:53
mordredfrom what I can tell, most people focus on using one or the other in their code and then using an exporter or translation service19:53
mordredclarkb: initial code sketch is here: https://review.openstack.org/62099019:53
corvusthey're vastly different in terms of how you model data; so you need a real abstraction layer if you're going to try to do both19:54
mordredyah19:54
clarkbmordred: ya in that space and tracing too it seems everyone uses one thing and then has a ton of very specific toling built around that19:54
mordredyup19:54
mordredwhich is great if you're writing a service to run in one and only one place19:55
corvus(and even so, the abstraction layer mordred proposes largely helps you think about both at the same time.  you still can't just forget about one or the other)19:55
corvusat least, that's how it registers in my brain19:55
mordredcorvus: yah, that's the idea19:55
mordred"this is a metric I want to collect. this is how it goes to prometheus. this is how it goes to statsd"19:56
* mordred waves hands19:56
Shrewsmordred: if you add a new thing, do you have to change the project name?  :)19:56
clarkbpromstatcchi19:56
mordredhah19:57
openstackgerritDavid Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support for RHEL/CentOS  https://review.openstack.org/52873919:58
openstackgerritDavid Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support  https://review.openstack.org/52937619:59
openstackgerritDavid Moreau Simard proposed openstack-infra/puppet-openstackci master: Add AFS mirror support for RHEL/CentOS  https://review.openstack.org/52873919:59
pabelangerclarkb: corvus: mordred: given the issues around bridge.o.o today, and tailing ansible.log where did the topic of running ara web on bridge end up? Do people see that actually live on bridge.o.o or some other server hosting the web for that?  I know we talked about putting that into into trove also.19:59
*** wolverineav has quit IRC19:59
dmsimardpabelanger: last I know we were waiting because bridge.o.o needed to be rebuilt on a larger instance or something along those lines20:00
clarkbI think we were going to run it with mostly the plan as is, then if we ever rebuild the bridge we can move it all internal20:00
clarkbfungi and ianw were far more on top of that than me though iirc20:00
pabelangerokay, yah. I might be able to find a little time to help with that. I was struggling an little today to look at ansible logs on bridge20:01
fungithe idea was we could run it on trove temporarily but when we get around to enlarging bridge.o.o we should make sure it's big enough to host its own database for that instead20:01
pabelangerassuming we did an audit of ARA for the public20:01
fungis/enlarging/rebuilding/20:01
dmsimardjust 2 cores and 2gb of ram on bridge.o.o indeed20:02
dmsimardis there anything stopping us from resizing it to a larger flavor ?20:02
corvusit's not actually large enough to do what it's already doing20:03
fungidmsimard: mainly the lack of a "resize" option in that cloud provider is what's stopping us20:03
*** rh-jelabarre has quit IRC20:03
openstackgerritTobias Henkel proposed openstack-infra/zuul master: Report tenant and project specific resource usage stats  https://review.openstack.org/61630620:03
fungidmsimard: we need to rebuild it on a larger flavor instead, and i gather it might not be 100% under configuration management yet?20:03
pabelangeris that because we thing ARA web will need more resources?20:04
fungior maybe it is by now but nobody's tried to build a new one yet outside integration tests20:04
clarkbfungi: ya the base server -> running ansible isn't managed yet? mordred is that the case?20:04
dmsimardpabelanger: it's not well sized to begin with20:04
openstackgerritTobias Henkel proposed openstack-infra/zuul master: Report tenant and project specific resource usage stats  https://review.openstack.org/61630620:04
corvusclarkb: well, the tests certainly suggest it's mostly there :)20:04
pabelangerI'm unsure what issues we are hitting today20:05
*** rh-jelabarre has joined #openstack-infra20:05
corvusbootstrapping it may still be a bit tricky.  we also need to copy all the non-managed stuff over (passwords, secrets, etc)20:05
clarkbpabelanger: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=65004&rra_id=all I think biggest concern is memory20:05
clarkbpabelanger: there are spikes that make it appear we should have more breathing room. Particularly to run a mysql20:06
*** eharney has joined #openstack-infra20:07
pabelangerack20:07
openstackgerritMerged openstack-infra/nodepool master: Block 0.20.0 of openstacksdk  https://review.openstack.org/62127220:08
corvusi'm going to grab lunch now; feel free to restart launchers after ^ is applied.20:09
clarkbI should get lunch too20:09
pabelangerclarkb: corvus: okay, I'll start with nl01 first20:09
pabelangerand confirm working20:09
pabelangerfungi: clarkb: have you see the zoom flaw making rounds, I think some of the openstack foundation staff using it for meetings: https://threatpost.com/critical-zoom-flaw-lets-hackers-hijack-conference-meetings/139489/20:11
*** jistr has quit IRC20:11
clarkbI hadn't but thats neat20:11
clarkbmrhillsman: ^ uses a bunch of zoom too20:11
pabelangera public exploit in the wild also20:11
funginifty20:11
fungii've only ever dialled into it from a telephone line, fwiw20:12
pabelangersame, figured I'd shared in case you wanted to pass along to presenters20:12
clarkbpabelanger: I've done so, thanks!20:12
fungiyep, thanks!20:12
*** jistr has joined #openstack-infra20:13
*** jamesmcarthur_ has quit IRC20:14
*** jamesmcarthur has joined #openstack-infra20:15
pabelangerclarkb: corvus: 621272 has landed on nl01, and confirmed openstacksdk 0.19.0 in installed20:18
pabelangergoing to proceed with restart in a moment20:18
mordredwoot!20:19
*** xek has joined #openstack-infra20:19
mordredclarkb: yeah - what corvus said earlier - bridge itself is *mostly* managed, but the secrets and passwords file is manual20:20
*** jamesmcarthur has quit IRC20:20
mordredso the bootstrap is ... fun20:20
pabelangerrestarted20:20
pabelangerbut there looks to be an issue20:20
pabelangerkubernetes.config.config_exception.ConfigException: Invalid kube-config file. Expected object with name default in kube-config/contexts list20:20
pabelangerguess we landed a nodepool change?20:20
tobiashpabelanger: either that or you landed a kube config20:21
mordredpabelanger: yah - we have a test kubernetes in vexxhost and a kube config for it20:21
openstackgerritMerged openstack-infra/project-config master: Create promstat project  https://review.openstack.org/62122520:21
openstackgerritMerged openstack-infra/project-config master: Add promstat project to Zuul  https://review.openstack.org/62122620:21
kmallocfungi: so looking at lemonldap-ng. it's cool. it is also very very very very perl-ism focused.20:21
kmallocfungi: we might be able to use it for our case over ipsilon20:21
fungiyeah, i saw it was perlish20:22
kmallocfungi: it's URL-regex and blob-o-json configured20:22
fungiahh20:22
kmallocso it's per URL, it's a little weird.20:22
mordredpabelanger: https://review.openstack.org/#/c/620756/ didn't land yet20:22
Ngreally? another -ng project? :(20:22
kmallocbut i can see it's very feature rich20:22
kmallocNg: yep.20:22
* mordred waves to the Ng20:22
Ngdangit20:22
fungiNg: i thought you wrote all of those20:22
Nghey mordred20:22
pabelangermordred: yah, I think this is something with default config, I don't see anything in nodepool.yaml20:22
mordredpabelanger: https://review.openstack.org/#/c/620755/ added the kube config20:22
kmallocthat said, we can totally do something more usable / api driven20:23
Ngfdegir: haha20:23
kmalloclong term20:23
pabelangeroh20:23
kmalloci'll consider ipsilon vs lemonldap for the transition stuff20:23
fungikmalloc: yep, as a canned thing i thought it might at least be an interesting alternative which is actually actively maintained still20:23
kmallocsince we have minimal things that need to be protected.20:23
kmallocexactly20:23
pabelangermordred: http://paste.openstack.org/show/736524/ is traceback20:23
kmallocthough the concept / mission of ipsilon maps more directly to what we do (even unmaintained)20:24
Shrewsoh weird. i just liked one of Ng's twitter posts today. crazy coincidence20:24
pabelangermordred: I am going to try and remove the file manually for now to see if nodpeool-launcher starts20:24
*** eernst has joined #openstack-infra20:24
NgShrews: oh yeah, I saw that. I think I'd forgotten that IFTTT was posting those for me ;)20:24
pabelangermordred: okay, renaming it to config.bak causes nodepool-launcher to start properly20:25
Shrewspabelanger: i see an issue20:28
Shrewspabelanger: related to tobiash's nodecachelistener20:28
*** eernst has quit IRC20:29
pabelangerShrews: yes, was just about to say it doesn't look happy20:29
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Remove updating stats debug log  https://review.openstack.org/62128320:29
Shrewshttp://paste.openstack.org/show/736525/20:29
Shrewstobiash: ^^20:29
pabelangeryah, we haven't launched a new node yet since restarting20:30
pabelangerI believe we should roll back to the previous version that was running20:30
*** eernst has joined #openstack-infra20:30
tobiashShrews: is that exception recurring?20:31
pabelangerokay, we appear to be launching nodes now: http://grafana.openstack.org/dashboard/db/nodepool-rackspace20:32
mordredcorvus: ping just in case you didn't see the k8s config related traceback20:32
*** wolverineav has joined #openstack-infra20:33
*** slaweq has quit IRC20:34
pabelanger2018-11-30 20:33:05,859 INFO nodepool.driver.NodeRequestHandler[nl01-13954-PoolWorker.rax-ord-main]: Not enough quota remaining to satisfy request 200-000058377320:34
pabelangerI am unsure if that is correct or not, looking at grafana, I do see space to launch more nodes20:34
Shrewstobiash: i'm not seeing more atm20:34
*** eernst has quit IRC20:35
pabelangerso far, we haven't launched a new node in rax-ord / rax-iad since the restart, but rax-dfw is now bringing nodes online20:36
tobiashShrews: maybe we should catch exceptions there and log a warning/error together with the event data and path20:36
tobiashShrews: I think this might be an event we actually don't want to process20:37
*** eernst has joined #openstack-infra20:37
tobiashpabelanger: do you have mode contect around the quota log?20:37
tobiashs/mode/more20:38
pabelangertobiash: not yet, still looking into logs20:38
Shrewstobiash: you need the numbers?20:39
pabelangertobiash: for exmaple: http://paste.openstack.org/show/736526/20:39
openstackgerritMerged openstack-infra/project-config master: Temporarily disable ovh-bhs1 in nodepool  https://review.openstack.org/62125020:40
*** eernst has quit IRC20:41
tobiashpabelanger, Shrews: there seems to be something off in the quota calculation20:41
tobiashpabelanger, Shrews: maybe one of the patches that landed today20:41
clarkbone of corvus' changes modified how we account for nodes outside of nodepool20:42
clarkb(to handle leaks better)20:42
mrhillsmanthx clarkb20:42
clarkb(I'm not really here yet still finishing up lunch)20:42
tobiashShrews, pabelanger: probably this one: https://review.openstack.org/62104020:42
tobiashthat would explain that we get predicted quota of -50 when launching one node20:42
*** eernst has joined #openstack-infra20:43
*** hwoarang has quit IRC20:44
pabelangerhmm, let me check if we have leaked nodes20:44
Shrewstobiash: pabelanger: corvus: hrm, we should be using zk.getNodes() instead of the nodeIterator in that change20:45
openstackgerritMatt Riedemann proposed openstack-infra/elastic-recheck master: Add query for nova i/o semaphore test race bug 1806123  https://review.openstack.org/62128520:45
openstackbug 1806123 in OpenStack Compute (nova) "i/o concurrency semaphore test changes are racy" [High,Confirmed] https://launchpad.net/bugs/180612320:45
tobiashShrews: I think the negation here is wrong: https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/provider.py#n19720:45
Shrewsbut that's unrelated to the quota thing20:46
Shrews(my suggestion, that is)20:46
*** yamamoto has joined #openstack-infra20:46
tobiashShrews: no, if that's wrong, it increases the unmanages usage20:46
tobiashShrews: so nodepool thinks it has -50 instances left and the rest is blocked with instances it doesn't manage20:47
pabelangerwell, rax-ord does have a few leaked nodes20:47
pabelangerbut, unsure why nodepool isn't deleting them20:47
*** eernst has quit IRC20:48
pabelangerfor example20:48
pabelangerhttp://paste.openstack.org/show/736527/20:48
*** eernst has joined #openstack-infra20:49
pabelangerthat one is missing metadata20:49
*** openstackgerrit has quit IRC20:50
*** yamamoto has quit IRC20:50
*** openstackgerrit has joined #openstack-infra20:50
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Fix leak detection in unmanaged quota calculation  https://review.openstack.org/62128620:50
*** hwoarang has joined #openstack-infra20:51
*** fried_rolls is now known as fried_rice20:51
fungiokay, i'm taking a break from mailing list combining work to go obtain sustenance. will return asap20:51
tobiashShrews, pabelanger: I think this should fix it ^20:51
tobiashcorvus: ^20:53
pabelangeroh20:53
*** takamatsu has joined #openstack-infra20:53
*** eernst has quit IRC20:53
tobiashwe also could revert 621040, fix that and re-propose it with a test case20:54
Shrewstobiash: corvus: pabelanger: if the instance isn't getting the meta properties set correctly, nodepool will never delete it.20:55
pabelangerso we have 2 issues right now with nl01, quota calculation if off. And the kube config is broken20:55
Shrewshttps://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/provider.py#n47420:55
pabelangerShrews: yah, I don't know why that is right now.  For now, I'm going to manually clean them up after current issue is addressed20:55
pabelangerthe missing metadata I mean20:55
Shrewspabelanger: you'll probably want to back off the upgrade for now20:57
mordredpabelanger: "Expected object with name default in kube-config/contexts list" seems to indicate something is unhappy with the "current-context: default" line20:57
pabelangerShrews: yah, I can stop and roll back20:57
pabelangerlet me check which verision the others are running20:57
corvusback/catching up20:59
clarkbI'm back now too21:00
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Update the current-context to valid context  https://review.openstack.org/62128721:01
mordredcorvus, pabelanger: ^^ I **think** that should fix the kube config error21:01
mordredbut I'm also sort of just shooting at pickles in a barrel21:01
mordredthat was based on reading https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/ fwiw21:01
pabelangerShrews: corvus: clarkb: I think we are safe to revert back to 3.3.1, but waiting for somebody to confirm21:02
pabelangerI also have to duck out for 10mins to wait for school bus, but will be back online to assist21:02
clarkbpabelanger: ya I don't think any of the changes required the updates to the zk schema21:02
clarkbso we should be safe to go back to older version21:02
clarkbpabelanger: maybe just do that manually by hand on nl01 then we get tobiash's fix in and try again later today or monday21:03
pabelangerclarkb: okay, can I pass the revert step to you now?21:03
clarkbpabelanger: on the server or getting the code changes in gerrit side? I can do either but want to make sure I know what you are doing too :)21:03
clarkb*either or both21:04
pabelangerclarkb: yes, the revert of nl01.o.o to 3.3.1.21:04
pabelangerI can do it, but will need about 10mins21:04
corvusregarding the current-context -- that's really disappointing that the kubernetes lib requires that even though we don't use it.  maybe we should default a bogus default context -- because default-context makes no sense in our multi-cloud case.21:04
pabelangeron server, I should say21:04
tobiashcorvus: sounds valid21:05
clarkbpabelanger: I'm ssh'ing in now21:05
pabelangerokay, afk for 10mins...21:05
mordredcorvus: I think we can set it to ''21:05
*** kjackal has joined #openstack-infra21:06
clarkbpuppet just ran so I have ~30 minutes to get this done.21:06
clarkbI am pip installing 3.3.1 now21:06
mordredcorvus: which seems to be a valid option based on that document21:06
clarkbthat sound right to everyone?21:06
clarkbbased on nl02 that looks right to me21:07
clarkbhahahaha ok21:07
clarkbwe don't have the openstacksdk exclusion on 3.3.121:08
corvusi will direct-enqueue both of those changes21:08
clarkbso I've got to manually install sdk afterwards21:08
clarkb(just a note for anyone else trying to do similar later)21:08
*** rlandy has quit IRC21:08
*** wolverineav has quit IRC21:08
*** hongbin has joined #openstack-infra21:08
clarkbnl01 restarted running nodepool==3.3.1 and openstacksdk==0.19.021:09
corvusboth changes are in gate21:10
hongbinhi folks, i have a patch to modify the zuul job: https://review.openstack.org/#/c/619642/ but consistently getting 'RETRY_LIMIT' error for some jobs, want to get helps to resolve it21:10
clarkbhongbin: RETRY_LIMIT happens when the job fails in the job pre run stage21:10
hongbinclarkb: i see, any idea about how to get the logs to see what is wrong?21:11
clarkbhongbin: failures in pre-run are retried up to three times until we report back RETRY_LIMIT. The reason for this is that we expect those things to always pass as they shouldn't test code, just be setup21:11
tobiashclarkb: with finger links that fails very early21:11
clarkbtobiash: ya interesting21:11
clarkbhongbin: in this case I think this means its failing very early so we dont' get logs. One way to debug that is to catch them with the streaming logs from the status page while they happen21:12
clarkbanother is we can go look in the zuul logs and see what that tells us21:12
openstackgerritMatt Riedemann proposed openstack-infra/elastic-recheck master: Add query for libvirt functional evacuate test bug 1806126  https://review.openstack.org/62128821:12
corvusit can mean that the post playbook fails (possibly because the host is hosed)21:12
openstackbug 1806126 in OpenStack Compute (nova) "LibvirtRbdEvacuateTest and LibvirtFlatEvacuateTest tests race fail" [High,Confirmed] https://launchpad.net/bugs/180612621:12
tobiashclarkb: what we did to improve that is to add a base-logs job as a parent to the job 'base' which only contains the log upload post playbook21:12
corvustobiash: you don't need to use inheritance for that; post-playbooks are separable21:13
clarkbcorvus: considering that this is making changes to networking in devstack that wouldn't surprise me21:13
tobiashcorvus: does it now run all post playbooks even if the pre playbook of the same job failed?21:13
corvustobiash: so as long as the last post playbook in base collects the logs, it will run regardless of anything before it21:13
corvustobiash: i think so21:14
tobiashah cool21:14
hongbinok, let me recheck it and try to catch the streaming logs, thanks for the hint21:14
clarkbhongbin: it might help us debug, if we understand what you are trying to do with all of those api extensions21:15
AJaeger_fungi, I see you updating openstack-discuss - want to review https://review.openstack.org/619216 to handle infra-manual, please?21:15
clarkbhongbin: my best guess at this point is one of those extensions results in either broken network stack of firweall rules that prevent zuul from talking to the test node21:15
corvustobiash: yeah, just double checked.  run won't run, but post will.21:15
tobiashcorvus: cool, thx21:15
AJaeger_hongbin: make a smaller change first - only change YAML with the goal to do exactly the same setup as before for a single job. And then iterate on it21:16
hongbinclarkb: i have two list of extensions, and in the zuul job config, combine those two list and write it to devstack config file21:16
tobiashmaybe that has changed when we separated it, or I just misunderstood that21:16
clarkbhongbin: right, but what do those extensions do?21:16
hongbinthose are just two list of string in yaml21:17
clarkbhongbin: but they change devstack behavior somehow right?21:17
clarkb(and that changes neutron's behavior)21:17
hongbinthe neutron behavior is not changed, since we should pass exactly the same list to devstack21:18
AJaeger_hongbin: so, I would do it as follows: 1) Use yaml anchors and create exactly same job as today. 2) Update the values21:18
pabelangerand back, sorry about that. #dadops21:18
pabelangerclarkb: thanks for reverting21:18
clarkbpabelanger: no worries. want to check on nl01? I think it should be running again21:18
pabelangersure21:19
AJaeger_hongbin: oh, I see - that list is long and you copy it over21:19
AJaeger_hongbin: so, looks like something is wrong in that setup.21:19
pabelangerclarkb: grafana.o.o looks much better21:19
hongbinNETWORK_API_EXTENSIONS: "{{ network_api_extensions_common + network_api_extensions_tempest | join(',') }}"21:19
hongbinAJaeger_: yes, possibly21:20
clarkbya I wasn't sure that would work when it was first suggested because this side of the config is in zuul not in ansible21:20
clarkbzuul doesn' jinja221:20
hongbinfor jobs that are doing NETWORK_API_EXTENSIONS: "{{ network_api_extensions | join(',') }}" , it succeeded21:21
hongbinfor jobs that are doing NETWORK_API_EXTENSIONS: "{{ network_api_extensions_common + network_api_extensions_tempest | join(',') }}" , it failed21:21
hongbinso i guess the usage of "+" to combine list wont' work?21:21
pabelangercorvus: clarkb: tobiash: Shrews: do we have an idea what is happening with http://paste.openstack.org/show/736525/21:22
logan-hongbin: "{{ (network_api_extensions_common + network_api_extensions_tempest) | join(',') }}"21:22
hongbinlogan-: ack, let me try that21:22
clarkbpabelanger: looks like the string was '' and not valid json21:22
hongbinlogan-: thanks for the advice21:22
clarkbpabelanger: I don't know why though21:23
pabelangerclarkb: that was the only other thing I noticed on startup21:23
logan-clarkb: yep, it works, because zuul just dumps the uninterpreted jinja into the job inventory, which ansible then consumes and interprets :)21:24
*** kgiusti has left #openstack-infra21:25
clarkbpabelanger: I think we want to look for any cases we might write an empty string to zk21:25
corvuspabelanger, clarkb, tobiash: we could look in zk to see if there are any node records with empty data21:26
clarkbbut I haven't followed any of the nodepool cache stuff21:26
clarkbso this is all new to me21:26
tobiashclarkb, pabelanger: we should catch this exception and log better data in this case, maybe it's just an event we don't want to process but didn't filter correctly21:26
pabelangercorvus: good idea, since this happened on startup, possible existing node-requests are missing data for some reason?21:28
pabelangertobiash: +121:28
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Log exceptions in cache listener events  https://review.openstack.org/62129221:28
corvustobiash, clarkb, pabelanger: ^21:28
clarkbtobiash: these acts as watches on the "filesystem" tree? and we reconcile our local datastructures when they chagne? making sure I understand the basics here21:28
corvusclarkb: yep21:29
tobiashpabelanger: it's nodes, not node-requests in the exception21:29
clarkbthen we don't need to read from zk every time we need data we just trust the cache is up to date beacuse it is being reconciled. Got it21:29
corvusclarkb: correct. we *do* read an extra time after we lock nodes, just to make sure everything is in sync.21:29
pabelangertobiash: thanks!21:29
corvusi've direct-enqueued that too21:30
clarkbreading the code we treat node added and node updated the same. Wouldn't surprise me if we create an empty node then write to it later21:31
clarkbbut the listener then races its cache update ot the update happening?21:31
clarkbthough I thought storeNode would write json with keys just empty values if there is no data21:32
clarkbso maybe this isn't that21:32
clarkbin any case logs ++21:32
pabelangerand rax looks back at capacity now21:33
corvusi'm hoping the event structure converts to strings in a useful way21:33
corvus(ie, i hope it looks like "<Event path:foo/bar data:...>"21:33
openstackgerritMerged openstack-infra/nodepool master: Fix leak detection in unmanaged quota calculation  https://review.openstack.org/62128621:34
*** kjackal has quit IRC21:34
clarkbcorvus: and hopefully ADDED/UPDATED/DELETED types are included too21:34
corvusoh, heh, it's a subclass of tuple21:35
corvushttps://kazoo.readthedocs.io/en/latest/_modules/kazoo/recipe/cache.html#TreeEvent21:35
corvusso we should be able to see what we need to21:35
*** wolverineav has joined #openstack-infra21:35
openstackgerritMerged openstack-infra/system-config master: Update the current-context to valid context  https://review.openstack.org/62128721:35
corvusso next question is, what's the format of event_data21:36
tobiashNodeData inherits from tuple21:38
*** kjackal has joined #openstack-infra21:38
corvusthe log is the only oustanding fix now21:38
corvusit's probably worth waiting for before we attempt another restart21:38
tobiashyes21:38
clarkbcorvus: ya I think so. we'll just end up restarting it again to debug that anyawy (likely)21:39
clarkbalso suggestion: we use nl04 for next restart since we've disabled bhs1 anyway. That will have low impact21:39
pabelanger++21:39
tobiashnow that I searched for it I see the same exceptions in my log21:39
openstackgerritMonty Taylor proposed openstack-infra/project-config master: Update promstat to use storyboard  https://review.openstack.org/62129321:40
corvusdrat, pep8 failure21:47
*** boden has quit IRC21:47
clarkbah the exception indentation?21:47
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Log exceptions in cache listener events  https://review.openstack.org/62129221:48
corvusclarkb: nope, a legit thing.  omitted a _21:48
clarkboh ya diff shows that.21:48
clarkbI single approved it if you want to reenqueue21:48
clarkb(its a trivial fix)21:49
corvusdone21:49
clarkb(fwiw I'm really excited about the relative priority stuff)21:49
corvusme too!21:50
corvuscombine that with splitting zuul into its own tenant, and we should be able to merge changes at least this fast with no admin access needed21:50
*** tpsilva has quit IRC21:51
tobiashcorvus, clarkb: 2018-11-30 21:47:58,587 ERROR nodepool.zk.ZooKeeper: Exception in node cache update for event: (0, ('/nodepool/nodes/0000235218', b'', ZnodeStat(czxid=8628450713, mzxid=8628450713, ctime=1543562070559, mtime=1543562070559, version=0, cversion=1, aversion=0, ephemeralOwner=0, dataLength=0, numChildren=1, pzxid=8628450714)))21:51
corvus(because we can drop the clean-check pipeline requirement)21:51
tobiashso we get empty data from some nodes21:51
tobiashchecking now if this is a cache setup thing or really like this in zk21:51
corvus0 is node_added21:51
clarkbtobiash: any idea what the 0 there is? is that the type?21:51
clarkboh cool so my hunch was maybe right?21:51
tobiash0 is the event type21:51
clarkbin that case we probably want to say if type == added and data or type == updated and data21:52
clarkbor similar there21:52
tobiash0 means node added21:52
tobiashconfirmed, that node is really empty in zk21:53
tobiashno idea yet what this means21:53
corvuslooking at storeNode, there should be no case where we set empty data21:53
pabelangercorvus: drop clean-check pipeline requirement? Could you explain more?21:54
clarkbcorvus: tobiash could it be that creating and writing data in zk is not atomic?21:54
clarkbcorvus: tobiash so as events go we get one first for the node creation then for the update?21:54
tobiashclarkb: no, I restarted nodepool with the patch and it read the nodes that were already in the system21:54
corvuspabelanger: in the good old days, you used to be able to approve a change and have it gated before check results arrived.  i believe that's a way better way to run a system, if you can trust people to behave.  we can not, in openstack, so we require check results before approving in gate.21:55
pabelangercorvus: interesting21:56
pabelangerlooking forward to seeing it in action :)21:57
corvusit's like all these direct-enqueues i'm doing, but anyone can do them21:57
openstackgerritMerged openstack-infra/elastic-recheck master: Better event checking timeouts  https://review.openstack.org/62103821:57
pabelangercorvus: yah21:57
pabelangercool21:57
*** wolverineav has quit IRC21:59
clarkbtobiash: so maybe in this case we can just ignore those events?21:59
clarkbas they aren't currently processed and lack useful info/data21:59
tobiashyes21:59
corvustobiash: but if the node is still empty in zk... what does that mean?22:00
tobiashcorvus, clarkb: I just looked into our zk data and we have many such nodes22:00
corvusif it were as clarkb suggests: a two-phase process -- create then write, you would expect to have data in there by the time you got around to looking at it22:00
corvusbut to still have an empty string is puzzling22:01
tobiashand judging by the numbers really old ones22:01
corvustobiash: any mention of that node id in logs?22:01
tobiashcorvus: I need to find such a node that is still in my logs22:01
corvusi'm setting up zk_shell so i can poke too22:02
clarkbfwiw our node count is in the 15k range whcih is about where it was when I monitored the new zk cluster transition22:03
clarkbI don't think we are leaking nodes at an appreciable rate22:03
corvuswe do have lots of old nodes with empty data22:03
corvuslatest node id is 0000844533.  but 0000470788 exists and is empty22:04
clarkbhuh22:04
*** wolverineav has joined #openstack-infra22:04
corvus0000819605 is recent-ish and empty22:05
tobiashcorvus: http://paste.openstack.org/show/736529/22:06
tobiashcorvus: nothing special here22:06
clarkbcould it be a quorum thing?22:06
corvustobiash: ditto: http://paste.openstack.org/show/736530/22:06
clarkbI woudl expect that to be fairly quick cleanup though not long term22:07
clarkbcorvus: maybe check by connecting to other zk nodes to see if they have the same empty structure?22:07
corvusclarkb: zk01 and zk02 report the same22:08
tobiashcorvus: is znode deletion recursive?22:08
corvustobiash: i don't know -- are you noting that there is a lock file under it too?22:08
tobiashcorvus: the node I looked at still has an empty lock child node22:08
corvusthis seems like a very likely possibility22:09
clarkbwould that imply the node is still locked? nodepool list should show us that info22:09
clarkb| 0000844533 | inap-mtl01          | ubuntu-xenial          | 8067566c-765c-4165-a90e-0adf29e80f1b | 198.72.124.158  |                                        | in-use   | 00:00:05:31 | locked   |22:10
clarkbseems like yes22:10
tobiashhrm, we do a recursive delete22:11
clarkber wait I got the wrong node to grep for there22:11
clarkbugh friday22:11
clarkb0000819605 is what we want to check for22:11
clarkbthat one does not show up22:12
*** wolverineav has quit IRC22:12
tobiashclarkb: yes, because nodepool has an if data: clause when getting the node22:12
clarkbha22:13
*** wolverineav has joined #openstack-infra22:13
tobiashso nodepool hid that before the caching patches22:13
clarkbya22:14
corvussomeone still holds the lock on 000081960522:14
clarkbdo the lockers drop an id of some sort on the node to indicate who has it?22:14
corvusit's /nodepool/launchers/nl03-13009-PoolWorker.vexxhost-sjc1-main22:14
clarkbfungi: mriedem fyi I'm going to restart elastic-recheck bot now and we can see if my change works22:15
tobiashat least on my leaked node there is no lock anymore, but still the empty lock child22:16
clarkbmriedem: fungi nevermind Nov 30 22:13:30 status puppet-user[17804]: (/Stage[main]/Elastic_recheck/Exec[install_elastic-recheck]/returns)     ImportError: No module named docutils.core22:17
corvustobiash: how did you determine there's no lock?22:17
clarkbthat prevents us from installing my change22:17
tobiashdump with grep22:17
corvustobiash: hrm.  that's what i did.22:17
tobiashor is that not enough?22:17
openstackgerritMerged openstack-infra/nodepool master: Log exceptions in cache listener events  https://review.openstack.org/62129222:17
corvustobiash: that should be enough22:17
*** hwoarang has quit IRC22:18
*** wolverineav has quit IRC22:18
corvustobiash: i see: http://paste.openstack.org/show/736531/22:19
*** wolverineav has joined #openstack-infra22:19
tobiashcorvus: ok, so if there is a lock the lock child has another child22:19
tobiashthat's not the case in my example, but it might have been the case earlier22:20
tobiashI restarted my launcher to pickup the logging change so the launcher cannot have any lock now22:20
tobiash(I have only one launcher atm)22:21
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Log exceptions deleting ZK nodes  https://review.openstack.org/62130122:21
corvustobiash: ah, so the ephemeral lock may just turn into an empty dir after restarts22:22
clarkbhttps://pagure.io/python-daemon/issue/18 <- that is really frustrating22:22
tobiashyes22:22
corvustobiash, clarkb: there's a code path where we can end up without an exception report; see change ^22:22
tobiash+222:22
clarkbcorvus: looking22:22
clarkbcorvus: looking at your paste, that was a dump of ephemeral nodes in zk? and the two nodes being udner the same item means that that launcher holds that lock? Seems like maybe we want to record that info in the lock itself?22:23
corvustobiash, clarkb: i think we've established this bug as "annoying but mostly harmless" and can probably proceed with restarts and debug this in parallel...22:24
clarkbcorvus: I agree22:24
tobiash++22:24
pabelanger++22:24
corvusclarkb: your understanding of the ephemeral nodes stuff is correct, but i don't want to mess with the kazoo lock recipe; this is workable enough i think.22:25
corvus(*any more -- i've already undertaking one improvement to kazoo locks)22:25
clarkbcorvus: ok, this might be a thing we add to operational docs then22:25
clarkb(this is how you find who owns a lock type thing)22:26
corvusclarkb: as long as we bury it deep, deep inside the nodepool developer docs.  i've been fighting very hard to avoid the perception that a nodepool operator needs to know anything about zk.  this is something only a nodepool dev should ever have to do, and only when nodepool is broke.22:27
clarkbcorvus: thats fair. Things like why did zuul leak this lock etc22:27
corvuspabelanger: you want to go ahead and restart a launcher?22:28
*** wolverineav has quit IRC22:28
clarkbnl04 is my suggestion22:28
pabelangerI can yes, nl04 this time?22:28
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Don't update caches with empty zNodes  https://review.openstack.org/62130522:28
clarkband double check they have the fix for the quota calculations installed22:28
pabelangerack, checking now22:29
pabelanger(nl04)22:29
*** wolverineav has joined #openstack-infra22:29
pabelangerf8d20d6 is current version of nodepool22:30
pabelangerI think we want the next 2 right22:30
corvuspabelanger: f8d20d6 is okay22:30
pabelangerk22:30
clarkbya we don't need those logs in place since tobiash was helpful and got them for us22:31
pabelangerokay, restarting nl04 now22:31
corvusyep, and next debug step there is changes that haven't merged yet22:31
pabelangernodepool-launcher started22:32
pabelangermore execptions at beginning, expected (json)22:32
tobiashI have around 200 of these leaked znodes22:33
openstackgerritMerged openstack-infra/elastic-recheck master: Add query for nova i/o semaphore test race bug 1806123  https://review.openstack.org/62128522:33
openstackbug 1806123 in OpenStack Compute (nova) "i/o concurrency semaphore test changes are racy" [High,Confirmed] https://launchpad.net/bugs/180612322:33
pabelangerI can see we are launching nodes in ovh-gra2122:33
pabelangerovh-gra1*22:33
clarkbfungi mriedem I think newer pip may fix this? I'm goign to try updating pip on that host to see22:35
pabelangerclarkb: corvus: tobiash: nl04 looks to be working as expected now22:35
tobiash:)22:35
pabelangerI no longer see no quota22:36
pabelangerand confirmed we've launched a few nodes22:36
*** slaweq has joined #openstack-infra22:36
pabelangerI think that means we can proceed with other nodepool-launchers, waiting for confirmation22:37
clarkbmriedem: fungi ok that didn't change it, but everything else should be the same between local system and remote. So I'mconfused22:37
clarkbpabelanger: I usually like to wait long enough for deletes to happen if they don't happen on startup22:37
pabelangerclarkb: sure, let me confirm22:37
clarkbpabelanger: basically make sure we handle create and delete, but ya if both those look good I think you can restart the others22:37
*** xek has quit IRC22:37
clarkbmriedem: fungi maybe a setuptools thing, checking that next22:38
clarkbthat was it22:39
clarkbrestarting elastic-recheck now22:39
pabelanger2018-11-30 22:39:14,760 DEBUG nodepool.StatsWorker: Updating stats22:39
pabelangerthat doesn't really seem helpful22:39
pabelangerand looks newish22:39
tobiashpabelanger: https://review.openstack.org/62128322:40
pabelangerthere we go, just deleted some nodes22:40
pabelangertobiash: +3, thanks22:40
pabelangerclarkb: corvus: okay, I think we are good to proceed to other launchers22:41
corvuspabelanger: ++22:41
tobiashthat was useful during development22:41
clarkbpabelanger: go for it22:41
mriedemclarkb: i knew you could do it22:42
*** slaweq has quit IRC22:43
clarkbmriedem: heh, in any case can you keep your eye out for it leaving comments/reports on irc?22:43
pabelangerokay, all launchers have been restarted22:43
mriedemclarkb: not sure which channel those are in, except qa i guess22:43
pabelangerlooking at logs now to confirm things are working properly22:43
*** hwoarang has joined #openstack-infra22:44
pabelangercorvus: clarkb: okay, I don't see any obvious issues. We are still creating / deleting nodes properly now22:45
corvusready for the zuul part of this?22:45
clarkbcorvus: I am if you are22:45
openstackgerritMerged openstack-infra/elastic-recheck master: Add query for libvirt functional evacuate test bug 1806126  https://review.openstack.org/62128822:46
openstackbug 1806126 in OpenStack Compute (nova) "LibvirtRbdEvacuateTest and LibvirtFlatEvacuateTest tests race fail" [High,Confirmed] https://launchpad.net/bugs/180612622:46
pabelanger++22:46
tobiashcorvus: just found this in my new nodepool: http://paste.openstack.org/show/736532/22:46
tobiashcorvus: related to the openstacksdk release?22:46
*** hamerins has quit IRC22:46
tobiashthe flavor used to be a munch, now it seems to be a dict22:46
tobiashmordred: ^22:46
clarkbtobiash: is that with sdk 0.19.0?22:46
clarkbtobiash: of 0.20.0?22:47
pabelangertobiash: we blocked 0.20.0 openstacksdk in nodepool not that long ago, which version22:47
tobiashchecking, but I guess 0.20.0 as I only pulled in the exception change22:47
pabelangertobiash: https://review.openstack.org/621272/22:47
*** apetrich has quit IRC22:47
fungiseveral hundred lines of scrollback in this channel while i was at dinner. ftr i'm just going to check nick highlights22:47
*** hamerins has joined #openstack-infra22:47
corvusshould i just do a scheduler restart or a full restart?22:48
tobiashyes 0.20.022:48
corvuswe did executors a couple days ago22:48
tobiashso another incompatibility22:48
clarkbcorvus: executors slow things down so I'm good with just scheduler if you want to make it easier22:48
clarkbthat gets is relative priorities. And infra doesn't currently need executor zones22:49
clarkbif we had an immediate need for executor zones I'd say do it all, but I can;t think of one22:49
fungiAJaeger_: thanks for the heads-up on 619216. i'm mostly trying to fix up cookiecutter templates and important documentation, so that's totally in scope22:49
pabelangerwhat ever is easier to get the next release of zuul with executor zone support is my vote22:49
*** slaweq has joined #openstack-infra22:50
clarkbpabelanger: in that case probably restarting the executors too, but we won't exercise those code paths. Probably want you to run it in beta form?22:50
pabelangerclarkb: I am okay with skipping executors today22:51
pabelangeryah, I'll test zones locally to be sure22:51
corvusi'm already down the scheduler-only path22:51
*** kjackal has quit IRC22:52
clarkbcorvus: wfm. I don't think relying on infra to ensure executor zones are ready is prudent22:52
pabelanger++22:52
clarkbwe don't have the constraints in place that made it a feature so we wouldn't be able to test it for general functioanlity well22:52
clarkb(even if we could pretend and check it doesn't completely break)22:52
*** slaweq has quit IRC22:52
pabelangerclarkb: I was only thinking the arm cloud in china might be a good zone to do22:53
clarkbpabelanger: I don't think we have trouble with the ssh connections. Only http22:53
pabelangerah22:53
clarkb(well and zk, but executor zones don't help that either)22:53
clarkbwe moved the builder into the london region to deal with the zk issue and switched to https to deal with http problems22:54
pabelangerif only infracloud was still arounds, we had bandwidth issues there :)22:54
*** mriedem has quit IRC22:54
*** wolverineav has quit IRC22:54
*** slaweq has joined #openstack-infra22:54
corvusi'm pretty opposed to adding spofs to zuul, so if we were to use zones, we'd have to double up on executors.  at least.22:54
clarkbya I just don't see us needing it with current resources22:55
corvus(even if we had network restricted resources, i'd suggest we talk about vpns first)22:55
*** wolverineav has joined #openstack-infra22:55
clarkbwe'd have an excuse to use that new in kernel vpn system22:56
corvusscheduler restarted22:57
clarkblooks like check is enque(ing|ed)22:57
pabelangerconfirmed, I can see nodepool-launcher creating nodes22:58
clarkbcorvus: and we are already running with relative priority right?22:58
clarkbseems so check is running jobs ahead of gate22:58
clarkbneat22:58
*** wolverineav has quit IRC23:00
clarkbor at least it appeared so a minute ago. Now I'm not completely sure23:00
*** wolverineav has joined #openstack-infra23:00
corvusthe field is being set: http://paste.openstack.org/show/736533/23:00
pabelangeryay23:00
clarkbcool I don't have evidence against it working, more lack of evidence it is working (since now all the gate things are happening23:01
fungiand the playing field is being leveled?23:01
corvustheoretically23:01
corvusgate still has priority over check23:01
corvusbut we should see, for example nodepool change 621286 get nodes before nova change 62086123:02
*** apetrich has joined #openstack-infra23:02
corvusin check23:02
pabelangercorvus: that is because there are more nova changes in check right?23:02
corvusyeah, there's a bunch right now23:02
pabelangerkk, keeping an eye out23:03
pabelangeroh, I think it is working 621011 has jobs running now23:03
pabelangerand there is nova before it with none23:04
corvussimilarly the requirements change 620563 is doing well23:04
pabelangeryah, that is pretty cool23:05
*** Miouge- has quit IRC23:05
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Add relative priority to request list  https://review.openstack.org/62131423:06
corvusi assume that's going to fail tests, i just don't know which23:06
pabelangernow I better understand 'level playing field' comment from fungi :)23:06
*** cdent has quit IRC23:07
clarkbpabelanger: ya this should allow lower resource usage projects "to get a word in" then once done give the higher usage projects a turn23:07
*** Miouge has joined #openstack-infra23:08
clarkbs/lower resource usage/less active/23:08
clarkbits based on changes not nodeset size23:08
pabelangerclarkb: yah, that is really cool, and fair.23:08
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Only count live items for relative priority  https://review.openstack.org/62131523:09
corvusor, uh, nearly fair ^  :)23:10
clarkbit still won't be a strict ordering because different providers can grab requests and fulfill them at different speeds23:10
pabelangerIt seems 614035 got nodes before 614012, but 614012 is parent23:10
clarkbcorvus: oh hah23:10
corvusyeah, it's all best-effort23:10
openstackgerritMerged openstack-infra/project-config master: Update promstat to use storyboard  https://review.openstack.org/62129323:11
pabelanger+223:11
corvus614012 has a relpri of 323:12
mordredtobiash: hrm. flavors should not have changed. they should still be munch ... or at the very least munch-like23:14
corvus603930 has relpri 3923:14
tobiashmordred: I just confirmed that the downgrade fixed the issue23:14
mordredtobiash: AWESOME23:14
mordredsorry for the breakage .. I'm ... quite sad about that23:15
corvusi don't know what the ones that got nodes are23:15
clarkbcorvus: all those non live changes bumping it up23:15
*** hamerins has quit IRC23:15
corvusya23:15
tobiashmordred: nothing happened, it's night here and I just tried that exception change and that pulled in the new sdk23:15
corvusmy guess is the others all grabbed nodes as the system was starting up and there were no other requests23:16
tobiashso I noticed that before it had an impact23:16
mordredtobiash: oh good23:18
clarkbcorvus: are you wanting to restart again today with the liveness check? I'm thinking that may be good to get in on friday vs monday23:18
clarkb(and i can hang around this afternoon to help with that)23:18
corvusclarkb: yeah, at least sometime before monday23:19
pabelangerI still have some time today to support23:19
clarkbthe other thing we should probably do is send an email to the dev list explaining the change so that we can avoid all the questions (or at least some of them) next week :)23:19
clarkbcorvus: is that something you'd like to draft up? (you came up with the diea and did most of the work to implement it)23:20
clarkbI'm happy to write a note too if you'd rather not23:20
corvusclarkb: i can do it23:20
mordredtobiash: oh! it's server.flavor.id ... so the sub-dict isn't munched23:21
clarkbwhile we are on a improve all the things streak. https://review.openstack.org/#/c/611920/ is a fairly straightforward gear change to make it more ipv6 friendly23:21
tobiashyes23:21
clarkbif anyone wants to review that one. It causes gear to listen on ipv6 if available23:21
clarkbpabelanger: ^ I want to say you've pushed changes around similar stuff there23:21
pabelangerlooking23:22
*** slaweq has quit IRC23:22
pabelangerAh, yah. +223:22
*** Swami has quit IRC23:27
openstackgerritMerged openstack-infra/nodepool master: Remove updating stats debug log  https://review.openstack.org/62128323:28
pabelangerclarkb: corvus: there we go, 621286 (nodepool) just got allocated nodes23:29
clarkbya and its been skipping changes aerlier as expected23:29
pabelangerway before existing keystone / nova patches23:29
pabelanger++23:29
pabelangerawesome23:29
mordredtobiash: remote:   https://review.openstack.org/621316 Transform server with munch before normalizing         should fix it23:30
corvusclarkb, pabelanger: btw this change from tobias https://review.openstack.org/610029 will help things as well23:30
corvus(that's probably the reason than 614012, at the top of the list, doesn't have nodes yet)23:31
pabelangerah, cool23:31
corvusit was unlucky enough to be locked by a bunch of handlers which couldn't fulfill it, and is now waiting for the ones that can to come back to the top of the loop23:31
pabelangercorvus: do we want to restart launchers today to pick it up?23:32
tobiashmordred: thanks :)23:32
corvuspabelanger: yeah, either today or this weekend i think23:38
clarkbTIL about time.monotonic. There are some nice things in python323:38
corvusbe very careful about monotonic.  it's a trap in unit tests.23:38
clarkbcorvus: oh?23:38
corvusi keep having to swap it out for time.time.23:38
corvusyeah, the base value isn't guaranteed23:38
clarkbtobiash's change to timeout assigning handlers uses monotonic23:39
corvusyeah, i think it's okay there23:39
clarkbcorvus: right its not a unix timestamp? I guess that could cause problems in places23:39
clarkbits a time that may start at zero when process starts?23:39
corvusclarkb: https://review.openstack.org/52917323:40
corvusi think as long as you always initialize to time.monotic, you're okay23:41
clarkbgot it23:41
corvusbut the common pattern of initializing to 0 and assuming that now()-then will work doesn't work with monotonic23:42
*** wolverineav has quit IRC23:42
*** tosky has quit IRC23:42
corvusor rather, it may or may not work depending on $random23:42
*** wolverineav has joined #openstack-infra23:45
corvusclarkb, pabelanger, fungi, mordred: how's this?  https://etherpad.openstack.org/p/QHQuh57TiG23:45
*** pbourke has quit IRC23:47
pabelangercorvus: wfm23:48
clarkbcorvus: looks great, particularly for calling out the out of order behavior23:48
clarkbI expect people will notice that and get curious23:48
*** wolverineav has quit IRC23:48
*** wolverineav has joined #openstack-infra23:49
fungicorvus: ship it!23:49
*** pbourke has joined #openstack-infra23:49
pabelangerclarkb: looking forward to the stats next week for node break down23:50
clarkbpabelanger: I don't expect that will change much23:50
clarkbwe are changing where in the pipeline you get resources not how many you get23:50
clarkb(so over a long period like a month or a week it shoudl all be roughly the same)23:50
pabelangerI guess it depends if more smaller projects are pushing up more patches23:51
fungihint: they're not23:52
fungithat's what makes them "small projects"23:52
pabelangercool, it only took my patch 21mins to get nodes, compared to the existing ones that are pushing 45mins23:52
pabelangervery nice23:53
pabelangeras it was the first one23:53
fungialso we generally do catch up over the weekend, so the week's worth of node requests still all get satisfied within that week23:53
clarkbya once upon a time I tried to look at it yoy and we use ~30% of our total resources when viewed at that scale23:53
clarkbweekends and the apac timezones tend to be quieter iirc23:54
*** slaweq has joined #openstack-infra23:54
clarkbso while we are flat out when people are watching, it is much less over long periods of time23:54
clarkbwe'll need to watch it and see how it goes though23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!