*** lathiat has joined #openstack-infra | 00:03 | |
*** mtreinish has quit IRC | 00:12 | |
*** jamesmcarthur has quit IRC | 00:14 | |
*** ijw has quit IRC | 00:14 | |
clarkb | corvus: we can try the hwe kernel (it is 4.13 on xenial) I run it on my local nas box because cpu is too new for old kernel | 00:21 |
---|---|---|
*** edmondsw has joined #openstack-infra | 00:21 | |
clarkb | corvus: I haven't had any issues with it other than waiting a little longer on the meltdown kpti patching which was annoying but at that point already a week behind | 00:21 |
*** gongysh has joined #openstack-infra | 00:21 | |
clarkb | considering these servers are largely disposable and rebuildable I would be on board with trying that | 00:21 |
*** bobh has quit IRC | 00:22 | |
*** Goneri has quit IRC | 00:23 | |
*** iyamahat has joined #openstack-infra | 00:24 | |
*** aeng has joined #openstack-infra | 00:25 | |
clarkb | that may also simplify bwrap for us and not require the setuid? | 00:25 |
tristanC | dmsimard: there was swift authentication settings in zuul.conf, then zuul would generate a temp url key per job | 00:25 |
*** edmondsw has quit IRC | 00:25 | |
clarkb | tristanC: dmsimard ya that part mostly worked (the only real issue we had there was the tempurl key was generated on job scheduling and could expire by the time a job ran iirc) the bigger issue was some dev going I want to say last weeks periodic job logs | 00:26 |
clarkb | s/say/see/ | 00:27 |
tristanC | clarkb: which is now possible the the zuul-web builds.json controller | 00:27 |
clarkb | ya that was mordreds point | 00:27 |
tristanC | with the* | 00:27 |
*** dingyichen has joined #openstack-infra | 00:27 | |
*** xarses_ has quit IRC | 00:28 | |
clarkb | there are still some downsides to that approach but none would be a regression when compared against the current system | 00:28 |
*** r-daneel has quit IRC | 00:29 | |
corvus | clarkb: yeah, i'll put executor oom on tomorrow's meeting agenda | 00:30 |
*** mtreinish has joined #openstack-infra | 00:37 | |
*** ijw has joined #openstack-infra | 00:38 | |
clarkb | in other interesting kernel news apparently 4.15 kernel performance with kpti is only a percent or two slower than 4.11 without kpti based on some benchmarks | 00:40 |
clarkb | it is unfortunate that hard work on performance improvements just got negated by kpti though | 00:40 |
clarkb | but at least we aren't taking a long term massive regression | 00:40 |
*** tosky has quit IRC | 00:41 | |
*** dave-mccowan has joined #openstack-infra | 00:42 | |
openstackgerrit | Merged openstack-infra/zuul master: Allow a few more starting builds https://review.openstack.org/540965 | 00:43 |
*** rloo has left #openstack-infra | 00:45 | |
*** slaweq has joined #openstack-infra | 00:46 | |
*** caphrim007_ has joined #openstack-infra | 00:48 | |
*** caphrim00_ has joined #openstack-infra | 00:49 | |
*** caphrim007_ has quit IRC | 00:49 | |
*** slaweq has quit IRC | 00:51 | |
*** caphrim007 has quit IRC | 00:51 | |
*** Swami has quit IRC | 00:53 | |
*** caphrim00_ has quit IRC | 00:54 | |
*** claudiub has quit IRC | 00:58 | |
*** camunoz has quit IRC | 01:00 | |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: [WIP] Add block-device defaults https://review.openstack.org/539375 | 01:02 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: [WIP] Choose appropriate bootloader for block-device https://review.openstack.org/539731 | 01:02 |
*** aeng has quit IRC | 01:02 | |
*** gongysh has quit IRC | 01:09 | |
*** aeng has joined #openstack-infra | 01:19 | |
*** jamesmcarthur has joined #openstack-infra | 01:20 | |
*** stakeda has joined #openstack-infra | 01:22 | |
*** jamesmcarthur has quit IRC | 01:25 | |
*** ihrachys has quit IRC | 01:26 | |
*** ihrachys_ has joined #openstack-infra | 01:26 | |
clarkb | the falcon heavy is scheduled to launch 30 minutes before the infra meeting tomorrow | 01:26 |
*** jamesmcarthur has joined #openstack-infra | 01:28 | |
*** abelur_ has joined #openstack-infra | 01:34 | |
*** liujiong has joined #openstack-infra | 01:34 | |
*** dougwig has quit IRC | 01:41 | |
*** ijw has quit IRC | 01:42 | |
ianw | i noticed that, might have to get up extra early :) | 01:42 |
*** jamesmcarthur has quit IRC | 01:43 | |
ianw | my kids are pretty bored with them now ... not another rocket launch ... crazy world they live in :) | 01:43 |
clarkb | but this one is going ti mars | 01:44 |
*** salv-orlando has joined #openstack-infra | 01:45 | |
*** hongbin has joined #openstack-infra | 01:49 | |
*** slaweq has joined #openstack-infra | 01:50 | |
*** jamesmcarthur has joined #openstack-infra | 01:54 | |
*** slaweq has quit IRC | 01:55 | |
*** aeng has quit IRC | 01:57 | |
*** aeng has joined #openstack-infra | 01:57 | |
*** larainema has joined #openstack-infra | 02:01 | |
*** caphrim007 has joined #openstack-infra | 02:03 | |
openstackgerrit | Merged openstack-infra/gerritbot master: Add unit test framework and one unit test https://review.openstack.org/499377 | 02:03 |
*** caphrim007 has quit IRC | 02:07 | |
corvus | it's either sending elon musk's tesla roadster to mars, or it's blowing up. apparently musk gives it even odds. | 02:08 |
*** edmondsw has joined #openstack-infra | 02:09 | |
*** liujiong has quit IRC | 02:10 | |
*** liujiong has joined #openstack-infra | 02:11 | |
*** bobh has joined #openstack-infra | 02:13 | |
*** edmondsw has quit IRC | 02:14 | |
*** esberglu has quit IRC | 02:14 | |
*** bobh has quit IRC | 02:15 | |
prometheanfire | whatever happens it'll be glorious | 02:16 |
*** gongysh has joined #openstack-infra | 02:18 | |
*** harlowja has quit IRC | 02:18 | |
*** shu-mutou-AWAY is now known as shu-mutou | 02:19 | |
*** askb has quit IRC | 02:21 | |
*** markvoelker has joined #openstack-infra | 02:22 | |
*** slaweq has joined #openstack-infra | 02:24 | |
*** askb has joined #openstack-infra | 02:24 | |
*** markvoelker has quit IRC | 02:24 | |
*** mriedem has quit IRC | 02:25 | |
*** mriedem has joined #openstack-infra | 02:28 | |
*** slaweq has quit IRC | 02:29 | |
*** askb has quit IRC | 02:29 | |
*** askb has joined #openstack-infra | 02:29 | |
*** markvoelker has joined #openstack-infra | 02:31 | |
*** askb has quit IRC | 02:31 | |
*** askb has joined #openstack-infra | 02:31 | |
*** askb has quit IRC | 02:31 | |
*** askb has joined #openstack-infra | 02:32 | |
*** salv-orlando has quit IRC | 02:33 | |
*** salv-orlando has joined #openstack-infra | 02:33 | |
*** askb has quit IRC | 02:36 | |
*** salv-orlando has quit IRC | 02:38 | |
*** greghaynes has quit IRC | 02:40 | |
*** greghaynes has joined #openstack-infra | 02:41 | |
*** greghaynes has quit IRC | 02:44 | |
*** greghaynes has joined #openstack-infra | 02:44 | |
*** askb has joined #openstack-infra | 02:45 | |
*** olaph1 has joined #openstack-infra | 02:46 | |
*** olaph has quit IRC | 02:47 | |
*** dave-mccowan has quit IRC | 02:48 | |
*** abelur_ has quit IRC | 02:49 | |
*** askb has quit IRC | 02:49 | |
*** abelur__ has joined #openstack-infra | 02:49 | |
*** caphrim007 has joined #openstack-infra | 02:50 | |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: [WIP] Choose appropriate bootloader for block-device https://review.openstack.org/539731 | 02:50 |
*** bobh has joined #openstack-infra | 02:51 | |
*** abelur__ has quit IRC | 02:51 | |
*** askb has joined #openstack-infra | 02:51 | |
*** askb_ has joined #openstack-infra | 02:52 | |
*** jamesmcarthur has quit IRC | 02:52 | |
*** askb has quit IRC | 02:52 | |
*** askb has joined #openstack-infra | 02:53 | |
*** bobh has quit IRC | 02:53 | |
smcginnis | Another release job timeout on a git fetch. | 02:54 |
smcginnis | http://logs.openstack.org/c3/c3ae8a1d87084bcee73e96e2f9e2d174ee610ed4/release-post/tag-releases/48a1f5a/job-output.txt.gz#_2018-02-05_23_47_27_034616 | 02:54 |
smcginnis | Just dropping a note for now. Will follow up tomorrow. | 02:54 |
*** markvoelker has quit IRC | 02:56 | |
*** markvoelker has joined #openstack-infra | 02:59 | |
*** slaweq has joined #openstack-infra | 03:00 | |
*** askb_ has quit IRC | 03:01 | |
*** slaweq has quit IRC | 03:04 | |
*** caphrim007 has quit IRC | 03:08 | |
prometheanfire | pip mirrors busted? | 03:09 |
*** dave-mccowan has joined #openstack-infra | 03:09 | |
prometheanfire | smcginnis: related I imagine? | 03:09 |
clarkb | prometheanfire: http://mirror.dfw.rax.openstack.org/pypi/last-modified its there and was last modified just under 3 hours ago | 03:10 |
clarkb | can you be more specific on how pip mirrors are busted? | 03:10 |
*** markvoelker has quit IRC | 03:11 | |
prometheanfire | clarkb: I was getting no distrobution found, I'll retry soon | 03:12 |
prometheanfire | http://logs.openstack.org/29/541029/1/check/requirements-tox-py35-check-uc/b098a8f/job-output.txt.gz#_2018-02-06_01_51_40_046774 | 03:13 |
*** mriedem has quit IRC | 03:14 | |
*** markvoelker has joined #openstack-infra | 03:14 | |
clarkb | http://mirror.dfw.rax.openstack.org/pypi/simple/sushy/ ya looks like they are missing | 03:15 |
clarkb | bandersnatch behind or having problems again is my best guess | 03:15 |
*** d0ugal_ has joined #openstack-infra | 03:16 | |
clarkb | nothing about sushy in the bandersnatch logs | 03:18 |
clarkb | for the 4th 5th and 6th | 03:18 |
*** d0ugal has quit IRC | 03:19 | |
clarkb | something wrong with upstream pypi's mirroring stuff? | 03:19 |
prometheanfire | not sure | 03:22 |
clarkb | looks like bandersnatch did error due to failed package retrieval early today PST | 03:22 |
clarkb | I wonder if that has it confused as to what serial it is on maybe it thinks it is done | 03:23 |
clarkb | this may require another full resync :/ | 03:23 |
clarkb | I'm not going to be able to watch that tonight but can pick up in the morning if necesary | 03:23 |
*** markvoelker has quit IRC | 03:25 | |
*** gcb has joined #openstack-infra | 03:26 | |
*** gyee has quit IRC | 03:29 | |
*** markvoelker has joined #openstack-infra | 03:30 | |
tonyb | jhesketh: random question ... what do OpenSuse and SLES use for init? systemd? | 03:31 |
clarkb | tonyb: tumbleweed is systemd at least | 03:32 |
*** dave-mccowan has quit IRC | 03:33 | |
*** gongysh has quit IRC | 03:33 | |
clarkb | looks like SLES 12 is systemd | 03:33 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: [WIP] Choose appropriate bootloader for block-device https://review.openstack.org/539731 | 03:33 |
tonyb | clarkb: ah cool | 03:37 |
prometheanfire | tonyb: ya, they use systemd, for some reason I thought they went with openrc or eudev, bug guess not | 03:39 |
*** bobh has joined #openstack-infra | 03:39 | |
*** slaweq has joined #openstack-infra | 03:40 | |
tonyb | prometheanfire: Yeah. Seems like gentoo and debian are the last distros in the camp of 'defult init' != 'systemd' | 03:41 |
*** markvoelker has quit IRC | 03:42 | |
prometheanfire | tonyb: debian uses systemd :P | 03:42 |
prometheanfire | I actually don't have much of a problem with systemd, run it on my laptop | 03:42 |
tonyb | prometheanfire: by default? Ithought it was sysv with an option for systemd? I don't mind being wrong | 03:43 |
*** ramishra has joined #openstack-infra | 03:43 | |
*** slaweq has quit IRC | 03:44 | |
tonyb | prometheanfire: Yeah, for me it isn't anything other than "$change seems to introduce a regression on systems that don't use systemd" Which systems are they and do we consider that a problem? | 03:44 |
tonyb | prometheanfire: by no means is it a value judgement on systemd | 03:45 |
*** coolsvap has joined #openstack-infra | 03:45 | |
*** cshastri has joined #openstack-infra | 03:46 | |
prometheanfire | we do have a sysv initflag, but that was just to map 'shutdown' to systemctl shutdown and the like | 03:47 |
prometheanfire | it can understand sysv init scripts iirc, but not sure | 03:47 |
fungi | tonyb: debian stable release before last (jessie) switched to systemd by default (except on non-linux arches like kfreebsd and hurd) but debian still mostly works if you install sysvinit as your default | 03:47 |
prometheanfire | funny thing is, I was part of the group that forked udev | 03:48 |
fungi | though the upgrade to jessie would keep your previous init default | 03:48 |
*** rossella_s has quit IRC | 03:48 | |
ianw | tonyb: presume relates to https://review.openstack.org/#/c/529976/2 ? | 03:48 |
tonyb | fungi: Ahh cool. I'm clearly out of date. | 03:48 |
prometheanfire | ya, there was a bug hubub about it and some debian people forked into another non-systemd distro | 03:48 |
tonyb | ianw: Yeah | 03:48 |
ianw | we've dropped centos6 era stuff ... that was also python2.6 which we don't code for | 03:48 |
ianw | my thinking was there isn't really a non-systemd case? | 03:49 |
*** askb_ has joined #openstack-infra | 03:49 | |
*** bobh has quit IRC | 03:49 | |
fungi | systemd is still at least partly a no-go on !linux kernels because upstream has in the past been adamant they rely on linux-kernel-specific features | 03:49 |
*** sree has joined #openstack-infra | 03:50 | |
prometheanfire | it's also a nogo on non-glibc systemd | 03:50 |
prometheanfire | it's also a nogo on non-glibc systems | 03:50 |
fungi | prometheanfire: the non-systemd debian derivative you're thinking of is https://devuan.org/ | 03:50 |
prometheanfire | ya, that's it | 03:51 |
prometheanfire | I knew how to pernounce it but not spell it | 03:51 |
tonyb | ianw: I'm happy to drop my objection in that case. Was trusty systemd? or upstart? | 03:51 |
prometheanfire | embedded tends to do a bit of non-glibc (uclibc, musl) | 03:51 |
prometheanfire | which is why we forked udev into eudev | 03:51 |
tonyb | if trusty is systemd then my objection seems to largely theoretical ;p | 03:52 |
prometheanfire | trusty is not systemd, xenial is | 03:52 |
prometheanfire | it might have some small systemd parts iirc, but mainly upstart | 03:52 |
fungi | yeah, trusty was/is very much upstart | 03:52 |
tonyb | prometheanfire, fungi: thanks. | 03:54 |
tonyb | ianw: so I guess that gives us a line in the sand. | 03:55 |
*** hongbin has quit IRC | 03:55 | |
*** askb_ has quit IRC | 03:55 | |
*** askb_ has joined #openstack-infra | 03:56 | |
tonyb | ianw: I guess that probably means it boils down to is trusty too old? | 03:56 |
prometheanfire | 18.04 comes out 'soon' too | 03:56 |
tonyb | ianw: I don't really mind what the answer is :) | 03:56 |
ianw | hmm, yeah trusty has been pretty unloved for some time | 03:56 |
tonyb | oooh /me notices etcd 3.2 (x86, arm and ppc64el) in bionic :) | 03:57 |
*** edmondsw has joined #openstack-infra | 03:58 | |
fungi | also openafs 1.8! | 03:58 |
tonyb | fungi: :) | 03:58 |
*** olaph1 has quit IRC | 03:59 | |
*** olaph has joined #openstack-infra | 04:00 | |
*** gongysh has joined #openstack-infra | 04:01 | |
*** edmondsw has quit IRC | 04:02 | |
*** s-shiono has joined #openstack-infra | 04:03 | |
tonyb | Does this http://logs.openstack.org/22/535622/10/check/puppet-openstack-unit-4.8-centos-7/84927af/job-output.txt.gz#_2018-02-06_02_58_30_732623 look like a real problem or a just run recheck problem? | 04:04 |
ianw | tonyb: fatal: could not create leading directories of '/etc/puppet/modules/powerdns': Permission denied | 04:05 |
ianw | i'd say that needs some help... | 04:05 |
tonyb | ianw: I was afraid you'd say that ;P | 04:06 |
*** askb_ has quit IRC | 04:06 | |
*** askb has quit IRC | 04:06 | |
*** askb has joined #openstack-infra | 04:06 | |
* tonyb does some research | 04:06 | |
*** askb has quit IRC | 04:07 | |
ianw | maybe it just needs a "become: " line for the task ...? | 04:07 |
*** askb has joined #openstack-infra | 04:07 | |
*** askb has quit IRC | 04:09 | |
*** askb has joined #openstack-infra | 04:09 | |
*** abelur__ has joined #openstack-infra | 04:11 | |
gongysh | hi, | 04:11 |
gongysh | I want to set up a multinode ci job for my project tacker | 04:11 |
*** gongysh has quit IRC | 04:11 | |
*** askb has quit IRC | 04:12 | |
*** askb has joined #openstack-infra | 04:13 | |
*** gongysh has joined #openstack-infra | 04:14 | |
*** caphrim007 has joined #openstack-infra | 04:15 | |
*** psachin has joined #openstack-infra | 04:26 | |
*** dsariel has quit IRC | 04:28 | |
*** gongysh has quit IRC | 04:30 | |
*** yamamoto has joined #openstack-infra | 04:35 | |
*** rossella_s has joined #openstack-infra | 04:38 | |
*** iyamahat has quit IRC | 04:41 | |
*** slaweq has joined #openstack-infra | 04:44 | |
*** harlowja has joined #openstack-infra | 04:46 | |
*** slaweq has quit IRC | 04:48 | |
*** olaph1 has joined #openstack-infra | 04:49 | |
*** olaph has quit IRC | 04:50 | |
*** sree_ has joined #openstack-infra | 04:53 | |
*** rosmaita has quit IRC | 04:53 | |
*** sree_ is now known as Guest77194 | 04:54 | |
openstackgerrit | John L. Villalovos proposed openstack-infra/project-config master: zuul.d: gerritbot: Remove check and gate jobs as now in repo https://review.openstack.org/541125 | 04:54 |
openstackgerrit | John L. Villalovos proposed openstack-infra/project-config master: zuul.d: gerritbot: Remove check and gate jobs as now in repo https://review.openstack.org/541125 | 04:55 |
openstackgerrit | John L. Villalovos proposed openstack-infra/project-config master: zuul.d: gerritbot: Remove check and gate jobs as now in repo https://review.openstack.org/541125 | 04:56 |
*** sree has quit IRC | 04:57 | |
*** zhurong has quit IRC | 05:07 | |
*** dhajare has joined #openstack-infra | 05:10 | |
*** harlowja has quit IRC | 05:11 | |
*** links has joined #openstack-infra | 05:12 | |
*** abelur__ has quit IRC | 05:12 | |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: Pass at cleaning up the bootloader https://review.openstack.org/541129 | 05:13 |
*** pgadiya has joined #openstack-infra | 05:14 | |
*** HawK3r has joined #openstack-infra | 05:15 | |
*** janki has joined #openstack-infra | 05:15 | |
HawK3r | I just installed https://www.rdoproject.org/install/packstack/ packstack | 05:17 |
HawK3r | fist time messing with openstack so I thought this would be a good way t get started | 05:17 |
HawK3r | I ahve a pretty beefy server I wa using as a ESX host now woudl be a OS Node | 05:17 |
HawK3r | one issue I am having is that I installed it with a dhcp given Ipaddress on the server. CENTOS 7 to be specific | 05:17 |
HawK3r | I need to change the IP address and I did it from the 15-horizon_vhost.conf file | 05:17 |
HawK3r | I am able to load horizon on teh browser but everytime I try to login I get Unable to establish connection to keystone endpoint. | 05:17 |
HawK3r | If i go back to the original IP address everytign works fine | 05:17 |
HawK3r | Im thinking there is another place where the ip needs to be changed. perhaps another config file im missing | 05:17 |
HawK3r | Also, coming from vMware I am not getting how the management network and the whole network stack works on Open stack to be honest with you | 05:17 |
*** askb has quit IRC | 05:18 | |
*** askb has joined #openstack-infra | 05:18 | |
*** gongysh has joined #openstack-infra | 05:19 | |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: [WIP] Choose appropriate bootloader for block-device https://review.openstack.org/539731 | 05:20 |
AJaeger | HawK3r: #openstack is the proper channel for such questions, this channel is for the infrastructure OpenStack uses to run CI etc. | 05:22 |
HawK3r | thanks | 05:22 |
*** HawK3r has left #openstack-infra | 05:23 | |
*** eumel8 has joined #openstack-infra | 05:25 | |
*** armaan has quit IRC | 05:25 | |
*** daidv has quit IRC | 05:26 | |
AJaeger | ianw, frickler, could you put https://review.openstack.org/541009 https://review.openstack.org/540971 https://review.openstack.org/540971 and https://review.openstack.org/540595 on your review queue, please? | 05:26 |
*** armaan has joined #openstack-infra | 05:28 | |
AJaeger | ianw, frickler, duplicate in there - I meant https://review.openstack.org/538344 as well, please | 05:28 |
*** olaph has joined #openstack-infra | 05:29 | |
*** sdake_ is now known as sdake | 05:30 | |
*** olaph1 has quit IRC | 05:30 | |
*** sree has joined #openstack-infra | 05:31 | |
*** Guest77194 has quit IRC | 05:34 | |
openstackgerrit | Merged openstack-infra/project-config master: add git timeout setting for clone_repo.sh https://review.openstack.org/541050 | 05:34 |
openstackgerrit | Merged openstack-infra/project-config master: add a retry loop to clone_repo.sh https://review.openstack.org/541051 | 05:37 |
*** dhajare has quit IRC | 05:38 | |
*** dhajare has joined #openstack-infra | 05:40 | |
*** wolverineav has joined #openstack-infra | 05:46 | |
*** edmondsw has joined #openstack-infra | 05:46 | |
*** abelur__ has joined #openstack-infra | 05:50 | |
*** edmondsw has quit IRC | 05:50 | |
*** wolverineav has quit IRC | 05:50 | |
*** abelur__ has quit IRC | 05:51 | |
*** abelur__ has joined #openstack-infra | 05:51 | |
*** sree_ has joined #openstack-infra | 05:54 | |
*** sree_ is now known as Guest54253 | 05:54 | |
*** wolverineav has joined #openstack-infra | 05:57 | |
*** sree has quit IRC | 05:58 | |
*** wolverin_ has joined #openstack-infra | 06:00 | |
*** gcb has quit IRC | 06:00 | |
*** wolverineav has quit IRC | 06:02 | |
*** slaweq has joined #openstack-infra | 06:02 | |
*** wolverin_ has quit IRC | 06:03 | |
*** wolverineav has joined #openstack-infra | 06:03 | |
*** abelur__ has quit IRC | 06:03 | |
*** abelur__ has joined #openstack-infra | 06:03 | |
*** e0ne has joined #openstack-infra | 06:04 | |
openstackgerrit | OpenStack Proposal Bot proposed openstack-infra/project-config master: Normalize projects.yaml https://review.openstack.org/541136 | 06:05 |
*** jchhatbar has joined #openstack-infra | 06:06 | |
*** janki has quit IRC | 06:06 | |
*** wolverineav has quit IRC | 06:07 | |
*** slaweq has quit IRC | 06:07 | |
*** aeng has quit IRC | 06:09 | |
*** askb has quit IRC | 06:09 | |
*** abelur__ has quit IRC | 06:09 | |
*** askb has joined #openstack-infra | 06:10 | |
*** ihrachys_ has quit IRC | 06:13 | |
*** d0ugal_ has quit IRC | 06:14 | |
*** wolverineav has joined #openstack-infra | 06:15 | |
openstackgerrit | Merged openstack-infra/project-config master: Remove legacy-devstack-dsvm-py35-updown from devstack https://review.openstack.org/540707 | 06:18 |
*** wolverineav has quit IRC | 06:19 | |
*** d0ugal_ has joined #openstack-infra | 06:23 | |
openstackgerrit | Merged openstack-infra/project-config master: Normalize projects.yaml https://review.openstack.org/541136 | 06:30 |
*** dhill__ has joined #openstack-infra | 06:31 | |
*** dhill_ has quit IRC | 06:33 | |
*** claudiub has joined #openstack-infra | 06:37 | |
*** threestrands has quit IRC | 06:40 | |
*** shu-mutou has quit IRC | 06:43 | |
*** yolanda has joined #openstack-infra | 06:43 | |
*** ianychoi_ has quit IRC | 06:47 | |
*** ianychoi_ has joined #openstack-infra | 06:48 | |
*** snapiri has joined #openstack-infra | 06:49 | |
*** rwsu has joined #openstack-infra | 07:01 | |
*** liujiong has quit IRC | 07:03 | |
*** liujiong has joined #openstack-infra | 07:03 | |
*** sree has joined #openstack-infra | 07:04 | |
*** HeOS has joined #openstack-infra | 07:07 | |
*** Guest54253 has quit IRC | 07:07 | |
*** zhurong has joined #openstack-infra | 07:09 | |
AJaeger | ianw: thanks for reviewing - any reason not to +A the nodepool change https://review.openstack.org/#/c/540595/ ? | 07:09 |
openstackgerrit | Merged openstack-infra/project-config master: Remove windmill-buildimages https://review.openstack.org/541009 | 07:10 |
*** gcb has joined #openstack-infra | 07:13 | |
openstackgerrit | Merged openstack-infra/openstack-zuul-jobs master: Remove legacy infra-ansible job https://review.openstack.org/540971 | 07:13 |
*** andreas_s has joined #openstack-infra | 07:14 | |
*** khappone has quit IRC | 07:17 | |
*** jchhatba_ has joined #openstack-infra | 07:20 | |
*** e0ne has quit IRC | 07:22 | |
*** jchhatbar has quit IRC | 07:23 | |
openstackgerrit | Merged openstack-infra/openstack-zuul-jobs master: Zuul: Remove project name https://review.openstack.org/541078 | 07:25 |
*** rcernin has quit IRC | 07:37 | |
*** khappone has joined #openstack-infra | 07:39 | |
*** iyamahat has joined #openstack-infra | 07:41 | |
ianw | AJaeger: not really, just haven't been super active in nodepool changes lately | 07:47 |
jlvillal | Not sure if just me but http://zuul.openstack.org/ isn't loading | 07:47 |
ianw | jlvillal: wfm ... | 07:47 |
ianw | try a hard refresh ... there was something we did with redirects this morning | 07:47 |
jlvillal | ianw, Ah! I thought CTRL-F5 was hard-refresh but I guess not. | 07:48 |
jlvillal | shift-click on reload icon did it in Firefox | 07:48 |
*** slaweq has joined #openstack-infra | 07:49 | |
openstackgerrit | Andreas Jaeger proposed openstack-infra/project-config master: Remove TripleO pipelines from grafana https://review.openstack.org/541165 | 07:51 |
AJaeger | ianw: we have 0 executors right now ;( | 07:51 |
AJaeger | ianw: check http://grafana.openstack.org/dashboard/db/zuul-status | 07:51 |
AJaeger | Did they all die? | 07:52 |
vivsoni_ | hi team, i am trying to create devstack newton | 07:52 |
AJaeger | infra-root ^ | 07:52 |
AJaeger | vivsoni_: devstack is a QA project, best ask on #openstack-qa | 07:52 |
vivsoni_ | AJaeger: ok.. thanks | 07:52 |
*** jpena|off is now known as jpena | 07:53 | |
jlvillal | ianw, Is http://zuul.openstack.org/ still working for you? Now it stopped for me :( | 07:55 |
jlvillal | AJaeger, would your '0 executors' thing impact http://zuul.openstack.org/ ? | 07:56 |
AJaeger | jlvillal: it's working for me - but something else is going on ;( | 07:56 |
jlvillal | AJaeger, Good reason for me to go to sleep :) Almost midnight here. | 07:57 |
AJaeger | jlvillal: it depends on what the problem is - and that needs an infra-root to investigate | 07:57 |
jlvillal | AJaeger, Thanks | 07:57 |
AJaeger | jlvillal: 9am here ;) Good night! | 07:57 |
*** sree has quit IRC | 07:59 | |
*** slaweq has quit IRC | 08:02 | |
*** alexchadin has joined #openstack-infra | 08:02 | |
*** pcichy has joined #openstack-infra | 08:03 | |
*** slaweq has joined #openstack-infra | 08:03 | |
*** kjackal has quit IRC | 08:10 | |
*** kjackal has joined #openstack-infra | 08:11 | |
*** jchhatba_ has quit IRC | 08:12 | |
*** pcichy has quit IRC | 08:12 | |
*** jchhatba_ has joined #openstack-infra | 08:12 | |
*** pcaruana has joined #openstack-infra | 08:14 | |
*** s-shiono has quit IRC | 08:15 | |
*** dhill_ has joined #openstack-infra | 08:17 | |
*** dhill__ has quit IRC | 08:18 | |
*** iyamahat has quit IRC | 08:18 | |
*** florianf has joined #openstack-infra | 08:19 | |
*** askb has quit IRC | 08:19 | |
*** askb_ has joined #openstack-infra | 08:20 | |
*** tesseract has joined #openstack-infra | 08:22 | |
*** ralonsoh has joined #openstack-infra | 08:23 | |
*** hashar has joined #openstack-infra | 08:24 | |
*** ianychoi_ has quit IRC | 08:27 | |
*** ianychoi_ has joined #openstack-infra | 08:28 | |
*** askb_ has quit IRC | 08:31 | |
*** abelur_ has joined #openstack-infra | 08:32 | |
*** iyamahat has joined #openstack-infra | 08:32 | |
*** kjackal has quit IRC | 08:34 | |
AJaeger | infra-root, looking at grafana, half of our ze0's are using 90+ % of memory. | 08:35 |
*** kjackal has joined #openstack-infra | 08:36 | |
*** armaan has quit IRC | 08:37 | |
*** armaan has joined #openstack-infra | 08:37 | |
*** iyamahat has quit IRC | 08:38 | |
*** zhurong has quit IRC | 08:38 | |
*** yamahata has quit IRC | 08:38 | |
*** dingyichen has quit IRC | 08:39 | |
jhesketh | AJaeger: taking a look | 08:40 |
AJaeger | thanks, jhesketh. I'm wondering whether we have a network or connection problem | 08:41 |
*** gongysh has quit IRC | 08:53 | |
AJaeger | jhesketh, infra-root, did we loose inap? http://grafana.openstack.org/dashboard/db/nodepool-inap | 08:53 |
jhesketh | AJaeger: that does look possible | 08:54 |
*** abelur_ has quit IRC | 08:55 | |
*** abelur_ has joined #openstack-infra | 08:55 | |
*** amoralej|off is now known as amoralej | 08:56 | |
*** priteau has joined #openstack-infra | 08:56 | |
*** d0ugal_ has quit IRC | 08:57 | |
*** alexchadin has quit IRC | 08:57 | |
*** d0ugal has joined #openstack-infra | 08:57 | |
*** d0ugal has quit IRC | 08:57 | |
*** d0ugal has joined #openstack-infra | 08:57 | |
openstackgerrit | Andreas Jaeger proposed openstack-infra/project-config master: Temporary disable inap https://review.openstack.org/541188 | 08:58 |
AJaeger | jhesketh: ^ | 08:58 |
*** alexchadin has joined #openstack-infra | 08:58 | |
AJaeger | jhesketh: want to force merge? | 08:58 |
AJaeger | waiting for nodes might take ages ;( | 08:58 |
*** jpich has joined #openstack-infra | 08:58 | |
ianw | 2018-02-06 08:59:09,862 INFO nodepool.CleanupWorker: ZooKeeper suspended. Waiting | 08:59 |
ianw | 2018-02-06 08:59:16,062 INFO nodepool.DeletedNodeWorker: ZooKeeper suspended. Waiting | 08:59 |
ianw | what's that mean? | 08:59 |
jhesketh | sorry, I'm still looking through logs, will check if inap is down then submit that... I'm not sure if we'll need to restart the executors stuck on those nodes | 08:59 |
*** rpittau has joined #openstack-infra | 09:01 | |
ianw | jhesketh / AJaeger : i've restarted nl03 ... i think that error is related to auth timing out maybe? | 09:02 |
*** zhurong has joined #openstack-infra | 09:03 | |
ianw | there was definitely some sort of blip, but it is now seeming to sync up inap | 09:03 |
*** askb has joined #openstack-infra | 09:03 | |
AJaeger | ianw: I have no idea what it could be ;( | 09:04 |
*** gongysh has joined #openstack-infra | 09:04 | |
*** gfidente has joined #openstack-infra | 09:04 | |
*** gfidente has joined #openstack-infra | 09:04 | |
ianw | if i had to guess, there was an inap blip, and it made nl03 unhappy and it lost it's connection to zk | 09:04 |
AJaeger | ianw: but now grafana shows different graphs for inap, so restarting seems to have helped | 09:04 |
* ianw handy-wavy ... | 09:05 | |
*** askb has quit IRC | 09:05 | |
*** e0ne has joined #openstack-infra | 09:05 | |
*** askb_ has joined #openstack-infra | 09:05 | |
* AJaeger hopes the executors recover... | 09:05 | |
ianw | they all seem to be processing? | 09:06 |
*** threestrands has joined #openstack-infra | 09:06 | |
*** jaosorior has quit IRC | 09:06 | |
ianw | but i agree ... why are they not reporting | 09:06 |
* jhesketh is glad ianw is here :-) | 09:08 | |
jhesketh | ianw: some of the executors aren't picking up work so they might need restarting | 09:10 |
AJaeger | Reading #zuul: Shrews wanted to restart executors today to pick up new changes... | 09:12 |
*** sree has joined #openstack-infra | 09:12 | |
* AJaeger needs to step out a bit | 09:13 | |
ianw | jhesketh: which ones? | 09:14 |
jhesketh | 1&3 at least, I haven't checked them all | 09:14 |
jhesketh | at a guess from grafana 6,7,8,9 | 09:14 |
*** dsariel has joined #openstack-infra | 09:15 | |
jhesketh | ianw: I don't have any evidence for why that might be an effect of inap though | 09:17 |
jhesketh | and therefore if it'll make a difference | 09:18 |
ianw | i think the problem with the stats might be graphite ... seems the disk is full | 09:18 |
ianw | /var/log/graphite/carbon-cache-a is full | 09:19 |
jhesketh | oh, good find | 09:19 |
*** threestrands has quit IRC | 09:21 | |
*** edmondsw has joined #openstack-infra | 09:22 | |
*** wxy has quit IRC | 09:22 | |
*** rossella_s has quit IRC | 09:23 | |
*** kashyap has joined #openstack-infra | 09:24 | |
*** rossella_s has joined #openstack-infra | 09:24 | |
*** edmondsw has quit IRC | 09:27 | |
ianw | i'm trying to copy it to the storage volume | 09:27 |
ianw | alright, i'm going to quickly reboot it, because things running out of disk ... i'm not sure what state it is in | 09:28 |
*** jaosorior has joined #openstack-infra | 09:29 | |
jhesketh | ack | 09:29 |
*** sshnaidm|bbl is now known as sshnaidm|rover | 09:30 | |
openstackgerrit | Matthieu Huin proposed openstack-infra/zuul-jobs master: role: Inject public keys in case of failure https://review.openstack.org/535803 | 09:30 |
ianw | #status log graphite.o.o disk full. move /var/log/graphite/carbon-cache-a/*201[67]* to cinder-volume-based /var/lib/graphite/storage/carbon-cache-a.backup.2018-02-06 | 09:31 |
ianw | #status log graphite.o.o disk full. move /var/log/graphite/carbon-cache-a/*201[67]* to cinder-volume-based /var/lib/graphite/storage/carbon-cache-a.backup.2018-02-06 and server rebooted | 09:32 |
openstackstatus | ianw: finished logging | 09:32 |
ianw | clarkb: ^ are you the expert on graphite? i feel like we need to do something there; there is still a 4gb console.log file in there that we need to manage i guess | 09:34 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Clean held nodes automatically after configurable timeout https://review.openstack.org/536295 | 09:36 |
*** olaph1 has joined #openstack-infra | 09:36 | |
*** kopecmartin has joined #openstack-infra | 09:36 | |
*** olaph has quit IRC | 09:38 | |
*** florianf has quit IRC | 09:43 | |
jhesketh | ianw: the queued jobs are still climbing, any objections to restarting zuul on 01 to see if it picks up new work? | 09:43 |
*** derekh has joined #openstack-infra | 09:45 | |
*** pcichy has joined #openstack-infra | 09:46 | |
openstackgerrit | Rico Lin proposed openstack-infra/irc-meetings master: Move Magnum irc meeting to #openstack-containers https://review.openstack.org/541210 | 09:49 |
ianw | jhesketh: not really, i'm looking at the logs and it seems to have taken a few jobs | 09:49 |
ianw | ahh, nahh it's not doing anything is it | 09:50 |
jhesketh | ianw: ze01? | 09:50 |
jhesketh | right | 09:50 |
jhesketh | ianw: I'll do a graceful restart | 09:50 |
*** dizquierdo has joined #openstack-infra | 09:54 | |
*** stakeda has quit IRC | 09:54 | |
*** florianf has joined #openstack-infra | 09:55 | |
jhesketh | oh, apparently graceful isn't implemented in v3? | 09:56 |
jhesketh | ianw: sending a stop, and apparently it had to abort a bunch of jobs: | 09:57 |
jhesketh | 2018-02-06 09:56:49,228 DEBUG zuul.AnsibleJob: [build: a956d5c2e21741298eecbc23da2a3443] Abort: no process is running | 09:57 |
jhesketh | so it looks like ansible died somehow? | 09:57 |
*** liujiong has quit IRC | 09:59 | |
tobiash | we maybe leak job workers under some circumstances | 09:59 |
tobiash | leaked not yet started job workers would be counted towards starting builds and thus explain the current behavior I see in the stats | 09:59 |
jhesketh | yep, I'd agree with that | 10:01 |
ianw | tobiash: oohhh, interesting. so basically a big global slowdown? | 10:01 |
ianw | that's what was a bit weird, things seemed to be happening ... slowly | 10:01 |
tobiash | ianw: either that or we leak job workers in the executor: http://grafana.openstack.org/dashboard/db/zuul-status?panelId=25&fullscreen | 10:01 |
tobiash | because what I expect is that starting jobs shouldn't increase monotonic | 10:02 |
jhesketh | the stop command appears to be hanging... If I kill the process what state will those builds end up in with the scheduler? | 10:03 |
jhesketh | I suspect they'll be stuck and we might need to do a scheduler restart | 10:03 |
tobiash | so either something is hanging the preparation (hanging git?) or we leaked job worker threads which are counted wrongly towards starting builds | 10:03 |
ianw | jhesketh: it can take a while to shutdown cleanly | 10:03 |
tobiash | jhesketh: a restart should at least get us going | 10:03 |
tobiash | I'm not sure if rescheduling of failed jobs is already fixed | 10:04 |
jhesketh | ianw: okay, I'll wait for the shutdown for a while | 10:04 |
*** amrith has left #openstack-infra | 10:04 | |
tobiash | but at least the queued up jobs should be running fine | 10:04 |
jhesketh | fyi, ze01 is in stopped state | 10:05 |
ianw | jhesketh: yeah, otherwise you can just kill it manually and restart, it *shouldn't* require scheduler restart | 10:05 |
tobiash | scheduler shouldn't be need a restart | 10:05 |
jhesketh | the scheduler will pick up that ze01 has lost builds if we kill it? | 10:05 |
tobiash | jhesketh: I'm not sure about lost builds, it should reschedule them, if not that's a bug | 10:06 |
*** janki has joined #openstack-infra | 10:06 | |
tobiash | and I think there was such a bug, but I don't know if that has been fixed | 10:06 |
jhesketh | ack | 10:06 |
tobiash | but at least the 700 queued jobs should run fine then | 10:07 |
tobiash | ;) | 10:07 |
jhesketh | I'll let the executor try and shut down cleanly still for a bit | 10:07 |
jhesketh | (it's still diong repo updates) | 10:07 |
*** jchhatba_ has quit IRC | 10:07 | |
ianw | jhesketh: sorry i gotta disappear; i'll leave it in your capable hands :) | 10:08 |
jhesketh | no worries (although not sure how capable!) | 10:08 |
jhesketh | thanks heaps for your help! | 10:08 |
*** hjensas has joined #openstack-infra | 10:12 | |
*** sree has quit IRC | 10:13 | |
*** sree has joined #openstack-infra | 10:14 | |
*** sree has quit IRC | 10:18 | |
*** nmathew has joined #openstack-infra | 10:19 | |
*** sree has joined #openstack-infra | 10:19 | |
*** alexchadin has quit IRC | 10:22 | |
*** sree has quit IRC | 10:23 | |
openstackgerrit | caishan proposed openstack-infra/irc-meetings master: Move Barbican irc meeting to #openstack-barbican https://review.openstack.org/541230 | 10:24 |
*** dtantsur|afk is now known as dtantsur | 10:25 | |
*** alexchadin has joined #openstack-infra | 10:25 | |
AJaeger | jhesketh: should we send #status notice Our Zuul infrastructure is currently experiencing some problems, we're investigating. Please do not approve or recheck changes for now. ? | 10:25 |
jhesketh | AJaeger: yep, perhaps a note that it'll be a little slow but should get through changes in time | 10:26 |
jhesketh | AJaeger: do you have status perms? | 10:26 |
AJaeger | yes, I have permissions | 10:27 |
jhesketh | AJaeger: I'm happy for you to do it unless you prefer me to | 10:28 |
AJaeger | #status notice Our Zuul infrastructure is currently experiencing some problems and processing jobs very slowly, we're investigating. Please do not approve or recheck changes for now. | 10:28 |
openstackstatus | AJaeger: sending notice | 10:28 |
AJaeger | jhesketh: done myself - thanks | 10:28 |
-openstackstatus- NOTICE: Our Zuul infrastructure is currently experiencing some problems and processing jobs very slowly, we're investigating. Please do not approve or recheck changes for now. | 10:29 | |
*** alexchadin has quit IRC | 10:30 | |
*** alexchadin has joined #openstack-infra | 10:30 | |
*** alexchadin has quit IRC | 10:30 | |
jhesketh | I think it looks like ze01 is slowly shutting down, so I'm going to give it some time (but it is /very/ slow) | 10:31 |
*** alexchadin has joined #openstack-infra | 10:31 | |
openstackstatus | AJaeger: finished sending notice | 10:31 |
*** alexchadin has quit IRC | 10:31 | |
jhesketh | after that, if restarting fixes things I can do the other executors | 10:31 |
*** kashyap has left #openstack-infra | 10:32 | |
*** alexchadin has joined #openstack-infra | 10:32 | |
*** alexchadin has quit IRC | 10:32 | |
*** alexchadin has joined #openstack-infra | 10:33 | |
*** alexchadin has quit IRC | 10:33 | |
*** dtruong has quit IRC | 10:35 | |
*** dtruong has joined #openstack-infra | 10:35 | |
AJaeger | wow, that takes long ;( | 10:36 |
jhesketh | I can do the others in parallel if it is the cause | 10:38 |
AJaeger | yep | 10:40 |
*** kjackal has quit IRC | 10:40 | |
*** ldnunes has joined #openstack-infra | 10:41 | |
*** kjackal has joined #openstack-infra | 10:41 | |
*** wolverineav has joined #openstack-infra | 10:42 | |
*** andreas_s has quit IRC | 10:43 | |
*** andreas_s has joined #openstack-infra | 10:47 | |
jhesketh | so it's stuck processing the update_queue (for getting git changes etc) but it's going very slowly | 10:48 |
jhesketh | unless it's doing whole clones (which it shouldn't) I can't see why that might be the case | 10:48 |
jhesketh | tobiash: any thoughts why the update_queue might be going so slow? ^ | 10:50 |
tobiash | jhesketh: currently lunching | 10:51 |
*** fverboso has joined #openstack-infra | 10:51 | |
jhesketh | no worries | 10:51 |
*** lucas-afk is now known as lucasagomes | 10:52 | |
*** sambetts|afk is now known as sambetts | 10:54 | |
*** andreas_s has quit IRC | 10:57 | |
*** andreas_s has joined #openstack-infra | 10:58 | |
*** yamamoto has quit IRC | 10:59 | |
AJaeger | jhesketh: I'm getting timeouts from jobs that finish ;( | 10:59 |
* AJaeger joins tobiash for virtual ;) lunch now | 11:00 | |
jhesketh | AJaeger: oh, link please? | 11:01 |
jhesketh | (when you return) | 11:01 |
*** gfidente has quit IRC | 11:02 | |
*** yamamoto has joined #openstack-infra | 11:03 | |
*** eyalb has joined #openstack-infra | 11:05 | |
*** gfidente has joined #openstack-infra | 11:05 | |
*** gfidente has joined #openstack-infra | 11:05 | |
tobiash | jhesketh: hrm, the update_queue does only resetting, fetching and cleaning repos | 11:06 |
tobiash | is the connection to gerrit slow? | 11:07 |
*** andreas_s has quit IRC | 11:07 | |
*** andreas_s has joined #openstack-infra | 11:08 | |
*** tosky has joined #openstack-infra | 11:08 | |
tobiash | jhesketh: hasn't been some connection firewall limit merged recently for gerrit? | 11:08 |
tobiash | maybe that's a side effect | 11:09 |
jhesketh | connection seems fine, I can clone a repo very fast on the executor so it's probably also not a firewall | 11:10 |
*** edmondsw has joined #openstack-infra | 11:10 | |
* jhesketh will be back shortly | 11:10 | |
tobiash | then I'm currently out of ideas | 11:11 |
*** alexchadin has joined #openstack-infra | 11:11 | |
*** edmondsw has quit IRC | 11:15 | |
*** rfolco|off is now known as rfolco|ruck | 11:16 | |
*** andreas_s has quit IRC | 11:22 | |
*** andreas_s has joined #openstack-infra | 11:23 | |
*** nicolasbock has joined #openstack-infra | 11:26 | |
*** andreas_s has quit IRC | 11:27 | |
*** andreas_s has joined #openstack-infra | 11:27 | |
* jhesketh returns | 11:31 | |
jhesketh | so it's still processing the update_queue. I think I'm going to kill the process now | 11:38 |
*** nmathew has quit IRC | 11:38 | |
*** larainema has quit IRC | 11:40 | |
*** links has quit IRC | 11:41 | |
*** pcichy has quit IRC | 11:48 | |
jhesketh | it's gone back to having a very slow process_queue (although actually accepting builds afaict since I started it again) | 11:49 |
jhesketh | git-upload-pack's are taking minutes | 11:49 |
*** sree has joined #openstack-infra | 11:51 | |
*** sree_ has joined #openstack-infra | 11:51 | |
*** sree_ is now known as Guest61310 | 11:52 | |
openstackgerrit | Merged openstack-infra/nodepool master: Convert nodepool-zuul-functional job https://review.openstack.org/540595 | 11:53 |
*** links has joined #openstack-infra | 11:54 | |
*** tpsilva has joined #openstack-infra | 11:55 | |
*** sree has quit IRC | 11:55 | |
*** gongysh has quit IRC | 11:57 | |
*** maciejjozefczyk has joined #openstack-infra | 11:59 | |
tobiash | jhesketh: git-upload-pack is server side, so it might be worth checking gerrit as well | 12:00 |
AJaeger | jhesketh: https://review.openstack.org/538508 has the timeout | 12:01 |
d0ugal | Is there any way to search logs for a particular CI job? I want to find when an exception first started | 12:05 |
jhesketh | tobiash: yep, a cursory glance doesn't show anything though and the other ze's appear to be working okay | 12:05 |
* jhesketh wonders if that's since changed | 12:05 | |
jhesketh | AJaeger: thanks... looks like there might be some more network issues... maybe these are affecting process queue | 12:06 |
AJaeger | jhesketh: https://review.openstack.org/539854 has one timeout as well - and post failures | 12:06 |
AJaeger | d0ugal: which job? | 12:06 |
jhesketh | d0ugal: http://logstash.openstack.org/ might help | 12:06 |
AJaeger | http://zuul.openstack.org/jobs.html helps | 12:06 |
AJaeger | and then click on Builds - and search for project | 12:07 |
d0ugal | AJaeger: a tripleo one - tripleo-ci-centos-7-scenario003-multinode-oooq-container | 12:07 |
d0ugal | cool, I shall try both of these | 12:07 |
d0ugal | I don't want to know how much time I have wasted manually looking through logs... I should have asked before! | 12:07 |
AJaeger | http://zuul.openstack.org/builds.html?job_name=tripleo-ci-centos-7-scenario003-multinode-oooq-container8 | 12:08 |
AJaeger | without the 8 at then end - http://zuul.openstack.org/builds.html?job_name=tripleo-ci-centos-7-scenario003-multinode-oooq-container | 12:09 |
d0ugal | Thanks | 12:09 |
*** electrical has quit IRC | 12:13 | |
*** electrical has joined #openstack-infra | 12:13 | |
*** Guest61310 has quit IRC | 12:14 | |
d0ugal | I guess I still need to manually look at the files :) | 12:15 |
*** sree has joined #openstack-infra | 12:15 | |
AJaeger | you can run queries with logstash | 12:16 |
d0ugal | I am trying to figure out how to do that :) | 12:18 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Add /node-list to the webapp https://review.openstack.org/535562 | 12:19 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Add /label-list to the webapp https://review.openstack.org/535563 | 12:19 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Refactor status functions, add web endpoints, allow params https://review.openstack.org/536301 | 12:19 |
*** sree has quit IRC | 12:19 | |
AJaeger | d0ugal: can't help with that - you might want to ask qa team | 12:20 |
*** edwarnicke has quit IRC | 12:21 | |
*** edwarnicke has joined #openstack-infra | 12:22 | |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Add /node-list to the webapp https://review.openstack.org/535562 | 12:25 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Add /label-list to the webapp https://review.openstack.org/535563 | 12:25 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Refactor status functions, add web endpoints, allow params https://review.openstack.org/536301 | 12:25 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Add a separate module for node management commands https://review.openstack.org/536303 | 12:25 |
*** fkautz has quit IRC | 12:25 | |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: webapp: add optional admin endpoint https://review.openstack.org/536319 | 12:26 |
*** HeOS has quit IRC | 12:26 | |
*** sshnaidm|rover is now known as sshnaidm|afk | 12:27 | |
*** hashar is now known as hasharAway | 12:28 | |
*** rossella_s has quit IRC | 12:29 | |
*** zoli is now known as zoli\ | 12:30 | |
*** zoli\ is now known as zoli | 12:30 | |
*** fverboso has quit IRC | 12:31 | |
*** rossella_s has joined #openstack-infra | 12:32 | |
*** rossella_s has quit IRC | 12:38 | |
*** rossella_s has joined #openstack-infra | 12:39 | |
*** jpena is now known as jpena|lunch | 12:51 | |
*** rossella_s has quit IRC | 12:52 | |
*** rossella_s has joined #openstack-infra | 12:52 | |
*** coolsvap has quit IRC | 12:54 | |
*** gongysh has joined #openstack-infra | 12:54 | |
*** links has quit IRC | 12:55 | |
*** jlabarre has joined #openstack-infra | 12:55 | |
*** panda|off is now known as panda | 12:55 | |
*** eharney has joined #openstack-infra | 12:58 | |
*** zhurong_ has joined #openstack-infra | 13:01 | |
*** rossella_s has quit IRC | 13:02 | |
*** larainema has joined #openstack-infra | 13:03 | |
*** rossella_s has joined #openstack-infra | 13:05 | |
*** zhurong has quit IRC | 13:08 | |
*** links has joined #openstack-infra | 13:08 | |
*** jbadiapa has joined #openstack-infra | 13:09 | |
*** edmondsw has joined #openstack-infra | 13:13 | |
*** jbadiapa has quit IRC | 13:14 | |
*** olaph has joined #openstack-infra | 13:16 | |
*** cshastri has quit IRC | 13:16 | |
*** olaph1 has quit IRC | 13:17 | |
*** amoralej is now known as amoralej|lunch | 13:19 | |
*** alexchadin has quit IRC | 13:22 | |
*** alexchadin has joined #openstack-infra | 13:23 | |
*** jaosorior has quit IRC | 13:23 | |
Shrews | AJaeger: I'm not going to restart the launchers now given the current situation. Don't want to add fuel to the fire. | 13:25 |
*** dayou has quit IRC | 13:26 | |
*** dave-mccowan has joined #openstack-infra | 13:29 | |
*** dhajare has quit IRC | 13:30 | |
*** rossella_s has quit IRC | 13:30 | |
AJaeger | Shrews: no idea what's going on ;( | 13:31 |
AJaeger | jhesketh: are you still around? Anything to share here? | 13:31 |
AJaeger | Shrews: looks like debugging/investigation is needed to mvoe us forward - help welcome I guess | 13:32 |
jhesketh | AJaeger: still around but about to head off | 13:32 |
*** rossella_s has joined #openstack-infra | 13:32 | |
jhesketh | Shrews, infra-root: unfortunately I've been unable to make any more progress (see scrollback) and have to head off. As best as I can tell on some ze's they are taking a very long time to perform git operations | 13:34 |
jhesketh | I've restarted ze01 with no effect | 13:34 |
jhesketh | Some executors don't appear affected and resource usage on the common hosts (zuul, nodepool, gerrit etc) all look normal | 13:34 |
*** dave-mccowan has quit IRC | 13:35 | |
*** dave-mcc_ has joined #openstack-infra | 13:36 | |
*** hemna_ has joined #openstack-infra | 13:36 | |
*** pgadiya has quit IRC | 13:40 | |
*** rossella_s has quit IRC | 13:41 | |
fungi | https://rackspace.service-now.com/system_status/ doesn't indicate any widespread issues they're aware of | 13:42 |
fungi | still catching up (is there a summary?) but resource utilization on ze01 looks normal or even low | 13:43 |
*** rossella_s has joined #openstack-infra | 13:44 | |
*** janki has quit IRC | 13:44 | |
AJaeger | fungi: see jhesketh'S last lines - and review http://grafana.openstack.org/dashboard/db/zuul-status . None of the ze's is accepting ;( | 13:45 |
jhesketh | Actually I think they are. Just very very very slowly | 13:46 |
AJaeger | so, slower than a snail ;( | 13:46 |
fungi | first guess is maybe this is the effect of the new throttle | 13:46 |
fungi | does the pace match the rate corvus described? | 13:47 |
jhesketh | fungi: git-upload-pack is taking minutes on a few executor hosts causing the jobs to take forever to prepare and never keeping up with the worj | 13:47 |
*** jpena|lunch is now known as jpena | 13:47 | |
fungi | ahh | 13:47 |
fungi | is that a subprocess of ansible pushing prepared repositories to the job nodes? any particular repos? | 13:47 |
jhesketh | Well that's my analysis of it. It's very likely I'm wrong and chasing a red herring | 13:48 |
jhesketh | No I think it's the executor mergers preparing the repos for config changes and cache etc | 13:48 |
fungi | not (yet) at a machine where i can pull up graphs easily. is there a list of which executors are exhibiting this behavior and which aren't? | 13:49 |
jhesketh | You can see the process in ps. The time it is running for seems extreme but maybe it's correct | 13:49 |
fungi | are any of the standalone mergers also having similar trouble? | 13:49 |
jhesketh | I've stepped away so I can't give you a list sorry. ze01 is affected and for comparison ze02 isn't | 13:50 |
jhesketh | From memory I think 3,5,6,7,8 are also affected | 13:50 |
jhesketh | fungi: great question, I didn't think to check sorry | 13:50 |
fungi | no problem. that helps, thanks! | 13:51 |
*** dbecker has joined #openstack-infra | 13:51 | |
jhesketh | fwiw git checkout on a suffering host (from gerrit) worked just fine | 13:51 |
*** rossella_s has quit IRC | 13:52 | |
*** zhurong_ has quit IRC | 13:52 | |
jhesketh | No worries, sorry I couldn't get further or stick around | 13:52 |
*** eumel8 has quit IRC | 13:52 | |
*** dizquierdo has quit IRC | 13:52 | |
*** sshnaidm|afk is now known as sshnaidm|rover | 13:52 | |
fungi | yeah, for now i'm not going to assume long-running git-upload-pack processes are atypical | 13:52 |
*** jcoufal has joined #openstack-infra | 13:52 | |
fungi | the governor changes also added some statsd counters/gauges | 13:53 |
*** esberglu has joined #openstack-infra | 13:53 | |
fungi | in a few minutes i should be in a better place to start looking into those if no other infra-root beats me to it | 13:53 |
*** rossella_s has joined #openstack-infra | 13:54 | |
Shrews | well, i'm not sure what to look for. so far on ze01, i see a lot of ssh unreachable errors for inap nodes, but not sure if that's related | 13:56 |
AJaeger | http://grafana.openstack.org/dashboard/db/nodepool-inap - inap might had some hickups | 13:57 |
AJaeger | Shrews: so ianw restarted nl03 and then it continued to work. | 13:58 |
AJaeger | There's - according to graphs - a correlation to the inap hicup (not sure whether at inap, or network to it) | 13:59 |
*** hrw has quit IRC | 13:59 | |
pabelanger | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=392&rra_id=all | 14:00 |
*** jamesmcarthur has joined #openstack-infra | 14:00 | |
pabelanger | we are also just about our of memory on zuul.o.o. | 14:00 |
AJaeger | pabelanger: wasn't when this started - but with this long backlog (going on since around 7:00 UTC)... | 14:01 |
*** hrw has joined #openstack-infra | 14:01 | |
*** oidgar has joined #openstack-infra | 14:03 | |
*** rossella_s has quit IRC | 14:04 | |
*** jaosorior has joined #openstack-infra | 14:04 | |
*** psachin has quit IRC | 14:04 | |
AJaeger | pabelanger: that should reduce once we process jobs in "normal" speed | 14:05 |
*** rossella_s has joined #openstack-infra | 14:06 | |
fungi | well, it won't really "reduce" visibly since the python interpreter won't really free memory allocations back to the system | 14:06 |
AJaeger | mnaser: let'S not approve changes for now, please - we have a really slow Zuul right now... | 14:06 |
fungi | but it'll stop growing, probably for a very long time until the next time it needs more than that amount of memory | 14:06 |
mnaser | AJaeger: sorry, ack | 14:07 |
AJaeger | fungi: yeah, we might do a restart at the end | 14:07 |
pabelanger | sadly, I'm not in a good spot to help debug this morning. | 14:07 |
pabelanger | Hope to be back into things in next 90mins or so | 14:08 |
*** jrist has quit IRC | 14:11 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Fix for age calculation on unused nodes https://review.openstack.org/541281 | 14:11 |
*** rossella_s has quit IRC | 14:11 | |
*** jrist has joined #openstack-infra | 14:12 | |
*** rossella_s has joined #openstack-infra | 14:12 | |
fungi | infra-root: ovh has also asked us to stop using any of their regions (due to customer impact) and check back with them in q3 | 14:13 |
Shrews | fwiw, i'm not seeing any unusual nodepool behavior, even on the launcher ianw restarted (though i did see a minor bug in our logs) | 14:13 |
Shrews | fungi: :( | 14:13 |
fungi | looks like they e-mailed that to me and clarkb just now | 14:14 |
*** andreas_s has quit IRC | 14:15 | |
pabelanger | fungi: ack, sad to see them turned off but understand. | 14:15 |
*** andreas_s has joined #openstack-infra | 14:15 | |
*** efried has joined #openstack-infra | 14:15 | |
*** Goneri has joined #openstack-infra | 14:16 | |
*** andreas_s has quit IRC | 14:18 | |
*** mriedem has joined #openstack-infra | 14:18 | |
*** andreas_s has joined #openstack-infra | 14:18 | |
*** rosmaita has joined #openstack-infra | 14:20 | |
openstackgerrit | Ruby Loo proposed openstack-infra/project-config master: Add description to update_upper_constraints patches https://review.openstack.org/541027 | 14:20 |
*** oidgar has quit IRC | 14:22 | |
*** amoralej|lunch is now known as amoralej | 14:24 | |
openstackgerrit | Merged openstack-infra/project-config master: Revert "base-test: test new mirror-workspace role" https://review.openstack.org/540949 | 14:28 |
*** jamesmcarthur has quit IRC | 14:29 | |
*** tiswanso has joined #openstack-infra | 14:30 | |
*** jamesmcarthur has joined #openstack-infra | 14:30 | |
*** mihalis68 has joined #openstack-infra | 14:30 | |
openstackgerrit | Liam Young proposed openstack-infra/project-config master: Add vault charm to gerrit https://review.openstack.org/541287 | 14:31 |
fungi | infra-root: sorry, i misread. not ovh but citycloud, so not nearly as big of a hit | 14:32 |
*** lucasagomes is now known as lucas-hungry | 14:34 | |
*** david-lyle has quit IRC | 14:34 | |
*** fverboso has joined #openstack-infra | 14:35 | |
openstackgerrit | Liam Young proposed openstack-infra/project-config master: Add vault charm to gerrit https://review.openstack.org/541287 | 14:35 |
*** jamesmcarthur has quit IRC | 14:35 | |
*** dtantsur is now known as dtantsur|bbl | 14:37 | |
*** rossella_s has quit IRC | 14:37 | |
*** links has quit IRC | 14:38 | |
dmsimard | fungi: it's not the first time citycloud asks us to hold back right ? | 14:39 |
dmsimard | fungi: I mean, is it no-op or do we have to tune some things down still ? | 14:39 |
*** rossella_s has joined #openstack-infra | 14:40 | |
fungi | dmsimard: there are still a couple regions at max-servers: 50 and also we should probably pause image uploads since they say they want us to stop until "q3" which i take to mean check with them again on july 1 | 14:40 |
dmsimard | depends if it's q3 calendar or q3 fiscal, by default I guess it's calendar ? | 14:41 |
openstackgerrit | Ruby Loo proposed openstack-infra/project-config master: Add description to update_upper_constraints patches https://review.openstack.org/541027 | 14:46 |
*** rossella_s has quit IRC | 14:47 | |
*** alexchadin has quit IRC | 14:48 | |
*** rossella_s has joined #openstack-infra | 14:50 | |
*** dbecker has quit IRC | 14:51 | |
*** olaph1 has joined #openstack-infra | 14:51 | |
*** olaph has quit IRC | 14:53 | |
*** sweston has quit IRC | 14:54 | |
*** sweston has joined #openstack-infra | 14:54 | |
*** derekjhyang has quit IRC | 14:55 | |
*** davidlenwell has quit IRC | 14:55 | |
*** dbecker has joined #openstack-infra | 14:56 | |
*** derekjhyang has joined #openstack-infra | 14:56 | |
*** davidlenwell has joined #openstack-infra | 14:56 | |
*** kiennt26 has joined #openstack-infra | 14:57 | |
*** hasharAway is now known as hashar | 15:00 | |
*** olaph1 is now known as olaph | 15:00 | |
*** jamesmcarthur has joined #openstack-infra | 15:02 | |
*** ihrachys has joined #openstack-infra | 15:03 | |
fungi | i'm getting lost in the maze of nodepool configs and yaml anchors now. checking the nodepool docs to figure out where i need to add pause directives | 15:04 |
Shrews | fungi: the diskimage section | 15:07 |
fungi | sure, just trying to figure out which one(s) | 15:07 |
*** HeOS has joined #openstack-infra | 15:07 | |
*** fkautz has joined #openstack-infra | 15:07 | |
fungi | to pause all image uploads for citycloud, do i need to add pause to each of the citycloud diskimages, and does that need to be done both in nodepool.yaml and nl02.yaml? | 15:08 |
fungi | or do the image configurations on the launchers not matter to the builders i guess? | 15:08 |
Shrews | fungi: image uploads only happen on nb0* | 15:09 |
fungi | well, that wasn't my question. i'll check puppet to see if the nl0*.yaml configs get installed on the nb0* hosts | 15:10 |
Shrews | doesn't matter what the launchers have | 15:10 |
Shrews | fungi: oh sorry | 15:10 |
fungi | mainly trying to figure out which specific config files i need to update for this | 15:10 |
fungi | and whether this means undoing a bunch of the yaml anchor business in nodepool.yaml (i'm betting it does) | 15:11 |
*** mnaser has quit IRC | 15:11 | |
*** mnaser has joined #openstack-infra | 15:12 | |
*** dtantsur|bbl is now known as dtantsur | 15:14 | |
*** mgkwill has quit IRC | 15:14 | |
Shrews | fungi: oh hrm... i'm not familiar with how the anchors work | 15:15 |
dmsimard | infra-root: FYI there will be a 0.14.6 release of ARA to workaround an issue where Ansible sometimes passes ignore_errors as a non-boolean value | 15:15 |
*** mgkwill has joined #openstack-infra | 15:15 | |
*** rossella_s has quit IRC | 15:16 | |
*** jamesmca_ has joined #openstack-infra | 15:18 | |
*** rossella_s has joined #openstack-infra | 15:19 | |
*** hrybacki has quit IRC | 15:19 | |
Shrews | fungi: if it helps, the nl0*.yaml configs are not installed on the nb0* instances. so i think we only need to deal with nodepool.yaml with the anchor business | 15:19 |
*** hrybacki has joined #openstack-infra | 15:19 | |
openstackgerrit | Jeremy Stanley proposed openstack-infra/project-config master: Disable citycloud in nodepool https://review.openstack.org/541307 | 15:20 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: Add separate modules for management commands https://review.openstack.org/536303 | 15:20 |
fungi | Shrews: thanks for confirming! that's what i figured as well | 15:20 |
fungi | see 541307 | 15:20 |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: webapp: add optional admin endpoint https://review.openstack.org/536319 | 15:20 |
pabelanger | okay, ready to help now | 15:21 |
pabelanger | we still seeing slowness in zuul? | 15:21 |
fungi | pabelanger: i expect so. also 541307 is a quick review but semi-urgent so i can reply to citycloud | 15:22 |
pabelanger | yah, just looking now | 15:23 |
*** bobh has joined #openstack-infra | 15:23 | |
*** trown is now known as trown|outtypewww | 15:23 | |
pabelanger | +2 | 15:24 |
*** iyamahat has joined #openstack-infra | 15:25 | |
*** caphrim007 has quit IRC | 15:25 | |
Shrews | fungi: +2'd. can approve now if you wish, or wait for others to look | 15:26 |
*** olaph1 has joined #openstack-infra | 15:27 | |
fungi | Shrews: let's go ahead and approve if you're okay with the configuration. the sooner it merges, the sooner i can let the provider know | 15:27 |
pabelanger | http://grafana.openstack.org/dashboard/db/zuul-status is a little confusing to read, but looking at ze03 now | 15:27 |
*** olaph has quit IRC | 15:27 | |
pabelanger | I do see route issues on ze03.o.o at 2018-02-06 06:47:13,066 stderr: 'ssh: connect to host review.openstack.org port 29418: No route to host | 15:28 |
AJaeger | pabelanger: check http://grafana.openstack.org/dashboard/db/zuul-status?from=1517844541494&to=1517930941494 | 15:29 |
pabelanger | I also see ansible SSH errors | 15:29 |
pabelanger | 2018-02-06 14:35:07,097 DEBUG zuul.AnsibleJob: [build: d1d72c14bb2b47eca956568b37af9a49] Ansible output: b' "msg": "SSH Error: data could not be sent to remote host \\"104.130.72.49\\". Make sure this host can be reached over ssh",' | 15:29 |
AJaeger | that shows nicely the executors not in accepting since 7:22 | 15:30 |
pabelanger | however, I think we might have leaked jobs because | 15:30 |
pabelanger | 2018-02-06 14:35:13,814 INFO zuul.ExecutorServer: Unregistering due to too many starting builds 20 >= 20.0 | 15:30 |
pabelanger | however, there are no ansible-playbooks process running | 15:30 |
*** lucas-hungry is now known as lucasagomes | 15:30 | |
AJaeger | fungi: it might take *hours* to merge that change with current slowness | 15:30 |
fungi | pabelanger: on ze03 i see that telnet to any open port on review.o.o hangs from ze03 when using ipv6, but works over ipv4 | 15:31 |
fungi | AJaeger: i know | 15:31 |
pabelanger | fungi: k | 15:31 |
*** kiennt26 has quit IRC | 15:31 | |
pabelanger | I'm going to look at zuul source code for a moment | 15:31 |
fungi | AJaeger: but if it's at least approved then as soon as we get zuul back on track i can enqueue it into the gate to expedite it | 15:31 |
AJaeger | fungi: sure! | 15:32 |
fungi | pabelanger: from ze10, on the other hand, i can connect to review.openstack.org via both ipv4 and ipv6 | 15:32 |
mordred | fungi: what can I do to help? | 15:32 |
mordred | oh jeez. we're having routing issues? | 15:33 |
fungi | mordred: see what oddities you can spot with the misbehaving executors, i think | 15:33 |
fungi | mordred: that looks like one possible explanation | 15:33 |
pabelanger | yah, think so | 15:33 |
pabelanger | but, also at 20 running builds, at least executor things that | 15:33 |
pabelanger | but no ansible-playbook are running | 15:34 |
tobiash | pabelanger: 20 starting builds | 15:34 |
tobiash | that means pre-ansible git and workspace preparation | 15:34 |
tobiash | so that could be explained by the routing issues maybe | 15:35 |
pabelanger | tobiash: where did you see that? | 15:35 |
mordred | oh - yah - if the mergers are having trouble cloning from gerrit due to routing issues | 15:35 |
*** jungleboyj has quit IRC | 15:36 | |
tobiash | pabelanger: wasn't that the starting builds metric sooner today? | 15:36 |
*** jungleboyj has joined #openstack-infra | 15:36 | |
*** sree has joined #openstack-infra | 15:36 | |
tobiash | pabelanger: 2018-02-06 14:35:13,814 INFO zuul.ExecutorServer: Unregistering due to too many starting builds 20 >= 20.0 | 15:36 |
*** aviau has quit IRC | 15:36 | |
pabelanger | yah | 15:37 |
*** aviau has joined #openstack-infra | 15:37 | |
tobiash | this means 20 builds in pre-ansible phase which is triggered by the slow start change | 15:37 |
fungi | ze01, ze03, ze06, ze07, ze08 and ze09 can't reach tcp services on review.o.o over ipv6. ze02, ze04, ze05 and ze10 can | 15:37 |
*** med_ has quit IRC | 15:38 | |
pabelanger | mordred: tobiash: I wonder if we are not resetting starting_builds in that case | 15:38 |
fungi | i'm checking routing for mergers next | 15:38 |
corvus | maybe we should switch the executors to ipv4 only? | 15:38 |
corvus | do we have any ipv6-only clouds right now? | 15:38 |
*** yamamoto has quit IRC | 15:39 | |
mordred | not to my knowledge | 15:39 |
fungi | that might be a viable stop-gap. we could just unconfigure the v6 addresses on them for the moment | 15:39 |
tobiash | pabelanger: that would mean we can leak job_workers in the executor which would be a zuul bug | 15:39 |
pabelanger | tobiash: I haven't confirmed yet, trying to understand how the new logic works | 15:40 |
*** armaan has quit IRC | 15:40 | |
fungi | worth noting, none of the standalone mergers (zm01-zm08) are exhibiting this ipv6 routing issue | 15:40 |
*** armaan has joined #openstack-infra | 15:41 | |
tobiash | pabelanger: that's entirely possible but I didn't spot such a code path in my 5min code scraping this morning | 15:41 |
pabelanger | that is good news | 15:41 |
*** sree has quit IRC | 15:41 | |
*** yamamoto has joined #openstack-infra | 15:42 | |
mordred | fungi: I wonder if reconfiguring the interfaces on one of the broken executors would have any impact ... oh, wait - these are rackspace so have static IP informatoin ... | 15:42 |
fungi | i wonder if rebooting them may "fix" it | 15:42 |
mordred | I was thinking something might be bong with RA... but since it's static config I think that less | 15:43 |
fungi | i've seen traffic cease making it to instances in rackspace in the past, and a hard reboot gets routing reestablished in whatever their network gear is | 15:43 |
fungi | though in those cases it's usually been both v4 and v6 at the same time | 15:43 |
*** eyalb has left #openstack-infra | 15:44 | |
mordred | the ipv6 routing tables on a working and a non-working node are the same | 15:44 |
corvus | let's go ahead and stop ze01 and reboot it, that's an easy test. who wants to stop it? | 15:44 |
dmsimard | have time to give a hand now | 15:44 |
fungi | yeah, i have a feeling this is due to something outside the instance itself | 15:44 |
fungi | i | 15:44 |
fungi | can stop it now | 15:45 |
*** vhosakot has joined #openstack-infra | 15:45 | |
fungi | it's stopping now | 15:45 |
corvus | fungi: cool, you're stopping ze01 | 15:45 |
fungi | yeah, even ping6 isn't making it through | 15:45 |
*** dbecker has quit IRC | 15:45 | |
dmsimard | The test node graph is super weird but you probably know that already http://grafana.openstack.org/dashboard/db/zuul-status?panelId=20&fullscreen | 15:45 |
corvus | while that's going on, let's try to deconfigure ipv6 on ze03 | 15:45 |
* dmsimard catches up | 15:45 | |
mordred | fwiw - http://paste.openstack.org/show/663521/ <-- that's got routing and address info for a working and a non-working node. both look the same to me | 15:46 |
fungi | what's strange is that (so far from my testing) i can ping6 instances in iad from ze01 but not instances in dfw | 15:46 |
fungi | i take that back. i can ping6 etherpad.o.o from ze01 successfully | 15:47 |
mordred | oh! | 15:47 |
fungi | so this may be some flow-based issue in their routing layer | 15:47 |
mordred | ip -6 neigh shows differences | 15:47 |
mordred | http://paste.openstack.org/show/663523/ | 15:47 |
fungi | i still see a few git remote operations as child processes of zuul-executor on ze01 | 15:48 |
fungi | wonder if i should just kill those | 15:48 |
corvus | ip -6 addr del 2001:4800:7818:104:be76:4eff:fe04:a4bf/64 dev eth0 | 15:48 |
corvus | how's that look? | 15:49 |
*** yamahata has joined #openstack-infra | 15:49 | |
corvus | fungi: killing git should be okay | 15:49 |
corvus | and i'll do the same for the fe80 addrs? | 15:49 |
*** dtantsur is now known as dtantsur|bbl | 15:49 | |
dmsimard | mordred: curious to try an arping on ze08 on that stale router | 15:50 |
fungi | shouldn't need to so it for any fe80:: addresses as those are just linklocal | 15:50 |
fungi | it's only the scope global address(es) which matter in this case | 15:50 |
corvus | ok, wasn't sure if that would affect anything. i'll just do it for that global address on ze03 then? | 15:50 |
dmsimard | mordred: ip -6 neigh shows the router as reachable now (didn't do anything btw) | 15:50 |
fungi | corvus: yeah, that should be plenty | 15:50 |
*** camunoz has joined #openstack-infra | 15:50 | |
corvus | done on ze03 | 15:51 |
*** david-lyle has joined #openstack-infra | 15:51 | |
*** pcaruana has quit IRC | 15:51 | |
corvus | if that works, presumably we'll see the effect after the current git op completes | 15:51 |
dmsimard | mordred: some flapping going on ? http://paste.openstack.org/raw/663529/ | 15:51 |
corvus | which is now, and it's running very quickly | 15:51 |
*** samueldmq has quit IRC | 15:52 | |
*** samueldmq has joined #openstack-infra | 15:52 | |
corvus | now i see a lot of ssh unreachable errors from ansible | 15:52 |
corvus | they are to ipv4 addresses | 15:52 |
fungi | ugh, ze01 keeps spawning new git operations | 15:52 |
corvus | fungi: yeah, it probably will. known inefficiency in shutdown: it doesn't short-circuit the update queue | 15:53 |
corvus | also, it now retries failed git operations 3 times. | 15:53 |
*** yamamoto has quit IRC | 15:53 | |
*** xarses has joined #openstack-infra | 15:54 | |
corvus | weird, i'm attempting to manually connect to the hosts that ansible says it can't connect to, and it's working | 15:54 |
fungi | okay, so could also be some random v4 connectivity issues from that instance? | 15:55 |
*** xarses_ has joined #openstack-infra | 15:55 | |
*** claudiub|2 has joined #openstack-infra | 15:55 | |
*** claudiub|2 has quit IRC | 15:55 | |
fungi | that's sounding more and more like some sort of address-based flow balancing sending some connections to a dead device | 15:55 |
openstackgerrit | Merged openstack-infra/zuul master: Fix github connection for standalone debugging https://review.openstack.org/540772 | 15:56 |
corvus | i've yet to fail to connect to one of those hosts manually, and pings are working fine | 15:56 |
*** claudiub|2 has joined #openstack-infra | 15:56 | |
Shrews | corvus: the few addresses i saw that happening for were all inap | 15:57 |
*** salv-orlando has joined #openstack-infra | 15:57 | |
corvus | the other hosts have an increase in those errors too | 15:58 |
dmsimard | Just curious... are we graphing conntrack ? Don't see it in cacti.. it requires some amount of RAM to keep iptables/ip6tables going with conntrack. Could our RAM starvation problems end up impacting that ? I don't see any conntrack messages in dmesg.. I know it would print obvious errors if we're reaching nf conntrack's max number but I don't know if it complains about other things. | 15:58 |
*** claudiub has quit IRC | 15:58 | |
*** olaph has joined #openstack-infra | 15:58 | |
*** xarses has quit IRC | 15:58 | |
dmsimard | Obviously right now the current conntrack numbers are low because nothing's running | 15:58 |
corvus | ze02 had 7 insances of ssh connection errors yesterday and 678 today. ze02 is one of the hosts that is 'working'. | 15:59 |
dmsimard | ok I'll look at ze02 | 15:59 |
*** olaph1 has quit IRC | 16:00 | |
*** r-daneel has joined #openstack-infra | 16:01 | |
fungi | still playing whack-a-mole with git processes on ze01 waiting for it to complete zuul-executor service shutdown | 16:02 |
corvus | Shrews: this is a list of ip addrs that ze03 has recently failed to connect to: http://paste.openstack.org/show/663545/ | 16:02 |
*** salv-orlando has quit IRC | 16:03 | |
*** salv-orlando has joined #openstack-infra | 16:03 | |
corvus | the 104's in there are rax | 16:03 |
Shrews | yeah, those seem pretty different enough to be multiple providers | 16:04 |
corvus | the 23. are rax | 16:04 |
corvus | the 166. are rax... | 16:04 |
corvus | hrm | 16:04 |
*** v1k0d3n has quit IRC | 16:04 | |
corvus | 37. is citycloud | 16:05 |
dmsimard | We probably want to tweak net.ipv4.tcp_keepalive_intvl... /proc/sys/net/ipv4/tcp_keepalive_intvl is showing 75 right now | 16:05 |
*** v1k0d3n has joined #openstack-infra | 16:05 | |
*** icey has quit IRC | 16:05 | |
dmsimard | That's quite a large interval :/ | 16:05 |
*** icey has joined #openstack-infra | 16:05 | |
*** jaosorior has quit IRC | 16:06 | |
fungi | https://rackspace.service-now.com/system_status/ still doesn't indicate any known widespread issue | 16:07 |
*** armaan has quit IRC | 16:07 | |
*** armaan has joined #openstack-infra | 16:08 | |
Shrews | 86 is citycloud too | 16:10 |
Shrews | 146 rax | 16:10 |
clarkb | the way rax does network throttling is to inject resets iirc | 16:10 |
*** neiljerram has joined #openstack-infra | 16:10 | |
clarkb | do the cacti graphs look like we are hitting the limit and getting throttled? | 16:11 |
fungi | worth noting citycloud has mentioned we're causing issues in their regions, 541307 will disable them in our nodepool configuration for now. we might want to consider whether the citycloud issues are unrelated | 16:11 |
Shrews | the first one, 213, is ovh | 16:11 |
fungi | clarkb: not seeing reset-type disconnects. more like no response/timeout | 16:12 |
neiljerram | Hi all. I just mistakenly did 'git review -y' on a long chain of commits, instead of the squashed one that I wanted to push to review.openstack.org. This can be seen under 'Submitted Together' at https://review.openstack.org/#/c/541365/. Is there a way that I can abandon all of those mistakenly submitted changes? | 16:12 |
corvus | the unreachable errors seem to be happening during the ansible setup phase. so basically, the first connection is failing, but my subsequent manual telnet connections are succeeding. | 16:12 |
fungi | neiljerram: if you use gertty, you can process-mark them and mass abandon | 16:13 |
clarkb | fungi: ya and cacti doesn't seem to support that theory either but graph is spotty (I'm guessing because udp packets are being blackholed too) | 16:13 |
neiljerram | fungi, I don't at the moment. But if that's the best approach, I'll set that up. | 16:13 |
fungi | neiljerram: or you can script something against one of the gerrit apis (ssh api, rest api) to abandon a list of changes | 16:13 |
*** clayg has quit IRC | 16:13 | |
fungi | i don't know of any mass review operations available in the gerrit webui | 16:13 |
neiljerram | fungi, Or I guess just some manual clicking.... | 16:14 |
AJaeger | neiljerram: 43 changes? Argh ;( | 16:14 |
dmsimard | neiljerram: depends how long it takes to automate it vs doing it by hand | 16:14 |
*** clayg has joined #openstack-infra | 16:14 | |
neiljerram | AJaeger, indeed, and sorry! | 16:14 |
*** beisner has quit IRC | 16:14 | |
clarkb | fungi: on the citycloud front we are waiting for the change to merge which is being affected by this outage too right? should we consider merging that more forcefully? | 16:14 |
AJaeger | neiljerram: unfortunate timing, please abandon quickly... | 16:14 |
*** beisner has joined #openstack-infra | 16:14 | |
fungi | clarkb: we can bypass ci with it, i'd support that | 16:15 |
neiljerram | AJaeger, doing that as quickly as I can right now. | 16:15 |
corvus | neiljerram: i'll do it | 16:15 |
*** andreas_s has quit IRC | 16:17 | |
*** andreas_s has joined #openstack-infra | 16:17 | |
neiljerram | corvus, OK, thanks | 16:17 |
corvus | neiljerram: done (via gertty) | 16:17 |
pabelanger | we're less then 1GB (854M) of available memory before we start swapping on zuul.o.o too. | 16:17 |
*** yamamoto has joined #openstack-infra | 16:18 | |
corvus | i can't think of a next step for the ipv4 connection issues. i don't know what would cause the first connection to fail but not subsequent ones. | 16:18 |
openstackgerrit | Doug Hellmann proposed openstack-infra/openstack-zuul-jobs master: [WIP] docs sitemap generation automation https://review.openstack.org/524862 | 16:19 |
AJaeger | neiljerram, corvus looks like some are not abandoned - neiljerram could you double check, please? e.g. https://review.openstack.org/#/c/541339/1 | 16:19 |
neiljerram | AJaeger, I'm doing that | 16:19 |
*** yamamoto has quit IRC | 16:19 | |
clarkb | fungi: ok I'm going to go ahead and do that so that we can focus on fixing zuul without also worrying about making citycloud happy | 16:19 |
neiljerram | AJaeger, they're all gone now | 16:20 |
neiljerram | corvus, many thanks for your help! | 16:20 |
corvus | i also don't really know how to confirm that. it's pretty tricky to find a node that an executor is about to use before it uses it. | 16:20 |
*** yamamoto has joined #openstack-infra | 16:20 | |
corvus | neiljerram: np. you should take a look at gertty anyway :) | 16:20 |
neiljerram | corvus, I did in the past, but didn't quite get into it; I'll have another go. | 16:21 |
openstackgerrit | Merged openstack-infra/project-config master: Disable citycloud in nodepool https://review.openstack.org/541307 | 16:21 |
*** kopecmartin has quit IRC | 16:21 | |
corvus | neiljerram: let me know (later) what got in your way so i can improve it :) | 16:22 |
fungi | thanks clarkb | 16:22 |
clarkb | fungi: ^ do you want me to respond to them as well? | 16:22 |
*** e0ne has quit IRC | 16:22 | |
fungi | clarkb: i have a half-written reply awaiting merger of the change. i can go ahead and finish it | 16:22 |
clarkb | fungi: ok | 16:22 |
fungi | once we see it take effect on the np servers | 16:23 |
fungi | i mean, once it's np-complete? ;) | 16:23 |
*** pcichy has joined #openstack-infra | 16:23 | |
dmsimard | corvus: is it normal to have these stuck git processes on zuul01? http://paste.openstack.org/raw/663572/ | 16:23 |
corvus | dmsimard: that'll be puppet | 16:23 |
corvus | dmsimard: it's erroneous, couldn't say if it's abnormal. probably safe to kill. | 16:23 |
dmsimard | looks like puppet has successfully updated the repository about an hour ago so I'll go ahead and kill those | 16:25 |
*** yamamoto has quit IRC | 16:25 | |
fungi | still killing git processes and waiting for zuul-executor to stop on ze01 | 16:26 |
*** andreas_s has quit IRC | 16:26 | |
corvus | i think ze03 shows us that if we ifdown the ipv6 interfaces, we'll get things moving again, and more jobs will be accepted. *except* that a good portion of them (unsure what %) will hit connection errors as soon as they start. those jobs will be retried 3x. tbh, i'm not sure whether it's a net win to fix the ipv6 issue. | 16:27 |
AJaeger | btw. should we send a #status alert? I sent a #status notice earlier... | 16:28 |
dmsimard | corvus, pabelanger: looking at the zuul scheduler memory.. seeing http://paste.openstack.org/raw/663579/ which somewhat reminds me of the ze issues we saw the other day with Shrews about the log spam | 16:28 |
* AJaeger will be offline for a bit now, so can't do it | 16:28 | |
*** rossella_s has quit IRC | 16:28 | |
corvus | dmsimard: log spam? | 16:28 |
dmsimard | corvus: let me find the issue and the patch, sec | 16:29 |
dmsimard | corvus: but tail -f /var/log/zuul/zuul.log is not pretty right now | 16:29 |
*** rossella_s has joined #openstack-infra | 16:30 | |
corvus | that's a very serious error | 16:30 |
*** yamamoto has joined #openstack-infra | 16:30 | |
Shrews | dmsimard: you're thinking of the repeated request decline by nodepool, i think, which was something totally different | 16:32 |
dmsimard | corvus: okay so the nodepool thing was different -- we were seeing this: http://paste.openstack.org/raw/653420/ and Shrews submitted https://review.openstack.org/#/c/537932/ -- he mentioned that https://review.openstack.org/#/c/533372/ may have caused the issue | 16:32 |
dmsimard | Shrews: yeah | 16:32 |
corvus | okay, i'm no longer looking into executor issues, i'm looking into scheduler-nodepool issues | 16:33 |
*** gmann has quit IRC | 16:33 | |
*** gmann has joined #openstack-infra | 16:33 | |
dmsimard | corvus: this locked exception is likely contributing to the ram usage we're seeing | 16:33 |
corvus | dmsimard: in what way? | 16:33 |
dmsimard | corvus: instinct, but I'm trying to get data to back that up right now -- the exceptions started happening all of a sudden and at first glance it might correlate with the spike in memory usage http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=392&rra_id=all | 16:35 |
fungi | or the memory utilization increase and those exceptions could both be symptoms of a common problem. correlation != causation | 16:37 |
Shrews | that, btw, was one of the bugs that made me want to restart the launchers today, but i'm holding off on that to limit changes | 16:37 |
dmsimard | It seems we started getting exceptions after a "kazoo.exceptions.NoNodeError" | 16:38 |
dmsimard | Did we lose zookeeper at some point ? | 16:38 |
dmsimard | paste of what seems to be the first occurrence (with extra lines for context) http://paste.openstack.org/raw/663600/ | 16:39 |
pabelanger | dmsimard: should be able to grep kazoo.client in logs to see | 16:39 |
fungi | how long should i be giving the executor daemon to stop on ze01 before i need to start considering other solutions? it's _still_ starting new git operations and we're coming up on an hour since i issued the service stop | 16:39 |
pabelanger | but zookeeper (nodepool.o.o) is still running last I checked this morning | 16:39 |
corvus | fungi: if all the other jobs have stopped, you can probably note the current repo it's working on, kill it, then delete that repo. | 16:40 |
fungi | it's lots of different repos | 16:40 |
fungi | not sure how to tell whether the jobs have stopped on it | 16:40 |
fungi | i'm just looking at ps | 16:40 |
corvus | hrm, sohuld only be like one at a time | 16:41 |
Shrews | corvus: theory... one of the launchers was restarted by ianw this morning, and has the pro-active delete solution. If the scheduler is so busy that it doesn't get around to locking it's node set immediate, we could be deleting nodes out from under it. | 16:41 |
Shrews | s/immediate/immediately | 16:42 |
dmsimard | pabelanger: /var/log/zookeeper/zookeeper.log dates back to october 2 2017 :/ | 16:42 |
corvus | Shrews, dmsimard: there was a zk disconnection event at 10:15:02 | 16:42 |
*** priteau has quit IRC | 16:42 | |
*** dsariel has quit IRC | 16:42 | |
pabelanger | fungi: I haven't killed a zuul-executor process myself, if we are not in a rush, maybe just let it shutdown itself? | 16:42 |
corvus | fungi: are you killing git processes as they pop up? | 16:42 |
fungi | yes | 16:42 |
fungi | i've manually killed hundreds (thousands?) of git operations for countless repos on ze01 over the past hour. right now in the process list i see it performing upload-pack operations for cinder and bifrost | 16:42 |
corvus | fungi: so if you kill the executor now, just delete the bifrost and cinder repos | 16:43 |
corvus | from /var/lib/zuul/executor-git | 16:43 |
fungi | got it | 16:43 |
*** hamzy has quit IRC | 16:43 | |
fungi | okay, all that's left is a few dozen ssh-agent processes | 16:45 |
fungi | deleted /var/lib/zuul/executor-git/git.openstack.org/openstack/bifrost and /var/lib/zuul/executor-git/git.openstack.org/openstack/cinder | 16:45 |
fungi | anything else i need to do cleanup-wise before rebooting? | 16:46 |
corvus | fungi: not that i can think of | 16:46 |
fungi | going to issue a poweroff on the instance and then hard reboot it from the nova api | 16:46 |
dmsimard | corvus: how did you find that disconnection event ? client side or server side ? not sure how to check on nodepool.o.o, I was trying to find a correlation but there's nothing in syslog/dmesg and /var/log/zookeeper is useless | 16:46 |
dmsimard | corvus: your timestamp also doesn't match that kazoo nonode exception | 16:46 |
corvus | dmsimard: 2018-02-06 10:15:02,480 WARNING kazoo.client: Connection dropped: socket connection broken | 16:47 |
corvus | scheduler log | 16:47 |
dmsimard | zk is using a bunch of cpu but doing a strace on the process shows no activity.. just FUTEX_WAIT | 16:48 |
dmsimard | not sure if it's doing anything | 16:49 |
*** sree has joined #openstack-infra | 16:49 | |
*** caphrim007 has joined #openstack-infra | 16:49 | |
clarkb | I'm still trying to catch up on the zuulstuff. Sounds like we've decided that going ipv4 only won't actually fix us? do we think the reboot fungi is doing will address networking? | 16:49 |
mordred | clarkb: maybe | 16:50 |
mordred | clarkb: I think we're in an 'it's worth a shot' mode on that one | 16:50 |
mordred | clarkb: but I do not believe we actually understand the underlying issue | 16:50 |
dmsimard | clarkb: there's different issues going on I think, I'll start a pad.. let's move this to incident ? | 16:50 |
corvus | Shrews, dmsimard: i think i understand the zk errors. there is a bug when a node request is fulfilled before a zk disconnection but the scheduler processes the event after the disconnection. | 16:51 |
fungi | clarkb: reboot fungi has _done_ now, but it's as much an experiment as anything. i've definitely seen wonky packet delivery issues in rackspace go away after hard rebooting an instance, though in my opinion this looks more like they have some misbehaving switch gear | 16:52 |
*** thingee has quit IRC | 16:52 | |
fungi | and the hard reboot of ze01 indeed doesn't seem to have solved this | 16:53 |
fungi | i'll delete its ipv6 address for the moment | 16:53 |
pabelanger | kk | 16:53 |
*** iyamahat has quit IRC | 16:54 | |
corvus | Shrews: it's pretty close to the case in 94e95886e2179f4a6aeecad687509bc7b1ab7fd3 i wonder if we just need to extend our testing of that a bit | 16:54 |
*** salv-orlando has quit IRC | 16:54 | |
fungi | heh, of course that killed my ssh connection to it ;) | 16:55 |
dmsimard | I have to step away ~10 minutes | 16:55 |
corvus | Shrews: oh, i see -- that test is only a narrow unit test... we don't have a test of the schedulers response to that situation. we need a full scale functional test. | 16:55 |
*** gfidente has quit IRC | 16:55 | |
*** salv-orlando has joined #openstack-infra | 16:55 | |
dmsimard | infra-root: started a pad to follow the different issues we're tracking https://etherpad.openstack.org/p/HRUjBTyabM | 16:55 |
*** dsariel has joined #openstack-infra | 16:57 | |
*** eumel8 has joined #openstack-infra | 16:59 | |
*** salv-orlando has quit IRC | 16:59 | |
*** gfidente has joined #openstack-infra | 17:00 | |
*** gfidente has quit IRC | 17:00 | |
*** gfidente has joined #openstack-infra | 17:00 | |
*** ramishra has quit IRC | 17:01 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Do not delete unused but allocated nodes https://review.openstack.org/541375 | 17:02 |
*** gongysh has quit IRC | 17:02 | |
*** bobh has quit IRC | 17:02 | |
*** pbourke has quit IRC | 17:02 | |
Shrews | corvus: i think ^^ will solve the situation i theorized earlier | 17:02 |
*** gyee has joined #openstack-infra | 17:02 | |
corvus | Shrews: oh yes! that should definitely be a thing! :) | 17:03 |
*** pbourke has joined #openstack-infra | 17:04 | |
*** salv-orlando has joined #openstack-infra | 17:04 | |
Shrews | now we definitely need a restart since nl03 is running without that :( | 17:04 |
*** slaweq has quit IRC | 17:05 | |
*** slaweq has joined #openstack-infra | 17:05 | |
*** derekjhyang has quit IRC | 17:05 | |
clarkb | corvus: and for the zk connection issue in the scheduler we can't recover from that without a restart? | 17:06 |
corvus | clarkb: correct | 17:06 |
*** gongysh has joined #openstack-infra | 17:06 | |
corvus | it looks a lot like the executor queue is running down to the point where that's going to block us pretty soon. | 17:07 |
*** zoli is now known as zoli|gone | 17:07 | |
*** zoli|gone is now known as zoli | 17:07 | |
corvus | so we should plan on doing a scheduler restart in the next 30m or so | 17:07 |
corvus | we know how to deal with the ipv6 issue. i think the main unanswered question is about the ipv4 connection problems. | 17:07 |
pabelanger | when we switch to zookeeper cluster, we'll have 3 connections to try and use from the scheduler, is that right? | 17:07 |
*** rossella_s has quit IRC | 17:08 | |
*** gongysh has quit IRC | 17:08 | |
corvus | pabelanger: yes it may be more robust in some cases, but if the problem is closer to the scheduler, then it can still go wrong. | 17:08 |
corvus | does anyone have any ideas how to further diagnose the ipv4 issues? | 17:08 |
dmsimard | infra-root: AJaeger suggested we send out an update on the situation, does this sound okay ? #status notice We are actively investigating different issues impacting the execution of Zuul jobs but have no ETA for the time being. Please stand by for updates or follow #openstack-infra while we work this out. Thanks. | 17:09 |
*** slaweq has quit IRC | 17:10 | |
pabelanger | corvus: understood | 17:10 |
dmsimard | corvus: I need to catch up on what we have found out so far about ipv4 and I'll try to look | 17:10 |
corvus | i'm not a fan of zero-content notices. would prefer a single alert. but i think we should send the next thing, whatever it is, after we restart. | 17:10 |
dmsimard | corvus: we did not send an alert previously, only a notice | 17:11 |
dmsimard | corvus: I'm also not a fan of zero content :( | 17:11 |
mordred | corvus: I am stumped on further diagnosing the ipv4 issue | 17:11 |
*** fverboso has quit IRC | 17:11 | |
pabelanger | fungi: is there other executors we need to reboot? (looking for something to do) | 17:11 |
clarkb | the underlying issue is networking conenctivity | 17:11 |
corvus | dmsimard: yeah, so let's either send an "alert:still not working" or "notice:maybe things are better" in a little bit. | 17:11 |
clarkb | I would say something along those lines | 17:11 |
fungi | pabelanger the reboot seems not to have helped | 17:11 |
clarkb | zk connection broke beacuse networking (likely) | 17:12 |
pabelanger | fungi: okay | 17:12 |
clarkb | ze can't talk to review.o.o and test nodes becaues networking (likely) | 17:12 |
dmsimard | I'll take a few minutes to dig into the connectivity issues now. | 17:12 |
clarkb | I need to grab caffeine before I get too comfortable in this chair | 17:12 |
clarkb | work fuel! | 17:13 |
corvus | i think the best thing we can do at the moment is: restart scheduler, remove ipv6 addres from all executors and mergers, then monitor for ipv4 ssh errors and hope enough jobs get through to make headway. | 17:13 |
dmsimard | I'm not even able to connect by SSH to ze01 or ze08 over ipv6, I had to force-resolve it to ipv4 | 17:13 |
corvus | also, file a trouble ticket. | 17:13 |
pabelanger | fungi: Hmm, we do seem to be accepting new jobs on ze01.o.o now. Is that related to removal of ipv6? | 17:14 |
mordred | corvus: I agree with your suggested plan of action | 17:15 |
pabelanger | +1 | 17:15 |
corvus | oh, all the executors just changed behavior | 17:15 |
fungi | pabelanger: it should be, yes | 17:15 |
fungi | corvus: all the executors or just the problem ones? i deleted the ipv6 global scope addresses from all executors which were unable to connect to review.openstack.org over v6 | 17:16 |
corvus | that may suggest that the ipv6 issue is resolved | 17:16 |
corvus | fungi: oh! i missed that | 17:16 |
corvus | heh | 17:16 |
dmsimard | looks like the builds just started picking up on grafana yeah | 17:16 |
corvus | fungi: i thought you only did ze01 | 17:16 |
*** olaph1 has joined #openstack-infra | 17:17 | |
corvus | anyway, we still need to restart scheduler, shall i do that now? | 17:17 |
*** iyamahat has joined #openstack-infra | 17:17 | |
*** olaph has quit IRC | 17:17 | |
*** jpena is now known as jpena|off | 17:18 | |
fungi | yeah, and you had previously done ze03, i ended up doing 06-09 based on discussion with "unknown purple person" in the etherpad, but i should have mentioned it in here (i probably didn't) | 17:18 |
fungi | now seems like a good time for the scheduler restart, agreed | 17:18 |
dmsimard | corvus: assuming we dump queues first ? there's also a good deal of post and periodic jobs still queued | 17:18 |
corvus | dmsimard: we don't usually save post or periodic | 17:19 |
fungi | we've generally considered post and periodic "best effort" | 17:19 |
dmsimard | I don't suppose losing periodic is that big of a deal but post might, I don't know | 17:19 |
fungi | most post jobs are idempotent and will be rerun on the next commit to merge to whatever repo | 17:19 |
pabelanger | ++ | 17:19 |
dmsimard | fair | 17:19 |
*** rwsu has quit IRC | 17:19 | |
*** rossella_s has joined #openstack-infra | 17:19 | |
*** rwsu has joined #openstack-infra | 17:20 | |
fungi | and for the ones which actually present a problem, we can reenqueue the branch tip into post for them later | 17:20 |
corvus | okay, restarting scheduler now | 17:20 |
fungi | if they don't have anything new to approve | 17:20 |
smcginnis | Are issues resolved with removing ipv6 and this pending scheduler restart? | 17:22 |
*** r-daneel has quit IRC | 17:22 | |
* smcginnis is trying to discern current status | 17:22 | |
*** slaweq has joined #openstack-infra | 17:22 | |
fungi | if things seem back on track after the scheduler restart, we could consider stopping the executor on ze01 again, rebooting it again, and then using it as the test system for a ticket with rackspace | 17:22 |
*** r-daneel has joined #openstack-infra | 17:22 | |
fungi | smcginnis: not resolved, no. possibly worked around, but we're also dealing with recovery from a zookeeper disconnection | 17:23 |
smcginnis | fungi: ack, thanks | 17:23 |
fungi | and executors failing to connect to far more job nodes than usual | 17:23 |
dmsimard | 515 *changes* in check queue ? really ? | 17:26 |
*** slaweq has quit IRC | 17:26 | |
corvus | re-enqueuing | 17:26 |
fungi | dmsimard: some had been in check for 9 hours when i looked earlier, so quite the backup | 17:27 |
dmsimard | I think OVH is not working right now. Time to ready for their nodes is flat http://grafana.openstack.org/dashboard/db/nodepool and seeing a lot of errors on nl04 -- example: http://paste.openstack.org/raw/663687/ | 17:30 |
dmsimard | Looking at this. | 17:30 |
pabelanger | yes, looks like they have having message queue issues | 17:31 |
pabelanger | http://status.ovh.net/?do=details&id=14073&PHPSESSID=e77302e17361afc822024180e2449e1f | 17:31 |
pabelanger | they do a good job keeping status.ovh.net updated | 17:31 |
dmsimard | can we temporarily set OVH to 0 so we don't needlessly hammer them ? | 17:32 |
pabelanger | no, should be fine | 17:32 |
dmsimard | pabelanger: that notice is from september 2016 ? | 17:32 |
*** bobh has joined #openstack-infra | 17:32 | |
pabelanger | oh, it is... odd | 17:33 |
*** sree has quit IRC | 17:33 | |
pabelanger | dmsimard: if you click on cloud box, at top, you can see open tickets | 17:33 |
dmsimard | nl04 is completely nuts at creating/deleting VMs non stop in the two OVH regions | 17:33 |
*** jlabarre has quit IRC | 17:34 | |
*** sree has joined #openstack-infra | 17:34 | |
pabelanger | dmsimard: yah, eventually it will start working again. | 17:34 |
dmsimard | we're not very nice :/ | 17:34 |
pabelanger | we don't usually turn down clouds, until the provider asks. All know how to contact us | 17:35 |
fungi | clarkb: i've replied to daniel at citycloud now, after confirming image uploads are stopped and node counts are at 0 in all their regions | 17:35 |
pabelanger | that way, we don't need to be constantly updating max-servers | 17:36 |
clarkb | fungi: I see the email, thanks! | 17:36 |
clarkb | looks like it was the removal of ipv6 that ended up "fixing" things? | 17:37 |
*** jlabarre has joined #openstack-infra | 17:37 | |
dmsimard | pabelanger: maybe we could have a soft disable or something ? a bit like we can enable/disable keep without restarting zuul executor | 17:37 |
mordred | clarkb: "yes" | 17:37 |
dmsimard | Did we remove ipv6 on all executors ? or just one ? Everything seemed to start working again all of a sudden | 17:37 |
*** dsariel has quit IRC | 17:37 | |
clarkb | dmsimard: pabelanger we can increase the time between api requests | 17:37 |
dmsimard | I'm not sure ipv6 has something to do with it | 17:37 |
mordred | dmsimard: fungi took care of all the ones with broken ipv6 | 17:37 |
dmsimard | mordred: the non-broken ones were not really accepting any builds | 17:38 |
mordred | dmsimard: (disabled ipv6 on them that is) | 17:38 |
dmsimard | mordred: http://grafana.openstack.org/dashboard/db/zuul-status | 17:38 |
pabelanger | dmsimard: why? I'd rather not have to toggle nodepool-launcher each time we have an API error. Would mean spending more time monitoring it | 17:38 |
*** jbadiapa has joined #openstack-infra | 17:38 | |
*** gfidente is now known as gfidente|afk | 17:38 | |
*** sree has quit IRC | 17:38 | |
*** bobh has quit IRC | 17:38 | |
mordred | dmsimard: that's not surprising though - the executors will only accept builds up until a point - and with the others out of commission, the running executors were well past their max | 17:38 |
dmsimard | pabelanger: what do you mean why ? the cloud is broken and we're creating/deleting hundreds of VMs a minute, it's not very helpful to the nodepool sponsors imo | 17:38 |
pabelanger | clarkb: sure, that too | 17:38 |
fungi | clarkb: for very disappointing definitions of "fixing" anyway | 17:39 |
fungi | dmsimard: see your etherpad for details? | 17:39 |
dmsimard | fungi: details for what, sorry ? a bit multithreaded | 17:40 |
dmsimard | fungi: oh, you mean where we removed v6 | 17:40 |
fungi | dmsimard: you asked whether we removed ipv6 addresses from all executors | 17:40 |
fungi | details on which are in the pad | 17:40 |
pabelanger | dmsimard: sure, providers often don't notify us of outages too. I like the fact that nodepool just keeps doing its thing, until the cloud comes back online | 17:40 |
dmsimard | fungi: but what I'm saying is that it didn't seem like the "non-ipv6 broken" executors were really doing anything, at least according to the graphs | 17:41 |
dmsimard | fungi: hmm, I was relying on the available executors graph and none of them were available because they were loaded.. nevermind that | 17:41 |
fungi | dmsimard: depended on the graphs. they were bogged down enough to stop accepting many new jobs | 17:41 |
*** bobh has joined #openstack-infra | 17:42 | |
pabelanger | dmsimard: OVH is pretty responsible to emails, we can reach out to them add see if they are aware of the issue. Would prefer we did that before shutting down nodepool-launcher. | 17:42 |
dmsimard | pabelanger: can we ? both bhs1 and gra1 are currently not working | 17:43 |
fungi | so http://status.ovh.net/?do=details&id=14073&PHPSESSID=e77302e17361afc822024180e2449e1f doesn't indicate they're aware? | 17:43 |
*** mtreinish has quit IRC | 17:43 | |
pabelanger | dmsimard: yes, I suggest we then email OVH the issue we are seeing. I'm unsure if you have their contact, but can forward you them if you wanted to reach out. | 17:44 |
pabelanger | however, http://grafana.openstack.org/dashboard/db/nodepool-ovh seems to show nodes coming online now | 17:45 |
pabelanger | at least for BHS1 | 17:45 |
dmsimard | http://paste.openstack.org/raw/663704/ is an excerpt of the errors we're seeing for bhs1/gra1 | 17:45 |
*** jbadiapa has quit IRC | 17:45 | |
openstackgerrit | Pino de Candia proposed openstack-infra/project-config master: Add new project for Tatu (SSH as a Service) Horizon Plugin. https://review.openstack.org/537653 | 17:46 |
*** apetrich has quit IRC | 17:46 | |
pabelanger | dmsimard: yah, I usually just email them an example error message and few UUIDs of nodes. They offer to help debug | 17:46 |
pabelanger | I'll forward you last emails I sent | 17:46 |
*** apetrich has joined #openstack-infra | 17:46 | |
dmsimard | ok I'll reach out | 17:46 |
dmsimard | Going to log a few things for our roots from other timezones | 17:47 |
fungi | oh, yeah, i'm seeing freshness issues with their status tracking i guess | 17:47 |
*** bobh has quit IRC | 17:47 | |
dmsimard | #status log (dmsimard) CityCloud asked us to disable nodepool usage with them until July: https://review.openstack.org/#/c/541307/ | 17:49 |
corvus | fyi: ze01 will behave differently than the others because it will accept jobs at a slightly higher rate | 17:49 |
openstackstatus | dmsimard: finished logging | 17:49 |
pabelanger | dmsimard: email forwarded | 17:49 |
fungi | corvus: out of curiosity, why is ze01 "special" like that? | 17:50 |
corvus | fungi: a patch landed and it rebooted | 17:50 |
Shrews | it goes to 11 | 17:50 |
fungi | ahh, got it | 17:50 |
fungi | yes, this one goes to 11 | 17:50 |
*** mtreinish has joined #openstack-infra | 17:50 | |
clarkb | corvus: this is the 4 vs 1 patch from yesterday? | 17:50 |
corvus | yep | 17:50 |
fungi | i think in earlier troubleshooting jhesketh may also have restarted an executor? before i woke up | 17:50 |
*** dsariel has joined #openstack-infra | 17:50 | |
fungi | seeing if i can find it in scrollback | 17:50 |
clarkb | fungi: I think he said ianw did the restart? and it was 01 too? | 17:51 |
fungi | not rebooted, just restarted the daemon | 17:51 |
fungi | oh, perfect | 17:51 |
* fungi cancels scrollback spelunking | 17:51 | |
corvus | ah, well that would have been masked by ipv6 issues till now | 17:51 |
fungi | yeah | 17:51 |
dmsimard | #status log (dmsimard) Different Zuul issues relative to ipv4/ipv6 connectivity, some executors have had their ipv6 removed: https://etherpad.openstack.org/p/HRUjBTyabM | 17:51 |
openstackstatus | dmsimard: finished logging | 17:51 |
*** olaph1 is now known as olaph | 17:52 | |
clarkb | do we have a ticket in for the ipv6 connectivity issue? | 17:53 |
dmsimard | #status log (dmsimard) zuul-scheduler issues with zookeeper ( kazoo.exceptions.NoNodeError / Exception: Node is not locked / kazoo.client: Connection dropped: socket connection broken ): https://etherpad.openstack.org/p/HRUjBTyabM | 17:53 |
openstackstatus | dmsimard: finished logging | 17:53 |
*** tbachman has joined #openstack-infra | 17:53 | |
dmsimard | #status log (dmsimard) High nodepool failure rates (500 errors) against OVH BHS1 and GRA1: http://paste.openstack.org/raw/663704/ | 17:53 |
openstackstatus | dmsimard: finished logging | 17:53 |
*** panda is now known as panda|off | 17:54 | |
fungi | clarkb: not yet because we need a system we can point to for troubleshooting. i'm proposing we stop the daemon on one of the "problem" executors and then reboot it with its executor service temporarily disabled so it just comes up doing nothing but with its v6 networking correctly configured | 17:54 |
*** bobh has joined #openstack-infra | 17:54 | |
clarkb | fungi: ++ | 17:54 |
fungi | we could also consider launching some more (replacement) executors | 17:55 |
*** iyamahat has quit IRC | 17:55 | |
*** iyamahat has joined #openstack-infra | 17:56 | |
dmsimard | pabelanger: ack, I'm sending an email to them nwo. | 17:56 |
dmsimard | now* | 17:56 |
*** bobh has quit IRC | 17:59 | |
*** yamamoto has quit IRC | 17:59 | |
*** yamamoto has joined #openstack-infra | 18:00 | |
corvus | i have a test-case reproduction of the scheduler/zk lock bug | 18:00 |
*** tesseract has quit IRC | 18:00 | |
*** jamesmca_ has quit IRC | 18:00 | |
*** jamesmcarthur has quit IRC | 18:01 | |
*** jamesmcarthur has joined #openstack-infra | 18:01 | |
*** sambetts is now known as sambetts|afk | 18:01 | |
*** derekh has quit IRC | 18:02 | |
pabelanger | great! | 18:02 |
openstackgerrit | Merged openstack-infra/zuul master: Enhance github debugging script for apps https://review.openstack.org/540774 | 18:03 |
dmsimard | sent an email to ovh, have to step away again. | 18:03 |
*** jamesmcarthur has quit IRC | 18:05 | |
*** xarses_ has quit IRC | 18:06 | |
pabelanger | dmsimard: thanks | 18:06 |
*** xarses has joined #openstack-infra | 18:06 | |
fungi | wow, monasca-thresh is removing mysql-connector from their java archive (due to license concerns) and relying on the drizzle jdbc? not sure whether that's amusing to former drizzlites: http://lists.openstack.org/pipermail/openstack-dev/2018-February/127027.html | 18:06 |
Shrews | O.o | 18:06 |
*** salv-orlando has quit IRC | 18:07 | |
Shrews | Drizzle says "We're not dead yet!" | 18:07 |
Shrews | "I'm feeling much better" | 18:07 |
mordred | hehe | 18:07 |
*** salv-orlando has joined #openstack-infra | 18:07 | |
mordred | that's truly awesome | 18:08 |
*** bobh has joined #openstack-infra | 18:10 | |
clarkb | fungi: should we just go ahead and pick a server and turn off the executor now? I guess when we have a backlog of jobs that isn't the happiset thing to do (but neither is remaining ipv4 only) | 18:10 |
fungi | i was giving things a few more minutes to see what other fallout we might spot | 18:11 |
fungi | and hoping to garner additional input on that plan | 18:11 |
*** salv-orlando has quit IRC | 18:12 | |
corvus | we may want to restart the other executors to pick up the 1->4 slow start change, it will let them be utilized better. | 18:12 |
*** apetrich has quit IRC | 18:13 | |
*** slaweq has joined #openstack-infra | 18:13 | |
corvus | maybe if we do that, we'll have a surplus compared to our backlog, and can turn one down with less effect. at the moment, we have more work than the executors can handle. | 18:13 |
pabelanger | wfm | 18:13 |
*** hamzy has joined #openstack-infra | 18:13 | |
*** apetrich has joined #openstack-infra | 18:13 | |
fungi | i like that idea | 18:13 |
corvus | stopping the others should be faster than ze01, but still, obviously, not super quick. | 18:13 |
corvus | (i'm still fixing the scheduler bug, so i'll leave that to others) | 18:14 |
fungi | do we have volunteers for a rolling restart of, say, ze02-ze08 and ze10? and then we can stop ze09 and reboot it with the executor disabled? | 18:14 |
pabelanger | I can help here in a moment, just getting more coffee | 18:14 |
*** hamzy_ has joined #openstack-infra | 18:15 | |
fungi | ze01, as mentioned, has already been restarted on the new code, and the instances impacted by the ipv6 situation seem to be 01, 03 and 06-09 | 18:15 |
*** myoung is now known as myoung|food | 18:15 | |
*** bobh has quit IRC | 18:16 | |
fungi | so the idea would be to restart all executors except 01 (which already got that treatment) and only stop one of the v6-problem-affected ones (like 09) without restarting the service on it again | 18:16 |
*** sshnaidm|rover is now known as sshnaidm|bbl | 18:16 | |
fungi | also be aware you probably want ssh -4 if you have working ipv6 connectivity | 18:17 |
*** slaweq has quit IRC | 18:17 | |
fungi | i need to eat something before the meeting | 18:17 |
*** hamzy has quit IRC | 18:17 | |
fungi | but can help with some of it | 18:17 |
*** mihalis68 has quit IRC | 18:20 | |
*** david-lyle has quit IRC | 18:20 | |
corvus | clarkb: launch delayed to 19:20, so we may need to have an agenda item for it | 18:21 |
*** jamesmcarthur has joined #openstack-infra | 18:21 | |
clarkb | I think I just say they pushed it all the way to 3:05 EST which is 20:05UTC? | 18:22 |
clarkb | weather needs to cooperate | 18:22 |
*** dtantsur|bbl is now known as dtantsur | 18:22 | |
pabelanger | fungi: okay, ready to get started on zuul-executors | 18:25 |
pabelanger | I'll do ze02.o.o first | 18:25 |
*** tosky has quit IRC | 18:26 | |
pabelanger | clarkb: I seen that too | 18:26 |
clarkb | not to completely distract from the zuul stuff but I think that sometime yesterday our bandersnatch may have gotten out of sync | 18:26 |
clarkb | sushy 1.3.1 never made it onto our mirror despite having mirror updates | 18:26 |
dansmith | I'm guessing it's known that zuul is down, but didn't see a status bot topic change so wanted to confirm? | 18:28 |
clarkb | dansmith: it should be up now | 18:28 |
clarkb | there are networking issues we've worked around by disabling ipv6 that were preventing things from functioning earlier | 18:29 |
* tbachman isn’t seeing zuul | 18:29 | |
pabelanger | where? | 18:29 |
dansmith | I haven't been able to hit it for like 30 minutes | 18:29 |
dansmith | just times out for me | 18:29 |
dhellmann | http://zuul.openstack.org is extremely slow or not responding | 18:29 |
AJaeger | clarkb: http://zuul.openstack.org/ is not reachable right now | 18:29 |
*** ralonsoh has quit IRC | 18:29 | |
tbachman | dansmith: same here | 18:29 |
pabelanger | looking | 18:30 |
pabelanger | I'm having issues SSHing into zuul.o.o | 18:30 |
*** bobh has joined #openstack-infra | 18:30 | |
pabelanger | checking cacti | 18:30 |
clarkb | pabelanger: I can ssh into it ok | 18:31 |
clarkb | but ya http is sad | 18:31 |
clarkb | zuul-web is running | 18:31 |
pabelanger | okay, SSH now works | 18:31 |
clarkb | it is listening on port 9000 and apache is up and configured to proxy to it | 18:32 |
pabelanger | clarkb: I see some exceptions in web.log, trying to see if related | 18:34 |
clarkb | hrm why does web-debug.log have less info in it than web.log? | 18:34 |
fungi | wonder if the restart picked up a new regression | 18:35 |
pabelanger | ze02 back online | 18:35 |
pabelanger | moving to ze03.o.o | 18:35 |
*** harlowja has joined #openstack-infra | 18:35 | |
clarkb | looks like web.log ends with it starting | 18:35 |
clarkb | web-debug.log doesn't show anything sad since it started | 18:35 |
clarkb | if I curl localhost:9000 I get html back | 18:38 |
clarkb | so the server is up and responding to requests | 18:38 |
clarkb | now to see if I can get a status.json from it | 18:38 |
fungi | one other possibility is the sheer hugeness of the status.json may simply be too much | 18:39 |
clarkb | AH00898: Error reading from remote server returned by /status.json | 18:39 |
pabelanger | maybe | 18:39 |
clarkb | ya I think apache is unable to get the status.json back | 18:39 |
pabelanger | we do have 587 jobs in check | 18:40 |
dhellmann | it sounds like that api needs a pagination feature :-) | 18:40 |
dmsimard | Was stepping in momentarily before going away again but the status.json backups might work (or not and exacerbate but iirc I set it up with flock/curl timeout) | 18:40 |
Shrews | pabelanger: back from eating. can help you out | 18:41 |
pabelanger | Shrews: I'm currently doing ze03 / ze04, if you want to get others. I think we want to keep ze09 as is | 18:41 |
clarkb | `curl --verbose localhost:9000/openstack/status.json` works and shows a status of 200 | 18:41 |
Shrews | pabelanger: ack. on ze04 | 18:41 |
Shrews | pabelanger: err, ze05 | 18:41 |
clarkb | data is small enough to fit on my scrollback buffer :P | 18:42 |
Shrews | pabelanger: just zuul-executor restart, yeah? | 18:42 |
fungi | pabelanger: Shrews: stopping ze09 would be great, but that's the one we want to disable the service on and then reboot so it comes back up with ipv6 addresses | 18:42 |
Shrews | fungi: ze09 is all yours :) | 18:42 |
clarkb | its only 2.3MB | 18:42 |
clarkb | "only" | 18:42 |
*** dougwig has joined #openstack-infra | 18:43 | |
fungi | Shrews: pabelanger: i'll refrain from stopping ze09 until the others have been cleanly restarted so we don't have too many out of rotation at a time | 18:43 |
*** gmann has quit IRC | 18:43 | |
Shrews | fungi: to be clear, we're just restarting the zuul-executor process, right? | 18:43 |
fungi | yes | 18:43 |
fungi | which, as i understand, is not possible currently to do a typical service restart because the stop takes so long to complete | 18:44 |
clarkb | working my way back up the stack those status.json errors were actually 40 minutes or so ago | 18:44 |
fungi | Shrews: so is a stop and then watch the process list until it can be safely started again | 18:44 |
fungi | (or tail logs, or whatever) | 18:44 |
clarkb | I do not see requests from my desktop hitting apache | 18:45 |
clarkb | possibly this is related to the networking issues we've had? | 18:45 |
clarkb | I'm ipv4 only here | 18:45 |
dhellmann | I get "connection reset by peer" http://paste.openstack.org/show/663792/ | 18:45 |
dhellmann | I'm also ipv4-only | 18:45 |
*** caphrim007_ has joined #openstack-infra | 18:46 | |
fungi | doesn't appear we're overloading its network: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64796&rra_id=all | 18:47 |
clarkb | if I make a new connection from my home I see it in a SYN_RECV state on the server | 18:47 |
clarkb | and its been that way for qutie a few seconds now | 18:48 |
fungi | in fact, that's probably status.json we see dying on the graph at 17:20 | 18:48 |
clarkb | then it eventually dies client side | 18:48 |
fungi | which does correspond with the restart | 18:48 |
fungi | scheduler and web daemons were restarted at 17:22 and 17:23 respectively | 18:49 |
* clarkb breaks out the tcpdump | 18:49 | |
*** caphrim007 has quit IRC | 18:50 | |
Shrews | still waiting for ze05. doing ze06 now too | 18:51 |
*** suhdood has joined #openstack-infra | 18:51 | |
openstackgerrit | Matthieu Huin proposed openstack-infra/nodepool master: [WIP] webapp: add optional admin endpoint https://review.openstack.org/536319 | 18:54 |
clarkb | aha! [Tue Feb 06 18:55:38.214934 2018] [mpm_event:error] [pid 13325:tid 140299712509824] AH00485: scoreboard is full, not at MaxRequestWorkers I mean I shouldn't checked there earlier but now e know | 18:55 |
*** bobh has quit IRC | 18:55 | |
fungi | so apache restart time? | 18:56 |
clarkb | so tcp connections are getting to apache but apache is saying go away | 18:56 |
clarkb | ya | 18:56 |
clarkb | should I go ahead and do that? | 18:56 |
* clarkb gives everyone a minute before doing that | 18:56 | |
fungi | possible the inrush of reloads from people's browsers after the restart coupled with the giant status.json caused it, in which case we may see it come right back | 18:56 |
fungi | but worth a try | 18:56 |
Shrews | pabelanger: fungi: ze06 restarted. ze05 still ongoing. moving to ze07 | 18:57 |
dmsimard | status.json is rendered client-side right ? | 18:57 |
fungi | yes | 18:57 |
fungi | for now at least | 18:58 |
dmsimard | so it's probably getting downloaded just fine but then has to do all that javascript dancing | 18:58 |
clarkb | dmsimard: no it wasn't downloading anything | 18:58 |
tbachman | fwiw, I can hit it now | 18:58 |
fungi | dmsimard: well, no, apache is super unhappy at the moment | 18:58 |
clarkb | apache was closing incoming tcp connections because it had no workers to put them on | 18:58 |
dmsimard | hmm | 18:58 |
clarkb | status page is working now after apache restart | 18:58 |
fungi | cool, we should keep an eye on it | 18:58 |
clarkb | now we see if that lasts | 18:58 |
fungi | (i almost said "keep tabs on it" and then realized...) | 18:58 |
openstackgerrit | Ruby Loo proposed openstack-infra/project-config master: Add description to update_upper_constraints patches https://review.openstack.org/541027 | 18:59 |
dmsimard | fungi: that would've been appropriate | 18:59 |
dmsimard | are we doing an infra meeting ? | 18:59 |
fungi | in 45 seconds | 18:59 |
clarkb | and now its meeting time and I'm not super prepared :P | 18:59 |
cmurphy | do we need to recheck? | 18:59 |
clarkb | cmurphy: you shouldn't need to, you can check now if your changes are in the queues though | 18:59 |
* fungi is entirely unprepared, but also not chairing | 18:59 | |
clarkb | (at least it loads for me currently) | 18:59 |
fungi | cmurphy: unless you had failures from earlier (retry_limit, post_failure) and those will need a recheck | 19:00 |
dmsimard | Because sometimes we can have good news too, it looks like OVH BHS1 has fully recovered and there's blips showing signs of recovery for GRA1 as well | 19:00 |
clarkb | I'll keep my tail on the error log open so I'll see if it happens again | 19:00 |
clarkb | and with that /me sorts out meeting | 19:00 |
cmurphy | nope it's in the queue | 19:00 |
*** tbachman has left #openstack-infra | 19:01 | |
*** lucasagomes is now known as lucas-afk | 19:02 | |
*** suhdood has quit IRC | 19:02 | |
*** myoung|food is now known as myoung | 19:03 | |
dmsimard | ze01 is effectively picking up more builds than the others | 19:04 |
dmsimard | 4.4GB swap :( | 19:04 |
Shrews | pabelanger: fungi: ze07 restarted. picking up ze08... still waiting for ze05 | 19:04 |
ianw | just catching up ... i know we've all moved on, but about 8 hours ago AJaeger noticed inap was offline; that lead me to noticing that nl03 had dropped out of zookeeper, so restarted that. then jhesketh restarted ze01, as it was not picking up jobs | 19:05 |
ianw | and i found the red-herring that graphite.o.o had a full root partition -- which i thought might have lead to some of the graph dropouts (bad stat collecting) but actually was unrelated in the end | 19:06 |
fungi | dmsimard: the others should catch up now that they're getting restarted on newer code | 19:06 |
pabelanger | Shrews: sorry, I lost network access. back online now | 19:06 |
pabelanger | looking at ze03 / ze04 now | 19:06 |
Shrews | pabelanger: summary is ze06 and ze07 are done. doing ze08 now. waiting on ze05 | 19:06 |
dmsimard | ianw: I summarized some of the things on the infra log https://wiki.openstack.org/wiki/Infrastructure_Status | 19:06 |
*** r-daneel_ has joined #openstack-infra | 19:08 | |
*** david-lyle has joined #openstack-infra | 19:08 | |
pabelanger | Shrews: ze03 online, waiting for ze04 | 19:08 |
*** yamamoto has quit IRC | 19:08 | |
*** david-lyle has quit IRC | 19:08 | |
*** dsariel has quit IRC | 19:08 | |
*** david-lyle has joined #openstack-infra | 19:09 | |
*** r-daneel has quit IRC | 19:09 | |
*** r-daneel_ is now known as r-daneel | 19:09 | |
clarkb | dmsimard: yes becuse it is running more aggressive scheudling code than the others (or was while we restart things) | 19:09 |
clarkb | dmsimard: it should stabilize hopefully | 19:09 |
ianw | dmsimard: interesting ... yeah i couldn't determine any reason why nl03 seemed to detach itself from zookeeper; as mentioned in the etherpad there wasn't any smoking gun logs | 19:09 |
*** HeOS has quit IRC | 19:10 | |
Shrews | pabelanger: ze08 restarted. ze05 is taking FOREVER | 19:10 |
pabelanger | ze04 is online, but hasn't accepted any jobs yet | 19:11 |
pabelanger | unsure why | 19:11 |
pabelanger | there we go | 19:11 |
SamYaple | can anyone review this bindep patch? its been sitting 2+ months. I dont know how to get traction on it, but really need the patch | 19:16 |
SamYaple | https://review.openstack.org/#/c/517105/ | 19:16 |
*** Goneri has quit IRC | 19:19 | |
*** vhosakot_ has joined #openstack-infra | 19:20 | |
*** vhosakot has quit IRC | 19:23 | |
*** jpich has quit IRC | 19:26 | |
Shrews | pabelanger: i'm not convinced that ze05 is actually trying to stop. is there a good indicator that it's trying to wind down? maybe number of active ansible-playbook processes or something? | 19:28 |
pabelanger | Shrews: what command did you use? | 19:28 |
clarkb | Shrews: pabelanger I watch ps -elf | grep zuul | wc -l | 19:28 |
clarkb | that number should generally trend down over time | 19:28 |
pabelanger | Shrews: should see socket accepting stopped in logs | 19:28 |
*** salv-orlando has joined #openstack-infra | 19:28 | |
pabelanger | 2018-02-06 18:35:54,142 DEBUG zuul.CommandSocket: Received b'stop' from socket for example | 19:29 |
fungi | though there will quite probably be tons of orphaned ssh-agent processes which never get cleaned up | 19:29 |
Shrews | pabelanger: i actually messed up on 05 and issued a restart before doing a stop | 19:29 |
pabelanger | that is on ze03 FWIW | 19:29 |
Shrews | which is why i'm not sure of its state | 19:29 |
pabelanger | ah | 19:30 |
* AJaeger spend so much time looking at grafana that he doesn't like to see N/A for TripleO anymore - https://review.openstack.org/541165 | 19:30 | |
pabelanger | wonder if we should remove restart from init scripts, haven't had much luck with it | 19:30 |
Shrews | pabelanger: yes, we should :) | 19:30 |
pabelanger | Shrews: you might be able to issue stop again, if command socket it still open | 19:30 |
Shrews | pabelanger: lemme try that | 19:30 |
*** salv-orlando has quit IRC | 19:30 | |
Shrews | the number of ansible process seem to be dropping faster now | 19:32 |
*** slaweq has joined #openstack-infra | 19:33 | |
*** olaph1 has joined #openstack-infra | 19:36 | |
dmsimard | AJaeger: added a comment on https://review.openstack.org/#/c/541165/ | 19:37 |
AJaeger | dmsimard: which file? http://git.openstack.org/cgit/openstack-infra/project-config/tree/grafana | 19:38 |
*** slaweq has quit IRC | 19:38 | |
AJaeger | dmsimard: grafana.openstack.org does not delete removed files - so manual deletion is needed. I think the file you mean is removed from project-config already. If it's still in-tree, tell me and I'll updated | 19:38 |
dmsimard | AJaeger: oh, you're right the file is already removed | 19:39 |
*** olaph has quit IRC | 19:39 | |
AJaeger | dmsimard: if you want to manually remove dead files from grafana - great ;) There're a couple ;( | 19:39 |
dmsimard | AJaeger: if we can write down somewhere which ones are stale I can eventually look at it | 19:40 |
AJaeger | dmsimard: I'll make a list... | 19:40 |
*** baoli has joined #openstack-infra | 19:41 | |
ianw | ok, so re hwe kernel -- yeah afs doesn't build surprise surprise. it looks like 1.6.21.1 is the earliest that supports 4.13 ... but I'll make a custom OpenAFS 1.6.22 pkg i think | 19:41 |
*** baoli has quit IRC | 19:42 | |
*** tosky has joined #openstack-infra | 19:42 | |
*** jamesmca_ has joined #openstack-infra | 19:44 | |
dmsimard | T-60 minutes | 19:46 |
*** peterlisak has quit IRC | 19:46 | |
*** onovy has quit IRC | 19:47 | |
fungi | ianw: i wonder if the 1.8 beta packages in bionic build successfully on xenial ;) | 19:48 |
openstackgerrit | Andreas Jaeger proposed openstack-infra/project-config master: Remove unused grafana/nodepool files https://review.openstack.org/541421 | 19:49 |
AJaeger | dmsimard: two more unneeded ones ^ | 19:49 |
ianw | fungi: they probably would, at least the arm64 ones do on arm64 xenial ... but i only want to change one thing at a time as much as possible, so i think sticking with 1.6 branch for this test at this point? | 19:49 |
pabelanger | ianw: haven't been following, is fedora-27 ready to use now or still an issue with python? | 19:51 |
fungi | sure, i wasn't suggesting it's something we should run just now | 19:51 |
fungi | having newer 1.6 point release packages for all our systems would be good anyway, for previously-discussed reasons | 19:51 |
*** HeOS has joined #openstack-infra | 19:51 | |
AJaeger | dmsimard: http://paste.openstack.org/show/663889/ contains list of obsolete grafana boards | 19:52 |
*** HeOS has quit IRC | 19:56 | |
dhellmann | do things seem stable now? is it safe to approve some releases? | 19:58 |
clarkb | dhellmann: I think they are stable in as much as we've worked around the ipv6 issues. The ipv6 problems haven't been corrected but shouldn't currently affect jobs | 19:59 |
dhellmann | clarkb : thanks | 19:59 |
AJaeger | dhellmann: just very heavy load - long backlog | 20:00 |
dhellmann | smcginnis : ^^ | 20:00 |
smcginnis | Just read that. | 20:00 |
clarkb | fungi: "kAFS supports IPv6 which OpenAFS does not; and it implements some of the AuriStorFS services so that IPv6 can be used not only with the Location service but with file services as well." | 20:00 |
smcginnis | So we can approve things, it just might take awhile. | 20:00 |
*** jtomasek has quit IRC | 20:00 | |
dhellmann | it looks like the aodh and panko releases were approved but not run. maybe remove your W+1 and re-apply it? | 20:01 |
corvus | ianw, fungi, clarkb: i think it was kerberos 5 support (which is needed for the sha256 stuff we use) which kafs was weak on last time i looked. perhaps that's changed -- but that's a thing to watch out ofr. | 20:01 |
smcginnis | dhellmann: Just about to do that. ;) | 20:01 |
ianw | auristor: ^^ i bet you'd know about that? | 20:02 |
*** owalsh has quit IRC | 20:02 | |
*** slaweq has joined #openstack-infra | 20:02 | |
clarkb | corvus: https://www.infradead.org/~dhowells/kafs/ looks like it supports kerberos 5 under AF_RXRPC | 20:02 |
*** owalsh has joined #openstack-infra | 20:02 | |
*** amoralej is now known as amoralej|off | 20:02 | |
*** dtantsur is now known as dtantsur|afk | 20:03 | |
*** onovy has joined #openstack-infra | 20:03 | |
*** caphrim007_ has quit IRC | 20:03 | |
clarkb | still no apache errors in the error log for zuul status so I'm closing my tail of that and making lunch | 20:03 |
smcginnis | dhellmann: Still not picking up my reapproval. Either things are really slow, or might need you to do it? | 20:03 |
*** caphrim007 has joined #openstack-infra | 20:03 | |
smcginnis | dhellmann: Oh, they're stacked up! | 20:04 |
dhellmann | sigh. we keep telling them not to do that. | 20:04 |
*** HeOS has joined #openstack-infra | 20:06 | |
dmsimard | corvus: I'm trying to see if there would be a way to highlight if a particular executor seems to be failing/timeouting jobs more than the others .. in graphite we have data for jobs but it doesn't link them back to a particular executor. I guess logstash allows me to query by job status and executor but we're much more aggressive in pruning that data out. Am I missing something ? | 20:06 |
*** caphrim007 has quit IRC | 20:07 | |
*** caphrim007 has joined #openstack-infra | 20:07 | |
*** rfolco|ruck is now known as rfolco|off | 20:07 | |
*** onovy has quit IRC | 20:07 | |
dmsimard | I understand that job failures/timeouts are not necessarily a result of bad behavior but instead legitimate job/patch issues but it might be worthwhile to see if there's a difference as we start to tweak some knobs here and there | 20:08 |
corvus | dmsimard: nope. right now you could grep logs. if you want to add stats counters to executors that'd be ok | 20:08 |
Shrews | i know ze05 seems to be timing out jobs like crazy. waiting for it to shutdown now and it's taking forever b/c the jobs seem to go to timeout | 20:08 |
dmsimard | corvus: okay, I'll see if I can figure out a patch. | 20:08 |
dmsimard | Shrews: I'll take a look | 20:09 |
dmsimard | Shrews: absolutely nothing going on in logs, huh | 20:09 |
dmsimard | last log entry ~4 minutes ago | 20:09 |
dmsimard | that's unusual | 20:10 |
Shrews | dmsimard: that's because it's shutting down i think. not accepting new jobs | 20:10 |
dmsimard | Shrews: yeah but even then the ongoing jobs ought to be printing things | 20:10 |
Shrews | pabelanger: fungi: number of ansible processes on ze05 into single digits now. i'm hoping that means it's almost done | 20:12 |
Shrews | it was in the 70's when i started watching it | 20:12 |
dmsimard | yeah it seems like it's wrapping up the remaining jobs | 20:12 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Rework log streaming to use python logging https://review.openstack.org/541434 | 20:13 |
*** ldnunes has quit IRC | 20:16 | |
*** r-daneel_ has joined #openstack-infra | 20:16 | |
*** wolverineav has quit IRC | 20:17 | |
*** r-daneel has quit IRC | 20:17 | |
*** r-daneel_ is now known as r-daneel | 20:17 | |
*** slaweq has quit IRC | 20:18 | |
*** wolverineav has joined #openstack-infra | 20:18 | |
*** slaweq has joined #openstack-infra | 20:19 | |
*** onovy has joined #openstack-infra | 20:20 | |
pabelanger | Shrews: ack | 20:22 |
*** Goneri has joined #openstack-infra | 20:24 | |
*** e0ne has joined #openstack-infra | 20:24 | |
pabelanger | clarkb: spacex stream live now! | 20:24 |
*** onovy has quit IRC | 20:25 | |
smcginnis | For those interested: http://www.spacex.com/webcast | 20:25 |
dmsimard | T-20 | 20:25 |
* fungi crosses fingers that the up goer five turns elon's scrap car into an asteroid | 20:26 | |
clarkb | ya watching | 20:26 |
clarkb | I'm sad that heavy gave up on asparagus | 20:26 |
*** florianf has quit IRC | 20:28 | |
*** florianf has joined #openstack-infra | 20:28 | |
*** florianf has quit IRC | 20:28 | |
*** jamesdenton has joined #openstack-infra | 20:29 | |
*** peterlisak has joined #openstack-infra | 20:32 | |
*** onovy has joined #openstack-infra | 20:33 | |
*** olaph has joined #openstack-infra | 20:38 | |
*** olaph1 has quit IRC | 20:40 | |
*** kjackal has quit IRC | 20:43 | |
*** dsariel has joined #openstack-infra | 20:44 | |
fungi | whoosh | 20:45 |
mordred | clarkb: where do they launch these from? | 20:45 |
fungi | this one's going up from kennedy | 20:46 |
clarkb | mordred: same pad saturn 5 used | 20:46 |
clarkb | but they are going to mars not the moon | 20:47 |
fungi | and there go the boosters | 20:47 |
openstackgerrit | Merged openstack-infra/nodepool master: Fix for age calculation on unused nodes https://review.openstack.org/541281 | 20:48 |
fungi | i still don't have the audio going, but... this is a promisingly long time with no explosion | 20:48 |
smcginnis | Lots of cheering. | 20:48 |
smcginnis | And David Bowie. | 20:49 |
fungi | appropriate | 20:49 |
smcginnis | Haha, that was pretty cool. | 20:50 |
*** hongbin has joined #openstack-infra | 20:50 | |
*** gfidente|afk has quit IRC | 20:51 | |
*** olaph1 has joined #openstack-infra | 20:52 | |
smcginnis | Dang that was cool. | 20:53 |
fungi | wow! | 20:53 |
Shrews | that's science well done | 20:53 |
clarkb | its like synchronized swimming | 20:53 |
clarkb | but with fire and rockets and space | 20:53 |
fungi | looks like they didn't topple this time either? | 20:54 |
smcginnis | clarkb: Glad I'm not the only one that thought of that. ;) | 20:54 |
*** olaph has quit IRC | 20:54 | |
clarkb | fungi: ya boosters appeared to land safely | 20:54 |
fungi | that was just the first two, right? third is on its way down later? | 20:54 |
smcginnis | Yep | 20:55 |
corvus | yeah, it's either landed on the 'of course i still love you' drone ship at sea, or ... not. | 20:55 |
fungi | heh | 20:55 |
fungi | okay, the "don't panic" on the dash got me | 20:57 |
smcginnis | I loved that detail. :) | 20:57 |
pabelanger | okay, I was jumping up and down in living room with excitement | 20:57 |
pabelanger | way awesome | 20:57 |
mordred | amazing | 20:58 |
*** e0ne has quit IRC | 21:01 | |
*** camunoz has quit IRC | 21:02 | |
Shrews | pabelanger: i have to afk for a bit. can you watch ze05 and restart when it's done? it looks really close... maybe 1 or 2 jobs | 21:03 |
fungi | pabelanger: Shrews: looks like we're still waiting on ze05 to get started and then ze10 to get stopped/started before i stop ze09? | 21:03 |
Shrews | fungi: oh, i didn't know there was a ze10 | 21:03 |
fungi | i've confirmed the rest have a /var/run/zuul/executor.pid from today | 21:03 |
pabelanger | Shrews: sure | 21:03 |
Shrews | pabelanger: thx | 21:03 |
pabelanger | yah, I haven't done ze10 yet. I can do that now | 21:04 |
fungi | Shrews: yeah, we have 10 total | 21:04 |
pabelanger | okay, ze10 stopping | 21:05 |
*** tiswanso has quit IRC | 21:06 | |
*** tiswanso has joined #openstack-infra | 21:07 | |
pabelanger | ze10 started | 21:08 |
AJaeger | config-core, please consider to review later some grafana updates: Remove tripleo remains from grafana https://review.openstack.org/541165 - and cleanup of unused nodepool providers https://review.openstack.org/541421 | 21:13 |
fungi | pabelanger: awesome, so we're just waiting on 05 now? | 21:13 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Rework log streaming to use python logging https://review.openstack.org/541434 | 21:13 |
mnaser | can we approve things now? :> | 21:16 |
AJaeger | mnaser: yes, should be fine AFAIU | 21:17 |
*** hemna_ has quit IRC | 21:17 | |
pabelanger | fungi: yah, think we are down to last job running | 21:17 |
*** e0ne has joined #openstack-infra | 21:17 | |
AJaeger | fungi, pabelanger http://grafana.openstack.org/dashboard/db/zuul-status shows a constant 19 "starting builds" on ze05 | 21:17 |
pabelanger | only see 1 ansible-playbook process | 21:17 |
AJaeger | that looks strange... | 21:17 |
pabelanger | AJaeger: expected, statsd has stopped running I believe | 21:18 |
clarkb | for 541421 we'll have to manually delete the dashboards once we stop having them in the config | 21:18 |
AJaeger | clarkb, http://paste.openstack.org/show/663889/ contains list of obsolete grafana boards | 21:18 |
clarkb | AJaeger: the first list can already be deleted? if so cool I can help get those all cleared out | 21:19 |
AJaeger | clarkb: yes, the first ones can be deleted already. | 21:19 |
AJaeger | dmsimard: ^ | 21:19 |
AJaeger | clarkb: just the last two not. and if you delete too many, I hope puppet creates them again ;) | 21:20 |
*** tpsilva has quit IRC | 21:20 | |
* AJaeger waves good night | 21:21 | |
clarkb | AJaeger: it should! | 21:22 |
dmsimard | (catching up on spacex).. the two side boosters landing simultaneously was awesome. | 21:25 |
fungi | that actually _is_ rocket science | 21:26 |
fungi | the rocket surgery comes later, when they have to put it all back together | 21:26 |
pabelanger | okay, we have no more ansible processes on ze05, but zuul-executor still running | 21:26 |
dmsimard | It must be so hard. I mean my day to day job is kinda hard.. but that is some crazy level of hard :) | 21:26 |
*** jamesmca_ has quit IRC | 21:26 | |
pabelanger | do we want to debug or just kill off? | 21:26 |
fungi | pabelanger: nothing else getting appended to the debug log? | 21:27 |
*** jamesmca_ has joined #openstack-infra | 21:27 | |
fungi | maybe attempt to trigger a thread dump (though i expect it may get ignored because it's already in the sigterm handler) | 21:27 |
pabelanger | fungi: no, I think using service zuul-executor restart was the reason we got into this state, which I've seen ood things happen, if you start another executor while one is already running | 21:28 |
pabelanger | fungi: k | 21:28 |
fungi | sounds possible | 21:28 |
*** salv-orlando has joined #openstack-infra | 21:30 | |
*** felipemonteiro has joined #openstack-infra | 21:31 | |
openstackgerrit | Merged openstack-infra/project-config master: Remove TripleO pipelines from grafana https://review.openstack.org/541165 | 21:32 |
pabelanger | fungi: okay, I dumped threads, they are in debug logs now | 21:32 |
*** threestrands has joined #openstack-infra | 21:32 | |
*** threestrands has quit IRC | 21:32 | |
*** threestrands has joined #openstack-infra | 21:32 | |
pabelanger | fungi: I'll kill off processes now and we can inspect logs | 21:33 |
fungi | after you kill the processes, may also be good to save a trimmed copy of just the thread dump so we don't have to dig them out of the log later, and then start everything back up again | 21:33 |
*** salv-orlando has quit IRC | 21:33 | |
pabelanger | sure | 21:34 |
dmsimard | Still no news on the center core, hope it's just a video feed issue :/ | 21:34 |
fungi | 2 of 3 recovered is still a pretty good result at this point | 21:34 |
pabelanger | okay, ze05 started again | 21:34 |
fungi | given the team was only about 50% confident the payload would even make it to orbit | 21:35 |
pabelanger | cleaning up copied logs | 21:35 |
dmsimard | Not arguing otherwise, they often get video feed issues when landing on barges | 21:35 |
fungi | and considering the number of recovery failures they've had up to now | 21:35 |
*** jbadiapa has joined #openstack-infra | 21:36 | |
dmsimard | I bet they're pulling a bunch of telemetry off the car too, it's not just a dumb car :D | 21:36 |
openstackgerrit | Merged openstack-infra/project-config master: gerritbot: Add queens and rocky to ironic IRC notifications https://review.openstack.org/540986 | 21:37 |
*** jbadiapa has quit IRC | 21:40 | |
*** dave-mcc_ has quit IRC | 21:41 | |
*** ijw has joined #openstack-infra | 21:42 | |
*** hemna_ has joined #openstack-infra | 21:42 | |
fungi | stopping the executor on ze09 now in preparation for using it as a demo/canary for the v6 routing issues we've seen | 21:44 |
*** jamesmca_ has quit IRC | 21:44 | |
*** peterlisak has quit IRC | 21:44 | |
*** tiswanso has quit IRC | 21:45 | |
*** onovy has quit IRC | 21:45 | |
openstackgerrit | David Moreau Simard proposed openstack-infra/zuul master: Add Executor Merger and Ansible execution statsd counters https://review.openstack.org/541452 | 21:46 |
jhesketh | Morning | 21:46 |
*** agopi__ has joined #openstack-infra | 21:46 | |
ijw | Is there something funky with the check jobs? I've had one sat on the queue for 2 hours, which is highly unusual | 21:46 |
*** jamesmca_ has joined #openstack-infra | 21:47 | |
*** hemna_ has quit IRC | 21:48 | |
clarkb | ijw: we had networking problems in the cloud hosting the zuul control plane which led to a large backlog | 21:48 |
clarkb | ijw: we've worked around that but result is large backlog takes time to get through | 21:48 |
*** agopi_ has joined #openstack-infra | 21:49 | |
*** agopi_ is now known as agopi | 21:49 | |
*** agopi__ has quit IRC | 21:52 | |
*** tiswanso has joined #openstack-infra | 21:52 | |
*** tiswanso has quit IRC | 21:52 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Fix stuck node requests across ZK reconnection https://review.openstack.org/541454 | 21:52 |
*** eumel8 has quit IRC | 21:57 | |
fungi | hrm, still some 26 zuul processes on ze09 15 minutes after stopping | 21:57 |
fungi | most look like: | 21:57 |
fungi | ssh: /var/lib/zuul/builds/ed57db43d883405e9665eac1e1cba797/.ansible/cp/ba8cd30f56 [mux] | 21:58 |
fungi | are those likely to just hang around indefinitely, or should i wait for them to clear up? | 21:58 |
*** ijw has quit IRC | 21:58 | |
fungi | the count hasn't changed in the ~5 minutes i've been checking it | 21:58 |
pabelanger | there seems to be a single job for 526171,3 that has been queued for some time in integrated gate. openstack-tox-pep8, unsure why it hasn't got a node yet | 21:59 |
fungi | looks like the zuul-executor daemons themselves are no longer running, so i assume these are just leaked orphan processes | 21:59 |
dmsimard | Did we talk about openstack summit presentations at the infra meeting ? I wasn't paying a lot of attention. Deadline is thursday. | 22:00 |
fungi | oh, right these are the ssh persistent sockets which time out after an hour of inactivity or something | 22:00 |
fungi | dmsimard: clarkb mentioned the cfp submission deadline, yeah | 22:00 |
dmsimard | I'm writing a draft for a talk about ARA (and Zuul) and another which is about the "life" of a commit from git review to RDO stable release (and a magic black box where the productized version appears) | 22:01 |
dmsimard | happy to participate in anything you'd like to throw together, I'm a frequent speaker at local meetups :D | 22:01 |
*** shrircha has joined #openstack-infra | 22:02 | |
*** jamesmca_ has quit IRC | 22:02 | |
*** vhosakot has joined #openstack-infra | 22:03 | |
*** shrircha has quit IRC | 22:03 | |
*** jamesmca_ has joined #openstack-infra | 22:04 | |
pabelanger | having issues with paste.o.o | 22:04 |
pabelanger | however | 22:04 |
pabelanger | https://pastebin.com/7aB9uvn0 | 22:04 |
pabelanger | corvus: I am seeing that error currently in zuul debug.log | 22:04 |
*** salv-orlando has joined #openstack-infra | 22:05 | |
pabelanger | seems to be looping | 22:05 |
dmsimard | that's the same error as this morning | 22:05 |
fungi | `sudo systemctl disable zuul-executor.service` seems to have properly disabled zuul-executor on ze09, so going to reboot it now | 22:06 |
pabelanger | fungi: ack | 22:06 |
*** vhosakot_ has quit IRC | 22:07 | |
*** ijw has joined #openstack-infra | 22:07 | |
fungi | pabelanger: hrm, yeah, paste.o.o is taking its sweet time responding to my browser's requests as well | 22:07 |
pabelanger | dmsimard: oh, maybe this is fixed by 541454 then. Hopefully corvus will confirm | 22:08 |
ijw | clarkb: thanks. I'm in no rush, I just wanted to make sure it wasn't our fault | 22:08 |
pabelanger | but, looks like integrated change queue is wedged | 22:08 |
*** vhosakot has quit IRC | 22:09 | |
fungi | restarting openstack-paste on paste.o.o seems to have fixed whatever was hanging requests indefinitely | 22:09 |
pabelanger | fungi: great, thanks | 22:09 |
pabelanger | I need to step away and help prepare dinner | 22:10 |
corvus | that is not the error from this morning. the error from this morning was http://paste.openstack.org/raw/663579/ | 22:10 |
*** hemna_ has joined #openstack-infra | 22:12 | |
*** e0ne has quit IRC | 22:15 | |
*** openstackgerrit has quit IRC | 22:16 | |
*** salv-orlando has quit IRC | 22:17 | |
corvus | Shrews: you win a cookie -- the error that pabelanger just found is an unused node deletion between assignment and lock | 22:17 |
fungi | okay, zuul09 is back up after a reboot, has no executor service running, has a v6 global address configured again and is still unable to reach the review.openstack.org ipv6 address but can via ipv4 | 22:18 |
fungi | working on a ticket with rackspace now | 22:18 |
corvus | Shrews, pabelanger: node 0002403946 if you're curious | 22:18 |
corvus | we can probably just pop that change out of the queue, or ignore it (it's in check) | 22:19 |
*** jamesmca_ has quit IRC | 22:19 | |
ianw | ok ... i've installed the hwe 4.13 kernel on ze02, and built some custom 1.6.22.2 afs modules (packages in my homedir). any objections to stopping executor and rebooting? | 22:22 |
*** peterlisak has joined #openstack-infra | 22:22 | |
*** openstackgerrit has joined #openstack-infra | 22:22 | |
openstackgerrit | Matt Riedemann proposed openstack-infra/project-config master: Remove legacy-tempest-dsvm-neutron-nova-next-full usage https://review.openstack.org/541477 | 22:22 |
clarkb | ianw: now is probably as good a time as any | 22:23 |
openstackgerrit | Matt Riedemann proposed openstack-infra/openstack-zuul-jobs master: Remove legacy-tempest-dsvm-neutron-nova-next-full job https://review.openstack.org/541479 | 22:24 |
fungi | through the magic of tcpdump i've also narrowed the v6 traffic issue down to being unidirectional | 22:24 |
openstackgerrit | Merged openstack-infra/nodepool master: Do not delete unused but allocated nodes https://review.openstack.org/541375 | 22:24 |
fungi | v6 traffic can _reach_ ze09 from review, but v6 traffic cannot reach review from ze09 | 22:25 |
*** onovy has joined #openstack-infra | 22:25 | |
dmsimard | any oddities in ip6tables ? | 22:25 |
*** hashar has quit IRC | 22:26 | |
fungi | well, tcpdump should be listening to a bpf on the interface, in which case ip(6)tables doesn't matter | 22:26 |
fungi | do we have an approximate time for when this all started? | 22:27 |
*** Goneri has quit IRC | 22:27 | |
corvus | fungi: looks like 06:43 based on graphs | 22:28 |
ianw | what's the current thinking re timing of "zuul-executor graceful" | 22:28 |
fungi | thanks corvus | 22:28 |
corvus | ianw: doesn't work, just use 'zuul-executor stop' | 22:28 |
*** rcernin has joined #openstack-infra | 22:28 | |
corvus | (not implemented yet) | 22:29 |
ianw | ahhhhh, that would explain a lot | 22:29 |
corvus | (also, since retries are automatic, probably almost never worth using) | 22:29 |
ianw | right, was hoping it cleaned up the builds better or something | 22:30 |
corvus | stop should be a clean shutdown | 22:30 |
corvus | even if it worked, graceful would be the one to use only if you wanted to wait 4 hours for it to stop | 22:31 |
*** jamesmca_ has joined #openstack-infra | 22:33 | |
dmsimard | Seeing a sudden increase in ram utilization on zuul.o.o http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=392&rra_id=all | 22:36 |
dmsimard | Seems like we're headed for a repeat of this morning's spike | 22:36 |
*** jamesmca_ has quit IRC | 22:37 | |
*** edmondsw has quit IRC | 22:37 | |
*** edmondsw has joined #openstack-infra | 22:38 | |
dmsimard | corvus: do we need to land https://review.openstack.org/#/c/541454/ and restart the scheduler again ? /var/log/zuul/zuul.log is getting plenty of kazoo.exceptions.NoNodeError. | 22:39 |
*** edmondsw has quit IRC | 22:42 | |
corvus | dmsimard: see above, nonodeerror is not because of the bug in 541454 | 22:42 |
corvus | we need to restart all launchers with https://review.openstack.org/541375 to fix that error | 22:42 |
dmsimard | corvus: ok I'll add that info to the pad, thanks | 22:43 |
ianw | #status log ze02.o.o rebooted with xenial 4.13 hwe kernel ... will monitor performance | 22:43 |
openstackstatus | ianw: finished logging | 22:43 |
*** tbachman has joined #openstack-infra | 22:45 | |
* tbachman wonders if others have noticed zuul to be down again | 22:45 | |
fungi | #status log provider ticket 180206-iad-0005440 has been opened to track ipv6 connectivity issues between some hosts in dfw; ze09.openstack.org has its zuul-executor process disabled so it can serve as an example while they investigate | 22:45 |
openstackstatus | fungi: finished logging | 22:45 |
pabelanger | corvus: Shrews: that is good news | 22:46 |
fungi | tbachman: meaning the status page for it? | 22:46 |
tbachman | fungi: ack | 22:46 |
tbachman | oh | 22:46 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Rework log streaming to use python logging https://review.openstack.org/541434 | 22:46 |
tbachman | it’s up :) | 22:46 |
tbachman | sorry | 22:46 |
tbachman | looked down | 22:46 |
* tbachman goes and hides in shame | 22:46 | |
fungi | may take a minute to load because the current status.json payload is pretty huge | 22:46 |
cmurphy | it looked down for me too for a second | 22:47 |
dmsimard | Getting pulled to dinner, will be back later -- the trend for the zuul.o.o ram usage is not good but I was not able to get to the bottom of it. | 22:47 |
dmsimard | If it keeps up, we're going to be maxing ram before long | 22:47 |
*** agopi has quit IRC | 22:47 | |
*** salv-orlando has joined #openstack-infra | 22:47 | |
ianw | so ... would anyone say they know anything about graphite's logs? do we need the stuff in /var/log/graphite/carbon-cache-a ? | 22:47 |
corvus | ianw: i don't think we need any of that. pretty much ever. | 22:48 |
* jhesketh has skimmed most of the scrollback... are we still having any network issues? | 22:48 | |
jhesketh | (specifically v4) | 22:49 |
pabelanger | dmsimard: 20GB ram usage was the norm before this morning, leaving a 10GB buffer for zuul.o.o | 22:49 |
pabelanger | lets see what happens | 22:49 |
fungi | jhesketh: not sure whether the v4 connectivity issues we were seeing between executors and job nodes have persisted | 22:50 |
pabelanger | Shrews: corvus: I can start launcher restarts in 30mins or hold off until the morning, what ever is good for everybody | 22:50 |
auristor | ianw, fungi, clarkb, corvus: the rx security class supported by kafs and openafs only supports fcrypt (a weaker than 56-bit DES) wire integrity protection and encryption. In order to produce rxkad_k5+kdf tokens to support Kerberos v5 AES256-CTS-HMAC-SHA-1 enctypes for authentication you need a proper version of aklog and supporting libraries. | 22:52 |
fungi | auristor: thanks for the update. that's definitely useful info | 22:53 |
* jhesketh nods | 22:53 | |
*** myoung is now known as myoung|off | 22:53 | |
auristor | ianw, fungi, clarkb, corvus: AuriStorFS supports the yfs-rxgk security class which uses GSS-API (Krb5 only at the moment) for auth and AES256-CTS-HMAC-SHA1-96. The support for AES256-CTS-HMAC-SHA256-384 was added to Heimdal Kerberos and once it is in MIT Kerberos we will enable it in yfs-rxgk. | 22:54 |
auristor | yfs-rxgk will be added to kAFS this year | 22:55 |
fungi | that's awesome news | 22:55 |
pabelanger | corvus: I'm showing 526171 in gate (and blocking jobs), I'm not sure we can ignore it. Mind confirming, and I'll abandon / restore. | 22:56 |
auristor | AuriStorFS servers, clients and all admin tooling support IPv6. clients, fileservers and admin tools can be IPv6 only. ubik servers have to be assigned unique IPv4 addresses but they don't have to be reachable via IPv4 | 22:56 |
corvus | pabelanger: yep, go ahead and pop it | 22:57 |
*** ijw has quit IRC | 22:57 | |
corvus | is someone restarting the launchers, or should i do that now? | 22:58 |
pabelanger | corvus: go for it | 22:59 |
corvus | auristor: rxgk hasn't made it into openafs yet, right? | 22:59 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul master: Make timeout value apply to entire job https://review.openstack.org/541485 | 23:02 |
corvus | #status log all nodepool launchers restarted to pick up https://review.openstack.org/541375 | 23:02 |
openstackstatus | corvus: finished logging | 23:02 |
auristor | corvus: yfs-rxgk != rxgk. and no rxgk is not in OpenAFS. There is a long history and I would be happy to share if you are interested. But after Feb 2012 the openafs leadership concluded that we couldn't accomplish our goals for the technology within openafs and created a new suite of rx services that mirror the AFS3 architecture that we could rapidly extend and innovate upon. | 23:02 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul master: Sync when doing disk accountant testing https://review.openstack.org/541486 | 23:03 |
*** slaweq has quit IRC | 23:04 | |
corvus | auristor: ah i did't pick up that yfs-rxgk!=rxgk, thanks | 23:04 |
auristor | corvus: think of AuriStorFS and now kafs as supporting two different file system protocols, afs3 and auristorfs. the protocol that is selected depends on the behavior of the peer at the rx layer. | 23:06 |
Shrews | corvus: pabelanger: yay! i iz teh winner! | 23:06 |
*** felipemonteiro has quit IRC | 23:06 | |
Shrews | corvus: pabelanger: did the fix to nodepool merge? | 23:07 |
*** hemna_ has quit IRC | 23:07 | |
corvus | Shrews: yep, should be in prod now | 23:07 |
*** felipemonteiro has joined #openstack-infra | 23:07 | |
Shrews | corvus: awesome. sorry i had to afk for a bit | 23:07 |
auristor | corvus: being dual-headed permits maximum compatibility. it was a major goal of AuriStorFS to provide a zero flag day upgrade and zero data loss | 23:07 |
pabelanger | okay, 526171 finally popped from gate pipeline | 23:08 |
*** wolverineav has quit IRC | 23:12 | |
*** r-daneel has quit IRC | 23:12 | |
*** r-daneel has joined #openstack-infra | 23:13 | |
*** wolverineav has joined #openstack-infra | 23:13 | |
*** aeng has joined #openstack-infra | 23:16 | |
*** wolverineav has quit IRC | 23:17 | |
*** felipemonteiro has quit IRC | 23:18 | |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul master: Use nested tempfile fixture for cleanups https://review.openstack.org/541487 | 23:20 |
*** sshnaidm|bbl has quit IRC | 23:21 | |
*** M4g1c5t0rM has joined #openstack-infra | 23:26 | |
*** armaan has quit IRC | 23:27 | |
*** rossella_s has quit IRC | 23:28 | |
*** rossella_s has joined #openstack-infra | 23:29 | |
*** slaweq has joined #openstack-infra | 23:31 | |
*** M4g1c5t0rM has quit IRC | 23:31 | |
openstackgerrit | Ian Wienand proposed openstack-infra/puppet-graphite master: Fix up log rotation https://review.openstack.org/541488 | 23:34 |
*** slaweq has quit IRC | 23:37 | |
openstackgerrit | Merged openstack-infra/project-config master: Add ansible-role-k8s-cinder to zuul.d https://review.openstack.org/534608 | 23:38 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Fix stuck node requests across ZK reconnection https://review.openstack.org/541454 | 23:40 |
*** jcoufal has quit IRC | 23:42 | |
ianw | http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=64155&rra_id=0&view_type=tree&graph_start=1517874206&graph_end=1517960606 <-- is this so choppy because of network, or something else? | 23:47 |
clarkb | ianw: ya I think cacti is using the AAAA records to get snmp and its not reliable on some of our executors | 23:48 |
*** edgewing_ has joined #openstack-infra | 23:48 | |
clarkb | ianw: also we've just turned it off at this point by removing the ip address config for ipv6 on the executors | 23:49 |
*** caphrim007_ has joined #openstack-infra | 23:49 | |
corvus | that's been choppy for a while though. i think there's something unique about that host. i don't know what. | 23:50 |
*** dingyichen has joined #openstack-infra | 23:50 | |
clarkb | I'm going to look at cleaning up grafana now | 23:50 |
* clarkb reads grafana api docs | 23:51 | |
*** caphrim007 has quit IRC | 23:52 | |
*** caphrim007_ has quit IRC | 23:53 | |
*** stakeda has joined #openstack-infra | 23:54 | |
clarkb | ok the first portion of AJaeger's list is cleaned up in grafana looks like we are still waiting for second portion to have change merged? | 23:58 |
pabelanger | mgagne: it looks like we might have a quota mismatch again in inap. Do you mind confirming when you have a moment | 23:59 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!