Tuesday, 2018-02-06

*** lathiat has joined #openstack-infra00:03
*** mtreinish has quit IRC00:12
*** jamesmcarthur has quit IRC00:14
*** ijw has quit IRC00:14
clarkbcorvus: we can try the hwe kernel (it is 4.13 on xenial) I run it on my local nas box because cpu is too new for old kernel00:21
*** edmondsw has joined #openstack-infra00:21
clarkbcorvus: I haven't had any issues with it other than waiting a little longer on the meltdown kpti patching which was annoying but at that point already a week behind00:21
*** gongysh has joined #openstack-infra00:21
clarkbconsidering these servers are largely disposable and rebuildable I would be on board with trying that00:21
*** bobh has quit IRC00:22
*** Goneri has quit IRC00:23
*** iyamahat has joined #openstack-infra00:24
*** aeng has joined #openstack-infra00:25
clarkbthat may also simplify bwrap for us and not require the setuid?00:25
tristanCdmsimard: there was swift authentication settings in zuul.conf, then zuul would generate a temp url key per job00:25
*** edmondsw has quit IRC00:25
clarkbtristanC: dmsimard ya that part mostly worked (the only real issue we had there was the tempurl key was generated on job scheduling and could expire by the time a job ran iirc) the bigger issue was some dev going I want to say last weeks periodic job logs00:26
tristanCclarkb: which is now possible the the zuul-web builds.json controller00:27
clarkbya that was mordreds point00:27
tristanCwith the*00:27
*** dingyichen has joined #openstack-infra00:27
*** xarses_ has quit IRC00:28
clarkbthere are still some downsides to that approach but none would be a regression when compared against the current system00:28
*** r-daneel has quit IRC00:29
corvusclarkb: yeah, i'll put executor oom on tomorrow's meeting agenda00:30
*** mtreinish has joined #openstack-infra00:37
*** ijw has joined #openstack-infra00:38
clarkbin other interesting kernel news apparently 4.15 kernel performance with kpti is only a percent or two slower than 4.11 without kpti based on some benchmarks00:40
clarkbit is unfortunate that hard work on performance improvements just got negated by kpti though00:40
clarkbbut at least we aren't taking a long term massive regression00:40
*** tosky has quit IRC00:41
*** dave-mccowan has joined #openstack-infra00:42
openstackgerritMerged openstack-infra/zuul master: Allow a few more starting builds  https://review.openstack.org/54096500:43
*** rloo has left #openstack-infra00:45
*** slaweq has joined #openstack-infra00:46
*** caphrim007_ has joined #openstack-infra00:48
*** caphrim00_ has joined #openstack-infra00:49
*** caphrim007_ has quit IRC00:49
*** slaweq has quit IRC00:51
*** caphrim007 has quit IRC00:51
*** Swami has quit IRC00:53
*** caphrim00_ has quit IRC00:54
*** claudiub has quit IRC00:58
*** camunoz has quit IRC01:00
openstackgerritIan Wienand proposed openstack/diskimage-builder master: [WIP] Add block-device defaults  https://review.openstack.org/53937501:02
openstackgerritIan Wienand proposed openstack/diskimage-builder master: [WIP] Choose appropriate bootloader for block-device  https://review.openstack.org/53973101:02
*** aeng has quit IRC01:02
*** gongysh has quit IRC01:09
*** aeng has joined #openstack-infra01:19
*** jamesmcarthur has joined #openstack-infra01:20
*** stakeda has joined #openstack-infra01:22
*** jamesmcarthur has quit IRC01:25
*** ihrachys has quit IRC01:26
*** ihrachys_ has joined #openstack-infra01:26
clarkbthe falcon heavy is scheduled to launch 30 minutes before the infra meeting tomorrow01:26
*** jamesmcarthur has joined #openstack-infra01:28
*** abelur_ has joined #openstack-infra01:34
*** liujiong has joined #openstack-infra01:34
*** dougwig has quit IRC01:41
*** ijw has quit IRC01:42
ianwi noticed that, might have to get up extra early :)01:42
*** jamesmcarthur has quit IRC01:43
ianwmy kids are pretty bored with them now ... not another rocket launch ... crazy world they live in :)01:43
clarkbbut this one is going ti mars01:44
*** salv-orlando has joined #openstack-infra01:45
*** hongbin has joined #openstack-infra01:49
*** slaweq has joined #openstack-infra01:50
*** jamesmcarthur has joined #openstack-infra01:54
*** slaweq has quit IRC01:55
*** aeng has quit IRC01:57
*** aeng has joined #openstack-infra01:57
*** larainema has joined #openstack-infra02:01
*** caphrim007 has joined #openstack-infra02:03
openstackgerritMerged openstack-infra/gerritbot master: Add unit test framework and one unit test  https://review.openstack.org/49937702:03
*** caphrim007 has quit IRC02:07
corvusit's either sending elon musk's tesla roadster to mars, or it's blowing up.  apparently musk gives it even odds.02:08
*** edmondsw has joined #openstack-infra02:09
*** liujiong has quit IRC02:10
*** liujiong has joined #openstack-infra02:11
*** bobh has joined #openstack-infra02:13
*** edmondsw has quit IRC02:14
*** esberglu has quit IRC02:14
*** bobh has quit IRC02:15
prometheanfirewhatever happens it'll be glorious02:16
*** gongysh has joined #openstack-infra02:18
*** harlowja has quit IRC02:18
*** shu-mutou-AWAY is now known as shu-mutou02:19
*** askb has quit IRC02:21
*** markvoelker has joined #openstack-infra02:22
*** slaweq has joined #openstack-infra02:24
*** askb has joined #openstack-infra02:24
*** markvoelker has quit IRC02:24
*** mriedem has quit IRC02:25
*** mriedem has joined #openstack-infra02:28
*** slaweq has quit IRC02:29
*** askb has quit IRC02:29
*** askb has joined #openstack-infra02:29
*** markvoelker has joined #openstack-infra02:31
*** askb has quit IRC02:31
*** askb has joined #openstack-infra02:31
*** askb has quit IRC02:31
*** askb has joined #openstack-infra02:32
*** salv-orlando has quit IRC02:33
*** salv-orlando has joined #openstack-infra02:33
*** askb has quit IRC02:36
*** salv-orlando has quit IRC02:38
*** greghaynes has quit IRC02:40
*** greghaynes has joined #openstack-infra02:41
*** greghaynes has quit IRC02:44
*** greghaynes has joined #openstack-infra02:44
*** askb has joined #openstack-infra02:45
*** olaph1 has joined #openstack-infra02:46
*** olaph has quit IRC02:47
*** dave-mccowan has quit IRC02:48
*** abelur_ has quit IRC02:49
*** askb has quit IRC02:49
*** abelur__ has joined #openstack-infra02:49
*** caphrim007 has joined #openstack-infra02:50
openstackgerritIan Wienand proposed openstack/diskimage-builder master: [WIP] Choose appropriate bootloader for block-device  https://review.openstack.org/53973102:50
*** bobh has joined #openstack-infra02:51
*** abelur__ has quit IRC02:51
*** askb has joined #openstack-infra02:51
*** askb_ has joined #openstack-infra02:52
*** jamesmcarthur has quit IRC02:52
*** askb has quit IRC02:52
*** askb has joined #openstack-infra02:53
*** bobh has quit IRC02:53
smcginnisAnother release job timeout on a git fetch.02:54
smcginnisJust dropping a note for now. Will follow up tomorrow.02:54
*** markvoelker has quit IRC02:56
*** markvoelker has joined #openstack-infra02:59
*** slaweq has joined #openstack-infra03:00
*** askb_ has quit IRC03:01
*** slaweq has quit IRC03:04
*** caphrim007 has quit IRC03:08
prometheanfirepip mirrors busted?03:09
*** dave-mccowan has joined #openstack-infra03:09
prometheanfiresmcginnis: related I imagine?03:09
clarkbprometheanfire: http://mirror.dfw.rax.openstack.org/pypi/last-modified its there and was last modified just under 3 hours ago03:10
clarkbcan you be more specific on how pip mirrors are busted?03:10
*** markvoelker has quit IRC03:11
prometheanfireclarkb: I was getting no distrobution found, I'll retry soon03:12
*** mriedem has quit IRC03:14
*** markvoelker has joined #openstack-infra03:14
clarkbhttp://mirror.dfw.rax.openstack.org/pypi/simple/sushy/ ya looks like they are missing03:15
clarkbbandersnatch behind or having problems again is my best guess03:15
*** d0ugal_ has joined #openstack-infra03:16
clarkbnothing about sushy in the bandersnatch logs03:18
clarkbfor the 4th 5th and 6th03:18
*** d0ugal has quit IRC03:19
clarkbsomething wrong with upstream pypi's mirroring stuff?03:19
prometheanfirenot sure03:22
clarkblooks like bandersnatch did error due to failed package retrieval early today PST03:22
clarkbI wonder if that has it confused as to what serial it is on maybe it thinks it is done03:23
clarkbthis may require another full resync :/03:23
clarkbI'm not going to be able to watch that tonight but can pick up in the morning if necesary03:23
*** markvoelker has quit IRC03:25
*** gcb has joined #openstack-infra03:26
*** gyee has quit IRC03:29
*** markvoelker has joined #openstack-infra03:30
tonybjhesketh: random question ... what do OpenSuse and SLES use for init? systemd?03:31
clarkbtonyb: tumbleweed is systemd at least03:32
*** dave-mccowan has quit IRC03:33
*** gongysh has quit IRC03:33
clarkblooks like SLES 12 is systemd03:33
openstackgerritIan Wienand proposed openstack/diskimage-builder master: [WIP] Choose appropriate bootloader for block-device  https://review.openstack.org/53973103:33
tonybclarkb: ah cool03:37
prometheanfiretonyb: ya, they use systemd, for some reason I thought they went with openrc or eudev, bug guess not03:39
*** bobh has joined #openstack-infra03:39
*** slaweq has joined #openstack-infra03:40
tonybprometheanfire: Yeah.  Seems like gentoo and debian are the last distros in the camp of 'defult init' != 'systemd'03:41
*** markvoelker has quit IRC03:42
prometheanfiretonyb: debian uses systemd :P03:42
prometheanfireI actually don't have much of a problem with systemd, run it on my laptop03:42
tonybprometheanfire: by default?  Ithought it was sysv with an option for systemd?  I don't mind being wrong03:43
*** ramishra has joined #openstack-infra03:43
*** slaweq has quit IRC03:44
tonybprometheanfire: Yeah, for me it isn't anything other than "$change seems to introduce a regression on systems that don't use systemd"  Which systems are they and do we consider that a problem?03:44
tonybprometheanfire: by no means is it a value judgement on systemd03:45
*** coolsvap has joined #openstack-infra03:45
*** cshastri has joined #openstack-infra03:46
prometheanfirewe do have a sysv initflag, but that was just to map 'shutdown' to systemctl shutdown and the like03:47
prometheanfireit can understand sysv init scripts iirc, but not sure03:47
fungitonyb: debian stable release before last (jessie) switched to systemd by default (except on non-linux arches like kfreebsd and hurd) but debian still mostly works if you install sysvinit as your default03:47
prometheanfirefunny thing is, I was part of the group that forked udev03:48
fungithough the upgrade to jessie would keep your previous init default03:48
*** rossella_s has quit IRC03:48
ianwtonyb: presume relates to https://review.openstack.org/#/c/529976/2 ?03:48
tonybfungi: Ahh cool.  I'm clearly out of date.03:48
prometheanfireya, there was a bug hubub about it and some debian people forked into another non-systemd distro03:48
tonybianw: Yeah03:48
ianwwe've dropped centos6 era stuff ... that was also python2.6 which we don't code for03:48
ianwmy thinking was there isn't really a non-systemd case?03:49
*** askb_ has joined #openstack-infra03:49
*** bobh has quit IRC03:49
fungisystemd is still at least partly a no-go on !linux kernels because upstream has in the past been adamant they rely on linux-kernel-specific features03:49
*** sree has joined #openstack-infra03:50
prometheanfireit's also a nogo on non-glibc systemd03:50
prometheanfireit's also a nogo on non-glibc systems03:50
fungiprometheanfire: the non-systemd debian derivative you're thinking of is https://devuan.org/03:50
prometheanfireya, that's it03:51
prometheanfireI knew how to pernounce it but not spell it03:51
tonybianw: I'm happy to drop my objection in that case.  Was trusty systemd? or upstart?03:51
prometheanfireembedded tends to do a bit of non-glibc (uclibc, musl)03:51
prometheanfirewhich is why we forked udev into eudev03:51
tonybif trusty is systemd then my objection seems to largely theoretical ;p03:52
prometheanfiretrusty is not systemd, xenial is03:52
prometheanfireit might have some small systemd parts iirc, but mainly upstart03:52
fungiyeah, trusty was/is very much upstart03:52
tonybprometheanfire, fungi: thanks.03:54
tonybianw: so I guess that gives us a line in the sand.03:55
*** hongbin has quit IRC03:55
*** askb_ has quit IRC03:55
*** askb_ has joined #openstack-infra03:56
tonybianw: I guess that probably means it boils down to is trusty too old?03:56
prometheanfire18.04 comes out 'soon' too03:56
tonybianw: I don't really mind what the answer is :)03:56
ianwhmm, yeah trusty has been pretty unloved for some time03:56
tonyboooh /me notices etcd 3.2 (x86, arm and ppc64el) in bionic :)03:57
*** edmondsw has joined #openstack-infra03:58
fungialso openafs 1.8!03:58
tonybfungi: :)03:58
*** olaph1 has quit IRC03:59
*** olaph has joined #openstack-infra04:00
*** gongysh has joined #openstack-infra04:01
*** edmondsw has quit IRC04:02
*** s-shiono has joined #openstack-infra04:03
tonybDoes this http://logs.openstack.org/22/535622/10/check/puppet-openstack-unit-4.8-centos-7/84927af/job-output.txt.gz#_2018-02-06_02_58_30_732623 look like a real problem or a just run recheck problem?04:04
ianwtonyb: fatal: could not create leading directories of '/etc/puppet/modules/powerdns': Permission denied04:05
ianwi'd say that needs some help...04:05
tonybianw: I was afraid you'd say that ;P04:06
*** askb_ has quit IRC04:06
*** askb has quit IRC04:06
*** askb has joined #openstack-infra04:06
* tonyb does some research04:06
*** askb has quit IRC04:07
ianwmaybe it just needs a "become: " line for the task ...?04:07
*** askb has joined #openstack-infra04:07
*** askb has quit IRC04:09
*** askb has joined #openstack-infra04:09
*** abelur__ has joined #openstack-infra04:11
gongyshI want to set up a multinode ci job for my project tacker04:11
*** gongysh has quit IRC04:11
*** askb has quit IRC04:12
*** askb has joined #openstack-infra04:13
*** gongysh has joined #openstack-infra04:14
*** caphrim007 has joined #openstack-infra04:15
*** psachin has joined #openstack-infra04:26
*** dsariel has quit IRC04:28
*** gongysh has quit IRC04:30
*** yamamoto has joined #openstack-infra04:35
*** rossella_s has joined #openstack-infra04:38
*** iyamahat has quit IRC04:41
*** slaweq has joined #openstack-infra04:44
*** harlowja has joined #openstack-infra04:46
*** slaweq has quit IRC04:48
*** olaph1 has joined #openstack-infra04:49
*** olaph has quit IRC04:50
*** sree_ has joined #openstack-infra04:53
*** rosmaita has quit IRC04:53
*** sree_ is now known as Guest7719404:54
openstackgerritJohn L. Villalovos proposed openstack-infra/project-config master: zuul.d: gerritbot: Remove check and gate jobs as now in repo  https://review.openstack.org/54112504:54
openstackgerritJohn L. Villalovos proposed openstack-infra/project-config master: zuul.d: gerritbot: Remove check and gate jobs as now in repo  https://review.openstack.org/54112504:55
openstackgerritJohn L. Villalovos proposed openstack-infra/project-config master: zuul.d: gerritbot: Remove check and gate jobs as now in repo  https://review.openstack.org/54112504:56
*** sree has quit IRC04:57
*** zhurong has quit IRC05:07
*** dhajare has joined #openstack-infra05:10
*** harlowja has quit IRC05:11
*** links has joined #openstack-infra05:12
*** abelur__ has quit IRC05:12
openstackgerritIan Wienand proposed openstack/diskimage-builder master: Pass at cleaning up the bootloader  https://review.openstack.org/54112905:13
*** pgadiya has joined #openstack-infra05:14
*** HawK3r has joined #openstack-infra05:15
*** janki has joined #openstack-infra05:15
HawK3rI just installed https://www.rdoproject.org/install/packstack/ packstack05:17
HawK3rfist time messing with openstack so I thought this would be a good way t get started05:17
HawK3rI ahve a pretty beefy server I wa using as a ESX host now woudl be a OS Node 05:17
HawK3rone issue I am having is that I installed it with a dhcp given Ipaddress on the server. CENTOS 7 to be specific 05:17
HawK3rI need to change the IP address and I did it from the 15-horizon_vhost.conf file05:17
HawK3rI am able to load horizon on teh browser but everytime I try to login I get Unable to establish connection to keystone endpoint.05:17
HawK3rIf i go back to the original IP address everytign works fine05:17
HawK3rIm thinking there is another place where the ip needs to be changed. perhaps another config file im missing 05:17
HawK3rAlso, coming from vMware I am not getting how the management network and the whole network stack works on Open stack to be honest with you05:17
*** askb has quit IRC05:18
*** askb has joined #openstack-infra05:18
*** gongysh has joined #openstack-infra05:19
openstackgerritIan Wienand proposed openstack/diskimage-builder master: [WIP] Choose appropriate bootloader for block-device  https://review.openstack.org/53973105:20
AJaegerHawK3r: #openstack is the proper channel for such questions, this channel is for the infrastructure OpenStack uses to run CI etc.05:22
*** HawK3r has left #openstack-infra05:23
*** eumel8 has joined #openstack-infra05:25
*** armaan has quit IRC05:25
*** daidv has quit IRC05:26
AJaegerianw, frickler, could you put https://review.openstack.org/541009 https://review.openstack.org/540971 https://review.openstack.org/540971 and https://review.openstack.org/540595 on your review queue, please?05:26
*** armaan has joined #openstack-infra05:28
AJaegerianw, frickler, duplicate in there - I meant https://review.openstack.org/538344 as well, please05:28
*** olaph has joined #openstack-infra05:29
*** sdake_ is now known as sdake05:30
*** olaph1 has quit IRC05:30
*** sree has joined #openstack-infra05:31
*** Guest77194 has quit IRC05:34
openstackgerritMerged openstack-infra/project-config master: add git timeout setting for clone_repo.sh  https://review.openstack.org/54105005:34
openstackgerritMerged openstack-infra/project-config master: add a retry loop to clone_repo.sh  https://review.openstack.org/54105105:37
*** dhajare has quit IRC05:38
*** dhajare has joined #openstack-infra05:40
*** wolverineav has joined #openstack-infra05:46
*** edmondsw has joined #openstack-infra05:46
*** abelur__ has joined #openstack-infra05:50
*** edmondsw has quit IRC05:50
*** wolverineav has quit IRC05:50
*** abelur__ has quit IRC05:51
*** abelur__ has joined #openstack-infra05:51
*** sree_ has joined #openstack-infra05:54
*** sree_ is now known as Guest5425305:54
*** wolverineav has joined #openstack-infra05:57
*** sree has quit IRC05:58
*** wolverin_ has joined #openstack-infra06:00
*** gcb has quit IRC06:00
*** wolverineav has quit IRC06:02
*** slaweq has joined #openstack-infra06:02
*** wolverin_ has quit IRC06:03
*** wolverineav has joined #openstack-infra06:03
*** abelur__ has quit IRC06:03
*** abelur__ has joined #openstack-infra06:03
*** e0ne has joined #openstack-infra06:04
openstackgerritOpenStack Proposal Bot proposed openstack-infra/project-config master: Normalize projects.yaml  https://review.openstack.org/54113606:05
*** jchhatbar has joined #openstack-infra06:06
*** janki has quit IRC06:06
*** wolverineav has quit IRC06:07
*** slaweq has quit IRC06:07
*** aeng has quit IRC06:09
*** askb has quit IRC06:09
*** abelur__ has quit IRC06:09
*** askb has joined #openstack-infra06:10
*** ihrachys_ has quit IRC06:13
*** d0ugal_ has quit IRC06:14
*** wolverineav has joined #openstack-infra06:15
openstackgerritMerged openstack-infra/project-config master: Remove legacy-devstack-dsvm-py35-updown from devstack  https://review.openstack.org/54070706:18
*** wolverineav has quit IRC06:19
*** d0ugal_ has joined #openstack-infra06:23
openstackgerritMerged openstack-infra/project-config master: Normalize projects.yaml  https://review.openstack.org/54113606:30
*** dhill__ has joined #openstack-infra06:31
*** dhill_ has quit IRC06:33
*** claudiub has joined #openstack-infra06:37
*** threestrands has quit IRC06:40
*** shu-mutou has quit IRC06:43
*** yolanda has joined #openstack-infra06:43
*** ianychoi_ has quit IRC06:47
*** ianychoi_ has joined #openstack-infra06:48
*** snapiri has joined #openstack-infra06:49
*** rwsu has joined #openstack-infra07:01
*** liujiong has quit IRC07:03
*** liujiong has joined #openstack-infra07:03
*** sree has joined #openstack-infra07:04
*** HeOS has joined #openstack-infra07:07
*** Guest54253 has quit IRC07:07
*** zhurong has joined #openstack-infra07:09
AJaegerianw: thanks for reviewing - any reason not to +A the nodepool change https://review.openstack.org/#/c/540595/ ?07:09
openstackgerritMerged openstack-infra/project-config master: Remove windmill-buildimages  https://review.openstack.org/54100907:10
*** gcb has joined #openstack-infra07:13
openstackgerritMerged openstack-infra/openstack-zuul-jobs master: Remove legacy infra-ansible job  https://review.openstack.org/54097107:13
*** andreas_s has joined #openstack-infra07:14
*** khappone has quit IRC07:17
*** jchhatba_ has joined #openstack-infra07:20
*** e0ne has quit IRC07:22
*** jchhatbar has quit IRC07:23
openstackgerritMerged openstack-infra/openstack-zuul-jobs master: Zuul: Remove project name  https://review.openstack.org/54107807:25
*** rcernin has quit IRC07:37
*** khappone has joined #openstack-infra07:39
*** iyamahat has joined #openstack-infra07:41
ianwAJaeger: not really, just haven't been super active in nodepool changes lately07:47
jlvillalNot sure if just me but http://zuul.openstack.org/ isn't loading07:47
ianwjlvillal: wfm ...07:47
ianwtry a hard refresh ... there was something we did with redirects this morning07:47
jlvillalianw, Ah! I thought CTRL-F5 was hard-refresh but I guess not.07:48
jlvillalshift-click on reload icon did it in Firefox07:48
*** slaweq has joined #openstack-infra07:49
openstackgerritAndreas Jaeger proposed openstack-infra/project-config master: Remove TripleO pipelines from grafana  https://review.openstack.org/54116507:51
AJaegerianw: we have 0 executors right now ;(07:51
AJaegerianw: check http://grafana.openstack.org/dashboard/db/zuul-status07:51
AJaegerDid they all die?07:52
vivsoni_hi team, i am trying to create devstack newton07:52
AJaegerinfra-root ^07:52
AJaegervivsoni_: devstack is a QA project, best ask on #openstack-qa07:52
vivsoni_AJaeger: ok.. thanks07:52
*** jpena|off is now known as jpena07:53
jlvillalianw, Is http://zuul.openstack.org/ still working for you? Now it stopped for me :(07:55
jlvillalAJaeger, would your '0 executors' thing impact http://zuul.openstack.org/ ?07:56
AJaegerjlvillal: it's working for me - but something else is going on ;(07:56
jlvillalAJaeger, Good reason for me to go to sleep :) Almost midnight here.07:57
AJaegerjlvillal: it depends on what the problem is - and that needs an infra-root to investigate07:57
jlvillalAJaeger, Thanks07:57
AJaegerjlvillal: 9am here ;) Good night!07:57
*** sree has quit IRC07:59
*** slaweq has quit IRC08:02
*** alexchadin has joined #openstack-infra08:02
*** pcichy has joined #openstack-infra08:03
*** slaweq has joined #openstack-infra08:03
*** kjackal has quit IRC08:10
*** kjackal has joined #openstack-infra08:11
*** jchhatba_ has quit IRC08:12
*** pcichy has quit IRC08:12
*** jchhatba_ has joined #openstack-infra08:12
*** pcaruana has joined #openstack-infra08:14
*** s-shiono has quit IRC08:15
*** dhill_ has joined #openstack-infra08:17
*** dhill__ has quit IRC08:18
*** iyamahat has quit IRC08:18
*** florianf has joined #openstack-infra08:19
*** askb has quit IRC08:19
*** askb_ has joined #openstack-infra08:20
*** tesseract has joined #openstack-infra08:22
*** ralonsoh has joined #openstack-infra08:23
*** hashar has joined #openstack-infra08:24
*** ianychoi_ has quit IRC08:27
*** ianychoi_ has joined #openstack-infra08:28
*** askb_ has quit IRC08:31
*** abelur_ has joined #openstack-infra08:32
*** iyamahat has joined #openstack-infra08:32
*** kjackal has quit IRC08:34
AJaegerinfra-root, looking at grafana, half of our ze0's are using 90+ % of memory.08:35
*** kjackal has joined #openstack-infra08:36
*** armaan has quit IRC08:37
*** armaan has joined #openstack-infra08:37
*** iyamahat has quit IRC08:38
*** zhurong has quit IRC08:38
*** yamahata has quit IRC08:38
*** dingyichen has quit IRC08:39
jheskethAJaeger: taking a look08:40
AJaegerthanks, jhesketh. I'm wondering whether we have a network or connection problem08:41
*** gongysh has quit IRC08:53
AJaegerjhesketh, infra-root, did we loose inap? http://grafana.openstack.org/dashboard/db/nodepool-inap08:53
jheskethAJaeger: that does look possible08:54
*** abelur_ has quit IRC08:55
*** abelur_ has joined #openstack-infra08:55
*** amoralej|off is now known as amoralej08:56
*** priteau has joined #openstack-infra08:56
*** d0ugal_ has quit IRC08:57
*** alexchadin has quit IRC08:57
*** d0ugal has joined #openstack-infra08:57
*** d0ugal has quit IRC08:57
*** d0ugal has joined #openstack-infra08:57
openstackgerritAndreas Jaeger proposed openstack-infra/project-config master: Temporary disable inap  https://review.openstack.org/54118808:58
AJaegerjhesketh: ^08:58
*** alexchadin has joined #openstack-infra08:58
AJaegerjhesketh: want to force merge?08:58
AJaegerwaiting for nodes might take ages ;(08:58
*** jpich has joined #openstack-infra08:58
ianw2018-02-06 08:59:09,862 INFO nodepool.CleanupWorker: ZooKeeper suspended. Waiting08:59
ianw2018-02-06 08:59:16,062 INFO nodepool.DeletedNodeWorker: ZooKeeper suspended. Waiting08:59
ianwwhat's that mean?08:59
jheskethsorry, I'm still looking through logs, will check if inap is down then submit that... I'm not sure if we'll need to restart the executors stuck on those nodes08:59
*** rpittau has joined #openstack-infra09:01
ianwjhesketh / AJaeger : i've restarted nl03 ... i think that error is related to auth timing out maybe?09:02
*** zhurong has joined #openstack-infra09:03
ianwthere was definitely some sort of blip, but it is now seeming to sync up inap09:03
*** askb has joined #openstack-infra09:03
AJaegerianw: I have no idea what it could be ;(09:04
*** gongysh has joined #openstack-infra09:04
*** gfidente has joined #openstack-infra09:04
*** gfidente has joined #openstack-infra09:04
ianwif i had to guess, there was an inap blip, and it made nl03 unhappy and it lost it's connection to zk09:04
AJaegerianw: but now grafana shows different graphs for inap, so restarting seems to have helped09:04
* ianw handy-wavy ...09:05
*** askb has quit IRC09:05
*** e0ne has joined #openstack-infra09:05
*** askb_ has joined #openstack-infra09:05
* AJaeger hopes the executors recover...09:05
ianwthey all seem to be processing?09:06
*** threestrands has joined #openstack-infra09:06
*** jaosorior has quit IRC09:06
ianwbut i agree ... why are they not reporting09:06
* jhesketh is glad ianw is here :-)09:08
jheskethianw: some of the executors aren't picking up work so they might need restarting09:10
AJaegerReading #zuul: Shrews wanted to restart executors today to pick up new changes...09:12
*** sree has joined #openstack-infra09:12
* AJaeger needs to step out a bit09:13
ianwjhesketh: which ones?09:14
jhesketh1&3 at least, I haven't checked them all09:14
jheskethat a guess from grafana 6,7,8,909:14
*** dsariel has joined #openstack-infra09:15
jheskethianw: I don't have any evidence for why that might be an effect of inap though09:17
jheskethand therefore if it'll make a difference09:18
ianwi think the problem with the stats might be graphite ... seems the disk is full09:18
ianw /var/log/graphite/carbon-cache-a is full09:19
jheskethoh, good find09:19
*** threestrands has quit IRC09:21
*** edmondsw has joined #openstack-infra09:22
*** wxy has quit IRC09:22
*** rossella_s has quit IRC09:23
*** kashyap has joined #openstack-infra09:24
*** rossella_s has joined #openstack-infra09:24
*** edmondsw has quit IRC09:27
ianwi'm trying to copy it to the storage volume09:27
ianwalright, i'm going to quickly reboot it, because things running out of disk ... i'm not sure what state it is in09:28
*** jaosorior has joined #openstack-infra09:29
*** sshnaidm|bbl is now known as sshnaidm|rover09:30
openstackgerritMatthieu Huin proposed openstack-infra/zuul-jobs master: role: Inject public keys in case of failure  https://review.openstack.org/53580309:30
ianw#status log graphite.o.o disk full.  move /var/log/graphite/carbon-cache-a/*201[67]* to cinder-volume-based /var/lib/graphite/storage/carbon-cache-a.backup.2018-02-0609:31
ianw#status log graphite.o.o disk full.  move /var/log/graphite/carbon-cache-a/*201[67]* to cinder-volume-based /var/lib/graphite/storage/carbon-cache-a.backup.2018-02-06 and server rebooted09:32
openstackstatusianw: finished logging09:32
ianwclarkb: ^ are you the expert on graphite?  i feel like we need to do something there; there is still a 4gb console.log file in there that we need to manage i guess09:34
openstackgerritMatthieu Huin proposed openstack-infra/nodepool master: Clean held nodes automatically after configurable timeout  https://review.openstack.org/53629509:36
*** olaph1 has joined #openstack-infra09:36
*** kopecmartin has joined #openstack-infra09:36
*** olaph has quit IRC09:38
*** florianf has quit IRC09:43
jheskethianw: the queued jobs are still climbing, any objections to restarting zuul on 01 to see if it picks up new work?09:43
*** derekh has joined #openstack-infra09:45
*** pcichy has joined #openstack-infra09:46
openstackgerritRico Lin proposed openstack-infra/irc-meetings master: Move Magnum irc meeting to #openstack-containers  https://review.openstack.org/54121009:49
ianwjhesketh: not really, i'm looking at the logs and it seems to have taken a few jobs09:49
ianwahh, nahh it's not doing anything is it09:50
jheskethianw: ze01?09:50
jheskethianw: I'll do a graceful restart09:50
*** dizquierdo has joined #openstack-infra09:54
*** stakeda has quit IRC09:54
*** florianf has joined #openstack-infra09:55
jheskethoh, apparently graceful isn't implemented in v3?09:56
jheskethianw: sending a stop, and apparently it had to abort a bunch of jobs:09:57
jhesketh2018-02-06 09:56:49,228 DEBUG zuul.AnsibleJob: [build: a956d5c2e21741298eecbc23da2a3443] Abort: no process is running09:57
jheskethso it looks like ansible died somehow?09:57
*** liujiong has quit IRC09:59
tobiashwe maybe leak job workers under some circumstances09:59
tobiashleaked not yet started job workers would be counted towards starting builds and thus explain the current behavior I see in the stats09:59
jheskethyep, I'd agree with that10:01
ianwtobiash: oohhh, interesting.  so basically a big global slowdown?10:01
ianwthat's what was a bit weird, things seemed to be happening ... slowly10:01
tobiashianw: either that or we leak job workers in the executor: http://grafana.openstack.org/dashboard/db/zuul-status?panelId=25&fullscreen10:01
tobiashbecause what I expect is that starting jobs shouldn't increase monotonic10:02
jheskeththe stop command appears to be hanging... If I kill the process what state will those builds end up in with the scheduler?10:03
jheskethI suspect they'll be stuck and we might need to do a scheduler restart10:03
tobiashso either something is hanging the preparation (hanging git?) or we leaked job worker threads which are counted wrongly towards starting builds10:03
ianwjhesketh: it can take a while to shutdown cleanly10:03
tobiashjhesketh: a restart should at least get us going10:03
tobiashI'm not sure if rescheduling of failed jobs is already fixed10:04
jheskethianw: okay, I'll wait for the shutdown for a while10:04
*** amrith has left #openstack-infra10:04
tobiashbut at least the queued up jobs should be running fine10:04
jheskethfyi, ze01 is in stopped state10:05
ianwjhesketh: yeah, otherwise you can just kill it manually and restart, it *shouldn't* require scheduler restart10:05
tobiashscheduler shouldn't be need a restart10:05
jheskeththe scheduler will pick up that ze01 has lost builds if we kill it?10:05
tobiashjhesketh: I'm not sure about lost builds, it should reschedule them, if not that's a bug10:06
*** janki has joined #openstack-infra10:06
tobiashand I think there was such a bug, but I don't know if that has been fixed10:06
tobiashbut at least the 700 queued jobs should run fine then10:07
jheskethI'll let the executor try and shut down cleanly still for a bit10:07
jhesketh(it's still diong repo updates)10:07
*** jchhatba_ has quit IRC10:07
ianwjhesketh: sorry i gotta disappear; i'll leave it in your capable hands :)10:08
jheskethno worries (although not sure how capable!)10:08
jheskeththanks heaps for your help!10:08
*** hjensas has joined #openstack-infra10:12
*** sree has quit IRC10:13
*** sree has joined #openstack-infra10:14
*** sree has quit IRC10:18
*** nmathew has joined #openstack-infra10:19
*** sree has joined #openstack-infra10:19
*** alexchadin has quit IRC10:22
*** sree has quit IRC10:23
openstackgerritcaishan proposed openstack-infra/irc-meetings master: Move Barbican irc meeting to #openstack-barbican  https://review.openstack.org/54123010:24
*** dtantsur|afk is now known as dtantsur10:25
*** alexchadin has joined #openstack-infra10:25
AJaegerjhesketh: should we send #status notice Our Zuul infrastructure is currently experiencing some problems, we're investigating. Please do not approve or recheck changes for now. ?10:25
jheskethAJaeger: yep, perhaps a note that it'll be a little slow but should get through changes in time10:26
jheskethAJaeger: do you have status perms?10:26
AJaegeryes, I have permissions10:27
jheskethAJaeger: I'm happy for you to do it unless you prefer me to10:28
AJaeger#status notice Our Zuul infrastructure is currently experiencing some problems and processing jobs very slowly, we're investigating.  Please do not approve or recheck changes for now.10:28
openstackstatusAJaeger: sending notice10:28
AJaegerjhesketh: done myself - thanks10:28
-openstackstatus- NOTICE: Our Zuul infrastructure is currently experiencing some problems and processing jobs very slowly, we're investigating. Please do not approve or recheck changes for now.10:29
*** alexchadin has quit IRC10:30
*** alexchadin has joined #openstack-infra10:30
*** alexchadin has quit IRC10:30
jheskethI think it looks like ze01 is slowly shutting down, so I'm going to give it some time (but it is /very/ slow)10:31
*** alexchadin has joined #openstack-infra10:31
openstackstatusAJaeger: finished sending notice10:31
*** alexchadin has quit IRC10:31
jheskethafter that, if restarting fixes things I can do the other executors10:31
*** kashyap has left #openstack-infra10:32
*** alexchadin has joined #openstack-infra10:32
*** alexchadin has quit IRC10:32
*** alexchadin has joined #openstack-infra10:33
*** alexchadin has quit IRC10:33
*** dtruong has quit IRC10:35
*** dtruong has joined #openstack-infra10:35
AJaegerwow, that takes long ;(10:36
jheskethI can do the others in parallel if it is the cause10:38
*** kjackal has quit IRC10:40
*** ldnunes has joined #openstack-infra10:41
*** kjackal has joined #openstack-infra10:41
*** wolverineav has joined #openstack-infra10:42
*** andreas_s has quit IRC10:43
*** andreas_s has joined #openstack-infra10:47
jheskethso it's stuck processing the update_queue (for getting git changes etc) but it's going very slowly10:48
jheskethunless it's doing whole clones (which it shouldn't) I can't see why that might be the case10:48
jheskethtobiash: any thoughts why the update_queue might be going so slow? ^10:50
tobiashjhesketh: currently lunching10:51
*** fverboso has joined #openstack-infra10:51
jheskethno worries10:51
*** lucas-afk is now known as lucasagomes10:52
*** sambetts|afk is now known as sambetts10:54
*** andreas_s has quit IRC10:57
*** andreas_s has joined #openstack-infra10:58
*** yamamoto has quit IRC10:59
AJaegerjhesketh: I'm getting timeouts from jobs that finish ;(10:59
* AJaeger joins tobiash for virtual ;) lunch now11:00
jheskethAJaeger: oh, link please?11:01
jhesketh(when you return)11:01
*** gfidente has quit IRC11:02
*** yamamoto has joined #openstack-infra11:03
*** eyalb has joined #openstack-infra11:05
*** gfidente has joined #openstack-infra11:05
*** gfidente has joined #openstack-infra11:05
tobiashjhesketh: hrm, the update_queue does only resetting, fetching and cleaning repos11:06
tobiashis the connection to gerrit slow?11:07
*** andreas_s has quit IRC11:07
*** andreas_s has joined #openstack-infra11:08
*** tosky has joined #openstack-infra11:08
tobiashjhesketh: hasn't been some connection firewall limit merged recently for gerrit?11:08
tobiashmaybe that's a side effect11:09
jheskethconnection seems fine, I can clone a repo very fast on the executor so it's probably also not a firewall11:10
*** edmondsw has joined #openstack-infra11:10
* jhesketh will be back shortly11:10
tobiashthen I'm currently out of ideas11:11
*** alexchadin has joined #openstack-infra11:11
*** edmondsw has quit IRC11:15
*** rfolco|off is now known as rfolco|ruck11:16
*** andreas_s has quit IRC11:22
*** andreas_s has joined #openstack-infra11:23
*** nicolasbock has joined #openstack-infra11:26
*** andreas_s has quit IRC11:27
*** andreas_s has joined #openstack-infra11:27
* jhesketh returns11:31
jheskethso it's still processing the update_queue. I think I'm going to kill the process now11:38
*** nmathew has quit IRC11:38
*** larainema has quit IRC11:40
*** links has quit IRC11:41
*** pcichy has quit IRC11:48
jheskethit's gone back to having a very slow process_queue (although actually accepting builds afaict since I started it again)11:49
jheskethgit-upload-pack's are taking minutes11:49
*** sree has joined #openstack-infra11:51
*** sree_ has joined #openstack-infra11:51
*** sree_ is now known as Guest6131011:52
openstackgerritMerged openstack-infra/nodepool master: Convert nodepool-zuul-functional job  https://review.openstack.org/54059511:53
*** links has joined #openstack-infra11:54
*** tpsilva has joined #openstack-infra11:55
*** sree has quit IRC11:55
*** gongysh has quit IRC11:57
*** maciejjozefczyk has joined #openstack-infra11:59
tobiashjhesketh: git-upload-pack is server side, so it might be worth checking gerrit as well12:00
AJaegerjhesketh: https://review.openstack.org/538508 has the timeout12:01
d0ugalIs there any way to search logs for a particular CI job? I want to find when an exception first started12:05
jheskethtobiash: yep, a cursory glance doesn't show anything though and the other ze's appear to be working okay12:05
* jhesketh wonders if that's since changed12:05
jheskethAJaeger: thanks... looks like there might be some more network issues... maybe these are affecting process queue12:06
AJaegerjhesketh:  https://review.openstack.org/539854 has one timeout as well - and post failures12:06
AJaegerd0ugal: which job?12:06
jheskethd0ugal: http://logstash.openstack.org/ might help12:06
AJaegerhttp://zuul.openstack.org/jobs.html helps12:06
AJaegerand then click on Builds - and search for project12:07
d0ugalAJaeger: a tripleo one - tripleo-ci-centos-7-scenario003-multinode-oooq-container12:07
d0ugalcool, I shall try both of these12:07
d0ugalI don't want to know how much time I have wasted manually looking through logs... I should have asked before!12:07
AJaegerwithout the 8 at then end - http://zuul.openstack.org/builds.html?job_name=tripleo-ci-centos-7-scenario003-multinode-oooq-container12:09
*** electrical has quit IRC12:13
*** electrical has joined #openstack-infra12:13
*** Guest61310 has quit IRC12:14
d0ugalI guess I still need to manually look at the files :)12:15
*** sree has joined #openstack-infra12:15
AJaegeryou can run queries with logstash12:16
d0ugalI am trying to figure out how to do that :)12:18
openstackgerritMatthieu Huin proposed openstack-infra/nodepool master: Add /node-list to the webapp  https://review.openstack.org/53556212:19
openstackgerritMatthieu Huin proposed openstack-infra/nodepool master: Add /label-list to the webapp  https://review.openstack.org/53556312:19
openstackgerritMatthieu Huin proposed openstack-infra/nodepool master: Refactor status functions, add web endpoints, allow params  https://review.openstack.org/53630112:19
*** sree has quit IRC12:19
AJaegerd0ugal: can't help with that - you might want to ask qa team12:20
*** edwarnicke has quit IRC12:21
*** edwarnicke has joined #openstack-infra12:22
openstackgerritMatthieu Huin proposed openstack-infra/nodepool master: Add /node-list to the webapp  https://review.openstack.org/53556212:25
openstackgerritMatthieu Huin proposed openstack-infra/nodepool master: Add /label-list to the webapp  https://review.openstack.org/53556312:25
openstackgerritMatthieu Huin proposed openstack-infra/nodepool master: Refactor status functions, add web endpoints, allow params  https://review.openstack.org/53630112:25
openstackgerritMatthieu Huin proposed openstack-infra/nodepool master: Add a separate module for node management commands  https://review.openstack.org/53630312:25
*** fkautz has quit IRC12:25
openstackgerritMatthieu Huin proposed openstack-infra/nodepool master: webapp: add optional admin endpoint  https://review.openstack.org/53631912:26
*** HeOS has quit IRC12:26
*** sshnaidm|rover is now known as sshnaidm|afk12:27
*** hashar is now known as hasharAway12:28
*** rossella_s has quit IRC12:29
*** zoli is now known as zoli\12:30
*** zoli\ is now known as zoli12:30
*** fverboso has quit IRC12:31
*** rossella_s has joined #openstack-infra12:32
*** rossella_s has quit IRC12:38
*** rossella_s has joined #openstack-infra12:39
*** jpena is now known as jpena|lunch12:51
*** rossella_s has quit IRC12:52
*** rossella_s has joined #openstack-infra12:52
*** coolsvap has quit IRC12:54
*** gongysh has joined #openstack-infra12:54
*** links has quit IRC12:55
*** jlabarre has joined #openstack-infra12:55
*** panda|off is now known as panda12:55
*** eharney has joined #openstack-infra12:58
*** zhurong_ has joined #openstack-infra13:01
*** rossella_s has quit IRC13:02
*** larainema has joined #openstack-infra13:03
*** rossella_s has joined #openstack-infra13:05
*** zhurong has quit IRC13:08
*** links has joined #openstack-infra13:08
*** jbadiapa has joined #openstack-infra13:09
*** edmondsw has joined #openstack-infra13:13
*** jbadiapa has quit IRC13:14
*** olaph has joined #openstack-infra13:16
*** cshastri has quit IRC13:16
*** olaph1 has quit IRC13:17
*** amoralej is now known as amoralej|lunch13:19
*** alexchadin has quit IRC13:22
*** alexchadin has joined #openstack-infra13:23
*** jaosorior has quit IRC13:23
ShrewsAJaeger: I'm not going to restart the launchers now given the current situation. Don't want to add fuel to the fire.13:25
*** dayou has quit IRC13:26
*** dave-mccowan has joined #openstack-infra13:29
*** dhajare has quit IRC13:30
*** rossella_s has quit IRC13:30
AJaegerShrews: no idea what's going on ;(13:31
AJaegerjhesketh: are you still around? Anything to share here?13:31
AJaegerShrews: looks like debugging/investigation is needed to mvoe us forward - help welcome I guess13:32
jheskethAJaeger: still around but about to head off13:32
*** rossella_s has joined #openstack-infra13:32
jheskethShrews, infra-root: unfortunately I've been unable to make any more progress (see scrollback) and have to head off. As best as I can tell on some ze's they are taking a very long time to perform git operations13:34
jheskethI've restarted ze01 with no effect13:34
jheskethSome executors don't appear affected and resource usage on the common hosts (zuul, nodepool, gerrit etc) all look normal13:34
*** dave-mccowan has quit IRC13:35
*** dave-mcc_ has joined #openstack-infra13:36
*** hemna_ has joined #openstack-infra13:36
*** pgadiya has quit IRC13:40
*** rossella_s has quit IRC13:41
fungihttps://rackspace.service-now.com/system_status/ doesn't indicate any widespread issues they're aware of13:42
fungistill catching up (is there a summary?) but resource utilization on ze01 looks normal or even low13:43
*** rossella_s has joined #openstack-infra13:44
*** janki has quit IRC13:44
AJaegerfungi: see jhesketh'S last lines - and review http://grafana.openstack.org/dashboard/db/zuul-status . None of the ze's is accepting ;(13:45
jheskethActually I think they are. Just very very very slowly13:46
AJaegerso, slower than a snail ;(13:46
fungifirst guess is maybe this is the effect of the new throttle13:46
fungidoes the pace match the rate corvus described?13:47
jheskethfungi: git-upload-pack is taking minutes on a few executor hosts causing the jobs to take forever to prepare and never keeping up with the worj13:47
*** jpena|lunch is now known as jpena13:47
fungiis that a subprocess of ansible pushing prepared repositories to the job nodes? any particular repos?13:47
jheskethWell that's my analysis of it. It's very likely I'm wrong and chasing a red herring13:48
jheskethNo I think it's the executor mergers preparing the repos for config changes and cache etc13:48
funginot (yet) at a machine where i can pull up graphs easily. is there a list of which executors are exhibiting this behavior and which aren't?13:49
jheskethYou can see the process in ps. The time it is running for seems extreme but maybe it's correct13:49
fungiare any of the standalone mergers also having similar trouble?13:49
jheskethI've stepped away so I can't give you a list sorry. ze01 is affected and for comparison ze02 isn't13:50
jheskethFrom memory I think 3,5,6,7,8 are also affected13:50
jheskethfungi: great question, I didn't think to check sorry13:50
fungino problem. that helps, thanks!13:51
*** dbecker has joined #openstack-infra13:51
jheskethfwiw git checkout on a suffering host (from gerrit) worked just fine13:51
*** rossella_s has quit IRC13:52
*** zhurong_ has quit IRC13:52
jheskethNo worries, sorry I couldn't get further or stick around13:52
*** eumel8 has quit IRC13:52
*** dizquierdo has quit IRC13:52
*** sshnaidm|afk is now known as sshnaidm|rover13:52
fungiyeah, for now i'm not going to assume long-running git-upload-pack processes are atypical13:52
*** jcoufal has joined #openstack-infra13:52
fungithe governor changes also added some statsd counters/gauges13:53
*** esberglu has joined #openstack-infra13:53
fungiin a few minutes i should be in a better place to start looking into those if no other infra-root beats me to it13:53
*** rossella_s has joined #openstack-infra13:54
Shrewswell, i'm not sure what to look for. so far on ze01, i see a lot of ssh unreachable errors for inap nodes, but not sure if that's related13:56
AJaegerhttp://grafana.openstack.org/dashboard/db/nodepool-inap - inap might had some hickups13:57
AJaegerShrews: so ianw restarted nl03 and then it continued to work.13:58
AJaegerThere's - according to graphs - a correlation to the inap hicup (not sure whether at inap, or network to it)13:59
*** hrw has quit IRC13:59
*** jamesmcarthur has joined #openstack-infra14:00
pabelangerwe are also just about our of memory on zuul.o.o.14:00
AJaegerpabelanger: wasn't when this started - but with this long backlog (going on since around 7:00 UTC)...14:01
*** hrw has joined #openstack-infra14:01
*** oidgar has joined #openstack-infra14:03
*** rossella_s has quit IRC14:04
*** jaosorior has joined #openstack-infra14:04
*** psachin has quit IRC14:04
AJaegerpabelanger: that should reduce once we process jobs in "normal" speed14:05
*** rossella_s has joined #openstack-infra14:06
fungiwell, it won't really "reduce" visibly since the python interpreter won't really free memory allocations back to the system14:06
AJaegermnaser: let'S not approve changes for now, please - we have a really slow Zuul right now...14:06
fungibut it'll stop growing, probably for a very long time until the next time it needs more than that amount of memory14:06
mnaserAJaeger: sorry, ack14:07
AJaegerfungi: yeah, we might do a restart at the end14:07
pabelangersadly, I'm not in a good spot to help debug this morning.14:07
pabelangerHope to be back into things in next 90mins or so14:08
*** jrist has quit IRC14:11
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool master: Fix for age calculation on unused nodes  https://review.openstack.org/54128114:11
*** rossella_s has quit IRC14:11
*** jrist has joined #openstack-infra14:12
*** rossella_s has joined #openstack-infra14:12
fungiinfra-root: ovh has also asked us to stop using any of their regions (due to customer impact) and check back with them in q314:13
Shrewsfwiw, i'm not seeing any unusual nodepool behavior, even on the launcher ianw restarted (though i did see a minor bug in our logs)14:13
Shrewsfungi: :(14:13
fungilooks like they e-mailed that to me and clarkb just now14:14
*** andreas_s has quit IRC14:15
pabelangerfungi: ack, sad to see them turned off but understand.14:15
*** andreas_s has joined #openstack-infra14:15
*** efried has joined #openstack-infra14:15
*** Goneri has joined #openstack-infra14:16
*** andreas_s has quit IRC14:18
*** mriedem has joined #openstack-infra14:18
*** andreas_s has joined #openstack-infra14:18
*** rosmaita has joined #openstack-infra14:20
openstackgerritRuby Loo proposed openstack-infra/project-config master: Add description to update_upper_constraints patches  https://review.openstack.org/54102714:20
*** oidgar has quit IRC14:22
*** amoralej|lunch is now known as amoralej14:24
openstackgerritMerged openstack-infra/project-config master: Revert "base-test: test new mirror-workspace role"  https://review.openstack.org/54094914:28
*** jamesmcarthur has quit IRC14:29
*** tiswanso has joined #openstack-infra14:30
*** jamesmcarthur has joined #openstack-infra14:30
*** mihalis68 has joined #openstack-infra14:30
openstackgerritLiam Young proposed openstack-infra/project-config master: Add vault charm to gerrit  https://review.openstack.org/54128714:31
fungiinfra-root: sorry, i misread. not ovh but citycloud, so not nearly as big of a hit14:32
*** lucasagomes is now known as lucas-hungry14:34
*** david-lyle has quit IRC14:34
*** fverboso has joined #openstack-infra14:35
openstackgerritLiam Young proposed openstack-infra/project-config master: Add vault charm to gerrit  https://review.openstack.org/54128714:35
*** jamesmcarthur has quit IRC14:35
*** dtantsur is now known as dtantsur|bbl14:37
*** rossella_s has quit IRC14:37
*** links has quit IRC14:38
dmsimardfungi: it's not the first time citycloud asks us to hold back right ?14:39
dmsimardfungi: I mean, is it no-op or do we have to tune some things down still ?14:39
*** rossella_s has joined #openstack-infra14:40
fungidmsimard: there are still a couple regions at max-servers: 50 and also we should probably pause image uploads since they say they want us to stop until "q3" which i take to mean check with them again on july 114:40
dmsimarddepends if it's q3 calendar or q3 fiscal, by default I guess it's calendar ?14:41
openstackgerritRuby Loo proposed openstack-infra/project-config master: Add description to update_upper_constraints patches  https://review.openstack.org/54102714:46
*** rossella_s has quit IRC14:47
*** alexchadin has quit IRC14:48
*** rossella_s has joined #openstack-infra14:50
*** dbecker has quit IRC14:51
*** olaph1 has joined #openstack-infra14:51
*** olaph has quit IRC14:53
*** sweston has quit IRC14:54
*** sweston has joined #openstack-infra14:54
*** derekjhyang has quit IRC14:55
*** davidlenwell has quit IRC14:55
*** dbecker has joined #openstack-infra14:56
*** derekjhyang has joined #openstack-infra14:56
*** davidlenwell has joined #openstack-infra14:56
*** kiennt26 has joined #openstack-infra14:57
*** hasharAway is now known as hashar15:00
*** olaph1 is now known as olaph15:00
*** jamesmcarthur has joined #openstack-infra15:02
*** ihrachys has joined #openstack-infra15:03
fungii'm getting lost in the maze of nodepool configs and yaml anchors now. checking the nodepool docs to figure out where i need to add pause directives15:04
Shrewsfungi: the diskimage section15:07
fungisure, just trying to figure out which one(s)15:07
*** HeOS has joined #openstack-infra15:07
*** fkautz has joined #openstack-infra15:07
fungito pause all image uploads for citycloud, do i need to add pause to each of the citycloud diskimages, and does that need to be done both in nodepool.yaml and nl02.yaml?15:08
fungior do the image configurations on the launchers not matter to the builders i guess?15:08
Shrewsfungi: image uploads only happen on nb0*15:09
fungiwell, that wasn't my question. i'll check puppet to see if the nl0*.yaml configs get installed on the nb0* hosts15:10
Shrewsdoesn't matter what the launchers have15:10
Shrewsfungi: oh sorry15:10
fungimainly trying to figure out which specific config files i need to update for this15:10
fungiand whether this means undoing a bunch of the yaml anchor business in nodepool.yaml (i'm betting it does)15:11
*** mnaser has quit IRC15:11
*** mnaser has joined #openstack-infra15:12
*** dtantsur|bbl is now known as dtantsur15:14
*** mgkwill has quit IRC15:14
Shrewsfungi: oh hrm... i'm not familiar with how the anchors work15:15
dmsimardinfra-root: FYI there will be a 0.14.6 release of ARA to workaround an issue where Ansible sometimes passes ignore_errors as a non-boolean value15:15
*** mgkwill has joined #openstack-infra15:15
*** rossella_s has quit IRC15:16
*** jamesmca_ has joined #openstack-infra15:18
*** rossella_s has joined #openstack-infra15:19
*** hrybacki has quit IRC15:19
Shrewsfungi: if it helps, the nl0*.yaml configs are not installed on the nb0* instances. so i think we only need to deal with nodepool.yaml with the anchor business15:19
*** hrybacki has joined #openstack-infra15:19
openstackgerritJeremy Stanley proposed openstack-infra/project-config master: Disable citycloud in nodepool  https://review.openstack.org/54130715:20
openstackgerritMatthieu Huin proposed openstack-infra/nodepool master: Add separate modules for management commands  https://review.openstack.org/53630315:20
fungiShrews: thanks for confirming! that's what i figured as well15:20
fungisee 54130715:20
openstackgerritMatthieu Huin proposed openstack-infra/nodepool master: webapp: add optional admin endpoint  https://review.openstack.org/53631915:20
pabelangerokay, ready to help now15:21
pabelangerwe still seeing slowness in zuul?15:21
fungipabelanger: i expect so. also 541307 is a quick review but semi-urgent so i can reply to citycloud15:22
pabelangeryah, just looking now15:23
*** bobh has joined #openstack-infra15:23
*** trown is now known as trown|outtypewww15:23
*** iyamahat has joined #openstack-infra15:25
*** caphrim007 has quit IRC15:25
Shrewsfungi: +2'd. can approve now if you wish, or wait for others to look15:26
*** olaph1 has joined #openstack-infra15:27
fungiShrews: let's go ahead and approve if you're okay with the configuration. the sooner it merges, the sooner i can let the provider know15:27
pabelangerhttp://grafana.openstack.org/dashboard/db/zuul-status is a little confusing to read, but looking at ze03 now15:27
*** olaph has quit IRC15:27
pabelangerI do see route issues on ze03.o.o at 2018-02-06 06:47:13,066   stderr: 'ssh: connect to host review.openstack.org port 29418: No route to host15:28
AJaegerpabelanger: check http://grafana.openstack.org/dashboard/db/zuul-status?from=1517844541494&to=151793094149415:29
pabelangerI also see ansible SSH errors15:29
pabelanger2018-02-06 14:35:07,097 DEBUG zuul.AnsibleJob: [build: d1d72c14bb2b47eca956568b37af9a49] Ansible output: b'    "msg": "SSH Error: data could not be sent to remote host \\"\\". Make sure this host can be reached over ssh",'15:29
AJaegerthat shows nicely the executors not in accepting since 7:2215:30
pabelangerhowever, I think we might have leaked jobs because15:30
pabelanger2018-02-06 14:35:13,814 INFO zuul.ExecutorServer: Unregistering due to too many starting builds 20 >= 20.015:30
pabelangerhowever, there are no ansible-playbooks process running15:30
*** lucas-hungry is now known as lucasagomes15:30
AJaegerfungi: it might take *hours* to merge that change with current slowness15:30
fungipabelanger: on ze03 i see that telnet to any open port on review.o.o hangs from ze03 when using ipv6, but works over ipv415:31
fungiAJaeger: i know15:31
pabelangerfungi: k15:31
*** kiennt26 has quit IRC15:31
pabelangerI'm going to look at zuul source code for a moment15:31
fungiAJaeger: but if it's at least approved then as soon as we get zuul back on track i can enqueue it into the gate to expedite it15:31
AJaegerfungi: sure!15:32
fungipabelanger: from ze10, on the other hand, i can connect to review.openstack.org via both ipv4 and ipv615:32
mordredfungi: what can I do to help?15:32
mordredoh jeez. we're having routing issues?15:33
fungimordred: see what oddities you can spot with the misbehaving executors, i think15:33
fungimordred: that looks like one possible explanation15:33
pabelangeryah, think so15:33
pabelangerbut, also at 20 running builds, at least executor things that15:33
pabelangerbut no ansible-playbook are running15:34
tobiashpabelanger: 20 starting builds15:34
tobiashthat means pre-ansible git and workspace preparation15:34
tobiashso that could be explained by the routing issues maybe15:35
pabelangertobiash: where did you see that?15:35
mordredoh - yah - if the mergers are having trouble cloning from gerrit due to routing issues15:35
*** jungleboyj has quit IRC15:36
tobiashpabelanger: wasn't that the starting builds metric sooner today?15:36
*** jungleboyj has joined #openstack-infra15:36
*** sree has joined #openstack-infra15:36
tobiashpabelanger: 2018-02-06 14:35:13,814 INFO zuul.ExecutorServer: Unregistering due to too many starting builds 20 >= 20.015:36
*** aviau has quit IRC15:36
*** aviau has joined #openstack-infra15:37
tobiashthis means 20 builds in pre-ansible phase which is triggered by the slow start change15:37
fungize01, ze03, ze06, ze07, ze08 and ze09 can't reach tcp services on review.o.o over ipv6. ze02, ze04, ze05 and ze10 can15:37
*** med_ has quit IRC15:38
pabelangermordred: tobiash: I wonder if we are not resetting starting_builds in that case15:38
fungii'm checking routing for mergers next15:38
corvusmaybe we should switch the executors to ipv4 only?15:38
corvusdo we have any ipv6-only clouds right now?15:38
*** yamamoto has quit IRC15:39
mordrednot to my knowledge15:39
fungithat might be a viable stop-gap. we could just unconfigure the v6 addresses on them for the moment15:39
tobiashpabelanger: that would mean we can leak job_workers in the executor which would be a zuul bug15:39
pabelangertobiash: I haven't confirmed yet, trying to understand how the new logic works15:40
*** armaan has quit IRC15:40
fungiworth noting, none of the standalone mergers (zm01-zm08) are exhibiting this ipv6 routing issue15:40
*** armaan has joined #openstack-infra15:41
tobiashpabelanger: that's entirely possible but I didn't spot such a code path in my 5min code scraping this morning15:41
pabelangerthat is good news15:41
*** sree has quit IRC15:41
*** yamamoto has joined #openstack-infra15:42
mordredfungi: I wonder if reconfiguring the interfaces on one of the broken executors would have any impact ... oh, wait - these are rackspace so have static IP informatoin ...15:42
fungii wonder if rebooting them may "fix" it15:42
mordredI was thinking something might be bong with RA... but since it's static config I think that less15:43
fungii've seen traffic cease making it to instances in rackspace in the past, and a hard reboot gets routing reestablished in whatever their network gear is15:43
fungithough in those cases it's usually been both v4 and v6 at the same time15:43
*** eyalb has left #openstack-infra15:44
mordredthe ipv6 routing tables on a working and a non-working node are the same15:44
corvuslet's go ahead and stop ze01 and reboot it, that's an easy test.  who wants to stop it?15:44
dmsimardhave time to give a hand now15:44
fungiyeah, i have a feeling this is due to something outside the instance itself15:44
fungican stop it now15:45
*** vhosakot has joined #openstack-infra15:45
fungiit's stopping now15:45
corvusfungi:  cool, you're stopping ze0115:45
fungiyeah, even ping6 isn't making it through15:45
*** dbecker has quit IRC15:45
dmsimardThe test node graph is super weird but you probably know that already http://grafana.openstack.org/dashboard/db/zuul-status?panelId=20&fullscreen15:45
corvuswhile that's going on, let's try to deconfigure ipv6 on ze0315:45
* dmsimard catches up15:45
mordredfwiw - http://paste.openstack.org/show/663521/ <-- that's got routing and address info for a working and a non-working node. both look the same to me15:46
fungiwhat's strange is that (so far from my testing) i can ping6 instances in iad from ze01 but not instances in dfw15:46
fungii take that back. i can ping6 etherpad.o.o from ze01 successfully15:47
fungiso this may be some flow-based issue in their routing layer15:47
mordredip -6 neigh shows differences15:47
fungii still see a few git remote operations as child processes of zuul-executor on ze0115:48
fungiwonder if i should just kill those15:48
corvusip -6 addr del 2001:4800:7818:104:be76:4eff:fe04:a4bf/64 dev eth015:48
corvushow's that look?15:49
*** yamahata has joined #openstack-infra15:49
corvusfungi: killing git should be okay15:49
corvusand i'll do the same for the fe80 addrs?15:49
*** dtantsur is now known as dtantsur|bbl15:49
dmsimardmordred: curious to try an arping on ze08 on that stale router15:50
fungishouldn't need to so it for any fe80:: addresses as those are just linklocal15:50
fungiit's only the scope global address(es) which matter in this case15:50
corvusok, wasn't sure if that would affect anything.  i'll just do it for that global address on ze03 then?15:50
dmsimardmordred: ip -6 neigh shows the router as reachable now (didn't do anything btw)15:50
fungicorvus: yeah, that should be plenty15:50
*** camunoz has joined #openstack-infra15:50
corvusdone on ze0315:51
*** david-lyle has joined #openstack-infra15:51
*** pcaruana has quit IRC15:51
corvusif that works, presumably we'll see the effect after the current git op completes15:51
dmsimardmordred: some flapping going on ? http://paste.openstack.org/raw/663529/15:51
corvuswhich is now, and it's running very quickly15:51
*** samueldmq has quit IRC15:52
*** samueldmq has joined #openstack-infra15:52
corvusnow i see a lot of ssh unreachable errors from ansible15:52
corvusthey are to ipv4 addresses15:52
fungiugh, ze01 keeps spawning new git operations15:52
corvusfungi: yeah, it probably will.  known inefficiency in shutdown: it doesn't short-circuit the update queue15:53
corvusalso, it now retries failed git operations 3 times.15:53
*** yamamoto has quit IRC15:53
*** xarses has joined #openstack-infra15:54
corvusweird, i'm attempting to manually connect to the hosts that ansible says it can't connect to, and it's working15:54
fungiokay, so could also be some random v4 connectivity issues from that instance?15:55
*** xarses_ has joined #openstack-infra15:55
*** claudiub|2 has joined #openstack-infra15:55
*** claudiub|2 has quit IRC15:55
fungithat's sounding more and more like some sort of address-based flow balancing sending some connections to a dead device15:55
openstackgerritMerged openstack-infra/zuul master: Fix github connection for standalone debugging  https://review.openstack.org/54077215:56
corvusi've yet to fail to connect to one of those hosts manually, and pings are working fine15:56
*** claudiub|2 has joined #openstack-infra15:56
Shrewscorvus: the few addresses i saw that happening for were all inap15:57
*** salv-orlando has joined #openstack-infra15:57
corvusthe other hosts have an increase in those errors too15:58
dmsimardJust curious... are we graphing conntrack ? Don't see it in cacti.. it requires some amount of RAM to keep iptables/ip6tables going with conntrack. Could our RAM starvation problems end up impacting that ? I don't see any conntrack messages in dmesg.. I know it would print obvious errors if we're reaching nf conntrack's max number but I don't know if it complains about other things.15:58
*** claudiub has quit IRC15:58
*** olaph has joined #openstack-infra15:58
*** xarses has quit IRC15:58
dmsimardObviously right now the current conntrack numbers are low because nothing's running15:58
corvusze02 had 7 insances of ssh connection errors yesterday and 678 today.  ze02 is one of the hosts that is 'working'.15:59
dmsimardok I'll look at ze0215:59
*** olaph1 has quit IRC16:00
*** r-daneel has joined #openstack-infra16:01
fungistill playing whack-a-mole with git processes on ze01 waiting for it to complete zuul-executor service shutdown16:02
corvusShrews: this is a list of ip addrs that ze03 has recently failed to connect to: http://paste.openstack.org/show/663545/16:02
*** salv-orlando has quit IRC16:03
*** salv-orlando has joined #openstack-infra16:03
corvusthe 104's in there are rax16:03
Shrewsyeah, those seem pretty different enough to be multiple providers16:04
corvusthe 23. are rax16:04
corvusthe 166. are rax...16:04
*** v1k0d3n has quit IRC16:04
corvus37. is citycloud16:05
dmsimardWe probably want to tweak net.ipv4.tcp_keepalive_intvl... /proc/sys/net/ipv4/tcp_keepalive_intvl is showing 75 right now16:05
*** v1k0d3n has joined #openstack-infra16:05
*** icey has quit IRC16:05
dmsimardThat's quite a large interval :/16:05
*** icey has joined #openstack-infra16:05
*** jaosorior has quit IRC16:06
fungihttps://rackspace.service-now.com/system_status/ still doesn't indicate any known widespread issue16:07
*** armaan has quit IRC16:07
*** armaan has joined #openstack-infra16:08
Shrews86 is citycloud too16:10
Shrews146 rax16:10
clarkbthe way rax does network throttling is to inject resets iirc16:10
*** neiljerram has joined #openstack-infra16:10
clarkbdo the cacti graphs look like we are hitting the limit and getting throttled?16:11
fungiworth noting citycloud has mentioned we're causing issues in their regions, 541307 will disable them in our nodepool configuration for now. we might want to consider whether the citycloud issues are unrelated16:11
Shrewsthe first one, 213, is ovh16:11
fungiclarkb: not seeing reset-type disconnects. more like no response/timeout16:12
neiljerramHi all.  I just mistakenly did 'git review -y' on a long chain of commits, instead of the squashed one that I wanted to push to review.openstack.org.  This can be seen under 'Submitted Together' at https://review.openstack.org/#/c/541365/.  Is there a way that I can abandon all of those mistakenly submitted changes?16:12
corvusthe unreachable errors seem to be happening during the ansible setup phase.  so basically, the first connection is failing, but my subsequent manual telnet connections are succeeding.16:12
fungineiljerram: if you use gertty, you can process-mark them and mass abandon16:13
clarkbfungi: ya and cacti doesn't seem to support that theory either but graph is spotty (I'm guessing because udp packets are being blackholed too)16:13
neiljerramfungi, I don't at the moment.  But if that's the best approach, I'll set that up.16:13
fungineiljerram: or you can script something against one of the gerrit apis (ssh api, rest api) to abandon a list of changes16:13
*** clayg has quit IRC16:13
fungii don't know of any mass review operations available in the gerrit webui16:13
neiljerramfungi, Or I guess just some manual clicking....16:14
AJaegerneiljerram: 43 changes? Argh ;(16:14
dmsimardneiljerram: depends how long it takes to automate it vs doing it by hand16:14
*** clayg has joined #openstack-infra16:14
neiljerramAJaeger, indeed, and sorry!16:14
*** beisner has quit IRC16:14
clarkbfungi: on the citycloud front we are waiting for the change to merge which is being affected by this outage too right? should we consider merging that more forcefully?16:14
AJaegerneiljerram: unfortunate timing, please abandon quickly...16:14
*** beisner has joined #openstack-infra16:14
fungiclarkb: we can bypass ci with it, i'd support that16:15
neiljerramAJaeger, doing that as quickly as I can right now.16:15
corvusneiljerram: i'll do it16:15
*** andreas_s has quit IRC16:17
*** andreas_s has joined #openstack-infra16:17
neiljerramcorvus, OK, thanks16:17
corvusneiljerram: done (via gertty)16:17
pabelangerwe're less then 1GB (854M) of available memory before we start swapping on zuul.o.o too.16:17
*** yamamoto has joined #openstack-infra16:18
corvusi can't think of a next step for the ipv4 connection issues.  i don't know what would cause the first connection to fail but not subsequent ones.16:18
openstackgerritDoug Hellmann proposed openstack-infra/openstack-zuul-jobs master: [WIP] docs sitemap generation automation  https://review.openstack.org/52486216:19
AJaegerneiljerram, corvus looks like some are not abandoned - neiljerram could you double check, please? e.g. https://review.openstack.org/#/c/541339/116:19
neiljerramAJaeger, I'm doing that16:19
*** yamamoto has quit IRC16:19
clarkbfungi: ok I'm going to go ahead and do that so that we can focus on fixing zuul without also worrying about making citycloud happy16:19
neiljerramAJaeger, they're all gone now16:20
neiljerramcorvus, many thanks for your help!16:20
corvusi also don't really know how to confirm that.  it's pretty tricky to find a node that an executor is about to use before it uses it.16:20
*** yamamoto has joined #openstack-infra16:20
corvusneiljerram: np.  you should take a look at gertty anyway :)16:20
neiljerramcorvus, I did in the past, but didn't quite get into it; I'll have another go.16:21
openstackgerritMerged openstack-infra/project-config master: Disable citycloud in nodepool  https://review.openstack.org/54130716:21
*** kopecmartin has quit IRC16:21
corvusneiljerram: let me know (later) what got in your way so i can improve it :)16:22
fungithanks clarkb16:22
clarkbfungi: ^ do you want me to respond to them as well?16:22
*** e0ne has quit IRC16:22
fungiclarkb: i have a half-written reply awaiting merger of the change. i can go ahead and finish it16:22
clarkbfungi: ok16:22
fungionce we see it take effect on the np servers16:23
fungii mean, once it's np-complete? ;)16:23
*** pcichy has joined #openstack-infra16:23
dmsimardcorvus: is it normal to have these stuck git processes on zuul01? http://paste.openstack.org/raw/663572/16:23
corvusdmsimard: that'll be puppet16:23
corvusdmsimard: it's erroneous, couldn't say if it's abnormal.  probably safe to kill.16:23
dmsimardlooks like puppet has successfully updated the repository about an hour ago so I'll go ahead and kill those16:25
*** yamamoto has quit IRC16:25
fungistill killing git processes and waiting for zuul-executor to stop on ze0116:26
*** andreas_s has quit IRC16:26
corvusi think ze03 shows us that if we ifdown the ipv6 interfaces, we'll get things moving again, and more jobs will be accepted.  *except* that a good portion of them (unsure what %) will hit connection errors as soon as they start.  those jobs will be retried 3x.  tbh, i'm not sure whether it's a net win to fix the ipv6 issue.16:27
AJaegerbtw. should we send a #status alert? I sent a #status notice earlier...16:28
dmsimardcorvus, pabelanger: looking at the zuul scheduler memory.. seeing http://paste.openstack.org/raw/663579/ which somewhat reminds me of the ze issues we saw the other day with Shrews about the log spam16:28
* AJaeger will be offline for a bit now, so can't do it16:28
*** rossella_s has quit IRC16:28
corvusdmsimard: log spam?16:28
dmsimardcorvus: let me find the issue and the patch, sec16:29
dmsimardcorvus: but tail -f /var/log/zuul/zuul.log is not pretty right now16:29
*** rossella_s has joined #openstack-infra16:30
corvusthat's a very serious error16:30
*** yamamoto has joined #openstack-infra16:30
Shrewsdmsimard: you're thinking of the repeated request decline by nodepool, i think, which was something totally different16:32
dmsimardcorvus: okay so the nodepool thing was different -- we were seeing this: http://paste.openstack.org/raw/653420/ and Shrews submitted https://review.openstack.org/#/c/537932/ -- he mentioned that https://review.openstack.org/#/c/533372/ may have caused the issue16:32
dmsimardShrews: yeah16:32
corvusokay, i'm no longer looking into executor issues, i'm looking into scheduler-nodepool issues16:33
*** gmann has quit IRC16:33
*** gmann has joined #openstack-infra16:33
dmsimardcorvus: this locked exception is likely contributing to the ram usage we're seeing16:33
corvusdmsimard: in what way?16:33
dmsimardcorvus: instinct, but I'm trying to get data to back that up right now -- the exceptions started happening all of a sudden and at first glance it might correlate with the spike in memory usage http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=392&rra_id=all16:35
fungior the memory utilization increase and those exceptions could both be symptoms of a common problem. correlation != causation16:37
Shrewsthat, btw, was one of the bugs that made me want to restart the launchers today, but i'm holding off on that to limit changes16:37
dmsimardIt seems we started getting exceptions after a "kazoo.exceptions.NoNodeError"16:38
dmsimardDid we lose zookeeper at some point ?16:38
dmsimardpaste of what seems to be the first occurrence (with extra lines for context) http://paste.openstack.org/raw/663600/16:39
pabelangerdmsimard: should be able to grep kazoo.client in logs to see16:39
fungihow long should i be giving the executor daemon to stop on ze01 before i need to start considering other solutions? it's _still_ starting new git operations and we're coming up on an hour since i issued the service stop16:39
pabelangerbut zookeeper (nodepool.o.o) is still running last I checked this morning16:39
corvusfungi: if all the other jobs have stopped, you can probably note the current repo it's working on, kill it, then delete that repo.16:40
fungiit's lots of different repos16:40
funginot sure how to tell whether the jobs have stopped on it16:40
fungii'm just looking at ps16:40
corvushrm, sohuld only be like one at a time16:41
Shrewscorvus: theory... one of the launchers was restarted by ianw this morning, and has the pro-active delete solution. If the scheduler is so busy that it doesn't get around to locking it's node set immediate, we could be deleting nodes out from under it.16:41
dmsimardpabelanger: /var/log/zookeeper/zookeeper.log dates back to october 2 2017 :/16:42
corvusShrews, dmsimard: there was a zk disconnection event at 10:15:0216:42
*** priteau has quit IRC16:42
*** dsariel has quit IRC16:42
pabelangerfungi: I haven't killed a zuul-executor process myself, if we are not in a rush, maybe just let it shutdown itself?16:42
corvusfungi: are you killing git processes as they pop up?16:42
fungii've manually killed hundreds (thousands?) of git operations for countless repos on ze01 over the past hour. right now in the process list i see it performing upload-pack operations for cinder and bifrost16:42
corvusfungi: so if you kill the executor now, just delete the bifrost and cinder repos16:43
corvusfrom /var/lib/zuul/executor-git16:43
fungigot it16:43
*** hamzy has quit IRC16:43
fungiokay, all that's left is a few dozen ssh-agent processes16:45
fungideleted /var/lib/zuul/executor-git/git.openstack.org/openstack/bifrost and /var/lib/zuul/executor-git/git.openstack.org/openstack/cinder16:45
fungianything else i need to do cleanup-wise before rebooting?16:46
corvusfungi: not that i can think of16:46
fungigoing to issue a poweroff on the instance and then hard reboot it from the nova api16:46
dmsimardcorvus: how did you find that disconnection event ? client side or server side ? not sure how to check on nodepool.o.o, I was trying to find a correlation but there's nothing in syslog/dmesg and /var/log/zookeeper is useless16:46
dmsimardcorvus: your timestamp also doesn't match that kazoo nonode exception16:46
corvusdmsimard: 2018-02-06 10:15:02,480 WARNING kazoo.client: Connection dropped: socket connection broken16:47
corvusscheduler log16:47
dmsimardzk is using a bunch of cpu but doing a strace on the process shows no activity.. just FUTEX_WAIT16:48
dmsimardnot sure if it's doing anything16:49
*** sree has joined #openstack-infra16:49
*** caphrim007 has joined #openstack-infra16:49
clarkbI'm still trying to catch up on the zuulstuff. Sounds like we've decided that going ipv4 only won't actually fix us? do we think the reboot fungi is doing will address networking?16:49
mordredclarkb: maybe16:50
mordredclarkb: I think we're in an 'it's worth a shot' mode on that one16:50
mordredclarkb: but I do not believe we actually understand the underlying issue16:50
dmsimardclarkb: there's different issues going on I think, I'll start a pad.. let's move this to incident ?16:50
corvusShrews, dmsimard: i think i understand the zk errors.  there is a bug when a node request is fulfilled before a zk disconnection but the scheduler processes the event after the disconnection.16:51
fungiclarkb: reboot fungi has _done_ now, but it's as much an experiment as anything. i've definitely seen wonky packet delivery issues in rackspace go away after hard rebooting an instance, though in my opinion this looks more like they have some misbehaving switch gear16:52
*** thingee has quit IRC16:52
fungiand the hard reboot of ze01 indeed doesn't seem to have solved this16:53
fungii'll delete its ipv6 address for the moment16:53
*** iyamahat has quit IRC16:54
corvusShrews: it's pretty close to the case in 94e95886e2179f4a6aeecad687509bc7b1ab7fd3  i wonder if we just need to extend our testing of that a bit16:54
*** salv-orlando has quit IRC16:54
fungiheh, of course that killed my ssh connection to it ;)16:55
dmsimardI have to step away ~10 minutes16:55
corvusShrews: oh, i see -- that test is only a narrow unit test... we don't have a test of the schedulers response to that situation.  we need a full scale functional test.16:55
*** gfidente has quit IRC16:55
*** salv-orlando has joined #openstack-infra16:55
dmsimardinfra-root: started a pad to follow the different issues we're tracking https://etherpad.openstack.org/p/HRUjBTyabM16:55
*** dsariel has joined #openstack-infra16:57
*** eumel8 has joined #openstack-infra16:59
*** salv-orlando has quit IRC16:59
*** gfidente has joined #openstack-infra17:00
*** gfidente has quit IRC17:00
*** gfidente has joined #openstack-infra17:00
*** ramishra has quit IRC17:01
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool master: Do not delete unused but allocated nodes  https://review.openstack.org/54137517:02
*** gongysh has quit IRC17:02
*** bobh has quit IRC17:02
*** pbourke has quit IRC17:02
Shrewscorvus: i think ^^ will solve the situation i theorized earlier17:02
*** gyee has joined #openstack-infra17:02
corvusShrews: oh yes!  that should definitely be a thing!  :)17:03
*** pbourke has joined #openstack-infra17:04
*** salv-orlando has joined #openstack-infra17:04
Shrewsnow we definitely need a restart since nl03 is running without that  :(17:04
*** slaweq has quit IRC17:05
*** slaweq has joined #openstack-infra17:05
*** derekjhyang has quit IRC17:05
clarkbcorvus: and for the zk connection issue in the scheduler we can't recover from that without a restart?17:06
corvusclarkb: correct17:06
*** gongysh has joined #openstack-infra17:06
corvusit looks a lot like the executor queue is running down to the point where that's going to block us pretty soon.17:07
*** zoli is now known as zoli|gone17:07
*** zoli|gone is now known as zoli17:07
corvusso we should plan on doing a scheduler restart in the next 30m or so17:07
corvuswe know how to deal with the ipv6 issue.  i think the main unanswered question is about the ipv4 connection problems.17:07
pabelangerwhen we switch to zookeeper cluster, we'll have 3 connections to try and use from the scheduler, is that right?17:07
*** rossella_s has quit IRC17:08
*** gongysh has quit IRC17:08
corvuspabelanger: yes it may be more robust in some cases, but if the problem is closer to the scheduler, then it can still go wrong.17:08
corvusdoes anyone have any ideas how to further diagnose the ipv4 issues?17:08
dmsimardinfra-root: AJaeger suggested we send out an update on the situation, does this sound okay ? #status notice We are actively investigating different issues impacting the execution of Zuul jobs but have no ETA for the time being. Please stand by for updates or follow #openstack-infra while we work this out. Thanks.17:09
*** slaweq has quit IRC17:10
pabelangercorvus: understood17:10
dmsimardcorvus: I need to catch up on what we have found out so far about ipv4 and I'll try to look17:10
corvusi'm not a fan of zero-content notices.  would prefer a single alert.  but i think we should send the next thing, whatever it is, after we restart.17:10
dmsimardcorvus: we did not send an alert previously, only a notice17:11
dmsimardcorvus: I'm also not a fan of zero content :(17:11
mordredcorvus: I am stumped on further diagnosing the ipv4 issue17:11
*** fverboso has quit IRC17:11
pabelangerfungi: is there other executors we need to reboot? (looking for something to do)17:11
clarkbthe underlying issue is networking conenctivity17:11
corvusdmsimard: yeah, so let's either send an "alert:still not working" or "notice:maybe things are better" in a little bit.17:11
clarkbI would say something along those lines17:11
fungi pabelanger the reboot seems not to have helped17:11
clarkbzk connection broke beacuse networking (likely)17:12
pabelangerfungi: okay17:12
clarkbze can't talk to review.o.o and test nodes becaues networking (likely)17:12
dmsimardI'll take a few minutes to dig into the connectivity issues now.17:12
clarkbI need to grab caffeine before I get too comfortable in this chair17:12
clarkbwork fuel!17:13
corvusi think the best thing we can do at the moment is: restart scheduler, remove ipv6 addres from all executors and mergers, then monitor for ipv4 ssh errors and hope enough jobs get through to make headway.17:13
dmsimardI'm not even able to connect by SSH to ze01 or ze08 over ipv6, I had to force-resolve it to ipv417:13
corvusalso, file a trouble ticket.17:13
pabelangerfungi: Hmm, we do seem to be accepting new jobs on ze01.o.o now. Is that related to removal of ipv6?17:14
mordredcorvus: I agree with your suggested plan of action17:15
corvusoh, all the executors just changed behavior17:15
fungipabelanger: it should be, yes17:15
fungicorvus: all the executors or just the problem ones? i deleted the ipv6 global scope addresses from all executors which were unable to connect to review.openstack.org over v617:16
corvusthat may suggest that the ipv6 issue is resolved17:16
corvusfungi: oh! i missed that17:16
dmsimardlooks like the builds just started picking up on grafana yeah17:16
corvusfungi: i thought you only did ze0117:16
*** olaph1 has joined #openstack-infra17:17
corvusanyway, we still need to restart scheduler, shall i do that now?17:17
*** iyamahat has joined #openstack-infra17:17
*** olaph has quit IRC17:17
*** jpena is now known as jpena|off17:18
fungiyeah, and you had previously done ze03, i ended up doing 06-09 based on discussion with "unknown purple person" in the etherpad, but i should have mentioned it in here (i probably didn't)17:18
funginow seems like a good time for the scheduler restart, agreed17:18
dmsimardcorvus: assuming we dump queues first ? there's also a good deal of post and periodic jobs still queued17:18
corvusdmsimard: we don't usually save post or periodic17:19
fungiwe've generally considered post and periodic "best effort"17:19
dmsimardI don't suppose losing periodic is that big of a deal but post might, I don't know17:19
fungimost post jobs are idempotent and will be rerun on the next commit to merge to whatever repo17:19
*** rwsu has quit IRC17:19
*** rossella_s has joined #openstack-infra17:19
*** rwsu has joined #openstack-infra17:20
fungiand for the ones which actually present a problem, we can reenqueue the branch tip into post for them later17:20
corvusokay, restarting scheduler now17:20
fungiif they don't have anything new to approve17:20
smcginnisAre issues resolved with removing ipv6 and this pending scheduler restart?17:22
*** r-daneel has quit IRC17:22
* smcginnis is trying to discern current status17:22
*** slaweq has joined #openstack-infra17:22
fungiif things seem back on track after the scheduler restart, we could consider stopping the executor on ze01 again, rebooting it again, and then using it as the test system for a ticket with rackspace17:22
*** r-daneel has joined #openstack-infra17:22
fungismcginnis: not resolved, no. possibly worked around, but we're also dealing with recovery from a zookeeper disconnection17:23
smcginnisfungi: ack, thanks17:23
fungiand executors failing to connect to far more job nodes than usual17:23
dmsimard515 *changes* in check queue ? really ?17:26
*** slaweq has quit IRC17:26
fungidmsimard: some had been in check for 9 hours when i looked earlier, so quite the backup17:27
dmsimardI think OVH is not working right now. Time to ready for their nodes is flat http://grafana.openstack.org/dashboard/db/nodepool and seeing a lot of errors on nl04 -- example: http://paste.openstack.org/raw/663687/17:30
dmsimardLooking at this.17:30
pabelangeryes, looks like they have having message queue issues17:31
pabelangerthey do a good job keeping status.ovh.net updated17:31
dmsimardcan we temporarily set OVH to 0 so we don't needlessly hammer them ?17:32
pabelangerno, should be fine17:32
dmsimardpabelanger: that notice is from september 2016 ?17:32
*** bobh has joined #openstack-infra17:32
pabelangeroh, it is... odd17:33
*** sree has quit IRC17:33
pabelangerdmsimard: if you click on cloud box, at top, you can see open tickets17:33
dmsimardnl04 is completely nuts at creating/deleting VMs non stop in the two OVH regions17:33
*** jlabarre has quit IRC17:34
*** sree has joined #openstack-infra17:34
pabelangerdmsimard: yah, eventually it will start working again.17:34
dmsimardwe're not very nice :/17:34
pabelangerwe don't usually turn down clouds, until the provider asks. All know how to contact us17:35
fungiclarkb: i've replied to daniel at citycloud now, after confirming image uploads are stopped and node counts are at 0 in all their regions17:35
pabelangerthat way, we don't need to be constantly updating max-servers17:36
clarkbfungi: I see the email, thanks!17:36
clarkblooks like it was the removal of ipv6 that ended up "fixing" things?17:37
*** jlabarre has joined #openstack-infra17:37
dmsimardpabelanger: maybe we could have a soft disable or something ? a bit like we can enable/disable keep without restarting zuul executor17:37
mordredclarkb: "yes"17:37
dmsimardDid we remove ipv6 on all executors ? or just one ? Everything seemed to start working again all of a sudden17:37
*** dsariel has quit IRC17:37
clarkbdmsimard: pabelanger we can increase the time between api requests17:37
dmsimardI'm not sure ipv6 has something to do with it17:37
mordreddmsimard: fungi took care of all the ones with broken ipv617:37
dmsimardmordred: the non-broken ones were not really accepting any builds17:38
mordreddmsimard: (disabled ipv6 on them that is)17:38
dmsimardmordred: http://grafana.openstack.org/dashboard/db/zuul-status17:38
pabelangerdmsimard: why? I'd rather not have to toggle nodepool-launcher each time we have an API error. Would mean spending more time monitoring it17:38
*** jbadiapa has joined #openstack-infra17:38
*** gfidente is now known as gfidente|afk17:38
*** sree has quit IRC17:38
*** bobh has quit IRC17:38
mordreddmsimard: that's not surprising though - the executors will only accept builds up until a point - and with the others out of commission, the running executors were well past their max17:38
dmsimardpabelanger: what do you mean why ? the cloud is broken and we're creating/deleting hundreds of VMs a minute, it's not very helpful to the nodepool sponsors imo17:38
pabelangerclarkb: sure, that too17:38
fungiclarkb: for very disappointing definitions of "fixing" anyway17:39
fungidmsimard: see your etherpad for details?17:39
dmsimardfungi: details for what, sorry ? a bit multithreaded17:40
dmsimardfungi: oh, you mean where we removed v617:40
fungidmsimard: you asked whether we removed ipv6 addresses from all executors17:40
fungidetails on which are in the pad17:40
pabelangerdmsimard: sure, providers often don't notify us of outages too. I like the fact that nodepool just keeps doing its thing, until the cloud comes back online17:40
dmsimardfungi: but what I'm saying is that it didn't seem like the "non-ipv6 broken" executors were really doing anything, at least according to the graphs17:41
dmsimardfungi: hmm, I was relying on the available executors graph and none of them were available because they were loaded.. nevermind that17:41
fungidmsimard: depended on the graphs. they were bogged down enough to stop accepting many new jobs17:41
*** bobh has joined #openstack-infra17:42
pabelangerdmsimard: OVH is pretty responsible to emails, we can reach out to them add see if they are aware of the issue. Would prefer we did that before shutting down nodepool-launcher.17:42
dmsimardpabelanger: can we ? both bhs1 and gra1 are currently not working17:43
fungiso http://status.ovh.net/?do=details&id=14073&PHPSESSID=e77302e17361afc822024180e2449e1f doesn't indicate they're aware?17:43
*** mtreinish has quit IRC17:43
pabelangerdmsimard: yes, I suggest we then email OVH the issue we are seeing. I'm unsure if you have their contact, but can forward you them if you wanted to reach out.17:44
pabelangerhowever, http://grafana.openstack.org/dashboard/db/nodepool-ovh seems to show nodes coming online now17:45
pabelangerat least for BHS117:45
dmsimardhttp://paste.openstack.org/raw/663704/ is an excerpt of the errors we're seeing for bhs1/gra117:45
*** jbadiapa has quit IRC17:45
openstackgerritPino de Candia proposed openstack-infra/project-config master: Add new project for Tatu (SSH as a Service) Horizon Plugin.  https://review.openstack.org/53765317:46
*** apetrich has quit IRC17:46
pabelangerdmsimard: yah, I usually just email them an example error message and few UUIDs of nodes. They offer to help debug17:46
pabelangerI'll forward you last emails I sent17:46
*** apetrich has joined #openstack-infra17:46
dmsimardok I'll reach out17:46
dmsimardGoing to log a few things for our roots from other timezones17:47
fungioh, yeah, i'm seeing freshness issues with their status tracking i guess17:47
*** bobh has quit IRC17:47
dmsimard#status log (dmsimard) CityCloud asked us to disable nodepool usage with them until July: https://review.openstack.org/#/c/541307/17:49
corvusfyi: ze01 will behave differently than the others because it will accept jobs at a slightly higher rate17:49
openstackstatusdmsimard: finished logging17:49
pabelangerdmsimard: email forwarded17:49
fungicorvus: out of curiosity, why is ze01 "special" like that?17:50
corvusfungi: a patch landed and it rebooted17:50
Shrewsit goes to 1117:50
fungiahh, got it17:50
fungiyes, this one goes to 1117:50
*** mtreinish has joined #openstack-infra17:50
clarkbcorvus: this is the 4 vs 1 patch from yesterday?17:50
fungii think in earlier troubleshooting jhesketh may also have restarted an executor? before i woke up17:50
*** dsariel has joined #openstack-infra17:50
fungiseeing if i can find it in scrollback17:50
clarkbfungi: I think he said ianw did the restart? and it was 01 too?17:51
funginot rebooted, just restarted the daemon17:51
fungioh, perfect17:51
* fungi cancels scrollback spelunking17:51
corvusah, well that would have been masked by ipv6 issues till now17:51
dmsimard#status log (dmsimard) Different Zuul issues relative to ipv4/ipv6 connectivity, some executors have had their ipv6 removed: https://etherpad.openstack.org/p/HRUjBTyabM17:51
openstackstatusdmsimard: finished logging17:51
*** olaph1 is now known as olaph17:52
clarkbdo we have a ticket in for the ipv6 connectivity issue?17:53
dmsimard#status log (dmsimard) zuul-scheduler issues with zookeeper ( kazoo.exceptions.NoNodeError / Exception: Node is not locked / kazoo.client: Connection dropped: socket connection broken ): https://etherpad.openstack.org/p/HRUjBTyabM17:53
openstackstatusdmsimard: finished logging17:53
*** tbachman has joined #openstack-infra17:53
dmsimard#status log (dmsimard) High nodepool failure rates (500 errors) against OVH BHS1 and GRA1: http://paste.openstack.org/raw/663704/17:53
openstackstatusdmsimard: finished logging17:53
*** panda is now known as panda|off17:54
fungiclarkb: not yet because we need a system we can point to for troubleshooting. i'm proposing we stop the daemon on one of the "problem" executors and then reboot it with its executor service temporarily disabled so it just comes up doing nothing but with its v6 networking correctly configured17:54
*** bobh has joined #openstack-infra17:54
clarkbfungi: ++17:54
fungiwe could also consider launching some more (replacement) executors17:55
*** iyamahat has quit IRC17:55
*** iyamahat has joined #openstack-infra17:56
dmsimardpabelanger: ack, I'm sending an email to them nwo.17:56
*** bobh has quit IRC17:59
*** yamamoto has quit IRC17:59
*** yamamoto has joined #openstack-infra18:00
corvusi have a test-case reproduction of the scheduler/zk lock bug18:00
*** tesseract has quit IRC18:00
*** jamesmca_ has quit IRC18:00
*** jamesmcarthur has quit IRC18:01
*** jamesmcarthur has joined #openstack-infra18:01
*** sambetts is now known as sambetts|afk18:01
*** derekh has quit IRC18:02
openstackgerritMerged openstack-infra/zuul master: Enhance github debugging script for apps  https://review.openstack.org/54077418:03
dmsimardsent an email to ovh, have to step away again.18:03
*** jamesmcarthur has quit IRC18:05
*** xarses_ has quit IRC18:06
pabelangerdmsimard: thanks18:06
*** xarses has joined #openstack-infra18:06
fungiwow, monasca-thresh is removing mysql-connector from their java archive (due to license concerns) and relying on the drizzle jdbc? not sure whether that's amusing to former drizzlites: http://lists.openstack.org/pipermail/openstack-dev/2018-February/127027.html18:06
*** salv-orlando has quit IRC18:07
ShrewsDrizzle says "We're not dead yet!"18:07
Shrews"I'm feeling much better"18:07
*** salv-orlando has joined #openstack-infra18:07
mordredthat's truly awesome18:08
*** bobh has joined #openstack-infra18:10
clarkbfungi: should we just go ahead and pick a server and turn off the executor now? I guess when we have a backlog of jobs that isn't the happiset thing to do (but neither is remaining ipv4 only)18:10
fungii was giving things a few more minutes to see what other fallout we might spot18:11
fungiand hoping to garner additional input on that plan18:11
*** salv-orlando has quit IRC18:12
corvuswe may want to restart the other executors to pick up the 1->4 slow start change, it will let them be utilized better.18:12
*** apetrich has quit IRC18:13
*** slaweq has joined #openstack-infra18:13
corvusmaybe if we do that, we'll have a surplus compared to our backlog, and can turn one down with less effect.  at the moment, we have more work than the executors can handle.18:13
*** hamzy has joined #openstack-infra18:13
*** apetrich has joined #openstack-infra18:13
fungii like that idea18:13
corvusstopping the others should be faster than ze01, but still, obviously, not super quick.18:13
corvus(i'm still fixing the scheduler bug, so i'll leave that to others)18:14
fungido we have volunteers for a rolling restart of, say, ze02-ze08 and ze10? and then we can stop ze09 and reboot it with the executor disabled?18:14
pabelangerI can help here in a moment, just getting more coffee18:14
*** hamzy_ has joined #openstack-infra18:15
fungize01, as mentioned, has already been restarted on the new code, and the instances impacted by the ipv6 situation seem to be 01, 03 and 06-0918:15
*** myoung is now known as myoung|food18:15
*** bobh has quit IRC18:16
fungiso the idea would be to restart all executors except 01 (which already got that treatment) and only stop one of the v6-problem-affected ones (like 09) without restarting the service on it again18:16
*** sshnaidm|rover is now known as sshnaidm|bbl18:16
fungialso be aware you probably want ssh -4 if you have working ipv6 connectivity18:17
*** slaweq has quit IRC18:17
fungii need to eat something before the meeting18:17
*** hamzy has quit IRC18:17
fungibut can help with some of it18:17
*** mihalis68 has quit IRC18:20
*** david-lyle has quit IRC18:20
corvusclarkb: launch delayed to 19:20, so we may need to have an agenda item for it18:21
*** jamesmcarthur has joined #openstack-infra18:21
clarkbI think I just say they pushed it all the way to 3:05 EST which is 20:05UTC?18:22
clarkbweather needs to cooperate18:22
*** dtantsur|bbl is now known as dtantsur18:22
pabelangerfungi: okay, ready to get started on zuul-executors18:25
pabelangerI'll do ze02.o.o first18:25
*** tosky has quit IRC18:26
pabelangerclarkb: I seen that too18:26
clarkbnot to completely distract from the zuul stuff but I think that sometime yesterday our bandersnatch may have gotten out of sync18:26
clarkbsushy 1.3.1 never made it onto our mirror despite having mirror updates18:26
dansmithI'm guessing it's known that zuul is down, but didn't see a status bot topic change so wanted to confirm?18:28
clarkbdansmith: it should be up now18:28
clarkbthere are networking issues we've worked around by disabling ipv6 that were preventing things from functioning earlier18:29
* tbachman isn’t seeing zuul18:29
dansmithI haven't been able to hit it for like 30 minutes18:29
dansmithjust times out for me18:29
dhellmannhttp://zuul.openstack.org is extremely slow or not responding18:29
AJaegerclarkb: http://zuul.openstack.org/ is not reachable right now18:29
*** ralonsoh has quit IRC18:29
tbachmandansmith: same here18:29
pabelangerI'm having issues SSHing into zuul.o.o18:30
*** bobh has joined #openstack-infra18:30
pabelangerchecking cacti18:30
clarkbpabelanger: I can ssh into it ok18:31
clarkbbut ya http is sad18:31
clarkbzuul-web is running18:31
pabelangerokay, SSH now works18:31
clarkbit is listening on port 9000 and apache is up and configured to proxy to it18:32
pabelangerclarkb: I see some exceptions in web.log, trying to see if related18:34
clarkbhrm why does web-debug.log have less info in it than web.log?18:34
fungiwonder if the restart picked up a new regression18:35
pabelangerze02 back online18:35
pabelangermoving to ze03.o.o18:35
*** harlowja has joined #openstack-infra18:35
clarkblooks like web.log ends with it starting18:35
clarkbweb-debug.log doesn't show anything sad since it started18:35
clarkbif I curl localhost:9000 I get html back18:38
clarkbso the server is up and responding to requests18:38
clarkbnow to see if I can get a status.json from it18:38
fungione other possibility is the sheer hugeness of the status.json may simply be too much18:39
clarkbAH00898: Error reading from remote server returned by /status.json18:39
clarkbya I think apache is unable to get the status.json back18:39
pabelangerwe do have 587 jobs in check18:40
dhellmannit sounds like that api needs a pagination feature :-)18:40
dmsimardWas stepping in momentarily before going away again but the status.json backups might work (or not and exacerbate but iirc I set it up with flock/curl timeout)18:40
Shrewspabelanger: back from eating. can help you out18:41
pabelangerShrews: I'm currently doing ze03 / ze04, if you want to get others. I think we want to keep ze09 as is18:41
clarkb`curl --verbose localhost:9000/openstack/status.json` works and shows a status of 20018:41
Shrewspabelanger: ack. on ze0418:41
Shrewspabelanger: err, ze0518:41
clarkbdata is small enough to fit on my scrollback buffer :P18:42
Shrewspabelanger: just zuul-executor restart, yeah?18:42
fungipabelanger: Shrews: stopping ze09 would be great, but that's the one we want to disable the service on and then reboot so it comes back up with ipv6 addresses18:42
Shrewsfungi: ze09 is all yours  :)18:42
clarkbits only 2.3MB18:42
*** dougwig has joined #openstack-infra18:43
fungiShrews: pabelanger: i'll refrain from stopping ze09 until the others have been cleanly restarted so we don't have too many out of rotation at a time18:43
*** gmann has quit IRC18:43
Shrewsfungi: to be clear, we're just restarting the zuul-executor process, right?18:43
fungiwhich, as i understand, is not possible currently to do a typical service restart because the stop takes so long to complete18:44
clarkbworking my way back up the stack those status.json errors were actually 40 minutes or so ago18:44
fungiShrews: so is a stop and then watch the process list until it can be safely started again18:44
fungi(or tail logs, or whatever)18:44
clarkbI do not see requests from my desktop hitting apache18:45
clarkbpossibly this is related to the networking issues we've had?18:45
clarkbI'm ipv4 only here18:45
dhellmannI get "connection reset by peer" http://paste.openstack.org/show/663792/18:45
dhellmannI'm also ipv4-only18:45
*** caphrim007_ has joined #openstack-infra18:46
fungidoesn't appear we're overloading its network: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64796&rra_id=all18:47
clarkbif I make a new connection from my home I see it in a SYN_RECV state on the server18:47
clarkband its been that way for qutie a few seconds now18:48
fungiin fact, that's probably status.json we see dying on the graph at 17:2018:48
clarkbthen it eventually dies client side18:48
fungiwhich does correspond with the restart18:48
fungischeduler and web daemons were restarted at 17:22 and 17:23 respectively18:49
* clarkb breaks out the tcpdump18:49
*** caphrim007 has quit IRC18:50
Shrewsstill waiting for ze05. doing ze06 now too18:51
*** suhdood has joined #openstack-infra18:51
openstackgerritMatthieu Huin proposed openstack-infra/nodepool master: [WIP] webapp: add optional admin endpoint  https://review.openstack.org/53631918:54
clarkbaha! [Tue Feb 06 18:55:38.214934 2018] [mpm_event:error] [pid 13325:tid 140299712509824] AH00485: scoreboard is full, not at MaxRequestWorkers I mean I shouldn't checked there earlier but now e know18:55
*** bobh has quit IRC18:55
fungiso apache restart time?18:56
clarkbso tcp connections are getting to apache but apache is saying go away18:56
clarkbshould I go ahead and do that?18:56
* clarkb gives everyone a minute before doing that18:56
fungipossible the inrush of reloads from people's browsers after the restart coupled with the giant status.json caused it, in which case we may see it come right back18:56
fungibut worth a try18:56
Shrewspabelanger: fungi: ze06 restarted. ze05 still ongoing. moving to ze0718:57
dmsimardstatus.json is rendered client-side right ?18:57
fungifor now at least18:58
dmsimardso it's probably getting downloaded just fine but then has to do all that javascript dancing18:58
clarkbdmsimard: no it wasn't downloading anything18:58
tbachmanfwiw, I can hit it now18:58
fungidmsimard: well, no, apache is super unhappy at the moment18:58
clarkbapache was closing incoming tcp connections because it had no workers to put them on18:58
clarkbstatus page is working now after apache restart18:58
fungicool, we should keep an eye on it18:58
clarkbnow we see if that lasts18:58
fungi(i almost said "keep tabs on it" and then realized...)18:58
openstackgerritRuby Loo proposed openstack-infra/project-config master: Add description to update_upper_constraints patches  https://review.openstack.org/54102718:59
dmsimardfungi: that would've been appropriate18:59
dmsimardare we doing an infra meeting ?18:59
fungiin 45 seconds18:59
clarkband now its meeting time and I'm not super prepared :P18:59
cmurphydo we need to recheck?18:59
clarkbcmurphy: you shouldn't need to, you can check now if your changes are in the queues though18:59
* fungi is entirely unprepared, but also not chairing18:59
clarkb(at least it loads for me currently)18:59
fungicmurphy: unless you had failures from earlier (retry_limit, post_failure) and those will need a recheck19:00
dmsimardBecause sometimes we can have good news too, it looks like OVH BHS1 has fully recovered and there's blips showing signs of recovery for GRA1 as well19:00
clarkbI'll keep my tail on the error log open so I'll see if it happens again19:00
clarkband with that /me sorts out meeting19:00
cmurphynope it's in the queue19:00
*** tbachman has left #openstack-infra19:01
*** lucasagomes is now known as lucas-afk19:02
*** suhdood has quit IRC19:02
*** myoung|food is now known as myoung19:03
dmsimardze01 is effectively picking up more builds than the others19:04
dmsimard4.4GB swap :(19:04
Shrewspabelanger: fungi: ze07 restarted. picking up ze08... still waiting for ze0519:04
ianwjust catching up ... i know we've all moved on, but about 8 hours ago AJaeger noticed inap was offline; that lead me to noticing that nl03 had dropped out of zookeeper, so restarted that.  then jhesketh restarted ze01, as it was not picking up jobs19:05
ianwand i found the red-herring that graphite.o.o had a full root partition -- which i thought might have lead to some of the graph dropouts (bad stat collecting) but actually was unrelated in the end19:06
fungidmsimard: the others should catch up now that they're getting restarted on newer code19:06
pabelangerShrews: sorry, I lost network access. back online now19:06
pabelangerlooking at ze03 / ze04 now19:06
Shrewspabelanger: summary is ze06 and ze07 are done. doing ze08 now. waiting on ze0519:06
dmsimardianw: I summarized some of the things on the infra log https://wiki.openstack.org/wiki/Infrastructure_Status19:06
*** r-daneel_ has joined #openstack-infra19:08
*** david-lyle has joined #openstack-infra19:08
pabelangerShrews: ze03 online, waiting for ze0419:08
*** yamamoto has quit IRC19:08
*** david-lyle has quit IRC19:08
*** dsariel has quit IRC19:08
*** david-lyle has joined #openstack-infra19:09
*** r-daneel has quit IRC19:09
*** r-daneel_ is now known as r-daneel19:09
clarkbdmsimard: yes becuse it is running more aggressive scheudling code than the others (or was while we restart things)19:09
clarkbdmsimard: it should stabilize hopefully19:09
ianwdmsimard: interesting ... yeah i couldn't determine any reason why nl03 seemed to detach itself from zookeeper; as mentioned in the etherpad there wasn't any smoking gun logs19:09
*** HeOS has quit IRC19:10
Shrewspabelanger: ze08 restarted. ze05 is taking FOREVER19:10
pabelangerze04 is online, but hasn't accepted any jobs yet19:11
pabelangerunsure why19:11
pabelangerthere we go19:11
SamYaplecan anyone review this bindep patch? its been sitting 2+ months. I dont know how to get traction on it, but really need the patch19:16
*** Goneri has quit IRC19:19
*** vhosakot_ has joined #openstack-infra19:20
*** vhosakot has quit IRC19:23
*** jpich has quit IRC19:26
Shrewspabelanger: i'm not convinced that ze05 is actually trying to stop. is there a good indicator that it's trying to wind down? maybe number of active ansible-playbook processes or something?19:28
pabelangerShrews: what command did you use?19:28
clarkbShrews: pabelanger I watch ps -elf | grep zuul | wc -l19:28
clarkbthat number should generally trend down over time19:28
pabelangerShrews: should see socket accepting stopped in logs19:28
*** salv-orlando has joined #openstack-infra19:28
pabelanger2018-02-06 18:35:54,142 DEBUG zuul.CommandSocket: Received b'stop' from socket for example19:29
fungithough there will quite probably be tons of orphaned ssh-agent processes which never get cleaned up19:29
Shrewspabelanger: i actually messed up on 05 and issued a restart before doing a stop19:29
pabelangerthat is on ze03 FWIW19:29
Shrewswhich is why i'm not sure of its state19:29
* AJaeger spend so much time looking at grafana that he doesn't like to see N/A for TripleO anymore - https://review.openstack.org/54116519:30
pabelangerwonder if we should remove restart from init scripts, haven't had much luck with it19:30
Shrewspabelanger: yes, we should  :)19:30
pabelangerShrews: you might be able to issue stop again, if command socket it still open19:30
Shrewspabelanger: lemme try that19:30
*** salv-orlando has quit IRC19:30
Shrewsthe number of ansible process seem to be dropping faster now19:32
*** slaweq has joined #openstack-infra19:33
*** olaph1 has joined #openstack-infra19:36
dmsimardAJaeger: added a comment on https://review.openstack.org/#/c/541165/19:37
AJaegerdmsimard: which file? http://git.openstack.org/cgit/openstack-infra/project-config/tree/grafana19:38
*** slaweq has quit IRC19:38
AJaegerdmsimard: grafana.openstack.org does not delete removed files - so manual deletion is needed. I think the file you mean is removed from project-config already. If it's still in-tree, tell me and I'll updated19:38
dmsimardAJaeger: oh, you're right the file is already removed19:39
*** olaph has quit IRC19:39
AJaegerdmsimard: if you want to manually remove dead files from grafana - great ;) There're a couple ;(19:39
dmsimardAJaeger: if we can write down somewhere which ones are stale I can eventually look at it19:40
AJaegerdmsimard: I'll make a list...19:40
*** baoli has joined #openstack-infra19:41
ianwok, so re hwe kernel -- yeah afs doesn't build surprise surprise.  it looks like is the earliest that supports 4.13 ... but I'll make a custom OpenAFS 1.6.22 pkg i think19:41
*** baoli has quit IRC19:42
*** tosky has joined #openstack-infra19:42
*** jamesmca_ has joined #openstack-infra19:44
dmsimardT-60 minutes19:46
*** peterlisak has quit IRC19:46
*** onovy has quit IRC19:47
fungiianw: i wonder if the 1.8 beta packages in bionic build successfully on xenial ;)19:48
openstackgerritAndreas Jaeger proposed openstack-infra/project-config master: Remove unused grafana/nodepool files  https://review.openstack.org/54142119:49
AJaegerdmsimard: two more unneeded ones ^19:49
ianwfungi: they probably would, at least the arm64 ones do on arm64 xenial ... but i only want to change one thing at a time as much as possible, so i think sticking with 1.6 branch for this test at this point?19:49
pabelangerianw: haven't been following, is fedora-27 ready to use now or still an issue with python?19:51
fungisure, i wasn't suggesting it's something we should run just now19:51
fungihaving newer 1.6 point release packages for all our systems would be good anyway, for previously-discussed reasons19:51
*** HeOS has joined #openstack-infra19:51
AJaegerdmsimard: http://paste.openstack.org/show/663889/ contains list of obsolete grafana boards19:52
*** HeOS has quit IRC19:56
dhellmanndo things seem stable now? is it safe to approve some releases?19:58
clarkbdhellmann: I think they are stable in as much as we've worked around the ipv6 issues. The ipv6 problems haven't been corrected but shouldn't currently affect jobs19:59
dhellmannclarkb : thanks19:59
AJaegerdhellmann: just very heavy load - long backlog20:00
dhellmannsmcginnis : ^^20:00
smcginnisJust read that.20:00
clarkbfungi: "kAFS supports IPv6 which OpenAFS does not; and it implements some of the AuriStorFS services so that IPv6 can be used not only with the Location service but with file services as well."20:00
smcginnisSo we can approve things, it just might take awhile.20:00
*** jtomasek has quit IRC20:00
dhellmannit looks like the aodh and panko releases were approved but not run. maybe remove your W+1 and re-apply it?20:01
corvusianw, fungi, clarkb: i think it was kerberos 5 support (which is needed for the sha256 stuff we use) which kafs was weak on last time i looked.  perhaps that's changed -- but that's a thing to watch out ofr.20:01
smcginnisdhellmann: Just about to do that. ;)20:01
ianwauristor: ^^ i bet you'd know about that?20:02
*** owalsh has quit IRC20:02
*** slaweq has joined #openstack-infra20:02
clarkbcorvus: https://www.infradead.org/~dhowells/kafs/ looks like it supports kerberos 5 under AF_RXRPC20:02
*** owalsh has joined #openstack-infra20:02
*** amoralej is now known as amoralej|off20:02
*** dtantsur is now known as dtantsur|afk20:03
*** onovy has joined #openstack-infra20:03
*** caphrim007_ has quit IRC20:03
clarkbstill no apache errors in the error log for zuul status so I'm closing my tail of that and making lunch20:03
smcginnisdhellmann: Still not picking up my reapproval. Either things are really slow, or might need you to do it?20:03
*** caphrim007 has joined #openstack-infra20:03
smcginnisdhellmann: Oh, they're stacked up!20:04
dhellmannsigh. we keep telling them not to do that.20:04
*** HeOS has joined #openstack-infra20:06
dmsimardcorvus: I'm trying to see if there would be a way to highlight if a particular executor seems to be failing/timeouting jobs more than the others .. in graphite we have data for jobs but it doesn't link them back to a particular executor. I guess logstash allows me to query by job status and executor but we're much more aggressive in pruning that data out. Am I missing something ?20:06
*** caphrim007 has quit IRC20:07
*** caphrim007 has joined #openstack-infra20:07
*** rfolco|ruck is now known as rfolco|off20:07
*** onovy has quit IRC20:07
dmsimardI understand that job failures/timeouts are not necessarily a result of bad behavior but instead legitimate job/patch issues but it might be worthwhile to see if there's a difference as we start to tweak some knobs here and there20:08
corvusdmsimard: nope.  right now you could grep logs.  if you want to add stats counters to executors that'd be ok20:08
Shrewsi know ze05 seems to be timing out jobs like crazy. waiting for it to shutdown now and it's taking forever b/c the jobs seem to go to timeout20:08
dmsimardcorvus: okay, I'll see if I can figure out a patch.20:08
dmsimardShrews: I'll take a look20:09
dmsimardShrews: absolutely nothing going on in logs, huh20:09
dmsimardlast log entry ~4 minutes ago20:09
dmsimardthat's unusual20:10
Shrewsdmsimard: that's because it's shutting down i think. not accepting new jobs20:10
dmsimardShrews: yeah but even then the ongoing jobs ought to be printing things20:10
Shrewspabelanger: fungi: number of ansible processes on ze05 into single digits now. i'm hoping that means it's almost done20:12
Shrewsit was in the 70's when i started watching it20:12
dmsimardyeah it seems like it's wrapping up the remaining jobs20:12
openstackgerritMonty Taylor proposed openstack-infra/zuul master: Rework log streaming to use python logging  https://review.openstack.org/54143420:13
*** ldnunes has quit IRC20:16
*** r-daneel_ has joined #openstack-infra20:16
*** wolverineav has quit IRC20:17
*** r-daneel has quit IRC20:17
*** r-daneel_ is now known as r-daneel20:17
*** slaweq has quit IRC20:18
*** wolverineav has joined #openstack-infra20:18
*** slaweq has joined #openstack-infra20:19
*** onovy has joined #openstack-infra20:20
pabelangerShrews: ack20:22
*** Goneri has joined #openstack-infra20:24
*** e0ne has joined #openstack-infra20:24
pabelangerclarkb: spacex stream live now!20:24
*** onovy has quit IRC20:25
smcginnisFor those interested: http://www.spacex.com/webcast20:25
* fungi crosses fingers that the up goer five turns elon's scrap car into an asteroid20:26
clarkbya watching20:26
clarkbI'm sad that heavy gave up on asparagus20:26
*** florianf has quit IRC20:28
*** florianf has joined #openstack-infra20:28
*** florianf has quit IRC20:28
*** jamesdenton has joined #openstack-infra20:29
*** peterlisak has joined #openstack-infra20:32
*** onovy has joined #openstack-infra20:33
*** olaph has joined #openstack-infra20:38
*** olaph1 has quit IRC20:40
*** kjackal has quit IRC20:43
*** dsariel has joined #openstack-infra20:44
mordredclarkb: where do they launch these from?20:45
fungithis one's going up from kennedy20:46
clarkbmordred: same pad saturn 5 used20:46
clarkbbut they are going to mars not the moon20:47
fungiand there go the boosters20:47
openstackgerritMerged openstack-infra/nodepool master: Fix for age calculation on unused nodes  https://review.openstack.org/54128120:48
fungii still don't have the audio going, but... this is a promisingly long time with no explosion20:48
smcginnisLots of cheering.20:48
smcginnisAnd David Bowie.20:49
smcginnisHaha, that was pretty cool.20:50
*** hongbin has joined #openstack-infra20:50
*** gfidente|afk has quit IRC20:51
*** olaph1 has joined #openstack-infra20:52
smcginnisDang that was cool.20:53
Shrewsthat's science well done20:53
clarkbits like synchronized swimming20:53
clarkbbut with fire and rockets and space20:53
fungilooks like they didn't topple this time either?20:54
smcginnisclarkb: Glad I'm not the only one that thought of that. ;)20:54
*** olaph has quit IRC20:54
clarkbfungi: ya boosters appeared to land safely20:54
fungithat was just the first two, right? third is on its way down later?20:54
corvusyeah, it's either landed on the 'of course i still love you' drone ship at sea, or ... not.20:55
fungiokay, the "don't panic" on the dash got me20:57
smcginnisI loved that detail. :)20:57
pabelangerokay, I was jumping up and down in living room with excitement20:57
pabelangerway awesome20:57
*** e0ne has quit IRC21:01
*** camunoz has quit IRC21:02
Shrewspabelanger: i have to afk for a bit. can you watch ze05 and restart when it's done? it looks really close... maybe 1 or 2 jobs21:03
fungipabelanger: Shrews: looks like we're still waiting on ze05 to get started and then ze10 to get stopped/started before i stop ze09?21:03
Shrewsfungi: oh, i didn't know there was a ze1021:03
fungii've confirmed the rest have a /var/run/zuul/executor.pid from today21:03
pabelangerShrews: sure21:03
Shrewspabelanger: thx21:03
pabelangeryah, I haven't done ze10 yet. I can do that now21:04
fungiShrews: yeah, we have 10 total21:04
pabelangerokay, ze10 stopping21:05
*** tiswanso has quit IRC21:06
*** tiswanso has joined #openstack-infra21:07
pabelangerze10 started21:08
AJaegerconfig-core, please consider to review later some grafana updates: Remove tripleo remains from grafana https://review.openstack.org/541165 - and cleanup of unused nodepool providers https://review.openstack.org/54142121:13
fungipabelanger: awesome, so we're just waiting on 05 now?21:13
openstackgerritMonty Taylor proposed openstack-infra/zuul master: Rework log streaming to use python logging  https://review.openstack.org/54143421:13
mnasercan we approve things now? :>21:16
AJaegermnaser: yes, should be fine AFAIU21:17
*** hemna_ has quit IRC21:17
pabelangerfungi: yah, think we are down to last job running21:17
*** e0ne has joined #openstack-infra21:17
AJaegerfungi, pabelanger http://grafana.openstack.org/dashboard/db/zuul-status shows a constant 19 "starting builds" on ze0521:17
pabelangeronly see 1 ansible-playbook process21:17
AJaegerthat looks strange...21:17
pabelangerAJaeger: expected, statsd has stopped running I believe21:18
clarkbfor 541421 we'll have to manually delete the dashboards once we stop having them in the config21:18
AJaegerclarkb, http://paste.openstack.org/show/663889/ contains list of obsolete grafana boards21:18
clarkbAJaeger: the first list can already be deleted? if so cool I can help get those all cleared out21:19
AJaegerclarkb: yes, the first ones can be deleted already.21:19
AJaegerdmsimard: ^21:19
AJaegerclarkb: just the last two not. and if you delete too many, I hope puppet creates them again ;)21:20
*** tpsilva has quit IRC21:20
* AJaeger waves good night21:21
clarkbAJaeger: it should!21:22
dmsimard(catching up on spacex).. the two side boosters landing simultaneously was awesome.21:25
fungithat actually _is_ rocket science21:26
fungithe rocket surgery comes later, when they have to put it all back together21:26
pabelangerokay, we have no more ansible processes on ze05, but zuul-executor still running21:26
dmsimardIt must be so hard. I mean my day to day job is kinda hard.. but that is some crazy level of hard :)21:26
*** jamesmca_ has quit IRC21:26
pabelangerdo we want to debug or just kill off?21:26
fungipabelanger: nothing else getting appended to the debug log?21:27
*** jamesmca_ has joined #openstack-infra21:27
fungimaybe attempt to trigger a thread dump (though i expect it may get ignored because it's already in the sigterm handler)21:27
pabelangerfungi: no, I think using service zuul-executor restart was the reason we got into this state, which I've seen ood things happen, if you start another executor while one is already running21:28
pabelangerfungi: k21:28
fungisounds possible21:28
*** salv-orlando has joined #openstack-infra21:30
*** felipemonteiro has joined #openstack-infra21:31
openstackgerritMerged openstack-infra/project-config master: Remove TripleO pipelines from grafana  https://review.openstack.org/54116521:32
pabelangerfungi: okay, I dumped threads, they are in debug logs now21:32
*** threestrands has joined #openstack-infra21:32
*** threestrands has quit IRC21:32
*** threestrands has joined #openstack-infra21:32
pabelangerfungi: I'll kill off processes now and we can inspect logs21:33
fungiafter you kill the processes, may also be good to save a trimmed copy of just the thread dump so we don't have to dig them out of the log later, and then start everything back up again21:33
*** salv-orlando has quit IRC21:33
dmsimardStill no news on the center core, hope it's just a video feed issue :/21:34
fungi2 of 3 recovered is still a pretty good result at this point21:34
pabelangerokay, ze05 started again21:34
fungigiven the team was only about 50% confident the payload would even make it to orbit21:35
pabelangercleaning up copied logs21:35
dmsimardNot arguing otherwise, they often get video feed issues when landing on barges21:35
fungiand considering the number of recovery failures they've had up to now21:35
*** jbadiapa has joined #openstack-infra21:36
dmsimardI bet they're pulling a bunch of telemetry off the car too, it's not just a dumb car :D21:36
openstackgerritMerged openstack-infra/project-config master: gerritbot: Add queens and rocky to ironic IRC notifications  https://review.openstack.org/54098621:37
*** jbadiapa has quit IRC21:40
*** dave-mcc_ has quit IRC21:41
*** ijw has joined #openstack-infra21:42
*** hemna_ has joined #openstack-infra21:42
fungistopping the executor on ze09 now in preparation for using it as a demo/canary for the v6 routing issues we've seen21:44
*** jamesmca_ has quit IRC21:44
*** peterlisak has quit IRC21:44
*** tiswanso has quit IRC21:45
*** onovy has quit IRC21:45
openstackgerritDavid Moreau Simard proposed openstack-infra/zuul master: Add Executor Merger and Ansible execution statsd counters  https://review.openstack.org/54145221:46
*** agopi__ has joined #openstack-infra21:46
ijwIs there something funky with the check jobs?  I've had one sat on the queue for 2 hours, which is highly unusual21:46
*** jamesmca_ has joined #openstack-infra21:47
*** hemna_ has quit IRC21:48
clarkbijw: we had networking problems in the cloud hosting the zuul control plane which led to a large backlog21:48
clarkbijw: we've worked around that but result is large backlog takes time to get through21:48
*** agopi_ has joined #openstack-infra21:49
*** agopi_ is now known as agopi21:49
*** agopi__ has quit IRC21:52
*** tiswanso has joined #openstack-infra21:52
*** tiswanso has quit IRC21:52
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Fix stuck node requests across ZK reconnection  https://review.openstack.org/54145421:52
*** eumel8 has quit IRC21:57
fungihrm, still some 26 zuul processes on ze09 15 minutes after stopping21:57
fungimost look like:21:57
fungissh: /var/lib/zuul/builds/ed57db43d883405e9665eac1e1cba797/.ansible/cp/ba8cd30f56 [mux]21:58
fungiare those likely to just hang around indefinitely, or should i wait for them to clear up?21:58
*** ijw has quit IRC21:58
fungithe count hasn't changed in the ~5 minutes i've been checking it21:58
pabelangerthere seems to be a single job for 526171,3 that has been queued for some time in integrated gate. openstack-tox-pep8, unsure why it hasn't got a node yet21:59
fungilooks like the zuul-executor daemons themselves are no longer running, so i assume these are just leaked orphan processes21:59
dmsimardDid we talk about openstack summit presentations at the infra meeting ? I wasn't paying a lot of attention. Deadline is thursday.22:00
fungioh, right these are the ssh persistent sockets which time out after an hour of inactivity or something22:00
fungidmsimard: clarkb mentioned the cfp submission deadline, yeah22:00
dmsimardI'm writing a draft for a talk about ARA (and Zuul) and another which is about the "life" of a commit from git review to RDO stable release (and a magic black box where the productized version appears)22:01
dmsimardhappy to participate in anything you'd like to throw together, I'm a frequent speaker at local meetups :D22:01
*** shrircha has joined #openstack-infra22:02
*** jamesmca_ has quit IRC22:02
*** vhosakot has joined #openstack-infra22:03
*** shrircha has quit IRC22:03
*** jamesmca_ has joined #openstack-infra22:04
pabelangerhaving issues with paste.o.o22:04
pabelangercorvus: I am seeing that error currently in zuul debug.log22:04
*** salv-orlando has joined #openstack-infra22:05
pabelangerseems to be looping22:05
dmsimardthat's the same error as this morning22:05
fungi`sudo systemctl disable zuul-executor.service` seems to have properly disabled zuul-executor on ze09, so going to reboot it now22:06
pabelangerfungi: ack22:06
*** vhosakot_ has quit IRC22:07
*** ijw has joined #openstack-infra22:07
fungipabelanger: hrm, yeah, paste.o.o is taking its sweet time responding to my browser's requests as well22:07
pabelangerdmsimard: oh, maybe this is fixed by 541454 then. Hopefully corvus will confirm22:08
ijwclarkb: thanks.  I'm in no rush, I just wanted to make sure it wasn't our fault22:08
pabelangerbut, looks like integrated change queue is wedged22:08
*** vhosakot has quit IRC22:09
fungirestarting openstack-paste on paste.o.o seems to have fixed whatever was hanging requests indefinitely22:09
pabelangerfungi: great, thanks22:09
pabelangerI need to step away and help prepare dinner22:10
corvusthat is not the error from this morning.  the error from this morning was http://paste.openstack.org/raw/663579/22:10
*** hemna_ has joined #openstack-infra22:12
*** e0ne has quit IRC22:15
*** openstackgerrit has quit IRC22:16
*** salv-orlando has quit IRC22:17
corvusShrews: you win a cookie -- the error that pabelanger just found is an unused node deletion between assignment and lock22:17
fungiokay, zuul09 is back up after a reboot, has no executor service running, has a v6 global address configured again and is still unable to reach the review.openstack.org ipv6 address but can via ipv422:18
fungiworking on a ticket with rackspace now22:18
corvusShrews, pabelanger: node 0002403946 if you're curious22:18
corvuswe can probably just pop that change out of the queue, or ignore it (it's in check)22:19
*** jamesmca_ has quit IRC22:19
ianwok ... i've installed the hwe 4.13 kernel on ze02, and built some custom afs modules (packages in my homedir).  any objections to stopping executor and rebooting?22:22
*** peterlisak has joined #openstack-infra22:22
*** openstackgerrit has joined #openstack-infra22:22
openstackgerritMatt Riedemann proposed openstack-infra/project-config master: Remove legacy-tempest-dsvm-neutron-nova-next-full usage  https://review.openstack.org/54147722:22
clarkbianw: now is probably as good a time as any22:23
openstackgerritMatt Riedemann proposed openstack-infra/openstack-zuul-jobs master: Remove legacy-tempest-dsvm-neutron-nova-next-full job  https://review.openstack.org/54147922:24
fungithrough the magic of tcpdump i've also narrowed the v6 traffic issue down to being unidirectional22:24
openstackgerritMerged openstack-infra/nodepool master: Do not delete unused but allocated nodes  https://review.openstack.org/54137522:24
fungiv6 traffic can _reach_ ze09 from review, but v6 traffic cannot reach review from ze0922:25
*** onovy has joined #openstack-infra22:25
dmsimardany oddities in ip6tables ?22:25
*** hashar has quit IRC22:26
fungiwell, tcpdump should be listening to a bpf on the interface, in which case ip(6)tables doesn't matter22:26
fungido we have an approximate time for when this all started?22:27
*** Goneri has quit IRC22:27
corvusfungi: looks like 06:43 based on graphs22:28
ianwwhat's the current thinking re timing of "zuul-executor graceful"22:28
fungithanks corvus22:28
corvusianw: doesn't work, just use 'zuul-executor stop'22:28
*** rcernin has joined #openstack-infra22:28
corvus(not implemented yet)22:29
ianwahhhhh, that would explain a lot22:29
corvus(also, since retries are automatic, probably almost never worth using)22:29
ianwright, was hoping it cleaned up the builds better or something22:30
corvusstop should be a clean shutdown22:30
corvuseven if it worked, graceful would be the one to use only if you wanted to wait 4 hours for it to stop22:31
*** jamesmca_ has joined #openstack-infra22:33
dmsimardSeeing a sudden increase in ram utilization on zuul.o.o http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=392&rra_id=all22:36
dmsimardSeems like we're headed for a repeat of this morning's spike22:36
*** jamesmca_ has quit IRC22:37
*** edmondsw has quit IRC22:37
*** edmondsw has joined #openstack-infra22:38
dmsimardcorvus: do we need to land https://review.openstack.org/#/c/541454/ and restart the scheduler again ? /var/log/zuul/zuul.log is getting plenty of kazoo.exceptions.NoNodeError.22:39
*** edmondsw has quit IRC22:42
corvusdmsimard: see above, nonodeerror is not because of the bug in 54145422:42
corvuswe need to restart all launchers with https://review.openstack.org/541375 to fix that error22:42
dmsimardcorvus: ok I'll add that info to the pad, thanks22:43
ianw#status log ze02.o.o rebooted with xenial 4.13 hwe kernel ... will monitor performance22:43
openstackstatusianw: finished logging22:43
*** tbachman has joined #openstack-infra22:45
* tbachman wonders if others have noticed zuul to be down again22:45
fungi#status log provider ticket 180206-iad-0005440 has been opened to track ipv6 connectivity issues between some hosts in dfw; ze09.openstack.org has its zuul-executor process disabled so it can serve as an example while they investigate22:45
openstackstatusfungi: finished logging22:45
pabelangercorvus: Shrews: that is good news22:46
fungitbachman: meaning the status page for it?22:46
tbachmanfungi: ack22:46
openstackgerritMonty Taylor proposed openstack-infra/zuul master: Rework log streaming to use python logging  https://review.openstack.org/54143422:46
tbachmanit’s up :)22:46
tbachmanlooked down22:46
* tbachman goes and hides in shame22:46
fungimay take a minute to load because the current status.json payload is pretty huge22:46
cmurphyit looked down for me too for a second22:47
dmsimardGetting pulled to dinner, will be back later -- the trend for the zuul.o.o ram usage is not good but I was not able to get to the bottom of it.22:47
dmsimardIf it keeps up, we're going to be maxing ram before long22:47
*** agopi has quit IRC22:47
*** salv-orlando has joined #openstack-infra22:47
ianwso ... would anyone say they know anything about graphite's logs?  do we need the stuff in /var/log/graphite/carbon-cache-a ?22:47
corvusianw: i don't think we need any of that.  pretty much ever.22:48
* jhesketh has skimmed most of the scrollback... are we still having any network issues?22:48
jhesketh(specifically v4)22:49
pabelangerdmsimard: 20GB ram usage was the norm before this morning, leaving a 10GB buffer for zuul.o.o22:49
pabelangerlets see what happens22:49
fungijhesketh: not sure whether the v4 connectivity issues we were seeing between executors and job nodes have persisted22:50
pabelangerShrews: corvus: I can start launcher restarts in 30mins or hold off until the morning, what ever is good for everybody22:50
auristorianw, fungi, clarkb, corvus: the rx security class supported by kafs and openafs only supports fcrypt (a weaker than 56-bit DES) wire integrity protection and encryption.  In order to produce rxkad_k5+kdf tokens to support Kerberos v5 AES256-CTS-HMAC-SHA-1 enctypes for authentication you need a proper version of aklog and supporting libraries.22:52
fungiauristor: thanks for the update. that's definitely useful info22:53
* jhesketh nods22:53
*** myoung is now known as myoung|off22:53
auristorianw, fungi, clarkb, corvus: AuriStorFS supports the yfs-rxgk security class which uses GSS-API (Krb5 only at the moment) for auth and AES256-CTS-HMAC-SHA1-96.   The support for AES256-CTS-HMAC-SHA256-384 was added to Heimdal Kerberos and once it is in MIT Kerberos we will enable it in yfs-rxgk.22:54
auristoryfs-rxgk will be added to kAFS this year22:55
fungithat's awesome news22:55
pabelangercorvus: I'm showing 526171 in gate (and blocking jobs), I'm not sure we can ignore it. Mind confirming, and I'll abandon / restore.22:56
auristorAuriStorFS servers, clients and all admin tooling support IPv6.  clients, fileservers and admin tools can be IPv6 only.   ubik servers have to be assigned unique IPv4 addresses but they don't have to be reachable via IPv422:56
corvuspabelanger: yep, go ahead and pop it22:57
*** ijw has quit IRC22:57
corvusis someone restarting the launchers, or should i do that now?22:58
pabelangercorvus: go for it22:59
corvusauristor: rxgk hasn't made it into openafs yet, right?22:59
openstackgerritClark Boylan proposed openstack-infra/zuul master: Make timeout value apply to entire job  https://review.openstack.org/54148523:02
corvus#status log all nodepool launchers restarted to pick up https://review.openstack.org/54137523:02
openstackstatuscorvus: finished logging23:02
auristorcorvus: yfs-rxgk != rxgk.   and no rxgk is not in OpenAFS.   There is a long history and I would be happy to share if you are interested.  But after Feb 2012 the openafs leadership concluded that we couldn't accomplish our goals for the technology within openafs and created a new suite of rx services that mirror the AFS3 architecture that we could rapidly extend and innovate upon.23:02
openstackgerritClark Boylan proposed openstack-infra/zuul master: Sync when doing disk accountant testing  https://review.openstack.org/54148623:03
*** slaweq has quit IRC23:04
corvusauristor: ah i did't pick up that yfs-rxgk!=rxgk, thanks23:04
auristorcorvus: think of AuriStorFS and now kafs as supporting two different file system protocols, afs3 and auristorfs.  the protocol that is selected depends on the behavior of the peer at the rx layer.23:06
Shrewscorvus: pabelanger: yay! i iz teh winner!23:06
*** felipemonteiro has quit IRC23:06
Shrewscorvus: pabelanger: did the fix to nodepool merge?23:07
*** hemna_ has quit IRC23:07
corvusShrews: yep, should be in prod now23:07
*** felipemonteiro has joined #openstack-infra23:07
Shrewscorvus: awesome. sorry i had to afk for a bit23:07
auristorcorvus: being dual-headed permits maximum compatibility.   it was a major goal of AuriStorFS to provide a zero flag day upgrade and zero data loss23:07
pabelangerokay, 526171 finally popped from gate pipeline23:08
*** wolverineav has quit IRC23:12
*** r-daneel has quit IRC23:12
*** r-daneel has joined #openstack-infra23:13
*** wolverineav has joined #openstack-infra23:13
*** aeng has joined #openstack-infra23:16
*** wolverineav has quit IRC23:17
*** felipemonteiro has quit IRC23:18
openstackgerritClark Boylan proposed openstack-infra/zuul master: Use nested tempfile fixture for cleanups  https://review.openstack.org/54148723:20
*** sshnaidm|bbl has quit IRC23:21
*** M4g1c5t0rM has joined #openstack-infra23:26
*** armaan has quit IRC23:27
*** rossella_s has quit IRC23:28
*** rossella_s has joined #openstack-infra23:29
*** slaweq has joined #openstack-infra23:31
*** M4g1c5t0rM has quit IRC23:31
openstackgerritIan Wienand proposed openstack-infra/puppet-graphite master: Fix up log rotation  https://review.openstack.org/54148823:34
*** slaweq has quit IRC23:37
openstackgerritMerged openstack-infra/project-config master: Add ansible-role-k8s-cinder to zuul.d  https://review.openstack.org/53460823:38
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Fix stuck node requests across ZK reconnection  https://review.openstack.org/54145423:40
*** jcoufal has quit IRC23:42
ianwhttp://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=64155&rra_id=0&view_type=tree&graph_start=1517874206&graph_end=1517960606 <-- is this so choppy because of network, or something else?23:47
clarkbianw: ya I think cacti is using the AAAA records to get snmp and its not reliable on some of our executors23:48
*** edgewing_ has joined #openstack-infra23:48
clarkbianw: also we've just turned it off at this point by removing the ip address config for ipv6 on the executors23:49
*** caphrim007_ has joined #openstack-infra23:49
corvusthat's been choppy for a while though.  i think there's something unique about that host.  i don't know what.23:50
*** dingyichen has joined #openstack-infra23:50
clarkbI'm going to look at cleaning up grafana now23:50
* clarkb reads grafana api docs23:51
*** caphrim007 has quit IRC23:52
*** caphrim007_ has quit IRC23:53
*** stakeda has joined #openstack-infra23:54
clarkbok the first portion of AJaeger's list is cleaned up in grafana looks like we are still waiting for second portion to have change merged?23:58
pabelangermgagne: it looks like we might have a quota mismatch again in inap. Do you mind confirming when you have a moment23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!