ianw | let me bring it up... | 00:00 |
---|---|---|
mnaser | corvus: oops that's probably a horizon bug that should be fixed | 00:00 |
*** wolverineav has joined #openstack-infra | 00:04 | |
*** slaweq has quit IRC | 00:05 | |
ianw | clarkb: root@38.145.34.8 ... that is a f29 vm from the .qcow2 of a failed job that i copied off a held node. it never got network during the job, i moved it and it booted just fine :/ | 00:05 |
ianw | then i tunneled port 6080 into the held node, got into devstack's horizion, broke into the bootloader and reset the root password, rebooted it and ... it was alive. so couldn't even replicate it in the same place it failed | 00:06 |
clarkb | ianw: did you delete the interface file before rebooting in that case? | 00:07 |
clarkb | chacnes are glean got it configured enough that on next startup it worked? | 00:07 |
clarkb | looking at this f29 instance the glean unit file is a bit different than the one in gerrit | 00:08 |
clarkb | it doesn't have the before network manager service thing | 00:08 |
ianw | yeah, this was just before i switched it to the local-fs.target -- which made it work in CI | 00:08 |
ianw | unfortunately, the logs are recorded because i was watching it, and when it failed i uploaded a new change so zuul canceled it | 00:10 |
ianw | aren't recorded | 00:10 |
openstackgerrit | Ian Wienand proposed openstack-infra/system-config master: bridge.o.o : install ansible 2.7.2 https://review.openstack.org/617218 | 00:13 |
clarkb | ianw: maybe this is something. systemctl list-units shows glean@lo.service but not glean@ens3.service on that f29 instance | 00:13 |
*** mriedem has quit IRC | 00:13 | |
clarkb | ianw: is it possible glean isn't managing this interface at all? | 00:13 |
ianw | hrm, it's in nmcli ... | 00:14 |
clarkb | it certainly seems like there is an ifcfg-ens3 file that looks like it came from glean though | 00:14 |
ianw | oh, hrm i've rebooted this, it's second boot so it would have skipped it | 00:15 |
clarkb | oh ya | 00:15 |
ianw | can delete the file and reboot | 00:15 |
clarkb | lets dothat, I think it iwll help to better understand what it is doing on boot. Do you also want to update the unit ot match the current version? | 00:15 |
clarkb | (you can reboot it under me, I don't have anything running on that connection I care about currently) | 00:15 |
ianw | ok, let me try | 00:17 |
*** kjackal has joined #openstack-infra | 00:17 | |
clarkb | should glean configure an lo ifcfg too for consistency? (I don't actually know, maybe linux just does that for us?) | 00:19 |
ianw | i think we skip it on purpose, but not sure | 00:20 |
ianw | rebooting. i've left the defaultdependencies=no in, for now | 00:20 |
ianw | it's going to be a while before the gate reports if it's happier with that | 00:20 |
clarkb | I think we skip it in glean ya, but that means we end up with the unit running in systemd (not a huge deal but we don't run the unit for everything else after first boot) | 00:20 |
ianw | without that | 00:20 |
ianw | yeah, ens3 service there now | 00:21 |
ianw | hrm, the dot file from systemd-analyze creates a 21mb png | 00:24 |
*** hamerins has quit IRC | 00:24 | |
clarkb | it made a 6.3kb svg for me doing systemd-analyze dot glean@ens3.service | dot -Tsvg systemd.svg | 00:25 |
clarkb | did you run it without specifying a unit? that must be for all the things? | 00:25 |
openstackgerrit | James E. Blair proposed openstack-infra/system-config master: Add kube config to nodepool servers https://review.openstack.org/620755 | 00:25 |
clarkb | in any case it seems to config that NetworkManager happens after glean@ens3 | 00:25 |
clarkb | and journalctl -u glean@ens3 -u NetworkManager seems to confirm as well | 00:26 |
*** hamerins has joined #openstack-infra | 00:27 | |
ianw | yeah, https://imgur.com/a/2VBZ11G is a more restricted set | 00:28 |
clarkb | http://paste.openstack.org/show/736341/ logging with very precise timestamps | 00:28 |
*** hamerins has quit IRC | 00:30 | |
clarkb | rw-r--r--. 1 root root 134 2018-11-29 00:20:04.439000000 +0000 ifcfg-ens3 | 00:30 |
clarkb | that lines up with when selinux was restored. Further evidence it isn't likely a sync/flush | 00:30 |
openstackgerrit | James E. Blair proposed openstack-infra/project-config master: Nodepool: add kubernetes provider https://review.openstack.org/620756 | 00:31 |
*** kjackal has quit IRC | 00:32 | |
ianw | i guess the big difference is that during the job it's a binary translated nested vm | 00:32 |
clarkb | ianw: if it continues to not work, it would be curious if adding a sleep(5) after the selinux restore makes it work. Like maybe we just need to go slower beacuse qemu | 00:33 |
corvus | clarkb: can you +3 https://review.openstack.org/620704 and https://review.openstack.org/620646 ? | 00:34 |
clarkb | corvus: yes | 00:35 |
ianw | clarkb: yep, good idea. | 00:36 |
ianw | clarkb: to summarise -- setting "after= & wants = local-fs.service" empirically works; but is theoretically wrong. setting "Before=NetworkManager.service network-pre.target" is theoretically right, but empirically does not work in the gate | 00:39 |
ianw | currently i'm testing before=networkmanager but dropping "defaultdependencies=no" (which we've just always had) to see if that makes a difference | 00:40 |
ianw | if not, i'll try again with a sleep() and sync() in glean to see if it's some sort of qemu race in the gate between getting the file out and starting networkmanager | 00:41 |
ianw | if not that, well just go back to local-fs.service and call it a day i guess | 00:41 |
clarkb | ianw: note that case seems to matter so its NetworkManager not networkmanager | 00:41 |
ianw | yep, it's using the CamelCase name in the system files | 00:42 |
openstackgerrit | Clark Boylan proposed openstack-infra/system-config master: Nodepool group no longer hosts zookeeper https://review.openstack.org/620760 | 00:47 |
clarkb | corvus: fyi ^ is a cleanup I noticed when reviewing your change | 00:48 |
*** tosky has quit IRC | 00:50 | |
*** jistr has quit IRC | 01:00 | |
*** jistr has joined #openstack-infra | 01:01 | |
clarkb | ianw: alright I've still got nothing. I am going to go rake leaves and take a better look at my retaining wall that fell over. Heres to hoping there is understanding when I'm back tomorrow :) | 01:05 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: executor: add support for generic build resource https://review.openstack.org/570668 | 01:07 |
*** markvoelker has joined #openstack-infra | 01:23 | |
*** markvoelker has quit IRC | 01:28 | |
ianw | clarkb: yeah, me either :) i'm on a half-day today so will check it all out later | 01:30 |
*** dklyle has joined #openstack-infra | 01:32 | |
*** sthussey has quit IRC | 01:35 | |
openstackgerrit | Merged openstack-infra/project-config master: Add opendev-website project to Zuul https://review.openstack.org/620704 | 01:48 |
*** gyee has quit IRC | 01:54 | |
*** hamerins has joined #openstack-infra | 02:00 | |
*** mgutehall has quit IRC | 02:19 | |
*** wolverineav has quit IRC | 02:21 | |
*** mrsoul has joined #openstack-infra | 02:21 | |
*** wolverineav has joined #openstack-infra | 02:24 | |
*** hongbin has joined #openstack-infra | 02:28 | |
*** wolverineav has quit IRC | 02:28 | |
*** markvoelker has joined #openstack-infra | 02:30 | |
*** hamerins has quit IRC | 02:33 | |
*** sthussey has joined #openstack-infra | 02:36 | |
*** bhavikdbavishi has joined #openstack-infra | 02:42 | |
*** ramishra has joined #openstack-infra | 02:53 | |
*** rcernin has quit IRC | 02:58 | |
*** jamesmcarthur has joined #openstack-infra | 03:00 | |
*** jamesmcarthur has quit IRC | 03:04 | |
*** ramishra has quit IRC | 03:04 | |
*** ramishra has joined #openstack-infra | 03:06 | |
*** diablo_rojo has quit IRC | 03:11 | |
*** jamesmcarthur has joined #openstack-infra | 03:14 | |
*** apetrich has quit IRC | 03:16 | |
*** rlandy|bbl is now known as rlandy | 03:18 | |
*** hamerins has joined #openstack-infra | 03:20 | |
*** jamesmcarthur has quit IRC | 03:30 | |
*** hamerins has quit IRC | 03:34 | |
*** bobh has joined #openstack-infra | 03:35 | |
*** rcernin has joined #openstack-infra | 03:38 | |
*** bobh has quit IRC | 03:41 | |
*** EmLOveAnh has joined #openstack-infra | 03:42 | |
*** roman_g has quit IRC | 03:49 | |
openstackgerrit | Merged openstack/diskimage-builder master: Fix unit tests for elements https://review.openstack.org/619387 | 03:53 |
*** EmLOveAnh has quit IRC | 03:54 | |
*** dklyle has quit IRC | 03:58 | |
openstackgerrit | Merged openstack-infra/zuul master: Remove uneeded if statement https://review.openstack.org/617984 | 04:02 |
*** diablo_rojo has joined #openstack-infra | 04:17 | |
*** jamesmcarthur has joined #openstack-infra | 04:20 | |
*** armax has quit IRC | 04:24 | |
*** jamesmcarthur has quit IRC | 04:25 | |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Amazon EC2 driver https://review.openstack.org/535558 | 04:29 |
*** hongbin has quit IRC | 04:34 | |
*** dave-mccowan has quit IRC | 04:34 | |
*** dave-mccowan has joined #openstack-infra | 04:35 | |
*** auristor has quit IRC | 04:38 | |
*** eernst has joined #openstack-infra | 04:45 | |
*** auristor has joined #openstack-infra | 04:45 | |
*** dave-mccowan has quit IRC | 04:49 | |
*** auristor has quit IRC | 04:50 | |
*** auristor has joined #openstack-infra | 04:53 | |
openstackgerrit | Kendall Nelson proposed openstack-infra/infra-specs master: StoryBoard Story Attachments https://review.openstack.org/607377 | 04:54 |
*** eernst has quit IRC | 04:55 | |
openstackgerrit | Merged openstack-infra/project-config master: Add opendev-website jobs https://review.openstack.org/620646 | 05:00 |
*** yboaron_ has joined #openstack-infra | 05:01 | |
*** sthussey has quit IRC | 05:05 | |
*** udesale has joined #openstack-infra | 05:06 | |
*** janki has joined #openstack-infra | 05:09 | |
*** bhavikdbavishi has quit IRC | 05:18 | |
*** bhavikdbavishi has joined #openstack-infra | 05:18 | |
*** ykarel|away has joined #openstack-infra | 05:20 | |
*** bhavikdbavishi has quit IRC | 05:30 | |
*** ykarel|away has quit IRC | 06:02 | |
*** bhavikdbavishi has joined #openstack-infra | 06:12 | |
*** mgutehall has joined #openstack-infra | 06:14 | |
*** ykarel|away has joined #openstack-infra | 06:16 | |
openstackgerrit | Ian Wienand proposed openstack-infra/glean master: Add NetworkManager distro plugin support https://review.openstack.org/618964 | 06:19 |
openstackgerrit | Ian Wienand proposed openstack-infra/glean master: A systemd skip for Debuntu systems https://review.openstack.org/620420 | 06:19 |
*** ykarel|away is now known as ykarel|lunch | 06:20 | |
*** ccamacho has quit IRC | 06:21 | |
*** armax has joined #openstack-infra | 06:44 | |
*** kjackal has joined #openstack-infra | 06:56 | |
*** bhavikdbavishi has quit IRC | 07:01 | |
*** ahosam has joined #openstack-infra | 07:02 | |
*** slaweq has joined #openstack-infra | 07:03 | |
*** quiquell|off is now known as quiquell | 07:07 | |
*** e0ne has joined #openstack-infra | 07:10 | |
*** ccamacho has joined #openstack-infra | 07:12 | |
*** chkumar|away has quit IRC | 07:13 | |
*** chandan_kumar has joined #openstack-infra | 07:15 | |
*** rascasoft has quit IRC | 07:25 | |
*** apetrich has joined #openstack-infra | 07:26 | |
*** ahosam has quit IRC | 07:26 | |
*** ahosam has joined #openstack-infra | 07:26 | |
*** armax has quit IRC | 07:27 | |
*** aojea has joined #openstack-infra | 07:32 | |
*** dpawlik has joined #openstack-infra | 07:32 | |
*** rcernin has quit IRC | 07:35 | |
*** ccamacho has quit IRC | 07:39 | |
*** rascasoft has joined #openstack-infra | 07:40 | |
openstackgerrit | Kendall Nelson proposed openstack-infra/infra-specs master: StoryBoard Story Attachments https://review.openstack.org/607377 | 07:53 |
*** florianf|afk is now known as florianf | 07:57 | |
*** ralonsoh has joined #openstack-infra | 07:58 | |
*** ccamacho has joined #openstack-infra | 08:06 | |
*** ccamacho has quit IRC | 08:07 | |
*** ccamacho has joined #openstack-infra | 08:08 | |
*** ginopc has joined #openstack-infra | 08:11 | |
*** quiquell is now known as quiquell|brb | 08:13 | |
*** roman_g has joined #openstack-infra | 08:18 | |
*** bhavikdbavishi has joined #openstack-infra | 08:18 | |
*** olivierbourdon38 has joined #openstack-infra | 08:27 | |
*** diablo_rojo has quit IRC | 08:29 | |
*** jpena|off is now known as jpena | 08:31 | |
*** ykarel|lunch is now known as ykarel | 08:31 | |
*** takamatsu has quit IRC | 08:31 | |
*** dims has quit IRC | 08:32 | |
*** dims has joined #openstack-infra | 08:33 | |
*** quiquell|brb is now known as quiquell | 08:37 | |
*** bhavikdbavishi1 has joined #openstack-infra | 08:37 | |
*** bhavikdbavishi has quit IRC | 08:38 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 08:38 | |
*** ginopc has quit IRC | 08:39 | |
*** ginopc has joined #openstack-infra | 08:39 | |
*** pcaruana has joined #openstack-infra | 08:48 | |
*** markvoelker has quit IRC | 08:50 | |
*** apetrich has quit IRC | 08:52 | |
openstackgerrit | Merged openstack-infra/irc-meetings master: Update Meeting Chairs https://review.openstack.org/620685 | 08:53 |
*** rossella_s has joined #openstack-infra | 08:54 | |
openstackgerrit | Merged openstack-infra/irc-meetings master: rpm-packaging: Adjust meeting time https://review.openstack.org/620616 | 08:57 |
*** tosky has joined #openstack-infra | 09:03 | |
*** shardy has joined #openstack-infra | 09:04 | |
*** gfidente has joined #openstack-infra | 09:06 | |
*** ahosam has quit IRC | 09:10 | |
*** ahosam has joined #openstack-infra | 09:10 | |
*** takamatsu has joined #openstack-infra | 09:13 | |
*** shardy has quit IRC | 09:13 | |
*** jpich has joined #openstack-infra | 09:15 | |
*** derekh has joined #openstack-infra | 09:22 | |
openstackgerrit | Brendan proposed openstack-infra/zuul master: Fix "reverse" Depends-On detection with new Gerrit URL schema https://review.openstack.org/620838 | 09:25 |
*** kaiokmo has quit IRC | 09:38 | |
*** apetrich has joined #openstack-infra | 09:43 | |
*** ssbarnea has quit IRC | 09:45 | |
*** ssbarnea|rover has joined #openstack-infra | 09:45 | |
*** yboaron_ has quit IRC | 09:46 | |
*** yboaron_ has joined #openstack-infra | 09:46 | |
openstackgerrit | Merged openstack-infra/nodepool master: Update node request during locking https://review.openstack.org/618807 | 09:50 |
*** shardy has joined #openstack-infra | 09:51 | |
*** markvoelker has joined #openstack-infra | 09:51 | |
*** rossella_s has quit IRC | 09:52 | |
*** bhavikdbavishi has quit IRC | 09:55 | |
openstackgerrit | Merged openstack-infra/nodepool master: Add second level cache of nodes https://review.openstack.org/619025 | 09:59 |
openstackgerrit | Merged openstack-infra/nodepool master: Add second level cache to node requests https://review.openstack.org/619069 | 09:59 |
openstackgerrit | Merged openstack-infra/nodepool master: Only setup zNode caches in launcher https://review.openstack.org/619440 | 09:59 |
*** jchhatbar has joined #openstack-infra | 10:02 | |
*** janki has quit IRC | 10:05 | |
*** ahosam has quit IRC | 10:06 | |
*** kashyap has left #openstack-infra | 10:10 | |
ianw | clarkb: http://logs.openstack.org/71/618671/14/experimental/nodepool-functional-py35-redhat-src/0067296/controller/logs/screen-nodepool-launcher.txt.gz#_Nov_29_08_02_15_155672 | 10:23 |
ianw | http://paste.openstack.org/show/736383/ | 10:23 |
ianw | even with two syncs() and a pause something still goes wrong | 10:23 |
openstackgerrit | BenoƮt Bayszczak proposed openstack-infra/zuul master: Disable Nodepool nodes lock for SKIPPED jobs https://review.openstack.org/613261 | 10:23 |
*** markvoelker has quit IRC | 10:24 | |
*** udesale has quit IRC | 10:27 | |
*** ahosam has joined #openstack-infra | 10:31 | |
openstackgerrit | Ian Wienand proposed openstack-infra/glean master: Add NetworkManager distro plugin support https://review.openstack.org/618964 | 10:31 |
openstackgerrit | Ian Wienand proposed openstack-infra/glean master: A systemd skip for Debuntu systems https://review.openstack.org/620420 | 10:31 |
*** AJaeger_ has quit IRC | 10:32 | |
*** rossella_s has joined #openstack-infra | 10:36 | |
*** jchhatba_ has joined #openstack-infra | 10:37 | |
*** lpetrut has joined #openstack-infra | 10:38 | |
*** jchhatbar has quit IRC | 10:39 | |
*** gfidente has quit IRC | 10:39 | |
*** mgutehall has quit IRC | 10:40 | |
*** shardy has quit IRC | 10:41 | |
*** yamamoto has quit IRC | 10:43 | |
*** shardy has joined #openstack-infra | 10:43 | |
*** rossella_s has quit IRC | 10:49 | |
*** rossella_s has joined #openstack-infra | 10:53 | |
*** mgutehall has joined #openstack-infra | 10:56 | |
*** electrofelix has joined #openstack-infra | 11:04 | |
*** AJaeger_ has joined #openstack-infra | 11:13 | |
*** markvoelker has joined #openstack-infra | 11:21 | |
*** takamatsu has quit IRC | 11:25 | |
*** jchhatbar has joined #openstack-infra | 11:30 | |
*** takamatsu has joined #openstack-infra | 11:31 | |
*** jchhatba_ has quit IRC | 11:33 | |
*** rossella_s has quit IRC | 11:39 | |
openstackgerrit | Merged openstack/diskimage-builder master: Fix a typo in the help message of disk-image-create https://review.openstack.org/619679 | 11:42 |
*** jchhatbar has quit IRC | 11:43 | |
*** ahosam has quit IRC | 11:45 | |
*** markvoelker has quit IRC | 11:55 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Asynchronously update node statistics https://review.openstack.org/619589 | 11:57 |
*** jpena is now known as jpena|lunch | 11:59 | |
*** jamesmcarthur has joined #openstack-infra | 12:00 | |
*** jamesmcarthur has quit IRC | 12:04 | |
*** takamatsu has quit IRC | 12:06 | |
*** yamamoto has joined #openstack-infra | 12:08 | |
*** xek_ has joined #openstack-infra | 12:12 | |
*** gfidente has joined #openstack-infra | 12:12 | |
*** tpsilva has joined #openstack-infra | 12:15 | |
*** dave-mccowan has joined #openstack-infra | 12:16 | |
*** jcoufal has joined #openstack-infra | 12:19 | |
*** xek_ has quit IRC | 12:21 | |
chandan_kumar | odyssey4me: Hello | 12:23 |
chandan_kumar | odyssey4me: https://review.openstack.org/#/c/620800/ and https://review.openstack.org/#/c/619986/4 both does different tasks | 12:24 |
chandan_kumar | odyssey4me: I am not getting how it is similar | 12:24 |
chandan_kumar | odyssey4me: need help here | 12:24 |
chandan_kumar | sorry wrong channel | 12:24 |
*** jcoufal has quit IRC | 12:33 | |
*** rh-jelabarre has joined #openstack-infra | 12:37 | |
*** panda|pto is now known as panda | 12:41 | |
*** sshnaidm|afk is now known as sshnaidm | 12:51 | |
*** markvoelker has joined #openstack-infra | 12:52 | |
openstackgerrit | Erno Kuvaja proposed openstack-infra/project-config master: Add Review Priority column to glance repos https://review.openstack.org/620904 | 12:53 |
*** e0ne has quit IRC | 12:53 | |
*** boden has joined #openstack-infra | 12:55 | |
*** shardy has quit IRC | 13:03 | |
*** shardy has joined #openstack-infra | 13:10 | |
*** rlandy has joined #openstack-infra | 13:11 | |
*** rossella_s has joined #openstack-infra | 13:12 | |
*** kgiusti has joined #openstack-infra | 13:14 | |
*** yamamoto has quit IRC | 13:19 | |
*** yamamoto has joined #openstack-infra | 13:19 | |
*** markvoelker has quit IRC | 13:24 | |
*** rh-jelabarre has quit IRC | 13:27 | |
*** jpena|lunch is now known as jpena | 13:31 | |
*** kaiokmo has joined #openstack-infra | 13:31 | |
*** e0ne has joined #openstack-infra | 13:31 | |
Dobroslaw | Hello again zuul masters | 13:32 |
Dobroslaw | what I want: in `release` step create docker image with tag containing release version | 13:32 |
Dobroslaw | question: does zuul create some env variable on the machine when creating new release so that I could catch it with bash script? | 13:32 |
Dobroslaw | or is there any other way for getting this value? | 13:32 |
Dobroslaw | I can't find anything useful in docs or zuul code | 13:32 |
pabelanger | Dobroslaw: you can look for zuul.tag in the inventory | 13:33 |
pabelanger | then check zuul.pipeline | 13:34 |
pabelanger | to know you are in the release pipeline | 13:34 |
*** janki has joined #openstack-infra | 13:35 | |
Dobroslaw | pabelanger: something like this?: https://github.com/openstack/kolla/blob/master/tests/templates/kolla-build.conf.j2#L5 | 13:35 |
*** yboaron_ has quit IRC | 13:36 | |
pabelanger | yup | 13:36 |
pabelanger | https://zuul-ci.org/docs/zuul/user/jobs.html#tag-items | 13:36 |
pabelanger | for more info | 13:36 |
*** yboaron_ has joined #openstack-infra | 13:36 | |
Dobroslaw | pabelanger: great, checking, thank you | 13:37 |
*** jamesmcarthur has joined #openstack-infra | 13:48 | |
*** jcoufal has joined #openstack-infra | 13:48 | |
*** jcoufal has quit IRC | 13:50 | |
*** takamatsu has joined #openstack-infra | 13:56 | |
*** zul has quit IRC | 14:00 | |
*** dpawlik has quit IRC | 14:01 | |
*** dpawlik has joined #openstack-infra | 14:03 | |
*** jamesmcarthur has quit IRC | 14:04 | |
*** dpawlik has quit IRC | 14:04 | |
*** bobh has joined #openstack-infra | 14:09 | |
efried | Hey folks, I think I brought this up a few weeks ago, but then lost track of it. | 14:11 |
efried | http://ci-watch.tintri.com/ <== seems to be down. Did we figure out who had been maintaining it, if it was going to be fixed or replaced with something equivalent, etc? | 14:11 |
efried | The nova team (at least) used to get a lot of use out of it. | 14:11 |
mordred | efried: I don't know that we know anything about it. what did it do? | 14:13 |
efried | mordred: It was like a summary table of all the CIs, including 3rd party. You could filter (by things like project, date range, etc) with queryparams. It had nice big green checkmark or red X to indicate whether a particular run passed or failed, with links to the run. | 14:14 |
efried | made it easy to tell at a glance whether a particular CI was really dead (lots of red in a row) etc. | 14:15 |
fungi | it's referenced in this infra spec: https://specs.openstack.org/openstack-infra/infra-specs/specs/deploy-ci-dashboard.html#proposed-change | 14:16 |
*** zul has joined #openstack-infra | 14:16 | |
mordred | gotcha | 14:16 |
mordred | ah - so we have the source code for it at least | 14:17 |
*** roman_g has quit IRC | 14:17 | |
fungi | looks like krtaylor and mmedvede were involved in the drafting of that spec, so maybe they know who was running the poc | 14:17 |
mordred | with the update to config management stuff, it might be an easier task for someone to pick up now | 14:18 |
*** roman_g has joined #openstack-infra | 14:19 | |
fungi | in unrelated news, looks like we're getting a lot of pip install errors in vexxhost-sjc1 so i'm checking out the proxy host now | 14:19 |
ttx | infra-core: we are holding off on releases until the pypi access issue affecting some regions is solved (http://status.openstack.org/elastic-recheck/#1449136) -- if you notice things are working correctly again please let us know ! | 14:19 |
fungi | ttx: the issue in rax-dfw seemed to clear up late yesterday but now we have a problem in another provider i'm just starting to check into | 14:20 |
fungi | mirror.sjc1.vexxhost.o.o seems to be completely unreachable for me | 14:20 |
ttx | hmm, yeah, that graph could be explained by two different issues | 14:20 |
fungi | right, if you tick on the node_provider field in one of the relevant logstash queries you'll see it's a different issue | 14:21 |
*** sthussey has joined #openstack-infra | 14:22 | |
fungi | nova claims the instance is active | 14:24 |
mordred | fungi: anything exciting in the nova console log? | 14:24 |
fungi | console log show is empty | 14:24 |
fungi | nova reboot? | 14:25 |
fungi | er, server reboot | 14:25 |
fungi | unless we want mnaser to see if there's a network-related explanation for why it's unreachable | 14:25 |
fungi | in which case we can turn down that region | 14:26 |
fungi | but we're already running full-out with some ~3 hours to get node assignments in check at this point, so further reduction in capacity is probably not going to help that situation | 14:27 |
fungi | judgement call... i'm going to reboot it via the api and see if i can get any sort of post-mortem from whatever system logs it managed to write (if any) | 14:28 |
mmedvede | fungi: the person who was running the ci-watch.tintri.com poc no longer works there. He left a contact email which is not responding so far. I did deploy the same service on http://ciwatch.mmedvede.net | 14:28 |
fungi | #status log rebooted mirror01.sjc1.vexxhost.openstack.org via api as it seems to have been unreachable since ~02:30z | 14:29 |
*** mriedem has joined #openstack-infra | 14:29 | |
openstackstatus | fungi: finished logging | 14:29 |
fungi | mmedvede: thanks! efried: see mmedvede's comment above | 14:30 |
fungi | A start job is running for Raise ne...k interfaces (2min 5s / 5min 1s) | 14:31 |
fungi | that's not very promising | 14:31 |
efried | mmedvede, fungi: Thanks! | 14:31 |
fungi | if it can't bring up the nic in the next couple minutes, i'll get the region temporarily disabled | 14:31 |
*** eharney has joined #openstack-infra | 14:33 | |
*** dpawlik has joined #openstack-infra | 14:33 | |
*** yboaron_ has quit IRC | 14:35 | |
*** yboaron_ has joined #openstack-infra | 14:35 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Add arbitrary node attributes config option https://review.openstack.org/620691 | 14:36 |
*** fuentess has joined #openstack-infra | 14:37 | |
*** dpawlik has quit IRC | 14:38 | |
openstackgerrit | Jeremy Stanley proposed openstack-infra/project-config master: Temporarily disable vexxhost-sjc1 in nodepool https://review.openstack.org/620924 | 14:38 |
fungi | config-core: ^ | 14:38 |
fungi | i'll see about adding nl03 to the emergency disable list and manually applying that diff while we wait for the change to merge | 14:39 |
mordred | ++ | 14:39 |
pabelanger | +3 | 14:39 |
*** graphene has joined #openstack-infra | 14:40 | |
pabelanger | At one point in time, we quickly discussed the idea of running multiple mirrors, so if one when down we didn't have to disable it in nodepool. logan- actually rehashed the discussion in berlin for another reason | 14:41 |
fungi | after editing nodepool.yaml, do i need to reload the config with the nodepool rpc cli? | 14:41 |
pabelanger | fungi: no, it will be live on save | 14:42 |
fungi | perfect | 14:42 |
fungi | #status log temporarily added nl03.o.o to the emergency disable list and manually applied https://review.openstack.org/620924 in advance of it merging | 14:42 |
openstackstatus | fungi: finished logging | 14:42 |
*** rh-jelabarre has joined #openstack-infra | 14:45 | |
*** jcoufal has joined #openstack-infra | 14:54 | |
*** bhavikdbavishi has joined #openstack-infra | 14:56 | |
*** udesale has joined #openstack-infra | 14:58 | |
*** lpetrut has quit IRC | 14:58 | |
*** xek has joined #openstack-infra | 15:00 | |
*** trown has quit IRC | 15:03 | |
*** trown has joined #openstack-infra | 15:04 | |
fungi | we're down to 0 nodes in the vexxhost-sjc1 main pool now | 15:08 |
*** ykarel is now known as ykarel|away | 15:11 | |
*** ykarel|away has quit IRC | 15:15 | |
*** zul has quit IRC | 15:18 | |
*** hamerins has joined #openstack-infra | 15:23 | |
*** roman_g has quit IRC | 15:24 | |
mnaser | hi | 15:24 |
* mnaser looks | 15:24 | |
mnaser | is it all vms or just mirror, fungi ? | 15:25 |
cmurphy | clarkb: mordred any idea why this failed https://review.openstack.org/602380 http://logs.openstack.org/80/602380/3/gate/infra-puppet-apply-3-ubuntu-trusty/a3a1e0c/job-output.txt.gz#_2018-11-27_21_19_01_265126 and if i recheck is someone around to babysit in case it makes it through? | 15:26 |
mnaser | sigh | 15:26 |
mnaser | ok i know what sgoing on | 15:26 |
mnaser | fungi: it should be back | 15:28 |
*** udesale has quit IRC | 15:29 | |
*** mriedem is now known as mriedem_afk | 15:29 | |
*** udesale has joined #openstack-infra | 15:29 | |
*** hamerins has quit IRC | 15:30 | |
mnaser | config-core: feel free to propose a revert of that patch | 15:30 |
mnaser | the issue should be fixed now | 15:30 |
*** yboaron_ has quit IRC | 15:31 | |
ianychoi | Hello infra team, would some system-config cores kindly review https://review.openstack.org/#/c/620661/ and +A? so many spams (>400 just for about latest 12 hours..) to me really heart my mail box.. | 15:31 |
*** hamerins has joined #openstack-infra | 15:31 | |
ianychoi | s/heart/hurt | 15:31 |
*** ykarel|away has joined #openstack-infra | 15:32 | |
fungi | mnaser: okay, thanks for finding the issue! | 15:32 |
fungi | i confirm i can ping it now | 15:32 |
fungi | and ssh into it | 15:33 |
*** ykarel|away is now known as ykarel | 15:34 | |
*** panda is now known as panda|pto | 15:34 | |
fungi | i've abandoned the change since it hadn't merged yet, and will unroll the emergency disablement | 15:35 |
mnaser | fungi: thanks | 15:38 |
evrardjp | a non important patch in my opened things just need a simple review of someone here: https://review.openstack.org/#/c/619216/ | 15:39 |
*** ahosam has joined #openstack-infra | 15:39 | |
*** ccamacho has quit IRC | 15:39 | |
*** ramishra has quit IRC | 15:39 | |
openstackgerrit | Merged openstack-infra/system-config master: Add kube config to nodepool servers https://review.openstack.org/620755 | 15:44 |
openstackgerrit | Merged openstack-infra/system-config master: Nodepool group no longer hosts zookeeper https://review.openstack.org/620760 | 15:44 |
*** ccamacho has joined #openstack-infra | 15:49 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Asynchronously update node statistics https://review.openstack.org/619589 | 15:50 |
*** quiquell is now known as quiquell|off | 15:55 | |
*** aojea has quit IRC | 15:59 | |
fungi | we're back up running with the original max-servers count in vexxhost-sjc1 | 16:00 |
*** dklyle has joined #openstack-infra | 16:01 | |
*** graphene has quit IRC | 16:03 | |
*** graphene has joined #openstack-infra | 16:04 | |
*** jamesmcarthur has joined #openstack-infra | 16:05 | |
openstackgerrit | Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Identify DLRN builds fail intermittently (network errors) https://review.openstack.org/620950 | 16:07 |
*** jamesmcarthur has quit IRC | 16:10 | |
openstackgerrit | Merged openstack-infra/system-config master: Blackhole messages to openstack-ko-owner@l.o.o https://review.openstack.org/620661 | 16:13 |
frickler | mnaser: can you share what the issue was? (just being curious from an ops perspective) | 16:19 |
*** udesale has quit IRC | 16:20 | |
*** mriedem_afk is now known as mriedem | 16:22 | |
mnaser | frickler: openstack-ansible bug seems like didn't cleanly restart neutron-openvswitch-agent on reboot | 16:23 |
*** janki has quit IRC | 16:24 | |
*** boden has quit IRC | 16:25 | |
*** e0ne has quit IRC | 16:26 | |
frickler | mnaser: ah, ok, that certainly causes network issues ;) luckily should be pretty easy to spot. thx | 16:26 |
*** jcoufal has quit IRC | 16:31 | |
*** jcoufal has joined #openstack-infra | 16:31 | |
ssbarnea|rover | anyone looking at zuul , it seems unresponsive | 16:31 |
ssbarnea|rover | http://zuul.openstack.org/status .... not loading. | 16:31 |
*** ykarel is now known as ykarel|away | 16:33 | |
mnaser | loads for me ssbarnea|rover | 16:33 |
mnaser | there is a lot of things in queue so many it takes a little while before it comes up | 16:33 |
ssbarnea|rover | mnaser: yep, it did load for me like 1-2 minutes later.... | 16:34 |
*** ccamacho has quit IRC | 16:34 | |
pabelanger | yes, status json file is pretty large, but does load | 16:34 |
*** hamerins has quit IRC | 16:34 | |
pabelanger | wow, tripleo queue is 45hrs | 16:34 |
pabelanger | wonder what is going on there | 16:35 |
*** hamerins has joined #openstack-infra | 16:36 | |
*** boden has joined #openstack-infra | 16:36 | |
ssbarnea|rover | pabelanger: i do have the impression that this was caused by pip : http://status.openstack.org/elastic-recheck/ for which I raised a CR yesterday which was not approved due to risks. | 16:36 |
*** armax has joined #openstack-infra | 16:37 | |
ssbarnea|rover | pabelanger: https://review.openstack.org/#/c/620630/ if you remember well | 16:37 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: Support relative priority of node requests https://review.openstack.org/620954 | 16:37 |
pabelanger | ssbarnea|rover: yah, looks like vexxhost had a large impact, but that seems to just be from this morning. Do you know if rax is still having an issue? | 16:38 |
ssbarnea|rover | now I didn't see the failure going down since yesterday so I suspect the problem is still valid. | 16:38 |
*** e0ne has joined #openstack-infra | 16:38 | |
pabelanger | ssbarnea|rover: http://status.openstack.org/elastic-recheck/data/others.html#tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates | 16:39 |
*** roman_g has joined #openstack-infra | 16:39 | |
fungi | pabelanger: rax networking issues cleared up around 1800z yesterday | 16:39 |
pabelanger | looks to be unclassified failures, if we add them to elastic-recheck we'll get a better idea of is happening too | 16:39 |
pabelanger | fungi: ack | 16:39 |
fungi | ssbarnea|rover: it's not that i pushed back on those changes due to "risks" but that you seem to not understand how extra-index-url works. it won't result in fewer failures (if anything it'll result in more) | 16:40 |
fungi | it adds indexes, all of which will be queried, and then pip will nondeterministically pick one at random to pull packages from if there are duplicate entries | 16:41 |
ssbarnea|rover | i know about its weird implementation, i do not question the downvote ;) | 16:41 |
fungi | it's not a "fallback" mechanism | 16:41 |
mordred | it would be so nice if it was a fallback mechanism | 16:41 |
ssbarnea|rover | but based on my local tests it should make it more reliable | 16:41 |
ssbarnea|rover | if I would not be under pressure i would have tried to fix pip | 16:42 |
fungi | it will likely result in ~half of our packages skipping the mirror and pulling from pypi.org anyway | 16:42 |
ssbarnea|rover | fungi: and is that a big issue? | 16:42 |
ssbarnea|rover | the idea is when one of the sources fails the other ones should still be able to serve it , right? | 16:43 |
fungi | yes, that's why we have a local cache in each region. reduces the calls to outside services which, because of the unreliability of the internet, fails at random more often | 16:43 |
fungi | that's not how pip works | 16:43 |
fungi | if the source it decides to pick fails, pip will return an error and exit | 16:44 |
mordred | it never ceases to boggle my mind that this is how it works :) | 16:44 |
ssbarnea|rover | fungi: not if it timesout, if it times out other source should win. | 16:44 |
ssbarnea|rover | mordred++ :D | 16:45 |
fungi | it doesn't pull from both. it picks one of the entries exclusively | 16:45 |
*** ahosam has quit IRC | 16:46 | |
ssbarnea|rover | fungi: btw, why we not just using a local http proxy? | 16:46 |
fungi | first it retrieves all the indices, and then it decides which entry from which index will satisfy the stated requirement, and then it tries to retrieve that. and if it gets an error, it fails (and if you have it set to retry, it just retries that same one multiple times) | 16:46 |
fungi | ssbarnea|rover: we _are_ using a local http proxy | 16:46 |
mordred | yeah. that's what the mirrors are | 16:46 |
*** hamzy has quit IRC | 16:46 | |
fungi | ssbarnea|rover: furthermore, the errors we saw in rackspace's dfw region yesterday (which prompted you to write that change) weren't because the proxy was broken but because networking from that provider region to the pypi.org cdn was broken. bypassing the proxy wouldn't have made any difference there | 16:47 |
pabelanger | in the case of vexxhost this morning, the mirror was down for unrelated pypi reasons. So, other things outside of pip were also affected, eg: apt / rpm. In the past we talked about the idea of standing up a 2nd mirror to load balance requests, maybe we should look back to that and do something like round robin DNS. As I type this, I am also not sure how our validate-hosts roles didn't catch the | 16:47 |
pabelanger | issue and abort the job | 16:47 |
fungi | pabelanger: then the load balancer becomes a single point of failure instead of the proxy | 16:47 |
*** gfidente has quit IRC | 16:48 | |
pabelanger | agreed | 16:48 |
logan- | pabelanger fungi: my thinking is we should handle the mirror selection using a random list of eligible mirrors in the pre-run. select a random one until a health check passes, then use it for the job | 16:48 |
fungi | unless we set up some sort of distributed lb with address takeover anyway | 16:48 |
*** e0ne has quit IRC | 16:48 | |
pabelanger | right, we could make our configure-mirror role a little smarter in that way | 16:49 |
fungi | but that could be an interesting challenge in providers who block multicast | 16:49 |
logan- | there is no need for a lb or spof that way, the zuul pre-run can select a mirror from a list of 1 or more eligibles | 16:49 |
pabelanger | get IP from dns, validate online, then use | 16:49 |
*** trown is now known as trown|lunch | 16:49 | |
pabelanger | that is kinda how validate-host role should work | 16:49 |
fungi | logan-: yeah, if the job node performs the health check, then that at least gets us a solution for situations where one of the servers has died completely, maybe not for when one is having intermittent issues and not the other | 16:50 |
ssbarnea|rover | ... just to recapitulate in less thab 24h we were hit by two different mirrors going down: rex, and later vexhost. | 16:51 |
fungi | however, most of the time when there's a problem with the mirror/proxy servers it's either a global issue or it's an issue impacting an entire region in a provider, so multiple servers is really only a solution for a fraction of these situations | 16:51 |
fungi | ssbarnea|rover: correct | 16:51 |
pabelanger | Odd, validate-host does not actually check our mirrors | 16:51 |
pabelanger | I thought it did | 16:51 |
ssbarnea|rover | oh,... next time someone tells me that mirrors are reliable I will send them a link to irc logs. | 16:52 |
fungi | ssbarnea|rover: i don't dispute that these problems occurred, i'm saying they were different problems entirely and there's no single solution here | 16:52 |
ssbarnea|rover | does pip respect http_proxy env value or not really? | 16:52 |
fungi | ssbarnea|rover: they weren't "mirror problems" (in both cases they were "network problems") | 16:52 |
pabelanger | ssbarnea|rover: fungi: logan-: At a minium, I think we could update our pre-run playbook in base, to do a health check of both git.o.o and regional mirror, of if either fail, the job will abort and hopefully rerun on another provider | 16:53 |
fungi | they were noticed as errors from the mirror servers because that's where jobs were trying to hit the outside through | 16:53 |
ssbarnea|rover | fungi: sure, mirror can got down for various reasons, but if pip would know how to fallback, this could be one solution coverving both outages. | 16:53 |
fungi | that's like saying that you have a "foot problem" because your leg fell off | 16:53 |
fungi | ssbarnea|rover: it wouldn't have solved both outages, no | 16:54 |
ssbarnea|rover | and the irony is that pip would have worked without custom mirror in both cases | 16:54 |
fungi | ssbarnea|rover: yesterday, the inability of rackspace's dfw region to reach pypi was the problem. the only solution would have been to stop running jobs there | 16:55 |
fungi | which we nearly did, but then their network issues in that region cleared up | 16:56 |
fungi | (either a network problem within that region or a problem with the nearest endpoint for the fastly cdn pypi.org uses) | 16:56 |
fungi | if nodes in that region had tried to connect directly to pypi.org the failure rate would have been identical | 16:57 |
fungi | no amount of patching pip or its configuration will solve that | 16:57 |
ssbarnea|rover | fungi: ohh, are you sure? if i heard this issue was limited to ipv6, and if the nodes were using ipv4, it would have worked. | 16:59 |
fungi | ssbarnea|rover: yes, an alternative would have been to figure out how to disable ipv6 on our test nodes in that region. also a drastic enough solution that you're not going to just work around it | 17:00 |
ssbarnea|rover | what's the status with vex, is it sorted? i am worried about gate queue which seems to only go up. | 17:01 |
fungi | ssbarnea|rover: it's sorted, yes | 17:01 |
fungi | the gate queue is only going up because tripleo monopolizes it. sorry, i have to point that out. complaints from the team who consume most of our resources are not really making this a fun thing to spend my valuable time maintaining today | 17:02 |
ssbarnea|rover | the gate queue is still 1.9 didn't see it going down at all. | 17:02 |
ssbarnea|rover | 1.9days, not hours :D | 17:02 |
fungi | i think you mean tripleo's gate queue | 17:03 |
ssbarnea|rover | yeah | 17:03 |
*** eernst has joined #openstack-infra | 17:03 | |
fungi | yeah | 17:03 |
*** david-lyle has joined #openstack-infra | 17:04 | |
*** ginopc has quit IRC | 17:05 | |
*** dklyle has quit IRC | 17:05 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Set relative priority of node requests https://review.openstack.org/615356 | 17:06 |
fungi | ssbarnea|rover: ^ that should help get everyone else's changes moving faster at least | 17:06 |
openstackgerrit | Paul Belanger proposed openstack-infra/project-config master: base-test: Check that regional mirror is online https://review.openstack.org/620961 | 17:08 |
pabelanger | ssbarnea|rover: fungi: logan-: ^is something we can test, to offer some protection to jobs | 17:08 |
*** shardy has quit IRC | 17:09 | |
clarkb | ssbarnea|rover: fungi right re proxy our mirror is a proxy | 17:09 |
clarkb | setting up a different proxy wouldnt help | 17:09 |
*** gyee has joined #openstack-infra | 17:09 | |
clarkb | also we shouldnt set up a transparent proxy as the potential for abuse on that skyrockets | 17:10 |
mordred | pabelanger: I think that would be helpful | 17:10 |
clarkb | this is why we reverse proxy specific services from our mirrors as we do | 17:10 |
pabelanger | mordred: yah, I could have sweared we did that in validate-host before, but seems to be just git.o.o | 17:10 |
pabelanger | also think, we could maybe make validate-hosts take a list of hosts too, and have it do validate there | 17:11 |
fungi | i'd be good with either solution | 17:11 |
clarkb | we always kne this would be a risk with using a proxy instead of proper mirror | 17:12 |
clarkb | but pypi growth just isnt manageable anymore to mirror properly | 17:12 |
clarkb | blame cuda linking | 17:12 |
*** bhavikdbavishi has quit IRC | 17:15 | |
*** bhavikdbavishi has joined #openstack-infra | 17:16 | |
pabelanger | clarkb: https://review.openstack.org/620961/ should help with vexxhost mirror outage this moring, if you'd like to review. We can do a bit of testing before deciding to move it to our base job | 17:19 |
clarkb | pabelanger: I'm not sure it will change anything. We already run bindep and things like devstack in pre | 17:20 |
clarkb | these will look for the distor package mirrors and fail in pre if that is down | 17:20 |
clarkb | we can certainly explicitly check things but I dont expect a major change in job retry behavior | 17:21 |
*** agopi is now known as agopi|food | 17:21 | |
pabelanger | tripleo isn't using devstack, so they don't have coverage. That said, we could just move that check into their jobs | 17:21 |
pabelanger | but figured, since mirrors our infra servers, having the check in base might help protect all jobs | 17:22 |
clarkb | pabelanger: ya we can explicitly check. Maybe this also points to tripleo maybe needing to move stuff into pre? I dont know I get lost every time I try to trace through a job there | 17:23 |
*** jamesmcarthur has joined #openstack-infra | 17:24 | |
pabelanger | clarkb: yah, that is fair | 17:24 |
*** dpawlik has joined #openstack-infra | 17:25 | |
*** dpawlik has quit IRC | 17:26 | |
clarkb | generally low churn boot strapping steps that are epected to succeed should be in pre | 17:26 |
clarkb | for most of our jobs this means install distro packages is in pre | 17:26 |
*** dpawlik has joined #openstack-infra | 17:26 | |
clarkb | maybe not exclusivle as tripleo is a deployment project an wamts to test those steps but tripleo must have deps itself? | 17:27 |
clarkb | pabelanger: because the next thing we'll run into is mirror is up but we rsynced a bad state and so its broken when you install | 17:28 |
clarkb | then we'll go through all of this again | 17:28 |
*** jpich has quit IRC | 17:29 | |
*** hamerins has quit IRC | 17:29 | |
pabelanger | clarkb: Yah, that was my thought of adding this check to validate-host, there we only do a traceroute to git.o.o. If that failures, we assume network issue. We could update it to support a list of hosts, and fail if we cannot traceroute both git.o.o and mirror | 17:30 |
*** hamerins has joined #openstack-infra | 17:31 | |
*** lujinluo has joined #openstack-infra | 17:33 | |
fungi | i thought we did a ping. the traceroute was more so that we could perform post-mortem analysis on reachability issues | 17:36 |
clarkb | its all to git.o.o though that host is no longer a good one actually since zuul pushes all git state into jobs now | 17:39 |
clarkb | maybe change git to the in region mirror and do that host instead | 17:39 |
pabelanger | yah, we could do that too | 17:40 |
pabelanger | I think the idea with git.o.o, is to confirm we can route out side the provider network | 17:40 |
pabelanger | so, might be good to also keep that | 17:40 |
fungi | yes, granted it doesn't actually confirm that in rax-dfw since that's where git.o.o resides | 17:42 |
clarkb | in the past all jobs had to talk to that node | 17:42 |
clarkb | so not being able to talk to that load balancer would result in job failure | 17:43 |
clarkb | this is not longer true but I think that is why it was chosen | 17:43 |
fungi | but having the node try to reach across the internet to some host we expect to always be up can be a good canary to keep | 17:43 |
clarkb | yup | 17:43 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: More strongly recommend the simple reverse proxy deployment https://review.openstack.org/620969 | 17:44 |
fungi | clarkb: is there an "easy" way to export the events data from a logstash query in kibana? | 17:52 |
*** ykarel|away has quit IRC | 17:53 | |
fungi | i have a query with 449 results and don't feel like stitching together 5 pages of copy+paste | 17:53 |
*** jpena is now known as jpena|off | 17:53 | |
fungi | though that's what i did in the end | 17:55 |
clarkb | there should be a csv export option somewhere | 17:56 |
fungi | i found where to adjust the pagination at least | 17:57 |
*** wolverineav has joined #openstack-infra | 17:59 | |
openstackgerrit | Merged openstack-infra/system-config master: docs: add info on generating DS records https://review.openstack.org/619334 | 18:00 |
fungi | config-core: we can add another 80 nodes back to the pool with https://review.openstack.org/619750 | 18:03 |
clarkb | mordred: pabelanger and the actual http GET runs on all of the ansible test nodes not the ansible process on the executor? | 18:03 |
clarkb | fungi: we aren't concerned that it will go back to being unstable? | 18:04 |
fungi | halving our utilization in ovh-bhs1 didn't solve the excessive timeouts there | 18:04 |
clarkb | roger | 18:04 |
fungi | see the comment i just added | 18:04 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul master: Add support for zones in executors https://review.openstack.org/549197 | 18:05 |
*** derekh has quit IRC | 18:06 | |
openstackgerrit | Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Identify DLRN builds fail intermittently (network errors) https://review.openstack.org/620950 | 18:06 |
*** trown|lunch is now known as trown | 18:08 | |
pabelanger | clarkb: yah, all nodes should curl the mirror | 18:08 |
openstackgerrit | Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Identify DLRN builds fail intermittently (network errors) https://review.openstack.org/620950 | 18:08 |
sshnaidm | can somebody tell why ara-report is from Sep in todays job? http://logs.openstack.org/64/606064/2/gate/tripleo-ci-centos-7-containers-multinode/dfc9a9c/ | 18:09 |
sshnaidm | is some server not time synced? | 18:09 |
sshnaidm | ara-report/2018-09-04 17:54 | 18:09 |
fungi | that's definitely weird | 18:09 |
clarkb | that ara report is generated by the zuul executor I think. Could be that one of them got confused? | 18:10 |
fungi | the timestamps inside the ara report look like they're from today | 18:10 |
pabelanger | ara-report folder will contain a sqlite.db | 18:10 |
corvus | fungi, clarkb: can one of you please obtain an opendev.org cert? | 18:10 |
pabelanger | I think that is created from the executor | 18:10 |
fungi | i think apache is confused | 18:10 |
clarkb | corvus: yes, I can do that | 18:11 |
fungi | ls -l /srv/static/logs/64/606064/2/gate/tripleo-ci-centos-7-containers-multinode/dfc9a9c/ | 18:11 |
fungi | drwxr-xr-x 2 jenkins jenkins 4096 Nov 29 11:30 ara-report | 18:11 |
clarkb | corvus: it will likely require we edit DNS to verify ownership of the domain though now that gdpr means whois is useless | 18:11 |
*** hamzy has joined #openstack-infra | 18:12 | |
pabelanger | lol, jenkins | 18:12 |
corvus | clarkb: the automation should be in place | 18:12 |
clarkb | corvus: cool, I'll push up a change for that when I've got the details from namecheap | 18:12 |
fungi | pabelanger: yeah, we haven't renamed the account on static.o.o | 18:12 |
*** bobh has quit IRC | 18:12 | |
*** bhavikdbavishi has quit IRC | 18:13 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently https://review.openstack.org/610029 | 18:13 |
*** bhavikdbavishi1 has joined #openstack-infra | 18:13 | |
ssbarnea|rover | fungi: clarkb : if you could help with recent CRs on https://review.openstack.org/#/q/project:openstack-infra/elastic-recheck+status:open it would be great | 18:13 |
openstackgerrit | James E. Blair proposed openstack-infra/system-config master: Serve opendev.org website from files.o.o https://review.openstack.org/620979 | 18:14 |
*** dpawlik has quit IRC | 18:14 | |
pabelanger | fungi: sshnaidm: http://logs.openstack.org/64/606064/2/gate/tripleo-ci-centos-7-containers-multinode/dfc9a9c/job-output.txt.gz#_2018-11-29_11_30_44_008080 I think is when we create the directory | 18:14 |
*** david-lyle has quit IRC | 18:15 | |
TheJulia | so the maximum zuul job length is 3 hours? | 18:15 |
*** agopi|food is now known as agopi | 18:15 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently https://review.openstack.org/610029 | 18:15 |
pabelanger | TheJulia: yes | 18:15 |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 18:15 | |
clarkb | corvus: fungi: a year cert as per normal? I expect we'll be maybe using free certs in the not too distant future so that makes snse to me | 18:15 |
corvus | clarkb: ++ | 18:16 |
sshnaidm | pabelanger, so is it apache fault..? | 18:16 |
fungi | clarkb: yeah, that's what i'd do | 18:16 |
*** wolverineav has quit IRC | 18:16 | |
openstackgerrit | Purnendu Ghosh proposed openstack-infra/project-config master: Create airship-spyglass repo https://review.openstack.org/619493 | 18:17 |
clarkb | sshnaidm: that is fungi's theory | 18:17 |
fungi | sshnaidm: yeah, must be? i can't for the life of me figure out where apache is getting that timestamp: http://paste.openstack.org/show/736434/ | 18:17 |
TheJulia | pabelanger: I guess I should try and see what I can prune out of a grenade job :( | 18:18 |
openstackgerrit | James E. Blair proposed openstack-infra/system-config master: DNS: replace ip addresses with names https://review.openstack.org/620980 | 18:18 |
*** wolverineav has joined #openstack-infra | 18:18 | |
corvus | TheJulia: for context: the idea was that 3 hours is plenty of headroom for jobs that were designed to be 1 hour long | 18:19 |
*** jistr has quit IRC | 18:19 | |
*** jistr has joined #openstack-infra | 18:19 | |
TheJulia | except grenade has always been 2-2.5 | 18:19 |
*** bhavikdbavishi has quit IRC | 18:19 | |
TheJulia | hit a slow node, and you pass 3 hours | 18:20 |
fungi | always? | 18:20 |
sshnaidm | fungi, yeah, weird..maybe some network storage bug | 18:20 |
TheJulia | as far as I can remember | 18:20 |
fungi | i vaguely remember grenade jobs taking ~1.25 hours when devstack jobs took ~0.75 hours | 18:20 |
TheJulia | You have to keep in mind the whole fake baremetal deployment boot/deploy cycle for ironic adds a chunk of time | 18:20 |
clarkb | corvus: fungi: do we want an alt name of www.opendev.org? | 18:20 |
fungi | sshnaidm: i have a feeling it's something weird in apache's caching layer | 18:20 |
corvus | clarkb: yeah i think so | 18:21 |
*** roman_g has quit IRC | 18:22 | |
fungi | clarkb: yes, that would then be consistent with the one for zuul-ci.org | 18:22 |
fungi | X509v3 Subject Alternative Name: DNS:zuul-ci.org, DNS:www.zuul-ci.org | 18:22 |
fungi | echo|openssl s_client -connect zuul-ci.org:443|openssl x509 -text | 18:23 |
fungi | so "do whatever you did for that domain" | 18:23 |
clarkb | fungi: I don't think I did that domain | 18:23 |
*** jamesmcarthur has quit IRC | 18:24 | |
openstackgerrit | James E. Blair proposed openstack-infra/zone-opendev.org master: Add A(AAA) records for (www.)opendev.org https://review.openstack.org/620982 | 18:25 |
fungi | clarkb: it's in bridge.o.o:~root/certs/2018-03-26/ so one of us did anyway | 18:25 |
fungi | oh, wait, that's git.zuul-ci.org | 18:26 |
fungi | in 2018-01-19 instead | 18:26 |
*** lujinluo has quit IRC | 18:26 | |
corvus | fungi, clarkb: i set "topic:opendev" on all related changes | 18:27 |
fungi | thanks!!! | 18:27 |
clarkb | I'll model it off of the static cert and createa new opendev.org cnf file and generate with that | 18:27 |
openstackgerrit | Merged openstack-infra/project-config master: Revert "Halve ovh-bhs1 max-servers temporarily" https://review.openstack.org/619750 | 18:28 |
fungi | clarkb: i think it adds www automagically as a san, but i could be wrong | 18:29 |
clarkb | fungi: it being openssl? | 18:29 |
fungi | clarkb: namecheap | 18:29 |
fungi | X509v3 Subject Alternative Name: DNS:git.zuul-ci.org, DNS:www.git.zuul-ci.org | 18:30 |
clarkb | oh huh | 18:30 |
fungi | X509v3 Subject Alternative Name: DNS:zuul.openstack.org, DNS:www.zuul.openstack.org | 18:30 |
clarkb | ya same for git.openstack.org. In that case I'll just use our normal process and it should just work (tm) | 18:31 |
clarkb | (I hope) | 18:31 |
clarkb | easy enough if so | 18:31 |
fungi | so if you just ask for a cert for opendev.org they should end up giving you www.opendev.org as a san | 18:31 |
*** hamzy has quit IRC | 18:31 | |
*** diablo_rojo has joined #openstack-infra | 18:31 | |
clarkb | yup | 18:31 |
openstackgerrit | Clark Boylan proposed openstack-infra/zone-opendev.org master: Add SSL Cert verification record https://review.openstack.org/620985 | 18:38 |
openstackgerrit | Clark Boylan proposed openstack-infra/zone-opendev.org master: Revert "Add SSL Cert verification record" https://review.openstack.org/620986 | 18:38 |
clarkb | fungi: corvus ^ double check me on that its been years since I hand edited a bind zone file :) | 18:39 |
*** eernst has quit IRC | 18:40 | |
clarkb | I didn't increment the serial | 18:41 |
clarkb | let me fix that real quick | 18:41 |
* fungi was about to point that out ;) | 18:41 | |
fungi | make sure to increment it again in the "revert" too | 18:41 |
corvus | emacs'll do it for you | 18:41 |
fungi | emagic | 18:41 |
*** eernst has joined #openstack-infra | 18:42 | |
openstackgerrit | Clark Boylan proposed openstack-infra/zone-opendev.org master: Add SSL Cert verification record https://review.openstack.org/620985 | 18:43 |
openstackgerrit | Clark Boylan proposed openstack-infra/zone-opendev.org master: Revert "Add SSL Cert verification record" https://review.openstack.org/620986 | 18:43 |
clarkb | corvus: yup got both of them | 18:43 |
fungi | you might want to wip 620986 until you're done, just to be safe | 18:44 |
*** gfidente has joined #openstack-infra | 18:45 | |
clarkb | ++ | 18:45 |
frickler | $ ls -l /usr/local/bin/ara-wsgi-sqlite | 18:45 |
frickler | -rwxr-xr-x 1 root root 1818 Sep 4 17:54 /usr/local/bin/ara-wsgi-sqlite | 18:45 |
*** dklyle has joined #openstack-infra | 18:45 | |
frickler | together with WSGIScriptAliasMatch ^.*/ara-report(?!/ansible.sqlite) /usr/local/bin/ara-wsgi-sqlite | 18:45 |
frickler | is what makes that timestamp | 18:46 |
fungi | frickler: oh! so that's the timestamp of the cgi | 18:46 |
fungi | sshnaidm: ^ mystery solved | 18:46 |
frickler | bit of a weird behaviour of apache I'd say | 18:46 |
sshnaidm | interesting | 18:46 |
clarkb | corvus: fungi: ok I've approved the dns updates so now we wait for that to merge and apply and comodo to notice. I'll get the cert data into hiera/ansiblevars as soon as I have it | 18:49 |
*** gfidente has quit IRC | 18:51 | |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul master: Clarify executor zone documentation https://review.openstack.org/620989 | 18:51 |
*** hamzy has joined #openstack-infra | 18:52 | |
clarkb | ssbarnea|rover: pabelanger out of curiousity where is the rdo/rpm packaging line vs pypi line drawn for tripleo testing? is there a general rule there ? | 18:52 |
openstackgerrit | Merged openstack-infra/zone-opendev.org master: Add A(AAA) records for (www.)opendev.org https://review.openstack.org/620982 | 18:52 |
*** hamzy_ has joined #openstack-infra | 18:57 | |
openstackgerrit | Merged openstack-infra/zone-opendev.org master: Add SSL Cert verification record https://review.openstack.org/620985 | 18:57 |
*** lpetrut has joined #openstack-infra | 18:58 | |
*** hamzy has quit IRC | 18:58 | |
pabelanger | clarkb: I'm not sure myself, but believe anything that is openstack (and dependency), gets built as RPM. | 18:59 |
fungi | i think they have some tox jobs too though? | 19:01 |
*** mtreinish has joined #openstack-infra | 19:03 | |
*** wolverineav has quit IRC | 19:04 | |
*** wolverineav has joined #openstack-infra | 19:05 | |
weshay | thanks clarkb! | 19:05 |
weshay | for the elastic recheck reviews | 19:05 |
*** wolverineav has quit IRC | 19:06 | |
clarkb | weshay: np. I'm happy to see people using it for this :) | 19:06 |
*** wolverineav has joined #openstack-infra | 19:06 | |
clarkb | weshay: I figure if mriedem or fungi or some other current root doesn't review in the next day or so I can approve the stack (except maybe the py3 support change since ianw was reviewing that one already and its a bit bigger than adding queries) | 19:06 |
ssbarnea|rover | clarkb: yep, the main rule is use rpm whenever is possible, but there are few exceptions, like what tox testing. | 19:06 |
weshay | k | 19:07 |
*** eernst has quit IRC | 19:07 | |
*** wolverineav has quit IRC | 19:07 | |
*** dklyle has quit IRC | 19:07 | |
mriedem | huh? | 19:07 |
*** wolverineav has joined #openstack-infra | 19:07 | |
clarkb | mriedem: e-r reviews | 19:07 |
ssbarnea|rover | i seen even other workarounds, but usually are only temporary, like installing a package from pip until we get a rpm for it. | 19:07 |
ssbarnea|rover | clarkb: exception do apply only for non shipping code like test related. anything that ships / installs on production must be rpm based. | 19:08 |
clarkb | ssbarnea|rover: got it,thanks | 19:08 |
*** fuentess has quit IRC | 19:08 | |
*** electrofelix has quit IRC | 19:12 | |
*** eernst has joined #openstack-infra | 19:13 | |
clarkb | ssbarnea|rover: comment on https://review.openstack.org/#/c/620950/3 | 19:16 |
mtreinish | clarkb, fungi: so I've got kind of a random question, do you have a pointer to the script used to build wheels? | 19:17 |
clarkb | mtreinish: ya I'll dig it up | 19:17 |
mtreinish | I got a request to upload wheels for stestr to pypi, and I've been using the old tarball script to upload that to pypi | 19:17 |
mtreinish | and reading the twine docs was not at all helpful | 19:17 |
mtreinish | clarkb: cool, thanks | 19:18 |
*** eernst has quit IRC | 19:18 | |
clarkb | mtreinish: https://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/roles/build-wheels/files/wheel-build.sh | 19:18 |
clarkb | darn I typed that out wrong | 19:19 |
*** wolverineav has quit IRC | 19:19 | |
*** eernst has joined #openstack-infra | 19:19 | |
clarkb | mtreinish: https://git.openstack.org/cgit/openstack-infra/project-config/tree/roles/build-wheels/files/wheel-build.sh | 19:19 |
clarkb | there | 19:19 |
fungi | you can do it in the same command where you also build an sdist, i.e. `python setup.py bdist_wheel sdist` | 19:20 |
*** wolverineav has joined #openstack-infra | 19:20 | |
clarkb | oh wait you are wanting the publish side of wheel building | 19:20 |
clarkb | that is the build all the wheels for mirroring script | 19:20 |
fungi | aha, yes | 19:20 |
fungi | sorry, i too misread | 19:20 |
clarkb | but ya you basically pip wheel ./ or python setup.py bdist_wheel | 19:21 |
fungi | `twine upload dist/*` | 19:21 |
mtreinish | thanks | 19:21 |
mtreinish | hmm, that's what I tried | 19:21 |
*** eernst has quit IRC | 19:21 | |
fungi | if you want to test against the dev pypi, do `twine upload --repository-url https://test.pypi.org/legacy/ dist/*` | 19:21 |
mtreinish | well I've got a bug fix release to push, so I'll give it a try on a fresh tag | 19:21 |
mtreinish | oh, that's good to know | 19:22 |
fungi | before doing that, i also recommend running `twine check dist/*` | 19:22 |
clarkb | corvus: root email says ns2.opendev.org requires a reboot to complete package upgrades. Maybe we should do that once dns is verified with comodo? | 19:22 |
fungi | at the moment i think that only checks the long description to make sure pypi will render it successfully, but in the future the expectation is that will grow additional checks for things like invalid trove classifiers | 19:22 |
clarkb | od that ns1 wouldn't require it, but then i Remember we likely use different base images in the two different clouds | 19:23 |
*** xek has quit IRC | 19:23 | |
ssbarnea|rover | weshay: please read https://review.openstack.org/#/c/620950/3 and add your intake, i am not sure if voting:1 should be in or not. | 19:23 |
*** wolverineav has quit IRC | 19:24 | |
*** wolverineav has joined #openstack-infra | 19:24 | |
ssbarnea|rover | mriedem: fungi : i am sure if you are also aware about "twine check" command which proved to VERY useful as it also lints the readme that goes to pypi and assures it renders well. | 19:24 |
ssbarnea|rover | mtreinish: ^^ this was for you. :) | 19:25 |
clarkb | ssbarnea|rover: the openstack release process uses that command to check things before we make releases | 19:25 |
clarkb | it is indeed quite helpful | 19:25 |
ssbarnea|rover | clarkb: the problem is that most of the project do not run it as part of their tox targets, so we find about breakage late in the process. | 19:26 |
ssbarnea|rover | my personal preference is to include "twine check" as part of tox-linters ... as this is what it does, mostly. | 19:26 |
clarkb | ssbarnea|rover: ya, though at least before we try to publish now | 19:26 |
mtreinish | heh, yeah it looks like it will be useful | 19:26 |
fungi | ssbarnea|rover: well, twine check doesn't technically check the readme, it checks the long description field of the built packages (which in our case are embedded copies of a readme) | 19:26 |
mtreinish | but there is a typo in the warning message about a missing dep | 19:27 |
mtreinish | told me to install 'readme_render[md]' but it meant 'readme_renderer[md]' | 19:27 |
ssbarnea|rover | yeah, i know about this. i think I made a PR to fix it, or wanted to. | 19:28 |
fungi | though if you don't use markdown you can ignore that | 19:28 |
fungi | restructuredtext is supported by default | 19:28 |
mtreinish | yeah the README for stestr is rst, but I have other projects that use md so I figured better to have it | 19:28 |
ssbarnea|rover | mtreinish: i had the same impression, better to have it. | 19:29 |
ssbarnea|rover | clarkb: wes replied, on https://review.openstack.org/#/c/620950/ -- you can make a decision. I think in this case is better to have voting on. | 19:31 |
ssbarnea|rover | to eliminate expected noise | 19:31 |
*** graphene has quit IRC | 19:32 | |
*** graphene has joined #openstack-infra | 19:33 | |
clarkb | ssbarnea|rover: ok I'm ok if we choose voting, just wanted to make sure it was explicit | 19:34 |
ssbarnea|rover | clarkb: now we only need to find someone else to workflow these. slowly we improve the categ rate. | 19:36 |
*** markvoelker has joined #openstack-infra | 19:36 | |
clarkb | ssbarnea|rover: as I mentioend I'm happy to approve the query changes with my +2 if no one else reviews them today. The py3 port should get more eyes though | 19:36 |
*** jamesmcarthur has joined #openstack-infra | 19:37 | |
ssbarnea|rover | i wonder if we track its value over time, so we can see how it goes. it would be nice to have an alarm, when it goes above: x% we start working on it until we bring it to y%. | 19:38 |
openstackgerrit | Jeremy Stanley proposed openstack-infra/zuul-website master: Revert "Add a promotional message banner and events list" https://review.openstack.org/620995 | 19:38 |
ssbarnea|rover | i do find elastic-recheck extremly useful, especially for those doing ruck/rovering. | 19:38 |
fungi | clarkb: i think in the past we've only expected a single +2 on e-r query additions/removals | 19:39 |
*** dpawlik has joined #openstack-infra | 19:39 | |
fungi | mriedem can correct me there | 19:39 |
clarkb | fungi: yes, though typically those reviews were from mriedem and mtreinish who know how to review the changes :) | 19:39 |
clarkb | I'm quite rusty :) | 19:39 |
mriedem | it's one thing if it's a query | 19:40 |
mriedem | these are py3 changes right? | 19:40 |
*** markvoelker has quit IRC | 19:40 | |
clarkb | mriedem: one change is a py3 porting. That one should have multiple reviews. The others are all queries which I've +2'd and can aopprove if that is what we want | 19:40 |
mriedem | oh ok let me review | 19:42 |
corvus | clarkb: ns2 reboot post comodo wfm | 19:43 |
ssbarnea|rover | mriedem: thanks, i am here to answer your question. the only tricky part was around few lp modules managed by canonical, which were initially not py3 ready, but they made a new release.... those libraries do not even have a *CI*... :D | 19:43 |
mtreinish | ooh, an updated review on: https://github.com/ansible/ansible/pull/23769 maybe we won't have to carry a local version soon | 19:48 |
fungi | that would be swell | 19:49 |
fungi | latest state just got two shipits. i think you're set? | 19:50 |
*** jamesmcarthur has quit IRC | 19:50 | |
* fungi has no idea how the review process for ansible works | 19:50 | |
mtreinish | nor do I | 19:51 |
ssbarnea|rover | mtreinish: i can help you with few hints around: ping key people on #ansible-devel -- bcoca helped me many times. | 19:51 |
ssbarnea|rover | mtreinish: ok, now you got feedback on it, be sure you address it. | 19:52 |
clarkb | ok ansible + puppet are not running | 19:52 |
*** jamesmcarthur has joined #openstack-infra | 19:52 | |
clarkb | this explains why I'm awiting on dns records for longer than I expected | 19:52 |
clarkb | Failed to discover available identity versions when contacting https://La1.citycloud.com:5000/v3/. Attempting to parse version from URL. | 19:53 |
clarkb | hrm is that a cloud outage? we should've fixed the ctiycloud per region keystone thing | 19:54 |
clarkb | and I get 502 bad gateway if I try to talk to that url | 19:54 |
mtreinish | ssbarnea|rover: thanks, it might be a while though. I don't have a lot of bandwidth for it right now. It sat idle for a long time and is low on my prio list right now | 19:54 |
clarkb | mordred: ^ any thoughts? | 19:54 |
*** wolverineav has quit IRC | 19:55 | |
ssbarnea|rover | clarkb: mtreinish regarding elastic-recheck I observed that in many cases before a CR is reviewed the logstash already recycled the logs so we should aim to review changes while they are fresh. I aim to review all in 24-48h to avoid this. | 19:56 |
*** wolverineav has joined #openstack-infra | 19:56 | |
*** sshnaidm is now known as sshnaidm|afk | 19:56 | |
*** markvoelker has joined #openstack-infra | 20:00 | |
*** jamesmcarthur has quit IRC | 20:00 | |
mriedem | not sure why specific tests are called out in this https://review.openstack.org/#/c/617579/ | 20:00 |
*** wolverineav has quit IRC | 20:00 | |
clarkb | http://cnstatus.com/?p=4413 maybe the issue is that | 20:00 |
mriedem | as generic ssh failures can hit most of tempest now since it runs with validatoin in tempest-full jobs | 20:00 |
openstackgerrit | Merged openstack-infra/elastic-recheck master: Categorize ImageNotFoundException on tripleo jobs https://review.openstack.org/620114 | 20:02 |
mordred | clarkb: looking | 20:04 |
*** wolverineav has joined #openstack-infra | 20:04 | |
mordred | clarkb: yeah - I think it's likely that | 20:05 |
mordred | clarkb: sto2.citycloud.com is working | 20:05 |
*** wolverineav has quit IRC | 20:10 | |
mriedem | ssbarnea|rover: https://review.openstack.org/#/c/616578/9 | 20:10 |
ssbarnea|rover | sure, taking care of these now. | 20:11 |
*** slaweq has quit IRC | 20:11 | |
*** irdr has quit IRC | 20:12 | |
*** jtomasek has quit IRC | 20:12 | |
mriedem | clarkb: btw, i haven't seen e-r commenting on failures lately | 20:13 |
mriedem | i wonder if the log index workers are overwhelmed with tripleo console log indexing? | 20:13 |
mriedem | one of the comments when you brought this up as a goal in berlin was that it'd be nice if we had some kind of status page / dashboard for the logstash workers and/or e-r bot to know if it's off the rails | 20:14 |
openstackgerrit | Merged openstack-infra/nodepool master: Add arbitrary node attributes config option https://review.openstack.org/620691 | 20:14 |
clarkb | mriedem: we do sort of have one for the logstash workers. http://grafana.openstack.org/d/T6vSHcSik/zuul-status?orgId=1 the logstash job queue graph there is part of the zuul status | 20:14 |
clarkb | mriedem: it looks like its keeping up, though individual files in the pipeline may be lagging more than say 20 minutes | 20:15 |
openstackgerrit | Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Made elastic-recheck py3 compatible https://review.openstack.org/616578 | 20:15 |
fungi | possible the bot has crashed/hung? | 20:15 |
mriedem | a lot of the time, e-r is dead or something | 20:16 |
mriedem | b/c of a bad query or something like that | 20:16 |
mriedem | although i'd think a bad query would also break the graph | 20:16 |
*** ralonsoh has quit IRC | 20:16 | |
fungi | in the middle of making fried rice, but can take a look once i'm done | 20:17 |
mriedem | oh i didn't know efried was there | 20:17 |
clarkb | mriedem: ya I would expect that to be the case too | 20:17 |
efried | :P | 20:17 |
clarkb | fungi: thanks! | 20:17 |
* clarkb returns to reviewing relative priority support in zuul | 20:17 | |
openstackgerrit | Matt Riedemann proposed openstack-infra/elastic-recheck master: fix tox python3 overrides https://review.openstack.org/605618 | 20:18 |
fungi | my kitchen would be a lot more awesome if efried were running it, i'm sure | 20:18 |
ianw | infra-root: can we look at ansible 2.7.2 install for bridge with -> https://review.openstack.org/#/c/617218/ . the other version didn't get reviews, and the cloud-launcher is still broken. i know we're not rolling out cloud changes, but evidence shows it tends to bitrot easily | 20:18 |
*** eernst has joined #openstack-infra | 20:19 | |
efried | I make a mean fried rice. Though I'm better at curries. | 20:19 |
openstackgerrit | Matt Riedemann proposed openstack-infra/elastic-recheck master: Include query results in graph https://review.openstack.org/260188 | 20:19 |
*** florianf is now known as florianf|afk | 20:19 | |
*** irdr has joined #openstack-infra | 20:19 | |
ianw | infra-root: and actually, now i look at http://grafana.openstack.org/d/qzQ_v2oiz/bridge-runtime?orgId=1&from=now-12h&to=now ... clearly something has gone wrong | 20:20 |
*** e0ne has joined #openstack-infra | 20:20 | |
pabelanger | Nice, didn't know it was hooked up to grafana | 20:21 |
clarkb | ianw: see above, citycloud outage is preventing us from generating inventory | 20:21 |
*** eernst has quit IRC | 20:21 | |
clarkb | ianw: http://cnstatus.com/?p=4413 is the issue I Think (sparse on details though) | 20:21 |
ianw | clarkb: oh, oh cool, if we have a reason good :) | 20:21 |
ianw | mordred: if around, would be great if you could review at least the glean bits of https://review.openstack.org/#/q/status:open+topic:fedora29 to enable networkmanager support | 20:23 |
*** eernst has joined #openstack-infra | 20:25 | |
*** jamesmcarthur has joined #openstack-infra | 20:25 | |
openstackgerrit | Merged openstack-infra/elastic-recheck master: Categorize ovs crash bug #1805176 https://review.openstack.org/620105 | 20:26 |
openstack | bug 1805176 in tripleo "tripleo jobs failing to setup bridge: fatal_signal|WARN|terminating with signal 14 (Alarm clock)" [High,Triaged] https://launchpad.net/bugs/1805176 | 20:26 |
clarkb | mriedem: fwiw on the bug that is limited to those specific tests I also left a note that maybe the qa team wants to be involved since many jobs seems to match | 20:26 |
clarkb | ssbarnea|rover: as far as alerting goes, we've generally try to avoid any semblence of on call, must react now type behavior. Instead we present the data so that it can be consumed by individuals as they have time/ability | 20:27 |
clarkb | ssbarnea|rover: so we generate graphs. We could also maybe light a batsignal if thresholds are reached that requires you to "look to the sky" or wherever for that rather than it hitting your laptop/phone | 20:28 |
*** eernst has quit IRC | 20:29 | |
*** jamesmcarthur has quit IRC | 20:29 | |
*** e0ne has quit IRC | 20:30 | |
ssbarnea|rover | clarkb: sure. | 20:30 |
openstackgerrit | Merged openstack-infra/elastic-recheck master: Categorize error mounting image volumes due to libpod bug[1] https://review.openstack.org/619059 | 20:31 |
clarkb | ssbarnea|rover: I think in the first (less explicit) signal has been the categorization rate being under like 80% | 20:31 |
clarkb | *in the past the first | 20:31 |
clarkb | since the first order of business is tracking the issues, then gaining understanding, then fixing them | 20:31 |
clarkb | we can probably have some metric we call out on the stuff we understand too. Like X failures in a day | 20:32 |
*** slaweq has joined #openstack-infra | 20:32 | |
ssbarnea|rover | clarkb: practical question, this is uncategorized: http://logs.openstack.org/04/618604/1/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/e892675/job-output.txt.gz -- any idea on how to categorize it? | 20:35 |
ssbarnea|rover | i guess i found the string: "PRE-RUN END RESULT_UNREACHABLE" | 20:36 |
clarkb | ssbarnea|rover: I think we already have a query for that one. Its a cloud level issue with using duplicate IPs :/ | 20:36 |
clarkb | however because that ran in pre the job will be retried | 20:36 |
clarkb | hrm not seeing a query for it anymore. maybe it was cleaned up? its a known issue we've engaged rax on. But I'm not sure if they know what causes it or if there is a fix | 20:37 |
ssbarnea|rover | clarkb: i was not able to find any bug or query with "PRE-RUN END RESULT_UNREACHABLE" in it, so maybe this one was missed. I will create new one, unless someone knows an existing one that can be adapted. | 20:42 |
ssbarnea|rover | https://bugs.launchpad.net/openstack-gate/+bug/1805900 | 20:45 |
openstack | Launchpad bug 1805900 in OpenStack-Gate "PRE-RUN END RESULT_UNREACHABLE" [Undecided,New] - Assigned to Sorin Sbarnea (ssbarnea) | 20:45 |
*** jcoufal has quit IRC | 20:46 | |
fungi | recheck 21279 0.2 14.0 964108 569948 ? Sl Nov16 40:13 /usr/bin/python /usr/local/bin/elastic-recheck /etc/elastic-recheck/elastic-recheck.conf | 20:46 |
fungi | looks like the recheck bot is running, at least | 20:46 |
openstackgerrit | Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Identify POST-RUN END RESULT_UNREACHABLE https://review.openstack.org/621004 | 20:50 |
*** wolverineav has joined #openstack-infra | 20:51 | |
*** wolverineav has quit IRC | 20:53 | |
*** wolverineav has joined #openstack-infra | 20:53 | |
*** jamesmcarthur has joined #openstack-infra | 20:53 | |
*** olivierbourdon38 has quit IRC | 20:57 | |
corvus | mordred, clarkb: any news on the ansible issue? | 21:00 |
openstackgerrit | Merged openstack-infra/nodepool master: Asynchronously update node statistics https://review.openstack.org/619589 | 21:00 |
openstackgerrit | Merged openstack-infra/zuul-website master: Revert "Add a promotional message banner and events list" https://review.openstack.org/620995 | 21:00 |
clarkb | corvus: pretty sure its that outage I linked to on citycloud status page. We can disable those regions if it persists | 21:01 |
clarkb | I'm eating lunch now. back in a bit | 21:01 |
corvus | mordred, clarkb: our theory with nodepool is that clouds are unreliable so we handle them disappearing gracefully. but we have the same clouds now blocking our operations whenever there's an error. is there a way to mitigate this, or should we just switch to static inventories? | 21:03 |
corvus | clarkb: this can wait till after lunch :) | 21:03 |
*** hamerins has quit IRC | 21:05 | |
mordred | corvus: I'm honestly torn on that question | 21:05 |
mordred | corvus: there's a config flag that can be given to the inventory to not let cloud errors bomb the whole thing out | 21:05 |
fungi | what other alternatives are there? not update the cached inventory for a given provider if shade hits an error trying to query it? | 21:06 |
mordred | but I think we've been reluctant to set it in the past out of fear we'd silently ignore a chunk of our inventory | 21:06 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: Fix updateFromDict overrides https://review.openstack.org/621008 | 21:07 |
mordred | fungi: it's a single inventory and the caching is at the inventory level, so there's not really a way to selectively update or not update portions of the cache | 21:07 |
mordred | we could also switch to static/generated inventories - it's not like the vms we're running in our clouds are terribly dynamic | 21:07 |
mordred | so MOST of the time it's the same servers | 21:07 |
fungi | yeah, i get that would likely entail an overhaul of the dynamic inventory generation | 21:07 |
mordred | but we'd need to think through workflows for that with new server creation and old server deletion | 21:08 |
corvus | mordred: yeah, i'd worry about losing the inventory. i think the biggest problem is we'd have to think about that possibility every time we make a change ("will this role work if half the servers are gone?") | 21:08 |
fungi | perhaps if shade errors aborted inventory generation we could just proceed with the previous cache? | 21:08 |
corvus | previous cache or pre-generation sound better to me | 21:09 |
fungi | as you say, it's the same servers 99.9% of the time anyway | 21:09 |
mordred | fungi: maybe? I'm not sure if we have that possibility - the caching layer is actually handled inside of ansible itself now | 21:09 |
fungi | worst case it goes unnoticed until we need to add or remove a server and notice the inventory's not updating | 21:09 |
corvus | previous cache and pre-generation are functionally the same thing; it sounds like just doing pre-generation would be the simplest way of getting the result | 21:10 |
*** dklyle has joined #openstack-infra | 21:10 | |
mordred | yeah. I mean - I was thinking generating the inventory and shoving it in to git | 21:10 |
corvus | mordred: oh, i was imagining we just run a script at the start of the cron job, and if it errors, continue anyway... | 21:11 |
fungi | right, only copy the generated inventory into place if the pregen script succeeds, otherwise leave the old one in place | 21:11 |
mordred | corvus: I was thinking basically just update the inventory when we create or delete servers | 21:11 |
mordred | and just remove it from being in the cronjob execution path completely | 21:12 |
corvus | auto-proposed changes to git is interesting | 21:12 |
corvus | i don't object to that, but it sounds like extra work and i don't know what we'd gain (but maybe that's lack of imagination on my part) | 21:13 |
*** dpawlik has quit IRC | 21:13 | |
mordred | corvus: I think the main thing I was thinking we'd gain is less moving parts at runtime - and I wasn't thinking auto-proposed as much as "when you're done running launch-node, run this script and submit a patch to git" ... but you're right, at that point I don't know that there's much benefit to having the data in git - other than visibility of the otherwise hidden data | 21:14 |
mordred | corvus: I'm not sure I'm *actually* advocating that we do that- it's just been a thought in the back of my head when we have inventory issues | 21:15 |
corvus | mordred: yeah. that'd work too. | 21:16 |
clarkb | mordred: ansible handling the caching is why I think we are noticing this now | 21:21 |
fungi | if we ever want more automation in server launching though, that's one more wait-for-a-human step | 21:21 |
clarkb | I mean I'm sure it was an issue before but we just used the last version right? | 21:21 |
fungi | granted, we've already got the "submit a patch to add dns records" step which needs to wait for reviewers | 21:22 |
fungi | unless we decide there's a subdomain we want to run completely off an autogenerated zonefile | 21:23 |
clarkb | another option could be to put all of the mirror nodes in their own ansible cron run thing and stop generating inventories for them in the main control plane run | 21:23 |
clarkb | for the main control plane run we only care about vexxhost and rax today | 21:23 |
clarkb | doesn't fix the issue but limits the scope of it | 21:24 |
*** agopi is now known as agopi|pto | 21:24 | |
clarkb | citynetwork says the issue I Found is estimated to be fixed at 2400 CET | 21:25 |
clarkb | which is 35 minutes from now? | 21:25 |
clarkb | or is CET only +1? | 21:25 |
*** agopi|pto has quit IRC | 21:26 | |
*** yamamoto has quit IRC | 21:26 | |
clarkb | mordred: would it be worthwile to suggest to ansible (or implement for ansible) fallback to prior cache? I mean what is the cache actually buying us if we can't use it without hitting the clouds? | 21:26 |
*** dpawlik has joined #openstack-infra | 21:28 | |
corvus | clarkb: i wask thinking about splitting, but we have some control-plane nodes in non-rax clouds. i don't think i want to inhibit more of that in the future, so i prefer the idea of making it more robust. | 21:29 |
fungi | clarkb: utc+1 | 21:29 |
clarkb | corvus: ya I'm beginning to think the most generally robust thing would be for ansible cacheing to act as a cache that doesn't need refreshing every run | 21:30 |
fungi | cest (their dst) is utc+2 | 21:30 |
mordred | I've got a patch locally with a static copy of the inventory that we can look at for sake of argument. I'm 99% sure it's safe to push up - does anyone want to double-check it somewhere private before I do? | 21:30 |
clarkb | corvus: but that is also likely the longest fix time wise | 21:30 |
clarkb | mordred: rax sets passwords, if that is done via metadata that might leak out in the inventory? | 21:31 |
mordred | all the adminPass fields are null | 21:31 |
corvus | i'll give it a look, you want to put it on bridge? | 21:31 |
mordred | they only show it to you the one time in the initial server creation response | 21:31 |
mordred | corvus: /home/mordred/static-inventory.yaml on bridge | 21:31 |
clarkb | mordred: ah | 21:31 |
corvus | that's a restless api | 21:31 |
mordred | also - we could write a friendlier generation script than what I have there - we don't actually use 99% of those variables | 21:32 |
clarkb | if we go the static inventory route do we want to use the machine generated inventory applied against our groups.yaml file? or should we just write an inventory that accomdoates both things for human and machine consumption | 21:32 |
clarkb | mordred: ya that | 21:32 |
openstackgerrit | Sean McGinnis proposed openstack-infra/project-config master: Add openstack/arch-design https://review.openstack.org/621012 | 21:33 |
mordred | clarkb: easiest first step is just have a our current groups.yaml plus a really simple file with server_name: ansible_host: ip_address list | 21:33 |
*** dpawlik has quit IRC | 21:33 | |
mordred | I mean- we could even leave out the ansible_host thing and just rely on dns | 21:33 |
*** hamerins has joined #openstack-infra | 21:34 | |
openstackgerrit | Sean McGinnis proposed openstack-infra/project-config master: Add openstack/arch-design https://review.openstack.org/621012 | 21:34 |
*** wolverineav has quit IRC | 21:34 | |
*** wolverineav has joined #openstack-infra | 21:35 | |
fungi | i don't think we want to rely on dns for this | 21:35 |
fungi | multiple instances with the same names, being able to update configuration before dns is in place... | 21:36 |
fungi | also allows things to keep working even if dns won't resolve for unrelated (or related!) reasons | 21:37 |
*** wolverineav has quit IRC | 21:38 | |
*** wolverineav has joined #openstack-infra | 21:38 | |
*** markvoelker has quit IRC | 21:38 | |
*** markvoelker has joined #openstack-infra | 21:38 | |
corvus | is our rax user id at all sensitive? | 21:39 |
*** dklyle has quit IRC | 21:39 | |
clarkb | I don't think so. | 21:40 |
corvus | i kind of doubt it. but that's the only thing i can think to question. | 21:40 |
corvus | the file lgtm. | 21:40 |
*** markvoelker has quit IRC | 21:43 | |
mordred | ok. I've actually got a slimmed down version | 21:43 |
openstackgerrit | Monty Taylor proposed openstack-infra/system-config master: Switch to a static inventory https://review.openstack.org/621031 | 21:44 |
*** rlandy is now known as rlandy|biab | 21:44 | |
clarkb | mordred: the location: block there isn't gonna cause ansible to do any lookups that would fail similarly to inventory generation right? (I don't think so) | 21:45 |
clarkb | just double checking that it is info only | 21:45 |
mordred | nope. it's just a piece of metadata from the shade record that I thought might be useful to us as humans looking at a record | 21:45 |
clarkb | ++ | 21:45 |
*** kjackal has quit IRC | 21:46 | |
clarkb | I'm willing to give ^ a go. It will change how we launch new servers, which might be a little weird until we get into the practice of that | 21:46 |
*** dpawlik has joined #openstack-infra | 21:46 | |
mordred | now - if we decided to go this route - we probably want to make a script that generates that file decently - I pulled that one from the json in the ansible inventory cache and then did some transforms on it | 21:46 |
mordred | so consider it 'hand made' | 21:46 |
clarkb | mordred: and maybe we check in the script not the file? | 21:47 |
clarkb | like it could just be run the regularly inventory generation, if failed us last sucessful result? | 21:47 |
mordred | well - we should _definitely_ check in the script ... but I think running it regularly as part of runs doesn't gain much value | 21:47 |
*** hamzy_ has quit IRC | 21:47 | |
mordred | if we don't check the inventory in - we should at most just run it after launch-node | 21:47 |
mordred | (if we're not going to be fully dynamic) | 21:48 |
*** jamesmcarthur has quit IRC | 21:48 | |
mordred | but honestly - I still don't know what I think about this :) | 21:48 |
*** jamesmcarthur has joined #openstack-infra | 21:48 | |
clarkb | ya I mostly want to avoid needing to launch node without any inventory (so you only get base server), then push a git change, wait for two people to approve it, then be able to run ansible/puppet/docker on your new server | 21:49 |
clarkb | if we have to do that temporarily for a bit thats fine, and maybe we discover its not that painful | 21:49 |
clarkb | mordred: what does the ansible cache actually cache? | 21:50 |
clarkb | I think understanding ^ may help us formulate a plan too. Like maybe its a matter of using the cache more effectively? | 21:50 |
mordred | clarkb: if you look in ./playbooks/roles/install-ansible/files/inventory_plugins/openstack.py | 21:51 |
mordred | around line 193 | 21:52 |
mordred | that's where the generation sets the cache data | 21:52 |
mordred | clarkb: maybe what we need is to be able to tell if fail_on_errors caused anything to be skipped | 21:53 |
mordred | clarkb: and if so, skip the cache.set step | 21:53 |
mordred | so that we can run during that period on partial data | 21:55 |
mordred | but not cache the partial data, so we're sure to get full data once it comes back? | 21:55 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Set relative priority of node requests https://review.openstack.org/615356 | 21:57 |
*** hamerins has quit IRC | 21:57 | |
clarkb | mriedem: fungi looking at the bot logs it seems that the bot is behind on querying console logs? like its looking for a failure in neutron-grenade from the 26th against today's index so that has no results | 21:57 |
*** hamerins has joined #openstack-infra | 21:57 | |
clarkb | mriedem: fungi: I think this is a bug in how we check for current results in e-r, not an issue with indexing? | 21:57 |
*** rcernin has joined #openstack-infra | 21:57 | |
mriedem | hmm | 21:58 |
clarkb | the data is there in the index from a few days ago. My guess is over time we get further and further behind then we start querying newer indexes for older data and never get results and then at that point are forever behind on the bot side | 21:59 |
clarkb | mordred: ++ | 21:59 |
fungi | oh, yeah i looked at the log but didn | 21:59 |
fungi | 't notice the time on those events | 21:59 |
*** markvoelker has joined #openstack-infra | 22:01 | |
*** hamerins has quit IRC | 22:02 | |
*** jamesmcarthur has quit IRC | 22:03 | |
clarkb | basically we have such a long time out that things in the queue pile up for a relatively short period of time and we'll be backlogged logn enough that future queries stop working | 22:04 |
clarkb | its almost like we want a more global timeout rather than timing out per event | 22:04 |
clarkb | event comes in, mark that time, then if after 20 minutes from then (regardless of how quickly anything before it went) we don't have results move on | 22:05 |
*** trown is now known as trown|outtypewww | 22:05 | |
*** yamamoto has joined #openstack-infra | 22:06 | |
fungi | i can restart it for now i guess | 22:10 |
*** xek has joined #openstack-infra | 22:11 | |
clarkb | ya that should reset things | 22:11 |
*** manjeets has quit IRC | 22:15 | |
fungi | #status log manually restarted elastic-recheck service on status.openstack.org to clear event backlog | 22:16 |
openstackstatus | fungi: finished logging | 22:16 |
*** yamamoto has quit IRC | 22:18 | |
*** dklyle has joined #openstack-infra | 22:19 | |
*** rlandy|biab is now known as rlandy | 22:21 | |
*** graphene has quit IRC | 22:23 | |
*** dklyle has quit IRC | 22:24 | |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: Support relative priority of node requests https://review.openstack.org/620954 | 22:26 |
*** manjeets has joined #openstack-infra | 22:28 | |
*** rh-jelabarre has quit IRC | 22:29 | |
*** manjeets has quit IRC | 22:29 | |
*** manjeets has joined #openstack-infra | 22:29 | |
*** dklyle has joined #openstack-infra | 22:32 | |
*** mriedem is now known as mriedem_afk | 22:33 | |
*** sshnaidm|afk is now known as sshnaidm|off | 22:37 | |
*** agopi has joined #openstack-infra | 22:38 | |
*** slaweq has quit IRC | 22:41 | |
openstackgerrit | Clark Boylan proposed openstack-infra/elastic-recheck master: Better event checking timeouts https://review.openstack.org/621038 | 22:42 |
clarkb | mriedem_afk: fungi ^ something like that should help | 22:42 |
*** tpsilva has quit IRC | 22:43 | |
*** dklyle has quit IRC | 22:44 | |
*** xek has quit IRC | 22:44 | |
clarkb | ssbarnea|rover: ^ you may be interested in that too (though it doesn't affect the dashboard generation of elastic-recheck, just the IRC and gerrit commenting) | 22:45 |
*** jonher has joined #openstack-infra | 22:45 | |
*** dpawlik has quit IRC | 22:47 | |
*** kgiusti has left #openstack-infra | 22:47 | |
*** dpawlik has joined #openstack-infra | 22:48 | |
*** dpawlik has quit IRC | 22:48 | |
*** slaweq has joined #openstack-infra | 22:53 | |
ianw | dmsimard: thanks for review :) i just removed the +w as explained, sorry i should have had it marked as wip as it requires a glean release | 22:53 |
*** rkukura_ has joined #openstack-infra | 22:54 | |
*** rkukura has quit IRC | 22:57 | |
*** rkukura_ is now known as rkukura | 22:57 | |
*** slaweq has quit IRC | 22:57 | |
openstackgerrit | Merged openstack-infra/project-config master: Remove ansible-role-redhat-subscription from central repo https://review.openstack.org/617974 | 22:58 |
openstackgerrit | Merged openstack-infra/project-config master: add jobs to publish library from governance repo https://review.openstack.org/619347 | 22:58 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: Revert "Make tripleo-buildimage-overcloud-full-centos-7 non-voting" https://review.openstack.org/620201 | 23:00 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: package-installs: provide for skip from env var https://review.openstack.org/619119 | 23:00 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: simple-init: allow for NetworkManager support https://review.openstack.org/619120 | 23:00 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: Revert "Make tripleo-buildimage-overcloud-full-centos-7 non-voting" https://review.openstack.org/620201 | 23:03 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: package-installs: provide for skip from env var https://review.openstack.org/619119 | 23:03 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: simple-init: allow for NetworkManager support https://review.openstack.org/619120 | 23:03 |
*** boden has quit IRC | 23:07 | |
*** lpetrut has quit IRC | 23:07 | |
*** mgutehal_ has joined #openstack-infra | 23:08 | |
*** mgutehall has quit IRC | 23:09 | |
*** agopi has quit IRC | 23:16 | |
*** eernst has joined #openstack-infra | 23:17 | |
*** eernst has quit IRC | 23:22 | |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: OpenStack: count leaked nodes in unmanaged quota https://review.openstack.org/621040 | 23:22 |
*** jamesdenton has quit IRC | 23:23 | |
*** dhellmann_ has joined #openstack-infra | 23:26 | |
*** eernst has joined #openstack-infra | 23:26 | |
*** dhellmann has quit IRC | 23:26 | |
*** eernst has quit IRC | 23:27 | |
*** dhellmann_ is now known as dhellmann | 23:30 | |
*** jamesdenton has joined #openstack-infra | 23:32 | |
ianw | http://grafana.openstack.org/d/qzQ_v2oiz/bridge-runtime?orgId=1&from=1543271530069&to=1543377490234 this does seem to be a consistent increase in bridge runtime from around 2018-11-27 | 23:32 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool master: OpenStack: store ZK records for launch error nodes https://review.openstack.org/621043 | 23:38 |
*** pbourke has quit IRC | 23:46 | |
*** manjeets has quit IRC | 23:46 | |
*** manjeets has joined #openstack-infra | 23:46 | |
*** pbourke has joined #openstack-infra | 23:47 | |
ianw | looks like the missing 1/2 hour is in here -> http://paste.openstack.org/show/736459/ | 23:48 |
clarkb | puppet unhappy on the arm node? | 23:49 |
ianw | i'm guessing so, logs on host look weird | 23:55 |
ianw | Nov 29 11:30:19 mirror01 puppet-user[31959]: Compiled catalog for mirror01.nrt1.arm64ci.openstack.org in environment production in 4.82 seconds | 23:55 |
ianw | like it starts but then nothing? | 23:55 |
ianw | oh, wow, a lot of stuck processes | 23:55 |
ianw | interesting, attach strace to one and now strace is dead | 23:56 |
ianw | oh dear, i think we have a smoking gun | 23:57 |
ianw | 1582391.992063] Call trace: | 23:57 |
ianw | [1582391.998695] afs_linux_raw_open+0x114/0x158 [openafs] | 23:57 |
ianw | [1582392.008571] osi_UFSOpen+0xa4/0x1d8 [openafs] | 23:57 |
ianw | afs has got stuck | 23:57 |
ianw | i'm rebooting the host | 23:57 |
ianw | actually | 23:58 |
ianw | [1241534.846289] print_req_error: I/O error, dev sdb, sector 314934816 | 23:58 |
ianw | [1241534.854136] Aborting journal on device dm-1-8. | 23:58 |
ianw | [1241534.964702] sd 0:0:0:1: [sdb] tag#79 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE | 23:58 |
ianw | seems it's not happy in multiple ways | 23:58 |
clarkb | ouch that is a cinder volume? | 23:59 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!