*** markvoelker has quit IRC | 00:02 | |
*** Goneri has quit IRC | 00:06 | |
ianw | fedora mirroring timed out. i'm rerunning on afs01 under localauth ... | 00:07 |
---|---|---|
ianw | lock still held on mirror-update | 00:08 |
*** bobh has joined #openstack-infra | 00:11 | |
*** bobh has quit IRC | 00:17 | |
*** rcernin has quit IRC | 00:30 | |
openstackgerrit | Merged openstack/diskimage-builder master: Add fedora-30 testing to gate https://review.opendev.org/680531 | 00:32 |
*** ccamacho has quit IRC | 00:51 | |
ianw | pabelanger: ^ yay, i think that's it for a dib release with f30 support. will cycle back on it and get it going | 00:53 |
johnsom | FYI, even a patch with no changes in it (extra carriage return) is failing and retrying. So it's definitely not that one patch. | 00:59 |
johnsom | https://www.irccloud.com/pastebin/LBgx6Mhh/ | 00:59 |
ianw | so it's like 213.32.79.175 just disappeared on you? | 01:03 |
johnsom | That is the end of the console stream, then zuul retries it. Sometimes not even one tempest test reports before that happens. | 01:05 |
ianw | johnsom: i have debugged such things before, it is a pain :) in one case we tickled a kernel oops on fedora | 01:06 |
johnsom | That ssh isn't part of our test to my knowledge. I think it comes from ansible | 01:06 |
ianw | johnsom: i don't think it has been ported to !devstack-gate, but https://opendev.org/openstack/devstack-gate/src/branch/master/devstack-vm-gate-wrap.sh#L446 might be a good thing to try | 01:06 |
ianw | https://opendev.org/openstack/devstack-gate/src/branch/master/functions.sh#L972 is the actual setup | 01:07 |
ianw | if you have something listening for that, it might let you catch why the host just dies | 01:07 |
johnsom | Hmm, that would be handy actually. I could just hijack the syslog from the nodepool level. I would have to come up with a target. I don't have any cloud instances anymore. | 01:09 |
johnsom | ianw Thanks for the idea! | 01:09 |
ianw | i should just pull that into a quick role ... | 01:10 |
ianw | johnsom: you should be able to pull one with a public address via rdocloud, but i can create one for you if you like | 01:11 |
johnsom | Yeah, I haven't dipped my toe in those waters yet. I should learn the process. | 01:12 |
*** sgw has quit IRC | 01:22 | |
*** exsdev has quit IRC | 01:26 | |
*** exsdev0 has joined #openstack-infra | 01:26 | |
*** exsdev0 is now known as exsdev | 01:27 | |
*** exsdev has quit IRC | 01:39 | |
*** exsdev has joined #openstack-infra | 01:40 | |
*** sgw has joined #openstack-infra | 01:44 | |
*** rcernin has joined #openstack-infra | 01:45 | |
*** yamamoto has joined #openstack-infra | 01:46 | |
*** gregoryo has joined #openstack-infra | 01:50 | |
*** yamamoto has quit IRC | 02:26 | |
*** yamamoto has joined #openstack-infra | 02:26 | |
*** bobh has joined #openstack-infra | 02:37 | |
openstackgerrit | jacky06 proposed openstack/diskimage-builder master: Move doc related modules to doc/requirements.txt https://review.opendev.org/628466 | 02:40 |
openstackgerrit | jacky06 proposed openstack/diskimage-builder master: Move doc related modules to doc/requirements.txt https://review.opendev.org/628466 | 02:42 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Add a netconsole role https://review.opendev.org/680901 | 02:46 |
*** yamamoto has quit IRC | 03:11 | |
*** ramishra has joined #openstack-infra | 03:13 | |
*** bobh has quit IRC | 03:14 | |
*** rh-jelabarre has joined #openstack-infra | 03:27 | |
*** dklyle has quit IRC | 03:29 | |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Add a netconsole role https://review.opendev.org/680901 | 03:40 |
*** yamamoto has joined #openstack-infra | 03:47 | |
*** ykarel has joined #openstack-infra | 03:49 | |
*** udesale has joined #openstack-infra | 03:53 | |
*** yamamoto has quit IRC | 04:11 | |
*** soniya29 has joined #openstack-infra | 04:12 | |
*** ykarel has quit IRC | 04:17 | |
*** ykarel has joined #openstack-infra | 04:36 | |
*** yamamoto has joined #openstack-infra | 04:40 | |
*** eernst has joined #openstack-infra | 04:54 | |
*** markvoelker has joined #openstack-infra | 04:55 | |
*** markvoelker has quit IRC | 05:00 | |
*** jaosorior has joined #openstack-infra | 05:03 | |
*** ricolin has joined #openstack-infra | 05:08 | |
openstackgerrit | Andreas Jaeger proposed openstack/openstack-zuul-jobs master: Use promote job for releasenotes https://review.opendev.org/678430 | 05:08 |
AJaeger_ | config-core, please review https://review.opendev.org/677158 for cleanup and https://review.opendev.org/#/q/topic:promote-docs+is:open to use promote jobs for releasenotes, and extra tests for zuul-jobs: https://review.opendev.org/678573 | 05:10 |
openstackgerrit | Jan Kubovy proposed zuul/zuul master: Evaluate CODEOWNERS settings during canMerge check https://review.opendev.org/644557 | 05:15 |
*** eernst has quit IRC | 05:29 | |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Add a netconsole role https://review.opendev.org/680901 | 05:29 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Add a netconsole role https://review.opendev.org/680901 | 05:44 |
*** raukadah is now known as chandankumar | 05:46 | |
*** kopecmartin|off is now known as kopecmartin | 05:56 | |
*** jtomasek has joined #openstack-infra | 06:02 | |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Add a netconsole role https://review.opendev.org/680901 | 06:02 |
*** e0ne has joined #openstack-infra | 06:09 | |
*** jbadiapa has joined #openstack-infra | 06:13 | |
openstackgerrit | jacky06 proposed openstack/diskimage-builder master: Move doc related modules to doc/requirements.txt https://review.opendev.org/628466 | 06:13 |
*** e0ne has quit IRC | 06:18 | |
*** jtomasek has quit IRC | 06:22 | |
openstackgerrit | Merged openstack/project-config master: Add file matcher to releasenotes promote job https://review.opendev.org/679856 | 06:22 |
openstackgerrit | Merged zuul/zuul-jobs master: Switch releasenotes to fetch-sphinx-tarball https://review.opendev.org/678429 | 06:28 |
*** snapiri has quit IRC | 06:35 | |
*** snapiri has joined #openstack-infra | 06:36 | |
*** snapiri has quit IRC | 06:36 | |
*** jtomasek has joined #openstack-infra | 06:40 | |
*** ccamacho has joined #openstack-infra | 06:42 | |
*** mattw4 has joined #openstack-infra | 06:44 | |
*** slaweq_ has joined #openstack-infra | 06:49 | |
*** snapiri has joined #openstack-infra | 06:49 | |
Tengu | ~. | 06:54 |
*** pgaxatte has joined #openstack-infra | 06:55 | |
openstackgerrit | Ian Wienand proposed openstack/project-config master: Add Fedora 30 nodes https://review.opendev.org/680919 | 06:56 |
*** shachar has joined #openstack-infra | 06:58 | |
*** mattw4 has quit IRC | 06:59 | |
*** yamamoto has quit IRC | 07:00 | |
*** snapiri has quit IRC | 07:00 | |
ianw | infra-root: appreciate eyes on https://review.opendev.org/680895 to fix testinfra testing for good; after my pebkac issues with how to depends-on: for github it works | 07:01 |
*** rcernin has quit IRC | 07:02 | |
*** slaweq_ has quit IRC | 07:06 | |
*** slaweq has joined #openstack-infra | 07:08 | |
*** yamamoto has joined #openstack-infra | 07:09 | |
*** trident has quit IRC | 07:10 | |
*** tesseract has joined #openstack-infra | 07:13 | |
*** hamzy_ has quit IRC | 07:19 | |
*** trident has joined #openstack-infra | 07:21 | |
*** hamzy_ has joined #openstack-infra | 07:29 | |
*** gfidente has joined #openstack-infra | 07:31 | |
*** ykarel is now known as ykarel|lunch | 07:31 | |
*** owalsh has quit IRC | 07:32 | |
*** owalsh has joined #openstack-infra | 07:33 | |
*** aluria has quit IRC | 07:33 | |
*** jpena|off is now known as jpena | 07:35 | |
*** aluria has joined #openstack-infra | 07:38 | |
*** kjackal has joined #openstack-infra | 07:45 | |
*** kaiokmo has joined #openstack-infra | 07:46 | |
*** pcaruana has joined #openstack-infra | 07:49 | |
*** markvoelker has joined #openstack-infra | 07:50 | |
*** sshnaidm|afk is now known as sshnaidm | 07:51 | |
*** sshnaidm is now known as sshnaidm|ruck | 07:51 | |
*** markvoelker has quit IRC | 07:55 | |
*** rpittau|afk is now known as rpittau | 07:55 | |
*** xenos76 has joined #openstack-infra | 07:56 | |
openstackgerrit | Simon Westphahl proposed zuul/zuul master: Fix timestamp race occuring on fast systems https://review.opendev.org/680937 | 08:04 |
*** apetrich has joined #openstack-infra | 08:05 | |
openstackgerrit | Simon Westphahl proposed zuul/zuul master: Fix timestamp race occuring on fast systems https://review.opendev.org/680937 | 08:06 |
*** dchen has quit IRC | 08:07 | |
*** panda has quit IRC | 08:09 | |
*** panda has joined #openstack-infra | 08:11 | |
*** rascasoft has quit IRC | 08:11 | |
*** rascasoft has joined #openstack-infra | 08:12 | |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Add a netconsole role https://review.opendev.org/680901 | 08:14 |
ianw | johnsom: ^ working for me | 08:16 |
*** ralonsoh has joined #openstack-infra | 08:18 | |
*** e0ne has joined #openstack-infra | 08:19 | |
jrosser | ianw: doesnt ansible_default_ipv4 contain the fields you need there rather than a bunch of shell commands? | 08:19 |
*** ykarel|lunch is now known as ykarel | 08:23 | |
*** tkajinam has quit IRC | 08:42 | |
*** derekh has joined #openstack-infra | 08:43 | |
*** gregoryo has quit IRC | 08:48 | |
*** kaiokmo has quit IRC | 08:48 | |
ianw | jrosser: yes, quite likely! it was just more or less a straight port from the old code that has been working for a long time | 08:56 |
openstackgerrit | Simon Westphahl proposed zuul/zuul master: Fix timestamp race occurring on fast systems https://review.opendev.org/680937 | 08:59 |
*** hamzy_ has quit IRC | 09:07 | |
*** ociuhandu has joined #openstack-infra | 09:08 | |
*** ociuhandu has quit IRC | 09:10 | |
*** ociuhandu has joined #openstack-infra | 09:10 | |
*** ociuhandu has quit IRC | 09:14 | |
*** ociuhandu_ has joined #openstack-infra | 09:14 | |
*** ociuhandu_ has quit IRC | 09:18 | |
zbr | ianw: do you know who can help with a new bindep release? last one was more than an year ago. | 09:18 |
*** soniya29 has quit IRC | 09:23 | |
*** ociuhandu has joined #openstack-infra | 09:27 | |
*** arxcruz_pto is now known as arxcruz | 09:27 | |
openstackgerrit | Sorin Sbarnea proposed opendev/bindep master: Add OracleLinux support https://review.opendev.org/536355 | 09:35 |
ianw | zbr: i would ask fungi first, i think he is back today his time. just to see if there was anything in the queue. but if he doesn't have time, etc. ping me back and i imagine i could tag one tomorrow too | 09:36 |
zbr | ianw: thanks. i think that there are few more patches I could fix, and a bug that prevents running the tests on macos (mocking fails to work), and after this we can make a release. | 09:37 |
zbr | ianw: thanks. i think that there are few more patches I could fix, and a bug that prevents running the tests on macos (mocking fails to work), and after this we can make a release. | 09:38 |
*** yamamoto has quit IRC | 09:40 | |
*** kaisers has quit IRC | 09:48 | |
*** kaisers has joined #openstack-infra | 09:50 | |
*** markvoelker has joined #openstack-infra | 09:51 | |
*** shachar has quit IRC | 09:53 | |
*** snapiri has joined #openstack-infra | 09:53 | |
*** ricolin_ has joined #openstack-infra | 09:55 | |
*** lpetrut has joined #openstack-infra | 09:55 | |
*** ricolin has quit IRC | 09:57 | |
*** markvoelker has quit IRC | 10:00 | |
*** rcernin has joined #openstack-infra | 10:08 | |
openstackgerrit | Sorin Sbarnea proposed opendev/bindep master: Fix test execution failure on Darwin https://review.opendev.org/680962 | 10:13 |
*** ricolin_ has quit IRC | 10:15 | |
*** spsurya has joined #openstack-infra | 10:21 | |
*** soniya29 has joined #openstack-infra | 10:26 | |
*** gfidente has quit IRC | 10:26 | |
*** panda is now known as panda|rover | 10:42 | |
*** kaisers has quit IRC | 10:44 | |
*** ociuhandu has quit IRC | 10:45 | |
openstackgerrit | Merged openstack/openstack-zuul-jobs master: Remove openSUSE 42.3 https://review.opendev.org/677158 | 10:47 |
*** dave-mccowan has joined #openstack-infra | 10:52 | |
*** ricolin_ has joined #openstack-infra | 10:53 | |
*** dciabrin_ has joined #openstack-infra | 10:53 | |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: add-build-sshkey: add centos/rhel-8 support https://review.opendev.org/674092 | 10:54 |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: add-build-sshkey: add centos/rhel-8 support https://review.opendev.org/674092 | 10:54 |
*** dciabrin has quit IRC | 10:54 | |
*** shachar has joined #openstack-infra | 11:02 | |
*** nicolasbock has joined #openstack-infra | 11:02 | |
*** snapiri has quit IRC | 11:03 | |
*** pgaxatte has quit IRC | 11:05 | |
*** dchen has joined #openstack-infra | 11:05 | |
*** yamamoto has joined #openstack-infra | 11:06 | |
*** rpittau is now known as rpittau|bbl | 11:12 | |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Fix promote-stx-api-ref https://review.opendev.org/680977 | 11:13 |
*** udesale has quit IRC | 11:15 | |
*** sshnaidm|ruck is now known as sshnaidm|afk | 11:15 | |
AJaeger_ | config-core, simple fix, please review ^ | 11:15 |
*** ricolin_ is now known as ricolin | 11:15 | |
*** ociuhandu has joined #openstack-infra | 11:17 | |
*** gfidente has joined #openstack-infra | 11:18 | |
*** ociuhandu has quit IRC | 11:22 | |
*** kaisers has joined #openstack-infra | 11:28 | |
*** pgaxatte has joined #openstack-infra | 11:34 | |
*** jpena is now known as jpena|lunch | 11:35 | |
openstackgerrit | Merged opendev/system-config master: Recognize DISK_FULL failure messages (review_dev) https://review.opendev.org/673893 | 11:37 |
*** ociuhandu has joined #openstack-infra | 11:41 | |
*** dchen has quit IRC | 11:44 | |
*** gfidente has quit IRC | 11:46 | |
*** shachar has quit IRC | 11:48 | |
*** gfidente has joined #openstack-infra | 11:53 | |
*** goldyfruit has quit IRC | 11:56 | |
*** markvoelker has joined #openstack-infra | 11:56 | |
*** snapiri has joined #openstack-infra | 12:02 | |
*** markvoelker has quit IRC | 12:02 | |
*** rlandy has joined #openstack-infra | 12:03 | |
*** yamamoto has quit IRC | 12:05 | |
*** rfolco has joined #openstack-infra | 12:06 | |
*** yamamoto has joined #openstack-infra | 12:09 | |
*** rosmaita has joined #openstack-infra | 12:10 | |
*** hamzy_ has joined #openstack-infra | 12:11 | |
*** markvoelker has joined #openstack-infra | 12:12 | |
*** ociuhandu has quit IRC | 12:21 | |
*** iurygregory has joined #openstack-infra | 12:22 | |
*** rcernin has quit IRC | 12:22 | |
ralonsoh | hello folks, I have one question if you can help me | 12:27 |
ralonsoh | this is about Neutron rally-tasks | 12:27 |
ralonsoh | for example: https://d4b9765f6ab6e1413c28-81a8be848ef91b58aa974b4cb791a408.ssl.cf5.rackcdn.com/680427/2/check/neutron-rally-task/01b2c1c/ | 12:30 |
ralonsoh | there are no logs or results, and the task is failing | 12:30 |
*** jpena|lunch is now known as jpena | 12:31 | |
*** sshnaidm|afk is now known as sshnaidm|ruck | 12:32 | |
frickler | ralonsoh: rally seems to be complicating access to logs by generating an index.html file. there are logs showing what I think is the error, see the end of https://d4b9765f6ab6e1413c28-81a8be848ef91b58aa974b4cb791a408.ssl.cf5.rackcdn.com/680427/2/check/neutron-rally-task/01b2c1c/controller/logs/devstacklog.txt.gz | 12:36 |
*** rosmaita has quit IRC | 12:38 | |
*** happyhemant has joined #openstack-infra | 12:39 | |
ralonsoh | frickler, thanks! I'll take a look at this | 12:40 |
frickler | ralonsoh: this is where the index file comes from, probably for our new logging setup it would work better to place this into a subdir and register the result as an artifact in zuul https://opendev.org/openstack/rally-openstack/src/branch/master/tests/ci/playbooks/roles/fetch-rally-task-results/tasks/main.yaml#L52-L64 | 12:43 |
*** ociuhandu has joined #openstack-infra | 12:43 | |
*** aaronsheffield has joined #openstack-infra | 12:44 | |
ralonsoh | frickler, or rename this index file in order to avoid the browser to read it by default | 12:45 |
*** Goneri has joined #openstack-infra | 12:47 | |
*** ociuhandu has quit IRC | 12:48 | |
*** rosmaita has joined #openstack-infra | 12:50 | |
*** rpittau|bbl is now known as rpittau | 12:55 | |
*** mriedem has joined #openstack-infra | 12:57 | |
*** roman_g has joined #openstack-infra | 13:00 | |
*** soniya29 has quit IRC | 13:06 | |
AJaeger_ | clarkb: I think all Fedora 28 changes are merged with exception of devstack... Open changes that I'm aware of are https://review.opendev.org/#/q/topic:fedora-latest+is:open | 13:08 |
*** e0ne has quit IRC | 13:19 | |
*** udesale has joined #openstack-infra | 13:20 | |
*** KeithMnemonic has joined #openstack-infra | 13:24 | |
*** gfidente has quit IRC | 13:30 | |
*** gfidente has joined #openstack-infra | 13:32 | |
*** ociuhandu has joined #openstack-infra | 13:34 | |
*** Goneri has quit IRC | 13:34 | |
*** ociuhandu has quit IRC | 13:39 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: Add prepare-workspace-openshift role https://review.opendev.org/631402 | 13:45 |
*** ykarel is now known as ykarel|afk | 13:45 | |
*** gfidente has quit IRC | 13:46 | |
*** gfidente has joined #openstack-infra | 13:48 | |
*** prometheanfire has quit IRC | 13:49 | |
*** e0ne has joined #openstack-infra | 13:50 | |
*** prometheanfire has joined #openstack-infra | 13:50 | |
*** beekneemech is now known as bnemec | 13:54 | |
*** priteau has joined #openstack-infra | 13:54 | |
*** AJaeger_ is now known as AJaeger | 13:54 | |
AJaeger | config-core, please review https://review.opendev.org/680977 to fix stx API promote job, https://review.opendev.org/#/q/topic:promote-docs+is:open to use promote jobs for releasenotes, and for adding extra tests for zuul-jobs: https://review.opendev.org/678573 | 13:55 |
*** ykarel|afk is now known as ykarel | 13:56 | |
*** zzehring has joined #openstack-infra | 13:57 | |
*** yamamoto has quit IRC | 14:02 | |
*** eharney has joined #openstack-infra | 14:04 | |
*** bdodd has joined #openstack-infra | 14:07 | |
*** ociuhandu has joined #openstack-infra | 14:13 | |
openstackgerrit | Merged zuul/nodepool master: Fix Kubernetes driver documentation https://review.opendev.org/680879 | 14:13 |
openstackgerrit | Merged zuul/nodepool master: Add extra spacing to avoid monospace rendering https://review.opendev.org/680880 | 14:13 |
openstackgerrit | Merged zuul/nodepool master: Fix chroot type https://review.opendev.org/680881 | 14:13 |
*** Goneri has joined #openstack-infra | 14:15 | |
*** jamesmcarthur has joined #openstack-infra | 14:15 | |
*** ociuhandu has quit IRC | 14:17 | |
*** yamamoto has joined #openstack-infra | 14:21 | |
openstackgerrit | Sorin Sbarnea proposed opendev/bindep master: Fix test execution failure on Darwin https://review.opendev.org/680962 | 14:25 |
dtantsur | hi folks! the release notes job on instack-undercloud stable/queens started failing like this: https://e29864d8017a43b5dd67-658295aea286d472989a3acc71bbfe02.ssl.cf5.rackcdn.com/680698/1/check/build-openstack-releasenotes/fae0512/job-output.txt | 14:25 |
dtantsur | do you have any ideas? | 14:25 |
*** dklyle has joined #openstack-infra | 14:29 | |
*** ykarel is now known as ykarel|afk | 14:30 | |
*** lpetrut has quit IRC | 14:30 | |
*** yamamoto has quit IRC | 14:30 | |
*** yamamoto has joined #openstack-infra | 14:34 | |
*** yamamoto has quit IRC | 14:34 | |
*** yamamoto has joined #openstack-infra | 14:34 | |
AJaeger | dtantsur: let me check, we just switched the implementation... | 14:35 |
*** ociuhandu has joined #openstack-infra | 14:35 | |
dtantsur | AJaeger: it may be that you're trying to do something on master. the master branch is deprecated and purged of contents. | 14:35 |
AJaeger | dtantsur: that explains it... | 14:36 |
AJaeger | dtantsur: we check out master branch for releasenotes. So, if master is dead, kill the releasenotes job. | 14:36 |
dtantsur | AJaeger: is there anything to be done except for removing the releasenotes job? | 14:36 |
dtantsur | ah | 14:36 |
dtantsur | mwhahaha: ^^^ | 14:36 |
AJaeger | dtantsur: releasenotes does basically: git checkout master;tox -e releasenotes | 14:37 |
AJaeger | So, master needs to work for this... | 14:37 |
dtantsur | this does seem a bit inconvenient since the stable branches are supported and receive changes. | 14:37 |
mwhahaha | just propose the job removal | 14:37 |
*** armax has joined #openstack-infra | 14:37 | |
AJaeger | dtantsur: I don't expect you make many changes to releasenotes on a released branch anymore... | 14:38 |
AJaeger | dtantsur: to change this, reno would need to be redesigned completely ;( | 14:39 |
dtantsur | fair enough | 14:39 |
mwhahaha | releases notes includes a bug fix | 14:39 |
*** yamamoto has quit IRC | 14:39 | |
mwhahaha | so i would assume every change should have one | 14:39 |
dtantsur | ++ | 14:39 |
*** ociuhandu has quit IRC | 14:39 | |
AJaeger | normally you do the bug fix on master and then backport ;) - and have master working... | 14:40 |
mwhahaha | ANYWAY let's just remove the job | 14:40 |
AJaeger | agreed | 14:40 |
dtantsur | mwhahaha: wanna me do it? | 14:40 |
mwhahaha | yes plz | 14:41 |
AJaeger | dtantsur: best on all active branches | 14:41 |
dtantsur | ack | 14:41 |
dtantsur | thanks AJaeger | 14:41 |
*** markvoelker has quit IRC | 14:41 | |
*** mtreinish has quit IRC | 14:41 | |
AJaeger | glad that we could solve that so quickly - you had me shocked for a moment ;) | 14:41 |
*** mtreinish has joined #openstack-infra | 14:42 | |
AJaeger | This is the change that merged earlier: https://review.opendev.org/#/c/678429/ | 14:42 |
AJaeger | config-core, we can switch releasenotes to promote jobs now: Please review https://review.opendev.org/#/c/678430/ | 14:43 |
dtantsur | aha, so rocky has already been fixed by another patch. but not queens. | 14:43 |
openstackgerrit | Andy Ladjadj proposed zuul/zuul master: Fix: prevent usage of hashi_vault https://review.opendev.org/681041 | 14:50 |
openstackgerrit | Matt McEuen proposed openstack/project-config master: Add per-project Airship groups https://review.opendev.org/680717 | 14:52 |
*** ociuhandu has joined #openstack-infra | 14:52 | |
openstackgerrit | Andy Ladjadj proposed zuul/zuul master: Fix: prevent usage of hashi_vault https://review.opendev.org/681041 | 14:52 |
frickler | AJaeger: do you want to amend the comment on 678430? otherwise just +A | 14:53 |
openstackgerrit | Andy Ladjadj proposed zuul/zuul master: Fix: prevent usage of hashi_vault https://review.opendev.org/681041 | 14:53 |
AJaeger | frickler: what do you propose? | 14:53 |
AJaeger | "Build and publish" -> "build and promote"? | 14:54 |
frickler | AJaeger: sounds more appropriate I'd think, yes | 14:54 |
AJaeger | frickler: let me do a followup - and update the other templates as well... | 14:55 |
* AJaeger will +A | 14:55 | |
frickler | AJaeger: ok | 14:56 |
AJaeger | thanks! | 14:56 |
AJaeger | frickler: could you review https://review.opendev.org/680977 as well, please? | 14:56 |
frickler | AJaeger: I did this morning | 14:58 |
frickler | well, before noon anyway ;) | 14:59 |
*** jroll has joined #openstack-infra | 15:00 | |
*** ykarel|afk is now known as ykarel | 15:00 | |
openstackgerrit | Andreas Jaeger proposed openstack/openstack-zuul-jobs master: Mention promote in template description https://review.opendev.org/681044 | 15:01 |
AJaeger | frickler: ah, thanks. Then I need another core ;) | 15:01 |
AJaeger | frickler: here's the change - is that what you had in mind? ^ | 15:01 |
*** markmcclain has joined #openstack-infra | 15:02 | |
AJaeger | thanks, clarkb ! | 15:03 |
AJaeger | clarkb: care to review https://review.opendev.org/678430 as well? That will switch releasenotes to promote jobs as well... | 15:04 |
clarkb | AJaeger: ya I'll be working through your scrollback list between meeting stuff | 15:04 |
AJaeger | clarkb: thanks | 15:04 |
*** diablo_rojo__ has joined #openstack-infra | 15:05 | |
AJaeger | clarkb: ignore 678430, just approved ;) | 15:05 |
zbr | ianw: clarkb : bindep: please have a look at https://review.opendev.org/#/c/680954/ and let me know if is ok. had to find a way to pass unittests on all platforms so I can test other incoming patches. | 15:05 |
openstackgerrit | Merged openstack/openstack-zuul-jobs master: Use promote job for releasenotes https://review.opendev.org/678430 | 15:06 |
AJaeger | clarkb: https://review.opendev.org/676430 and https://review.opendev.org/681044 are the two remaining ones I care about, ignore the backscroll ;) | 15:06 |
AJaeger | clarkb: these are ready as well https://review.opendev.org/680919 and https://review.opendev.org/680830 | 15:06 |
AJaeger | config-core as is https://review.opendev.org/#/c/680717/ | 15:07 |
*** gyee has joined #openstack-infra | 15:07 | |
AJaeger | no, not yet ;( | 15:07 |
*** goldyfruit has joined #openstack-infra | 15:08 | |
AJaeger | sorry, it IS ready | 15:08 |
openstackgerrit | Merged openstack/project-config master: Fix promote-stx-api-ref https://review.opendev.org/680977 | 15:10 |
*** markmcclain has quit IRC | 15:12 | |
*** jaosorior has quit IRC | 15:12 | |
*** ykarel is now known as ykarel|away | 15:13 | |
*** ykarel|away has quit IRC | 15:22 | |
donnyd | Just an update on FN. I am currently waiting on a part to come in the mail (should be here in the next few hours), and also I am trying to get someone out to do the propane installation today or tomorrow as well. | 15:30 |
*** ociuhandu has quit IRC | 15:31 | |
donnyd | I don't really want to do the generator tests with FN at full tilt when the fuel is hooked up... so if it is going to be done in the next 48 hours I will leave it out of nodepool... if it's gonna take a week... then I will put it back in and pull it again when the time is right... That is *if* that makes sense to everyone here | 15:32 |
roman_g | Hello team. Could I ask to force-refresh OpenSUSE mirror synk, please? This one: [repo-update|http://mirror.dfw.rax.opendev.org/opensuse/update/leap/15.0/oss/] Valid metadata not found at specified URL | 15:33 |
roman_g | OpenSUSE Leap 15.0, Updates, OSS | 15:33 |
roman_g | synk -> sync | 15:33 |
clarkb | donnyd: wfm | 15:34 |
fungi | roman_g: any idea if the official obs mirrors ever got fixed after the reorg? they're what's been blocking our automatic opensuse mirroring | 15:34 |
openstackgerrit | Fabien Boucher proposed zuul/zuul master: Pagure - handle Pull Request tags (labels) metadata https://review.opendev.org/681050 | 15:34 |
AJaeger | ricolin: could you help merge https://review.opendev.org/678353 , please? | 15:34 |
fungi | donnyd: are you planning to load-test the generator with a resistor bank or something? otherwise you probably do want to test it out at full tilt | 15:34 |
clarkb | fungi: maybe we should just remove the obs mirroring for now | 15:35 |
roman_g | fungi: yes, zypper works perfectly with original repo at http://download.opensuse.org/update/leap/15.0/oss/ | 15:35 |
fungi | roman_g: i said obs | 15:35 |
clarkb | roman_g: the problem is with obs not the main distro mirrors | 15:35 |
clarkb | roman_g: we have a list of obs repos to mirror and they are not present in the upstream we were pulling from any longer | 15:35 |
AJaeger | dirk, can you help, please? ^ | 15:35 |
clarkb | unfortunately we do all the suse mirroring together so if obs fails we don't publish main distro updates | 15:35 |
donnyd | fungi well first I have to function check the system. Make sure the auto transfer switch doesn | 15:35 |
roman_g | what is obs? fungi clarkb | 15:35 |
fungi | we have a script which mirrors opensuse distribution files and some select obs package trees. the latter have been blocking the script succeeding | 15:35 |
clarkb | roman_g: packaging that isn't part of the main distro release. In this case its probably similar to epel | 15:36 |
donnyd | doesn't blow a hole in my wall or anything.. so it will be initially tested unloaded | 15:36 |
fungi | or similar to uca | 15:36 |
roman_g | open build service? | 15:36 |
clarkb | roman_g: yes | 15:36 |
fungi | yes | 15:36 |
donnyd | And that is like a 1 hour cycle | 15:36 |
donnyd | then I have to test it loaded (at least what I can do) | 15:37 |
roman_g | so we mirror not from download.opensuse.org, but from some other (obs) location? | 15:37 |
fungi | donnyd: sounds like a fun day! | 15:37 |
roman_g | what is that location? I could check if repo is in good condition there | 15:37 |
donnyd | the UPS place a 17amp load on the system idle... with everything in full tilt its a 21 amp load | 15:37 |
donnyd | plus the rest of my house circuits | 15:37 |
donnyd | So you are correct I want to test loaded... but usually you start small and make sure it works.. and the only way that is done is to keep FN out of the pool for now | 15:38 |
donnyd | I should know more later today on propane install | 15:38 |
clarkb | roman_g: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/opensuse-mirror-update#L82-L96 | 15:39 |
clarkb | roman_g: they all originate from suse | 15:39 |
clarkb | but they are not all equally mirrored | 15:39 |
clarkb | the upstreams we use for OBS no longer have all of the obs repos we want. And after some looking I was unable to find one that did | 15:40 |
zbr | is the os-loganalyze project orphan or in need more cores? it seems to have reviews lingering for years, like https://review.opendev.org/#/c/468762/ which was clearly not controversial in any way :p | 15:40 |
slittle1 | RE: starlingx/utilities ... we had one major package that failed to migrate to the new repo. We would like an admin to help us import the missing content. | 15:41 |
clarkb | zbr: the move to swift likely did orphan it. | 15:41 |
clarkb | zbr: in general though for changes like that we all have much higher priority items than removing pypy from tox.ini where it harms nothing | 15:42 |
zbr | clarkb: maybe we should find a better way to address these. I am asking because once I review a change it stays on my queue until it merges (or is abandoned) as I feel the need not to waste the time already invested. | 15:43 |
zbr | what can I do in such cases? sometimes I remove my vote. still the change will still be lingering there. | 15:44 |
AJaeger | slittle1: what is the problem? https://opendev.org/starlingx/utilities was imported, wasn't it? | 15:44 |
clarkb | zbr: probably the best way to address things like that case specifically is to coordinate repo updates when we remove pypy | 15:44 |
zbr | i would find better to even abandon all changes, reducing the noise. | 15:44 |
clarkb | zbr: but I'm guessing the global pypy removal wasn't that thorough | 15:44 |
clarkb | zbr: if you click the 'x' next to your name you should be removed from it | 15:45 |
slittle1 | yes, but one of the packages that I was supposed to relocate into starlingx/utilities failed to move | 15:45 |
zbr | clarkb: that was just one example, as you can imagine I do not care much about pypy. in fact not all all. | 15:45 |
zbr | i was more curious about which approach should I take | 15:45 |
slittle1 | That package has a very long commit history. I stopped counting at 100 | 15:46 |
*** ykarel|away has joined #openstack-infra | 15:46 | |
AJaeger | slittle1: I don't understand "failed to move" - let's ask differently: Did the repo you prepared was not what you expected and therefore you want to set it up again from scratch? | 15:46 |
slittle1 | Yes, that's what I'm proposing | 15:46 |
AJaeger | slittle1: so, do you have a new git repo ready? Then an admin - if he has time - can force update it for you... | 15:47 |
AJaeger | (note I'm not an admin, just trying to figure out next steps) | 15:48 |
slittle1 | I'm preparing the new git repo now | 15:48 |
slittle1 | Might take a couple hours to test | 15:48 |
clarkb | zbr: personally I track my priorities external to Gerrit, when those intersect with Gerrit I do reviews | 15:48 |
AJaeger | slittle1: then please come back once you're ready ;) | 15:49 |
AJaeger | slittle1: so, prepare the repo, and then ask for somebody here to force push this to starlingx/utilities. | 15:50 |
AJaeger | slittle1: and better tell starlingx team to keep the repo frozen and not merge anything - and everything proposed to it might need a rebase... | 15:50 |
AJaeger | (and devs might need to check from scratch) - all depending on the changes you do (if they can be fast forwarded, it should be fine) | 15:51 |
roman_g | clarkb: thank you for your help. I'm not sure I fully understand the code you are referring to, as it mentions repositories I don't use. A few lines above https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/opensuse-mirror-update#L26 there is a MIRROR="rsync://mirror.us.leaseweb.net/opensuse", and this one has broken repository metadata -> seems that we | 15:51 |
roman_g | mirror broken leaseweb repository. | 15:51 |
clarkb | roman_g: our mirrors are only as good as what we pull from | 15:51 |
clarkb | roman_g: I think what we need is for someone to find a reliable upstream and we'll switch to it. If that means we can also continue to mirror OBS then we'll do that otherwise maybe we just turn off OBS mirroring | 15:52 |
fungi | one possibility would be to split opensuse and obs to separate afs volumes and update them with separate scripts, but that would need someone to do the work | 15:52 |
*** rlandy is now known as rlandy|brb | 15:52 | |
*** chandankumar is now known as raukadah | 15:53 | |
clarkb | fungi: and still requires finding a working obs upstream | 15:53 |
paladox | corvus also you'll want to add redirecting /p/ to the upgrade list (only needed for 2.16+) | 15:53 |
roman_g | clarkb: could leaseweb be changed to something else (original opensuse mirror)? http://provo-mirror.opensuse.org/ this one, for example. or http://download.opensuse.org/ this one. | 15:53 |
paladox | i only just remembered | 15:53 |
clarkb | roman_g: we don't use the main one because they limit rsync connections aggressively and we never update | 15:53 |
clarkb | roman_g: but we can point to any other second level mirror | 15:53 |
*** rpittau is now known as rpittau|afk | 15:54 | |
*** igordc has joined #openstack-infra | 15:54 | |
zbr | clarkb: ok, i will start removing myself from such reviews, but I will also post them here with recomandation to abandon, like: https://review.opendev.org/#/c/385217/ | 15:55 |
*** mattw4 has joined #openstack-infra | 15:58 | |
*** sshnaidm|ruck is now known as sshnaidm|afk | 16:00 | |
*** diablo_rojo__ is now known as diablo_rojo | 16:00 | |
*** mattw4 has quit IRC | 16:01 | |
*** mattw4 has joined #openstack-infra | 16:01 | |
*** mattw4 has quit IRC | 16:05 | |
*** mattw4 has joined #openstack-infra | 16:06 | |
*** tesseract has quit IRC | 16:07 | |
*** igordc has quit IRC | 16:09 | |
clarkb | zbr: another strategy we could probably get better at is trusting cores to single approve simple changes like that (but we've had a long history of two core reviewers so that probably merits more communication to tell people its ok in the simple case) | 16:09 |
*** e0ne has quit IRC | 16:10 | |
*** mattw4 has quit IRC | 16:10 | |
clarkb | infra-root I'd like to restart the nodepool launcher on nl02 to rule out bad confg loads with the numa flavor problem that sean-k-mooney has had. FN is currenty set to max-servers of zero so we won't be able to check it until back in the rotation but this way that is done and out of the way | 16:11 |
clarkb | that said the problems nodepool has are lack of ssh access | 16:11 |
clarkb | according to the logs anyway | 16:11 |
clarkb | and if we got that far then the config should be working | 16:11 |
zbr | not a bad idea, especially for "downsized" projects where maybe there are no linger single cores active, same applies to abandons, cores should not be worried to press that button, is not like "delete", anyone can still restore them if needed. | 16:11 |
clarkb | but testing a boot with that flavor and our images myself by hand I had no issues getting onto the host via ssh in a relatively quick amount of time | 16:12 |
Shrews | clarkb: could you go ahead and make sure that openstacksdk is at the latest (i think it was last i checked) and restart nl01 as well? it contains a swift feature that auto-deletes the image objects after a day | 16:12 |
clarkb | Shrews: I can | 16:12 |
Shrews | ossum | 16:13 |
clarkb | I may as well rotate through all 4 so they are running the same code. I'll do that if 02 looks happy | 16:13 |
clarkb | Shrews: openstacksdk==0.35.0 is what is on 02 and that appears to be current | 16:14 |
*** virendra-sharma has joined #openstack-infra | 16:14 | |
sean-k-mooney | clarkb: the ci runs with the old lable passed on saturday before FN was scalled down by the way https://zuul.opendev.org/t/openstack/build/dd0f5dad770d40a2afb3c506327d1b3e so it is just the new lable that is broken | 16:14 |
clarkb | Shrews: nodepool==3.8.1.dev10 # git sha f2a80ef is the nodepool version there | 16:14 |
Shrews | clarkb: yes, that's the version. it just sets an auto-expire header on the objects (so not really a "feature" but a good thing) | 16:14 |
clarkb | sean-k-mooney: ya I saw that. That is what made me suspect maybe config loading | 16:15 |
clarkb | sean-k-mooney: once FN is back in the ortation we should be able to confirm or deny | 16:15 |
sean-k-mooney | ack | 16:15 |
sean-k-mooney | i proably should tell people FN is offline/disable currenlty too | 16:15 |
Shrews | clarkb: nodepool version is good. i don't see anything of significance that might cause issues | 16:15 |
Shrews | since 3.8.0 release, that is | 16:16 |
openstackgerrit | Mohammed Naser proposed zuul/zuul master: spec: use operator-sdk for kubernetes operator https://review.opendev.org/681058 | 16:16 |
zbr | fungi: there are only 4 more pending reviews on bindep, after we address them, can we make an aniversary release? | 16:17 |
clarkb | Shrews: nl02 has been restarted. Limestone is having trouble (but appeared to have trouble prior to the restart too) going to understand that before doing 01 03 and 04 | 16:17 |
clarkb | logan-: fyi ^ "openstack.exceptions.SDKException: Error in creating the server (no further information available)" is what we get from the sdk | 16:17 |
clarkb | logan-: I'm going to try a manual boot now | 16:17 |
*** virendra-sharma has quit IRC | 16:20 | |
clarkb | logan-: {'message': 'No valid host was found. There are not enough hosts available.', 'code': 500, 'created': '2019-09-09T16:19:42Z'} | 16:20 |
clarkb | Shrews: ^ seems like sdk should be able to bubble that message up? maybe that error isn't available initially? | 16:21 |
Shrews | clarkb: did that come from the node.fault.message? | 16:21 |
clarkb | Shrews: yes | 16:21 |
*** markvoelker has joined #openstack-infra | 16:21 | |
Shrews | clarkb: https://review.opendev.org/671704 should begin logging that | 16:22 |
Shrews | clarkb: the issue is you need to either list servers with detailed info (which nodepool does not), or refetch the server after failure | 16:22 |
clarkb | oh right cool | 16:22 |
clarkb | I shall rereview it when this restart expedition is complete | 16:22 |
*** dtantsur is now known as dtantsur|afk | 16:23 | |
*** virendra-sharma has joined #openstack-infra | 16:23 | |
Shrews | clarkb: yeah, i think it's good to go now, but i can't +2 it since I've added to the review | 16:24 |
AJaeger | promotion of releasenotes is working fine so far - note that we now only update releasenotes if those are changed and not after each merge. | 16:25 |
logan- | clarkb: yep, thats expected. i disabled the host aggregate on our end out of caution since we had a transit carrier acting up last week. i am planning to re-enable soon, later today or tomorrow, when i am around for a bit to monitor job results | 16:25 |
openstackgerrit | Sorin Sbarnea proposed opendev/bindep master: Add OracleLinux support https://review.opendev.org/536355 | 16:26 |
logan- | thats how i typically stop nodepool from launching nodes while allowing it to finish up and delete already running jobs. does it mess anything up when I do it that way? | 16:26 |
clarkb | logan-: thanks for confirming. As a side note you can update the instance quota to zero and nodepool will stop launching against that cloud | 16:26 |
clarkb | it doesn't mess anything up, just noisy logs | 16:26 |
*** spsurya has quit IRC | 16:27 | |
logan- | gotcha | 16:27 |
openstackgerrit | Sorin Sbarnea proposed opendev/bindep master: Fix tox python3 overrides https://review.opendev.org/605613 | 16:32 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-operator master: Create zookeeper operator https://review.opendev.org/676458 | 16:32 |
openstackgerrit | Sorin Sbarnea proposed opendev/bindep master: Add some more default help https://review.opendev.org/570721 | 16:33 |
openstackgerrit | Mohammed Naser proposed openstack/project-config master: Add pravega/zookeeper-operator to zuul tenant https://review.opendev.org/681063 | 16:36 |
mnaser | ^ if anyone is around for a trivial patch.. | 16:37 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-operator master: Create zookeeper operator https://review.opendev.org/676458 | 16:40 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-operator master: Deploy Zuul cluster using operator https://review.opendev.org/681065 | 16:40 |
*** rlandy|brb is now known as rlandy | 16:41 | |
*** pgaxatte has quit IRC | 16:42 | |
openstackgerrit | Merged openstack/openstack-zuul-jobs master: Mention promote in template description https://review.opendev.org/681044 | 16:43 |
openstackgerrit | Merged zuul/zuul-jobs master: Switch to fetch-sphinx-tarball for tox-docs https://review.opendev.org/676430 | 16:46 |
*** ykarel|away has quit IRC | 16:46 | |
*** jpena is now known as jpena|off | 16:47 | |
AJaeger | infra-root, we still have 6 hold nodes - are they all needed? | 16:47 |
mnaser | AJaeger: if any are mine, i dont need them | 16:48 |
*** ociuhandu has joined #openstack-infra | 16:48 | |
clarkb | I can take a look as part of my nodepool restarts. I've just done 03 and am confirming things work there since 02 only has offline clouds currently | 16:48 |
*** diablo_rojo has quit IRC | 16:49 | |
Shrews | wow, one has been in hold for 49 days | 16:49 |
Shrews | AJaeger: i suspect mordred has simply forgetten about that one :) | 16:50 |
AJaeger | ;) | 16:50 |
Shrews | didn't pabelanger say we could delete his held node the other day? | 16:51 |
mnaser | please dont interrupt that one, that's their bitcoin miner :) | 16:51 |
Shrews | mnaser: must be a deep mine | 16:52 |
AJaeger | mwhahaha, EmilienM, I see at http://zuul.opendev.org/t/openstack/status that https://review.opendev.org/680780 runs a non-voting job in gate - could you remove that one, please? | 16:52 |
AJaeger | Shrews: he did, so go ahead and delete it... | 16:52 |
mwhahaha | weshay: -^ | 16:52 |
AJaeger | mwhahaha, EmilienM, weshay, job is "tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates | 16:52 |
clarkb | ok there are nodes that went from building to in-use on nl03 now. Going to do 04 | 16:53 |
AJaeger | weshay, mwhahaha, and https://review.opendev.org/674065 has tripleo-build-containers-centos-7-buildah as non-voting in gate | 16:53 |
weshay | hrm.. k.. thanks for letting me know | 16:53 |
*** ykarel|away has joined #openstack-infra | 16:53 | |
Shrews | mnaser: you have a held node that i'll go ahead and delete now | 16:53 |
pabelanger | Shrews: yes, it can be deleted. Thanks | 16:54 |
*** mattw4 has joined #openstack-infra | 16:56 | |
*** bdodd has quit IRC | 16:56 | |
Shrews | AJaeger: since that change that mordred's held node is based on has merged, i'm going to assume that is not needed anymore either | 16:56 |
Shrews | ianw: you now have the only 2 held nodes. will let you decide if they are needed or not | 16:57 |
clarkb | Shrews: for the case where there are ready nodes locked for many many days is there a way to break the lock without restarting the zuul scheduler? | 16:58 |
clarkb | Shrews: 0010553302 for example | 16:58 |
*** trident has quit IRC | 16:58 | |
Shrews | clarkb: we should first verify who owns the lock. let me look | 16:58 |
Shrews | there's been a long standing zuul bug around that | 16:59 |
*** derekh has quit IRC | 17:00 | |
clarkb | #status log Restarted nodepool launchers with OpenstackSDK 0.35.0 and nodepool==3.8.1.dev10 # git sha f2a80ef | 17:00 |
openstackstatus | clarkb: finished logging | 17:00 |
roman_g | clarkb: I've asked SUSE guys if they could reach out to their OpenSUSE community and find a good CDN to rsync OpenSUSE from. | 17:01 |
clarkb | roman_g: fwiw AJaeger asked dirk here to do that too in scrollback (though he is CEST so may not happen today) | 17:01 |
clarkb | roman_g: let us know what you find out | 17:01 |
*** diablo_rojo has joined #openstack-infra | 17:02 | |
ricolin | AJaeger, done | 17:03 |
roman_g | clarkb, AJaeger, just for reference, I was replying to Jean-Philippe Evrard (SUSE) here https://review.opendev.org/#/c/672678/ in comment to patch set 4. | 17:03 |
Shrews | clarkb: yeah, zuul owns that lock. there is an out-of-band way to do it (just delete the znode) but that's not recommended | 17:03 |
clarkb | Shrews: ya it should go away if we restart the scheduler but now is a bad time for that due to job backlogs | 17:04 |
clarkb | (and I was thinking freeing up ~4 more nodes will help a bit with the backlog) | 17:04 |
Shrews | i just freed up 4 held nodes, so that might help a small bit | 17:05 |
clarkb | johnsom: fungi ianw re job network problems if it is rax specific it could be that we are seeing duplicate IPs there again? | 17:05 |
clarkb | in the past we would get job failures for those errors because ansible didn't return the correct rc code for network trouble but that hasbeen fixed and so now we retry | 17:05 |
clarkb | however if it happens on more than just rax and it is always triggered during the start of tempest tests I would suspect something in the tempest tests | 17:06 |
fungi | clarkb: if we can manage to correlate the failures to a specific subset of recurring ip addresses, that can explain it | 17:06 |
clarkb | frickler: figured out worlddumping which will help with the dns debugging too I think | 17:06 |
johnsom | I have seen it fail after 4-5 tests have passed, it's not at the start | 17:06 |
*** trident has joined #openstack-infra | 17:06 | |
clarkb | johnsom: ok then during tempest | 17:06 |
clarkb | both ipv6 only clouds are offlien currently so the dns issues shouldn't be happening for a bit | 17:07 |
clarkb | hopefully long enough to get worlddump fix merged in devstack | 17:07 |
clarkb | johnsom: the point was more that if it is duplicate IP problem we should see that during any point of time in a job | 17:08 |
clarkb | and it would be cloud/region specific | 17:08 |
clarkb | if however it happens during a common job location and on variety of clouds we probably don't have that problem | 17:08 |
johnsom | Yes, it is a definite possibility | 17:08 |
donnyd | ok electrical is done for FN and it looks like nova may need to use it.. So how about we put it back in service and can work around the gas install | 17:08 |
clarkb | donnyd: I will defer to you on that | 17:09 |
donnyd | I am thinking only compute jobs, maybe hold off on swift logs | 17:09 |
clarkb | wfm | 17:09 |
*** ramishra has quit IRC | 17:09 | |
clarkb | https://storage.gra1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e47/680797/1/check/openstack-tox-py36/e477716/ is still working | 17:09 |
clarkb | AJaeger: ^ should we turn ovh back on for swift logs? | 17:09 |
fungi | clarkb: last word from amorin was that the suspendbot was done for yesterday, but no idea when it would come back today and whether it would be fixed | 17:10 |
clarkb | fungi: ya its been just over 24 hours | 17:12 |
AJaeger | it's more than 24 hours since we talked - i would expect it run again in that time. | 17:12 |
AJaeger | clarkb: let me remove my WIP from https://review.opendev.org/#/c/680855/ | 17:12 |
AJaeger | we might want to wait another 24 hours - or just go for it... | 17:12 |
clarkb | johnsom: fungi http://zuul.openstack.org/build/74d1733dd44f4233a37ef43039e338e8/log/job-output.txt#36386 that may be a clue (that job was a retry_limit) | 17:13 |
AJaeger | thanks, ricolin | 17:14 |
clarkb | johnsom: fungi ze04 (where that build ran has disk now and doesn't have leaked builds) | 17:14 |
openstackgerrit | Donny Davis proposed openstack/project-config master: Bringing FN back online for CI Jobs https://review.opendev.org/681075 | 17:15 |
fungi | clarkb: johnsom: oh! so another thing worth noting is that nodes in rackspace have a smallish rootfs with more space mounted at /opt | 17:15 |
AJaeger | roman_g: yeah, evrardjp is also a good contact - thanks | 17:15 |
clarkb | fungi: ya though in this case the thing that ran out of disk was the executor | 17:15 |
fungi | ahh, okay, there goes that idea | 17:15 |
AJaeger | config-core, https://review.opendev.org/678356 and https://review.opendev.org/678357 are ready for review, dependencies merged - those remove now unused publish jobs and update promote jobs. Please review | 17:18 |
clarkb | AJaeger: fwiw I asked airship if they are ready for their global acl updates to go in and haven't gotten an answer yet | 17:19 |
clarkb | it will prevent them from approving changes while new groups are configured so I figured I would ask | 17:20 |
AJaeger | clarkb: looking at the commit message: If we add the airship-core group everywhere, it should work just fine, wouldn't it? | 17:20 |
AJaeger | clarkb: but double checking is better ;) Thanks | 17:20 |
donnyd | clarkb: fungi so even though FN doesn't have any jobs running, the images should have still be loaded from nodepool.. Correct? | 17:21 |
fungi | donnyd: correct | 17:21 |
donnyd | and the mirror never actually went down | 17:21 |
fungi | nodepool continues performing uploads even if max-servers is 0 | 17:21 |
donnyd | i did have enough juice to keep it up this whole time | 17:22 |
fungi | awesome. thanks! | 17:22 |
clarkb | AJaeger: ya though it won't have the effect tehy want (they will still need to make changes) | 17:22 |
donnyd | so it should also be up to date with whatever has been done... | 17:22 |
clarkb | 23GB is executor-git and 7.9GB is builds in /var/lib/zuul | 17:23 |
clarkb | git will hardlink so that 23GB seems large but should be reasonable as its shared across builds | 17:23 |
clarkb | there are 36GB available on ze04 represents 53% free on that fs | 17:24 |
*** gfidente has quit IRC | 17:25 | |
clarkb | AJaeger: fungi maybe you can reach out to amorin tomorrow CEST time and see if amorin has any more input? then we can enable early tomorrow if amorin is happy with it at that point? | 17:26 |
corvus | paladox: can you elaborate on redirecting /p/ ? do we need to do something like /p/(.*) -> /$1 ? | 17:27 |
paladox | yup | 17:27 |
paladox | polygerrit takes over /p/ | 17:27 |
paladox | so cloning over /p/ breaks | 17:27 |
fungi | clarkb: hopefully my availability tomorrow morning is better than today was. i've added a reminder for it and will see what i can do | 17:27 |
paladox | corvus https://www.gerritcodereview.com/2.16.html#legacy-p-prefix-for-githttp-projects-is-removed | 17:28 |
*** jamesmcarthur has quit IRC | 17:28 | |
paladox | corvus https://github.com/wikimedia/puppet/commit/4a2a25f3cbcbabd03b6291459941304e67bbd1c5#diff-4d7f1c048cc827721ef9298a98d1f5d9 is what i did | 17:28 |
paladox | to prevent breaking clones && to prevent breaking PolyGerrit project dashboard support. | 17:29 |
corvus | paladox: gotcha, so that's targeting just cloning then | 17:29 |
corvus | that regex in your change | 17:29 |
paladox | yup | 17:29 |
clarkb | http://grafana.openstack.org/d/nuvIH5Imk/nodepool-vexxhost?orgId=1 vexxhost has some weird looking node status graphs | 17:30 |
paladox | we will be upgrading soon :) | 17:30 |
paladox | spoke with one of the releng members and with 2.15 going EOL, we will want to be going to 2.16 very soon. | 17:30 |
corvus | paladox: cool thanks! i added it to our list :) | 17:30 |
paladox | :) | 17:30 |
* clarkb figures out boot from volume to test vexxhost | 17:31 | |
paladox | corvus you'll also want to convert any of your its-* rules to soy too! | 17:32 |
clarkb | I think we may have leaked volumes | 17:33 |
clarkb | I'm digging in | 17:33 |
fungi | paladox: oh, fun... what's soy? | 17:34 |
*** ociuhandu has quit IRC | 17:34 | |
paladox | Soy is https://github.com/google/closure-templates | 17:34 |
Shrews | clarkb: i would not doubt leaked volumes. there is something weird there that i couldn't quite figure out (i thought i had the bug down at one time, but alas, not so much) | 17:34 |
paladox | https://github.com/wikimedia/puppet/tree/production/modules/gerrit/files/homedir/review_site/etc/its is how it looks under soy :) | 17:35 |
*** ociuhandu has joined #openstack-infra | 17:36 | |
clarkb | Shrews: well volume list shows we have ~37 volumes at 80GB each. The quota error we got says we are using 6080GB which is enough for 76 instances | 17:36 |
Shrews | yowza | 17:36 |
clarkb | Shrews: so the volume leak isn't the only thing going on here from what I can tell (we do apepar to have leaked a small number of volumes which I'm trying to trim now since that is actionable on our end) | 17:36 |
paladox | fungi it's also what is used for PolyGerrit's index.html :) | 17:37 |
paladox | it helped me to introduce support for base url's | 17:37 |
clarkb | Shrews: its basically 2 times our consumption | 17:37 |
fungi | thanks paladox. luckily we don't have a bunch of its rules: https://opendev.org/opendev/system-config/src/branch/master/modules/openstack_project/manifests/review.pp#L213-L234 | 17:37 |
Shrews | clarkb: can you capture that info for the one you're deleting for me? at least the a few of the most recent ones | 17:37 |
paladox | ah! | 17:37 |
paladox | ok | 17:37 |
paladox | i had to do the support for soy in its-base for security reasons :( | 17:38 |
clarkb | Shrews: ya I'll get a list and then get full volume shows for each of them (thats basically how I'm combing through to decide they are really leaked) | 17:38 |
Shrews | cool | 17:38 |
openstackgerrit | Merged openstack/project-config master: Bringing FN back online for CI Jobs https://review.opendev.org/681075 | 17:38 |
corvus | paladox, fungi: if we're using its-storyboard, do we still need to use soy? | 17:39 |
paladox | nope | 17:39 |
paladox | you doin't use velocity it seems per the link fungi gave | 17:39 |
fungi | corvus: i thought its rules were how we configured its-storyboard (well, that and commentlinks, which seem to be integrated with how its does rule lookups) | 17:39 |
*** ricolin has quit IRC | 17:40 | |
fungi | is soy replacing the commentlinks feature too? | 17:40 |
paladox | though, it's rules can now be configured from the UI :) | 17:40 |
paladox | fungi nope | 17:40 |
paladox | it's replacing the velocity feature | 17:40 |
*** ociuhandu has quit IRC | 17:40 | |
fungi | ahh, i have no idea what velocity is anyway | 17:42 |
fungi | just as well, i suppose | 17:42 |
paladox | https://velocity.apache.org | 17:42 |
clarkb | Shrews: bridge.o.o:/home/clarkb/vexxhost-leaked-volumes.txt | 17:42 |
clarkb | Shrews: I'm going to delete them now | 17:43 |
fungi | paladox: okay, so that was providing some sort of macro/templating capability i guess | 17:43 |
paladox | yup | 17:43 |
clarkb | Shrews: they are all from august 7th | 17:43 |
Shrews | clarkb: thx | 17:43 |
fungi | clarkb: last time i embarked on a vexxhost leaked image cleanup adventure, i similarly discovered a bunch of images hanging around that all leaked within a few minutes of each other on the same day, weeks prior | 17:44 |
fungi | something happens briefly there to cause this, i guess | 17:44 |
paladox | do you set javaOpts through puppet (for gerrit.config) | 17:44 |
Shrews | clarkb: fungi: my last exploration of this found they seemed to be associated with the inability to delete a node because of "volume in use" errors, iirc | 17:45 |
fungi | Shrews: correct, same for me. something caused server instances to go away ungracefully and left cinder thinking the volumes for them were still in use | 17:46 |
Shrews | maybe this new batch will have some key that i missed before | 17:46 |
fungi | paladox: not to my knowledge. i | 17:46 |
fungi | er, i'm not even sure what javaopts are | 17:46 |
paladox | ok, i think you'll want to set https://github.com/wikimedia/puppet/commit/01bf99d8c72886e876878ade7e99f9081dc313d5#diff-1145a1f82a8b6b5ee2c3238b41d39601 | 17:46 |
clarkb | paladox: fungi we set heap memory size | 17:46 |
clarkb | but that gets loaded via the defaults file in the init script | 17:47 |
clarkb | so ya we set things but puppet is only partially related there | 17:47 |
fungi | ahh, here: https://opendev.org/opendev/system-config/src/branch/master/modules/openstack_project/manifests/review.pp#L119 | 17:47 |
paladox | ahh, ok. | 17:47 |
fungi | https://opendev.org/opendev/puppet-gerrit/src/branch/master/templates/gerrit.config.erb#L62-L64 | 17:48 |
fungi | is where it gets plumbed through to gerrit.config | 17:49 |
*** udesale has quit IRC | 17:49 | |
*** ralonsoh has quit IRC | 17:49 | |
clarkb | mnaser: if you have a moment, vexxhost sjc1 claims we are using ~6080GB of volume quota but volume list shows ~3040GB (curious that it is 2x). Not sure if that is a known thing or somethign we can help debug further but thought I would point it out | 17:51 |
*** goldyfruit_ has joined #openstack-infra | 17:51 | |
*** virendra-sharma has quit IRC | 17:53 | |
*** goldyfruit has quit IRC | 17:53 | |
Shrews | clarkb: hrm, aug 7th logs for nodepool no longer exist. that makes investigating near impossible :( | 17:53 |
*** ociuhandu has joined #openstack-infra | 17:53 | |
clarkb | ya we keep a month of logs iirc | 17:54 |
clarkb | and that is ~2 days past | 17:54 |
Shrews | we only go back to aug 30 | 17:54 |
clarkb | oh just 10 days then | 17:54 |
clarkb | re the retrying jobs the top of the check queue is a ironic job that disappeared in tempest ~2 hours ago on rax-ord and zuul/ansible haven't caught up on that yet | 18:00 |
clarkb | I am able to hit the ssh port via home and the executor that was running the job there | 18:01 |
clarkb | so if it is the duplicate ip problem we've got the IP currently | 18:01 |
fungi | well, it's rarely just *one* duplicate ip address | 18:02 |
fungi | but it's usually an identifiably small percent of the addresses used there overall | 18:03 |
openstackgerrit | Sorin Sbarnea proposed opendev/bindep master: Add OracleLinux support https://review.opendev.org/536355 | 18:03 |
clarkb | ya its just hard to identfy because we don't get logs | 18:03 |
openstackgerrit | Merged openstack/project-config master: Add pravega/zookeeper-operator to zuul tenant https://review.opendev.org/681063 | 18:03 |
clarkb | I've put an autohold on that job | 18:03 |
clarkb | tox/tempset is still running there | 18:08 |
fungi | previously when i've had to hunt these down i've built up lists of build ids which hit the retry_limit or whatever, and then programmatically tracked those back through zuul executor logs to corresponding node requests and then into nodepool launcher logs | 18:09 |
clarkb | 23.253.173.73 is the host if anyone else wants to take a look | 18:09 |
fungi | my access is still a bit crippled while my broadband is out | 18:10 |
clarkb | http://paste.openstack.org/show/774344/ is the last thing stestr logged | 18:11 |
clarkb | about 37 minutes after ansible stopped getting info | 18:11 |
*** e0ne has joined #openstack-infra | 18:12 | |
johnsom | I am going to try Ian's netconsole role plus redirect the rsyslog. See if that catches anything. | 18:14 |
clarkb | right around the time ansible loses connectivity dmesg says there was an sda gpt block device. This device has GPT errors and does not exist currently | 18:14 |
clarkb | dtantsur|afk: TheJulia ^ do you know if ironic-standalone is creating that sda device and giving it a GPT table? if so it seems to be buggy | 18:15 |
*** jtomasek_ has joined #openstack-infra | 18:17 | |
clarkb | journalctl is empty for that block of time too | 18:18 |
fungi | oh, i wonder if something to do with block device management is temporarily hanging the kernel long enough to cause ansible to give up | 18:19 |
*** jtomasek has quit IRC | 18:19 | |
fungi | that behavior could certainly be provider/hypervisor dependent | 18:19 |
clarkb | fungi: ya journalctl having missing data around then seems to point at sadness | 18:19 |
clarkb | syslog has the data though so not universally a problem | 18:19 |
clarkb | but could be just bad enough potentially | 18:19 |
*** priteau has quit IRC | 18:23 | |
*** kjackal has quit IRC | 18:23 | |
clarkb | fungi: wins /dev/xvda1 38G 36G 0 100% / | 18:25 |
clarkb | running du to track down what filled it up | 18:25 |
fungi | probably the job needs adjusting to write stuff in /opt | 18:27 |
fungi | how full is /opt on that node? | 18:27 |
*** yamamoto has joined #openstack-infra | 18:27 | |
clarkb | http://paste.openstack.org/show/774346/ | 18:28 |
clarkb | opt has 57GB | 18:28 |
clarkb | johnsom: ^ is it possible that your jobs are using fat images too and running into the same problem? | 18:28 |
clarkb | that could explain why it is rax and ironic + octavia hitting this a lot | 18:28 |
clarkb | because you both run workload in your VMs | 18:28 |
clarkb | and don't cirros | 18:28 |
johnsom | Not likely, we just shaved another 250MB off them a month or two ago. It should be only one 350MB image | 18:29 |
johnsom | We are logging more than we were however, as we added log offloading a few months back. Seems unlikely that it would be failing at the start of the tempest job though. | 18:30 |
johnsom | Is the DF log in the output at the start or the end of the job? | 18:31 |
*** yamamoto has quit IRC | 18:31 | |
clarkb | thats me sshing in after node broke | 18:31 |
fungi | so what is e.g. /var/lib/libvirt/images/node-1.qcow2 there which is 12gb? | 18:32 |
johnsom | This was a run that passed: https://zuul.opendev.org/t/openstack/build/065acaa6544342348b3eef799e627696/log/controller/logs/df.txt.gz | 18:32 |
clarkb | ironics test image | 18:32 |
clarkb | fungi: ^ | 18:32 |
fungi | aha | 18:32 |
fungi | that's fairly massive | 18:32 |
fungi | especially since it's one of a dozen images in that same directory | 18:33 |
clarkb | ironic at least is clearly broken in this way | 18:33 |
clarkb | gives us something to fix and check on after | 18:33 |
fungi | but also it should almost certainly put that stuff in /opt | 18:33 |
fungi | so as not to fill the rootfs | 18:33 |
johnsom | Yeah, our new logging is only 250K and 2.3K for a full run, so not a big change. | 18:33 |
*** e0ne has quit IRC | 18:34 | |
clarkb | note as suspected that is a side effect of running tempest I think | 18:35 |
* clarkb makes a bug for ironic | 18:37 | |
clarkb | johnsom: "new logging" ? | 18:37 |
clarkb | johnsom: also keep in mind the original image size may grow up to the flavor image size when booted and used | 18:37 |
*** jamesmcarthur has joined #openstack-infra | 18:37 | |
clarkb | johnsom: even if your qcow2 is say only 1GB when handed to glance once booted and blocks start changing it can grow up to that limit | 18:38 |
johnsom | New logging == offloading the logging from inside the amphora instances. Looking at the passing runs it's only 250K uncompressed, so certainly not eating the disk. | 18:39 |
johnsom | Our qcow is ~300MB, the nova flavor it gets is 2GB. Other than getting smaller, it hasn't changed in years. | 18:39 |
clarkb | ok. I'm pointing it out because I've just shown that one of your example jobs is likely failing due to this problem. | 18:41 |
clarkb | I think it is worthwhile to be able to specifically rule it in or out in the other jobs we know are failing similarly | 18:41 |
*** ociuhandu has quit IRC | 18:41 | |
johnsom | Yeah, that seems very ironic job related. | 18:41 |
clarkb | oh is ironic in storyboard? | 18:42 |
johnsom | I hope this netconsole trick sheds light on our job. | 18:43 |
fungi | clarkb: yes | 18:43 |
fungi | and i agree, seems like there's a good chance that the ironic and octavia jobs are failing in similarly opaque ways for distinct reasons | 18:44 |
johnsom | clarkb This test run is going now: https://review.opendev.org/680894 assuming I didn't fumble the Ansible somehow, which is likely given I use it just enough to forget it. | 18:44 |
johnsom | Yep, nevermind, broken ansible. lol | 18:44 |
clarkb | fungi: yup could be distinct but we should rule out known causes first | 18:45 |
clarkb | rather than assuming they are fine | 18:45 |
clarkb | octavia like ironic is failing when tempest runs | 18:45 |
clarkb | octavia like ironic is being retried becaus ansible fails | 18:45 |
clarkb | we know ironic is filling its root disk on rax (which also explains why it may be cloud specific) | 18:46 |
clarkb | why not simply check "is octavia filling the disk" if yes we solved it fi no we look elsewhere | 18:46 |
fungi | and also, predominantly in the same provider regions right? one where the available space on the rootfs is fairly limited compared to other providers | 18:46 |
fungi | er, what you said | 18:47 |
fungi | what's a good way to go about reporting the disk utilization on those nodes just before they fall over? the successful run johnsom linked earlier had fairly minimal disk usage reported by df at the end of the job | 18:48 |
fungi | but maybe disk utilization spikes during the tempest run and then the resources consuming disk get cleaned up? | 18:49 |
clarkb | fungi: ya it could be the test ordering that gets us if multiple tests with big disks run concurrently | 18:49 |
clarkb | fungi: as for how to get the data off, if ansible is failing (likely because it can't write data to /tmp) then we may need to use ansible raw? | 18:50 |
clarkb | pabelanger: ^ do you know if the raw module avoids needing to copy files to /tmp on the remote host? | 18:50 |
*** factor has joined #openstack-infra | 18:50 | |
*** vesper11 has quit IRC | 18:52 | |
pabelanger | clarkb: I am not sure, would have to look | 18:52 |
pabelanger | IIRC, raw is pretty minimal | 18:53 |
clarkb | https://storyboard.openstack.org/#!/story/2006520 has been filed and I just shared it with the ironic channel | 18:56 |
*** diablo_rojo has quit IRC | 18:56 | |
*** diablo_rojo has joined #openstack-infra | 18:56 | |
fungi | i wonder if we could better trap/report ansible remote write failures like that | 18:58 |
AJaeger | config-core, https://review.opendev.org/678356 and https://review.opendev.org/678357 are ready for review, dependencies merged - those remove now unused publish jobs and update promote jobs. Please review | 18:58 |
*** jbadiapa has quit IRC | 19:00 | |
clarkb | fungi: looking at the log on ze04 for bb2716ef3894465cb1cfbf1b22d7736c the way it manifest for ironic is the run playbook timesout, then the post run attempts to in and do post run things but that fails with exit code 4 which is the networking failed me problem. In this case I don't think networking at a layer 3 or below failed but instead at 7 because ansible/ssh needs /tmp to do things | 19:01 |
clarkb | johnsom: is there a preexisting octavia change that we can recheck without acusing problems for you and then if job lands on say rax-* we can hold it/investigate? | 19:03 |
clarkb | johnsom: if not I can push a no change change | 19:03 |
*** ykarel|away has quit IRC | 19:03 | |
johnsom | I am using this one for the netconsole test: https://review.opendev.org/#/c/680894/ | 19:03 |
johnsom | I just can't say if I got the ansible right this time. | 19:03 |
clarkb | k | 19:03 |
*** e0ne has joined #openstack-infra | 19:03 | |
johnsom | Looks like it got OVH | 19:04 |
fungi | hrm, yeah if ansible is overloading the same exit code for both network connection failure and an inability to write to the remote node's filesystem, then discerning which it is could be tough | 19:07 |
*** e0ne has quit IRC | 19:09 | |
clarkb | johnsom: if you notice a run on rax-* ping me and I'll happily add an autohold for it | 19:10 |
johnsom | Ok, sounds good | 19:10 |
*** e0ne has joined #openstack-infra | 19:10 | |
fungi | or we can set an autohold with a count >1 | 19:11 |
*** jbadiapa has joined #openstack-infra | 19:13 | |
*** goldyfruit_ has quit IRC | 19:15 | |
*** eharney has quit IRC | 19:20 | |
*** bnemec has quit IRC | 19:27 | |
*** bnemec has joined #openstack-infra | 19:32 | |
*** goldyfruit has joined #openstack-infra | 19:33 | |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: Improve job and node information banner https://review.opendev.org/677971 | 19:33 |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: Improve job and node information banner https://review.opendev.org/677971 | 19:34 |
zbr | fungi: if is not too late, please take a look at https://review.opendev.org/#/q/project:opendev/bindep+status:open thanks. | 19:36 |
*** vesper11 has joined #openstack-infra | 19:41 | |
openstackgerrit | Clark Boylan proposed opendev/base-jobs master: Add cleanup phase to base(-test) https://review.opendev.org/681100 | 19:46 |
clarkb | infra-root ^ that might help us debug these problems better in the future | 19:46 |
*** ociuhandu has joined #openstack-infra | 19:51 | |
*** igordc has joined #openstack-infra | 20:00 | |
*** jtomasek_ has quit IRC | 20:01 | |
*** ociuhandu has quit IRC | 20:03 | |
*** ociuhandu has joined #openstack-infra | 20:03 | |
*** michael-beaver has joined #openstack-infra | 20:04 | |
clarkb | corvus: have a quick second for 681100? once that is in I can test it | 20:09 |
*** markmcclain has joined #openstack-infra | 20:10 | |
clarkb | sean-k-mooney: I think fn is enabled again if you want to retry your label checks (restarted the nodepool service so that will rule out config problems if it persists) | 20:10 |
clarkb | or at least configs due to weird reloading of pools | 20:10 |
sean-k-mooney | clarkb: ack | 20:10 |
sean-k-mooney | ill kick of the new lable test and let you know if that works | 20:11 |
*** Lucas_Gray has joined #openstack-infra | 20:11 | |
*** nicolasbock has quit IRC | 20:11 | |
sean-k-mooney | related question vexxhost has nested virt as does FN. did packet ever turn it back on | 20:11 |
fungi | we haven't had a usable packethost environment in... over a year? | 20:12 |
openstackgerrit | Merged opendev/bindep master: Fix emerge testcases https://review.opendev.org/460217 | 20:12 |
sean-k-mooney | welll i think that answer my question | 20:12 |
clarkb | ya there was a plan to redeploy using osa but I don't think that ever happened | 20:13 |
sean-k-mooney | my hope is that if we have 2+ proviers that can support the multi numa/or single numa lables wiht nested vert(that would be FN and maybe vexhost) then we could replace the now defunct intel nfv ci with a first part version that at a minium did a nightly build | 20:15 |
sean-k-mooney | if that is stable then maybe we can promot it to non voting or even voting in check | 20:15 |
*** trident has quit IRC | 20:15 | |
clarkb | limestone may also be able to help | 20:15 |
sean-k-mooney | yes perhaps. now that we have 1 provider i can continue to refine the job and we can expolore other options as they become avialable | 20:16 |
sean-k-mooney | ignoring the lable issue the FN job seams to be workign well | 20:17 |
clarkb | I think the biggest gotcha is going to be those periodic updates to kernels that break it and needing to update the various layers to cope | 20:18 |
clarkb | but if the job is nonvoting and we communicate when that happens its probably fine? | 20:18 |
sean-k-mooney | hopefully | 20:18 |
*** nicolasbock has joined #openstack-infra | 20:18 | |
sean-k-mooney | the intel nfv ci was always using nested virt and it was at least for the first 2 years when it was staffed faily stable | 20:19 |
sean-k-mooney | mos tof the issue we had were not related to nested vert | 20:19 |
sean-k-mooney | but rather dns or out proxies/mirrors | 20:19 |
sean-k-mooney | i kicked of the new lable job 680738 | 20:20 |
sean-k-mooney | so we will see if tha tfails | 20:20 |
sean-k-mooney | or passes | 20:20 |
clarkb | ya we just know it breaks periodically | 20:21 |
clarkb | logan-: has given us a bit of insight into that and aiui something in middle kernel will update then we need to update the hypervisors to accomodate and it starts working again | 20:21 |
sean-k-mooney | its on by default now in kerels from 4.19 | 20:21 |
*** ociuhandu has quit IRC | 20:22 | |
clarkb | ya so only about 2 years away :) | 20:22 |
*** ociuhandu has joined #openstack-infra | 20:22 | |
sean-k-mooney | it really bugs me that rhel8 is based on 4.18 | 20:22 |
sean-k-mooney | because its an lts disto on an non lts kernel | 20:22 |
*** iurygregory has quit IRC | 20:23 | |
sean-k-mooney | well its a redhat kerel so the version number dont really mean anything in relation to the usptream kernel anyway | 20:23 |
*** ociuhandu has quit IRC | 20:27 | |
*** trident has joined #openstack-infra | 20:27 | |
*** trident has quit IRC | 20:32 | |
*** notmyname has quit IRC | 20:32 | |
*** trident has joined #openstack-infra | 20:33 | |
openstackgerrit | Merged openstack/project-config master: Add per-project Airship groups https://review.opendev.org/680717 | 20:35 |
*** notmyname has joined #openstack-infra | 20:42 | |
*** ociuhandu has joined #openstack-infra | 20:44 | |
*** kopecmartin is now known as kopecmartin|off | 20:45 | |
johnsom | clarkb rax-iad https://zuul.openstack.org/stream/cd166c6ecde942259f27abaeef8b89fa?logfile=console.log | 20:45 |
johnsom | I am getting remote logs for this one too | 20:46 |
clarkb | johnsom: placed a hold on it | 20:47 |
clarkb | I'll also ssh in now to see if my connection dies | 20:47 |
johnsom | I am going to run grab lunch as it will be at least 20 minutes for devstack. | 20:48 |
clarkb | k I'll keep an eye on it out of the corner of my eye | 20:48 |
openstackgerrit | Merged opendev/bindep master: Fix apk handling of warnings/output problems https://review.opendev.org/602433 | 20:50 |
*** markvoelker has quit IRC | 21:02 | |
*** ociuhandu has quit IRC | 21:09 | |
johnsom | clarkb Tempest is just starting | 21:11 |
clarkb | I'm still ssh'd in | 21:12 |
*** Goneri has quit IRC | 21:14 | |
clarkb | I don't think this is the problem but I wonder if neutron wants to clean it up: netlink: 'ovs-vswitchd': attribute type 5 has an invalid length. | 21:15 |
johnsom | Yeah I am seeing a lot of those too | 21:16 |
johnsom | Neutron does seem to do a lot of stuff at the start of tempest. It's log was scrolling pretty fast | 21:16 |
*** markvoelker has joined #openstack-infra | 21:16 | |
fungi | sean-k-mooney: is that ovs-vswitchd attribute length error something you're aware of? | 21:17 |
clarkb | I'm running watch against ip addr show eth0 ; df -h ; w ; ps -elf | grep tempest | 21:19 |
fungi | good call | 21:22 |
johnsom | boom kernel opps | 21:23 |
fungi | so we are triggering a panic? | 21:23 |
johnsom | https://www.irccloud.com/pastebin/7VeZaVsl/ | 21:23 |
clarkb | I didn't get that with my watch | 21:23 |
fungi | and i guess it's hypervisor-specific | 21:23 |
fungi | so probably only happening on xen guests | 21:23 |
clarkb | looks liek an ovs issue | 21:24 |
clarkb | exciting | 21:24 |
fungi | neat. sean-k-mooney may be interested in that too | 21:24 |
johnsom | "exciting" Yes, yes it has been given it's freeze week | 21:25 |
fungi | any idea what might have changed around that in the relevant timeframe where the behavior seems to have emerged? | 21:26 |
johnsom | Nothing in our project, that is why this has been a pain. | 21:26 |
clarkb | could be a kernel update in bionic | 21:26 |
clarkb | or a xen update in the cloud(s) | 21:27 |
fungi | neutron changes? new ovs version? is it just happening on ubuntu bionic or do we have suspected failures on other distros/versions? | 21:27 |
fungi | but yeah, at least it's happening around feature freeze week, not release week | 21:27 |
clarkb | getting centos version of this test to run on rax to generate a bt (assuming it fails at all) would be a great next step I expect | 21:28 |
fungi | getting a minimum reproducer might make that easier too | 21:29 |
*** jamesmcarthur has quit IRC | 21:29 | |
clarkb | also could it be related to the invalid length error above? | 21:29 |
fungi | a distinct possibility | 21:29 |
johnsom | We started seeing it on the 3rd, so something around that time. | 21:30 |
johnsom | Looking to see if the centos-7 version of this has ever done this. (though centos has had it's own issues) | 21:30 |
fungi | if the ovs error messages show up in success logs too, then that seems unlikely to be related (though still possible i suppose) | 21:30 |
johnsom | Well, I don't see them in this successful run: https://zuul.opendev.org/t/openstack/build/065acaa6544342348b3eef799e627696/log/controller/logs/syslog.txt.gz assuming they go into syslog too | 21:31 |
johnsom | Ok, for one patch, I have not seen it on the centos7 version. | 21:33 |
clarkb | johnsom: they are in that log you linked | 21:33 |
clarkb | https://zuul.opendev.org/t/openstack/build/065acaa6544342348b3eef799e627696/log/controller/logs/syslog.txt.gz#2399 for example | 21:33 |
johnsom | Oh, I must have typo'd | 21:33 |
clarkb | so ya likely not related but could potentially still be if xen somehow tickles a corner case around the handling of that | 21:34 |
clarkb | either way cleaning it up would aid debugging because its one less thing to catch your eye and be confused about | 21:34 |
fungi | yeah, maybe that's still a bug but only leads to a fatal condition on xen guests? | 21:34 |
fungi | wild guesses at this stage though | 21:35 |
clarkb | https://patchwork.ozlabs.org/patch/1012045/ I think it may just be noise | 21:36 |
fungi | seems that way | 21:37 |
fungi | so do we know which tempest test seems to have triggered the panic? | 21:37 |
clarkb | if the node did end up being held I may be able to restart the server and get the stestr log | 21:38 |
clarkb | I'll work on that now | 21:38 |
* fungi shakes fist at broadband provider... nearly 7 hours so far on repairing a fiber cut somewhere on the mainland | 21:39 | |
clarkb | does not look like holds apply in this case (because the job is retried?) | 21:39 |
fungi | you were doing a ps though... do you have the last one from before it died? can we infer tests from that? | 21:40 |
fungi | or i guess the console log stream also mentioned tests | 21:40 |
*** e0ne has quit IRC | 21:40 | |
clarkb | the ps doesn't tell you what tests are running | 21:41 |
clarkb | jsut that there were 4 proceses running tests | 21:41 |
*** snapiri has quit IRC | 21:41 | |
johnsom | I was a bit surprised to see it running 4 for the concurrency, but I also tried locking it down to 2 with no luck | 21:42 |
fungi | ahh, yep, you used the f option (so included child proceses) but then omitted those lines with the grep for "tempest" | 21:42 |
clarkb | fungi: we don't start new processes for each test though | 21:43 |
clarkb | the 4 child processes run their 1/4 of the tests loaded from stdin iirc | 21:44 |
fungi | sure, thought it might be inferrable from whatever shell commands were appearing in child process command lines, but also probably too short-lived for the actual culprit to be captured by watch | 21:44 |
fungi | i guess if we can reproduce it with tempest concurrency set to 1 that might allow narrowing it down | 21:45 |
fungi | it happens fairly early in the testset right? | 21:46 |
clarkb | ya I guess that depends onw hether or not neutron uses the api or cli commands to interact with ovs there. However that said it almost looksl ike the failure was in frame processing | 21:46 |
clarkb | entry_SYSCALL_64_after_hwframe+0x3d/0xa2 | 21:46 |
clarkb | and crashes in do_output | 21:46 |
clarkb | so probably not easy to track that back to a specific test | 21:46 |
fungi | yeah, maybe ethernet frame forwarding? | 21:47 |
fungi | in which case perhaps any test which produces a fair amount of guest traffic might tickle it | 21:48 |
openstackgerrit | James E. Blair proposed zuul/zuul master: WIP: Add support for the Gerrit checks plugin https://review.opendev.org/680778 | 21:49 |
openstackgerrit | James E. Blair proposed zuul/zuul master: WIP: Add enqueue reporter action https://review.opendev.org/681132 | 21:49 |
clarkb | fungi: ya | 21:49 |
clarkb | thinking out loud here I wonder if having zuul retry jobs that blow up like this actually makes it harder to debug | 21:50 |
clarkb | fewer people notice because most of the time the retried probably do work by running on another cloud | 21:50 |
clarkb | and you can't hold the test nodes | 21:50 |
clarkb | etc | 21:50 |
clarkb | https://launchpad.net/ubuntu/+source/linux/4.15.0-60.67 went out on the third fwiw | 21:53 |
fungi | potential culprit, yep. is that what our images have? | 21:54 |
clarkb | 4.15.0-60-generic #67-Ubuntu os from the panic so ya I think they are | 21:55 |
clarkb | http://launchpadlibrarian.net/439948149/linux_4.15.0-58.64_4.15.0-60.67.diff.gz is the diff between the code that went out on the 3rd and what was in ubuntu prior | 21:56 |
sean-k-mooney | what does the waiting state mean in the zuul dashboard. how is it different form queued | 21:57 |
sean-k-mooney | does that mean its waiting for quota/nodepool? | 21:58 |
sean-k-mooney | but zuul has selected the job to run when the nodeset has been fulfilled? | 21:58 |
johnsom | https://www.irccloud.com/pastebin/Tnfw12ug/ | 21:58 |
johnsom | So, this is fixed in -62 | 21:58 |
*** markvoelker has quit IRC | 21:59 | |
johnsom | Wonder why the instance wasn't using 62. | 21:59 |
* johnsom feels like the canary project for *world* | 22:00 | |
clarkb | sean-k-mooney: from tobiash a few days ago 04:25:11* tobiash | pabelanger, clarkb, SpamapS: waiting means waiting for a dependency or semaphore (basically no node request created yet), queued means waiting for nodes | 22:00 |
clarkb | johnsom: 62 was only published 7 hours ago | 22:01 |
*** markvoelker has joined #openstack-infra | 22:01 | |
johnsom | Ha, so changelog is much earlier than availability... lol | 22:01 |
sean-k-mooney | ok so its waiting in the merger? | 22:01 |
*** goldyfruit has quit IRC | 22:01 | |
ianw | infra-root: could i get a +w on https://review.opendev.org/#/c/680895/ so 3rd party testinfra is working please, thanks | 22:02 |
clarkb | ianw trade you https://review.opendev.org/#/c/681100/ | 22:02 |
johnsom | ianw Thanks for the role. It found the kernel panic | 22:03 |
sean-k-mooney | it whet form queued to waiting but i dont see a node in http://zuul.openstack.org/nodes that matche the lable. so its unclear why it transiation state | 22:03 |
pabelanger | sean-k-mooney: if multinode job, likely waiting for node in nodeset | 22:04 |
ianw | johnsom: oh nice! is it a known issue or unique? | 22:04 |
pabelanger | other wise, nodepool is running a capacity and no nodes avaiable | 22:04 |
pabelanger | available* | 22:04 |
johnsom | ianw Evidently the fixed kernel went out 7 hours ago | 22:05 |
clarkb | johnsom: note that bug is specific to fragmented packets which is somethign we do on our ovs bridges because nesting. So ya seems like really good match and likely is the fix | 22:05 |
ianw | johnsom: haha well nothing like living on the edge! | 22:05 |
sean-k-mooney | pabelanger: it is a multi node job but neither node has been created. could it be waiting for an io opertation on the cloud such as uploading an image? | 22:05 |
pabelanger | sean-k-mooney: http://grafana.openstack.org/dashboard/db/zuul-status gives a good overview of what zuul is doing | 22:05 |
clarkb | sean-k-mooney: no we never wait for images to upload unless there are no images | 22:05 |
johnsom | clarkb Path forward? How soon will that hit the mirrors? | 22:05 |
pabelanger | sean-k-mooney: no image uploads shouldn't be issue hear. Sounds like maybe quota issue | 22:05 |
clarkb | sean-k-mooney: waiting is "normal" I wouldn't worry about it | 22:05 |
*** markvoelker has quit IRC | 22:06 | |
pabelanger | sean-k-mooney: cannot boot node on cloud, because no space | 22:06 |
sean-k-mooney | clarkb: ok | 22:06 |
johnsom | clarkb matches to the line number in the launchpad bug | 22:06 |
sean-k-mooney | ya i can see FN is at capasity more or less http://grafana.openstack.org/d/3Bwpi5SZk/nodepool-fortnebula?orgId=1 | 22:06 |
pabelanger | there is some odd interactions with multinode jobs and quota on clouds. It is possible for a job to boot 1 of 2 nodes fine, but with 2 of 2 boots, there is no quota | 22:06 |
pabelanger | then you are stuck on the cloud waiting | 22:06 |
pabelanger | we have that issue in zuul.ansible.com | 22:07 |
clarkb | johnsom: we update our mirrors every 4 hours and build images every 24 ish hours | 22:07 |
clarkb | johnsom: I've just checked the mirror and don't see it at http://mirror.dfw.rax.openstack.org/ubuntu/dists/bionic-updates/main/binary-amd64/Packages | 22:08 |
sean-k-mooney | in this case neither node is running (its use a special lable so i can tell form http://zuul.openstack.org/nodes) but i looks like its waiting for space | 22:08 |
clarkb | johnsom: so I would guess anywhere from the next 4-24 hours ish | 22:08 |
pabelanger | sean-k-mooney: yah, that would be my guess too | 22:08 |
*** goldyfruit has joined #openstack-infra | 22:08 | |
pabelanger | you feel it much more, when running specific labels in a single cloud | 22:08 |
clarkb | http://grafana.openstack.org/d/3Bwpi5SZk/nodepool-fortnebula?orgId=1 you can see that cloud is at capacity | 22:09 |
sean-k-mooney | pabelanger: well this might still be a good thing. previously that lable was going to node_error so this might actully be an imporvement | 22:09 |
clarkb | the sun just came out so I'm going to sneak out for a quick bike ride before the rain returns | 22:09 |
clarkb | back in a bit | 22:10 |
*** ociuhandu has joined #openstack-infra | 22:10 | |
fungi | johnsom: moral of this story is if you'd just put off debugging another day it would have fixed itself? ;) | 22:13 |
johnsom | Yeah, that probably would have happened if it wasn't freeze week | 22:14 |
* fungi suspects that's a terrible moral for a story anyway | 22:14 | |
*** ociuhandu has quit IRC | 22:15 | |
johnsom | Yeah, burned a good week trying to figure out why we couldn't merge anything | 22:15 |
fungi | what's especially interesting is octavia's testing was thorough enough to hit this bug when neutron's was not | 22:16 |
johnsom | I also wonder if no one is actually testing with floating IPs.... Given this is tied to NAT. It's not like we are load testing either.... | 22:16 |
johnsom | Yeah, not the first time | 22:17 |
*** slaweq has quit IRC | 22:18 | |
*** goldyfruit_ has joined #openstack-infra | 22:21 | |
*** goldyfruit has quit IRC | 22:24 | |
adriant | Is this the irc channel to talk devstack issues? | 22:25 |
adriant | or does devstack have its own channel? | 22:25 |
johnsom | adriant General devstack issues are in #openstack-qa | 22:26 |
fungi | adriant: you're looking for #openstacl-qa | 22:26 |
johnsom | What he said.... grin | 22:27 |
fungi | er, the one johnsom said. you know, with the accurate typing | 22:27 |
adriant | johnsom wins :P | 22:27 |
adriant | ty | 22:27 |
pabelanger | ianw: in devstack, is there an option to enable nested virt for centos? Like is it modifying modprode.d configs? | 22:28 |
pabelanger | not that I am asking to enable it, want to see how it is done | 22:29 |
pabelanger | cause I need to modprob -r kvm && modprobe kvm to toggle it | 22:29 |
pabelanger | not sure why RPM isn't reading modprobe config | 22:30 |
Roamer` | pabelanger, judging by the fact that devstack's doc/source/guides/devstack-with-nested-kvm.rst explains how to do the rmmod/modprobe dance for different CPU types, and from the fact that there is no mention of rmmod in devstack itself other than this file, I'd say most probably not :/ | 22:34 |
openstackgerrit | Merged opendev/system-config master: Set zuul_work_dir for tox testing https://review.opendev.org/680895 | 22:36 |
*** markvoelker has joined #openstack-infra | 22:36 | |
pabelanger | Roamer`: thanks! I should read the manually next time. that is exactly what I am doing too | 22:36 |
pabelanger | rmmod | 22:37 |
pabelanger | above was a typo :) | 22:37 |
*** bobh has joined #openstack-infra | 22:37 | |
johnsom | pabelanger export DEVSTACK_GATE_LIBVIRT_TYPE=kvm is the setting for devstack to setup nova for it. | 22:39 |
*** EmilienM is now known as little_script | 22:39 | |
pabelanger | ty | 22:39 |
Roamer` | pabelanger, what do you mean RPM is not reading the modprobe config though? is it possible that the module has been already loaded, maybe even at boot time, and you modifying the config later has no effect without, well, reloading it using rmmod/modprobe? :) | 22:39 |
pabelanger | Roamer`: so, if I setup /etc/modprobe.d/kvm.conf with 'options kvm_intel nested=1' then yum install qemu-kvm, nested isn't enabled. I need to bounce the module, for it to work | 22:41 |
*** factor has quit IRC | 22:41 | |
pabelanger | not an issue, but surprised it wasn't loaded properly | 22:41 |
*** factor has joined #openstack-infra | 22:41 | |
*** little_script is now known as EmilienM | 22:42 | |
*** bobh has quit IRC | 22:45 | |
Roamer` | pabelanger, well, sorry if I'm being obtuse and asking for things that may be obvious and you may have already checked, but still, are you really sure that the kvm_intel module has not been already loaded even before you installed qemu-kvm? I can't really think of anything that would want to load it, it's certainly not loaded by default, but it's usually part of the kernel package, so it will | 22:45 |
Roamer` | have been *installed* before you install qemu-kvm | 22:45 |
*** xenos76 has quit IRC | 22:46 | |
*** markvoelker has quit IRC | 22:46 | |
pabelanger | good point, I should check that | 22:46 |
pabelanger | I assumed qemu-kvm was pulling it in | 22:47 |
clarkb | pabelanger: note you only set nested=1 on the first hypervisor | 22:48 |
clarkb | on your second you can consume the nested virt without enabling it for the next layer | 22:48 |
*** happyhemant has quit IRC | 22:49 | |
*** rlandy is now known as rlandy|bbl | 22:50 | |
*** icarusfactor has joined #openstack-infra | 22:51 | |
*** mriedem has quit IRC | 22:51 | |
*** goldyfruit_ has quit IRC | 22:51 | |
clarkb | pabelanger: also note you cannot enable nested virt from the middle hypervisor if the first does not have it enabled | 22:52 |
clarkb | and note that it crashes a lot | 22:52 |
clarkb | but in linux 4.19 it is finally enabled by default for intel cpus so in theory its a lot better past that point in time | 22:53 |
*** factor has quit IRC | 22:53 | |
*** tkajinam has joined #openstack-infra | 22:55 | |
pabelanger | clarkb: Yup agree! Not going to run this in production, mostly wanted to understand how people enabled it for jobs | 22:56 |
*** Lucas_Gray has quit IRC | 22:57 | |
*** rcernin has joined #openstack-infra | 22:59 | |
ianw | clarkb: :/ ubuntu mirroring failed @ 2019-09-05T22:38:09,181762752+00:00 ish ... http://paste.openstack.org/show/774655/ | 23:11 |
clarkb | ianw: similar problems to fedora? | 23:11 |
*** owalsh has quit IRC | 23:11 | |
clarkb | auth expired because vos release took too long? | 23:11 |
ianw | yeah, seems likely. note that's mirror-update.openstack.org so the old server, and also rsync isn't involved there | 23:12 |
*** Lucas_Gray has joined #openstack-infra | 23:14 | |
*** owalsh has joined #openstack-infra | 23:14 | |
*** threestrands has joined #openstack-infra | 23:14 | |
ianw | istr from scrollback some sort of issues, and davolserver was restarted @ "Sep 6 18:16 /proc/4595" | 23:18 |
clarkb | ya afs02.dfw was out to lunch (really high load and ssh not responding) the console log showed it had a bunch of kernel timeouts for disks | 23:18 |
clarkb | and processes so we rebooted it | 23:18 |
clarkb | other vos releases were working but we didn't check all of them, its possible we should've checked all of them | 23:18 |
ianw | hrm, so that failure happened before the reboot, and could be explained by afs02 being in bad state then | 23:21 |
clarkb | oh I didn't notice the date on the failure | 23:21 |
ianw | my only concern though is that if i unlock the volume, r/o needs to be completely recreated and maybe *that* will timeout now | 23:21 |
clarkb | but ya likely explained by that | 23:21 |
clarkb | ianw: ya in the past when I've manually unlocked I've done a manual sync with the lock held then run the vos release from screen on afs01.dfw using localauth | 23:22 |
clarkb | the localauth doesn't timeout | 23:22 |
ianw | right, it just seems like most volumes are in this state :/ | 23:22 |
clarkb | aspiers: following up on the zuul backlog. I've tracked at least part of the problem back to ironic retrying tempest jobs multiple times due to filling up disks on rax instances | 23:25 |
clarkb | aspiers: that means we get 3 attempts * 3 hour timeouts we have to wait for | 23:25 |
aspiers | clarkb: ouch, nice find! | 23:25 |
clarkb | so those ironic changes hang out in the queue for forever | 23:26 |
sean-k-mooney | clarkb the jobs still ended up in node failure so it looks like the nodepool restart was not the issue. | 23:27 |
clarkb | sean-k-mooney: ok good to rule out the config loading then | 23:27 |
clarkb | sean-k-mooney: let me check if we are still having ssh errors | 23:27 |
sean-k-mooney | this was the patch that failed if that helps but its the only job that uses the numa lables | 23:28 |
sean-k-mooney | https://review.opendev.org/#/c/680738/ | 23:28 |
clarkb | sean-k-mooney: same http error | 23:28 |
sean-k-mooney | ok | 23:28 |
clarkb | sean-k-mooney: I'm going to dig through the logs for a specific instance to see if I can napkin maht verify we are hitting the timeout | 23:28 |
sean-k-mooney | am for now im just goign to rework the job to use lables we know work. | 23:29 |
sean-k-mooney | i would like to have this working be fore FF to hopeful help merge the numa migration feature its testing | 23:29 |
*** dchen has joined #openstack-infra | 23:29 | |
clarkb | looks like nova says the instance is active in about 35 seconds then we timeout after 120 seconds | 23:30 |
fungi | slow to boot then? | 23:30 |
sean-k-mooney | so its higing the ssh timeout | 23:30 |
sean-k-mooney | fungi: it should not be. it has a numa toplogy but that normally makes it faster | 23:31 |
clarkb | sean-k-mooney: ya the traceback is definitely the ssh timeout. I just wanted to make sure it wasn't mathing that short but the logging timestamps seem to indicate it hasn't | 23:31 |
sean-k-mooney | it also has more ram at least on the contoler | 23:31 |
fungi | grabbing a nova console log from a boot attempt ought to help | 23:31 |
sean-k-mooney | 16G instead of 8 | 23:31 |
clarkb | specifically it is trying to scan the ssh hostkeys | 23:31 |
clarkb | brainstorming: we could be failing to get entropy to generate host keys? | 23:31 |
sean-k-mooney | its really strange | 23:31 |
sean-k-mooney | oh | 23:32 |
clarkb | and sshd won't start until it has generated them | 23:32 |
sean-k-mooney | yes we could | 23:32 |
clarkb | does numa affect entropy in VMs? also we run haveged | 23:32 |
sean-k-mooney | we could enable the hardware random number gen in the guest | 23:32 |
fungi | or it could be timing out waiting for device scanning to settle or a particular block device to appear or any number of other things that it waits for during boot | 23:32 |
clarkb | fungi: ya | 23:32 |
sean-k-mooney | clarkb: it should not | 23:32 |
clarkb | fungi: when I booted by hand though we had none of those issues | 23:32 |
clarkb | fungi: we can try booting by hand mpre | 23:33 |
* clarkb tries now | 23:33 | |
sean-k-mooney | fungi: also we have other labels that provide identical flavor with a different name that work | 23:33 |
fungi | oh, that's wacky | 23:33 |
sean-k-mooney | ya | 23:33 |
clarkb | the hostids for all three attempts were different | 23:34 |
clarkb | donnyd: 71a9bf7925f98026e8a268d2cda4f8623c4812e3025626a2b395e7b0 is the hostid and it tried booting f5708f13-1918-44ce-8c2f-3676712d12a0 | 23:34 |
clarkb | if you want to double check the hyervisor for anything funny | 23:34 |
fungi | could nodepool be trying the wrong interface's address maybe? | 23:34 |
sean-k-mooney | i think its booting with just one interace and it seam to be acessing the ipv6 ip | 23:35 |
clarkb | nodepool.exceptions.ConnectionTimeoutException: Timeout waiting for connection to 2001:470:e045:8000:f816:3eff:fe57:3b6c on port 22 | 23:35 |
sean-k-mooney | well it might have more interfaces | 23:35 |
ianw | clarkb: so basically every volume of interest is locked ... http://paste.openstack.org/show/774658/ | 23:35 |
clarkb | that is 387fa4f4-f69a-4654-9c39-c9fd3443bd6a on 247bdf6d16fe8bfc4c76cbe3a03e5933b4215077c60e516b1a26fbd8 | 23:35 |
sean-k-mooney | that a publicaly routable ipv6 address | 23:35 |
clarkb | ianw: bah ok | 23:35 |
ianw | at this point, i think the best option is to shutdown the two mirror-updates to stop things getting any worse, and do localauth releases of those volumes | 23:36 |
clarkb | ianw: don't we need to run their respective syncs before vos releasing? | 23:36 |
clarkb | I suppose if we are happy with the RW state then your plan will work | 23:36 |
donnyd | I am afk atm | 23:37 |
donnyd | I can take a look when I get back | 23:37 |
sean-k-mooney | fungi: these are the lables/flavor/image/keys that we are using | 23:37 |
sean-k-mooney | https://github.com/openstack/project-config/blob/master/nodepool/nl02.openstack.org.yaml#L343-L357 | 23:37 |
sean-k-mooney | ubuntu-bionic-expanded works fine | 23:38 |
ianw | clarkb: yeah, for mine i think we get the volumes back in sync, and then let the normal mirroring process run | 23:38 |
*** Lucas_Gray has quit IRC | 23:38 | |
sean-k-mooney | when i use multi-numa-ubuntu-bionic-expanded or multi-numa-ubuntu-bionic it does not work | 23:38 |
fungi | sean-k-mooney: okay, so they're using different flavors | 23:38 |
sean-k-mooney | fungi: the flavor are identical https://www.irccloud.com/pastebin/FWxMEIqc/ | 23:38 |
clarkb | fungi: the flavors are named differently but have the same attributes. However maybe there are attributes not exposed by flavor show that are different | 23:39 |
sean-k-mooney | well the none expand one has 8G instead of 16G of ram but otherwise they are the same | 23:39 |
clarkb | I'm able to get right in on the node I just booted with multi-numa-expanded | 23:40 |
sean-k-mooney | clarkb: could you test a multi-numa instace instead fo the expanded one. i mean ssh runs fine on a 64mb cirros image but just incase | 23:41 |
*** icarusfactor has quit IRC | 23:41 | |
clarkb | 23:39:05 is first entry in dmesg -T, 23:39:11 is sshd starting, 23:39:45 is me logging in | 23:41 |
clarkb | sean-k-mooney: ya though the one I have failure for is the expanded label | 23:42 |
sean-k-mooney | oh ok | 23:42 |
*** exsdev has quit IRC | 23:42 | |
*** exsdev0 has joined #openstack-infra | 23:42 | |
*** exsdev0 is now known as exsdev | 23:42 | |
sean-k-mooney | well honestly i have been 90 of my openstack dev in multi numa vm for the lat 6 years and i have never seen it have any affect on boot time or time to ssh working | 23:43 |
clarkb | I can also hit it from nl02 via ipv6 to port 22 withtelnet | 23:43 |
clarkb | just ruling out networking problems between nl02 and fn | 23:43 |
* clarkb tests the non expanded just to be safe | 23:43 | |
clarkb | both flavors seem to work manually | 23:45 |
clarkb | I did get a brief no route to host then host finished booting and ssh worked | 23:45 |
clarkb | all within a minute, well under the timeout | 23:46 |
sean-k-mooney | donnyd: how do you get your ipv6 routing? | 23:46 |
sean-k-mooney | could there be a propogation dely in the route being acessable ? | 23:46 |
sean-k-mooney | although that woudl not explain why the other lable seams to work at least most of the time | 23:47 |
sean-k-mooney | i think i had some node_failures with ubuntu-bionic-expanded but i think we tracked thoes to quota | 23:47 |
sean-k-mooney | i assme we have not seen other node_failres with FN? | 23:48 |
sean-k-mooney | its just this spefic set of labels? | 23:48 |
clarkb | not that I know of | 23:49 |
clarkb | these are the only labels that are fn specific | 23:50 |
clarkb | other clouds can pick up the slack if it happens then we don't see a node failure, just another cloud serviing the request | 23:50 |
clarkb | let me see if I can grep for this happening in fn on other labels | 23:50 |
sean-k-mooney | well the ubuntu-bionic-expanded and centos-7-expanded are also | 23:51 |
sean-k-mooney | but nothing was previously using them | 23:51 |
clarkb | 2019-09-09 18:36:39,267 ERROR nodepool.NodeLauncher: [node: 0011030429] Launch failed for node opensuse-tumbleweed-fortnebula-regionone-0011030429 | 23:51 |
clarkb | it is happening for other labels too | 23:51 |
sean-k-mooney | so maybe in other cases it retrying to a different provider | 23:52 |
clarkb | ya | 23:52 |
sean-k-mooney | this is not a host key thing right like when people reused ipv4 addresses | 23:53 |
sean-k-mooney | its not able to connect rather than being rejected | 23:53 |
clarkb | sean-k-mooney: correct my read of it is tcp is failing to connect to port 22 | 23:55 |
clarkb | so ya a networking issue in the cloud could explain it | 23:55 |
sean-k-mooney | well we do have that recuring issue in the tempest test where ssh sometimes does not work... | 23:55 |
ianw | #status log mirror-update.<opendev|openstack>.org shutdown during volume recovery. script running in root screen on afs01 unlocking and doing localauth releases on all affected volumes | 23:56 |
openstackstatus | ianw: finished logging | 23:56 |
sean-k-mooney | which i belive is somehow related to the l3 agent. i wonder if there a race with neutron seting up routing of the vm | 23:56 |
fungi | so if it's just a sometimes network issue, retrying that job should work sometimes and return node_error other times | 23:58 |
clarkb | fungi: ya though with the way queues have been it might be a while :/ | 23:58 |
clarkb | I've not heard anything from ironic re the disk issues yet. here is hoping that is because of CEST timezones | 23:59 |
sean-k-mooney | fungi: ya so it could be a bad luck that the old ubuntu expanded lable seams to work and this one does not. | 23:59 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!