*** Tim_ok has quit IRC | 00:02 | |
*** sthussey has quit IRC | 00:22 | |
*** felipemonteiro has joined #openstack-infra | 00:24 | |
openstackgerrit | Merged openstack-infra/zuul-jobs master: use find instead of ls to list interfaces https://review.openstack.org/604677 | 00:26 |
---|---|---|
*** longkb has joined #openstack-infra | 00:31 | |
*** jamesdenton has quit IRC | 00:36 | |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: Allow debootstrap to cleanup without a kernel https://review.openstack.org/604692 | 00:42 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Add Gentoo iptables handling https://review.openstack.org/604688 | 00:50 |
Shrews | clarkb: not sure if you're still seeing it, but if nodes are locked after restarting all launchers, then zuul has the lock and you shouldn't try to undo that | 00:51 |
*** hashar has joined #openstack-infra | 00:56 | |
*** felipemonteiro has quit IRC | 00:57 | |
clarkb | Shrews: even after 14 days? | 00:59 |
clarkb | maybe it is a hold? | 00:59 |
ianw | clarkb: there we are couple of held bhs1 nodes ... that was what i was looking at for the old configs | 01:02 |
clarkb | ah maybe we can clean those up now? | 01:03 |
*** rlandy has joined #openstack-infra | 01:05 | |
ianw | clarkb: hrm, none of the bhs1 nodes currently seem to have a comment | 01:06 |
ianw | hrm, well maybe it never had a comment. 0001975067 is the node i'm talking about from https://review.openstack.org/#/c/603988/ ... that was one of the few nodes that remained after the region was shutdown | 01:10 |
*** graphene has joined #openstack-infra | 01:12 | |
prometheanfire | just one more review is needed for https://review.openstack.org/#/c/602439/ to get gentoo as a usable image | 01:15 |
prometheanfire | :( | 01:17 |
*** rlandy has quit IRC | 01:19 | |
*** mrsoul has joined #openstack-infra | 01:19 | |
*** harlowja has quit IRC | 01:22 | |
*** hashar has quit IRC | 01:26 | |
openstackgerrit | Ian Wienand proposed openstack-infra/puppet-graphite master: [wip] rpsec: check service running https://review.openstack.org/605286 | 01:26 |
*** graphene has quit IRC | 01:26 | |
*** owalsh_ has joined #openstack-infra | 01:29 | |
Shrews | clarkb: I don't think nodes in hold are locked. There is a bug in zuul somewhere that keeps nodes locked, but I forget what triggers it. | 01:31 |
Shrews | That might be what you're seeing | 01:31 |
*** owalsh has quit IRC | 01:33 | |
*** rh-jelabarre has quit IRC | 01:34 | |
pabelanger | IIRC, it happens when a dynamic reload happens during a noderequest, if the job is removed from zuul, we leak the noderequest and will never unlock | 01:35 |
*** hongbin has joined #openstack-infra | 01:39 | |
*** jamesdenton has joined #openstack-infra | 01:46 | |
*** annp has joined #openstack-infra | 01:47 | |
openstackgerrit | Matthew Thode proposed openstack-infra/openstack-zuul-jobs master: add Gentoo jobs and vars and also fix install test https://review.openstack.org/602439 | 01:58 |
*** rkukura has quit IRC | 02:00 | |
*** adriant has quit IRC | 02:05 | |
*** adriant has joined #openstack-infra | 02:14 | |
*** adriant has quit IRC | 02:15 | |
*** adriant has joined #openstack-infra | 02:17 | |
*** adriant has quit IRC | 02:31 | |
*** psachin has joined #openstack-infra | 02:39 | |
*** graphene has joined #openstack-infra | 02:40 | |
*** apetrich has quit IRC | 02:42 | |
*** Bhujay has joined #openstack-infra | 02:49 | |
*** imacdonn has quit IRC | 02:50 | |
*** imacdonn has joined #openstack-infra | 02:50 | |
*** felipemonteiro has joined #openstack-infra | 03:02 | |
*** adriant has joined #openstack-infra | 03:07 | |
*** Bhujay has quit IRC | 03:07 | |
*** felipemonteiro has quit IRC | 03:19 | |
*** ijw has joined #openstack-infra | 03:19 | |
*** diablo_rojo has quit IRC | 03:19 | |
*** ijw has quit IRC | 03:24 | |
*** ramishra has joined #openstack-infra | 03:26 | |
*** jamesmcarthur has joined #openstack-infra | 03:42 | |
*** dave-mccowan has quit IRC | 03:44 | |
*** dave-mccowan has joined #openstack-infra | 03:46 | |
*** jamesmcarthur has quit IRC | 03:48 | |
*** jamesmcarthur has joined #openstack-infra | 03:49 | |
*** armax has quit IRC | 03:53 | |
*** udesale has joined #openstack-infra | 03:53 | |
openstackgerrit | Merged openstack-infra/project-config master: Fix not working kolla graphs https://review.openstack.org/605026 | 03:54 |
*** ykarel has joined #openstack-infra | 03:59 | |
*** hongbin has quit IRC | 03:59 | |
*** felipemonteiro has joined #openstack-infra | 04:06 | |
*** pcaruana has joined #openstack-infra | 04:14 | |
*** jamesmcarthur has quit IRC | 04:21 | |
*** jamesmcarthur has joined #openstack-infra | 04:23 | |
*** jamesmcarthur has quit IRC | 04:27 | |
*** yamamoto has quit IRC | 04:31 | |
*** yamamoto has joined #openstack-infra | 04:31 | |
*** pcaruana has quit IRC | 04:38 | |
openstackgerrit | Merged openstack-infra/zuul master: Web: don't update the status cache more than once https://review.openstack.org/605243 | 04:57 |
*** auristor has quit IRC | 05:03 | |
*** jamesmcarthur has joined #openstack-infra | 05:05 | |
*** e0ne has joined #openstack-infra | 05:08 | |
*** jamesmcarthur has quit IRC | 05:10 | |
*** jamesmcarthur has joined #openstack-infra | 05:25 | |
*** Bhujay has joined #openstack-infra | 05:26 | |
*** jamesmcarthur has quit IRC | 05:31 | |
cloudnull | fungi clarkb been afk - still need anything with that instance? | 05:32 |
*** Bhujay has quit IRC | 05:32 | |
*** rkukura has joined #openstack-infra | 05:33 | |
cloudnull | going to bed finally however if it needs digging into let me know, i'll tackle it first thing in the morning | 05:35 |
*** auristor has joined #openstack-infra | 05:39 | |
*** dave-mccowan has quit IRC | 05:39 | |
*** quique|rover|off is now known as quiquell|rover | 05:40 | |
prometheanfire | yarp | 05:40 |
*** pcaruana has joined #openstack-infra | 05:43 | |
*** apetrich has joined #openstack-infra | 05:46 | |
*** jamesmcarthur has joined #openstack-infra | 05:47 | |
*** e0ne has quit IRC | 05:49 | |
*** Bhujay has joined #openstack-infra | 05:49 | |
*** felipemonteiro has quit IRC | 05:50 | |
*** rkukura has quit IRC | 05:50 | |
*** jamesmcarthur has quit IRC | 05:51 | |
*** jistr has quit IRC | 05:55 | |
*** jistr has joined #openstack-infra | 05:56 | |
*** jamesmcarthur has joined #openstack-infra | 06:08 | |
*** jamesmcarthur has quit IRC | 06:12 | |
*** jtomasek has joined #openstack-infra | 06:13 | |
openstackgerrit | Ian Wienand proposed openstack-infra/system-config master: [wip] port graphite setup to ansible https://review.openstack.org/605336 | 06:14 |
*** aojea has joined #openstack-infra | 06:22 | |
*** dpawlik has joined #openstack-infra | 06:23 | |
*** diablo_rojo has joined #openstack-infra | 06:27 | |
*** jamesmcarthur has joined #openstack-infra | 06:28 | |
*** jamesmcarthur has quit IRC | 06:33 | |
*** quiquell|rover is now known as quique|rover|brb | 06:43 | |
*** chkumar|off is now known as chkumar|ruck | 06:44 | |
*** ijw has joined #openstack-infra | 06:46 | |
*** jamesmcarthur has joined #openstack-infra | 06:50 | |
*** graphene has quit IRC | 06:50 | |
*** ijw has quit IRC | 06:50 | |
*** graphene has joined #openstack-infra | 06:51 | |
*** jamesmcarthur has quit IRC | 06:54 | |
*** ginopc has joined #openstack-infra | 06:57 | |
*** quique|rover|brb is now known as quiquell|rover | 07:00 | |
*** rcernin has quit IRC | 07:02 | |
*** alexchadin has joined #openstack-infra | 07:03 | |
*** hashar has joined #openstack-infra | 07:06 | |
*** ijw has joined #openstack-infra | 07:09 | |
*** jamesmcarthur has joined #openstack-infra | 07:10 | |
*** olivierb has joined #openstack-infra | 07:13 | |
*** ijw has quit IRC | 07:13 | |
*** jamesmcarthur has quit IRC | 07:15 | |
*** olivierb has quit IRC | 07:17 | |
*** olivierb has joined #openstack-infra | 07:17 | |
openstackgerrit | Andreas Jaeger proposed openstack-infra/openstack-zuul-jobs master: Remove tricircle dsvm jobs https://review.openstack.org/605344 | 07:19 |
*** psachin has quit IRC | 07:21 | |
*** shardy has joined #openstack-infra | 07:23 | |
*** alexchadin has quit IRC | 07:25 | |
*** psachin has joined #openstack-infra | 07:26 | |
*** jamesmcarthur has joined #openstack-infra | 07:31 | |
*** jamesmcarthur has quit IRC | 07:37 | |
*** ijw has joined #openstack-infra | 07:44 | |
*** jpich has joined #openstack-infra | 07:48 | |
*** ijw has quit IRC | 07:50 | |
*** alexchadin has joined #openstack-infra | 07:51 | |
*** jpena|off is now known as jpena | 07:51 | |
*** jamesmcarthur has joined #openstack-infra | 07:53 | |
*** rcernin has joined #openstack-infra | 07:56 | |
*** alexchadin has quit IRC | 07:57 | |
*** jamesmcarthur has quit IRC | 07:58 | |
*** ykarel is now known as ykarel|lunch | 07:59 | |
*** alexchadin has joined #openstack-infra | 07:59 | |
*** Guest42266 has joined #openstack-infra | 08:05 | |
*** hashar has quit IRC | 08:06 | |
*** hashar has joined #openstack-infra | 08:06 | |
*** jamesmcarthur has joined #openstack-infra | 08:09 | |
*** e0ne has joined #openstack-infra | 08:10 | |
openstackgerrit | Chandan Kumar proposed openstack-infra/project-config master: Added twine check functionality to python-tarball playbook https://review.openstack.org/605096 | 08:11 |
*** jamesmcarthur has quit IRC | 08:13 | |
*** Emine has joined #openstack-infra | 08:17 | |
*** e0ne has quit IRC | 08:20 | |
*** ykarel|lunch is now known as ykarel | 08:21 | |
*** jamesmcarthur has joined #openstack-infra | 08:24 | |
*** noama has joined #openstack-infra | 08:25 | |
*** jamesmcarthur has quit IRC | 08:29 | |
*** jistr has quit IRC | 08:30 | |
*** jistr has joined #openstack-infra | 08:31 | |
*** derekh has joined #openstack-infra | 08:33 | |
*** derekh has quit IRC | 08:33 | |
*** derekh has joined #openstack-infra | 08:34 | |
*** shardy has quit IRC | 08:35 | |
*** shardy has joined #openstack-infra | 08:36 | |
*** jamesmcarthur has joined #openstack-infra | 08:40 | |
*** jamesmcarthur has quit IRC | 08:45 | |
*** owalsh_ is now known as owalsh | 08:45 | |
*** olivier__ has joined #openstack-infra | 08:46 | |
*** olivierb has quit IRC | 08:46 | |
*** tosky has joined #openstack-infra | 08:48 | |
*** apetrich has quit IRC | 08:49 | |
openstackgerrit | Fabien Boucher proposed openstack-infra/zuul master: Doc: executor operations document pause, remove graceful https://review.openstack.org/602455 | 08:55 |
openstackgerrit | Fabien Boucher proposed openstack-infra/zuul master: Doc: executor operations - explain jobs will be restarted at restart https://review.openstack.org/603136 | 08:55 |
*** jamesmcarthur has joined #openstack-infra | 08:56 | |
*** chkumar|ruck has quit IRC | 08:58 | |
*** chandankumar has joined #openstack-infra | 08:59 | |
*** chandankumar is now known as chkumar|ruck | 09:00 | |
*** jamesmcarthur has quit IRC | 09:01 | |
*** ykarel is now known as ykarel|away | 09:01 | |
*** e0ne has joined #openstack-infra | 09:04 | |
*** ykarel|away has quit IRC | 09:05 | |
*** alexchadin has quit IRC | 09:06 | |
*** dtantsur|afk is now known as dtantsur | 09:10 | |
*** jamesmcarthur has joined #openstack-infra | 09:12 | |
*** rcernin has quit IRC | 09:16 | |
*** jamesmcarthur has quit IRC | 09:17 | |
*** alexchadin has joined #openstack-infra | 09:20 | |
*** pbourke has quit IRC | 09:22 | |
*** pbourke has joined #openstack-infra | 09:23 | |
*** alexchadin has quit IRC | 09:25 | |
*** electrofelix has joined #openstack-infra | 09:26 | |
*** ssbarnea|bkp has quit IRC | 09:28 | |
*** jamesmcarthur has joined #openstack-infra | 09:28 | |
*** jamesmcarthur has quit IRC | 09:33 | |
*** priteau has joined #openstack-infra | 09:41 | |
*** alexchadin has joined #openstack-infra | 09:42 | |
*** jamesmcarthur has joined #openstack-infra | 09:44 | |
*** shardy is now known as shardy_mtg | 09:48 | |
*** jamesmcarthur has quit IRC | 09:49 | |
*** hashar is now known as hasharAway | 09:51 | |
*** gfidente has joined #openstack-infra | 09:53 | |
*** jamesmcarthur has joined #openstack-infra | 10:00 | |
*** diablo_rojo has quit IRC | 10:02 | |
*** jamesmcarthur has quit IRC | 10:04 | |
openstackgerrit | Matthieu Huin proposed openstack-infra/zuul master: CLI: add create-web-token command https://review.openstack.org/605386 | 10:06 |
*** longkb has quit IRC | 10:08 | |
*** shardy_mtg has quit IRC | 10:11 | |
*** jamesmcarthur has joined #openstack-infra | 10:15 | |
*** jamesmcarthur has quit IRC | 10:22 | |
*** e0ne has quit IRC | 10:27 | |
*** jamesmcarthur has joined #openstack-infra | 10:33 | |
*** yamamoto has quit IRC | 10:35 | |
*** jamesmcarthur has quit IRC | 10:38 | |
*** graphene has quit IRC | 10:40 | |
*** graphene has joined #openstack-infra | 10:41 | |
*** jamesmcarthur has joined #openstack-infra | 10:49 | |
*** alexchadin has quit IRC | 10:49 | |
*** felipemonteiro has joined #openstack-infra | 10:49 | |
*** jamesmcarthur has quit IRC | 10:53 | |
*** felipemonteiro has quit IRC | 10:56 | |
*** alexchadin has joined #openstack-infra | 10:58 | |
*** yamamoto has joined #openstack-infra | 11:00 | |
*** alexchadin has quit IRC | 11:03 | |
*** jamesmcarthur has joined #openstack-infra | 11:05 | |
*** ijw has joined #openstack-infra | 11:09 | |
*** jamesmcarthur has quit IRC | 11:09 | |
*** udesale has quit IRC | 11:12 | |
*** psachin has quit IRC | 11:13 | |
*** ijw has quit IRC | 11:13 | |
*** pcaruana has quit IRC | 11:15 | |
*** jamesmcarthur has joined #openstack-infra | 11:20 | |
*** jpena is now known as jpena|lunch | 11:21 | |
*** jamesmcarthur has quit IRC | 11:25 | |
*** psachin has joined #openstack-infra | 11:27 | |
*** olivier__ has quit IRC | 11:28 | |
*** olivierb has joined #openstack-infra | 11:29 | |
*** roman_g has quit IRC | 11:30 | |
*** oanson has quit IRC | 11:35 | |
*** jamesmcarthur has joined #openstack-infra | 11:36 | |
*** emerson has quit IRC | 11:39 | |
*** jamesmcarthur has quit IRC | 11:41 | |
*** emerson has joined #openstack-infra | 11:49 | |
*** roman_g has joined #openstack-infra | 11:49 | |
*** jamesmcarthur has joined #openstack-infra | 11:52 | |
*** e0ne has joined #openstack-infra | 11:53 | |
*** alexchadin has joined #openstack-infra | 11:57 | |
*** jamesmcarthur has quit IRC | 11:57 | |
*** rh-jelabarre has joined #openstack-infra | 11:58 | |
*** dpawlik has quit IRC | 11:58 | |
*** trown|outtypewww is now known as trown | 11:59 | |
*** jamesmcarthur has joined #openstack-infra | 12:08 | |
*** kgiusti has joined #openstack-infra | 12:11 | |
*** apetrich has joined #openstack-infra | 12:11 | |
*** jamesmcarthur has quit IRC | 12:13 | |
*** ijw has joined #openstack-infra | 12:15 | |
*** jamesmcarthur has joined #openstack-infra | 12:15 | |
*** alexchadin has quit IRC | 12:17 | |
*** dtantsur is now known as dtantsur|brb | 12:18 | |
*** rlandy has joined #openstack-infra | 12:18 | |
*** shardy has joined #openstack-infra | 12:19 | |
*** ijw has quit IRC | 12:19 | |
*** quiquell|rover is now known as quique|rover|lch | 12:20 | |
*** panda|off is now known as panda | 12:21 | |
*** agopi has quit IRC | 12:24 | |
*** alexchadin has joined #openstack-infra | 12:25 | |
*** jpena|lunch is now known as jpena | 12:28 | |
*** udesale has joined #openstack-infra | 12:30 | |
*** jamesmcarthur has quit IRC | 12:31 | |
*** yamamoto has quit IRC | 12:35 | |
AJaeger | fungi, do we need groups always when storyboard is used for new projects? See https://docs.openstack.org/infra/manual/creators.html#add-the-project-to-the-master-projects-list and https://review.openstack.org/#/c/605193/1/gerrit/projects.yaml , please | 12:40 |
*** psachin has quit IRC | 12:41 | |
*** quique|rover|lch is now known as quiquell|rover | 12:44 | |
*** jamesmcarthur has joined #openstack-infra | 12:46 | |
dmsimard | Just sharing: Ansible to adopt molecule and ansible-lint projects, https://groups.google.com/forum/m/#!topic/ansible-project/ehrb6AEptzA | 12:48 |
*** jcoufal has joined #openstack-infra | 12:49 | |
*** jamesmcarthur has quit IRC | 12:52 | |
*** janki has joined #openstack-infra | 12:55 | |
*** jamesmcarthur has joined #openstack-infra | 12:55 | |
fungi | AJaeger: groups are a convenience option, not mandatory. if a team only has one project then a group may not be warranted | 12:55 |
fungi | if a team has more than one project, they may want to put them in a project group together so they can query them as a set | 12:55 |
*** boden has joined #openstack-infra | 12:58 | |
*** Guest42266 is now known as florianf | 13:01 | |
fungi | our creators guide also doesn't suggest they're always needed, simply explains what they're used for | 13:02 |
*** gfidente has quit IRC | 13:02 | |
*** alexchadin has quit IRC | 13:02 | |
*** mriedem has joined #openstack-infra | 13:03 | |
AJaeger | fungi: ah, ok - thanks | 13:07 |
*** sshnaidm is now known as sshnaidm|mtg | 13:07 | |
*** yamamoto has joined #openstack-infra | 13:13 | |
*** alexchadin has joined #openstack-infra | 13:17 | |
*** ijw has joined #openstack-infra | 13:17 | |
openstackgerrit | Matthieu Huin proposed openstack-infra/zuul master: CLI: add create-web-token command https://review.openstack.org/605386 | 13:18 |
openstackgerrit | Merged openstack-infra/project-config master: Add new project: sardonic https://review.openstack.org/605193 | 13:20 |
*** yamamoto has quit IRC | 13:21 | |
*** yamamoto has joined #openstack-infra | 13:21 | |
*** ijw has quit IRC | 13:22 | |
dulek | Everything okay with http://zuul.openstack.org/status.html ? Running times (especially in post) seem a bit scary? | 13:22 |
fungi | dulek: http://lists.openstack.org/pipermail/openstack-dev/2018-September/134867.html | 13:26 |
*** chkumar|ruck is now known as chandankumar | 13:26 | |
dulek | fungi: Thanks! | 13:26 |
fungi | we're back up to capacity now but i think various openstack bugs are still causing a lot of churn in the gate pipeline starving other lower-priority pipelines | 13:26 |
openstackgerrit | Merged openstack-infra/irc-meetings master: Change congress meeting time https://review.openstack.org/605274 | 13:26 |
*** mrsoul has quit IRC | 13:26 | |
fungi | helping the community/qa team identify and fix bugs is probably the best place to focus on improving it | 13:27 |
AJaeger | fungi, should we kill the periodic jobs? We haven't run any of them for two days now... | 13:27 |
*** agopi has joined #openstack-infra | 13:27 | |
*** smarcet has joined #openstack-infra | 13:27 | |
fungi | they won't really use that much capacity once they do finally run | 13:29 |
dulek | fungi: I'll prioritize looking at kuryr-kubernetes CI flakiness then. Thanks for info! | 13:29 |
AJaeger | fungi, I fear we run them only at the weekend... | 13:32 |
*** panda has quit IRC | 13:32 | |
*** dtantsur|brb is now known as dtantsur | 13:33 | |
AJaeger | fungi: oh, we have them only once in the queue - not multiple times as I feared (if I interpret zuul.o.o correctly) - then this is fine. | 13:33 |
*** alexchadin has quit IRC | 13:44 | |
*** lbragstad has quit IRC | 13:45 | |
*** slaweq has quit IRC | 13:45 | |
*** alexchadin has joined #openstack-infra | 13:46 | |
*** jamesmcarthur has quit IRC | 13:49 | |
*** gfidente has joined #openstack-infra | 13:50 | |
*** lbragstad has joined #openstack-infra | 13:50 | |
*** alexchadin has quit IRC | 13:50 | |
*** jamesmcarthur has joined #openstack-infra | 13:51 | |
*** sthussey has joined #openstack-infra | 13:52 | |
*** ijw has joined #openstack-infra | 13:53 | |
*** jamesmcarthur has quit IRC | 13:56 | |
*** ijw has quit IRC | 13:58 | |
openstackgerrit | Matthieu Huin proposed openstack-infra/zuul master: CLI: add create-web-token command https://review.openstack.org/605386 | 13:58 |
fungi | yeah, if memory serves zuul doesn't enqueue multiples in periodic | 13:59 |
mordred | fungi: I think it will - but we could discuss changing its pipeline manager to supercedent so that it wouldn't | 14:01 |
mordred | also - good morning | 14:02 |
AJaeger | mordred: I'm fine with not enqueue multiples ;) | 14:02 |
AJaeger | good morning, mordred | 14:02 |
*** yamamoto has quit IRC | 14:03 | |
openstackgerrit | Dmitry Tantsur proposed openstack/diskimage-builder master: Add an element to configure iBFT network interfaces https://review.openstack.org/391787 | 14:03 |
chandankumar | fungi: AJaeger mordred https://review.openstack.org/#/c/605096/ please have a look , thanks :-) | 14:03 |
*** yamamoto has joined #openstack-infra | 14:04 | |
openstackgerrit | Dmitry Tantsur proposed openstack/diskimage-builder master: Add an element to configure iBFT network interfaces https://review.openstack.org/391787 | 14:07 |
*** bobh has joined #openstack-infra | 14:08 | |
*** yamamoto has quit IRC | 14:09 | |
mordred | dtantsur: ^^ wow, that patch is old | 14:10 |
dtantsur | mordred: yeah, I hoped that nobody would come again with this problem to me.. I was proved wrong. | 14:10 |
dtantsur | it's complicated by the fact that I have no idea what I'm typing :) | 14:11 |
*** jamesmcarthur has joined #openstack-infra | 14:12 | |
*** janki has quit IRC | 14:12 | |
*** bobh has quit IRC | 14:13 | |
*** udesale has quit IRC | 14:15 | |
*** jamesmcarthur has quit IRC | 14:16 | |
mordred | dtantsur: my favorite problems! | 14:18 |
*** olivierb has quit IRC | 14:18 | |
*** panda has joined #openstack-infra | 14:21 | |
mordred | chandankumar, fungi: commented on https://review.openstack.org/#/c/605096/ | 14:22 |
fungi | i thought we were running that in check/gate? | 14:23 |
*** olivierb has joined #openstack-infra | 14:23 | |
fungi | at least AJaeger had pointed to the fact that we're testing sdist and wheel builds now as part of the standard template for projects participating in release management | 14:24 |
AJaeger | mordred: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/jobs.yaml#n255 is the job | 14:26 |
AJaeger | mordred: part of publish-to-pypi template | 14:26 |
*** jamesmcarthur has joined #openstack-infra | 14:28 | |
*** eernst has joined #openstack-infra | 14:31 | |
*** jamesmcarthur has quit IRC | 14:32 | |
openstackgerrit | sebastian marcet proposed openstack-infra/openstackid-resources master: added new endpoint delete my presentation https://review.openstack.org/604130 | 14:32 |
njohnston | AJaeger: I have a quick question to clarify your feedback on https://review.openstack.org/#/c/605126/1/.zuul.yaml@a125 "remove these, they are not needed anymore now." Do you mean that the ansible playbooks are not consulted anymore for jobs depending on legacy-dsvm-base? | 14:33 |
*** bobh has joined #openstack-infra | 14:35 | |
AJaeger | njohnston: you remove the job using the roles, so remove the roles as well. | 14:38 |
*** armax has joined #openstack-infra | 14:39 | |
AJaeger | njohnston: I think you're confused by the diff ;) | 14:39 |
mordred | AJaeger, fungi: yes - but we also use those playbooks in the actual release job | 14:40 |
mordred | ah - wait | 14:41 |
mordred | playbooks/pti-python-tarball/check.yaml is the one that should get updated with the twine commands | 14:41 |
chandankumar | mordred: I will update that | 14:41 |
mordred | chandankumar: thanks! | 14:42 |
*** kashyap has left #openstack-infra | 14:42 | |
*** jamesmcarthur has joined #openstack-infra | 14:43 | |
*** Bhujay has quit IRC | 14:44 | |
jroll | hi friends, per https://review.openstack.org/#/c/605193/ could I please be added to the core and release groups? easiest to search for jim@jimrollenhagen.com https://review.openstack.org/#/admin/groups/1947,members https://review.openstack.org/#/admin/groups/1948,members | 14:45 |
njohnston | AJaeger: Ah! I thought you meant to remove the lines, not the files. *facepalm* Thanks for the clarification! | 14:46 |
AJaeger | njohnston: sorry for the confusion | 14:46 |
*** jamesmcarthur has quit IRC | 14:47 | |
*** yamamoto has joined #openstack-infra | 14:52 | |
pabelanger | mnaser: clarkb: sjc1 doesn't look happy right now: http://grafana.openstack.org/dashboard/db/nodepool-vexxhost | 14:55 |
mnaser | i know it was unhappy yesterday | 14:55 |
mnaser | let me double check | 14:55 |
pabelanger | nothing in nodepool log except cannot create server | 14:56 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: rewrite interface in react https://review.openstack.org/591604 | 14:58 |
*** jamesmcarthur has joined #openstack-infra | 14:59 | |
*** pcaruana has joined #openstack-infra | 15:01 | |
*** jamesmcarthur has quit IRC | 15:03 | |
*** bobh has quit IRC | 15:03 | |
quiquell|rover | Hello any infra-root here ? | 15:11 |
quiquell|rover | We have a review to fix timeouts https://review.openstack.org/#/c/605377/ | 15:12 |
quiquell|rover | Need to go over the gates | 15:12 |
clarkb | quiquell|rover: you mean that change needs to be promoted to the top of the gate? | 15:13 |
*** jamesmcarthur has joined #openstack-infra | 15:14 | |
fungi | i've just about given up trying to correct people on terminology. everybody seems to want to call everything "gates" even when they mean "gating jobs" or "changes in the gate pipeline" or whatever | 15:14 |
quiquell|rover | Hello | 15:14 |
clarkb | quiquell|rover: are you asking us to promote that change to the top of the gate? | 15:15 |
clarkb | this will restart testing for all of the other changes in the tripleo gate pipeline. So want to be sure we are doing that intentionally before we do it | 15:16 |
jaosorior | clarkb: restarting the testing of the other patches is fine. The patch he's trying to promote is meant to help out with the timeouts. | 15:16 |
*** bobh has joined #openstack-infra | 15:17 | |
*** janki has joined #openstack-infra | 15:17 | |
clarkb | jaosorior: quiquell|rover another thing I notice when looking at this is that tripleo changes have multiple non voting jobs in the gate. Can we remove those from the gate since they are non voting? | 15:17 |
clarkb | will help get things through more quickly for you (fewer jobs to wait on) and gives more resources to other changes | 15:18 |
*** jamesmcarthur has quit IRC | 15:18 | |
*** aojea has quit IRC | 15:19 | |
jaosorior | clarkb: yes, we're removing those https://review.openstack.org/#/c/603419/ | 15:19 |
clarkb | ok I am going to enqueue ^ to the gate, then promote the first change to the top and the second behind that change | 15:21 |
quiquell|rover | clarkb: non voting juning in gates ? will look into that | 15:22 |
quiquell|rover | clarkb, jaosorior: thanks | 15:22 |
* prometheanfire doesn't like fedora-multinode :| | 15:23 | |
jaosorior | clarkb: thank you! | 15:24 |
clarkb | quiquell|rover: jaosorior it is done | 15:25 |
*** chandankumar is now known as chkumar|off | 15:26 | |
*** quiquell|rover is now known as quique|rover|off | 15:27 | |
jaosorior | yay :D | 15:27 |
*** quique|rover|off is now known as quique|off | 15:28 | |
mnaser | infra-root: is it ok to just delete a vm that nodepool is trying to launch if its stuck? | 15:29 |
*** sshnaidm|mtg is now known as sshnaidm | 15:29 | |
*** jamesmcarthur has joined #openstack-infra | 15:29 | |
mnaser | (on my side) | 15:29 |
clarkb | mnaser: it should be, nodepool won't use the VM unless ssh works. And if the api shows it as error'd out then it should handle that fine | 15:30 |
openstackgerrit | Merged openstack-infra/openstackid-resources master: added new endpoint delete my presentation https://review.openstack.org/604130 | 15:30 |
clarkb | mnaser: if the node completely disappears I think nodepool will treat that as an error on its side too, but if it doesn't we should add that functionality | 15:30 |
fungi | if nodepool is repeatedly issuing "delete" calls for it to the api and it suddenly disappears, i think nodepool will treat that like business as usual? | 15:31 |
clarkb | fungi: ya | 15:32 |
clarkb | I expect it will be fine as is | 15:32 |
*** dpawlik has joined #openstack-infra | 15:32 | |
smarcet | fungi: afternoon hope that everything is fine on your side :) i have an issue on openstackid zuul jobs, https://review.openstack.org/#/c/604172/ seems that failed and script due a temporal error, could you re trigger the job ? | 15:33 |
clarkb | smarcet: leave comment that says "recheck" and it will reenqueue for you | 15:33 |
smarcet | oh ok | 15:34 |
smarcet | thx u ! | 15:34 |
fungi | yep, that will cause jobs to rerun automatically | 15:34 |
*** jamesmcarthur has quit IRC | 15:34 | |
*** jtomasek has quit IRC | 15:34 | |
smarcet | fungi: clarkb: thx u 4 info :) | 15:35 |
*** dpawlik has quit IRC | 15:36 | |
*** janki has quit IRC | 15:36 | |
*** eernst has quit IRC | 15:36 | |
fungi | though it may take some time. we're under a bit of a backlog this week | 15:37 |
pabelanger | mnaser: trashing in sjc1 seems to have stopped, guess you are still looking into it | 15:40 |
mnaser | yup.. working on it.. | 15:40 |
pabelanger | ++ | 15:40 |
* prometheanfire thinks the requirements proposal bot update is failing again | 15:41 | |
clarkb | prometheanfire: failing or not running because of the backlog? | 15:42 |
*** Tim_ok has joined #openstack-infra | 15:42 | |
prometheanfire | backlog is 5 hours, I think it's past that by a bit now (double) | 15:43 |
prometheanfire | and it's been a couple days I think | 15:43 |
clarkb | prometheanfire: the post backlog is 53 hours | 15:44 |
clarkb | it has a lower priority than check and gate | 15:44 |
prometheanfire | wat, wow, ok | 15:44 |
*** jtomasek has joined #openstack-infra | 15:44 | |
prometheanfire | guess I wait | 15:44 |
*** jamesmcarthur has joined #openstack-infra | 15:45 | |
*** dklyle has joined #openstack-infra | 15:45 | |
prometheanfire | there just more activity now? | 15:45 |
*** Bhujay has joined #openstack-infra | 15:46 | |
pabelanger | provider issues i think, ovh looks to also be having an outage: http://grafana.openstack.org/dashboard/db/nodepool-ovh | 15:46 |
prometheanfire | k | 15:46 |
pabelanger | and packethost, some launch errors too: http://grafana.openstack.org/dashboard/db/nodepool-packethost | 15:46 |
clarkb | pabelanger: prometheanfire its easy to blame provider issues but I think the vast majority of it is we have a lot of unrelaible tests | 15:47 |
pabelanger | so, less VMs to service jobs | 15:47 |
clarkb | tripleo timing out with a gate queue of like 50 changes | 15:47 |
clarkb | neutron functional doesn't work | 15:47 |
pabelanger | clarkb: yup, gate resets too | 15:47 |
clarkb | glance can't pass unittests | 15:47 |
clarkb | tempest also has problems | 15:47 |
prometheanfire | clarkb: ya, seen some of that | 15:47 |
pabelanger | sorry, wasn't blaming just things adding to backlog | 15:47 |
clarkb | pabelanger: I just want to get away from the attitude that it is someone elses problem | 15:47 |
clarkb | openstack testing is bad right now | 15:47 |
*** xyang has joined #openstack-infra | 15:48 | |
clarkb | and openstack should work to fix that | 15:48 |
fungi | horizon only just merged fixes for their completely broken gating jobs' | 15:48 |
clarkb | (and not think infra will fix it by adding capacity) | 15:48 |
pabelanger | Yes, I think that is a fair statement | 15:48 |
clarkb | the packethost issue is likely leaked ports | 15:48 |
clarkb | we should see if someone from neutron land (slaweq maybe?) can sync up with studarus on debugging that | 15:49 |
fungi | do we yet know what's going on with ovh-bhs1? | 15:49 |
mnaser | hmm | 15:49 |
*** jamesmcarthur has quit IRC | 15:49 | |
mnaser | it looks like nodepool isnt issuing creates anymore | 15:50 |
clarkb | fungi: no that is new to me | 15:50 |
mnaser | after deleting those stuck instances | 15:50 |
tosky | at least it's a nice stress test for zuul | 15:50 |
* tosky hides | 15:50 | |
fungi | looks like things started going sideways in bhs1 around 08:30 utc | 15:51 |
mnaser | i see vms spawning again | 15:51 |
clarkb | {"forbidden": {"message": "The number of defined ports: 360 is over the limit: 300", "code": 403}} | 15:51 |
mnaser | so if someone can kick off nodepool | 15:52 |
mnaser | oh | 15:52 |
mnaser | is that in sjc1? | 15:52 |
clarkb | no that is ovh bhs1 sorry | 15:52 |
*** panda is now known as panda|bbl | 15:52 | |
mnaser | after deleting the vms that were stuck | 15:52 |
mnaser | nodepool hasnt issued any creates | 15:52 |
mnaser | but it's working ok now | 15:52 |
fungi | so neutron has likely leaked ports in ovh-bhs1 for some reason | 15:52 |
fungi | i can try to manually delete them | 15:52 |
clarkb | sure neough we have many DOWN ports there | 15:52 |
prometheanfire | fungi: they were waiting on reqs a bit for that (horizon) | 15:53 |
clarkb | fungi: care to only delete the ones that are DOWN | 15:53 |
prometheanfire | they had to cap something :( | 15:53 |
pabelanger | mnaser: clarkb: I can look at nodepool-launcher, 1 sec | 15:53 |
clarkb | I wonder if all the clouds have upgraded to a buggy version of neutron/nova and now we leak ports all over | 15:53 |
clarkb | we should haev a port cleanup thing in nodepool though let me see if we can run that in ovh | 15:53 |
clarkb | or maybe that is only FIPs | 15:54 |
*** kopecmartin|ruck has joined #openstack-infra | 15:55 | |
*** kopecmartin|ruck has left #openstack-infra | 15:55 | |
*** kopecmartin|ruck has joined #openstack-infra | 15:55 | |
fungi | deleting all the ports marked as DOWN now | 15:56 |
*** jamesmcarthur has joined #openstack-infra | 15:56 | |
*** eglute has joined #openstack-infra | 15:56 | |
fungi | starting out with 357 down | 15:56 |
clarkb | fwiw this looks very similar to the problems we have in packethost | 15:57 |
fungi | might be a bigger issue there... my first `openstack port delete ...` in the loop is hanging | 15:57 |
*** jamesmcarthur has quit IRC | 15:57 | |
clarkb | hrm they don't hang in packethost, but they aren't very fast | 15:57 |
pabelanger | 2018-09-26 15:56:36,487 DEBUG nodepool.driver.NodeRequestHandler[nl03-25326-PoolWorker.vexxhost-sjc1-vexxhost-specific]: Declining node request 100-0006323378 because node type(s) [ubuntu-trusty] not available | 15:57 |
*** jamesmcarthur has joined #openstack-infra | 15:57 | |
pabelanger | is sjc1 not running all images? | 15:57 |
fungi | well, i say "hanging" but i don't know that for sure. it's only been ~60 seconds so far | 15:57 |
fungi | without returning' | 15:57 |
clarkb | this might be one of those situations where we need to get neutron and nova and our clouds all talking together to figure out why this is painful all of a sudden | 15:58 |
clarkb | pabelanger: it should be but maybe the config there is buggy? | 15:58 |
fungi | ahh, i guess my deletes aren't hanging, they're just "slow" (for fairly extreme definitions of the word) | 15:59 |
fungi | we're down to 340 now | 15:59 |
clarkb | dpawlik, amorin, studarus, mlavalle get together in a room and fix neutron | 15:59 |
fungi | and the port delete command in osc doesn't provide any output, so i thought it was still on the first in the set | 16:00 |
*** dave-mccowan has joined #openstack-infra | 16:00 | |
clarkb | it looks like something in the background may clean them up in ovh based on usage graphs | 16:01 |
clarkb | we'll spike then go quiet then spike again | 16:01 |
openstackgerrit | Paul Belanger proposed openstack-infra/project-config master: Simplify vexxhost nodepool configuration https://review.openstack.org/605469 | 16:01 |
pabelanger | mnaser: clarkb: fungi: not a fix, but should reduce some copypasta for vexxhost ^ | 16:02 |
pabelanger | 2018-09-26 15:56:30,317 INFO nodepool.driver.NodeRequestHandler[nl03-25326-PoolWorker.vexxhost-sjc1-main]: Not enough quota remaining to satisfy request 200-0006317635 | 16:02 |
pabelanger | seems nodepool thinks it is at quota for sjc1 | 16:03 |
ssbarnea | clarkb: can you help with https://review.openstack.org/#/c/603061/ , already 8 days old and without extra pings I doubt it will get merged. thanks. | 16:03 |
pabelanger | 2018-09-26 15:56:30,317 DEBUG nodepool.driver.NodeRequestHandler[nl03-25326-PoolWorker.vexxhost-sjc1-main]: Current pool quota: {'compute': {'ram': inf, 'instances': 0, 'cores': inf}} | 16:03 |
pabelanger | don't know why that is 0 | 16:03 |
pabelanger | mnaser: the instances you deleted, were they in ERROR state? | 16:04 |
*** jamesmcarthur has quit IRC | 16:04 | |
*** jamesmcarthur has joined #openstack-infra | 16:05 | |
clarkb | pabelanger: it is calculating min(quota, max-servers) - number of servers running in nodepool | 16:07 |
*** sthussey has quit IRC | 16:07 | |
frickler | jroll: seems your request got overlooked, done now | 16:08 |
*** dtantsur is now known as dtantsur|afk | 16:08 | |
pabelanger | clarkb: okay, nodepool thinks there are 46 nodes building | 16:09 |
pabelanger | let me check openstack api | 16:09 |
*** ramishra has quit IRC | 16:09 | |
*** noama has quit IRC | 16:10 | |
pabelanger | clarkb: mnaser: okay, I see movement now. I think we just needed to wait for the launch-timeout to trigger | 16:12 |
clarkb | amorin: if you are around, we've noticed that we appear to leak neutron ports in ovh bhs1 now. Manually deleting them appears to work. Thought you may find this information useful as you continue to operate newer openstack | 16:14 |
clarkb | amorin: let us know if we can provide additional information to help debug or understand the problem | 16:14 |
fungi | amorin: from our graphs, it looks like the leak may have started around 08:30 utc | 16:16 |
*** Emine has quit IRC | 16:16 | |
fungi | down port deletion is slowing considerably... i have a feeling we're continuing to leak new ports as we start to be able to boot new nodes after i delete previous ones | 16:17 |
*** florianf is now known as florianf|afk | 16:17 | |
clarkb | fungi: would not surprise me | 16:17 |
pabelanger | clarkb: mnaser: I see jobs running in sjc1 now | 16:17 |
pabelanger | thanks! | 16:17 |
clarkb | ssbarnea: we don't maintain stackalytics, it is a third party service | 16:18 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: WIP Extract pep8 messages for inline comments https://review.openstack.org/589634 | 16:18 |
clarkb | I do not know why I have approval rights to that repo, but it isn't something we support | 16:18 |
clarkb | fungi: I hate to suggest it but we could add a leaked port cleaner like we do for floating ips to nodepool | 16:19 |
*** tpsilva has joined #openstack-infra | 16:19 | |
clarkb | this specific type of problem seems important enough to want the clouds to address it though | 16:19 |
fungi | yeah, i've hit the inflection point in bhs1 now where the count of down ports is rising faster than i'm deleting them | 16:20 |
*** kopecmartin|ruck is now known as kopecmartin|off | 16:20 | |
pabelanger | openstack.exceptions.SDKException: Error in creating the server: Build of instance 305569df-0b6b-4da8-9e85-a2c2273e34a5 aborted: VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds allowed gigabytes quota. Requested 80G, quota is 5120G and 5120G has been consumed. | 16:20 |
pabelanger | mnaser: think we might have leaked volumes ^ | 16:20 |
*** jpich has quit IRC | 16:22 | |
ssbarnea | clarkb: infra-core is still member of stackalytics-core which makes me believe it is able to perform reviews in absence of the main maintainers, right? | 16:23 |
clarkb | ssbarnea: yes I have +2 on the repo. I have no desire to use it | 16:23 |
clarkb | it isn't a service or a repo we support | 16:23 |
AJaeger | clarkb: infra-core is now in stackalytics group ;( | 16:26 |
clarkb | fungi: rereading ovh bhs1 graphs I don't think the port issue is the only problem? we don't seem to have ever transitioned to in use nodes | 16:27 |
*** che-arne has joined #openstack-infra | 16:27 | |
* clarkb looks at logs again | 16:27 | |
AJaeger | sorry for duplicate - too slow in reading scrollback | 16:27 |
clarkb | fungi: or maybe we never got it below the 300 magic number? | 16:27 |
fungi | possible | 16:28 |
jroll | frickler: thanks! | 16:28 |
ssbarnea | clarkb: sure. i was curious more about the process in general. what happens when a project has only one ore two people in the core team and none of them is available any more? may not be the case here, should CRs be stalling forevever? | 16:28 |
fungi | clarkb: i mean, my delete loop is still going but we're back up above 300 again as of a few minutes ago | 16:28 |
clarkb | Error in creating the server: Exceeded maximum number of retries. Exceeded max scheduling attempts 6 for instance 8f95c8e4-e70a-4744-b87c-ae7c6cdc57cd. Last exception: Maximum number of ports exceeded | 16:29 |
clarkb | seems we transitioned to that after some time | 16:29 |
fungi | ssbarnea: if it's an official openstack project the answer is different than for an unofficial project | 16:29 |
*** Bhujay has quit IRC | 16:29 | |
fungi | for official openstack projects the team in charge of it can be required by the tc to add more reviewers or retire that deliverable | 16:30 |
fungi | as stackalytics is not and never was an official openstack project, the tc has no jurisdiction over it | 16:30 |
ssbarnea | indeed, i was refering more to unofficial/side/infra projects because this the only place where I encounter this dilemma, on official ones is (kinda) easy to find someone. | 16:30 |
clarkb | ssbarnea: in this case there have been multiple discussions of various groups taking it over then no one does | 16:31 |
ssbarnea | haha, indeed, a good description of reality :D | 16:31 |
clarkb | it was a mirantis project and continues to be one as far as I know | 16:32 |
clarkb | if there is renewed interest in taking it over I would reach out to the previous maintainers and see if they will hand over the reigns | 16:32 |
dmsimard | mnaser: looks like we're running some nodes on sjc1 again, thanks | 16:32 |
ssbarnea | i will send another email to the two commiters, at some point they will find my email, hopefully. | 16:33 |
fungi | we've had contributors in the past interested in (and even getting pretty far on) making various improvements to stackalytics like fixing the persistent analysis store so it doesn't go offline for hours when you need to restart it or ripping out the affiliation static config in favor of querying the foundation profile api | 16:33 |
fungi | but the team in charge of the project and running the server for it seem to lack the bandwidth for such contribution (or even to dicusss it) | 16:34 |
*** panda|bbl is now known as panda | 16:35 | |
clarkb | fungi: I added nl04 to the emergency file | 16:35 |
corvus | a minor correction: *infra* projects *are* official openstack projects :) | 16:35 |
clarkb | puppet just ran on it ~4 minutes ago so I think I am safe to go ahead and set max-servers to zero in bhs1 | 16:35 |
clarkb | then we can rerun the port cleanup. Set max-servers to ~5 and see if it ends up in a happier place or not | 16:36 |
ssbarnea | corvus: with the sideone that not everything under infra/ in gerrit is a infra project :D | 16:36 |
fungi | well, not everything under openstack/ in gerrit is an official openstack project either | 16:36 |
clarkb | fungi: ok max-servers is set to 0 | 16:36 |
ssbarnea | now, I want to create a new infra project, probably named openstack-helpers that would be used to host multipe greasemonkey scripts that are useful for openstack devs. i did read the (very) long docs page but is not clear to me who creates the repository in the first place. | 16:36 |
fungi | ssbarnea: a script creates the repository when it sees a new entry appear in the gerrit/projects.yaml file in openstack-infra/project-config | 16:37 |
fungi | automation | 16:37 |
clarkb | I'm going to grab breakfast while waiting on that port cleanup to happen | 16:37 |
clarkb | fungi: ^ assuming it is still running? | 16:38 |
fungi | it just finished. i can start another pass | 16:38 |
clarkb | please do | 16:38 |
openstackgerrit | James E. Blair proposed openstack-infra/project-config master: Add zone-opendev.org project https://review.openstack.org/605095 | 16:38 |
fungi | we're currently at 295 leaked, so it may be cleaning itself up? | 16:38 |
*** jamesmcarthur has quit IRC | 16:38 | |
clarkb | fungi: do we want to wait ~10 minutes and see if that number moves? | 16:39 |
fungi | i'll give it a few, yes | 16:39 |
clarkb | max-servers is set to 0 so any servers are using should clean up (and their ports too I hope) | 16:39 |
ssbarnea | fungi: thanks for this magic hint! i made a note. Before making the CR, does anyone have something against using "openstack-helpers" name? I could go for "monkeys" if you don't like it, | 16:39 |
fungi | and then start deleting if it doesn't seem to be doing the cleanup on its own | 16:39 |
fungi | openstack-monkeys seems mildly offensive | 16:40 |
fungi | (or could be taken that way) | 16:40 |
*** trown is now known as trown|lunch | 16:40 | |
AJaeger | ssbarnea: why not remove openstack from the name? Or use "os"? | 16:40 |
mordred | os-greasemonkey would be descriptive | 16:41 |
ssbarnea | i do not want to use the full "grease*" name because refers to a specific extension and there are multiple ones tampermonkey, greasemonkey, and even others with different names. | 16:42 |
mordred | I did not know that - nod | 16:42 |
ssbarnea | thus is why i was considering "helpers" as a more multi-purpose name to use that is not directly linked to a browser extension. | 16:42 |
fungi | browser-helpers? | 16:43 |
fungi | or are they useful outside a web browser context? | 16:43 |
*** e0ne has quit IRC | 16:43 | |
*** jamesmcarthur has joined #openstack-infra | 16:44 | |
ssbarnea | fungi: yep, for the moment web-helpers would ok but i have some regex patterns inside that are used to highlight console logs. Still, in the future I want to also make a CLI tool to parse logs, one that uses the same patterns as the web extension, .... this would make the "web" part confusing in the repo name. | 16:45 |
ssbarnea | i find the inclusing of web in the repo name as,.... limiting. | 16:45 |
*** dave-mccowan has quit IRC | 16:49 | |
*** rkukura has joined #openstack-infra | 16:50 | |
corvus | you could call it "ssbarnea's scripts, browser, and regex network emporium area". maybe abbreviate that somehow. | 16:50 |
fungi | yeah, useful names for catch-all repos are hard to come up with | 16:50 |
*** evrardjp has quit IRC | 16:50 | |
*** ginopc has quit IRC | 16:51 | |
openstackgerrit | Monty Taylor proposed openstack-infra/system-config master: Only replicate openstack* to github https://review.openstack.org/605486 | 16:53 |
*** olivierb has quit IRC | 16:54 | |
*** gfidente has quit IRC | 16:55 | |
mordred | corvus, fungi: I'm pretty sure that's all we need to let the zone-opendev.org be in opendev/ instead of openstack-infra, should we choose to do such a thing | 16:56 |
corvus | mordred: sounds reasonable; clarkb, mordred, fungi: should i rework that back to opendev/ ? or leave it to be (presumably) moved with the rest? | 16:57 |
fungi | reviewing | 16:57 |
mordred | https://gerrit.googlesource.com/plugins/replication/+doc/master/src/main/resources/Documentation/config.md <-- is what I was looking at - search for "remote.NAME.projects" | 16:58 |
mordred | we could alternately do [ | 16:58 |
mordred | we could alternately do ['openstack/*', 'openstack-dev/*', 'openstack-infra/*'] to be more clearer | 16:59 |
clarkb | should double check against our gerrit docs | 16:59 |
*** rkukura has quit IRC | 16:59 | |
mordred | clarkb: I tried looking at the docs for our gerrit - but with them having been moved to plugins ... | 16:59 |
pabelanger | clarkb: I've deleted the leaked volumes in sjc1, but think mnaser will need to debug openstack side, some are stuck deleting. Think I'll work on nodepool at ansiblefest to also try and clean up leaked volumes, I can see there is meta data in the volumes for nodepool_build_id | 16:59 |
mordred | clarkb: I got to that doc from https://review.openstack.org/Documentation/config-plugins.html#replication | 16:59 |
clarkb | ah | 16:59 |
mordred | clarkb: https://gerrit.googlesource.com/plugins/replication/+/stable-2.13/src/main/resources/Documentation/config.md | 17:01 |
mordred | there we go - there's the 2.13 docs - and it saysthe same thing | 17:01 |
*** derekh has quit IRC | 17:01 | |
fungi | so the leaked ports in bhs1 are still dropping, but slower than when i was deleting them i think. i'll go ahead and augment it with a loop of explicit deletes to speed things up further there | 17:01 |
fungi | we're down to 288 leaked so far | 17:01 |
mordred | fungi: glorious | 17:02 |
*** fuentess has joined #openstack-infra | 17:03 | |
mordred | clarkb: so I think if you've ok with that- it's got 2x+2 | 17:06 |
clarkb | do we want to test it on review-dev first? will require a gerrit restart too iirc | 17:07 |
mordred | clarkb: good points both of those | 17:07 |
mordred | clarkb: lemme make a review-dev patch | 17:07 |
*** jpena is now known as jpena|off | 17:09 | |
openstackgerrit | Monty Taylor proposed openstack-infra/system-config master: Only replicate openstack namespaces to github https://review.openstack.org/605486 | 17:11 |
openstackgerrit | Monty Taylor proposed openstack-infra/system-config master: Only replicate gtest-org and kdc https://review.openstack.org/605490 | 17:11 |
*** slaweq has joined #openstack-infra | 17:11 | |
mordred | clarkb, corvus, fungi: ^^ there ya go | 17:11 |
*** slaweq has quit IRC | 17:15 | |
*** dpawlik has joined #openstack-infra | 17:16 | |
*** dpawlik has quit IRC | 17:16 | |
*** jamesmcarthur has quit IRC | 17:16 | |
*** dpawlik has joined #openstack-infra | 17:17 | |
clarkb | fungi: only down to 212 ports so far | 17:18 |
fungi | yeah, but also hitting some like | 17:19 |
fungi | Failed to delete port with name or ID '3c864749-1664-4af9-8aab-d6dacaba24a4': HttpException: 504: Server Error for url: https://network.compute.bhs1.cloud.ovh.net/v2.0/ports/3c864749-1664-4af9-8aab-d6dacaba24a4, <html><body><h1>504 Gateway Time-out</h1>The server didn't respond in time.</body></html> 1 of 1 ports failed to delete. | 17:19 |
*** shardy has quit IRC | 17:20 | |
clarkb | I would not be surprised if that is part of why we are leaking them in the first place | 17:20 |
clarkb | nova asks neutron to delete them, neutron 504s, and server is deleted now we have a leaked port | 17:20 |
fungi | sounds remarkably familiar | 17:20 |
* fungi has an overwhelming sense of deja vu | 17:21 | |
*** jamesmcarthur has joined #openstack-infra | 17:21 | |
*** harlowja has joined #openstack-infra | 17:23 | |
*** rkukura has joined #openstack-infra | 17:26 | |
clarkb | ya I want to say we saw this issue with the gate and tempest | 17:28 |
clarkb | and initially a lot of the blame was pointed at the apache proxy that was terminated tls | 17:28 |
clarkb | I don't know if it was ever fixed though | 17:28 |
clarkb | oh actually it was a client thing with holding connections open | 17:28 |
fungi | down to 205 leaked ports in ovh-bhs1 now but it's been stuck there for a few minutes | 17:28 |
clarkb | apache by default allows for connections to be reused | 17:29 |
clarkb | python requests is buggy in the situation where it closes a connection but races trying to use it for a new request | 17:29 |
clarkb | we fixed it by telling requests to use a new connection each request iirc | 17:29 |
clarkb | fungi: now 204 | 17:30 |
fungi | indeed | 17:30 |
clarkb | and now 203. Not very quick. That could also explain the leaks (not being able to delete fast enough) | 17:33 |
fungi | yeah, i'm not seeing too many timeouts | 17:34 |
mordred | clarkb: amusingly enough I recently set session.keep_alive = False in the openstacksdk functional test suite because of tons of log spam due to "dropped connection, retrying" | 17:34 |
fungi | i have a feeling it's more that we're recycling instance quota faster than the neutron backend can clean up | 17:34 |
clarkb | mordred: ya python requests isn't great around when those keep alived connections are killed due to timeouts | 17:35 |
clarkb | fungi: could be | 17:36 |
*** annp has quit IRC | 17:39 | |
fungi | getting soooo slooooow | 17:41 |
fungi | 200 now | 17:41 |
fungi | only hit 5 timeouts so far | 17:42 |
pabelanger | it is a load issue on nodes? | 17:42 |
AJaeger | config-core, two quick cleanup reviews, please: https://review.openstack.org/605076 https://review.openstack.org/605344 | 17:42 |
fungi | pabelanger: not sure what you're asking about | 17:42 |
pabelanger | we had neutron issue in infra-cloud when CPU was pinned converting qcow2 images to raw | 17:42 |
pabelanger | fungi: sorry, just jumping in, was asking if the requests were slow due to remove node not responsing fast enough | 17:43 |
fungi | and that caused port deletion to be slow? | 17:43 |
pabelanger | fungi: creation | 17:43 |
fungi | this is just leaked ports. trying to remove them | 17:43 |
pabelanger | VMs would timeout on network getting created, and fail to boot because of it | 17:43 |
*** trown|lunch is now known as trown | 17:44 | |
clarkb | hand waving guessing: similar to how we have to force dhcp in OVH because the neutron config isn't actually to be used, when cleaning up network related resources neutron is talking to something external to do the cleanups and this is slow | 17:46 |
fungi | yeah, probably | 17:47 |
*** dpawlik has quit IRC | 17:47 | |
clarkb | nova/neutron may actually have all of the port deletes queued up they just don't happen very fast | 17:49 |
clarkb | then on top of that somehow we ended up above the quota limit of 300 so when things caught up a little we still weren't able to boot new instances | 17:50 |
clarkb | if we can get it to zero we can bump max-servers to say 5 and see if we leak again | 17:51 |
fungi | yeah, it's just not happening any time soon | 17:53 |
fungi | 196 ports left to delete | 17:53 |
Shrews | that's, like, painfully slow | 17:54 |
fungi | we're up to 9 deletion timeouts now | 17:56 |
*** electrofelix has quit IRC | 17:56 | |
clarkb | out of ~100 ? | 17:56 |
openstackgerrit | Merged openstack-infra/project-config master: Move glare legacy jobs in-repo https://review.openstack.org/605076 | 17:57 |
fungi | yeah, something like that | 17:57 |
fungi | so maybe 10% | 17:57 |
mnaser | http://grafana.openstack.org/d/nuvIH5Imk/nodepool-vexxhost?orgId=1&from=now-3h&to=now fyi i notice only 15 nodes used in sjc1? do we know why? | 18:03 |
*** jcoufal has quit IRC | 18:03 | |
clarkb | mnaser: possibly the volume leaks pabelanger was talking about? | 18:04 |
openstackgerrit | sebastian marcet proposed openstack-infra/openstackid-resources master: Fixed bugs on Submit presentation flow https://review.openstack.org/605497 | 18:04 |
clarkb | mnaser: apparently some of them are stuck in deleting | 18:04 |
mnaser | oh | 18:04 |
mnaser | let me see | 18:04 |
mnaser | deleting seems to fluctuate up and down | 18:04 |
mnaser | "Requested 80G, quota is 5120G and 5120G has been consumed." ok cool let me investigate | 18:04 |
mnaser | ok, let me see | 18:05 |
*** graphene has quit IRC | 18:07 | |
openstackgerrit | Merged openstack-infra/openstackid-resources master: Fixed bugs on Submit presentation flow https://review.openstack.org/605497 | 18:10 |
* mnaser is in a call, will review shortly | 18:10 | |
*** slaweq has joined #openstack-infra | 18:12 | |
clarkb | fungi: now down toe 76 (seems like it sped up) | 18:13 |
clarkb | 68 | 18:13 |
fungi | weird | 18:16 |
fungi | i wonder if they're working on it | 18:16 |
fungi | 33 now | 18:16 |
*** diablo_rojo has joined #openstack-infra | 18:18 | |
fungi | and now 0 | 18:19 |
mordred | yay! | 18:19 |
fungi | i guess let's try to start ramping it back up again, but i have a feeling the deletion speedup coincided with them fixing some problem in their backend | 18:19 |
clarkb | ok I'm going to set max-servers to 5 | 18:20 |
prometheanfire | lol | 18:20 |
clarkb | and we can watch if we leak from there | 18:20 |
*** dpawlik has joined #openstack-infra | 18:21 | |
fungi | wfm | 18:21 |
*** slaweq has quit IRC | 18:22 | |
clarkb | Shrews: following up on zuul potentially leaking nodes in nodepool (locked for ~14 days) any idea on how to clear those out? | 18:22 |
clarkb | Shrews: 0001950609 0001975058 0001975067 are the three nodes if you want to take a look | 18:22 |
Shrews | clarkb: would require a zuul restart to delete the locks | 18:23 |
Shrews | clarkb: but let me poke around zk a bit | 18:23 |
clarkb | Shrews: could we manually delete the lock? | 18:24 |
*** pcaruana has quit IRC | 18:24 | |
Shrews | clarkb: we'd have to manually delete zk nodes. i'm hesitant to do that | 18:24 |
*** lbragstad has quit IRC | 18:24 | |
clarkb | Shrews: ok | 18:25 |
Shrews | but i guess we could. they'd be seen as leaked nodes to nodepool. not sure what the zuul impact would be | 18:25 |
clarkb | fungi: due to the 3 ready nodes that zuul has locked (leaked above) max-servers 5 really means only 2 new instances. I am going to bump to 8 to get 5 | 18:25 |
*** dpawlik has quit IRC | 18:25 | |
*** lbragstad has joined #openstack-infra | 18:25 | |
fungi | k | 18:26 |
*** yamamoto has quit IRC | 18:26 | |
*** yamamoto has joined #openstack-infra | 18:26 | |
Shrews | clarkb: confirmed zuul is holding the locks :( | 18:31 |
clarkb | Shrews: zuul should handle if the znode goes away though right? | 18:35 |
clarkb | | fault | {'message': 'Build of instance 960aff55-3795-43cb-ad73-e58816444355 aborted: Failed to allocate the network(s), not rescheduling.', 'code': 500, 'created': '2018-09-26T18:36:13Z'} one of the nodes failed to build | 18:37 |
clarkb | seems potentially related to our inability to clean up ports | 18:37 |
Shrews | clarkb: i'm not sure how zuul would handle that | 18:41 |
Shrews | we expect some zk nodes to disappear, but a Node isn't one of them | 18:41 |
Shrews | good question for corvus | 18:42 |
mordred | corvus knows everything | 18:42 |
*** jamesmcarthur has quit IRC | 18:42 | |
Shrews | we need a zuul restart at some point anyway for some fixes mentioned earlier (yesterday?). perhaps we should just schedule a time for that | 18:44 |
Shrews | that sql optimization at least | 18:45 |
clarkb | that probably depends on whether or not we will declare bankruptcy on the backlog or not | 18:45 |
clarkb | fungi: fwiw it seems that some nodes error as above and others are just really slow to boot. None have successfully booted yet | 18:46 |
corvus | with those node ids, we can trace them in the zuul logs and figure out why they leaked | 18:46 |
corvus | that should not preclude us restarting either nodepool or zuul whenever we wish | 18:47 |
*** jamesmcarthur has joined #openstack-infra | 18:47 | |
corvus | however, i'm in need of a sandwich so am not going to trace them now :) | 18:47 |
clarkb | fungi: I'll let it go a little longer but the oldest node i was watching was deleted by nodepool. Doesn't appear to have errored just taken too long | 18:50 |
fungi | okay, so may be that bhs1 is still just plain unusable at the moment | 18:50 |
clarkb | http://logs.openstack.org/25/604925/2/check/system-config-run-base/3a474c9/job-output.txt.gz#_2018-09-25_01_31_28_665890 is a fun ci bug. I think that happens beacuse I am trying to change the uid of the zuul user on the test nodes | 18:54 |
clarkb | mordred: corvus ^ thoughts on creating a zuulcd user instead? | 18:54 |
mordred | clarkb: oh. HAH | 18:54 |
clarkb | then zuul the test node user can coexist with zuulcd the cd user | 18:54 |
mordred | clarkb: or else - maybe make the creation use the same uid for nodepool and prod and make the creation idempotent? | 18:54 |
mordred | I haven't thought long enough to ave thoughts of whether that'sa bad idea or not | 18:55 |
clarkb | mordred: ya I think we could set the uid to 1000? I'm nto sure how difficult it will be to keep that in sync over time | 18:55 |
*** jcoufal has joined #openstack-infra | 18:55 | |
clarkb | (I did the last uid + 1 process for normal users which resulted in 203X) | 18:56 |
clarkb | fungi: ya I'm going to set max-servers back to 0, this isn't making any progress | 18:57 |
*** smarcet has quit IRC | 18:58 | |
clarkb | the gate just reset again | 18:58 |
clarkb | I think I need to step away from the computer for a bit | 18:58 |
* fungi is struggling to whittle down a forum session proposal to fit in the requisite 1k character matchbox | 18:59 | |
clarkb | fungi: you get three of those boxes :) | 19:00 |
fungi | yeah, but only one goes on the schedule | 19:01 |
fungi | (i think?) | 19:01 |
mordred | fungi: "gonna talk about stuff" | 19:02 |
mordred | fungi: who needs more words than that? | 19:02 |
fungi | "i will wear a funny shirt and make people talk to each other" | 19:03 |
fungi | perfect! | 19:03 |
clarkb | "if this session gets enough upvotes I will wear the orange cantina shirt" | 19:04 |
*** rtjure has quit IRC | 19:05 | |
*** dave-mccowan has joined #openstack-infra | 19:05 | |
AJaeger | "And the green one if not" - give us a choice to vote for one ;) | 19:05 |
fungi | grr... i'm still 95 characters over | 19:06 |
AJaeger | Remove all punctations ;) | 19:07 |
clarkb | the down ports list in ovh bhs1 fell to 1 after setting max servers to 0 | 19:07 |
*** rtjure has joined #openstack-infra | 19:08 | |
clarkb | jaosorior: https://review.openstack.org/#/c/603419/ fyi that didn't pass (it also failed the rdo third party check) | 19:12 |
*** e0ne has joined #openstack-infra | 19:15 | |
*** jcoufal has quit IRC | 19:21 | |
*** Tim_ok has quit IRC | 19:24 | |
*** slaweq has joined #openstack-infra | 19:24 | |
*** Emine has joined #openstack-infra | 19:26 | |
fuentess | clarkb: hi, quick question? what is the best way to know if I'm running under a Zuul slave? is there any environment variable that I can check? | 19:27 |
*** e0ne has quit IRC | 19:30 | |
fungi | fuentess: jobs can set any environment variables they like when executing a shell or command process... can you be more specific about what you're trying to do and where? | 19:31 |
mordred | yah - there is a zuul ansible host_var your playbooks can do things with | 19:32 |
*** jamesmcarthur has quit IRC | 19:32 | |
mordred | but once you're out of ansible and into some shell that the ansible has spawned, that's all job specific | 19:32 |
fuentess | fungi, mordred: I will add a change in one of our scripts to run the cri-o tests using overlay (instead of devicemapper) when in a Zuul slave, so I want to check using something like: if [ "$ZUUL_CI" == true ] | 19:35 |
mordred | yeah - in order for that to work, we'd need to update the job that calls your script to set a variable like ZUUL_CI | 19:35 |
mordred | alternately, maybe we should justhae the zuul job pass something like --overlay to the script? | 19:36 |
fuentess | adding it here, right? https://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/roles/kata-setup/tasks/main.yaml#n37 | 19:36 |
Shrews | corvus: attempted to trace node 0001975067 through zuul logs (debug.log.14.gz, btw). For that request (200-0006054530), I see that it gets completed, but the nodes are never set in-use (i.e., no "Setting nodeset ... in use" log entry) | 19:37 |
slaweq | clarkb: mordred: with big help from both of You I finally managed to do patch to migrate dvr multinode job in neutron to zuulv3 syntax: https://review.openstack.org/#/c/578796/ - thx a lot guys :) | 19:37 |
Shrews | corvus: i'm a bit stumped as to why | 19:37 |
mordred | fuentess: yes- and it looks like CI=true is already being set there | 19:37 |
mordred | fuentess: so you could just add another line just like that | 19:38 |
mordred | slaweq: \o/ woohoo! | 19:38 |
fuentess | mordred: cool, I am not sure if I can submit changes there... or how can I do it, any guidance? | 19:39 |
mordred | fuentess: you totally can - have you submitted patches to the openstack gerrit before, or will this be your first one? | 19:40 |
fuentess | mordred: this will be my first one | 19:40 |
*** jamesmcarthur has joined #openstack-infra | 19:41 | |
mordred | fuentess: excellent. well - we have a doc here: https://docs.openstack.org/infra/manual/developers.html#accout-setup ... I don't think we require signing the CLA for infra projects (do we clarkb fungi?) so you can probably skip that part | 19:42 |
fungi | checking on ozj there | 19:42 |
mordred | fuentess: tl;dr is "make sure you have a launchpad/ubuntu sso account", "log in to review.openstack.org", "add your ssh key to your profile in gerrit", "pip install git-review" ... then in a git clone of openstack-zuul-jobs, once you've made your commit, run "git review" | 19:43 |
fuentess | mordred: great, thanks, I'll follow the doc | 19:43 |
mordred | fuentess: but the doc is more complete than messages from me in IRC :) | 19:43 |
fungi | https://review.openstack.org/#/admin/projects/openstack-infra/openstack-zuul-jobs says "Require a valid contributor agreement to upload: INHERIT (false)" so no cla required | 19:44 |
mordred | yay! | 19:44 |
fuentess | great, thanks | 19:44 |
mordred | let us know if you have any issues - the initial account setup is more onerous than we'd like, but such is the world we live in | 19:44 |
fungi | things we're (painfully slowly it seems like) changing for the better over time | 19:45 |
*** jamesmcarthur has quit IRC | 19:46 | |
*** jamesmcarthur has joined #openstack-infra | 19:48 | |
openstackgerrit | Chandan Kumar proposed openstack-infra/project-config master: Added twine check functionality to python-tarball playbook https://review.openstack.org/605096 | 19:49 |
corvus | Shrews: thanks, i'll start there and see if i can trace further | 19:50 |
openstackgerrit | Matthieu Huin proposed openstack-infra/zuul master: web: add tenant and project scoped, JWT-protected actions https://review.openstack.org/576907 | 19:51 |
*** jamesmcarthur has quit IRC | 19:52 | |
mnaser | infra-root: http://grafana.openstack.org/d/nuvIH5Imk/nodepool-vexxhost?orgId=1&from=now-1h&to=now&var-region=All is this starting to look..healthy? | 19:52 |
mnaser | it looks like for some reason a lot of instances accumulate as ready and then all get used up at once | 19:52 |
*** ijw has joined #openstack-infra | 19:53 | |
*** jamesmcarthur has joined #openstack-infra | 19:55 | |
openstackgerrit | Chandan Kumar proposed openstack-infra/project-config master: Added twine check functionality to python-tarball playbook https://review.openstack.org/605096 | 19:57 |
corvus | clarkb, Shrews: the job that node request was for (openstack-ansible-ironic-ssl-nv) was removed between the time the node request was issued and fulfilled: http://git.openstack.org/cgit/openstack/openstack-ansible-os_ironic/commit/?id=9a2f843dc15750cddba10db73afe381ee3785250 | 20:01 |
corvus | i think that's our smoking gun :) | 20:01 |
Shrews | corvus: w00t | 20:01 |
clarkb | mnaser: I think that may be lag induced by the zuul executors throttling themselves. You can compare with the active executor count on the zuul status page | 20:02 |
* mnaser clarkb: so it looks like we're at 90ish in-use which seems to tell me things are healthy | 20:02 | |
mnaser | that shouldn't have been a /me but okay | 20:03 |
mnaser | sorry for the noise/annoyance/problems/etc | 20:03 |
corvus | that behavior could also be zuul doing a bunch of reconfigs, or a gate reset | 20:03 |
clarkb | mnaser: was that a one off thing? eg we don't need nodepool to manage nodes/volumes better? | 20:04 |
mnaser | clarkb: yes, that was a one-off at our side thing, but i noticed that when i deleted the vms from under nodepool it was a bit confused | 20:04 |
mnaser | not sure what was done there | 20:04 |
*** agopi has quit IRC | 20:06 | |
clarkb | mnaser: I think pabelanger mentioned that it ended up waiting for the api requset to timeout or similar | 20:07 |
*** bnemec has quit IRC | 20:10 | |
*** evrardjp has joined #openstack-infra | 20:11 | |
clarkb | looks like inap might also be unhappy (though less unhappy that some of the other clouds) | 20:14 |
clarkb | (a lot of deleting nodes there and higher error rate in recent hours) | 20:15 |
*** bnemec has joined #openstack-infra | 20:15 | |
mgagne | clarkb: how can I make it happy? | 20:15 |
clarkb | mgagne: reading the nodepool logs the errors appaer to be "timeout waiting for node to delete" | 20:17 |
mgagne | clarkb: could it be that same issue we had a couple of months ago? | 20:17 |
clarkb | mgagne: I think nodepool is not booting new nodes until it frees up capacity (so slow deletes are preventing new boots) | 20:17 |
clarkb | mgagne: you'll have to remind me what that was (sorry) | 20:18 |
mgagne | maybe I should have documented that SQL query somewhere.... | 20:18 |
clarkb | nodepool.exceptions.ServerDeleteException: Timeout waiting for server 3cc96068-e398-4c2c-a908-ebea018af044 deletion | 20:18 |
clarkb | is the full message and includes an instance uuid | 20:18 |
clarkb | if that helps | 20:18 |
mgagne | clarkb: I think delete task gets killed by restarting conductor or something. So you can't delete it again because database says it's already happening somewhere. | 20:19 |
mgagne | but it's stuck in BUILD too so might be something related to build time being higher than expected. | 20:20 |
mgagne | I'll see that I can do to unstuck that mess ;) | 20:21 |
mgagne | I see that some instances are successfully deleted without me taking any actions. | 20:22 |
clarkb | mgagne: so maybe it is just slowness? | 20:23 |
mgagne | I haven't figured out what's going out. I know that some stuck instances are getting deleted. | 20:24 |
pabelanger | clarkb: mnaser: I don't think it is zuul-executors, our executor queue looks healthy: http://grafana.openstack.org/dashboard/db/zuul-status | 20:24 |
mgagne | what I would *really* love is a way to know what tasks are running on conductor or compute =) | 20:25 |
clarkb | mnaser: pabelanger http://paste.openstack.org/show/730964/ is a list of volume ids that claim to be attached to those server ids. At last check none of those server ids actually exist | 20:27 |
clarkb | mnaser: pabelanger I think that may be part of our leaked volume story | 20:27 |
clarkb | the oldest volume is from about two weeks ago and the newest from an hour ago | 20:27 |
*** smarcet has joined #openstack-infra | 20:27 | |
clarkb | mnaser: pabelanger I haven't tried to delete any of them yet but I figure that is my next step if mnaser doesn't say otehrwise | 20:27 |
pabelanger | clarkb: Hmm, it is possible I may have deleted those volumes. I cleaned up some an hour so ago | 20:28 |
pabelanger | let me check history | 20:28 |
clarkb | pabelanger: well they are still there if you tried to delete them :) | 20:28 |
pabelanger | clarkb: the ones I deleted were available, but unattached | 20:28 |
clarkb | these all claim to be attached to those servers but those servers do not exist | 20:29 |
*** smarcet has quit IRC | 20:29 | |
*** smarcet has joined #openstack-infra | 20:30 | |
*** smarcet has quit IRC | 20:30 | |
mgagne | so I restarted a nova-compute and some instances are getting deleted on that node. | 20:30 |
pabelanger | clarkb: http://paste.openstack.org/show/730965/ | 20:30 |
pabelanger | that was a few hours ago when I started clean up | 20:30 |
pabelanger | 22f56d13-c67c-4aac-a6aa-cf58fe57b177 | 20:31 |
pabelanger | is in both pastebins | 20:31 |
clarkb | ya all of mine are those that look like attached to $uuid | 20:31 |
pabelanger | yah | 20:31 |
clarkb | there are a couple extras in mine too | 20:31 |
pabelanger | I'd expect nodepool to use hostname | 20:32 |
clarkb | pabelanger: it does see the centos ones in your example | 20:32 |
clarkb | I think it shows uuid there beacuse the instances don't exist (so it cannot lookup the name) | 20:32 |
pabelanger | right | 20:32 |
pabelanger | think so too | 20:32 |
*** priteau has quit IRC | 20:33 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Fix node leak on job removal https://review.openstack.org/605527 | 20:34 |
clarkb | mgagne: looks like in the last 10 minutes or so things may be improving (possibly related to that restart?) | 20:34 |
corvus | clarkb, Shrews: ^ fix for that leak | 20:34 |
fungi | oh, that's a fun bug | 20:34 |
mgagne | I restarted 1 stuck nova-compute service. Could it be it consumed all rpc workers on nova-conductor? ¯\_(ツ)_/¯ | 20:36 |
mgagne | I don't have much time in front of me. lets hope this improves the situation. | 20:36 |
*** hasharAway has quit IRC | 20:37 | |
clarkb | mgagne: thank you for looking | 20:38 |
mgagne | clarkb: will nodepool attempt to retry deletion if instance is now in error state? | 20:39 |
clarkb | mgagne: yes it should retry every 5 (or is it 15 minutes) | 20:39 |
mgagne | cool so restarting nova-compute should put the instance in error and nodepool can continue its cleanup from there. | 20:40 |
Shrews | corvus: sweet | 20:40 |
corvus | clarkb: there's definitely a 5 in there. but i think it's every 5 *seconds* now :) | 20:43 |
*** rkukura has quit IRC | 20:43 | |
clarkb | persistent | 20:43 |
fungi | somethingsomething5something | 20:43 |
clarkb | if anyone is wondering the tempest-slow job is really slow | 20:44 |
clarkb | but we might actually merge some changes shortly | 20:44 |
clarkb | corvus: http://grafana.openstack.org/d/ykvSNcImk/nodepool-inap?orgId=1&from=now-3h&to=now is it normal for zuul to not assign those ready nodes for this significant amount of time? | 20:45 |
clarkb | the zuul queue lengths are short | 20:46 |
clarkb | all 11 executors are online and accepting | 20:46 |
clarkb | this must be slowness in the scheduler? | 20:46 |
corvus | clarkb: yeah, if there's no executor queue, then it's the scheduler not getting around to dispatching the jobs | 20:46 |
corvus | the scheduler is not currently behind on work | 20:47 |
clarkb | oh wow refresh the graph | 20:48 |
clarkb | in the last minute or two almost all of those nodes went to in use | 20:48 |
corvus | i guess we looked just a bit too late | 20:48 |
pabelanger | could be zookeeper slowing down? | 20:48 |
pabelanger | (just a guess) | 20:48 |
Shrews | nodepool.o.o seems very idle | 20:49 |
corvus | pabelanger: based on what? | 20:49 |
Shrews | oh, wait. | 20:49 |
Shrews | java 170% cpu | 20:50 |
Shrews | neat | 20:50 |
Shrews | 320% | 20:50 |
pabelanger | corvus: there have been issues in base with nodepool not allocating nodes if zookeeper was laggy, based on comments SpamapS has made in past | 20:50 |
corvus | pabelanger: issues in base? | 20:50 |
pabelanger | sorry, past* | 20:51 |
corvus | pabelanger: the current issue is that nodepool allocated nodes and zuul did not immediately use them | 20:51 |
clarkb | spamaps problems were related to disk io I think. SpamapS and tobiash both run on top of tmpfs now | 20:51 |
pabelanger | corvus: right, that is what SpamapS said in the past | 20:51 |
corvus | Shrews: java, unlike python, is really good at using multiple processor | 20:51 |
corvus | s | 20:51 |
Shrews | iostat shows mostly idle disk | 20:51 |
clarkb | I'm booting a clarkb-test instance in bhs1 using the current xenial image. I want to see if it will ever start up without nodepool timing out on it | 20:54 |
ianw | clarkb: catching up ... lmn if i can help | 20:55 |
clarkb | ianw: mostly at a loss to why bhs1 stopped working, but its basically unuseable for us now. Instances don't seem to boot at all and we leaked a bunch of ports prior to that | 20:55 |
clarkb | ianw: we think we have the ports cleaned up but now trying to see if anything will boot | 20:55 |
clarkb | ianw: likely our next step is to followup with amorin during european tomorrow | 20:56 |
*** agopi has joined #openstack-infra | 20:56 | |
clarkb | other than that I think vexxhost is largely working again (though there are some weird volumes there that would be good to have mnaser glance at before we delete them) | 20:56 |
clarkb | http://paste.openstack.org/show/730964/ is those details | 20:56 |
clarkb | and mgagne just did a thing to make inap happier | 20:57 |
openstackgerrit | Matthieu Huin proposed openstack-infra/zuul master: CLI: add create-web-token command https://review.openstack.org/605386 | 20:57 |
clarkb | ianw: other than general cloud capacity items like ^ I think the continuing issues are largely related to gate flakynesss and cost of resets | 20:58 |
clarkb | tripleo has a very deep gate queue and resets are common. Openstack integrated gate isn't quite so large but has also shown signs of flakyness | 20:58 |
*** trown is now known as trown|outtypewww | 20:59 | |
*** bobh has quit IRC | 20:59 | |
ianw | ok, thanks for the update :) | 20:59 |
*** colonwq has joined #openstack-infra | 20:59 | |
ianw | quick question; i started looking at graphite.o.o upgrade ... the puppet is very procedural (install stuff, template, start service). i've started looking at making it all ansible running on bionic ... see any particular issues with this? | 21:00 |
*** rkukura has joined #openstack-infra | 21:00 | |
corvus | ianw: maybe we can use containers? | 21:00 |
*** fuentess has quit IRC | 21:04 | |
ianw | corvus: yeah, i started looking at that; i don't know if i'm sold really ... say i make a job to make a statsd container, a carbon container, etc, then a playbook to plug it all together. what advantage do we have over just having a playbook putting this stuff on a host? | 21:04 |
ianw | and instead of unattended-upgrades, we now have to manage the dependencies in all those sub-containers | 21:04 |
corvus | ianw: it's a good question, but i think the spec addresses it. in short, os-independence. | 21:05 |
*** slaweq has quit IRC | 21:06 | |
corvus | Shrews, clarkb: i looked at node 0002342552 assigned to request 200-0006321477 which has been ready for 7 minutes in inap. the request is still 'pending'. so nodepool-launcher hasn't marked it fulfilled yet, even though the only node in the request is ready. | 21:07 |
corvus | that's on nl03 | 21:07 |
*** smarcet has joined #openstack-infra | 21:11 | |
fungi | is it really os-independent given that there is an os inside every container (albeit a hopefully minimal one)? | 21:11 |
corvus | Shrews, clarkb: nodepool is very busy there; perhaps this is thread starvation | 21:11 |
pabelanger | corvus: oh, interesting, we have 5 active providers there. Maybe getting close to spliting that out to another launcher? | 21:12 |
corvus | fungi: i should have said no more than "see the spec" | 21:12 |
clarkb | we could rebalance the providers. I think nl02 is particularly quiet recently | 21:12 |
fungi | heh | 21:12 |
mgagne | I really have to go. But to summarize, I restarted nova-compute on most of the compute nodes. (not all) It put "BUILD/deleting" instances in ERROR state so Nodepool could clean them up. Hopefully new deletions won't get stuck anymore. | 21:12 |
clarkb | mgagne: thanks, I think it did get things moving | 21:12 |
mgagne | +1 | 21:12 |
*** panda is now known as panda|off | 21:13 | |
fungi | cpu on nl01 looks pretty much pegged (much of that claimed for system) | 21:15 |
corvus | in the nicest way possible i'd like to convey the idea that i don't really enjoy having "container-vs-not-container" debates and i think it's important that we go with the consensus we achieved in the spec after much work and deliberation rather than having more container debates. if we truly want to re-open the discussion we had in the spec (without having even really begun on actual | 21:15 |
corvus | implementation) that's certainly a choice, but let's make that choice clearly. | 21:15 |
fungi | i'm good with the consensus, i just don't even know where to start with trying to containerize something | 21:16 |
clarkb | corvus: while I agree I think we also said in the spec we would move their gradually and build up the tooling to make this happen | 21:16 |
clarkb | and so maybe upgrading graphite can move in parallel to getting tooling in place to build container images so that we can deploy graphtie with containers? | 21:16 |
fungi | like, i'd like to work out how to containerize mailman3 but i feel pretty out of my depth on container standards and paradigms to know what that should look like | 21:17 |
corvus | clarkb: true. i read that as "keep puppeting" rather than "convert puppet to ansible then maybe someone will container" | 21:17 |
*** jamesmcarthur has quit IRC | 21:18 | |
clarkb | fungi: the big step 0 which we've started for nodepool and zuul is building images in repeatable manner to keep up with updates | 21:18 |
fungi | i'm still struggling to come to grips with ansible instead of puppet, but since we have people who want to do the ansible and container legwork i'm cool with the direction we settled on | 21:18 |
corvus | (and to be fair, many services will be easier to run in ansible rather than containers, especially if we're just running os packages. but graphite is a bunch of pypi packages, so seems like an ideal candidate for containers) | 21:19 |
*** jamesmcarthur has joined #openstack-infra | 21:19 | |
fungi | ahh, yeah, my main thought exercise was to try and figure out how to containerize gerrit and the various java libs it needs | 21:20 |
corvus | fungi: there's already a gerrit container image | 21:20 |
fungi | but maybe approaching python stuff first will be easier to grasp | 21:20 |
fungi | i thought we said reusing existing containers was a no-go and we were going to insist on building all ours from scratch instead? | 21:20 |
fungi | now i'm confused | 21:20 |
corvus | fungi: well, if that's the case, we at least presumably have a build script | 21:21 |
fungi | i probably misunderstood the answer when i asked that question earlier | 21:21 |
fungi | got the impression we didn't want to reuse anyone else's container build tooling | 21:21 |
fungi | but if it's just that we want to regenerate containers using existing tools provided by the upstream maintainers, that's not so tought to grasp | 21:22 |
clarkb | there is the build tooling and then the resulting images. I think we should reuse images if possible, but we'll likely have to see how reusable these images are | 21:22 |
corvus | anyway, i merely suggest that given the spec, perhaps leaving graphite as puppet or working on containerizing it may be better choices at the momemnt than translating the puppet to ansible | 21:23 |
corvus | ianw: thanks for asking :) | 21:23 |
ianw | say i make a job using buildah to produce an image that contains statsd (that's probably the very simplest case, no secrets, nodejs + statsd all required) | 21:23 |
ianw | as a first step, where does that image go after build? | 21:23 |
fungi | maybe also a terminology gap for me. when i hear "build tooling" i think the makefiles provided by upstream for installing things into a container image | 21:23 |
clarkb | ianw: dockerhub? | 21:24 |
fungi | i.e. the things we currently have pupet exec | 21:24 |
clarkb | fungi: there is also a severe case of NIH when it comes to container image build tools. Basically everone has one | 21:24 |
clarkb | I think the spec suggests we start with dockerfiles and ocker build because it is the commonest tool | 21:25 |
fungi | so, like, would we still use pip for installing python-based packages into a container image or do we need to do something lower-level? is it okay to install pip temporarily into the container and then uninstall it before creating the image? | 21:26 |
fungi | do the images get created under a chroot on a loopback device and then copied? | 21:26 |
fungi | or do we tar up the chroot after it's tied off? | 21:27 |
fungi | i guess a dockerfile is like a makefile that contains the steps for writing things inside the chroot? | 21:28 |
clarkb | you would use pip or any other package manager (or make etc) as part of the image build to create the image contents. It is ok to have intermediate steps that don't show through to the final image (though if you want to keep iamge size down you have to take additional steps to clean those out) | 21:28 |
clarkb | fungi: yes | 21:28 |
clarkb | I'm not actually sure what docker uses as a transport format to and from dockerhub, likely something like a tarball | 21:29 |
ianw | hrm, there's an official docker graphite+statsd+carbon docker | 21:29 |
ianw | https://github.com/graphite-project/docker-graphite-statsd#secure-the-django-admin | 21:29 |
fungi | as opposed to whatever the thing is that tells you how to stack multiple image layers? | 21:29 |
ianw | is a little worrying "First login at: http://localhost/account/login Then update the root user's profile at: http://localhost/admin/auth/user/1/" | 21:29 |
fungi | docker does layered images, right? | 21:29 |
fungi | or i guess we can just decide to squash them all into a single layer if we want? | 21:30 |
clarkb | fungi: yes it does layered images, that is one of the reasons that cleaning out the intermediate stuff you don't need in the end result can be tricky (like say pip being installed to pip install but then is deleted after, by default you'd keep the layer that had pip installed) | 21:31 |
ianw | oh, i just thought of one big issue ... ipv6 | 21:31 |
fungi | why is ipv6 an issue? | 21:32 |
clarkb | shouldn't be with docker itself, k8s doesn't really support ipv6 yet though | 21:32 |
ianw | fungi: well, something like https://github.com/graphite-project/docker-graphite-statsd, which at first glance looks like a lot of work is done for you, well it's not really | 21:32 |
*** jamesmcarthur has quit IRC | 21:33 | |
clarkb | hrm my ubuntu-xenial test in bhs1 booted just fine looks like | 21:33 |
fungi | but also for web-based services we can still proxy from an apache running outside the container on the host server to the v4 loopback if we wanted, right? | 21:33 |
clarkb | let me try centos too | 21:33 |
clarkb | fungi: yes | 21:34 |
clarkb | fungi: possibly in another container | 21:34 |
ianw | fungi: if we're in a world where we're running things in containers for simplicity, and then running external apache to proxy ipv6 into containers instead of processes running on the host listening on ipv6 ... i'm not sure i'd count that as winning :/ | 21:35 |
fungi | well, it was suggested that we don't have to containerize all the things if it winds up being convenient to, say, run a gerrit container and leave apache in the host system context | 21:35 |
*** bobh has joined #openstack-infra | 21:35 | |
clarkb | ya not required to do so, but could be done | 21:35 |
fungi | (in cases where we already run services on the loopback and proxy them from apache anyway) | 21:35 |
ianw | oh right, yeah that's not the case for say statsd currently, but is for gerrit | 21:36 |
*** graphene has joined #openstack-infra | 21:36 | |
clarkb | ok bhs1 is working now that I manually boot stuff | 21:36 |
clarkb | I am going to turn max-servers back up to 8 again | 21:37 |
clarkb | perhaps this is nodepool specific somehow? we'll have to see if the behavior persists | 21:37 |
fungi | yeah, maybe they fixed it, or maybe there's some deeper problem there | 21:38 |
*** bobh has quit IRC | 21:40 | |
clarkb | | 0002343764 | ovh-bhs1 | ubuntu-xenial | d427d8d8-4a31-4f02-bab3-eb857e0fcf9b | 158.69.64.222 | 2607:5300:201:2000::26 | in-use | 00:00:00:06 | locked | | 21:40 |
clarkb | that looks promising | 21:40 |
clarkb | ianw: ^ fyi nl04 is in the emergency file so that we can tweak the max-servers value there to help debug bhs1 | 21:40 |
ianw | ++ | 21:42 |
*** ijw has quit IRC | 21:46 | |
mordred | ianw: yah - for some (possibly many) of the services, there is likely a container already built we can use | 21:50 |
mordred | ianw: for python things we need to build, I've been using the python:sim or python:alpine base images, which already have pip and stuff installed - so you can do really simple dockerfiles like "pip install ." or something | 21:51 |
mordred | and similarly, thinking about things like gerrit, I would expect to use a "java" base image and then just install the war file in there | 21:51 |
mordred | or something | 21:51 |
mordred | but that's also all just in my imagination of course :) | 21:52 |
mordred | ianw: also - for anything we need to make a container of that uses pbr and bindep, we can use pbrx to build images - so like when we start making storyboard containers, for instance | 21:53 |
*** jamesmcarthur has joined #openstack-infra | 21:54 | |
ianw | mordred: so as step one, we need an ansible role to install & configure docker on a host, right? that's not done? | 21:54 |
clarkb | still only the one DOWN port in BHS1 too | 21:54 |
ianw | apt-get install docker seems fine ... reading up on making sure it talks ipv6 seems a little harder | 21:54 |
clarkb | I've got to go pick up a box of vegetables, any objection to me increasing max-servers to say 80 on bhs1 to see ho that goes? | 21:54 |
clarkb | I won't be able to check on it for about 45 minutes likely | 21:55 |
ianw | clarkb: just keep an eye on ports? | 21:56 |
clarkb | ianw: ports in a DOWN state and whether or not instancs are actually booting | 21:57 |
clarkb | ianw: earlier today after we cleaned up the ports in a DOWN state we ended up not being able to boot anything | 21:57 |
*** jamesmcarthur has quit IRC | 21:58 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Don't report non-live items in stats https://review.openstack.org/605540 | 22:00 |
clarkb | ianw: fungi: you all ok with me making that max-server edit? | 22:01 |
ianw | clarkb: ok, i have a window up with pre-warmed bash cache for monitoring ovh-bhs1 openstackjenkins | 22:01 |
*** dklyle has quit IRC | 22:02 | |
clarkb | ok max servers set to 80 | 22:02 |
ianw | what's the 3 servers in there that don't have an "image"? | 22:03 |
ianw | possibly we just refreshed images? | 22:03 |
gouthamr | hi infra experts, i switched the nodeset on a set of jobs to ubuntu-bionic, but got a weird side-effect - the post run playbook is ignored - https://review.openstack.org/#/c/604929 - is this a known issue? or pbkac on my end? | 22:04 |
gouthamr | the playbook's there in the ARA report, but not executed | 22:04 |
gouthamr | sample: http://logs.openstack.org/29/604929/2/check/manila-tempest-dsvm-mysql-generic/ab0b8cf/ | 22:04 |
clarkb | gouthamr: I think the devstack base job may expect certain groups in your nodest | 22:05 |
clarkb | gouthamr: you'll need to update the nodeset and include those if you haven't | 22:05 |
*** boden has quit IRC | 22:06 | |
gouthamr | clarkb: oh.. you mean the "legacy-dsvm-base" job? | 22:06 |
*** kgiusti has left #openstack-infra | 22:06 | |
clarkb | hrm that job should be named legacy-manila-tempest-dsvm-mysql-generic if it isn't a v3 native job | 22:07 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Speed up build list query under mysql https://review.openstack.org/605170 | 22:07 |
clarkb | no I meant the native v3 devstack job base | 22:07 |
clarkb | I don't know if legacy-dsvm-base needs anything like that | 22:08 |
mordred | ianw: we have a role for that ... | 22:09 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Improve docs for inventory_file https://review.openstack.org/602665 | 22:09 |
gouthamr | clarkb: true! that isn't v3 native; dunno why we changed the name on that | 22:09 |
clarkb | gouthamr: the legacy jobs use the same devstack-single-node nodeset that the non legacy jobs use | 22:09 |
mordred | ianw: http://git.openstack.org/cgit/openstack-infra/zuul-jobs/tree/roles/install-docker | 22:09 |
clarkb | gouthamr: and that sets a node to be named controller. look in openstack-dev/devstack/.zuul.yaml | 22:10 |
mordred | ianw: it might be worth us rearranging that a little bit - based on the discussion we had about roles in system-config | 22:10 |
mordred | ianw: this might be another of the ones that wants to live in system-config? | 22:10 |
corvus | mordred: why system-config? | 22:10 |
mordred | corvus: if we're going to use it in production ansible to install docker on a system we want to run services in containers on - then it would be a bit weird for system-config to need to add the zuul-jobs repo to bridge.o.o in order to get the role? | 22:11 |
corvus | that's currently zuul-jobs, which is our widest applicable source of roles, so moving to system-config is a scope reduction | 22:11 |
corvus | mordred: maybe that should be an independent repo then | 22:12 |
mordred | corvus: maybe so - maybe this is the first one where being an independent repo has a benefit | 22:12 |
clarkb | openstack.exceptions.SDKException: Error in creating the server: Build of instance 7bbfc561-5a4a-4fc7-8f5c-02e60bc61511 aborted: Request to https://network.compute.bhs1.cloud.ovh.net/v2.0/ports.json timed out | 22:12 |
corvus | mordred: though, *that* role sets up CI-specific mirrors -- so we may need 2 roles | 22:12 |
clarkb | maybe all our troubles are not behind us :/ | 22:12 |
mordred | corvus: oh -that's a great point | 22:12 |
ianw | clarkb: i also noticed a 500 error when i tried to list the servers, but only once | 22:13 |
ianw | clarkb: more general intermittent issues? | 22:13 |
ianw | yeah, lot of ERRORs | 22:14 |
*** jamesmcarthur has joined #openstack-infra | 22:14 | |
clarkb | I'll set it back to 8 and then go pick up my groceries | 22:15 |
fungi | sorry, stepped away for dinner. clarkb: increasing max-servers again seems worth trying, i agree | 22:16 |
ianw | fungi: heh, tried it and it didn't work :) | 22:16 |
*** smarcet has quit IRC | 22:16 | |
mordred | ianw: corvus has brought me back around to your way of thinking - we need an install-docker role for our production ansible | 22:16 |
mordred | ianw: I'd argue that it could be _very_ similar to the install-ansible role in zuul-jobs though - just maybe not with ci mirrors | 22:16 |
clarkb | fungi: ya was worth trying, but still broken, I unddi it | 22:17 |
mordred | ianw, corvus: actually - sorry - now I'm changing my mind again | 22:17 |
mordred | how about I go play with something for a second and get back to the two of you | 22:17 |
fungi | i'm also figuring that rather than completing the mediawiki puppeting now i should be looking at containerizing it instead. is there a base php container it should be built on? | 22:18 |
ianw | mordred: sure, well knowing how to get docker onto the host seems like the first step in using this. | 22:19 |
*** jamesmcarthur has quit IRC | 22:19 | |
fungi | like the alpine python base? | 22:19 |
mordred | fungi: https://hub.docker.com/_/mediawiki/ | 22:20 |
mordred | fungi: there's even a mediawiki container | 22:20 |
fungi | yeah, just found https://docs.docker.com/samples/library/mediawiki/ | 22:20 |
mordred | \o/ | 22:20 |
fungi | as well as a lot of discussion about the complexity of adding extensions | 22:21 |
mordred | fungi: but if we want to roll our own - their dockerfile shows: FROM php:7.2-apache | 22:21 |
mordred | so that seems to maybe be a good base image | 22:21 |
clarkb | mordred: in the openstacksdk exception above is sdk/shade directly trying to manage the ports? | 22:21 |
mordred | fungi: https://github.com/wikimedia/mediawiki-docker/blob/41edcc8020aa47823d30c1b35f216b0a2834b2b6/stable/Dockerfile | 22:21 |
clarkb | I wonder if that is part of the problem. when I boot manually I use openstackclient which probably just talks boring nova api | 22:22 |
mordred | clarkb: in some cases it might be | 22:22 |
mordred | clarkb: we query ports after booting a server to find the port id | 22:22 |
mordred | clarkb: so that we can pass the port_id of the server's fixed ip to the foating_ip create call, creating the fip on the server's port | 22:22 |
clarkb | mordred: but we don't use floating IPs here | 22:22 |
mordred | oh - then something is wrong - we shouldn't be listing ports on a non fip cloud | 22:23 |
mordred | we MIGHT be trying to list them to get meta info about the server's interfaces | 22:23 |
mordred | but we should only ever be listing | 22:23 |
clarkb | ya this looks like its part of listing the server details | 22:23 |
clarkb | http://paste.openstack.org/show/730970/ is full traceback | 22:24 |
clarkb | if we go back to hunches I wouldn't be surprised if ovhdoesn't expcet you to talk to neutron so much since they don't really enable any sdn features | 22:24 |
fungi | mordred: aha so we would just have a periodic job slurp down a dockerfile like that and then push the resulting image to a server? | 22:24 |
mordred | oh - wait | 22:24 |
mordred | clarkb: that's not the sdk interacting with ports | 22:24 |
clarkb | ok really need to grab veggies not, but maybe that makes sense to you mordred | 22:25 |
clarkb | mordred: basically anything that seems to directly mess with ports is unreliable | 22:25 |
mordred | clarkb: "Error in creating the server: {reason}".format( | 22:25 |
mordred | reason=server['fault']['message']), | 22:25 |
clarkb | oh so that is from nova? | 22:25 |
mordred | that's us printing the fault.message in the server payload from nova | 22:25 |
clarkb | gotit | 22:26 |
mordred | we should clarify that in the error message | 22:26 |
mordred | "Error in creating the server, nova reports error: {reason}" | 22:26 |
mordred | would be clearer, yeah? | 22:26 |
clarkb | ++ | 22:28 |
mordred | clarkb: remote: https://review.openstack.org/605544 Clarify error message is from nova | 22:28 |
mordred | fungi: yes, that's what I'm thinking | 22:29 |
fungi | so this is probably similar to the errors i was getting from openstackclient in that case (nova reporting a problem talking to neutron)? | 22:29 |
mordred | fungi: or, a periodic job that builds an image based on a dockerfile like that - then publishes the image to either dockerhub or a docker hub we run | 22:29 |
fungi | i guess if we want to fork the dockerfile we can do that to add things like different configuration or additional non-core extension and theme bundles? | 22:30 |
mordred | fungi: then in the ansible, instead of "apt-get install mediawiki" we'd just do "docker run opendev-ci/mediawiki" or something | 22:30 |
mordred | fungi: yes. | 22:30 |
fungi | or is that where layered images come into play? | 22:30 |
mordred | fungi: we could also just buld a new image based on the mediawiki image | 22:30 |
mordred | fungi: so make a dockerfile that says "from mediawiki" at the top, then plop our extensions and themes in | 22:31 |
mordred | fungi: I think for config, we just put it on the server like normal, and bind-mount the config dir in as a docker volume | 22:31 |
fungi | ahh, so dockerfiles have an inheritance concept. that's what the FROM php:7.2-apache at the top means? | 22:31 |
mordred | fungi: so like "docker run -v /etc/mediawiki:/etc/mediawiki opendev-ci/mediawiki" | 22:32 |
mordred | fungi: yes | 22:32 |
ianw | mordred: it would be more a .service file that does the docker run though? | 22:32 |
mordred | ianw: yah - I'm just spitting commandline docker commands in channel for conversation | 22:32 |
mordred | just saying - zuul job publishes to dockerhub (or a private dockerhub) - then on the server we just pull/run the image | 22:32 |
fungi | the fact that the mediawiki dockerfile runs `apt-get install ...` suggests that the php:7.2-apache image is debianish | 22:33 |
mordred | fungi: yes indeed it does | 22:33 |
fungi | and not some stripped-down thing like alpine or coreos i guess? | 22:33 |
mordred | fungi: if you want - you should be able to run "docker run -it --rm mediawiki -- /bin/bash" and get a shell in a container running that image and see what it's based on and what's there | 22:34 |
mordred | clarkb, fungi, ianw corvus : speaking of - opendev is taken on dockerhub - what say I grab opendevorg as an account to push things to? | 22:35 |
corvus | mordred: ++ | 22:35 |
*** jamesmcarthur has joined #openstack-infra | 22:35 | |
*** rcernin has joined #openstack-infra | 22:35 | |
fungi | the deeper down this rabbit hole i go, the more it seems like running containerized services means 1. understanding conventions of multiple distros since you won't know which ones a given upstream image is based on, and 2. sort of also being responsible for your own distribution since versions of stuff is hard-coded all over the place needing you to bump them to get security fixes? surely this can't | 22:36 |
fungi | be the actual reality of containerization | 22:36 |
fungi | opendevorg sgtm | 22:36 |
*** eernst has joined #openstack-infra | 22:37 | |
*** tosky has quit IRC | 22:37 | |
mordred | fungi: yup. that is, in fact, the realityof containerization | 22:37 |
fungi | at least for the mediawiki image, it's just mediawiki versions themselves which are hard-coded | 22:38 |
mordred | fungi: however, due to the idea of microservices and service teams - it's frequently only one service and one base image you're ever poking at | 22:38 |
fungi | so no worse than our (incomplete) configuration management in that regard | 22:38 |
mordred | fungi: yah | 22:38 |
mordred | fungi: also - I've been finding that the majority of contianer images are actually debuntu based - unless someone wants to make it smaller inwhich case there is also frequently an alpine variation | 22:39 |
*** jamesmcarthur has quit IRC | 22:40 | |
SpamapS | FYI, zookeeper's disk IO is *VERY* bursty | 22:40 |
fungi | mailman3 seems to use docker-compose to build their images https://github.com/maxking/docker-mailman/blob/master/docker-compose.yaml | 22:40 |
SpamapS | it wants to sync all of its writes to disk periodically, and if you did a lot.... | 22:40 |
SpamapS | it gets mad | 22:40 |
SpamapS | It never shows as heavy disk IO, it just shows as heavy latency on the syncs | 22:41 |
SpamapS | and no I don't run on tmpfs, I just run on a dedicated ZK node. | 22:41 |
SpamapS | previously it was battling with other processes for that IO latency and that made it miss checkpoints | 22:42 |
fungi | aha, FROM python:3.6-alpine https://github.com/maxking/docker-mailman/blob/master/core/Dockerfile | 22:42 |
mordred | fungi: yeah. that docker compose file is a way to say "please launch this set of images for me" | 22:43 |
fungi | so probably not too dissimilar from whatever we'll use for other python services | 22:43 |
mordred | fungi: yah - that's whatwe're using in the zuul and nodepool images | 22:43 |
fungi | oh, so docker-compose is not a higher-level image build language, it's a deployment language? | 22:43 |
mordred | ah | 22:43 |
mordred | yah | 22:43 |
fungi | and again, at least for the mm3 core image, it's only hard-coding mailman package versions and i guess taking whatever latest is on pypi otherwise | 22:45 |
fungi | so not too onerous | 22:45 |
mordred | in fact, jesse keating had a docker compose file a while back for booting a zuul on a machine - so you can say "docker compose up" and docker compose will launch all of the processes in containers needed connected together to produce the zuul service | 22:45 |
mordred | fungi: yah | 22:45 |
fungi | i already worked out how to get mm3 services deployed from debian packages, but translating it to alpine will probably take some time | 22:46 |
fungi | i have a feeling we might stick with distro-packaged exim similarly to how we might stick with distro-packaged apache for some of these things | 22:47 |
mordred | fungi: well, you could maybe just use the docker mailman images? | 22:47 |
mordred | fungi: yup. totally agree | 22:47 |
fungi | i mean from a configuring it standpoint | 22:47 |
mordred | fungi: I don't think it's a win for us for the forseeable future for things like apache | 22:47 |
mordred | fungi: oh - gotcha | 22:47 |
fungi | the debian packages preconfigure an awful lot of the wsgi bits and stuff | 22:48 |
*** dpawlik has joined #openstack-infra | 22:48 | |
fungi | it's django and wsgi and... | 22:48 |
mordred | fungi: https://hub.docker.com/r/maxking/mailman-core/ seems to be the main image | 22:49 |
fungi | and database setup | 22:49 |
mordred | yeah - seems to be a lot of stuff | 22:49 |
mordred | fungi: well - if there are well maintained debian packages maybe it's one that we just instlal with ansible directly? I see the container stuff as more of a win when we're installing wars or installing python stuff from source or anything we're building ourselves | 22:51 |
fungi | https://asynchronous.in/docker-mailman/ is the container install walkthrough linked from the mm3 docs | 22:51 |
mordred | but - I guess if they have a full walkthrough of that- maybe it's a good learning example? | 22:51 |
mordred | fungi: ooh - the security section of that is nice - we should do that with our images | 22:52 |
*** dpawlik has quit IRC | 22:52 | |
mordred | (and make sure we appropriately sign our images too) | 22:52 |
fungi | yeah, maybe no need to container mm3, but also if we stick with the debian packages filtered through ubuntu we're stuck dealing with a mailman that's still rapidly changing (there are a _lot_ of bugs in mm3 being worked through still, from what i can see) | 22:53 |
*** rcernin has quit IRC | 22:53 | |
mordred | fungi: yah. whereas if you just follow these instructions | 22:54 |
*** eernst has quit IRC | 22:54 | |
*** rcernin has joined #openstack-infra | 22:55 | |
mordred | and maybe even use docker-compose for it since that's how they're recommending - and that way we can follow the stuff they're publishing and do it in a way that would enhance our ability to interact with them? | 22:55 |
*** jamesmcarthur has joined #openstack-infra | 22:56 | |
fungi | seems worth trying to redo the current poc following my notes from https://etherpad.openstack.org/p/mm3poc and adapting them to their container walkthrough | 22:56 |
mordred | ++ | 22:56 |
mordred | would at the very least be a learning case where you've got a clear set of instructions for the one way | 22:56 |
mordred | in the mean time - I will get an install-docker role done by morning | 22:57 |
ianw | mordred: sorry, where did we end up on "install docker to a host"? is there something i can do? i would like to get this bit nailed down (with ipv6) before i start looking at it | 22:57 |
ianw | oh, jinx | 22:57 |
mordred | ianw: :) | 22:57 |
mordred | ianw: I think I nerd-sniped myself into helping on that bit | 22:57 |
openstackgerrit | Merged openstack-infra/openstackid master: Updated user profile UI https://review.openstack.org/604172 | 22:57 |
*** jamesmcarthur has quit IRC | 22:57 | |
*** jamesmcarthur has joined #openstack-infra | 22:57 | |
ianw | mordred: i'm happy to take an initial stab, working from the existing role, if you're thinking of moving it into system-config? | 22:58 |
corvus | s/move/copy/ | 22:58 |
mordred | ianw: well - I think corvus made a great point, which is that the one in zuul-jobs has ci specific things | 22:58 |
mordred | so yeah, I think the next thought was copy it | 22:58 |
mordred | but then I was thinking - the ci mirror stuff is just generic "use this mirror" and could still be useful - we just need to remove the defaults of zuul_docker_mirror_host | 22:59 |
mordred | which is what I wanted to poke at to see what it looked like | 22:59 |
mordred | if that does seem reasonable - then I think moving it into its own repo and having the jobs add a roles-path - and adding it to the galaxy install in system-config - could be nice | 22:59 |
clarkb | I've set bhs1 back to max-servers 0 | 22:59 |
fungi | what is our story around testing this stuff? should we keep most of the zuul-side automation reusable for check/gate jobs on our container configs? | 23:00 |
mordred | corvus, ianw does that above stream-of-consciousness make sense? | 23:00 |
mordred | or - should we just copy the role to system-config for now | 23:00 |
mordred | and maybe make the reusable shared role for later? | 23:00 |
ianw | mordred: sort of, though i'm not sure what advantage having it in a separate repo affords? | 23:00 |
fungi | er, i guess that's basically what you're already discussing ;) | 23:00 |
mordred | ianw: to be able to use it in both zuul-jobs and system-config without having zuul-jobs need system-config or system-config need zuul-jobs | 23:01 |
ianw | presumably other people have done this, and we're not using *their* roles | 23:01 |
mordred | totally - and it might be a waste of energy :) | 23:01 |
ianw | given that it's not that complex, and we might want to do things in production that we don't want in zuul-jobs without having to worry about it being generic, i'm feeling like putting it in system-config wins, for mine | 23:01 |
mordred | so maybe we should just start with a copy of install-docker in system-config but without the mirror config? | 23:02 |
mordred | ianw: ++ | 23:02 |
fungi | we can also run roles from system-config in jobs that test things in or related to system-config if we want, right? | 23:02 |
mordred | fungi: totally | 23:02 |
ianw | ok, i'll be happy to spin that up with some basic test-infra as step 1 | 23:02 |
openstackgerrit | Clark Boylan proposed openstack-infra/system-config master: Add zuul user to bridge.openstack.org https://review.openstack.org/604925 | 23:02 |
openstackgerrit | Clark Boylan proposed openstack-infra/system-config master: Manage user ssh keys from urls https://review.openstack.org/604932 | 23:02 |
mordred | fungi: but for ci and for docker specifically, we should use the install-docker role in zuul-jobs | 23:03 |
mordred | fungi: since it properly configures ci mirrors | 23:03 |
fungi | oh, sure | 23:03 |
mordred | but yes, in general :) | 23:03 |
mordred | ianw: sweet! | 23:03 |
fungi | okay, now i get it | 23:03 |
fungi | so for deployment we have a shadow equivalent of install-docker which doesn't do the ci-specific setup | 23:03 |
fungi | and stick that in system-config | 23:04 |
ianw | fungi: yes, and quite possibly might do things production-specific, if we find the need | 23:04 |
mordred | yah | 23:04 |
clarkb | we did leak a bunch of ports out of that short time with max servers 80 | 23:04 |
mordred | like - we might decide we want some settings in the docker daemon.json | 23:04 |
fungi | clarkb: yech. are you still worried this is something recently regressed in nodepool/shade? | 23:05 |
mordred | ianw: also - we can simplify that role for production and just keep the 'use_upstream_docker' codepath | 23:05 |
ianw | yes, like the ipv6 settings i keep harping on about | 23:05 |
clarkb | fungi: no mordred pointed out that message came from nova itself and shade was just passing it through | 23:05 |
mordred | fungi: no - I believe it's an openstack-side issue | 23:05 |
mordred | ianw: ++ | 23:05 |
fungi | yeah, makes sense that the errors nodepool reported are the equivalent of some of what i was seeing from osc | 23:05 |
mordred | ianw: so maybe it's "copy install-docker, start deleting, completely change daemon.json.j2" :) | 23:05 |
ianw | sounds about right | 23:07 |
ianw | clarkb: am i understanding that at lower limits, no port leaks and everything was working, but then at 80 it started going wrong? | 23:08 |
clarkb | ianw: yes, though turning it back down to 8 it was unhappy still. (but plenty of ports leaked so maybe have to clean those up again?) | 23:09 |
clarkb | I am composing an email to amorin since our timezones don't line up great. I will cc you and fungi as people who ahve looked at this so far | 23:10 |
ianw | right, yeah saw those errors. but individual boots didn't leak ports | 23:10 |
ianw | anyway, ++ on the email. it does seem like the error is coming from the other end | 23:11 |
fungi | thanks clarkb! | 23:11 |
*** jamesmcarthur has quit IRC | 23:17 | |
clarkb | email sent | 23:18 |
*** jamesmcarthur has joined #openstack-infra | 23:19 | |
*** tpsilva has quit IRC | 23:19 | |
ianw | http://logs.openstack.org/39/602439/11/check/openstack-infra-multinode-integration-fedora-latest/2fa2d84/job-output.txt.gz | 23:24 |
ianw | traceroute_v4_output": "git.openstack.org: Name or service not known\nCannot handle \"host\" cmdline arg `git.openstack.org' on position 1 (argc 2)\n", | 23:24 |
*** jlvillal has joined #openstack-infra | 23:24 | |
ianw | that's interesting, did we know dns was the cause of those intermittent errors? | 23:25 |
ianw | and how fascinating that was a multinode job and they *both* failed. that suggests it wasn't just a fluke | 23:26 |
*** rh-jelabarre has quit IRC | 23:26 | |
clarkb | I think prometheanfire had pointed it out (they were failing when setting up gentoo) | 23:26 |
*** dklyle has joined #openstack-infra | 23:26 | |
clarkb | http://logs.openstack.org/39/602439/11/check/openstack-infra-multinode-integration-fedora-latest/2fa2d84/zuul-info/host-info.primary.yaml says localhost is the resolver so must be somethign with unbound? | 23:27 |
clarkb | (grep dns in that file) | 23:28 |
*** graphene has quit IRC | 23:28 | |
prometheanfire | ya, it's been happening a bunch | 23:29 |
prometheanfire | silly fedora | 23:29 |
*** mriedem is now known as mriedem_away | 23:30 | |
mwhahaha | any chance on getting this project-config change merged? https://review.openstack.org/#/c/603489/ | 23:31 |
ianw | clarkb: but unbound isn't setup yet, right, on these integration jobs? | 23:33 |
mwhahaha | gracias | 23:33 |
*** graphene has joined #openstack-infra | 23:33 | |
clarkb | ianw: it should be but using our default config (which is ipv4 and ipv6 resolvers) | 23:33 |
clarkb | unless we don't do that config on fedora | 23:34 |
ianw | but why would it only sometimes not work ... | 23:34 |
ianw | clarkb: where did you see the resolver in info yaml? | 23:34 |
openstackgerrit | Jeremy Stanley proposed openstack-infra/system-config master: Update Zuul service documentation https://review.openstack.org/605556 | 23:35 |
clarkb | ianw: it is under ansible_dns | 23:36 |
clarkb | says nameservers: - 127.0.0.1 | 23:36 |
prometheanfire | is the service running? | 23:38 |
ianw | hrm, right, so that would be pre reconfiguration | 23:38 |
ianw | a working log is http://logs.openstack.org/39/602439/11/check/openstack-infra-base-integration-fedora-latest/08950a4/zuul-info/host-info.fedora-28.yaml and it has the same | 23:38 |
clarkb | prometheanfire: I don't think we have that information, that would certainly be somethign to check. Maybe it isn't starting on boot on fedora reliably | 23:39 |
prometheanfire | maybe provider related? iirc, 2-3/4 of my rechecks for adding gentoo are for fedora failing | 23:40 |
ianw | we should capture unbound.log | 23:40 |
ianw | and a ps dump at least | 23:41 |
*** rcernin_ has joined #openstack-infra | 23:41 | |
pabelanger | prometheanfire: yes, it will fail on ipv6 hosts I believe | 23:43 |
*** rcernin has quit IRC | 23:43 | |
clarkb | pabelanger: it shouldn't | 23:43 |
pabelanger | clarkb: unbound needs to be configured before validate-hosts | 23:44 |
pabelanger | http://logs.openstack.org/39/602439/11/check/openstack-infra-multinode-integration-fedora-latest/2fa2d84/job-output.txt.gz | 23:44 |
clarkb | it is configured | 23:44 |
prometheanfire | it's that race thing? | 23:44 |
pabelanger | that was the whole reason fedora-28 nodes were flakey in the gate | 23:44 |
openstackgerrit | Merged openstack-infra/project-config master: Add ansible-role-chrony project https://review.openstack.org/603489 | 23:44 |
pabelanger | clarkb: right, but configure-unbound hasn't run yet for base-minimal jobs | 23:44 |
ianw | pabelanger: it should be minimally configured in dib -> https://git.openstack.org/cgit/openstack-infra/project-config/tree/nodepool/elements/nodepool-base/finalise.d/89-unbound#n37 | 23:44 |
clarkb | yes, but the unbound config that is there should work | 23:45 |
pabelanger | ianw: yup, agree. for some reason, it is broken on ipv6. Fix was to first do configure-unbound and reload unbound service, and started working | 23:45 |
clarkb | we boot with working dns, if we don't that is a bug | 23:45 |
clarkb | we are only optimizing it from there on the jobs | 23:45 |
pabelanger | http://git.openstack.org/cgit/openstack-infra/project-config/tree/playbooks/base/pre.yaml#n6 | 23:45 |
clarkb | ya I wonder if unbound is just not runningat boot theer | 23:46 |
pabelanger | the only difference I can think of, is for DIB we put both ipv6/ipv4 forwarders, for configure-unbound role, we choose the right one based on ipv6 / ipv4 | 23:46 |
clarkb | and restarting it is what made it happy | 23:46 |
ianw | ok, this seems like something i can test ... if we boot a fedora node in rax the suggestion is that unbound isn't immediately working, right? | 23:47 |
pabelanger | clarkb: should be able to manually boot a fedora-28 to confirm | 23:47 |
pabelanger | in inap | 23:47 |
clarkb | ianw: yes | 23:47 |
pabelanger | but, i think I did manually test this in rax or ipv6 | 23:47 |
pabelanger | but happy to see ianw try | 23:48 |
pabelanger | could reproduce the issue | 23:48 |
pabelanger | couldn't* | 23:48 |
clarkb | pabelanger: the error above was on rax is why ianw is looking at rax I think | 23:48 |
ianw | yeah ... but we have rax logs where it passes and where it fails | 23:48 |
pabelanger | ++ | 23:49 |
*** openstackgerrit has quit IRC | 23:49 | |
ianw | let me try creating a rax node manually and we can boot-cycle it and see if we hit something | 23:49 |
pabelanger | maybe add port tcp/53 localhost check to validate-host too? | 23:50 |
*** jamesmcarthur has quit IRC | 23:50 | |
ianw | a hold on this job might help, although i think i tried that and got distracted because we never hit it to hold the node | 23:50 |
ianw | and then zuul got restarted etc | 23:51 |
pabelanger | ianw: yah, I think that is what I did, autohold. then when I finally went back to check, everything was working. | 23:51 |
pabelanger | oh, I rememeber, maybe it was a race starting unbound, but journald logs were too old to confirm | 23:52 |
ianw | classic heisenbug | 23:52 |
ianw | we do start unbound from rc.local right ... | 23:52 |
ianw | maybe we should drop in a .service file | 23:53 |
ianw | no, what we do is "'echo 'nameserver 127.0.0.1' > /etc/resolv.conf" in rc.local | 23:55 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!