corvus | clarkb: oh; er, do you have the whole sequence handy? (or a pointer to docs?) | 00:00 |
---|---|---|
clarkb | ya let me get the email that was sent to the list | 00:00 |
clarkb | corvus: http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000489.html is what I started with | 00:00 |
*** yamamoto has quit IRC | 00:01 | |
clarkb | I performed the pull and update steps in taht email for each of the kubernets containers that showed up as running according to `atomic containers list --no-trunc` | 00:02 |
clarkb | the only things outside of that I did was free up disk space so that the pull would work. I did that via atomic images prune and `journalctl --vacuum-size=100M` | 00:02 |
corvus | clarkb: do we need to restart anything after it's done? | 00:03 |
clarkb | corvus: the update command restarts the containers for us (you can confirm with the containers list command) | 00:03 |
corvus | ok i think i grok now | 00:04 |
clarkb | the pull brings down the image, the update restarts services as necessary aiui | 00:04 |
clarkb | corvus: note that the minion was updated too (it runs a subset of the services) | 00:08 |
clarkb | corvus: thinking out loud, does the json change need an executor restart? | 00:10 |
corvus | clarkb: yep. i did your foreach kube*; pull, update on all | 00:10 |
clarkb | thats part of the ansible proceses which are forked on demand and should've seen the change as soon as we installed it? | 00:10 |
corvus | clarkb: i think it will need a restart because it's one of the things we copy into place on startup | 00:11 |
clarkb | ah | 00:11 |
*** rlandy has quit IRC | 00:11 | |
corvus | infra-root: i've created another k8s cluster via magnum in vexxhost to experiment with running gitea in it | 00:12 |
corvus | clarkb: ^ and i just upgrade it, because i forgot to specify the version on create | 00:12 |
fungi | okay, i'm fed/inebriated and catching up | 00:12 |
clarkb | corvus: it was actually pretty painless once I figured out how to get around teh disk constraints | 00:13 |
corvus | if this works, i'll be able to push some changes to system-config for it. but for now i'd just like to experiment and see what works | 00:13 |
corvus | clarkb: agreed; i'm done now :) | 00:13 |
clarkb | well that and knowing what to do in the first place but thankfully people sent email to the discuss list about it | 00:13 |
corvus | oh, the cluster is called "opendev" btw | 00:14 |
fungi | we're ~ready to do executor restarts nowish i guess? | 00:16 |
fungi | (unless i've misread) | 00:16 |
clarkb | I am ready if you are | 00:16 |
fungi | ze01 seemed to have the right (new) version installed when i looked a moment ago | 00:19 |
fungi | i'll double-check them all real fast | 00:19 |
corvus | wfm. i'd recommend an all-stop, then update fstab on ze12, then all start. | 00:20 |
corvus | does someone else want to do the all-stop, and i'll fixup ze12 when they're stopped? | 00:20 |
fungi | huh, suddenly i've lost (ipv6 only?) access to ze01 | 00:21 |
clarkb | corvus: I'm happy to do it, though maybe fungi would like to if he hasn't done one before? | 00:21 |
clarkb | fungi: ansible ze*.openstack.org -m shell -a 'sytemctl stop zuul-executor' should work iirf | 00:22 |
clarkb | if you fix the typos :/ | 00:22 |
clarkb | then we wait for all of them to stop and switch stop to start | 00:22 |
fungi | seems i'm suddenly able to connect again and finishing double-checking installed versions | 00:23 |
fungi | okay, ze01-ze12 report the expected versions via pbr freeze | 00:25 |
fungi | ze12 doesn't actually seem to reflect any zuul installed according to pbr freeze so digging deeper | 00:25 |
fungi | ze01-ze11 did | 00:25 |
clarkb | fungi: python3 $(which pbr) freeze ? | 00:26 |
clarkb | chances are pbr is currently installed under python2 so it looks there by default | 00:26 |
fungi | pip3 freeze works, so yeah | 00:26 |
*** bobh has joined #openstack-infra | 00:26 | |
fungi | well, chances are pbr is installed under _both_ (otherwise zuul wouldn't install under python3) but the executable entrypoint is the python2 version | 00:27 |
clarkb | right the "binary" executable | 00:27 |
fungi | pbr freeze indicates "zuul==3.3.2.dev32 # git sha e2520b9" on the others and pip3 freeze on ze12 reports "zuul==3.3.2.dev32" so i take that as a match | 00:28 |
fungi | unfortunately `python3 -m pbr freeze` isn't a thing | 00:29 |
fungi | perhaps mordred knows what needs to be added to make that also work | 00:29 |
fungi | anyway, i think we're good for executor restarts if we want | 00:30 |
corvus | i'm ready | 00:30 |
clarkb | I'm still here | 00:31 |
*** jmorgan1 has quit IRC | 00:31 | |
*** bobh has quit IRC | 00:31 | |
*** yamamoto has joined #openstack-infra | 00:32 | |
fungi | okay, so should we be manually restarting executors one-by-one or use the zuul_restart playbook in system-config? | 00:33 |
corvus | fungi: not that one -- it does a full system restart | 00:33 |
corvus | fungi: i'd just do what clarkb suggested above | 00:33 |
clarkb | fungi: I put an example command above that should work `ansible ze*.openstack.org -m shell -a 'systemctl stop zuul-executor'` | 00:34 |
clarkb | that probably needs sudo on bridge.o.o | 00:34 |
fungi | so `ansible ze*.openstack.org -m shell -a 'sytemctl stop zuul-executor' (modulo typos) | 00:34 |
clarkb | ya | 00:34 |
fungi | i'll do that now | 00:35 |
fungi | obviously then we start too | 00:35 |
clarkb | fungi: no you wait before starting | 00:35 |
fungi | k | 00:35 |
clarkb | the executors will take some time to completely stop (so we can check ps -elf | grep zuul-executor or whatever incantation you prefer for that sort of thing | 00:35 |
corvus | it takes like 15 minutes to stop | 00:35 |
clarkb | I do something like ps -elf | grep zuul | wc -l to get a countdown metric | 00:36 |
fungi | on bridge.o.o i've run `sudo ansible ze*.openstack.org -m shell -a 'systemctl stop zuul-executor'` | 00:36 |
fungi | it reported "CHANGED" for all 12 executors, which i take as a good sign | 00:37 |
corvus | yep | 00:37 |
corvus | i see ze12 stopping | 00:37 |
corvus | i have already rsynced the necessary data, so the ze12 switcheroo shouldn't take long | 00:38 |
corvus | ze12 has stopped | 00:38 |
fungi | awesome | 00:39 |
clarkb | ze01 is down to 82 processes by my earlier command | 00:39 |
clarkb | now 80, so trending in the expected direction | 00:39 |
fungi | yeah, `sudo ansible ze*.openstack.org -m shell -a 'ps -elf | grep zuul | 00:40 |
fungi | | wc -l'` | 00:40 |
*** tosky has quit IRC | 00:40 | |
corvus | ze12 is ready to go | 00:40 |
fungi | is returning pretty high numbers (disregard the stray newline) | 00:40 |
fungi | returning 52 for ze12 | 00:41 |
clarkb | fungi: it likely won't got to zero fwiw due to ssh control persist processes being stubborn | 00:41 |
fungi | k | 00:41 |
corvus | grep for "zuul-" instead | 00:41 |
fungi | ahh | 00:41 |
corvus | as in "zuul-executor" | 00:41 |
fungi | much more reasonable | 00:41 |
fungi | 4 on all but ze12 which returns 2 | 00:42 |
fungi | 2 is the new 0? | 00:42 |
clarkb | fungi: according to puppet: yes | 00:42 |
corvus | with grep, i think so :) | 00:42 |
fungi | k | 00:42 |
corvus | ze12 has been stopped for a while | 00:42 |
fungi | i get it ;) | 00:42 |
clarkb | ze01 looks stopped | 00:43 |
fungi | several of them are returning 2 now, yes | 00:43 |
fungi | , 05, 08 and 10 still going | 00:44 |
fungi | now just 04 and 05 | 00:44 |
*** yamamoto has quit IRC | 00:44 | |
*** mriedem has quit IRC | 00:45 | |
fungi | and now just 05 left | 00:46 |
openstackgerrit | MarcH proposed openstack-infra/git-review master: doc: new testing-behind-proxy.rst; tox.ini: passenv = http[s]_proxy https://review.openstack.org/623361 | 00:47 |
*** yamamoto has joined #openstack-infra | 00:47 | |
fungi | 100% stopped now | 00:48 |
corvus | i cleaned up all old build dirs | 00:48 |
fungi | ready to start, or wait? | 00:48 |
corvus | fungi: ready | 00:48 |
fungi | finger hovering over the button | 00:48 |
fungi | clarkb: all clear? | 00:48 |
clarkb | fungi: ya | 00:49 |
fungi | running | 00:49 |
corvus | ze12 seems happy | 00:49 |
clarkb | I see new executor on ze01 | 00:49 |
fungi | should all be starting up now | 00:49 |
*** Swami has quit IRC | 00:49 | |
fungi | `ps -elf | grep zuul- | wc -l` is returning 4 for all | 00:49 |
openstackgerrit | MarcH proposed openstack-infra/git-review master: CONTRIBUTING.rst, HACKING.rst: fix broken link, minor flow updates https://review.openstack.org/623362 | 00:50 |
clarkb | swap has been stable so far | 00:51 |
corvus | looks like the restart did clear out swap usage too, so it should be easy to compare | 00:52 |
corvus | (ie, swap went to 0 at restart) | 00:52 |
mwhahaha | if anyone is around to promote https://review.openstack.org/#/c/623293/ in the tripleo gate that would be helpful (to stop resets due to nested virt crashes) | 00:54 |
corvus | looks pretty good; i'm going to eod now | 00:54 |
* mwhahaha wanders off | 00:54 | |
fungi | mwhahaha: i've promoted it now | 00:57 |
mwhahaha | Thanks | 00:58 |
*** rkukura has quit IRC | 00:59 | |
clarkb | there are no more queued jobs now | 01:06 |
*** bobh has joined #openstack-infra | 01:06 | |
clarkb | no more executor queued jobs | 01:07 |
clarkb | there may be jobs the scheduler is waiting for nodes on | 01:07 |
*** gyee has quit IRC | 01:09 | |
*** bobh has quit IRC | 01:10 | |
pabelanger | ze12.o.o looks to be running a different kernel for some reason | 01:11 |
pabelanger | Linux ze12 4.4.0-137-generic #163-Ubuntu SMP Mon Sep 24 13:14:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | 01:11 |
pabelanger | this is from ze01 | 01:11 |
pabelanger | Linux ze01 4.15.0-42-generic #45~16.04.1-Ubuntu SMP Mon Nov 19 13:02:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | 01:11 |
clarkb | hrm how do we apply the hwe kernel? | 01:12 |
clarkb | I thought we puppeted that | 01:12 |
clarkb | maybe we haven't rebooted since puppet ran | 01:12 |
clarkb | I bet that is it | 01:12 |
pabelanger | yah, maybe | 01:12 |
clarkb | since now puppet is a step after launch | 01:12 |
pabelanger | but, it seem ze12.o.o is running more jobs right now | 01:12 |
pabelanger | and using less ram | 01:12 |
pabelanger | unsure if related to kernel or just smaller jobs | 01:13 |
clarkb | pabelanger: the jobs definitely have an impact on executor memory | 01:13 |
clarkb | so could be distribution of "expensive" jobs | 01:13 |
pabelanger | yah, the HDD usage on ze12 is low also | 01:13 |
pabelanger | so, likely smaller projects | 01:13 |
*** kjackal has quit IRC | 01:14 | |
clarkb | ze01 swap use is still pretty stable | 01:18 |
clarkb | ze12 is apparently swapping | 01:18 |
clarkb | which is maybe not surprising since the old kernel has/had issues with swap | 01:18 |
pabelanger | yah, we should stop it and reboot if we wanted | 01:19 |
pabelanger | but ze01.o.o swap looks good | 01:20 |
clarkb | I've got to go to a birthday dinner so I can't do that now but can help in the morning if we want to do that | 01:20 |
pabelanger | wfm | 01:20 |
*** jamesmcarthur has joined #openstack-infra | 01:21 | |
*** jamesmcarthur has quit IRC | 01:24 | |
*** jamesmcarthur has joined #openstack-infra | 01:25 | |
*** rkukura has joined #openstack-infra | 01:40 | |
*** yamamoto has quit IRC | 01:49 | |
*** bobh has joined #openstack-infra | 01:51 | |
*** wolverineav has quit IRC | 01:55 | |
*** bobh has quit IRC | 01:55 | |
*** betherly has joined #openstack-infra | 01:59 | |
*** dave-mccowan has quit IRC | 02:00 | |
*** jamesmcarthur has quit IRC | 02:01 | |
*** betherly has quit IRC | 02:04 | |
*** bobh has joined #openstack-infra | 02:10 | |
*** mrsoul has joined #openstack-infra | 02:12 | |
*** bobh has quit IRC | 02:15 | |
lbragstad | awesome write up to the mailing list clarkb | 02:26 |
*** hongbin has joined #openstack-infra | 02:41 | |
*** bobh has joined #openstack-infra | 02:47 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: add OpenAPI documentation https://review.openstack.org/535541 | 02:49 |
*** bobh has quit IRC | 02:51 | |
*** psachin has joined #openstack-infra | 02:53 | |
*** imacdonn has quit IRC | 02:53 | |
*** imacdonn has joined #openstack-infra | 02:53 | |
*** betherly has joined #openstack-infra | 03:01 | |
*** bhavikdbavishi has joined #openstack-infra | 03:02 | |
*** betherly has quit IRC | 03:05 | |
*** bobh has joined #openstack-infra | 03:06 | |
*** rh-jelabarre has quit IRC | 03:08 | |
*** bobh has quit IRC | 03:10 | |
*** dave-mccowan has joined #openstack-infra | 03:11 | |
*** dave-mccowan has quit IRC | 03:19 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: add OpenAPI documentation https://review.openstack.org/535541 | 03:20 |
*** jamesmcarthur has joined #openstack-infra | 03:23 | |
*** diablo_rojo has quit IRC | 03:25 | |
*** jamesmcarthur has quit IRC | 03:41 | |
*** bobh has joined #openstack-infra | 03:43 | |
*** bobh has quit IRC | 03:48 | |
*** ramishra has joined #openstack-infra | 03:50 | |
*** ykarel|away has joined #openstack-infra | 03:52 | |
*** bobh has joined #openstack-infra | 04:01 | |
*** jamesmcarthur has joined #openstack-infra | 04:02 | |
*** bobh has quit IRC | 04:06 | |
*** jamesmcarthur has quit IRC | 04:07 | |
*** neilsun has joined #openstack-infra | 04:15 | |
*** lbragstad has quit IRC | 04:23 | |
openstackgerrit | Jea-Min, Lim proposed openstack-infra/project-config master: add new project called ku.stella this project is unofficial openstack project https://review.openstack.org/623396 | 04:30 |
*** bobh has joined #openstack-infra | 04:37 | |
*** psachin has quit IRC | 04:41 | |
*** bobh has quit IRC | 04:42 | |
*** wolverineav has joined #openstack-infra | 04:44 | |
*** ykarel|away has quit IRC | 04:46 | |
*** janki has joined #openstack-infra | 04:46 | |
mrhillsman | is there a dib element for running a custom script? | 04:49 |
*** bobh has joined #openstack-infra | 04:56 | |
*** psachin has joined #openstack-infra | 04:58 | |
*** bobh has quit IRC | 05:01 | |
*** ykarel|away has joined #openstack-infra | 05:04 | |
ianw | mrhillsman: umm, every dib element is a custom script? | 05:05 |
ianw | that's kind of the point of them :) | 05:05 |
*** bobh has joined #openstack-infra | 05:14 | |
*** bobh has quit IRC | 05:19 | |
*** wolverineav has quit IRC | 05:21 | |
*** wolverineav has joined #openstack-infra | 05:32 | |
*** bobh has joined #openstack-infra | 05:32 | |
*** wolverineav has quit IRC | 05:36 | |
*** bobh has quit IRC | 05:36 | |
*** bobh has joined #openstack-infra | 05:51 | |
*** bobh has quit IRC | 05:55 | |
prometheanfire | I'm guessing it's been a busy day/week? | 06:02 |
*** bobh has joined #openstack-infra | 06:10 | |
*** bobh has quit IRC | 06:14 | |
*** bhavikdbavishi has quit IRC | 06:18 | |
*** adam_zhang has joined #openstack-infra | 06:22 | |
*** bobh has joined #openstack-infra | 06:28 | |
*** hongbin has quit IRC | 06:32 | |
*** bobh has quit IRC | 06:33 | |
*** kjackal has joined #openstack-infra | 06:42 | |
*** adam_zhang has quit IRC | 06:44 | |
*** bobh has joined #openstack-infra | 06:46 | |
*** bobh has quit IRC | 06:51 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: add OpenAPI documentation https://review.openstack.org/535541 | 06:52 |
*** hwoarang has quit IRC | 06:54 | |
*** hwoarang has joined #openstack-infra | 06:56 | |
openstackgerrit | Quique Llorente proposed openstack-infra/zuul master: Add default value for relative_priority https://review.openstack.org/622175 | 07:01 |
*** rcernin has quit IRC | 07:01 | |
*** bobh has joined #openstack-infra | 07:03 | |
openstackgerrit | Merged openstack-infra/zuul master: web: refactor status page to use a reducer https://review.openstack.org/621395 | 07:04 |
openstackgerrit | Merged openstack-infra/zuul master: web: refactor jobs page to use a reducer https://review.openstack.org/621396 | 07:06 |
*** yamamoto has joined #openstack-infra | 07:07 | |
*** bobh has quit IRC | 07:07 | |
*** betherly has joined #openstack-infra | 07:08 | |
*** dklyle has quit IRC | 07:09 | |
*** dklyle has joined #openstack-infra | 07:10 | |
*** yamamoto has quit IRC | 07:11 | |
*** betherly has quit IRC | 07:13 | |
*** pgaxatte has joined #openstack-infra | 07:19 | |
*** bobh has joined #openstack-infra | 07:21 | |
*** dpawlik has joined #openstack-infra | 07:24 | |
*** bobh has quit IRC | 07:26 | |
*** kjackal has quit IRC | 07:28 | |
*** jtomasek has joined #openstack-infra | 07:28 | |
*** ykarel|away is now known as ykarel | 07:35 | |
*** alexchadin has joined #openstack-infra | 07:35 | |
*** bobh has joined #openstack-infra | 07:38 | |
*** bobh has quit IRC | 07:43 | |
*** dims has quit IRC | 07:44 | |
*** dims has joined #openstack-infra | 07:47 | |
*** kjackal has joined #openstack-infra | 07:54 | |
*** bobh has joined #openstack-infra | 07:56 | |
*** bobh has quit IRC | 08:01 | |
*** ykarel is now known as ykarel|lunch | 08:02 | |
*** bobh has joined #openstack-infra | 08:14 | |
*** bobh has quit IRC | 08:19 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Use combined status for Github status checks https://review.openstack.org/623417 | 08:23 |
*** bobh has joined #openstack-infra | 08:30 | |
*** bobh has quit IRC | 08:35 | |
tobias-urdin | tonyb: thank you tony! | 08:38 |
*** jpena|off is now known as jpena | 08:43 | |
*** bobh has joined #openstack-infra | 08:48 | |
*** shardy has joined #openstack-infra | 08:52 | |
*** bobh has quit IRC | 08:53 | |
*** jpich has joined #openstack-infra | 08:57 | |
*** ykarel|lunch is now known as ykarel | 09:00 | |
*** yamamoto has joined #openstack-infra | 09:04 | |
*** bobh has joined #openstack-infra | 09:07 | |
*** ccamacho has joined #openstack-infra | 09:09 | |
*** ccamacho has quit IRC | 09:09 | |
*** bobh has quit IRC | 09:11 | |
tobiash | clarkb, corvus: when I look at grafana I think the starting builds graph looks odd. During the times when all executors deregistered it's constantly at 5. So maybe something changed in the starting jobs phase. | 09:14 |
*** ccamacho has joined #openstack-infra | 09:15 | |
tobiash | clarkb, corvus: to me it looks like job starting (maybe repo setup or gathering facts) is slower. So the swapping could be a red herring. | 09:16 |
*** bobh has joined #openstack-infra | 09:23 | |
*** alexchadin has quit IRC | 09:24 | |
*** bobh has quit IRC | 09:27 | |
*** bobh has joined #openstack-infra | 09:37 | |
*** gfidente has joined #openstack-infra | 09:37 | |
*** bobh has quit IRC | 09:41 | |
*** bobh has joined #openstack-infra | 09:51 | |
*** bobh has quit IRC | 09:55 | |
*** e0ne has joined #openstack-infra | 09:59 | |
*** electrofelix has joined #openstack-infra | 10:02 | |
*** jamesmcarthur has joined #openstack-infra | 10:03 | |
*** verdurin has quit IRC | 10:04 | |
*** jamesmcarthur has quit IRC | 10:07 | |
*** verdurin has joined #openstack-infra | 10:07 | |
*** bobh has joined #openstack-infra | 10:09 | |
*** bobh has quit IRC | 10:13 | |
stephenfin | I've noticed that I seem to be getting signed out of Gerrit each day. Has something changed in the past ~3 weeks? | 10:19 |
*** yamamoto has quit IRC | 10:22 | |
*** dpawlik has quit IRC | 10:23 | |
*** dpawlik has joined #openstack-infra | 10:23 | |
*** bobh has joined #openstack-infra | 10:24 | |
*** bhavikdbavishi has joined #openstack-infra | 10:27 | |
*** bobh has quit IRC | 10:29 | |
*** ccamacho has quit IRC | 10:31 | |
*** agopi is now known as agopi-pto | 10:38 | |
frickler | stephenfin: no changes that I know of, and I seem to stay logged in as long as I'm active once per day | 10:40 |
*** ccamacho has joined #openstack-infra | 10:41 | |
stephenfin | frickler: Ack. Must be a client side issue so. I'll investigate. Thanks! :) | 10:41 |
frickler | amorin: infra-root: I just saw a gate failure with three simultaneous timed_out jobs all on bhs1, so it seems that something is still not good there, even with the reduced load | 10:43 |
*** agopi-pto has quit IRC | 10:44 | |
*** bobh has joined #openstack-infra | 10:49 | |
*** yamamoto has joined #openstack-infra | 10:50 | |
*** pbourke has quit IRC | 10:50 | |
*** bobh has quit IRC | 10:53 | |
*** yamamoto has quit IRC | 10:56 | |
*** eernst has joined #openstack-infra | 10:58 | |
*** wolverineav has joined #openstack-infra | 11:00 | |
*** bobh has joined #openstack-infra | 11:01 | |
*** wolverineav has quit IRC | 11:05 | |
*** bobh has quit IRC | 11:06 | |
*** eernst has quit IRC | 11:08 | |
*** kjackal has quit IRC | 11:15 | |
*** bobh has joined #openstack-infra | 11:20 | |
*** bobh has quit IRC | 11:24 | |
*** pbourke has joined #openstack-infra | 11:27 | |
*** kjackal has joined #openstack-infra | 11:33 | |
*** sshnaidm|afk is now known as sshnaidm|off | 11:33 | |
*** yamamoto has joined #openstack-infra | 11:36 | |
*** bobh has joined #openstack-infra | 11:37 | |
*** bobh has quit IRC | 11:41 | |
*** yamamoto has quit IRC | 11:46 | |
openstackgerrit | Jens Harbott (frickler) proposed openstack-infra/project-config master: Disable ovh bhs1 and gra1 https://review.openstack.org/623457 | 11:46 |
frickler | amorin: infra-root: ^^ this would be my measure of last resort, unless you have a better idea. but gate queues seem to be effectively be stuck due to the large number of timeouts and queue resets | 11:48 |
*** gfidente has quit IRC | 11:49 | |
*** bobh has joined #openstack-infra | 11:49 | |
*** tosky has joined #openstack-infra | 11:52 | |
*** bobh has quit IRC | 11:53 | |
*** dtantsur|afk is now known as dtantsur\ | 11:54 | |
*** dtantsur\ is now known as dtantsur | 11:54 | |
*** slaweq has joined #openstack-infra | 12:03 | |
*** yamamoto has joined #openstack-infra | 12:14 | |
*** bobh has joined #openstack-infra | 12:15 | |
*** bobh has quit IRC | 12:20 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Add timer for starting_builds https://review.openstack.org/623468 | 12:24 |
tobiash | corvus, clarkb: having metrics about the job startup times could be helpful ^ | 12:24 |
*** e0ne has quit IRC | 12:27 | |
fungi | stephenfin: there's a long standing (mis?)behavior with gerrit where if you have multiple tabs open and you try to use one you've gotten inadvertently signed out of and sign it back in, that will invalidate teh session token used by other gerrit tabs in your browser. workaround is to make sure one is signed in and working, then go around and force refresh all your other gerrit tabs _before_ trying to | 12:38 |
fungi | click anything in them | 12:38 |
*** tobiash has quit IRC | 12:38 | |
*** rh-jelabarre has joined #openstack-infra | 12:39 | |
*** bobh has joined #openstack-infra | 12:39 | |
stephenfin | fungi: Ahhhh, I've seen that. It's probably the fact I have tabs open from before vacation that's throwing me so | 12:39 |
* stephenfin goes and refreshes everything manually | 12:39 | |
*** kaisers has quit IRC | 12:39 | |
fungi | frickler: are you sure the gate's stuck? we seem to be merging 10-15 changes an hour. looks to me like we're approving changes faster than we can get them through | 12:40 |
panda | is zuul in infra deployed using container images build with pbrx ? | 12:40 |
fungi | panda: not yet, no | 12:40 |
fungi | still deployed with the puppet-zuul module | 12:40 |
*** jpena is now known as jpena|lunch | 12:40 | |
*** bhavikdbavishi has quit IRC | 12:41 | |
*** psachin has quit IRC | 12:41 | |
panda | fungi: but is planned to be ? | 12:41 |
fungi | panda: i believe so, yes. https://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-management.html outlines the transition plan | 12:43 |
fungi | the containers section mentions "For our Python services, a new tool is in work, pbrx, which has a command for making single-process containers from pbr setup.cfg files and bindep.txt." | 12:44 |
panda | fungi: ah, the link I was looking for. Thanks! | 12:48 |
*** ahosam has joined #openstack-infra | 12:49 | |
fungi | always happy when i can point someone to actual documentation! | 12:49 |
*** tobiash_ has joined #openstack-infra | 12:52 | |
*** tobiash has joined #openstack-infra | 12:53 | |
*** kaisers has joined #openstack-infra | 12:56 | |
*** bobh has quit IRC | 12:56 | |
*** bobh has joined #openstack-infra | 12:57 | |
*** tobiash has quit IRC | 13:02 | |
*** tobiash has joined #openstack-infra | 13:02 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Add timer for starting_builds https://review.openstack.org/623468 | 13:03 |
frickler | fungi: most changes that merge seem to be outside of integrated or tripleo queues. still things seem to work a bit better currently, will wait for the next set of results from integrated queue | 13:05 |
fungi | clarkb was helping track down a lot of failures which weren't related to performance in ovh regions | 13:08 |
*** boden has joined #openstack-infra | 13:08 | |
fungi | which were impacting the tripleo and integrated gate queues | 13:08 |
*** gfidente has joined #openstack-infra | 13:08 | |
frickler | I added the timeouts that I saw today at the bottom of https://etherpad.openstack.org/p/bhs1-test-node-slowness and they all were in ovh | 13:09 |
frickler | but I'm also fine with waiting to see how things evolve over the weekend | 13:10 |
fungi | i can run some more analysis of timeouts seen by logstash and break them down by provider weighted by max-servers in each | 13:11 |
*** chandan_kumar is now known as chkumar|off | 13:12 | |
*** e0ne has joined #openstack-infra | 13:13 | |
*** jamesmcarthur has joined #openstack-infra | 13:18 | |
*** bobh has quit IRC | 13:22 | |
*** EmilienM is now known as EvilienM | 13:22 | |
*** bobh has joined #openstack-infra | 13:24 | |
*** janki has quit IRC | 13:24 | |
frickler | fungi: https://ethercalc.openstack.org/jg8f4p7jow5o , seems to point to bhs1 still being bad, so maybe disable only that region | 13:25 |
frickler | that's from the results for the last 12h on http://logstash.openstack.org/#/dashboard/file/logstash.json?query=(message:%20%5C%22FAILED%20with%20status:%20137%5C%22%20OR%20message:%20%5C%22FAILED%20with%20status:%20143%5C%22%20OR%20message:%20%5C%22RUN%20END%20RESULT_TIMED_OUT%5C%22)%20AND%20tags:%20%5C%22console%5C%22%20AND%20voting:1&from=864000s | 13:26 |
*** bobh has quit IRC | 13:29 | |
fungi | frickler: thanks, and yeah that should only be since the most recent max-servers change to bhs1 so presumably fairly accurate | 13:31 |
fungi | i agree no need to drop gra1, it's doing better on this than all the rackspace regions according to your analysis | 13:31 |
fungi | hard to know if vexxhost-ca-ymq-1 is statistically significant there | 13:33 |
fungi | so i wouldn't go drawing any conclusions from that one | 13:34 |
*** jamesmcarthur has quit IRC | 13:35 | |
*** neilsun has quit IRC | 13:36 | |
*** jpena|lunch is now known as jpena | 13:37 | |
*** sshnaidm|off has quit IRC | 13:37 | |
*** bobh has joined #openstack-infra | 13:39 | |
*** jcoufal has joined #openstack-infra | 13:41 | |
*** eumel8 has joined #openstack-infra | 13:42 | |
*** edmondsw has quit IRC | 13:42 | |
openstackgerrit | Frank Kloeker proposed openstack-infra/project-config master: Re-activate translation job for Trove https://review.openstack.org/623492 | 13:42 |
openstackgerrit | Frank Kloeker proposed openstack-infra/openstack-zuul-jobs master: Add Trove to project doc translation https://review.openstack.org/623493 | 13:43 |
*** bobh has quit IRC | 13:43 | |
*** rlandy has joined #openstack-infra | 13:44 | |
*** tobiash has left #openstack-infra | 13:44 | |
*** tobiash_ is now known as tobiash | 13:45 | |
*** alexchadin has joined #openstack-infra | 13:47 | |
*** jamesmcarthur has joined #openstack-infra | 13:49 | |
frickler | fungi: vexxhost-ca-ymq-1 seems to be only special nodes for kata, so that shouldn't be relevant. | 13:50 |
*** bobh has joined #openstack-infra | 13:53 | |
fungi | agreed | 13:56 |
*** bobh has quit IRC | 13:58 | |
*** jpich has quit IRC | 14:02 | |
*** jpich has joined #openstack-infra | 14:03 | |
*** kgiusti has joined #openstack-infra | 14:04 | |
*** dave-mccowan has joined #openstack-infra | 14:05 | |
*** lbragstad has joined #openstack-infra | 14:06 | |
*** eharney has quit IRC | 14:06 | |
openstackgerrit | Jens Harbott (frickler) proposed openstack-infra/project-config master: Disable ovh bhs1 https://review.openstack.org/623457 | 14:06 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul master: Add nodepool.host_id variable to inventory file https://review.openstack.org/623496 | 14:07 |
*** dave-mccowan has quit IRC | 14:10 | |
openstackgerrit | Merged openstack-infra/nodepool master: Include host_id for openstack provider https://review.openstack.org/623107 | 14:10 |
*** edmondsw has joined #openstack-infra | 14:12 | |
mnaser | morning infra-root | 14:19 |
mnaser | we're having massive issues with our centos-7 builds with 7.6 being out | 14:19 |
mnaser | a lot of timeouts and really slow operations | 14:19 |
mordred | mnaser: how spectacular | 14:19 |
mnaser | could i request a node hold? | 14:19 |
mnaser | mordred: stable enterprise os they said | 14:20 |
mnaser | almost everything is timing out, it looks like slow io.. or slow network.. something is slow | 14:20 |
mnaser | i dont know what | 14:20 |
mordred | mnaser: yeah. value in not changing they said | 14:20 |
dmsimard | mnaser: which job for which project you'd like to hold ? | 14:20 |
mnaser | dmsimard: can we have it hold based on job? we have `openstack-ansible-functional-centos-7` or `openstack-ansible-functional-distro_install-centos-7` on any project | 14:21 |
mnaser | all roles are broken | 14:21 |
dmsimard | mnaser: yes, it can be based on job | 14:21 |
mnaser | dmsimard: awesome, any of those two should be ok. | 14:21 |
dmsimard | mnaser: what project? openstack/openstack-ansible ? | 14:22 |
dmsimard | I know you said any project, but I need one :p | 14:23 |
mnaser | dmsimard: it is affecting all of our roles so anything openstack/openstack-ansible-* .. the integrated is broken so i can find a job from openstack/openstack-ansible | 14:23 |
mnaser | ah if its any project then lets just do openstack/openstack-ansible but i have different job name for you | 14:23 |
dmsimard | ok | 14:23 |
mnaser | openstack-ansible-deploy-aio_lxc-centos-7 | 14:23 |
mnaser | and/or openstack-ansible-deploy-aio_metal-centos-7 | 14:23 |
dmsimard | mnaser: the holds are set, next time there's a failure they will be available | 14:24 |
mnaser | dmsimard: cool, will they catch and already running job that might fail | 14:25 |
mnaser | or will it have to be done started now? | 14:25 |
dmsimard | not 100% sure | 14:25 |
mnaser | one literally just failed now, i wonder if were quick enough =P | 14:25 |
dmsimard | let me see | 14:25 |
mnaser | https://review.openstack.org/#/c/618711/ 1 minute ago | 14:25 |
*** mriedem has joined #openstack-infra | 14:25 | |
*** lbragstad has quit IRC | 14:25 | |
dmsimard | yeah it triggered the autohold, one min | 14:25 |
mnaser | perfect timing woo | 14:27 |
dmsimard | mnaser: root@104.130.117.58 | 14:28 |
dmsimard | mnaser: it's in rax-ord | 14:28 |
mnaser | awesome, works, thank you | 14:28 |
mnaser | ssh taking ages to log me in | 14:28 |
mnaser | so thats a good start | 14:28 |
*** lbragstad has joined #openstack-infra | 14:29 | |
dmsimard | looks like ansible-playbook is still running | 14:29 |
mnaser | yeah, the job didn't fail but timed out (it really shouldn't take this long) | 14:29 |
fungi | mnaser: any chance you're seeing these slow runs just in ovh-bhs1? i think we're about to disable it again because of a disproportionate amount of job timeouts there | 14:30 |
mnaser | fungi: i've seen some but in this case it's actually been centos and i've noticed it on rax-ord | 14:30 |
mnaser | which in my experience is a stable region but i dunno if 7.6 made weird things | 14:30 |
fungi | okay, so some more systemic issue i suppose | 14:30 |
dmsimard | mnaser: setup-infrastructure is 56 minutes in your patch: http://logs.openstack.org/11/618711/3/check/openstack-ansible-deploy-aio_lxc-centos-7/ec4a635/logs/ara-report/ | 14:30 |
mnaser | thing is even log collection | 14:31 |
dmsimard | it's 28 minutes on opensuse | 14:31 |
mnaser | times out after 30 minutes | 14:31 |
mnaser | so something is strange on centos machines | 14:31 |
*** jamesmcarthur has quit IRC | 14:31 | |
mnaser | we cant even collect logs because we timeout | 14:31 |
mnaser | http://logs.openstack.org/42/614342/4/check/openstack-ansible-functional-centos-7/517947d/job-output.txt.gz#_2018-12-06_17_57_16_163687 | 14:32 |
mnaser | have a look at that | 14:32 |
dmsimard | mnaser: looking at the ara report for that timeout'd job, there's definitely something going on | 14:33 |
guilhermesp | yeah mostly yesterday, around 3 rechecks with timeout collecting logs http://logs.openstack.org/20/618820/48/check/openstack-ansible-functional-centos-7/ca0f625/job-output.txt.gz#_2018-12-07_03_13_17_201631 | 14:33 |
mnaser | http://logs.openstack.org/11/618711/3/check/openstack-ansible-deploy-aio_lxc-centos-7/ec4a635/logs/ara-report/result/143042ae-27a8-4601-8506-1f4f7bea56a6/ | 14:33 |
dmsimard | mnaser: templating a file shouldn't take >5 minutes | 14:34 |
dmsimard | creating files/directories too | 14:34 |
mnaser | yeah.. | 14:34 |
mnaser | fungi: what was the dd command you've been using? | 14:34 |
mordred | wow. the systemd log verification failure is nice | 14:34 |
mnaser | :D | 14:34 |
mnaser | i cant remember if hwoarang or cloudnull worked on that | 14:35 |
fungi | mnaser: i took it from your swapfile setup example log: sudo dd if=/dev/zero of=/foo bs=1M count=4096 | 14:35 |
mnaser | but one is dealing with a new born and the other is enjoying far east asia | 14:35 |
mnaser | so dont think they can comment too much :) | 14:35 |
mnaser | ok running that to see what sort of numbers i get | 14:35 |
*** wolverineav has joined #openstack-infra | 14:36 | |
dmsimard | taking 1m30s to create a directory http://logs.openstack.org/11/618711/3/check/openstack-ansible-deploy-aio_lxc-centos-7/ec4a635/logs/ara-report/file/aee893b3-c5f4-4c21-943a-367431ef584c/#line-46 ... | 14:36 |
mnaser | 4294967296 bytes (4.3 GB) copied, 9.86048 s, 436 MB/s | 14:37 |
mnaser | hmm | 14:37 |
mnaser | see whats interesting is | 14:37 |
mnaser | when you login, it takes a while to get a terminal | 14:37 |
mnaser | im wondering if that has to do with it.. | 14:37 |
mnaser | pam_systemd(sshd:session): Failed to create session: Failed to activate service 'org.freedesktop.login1': timed out | 14:37 |
mordred | wow | 14:38 |
*** eernst has joined #openstack-infra | 14:38 | |
fungi | looks like we lost a merger around 09:00z | 14:38 |
*** bobh has joined #openstack-infra | 14:38 | |
logan- | following up on the nested virt issues we were seeing in limestone over the psat 3-4 days: all of the HVs have been upgraded from xenial hwe kernel 4.15.0.34.56 to 4.15.0.42.63 now and my jobs that require nested virt are happy again, so I assume others are also. | 14:38 |
fungi | i'll see if i can figure out which merger is missing | 14:38 |
*** wolverineav has quit IRC | 14:41 | |
fungi | zuul-merger is running on all the dedicated zm hosts | 14:42 |
dmsimard | mnaser: not sure if related, but watching the processes, sudo commands seems to get stuck a lot ? | 14:42 |
dmsimard | i.e, http://paste.openstack.org/raw/736820/ | 14:42 |
mnaser | dmsimard: that 100% is related to that logind stuff | 14:42 |
mnaser | journalctl | grep login1 | 14:43 |
dmsimard | oh yeah it's obvious now | 14:43 |
*** takamatsu has joined #openstack-infra | 14:43 | |
*** alexchadin has quit IRC | 14:43 | |
mnaser | now i dont know if we're breaking it or if 7.6 broke it | 14:43 |
fungi | actually, have we lost 4 mergers? we should have 20, right? 8 dedicated and 12 accessory to the executors? | 14:44 |
fungi | http://grafana.openstack.org/d/T6vSHcSik/zuul-status?panelId=30&fullscreen&orgId=1&from=now%2Fd&to=now%2Fd | 14:44 |
fungi | yep | 14:44 |
dmsimard | mnaser: from http://grafana.openstack.org/d/T6vSHcSik/zuul-status?orgId=1 | 14:45 |
dmsimard | mnaser: wrong name, meant fungi sorry :p | 14:45 |
mnaser | figured :) | 14:45 |
dmsimard | mnaser: in systemd-logind journal there's "Failed to abandon session scope: Transport endpoint is not connected" which leads to https://github.com/systemd/systemd/issues/2925 | 14:45 |
mnaser | dmsimard: thats super useful | 14:46 |
mnaser | it looks like something happened at 12:14:52 which restarted dbus | 14:46 |
mnaser | and since then it never came back | 14:46 |
frickler | "Restarting dbus is not supported generally. That disconnects all clients, and the system generally cannot recover from that. This is a dbus limitation." - poettering | 14:48 |
fungi | ze12 seems to have registered an oom-killer event in dmesg at 23:21:17z | 14:48 |
mnaser | looking here https://github.com/openstack/openstack-ansible-lxc_hosts/blob/6eee41f123dd49d73ad2851b878c11efd6cfffa2/tasks/lxc_cache_preparation_systemd_old.yml | 14:48 |
mnaser | ill move this to #openstack-ansible | 14:48 |
*** yamamoto has quit IRC | 14:49 | |
fungi | the other executors don't indicate any recent oom events, but then again they turn over their dmesg ring buffers rather rapidly | 14:49 |
openstackgerrit | Frank Kloeker proposed openstack-infra/project-config master: Add translation job for storyboard https://review.openstack.org/623508 | 14:50 |
fungi | not super urgent as the mergers seem to be keeping up. occasional spikes of 10-20 queued but they clear quickly | 14:50 |
eumel8 | fungi, mordred: ^^ We want to start with storyboard translation in that cycle. | 14:51 |
fungi | but a little worried that the merger threads on the executors may be dying inexplicably | 14:51 |
*** sshnaidm has joined #openstack-infra | 14:52 | |
*** jcoufal has quit IRC | 14:54 | |
*** jamesmcarthur has joined #openstack-infra | 14:57 | |
*** jcoufal has joined #openstack-infra | 14:58 | |
*** diablo_rojo has joined #openstack-infra | 14:58 | |
*** ccamacho has quit IRC | 15:00 | |
*** armstrong has joined #openstack-infra | 15:00 | |
*** sshnaidm has quit IRC | 15:03 | |
*** slaweq has quit IRC | 15:04 | |
*** eharney has joined #openstack-infra | 15:05 | |
*** ramishra has quit IRC | 15:06 | |
openstackgerrit | Frank Kloeker proposed openstack-infra/project-config master: Add translation job for storyboard https://review.openstack.org/623508 | 15:09 |
*** ykarel is now known as ykarel|away | 15:11 | |
pabelanger | fungi: ze12 needs to be rebooted as it is running a non HWE kernel. The only executor that is different | 15:12 |
AJaeger | eumel8: please ask storyboard cores to review that change and +1 | 15:12 |
*** dpawlik has quit IRC | 15:12 | |
fungi | pabelanger: yep, i assume that plays into the oom situation on that server | 15:13 |
eumel8 | AJaeger: thx | 15:14 |
mnaser | mriedem: i see you do this often, but how do you take a bug and list it across affecting different releases? | 15:15 |
mnaser | inside launchpad | 15:15 |
fungi | mnaser: you have to be in the bug supervisor group for those projects, i think | 15:16 |
mriedem | right | 15:16 |
mriedem | "Target to series" | 15:16 |
fungi | and if you are, there's a little icon where you can get into the project details for a given bugtask and add specific series | 15:16 |
mriedem | and the series has to be managed properly in launchpad | 15:17 |
mnaser | ok i see that, alright, looks like we need t add the new releases because it looks like the last series we have there is newton | 15:17 |
mriedem | yup https://launchpad.net/nova/+series | 15:17 |
mriedem | lots of projects in lp don't track the series | 15:17 |
fungi | yeah, it's rather a bit of setup so unless it's something you expect to make a lot of use of i'm not sure i'd bother | 15:17 |
mnaser | ack | 15:17 |
*** ykarel|away has quit IRC | 15:18 | |
*** ccamacho has joined #openstack-infra | 15:18 | |
dmsimard | For the record, we have identified what was causing the OSA CentOS timeouts and they're testing a fix right now. A task ended up restarting dbus and this apparently causes issues with systemd-logind which lead to 25s timeouts for every ssh, sudo (and ansible!) commands | 15:20 |
dmsimard | It seems like a well documented bug and the consensus appears to be that dbus isn't meant to be restarted, it should only be reloaded if need be | 15:20 |
mordred | dmsimard: awesome news! | 15:23 |
mordred | dmsimard: yay for debugging and finding errors | 15:23 |
dmsimard | mnaser: do you still need those two nodes ? | 15:23 |
mnaser | dmsimard: oh can i have access to the baremetal one? | 15:23 |
dmsimard | mnaser: yeah | 15:23 |
dmsimard | sec | 15:23 |
mnaser | we shouldn't run lxc_hosts on the metal jobs | 15:24 |
dmsimard | actually, 25secs :P | 15:24 |
dmsimard | mnaser: root@104.239.173.194 | 15:24 |
mnaser | aw man | 15:25 |
mnaser | i hope we're not running lxc_hosts on non-container jobs | 15:25 |
*** yamamoto has joined #openstack-infra | 15:26 | |
*** yamamoto has quit IRC | 15:29 | |
*** sshnaidm has joined #openstack-infra | 15:29 | |
*** yamamoto has joined #openstack-infra | 15:29 | |
openstackgerrit | Frank Kloeker proposed openstack-infra/project-config master: Re-activate translation job for Trove https://review.openstack.org/623492 | 15:30 |
fungi | if any infra-puppet-core has a moment to review https://review.openstack.org/623290 the storyboard team would appreciate it | 15:31 |
fungi | eumel8: i've mentioned https://review.openstack.org/623508 in #storyboard and given a couple of our primary maintainers a heads up about it | 15:33 |
*** ykarel|away has joined #openstack-infra | 15:33 | |
*** e0ne has quit IRC | 15:34 | |
mnaser | ok we found the root cause, we can lose those vms dmsimard | 15:34 |
mnaser | thank you | 15:34 |
dmsimard | \o/ | 15:34 |
*** eharney has quit IRC | 15:37 | |
*** dpawlik has joined #openstack-infra | 15:41 | |
*** sshnaidm is now known as sshnaidm|off | 15:42 | |
*** dpawlik has quit IRC | 15:46 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: add OpenAPI documentation https://review.openstack.org/535541 | 15:46 |
*** eharney has joined #openstack-infra | 15:52 | |
*** e0ne has joined #openstack-infra | 15:55 | |
*** janki has joined #openstack-infra | 16:00 | |
*** pgaxatte has quit IRC | 16:06 | |
clarkb | frickler: fungi: I also identified ~6 e-r bugs that correlated strongly to disabling ovh bhs1 | 16:13 |
clarkb | we shoudl be able to use that to see if things are improving too | 16:13 |
*** tosky has quit IRC | 16:14 | |
*** e0ne has quit IRC | 16:23 | |
openstackgerrit | Merged openstack-infra/zuul master: Add default value for relative_priority https://review.openstack.org/622175 | 16:25 |
*** gyee has joined #openstack-infra | 16:25 | |
*** adriancz has quit IRC | 16:26 | |
fungi | clarkb: did hits for those reduce by half when we halved the max-servers too? | 16:28 |
clarkb | fungi: its hard to tell because they don't seem to run at a constant rate either? and looking at multiple graphs we'd need to merge them to answer that I think | 16:29 |
clarkb | fwiw they seem to have continued since yesterday's halving | 16:29 |
fungi | but disappeared when bhs1 was offline? | 16:30 |
clarkb | yes | 16:32 |
clarkb | fungi: http://status.openstack.org/elastic-recheck/index.html#1805176 is a really good example of that | 16:32 |
openstackgerrit | Merged openstack-infra/zuul master: web: refactor job page to use a reducer https://review.openstack.org/623156 | 16:32 |
clarkb | there is this giant hole there, and at first I was really concerned we had broken log indexing but I'm fairly positive its the bhs1 disabling instead | 16:32 |
openstackgerrit | Merged openstack-infra/zuul master: web: refactor tenants page to use a reducer https://review.openstack.org/623157 | 16:35 |
clarkb | fungi: but you'll alos notice that it hasn't stopped since yesterday's halving | 16:37 |
fungi | that's "neat" | 16:39 |
*** bhavikdbavishi has joined #openstack-infra | 16:39 | |
clarkb | ya I started going down the hold of where did we break indexing | 16:40 |
clarkb | and after 15 minutes light bulb went off when I couldn't find anything obviously broken | 16:40 |
clarkb | unfortunately halving our use to curb being our own noisy neighbor was my last good idea :) I don't really know what we'd look at next particularly from our end. | 16:41 |
clarkb | maybe go back to the idea its unhappy hypervisors and maybe there are more of them than we expected? | 16:41 |
clarkb | we do know that there are jobs that run perfectly fine on bhs1 | 16:42 |
clarkb | dstat doesn't show the cpu sys and wai signature of the jobs that fail. These happy jobs run in a resonable amount of time successfully | 16:42 |
pabelanger | https://review.openstack.org/623496/ is atleast ready to start exposing host_id into the inventory files, for when we next restart zuul | 16:42 |
clarkb | this implies to me its not a consistent issue like "cpus are slow" or "disks are slow" deal with it. I think it also rules out overhead due to metldown/spectre/l1tf | 16:43 |
pabelanger | as another data point for logstash | 16:43 |
clarkb | pabelanger: ++ | 16:43 |
clarkb | added to my review queue | 16:43 |
clarkb | Maybe a next step is to disable the region, then boot a couple VMs on every hypervisor (may need amorin help with this depending on how the scheduler places things) and run benchmarks on each VM and see if that shows a pattern | 16:45 |
clarkb | (I would just run devstack as a first pass benchmark) | 16:45 |
*** cgoncalves has quit IRC | 16:45 | |
clarkb | we've seen the db migrations and service startup for eg nova api trigger things in devstack | 16:45 |
fungi | i wonder if we could take bhs1 out of the line of fire for real jobs, but then load it with identical representative workloads we could use to benchmark instances and map them back to their respective hosts to identify any correlation | 16:45 |
clarkb | ya | 16:45 |
clarkb | also maybe the nova team ahs ideas around how to make nova VMs happier | 16:47 |
*** bobh has quit IRC | 16:50 | |
*** bobh has joined #openstack-infra | 16:51 | |
*** cgoncalves has joined #openstack-infra | 16:53 | |
*** kjackal has quit IRC | 16:53 | |
*** boden has quit IRC | 16:54 | |
*** kjackal has joined #openstack-infra | 16:55 | |
clarkb | considering it is friday and we are entering a quieter period for many. Maybe the next action is as frickler suggests disable bhs1. Then we can regroup while things are slow and rather than treat this like a fire work through it carefully? | 16:56 |
clarkb | unfortunately my timezone doesn't overlap well with CET. Thank you all who do have some overlap with help with that. Curious what you think about ^ if we were to try and do more measured debugging and take our time (which likely requires being awake when amorin et al are | 16:57 |
fungi | i concur, we should likely disable it again. frickler's analysis was pretty eye-opening | 16:59 |
clarkb | infra-root https://review.openstack.org/#/c/623457/2 is frickler's change to do ^. I've +2'd it and if anyone else would like to chime in now is a good time :) | 16:59 |
mordred | clarkb: lgtm - I think the plan above to disable it and then run synthetic tests sounds good | 17:00 |
fungi | i do think the frequent timeouts there are likely costing us more capacity than the nodes we lose by dropping it for now | 17:00 |
clarkb | ya the cascading effect of gate resets is painful | 17:02 |
*** ginopc has quit IRC | 17:02 | |
fungi | i think you can approve it. we have a lot of +2s on there now | 17:05 |
clarkb | done | 17:05 |
clarkb | then to be extra confident in those e-r holes being related to bhs1 we can watch for a new hole :) | 17:08 |
*** jpich has quit IRC | 17:10 | |
*** mriedem is now known as mriedem_lunch | 17:19 | |
clarkb | pabelanger: +2'd https://review.openstack.org/#/c/623496/1 but didn't approve as I'm always wary around the addition of reliance on new zk data. I'm not quite sure if the release note needs to add any additioanl info about updates or if that will just work if your nodepool is older | 17:19 |
clarkb | Shrews: ^ that might be a question for you? | 17:19 |
*** gfidente has quit IRC | 17:20 | |
pabelanger | clarkb: in older nodepool, I'd expect it to nodepool.host_id to be None | 17:20 |
Shrews | looking | 17:21 |
*** dims has quit IRC | 17:21 | |
Shrews | looks like that should just work | 17:22 |
openstackgerrit | Merged openstack-infra/project-config master: Disable ovh bhs1 https://review.openstack.org/623457 | 17:22 |
*** dtantsur is now known as dtantsur|afk | 17:22 | |
Shrews | older nodepool won't have host_id, so it should just remain None, as pabelanger says | 17:23 |
*** rlandy is now known as rlandy|brb | 17:23 | |
clarkb | pabelanger: ya I guess since you are just setting yaml/ansible var values and not acting on the attribute that should be fine | 17:23 |
*** wolverineav has joined #openstack-infra | 17:23 | |
clarkb | if the jobs don't handle that properly the job will have issues, but zuul itself will be fine | 17:23 |
Shrews | the gotchas for zk schema changes tend to be adding new states to pre-existing fields | 17:23 |
Shrews | pabelanger: just a reminder, but launchers will need a restart if you intend to use that new field | 17:25 |
Shrews | last restart didn't get that update | 17:25 |
clarkb | 'The recend tripleo gate reset was not bhs1 related, at least jobs didn't run there. Instead delorean is unhappy with DEBUG: cp: cannot stat 'README.md': No such file or directory' | 17:25 |
clarkb | we that first ' should be just before DEBUG | 17:25 |
pabelanger | Shrews: ++ | 17:26 |
clarkb | Tryign to call those out to dispell the idea that bhs1 is our only issue | 17:26 |
EvilienM | we're trying to figure how tripleo-ci-centos-7-undercloud-containers is not in our gate anymore | 17:26 |
EvilienM | it's in our layout | 17:27 |
corvus | EvilienM: give me a change id that should have run it and i can tell you | 17:27 |
*** wolverineav has quit IRC | 17:27 | |
corvus | (you can also use the debug:true feature, but this will save 3 hours :) | 17:27 |
EvilienM | corvus: I6754da1142e2ec865ef8c60a7e09df00300f791e | 17:27 |
EvilienM | yeah | 17:27 |
Shrews | "debug: corvus" > "debug: true" | 17:28 |
corvus | lol | 17:28 |
EvilienM | yes that^ | 17:28 |
mordred | all debug messages should just be "corvus: | 17:28 |
corvus | (i wonder if we can attach the debug info to buildsets in the sql db so folks could look this up post-facto through the web) | 17:29 |
EvilienM | corvus: it started 3 days ago | 17:29 |
EvilienM | (lgostash says) | 17:29 |
fungi | that sounds like a great idea. it doesn't really mean that much additional data in the db, and it's not like that db gets huge anyway | 17:29 |
fungi | plus, people often come asking for an explanation of why something didn't run | 17:30 |
*** janki has quit IRC | 17:33 | |
corvus | EvilienM: i haven't found that zuul even considered running that job on that change... where is it attached to the tripleo-ci project gate pipeline? | 17:33 |
corvus | EvilienM: http://git.openstack.org/cgit/openstack-infra/tripleo-ci/tree/zuul.d/undercloud-jobs.yaml#n6 has it attached to check but not gate | 17:36 |
corvus | (note that because of the clean-check gate requirement, a voting job in check but not in gate can wedge a co-gating project system) | 17:37 |
*** rlandy|brb is now known as rlandy | 17:41 | |
*** bhavikdbavishi has quit IRC | 17:48 | |
*** sean-k-mooney has joined #openstack-infra | 17:49 | |
*** ahosam has quit IRC | 17:50 | |
*** jpena is now known as jpena|off | 17:50 | |
*** boden has joined #openstack-infra | 17:50 | |
EvilienM | corvus: weird, let me see | 17:51 |
EvilienM | I think it's a leftover | 17:51 |
EvilienM | when we had to deal with fires | 17:51 |
EvilienM | I2753393cd7cdd64720d39360bebe9ddea2f20efc | 17:51 |
EvilienM | mwhahaha, corvus : https://review.openstack.org/#/c/623555/ | 17:52 |
EvilienM | damn, sorry for noise | 17:52 |
EvilienM | mwhahaha: are we missing other jobs in your finding? | 17:52 |
mwhahaha | no just that one | 17:52 |
mwhahaha | he removed the wrong one | 17:52 |
*** shardy has quit IRC | 17:57 | |
*** Swami has joined #openstack-infra | 18:13 | |
*** wolverineav has joined #openstack-infra | 18:13 | |
*** jamesmcarthur has quit IRC | 18:14 | |
*** ykarel|away has quit IRC | 18:17 | |
clarkb | integrated gate just merged ~9 changes. tripleo gate also merged a stack not too long ago. I think things are moving | 18:18 |
openstackgerrit | Merged openstack-infra/zuul master: Add nodepool.host_id variable to inventory file https://review.openstack.org/623496 | 18:21 |
*** udesale has joined #openstack-infra | 18:26 | |
*** udesale has quit IRC | 18:27 | |
sean-k-mooney | hi o/ | 18:28 |
sean-k-mooney | i ask this every 6 months or so but looking at https://docs.openstack.org/infra/manual/testing.html is the policy on nested virt still the same. | 18:28 |
sean-k-mooney | e.g. we cant gurrenttee its available and its usally disabled if detected | 18:29 |
*** jamesmcarthur has joined #openstack-infra | 18:31 | |
clarkb | sean-k-mooney: yes, in fact just yseterday we found it crashing and rebooting VMs | 18:34 |
clarkb | sean-k-mooney: seems to be just as unreliable as ever unfortunately | 18:34 |
sean-k-mooney | clarkb: its now on by default in the linux kernel going forward | 18:34 |
clarkb | sean-k-mooney: ya but none of our clouds or guests are running 4.19 yet (except our tumbleweed images and maybe f29) | 18:35 |
sean-k-mooney | clarkb: my experince is its pretty reliable if you set teh vms to host-passthough but its flaky if you set a custom cpu moduel or host-model | 18:35 |
*** jamesmcarthur has quit IRC | 18:35 | |
clarkb | and quite literally yesterday we found unexpected reboots in tests from centos7 crashing with nested guests and virt enabled. I wish it were reliable but it isn't | 18:36 |
clarkb | logan-: ^ not sure if you set host passthrough or not butthat may be something to check if you don't need live migration in that region or have homogenous cpus | 18:36 |
sean-k-mooney | ok the reason im asking is the intel nfv ci has broken again and im trying to figure out if i can set up a ci to replace it upstream or else where | 18:36 |
logan- | yes, we use host passthrough | 18:37 |
*** bobh has quit IRC | 18:38 | |
sean-k-mooney | huh strange i guess i was just looking when we ran the intel nfv ci with it in our dev lab we never had any issues | 18:38 |
fungi | sean-k-mooney: in particular, over the past couple years we've seen that it's very sensitive to the combination of guest and host kernel used | 18:38 |
fungi | if you control both then you can get it to work | 18:38 |
fungi | _probably_ | 18:38 |
sean-k-mooney | fungi: we were using ubuntu 14.04/16.04 hosts with centos7 and ubutu 14.04/16.04 guests | 18:39 |
sean-k-mooney | depending on theyear | 18:39 |
fungi | so for "private cloud" scenarios it's likely viable. in our situation relying mostly on public cloud providers it's more like russian roulette | 18:39 |
sean-k-mooney | ya i get that | 18:39 |
*** ccamacho has quit IRC | 18:40 | |
sean-k-mooney | ok well that contiues to rule out ovs-dpdk testing in the gate so | 18:40 |
sean-k-mooney | im going to set up a tiny cluster at home to do some third party ci for now. | 18:41 |
fungi | we get to periods where there is some working combination of guest and host kernels in our images and some of our providers, and then our images get a minor kernel update and suddenly it all goes sideways until the provider gets a newer kernel onto their hosts | 18:41 |
logan- | the base xenial 4.4 seemed to break a lot more often, so we run the 4.15 xenial hwe on the nodepool hvs starting a few months ago, and even with that we still see things like this week where cent7 and xenial guests running on these hvs were hard rebooting when they attempted to launch a nested virt guest | 18:41 |
sean-k-mooney | logan-: did you have host-pasthough on both the hosting cloud and the vms launched by tempest | 18:42 |
sean-k-mooney | i should have mention that you need it for both | 18:43 |
logan- | i don't control all of the guest jobs, but for my test jobs that were rebooting on xenial, yes passthrough was enabled on the guest also | 18:43 |
sean-k-mooney | oh well as i said i guess we were lucky with our configs | 18:43 |
sean-k-mooney | logan-: do you work for one of the cloud providers? | 18:44 |
logan- | i run the limestone cloud | 18:44 |
clarkb | mgagne_: had found that the cpu itself seemed to make a difference too iirc. But unsure if that was a bug in cpus or linux kernel issues or something else | 18:44 |
sean-k-mooney | ah ok not familar with that one. | 18:44 |
clarkb | if 4.19 is indeed stable enough to have it turned on by default and evidence shows that to be the case I'll happily use it when we get there | 18:45 |
clarkb | but that is likely still a ways off | 18:45 |
clarkb | 4.19 has fixes for the fs corruption now too | 18:46 |
clarkb | whcih I should actually reboot for | 18:46 |
*** kjackal has quit IRC | 18:47 | |
*** e0ne has joined #openstack-infra | 18:47 | |
sean-k-mooney | clarkb: ya if it changes in the future ill maybe see about setting up some ustream nfv testing but for now ill look at third party solutions | 18:48 |
sean-k-mooney | that or review the effort to allow ovs-dpdk testing to work without nested virt but nova dont want to merge code that would only be used for testing | 18:49 |
*** e0ne has quit IRC | 18:49 | |
sean-k-mooney | *revive | 18:49 |
*** bobh has joined #openstack-infra | 18:49 | |
*** mriedem_lunch is now known as mriedem | 18:49 | |
fungi | i don't suppose that behavior could be made pluggable | 18:49 |
clarkb | another avenue which we've considered in the past but never got very far with is baremetal test resources | 18:50 |
mnaser | fyi i think our nested virt is working well | 18:50 |
*** dims has joined #openstack-infra | 18:50 | |
mnaser | i know its pretty stable for kata afaik | 18:50 |
clarkb | mnaser: have they recently tested with centos 7.6? | 18:50 |
mnaser | i dont know if kata uses 7.6 | 18:50 |
clarkb | mnaser: that seemed to be the recent trigger for us I think | 18:50 |
clarkb | mnaser: it was the new kernel in the host VM | 18:51 |
mnaser | we also run the latest 7.6 in sjc1 in the host | 18:51 |
sean-k-mooney | fungi: well it basically comes down to the fact that nova enables cpu pinning when you request hugepages and qemu does not supprot cpu pinning without kvm | 18:51 |
mnaser | and that resolved a lot of nested virt issues | 18:51 |
clarkb | mnaser: ah in that case it may work out | 18:51 |
mnaser | worth a shot. kata has a lot of resources in terms of nested virt folks so maybe working with them might be beneficial | 18:51 |
sean-k-mooney | fungi: hugepgaes and numa work with qemu but since i cant disabel the pinning i cant do the testing upstream. | 18:52 |
clarkb | mnaser: logan- found the new ubuntu hwe kernel seems to have stabilized it too. The bigger issue on my side is that it stopped working in tripleo. The tests were rebooting halfway through and no one noticed until days later when I dug into failures | 18:52 |
sean-k-mooney | maybe that will change when rhel8/centos8 come out next year | 18:53 |
sean-k-mooney | that said we will still have cento7 jobs for a few releaes | 18:53 |
*** bobh has quit IRC | 18:54 | |
*** bobh has joined #openstack-infra | 18:57 | |
fungi | we will, but they'll probably be run a lot less frequently as most of the development activity goes on in master anyway and we'd be using 7 on (fewer and older over time) stable branches | 18:58 |
*** bobh has quit IRC | 19:01 | |
*** eernst has quit IRC | 19:06 | |
*** electrofelix has quit IRC | 19:11 | |
*** bobh has joined #openstack-infra | 19:16 | |
*** wolverineav has quit IRC | 19:19 | |
*** bobh has quit IRC | 19:20 | |
openstackgerrit | MarcH proposed openstack-infra/git-review master: CONTRIBUTING.rst, HACKING.rst: fix broken link, minor flow updates https://review.openstack.org/623362 | 19:28 |
*** armstrong has quit IRC | 19:30 | |
openstackgerrit | MarcH proposed openstack-infra/git-review master: doc: new testing-behind-proxy.rst; tox.ini: passenv = http[s]_proxy https://review.openstack.org/623361 | 19:31 |
*** bobh has joined #openstack-infra | 19:34 | |
clarkb | I'm caught up on email and the gate is looking relatively happy. I think I'm going to take this as an opportunity to reboot for kernel fixes (so my fs doesn't corrupet) and apply some patches to my local router. I expect I'll not be gone long but will let you know via the phone network if I managed to break something :) | 19:36 |
*** bobh has quit IRC | 19:39 | |
tobiash | clarkb: good luck :) | 19:39 |
clarkb | that was actually far less painful than I expected | 19:46 |
*** bobh has joined #openstack-infra | 19:53 | |
*** bobh has quit IRC | 19:57 | |
*** jcoufal has quit IRC | 20:10 | |
*** jamesmcarthur has joined #openstack-infra | 20:10 | |
*** jamesmcarthur has quit IRC | 20:11 | |
*** bobh has joined #openstack-infra | 20:11 | |
*** eernst has joined #openstack-infra | 20:13 | |
*** kjackal has joined #openstack-infra | 20:15 | |
*** bobh has quit IRC | 20:16 | |
fungi | huh, a bunch of failing changes near the front of the integrated gate queue now | 20:19 |
*** sean-k-mooney has quit IRC | 20:20 | |
fungi | all seem to be failures and timeouts for unit tests? weird | 20:20 |
fungi | the neutron change with the three unit test timeouts all ran in limestone-regionone | 20:21 |
fungi | cinder change failed a volume creation/deletion unit test, ran in ovh-gra1 | 20:24 |
*** mriedem has quit IRC | 20:25 | |
fungi | the nova change had two unit test jobs fail on assorted database migration/sync tests, and both those jobs ran in limestone-regionone | 20:25 |
*** kjackal has quit IRC | 20:25 | |
fungi | logan-: any chance the kernel updates have caused us to chew up more resources there? | 20:25 |
*** sean-k-mooney has joined #openstack-infra | 20:28 | |
*** mriedem has joined #openstack-infra | 20:28 | |
*** bobh has joined #openstack-infra | 20:29 | |
*** kgiusti has left #openstack-infra | 20:30 | |
*** bobh has quit IRC | 20:34 | |
*** eharney has quit IRC | 20:39 | |
*** ahosam has joined #openstack-infra | 20:41 | |
fungi | according to logstash, job timeouts for limestone-regionone really started picking up around 18:00z | 20:46 |
*** ralonsoh has quit IRC | 20:47 | |
*** bobh has joined #openstack-infra | 20:47 | |
fungi | that's something like 3.5 hours after logan- mentioned all the hypervisors had been upgraded | 20:48 |
fungi | so maybe not connected? | 20:48 |
*** bobh has quit IRC | 20:52 | |
mriedem | so, on that new zuul queueing behavior, is there any way that could negatively affect some changes from landing? | 20:54 |
mriedem | clarkb's email mentioned it could mean things in nova taking longer, | 20:54 |
mriedem | but i'm just wondering if we're re-enqueuing some approved nova changes that take about 16 hours just to fail on some slow node, then take extra long now to go back through | 20:54 |
mriedem | b/c it's taking us days to land code | 20:55 |
mriedem | i.e. it sounds like nova changes are deprioritized because there could be a lot of nova changes queued up at any given time right? | 20:56 |
pabelanger | each patch now enters the check pipeline at the same priority, across all projects. So the first patch set of each project gets nodes now, even if nova submitted 10 different patches lets say | 20:58 |
mnaser | mriedem: is this happening in big stacks? | 20:58 |
mriedem | define big | 20:58 |
mriedem | https://review.openstack.org/#/c/602804/15 | 20:58 |
pabelanger | this is amplified, if a large stack of patches are submitted together, as patches behind that won't get nodes until the patch series has nodes | 20:58 |
mnaser | yeah what pabelanger said.. i think this is something we should somehow address | 20:59 |
dansmith | define large? | 20:59 |
dansmith | and is this just stacks of patches? | 20:59 |
mnaser | well, large is relative to how many patches are in the nova queue right now | 20:59 |
mnaser | i.e. if i push a 30 stack change, then i've just effectively slowed down all of nova's development till all 30 are tested | 21:00 |
mnaser | which isn't ideal tbh | 21:00 |
dansmith | meaning it slows down all of nova, while trying to avoid slowing down, say, neutron? | 21:00 |
mnaser | pretty much dansmith | 21:00 |
mnaser | because then the 'queue' for nova is long and neutron is short so.. neutron gets a node while nova waits | 21:01 |
pabelanger | mnaser: well, not really. it is the same behavior to nova, but other projects get priority over nodes now | 21:01 |
dansmith | doesn't that kinda not work if one project has three people working on it and another has 30? | 21:01 |
dansmith | pabelanger: I'm not sure I see the difference | 21:01 |
mnaser | dansmith: yeah. unfortunately the busier projects somehow get *less* results and the more idle ones get more results quicker | 21:01 |
pabelanger | for node requests, now large and small project both have equal weight to the pool of resources | 21:02 |
mnaser | because $some_small_project always has 2-3 changes in queue, vs nova that might have 40 | 21:02 |
mnaser | so if i'm working on a project alone, i can almost always get a node right away.. but im in nova, it might not be for a while | 21:02 |
dansmith | mnaser: okay, yeah, that seems frustrating.. because we've been waiting longer than a day to get a single read on a fix.. like, to even get logs to evaluate something | 21:02 |
pabelanger | the 4th paragraph in http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000482.html should give a little more detail | 21:02 |
mriedem | i just replied, | 21:03 |
mnaser | dansmith: i agree. i'm kinda on both sides of this, now our openstack-ansible roles get super quick feedback.. but i see working with other big projects that now it takes forever to get check responses | 21:03 |
mriedem | wondering if it would be possible to weigh nova changes in the queue differently based on their previous number of times through | 21:03 |
fungi | node assignments are round-robined between repositories, so if three nova changes are queued at the same time as two cinder changes and one neutron change then nodes will be doled out to the first nova, neutron and cinder changes, then the next two nova and neutron changes, and then the third nova change in that order | 21:03 |
pabelanger | dansmith: right now I would say the gate resets / reduced nodes from providers is also impacting the time here too | 21:03 |
dansmith | mnaser: honestly the last few days I'll make one change in the morning and just do other stuff for the whole day, then check on the patch the next morning | 21:03 |
dansmith | mnaser: which pretty much kills one's will to live and motivation to iterate on something | 21:04 |
mnaser | dansmith: i agree with you 100%. | 21:04 |
dansmith | pabelanger: I'm sure yeah | 21:04 |
mnaser | i do think at times a smaller project can iterate a lot faster than say.. nova | 21:04 |
fungi | a lot of this queuing model was driven by the desire to make tripleo changes wait instead of starving the ci system of resources to the point where most other projects were waiting forever | 21:04 |
mnaser | because they'll always get a highest priority if they only have one change in a queue | 21:04 |
mriedem | so, just curious, are the new non-openstack foundation projects contributing resources to node pool for CI resources? | 21:04 |
mriedem | is that just a bucket of money the foundation doles out for CI? | 21:05 |
fungi | the foundation doesn't purchase ci resources | 21:05 |
mnaser | mriedem: none do, but afaik, their utilization is very small | 21:05 |
dansmith | fungi: yeah, and I'm sure tripleo causes a lot of trouble for the pool | 21:05 |
mriedem | fungi: don't get me wrong, i'd like to see tripleo go full 3rd party CI at this point :) | 21:05 |
mriedem | has the idea of quotas ever come up? | 21:05 |
fungi | and yeah, the other osf projects use miniscule amounts of ci resources, as mnaser notes | 21:05 |
*** bobh has joined #openstack-infra | 21:06 | |
mriedem | e.g. each project gets some kind of quota so one project can't just add 30 jobs to run per change | 21:06 |
mriedem | or never clean up their old redundant jobs | 21:06 |
fungi | mriedem: yes, i think the idea is that we'd potentially implement quotas in zuul/nodepool per tenant, but we're still missing some of the mechanisms we'd need to do that | 21:07 |
mriedem | tenant == code repo/project requesting CI resources? | 21:08 |
*** jtomasek has quit IRC | 21:08 | |
mriedem | maybe project under governance | 21:08 |
*** rlandy has quit IRC | 21:08 | |
mriedem | i just know it's very easy today, especially with zuulv3 where jobs are defined per project, for a project to be like, "oh we should test knob x, let's copy job A, call it job B and set knob x to True even though 95% of the tests run will be the same" | 21:09 |
sean-k-mooney | fungi: out of interest what is missing for quotas. i played wtih having mulitple zuul tenants using the same cloud resoues provided by a ahred node pool. | 21:09 |
sean-k-mooney | i did not have it deployed for long but had tought everything was supported to do that | 21:10 |
corvus | well from my pov what's missing is the will to set quotas. does anyone have a suggestion as to how we should set them? | 21:10 |
*** bobh has quit IRC | 21:10 | |
clarkb | seems I've missed the fun conversation running errands and finding lunch | 21:11 |
clarkb | my last email included a paste that breaks down usage | 21:11 |
*** jamesmcarthur has joined #openstack-infra | 21:11 | |
clarkb | tripleo is ~42% with nova and neutron around 10-15% iirc. All openstack official projects together are 98.2% iirc | 21:11 |
dansmith | I definitely get the intent here, and I think it's good, | 21:12 |
clarkb | I really want to gwt away from the idea itsnew projects using all the resources the data doesnot back that up | 21:12 |
dansmith | clarkb: we get it. it's not new projects, it's tripleo :) | 21:12 |
sean-k-mooney | corvus: well i dont know what the actul capasity is but you could start with one tenant per offial team in the governance repo and oversibsribing the actual cloud resouces by giving each tanatn a quorat of 100 inances and then adjust it down | 21:13 |
dansmith | 42% when the fat pig that is nova is a pretty huge scaling factor | 21:13 |
fungi | sean-k-mooney: sharing job configuration across tenants is inconvenient, i think we were looking at tenant per osf top-level project | 21:14 |
*** jamesmcarthur has quit IRC | 21:14 | |
*** jamesmcarthur has joined #openstack-infra | 21:14 | |
mriedem | but nova waiting a day for results when zaqar is prioritized higher is kind of....weird | 21:14 |
clarkb | right the biggest impact we can have is improving tripleos fail rate and overall high drmand | 21:15 |
corvus | dansmith: it's a little hard for me to separate whether you're seeing the effect of the new queueing behavior or just the general backlog. because while you may wait a day to get results on the change you push up this morning, if we turned off the new behavior the same would still be true. | 21:15 |
mriedem | i would think quota would be doled out by some kind of deployed / activity metrci | 21:15 |
mriedem | *metric | 21:15 |
corvus | yesterday, mid-day, nova was waiting 5 hours for results | 21:15 |
dansmith | er, I meant 42% when something as big as nova is only 10% is a big scaling factor | 21:15 |
mriedem | taking a different angle here, | 21:15 |
dansmith | corvus: 5 hours for one patch or five hours for the last patch in a small series? | 21:16 |
mriedem | are we aware of any voting jobs that have a fail rate of 50% or more? | 21:16 |
dansmith | corvus: and yeah, I know things are in shambles right now | 21:16 |
sean-k-mooney | why does triplo use so much resouces anyway. is its minium deployment size jsut quite big or does it have a lot of jobs? or both i guess? | 21:16 |
mriedem | shitload of jobs | 21:16 |
*** tpsilva has quit IRC | 21:16 | |
mriedem | from what i could see | 21:16 |
corvus | sean-k-mooney: and they run very long | 21:16 |
mriedem | baremetal | 21:16 |
mriedem | it's like the worst kind of ci requirements | 21:16 |
sean-k-mooney | am could we have a seperate teanatn for tripplo? | 21:17 |
corvus | we don't have any bare-metal resources, so they all run virtualized | 21:17 |
clarkb | they are big long running jobs that fail a lot | 21:17 |
mriedem | so back to my other question, do we know of voting jobs that are failing at too high a rate and should make them non-voting to avoid resets? | 21:18 |
clarkb | the fail a lot is important with how the gate works | 21:18 |
clarkb | mriedem: non voting jobs dont causeresets and shouldnt run in the gate | 21:18 |
mriedem | we used to make the ceph job non-voting if it failed consistently at something like 25% | 21:18 |
dansmith | maybe we have one queue for tripleo and one for everything else? | 21:18 |
mriedem | clarkb: that's my point, | 21:18 |
mriedem | do we have *voting* jobs with high fail rates that should be made non-voting | 21:18 |
mriedem | until they are sorted out | 21:18 |
dansmith | if they're currently 50% of the load, that would seem reasonable, | 21:18 |
clarkb | mriedem: I am not sure, you'd have to check graphite | 21:19 |
clarkb | but also that doesnt fix the issue of brojen software | 21:19 |
sean-k-mooney | dansmith: ya that was basically the logic i had with suggesting making the be a seperate tenant then the rest | 21:19 |
corvus | dansmith: they do have their own queue -- or do you mean quota? or... i may not understand... | 21:19 |
mriedem | clarkb: i realize that doesn't fix broken software, but holding the rest of openstack hostage while one project figures out why it's jobs are always failing is wrong | 21:20 |
dansmith | corvus: like, the per-project queue thing you have now, but put tripleo in one queue and everyone else in the same queue, | 21:20 |
dansmith | so that nova and neutron fight only against tripleo :) | 21:20 |
corvus | dansmith: gotcha -- so only test one tripleo change at a time across the whole system? | 21:20 |
clarkb | mriedem: except Im fairly certain some of the issue here is broken rolling downhill | 21:20 |
mriedem | as i said, if the ceph job had a high failure rate and someone wasn't fixing it, we'd make it non-voting | 21:20 |
clarkb | tripleo takes nova and it does work tests break | 21:20 |
clarkb | then clouds take tripleo and hosts the tests and break | 21:21 |
clarkb | its not as simple as x y and z are bad stop them | 21:21 |
dansmith | corvus: is that what the current queuing is doing? or you mean just because their jobs are so big, that they'd realistically only get one going at a time? | 21:21 |
corvus | dansmith: that's actually not that hard to do -- we can adjust the new priority system to operate based on the gate pipeline's shared changes queues, even in check. | 21:21 |
corvus | dansmith: it's not what the queueing is currently doing, because each tripleo project gets its own priority queue in check (in gate, they do share one queue). | 21:22 |
dansmith | corvus: oh cripes! | 21:23 |
dansmith | corvus: yeah, seems like they get priority inflation because they have a billion projects, amirite? | 21:23 |
fungi | again though, that's in check. in gate the tripleo queue and the integrated queue are basically on equal footing | 21:24 |
corvus | dansmith: right | 21:24 |
dansmith | presumably gate is less of an issue right? | 21:24 |
clarkb | well gate is what eats all the nodes | 21:24 |
corvus | well, gate has priority over check | 21:24 |
*** bobh has joined #openstack-infra | 21:24 | |
clarkb | especially when thing are flaky because a reset stops all jobs, then restarts them taking nodes from check | 21:24 |
dansmith | really? gate is always smaller it seems like | 21:24 |
clarkb | which is why I keep telling people please fix the tests | 21:24 |
dansmith | yeah I know gate resets kill everything | 21:24 |
corvus | so a gate reset (integrated or tripleo) starves check | 21:24 |
clarkb | you will merge features faster if we all just spend a little time fixing bugs | 21:25 |
sean-k-mooney | clarkb: well to get to gate they would have to go through check so check should be the limiting factor no? | 21:25 |
dansmith | sean-k-mooney: it's flaky tests | 21:25 |
dansmith | not failboat tests | 21:25 |
clarkb | right its using nested virt crashing teh VM and failing the test that does it | 21:25 |
clarkb | and other not 100% failure issues | 21:25 |
fungi | yeah, tests with nondeterministic behaviors (or tests exercising nondeterministic features of the software) | 21:26 |
pabelanger | maybe in check, group the priority based on the project deliverables? I think that is what dansmith might be getting at | 21:26 |
dansmith | so, gate resets suck, we know that, and a very small section of people work on fixing those which sucks | 21:26 |
dansmith | but yeah, | 21:26 |
sean-k-mooney | so like 4 years ago we used to have both recheck and reverify to allow running check and gate independely | 21:26 |
dansmith | ffs put tripleo in one bucket at least, | 21:26 |
sean-k-mooney | do we have reverify anymore | 21:26 |
dansmith | and maybe put everyone else in another bucket together | 21:26 |
clarkb | dansmith: if the TC tells us to we will. But so far OpenStack the project seems to tolerate their behavior | 21:26 |
fungi | pabelanger: challenge there is probably in integrating governance data. zuul knows about a flat set of repositories, and knows about specific queues | 21:27 |
pabelanger | fungi: agree | 21:27 |
clarkb | fwiw I don't think tolerating is necessarily a bad thing | 21:27 |
pabelanger | sean-k-mooney: no, just recheck | 21:27 |
clarkb | tripleo testing has to deal with all the bugs in the rest of openstack | 21:27 |
dansmith | clarkb: really? Is the TC in charge of how to allocate infra resources? I didn't realize | 21:27 |
corvus | [since i'm entering the convesation in the middle, let me add a late preface -- the new queing behavior is an attempt to make things suck equally for all projects. it's just an idea and i'm very happy to talk about and try other ideas, or turn it off if it's terrible. it's behind a feature flag] | 21:27 |
clarkb | dansmith: the TC is in charge of saying nova is more important that tripleo | 21:27 |
dansmith | clarkb: but osa is the same and they use a much smaller fraction right? | 21:27 |
sean-k-mooney | pabelanger: ok could we bring back reverify so that if check passes and gate fails they could just rerun gate jobs and maybe reduces load that way | 21:28 |
dansmith | clarkb: we know they're never going to say that.. this doesn't seem like a politics thing, but rather a making the best use of the resources we all share | 21:28 |
clarkb | dansmith: yes, I pointed this out to the TC when I started sharing this data | 21:28 |
clarkb | dansmith: all of the other deployment projects use about 5% or less resources each | 21:28 |
*** bobh has quit IRC | 21:28 | |
dansmith | clarkb: if you make it about "okay who is prettier" you'll never get an answer from them and we'll all just keep sucking | 21:28 |
dansmith | clarkb: right, and they all deal with "all the bugs of openstack" in the same way right? | 21:29 |
corvus | sean-k-mooney: that's called the 'clean-check' requirement and it was instituted because too many people were approving changes with failing check jobs, and therefore causing gate resets or, if they merged, then merging flaky tests. | 21:29 |
clarkb | dansmith: sort of. I think what makes tripleo "weird" is that they have stronger dependencies on openstack working for tripleo to actually deploy anything | 21:29 |
clarkb | dansmith: ansible runs mistral which runs heat which runs ansible which runs openstack type of deal | 21:29 |
clarkb | whereas puppet and ansible and so on are just the last bit | 21:29 |
corvus | sean-k-mooney: sdague thought clean-check was very important and effective for eliminating that. | 21:29 |
dansmith | clarkb: yeah I get that their design has made them fragile :) | 21:30 |
sean-k-mooney | corvus: im suggestig that reverify would only work if check was clean and gate jobs failed it woudl rerun the gate jobs only keeping the clean check results | 21:30 |
corvus | sean-k-mooney: oh, when the gate runs we forget the check result | 21:30 |
*** eharney has joined #openstack-infra | 21:31 | |
clarkb | the good news is tripleo is reducing their footprint | 21:31 |
corvus | sean-k-mooney: so basically the issue is, how do you enqueue a change in gate with a verified=-2. i guess we could permit that.... | 21:31 |
fungi | sean-k-mooney: also, before clean-check one of the problems was reviewers approving changes which had all-green results from check jobs run 6 months ago | 21:31 |
clarkb | I think we are seeing positive change there, unfortunately it is still slow movement and I end up debugging a bunch of gate failures | 21:31 |
clarkb | Ideally tripleo would be doing that more actively | 21:31 |
clarkb | (as would openstack with its flaky queue) | 21:31 |
corvus | (verified=-2 would mean "this failed in gate, but it previously passed in check, so allow it to go back into gate) | 21:31 |
fungi | sean-k-mooney: which clean-check doesn't solve obviously, but at least forcing the change which caused a gate reset to getobtain check results keeps it from continuing to get reapproved | 21:32 |
sean-k-mooney | fungi: ya i know it would reduce the quality of the gate in some respects | 21:32 |
sean-k-mooney | it woudl be harder to do but enven if reverify woudl conditionally skip the check if they were run in the last day might help | 21:33 |
fungi | sean-k-mooney: less about reducing the quality of the gate, it would actually slow development (these changes went in specifically because our ability to gate changes ground to a halt with nondeterministic jobs and people approving broken changes) | 21:33 |
mriedem | yeah i don't want to go back to that | 21:33 |
mriedem | it was changed for good reasons | 21:33 |
sean-k-mooney | fair enough | 21:34 |
mriedem | sdague isn't in his grave yet, but i can hear him rolling | 21:34 |
sean-k-mooney | i know there were issue with it | 21:34 |
clarkb | the reason I push on gate failures is the way the gate should work is that 20 changes go in, they grab X nodes to test those changes and do that for 2.5 hours. Then they all merge and the gate is empty leaving all the other resources for everyone else | 21:34 |
clarkb | what happens instead is we use N nodes for 20 minutes, reset then use N nodes for 20 minutes and reset and on and on | 21:35 |
corvus | s/2.5/1 :) | 21:35 |
fungi | yep, while forcing a change which fails in the gate pipeline to reobtain check results is painful, not doing it is significantly more painful in the long run | 21:35 |
clarkb | and never free up resources for check | 21:35 |
clarkb | unfortunately the ideal behavior sort of assumes we operate under the assumption that broken flaky code is bad | 21:35 |
pabelanger | clarkb: given how realitive priority works now, if the gate window was smaller again, that would mean more nodes in check right? Meaning more feedback, however it does mean potential longer for things to gate and merge | 21:36 |
sean-k-mooney | well we do in most projects | 21:36 |
pabelanger | right now, there is a lot of nodes servicing gate | 21:36 |
fungi | and assumes that we catch a vast majority of the failures in check, and that our jobs pass reliably | 21:36 |
clarkb | sean-k-mooney: there are certainly pockets that do, but overall from my perspective as being the person debugging things and beating this drum very few do | 21:36 |
corvus | pabelanger: relative_priority doesn't really change the gate/check balance. and clarkb shrunk the window a few weeks ago and observed no noticeable change in overall throughput. | 21:37 |
corvus | (so the window is back at 20) | 21:37 |
clarkb | I do think this feedback is worthwhile though. One thing that comes to mind is maybe the priority should be based on time rather than count (though noav would still suffer under that) | 21:37 |
corvus | shrinking the window should alter the check/gate balance though. so we might see some check results faster if we shrunk it again. | 21:37 |
clarkb | perhaps there should be a relieve valve in the priority? | 21:37 |
fungi | yeah, i think when things get really turbulent, resetting a 10-change queue over and over is about as bad as a 20 | 21:38 |
clarkb | perhaps we do need to allocate quotas to projects given some importance value | 21:38 |
clarkb | my concern with this is I really don't want to be the person that says tripleo gets less resources than nova | 21:38 |
fungi | i have a feeling there are a lot of times where zuul never gets nodes allocated to more than 10 changes in the queue before another reset tanks them all anyway | 21:38 |
sean-k-mooney | clarkb: ya to be honest ignoring the hassel of being involed in runing a thrid party ci, one thing i did liek about it is the random edgecase it exposed that i was then able to go report or fix upstream | 21:38 |
clarkb | I think we have an elected body over all the projects for that | 21:38 |
dansmith | clarkb: to some degree, it's currently being said that tripleo gets as many resources as all the rest of openstack | 21:39 |
pabelanger | clarkb: wasn't there also a suggestion for priority becase on number of nodes also? | 21:39 |
clarkb | dansmith: yup, and we've asked tripleo nicely to change that and they have started doing so | 21:39 |
dansmith | and by them having a crapton of projects, they're N times more important any any one other single-repo project, if I understand correctly | 21:39 |
clarkb | pabelanger: ya node time I think I mentioend once | 21:39 |
clarkb | dansmith: ya using gate queues to determine allocations (I think corvus mentioend that above) may be an important improvement here | 21:40 |
clarkb | or the governance data? or some sort of aggregation | 21:40 |
dansmith | clarkb: that means treating tripleo as one thing instead of N things? definitely seems like an improvement | 21:40 |
*** bobh has joined #openstack-infra | 21:41 | |
clarkb | dansmith: ya and osa and others that are organized similarly | 21:41 |
* dansmith nods | 21:41 | |
sean-k-mooney | clarkb: how hard would it be to have a check queue per governance team | 21:41 |
pabelanger | I still like the suggestion from dansmith that project deliverables are counted together, not as individuals for priority. That does seem to give an advandage to a project with code over more repos, then a single | 21:41 |
clarkb | sean-k-mooney: we already do a check queue per change. I think this is less about the queue and more prioritizing how nodes are assigned | 21:41 |
sean-k-mooney | check queue per change? did you mean project? | 21:42 |
clarkb | sean-k-mooney: no per change is how its logically implemented in zuul | 21:42 |
clarkb | to represent dependencies | 21:42 |
fungi | each change in an independent pipeline (e.g., "check") gets its own queue | 21:43 |
fungi | changes in a dependent pipeline (e.g., "gate") get changes queued together based on jobs they share in common, or via explicit queue declaration | 21:43 |
sean-k-mooney | fungi: ya i realised when i said queue before i ment to say pipeline | 21:43 |
corvus | ideas so far: https://etherpad.openstack.org/p/QxXuCSdAoF | 21:43 |
fungi | or in v3 did we drop the "jobs in common" criteria for automatic queue determination? is it all explicit now? | 21:44 |
corvus | fungi: all explicit | 21:44 |
fungi | okay, that then ;) | 21:44 |
fungi | unfortunately my next idea is to go get dinner, and beer | 21:45 |
*** bobh has quit IRC | 21:45 | |
fungi | or perhaps that's fortunately | 21:45 |
corvus | it sounds like grouping projects in check by the same criteria we use in gate for setting relative_priority (ie, group all tripleo projects together in check) seems popular, fair, and not-difficult-to-implement. | 21:45 |
*** jamesmcarthur has quit IRC | 21:45 | |
corvus | should i work on that? | 21:45 |
clarkb | corvus: and if not difficult to implement we can at least try it out and if it doesn't work well probably not a huge deal? | 21:46 |
clarkb | corvus: sounds like you should :) | 21:46 |
fungi | yes, my concern with trying to merge it with governance data is the complexity. zuul already has knowledge of queues | 21:46 |
corvus | yeah, i consider this whole thing an experiment and we should change it however we want :) | 21:46 |
pabelanger | +1 to experiment | 21:46 |
clarkb | dansmith: ^ does that seem like a reasonable place to start? That should treat aggregate tripleo as a unit rather than individual repos | 21:47 |
pabelanger | clarkb: my question would be, does that ignore the 'integrated' queue in gate, or include it? | 21:47 |
dansmith | I would think that'd be an improvement worth doing at least yeah | 21:47 |
fungi | should there be a multi-tiered prioritization decision within the queuing set too? or just treat all changes for that group equally even if there are 10 nova changes and 1 mistral change (assuming mistral is in integrated too) | 21:48 |
corvus | (oh, i added one more idea -- pabelanger suggested that we could treat the second and later patch in a patch series with a lower priority) | 21:48 |
corvus | fungi: that's a good point, this would lump nova/cinder/etc together | 21:49 |
clarkb | pabelanger: ya I guess thats the other question is if it will result in much change if nova + neutron + cinder + glance + swift are all together | 21:49 |
sean-k-mooney | clarkb: you know there is one other thing we could try | 21:49 |
sean-k-mooney | could we split the short zull jobs from the long ones | 21:49 |
clarkb | sean-k-mooney: thats sort of what we've done with the current priority | 21:50 |
clarkb | we are running a lot more short jobs beacuse the lessa ctive projects tend to have less involved testing | 21:50 |
sean-k-mooney | e.g. run the unit,function,pep8,docs and release notes in one set and comment back and the tempest ones in another bucket | 21:50 |
clarkb | ah split on that axis | 21:50 |
sean-k-mooney | ya | 21:51 |
clarkb | ya though maybe thats an indication we all need to run `tox` before pushing more often :P | 21:51 |
sean-k-mooney | so that you get develop feed back quickly on at least the non integration tests | 21:51 |
clarkb | (I'm bad at it myself, but it does make a difference) | 21:51 |
pabelanger | clarkb: yah, i think for the impact, we'd need to group specific queues in check (via configuration?) and keep current behavior of per project realitive priority | 21:51 |
sean-k-mooney | clarkb: well i normally do tox -e py27,pep8,docs | 21:52 |
sean-k-mooney | but i rarely un py3 or fucntional test locally | 21:52 |
sean-k-mooney | i do if i touch them but nova takes a while | 21:52 |
sean-k-mooney | clarkb: you could even wait to kick off the tempest test until the other test job passed | 21:53 |
corvus | it's worth keeping in mind that nothing we've discussed (including this idea) changes the fact that during the north-american day, we are trying to run about 2x the number of jobs at a time than we can support. | 21:53 |
clarkb | http://logs.openstack.org/37/611137/1/gate/grenade-py3/e454703/job-output.txt.gz#_2018-12-07_21_48_57_091229 just reset the integrated gate | 21:53 |
clarkb | sean-k-mooney: ^ its specifically test failures like that that I'm talking about needing more eyeballs fixing | 21:54 |
corvus | if we run 20,000 jobs in a day now, with any of these changes, we'll still run 20,000 jobs in a day. it's just re-ordering when we run them. the only thing that will change that is to run fewer jobs, run them more quickly, or have them fail less often. | 21:54 |
clarkb | that change modifies cinder unit tests | 21:54 |
clarkb | so shouldn't be anywhere near what grenade runs | 21:55 |
clarkb | and yet it fails >0% but <100% of the time | 21:55 |
*** boden has quit IRC | 21:55 | |
clarkb | corvus: ya I think what dansmith and mriedem were get at is the turnaround time for a specific patch impacts there ability to fix review today or wait for tomorrow | 21:55 |
clarkb | corvus: in the fifo system we had before that turnaround time was the same for evryone roughly (we'd get so far behind then everyone is waiting all day sort of thing) | 21:56 |
clarkb | in the current system only some subset of people are waiting for tomorrow | 21:56 |
dansmith | yep | 21:56 |
clarkb | which is an improvement for some and not for others | 21:56 |
mriedem | it's also impacting our ability to try and work on things to fix the gate | 21:56 |
mriedem | like the n-api slow start times | 21:56 |
dansmith | we just proposed removing the cellsv1 job from our regular run, btw.. I'm constantly thinking about what we can run less of, fwiw | 21:56 |
clarkb | ya nova is a fairly responsible project | 21:56 |
sean-k-mooney | hum the content type was application/octet-stream which i woudl have expected to work | 21:56 |
mriedem | i'm trying to merge the nova-multiattach job into tempest so we can kill that job as well | 21:56 |
dansmith | mriedem: right, this actually came up while trying to get results from gate-fixing patches | 21:56 |
*** wolverineav has joined #openstack-infra | 21:57 | |
mriedem | clarkb: can you say that again, but this time into my lapel? | 21:57 |
dansmith | clarkb: FAIRLY RESPONSIBLE | 21:57 |
*** wolverineav has quit IRC | 21:57 | |
*** wolverineav has joined #openstack-infra | 21:57 | |
clarkb | dansmith: mriedem heh I just mean in comparison to others yall seem to jump on your bugs without prompting | 21:57 |
clarkb | and dive in if prompted | 21:57 |
dansmith | clarkb: yeah, arguing the "fairly" part :) | 21:57 |
* mriedem lights a cigarette | 21:57 | |
fungi | our previous high-profile gate-fixers were mostly nova core reviewers too, historically | 21:58 |
dansmith | mriedem jumps without prompting, and I jump when prompted by mriedem | 21:58 |
dansmith | it's a good system. | 21:58 |
corvus | clarkb: yes, though one thing that wasn't pointed out was that a large project has more changes with results waiting at the end of the turnaround time, whereas a smaller project may only have the one change. so if you're in a large project and can work on multiple things, it's less of an issue. of course, if you're focused on one change in a large project, it's worse now. | 21:58 |
clarkb | corvus: ya | 21:58 |
corvus | i'll go look into the combine-stuff-in-check idea now | 21:58 |
clarkb | corvus: ok, thanks | 21:58 |
mriedem | dansmith: working on this shit allows me to procrastinate from working on cross-cell resize | 21:58 |
dansmith | mriedem: and you're generally a noble sumbitch to boot. | 21:59 |
* dansmith has to run | 21:59 | |
clarkb | sean-k-mooney: ya these bugs tend to be difficult to debug (though not always) which is one reason I think we have so few people that dig into them | 21:59 |
sean-k-mooney | so in general do people think there would be merrit in a precheck pipeline for running all the non tempest test(pep8,py27...) and only kicking of the dsvm test if the precheck job passed | 21:59 |
clarkb | but that digging is quite valuable | 21:59 |
clarkb | sean-k-mooney: about 5 years ago we did do that, and what we found was we had more round trips per patch as a result | 21:59 |
clarkb | sean-k-mooney: that doesn't mean we shouldn't try it again | 22:00 |
clarkb | but is something to keep in mind, the current thought is providing complete results in one go makes it easier to fix the complete set of bugs in a change before it goes through again | 22:00 |
sean-k-mooney | more rount trips with a shorter latancy for result untill the quick jobs passed might be a saving on gate time over all | 22:00 |
clarkb | we should see about measuring that along with any more lag time on throughput | 22:00 |
pabelanger | Right, this was also the idea of doing fast-fail too, if one job fails, they all do. | 22:01 |
pabelanger | but means less results | 22:01 |
sean-k-mooney | clarkb: do we have visablity or an easy way to catogries what precentege of falirues on a patch are from tempest jobs verses the rest? | 22:01 |
clarkb | sean-k-mooney: I think that is something to keep in our back pocket as an option if the reorg of priority aggregation continues to be sadness. I do want to avoid changing too many things at once | 22:01 |
fungi | okay, really running off to dinner now. *might* pop back on when i return, but... it is friday night | 22:01 |
pabelanger | sean-k-mooney: you could actually test that today, with using job dependencies in your zuul.yaml file | 22:01 |
clarkb | fungi: enjoy your evening and weekened | 22:02 |
fungi | thanks! | 22:02 |
sean-k-mooney | pabelanger: you could actully test it today. i could test it in a week after i read up on how this all works again :) | 22:02 |
clarkb | sean-k-mooney: pabelanger I added it to https://etherpad.openstack.org/p/QxXuCSdAoF for compelteness | 22:02 |
pabelanger | clarkb: +1 | 22:03 |
sean-k-mooney | clarkb: this is somethign we could maybe test on a per projec basis too | 22:03 |
corvus | my changes will definitely use more nodes overall if we fast-fail :( | 22:04 |
sean-k-mooney | e.g. if we had a precheck pipeline we try it on nova or something by change just our zull file | 22:04 |
clarkb | corvus: ya mine too | 22:04 |
*** wolverineav has quit IRC | 22:04 | |
clarkb | but maybe that is ok if the aggregate doesn't | 22:04 |
corvus | (typically my changes use 2x nodes because i always get something wrong the first time. expect x! nodes if we do fast-fail. :) | 22:05 |
clarkb | sean-k-mooney: re that test I called out. It failed due to a 502 from apache talking to cinder. Apache says AH01102: error reading status line from remote server 127.0.0.1:60999 at http://logs.openstack.org/37/611137/1/gate/grenade-py3/e454703/logs/apache/error.txt.gz | 22:06 |
clarkb | cinder api log doesn't immediately show me anything that would indicate why | 22:07 |
sean-k-mooney | clarkb: ya i saw that apache is proxying to mod_wsgi i am assuming | 22:07 |
clarkb | sean-k-mooney: I think only for apache, the rest of the services run uwsgi standalone and we just tcp to them? | 22:07 |
clarkb | apache is just terminatign ssl for us | 22:07 |
sean-k-mooney | oh ok | 22:07 |
*** wolverineav has joined #openstack-infra | 22:08 | |
clarkb | stack exchange says set proxy-initial-not-pooled in apache | 22:08 |
clarkb | this will degrade performance but make things more reliable as it avoids a race between pooled connection being closed and new connection to frontend | 22:09 |
clarkb | I thought we had something like this in the apache config already | 22:09 |
clarkb | oh I remember it was the backends and apache not allowing connection reuse by the python clients because python requests has the same race | 22:10 |
sean-k-mooney | ya just reading https://httpd.apache.org/docs/2.4/mod/mod_proxy_http.html | 22:10 |
clarkb | chances are we do want this sort of thing added to devstack | 22:10 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/project-config master: Add gate job for Slack notifier in zuul-jobs https://review.openstack.org/623593 | 22:11 |
*** slaweq has joined #openstack-infra | 22:11 | |
sean-k-mooney | that or we use something other then appache for ssl termination | 22:11 |
clarkb | ya we used apache because keystone already depped on it | 22:12 |
clarkb | avoided adding a dep | 22:12 |
sean-k-mooney | keystone can run under uwisg now right? | 22:12 |
sean-k-mooney | i know glance stil has some issues | 22:12 |
clarkb | maybe? its a good question | 22:12 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul-jobs master: Add a slack-notify role https://review.openstack.org/623594 | 22:13 |
sean-k-mooney | haproxy, nginx and caddy all are lighter weight solution to ssl termmination then apache but that option is proably a good place to start | 22:13 |
clarkb | ya devstack had broken support for some lightweight terminator that ended up being EOL'd and removed from the distros | 22:17 |
clarkb | and it was at that point stuff moved to apache because it was already a hard dep for keystone | 22:17 |
*** eernst has quit IRC | 22:17 | |
clarkb | it can certainly be updated again if it makes sense, though configuring apache is likely easier short term | 22:17 |
* clarkb updates devstack repo | 22:18 | |
sean-k-mooney | ya am i have several other experiment i want to do with devstack but i might add that to the list | 22:18 |
jrosser | is it right i seem to see a mix of centos 7.5 & 7.6 nodes? | 22:19 |
clarkb | jrosser: as of yesterday all but inap should haev an up to date 7.6 image | 22:20 |
clarkb | I haven't checked yet today if inap image managed to get pushed | 22:21 |
jrosser | ok - i'll check they're all from there | 22:21 |
*** dmellado has quit IRC | 22:22 | |
*** stevebaker has quit IRC | 22:23 | |
*** gouthamr has quit IRC | 22:23 | |
*** bobh has joined #openstack-infra | 22:24 | |
jrosser | clarkb: looking at a few the 7.5 do indeed look to be inap nodes | 22:26 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Consider shared changes queues for relative_priority https://review.openstack.org/623595 | 22:28 |
clarkb | jrosser: ya I just checked and inap is still not getting a succesful upload | 22:28 |
clarkb | we'll need to debug that | 22:28 |
clarkb | jrosser: does that cause problems? maybe the package updates take a long time? | 22:29 |
*** bobh has quit IRC | 22:29 | |
corvus | clarkb, pabelanger, sean-k-mooney, dansmith, mriedem, fungi: ^ i think https://review.openstack.org/623595 is our tweak (combined with a change to project-config to establish the queues in check) | 22:29 |
jrosser | ok thanks - it'll trip up osa jobs where we just fixed 7.6 host + 7.6 docker image | 22:29 |
jrosser | mismatching those doesnt work for us | 22:29 |
clarkb | oh right the venv thing mnaser mentioned | 22:30 |
jrosser | yes thats it | 22:30 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Consider shared changes queues for relative_priority https://review.openstack.org/623595 | 22:30 |
mriedem | corvus: ok, but i'll admit the wording in that commit message is greek to me | 22:31 |
mriedem | maybe you want to poke that out in the ML thread | 22:31 |
clarkb | openstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://image.api.mtl01.cloud.iweb.com/v2/images/7cc7c423-dea6-4efb-b36d-bc3b7fbdee5e/file, Image status transition from saving to saving is not allowed: 409 Conflict | 22:33 |
clarkb | jrosser: ^ that is why our uploads are failing I think that is an openstacksdk bug | 22:33 |
clarkb | mordred: ^ fyi | 22:33 |
corvus | mriedem: it's greek for "lump tripleo together in check" :) | 22:33 |
clarkb | jrosser: I'm not in a great spot to debug that myself right now, but let me get a full traceback pasted so that someone can look if that have time | 22:34 |
clarkb | mordred: jrosser http://paste.openstack.org/show/736850/ | 22:35 |
mriedem | corvus: ok, but still probably good to say in that thread "this is what happens today and this is what's proposed after getting brow beaten for an hour in irc" | 22:35 |
jrosser | clarkb: thankyou | 22:35 |
clarkb | I want to finish up the train of thought with devstack apache, then review corvus' change, then maybe I'll get to that sdk thing | 22:35 |
mriedem | maybe leave out that last part | 22:35 |
corvus | mriedem: yeah, i'll reply, but it'll take me a minute because now i have to write about your suggestion ;) | 22:36 |
mriedem | oh weights on changes within a given project? | 22:37 |
mriedem | this all sounds like stuff people want nova to be doing all the time | 22:37 |
mriedem | now you know how i feel | 22:37 |
mriedem | oh what if we just used zuul in the nova-scheduler... | 22:37 |
corvus | well, i'm reading it as weigh higher changes which have failed alot, but yeah | 22:37 |
corvus | mriedem: they both have schedulers, they must be the same thing | 22:38 |
mriedem | correct | 22:38 |
clarkb | does that mean our work here is done? | 22:38 |
corvus | i'm pretty sure the zuul-scheduler is so called because that was the only word that came to mind after spending a year listening to people talk about nova | 22:39 |
clarkb | there are times that I wish devstack was more config managementy, this is one of them :P | 22:41 |
mriedem | clarkb: btw, please take it easy on me on monday after your boy russell prances all over my team this weekend | 22:41 |
clarkb | mriedem: I had them going 5-11 this year. I'm both happy and mad they proved me wrong | 22:41 |
clarkb | mriedem: whats crazy is the packers could be that 5-10-1 team | 22:41 |
mriedem | not crazy, great | 22:42 |
*** bobh has joined #openstack-infra | 22:43 | |
*** mriedem is now known as mriedem_afk | 22:46 | |
*** bobh has quit IRC | 22:47 | |
clarkb | sean-k-mooney: https://review.openstack.org/623597 fyi should set that env var | 22:52 |
*** slaweq has quit IRC | 22:52 | |
clarkb | I spent more time figuring out what the opensuse envvars file is than writing the patch :P | 22:52 |
clarkb | corvus: I think your change must've got caught by a pyyaml release? pep8 is complaning that the "safe" methods dont' exist anymore | 22:57 |
corvus | clarkb: i don't see a new pyyaml release | 22:58 |
clarkb | I'm guessing the change to make safe the default and make unsafe explicit opt in hit? | 22:58 |
clarkb | hrm | 22:58 |
clarkb | http://logs.openstack.org/95/623595/2/check/tox-pep8/b4c0a8b/job-output.txt.gz#_2018-12-07_22_36_31_796233 | 22:58 |
corvus | there was a new mypy release though | 22:58 |
clarkb | oh | 22:58 |
clarkb | that must be it then | 22:58 |
corvus | i expect the unit test failures are separate from that, so i'll look into that before pushing up fixes for both | 22:59 |
clarkb | ok | 23:00 |
clarkb | corvus: the other thing I notice is that we'll set specific check queues which are different tyhan those in gate (or could be at least?) | 23:00 |
clarkb | that seems like a good feature | 23:00 |
corvus | clarkb: yep; it'd be really messy to implement it otherwise | 23:00 |
clarkb | I think you only get the CSafeLoader attributes if the libyaml-dev headers are available | 23:01 |
clarkb | I wonder if mypy can be convinced to allow either type | 23:01 |
clarkb | another option is to install that package via bindep | 23:01 |
*** irdr has quit IRC | 23:03 | |
clarkb | " Image status transition from saving to saving is not allowed" | 23:04 |
clarkb | it only just occurred to me its mad that the state is transitioning to the same state | 23:04 |
*** gouthamr has joined #openstack-infra | 23:06 | |
*** wolverineav has quit IRC | 23:08 | |
*** wolverin_ has joined #openstack-infra | 23:08 | |
clarkb | this is the two step upload process, We create an image record, this first step is the one that gets passed all the image property data. Then we PUT the actual image file data to foo_image/file url | 23:09 |
clarkb | its the second one that is failing, I don't think we supply the property data there, it should just be the content type header and the content of the image itself | 23:09 |
*** slaweq has joined #openstack-infra | 23:09 | |
*** dmellado has joined #openstack-infra | 23:11 | |
*** slaweq has quit IRC | 23:14 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Consider shared changes queues for relative_priority https://review.openstack.org/623595 | 23:15 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Cap mypy https://review.openstack.org/623598 | 23:15 |
clarkb | ya it looks like if we use the sdk native interface it may remember some object attributes but we consume things ont he shade side which is just doing a pretty boring POST then PUT without any information abuot state at that level (shade manages the state, the client is clueless beyond the session from what I can tell) | 23:19 |
clarkb | possibly a chagne on the cloud side? | 23:19 |
*** slaweq has joined #openstack-infra | 23:19 | |
clarkb | mgagne_: ^ if you are around do you see a similar traceback? I can't tell if its the lib/sdk that is buggy or if maybe the server is? | 23:20 |
clarkb | mordred: chances are you just know off the top of your head if you have a sec too | 23:23 |
*** slaweq has quit IRC | 23:24 | |
*** stevebaker has joined #openstack-infra | 23:25 | |
sean-k-mooney | clarkb: im kindo of surpised that we soudl set envars in /etc/sysconfig i would have assumed it would have either been in /etc/default/<apache name> or some systemd folder | 23:26 |
clarkb | sean-k-mooney: apparently on rhel/centos and suse this is how you do it | 23:28 |
clarkb | debuntu use /etc/apache2/envvars | 23:28 |
sean-k-mooney | ok i proably should know that ... | 23:28 |
clarkb | then the init system applies that data (via apachectl on debuntu?) | 23:28 |
*** apetrich has quit IRC | 23:47 | |
mnaser | hm | 23:47 |
mnaser | in nested virt situations, i assume folks have seen really slow io or traffic? | 23:48 |
mnaser | trying to debug functional tests for k8s .. http://logs.openstack.org/75/623575/1/check/magnum-functional-k8s/463ae1a/logs/cluster-nodes/master-test-172.24.5.203/cloud-init-output.txt.gz | 23:48 |
mnaser | it looks like its downloading quite slowyl | 23:48 |
sean-k-mooney | mnaser: no i havent seen that before | 23:48 |
mnaser | i mean it could also not be using nested virt | 23:49 |
clarkb | mnaser: chances are its not nested virt | 23:49 |
mnaser | darn :( | 23:49 |
mnaser | i'm trying to make magnum functional tests for k8s actually start working again | 23:49 |
mnaser | but i think it might just be totally impossible without some sort of nested virt guarantee :( | 23:49 |
clarkb | http://logs.openstack.org/75/623575/1/check/magnum-functional-k8s/463ae1a/logs/etc/nova/nova-cpu.conf.txt.gz virt type qemu | 23:49 |
clarkb | which we set in devstack by default due to issues like centos 7 crashing under it :( | 23:50 |
mnaser | it'd be nice if we can have not-so-third-party-third-party-ci | 23:50 |
sean-k-mooney | mnaser: so nested virst is on but kvm is not used so the nested vms are just slow | 23:50 |
mnaser | as in like "here's some credentials, can we use those nodesets in this project only please" | 23:50 |
mnaser | rather than us deploying a fully fledged zuul to do third party ci | 23:51 |
clarkb | sean-k-mooney: it may or not be on depending on the cloud that it is scheduled to | 23:52 |
* sean-k-mooney may or may not be deploying openstack and zuul at home to set up a third party ci... | 23:52 | |
clarkb | which is the other issue | 23:52 |
clarkb | mnaser: we actually can do that, no one has offered as far as I know. But thats roughly what we are doing with kata | 23:53 |
clarkb | the key thing is we can't gate in that setup (because its a spof) | 23:53 |
clarkb | but can provide informational results | 23:53 |
sean-k-mooney | clarkb: ya that is true its not that hard to check if you can enable kvm you jsut have to modeprobe it and see if /dev/kvm is there | 23:53 |
mnaser | clarkb: .. so can we do that for magnum :D | 23:54 |
clarkb | mnaser: maybe? adrian otto actually brought it up a while back and then it went nowhere (idea was to use rax onmetal at the time) | 23:54 |
clarkb | sean-k-mooney: ya it even works if you've hidden vmx from the instance (which is why we don't botherdoing that) | 23:54 |
mnaser | i mean i can do the work and provide the infra (..magnum is important for us, and its functional jobs are pretty disfunctional because of this) | 23:54 |
sean-k-mooney | i barbican still doing terble things in there ci jobs. | 23:55 |
*** rkukura_ has joined #openstack-infra | 23:55 | |
sean-k-mooney | before osic died they had a ci job that tried to enable kvm and powered off the host if it failed so zuul woudl reschdule them. | 23:56 |
clarkb | sean-k-mooney: a few project try to use nested virt if its there (octavia and tripleo are/were doing this, its how we ran into that issues with centos( | 23:56 |
sean-k-mooney | at least i think it was barbican that had that job | 23:56 |
*** rkukura has quit IRC | 23:57 | |
*** rkukura_ is now known as rkukura | 23:57 | |
*** jamesmcarthur has joined #openstack-infra | 23:58 | |
sean-k-mooney | mnaser: by the way do you know with vexhost if i create an account can i set a limit on my usage per month | 23:59 |
mnaser | sean-k-mooney: unfortunately the only way that's possible is by enforcing a quota on your account, we don't have a "cost" quota ;x | 23:59 |
sean-k-mooney | ok i assumed that would be the answer | 23:59 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!