*** jascott1 has joined #openstack-infra | 00:00 | |
*** edmondsw has quit IRC | 00:00 | |
*** iyamahat_ has quit IRC | 00:01 | |
*** dougwig has quit IRC | 00:03 | |
*** jascott1 has quit IRC | 00:03 | |
*** jascott1 has joined #openstack-infra | 00:03 | |
*** iyamahat_ has joined #openstack-infra | 00:05 | |
*** iyamahat_ has quit IRC | 00:06 | |
*** jascott1 has quit IRC | 00:07 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add skipped CRD tests https://review.openstack.org/531887 | 00:08 |
---|---|---|
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Support cross-source dependencies https://review.openstack.org/530806 | 00:08 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Add cross-source tests https://review.openstack.org/532699 | 00:08 |
*** wolverineav has joined #openstack-infra | 00:16 | |
*** felipemonteiro has joined #openstack-infra | 00:19 | |
*** sbra has joined #openstack-infra | 00:22 | |
*** yamamoto has quit IRC | 00:24 | |
*** yamamoto has joined #openstack-infra | 00:27 | |
*** erlon has quit IRC | 00:27 | |
*** armax has joined #openstack-infra | 00:31 | |
openstackgerrit | Ian Wienand proposed openstack-infra/project-config master: Revert "Pause builds for dib 2.10 release" https://review.openstack.org/532701 | 00:36 |
*** abelur_ has quit IRC | 00:38 | |
*** yamamoto has quit IRC | 00:39 | |
*** abelur_ has joined #openstack-infra | 00:39 | |
*** claudiub has quit IRC | 00:43 | |
*** sree has joined #openstack-infra | 00:43 | |
*** rmcall has quit IRC | 00:46 | |
*** rmcall has joined #openstack-infra | 00:46 | |
*** sbra has quit IRC | 00:46 | |
*** sree has quit IRC | 00:48 | |
*** wolverin_ has joined #openstack-infra | 00:49 | |
*** wolverineav has quit IRC | 00:50 | |
*** wolverin_ has quit IRC | 00:53 | |
*** cuongnv has joined #openstack-infra | 00:58 | |
*** cuongnv has quit IRC | 00:59 | |
openstackgerrit | Paul Belanger proposed openstack-infra/project-config master: Set max-server to 0 for infracloud-vanilla https://review.openstack.org/532705 | 01:01 |
*** pcrews has quit IRC | 01:01 | |
*** threestrands has quit IRC | 01:03 | |
clarkb | ianw: ^ do you wnat to review that too? | 01:03 |
openstackgerrit | Kendall Nelson proposed openstack-infra/storyboard master: [WIP]Migration Error with Suspended User https://review.openstack.org/532706 | 01:04 |
*** pcrews has joined #openstack-infra | 01:05 | |
*** ilpianista has quit IRC | 01:05 | |
ianw | are we 100% sure the gate is moving? | 01:09 |
*** ricolin has joined #openstack-infra | 01:10 | |
clarkb | ianw: I am not no | 01:12 |
*** stakeda has joined #openstack-infra | 01:12 | |
clarkb | ze01 at least is running stuff | 01:12 |
pabelanger | issues? | 01:12 |
clarkb | zuul scheduler is running | 01:12 |
clarkb | and zm01 has a zuul-merger | 01:13 |
ianw | 532395,1 ... my change of course ... i would call stuck ... just looking at some of the jobs | 01:13 |
ianw | everything on the grafana page seems to be going | 01:13 |
*** cuongnv has joined #openstack-infra | 01:15 | |
ianw | zuul-executor on ze04 seems to have pretty high load ... maybe it's normal | 01:15 |
ianw | http://paste.openstack.org/show/642527/ in the logs, odd one | 01:15 |
ianw | git.exc.GitCommandError: Cmd('git') failed due to: exit code(128) | 01:15 |
ianw | stderr: 'Host key verification failed. | 01:16 |
clarkb | I've got to afk to take care of kids. I did approve pabelanger's max-servers: 0 for infracloud | 01:16 |
clarkb | ianw: ya that was brought up earlier today too | 01:16 |
pabelanger | yah, if we loaded from queues, it is possible one executor grabbed more then normal | 01:16 |
clarkb | I think tomorrow once meltdown is behind us we need to do some zuul stabilization and investigation | 01:16 |
pabelanger | cool | 01:16 |
clarkb | and for you that may be today >_> | 01:16 |
ianw | do we have 11 executors now? | 01:16 |
clarkb | ianw: just 10, ze01-10 | 01:16 |
ianw | grafana seems to think we have 11 | 01:17 |
*** kiennt26 has joined #openstack-infra | 01:17 | |
clarkb | could be stale connwction to geard? | 01:17 |
pabelanger | I can't seem to get any of the stream.html pages working | 01:18 |
pabelanger | finger 779ce43930ab4f48aa41ad6f9e422734@zuulv3.openstack.org also returns Internal streaming error | 01:19 |
ianw | 2018-01-11 00:10:20,511 DEBUG zuul.AnsibleJob: [build: ac4dbd3dd3eb4dbbbbed8ab3171500b2] Ansible complete, result RESULT_NORMAL code 0 | 01:19 |
ianw | that was about an hour ago, and the status page hasn't picked up the job returned | 01:19 |
*** slaweq has joined #openstack-infra | 01:19 | |
pabelanger | tcp6 0 0 :::7900 :::* LISTEN | 01:20 |
pabelanger | that is, new | 01:20 |
*** slaweq_ has joined #openstack-infra | 01:21 | |
pabelanger | did we land new tcp 7900 changes? | 01:21 |
pabelanger | corvus: Shrews: ^ | 01:21 |
pabelanger | yah, we must have | 01:21 |
pabelanger | so, we have a mix of executors on tcp/79 and other on tcp/7900 | 01:22 |
ianw | my other job, 0aee65504f2d491ea497064898f2ad8e, maybe hasn't been picked up by an executor | 01:22 |
pabelanger | 2018-01-11 01:18:11,940 DEBUG zuul.web.LogStreamingHandler: Connecting to finger server ze09:79 | 01:23 |
pabelanger | socket.gaierror: [Errno -2] Name or service not known | 01:23 |
pabelanger | in zuul-web debug log | 01:24 |
*** slaweq has quit IRC | 01:24 | |
ianw | the job at the top of the integrated gate queue | 01:25 |
ianw | http://zuulv3.openstack.org/stream.html?uuid=a379a36040e047cebfbc4b3e2e9d79a3&logfile=console.log | 01:25 |
ianw | 2018-01-10 22:53:10,886 DEBUG zuul.AnsibleJob: [build: a379a36040e047cebfbc4b3e2e9d79a3] Ansible complete, result RESULT_NORMAL code 0 | 01:25 |
*** slaweq_ has quit IRC | 01:25 | |
pabelanger | okay, so I think just ze09 is having web streaming issues | 01:26 |
pabelanger | due to hostname | 01:26 |
ianw | i think my amateur analysis is ... something's not right | 01:26 |
*** ilpianista has joined #openstack-infra | 01:26 | |
pabelanger | yup, hostname on ze09 is ze09 again | 01:27 |
pabelanger | while ze01.o.o is ze01.openstack.org | 01:27 |
clarkb | yes and plan is to switch everything ti be like ze09 | 01:28 |
clarkb | at least that was mordreds ideal and no one objected | 01:28 |
*** masayukig_ has quit IRC | 01:28 | |
pabelanger | I guess zuulv3.o.o cannot resolve ze09 right now | 01:28 |
pabelanger | guess we need to append domain | 01:28 |
clarkb | or use IP addrs | 01:29 |
pabelanger | http://paste.openstack.org/show/642528/ | 01:29 |
ianw | infra-root: ^ do we want to agree zuul isn't currently making progress? restart? | 01:30 |
pabelanger | yah, something to deal with in the morning | 01:30 |
*** masayukig has joined #openstack-infra | 01:30 | |
clarkb | ianw: I'll have to defer to your judgement, am on phone and feeding kids | 01:30 |
pabelanger | ianw: no, it is inap we have been having issues there today | 01:30 |
pabelanger | I'd wait until the job times out | 01:31 |
pabelanger | ianw: we should see if we can SSH into node and see what network looks like | 01:31 |
ianw | pabelanger: but look at something like a379a36040e047cebfbc4b3e2e9d79a3 ? it appears to have finished but zuul hasn't noticed? | 01:32 |
pabelanger | ianw: what is load look like on executor? | 01:32 |
ianw | that job went to ze03 | 01:32 |
ianw | zuul-executor is busy there, but probably not mroe than normal. load ~1.5 | 01:33 |
pabelanger | looks like a379a36040e047cebfbc4b3e2e9d79a3 is still running | 01:33 |
*** zhurong has joined #openstack-infra | 01:33 | |
*** felipemonteiro has quit IRC | 01:35 | |
pabelanger | ianw: I think we need to wait for a379a36040e047cebfbc4b3e2e9d79a3 to timeout | 01:35 |
ianw | yeah ok, that has gone to internap | 01:36 |
*** caphrim007_ has joined #openstack-infra | 01:36 | |
ianw | ubuntu-xenial-inap-mtl01-0001803996 | 01:36 |
ianw | ac4dbd3dd3eb4dbbbbed8ab3171500b2 is a docs job that should have finihsed ages ago however ... | 01:36 |
ianw | that one went to ovh | 01:37 |
*** caphrim007 has quit IRC | 01:38 | |
pabelanger | ianw: do you know which executor ac4dbd3dd3eb4dbbbbed8ab3171500b2 was on? | 01:39 |
pabelanger | ze04.o.o, according to scheduler | 01:39 |
ianw | yes, ze04 ... it went to 158.69.75.101 | 01:39 |
*** yamamoto has joined #openstack-infra | 01:39 | |
ianw | it's got everything cloned, so something happened | 01:40 |
pabelanger | wow | 01:40 |
pabelanger | I think ze04.o.o was stop and started | 01:40 |
pabelanger | or killed | 01:40 |
pabelanger | 2018-01-11 01:00:06,180 DEBUG zuul.log_streamer: LogStreamer starting on port 7900 | 01:41 |
*** askb has quit IRC | 01:41 | |
ianw | hmm, this might explain the 11 executors | 01:41 |
pabelanger | something happened 40mins ago | 01:41 |
ianw | puppet run? | 01:41 |
*** caphrim007_ has quit IRC | 01:41 | |
pabelanger | checking | 01:42 |
*** rcernin has quit IRC | 01:42 | |
*** askb has joined #openstack-infra | 01:43 | |
*** edmondsw has joined #openstack-infra | 01:44 | |
pabelanger | we rebooted 40mins ago | 01:44 |
pabelanger | which make sense | 01:44 |
pabelanger | is that when we did meltdown reboots? | 01:44 |
ianw | no, i rebooted them all yesterday! | 01:44 |
pabelanger | I think we did it again | 01:44 |
pabelanger | uptime is 1hr:44mins | 01:44 |
ianw | actually only 44 minutes (it's 1:44) ... | 01:45 |
pabelanger | okay, so I think the reboot in logs was from server reboot, so executor didn't crash | 01:46 |
pabelanger | however | 01:46 |
ianw | afaics there was no login around the time of the reboot. was it externally triggered? | 01:46 |
*** shu-mutou-AWAY is now known as shu-mutou | 01:46 | |
pabelanger | ac4dbd3dd3eb4dbbbbed8ab3171500b2 was running just before server reboot | 01:47 |
*** yamamoto has quit IRC | 01:47 | |
pabelanger | maybe? | 01:47 |
clarkb | we didnt do ze's unless jeblair did them? | 01:47 |
pabelanger | was the server live migrated? | 01:47 |
ianw | it must have been ,or something. it happened at exactly 01:00 | 01:47 |
pabelanger | because executor was killed when ac4dbd3dd3eb4dbbbbed8ab3171500b2 was running | 01:48 |
pabelanger | so, in that case, zuul does thing it is running | 01:48 |
*** askb has quit IRC | 01:48 | |
pabelanger | when it is in fact done | 01:48 |
pabelanger | which means, I do think we need to dump queues and restart scheduler | 01:48 |
*** edmondsw has quit IRC | 01:48 | |
ianw | yah :/ | 01:48 |
pabelanger | okay, then I agree | 01:49 |
*** askb has joined #openstack-infra | 01:49 | |
pabelanger | that also explains why it is running as tcp/7900 | 01:49 |
pabelanger | since daemon restarted | 01:49 |
*** markvoelker has joined #openstack-infra | 01:50 | |
*** rcernin has joined #openstack-infra | 01:50 | |
ianw | is anyone logged into the rax console where they send those notifications? | 01:50 |
Shrews | pabelanger: hrm, yes, it seems my executor change recently landed | 01:51 |
ianw | is this finger thing something we want to consider before reloading? | 01:51 |
Shrews | possibly? we may not have considered all of the things that need to change for it | 01:53 |
*** bobh has joined #openstack-infra | 01:53 | |
Shrews | there's this: https://review.openstack.org/532594 | 01:53 |
Shrews | and we might need iptables rules for the new port? | 01:53 |
ianw | with the gate the way it is, unless we start force pushing, not sure we can do too much changing | 01:55 |
openstackgerrit | David Shrewsbury proposed openstack-infra/system-config master: Zuul executor needs to open port 7900 now. https://review.openstack.org/532709 | 01:57 |
*** Apoorva has quit IRC | 01:58 | |
Shrews | 532594 and 532709 are going to be needed since that change landed. Then restart the executors. I'm not sure what the recently restarted ze* servers are going to need since they're probably running as root and not zuul now. Might run into file permission issues? Can't speak to that. | 02:00 |
Shrews | ianw: pabelanger: I wasn't expecting that change to land so quickly. My fault for not marking it WIP or -2 before we discussed coordination. | 02:03 |
Shrews | ianw: pabelanger: what can i do to assist here? | 02:04 |
ianw | well only ze04 has probably restarted with the new code | 02:04 |
ianw | is the only consequence that live streaming doesn't work? | 02:04 |
Shrews | right | 02:05 |
Shrews | i see the processes there running as root, not zuul. i'm not sure of the implications of that | 02:05 |
Shrews | corvus would know | 02:06 |
pabelanger | we could manually downgrade ze04.o.o for tonight | 02:06 |
pabelanger | then work on rolling out new code | 02:07 |
ianw | or just turn it off? | 02:07 |
pabelanger | or that | 02:07 |
pabelanger | maybe best for tonight | 02:07 |
pabelanger | I'm going to have to run shortly | 02:07 |
ianw | yeah, given it's run as root now, turning it back to zuul seems dangerous | 02:07 |
ianw | so, let's stop ze04 | 02:08 |
ianw | then i'll dump & restart zuul scheduler | 02:08 |
pabelanger | 1wfm | 02:08 |
ianw | and cross fingers none of the other ze's restart | 02:08 |
*** xarses has joined #openstack-infra | 02:08 | |
ianw | and the a-team can pick this up tomorrow | 02:08 |
pabelanger | agree | 02:08 |
ianw | ok | 02:08 |
Shrews | wfm x 2. should I -2 532594 and 532709 for now? | 02:08 |
Shrews | landing 709 w/o restarting the executors would break streaming on all executors until restart | 02:09 |
ianw | i think so, just for safety | 02:09 |
pabelanger | we could make 709 support both 79,7900 then remove 79 in follow | 02:10 |
*** gouthamr has joined #openstack-infra | 02:10 | |
*** ramishra has joined #openstack-infra | 02:10 | |
pabelanger | might be safer approach | 02:10 |
Shrews | -2'd for now | 02:11 |
pabelanger | kk | 02:11 |
*** annp has joined #openstack-infra | 02:11 | |
ianw | #status log zuul-executor stopped on ze04.o.o and it is placed in the emergency file, due to an external reboot applying https://review.openstack.org/#/c/532575/. we will need to more carefully consider the rollout of this code | 02:13 |
openstackstatus | ianw: finished logging | 02:13 |
*** Apoorva has joined #openstack-infra | 02:14 | |
*** gouthamr_ has joined #openstack-infra | 02:15 | |
*** gouthamr has quit IRC | 02:15 | |
pabelanger | ianw: okay, I'm EOD, good luck | 02:17 |
ianw | i'm not sure i like the look of this | 02:18 |
ianw | http://paste.openstack.org/show/642531/ | 02:18 |
*** namnh has joined #openstack-infra | 02:18 | |
*** yamahata has joined #openstack-infra | 02:18 | |
Shrews | ianw: where's that from? | 02:18 |
Shrews | i think dmsimard pasted something similar a couple of days ago | 02:19 |
*** gouthamr_ has quit IRC | 02:19 | |
*** yamamoto has joined #openstack-infra | 02:19 | |
dmsimard | yeah | 02:19 |
dmsimard | did you restart zuul ? | 02:19 |
*** Apoorva has quit IRC | 02:19 | |
dmsimard | iirc I saw that being spammed (like hell) in zuulv3.o.o logs after starting the scheduler/web | 02:20 |
ianw | yes, just restarted | 02:20 |
Shrews | dmsimard: did anyone look into that? | 02:20 |
ianw | it seems to be going ... | 02:20 |
ianw | ok, i'm going to requeue | 02:20 |
*** rmcall has quit IRC | 02:20 | |
dmsimard | Shrews: no one jumped on it when I pointed it out, no | 02:21 |
ianw | if it's job is to scare the daylights out of the person restarting, then it has worked :) | 02:22 |
dmsimard | ianw: iirc there's the same errors in the zuul-web logs too | 02:22 |
dmsimard | not just zuul-scheduler | 02:22 |
dmsimard | yeah, zuul web: http://paste.openstack.org/raw/642533/ | 02:23 |
*** gouthamr has joined #openstack-infra | 02:23 | |
Shrews | looks like it might be a bug in the /{tenant}/status route? | 02:30 |
Shrews | tristanC: know anything about that? ^^^^ | 02:30 |
dmsimard | I was talking to him about that just now :D | 02:30 |
tristanC | Shrews: yep, i'm currently writting a better exception handling | 02:30 |
Shrews | ianw: my instinct says it's not critical | 02:30 |
Shrews | w00t! | 02:31 |
dmsimard | He says it's because when the scheduler restarts, the layouts are not yet available yet when you try to load the webpage it tries to seek them out | 02:31 |
dmsimard | or something along those lines | 02:31 |
tristanC | dmsimard: ianw: this happens when the requested tenant layout isn't ready in the scheduler | 02:31 |
Shrews | dmsimard: tristanC: thx for the explanation | 02:32 |
ianw | tristanC: ok :) crisis averted | 02:33 |
ianw | still requeueing | 02:33 |
*** Qiming has quit IRC | 02:33 | |
*** markvoelker has quit IRC | 02:36 | |
*** yamahata has quit IRC | 02:38 | |
ianw | #status log zuul restarted due to the unexpected loss of ze04; jobs requeued | 02:38 |
openstackstatus | ianw: finished logging | 02:38 |
*** yamahata has joined #openstack-infra | 02:39 | |
ianw | ok, we're back to 9 executors in grafana, which seems right | 02:39 |
*** Qiming has joined #openstack-infra | 02:39 | |
*** gouthamr has quit IRC | 02:40 | |
ianw | will just leave things for a bit, will send a summary email to avoid people having to pull apart chat logs | 02:40 |
*** caphrim007 has joined #openstack-infra | 02:41 | |
*** hongbin has joined #openstack-infra | 02:44 | |
dmsimard | we have 10, what other executor is missing ? | 02:45 |
*** caphrim007 has quit IRC | 02:45 | |
*** threestrands has joined #openstack-infra | 02:47 | |
*** yamahata has quit IRC | 02:54 | |
*** mriedem has quit IRC | 02:56 | |
*** wolverineav has joined #openstack-infra | 03:01 | |
*** fultonj has quit IRC | 03:01 | |
*** masber has quit IRC | 03:02 | |
*** fultonj has joined #openstack-infra | 03:02 | |
*** masber has joined #openstack-infra | 03:05 | |
*** wolverineav has quit IRC | 03:05 | |
*** ijw has joined #openstack-infra | 03:06 | |
*** aeng has quit IRC | 03:08 | |
*** aeng has joined #openstack-infra | 03:08 | |
*** katkapilatova1 has quit IRC | 03:08 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: scheduler: better handle format status error https://review.openstack.org/532718 | 03:10 |
*** rmcall has joined #openstack-infra | 03:10 | |
*** yamamoto_ has joined #openstack-infra | 03:12 | |
ianw | dmsimard: 4 is turned off, see prior, and grafana was saying there was 11 at one point, so something had gone wrong there | 03:14 |
*** yamamoto has quit IRC | 03:16 | |
*** gyee has quit IRC | 03:16 | |
*** felipemonteiro has joined #openstack-infra | 03:18 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: scheduler: better handle format status error https://review.openstack.org/532718 | 03:21 |
*** slaweq has joined #openstack-infra | 03:21 | |
*** harlowja_ has quit IRC | 03:21 | |
*** slaweq_ has joined #openstack-infra | 03:21 | |
*** ijw has quit IRC | 03:22 | |
*** slaweq has quit IRC | 03:25 | |
*** ramishra has quit IRC | 03:25 | |
*** slaweq_ has quit IRC | 03:26 | |
*** ramishra has joined #openstack-infra | 03:27 | |
*** wolverineav has joined #openstack-infra | 03:33 | |
*** wolverineav has quit IRC | 03:38 | |
*** ijw has joined #openstack-infra | 03:39 | |
*** sbra has joined #openstack-infra | 03:42 | |
*** ramishra has quit IRC | 03:45 | |
*** ijw has quit IRC | 03:47 | |
*** rlandy|bbl is now known as rlandy | 03:50 | |
mtreinish | clarkb: I just did a quick test and afs works fine on on of my raspberry pi's running arch | 03:54 |
*** sbra has quit IRC | 03:54 | |
*** verdurin_ has joined #openstack-infra | 03:55 | |
*** verdurin has quit IRC | 03:56 | |
*** felipemonteiro has quit IRC | 03:57 | |
*** caphrim007 has joined #openstack-infra | 03:58 | |
*** ameliac has quit IRC | 04:00 | |
*** ramishra has joined #openstack-infra | 04:01 | |
*** verdurin has joined #openstack-infra | 04:03 | |
*** verdurin_ has quit IRC | 04:04 | |
*** xarses has quit IRC | 04:05 | |
openstackgerrit | Merged openstack/diskimage-builder master: Revert "Dont install python-pip for py3k" https://review.openstack.org/532395 | 04:05 |
*** esberglu has quit IRC | 04:10 | |
*** bobh has quit IRC | 04:11 | |
*** zhurong has quit IRC | 04:12 | |
*** xarses has joined #openstack-infra | 04:13 | |
*** xarses_ has joined #openstack-infra | 04:14 | |
*** xarses_ has quit IRC | 04:14 | |
*** xarses_ has joined #openstack-infra | 04:15 | |
*** felipemonteiro has joined #openstack-infra | 04:15 | |
*** daidv has joined #openstack-infra | 04:16 | |
*** xarses has quit IRC | 04:18 | |
*** jamesmcarthur has joined #openstack-infra | 04:18 | |
*** jamesmcarthur has quit IRC | 04:18 | |
*** jamesmcarthur has joined #openstack-infra | 04:19 | |
*** sree has joined #openstack-infra | 04:21 | |
*** armax has quit IRC | 04:24 | |
*** links has joined #openstack-infra | 04:25 | |
*** spzala has joined #openstack-infra | 04:31 | |
*** andreas_s has joined #openstack-infra | 04:32 | |
*** ykarel|away has joined #openstack-infra | 04:33 | |
*** rosmaita has quit IRC | 04:34 | |
*** bhavik1 has joined #openstack-infra | 04:35 | |
*** shu-mutou has quit IRC | 04:35 | |
*** shu-mutou has joined #openstack-infra | 04:36 | |
*** xarses_ has quit IRC | 04:37 | |
*** andreas_s has quit IRC | 04:37 | |
*** felipemonteiro has quit IRC | 04:39 | |
*** spzala has quit IRC | 04:42 | |
*** dingyichen has quit IRC | 04:43 | |
*** rlandy has quit IRC | 04:53 | |
*** rmcall has quit IRC | 04:54 | |
*** dingyichen has joined #openstack-infra | 04:55 | |
*** links has quit IRC | 04:56 | |
*** links has joined #openstack-infra | 04:57 | |
*** janki has joined #openstack-infra | 05:07 | |
*** dingyichen has quit IRC | 05:09 | |
*** dingyichen has joined #openstack-infra | 05:09 | |
*** nibalizer has joined #openstack-infra | 05:14 | |
*** claudiub has joined #openstack-infra | 05:15 | |
*** sree has quit IRC | 05:19 | |
*** edmondsw has joined #openstack-infra | 05:20 | |
*** psachin has joined #openstack-infra | 05:21 | |
*** edmondsw has quit IRC | 05:24 | |
*** gema has quit IRC | 05:26 | |
*** gema has joined #openstack-infra | 05:27 | |
*** psachin has quit IRC | 05:27 | |
*** zhurong has joined #openstack-infra | 05:28 | |
*** rcernin has quit IRC | 05:29 | |
*** CrayZee has joined #openstack-infra | 05:29 | |
*** harlowja has joined #openstack-infra | 05:29 | |
*** sree has joined #openstack-infra | 05:29 | |
*** gcb has quit IRC | 05:30 | |
*** gcb has joined #openstack-infra | 05:31 | |
*** hongbin has quit IRC | 05:32 | |
*** shu-mutou is now known as shu-mutou-AWAY | 05:32 | |
*** wolverineav has joined #openstack-infra | 05:34 | |
*** psachin has joined #openstack-infra | 05:35 | |
*** wolverin_ has joined #openstack-infra | 05:35 | |
*** jamesmcarthur has quit IRC | 05:36 | |
*** wolveri__ has joined #openstack-infra | 05:38 | |
*** wolverineav has quit IRC | 05:39 | |
*** wolverin_ has quit IRC | 05:39 | |
*** bhavik1 has quit IRC | 05:42 | |
*** wolveri__ has quit IRC | 05:43 | |
*** markvoelker has joined #openstack-infra | 05:47 | |
*** coolsvap has joined #openstack-infra | 05:47 | |
*** dhajare has joined #openstack-infra | 05:47 | |
*** jamesmcarthur has joined #openstack-infra | 05:52 | |
*** xinliang has quit IRC | 05:56 | |
*** sandanar has joined #openstack-infra | 05:58 | |
*** ykarel|away is now known as ykarel | 05:59 | |
*** jamesmcarthur has quit IRC | 06:00 | |
*** jbadiapa has quit IRC | 06:06 | |
*** xinliang has joined #openstack-infra | 06:08 | |
*** xinliang has quit IRC | 06:08 | |
*** xinliang has joined #openstack-infra | 06:08 | |
CrayZee | Hi, Is there anything going on with zuul? I see over 300 tasks in the queue with tasks running over 3.5 hours that are probably requeued ? | 06:08 |
CrayZee | e.g. 532631... | 06:09 |
*** liujiong has joined #openstack-infra | 06:09 | |
*** slaweq has joined #openstack-infra | 06:09 | |
*** slaweq has quit IRC | 06:13 | |
*** kjackal has joined #openstack-infra | 06:14 | |
*** armaan has quit IRC | 06:17 | |
ianw | CrayZee: I think we're just flat out ATM. we're down one provider until we can sort out some issues, but http://grafana01.openstack.org/dashboard/db/zuul-status seems to be okish | 06:17 |
*** Hunner has quit IRC | 06:19 | |
*** bmjen has quit IRC | 06:19 | |
CrayZee | ianw: thanks | 06:21 |
*** liujiong has quit IRC | 06:28 | |
*** jamesmcarthur has joined #openstack-infra | 06:28 | |
*** jamesmcarthur has quit IRC | 06:33 | |
*** liujiong has joined #openstack-infra | 06:33 | |
*** jamesmcarthur has joined #openstack-infra | 06:35 | |
*** annp has quit IRC | 06:38 | |
*** markvoelker has quit IRC | 06:40 | |
*** markvoelker has joined #openstack-infra | 06:42 | |
eumel8 | ianw: you're still around? | 06:44 |
*** kzaitsev_pi has quit IRC | 06:45 | |
*** yamamoto has joined #openstack-infra | 06:45 | |
ianw | eumel8: a little ... wrapping up | 06:45 |
*** alexchadin has joined #openstack-infra | 06:46 | |
*** harlowja has quit IRC | 06:47 | |
eumel8 | ianw: morning :) can you please take a look into https://review.openstack.org/#/c/531736/ and plan the next upgrade of Zanata? | 06:47 |
*** yamamoto_ has quit IRC | 06:49 | |
*** kzaitsev_pi has joined #openstack-infra | 06:52 | |
ianw | eumel8: ok, i can probably try that out tomorrow my time. basically 1) pause puppet 2) do the table drop 3) commit that 4) restart puppet? | 06:54 |
eumel8 | ianw: drop database, not table :) | 06:55 |
*** aeng has quit IRC | 06:56 | |
ianw | ahh, ok. how about i stop puppet for it now & we get it in the queue | 06:56 |
eumel8 | ianw: sounds good, thx | 06:57 |
*** threestrands has quit IRC | 07:00 | |
*** sbra has joined #openstack-infra | 07:00 | |
*** threestrands has joined #openstack-infra | 07:01 | |
*** threestrands has quit IRC | 07:02 | |
*** threestrands has joined #openstack-infra | 07:03 | |
*** threestrands has quit IRC | 07:03 | |
*** threestrands has joined #openstack-infra | 07:03 | |
*** threestrands has quit IRC | 07:04 | |
*** threestrands has joined #openstack-infra | 07:04 | |
*** edmondsw has joined #openstack-infra | 07:08 | |
*** jamesmcarthur has quit IRC | 07:12 | |
*** edmondsw has quit IRC | 07:13 | |
*** armaan has joined #openstack-infra | 07:13 | |
*** dhill_ has quit IRC | 07:13 | |
*** dhill__ has joined #openstack-infra | 07:13 | |
*** jamesmcarthur has joined #openstack-infra | 07:19 | |
masayukig | mtreinish: clarkb: yeah, I don't have a lot of stuff installed on my machines. But I also tested it on a very slow hdd. Files might be on disk cache because the size is not so big. | 07:19 |
*** pcaruana has joined #openstack-infra | 07:21 | |
*** dims_ has quit IRC | 07:21 | |
openstackgerrit | Merged openstack-infra/infra-manual master: Clarify the point of a repo in a zuulv3 job name https://review.openstack.org/532686 | 07:21 |
*** hemna_ has quit IRC | 07:21 | |
*** jamesmcarthur has quit IRC | 07:24 | |
*** dims has joined #openstack-infra | 07:25 | |
openstackgerrit | Merged openstack-infra/system-config master: Upgrade translate-dev.o.o to Zanata 4.3.3 https://review.openstack.org/531736 | 07:25 |
openstackgerrit | Merged openstack-infra/project-config master: Switch grafana neutron board to non-legacy jobs https://review.openstack.org/532632 | 07:26 |
*** afred312 has quit IRC | 07:27 | |
*** abelur_ has quit IRC | 07:29 | |
*** vsaienk0 has joined #openstack-infra | 07:29 | |
*** jamesmcarthur has joined #openstack-infra | 07:30 | |
*** afred312 has joined #openstack-infra | 07:30 | |
AJaeger | config-core, could you review some changes on https://etherpad.openstack.org/p/Nvt3ovbn5x , please? I put up ready changes there for easier review | 07:34 |
*** jamesmcarthur has quit IRC | 07:35 | |
*** slaweq has joined #openstack-infra | 07:38 | |
*** jtomasek has joined #openstack-infra | 07:39 | |
*** jamesmcarthur has joined #openstack-infra | 07:39 | |
*** jtomasek has joined #openstack-infra | 07:39 | |
*** armaan has quit IRC | 07:41 | |
*** armaan has joined #openstack-infra | 07:41 | |
*** jamesmcarthur has quit IRC | 07:44 | |
openstackgerrit | Merged openstack-infra/project-config master: Set max-server to 0 for infracloud-vanilla https://review.openstack.org/532705 | 07:45 |
*** jamesmcarthur has joined #openstack-infra | 07:46 | |
*** armaan has quit IRC | 07:46 | |
*** armaan has joined #openstack-infra | 07:47 | |
*** dciabrin has quit IRC | 07:49 | |
AJaeger | frickler: could you review https://review.openstack.org/#/c/523018/ to remove an unused template so that it won't get used again, please? | 07:50 |
*** andreas_s has joined #openstack-infra | 07:53 | |
*** jamesmcarthur has quit IRC | 07:55 | |
gema | mtreinish: thanks for trying it, good to know afs can be built on arm :) | 07:57 |
AJaeger | infra-root, looking at http://grafana.openstack.org/dashboard/db/nodepool I see a nearly constant number of "deleting nodes". Do we have a problem with deletion somewhere? | 07:57 |
*** annp has joined #openstack-infra | 08:01 | |
*** jamesmcarthur has joined #openstack-infra | 08:04 | |
*** evin has joined #openstack-infra | 08:04 | |
*** sbra has quit IRC | 08:06 | |
*** threestrands has quit IRC | 08:08 | |
*** jamesmcarthur has quit IRC | 08:09 | |
*** slaweq_ has joined #openstack-infra | 08:10 | |
*** apetrich has quit IRC | 08:13 | |
*** florianf has joined #openstack-infra | 08:14 | |
*** slaweq_ has quit IRC | 08:14 | |
*** jamesmcarthur has joined #openstack-infra | 08:15 | |
*** pcichy has joined #openstack-infra | 08:17 | |
*** tesseract has joined #openstack-infra | 08:17 | |
*** ramishra has quit IRC | 08:18 | |
*** aviau has quit IRC | 08:20 | |
*** aviau has joined #openstack-infra | 08:20 | |
*** ramishra has joined #openstack-infra | 08:20 | |
*** jamesmcarthur has quit IRC | 08:21 | |
*** jamesmcarthur has joined #openstack-infra | 08:22 | |
*** shardy has joined #openstack-infra | 08:22 | |
*** jamesmcarthur has quit IRC | 08:26 | |
openstackgerrit | Vasyl Saienko proposed openstack-infra/project-config master: Add networking-generic-switch-tempest-plugin repo https://review.openstack.org/532542 | 08:27 |
openstackgerrit | Vasyl Saienko proposed openstack-infra/project-config master: Add jobs for n-g-s tempest-plugin https://review.openstack.org/532543 | 08:27 |
*** alexchadin has quit IRC | 08:28 | |
*** tosky has joined #openstack-infra | 08:29 | |
*** cshastri has joined #openstack-infra | 08:31 | |
*** jamesmcarthur has joined #openstack-infra | 08:31 | |
*** jpena|off is now known as jpena | 08:34 | |
*** jamesmcarthur has quit IRC | 08:36 | |
*** jamesmcarthur has joined #openstack-infra | 08:37 | |
*** jbadiapa has joined #openstack-infra | 08:40 | |
*** alexchadin has joined #openstack-infra | 08:41 | |
*** jamesmcarthur has quit IRC | 08:42 | |
*** pcaruana has quit IRC | 08:44 | |
*** links has quit IRC | 08:46 | |
openstackgerrit | Dmitrii Shcherbakov proposed openstack-infra/project-config master: add charm-panko project-config https://review.openstack.org/532769 | 08:47 |
*** jaosorior has joined #openstack-infra | 08:47 | |
*** b_bezak has joined #openstack-infra | 08:47 | |
*** armaan has quit IRC | 08:47 | |
*** armaan has joined #openstack-infra | 08:48 | |
*** jamesmcarthur has joined #openstack-infra | 08:48 | |
*** ralonsoh has joined #openstack-infra | 08:50 | |
*** jamesmcarthur has quit IRC | 08:53 | |
*** dingyichen has quit IRC | 08:54 | |
*** edmondsw has joined #openstack-infra | 08:56 | |
*** b_bezak has quit IRC | 08:57 | |
*** hashar has joined #openstack-infra | 08:57 | |
*** links has joined #openstack-infra | 08:59 | |
*** jamesmcarthur has joined #openstack-infra | 09:01 | |
*** edmondsw has quit IRC | 09:01 | |
*** jpich has joined #openstack-infra | 09:02 | |
*** jamesmcarthur has quit IRC | 09:05 | |
*** pblaho has quit IRC | 09:06 | |
*** sree_ has joined #openstack-infra | 09:10 | |
*** sree_ is now known as Guest37375 | 09:11 | |
*** sree has quit IRC | 09:12 | |
*** sree has joined #openstack-infra | 09:12 | |
*** Guest37375 has quit IRC | 09:15 | |
*** dbecker has joined #openstack-infra | 09:17 | |
*** jamesmcarthur has joined #openstack-infra | 09:17 | |
*** flaper87 has quit IRC | 09:17 | |
*** pcichy has quit IRC | 09:18 | |
*** dsariel has joined #openstack-infra | 09:19 | |
*** pblaho has joined #openstack-infra | 09:19 | |
*** jamesmcarthur has quit IRC | 09:19 | |
*** flaper87 has joined #openstack-infra | 09:21 | |
*** flaper87 has quit IRC | 09:21 | |
*** e0ne has joined #openstack-infra | 09:21 | |
*** flaper87 has joined #openstack-infra | 09:23 | |
*** lucas-afk is now known as lucasagomes | 09:26 | |
*** kopecmartin has joined #openstack-infra | 09:27 | |
*** armaan has quit IRC | 09:27 | |
*** armaan has joined #openstack-infra | 09:27 | |
*** dsariel has quit IRC | 09:27 | |
*** pblaho has quit IRC | 09:36 | |
*** armaan has quit IRC | 09:36 | |
*** dtantsur|afk is now known as dtantsur | 09:38 | |
ykarel | Jobs are in queue For long, some jobs for 7+ hr, is some issue going on or it's just infra is out of nodes | 09:43 |
*** pblaho has joined #openstack-infra | 09:48 | |
*** links has quit IRC | 09:51 | |
*** apetrich has joined #openstack-infra | 09:53 | |
*** vsaienk0 has quit IRC | 09:55 | |
*** stakeda has quit IRC | 09:57 | |
CrayZee | ykarel: This is the answer I got a few hours ago: "(06:17:58 UTC) ianw: CrayZee: I think we're just flat out ATM. we're down one provider until we can sort out some issues, but http://grafana01.openstack.org/dashboard/db/zuul-status seems to be okish" | 09:59 |
*** CrayZee has quit IRC | 10:00 | |
*** armaan has joined #openstack-infra | 10:03 | |
*** pcaruana has joined #openstack-infra | 10:04 | |
*** links has joined #openstack-infra | 10:05 | |
ykarel | CrayZee Thanks | 10:06 |
*** slaweq_ has joined #openstack-infra | 10:11 | |
*** markvoelker has quit IRC | 10:13 | |
*** ankkumar has joined #openstack-infra | 10:13 | |
*** sambetts|afk is now known as sambetts | 10:14 | |
*** namnh has quit IRC | 10:14 | |
*** slaweq_ has quit IRC | 10:15 | |
ankkumar | Hi | 10:16 |
ankkumar | I am trying to build openstack/ironic CI using zuul v3 | 10:16 |
ankkumar | Can anyone tell me, how would it would read my zuul.yaml customized file, because openstack/ironic repo has there one? | 10:16 |
*** cuongnv has quit IRC | 10:16 | |
ankkumar | How would my gate will run, I mean using which zuul yaml files? | 10:16 |
*** pbourke has joined #openstack-infra | 10:16 | |
*** liujiong has quit IRC | 10:23 | |
*** sree_ has joined #openstack-infra | 10:25 | |
*** vsaienk0 has joined #openstack-infra | 10:25 | |
*** sree_ is now known as Guest88600 | 10:26 | |
*** ricolin has quit IRC | 10:28 | |
*** sree has quit IRC | 10:29 | |
*** rosmaita has joined #openstack-infra | 10:32 | |
*** alexchadin has quit IRC | 10:34 | |
*** cshastri has quit IRC | 10:35 | |
*** zhurong has quit IRC | 10:35 | |
*** ldnunes has joined #openstack-infra | 10:37 | |
*** abelur_ has joined #openstack-infra | 10:37 | |
*** e0ne has quit IRC | 10:38 | |
*** e0ne has joined #openstack-infra | 10:39 | |
*** sandanar has quit IRC | 10:39 | |
*** sandanar has joined #openstack-infra | 10:40 | |
*** edmondsw has joined #openstack-infra | 10:44 | |
*** pcichy has joined #openstack-infra | 10:46 | |
*** slaweq_ has joined #openstack-infra | 10:46 | |
*** danpawlik_ has joined #openstack-infra | 10:46 | |
*** maciejjo1 has joined #openstack-infra | 10:47 | |
*** andreas_s has quit IRC | 10:48 | |
*** slaweq_ has quit IRC | 10:48 | |
*** maciejjozefczyk has quit IRC | 10:48 | |
*** danpawlik has quit IRC | 10:48 | |
*** slaweq_ has joined #openstack-infra | 10:48 | |
*** cshastri has joined #openstack-infra | 10:48 | |
*** slaweq has quit IRC | 10:48 | |
*** edmondsw has quit IRC | 10:49 | |
*** pcichy has quit IRC | 10:50 | |
*** pcichy has joined #openstack-infra | 10:50 | |
*** numans has quit IRC | 10:54 | |
*** numans has joined #openstack-infra | 10:55 | |
*** andreas_s has joined #openstack-infra | 10:58 | |
frickler | infra-root: AJaeger: ack, seeing lots of deletion-related exceptions on nl02, will try do dig a bit more | 11:10 |
*** andreas_s has quit IRC | 11:19 | |
*** shardy has quit IRC | 11:22 | |
*** shardy has joined #openstack-infra | 11:24 | |
*** panda|rover|afk has quit IRC | 11:24 | |
*** derekh has joined #openstack-infra | 11:25 | |
*** eyalb has joined #openstack-infra | 11:26 | |
*** maciejjo1 is now known as maciejjozefczyk | 11:27 | |
eyalb | any news on zuul our jobs are queuing for a long time | 11:28 |
frickler | eyalb: due to various issues over the last days we have a pretty large backlog currently, please be patient. if you have specific jobs that seem stuck, please post them here so we can take a look | 11:30 |
eyalb | frickler: thanks | 11:31 |
*** tpsilva_ has joined #openstack-infra | 11:31 | |
frickler | infra-root: seems we have multiple issues currently: a) nodes in stuck deleting state, this seems to be on all providers, all since about the time zuul seems to have been restarted (around 22:30) | 11:32 |
frickler | infra-root: b) some timeouts when deleting current nodes and c) some more issues with infracloud-vanilla | 11:33 |
frickler | for a) I think a restart of nodepool might help, but I don't want to make the situation worse. this seems to be just blocking about 30% of our quota, so not optimal, but also not catastrophic I'd say, as the remaining quota seems to get used without issues | 11:34 |
*** tpsilva_ is now known as tpsilva | 11:34 | |
frickler | b) might actually be normal behaviour. c) infracloud-vanilla seems to have been taken offline anyway, so also not critical | 11:35 |
*** andreas_s has joined #openstack-infra | 11:36 | |
*** panda has joined #openstack-infra | 11:38 | |
frickler | although it may be that the leaked instance cleanup is not working due to c) | 11:38 |
*** apetrich has quit IRC | 11:38 | |
*** sdague has joined #openstack-infra | 11:39 | |
*** wolverineav has joined #openstack-infra | 11:39 | |
*** andreas_s has quit IRC | 11:40 | |
*** wolverineav has quit IRC | 11:44 | |
*** andreas_s has joined #openstack-infra | 11:50 | |
*** florianf has quit IRC | 11:52 | |
*** florianf has joined #openstack-infra | 11:53 | |
*** rhallisey has joined #openstack-infra | 11:54 | |
*** andreas_s has quit IRC | 11:55 | |
Shrews | frickler: AJaeger: ugh, not what i want to see when i first wake up. i'll take a look at nodepool | 11:56 |
*** e0ne has quit IRC | 11:58 | |
*** janki has quit IRC | 11:59 | |
Shrews | oh, this is new: http://paste.openstack.org/show/642553/ | 11:59 |
*** panda has quit IRC | 11:59 | |
Shrews | a restart of nodepool would not help with that | 12:00 |
Shrews | frickler: you said "more" issues with vanilla. what were the previous issues? | 12:00 |
*** davidlenwell has quit IRC | 12:01 | |
*** yamamoto has quit IRC | 12:01 | |
*** davidlenwell has joined #openstack-infra | 12:01 | |
*** yamamoto has joined #openstack-infra | 12:01 | |
*** electrical has quit IRC | 12:01 | |
*** icey has quit IRC | 12:02 | |
*** pcaruana has quit IRC | 12:02 | |
*** pcaruana|afk| has joined #openstack-infra | 12:02 | |
*** icey has joined #openstack-infra | 12:02 | |
*** electrical has joined #openstack-infra | 12:02 | |
Shrews | oh, the meltdown patches causing issues | 12:03 |
Shrews | we might should disable all of vanilla until someone can look into that infrastructure a bit more | 12:04 |
*** jpena is now known as jpena|lunch | 12:04 | |
*** jkilpatr has joined #openstack-infra | 12:05 | |
*** annp has quit IRC | 12:05 | |
Shrews | oh, it IS disabled in nodepool.yaml | 12:06 |
*** kjackal has quit IRC | 12:07 | |
Shrews | oh geez, happening across other providers too. *sigh* | 12:08 |
Shrews | mordred: was there a new shade release recently? | 12:09 |
*** dsariel has joined #openstack-infra | 12:10 | |
*** arxcruz|ruck is now known as arxcruz|rover | 12:11 | |
*** slaweq has joined #openstack-infra | 12:11 | |
frickler | Shrews: the failure in your paste is happening because keystone isn't responding. but I'm also seeing timeouts when deleting nodes from other providers | 12:12 |
*** panda has joined #openstack-infra | 12:13 | |
Shrews | frickler: yeah, looking at those on the other launcher now | 12:13 |
Shrews | seems to be affecting rax, ovh, and chocolate too | 12:13 |
Shrews | no idea what's up with those | 12:13 |
*** markvoelker has joined #openstack-infra | 12:14 | |
frickler | Shrews: http://paste.openstack.org/show/642563/ seems to show the reason for infra-chocolate | 12:15 |
AJaeger | Shrews: https://review.openstack.org/532705 disabled vanilla already. | 12:16 |
frickler | Shrews: nodepool seems always to retry deleting the same node before deleting others | 12:16 |
*** slaweq has quit IRC | 12:16 | |
AJaeger | frickler: thanks for looking into this. | 12:16 |
AJaeger | and good morning, Shrews ! | 12:17 |
*** sandanar has quit IRC | 12:18 | |
frickler | Shrews: I'm wondering whether we should delete infracloud-vanilla from nodepool.yaml completely for the time being. that should at least allow the leaked instance cleanup to run for the other clouds | 12:18 |
*** smatzek has joined #openstack-infra | 12:21 | |
Shrews | frickler: iirc, what should be happening is a new thread is started for each node delete. so it should cycle through all deleted instances rather quickly, but the actual delete threads are the things timing out | 12:22 |
*** sdoran has quit IRC | 12:24 | |
*** mrmartin has quit IRC | 12:24 | |
*** madhuvishy has quit IRC | 12:24 | |
*** tomhambleton_ has quit IRC | 12:24 | |
*** madhuvishy has joined #openstack-infra | 12:24 | |
*** lucasagomes is now known as lucas-hungry | 12:25 | |
*** pcichy has quit IRC | 12:25 | |
*** abelur has quit IRC | 12:27 | |
*** abelur has joined #openstack-infra | 12:27 | |
*** kjackal has joined #openstack-infra | 12:28 | |
*** e0ne has joined #openstack-infra | 12:30 | |
*** pcichy has joined #openstack-infra | 12:31 | |
Shrews | looks like we lost connection to our zookeeper server yesterday | 12:32 |
*** edmondsw has joined #openstack-infra | 12:33 | |
*** e0ne has quit IRC | 12:33 | |
Shrews | ok, this is weird. i'm going to restart the launchers to see what happens | 12:33 |
*** rosmaita has quit IRC | 12:34 | |
*** abelur_ has quit IRC | 12:35 | |
AJaeger | Shrews: wait, please... | 12:35 |
AJaeger | Shrews: first read http://lists.openstack.org/pipermail/openstack-infra/2018-January/005774.html - ianw run into a problem with ze04 that we should check before restarting | 12:36 |
Shrews | AJaeger: already restarted nl01, which seems to be cleaning up the instances. i worked with ianw last night on ze04, so aware of that | 12:37 |
Shrews | unrelated things | 12:37 |
*** edmondsw has quit IRC | 12:38 | |
AJaeger | Shrews: glad to hear. | 12:38 |
*** mrmartin has joined #openstack-infra | 12:39 | |
*** e0ne has joined #openstack-infra | 12:42 | |
Shrews | nl02 restarted which seems to have cleaned up the enabled providers there | 12:42 |
*** bobh has joined #openstack-infra | 12:42 | |
Shrews | AJaeger: frickler: not sure what happened there. I'm going to have to spend the day sorting through tons of logs. | 12:43 |
Shrews | can one of you #status the things while I finally have a cup of coffee? | 12:43 |
*** vsaienk0 has quit IRC | 12:44 | |
AJaeger | #status log nl01 and nl02 restarted to recover nodes in deletion | 12:47 |
openstackstatus | AJaeger: finished logging | 12:47 |
*** markvoelker has quit IRC | 12:48 | |
AJaeger | Shrews: done - or do you want to write more - or on broader scale? Enjoy your coffee! | 12:48 |
*** pgadiya has joined #openstack-infra | 12:49 | |
*** pcaruana|afk| has quit IRC | 12:49 | |
*** krtaylor_ has quit IRC | 12:50 | |
Shrews | AJaeger: thx. good enough for now i guess. i don't yet know enough about the problem to say more | 12:50 |
mpeterson | hello, I was wondering if you know what might be causing a job not to go into the RUN phase at all? it does PRE and then POST all successful but not RUN | 12:51 |
*** e0ne has quit IRC | 12:52 | |
*** e0ne has joined #openstack-infra | 12:53 | |
*** andreas_s has joined #openstack-infra | 12:58 | |
pabelanger | we'll be deleting infracloud-vanilla, and moving compute node to chocolate. We lost the controller yesterday to a bad HDD | 12:58 |
pabelanger | https://review.openstack.org/532705/ should have disable it | 12:59 |
*** ankkumar has quit IRC | 12:59 | |
*** jpena|lunch is now known as jpena | 13:00 | |
*** pcaruana|afk| has joined #openstack-infra | 13:02 | |
*** andreas_s has quit IRC | 13:04 | |
*** vsaienk0 has joined #openstack-infra | 13:04 | |
*** dprince has joined #openstack-infra | 13:08 | |
*** jaosorior has quit IRC | 13:09 | |
AJaeger | mpeterson: do you have logs? | 13:10 |
*** olaph has quit IRC | 13:13 | |
*** olaph has joined #openstack-infra | 13:14 | |
*** janki has joined #openstack-infra | 13:14 | |
mpeterson | AJaeger: yes, patches 530642 and 517359 | 13:16 |
mpeterson | AJaeger: first one in devstack-tempest, second one in networking-odl-functional-carbon | 13:16 |
*** pcichy has quit IRC | 13:17 | |
AJaeger | mpeterson: we have on https://wiki.openstack.org/wiki/Infrastructure_Status item "another set of broken images has been in use from about 06:00-11:00 UTC, reverted once more to the previous one". | 13:19 |
*** e0ne has quit IRC | 13:20 | |
AJaeger | mpeterson: that was at 14:00 and your run was at that time - and would explain it. I'm just wondering whether the time span is accurate. | 13:20 |
frickler | pabelanger: the issue seems to have been that nodepool still wanted to delete nodes on vanilla and that may have blocked other deletes | 13:22 |
AJaeger | mpeterson: wait, those are change to Zuul v3 config itself. I wonder whether there's a bug in them that let's Zuul skip the run playbooks. | 13:22 |
AJaeger | mpeterson: please discuss with rest of the team later... | 13:22 |
pabelanger | frickler: ouch, okay. We should be able to write a unit test for that in nodepool at least and see what happens with a bad provider | 13:23 |
mpeterson | AJaeger: yeah, I think it is probably an issue in layer 8 (aka me) rather the infra | 13:23 |
mpeterson | AJaeger: sure, when would the rest team be in aprox? | 13:24 |
*** katkapilatova1 has joined #openstack-infra | 13:24 | |
AJaeger | mpeterson: US best - so during the next three hours... | 13:24 |
AJaeger | US based I mean | 13:24 |
openstackgerrit | Merged openstack-infra/project-config master: Remove Neutron legacy jobs definition https://review.openstack.org/530500 | 13:24 |
*** e0ne has joined #openstack-infra | 13:24 | |
*** katkapilatova1 has left #openstack-infra | 13:25 | |
*** bobh has quit IRC | 13:25 | |
mpeterson | AJaeger: okey I'm EMEA based, I hope I'll still be around, thanks | 13:25 |
frickler | pabelanger: Shrews: fyi there are still errors about this happening now on nl02, not sure about the impact | 13:26 |
pabelanger | frickler: specific to a cloud? | 13:27 |
Shrews | pabelanger: you'll keep seeing them for vanilla, unless we remove that provider entirely | 13:28 |
*** kiennt26_ has joined #openstack-infra | 13:28 | |
Shrews | i don't see any deleted instances that aren't getting cleaned up in zookeeper except for vanilla | 13:29 |
*** panda is now known as panda|ruck | 13:29 | |
Shrews | pabelanger: were there any issues with the zookeeper server yesterday evening? looks like we lost connection to it around 2018-01-10 22:27:18,481 UTC (5:27pm eastern) | 13:30 |
*** e0ne has quit IRC | 13:30 | |
Shrews | i think that's about when this delete problem started | 13:31 |
*** eyalb has left #openstack-infra | 13:31 | |
Shrews | hrm, we probably didn't notice since it recovered about a second later | 13:33 |
*** e0ne has joined #openstack-infra | 13:33 | |
*** apetrich has joined #openstack-infra | 13:33 | |
*** e0ne has quit IRC | 13:34 | |
frickler | "nodepool list" still shows about 50 nodes in infracloud-vanilla in state deleting | 13:35 |
mgagne | pabelanger: any known/pending issue with inap? I keep seeing inap mentioned and I'm starting to wonder if there is an underlaying issue on our side that needs to be addressed. | 13:35 |
pabelanger | Shrews: yah, working on patches to remove vanilla now | 13:36 |
pabelanger | Shrews: also, not aware of any issues with zookeeper | 13:36 |
*** rlandy has joined #openstack-infra | 13:36 | |
*** trown|outtypewww is now known as trown | 13:37 | |
Shrews | pabelanger: fyi, if we never re-enable vanilla, we'll need to manually delete its instances. might even need to cleanup zookeeper nodes for it. | 13:38 |
Shrews | i guess we can't do the instances without a working controller though | 13:39 |
openstackgerrit | Zara proposed openstack-infra/storyboard-webclient master: Fix reference to angular-bootstrap in package.json https://review.openstack.org/532819 | 13:40 |
pabelanger | Shrews: Agree | 13:41 |
openstackgerrit | Paul Belanger proposed openstack-infra/project-config master: Remove infracloud-vanilla from nodepool https://review.openstack.org/532820 | 13:41 |
pabelanger | Shrews: frickler: ^ | 13:41 |
*** rosmaita has joined #openstack-infra | 13:41 | |
pabelanger | Shrews: it might be a good idea to maybe come up with a tool or docs to help do the clean up for users in this case of a dead cloud. | 13:42 |
pabelanger | before we'd have to purge the database, would be good to figure out the syntax in zookeeper | 13:42 |
vsaienk0 | AJaeger: pabelanger please add to your review queue https://review.openstack.org/#/c/531375/ - adds releasenotes job to n-g-s | 13:44 |
*** lucas-hungry is now known as lucasagomes | 13:44 | |
*** markvoelker has joined #openstack-infra | 13:44 | |
*** wolverineav has joined #openstack-infra | 13:46 | |
Shrews | pabelanger: ++ Having someone do anything manual in zookeeper is not a friendly user experience | 13:47 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Improve logging around ZooKeeper suspension https://review.openstack.org/532823 | 13:48 |
Shrews | ^^^ might help with debugging this situation in the future | 13:49 |
*** tosky has quit IRC | 13:50 | |
frickler | Shrews: pabelanger: different issue, would this be correct if I want to hold a node for debugging? zuul autohold --tenant openstack --project openstack/cookbook-openstack-common --job openstack-chef-repo-integration --reason "frickler: debug 523030" --count 1 | 13:51 |
*** masber has quit IRC | 13:52 | |
Shrews | frickler: i believe that's correct | 13:52 |
pabelanger | yah | 13:52 |
pabelanger | looks right | 13:52 |
Shrews | frickler: just make sure to delete the node when you're done with it | 13:52 |
frickler | Shrews: sure | 13:53 |
*** tosky has joined #openstack-infra | 13:54 | |
frickler | hmm, is it expected that "zuul autohold-list" takes like forever? or is that maybe also blocking trying to access vanilla? | 13:56 |
*** pgadiya has quit IRC | 13:56 | |
*** e0ne has joined #openstack-infra | 13:58 | |
frickler | my autohold command also doesn't proceed. maybe I'll just wait with that until nodepool issues are sorted out | 13:58 |
*** alexchadin has joined #openstack-infra | 13:59 | |
*** shardy has quit IRC | 14:02 | |
pabelanger | did nodepool restart? | 14:03 |
pabelanger | grafana.o.o says we have zero nodes online | 14:03 |
*** kiennt26_ has quit IRC | 14:04 | |
pabelanger | http://grafana.openstack.org/dashboard/db/nodepool | 14:04 |
pabelanger | frickler: Shrews ^ | 14:04 |
*** kiennt26_ has joined #openstack-infra | 14:05 | |
pabelanger | oh, maybe we just had a gate reset | 14:05 |
pabelanger | but ~500 nodes just deleted | 14:06 |
Shrews | pabelanger: yes, restarted launchers earlier | 14:06 |
frickler | pabelanger: Shrews: https://review.openstack.org/532303 failed at the head of integrated gate | 14:06 |
pabelanger | Shrews: this just happened a minute ago | 14:06 |
pabelanger | yah, must be a gate reset | 14:07 |
*** eharney has joined #openstack-infra | 14:07 | |
pabelanger | all our nodes tied up in gate pipeline | 14:07 |
*** efried has joined #openstack-infra | 14:08 | |
frickler | http://logs.openstack.org/03/532303/1/gate/openstack-tox-py35/c7bad8c/job-output.txt.gz has a timeout and ssh host id changed warning | 14:08 |
pabelanger | looks like citycloud | 14:09 |
*** dave-mccowan has joined #openstack-infra | 14:10 | |
*** slaweq has joined #openstack-infra | 14:12 | |
*** dave-mccowan has quit IRC | 14:14 | |
rosmaita | howdy infra folks ... quick question: when a gate job seems to be stuck in queued for a long time, is there any action i can take (other than be patient or go grab a coffee)? | 14:14 |
frickler | rosmaita: you can tell us which job it is so we can take a closer look | 14:15 |
*** dave-mccowan has joined #openstack-infra | 14:15 | |
frickler | rosmaita: but also we have a pretty large backlog currently due to various issues during the last couple of days | 14:16 |
rosmaita | frickler thanks! the one i was interested in has actually been picked up | 14:16 |
*** abhishekk has joined #openstack-infra | 14:17 | |
*** slaweq has quit IRC | 14:17 | |
*** markvoelker has quit IRC | 14:18 | |
*** Goneri has joined #openstack-infra | 14:18 | |
*** edmondsw has joined #openstack-infra | 14:21 | |
*** tomhambleton_ has joined #openstack-infra | 14:21 | |
*** sdoran has joined #openstack-infra | 14:21 | |
*** evin has quit IRC | 14:24 | |
*** edmondsw_ has joined #openstack-infra | 14:24 | |
*** psachin has quit IRC | 14:24 | |
*** esberglu has joined #openstack-infra | 14:28 | |
openstackgerrit | Chandan Kumar proposed openstack-infra/project-config master: Switch to tempest-plugin-jobs for ec2api-tempest-plugin https://review.openstack.org/532835 | 14:28 |
*** edmondsw has quit IRC | 14:28 | |
chandankumar | AJaeger: regarding uploading package to pypi, respective team will take care of that | 14:29 |
chandankumar | for their tempest plugins | 14:29 |
mordred | Shrews: morning! anything I should be looking at? | 14:29 |
*** ijw has joined #openstack-infra | 14:31 | |
Shrews | mordred: no, don't think so | 14:32 |
*** rosmaita has quit IRC | 14:34 | |
*** jkilpatr has quit IRC | 14:39 | |
*** jkilpatr has joined #openstack-infra | 14:45 | |
*** edmondsw_ is now known as edmondsw | 14:45 | |
*** abhishekk has quit IRC | 14:46 | |
openstackgerrit | Tytus Kurek proposed openstack-infra/project-config master: Add charm-interface-designate project https://review.openstack.org/529379 | 14:46 |
*** hongbin has joined #openstack-infra | 14:48 | |
*** Swami has joined #openstack-infra | 14:52 | |
AJaeger | chandankumar: once all are created, tell me in the patch and we can merge... | 14:55 |
chandankumar | AJaeger: sure! | 14:56 |
*** hoangcx_ has joined #openstack-infra | 14:57 | |
hoangcx_ | AJaeger: Hi, Do you know why https://docs.openstack.org/neutron-vpnaas/latest/ is not live even if we merged a patch in our repos recently | 14:58 |
*** eharney has quit IRC | 15:00 | |
hoangcx_ | AJaeger: I mean https://review.openstack.org/#/c/522695/ merged in project-config and the we megered one patch in vpnaas to reflect the change. But I don't see the link alive yet. | 15:00 |
*** gouthamr has joined #openstack-infra | 15:01 | |
hoangcx_ | s/the we/then we | 15:01 |
*** kopecmartin has quit IRC | 15:02 | |
*** markvoelker has joined #openstack-infra | 15:02 | |
*** markvoelker has quit IRC | 15:02 | |
*** mriedem has joined #openstack-infra | 15:02 | |
AJaeger | hoangcx_: check http://zuulv3.openstack.org/ - the post queue should contain your job | 15:03 |
*** eharney has joined #openstack-infra | 15:03 | |
AJaeger | hoangcx_: we have a backlog of 11+ hours, and your change only merged 1h ago | 15:04 |
openstackgerrit | sebastian marcet proposed openstack-infra/openstackid-resources master: Added new endpoint merge speakers https://review.openstack.org/532844 | 15:04 |
*** ihrachys has quit IRC | 15:05 | |
*** ihrachys has joined #openstack-infra | 15:05 | |
*** kgiusti has joined #openstack-infra | 15:08 | |
*** shardy has joined #openstack-infra | 15:08 | |
frickler | infra-root: I've seen multiple gate failures with "node_failure" now, but don't know how to debug these further, zuul.log just says "node request failed". http://paste.openstack.org/show/642701/ and also 526061,2 | 15:08 |
hoangcx_ | AJaeger: Clear the point now. I will check it tomorrow. Sorry for trouble and thank you for pointing it out. | 15:11 |
openstackgerrit | Merged openstack-infra/openstackid-resources master: Added new endpoint merge speakers https://review.openstack.org/532844 | 15:11 |
*** hoangcx_ has quit IRC | 15:12 | |
*** Apoorva has joined #openstack-infra | 15:12 | |
*** hjensas has quit IRC | 15:12 | |
*** jkilpatr has quit IRC | 15:13 | |
*** jkilpatr has joined #openstack-infra | 15:13 | |
*** jkilpatr has quit IRC | 15:13 | |
*** jkilpatr has joined #openstack-infra | 15:13 | |
*** kopecmartin has joined #openstack-infra | 15:15 | |
*** bmjen has joined #openstack-infra | 15:15 | |
*** Hunner has joined #openstack-infra | 15:16 | |
*** Hunner has quit IRC | 15:16 | |
*** Hunner has joined #openstack-infra | 15:16 | |
*** hamzy has quit IRC | 15:17 | |
*** edmondsw has quit IRC | 15:18 | |
*** hashar is now known as hasharAway | 15:19 | |
Shrews | frickler: i'm thinking that is likely related to the vanilla failure. vanilla worker thread in the launcher is getting assigned the request, but it throwing an exception during trying to determine if it has quota to handle it | 15:20 |
Shrews | frickler: i think we need to cleanup exception handling there. it's newish code | 15:20 |
Shrews | oh, the exception is handled correctly. it fails the request | 15:21 |
sshnaidm|afk | pabelanger, hi, we have connection errors to nodes since yesterday, is it known issue? http://logs.openstack.org/63/531563/3/gate/tripleo-ci-centos-7-scenario004-multinode-oooq-container/357414a/job-output.txt.gz#_2018-01-11_14_06_38_648182 | 15:21 |
*** sshnaidm|afk is now known as sshnaidm | 15:21 | |
*** kopecmartin has quit IRC | 15:21 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Log request ID on request failure https://review.openstack.org/532857 | 15:23 |
*** markvoelker has joined #openstack-infra | 15:24 | |
Shrews | pabelanger: approved https://review.openstack.org/532820 | 15:24 |
Shrews | frickler: ^^^ should help | 15:24 |
*** edmondsw has joined #openstack-infra | 15:24 | |
*** bobh has joined #openstack-infra | 15:24 | |
*** rmcall has joined #openstack-infra | 15:26 | |
*** pcichy has joined #openstack-infra | 15:27 | |
*** pcichy has joined #openstack-infra | 15:27 | |
*** alexchad_ has joined #openstack-infra | 15:28 | |
dmsimard | FYI there are github unicorns going around | 15:30 |
dmsimard | in case some jobs fail because of that | 15:30 |
*** slaweq_ has quit IRC | 15:30 | |
dmsimard | https://status.github.com/messages confirmed outage | 15:31 |
*** slaweq has joined #openstack-infra | 15:31 | |
*** alexchadin has quit IRC | 15:31 | |
*** afred312 has quit IRC | 15:31 | |
*** afred312 has joined #openstack-infra | 15:32 | |
pabelanger | Shrews: danke | 15:32 |
*** kopecmartin has joined #openstack-infra | 15:33 | |
*** Swami has quit IRC | 15:33 | |
pabelanger | dmsimard: shouldn't be much of an impact, minus replication | 15:34 |
*** janki has quit IRC | 15:35 | |
*** shardy has quit IRC | 15:35 | |
*** slaweq has quit IRC | 15:35 | |
*** yamamoto has quit IRC | 15:37 | |
*** felipemonteiro has joined #openstack-infra | 15:37 | |
*** alexchad_ has quit IRC | 15:37 | |
*** links has quit IRC | 15:38 | |
*** yamamoto has joined #openstack-infra | 15:38 | |
*** felipemonteiro_ has joined #openstack-infra | 15:38 | |
*** caphrim007 has quit IRC | 15:40 | |
*** caphrim007 has joined #openstack-infra | 15:40 | |
corvus | Shrews: i think we can approve 532594 now and if you make the change i suggested in 532709 we can merge it too. | 15:42 |
*** yamamoto has quit IRC | 15:43 | |
*** felipemonteiro has quit IRC | 15:43 | |
mwhahaha | getting post failures again today | 15:44 |
openstackgerrit | David Shrewsbury proposed openstack-infra/system-config master: Zuul executor needs to open port 7900 now. https://review.openstack.org/532709 | 15:45 |
Shrews | corvus: done | 15:45 |
mwhahaha | http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22SSH%20Error%3A%20data%20could%20not%20be%20sent%20to%20remote%20host%5C%22 | 15:45 |
*** xarses has joined #openstack-infra | 15:45 | |
*** caphrim007 has quit IRC | 15:45 | |
dmsimard | mwhahaha: we're already tracking it in elastic-recheck | 15:45 |
dmsimard | I've spent some amount of time investigating this week without being able to pinpoint a specific issue that wouldn't be related to load | 15:46 |
pabelanger | corvus: mordred: dmsimard: this doesn't look healthy: http://paste.openstack.org/show/642816/ | 15:46 |
pabelanger | first time I've need DB errors on ARA | 15:46 |
pabelanger | s/error/warning | 15:47 |
dmsimard | I've discussed that issue with upstream Ansible recently | 15:47 |
dmsimard | It's an Ansible bug | 15:47 |
*** kiennt26_ has quit IRC | 15:48 | |
*** hamzy has joined #openstack-infra | 15:48 | |
dmsimard | tl;dr, in some circumstances Ansible can pass a non-boolean ignore_errors down to callbacks (breaking the "contract") | 15:48 |
* dmsimard searches logs | 15:51 | |
*** esberglu has quit IRC | 15:52 | |
dmsimard | 2018-01-05 18:16:46 @sivel dmsimard: I just did a little testing, your `yes` bug was fixed by bcoca in https://github.com/ansible/ansible/commit/aa54a3510f6f14491808291a0300da609b42753d | 15:52 |
dmsimard | That's landed in 2.4.2 | 15:52 |
Shrews | dmsimard: sounds like we need https://review.openstack.org/531009 | 15:53 |
dmsimard | Shrews: yeah I was just about to mention that | 15:53 |
dmsimard | 2.4.2 also contains other fixes we need | 15:53 |
*** alexchadin has joined #openstack-infra | 15:53 | |
Shrews | dmsimard: that might require coordination with infra-root about the upgrade path | 15:54 |
pabelanger | yah, maybe even PTG? | 15:54 |
pabelanger | since we are all in a room | 15:54 |
dmsimard | Shrews: right, but it also means we'll be upgrading all the jobs, playbooks, roles, etc... to 2.4.2 | 15:54 |
pabelanger | s/all/most all/ | 15:54 |
dmsimard | and you know what it means to upgrade from a minor ansible release to the next | 15:54 |
Shrews | i'm going to -2 that, lest someone approve before we discussed it (based on last night's experience) | 15:54 |
pabelanger | dmsimard: yah, part of my concern as well, some playbook / role some place might now be 2.4.2 compat | 15:55 |
pabelanger | so, we'll need to be on deck to help migrate that | 15:55 |
pabelanger | maybe we need to bring zuulv3-dev.o.o online as thirdparty to do some testing | 15:55 |
pabelanger | would be a good way to get some code coverage on 2.4.2 | 15:56 |
dmsimard | pabelanger: and it's not just z-j and o-z-j, it's every project's playbooks/roles, across branches, etc. | 15:56 |
pabelanger | yes | 15:56 |
*** ykarel is now known as ykarel|away | 15:57 | |
*** derekh has quit IRC | 15:59 | |
Shrews | i have to step away for a bit now for an appointment. bbl | 16:01 |
*** Guest88600 has quit IRC | 16:03 | |
tosky | dmsimard: re those POST_FAILURE you discussed earlier, should I just recheck or wait? | 16:03 |
*** armaan has quit IRC | 16:04 | |
*** sree has joined #openstack-infra | 16:04 | |
dmsimard | tosky: there's a lot of load in general right now | 16:04 |
*** ricolin has joined #openstack-infra | 16:04 | |
dmsimard | corvus: I know you want to wait for the RAM governor but the executors aren't keeping up.. | 16:04 |
dmsimard | do you have any suggestions ? | 16:05 |
*** ykarel|away has quit IRC | 16:05 | |
dmsimard | the graphs are fairly thundering herd-ish | 16:05 |
*** sbra has joined #openstack-infra | 16:05 | |
tosky | I guess I can wait a bit until people leave the offices in the US :) | 16:06 |
corvus | dmsimard: fix the broken executor? | 16:06 |
corvus | dmsimard: and then implement the ram governor? perhaps that should be considered a high priority? | 16:07 |
dmsimard | Shrews: should we rebuild ze04 ? I'm not sure what's the takeaway from your email | 16:08 |
*** jbadiapa has quit IRC | 16:08 | |
*** sree has quit IRC | 16:09 | |
*** slaweq has joined #openstack-infra | 16:09 | |
pabelanger | I don't think we need a rebuilt, we just need to finish landing tcp/7900 changes in puppet-zuul | 16:09 |
pabelanger | then, schedule restarts | 16:09 |
corvus | don't schedule, just do. but there will be permissions that need fixing. | 16:09 |
pabelanger | +3 on https://review.openstack.org/532709/ for firewall | 16:10 |
corvus | https://review.openstack.org/532594 is the other | 16:10 |
pabelanger | great, +3 | 16:11 |
pabelanger | checking permissions | 16:11 |
corvus | while those are landing, someone can go on ze04 and fix filesystem permissions | 16:11 |
openstackgerrit | Merged openstack-infra/project-config master: Remove infracloud-vanilla from nodepool https://review.openstack.org/532820 | 16:12 |
*** slaweq has quit IRC | 16:13 | |
*** slaweq_ has joined #openstack-infra | 16:13 | |
pabelanger | corvus: any objections to also having puppet-zuul check them? | 16:15 |
corvus | pabelanger: why? | 16:15 |
*** eumel8 has quit IRC | 16:16 | |
corvus | let's just fix this and move on. it was a one time mistake. | 16:16 |
pabelanger | ok | 16:16 |
*** armaan has joined #openstack-infra | 16:17 | |
*** slaweq_ has quit IRC | 16:17 | |
pabelanger | actually, we'll need to change these on other executor too right? Currently /var/log/zuul is root:root | 16:18 |
pabelanger | actually | 16:18 |
pabelanger | zuul:root | 16:18 |
*** kopecmartin has quit IRC | 16:19 | |
pabelanger | okay, we safe there | 16:19 |
*** kopecmartin has joined #openstack-infra | 16:19 | |
efried | qq, does 'recheck' kick a currently-running check out of the queue? With how long stuff is taking, I want to be able to recheck as soon as I see any failure rather than waiting for the job to finish. | 16:19 |
*** armaan has quit IRC | 16:21 | |
pabelanger | odd /var/log/zuul/executor.log is root:root, I would have thought it was zuul:zuul or zuul:root on ze02 | 16:22 |
*** e0ne has quit IRC | 16:24 | |
*** alexchadin has quit IRC | 16:24 | |
*** alexchadin has joined #openstack-infra | 16:25 | |
*** esberglu has joined #openstack-infra | 16:26 | |
AJaeger | efried: no, it does not. one way to stop: rebase | 16:26 |
pabelanger | okay, I think we'll also need fix permissions on /var/lib/zuul for all executors after we stop | 16:26 |
efried | AJaeger Okay, thanks. I was getting tempted to do that anyway to see if it knocked anything loose :) | 16:27 |
mriedem | what does NODE_FAILURE as a job result generally mean? | 16:28 |
*** alexchadin has quit IRC | 16:29 | |
clarkb | mriedem: shrews was saying earlier that he thought it was related to the vanilla infracloud going away and quota calculations not working | 16:29 |
*** sree has joined #openstack-infra | 16:30 | |
clarkb | Shrews: does ^ mean max-servers: 0 is no longer effective? or maybe we didn't get that merged quickyl enough to avoid problems? | 16:30 |
mriedem | so it's like creating a vm fails ? | 16:30 |
mriedem | just suspicious because it's on a new job definition i'm working on and it's in the experimental queue, and zuul never posted results on it last night, and now it failed this morning with node_failure | 16:30 |
clarkb | mriedem: that was my understanding of it, due to a bug in new code that tries to more dynamically determine available quota. Rather than statically assume what we have told nodepool about its quota is correct | 16:30 |
*** rosmaita has joined #openstack-infra | 16:31 | |
mriedem | dynamically determine available quota, good luck | 16:31 |
mriedem | upgrade to pike where there are no more quota reservations to screw everything up | 16:31 |
clarkb | Shrews: pabelanger I'm not sure that removing vanilla entirely should be our fix in this case. Nodepool should (and always has in the past) handle cloud outages gracefully | 16:32 |
*** shardy has joined #openstack-infra | 16:32 | |
*** rwellum has left #openstack-infra | 16:32 | |
mriedem | here is an unrelated question, i see a change that's been approved for a few days now is sitting in the gate queue for 9 hours, and it's also sitting in the check queue | 16:32 |
*** ijw has quit IRC | 16:32 | |
mriedem | should i recheck it? or just wait for it to drop from the queues altogether? | 16:32 |
pabelanger | clarkb: agree, it sounds like there was some bug preventing nodes from deleting. Something we should fix for sure | 16:33 |
pabelanger | clarkb: I removed vanilla to stop image uploads by nodepool-builder | 16:33 |
dmsimard | mriedem: might be side effect of people re-checking the patches and the few zuul restarts that have happened where we re-enqueued jobs | 16:33 |
*** pblaho1 has joined #openstack-infra | 16:33 | |
mriedem | is there a timeout where a change just gets kicked out altogether? like 10 hours or something? | 16:33 |
mriedem | given the reboots and timeouts and such, i assume people are just rechecking like mad to try and get anything in | 16:34 |
openstackgerrit | Merged openstack-infra/puppet-zuul master: Start executor as 'zuul' user https://review.openstack.org/532594 | 16:34 |
*** sree has quit IRC | 16:34 | |
*** ramishra has quit IRC | 16:34 | |
*** vsaienk0 has quit IRC | 16:34 | |
*** maciejjozefczyk has quit IRC | 16:35 | |
clarkb | pabelanger: but deleting nodes should be indepdnent of handling node requests right? In any case its something we should make sure nodepool handles gracefully so that losing a lcoud doesn't affect jobs | 16:35 |
*** pblaho has quit IRC | 16:35 | |
*** HenryG has quit IRC | 16:36 | |
dmsimard | mriedem: the rechecks are exacerbating the load issues that we have right now so we have a bit of a thundering herd effect going on. | 16:36 |
pabelanger | clarkb: agree, I'd have to defer to Shrews on why that didn't work. But I would guess we can reproduce in a unit test to see why | 16:36 |
mriedem | that's what i was wondering - if it's just better to not recheck | 16:36 |
dmsimard | mriedem: things are slowly stabilizing and we also have a zuul executor that isn't running right now (which is not helping) so our current focus is getting that executor back in line (and resolving the root cause of why it went out in the first place) | 16:38 |
mriedem | ack | 16:38 |
dmsimard | mriedem: a good idea might be to keep an eye on the zuul-executor graphs here: http://grafana.openstack.org/dashboard/db/zuul-status | 16:38 |
mriedem | dmsimard: what's a healthy graph look like? | 16:39 |
mriedem | not maxed out? | 16:39 |
dmsimard | mriedem: one where the load of the executors is not 100 :) | 16:39 |
mriedem | heh is that all | 16:40 |
mriedem | ok will do - thanks for the link btw, i always forget where that is (bookmarked now) | 16:40 |
fungi | AJaeger: efried: you may be able to abort running check jobs and restart them by abandoning and restoring the change, so you don't need gratuitous rebases | 16:40 |
dmsimard | fungi: I'm not sure restoring the change triggers the jobs by itself, it might need a recheck after restore but I'm not sure | 16:41 |
efried | fungi That's a nice tip. Though wouldn't that require re+2 as well as re+W? | 16:41 |
efried | fungi In any case, rebase wasn't gonna hurt this particular one. | 16:41 |
*** HenryG has joined #openstack-infra | 16:41 | |
fungi | efried: as opposed to rebasing? abandon doesn't remove any votes | 16:41 |
dmsimard | efried: a rebase is likely to require re-voting as well FWIW (though I'm not entirely sure on the criteria about what votes are removed when) | 16:41 |
efried | dmsimard In my recent experience, code reviews stick around on a (non-manual) rebase, but +Ws are cleared. | 16:42 |
fungi | dmsimard: it's configurable in gerrit, but generally for our projects it's set to keep code review votes (and clear verified and workflow votes) when the diff of the new change basically matches the diff of the old change. used to use git patch-id to achieve that but i haven't looked at the code in gerrit recently to know whether that's still the case | 16:43 |
*** kopecmartin has quit IRC | 16:43 | |
fungi | (`git patch-id` is a sha-1 checksum of the diff's contents, not to be confused with the commit id which covers additional metadata like the commit message and timestamps) | 16:45 |
dmsimard | fungi: yeah I assumed it had something to do along those lines -- I've modified commit messages before without having to re-run jobs for example | 16:47 |
*** dtantsur is now known as dtantsur|afk | 16:47 | |
fungi | hopefully not in our gerrit. a new patch set should clear the verified vote and cause new jobs to run | 16:48 |
dmsimard | maybe it wasn't in the upstream gerrit, I don't remember | 16:49 |
fungi | also clears workflow votes so that we don't automatically send it into the gate without an explicit reapproval | 16:49 |
fungi | but yes, that's configurable per label, so other gerrits may preserve verified votes on rebase, for example | 16:49 |
clarkb | (we've had to do this because tests whose results are affected by the commit message are or were common) | 16:49 |
fungi | for similar reasons, we clear all votes (including code review) when a commit message is changed | 16:50 |
fungi | but that's also configurable | 16:50 |
fungi | i love logging into the rax dashboard. you never know what new tickets will be waiting | 16:52 |
fungi | "This message is to inform you that the host your cloud server 'ze04.openstack.org' resides on became unresponsive. We have rebooted the server and will continue to monitor it for any further alerts." | 16:52 |
pabelanger | okay, think I've cleaned up all permissions on ze04 | 16:53 |
pabelanger | will start it up here in a moment once puppet runs | 16:53 |
clarkb | pabelanger: that is to pick up the init script change/ | 16:53 |
clarkb | (puppet that is) | 16:53 |
pabelanger | clarkb: yah | 16:53 |
pabelanger | but it does look like logging on executor is being written as root, so we also need to setup permissions on others after we stop | 16:54 |
*** armaan has joined #openstack-infra | 16:55 | |
fungi | #status log previously mentioned trove maintenance activities in rackspace have been postponed/cancelled and can be ignored | 16:56 |
openstackstatus | fungi: finished logging | 16:56 |
*** pcaruana|afk| has quit IRC | 16:57 | |
*** felipemonteiro_ has quit IRC | 16:58 | |
*** florianf has quit IRC | 17:00 | |
*** dsariel has quit IRC | 17:00 | |
fungi | ttx: are you still using odsreg.openstack.org or can we delete that instance? | 17:00 |
ttx | fungi: I thought we cleared it some times ago, I remember someone asking me about it | 17:01 |
fungi | that may have been forumtopics | 17:01 |
ttx | those days we only use ptgbot on one side and forumtopics.o.o | 17:01 |
fungi | i thought i remembered asking you about odsreg but scoured my irc logs and came up with nothing so thought it best to ask again | 17:02 |
fungi | i'm happy to delete odsreg.openstack.org now if you don't object | 17:02 |
ttx | fungi: go for it! | 17:02 |
fungi | done. thanks for confirming ttx! | 17:03 |
fungi | #status log deleted old odsreg.openstack.org instance | 17:03 |
openstackstatus | fungi: finished logging | 17:03 |
*** iyamahat has joined #openstack-infra | 17:03 | |
pabelanger | okay, ze04 is coming online | 17:10 |
*** pcichy has quit IRC | 17:10 | |
*** iyamahat has quit IRC | 17:11 | |
pabelanger | and running jobs | 17:14 |
pabelanger | still waiting for firewall to open up I think | 17:14 |
pabelanger | finger dd7cbca51ff543389eeb43dac537557f@zuulv3.openstack.org | 17:14 |
*** electrofelix has left #openstack-infra | 17:16 | |
*** links has joined #openstack-infra | 17:17 | |
*** links has quit IRC | 17:17 | |
*** jpich has quit IRC | 17:17 | |
*** lucasagomes is now known as lucas-afk | 17:19 | |
clarkb | I'm really confused by this node failure stuff | 17:20 |
clarkb | I'm looking at a node request that went in at about 1420UTC got rejected by citycloud and vexxhost then appears to go idle. Then 2 hours later infracloud picks it up and then the status is marked as failed? | 17:21 |
pabelanger | I haven't looked, is there a log? | 17:21 |
clarkb | pabelanger: ya I'm trying to put it together for this example I'll have a paste up soon | 17:21 |
pabelanger | I did see some citycloud failures this morning about SSH hostkey changing, wonder if we had some ghost instances | 17:22 |
pabelanger | but haven't looked more into it | 17:22 |
clarkb | this is way before any ssh can happen | 17:23 |
clarkb | its all in the "please boot me a node" negotiation | 17:23 |
*** amotoki has quit IRC | 17:24 | |
*** dhajare has quit IRC | 17:24 | |
pabelanger | Shrews: do you have a moment to look at fingergw.log? Seeing some execptions around routing, possible we need to improve logging / bug | 17:24 |
*** electrofelix has joined #openstack-infra | 17:26 | |
*** Apoorva has quit IRC | 17:28 | |
*** hjensas has joined #openstack-infra | 17:28 | |
fungi | clarkb: possible when we restarted the launchers we upgraded to a regression of some sort? | 17:28 |
Shrews | pabelanger: i just got back from the appointment i mentioned earlier. let me catch up on backscroll | 17:29 |
pabelanger | okay, I think it might be firewall | 17:29 |
pabelanger | confirming that is open | 17:29 |
pabelanger | ah, yup: https://review.openstack.org/532709/ isn't merged | 17:30 |
pabelanger | Umm | 17:31 |
pabelanger | [Thu Jan 11 17:26:37 2018] zuul-scheduler[19039]: segfault at a9 ip 0000000000513ef4 sp 00007f2411437ea8 error 4 in python3.5[400000+3a9000] | 17:31 |
*** trown is now known as trown|outtypewww | 17:31 | |
pabelanger | that doesn't look good | 17:31 |
pabelanger | http://zuulv3.openstack.org/ is also down | 17:31 |
fungi | does that correspond to another puppet exec upgrading zuul or its deps? | 17:32 |
Shrews | clarkb: the new quota checks from tristanC were failing (b/c the query to the provider wasn't working). i think we should short-circuit that if we've said max-servers is 0. it worked before on a failed provider b/c it didn't query that provider, and just went off max-servers | 17:32 |
pabelanger | fungi: let me check | 17:32 |
clarkb | pabelanger: Shrews corvus http://paste.openstack.org/show/642927/ (not to distract from the segfault) thats what I can find about infracloud vanilla | 17:32 |
*** armax has joined #openstack-infra | 17:32 | |
pabelanger | fungi: yes | 17:33 |
clarkb | Shrews: ya I see that buy why did it take 2 hours to process that request in the first plcae? | 17:33 |
pabelanger | Jan 11 17:16:43 zuulv3 puppet-user[26991]: (/Stage[main]/Zuul/Exec[install_zuul]) Triggered 'refresh' from 1 events | 17:33 |
Shrews | and by "i think we should short-circuit", i mean "should fix it so that it short-circuits" | 17:33 |
*** slaweq has joined #openstack-infra | 17:33 | |
fungi | so far all the segfault events have happened right on the heels (within a minute) of a puppet exec upgrading zuul or a dependency on the server where it happened | 17:33 |
clarkb | Shrews: but also I'm not sure its sufficient to short circuit, nodepool should handle cloud failures gracefully | 17:33 |
clarkb | Shrews: if the cloud is not there then we shouldn't mark the request failed and instead let some other cloud service it | 17:33 |
Shrews | clarkb: i dunno. that is new info to me | 17:34 |
corvus | the segfault killed the gearman server | 17:34 |
Shrews | clarkb: i am not disagreeing with you. | 17:34 |
corvus | this is not recoverable | 17:34 |
pabelanger | I also do not see zuul-web | 17:35 |
pabelanger | corvus: happy to pass to you to drive if you'd like | 17:35 |
clarkb | Shrews: cool, just pointing out a short circuit on max-servers isn't sufficnet to get that behavior | 17:35 |
corvus | pabelanger: i don't think there's anything to do but to restart with empty queues. | 17:35 |
corvus | pabelanger: you continue driving | 17:35 |
pabelanger | okay | 17:36 |
clarkb | Shrews: having a hard time with the logs because we don't seem to log the point where we mark it failed on the nodepool side with the request id (that may be because we've bubbled up super far after the cloud exception) | 17:36 |
Shrews | clarkb: https://review.openstack.org/532857 | 17:36 |
pabelanger | okay, stopping zuul-echeduler per corvus recommendation | 17:36 |
*** coolsvap has quit IRC | 17:36 | |
*** slaweq has quit IRC | 17:37 | |
pabelanger | scheduler starting | 17:37 |
Shrews | clarkb: pabelanger: fwiw, there were 2 separate issues i found this morning in the firefight. the vanilla issue is not related to the stuck delete issue | 17:39 |
pabelanger | okay, I've started, and stopped zuul-web. It seems to raising exceptions in scheduler | 17:39 |
pabelanger | AttributeError: 'NoneType' object has no attribute 'layout' | 17:39 |
pabelanger | cat jobs now running | 17:39 |
Shrews | the vanilla issue not being handled well we can fix. i don't have an answer on the delete thing | 17:40 |
corvus | pabelanger: you can start zuul-web, that's harmless | 17:40 |
pabelanger | corvus: okay | 17:40 |
pabelanger | zuul-web started | 17:40 |
*** vsaienk0 has joined #openstack-infra | 17:41 | |
corvus | is there any place where we have the full output from the pip install of zuul? | 17:41 |
*** cshastri has quit IRC | 17:41 | |
corvus | ie, puppet reports or anything? | 17:41 |
Shrews | pabelanger: did you still need me to look at something? | 17:42 |
pabelanger | okay, scheduler back online | 17:42 |
corvus | pabelanger: i suggest sending a notice | 17:42 |
pabelanger | Shrews: I don't think so, firewall change hasn't landed yet | 17:42 |
pabelanger | corvus: agree | 17:42 |
*** jamesmcarthur has joined #openstack-infra | 17:43 | |
fungi | corvus: i believe pip will log in the homedir of the account running that exec, so presumably ~root/ | 17:43 |
fungi | looking | 17:43 |
clarkb | Shrews: fwiw I think the deleted issue is the smaller concern of the two since it doesn't directly affect job results | 17:43 |
fungi | though i'm not finding it yet | 17:44 |
clarkb | Shrews: comment on https://review.openstack.org/#/c/532857/1 | 17:44 |
pabelanger | how does, #status log Due to an unexpected issue with zuulv3.o.o, we were not able to preserve running jobs for a restart. As a result, you'll need to recheck your previous patchsets | 17:44 |
clarkb | now to catch up on the zuul scheduler thing | 17:45 |
fungi | pabelanger: lgtm | 17:45 |
pabelanger | #status notice Due to an unexpected issue with zuulv3.o.o, we were not able to preserve running jobs for a restart. As a result, you'll need to recheck your previous patchsets | 17:45 |
openstackstatus | pabelanger: sending notice | 17:45 |
clarkb | I'm guessing its too late now but maybe we should turn on core dumps? | 17:46 |
corvus | it's never too late to turn on core dumps | 17:46 |
fungi | Jan 11 17:16:40 zuulv3 kernel: [67122.422307] zuul-web[19121]: segfault at a9 ip 0000000000513ef4 sp 00007fb4116b98a8 error 4 in python3.5[400000+3a9000] | 17:46 |
clarkb | (might be able to modify that ulimit for a running process somehow?) | 17:46 |
fungi | Jan 11 17:16:43 zuulv3 puppet-user[26991]: (/Stage[main]/Zuul/Exec[install_zuul]) Triggered 'refresh' from 1 events | 17:46 |
pabelanger | I am unsure where pip install would be logged now | 17:46 |
fungi | that's pretty tight correlation :/ | 17:46 |
-openstackstatus- NOTICE: Due to an unexpected issue with zuulv3.o.o, we were not able to preserve running jobs for a restart. As a result, you'll need to recheck your previous patchsets | 17:47 | |
corvus | fungi: the scheduler segfault (which was the geard process) was 10m later? | 17:47 |
fungi | 17:26:38 yeah | 17:48 |
corvus | i think it would be really useful to know if a certain dependency was updated, or if it was just zuul | 17:48 |
corvus | thus my query about logs | 17:48 |
openstackstatus | pabelanger: finished sending notice | 17:48 |
fungi | this is at least the third time we've seen this happen (last couple times it took out all the executors and maybe the mergers too) | 17:49 |
fungi | and yeah, i'm looking to see if explicit pip logging needs to be enabled in that exec | 17:49 |
pabelanger | I did run kisk.sh on zuulv3.o.o to attempt to pick up firewall changes, I don't see how but maybe related? | 17:50 |
Shrews | clarkb: i think that log thing is totally separate. there's a bug there, i think, with a handler being present when the provider has been removed | 17:50 |
*** vsaienk0 has quit IRC | 17:51 | |
fungi | pabelanger: at what time? | 17:51 |
Shrews | clarkb: i think we should land the log improvement, then I can look at the other thing | 17:51 |
clarkb | Shrews: run_handler is what creates the self.log object and so if it crashes in run_handler then we hvae no logger | 17:51 |
clarkb | Shrews: I'm fine with that but I also don't know that it will actually give us the logs we want | 17:51 |
*** ricolin has quit IRC | 17:51 | |
pabelanger | fungi: I do not have timestamps on my terminal, but it was before I noticed an issue with http://zuulv3.o.o | 17:51 |
clarkb | (but merging the change isn't a regression in that case, it just also won't fix things?) | 17:51 |
fungi | pabelanger: was the kick happening roughly concurrent with the 17:26:38 segfault for the gearman process or the 17:16:40 segfault for the zuul-web daemon? | 17:52 |
clarkb | Shrews: I'll approve it | 17:52 |
pabelanger | the puppet run, was success with no unknown errors | 17:52 |
pabelanger | fungi: it was after 2018-01-11 17:15:05,152 so it could have been the reason for puppet run | 17:52 |
corvus | clarkb, Shrews: erm, let's fix it right? | 17:52 |
clarkb | corvus: yes we should fix it | 17:53 |
fungi | pabelanger: yeah, i don't see any puppet activity around 17:26:38 just 17:16:40 | 17:53 |
fungi | pabelanger: so sounds likely to be the earlier one | 17:53 |
clarkb | corvus: its unrelated enough to the existing change and if the exception happens after configuring the logger we will be fine | 17:53 |
clarkb | so I think we can approve the existing change then swing around and fix the logger | 17:53 |
*** Apoorva has joined #openstack-infra | 17:53 | |
corvus | clarkb, Shrews: i'd hate to restart nodepool just to end up without the extra info we need because of that bug | 17:53 |
pabelanger | fungi: in fact, I only see puppet-user from my kick.sh attempt | 17:53 |
pabelanger | fungi: I don't see any runs of our wheel for some reason | 17:54 |
*** sambetts is now known as sambetts|afk | 17:54 | |
fungi | pabelanger: yeah, i wonder if something about our meltdown upgrades yesterday has broken puppeting | 17:55 |
*** jistr is now known as jistr|afk | 17:55 | |
clarkb | corvus: I can remove the +W if you wnat to see it fixed all at once | 17:55 |
pabelanger | git.o.o is failing puppet, i think that is block zuulv3.o.o from running | 17:55 |
pabelanger | looking at git.o.o now | 17:55 |
clarkb | went ahead and did that | 17:55 |
pabelanger | oh, yum-crontab fix didn't land | 17:56 |
Shrews | clarkb: corvus: run_handler() is re-entrant, executed multiple times for a request. most of the time, self.log is available. but that race i mentioned is what we need to fix, and is rare | 17:56 |
pabelanger | that is odd | 17:56 |
pabelanger | https://review.openstack.org/532331/ | 17:57 |
pabelanger | I thought that merged, apparently not | 17:57 |
*** ijw has joined #openstack-infra | 17:57 | |
pabelanger | so, it is possible that was the first puppet run on zuulv3.o.o in a few days | 17:57 |
*** yamamoto has joined #openstack-infra | 18:00 | |
*** nicolasbock has joined #openstack-infra | 18:00 | |
pabelanger | Apologies, it does seem my kick.sh was the reason for puppet run | 18:00 |
openstackgerrit | Dirk Mueller proposed openstack/diskimage-builder master: Add SUSE Mapping https://review.openstack.org/532925 | 18:00 |
corvus | pabelanger: well, i mean, puppet runs happen. it's hardly your fault. i don't think there's anything about kick.sh that should cause a segfault. | 18:02 |
corvus | clarkb, pabelanger, fungi: this segfault thing is pretty critical -- we'll sink if zuul just radomly gets killed. what do we need to do to track it down? | 18:03 |
*** ralonsoh has quit IRC | 18:04 | |
*** jbadiapa has joined #openstack-infra | 18:04 | |
*** yamamoto has quit IRC | 18:04 | |
fungi | corvus: looking through pip --help i see options for directing it where to log and how verbosely, but we'll probably want to set that in /etc/pip.conf given the mix of explicit pip calls and pip puppet package resources | 18:04 |
fungi | so i'm looking up the corresponding config options | 18:05 |
corvus | maybe we can figure out what was updated by filesytem timestamps | 18:05 |
corvus | fungi, pabelanger, clarkb: http://paste.openstack.org/show/642940/ | 18:05 |
fungi | maybe, though with the gearman crash coming after a 10-minute delay that may indicate that the correlation isn't tight enough to be able to match them | 18:05 |
corvus | was msgpack the thing from earlier? | 18:05 |
fungi | yes, it's the one which changed its name i think? | 18:05 |
johnsom | FYI, we are still seeing "retry_limit" errors that are bogus. | 18:06 |
johnsom | zuul patch 531514 for example | 18:06 |
fungi | corvus: also, deploying pip.conf updates widely will depend on getting puppet working again, so also need to look into why that broke | 18:06 |
clarkb | corvus: probably a good idea to get core dumps turned on too? | 18:06 |
pabelanger | I agree we should consider coredumps for zuul, is there any impact to that? | 18:07 |
clarkb | pabelanger: could fill up our disk (but we can watch it closely) | 18:07 |
corvus | so does this happen everytime msgpack updates, or just once? | 18:07 |
pabelanger | I believe msgpack is what dmsimard seen for executor issue also | 18:07 |
fungi | corvus: we've had several crashes across all zuul daemons (launchers, mergers) i think, so maybe each time msgpack updates? | 18:08 |
dmsimard | everything crashed *again* ? sorry I've been in meetings all morning | 18:08 |
dmsimard | FWIW we're not the only ones having issues with this... https://github.com/msgpack/msgpack-python/issues/268 && https://github.com/msgpack/msgpack-python/issues/266 | 18:09 |
fungi | dmsimard: so far just the daemons on the zuulv3.o.o server, but puppet isn't working (pabelanger ran it manually against that server) so that may be why it hasn't crashed any others yet | 18:09 |
dmsimard | We might have to manually uninstall both msgpack-python and msgpack, make sure there's no more files in /usr/local/python3.5/dist-packages (stale distinfo, eggs, etc.) and then re-run the (zuul?) reinstallation or just install msgpack by itself, I don't know. | 18:10 |
fungi | though interestingly, pip3 list says zuulv3 and ze01 both have the same versions of msgpack and msgpack-python | 18:10 |
dmsimard | fungi: yeah, that's mentioned in the issues I linked | 18:10 |
fungi | maybe zuulv3 upgrade was delayed somehow? | 18:11 |
corvus | okay, so if puppet wasn't running on zuulv3.o.o, then maybe this is the same issue, just delayed, and it's not a systemic problem? | 18:11 |
dmsimard | well, when it first happened last sunday, only the ZE and ZM nodes were impacted .. and then monday (I think?) zuulv3.o.o crashed because of msgpack as well | 18:11 |
corvus | oh, so it has happened before | 18:11 |
clarkb | pabelanger: yes msgpack is what I think we tracked the executors dying to | 18:12 |
corvus | in that case, since there was just a msgpack release, i wonder if we can expect the z* servers to all crash shortly? | 18:12 |
fungi | corvus: yeah, it's tanked all the zuul servers at least twice now, but there may have been two versions of msgpack triggering this | 18:12 |
pabelanger | dmsimard: when on Monday? We had a swapping issue due to memory, but wasn't aware of a crash | 18:12 |
dmsimard | fungi: the corellation can be found in /usr/local/lib/python3.5/dist-packages (see timestamps where modules were last updated) and the dmesg timestamps of the general protection fault | 18:13 |
fungi | corvus: i think what crashed it most recently on the other servers is what just now crashed zuulv3 but was delayed because puppet hasn't run on that server in a while until pabelanger did so manually | 18:13 |
clarkb | could it be that the msgpack builds are replacing compiled so files under zuul? | 18:13 |
dmsimard | that's how I ended up finding out what was the issue | 18:13 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Delete the pool thread when a provider is removed https://review.openstack.org/532931 | 18:13 |
clarkb | except we should load everything into memory right? so probably not that | 18:13 |
corvus | i'm worried that msgpack_python (the deprecated package) was updated, but not msgpack (the replacement) | 18:14 |
dmsimard | pabelanger: there was two separate incidents for msgpack, one sunday and one monday | 18:14 |
dmsimard | 2018-01-07 20:55:48 UTC (dmsimard) all zuul-mergers and zuul-executors stopped simultaneously after what seems to be a msgpack update which did not get installed correctly: http://paste.openstack.org/raw/640474/ everything is started after reinstalling msgpack properly. | 18:14 |
pabelanger | fungi: corvus: I believe so, I am going to see if I can reproduce it locally here by installing old version of msgpack, then upgrading it when zuul is running | 18:14 |
dmsimard | 2018-01-08 20:33:57 UTC (dmsimard) the msgpack issue experienced yesterday on zm and ze nodes propagated to zuulv3.o.o and crashed zuul-web and zuul-scheduler with the same python general protection fault. They were started after re-installing msgpack but the contents of the queues were lost. | 18:14 |
*** weshay is now known as weshay_interview | 18:14 | |
pabelanger | dmsimard: okay | 18:15 |
dmsimard | pabelanger: it might require something to hit that region of the memory for the GPF to trigger | 18:15 |
fungi | pypi release history shows msgpack and msgpack-python releases on january 6 and january 9 | 18:15 |
fungi | one release each on each of those two dates | 18:15 |
corvus | https://pypi.python.org/pypi/msgpack/0.5.1 is worth a read | 18:15 |
pabelanger | let me see when the last time we ran puppet on zuulv3.o.o was | 18:16 |
dmsimard | why the hell is there two *package names* using the same *module name* ? | 18:16 |
fungi | 0.5.0 for both on the 6th and 0.5.1 on the 9th | 18:16 |
*** jpena is now known as jpena|off | 18:16 | |
pabelanger | Jan 9 20:33:36 zuulv3 puppet-user[17648]: Finished catalog run in 7.59 seconds | 18:16 |
pabelanger | was the last puppet run on zuulv3, before the kick.sh | 18:17 |
dmsimard | that's weird, it's almost an exact match from the 01-08 timestamp | 18:17 |
fungi | anybody happen to know off the top of their head what zuul dep is dragging in msgpack? | 18:17 |
pabelanger | Jan 8 21:49:06 zuulv3 puppet-user[10102]: (/Stage[main]/Zuul/Exec[install_zuul]) Triggered 'refresh' from 1 events | 18:17 |
pabelanger | was our last install_zuul before kisk.sh | 18:18 |
fungi | (or msgpack-python i guess) | 18:18 |
corvus | fungi: no, but i'd like to find out | 18:18 |
fungi | yeah, was going to track that down next | 18:18 |
fungi | just didn't know if anyone already had | 18:18 |
dmsimard | is there something else than puppet that could update python modules ? here's the timestamps from dist-packages: http://paste.openstack.org/raw/642949/ | 18:18 |
*** dhajare has joined #openstack-infra | 18:18 | |
corvus | fungi: cachecontrol | 18:19 |
dmsimard | fungi, corvus: the stack trace from msgpack should give us an idea of where it's imported from http://paste.openstack.org/raw/640474/ | 18:19 |
fungi | dmsimard: potentially unattended-upgrades if installed from distro packages | 18:19 |
*** SumitNaiksatam has joined #openstack-infra | 18:19 | |
fungi | corvus: thanks | 18:19 |
corvus | dmsimard: confirmed :) | 18:19 |
pabelanger | drwxr-sr-x 2 root staff 4096 Jan 11 17:16 msgpack_python-0.5.1.dist-info I think that is what we just installed | 18:20 |
dmsimard | fungi: doesn't look like it'd be from a package http://paste.openstack.org/raw/642953/ | 18:20 |
fungi | latest release of cachecontrol is what started using msgpack, it seems | 18:21 |
dmsimard | fungi: but from that last paste (dist-packages), you can see that it's not just msgpack that was updated.. there's zuul there as well | 18:21 |
fungi | introduced in 0.12.0, while 0.11.2 and earlier don't use it | 18:21 |
*** rmcall has quit IRC | 18:21 | |
fungi | dmsimard: right, it happens when we pip install zuul from source each time a new commit lands | 18:21 |
dmsimard | fungi: so that's not driven by puppet then, it's automatic somehow ? | 18:22 |
fungi | dmsimard: it's driven by puppet. there's a vcsrepo resource for teh zuul git repo which triggers an exec to upgrade zuul from that | 18:22 |
dmsimard | ok, yeah, there's a puppet run that matches the timestamps where the modules have been updated | 18:23 |
dmsimard | BTW, if this happens again (where we lose the full scheduler) -- there is a short window where the status.json appears still populated while the scheduler reloads everything and we might be able to dump everything during that short window | 18:24 |
dmsimard | but I haven't tried | 18:24 |
dmsimard | (only realized when it was too late) | 18:24 |
*** ldnunes has quit IRC | 18:27 | |
corvus | dmsimard: i believe we were alerted to this because zuul-web died | 18:27 |
dmsimard | corvus: yeah but for some reason, (monday) when I started zuul-scheduler and zuul-web again, I recall seeing the whole status page and wondering why it wasn't empty | 18:28 |
corvus | dmsimard: that'll be the apache cache. it was gone. | 18:28 |
dmsimard | corvus: likely | 18:29 |
corvus | dmsimard: i'm certain; i checked. | 18:29 |
dmsimard | corvus: status.json is generated dynamically right ? it's not dumped to disk periodically (even for cache purposes) ? | 18:29 |
corvus | dmsimard: it's in-memory in zuul. apache may put it on disk. | 18:30 |
dmsimard | corvus: should we consider dumping it to disk periodically even if just for backup purposes so we can re-queue if need be ? though now that I think about it, this can be out of band -- like a cron that just gets it every minute or something | 18:31 |
corvus | dmsimard: i don't object to a cron | 18:32 |
corvus | in the mean time, what should we do about this? | 18:32 |
dmsimard | I believe pabelanger mentioned he'd like to try and reproduce it locally | 18:32 |
dmsimard | I'll send a patch for a cron which could at least help prevent us losing the entire queues -- check and gate aren't so bad, but losing post/release/tag kind of sucks | 18:33 |
corvus | dmsimard: post/release/tag aren't safe to automatically re-enqueue | 18:33 |
*** dprince has quit IRC | 18:34 | |
dmsimard | right, but we have no visibility on what jobs might not have run | 18:34 |
corvus | it's fine to save them, just pointing out that we can't restore them without consideration. | 18:34 |
* dmsimard nods | 18:34 | |
*** ldnunes has joined #openstack-infra | 18:41 | |
*** dprince has joined #openstack-infra | 18:42 | |
*** sree has joined #openstack-infra | 18:43 | |
*** iyamahat has joined #openstack-infra | 18:43 | |
*** tesseract has quit IRC | 18:45 | |
*** jamesmcarthur has quit IRC | 18:46 | |
*** sree has quit IRC | 18:47 | |
AJaeger | mordred: your change https://review.openstack.org/#/c/532304/ fail in http://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/roles/configure-unbound/tasks/main.yaml#n44 . Is role_path not set correctly? | 18:48 |
*** fultonj has quit IRC | 18:48 | |
*** beagles has quit IRC | 18:49 | |
*** b3nt_pin has joined #openstack-infra | 18:49 | |
*** b3nt_pin is now known as beagles | 18:49 | |
openstackgerrit | Merged openstack-infra/system-config master: Fix typo with yum-cron package / service https://review.openstack.org/532331 | 18:51 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Capture and report errors in sibling installation https://review.openstack.org/532216 | 18:51 |
AJaeger | mordred: or is that a problem of the base-minimal parent? | 18:51 |
*** yamahata has joined #openstack-infra | 18:53 | |
AJaeger | mordred: left comments on 532304 | 18:54 |
*** felipemonteiro has joined #openstack-infra | 18:55 | |
*** felipemonteiro_ has joined #openstack-infra | 18:56 | |
*** hemna_ has joined #openstack-infra | 18:59 | |
*** sree has joined #openstack-infra | 19:00 | |
*** jkilpatr_ has joined #openstack-infra | 19:00 | |
*** felipemonteiro has quit IRC | 19:00 | |
*** jkilpatr has quit IRC | 19:01 | |
*** shardy has quit IRC | 19:01 | |
*** caphrim007 has joined #openstack-infra | 19:02 | |
*** sree has quit IRC | 19:05 | |
*** fultonj has joined #openstack-infra | 19:06 | |
*** sree has joined #openstack-infra | 19:06 | |
AJaeger | mlavalle, just commented on https://review.openstack.org/#/c/531496 - do you know what to do or do you have further questions? | 19:07 |
*** jkilpatr_ has quit IRC | 19:07 | |
corvus | fungi, pabelanger, clarkb: based on what i reported in #zuul, i *think* we can expect msgpack not to break us again unless they do another rename. further version upgrades shouldn't break us. i'm assuming we have 0.5.1 installed everywhere now. if we want to be extra safe, we might want to uninstall msgpack and msgpack-python everywhere and reinstall msgpack-python 0.5.1 just to clean up, but | 19:09 |
corvus | shouldn't be necessary. | 19:09 |
corvus | dmsimard: ^ | 19:09 |
pabelanger | okay, puppet is running again on zuulv3.o.o now that git servers are running puppet again | 19:09 |
clarkb | corvus: good to know thanks for digging in | 19:09 |
corvus | Shrews: what do we need to do for nodepool? | 19:10 |
*** sree has quit IRC | 19:10 | |
pabelanger | I am also still waiting for 532709 to land and confirm console logging is working on ze04.o.o before continuing with reboots of other executors | 19:12 |
*** sshnaidm is now known as sshnaidm|afk | 19:15 | |
*** weshay_interview is now known as weshay | 19:16 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Fix races around deleting a provider https://review.openstack.org/532931 | 19:16 |
Shrews | corvus: bunches of things? | 19:18 |
Shrews | corvus: sort of sinking in the changes at the moment and can't cover them all | 19:19 |
Shrews | and i've totally missed lunch b/c of it, so gonna grab a quick bite. brb | 19:19 |
*** slaweq has joined #openstack-infra | 19:20 | |
corvus | Shrews: okay. it looks like the segfault this is mostly sorted, so when you get a chance to sort out what needs to happen, let me know. | 19:20 |
*** vsaienk0 has joined #openstack-infra | 19:20 | |
*** panda|ruck is now known as panda|ruck|afk | 19:20 | |
*** jkilpatr_ has joined #openstack-infra | 19:21 | |
smcginnis | Looking into some stable branch failures. Anyone know how this could have happened? http://logs.openstack.org/periodic-stable/git.openstack.org/openstack/kolla/stable/pike/build-openstack-sphinx-docs/60e2946/job-output.txt.gz#_2018-01-09_06_16_06_651553 | 19:23 |
smcginnis | Must be something local: http://git.openstack.org/cgit/openstack/kolla/tree/doc/source?h=stable/pike | 19:24 |
fungi | where did we end up defining zuul nodeset names? | 19:24 |
corvus | fungi: ozj i think | 19:26 |
*** dprince has quit IRC | 19:26 | |
openstackgerrit | Ihar Hrachyshka proposed openstack-infra/openstack-zuul-jobs master: Switched all jobs from q-qos to neutron-qos https://review.openstack.org/532948 | 19:26 |
fungi | corvus: thanks! i thought it was in pc for some reason | 19:26 |
*** smatzek has quit IRC | 19:27 | |
AJaeger | team, is our pypi mirror working? We pushed https://pypi.python.org/pypi/openstackdocstheme 2hours ago and it's not yet used in new jobs | 19:27 |
*** florianf has joined #openstack-infra | 19:27 | |
*** smatzek has joined #openstack-infra | 19:28 | |
openstackgerrit | Merged openstack-infra/system-config master: Zuul executor needs to open port 7900 now. https://review.openstack.org/532709 | 19:28 |
*** alexchadin has joined #openstack-infra | 19:29 | |
pabelanger | AJaeger: let me check | 19:30 |
*** sree has joined #openstack-infra | 19:30 | |
*** vsaienk0 has quit IRC | 19:30 | |
pabelanger | AJaeger: bandersnatch is running | 19:31 |
AJaeger | pabelanger: and it's at http://mirror.ca-ymq-1.vexxhost.openstack.org/pypi/simple/openstackdocstheme/ - let me recheck then. | 19:31 |
AJaeger | thanks | 19:31 |
*** smatzek has quit IRC | 19:33 | |
*** sree has quit IRC | 19:34 | |
Shrews | corvus: so nodepool, we have A) new quota handling does not fail as gracefully since we do more than just check max-servers now B) it has been suggested that instead of failing a request that gets an exception in request handling, that we instead let another provider try (which is a good idea IMO) and C) some zookeeper wonkiness caused us to not be able to delete 'deleting' znodes even though the | 19:39 |
Shrews | actual instance was deleted. I have no idea on this one atm | 19:39 |
Shrews | corvus: and the other thing clarkb was concerned about is hopefully handled in https://review.openstack.org/532931 | 19:40 |
Shrews | but i can't test that one locally b/c linux and the world hates me | 19:40 |
pabelanger | nice | 19:41 |
pabelanger | finger 8ee380a2a3ec4b1698ccd4fe6e6d5ecb@zuulv3.openstack.org | 19:42 |
pabelanger | works | 19:42 |
pabelanger | that should be using tcp/7900 now | 19:42 |
*** eharney has quit IRC | 19:42 | |
pabelanger | I think we can proceed with rolling restarts of zuul-executors to drop root permissions | 19:43 |
*** harlowja has joined #openstack-infra | 19:43 | |
pabelanger | along with /var/log/zuul permission fix | 19:43 |
*** eharney has joined #openstack-infra | 19:44 | |
*** edmondsw_ has joined #openstack-infra | 19:48 | |
*** edmonds__ has joined #openstack-infra | 19:48 | |
openstackgerrit | David Moreau Simard proposed openstack-infra/puppet-zuul master: Periodically retrieve and back up Zuul's status.json https://review.openstack.org/532955 | 19:49 |
dmsimard | ^ as per my suggestion | 19:49 |
dmsimard | corvus: ^ | 19:49 |
openstackgerrit | David Moreau Simard proposed openstack-infra/puppet-zuul master: Periodically retrieve and back up Zuul's status.json https://review.openstack.org/532955 | 19:51 |
*** alexchadin has quit IRC | 19:51 | |
clarkb | Shrews: it would be good to get A) and B) sorted out so that we can safely reboot the chocolate infracloud controller | 19:51 |
clarkb | I'm sort of all over the place today taking care of sick family and getting new glasses and doing travel paperwork | 19:51 |
clarkb | but happy to help where I can (will review that one change shortly) | 19:51 |
pabelanger | clarkb: are we okay to proceed with zuul-executor restarts? This is to drop root permissions and finger port change to tcp/7900. Confirmed to be working on ze04 | 19:52 |
*** edmondsw has quit IRC | 19:52 | |
*** edmondsw_ has quit IRC | 19:52 | |
clarkb | pabelanger: I think as long as it isn't expected to affect job results we should be fine | 19:52 |
clarkb | executor stops result in jobs being rerun right? | 19:52 |
clarkb | (its been a couple rough days so doing everything we can to make it less rough is nice) | 19:53 |
pabelanger | clarkb: yah, jobs will abort and requeue | 19:53 |
pabelanger | just means people wait a little longer for stuff to merge | 19:53 |
clarkb | and we are at capacity ya? | 19:53 |
clarkb | might be best to wait for things to cool off a bit for that? | 19:54 |
pabelanger | yah, we are maxed out right now | 19:54 |
dmsimard | there's a lot of compounding issues right now | 19:54 |
dmsimard | the restarts, the loaded executors, leading people to recheck a significant backlog of things | 19:55 |
dmsimard | need to afk food | 19:55 |
pabelanger | I think what might happen, is if executor is stopped and started again with /var/log/zuul still root, we might not properly start again | 19:56 |
pabelanger | eg: live migration | 19:56 |
clarkb | dmsimard: right, I think it may be best to let things settle in a bit | 19:56 |
clarkb | and see where we are since the executor restarts aren't urgent | 19:56 |
clarkb | pabelanger: will it not fail to start at all or will it start and be broken? | 19:57 |
*** jistr|afk is now known as jistr | 19:57 | |
*** smatzek has joined #openstack-infra | 19:57 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Short-circuit request handling on disable provider https://review.openstack.org/532957 | 19:57 |
pabelanger | let me check something | 19:57 |
pabelanger | okay, we might be good | 19:58 |
pabelanger | -rw-rw-rw- 1 root root 171389851 Jan 11 19:58 /var/log/zuul/executor-debug.log | 19:58 |
pabelanger | zuul user will be able to write | 19:58 |
*** sree has joined #openstack-infra | 19:59 | |
corvus | Shrews, clarkb: i think (B) is compatible with the algorithm and should be okay to implement. i think that would manifest as: when all of the node launch retries have been exhausted, we decline the request. the final provider which handles that request would still cause it to fail, in the normal way that requests that are universally declined are failed. | 19:59 |
corvus | pabelanger: it's world writable? | 20:01 |
Shrews | corvus: it would be more than node launch retries (in the vanilla case, it was throwing an exception when trying to query the provider for quota info), but... yeah | 20:01 |
*** Apoorva has quit IRC | 20:01 | |
corvus | Shrews: sounds good | 20:01 |
corvus | Shrews: is there more detail about (A)? | 20:02 |
openstackgerrit | Sean McGinnis proposed openstack-infra/openstack-zuul-jobs master: Add commit irrelevant files to tempest-full https://review.openstack.org/532959 | 20:02 |
pabelanger | corvus: yes | 20:02 |
*** armaan has quit IRC | 20:02 | |
corvus | pabelanger: let's fix that after the restarts :) | 20:03 |
pabelanger | agree | 20:03 |
*** armaan has joined #openstack-infra | 20:03 | |
Shrews | corvus: such as? https://review.openstack.org/532957 re-enables that short-circuit | 20:03 |
openstackgerrit | Sean McGinnis proposed openstack-infra/project-config master: Remove irrelevant-files for tempest-full https://review.openstack.org/532960 | 20:03 |
Shrews | corvus: i think you know as much as i do at this point.... happy to clarify anything though | 20:04 |
corvus | Shrews: that sounds good, but what do you mean by "new quota handling does not fail as gracefully since we do more than just check max-servers now" ? in what cases does it fail not gracefully? | 20:04 |
*** sree has quit IRC | 20:04 | |
Shrews | corvus: http://paste.openstack.org/show/643045/ | 20:05 |
ianw | there's a lot of scroll-back for us antipodeans ... are we at a point we want to merge https://review.openstack.org/#/c/532701/ to start new image builds, or is there too much else going on? | 20:06 |
corvus | Shrews: or are A and B really related -- if we fix B, we'll handle the case where we can't calculate quota because the provider is broken by declining the request. and then when we come along and set max-servers to 0 because we know the cloud is broken, change 532957 will short circuit that and make things cleaner? | 20:06 |
*** gema has quit IRC | 20:06 | |
Shrews | corvus: oh, not failing gracefully... the new quota stuff adds the dimension of exceptions from a wonky provider during the "should I decline this?" checks | 20:07 |
Shrews | A and B are sort of related, yes | 20:07 |
*** hasharAway is now known as hashar | 20:08 | |
corvus | Shrews: okay, i think i grok. my understanding is: 532957 should fix the most immediate thing and let us restart infracloud without node failures, then solving (B) will let infracloud break in the future without spewing lots of node_failure messages.. correct? | 20:09 |
Shrews | corvus: 532957 would have hidden the provider failure, but doesn't fix the exception stuff that can still occur when max-servers >0 | 20:09 |
corvus | Shrews: yep, that jives with my understanding | 20:09 |
Shrews | because vanilla was disabled, but we were still getting the exceptions | 20:09 |
pabelanger | ianw: +3 | 20:10 |
Shrews | corvus: yes, with all my current changes that I have up, we can restart and run for a while while I sort B out | 20:11 |
pabelanger | ianw: also, ze04.o.o is back online and running as zuul user again | 20:11 |
pabelanger | ianw: we have not done any other executors yet, likey do so in another day | 20:11 |
fungi | ianw: probably safe as long as we don't expect that updates to the images will bring yet new regressions... we're somewhat backlogged and maxxed out on capacity at the moment | 20:11 |
corvus | Shrews: i've +2 532931 and 532957 | 20:11 |
corvus | Shrews: any others? | 20:11 |
pabelanger | ianw: but all patches for puppet-zuul and fireweall landed | 20:11 |
Shrews | corvus: clarkb: what were the concerns about chocolate? i'm afraid i've been too heads down in firecoding to have noticed | 20:12 |
pabelanger | fungi: ianw: actually, lets hold off until tomorrow then, just to be safe | 20:12 |
pabelanger | and time for some patches to merge | 20:12 |
*** sree has joined #openstack-infra | 20:12 | |
ianw | ok, that puts it into my weekend, so maybe leave it with +2's until my monday | 20:13 |
fungi | Shrews: we need to reboot some of the chocolate control plane for meltdown patching still | 20:13 |
Shrews | corvus: no, i think you got them all | 20:13 |
ianw | that way a) it's usually quiet(er) and b) i can monitor | 20:13 |
pabelanger | ianw: good idea, might want to -2 then | 20:13 |
fungi | Shrews: and are concerned that any prolonged outage of the api will cause heartbreak for nodepool | 20:13 |
Shrews | fungi: is chocolate disabled in nodepool? | 20:13 |
corvus | clarkb: +3 532931 and 532957 please | 20:13 |
fungi | Shrews: no, only vanilla is disabled because we were unable to get it back into operation reliably | 20:13 |
Shrews | fungi: has chocolate been otherwise functional? | 20:14 |
fungi | i _think_ so (inasmuch as it ever is anyway) | 20:15 |
corvus | Shrews: i think the extent of what i was saying is that being able to reliably zero max-servers for a cloud gives us the room to enable/disable as needed without worrying about errors related to (B) | 20:15 |
Shrews | corvus: yeah. i think we should disable chocolate with max-servers=0 if we're concerned about it working | 20:15 |
Shrews | 957 gives us that leeway | 20:16 |
fungi | but if we set it to max-servers=0 are we still going to cause the same problems that vanilla was/is causing with max-servers=0 already? | 20:16 |
Shrews | fungi: not with 957 | 20:16 |
Shrews | (in theory) | 20:16 |
fungi | aha, right, that's the missing piece thanks | 20:16 |
*** sree has quit IRC | 20:16 | |
* Shrews has missing pieces scattered everywhere today | 20:17 | |
*** cody-somerville has joined #openstack-infra | 20:23 | |
*** dprince has joined #openstack-infra | 20:26 | |
clarkb | corvus: looking now | 20:27 |
clarkb | fwiw 957 isn't really the concern with restarting the controller | 20:28 |
clarkb | we shouldn't have node failures if a cloud goes away even if max servers is > 0 | 20:29 |
*** SumitNaiksatam has quit IRC | 20:29 | |
clarkb | this requires us to know in advance when clouds will have outages which isn't always the case | 20:29 |
clarkb | I've approved 957 because its a good improvement either way. Also left a comment on it for a followup | 20:30 |
clarkb | Shrews: ^ | 20:30 |
*** fultonj has quit IRC | 20:31 | |
*** cody-somerville has quit IRC | 20:31 | |
Shrews | clarkb: yeah, not saying that fixes the B part. it's a stop-gap until i can fix the other thing | 20:32 |
Shrews | but it's a good stop-gap that should stay since it prevents unnecessary provider calls | 20:33 |
*** gema has joined #openstack-infra | 20:33 | |
clarkb | Shrews: yup | 20:33 |
clarkb | Shrews: in https://review.openstack.org/#/c/532931/2 why are we removing the extra logging that you just added? | 20:33 |
clarkb | oh wait I misread that nevermind | 20:33 |
clarkb | corvus: Shrews both changes have been approved, just the one minor improvement idea on 957 | 20:36 |
clarkb | I need to grab lunch now | 20:36 |
Shrews | clarkb: can I +A the logging change? | 20:37 |
*** hrubi has quit IRC | 20:37 | |
*** hrubi has joined #openstack-infra | 20:39 | |
*** eharney has quit IRC | 20:41 | |
clarkb | Shrews: ya I think so | 20:42 |
efried | Don't want to recheck unnecessarily and worsen the problem, so: if my change isn't showing up on the zuulv3.o.o dashboard, do I need to recheck it? | 20:49 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Fix races around deleting a provider https://review.openstack.org/532931 | 20:49 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Short-circuit request handling on disable provider https://review.openstack.org/532957 | 20:49 |
efried | Specifically https://review.openstack.org/#/c/518633/ | 20:50 |
*** eharney has joined #openstack-infra | 20:51 | |
AJaeger | efried: yes. But in your case: Ask mriedem to toggle the +W, so it goes directly into gate. A recheck would run it first through check. | 20:52 |
*** eharney has quit IRC | 20:52 | |
efried | AJaeger Oh, that's useful to know. Thank you. | 20:52 |
AJaeger | efried: or have another nova core just add an additional +W to trigger that | 20:52 |
AJaeger | efried: see also top entry at https://wiki.openstack.org/wiki/Infrastructure_Status | 20:52 |
efried | AJaeger What about the other already+W'd patches behind that guy? Will they go to the gate automatically once the first one clears? | 20:53 |
efried | or do they need to be +W-twiddled too? | 20:53 |
AJaeger | efried: hope so ;) Give it a try | 20:53 |
efried | AJaeger Thanks. | 20:53 |
mriedem | what do i need to do? | 20:55 |
AJaeger | mriedem: toggle +W on https://review.openstack.org/#/c/518633 | 20:55 |
*** e0ne has joined #openstack-infra | 20:55 | |
mriedem | toggle == remove +W and add it back? | 20:55 |
AJaeger | mriedem: yes | 20:56 |
mriedem | consider it toggled | 20:56 |
* mriedem blushes | 20:56 | |
AJaeger | thanks, mriedem | 20:56 |
efried | Thanks mriedem | 20:56 |
mriedem | my pleasure | 20:56 |
efried | It appeared in the gate right away, wohoo! | 20:56 |
efried | Now we'll see if it gets through. Last time it sat there for about 12 before disappearing mysteriously without a trace... | 20:57 |
AJaeger | efried: https://review.openstack.org/#/c/518982/19 will not go into gate - there's no Zuul +1 vote. So, that one needs a recheck. | 20:57 |
mriedem | it hit the bermuda triangle | 20:57 |
AJaeger | efried: I suggest you recheck those in the stack that don't have Zuul +1. | 20:58 |
efried | AJaeger Thanks, will do. | 20:58 |
AJaeger | mriedem: it tried to hide but you found it ;) | 20:59 |
fungi | one of zuul's dependencies made a botched attempt at transitioning between package names, and that has resulted in a couple of rather sudden outages where we ended up unable to reenqueue previously running jobs once we got it back online | 20:59 |
fungi | thankfully, we think this shouldn't be a recurrent problem now that they've fixed their transitional package on pypi | 20:59 |
fungi | the joys of continuous delivery ;) | 21:00 |
mtreinish | fungi: which package? | 21:00 |
fungi | msgpack-python -> msgpack | 21:00 |
mtreinish | ah, ok | 21:00 |
fungi | a transitive dep via cachecontrol | 21:00 |
fungi | msgpack-python 0.5.0 was basically empty, and so in-place upgrading it caused segfaults for running zuul processes | 21:01 |
*** krtaylor has joined #openstack-infra | 21:01 | |
*** Apoorva has joined #openstack-infra | 21:02 | |
fungi | and upgrading from it had similarly disastrous impact once they released a fixed version | 21:02 |
*** david-lyle has quit IRC | 21:03 | |
*** krtaylor has quit IRC | 21:03 | |
*** david-lyle has joined #openstack-infra | 21:04 | |
*** krtaylor has joined #openstack-infra | 21:04 | |
ianw | fungi: ahh, so that's the suspect in our "all the executors went bye-bye" scenario the other day? | 21:06 |
*** Apoorva has quit IRC | 21:06 | |
corvus | ianw: yep; there's a bit more detail about versions, etc, in #zuul | 21:06 |
*** gouthamr has quit IRC | 21:07 | |
*** edmonds__ is now known as edmondsw | 21:07 | |
*** e0ne has quit IRC | 21:09 | |
*** masber has joined #openstack-infra | 21:10 | |
*** sree has joined #openstack-infra | 21:12 | |
efried | mriedem Well, that worked so well for that patch, would you mind doing the same for https://review.openstack.org/#/c/521686/ ? | 21:14 |
*** gouthamr has joined #openstack-infra | 21:15 | |
mriedem | efried: bauzas is still awake, i'm sure he can do it | 21:15 |
efried | Hum, that one might be different, since it's still in the check queue (six times ??) | 21:15 |
*** olaph1 has joined #openstack-infra | 21:15 | |
ianw | eumel8 / ianychoi : i dropped the db (made a backup) and ran puppet on translate-dev ... it doesn't seem to have redeployed zanata. looking into it | 21:16 |
*** olaph has quit IRC | 21:16 | |
*** sree has quit IRC | 21:17 | |
AJaeger | config-core, I updated the list of reviews to review at https://etherpad.openstack.org/p/Nvt3ovbn5x - the backlog is growing; just in case somebody has time left after all the updating and fixing... | 21:21 |
AJaeger | efried: 521686 has no Zuul +1 | 21:21 |
fungi | smcginnis: pasted a job log in #-release where a job failed to push content via rsync to static.o.o, connection unexpectedly closed | 21:23 |
fungi | er, smcginnis pasted | 21:23 |
fungi | http://logs.openstack.org/a5/a52aa0b2ad06a52e50be8879f9256576ceceb91c/release-post/publish-static/cee3a5f/job-output.txt.gz#_2018-01-11_21_08_06_844641 | 21:23 |
efried | AJaeger Perhaps you can help me understand what I'm seeing on the dashboard. 518633,23 showed up in the gate and was trucking along nicely. Then while it was going, three copies of it showed up in the check queue. They're still there. And now the one in the gate doesn't have the sub-job thingies anymore - just the one line labeled 518633,23 | 21:23 |
fungi | i can connect to the server, at least | 21:23 |
efried | AJaeger What does it mean when a change shows up that way (with just the one status line labeled with the change number, that never seems to move)? | 21:24 |
fungi | efried: does hovering over the dot next to it tell you anything in a pop-up tooltip? | 21:24 |
efried | fungi Oo, "dependent change required for testing" -- howzat? | 21:24 |
AJaeger | efried: I see 518633 as bottom of stack, so that is fine. You rechecked some changes that are stacked on top of it. so, this is fine | 21:25 |
fungi | efried: looks like the cinder change ahead of it failed a voting job, so your change has been restarted to no longer include testing with that failing cinder change | 21:25 |
efried | fungi When you say "ahead of it"... | 21:26 |
fungi | it took a moment to get new node assignments for the jobs | 21:26 |
fungi | "above" it in the gate pipeline | 21:26 |
efried | There's no dependency between them, is there? | 21:26 |
*** ldnunes has quit IRC | 21:26 | |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Improve logging around ZooKeeper suspension https://review.openstack.org/532823 | 21:26 |
fungi | efried: only insofar as they share some jobs and there's a chance that the cinder change could break the ability of your change to pass jobs (or vice versa) so we test them together to make sure we don't race in merging an interdependency bug | 21:27 |
*** jkilpatr_ has quit IRC | 21:27 | |
fungi | but if we realize the cinder change isn't going to merge, we have to retest your nova change without it applied | 21:27 |
efried | fungi I see. And that helps me understand how the whole queue thrashing problem can compound as it gets more full. | 21:28 |
fungi | yep. we have a sliding stability window where we'll try to test as many changes at a time as possible, but if we fail too often we scale the window down to a minimum of 20 changes at a time in a shared queue | 21:28 |
fungi | just to keep the thrash manageable | 21:29 |
lbragstad | i have a quick question, we started noticing this on stable/pike http://logs.openstack.org/periodic-stable/git.openstack.org/openstack/keystone/stable/pike/build-openstack-sphinx-docs/a5a6cb8/job-output.txt.gz#_2018-01-09_06_13_52_297432 | 21:29 |
lbragstad | we did see that on master a while ago but I don't think we fixed it with a requirements bump | 21:29 |
lbragstad | i'm wondering if that jogs anyone's memory here | 21:30 |
fungi | lbragstad: smells like a dependency updating in a backward-incompatible fashion | 21:30 |
smcginnis | lbragstad: Is that package python-ldap? | 21:30 |
cmurphy | lbragstad: fungi let me dig up AJaeger's patch that fixed it - the problem was autodoc wasn't loading all the libs that were declared in setup.cfg and not in requirements.txt | 21:31 |
lbragstad | https://github.com/openstack/keystone/blob/master/setup.cfg#L31 | 21:31 |
smcginnis | lbragstad: Ah, it can't be in setup.cfg anymore since the source isn't installed to run the docs job. | 21:32 |
lbragstad | aha | 21:32 |
smcginnis | lbragstad: Probably need to backport the change cmurphy is thinking of. | 21:32 |
*** e0ne has joined #openstack-infra | 21:32 | |
ianw | fungi: any thoughts on what we can do about translate-dev getting "451 4.7.1 Greylisting in action, please come back in 00:04:59" when sending confirm emails? how do we make it look more legit? | 21:32 |
smcginnis | I've been doing that in a few projects to get stable jobs working. Let me know if there's any questions about it. | 21:32 |
fungi | ianw: make sure it's not covered by a listing in the spamhaus pbl, and apply for an exception for that ip address if it is | 21:33 |
fungi | ianw: rackspace has added basically all of their ip addignments to the pbl, i guess as a way to cut down on abuse reports, so you have to explicitly poke holes in those blanket listings for systems you want to be able to send e-mail to popular domains which may make filtering decisions based on pbl lookups | 21:34 |
fungi | cmurphy: great memory, i didn't even consider this might be fallout from the docs pti compliance work | 21:35 |
lbragstad | cmurphy: wasn't it this one? https://review.openstack.org/#/c/530087/ | 21:35 |
*** numans has quit IRC | 21:36 | |
*** numans has joined #openstack-infra | 21:36 | |
fungi | ianw: i usually start by looking at https://talosintelligence.com/reputation_center/lookup?search=translate-dev.openstack.org and that's suggesting there aren't any matching blacklist entries so it's probably not pbl impact at least | 21:37 |
cmurphy | lbragstad: it was either that one or this one https://review.openstack.org/#/c/528960 | 21:37 |
smcginnis | lbragstad: It was probably a change in openstack/keystone. | 21:37 |
cmurphy | if it's that one ^ then we just need to backport | 21:37 |
smcginnis | cmurphy: Yep, that looks like what you'd need. | 21:37 |
ianw | fungi: yeah ... also digging into the logs, there's not much email traffic out, but it might just be @redhat.com that's being too picky | 21:37 |
fungi | ianw: up side, if it's just greylisting, is that usually subsides as popular domains get used to receiving e-mail from it | 21:38 |
smcginnis | lbragstad, cmurphy: Only tricky thing I've run in to backporting these is the differences in global-requirements with the stable branch. | 21:38 |
cmurphy | smcginnis: oh hrm | 21:38 |
corvus | ianw: is it putting translate-dev01.openstack.org in the envelope sender address, or translate-dev? | 21:38 |
fungi | ooh, excellent point | 21:38 |
*** sbra has quit IRC | 21:38 | |
smcginnis | cmurphy, lbragstad: But actually, doesn't look like that patch was complete. Should switch s/python setup.py build_sphinx/sphinx-build .../. But I don't think that's necessary as far as fixing pike. | 21:39 |
ianw | corvus: hmm, possibly "noreply@openstack.org" http://paste.openstack.org/show/643127/ | 21:40 |
lbragstad | smcginnis: ack - let me get a backport proposed quick | 21:40 |
fungi | ianw: if the message is still in exim's queue, you should be able to find the message (body and headers) in /var/spool/exim4/input/ | 21:40 |
ianw | fungi: yep, that's what i just did :) anyway, let's see if it's an issue with the real server & people not getting their auth emails and then dig into it | 21:41 |
*** aeng has joined #openstack-infra | 21:41 | |
fungi | and yeah, looking at it i agree it seems to be using noreply@openstack.org for sender | 21:42 |
fungi | which at least matches the from header | 21:42 |
*** smatzek has quit IRC | 21:42 | |
ianw | there are some interesting options. i wonder if i want to turn my "maximum gravatar rating shown" up to "X" | 21:42 |
fungi | it doesn't go to 11? | 21:43 |
fungi | er, i mean, xi? | 21:43 |
corvus | noreply@openstack.org does not accept bounce messages, so it's a bad choice for an envelope sender. that's a legit greylist trigger. | 21:43 |
*** jtomasek has quit IRC | 21:43 | |
*** sree has joined #openstack-infra | 21:43 | |
fungi | yep, we usually configure an alias of the infra-root address as sender for other systems | 21:44 |
*** olaph1 has quit IRC | 21:44 | |
*** olaph has joined #openstack-infra | 21:45 | |
corvus | http://paste.openstack.org/show/643135/ | 21:45 |
*** eharney has joined #openstack-infra | 21:46 | |
*** rcernin has joined #openstack-infra | 21:46 | |
fungi | right, not gonna get accepted by any mta which enforces sender verification callbacks | 21:46 |
ianw | i wonder where it comes from. not immediately obvious from http://codesearch.openstack.org/?q=noreply%40openstack.org&i=nope&files=&repos= | 21:47 |
*** sree has quit IRC | 21:48 | |
ianw | well the server gui configuration has a "from email address" durr | 21:48 |
fungi | i was gonna say, may be manually configured | 21:48 |
*** eharney has quit IRC | 21:48 | |
ianw | i just dropped the db for it to repopulate from scratch though | 21:49 |
corvus | it's fine as a from header, just not great as an envelope sender | 21:49 |
lbragstad | smcginnis: not sure if i'm on the right track here https://review.openstack.org/#/c/532984/ | 21:49 |
lbragstad | but there was a merge conflict with the test-requirements.txt file | 21:49 |
fungi | ianw: maybe it doesn't store that in the remote db? | 21:50 |
lbragstad | and i'm not quite sure if what i proposed breaks process or not (if i had to change requirement versions because of the conflict) | 21:50 |
ianw | fungi: yeah, i guess. thanks, i think it's some good info if it's a problem with other people not getting emails, especially on the live server | 21:52 |
smcginnis | lbragstad: No, that looks right to me. Let's see how it goes in the gate. | 21:52 |
*** jkilpatr has joined #openstack-infra | 21:52 | |
ianw | clarkb / ianychoi / eumel8 : i'll drop some notes in the change, but i think translate-dev should be back with the fresh db, i can log in and poke around at least. let me know if issues | 21:53 |
lbragstad | smcginnis: cool - thanks for the sanity check | 21:53 |
lbragstad | cmurphy: fyi ^ | 21:53 |
lbragstad | smcginnis: fwiw - if that does fix things, you might have to kick that one through (keystone doesn't have two active stable cores) | 21:54 |
*** slaweq has quit IRC | 21:55 | |
*** slaweq has joined #openstack-infra | 21:56 | |
*** e0ne has quit IRC | 21:57 | |
openstackgerrit | David Moreau Simard proposed openstack-infra/puppet-zuul master: Periodically retrieve and back up Zuul's status.json https://review.openstack.org/532955 | 21:57 |
*** threestrands has joined #openstack-infra | 21:58 | |
*** threestrands has quit IRC | 21:58 | |
*** threestrands has joined #openstack-infra | 21:58 | |
*** hamzy has quit IRC | 21:58 | |
*** threestrands has quit IRC | 21:59 | |
*** threestrands has joined #openstack-infra | 21:59 | |
*** threestrands has quit IRC | 22:00 | |
smcginnis | lbragstad: Really? | 22:00 |
smcginnis | Oh, stable. Sure, I will +2 once things pass. | 22:00 |
*** threestrands has joined #openstack-infra | 22:01 | |
*** jascott1 has joined #openstack-infra | 22:02 | |
*** threestrands has quit IRC | 22:02 | |
*** gyee has joined #openstack-infra | 22:02 | |
*** threestrands has joined #openstack-infra | 22:02 | |
*** Apoorva has joined #openstack-infra | 22:03 | |
*** sree has joined #openstack-infra | 22:06 | |
*** tpsilva has quit IRC | 22:09 | |
*** nicolasbock has quit IRC | 22:10 | |
*** sree has quit IRC | 22:10 | |
smcginnis | lbragstad: Looked up the stable/pike g-r values. | 22:11 |
clarkb | not really back yet but looks like nodepool changes did merge, next step is restarting the launchers or is that done? | 22:16 |
*** dave-mccowan has quit IRC | 22:16 | |
*** dave-mccowan has joined #openstack-infra | 22:17 | |
*** olaph has quit IRC | 22:17 | |
*** Goneri has quit IRC | 22:20 | |
*** Apoorva_ has joined #openstack-infra | 22:23 | |
*** hashar has quit IRC | 22:24 | |
*** dave-mccowan has quit IRC | 22:25 | |
*** Apoorva has quit IRC | 22:26 | |
corvus | clarkb: i don't think it's been done | 22:27 |
*** Apoorva_ has quit IRC | 22:30 | |
*** Apoorva has joined #openstack-infra | 22:31 | |
*** dprince has quit IRC | 22:31 | |
*** sree has joined #openstack-infra | 22:32 | |
*** bobh has quit IRC | 22:32 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Support cross-source dependencies https://review.openstack.org/530806 | 22:35 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Add cross-source tests https://review.openstack.org/532699 | 22:35 |
*** sree has quit IRC | 22:36 | |
*** esberglu has quit IRC | 22:36 | |
*** olaph has joined #openstack-infra | 22:38 | |
*** sree has joined #openstack-infra | 22:38 | |
*** kjackal has quit IRC | 22:39 | |
*** kjackal has joined #openstack-infra | 22:40 | |
*** sree has quit IRC | 22:43 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Delete stale jobdirs at startup https://review.openstack.org/531510 | 22:44 |
*** slaweq has quit IRC | 22:46 | |
*** jbadiapa has quit IRC | 22:47 | |
*** markvoelker has quit IRC | 22:49 | |
ianw | corvus: one thing i'd note about that deleting job dirs, is that on the cinder mounted volumes, it actually takes quite a while | 22:49 |
*** markvoelker has joined #openstack-infra | 22:49 | |
ianw | when i restarted the executors the other day i cleared out the old stuff, and it was like upwards of 10 minutes | 22:50 |
ianw | it made me think maybe a cron job that removes X day old dirs might work | 22:50 |
corvus | ianw: well, if we need that, we should build it into zuul. | 22:50 |
corvus | trouble is, it's hard to tell if the admin intended to keep the dir | 22:51 |
corvus | so if we do that, we'd need to drop a flag in the dirs indicating they were 'kept' and not have the 'cron' delete them | 22:51 |
*** niska has quit IRC | 22:51 | |
*** ruhe has quit IRC | 22:51 | |
ianw | oh, from the keep var in config you mean? | 22:52 |
*** rlandy is now known as rlandy|bbl | 22:52 | |
*** niska has joined #openstack-infra | 22:52 | |
corvus | ianw: it's a run-time toggle, so can change at any time | 22:52 |
*** ruhe has joined #openstack-infra | 22:52 | |
ianw | yeah, a stamp-file in the directory might be a good interface | 22:53 |
*** Jeffrey4l has quit IRC | 22:53 | |
*** markvoelker has quit IRC | 22:54 | |
*** andreas_s has joined #openstack-infra | 22:54 | |
*** sree has joined #openstack-infra | 22:54 | |
*** bandini has quit IRC | 22:54 | |
*** edmondsw has quit IRC | 22:54 | |
openstackgerrit | James E. Blair proposed openstack-infra/system-config master: Add zuul mailing lists https://review.openstack.org/533006 | 22:54 |
ianw | it's probably fine to clear out on start, but if at least in our case, with the slower remote storage for dirs, we have the slate almost clean when it starts that could be helpful | 22:54 |
*** wolverineav has quit IRC | 22:55 | |
ianw | especially because you're usually restarting them in a pressure situation :) | 22:55 |
*** edmondsw has joined #openstack-infra | 22:55 | |
*** abelur_ has joined #openstack-infra | 22:55 | |
*** vsaienk0 has joined #openstack-infra | 22:55 | |
*** felipemonteiro_ has quit IRC | 22:55 | |
*** wolverineav has joined #openstack-infra | 22:55 | |
corvus | yep. i think both things are incremental improvements. | 22:55 |
ianw | i'll add it to my todo list :) | 22:55 |
*** Jeffrey4l has joined #openstack-infra | 22:56 | |
*** bandini has joined #openstack-infra | 22:56 | |
*** markvoelker has joined #openstack-infra | 22:56 | |
clarkb | ok back with new glasses. I can see again. Except now everything looks funny | 22:58 |
*** sree has quit IRC | 22:58 | |
clarkb | corvus: nl02 is the launcher for chocolate and has the new nodepool code installed. thoughts on restarting that one now? | 22:58 |
*** edmondsw has quit IRC | 22:59 | |
*** markvoelker has quit IRC | 22:59 | |
*** wolverineav has quit IRC | 22:59 | |
corvus | clarkb: i'm around for a bit longer and can help with probs, i say go for it | 23:00 |
*** andreas_s has quit IRC | 23:00 | |
clarkb | corvus: ok restart it now then | 23:00 |
*** markvoelker has joined #openstack-infra | 23:00 | |
clarkb | it is running again and appears to be acting normally (decline requests that are at quota and just saw a node go ready) | 23:03 |
clarkb | corvus: Shrews what is the request handler behavior when all clouds are at quota? I seem to recall reading that code at some point but forget the behavior | 23:03 |
*** vsaienk0 has quit IRC | 23:04 | |
*** markvoelker has quit IRC | 23:05 | |
corvus | clarkb: every provider will grab one more request than it can handle and then block on that request, not accepting any more until it completes. | 23:05 |
*** sree has joined #openstack-infra | 23:05 | |
corvus | (strictly speaking, when a single provider is at quota, it grabs one more request and blocks. providers act independently, so if they are all at quota, they simply all do that) | 23:06 |
*** jascott1 has quit IRC | 23:06 | |
clarkb | corvus: Pausing request handling to satisfy request appears to be the logged message for that? | 23:07 |
*** jascott1 has joined #openstack-infra | 23:07 | |
corvus | clarkb: i believe so | 23:07 |
corvus | clarkb: once it gets enough available quota, it should 'unpause', finish that request, and as long as we are backlogged, grab another one and go right back to being paused. | 23:08 |
*** jascott1 has quit IRC | 23:09 | |
*** jascott1 has joined #openstack-infra | 23:09 | |
*** sree has quit IRC | 23:10 | |
clarkb | corvus: 2018-01-11 23:01:28,873 DEBUG nodepool.driver.openstack.OpenStackNodeRequestHandler[nl02-7970-PoolWorker.citycloud-la1-main]: Declining node request 100-0002001054 because it would exceed quota lots of messages like that which I would expect wouldn't happen if the handler was paused? | 23:11 |
*** jascott1 has quit IRC | 23:11 | |
openstackgerrit | Merged openstack-infra/system-config master: Add note on how to talk to zuul's gearman https://review.openstack.org/531522 | 23:11 |
*** jascott1 has joined #openstack-infra | 23:11 | |
*** jascott1 has quit IRC | 23:12 | |
*** jascott1 has joined #openstack-infra | 23:13 | |
*** jbadiapa has joined #openstack-infra | 23:14 | |
*** jtomasek has joined #openstack-infra | 23:15 | |
openstackgerrit | Merged openstack-infra/puppet-lodgeit master: Systemd: start lodgeit after network https://review.openstack.org/527729 | 23:16 |
corvus | clarkb: that should only happen if it's impossible for the provider to ever satisfy that (ie, a request for 10 nodes in a provider where our absolute limit is 5) | 23:16 |
clarkb | huh I wonder if someone is requesting 50 nodes for a job | 23:17 |
*** jascott1 has quit IRC | 23:17 | |
clarkb | (that could also explain why we are at quota so much) | 23:17 |
corvus | clarkb: the example you cited was 1 node | 23:18 |
clarkb | corvus: oh wait citycloud has a couple regions that are turned off | 23:19 |
clarkb | and la1 is one of them | 23:19 |
clarkb | ok that explains it | 23:19 |
corvus | looks like it was also declined by rh1-main | 23:20 |
corvus | er rh1 | 23:20 |
clarkb | corvus: have you seen anything re nl02 that would indicate we shouldn't restart nl01 as well? I haven't yet | 23:20 |
*** jtomasek has quit IRC | 23:21 | |
corvus | clarkb: nope, i say go. | 23:22 |
clarkb | done | 23:24 |
*** jascott1 has joined #openstack-infra | 23:29 | |
clarkb | Shrews: corvus related to the max-servers: 0 short circuit from earlier today we may want to avoid doing all the extra logging and pause the handler entirely (eg stop polling requests) | 23:30 |
*** jbadiapa has quit IRC | 23:32 | |
clarkb | corvus: do you want to review https://review.openstack.org/#/c/523951/ I think it is ready and will allow us to merge zuulv3 branches into master | 23:33 |
corvus | clarkb: when a provider is paused, it should stop polling for new requests | 23:36 |
corvus | clarkb: i'll take a look | 23:36 |
corvus | +3 and reviewed children as well | 23:40 |
clarkb | cool | 23:40 |
*** kgiusti has left #openstack-infra | 23:41 | |
*** smatzek has joined #openstack-infra | 23:43 | |
*** sree has joined #openstack-infra | 23:43 | |
openstackgerrit | Clark Boylan proposed openstack-infra/project-config master: Set infracloud chocolate to max-servers: 0 https://review.openstack.org/533012 | 23:45 |
clarkb | ^^ is not a rush, don't think I will have time to do controller00 patching reboots today but will try for tomorrow | 23:45 |
*** smatzek has quit IRC | 23:47 | |
*** sree has quit IRC | 23:47 | |
*** hongbin has quit IRC | 23:53 | |
*** armaan has quit IRC | 23:53 | |
clarkb | zxiiro: is https://review.openstack.org/#/c/194497/ still valid? | 23:54 |
clarkb | corvus: I think https://review.openstack.org/#/c/163637/ ended up being rplaced by the zuulv3 spec? | 23:55 |
corvus | clarkb: yeah, i think it covered everything there except the ip whitelist. | 23:57 |
zxiiro | clarkb: not sure. I can bring it up on our meeting tomorrow though | 23:57 |
clarkb | zxiiro: thanks | 23:57 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!