*** heyongli has quit IRC | 00:00 | |
*** tosky has quit IRC | 00:00 | |
*** heyongli has joined #openstack-infra | 00:00 | |
auristor | mirror.pypi is currently at 1.738TB | 00:00 |
---|---|---|
ianw | i think we have quite good taste, and the graphite queries are quite powerful. stupid things like bouncing because you go one byte over are mitigated by sensible queries like ensuring average values are too high for a long period | 00:00 |
clarkb | we need to rebuild pypi with the new blacklist in bandersnatch to see the true disk saving | 00:04 |
*** felipemonteiro has joined #openstack-infra | 00:05 | |
clarkb | right now we arent adding to it quite as quickly but need to delete the older stuff we dont want | 00:05 |
ianw | anyway we are better tracking things now | 00:06 |
ianw | http://grafana02.openstack.org/d/lFKIH5Smk/afs?panelId=8&orgId=1&from=now-7d&to=now&tab=general | 00:06 |
*** felipemonteiro_ has quit IRC | 00:07 | |
corvus | we have a year's worth of that data in graphite | 00:08 |
corvus | er cacti | 00:08 |
corvus | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=3171&rra_id=all | 00:08 |
ianw | yeah but not volume data | 00:08 |
ianw | http://grafana02.openstack.org/d/lFKIH5Smk/afs?panelId=10&fullscreen&orgId=1&from=now-7d&to=now | 00:09 |
corvus | ianw: where's the volume data? | 00:09 |
auristor | I'm just seeing partition data. I was going to ask about volume | 00:10 |
ianw | corvus: where's it come from? | 00:10 |
*** heyongli has quit IRC | 00:10 | |
corvus | ianw: i feel like you're trying to show me a graph of volume data but i'm not seeing one | 00:10 |
ianw | this is the new tracking i've been working on | 00:10 |
*** heyongli has joined #openstack-infra | 00:10 | |
corvus | i see partitions here: https://screenshots.firefox.com/rsosIaN3G4csM5ax/grafana02.openstack.org | 00:10 |
corvus | and this is what i get on the second link: https://screenshots.firefox.com/lSBlYJytN4Tyc1ib/grafana02.openstack.org | 00:11 |
auristor | Now I see pypi growth | 00:11 |
ianw | oh, try now. it's the same problem as the alert i left, i forgot to click save | 00:11 |
corvus | better :) you might want to set the min-y to 0 | 00:11 |
ianw | i'm just fiddling with the graphs. obviously this needs to be put into grafyaml | 00:12 |
corvus | i think the volume graphs will be a nice addition :) | 00:13 |
ianw | what i'm hoping i can figure out, and i think grafana 5.x has some features to help with "holding" last values, is if i can see when the last vos release was done in some sane way | 00:13 |
ianw | it would be very handy to know "debian mirror hasn't released in the last 24 hours" | 00:13 |
*** r-daneel has quit IRC | 00:14 | |
corvus | ianw: grafana and graphite support events; if you use that, the graphs can have little annotations on them | 00:14 |
corvus | ianw: or you could report the last release time as a gauge with the unix timestamp as a value, and have a graph perform math on it | 00:15 |
clarkb | if we did want to replace the pypi contents is deleting what is there then rebuilding it reasonable since we can wait to vos release or will the delete take forever? | 00:16 |
clarkb | I guess another option is to use a new volume and switch to tjat then delete the old | 00:16 |
ianw | corvus: yeah, gauge idea was roughly what i was thinking, but i also like stamping it as events. i haven't looked but i'm not sure about getting the last release time post-fact from "vos list" type tools | 00:17 |
corvus | ianw: the docs release cron job does it | 00:18 |
auristor | if openstack.org was auristorfs then "vos splitvolume vol-name relative-path" would be the answer. | 00:19 |
corvus | ianw: oh, hrm, no that's last update it does | 00:19 |
ianw | sends a stat? i'd like to incorporate it into the server/partition/volume polling if possible for overall consistency | 00:19 |
corvus | ianw: no, i mean parsing vos examine | 00:19 |
*** heyongli has quit IRC | 00:20 | |
*** heyongli has joined #openstack-infra | 00:20 | |
ianw | listvol say "last update" | 00:21 |
corvus | ah, the last update time of the readonly volume is the last time it was released | 00:21 |
auristor | vos examine mirror.pypi.readonly and then take the Creation time. That is the time the .readonly was last created. | 00:21 |
corvus | even better :) | 00:21 |
auristor | The Last Update time is the most recent update of the RW before the release. | 00:22 |
corvus | i guess last update would be the last file change to the read-write volume before the release, then create... yeah that :) | 00:22 |
auristor | compare the Last Update time of the RW to the RO to determine if a release is required | 00:23 |
corvus | that's what the docs relase script does here: https://git.openstack.org/cgit/openstack-infra/system-config/tree/modules/openstack_project/files/openafs/release-volumes.py#n1 | 00:23 |
ianw | auristor: in this case what we're mostly interested in is if the release happened. because if it didn't, that means the mirroring job failed | 00:24 |
ianw | which is the "real" problem | 00:24 |
ianw | i.e. reprepro corrupted itself again | 00:24 |
corvus | ianw: another approach would be to report how long ago the release happened as a gauge. you're doing the math ahead of time there, so that makes some things easier. | 00:25 |
ianw | anyway, good info, and step 1 for me is to grab the creation date anyway, as i'm not picking that up right now | 00:27 |
*** vtapia has quit IRC | 00:27 | |
ianw | then i think some gnarly graphite queries can probably collate things | 00:27 |
*** heyongli has quit IRC | 00:30 | |
*** heyongli has joined #openstack-infra | 00:31 | |
*** rwsu has quit IRC | 00:32 | |
*** claudiub has quit IRC | 00:33 | |
*** yamamoto has joined #openstack-infra | 00:34 | |
*** claudiub has joined #openstack-infra | 00:35 | |
*** yamamoto has quit IRC | 00:39 | |
*** shardy has quit IRC | 00:41 | |
*** heyongli has quit IRC | 00:41 | |
*** heyongli has joined #openstack-infra | 00:41 | |
*** shardy has joined #openstack-infra | 00:41 | |
*** felipemonteiro_ has joined #openstack-infra | 00:42 | |
*** hongbin has joined #openstack-infra | 00:43 | |
*** felipemonteiro has quit IRC | 00:43 | |
openstackgerrit | Ian Wienand proposed openstack-infra/afsmon master: Add .gitreview https://review.openstack.org/573471 | 00:47 |
*** heyongli has quit IRC | 00:51 | |
openstackgerrit | Ian Wienand proposed openstack-infra/afsmon master: Add .gitreview https://review.openstack.org/573471 | 00:51 |
openstackgerrit | Ian Wienand proposed openstack-infra/afsmon master: Add basic zuul jobs https://review.openstack.org/573472 | 00:51 |
*** heyongli has joined #openstack-infra | 00:51 | |
*** masayukig has quit IRC | 00:52 | |
*** rlandy|rover has quit IRC | 00:52 | |
*** masayukig has joined #openstack-infra | 00:53 | |
*** rfolco has quit IRC | 00:54 | |
auristor | ianw: for the volume disk usage graph it might be useful to plot the volume quota in addition to disk usage | 00:54 |
auristor | ianw: and partition free space | 00:54 |
*** rwsu has joined #openstack-infra | 00:56 | |
*** dhill_ has quit IRC | 00:56 | |
ianw | auristor: yep i have that. you can see what i'm sending in at http://graphite.openstack.org/ metrics->stats->gauges->afs | 00:57 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: scheduler: add job's parent name to the rpc job_list method https://review.openstack.org/573473 | 01:01 |
*** heyongli has quit IRC | 01:01 | |
*** heyongli has joined #openstack-infra | 01:01 | |
*** r-daneel has joined #openstack-infra | 01:04 | |
*** vtapia has joined #openstack-infra | 01:04 | |
*** gyee has quit IRC | 01:05 | |
ianw | auristor: it's slightly less dramatic when graphed to zero against the quota | 01:07 |
ianw | http://grafana02.openstack.org/d/lFKIH5Smk/afs?panelId=10&fullscreen&orgId=1&from=now-7d&to=now | 01:07 |
ianw | that's with an alert "average usage > 90% for more than 24 hours" setup on it | 01:07 |
*** heyongli has quit IRC | 01:11 | |
*** heyongli has joined #openstack-infra | 01:12 | |
openstackgerrit | Merged openstack-infra/afsmon master: Add basic zuul jobs https://review.openstack.org/573472 | 01:18 |
openstackgerrit | Merged openstack-infra/afsmon master: Add .gitreview https://review.openstack.org/573471 | 01:18 |
*** yamahata has quit IRC | 01:18 | |
*** heyongli has quit IRC | 01:22 | |
*** heyongli has joined #openstack-infra | 01:22 | |
*** mriedem_afk is now known as mriedem | 01:23 | |
*** namnh has joined #openstack-infra | 01:25 | |
*** jesslampe has quit IRC | 01:31 | |
*** jesslampe has joined #openstack-infra | 01:31 | |
*** jesslampe has quit IRC | 01:31 | |
*** jesslampe has joined #openstack-infra | 01:32 | |
*** heyongli has quit IRC | 01:32 | |
*** jesslampe has quit IRC | 01:32 | |
*** heyongli has joined #openstack-infra | 01:32 | |
*** jesslampe has joined #openstack-infra | 01:32 | |
*** jesslampe has quit IRC | 01:33 | |
*** jesslampe has joined #openstack-infra | 01:34 | |
*** mriedem is now known as mriedem_afk | 01:37 | |
*** zhangfei has joined #openstack-infra | 01:42 | |
*** heyongli has quit IRC | 01:42 | |
*** heyongli has joined #openstack-infra | 01:42 | |
*** heyongli has quit IRC | 01:52 | |
*** heyongli has joined #openstack-infra | 01:53 | |
*** mriedem_afk is now known as mriedem | 01:54 | |
*** VW has joined #openstack-infra | 01:54 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: add /{tenant}/job/{job_name} route https://review.openstack.org/550978 | 02:02 |
*** heyongli has quit IRC | 02:03 | |
*** heyongli has joined #openstack-infra | 02:03 | |
*** owalsh_ has joined #openstack-infra | 02:08 | |
*** VW has quit IRC | 02:09 | |
*** VW has joined #openstack-infra | 02:09 | |
*** mriedem has quit IRC | 02:10 | |
*** owalsh has quit IRC | 02:12 | |
*** heyongli has quit IRC | 02:13 | |
*** heyongli has joined #openstack-infra | 02:13 | |
*** VW has quit IRC | 02:14 | |
*** lifeless_ has joined #openstack-infra | 02:22 | |
*** heyongli has quit IRC | 02:23 | |
*** heyongli has joined #openstack-infra | 02:23 | |
*** lifeless has quit IRC | 02:23 | |
*** rh-jelabarre has quit IRC | 02:25 | |
*** hongbin has quit IRC | 02:26 | |
*** rh-jelabarre has joined #openstack-infra | 02:30 | |
*** heyongli has quit IRC | 02:33 | |
*** heyongli has joined #openstack-infra | 02:34 | |
*** psachin has joined #openstack-infra | 02:35 | |
openstackgerrit | Ian Wienand proposed openstack-infra/afsmon master: Run pep8 https://review.openstack.org/573483 | 02:41 |
openstackgerrit | Ian Wienand proposed openstack-infra/afsmon master: Add creation date, report RO volumes https://review.openstack.org/573484 | 02:41 |
*** heyongli has quit IRC | 02:44 | |
*** heyongli has joined #openstack-infra | 02:44 | |
openstackgerrit | Ian Wienand proposed openstack-infra/afsmon master: Run pep8 https://review.openstack.org/573483 | 02:45 |
openstackgerrit | Ian Wienand proposed openstack-infra/afsmon master: Add creation date, report RO volumes https://review.openstack.org/573484 | 02:45 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: add /{tenant}/projects and /{tenant}/project/{project} routes https://review.openstack.org/550979 | 02:48 |
openstackgerrit | Ian Wienand proposed openstack-infra/afsmon master: Add empty bindep.txt https://review.openstack.org/573486 | 02:49 |
*** jcoufal has joined #openstack-infra | 02:49 | |
openstackgerrit | Ian Wienand proposed openstack-infra/afsmon master: Add empty bindep.txt https://review.openstack.org/573486 | 02:50 |
openstackgerrit | Ian Wienand proposed openstack-infra/afsmon master: Run pep8 https://review.openstack.org/573483 | 02:50 |
openstackgerrit | Ian Wienand proposed openstack-infra/afsmon master: Add creation date, report RO volumes https://review.openstack.org/573484 | 02:50 |
*** jcoufal has quit IRC | 02:54 | |
*** heyongli has quit IRC | 02:54 | |
*** heyongli has joined #openstack-infra | 02:54 | |
*** rosmaita has quit IRC | 02:56 | |
openstackgerrit | Merged openstack-infra/afsmon master: Add empty bindep.txt https://review.openstack.org/573486 | 02:58 |
*** markvoelker has quit IRC | 03:00 | |
*** markvoelker has joined #openstack-infra | 03:02 | |
*** heyongli has quit IRC | 03:04 | |
*** heyongli has joined #openstack-infra | 03:04 | |
*** ramishra has joined #openstack-infra | 03:06 | |
*** markvoelker has quit IRC | 03:07 | |
*** markvoelker has joined #openstack-infra | 03:12 | |
*** heyongli has quit IRC | 03:14 | |
*** heyongli has joined #openstack-infra | 03:15 | |
openstackgerrit | Ian Wienand proposed openstack-infra/afsmon master: Add creation date, report RO volumes https://review.openstack.org/573484 | 03:16 |
openstackgerrit | Merged openstack-infra/afsmon master: Run pep8 https://review.openstack.org/573483 | 03:21 |
*** felipemonteiro has joined #openstack-infra | 03:24 | |
*** heyongli has quit IRC | 03:25 | |
*** heyongli has joined #openstack-infra | 03:25 | |
*** yamamoto has joined #openstack-infra | 03:26 | |
*** felipemonteiro_ has quit IRC | 03:26 | |
*** yamamoto has quit IRC | 03:26 | |
openstackgerrit | Ian Wienand proposed openstack-infra/system-config master: mirror-update: install afsmon and run from cron https://review.openstack.org/573493 | 03:27 |
*** Bhujay has joined #openstack-infra | 03:28 | |
openstackgerrit | Ian Wienand proposed openstack-infra/system-config master: mirror-update: install afsmon and run from cron https://review.openstack.org/573493 | 03:29 |
mnaser | i'm seeing some packet loss and high latency to zuul.o.o | 03:31 |
mnaser | anyone see the same? | 03:31 |
*** yamamoto has joined #openstack-infra | 03:31 | |
mnaser | http://paste.openstack.org/show/722942/ | 03:32 |
mnaser | mtr shows packet loss at zayo | 03:33 |
clarkb | my connection is also over zayo but not seeing loss (ipv6) | 03:33 |
mnaser | http://paste.openstack.org/show/722944/ | 03:34 |
mnaser | https://tranzact.zayo.com/#!/networkStatus | 03:34 |
*** heyongli has quit IRC | 03:35 | |
*** heyongli has joined #openstack-infra | 03:35 | |
clarkb | my path went through ord | 03:38 |
*** tpsilva has quit IRC | 03:41 | |
mnaser | i guess it is the nyc fiber issue causing congestion | 03:42 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: angular6 fix attempt https://review.openstack.org/573494 | 03:43 |
*** dave-mcc_ has quit IRC | 03:44 | |
*** heyongli has quit IRC | 03:45 | |
*** lpetrut has joined #openstack-infra | 03:45 | |
*** heyongli has joined #openstack-infra | 03:45 | |
*** udesale has joined #openstack-infra | 03:48 | |
openstackgerrit | Ian Wienand proposed openstack-infra/system-config master: mirror-update: install afsmon and run from cron https://review.openstack.org/573493 | 03:52 |
*** germs has joined #openstack-infra | 03:54 | |
*** germs has quit IRC | 03:54 | |
*** germs has joined #openstack-infra | 03:54 | |
*** heyongli has quit IRC | 03:55 | |
*** heyongli has joined #openstack-infra | 03:56 | |
openstackgerrit | Ian Wienand proposed openstack-infra/system-config master: mirror-update: install afsmon and run from cron https://review.openstack.org/573493 | 03:58 |
*** felipemonteiro has quit IRC | 04:03 | |
*** heyongli has quit IRC | 04:06 | |
*** heyongli has joined #openstack-infra | 04:06 | |
*** rh-jelabarre has quit IRC | 04:14 | |
*** heyongli has quit IRC | 04:16 | |
*** heyongli has joined #openstack-infra | 04:16 | |
*** germs has quit IRC | 04:22 | |
*** rh-jelabarre has joined #openstack-infra | 04:23 | |
*** lpetrut has quit IRC | 04:23 | |
*** heyongli has quit IRC | 04:26 | |
*** heyongli has joined #openstack-infra | 04:26 | |
openstackgerrit | Artem Goncharov proposed openstack-infra/zuul master: fill `delta`, `start`, `end` for skipped `creates` and `removes` command. https://review.openstack.org/567864 | 04:33 |
*** heyongli has quit IRC | 04:36 | |
*** heyongli has joined #openstack-infra | 04:37 | |
*** pgadiya has joined #openstack-infra | 04:37 | |
*** pgadiya has quit IRC | 04:37 | |
*** links has joined #openstack-infra | 04:41 | |
*** ianychoi has quit IRC | 04:44 | |
*** heyongli has quit IRC | 04:47 | |
*** heyongli has joined #openstack-infra | 04:47 | |
*** heyongli has quit IRC | 04:57 | |
*** heyongli has joined #openstack-infra | 04:57 | |
*** heyongli has quit IRC | 05:07 | |
*** heyongli has joined #openstack-infra | 05:07 | |
*** aeng has quit IRC | 05:10 | |
*** heyongli has quit IRC | 05:17 | |
*** heyongli has joined #openstack-infra | 05:18 | |
*** Bhujay has quit IRC | 05:19 | |
*** lifeless_ has quit IRC | 05:26 | |
*** lifeless has joined #openstack-infra | 05:26 | |
*** heyongli has quit IRC | 05:28 | |
*** heyongli has joined #openstack-infra | 05:28 | |
openstackgerrit | Ian Wienand proposed openstack-infra/grafyaml master: Add transparent https://review.openstack.org/573527 | 05:31 |
ianw | umm, i'm pretty sure the jobs launched per hour has dropped to zero | 05:32 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Refactor load sensors into drivers https://review.openstack.org/549275 | 05:33 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: WIP: Add cgroup support to ram sensor https://review.openstack.org/549506 | 05:33 |
ianw | corvus: the type of thing it would be good to alert for ;) | 05:33 |
ianw | there's a lot of | 05:34 |
ianw | 2018-06-08 04:33:45,465 DEBUG zuul.RPCListener: Received job zuul:status_get | 05:34 |
ianw | 2018-06-08 04:33:45,559 DEBUG zuul.RPCListener: Received job zuul:tenant_sql_connection | 05:34 |
ianw | in the logs | 05:34 |
ianw | launcher seem to be running, but seem to have nothing to do | 05:37 |
*** heyongli has quit IRC | 05:38 | |
*** heyongli has joined #openstack-infra | 05:38 | |
ianw | i can not see any smoking guns in the scheduler logs | 05:39 |
ianw | the executors all seem up | 05:40 |
ianw | it's like it has stopped receiving events from gerrit | 05:41 |
amotoki | all finished jobs continue to remain in check/gate queues too | 05:41 |
*** gfidente has joined #openstack-infra | 05:42 | |
*** gfidente has joined #openstack-infra | 05:42 | |
ianw | i wonder, with prior reports of packet drops, if we did somehow drop the connection? | 05:43 |
ianw | gerrit memory usage @11,461 Mb / 43,691 Mb | 05:44 |
ianw | so stable, cpu load is no different than usual | 05:44 |
ianw | i'm going to restart zuul-scheduler | 05:45 |
*** heyongli has quit IRC | 05:48 | |
*** heyongli has joined #openstack-infra | 05:48 | |
ianw | ok, i think we are starting to see things come in again | 05:53 |
ianw | ok, i have re-queued what was in the gate, but my strong suspicion here is that events from gerrit were not coming through | 05:55 |
ianw | hence i can't requeue what wasn't in the queue | 05:56 |
Tengu | erf | 05:56 |
*** iranzo has joined #openstack-infra | 05:57 | |
*** pcaruana has joined #openstack-infra | 05:57 | |
*** heyongli has quit IRC | 05:58 | |
*** heyongli has joined #openstack-infra | 05:59 | |
ianw | #status notice Zuul stopped receiving gerrit events around 04:00UTC; any changes submitted between then and now will probably require a "recheck" comment to be requeued. Thanks! | 05:59 |
openstackstatus | ianw: sending notice | 05:59 |
-openstackstatus- NOTICE: Zuul stopped receiving gerrit events around 04:00UTC; any changes submitted between then and now will probably require a "recheck" comment to be requeued. Thanks! | 06:01 | |
ianw | infra-root: ^ that's my best guess :/ with that suggestion log spelunking might show up something more useful. i couldn't see any obvious errors or exceptions | 06:02 |
openstackstatus | ianw: finished sending notice | 06:03 |
ianw | i'm afk for a while, will check later | 06:03 |
*** armaan has joined #openstack-infra | 06:04 | |
AJaeger | thanks, ianw ! | 06:05 |
ianw | corvus / auristor : in other news, the dashboard is looking about how i'd like it now -> http://grafana02.openstack.org/d/ACtl1JSmz/afs?orgId=1 | 06:06 |
openstackgerrit | Ian Wienand proposed openstack-infra/project-config master: Fix up AFS dashboard https://review.openstack.org/573537 | 06:07 |
*** e0ne has joined #openstack-infra | 06:09 | |
*** heyongli has quit IRC | 06:09 | |
*** heyongli has joined #openstack-infra | 06:09 | |
*** pcaruana has quit IRC | 06:14 | |
*** pcaruana has joined #openstack-infra | 06:15 | |
*** Bhujay has joined #openstack-infra | 06:16 | |
*** heyongli has quit IRC | 06:19 | |
*** heyongli has joined #openstack-infra | 06:19 | |
*** armaan has quit IRC | 06:20 | |
*** armaan has joined #openstack-infra | 06:20 | |
*** cshastri has joined #openstack-infra | 06:22 | |
*** lpetrut has joined #openstack-infra | 06:22 | |
*** jbadiapa has joined #openstack-infra | 06:22 | |
*** dhajare has quit IRC | 06:24 | |
openstackgerrit | OpenStack Proposal Bot proposed openstack-infra/project-config master: Normalize projects.yaml https://review.openstack.org/573592 | 06:27 |
*** heyongli has quit IRC | 06:29 | |
*** heyongli has joined #openstack-infra | 06:29 | |
*** dhajare has joined #openstack-infra | 06:39 | |
*** heyongli has quit IRC | 06:39 | |
*** heyongli has joined #openstack-infra | 06:40 | |
*** zoli has quit IRC | 06:41 | |
*** dulek has quit IRC | 06:41 | |
*** Bhujay has quit IRC | 06:43 | |
*** dklyle has quit IRC | 06:44 | |
*** e0ne has quit IRC | 06:44 | |
*** armaan has quit IRC | 06:44 | |
*** armaan has joined #openstack-infra | 06:45 | |
*** zoli has joined #openstack-infra | 06:48 | |
*** heyongli has quit IRC | 06:50 | |
*** heyongli has joined #openstack-infra | 06:50 | |
*** alexchadin has joined #openstack-infra | 06:56 | |
*** dhajare has quit IRC | 06:57 | |
*** jaosorior has quit IRC | 06:58 | |
*** caphrim007_ has quit IRC | 06:59 | |
*** caphrim007 has joined #openstack-infra | 07:00 | |
*** heyongli has quit IRC | 07:00 | |
*** heyongli has joined #openstack-infra | 07:00 | |
*** Bhujay has joined #openstack-infra | 07:02 | |
*** d0ugal has joined #openstack-infra | 07:02 | |
*** hashar has joined #openstack-infra | 07:03 | |
*** dhajare has joined #openstack-infra | 07:09 | |
*** heyongli has quit IRC | 07:10 | |
*** heyongli has joined #openstack-infra | 07:10 | |
*** rcernin has quit IRC | 07:13 | |
*** ramishra has quit IRC | 07:14 | |
*** diablo_rojo has joined #openstack-infra | 07:16 | |
*** slaweq has joined #openstack-infra | 07:17 | |
*** jesslampe has quit IRC | 07:17 | |
*** jesslampe has joined #openstack-infra | 07:18 | |
*** jesslampe has quit IRC | 07:18 | |
*** jesslampe has joined #openstack-infra | 07:18 | |
*** heyongli has quit IRC | 07:20 | |
*** heyongli has joined #openstack-infra | 07:21 | |
*** amoralej|off is now known as amoralej | 07:23 | |
*** jesslampe has quit IRC | 07:23 | |
*** jesslampe has joined #openstack-infra | 07:23 | |
*** jesslampe has quit IRC | 07:24 | |
*** jesslampe has joined #openstack-infra | 07:24 | |
*** jesslampe has quit IRC | 07:25 | |
*** jesslampe has joined #openstack-infra | 07:25 | |
*** jistr is now known as jistr|reloc | 07:25 | |
*** jesslampe has quit IRC | 07:25 | |
*** jesslampe has joined #openstack-infra | 07:26 | |
*** e0ne has joined #openstack-infra | 07:26 | |
*** jesslampe has quit IRC | 07:26 | |
*** jesslampe has joined #openstack-infra | 07:26 | |
*** jesslampe has quit IRC | 07:27 | |
*** jesslampe has joined #openstack-infra | 07:27 | |
*** jesslampe has quit IRC | 07:28 | |
openstackgerrit | Artem Goncharov proposed openstack-infra/project-config master: Add openstack-service-broker project https://review.openstack.org/573459 | 07:29 |
*** tosky has joined #openstack-infra | 07:30 | |
openstackgerrit | Ian Wienand proposed openstack-infra/system-config master: mirror-update: install afsmon and run from cron https://review.openstack.org/573493 | 07:30 |
*** heyongli has quit IRC | 07:31 | |
*** heyongli has joined #openstack-infra | 07:31 | |
*** hongbin has joined #openstack-infra | 07:32 | |
*** jbadiapa has quit IRC | 07:32 | |
openstackgerrit | Artem Goncharov proposed openstack-infra/nodepool master: Use openstacksdk instead of os-client-config https://review.openstack.org/566158 | 07:33 |
*** aojea has joined #openstack-infra | 07:33 | |
*** jbadiapa has joined #openstack-infra | 07:33 | |
*** hongbin has quit IRC | 07:35 | |
*** bauzas is now known as PapaOurs | 07:38 | |
*** lyarwood is now known as lyaaaaaaaarwood | 07:38 | |
*** annp has quit IRC | 07:39 | |
*** salv-orlando has joined #openstack-infra | 07:41 | |
*** heyongli has quit IRC | 07:41 | |
*** heyongli has joined #openstack-infra | 07:41 | |
*** jcoufal has joined #openstack-infra | 07:42 | |
*** ramishra has joined #openstack-infra | 07:44 | |
*** salv-orlando has quit IRC | 07:46 | |
*** amotoki has quit IRC | 07:47 | |
*** jcoufal has quit IRC | 07:47 | |
*** jesslampe has joined #openstack-infra | 07:47 | |
*** roman_g has joined #openstack-infra | 07:47 | |
*** amotoki has joined #openstack-infra | 07:48 | |
*** jesslampe has quit IRC | 07:48 | |
*** rcernin has joined #openstack-infra | 07:48 | |
*** heyongli has quit IRC | 07:51 | |
*** heyongli has joined #openstack-infra | 07:51 | |
*** rwsu has quit IRC | 07:52 | |
*** jpena|off is now known as jpena | 07:53 | |
hwoarang | ianw: new leap-150 build (478) is woking fine! awesome! howeer, 477 is still being present on various nl0* so can you wipe that because every other build hits the problematic nodes i think | 07:54 |
hwoarang | *however | 07:54 |
*** dulek has joined #openstack-infra | 07:54 | |
*** Bhujay has quit IRC | 07:57 | |
*** shardy has quit IRC | 07:59 | |
*** shardy has joined #openstack-infra | 08:01 | |
*** heyongli has quit IRC | 08:01 | |
*** heyongli has joined #openstack-infra | 08:02 | |
*** Adri2000 has quit IRC | 08:03 | |
*** rwsu has joined #openstack-infra | 08:05 | |
*** jpich has joined #openstack-infra | 08:05 | |
*** dklyle has joined #openstack-infra | 08:06 | |
*** yamamoto has quit IRC | 08:08 | |
*** lifeless has quit IRC | 08:09 | |
*** lifeless has joined #openstack-infra | 08:10 | |
*** ramishra has quit IRC | 08:11 | |
*** Adri2000 has joined #openstack-infra | 08:11 | |
*** heyongli has quit IRC | 08:12 | |
*** heyongli has joined #openstack-infra | 08:12 | |
*** ramishra has joined #openstack-infra | 08:13 | |
*** shardy has quit IRC | 08:13 | |
*** annp has joined #openstack-infra | 08:14 | |
*** slaweq has quit IRC | 08:14 | |
*** slaweq has joined #openstack-infra | 08:14 | |
*** slaweq has quit IRC | 08:14 | |
*** slaweq has joined #openstack-infra | 08:15 | |
*** alexchadin has quit IRC | 08:16 | |
openstackgerrit | Merged openstack-infra/project-config master: Normalize projects.yaml https://review.openstack.org/573592 | 08:17 |
*** jistr|reloc is now known as jistr | 08:19 | |
*** heyongli has quit IRC | 08:22 | |
*** yamamoto has joined #openstack-infra | 08:22 | |
*** heyongli has joined #openstack-infra | 08:22 | |
*** slaweq has quit IRC | 08:25 | |
*** slaweq has joined #openstack-infra | 08:25 | |
*** alexchadin has joined #openstack-infra | 08:27 | |
*** shardy has joined #openstack-infra | 08:31 | |
*** heyongli has quit IRC | 08:32 | |
*** heyongli has joined #openstack-infra | 08:33 | |
ianw | hwoarang: hmm, it shouldn't be picking that up if it's not the most recent | 08:33 |
ianw | unless some providers are out of sync | 08:33 |
hwoarang | ah i saw jobs queuing a lot on opensuse-150 so i thought there were just hitting a bad node. | 08:35 |
hwoarang | in the end, a node was found and the job worked fine. | 08:35 |
hwoarang | so lets say it's working then | 08:35 |
ianw | yeah, i deleted it anyway since it doesn't work and i was in there | 08:36 |
*** markvoelker has quit IRC | 08:37 | |
hwoarang | thank you | 08:37 |
*** markvoelker has joined #openstack-infra | 08:38 | |
*** derekh has joined #openstack-infra | 08:41 | |
*** salv-orlando has joined #openstack-infra | 08:42 | |
*** markvoelker has quit IRC | 08:42 | |
*** heyongli has quit IRC | 08:42 | |
*** heyongli has joined #openstack-infra | 08:43 | |
*** salv-orlando has quit IRC | 08:46 | |
*** heyongli has quit IRC | 08:53 | |
*** heyongli has joined #openstack-infra | 08:53 | |
*** d0ugal_ has joined #openstack-infra | 09:00 | |
*** d0ugal has quit IRC | 09:00 | |
*** d0ugal_ has quit IRC | 09:00 | |
*** d0ugal has joined #openstack-infra | 09:01 | |
*** lifeless_ has joined #openstack-infra | 09:02 | |
*** lifeless has quit IRC | 09:03 | |
*** heyongli has quit IRC | 09:03 | |
*** heyongli has joined #openstack-infra | 09:03 | |
openstackgerrit | Olivier Bourdon proposed openstack/diskimage-builder master: Fix CentOS image build failure when dib runs on debian based system https://review.openstack.org/559485 | 09:06 |
*** dtantsur|afk is now known as dtantsur | 09:09 | |
openstackgerrit | Masayuki Igawa proposed openstack/os-testr master: Deprecate ostestr command https://review.openstack.org/573636 | 09:12 |
*** heyongli has quit IRC | 09:13 | |
*** lifeless_ is now known as lifeless | 09:13 | |
*** heyongli has joined #openstack-infra | 09:13 | |
*** sambetts|afk is now known as sambetts | 09:17 | |
*** alexchadin has quit IRC | 09:17 | |
*** ramishra has quit IRC | 09:17 | |
ltomasbo | ping AJaeger: again gate issue for another patch https://review.openstack.org/#/c/564148 | 09:17 |
ltomasbo | AJaeger, I rebase it and I see on Zuul some jobs (not yet failing but) with the message: merger_failure | 09:18 |
*** ramishra has joined #openstack-infra | 09:19 | |
*** heyongli has quit IRC | 09:23 | |
*** heyongli has joined #openstack-infra | 09:24 | |
*** edmondsw has joined #openstack-infra | 09:26 | |
*** alexchadin has joined #openstack-infra | 09:28 | |
*** edmondsw has quit IRC | 09:30 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Don't use GRANT to create new MySQL users https://review.openstack.org/573641 | 09:30 |
*** lifeless_ has joined #openstack-infra | 09:30 | |
*** lifeless has quit IRC | 09:32 | |
*** owalsh_ is now known as owalsh | 09:33 | |
*** heyongli has quit IRC | 09:34 | |
*** heyongli has joined #openstack-infra | 09:34 | |
*** jaosorior has joined #openstack-infra | 09:36 | |
*** dtantsur is now known as dtantsur|brb | 09:38 | |
*** pbourke has quit IRC | 09:42 | |
*** pbourke has joined #openstack-infra | 09:42 | |
*** salv-orlando has joined #openstack-infra | 09:42 | |
*** heyongli has quit IRC | 09:44 | |
frickler | ltomasbo: infra-root: I'm seeing MERGER_FAILURE for lots of jobs, but haven't found a cause for it yet. might be related to the network issues mentioned earlier | 09:44 |
*** heyongli has joined #openstack-infra | 09:44 | |
ltomasbo | frickler, ahh ok, I was not aware! thanks! | 09:45 |
*** salv-orlando has quit IRC | 09:46 | |
jokke_ | abigerrit seems to be belly up as well | 09:47 |
jokke_ | -abi | 09:47 |
ianw | frickler / ltomasbo : that job went in around 06:30 | 09:51 |
ianw | 2018-06-08 06:30:54,012 DEBUG zuul.Pipeline.openstack.check: Scheduling merge for item <QueueItem 0x7fadc2296ef0 for <Change 0x7fadccebd2e8 564148,15> | 09:51 |
ianw | at about the same time in the merger | 09:51 |
ianw | 2018-06-08 06:26:23,180 ... git.exc.GitCommandError: Cmd('git') failed due to: exit code(128) | 09:52 |
ianw | cmdline: git fetch origin refs/pull/41155/head | 09:52 |
ianw | stderr: 'error: RPC failed; curl 18 transfer closed with outstanding read data remaining | 09:52 |
ianw | a different change, but clearly about that time there was some networking issues | 09:52 |
ianw | i think it's in the same realm of issues | 09:52 |
ltomasbo | ianw, so, should I recheck it? | 09:52 |
ianw | ltomasbo: i probably would. i don't have time to track that change to the exact merger that picked it up, but i would think it's likely the same sort of thing | 09:53 |
ltomasbo | ok | 09:53 |
ltomasbo | thanks! | 09:53 |
*** evrardjp has quit IRC | 09:53 | |
ianw | frickler: my examination of review.o.o & gerrit seemed to show it was roughly ok, but maybe check again? | 09:54 |
ianw | we don't have any email from rax about known networking issues at this point | 09:54 |
*** heyongli has quit IRC | 09:54 | |
*** heyongli has joined #openstack-infra | 09:54 | |
*** evrardjp has joined #openstack-infra | 09:56 | |
*** lifeless has joined #openstack-infra | 09:58 | |
*** lifeless_ has quit IRC | 09:58 | |
*** evrardjp_ has joined #openstack-infra | 10:01 | |
*** evrardjp has quit IRC | 10:01 | |
*** namnh has quit IRC | 10:03 | |
*** alexchadin has quit IRC | 10:03 | |
*** heyongli has quit IRC | 10:04 | |
*** heyongli has joined #openstack-infra | 10:05 | |
*** evrardjp_ has quit IRC | 10:05 | |
*** alexchadin has joined #openstack-infra | 10:06 | |
*** evrardjp has joined #openstack-infra | 10:06 | |
*** annp has quit IRC | 10:09 | |
*** heyongli has quit IRC | 10:15 | |
*** heyongli has joined #openstack-infra | 10:15 | |
*** lifeless_ has joined #openstack-infra | 10:23 | |
*** lifeless has quit IRC | 10:24 | |
*** vivsoni_ has joined #openstack-infra | 10:24 | |
*** vivsoni has quit IRC | 10:24 | |
*** heyongli has quit IRC | 10:25 | |
*** heyongli has joined #openstack-infra | 10:25 | |
*** udesale_ has joined #openstack-infra | 10:27 | |
*** udesale__ has joined #openstack-infra | 10:28 | |
*** udesale has quit IRC | 10:29 | |
*** udesale_ has quit IRC | 10:31 | |
*** dtantsur|brb is now known as dtantsur | 10:31 | |
*** boden has joined #openstack-infra | 10:35 | |
*** heyongli has quit IRC | 10:35 | |
*** heyongli has joined #openstack-infra | 10:35 | |
*** rcernin has quit IRC | 10:37 | |
*** slaweq has quit IRC | 10:38 | |
*** slaweq_ has joined #openstack-infra | 10:38 | |
*** markvoelker has joined #openstack-infra | 10:38 | |
*** lifeless_ has quit IRC | 10:39 | |
*** stephenfin is now known as finucannot | 10:44 | |
*** heyongli has quit IRC | 10:45 | |
*** shardy has quit IRC | 10:46 | |
*** heyongli has joined #openstack-infra | 10:46 | |
*** quiquell has joined #openstack-infra | 10:47 | |
quiquell | Hello | 10:47 |
quiquell | I see some "MERGER_FAILURE" here https://review.openstack.org/#/c/572096/ | 10:47 |
quiquell | What it means ? | 10:48 |
*** shardy has joined #openstack-infra | 10:48 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Add --check-config option to zuul scheduler https://review.openstack.org/542160 | 10:55 |
*** lifeless has joined #openstack-infra | 10:55 | |
*** heyongli has quit IRC | 10:56 | |
*** heyongli has joined #openstack-infra | 10:56 | |
openstackgerrit | Merged openstack-dev/cookiecutter master: Fix quotes and undefined variable https://review.openstack.org/569755 | 10:57 |
Tengu | quiquell: have the same on one of my reviews: https://review.openstack.org/570913 | 11:01 |
Tengu | funky part: it's not shown in the table - have to dig in the mail/messages. | 11:02 |
Tengu | maybe it's a new bunch of tests? | 11:02 |
*** heyongli has quit IRC | 11:06 | |
*** heyongli has joined #openstack-infra | 11:06 | |
frickler | Tengu: quiquell: it seems there were some issues earlier, not sure yet what the exact reason was, but hopefully it will be better when you do a recheck | 11:07 |
*** elod has joined #openstack-infra | 11:08 | |
Tengu | frickler: ok :). | 11:08 |
Tengu | as it's not an urgent thing, I'll let it as-is until my next refresh round. | 11:09 |
*** jpena is now known as jpena|lunch | 11:09 | |
*** vtapia has quit IRC | 11:11 | |
*** alexchadin has quit IRC | 11:11 | |
*** zoli is now known as zoli|afk | 11:13 | |
*** markvoelker has quit IRC | 11:13 | |
*** heyongli has quit IRC | 11:16 | |
*** heyongli has joined #openstack-infra | 11:16 | |
auristor | ianw: the dashboard is looking good | 11:19 |
*** shardy has quit IRC | 11:19 | |
*** dhajare has quit IRC | 11:19 | |
*** shardy has joined #openstack-infra | 11:20 | |
*** heyongli has quit IRC | 11:26 | |
*** heyongli has joined #openstack-infra | 11:27 | |
*** auristor has quit IRC | 11:27 | |
frickler | ianw: still seing sporadic failures on jobs less than 2h old, I'll go through all the zm* now and check their logs | 11:28 |
*** udesale__ has quit IRC | 11:29 | |
*** Bhujay has joined #openstack-infra | 11:33 | |
*** auristor has joined #openstack-infra | 11:35 | |
*** ldnunes has joined #openstack-infra | 11:35 | |
*** heyongli has quit IRC | 11:37 | |
*** heyongli has joined #openstack-infra | 11:37 | |
*** vtapia has joined #openstack-infra | 11:37 | |
openstackgerrit | Merged openstack-dev/pbr master: Add leading 0 on alpha release in semver doc https://review.openstack.org/558181 | 11:37 |
ianw | frickler: hmm, example? | 11:38 |
ianw | ok 570913 ... | 11:39 |
ianw | 2018-06-08 07:18:05,671 DEBUG zuul.Merger: Processing ref refs/changes/13/570913/5 for project gerrit/openstack/tripleo-quickstart-extras / master uuid 99a6605fba0a4f89b955d54dda2157f0 | 11:39 |
*** jcoufal has joined #openstack-infra | 11:40 | |
ianw | 2018-06-08 07:18:06,190 DEBUG zuul.Repo: Checking out 90ab17980bc793b4920bc72f4d3a318442d3bd1b ... that's the right hash | 11:40 |
*** alexchadin has joined #openstack-infra | 11:42 | |
*** vtapia has quit IRC | 11:43 | |
ianw | 2018-06-08 07:18:06,896 INFO zuul.MergeClient: Merge <gear.Job 0x7fadcd3a50b8 handle: b'H:127.0.0.1:22420' name: merger:merge unique: e2309f8ba26f4a748da647a8f87e9dd0> complete, merged: True, updated: False, commit: 4bcef77e29658cf512f2ca4194641928573389f9 | 11:43 |
ianw | zuul gets response back | 11:43 |
*** dklyle has quit IRC | 11:46 | |
*** rosmaita has joined #openstack-infra | 11:46 | |
ianw | ok, here's hte error on the executor | 11:47 |
ianw | 2018-06-08 11:43:47,985 ERROR zuul.AnsibleJob: [build: f197ad36a1904208894f04b10269cbf7] Retry 1: Fetch /var/lib/zuul/builds/f197ad36a1904208894f04b10269cbf7/work/src/git.openstack.org/openstack/airship-drydock origin None | 11:47 |
*** heyongli has quit IRC | 11:47 | |
*** heyongli has joined #openstack-infra | 11:47 | |
frickler | ianw: which executor is that? | 11:52 |
ianw | if you look for that "retry [1|2]" it's basically across all the executors | 11:52 |
ianw | no i tell a lie | 11:53 |
ianw | ze01 ze07 ze09 ze10 do *not* have that in the logs at all | 11:53 |
ianw | ze02 ze03 ze04 ze05 ze06 ze08 *do have* the retry messages | 11:54 |
ianw | so ... something different between the two sets? | 11:54 |
frickler | the ones I looked at all seem to have "ssh: Could not resolve hostname review.openstack.org: Temporary failure in name resolution" as reason | 11:55 |
ianw | but yeah, that fetch failure was from /var/lib ? | 11:56 |
*** dklyle has joined #openstack-infra | 11:56 | |
*** rfolco has joined #openstack-infra | 11:56 | |
frickler | hmm unbound seems dead. on ze02 at least: active (exited) since Fri 2018-06-08 06:39:21 UTC; 5h 17min ago | 11:56 |
*** heyongli has quit IRC | 11:57 | |
fungi | that timing seems suspiciously close to when system cronjobs fire | 11:57 |
*** heyongli has joined #openstack-infra | 11:57 | |
frickler | after a restart it seems to be working again | 11:58 |
frickler | that is what the journal says: http://paste.openstack.org/show/722969/ | 11:58 |
fungi | yeah, not running on ze03 either | 11:58 |
fungi | this may be the difference between the broken and working executors | 11:59 |
*** vtapia has joined #openstack-infra | 11:59 | |
ianw | frickler: good eye! | 11:59 |
ianw | not running on 3 5 8 6 4 | 12:00 |
ianw | it is running on 2 ... maybe we restarted it. but it matches exactly | 12:00 |
ianw | otherwise | 12:00 |
frickler | ianw: yeah, I just restarted it on 2, that matches completely | 12:00 |
ianw | that's what i call a smoking gun! :) | 12:01 |
*** dklyle has quit IRC | 12:02 | |
*** jpena|lunch is now known as jpena | 12:02 | |
fungi | looks like the unbound restarts were related to unattended upgrades | 12:02 |
fungi | looking at /var/log/dpkg.log | 12:03 |
openstackgerrit | Ian Wienand proposed openstack-infra/system-config master: mirror-update: install afsmon and run from cron https://review.openstack.org/573493 | 12:05 |
fungi | yeah, the problem executors mention unbound upgrades in the current /var/log/dpkg.log and the executors which were still working have no mention of unbound in their dpkg.log | 12:05 |
frickler | also it looks like systemd was too fast when restarting. look at the paste I posted, pid 1681 is the old unbound process. and it outputs its info block after the new process has been started. so the new only probably fails to bind to port 53 | 12:06 |
fungi | basically, the problem executors have the newer unbound 1.5.8-1ubuntu1.1 and the ones which remained working are still on older unbound 1.5.8-1ubuntu1 | 12:07 |
*** heyongli has quit IRC | 12:07 | |
fungi | so we should probably do controlled upgrades of unbound on the rest and make sure it remains running afterward | 12:08 |
fungi | starting it again manually if needed | 12:08 |
*** heyongli has joined #openstack-infra | 12:08 | |
frickler | yeah, that's a security update that got published tonight https://bugs.launchpad.net/ubuntu/+source/unbound/1.5.8-1ubuntu1.1 | 12:08 |
openstack | Launchpad bug 1 in Ubuntu Malaysia LoCo Team "Microsoft has a majority market share" [Critical,In progress] - Assigned to MFauzilkamil Zainuddin (apogee) | 12:08 |
fungi | otherwise they're just going to break like the others did when unattended-upgrades fires on them | 12:08 |
ianw | haha i think that bug parsing is wrong :) | 12:09 |
*** hemna_ has joined #openstack-infra | 12:09 | |
fungi | just slightly | 12:10 |
*** rosmaita has quit IRC | 12:11 | |
*** yamamoto has quit IRC | 12:11 | |
*** Bhujay has quit IRC | 12:11 | |
ianw | but yet strangely relevant | 12:12 |
tosky | uhm, are you talking about the reasons for all the MERGER_FAILURE errors that I'm seeing all around? | 12:12 |
frickler | tosky: indirectly, but yes | 12:12 |
* frickler goes to open a bug report for ubuntu | 12:12 | |
ianw | tosky: yes, i think frickler found the reason and should probably be cleared up soon | 12:12 |
*** nicolasbock has joined #openstack-infra | 12:13 | |
ianw | is someone doing the manual upgrades/restarts? | 12:13 |
*** mugsie_ is now known as mugsie | 12:14 | |
*** yamamoto has joined #openstack-infra | 12:15 | |
*** markvoelker has joined #openstack-infra | 12:16 | |
fungi | i'm not yet in a spot where i can, but can probably start on them in 30-60 minutes if nobody else beats me to it | 12:16 |
frickler | I'll run the restarts now | 12:16 |
*** yamamoto has quit IRC | 12:16 | |
*** psachin has quit IRC | 12:16 | |
*** dhill_ has joined #openstack-infra | 12:17 | |
*** heyongli has quit IRC | 12:18 | |
*** heyongli has joined #openstack-infra | 12:18 | |
fungi | thanks frickler! | 12:19 |
frickler | also confirmed that the update breaks unbound by running it manually on ze09. will update the other nodes now | 12:20 |
frickler | created https://bugs.launchpad.net/ubuntu/+source/unbound/+bug/1775833 in the meantime | 12:20 |
openstack | Launchpad bug 1775833 in unbound (Ubuntu) "unbound not running after automatic update" [Undecided,New] | 12:20 |
boden | hi clarkb corvus as per our chat yesterday on tox siblings, I added a job for lower constraints and also added neutron as required project in https://review.openstack.org/#/c/573429/ however based on my testing neutron master still isn’t getting installed http://logs.openstack.org/29/573429/1/check/vmware-tox-lower-constraints/0b6888a/job-output.txt.gz#_2018-06-07_20_34_48_060474 | 12:21 |
boden | any ideas? | 12:21 |
*** alexchadin has quit IRC | 12:21 | |
boden | sorry, the job updates are in https://review.openstack.org/#/c/573386/ | 12:22 |
*** armaan has quit IRC | 12:25 | |
*** yamamoto has joined #openstack-infra | 12:27 | |
ianw | ++ thanks frickler. i'm out, have a good day all! | 12:28 |
*** heyongli has quit IRC | 12:28 | |
*** yamamoto has quit IRC | 12:28 | |
*** heyongli has joined #openstack-infra | 12:28 | |
*** sthussey has joined #openstack-infra | 12:28 | |
frickler | ianw: thanks to you, have a nice and quiet weekend | 12:31 |
*** rlandy has joined #openstack-infra | 12:31 | |
*** rlandy is now known as rlandy|rover | 12:31 | |
*** jcoufal_ has joined #openstack-infra | 12:31 | |
*** alexchadin has joined #openstack-infra | 12:32 | |
smcginnis | Have things been restarted? Safe to recheck patches now? | 12:33 |
*** jcoufal has quit IRC | 12:34 | |
frickler | smcginnis: yes, we should be fine again now | 12:37 |
smcginnis | frickler: OK, thank you. | 12:37 |
*** heyongli has quit IRC | 12:38 | |
*** heyongli has joined #openstack-infra | 12:38 | |
*** armaan has joined #openstack-infra | 12:41 | |
*** armaan has quit IRC | 12:41 | |
*** armaan has joined #openstack-infra | 12:41 | |
*** armaan has quit IRC | 12:46 | |
*** heyongli has quit IRC | 12:48 | |
*** heyongli has joined #openstack-infra | 12:49 | |
frickler | infra-root: ze01-10 should be fine now. some zm* were already updated without issues, so it only seems to happen under load. not sure which other nodes might be affected, maybe someone can later update all of them just to be on the safe side | 12:49 |
*** alexchadin has quit IRC | 12:51 | |
fungi | frickler: thanks, looks like they're all upgraded to 1.5.8-1ubuntu1.1 now | 12:52 |
frickler | clarkb: corvus: I found this error in the zuul debug.log, but the node did indeed got held successfully. I debugged that node and deleted it though before noticing this message. http://paste.openstack.org/show/722971/ | 12:54 |
*** tpsilva has joined #openstack-infra | 12:54 | |
*** VW has joined #openstack-infra | 12:54 | |
*** zoli|afk is now known as zoli | 12:55 | |
*** zoli is now known as zoli|wfh | 12:55 | |
*** zoli|wfh is now known as zoli | 12:55 | |
*** dklyle has joined #openstack-infra | 12:58 | |
*** heyongli has quit IRC | 12:59 | |
*** myoung|off is now known as myoung | 12:59 | |
*** heyongli has joined #openstack-infra | 12:59 | |
tosky | frickler: is it safe to recheck then? | 13:01 |
*** edmondsw has joined #openstack-infra | 13:02 | |
*** amoralej is now known as amoralej|lunch | 13:02 | |
frickler | tosky: I pretty much hope so. we do have quite some backlog, though, so don't expect fast results | 13:04 |
tosky | sure | 13:04 |
* tosky rechecks | 13:04 | |
*** dbecker has joined #openstack-infra | 13:04 | |
tosky | frickler: is there some specific message/tag that I should add to recheck in order to track this? | 13:04 |
*** edmondsw has quit IRC | 13:06 | |
*** d0ugal_ has joined #openstack-infra | 13:08 | |
*** d0ugal has quit IRC | 13:08 | |
*** d0ugal_ has quit IRC | 13:08 | |
*** d0ugal has joined #openstack-infra | 13:08 | |
*** d0ugal has quit IRC | 13:08 | |
*** d0ugal has joined #openstack-infra | 13:08 | |
*** heyongli has quit IRC | 13:09 | |
*** heyongli has joined #openstack-infra | 13:09 | |
*** dhill_ has quit IRC | 13:12 | |
*** VW has quit IRC | 13:14 | |
*** VW has joined #openstack-infra | 13:14 | |
frickler | tosky: well you could mention either "MERGER_FAILURE" or the unbound bug I quoted above, but I don't think that that is necessary from an infra pov | 13:14 |
tosky | ack | 13:17 |
*** Goneri has joined #openstack-infra | 13:17 | |
tosky | now let's see | 13:17 |
*** caphrim007 has quit IRC | 13:18 | |
*** owalsh has quit IRC | 13:18 | |
*** pblaho has quit IRC | 13:19 | |
*** owalsh has joined #openstack-infra | 13:19 | |
*** VW has quit IRC | 13:19 | |
*** heyongli has quit IRC | 13:19 | |
*** caphrim007 has joined #openstack-infra | 13:19 | |
*** heyongli has joined #openstack-infra | 13:19 | |
*** efried is now known as fried_rice | 13:21 | |
*** psachin has joined #openstack-infra | 13:21 | |
*** eharney has joined #openstack-infra | 13:23 | |
*** caphrim007 has quit IRC | 13:25 | |
*** owalsh_ has joined #openstack-infra | 13:28 | |
*** jaosorior has quit IRC | 13:29 | |
*** yamamoto has joined #openstack-infra | 13:29 | |
*** dklyle has quit IRC | 13:29 | |
*** heyongli has quit IRC | 13:29 | |
*** heyongli has joined #openstack-infra | 13:30 | |
fungi | given that we run unbound on most of our servers, we probably want to keep an eye out for other similar breakage in random places. i've spot-checked some more important systems and all have it still running, so i agree this seems to have been isolated to the zuul executors | 13:30 |
*** yamamoto has quit IRC | 13:31 | |
*** owalsh has quit IRC | 13:31 | |
openstackgerrit | Javier Peña proposed openstack-infra/openstack-zuul-jobs master: Update version of CentOS OpenAFS packages to 1.6.22.3 https://review.openstack.org/573694 | 13:31 |
*** dklyle has joined #openstack-infra | 13:31 | |
*** mriedem has joined #openstack-infra | 13:32 | |
*** yamamoto has joined #openstack-infra | 13:32 | |
*** zhangfei has quit IRC | 13:34 | |
*** owalsh_ has quit IRC | 13:34 | |
*** ramishra has quit IRC | 13:38 | |
*** owalsh has joined #openstack-infra | 13:39 | |
*** heyongli has quit IRC | 13:40 | |
*** heyongli has joined #openstack-infra | 13:40 | |
fungi | #status notice A misapplied distro security package update caused many jobs to fail with a MERGER_FAILURE error between ~06:30-12:30 UTC; these can be safely rechecked now that the problem has been addressed | 13:45 |
openstackstatus | fungi: sending notice | 13:45 |
*** links has quit IRC | 13:46 | |
-openstackstatus- NOTICE: A misapplied distro security package update caused many jobs to fail with a MERGER_FAILURE error between ~06:30-12:30 UTC; these can be safely rechecked now that the problem has been addressed | 13:46 | |
fungi | #status log unbound was manually restarted on many zuul executors following the 1.5.8-1ubuntu1.1 security update, due to https://launchpad.net/bugs/1775833 | 13:46 |
openstack | Launchpad bug 1775833 in unbound (Ubuntu) "unbound not running after automatic update" [Undecided,New] | 13:46 |
*** zhangfei has joined #openstack-infra | 13:47 | |
openstackstatus | fungi: finished sending notice | 13:48 |
openstackstatus | fungi: finished logging | 13:48 |
fungi | thanks to giblet in #openstack-nova for pointing out that we hadn't logged that | 13:49 |
*** heyongli has quit IRC | 13:50 | |
*** slaweq_ is now known as slaweq | 13:50 | |
*** heyongli has joined #openstack-infra | 13:50 | |
openstackgerrit | Jeremy Stanley proposed openstack-infra/system-config master: Template credentials for Packet Host/Platform 9 https://review.openstack.org/573021 | 13:52 |
dhellmann | is the ever-growing event queue length in zuul a side-effect of those failures that openstackstatus just mentioned? | 13:53 |
*** amoralej|lunch is now known as amoralej | 13:54 | |
dhellmann | ah, and there's the reset | 13:54 |
fungi | that's simply how zuul breathes | 13:54 |
*** edmondsw has joined #openstack-infra | 13:54 | |
fungi | inhale events, exhale results | 13:54 |
dhellmann | yeah, I haven't seen that up to 600+ in a long time | 13:55 |
*** dklyle has quit IRC | 13:55 | |
dhellmann | somehow it always clears up as soon as I come to ask about it. I wonder if zuul is also watching for questions like that on irc? | 13:55 |
dhellmann | "uh oh, someone noticed, better get back to work!" | 13:56 |
fungi | but yeah, even before the ansible upgrade and the network connectivity incident and the broken unbound restarts from unattended-upgrades, we were maxed out on node capacity as of early yesterday | 13:56 |
dhellmann | ah | 13:56 |
fungi | i'm betting this is the rush of r2 milestone procrastinators ;) | 13:56 |
dhellmann | probably | 13:56 |
dhellmann | although I like my image of zuul taking a smoke break better | 13:56 |
fungi | granted, maintenance and unrelated outages isn't helping matters | 13:57 |
*** lpetrut has quit IRC | 13:58 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Upgrade from angularjs (v1) to angular (v6) https://review.openstack.org/551989 | 14:00 |
*** heyongli has quit IRC | 14:00 | |
*** heyongli has joined #openstack-infra | 14:01 | |
*** ianychoi has joined #openstack-infra | 14:04 | |
*** hongbin has joined #openstack-infra | 14:05 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Upgrade from angularjs (v1) to angular (v6) https://review.openstack.org/551989 | 14:06 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Upgrade from angularjs (v1) to angular (v6) https://review.openstack.org/551989 | 14:09 |
*** heyongli has quit IRC | 14:10 | |
*** heyongli has joined #openstack-infra | 14:11 | |
*** felipemonteiro has joined #openstack-infra | 14:12 | |
*** shardy has quit IRC | 14:18 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul master: Hide queue headers for empty queues when filtering https://review.openstack.org/572588 | 14:19 |
openstackgerrit | Thierry Carrez proposed openstack-infra/project-config master: Remove direct branching/tagging ACL for Chef https://review.openstack.org/573712 | 14:20 |
*** heyongli has quit IRC | 14:21 | |
*** shardy has joined #openstack-infra | 14:21 | |
frickler | oh, nice, the mails about the broken unattended upgrades also weren't sent out earlier because of broken dns ... at least they mention that a reboot might be required, which would also solve that issue | 14:21 |
*** heyongli has joined #openstack-infra | 14:21 | |
*** lpetrut has joined #openstack-infra | 14:22 | |
openstackgerrit | Jeremy Stanley proposed openstack-infra/system-config master: Document an example for deleting content from AFS https://review.openstack.org/572821 | 14:24 |
fungi | indeed, an interesting catch-22 when your broken upgrade also breaks your ability to notify the sysadmin that some action may be required | 14:24 |
*** jistr is now known as jistr|mtg | 14:28 | |
*** VW has joined #openstack-infra | 14:29 | |
*** alkhodos_ has quit IRC | 14:30 | |
*** alkhodos_ has joined #openstack-infra | 14:30 | |
*** heyongli has quit IRC | 14:31 | |
*** heyongli has joined #openstack-infra | 14:31 | |
*** rpioso|afk is now known as rpioso | 14:32 | |
*** quiquell is now known as quiquell|off | 14:32 | |
*** r-daneel has quit IRC | 14:37 | |
*** udesale has joined #openstack-infra | 14:38 | |
*** felipemonteiro_ has joined #openstack-infra | 14:39 | |
*** TheJulia is now known as needssleep | 14:40 | |
*** heyongli has quit IRC | 14:41 | |
*** heyongli has joined #openstack-infra | 14:41 | |
*** felipemonteiro has quit IRC | 14:43 | |
openstackgerrit | yolanda.robla proposed openstack/diskimage-builder master: Fix bootloader packages for rhel https://review.openstack.org/573726 | 14:46 |
openstackgerrit | Thierry Carrez proposed openstack-infra/project-config master: Fix ACLs for tripleo-ci and dib-utils https://review.openstack.org/573728 | 14:46 |
openstackgerrit | Olivier Bourdon proposed openstack/diskimage-builder master: Fix CentOS image build failure when dib runs on debian based system https://review.openstack.org/559485 | 14:51 |
*** neiloy has joined #openstack-infra | 14:51 | |
*** heyongli has quit IRC | 14:51 | |
*** heyongli has joined #openstack-infra | 14:52 | |
Shrews | ooh, did we restart nl03? i'm pleasantly surprised it's working since the multi-label stuff is merged | 14:52 |
* Shrews goes to inspect zk data | 14:52 | |
*** lpetrut has quit IRC | 14:54 | |
Shrews | "type": ["ubuntu-xenial"] | 14:54 |
Shrews | neat | 14:54 |
*** hashar is now known as hasharAway | 14:54 | |
*** jamesmcarthur has joined #openstack-infra | 14:57 | |
*** myoung is now known as myoung|biaf | 14:57 | |
mnaser | infra-root: is there a way to check if things have been healthy on the new vms we added? | 15:00 |
*** r-daneel has joined #openstack-infra | 15:01 | |
fungi | i think the answer is that it depends on how you expect to measure "healthy" | 15:01 |
fungi | we can look in logstash to see if any jobs which ran there succeeded | 15:02 |
*** heyongli has quit IRC | 15:02 | |
*** zhangfei has quit IRC | 15:02 | |
mnaser | fungi: i guess just knowing that the overall state of things are okay, no weird mirror issues or other unrelated things | 15:02 |
*** shardy has quit IRC | 15:02 | |
*** heyongli has joined #openstack-infra | 15:02 | |
Shrews | mnaser: which region & node label? | 15:03 |
mordred | Shrews: \o/ | 15:03 |
mnaser | Shrews: vexxhost-ca-ymq-1 i think would be the node_provider | 15:03 |
*** r-daneel_ has joined #openstack-infra | 15:05 | |
mnaser | http://logstash.openstack.org/#dashboard/file/logstash.json?query=node_provider%3Avexxhost-ca-ymq-1%20AND%20message%3A%5C%22%5ERUN%20END%20RESULT_NORMAL%5C%22 | 15:05 |
mnaser | it looks like things are ok | 15:05 |
Shrews | mnaser: well, i see that one actively doing things on nl03. no obvious errors in nodepool log | 15:05 |
*** armaan has joined #openstack-infra | 15:05 | |
*** r-daneel has quit IRC | 15:06 | |
*** r-daneel_ is now known as r-daneel | 15:06 | |
Shrews | mnaser: and 10 in-use nodes | 15:06 |
Shrews | centos-7 and ubuntu-xenial | 15:07 |
Shrews | looks ok? | 15:07 |
openstackgerrit | Mohammed Naser proposed openstack-infra/project-config master: Bump vexxhost to 25 servers https://review.openstack.org/573738 | 15:07 |
mnaser | Shrews: yep, going to slowly bump it more | 15:07 |
mnaser | ^ | 15:07 |
Shrews | i'm all for increasing capacity | 15:11 |
mnaser | maybe if fungi is around ^ | 15:11 |
clarkb | I'm sort of around at this point. fwiw I did keep an eye on vexxhost yesterday before I disappeared and it looked ok | 15:12 |
*** heyongli has quit IRC | 15:12 | |
fungi | i am around though trying to pay attention in the release team meeting | 15:12 |
fungi | but will review that | 15:12 |
corvus | we only seem to be using 750 nodes out of our ~1000 node capacity | 15:13 |
*** d0ugal has quit IRC | 15:14 | |
openstackgerrit | yolanda.robla proposed openstack/diskimage-builder master: Fix bootloader packages for rhel https://review.openstack.org/573726 | 15:14 |
fungi | indeed, what's going on there? | 15:15 |
*** heyongli has joined #openstack-infra | 15:15 | |
corvus | Shrews, fungi, clarkb, mordred, mnaser: ^ | 15:15 |
corvus | http://grafana.openstack.org/dashboard/db/zuul-status | 15:15 |
clarkb | possible the dynamic quote calculations are reflecting reality better than our max servers values? | 15:15 |
fungi | yeah, wonder if the dynamic quota stuff is causing that to be lower than however we set the max there? | 15:15 |
fungi | er, what clarkb said | 15:16 |
*** jistr|mtg is now known as jistr | 15:16 | |
corvus | we need a nodepool dashboard with the node graphs of each provider on it | 15:16 |
*** caphrim007 has joined #openstack-infra | 15:17 | |
mnaser | we have a dashbaord for each but not for each provider in the same one | 15:17 |
mnaser | limestone seems to never go over 13 | 15:17 |
fungi | looks like it spiked up after the scheduler restart at 06:00 related to what was reported as a network connectivity problem, but quickly flattened out at around 750 nodes | 15:17 |
mnaser | oh | 15:17 |
mnaser | ovh has max of 159 | 15:17 |
mnaser | but only 3 available and 1 in use and 1 deleting | 15:17 |
mnaser | in bhs1 | 15:17 |
mnaser | gra1 has a max of 79 and only a few are being used | 15:18 |
mnaser | http://grafana.openstack.org/dashboard/db/nodepool-ovh | 15:18 |
fungi | maybe our old quota calculation bug in ovh has re-emerged | 15:18 |
corvus | yeah, that looks like the anomalous one | 15:18 |
mnaser | i'll defer to the roots who have access and can check actual quotas | 15:18 |
*** caphrim007 has quit IRC | 15:18 | |
*** caphrim007 has joined #openstack-infra | 15:19 | |
fungi | they seem to semi-regularly have trouble with whatever mechanism they use to sync quota utilization between regions to support their global quota implementation | 15:19 |
*** dtroyer has quit IRC | 15:19 | |
*** caphrim007 has quit IRC | 15:19 | |
*** dtroyer has joined #openstack-infra | 15:19 | |
*** dklyle has joined #openstack-infra | 15:20 | |
corvus | gra1 dropped off starting at 23:00... bhs1... hard to say; it's more complicated. | 15:20 |
*** bnemec is now known as beekneemech | 15:20 | |
Shrews | i count 56 ready+unlocked nodes | 15:20 |
corvus | both are served by nl04 | 15:20 |
openstackgerrit | Clark Boylan proposed openstack-infra/openstack-zuul-jobs master: Improve kata-runsh job https://review.openstack.org/573748 | 15:21 |
*** caphrim007 has joined #openstack-infra | 15:21 | |
corvus | did we, by any chance, restart all of the launchers except nl04? | 15:21 |
clarkb | nl03 was the only one I restarted | 15:21 |
corvus | TypeError: unhashable type: 'list' | 15:21 |
corvus | that's why i ask ^ | 15:22 |
*** heyongli has quit IRC | 15:22 | |
*** cshastri has quit IRC | 15:22 | |
corvus | weird. nl04 and nl01 have both been running since april 26 | 15:22 |
Shrews | corvus: all but nl03 have Apr something start times | 15:22 |
*** heyongli has joined #openstack-infra | 15:22 | |
*** ccamacho has quit IRC | 15:23 | |
*** eernst has joined #openstack-infra | 15:23 | |
corvus | okay there are similar errors in nl01's logs | 15:24 |
corvus | maybe it's related to the order in which they handle requests | 15:24 |
corvus | how about i restart the launcher on 04 | 15:24 |
mordred | ++ | 15:25 |
fungi | worth a try | 15:25 |
*** dave-mccowan has joined #openstack-infra | 15:25 | |
corvus | okay, it's handling requests now, which makes me realize that it probably should have been doing more work earlier. it's possible there's more to the failure than just the list change | 15:26 |
corvus | (like, i wonder if a thread died somewhere) | 15:27 |
corvus | we should also restart nl01 and nl02 | 15:27 |
corvus | but i'm about to be running late; can someone else handle that? | 15:28 |
*** dklyle has quit IRC | 15:28 | |
*** hjensas has joined #openstack-infra | 15:29 | |
Shrews | corvus: i'll do it | 15:30 |
corvus | Shrews: thx | 15:30 |
fungi | thanks corvus and Shrews! | 15:31 |
*** dklyle has joined #openstack-infra | 15:32 | |
Shrews | nl01 and nl02 launchers restarted | 15:32 |
*** dave-mccowan has quit IRC | 15:32 | |
*** iranzo has quit IRC | 15:32 | |
*** heyongli has quit IRC | 15:32 | |
*** heyongli has joined #openstack-infra | 15:33 | |
Shrews | citycloud seems to be reject requests because "images not available" | 15:34 |
Shrews | is that a known issue? | 15:34 |
clarkb | citycloud should have max-servers set to 0 | 15:34 |
clarkb | and we stopped pushing images to them | 15:34 |
clarkb | proobably unknown issue that something is trying to talk to their apis still | 15:35 |
Shrews | ah | 15:35 |
Shrews | maybe we should move the max-servers check before the images check | 15:36 |
*** myoung|biaf is now known as myoung | 15:37 | |
*** dklyle has quit IRC | 15:39 | |
clarkb | seems reasonable | 15:40 |
openstackgerrit | Ian Y. Choi proposed openstack-infra/project-config master: [translation] doc generatepot jobs for 3 projects https://review.openstack.org/545377 | 15:40 |
*** e0ne has quit IRC | 15:41 | |
*** kamren has quit IRC | 15:41 | |
clarkb | there isn't a way to dump the periodic queue without restarting the zuul scheduler right? | 15:41 |
clarkb | just thinking out loud here that dumping last nights periodic jobs may be helpful for moving check/gate along | 15:42 |
clarkb | though periodic is low priority iirc so that may not have an effect | 15:42 |
AJaeger | infra-root, ianw, https://review.openstack.org/#/c/573728/ removes direct tagging from diskimage-builder (commit speaks only about dib-utils but they are shared) to use tagging via the releases repository by the release team. Is that ok? | 15:42 |
clarkb | AJaeger: no we've asked the release team to keep it outside of that as an infra project that getting releases out indirectly depends on | 15:42 |
clarkb | that said I don't know what dib-utils is | 15:43 |
*** heyongli has quit IRC | 15:43 | |
clarkb | definitel ^ is true for dib itself | 15:43 |
AJaeger | clarkb: then ttx needs to rework the change ;( | 15:43 |
*** heyongli has joined #openstack-infra | 15:43 | |
AJaeger | clarkb: dib-utils shares ACL with diskimage-builder. | 15:43 |
fungi | clarkb: the most dumping periodic will buy us is freeing up resources for post, but we don't generally reenqueue post on a scheduler restart either | 15:43 |
clarkb | dib-utils is still a tripleo repo according to governance | 15:44 |
clarkb | maybe it should be part of release if that is the case | 15:44 |
*** lbragstad is now known as elbragstad | 15:44 | |
boden | clarkb hi… did you see my ping earlier regarding https://review.openstack.org/#/c/573386 | 15:44 |
fungi | we could split the acls apart for those repos too | 15:44 |
AJaeger | fungi, that's what I proposed... | 15:44 |
fungi | oh, cool ;) | 15:45 |
AJaeger | (as review comment with my -1 now) | 15:45 |
*** robled has quit IRC | 15:46 | |
clarkb | boden: no | 15:46 |
*** robled has joined #openstack-infra | 15:47 | |
*** robled has joined #openstack-infra | 15:47 | |
clarkb | AJaeger: fungi ok reading dib element readmes and release notes, dib-utils is a copy of dib-run-parts for use in places that want it without dib. dib itself vendors its own copy. I think that basically means infra probably doesn't care as dib itself is fine | 15:47 |
*** sambetts is now known as sambetts|afk | 15:47 | |
ttx | AJaeger: hah! | 15:47 |
*** kamren has joined #openstack-infra | 15:47 | |
clarkb | AJaeger: fungi: happy for it to be a tripleo project and be managed by release team, but we should update the acl I think | 15:47 |
boden | clarkb: based on our chat yesterday regarding siblings, I thought https://review.openstack.org/#/c/573386 would address the issue, but it appears neutron/master is still not installed.. see latest comment in 573386 | 15:47 |
clarkb | ttx: ^ fyi I think its ok to manage dib-utils | 15:48 |
openstackgerrit | kaka proposed openstack/diskimage-builder master: fix tox python3 overrides https://review.openstack.org/573759 | 15:48 |
fungi | fwiw, dib-utils looks basically dead activity-wise | 15:49 |
ttx | clarkb: we need a separate ACL file for dib-utils ? | 15:49 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Fail quickly for disabled provider pools https://review.openstack.org/573762 | 15:49 |
clarkb | boden: we don't dynamically load config out of project config because it is a repo that contains secrets and stuff (it is trusted and must be reivewed before used) | 15:49 |
clarkb | boden: so that change will have to merge first | 15:49 |
boden | clarkb ah ok, I wasn’t aware | 15:49 |
fungi | or extremely stable... the only util in it is bin/dib-run-parts? | 15:49 |
fungi | maybe it should be considered part of dib, just a very stable part? | 15:49 |
*** robled has quit IRC | 15:50 | |
clarkb | ttx: ya I think we should have tripleo apply a tripleo acl to it | 15:50 |
clarkb | fungi: well dib has its own copy | 15:50 |
boden | clarkb perhaps I can ask you to peek at https://review.openstack.org/#/c/573386 when you get a min so we can help land it and I can verify… thanks | 15:50 |
ttx | clarkb: ok will fix | 15:50 |
fungi | ahh, so if dib doesn't use it then seems fine to stay under tripleo | 15:50 |
clarkb | fungi: I think what hapened was rhel didn't have a run parts that would work for dib so dib made their own run parts, then later rhelians found usage for it outside of dib but didn't want to install dib for it so they made a copy in its own repo | 15:50 |
*** robled has joined #openstack-infra | 15:50 | |
*** robled has joined #openstack-infra | 15:50 | |
clarkb | dib appears to have its own run parts implementation internally | 15:50 |
clarkb | diskimage_builder/lib/dib-run-parts specifically | 15:51 |
fungi | the test nodes graph shows we now have a glut of available nodes not getting assigned to jobs? | 15:52 |
fungi | basically everything built since the launcher restart | 15:52 |
clarkb | zuul schedule should tell us what is going on related to that since it makes the node requests iirc | 15:53 |
*** heyongli has quit IRC | 15:53 | |
fungi | rather, everything above and beyond what we were previously using. so our node capacity came back but our in use didn't increase | 15:53 |
*** heyongli has joined #openstack-infra | 15:53 | |
fungi | and looks like the starting builds graph indicates no executors have started builds for the past hour? | 15:54 |
Shrews | i'm seeing many in-use nodes | 15:54 |
clarkb | 2018-06-08 15:42:43,117 INFO zuul.nodepool: Node request <NodeRequest 200-0004363196 <NodeSet devstack-single-node OrderedDict([(('primary',), <Node 0004362489 ('primary',):ubuntu-xenial>)])OrderedDict()>> fulfilled was the last fulfilled request | 15:55 |
Shrews | 646 | 15:55 |
fungi | yeah, we have ~700 in use and ~300 available according to the graph | 15:55 |
clarkb | plenty of requests are being updated though | 15:55 |
openstackgerrit | Thierry Carrez proposed openstack-infra/project-config master: Fix ACLs for tripleo-ci and dib-utils https://review.openstack.org/573728 | 15:56 |
Shrews | 3075 requests in the nodepool queue | 15:56 |
clarkb | you know thinking about ^ the gearman priority may not affect node allocation? | 15:57 |
clarkb | bceause we make the node requests first in the scheduler then ask for an executor to run the job via gearman after | 15:57 |
clarkb | I think | 15:57 |
clarkb | Its early in the morning and I am pre caffeien but I think gearman priority may be far less meaningful now? | 15:58 |
corvus | clarkb: pipeline priority determines node request priority | 15:59 |
corvus | clarkb: https://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/model.py#n628 | 15:59 |
clarkb | ah | 16:00 |
clarkb | so its still a layer removed but we attempt to respect it | 16:00 |
clarkb | since a high priority node request may take longer to boot than the next low priority one it isn't perfect but in general should be good | 16:00 |
openstackgerrit | Jeremy Stanley proposed openstack-infra/system-config master: Document an example for deleting content from AFS https://review.openstack.org/572821 | 16:01 |
*** heyongli has quit IRC | 16:03 | |
*** heyongli has joined #openstack-infra | 16:04 | |
corvus | Shrews, clarkb: nodepool list suggests that most of the ready nodes are recently built nodes from ovh | 16:04 |
*** jcoufal_ has quit IRC | 16:05 | |
corvus | i have to run now, sorry i can't help more | 16:05 |
clarkb | is it possible that we are at executor load limits? | 16:06 |
clarkb | that would prevent executors from taking new jobs and marking nodes in use right? | 16:06 |
corvus | clarkb: not even close | 16:06 |
corvus | clarkb: http://grafana.openstack.org/dashboard/db/zuul-status | 16:06 |
corvus | the executors are becoming more and more idle | 16:06 |
corvus | i would seriously consider reverting the nodepool changes | 16:07 |
corvus | this is not a great time to have the system stuck | 16:07 |
clarkb | ya, though I'm not caught up on what changes were made /me looks at git | 16:07 |
clarkb | ah the driver changes | 16:08 |
corvus | (there are currently only about 100 builds running) | 16:09 |
*** r-daneel has quit IRC | 16:09 | |
*** r-daneel_ has joined #openstack-infra | 16:09 | |
fungi | https://review.openstack.org/568704 Simplify driver API | 16:09 |
fungi | and i'm guessing the multi-label stuff depends on that too | 16:09 |
*** camunoz has joined #openstack-infra | 16:09 | |
clarkb | revert to this commit maybe 9a03c679e32c13c0401dccaa9d93479229147801 | 16:10 |
corvus | gotta run now. Shrews knows the story | 16:10 |
clarkb | it is still after when the nodepool launchers were last restarted but before the driver updates | 16:10 |
Shrews | well, ovh-gra1 doesn't seem to be able to launch nodes | 16:10 |
openstackgerrit | Sean McGinnis proposed openstack-infra/project-config master: Remove DragonFlow tagging ACL https://review.openstack.org/573772 | 16:11 |
fungi | clarkb: 9e5df7325b863e8bc4718b32d0ff93ce27c5530b looks like the last state before the merges from 4 days ago | 16:11 |
clarkb | ah a couple ahead of the one I picked, that wfm | 16:11 |
corvus | Shrews: it launched 72 of them in the past 40 minutes. that's a red herring. | 16:11 |
*** r-daneel_ is now known as r-daneel | 16:11 | |
clarkb | I'm going to add nl01-4 to the emergecny file so that we can do this without pupet fighting us | 16:12 |
fungi | sounds good, thanks | 16:12 |
*** neiloy has quit IRC | 16:12 | |
clarkb | that is done | 16:13 |
clarkb | ovh is on 04 right? why don't we start there since that will in theory free up available ready nodes for use | 16:13 |
*** heyongli has quit IRC | 16:13 | |
fungi | i agree that seems like a good place to start | 16:14 |
*** heyongli has joined #openstack-infra | 16:14 | |
clarkb | I'm going to checkout 9e5d and pip install then restart | 16:14 |
clarkb | on nl04 | 16:14 |
fungi | k | 16:14 |
clarkb | it helps to remember to use python3 not 2 | 16:15 |
mordred | infra-root: mails sent to Shrews and frickler by thesystem keep getting bounced due to google being a bad-actor on the internet - is there any fix we're aware of? or should we maybe take their emails out of the sysadmin list? | 16:16 |
*** armaan has quit IRC | 16:16 | |
openstackgerrit | Anita Kuno proposed openstack-infra/system-config master: Survey Documentation https://review.openstack.org/571536 | 16:16 |
clarkb | mordred: my fix for that was to get a fastmail account | 16:16 |
*** armaan has joined #openstack-infra | 16:16 | |
clarkb | nl04 is running 9e5df7325b863e8bc4718b32d0ff93ce27c5530b now | 16:16 |
clarkb | TypeError: unhashable type: 'list' is an issue | 16:17 |
clarkb | which is where we started this morning right? | 16:17 |
fungi | mordred: which system? | 16:18 |
clarkb | I think this may have been a nop | 16:18 |
clarkb | we went from one not working system to another not working system | 16:18 |
clarkb | do we understand the unhashable type problem? | 16:18 |
Shrews | clarkb: nope. haven't had time to look | 16:18 |
Shrews | clarkb: wait. | 16:19 |
Shrews | don't restart any more | 16:19 |
Shrews | we have a problem | 16:19 |
Shrews | we now have zk nodes with a list for node.type, but the revert doesn't know how to handle that | 16:20 |
clarkb | ok shoudl I go ahead and just stop the daemon on nl04 then? | 16:20 |
Shrews | clarkb: yes | 16:20 |
clarkb | also I bet this was the original issue with me restarting nl03 | 16:20 |
clarkb | and only nl03 | 16:20 |
fungi | ohhhh | 16:20 |
clarkb | it "polluted" the zk data | 16:20 |
Shrews | clarkb: possibly | 16:21 |
clarkb | nl04 is not running a launcher anymore | 16:21 |
clarkb | any thoughts on our next step? we can debug the problem with master, or do a full system restart which should clear out the requests and bring up the launchers on the old code | 16:23 |
clarkb | by full I mean scheduler and launcher restarts | 16:23 |
fungi | is that sufficient or do we also have to do something extra to clear out zk? | 16:23 |
*** heyongli has quit IRC | 16:24 | |
Shrews | i think we'd have to clear out zk if we don't push forward with finding the problem in master | 16:24 |
clarkb | my understanding is they are all connection dependent nodes so once the connection timeout occurs they are deleted. We may have to wait >1 minute to have it dlete them though | 16:24 |
*** heyongli has joined #openstack-infra | 16:24 | |
clarkb | we can also manually clear out the db | 16:24 |
Shrews | can someone confirm if *any* requests are being handled now? | 16:25 |
clarkb | 15:42:43,117 remains the last fulfilled request according to the scheduelr | 16:25 |
*** fried_rice is now known as fried_rolls | 16:29 | |
*** lpetrut has joined #openstack-infra | 16:29 | |
clarkb | zuul appears to really only be processing external events like gerrit and zuul web rpc | 16:30 |
clarkb | so ya I don't think there are any requests being handled | 16:30 |
*** jesslampe has joined #openstack-infra | 16:30 | |
clarkb | lots of retrying node requests in the launchers but not seeing tracebacks to tell me why | 16:31 |
openstackgerrit | Oliver Walsh proposed openstack/diskimage-builder master: WIP: Add DIB element to install NVIDIA GPU drivers https://review.openstack.org/573223 | 16:31 |
clarkb | ah that comes from the self.paused handling | 16:32 |
frickler | mordred: can you show a copy of the bounce? I'm regulary seeing bounces for some other accounts http://paste.openstack.org/show/722989/ | 16:32 |
Shrews | so yeah, the "unhashable type" issue was because of zk pollution | 16:33 |
anteaya | is it worth sending out a status notification about the current situation yet? | 16:33 |
anteaya | just to pre-empt any questions? | 16:34 |
*** heyongli has quit IRC | 16:34 | |
clarkb | 2018-06-08 15:08:24,139 INFO nodepool.NodeLauncher-0004362219: Node id 0004362219 is ready is the last case of a node going ready on nl01 | 16:34 |
fungi | frickler: the bounces from the zk servers are because those still haven't gotten their hostnames corrected so they're helo'ing with unresolvable hostnames | 16:34 |
*** heyongli has joined #openstack-infra | 16:34 | |
*** jesslampe has quit IRC | 16:34 | |
clarkb | anteaya: probably? things have been on fire for like 9 hours or something | 16:35 |
anteaya | shows you how behind I am then | 16:35 |
fungi | at this point it's just a matter of builds queuing and not starting, i think, so depending on how we correct it there may be no visible impact other than just more delayed results | 16:36 |
clarkb | Shrews: I don't see any tracebacks in the log after 1400ish UTC either | 16:36 |
clarkb | Shrews: my hunch is that our openstack driver isn't really being used and we're just calling the launch on the base handler that psses | 16:36 |
clarkb | or something similar to that based on lack of openstack specific logging (either successes or failures) | 16:36 |
anteaya | I don't see anything I can do that is helpful, I have to run away and complete an errand, may the force be with you | 16:38 |
*** caphrim007_ has joined #openstack-infra | 16:39 | |
clarkb | I too have an 11am doctors visit with the kids so I'm on a bit of a time crunch. But not up against it just yet | 16:39 |
Shrews | ok, i hate to say it, but the action i'm thinking we may need to do now is shutdown all of zuul, remove all nodes from zk, restart the launchers in reverted state (they should then begin cleaning up leaked instances and building new nodes), then start zuul | 16:39 |
Shrews | unless someone else has a better idea? | 16:39 |
fungi | no, that sounds reasonable at this stage | 16:40 |
clarkb | I don't think all of zuul has to be restarted, just the scheduler | 16:40 |
clarkb | maybe web if it doesn't like the scheduler going away | 16:40 |
Shrews | i mean, i could spend time trying to fix master, but i have no idea what the problem is | 16:40 |
Shrews | i think just the scheduler since it holds the node locks in zk | 16:40 |
clarkb | fwiw I don't see the driver load error messages in the log file | 16:40 |
clarkb | * launcher log file | 16:41 |
Shrews | what driver load error? | 16:41 |
*** Guest14735 is now known as sdake | 16:41 | |
*** caphrim007 has quit IRC | 16:41 | |
clarkb | no implementation found | 16:41 |
Shrews | where is that? | 16:42 |
clarkb | nodepool/driver/__init__.py | 16:42 |
Shrews | i mean what log? | 16:42 |
clarkb | but I don't see that in the log so no positive verification that is the issue. launcher debug log | 16:42 |
Shrews | wait, rewind please.... did you see this driver error in a log? | 16:43 |
clarkb | no | 16:43 |
fungi | speculation based on lack of any logged error? | 16:43 |
*** caphrim007 has joined #openstack-infra | 16:43 | |
clarkb | I was just following up on my theory above that we aren't amking any openstack related calls since there are no positive or negative openstack log messages | 16:43 |
Shrews | someone want to stop zuul scheduler? | 16:44 |
Shrews | (assuming we are agreed this is the plan?) | 16:44 |
*** heyongli has quit IRC | 16:44 | |
clarkb | I'm fine with proceeding with that plan. I can stop the scheduler and dump its queues | 16:44 |
*** heyongli has joined #openstack-infra | 16:44 | |
*** panda is now known as panda|off | 16:45 | |
fungi | you got into place faster than i | 16:45 |
clarkb | oh I'm not into place :) | 16:45 |
fungi | are we still good with 9e5df7325b863e8bc4718b32d0ff93ce27c5530b as the rollback for the launchers? | 16:45 |
Shrews | fungi: yes | 16:45 |
clarkb | before we stop the scheduler why don't we prep all of the launchers with ^ | 16:45 |
fungi | do we also need to roll back the builders? | 16:45 |
Shrews | fungi: builders should be fine | 16:46 |
*** jpich has quit IRC | 16:46 | |
clarkb | and don't forget to use pip3 install instaed of pip install | 16:46 |
fungi | working on nl01 now | 16:46 |
Shrews | i'm going to prep the zk cleanup | 16:46 |
clarkb | I'll do 03 | 16:46 |
*** caphrim007_ has quit IRC | 16:47 | |
fungi | Successfully uninstalled nodepool-3.0.2.dev51 | 16:47 |
fungi | Successfully installed nodepool-3.0.2.dev29 | 16:47 |
fungi | presumably that's what we want to see everywhere | 16:47 |
fungi | i'll work on 02 next | 16:48 |
clarkb | ya that looks like what I've got on 04 and 03 | 16:48 |
Shrews | are all launchers stopped? | 16:48 |
clarkb | Shrews: no | 16:48 |
fungi | not yet | 16:48 |
clarkb | and scheduler is still running too | 16:48 |
fungi | okay, nl02 is upgraded now too | 16:48 |
Shrews | let's do that too | 16:48 |
fungi | so the idea is that we stop the launchers next? | 16:48 |
clarkb | yes I think launchers, then scheduler, then zk cleanup, then start scheduler and launchers again | 16:49 |
Shrews | yes. i don't want anything accessing zk | 16:49 |
fungi | i'll do 01 and 02 | 16:49 |
clarkb | I'll make sure 03 and 04 are stopped then do scheduler and confirm they are all done when done | 16:49 |
Shrews | ack | 16:49 |
fungi | 01 and 02 are stopped now | 16:49 |
clarkb | 03 and 04 are stopped, moving to the scheduler | 16:50 |
mordred | frickler, fungi : http://paste.openstack.org/show/722990/ | 16:50 |
clarkb | I have asked the scheduler to stop and am waiting for it to do so | 16:50 |
clarkb | the builders use zk too, do you want them stopped or you'll only worry about the noderequests? | 16:51 |
clarkb | Shrews: ^ | 16:51 |
fungi | mordred: frickler: we could probably "fix" that by whitelisting those server ip addresses in the pbl | 16:51 |
Shrews | clarkb: just things access nodes | 16:52 |
Shrews | so builders are fine to leave alone | 16:52 |
clarkb | zuul scheduler is refusing to stop quickly again. I'll kill it manually in a minute or two if it doesn't stop on its own | 16:52 |
*** jtomasek has quit IRC | 16:52 | |
*** jpena is now known as jpena|off | 16:53 | |
clarkb | ok I'm going to manually kill it | 16:53 |
Shrews | ok | 16:53 |
clarkb | scheduler is stopped | 16:53 |
clarkb | Shrews: I think you are good from a zk perspective | 16:54 |
*** felipemonteiro_ has quit IRC | 16:54 | |
*** heyongli has quit IRC | 16:54 | |
*** heyongli has joined #openstack-infra | 16:55 | |
Shrews | clarkb: are you certain we shouldn't stop the executors? once i restart the launchers, they'll attempt to delete the leaked instances which could be running jobs | 16:55 |
Shrews | that may make the executors go haywire | 16:55 |
clarkb | I think stopping the scheduler already effectively does/did that | 16:56 |
clarkb | we can stop them too if we want, I don't think it will hurt | 16:56 |
Shrews | ok, then my plan is to 1) remove zk data in /nodepool/nodes/* 2) start nl04 launcher and verify it's doing what i'd expect 3) restart the other launchers if #2 is ok | 16:57 |
Shrews | 4) we can restart scheduler | 16:57 |
clarkb | ok, when should we start the scheduler? perfect | 16:57 |
mordred | Shrews: ++ | 16:57 |
fungi | sounds great | 16:57 |
Shrews | doing #1 now | 16:57 |
Shrews | done. starting nl04 launcher | 16:58 |
*** lpetrut has quit IRC | 16:59 | |
clarkb | looks like it marked a bunch of stuff for deletion as expected | 16:59 |
Shrews | yes. | 16:59 |
Shrews | new instances launching too | 16:59 |
*** derekh has quit IRC | 17:00 | |
Shrews | actually, no new instances | 17:00 |
*** r-daneel has quit IRC | 17:00 | |
clarkb | the only new instances should be for min ready but we manage that on nl01 only | 17:01 |
clarkb | I think this is expected. maybe do nl01 next and ensure min ready launch? | 17:01 |
Shrews | oh we have no min-ready | 17:01 |
Shrews | cool | 17:01 |
clarkb | ya we made that change so that we'd stop doing min ready * number of launchers | 17:01 |
Shrews | ok, ovh looks to be cleaned up | 17:01 |
Shrews | going to do nl01 now | 17:01 |
*** pcaruana has quit IRC | 17:02 | |
clarkb | fungi: since I may have to pop out before we are done. ~root/check.sh and ~root/gate.sh on zuul01 are our saved checka nd gate queues. You'll want to ensure the shceduler is fully started before enqueuing otherwise it just errors at you a lot. Also zuul-web may need a restart after zuul-scheduler is up again | 17:02 |
fungi | noted, thanks! | 17:03 |
*** udesale has quit IRC | 17:03 | |
*** hemna_ has quit IRC | 17:03 | |
Shrews | deleting leaked instances. i see new ones building | 17:03 |
*** zoli is now known as zoli|gne | 17:04 | |
*** zoli|gne is now known as zoli|gone | 17:04 | |
Shrews | going to let that cleanup finish before doing nl02 | 17:04 |
*** zoli|gone is now known as zoli | 17:04 | |
clarkb | ok | 17:04 |
clarkb | (I have about 5 minutes left before I need to look at getting kids out the door for an on time doctors visit) | 17:04 |
*** heyongli has quit IRC | 17:05 | |
clarkb | but I think its mostly mechanical now and a matter of just watching things as we start up | 17:05 |
*** heyongli has joined #openstack-infra | 17:05 | |
fungi | i'm all queued up to work on the scheduler startup and requeuing | 17:05 |
fungi | er, cued up | 17:05 |
Shrews | fungi: thx | 17:06 |
Shrews | doing nl02 now | 17:06 |
Shrews | nl03 now | 17:08 |
*** Swami has joined #openstack-infra | 17:08 | |
Shrews | fungi: i think we're about ready. let nl03 clean up resources a bit.... | 17:09 |
fungi | k | 17:09 |
fungi | standing by | 17:10 |
Shrews | whole lot of inap instances to clear out | 17:10 |
clarkb | alright I'm just about out of time. the scheduler startup will probably take around 5 minutes iirc | 17:11 |
clarkb | its a pretty clear change in the logging content it goes from doing a ton of git operations to build a config ot listening to events and stuff | 17:11 |
fungi | thanks! | 17:12 |
Shrews | fungi: good to start scheduler | 17:13 |
fungi | on it | 17:13 |
*** dhill_ has joined #openstack-infra | 17:13 | |
Shrews | jobs may be delayed by node building | 17:13 |
fungi | tailing /var/log/zuul/debug.log to catch when it's safe to reenqueue check/gate builds | 17:14 |
*** heyongli has quit IRC | 17:15 | |
*** heyongli has joined #openstack-infra | 17:15 | |
Shrews | stuff is happening | 17:17 |
fungi | so we think events from ~16:50 to ~17:15 will have been missed, needing rechecks/reapprovals? | 17:17 |
fungi | basically from when clarkb stopped the scheduler to when i started it? | 17:17 |
fungi | also looks like it's finished starting up so i'll get to reenqueuing now | 17:17 |
Shrews | i believe so | 17:18 |
Shrews | and let's wait until next week to make any more changes :) | 17:18 |
*** eharney has quit IRC | 17:19 | |
fungi | i'll restart zuul-web too. i'm not getting any content on the status page | 17:20 |
smcginnis | Was waiting to point that out if it didn't come up. :) | 17:20 |
smcginnis | There it is. | 17:20 |
fungi | yeah, i just wanted to give it a few | 17:21 |
*** iranzo has joined #openstack-infra | 17:21 | |
fungi | there it goes | 17:21 |
smcginnis | All the pretty status bars. | 17:21 |
fungi | apparently does still need zuul-web restarted after a scheduler restart | 17:21 |
fungi | changes are still being enqueued, but i'll go ahead and send a status notice | 17:21 |
Shrews | fungi: thx | 17:22 |
smcginnis | Did we end up needing to manual recheck patches, or did things get captured OK for renque? | 17:22 |
fungi | #status notice The Zuul scheduler was offline briefly to clean up from debugging a nodepool issue, so changes uploaded or approved between 16:50 and 17:15 UTC may need to be rechecked or reapproved (all already queued changes are in the process of being reenqueued now) | 17:22 |
openstackstatus | fungi: sending notice | 17:22 |
smcginnis | Hah, guess that answers it. Thanks. | 17:22 |
Shrews | fungi: infra-root: let's leave the launchers in the emergency file for now | 17:23 |
fungi | Shrews: yes, let's | 17:23 |
*** psachin has quit IRC | 17:23 | |
Shrews | i've grabbed the nl04 debug log file, so will need to spend some time digging through that | 17:23 |
-openstackstatus- NOTICE: The Zuul scheduler was offline briefly to clean up from debugging a nodepool issue, so changes uploaded or approved between 16:50 and 17:15 UTC may need to be rechecked or reapproved (all already queued changes are in the process of being reenqueued now) | 17:24 | |
*** jamesmcarthur has quit IRC | 17:25 | |
*** heyongli has quit IRC | 17:25 | |
*** yamahata has joined #openstack-infra | 17:25 | |
*** heyongli has joined #openstack-infra | 17:25 | |
openstackstatus | fungi: finished sending notice | 17:26 |
Shrews | fungi: agree that things look normal'ish enough for me to step out for lunch for a bit? | 17:26 |
openstackgerrit | Sean McGinnis proposed openstack-infra/project-config master: Remove DragonFlow tagging ACL https://review.openstack.org/573772 | 17:26 |
*** electrofelix has quit IRC | 17:27 | |
fungi | Shrews: i believe so, yes | 17:27 |
fungi | i'm about to go cook something up myself | 17:27 |
Shrews | http://grafana.openstack.org/dashboard/db/nodepool seems sane | 17:28 |
*** dtantsur is now known as dtantsur|afk | 17:30 | |
fungi | yeah, looks like lots of jobs are running and quite a few have already succeeded | 17:30 |
fungi | reenqueue is still underway | 17:31 |
clarkb | Shrews: as a datapoint before I forget nl03 was restarted yestder at about 3pm pacific ish and was successfully launching nodes | 17:32 |
clarkb | I dont know when all thr new stuff merged but possible its not a 100% always on failure | 17:32 |
*** aojea has quit IRC | 17:32 | |
Shrews | clarkb: thx | 17:33 |
*** heyongli has quit IRC | 17:35 | |
*** heyongli has joined #openstack-infra | 17:36 | |
*** myoung is now known as myoung|bbl | 17:36 | |
*** camunoz has quit IRC | 17:37 | |
*** camunoz has joined #openstack-infra | 17:38 | |
*** lpetrut has joined #openstack-infra | 17:41 | |
*** XP_2600 has joined #openstack-infra | 17:43 | |
*** amoralej is now known as amoralej|off | 17:44 | |
*** hemna_ has joined #openstack-infra | 17:45 | |
*** heyongli has quit IRC | 17:46 | |
*** heyongli has joined #openstack-infra | 17:46 | |
*** XP_2600 has quit IRC | 17:47 | |
*** jcoufal has joined #openstack-infra | 17:53 | |
*** jcoufal has quit IRC | 17:53 | |
*** jcoufal has joined #openstack-infra | 17:54 | |
*** heyongli has quit IRC | 17:56 | |
*** heyongli has joined #openstack-infra | 17:56 | |
openstackgerrit | Merged openstack-infra/project-config master: Add magnum to storyboard https://review.openstack.org/572440 | 18:02 |
*** heyongli has quit IRC | 18:06 | |
*** heyongli has joined #openstack-infra | 18:07 | |
*** rajinir has joined #openstack-infra | 18:13 | |
*** neiloy has joined #openstack-infra | 18:14 | |
*** heyongli has quit IRC | 18:16 | |
*** heyongli has joined #openstack-infra | 18:17 | |
*** kgiusti has quit IRC | 18:18 | |
*** kjackal has joined #openstack-infra | 18:20 | |
*** yamamoto has quit IRC | 18:25 | |
*** yamamoto has joined #openstack-infra | 18:26 | |
*** heyongli has quit IRC | 18:27 | |
*** heyongli has joined #openstack-infra | 18:27 | |
*** rmcall has joined #openstack-infra | 18:30 | |
*** yamamoto has quit IRC | 18:31 | |
*** hasharAway is now known as hashar | 18:31 | |
*** heyongli has quit IRC | 18:37 | |
*** heyongli has joined #openstack-infra | 18:37 | |
*** gfidente has quit IRC | 18:40 | |
*** dhill_ has quit IRC | 18:40 | |
*** jcoufal has quit IRC | 18:41 | |
*** fried_rolls is now known as fried_rice | 18:43 | |
*** heyongli has quit IRC | 18:47 | |
*** heyongli has joined #openstack-infra | 18:47 | |
Shrews | welp, been totally quite since lunch. i'm assuming things are still good | 18:49 |
fungi | seems so | 18:50 |
fungi | though i just emerged from a huge wok full of fried rice myself so still checking | 18:50 |
*** heyongli has quit IRC | 18:57 | |
*** heyongli has joined #openstack-infra | 18:58 | |
*** rmcall has quit IRC | 19:02 | |
fungi | yeah, nodes is pretty much maxed out in use, node requests and check pipeline length are steadily (if slowly) dropping | 19:07 |
Shrews | ok, i *think* i have a theory on what happened. | 19:07 |
* fungi pulls up a chair | 19:07 | |
fungi | will i need popcorn? | 19:07 |
Shrews | nah | 19:07 |
*** heyongli has quit IRC | 19:08 | |
*** heyongli has joined #openstack-infra | 19:08 | |
Shrews | because we were running a mix of zk data with node.type of str and node.type of list, our min-ready code (running on nl01, so old code) did not recognize the new nodes as ready | 19:08 |
Shrews | so we were rapidly creating new nodes which were never seen as ready by the min-ready code | 19:09 |
fungi | reasonable | 19:09 |
Shrews | so we maxed out our capacity | 19:09 |
Shrews | launchers paused | 19:09 |
Shrews | things were sad | 19:09 |
fungi | and accumulating those at the expense of usable nodes | 19:09 |
Shrews | i'm going to keep looking, but that seems like at least part of the issue | 19:10 |
Shrews | the solution would have been to stop all launchers, then restart one-at-a-time | 19:10 |
Shrews | so that we never ran a mix | 19:10 |
Shrews | this was not easy to predict or forsee | 19:10 |
Shrews | (which is why changing the zk schema *always* scares me) | 19:11 |
Shrews | assuming that was the only issue, we'll need a release note for admins | 19:12 |
Shrews | well, regardless, we'll need that | 19:13 |
Shrews | tobiash: hopefully your not running nodepool from master ;) | 19:14 |
Shrews | s/your/you're/ | 19:14 |
clarkb | ++ to a release note | 19:14 |
tobiash | Shrews: ? | 19:14 |
fungi | yes, i'd rather nobody else needed to trip over this | 19:14 |
* tobiash reads backlog | 19:14 | |
Shrews | tobiash: multilabel support requires total shutdown of nodepool, then restart | 19:15 |
fungi | though maybe this is only likely to be a problem for deployments with multiple launchers? | 19:15 |
clarkb | shrews does it reqyite clearing all existing noderequests too? | 19:15 |
Shrews | tobiash: err, all nodepool launchers that is | 19:15 |
Shrews | clarkb: no | 19:15 |
tobiash | Shrews: total shutdown means also deletong all nodes in zk? | 19:15 |
Shrews | tobiash: shouldn't need to | 19:16 |
tobiash | I'm currently having only one launcher | 19:16 |
Shrews | tobiash: oh, then you're fine | 19:16 |
tobiash | so I guess that is a full restart :) | 19:16 |
clarkb | probably wortb a status log too with path forward | 19:17 |
Shrews | clarkb: i'll throw something together | 19:17 |
*** heyongli has quit IRC | 19:18 | |
*** heyongli has joined #openstack-infra | 19:18 | |
tobiash | so that was the reason for that massive queueing today? | 19:19 |
Shrews | status log Nodepool issue from earlier today seems to have been caused by nl03 launcher restart. Mixed, incompatible versions of code caused us to create min-ready nodes continually until we reached full capacity. | 19:19 |
Shrews | something like ^^ ? | 19:20 |
Shrews | tobiash: believe so | 19:21 |
clarkb | Shrews: ya and maybe note that a full restart of launchwr is necessary to go forward | 19:21 |
fungi | rca lgtm | 19:22 |
Shrews | status log Nodepool issue from earlier today seems to have been caused by nl03 launcher restart. Mixed, incompatible versions of code caused us to create min-ready nodes continually until we reached full capacity. A full restart of nodepool launchers is necessary to prevent this going forward. | 19:23 |
Shrews | status log Nodepool issue from earlier today seems to have been caused by nl03 launcher restart. Mixed, incompatible versions of code caused us to create min-ready nodes continually until we reached full capacity. A full rhutddown and restart of nodepool launchers is necessary to prevent this going forward. | 19:23 |
clarkb | ++ | 19:23 |
Shrews | rhutddown... gah | 19:23 |
*** camunoz has quit IRC | 19:23 | |
* Shrews edits in vim | 19:23 | |
fungi | you just need a vim-based irc client and you'll be all set | 19:24 |
Shrews | #status log Nodepool issue from earlier today seems to have been caused by nl03 launcher restart. Mixed, incompatible versions of code caused us to create min-ready nodes continually until we reached full capacity. A full shutdown and restart of nodepool launchers is necessary to prevent this going forward. | 19:25 |
*** eharney has joined #openstack-infra | 19:25 | |
openstackstatus | Shrews: finished logging | 19:25 |
fungi | thanks for figuring that out, Shrews! | 19:27 |
*** heyongli has quit IRC | 19:28 | |
*** heyongli has joined #openstack-infra | 19:28 | |
*** yamamoto has joined #openstack-infra | 19:29 | |
Shrews | i'm not quite sure how to test and validate my shutdown/restart suggestion | 19:31 |
*** vivsoni_ has quit IRC | 19:31 | |
Shrews | i mean, in my head... it works | 19:32 |
*** vivsoni has joined #openstack-infra | 19:32 | |
*** dklyle has joined #openstack-infra | 19:32 | |
clarkb | I think we can run two launchers against a devstack amd see they work? | 19:32 |
Shrews | internal debugger is efficient, but limited | 19:32 |
clarkb | maybe even modufy our existing job to do it | 19:32 |
Shrews | clarkb: but it requires different code versions. that'd be hard to do with existing job | 19:34 |
*** armaan has quit IRC | 19:34 | |
*** armaan has joined #openstack-infra | 19:34 | |
Shrews | clone old version; run nodepool; build nodes; shutdown nodepool; upgrade nodepool; run nodepool; send node requests | 19:35 |
Shrews | might have to do this manually, somehow | 19:36 |
tobiash | Shrews: do you do sharding of the clouds or does every launcher talk with every cloud? | 19:36 |
Shrews | tobiash: we shard | 19:36 |
*** myoung|bbl is now known as myoung | 19:36 | |
*** yamamoto has quit IRC | 19:37 | |
tobiash | ah, had to reread your explanation to match that with sharding in my head | 19:37 |
*** dklyle has quit IRC | 19:37 | |
*** heyongli has quit IRC | 19:38 | |
Shrews | i'm still investigating how it affected the other launchers. i know the min-ready requests would at least put additional pressure on them due to shared labels | 19:39 |
*** heyongli has joined #openstack-infra | 19:39 | |
*** slaweq has quit IRC | 19:39 | |
*** kjackal has quit IRC | 19:40 | |
*** slaweq has joined #openstack-infra | 19:41 | |
*** kjackal has joined #openstack-infra | 19:41 | |
*** lifeless has quit IRC | 19:46 | |
*** heyongli has quit IRC | 19:49 | |
*** heyongli has joined #openstack-infra | 19:49 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Fix 'satisfy' spelling errors https://review.openstack.org/573823 | 19:54 |
*** VW has quit IRC | 19:58 | |
*** heyongli has quit IRC | 19:59 | |
*** heyongli has joined #openstack-infra | 19:59 | |
sthussey | Is the post pipeline queue still executing? | 20:01 |
*** kjackal has quit IRC | 20:08 | |
*** heyongli has quit IRC | 20:09 | |
*** heyongli has joined #openstack-infra | 20:10 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Add release note about upgrading for multi-label https://review.openstack.org/573827 | 20:11 |
*** jesslampe has joined #openstack-infra | 20:12 | |
*** armaan has quit IRC | 20:12 | |
*** armaan has joined #openstack-infra | 20:13 | |
*** heyongli has quit IRC | 20:19 | |
*** heyongli has joined #openstack-infra | 20:20 | |
*** lifeless has joined #openstack-infra | 20:22 | |
fungi | sthussey: we reenqueued everything that had been enqueued before the restart, so if you didn't got job results and don't see your change pending/running at https://zuul.openstack.org/ then you'll need to recheck it (or have its approval vote reapplied if it's suppose to be in the gate) | 20:25 |
sthussey | It is in the queue, I just haven't seen the queue change in 45m | 20:26 |
sthussey | Well, change aside from grow. | 20:26 |
*** e0ne has joined #openstack-infra | 20:27 | |
fungi | http://grafana.openstack.org/dashboard/db/zuul-status indicates the check and gate pipelines are steadily (if slowly) shrinking since we brought everything back up about 3 hours ago | 20:28 |
*** heyongli has quit IRC | 20:30 | |
*** heyongli has joined #openstack-infra | 20:30 | |
sthussey | Right, this is the post pipeline | 20:31 |
*** armaan has quit IRC | 20:31 | |
*** armaan has joined #openstack-infra | 20:31 | |
*** iranzo has quit IRC | 20:35 | |
fungi | oh, post and periodic run at the lowest priority, so don't get node allocations until gate and check (primarily) are satisfied | 20:35 |
clarkb | ok back now | 20:36 |
clarkb | Shrews: as for restarting again for infra maybe we plan to do that monday? | 20:36 |
fungi | wfm | 20:36 |
Shrews | wfm2. we can always do what we did today if that doesn't work more betterer | 20:38 |
clarkb | ya we now have a rollback process that is known to work so less scary | 20:40 |
*** agopi has joined #openstack-infra | 20:40 | |
*** heyongli has quit IRC | 20:40 | |
*** heyongli has joined #openstack-infra | 20:40 | |
*** dhill_ has joined #openstack-infra | 20:41 | |
fungi | well, and we actually know what's going on so can keep it brief | 20:42 |
*** hemna_ has quit IRC | 20:42 | |
*** Swimingly has quit IRC | 20:44 | |
*** Swimingly has joined #openstack-infra | 20:45 | |
*** Swimingly has quit IRC | 20:45 | |
*** Swimingly has joined #openstack-infra | 20:47 | |
*** Swimingly has quit IRC | 20:47 | |
*** Swimingly has joined #openstack-infra | 20:47 | |
*** heyongli has quit IRC | 20:50 | |
*** heyongli has joined #openstack-infra | 20:50 | |
*** heyongli has quit IRC | 21:00 | |
*** heyongli has joined #openstack-infra | 21:01 | |
*** kjackal has joined #openstack-infra | 21:03 | |
*** r-daneel has joined #openstack-infra | 21:05 | |
*** neiloy has quit IRC | 21:07 | |
*** r-daneel_ has joined #openstack-infra | 21:08 | |
*** r-daneel has quit IRC | 21:09 | |
*** r-daneel_ is now known as r-daneel | 21:09 | |
*** heyongli has quit IRC | 21:11 | |
*** heyongli has joined #openstack-infra | 21:11 | |
sthussey | Ah, thanks. That explains it. We've chosen the wrong pipeline. | 21:13 |
clarkb | not necessarily. Depends on what the job's function is and when it needs to run | 21:14 |
*** lpetrut has quit IRC | 21:15 | |
*** ldnunes has quit IRC | 21:15 | |
fungi | since the same jobs get run for each commit which merges to the post pipeline and are generally idempotent, it's been deemed non-time-sensitive and possibly lossy (i.e., we don't make any effort to preserve and restore it on scheduler restarts) | 21:16 |
mnaser | https://review.openstack.org/#/c/573738/ | 21:18 |
mnaser | if you wanna bump that up to help speed things up | 21:18 |
clarkb | I can enqueue it to the gate directly | 21:19 |
clarkb | fungi: ^ any reason to not do that? | 21:19 |
mnaser | it's not much but i figured it would help | 21:19 |
clarkb | better than nothing | 21:20 |
clarkb | oh except we have puppet disabled on those nodes | 21:20 |
clarkb | I'll just manually increase it to 25 then when puppet is running again it will match | 21:20 |
sthussey | this is our job to publish docker images | 21:21 |
*** heyongli has quit IRC | 21:21 | |
clarkb | #status log Manually applied https://review.openstack.org/#/c/573738/ to nl03 as nl* are disabled in puppet until we sort out the migration to no zk schema | 21:21 |
openstackstatus | clarkb: finished logging | 21:21 |
*** heyongli has joined #openstack-infra | 21:21 | |
fungi | clarkb: right, hand-patching is the way for now. i guess it's just needed on one launcher | 21:21 |
sthussey | probably should be in the promotion pipeline which I wasn't aware of when I was writing the CI stuff | 21:21 |
clarkb | sthussey: promote and post have the same priority fwiw. The difference is the gerrit event they trigger off of, one is tied to generic ref updates the other to changes merging | 21:23 |
fungi | sthussey: normally it wouldn't be a bad choice... we simply had a couple of disruptive incidents strike in the middle of the mad rush to land patches for milestone 2 | 21:23 |
fungi | so things are a lot more backed up than usual | 21:23 |
sthussey | Okay, post is probably fine. We want this after merges. | 21:23 |
sthussey | Thanks | 21:23 |
clarkb | fungi: do we know why unbound's package update broke the daemons? | 21:24 |
clarkb | fungi: and should we expect that to happen next time they update the package (and why were none of our other servers affected or are they) sorry I'm dumpiong all the questions now that fires are mostly out and I'm back from doctor visit | 21:24 |
fungi | clarkb: suspicions but no smoking gun... it seems like the zuul executors, being under heavy load, exposed some sort of race in the daemon restart triggered by the deb maintscripts at package upgrade time | 21:26 |
*** dklyle has joined #openstack-infra | 21:26 | |
fungi | possibly the old process hadn't fully unlinked the listening socket? | 21:26 |
fungi | the logs were not helpful | 21:26 |
clarkb | weird, we've got unbound all over the place would've expected a more universal problem (also I run it at home and don't have problems like that either) | 21:27 |
fungi | agreed | 21:27 |
fungi | none of our other servers seem to have been impacted | 21:27 |
*** heyongli has quit IRC | 21:31 | |
*** heyongli has joined #openstack-infra | 21:32 | |
*** myoung is now known as myoung|off | 21:34 | |
*** boden has quit IRC | 21:34 | |
*** heyongli has quit IRC | 21:41 | |
*** armaan has quit IRC | 21:41 | |
*** heyongli has joined #openstack-infra | 21:42 | |
*** dhill_ has quit IRC | 21:45 | |
*** felipemonteiro has joined #openstack-infra | 21:51 | |
*** heyongli has quit IRC | 21:52 | |
*** heyongli has joined #openstack-infra | 21:52 | |
*** lifeless_ has joined #openstack-infra | 21:58 | |
*** e0ne has quit IRC | 22:00 | |
*** lifeless has quit IRC | 22:00 | |
*** heyongli has quit IRC | 22:02 | |
*** heyongli has joined #openstack-infra | 22:02 | |
*** jamesmcarthur has joined #openstack-infra | 22:02 | |
*** r-daneel has quit IRC | 22:03 | |
*** hashar has quit IRC | 22:04 | |
*** nicolasbock has quit IRC | 22:06 | |
*** jamesmcarthur has quit IRC | 22:07 | |
*** ykarel has joined #openstack-infra | 22:09 | |
*** rfolco_ has joined #openstack-infra | 22:11 | |
*** rfolco has quit IRC | 22:12 | |
*** heyongli has quit IRC | 22:12 | |
openstackgerrit | Jeremy Stanley proposed openstack-infra/storyboard master: use required enums to validate 'type' args https://review.openstack.org/545170 | 22:12 |
openstackgerrit | Jeremy Stanley proposed openstack-infra/storyboard master: mark worklist filter_criteria as a required field https://review.openstack.org/545405 | 22:12 |
openstackgerrit | Jeremy Stanley proposed openstack-infra/storyboard master: mark FilterCriterion title as a mandatory field https://review.openstack.org/545406 | 22:12 |
*** heyongli has joined #openstack-infra | 22:12 | |
*** diablo_rojo has quit IRC | 22:13 | |
openstackgerrit | Zane Bitter proposed openstack-dev/cookiecutter master: Convert to new docs PTI https://review.openstack.org/573294 | 22:13 |
*** ykarel_ has joined #openstack-infra | 22:14 | |
*** florianf has quit IRC | 22:14 | |
*** ykarel has quit IRC | 22:16 | |
openstackgerrit | Merged openstack-infra/storyboard master: Make notification driver configurable https://review.openstack.org/538574 | 22:19 |
*** heyongli has quit IRC | 22:22 | |
*** heyongli has joined #openstack-infra | 22:23 | |
*** agopi has quit IRC | 22:24 | |
ianw | clarkb: (scrollback) yep dib from 2.0 doesn't actually use dib-utils ... dib-run-parts was the only thing in it | 22:24 |
clarkb | ianw: thank you for confirming | 22:24 |
*** lbragstad has joined #openstack-infra | 22:25 | |
*** elbragstad has quit IRC | 22:25 | |
openstackgerrit | Ian Wienand proposed openstack-infra/system-config master: mirror-update: install afsmon and run from cron https://review.openstack.org/573493 | 22:26 |
ianw | was zuul-scheduler ok since it's weird gerrit dropout yesterday? | 22:26 |
ianw | thinking about it, it may have had to do with the unbound thing too | 22:26 |
*** rlandy|rover has quit IRC | 22:27 | |
clarkb | its fine now | 22:27 |
clarkb | we had to surgery things | 22:27 |
clarkb | largely nodepool related btu had to take scheduler offline too to avoid it talking to zk | 22:27 |
ianw | ok, put that one down to a glitch in the matrix | 22:29 |
*** slaweq has quit IRC | 22:30 | |
*** r-daneel has joined #openstack-infra | 22:30 | |
*** slaweq has joined #openstack-infra | 22:30 | |
*** harlowja has joined #openstack-infra | 22:30 | |
*** heyongli has quit IRC | 22:33 | |
*** heyongli has joined #openstack-infra | 22:33 | |
*** fried_rice is now known as efried | 22:35 | |
*** lifeless_ is now known as lifeless | 22:41 | |
*** tpsilva has quit IRC | 22:42 | |
*** heyongli has quit IRC | 22:43 | |
*** hongbin has quit IRC | 22:43 | |
*** heyongli has joined #openstack-infra | 22:43 | |
*** felipemonteiro has quit IRC | 22:44 | |
openstackgerrit | Merged openstack-infra/storyboard master: Add MQTT notification publisher https://review.openstack.org/538575 | 22:45 |
openstackgerrit | Merged openstack-infra/storyboard master: Add configurable notification subscriber and mqtt driver https://review.openstack.org/540958 | 22:45 |
openstackgerrit | Merged openstack-infra/storyboard master: Make it impossible to create a userless private story https://review.openstack.org/416070 | 22:46 |
*** ykarel_ has quit IRC | 22:49 | |
openstackgerrit | Merged openstack-infra/storyboard master: Document some usage instructions for a freshly deployed dev instance https://review.openstack.org/556018 | 22:50 |
*** heyongli has quit IRC | 22:53 | |
*** heyongli has joined #openstack-infra | 22:53 | |
*** felipemonteiro has joined #openstack-infra | 22:57 | |
*** rpioso is now known as rpioso|afk | 23:00 | |
*** r-daneel_ has joined #openstack-infra | 23:01 | |
*** caphrim007_ has joined #openstack-infra | 23:01 | |
*** caphrim007_ has quit IRC | 23:03 | |
*** r-daneel has quit IRC | 23:03 | |
*** r-daneel_ is now known as r-daneel | 23:03 | |
*** heyongli has quit IRC | 23:03 | |
*** heyongli has joined #openstack-infra | 23:04 | |
*** ykarel_ has joined #openstack-infra | 23:04 | |
*** caphrim007 has quit IRC | 23:04 | |
*** ykarel_ has quit IRC | 23:09 | |
*** kjackal has quit IRC | 23:12 | |
*** heyongli has quit IRC | 23:14 | |
*** heyongli has joined #openstack-infra | 23:14 | |
*** claudiub has quit IRC | 23:21 | |
*** felipemonteiro has quit IRC | 23:23 | |
*** heyongli has quit IRC | 23:24 | |
*** heyongli has joined #openstack-infra | 23:24 | |
*** r-daneel has quit IRC | 23:25 | |
*** sthussey has quit IRC | 23:28 | |
*** Swami has quit IRC | 23:31 | |
*** r-daneel has joined #openstack-infra | 23:33 | |
*** heyongli has quit IRC | 23:34 | |
*** heyongli has joined #openstack-infra | 23:34 | |
openstackgerrit | Merged openstack-infra/project-config master: Bump vexxhost to 25 servers https://review.openstack.org/573738 | 23:36 |
*** felipemonteiro has joined #openstack-infra | 23:40 | |
*** XP_2600 has joined #openstack-infra | 23:44 | |
*** heyongli has quit IRC | 23:44 | |
*** heyongli has joined #openstack-infra | 23:45 | |
*** XP_2600 has quit IRC | 23:48 | |
clarkb | down to 338 node requests | 23:53 |
*** heyongli has quit IRC | 23:55 | |
*** heyongli has joined #openstack-infra | 23:55 | |
*** r-daneel has quit IRC | 23:58 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!