*** ysandeep|away is now known as ysandeep | 05:45 | |
*** ykarel_ is now known as ykarel|afk | 06:54 | |
*** jpena|off is now known as jpena | 07:27 | |
lourot | o/ I think I need another +2 on https://review.opendev.org/c/openstack/project-config/+/808190 - thanks! | 07:41 |
---|---|---|
*** rpittau|afk is now known as rpittau | 07:55 | |
*** ykarel|afk is now known as ykarel | 07:58 | |
*** ykarel_ is now known as ykarel | 08:09 | |
*** ysandeep is now known as ysandeep|lunch | 08:53 | |
*** ykarel_ is now known as ykarel|lunch | 10:10 | |
*** ysandeep|lunch is now known as ysandeep | 10:35 | |
*** ykarel|lunch is now known as ykarel | 11:12 | |
*** jpena is now known as jpena|lunch | 11:29 | |
*** dviroel|out is now known as dviroel | 11:36 | |
*** ysandeep is now known as ysandeep|afk | 11:37 | |
*** jpena|lunch is now known as jpena | 12:22 | |
lourot | thanks fungi :) | 12:38 |
fungi | yw | 12:39 |
fungi | sorry we missed that in the prior review | 12:39 |
opendevreview | Merged openstack/project-config master: Fix missing charm-interface-ironic-api https://review.opendev.org/c/openstack/project-config/+/808190 | 12:47 |
*** ysandeep|afk is now known as ysandeep | 12:47 | |
opendevreview | Yuriy Shyyan proposed openstack/project-config master: Scaling down proposed instance count. https://review.opendev.org/c/openstack/project-config/+/810009 | 13:03 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Switch IPv4 rejects from host-prohibit to admin https://review.opendev.org/c/opendev/system-config/+/810013 | 13:26 |
opendevreview | Merged openstack/project-config master: Scaling down proposed instance count. https://review.opendev.org/c/openstack/project-config/+/810009 | 13:33 |
opendevreview | daniel.pawlik proposed opendev/puppet-log_processor master: Add capability with python3; add log request cert verify https://review.opendev.org/c/opendev/puppet-log_processor/+/809424 | 13:43 |
opendevreview | daniel.pawlik proposed opendev/base-jobs master: Submit a log processing job to the RDO infra logstash workers https://review.opendev.org/c/opendev/base-jobs/+/809765 | 14:00 |
*** ykarel is now known as ykarel|away | 14:39 | |
ricolin | got 502 error when try to generate contents on https://paste.opendev.org/ | 15:13 |
fungi | i've seen that when its database connection times out... is it consistent when you retry? | 15:18 |
noonedeadpunk | seems like it is | 15:44 |
noonedeadpunk | and db connection issue looked differently | 15:45 |
noonedeadpunk | not proxy error... but dunno | 15:45 |
clarkb | we got an email ~40 minutes ago from the cloud provider saying that the backing volume had an issue? | 15:46 |
clarkb | I suspect it is related to that | 15:46 |
clarkb | fungi: ^ fyi. | 15:46 |
clarkb | they suggest a fsck. Hopefully mariadb doesn't need more than that. /me needs breakfast but can help look at that after | 15:46 |
fungi | i'll take a look now, thanks i hadn't seen that ticket from the provide ryet | 15:50 |
fungi | yeah, i guess we added a cinder volume for /var/lib/lodgeit/mariadb ? | 15:51 |
fungi | i'll stop the containers, then umount and fsck that | 15:51 |
fungi | looks like the server got rebooted 1.75 hours ago | 15:52 |
fungi | and ~0.25 hours after that it started logging i/o errors for that volume | 15:52 |
fungi | fsck ran clean. i'll reboot the server again too for good measure, since we were still logging i/o errors for the device as recently as a few minutes agi | 15:54 |
fungi | ago | 15:54 |
fungi | noonedeadpunk: ricolin: is it working better now? | 15:56 |
fungi | #status log Ran filesystem checks on database volume for the paste.opendev.org service, and rebooted the server in order to attempt to clear any lingering I/O error state | 16:00 |
opendevstatus | fungi: finished logging | 16:00 |
noonedeadpunk | works for me | 16:00 |
fungi | thanks for confirming | 16:00 |
*** rpittau is now known as rpittau|afk | 16:00 | |
fungi | it was because clouds | 16:01 |
clarkb | a random paste loads for me now too, thanks for doing the cleanup | 16:04 |
fungi | i'll get the ticket closed out too | 16:06 |
*** marios is now known as marios|out | 16:09 | |
fungi | now that lists.o.o has been running upgraded for a while, it's clear memory utilization is improved and more stable: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=219&rra_id=all | 16:11 |
clarkb | fungi: over the weekend it occured to me that an easy way to replace the server would be to boot a new instance and have ansible configure it in test mode so no one is notified but all the lists should be made. | 16:15 |
clarkb | Then we can just copy over the membership and archives? | 16:15 |
clarkb | However, it may make more sense to deploy the new server as mm3 and migrate into that? | 16:15 |
fungi | yes, i wanted to push up an mm3 migration spec late last week but other fires keep coming up | 16:16 |
fungi | i've got it half drafted | 16:16 |
clarkb | cool | 16:16 |
fungi | okay, so in addition to two tickets about paste volume outages, there was a hypervisor host issue with paste.o.o (that explains the short uptime when i first logged in), and also for zm06 | 16:19 |
fungi | zm06 looks like it's got a merger running on it now, and grafana shows a full compliment of mergers as well as the short dip while zm06 was offline | 16:21 |
fungi | so i think there's nothing to worry about there | 16:21 |
*** ysandeep is now known as ysandeep|dinner | 16:25 | |
clarkb | and zuul should be resilient to losing that node when the issue occured (though a job or two may have failed?) | 16:26 |
fungi | hopefully not, but who knows | 16:26 |
clarkb | looks like the integrated openstack gate isn't doing any work | 16:28 |
clarkb | does anyone understand what might be happening there? corvus ^ cyi | 16:28 |
clarkb | er fyi | 16:28 |
clarkb | I'm going to start looking in logs | 16:29 |
clarkb | I see it started ~12 builds at around 12:07-12:12 UTC | 16:32 |
clarkb | then around 12:50 UTC the builds are canceled/aborted (I assume due to a gate reset) | 16:33 |
clarkb | No new jobs are started after that point in time | 16:33 |
clarkb | corvus: what is intended to submit new build erquests after the previous ones get canceled? | 16:34 |
corvus | clarkb: o/ catching up | 16:34 |
fungi | hopefully it's not related to the tripleo-ci change in check with 44 depends-on footers | 16:34 |
fungi | that's an interesting stress-test | 16:35 |
corvus | clarkb: if jobs are canceled for a gate reset, they should be immediately replaced by new jobs from the same run of the main scheduler loop | 16:35 |
clarkb | corvus: it doesn't look like that is happening based on the logs | 16:36 |
fungi | it looks similar to when mergers haven't reported back on changes, the entries don't yet expand to any jobs | 16:36 |
clarkb | It goes very quiet on that event/change after the cancels happen | 16:36 |
clarkb | fungi: or perhaps zm06 restarting tripped this up? Do you know when zm06 rebooted? | 16:36 |
clarkb | I think ~12:50 +/1 a minute or two is the window we're looking at here | 16:37 |
clarkb | 8bb2e0f74b1144cca9ba41ddea58a1dc is the event id fwiw | 16:37 |
fungi | reboot system boot Mon Sep 20 13:20 | 16:37 |
corvus | it's plausible there's an outstanding merge job that its waiting on | 16:37 |
clarkb | Submitting job request to ZooKeeper <MergeRequest 578fe1e26cf84ba7a86454315c1b4eb1 that one maybe? | 16:38 |
corvus | i'm going to look for the first merge job uuid submitted after the reset; let me know if you find it first | 16:38 |
corvus | ok thx :) | 16:38 |
clarkb | er no it says it removed that request laster | 16:38 |
clarkb | and I don't see any requests logged after that one | 16:39 |
fungi | the tripleo-ci change with all the deps would have been enqueued around 12:53z when it was last rechecked | 16:39 |
clarkb | I wonder why they keep rechecking that change a bunch of the depends on are not mergeable | 16:40 |
fungi | looks like zm06 might have died around 12:30z and came back at 13:20z | 16:41 |
fungi | trying to correlate cacti against syslog entries now | 16:42 |
fungi | yeah, the gap in cacti graphs bears that out | 16:42 |
clarkb | 578fe1e26cf84ba7a86454315c1b4eb1 claims it completed, though I'm not sure we record its status in the scheduler logs | 16:42 |
corvus | 2021-09-20 13:12:08,815 DEBUG zuul.MergerApi: [e: 8bb2e0f74b1144cca9ba41ddea58a1dc] [build: 578fe1e26cf84ba7a86454315c1b4eb1] Updating request <MergeRequest 578fe1e26cf84ba7a86454315c1b4eb1, job_type=merge, state=completed, path=/zuul/merger/requests/578fe1e26cf84ba7a86454315c1b4eb1> | 16:43 |
corvus | state=completed | 16:43 |
clarkb | `grep 8bb2e0f74b1144cca9ba41ddea58a1dc /var/log/zuul/debug.log | grep -v 'Checking for changes needed by' | grep -v 'No changes needed' | less` is the sort of thing I'm looking at it and shows the event went very idle after this stuff | 16:43 |
corvus | so i think that job is good | 16:43 |
clarkb | corvus: ya | 16:43 |
corvus | it'll be cool when we have the pipeline state in zk; will be a lot easier to inspect | 16:44 |
corvus | it looks like this was kicked off when the change ahead of 809336 failed, and 336 became the head of the queue | 16:45 |
corvus | that was the event at 12:50 thet caused the jobs to cancel and spawned the 578 merge job | 16:46 |
clarkb | based on the log it appears we're stuck between having completed the merge to determine new speculative state and starting jobs? | 16:47 |
corvus | yes -- somewhat complicated by the fact that since then there has been another re-equeue event (13:47:14,916) i don't know details of that | 16:47 |
corvus | that appears to be a tenant reconfig event | 16:49 |
corvus | that probably should just be a noop | 16:51 |
corvus | okay, i'm going to fire up the repl and see if i can inspect the queue item state | 16:51 |
clarkb | ok | 16:51 |
clarkb | fungi: unrelated to ^ https://review.opendev.org/c/opendev/system-config/+/809483 should be safe to land as we pull and restart on new nodepool builder images when available and https://review.opendev.org/c/zuul/nodepool/+/805743 has merged | 16:53 |
corvus | <BuildSet item: <QueueItem a31bac49a9a0474a8094b4313bb307e1 for <Change 0x7f829f42e430 openstack/placement 809366,1> in gate> #builds: 0 merge state: PENDING> | 16:54 |
corvus | i'm suspecting that the merge request was deleted by the cleanup process | 16:59 |
corvus | working on confirming that now | 16:59 |
clarkb | interesting. The log message did indicate they overlapped each other | 17:00 |
corvus | yeah, we need a 'finished' log msg for the cleanup :/ | 17:00 |
corvus | clarkb: did you find where 578fe1e26cf84ba7a86454315c1b4eb1 ran? | 17:01 |
clarkb | corvus: I did not | 17:02 |
corvus | i'll look | 17:02 |
corvus | apparently nowhere | 17:04 |
corvus | sudo ansible -f 20 -m shell -a 'grep 578fe1e26cf84ba7a86454315c1b4eb1 /var/log/zuul/merger-debug.log' 'zm*' | 17:04 |
corvus | sudo ansible -f 20 -m shell -a 'grep 578fe1e26cf84ba7a86454315c1b4eb1 /var/log/zuul/executor-debug.log' 'ze*' | 17:04 |
corvus | both return nothing | 17:04 |
*** ysandeep|dinner is now known as ysandeep | 17:05 | |
clarkb | I agree I grepped the same uuid across mergers and executors and got no results | 17:05 |
clarkb | need to dbouel check file rotation though | 17:05 |
clarkb | looks like merger's .log.1 is from yseterday UTC so wouldn't have it | 17:06 |
clarkb | same thing with executors | 17:06 |
corvus | i'm pretty sure we log the job id right as we pull it off the queue | 17:07 |
fungi | unrelated, looks like mirror.iad3.inmotion.opendev.org is down or unreachable, i see multiple jobs retried due to pre-run failures trying to update package lists | 17:08 |
fungi | can't ping it at all at the moment | 17:08 |
corvus | perhaps we've had other merger connectivity/uptime issues | 17:08 |
clarkb | fungi: is the server down? it may have crashed when we overloaded the hypervisors? If so doing a hard reboot is probably what it needs | 17:09 |
fungi | clarkb: yeah, not sure, i'm checking the nova api now | 17:09 |
fungi | to see what state it thinks the instance is in | 17:09 |
clarkb | corvus: is it possible that the cancel happened such that the completion occured all on the scheduler side before a merger ever saw it? | 17:09 |
fungi | aha, it's in shutoff state, starting it and hopefully it will boot | 17:10 |
corvus | clarkb: i don't think we canceled 578fe1e26cf84ba7a86454315c1b4eb1 | 17:11 |
clarkb | got it | 17:11 |
fungi | #status log Started mirror02.iad3.inmotion.opendev.org via Nova API as it was in a SHUTOFF stateyes | 17:11 |
opendevstatus | fungi: finished logging | 17:11 |
fungi | er, yes that seems to have gotten it back online again | 17:12 |
corvus | only zm06 has update < x days, so i wouldn't expect log entry data loss from anyone else | 17:13 |
corvus | oh, zm06 has log entries from as late as 12:47; our merge job was enqueued at 12:50... that's pretty close | 17:14 |
corvus | like, close enough it may have still been semi-functional at 12:50? | 17:14 |
fungi | entirely possible | 17:14 |
fungi | 12:25 was the last i saw anything in syslog but i didn't check the merger log | 17:14 |
fungi | and it's hard to tell from cacti since i've ceased to be able to zoom the timescale on the graphs as of a while back | 17:15 |
corvus | functional enough to pull a job off the queue, update its state to running, and then crash before finishing the job or flushing out log buffer pages to disk | 17:15 |
corvus | 12:47:30,773 is the last merger log entry | 17:15 |
clarkb | that seems plausible | 17:16 |
corvus | it's not conclusive, but it fits the facts we have... | 17:16 |
clarkb | why would it be marked completed though? | 17:16 |
corvus | that's part of what the cleanup does... | 17:16 |
clarkb | ah | 17:16 |
corvus | so i think we've learned 2 things: 1) the cleanup routine works (yay). 2) we need a way to poke the scheduler and tell it that a merge job it was expecting failed. | 17:17 |
corvus | i think we may need to do something similar to what we do with build jobs and put a fake complete event into the queue | 17:17 |
clarkb | corvus: have the cleanup routine send that event you mean? | 17:18 |
clarkb | that makes sense to me | 17:18 |
corvus | yep; that's what the build cleanup does | 17:18 |
corvus | clarkb: i think we have the info we need and at this point remediation is dequeue/enqueue. | 17:18 |
clarkb | corvus: cool I can do the dequeue enqueue | 17:18 |
corvus | i'll dig into fixing | 17:19 |
*** jpena is now known as jpena|off | 17:19 | |
fungi | would promote probably solve it? | 17:19 |
fungi | just reorder the queue (might even not actually have to reorder it if you tell it to simply promote the first item) | 17:20 |
fungi | solve as in get builds going again i mean | 17:20 |
fungi | clarkb: ^ | 17:20 |
clarkb | it may have. I've done the dequeue enqueue and stuck it at the end of the queue | 17:21 |
clarkb | or am attempting to anyway, it hasn't finished the enqueue yet | 17:21 |
clarkb | I think that is beacuse it is busy handling the events for the other changes that suddenly became viable | 17:22 |
fungi | yeah | 17:22 |
clarkb | yup there it goes all done | 17:22 |
fungi | thanks! | 17:22 |
opendevreview | Merged opendev/system-config master: Run daily backups of nodepool zk image data https://review.opendev.org/c/opendev/system-config/+/809483 | 17:34 |
*** ysandeep is now known as ysandeep|away | 18:04 | |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/809488/ is the next step in the logo asset management process. I think the earlier changes in that stack are likely to be safe to land? Definitely review them for correctness and potential release impacts I guess | 18:22 |
fungi | sure, i could use a break anyway, will take a look while i have quieries underway | 18:26 |
opendevreview | Merged opendev/system-config master: gitea: cleanup logo assets https://review.opendev.org/c/opendev/system-config/+/809488 | 19:42 |
opendevreview | Merged opendev/system-config master: docs: use logo from assets https://review.opendev.org/c/opendev/system-config/+/809489 | 19:42 |
clarkb | corvus: fungi: it just occured to me that putting the nodepool image data backups in /var/log/nodepool means we'll serve them with the http server on those hosts. There isn't anything sensitive in that though right? Should we move them in case that changes? | 19:55 |
clarkb | I'm running one of the exports into /tmp in the container to look at it | 19:57 |
clarkb | fungi: corvus: ianw: ya looking at the data it has info like when it was built, the python path, the username associated with the image, the cloud side uuid and name, and the image format | 20:01 |
clarkb | I think that is ok to have exposed. But if you all can double check that that would be great | 20:01 |
clarkb | oh and we may only serve the image build logs not the actual service logs so this may be all for nothing anyway | 20:03 |
corvus | clarkb: i think we should not serve it. i don't expect there to be any sensitive info, but i don't want to commit to that | 20:04 |
clarkb | corvus: ok I'm double checking what we actually serve now | 20:04 |
corvus | (for example, i'd like to move more authn info into zk for nodes; that might end up doing the same for images too (probably not, but it's conceivable)) | 20:04 |
clarkb | we serve content of /var/log/nodepool/builds and /opt/nodepool_dib. Currently the backups go to /var/log/nodepool/nodepool-image-backup.json | 20:05 |
clarkb | I think that means we are good | 20:05 |
corvus | kk | 20:05 |
fungi | yeah, seems fine then | 20:25 |
clarkb | One of the things on my todo list is restarting gerrit to pick up the replication teimouts. We have a fair number of leaked replicatiosn again which I'll clean up now. Any sense for when a good time to restart gerrit would be? | 20:26 |
clarkb | I think we should wait until we've got caught up replication queues too | 20:28 |
clarkb | also I'm noticing they always seem to happen around 1300-1800 UTC | 20:32 |
clarkb | either that means I'm checking after they have timed out hours later from getting stuck earlier or something specific about those times is causing us issues? | 20:32 |
clarkb | I'm thinking it might be best to do the restart when we can monitor gerrit and ideally through that period of time. Maybe do a restart early tomorrow as a result? | 20:35 |
clarkb | the backlog is cleared now fwiw | 20:36 |
*** dviroel is now known as dviroel|out | 21:09 | |
clarkb | Note it appears the paste server is having trouble again | 21:32 |
clarkb | cloud reports they are investigating and will let us know when we can expect to recover things | 21:33 |
fungi | thanks | 21:34 |
clarkb | Last time they sent us a nice email suggesting fsck and related recovery path so I expect we'll get similar this time around too | 21:35 |
fungi | yeah, basic steps are 1. down the containers, 2. umount the fs, 3. fsck it, 4. reboot, 5. log back in and up the containers again | 21:55 |
clarkb | ok email saying it should be happy has arrived | 22:12 |
clarkb | fungi: just `fsck $path` after umounting? | 22:14 |
fungi | yup | 22:14 |
fungi | it'll probably report clean | 22:14 |
fungi | fsck $devpath | 22:14 |
clarkb | yup it did | 22:14 |
clarkb | I'll reboot now then up the containers again | 22:15 |
clarkb | seems to be working. I can load an older paste and create a new one | 22:17 |
fungi | perfect | 22:17 |
TheJulia | out of curiosity, does opendev track/trend cpu utilization/iowait hit on test vms? | 23:26 |
ianw | TheJulia: devstack runs a dstat service in the background that can be graphed, but not generally | 23:26 |
TheJulia | i ask because I've got some cases where I've spotted unit tests taking 1000 seconds where my 10 year old desktop runs them in <200 seconds | 23:27 |
fungi | were you able to correlate those runs to a particular provider? | 23:27 |
fungi | i get the impression some of our providers may sort of oversubscribe our disk i/o bandwidth in particuar, and when our test volumes get really high all our test nodes in that location wind up fighting one another over limited disk bandwidth | 23:29 |
TheJulia | one moment | 23:29 |
TheJulia | inmotion-iad3, 600 seconds one one job, i think the other one was like a thousand seconds | 23:31 |
TheJulia | examples are slim since one is a gate fix with neutron I'm trying to get merged and...well.. yeah :( | 23:33 |
TheJulia | at least, a workaround for ironic to keep moving forward | 23:33 |
fungi | we did just ramp up the usage on that provider over the weekend, so it's untested territory. early today we were oom'ing bits of the infrastructure there but scaled it back as of roughly 14:00 utc | 23:34 |
fungi | it seemed more stable after that at least | 23:34 |
fungi | what time did the really long test duration happen? | 23:34 |
TheJulia | uhh, with-in the last 2 hours | 23:37 |
TheJulia | https://zuul.opendev.org/t/openstack/status#809895 | 23:38 |
fungi | okay, thanks that helps | 23:39 |
* TheJulia steps away for a little while to ponder dinner | 23:39 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!