Monday, 2021-09-20

*** ysandeep|away is now known as ysandeep05:45
*** ykarel_ is now known as ykarel|afk06:54
*** jpena|off is now known as jpena07:27
louroto/ I think I need another +2 on - thanks!07:41
*** rpittau|afk is now known as rpittau07:55
*** ykarel|afk is now known as ykarel07:58
*** ykarel_ is now known as ykarel08:09
*** ysandeep is now known as ysandeep|lunch08:53
*** ykarel_ is now known as ykarel|lunch10:10
*** ysandeep|lunch is now known as ysandeep10:35
*** ykarel|lunch is now known as ykarel11:12
*** jpena is now known as jpena|lunch11:29
*** dviroel|out is now known as dviroel11:36
*** ysandeep is now known as ysandeep|afk11:37
*** jpena|lunch is now known as jpena12:22
lourotthanks fungi :)12:38
fungisorry we missed that in the prior review12:39
opendevreviewMerged openstack/project-config master: Fix missing charm-interface-ironic-api
*** ysandeep|afk is now known as ysandeep12:47
opendevreviewYuriy Shyyan proposed openstack/project-config master: Scaling down proposed instance count.
opendevreviewJeremy Stanley proposed opendev/system-config master: Switch IPv4 rejects from host-prohibit to admin
opendevreviewMerged openstack/project-config master: Scaling down proposed instance count.
opendevreviewdaniel.pawlik proposed opendev/puppet-log_processor master: Add capability with python3; add log request cert verify
opendevreviewdaniel.pawlik proposed opendev/base-jobs master: Submit a log processing job to the RDO infra logstash workers
*** ykarel is now known as ykarel|away14:39
ricolingot 502 error when try to generate contents on
fungii've seen that when its database connection times out... is it consistent when you retry?15:18
noonedeadpunkseems like it is15:44
noonedeadpunkand db connection issue looked differently15:45
noonedeadpunknot proxy error... but dunno15:45
clarkbwe got an email ~40 minutes ago from the cloud provider saying that the backing volume had an issue?15:46
clarkbI suspect it is related to that15:46
clarkbfungi: ^ fyi.15:46
clarkbthey suggest a fsck. Hopefully mariadb doesn't need more than that. /me needs breakfast but can help look at that after15:46
fungii'll take a look now, thanks i hadn't seen that ticket from the provide ryet15:50
fungiyeah, i guess we added a cinder volume for /var/lib/lodgeit/mariadb ?15:51
fungii'll stop the containers, then umount and fsck that15:51
fungilooks like the server got rebooted 1.75 hours ago15:52
fungiand ~0.25 hours after that it started logging i/o errors for that volume 15:52
fungifsck ran clean. i'll reboot the server again too for good measure, since we were still logging i/o errors for the device as recently as a few minutes agi15:54
funginoonedeadpunk: ricolin: is it working better now?15:56
fungi#status log Ran filesystem checks on database volume for the service, and rebooted the server in order to attempt to clear any lingering I/O error state16:00
opendevstatusfungi: finished logging16:00
noonedeadpunkworks for me16:00
fungithanks for confirming16:00
*** rpittau is now known as rpittau|afk16:00
fungiit was because clouds16:01
clarkba random paste loads for me now too, thanks for doing the cleanup16:04
fungii'll get the ticket closed out too16:06
*** marios is now known as marios|out16:09
funginow that lists.o.o has been running upgraded for a while, it's clear memory utilization is improved and more stable:
clarkbfungi: over the weekend it occured to me that an easy way to replace the server would be to boot a new instance and have ansible configure it in test mode so no one is notified but all the lists should be made.16:15
clarkbThen we can just copy over the membership and archives?16:15
clarkbHowever, it may make more sense to deploy the new server as mm3 and migrate into that?16:15
fungiyes, i wanted to push up an mm3 migration spec late last week but other fires keep coming up16:16
fungii've got it half drafted16:16
fungiokay, so in addition to two tickets about paste volume outages, there was a hypervisor host issue with paste.o.o (that explains the short uptime when i first logged in), and also for zm0616:19
fungizm06 looks like it's got a merger running on it now, and grafana shows a full compliment of mergers as well as the short dip while zm06 was offline16:21
fungiso i think there's nothing to worry about there16:21
*** ysandeep is now known as ysandeep|dinner16:25
clarkband zuul should be resilient to losing that node when the issue occured (though a job or two may have failed?)16:26
fungihopefully not, but who knows16:26
clarkblooks like the integrated openstack gate isn't doing any work16:28
clarkbdoes anyone understand what might be happening there? corvus  ^ cyi16:28
clarkber fyi16:28
clarkbI'm going to start looking in logs16:29
clarkbI see it started ~12 builds at around 12:07-12:12 UTC16:32
clarkbthen around 12:50 UTC the builds are canceled/aborted (I assume due to a gate reset)16:33
clarkbNo new jobs are started after that point in time16:33
clarkbcorvus: what is intended to submit new build erquests after the previous ones get canceled?16:34
corvusclarkb: o/ catching up16:34
fungihopefully it's not related to the tripleo-ci change in check with 44 depends-on footers16:34
fungithat's an interesting stress-test16:35
corvusclarkb: if jobs are canceled for a gate reset, they should be immediately replaced by new jobs from the same run of the main scheduler loop16:35
clarkbcorvus: it doesn't look like that is happening based on the logs16:36
fungiit looks similar to when mergers haven't reported back on changes, the entries don't yet expand to any jobs16:36
clarkbIt goes very quiet on that event/change after the cancels happen16:36
clarkbfungi: or perhaps zm06 restarting tripped this up? Do you know when zm06 rebooted?16:36
clarkbI think ~12:50 +/1 a minute or two is the window we're looking at here16:37
clarkb8bb2e0f74b1144cca9ba41ddea58a1dc is the event id fwiw16:37
fungireboot   system boot  Mon Sep 20 13:2016:37
corvusit's plausible there's an outstanding merge job that its waiting on16:37
clarkbSubmitting job request to ZooKeeper <MergeRequest 578fe1e26cf84ba7a86454315c1b4eb1 that one maybe?16:38
corvusi'm going to look for the first merge job uuid submitted after the reset; let me know if you find it first16:38
corvusok thx :)16:38
clarkber no it says it removed that request laster16:38
clarkband I don't see any requests logged after that one16:39
fungithe tripleo-ci change with all the deps would have been enqueued around 12:53z when it was last rechecked16:39
clarkbI wonder why they keep rechecking that change a bunch of the depends on are not mergeable16:40
fungilooks like zm06 might have died around 12:30z and came back at 13:20z16:41
fungitrying to correlate cacti against syslog entries now16:42
fungiyeah, the gap in cacti graphs bears that out16:42
clarkb578fe1e26cf84ba7a86454315c1b4eb1 claims it completed, though I'm not sure we record its status in the scheduler logs16:42
corvus2021-09-20 13:12:08,815 DEBUG zuul.MergerApi: [e: 8bb2e0f74b1144cca9ba41ddea58a1dc] [build: 578fe1e26cf84ba7a86454315c1b4eb1] Updating request <MergeRequest 578fe1e26cf84ba7a86454315c1b4eb1, job_type=merge, state=completed, path=/zuul/merger/requests/578fe1e26cf84ba7a86454315c1b4eb1>16:43
clarkb`grep 8bb2e0f74b1144cca9ba41ddea58a1dc /var/log/zuul/debug.log | grep -v 'Checking for changes needed by' | grep -v 'No changes needed' | less` is the sort of thing I'm looking at it and shows the event went very idle after this stuff16:43
corvusso i think that job is good16:43
clarkbcorvus: ya16:43
corvusit'll be cool when we have the pipeline state in zk; will be a lot easier to inspect16:44
corvusit looks like this was kicked off when the change ahead of 809336 failed, and 336 became the head of the queue16:45
corvusthat was the event at 12:50 thet caused the jobs to cancel and spawned the 578 merge job16:46
clarkbbased on the log it appears we're stuck between having completed the merge to determine new speculative state and starting jobs?16:47
corvusyes -- somewhat complicated by the fact that since then there has been another re-equeue event (13:47:14,916) i don't know details of that16:47
corvusthat appears to be a tenant reconfig event16:49
corvusthat probably should just be a noop16:51
corvusokay, i'm going to fire up the repl and see if i can inspect the queue item state16:51
clarkbfungi: unrelated to ^ should be safe to land as we pull and restart on new nodepool builder images when available and has merged16:53
corvus<BuildSet item: <QueueItem a31bac49a9a0474a8094b4313bb307e1 for <Change 0x7f829f42e430 openstack/placement 809366,1> in gate> #builds: 0 merge state: PENDING>16:54
corvusi'm suspecting that the merge request was deleted by the cleanup process16:59
corvusworking on confirming that now16:59
clarkbinteresting. The log message did indicate they overlapped each other17:00
corvusyeah, we need a 'finished' log msg for the cleanup :/17:00
corvusclarkb: did you find where 578fe1e26cf84ba7a86454315c1b4eb1 ran?17:01
clarkbcorvus: I did not17:02
corvusi'll look17:02
corvusapparently nowhere17:04
corvussudo ansible -f 20 -m shell -a 'grep 578fe1e26cf84ba7a86454315c1b4eb1 /var/log/zuul/merger-debug.log' 'zm*'17:04
corvussudo ansible -f 20 -m shell -a 'grep 578fe1e26cf84ba7a86454315c1b4eb1 /var/log/zuul/executor-debug.log' 'ze*'17:04
corvusboth return nothing17:04
*** ysandeep|dinner is now known as ysandeep17:05
clarkbI agree I grepped the same uuid across mergers and executors and got no results17:05
clarkbneed to dbouel check file rotation though17:05
clarkblooks like merger's .log.1 is from yseterday UTC so wouldn't have it17:06
clarkbsame thing with executors17:06
corvusi'm pretty sure we log the job id right as we pull it off the queue17:07
fungiunrelated, looks like is down or unreachable, i see multiple jobs retried due to pre-run failures trying to update package lists17:08
fungican't ping it at all at the moment17:08
corvusperhaps we've had other merger connectivity/uptime issues17:08
clarkbfungi: is the server down? it may have crashed when we overloaded the hypervisors? If so doing a hard reboot is probably what it needs17:09
fungiclarkb: yeah, not sure, i'm checking the nova api now17:09
fungito see what state it thinks the instance is in17:09
clarkbcorvus: is it possible that the cancel happened such that the completion occured all on the scheduler side before a merger ever saw it?17:09
fungiaha, it's in shutoff state, starting it and hopefully it will boot17:10
corvusclarkb: i don't think we canceled 578fe1e26cf84ba7a86454315c1b4eb117:11
clarkbgot it17:11
fungi#status log Started via Nova API as it was in a SHUTOFF stateyes17:11
opendevstatusfungi: finished logging17:11
fungier, yes that seems to have gotten it back online again17:12
corvusonly zm06 has update < x days, so i wouldn't expect log entry data loss from anyone else17:13
corvusoh, zm06 has log entries from as late as 12:47; our merge job was enqueued at 12:50... that's pretty close17:14
corvuslike, close enough it may have still been semi-functional at 12:50?17:14
fungientirely possible17:14
fungi12:25 was the last i saw anything in syslog but i didn't check the merger log17:14
fungiand it's hard to tell from cacti since i've ceased to be able to zoom the timescale on the graphs as of a while back17:15
corvusfunctional enough to pull a job off the queue, update its state to running, and then crash before finishing the job or flushing out log buffer pages to disk17:15
corvus12:47:30,773 is the last merger log entry17:15
clarkbthat seems plausible17:16
corvusit's not conclusive, but it fits the facts we have...17:16
clarkbwhy would it be marked completed though?17:16
corvusthat's part of what the cleanup does...17:16
corvusso i think we've learned 2 things: 1) the cleanup routine works (yay).  2) we need a way to poke the scheduler and tell it that a merge job it was expecting failed.17:17
corvusi think we may need to do something similar to what we do with build jobs and put a fake complete event into the queue17:17
clarkbcorvus: have the cleanup routine send that event you mean?17:18
clarkbthat makes sense to me17:18
corvusyep; that's what the build cleanup does17:18
corvusclarkb: i think we have the info we need and at this point remediation is dequeue/enqueue.17:18
clarkbcorvus: cool I can do the dequeue enqueue17:18
corvusi'll dig into fixing17:19
*** jpena is now known as jpena|off17:19
fungiwould promote probably solve it?17:19
fungijust reorder the queue (might even not actually have to reorder it if you tell it to simply promote the first item)17:20
fungisolve as in get builds going again i mean17:20
fungiclarkb: ^17:20
clarkbit may have. I've done the dequeue enqueue and stuck it at the end of the queue17:21
clarkbor am attempting to anyway, it hasn't finished the enqueue yet17:21
clarkbI think that is beacuse it is busy handling the events for the other changes that suddenly became viable17:22
clarkbyup there it goes all done17:22
opendevreviewMerged opendev/system-config master: Run daily backups of nodepool zk image data
*** ysandeep is now known as ysandeep|away18:04
clarkbfungi: is the next step in the logo asset management process. I think the earlier changes in that stack are likely to be safe to land? Definitely review them for correctness and potential release impacts I guess18:22
fungisure, i could use a break anyway, will take a look while i have quieries underway18:26
opendevreviewMerged opendev/system-config master: gitea: cleanup logo assets
opendevreviewMerged opendev/system-config master: docs: use logo from assets
clarkbcorvus: fungi: it just occured to me that putting the nodepool image data backups in /var/log/nodepool means we'll serve them with the http server on those hosts. There isn't anything sensitive in that though right? Should we move them in case that changes?19:55
clarkbI'm running one of the exports into /tmp in the container to look at it19:57
clarkbfungi: corvus: ianw: ya looking at the data it has info like when it was built, the python path, the username associated with the image, the cloud side uuid and name, and the image format20:01
clarkbI think that is ok to have exposed. But if you all can double check that that would be great20:01
clarkboh and we may only serve the image build logs not the actual service logs so this may be all for nothing anyway20:03
corvusclarkb: i think we should not serve it.  i don't expect there to be any sensitive info, but i don't want to commit to that20:04
clarkbcorvus: ok I'm double checking what we actually serve now20:04
corvus(for example, i'd like to move more authn info into zk for nodes; that might end up doing the same for images too (probably not, but it's conceivable))20:04
clarkbwe serve content of /var/log/nodepool/builds and /opt/nodepool_dib. Currently the backups go to /var/log/nodepool/nodepool-image-backup.json20:05
clarkbI think that means we are good20:05
fungiyeah, seems fine then20:25
clarkbOne of the things on my todo list is restarting gerrit to pick up the replication teimouts. We have a fair number of leaked replicatiosn again which I'll clean up now. Any sense for when a good time to restart gerrit would be?20:26
clarkbI think we should wait until we've got caught up replication queues too20:28
clarkbalso I'm noticing they always seem to happen around 1300-1800 UTC20:32
clarkbeither that means I'm checking after they have timed out hours later from getting stuck earlier or something specific about those times is causing us issues?20:32
clarkbI'm thinking it might be best to do the restart when we can monitor gerrit and ideally through that period of time. Maybe do a restart early tomorrow as a result?20:35
clarkbthe backlog is cleared now fwiw20:36
*** dviroel is now known as dviroel|out21:09
clarkbNote it appears the paste server is having trouble again21:32
clarkbcloud reports they are investigating and will let us know when we can expect to recover things21:33
clarkbLast time they sent us a nice email suggesting fsck and related recovery path so I expect we'll get similar this time around too21:35
fungiyeah, basic steps are 1. down the containers, 2. umount the fs, 3. fsck it, 4. reboot, 5. log back in and up the containers again21:55
clarkbok email saying it should be happy has arrived22:12
clarkbfungi: just `fsck $path` after umounting?22:14
fungiit'll probably report clean22:14
fungifsck $devpath22:14
clarkbyup it did22:14
clarkbI'll reboot now then up the containers again22:15
clarkbseems to be working. I can load an older paste and create a new one22:17
TheJuliaout of curiosity, does opendev track/trend cpu utilization/iowait hit on test vms?23:26
ianwTheJulia: devstack runs a dstat service in the background that can be graphed, but not generally23:26
TheJuliai ask because I've got some cases where I've spotted unit tests taking 1000 seconds where my 10 year old desktop runs them in <200 seconds23:27
fungiwere you able to correlate those runs to a particular provider?23:27
fungii get the impression some of our providers may sort of oversubscribe our disk i/o bandwidth in particuar, and when our test volumes get really high all our test nodes in that location wind up fighting one another over limited disk bandwidth23:29
TheJuliaone moment23:29
TheJuliainmotion-iad3, 600 seconds one one job, i think the other one was like a thousand seconds23:31
TheJuliaexamples are slim since one is a gate fix with neutron I'm trying to get merged and...well.. yeah :(23:33
TheJuliaat least, a workaround for ironic to keep moving forward23:33
fungiwe did just ramp up the usage on that provider over the weekend, so it's untested territory. early today we were oom'ing bits of the infrastructure there but scaled it back as of roughly 14:00 utc23:34
fungiit seemed more stable after that at least23:34
fungiwhat time did the really long test duration happen?23:34
TheJuliauhh, with-in the last 2 hours23:37
fungiokay, thanks that helps23:39
* TheJulia steps away for a little while to ponder dinner23:39

Generated by 2.17.2 by Marius Gedminas - find it at!