#opendev-meeting log

19:01:06 <clarkb> #startmeeting infra
19:01:06 <opendevmeet> Meeting started Tue Jun 29 19:01:06 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:06 <opendevmeet> The meeting name has been set to 'infra'
19:01:17 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-June/000262.html Our Agenda
19:01:22 <clarkb> #topic Announcements
19:01:54 <clarkb> No real announcements other than my life is returning to normally scheduled day to day so I'll be around at typical times now
19:02:06 <clarkb> The one exception to that is Monday is apparently the observation of a holiday here
19:02:31 <fungi> yes, th eone where citizens endeavor to celebrate the independence of their nation by blowing up a small piece of it
19:02:51 <diablo_rojo> o/
19:02:51 <fungi> always a fun occasion
19:02:53 <clarkb> fungi: yup, but also this year I think we are declaring the pandemic is over here and we should remove all precautions
19:03:07 <fungi> blowing up in more ways than one, in that case
19:03:35 <clarkb> But I'll be around Tuesday and we'll have a meeting as usual
19:03:41 <clarkb> #topic Actions from last meeting
19:03:47 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-06-15-19.01.txt minutes from last meeting
19:04:30 <clarkb> I have not brought up the ELK situation with openstack leadership yet. diablo_rojo fyi I intend on doing that when I find time in the near future. Mostly just to plan out what we are doing next as far as wind down goes
19:04:40 <clarkb> #action clarkb Followup with OpenStack on ELK retirement
19:04:50 <clarkb> ianw: have ppc packages been cleaned up from centos mirrors?
19:04:58 <fungi> though they're presenting a call to action for it to the board of directors tomorrow
19:05:07 <diablo_rojo> makes sense to me.
19:05:16 <fungi> (the elastic recheck support request, i mean)
19:05:30 <clarkb> fungi: yup, I don't think we have to say "its turning off tomorrow" more of a we are doing these things you are doing those things, when is a reasonable time to say its dead or not
19:05:44 <clarkb> and start to create the longer term expectations
19:05:46 <ianw> clarkb: yep, that was done https://review.opendev.org/c/opendev/system-config/+/797365
19:06:04 <clarkb> ianw: excellent thanks!
19:06:34 <clarkb> and I don't think a prometheus replacement for cacti spec has been written yet either. I'm mostly keeping this on the list because i think it is a good idea and keeping it visible can only help make it happen :)
19:06:44 <clarkb> #action someone write spec to replace Cacti with Prometheus
19:07:02 <fungi> also, while it didn't get flagged as an action item it effectively was one:
19:07:05 <fungi> #link https://review.opendev.org/797990 Stop updating Gerrit RDBMS for repo renames
19:07:33 <fungi> now i can stop forgetting to remember to do that
19:07:42 <clarkb> fungi: great, I'll have to give that a review (I've been on a review push myself the last few days trying to catch up on all the awesome work everyone has been doing)
19:08:15 <clarkb> #topic Topics
19:08:22 <clarkb> #topic Eavesdrop and Limnoria
19:08:47 <clarkb> We discovered there was a bug in the channel log conversion from raw text logs to html that may have explained the lag people noticed in those files
19:09:01 <clarkb> basically we ran the conversion once an hour instead of every 15 minutes. Fungi wrote a fix for that.
19:09:08 <fungi> and it merged
19:09:20 <fungi> so should be back to behavnig normally now
19:09:23 <fungi> behaving
19:09:29 <clarkb> Would be good to keep an eye out for any new reports of lag in those logs, but I think we can call it fixed now based on what we saw timestamp wise yesterday
19:09:32 <clarkb> ++
19:09:57 <ianw> sorry about that, missed a * from the old job :/
19:10:00 <fungi> that was the new lag, by the way, the old lag before that was related to flushing files
19:10:12 <fungi> so we actually had two lag sources playing off one another
19:10:44 <clarkb> ah cool I wasn't sure if we saw lag in the text files previously or only html
19:10:55 <clarkb> text files were happy yseterday it seemed like when we looked at least and then we fixed the html side
19:11:33 <clarkb> #topic Gerrit Account Cleanup
19:11:51 <clarkb> I'm hoping to find time for this among everything else and deactivate those accounts whose external ids we'll delete later
19:12:07 <clarkb> fungi: you started to look at that more closely, have you had a chance to do a sufficient sampling to be comfortable with the list?
19:13:02 <fungi> yes, my spot-checking didn't turn up any concerns
19:13:36 <clarkb> great, I'll try to pencil this in for the end of the week then and do the account retirement/deactivation then in a few weeks we can do the external id deletions for all those that don't complain (and none should)
19:14:04 <clarkb> #topic Review Upgrade
19:14:15 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-2021 Upgrade Checklist
19:14:26 <clarkb> The agenda says this document is ready for review. infra-root please take a look at it
19:14:42 <clarkb> ianw: does the ipv6 problem that recently happened put a pause on this while we sort that out?
19:15:46 <ianw> i'm not sure, i rebooted the host and it came back
19:16:09 <ianw> i know there was some known issue at some point that required a reboot prior, it might have been broken then
19:16:24 <clarkb> ianw: the issue that hapepned on the cloud side?
19:16:58 <clarkb> Considering that we do have the otpion of removing the AAAA record from DNS temporarily if necessary I suspect this isn't critical. But others may feel more strongly about ipv6
19:17:19 <fungi> there was a host outage and reboot/migration forced at one point, but i don't recall how long ago
19:17:40 <fungi> and probably didn't track it closely since the server was not yet in production
19:17:41 <ianw> right, that feels like the sort of thing that duplicate addresses might pop up in
19:17:46 <clarkb> it happened the weekend after I did all those focal reboots
19:17:58 <clarkb> I remember because I delayed review02s reboot and then vexxhost took care of it for me :)
19:18:12 <fungi> ahh, right, mnaser let us know about it, could find a more precise time in irc logs
19:18:18 <clarkb> and ya that seems like a possibility if there was a migration with two instances out there fighting over arp
19:18:38 <clarkb> (or even just not properly flushing the router's tables first)
19:19:08 <fungi> well, dad operates on the server seeing evidence of a conflict
19:19:29 <fungi> so presumably there really were two systems trying to use the same v6 address at the same moment
19:19:32 <clarkb> got it
19:19:40 <ianw> anyway, if we can work on that checklist, i'm happy to maybe do this on a .au monday morning.  that's usually a very quiet time
19:20:07 <ianw> i'm not sure if we could be ready for the 5th, but that would be even quieter
19:20:08 <fungi> will do, thanks!
19:20:18 <clarkb> ianw: yup I'll need to add that to my list of reviews for today. And I can do .au morning as well usually. Since that overlaps with my afternoon/evening without too much pain
19:20:37 <clarkb> ianw: I think your suggested date of the 19th is probably reasonable
19:20:57 <clarkb> that way we can announce it with a couple of weeks of notice too (so that firewall rules can be updates in various places if necessary)
19:21:12 <clarkb> maybe plan to send that out in a couple of days after we have a chance to double check your checklist
19:21:17 <ianw> the 12th maybe too, although i'll be out a day or two before that (still deciding on plans wrt. lockdowns, etc.)
19:22:31 <clarkb> I like giving a bit of notice for this and 19th feels like a good balance between too little and too much
19:22:41 <clarkb> infra-root ^ feel free to weigh in though
19:23:07 <fungi> i the past we've announced the new ip addresses somewhat in advance
19:23:33 <clarkb> yes in the past we've tried to do ~4 weeks iirc
19:23:37 <fungi> since a number of companies maintain firewall exceptions allowing their employees or ci systems to connect
19:23:48 <clarkb> but we also had more companies with strict firewall rules than we have today (or at least they don't complain as much anymore)
19:23:56 <ianw> ok, i can construct a notification for that soon then, as i don't see any reason we'll change the ip and reverse dns is setup too
19:24:04 <fungi> right, i do think 4 weeks is probably excessive today
19:24:26 <fungi> but if we can give them a heads up, sooner would be better than later
19:24:30 <clarkb> ++
19:24:58 <clarkb> we could even advertise the new IPs with a no sooner than X date
19:25:07 <clarkb> then they can add firwall rules and keep the old one in place until we do the switch
19:25:17 <clarkb> but the 19th seems like a good option to me.
19:25:48 <clarkb> should cross check against release schedules for various projects but I think that is relatively quiet time
19:25:56 <clarkb> Anything else on the review upgrade topic?
19:26:26 <ianw> not really, i just want to get the checklist as detailed as possible
19:26:26 <fungi> i got nothin'
19:26:45 <fungi> thanks for organizing this, ianw!
19:26:46 <clarkb> #topic Listserv upgrades
19:26:49 <clarkb> ++ thanks!
19:27:04 <clarkb> I've somewhat stalled out on this and worry I've got a number of other tasks that are just as or more important fighting for time
19:27:29 <clarkb> If anyone else wants to boot hte test node and run through an upgrade on it I've already started notes on an etherpad somewhere I should dig up again. But if not I'll keep this on my list and try to get to it when I can
19:28:00 <clarkb> Mostly this is a heads up that I'm probably not getting to it this week. Hopeflly next
19:28:26 <clarkb> #topic Draft matrix spec
19:28:36 <clarkb> #link https://review.opendev.org/796156 Draft matrix spec
19:28:51 <clarkb> I reached out to EMS (element matrix services) today through a contact that corvus had
19:29:10 <clarkb> Their day was largely already over but they said they will try to scheduel a call with me tomorrow.
19:29:56 <clarkb> I suspect that corvus would be itnerested in bneing on that call. Is anyone else interested too? We'll be overlapping with pacific timezone and europe so the window for that isn't very large
19:30:40 <corvus> thanks!  i'm hoping we can narrow the options down and revise the spec with something more concrete there
19:30:41 <clarkb> I suspect this intiial conversation will be super high level and not incredibly important for everyone to be on. But I'm happy to include others if there is interest
19:30:52 <clarkb> corvus: ++
19:31:03 <fungi> i can be on the call, but am happy to entrust the discussion to the two of you
19:32:08 <clarkb> alright I'll see what they say schedule wise tomorrow
19:32:12 <clarkb> #topic gitea01 backups
19:32:28 <clarkb> Not sure if anyone has looked into this yet but gitea01 seems to be failign to backup to one of our two backup targets
19:32:54 <ianw> is it somewhat random?
19:33:05 <clarkb> Thought I would bring it up here to ensure it wasn't forgotten. I don't think this is super urgent as we haven't made any recent project renames (which would update the db tables that we want to backup)
19:33:08 <fungi> i haven't checked the logs, just noticed the notifications to the root inbox
19:33:11 <clarkb> ianw: no it seems to happen consistently each day
19:33:13 <fungi> seems like it's consistently every day
19:33:24 <clarkb> the consistency is why I believe only one backup target is affected
19:33:30 <clarkb> (otherwise we'd see multiple timestamps?)
19:33:39 <ianw> i'm sure it's mysql dropping right?
19:33:44 <fungi> appears to have started on 2021-06-12
19:34:34 <clarkb> ianw: I havne't even dug in that far, but probably a good guess
19:34:45 <mordred> clarkb: (sorry, I'd also love to be on the matrix call, but obviously don't block on me)
19:34:53 <ianw> http://paste.openstack.org/show/807046/
19:35:28 <fungi> socket timeouts maybe?
19:35:41 <clarkb> mordred: noted
19:35:43 <fungi> i wonder if the connection goes idle waiting on the query to complete
19:35:56 <ianw> but only to the vexxhost backup
19:36:12 <fungi> which implies some router in that path dropping state prematurely
19:36:27 <fungi> or nat if we're doing a vip
19:36:29 <ianw> and this runs in vexxhost, right?  so the external further-away rax backup is working
19:36:40 <clarkb> yup gitea01 is in sjc vexxhost
19:36:44 <clarkb> and the mysql is localhost
19:37:11 <fungi> oh, it's vexx-to-vexx dropping? hmm... yeah that's strange
19:37:23 <fungi> and same region presumably
19:37:36 <ianw> 64 bytes from 2604:e100:1:0:f816:3eff:fe83:a5e5 (2604:e100:1:0:f816:3eff:fe83:a5e5): icmp_seq=6 ttl=47 time=72.0 ms
19:37:44 <ianw> 64 bytes from backup01.ord.rax.opendev.org (2001:4801:7825:103:be76:4eff:fe10:1b1): icmp_seq=3 ttl=52 time=49.9 ms
19:37:51 <ianw> the ping to rax seems lower
19:37:57 <fungi> also surprising
19:38:16 <clarkb> if the backup server is in montreal then that would make sense
19:38:25 <clarkb> since ord is slightly closer to sjc than montreal
19:38:47 <clarkb> anyway we don't have to do live debugging in the meeting. I just wanted to bring it up as a not super urgent issue but one that should probably be addressed
19:39:00 <clarkb> (the db backups in both sites should be complete until we do a project rename)
19:39:05 <fungi> i thought he was saying tat higher rtt was locally within vexxhost
19:39:15 <fungi> but yeah, we can dig into it after the meeting
19:39:21 <clarkb> as it is project renames that update the redirects whcih live in the db
19:39:27 <ianw> this streams the output of mysqldump directly to the server
19:39:52 <clarkb> #topic Scheduling Project Renames
19:40:04 <ianw> so if anyone knows any timeout options for that, let me know :)\
19:40:08 <clarkb> Lets move on and then we can discuss further at the end or eat lunch/breakfast/dinner :)
19:40:21 <fungi> in theory we can "just do it" now that the rename playbook no longer tries to update the nonexistent mysql db
19:40:44 <clarkb> For project renames do we want to try and incorporate that into the server move? My preference would be that maybe we do the renames the week after once we're settled into the new server and not try to overdo it
19:40:57 <fungi> i don't think we had any other pending blockers besides actual scheculing anyway
19:40:59 <clarkb> fungi: linked one of the changes we need to do renames
19:41:04 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/797990/
19:41:20 <fungi> yeah, once that merges i mean
19:41:47 <clarkb> Anyone have a concern with doing the renames a week after the move?
19:42:05 <clarkb> That should probably be enough time to be settled in on the new server and if not we can always reschedule
19:42:11 <ianw> ++
19:42:12 <fungi> wfm
19:42:15 <clarkb> but that gives us a time frame to tell people to get their requests in for
19:42:23 <clarkb> great
19:42:39 <fungi> and also a window to do any non-urgent post-move config tweaks
19:42:44 <clarkb> ++
19:42:58 <fungi> in case we spot things which need adjusting
19:43:47 <clarkb> #topic Open Discussion
19:44:21 <clarkb> Anything else to bring up?
19:44:41 <diablo_rojo> I think I have the container mostly setup for the ptgbot?
19:44:53 <clarkb> diablo_rojo: oh cool are there changes that need review?
19:44:54 <diablo_rojo> oh. failing zuul though.
19:45:23 <fungi> on the oftc migration wrap-up, i have an infra manual change which needs reviewing:
19:45:24 <diablo_rojo> clarkb, just the one kinda? I havent written the role yet for it. Started with setting up the container
19:45:25 <fungi> #link https://review.opendev.org/797531 Switch docs from referencing Freenode to OFTC
19:45:42 <clarkb> diablo_rojo: have a link?
19:45:59 <diablo_rojo> https://review.opendev.org/c/openstack/ptgbot/+/798025
19:46:29 <clarkb> great I'll try to take a look at that change too. Feel free to reach out about the failures too
19:47:10 <clarkb> fungi: that looks like a good one to get in ASAP to avoid any additional confusion that may be causing
19:47:36 <fungi> there was some discussion between other reviewers about adjustments, so more feedback around those for preferences would be appreciated
19:48:04 <ianw> diablo_rojo: i think you've got an openstack that hsould be an opendev at first glance : FileNotFoundError: [Errno 2] No such file or directory: '/home/zuul/src/openstack.org/opendev/ptgbot'
19:48:46 <diablo_rojo> Oh  I thought I had that as opendev originally.
19:48:53 <diablo_rojo> I can change that back
19:49:15 <ianw> i think it has a high chance of working with that
19:49:21 <diablo_rojo> Sweet.
19:49:24 <diablo_rojo> Will do that now.
19:49:36 <ianw> speaking of building images for external projects
19:49:38 <ianw> #link https://review.opendev.org/c/openstack/project-config/+/798413
19:50:03 <ianw> is there a reason lodgeit isn't in openstack?  i can't reference it's image build jobs from system-config jobs, so can't do a speculative build of the image
19:50:09 <fungi> yeah, the ptgbot repo is openstack/ptgbot
19:50:25 <fungi> the puppet-ptgbot repo we'll be retiring is opendev/puppet-ptgbot
19:50:29 <fungi> different namespaces
19:50:32 <ianw> yeah, i think "opendev.org/openstack/ptgbot" is the path
19:50:32 <clarkb> ianw: no I think it was one of the very first moves out to opendev and we probably just figured it was fine to be completely separate
19:50:45 <clarkb> ianw: we've learned soem stuff since then
19:51:13 <ianw> ok, if we could add it with that review that would be helpful :)
19:51:15 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/798400
19:51:16 <clarkb> ianw: you may need a null include for that repo though
19:51:26 <clarkb> ianw: since its jobs are expected to be handled in the opendev tenant
19:51:45 <clarkb> include: [] is what we do for gerrit just above in your change
19:51:57 <clarkb> corvus: ^ can probably confirm that
19:52:00 <fungi> yeah, i think the expectation was that the rest would be moving to the opendev tenant in time, and then we could interlink them
19:52:16 <ianw> is working https://104.130.239.208/ is a held node
19:52:35 <fungi> i've managed to move some more leaf repos into the opendev tenant, but things heavily integrated with system-config or openstack/project-config are harder
19:53:24 <ianw> but there is some sort of db timeout weirdness.  when you submit, you can see in the network window it gets redirected to the new paste but then it seems to take 60s for the query to return
19:53:43 <ianw> i'm not yet sure if it's my janky hand-crafted server there or somethign systematic
19:53:52 <ianw> suggestions welcome
19:54:00 <clarkb> ianw: if you hack /etc/hosts locally wouldn't that avoid any redirect problems?
19:54:27 <clarkb> might help isolate things a bit. But I doubt that is a solution
19:55:04 <ianw> i don't think it is name resolution; it really seems like the db, or something in sqlalchemy, takes that long to return
19:55:17 <ianw> but then it does, and further queries work fine
19:55:34 <clarkb> it only happens the first time?
19:56:43 <clarkb> We are just about at time. I need lunch and then I have a large stack of changes and etherpads to review :) Thank you everyone! We'll be back here same time and place next week. As always feel free to reach out to us anytime on the mailing list of in #opendev
19:56:55 <ianw> when you paste a new ... paste.  anyway, yeah, chat in #opendev
19:57:06 <fungi> thanks clarkb!
19:57:26 <clarkb> ya sorry, realized we should move along (not going to lie in part because I am now very hungry :) )
19:57:29 <clarkb> #endmeeting