19:01:11 <clarkb> #startmeeting infra 19:01:12 <openstack> Meeting started Tue Aug 20 19:01:11 2019 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:12 <ianw> o/ 19:01:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:15 <openstack> The meeting name has been set to 'infra' 19:01:20 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2019-August/006452.html Our Agenda 19:01:41 <clarkb> thank you ianw for running the meeting last week 19:02:07 <clarkb> #topic Announcements 19:02:31 <clarkb> This wasn't on the agenda but you have a handful of hours left to vote on the openstack U naming poll if you would like to do so before it ends 19:03:06 <clarkb> #topic Actions from last meeting 19:03:12 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-08-13-19.01.txt minutes from last meeting 19:03:32 <clarkb> I didn't see any actions in the meeting notes 19:03:39 <clarkb> ianw: ^ anything to point out here before we move on? 19:03:56 <ianw> no it was fairly quiet 19:04:22 <clarkb> #topic Priority Efforts 19:04:34 <clarkb> #topic OpenDev 19:04:54 <clarkb> I've made some minor progress on having gitea timeout requests. 19:05:00 <clarkb> #link https://github.com/cboylan/gitea/commit/d11d4dab34f769f3ba4589bb938a2dbd09ff8b3a 19:05:34 <clarkb> It turns out that gitea's http framework is not directly compatible with golang's http lib beacuse they use a context type that doesn't conform to the standard 19:05:55 <clarkb> They do drag along the underlying http stdlib request's context though so we can update that and get it to do things 19:06:44 <corvus> oh "neat" 19:06:47 <clarkb> However the http.TimeoutHandler is a bit more robust than what I have there and I'm not sure how much of that I can replicate within the macaron framework so this might get clunky (probably largely due to my lack of go knowledge) 19:07:21 <clarkb> in any case that now builds and seems to work in the job we have. Next I need to exercise that it times out long requests as expected 19:07:41 * diablo_rojo sneaks in late 19:08:07 <clarkb> corvus: ya my understanding of the correct way to implement that is to have a context type that implements the standard interface while adding the bits you want in addition to that 19:08:20 <clarkb> corvus: then you can use stdlib handlers like the timeout handler but also track application specific info 19:08:42 <clarkb> instead they track the request as an attribute of the context object and that request object has the standard matching context object 19:09:50 <clarkb> I was also thinking that I might want to file an issue with them and share what I have so far and see if they can point me in a better direction if one exists 19:10:33 <clarkb> any other opendev specific items to talk about before we move on? 19:10:47 <fungi> too bad github doesn't have a "wip" flag for pull requests 19:11:29 <corvus> the gitea folks use [wip] in the summary (much like we do) 19:11:35 <fungi> ahh, heh 19:12:08 <fungi> then yeah, seeking their input with what you already have sounds like a great option 19:14:24 <clarkb> #topic Update Config Management 19:15:04 <clarkb> corvus: should we talk about the intermediate registry here? 19:15:20 <clarkb> aiui the swift backend for the intermediate docker registry loses blobs 19:15:35 <clarkb> and then our jobs that rely on working docker images fail because tehy can't get the layer(s) they need 19:15:49 <corvus> oh yeah 19:16:06 <corvus> we're seeing this even in moderate use of the intermediate registry 19:16:16 <corvus> especially if there's a patch series 19:16:29 <corvus> (it could be happening with image promotion too, we may just not notice as much) 19:16:53 <corvus> the logical thing to do would be intensive debugging of the problem resulting in a patch to docker 19:17:08 <fungi> so the registry itself is losing track of the swift objects? i'd be surprised if swift is losing those itself 19:17:24 <corvus> afaict, the registry is inserting 0-byte objects in swift 19:17:30 <fungi> ouch 19:17:33 <corvus> no idea what's up with that 19:17:55 <corvus> the nice thing is it's easy to verify we're seeing the same problem (all zero byte objects have the same sha256sum value :) 19:18:15 <fungi> indeed 19:18:40 <corvus> there's a lot of things we would like the registry to do which it doesn't -- authentication only for writes, pass-through to dockerhub, support for pass-through to multiple registries... 19:19:17 <corvus> so i'm seriously inclined to solve this by writing a new registry shadowing system from scratch 19:19:43 <clarkb> I remember pulp saying they support docker image registries as one of their archives 19:19:50 <clarkb> (that might be another option to look at) 19:20:07 <corvus> ooh, well remembered 19:20:23 <clarkb> https://pulpproject.org/ for those that may not be familiar 19:21:06 <clarkb> https://docs.pulpproject.org/plugins/crane/index.html is something pulp points at 19:21:16 <clarkb> that may be too simple for what we want though (crane not pulp) 19:21:26 <corvus> yeah, we do need to write to it 19:22:43 <clarkb> if anyone else knows of alternative options they are probably worth sharing. Like what does openshift run? 19:24:05 <corvus> i believe we learned that running openshift container registry does require running it in openshift 19:24:10 <clarkb> ah 19:24:46 <corvus> https://docs.openshift.com/container-platform/3.5/install_config/install/stand_alone_registry.html#install-config-installing-stand-alone-registry 19:25:16 <clarkb> Alright we probably won't solve that problem in the meeting. But wanted to call it out as it seems like the kind of problem where someone might already know of a preexisting solution (surely we aren't the only people that want to run a docker registry that reliably serves data) 19:25:44 <corvus> i mean, it'd be better for everyone if docker registry did support swift without flaws 19:25:50 <clarkb> ++ 19:26:13 <corvus> but my enthusiasm for fixing that bug without addressing all the other things is limited 19:27:03 <corvus> (also, to be fair, i'm not sure we've eliminated the possibility that skopeo is to blame) 19:27:15 <corvus> (it seems highly unlikely though) 19:27:46 <corvus> i don't think the logging available is adequate 19:29:19 <clarkb> Anything else on this topic or should we moev on? 19:30:16 <clarkb> sounds like that is it. 19:30:20 <clarkb> #topic Storyboard 19:30:45 <clarkb> diablo_rojo: fungi: you were both distracted with meetings all last week, but anything to bring up re storyboard 19:31:07 <fungi> i got nuthin' 19:32:06 <diablo_rojo> I remembered to bother mordred twice about db stuff but I don't think he had time to do anything yet 19:32:11 <diablo_rojo> mordred, shall I keep poking? 19:32:46 <diablo_rojo> I didn't have anything else 19:33:06 <clarkb> #topic General Topics 19:33:20 <clarkb> #link https://etherpad.openstack.org/p/201808-infra-server-upgrades-and-cleanup 19:33:29 <clarkb> job logs are now in swift 19:34:05 <clarkb> I think that leaves tarballs on static.o.o ? 19:34:20 <clarkb> corvus had set up afs based tarballs.opendev.org 19:34:22 <fungi> well, it also leaves a bunch of our static sites content 19:34:30 <fungi> security, governance, releases, and so on 19:34:32 <clarkb> fungi: oh I thought that was all on files.openstack.org now 19:34:44 <clarkb> DNS says I'm wrong 19:34:50 <fungi> governance.openstack.org is an alias for static.openstack.org. 19:34:57 <fungi> et cetera 19:35:29 <fungi> also we still have some logs on static.o.o until they age out 19:35:38 <clarkb> /srv/static has ci election governance logs lost+found mirror old-docs-draft old-pypi release release.new releases reviewday security service-types sigs specs status tarballs tc uc 19:36:07 <clarkb> fungi: ya ~4 weeks iirc 19:36:40 <clarkb> One option is to upgrade the server with a much simpler lvm setup as we'll not need to worry about massive impact to job results 19:36:51 <clarkb> or we can try to push that content onto files.o.o instead 19:37:13 <fungi> some of those are dead (mirror, old-pypi, old-docs-draft, ...), some are just redirects (ci), some are mapped into subtrees of the same vhosts (governance, sigs, tc, uc) 19:37:34 <fungi> so the list looks more daunting than it really is 19:37:59 <ianw> i don't mind taking an action item to audit it all and report back with a list of work? 19:38:05 <fungi> also, yeah, we can pvmove everything left onto a single volume, for starters, and then swap it for a smaller volume if we wanrt 19:38:06 <clarkb> ianw: that would be great 19:38:21 <fungi> thanks ianw! 19:38:32 <corvus> also, i think all the mechanicas are worked out, so moving them to afs probably wouldn't be too difficult 19:38:41 <corvus> mechanics even 19:38:58 <clarkb> #action ianw audit static.openstack.org webserver content and create a list of work to either get off the server or upgrade the server now that job logs are not hosted there (or won't be in 4 weeks) 19:39:33 <ianw> ++ that was what i thought when i had a quick poke a couple of weeks ago, but will take a more systematic look 19:39:34 <fungi> right, the fiddly bits will be redirecting or mapping openstack tarballs since they're published without a namespace prefix, doing something akin to root-marker with stuff like the governance site which is published by stitching together multiple repos... 19:40:36 <clarkb> The other related item on this is getting wiki-dev working 19:40:47 <clarkb> fungi: ^ I doubt you had much time for that last week. Anything to point out there? 19:40:52 <fungi> ianw: if you want to start by just tossing it all into an etherpad i'm happy to flag some stuff in the list too 19:41:07 <fungi> yeah, there's some wiki-dev updates actually 19:41:46 <fungi> first, i've updated the cname for wiki-dev.openstack.org to point to the new wiki-dev03 server for ease of testing. it's not like the old one was in perfect shape either 19:42:10 <fungi> (also because i got tired of editing my /etc/hosts on multiple clients) 19:42:28 <fungi> i thought openid was broken but it's actually working better than it did on the old wiki-dev01 19:43:14 <fungi> the reason i didn't realize that is up in the top-right corner of https://wiki-dev.openstack.org/ the drop-down is doing language selection instead of login 19:43:31 <fungi> so need to figure out how to get that to point to the right thing 19:43:44 <clarkb> oh a theming bug I guess? 19:43:48 <fungi> #link https://wiki-dev.openstack.org/wiki/Special:OpenIDLogin 19:43:59 <fungi> well, so it may be related to the next thing 19:44:25 <fungi> #link https://review.opendev.org/675713 Put image data in a parallel path to source code 19:44:53 <fungi> the old wiki-dev deployment was puppeting the installation onto a cinder volume we'd attached 19:45:21 <fungi> and then having mediawiki store its static content (image uploads and so on) into subtrees of that installation path 19:45:26 <ianw> ahh, was going to say i had a few broken images, so that fixes that? 19:45:38 <fungi> well, it's the first step in fixing it, yeah 19:45:56 <fungi> we were essentially mixing configuration-managed stuff with mw-managed stuff in the same tree 19:47:01 <fungi> so with the proposed change i want mediawiki to start managing its persistent file content into a separate path (which we can put in a cinder volume) and not have to cart around crufy from configuration-managed bits previously on other machines 19:47:14 <clarkb> sounds liek a great idea 19:47:14 <fungi> er, cruft 19:47:29 <fungi> i've already got the new volume in place and formatted/mounted 19:47:35 <fungi> on the new wiki-dev03 19:47:53 <fungi> so once that change lands i'll rsync over the images tree and whatever else needs to go there 19:48:25 <fungi> and that will allow us to completely blow away the config-managed stuff any time we want and redeploy without risking loss of precious files 19:48:43 <fungi> also a related change... 19:49:00 <fungi> #link https://review.opendev.org/675733 Update to 1.28.x branch 19:49:33 <fungi> we'd already manually upgraded production wiki.o.o to 1.28 19:49:44 <clarkb> seems like config management should reflect that then 19:49:46 <fungi> so this brings the wiki-dev configuration management in line with production 19:50:03 <fungi> which also both need badly to be upgraded some more, but... 19:50:08 <clarkb> as a timecheck we have about 10 minutes left and a few more items on the agenda. Anything else urgent on this subject before we move on? 19:50:23 <fungi> having -dev at least on the same version will facilitate that 19:50:25 <fungi> yeah, we can move on 19:50:34 <fungi> anyway, reviews of those two changes most welcome 19:50:42 <clarkb> We been having flaky afs publishing from mirror-update.opendev.org 19:50:58 <clarkb> ianw it occured to me that that host may be running the bionic openafs build which we know is broken? 19:51:06 <clarkb> ianw: maybe the proper fix there is to install from our ppa if we haven't already? 19:52:05 <ianw> it's worth checking but i think it is using the latest 19:52:32 <clarkb> ok for those that might not be aware we've had vos releases fail or run for weeks at a time resulting in slow updates to the rsync'd mirrors we have 19:52:45 <clarkb> ianw and I were working on it yesterday. I need to catch up on that after the meting 19:53:00 <ianw> ii openafs-client 1.8.3-1~bionic 19:53:15 <fungi> not the broken prerelease then 19:53:51 <clarkb> rules that out then 19:54:00 <clarkb> Other items really quickly: 19:54:03 <clarkb> #link https://review.opendev.org/#/c/675537 New backup server 19:54:04 <ianw> i think we keep an eye at this point; it might have been around the time i was restarting to get some audit data 19:54:20 <clarkb> ianw ^ has a new backup server ready to go if we can get reviews on that 19:54:36 <ianw> yeah, just wanted eyes because it's non-regular semantics with that host keeping it out of ansible, as we discussed 19:54:53 <clarkb> ianw also updated dib with what we think will fix limestone ipv4 issues 19:54:59 <clarkb> #link https://review.opendev.org/#/c/677410/ DIB fix for limestone ipv4 issues 19:55:11 <ianw> i can do a release and rebuild in a bit for that 19:55:19 <clarkb> ianw: note there was a tripleo job failing on that change 19:55:25 <clarkb> and apparently it was a known failure they had yet to fix 19:55:31 <clarkb> we might consider making that job non voting? 19:55:57 <ianw> if it doesn't work, there are options but i think they'll require us updating how glean writes config in one way or another, comments in that change 19:56:10 <clarkb> And finally feel free to start adding ideas for PTG topics on the etherpad 19:56:12 <clarkb> #link https://etherpad.openstack.org/p/OpenDev-Shanghai-PTG-2019 19:56:24 <ianw> (other than, you know, intense debugging of legacy areas of networkmanager) 19:56:25 <clarkb> I expect planning for that will start in earnest much closer to the event 19:56:51 <clarkb> ianw: it is odd that the bug has been around for so long too 19:57:02 <clarkb> its clearly a fairly major problem they've had for a long time 19:57:15 <clarkb> I guess if the kernel is manageing the interface NM doesn't want to step on its toes 19:57:24 <clarkb> #topic Open Discussion 19:57:29 <clarkb> we have about 2.5 minutes for anything else 19:57:29 <ianw> speaking of the afs thing before 19:57:32 <ianw> #link https://review.opendev.org/#/q/status:open+topic:openafs-reccomends 19:58:17 <shadiakiki> o/ hey there. I had sent a few emails on the mailing list about server sizing 19:58:18 <ianw> that was related to us not installing the correct openafs packages on new servers 19:59:01 <clarkb> shadiakiki: hello. I've been trying to keep up with responses (as has ianw looks like) 19:59:02 <shadiakiki> Just want to ask if it's a subject that's of interest for you in terms of cost savings 19:59:37 <clarkb> shadiakiki: as ianw pointed out I think we tend to end up undersizing servers more than we oversize them 19:59:39 <shadiakiki> Thanks Clark. You guys have been very responsive. It's fantastic 19:59:56 <ianw> shadiakiki: hello! for mine, i'd say we're always open to contribution :) 20:00:31 <clarkb> our gitea backends and zuul executors could all be bigger probably. Thinking out loud here, it might be moer useful for us to see where we should run larger instances 20:00:31 <ianw> in this respect, i think that with our new trend towards containerising that might be the best place to start looking at this 20:00:45 <shadiakiki> Awesome! I founded my startup a few weeks ago to solve the issue of sizing for large infra. It'll be great if I can communicate with you guys from time to time 20:01:00 <clarkb> and we are at time 20:01:03 <clarkb> Thank you everyone 20:01:08 <fungi> it's good to note that while we try to be careful and cost-wise in how we use donated server resources, we're also a very small team with limited time to invest in complex solutions if the return on investment is minor 20:01:13 <fungi> thanks clarkb! 20:01:14 <clarkb> feel free to continue discussion in #openstack-infra 20:01:17 <clarkb> #endmeeting