Thursday, 2020-10-29

*** tosky has quit IRC00:06
*** hamalq has quit IRC00:18
ianwok, i think everything is running smoothly on hte reprepo front now00:27
ianwi've got a change to convert it to releasing using ssh but i think leave that for a bit while it settles in00:27
ianw#status log reprepro mirroring moved to mirror-update.opendev.org.  logs now available at https://static.opendev.org/mirror/logs/00:29
openstackstatusianw: finished logging00:29
*** johnsom has quit IRC02:00
*** rpittau|afk has quit IRC02:00
*** TheJulia has quit IRC02:00
*** TheJulia has joined #opendev02:00
*** rpittau|afk has joined #opendev02:01
*** johnsom has joined #opendev02:01
*** ykarel|away has joined #opendev03:02
*** ykarel|away has quit IRC03:09
openstackgerritIan Wienand proposed opendev/system-config master: reprepro: fix apt-puppetlabs volume name  https://review.opendev.org/76027203:10
openstackgerritMerged opendev/system-config master: reprepro: fix apt-puppetlabs volume name  https://review.opendev.org/76027204:18
*** ykarel|away has joined #opendev04:45
*** ykarel|away is now known as ykarel04:47
openstackgerritMerged opendev/system-config master: Cleanup grafana.openstack.org  https://review.opendev.org/73962505:15
openstackgerritMerged openstack/diskimage-builder master: Allow processing 'focal' ubuntu release in lvm  https://review.opendev.org/76015605:30
*** marios has joined #opendev06:00
openstackgerritRico Lin proposed opendev/irc-meetings master: move heat meeting time to  Wednesday 1400 UTC  https://review.opendev.org/76028606:25
fricklerianw: do you also want to clean up the dns records for grafana02.openstack.org ?06:35
*** ysandeep|away is now known as ysandeep|ruck06:43
*** ralonsoh has joined #opendev06:47
*** DSpider has joined #opendev06:51
*** jaicaa has quit IRC07:00
*** jaicaa has joined #opendev07:01
*** sshnaidm|afk is now known as sshnaidm|rover07:01
sshnaidm|roverhi, is there a process how and what I can upload to tarballs.o.o?07:14
*** sboyron has joined #opendev07:18
*** eolivare has joined #opendev07:25
openstackgerritLajos Katona proposed openstack/project-config master: Add publish-to-pypi template to tap-as-a-service  https://review.opendev.org/76029307:50
*** slaweq has joined #opendev08:04
*** andrewbonney has joined #opendev08:09
*** marios has quit IRC08:09
*** ysandeep|ruck is now known as ysandeep|lunch08:17
AJaegerinfra-root, config-core, I'm on vacation the next 10 days - will only be occasionally online and will shutdown IRC now.08:28
AJaegersshnaidm|rover: tarballs.o.o gets uploaded as part of jobs, best explain what you're considering and we see whether it fits. It was established to have tarballs of releases and after each merge.08:29
fricklerAJaeger: enjoy your vacation08:29
sshnaidm|roverI'd like to upload there containers tarballs which will be downloaded in every job08:33
sshnaidm|rovercontainers tarballs will be uploaded once/twice a day max08:33
ianwAJaeger: have fun and stay covid safe!08:34
ianwsshnaidm|rover: it's more or less an append-only service.  it doesn't seem like the right place to put ephemeral things.  perhaps job artifacts would work?08:36
sshnaidm|roverianw, I see.. I was looking more for something that can serve multiple jobs, like "images server", where images will be updated periodically, but not too often08:39
sshnaidm|roverianw, not after each job08:39
ianwfrickler: yes, we should clear out that record too.  grafana.openstack.org though i think is a CNAME (and the cert covers it)08:40
*** rpittau|afk is now known as rpittau08:40
ianwsshnaidm|rover: well, are you familiar with the intermediate container our docker build jobs use?  perhaps that is sufficient to serve the same container to multiple jobs?08:41
*** AJaeger has quit IRC08:42
sshnaidm|roverianw, you mean passing registry to dependent jobs08:46
sshnaidm|rover?08:46
ianwsshnaidm|rover: yes, basically ; see https://zuul-ci.org/docs/zuul-jobs/docker-image.html08:46
ianwi have to head off, but i'm sure other can help.  see the system-config repo for examples08:47
sshnaidm|roverianw, I need all jobs in various repos and branches to use these images, in all patches08:47
sshnaidm|roverianw, ack, thanks08:47
*** marios has joined #opendev08:50
*** tosky has joined #opendev08:51
*** ysandeep|lunch is now known as ysandeep|ruck08:57
*** webmariner has quit IRC09:03
*** webmariner has joined #opendev09:03
*** webmariner has quit IRC09:08
*** DSpider has quit IRC09:45
fricklerslaweq: regarding https://bugs.launchpad.net/bugs/1902002 , I've seen this multiple times myself, but only for our limestone provider which has public v6 but only private v4 for instances and sometimes they even fail to get v4 completely09:48
openstackLaunchpad bug 1902002 in devstack "Fail to get default route device in CI jobs" [Undecided,New]09:48
fricklermaybe it is time to teach devstack to give v6 connectivity to instances instead of v4?09:49
fricklerhmm, from logstash it also seems to happen for other providers, but less frequently09:52
*** DSpider has joined #opendev09:54
frickleranother option would be to make sure that tests all work without external connectivity and then drop this part from devstack09:54
fricklerotherwise we could check for this situation in the pre phase and fail early, allowing for a retry and hopefully finding a better node09:55
slaweqfrickler: yes, my first though was that it may be related to one provider but from logstash it seems like it's not10:00
slaweqand it didn't happend (according to logstash) before 23.1010:00
slaweqbut that could be just missing older logs in logstash, idk10:01
slaweqalso it seems from devstack that this check of default gw is also needed for jobs where linuxbridge agent is used, maybe we can make it conditional in devstack?10:01
slaweqthat would workaround the issue for most of the jobs I think10:01
*** slaweq has quit IRC10:05
*** hashar has joined #opendev10:07
*** slaweq has joined #opendev10:12
*** ykarel has quit IRC10:17
*** ykarel has joined #opendev10:19
fricklerslaweq: that's a good idea for a first iteration workaround with low regression potential, I'm creating a patch for that now10:38
slaweqfrickler: ok, let me check that and propose patch10:38
slaweqthx for looking at this issue10:38
*** ykarel_ has joined #opendev10:52
*** ykarel has quit IRC10:55
*** ykarel_ is now known as ykarel11:06
*** lpetrut has joined #opendev11:59
openstackgerritBalazs Gibizer proposed opendev/elastic-recheck master: Add query for bug 1901739  https://review.opendev.org/75996712:33
openstackbug 1901739 in OpenStack Compute (nova) " libvirt.libvirtError: internal error: missing block job data for disk 'vda'" [High,Confirmed] https://launchpad.net/bugs/190173912:33
fungisshnaidm|rover: would you overwrite the images, or would they accumulate? we don't really have a pruning solution for the files on tarballs.o.o. also it's served from afs which we've tuned to handle mirrors of things like python wheels and distribution packages, but putting very large files in there could represent challenges. the tarballs volume specifically has around 170gb of files in it currently and a13:16
fungi400gb quota, to give you some idea of the capacity and content there right now (we can increase the quota on it if needed, but it's not unbounded)13:16
sshnaidm|roverfungi, I think we'll need like 40-50 GB and it's better to prune old files when uploading new ones, we definitely don't want them to append there13:28
openstackgerritMerged openstack/project-config master: Add publish-to-pypi template to tap-as-a-service  https://review.opendev.org/76029313:28
fungisshnaidm|rover: overwriting the old files each time you upload new ones wouldn't suffice, i guess?13:30
fungiit's worth noting that afs volume updates are effectively atomic at least, so you shouldn't have files getting served in a half-written state if a download happens during an overwrite13:30
sshnaidm|roverfungi, it will have different names, so just removing them would be fine13:31
sshnaidm|roverfungi, if I did this, I'd upload files to a new folder, change symlink and delete old folder13:31
sshnaidm|roverkind of13:31
fungiwell, it's the "removing" that's the problem. we don't have anything which removes files from that site, though we do have common job patterns where the file being uploaded overwrites a previous version (notably, branch tip tarballs)13:31
sshnaidm|roverfungi, can we delete just a whole folder?13:32
fungithe uploads to it are indirect... the executor pulls files from the build node with rsync, and then uses rsync to copy those files into a path in afs. the rsync used in the second phase would probably need to include the --delete flag13:34
fungiso that it removes any unrecognized files in the destination path13:34
sshnaidm|roverI see.. not sure we can use it then13:35
fungibut file deletion isn't a distinct step, it would in theory need to be done with some feature of rsync during the copy phase13:35
fungithis is why overwrites are easier to deal with13:36
sshnaidm|roverfungi, and it's possible from jobs only, right?13:36
fungifor docs publication we have particularly complex rsync invocations which look for and ignore subtrees with special root marker files in them to avoid removing content handled by different builds13:36
sshnaidm|roverdo you have example for those? ^13:37
fungisshnaidm|rover: yeah, we don't really have a mechanism/workflow in place to grant kerberos credentials except for our automation and occasional systems administration work13:37
fungii'm looking for where the docs upload role is now, just a sec13:38
sshnaidm|roverfungi, so if I need to do this periodically, I can use only periodic jobs and it will be not often then once a day?13:38
fungiyeah, periodic jobs could be used, in theory13:39
sshnaidm|rovermaybe I'm going to wrong direction trying to use tarballs.o.o as a containers registry..13:39
fungirelease management ptg session is slowing down my browser, this will take time13:39
sshnaidm|roverfungi, sure, np13:40
sshnaidm|roverdon't you have a caching registry mirror? If not, I can help to set it up13:40
fungiwe do have a caching proxies for dockerhub and maybe quay, is that what you're asking about?13:41
sshnaidm|roverfungi, no, I mean caching container images themselves: https://docs.docker.com/registry/recipes/mirror/13:41
fungisshnaidm|rover: other than the intermediate registry we're running, no we don't host any container registries at the moment13:46
sshnaidm|roverfungi, is there any interest to have it? I can work on ansible role to set it up and configure13:46
fungiit's worth discussing with the rest of the team. we don't have any more ptg sessions scheduled for opendev but maybe start a thread on the service-discuss ml or something13:47
fungisshnaidm|rover: so the role which does the fancy footwork for afs docs publication with root marker files is part of zuul's stdlib: https://zuul-ci.org/docs/zuul-jobs/afs-roles.html#role-upload-afs-roots13:48
fungiand the jobs we use that in are documented here: https://docs.opendev.org/opendev/base-jobs/latest/documentation-jobs.html13:48
sshnaidm|roverfungi, cool, thanks13:48
*** mlavalle has joined #opendev14:17
openstackgerritJeremy Stanley proposed openstack/project-config master: Replace old Victoria cycle signing key with Wallaby  https://review.opendev.org/76036414:23
*** hamalq has joined #opendev14:41
*** ykarel has quit IRC14:54
*** lpetrut has quit IRC14:56
*** tkajinam has quit IRC15:05
*** ysandeep|ruck is now known as ysandeep|away15:12
openstackgerritMerged openstack/project-config master: Create telemetry group and include  https://review.opendev.org/76019015:20
mordredfungi, sshnaidm|rover : it might become a more important thing to consider with the new dockerhub rate limiting15:24
mordred(running an actual registry that is)15:24
clarkbdid htey ever publish their promised "what CI systems should do" doc?15:25
clarkbbecause they promised a thing like that and I tried to periodically look for it then got distracted15:25
fungii haven't seen one mentioned15:25
mordredI haven't seen one mentioned either15:25
mordredbut 100 container image pulls every six hours is ... yeah15:26
fungii assume that's 100 pulls of a given image from any anonymous source? or is that 100 pulls of any container image from the same ip address? or something else entirely?15:28
clarkbfungi: I don't think they've said specifically. However given the earlier rate limiting I expect its 100 pulls of any container image from the same ip15:30
clarkbalso note that they are no longer rate limiting by blob but now by manfifest15:30
clarkbwhich is what completely breaks our caching system (and it also fails rate limit the thing that is actually expensive. I think this is what irritates me most about their changes, they have explicitly broken what we did to mitigate our impact and ignored the fact that we were already good citizens)15:31
clarkbaccording to their blog post rate limiting the large blob requests was confusing to users so instead we'll do something that makes zero sense :/15:31
clarkb"However, CircleCI has partnered with Docker to ensure that our users can continue to access Docker Hub without rate limits."15:32
fricklerthere was a link for CI owners to register and fill out some questions in order to get more information15:34
fungimaybe we just need to write some high-profile blog posts/news articles about how dockerhub is useless for free and open source software15:35
* clarkb remembers where we last got re caching this. We need a smarter cache than apache because apache won't cache the manifest data since it is fetched with authentication (the headers are set such that a smarter proxy could cache them though)15:44
*** olaph has quit IRC15:44
clarkbthe other thought was to stop proxy caching entirely15:45
clarkbbecause that would diversify our IP addresses15:45
clarkbthere are also "open source project plans" which don't get published limits15:48
clarkbhttps://www.docker.com/blog/best-practices-for-using-docker-hub-for-ci-cd/ I think their advice may be "upgrade your account"15:54
fungithough that would also require auth right?15:56
clarkbyes, which we technically already do (just in most cases as the boring anonymous user auth)15:56
clarkbI guess the issue then becomes securing the tokens15:58
*** hashar has quit IRC15:59
*** hashar has joined #opendev15:59
*** hashar_ has joined #opendev16:04
*** hashar has quit IRC16:04
sshnaidm|rovermordred, yeah, that's exactly the reason I raised this..16:24
clarkbI don't think we should use tarballs as a docker registry standin16:24
clarkbwe should address the problem more directly (not sure what the best solution is though)16:24
*** hashar_ is now known as hashar16:24
*** hashar has quit IRC16:25
*** hashar has joined #opendev16:25
sshnaidm|rovertbh I'm not sure if mirroring dockerhub will be fine too, because it's enough for someone to request 100 new containers that are not in the cache - and cache machine is blocked for 6 hours to pull from dockerhub16:26
clarkbs/mirror/proxy cache/16:27
clarkbif we mirrored it we could presumably use an account without rate limits and download all the things if we had sufficient disk16:28
clarkb(note we don't have sufficient disk)16:28
sshnaidm|roverI mean a real mirror:  https://docs.docker.com/registry/recipes/mirror/16:28
sshnaidm|roverwe can configure pruning as frequent as we want16:28
clarkbwell a real mirror shouldn't have to worry about the rate lmits?16:28
sshnaidm|roverclarkb, only if it doesn't have container in it and need to pull it from dockerhub16:28
clarkbyour example there is for a pull through cache16:28
clarkbwhich is different than a mirror (and very similar to what we already run)16:29
sshnaidm|roverah, ok16:29
sshnaidm|roverI keep calling it mirror :)16:29
sshnaidm|roverclarkb, do you run this pull-through-cache?16:29
clarkbopendev runs an apache based pull through cache in every test node region16:30
clarkbthe issue is that they are changing how they apply rate limits to essentially break that setup16:30
sshnaidm|roveryeah :/16:30
clarkbshifting from blob rate limits which apache can cache to manifest rate limits which I can't figure out how to get apache to cache16:31
clarkbalso that doc doesn't seem to have anything on pruning?16:31
clarkbit just says it will somehow magically do it itself16:31
clarkb(can we configure filesystem limits or frequency of pruning?)16:31
sshnaidm|roverclarkb, pruning is available in config for registry container16:39
sshnaidm|roverstorage: maintenance:16:40
sshnaidm|rover    uploadpurging:16:40
sshnaidm|rover      enabled: true16:40
sshnaidm|rover      age: 168h16:40
sshnaidm|rover      interval: 24h16:40
*** sboyron has quit IRC16:40
sshnaidm|roverhttps://docs.docker.com/registry/configuration/#list-of-configuration-options16:41
sshnaidm|roverbut it's when you run registry as a container16:41
*** marios is now known as marios|out16:41
clarkbsshnaidm|rover: that is for puring uploads to the registry which you wouldn't do in this case16:42
clarkb(though amybe the docs are just overly specific and it does apply to the cache scenario too?)16:42
clarkbhttps://docs.docker.com/registry/garbage-collection/#more-details-about-garbage-collection16:43
clarkbsshnaidm|rover: ^ basically those docs continue to say you can't garbage collect a live registry, it has to be read only (which I think means no new cached content but could serve existing content) or it has to be off16:43
*** slaweq has quit IRC16:44
clarkbit is possible some setup could be devised from this, but it needs someone to actually spin one up and test it and figure out these details taht the docs make ambiguous or confusing16:44
fricklerI did fill out https://www.docker.com/community/open-source/application in the name of opendev now, waiting to see how they respond16:45
clarkbfrickler: thank you16:45
sshnaidm|roverI don't see where it can't garbage collect on live registry16:50
sshnaidm|roverfrom this page16:50
clarkbsshnaidm|rover: "You should ensure that the registry is in read-only mode or not running at all. If you were to upload an image while garbage collection is running, there is the risk that the image’s layers are mistakenly deleted leading to a corrupted image."16:51
clarkbagain its not clear how that interacts with pull through useage16:51
clarkb(the docs are really lacking unfortunately)16:51
sshnaidm|roversigh..16:51
clarkbif someone is able to sort out how pruning functions so that we can evaluate it properly as a pull through cache that would be excellent16:53
clarkbbut I don't think the docs give us a clear enough picture and it requires actual testing16:53
*** mwhahaha has quit IRC16:55
*** mwhahaha has joined #opendev16:55
*** rpittau is now known as rpittau|afk16:57
openstackgerritClark Boylan proposed opendev/system-config master: Stop using the in region docker mirrors for system-config jobs  https://review.opendev.org/76041717:04
clarkbfrickler: sshnaidm|rover fungi ^ something like that may also serve as a reasonable temporary mitigation17:05
openstackgerritClark Boylan proposed openstack/project-config master: Don't use the local docker caches  https://review.opendev.org/76041917:09
clarkbcorvus: mordred ^ you may also be interested in those17:09
clarkbI think it would be good to hear back from docker on frickler's request. Maybe they'll say we can give them a list of IPs to rate limit less aggressively or something but we've now got a couple changes we can land next week if we think it will help17:11
*** hamalq has quit IRC17:13
*** hamalq has joined #opendev17:14
*** ralonsoh has quit IRC17:15
*** ykarel has joined #opendev17:35
*** ykarel has quit IRC17:37
*** ykarel has joined #opendev17:38
corvusclarkb: i originally designed zuul-registry as a smart pull-through cache, but later determined i didn't need it and removed that functionality.  if we feel like we want a pull-through cache with behavior that we know and control, we can resurrect that.17:40
corvus(to be clear, simply using zuul-registry as a pull-through cache to replace the apache mirrors; not trying to incorporate any of the zuul shadowing functionality)17:41
clarkbya that may be useful using a filesystem driver17:43
clarkbespecially if we can incorporate auto pruning against some size limit17:43
fungithough it stores the images in swift right? so having the z-r frontend in every provider may not be any more efficient than having a single one close to the swift backend?17:44
clarkbit has multiple drivers17:44
clarkbincluding local filesystem17:45
fungiahh, okay, so we have the option of local storage and multiple frontends then, got it17:45
*** eolivare has quit IRC17:47
fungithough as pointed out previously, if a job suddenly requests 100 additional images we're probably also stuck until the next day17:52
clarkbya I think the other step in proxying is maybe using credentials that arent rate limited between the proxy and upstream17:53
*** ykarel has quit IRC17:54
*** webmariner has joined #opendev18:03
*** hashar has quit IRC18:04
*** marios|out has quit IRC18:10
*** andrewbonney has quit IRC18:13
*** Green_Bird has joined #opendev18:21
*** tosky has quit IRC18:29
TheJuliao/21:35
TheJuliaOut of curiosity, more gerrit crawling evil? proxy 502 + slowness/hanging21:36
johnsomYeah, it has been slow for me too21:37
fungistuffing my face but should be able to take a look in a few21:38
TheJuliathanks fungi21:39
* johnsom thinks TheJulia has earned one of the Octavia team's canary stickers for catching issues about the same time we do.21:42
TheJulia\o/21:49
*** tosky has joined #opendev21:57
ianwthere is something walking all changes, but it looks different to the prior requests22:03
ianwinterestingly this time it comes from an engineering school in quebec22:04
TheJuliasounds like it might be time to dig up a network operations center phone number...22:19
TheJuliaBecause this is sounding remarkably suspicious.22:19
fungiwell, i've suspected all along this could be related to someone's compsci research project22:20
fungibut finding a way to reach them and ask them to go slower was challenging while it was just random ip addresses in google cloud22:20
johnsomSomeone is training the ML on our code, our jobs are doomed. grin22:21
TheJuliaor there is a compromised machine, but the simplest possibility is most likely22:21
fungiif a machine wants my job, it can have it ;)22:21
TheJuliajohnsom: Or maybe not and the AI goes "NOOOOO!!!!!" and curls up in a ball and cries having realized the nature of the universe22:22
johnsomDo you still have HAProxy in front of it? If so you could consider rate limiting based on source IP, etc.22:22
johnsom^^^ Probably a more likely situation.22:22
fungiwe haven't had an haproxy in front of it at all (that's the gitea cluster), but yes we've been considering that as a possible avenue22:23
fungian apache module would probably be easier though since there is an apache layer proxying to the gerrit service22:23
TheJuliais it new tcp connections or is over a kept alive connection?22:23
johnsomOk, I can help give pointers on how to set that up if/when you decide you want to do it.22:23
*** slaweq has joined #opendev22:27
*** slaweq has quit IRC22:42
*** stevebaker has joined #opendev22:42
*** rchurch has quit IRC22:43
*** rchurch has joined #opendev22:45
*** hamalq has quit IRC22:57
fungiclarkb: oh, also there's https://review.opendev.org/760220 up to readd inap-mtl01 to nodepool if you have time23:12
clarkbcool lets try it23:12
*** Green_Bird has quit IRC23:17
sean-k-mooneyby the way is https://docs.opendev.org/opendev/system-config/latest/contribute-cloud.html still acurate in terms of 1st party ci requirementes23:18
openstackgerritMerged openstack/project-config master: Revert "Temporarily stop booting nodes in inap-mtl01"  https://review.opendev.org/76022023:20
clarkbsean-k-mooney: yes I think so. We've been a bit more flexible on the total number of test node requirement but otherwise ya23:20
sean-k-mooneyya that is the one i was debating about. im considerign expanding my home cloud in the next few months23:21
sean-k-mooneyim debating if i will invest enough too possibley be able to meet those requirements23:21
*** slaweq has joined #opendev23:22
sean-k-mooneyi currently have 4 ironic nodes and one vm server with 256GB of ram and 48 cores. im thinking of adding a few more servers for vms in the next few months after i see how my thirdparty ci is going.23:25
*** mlavalle has quit IRC23:30
sean-k-mooneyits funny that one of these https://www.ebay.ie/itm/4x-Node-Dell-PowerEdge-C6220-2U-Server-8x-E5-2670-OCTA-CORE-1024GB-RAM-Rails/402497036993?hash=item5db6b162c1:g:DfoAAOSwS5tfibmo can now actuly provide that23:31
*** slaweq has quit IRC23:55

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!