Friday, 2021-12-17

Clark[m]That looks like the pypi fallback backend issue00:03
Clark[m]It's a known issue. Basically pypi serves you stale content from the CDN if their primary CDN backend fails and the CDN fallsback to their backup. The backup has old data and the install fails because constraints request a version that is newer than the backup contains00:04
Clark[m]As mentioned in the TC meeting today I think it would be appropriate for openstack to reach out to pypa about it. Openstack is primarily affected due to the use of constraints00:04
rlandy|ruckoh - back to that joy00:09
rlandy|ruckthanks for the pointer00:10
rlandy|ruckwill raise it with the dev teams00:10
*** rlandy|ruck is now known as rlandy|out00:12
fungiis it cropping up again today?00:26
fungione thing frickler observed which seems to make things worse is that the broken/stale backend serves pages with a very long cache ttl compared to the primary backend00:27
fungiso we end up caching and serving up the bad indices for far longer than the good ones in that case00:28
fungibasically, our use of a caching proxy amplifies the problem, which could be another reason we seem to notice it more00:29
wxy-xiyuan_fungi: clarkb thanks very much for the quick debug and fix.00:49
fungiof course!00:50
fungiit's just a workaround, we still need to improve nodepool so it doesn't get confused by '.' in image names00:51
fungigood news is the builder cleaned up the image records00:53
fungithough it does seem to have left behind the files on disk00:54
fungiclarkb: is it safe to delete those out of /opt/nodepool_dib/ once they're no longer mentioned by nodepool dib-image-list?00:56
Clark[m]fungi: yes should be safe since nothing will try to use those files once the db records are gone01:01
fungithanks, done!01:03
fungithere were none on nb01, just 02 (amd64) and 03 (arm64)01:04
corvusif anyone feels like a zuul review, i believe is what is causing the login button to fail to appear01:14
corvus(on opendev's zuul)01:15
fungisure thing01:32
*** ysandeep|out is now known as ysandeep01:39
fungiokay, this is an odd behavior02:07
fungii can't connect to with firefox, it inists on going to https. not the case for other virtual domains on that same server02:08
fungichromium works though02:11
*** sshnaidm is now known as sshnaidm|afk02:45
fungiwow how has this never come to my attention before? .dev is unilaterally included in the hsts preload list. you're actually not supposed to to http with any site on a .dev domain03:00
fungithis throws a new wrench into plans03:00
fungii'll need to sleep on this03:01
wxy-xiyuan_openEuler arm64 node works now, while X86 ones still raise node_failure error. I assume it needs more time to upload the x86 image to cloud provider?03:37
wxy-xiyuan_X86 works now!06:09
*** ysandeep is now known as ysandeep|brb06:19
*** ysandeep|brb is now known as ysandeep06:57
opendevreviewwangxiyuan proposed openstack/project-config master: Add cache-devstack element for openEuler
opendevreviewMerged openstack/project-config master: Add cache-devstack element for openEuler
ykarelpypi proxies' affected today as well07:53
mnaserinfra-root: it seems that the routes are hard-coded for ipv6 on some of the nodes running in our mtl dc08:30
mnaserappreciate changing them to `2604:e100:1::1/64` and `2604:e100:1::2/64` for us to move forward with our dc migration :)08:31
fricklermnaser: was that due to the issues with rogue RAs? do you have an indication which nodes might be doing that?08:48
mnaserfrickler: nothing on that yet unfortunately, we're hoping to bring the cloud up to latest release and see if that helps08:55
*** ysandeep is now known as ysandeep|lunch09:11
*** ysandeep|lunch is now known as ysandeep09:42
*** redrobot6 is now known as redrobot10:37
*** sboyron_ is now known as sboyron10:52
fricklerhmm, gbot seems to be off for the holidays, too10:53
fricklerinfra-root: Set CacheMaxExpire to 1h10:53
fricklerthat would be my attempt at reducing the impact of bad cdn responses10:54
fricklerykarel mentioned a database upgrade on pypi last night
fricklerthat might explain why the impact seems to have been larger than usual this time10:54
fricklerin an attempt to clean up things, I'm running this on mirrors now manually:10:55
fricklerhtcacheclean -D -a -v -p /var/cache/apache2/proxy/|grep simple|xargs -i -n1 htcacheclean -v -p /var/cache/apache2/proxy/ {}10:55
dulekThere's another instance of deps issues, this time happening at ovh-bh1:
*** rlandy_ is now known as rlandy|ruck11:11
*** dviroel|out is now known as dviroel|rover11:14
*** frenzy_friday is now known as anbanerj|ruck11:27
dulekfrickler: ^11:35
*** ysandeep is now known as ysandeep|afk11:51
*** ysandeep|afk is now known as ysandeep12:13
*** frenzy_friday is now known as anbanerj|ruck12:21
*** tobias-urdin9 is now known as tobias-urdin13:02
*** pojadhav is now known as pojadhav|brb13:24
*** jpena|off is now known as jpena13:49
*** ysandeep is now known as ysandeep|dinner14:08
*** pojadhav|brb is now known as pojadhav14:19
*** ysandeep|dinner is now known as ysandeep14:52
fricklerinfra-root: gerritbot has stopped logging today at 08:12 it seems, seems quiet since then. is there anything we could debug or shall we just restart it?14:56
fricklerdulek: please recheck and let us know if the issue persists14:56
funginothing in its debug log?14:56
fricklerfungi: which debug log? the docker log is what I was talking about14:57
dulekfrickler: Already did, it happened a few times, but seems better now.14:58
fungifrickler: yeah, it says to log to syslog, but doesn't seem to actually write anything into /var/log/syslog15:04
fungii agree docker-compose logs looks like it died 7 hours ago15:04
fungiit's still in channel so my guess is it's somehow lost the gerrit event stream15:05
fungiand hasn't realized doesn't try to reconnect15:05
fungifrickler: so yes, probably nothing left to do but stop/start or down/up -d the container15:06
fricklerfungi: right, will do that now15:07
*** rlandy|ruck is now known as rlandy|dr_appt15:07
*** dviroel|rover is now known as dviroel|rover|lunch15:09
fricklerseems to be doing fine again
fungiexcellent. likely we have a blind spot with some sorts of network disruption impacting the event stream15:19
fricklerfungi: mnaser mentioned something with ipv6 default routes earlier, that might have affected review02, too15:26
fungioh, yep great point15:27
fungiand yes, i think we ended up with routes hard-coded into the review server in order to stop it from acting on bogus route announcements it was receiving15:27
fricklerfungi: yes, in netplan15:28
fricklerso it seems we should change those from the LL addresses we are currently using to the ones mnaser mentioned above15:29
fungilooks like we added those in /etc/netplan/50-cloud-init.yaml15:29
fungiapparently we deploy it via playbooks/service-review.yaml15:30
fricklerI'll propose a patch15:30
fungiperfect, i was about to ask if there was already one i could review15:30
fungithat seems to be the only server we're doing it for15:31
fungiat least the only server we're doing it that way for, in the system-config ansible15:31
Clark[m]I think the mirror in that region also hardcore's via netplan15:32
fungipossible we did something different for the mirror15:32
fungimaybe not with system-config15:32
fungiwhich region is that? ca-ymq-1 or sjc1?15:32
Clark[m]Also if we break networking on that host we should have a recovery plan. ca-ymq-115:32
Clark[m]Maybe do the mirror first to ensure it all works15:33
fungii think the recovery plan for review is to make the change locally, and if necessary we can reboot it via the nova api15:33
Clark[m]Re gerritbot syslog we have syslog rules to write their output to /var/log/containers iirc15:33
fungithen merge the fix in system-config if all is well15:33
fungiclarkb: oh, yep! i see a /var/log/containers/docker-gerritbot.log on eavesdrop0115:34
Clark[m]fungi: you mean not via netplan but via route or whatever the command is now?15:34
fungiclarkb: yeah, we can use ip -6 route add/del sure15:35
fungiobviously editing the netplan config wouldn't get reset on a reboot15:35
fungigood to call that out15:36
fricklerI can do the changes while connected via v4, that should be safe enough15:36
fungioh, that's also a great point, even if a reboot breaks v6 connectivity, we should be able to get to it over v4 anyway15:37
opendevreviewDr. Jens Harbott proposed opendev/system-config master: Switch router addresses for review02 to global
Clark[m]Ah yup. It is too early in the morning and I'm having a slow start but that is an excellent point :)15:38
Clark[m]I'm not sure how to trigger netplan to reapply without a reboot either15:39
Clark[m]But maybe editing the file and manually updating the routes is good enough15:40
* frickler needs a break, will proceed in half an hour or so15:41
gmannIt seems all my patches facing pypi CDN issue but did recheck let's see15:42
Clark[m]I really think openstack needs to reach out to pypa over this. It isn't something we can fix (even lowering the max cache time won't help a ton) ourselves. And it primarily affects openstack due to openstack's use of cpnstraints15:44
Clark[m]Bringing it up in here isn't going to change much. Other than for me to continue to suggest it be brought up with pypa :)15:45
Clark[m]I can start a draft of an issue in an etherpad once I've booted my morning. But would appreciate it if someone else submits it given responses I have received from pypa in the past15:55
*** gibi is now known as gibi_pto_back_on_10th16:02
fricklerso I changed the routing manually on , can still reach the outside world from it and back16:11
Clark[m]frickler fungi I just remembered that I think pip sends cache control to tell the servers it doesn't care about cached values. It may be the case that lowering TTLs won't help16:11
fricklerClark[m]: we set config in apache to ignore that16:15
fricklersee the comment after my change in
fricklerbut that may also be a reason why noone else is really affected by this16:17
fricklerso maybe if 822095 doesn't help much either, the next step could be to try and drop CacheIgnoreCacheControl16:18
clarkbmakes sense.16:32
clarkbfrickler: did you update the netplan on the mirror node too?16:32
*** dviroel|rover|lunch is now known as dviroel|rover16:33
clarkbfrickler: I've approved the pypi proxy cache change and +2'd the review02 netplan update16:33
clarkbchecking the mirror node the netplan file and ip -6 route output seem to be in aggrement and use those newer addresses16:34
clarkbfrickler: fungi: I guess we proceed with review then?16:35
clarkbdid we want to approve and land the netplan file update before doing the manual update? if so can you review that chagne fungi?16:35
fricklerclarkb: I edited the netplan file for the mirror, yes. if you or fungi want to do review, go ahead, not sure how urgent this is for mnaser. I would continue tomorrow otherwise16:36
clarkbya I think we should land the netplan update for reivew02 then manually update the routes. I'm sure fungi or I can do the route upadte once the file is updated16:37
clarkbthank you for getting that started for us and testing it works happily on the mirror node16:37
clarkbfrickler: how long are the TTLs on the backup backend indexes from pypi?16:37
clarkbthe primary produces 10 minute ttls iirc. But knowing how long the backup backend TTLs are may be useful for the issue draft I'll start momentarily16:38
*** ysandeep is now known as ysandeep|out16:39
fricklerclarkb: well what I saw in the cache was 1d, which is also the old MaxExpire value, so we don't really know what the response was, could be anything >=1d16:40
clarkbfrickler: thanks that is still helpful16:40
fungisorry, stepped away for a few, catching back up now16:42
fungii've approved frickler's netplan patch for review16:44
fungisince we have ipv4 as a fallback connectivity option, i'd be fine waiting for that to deploy and then performing a quick controlled reboot of the server just to make sure everything's fine16:45
clarkbok, I guess that is the other method of applying it16:49
clarkbit is friday and holidays so should be reasonably safe16:49
fungirevisiting my discovery from last night, i think the most straightforward course of action to get working is to make the addition of letsencrypt for the vhosts a preliminary step in the mailman3 spec and get https set up for the existing v2 servers16:50
fungiwe know we want https for mm3, and the openinfra ml migrations were basically the last major work we wanted to do on the mm2 deployments anyway16:51
fungiotherwise, most users are going to be able to get to the list archives and admin/moderation interface for that site in the interim16:52
clarkbfungi: wfm16:52
fungibut that aside, i'm still floored that it's actually forbidden to use plan http with .dev domains16:53
fungi(also questioning the foundation's decision to use a domain for which google owns the tld and serves as the name authority, but that's a discussion for another day)16:53
clarkb if someone wants to file that feel free (or edit as necessary before filing)16:54
clarkbgmann: ^16:54
opendevreviewMerged opendev/system-config master: Set CacheMaxExpire to 1h
clarkbTo be clear I don't intend on filing that issue. My last experience with pypa has left me unwilling to directly interact iwth them. Someone else should file that if we want to go that route17:03
clarkb(I was told my pull request against pip wasn't even worthy of code review. Probably the least constructive interaction I've ever had with an open source project)17:04
gmannclarkb: ack, thanks. 17:09
*** jpena is now known as jpena|off17:13
opendevreviewMerged opendev/system-config master: Switch router addresses for review02 to global
clarkbfungi: frickler: mnaser: 2604:e100:1:0::1 == 2604:e100:1::1 because :: means replace everything with 0s?17:18
clarkbjust double checking ^ against what mnaser said and what the mirror host shows in its ip -6 route output compared to it and reviews netplan edits17:18
*** rlandy|dr_appt is now known as rlandy|ruck17:34
fungiyes, exactly17:35
fungi:0: can be replaced by just :: as can any arbitrarily long run of :0:0: so long as only one :: appears in the address17:36
fungimore than one :: in an address would be ambiguous of course, as then you wouldn't know how many null bytes are being elided17:37
clarkbthe netplan file on review has updated though the review job isn't quite finished yet17:49
clarkbfungi: I'll defer to you on manually updating the routes or doing a reboot (though I can help) as I've got plenty of distractions at home today17:50
fungisure, i'll wait for the deploy buildset to report first17:51
jentoioSorry, Ive not been following the log4j updates here. I saw this today... The importance of being at log4j v2.16 just got a bunch higher, CVSS score is raised from 3.7 to 9.7. 17:56
fungithanks for the link!17:57
opendevreviewJeremy Stanley proposed opendev/ master: Add letsencrypt record for lists site
opendevreviewJeremy Stanley proposed opendev/ master: Add letsencrypt record for lists site
opendevreviewJeremy Stanley proposed opendev/system-config master: Generate HTTPS certs for Mailman sites
clarkbfungi: ^ left a note on that one about a missing piece. It isn't important if you want to get the certs generated first though19:32
fungiclarkb: yep, i'm almost done with that change as a separate step19:39
fungijust writing tests for it19:40
fungigot sidetracked dealing with dns in all the other domains19:42
opendevreviewJeremy Stanley proposed opendev/system-config master: Add HTTPS vhosts to mailman servers
fungii'm slightly confused as to how playbooks/roles/install-ansible/files/inventory_plugins/ knows what groups something *isn't* supposed to be in19:53
clarkbI think it may do an equality match? I always have to look at it19:57
fungiaha, we've just hard-coded a reference in playbooks/roles/install-ansible/files/inventory_plugins/test-fixtures/results.yaml20:01
clarkbfungi: for the path to the certs is that based on the first name listed in the cert?20:06
fungii... think so?20:06
fungitesting should confirm20:06
fungialso is there any additional job i need to add as required by the lists deploy job when using le?20:07
fungiinfra-prod-service-codesearch uses certs and doesn't require infra-prod-letsencrypt so i guess not20:08
opendevreviewJeremy Stanley proposed opendev/system-config master: Generate HTTPS certs for Mailman sites
opendevreviewJeremy Stanley proposed opendev/system-config master: Add HTTPS vhosts to mailman servers
clarkbyou don't, the testing is mocked out if you add the right groups iirc20:09
clarkband it sets up the self signed certs as if they came from le20:09
fungiclarkb: the reason i split the vhost change out separate is because i had to touch so much dns to set up the acme-challenge cnames that i'm worried some certs may fail to get generated, and i wouldn't want to cause apache to fail to load configs for some of the vhosts20:15
fungionce i see that the certs really do wind up on the servers, then i'll feel more comfortable merging the followup to use them20:16
opendevreviewMerged opendev/ master: Add letsencrypt record for lists site
opendevreviewMerged opendev/ master: Add letsencrypt record for lists site
fungiclarkb: the second patchset on 822196 simply adjusts our test fixtures to say to expect that server in the new group, and the job the first revision failed has already succeeded (otherwise same as the one you already +2'd)20:31
fungiwe're a few minutes out from those two dns changes deploying, and i've hand-tested all the other acme-challenge cnames are already in place for 822196 to be able to work20:31
*** dviroel|rover is now known as dviroel|rover|brb20:34
fungizuul is pretty quiet, so i'll get ready to reboot gerrit here soon20:35
Clark[m]Sorry finishing up lunch can look in a bit20:45
fungithose remaining two acme-challenge cnames are resolving publicly now20:46
fungiso 822196 ought to be safe to merge if its remaining jobs pass20:47
fungithe vhost addition is broken, but we're not collecting apache logs from those nodes so it's hard to know quite why. i'll add that20:51
opendevreviewJeremy Stanley proposed opendev/system-config master: Add HTTPS vhosts to mailman servers
clarkb822196 lgtm21:05
fungithanks, looks like it's a minute or two from passing check21:07
*** dviroel|rover|brb is now known as dviroel|rover21:20
*** dviroel|rover is now known as dviroel|out21:29
fungiyeah, as suspected, i've apparently guessed the ssl cert paths wrong in the vhost addition21:39
clarkbI think it will use the first name in the list of certs21:41
clarkbmaybe reorder to put at the top since that is the host name?21:41
opendevreviewJeremy Stanley proposed opendev/system-config master: Generate HTTPS certs for Mailman sites
opendevreviewJeremy Stanley proposed opendev/system-config master: Add HTTPS vhosts to mailman servers
fungidone, i also set it to collect the logs since it should tell us the path21:47
clarkbI thought we were already collecting those but ya we should21:48
fungiwe weren't21:48
funginot in the run-lists job anyway21:48
fungistatus notice The server is being rebooted to validate a routing configuration update, and should return to service shortly21:51
fungithat ^ look reasonable?21:51
clarkbmaybe down the docker-compose stuff first? then up -d it when it is back?21:51
fungisure, can do21:56
fungithere's a few things in the openstack tenant status page which look about to report, so i'll give those a little longer21:57
clarkblooks like those may have claered out?22:16
fungistill waiting for that tripleo-heat-templates failure at the top of the gate to report, its last build is copying logs22:18
fungiis there some playbook i need to add somewhere to generate the ssl certs in the run-lists job? looking at there's no indication was run22:19
fungiwhich could explain the lack of certs apache's finding22:20
clarkbyou might need a different parent job let me see22:20
clarkblooks like you add playbooks/letsencrypt.yaml to the playbooks listing on the job22:21
fungii only see that being added in the infra-prod-letsencrypt job22:21
clarkbits on the system-config-run jobs22:21
clarkbin the other file22:21
clarkbzuul.d/system-config-run.yaml that one22:22
fungioh! yes i'm looking in the wrong jobs... *sigh* thanks22:22
opendevreviewJeremy Stanley proposed opendev/system-config master: Generate HTTPS certs for Mailman sites
opendevreviewJeremy Stanley proposed opendev/system-config master: Add HTTPS vhosts to mailman servers
fungiokay, everything's reported that's going to report any time soon22:27
fungi#status notice The server is being rebooted to validate a routing configuration update, and should return to service shortly22:27
opendevstatusfungi: sending notice22:27
-opendevstatus- NOTICE: The server is being rebooted to validate a routing configuration update, and should return to service shortly22:27
fungistopping the container22:27
fungiand rebooting the server now that it's done22:27
fungiit's responding to pings over ipv6 already22:28
fungii can ssh into it22:28
fungi`ip -6 ro sh` gives the expected new routes22:28
fungistarting the container again22:29
clarkbit can ping6 google too22:29
clarkband ya I concur the ip -6 route gives me what I expect22:29
fungii'm ssh'd in over ipv6 btw22:29
fungiwebui is up and working for me22:30
clarkbsame here, maybe a bit slow but that should get better as the caches rebuild22:31
clarkb(this is a known issue and apparently some libmodule plugin to swap out the cache types and make them persistent can help with this)22:31
fungiadding the le playbook got things much closer, now it's just the https testinfra asserts which are failing. i may set an autohold on the vhost addition and recheck23:06
fungifor some reason curl is getting "SSL routines:ssl3_get_record:wrong version number"23:10
fungii suspect something's not quite right with the vhost config23:11

Generated by 2.17.3 by Marius Gedminas - find it at!