Tuesday, 2020-09-15

*** Dmitrii-Sh9 has joined #opendev		00:40
*** Dmitrii-Sh has quit IRC		00:41
*** Dmitrii-Sh9 is now known as Dmitrii-Sh		00:41
openstackgerrit	wu.shiming proposed openstack/diskimage-builder master: Remove install unnecessary packages https://review.opendev.org/751926	01:00
ianw	[2020-09-15 01:16:46.308] "GET /api/status HTTP/1.1" 200 68963 "cache hit" "https://zuul.openstack.org/status" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"	01:17
ianw	that's better	01:17
ianw	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64789&rra_id=all seems to show the host isn't under too much pressure	01:22
openstackgerrit	wu.shiming proposed openstack/diskimage-builder master: Remove install unnecessary packages https://review.opendev.org/751926	02:06
*** iurygregory has quit IRC		02:08
ianw	ERROR: No matching distribution found for oslo.log===4.4.0 (from -c /opt/stack/requirements/upper-constraints.txt (line 265))	02:19
ianw	seem to hit a bunch of dib jobs too :/	02:20
fungi	in ovh-gra1 again?	02:23
ianw	seems so, putting it together	02:23
ianw	fungi: iirc, you had to run a loop to occasionally get bad returns right?	02:23
fungi	yeah, just stuck date and wget in a for loop with a sleep 1	02:26
fungi	while :;do sleep 10;echo `wget -SO- https://pypi.org/simple/libvirt-python/ 2>&1\|grep -e 'X-Served-By: .' -e '>libvirt-python-6\.6\.0\.tar\.gz<'\|sed -e 's/.X-Served-By: //' -e 's/.pythonhosted.>$libvirt-python-6\.6\.0\.tar\.gz$<.*/\1/'`;done\|tee results	02:27
fungi	that's an example from when i was seeing libvirt-python-6.6.0.tar.gz sometimes missing	02:28
fungi	from the simple api index	02:28
fungi	i guess it was every 10 seconds	02:29
fungi	and i didn't timestamp them after all	02:29
fungi	i was just logging which fastly backends returned what	02:31
ianw	got it running, so far it's returning the right thing but we'll see	02:36
ianw	sigh, and something up with fedora mirror ... it's not a good day for the dib gate :/	02:42
prometheanfire	heh, 6 seconds off	02:44
*** hashar has joined #opendev		02:48
ianw	the results are something like "cache-bwi5141-BWI, cache-cdg20768-CDG"	02:59
ianw	where the first part is constant, and the second part seems to choose one of 58 difference instances	02:59
ianw	it's just looping around the same 58 results now, and all are returning the correct thing it seems	03:17
ianw	for reference the failing results were in https://review.opendev.org/#/c/747878/	03:19
ianw	2020-09-15 01:21:01.448075 \| ubuntu-bionic \| ERROR: No matching distribution found for oslo.log===4.4.0 (from -c /opt/stack/requirements/upper-constraints.txt (line 265))	03:19
ianw	at or around that time	03:19
hashar	ianw: looks like airport codes. bwi baltimore, cdg for paris ;)	03:46
hashar	Paris, in France	03:46
ianw	yeah, so still 58 different backends, all responding correctly	04:13
ianw	2020-09-14 22:22:08.327035 \| controller \| ERROR: Could not find a version that satisfies the requirement oslo.log===4.4.0	04:15
*** ykarel\|away has joined #opendev		04:17
*** ykarel_ has joined #opendev		04:21
*** ykarel\|away has quit IRC		04:24
*** Topner has joined #opendev		04:32
*** fressi has joined #opendev		04:35
*** Topner has quit IRC		04:36
*** fressi has quit IRC		04:44
*** hashar has quit IRC		05:13
*** ysandeep\|away is now known as ysandeep		05:13
*** ykarel_ has quit IRC		05:22
*** ykarel_ has joined #opendev		05:23
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: only run docker-setup.yaml when installed https://review.opendev.org/747062	05:48
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: update-json-file: add role to combine values into a .json https://review.opendev.org/746834	05:48
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: Linaro MTU workaround https://review.opendev.org/747063	05:48
*** ykarel_ has quit IRC		06:10
openstackgerrit	Rico Lin proposed zuul/zuul-jobs master: Allow skip files when download logs https://review.opendev.org/751973	06:11
*** ykarel_ has joined #opendev		06:11
*** lpetrut has joined #opendev		06:14
*** qchris has quit IRC		06:20
*** ykarel_ is now known as ykarel		06:21
ianw	infra-root: ^ found https://58aca06c0a1750aa55ba-25dc1a1096eaf82a51392114c0a4a973.ssl.cf2.rackcdn.com/732435/7/check/heat-functional/852a5bf/controller/logs/screen-barbican-keystone-listener.txt	06:22
ianw	in a browser that gives me a "content encoding error", when I curl it I get back a 500 page	06:23
ianw	An error occurred while processing your request.<p>	06:23
ianw	Reference #112.9b052017.1600151007.18aba199	06:23
ianw	probably something to keep an eye on	06:24
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: only run docker-setup.yaml when installed https://review.opendev.org/747062	06:27
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: update-json-file: add role to combine values into a .json https://review.opendev.org/746834	06:27
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: Linaro MTU workaround https://review.opendev.org/747063	06:27
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: Retire Fedora 31 for 32 https://review.opendev.org/751975	06:27
frickler	ianw: I've seen these content errors before and attributed them to upload errors. as long as they are only sporadic I guess that would be fine	06:28
*** diablo_rojo has quit IRC		06:29
*** qchris has joined #opendev		06:34
ianw	frickler: ok, thanks; i guess with the amount we upload something is going to go wrong	06:36
*** iurygregory has joined #opendev		06:58
*** slaweq has joined #opendev		07:05
*** andrewbonney has joined #opendev		07:14
openstackgerrit	Dmitriy Rabotyagov (noonedeadpunk) proposed zuul/zuul-jobs master: Add support to use stow for ensure-python https://review.opendev.org/751611	07:15
*** hashar has joined #opendev		07:24
*** fressi has joined #opendev		07:27
*** tosky has joined #opendev		07:34
*** priteau has joined #opendev		07:39
openstackgerrit	Witold Bedyk proposed openstack/project-config master: Use only noop jobs for openstack/monasca-transform https://review.opendev.org/751983	07:58
*** moppy has quit IRC		08:01
*** moppy has joined #opendev		08:02
*** sshnaidm\|afk is now known as sshnaidm		08:04
*** DSpider has joined #opendev		08:06
openstackgerrit	Witold Bedyk proposed openstack/project-config master: End project gating for openstack/monasca-analytics https://review.opendev.org/751987	08:20
openstackgerrit	Rico Lin proposed zuul/zuul-jobs master: Allow skip files when download logs https://review.opendev.org/751973	08:21
*** ricolin has joined #opendev		08:26
*** ykarel is now known as ykarel\|lunch		08:28
ricolin	ianw, please review again if you got time:) https://review.opendev.org/#/c/751973	08:33
openstackgerrit	Witold Bedyk proposed openstack/project-config master: End project gating for openstack/monasca-analytics https://review.opendev.org/751987	08:39
openstackgerrit	Witold Bedyk proposed openstack/project-config master: Remove openstack/monasca-analytics https://review.opendev.org/751993	08:46
*** ysandeep is now known as ysandeep\|lunch		08:55
*** dtantsur\|afk is now known as dtantsur		09:00
openstackgerrit	Dmitriy Rabotyagov (noonedeadpunk) proposed zuul/zuul-jobs master: Add support to use stow for ensure-python https://review.opendev.org/751611	09:09
*** ykarel\|lunch is now known as ykarel		09:12
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/elastic-recheck master: Made parse_jenkins_failure a non static https://review.opendev.org/751999	09:19
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/system-config master: Sorted statusbot nick https://review.opendev.org/752008	09:48
*** tkajinam has quit IRC		09:49
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/system-config master: Allow zbr to use statusbot https://review.opendev.org/752009	09:53
openstackgerrit	wu.shiming proposed openstack/diskimage-builder master: Update imp module to importlib https://review.opendev.org/752011	10:02
*** ysandeep\|lunch is now known as ysandeep		10:10
*** slaweq has quit IRC		10:13
*** slaweq has joined #opendev		10:21
*** stephenfin has joined #opendev		10:39
*** fressi has quit IRC		11:16
frickler	infra-root: mirror.kna1.airship-citycloud.opendev.org has no /afs mount, leading to lots of 403s. there was a process stuck trying to do "rmmod openafs" since the node was rebooted on friday. trying another reboot now	11:22
*** fressi has joined #opendev		11:32
*** Goneri has joined #opendev		11:44
fungi	i don't recall us rebooting it	11:52
fungi	last manual reboot we logged for it at https://wiki.openstack.org/wiki/Infrastructure_Status was 2020-05-20	11:53
*** lpetrut has quit IRC		11:54
frickler	my reboot sadly hasn't helped, either	11:54
frickler	so I deferring to someone with more experience debugging afs	11:54
frickler	I'm	11:54
frickler	oh, now there's a backtrace in dmesg	11:55
frickler	http://paste.openstack.org/show/797875/	11:56
*** rosmaita has joined #opendev		11:57
fungi	yeah, not super useful without the ksyms	12:08
fungi	looks from /var/log/dpkg.log like unattended-upgrades replaced the kernel package 5 days ago and may have suffered a failure	12:13
fungi	i'm trying to manually reinstall the kernel package and headers and rebuild the openafs module	12:15
fungi	and rebooting it	12:17
*** Goneri has quit IRC		12:23
*** sshnaidm has quit IRC		12:25
*** slaweq has quit IRC		12:30
fungi	#status log reinstalled the kernel, kernel headers, and openafs-client on mirror.kna1.airship-citycloud.opendev.org and rebooted it, as it seems to have possibly been previously rebooted after an incomplete package update	12:34
openstackstatus	fungi: finished logging	12:34
fungi	frickler: seems to be working for me now, but i don't have time to test thoroughly just yet	12:34
*** slaweq has joined #opendev		12:35
*** priteau has quit IRC		12:42
*** sshnaidm has joined #opendev		12:50
*** priteau has joined #opendev		12:55
fungi	i'm heading out to run errands, but should be back by 15:00 utc	12:57
*** lpetrut has joined #opendev		13:03
*** priteau has quit IRC		13:15
*** hashar has quit IRC		13:16
auristor	fungi frickler ianw: RIP: 0010:afs_CellNumValid+0x4f/0xd0 [openafs] not the result of an incomplete package update. its due to a race during the startup of the openafs kernel module. if another process attempts to read from openafs /proc data or issue an afs pioctl like the "fs" command or attempts to set tokens using "aklog" while afsd is configuring the kernel module, this crash can occur.	13:26
zbr	so basically there was nothing wrong with pypi?	13:47
*** fressi_ has joined #opendev		13:51
*** fressi has quit IRC		13:51
*** fressi_ is now known as fressi		13:51
*** qchris has quit IRC		13:53
*** bbezak_ has joined #opendev		13:57
*** ShadowJonathan_ has joined #opendev		13:57
*** jentoio_ has joined #opendev		13:57
*** davidlenwell_ has joined #opendev		13:57
*** gouthamr__ has joined #opendev		13:57
*** priteau has joined #opendev		13:57
*** gmann_ has joined #opendev		13:57
*** cgoncalves has quit IRC		13:58
*** jbryce_ has joined #opendev		13:58
*** mrunge_ has joined #opendev		13:59
*** gouthamr has quit IRC		14:00
*** ShadowJonathan has quit IRC		14:00
*** jbryce has quit IRC		14:00
*** davidlenwell has quit IRC		14:00
*** bbezak has quit IRC		14:00
*** fungi has quit IRC		14:00
*** Eighth_Doctor has quit IRC		14:00
*** logan- has quit IRC		14:00
*** gmann has quit IRC		14:00
*** mrunge has quit IRC		14:00
*** jentoio has quit IRC		14:00
*** jroll has quit IRC		14:00
*** knikolla has quit IRC		14:00
*** dtantsur has quit IRC		14:00
*** jbryce_ is now known as jbryce		14:00
*** bbezak_ is now known as bbezak		14:00
*** ShadowJonathan_ is now known as ShadowJonathan		14:00
*** gouthamr__ is now known as gouthamr		14:00
*** davidlenwell_ is now known as davidlenwell		14:00
*** jentoio_ is now known as jentoio		14:00
*** logan_ has joined #opendev		14:00
*** gmann_ is now known as gmann		14:00
*** logan_ is now known as logan-		14:00
*** qchris has joined #opendev		14:04
*** knikolla has joined #opendev		14:07
*** fungi has joined #opendev		14:07
*** Eighth_Doctor has joined #opendev		14:08
*** ysandeep is now known as ysandeep\|away		14:08
*** dtantsur has joined #opendev		14:08
*** jroll has joined #opendev		14:09
*** Tengu has joined #opendev		14:17
*** Tengu has quit IRC		14:25
*** cgoncalves has joined #opendev		14:26
*** Tengu has joined #opendev		14:27
clarkb	zbr: pypi is proxied and does not involve afs. Also pypi issues were in another cloud I believe these are separate problems	14:28
*** Topner has joined #opendev		14:30
zbr	clarkb: thanks for clarifying. it didn't make any sense for me to use a distributed fs for a mirror.	14:32
zbr	s/mirror/proxy	14:32
zbr	clarkb: was anyone able to reproduce the pypi issue? see https://discuss.python.org/t/any-chance-to-an-issue-tracker-for-pypi-org-operational-problems/5219/3	14:34
zbr	without a curl -v ... output from a specific address address, nobody will start working on it.	14:35
*** Topner has quit IRC		14:35
clarkb	ianw discovered there are 58 backends	14:36
clarkb	and each likely has mulyiple servers	14:36
clarkb	previously the issur was attributed to fastly errors	14:37
clarkb	and oslo.log 4.4.0 was updated around when those fastly errors happened last month	14:37
clarkb	I assume they just never fully resynced after that and there is 1/1000 cdn nodes that has a problem	14:37
zbr	seems like a time to write a script that tests an URL with all backends	14:37
zbr	i bet is more than this, the number of failures I seen with my changes was more like 1/5 failed.	14:38
clarkb	zbr: but only in the last day	14:39
zbr	was the whl or the tar.gz that failed to download?	14:39
clarkb	zbr: neither, its the index missing the latest package version whcih constraints requires	14:39
clarkb	the index is old (like many weeks old)	14:40
clarkb	if we didn't use constraints we'd install the older version and be fine (likely why most people don't notice)	14:41
clarkb	because I'm sure other packages have the same problem	14:42
Tengu	consistency issue in mirror/cdn? that's.... great.	14:43
lourot	o/ we're getting apparently random POST_FAILUREs, e.g. here twice in a row: https://review.opendev.org/#/c/748650/ - is it known / related to what is being discussed here? thanks!	14:46
clarkb	lourot: likely unrelated	14:47
clarkb	mirrors tend to be used in pre and run not post	14:47
clarkb	zbr: ianw tracked this down previously and this is the logs of what it was attributed to http://kafka.dcpython.org/day/pypa-dev/2020-08-25#00.11.33.PSFSlack	14:48
clarkb	zbr: note that oslo.log was released around the time of that incident	14:48
clarkb	I don't know if the issue is in fastly or pypi but its like they simply never recovered from that	14:48
Tengu	though it's marked as "resolved" 22 days ago for fastly at least.	14:49
Tengu	clarkb: any link to those 58 backends?	14:51
clarkb	Tengu: sure they may think it was resolved :) but they are still serving stale oslo.log indexes a month several weeks later	14:51
clarkb	Tengu: you have to do the requests and see the headers	14:51
clarkb	do enough of them and you build up a large collection	14:52
Tengu	clarkb: what request exactly? I came late to the party ^^'	14:52
clarkb	I don't know that they publish them anywhere	14:52
clarkb	Tengu: https://pypi.org/simple/oslo.log is the source material. Then many of the failures were happenign in our ovh gra1 region which has a caching proxy to that url at https://mirror.gra1.ovh.opendev.org/pypi/simple/oslo.log	14:53
clarkb	lourot: your errors are related to a new swift backend we added yesterday to store logs	14:53
*** user_19173783170 has quit IRC		14:53
Tengu	yup. I'm getting 404 on that ovh proxy.	14:53
clarkb	Tengu: it shouldn't 404	14:54
Tengu	clarkb: maybe purgin some cache... ?	14:54
clarkb	(it was never 404ing before)	14:54
clarkb	the issue is that the index.html served by pypi (and cached by our proxy) was stale and lacked the 4.4.0 release	14:54
Tengu	oh. funky. lemme recheck the URL I used	14:54
clarkb	note that url does not 404 and the 4.4.0 release is currently present if I load it	14:55
Tengu	indeed.	14:55
Tengu	oslo.log-4.4.0.tar.gz#sha256=	14:55
fungi	clarkb: zbr: to be exact, that's 58 endpoints which respond for requests from ovh-gra1, they won't be the same endpoints which respond from other parts of the world necessarily	14:55
Tengu	ah. the 404 was on https://mirror.gra1.ovh.opendev.org/simple/oslo-log/ apparently missing the "pypi" prefix.	14:56
Tengu	that URL comes from the CI job I've checked.	14:56
*** Topner has joined #opendev		14:56
clarkb	fungi: maybe, I'm not sure how ianw generated the list	14:58
clarkb	Tengu: can you share links to that? that appears to be a different issue	14:58
fungi	clarkb: a while loop over wget -S grepping out the X-whatever header which lists the fastly chain	14:58
Tengu	1s	14:58
clarkb	fungi: wgeting pypi.org directly?	14:58
clarkb	from the mirror?	14:59
*** TheJulia has joined #opendev		14:59
fungi	yep, run on a vm in that cloud region (i assume on our mirror server for convenience)	14:59
fungi	now that i'm back from errands i can try to continue the data collection experiment while multitasking with meetnigs	14:59
Tengu	clarkb: for instance, it hits https://mirror.gra1.ovh.opendev.org/pypi/simple which loads as expected. but every link in there are apparently /simple/<module-name>, missing the "pypi" prefix.	14:59
clarkb	also can someone push a opendev/base-jobs change to remove the oe cloud from playbooks/base/post-logs.yaml and playbooks/base-minimal/post-logs.yaml? Its erroring (see lourot's link above) and my local disk is full so I can't do git things right now and I have a meeting to dial into :/	14:59
Tengu	not sure if it has anything to do though. No pypi/pip expert here, just a bit wondering.	15:00
Tengu	clarkb: for instance, in this job: https://e9bcfc635397bc0a3785-9ea232b51691ad27ed2d3f8993736155.ssl.cf2.rackcdn.com/751828/8/check/tripleo-build-containers-ubi-8/b164baf/job-output.txt	15:00
clarkb	Tengu: but none of the links there should point at simple	15:00
Tengu	clarkb: sooooo.... that's the issue?	15:00
clarkb	Tengu: if you load up the index page and inspect source you'll see they are all hosted under /pypifiles	15:00
fungi	probably an issue but not the same issue we're digging into	15:01
clarkb	Tengu: its possible they are serving stale and also different data	15:01
fungi	or could be a related issue i guess	15:01
clarkb	but different symptioms	15:01
clarkb	Tengu: I don't think you are reading that error correctly	15:02
clarkb	Tengu: what it is saying is that the index lists a bunch of versions but not 4.4.0	15:02
clarkb	I don't see where it says https://mirror.gra1.ovh.opendev.org/simple/oslo-log/ is a 404	15:02
Tengu	clarkb: so. go here: https://mirror.gra1.ovh.opendev.org/pypi/simple you get everything listed right? click on ANY link there, and it ends 404.	15:02
clarkb	Tengu: that isn't what pip is doing	15:03
Tengu	still weird the links on that index page are just plain wrong.	15:03
clarkb	Tengu: pip doesn't rely on the top level index (so if thats broken its a problem maybe but not the problem)	15:04
Tengu	'k.	15:04
clarkb	pip instead will directly lookup package names like /pypi/simple/oslo.log	15:04
clarkb	that gives it an html index with a list of versions and some metadata about the versions of python those packages support	15:04
clarkb	The error you're seeing is that it sees a bunch of versions like 4.3.0 but not 4.4.0 which is what openstack's constraints requires	15:04
clarkb	because constraints requires a very specific version that does not exist pip then errors	15:05
Tengu	ok. so that top-level index is just misleading basic humans hoping to just click'n'follow links then.	15:05
fungi	or more specifically, which isn't included in the index it's returning for that package name	15:05
clarkb	the last time we looked into this pypi said there were fastly errors, see http://kafka.dcpython.org/day/pypa-dev/2020-08-25#00.11.33.PSFSlack and those happened around the time when oslo.log was released. That makes me suspect that they never fully resynced the stale data after that issue	15:05
Tengu	so yeah, I do understand it doesnt find the right version - I was just surprised ending in a stupid 404 when just browsing the ovh host.	15:06
clarkb	I suspect if we released a oslo.log 4.4.1 this issue would correct	15:06
clarkb	I also suspect if pypi took their users at their word and resnycing any packages publsihed around august 24/25 we'd be good too	15:06
clarkb	but without doing a billion requests to find the needle in the haystack I bet we won't get that	15:07
Tengu	so getting a dummy 4.4.1 might just do the trick	15:07
Tengu	and would probably be faster	15:07
Tengu	and we'd get 4.4.0 (the one we want in the requirements) and that new 4.4.1	15:07
openstackgerrit	Clark Boylan proposed opendev/base-jobs master: Disable OE swift for log storage https://review.opendev.org/752066	15:08
yoctozepto	morning infra	15:08
clarkb	I found a VM image I could rm :) infra-root donnyd fyi ^ we should land that then debug why we get errors on some OE swift uploads	15:08
clarkb	Tengu: that is my hunch	15:09
fungi	i've got a loop going every 10 seconds testing https://pypi.org/simple/oslo.log/ looking for whether it includes oslo.log-4.4.0.tar.gz and logging the X-Served-By: header values in the responses	15:09
fungi	running on mirror.gra1.ovh	15:09
yoctozepto	I see you already know how bad the situation is	15:09
clarkb	Tengu: the reason you get a 404 at the root index is because we're proxying pypi under a subpath in this case /pypi and we aren't doing rewrite of the response content (just the headers)	15:10
fungi	yoctozepto: yes, i'm hoping we can collect sufficient data to convince the pypi admins there's an issue in fastly so they can report it or at least try to get the problem indices recirculated	15:10
clarkb	Tengu: we can fix this by moving pypi proxy caching onto a separate port (then we can drop the subpath)	15:11
Tengu	clarkb: ahhh	15:11
clarkb	Tengu: but pip doesn't care so we haven't bothered	15:11
*** ykarel is now known as ykarel\|away		15:12
fungi	odds are there was a fastly endpoint in france which was down/disconnected during the incident around the time oslo.log 4.4.0 got uploaded, so missed the memo, and has recently been brought back online in the last day or so	15:12
yoctozepto	fungi: /me keeping fingers crossed for the infra team :-)	15:12
smcginnis	Of course, all the oslo deliverables were released around the same time, so we would probably have to refresh all of them.	15:14
yoctozepto	smcginnis: ++, I saw different oslos failing indeed	15:14
*** slaweq has quit IRC		15:19
*** slaweq has joined #opendev		15:23
clarkb	yoctozepto: can you link to the other similar failures? we should double check that they aren't python version mismatches and that they exhibit the same basic behavior (may help in tracking it down)	15:27
clarkb	fungi: fyi https://review.opendev.org/#/c/752066/	15:28
clarkb	donnyd: re ^ we are getting HTTP 400 errors for container urls looks like	15:28
*** ysandeep\|away is now known as ysandeep		15:28
clarkb	donnyd: there isn't a whole log of detail. Maybe you can see more on your side? zuul_opendev_logs_d0d is one that had this problem	15:29
donnyd	That is vert strange	15:29
clarkb	donnyd: happened around 2020-09-15 14:38:16 from ze09.openstack.org (it should have ipv4 and ipv6 addrs if you need to check both)	15:30
donnyd	Well that would be because container d0d was never created	15:35
donnyd	donny@office:~> openstack container list \|grep d0d	15:35
donnyd	donny@office:~>	15:35
yoctozepto	clarkb: sure, I'll create a paste of them	15:37
donnyd	now that is strange	15:39
donnyd	When I checked at 6am there were 1004 containers... when I check just now... there are 1004 containers	15:39
donnyd	there should be more	15:39
donnyd	maybe there is a quota limit on the # of containers by default	15:40
yoctozepto	clarkb: done: http://paste.openstack.org/show/797891/	15:41
johnsom	Oh joy, the packages are still not loading... Let me know if I can collect any info for you.	15:42
donnyd	annnnd I found the issue	15:45
donnyd	https://www.irccloud.com/pastebin/P6b5ZSGC/	15:46
donnyd	clarkb: ^^^	15:46
clarkb	donnyd: we need to increase max buckets?	15:46
clarkb	donnyd: let me do some math to give you an idea of how many we use	15:46
clarkb	(its capped)	15:46
*** lpetrut has quit IRC		15:47
donnyd	yep	15:48
donnyd	I would rather just disable it	15:48
donnyd	https://www.irccloud.com/pastebin/nvyGm5h7/	15:50
clarkb	donnyd: ah ok I was having a hard time finding wheer we generated it (I know I edited in the past and its capped)	15:51
clarkb	removing the limit works too	15:51
donnyd	should be good to go now	15:51
clarkb	donnyd: infra-root I WIP'd https://review.opendev.org/#/c/752066/1 as we think ^ should fix it	15:51
donnyd	can I get someone with the openstackzuul user to try to create a bucket?	15:52
clarkb	donnyd: ya I can create a test bucket	15:52
clarkb	donnyd: Unauthorized (HTTP 401)	15:54
clarkb	I can list them though	15:54
donnyd	so I was getting the same error and I though it was maybe just me	15:55
donnyd	ok, I will get it fixed and get back to you	15:55
clarkb	donnyd: should we land the disabling change for now?	15:55
clarkb	I assume that error will hit jobs if they try to create containers too	15:55
donnyd	yes	15:55
donnyd	I would assume until I get this sorted it will fail jobs	15:56
clarkb	ok I've approved the change	15:56
clarkb	thank you for looking at it	15:56
donnyd	so setting that value to -1 does not work	15:57
donnyd	it makes things come back as unauth	15:57
donnyd	"max_buckets": 999999999,	15:58
donnyd	try it now	15:58
clarkb	donnyd: that worked it created clarkb_test	15:59
clarkb	can you double check you see ^	15:59
clarkb	removed my approval from the base-jobs update now :)	15:59
donnyd	donny@office:~> openstack container list \|grep clark	15:59
donnyd	\| clarkb_test \|	15:59
donnyd	yep, I can see	15:59
donnyd	I will have to figure out how to just disable that entirely	16:00
donnyd	don't want to get to 1000000000 and have it fail again	16:01
donnyd	should be good for now though	16:01
clarkb	donnyd: we shard with the first 3 chars of the build uuid. build uuids can have a-f and 0-9 in them which is a total of 16 * 15 * 14 containers I think	16:03
clarkb	donnyd: so ~3360	16:03
donnyd	so we should be good with 999999999 for the foreseeable future	16:03
clarkb	ya	16:03
fungi	yeah, i guess the default was just only enough for roughly a third of what we needed	16:04
clarkb	er no its 16 * 3 because you can have duplicates	16:04
clarkb	sorry I was thinking of deck of cards	16:04
clarkb	er 16 ** 3	16:04
clarkb	math I can do it :)	16:04
clarkb	4096	16:04
clarkb	if you want to set it to 8192 it should be plent	16:04
fungi	s/third/quarter/ then ;)	16:04
donnyd	ok, 24576 is what I set it to	16:05
donnyd	I can still create container	16:06
donnyd	I think this is resolved	16:06
donnyd	sorry for that	16:06
donnyd	should have thought about it before	16:06
donnyd	exit	16:06
donnyd	wrong terminal	16:07
fungi	exeunt!	16:07
*** ykarel\|away has quit IRC		16:09
-openstackstatus- NOTICE: Our PyPI caching proxies are serving stale package indexes for some packages. We think because PyPI's CDN is serving stale package indexes. We are sorting out how we can either fix or workaround that. In the meantime updating requirements is likely the wrong option.		16:09
openstackgerrit	Merged opendev/elastic-recheck master: Create elastic-recheck container image https://review.opendev.org/750958	16:14
*** ysandeep is now known as ysandeep\|away		16:16
auristor	no	16:30
auristor	oops	16:30
*** Topner has quit IRC		16:33
*** fressi has left #opendev		16:35
*** dtantsur is now known as dtantsur\|afk		16:46
*** slaweq has quit IRC		16:47
clarkb	ugh I've run out of local disk space again. That 3.2GB went quickly. I must have a misbehaving daemon chewing up log space or something	17:15
clarkb	and now I'm deep in the bowels of btrfs balancing	17:30
fungi	my sympathies	17:35
fungi	i'm back to trying to bang my head against the storyboard continuous deployment being noncontinuous	17:36
fungi	and also getting openstack wallaby cycle elections underway	17:36
*** paladox has quit IRC		17:37
*** paladox has joined #opendev		17:38
*** tosky has quit IRC		17:40
clarkb	the good news is I think I've sorted my problem on	17:42
clarkb	its basically defragmenting btrfs	17:42
clarkb	I thought we left those problems behind	17:43
smcginnis	Does btrfs have a pleasing UI to watch it defrag? :)	17:44
clarkb	smcginnis: btrfs balance status -v /path	17:44
clarkb	up to you if you find that pleasing	17:44
smcginnis	I want to watch all the little blocks move around. Was my favorite part about defrag.exe.	17:44
*** diablo_rojo has joined #opendev		17:44
clarkb	fedora users beware :)	17:45
clarkb	I'm going to ensure both of my btrfs filesystems are happy before doing anything else today as I've hit problems with ti a couple times in a short period and its super frustrating	17:48
clarkb	ok I think I'm set now with plenty of free space	17:57
clarkb	back to trying to grok the pypi things	17:57
johnsom	I had to dump btrfs. The VMs were killing it with painful CoW chains.	18:00
fungi	Error: /Stage[main]/Storyboard::Application/Exec[install-storyboard]: 'pip install /opt/storyboard' returned 1 instead of one of [0]	18:02
fungi	well, doesn't give me the output from pip, but at least it's somewhere to start from	18:02
clarkb	fungi: I'm pulling up recent fails from the zuul status page a few within the last ~6 minutes. Anyway out of my ~3 examples all of them have the versions in the index files if I request them through the mirror in my browser	18:02
clarkb	that does make me wonder if pip is doing something with headers or maybe fetching some other path that is breaking	18:03
fungi	yeah, and i still haven't caught any wgets to pypi from the mirrors returning stale content	18:03
fungi	what packages/releases/regions were those? anything besides oslo.log 4.4.0 in ovh-gra1 and oslo.service 2.4.0 in rax-dfw?	18:04
clarkb	ya https://mirror.ord.rax.opendev.org/pypi/simple/python-neutronclient/ from https://ace7c3bbea1b346d61ae-b7ce26267fa7738cf72e86f3ba4b5e8c.ssl.cf5.rackcdn.com/751040/3/check/openstack-tox-py38/20a4543/job-output.txt	18:04
clarkb	https://mirror.mtl01.inap.opendev.org/pypi/simple/reno/ from https://api.us-east.open-edge.io:8080/swift/v1/AUTH_e02c11e4e2c24efc98022353c88ab506/zuul_opendev_logs_66f/751040/3/check/openstack-tox-py36/66f5320/job-output.txt and https://ee2d68f892c305f185f6-70dee07df320095fd5d268c658f557da.ssl.cf2.rackcdn.com/749748/5/check/openstack-tox-docs/3a08312/job-output.txt	18:05
clarkb	also thats two different hosts in the same region seeing the same content which makes it extra weird that if I fetch it I can't reproduce	18:05
fungi	so i could also start similarly polling in rax-ord for python-neutronclient <someversion> and...	18:05
clarkb	granted I may be fetching after the ttl has expired on that	18:05
fungi	were any of the failures you observed for anything you already tried to flush in the cdn?	18:06
clarkb	no I've only tried to flush oslo.log	18:06
fungi	just wondering whether we have positive evidence of that workaround not helping	18:07
clarkb	another curiousity is that its always packages we control it seems like	18:07
clarkb	and all the things have releases since end of august	18:08
fungi	okay, i've confirmed the fact that we stopped deploying storyboard after the change to drop python2.7 was no coincidence: storyboard-api requires Python '>=3.6' but the running Python is 2.7.12	18:08
clarkb	kind of makes me wonder if it is something about pushing newer packages	18:08
clarkb	fungi: pip sets cache-control: max-age=0 when fetching indexes eg /pypi/simple/reno	18:22
fungi	neat, so we're always going to cache-miss those regardless	18:22
*** andrewbonney has quit IRC		18:23
clarkb	the server then sets cache-control: max-age=600	18:23
clarkb	I think max-age=0 from the client means the proxy has to recheck the content is hasn't changed	18:24
clarkb	but that is a simple check and maybe we're breaking in that somehopw	18:24
clarkb	except if that was the case I'd expect it to serve me the broken version when I load it 6 minutes later	18:24
fungi	it does also debunk the theory that we're somehow magnifying the problem by caching and continuing to serve stale responses for multiple consecutive requests	18:25
clarkb	fungi: fwiw I sorted out pip request headers via tcpdump and installing reno via http://mirror.mtl01.inap.opendev.org	18:27
clarkb	the index contents themselves are gzip'd so I'll need to do better packet capture than tcpdump -A to verify thosecontents however I figure the install is working for reno 3.2.0 so the index contents likely won't show anyhting usefl	18:27
clarkb	fungi: for reproducing though we may need to set max-age=0 like pip does	18:30
clarkb	because that may also effect the fastly caches	18:31
clarkb	your wgets are probably not setting that so can use good cached data, but maybe this breaks when fastly has to refresh or something	18:31
clarkb	in the case of our jobs talking through our caches pip will make a request with max-age=0. That will force our cache to check with fastly if things have been updated. That check may not actually include the max-age=0 though its probably implied when you are checking updated timestamps. If fastly says we are stale then we send the request to reup the value to fastly which will include the max-age=0 which	18:34
clarkb	should force fastly to repopulate data?	18:34
clarkb	if fastly says we are good then we just serve the cached content	18:34
clarkb	fungi: I wonder if we shouldn't try and trace this out and draw it up	18:37
clarkb	just thinking through it in my head it does seem like we'd get stale indexes more often this way if there are stale indexes in fastly	18:38
clarkb	because pip is going to always check and if they are stale compared to what we've got we'll be told to "update"	18:38
clarkb	then we fetch the bad data and have a sad. Then the next request that comes by checks and sees we are actually stale and updates again and gets good results	18:38
clarkb	which can explain why this is so difficult to diagnose	18:39
fungi	so should i adjust wget to send a max-age=0 with the request to get a pull through the cache?	18:40
clarkb	yes I think so in order to mimic what pip is doing	18:43
clarkb	(and I wonder if it is that behavior that increases the odds of getting the stale content)	18:43
clarkb	https://868d9d181f45f3dc08e4-0ff94bb25fd1b2b22d207854a15b8765.ssl.cf5.rackcdn.com/744883/16/check/openstack-tox-py36/96380ed/job-output.txt just complained about https://mirror.ord.rax.opendev.org/pypi/simple/taskflow/ but loading it in my brwoser its fine	18:45
fungi	--header='Cache-Control: max-age=0' i guess?	18:49
clarkb	yes that looks correct to me	18:49
*** rosmaita has quit IRC		18:49
fungi	okay, i've incorporated that into my test loops	18:50
*** Topner has joined #opendev		19:08
*** Topner has quit IRC		19:13
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: Retire Fedora 31 for 32 https://review.opendev.org/751975	19:15
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: only run docker-setup.yaml when installed https://review.opendev.org/747062	19:15
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: update-json-file: add role to combine values into a .json https://review.opendev.org/746834	19:15
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: Linaro MTU workaround https://review.opendev.org/747063	19:15
*** priteau has quit IRC		19:16
clarkb	fungi: for example https://702b7e8f253d29e679a6-2fe3f6c342189909aad5220492fb4721.ssl.cf1.rackcdn.com/743189/10/check/openstack-tox-py38/2c678fd/job-output.txt complains about octavia-lib==2.2.0 but says 2.1.1 is valid. If you view source on https://mirror.mtl01.inap.opendev.org/pypi/simple/octavia-lib/ they appear to have the same interpreter requirements	19:59
clarkb	it really does seem like we're getting the previous index	19:59
clarkb	rather than breaking with python versions	19:59
clarkb	https://etherpad.opendev.org/debugging-pypi-index-problems I'm going to start taking notes there	20:00
johnsom	Which package are you using for the cache? squid?	20:03
fungi	apache mod_cache	20:03
clarkb	but as frickler reported this has been observed locally running devstack wthout our mirrors	20:05
clarkb	and we haev seen this before talking directly to fastly too	20:05
clarkb	(strongly implying it isn't apache that is at fault)	20:05
johnsom	I assume you have tried a flush on an apache instance that is behaving strange too. That would also rule out mod_cache issue	20:09
clarkb	johnsom: that is what we just discussed in our meeting	20:09
clarkb	I think we want to observe what we've cached first	20:09
fungi	we talked about it in the meeting. if it continues after we flush the cache then we can say it's likely not the cache, but if it stops after we flush the cache we don't know that it was the cache flushing which resolved it or just coincidentally happened to clear up on its own at the same time	20:10
clarkb	(made difficult by the structure of the cache data, but should be doable)	20:10
johnsom	Yes, fair enough	20:10
fungi	also doesn't explain why its cropped up in different provider regions each with its own cache in the span of a day or so	20:11
fungi	the problem would have had to start impacting multiple caches on different virtual machines in different providers all at approximately the same time	20:11
johnsom	Well, we don't know it's a problem until upper-constraints update, then we all start looking for new versions	20:11
fungi	i suppose it couldn't hurt to check when the problem entries were added to the constraints lists, but i understand were seeing it complain about missing releases which were tagged several weeks ago	20:12
clarkb	johnsom: ya though some of these pacakges like oslo.log are old	20:12
clarkb	did we not update constraints for oslo.log until recently maybe?	20:12
clarkb	fungi: correct	20:12
johnsom	20 days ago, but the new rush for this is probably the switch to focal and needing py3.x support	20:14
fungi	folks started reporting the problem yesterday	20:15
fungi	i think	20:15
clarkb	fungi: ya	20:15
johnsom	Yeah, but it's happened at least twice before in the past	20:15
fungi	oh, it's happened many times in recent years	20:15
clarkb	the joys of cdn	20:15
fungi	but more recently we've looked into incidents on august 19 and august 25	20:16
clarkb	we've definitely tracked it back to cdn returning stale data before	20:16
clarkb	the difficulty now is that its persisting and not reproducing as easily	20:16
clarkb	and last time it came up pypi pointed at a fastly outage	20:16
clarkb	maybe its the same thing this time around and fastly hasn't noticed yet	20:16
* clarkb finds lunch		20:16
smcginnis	Nearly all projects should have been running in py3.x for quite some time now.	20:17
smcginnis	And yeah, some of these were updated in upper-constraints for weeks now.	20:17
johnsom	Yeah, most have been, but the focal switch caused new problems that forced most everyone to bump lower-constraints	20:17
fungi	are these mostly lower-constraints jobs we're seeing?	20:19
johnsom	No	20:19
johnsom	I have seen it across tempest and unit tests. It seems more regional than test related.	20:20
ianw	i don't exactly know what this means yet, but i've found 3 instances of what looks like the pypi python-designateclient page on mtl-1	20:22
ianw	http://paste.openstack.org/show/797902/	20:22
ianw	yes! this has headers	20:24
ianw	http://paste.openstack.org/show/797903/	20:26
ianw	fungi / clarkb: ^ double check my work here, but this looks like a cached version of the pypi page without designateclient 4.1.0	20:26
ianw	Date: Tue, 15 Sep 2020 07:24:38 GMT	20:27
clarkb	with no serial and no interpreter requirements	20:28
clarkb	when did we release 4.1.0?	20:28
clarkb	not today ya?	20:28
smcginnis	No, not today.	20:28
smcginnis	Checking...	20:28
clarkb	if so that points to fastly I think given the date	20:28
*** Topner has joined #opendev		20:28
smcginnis	August 3 is the merge commit.	20:28
ianw	clarkb: yeah, you're right, no serial on that result, i didn't accidentally cut it off	20:28
ianw	double check mtl-01 mirror /var/cache/apache2/proxy/91/pq/qD/Xf/1O/Cmdy2okLRNwQ.header.vary/BJ/1z/1Z/_7/jD/HNLd0wzgcbGA.data	20:29
smcginnis	And python-designateclient also updated in upper-constraints on Aug 3.	20:29
fungi	okay, i've revamped the test loop to use curl, pass similar compression headers, the cache-control max-age=0 and also a lifelike pip ua i scraped from one of our logs	20:30
ianw	there's now 3 other hits from walking the cache; they all have 4.1.0 and a serila	20:30
ianw	serial even	20:30
fungi	which is the mtl-01 mirror? ca-ymq-1.vexxhost?	20:31
clarkb	fungi: mtl01.inap	20:31
fungi	ahh	20:31
fungi	did we get failures there too?	20:31
clarkb	ianw makena copy so that htcacheclean doesnt accidentally get it	20:31
clarkb	fungi: yes all over	20:32
ianw	that was the one mentioned in the meeting just now	20:32
fungi	got it	20:32
*** Topner has quit IRC		20:33
fungi	and yeah, that cache example was modified today, probably created today but stat doesn't report creation time	20:34
clarkb	ianw: if I zcat that I see no 4.1.0 but don't see the headers	20:34
ianw	headers are in ./91/pq/qD/Xf/1O/Cmdy2okLRNwQ.header.vary/BJ/1z/1Z/_7/jD/HNLd0wzgcbGA.header	20:34
clarkb	I guess that is a separate file	20:34
clarkb	thanks	20:34
clarkb	and ya I agree with what yuo've found	20:35
fungi	X-Served-By: cache-lcy19265-LCY, cache-yul8920-YUL	20:35
clarkb	note the index.html is completetly different too	20:35
clarkb	the sdists and wheels are in two separate lists rather than a shared list sorted by version	20:36
clarkb	thee is no python interpreter requirements either	20:36
clarkb	I think thats what we needed to go to pypi. Though we should note we see similar all over the place	20:37
clarkb	its just this is what we managed to catch	20:37
clarkb	https://702b7e8f253d29e679a6-2fe3f6c342189909aad5220492fb4721.ssl.cf1.rackcdn.com/743189/10/check/openstack-tox-py38/2c678fd/job-output.txt is octavia-lib 2.2.0 on the same mirror	20:38
clarkb	ianw: ^ maybe you want to run the same find on that and we can see if the same x-served-by shows up?	20:38
ianw	http://paste.openstack.org/show/797905/	20:39
ianw	comparison of the two headers for the designate index page; one good, the pip one bad	20:40
ianw	ok, i'll kill and see if we can find similar for octavia-lib	20:41
ianw	http://paste.openstack.org/show/797906/ would seem to be it	20:42
clarkb	thee is no x-last-serial in the bad one either	20:43
ianw	Date: Tue, 15 Sep 2020 07:22:02 GMT .. same time	20:43
ianw	~ same time, anyway	20:43
clarkb	LCY again	20:43
clarkb	too	20:43
clarkb	LCY is it you!?	20:44
ianw	but, apache must be deciding that it can serve this data based on the headers coming back from pypi/fastly, right?	20:44
fungi	i guess lcy is london?	20:44
clarkb	ianw: yes, I think the way it works with the client setting max-age=0 is apache will ask the backend if the data it has is fresh	20:44
ianw	hashar i think pointed out they're airport codes	20:44
johnsom	There is an "I love Lucy" joke in there	20:45
fungi	right, lcy is the iata code for london city airport	20:45
clarkb	ianw: so what must've happened there is we asked LCY again if it was fresh and it said yes and we reused it?	20:45
ianw	that's going to be the e-tag matching?	20:45
* ianw goes to read exactly what etag is		20:45
clarkb	ianw: I'm not entirely sure how apache checks but I would expect something like etags yes	20:45
clarkb	fungi: fwiw I found fastly docs that confirmed what you said about hat header. The first region is the one that atlks to pypi directly and the second is the one that we talk to	20:48
ianw	so theory is apache does a GET request, gets back headers, the returned ETag: must match what we have in the cache there, and we serve up the old data	20:48
clarkb	ianw: ya	20:48
clarkb	ianw: and the other times when it works we get the other etag and we match that to the other content we've cached	20:48
clarkb	if fastly stops serving the bad etag'd data we should stop using what we've cached	20:49
clarkb	(because it won't be considered valid anymore)	20:49
clarkb	I expect this is sufficient info to take to them	20:50
clarkb	while also noting that we're seeing it in a fairly globally distributed manner so it may affect more than just LCY	20:50
clarkb	rax dfw and ord have hit it, ovh gra1, and inap mtl01 at least	20:51
clarkb	with various packages	20:51
johnsom	The Last-Modified header seems suspicious as well. In the bad examples, it's 9 of April.	20:51
clarkb	johnsom: ya and the content itself is all different too	20:51
clarkb	johnsom: current content is a list sorted by version. Old content is two list one for sdists and one for wheels	20:51
clarkb	current content has a serial in the html old content doesn't	20:51
ianw	when did <!--SERIAL 8050067--> get added?	20:51
clarkb	ianw: I don't know	20:52
ianw	was that a flag day that pypa regenerated every index file?	20:52
clarkb	(I just know that its there now :) )	20:52
ianw	that might be a data point	20:52
ianw	i think we have quite a bit of data ... the question is ... who can do anything about it :)	20:53
clarkb	pypi I think. The warehouse issue tracker is where they ask for feedback and issues	20:53
clarkb	https://github.com/pypa/warehouse/issues	20:53
fungi	or ping dstufft and di in #pypa-dev on freenode	20:53
*** slaweq has joined #opendev		20:54
clarkb	and zbr posted https://discuss.python.org/t/any-chance-to-an-issue-tracker-for-pypi-org-operational-problems/5219/3 which is related	20:54
clarkb	also note that it isn't 404s...	20:54
*** slaweq has quit IRC		20:55
clarkb	I'm happy to write the issue but don't want to step on toes (ianw did the magic that sorted it out I think)	20:55
clarkb	let me know how I can help and I'll gladly do what I can :)	20:55
ianw	i think we should get maybe a few more results from some of the other regions to amass a bigger picture	20:55
clarkb	ianw: sunds good. Let me look through my job links to find some from not mtl01	20:56
clarkb	https://60ce53b5f45ed0ef7bd3-e873feb845d99f2e0685947947034235.ssl.cf1.rackcdn.com/751040/3/check/openstack-tox-pep8/1921a1e/job-output.txt oslo.serialization from rax dfw	20:56
clarkb	https://868d9d181f45f3dc08e4-0ff94bb25fd1b2b22d207854a15b8765.ssl.cf5.rackcdn.com/744883/16/check/openstack-tox-py36/96380ed/job-output.txt taskflow from rax ord	20:56
ianw	i need to get some breakfast :) maybe we should etherpad up a few results with headers/timestamps and then we can write a good bug	20:56
ianw	bib	20:56
clarkb	https://api.us-east.open-edge.io:8080/swift/v1/AUTH_e02c11e4e2c24efc98022353c88ab506/zuul_opendev_logs_158/743189/10/check/openstack-tox-pep8/158d6e1/job-output.txt octavia-lib on ovh bhs1	20:57
fungi	i can sympathize with their desire not to maintain an "issue tracker for operational problems" any more than we want to, they want help tracking these problems down and assistance solving them, not some black box where people can drop their complaints	20:57
clarkb	https://etherpad.opendev.org/p/debugging-pypi-index-problems lets keep working there	20:58
clarkb	I'll incorporate what we've arleady found on mtl01 on that etherpad	20:58
clarkb	then start on the list above	20:58
*** slaweq has joined #opendev		20:58
fungi	wow, bug 1449136 is so very full of misinformation	21:01
openstack	bug 1449136 in tripleo "Pip fails to find distribution for package" [Critical,Triaged] https://launchpad.net/bugs/1449136	21:01
fungi	outright blaming it on our caches, conflating the swift log upload issue with the stale pypi indices...	21:02
fungi	i'm just going to pretend i didn't see that rather than rage comment in it	21:02
clarkb	yes, I've basically decided I'll continue to set teh record straight here :)	21:02
fungi	i have openstack elections to finish setting up, and related tooling bugs to hotfix	21:05
*** slaweq has quit IRC		21:05
clarkb	I'm working on the rax dfw oslo.serialization grepping now	21:07
clarkb	dfw is lcy too	21:13
clarkb	putting details on the etherpad	21:13
ianw	clarkb: i can pull up bhs1 if you haven't already?	21:15
clarkb	ianw: go for it. I was going to add more info on the others and copy the files on dfw to my home dir so that htcacheclean doesn't rm them	21:16
fungi	yeah, i think the only first-level caches i've seen appear in responses are either bwi (baltimore) or lcy (london)	21:18
clarkb	ianw: I'm just copying them into the root of my homedir on each of the mirrors I do that way we've got the exact file for later if we need it	21:19
clarkb	and thats done now for mtl01 and dfw. I'll do rax ord now	21:19
*** slaweq has joined #opendev		21:20
ianw	i've asked in #pypa-dev about the index file format, and if anyone knows around what time it changed, as i think this might be a clue to where stale entries might be coming from	21:21
ianw	it's not generally a very active channel though, so we'll see	21:21
clarkb	if we note it in the warehouse issue I'm sure thats something they'll track down	21:21
ianw	yeah for sure, that's the path forwrad	21:21
clarkb	another interesting thing is the format of the etag is different	21:23
clarkb	its almost like they've got a stale warehouse backend for fastly	21:23
clarkb	and its looking at an old db or something	21:23
ianw	clarkb: mirror.bhs1.ovh.opendev.org:/var/cache/apache2/proxy/4r/Kl/1x/Zd/Wx/DOmpJToA1chw.header.vary/Ht/iP/SP/ng/Z2/v1_4jNm08zWA.<header\|data> is the one to scp	21:25
clarkb	ianw: oh I'm not scpin'g I'm just cp'ing to my homedir on the mirror	21:26
clarkb	but I'll do that now	21:26
clarkb	was mostly worried about htcacheclean	21:26
clarkb	also LCY again	21:27
clarkb	on ord I've only found what looks like a good index for taskflow	21:27
clarkb	still waiting for find to find more	21:27
clarkb	my draft is bad :?	21:31
clarkb	and ya feel free to hack it up :)	21:34
ianw	ok, relaying pypa-dev, but there is somehow an old bandersnatch instance involved behind pypi/cdn	21:40
clarkb	ianw: do they have a public log somewhere?	21:41
clarkb	http://kafka.dcpython.org/day/pypa-dev/2020-09-15 found it	21:42
fungi	oh fun	21:42
clarkb	fungi: ianw you think I should file an issue with what we've got? and there shouldn't be any concern attaching the files verbatim ya?	21:58
clarkb	I'll gunzip the data too	21:58
*** slaweq has quit IRC		21:59
ianw	clarkb: yeah, i think pypa is narrowing in on the issue but an issue will be good	22:00
ianw	if nothing else to point people to	22:00
clarkb	ya	22:00
clarkb	I'll work on that next	22:00
fungi	yeah, that would be a good next step	22:01
ianw	i get the feeling there's going to be a super-out-of-date-failing-bandersnatch behind this ... i will certainly not throw stones because it's a glass house we have been in :)	22:05
clarkb	ya	22:09
fungi	i used to live in that glass house. it had a lovely view, but eventually ran out of space	22:12
clarkb	https://github.com/pypa/warehouse/issues/8568 for those following along	22:25
clarkb	dstufft has also found that their fallback bandersnatch mirror (what fastly talks to if it fails to talk to pypi proper) has filled its 12TB of disk	22:25
clarkb	and so it is stale and this is likely the issue	22:25
clarkb	we only see this when fastly falls back to the bandersnatch mirror	22:25
clarkb	they are discussing how to fix it now	22:26
*** DSpider has quit IRC		22:26
fungi	i'm guessing there's a separate issue which has caused fastly's first level cache in london to start having problems reaching warehouse proper, causing it to fall back to the 2-month-stale bandersnatch mirror more often	22:29
fungi	which would explain it just starting to get bad again in the last day or two	22:30
ianw	https://mirror.dub1.pypi.io ... i wonder if that is dub1 too, which seems close to london ...	22:30
ianw	sorry i mean dublin?	22:30
ianw	mirror.dub1.pypi.io is an alias for ec2-34-244-193-164.eu-west-1.compute.amazonaws.com.	22:30
clarkb	sounds like they are removing the bandersnatch fallback from fastly	22:31
clarkb	I think that means we may see more failures but pip retries those so should make our jobs happier	22:31
fungi	since it looks like their disk filled up in early august, that would also cover the somewhat recent flare-ups we say on august 19 and august 25	22:31
clarkb	anyone want to update the lp bug that isn't me? >_> I'll do it if not	22:31
fungi	the lp bug is for tripleo anyway, i guess it can be full of wrong if that's what they prefer	22:32
clarkb	also I'm really glad we didn't try to keep up with bandersnatch ourselves	22:33
fungi	>12tb now! yeesh	22:33
clarkb	smcginnis: johnsom TheJulia ^ fyi I would expect things to start getting better around now	22:34
clarkb	assuming that the issue ahs been properly identified and the mitigations are effective	22:34
johnsom	Yeah, I have been following along and sent a canary	22:34
clarkb	I'll probably send a #status notice in an hour or so if zuul agrees	22:34
TheJulia	Awesome	22:35
clarkb	ianw: maybe its network issues from london to wherever pypi proper is located	22:36
clarkb	ianw: that fails due to transatlantic problems then it falls back to bandersnatch? I dunno	22:36
ianw	i blame brexit	22:37
johnsom	lol, they cut the lines early!	22:37
johnsom	Sad trombones, ovh-bhs1 couldn't get octavia-lib: https://api.us-east.open-edge.io:8080/swift/v1/AUTH_e02c11e4e2c24efc98022353c88ab506/zuul_opendev_logs_a7f/751918/1/check/openstack-tox-pep8/a7f5e55/job-output.txt	22:47
fungi	22:45:10	22:49
clarkb	johnsom: I wonder if they have to flush fastly caches too	22:49
johnsom	It was one of the two canary patches I rechecked	22:49
fungi	we may have also cached a response for up to... 600 seconds is it?	22:49
ianw	i'll ask dstufft	22:49
fungi	we can probably find the cachefile on our mirror for that	22:50
fungi	since we can narrow to find -mtime-10 or something	22:50
ianw	seems we should -XPURGE ... looking into that	22:51
*** tkajinam has joined #opendev		22:51
clarkb	we should maybe do that for everything in requirements	22:52
ianw	i did curl -XPURGE https://pypi.org/simple/octavia-lib and got an OK	22:55
ianw	johnsom: maybe try again	22:55
johnsom	I can send another recheck, but it's a roll of the dice here it will land.	22:56
fungi	roll them bones	22:57
clarkb	maybe we purge everything in https://opendev.org/openstack/requirements/src/branch/master/upper-constraints.txt afterremoving the === and clearing duplicates	22:59
* clarkb makes that list		22:59
fungi	remember to dereference the index urls too	23:00
fungi	. and _ get rewritten to - looks like?	23:00
fungi	also lower-cased	23:00
fungi	and i think runs of multiple punctuation are coalesced	23:00
fungi	though we shouldn't have any of those	23:01
clarkb	http://paste.openstack.org/show/797908/ I think constraints already normalized	23:02
fungi	but not necessarily normalized to pypi's rules	23:02
clarkb	also are we sure its - and not .	23:02
clarkb	ah yup seems to redirect	23:03
ianw	i have "cat upper-constraints.txt \| sed 's/===.*//' \| tr '.' '-' \| tr '[:upper:]' '[:lower:]'"	23:03
clarkb	so we sjust have to s/./-/	23:03
clarkb	ianw: ya that looks better than what I did	23:03
ianw	running "cat upper-constraints.txt \| sed 's/===.*//' \| tr '.' '-' \| tr '[:upper:]' '[:lower:]' \| xargs -I "PROJ" curl -XPURGE https://pypi.org/simple/PROJ"	23:07
ianw	it's returning OK but i think it might just do that for everything?	23:07
clarkb	I wonder too if we have to do it form the mirrors so that it hits the right fastly endpoints	23:09
fungi	xargs doesn't need you to do https://pypi.org/simple/$PROJ i guess?	23:09
clarkb	fungi: not with -I	23:09
johnsom	No luck, ovh-bhs1 missing octavia-lib: https://cac8a4551058cbabfd7f-de2a4e9610e68e853bfff1f1436a242e.ssl.cf5.rackcdn.com/751925/4/check/openstack-tox-py36/957de24/job-output.txt	23:10
fungi	ahh, awesome, one more way to avoid for loops	23:10
clarkb	ianw: where did you run the purge from?	23:11
ianw	just my local machine	23:11
clarkb	I guess next thing would be to do it from the mirrors?	23:11
clarkb	and if that still doesn't work then we may need to defer to pypa again	23:11
johnsom	Sorry to say I need to step away soon. I can post a DNM patch that has only the faster running jobs if you would like it for testing before I go.	23:12
ianw	johnsom: i think this might be a rare case of just coming back tomorrow will make it work :)	23:12
johnsom	Yeah, that is my perspective as well	23:13
johnsom	Though that didn't work yesterday... grin	23:13
clarkb	I think they said the fastly caches are 24 hours?	23:14
johnsom	Yeah, that is what I saw	23:14
clarkb	so maybe if we do nothing else it will work again tomorrow this time	23:14
johnsom	Too bad it's not a good time to take a few days and go to the coast.	23:16
ianw	trying to read about purging, i feel like it's a global thing	23:16
clarkb	johnsom: my brothers went fishing at barview jetty a couple days ago. I told them they were stupid	23:16
clarkb	johnsom: they did get a keeper ling cod though so I don't think they cared	23:16
clarkb	ianw: rgr	23:16
fungi	we could declare tomorrow international write stuff in rust day	23:17
fungi	or whatever the hip language is this week	23:17
johnsom	fungi I am down with that. I have been playing with it in my spare time	23:17
*** smcginnis has quit IRC		23:20
*** smcginnis has joined #opendev		23:22
ianw	perhaps pip pulls some variant of that url	23:32
ianw	https://pypi.org:443 "GET /simple/octavia-lib/ HTTP/1.1" 304 0	23:37
ianw	doesn't seem so	23:37

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!