*** Dmitrii-Sh9 has joined #opendev | 00:40 | |
*** Dmitrii-Sh has quit IRC | 00:41 | |
*** Dmitrii-Sh9 is now known as Dmitrii-Sh | 00:41 | |
openstackgerrit | wu.shiming proposed openstack/diskimage-builder master: Remove install unnecessary packages https://review.opendev.org/751926 | 01:00 |
---|---|---|
ianw | [2020-09-15 01:16:46.308] "GET /api/status HTTP/1.1" 200 68963 "cache hit" "https://zuul.openstack.org/status" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0" | 01:17 |
ianw | that's better | 01:17 |
ianw | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64789&rra_id=all seems to show the host isn't under too much pressure | 01:22 |
openstackgerrit | wu.shiming proposed openstack/diskimage-builder master: Remove install unnecessary packages https://review.opendev.org/751926 | 02:06 |
*** iurygregory has quit IRC | 02:08 | |
ianw | ERROR: No matching distribution found for oslo.log===4.4.0 (from -c /opt/stack/requirements/upper-constraints.txt (line 265)) | 02:19 |
ianw | seem to hit a bunch of dib jobs too :/ | 02:20 |
fungi | in ovh-gra1 again? | 02:23 |
ianw | seems so, putting it together | 02:23 |
ianw | fungi: iirc, you had to run a loop to occasionally get bad returns right? | 02:23 |
fungi | yeah, just stuck date and wget in a for loop with a sleep 1 | 02:26 |
fungi | while :;do sleep 10;echo `wget -SO- https://pypi.org/simple/libvirt-python/ 2>&1|grep -e 'X-Served-By: .*' -e '>libvirt-python-6\.6\.0\.tar\.gz<'|sed -e 's/.*X-Served-By: //' -e 's/.*pythonhosted.*>\(libvirt-python-6\.6\.0\.tar\.gz\)<.*/\1/'`;done|tee results | 02:27 |
fungi | that's an example from when i was seeing libvirt-python-6.6.0.tar.gz sometimes missing | 02:28 |
fungi | from the simple api index | 02:28 |
fungi | i guess it was every 10 seconds | 02:29 |
fungi | and i didn't timestamp them after all | 02:29 |
fungi | i was just logging which fastly backends returned what | 02:31 |
ianw | got it running, so far it's returning the right thing but we'll see | 02:36 |
ianw | sigh, and something up with fedora mirror ... it's not a good day for the dib gate :/ | 02:42 |
prometheanfire | heh, 6 seconds off | 02:44 |
*** hashar has joined #opendev | 02:48 | |
ianw | the results are something like "cache-bwi5141-BWI, cache-cdg20768-CDG" | 02:59 |
ianw | where the first part is constant, and the second part seems to choose one of 58 difference instances | 02:59 |
ianw | it's just looping around the same 58 results now, and all are returning the correct thing it seems | 03:17 |
ianw | for reference the failing results were in https://review.opendev.org/#/c/747878/ | 03:19 |
ianw | 2020-09-15 01:21:01.448075 | ubuntu-bionic | ERROR: No matching distribution found for oslo.log===4.4.0 (from -c /opt/stack/requirements/upper-constraints.txt (line 265)) | 03:19 |
ianw | at or around that time | 03:19 |
hashar | ianw: looks like airport codes. bwi baltimore, cdg for paris ;) | 03:46 |
hashar | Paris, in France | 03:46 |
ianw | yeah, so still 58 different backends, all responding correctly | 04:13 |
ianw | 2020-09-14 22:22:08.327035 | controller | ERROR: Could not find a version that satisfies the requirement oslo.log===4.4.0 | 04:15 |
*** ykarel|away has joined #opendev | 04:17 | |
*** ykarel_ has joined #opendev | 04:21 | |
*** ykarel|away has quit IRC | 04:24 | |
*** Topner has joined #opendev | 04:32 | |
*** fressi has joined #opendev | 04:35 | |
*** Topner has quit IRC | 04:36 | |
*** fressi has quit IRC | 04:44 | |
*** hashar has quit IRC | 05:13 | |
*** ysandeep|away is now known as ysandeep | 05:13 | |
*** ykarel_ has quit IRC | 05:22 | |
*** ykarel_ has joined #opendev | 05:23 | |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: only run docker-setup.yaml when installed https://review.opendev.org/747062 | 05:48 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: update-json-file: add role to combine values into a .json https://review.opendev.org/746834 | 05:48 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: Linaro MTU workaround https://review.opendev.org/747063 | 05:48 |
*** ykarel_ has quit IRC | 06:10 | |
openstackgerrit | Rico Lin proposed zuul/zuul-jobs master: Allow skip files when download logs https://review.opendev.org/751973 | 06:11 |
*** ykarel_ has joined #opendev | 06:11 | |
*** lpetrut has joined #opendev | 06:14 | |
*** qchris has quit IRC | 06:20 | |
*** ykarel_ is now known as ykarel | 06:21 | |
ianw | infra-root: ^ found https://58aca06c0a1750aa55ba-25dc1a1096eaf82a51392114c0a4a973.ssl.cf2.rackcdn.com/732435/7/check/heat-functional/852a5bf/controller/logs/screen-barbican-keystone-listener.txt | 06:22 |
ianw | in a browser that gives me a "content encoding error", when I curl it I get back a 500 page | 06:23 |
ianw | An error occurred while processing your request.<p> | 06:23 |
ianw | Reference #112.9b052017.1600151007.18aba199 | 06:23 |
ianw | probably something to keep an eye on | 06:24 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: only run docker-setup.yaml when installed https://review.opendev.org/747062 | 06:27 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: update-json-file: add role to combine values into a .json https://review.opendev.org/746834 | 06:27 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: Linaro MTU workaround https://review.opendev.org/747063 | 06:27 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Retire Fedora 31 for 32 https://review.opendev.org/751975 | 06:27 |
frickler | ianw: I've seen these content errors before and attributed them to upload errors. as long as they are only sporadic I guess that would be fine | 06:28 |
*** diablo_rojo has quit IRC | 06:29 | |
*** qchris has joined #opendev | 06:34 | |
ianw | frickler: ok, thanks; i guess with the amount we upload something is going to go wrong | 06:36 |
*** iurygregory has joined #opendev | 06:58 | |
*** slaweq has joined #opendev | 07:05 | |
*** andrewbonney has joined #opendev | 07:14 | |
openstackgerrit | Dmitriy Rabotyagov (noonedeadpunk) proposed zuul/zuul-jobs master: Add support to use stow for ensure-python https://review.opendev.org/751611 | 07:15 |
*** hashar has joined #opendev | 07:24 | |
*** fressi has joined #opendev | 07:27 | |
*** tosky has joined #opendev | 07:34 | |
*** priteau has joined #opendev | 07:39 | |
openstackgerrit | Witold Bedyk proposed openstack/project-config master: Use only noop jobs for openstack/monasca-transform https://review.opendev.org/751983 | 07:58 |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:02 | |
*** sshnaidm|afk is now known as sshnaidm | 08:04 | |
*** DSpider has joined #opendev | 08:06 | |
openstackgerrit | Witold Bedyk proposed openstack/project-config master: End project gating for openstack/monasca-analytics https://review.opendev.org/751987 | 08:20 |
openstackgerrit | Rico Lin proposed zuul/zuul-jobs master: Allow skip files when download logs https://review.opendev.org/751973 | 08:21 |
*** ricolin has joined #opendev | 08:26 | |
*** ykarel is now known as ykarel|lunch | 08:28 | |
ricolin | ianw, please review again if you got time:) https://review.opendev.org/#/c/751973 | 08:33 |
openstackgerrit | Witold Bedyk proposed openstack/project-config master: End project gating for openstack/monasca-analytics https://review.opendev.org/751987 | 08:39 |
openstackgerrit | Witold Bedyk proposed openstack/project-config master: Remove openstack/monasca-analytics https://review.opendev.org/751993 | 08:46 |
*** ysandeep is now known as ysandeep|lunch | 08:55 | |
*** dtantsur|afk is now known as dtantsur | 09:00 | |
openstackgerrit | Dmitriy Rabotyagov (noonedeadpunk) proposed zuul/zuul-jobs master: Add support to use stow for ensure-python https://review.opendev.org/751611 | 09:09 |
*** ykarel|lunch is now known as ykarel | 09:12 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/elastic-recheck master: Made parse_jenkins_failure a non static https://review.opendev.org/751999 | 09:19 |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/system-config master: Sorted statusbot nick https://review.opendev.org/752008 | 09:48 |
*** tkajinam has quit IRC | 09:49 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/system-config master: Allow zbr to use statusbot https://review.opendev.org/752009 | 09:53 |
openstackgerrit | wu.shiming proposed openstack/diskimage-builder master: Update imp module to importlib https://review.opendev.org/752011 | 10:02 |
*** ysandeep|lunch is now known as ysandeep | 10:10 | |
*** slaweq has quit IRC | 10:13 | |
*** slaweq has joined #opendev | 10:21 | |
*** stephenfin has joined #opendev | 10:39 | |
*** fressi has quit IRC | 11:16 | |
frickler | infra-root: mirror.kna1.airship-citycloud.opendev.org has no /afs mount, leading to lots of 403s. there was a process stuck trying to do "rmmod openafs" since the node was rebooted on friday. trying another reboot now | 11:22 |
*** fressi has joined #opendev | 11:32 | |
*** Goneri has joined #opendev | 11:44 | |
fungi | i don't recall us rebooting it | 11:52 |
fungi | last manual reboot we logged for it at https://wiki.openstack.org/wiki/Infrastructure_Status was 2020-05-20 | 11:53 |
*** lpetrut has quit IRC | 11:54 | |
frickler | my reboot sadly hasn't helped, either | 11:54 |
frickler | so I deferring to someone with more experience debugging afs | 11:54 |
frickler | I'm | 11:54 |
frickler | oh, now there's a backtrace in dmesg | 11:55 |
frickler | http://paste.openstack.org/show/797875/ | 11:56 |
*** rosmaita has joined #opendev | 11:57 | |
fungi | yeah, not super useful without the ksyms | 12:08 |
fungi | looks from /var/log/dpkg.log like unattended-upgrades replaced the kernel package 5 days ago and may have suffered a failure | 12:13 |
fungi | i'm trying to manually reinstall the kernel package and headers and rebuild the openafs module | 12:15 |
fungi | and rebooting it | 12:17 |
*** Goneri has quit IRC | 12:23 | |
*** sshnaidm has quit IRC | 12:25 | |
*** slaweq has quit IRC | 12:30 | |
fungi | #status log reinstalled the kernel, kernel headers, and openafs-client on mirror.kna1.airship-citycloud.opendev.org and rebooted it, as it seems to have possibly been previously rebooted after an incomplete package update | 12:34 |
openstackstatus | fungi: finished logging | 12:34 |
fungi | frickler: seems to be working for me now, but i don't have time to test thoroughly just yet | 12:34 |
*** slaweq has joined #opendev | 12:35 | |
*** priteau has quit IRC | 12:42 | |
*** sshnaidm has joined #opendev | 12:50 | |
*** priteau has joined #opendev | 12:55 | |
fungi | i'm heading out to run errands, but should be back by 15:00 utc | 12:57 |
*** lpetrut has joined #opendev | 13:03 | |
*** priteau has quit IRC | 13:15 | |
*** hashar has quit IRC | 13:16 | |
auristor | fungi frickler ianw: RIP: 0010:afs_CellNumValid+0x4f/0xd0 [openafs] not the result of an incomplete package update. its due to a race during the startup of the openafs kernel module. if another process attempts to read from openafs /proc data or issue an afs pioctl like the "fs" command or attempts to set tokens using "aklog" while afsd is configuring the kernel module, this crash can occur. | 13:26 |
zbr | so basically there was nothing wrong with pypi? | 13:47 |
*** fressi_ has joined #opendev | 13:51 | |
*** fressi has quit IRC | 13:51 | |
*** fressi_ is now known as fressi | 13:51 | |
*** qchris has quit IRC | 13:53 | |
*** bbezak_ has joined #opendev | 13:57 | |
*** ShadowJonathan_ has joined #opendev | 13:57 | |
*** jentoio_ has joined #opendev | 13:57 | |
*** davidlenwell_ has joined #opendev | 13:57 | |
*** gouthamr__ has joined #opendev | 13:57 | |
*** priteau has joined #opendev | 13:57 | |
*** gmann_ has joined #opendev | 13:57 | |
*** cgoncalves has quit IRC | 13:58 | |
*** jbryce_ has joined #opendev | 13:58 | |
*** mrunge_ has joined #opendev | 13:59 | |
*** gouthamr has quit IRC | 14:00 | |
*** ShadowJonathan has quit IRC | 14:00 | |
*** jbryce has quit IRC | 14:00 | |
*** davidlenwell has quit IRC | 14:00 | |
*** bbezak has quit IRC | 14:00 | |
*** fungi has quit IRC | 14:00 | |
*** Eighth_Doctor has quit IRC | 14:00 | |
*** logan- has quit IRC | 14:00 | |
*** gmann has quit IRC | 14:00 | |
*** mrunge has quit IRC | 14:00 | |
*** jentoio has quit IRC | 14:00 | |
*** jroll has quit IRC | 14:00 | |
*** knikolla has quit IRC | 14:00 | |
*** dtantsur has quit IRC | 14:00 | |
*** jbryce_ is now known as jbryce | 14:00 | |
*** bbezak_ is now known as bbezak | 14:00 | |
*** ShadowJonathan_ is now known as ShadowJonathan | 14:00 | |
*** gouthamr__ is now known as gouthamr | 14:00 | |
*** davidlenwell_ is now known as davidlenwell | 14:00 | |
*** jentoio_ is now known as jentoio | 14:00 | |
*** logan_ has joined #opendev | 14:00 | |
*** gmann_ is now known as gmann | 14:00 | |
*** logan_ is now known as logan- | 14:00 | |
*** qchris has joined #opendev | 14:04 | |
*** knikolla has joined #opendev | 14:07 | |
*** fungi has joined #opendev | 14:07 | |
*** Eighth_Doctor has joined #opendev | 14:08 | |
*** ysandeep is now known as ysandeep|away | 14:08 | |
*** dtantsur has joined #opendev | 14:08 | |
*** jroll has joined #opendev | 14:09 | |
*** Tengu has joined #opendev | 14:17 | |
*** Tengu has quit IRC | 14:25 | |
*** cgoncalves has joined #opendev | 14:26 | |
*** Tengu has joined #opendev | 14:27 | |
clarkb | zbr: pypi is proxied and does not involve afs. Also pypi issues were in another cloud I believe these are separate problems | 14:28 |
*** Topner has joined #opendev | 14:30 | |
zbr | clarkb: thanks for clarifying. it didn't make any sense for me to use a distributed fs for a mirror. | 14:32 |
zbr | s/mirror/proxy | 14:32 |
zbr | clarkb: was anyone able to reproduce the pypi issue? see https://discuss.python.org/t/any-chance-to-an-issue-tracker-for-pypi-org-operational-problems/5219/3 | 14:34 |
zbr | without a curl -v ... output from a specific address address, nobody will start working on it. | 14:35 |
*** Topner has quit IRC | 14:35 | |
clarkb | ianw discovered there are 58 backends | 14:36 |
clarkb | and each likely has mulyiple servers | 14:36 |
clarkb | previously the issur was attributed to fastly errors | 14:37 |
clarkb | and oslo.log 4.4.0 was updated around when those fastly errors happened last month | 14:37 |
clarkb | I assume they just never fully resynced after that and there is 1/1000 cdn nodes that has a problem | 14:37 |
zbr | seems like a time to write a script that tests an URL with all backends | 14:37 |
zbr | i bet is more than this, the number of failures I seen with my changes was more like 1/5 failed. | 14:38 |
clarkb | zbr: but only in the last day | 14:39 |
zbr | was the whl or the tar.gz that failed to download? | 14:39 |
clarkb | zbr: neither, its the index missing the latest package version whcih constraints requires | 14:39 |
clarkb | the index is old (like many weeks old) | 14:40 |
clarkb | if we didn't use constraints we'd install the older version and be fine (likely why most people don't notice) | 14:41 |
clarkb | because I'm sure other packages have the same problem | 14:42 |
Tengu | consistency issue in mirror/cdn? that's.... great. | 14:43 |
lourot | o/ we're getting apparently random POST_FAILUREs, e.g. here twice in a row: https://review.opendev.org/#/c/748650/ - is it known / related to what is being discussed here? thanks! | 14:46 |
clarkb | lourot: likely unrelated | 14:47 |
clarkb | mirrors tend to be used in pre and run not post | 14:47 |
clarkb | zbr: ianw tracked this down previously and this is the logs of what it was attributed to http://kafka.dcpython.org/day/pypa-dev/2020-08-25#00.11.33.PSFSlack | 14:48 |
clarkb | zbr: note that oslo.log was released around the time of that incident | 14:48 |
clarkb | I don't know if the issue is in fastly or pypi but its like they simply never recovered from that | 14:48 |
Tengu | though it's marked as "resolved" 22 days ago for fastly at least. | 14:49 |
Tengu | clarkb: any link to those 58 backends? | 14:51 |
clarkb | Tengu: sure they may think it was resolved :) but they are still serving stale oslo.log indexes a month several weeks later | 14:51 |
clarkb | Tengu: you have to do the requests and see the headers | 14:51 |
clarkb | do enough of them and you build up a large collection | 14:52 |
Tengu | clarkb: what request exactly? I came late to the party ^^' | 14:52 |
clarkb | I don't know that they publish them anywhere | 14:52 |
clarkb | Tengu: https://pypi.org/simple/oslo.log is the source material. Then many of the failures were happenign in our ovh gra1 region which has a caching proxy to that url at https://mirror.gra1.ovh.opendev.org/pypi/simple/oslo.log | 14:53 |
clarkb | lourot: your errors are related to a new swift backend we added yesterday to store logs | 14:53 |
*** user_19173783170 has quit IRC | 14:53 | |
Tengu | yup. I'm getting 404 on that ovh proxy. | 14:53 |
clarkb | Tengu: it shouldn't 404 | 14:54 |
Tengu | clarkb: maybe purgin some cache... ? | 14:54 |
clarkb | (it was never 404ing before) | 14:54 |
clarkb | the issue is that the index.html served by pypi (and cached by our proxy) was stale and lacked the 4.4.0 release | 14:54 |
Tengu | oh. funky. lemme recheck the URL I used | 14:54 |
clarkb | note that url does not 404 and the 4.4.0 release is currently present if I load it | 14:55 |
Tengu | indeed. | 14:55 |
Tengu | oslo.log-4.4.0.tar.gz#sha256= | 14:55 |
fungi | clarkb: zbr: to be exact, that's 58 endpoints which respond for requests from ovh-gra1, they won't be the same endpoints which respond from other parts of the world necessarily | 14:55 |
Tengu | ah. the 404 was on https://mirror.gra1.ovh.opendev.org/simple/oslo-log/ apparently missing the "pypi" prefix. | 14:56 |
Tengu | that URL comes from the CI job I've checked. | 14:56 |
*** Topner has joined #opendev | 14:56 | |
clarkb | fungi: maybe, I'm not sure how ianw generated the list | 14:58 |
clarkb | Tengu: can you share links to that? that appears to be a different issue | 14:58 |
fungi | clarkb: a while loop over wget -S grepping out the X-whatever header which lists the fastly chain | 14:58 |
Tengu | 1s | 14:58 |
clarkb | fungi: wgeting pypi.org directly? | 14:58 |
clarkb | from the mirror? | 14:59 |
*** TheJulia has joined #opendev | 14:59 | |
fungi | yep, run on a vm in that cloud region (i assume on our mirror server for convenience) | 14:59 |
fungi | now that i'm back from errands i can try to continue the data collection experiment while multitasking with meetnigs | 14:59 |
Tengu | clarkb: for instance, it hits https://mirror.gra1.ovh.opendev.org/pypi/simple which loads as expected. but every link in there are apparently /simple/<module-name>, missing the "pypi" prefix. | 14:59 |
clarkb | also can someone push a opendev/base-jobs change to remove the oe cloud from playbooks/base/post-logs.yaml and playbooks/base-minimal/post-logs.yaml? Its erroring (see lourot's link above) and my local disk is full so I can't do git things right now and I have a meeting to dial into :/ | 14:59 |
Tengu | not sure if it has anything to do though. No pypi/pip expert here, just a bit wondering. | 15:00 |
Tengu | clarkb: for instance, in this job: https://e9bcfc635397bc0a3785-9ea232b51691ad27ed2d3f8993736155.ssl.cf2.rackcdn.com/751828/8/check/tripleo-build-containers-ubi-8/b164baf/job-output.txt | 15:00 |
clarkb | Tengu: but none of the links there should point at simple | 15:00 |
Tengu | clarkb: sooooo.... that's the issue? | 15:00 |
clarkb | Tengu: if you load up the index page and inspect source you'll see they are all hosted under /pypifiles | 15:00 |
fungi | probably *an* issue but not the same issue we're digging into | 15:01 |
clarkb | Tengu: its possible they are serving stale and also different data | 15:01 |
fungi | or could be a related issue i guess | 15:01 |
clarkb | but different symptioms | 15:01 |
clarkb | Tengu: I don't think you are reading that error correctly | 15:02 |
clarkb | Tengu: what it is saying is that the index lists a bunch of versions but not 4.4.0 | 15:02 |
clarkb | I don't see where it says https://mirror.gra1.ovh.opendev.org/simple/oslo-log/ is a 404 | 15:02 |
Tengu | clarkb: so. go here: https://mirror.gra1.ovh.opendev.org/pypi/simple you get everything listed right? click on ANY link there, and it ends 404. | 15:02 |
clarkb | Tengu: that isn't what pip is doing | 15:03 |
Tengu | still weird the links on that index page are just plain wrong. | 15:03 |
clarkb | Tengu: pip doesn't rely on the top level index (so if thats broken its a problem maybe but not the problem) | 15:04 |
Tengu | 'k. | 15:04 |
clarkb | pip instead will directly lookup package names like /pypi/simple/oslo.log | 15:04 |
clarkb | that gives it an html index with a list of versions and some metadata about the versions of python those packages support | 15:04 |
clarkb | The error you're seeing is that it sees a bunch of versions like 4.3.0 but not 4.4.0 which is what openstack's constraints requires | 15:04 |
clarkb | because constraints requires a very specific version that does not exist pip then errors | 15:05 |
Tengu | ok. so that top-level index is just misleading basic humans hoping to just click'n'follow links then. | 15:05 |
fungi | or more specifically, which isn't included in the index it's returning for that package name | 15:05 |
clarkb | the last time we looked into this pypi said there were fastly errors, see http://kafka.dcpython.org/day/pypa-dev/2020-08-25#00.11.33.PSFSlack and those happened around the time when oslo.log was released. That makes me suspect that they never fully resynced the stale data after that issue | 15:05 |
Tengu | so yeah, I do understand it doesnt find the right version - I was just surprised ending in a stupid 404 when just browsing the ovh host. | 15:06 |
clarkb | I suspect if we released a oslo.log 4.4.1 this issue would correct | 15:06 |
clarkb | I also suspect if pypi took their users at their word and resnycing any packages publsihed around august 24/25 we'd be good too | 15:06 |
clarkb | but without doing a billion requests to find the needle in the haystack I bet we won't get that | 15:07 |
Tengu | so getting a dummy 4.4.1 might just do the trick | 15:07 |
Tengu | and would probably be faster | 15:07 |
Tengu | and we'd get 4.4.0 (the one we want in the requirements) and that new 4.4.1 | 15:07 |
openstackgerrit | Clark Boylan proposed opendev/base-jobs master: Disable OE swift for log storage https://review.opendev.org/752066 | 15:08 |
yoctozepto | morning infra | 15:08 |
clarkb | I found a VM image I could rm :) infra-root donnyd fyi ^ we should land that then debug why we get errors on some OE swift uploads | 15:08 |
clarkb | Tengu: that is my hunch | 15:09 |
fungi | i've got a loop going every 10 seconds testing https://pypi.org/simple/oslo.log/ looking for whether it includes oslo.log-4.4.0.tar.gz and logging the X-Served-By: header values in the responses | 15:09 |
fungi | running on mirror.gra1.ovh | 15:09 |
yoctozepto | I see you already know how bad the situation is | 15:09 |
clarkb | Tengu: the reason you get a 404 at the root index is because we're proxying pypi under a subpath in this case /pypi and we aren't doing rewrite of the response content (just the headers) | 15:10 |
fungi | yoctozepto: yes, i'm hoping we can collect sufficient data to convince the pypi admins there's an issue in fastly so they can report it or at least try to get the problem indices recirculated | 15:10 |
clarkb | Tengu: we can fix this by moving pypi proxy caching onto a separate port (then we can drop the subpath) | 15:11 |
Tengu | clarkb: ahhh | 15:11 |
clarkb | Tengu: but pip doesn't care so we haven't bothered | 15:11 |
*** ykarel is now known as ykarel|away | 15:12 | |
fungi | odds are there was a fastly endpoint in france which was down/disconnected during the incident around the time oslo.log 4.4.0 got uploaded, so missed the memo, and has recently been brought back online in the last day or so | 15:12 |
yoctozepto | fungi: /me keeping fingers crossed for the infra team :-) | 15:12 |
smcginnis | Of course, all the oslo deliverables were released around the same time, so we would probably have to refresh all of them. | 15:14 |
yoctozepto | smcginnis: ++, I saw different oslos failing indeed | 15:14 |
*** slaweq has quit IRC | 15:19 | |
*** slaweq has joined #opendev | 15:23 | |
clarkb | yoctozepto: can you link to the other similar failures? we should double check that they aren't python version mismatches and that they exhibit the same basic behavior (may help in tracking it down) | 15:27 |
clarkb | fungi: fyi https://review.opendev.org/#/c/752066/ | 15:28 |
clarkb | donnyd: re ^ we are getting HTTP 400 errors for container urls looks like | 15:28 |
*** ysandeep|away is now known as ysandeep | 15:28 | |
clarkb | donnyd: there isn't a whole log of detail. Maybe you can see more on your side? zuul_opendev_logs_d0d is one that had this problem | 15:29 |
donnyd | That is vert strange | 15:29 |
clarkb | donnyd: happened around 2020-09-15 14:38:16 from ze09.openstack.org (it should have ipv4 and ipv6 addrs if you need to check both) | 15:30 |
donnyd | Well that would be because container d0d was never created | 15:35 |
donnyd | donny@office:~> openstack container list |grep d0d | 15:35 |
donnyd | donny@office:~> | 15:35 |
yoctozepto | clarkb: sure, I'll create a paste of them | 15:37 |
donnyd | now that is strange | 15:39 |
donnyd | When I checked at 6am there were 1004 containers... when I check just now... there are 1004 containers | 15:39 |
donnyd | there should be more | 15:39 |
donnyd | maybe there is a quota limit on the # of containers by default | 15:40 |
yoctozepto | clarkb: done: http://paste.openstack.org/show/797891/ | 15:41 |
johnsom | Oh joy, the packages are still not loading... Let me know if I can collect any info for you. | 15:42 |
donnyd | annnnd I found the issue | 15:45 |
donnyd | https://www.irccloud.com/pastebin/P6b5ZSGC/ | 15:46 |
donnyd | clarkb: ^^^ | 15:46 |
clarkb | donnyd: we need to increase max buckets? | 15:46 |
clarkb | donnyd: let me do some math to give you an idea of how many we use | 15:46 |
clarkb | (its capped) | 15:46 |
*** lpetrut has quit IRC | 15:47 | |
donnyd | yep | 15:48 |
donnyd | I would rather just disable it | 15:48 |
donnyd | https://www.irccloud.com/pastebin/nvyGm5h7/ | 15:50 |
clarkb | donnyd: ah ok I was having a hard time finding wheer we generated it (I know I edited in the past and its capped) | 15:51 |
clarkb | removing the limit works too | 15:51 |
donnyd | should be good to go now | 15:51 |
clarkb | donnyd: infra-root I WIP'd https://review.opendev.org/#/c/752066/1 as we think ^ should fix it | 15:51 |
donnyd | can I get someone with the openstackzuul user to try to create a bucket? | 15:52 |
clarkb | donnyd: ya I can create a test bucket | 15:52 |
clarkb | donnyd: Unauthorized (HTTP 401) | 15:54 |
clarkb | I can list them though | 15:54 |
donnyd | so I was getting the same error and I though it was maybe just me | 15:55 |
donnyd | ok, I will get it fixed and get back to you | 15:55 |
clarkb | donnyd: should we land the disabling change for now? | 15:55 |
clarkb | I assume that error will hit jobs if they try to create containers too | 15:55 |
donnyd | yes | 15:55 |
donnyd | I would assume until I get this sorted it will fail jobs | 15:56 |
clarkb | ok I've approved the change | 15:56 |
clarkb | thank you for looking at it | 15:56 |
donnyd | so setting that value to -1 does not work | 15:57 |
donnyd | it makes things come back as unauth | 15:57 |
donnyd | "max_buckets": 999999999, | 15:58 |
donnyd | try it now | 15:58 |
clarkb | donnyd: that worked it created clarkb_test | 15:59 |
clarkb | can you double check you see ^ | 15:59 |
clarkb | removed my approval from the base-jobs update now :) | 15:59 |
donnyd | donny@office:~> openstack container list |grep clark | 15:59 |
donnyd | | clarkb_test | | 15:59 |
donnyd | yep, I can see | 15:59 |
donnyd | I will have to figure out how to just disable that entirely | 16:00 |
donnyd | don't want to get to 1000000000 and have it fail again | 16:01 |
donnyd | should be good for now though | 16:01 |
clarkb | donnyd: we shard with the first 3 chars of the build uuid. build uuids can have a-f and 0-9 in them which is a total of 16 * 15 * 14 containers I think | 16:03 |
clarkb | donnyd: so ~3360 | 16:03 |
donnyd | so we should be good with 999999999 for the foreseeable future | 16:03 |
clarkb | ya | 16:03 |
fungi | yeah, i guess the default was just only enough for roughly a third of what we needed | 16:04 |
clarkb | er no its 16 * 3 because you can have duplicates | 16:04 |
clarkb | sorry I was thinking of deck of cards | 16:04 |
clarkb | er 16 ** 3 | 16:04 |
clarkb | math I can do it :) | 16:04 |
clarkb | 4096 | 16:04 |
clarkb | if you want to set it to 8192 it should be plent | 16:04 |
fungi | s/third/quarter/ then ;) | 16:04 |
donnyd | ok, 24576 is what I set it to | 16:05 |
donnyd | I can still create container | 16:06 |
donnyd | I think this is resolved | 16:06 |
donnyd | sorry for that | 16:06 |
donnyd | should have thought about it before | 16:06 |
donnyd | exit | 16:06 |
donnyd | wrong terminal | 16:07 |
fungi | exeunt! | 16:07 |
*** ykarel|away has quit IRC | 16:09 | |
-openstackstatus- NOTICE: Our PyPI caching proxies are serving stale package indexes for some packages. We think because PyPI's CDN is serving stale package indexes. We are sorting out how we can either fix or workaround that. In the meantime updating requirements is likely the wrong option. | 16:09 | |
openstackgerrit | Merged opendev/elastic-recheck master: Create elastic-recheck container image https://review.opendev.org/750958 | 16:14 |
*** ysandeep is now known as ysandeep|away | 16:16 | |
auristor | no | 16:30 |
auristor | oops | 16:30 |
*** Topner has quit IRC | 16:33 | |
*** fressi has left #opendev | 16:35 | |
*** dtantsur is now known as dtantsur|afk | 16:46 | |
*** slaweq has quit IRC | 16:47 | |
clarkb | ugh I've run out of local disk space again. That 3.2GB went quickly. I must have a misbehaving daemon chewing up log space or something | 17:15 |
clarkb | and now I'm deep in the bowels of btrfs balancing | 17:30 |
fungi | my sympathies | 17:35 |
fungi | i'm back to trying to bang my head against the storyboard continuous deployment being noncontinuous | 17:36 |
fungi | and also getting openstack wallaby cycle elections underway | 17:36 |
*** paladox has quit IRC | 17:37 | |
*** paladox has joined #opendev | 17:38 | |
*** tosky has quit IRC | 17:40 | |
clarkb | the good news is I think I've sorted my problem on | 17:42 |
clarkb | its basically defragmenting btrfs | 17:42 |
clarkb | I thought we left those problems behind | 17:43 |
smcginnis | Does btrfs have a pleasing UI to watch it defrag? :) | 17:44 |
clarkb | smcginnis: btrfs balance status -v /path | 17:44 |
clarkb | up to you if you find that pleasing | 17:44 |
smcginnis | I want to watch all the little blocks move around. Was my favorite part about defrag.exe. | 17:44 |
*** diablo_rojo has joined #opendev | 17:44 | |
clarkb | fedora users beware :) | 17:45 |
clarkb | I'm going to ensure both of my btrfs filesystems are happy before doing anything else today as I've hit problems with ti a couple times in a short period and its super frustrating | 17:48 |
clarkb | ok I think I'm set now with plenty of free space | 17:57 |
clarkb | back to trying to grok the pypi things | 17:57 |
johnsom | I had to dump btrfs. The VMs were killing it with painful CoW chains. | 18:00 |
fungi | Error: /Stage[main]/Storyboard::Application/Exec[install-storyboard]: 'pip install /opt/storyboard' returned 1 instead of one of [0] | 18:02 |
fungi | well, doesn't give me the output from pip, but at least it's somewhere to start from | 18:02 |
clarkb | fungi: I'm pulling up recent fails from the zuul status page a few within the last ~6 minutes. Anyway out of my ~3 examples all of them have the versions in the index files if I request them through the mirror in my browser | 18:02 |
clarkb | that does make me wonder if pip is doing something with headers or maybe fetching some other path that is breaking | 18:03 |
fungi | yeah, and i still haven't caught any wgets to pypi from the mirrors returning stale content | 18:03 |
fungi | what packages/releases/regions were those? anything besides oslo.log 4.4.0 in ovh-gra1 and oslo.service 2.4.0 in rax-dfw? | 18:04 |
clarkb | ya https://mirror.ord.rax.opendev.org/pypi/simple/python-neutronclient/ from https://ace7c3bbea1b346d61ae-b7ce26267fa7738cf72e86f3ba4b5e8c.ssl.cf5.rackcdn.com/751040/3/check/openstack-tox-py38/20a4543/job-output.txt | 18:04 |
clarkb | https://mirror.mtl01.inap.opendev.org/pypi/simple/reno/ from https://api.us-east.open-edge.io:8080/swift/v1/AUTH_e02c11e4e2c24efc98022353c88ab506/zuul_opendev_logs_66f/751040/3/check/openstack-tox-py36/66f5320/job-output.txt and https://ee2d68f892c305f185f6-70dee07df320095fd5d268c658f557da.ssl.cf2.rackcdn.com/749748/5/check/openstack-tox-docs/3a08312/job-output.txt | 18:05 |
clarkb | also thats two different hosts in the same region seeing the same content which makes it extra weird that if I fetch it I can't reproduce | 18:05 |
fungi | so i could also start similarly polling in rax-ord for python-neutronclient <someversion> and... | 18:05 |
clarkb | granted I may be fetching after the ttl has expired on that | 18:05 |
fungi | were any of the failures you observed for anything you already tried to flush in the cdn? | 18:06 |
clarkb | no I've only tried to flush oslo.log | 18:06 |
fungi | just wondering whether we have positive evidence of that workaround not helping | 18:07 |
clarkb | another curiousity is that its always packages we control it seems like | 18:07 |
clarkb | and all the things have releases since end of august | 18:08 |
fungi | okay, i've confirmed the fact that we stopped deploying storyboard after the change to drop python2.7 was no coincidence: storyboard-api requires Python '>=3.6' but the running Python is 2.7.12 | 18:08 |
clarkb | kind of makes me wonder if it is something about pushing newer packages | 18:08 |
clarkb | fungi: pip sets cache-control: max-age=0 when fetching indexes eg /pypi/simple/reno | 18:22 |
fungi | neat, so we're always going to cache-miss those regardless | 18:22 |
*** andrewbonney has quit IRC | 18:23 | |
clarkb | the server then sets cache-control: max-age=600 | 18:23 |
clarkb | I think max-age=0 from the client means the proxy has to recheck the content is hasn't changed | 18:24 |
clarkb | but that is a simple check and maybe we're breaking in that somehopw | 18:24 |
clarkb | except if that was the case I'd expect it to serve me the broken version when I load it 6 minutes later | 18:24 |
fungi | it does also debunk the theory that we're somehow magnifying the problem by caching and continuing to serve stale responses for multiple consecutive requests | 18:25 |
clarkb | fungi: fwiw I sorted out pip request headers via tcpdump and installing reno via http://mirror.mtl01.inap.opendev.org | 18:27 |
clarkb | the index contents themselves are gzip'd so I'll need to do better packet capture than tcpdump -A to verify thosecontents however I figure the install is working for reno 3.2.0 so the index contents likely won't show anyhting usefl | 18:27 |
clarkb | fungi: for reproducing though we may need to set max-age=0 like pip does | 18:30 |
clarkb | because that may also effect the fastly caches | 18:31 |
clarkb | your wgets are probably not setting that so can use good cached data, but maybe this breaks when fastly has to refresh or something | 18:31 |
clarkb | in the case of our jobs talking through our caches pip will make a request with max-age=0. That will force our cache to check with fastly if things have been updated. That check may not actually include the max-age=0 though its probably implied when you are checking updated timestamps. If fastly says we are stale then we send the request to reup the value to fastly which will include the max-age=0 which | 18:34 |
clarkb | should force fastly to repopulate data? | 18:34 |
clarkb | if fastly says we are good then we just serve the cached content | 18:34 |
clarkb | fungi: I wonder if we shouldn't try and trace this out and draw it up | 18:37 |
clarkb | just thinking through it in my head it does seem like we'd get stale indexes more often this way if there are stale indexes in fastly | 18:38 |
clarkb | because pip is going to always check and if they are stale compared to what we've got we'll be told to "update" | 18:38 |
clarkb | then we fetch the bad data and have a sad. Then the next request that comes by checks and sees we are actually stale and updates again and gets good results | 18:38 |
clarkb | which can explain why this is so difficult to diagnose | 18:39 |
fungi | so should i adjust wget to send a max-age=0 with the request to get a pull through the cache? | 18:40 |
clarkb | yes I think so in order to mimic what pip is doing | 18:43 |
clarkb | (and I wonder if it is that behavior that increases the odds of getting the stale content) | 18:43 |
clarkb | https://868d9d181f45f3dc08e4-0ff94bb25fd1b2b22d207854a15b8765.ssl.cf5.rackcdn.com/744883/16/check/openstack-tox-py36/96380ed/job-output.txt just complained about https://mirror.ord.rax.opendev.org/pypi/simple/taskflow/ but loading it in my brwoser its fine | 18:45 |
fungi | --header='Cache-Control: max-age=0' i guess? | 18:49 |
clarkb | yes that looks correct to me | 18:49 |
*** rosmaita has quit IRC | 18:49 | |
fungi | okay, i've incorporated that into my test loops | 18:50 |
*** Topner has joined #opendev | 19:08 | |
*** Topner has quit IRC | 19:13 | |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Retire Fedora 31 for 32 https://review.opendev.org/751975 | 19:15 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: only run docker-setup.yaml when installed https://review.opendev.org/747062 | 19:15 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: update-json-file: add role to combine values into a .json https://review.opendev.org/746834 | 19:15 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: ensure-docker: Linaro MTU workaround https://review.opendev.org/747063 | 19:15 |
*** priteau has quit IRC | 19:16 | |
clarkb | fungi: for example https://702b7e8f253d29e679a6-2fe3f6c342189909aad5220492fb4721.ssl.cf1.rackcdn.com/743189/10/check/openstack-tox-py38/2c678fd/job-output.txt complains about octavia-lib==2.2.0 but says 2.1.1 is valid. If you view source on https://mirror.mtl01.inap.opendev.org/pypi/simple/octavia-lib/ they appear to have the same interpreter requirements | 19:59 |
clarkb | it really does seem like we're getting the previous index | 19:59 |
clarkb | rather than breaking with python versions | 19:59 |
clarkb | https://etherpad.opendev.org/debugging-pypi-index-problems I'm going to start taking notes there | 20:00 |
johnsom | Which package are you using for the cache? squid? | 20:03 |
fungi | apache mod_cache | 20:03 |
clarkb | but as frickler reported this has been observed locally running devstack wthout our mirrors | 20:05 |
clarkb | and we haev seen this before talking directly to fastly too | 20:05 |
clarkb | (strongly implying it isn't apache that is at fault) | 20:05 |
johnsom | I assume you have tried a flush on an apache instance that is behaving strange too. That would also rule out mod_cache issue | 20:09 |
clarkb | johnsom: that is what we just discussed in our meeting | 20:09 |
clarkb | I think we want to observe what we've cached first | 20:09 |
fungi | we talked about it in the meeting. if it continues after we flush the cache then we can say it's likely not the cache, but if it stops after we flush the cache we don't know that it was the cache flushing which resolved it or just coincidentally happened to clear up on its own at the same time | 20:10 |
clarkb | (made difficult by the structure of the cache data, but should be doable) | 20:10 |
johnsom | Yes, fair enough | 20:10 |
fungi | also doesn't explain why its cropped up in different provider regions each with its own cache in the span of a day or so | 20:11 |
fungi | the problem would have had to start impacting multiple caches on different virtual machines in different providers all at approximately the same time | 20:11 |
johnsom | Well, we don't know it's a problem until upper-constraints update, then we all start looking for new versions | 20:11 |
fungi | i suppose it couldn't hurt to check when the problem entries were added to the constraints lists, but i understand were seeing it complain about missing releases which were tagged several weeks ago | 20:12 |
clarkb | johnsom: ya though some of these pacakges like oslo.log are old | 20:12 |
clarkb | did we not update constraints for oslo.log until recently maybe? | 20:12 |
clarkb | fungi: correct | 20:12 |
johnsom | 20 days ago, but the new rush for this is probably the switch to focal and needing py3.x support | 20:14 |
fungi | folks started reporting the problem yesterday | 20:15 |
fungi | i think | 20:15 |
clarkb | fungi: ya | 20:15 |
johnsom | Yeah, but it's happened at least twice before in the past | 20:15 |
fungi | oh, it's happened many times in recent years | 20:15 |
clarkb | the joys of cdn | 20:15 |
fungi | but more recently we've looked into incidents on august 19 and august 25 | 20:16 |
clarkb | we've definitely tracked it back to cdn returning stale data before | 20:16 |
clarkb | the difficulty now is that its persisting and not reproducing as easily | 20:16 |
clarkb | and last time it came up pypi pointed at a fastly outage | 20:16 |
clarkb | maybe its the same thing this time around and fastly hasn't noticed yet | 20:16 |
* clarkb finds lunch | 20:16 | |
smcginnis | Nearly all projects should have been running in py3.x for quite some time now. | 20:17 |
smcginnis | And yeah, some of these were updated in upper-constraints for weeks now. | 20:17 |
johnsom | Yeah, most have been, but the focal switch caused new problems that forced most everyone to bump lower-constraints | 20:17 |
fungi | are these mostly lower-constraints jobs we're seeing? | 20:19 |
johnsom | No | 20:19 |
johnsom | I have seen it across tempest and unit tests. It seems more regional than test related. | 20:20 |
ianw | i don't exactly know what this means yet, but i've found 3 instances of what looks like the pypi python-designateclient page on mtl-1 | 20:22 |
ianw | http://paste.openstack.org/show/797902/ | 20:22 |
ianw | yes! this has headers | 20:24 |
ianw | http://paste.openstack.org/show/797903/ | 20:26 |
ianw | fungi / clarkb: ^ double check my work here, but this looks like a cached version of the pypi page *without* designateclient 4.1.0 | 20:26 |
ianw | Date: Tue, 15 Sep 2020 07:24:38 GMT | 20:27 |
clarkb | with no serial and no interpreter requirements | 20:28 |
clarkb | when did we release 4.1.0? | 20:28 |
clarkb | not today ya? | 20:28 |
smcginnis | No, not today. | 20:28 |
smcginnis | Checking... | 20:28 |
clarkb | if so that points to fastly I think given the date | 20:28 |
*** Topner has joined #opendev | 20:28 | |
smcginnis | August 3 is the merge commit. | 20:28 |
ianw | clarkb: yeah, you're right, no serial on that result, i didn't accidentally cut it off | 20:28 |
ianw | double check mtl-01 mirror /var/cache/apache2/proxy/91/pq/qD/Xf/1O/Cmdy2okLRNwQ.header.vary/BJ/1z/1Z/_7/jD/HNLd0wzgcbGA.data | 20:29 |
smcginnis | And python-designateclient also updated in upper-constraints on Aug 3. | 20:29 |
fungi | okay, i've revamped the test loop to use curl, pass similar compression headers, the cache-control max-age=0 and also a lifelike pip ua i scraped from one of our logs | 20:30 |
ianw | there's now 3 other hits from walking the cache; they all have 4.1.0 and a serila | 20:30 |
ianw | serial even | 20:30 |
fungi | which is the mtl-01 mirror? ca-ymq-1.vexxhost? | 20:31 |
clarkb | fungi: mtl01.inap | 20:31 |
fungi | ahh | 20:31 |
fungi | did we get failures there too? | 20:31 |
clarkb | ianw makena copy so that htcacheclean doesnt accidentally get it | 20:31 |
clarkb | fungi: yes all over | 20:32 |
ianw | that was the one mentioned in the meeting just now | 20:32 |
fungi | got it | 20:32 |
*** Topner has quit IRC | 20:33 | |
fungi | and yeah, that cache example was modified today, probably created today but stat doesn't report creation time | 20:34 |
clarkb | ianw: if I zcat that I see no 4.1.0 but don't see the headers | 20:34 |
ianw | headers are in ./91/pq/qD/Xf/1O/Cmdy2okLRNwQ.header.vary/BJ/1z/1Z/_7/jD/HNLd0wzgcbGA.header | 20:34 |
clarkb | I guess that is a separate file | 20:34 |
clarkb | thanks | 20:34 |
clarkb | and ya I agree with what yuo've found | 20:35 |
fungi | X-Served-By: cache-lcy19265-LCY, cache-yul8920-YUL | 20:35 |
clarkb | note the index.html is completetly different too | 20:35 |
clarkb | the sdists and wheels are in two separate lists rather than a shared list sorted by version | 20:36 |
clarkb | thee is no python interpreter requirements either | 20:36 |
clarkb | I think thats what we needed to go to pypi. Though we should note we see similar all over the place | 20:37 |
clarkb | its just this is what we managed to catch | 20:37 |
clarkb | https://702b7e8f253d29e679a6-2fe3f6c342189909aad5220492fb4721.ssl.cf1.rackcdn.com/743189/10/check/openstack-tox-py38/2c678fd/job-output.txt is octavia-lib 2.2.0 on the same mirror | 20:38 |
clarkb | ianw: ^ maybe you want to run the same find on that and we can see if the same x-served-by shows up? | 20:38 |
ianw | http://paste.openstack.org/show/797905/ | 20:39 |
ianw | comparison of the two headers for the designate index page; one good, the pip one bad | 20:40 |
ianw | ok, i'll kill and see if we can find similar for octavia-lib | 20:41 |
ianw | http://paste.openstack.org/show/797906/ would seem to be it | 20:42 |
clarkb | thee is no x-last-serial in the bad one either | 20:43 |
ianw | Date: Tue, 15 Sep 2020 07:22:02 GMT .. same time | 20:43 |
ianw | ~ same time, anyway | 20:43 |
clarkb | LCY again | 20:43 |
clarkb | too | 20:43 |
clarkb | LCY is it you!? | 20:44 |
ianw | but, apache must be deciding that it can serve this data based on the headers coming back from pypi/fastly, right? | 20:44 |
fungi | i guess lcy is london? | 20:44 |
clarkb | ianw: yes, I think the way it works with the client setting max-age=0 is apache will ask the backend if the data it has is fresh | 20:44 |
ianw | hashar i think pointed out they're airport codes | 20:44 |
johnsom | There is an "I love Lucy" joke in there | 20:45 |
fungi | right, lcy is the iata code for london city airport | 20:45 |
clarkb | ianw: so what must've happened there is we asked LCY again if it was fresh and it said yes and we reused it? | 20:45 |
ianw | that's going to be the e-tag matching? | 20:45 |
* ianw goes to read exactly what etag is | 20:45 | |
clarkb | ianw: I'm not entirely sure how apache checks but I would expect something like etags yes | 20:45 |
clarkb | fungi: fwiw I found fastly docs that confirmed what you said about hat header. The first region is the one that atlks to pypi directly and the second is the one that we talk to | 20:48 |
ianw | so theory is apache does a GET request, gets back headers, the returned ETag: must match what we have in the cache there, and we serve up the old data | 20:48 |
clarkb | ianw: ya | 20:48 |
clarkb | ianw: and the other times when it works we get the other etag and we match that to the other content we've cached | 20:48 |
clarkb | if fastly stops serving the bad etag'd data we should stop using what we've cached | 20:49 |
clarkb | (because it won't be considered valid anymore) | 20:49 |
clarkb | I expect this is sufficient info to take to them | 20:50 |
clarkb | while also noting that we're seeing it in a fairly globally distributed manner so it may affect more than just LCY | 20:50 |
clarkb | rax dfw and ord have hit it, ovh gra1, and inap mtl01 at least | 20:51 |
clarkb | with various packages | 20:51 |
johnsom | The Last-Modified header seems suspicious as well. In the bad examples, it's 9 of April. | 20:51 |
clarkb | johnsom: ya and the content itself is all different too | 20:51 |
clarkb | johnsom: current content is a list sorted by version. Old content is two list one for sdists and one for wheels | 20:51 |
clarkb | current content has a serial in the html old content doesn't | 20:51 |
ianw | when did <!--SERIAL 8050067--> get added? | 20:51 |
clarkb | ianw: I don't know | 20:52 |
ianw | was that a flag day that pypa regenerated every index file? | 20:52 |
clarkb | (I just know that its there now :) ) | 20:52 |
ianw | that might be a data point | 20:52 |
ianw | i think we have quite a bit of data ... the question is ... who can do anything about it :) | 20:53 |
clarkb | pypi I think. The warehouse issue tracker is where they ask for feedback and issues | 20:53 |
clarkb | https://github.com/pypa/warehouse/issues | 20:53 |
fungi | or ping dstufft and di in #pypa-dev on freenode | 20:53 |
*** slaweq has joined #opendev | 20:54 | |
clarkb | and zbr posted https://discuss.python.org/t/any-chance-to-an-issue-tracker-for-pypi-org-operational-problems/5219/3 which is related | 20:54 |
clarkb | also note that it isn't 404s... | 20:54 |
*** slaweq has quit IRC | 20:55 | |
clarkb | I'm happy to write the issue but don't want to step on toes (ianw did the magic that sorted it out I think) | 20:55 |
clarkb | let me know how I can help and I'll gladly do what I can :) | 20:55 |
ianw | i think we should get maybe a few more results from some of the other regions to amass a bigger picture | 20:55 |
clarkb | ianw: sunds good. Let me look through my job links to find some from not mtl01 | 20:56 |
clarkb | https://60ce53b5f45ed0ef7bd3-e873feb845d99f2e0685947947034235.ssl.cf1.rackcdn.com/751040/3/check/openstack-tox-pep8/1921a1e/job-output.txt oslo.serialization from rax dfw | 20:56 |
clarkb | https://868d9d181f45f3dc08e4-0ff94bb25fd1b2b22d207854a15b8765.ssl.cf5.rackcdn.com/744883/16/check/openstack-tox-py36/96380ed/job-output.txt taskflow from rax ord | 20:56 |
ianw | i need to get some breakfast :) maybe we should etherpad up a few results with headers/timestamps and then we can write a good bug | 20:56 |
ianw | bib | 20:56 |
clarkb | https://api.us-east.open-edge.io:8080/swift/v1/AUTH_e02c11e4e2c24efc98022353c88ab506/zuul_opendev_logs_158/743189/10/check/openstack-tox-pep8/158d6e1/job-output.txt octavia-lib on ovh bhs1 | 20:57 |
fungi | i can sympathize with their desire not to maintain an "issue tracker for operational problems" any more than we want to, they want help tracking these problems down and assistance solving them, not some black box where people can drop their complaints | 20:57 |
clarkb | https://etherpad.opendev.org/p/debugging-pypi-index-problems lets keep working there | 20:58 |
clarkb | I'll incorporate what we've arleady found on mtl01 on that etherpad | 20:58 |
clarkb | then start on the list above | 20:58 |
*** slaweq has joined #opendev | 20:58 | |
fungi | wow, bug 1449136 is so very full of misinformation | 21:01 |
openstack | bug 1449136 in tripleo "Pip fails to find distribution for package" [Critical,Triaged] https://launchpad.net/bugs/1449136 | 21:01 |
fungi | outright blaming it on our caches, conflating the swift log upload issue with the stale pypi indices... | 21:02 |
fungi | i'm just going to pretend i didn't see that rather than rage comment in it | 21:02 |
clarkb | yes, I've basically decided I'll continue to set teh record straight here :) | 21:02 |
fungi | i have openstack elections to finish setting up, and related tooling bugs to hotfix | 21:05 |
*** slaweq has quit IRC | 21:05 | |
clarkb | I'm working on the rax dfw oslo.serialization grepping now | 21:07 |
clarkb | dfw is lcy too | 21:13 |
clarkb | putting details on the etherpad | 21:13 |
ianw | clarkb: i can pull up bhs1 if you haven't already? | 21:15 |
clarkb | ianw: go for it. I was going to add more info on the others and copy the files on dfw to my home dir so that htcacheclean doesn't rm them | 21:16 |
fungi | yeah, i think the only first-level caches i've seen appear in responses are either bwi (baltimore) or lcy (london) | 21:18 |
clarkb | ianw: I'm just copying them into the root of my homedir on each of the mirrors I do that way we've got the exact file for later if we need it | 21:19 |
clarkb | and thats done now for mtl01 and dfw. I'll do rax ord now | 21:19 |
*** slaweq has joined #opendev | 21:20 | |
ianw | i've asked in #pypa-dev about the index file format, and if anyone knows around what time it changed, as i think this might be a clue to where stale entries might be coming from | 21:21 |
ianw | it's not generally a very active channel though, so we'll see | 21:21 |
clarkb | if we note it in the warehouse issue I'm sure thats something they'll track down | 21:21 |
ianw | yeah for sure, that's the path forwrad | 21:21 |
clarkb | another interesting thing is the format of the etag is different | 21:23 |
clarkb | its almost like they've got a stale warehouse backend for fastly | 21:23 |
clarkb | and its looking at an old db or something | 21:23 |
ianw | clarkb: mirror.bhs1.ovh.opendev.org:/var/cache/apache2/proxy/4r/Kl/1x/Zd/Wx/DOmpJToA1chw.header.vary/Ht/iP/SP/ng/Z2/v1_4jNm08zWA.<header|data> is the one to scp | 21:25 |
clarkb | ianw: oh I'm not scpin'g I'm just cp'ing to my homedir on the mirror | 21:26 |
clarkb | but I'll do that now | 21:26 |
clarkb | was mostly worried about htcacheclean | 21:26 |
clarkb | also LCY again | 21:27 |
clarkb | on ord I've only found what looks like a good index for taskflow | 21:27 |
clarkb | still waiting for find to find more | 21:27 |
clarkb | my draft is bad :? | 21:31 |
clarkb | and ya feel free to hack it up :) | 21:34 |
ianw | ok, relaying pypa-dev, but there is somehow an old bandersnatch instance involved behind pypi/cdn | 21:40 |
clarkb | ianw: do they have a public log somewhere? | 21:41 |
clarkb | http://kafka.dcpython.org/day/pypa-dev/2020-09-15 found it | 21:42 |
fungi | oh fun | 21:42 |
clarkb | fungi: ianw you think I should file an issue with what we've got? and there shouldn't be any concern attaching the files verbatim ya? | 21:58 |
clarkb | I'll gunzip the data too | 21:58 |
*** slaweq has quit IRC | 21:59 | |
ianw | clarkb: yeah, i think pypa is narrowing in on the issue but an issue will be good | 22:00 |
ianw | if nothing else to point people to | 22:00 |
clarkb | ya | 22:00 |
clarkb | I'll work on that next | 22:00 |
fungi | yeah, that would be a good next step | 22:01 |
ianw | i get the feeling there's going to be a super-out-of-date-failing-bandersnatch behind this ... i will certainly not throw stones because it's a glass house we have been in :) | 22:05 |
clarkb | ya | 22:09 |
fungi | i used to live in that glass house. it had a lovely view, but eventually ran out of space | 22:12 |
clarkb | https://github.com/pypa/warehouse/issues/8568 for those following along | 22:25 |
clarkb | dstufft has also found that their fallback bandersnatch mirror (what fastly talks to if it fails to talk to pypi proper) has filled its 12TB of disk | 22:25 |
clarkb | and so it is stale and this is likely the issue | 22:25 |
clarkb | we only see this when fastly falls back to the bandersnatch mirror | 22:25 |
clarkb | they are discussing how to fix it now | 22:26 |
*** DSpider has quit IRC | 22:26 | |
fungi | i'm guessing there's a separate issue which has caused fastly's first level cache in london to start having problems reaching warehouse proper, causing it to fall back to the 2-month-stale bandersnatch mirror more often | 22:29 |
fungi | which would explain it just starting to get bad again in the last day or two | 22:30 |
ianw | https://mirror.dub1.pypi.io ... i wonder if that is dub1 too, which seems close to london ... | 22:30 |
ianw | sorry i mean dublin? | 22:30 |
ianw | mirror.dub1.pypi.io is an alias for ec2-34-244-193-164.eu-west-1.compute.amazonaws.com. | 22:30 |
clarkb | sounds like they are removing the bandersnatch fallback from fastly | 22:31 |
clarkb | I think that means we may see more failures but pip retries those so should make our jobs happier | 22:31 |
fungi | since it looks like their disk filled up in early august, that would also cover the somewhat recent flare-ups we say on august 19 and august 25 | 22:31 |
clarkb | anyone want to update the lp bug that isn't me? >_> I'll do it if not | 22:31 |
fungi | the lp bug is for tripleo anyway, i guess it can be full of wrong if that's what they prefer | 22:32 |
clarkb | also I'm really glad we didn't try to keep up with bandersnatch ourselves | 22:33 |
fungi | >12tb now! yeesh | 22:33 |
clarkb | smcginnis: johnsom TheJulia ^ fyi I would expect things to start getting better around now | 22:34 |
clarkb | assuming that the issue ahs been properly identified and the mitigations are effective | 22:34 |
johnsom | Yeah, I have been following along and sent a canary | 22:34 |
clarkb | I'll probably send a #status notice in an hour or so if zuul agrees | 22:34 |
TheJulia | Awesome | 22:35 |
clarkb | ianw: maybe its network issues from london to wherever pypi proper is located | 22:36 |
clarkb | ianw: that fails due to transatlantic problems then it falls back to bandersnatch? I dunno | 22:36 |
ianw | i blame brexit | 22:37 |
johnsom | lol, they cut the lines early! | 22:37 |
johnsom | Sad trombones, ovh-bhs1 couldn't get octavia-lib: https://api.us-east.open-edge.io:8080/swift/v1/AUTH_e02c11e4e2c24efc98022353c88ab506/zuul_opendev_logs_a7f/751918/1/check/openstack-tox-pep8/a7f5e55/job-output.txt | 22:47 |
fungi | 22:45:10 | 22:49 |
clarkb | johnsom: I wonder if they have to flush fastly caches too | 22:49 |
johnsom | It was one of the two canary patches I rechecked | 22:49 |
fungi | we may have also cached a response for up to... 600 seconds is it? | 22:49 |
ianw | i'll ask dstufft | 22:49 |
fungi | we can probably find the cachefile on our mirror for that | 22:50 |
fungi | since we can narrow to find -mtime-10 or something | 22:50 |
ianw | seems we should -XPURGE ... looking into that | 22:51 |
*** tkajinam has joined #opendev | 22:51 | |
clarkb | we should maybe do that for everything in requirements | 22:52 |
ianw | i did curl -XPURGE https://pypi.org/simple/octavia-lib and got an OK | 22:55 |
ianw | johnsom: maybe try again | 22:55 |
johnsom | I can send another recheck, but it's a roll of the dice here it will land. | 22:56 |
fungi | roll them bones | 22:57 |
clarkb | maybe we purge everything in https://opendev.org/openstack/requirements/src/branch/master/upper-constraints.txt afterremoving the === and clearing duplicates | 22:59 |
* clarkb makes that list | 22:59 | |
fungi | remember to dereference the index urls too | 23:00 |
fungi | . and _ get rewritten to - looks like? | 23:00 |
fungi | also lower-cased | 23:00 |
fungi | and i think runs of multiple punctuation are coalesced | 23:00 |
fungi | though we shouldn't have any of those | 23:01 |
clarkb | http://paste.openstack.org/show/797908/ I think constraints already normalized | 23:02 |
fungi | but not necessarily normalized to pypi's rules | 23:02 |
clarkb | also are we sure its - and not . | 23:02 |
clarkb | ah yup seems to redirect | 23:03 |
ianw | i have "cat upper-constraints.txt | sed 's/===.*//' | tr '.' '-' | tr '[:upper:]' '[:lower:]'" | 23:03 |
clarkb | so we sjust have to s/./-/ | 23:03 |
clarkb | ianw: ya that looks better than what I did | 23:03 |
ianw | running "cat upper-constraints.txt | sed 's/===.*//' | tr '.' '-' | tr '[:upper:]' '[:lower:]' | xargs -I "PROJ" curl -XPURGE https://pypi.org/simple/PROJ" | 23:07 |
ianw | it's returning OK but i think it might just do that for everything? | 23:07 |
clarkb | I wonder too if we have to do it form the mirrors so that it hits the right fastly endpoints | 23:09 |
fungi | xargs doesn't need you to do https://pypi.org/simple/$PROJ i guess? | 23:09 |
clarkb | fungi: not with -I | 23:09 |
johnsom | No luck, ovh-bhs1 missing octavia-lib: https://cac8a4551058cbabfd7f-de2a4e9610e68e853bfff1f1436a242e.ssl.cf5.rackcdn.com/751925/4/check/openstack-tox-py36/957de24/job-output.txt | 23:10 |
fungi | ahh, awesome, one more way to avoid for loops | 23:10 |
clarkb | ianw: where did you run the purge from? | 23:11 |
ianw | just my local machine | 23:11 |
clarkb | I guess next thing would be to do it from the mirrors? | 23:11 |
clarkb | and if that still doesn't work then we may need to defer to pypa again | 23:11 |
johnsom | Sorry to say I need to step away soon. I can post a DNM patch that has only the faster running jobs if you would like it for testing before I go. | 23:12 |
ianw | johnsom: i think this might be a rare case of just coming back tomorrow will make it work :) | 23:12 |
johnsom | Yeah, that is my perspective as well | 23:13 |
johnsom | Though that didn't work yesterday... grin | 23:13 |
clarkb | I think they said the fastly caches are 24 hours? | 23:14 |
johnsom | Yeah, that is what I saw | 23:14 |
clarkb | so maybe if we do nothing else it will work again tomorrow this time | 23:14 |
johnsom | Too bad it's not a good time to take a few days and go to the coast. | 23:16 |
ianw | trying to read about purging, i feel like it's a global thing | 23:16 |
clarkb | johnsom: my brothers went fishing at barview jetty a couple days ago. I told them they were stupid | 23:16 |
clarkb | johnsom: they did get a keeper ling cod though so I don't think they cared | 23:16 |
clarkb | ianw: rgr | 23:16 |
fungi | we could declare tomorrow international write stuff in rust day | 23:17 |
fungi | or whatever the hip language is this week | 23:17 |
johnsom | fungi I am down with that. I have been playing with it in my spare time | 23:17 |
*** smcginnis has quit IRC | 23:20 | |
*** smcginnis has joined #opendev | 23:22 | |
ianw | perhaps pip pulls some variant of that url | 23:32 |
ianw | https://pypi.org:443 "GET /simple/octavia-lib/ HTTP/1.1" 304 0 | 23:37 |
ianw | doesn't seem so | 23:37 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!