Thursday, 2020-06-25

clarkb	looks like my fix is failing now, its the same error but in the openstack org not x org	00:14
fungi	in production it was failing on a variety of different namespaces	00:15
clarkb	I would expect us to process the list of projects in order but maybe we don't	00:16
clarkb	fungi: does anything about the change I wrote look wrong?	00:18
clarkb	maybe we need to respect the link headers because it does some out of order pagination?	00:19
clarkb	rathre than assuming we can iterate one by one until the end	00:20
clarkb	oh maybe urlencoding is a problem	00:23
clarkb	no I don't think that is it	00:25
clarkb	looking at the gitea logs from the job it doen't appear we are looping. we're just doing the first fetch	00:28
openstackgerrit	Clark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists https://review.opendev.org/737882	00:56
openstackgerrit	Clark Boylan proposed opendev/system-config master: Paginate all the gitea get requests https://review.opendev.org/737885	00:56
clarkb	I don't think ^ will fix it but I wanted to make those cleanups anyway	00:56
clarkb	I'm not having any better ideas right now. WIll have to pick this up in the morning. (Also feel free to update if you think you see it)	01:01
clarkb	actually the first of my changes may be doing the correct thing but not the followup I was looking at the wrong log file	01:04
clarkb	the first is still failing though which makes me wonder if there is a second issue to address	01:05
ianw	sorry i had to run out this morning but am back ... i'm a bit lost but let me know if i can help	01:13
clarkb	ianw: basically the manage-projects job isn't running at all bceause we very quickly hit an http 409 error from gitea. The root cause seems to be an addition of pagination listing repos in gitea. We list all the gitea repos then use that list to check if we have to create new repos	01:14
clarkb	ianw: but since we do an incomplete listing we try to create repos that already exist and get the 409 conflict	01:14
*** mrunge has quit IRC		01:14
clarkb	ianw: https://review.opendev.org/737882 aims to fix this but is still erroring with the same error implying we aren't listing things properly	01:14
*** mrunge has joined #opendev		01:14
clarkb	looking at the gitea logs for the previous patchset of that change we are doing the looping of requests to get all of the repos	01:15
clarkb	I'm assuming the bug now is in the internal datastructure representing those lists of repos (whcih we check against to see if a project already exists)	01:15
clarkb	but I just don't see it	01:16
clarkb	and its getting late and I have cranky kids so hard to think	01:16
clarkb	ianw: to be clear there isn't an immediate emergency. We just can't add or update projects right now	01:16
clarkb	if you want to poke at it feel free. Its all tested in that stack because the base of the stack sets up the job to run manage projects twice	01:17
clarkb	first time creates all the repos then second pass should noop successfully but it doesn't currently	01:17
ianw	ok cool, i'm fresh eyes on all this so not sure much help but will have a poke	01:18
clarkb	probably the next thing is to figure out how to get that ansible library to emit more logging of what the gitea repos it saw were and what repo it tried to create	01:23
ianw	you read my mind :)	01:23
clarkb	cool I'll leave you to it then	01:24
clarkb	also its really neat how easy it is to test this stuff	01:24
*** DSpider has quit IRC		01:31
*** cloudnull has joined #opendev		01:42
ianw	looks like ps3 fixed it	01:44
clarkb	oh really?	02:01
clarkb	maybe it was a parameter issue then	02:01
fungi	clarkb: sorry, i had turned in for the evening, i can try to take a look in the morning if you haven't already worked it out	02:02
fungi	skimming, sounds like maybe you worked it out after all	02:03
*** diablo_rojo has quit IRC		03:15
*** shtepanie has quit IRC		03:28
openstackgerrit	Merged opendev/grafyaml master: Drop Python 2 support https://review.opendev.org/737667	03:54
*** pmacdonnell has quit IRC		03:56
openstackgerrit	Merged opendev/grafyaml master: Remove query variable refresh deprecation https://review.opendev.org/737664	04:00
*** ykarel\|away is now known as ykarel		04:21
*** ysandeep\|away is now known as ysandeep		04:43
openstackgerrit	Ian Wienand proposed opendev/grafyaml master: Add import of json files https://review.opendev.org/737900	05:04
ianw	glarkb/fungi: ^ so that gets us to something we talked about, where you can run a local grafana in a container, make your changes via UI and save the json to project-config for review/version control	05:05
ianw	clarkb even ^ :)	05:05
ianw	i just need to write the instructions for the grafana side now	05:05
*** jaicaa has quit IRC		05:22
*** jaicaa has joined #opendev		05:23
*** ysandeep is now known as ysandeep\|afk		05:48
*** cloudnull has quit IRC		06:14
*** rpittau\|afk is now known as rpittau		06:20
*** cloudnull has joined #opendev		06:27
*** ysandeep\|afk is now known as ysandeep		06:44
openstackgerrit	Ian Wienand proposed opendev/system-config master: Grafana container deployment https://review.opendev.org/737406	06:44
openstackgerrit	Andreas Jaeger proposed openstack/project-config master: Add pep8 jobs to grafyaml https://review.opendev.org/737915	06:55
*** hashar has joined #opendev		06:59
openstackgerrit	Ian Wienand proposed openstack/project-config master: Add all python versions to bindep tox testing https://review.opendev.org/735284	07:00
frickler	I haven't looked at that in some time, so don't know when it may have started, but I'm now seeing too large select buttons on https://review.opendev.org/#/admin/projects/openstack/neutron-dynamic-routing,access using firefox, leading to an overlap effect similar to what we had on etherpad. it may be an effect of my local settings, though	07:14
*** sgw1 has quit IRC		07:22
*** tosky has joined #opendev		07:42
*** moppy has quit IRC		08:01
*** moppy has joined #opendev		08:01
openstackgerrit	Javier Peña proposed opendev/system-config master: Make the base role and playbook compatible with CentOS https://review.opendev.org/737043	08:14
*** hashar has quit IRC		08:16
*** corvus has quit IRC		08:17
*** hashar has joined #opendev		08:22
*** corvus has joined #opendev		08:30
*** ykarel is now known as ykarel\|lunch		08:39
*** hrw has joined #opendev		08:46
hrw	morning	08:46
yoctozepto	hey infra - got a question about meetpad - does it support recording?	08:47
openstackgerrit	Javier Peña proposed opendev/system-config master: Support CentOS for AFS mirror https://review.opendev.org/736996	09:13
*** sorin-mihai has joined #opendev		09:28
*** aannuusshhkkaa has quit IRC		09:33
*** DSpider has joined #opendev		09:35
*** ysandeep is now known as ysandeep\|afk		09:39
*** bhagyashris is now known as bhagyashris\|afk		09:55
*** hashar has quit IRC		09:57
*** ykarel\|lunch is now known as ykarel		09:58
openstackgerrit	Donny Davis proposed openstack/project-config master: Slowly Scale OE back up https://review.opendev.org/737941	09:59
*** ysandeep\|afk is now known as ysandeep		10:05
*** rpittau is now known as rpittau\|bbl		10:20
*** tkajinam has quit IRC		10:22
openstackgerrit	Merged openstack/project-config master: Slowly Scale OE back up https://review.opendev.org/737941	10:27
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/system-config master: Recognize LP urls for footer bugs https://review.opendev.org/737960	10:29
openstackgerrit	Thierry Carrez proposed openstack/project-config master: Removing missed tripleo-ui references https://review.opendev.org/737961	10:31
frickler	yoctozepto: currently not. jitsi does have a recoding component but we haven't deployed that afaik	10:35
frickler	recording	10:35
yoctozepto	frickler: ack, thanks	10:42
*** bhagyashris\|afk is now known as bhagyashris		11:00
*** ysandeep is now known as ysandeep\|break		11:17
*** sorin-mihai has quit IRC		11:25
*** ysandeep\|break is now known as ysandeep		11:47
*** dpawlik6 has quit IRC		11:54
openstackgerrit	Merged openstack/project-config master: Removing missed tripleo-ui references https://review.opendev.org/737961	12:08
openstackgerrit	Andreas Jaeger proposed openstack/project-config master: Retire networking-onos, openstack-ux, solum-infra-guest-agent: Step 1 https://review.opendev.org/737987	12:11
*** dpawlik6 has joined #opendev		12:19
*** hashar has joined #opendev		12:22
openstackgerrit	Andreas Jaeger proposed openstack/project-config master: Retire networking-onos, openstack-ux, solum-infra-guest-agent: Step 1 https://review.opendev.org/737987	12:26
openstackgerrit	Andreas Jaeger proposed openstack/project-config master: Finish retirement of networking-onos,openstack-ux,solum-infra-guestagent https://review.opendev.org/737992	12:26
*** rpittau\|bbl is now known as rpittau		12:42
*** hashar has quit IRC		13:16
fungi	yoctozepto: i've heard that with the right software you can locally record the browser window	13:34
fungi	though i don't personally know who's done that	13:34
fungi	and i expect gpu acceleration makes that complicated to capture	13:34
openstackgerrit	Oleksandr Kozachenko proposed openstack/project-config master: Add openstack/tempest-horizon in required project https://review.opendev.org/738024	13:39
openstackgerrit	Oleksandr Kozachenko proposed openstack/project-config master: Add openstack/tempest-horizon in required project https://review.opendev.org/738024	13:44
Open10K8S	Hi Team	13:45
Open10K8S	Please check this PS	13:45
Open10K8S	https://review.opendev.org/738024	13:45
Open10K8S	Waiting review on other PSs	13:45
*** dpawlik6 is now known as dpawlik-2		13:48
*** dpawlik-2 is now known as danpawlik		13:48
*** sgw has joined #opendev		13:52
openstackgerrit	Ghanshyam Mann proposed openstack/project-config master: Retire networking-l2gw and networking-l2gw-tempest-plugin https://review.opendev.org/738030	13:57
ttx	fungi, clarkb: we should sync on when to approve https://review.opendev.org/#/c/737533/ so that I can watch a few runs and check everything behaves as expected	14:06
fungi	ttx: i can approve now if you're around to check results	14:09
ttx	I'm in a meeting, but my brain is not used at 100%, so yes	14:09
openstackgerrit	Ghanshyam Mann proposed openstack/project-config master: Retire networking-l2gw and networking-l2gw-tempest-plugin https://review.opendev.org/738030	14:10
ttx	fungi: ^	14:12
fungi	"ignore_errors: zuul.newrev is defined" seems backwards to me, but that's probably just me not understanding ansible's backwards logic	14:12
*** ryohayakawa has quit IRC		14:12
fungi	i thought the idea was to ignore errors from mirroring if there is no zuul.newrev (because it was triggered from something other than a ref-updated event)	14:13
ttx	zuul.newrev is always defined in the post pipeline, so that 's equivalent to ignore_errors = true	14:14
ttx	the goal being to ignore mirror failures as long as the reference is up	14:14
mnaser	fungi: is it ok to +W project-config changes given the current state of manage-projects?	14:14
fungi	but if you run it outside the post pipeline, say in check, "zuul.newrev is defined" evaluates false	14:14
fungi	so you're telling it not to ignore errors if run in check?	14:15
ttx	yes, it basically behaves as it currently does, if tested in check	14:15
ttx	(so the check test does not really test the new code... but it can't since the job actually runs under different conditions in post pipeline)	14:16
openstackgerrit	Ghanshyam Mann proposed openstack/project-config master: Final step for networking-l2gw and networking-l2gw-tempest-plugin retirement https://review.opendev.org/738040	14:17
fungi	ttx: oh, got it, we won't hit this race condition in check/gate pipelines anyway	14:18
ttx	exactly	14:18
fungi	now if gertty will stop hanging for a moment i can approve :/	14:18
ttx	fungi: how fast are zuul-jobs deployed once the change merges ? Should I just watch the promote job?	14:23
corvus	ttx: that change will take effect immediately upon merge; so the next run of the job starting after the merge will use it	14:23
ttx	noted! Will stand by	14:24
openstackgerrit	Merged openstack/project-config master: Add openstack/tempest-horizon in required project https://review.opendev.org/738024	14:28
*** ysandeep is now known as ysandeep\|away		14:29
*** mlavalle has joined #opendev		14:30
fungi	yeah, zuul takes its job configuration from the git state on branches and knows as soon as they merge that it should start using them instead of the prior state	14:31
openstackgerrit	Merged zuul/zuul-jobs master: upload-git-mirror: check after mirror operation https://review.opendev.org/737533	14:32
*** ykarel is now known as ykarel\|away		14:36
ttx	now waiting for something openstacky to actually merge	14:43
fungi	rackspace just opened tickets letting us know about host outages impacting logstash-worker02 and nl02	14:48
fungi	#status log logstash-worker02.openstack.org rebooted by provider at 14:46z due to a hypervisor host outage	14:53
openstackstatus	fungi: finished logging	14:53
fungi	#status log nl02.openstack.org rebooted by provider at 14:46z due to a hypervisor host outage	14:54
openstackstatus	fungi: finished logging	14:54
ttx	looking good so far	14:56
openstackgerrit	Clark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists https://review.opendev.org/737882	15:01
openstackgerrit	Clark Boylan proposed opendev/system-config master: Paginate all the gitea get requests https://review.opendev.org/737885	15:01
clarkb	fungi: https://review.opendev.org/#/c/737883/ got squashed into 737882 which will allow us to land that stack (I set you as co author on the commit). I think that means you can abandon 737883	15:02
clarkb	infra-root ^ Those two changes are ready for review and landing now I guess. The 737882 parent is the one we need for our current problems and 737885 is future proofing	15:02
clarkb	for the previous patchsets you can see things appear to work properly starting at https://zuul.opendev.org/t/openstack/build/6c45d6e883454129be037b48e3f714a2/log/job-output.txt#18188 and https://zuul.opendev.org/t/openstack/build/31ccae26537e4ec2835d23e28e8e1d3f/log/job-output.txt#18223 for each of those changes	15:05
mordred	clarkb: nice	15:10
ttx	fungi: so the new playbook works well in the nominal case. Now I have to wait for the race condition to happen to see if it really solves it	15:11
fungi	yeah, that's always the hard part	15:11
*** bhagyashris is now known as bhagyashris\|afk		15:19
AJaeger	ttx, push three changes and approve them together?	15:20
ttx	AJaeger: it's hit and miss, depends how fast they enqueue into the post pipeline	15:20
AJaeger	fun	15:20
fungi	is there anything we need to check in the wake of the nl02 outage? i suppose keeping all the state in zk mostly shields us from hung/leaked operations when a launcher is suddenly rebooted?	15:21
ttx	I'll see by tomorrow :) We usually have a couple issues per day	15:21
AJaeger	looking forward to hear the results	15:22
clarkb	fungi: ya should be fine re nl02	15:24
*** aannuusshhkkaa has joined #opendev		15:57
*** diablo_rojo has joined #opendev		16:02
openstackgerrit	Clark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists https://review.opendev.org/737882	16:10
openstackgerrit	Clark Boylan proposed opendev/system-config master: Paginate all the gitea get requests https://review.opendev.org/737885	16:10
clarkb	mordred: fungi ^ ianw's suggestion was important enough that I thought a new set of ps's would be a good idea	16:11
clarkb	and this generates even more test data (and confidence!)	16:11
*** rpittau is now known as rpittau\|afk		16:11
mordred	clarkb: ++	16:12
*** hashar has joined #opendev		16:14
fungi	cool	16:17
openstackgerrit	Clark Boylan proposed opendev/system-config master: Increase parallelism of gitea project creation https://review.opendev.org/738064	16:43
clarkb	mordred: corvus ^ I noticed that those TODOs could be cleaned up while working on the other thing. I'm not sure if I got that quite right and that is even less of an emergency but your input on it since you dealt with the original pass would be good	16:43
corvus	clarkb: seems legit; i don't recall details about your question in the TODO though.	16:48
mordred	me either - but also seem legit	16:50
corvus	i'm afk for a bit	16:53
clarkb	thanks I've WIP'd it just to be sure the other stuff lands first and we can stablizie before worrying about optimizing	16:54
mordred	clarkb: ok. I think it's time to try deploying an executor with docker	16:59
mordred	clarkb: ze01 is already stopped - so I'm going to start with it - sound ok?	16:59
clarkb	wfm	16:59
mordred	clarkb: I've disabled ansible - will wait for current playbooks to stop	17:00
mordred	clarkb: there are a couple of old ansible playbooks runs	17:00
clarkb	mordred: ok keep in mind the manage-projects backlog will be stopped up against that when the changes land, but we can also run that manually	17:01
clarkb	mordred: I think job timeouts are doing that	17:01
clarkb	but not completely positive of that	17:01
mordred	clarkb: yeah - need to figure out what's going on there	17:02
mordred	running against ze01	17:04
clarkb	mordred: we have to manually stop the executor first then run the use docker playbook update?	17:07
clarkb	we didn't encode the transition in the playbooks	17:07
mordred	that's right	17:07
mordred	clarkb: actually - we seem to have landed the "disable old service" patch	17:08
mordred	clarkb: so - the playbook will turn off executor - but will not run docker compoose up	17:09
clarkb	https://zuul.opendev.org/t/openstack/build/72dedbb571374ccbbc7c9cc14e10f209/log/job-output.txt#18246 that was the latest pass of the pagination change which was just an update to add a comment in the zuul config	17:09
clarkb	thats making me think this fix is incomplete or buggy or racy	17:10
clarkb	fungi: ianw: ^ I think that must've been why it failed for me last night and I got all confused	17:10
clarkb	thinking out loud here, it could be a race for listing repos after creating all the repos?	17:10
clarkb	except we seem to have ~25 seconds between runs there	17:11
mordred	clarkb: oh. bong. zuul-executor is runnong on ze01	17:11
mordred	zuul-executor stop did not stop it	17:12
clarkb	mordred: well it should stop it in about 10 minutes	17:12
clarkb	it waits for all the ansible to stop running	17:13
mordred	ah	17:13
mordred	nod	17:13
* mordred thought this was one off - but forgot we did that whole big restart		17:13
* mordred waits		17:13
clarkb	mordred: what I do is watch something like `ps -elf \| grep zuul \| wc -l` and that number should generall trend down	17:14
clarkb	I pulled the +W off of https://review.opendev.org/#/c/737882/5 and rechecked it in order to generate more data	17:15
clarkb	if anyone else has ideas for ^ they are more than welcome	17:18
clarkb	mordred: in particular doing better logging of the tool execution in ansible somehow would be useful	17:18
clarkb	but I'm not sure how to expose that in ansible. Maybe just start writing to stdout and ansible captures that or?	17:19
clarkb	I guess we can hold the nodes too and try to rerun manually and see what happens	17:20
* clarkb puts a hold on that change		17:20
clarkb	alright thats in place	17:21
mordred	clarkb: no - definitely don't just write to stdout from an ansible module	17:23
mordred	I thnik there is a log method now	17:23
mordred	on the module object	17:23
clarkb	mordred: cool if I catch one with the hold I can fiddle with finding that and using it in the python module	17:23
mordred	clarkb: cool	17:24
clarkb	but also I think I can run it outside of the ansible context on the held setup	17:24
mordred	clarkb: ok - ze01 is running in docker	17:24
clarkb	and then do normal python logging/tracebacks/etc	17:24
clarkb	mordred: now we want to see jobs on ze01 use afs properly?	17:24
mordred	clarkb: yeah. I've put ze* back in the emergency file and have removed DISABLE-ANSIBLE	17:25
mordred	so we can watch ze01 for a bit and make sure we're happy with it and evrything	17:25
mordred	#status log ze01 is running via docker now, ze* is still in emergency so we can watch ze01	17:26
openstackstatus	mordred: finished logging	17:26
mnaser	is it ok to merge project-config changes that touch manage-projects?	17:28
fungi	mnaser: should be, they just aren't taking effect yet	17:28
clarkb	and the fix is still failing occasionally for unknown reasons	17:29
mnaser	:(	17:29
frickler	infra-root: something seems wrong, likely related to nl02 reboot, our used nodes dropped and nl01 logs a lot of quota failures	17:34
frickler	maybe the reboot left orphaned nodes?	17:35
frickler	HttpException: 403: Client Error for url: https://iad.servers.api.rackspacecloud.com/v2/637776/servers, Quota exceeded for ram: Requested 8	17:36
frickler	192, but already used 1441792 of 1536000 ram	17:36
frickler	that looks more like rackspace might have messed up their quota calculation	17:36
openstackgerrit	Merged openstack/project-config master: Add Neutron Arista plugin charm to OpenStack charms https://review.opendev.org/737791	17:42
clarkb	ya nova can get out of sync	17:48
openstackgerrit	Merged openstack/project-config master: Refresh openstack-ansible grafana dashboards https://review.opendev.org/737742	17:48
* frickler needs to eod, maybe someone can contact them		17:48
fungi	frickler: a quick server list for iad shows we have 186 instances booted there	17:54
fungi	so we may have leaked nodes?	17:54
openstackgerrit	Merged openstack/project-config master: Add pep8 jobs to grafyaml https://review.opendev.org/737915	18:06
openstackgerrit	Merged openstack/project-config master: Add all python versions to bindep tox testing https://review.opendev.org/735284	18:06
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Test multiarch release builds https://review.opendev.org/737315	18:16
openstackgerrit	Clark Boylan proposed opendev/system-config master: Add more logging to gitea project creation https://review.opendev.org/738083	18:23
clarkb	mordred: ^ maybe something like that?	18:23
*** Open10K8S has quit IRC		18:24
*** Open10K8S has joined #opendev		18:25
clarkb	interestingly it has failed twice in a row now	18:25
clarkb	ok rerunning the playbook on my held node causes failure	18:29
clarkb	which is good because it likely rules out a race	18:29
* clarkb tries the extra logging there now		18:29
openstackgerrit	Rafael Folco proposed openstack/diskimage-builder master: Enable py3 on dib release 7 https://review.opendev.org/736421	18:31
fungi	ahh, looks like we have a number of nodes for rax-iad in error and shutoff states	18:33
AJaeger	infra-root, https://zuul.opendev.org/t/openstack/build/c4948797c7994937bfa632105d06af93 fails with "No such file or directory: 'kinit': 'kinit'" - this is a promote docs job. Is kinit suddenly missing?	18:40
clarkb	mordred: ^ we may need to rollback ze01 docker container deployment	18:40
clarkb	AJaeger: I think it is likely that is related to running ze01 in a docker container	18:40
fungi	yeah, may have missed installing krb5 in the image	18:40
corvus	it was supposed to be in the image; we landed a change to add it	18:41
corvus	but i agree, immediate resolution should be to stop that ze	18:42
corvus	i will do that now	18:42
AJaeger	thanks	18:43
corvus	i have issued the 'zuul-executor stop' command	18:43
fungi	i have confirmed none of the 21 shutdown or error status instances in rax-iad appear in our `nodepool list` output, so i'll work on manually deleting them with osc	18:44
clarkb	I think I'm making progress on the gitea thing. in my test case its failing because openstack/telemetry-tempest-plugin isn't in the repo listing	18:44
clarkb	but it was supposedly created on the first pass	18:44
mordred	corvus: we only added openafs-krb5 - we didn't add krb5-user	18:45
clarkb	and that seems consistent across multiple runs of the playbook on this test setup	18:45
clarkb	curling the page for that repo seems to show it	18:47
corvus	fungi: do they lack the metadata that would tell nodepool to delete them?	18:48
fungi	corvus: i should have grabbed some samples, however i see there's one in dfw too, so i'll dig into it more closely	18:50
corvus	maybe they lost their md due to whatever error happened	18:50
fungi	i was at least able to `openstack server delete` all 21 strays in iad successfullt	18:50
fungi	corvus: the example in dfw does lack the additional nodepool_* properties	18:52
fungi	its properties field (according to openstack server show) is completely empty	18:53
fungi	looks like it's from 19 days ago	18:53
fungi	oh, actually, it was created 2018-04-18 but updated 2020-06-06 for some reason	18:53
fungi	so yeah, this looks like it could be a source of infrequent node leaks	18:54
corvus	booo :(	18:55
fungi	anything else i should look at before i delete this one in dfw?	19:02
fungi	it claims to be over 2 years old, though i have a tough time believing it's been in our openstackjenkins tenant server list for dfw that entire time	19:03
clarkb	well thats curious. I checked the cardinality of the gitea repo list and it matches the input list size. But then I convert to a set and now I'm off by one	19:05
fungi	cardinality vs ordinality maybe?	19:06
clarkb	well you wouldn't expect duplicates	19:08
clarkb	somehow openstack/tempest is listed twice now to see if that is consistent	19:08
clarkb	(I'm wondering if this is a page boundary bug in gitea)	19:08
fungi	oh, yeah, maybe they have problems with calculating offsets correctly	19:10
clarkb	confirmed that it straddles pages	19:14
fungi	ew	19:14
clarkb	page 2 element 50 and page 3 element 1	19:14
clarkb	openstack/tempest	19:14
clarkb	howdy	19:14
fungi	i guess the list can be deduplicated as a workaround?	19:14
fungi	or are we also missing entries because of this?	19:15
fungi	i guess for each page we get a duplicate and lose an entry too	19:15
clarkb	fungi: that doesn't help because the problem is we don't have openstack/telementry-tempest-plugin in the list and that causes us to try and recreate openstack/telemetry-tempest-plugin and that fails	19:15
clarkb	no this is the only duplicate	19:15
fungi	yeah. we could run through the list twice with different prime-numbered page sizes	19:15
clarkb	we can also fetch the repo page and see if it exists rather than usign the api to list them all	19:16
clarkb	but I want to see if I can figure out why this happens in the first place and ya maybe I'll try some different page sizes	19:16
clarkb	is 17 prime?	19:16
fungi	yes	19:16
clarkb	and maybe 31?	19:17
fungi	43 and 47 are the two largest primes under the 50 max	19:17
clarkb	17 changed the dup	19:18
clarkb	still only one duplicate though	19:18
clarkb	I'm guessing the next step in debugging this is looking at the db and the paging code and figuring out the bug	19:18
clarkb	separately we can do a double check and see if https://localhost:3000/org/project is a 404 or a 2XX	19:18
clarkb	and only try to create if it isn't 2XX	19:18
fungi	yeah, the reason to run through twice with two different prime numbered page sizes is they're guaranteed not to share a period	19:19
corvus	we don't have any repo creation happening during this right?	19:19
corvus	the data set is supposed to be static during our queries?	19:19
clarkb	corvus: correct, we do all the queries upfront	19:20
clarkb	fungi: ya and we could then combine them all and dedup the result	19:20
fungi	granted, if the mistaken offset is >1 you basically need n+1 different passes with different page sizes	19:21
clarkb	except in this case we seem to only ever get one dup for some reason	19:21
clarkb	probably shouldn't rely on that behavior until we understand it though	19:21
fungi	just the other day someone asked me where prime numbers are useful in computer science. this would have made a great example	19:22
clarkb	43, 47, and 50 produce the same duplicate: openstack/tempest	19:23
fungi	neat. is it the first or last entry?	19:23
clarkb	or wait I might have a bug in duplicate logging	19:23
clarkb	yup I do	19:24
clarkb	so ignore that observation	19:24
clarkb	(there are definitely duplicates and it changes based on page size and it is sometimes tempest)	19:25
clarkb	43 and 47 don't produce duplicates	19:25
clarkb	and the play succeeds	19:25
clarkb	https://github.com/go-gitea/gitea/pull/11827	19:32
clarkb	it seems like its querying the db for repos where the owner id matches the provided id	19:33
clarkb	but there is an ordered by that is "updated_unix DESC"	19:34
clarkb	mordred: ^ we're not changing order but maybe if that is a common value the order isn't stable?	19:35
clarkb	I guess my next step is to check the database	19:35
clarkb	but it is lunch time. My hunch is that ordered by isn't stable	19:35
clarkb	and if that is the case we can make a change to something more stable and in the mean time do a secondary fetch for https://localhost:3000/org/project and check that status	19:36
fungi	another possibility... off-by-one in the 50 max limit? maybe any page size <50 but not ==50 works correctly?	19:36
clarkb	fungi: setting it to 17 also fails. But maybe there is an off by one relative to any value	19:37
clarkb	I'll check the db directly after lunch	19:37
AJaeger	clarkb, fungi, https://review.opendev.org/737791 has added today a new repo - but I think your off-by-one happened already before that, didn't it?	19:43
clarkb	AJaeger: ya its a bug since we upgraded gitea	19:45
clarkb	we should stop adding new projects for now	19:45
clarkb	http://paste.openstack.org/show/795231/ I think that mostly confirms the bug	19:54
clarkb	Internet says ordered by is not stable on successive requests	19:54
clarkb	I can't reproduce that via mysql client yet, but I'd be really surprised if there was a different issue	19:54
mordred	clarkb: yeah - so - if we're ordering only by updated unix - that's only seconds	19:55
clarkb	mordred: yup exaclty. I think teh fix in gitea is to ordered by id and updated_unix	19:55
clarkb	id is a proper key	19:55
clarkb	and should make it stable	19:55
mordred	yes. that would be the right choice	19:55
clarkb	I'm going to fiddle with the mysql client to make sure I get it right but I'll make a PR for gitea	19:56
clarkb	and then we can maybe deploy that ourselves and then my fix for the pagination should work?	19:56
clarkb	or we can hack up our ansible to check if the page for the repo exists as a fallback	19:56
*** olaph has quit IRC		19:57
mordred	yah	19:58
mordred	clarkb: it should be ordered by updated_unix, id desc i believe	19:58
mordred	(since you want to order by updated_unix and then by id)	19:58
*** hashar has quit IRC		19:59
clarkb	mordred: ya I think we need two DESC's though? ORDER BY updated_unix DESC , id DESC;	20:00
mordred	yeah, I thnk that's right	20:01
clarkb	cool I'm going to figure out two different things to push to github. One will be against master I can do a PR for then the other will be 1.12.0 + the fix that we can update our image to and then redo all the testing with that	20:03
clarkb	this is easier said than done :)	20:03
fungi	sounds great	20:03
fungi	but yeah, githubz	20:03
mordred	++	20:04
clarkb	hrm actually	20:05
clarkb	this won't fix us in production will it?	20:05
clarkb	assuming that gerrit replication updates that timestamp we'll never have a full listing through that api	20:06
fungi	is there not an autoincrement index for the projects table?	20:07
mordred	are they doing this with order by limit?	20:07
clarkb	fungi: there is	20:07
mordred	it woudl be much better to just have this listing be sorted by id	20:08
fungi	yeah, that	20:08
clarkb	mordred: https://github.com/go-gitea/gitea/pull/11827/commits/8cc1b15245f06145c267f59146f4cb74c6330a1b first bit of diff there is the order by	20:08
clarkb	they are not doing order by limit, its listing everything then chunking later I think	20:08
clarkb	What we can do is pass in opts from the api side to order by id and not updated_unix	20:09
clarkb	and override the default	20:09
* clarkb changes commit to do that		20:09
mordred	yeah	20:09
mordred	we don't care about "most recent" for our use case	20:10
mordred	we just want the list	20:10
fungi	and more importantly, the entire list	20:14
mordred	fungi: yeah - this is one of those cases where it would be really nice to be able to say "hi, please to not paginate"	20:16
clarkb	mordred: ya as I'm trying to figure out all the places this may be a bug in gitea I'm feeling the same way	20:16
clarkb	but I'm giving up on that for now	20:17
clarkb	because my brain is melting	20:17
mordred	yup	20:17
clarkb	it kinda makes me think we should solve this differently in ansible	20:20
clarkb	but let me push up change to try this version	20:21
openstackgerrit	Clark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists https://review.opendev.org/737882	20:25
openstackgerrit	Clark Boylan proposed opendev/system-config master: Paginate all the gitea get requests https://review.opendev.org/737885	20:25
openstackgerrit	Clark Boylan proposed opendev/system-config master: Increase parallelism of gitea project creation https://review.opendev.org/738064	20:25
clarkb	infra-root ^ I'm not sure we want to merge it like that just yet, but that should exercise me gitea fix	20:26
clarkb	I'm filing a gitea bug now	20:27
clarkb	and will do my PR and see if they think its sufficient or not	20:27
clarkb	https://github.com/go-gitea/gitea/issues/12056	20:36
clarkb	https://github.com/go-gitea/gitea/pull/12057	20:38
*** icarusfactor has joined #opendev		20:50
openstackgerrit	Clark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists https://review.opendev.org/738109	20:51
clarkb	infra-root ^ alternative approach with double checking	20:51
*** factor has quit IRC		20:52
clarkb	I'm going to take a break now. I've got to do the zuul community update tonight so need to prep for that but also don't want to work all afternoon if I'm working tonight :)	20:52
clarkb	I think that gives us ~2 options to address this and people can feel free to update/fix/etc as necessary	20:53
clarkb	also I think we need to sort id ascending so that if we get new repos we don't change the ordering	21:03
clarkb	but that shouldn't be as big of an issue bceause you'll get dups but still have complete data	21:03
fungi	yeah, agreed, ascending id sort is more robust than descending if new repos are added while iterating (not that we expect that in production)	21:07
clarkb	and thinking about it more what they really should do is provide a next url that has enough of a seed to reproduce the original list and index into it properly	21:09
fungi	that probably requires caching some additional state, and then having to decide how long you keep it fo	21:09
fungi	r	21:09
clarkb	ya, I'm hoping this issue I've filed sparks a discussion on getting it right overall	21:10
clarkb	for that reason I'm kinda leaning towards our solution being https://review.opendev.org/738109 for now	21:11
clarkb	basically accept the pagination is flawed and work around it	21:11
clarkb	rather than rely on another likely flawed pagination system	21:11
clarkb	though I've justed realized that will make creating an initial set of repos very slow. I guess we're about to find out how slow via testing	21:15
clarkb	fungi: another thought just occured to me. If we clear out new project additions from projects.yaml we could clear out the gitea management temporarily in order to get gerrit things updated	21:21
clarkb	fungi: that may be worth doing if this prolongs itself due to the pain of dealing with it	21:21
clarkb	oh also just had a thought. We could check for http 409 and ignore those errors	21:22
clarkb	instead of doing the GET before hand to see if project exists which will slow down initial load out	21:22
fungi	oh, yep, lbyl is biting us basically, we could just eafp	21:23
fungi	especially if the only effective error 409 represents is "already exists"	21:24
clarkb	ya its a conflict which is basically its there you can't do this	21:24
clarkb	aiui	21:24
clarkb	ok really popping out for a bit now	21:25
clarkb	I'll roll back in a bit later and see if anyone has a preference of the ~3 options that have been brainstormed	21:25
clarkb	the gitea change I wrote doesn't seem to fix it. https://zuul.opendev.org/t/openstack/build/11bade7c1229425a916a04c505ada62e failed. I think that means we should ask permission or forgiveness. For permission https://review.opendev.org/#/c/738109/	21:45
clarkb	I'm rechecking though since a single data point is insufficient	21:45
clarkb	maybe we recheck that continuously thoruhg the end of today and if it continues to work we go with it tomorrow?	21:45
clarkb	then I can rebase the other stuff on top of that if we go that route	21:45
ianw	clarkb: huh, the jury is still out on how to get any logging out of the module? TypeError: __init__() got an unexpected keyword argument 'log'	22:23
ianw	i couldn't see anything, other than bunching stuff up to return in the json	22:23
clarkb	ianw: ya that change is broken but I fixed the argument thing and it still didn't work	22:23
clarkb	ianw: I ended up using the built in logging of the module which is really lcuinky but it worked	22:23
clarkb	ianw: and traced it to bugs in gitea pagination so now I'm thinking something like https://review.opendev.org/#/c/738109/ is our best bet or similar to that but asking forgiveness isntead of permission	22:24
clarkb	I also filed a bug with gitea and psuhed a PR that doesn't seem to be working	22:24
clarkb	ianw: I think we recheck 738109 a few times then if people are happy enough with it we can try and land it and get manage projects running	22:25
clarkb	manage projects has a fair bit of backlog now though so not sure you want to be on the hook for that overnight (can wait until tomorrow morning)	22:25
clarkb	and I'ev held some nodes if people want to interact with gitea though I've hacked up the gitea-git-repos role there so may need to restore to known state if expecting that to make sense	22:26
ianw	so it's not that get_org_repo_list may return duplicates; it's more that it may also not return all the projects?	22:28
clarkb	yes	22:29
clarkb	because pagination order things by timestamp and collisions would not be sorted stably by mariadb	22:29
fungi	basically, page offsets seem to start and end at the wrong place	22:29
clarkb	but also in production that timestamp can update frequently so you'd lose things in the listing that way as well	22:30
ianw	and we don't want to just probe for projects all the time and ditch the walk, because that makes it much much slower even in CI when we're starting fresh?	22:30
clarkb	yes however the 109 change above will do it in CI because we start from nothing so check each repo in that case	22:31
clarkb	and it doesnt seem to be too slow	22:31
clarkb	but ya checking after listing is an optimization but we could always check amd drop the listing too	22:31
ianw	yeah, that's my only thought; simplify it by just checking the project directly -- reusing the session it seems like it should be as low overhead as possible	22:32
clarkb	also orgs, teams and all that arepaginated too and I would be amazed if they dont have similar issues	22:33
clarkb	I expect we'll end up doing incrememntal imptovements	22:33
clarkb	starting with repos ti make things work with the current dataset	22:33
ianw	just as written also makes sense, and is more easily revertable when it's fixed, so no problem with that either really :)	22:33
mordred	yeah - ultimately this is a fundamental issue in the api that would surely be good to sort out	22:33
clarkb	mordred: ya I'm hoping my github issue helps gitea move towards the fixing	22:34
clarkb	but ya I think maybe start with 109 once we are confident in it then continue to make incremental improvments frkm there	22:34
*** tosky has quit IRC		22:40
*** mlavalle has quit IRC		22:58
clarkb	109 has succeded now twice. I've recehcked it again	23:02
mordred	clarkb, ianw: I left a +2 on 109 but not a +A	23:05
mordred	because - you knlw - it's EOD here	23:05
mordred	ianw: oh - also - we tried rolling out ze* on docker but were missing krb5-user from the images. zuul is stopped on ze01 and they're all in emergency	23:06
mordred	we'll try again tomorrow - but just an fyi	23:06
ianw	cool, yeah i think my best bet is to not touch anything :)	23:06
ianw	mordred: not sure if you saw but grafana came together well as a container in https://review.opendev.org/#/c/737406/ and https://review.opendev.org/#/c/737397/5	23:07
ianw	i'm going to have a look at graphite and see if it is as amenable ... that would be two more ticked of the xenial list	23:08
mordred	ianw: that looks great!	23:12
mordred	ianw: I +Ad the first, +2d the second	23:12
ianw	thanks ... i found https://hub.docker.com/r/graphiteapp/docker-graphite-statsd/ yesterday and it looks very promising as pretty much a drop-in	23:13
ianw	devil will be in the details	23:13
openstackgerrit	Monty Taylor proposed opendev/system-config master: Make bindep installs non-interactive https://review.opendev.org/738121	23:19
mordred	ianw, corvus: ^^ if you got a sec	23:20
corvus	++ all around	23:20
mordred	ianw: the devil is always in the details	23:20
mordred	corvus: awesome - thanks	23:20
*** DSpider has quit IRC		23:22
openstackgerrit	Merged opendev/system-config master: Add a grafana/grafyaml image https://review.opendev.org/737397	23:33
*** rchurch has quit IRC		23:43
*** rchurch has joined #opendev		23:43
*** ryohayakawa has joined #opendev		23:56
*** cloudnull has quit IRC		23:57
*** cloudnull has joined #opendev		23:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!