clarkb | looks like my fix is failing now, its the same error but in the openstack org not x org | 00:14 |
---|---|---|
fungi | in production it was failing on a variety of different namespaces | 00:15 |
clarkb | I would expect us to process the list of projects in order but maybe we don't | 00:16 |
clarkb | fungi: does anything about the change I wrote look wrong? | 00:18 |
clarkb | maybe we need to respect the link headers because it does some out of order pagination? | 00:19 |
clarkb | rathre than assuming we can iterate one by one until the end | 00:20 |
clarkb | oh maybe urlencoding is a problem | 00:23 |
clarkb | no I don't think that is it | 00:25 |
clarkb | looking at the gitea logs from the job it doen't appear we are looping. we're just doing the first fetch | 00:28 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists https://review.opendev.org/737882 | 00:56 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Paginate all the gitea get requests https://review.opendev.org/737885 | 00:56 |
clarkb | I don't think ^ will fix it but I wanted to make those cleanups anyway | 00:56 |
clarkb | I'm not having any better ideas right now. WIll have to pick this up in the morning. (Also feel free to update if you think you see it) | 01:01 |
clarkb | actually the first of my changes may be doing the correct thing but not the followup I was looking at the wrong log file | 01:04 |
clarkb | the first is still failing though which makes me wonder if there is a second issue to address | 01:05 |
ianw | sorry i had to run out this morning but am back ... i'm a bit lost but let me know if i can help | 01:13 |
clarkb | ianw: basically the manage-projects job isn't running at all bceause we very quickly hit an http 409 error from gitea. The root cause seems to be an addition of pagination listing repos in gitea. We list all the gitea repos then use that list to check if we have to create new repos | 01:14 |
clarkb | ianw: but since we do an incomplete listing we try to create repos that already exist and get the 409 conflict | 01:14 |
*** mrunge has quit IRC | 01:14 | |
clarkb | ianw: https://review.opendev.org/737882 aims to fix this but is still erroring with the same error implying we aren't listing things properly | 01:14 |
*** mrunge has joined #opendev | 01:14 | |
clarkb | looking at the gitea logs for the previous patchset of that change we are doing the looping of requests to get all of the repos | 01:15 |
clarkb | I'm assuming the bug now is in the internal datastructure representing those lists of repos (whcih we check against to see if a project already exists) | 01:15 |
clarkb | but I just don't see it | 01:16 |
clarkb | and its getting late and I have cranky kids so hard to think | 01:16 |
clarkb | ianw: to be clear there isn't an immediate emergency. We just can't add or update projects right now | 01:16 |
clarkb | if you want to poke at it feel free. Its all tested in that stack because the base of the stack sets up the job to run manage projects twice | 01:17 |
clarkb | first time creates all the repos then second pass should noop successfully but it doesn't currently | 01:17 |
ianw | ok cool, i'm fresh eyes on all this so not sure much help but will have a poke | 01:18 |
clarkb | probably the next thing is to figure out how to get that ansible library to emit more logging of what the gitea repos it saw were and what repo it tried to create | 01:23 |
ianw | you read my mind :) | 01:23 |
clarkb | cool I'll leave you to it then | 01:24 |
clarkb | also its really neat how easy it is to test this stuff | 01:24 |
*** DSpider has quit IRC | 01:31 | |
*** cloudnull has joined #opendev | 01:42 | |
ianw | looks like ps3 fixed it | 01:44 |
clarkb | oh really? | 02:01 |
clarkb | maybe it was a parameter issue then | 02:01 |
fungi | clarkb: sorry, i had turned in for the evening, i can try to take a look in the morning if you haven't already worked it out | 02:02 |
fungi | skimming, sounds like maybe you worked it out after all | 02:03 |
*** diablo_rojo has quit IRC | 03:15 | |
*** shtepanie has quit IRC | 03:28 | |
openstackgerrit | Merged opendev/grafyaml master: Drop Python 2 support https://review.opendev.org/737667 | 03:54 |
*** pmacdonnell has quit IRC | 03:56 | |
openstackgerrit | Merged opendev/grafyaml master: Remove query variable refresh deprecation https://review.opendev.org/737664 | 04:00 |
*** ykarel|away is now known as ykarel | 04:21 | |
*** ysandeep|away is now known as ysandeep | 04:43 | |
openstackgerrit | Ian Wienand proposed opendev/grafyaml master: Add import of json files https://review.opendev.org/737900 | 05:04 |
ianw | glarkb/fungi: ^ so that gets us to something we talked about, where you can run a local grafana in a container, make your changes via UI and save the json to project-config for review/version control | 05:05 |
ianw | clarkb even ^ :) | 05:05 |
ianw | i just need to write the instructions for the grafana side now | 05:05 |
*** jaicaa has quit IRC | 05:22 | |
*** jaicaa has joined #opendev | 05:23 | |
*** ysandeep is now known as ysandeep|afk | 05:48 | |
*** cloudnull has quit IRC | 06:14 | |
*** rpittau|afk is now known as rpittau | 06:20 | |
*** cloudnull has joined #opendev | 06:27 | |
*** ysandeep|afk is now known as ysandeep | 06:44 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Grafana container deployment https://review.opendev.org/737406 | 06:44 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Add pep8 jobs to grafyaml https://review.opendev.org/737915 | 06:55 |
*** hashar has joined #opendev | 06:59 | |
openstackgerrit | Ian Wienand proposed openstack/project-config master: Add all python versions to bindep tox testing https://review.opendev.org/735284 | 07:00 |
frickler | I haven't looked at that in some time, so don't know when it may have started, but I'm now seeing too large select buttons on https://review.opendev.org/#/admin/projects/openstack/neutron-dynamic-routing,access using firefox, leading to an overlap effect similar to what we had on etherpad. it may be an effect of my local settings, though | 07:14 |
*** sgw1 has quit IRC | 07:22 | |
*** tosky has joined #opendev | 07:42 | |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:01 | |
openstackgerrit | Javier Peña proposed opendev/system-config master: Make the base role and playbook compatible with CentOS https://review.opendev.org/737043 | 08:14 |
*** hashar has quit IRC | 08:16 | |
*** corvus has quit IRC | 08:17 | |
*** hashar has joined #opendev | 08:22 | |
*** corvus has joined #opendev | 08:30 | |
*** ykarel is now known as ykarel|lunch | 08:39 | |
*** hrw has joined #opendev | 08:46 | |
hrw | morning | 08:46 |
yoctozepto | hey infra - got a question about meetpad - does it support recording? | 08:47 |
openstackgerrit | Javier Peña proposed opendev/system-config master: Support CentOS for AFS mirror https://review.opendev.org/736996 | 09:13 |
*** sorin-mihai has joined #opendev | 09:28 | |
*** aannuusshhkkaa has quit IRC | 09:33 | |
*** DSpider has joined #opendev | 09:35 | |
*** ysandeep is now known as ysandeep|afk | 09:39 | |
*** bhagyashris is now known as bhagyashris|afk | 09:55 | |
*** hashar has quit IRC | 09:57 | |
*** ykarel|lunch is now known as ykarel | 09:58 | |
openstackgerrit | Donny Davis proposed openstack/project-config master: Slowly Scale OE back up https://review.opendev.org/737941 | 09:59 |
*** ysandeep|afk is now known as ysandeep | 10:05 | |
*** rpittau is now known as rpittau|bbl | 10:20 | |
*** tkajinam has quit IRC | 10:22 | |
openstackgerrit | Merged openstack/project-config master: Slowly Scale OE back up https://review.opendev.org/737941 | 10:27 |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/system-config master: Recognize LP urls for footer bugs https://review.opendev.org/737960 | 10:29 |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Removing missed tripleo-ui references https://review.opendev.org/737961 | 10:31 |
frickler | yoctozepto: currently not. jitsi does have a recoding component but we haven't deployed that afaik | 10:35 |
frickler | recording | 10:35 |
yoctozepto | frickler: ack, thanks | 10:42 |
*** bhagyashris|afk is now known as bhagyashris | 11:00 | |
*** ysandeep is now known as ysandeep|break | 11:17 | |
*** sorin-mihai has quit IRC | 11:25 | |
*** ysandeep|break is now known as ysandeep | 11:47 | |
*** dpawlik6 has quit IRC | 11:54 | |
openstackgerrit | Merged openstack/project-config master: Removing missed tripleo-ui references https://review.opendev.org/737961 | 12:08 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Retire networking-onos, openstack-ux, solum-infra-guest-agent: Step 1 https://review.opendev.org/737987 | 12:11 |
*** dpawlik6 has joined #opendev | 12:19 | |
*** hashar has joined #opendev | 12:22 | |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Retire networking-onos, openstack-ux, solum-infra-guest-agent: Step 1 https://review.opendev.org/737987 | 12:26 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Finish retirement of networking-onos,openstack-ux,solum-infra-guestagent https://review.opendev.org/737992 | 12:26 |
*** rpittau|bbl is now known as rpittau | 12:42 | |
*** hashar has quit IRC | 13:16 | |
fungi | yoctozepto: i've heard that with the right software you can locally record the browser window | 13:34 |
fungi | though i don't personally know who's done that | 13:34 |
fungi | and i expect gpu acceleration makes that complicated to capture | 13:34 |
openstackgerrit | Oleksandr Kozachenko proposed openstack/project-config master: Add openstack/tempest-horizon in required project https://review.opendev.org/738024 | 13:39 |
openstackgerrit | Oleksandr Kozachenko proposed openstack/project-config master: Add openstack/tempest-horizon in required project https://review.opendev.org/738024 | 13:44 |
Open10K8S | Hi Team | 13:45 |
Open10K8S | Please check this PS | 13:45 |
Open10K8S | https://review.opendev.org/738024 | 13:45 |
Open10K8S | Waiting review on other PSs | 13:45 |
*** dpawlik6 is now known as dpawlik-2 | 13:48 | |
*** dpawlik-2 is now known as danpawlik | 13:48 | |
*** sgw has joined #opendev | 13:52 | |
openstackgerrit | Ghanshyam Mann proposed openstack/project-config master: Retire networking-l2gw and networking-l2gw-tempest-plugin https://review.opendev.org/738030 | 13:57 |
ttx | fungi, clarkb: we should sync on when to approve https://review.opendev.org/#/c/737533/ so that I can watch a few runs and check everything behaves as expected | 14:06 |
fungi | ttx: i can approve now if you're around to check results | 14:09 |
ttx | I'm in a meeting, but my brain is not used at 100%, so yes | 14:09 |
openstackgerrit | Ghanshyam Mann proposed openstack/project-config master: Retire networking-l2gw and networking-l2gw-tempest-plugin https://review.opendev.org/738030 | 14:10 |
ttx | fungi: ^ | 14:12 |
fungi | "ignore_errors: zuul.newrev is defined" seems backwards to me, but that's probably just me not understanding ansible's backwards logic | 14:12 |
*** ryohayakawa has quit IRC | 14:12 | |
fungi | i thought the idea was to ignore errors from mirroring if there is no zuul.newrev (because it was triggered from something other than a ref-updated event) | 14:13 |
ttx | zuul.newrev is always defined in the post pipeline, so that 's equivalent to ignore_errors = true | 14:14 |
ttx | the goal being to ignore mirror failures as long as the reference is up | 14:14 |
mnaser | fungi: is it ok to +W project-config changes given the current state of manage-projects? | 14:14 |
fungi | but if you run it outside the post pipeline, say in check, "zuul.newrev is defined" evaluates false | 14:14 |
fungi | so you're telling it not to ignore errors if run in check? | 14:15 |
ttx | yes, it basically behaves as it currently does, if tested in check | 14:15 |
ttx | (so the check test does not really test the new code... but it can't since the job actually runs under different conditions in post pipeline) | 14:16 |
openstackgerrit | Ghanshyam Mann proposed openstack/project-config master: Final step for networking-l2gw and networking-l2gw-tempest-plugin retirement https://review.opendev.org/738040 | 14:17 |
fungi | ttx: oh, got it, we won't hit this race condition in check/gate pipelines anyway | 14:18 |
ttx | exactly | 14:18 |
fungi | now if gertty will stop hanging for a moment i can approve :/ | 14:18 |
ttx | fungi: how fast are zuul-jobs deployed once the change merges ? Should I just watch the promote job? | 14:23 |
corvus | ttx: that change will take effect immediately upon merge; so the next run of the job starting after the merge will use it | 14:23 |
ttx | noted! Will stand by | 14:24 |
openstackgerrit | Merged openstack/project-config master: Add openstack/tempest-horizon in required project https://review.opendev.org/738024 | 14:28 |
*** ysandeep is now known as ysandeep|away | 14:29 | |
*** mlavalle has joined #opendev | 14:30 | |
fungi | yeah, zuul takes its job configuration from the git state on branches and knows as soon as they merge that it should start using them instead of the prior state | 14:31 |
openstackgerrit | Merged zuul/zuul-jobs master: upload-git-mirror: check after mirror operation https://review.opendev.org/737533 | 14:32 |
*** ykarel is now known as ykarel|away | 14:36 | |
ttx | now waiting for something openstacky to actually merge | 14:43 |
fungi | rackspace just opened tickets letting us know about host outages impacting logstash-worker02 and nl02 | 14:48 |
fungi | #status log logstash-worker02.openstack.org rebooted by provider at 14:46z due to a hypervisor host outage | 14:53 |
openstackstatus | fungi: finished logging | 14:53 |
fungi | #status log nl02.openstack.org rebooted by provider at 14:46z due to a hypervisor host outage | 14:54 |
openstackstatus | fungi: finished logging | 14:54 |
ttx | looking good so far | 14:56 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists https://review.opendev.org/737882 | 15:01 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Paginate all the gitea get requests https://review.opendev.org/737885 | 15:01 |
clarkb | fungi: https://review.opendev.org/#/c/737883/ got squashed into 737882 which will allow us to land that stack (I set you as co author on the commit). I think that means you can abandon 737883 | 15:02 |
clarkb | infra-root ^ Those two changes are ready for review and landing now I guess. The 737882 parent is the one we need for our current problems and 737885 is future proofing | 15:02 |
clarkb | for the previous patchsets you can see things appear to work properly starting at https://zuul.opendev.org/t/openstack/build/6c45d6e883454129be037b48e3f714a2/log/job-output.txt#18188 and https://zuul.opendev.org/t/openstack/build/31ccae26537e4ec2835d23e28e8e1d3f/log/job-output.txt#18223 for each of those changes | 15:05 |
mordred | clarkb: nice | 15:10 |
ttx | fungi: so the new playbook works well in the nominal case. Now I have to wait for the race condition to happen to see if it really solves it | 15:11 |
fungi | yeah, that's always the hard part | 15:11 |
*** bhagyashris is now known as bhagyashris|afk | 15:19 | |
AJaeger | ttx, push three changes and approve them together? | 15:20 |
ttx | AJaeger: it's hit and miss, depends how fast they enqueue into the post pipeline | 15:20 |
AJaeger | fun | 15:20 |
fungi | is there anything we need to check in the wake of the nl02 outage? i suppose keeping all the state in zk mostly shields us from hung/leaked operations when a launcher is suddenly rebooted? | 15:21 |
ttx | I'll see by tomorrow :) We usually have a couple issues per day | 15:21 |
AJaeger | looking forward to hear the results | 15:22 |
clarkb | fungi: ya should be fine re nl02 | 15:24 |
*** aannuusshhkkaa has joined #opendev | 15:57 | |
*** diablo_rojo has joined #opendev | 16:02 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists https://review.opendev.org/737882 | 16:10 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Paginate all the gitea get requests https://review.opendev.org/737885 | 16:10 |
clarkb | mordred: fungi ^ ianw's suggestion was important enough that I thought a new set of ps's would be a good idea | 16:11 |
clarkb | and this generates even more test data (and confidence!) | 16:11 |
*** rpittau is now known as rpittau|afk | 16:11 | |
mordred | clarkb: ++ | 16:12 |
*** hashar has joined #opendev | 16:14 | |
fungi | cool | 16:17 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Increase parallelism of gitea project creation https://review.opendev.org/738064 | 16:43 |
clarkb | mordred: corvus ^ I noticed that those TODOs could be cleaned up while working on the other thing. I'm not sure if I got that quite right and that is even less of an emergency but your input on it since you dealt with the original pass would be good | 16:43 |
corvus | clarkb: seems legit; i don't recall details about your question in the TODO though. | 16:48 |
mordred | me either - but also seem legit | 16:50 |
corvus | i'm afk for a bit | 16:53 |
clarkb | thanks I've WIP'd it just to be sure the other stuff lands first and we can stablizie before worrying about optimizing | 16:54 |
mordred | clarkb: ok. I think it's time to try deploying an executor with docker | 16:59 |
mordred | clarkb: ze01 is already stopped - so I'm going to start with it - sound ok? | 16:59 |
clarkb | wfm | 16:59 |
mordred | clarkb: I've disabled ansible - will wait for current playbooks to stop | 17:00 |
mordred | clarkb: there are a couple of old ansible playbooks runs | 17:00 |
clarkb | mordred: ok keep in mind the manage-projects backlog will be stopped up against that when the changes land, but we can also run that manually | 17:01 |
clarkb | mordred: I think job timeouts are doing that | 17:01 |
clarkb | but not completely positive of that | 17:01 |
mordred | clarkb: yeah - need to figure out what's going on there | 17:02 |
mordred | running against ze01 | 17:04 |
clarkb | mordred: we have to manually stop the executor first then run the use docker playbook update? | 17:07 |
clarkb | we didn't encode the transition in the playbooks | 17:07 |
mordred | that's right | 17:07 |
mordred | clarkb: actually - we seem to have landed the "disable old service" patch | 17:08 |
mordred | clarkb: so - the playbook will turn off executor - but will not run docker compoose up | 17:09 |
clarkb | https://zuul.opendev.org/t/openstack/build/72dedbb571374ccbbc7c9cc14e10f209/log/job-output.txt#18246 that was the latest pass of the pagination change which was just an update to add a comment in the zuul config | 17:09 |
clarkb | thats making me think this fix is incomplete or buggy or racy | 17:10 |
clarkb | fungi: ianw: ^ I think that must've been why it failed for me last night and I got all confused | 17:10 |
clarkb | thinking out loud here, it could be a race for listing repos after creating all the repos? | 17:10 |
clarkb | except we seem to have ~25 seconds between runs there | 17:11 |
mordred | clarkb: oh. bong. zuul-executor is runnong on ze01 | 17:11 |
mordred | zuul-executor stop did not stop it | 17:12 |
clarkb | mordred: well it should stop it in about 10 minutes | 17:12 |
clarkb | it waits for all the ansible to stop running | 17:13 |
mordred | ah | 17:13 |
mordred | nod | 17:13 |
* mordred thought this was one off - but forgot we did that whole big restart | 17:13 | |
* mordred waits | 17:13 | |
clarkb | mordred: what I do is watch something like `ps -elf | grep zuul | wc -l` and that number should generall trend down | 17:14 |
clarkb | I pulled the +W off of https://review.opendev.org/#/c/737882/5 and rechecked it in order to generate more data | 17:15 |
clarkb | if anyone else has ideas for ^ they are more than welcome | 17:18 |
clarkb | mordred: in particular doing better logging of the tool execution in ansible somehow would be useful | 17:18 |
clarkb | but I'm not sure how to expose that in ansible. Maybe just start writing to stdout and ansible captures that or? | 17:19 |
clarkb | I guess we can hold the nodes too and try to rerun manually and see what happens | 17:20 |
* clarkb puts a hold on that change | 17:20 | |
clarkb | alright thats in place | 17:21 |
mordred | clarkb: no - definitely don't just write to stdout from an ansible module | 17:23 |
mordred | I thnik there is a log method now | 17:23 |
mordred | on the module object | 17:23 |
clarkb | mordred: cool if I catch one with the hold I can fiddle with finding that and using it in the python module | 17:23 |
mordred | clarkb: cool | 17:24 |
clarkb | but also I think I can run it outside of the ansible context on the held setup | 17:24 |
mordred | clarkb: ok - ze01 is running in docker | 17:24 |
clarkb | and then do normal python logging/tracebacks/etc | 17:24 |
clarkb | mordred: now we want to see jobs on ze01 use afs properly? | 17:24 |
mordred | clarkb: yeah. I've put ze* back in the emergency file and have removed DISABLE-ANSIBLE | 17:25 |
mordred | so we can watch ze01 for a bit and make sure we're happy with it and evrything | 17:25 |
mordred | #status log ze01 is running via docker now, ze* is still in emergency so we can watch ze01 | 17:26 |
openstackstatus | mordred: finished logging | 17:26 |
mnaser | is it ok to merge project-config changes that touch manage-projects? | 17:28 |
fungi | mnaser: should be, they just aren't taking effect yet | 17:28 |
clarkb | and the fix is still failing occasionally for unknown reasons | 17:29 |
mnaser | :( | 17:29 |
frickler | infra-root: something seems wrong, likely related to nl02 reboot, our used nodes dropped and nl01 logs a lot of quota failures | 17:34 |
frickler | maybe the reboot left orphaned nodes? | 17:35 |
frickler | HttpException: 403: Client Error for url: https://iad.servers.api.rackspacecloud.com/v2/637776/servers, Quota exceeded for ram: Requested 8 | 17:36 |
frickler | 192, but already used 1441792 of 1536000 ram | 17:36 |
frickler | that looks more like rackspace might have messed up their quota calculation | 17:36 |
openstackgerrit | Merged openstack/project-config master: Add Neutron Arista plugin charm to OpenStack charms https://review.opendev.org/737791 | 17:42 |
clarkb | ya nova can get out of sync | 17:48 |
openstackgerrit | Merged openstack/project-config master: Refresh openstack-ansible grafana dashboards https://review.opendev.org/737742 | 17:48 |
* frickler needs to eod, maybe someone can contact them | 17:48 | |
fungi | frickler: a quick server list for iad shows we have 186 instances booted there | 17:54 |
fungi | so we may have leaked nodes? | 17:54 |
openstackgerrit | Merged openstack/project-config master: Add pep8 jobs to grafyaml https://review.opendev.org/737915 | 18:06 |
openstackgerrit | Merged openstack/project-config master: Add all python versions to bindep tox testing https://review.opendev.org/735284 | 18:06 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Test multiarch release builds https://review.opendev.org/737315 | 18:16 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Add more logging to gitea project creation https://review.opendev.org/738083 | 18:23 |
clarkb | mordred: ^ maybe something like that? | 18:23 |
*** Open10K8S has quit IRC | 18:24 | |
*** Open10K8S has joined #opendev | 18:25 | |
clarkb | interestingly it has failed twice in a row now | 18:25 |
clarkb | ok rerunning the playbook on my held node causes failure | 18:29 |
clarkb | which is good because it likely rules out a race | 18:29 |
* clarkb tries the extra logging there now | 18:29 | |
openstackgerrit | Rafael Folco proposed openstack/diskimage-builder master: Enable py3 on dib release 7 https://review.opendev.org/736421 | 18:31 |
fungi | ahh, looks like we have a number of nodes for rax-iad in error and shutoff states | 18:33 |
AJaeger | infra-root, https://zuul.opendev.org/t/openstack/build/c4948797c7994937bfa632105d06af93 fails with "No such file or directory: 'kinit': 'kinit'" - this is a promote docs job. Is kinit suddenly missing? | 18:40 |
clarkb | mordred: ^ we may need to rollback ze01 docker container deployment | 18:40 |
clarkb | AJaeger: I think it is likely that is related to running ze01 in a docker container | 18:40 |
fungi | yeah, may have missed installing krb5 in the image | 18:40 |
corvus | it was supposed to be in the image; we landed a change to add it | 18:41 |
corvus | but i agree, immediate resolution should be to stop that ze | 18:42 |
corvus | i will do that now | 18:42 |
AJaeger | thanks | 18:43 |
corvus | i have issued the 'zuul-executor stop' command | 18:43 |
fungi | i have confirmed none of the 21 shutdown or error status instances in rax-iad appear in our `nodepool list` output, so i'll work on manually deleting them with osc | 18:44 |
clarkb | I think I'm making progress on the gitea thing. in my test case its failing because openstack/telemetry-tempest-plugin isn't in the repo listing | 18:44 |
clarkb | but it was supposedly created on the first pass | 18:44 |
mordred | corvus: we only added openafs-krb5 - we didn't add krb5-user | 18:45 |
clarkb | and that seems consistent across multiple runs of the playbook on this test setup | 18:45 |
clarkb | curling the page for that repo seems to show it | 18:47 |
corvus | fungi: do they lack the metadata that would tell nodepool to delete them? | 18:48 |
fungi | corvus: i should have grabbed some samples, however i see there's one in dfw too, so i'll dig into it more closely | 18:50 |
corvus | maybe they lost their md due to whatever error happened | 18:50 |
fungi | i was at least able to `openstack server delete` all 21 strays in iad successfullt | 18:50 |
fungi | corvus: the example in dfw does lack the additional nodepool_* properties | 18:52 |
fungi | its properties field (according to openstack server show) is completely empty | 18:53 |
fungi | looks like it's from 19 days ago | 18:53 |
fungi | oh, actually, it was created 2018-04-18 but updated 2020-06-06 for some reason | 18:53 |
fungi | so yeah, this looks like it could be a source of infrequent node leaks | 18:54 |
corvus | booo :( | 18:55 |
fungi | anything else i should look at before i delete this one in dfw? | 19:02 |
fungi | it claims to be over 2 years old, though i have a tough time believing it's been in our openstackjenkins tenant server list for dfw that entire time | 19:03 |
clarkb | well thats curious. I checked the cardinality of the gitea repo list and it matches the input list size. But then I convert to a set and now I'm off by one | 19:05 |
fungi | cardinality vs ordinality maybe? | 19:06 |
clarkb | well you wouldn't expect duplicates | 19:08 |
clarkb | somehow openstack/tempest is listed twice now to see if that is consistent | 19:08 |
clarkb | (I'm wondering if this is a page boundary bug in gitea) | 19:08 |
fungi | oh, yeah, maybe they have problems with calculating offsets correctly | 19:10 |
clarkb | confirmed that it straddles pages | 19:14 |
fungi | ew | 19:14 |
clarkb | page 2 element 50 and page 3 element 1 | 19:14 |
clarkb | openstack/tempest | 19:14 |
clarkb | howdy | 19:14 |
fungi | i guess the list can be deduplicated as a workaround? | 19:14 |
fungi | or are we also missing entries because of this? | 19:15 |
fungi | i guess for each page we get a duplicate and lose an entry too | 19:15 |
clarkb | fungi: that doesn't help because the problem is we don't have openstack/telementry-tempest-plugin in the list and that causes us to try and recreate openstack/telemetry-tempest-plugin and that fails | 19:15 |
clarkb | no this is the only duplicate | 19:15 |
fungi | yeah. we could run through the list twice with different prime-numbered page sizes | 19:15 |
clarkb | we can also fetch the repo page and see if it exists rather than usign the api to list them all | 19:16 |
clarkb | but I want to see if I can figure out why this happens in the first place and ya maybe I'll try some different page sizes | 19:16 |
clarkb | is 17 prime? | 19:16 |
fungi | yes | 19:16 |
clarkb | and maybe 31? | 19:17 |
fungi | 43 and 47 are the two largest primes under the 50 max | 19:17 |
clarkb | 17 changed the dup | 19:18 |
clarkb | still only one duplicate though | 19:18 |
clarkb | I'm guessing the next step in debugging this is looking at the db and the paging code and figuring out the bug | 19:18 |
clarkb | separately we can do a double check and see if https://localhost:3000/org/project is a 404 or a 2XX | 19:18 |
clarkb | and only try to create if it isn't 2XX | 19:18 |
fungi | yeah, the reason to run through twice with two different prime numbered page sizes is they're guaranteed not to share a period | 19:19 |
corvus | we don't have any repo creation happening during this right? | 19:19 |
corvus | the data set is supposed to be static during our queries? | 19:19 |
clarkb | corvus: correct, we do all the queries upfront | 19:20 |
clarkb | fungi: ya and we could then combine them all and dedup the result | 19:20 |
fungi | granted, if the mistaken offset is >1 you basically need n+1 different passes with different page sizes | 19:21 |
clarkb | except in this case we seem to only ever get one dup for some reason | 19:21 |
clarkb | probably shouldn't rely on that behavior until we understand it though | 19:21 |
fungi | just the other day someone asked me where prime numbers are useful in computer science. this would have made a great example | 19:22 |
clarkb | 43, 47, and 50 produce the same duplicate: openstack/tempest | 19:23 |
fungi | neat. is it the first or last entry? | 19:23 |
clarkb | or wait I might have a bug in duplicate logging | 19:23 |
clarkb | yup I do | 19:24 |
clarkb | so ignore that observation | 19:24 |
clarkb | (there are definitely duplicates and it changes based on page size and it is sometimes tempest) | 19:25 |
clarkb | 43 and 47 don't produce duplicates | 19:25 |
clarkb | and the play succeeds | 19:25 |
clarkb | https://github.com/go-gitea/gitea/pull/11827 | 19:32 |
clarkb | it seems like its querying the db for repos where the owner id matches the provided id | 19:33 |
clarkb | but there is an ordered by that is "updated_unix DESC" | 19:34 |
clarkb | mordred: ^ we're not changing order but maybe if that is a common value the order isn't stable? | 19:35 |
clarkb | I guess my next step is to check the database | 19:35 |
clarkb | but it is lunch time. My hunch is that ordered by isn't stable | 19:35 |
clarkb | and if that is the case we can make a change to something more stable and in the mean time do a secondary fetch for https://localhost:3000/org/project and check that status | 19:36 |
fungi | another possibility... off-by-one in the 50 max limit? maybe any page size <50 but not ==50 works correctly? | 19:36 |
clarkb | fungi: setting it to 17 also fails. But maybe there is an off by one relative to any value | 19:37 |
clarkb | I'll check the db directly after lunch | 19:37 |
AJaeger | clarkb, fungi, https://review.opendev.org/737791 has added today a new repo - but I think your off-by-one happened already before that, didn't it? | 19:43 |
clarkb | AJaeger: ya its a bug since we upgraded gitea | 19:45 |
clarkb | we should stop adding new projects for now | 19:45 |
clarkb | http://paste.openstack.org/show/795231/ I think that mostly confirms the bug | 19:54 |
clarkb | Internet says ordered by is not stable on successive requests | 19:54 |
clarkb | I can't reproduce that via mysql client yet, but I'd be really surprised if there was a different issue | 19:54 |
mordred | clarkb: yeah - so - if we're ordering only by updated unix - that's only seconds | 19:55 |
clarkb | mordred: yup exaclty. I think teh fix in gitea is to ordered by id and updated_unix | 19:55 |
clarkb | id is a proper key | 19:55 |
clarkb | and should make it stable | 19:55 |
mordred | yes. that would be the right choice | 19:55 |
clarkb | I'm going to fiddle with the mysql client to make sure I get it right but I'll make a PR for gitea | 19:56 |
clarkb | and then we can maybe deploy that ourselves and then my fix for the pagination should work? | 19:56 |
clarkb | or we can hack up our ansible to check if the page for the repo exists as a fallback | 19:56 |
*** olaph has quit IRC | 19:57 | |
mordred | yah | 19:58 |
mordred | clarkb: it should be ordered by updated_unix, id desc i believe | 19:58 |
mordred | (since you want to order by updated_unix and then by id) | 19:58 |
*** hashar has quit IRC | 19:59 | |
clarkb | mordred: ya I think we need two DESC's though? ORDER BY updated_unix DESC , id DESC; | 20:00 |
mordred | yeah, I thnk that's right | 20:01 |
clarkb | cool I'm going to figure out two different things to push to github. One will be against master I can do a PR for then the other will be 1.12.0 + the fix that we can update our image to and then redo all the testing with that | 20:03 |
clarkb | this is easier said than done :) | 20:03 |
fungi | sounds great | 20:03 |
fungi | but yeah, githubz | 20:03 |
mordred | ++ | 20:04 |
clarkb | hrm actually | 20:05 |
clarkb | this won't fix us in production will it? | 20:05 |
clarkb | assuming that gerrit replication updates that timestamp we'll never have a full listing through that api | 20:06 |
fungi | is there not an autoincrement index for the projects table? | 20:07 |
mordred | are they doing this with order by limit? | 20:07 |
clarkb | fungi: there is | 20:07 |
mordred | it woudl be much better to just have this listing be sorted by id | 20:08 |
fungi | yeah, that | 20:08 |
clarkb | mordred: https://github.com/go-gitea/gitea/pull/11827/commits/8cc1b15245f06145c267f59146f4cb74c6330a1b first bit of diff there is the order by | 20:08 |
clarkb | they are not doing order by limit, its listing everything then chunking later I think | 20:08 |
clarkb | What we can do is pass in opts from the api side to order by id and not updated_unix | 20:09 |
clarkb | and override the default | 20:09 |
* clarkb changes commit to do that | 20:09 | |
mordred | yeah | 20:09 |
mordred | we don't care about "most recent" for our use case | 20:10 |
mordred | we just want the list | 20:10 |
fungi | and more importantly, the entire list | 20:14 |
mordred | fungi: yeah - this is one of those cases where it would be really nice to be able to say "hi, please to not paginate" | 20:16 |
clarkb | mordred: ya as I'm trying to figure out all the places this may be a bug in gitea I'm feeling the same way | 20:16 |
clarkb | but I'm giving up on that for now | 20:17 |
clarkb | because my brain is melting | 20:17 |
mordred | yup | 20:17 |
clarkb | it kinda makes me think we should solve this differently in ansible | 20:20 |
clarkb | but let me push up change to try this version | 20:21 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists https://review.opendev.org/737882 | 20:25 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Paginate all the gitea get requests https://review.opendev.org/737885 | 20:25 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Increase parallelism of gitea project creation https://review.opendev.org/738064 | 20:25 |
clarkb | infra-root ^ I'm not sure we want to merge it like that just yet, but that should exercise me gitea fix | 20:26 |
clarkb | I'm filing a gitea bug now | 20:27 |
clarkb | and will do my PR and see if they think its sufficient or not | 20:27 |
clarkb | https://github.com/go-gitea/gitea/issues/12056 | 20:36 |
clarkb | https://github.com/go-gitea/gitea/pull/12057 | 20:38 |
*** icarusfactor has joined #opendev | 20:50 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Deal with gitea pagination of repo lists https://review.opendev.org/738109 | 20:51 |
clarkb | infra-root ^ alternative approach with double checking | 20:51 |
*** factor has quit IRC | 20:52 | |
clarkb | I'm going to take a break now. I've got to do the zuul community update tonight so need to prep for that but also don't want to work all afternoon if I'm working tonight :) | 20:52 |
clarkb | I think that gives us ~2 options to address this and people can feel free to update/fix/etc as necessary | 20:53 |
clarkb | also I think we need to sort id ascending so that if we get new repos we don't change the ordering | 21:03 |
clarkb | but that shouldn't be as big of an issue bceause you'll get dups but still have complete data | 21:03 |
fungi | yeah, agreed, ascending id sort is more robust than descending if new repos are added while iterating (not that we expect that in production) | 21:07 |
clarkb | and thinking about it more what they really should do is provide a next url that has enough of a seed to reproduce the original list and index into it properly | 21:09 |
fungi | that probably requires caching some additional state, and then having to decide how long you keep it fo | 21:09 |
fungi | r | 21:09 |
clarkb | ya, I'm hoping this issue I've filed sparks a discussion on getting it right overall | 21:10 |
clarkb | for that reason I'm kinda leaning towards our solution being https://review.opendev.org/738109 for now | 21:11 |
clarkb | basically accept the pagination is flawed and work around it | 21:11 |
clarkb | rather than rely on another likely flawed pagination system | 21:11 |
clarkb | though I've justed realized that will make creating an initial set of repos very slow. I guess we're about to find out how slow via testing | 21:15 |
clarkb | fungi: another thought just occured to me. If we clear out new project additions from projects.yaml we could clear out the gitea management temporarily in order to get gerrit things updated | 21:21 |
clarkb | fungi: that may be worth doing if this prolongs itself due to the pain of dealing with it | 21:21 |
clarkb | oh also just had a thought. We could check for http 409 and ignore those errors | 21:22 |
clarkb | instead of doing the GET before hand to see if project exists which will slow down initial load out | 21:22 |
fungi | oh, yep, lbyl is biting us basically, we could just eafp | 21:23 |
fungi | especially if the only effective error 409 represents is "already exists" | 21:24 |
clarkb | ya its a conflict which is basically its there you can't do this | 21:24 |
clarkb | aiui | 21:24 |
clarkb | ok really popping out for a bit now | 21:25 |
clarkb | I'll roll back in a bit later and see if anyone has a preference of the ~3 options that have been brainstormed | 21:25 |
clarkb | the gitea change I wrote doesn't seem to fix it. https://zuul.opendev.org/t/openstack/build/11bade7c1229425a916a04c505ada62e failed. I think that means we should ask permission or forgiveness. For permission https://review.opendev.org/#/c/738109/ | 21:45 |
clarkb | I'm rechecking though since a single data point is insufficient | 21:45 |
clarkb | maybe we recheck that continuously thoruhg the end of today and if it continues to work we go with it tomorrow? | 21:45 |
clarkb | then I can rebase the other stuff on top of that if we go that route | 21:45 |
ianw | clarkb: huh, the jury is still out on how to get any logging out of the module? TypeError: __init__() got an unexpected keyword argument 'log' | 22:23 |
ianw | i couldn't see anything, other than bunching stuff up to return in the json | 22:23 |
clarkb | ianw: ya that change is broken but I fixed the argument thing and it still didn't work | 22:23 |
clarkb | ianw: I ended up using the built in logging of the module which is really lcuinky but it worked | 22:23 |
clarkb | ianw: and traced it to bugs in gitea pagination so now I'm thinking something like https://review.opendev.org/#/c/738109/ is our best bet or similar to that but asking forgiveness isntead of permission | 22:24 |
clarkb | I also filed a bug with gitea and psuhed a PR that doesn't seem to be working | 22:24 |
clarkb | ianw: I think we recheck 738109 a few times then if people are happy enough with it we can try and land it and get manage projects running | 22:25 |
clarkb | manage projects has a fair bit of backlog now though so not sure you want to be on the hook for that overnight (can wait until tomorrow morning) | 22:25 |
clarkb | and I'ev held some nodes if people want to interact with gitea though I've hacked up the gitea-git-repos role there so may need to restore to known state if expecting that to make sense | 22:26 |
ianw | so it's not that get_org_repo_list may return duplicates; it's more that it may *also* not return all the projects? | 22:28 |
clarkb | yes | 22:29 |
clarkb | because pagination order things by timestamp and collisions would not be sorted stably by mariadb | 22:29 |
fungi | basically, page offsets seem to start and end at the wrong place | 22:29 |
clarkb | but also in production that timestamp can update frequently so you'd lose things in the listing that way as well | 22:30 |
ianw | and we don't want to just probe for projects all the time and ditch the walk, because that makes it much much slower even in CI when we're starting fresh? | 22:30 |
clarkb | yes however the 109 change above will do it in CI because we start from nothing so check each repo in that case | 22:31 |
clarkb | and it doesnt seem to be too slow | 22:31 |
clarkb | but ya checking after listing is an optimization but we could always check amd drop the listing too | 22:31 |
ianw | yeah, that's my only thought; simplify it by just checking the project directly -- reusing the session it seems like it should be as low overhead as possible | 22:32 |
clarkb | also orgs, teams and all that arepaginated too and I would be amazed if they dont have similar issues | 22:33 |
clarkb | I expect we'll end up doing incrememntal imptovements | 22:33 |
clarkb | starting with repos ti make things work with the current dataset | 22:33 |
ianw | just as written also makes sense, and is more easily revertable when it's fixed, so no problem with that either really :) | 22:33 |
mordred | yeah - ultimately this is a fundamental issue in the api that would surely be good to sort out | 22:33 |
clarkb | mordred: ya I'm hoping my github issue helps gitea move towards the fixing | 22:34 |
clarkb | but ya I think maybe start with 109 once we are confident in it then continue to make incremental improvments frkm there | 22:34 |
*** tosky has quit IRC | 22:40 | |
*** mlavalle has quit IRC | 22:58 | |
clarkb | 109 has succeded now twice. I've recehcked it again | 23:02 |
mordred | clarkb, ianw: I left a +2 on 109 but not a +A | 23:05 |
mordred | because - you knlw - it's EOD here | 23:05 |
mordred | ianw: oh - also - we tried rolling out ze* on docker but were missing krb5-user from the images. zuul is stopped on ze01 and they're all in emergency | 23:06 |
mordred | we'll try again tomorrow - but just an fyi | 23:06 |
ianw | cool, yeah i think my best bet is to not touch anything :) | 23:06 |
ianw | mordred: not sure if you saw but grafana came together well as a container in https://review.opendev.org/#/c/737406/ and https://review.opendev.org/#/c/737397/5 | 23:07 |
ianw | i'm going to have a look at graphite and see if it is as amenable ... that would be two more ticked of the xenial list | 23:08 |
mordred | ianw: that looks great! | 23:12 |
mordred | ianw: I +Ad the first, +2d the second | 23:12 |
ianw | thanks ... i found https://hub.docker.com/r/graphiteapp/docker-graphite-statsd/ yesterday and it looks very promising as pretty much a drop-in | 23:13 |
ianw | devil will be in the details | 23:13 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Make bindep installs non-interactive https://review.opendev.org/738121 | 23:19 |
mordred | ianw, corvus: ^^ if you got a sec | 23:20 |
corvus | ++ all around | 23:20 |
mordred | ianw: the devil is always in the details | 23:20 |
mordred | corvus: awesome - thanks | 23:20 |
*** DSpider has quit IRC | 23:22 | |
openstackgerrit | Merged opendev/system-config master: Add a grafana/grafyaml image https://review.opendev.org/737397 | 23:33 |
*** rchurch has quit IRC | 23:43 | |
*** rchurch has joined #opendev | 23:43 | |
*** ryohayakawa has joined #opendev | 23:56 | |
*** cloudnull has quit IRC | 23:57 | |
*** cloudnull has joined #opendev | 23:58 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!