Tuesday, 2021-10-05

fungii hadn't looked at the backup errors yet00:05
fungiand we could probably upgrade its-base independent of gerrit00:05
fungithough also, storyboard task updating is still working as intended, even in the current state. i guess we just get errors too00:06
clarkbfungi: we can update its-base. We select the versio nto use in our build job definitions00:06
fungiright, i meant we could just do that00:10
fungiand it would probably be compatible00:10
*** dviroel is now known as dviroel|out00:15
opendevreviewIan Wienand proposed opendev/system-config master: [wip] export ptgbot web  https://review.opendev.org/c/opendev/system-config/+/81241902:13
opendevreviewIan Wienand proposed opendev/zone-opendev.org master: Add CNAME for ptgbot.opendev.org  https://review.opendev.org/c/opendev/zone-opendev.org/+/80479002:54
opendevreviewMerged opendev/zone-opendev.org master: Add CNAME for ptgbot.opendev.org  https://review.opendev.org/c/opendev/zone-opendev.org/+/80479003:02
opendevreviewIan Wienand proposed opendev/system-config master: Setting Up Ansible For ptgbot  https://review.opendev.org/c/opendev/system-config/+/80319003:07
opendevreviewIan Wienand proposed opendev/system-config master: [wip] export ptgbot web  https://review.opendev.org/c/opendev/system-config/+/81241903:07
opendevreviewIan Wienand proposed opendev/system-config master: [wip] export ptgbot web  https://review.opendev.org/c/opendev/system-config/+/81241903:48
opendevreviewIan Wienand proposed opendev/system-config master: Setup Letsencrypt for ptgbot site  https://review.opendev.org/c/opendev/system-config/+/80479105:05
opendevreviewIan Wienand proposed opendev/system-config master: [wip] export ptgbot web  https://review.opendev.org/c/opendev/system-config/+/81241905:05
*** ykarel|away is now known as ykarel05:18
*** ysandeep|out is now known as ysandeep05:37
opendevreviewIan Wienand proposed opendev/system-config master: Setting Up Ansible For ptgbot  https://review.opendev.org/c/opendev/system-config/+/80319006:08
opendevreviewIan Wienand proposed opendev/system-config master: Setup Letsencrypt for ptgbot site  https://review.opendev.org/c/opendev/system-config/+/80479106:08
opendevreviewIan Wienand proposed opendev/system-config master: [wip] export ptgbot web  https://review.opendev.org/c/opendev/system-config/+/81241906:08
opendevreviewIan Wienand proposed opendev/system-config master: Setup Letsencrypt for ptgbot site  https://review.opendev.org/c/opendev/system-config/+/80479106:10
opendevreviewIan Wienand proposed opendev/system-config master: Setting Up Ansible For ptgbot  https://review.opendev.org/c/opendev/system-config/+/80319006:10
opendevreviewIan Wienand proposed opendev/system-config master: ptgbot: setup web interface  https://review.opendev.org/c/opendev/system-config/+/81241906:10
opendevreviewyatin proposed openstack/diskimage-builder master: Drop lower version requirement for networkx  https://review.opendev.org/c/openstack/diskimage-builder/+/81245307:26
*** jpena|off is now known as jpena07:33
ianwfungi: ^ i think that's ready to go, but the gate is unhappy with letsencrypt07:54
ianw"type": "urn:ietf:params:acme:error:serverInternal",07:54
ianw  "detail": "Error creating new order",07:54
ianw  "status": 50007:54
ianwis pretty much the error that's hitting lots of jobs that use LE07:55
*** ykarel is now known as ykarel|lunch08:18
*** ysandeep is now known as ysandeep|lunch09:00
*** ysandeep|lunch is now known as ysandeep09:48
*** ykarel|lunch is now known as ykarel10:24
*** jpena is now known as jpena|lunch11:24
*** dviroel|out is now known as dviroel12:07
opendevreviewdaniel.pawlik proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703112:13
*** jpena|lunch is now known as jpena12:22
*** ysandeep is now known as ysandeep|afk12:34
fungiianw: yeah, i saw one of those 500 errors yesterday from the test api, i guess it's gotten worse today12:38
*** artom_ is now known as artom13:18
*** ysandeep|afk is now known as ysandeep13:26
*** dpawlik2 is now known as dpawlik13:31
fungiianw: and it's still failing... i wonder if it's time we considered deploying pebble in our jobs: https://letsencrypt.org/docs/staging-environment/#continuous-integration-development-testing13:51
fungilooks like letsencrypt provides images and a docker-compose file13:53
*** artom_ is now known as artom13:55
fungiacme.sh readme mentions support for "Pebble strict Mode" as a cs14:00
Clark[m]In the past LE has been good about fixing the staging env. I agree if that doesn't happen then running our own seems reasonable. We could add it to bridge in the test jobs to avoid adding a node14:01
*** ykarel is now known as ykarel|away14:03
fungidocs indicate acme.sh will take full api urls in its --server parameter, not just the short aliases, so we could run pebble on a high-numbered port on the loopback and just point acme.sh at that14:06
clarkblooks like fungi already rechecked the change having LE trouble. I can sip tea and wait for my meeting then :)14:51
fungiyeah, would like to get ptgbot up and running some time today of le cooperates14:54
zigoclarkb: https://salsa.debian.org/python-team/packages/simplejson/-/commit/320ef98575debcbc056768e19e37fdc2a583b62314:57
zigoHopefully, p1otr will upload it soonish ...14:57
zigoThough IMO very little hope for having this corrected in already released suites ...14:58
clarkbzigo: yes, that is why when this first came up ~3 yaers ago we brought it up knowing it would really only get fixed in newer releases :)14:58
clarkbthank you for the update on simplejson14:58
clarkbI wish that the pypa crowd had done a better job with coordinating with packagers too. It seems that latest pip on addressing some of these issues is their recognition they needed to do that15:00
clarkbbut a bit late for those of us caught in the middle15:00
*** dtantsur_ is now known as dtantsur15:10
clarkbfungi: https://zuul.opendev.org/t/openstack/build/61c56144d5d648ac8aa8020995f7da3f/log/static01.opendev.org/acme.sh/acme.sh.log#557-561 I think we hit the issue again.  Ido wonder if we need to run the static job on that update?15:30
clarkbThat host has a ton of certs and I bet our chances of success go up by not running it15:30
fungii think it's run because we have to update the handlers list every time we add a new site15:33
fungibut yeah maybe we could pick a less intensive job to exercise that15:34
clarkbah ya that is probably the reason15:34
clarkbfungi: I wonder if we can split up the handlers into multiple files then run only when we update the handler for the specific service15:35
clarkbinstead of handlers/main.yaml have a handlers/static.yaml and so on (but I'm not sure how to make ansible see all of those)15:35
fungimaybe it just sees any .yaml file in that directory?15:36
fungilike, automagically?15:36
clarkbbut if we can make that work maybe we can haev a handlers/main.yaml and a handlers/eavesdrop.yaml to start15:36
clarkbfungi: maybe?15:36
clarkb"Handler names and listen topics live in a global namespace." <- that implies ya it may be that easy15:37
fungii can try splitting it out in that change once the call i'm on wraps up15:37
opendevreviewJeremy Stanley proposed opendev/system-config master: Setup Letsencrypt for ptgbot site  https://review.opendev.org/c/opendev/system-config/+/80479116:00
opendevreviewJeremy Stanley proposed opendev/system-config master: Setting Up Ansible For ptgbot  https://review.opendev.org/c/opendev/system-config/+/80319016:00
fungiclarkb: diablo_rojo__: ^16:00
fungiand zuul has correctly not added a system-config-run-static build this time16:03
fungialso one thing i wondered about, reading up on pebble, it apparently by default rejects some percentage of negotiated nonces in order to confirm that clients properly retry... i wonder if their staging api does the same and acme.sh is choking on that?16:04
clarkboh intersting16:12
clarkbcurrently we don't fail on the acme.sh run and the job continues until it tries to start apache which does fail16:12
clarkbbut maybe we should check the results properly and retry a few times16:12
clarkbfungi: I think the trick there is capturing acme.sh return codes in our fancy driver.sh16:13
clarkbI've got an update to the driver.sh and ansibel to try retries in the owrks16:21
fungilooks like acme.sh does retry those: https://github.com/acmesh-official/acme.sh/blob/master/acme.sh#L2197-L220216:23
fungialso more generally, if you look for the word "retry" you'll find the script is littered with a variety of aggressive retries, so retrying in our driver may not make any difference16:24
clarkbah ok. I guess that makes sense since acme trie sto do it all16:24
clarkband ya I agree we shouldn't double up the retries as we'll just send more traffic to an already potentially sad system16:25
yuriysclarkb: fungi: This week I am planning to deploy some placement/nova-scheduler config updates, as well as add a few beefy boys (hardware nodes) to your inmotion cloud. Will need to set to 0 workers for the process. Is there a preferred day/time and do you guys want to do a meets like last time (I'm thinking Fridayish)16:29
clarkbyuriys: The biggest thing is to avoid the openstack release which is happening between now and 1500UTC tomorrow (or about 23 hours from now)16:30
clarkbyuriys: I would say once that is done you can just go for it. Also it might be easiest to modify the quota of max instances to 0 for that work since it should be over quickly16:30
*** ysandeep is now known as ysandeep|dinner16:30
yuriysSounds good.16:30
*** jpena is now known as jpena|off16:32
fungiclarkb: looks like ~2 weeks ago, the system-config-run-base-ansible-devel job broke, and the error seems to be that ubuntu-bionic's default python3 is too old for it. any ideas how we should approach that? work on upgrading/replacing bridge.o.o? switch to using a non-default python3 on it? something else?16:35
fungiERROR: Package 'ansible-core' requires a different Python: 3.6.9 not in '>=3.8'16:35
clarkbfungi: maybe update that job to run on focal giving us info about whether or not we can upgrade bridge to focal and update ansible there. But not necessarily itnend on doing that immediately16:36
fungiworth testing, yep16:36
fungii'll push that up now16:36
clarkbThe idea behind that job was to be forward looking and catch future issues. It has done that here and the fix is apparently to update to focal and then we can find the next issue :)16:36
clarkbIt is really interesting to me that ansible has decided to stop supporting rhel 8?16:40
clarkbor maybe they use some other python installation on that platform?16:40
opendevreviewJeremy Stanley proposed opendev/system-config master: Test ansible-devel with an ubuntu-focal bridge.o.o  https://review.opendev.org/c/opendev/system-config/+/81252716:42
clarkbfungi: looks like the eavesdrop job failed on that chagne but it didn't run acme.sh at all? or at least we didn't collect acme.sh logs16:42
fungiclarkb: i think you can install newer python on rhel 816:42
fungiERROR! the playbook: playbooks/roles/letsencrypt-create-certs/handlers/eavesdrop.yaml could not be found16:44
fungiyeah that's strange16:44
clarkbhttps://docs.ansible.com/ansible/latest/user_guide/playbooks_reuse_roles.html hrm handlers/main.yaml might be more special than we were hoping :/16:47
clarkbI think to make this work we would have to have main.yaml include all the handlers from the other files16:48
clarkbbut then we are largely back where we started so it doesn't help as much16:48
clarkbarg I guess our best option is to go back to the old setup. We could maybe drop static triggering on updates to that file temporarily but it is probably correct to keep it generally triggering?16:50
opendevreviewJeremy Stanley proposed opendev/system-config master: Setting Up Ansible For ptgbot  https://review.opendev.org/c/opendev/system-config/+/80319016:54
opendevreviewJeremy Stanley proposed opendev/system-config master: Setup Letsencrypt for ptgbot site  https://review.opendev.org/c/opendev/system-config/+/80479116:55
opendevreviewJeremy Stanley proposed opendev/system-config master: Setting Up Ansible For ptgbot  https://review.opendev.org/c/opendev/system-config/+/80319016:55
fungiapparently if you roll back a parent change to a previous patchset, git-review only ends up pushing the child changes because the commit id for the rolled-back revision is already in gerrit even though it's non-current16:56
clarkbcrazy idea: we could stop relying on ansible handlers and instead have specific tasks in the service roles that check certificate file age and restart based on some accounting of that16:56
clarkbthat completely does an end around ansible's tool for handling this, but it seems like we constantly fight that tool ...16:57
fungiapparently airship's rtd builds are broken because of https://github.com/readthedocs/readthedocs.org/issues/8555 (letsencrypt root cert situation)16:59
*** ysandeep|dinner is now known as ysandeep17:33
fungithis time system-config-run-static succeeded but system-config-run-mirror-x86 broke on cert issuing18:04
fungiand system-config-run-review-3.2 as well, same root cause18:06
fungihttps://letsencrypt.status.io/ says everything's fine nothing to see here18:11
fungifailures are happening across iweb, rax, and ovh, so it's not provider-specific at least18:16
clarkbhttps://letsencrypt.status.io/pages/55957a99e800baa4470002da doesn't report issues18:18
clarkbacme-staging.api.letsencrypt.org is marked deprecated. Any chance we're using an old staging api that isn't getting the same level of attention?18:19
fungiit'll be whatever --staging gets routed to in acme.sh18:21
clarkbhttps://github.com/acmesh-official/acme.sh/blob/master/acme.sh#L26 seems to be the newer v2 api18:30
clarkbwoo I think I figured out my DAC issues. It is a usb power problem? I can get it to work if I plug in the aux usb power port. But only with the A to C on the aux and C to C on the data port18:39
clarkbI'm guessing linux has been doing updates to power control over usb18:40
*** ysandeep is now known as ysandeep|out19:20
clarkbianw: fungi https://etherpad.opendev.org/p/gerrit-3.3-upgrade-prep has notes on the tested gerrit 3.3 -> 3.2 revert process19:57
clarkbI need to eat lunch then I'll start working on the various things I need to review like changes and then work on drafting up that email announcing stuff20:01
opendevreviewIan Wienand proposed opendev/system-config master: [wip] letsencrypt : don't hit staging in the gate  https://review.opendev.org/c/opendev/system-config/+/81261020:37
clarkbianw: what time were you thinking for the gerrit upgrade monday? fungi pencilled in 20:00-22:00 UTC in the newsletter entry we are putting together. Is taht too early for you ? I think that is very early20:41
fungii was guessing20:44
fungihappy to move it later20:44
clarkbfungi: ianw: https://etherpad.opendev.org/p/J6WEMSZvaklcW_YCQujI how does that look for announcing things? I shifted the time an hour later as I expect 8am is much better than 7am?20:51
clarkbfungi: also does that time work for renaming? I listed it as 15:00-16:00UTC20:52
clarkbianw: feel free to update the etherpad with the time range you were considering20:52
clarkbfungi: oh I see you listed 1800-1900 UTC on the 15th for renames. THat works better for me if it works better for you :)20:53
clarkbI'll update the announacement email to match 1800 - 1900 UTC20:53
*** dviroel is now known as dviroel|out20:55
fungiyeah, i figured something west-coast friendly for the renames would be better21:00
clarkbI do have to go do a school run in a few minutes. If ya'll happen to sort out timing for the 10th prior to me getting back feel free to send the email or I'll send it when I get back21:01
fungishould the announcement say to reach out to us about other renames?21:01
clarkboh good idea as we want to ensure we are aware of the requests21:01
clarkbfungi: how does that edit look21:01
clarkbianw: left a comment on https://review.opendev.org/c/opendev/system-config/+/81261021:14
ianwsorry, breakfast, back now :)21:15
clarkbno worries I have to pop out in ~3 minutes :)21:16
ianwannouncement looks good, thanks21:16
clarkbianw: those times are good then?21:16
clarkbI've got to be out the door in 3 minutes but can send the email when I get back if no one does it for me :)21:16
ianwyep, that time is fine, gives a bit more overlap 21:18
ianwclarkb: iirc the issue is it's like "-d domain.com -d alias.com" "-d domain2.com -d alias2.com" ... i think.  i know it's a weird quoting situation21:20
ianwi will make it clearer21:22
clarkbemail sent21:57
clarkbianw: your zuul playbook detector change had me very confused for a bit. I was wondering how did thischange get such a low change number. Then it hit me. 2020 not 2021. sorry I haven't reviewed this sooner :/22:14
ianwnp :)22:18
ianwthe letsencrypt local only change is making what looks like a good list of TXT keys22:19
ianwbut somehow one of them is missing in the dns zone when it looks22:19
opendevreviewIan Wienand proposed opendev/system-config master: [wip] letsencrypt : don't hit staging in the gate  https://review.opendev.org/c/opendev/system-config/+/81261022:30
clarkbinfra-root I will abandon https://review.opendev.org/c/opendev/system-config/+/811749 since we haven't seen any need for this different less old android friendly chain22:44
clarkbfungi: maybe you can review https://review.opendev.org/c/opendev/system-config/+/809269 and https://review.opendev.org/c/opendev/system-config/+/809286 then tomorrow after openstack release things we can plan to land both. The giteas will automatically restart, but we'll have to plan a gerrit restart for that image22:46
opendevreviewIan Wienand proposed opendev/system-config master: [wip] letsencrypt : don't hit staging in the gate  https://review.opendev.org/c/opendev/system-config/+/81261022:51
clarkbianw: I left some notes on https://review.opendev.org/c/opendev/system-config/+/80767223:00
clarkbany idea what the inconsistency for wiki backups was? Doesn't seem to have persisted23:05
clarkbinfra-root I'm doing some spring cleaning in my change list and noticed https://review.opendev.org/c/opendev/system-config/+/791832 never got reviews. We don't make new instances often but addressing that might be a good thing23:11
clarkbalso its a bit dark magic for me why that works23:11
ianwclarkb: thanks, will rework23:16
ianwi didn't get a chance to look at wiki23:16
clarkbweird, it seems happier now.23:17
ianwevery log there has 'rc 0'23:18
ianwohhh, actually that's from the weekly consistency checker23:19
clarkbthat would explain why it hasn't complained today23:20
ianwhttps://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/borg-backup-server/files/verify-borg-backups.sh this one23:20
ianwSun Oct  3 05:32:37 UTC 2021 Verifying /opt/backups/borg-wiki-update-test/backup ...23:21
ianwFailed to create/acquire the lock /opt/backups/borg-wiki-update-test/backup/lock.exclusive (timeout).23:21
clarkboooh a side effect of making the backup times more raendom?23:22
ianwthat's ... good.  at least it hasn't found corruption23:22
clarkbhrm no because wiki doesn't ansible so that changed times wouldnlt have affect it23:22
clarkbalso I can't type23:22
ianwit still could be a conflict, though23:23
ianwwas it running at 05:3223:24
ianwSun Oct  3 05:30:01 UTC 2021 Starting backup23:24
ianwthe answer would be yes23:24
ianwseriously, talk about murphy's law.  of all the backups, chance of these two overlapping ...23:25
ianwahh, borg has a "with-lock" command.  maybe we want that?23:27
ianwhrm, no, that's more if you want to run rsync or something on the underlying data23:28
clarkbhttps://borgbackup.readthedocs.io/en/stable/usage/lock.html#id3 ya it says you should use it carefully and only to break locks23:28
clarkboh wait I'm a derop23:29
ianwyeah that's "break-lock" (the page is a bit confusing)23:29
clarkbthey have two commands on the same change23:29
clarkbya I think with-lock is so that you can run things outside of borgs command set while holding the lock23:30
clarkbrsync as you mention for example23:30
ianwlooks like there is a "--lock-wait"23:30
clarkbputting that on the consistency checker seems reasonable23:31
clarkbthen it can wait until backups complete23:31
opendevreviewIan Wienand proposed opendev/system-config master: borg-backup-server: wait for lock in verify  https://review.opendev.org/c/opendev/system-config/+/81262223:35
ianwok, still confused on this LE change23:38
ianwthe zone file has 6 entries23:38
ianwthe "dig -t txt" we do in testinfra sees 5 23:39
clarkband we don't updte the zone file onces for each record instead we do a single update with all the info iirc23:40
clarkbis it doing a round robin? if you request again you get a different set of 5?23:40
clarkbnot sure if TXT records and A records differ in their behavior there23:41
ianwooohhh, i might have caused a hash collision23:41
clarkboh yes I see it23:41
ianwi'm generating the TXT record as a sha256 of the hostname23:41
clarkbianw: what if you just do a /dev/urandom string setn through tr so that it is alphanum23:42
clarkbor filter out !alphanum23:42
ianwyeah, it's specified as a the base64url encoding of a sha256 sum.  i'll just do some entropy on the input.  23:42
ianwi mean the TXT record is specified as ...23:43
clarkboh that is how the acme protocol generates them?23:43
clarkbI figured there was a bit more shared secret involved. But maybe they salt them23:44
ianwoh, it's a sha256 of magic including a jwt that's then used as a check; so yeah the protocol has entropy.  i was just trying to make an output that was similar23:45
opendevreviewIan Wienand proposed opendev/system-config master: [wip] letsencrypt : don't hit staging in the gate  https://review.opendev.org/c/opendev/system-config/+/81261023:48

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!