*** gibi_pto is now known as gibi | 06:08 | |
*** iurygregory_ is now known as iurygregory | 06:31 | |
*** jpena|off is now known as jpena | 07:42 | |
opendevreview | chzhang8 proposed openstack/project-config master: bring tricircle under x namespace https://review.opendev.org/c/openstack/project-config/+/804669 | 09:39 |
---|---|---|
opendevreview | chzhang8 proposed openstack/project-config master: bring tricircle under x namespace https://review.opendev.org/c/openstack/project-config/+/804669 | 10:01 |
*** sshnaidm|pto is now known as sshnaidm | 10:30 | |
*** sshnaidm is now known as sshnaidm|pto | 10:31 | |
*** jpena is now known as jpena|lunch | 11:16 | |
*** dviroel|out is now known as dviroel|ruck | 11:26 | |
*** diablo_rojo is now known as Guest4491 | 11:39 | |
*** jpena|lunch is now known as jpena | 12:16 | |
clarkb | yoctozepto: fwiw I cannot reproduce the behavior clicking on the cherry picks link when logged in on firefox | 15:34 |
clarkb | I wonder if it has to do with being the owner for the change | 15:34 |
*** ysandeep is now known as ysandeep|away | 15:40 | |
*** diablo_rojo__ is now known as diablo_rojo | 15:43 | |
yoctozepto | clarkb: ack, no problem; it is the first I have such an issue | 15:48 |
*** jpena is now known as jpena|off | 15:58 | |
*** marios is now known as marios|out | 16:01 | |
clarkb | our meeting agenda is surprisingly empty after taking a first pass at updating it this morning. I guess good news there is it means we've just put a bunch of work behind us :) | 16:28 |
clarkb | let me know if I'm missing anything obvious that should be on there though. | 16:28 |
opendevreview | Kendall Nelson proposed opendev/system-config master: Setting Up Ansible For ptgbot https://review.opendev.org/c/opendev/system-config/+/803190 | 16:49 |
clarkb | fungi: re the lists.kc.io snapshot I'll try to boot that after lunch seems to be the most likely schedulign for that. Then upgrade it all the way through to focal taking notes? | 16:53 |
clarkb | fungi: ^ are there any gotchas or things you think we should keep an eye out for on that? | 16:53 |
clarkb | one thing is the esm registration on that snapshot I guess | 16:54 |
clarkb | Maybe step zero is to disable that? | 16:54 |
clarkb | though we're under our quota for that so having a test server boot up with it isn't the end of the world I guess | 16:54 |
clarkb | though maybe it is safest to disable it to prevent the other server getting unregistered | 16:55 |
* clarkb is not really sure how taht gets accounted | 16:55 | |
fungi | will disabling it disable the production server too? i guess we can check | 16:57 |
fungi | wondering if it loads a unique key onto the machine on registering | 16:57 |
clarkb | ya I have no idea | 17:00 |
clarkb | and ya I guess we can check it and resolve manually if necessary | 17:01 |
clarkb | fungi: do you think it would be safer to leave it as is on the new instance or disable esm on the new instance? | 17:01 |
fungi | i would disable, and then re-register the production server if necessary | 17:02 |
clarkb | ok | 17:05 |
opendevreview | Kendall Nelson proposed opendev/zone-opendev.org master: Add CNAME for ptgbot.opendev.org https://review.opendev.org/c/opendev/zone-opendev.org/+/804790 | 17:08 |
clarkb | diablo_rojo: ^ note on that one. I need to page mroe of the plan there in to be sure of my comment on that but wanted to point out the issue either way | 17:11 |
opendevreview | Kendall Nelson proposed opendev/system-config master: Setup Letsencrypt for ptgbot site https://review.opendev.org/c/opendev/system-config/+/804791 | 17:16 |
*** timburke__ is now known as timburke | 17:16 | |
opendevreview | Kendall Nelson proposed opendev/zone-opendev.org master: Add CNAME for ptgbot.opendev.org https://review.opendev.org/c/opendev/zone-opendev.org/+/804790 | 17:18 |
opendevreview | Kendall Nelson proposed opendev/zone-opendev.org master: Add CNAME for ptgbot.opendev.org https://review.opendev.org/c/opendev/zone-opendev.org/+/804790 | 17:19 |
diablo_rojo | clarkb, makes sense. Hopefully I fixed it correctly. | 17:22 |
diablo_rojo | I also figure the letsencrypt cert had to be setup first? and that this should be dependent on that? | 17:22 |
diablo_rojo | But I can remove that if its wrong | 17:22 |
clarkb | in testing we use the staging LE servers and I'm not sure they properly verify against DNS or not | 17:24 |
diablo_rojo | Okay so your guess is only marginally better than mine lol. | 17:27 |
clarkb | diablo_rojo: what I'm not sure about looking at thsse changes is where the apache config is. I think you may need a "run the ptg site" change somewhere? | 17:31 |
clarkb | ya I think https://review.opendev.org/c/opendev/system-config/+/780942 was that but then the puppet got removed. | 17:32 |
clarkb | diablo_rojo: that means your letsencrypt change likely needs to also configure the apache config as well | 17:32 |
opendevreview | Kendall Nelson proposed opendev/system-config master: Setup Letsencrypt for ptgbot site https://review.opendev.org/c/opendev/system-config/+/804791 | 17:47 |
corvus | fungi, clarkb: i think the issue with the semaphore is that we didn't choose one CD strategy, we chose two, and they are not working well together | 17:48 |
clarkb | corvus: I'm not sure I completely agree with that. Having periodic catch ups seems like a reasonable safety net even if you want to do direct deploys. | 17:49 |
*** dviroel|ruck is now known as dviroel|out | 17:49 | |
clarkb | Yes their approaches are different, but I don't think that users should be forced into only one or the other | 17:49 |
corvus | if we can really deploy things when changes merge, that should be the primary strategy, and the periodic should be a backup. and it should run less often and be quicker so it doesn't interfere with the first. | 17:49 |
clarkb | corvus: fwiw I believe the reason we haev an hourly deploy in addition to a daily is that some services like zuul and nodepool get image updates we awnt to apply more quickly than daily | 17:49 |
clarkb | we could address that by building our own zuul images when neceessary similar to how we do for other services. But I think that the zuul produced images work well and redoing that effort seems wrong | 17:51 |
corvus | clarkb: well, in practice, we always manually pull zuul images when restarting anyway because that can't be relied upon. | 17:51 |
corvus | clarkb: i also think our semaphore is too coarse | 17:52 |
corvus | we should be able to run all those jobs at the same time | 17:52 |
clarkb | The other big thing in hourly is remote-puppet-else but I think we can configure taht job to run whenever puppet related files change in system-config rather than blindly doing an hourly update | 17:52 |
clarkb | It would also make a huge difference if ansible addressed their performance regressions around forking tasks. It is unfortunately quite slow now :( but mordred says upstream isn't interested in reverting or changing that behavior (I can appreciate that it is likeyl complicated and changes there could produce worse unexpected side effects) | 17:53 |
corvus | yep | 17:53 |
corvus | here's my thinking: zuul should provide tools to help people CD, but our case is not a good one to model -- we have conflicting requirements that just plain cannot be satisfied. we should resolve that before we try to ask for more complexity from zuul. | 17:55 |
clarkb | I worry that opendev's situation is more common than that assertion expects though and we're likely to produce similar problems for other CD users | 17:56 |
corvus | it's possible, but we know that our playbooks are not designed for this. we haven't even finished implementing the system we originally designed. | 17:56 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run the cloud launcher daily instead of hourly https://review.opendev.org/c/opendev/system-config/+/804795 | 17:57 |
clarkb | I think ^ is an easy change we can make. | 17:57 |
clarkb | It won't solve everything but that job isn't fast and we run it hourly when we almost never actually need the updates encoded in it (and it runs in the deploy pipeline when we do need it) | 17:57 |
corvus | the fact that the periodic run takes >1hr is just not a good starting point -- honestly, if we're okay with that, we should drop deploy anyway and go back to "it will be deployed within 0-2 hours". | 17:59 |
corvus | if we want to make immediate deployment primary, then we need to get the reconciliation path out of the way | 17:59 |
clarkb | I don't think we're ok with it, but to fix it we either need to stop using ansible, run jobs in parallel, or run fewer jobs | 17:59 |
clarkb | run jobs in parallel was the original expectation iirc | 18:00 |
corvus | yes, i agree. i think that's the starting point though. | 18:00 |
clarkb | yup definitely improving hourly throughput would make a huge difference | 18:00 |
clarkb | 804795 above should make a good starting dent in that | 18:00 |
corvus | i'm not sure why all the other jobs can't run after base? is it because we don't want them to run at the same time as any deployment pipeline job, and the semaphore doesn't let us express that? | 18:01 |
clarkb | corvus: I think there is some ordering implied as well. Like nodepool should update before zuul (and the registry as well?) | 18:02 |
corvus | clarkb: maybe it's just a matter of making a new parent job for the periodic pipeline, have that one hold the semaphore, run base, then run everything else. then also have the deployment pipeline jobs individually hold the semaphore so they exclude each other and the entire periodic pipeline (which is now faster)? | 18:02 |
clarkb | Eavesdrop should be able to run whenever I expect | 18:03 |
corvus | we should still be able to accomodate that | 18:03 |
clarkb | but ya I don't think it is as easy as letting everything run in parallel there is some implied ordering in services. Gitea before gerrit (for replication), nodepool before zuul for image changes, and so on | 18:04 |
clarkb | I think puppet runs last because in the past we had a bunch of stuff doing puppet that wanted to be in the ordering. But now we may be able to run puppet whenever | 18:04 |
clarkb | If order doesn't matter for storyboard (I suspect it doesn't) then we can run pupept in any order based on my read of the site.pp | 18:05 |
corvus | actually, the deploy pipeline should probably hold the lock for the entire buildset too | 18:05 |
corvus | clarkb: basically like this: https://paste.opendev.org/show/808121/ | 18:06 |
clarkb | and lock-holding-job is paused? | 18:06 |
corvus | clarkb: yes | 18:06 |
corvus | zero-node job | 18:06 |
clarkb | ya I think at the very least that will help us express the dependencies properly which will allow us to optimize further | 18:07 |
clarkb | basically that might not end up being faster but it will help us understand better to then make things faster | 18:07 |
corvus | yeah, it should theoretically be faster ;) | 18:07 |
corvus | we should be able to use the same job tree in periodic and deploy | 18:07 |
clarkb | corvus: when you say periodic do you mean daily or hourly or both? | 18:08 |
clarkb | (we have two periodic pipelines currently and they have different jobs, see https://review.opendev.org/c/opendev/system-config/+/804795 for an example) | 18:08 |
corvus | hrm, i wasn't aware of the difference | 18:09 |
clarkb | basically hourly is there for things we want to update quickly because we may not have a good trigger for them | 18:10 |
corvus | i wonder why eavesdrop is in there? | 18:10 |
clarkb | like zuul and nodepool image updates | 18:10 |
clarkb | corvus: eavesdrop also consumes images from other repos (gerritbot for example) | 18:10 |
corvus | clarkb: but it's pinned | 18:10 |
corvus | it won't update without a corresponding system-config update | 18:11 |
clarkb | corvus: matrix gerritbot is but not irc gerritbot iirc | 18:11 |
corvus | where does the gerritbot image come from? | 18:11 |
clarkb | corvus: from the gerritbot repo | 18:11 |
clarkb | https://opendev.org/opendev/gerritbot/src/branch/master/Dockerfile | 18:12 |
corvus | would we be sad if it took 24h to update? | 18:12 |
corvus | anyway, parallelism should help there | 18:12 |
clarkb | If we are trying to fix a bug we can always manually pull | 18:12 |
corvus | i'm surprised the hourly takes >1 hour with those jobs | 18:13 |
clarkb | corvus: ~20 minutes is the cloud launcher which is why I have proposed moving it. But also ansible is really really slow :/ | 18:13 |
clarkb | A big part of the cloud launcher slowness is processing all of that native ansible task stuff to interact with the clouds | 18:13 |
clarkb | it would probably take a minute or two if written as a python script | 18:14 |
corvus | clarkb: looking at a recent buildset, if we parallelize that (after merging your cloud launcher move), we would have 4m for bridge + 8m for zuul (the longest playbook) | 18:14 |
corvus | so we should be able to get a n hourly run down to 12m with this approach -- without doing any deeper optimization | 18:15 |
clarkb | corvus: but nodepool and zuul registry and zuul would need to run serially? I agree the puppet and the eavesdrop jobs can run in parallel | 18:15 |
corvus | clarkb: i'm not sure they do? | 18:15 |
clarkb | corvus: its possible they don't. I thought that order was intentional for the zuul and nodepool services though. TO ensure that labels show up in the right order or similar | 18:16 |
clarkb | but I guess that is all happening in zookeeper now and can be lazy? | 18:16 |
corvus | clarkb: (and incidentally, the cumulative runtime of all the current hourly jobs is 55m - so that assumption we've been working from is correct) | 18:16 |
corvus | clarkb: if we're talking about adding a nodepool label, i don't think we expect them to be immediately available anyway (image build/upload time, etc) | 18:17 |
corvus | clarkb: i think typically nodepool-provides-label and zuul-uses-label would be different changes anyway | 18:17 |
clarkb | Another tricky thing is that we use the same jobs in deploy and the hourly and periodic pipelines so we can't just convert the hourly pipeline without converting everything? | 18:17 |
clarkb | though maybe we can use a variant of the job to override semaphores in pipelines so we can do a bit at a time | 18:18 |
corvus | clarkb: by convert do you mean change the semaphore usage? i think the right thing to do is to apply this to all 3 pipelines. | 18:18 |
clarkb | corvus: yes. That is the right thing to do but I'm concerned that the scope of it is quite large and we can't minimize risk for out of order problems if we do all three at once | 18:19 |
corvus | all 3 should have a lock-holding-job to make the semaphore apply to the whole buildset so interruptions don't happen. then, if it's okay to parallelize in one, it should be okay to parallelize in all. | 18:19 |
clarkb | corvus: yes, except that periodic and deploy run many many more jobs than hourly. Which means we would have to sort out all of those ordering concerns at the same time (much more risk) | 18:19 |
clarkb | making the hourly deploy parallel is much smaller in scope as far as determining what the order graph is | 18:20 |
clarkb | But maybe we start by trying to do all 3 together and if it gets unwieldy then we can attempt something smaller in scope | 18:21 |
fungi | the eavesdrop deploy job was also handling meeting time changes at one point, right? but very recently that's been switched to use a more typical static site publication job? | 18:21 |
corvus | clarkb: my understanding is that the original intent was that each of these jobs (aside from base and bridge) should be independent, so i hope that there aren't many instances of us assuming the opposite. but you could stage this by using 2 semaphores. one as the new lock-holding-job for the buildset -- all 3 pipelines need to use it. then a second semaphore to make the jobs within a pipeline mutually exclusive. that will keep things slow | 18:22 |
corvus | (like today) until the second one is removed. | 18:22 |
clarkb | fungi: yes, and I guess those meeting times were listed in a different repo too so would rely on hourly updates | 18:22 |
fungi | er, i guess it wasn't the eavesdrop deploy job before that, it was puppet | 18:22 |
clarkb | corvus: I'm pretty sure we still have ordering between the jobs. I dno't know that we sufficiently untangled that yet | 18:23 |
clarkb | things like gitea-lb running before gitea | 18:24 |
clarkb | (maybe that should be one job?) | 18:24 |
fungi | but it would be good to at least identify and codify specific cases like that using dependencies | 18:24 |
fungi | or making it one job | 18:24 |
clarkb | nameserver before letsencrypt (though we don't encode that order today) | 18:25 |
clarkb | because we need the nameserver job to create the zones before letsencrypt attempts to add records to them | 18:25 |
fungi | is there a specific letsencrypt job though? | 18:25 |
clarkb | fungi: yes | 18:25 |
corvus | clarkb: yeah, gitea sounds like maybe that should be one playbook | 18:25 |
fungi | ahh, okay, i'm likely thinking of the individual handlers in cert management of various services | 18:26 |
clarkb | letsencrypt before all the webservers | 18:26 |
clarkb | They definitel exist, and once we've bootstrapped things sufficiently the order tends to matter less | 18:26 |
fungi | though also the nameserver/letsencrypt ordering is primarily a zone bootstrapping problem right? if we're not adding a new domain it doesn't matter? | 18:26 |
clarkb | but if you bootstrap from scratch the other is important and not encoded beyond the order of the jobs and running serially | 18:27 |
fungi | oh, as far as making sure cname records are deployed | 18:27 |
fungi | okay, i've got it | 18:27 |
fungi | so any time we're bootstrapping a new service really | 18:27 |
clarkb | fungi: no I'm talking about the server bootstrapping in this case not the CNAME addition (though taht order also matters) | 18:27 |
clarkb | basically if we went fully parallel we couldn't safely deploy a new name server and attempt to update LE certs | 18:28 |
clarkb | 99% of that time that doesn't matter, until it does | 18:28 |
clarkb | similarly with the various webservers that all need LE certs. We rely on the le job running early before they happen to properly bootstrap new webservers | 18:29 |
clarkb | Currently that is only encoded in the pipeline def order with the assumption things will run serially | 18:29 |
clarkb | All of this is fixable, but it isn't as simple as making it parallel | 18:29 |
fungi | we wouldn't add the nameserver into production (list it through the domain registrars) until it was serving the right records though, yeah? | 18:29 |
fungi | that's a manual step | 18:29 |
fungi | so letsencrypt shouldn't try to use it for lookups anyway | 18:30 |
corvus | that was an unfortunate oversight; the deploy pipeline is supposed to have explicit dependencies (after all, it does a lot of file matchers) | 18:30 |
corvus | several jobs have an explicit dep on letsencrypt | 18:30 |
corvus | (codesearch, etherpad, grafana) | 18:30 |
clarkb | fungi: the issue isn't on the resolution side it is on the ansible side. We run the playbooks on both but it will fail to add records to a zone which doesn't exist if you get the order wrong | 18:30 |
clarkb | corvus: ah yup looks like some do have the right annotations but not all | 18:31 |
corvus | (wow codesearch is listed in the pipeline twice :/ ) | 18:31 |
clarkb | fungi: basically the ansible will fail then we won't get new certs for anything. Not the dns lookup will fail because we never get to that point | 18:31 |
fungi | clarkb: we make the zone exist by adding records to it though, right? | 18:31 |
fungi | we're just installing zonefiles into the fs i thought, not using an api like dynamic dns updates or something? | 18:32 |
corvus | if it's a case we want to handle, letsencrypt depending on nameserver is reasonable (though right now, nameserver runs after letsencrypt) | 18:33 |
corvus | sorry i have to run | 18:34 |
fungi | looks like we inject lines into existing files like /var/lib/bind/zones/acme.opendev.org/zone.db so maybe the problem is that we're not writing out the whole file (because we want to avoid invalidating other entries in it which might still be in use) | 18:34 |
clarkb | fungi: sort of. the acme stuff is dynamic but I'm not sure what triggers it yet looking deeper | 18:34 |
clarkb | fungi: but https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/master-nameserver/tasks/main.yaml is part of service-nameserver and that installs bind which ensures there are dirs to write to | 18:35 |
fungi | if that's the actual place it's breaking down, we could just make sure to always create the file before writing to it | 18:35 |
clarkb | fungi: but then you don't get file perms right beacuse bind isnt even installed yet | 18:35 |
clarkb | and then https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/letsencrypt-install-txt-record/tasks/main.yaml runs in the letsencrypt job | 18:36 |
clarkb | there is an implicit dependency between them today that the service-nameserver playbook and job run before we try to set up any LE certs | 18:36 |
clarkb | it hasn't been a problem because we havne't tried to replace any nameservers recently | 18:37 |
clarkb | and if we are careful when replacing the server we don't need to encode the dependency, but it does exist | 18:37 |
fungi | so in theory we could just include the nameserver setup role there before trying to create files | 18:37 |
fungi | which ideally should be a quick no-op if the server is already configured | 18:37 |
clarkb | except ansible is slow so not quite, but yes that would be an option | 18:38 |
fungi | "quick" in ansible terms then ;) | 18:38 |
clarkb | that becomes similar to the gitea-lb vs gitea job though | 18:38 |
clarkb | we would basically collapse into a single job to encode the dependency and delete the other job | 18:38 |
clarkb | happy to do that, but we still need to make changes like that before parallelizing things is safe | 18:39 |
clarkb | I'm beginning to think we have a couple of first tasks we need to do here. We can do some trivial updates like my cloud launcher change to speed things up immediately. But we should also do a large scale graphing exercise of dependencies in a human readable format if possible. | 18:40 |
clarkb | Then once we've got that graph we can either encode deps as zuul job deps or collapse jobs together as it makes sense | 18:40 |
clarkb | then we can switch to being parallel | 18:40 |
fungi | i concur | 18:41 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run the cloud launcher daily instead of hourly https://review.opendev.org/c/opendev/system-config/+/804795 | 18:43 |
opendevreview | Clark Boylan proposed opendev/system-config master: Remove extra service-codesearch job in deploy https://review.opendev.org/c/opendev/system-config/+/804796 | 18:43 |
clarkb | There are a couple of easy cleanups based on some of what we discovered during this discussion | 18:43 |
clarkb | I'll add this as a meeting agenda item as I think it deserves discussion around my next steps above and then actually tackling them | 18:43 |
fungi | thanks! | 18:46 |
clarkb | ok the wiki should have my condensation of all that ^ in it now. Feel free to add additional notes or edits | 18:49 |
clarkb | I'm going to grab lunch now. But plan to work on the kata lists test server next | 18:51 |
clarkb | fungi: re disabling esm I'll have to do that after the snapshot boots unless we want to disable it on prod, snapshot again, then reenable on prod. I suspect the safest thing is after boot since the daily package updates won't run for a while anyway | 18:52 |
fungi | yeah, i assumed after boot | 18:53 |
fungi | at one point we also had problems with too many ansible runs at the same time causing oom events on bridge.o.o, right? | 18:57 |
fungi | thinking back to the parallel deploy jobs thing | 18:57 |
clarkb | fungi: yes, but that is tied to connectivity errors piling up ansible processes | 18:57 |
clarkb | I agree that is a risk though. We might need a semaphore with say only 5 jobs allowed to run at once | 18:57 |
clarkb | to temper that | 18:58 |
fungi | and i guess if we did run into that... yeah | 18:58 |
fungi | exactly what i was thinking, since semaphores can be limited to more than 1 | 18:58 |
clarkb | the risk there is we would want the semaphores to all live within one pipeline I think | 18:58 |
clarkb | exclusve to a pipeline but >1 job in a pipeline | 18:58 |
corvus | If we have a builder semaphore one secondary semaphore is ok | 19:00 |
corvus | The other pipes can't get it | 19:00 |
corvus | * If we have a buildset semaphore one secondary semaphore is ok | 19:00 |
fungi | yeah, coupling those sounds reasonable | 19:01 |
corvus | One semaphore for pipeline exclusion (max 1). One semaphore for job exclusion (max N). | 19:02 |
clarkb | aha | 19:02 |
opendevreview | Merged opendev/system-config master: Run the cloud launcher daily instead of hourly https://review.opendev.org/c/opendev/system-config/+/804795 | 19:22 |
clarkb | fungi: can you check my comment on line 195 of https://etherpad.opendev.org/p/listserv-inplace-upgrade-testing-2021 before I boot the tes instance? I assume there is some way to do what I describe there but I'm not sure I know what it is | 19:51 |
clarkb | fungi: also do you know what you called the snapshot? I'm not able to list the snapshot currently but am trying to figure that out | 19:59 |
clarkb | --os-volume-api 1 volume list using that works to get the volumes | 20:01 |
clarkb | but not snapshots | 20:01 |
clarkb | aha because it is isn't a volume snapshot it is a server image | 20:03 |
clarkb | I need image list. Found it | 20:03 |
clarkb | fungi: ok I think I'm ready to boot off the snapshotted image if you can take a look at line 195 on the etherpad above | 20:05 |
fungi | clarkb: yep, sorry, was out taming the yard, looking at the pad now | 20:18 |
fungi | and sorry i wasn't clear earlier, it's definitely a server image not a volume snapshot (the latter would assume cinder and bfv i think?) | 20:18 |
fungi | clarkb: so... package maintainer scripts are supposed to obey disabled state for services | 20:22 |
fungi | it's maybe possible they'll break, but i think the restarts are supposed to be a successful no-op when services are disabled, per debian policy (and these packages are generally inherited from debian anyway) | 20:23 |
clarkb | fungi: testing with the zuul test node servers indicated this wasn't the case. I suspect because with an upgrade it is different? | 20:24 |
clarkb | fungi: the services were definitely running after the upgrade | 20:24 |
clarkb | fungi: the statement ther eabout it starting serices is based on previous experimental data | 20:24 |
fungi | mmm, how were those services disabled? | 20:25 |
fungi | systemctl disable or some other way? | 20:25 |
clarkb | ya using systemctl disable | 20:25 |
*** artom_ is now known as artom | 20:26 | |
clarkb | looking at my notes it seems mailman started after the first upgrade but maybe not exim? then maybe exim started after the second one. I wish I had better specifics about that now | 20:27 |
clarkb | fungi: after we boot the snapshot can we check for anything spooled upand clear that out if it exists? | 20:27 |
clarkb | if so I can boot the snapshot now and we can take a look and see if this is even a concern | 20:27 |
fungi | yeah, worth double-checking them | 20:29 |
clarkb | ok should I boot the server now then? | 20:29 |
fungi | can always clear out the dirs under /var/spool/exim | 20:29 |
fungi | yeah, i'm on as long a break from yardwork as we need, go ahead and i'll check it | 20:30 |
clarkb | ok proceeding | 20:30 |
fungi | lmk the ip address it gives you | 20:30 |
fungi | and then i'll start a root screen session on it | 20:31 |
clarkb | fungi: did you want to run through the upgrades with me too? Not sure how interested in that bit you are. My plan was to record the steps like I did with the zuul test servers on that same etherpad | 20:31 |
fungi | i can, sure | 20:32 |
fungi | server is responding to ping but ssh doesn't work yet | 20:37 |
clarkb | ya | 20:37 |
fungi | though the server has a static ip configuration in /etc/network/interfaces... i wonder if that's getting properly reset | 20:39 |
clarkb | oh it does? | 20:40 |
clarkb | hrm I want to say we've run into this before and had to do a rescue instance? ugh this gets really annoying really fast | 20:41 |
clarkb | ya I suspect that may be the issue | 20:41 |
clarkb | because we uninstall cloud init | 20:41 |
clarkb | :( | 20:41 |
clarkb | maybe we should stop doing that | 20:42 |
fungi | well, to be fair, we almost never do server images | 20:42 |
fungi | (or in-place upgrades) | 20:42 |
clarkb | ya but when we do it is because the other options are bad | 20:42 |
clarkb | I'm not really sure how the whole rescue instance thing works. Is that something I can do via osc? | 20:43 |
fungi | i've only ever tried it through the dashboard | 20:43 |
fungi | but in essence it boots a replacement vm from some standard image and then attaches the server's volume as another block device | 20:44 |
clarkb | and from that we can mount and edit /etc/network/interfaces | 20:44 |
clarkb | looks like there is an openstack server rescue command | 20:44 |
clarkb | I'll try that | 20:45 |
clarkb | hrm I didn't set a key name on the original boot so I'm not sure there will be keys set on the rescue vm | 20:46 |
clarkb | why isn't that part of what the rescue api takes | 20:46 |
clarkb | fungi: I suspect that I may need to unrescue then delete my test instance | 20:47 |
clarkb | then start over and set key name appropriately | 20:47 |
clarkb | yup I get an ssh connection now but it wants to fallback to password | 20:47 |
clarkb | I'll unrescue, delete the instance, boot again with a key set then rescue again | 20:47 |
fungi | normally when you do a rescue through the dashboard it tells you the root password | 20:48 |
clarkb | huh it did not do that here | 20:48 |
clarkb | though it looks like I can set a rescue password | 20:48 |
clarkb | I guess I can try that before deleting and starting over | 20:48 |
clarkb | yup that seems to have worked. The good news is the rescue image has an /etc/network/interfaces that I can refer too as well | 20:55 |
fungi | pita but at least it's a way forward to get the thing usable. i suppose you could chroot into it and disable esm too if you wanted | 20:56 |
clarkb | but now I'm confused because it seems the /etc/network/interfaces on the other device is the same as the one on this device | 20:58 |
clarkb | they are clearly different devices if I look at /etc/hostname and /mnt/etc/hostname or /etc/hosts etc | 20:59 |
fungi | so maybe on boot the rax-agent or whatever it is does config file injection then? | 20:59 |
fungi | in which case there could be other reasons for sshd not accepting connections, i guess | 20:59 |
fungi | maybe it was taking a long time gathering enough entropy to generate new host keys? | 20:59 |
clarkb | I see a .bak file with the old content I expected. I half suspect that it just didn't restart networking and that if I unrescue at this point maybe it will work? | 21:00 |
clarkb | from 20:34 today | 21:00 |
clarkb | I'll try that I guess. I haven't changed anything via the rescue so if that works then hax | 21:00 |
fungi | oh yeah maybe | 21:00 |
fungi | does it have host keys? | 21:00 |
clarkb | oh I've already dropped out I didn't check | 21:01 |
clarkb | I guess thats the next thing to check if it continues to fail | 21:01 |
fungi | no worries, something to check if it happens again, yeah | 21:01 |
fungi | it's pinging again | 21:03 |
clarkb | but still no ssh. I'm finding this very confusing. Also doesn't debian's ssh init generate ssh host keys? | 21:04 |
clarkb | what if sshd is not starting at all for some reason? | 21:04 |
fungi | yes | 21:05 |
fungi | which needs entropy, which is hard to come by on a vm | 21:05 |
fungi | and looks like haveged is not installed on it | 21:05 |
clarkb | ah I see what you mean earlier it could just be waiting on that? | 21:05 |
fungi | also quite likely no compatible host entropy kernel module | 21:05 |
fungi | yeah | 21:05 |
clarkb | do you think it is worth waiting to see if this chagnes or should I rescue it again? | 21:06 |
fungi | it likely says on the console if it's generating keys | 21:06 |
clarkb | ya but for the console i have to login to the dashboard :P I was hoping to avoid that, but maybe I should | 21:06 |
clarkb | *shouldn't | 21:06 |
fungi | openstack console url show <uuid> | 21:07 |
fungi | it seems to want the root password for maintenance | 21:08 |
clarkb | does that work with rax? I know just dumping the consiole doesn't. I'm logged in | 21:08 |
fungi | maybe fsck of the rootfs failed? | 21:08 |
fungi | yeah, console url show works with rac | 21:08 |
fungi | rax | 21:08 |
clarkb | til | 21:09 |
fungi | i wouldn't be surprised if there are fs inconsistencies since it was imaged while mounted and likely writing | 21:09 |
clarkb | I agree it wants a root login. Should I delete this instance and try again? Then check the console of the new instance? maybe the snapshot isn't so happy? | 21:09 |
clarkb | if it fails again we do over? | 21:09 |
fungi | i would rescue boot and fsck the block device while unmounted | 21:09 |
clarkb | ok I'll try that | 21:09 |
fungi | i suspect imaging any running server could result in this situation | 21:10 |
fungi | i've had more luck imaging servers while they're shutdown | 21:10 |
fungi | more frightening though if rackspace did file injection on a filesystem which wasn't clean | 21:11 |
fungi | the power of cloud | 21:11 |
clarkb | fungi: just `fsck /dev/xvdb` ? | 21:13 |
fungi | sure, though you might have to hit 'y' a lot | 21:13 |
fungi | you could add -y | 21:13 |
fungi | it's unlikely you'd tell it to do anything other than try to repair anyway | 21:13 |
clarkb | my manpage doesn't show -y as a valid option | 21:14 |
fungi | i may be confusing with bsd ;) | 21:14 |
clarkb | actually it probably wants xvdb1 since thati s the partition with an fs on it | 21:14 |
fungi | my fsck manpage has -y documented | 21:14 |
clarkb | huh mine does not | 21:14 |
clarkb | many I need to look at fsck.ext3 | 21:15 |
fungi | the fsck manpage on lists.k.i also has it | 21:15 |
tosky | maybe it's specific to the fsck.foo you use | 21:15 |
clarkb | ya has to be a specific fsck | 21:15 |
clarkb | just fsck doesn't document it | 21:15 |
fungi | it is specific to the fsck.foo, but the general fsck manpage also says that about it | 21:15 |
clarkb | neat mine doesn't on suse nor does the debian rescue image I'm on | 21:16 |
clarkb | I'm running that fsck now | 21:16 |
fungi | "-y r some filesystem-specific checkers, the -y option will cause the fs-specific fsck to always attempt to fix..." | 21:16 |
clarkb | that was quick it is done | 21:16 |
fungi | did it say it repaired anything? | 21:16 |
clarkb | it updated free inode and block counts | 21:16 |
tosky | both Debian 11 (ok, testing) and Fedora 34 don't document it (same manpage apparently, last update February 2009, from util-linux) | 21:16 |
clarkb | cloudimg-rootfs: clean, 303705/5120000 files, 1247631/10485499 blocks | 21:16 |
clarkb | doesn't seem to have done much | 21:17 |
clarkb | I'll mount it now and check for ssh host keys | 21:17 |
clarkb | it has its preexisting keys from the snapshot | 21:17 |
clarkb | fungi: ^ anything else you want me to try before giving it another reboot? | 21:18 |
fungi | tosky: interestingly, the ubuntu 16.04 fsck manpage i quoted from above claims to be from "util-linux February 2009" | 21:18 |
fungi | maybe they patched it | 21:18 |
fungi | clarkb: nah, i strongly suspect it was fsck failing at boot which caused the behavior we saw | 21:18 |
fungi | give it another try now | 21:18 |
clarkb | ok | 21:18 |
fungi | tosky: i agree my newer debian machines don't document -y in the general fsck manpage but do in the manpage for, e.g., fsck.ext2 and so on | 21:20 |
tosky | weird | 21:21 |
clarkb | the console shows the boot splash. I wish bootsplashes went away on server images | 21:21 |
clarkb | I suspect this may fail as it is taking a long time again | 21:22 |
fungi | tosky: i guess they decided to axe a number of entries in the general manpage which just said "this normally does x but behavior depends on the backend chosen" | 21:22 |
clarkb | yup it is in emergency mode again | 21:22 |
clarkb | but the boot splash prevented us from seeing why | 21:22 |
fungi | did it say why? | 21:22 |
clarkb | no because boot splash | 21:22 |
fungi | ugh | 21:22 |
clarkb | I guess if I rescue it again there might be something in the kernel or syslog? | 21:23 |
fungi | maybe, and worst case we can disable the bootsplash in the grub.config or whatever | 21:23 |
clarkb | fungi: ya but doesn't grub require complicate reconfiguration when you do that now? | 21:23 |
clarkb | I wonder if that will work using debian tools against the ubuntu image | 21:23 |
fungi | not if you're editing the config in /boot | 21:23 |
clarkb | ah ok let me rescue it again | 21:24 |
fungi | there's a fancy run-parts tree in /etc which you can use to build a grub.cfg file, but that's all it really is. you can edit the built config directly as long as you don't care that rerunning the config builder will undo your changes in it | 21:25 |
clarkb | or that if you get it wrong grub may fail? which isn't much worse than what we're doing now | 21:25 |
fungi | should just be able to edit /boot/grub/grub.cfg and take out the "splash quiet" from the kernel command line | 21:26 |
fungi | and yes, splashscreens on virtual machines (or servers in general) are beyond silly | 21:28 |
clarkb | doing that now | 21:28 |
clarkb | ok that is done. going to quickly check kern.log and syslog type logs | 21:29 |
clarkb | those don't have anything from today in them implying we aren't getting that far | 21:30 |
fungi | right, if they don't get far enough to mount the rootfs, that's what i'd expect | 21:30 |
clarkb | fungi: I think it might possibly be the swap device in fstab | 21:31 |
clarkb | I'm going to comment those out | 21:31 |
fungi | oh, quite likely | 21:31 |
fungi | in fact, yes, i bet we set swap to a partition on the ephemeral disk which on the new server isn't partitioned/formatted | 21:32 |
fungi | good call | 21:32 |
clarkb | yup | 21:32 |
clarkb | it is unrescuing now | 21:32 |
fungi | of course, if it weren't for the splashscreen, we'd have known that an hour ago ;) | 21:33 |
clarkb | indeed | 21:33 |
clarkb | alright it is up finally | 21:34 |
fungi | and i can ssh in | 21:34 |
fungi | i have a root screen session established on it | 21:34 |
fungi | exim4 and mailman did not start on boot, that's good | 21:35 |
clarkb | yup I didn't expect them to start booting from the snapshot, just after the upgrade(s) | 21:35 |
clarkb | fungi: looks like exim may have some stuff spooled | 21:35 |
fungi | agreed, just getting a baseline so we know later | 21:35 |
fungi | okay, cleared out the exim4 spooldirs | 21:37 |
fungi | that way if it does start, it shouldn't try to deliver any duplicate messages | 21:37 |
clarkb | thanks. you just rm'd the dir contents for input etc? | 21:38 |
clarkb | I'm getting my notes together on the etherpad then will proceed with upgrade things | 21:38 |
fungi | i did, yes | 21:39 |
fungi | /var/spool/exim4/*/* basically | 21:39 |
clarkb | alright the next step on my etherpad is to unenroll from esm. I'll do that | 21:40 |
fungi | and then we want to check esm status on the production server | 21:40 |
clarkb | yup I just did that nad it says it is still enrolled. I'll make a note to check it again tomorrow in case it takes time for that accounting to ahppen | 21:41 |
clarkb | as expected upgrading is a noop | 21:41 |
clarkb | my notes say I should reboot, but I think I can skip that because no package updates occured | 21:42 |
fungi | yes | 21:42 |
fungi | that's fine | 21:42 |
fungi | effectively we did *just* "reboot" it anyway | 21:42 |
clarkb | yup exactly. | 21:42 |
clarkb | fungi: the next step is actually doing an upgrade. I'm going to take a short break here as I need some water. But will get back to it and start a root screen and update my notes on the etherpad as I go through answering questions if you want to follow along | 21:43 |
fungi | are you using the root screen session i started (and if not, do you want to?) | 21:43 |
clarkb | oh not yet sorry I didn't realize there was one but you said you would start one | 21:43 |
clarkb | I'll use that one going forward | 21:43 |
fungi | and yeah, i'll do a quick round with the leaf blower, brb | 21:43 |
clarkb | I'll start the beginning of this isn't too interesting | 21:48 |
clarkb | fungi: you ready for me to accept the new packages. This is where it gets fun and you have to sort out keeping old files or accepting new ones. THough I have a bit of a cheatsheet from earlier testing | 21:53 |
* clarkb goes for it. we can always redo testing again if necessary | 21:55 | |
fungi | i'm back, sorry | 21:56 |
clarkb | fungi: if you look at the etherpad we are at line 216 | 21:56 |
fungi | and yeah, the choices are less interesting than the results | 21:56 |
clarkb | fungi: so this step is one that wasn't curses before it was just a prompt but also one I had questions about | 21:58 |
clarkb | currently we only support a subset of lanagues in our lists, but I figure the safest thing is to select all of them? | 21:59 |
clarkb | fungi: do you think it is worthwhile to only do the subset? | 21:59 |
fungi | yeah, just take the default there. it won't really chew up that much additional space nor time generating locales for things | 21:59 |
clarkb | well the default was none iirc | 21:59 |
fungi | not even english? interesting | 21:59 |
clarkb | and ya it is relatively tiny and not that much time so I figured lets just pick all of them | 21:59 |
clarkb | I'll proceed with all selected | 22:00 |
fungi | that may come in important for miltilingual support in mailman anyway | 22:00 |
clarkb | I'm going to select no for saving the iptables rules because I added the 1022 rule that we want to go away later | 22:01 |
fungi | oh, what was rule 1022 for? | 22:01 |
fungi | i see, temp ssh server | 22:02 |
clarkb | yup | 22:02 |
fungi | as long as you're okay with that vanishing on reboot | 22:02 |
clarkb | fungi: it was going to anyway as the upgrade doesn't restart that sshd aiui | 22:04 |
fungi | and yeah, the logind.conf change looks non-impacting for us | 22:05 |
fungi | is the current plan just to upgrade from xenial to bionic, or are we going all the way through to focal? | 22:08 |
clarkb | I think we can go all the way to focal | 22:08 |
clarkb | also I just added a question to the etherpad. Do we need to uninstall puppet first or can we do that after. | 22:08 |
clarkb | We'll test if uninstalling it after works on this run I guess | 22:08 |
fungi | if we're not puppeting anything on those servers any longer, i'd uninstall puppet first before doing any upgrading | 22:09 |
clarkb | that works too | 22:09 |
clarkb | and yes we are not puppeting anymore with the move to ansibel for lists | 22:09 |
fungi | fewer unknowns that way | 22:09 |
clarkb | wfm | 22:09 |
fungi | otherwise there's every chance the puppet packages will conflict with distro packages somewhere | 22:10 |
fungi | like by insisting on a capped version of some dependency, or something | 22:10 |
clarkb | fungi: the current pop up is unexpected as we should have that list shouldn't we? | 22:10 |
fungi | yes, we should probably investigate it | 22:10 |
clarkb | there is a /var/lib/mailman/lists/mailman list | 22:11 |
clarkb | I wonder if this is just dpkg being extra verbose | 22:11 |
clarkb | fungi: I'm going to select 'OK' to continue if you think that is good considering /var/lib/mailman/lists/mailman exists | 22:12 |
fungi | yeah, i'm not sure why it's not finding that, but maybe we have a path getting overridden somewhere | 22:13 |
clarkb | did you want to check anything else before I hit ok? | 22:13 |
fungi | no, go ahead and okay it | 22:13 |
fungi | we technically don't rely on the mailman list anyway, as we disable password reminders | 22:14 |
fungi | we'll want to keep the modified templates i think? because we install them with ansible (though we may need to upgrade the ones in system-config once we're done upgrading everything) | 22:15 |
clarkb | yup I think this is a keep | 22:15 |
fungi | the maintainer versions will be added to the directory with a .dpkg-new extension on them for reference anyway | 22:16 |
fungi | so we can compare later after the upgrade is done | 22:16 |
clarkb | ok I guess just keep all of these for now then | 22:17 |
fungi | especially if they're files in our ansible, since for the production upgrade ansible will just replace them anway | 22:17 |
fungi | looks like the new ntp.conf is ~equivalent to the one we were installing with puppet. do we no longer do that? | 22:20 |
clarkb | fungi: ya we don't use puppet for that anymore and no puppet on this server. This is an install package manager version iirc | 22:20 |
fungi | sounds good | 22:21 |
clarkb | ya that is what I have on my notes from using the zuul test nodes. I'm going to select 'Y' here | 22:21 |
fungi | ahh, okay, that looked like the daily ntp cronjob in your notes | 22:21 |
fungi | we do manage the sshd config with ansible though, right? | 22:22 |
clarkb | yes we do | 22:23 |
clarkb | this is a keep ours | 22:23 |
fungi | ahh, yeah your notes say yes | 22:23 |
clarkb | fungi: the etehrpad has the ntoes ya | 22:24 |
fungi | before the reboot step, i want to open a second window in screen and check the exim4/mailman service states | 22:24 |
clarkb | ok | 22:25 |
clarkb | fungi: you are clear to check things. | 22:27 |
fungi | opening a new screen window now, if you're ready | 22:27 |
clarkb | yup I'm ready | 22:28 |
fungi | it did indeed start mailman but not exim | 22:28 |
fungi | what's sad is there's still /etc/rc2.d/K01mailman | 22:29 |
fungi | as added by `systemctl disable mailman` earlier | 22:29 |
clarkb | they are units too. Maybe we should do another systemctl stop mailman && systemctl disable mailman ? | 22:29 |
clarkb | s/units too/units now/ | 22:29 |
fungi | also /etc/rc2.d/K01exim4 | 22:29 |
clarkb | I think it may ignore the compat stuff when there are valid units | 22:30 |
clarkb | that was based on testing at some point | 22:30 |
fungi | Removed /etc/systemd/system/mailman-qrunner.service. | 22:30 |
fungi | so that's why indeed | 22:30 |
clarkb | I added that step to my notes. Do you want to do the same with exim4 to see if it removes anything? | 22:31 |
fungi | xenial to bionic upgrade switched from sysv-compat to systemd and didn't honor the existing service state | 22:31 |
clarkb | looks like you just did I'll put that in the notes too | 22:31 |
fungi | i did exim4 and it didn't remove anything | 22:31 |
clarkb | yup | 22:31 |
clarkb | doesn't hurt to run it again though | 22:31 |
fungi | exim4.service is not a native service, redirecting to systemd-sysv-install. | 22:31 |
clarkb | should I select y to reboot now? | 22:32 |
fungi | when the exim4 service switches from sysv-compat to systemd i expect we'll see the same behavior for it | 22:32 |
fungi | yep, go for it | 22:32 |
clarkb | ya that may happen between bionic and focal and explain why I remember it causing trouble when tesed previously | 22:32 |
clarkb | it is rebooting now | 22:32 |
clarkb | it is up | 22:33 |
fungi | logged in and root screen session started agaiun | 22:34 |
clarkb | ok we can do the sanity checks there | 22:34 |
clarkb | then continue to the focal upgrade | 22:34 |
fungi | it did not restart those services on reboot | 22:34 |
clarkb | yup | 22:35 |
fungi | puppet uninstall command lgtm | 22:35 |
clarkb | fungi: does that puppet removal look correct to you? we didn't do it prior to upgraded to bionic but can do that on the prod server. I figure we should purge it out now | 22:35 |
clarkb | cool running it | 22:36 |
fungi | do we want to autoremove as well? | 22:36 |
fungi | yes please | 22:36 |
clarkb | ++ | 22:36 |
fungi | you can also do a clean after that to clear out the previous downloads and free up some space in /var | 22:37 |
clarkb | like that? | 22:38 |
clarkb | oh I guess clean != clean. Do you think we should do a clean here? | 22:38 |
fungi | sure, clean would also remove packages which still exist on the mirrors | 22:38 |
clarkb | doesn't seem to make a difference here? | 22:39 |
fungi | autoclean only removes local downloaded copies of things which are no longer on the mirrors | 22:39 |
fungi | (in theory anyway) | 22:39 |
fungi | also possible the do-release-upgrade script already did that for us | 22:39 |
clarkb | ya it may have done so | 22:39 |
clarkb | Alright the next step on the etehrpad is upgrading to focal unless there is other sanity checkign yuo want to do | 22:40 |
fungi | nothing i can think of | 22:40 |
fungi | hopefully we'll get fewer debconf prompts from this upgrade | 22:44 |
clarkb | fungi: from my previous notes I selected yes here but do you think we should select no to maybe avoid things like exim and mailman getting started? | 22:45 |
fungi | well, i think it's worth trying to see. if they do start it's not the end of the world since we cleaned things out | 22:45 |
clarkb | fungi: select yes then? | 22:45 |
fungi | and in production upgrade scenarios it won't matter | 22:45 |
fungi | yeah, go for the yes i guess | 22:45 |
clarkb | ok I'll select yes | 22:45 |
fungi | ideally restarts would only be triggered for already running services | 22:46 |
clarkb | I think the text reported it was restarting exim but I don't see it running | 22:46 |
clarkb | so it must be smart about that | 22:46 |
clarkb | for server upgrade planning I'm thinking we can do something like thursday: Stop services on lists.kc.io then shut it down and snapshot it. Then start it again without services running and go through this upgrade process. If that all checks out do similar for lists in a week or two? | 22:50 |
clarkb | *do similar for lists.o.o in a week or two | 22:50 |
clarkb | keeping snmpd.conf because it is ansible managed | 22:52 |
fungi | yeah, noting that lists.o.o imaging will take hours even if shut down | 22:52 |
clarkb | hrm ya maybe we need to think about that first | 22:52 |
clarkb | shouldn't be a problem to proceed with lists.kc.io as it snapshots quickly | 22:52 |
fungi | we have backups of lists.o.o | 22:52 |
fungi | for the most part these debconf prompts should be the same ones we kept our versions of in the previous upgrade | 22:53 |
clarkb | yup just fewr of them if previous testing is an indication | 22:54 |
fungi | that was reasonably quick | 22:57 |
fungi | want to pause again at the reboot step so i can check running exim and mailman services | 22:58 |
clarkb | yup | 22:58 |
fungi | neither is running nor enabled | 22:59 |
fungi | exim4 is still via sysv-compat too | 22:59 |
fungi | should be safe to reboot | 22:59 |
clarkb | ok it must have only been mailman that was a problem last time | 22:59 |
clarkb | rebooting | 22:59 |
fungi | i'm ssh'd in again with a new root screen started | 23:01 |
clarkb | joining | 23:01 |
fungi | i'm paranoid about iptables rules getting wiped out due to circular symlinks after that one time where the wiki server ended up with an exposed elasticsearch | 23:04 |
fungi | but lgtm | 23:04 |
clarkb | if you talk to apahce it says tehre are no mailing lists | 23:04 |
clarkb | but the lists are listed in /var/lib/mailman/lists | 23:05 |
fungi | also `list_lists` spits them out | 23:06 |
fungi | but seems /usr/lib/cgi-bin/mailman/listinfo isn't finding them | 23:06 |
fungi | however, this part we can probably troubleshoot tomorrow, if you want a break | 23:06 |
opendevreview | Merged opendev/system-config master: Remove extra service-codesearch job in deploy https://review.opendev.org/c/opendev/system-config/+/804796 | 23:09 |
clarkb | ya I'm thinking this has been a number of hours of mailman upgrade stuff so far. Definitely want to figure this out but maybe we can do that tomorrow | 23:09 |
clarkb | I need to get our meeting agenda out today too | 23:09 |
fungi | cool, in that case i'm a get back to yardwork before i run out of sunlight | 23:09 |
clarkb | fungi: do you think we should stop apache or shutdown the test server for now? | 23:09 |
clarkb | or just leave it be? | 23:09 |
ianw | clarkb/fungi: did you have any thoughts on restricting the redirect for paste to specific UAs in https://review.opendev.org/c/opendev/system-config/+/804539 ? | 23:15 |
ianw | i don't really mind if we want to just leave the http site up, just seemed like an option | 23:15 |
clarkb | ianw: does the paste command use a UA that we can key off of | 23:17 |
clarkb | ? | 23:17 |
ianw | clarkb: i linked to the change that i think implements it, added in ~2014 | 23:19 |
clarkb | I'd be ok with redirecting for everything else | 23:20 |
fungi | ianw: yeah, saw the comment, seems like a fine idea, i just haven't had time to work out the details and test today | 23:21 |
ianw | no probs, i can have a look since i broke it :) | 23:21 |
clarkb | I think the issue has to do with mailmans vhosting. lists know what url they belong to and if you lookup from the wrong url it doesn't work | 23:27 |
clarkb | using /etc/hosts to override locally seems to fix it for me. fungi you can confirm tomorrow | 23:27 |
clarkb | I'm going to context switch to meeting agenda stuff now | 23:28 |
clarkb | and can pick this up tomorrow | 23:28 |
fungi | clarkb: oh, yep, great guess, that does seem to be the answer | 23:48 |
clarkb | another thing I found when digging into that is you can set the python version explicitly which might need to be done more carefully with our mutlisite mailman since we do config things there | 23:50 |
clarkb | something to check out | 23:50 |
fungi | my bigger concern with the multi-site mailman is adapting the systemd service unit to it | 23:52 |
clarkb | I don't think that is an initial issue as the sysv stuff should keep working | 23:53 |
clarkb | it did on the zuul test nodes | 23:53 |
clarkb | we leave the systemd unit as disabled, then can followup with unit conversions if we like | 23:53 |
clarkb | agenda is sent | 23:56 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!