Monday, 2021-08-16

*** gibi_pto is now known as gibi		06:08
*** iurygregory_ is now known as iurygregory		06:31
*** jpena\|off is now known as jpena		07:42
opendevreview	chzhang8 proposed openstack/project-config master: bring tricircle under x namespace https://review.opendev.org/c/openstack/project-config/+/804669	09:39
opendevreview	chzhang8 proposed openstack/project-config master: bring tricircle under x namespace https://review.opendev.org/c/openstack/project-config/+/804669	10:01
*** sshnaidm\|pto is now known as sshnaidm		10:30
*** sshnaidm is now known as sshnaidm\|pto		10:31
*** jpena is now known as jpena\|lunch		11:16
*** dviroel\|out is now known as dviroel\|ruck		11:26
*** diablo_rojo is now known as Guest4491		11:39
*** jpena\|lunch is now known as jpena		12:16
clarkb	yoctozepto: fwiw I cannot reproduce the behavior clicking on the cherry picks link when logged in on firefox	15:34
clarkb	I wonder if it has to do with being the owner for the change	15:34
*** ysandeep is now known as ysandeep\|away		15:40
*** diablo_rojo__ is now known as diablo_rojo		15:43
yoctozepto	clarkb: ack, no problem; it is the first I have such an issue	15:48
*** jpena is now known as jpena\|off		15:58
*** marios is now known as marios\|out		16:01
clarkb	our meeting agenda is surprisingly empty after taking a first pass at updating it this morning. I guess good news there is it means we've just put a bunch of work behind us :)	16:28
clarkb	let me know if I'm missing anything obvious that should be on there though.	16:28
opendevreview	Kendall Nelson proposed opendev/system-config master: Setting Up Ansible For ptgbot https://review.opendev.org/c/opendev/system-config/+/803190	16:49
clarkb	fungi: re the lists.kc.io snapshot I'll try to boot that after lunch seems to be the most likely schedulign for that. Then upgrade it all the way through to focal taking notes?	16:53
clarkb	fungi: ^ are there any gotchas or things you think we should keep an eye out for on that?	16:53
clarkb	one thing is the esm registration on that snapshot I guess	16:54
clarkb	Maybe step zero is to disable that?	16:54
clarkb	though we're under our quota for that so having a test server boot up with it isn't the end of the world I guess	16:54
clarkb	though maybe it is safest to disable it to prevent the other server getting unregistered	16:55
* clarkb is not really sure how taht gets accounted		16:55
fungi	will disabling it disable the production server too? i guess we can check	16:57
fungi	wondering if it loads a unique key onto the machine on registering	16:57
clarkb	ya I have no idea	17:00
clarkb	and ya I guess we can check it and resolve manually if necessary	17:01
clarkb	fungi: do you think it would be safer to leave it as is on the new instance or disable esm on the new instance?	17:01
fungi	i would disable, and then re-register the production server if necessary	17:02
clarkb	ok	17:05
opendevreview	Kendall Nelson proposed opendev/zone-opendev.org master: Add CNAME for ptgbot.opendev.org https://review.opendev.org/c/opendev/zone-opendev.org/+/804790	17:08
clarkb	diablo_rojo: ^ note on that one. I need to page mroe of the plan there in to be sure of my comment on that but wanted to point out the issue either way	17:11
opendevreview	Kendall Nelson proposed opendev/system-config master: Setup Letsencrypt for ptgbot site https://review.opendev.org/c/opendev/system-config/+/804791	17:16
*** timburke__ is now known as timburke		17:16
opendevreview	Kendall Nelson proposed opendev/zone-opendev.org master: Add CNAME for ptgbot.opendev.org https://review.opendev.org/c/opendev/zone-opendev.org/+/804790	17:18
opendevreview	Kendall Nelson proposed opendev/zone-opendev.org master: Add CNAME for ptgbot.opendev.org https://review.opendev.org/c/opendev/zone-opendev.org/+/804790	17:19
diablo_rojo	clarkb, makes sense. Hopefully I fixed it correctly.	17:22
diablo_rojo	I also figure the letsencrypt cert had to be setup first? and that this should be dependent on that?	17:22
diablo_rojo	But I can remove that if its wrong	17:22
clarkb	in testing we use the staging LE servers and I'm not sure they properly verify against DNS or not	17:24
diablo_rojo	Okay so your guess is only marginally better than mine lol.	17:27
clarkb	diablo_rojo: what I'm not sure about looking at thsse changes is where the apache config is. I think you may need a "run the ptg site" change somewhere?	17:31
clarkb	ya I think https://review.opendev.org/c/opendev/system-config/+/780942 was that but then the puppet got removed.	17:32
clarkb	diablo_rojo: that means your letsencrypt change likely needs to also configure the apache config as well	17:32
opendevreview	Kendall Nelson proposed opendev/system-config master: Setup Letsencrypt for ptgbot site https://review.opendev.org/c/opendev/system-config/+/804791	17:47
corvus	fungi, clarkb: i think the issue with the semaphore is that we didn't choose one CD strategy, we chose two, and they are not working well together	17:48
clarkb	corvus: I'm not sure I completely agree with that. Having periodic catch ups seems like a reasonable safety net even if you want to do direct deploys.	17:49
*** dviroel\|ruck is now known as dviroel\|out		17:49
clarkb	Yes their approaches are different, but I don't think that users should be forced into only one or the other	17:49
corvus	if we can really deploy things when changes merge, that should be the primary strategy, and the periodic should be a backup. and it should run less often and be quicker so it doesn't interfere with the first.	17:49
clarkb	corvus: fwiw I believe the reason we haev an hourly deploy in addition to a daily is that some services like zuul and nodepool get image updates we awnt to apply more quickly than daily	17:49
clarkb	we could address that by building our own zuul images when neceessary similar to how we do for other services. But I think that the zuul produced images work well and redoing that effort seems wrong	17:51
corvus	clarkb: well, in practice, we always manually pull zuul images when restarting anyway because that can't be relied upon.	17:51
corvus	clarkb: i also think our semaphore is too coarse	17:52
corvus	we should be able to run all those jobs at the same time	17:52
clarkb	The other big thing in hourly is remote-puppet-else but I think we can configure taht job to run whenever puppet related files change in system-config rather than blindly doing an hourly update	17:52
clarkb	It would also make a huge difference if ansible addressed their performance regressions around forking tasks. It is unfortunately quite slow now :( but mordred says upstream isn't interested in reverting or changing that behavior (I can appreciate that it is likeyl complicated and changes there could produce worse unexpected side effects)	17:53
corvus	yep	17:53
corvus	here's my thinking: zuul should provide tools to help people CD, but our case is not a good one to model -- we have conflicting requirements that just plain cannot be satisfied. we should resolve that before we try to ask for more complexity from zuul.	17:55
clarkb	I worry that opendev's situation is more common than that assertion expects though and we're likely to produce similar problems for other CD users	17:56
corvus	it's possible, but we know that our playbooks are not designed for this. we haven't even finished implementing the system we originally designed.	17:56
opendevreview	Clark Boylan proposed opendev/system-config master: Run the cloud launcher daily instead of hourly https://review.opendev.org/c/opendev/system-config/+/804795	17:57
clarkb	I think ^ is an easy change we can make.	17:57
clarkb	It won't solve everything but that job isn't fast and we run it hourly when we almost never actually need the updates encoded in it (and it runs in the deploy pipeline when we do need it)	17:57
corvus	the fact that the periodic run takes >1hr is just not a good starting point -- honestly, if we're okay with that, we should drop deploy anyway and go back to "it will be deployed within 0-2 hours".	17:59
corvus	if we want to make immediate deployment primary, then we need to get the reconciliation path out of the way	17:59
clarkb	I don't think we're ok with it, but to fix it we either need to stop using ansible, run jobs in parallel, or run fewer jobs	17:59
clarkb	run jobs in parallel was the original expectation iirc	18:00
corvus	yes, i agree. i think that's the starting point though.	18:00
clarkb	yup definitely improving hourly throughput would make a huge difference	18:00
clarkb	804795 above should make a good starting dent in that	18:00
corvus	i'm not sure why all the other jobs can't run after base? is it because we don't want them to run at the same time as any deployment pipeline job, and the semaphore doesn't let us express that?	18:01
clarkb	corvus: I think there is some ordering implied as well. Like nodepool should update before zuul (and the registry as well?)	18:02
corvus	clarkb: maybe it's just a matter of making a new parent job for the periodic pipeline, have that one hold the semaphore, run base, then run everything else. then also have the deployment pipeline jobs individually hold the semaphore so they exclude each other and the entire periodic pipeline (which is now faster)?	18:02
clarkb	Eavesdrop should be able to run whenever I expect	18:03
corvus	we should still be able to accomodate that	18:03
clarkb	but ya I don't think it is as easy as letting everything run in parallel there is some implied ordering in services. Gitea before gerrit (for replication), nodepool before zuul for image changes, and so on	18:04
clarkb	I think puppet runs last because in the past we had a bunch of stuff doing puppet that wanted to be in the ordering. But now we may be able to run puppet whenever	18:04
clarkb	If order doesn't matter for storyboard (I suspect it doesn't) then we can run pupept in any order based on my read of the site.pp	18:05
corvus	actually, the deploy pipeline should probably hold the lock for the entire buildset too	18:05
corvus	clarkb: basically like this: https://paste.opendev.org/show/808121/	18:06
clarkb	and lock-holding-job is paused?	18:06
corvus	clarkb: yes	18:06
corvus	zero-node job	18:06
clarkb	ya I think at the very least that will help us express the dependencies properly which will allow us to optimize further	18:07
clarkb	basically that might not end up being faster but it will help us understand better to then make things faster	18:07
corvus	yeah, it should theoretically be faster ;)	18:07
corvus	we should be able to use the same job tree in periodic and deploy	18:07
clarkb	corvus: when you say periodic do you mean daily or hourly or both?	18:08
clarkb	(we have two periodic pipelines currently and they have different jobs, see https://review.opendev.org/c/opendev/system-config/+/804795 for an example)	18:08
corvus	hrm, i wasn't aware of the difference	18:09
clarkb	basically hourly is there for things we want to update quickly because we may not have a good trigger for them	18:10
corvus	i wonder why eavesdrop is in there?	18:10
clarkb	like zuul and nodepool image updates	18:10
clarkb	corvus: eavesdrop also consumes images from other repos (gerritbot for example)	18:10
corvus	clarkb: but it's pinned	18:10
corvus	it won't update without a corresponding system-config update	18:11
clarkb	corvus: matrix gerritbot is but not irc gerritbot iirc	18:11
corvus	where does the gerritbot image come from?	18:11
clarkb	corvus: from the gerritbot repo	18:11
clarkb	https://opendev.org/opendev/gerritbot/src/branch/master/Dockerfile	18:12
corvus	would we be sad if it took 24h to update?	18:12
corvus	anyway, parallelism should help there	18:12
clarkb	If we are trying to fix a bug we can always manually pull	18:12
corvus	i'm surprised the hourly takes >1 hour with those jobs	18:13
clarkb	corvus: ~20 minutes is the cloud launcher which is why I have proposed moving it. But also ansible is really really slow :/	18:13
clarkb	A big part of the cloud launcher slowness is processing all of that native ansible task stuff to interact with the clouds	18:13
clarkb	it would probably take a minute or two if written as a python script	18:14
corvus	clarkb: looking at a recent buildset, if we parallelize that (after merging your cloud launcher move), we would have 4m for bridge + 8m for zuul (the longest playbook)	18:14
corvus	so we should be able to get a n hourly run down to 12m with this approach -- without doing any deeper optimization	18:15
clarkb	corvus: but nodepool and zuul registry and zuul would need to run serially? I agree the puppet and the eavesdrop jobs can run in parallel	18:15
corvus	clarkb: i'm not sure they do?	18:15
clarkb	corvus: its possible they don't. I thought that order was intentional for the zuul and nodepool services though. TO ensure that labels show up in the right order or similar	18:16
clarkb	but I guess that is all happening in zookeeper now and can be lazy?	18:16
corvus	clarkb: (and incidentally, the cumulative runtime of all the current hourly jobs is 55m - so that assumption we've been working from is correct)	18:16
corvus	clarkb: if we're talking about adding a nodepool label, i don't think we expect them to be immediately available anyway (image build/upload time, etc)	18:17
corvus	clarkb: i think typically nodepool-provides-label and zuul-uses-label would be different changes anyway	18:17
clarkb	Another tricky thing is that we use the same jobs in deploy and the hourly and periodic pipelines so we can't just convert the hourly pipeline without converting everything?	18:17
clarkb	though maybe we can use a variant of the job to override semaphores in pipelines so we can do a bit at a time	18:18
corvus	clarkb: by convert do you mean change the semaphore usage? i think the right thing to do is to apply this to all 3 pipelines.	18:18
clarkb	corvus: yes. That is the right thing to do but I'm concerned that the scope of it is quite large and we can't minimize risk for out of order problems if we do all three at once	18:19
corvus	all 3 should have a lock-holding-job to make the semaphore apply to the whole buildset so interruptions don't happen. then, if it's okay to parallelize in one, it should be okay to parallelize in all.	18:19
clarkb	corvus: yes, except that periodic and deploy run many many more jobs than hourly. Which means we would have to sort out all of those ordering concerns at the same time (much more risk)	18:19
clarkb	making the hourly deploy parallel is much smaller in scope as far as determining what the order graph is	18:20
clarkb	But maybe we start by trying to do all 3 together and if it gets unwieldy then we can attempt something smaller in scope	18:21
fungi	the eavesdrop deploy job was also handling meeting time changes at one point, right? but very recently that's been switched to use a more typical static site publication job?	18:21
corvus	clarkb: my understanding is that the original intent was that each of these jobs (aside from base and bridge) should be independent, so i hope that there aren't many instances of us assuming the opposite. but you could stage this by using 2 semaphores. one as the new lock-holding-job for the buildset -- all 3 pipelines need to use it. then a second semaphore to make the jobs within a pipeline mutually exclusive. that will keep things slow	18:22
corvus	(like today) until the second one is removed.	18:22
clarkb	fungi: yes, and I guess those meeting times were listed in a different repo too so would rely on hourly updates	18:22
fungi	er, i guess it wasn't the eavesdrop deploy job before that, it was puppet	18:22
clarkb	corvus: I'm pretty sure we still have ordering between the jobs. I dno't know that we sufficiently untangled that yet	18:23
clarkb	things like gitea-lb running before gitea	18:24
clarkb	(maybe that should be one job?)	18:24
fungi	but it would be good to at least identify and codify specific cases like that using dependencies	18:24
fungi	or making it one job	18:24
clarkb	nameserver before letsencrypt (though we don't encode that order today)	18:25
clarkb	because we need the nameserver job to create the zones before letsencrypt attempts to add records to them	18:25
fungi	is there a specific letsencrypt job though?	18:25
clarkb	fungi: yes	18:25
corvus	clarkb: yeah, gitea sounds like maybe that should be one playbook	18:25
fungi	ahh, okay, i'm likely thinking of the individual handlers in cert management of various services	18:26
clarkb	letsencrypt before all the webservers	18:26
clarkb	They definitel exist, and once we've bootstrapped things sufficiently the order tends to matter less	18:26
fungi	though also the nameserver/letsencrypt ordering is primarily a zone bootstrapping problem right? if we're not adding a new domain it doesn't matter?	18:26
clarkb	but if you bootstrap from scratch the other is important and not encoded beyond the order of the jobs and running serially	18:27
fungi	oh, as far as making sure cname records are deployed	18:27
fungi	okay, i've got it	18:27
fungi	so any time we're bootstrapping a new service really	18:27
clarkb	fungi: no I'm talking about the server bootstrapping in this case not the CNAME addition (though taht order also matters)	18:27
clarkb	basically if we went fully parallel we couldn't safely deploy a new name server and attempt to update LE certs	18:28
clarkb	99% of that time that doesn't matter, until it does	18:28
clarkb	similarly with the various webservers that all need LE certs. We rely on the le job running early before they happen to properly bootstrap new webservers	18:29
clarkb	Currently that is only encoded in the pipeline def order with the assumption things will run serially	18:29
clarkb	All of this is fixable, but it isn't as simple as making it parallel	18:29
fungi	we wouldn't add the nameserver into production (list it through the domain registrars) until it was serving the right records though, yeah?	18:29
fungi	that's a manual step	18:29
fungi	so letsencrypt shouldn't try to use it for lookups anyway	18:30
corvus	that was an unfortunate oversight; the deploy pipeline is supposed to have explicit dependencies (after all, it does a lot of file matchers)	18:30
corvus	several jobs have an explicit dep on letsencrypt	18:30
corvus	(codesearch, etherpad, grafana)	18:30
clarkb	fungi: the issue isn't on the resolution side it is on the ansible side. We run the playbooks on both but it will fail to add records to a zone which doesn't exist if you get the order wrong	18:30
clarkb	corvus: ah yup looks like some do have the right annotations but not all	18:31
corvus	(wow codesearch is listed in the pipeline twice :/ )	18:31
clarkb	fungi: basically the ansible will fail then we won't get new certs for anything. Not the dns lookup will fail because we never get to that point	18:31
fungi	clarkb: we make the zone exist by adding records to it though, right?	18:31
fungi	we're just installing zonefiles into the fs i thought, not using an api like dynamic dns updates or something?	18:32
corvus	if it's a case we want to handle, letsencrypt depending on nameserver is reasonable (though right now, nameserver runs after letsencrypt)	18:33
corvus	sorry i have to run	18:34
fungi	looks like we inject lines into existing files like /var/lib/bind/zones/acme.opendev.org/zone.db so maybe the problem is that we're not writing out the whole file (because we want to avoid invalidating other entries in it which might still be in use)	18:34
clarkb	fungi: sort of. the acme stuff is dynamic but I'm not sure what triggers it yet looking deeper	18:34
clarkb	fungi: but https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/master-nameserver/tasks/main.yaml is part of service-nameserver and that installs bind which ensures there are dirs to write to	18:35
fungi	if that's the actual place it's breaking down, we could just make sure to always create the file before writing to it	18:35
clarkb	fungi: but then you don't get file perms right beacuse bind isnt even installed yet	18:35
clarkb	and then https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/letsencrypt-install-txt-record/tasks/main.yaml runs in the letsencrypt job	18:36
clarkb	there is an implicit dependency between them today that the service-nameserver playbook and job run before we try to set up any LE certs	18:36
clarkb	it hasn't been a problem because we havne't tried to replace any nameservers recently	18:37
clarkb	and if we are careful when replacing the server we don't need to encode the dependency, but it does exist	18:37
fungi	so in theory we could just include the nameserver setup role there before trying to create files	18:37
fungi	which ideally should be a quick no-op if the server is already configured	18:37
clarkb	except ansible is slow so not quite, but yes that would be an option	18:38
fungi	"quick" in ansible terms then ;)	18:38
clarkb	that becomes similar to the gitea-lb vs gitea job though	18:38
clarkb	we would basically collapse into a single job to encode the dependency and delete the other job	18:38
clarkb	happy to do that, but we still need to make changes like that before parallelizing things is safe	18:39
clarkb	I'm beginning to think we have a couple of first tasks we need to do here. We can do some trivial updates like my cloud launcher change to speed things up immediately. But we should also do a large scale graphing exercise of dependencies in a human readable format if possible.	18:40
clarkb	Then once we've got that graph we can either encode deps as zuul job deps or collapse jobs together as it makes sense	18:40
clarkb	then we can switch to being parallel	18:40
fungi	i concur	18:41
opendevreview	Clark Boylan proposed opendev/system-config master: Run the cloud launcher daily instead of hourly https://review.opendev.org/c/opendev/system-config/+/804795	18:43
opendevreview	Clark Boylan proposed opendev/system-config master: Remove extra service-codesearch job in deploy https://review.opendev.org/c/opendev/system-config/+/804796	18:43
clarkb	There are a couple of easy cleanups based on some of what we discovered during this discussion	18:43
clarkb	I'll add this as a meeting agenda item as I think it deserves discussion around my next steps above and then actually tackling them	18:43
fungi	thanks!	18:46
clarkb	ok the wiki should have my condensation of all that ^ in it now. Feel free to add additional notes or edits	18:49
clarkb	I'm going to grab lunch now. But plan to work on the kata lists test server next	18:51
clarkb	fungi: re disabling esm I'll have to do that after the snapshot boots unless we want to disable it on prod, snapshot again, then reenable on prod. I suspect the safest thing is after boot since the daily package updates won't run for a while anyway	18:52
fungi	yeah, i assumed after boot	18:53
fungi	at one point we also had problems with too many ansible runs at the same time causing oom events on bridge.o.o, right?	18:57
fungi	thinking back to the parallel deploy jobs thing	18:57
clarkb	fungi: yes, but that is tied to connectivity errors piling up ansible processes	18:57
clarkb	I agree that is a risk though. We might need a semaphore with say only 5 jobs allowed to run at once	18:57
clarkb	to temper that	18:58
fungi	and i guess if we did run into that... yeah	18:58
fungi	exactly what i was thinking, since semaphores can be limited to more than 1	18:58
clarkb	the risk there is we would want the semaphores to all live within one pipeline I think	18:58
clarkb	exclusve to a pipeline but >1 job in a pipeline	18:58
corvus	If we have a builder semaphore one secondary semaphore is ok	19:00
corvus	The other pipes can't get it	19:00
corvus	* If we have a buildset semaphore one secondary semaphore is ok	19:00
fungi	yeah, coupling those sounds reasonable	19:01
corvus	One semaphore for pipeline exclusion (max 1). One semaphore for job exclusion (max N).	19:02
clarkb	aha	19:02
opendevreview	Merged opendev/system-config master: Run the cloud launcher daily instead of hourly https://review.opendev.org/c/opendev/system-config/+/804795	19:22
clarkb	fungi: can you check my comment on line 195 of https://etherpad.opendev.org/p/listserv-inplace-upgrade-testing-2021 before I boot the tes instance? I assume there is some way to do what I describe there but I'm not sure I know what it is	19:51
clarkb	fungi: also do you know what you called the snapshot? I'm not able to list the snapshot currently but am trying to figure that out	19:59
clarkb	--os-volume-api 1 volume list using that works to get the volumes	20:01
clarkb	but not snapshots	20:01
clarkb	aha because it is isn't a volume snapshot it is a server image	20:03
clarkb	I need image list. Found it	20:03
clarkb	fungi: ok I think I'm ready to boot off the snapshotted image if you can take a look at line 195 on the etherpad above	20:05
fungi	clarkb: yep, sorry, was out taming the yard, looking at the pad now	20:18
fungi	and sorry i wasn't clear earlier, it's definitely a server image not a volume snapshot (the latter would assume cinder and bfv i think?)	20:18
fungi	clarkb: so... package maintainer scripts are supposed to obey disabled state for services	20:22
fungi	it's maybe possible they'll break, but i think the restarts are supposed to be a successful no-op when services are disabled, per debian policy (and these packages are generally inherited from debian anyway)	20:23
clarkb	fungi: testing with the zuul test node servers indicated this wasn't the case. I suspect because with an upgrade it is different?	20:24
clarkb	fungi: the services were definitely running after the upgrade	20:24
clarkb	fungi: the statement ther eabout it starting serices is based on previous experimental data	20:24
fungi	mmm, how were those services disabled?	20:25
fungi	systemctl disable or some other way?	20:25
clarkb	ya using systemctl disable	20:25
*** artom_ is now known as artom		20:26
clarkb	looking at my notes it seems mailman started after the first upgrade but maybe not exim? then maybe exim started after the second one. I wish I had better specifics about that now	20:27
clarkb	fungi: after we boot the snapshot can we check for anything spooled upand clear that out if it exists?	20:27
clarkb	if so I can boot the snapshot now and we can take a look and see if this is even a concern	20:27
fungi	yeah, worth double-checking them	20:29
clarkb	ok should I boot the server now then?	20:29
fungi	can always clear out the dirs under /var/spool/exim	20:29
fungi	yeah, i'm on as long a break from yardwork as we need, go ahead and i'll check it	20:30
clarkb	ok proceeding	20:30
fungi	lmk the ip address it gives you	20:30
fungi	and then i'll start a root screen session on it	20:31
clarkb	fungi: did you want to run through the upgrades with me too? Not sure how interested in that bit you are. My plan was to record the steps like I did with the zuul test servers on that same etherpad	20:31
fungi	i can, sure	20:32
fungi	server is responding to ping but ssh doesn't work yet	20:37
clarkb	ya	20:37
fungi	though the server has a static ip configuration in /etc/network/interfaces... i wonder if that's getting properly reset	20:39
clarkb	oh it does?	20:40
clarkb	hrm I want to say we've run into this before and had to do a rescue instance? ugh this gets really annoying really fast	20:41
clarkb	ya I suspect that may be the issue	20:41
clarkb	because we uninstall cloud init	20:41
clarkb	:(	20:41
clarkb	maybe we should stop doing that	20:42
fungi	well, to be fair, we almost never do server images	20:42
fungi	(or in-place upgrades)	20:42
clarkb	ya but when we do it is because the other options are bad	20:42
clarkb	I'm not really sure how the whole rescue instance thing works. Is that something I can do via osc?	20:43
fungi	i've only ever tried it through the dashboard	20:43
fungi	but in essence it boots a replacement vm from some standard image and then attaches the server's volume as another block device	20:44
clarkb	and from that we can mount and edit /etc/network/interfaces	20:44
clarkb	looks like there is an openstack server rescue command	20:44
clarkb	I'll try that	20:45
clarkb	hrm I didn't set a key name on the original boot so I'm not sure there will be keys set on the rescue vm	20:46
clarkb	why isn't that part of what the rescue api takes	20:46
clarkb	fungi: I suspect that I may need to unrescue then delete my test instance	20:47
clarkb	then start over and set key name appropriately	20:47
clarkb	yup I get an ssh connection now but it wants to fallback to password	20:47
clarkb	I'll unrescue, delete the instance, boot again with a key set then rescue again	20:47
fungi	normally when you do a rescue through the dashboard it tells you the root password	20:48
clarkb	huh it did not do that here	20:48
clarkb	though it looks like I can set a rescue password	20:48
clarkb	I guess I can try that before deleting and starting over	20:48
clarkb	yup that seems to have worked. The good news is the rescue image has an /etc/network/interfaces that I can refer too as well	20:55
fungi	pita but at least it's a way forward to get the thing usable. i suppose you could chroot into it and disable esm too if you wanted	20:56
clarkb	but now I'm confused because it seems the /etc/network/interfaces on the other device is the same as the one on this device	20:58
clarkb	they are clearly different devices if I look at /etc/hostname and /mnt/etc/hostname or /etc/hosts etc	20:59
fungi	so maybe on boot the rax-agent or whatever it is does config file injection then?	20:59
fungi	in which case there could be other reasons for sshd not accepting connections, i guess	20:59
fungi	maybe it was taking a long time gathering enough entropy to generate new host keys?	20:59
clarkb	I see a .bak file with the old content I expected. I half suspect that it just didn't restart networking and that if I unrescue at this point maybe it will work?	21:00
clarkb	from 20:34 today	21:00
clarkb	I'll try that I guess. I haven't changed anything via the rescue so if that works then hax	21:00
fungi	oh yeah maybe	21:00
fungi	does it have host keys?	21:00
clarkb	oh I've already dropped out I didn't check	21:01
clarkb	I guess thats the next thing to check if it continues to fail	21:01
fungi	no worries, something to check if it happens again, yeah	21:01
fungi	it's pinging again	21:03
clarkb	but still no ssh. I'm finding this very confusing. Also doesn't debian's ssh init generate ssh host keys?	21:04
clarkb	what if sshd is not starting at all for some reason?	21:04
fungi	yes	21:05
fungi	which needs entropy, which is hard to come by on a vm	21:05
fungi	and looks like haveged is not installed on it	21:05
clarkb	ah I see what you mean earlier it could just be waiting on that?	21:05
fungi	also quite likely no compatible host entropy kernel module	21:05
fungi	yeah	21:05
clarkb	do you think it is worth waiting to see if this chagnes or should I rescue it again?	21:06
fungi	it likely says on the console if it's generating keys	21:06
clarkb	ya but for the console i have to login to the dashboard :P I was hoping to avoid that, but maybe I should	21:06
clarkb	*shouldn't	21:06
fungi	openstack console url show <uuid>	21:07
fungi	it seems to want the root password for maintenance	21:08
clarkb	does that work with rax? I know just dumping the consiole doesn't. I'm logged in	21:08
fungi	maybe fsck of the rootfs failed?	21:08
fungi	yeah, console url show works with rac	21:08
fungi	rax	21:08
clarkb	til	21:09
fungi	i wouldn't be surprised if there are fs inconsistencies since it was imaged while mounted and likely writing	21:09
clarkb	I agree it wants a root login. Should I delete this instance and try again? Then check the console of the new instance? maybe the snapshot isn't so happy?	21:09
clarkb	if it fails again we do over?	21:09
fungi	i would rescue boot and fsck the block device while unmounted	21:09
clarkb	ok I'll try that	21:09
fungi	i suspect imaging any running server could result in this situation	21:10
fungi	i've had more luck imaging servers while they're shutdown	21:10
fungi	more frightening though if rackspace did file injection on a filesystem which wasn't clean	21:11
fungi	the power of cloud	21:11
clarkb	fungi: just `fsck /dev/xvdb` ?	21:13
fungi	sure, though you might have to hit 'y' a lot	21:13
fungi	you could add -y	21:13
fungi	it's unlikely you'd tell it to do anything other than try to repair anyway	21:13
clarkb	my manpage doesn't show -y as a valid option	21:14
fungi	i may be confusing with bsd ;)	21:14
clarkb	actually it probably wants xvdb1 since thati s the partition with an fs on it	21:14
fungi	my fsck manpage has -y documented	21:14
clarkb	huh mine does not	21:14
clarkb	many I need to look at fsck.ext3	21:15
fungi	the fsck manpage on lists.k.i also has it	21:15
tosky	maybe it's specific to the fsck.foo you use	21:15
clarkb	ya has to be a specific fsck	21:15
clarkb	just fsck doesn't document it	21:15
fungi	it is specific to the fsck.foo, but the general fsck manpage also says that about it	21:15
clarkb	neat mine doesn't on suse nor does the debian rescue image I'm on	21:16
clarkb	I'm running that fsck now	21:16
fungi	"-y r some filesystem-specific checkers, the -y option will cause the fs-specific fsck to always attempt to fix..."	21:16
clarkb	that was quick it is done	21:16
fungi	did it say it repaired anything?	21:16
clarkb	it updated free inode and block counts	21:16
tosky	both Debian 11 (ok, testing) and Fedora 34 don't document it (same manpage apparently, last update February 2009, from util-linux)	21:16
clarkb	cloudimg-rootfs: clean, 303705/5120000 files, 1247631/10485499 blocks	21:16
clarkb	doesn't seem to have done much	21:17
clarkb	I'll mount it now and check for ssh host keys	21:17
clarkb	it has its preexisting keys from the snapshot	21:17
clarkb	fungi: ^ anything else you want me to try before giving it another reboot?	21:18
fungi	tosky: interestingly, the ubuntu 16.04 fsck manpage i quoted from above claims to be from "util-linux February 2009"	21:18
fungi	maybe they patched it	21:18
fungi	clarkb: nah, i strongly suspect it was fsck failing at boot which caused the behavior we saw	21:18
fungi	give it another try now	21:18
clarkb	ok	21:18
fungi	tosky: i agree my newer debian machines don't document -y in the general fsck manpage but do in the manpage for, e.g., fsck.ext2 and so on	21:20
tosky	weird	21:21
clarkb	the console shows the boot splash. I wish bootsplashes went away on server images	21:21
clarkb	I suspect this may fail as it is taking a long time again	21:22
fungi	tosky: i guess they decided to axe a number of entries in the general manpage which just said "this normally does x but behavior depends on the backend chosen"	21:22
clarkb	yup it is in emergency mode again	21:22
clarkb	but the boot splash prevented us from seeing why	21:22
fungi	did it say why?	21:22
clarkb	no because boot splash	21:22
fungi	ugh	21:22
clarkb	I guess if I rescue it again there might be something in the kernel or syslog?	21:23
fungi	maybe, and worst case we can disable the bootsplash in the grub.config or whatever	21:23
clarkb	fungi: ya but doesn't grub require complicate reconfiguration when you do that now?	21:23
clarkb	I wonder if that will work using debian tools against the ubuntu image	21:23
fungi	not if you're editing the config in /boot	21:23
clarkb	ah ok let me rescue it again	21:24
fungi	there's a fancy run-parts tree in /etc which you can use to build a grub.cfg file, but that's all it really is. you can edit the built config directly as long as you don't care that rerunning the config builder will undo your changes in it	21:25
clarkb	or that if you get it wrong grub may fail? which isn't much worse than what we're doing now	21:25
fungi	should just be able to edit /boot/grub/grub.cfg and take out the "splash quiet" from the kernel command line	21:26
fungi	and yes, splashscreens on virtual machines (or servers in general) are beyond silly	21:28
clarkb	doing that now	21:28
clarkb	ok that is done. going to quickly check kern.log and syslog type logs	21:29
clarkb	those don't have anything from today in them implying we aren't getting that far	21:30
fungi	right, if they don't get far enough to mount the rootfs, that's what i'd expect	21:30
clarkb	fungi: I think it might possibly be the swap device in fstab	21:31
clarkb	I'm going to comment those out	21:31
fungi	oh, quite likely	21:31
fungi	in fact, yes, i bet we set swap to a partition on the ephemeral disk which on the new server isn't partitioned/formatted	21:32
fungi	good call	21:32
clarkb	yup	21:32
clarkb	it is unrescuing now	21:32
fungi	of course, if it weren't for the splashscreen, we'd have known that an hour ago ;)	21:33
clarkb	indeed	21:33
clarkb	alright it is up finally	21:34
fungi	and i can ssh in	21:34
fungi	i have a root screen session established on it	21:34
fungi	exim4 and mailman did not start on boot, that's good	21:35
clarkb	yup I didn't expect them to start booting from the snapshot, just after the upgrade(s)	21:35
clarkb	fungi: looks like exim may have some stuff spooled	21:35
fungi	agreed, just getting a baseline so we know later	21:35
fungi	okay, cleared out the exim4 spooldirs	21:37
fungi	that way if it does start, it shouldn't try to deliver any duplicate messages	21:37
clarkb	thanks. you just rm'd the dir contents for input etc?	21:38
clarkb	I'm getting my notes together on the etherpad then will proceed with upgrade things	21:38
fungi	i did, yes	21:39
fungi	/var/spool/exim4// basically	21:39
clarkb	alright the next step on my etherpad is to unenroll from esm. I'll do that	21:40
fungi	and then we want to check esm status on the production server	21:40
clarkb	yup I just did that nad it says it is still enrolled. I'll make a note to check it again tomorrow in case it takes time for that accounting to ahppen	21:41
clarkb	as expected upgrading is a noop	21:41
clarkb	my notes say I should reboot, but I think I can skip that because no package updates occured	21:42
fungi	yes	21:42
fungi	that's fine	21:42
fungi	effectively we did just "reboot" it anyway	21:42
clarkb	yup exactly.	21:42
clarkb	fungi: the next step is actually doing an upgrade. I'm going to take a short break here as I need some water. But will get back to it and start a root screen and update my notes on the etherpad as I go through answering questions if you want to follow along	21:43
fungi	are you using the root screen session i started (and if not, do you want to?)	21:43
clarkb	oh not yet sorry I didn't realize there was one but you said you would start one	21:43
clarkb	I'll use that one going forward	21:43
fungi	and yeah, i'll do a quick round with the leaf blower, brb	21:43
clarkb	I'll start the beginning of this isn't too interesting	21:48
clarkb	fungi: you ready for me to accept the new packages. This is where it gets fun and you have to sort out keeping old files or accepting new ones. THough I have a bit of a cheatsheet from earlier testing	21:53
* clarkb goes for it. we can always redo testing again if necessary		21:55
fungi	i'm back, sorry	21:56
clarkb	fungi: if you look at the etherpad we are at line 216	21:56
fungi	and yeah, the choices are less interesting than the results	21:56
clarkb	fungi: so this step is one that wasn't curses before it was just a prompt but also one I had questions about	21:58
clarkb	currently we only support a subset of lanagues in our lists, but I figure the safest thing is to select all of them?	21:59
clarkb	fungi: do you think it is worthwhile to only do the subset?	21:59
fungi	yeah, just take the default there. it won't really chew up that much additional space nor time generating locales for things	21:59
clarkb	well the default was none iirc	21:59
fungi	not even english? interesting	21:59
clarkb	and ya it is relatively tiny and not that much time so I figured lets just pick all of them	21:59
clarkb	I'll proceed with all selected	22:00
fungi	that may come in important for miltilingual support in mailman anyway	22:00
clarkb	I'm going to select no for saving the iptables rules because I added the 1022 rule that we want to go away later	22:01
fungi	oh, what was rule 1022 for?	22:01
fungi	i see, temp ssh server	22:02
clarkb	yup	22:02
fungi	as long as you're okay with that vanishing on reboot	22:02
clarkb	fungi: it was going to anyway as the upgrade doesn't restart that sshd aiui	22:04
fungi	and yeah, the logind.conf change looks non-impacting for us	22:05
fungi	is the current plan just to upgrade from xenial to bionic, or are we going all the way through to focal?	22:08
clarkb	I think we can go all the way to focal	22:08
clarkb	also I just added a question to the etherpad. Do we need to uninstall puppet first or can we do that after.	22:08
clarkb	We'll test if uninstalling it after works on this run I guess	22:08
fungi	if we're not puppeting anything on those servers any longer, i'd uninstall puppet first before doing any upgrading	22:09
clarkb	that works too	22:09
clarkb	and yes we are not puppeting anymore with the move to ansibel for lists	22:09
fungi	fewer unknowns that way	22:09
clarkb	wfm	22:09
fungi	otherwise there's every chance the puppet packages will conflict with distro packages somewhere	22:10
fungi	like by insisting on a capped version of some dependency, or something	22:10
clarkb	fungi: the current pop up is unexpected as we should have that list shouldn't we?	22:10
fungi	yes, we should probably investigate it	22:10
clarkb	there is a /var/lib/mailman/lists/mailman list	22:11
clarkb	I wonder if this is just dpkg being extra verbose	22:11
clarkb	fungi: I'm going to select 'OK' to continue if you think that is good considering /var/lib/mailman/lists/mailman exists	22:12
fungi	yeah, i'm not sure why it's not finding that, but maybe we have a path getting overridden somewhere	22:13
clarkb	did you want to check anything else before I hit ok?	22:13
fungi	no, go ahead and okay it	22:13
fungi	we technically don't rely on the mailman list anyway, as we disable password reminders	22:14
fungi	we'll want to keep the modified templates i think? because we install them with ansible (though we may need to upgrade the ones in system-config once we're done upgrading everything)	22:15
clarkb	yup I think this is a keep	22:15
fungi	the maintainer versions will be added to the directory with a .dpkg-new extension on them for reference anyway	22:16
fungi	so we can compare later after the upgrade is done	22:16
clarkb	ok I guess just keep all of these for now then	22:17
fungi	especially if they're files in our ansible, since for the production upgrade ansible will just replace them anway	22:17
fungi	looks like the new ntp.conf is ~equivalent to the one we were installing with puppet. do we no longer do that?	22:20
clarkb	fungi: ya we don't use puppet for that anymore and no puppet on this server. This is an install package manager version iirc	22:20
fungi	sounds good	22:21
clarkb	ya that is what I have on my notes from using the zuul test nodes. I'm going to select 'Y' here	22:21
fungi	ahh, okay, that looked like the daily ntp cronjob in your notes	22:21
fungi	we do manage the sshd config with ansible though, right?	22:22
clarkb	yes we do	22:23
clarkb	this is a keep ours	22:23
fungi	ahh, yeah your notes say yes	22:23
clarkb	fungi: the etehrpad has the ntoes ya	22:24
fungi	before the reboot step, i want to open a second window in screen and check the exim4/mailman service states	22:24
clarkb	ok	22:25
clarkb	fungi: you are clear to check things.	22:27
fungi	opening a new screen window now, if you're ready	22:27
clarkb	yup I'm ready	22:28
fungi	it did indeed start mailman but not exim	22:28
fungi	what's sad is there's still /etc/rc2.d/K01mailman	22:29
fungi	as added by `systemctl disable mailman` earlier	22:29
clarkb	they are units too. Maybe we should do another systemctl stop mailman && systemctl disable mailman ?	22:29
clarkb	s/units too/units now/	22:29
fungi	also /etc/rc2.d/K01exim4	22:29
clarkb	I think it may ignore the compat stuff when there are valid units	22:30
clarkb	that was based on testing at some point	22:30
fungi	Removed /etc/systemd/system/mailman-qrunner.service.	22:30
fungi	so that's why indeed	22:30
clarkb	I added that step to my notes. Do you want to do the same with exim4 to see if it removes anything?	22:31
fungi	xenial to bionic upgrade switched from sysv-compat to systemd and didn't honor the existing service state	22:31
clarkb	looks like you just did I'll put that in the notes too	22:31
fungi	i did exim4 and it didn't remove anything	22:31
clarkb	yup	22:31
clarkb	doesn't hurt to run it again though	22:31
fungi	exim4.service is not a native service, redirecting to systemd-sysv-install.	22:31
clarkb	should I select y to reboot now?	22:32
fungi	when the exim4 service switches from sysv-compat to systemd i expect we'll see the same behavior for it	22:32
fungi	yep, go for it	22:32
clarkb	ya that may happen between bionic and focal and explain why I remember it causing trouble when tesed previously	22:32
clarkb	it is rebooting now	22:32
clarkb	it is up	22:33
fungi	logged in and root screen session started agaiun	22:34
clarkb	ok we can do the sanity checks there	22:34
clarkb	then continue to the focal upgrade	22:34
fungi	it did not restart those services on reboot	22:34
clarkb	yup	22:35
fungi	puppet uninstall command lgtm	22:35
clarkb	fungi: does that puppet removal look correct to you? we didn't do it prior to upgraded to bionic but can do that on the prod server. I figure we should purge it out now	22:35
clarkb	cool running it	22:36
fungi	do we want to autoremove as well?	22:36
fungi	yes please	22:36
clarkb	++	22:36
fungi	you can also do a clean after that to clear out the previous downloads and free up some space in /var	22:37
clarkb	like that?	22:38
clarkb	oh I guess clean != clean. Do you think we should do a clean here?	22:38
fungi	sure, clean would also remove packages which still exist on the mirrors	22:38
clarkb	doesn't seem to make a difference here?	22:39
fungi	autoclean only removes local downloaded copies of things which are no longer on the mirrors	22:39
fungi	(in theory anyway)	22:39
fungi	also possible the do-release-upgrade script already did that for us	22:39
clarkb	ya it may have done so	22:39
clarkb	Alright the next step on the etehrpad is upgrading to focal unless there is other sanity checkign yuo want to do	22:40
fungi	nothing i can think of	22:40
fungi	hopefully we'll get fewer debconf prompts from this upgrade	22:44
clarkb	fungi: from my previous notes I selected yes here but do you think we should select no to maybe avoid things like exim and mailman getting started?	22:45
fungi	well, i think it's worth trying to see. if they do start it's not the end of the world since we cleaned things out	22:45
clarkb	fungi: select yes then?	22:45
fungi	and in production upgrade scenarios it won't matter	22:45
fungi	yeah, go for the yes i guess	22:45
clarkb	ok I'll select yes	22:45
fungi	ideally restarts would only be triggered for already running services	22:46
clarkb	I think the text reported it was restarting exim but I don't see it running	22:46
clarkb	so it must be smart about that	22:46
clarkb	for server upgrade planning I'm thinking we can do something like thursday: Stop services on lists.kc.io then shut it down and snapshot it. Then start it again without services running and go through this upgrade process. If that all checks out do similar for lists in a week or two?	22:50
clarkb	*do similar for lists.o.o in a week or two	22:50
clarkb	keeping snmpd.conf because it is ansible managed	22:52
fungi	yeah, noting that lists.o.o imaging will take hours even if shut down	22:52
clarkb	hrm ya maybe we need to think about that first	22:52
clarkb	shouldn't be a problem to proceed with lists.kc.io as it snapshots quickly	22:52
fungi	we have backups of lists.o.o	22:52
fungi	for the most part these debconf prompts should be the same ones we kept our versions of in the previous upgrade	22:53
clarkb	yup just fewr of them if previous testing is an indication	22:54
fungi	that was reasonably quick	22:57
fungi	want to pause again at the reboot step so i can check running exim and mailman services	22:58
clarkb	yup	22:58
fungi	neither is running nor enabled	22:59
fungi	exim4 is still via sysv-compat too	22:59
fungi	should be safe to reboot	22:59
clarkb	ok it must have only been mailman that was a problem last time	22:59
clarkb	rebooting	22:59
fungi	i'm ssh'd in again with a new root screen started	23:01
clarkb	joining	23:01
fungi	i'm paranoid about iptables rules getting wiped out due to circular symlinks after that one time where the wiki server ended up with an exposed elasticsearch	23:04
fungi	but lgtm	23:04
clarkb	if you talk to apahce it says tehre are no mailing lists	23:04
clarkb	but the lists are listed in /var/lib/mailman/lists	23:05
fungi	also `list_lists` spits them out	23:06
fungi	but seems /usr/lib/cgi-bin/mailman/listinfo isn't finding them	23:06
fungi	however, this part we can probably troubleshoot tomorrow, if you want a break	23:06
opendevreview	Merged opendev/system-config master: Remove extra service-codesearch job in deploy https://review.opendev.org/c/opendev/system-config/+/804796	23:09
clarkb	ya I'm thinking this has been a number of hours of mailman upgrade stuff so far. Definitely want to figure this out but maybe we can do that tomorrow	23:09
clarkb	I need to get our meeting agenda out today too	23:09
fungi	cool, in that case i'm a get back to yardwork before i run out of sunlight	23:09
clarkb	fungi: do you think we should stop apache or shutdown the test server for now?	23:09
clarkb	or just leave it be?	23:09
ianw	clarkb/fungi: did you have any thoughts on restricting the redirect for paste to specific UAs in https://review.opendev.org/c/opendev/system-config/+/804539 ?	23:15
ianw	i don't really mind if we want to just leave the http site up, just seemed like an option	23:15
clarkb	ianw: does the paste command use a UA that we can key off of	23:17
clarkb	?	23:17
ianw	clarkb: i linked to the change that i think implements it, added in ~2014	23:19
clarkb	I'd be ok with redirecting for everything else	23:20
fungi	ianw: yeah, saw the comment, seems like a fine idea, i just haven't had time to work out the details and test today	23:21
ianw	no probs, i can have a look since i broke it :)	23:21
clarkb	I think the issue has to do with mailmans vhosting. lists know what url they belong to and if you lookup from the wrong url it doesn't work	23:27
clarkb	using /etc/hosts to override locally seems to fix it for me. fungi you can confirm tomorrow	23:27
clarkb	I'm going to context switch to meeting agenda stuff now	23:28
clarkb	and can pick this up tomorrow	23:28
fungi	clarkb: oh, yep, great guess, that does seem to be the answer	23:48
clarkb	another thing I found when digging into that is you can set the python version explicitly which might need to be done more carefully with our mutlisite mailman since we do config things there	23:50
clarkb	something to check out	23:50
fungi	my bigger concern with the multi-site mailman is adapting the systemd service unit to it	23:52
clarkb	I don't think that is an initial issue as the sysv stuff should keep working	23:53
clarkb	it did on the zuul test nodes	23:53
clarkb	we leave the systemd unit as disabled, then can followup with unit conversions if we like	23:53
clarkb	agenda is sent	23:56

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!