Thursday, 2020-05-21

ianw	you can have multiple certs; we take the first name listed in the cert and add that to the list to be checkd	00:00
ianw	so, e.g. in the gitea case, we check gitea0X:3000, but don't check opendev.org	00:01
ianw	now the dib python35 job is showing "ERROR: No matching distribution found for oslotest===4.2.0 "	00:02
fungi	i guess it's a question of whether le updates can break for some certs on a host and not others?	00:02
ianw	fungi: so all certs should get an entry; just if that cert covers multiple names, we only take the first one. so like "mirror01.x" and "mirror.x" we will take the first, mirror01 and put that in the list	00:05
ianw	but if the cert @ mirror01.x is valid, that implies that the same cert which is used for mirror.x is ok?	00:06
fungi	got it. so we're still testing each individual cert	00:06
fungi	that wfm	00:06
ianw	to be concrete, https://zuul.opendev.org/t/openstack/build/6d1c8cf7ba95499496910f2f5bd2b97e/log/bridge.openstack.org/certcheck/ssldomains	00:10
ianw	compare to https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/templates/host_vars/letsencrypt01.opendev.org.yaml.j2	00:11
ianw	we end up checking letsencrypt01.opendev.org and someotherservice.opendev.org	00:12
fungi	yep, cool	00:12
openstackgerrit	Jeremy Stanley proposed openstack/project-config master: Replace old Ussuri cycle signing key with Victoria https://review.opendev.org/729804	00:17
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: Drop support for python2 https://review.opendev.org/728889	00:32
*** dzho has quit IRC		00:40
*** markmcclain has quit IRC		02:25
*** ysandeep\|away is now known as ysandeep		02:33
*** Eighth_Doctor is now known as Conan_Kudo		02:46
*** Conan_Kudo is now known as Eighth_Doctor		02:46
*** markmcclain has joined #opendev		02:49
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: package-installs: allow when filter to be a list https://review.opendev.org/727049	04:04
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: ubuntu-minimal: fix HWE install for focal https://review.opendev.org/727050	04:04
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: ubuntu-minimal : only install 16.04 HWE kernel on xenial https://review.opendev.org/726996	04:04
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: ubuntu-minimal: Add Ubuntu Focal test build https://review.opendev.org/725752	04:04
*** Meiyan has joined #opendev		04:15
*** ykarel\|away is now known as ykarel		04:23
*** tkajinam has quit IRC		04:26
*** tkajinam has joined #opendev		04:26
*** raukadah is now known as chandankumar		05:32
*** dpawlik has joined #opendev		06:01
*** ianw has quit IRC		06:30
*** ianw has joined #opendev		06:33
*** slaweq has joined #opendev		06:57
*** dpawlik has quit IRC		07:05
*** dpawlik has joined #opendev		07:18
*** dpawlik has quit IRC		07:24
*** dpawlik has joined #opendev		07:25
*** tosky has joined #opendev		07:31
slaweq	frickler: hi	07:34
*** dpawlik has quit IRC		07:34
slaweq	frickler: may I ask You about one infra and CI related test?	07:34
*** dpawlik has joined #opendev		07:34
slaweq	frickler: we recently introduced in neutron-tempest-plugin test which is pinging some external IP address to check external connectivity is really ok, see https://review.opendev.org/#/c/727764/	07:35
slaweq	frickler: it's skipped by default, but would infra-root have anything against if we would configure some IP address (I don't know what would be the best one) to run this test in u/s gate?	07:36
slaweq	frickler: something like ping 8.8.8.8 or similar	07:36
*** dpawlik has quit IRC		07:44
*** dpawlik has joined #opendev		07:45
*** dpawlik has quit IRC		07:49
*** dpawlik has joined #opendev		07:50
*** moppy has quit IRC		08:01
*** moppy has joined #opendev		08:01
*** DSpider has joined #opendev		08:11
*** dpawlik has quit IRC		08:11
*** dpawlik has joined #opendev		08:11
*** lpetrut has joined #opendev		08:15
*** dpawlik has quit IRC		08:16
*** dpawlik has joined #opendev		08:16
*** jaicaa has quit IRC		08:17
*** jaicaa has joined #opendev		08:20
*** ysandeep is now known as ysandeep\|lunch		08:20
*** yuri has joined #opendev		08:45
*** iurygregory has quit IRC		08:51
*** ysandeep\|lunch is now known as ysandeep		08:57
*** ykarel is now known as ykarel\|lunch		09:08
*** dpawlik has quit IRC		09:10
*** dpawlik has joined #opendev		09:10
*** dpawlik has quit IRC		09:14
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner https://review.opendev.org/728684	09:26
*** iurygregory has joined #opendev		09:39
*** ykarel\|lunch is now known as ykarel		09:56
*** smcginnis has quit IRC		11:26
*** dpawlik has joined #opendev		11:29
*** smcginnis has joined #opendev		11:33
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/system-config master: Switch prep-apply.sh to use python3 https://review.opendev.org/729543	11:45
*** ysandeep is now known as ysandeep\|afk		12:03
*** rosmaita has left #opendev		12:11
*** ysandeep\|afk is now known as ysandeep		12:29
fungi	slaweq: i thought we already had a similar role in devstack to ping the git farm or something... checking	12:32
*** ysandeep is now known as ysandeep\|afk		12:32
fungi	slaweq: i must have been thinking of this: https://opendev.org/openstack/devstack-gate/src/branch/master/playbooks/roles/network_sanity_check/tasks/main.yaml#L18-L19	12:51
fungi	i guess we didn't implement anything similar in devstack	12:51
fungi	but it seems like a reasonable idea to have a network sanity test which performs a quick ping of some bits of our infrastructure	12:52
fungi	(i would not recommend pinging 8.8.8.8 though)	12:52
fungi	ahh, i see, so for the specific bug you linked, pinging anything that's not directly on the job node ought to be sufficient. the local mirror service in the provider is your best bet, since your jobs should rely on that server being reachable from the job node anyway so if it isn't the job will have failed before it gets to the point of creating nested instances	12:56
*** ysandeep\|afk is now known as ysandeep		12:58
fungi	slaweq: the identity of the local mirror server should be available from /etc/ci/mirror_info.sh on each node, there will be a "export NODEPOOL_MIRROR_HOST=..." line in there which you can resolve to an ip address if you need a raw address and not a dns name	13:03
fungi	also it looks like we set a zuul_site_mirror_fqdn ansible fact, which you could use to pipe into your script instead	13:05
*** ykarel is now known as ykarel\|afk		13:14
mordred	fungi, slaweq we should check with clarkb about being forward-compatible with the intended new system for communicating mirrors to jobs	13:24
mordred	(I agree, pinging the mirror is definitely the right choice)	13:24
mordred	fungi, AJaeger: new promote-javascript-deployment job did not work for zuul (it's ok that it failed, we don't happen to use the results of it currently)	13:26
*** lpetrut has quit IRC		13:49
zbr	who can help me with gerritlib? i have a couple of patches and i also need a new release for the yesterday bugfix.	13:52
zbr	https://review.opendev.org/#/c/729734/	13:52
*** sgw has quit IRC		13:57
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed POLLIN event check https://review.opendev.org/729966	14:03
*** roman_g has quit IRC		14:05
corvus	zbr: how was the POLLIN check failing?	14:06
openstackgerrit	Merged opendev/gerritlib master: Enable py36-py38 testing https://review.opendev.org/729734	14:06
corvus	zbr: (did you get an event with POLLIN and some other bit set?)	14:06
zbr	i got returned 3	14:06
zbr	which is valid result	14:06
zbr	i guess that https://code.woboq.org/gcc/include/sys/epoll.h.html#_M/EPOLLIN should be self-explanatory	14:07
corvus	so it was pollin and pollpri ?	14:08
zbr	yep	14:08
corvus	though that's for epoll	14:08
corvus	but it's the same for regular poll	14:09
zbr	macos + py38, i suspect is py38 specific.	14:09
corvus	so there is input, and an error	14:09
mordred	corvus: or, input and a flag indicating it's "priority" input, no?	14:10
corvus	mordred: pollpri means "an exceptional condition" the poll manpage includes some	14:10
corvus	i'm curious what it would be in this case	14:11
mordred	ah - nod	14:11
corvus	because i'm not sure the original code was wrong	14:11
corvus	(the original code was "read if there is data and no exceptional condition")	14:11
corvus	i'm open to changing it, but i'd like to understand why it's okay to read when there's an exceptional condition, what the exceptional condition we intend to handle is, and why it's caused and how we should handle it	14:12
mordred	corvus: https://stackoverflow.com/questions/10681624/epollpri-when-does-this-case-happen says some words	14:12
zbr	urgent, likely telling you process that data faster or i will start dropping it	14:13
*** sgw has joined #opendev		14:14
zbr	we could think to improve it in the future, but clearly we need to read.	14:14
corvus	i don't think that's what that means; i don't actually know how the standard out channel of an ssh connection could even send oob data.	14:14
mordred	corvus: most of the other docs I'm finding (other than the man page) - like the rust api docs - all describe it as "there is urgent data" instead of "there is an exceptional condition" - but I still don't understand what that means - so I agree with you as to wanting to understand how we got to that state	14:14
*** ykarel\|afk is now known as ykarel		14:15
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed linting issues https://review.opendev.org/729974	14:15
mordred	corvus: https://lore.kernel.org/netdev/CAD56B7fCUyWG8d-OT__B3SbEfY=AdiZGJVdjSZ7qurqLUufvgg@mail.gmail.com/T/ has some discussion around POLLPRI and ssh	14:17
openstackgerrit	Merged opendev/gerritlib master: Replace testrepository with stestr https://review.opendev.org/729742	14:19
zbr	for me is a no brainer, event is a bitmask, and POLLIN declares that there is data to be read. In that case i do not care about any other flag, my loop goal being to read.	14:20
zbr	it gets more interesting when you do not have data to read	14:20
corvus	on the contrary, i want the loop to exit if there is an error	14:20
corvus	your patch will change that	14:20
corvus	(with your change, it could fail to exit on a loop if there is an error and data to read)	14:21
fungi	yes, continuing to read from a socket which has raised an error is not guaranteed safe, and there are plenty of scenarios where we could end up with a hung process indefinitely reading from a dead socket that way	14:21
zbr	aimho, it should read until there is nothing else to read.	14:21
zbr	a dead socket producing data?	14:22
fungi	zbr: what if that read call never returns?	14:22
zbr	fungi: same would apply even without this condition.	14:22
corvus	zbr: a poll can return with data and an error at the same time. your code would read the data, ignore the error, then go right back to the poll again. what happens after that would be undefined.	14:22
corvus	zbr: or possibly read the data and fail.	14:23
mordred	right - which is why we need to understand under what conditions the PRI flag is set and what it's trying to communicate	14:23
zbr	i think it would be better than now, where 0x3 produces an error.	14:23
mordred	zbr: 0x3 producing an error may be correct behavior	14:23
fungi	if we don't ignore errors we can try to handle them, say by closing out and reopening the socket (depending on the error)	14:24
fungi	there are options other than ignoring possible error conditions or dying on the spot	14:25
corvus	if there were oob data, then since we don't set so_oobinline, we would need to set the msg_oob flag to recv to retrieve it	14:28
corvus	(though, like i said, i doubt the issue is oob data)	14:29
zbr	How about logging when the event has anything else than just POLLIN, but reading.	14:29
corvus	mordred: i've skimmed that link, but i haven't found anything there i can apply to our situation	14:29
*** Meiyan has quit IRC		14:30
zbr	one random logic example: https://github.com/kdart/pycopia/blob/master/core/pycopia/asyncio.py#L197-L200	14:30
zbr	as seen here, other flags should not prevent reading data.	14:31
zbr	even if we do not implement them	14:31
corvus	zbr: that's a different approach than i think we should take.	14:32
corvus	i think we should understand the issue and handle it	14:32
corvus	zbr: since it only seems to be reproducible in your environment, maybe you can determine whether it's oob data or something else causing the pri flag to be set?	14:33
zbr	i need some hints on how to figure it out	14:33
zbr	did any of you tried to run elastic-recheck with py38?	14:34
zbr	i wonder if the issue is specific to macos, py38 or both.	14:34
corvus	zbr: i'd try calling recv with MSG_OOB and see if you get data	14:34
mordred	zbr: my hunch is going to be specific to macos - I would doubt py38 has anything to do with it since it would be something setting that flag in the networking stack	14:35
fungi	i'm still not finding any indication that tcp/urg is commonly used for ssh sockets	14:35
fungi	(tcp/psh definitely is, but urg is surprising there)	14:35
zbr	does any of you have an example on how to call recv from there?	14:44
zbr	yep, i do get OOB data, probably as json. stdout.channel.recv(socket.MSG_OOB)	14:49
corvus	zbr: i was going to suggest recv(stdout.channel.fileno()) but that looks plausible too :)	14:50
corvus	zbr: i'm really curious what data you get	14:50
mordred	me too	14:50
mordred	corvus: I feel like OOB packets are a thing we don't use nearly enough ;)	14:51
corvus	mordred: i know	14:51
mordred	there are whole sets of socket flags we're not making use of to their fullest!	14:51
corvus	mordred: speaking of which, i had a dream last night about finally adding in that "this job will fail but it's still running" flag to zuul.	14:52
mordred	corvus: (honestly, gearman job cancelling seems like a good use-case for those)	14:52
mordred	corvus: what a fun dream!	14:52
corvus	(the gearman warning result; which, at this point, we probably wouldn't handle with gearman, but hey)	14:53
corvus	(i'm reminded of this because we need an oob ansible->executor signal to accomplish this)	14:54
zbr	how to i read the entire buffer?	14:54
corvus	zbr: recv takes a length arg, so it's actually socket.recv(4096, socket.MSG_OOB)	14:56
corvus	zbr: (socket.recv(socket.MSG_OOB) will read one byte because socket.MSG_OOB == 1)	14:56
corvus	zbr: then you do that inside the poll loop until POLLPRI is clear	14:56
zbr	not this one, TypeError: recv() takes 2 positional arguments but 3 were given	14:57
corvus	oh, then that's going to be a paramiko think	14:57
corvus	http://docs.paramiko.org/en/stable/api/channel.html#paramiko.channel.Channel.recv	14:58
corvus	that only lists one arg :/	14:58
zbr	the api is socket-like, but is not a real socket	15:00
corvus	you could use the socket methods on the underlying fd	15:02
zbr	so we may not have a OOB in the end	15:02
zbr	not any progress, so far we have no proof that that is a OOB message.	15:11
zbr	but documentation tells us that it can be, so not sure why we would keep the code crashing for this case.	15:12
corvus	zbr: what documentation tells us to expect pri to be set when we connect to gerrit stream events over ssh?	15:12
zbr	i patched it locally and it seems to be happy to process data, so that is priority is not an error	15:12
zbr	for me https://code.woboq.org/gcc/include/sys/epoll.h.html#EPOLL_EVENTS is enough.	15:14
zbr	testing a bitmask with == is almost never correct.	15:15
corvus	zbr: it is correct if you want to test that only one bit is set, which is the intent here	15:15
corvus	i thought we covered that already	15:15
*** tkajinam has quit IRC		15:16
zbr	is http://man7.org/linux/man-pages/man2/poll.2.html enough?	15:16
zbr	search for POLLPRI	15:17
zbr	i do not see any of the documetned cases are a good enough reason to raise an exception	15:17
fungi	it's suggesting there is urgent out of band data which should be read, and we're not (yet) reading that, so it sounds like a potential problem to me	15:18
zbr	to log something yes, but not to quit (and likely loose data too)	15:18
*** ykarel is now known as ykarel\|away		15:18
fungi	without knowing what the urgent message is, hard to say whether it's safe to just log	15:18
zbr	where is the proof that there is a message?	15:19
corvus	zbr: the man page explains what the different values mean. they don't tell us why you alone are receiving that value now, and our production systems have not hit this in 8 years of continuous use.	15:19
corvus	the man mage doesn't tell us how to handle the case, only that the case may happen	15:19
zbr	good that we do not write air-control software...	15:20
fungi	atc software would not be allowed to ignore error conditions ;)	15:20
fungi	just ask clarkb	15:20
corvus	and indeed, this program ignores no error conditions. your patch does.	15:20
zbr	because for me, even without knowing the nature of the urgent notification, i prefer to continue processing data (and eventually log something)	15:21
corvus	so yeah, this program is written in a very conservative way as far as errors go	15:21
clarkb	fungi: yes, though setting the reboot routine as your interrupt handler for major errors is apparently acceptable :)	15:21
zbr	what if your phone crashes on a specific text message? (oops, that already happened)	15:21
zbr	please show me where is written that any bitmask other than POLLIN is an error?	15:22
clarkb	(we put the interrupt routine at address 0)	15:22
clarkb	er reboot interrupt	15:22
fungi	zbr: it's not necessarily an error, but it could indicate an error so not handling it is potentially a bug	15:23
zbr	in fact even the presence of POLLOUT woudl break the current logic.	15:23
clarkb	I think the big thing here is as corvus said we've not seen this in almost a decade of using these tools. The fact that it shows up is note worthy and worth understanding in order to address properly	15:23
*** dpawlik has quit IRC		15:24
zbr	i would add a logging.error for this purpose.	15:24
zbr	at least we would know if it happens and how often.	15:24
clarkb	zbr: re pollout there is no writes on that conncetion iirc	15:27
zbr	sadly searching for paramiko and POLLPRI brings zero results	15:27
clarkb	ya its a stdout channel, so we should never write to it	15:28
zbr	do you want me to setup a tmate with debugger?	15:28
clarkb	zbr: have you tried recv on the fd?	15:28
clarkb	(basically bypass whatever higher level apis are in front of the actual file descriptor and read directly	15:29
zbr	if i only can find the fd, no variable with similar name so far.	15:30
corvus	zbr: did you try stdout.channel.fileno() ?	15:31
*** dirk has quit IRC		15:38
zbr	ok, i have `stdout.channel.fileno` which looks like one, but i cannot get a socket if there is no socket to start with.	15:40
*** priteau has joined #opendev		15:40
zbr	i cannot create a socket out of a fd.	15:40
clarkb	zbr: I think you can call os.read() directly against a fd	15:41
zbr	socket and paramiko.Channel are very similar but identical. I cannot give any flags to os.read() to ask about OOB stuff.	15:42
zbr	i am also inclined to believe that OOB stuff is specific to socket, so it may not work with paramiko	15:43
clarkb	zbr: looking at https://www.gnu.org/software/libc/manual/html_node/Out_002dof_002dBand-Data.html I think you can use a combo of read() and ioctl	15:44
clarkb	basically pretending to be C via python	15:44
corvus	i have to run some errands now, but i ask that if we decide that we want to ignore pollpri, then we do so in such a way that doesn't change the error handling structure of that code. in other words, just ignore pollpri.	15:45
clarkb	zbr: basically you do that discard_until_mark thing, then read after that to get the OOB data? Since OOB data automatically marks the stream	15:46
clarkb	also if we decide the OOB data is important I think that is roughly what gerritlib will need to do anyway	15:47
*** mlavalle has joined #opendev		15:52
zbr	for the moment I raised https://github.com/paramiko/paramiko/issues/1694	16:00
clarkb	slaweq: mordred fungi every job should already do external connectivity tests	16:01
* clarkb looking that up now		16:02
fungi	clarkb: well, in this case what they want to test is that a nested instance can reach beyond neutron	16:02
clarkb	I see, does that include off of the middle hypervisor?	16:03
mordred	(which is a nice test of north/south traffic)	16:03
clarkb	because devstack networking doesn't actually set that up	16:03
fungi	and i guess pinging a fabricated virtual interface elsewhere on the job node might mask some failure cases? not really sure	16:03
clarkb	there is no double nat for nested instance FIPs to allow return traffic to find the middle hypervisors interface. packets will go out sourced from the nested isntance fip then poof disappear	16:04
fungi	yeah, i expect they will have to do some masquerading on the job node to make that work	16:04
fungi	seems doable though, i don't really know what their test scenario looks like	16:05
clarkb	if the second layer of nat were set up it would probably just work (tm)	16:05
fungi	right	16:05
zbr	i find it interesting that gerritlib implementation is apparently the only one that that not give the select.POLLIN to poll.register()	16:06
zbr	http://codesearch.openstack.org/?q=select.POLLIN&i=nope&files=.*py&repos=	16:06
zbr	mythere are lots of examples in openstack, but that the only register without filtering that I see.	16:07
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed linting issues https://review.opendev.org/729974	16:12
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed linting issues https://review.opendev.org/729974	16:12
*** ysandeep is now known as ysandeep\|afk		16:14
*** hashar has joined #opendev		16:14
openstackgerrit	Merged openstack/project-config master: Replace old Ussuri cycle signing key with Victoria https://review.opendev.org/729804	16:15
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed POLLIN event check https://review.opendev.org/729966	16:27
clarkb	mordred: if you have time could you lookover https://review.opendev.org/#/c/729659/ for the gitea 1.12.0-rc1 testing? You normally do those updates so would be good for you to check I didn't miss anything obvious. I'm not sure we watn to land that and run the rc code, but we probably could if people want to go for it	16:31
*** priteau has quit IRC		16:33
mordred	clarkb: what do you need gpg for?	16:35
mordred	(just curious - I looked at it yesterday and it looks good to me - +2 - but I'm about to run an errand, so didn't +A - feel free to though)	16:37
clarkb	mordred: not sure, the upstream image added it though. If I had to guess to work with signed commits	16:37
clarkb	infra-root https://review.opendev.org/#/c/729619/3 is at the bottom of corvus' stack to do more zuul things if you have a moment	16:37
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/elastic-recheck master: WIP: Create elastic-recheck docker image https://review.opendev.org/729623	16:43
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: WIP: Use py3 with elastic-recheck https://review.opendev.org/729336	17:21
clarkb	fungi: do you have time for https://review.opendev.org/#/c/729619/ ? you reviewed its parent	17:24
*** ysandeep\|afk is now known as ysandeep		17:25
fungi	sure, i can take a quick look	17:25
clarkb	also if anyone else wants to look at https://etherpad.opendev.org/p/XRyf4UliAKI9nRGstsP4 for advisory board bootstrapping now is the time to do that. I'm hoping to send that out soonish	17:26
clarkb	I believe I've incorporated/addressed all the existing feedback (thank you for that)	17:26
fungi	your updates lgtm	17:32
openstackgerrit	Merged zuul/zuul-jobs master: Remove failovermethod from fedora dnf repo configs https://review.opendev.org/729774	17:33
*** ysandeep is now known as ysandeep\|away		17:40
openstackgerrit	James E. Blair proposed opendev/system-config master: Use sqlite with Zuul in the gate https://review.opendev.org/729786	17:43
openstackgerrit	Sorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: WIP: Use py3 with elastic-recheck https://review.opendev.org/729336	17:48
openstackgerrit	Merged opendev/system-config master: Save zuul and nodepool logs from gate test jobs https://review.opendev.org/729619	17:54
openstackgerrit	Merged opendev/system-config master: Vendor the apt repo gpg keys used for Zuul https://review.opendev.org/729401	18:28
*** hashar has quit IRC		18:40
clarkb	ok email about advisory board has been sent	18:41
fungi	thanks!	18:58
smcginnis	hrw: Looks like your entire mail spool is dumping into #openstack-requirements	19:01
smcginnis	halp	19:02
clarkb	smcginnis: can you kick hrw (maybe even ban if rejoined)	19:02
smcginnis	Looks like he killed it.	19:02
hrw	sorry for that	19:02
fungi	anything sensitive? do i need to perform surgery on our channel log archives?	19:03
hrw	no, twitter feed	19:04
smcginnis	twitter2irc probably has limited use. :)	19:04
fungi	i hear there is such a thing	19:04
fungi	at least the message length is compatible ;)	19:05
hrw	;D	19:05
smcginnis	Hah	19:05
hrw	everything2irc more or less exists	19:05
hrw	just at different levels of usability	19:05
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Do not interpolate values from tox --showconfig https://review.opendev.org/729520	19:23
openstackgerrit	Merged opendev/system-config master: Run Zuul as the zuuld user https://review.opendev.org/726958	19:30
*** dpawlik has joined #opendev		20:08
*** dpawlik has quit IRC		20:52
*** hrw has quit IRC		20:56
corvus	infra-prod-base failed on that ^	20:58
clarkb	that was expected though right?	20:59
corvus	not really?	20:59
corvus	i'm looking at /var/log/ansible/base.yaml.log to try to ascertain the error	20:59
corvus	but all i see is several mb of json	21:00
corvus	okay, after scanning the list several times, i see this: review-dev01.opendev.org : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0	21:01
*** hrw has joined #opendev		21:01
corvus	so i guess something failed on review-dev01?	21:01
corvus	An exception occurred during task execution. To see the full traceback, use -vvv. The error was: FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/root']	21:01
corvus	No space left on device	21:02
clarkb	track_upstream.log is 23GB	21:04
clarkb	infra-root should we just cat /dev/null > /var/log/track_upstream.log for now?	21:05
corvus	yeah. it's throwing host key errors	21:06
corvus	so something is broken but i don't think we care right now	21:06
corvus	we'll get new errors real quick	21:06
clarkb	ok I'll run that now	21:06
clarkb	thats done and ya the file is already filling quick	21:07
clarkb	but we should be good for a while	21:07
corvus	re-enqueing 726958	21:07
clarkb	I've confirmed that the ssh key it doesn't like is the one reported by the server	21:11
clarkb	the track upstream command is using roots ssh configs not gerrit2's	21:13
clarkb	I'm going to see if I can ssh as root and accept the key	21:14
clarkb	ok its talking to port 22 not 29418	21:17
clarkb	I wonder if this is related to zbr's change	21:22
clarkb	2020-05-21 21:21:31,733: gerrit.GerritConnection - ERROR - Exception connecting to review-dev.opendev.org:29418 it thinks it is connecting to 29418 but if I look at the keys it seems to be getting the port 22 key?	21:22
corvus	clarkb: what change?	21:22
clarkb	corvus: https://review.opendev.org/#/c/729699/ or rather if its related to the sort of thing that addressed	21:25
*** DSpider has quit IRC		21:26
corvus	i thought from the log it was going on longer than that	21:27
clarkb	corvus: ya I don't think that change has caused it, but maybe there is a similar bug in gerritlib that is?	21:27
clarkb	also that change is not in the images on review-dev for gerrit because we consume gerritlib from releases (just double checked that)	21:28
openstackgerrit	Monty Taylor proposed openstack/project-config master: Allow ansible-collections-openstack-release team to push tags https://review.opendev.org/730129	21:29
clarkb	if we look at /root/.ssh/known_hosts the host key there is correct for port 29418	21:29
clarkb	if I ssh as root to the jeepyb user on port 29418 that works with no error (because the host key is correct)	21:30
clarkb	the key that it seems to be getting is the port 22 hostkey	21:30
clarkb	jeepyb defaults to 29418 if not set (I can't find us setting a different port)	21:30
clarkb	this is quite weird	21:30
clarkb	I'm going to try sshing now within the container	21:31
clarkb	manual ssh from within the container also works	21:31
clarkb	mordred: ^ any ideas?	21:32
corvus	separately, the infra-prod-base job failed again. this time there are no failures reported in the log, so i don't know why it "failed" -- https://zuul.opendev.org/t/openstack/build/83b7df5203064553a66b0a293f36c228	21:33
corvus	paste01.openstack.org : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0	21:33
corvus	could it be that?	21:33
clarkb	corvus: yes unreachable hosts cause it problems. ianw has a change up to deal with that by ignoring them (butI'm not sure that is correct either)	21:34
corvus	i think it had an ssh host key change?	21:34
corvus	hrm	21:35
corvus	well it said Warning: Permanently added the RSA host key for IP address '23.253.235.223' to the list of known hosts.	21:35
corvus	but i guess it had the key already with a different address?	21:35
corvus	i've enqueued the change again	21:36
clarkb	corvus: possible that ipv6 stopped working between those hosts so it fell back to ipv4?	21:36
corvus	maybe	21:36
corvus	it also said Last login: Thu Aug 16 16:59:31 2018 from 2001:4800:7818:101:3c21:a454:23ed:4072	21:36
*** Dmitrii-Sh has quit IRC		21:38
mordred	I just tried sshing to paste01 and I'm getting an ssh host key differs	21:40
mordred	AND	21:40
mordred	an it wants to do 23.253.235.223	21:40
mordred	and I can't ssh to the ipv6	21:41
clarkb	looking at the review-dev thing more I think what is happening is paramiko is finding a known hosts entry for the port 22 host key and applying itto port 29418. I cannot see how that is happening though	21:42
clarkb	and I see a bunch of 29418 connections so I think it is connecting to 29418 properly, just using the wrong known_hosts value	21:43
*** Dmitrii-Sh has joined #opendev		21:43
mordred	clarkb: blink blink	21:43
mordred	oh - hrm. I can ssh to paste.openstack.org just fine - it's paste01.openstack.org that has a conflicting hostkey for me - so that might just be old cruft somehow	21:44
mordred	still can't connect to ipv6 though	21:44
clarkb	there are no sshfp records that I can see	21:45
clarkb	(which was one idea for why it might've gotten mixed up)	21:45
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner https://review.opendev.org/728684	21:50
clarkb	I'm stumped	21:56
clarkb	ssh-keyscan, manual ssh, and manually reading configs shows me that everything should be happy	21:57
clarkb	anyone have other ideas for what we should look at ? I think the implication is that it isn't using the known_hosts file we bind mount in. But I have no idea what other known_hosts info it could be using (which is why I checked sshfp)	21:59
mordred	clarkb: no - because you tried from inside the container right?	21:59
clarkb	mordred: ya inside and out	22:00
mordred	clarkb: is it a user thing perhaps? are we running track-upstream as a different user inside the container than you're exec-ing in as?	22:00
corvus	what container?	22:00
* mordred trying to think of left-field things		22:00
mordred	corvus: the gerrit container	22:01
mordred	or - the track-upstream container - which uses the gerrit image	22:01
corvus	ah, sorry, gotcha	22:01
clarkb	corvus: mordred ya I'm replacing track-upstream command with bash and using the same docker mounts	22:01
clarkb	I did have to add -it thoug	22:01
mordred	yeah - that's normal	22:01
clarkb	mordred: externally the container is started as root and it runs as root in the container I think	22:01
clarkb	mordred: though maybe it can't read the known_hosts file?	22:01
corvus	there are a lot of containers running	22:02
corvus	with random names	22:02
clarkb	corvus: yes, they ar erung by cron and I think they are in a restart loop for a while before they give up	22:02
clarkb	*are run by	22:02
*** slaweq has quit IRC		22:02
mordred	we should maybe kill some of those	22:03
mordred	like: docker ps \| grep -v gerritcompose_gerrit_1 \| grep 'Up 5 weeks' \| awk '{print $1}' \| xargs -n1 docker stop	22:05
clarkb	oh could it be these are stale	22:05
mordred	clarkb: is it possible that this is not a problem anymore but was 5 weeks ago	22:05
clarkb	and new runs would be happy	22:05
clarkb	ya	22:05
mordred	and yeah - that's why you can't reproduce it	22:05
mordred	how about I run that?	22:05
corvus	1 sec	22:05
mordred	kk	22:05
corvus	mordred: ok clear from me	22:06
clarkb	ya I think that should be fine too	22:06
corvus	(i just wanted to verify they were track-upstream procs, and they are)	22:06
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner https://review.opendev.org/728684	22:06
mordred	k. stopped	22:06
clarkb	the next track upstream runs in about 35 minutes	22:06
clarkb	I guess we can see if it is happy then	22:06
mordred	yeah	22:07
mordred	I'm guessing it will be because _all_ of these were 5 weeks old	22:07
mordred	and we run cron hourly	22:07
clarkb	ya	22:07
mordred	phew	22:07
fungi	sorry, catching back up, so summary is that we filled up review's rootfs because there were a bunch of track-upstream invocations from 5 weeks ago which were continually filling the log with ssh host key mismatch errors due to incorrectly deployed containers from the same timeframe?	22:11
clarkb	fungi: yes, except it was review-dev not review	22:12
corvus	okay this time infra-prod-base failed here: mirror01.kna1.airship-citycloud.opendev.org : ok=25 changed=1 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0	22:12
fungi	aha, okay, review-dev, got it	22:12
corvus	and also paste01 is still unreachable	22:12
corvus	i don't think the current situation is tenable	22:12
clarkb	I've reproduced the ipv6 connectivity issue to paste over port 22	22:13
corvus	clarkb: oh you've identified an issue?	22:13
clarkb	it doesn't ping either	22:13
clarkb	corvus: ya its looking like ipv6 to paste is not working	22:14
corvus	clarkb: but v4 is fine; so why is ansible unhappy?	22:14
clarkb	corvus: probably because ansible is preferring v6 since we set a specific ansible_host addr?	22:14
clarkb	corvus: I bet if we used names rather than addrs it would fallback from ipv6 to ipv4	22:15
clarkb	but we set a specific address instead iirc	22:15
corvus	indeed, paste01 is ansible_host: 2001:4800:7817:103:be76:4eff:fe05:176e	22:15
donnyd	so I started capturing some metrics on how many jobs are run per day and per week on OE	22:15
clarkb	donnyd: we've got some of that data in graphite too if you need it	22:16
clarkb	corvus: for now we can switch ansible_host to ipv4 address maybe?	22:16
corvus	the issue on the kna1 mirror seems disk related: -bash: /etc/profile: Input/output error	22:16
corvus	it seems possible that host may be about to start bombing jobs	22:17
clarkb	corvus: thats weird I'm able to open /etc/profile on that host	22:17
corvus	clarkb: mirror01.kna1.airship-citycloud.opendev.org ?	22:17
clarkb	corvus: ya	22:17
clarkb	and nothing in dmesg since it last rebooted	22:18
clarkb	(9 days ago)	22:18
clarkb	er wait	22:18
clarkb	I was looking at wrong terminal /me double checks	22:18
corvus	when i ssh in, i get immediately disconnected after the io error	22:18
clarkb	ya it looks like ssh succeeds but it just errors out after motd	22:18
clarkb	corvus: ya I missed that it was failing as it printed out the motd and everything and then you get your old shell prompt back	22:19
corvus	i'm assuming the disk is gone	22:19
clarkb	corvus: we can put it in the emergency file for now	22:19
clarkb	(that shoul dsolve the ansible side problem)	22:20
mordred	corvus: I agree - this isn't super tennable - but I'm not sure which thing would be better - if we put the base role in each service playbook, then we theoretically run a bunch of stuff each time we run a service that isn't useful, but at the cost of services being decoupled. we could also not make a service playbook dependent on base running - at the potential cost that something from base that's important	22:20
mordred	for a service doesnt' get applied. but maybe shifting base roles into service playbooks is better (I think you advocated for that before)	22:20
clarkb	mordred: ianw's change does the second thing you suggested	22:20
clarkb	re that mirror should we try a reboot?	22:21
mordred	yeah - I wasn't super crazy about that change because I was worried we'd still miss important updates in base	22:21
mordred	clarkb: ++	22:21
mordred	clarkb: I don't think we can debug it any further in its current state	22:21
corvus	yeah, it just seems that threading the needle on having every host succeed is not working well for us today (we've got 3 moles needing whacking so far)	22:21
corvus	++reboot	22:21
clarkb	I'll work on the reboot of kna1's mirror	22:21
ianw	hrm, i think i had problems with that the other day	22:22
corvus	for paste -- should we just prefer ipv4 addrs always?	22:22
mordred	corvus: I think what I really want is a per-service base playbook that runs base and only gets run if base stuff has changed - but my brain can't quite wrap its head around implementing that	22:22
ianw	clarkb: 2020-05-20 08:00:46 UTC rebooted mirror.kna1.airship-citycloud.opendev.org ; it was refusing a few connection and had some old hung processes lying around	22:22
ianw	is it doing the same thing again?	22:22
clarkb	ianw: ya	22:22
clarkb	corvus: I htink ipv4 is likely to be more reliable in general	22:22
mordred	ianw: we don't know - we can't get in to see if it has hung processes lying around	22:22
corvus	i think we wanted ip addrs for ansible_host so that we wouldn't mess up a prod server when booting a replacement	22:22
mordred	corvus: yes, that is correct	22:23
clarkb	but thats based on having comcast provided ipv6 in the past where things would randomly just not work	22:23
mordred	clarkb: I think ipv6 has been mostly stable for us	22:23
corvus	mostly -- but i think we've still seen more issues with, say, gerrit than ipv4. <--anecdata	22:23
clarkb	mordred: we've seen similar flakyness fwiw	22:23
mordred	I bet if we reboot paste (or just did a restart of the ipv6 networking stack) ipv6 woudl come back there	22:23
clarkb	mordred: some of the logstash worker nodes can't talk to the gearman server via ipv6	22:23
clarkb	so they are slow to start up as they fall back to ipv4	22:23
clarkb	mordred: thats possible	22:24
corvus	mordred: oh you reckon that's client side	22:24
corvus	i'll reboot paste	22:24
mordred	corvus: yeah - the host seems to think it has an ipv6 address, but I can't get to it - so since it's been up for >800 days I wouldn't be surprised if somethign got borked in rax routing tables	22:24
clarkb	I've asked the mirror in kna1 to reboot	22:25
ianw	i'm pretty sure i still have an open ticket about rax hosts not being able to do symmetrical ipv6	22:25
fungi	catching back up again, yes rebooting instances in rackspace has sometimes solved the sudden ipv6 connectivity breakage	22:25
fungi	ianw: i think that ticket was eventually closed	22:26
clarkb	mirrors up	22:26
fungi	if memory serves, the resolution was that "all" you need to do is delete and recreate the port/interface	22:26
fungi	(in which case you may as well delete and redeploy the instance)	22:27
corvus	paste is back up, but v6 isn't any better	22:27
ianw	yeah, https://portal.rackspace.com/610275/tickets/details/190621-ord-0000027	22:27
corvus	oh, well, it can at least ping outbound v6 now	22:27
clarkb	nothing in syslog or kern.log	22:27
clarkb	on the mirror for ^	22:28
fungi	i suppose we could switch the inventory to the v4 address? should probably also take the aaaa out of dns for it	22:28
fungi	at least until we can deploy a replacement	22:28
corvus	mordred: are you on paste from a v6 addr?	22:28
clarkb	also mirror had ansible things in syslog from 21:40ish	22:29
fungi	i'm able to ssh to paste over ipv6	22:29
clarkb	so it seemed to be a relatively recent occurence	22:29
fungi	maybe it's fixed?	22:29
ianw	ING paste.openstack.org(paste01.openstack.org (2001:4800:7817:103:be76:4eff:fe05:176e)) 56 data bytes	22:29
ianw	64 bytes from paste01.openstack.org (2001:4800:7817:103:be76:4eff:fe05:176e): icmp_seq=1 ttl=46 time=280 ms	22:29
ianw	from .au :)	22:29
corvus	but not from bridge.o.o	22:29
corvus	(though bridge can ping6 google.com)	22:30
fungi	yeah, when we see these connectivity issues, it's typically between specific instances in rackspace	22:30
mordred	corvus: yes	22:31
mordred	corvus: I am in over ipv6 - although I could not get in from my laptop over ipv6 before you rebooted	22:31
fungi	suggesting there's a dead route somewhere (and usually the packets only go missing in one direction according to packet captures)	22:31
mordred	well - I will say that gives weight to clarkb's suggestion of using ipv4 for ansible hosts	22:32
mordred	actually, I think that was a suggestion to hostname - but that does make replacements harder	22:32
fungi	also possible there's an incorrectly cached flow somewhere in their network which will age out now that the instance has been rebooted	22:32
clarkb	mordred: I was suggesting ipv4 as ime its more reliable	22:32
ianw	yep, https://portal.rackspace.com/610275/tickets/details/180206-iad-0005440 is the original issue about one-way ipv6 ... i bet pinging from paste you see the packets on bridge, but they never make it back	22:32
mordred	yeah	22:32
clarkb	ipv6 is almost as reliable but weirdness happens more often ime	22:32
mordred	if we occasinoally get unexplained ipv6 issues between hosts in rax - then as much as I'd prefer v6 on general principle, it seems like we're opening ourselves up to more likely ansible woes	22:33
fungi	ipv6 is plenty reliable, it's that providers have deployed it in so many unreliable ways due to complex migration plans	22:33
clarkb	and that happens at the isp level (comcast had a bitbucket in denver back in the day) and in cloud providers (this thing with rax and vexxhost not being able to google dns)	22:33
mordred	fungi: yah	22:33
fungi	and also them not noticing bugs because so few customers tried to use it with any seriousness until relatively recently	22:34
fungi	s/bugs/misconfigurations/	22:34
mordred	corvus: I have an idea for base splitting that MIGHT not be 100% terrible (sort of the thing I suggested above but didn't know how to do)	22:37
corvus	mordred: cool :)	22:38
corvus	i'll see about switching our ansible_hosts to v4	22:38
mordred	corvus: what if we make child-jobs of infra-prod-run-base for each service that don't do anything different but add like an "ansible_limit_hosts" variable	22:38
mordred	corvus: and then make each service soft-depend on the host-specific version of base	22:38
mordred	when we DO change base it'll be a larger number of smaller jobs that have to run - but we won't have service-zuul not running because run-base couldn't talk to paste	22:38
mordred	s/host-specific/service-specific/	22:39
corvus	mordred: what would cause the service-specific jobs to run?	22:39
corvus	file matchers that are the same as the service?	22:39
mordred	no - file matchers that are specific to "base needs to run" - we'd ditch "run a single base playbook" completely - so if base doesn't need to run, it doesn't and we skip it	22:40
clarkb	mordred: if we did that wouldn't we want to fail if base failed?	22:41
mordred	yes - but then we're only failing if base fails for the specific service in question	22:41
mordred	so if base fails on a zuul host, then not running service-zuul is the right choice	22:41
corvus	ooh	22:41
clarkb	gotcha	22:41
clarkb	but gitea or whatever could still run	22:42
mordred	yeah	22:42
mordred	but we still get to use zuul file matchers to avoid running useless things	22:42
mordred	we could also do service-base specific file matchers if needed - like if we wanted to also trigger a base run for a service if host vars changed (because iptables)	22:42
mordred	but - probablu just triggering all the base jobs if host_vars/* changes isn't a terrible idea	22:43
donnyd	clarkb: Oh, I didn't know we tracked that kind of data	22:43
corvus	mordred: so a change that touches something in base would cause all the little base jobs to run, and then, say, also the zuul service job (if the change was a zuul change). and the zuul service job only depends on its little bit of base.	22:43
mordred	corvus: yes	22:43
clarkb	donnyd: I'm not sure we specifically track jobs per provider but do have resources per provider	22:43
*** yuri has quit IRC		22:43
clarkb	donnyd: but wanted to point out we have some related info	22:43
mordred	corvus: we could also do it with a bunch of little service-base playbooks and not with an ansible_limit_hosts variable - the downside to a limit_hosts var is that we're keeping host lists in 2 different places	22:44
mordred	but the idea would be the same	22:44
corvus	mordred: fwiw, base is gonna run on a lot of changes (any hostvars change)	22:44
donnyd	I was honestly just curious about the actual number of jobs launched per day/week	22:44
mordred	corvus: yeah. I mean - we could get more specific on those hostvars changes if we did individual playbooks	22:45
donnyd	1300-1500 per day and from the week total about 8500 on my crappy little cloud	22:45
donnyd	Not too shabby	22:46
mordred	donnyd: ++	22:46
donnyd	i wish i could make it go faster, which i am, but it will be a little while yet	22:46
corvus	mordred: what's the right way to run generate_inventory.py?	22:47
donnyd	I am replacing the 8 big compute servers with 32 little ones	22:47
mordred	corvus: I don't know that it's designed to be re-run - I think I just used it the first time - but let me look	22:48
donnyd	So i am curious to see if more == faster	22:48
corvus	mordred: to get the right clouds.yaml	22:48
mordred	corvus: yeah - that's just it - we don't have a split clouds.yaml any more	22:48
mordred	so that is not designed for our current situation (We should probably delete it)	22:48
mordred	corvus: I think you'd be better off doing a quick yaml transform on the existing inventory	22:49
*** tkajinam has joined #opendev		22:49
openstackgerrit	James E. Blair proposed opendev/system-config master: Use ipv4 in inventory https://review.opendev.org/730144	22:52
corvus	or emacs	22:52
*** Dmitrii-Sh has quit IRC		22:53
clarkb	corvus: any idea why ask01 didn' update?	22:54
mordred	corvus, clarkb: so - do y'all like the split-base idea enough for me to work up a patch? (I don't want to work up that patch if we don't provisionally like the idea)	23:00
clarkb	mordred: I think my biggest concern with it is it will make understanding when and how to run jobs even more complicated	23:01
clarkb	I like the idea though. If we can make it simple enough that it itself doesn't break constantly I think it could be worthwile	23:01
mordred	yeah - that's my concern too	23:10
mordred	clarkb: I think we would need to do a pile of base playbooks	23:10
mordred	instead of zuul job host limit strings	23:11
mordred	we'd never understand the limit strings	23:11
*** tkajinam_ has joined #opendev		23:18
*** tkajinam has quit IRC		23:20
*** Dmitrii-Sh has joined #opendev		23:21
openstackgerrit	Merged opendev/system-config master: Add iptables_extra_allowed_groups https://review.opendev.org/726475	23:23
openstackgerrit	Merged opendev/system-config master: Add support for multiple jvbs behind meetpad https://review.opendev.org/729008	23:23
mordred	corvus: don't know if you saw from clarkb on your inventory patch - but a host was missed	23:33
mordred	corvus: ask01.openstack.org	23:34
mordred	it otherwise looks great	23:34
mordred	(I suppose I could just fix that real quick)	23:34
openstackgerrit	Monty Taylor proposed opendev/system-config master: Use ipv4 in inventory https://review.opendev.org/730144	23:35
mordred	clarkb: ^^	23:35
corvus	ah thx, sorry	23:35
mordred	corvus: no worries - I just also realized that it's a one-line thing, so you know, quicker to just fix it than poke :)	23:37
ianw	i've been through and removed old servers/volumes/dns entries for all active clouds in https://etherpad.opendev.org/p/openstack.org-mirror-be-gone ... so that's everything gone and cleaned up for the opentsack.org mirrors	23:37
mordred	ianw: wow. nice	23:37
ianw	clarkb: did you end up thinking the inventory fixutre in https://review.opendev.org/729418 might be helpful?	23:43
openstackgerrit	Ian Wienand proposed opendev/system-config master: Use ipv4 in inventory https://review.opendev.org/730144	23:51
openstackgerrit	Ian Wienand proposed opendev/system-config master: Use ipv4 in server launch inventory output https://review.opendev.org/730149	23:51
ianw	argghhh sorry i cherry-picked that and thus ended up rebasing it	23:52
ianw	i didn't mean to reupload 730144	23:52

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!