Thursday, 2020-05-21

ianwyou can have multiple certs; we take the first name listed in the cert and add that to the list to be checkd00:00
ianwso, e.g. in the gitea case, we check gitea0X:3000, but don't check opendev.org00:01
ianwnow the dib python35 job is showing "ERROR: No matching distribution found for oslotest===4.2.0 "00:02
fungii guess it's a question of whether le updates can break for some certs on a host and not others?00:02
ianwfungi: so all certs should get an entry; just if that cert covers multiple names, we only take the first one.  so like "mirror01.x" and "mirror.x" we will take the first, mirror01 and put that in the list00:05
ianwbut if the cert @ mirror01.x is valid, that implies that the same cert which is used for mirror.x is ok?00:06
fungigot it. so we're still testing each individual cert00:06
fungithat wfm00:06
ianwto be concrete,
ianwcompare to
ianwwe end up checking and someotherservice.opendev.org00:12
fungiyep, cool00:12
openstackgerritJeremy Stanley proposed openstack/project-config master: Replace old Ussuri cycle signing key with Victoria
openstackgerritIan Wienand proposed openstack/diskimage-builder master: Drop support for python2
*** dzho has quit IRC00:40
*** markmcclain has quit IRC02:25
*** ysandeep|away is now known as ysandeep02:33
*** Eighth_Doctor is now known as Conan_Kudo02:46
*** Conan_Kudo is now known as Eighth_Doctor02:46
*** markmcclain has joined #opendev02:49
openstackgerritIan Wienand proposed openstack/diskimage-builder master: package-installs: allow when filter to be a list
openstackgerritIan Wienand proposed openstack/diskimage-builder master: ubuntu-minimal: fix HWE install for focal
openstackgerritIan Wienand proposed openstack/diskimage-builder master: ubuntu-minimal : only install 16.04 HWE kernel on xenial
openstackgerritIan Wienand proposed openstack/diskimage-builder master: ubuntu-minimal: Add Ubuntu Focal test build
*** Meiyan has joined #opendev04:15
*** ykarel|away is now known as ykarel04:23
*** tkajinam has quit IRC04:26
*** tkajinam has joined #opendev04:26
*** raukadah is now known as chandankumar05:32
*** dpawlik has joined #opendev06:01
*** ianw has quit IRC06:30
*** ianw has joined #opendev06:33
*** slaweq has joined #opendev06:57
*** dpawlik has quit IRC07:05
*** dpawlik has joined #opendev07:18
*** dpawlik has quit IRC07:24
*** dpawlik has joined #opendev07:25
*** tosky has joined #opendev07:31
slaweqfrickler: hi07:34
*** dpawlik has quit IRC07:34
slaweqfrickler: may I ask You about one infra and CI related test?07:34
*** dpawlik has joined #opendev07:34
slaweqfrickler: we recently introduced in neutron-tempest-plugin test which is pinging some external IP address to check external connectivity is really ok, see
slaweqfrickler: it's skipped by default, but would infra-root have anything against if we would configure some IP address (I don't know what would be the best one) to run this test in u/s gate?07:36
slaweqfrickler: something like ping or similar07:36
*** dpawlik has quit IRC07:44
*** dpawlik has joined #opendev07:45
*** dpawlik has quit IRC07:49
*** dpawlik has joined #opendev07:50
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:01
*** DSpider has joined #opendev08:11
*** dpawlik has quit IRC08:11
*** dpawlik has joined #opendev08:11
*** lpetrut has joined #opendev08:15
*** dpawlik has quit IRC08:16
*** dpawlik has joined #opendev08:16
*** jaicaa has quit IRC08:17
*** jaicaa has joined #opendev08:20
*** ysandeep is now known as ysandeep|lunch08:20
*** yuri has joined #opendev08:45
*** iurygregory has quit IRC08:51
*** ysandeep|lunch is now known as ysandeep08:57
*** ykarel is now known as ykarel|lunch09:08
*** dpawlik has quit IRC09:10
*** dpawlik has joined #opendev09:10
*** dpawlik has quit IRC09:14
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner
*** iurygregory has joined #opendev09:39
*** ykarel|lunch is now known as ykarel09:56
*** smcginnis has quit IRC11:26
*** dpawlik has joined #opendev11:29
*** smcginnis has joined #opendev11:33
openstackgerritSorin Sbarnea (zbr) proposed opendev/system-config master: Switch to use python3
*** ysandeep is now known as ysandeep|afk12:03
*** rosmaita has left #opendev12:11
*** ysandeep|afk is now known as ysandeep12:29
fungislaweq: i thought we already had a similar role in devstack to ping the git farm or something... checking12:32
*** ysandeep is now known as ysandeep|afk12:32
fungislaweq: i must have been thinking of this:
fungii guess we didn't implement anything similar in devstack12:51
fungibut it seems like a reasonable idea to have a network sanity test which performs a quick ping of some bits of our infrastructure12:52
fungi(i would not recommend pinging though)12:52
fungiahh, i see, so for the specific bug you linked, pinging anything that's not directly on the job node ought to be sufficient. the local mirror service in the provider is your best bet, since your jobs should rely on that server being reachable from the job node anyway so if it isn't the job will have failed before it gets to the point of creating nested instances12:56
*** ysandeep|afk is now known as ysandeep12:58
fungislaweq: the identity of the local mirror server should be available from /etc/ci/ on each node, there will be a "export NODEPOOL_MIRROR_HOST=..." line in there which you can resolve to an ip address if you need a raw address and not a dns name13:03
fungialso it looks like we set a zuul_site_mirror_fqdn ansible fact, which you could use to pipe into your script instead13:05
*** ykarel is now known as ykarel|afk13:14
mordredfungi, slaweq we should check with clarkb about being forward-compatible with the intended new system for communicating mirrors to jobs13:24
mordred(I agree, pinging the mirror is definitely the right choice)13:24
mordredfungi, AJaeger: new promote-javascript-deployment job did not work for zuul (it's ok that it failed, we don't happen to use the results of it currently)13:26
*** lpetrut has quit IRC13:49
zbrwho can help me with gerritlib? i have a couple of patches and i also need a new release for the yesterday bugfix.13:52
*** sgw has quit IRC13:57
openstackgerritSorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed POLLIN event check
*** roman_g has quit IRC14:05
corvuszbr: how was the POLLIN check failing?14:06
openstackgerritMerged opendev/gerritlib master: Enable py36-py38 testing
corvuszbr: (did you get an event with POLLIN and some other bit set?)14:06
zbri got returned 314:06
zbrwhich is valid result14:06
zbri guess that should be self-explanatory14:07
corvusso it was pollin and pollpri ?14:08
corvusthough that's for epoll14:08
corvusbut it's the same for regular poll14:09
zbrmacos + py38, i suspect is py38 specific.14:09
corvusso there is input, and an error14:09
mordredcorvus: or, input and a flag indicating it's "priority" input, no?14:10
corvusmordred: pollpri means "an exceptional condition" the poll manpage includes some14:10
corvusi'm curious what it would be in this case14:11
mordredah - nod14:11
corvusbecause i'm not sure the original code was wrong14:11
corvus(the original code was "read if there is data and no exceptional condition")14:11
corvusi'm open to changing it, but i'd like to understand why it's okay to read when there's an exceptional condition, what the exceptional condition we intend to handle is, and why it's caused and how we should handle it14:12
mordredcorvus: says some words14:12
zbrurgent, likely telling you process that data faster or i will start dropping it14:13
*** sgw has joined #opendev14:14
zbrwe could think to improve it in the future, but clearly we need to read.14:14
corvusi don't think that's what that means; i don't actually know how the standard out channel of an ssh connection could even send oob data.14:14
mordredcorvus: most of the other docs I'm finding (other than the man page) - like the rust api docs - all describe it as "there is urgent data" instead of "there is an exceptional condition" - but I still don't understand what that means - so I agree with you as to wanting to understand how we got to that state14:14
*** ykarel|afk is now known as ykarel14:15
openstackgerritSorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed linting issues
mordredcorvus: has some discussion around POLLPRI and ssh14:17
openstackgerritMerged opendev/gerritlib master: Replace testrepository with stestr
zbrfor me is a no brainer, event is a bitmask, and POLLIN declares that there is data to be read. In that case i do not care about any other flag, my loop goal being to read.14:20
zbrit gets more interesting when you do not have data to read14:20
corvuson the contrary, i want the loop to exit if there is an error14:20
corvusyour patch will change that14:20
corvus(with your change, it could fail to exit on a loop if there is an error and data to read)14:21
fungiyes, continuing to read from a socket which has raised an error is not guaranteed safe, and there are plenty of scenarios where we could end up with a hung process indefinitely reading from a dead socket that way14:21
zbraimho, it should read until there is nothing else to read.14:21
zbra dead socket producing data?14:22
fungizbr: what if that read call never returns?14:22
zbrfungi: same would apply even without this condition.14:22
corvuszbr: a poll can return with data and an error at the same time.  your code would read the data, ignore the error, then go right back to the poll again.  what happens after that would be undefined.14:22
corvuszbr: or possibly read the data and fail.14:23
mordredright - which is why we need to understand under what conditions the PRI flag is set and what it's trying to communicate14:23
zbri think it would be better than now, where 0x3 produces an error.14:23
mordredzbr: 0x3 producing an error may be correct behavior14:23
fungiif we don't ignore errors we can try to handle them, say by closing out and reopening the socket (depending on the error)14:24
fungithere are options other than ignoring possible error conditions or dying on the spot14:25
corvusif there were oob data, then since we don't set so_oobinline, we would need to set the msg_oob flag to recv to retrieve it14:28
corvus(though, like i said, i doubt the issue is oob data)14:29
zbrHow about logging when the event has anything else than just POLLIN, but reading.14:29
corvusmordred: i've skimmed that link, but i haven't found anything there i can apply to our situation14:29
*** Meiyan has quit IRC14:30
zbrone random logic example:
zbras seen here, other flags should not prevent reading data.14:31
zbreven if we do not implement them14:31
corvuszbr: that's a different approach than i think we should take.14:32
corvusi think we should understand the issue and handle it14:32
corvuszbr: since it only seems to be reproducible in your environment, maybe you can determine whether it's oob data or something else causing the pri flag to be set?14:33
zbri need some hints on how to figure it out14:33
zbrdid any of you tried to run elastic-recheck with py38?14:34
zbri wonder if the issue is specific to macos, py38 or both.14:34
corvuszbr: i'd try calling recv with MSG_OOB and see if you get data14:34
mordredzbr: my hunch is going to be specific to macos - I would doubt py38 has anything to do with it since it would be something setting that flag in the networking stack14:35
fungii'm still not finding any indication that tcp/urg is commonly used for ssh sockets14:35
fungi(tcp/psh definitely is, but urg is surprising there)14:35
zbrdoes any of you have an example on how to call recv from there?14:44
zbryep, i do get OOB data, probably as json.
corvuszbr: i was going to suggest recv( but that looks plausible too :)14:50
corvuszbr: i'm really curious what data you get14:50
mordredme too14:50
mordredcorvus: I feel like OOB packets are a thing we don't use nearly enough ;)14:51
corvusmordred: i know14:51
mordredthere are whole sets of socket flags we're not making use of to their fullest!14:51
corvusmordred: speaking of which, i had a dream last night about finally adding in that "this job will fail but it's still running" flag to zuul.14:52
mordredcorvus: (honestly, gearman job cancelling seems like a good use-case for those)14:52
mordredcorvus: what a fun dream!14:52
corvus(the gearman warning result; which, at this point, we probably wouldn't handle with gearman, but hey)14:53
corvus(i'm reminded of this because we need an oob ansible->executor signal to accomplish this)14:54
zbrhow to i read the entire buffer?14:54
corvuszbr: recv takes a length arg, so it's actually socket.recv(4096, socket.MSG_OOB)14:56
corvuszbr: (socket.recv(socket.MSG_OOB) will read one byte because socket.MSG_OOB == 1)14:56
corvuszbr: then you do that inside the poll loop until POLLPRI is clear14:56
zbrnot this one,  TypeError: recv() takes 2 positional arguments but 3 were given14:57
corvusoh, then that's going to be a paramiko think14:57
corvusthat only lists one arg :/14:58
zbrthe api is socket-like, but is not a real socket15:00
corvusyou could use the socket methods on the underlying fd15:02
zbrso we may not have a OOB in the end15:02
zbrnot any progress, so far we have no proof that that is a OOB message.15:11
zbrbut documentation tells us that it can be, so not sure why we would keep the code crashing for this case.15:12
corvuszbr: what documentation tells us to expect pri to be set when we connect to gerrit stream events over ssh?15:12
zbri patched it locally and it seems to be happy to process data, so that is priority is not an error15:12
zbrfor me is enough.15:14
zbrtesting a bitmask with == is almost never correct.15:15
corvuszbr: it is correct if you want to test that only one bit is set, which is the intent here15:15
corvusi thought we covered that already15:15
*** tkajinam has quit IRC15:16
zbris enough?15:16
zbrsearch for POLLPRI15:17
zbri do not see any of the documetned cases are a good enough reason to raise an exception15:17
fungiit's suggesting there is urgent out of band data which should be read, and we're not (yet) reading that, so it sounds like a potential problem to me15:18
zbrto log something yes, but not to quit (and likely loose data too)15:18
*** ykarel is now known as ykarel|away15:18
fungiwithout knowing what the urgent message is, hard to say whether it's safe to just log15:18
zbrwhere is the proof that there is a message?15:19
corvuszbr: the man page explains what the different values mean.  they don't tell us why you alone are receiving that value now, and our production systems have not hit this in 8 years of continuous use.15:19
corvusthe man mage doesn't tell us how to handle the case, only that the case may happen15:19
zbrgood that we do not write air-control software...15:20
fungiatc software would not be allowed to ignore error conditions ;)15:20
fungijust ask clarkb15:20
corvusand indeed, this program ignores no error conditions.  your patch does.15:20
zbrbecause for me, even without knowing the nature of the urgent notification, i prefer to continue processing data (and eventually log something)15:21
corvusso yeah, this program is written in a *very* conservative way as far as errors go15:21
clarkbfungi: yes, though setting the reboot routine as your interrupt handler for major errors is apparently acceptable :)15:21
zbrwhat if your phone crashes on a specific text message? (oops, that already happened)15:21
zbrplease show me where is written that any bitmask other than POLLIN is an error?15:22
clarkb(we put the interrupt routine at address 0)15:22
clarkber reboot interrupt15:22
fungizbr: it's not necessarily an error, but it could indicate an error so not handling it is potentially a bug15:23
zbrin fact even the presence of POLLOUT woudl break the current logic.15:23
clarkbI think the big thing here is as corvus said we've not seen this in almost a decade of using these tools. The fact that it shows up is note worthy and worth understanding in order to address properly15:23
*** dpawlik has quit IRC15:24
zbri would add a logging.error for this purpose.15:24
zbrat least we would know if it happens and how often.15:24
clarkbzbr: re pollout there is no writes on that conncetion iirc15:27
zbrsadly searching for paramiko and POLLPRI brings zero results15:27
clarkbya its a stdout channel, so we should never write to it15:28
zbrdo you want me to setup a tmate with debugger?15:28
clarkbzbr: have you tried recv on the fd?15:28
clarkb(basically bypass whatever higher level apis are in front of the actual file descriptor and read directly15:29
zbrif i only can find the fd, no variable with similar name so far.15:30
corvuszbr: did you try ?15:31
*** dirk has quit IRC15:38
zbrok, i have `` which looks like one, but i cannot get a socket if there is no socket to start with.15:40
*** priteau has joined #opendev15:40
zbri cannot create a socket out of a fd.15:40
clarkbzbr: I think you can call directly against a fd15:41
zbrsocket and paramiko.Channel are very similar but identical. I cannot give any flags to to ask about OOB stuff.15:42
zbri am also inclined to believe that OOB stuff is specific to socket, so it may not work with paramiko15:43
clarkbzbr: looking at I think you can use a combo of read() and ioctl15:44
clarkbbasically pretending to be C via python15:44
corvusi have to run some errands now, but i ask that if we decide that we want to ignore pollpri, then we do so in such a way that doesn't change the error handling structure of that code.  in other words, just ignore pollpri.15:45
clarkbzbr: basically you do that discard_until_mark thing, then read after that to get the OOB data? Since OOB data automatically marks the stream15:46
clarkbalso if we decide the OOB data is important I think that is roughly what gerritlib will need to do anyway15:47
*** mlavalle has joined #opendev15:52
zbrfor the moment I raised
clarkbslaweq: mordred fungi every job should already do external connectivity tests16:01
* clarkb looking that up now16:02
fungiclarkb: well, in this case what they want to test is that a nested instance can reach beyond neutron16:02
clarkbI see, does that include off of the middle hypervisor?16:03
mordred(which is a nice test of north/south traffic)16:03
clarkbbecause devstack networking doesn't actually set that up16:03
fungiand i guess pinging a fabricated virtual interface elsewhere on the job node might mask some failure cases? not really sure16:03
clarkbthere is no double nat for nested instance FIPs to allow return traffic to find the middle hypervisors interface. packets will go out sourced from the nested isntance fip then poof disappear16:04
fungiyeah, i expect they will have to do some masquerading on the job node to make that work16:04
fungiseems doable though, i don't really know what their test scenario looks like16:05
clarkbif the second layer of nat were set up it would probably just work (tm)16:05
zbri find it interesting that gerritlib implementation is apparently the only one that that not give the select.POLLIN to poll.register()16:06
zbrmythere are lots of examples in openstack, but that the only register without filtering that I see.16:07
openstackgerritSorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed linting issues
openstackgerritSorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed linting issues
*** ysandeep is now known as ysandeep|afk16:14
*** hashar has joined #opendev16:14
openstackgerritMerged openstack/project-config master: Replace old Ussuri cycle signing key with Victoria
openstackgerritSorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed POLLIN event check
clarkbmordred: if you have time could you lookover for the gitea 1.12.0-rc1 testing? You normally do those updates so would be good for you to check I didn't miss anything obvious. I'm not sure we watn to land that and run the rc code, but we probably could if people want to go for it16:31
*** priteau has quit IRC16:33
mordredclarkb: what do you need gpg for?16:35
mordred(just curious - I looked at it yesterday and it looks good to me - +2 - but I'm about to run an errand, so didn't +A - feel free to though)16:37
clarkbmordred: not sure, the upstream image added it though. If I had to guess to work with signed commits16:37
clarkbinfra-root is at the bottom of corvus' stack to do more zuul things if you have a moment16:37
openstackgerritSorin Sbarnea (zbr) proposed opendev/elastic-recheck master: WIP: Create elastic-recheck docker image
openstackgerritSorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: WIP: Use py3 with elastic-recheck
clarkbfungi: do you have time for ? you reviewed its parent17:24
*** ysandeep|afk is now known as ysandeep17:25
fungisure, i can take a quick look17:25
clarkbalso if anyone else wants to look at for advisory board bootstrapping now is the time to do that. I'm hoping to send that out soonish17:26
clarkbI believe I've incorporated/addressed all the existing feedback (thank you for that)17:26
fungiyour updates lgtm17:32
openstackgerritMerged zuul/zuul-jobs master: Remove failovermethod from fedora dnf repo configs
*** ysandeep is now known as ysandeep|away17:40
openstackgerritJames E. Blair proposed opendev/system-config master: Use sqlite with Zuul in the gate
openstackgerritSorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: WIP: Use py3 with elastic-recheck
openstackgerritMerged opendev/system-config master: Save zuul and nodepool logs from gate test jobs
openstackgerritMerged opendev/system-config master: Vendor the apt repo gpg keys used for Zuul
*** hashar has quit IRC18:40
clarkbok email about advisory board has been sent18:41
smcginnishrw: Looks like your entire mail spool is dumping into #openstack-requirements19:01
clarkbsmcginnis: can you kick hrw (maybe even ban if rejoined)19:02
smcginnisLooks like he killed it.19:02
hrwsorry for that19:02
fungianything sensitive? do i need to perform surgery on our channel log archives?19:03
hrwno, twitter feed19:04
smcginnistwitter2irc probably has limited use. :)19:04
fungii hear there is such a thing19:04
fungiat least the message length is compatible ;)19:05
hrweverything2irc more or less exists19:05
hrwjust at different levels of usability19:05
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Do not interpolate values from tox --showconfig
openstackgerritMerged opendev/system-config master: Run Zuul as the zuuld user
*** dpawlik has joined #opendev20:08
*** dpawlik has quit IRC20:52
*** hrw has quit IRC20:56
corvusinfra-prod-base failed on that ^20:58
clarkbthat was expected though right?20:59
corvusnot really?20:59
corvusi'm looking at /var/log/ansible/base.yaml.log to try to ascertain the error20:59
corvusbut all i see is several mb of json21:00
corvusokay, after scanning the list several times, i see this:   : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=021:01
*** hrw has joined #opendev21:01
corvusso i guess something failed on review-dev01?21:01
corvusAn exception occurred during task execution. To see the full traceback, use -vvv. The error was: FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/root']21:01
corvus No space left on device21:02
clarkbtrack_upstream.log is 23GB21:04
clarkbinfra-root should we just cat /dev/null > /var/log/track_upstream.log for now?21:05
corvusyeah.  it's throwing host key errors21:06
corvusso something is broken but i don't think we care right now21:06
corvuswe'll get new errors real quick21:06
clarkbok I'll run that now21:06
clarkbthats done and ya the file is already filling quick21:07
clarkbbut we should be good for a while21:07
corvusre-enqueing 72695821:07
clarkbI've confirmed that the ssh key it doesn't like is the one reported by the server21:11
clarkbthe track upstream command is using roots ssh configs not gerrit2's21:13
clarkbI'm going to see if I can ssh as root and accept the key21:14
clarkbok its talking to port 22 not 2941821:17
clarkbI wonder if this is related to zbr's change21:22
clarkb2020-05-21 21:21:31,733: gerrit.GerritConnection - ERROR - Exception connecting to it thinks it is connecting to 29418 but if I look at the keys it seems to be getting the port 22 key?21:22
corvusclarkb: what change?21:22
clarkbcorvus: or rather if its related to the sort of thing that addressed21:25
*** DSpider has quit IRC21:26
corvusi thought from the log it was going on longer than that21:27
clarkbcorvus: ya I don't think that change has caused it, but maybe there is a similar bug in gerritlib that is?21:27
clarkbalso that change is not in the images on review-dev for gerrit because we consume gerritlib from releases (just double checked that)21:28
openstackgerritMonty Taylor proposed openstack/project-config master: Allow ansible-collections-openstack-release team to push tags
clarkbif we look at /root/.ssh/known_hosts the host key there is correct for port 2941821:29
clarkbif I ssh as root to the jeepyb user on port 29418 that works with no error (because the host key is correct)21:30
clarkbthe key that it seems to be getting is the port 22 hostkey21:30
clarkbjeepyb defaults to 29418 if not set (I can't find us setting a different port)21:30
clarkbthis is quite weird21:30
clarkbI'm going to try sshing now within the container21:31
clarkbmanual ssh from within the container also works21:31
clarkbmordred: ^ any ideas?21:32
corvusseparately, the infra-prod-base job failed again.  this time there are no failures reported in the log, so i don't know why it "failed" --      : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=021:33
corvuscould it be that?21:33
clarkbcorvus: yes unreachable hosts cause it problems. ianw has a change up to deal with that by ignoring them (butI'm not sure that is correct either)21:34
corvusi think it had an ssh host key change?21:34
corvuswell it said Warning: Permanently added the RSA host key for IP address '' to the list of known hosts.21:35
corvusbut i guess it had the key already with a different address?21:35
corvusi've enqueued the change again21:36
clarkbcorvus: possible that ipv6 stopped working between those hosts so it fell back to ipv4?21:36
corvusit also said Last login: Thu Aug 16 16:59:31 2018 from 2001:4800:7818:101:3c21:a454:23ed:407221:36
*** Dmitrii-Sh has quit IRC21:38
mordredI just tried sshing to paste01 and I'm getting an ssh host key differs21:40
mordredan it wants to do
mordredand I can't ssh to the ipv621:41
clarkblooking at the review-dev thing more I think what is happening is paramiko is finding a known hosts entry for the port 22 host key and applying itto port 29418. I cannot see how that is happening though21:42
clarkband I see a bunch of 29418 connections so I think it is connecting to 29418 properly, just using the wrong known_hosts value21:43
*** Dmitrii-Sh has joined #opendev21:43
mordredclarkb: blink blink21:43
mordredoh - hrm. I can ssh to just fine - it's that has a conflicting hostkey for me - so that might just be old cruft somehow21:44
mordredstill can't connect to ipv6 though21:44
clarkbthere are no sshfp records that I can see21:45
clarkb(which was one idea for why it might've gotten mixed up)21:45
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner
clarkbI'm stumped21:56
clarkbssh-keyscan, manual ssh, and manually reading configs shows me that everything should be happy21:57
clarkbanyone have other ideas for what we should look at ? I think the implication is that it isn't using the known_hosts file we bind mount in. But I have no idea what other known_hosts info it could be using (which is why I checked sshfp)21:59
mordredclarkb: no - because you tried from inside the container right?21:59
clarkbmordred: ya inside and out22:00
mordredclarkb: is it a user thing perhaps? are we running track-upstream as a different user inside the container than you're exec-ing in as?22:00
corvuswhat container?22:00
* mordred trying to think of left-field things22:00
mordredcorvus: the gerrit container22:01
mordredor - the track-upstream container - which uses the gerrit image22:01
corvusah, sorry, gotcha22:01
clarkbcorvus: mordred ya I'm replacing track-upstream command with bash and using the same docker mounts22:01
clarkbI did have to add -it thoug22:01
mordredyeah - that's normal22:01
clarkbmordred: externally the container is started as root and it runs as root in the container I think22:01
clarkbmordred: though maybe it can't read the known_hosts file?22:01
corvusthere are a lot of containers running22:02
corvuswith random names22:02
clarkbcorvus: yes, they ar erung by cron and I think they are in a restart loop for a while before they give up22:02
clarkb*are run by22:02
*** slaweq has quit IRC22:02
mordredwe should maybe kill some of those22:03
mordredlike: docker ps  | grep -v gerritcompose_gerrit_1 | grep 'Up 5 weeks' | awk '{print $1}' | xargs -n1 docker stop22:05
clarkboh could it be these are stale22:05
mordredclarkb: is it possible that this is not a problem anymore but was 5 weeks ago22:05
clarkband new runs would be happy22:05
mordredand yeah - that's why you can't reproduce it22:05
mordredhow about I run that?22:05
corvus1 sec22:05
corvusmordred: ok clear from me22:06
clarkbya I think that should be fine too22:06
corvus(i just wanted to verify they were track-upstream procs, and they are)22:06
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner
mordredk. stopped22:06
clarkbthe next track upstream runs in about 35 minutes22:06
clarkbI guess we can see if it is happy then22:06
mordredI'm guessing it will be because _all_ of these were 5 weeks old22:07
mordredand we run cron hourly22:07
fungisorry, catching back up, so summary is that we filled up review's rootfs because there were a bunch of track-upstream invocations from 5 weeks ago which were continually filling the log with ssh host key mismatch errors due to incorrectly deployed containers from the same timeframe?22:11
clarkbfungi: yes, except it was review-dev not review22:12
corvusokay this time infra-prod-base failed here: : ok=25   changed=1    unreachable=0    failed=1    skipped=3    rescued=0    ignored=022:12
fungiaha, okay, review-dev, got it22:12
corvusand also paste01 is still unreachable22:12
corvusi don't think the current situation is tenable22:12
clarkbI've reproduced the ipv6 connectivity issue to paste over port 2222:13
corvusclarkb: oh you've identified an issue?22:13
clarkbit doesn't ping either22:13
clarkbcorvus: ya its looking like ipv6 to paste is not working22:14
corvusclarkb: but v4 is fine; so why is ansible unhappy?22:14
clarkbcorvus: probably because ansible is preferring v6 since we set a specific ansible_host addr?22:14
clarkbcorvus: I bet if we used names rather than addrs it would fallback from ipv6 to ipv422:15
clarkbbut we set a specific address instead iirc22:15
corvusindeed, paste01 is       ansible_host: 2001:4800:7817:103:be76:4eff:fe05:176e22:15
donnydso I started capturing some metrics on how many jobs are run per day and per week on OE22:15
clarkbdonnyd: we've got some of that data in graphite too if you need it22:16
clarkbcorvus: for now we can switch ansible_host to ipv4 address maybe?22:16
corvusthe issue on the kna1 mirror seems disk related: -bash: /etc/profile: Input/output error22:16
corvusit seems possible that host may be about to start bombing jobs22:17
clarkbcorvus: thats weird I'm able to open /etc/profile on that host22:17
corvusclarkb:  ?22:17
clarkbcorvus: ya22:17
clarkband nothing in dmesg since it last rebooted22:18
clarkb(9 days ago)22:18
clarkber wait22:18
clarkbI was looking at wrong terminal /me double checks22:18
corvuswhen i ssh in, i get immediately disconnected after the io error22:18
clarkbya it looks like ssh succeeds but it just errors out after motd22:18
clarkbcorvus: ya I missed that it was failing as it printed out the motd and everything and then you get your old shell prompt back22:19
corvusi'm assuming the disk is gone22:19
clarkbcorvus: we can put it in the emergency file for now22:19
clarkb(that shoul dsolve the ansible side problem)22:20
mordredcorvus: I agree - this isn't super tennable - but I'm not sure which thing would be better - if we put the base role in each service playbook, then we theoretically run a bunch of stuff each time we run a service that isn't useful, but at the cost of services being decoupled. we could also not make a service playbook dependent on base running - at the potential cost that something from base that's important22:20
mordredfor a service doesnt' get applied. but maybe shifting base roles into service playbooks is better (I think you advocated for that before)22:20
clarkbmordred: ianw's change does the second thing you suggested22:20
clarkbre that mirror should we try a reboot?22:21
mordredyeah - I wasn't super crazy about that change because I was worried we'd still miss important updates in base22:21
mordredclarkb: ++22:21
mordredclarkb: I don't think we can debug it any further in its current state22:21
corvusyeah, it just seems that threading the needle on having every host succeed is not working well for us today (we've got 3 moles needing whacking so far)22:21
clarkbI'll work on the reboot of kna1's mirror22:21
ianwhrm, i think i had problems with that the other day22:22
corvusfor paste -- should we just prefer ipv4 addrs always?22:22
mordredcorvus: I think what I *really* want is a per-service base playbook that runs base and only gets run if base stuff has changed - but my brain can't quite wrap its head around implementing that22:22
ianwclarkb: 2020-05-20 08:00:46 UTC rebooted ; it was refusing a few connection and had some old hung processes lying around22:22
ianwis it doing the same thing again?22:22
clarkbianw: ya22:22
clarkbcorvus: I htink ipv4 is likely to be more reliable in general22:22
mordredianw: we don't know - we can't get in to see if it has hung processes lying around22:22
corvusi think we wanted ip addrs for ansible_host so that we wouldn't mess up a prod server when booting a replacement22:22
mordredcorvus: yes, that is correct22:23
clarkbbut thats based on having comcast provided ipv6 in the past where things would randomly just not work22:23
mordredclarkb: I think ipv6 has been mostly stable for us22:23
corvusmostly -- but i think we've still seen more issues with, say, gerrit than ipv4.  <--anecdata22:23
clarkbmordred: we've seen similar flakyness fwiw22:23
mordredI bet if we reboot paste (or just did a restart of the ipv6 networking stack) ipv6 woudl come back there22:23
clarkbmordred: some of the logstash worker nodes can't talk to the gearman server via ipv622:23
clarkbso they are slow to start up as they fall back to ipv422:23
clarkbmordred: thats possible22:24
corvusmordred: oh you reckon that's client side22:24
corvusi'll reboot paste22:24
mordredcorvus: yeah - the host seems to think it has an ipv6 address, but I can't get to it - so since it's been up for >800 days I wouldn't be surprised if somethign got borked in rax routing tables22:24
clarkbI've asked the mirror in kna1 to reboot22:25
ianwi'm pretty sure i still have an open ticket about rax hosts not being able to do symmetrical ipv622:25
fungicatching back up again, yes rebooting instances in rackspace has sometimes solved the sudden ipv6 connectivity breakage22:25
fungiianw: i think that ticket was eventually closed22:26
clarkbmirrors up22:26
fungiif memory serves, the resolution was that "all" you need to do is delete and recreate the port/interface22:26
fungi(in which case you may as well delete and redeploy the instance)22:27
corvuspaste is back up, but v6 isn't any better22:27
corvusoh, well, it can at least ping outbound v6 now22:27
clarkbnothing in syslog or kern.log22:27
clarkbon the mirror for ^22:28
fungii suppose we could switch the inventory to the v4 address? should probably also take the aaaa out of dns for it22:28
fungiat least until we can deploy a replacement22:28
corvusmordred: are you on paste from a v6 addr?22:28
clarkbalso mirror had ansible things in syslog from 21:40ish22:29
fungii'm able to ssh to paste over ipv622:29
clarkbso it seemed to be a relatively recent occurence22:29
fungimaybe it's fixed?22:29
ianwING (2001:4800:7817:103:be76:4eff:fe05:176e)) 56 data bytes22:29
ianw64 bytes from (2001:4800:7817:103:be76:4eff:fe05:176e): icmp_seq=1 ttl=46 time=280 ms22:29
ianwfrom .au :)22:29
corvusbut not from bridge.o.o22:29
corvus(though bridge can ping6
fungiyeah, when we see these connectivity issues, it's typically between specific instances in rackspace22:30
mordredcorvus: yes22:31
mordredcorvus: I am in over ipv6 - although I could not get in from my laptop over ipv6 before you rebooted22:31
fungisuggesting there's a dead route somewhere (and usually the packets only go missing in one direction according to packet captures)22:31
mordredwell - I will say that gives weight to clarkb's suggestion of using ipv4 for ansible hosts22:32
mordredactually, I think that was a suggestion to hostname - but that does make replacements harder22:32
fungialso possible there's an incorrectly cached flow somewhere in their network which will age out now that the instance has been rebooted22:32
clarkbmordred: I was suggesting ipv4 as ime its more reliable22:32
ianwyep, is the original issue about one-way ipv6 ... i bet pinging from paste you see the packets on bridge, but they never make it back22:32
clarkbipv6 is almost as reliable but weirdness happens more often ime22:32
mordredif we occasinoally get unexplained ipv6 issues between hosts in rax - then as much as I'd prefer v6 on general principle, it seems like we're opening ourselves up to more likely ansible woes22:33
fungiipv6 is plenty reliable, it's that providers have deployed it in so many unreliable ways due to complex migration plans22:33
clarkband that happens at the isp level (comcast had a bitbucket in denver back in the day) and in cloud providers (this thing with rax and vexxhost not being able to google dns)22:33
mordredfungi: yah22:33
fungiand also them not noticing bugs because so few customers tried to use it with any seriousness until relatively recently22:34
mordredcorvus: I have an idea for base splitting that MIGHT not be *100%* terrible (sort of the thing I suggested above but didn't know how to do)22:37
corvusmordred: cool :)22:38
corvusi'll see about switching our ansible_hosts to v422:38
mordredcorvus: what if we make child-jobs of infra-prod-run-base for each service that don't do anything different but add like an "ansible_limit_hosts" variable22:38
mordredcorvus: and then make each service soft-depend on the host-specific version of base22:38
mordredwhen we DO change base it'll be a larger number of smaller jobs that have to run - but we won't have service-zuul not running because run-base couldn't talk to paste22:38
corvusmordred: what would cause the service-specific jobs to run?22:39
corvusfile matchers that are the same as the service?22:39
mordredno - file matchers that are specific to "base needs to run" - we'd ditch "run a single base playbook" completely - so if base doesn't need to run, it doesn't and we skip it22:40
clarkbmordred: if we did that wouldn't we want to fail if base failed?22:41
mordredyes - but then we're only failing if base fails for the specific service in question22:41
mordredso if base fails on a zuul host, then not running service-zuul is the right choice22:41
clarkbbut gitea or whatever could still run22:42
mordredbut we still get to use zuul file matchers to avoid running useless things22:42
mordredwe could also do service-base specific file matchers if needed - like if we wanted to also trigger a base run for a service if host vars changed (because iptables)22:42
mordredbut - probablu just triggering all the base jobs if host_vars/* changes isn't a terrible idea22:43
donnydclarkb: Oh, I didn't know we tracked that kind of data22:43
corvusmordred: so a change that touches something in base would cause all the little base jobs to run, and then, say, also the zuul service job (if the change was a zuul change).  and the zuul service job only depends on its little bit of base.22:43
mordredcorvus: yes22:43
clarkbdonnyd: I'm not sure we specifically track jobs per provider but do have resources per provider22:43
*** yuri has quit IRC22:43
clarkbdonnyd: but wanted to point out we have some related info22:43
mordredcorvus: we could also do it with a bunch of little service-base playbooks and not with an ansible_limit_hosts variable - the downside to a limit_hosts var is that we're keeping host lists in 2 different places22:44
mordredbut the idea would be the same22:44
corvusmordred: fwiw, base is gonna run on a lot of changes (any hostvars change)22:44
donnydI was honestly just curious about the actual number of jobs launched per day/week22:44
mordredcorvus: yeah. I mean - we could get more specific on those hostvars changes if we did individual playbooks22:45
donnyd1300-1500 per day and from the week total about 8500 on my crappy little cloud22:45
donnydNot too shabby22:46
mordreddonnyd: ++22:46
donnydi wish i could make it go faster, which i am, but it will be a little while yet22:46
corvusmordred: what's the right way to run
donnydI am replacing the 8 big compute servers with 32 little ones22:47
mordredcorvus: I don't know that it's designed to be re-run - I think I just used it the first time - but let me look22:48
donnydSo i am curious to see if more == faster22:48
corvusmordred: to get the right clouds.yaml22:48
mordredcorvus: yeah - that's just it - we don't have a split clouds.yaml any more22:48
mordredso that is not designed for our current situation (We should probably delete it)22:48
mordredcorvus: I think you'd be better off doing a quick yaml transform on the existing inventory22:49
*** tkajinam has joined #opendev22:49
openstackgerritJames E. Blair proposed opendev/system-config master: Use ipv4 in inventory
corvusor emacs22:52
*** Dmitrii-Sh has quit IRC22:53
clarkbcorvus: any idea why ask01 didn' update?22:54
mordredcorvus, clarkb: so - do y'all like the split-base idea enough for me to work up a patch? (I don't want to work up that patch if we don't provisionally like the idea)23:00
clarkbmordred: I think my biggest concern with it is it will make understanding when and how to run jobs even more complicated23:01
clarkbI like the idea though. If we can make it simple enough that it itself doesn't break constantly I think it could be worthwile23:01
mordredyeah - that's my concern too23:10
mordredclarkb: I think we would need to do a pile of base playbooks23:10
mordredinstead of zuul job host limit strings23:11
mordredwe'd never understand the limit strings23:11
*** tkajinam_ has joined #opendev23:18
*** tkajinam has quit IRC23:20
*** Dmitrii-Sh has joined #opendev23:21
openstackgerritMerged opendev/system-config master: Add iptables_extra_allowed_groups
openstackgerritMerged opendev/system-config master: Add support for multiple jvbs behind meetpad
mordredcorvus: don't know if you saw from clarkb on your inventory patch - but a host was missed23:33
mordredcorvus: ask01.openstack.org23:34
mordredit otherwise looks great23:34
mordred(I suppose I could just fix that real quick)23:34
openstackgerritMonty Taylor proposed opendev/system-config master: Use ipv4 in inventory
mordredclarkb: ^^23:35
corvusah thx, sorry23:35
mordredcorvus: no worries - I just also realized that it's a one-line thing, so you know, quicker to just fix it than poke :)23:37
ianwi've been through and removed old servers/volumes/dns entries for all active clouds in ... so that's everything gone and cleaned up for the mirrors23:37
mordredianw: wow. nice23:37
ianwclarkb: did you end up thinking the inventory fixutre in might be helpful?23:43
openstackgerritIan Wienand proposed opendev/system-config master: Use ipv4 in inventory
openstackgerritIan Wienand proposed opendev/system-config master: Use ipv4 in server launch inventory output
ianwargghhh sorry i cherry-picked that and thus ended up rebasing it23:52
ianwi didn't mean to reupload 73014423:52

Generated by 2.17.2 by Marius Gedminas - find it at!