ianw | you can have multiple certs; we take the first name listed in the cert and add that to the list to be checkd | 00:00 |
---|---|---|
ianw | so, e.g. in the gitea case, we check gitea0X:3000, but don't check opendev.org | 00:01 |
ianw | now the dib python35 job is showing "ERROR: No matching distribution found for oslotest===4.2.0 " | 00:02 |
fungi | i guess it's a question of whether le updates can break for some certs on a host and not others? | 00:02 |
ianw | fungi: so all certs should get an entry; just if that cert covers multiple names, we only take the first one. so like "mirror01.x" and "mirror.x" we will take the first, mirror01 and put that in the list | 00:05 |
ianw | but if the cert @ mirror01.x is valid, that implies that the same cert which is used for mirror.x is ok? | 00:06 |
fungi | got it. so we're still testing each individual cert | 00:06 |
fungi | that wfm | 00:06 |
ianw | to be concrete, https://zuul.opendev.org/t/openstack/build/6d1c8cf7ba95499496910f2f5bd2b97e/log/bridge.openstack.org/certcheck/ssldomains | 00:10 |
ianw | compare to https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/templates/host_vars/letsencrypt01.opendev.org.yaml.j2 | 00:11 |
ianw | we end up checking letsencrypt01.opendev.org and someotherservice.opendev.org | 00:12 |
fungi | yep, cool | 00:12 |
openstackgerrit | Jeremy Stanley proposed openstack/project-config master: Replace old Ussuri cycle signing key with Victoria https://review.opendev.org/729804 | 00:17 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: Drop support for python2 https://review.opendev.org/728889 | 00:32 |
*** dzho has quit IRC | 00:40 | |
*** markmcclain has quit IRC | 02:25 | |
*** ysandeep|away is now known as ysandeep | 02:33 | |
*** Eighth_Doctor is now known as Conan_Kudo | 02:46 | |
*** Conan_Kudo is now known as Eighth_Doctor | 02:46 | |
*** markmcclain has joined #opendev | 02:49 | |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: package-installs: allow when filter to be a list https://review.opendev.org/727049 | 04:04 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: ubuntu-minimal: fix HWE install for focal https://review.opendev.org/727050 | 04:04 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: ubuntu-minimal : only install 16.04 HWE kernel on xenial https://review.opendev.org/726996 | 04:04 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: ubuntu-minimal: Add Ubuntu Focal test build https://review.opendev.org/725752 | 04:04 |
*** Meiyan has joined #opendev | 04:15 | |
*** ykarel|away is now known as ykarel | 04:23 | |
*** tkajinam has quit IRC | 04:26 | |
*** tkajinam has joined #opendev | 04:26 | |
*** raukadah is now known as chandankumar | 05:32 | |
*** dpawlik has joined #opendev | 06:01 | |
*** ianw has quit IRC | 06:30 | |
*** ianw has joined #opendev | 06:33 | |
*** slaweq has joined #opendev | 06:57 | |
*** dpawlik has quit IRC | 07:05 | |
*** dpawlik has joined #opendev | 07:18 | |
*** dpawlik has quit IRC | 07:24 | |
*** dpawlik has joined #opendev | 07:25 | |
*** tosky has joined #opendev | 07:31 | |
slaweq | frickler: hi | 07:34 |
*** dpawlik has quit IRC | 07:34 | |
slaweq | frickler: may I ask You about one infra and CI related test? | 07:34 |
*** dpawlik has joined #opendev | 07:34 | |
slaweq | frickler: we recently introduced in neutron-tempest-plugin test which is pinging some external IP address to check external connectivity is really ok, see https://review.opendev.org/#/c/727764/ | 07:35 |
slaweq | frickler: it's skipped by default, but would infra-root have anything against if we would configure some IP address (I don't know what would be the best one) to run this test in u/s gate? | 07:36 |
slaweq | frickler: something like ping 8.8.8.8 or similar | 07:36 |
*** dpawlik has quit IRC | 07:44 | |
*** dpawlik has joined #opendev | 07:45 | |
*** dpawlik has quit IRC | 07:49 | |
*** dpawlik has joined #opendev | 07:50 | |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:01 | |
*** DSpider has joined #opendev | 08:11 | |
*** dpawlik has quit IRC | 08:11 | |
*** dpawlik has joined #opendev | 08:11 | |
*** lpetrut has joined #opendev | 08:15 | |
*** dpawlik has quit IRC | 08:16 | |
*** dpawlik has joined #opendev | 08:16 | |
*** jaicaa has quit IRC | 08:17 | |
*** jaicaa has joined #opendev | 08:20 | |
*** ysandeep is now known as ysandeep|lunch | 08:20 | |
*** yuri has joined #opendev | 08:45 | |
*** iurygregory has quit IRC | 08:51 | |
*** ysandeep|lunch is now known as ysandeep | 08:57 | |
*** ykarel is now known as ykarel|lunch | 09:08 | |
*** dpawlik has quit IRC | 09:10 | |
*** dpawlik has joined #opendev | 09:10 | |
*** dpawlik has quit IRC | 09:14 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner https://review.opendev.org/728684 | 09:26 |
*** iurygregory has joined #opendev | 09:39 | |
*** ykarel|lunch is now known as ykarel | 09:56 | |
*** smcginnis has quit IRC | 11:26 | |
*** dpawlik has joined #opendev | 11:29 | |
*** smcginnis has joined #opendev | 11:33 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/system-config master: Switch prep-apply.sh to use python3 https://review.opendev.org/729543 | 11:45 |
*** ysandeep is now known as ysandeep|afk | 12:03 | |
*** rosmaita has left #opendev | 12:11 | |
*** ysandeep|afk is now known as ysandeep | 12:29 | |
fungi | slaweq: i thought we already had a similar role in devstack to ping the git farm or something... checking | 12:32 |
*** ysandeep is now known as ysandeep|afk | 12:32 | |
fungi | slaweq: i must have been thinking of this: https://opendev.org/openstack/devstack-gate/src/branch/master/playbooks/roles/network_sanity_check/tasks/main.yaml#L18-L19 | 12:51 |
fungi | i guess we didn't implement anything similar in devstack | 12:51 |
fungi | but it seems like a reasonable idea to have a network sanity test which performs a quick ping of some bits of our infrastructure | 12:52 |
fungi | (i would not recommend pinging 8.8.8.8 though) | 12:52 |
fungi | ahh, i see, so for the specific bug you linked, pinging anything that's not directly on the job node ought to be sufficient. the local mirror service in the provider is your best bet, since your jobs should rely on that server being reachable from the job node anyway so if it isn't the job will have failed before it gets to the point of creating nested instances | 12:56 |
*** ysandeep|afk is now known as ysandeep | 12:58 | |
fungi | slaweq: the identity of the local mirror server should be available from /etc/ci/mirror_info.sh on each node, there will be a "export NODEPOOL_MIRROR_HOST=..." line in there which you can resolve to an ip address if you need a raw address and not a dns name | 13:03 |
fungi | also it looks like we set a zuul_site_mirror_fqdn ansible fact, which you could use to pipe into your script instead | 13:05 |
*** ykarel is now known as ykarel|afk | 13:14 | |
mordred | fungi, slaweq we should check with clarkb about being forward-compatible with the intended new system for communicating mirrors to jobs | 13:24 |
mordred | (I agree, pinging the mirror is definitely the right choice) | 13:24 |
mordred | fungi, AJaeger: new promote-javascript-deployment job did not work for zuul (it's ok that it failed, we don't happen to use the results of it currently) | 13:26 |
*** lpetrut has quit IRC | 13:49 | |
zbr | who can help me with gerritlib? i have a couple of patches and i also need a new release for the yesterday bugfix. | 13:52 |
zbr | https://review.opendev.org/#/c/729734/ | 13:52 |
*** sgw has quit IRC | 13:57 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed POLLIN event check https://review.opendev.org/729966 | 14:03 |
*** roman_g has quit IRC | 14:05 | |
corvus | zbr: how was the POLLIN check failing? | 14:06 |
openstackgerrit | Merged opendev/gerritlib master: Enable py36-py38 testing https://review.opendev.org/729734 | 14:06 |
corvus | zbr: (did you get an event with POLLIN and some other bit set?) | 14:06 |
zbr | i got returned 3 | 14:06 |
zbr | which is valid result | 14:06 |
zbr | i guess that https://code.woboq.org/gcc/include/sys/epoll.h.html#_M/EPOLLIN should be self-explanatory | 14:07 |
corvus | so it was pollin and pollpri ? | 14:08 |
zbr | yep | 14:08 |
corvus | though that's for epoll | 14:08 |
corvus | but it's the same for regular poll | 14:09 |
zbr | macos + py38, i suspect is py38 specific. | 14:09 |
corvus | so there is input, and an error | 14:09 |
mordred | corvus: or, input and a flag indicating it's "priority" input, no? | 14:10 |
corvus | mordred: pollpri means "an exceptional condition" the poll manpage includes some | 14:10 |
corvus | i'm curious what it would be in this case | 14:11 |
mordred | ah - nod | 14:11 |
corvus | because i'm not sure the original code was wrong | 14:11 |
corvus | (the original code was "read if there is data and no exceptional condition") | 14:11 |
corvus | i'm open to changing it, but i'd like to understand why it's okay to read when there's an exceptional condition, what the exceptional condition we intend to handle is, and why it's caused and how we should handle it | 14:12 |
mordred | corvus: https://stackoverflow.com/questions/10681624/epollpri-when-does-this-case-happen says some words | 14:12 |
zbr | urgent, likely telling you process that data faster or i will start dropping it | 14:13 |
*** sgw has joined #opendev | 14:14 | |
zbr | we could think to improve it in the future, but clearly we need to read. | 14:14 |
corvus | i don't think that's what that means; i don't actually know how the standard out channel of an ssh connection could even send oob data. | 14:14 |
mordred | corvus: most of the other docs I'm finding (other than the man page) - like the rust api docs - all describe it as "there is urgent data" instead of "there is an exceptional condition" - but I still don't understand what that means - so I agree with you as to wanting to understand how we got to that state | 14:14 |
*** ykarel|afk is now known as ykarel | 14:15 | |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed linting issues https://review.opendev.org/729974 | 14:15 |
mordred | corvus: https://lore.kernel.org/netdev/CAD56B7fCUyWG8d-OT__B3SbEfY=AdiZGJVdjSZ7qurqLUufvgg@mail.gmail.com/T/ has some discussion around POLLPRI and ssh | 14:17 |
openstackgerrit | Merged opendev/gerritlib master: Replace testrepository with stestr https://review.opendev.org/729742 | 14:19 |
zbr | for me is a no brainer, event is a bitmask, and POLLIN declares that there is data to be read. In that case i do not care about any other flag, my loop goal being to read. | 14:20 |
zbr | it gets more interesting when you do not have data to read | 14:20 |
corvus | on the contrary, i want the loop to exit if there is an error | 14:20 |
corvus | your patch will change that | 14:20 |
corvus | (with your change, it could fail to exit on a loop if there is an error and data to read) | 14:21 |
fungi | yes, continuing to read from a socket which has raised an error is not guaranteed safe, and there are plenty of scenarios where we could end up with a hung process indefinitely reading from a dead socket that way | 14:21 |
zbr | aimho, it should read until there is nothing else to read. | 14:21 |
zbr | a dead socket producing data? | 14:22 |
fungi | zbr: what if that read call never returns? | 14:22 |
zbr | fungi: same would apply even without this condition. | 14:22 |
corvus | zbr: a poll can return with data and an error at the same time. your code would read the data, ignore the error, then go right back to the poll again. what happens after that would be undefined. | 14:22 |
corvus | zbr: or possibly read the data and fail. | 14:23 |
mordred | right - which is why we need to understand under what conditions the PRI flag is set and what it's trying to communicate | 14:23 |
zbr | i think it would be better than now, where 0x3 produces an error. | 14:23 |
mordred | zbr: 0x3 producing an error may be correct behavior | 14:23 |
fungi | if we don't ignore errors we can try to handle them, say by closing out and reopening the socket (depending on the error) | 14:24 |
fungi | there are options other than ignoring possible error conditions or dying on the spot | 14:25 |
corvus | if there were oob data, then since we don't set so_oobinline, we would need to set the msg_oob flag to recv to retrieve it | 14:28 |
corvus | (though, like i said, i doubt the issue is oob data) | 14:29 |
zbr | How about logging when the event has anything else than just POLLIN, but reading. | 14:29 |
corvus | mordred: i've skimmed that link, but i haven't found anything there i can apply to our situation | 14:29 |
*** Meiyan has quit IRC | 14:30 | |
zbr | one random logic example: https://github.com/kdart/pycopia/blob/master/core/pycopia/asyncio.py#L197-L200 | 14:30 |
zbr | as seen here, other flags should not prevent reading data. | 14:31 |
zbr | even if we do not implement them | 14:31 |
corvus | zbr: that's a different approach than i think we should take. | 14:32 |
corvus | i think we should understand the issue and handle it | 14:32 |
corvus | zbr: since it only seems to be reproducible in your environment, maybe you can determine whether it's oob data or something else causing the pri flag to be set? | 14:33 |
zbr | i need some hints on how to figure it out | 14:33 |
zbr | did any of you tried to run elastic-recheck with py38? | 14:34 |
zbr | i wonder if the issue is specific to macos, py38 or both. | 14:34 |
corvus | zbr: i'd try calling recv with MSG_OOB and see if you get data | 14:34 |
mordred | zbr: my hunch is going to be specific to macos - I would doubt py38 has anything to do with it since it would be something setting that flag in the networking stack | 14:35 |
fungi | i'm still not finding any indication that tcp/urg is commonly used for ssh sockets | 14:35 |
fungi | (tcp/psh definitely is, but urg is surprising there) | 14:35 |
zbr | does any of you have an example on how to call recv from there? | 14:44 |
zbr | yep, i do get OOB data, probably as json. stdout.channel.recv(socket.MSG_OOB) | 14:49 |
corvus | zbr: i was going to suggest recv(stdout.channel.fileno()) but that looks plausible too :) | 14:50 |
corvus | zbr: i'm really curious what data you get | 14:50 |
mordred | me too | 14:50 |
mordred | corvus: I feel like OOB packets are a thing we don't use nearly enough ;) | 14:51 |
corvus | mordred: i know | 14:51 |
mordred | there are whole sets of socket flags we're not making use of to their fullest! | 14:51 |
corvus | mordred: speaking of which, i had a dream last night about finally adding in that "this job will fail but it's still running" flag to zuul. | 14:52 |
mordred | corvus: (honestly, gearman job cancelling seems like a good use-case for those) | 14:52 |
mordred | corvus: what a fun dream! | 14:52 |
corvus | (the gearman warning result; which, at this point, we probably wouldn't handle with gearman, but hey) | 14:53 |
corvus | (i'm reminded of this because we need an oob ansible->executor signal to accomplish this) | 14:54 |
zbr | how to i read the entire buffer? | 14:54 |
corvus | zbr: recv takes a length arg, so it's actually socket.recv(4096, socket.MSG_OOB) | 14:56 |
corvus | zbr: (socket.recv(socket.MSG_OOB) will read one byte because socket.MSG_OOB == 1) | 14:56 |
corvus | zbr: then you do that inside the poll loop until POLLPRI is clear | 14:56 |
zbr | not this one, TypeError: recv() takes 2 positional arguments but 3 were given | 14:57 |
corvus | oh, then that's going to be a paramiko think | 14:57 |
corvus | http://docs.paramiko.org/en/stable/api/channel.html#paramiko.channel.Channel.recv | 14:58 |
corvus | that only lists one arg :/ | 14:58 |
zbr | the api is socket-like, but is not a real socket | 15:00 |
corvus | you could use the socket methods on the underlying fd | 15:02 |
zbr | so we may not have a OOB in the end | 15:02 |
zbr | not any progress, so far we have no proof that that is a OOB message. | 15:11 |
zbr | but documentation tells us that it can be, so not sure why we would keep the code crashing for this case. | 15:12 |
corvus | zbr: what documentation tells us to expect pri to be set when we connect to gerrit stream events over ssh? | 15:12 |
zbr | i patched it locally and it seems to be happy to process data, so that is priority is not an error | 15:12 |
zbr | for me https://code.woboq.org/gcc/include/sys/epoll.h.html#EPOLL_EVENTS is enough. | 15:14 |
zbr | testing a bitmask with == is almost never correct. | 15:15 |
corvus | zbr: it is correct if you want to test that only one bit is set, which is the intent here | 15:15 |
corvus | i thought we covered that already | 15:15 |
*** tkajinam has quit IRC | 15:16 | |
zbr | is http://man7.org/linux/man-pages/man2/poll.2.html enough? | 15:16 |
zbr | search for POLLPRI | 15:17 |
zbr | i do not see any of the documetned cases are a good enough reason to raise an exception | 15:17 |
fungi | it's suggesting there is urgent out of band data which should be read, and we're not (yet) reading that, so it sounds like a potential problem to me | 15:18 |
zbr | to log something yes, but not to quit (and likely loose data too) | 15:18 |
*** ykarel is now known as ykarel|away | 15:18 | |
fungi | without knowing what the urgent message is, hard to say whether it's safe to just log | 15:18 |
zbr | where is the proof that there is a message? | 15:19 |
corvus | zbr: the man page explains what the different values mean. they don't tell us why you alone are receiving that value now, and our production systems have not hit this in 8 years of continuous use. | 15:19 |
corvus | the man mage doesn't tell us how to handle the case, only that the case may happen | 15:19 |
zbr | good that we do not write air-control software... | 15:20 |
fungi | atc software would not be allowed to ignore error conditions ;) | 15:20 |
fungi | just ask clarkb | 15:20 |
corvus | and indeed, this program ignores no error conditions. your patch does. | 15:20 |
zbr | because for me, even without knowing the nature of the urgent notification, i prefer to continue processing data (and eventually log something) | 15:21 |
corvus | so yeah, this program is written in a *very* conservative way as far as errors go | 15:21 |
clarkb | fungi: yes, though setting the reboot routine as your interrupt handler for major errors is apparently acceptable :) | 15:21 |
zbr | what if your phone crashes on a specific text message? (oops, that already happened) | 15:21 |
zbr | please show me where is written that any bitmask other than POLLIN is an error? | 15:22 |
clarkb | (we put the interrupt routine at address 0) | 15:22 |
clarkb | er reboot interrupt | 15:22 |
fungi | zbr: it's not necessarily an error, but it could indicate an error so not handling it is potentially a bug | 15:23 |
zbr | in fact even the presence of POLLOUT woudl break the current logic. | 15:23 |
clarkb | I think the big thing here is as corvus said we've not seen this in almost a decade of using these tools. The fact that it shows up is note worthy and worth understanding in order to address properly | 15:23 |
*** dpawlik has quit IRC | 15:24 | |
zbr | i would add a logging.error for this purpose. | 15:24 |
zbr | at least we would know if it happens and how often. | 15:24 |
clarkb | zbr: re pollout there is no writes on that conncetion iirc | 15:27 |
zbr | sadly searching for paramiko and POLLPRI brings zero results | 15:27 |
clarkb | ya its a stdout channel, so we should never write to it | 15:28 |
zbr | do you want me to setup a tmate with debugger? | 15:28 |
clarkb | zbr: have you tried recv on the fd? | 15:28 |
clarkb | (basically bypass whatever higher level apis are in front of the actual file descriptor and read directly | 15:29 |
zbr | if i only can find the fd, no variable with similar name so far. | 15:30 |
corvus | zbr: did you try stdout.channel.fileno() ? | 15:31 |
*** dirk has quit IRC | 15:38 | |
zbr | ok, i have `stdout.channel.fileno` which looks like one, but i cannot get a socket if there is no socket to start with. | 15:40 |
*** priteau has joined #opendev | 15:40 | |
zbr | i cannot create a socket out of a fd. | 15:40 |
clarkb | zbr: I think you can call os.read() directly against a fd | 15:41 |
zbr | socket and paramiko.Channel are very similar but identical. I cannot give any flags to os.read() to ask about OOB stuff. | 15:42 |
zbr | i am also inclined to believe that OOB stuff is specific to socket, so it may not work with paramiko | 15:43 |
clarkb | zbr: looking at https://www.gnu.org/software/libc/manual/html_node/Out_002dof_002dBand-Data.html I think you can use a combo of read() and ioctl | 15:44 |
clarkb | basically pretending to be C via python | 15:44 |
corvus | i have to run some errands now, but i ask that if we decide that we want to ignore pollpri, then we do so in such a way that doesn't change the error handling structure of that code. in other words, just ignore pollpri. | 15:45 |
clarkb | zbr: basically you do that discard_until_mark thing, then read after that to get the OOB data? Since OOB data automatically marks the stream | 15:46 |
clarkb | also if we decide the OOB data is important I think that is roughly what gerritlib will need to do anyway | 15:47 |
*** mlavalle has joined #opendev | 15:52 | |
zbr | for the moment I raised https://github.com/paramiko/paramiko/issues/1694 | 16:00 |
clarkb | slaweq: mordred fungi every job should already do external connectivity tests | 16:01 |
* clarkb looking that up now | 16:02 | |
fungi | clarkb: well, in this case what they want to test is that a nested instance can reach beyond neutron | 16:02 |
clarkb | I see, does that include off of the middle hypervisor? | 16:03 |
mordred | (which is a nice test of north/south traffic) | 16:03 |
clarkb | because devstack networking doesn't actually set that up | 16:03 |
fungi | and i guess pinging a fabricated virtual interface elsewhere on the job node might mask some failure cases? not really sure | 16:03 |
clarkb | there is no double nat for nested instance FIPs to allow return traffic to find the middle hypervisors interface. packets will go out sourced from the nested isntance fip then poof disappear | 16:04 |
fungi | yeah, i expect they will have to do some masquerading on the job node to make that work | 16:04 |
fungi | seems doable though, i don't really know what their test scenario looks like | 16:05 |
clarkb | if the second layer of nat were set up it would probably just work (tm) | 16:05 |
fungi | right | 16:05 |
zbr | i find it interesting that gerritlib implementation is apparently the only one that that not give the select.POLLIN to poll.register() | 16:06 |
zbr | http://codesearch.openstack.org/?q=select.POLLIN&i=nope&files=.*py&repos= | 16:06 |
zbr | mythere are lots of examples in openstack, but that the only register without filtering that I see. | 16:07 |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed linting issues https://review.opendev.org/729974 | 16:12 |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed linting issues https://review.opendev.org/729974 | 16:12 |
*** ysandeep is now known as ysandeep|afk | 16:14 | |
*** hashar has joined #opendev | 16:14 | |
openstackgerrit | Merged openstack/project-config master: Replace old Ussuri cycle signing key with Victoria https://review.opendev.org/729804 | 16:15 |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/gerritlib master: Fixed POLLIN event check https://review.opendev.org/729966 | 16:27 |
clarkb | mordred: if you have time could you lookover https://review.opendev.org/#/c/729659/ for the gitea 1.12.0-rc1 testing? You normally do those updates so would be good for you to check I didn't miss anything obvious. I'm not sure we watn to land that and run the rc code, but we probably could if people want to go for it | 16:31 |
*** priteau has quit IRC | 16:33 | |
mordred | clarkb: what do you need gpg for? | 16:35 |
mordred | (just curious - I looked at it yesterday and it looks good to me - +2 - but I'm about to run an errand, so didn't +A - feel free to though) | 16:37 |
clarkb | mordred: not sure, the upstream image added it though. If I had to guess to work with signed commits | 16:37 |
clarkb | infra-root https://review.opendev.org/#/c/729619/3 is at the bottom of corvus' stack to do more zuul things if you have a moment | 16:37 |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/elastic-recheck master: WIP: Create elastic-recheck docker image https://review.opendev.org/729623 | 16:43 |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: WIP: Use py3 with elastic-recheck https://review.opendev.org/729336 | 17:21 |
clarkb | fungi: do you have time for https://review.opendev.org/#/c/729619/ ? you reviewed its parent | 17:24 |
*** ysandeep|afk is now known as ysandeep | 17:25 | |
fungi | sure, i can take a quick look | 17:25 |
clarkb | also if anyone else wants to look at https://etherpad.opendev.org/p/XRyf4UliAKI9nRGstsP4 for advisory board bootstrapping now is the time to do that. I'm hoping to send that out soonish | 17:26 |
clarkb | I believe I've incorporated/addressed all the existing feedback (thank you for that) | 17:26 |
fungi | your updates lgtm | 17:32 |
openstackgerrit | Merged zuul/zuul-jobs master: Remove failovermethod from fedora dnf repo configs https://review.opendev.org/729774 | 17:33 |
*** ysandeep is now known as ysandeep|away | 17:40 | |
openstackgerrit | James E. Blair proposed opendev/system-config master: Use sqlite with Zuul in the gate https://review.opendev.org/729786 | 17:43 |
openstackgerrit | Sorin Sbarnea (zbr) proposed opendev/puppet-elastic_recheck master: WIP: Use py3 with elastic-recheck https://review.opendev.org/729336 | 17:48 |
openstackgerrit | Merged opendev/system-config master: Save zuul and nodepool logs from gate test jobs https://review.opendev.org/729619 | 17:54 |
openstackgerrit | Merged opendev/system-config master: Vendor the apt repo gpg keys used for Zuul https://review.opendev.org/729401 | 18:28 |
*** hashar has quit IRC | 18:40 | |
clarkb | ok email about advisory board has been sent | 18:41 |
fungi | thanks! | 18:58 |
smcginnis | hrw: Looks like your entire mail spool is dumping into #openstack-requirements | 19:01 |
smcginnis | halp | 19:02 |
clarkb | smcginnis: can you kick hrw (maybe even ban if rejoined) | 19:02 |
smcginnis | Looks like he killed it. | 19:02 |
hrw | sorry for that | 19:02 |
fungi | anything sensitive? do i need to perform surgery on our channel log archives? | 19:03 |
hrw | no, twitter feed | 19:04 |
smcginnis | twitter2irc probably has limited use. :) | 19:04 |
fungi | i hear there is such a thing | 19:04 |
fungi | at least the message length is compatible ;) | 19:05 |
hrw | ;D | 19:05 |
smcginnis | Hah | 19:05 |
hrw | everything2irc more or less exists | 19:05 |
hrw | just at different levels of usability | 19:05 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Do not interpolate values from tox --showconfig https://review.opendev.org/729520 | 19:23 |
openstackgerrit | Merged opendev/system-config master: Run Zuul as the zuuld user https://review.opendev.org/726958 | 19:30 |
*** dpawlik has joined #opendev | 20:08 | |
*** dpawlik has quit IRC | 20:52 | |
*** hrw has quit IRC | 20:56 | |
corvus | infra-prod-base failed on that ^ | 20:58 |
clarkb | that was expected though right? | 20:59 |
corvus | not really? | 20:59 |
corvus | i'm looking at /var/log/ansible/base.yaml.log to try to ascertain the error | 20:59 |
corvus | but all i see is several mb of json | 21:00 |
corvus | okay, after scanning the list several times, i see this: review-dev01.opendev.org : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0 | 21:01 |
*** hrw has joined #opendev | 21:01 | |
corvus | so i guess something failed on review-dev01? | 21:01 |
corvus | An exception occurred during task execution. To see the full traceback, use -vvv. The error was: FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/root'] | 21:01 |
corvus | No space left on device | 21:02 |
clarkb | track_upstream.log is 23GB | 21:04 |
clarkb | infra-root should we just cat /dev/null > /var/log/track_upstream.log for now? | 21:05 |
corvus | yeah. it's throwing host key errors | 21:06 |
corvus | so something is broken but i don't think we care right now | 21:06 |
corvus | we'll get new errors real quick | 21:06 |
clarkb | ok I'll run that now | 21:06 |
clarkb | thats done and ya the file is already filling quick | 21:07 |
clarkb | but we should be good for a while | 21:07 |
corvus | re-enqueing 726958 | 21:07 |
clarkb | I've confirmed that the ssh key it doesn't like is the one reported by the server | 21:11 |
clarkb | the track upstream command is using roots ssh configs not gerrit2's | 21:13 |
clarkb | I'm going to see if I can ssh as root and accept the key | 21:14 |
clarkb | ok its talking to port 22 not 29418 | 21:17 |
clarkb | I wonder if this is related to zbr's change | 21:22 |
clarkb | 2020-05-21 21:21:31,733: gerrit.GerritConnection - ERROR - Exception connecting to review-dev.opendev.org:29418 it thinks it is connecting to 29418 but if I look at the keys it seems to be getting the port 22 key? | 21:22 |
corvus | clarkb: what change? | 21:22 |
clarkb | corvus: https://review.opendev.org/#/c/729699/ or rather if its related to the sort of thing that addressed | 21:25 |
*** DSpider has quit IRC | 21:26 | |
corvus | i thought from the log it was going on longer than that | 21:27 |
clarkb | corvus: ya I don't think that change has caused it, but maybe there is a similar bug in gerritlib that is? | 21:27 |
clarkb | also that change is not in the images on review-dev for gerrit because we consume gerritlib from releases (just double checked that) | 21:28 |
openstackgerrit | Monty Taylor proposed openstack/project-config master: Allow ansible-collections-openstack-release team to push tags https://review.opendev.org/730129 | 21:29 |
clarkb | if we look at /root/.ssh/known_hosts the host key there is correct for port 29418 | 21:29 |
clarkb | if I ssh as root to the jeepyb user on port 29418 that works with no error (because the host key is correct) | 21:30 |
clarkb | the key that it seems to be getting is the port 22 hostkey | 21:30 |
clarkb | jeepyb defaults to 29418 if not set (I can't find us setting a different port) | 21:30 |
clarkb | this is quite weird | 21:30 |
clarkb | I'm going to try sshing now within the container | 21:31 |
clarkb | manual ssh from within the container also works | 21:31 |
clarkb | mordred: ^ any ideas? | 21:32 |
corvus | separately, the infra-prod-base job failed again. this time there are no failures reported in the log, so i don't know why it "failed" -- https://zuul.opendev.org/t/openstack/build/83b7df5203064553a66b0a293f36c228 | 21:33 |
corvus | paste01.openstack.org : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0 | 21:33 |
corvus | could it be that? | 21:33 |
clarkb | corvus: yes unreachable hosts cause it problems. ianw has a change up to deal with that by ignoring them (butI'm not sure that is correct either) | 21:34 |
corvus | i think it had an ssh host key change? | 21:34 |
corvus | hrm | 21:35 |
corvus | well it said Warning: Permanently added the RSA host key for IP address '23.253.235.223' to the list of known hosts. | 21:35 |
corvus | but i guess it had the key already with a different address? | 21:35 |
corvus | i've enqueued the change again | 21:36 |
clarkb | corvus: possible that ipv6 stopped working between those hosts so it fell back to ipv4? | 21:36 |
corvus | maybe | 21:36 |
corvus | it also said Last login: Thu Aug 16 16:59:31 2018 from 2001:4800:7818:101:3c21:a454:23ed:4072 | 21:36 |
*** Dmitrii-Sh has quit IRC | 21:38 | |
mordred | I just tried sshing to paste01 and I'm getting an ssh host key differs | 21:40 |
mordred | AND | 21:40 |
mordred | an it wants to do 23.253.235.223 | 21:40 |
mordred | and I can't ssh to the ipv6 | 21:41 |
clarkb | looking at the review-dev thing more I think what is happening is paramiko is finding a known hosts entry for the port 22 host key and applying itto port 29418. I cannot see how that is happening though | 21:42 |
clarkb | and I see a bunch of 29418 connections so I think it is connecting to 29418 properly, just using the wrong known_hosts value | 21:43 |
*** Dmitrii-Sh has joined #opendev | 21:43 | |
mordred | clarkb: blink blink | 21:43 |
mordred | oh - hrm. I can ssh to paste.openstack.org just fine - it's paste01.openstack.org that has a conflicting hostkey for me - so that might just be old cruft somehow | 21:44 |
mordred | still can't connect to ipv6 though | 21:44 |
clarkb | there are no sshfp records that I can see | 21:45 |
clarkb | (which was one idea for why it might've gotten mixed up) | 21:45 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner https://review.opendev.org/728684 | 21:50 |
clarkb | I'm stumped | 21:56 |
clarkb | ssh-keyscan, manual ssh, and manually reading configs shows me that everything should be happy | 21:57 |
clarkb | anyone have other ideas for what we should look at ? I think the implication is that it isn't using the known_hosts file we bind mount in. But I have no idea what other known_hosts info it could be using (which is why I checked sshfp) | 21:59 |
mordred | clarkb: no - because you tried from inside the container right? | 21:59 |
clarkb | mordred: ya inside and out | 22:00 |
mordred | clarkb: is it a user thing perhaps? are we running track-upstream as a different user inside the container than you're exec-ing in as? | 22:00 |
corvus | what container? | 22:00 |
* mordred trying to think of left-field things | 22:00 | |
mordred | corvus: the gerrit container | 22:01 |
mordred | or - the track-upstream container - which uses the gerrit image | 22:01 |
corvus | ah, sorry, gotcha | 22:01 |
clarkb | corvus: mordred ya I'm replacing track-upstream command with bash and using the same docker mounts | 22:01 |
clarkb | I did have to add -it thoug | 22:01 |
mordred | yeah - that's normal | 22:01 |
clarkb | mordred: externally the container is started as root and it runs as root in the container I think | 22:01 |
clarkb | mordred: though maybe it can't read the known_hosts file? | 22:01 |
corvus | there are a lot of containers running | 22:02 |
corvus | with random names | 22:02 |
clarkb | corvus: yes, they ar erung by cron and I think they are in a restart loop for a while before they give up | 22:02 |
clarkb | *are run by | 22:02 |
*** slaweq has quit IRC | 22:02 | |
mordred | we should maybe kill some of those | 22:03 |
mordred | like: docker ps | grep -v gerritcompose_gerrit_1 | grep 'Up 5 weeks' | awk '{print $1}' | xargs -n1 docker stop | 22:05 |
clarkb | oh could it be these are stale | 22:05 |
mordred | clarkb: is it possible that this is not a problem anymore but was 5 weeks ago | 22:05 |
clarkb | and new runs would be happy | 22:05 |
clarkb | ya | 22:05 |
mordred | and yeah - that's why you can't reproduce it | 22:05 |
mordred | how about I run that? | 22:05 |
corvus | 1 sec | 22:05 |
mordred | kk | 22:05 |
corvus | mordred: ok clear from me | 22:06 |
clarkb | ya I think that should be fine too | 22:06 |
corvus | (i just wanted to verify they were track-upstream procs, and they are) | 22:06 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: add simple test runner https://review.opendev.org/728684 | 22:06 |
mordred | k. stopped | 22:06 |
clarkb | the next track upstream runs in about 35 minutes | 22:06 |
clarkb | I guess we can see if it is happy then | 22:06 |
mordred | yeah | 22:07 |
mordred | I'm guessing it will be because _all_ of these were 5 weeks old | 22:07 |
mordred | and we run cron hourly | 22:07 |
clarkb | ya | 22:07 |
mordred | *phew* | 22:07 |
fungi | sorry, catching back up, so summary is that we filled up review's rootfs because there were a bunch of track-upstream invocations from 5 weeks ago which were continually filling the log with ssh host key mismatch errors due to incorrectly deployed containers from the same timeframe? | 22:11 |
clarkb | fungi: yes, except it was review-dev not review | 22:12 |
corvus | okay this time infra-prod-base failed here: mirror01.kna1.airship-citycloud.opendev.org : ok=25 changed=1 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0 | 22:12 |
fungi | aha, okay, review-dev, got it | 22:12 |
corvus | and also paste01 is still unreachable | 22:12 |
corvus | i don't think the current situation is tenable | 22:12 |
clarkb | I've reproduced the ipv6 connectivity issue to paste over port 22 | 22:13 |
corvus | clarkb: oh you've identified an issue? | 22:13 |
clarkb | it doesn't ping either | 22:13 |
clarkb | corvus: ya its looking like ipv6 to paste is not working | 22:14 |
corvus | clarkb: but v4 is fine; so why is ansible unhappy? | 22:14 |
clarkb | corvus: probably because ansible is preferring v6 since we set a specific ansible_host addr? | 22:14 |
clarkb | corvus: I bet if we used names rather than addrs it would fallback from ipv6 to ipv4 | 22:15 |
clarkb | but we set a specific address instead iirc | 22:15 |
corvus | indeed, paste01 is ansible_host: 2001:4800:7817:103:be76:4eff:fe05:176e | 22:15 |
donnyd | so I started capturing some metrics on how many jobs are run per day and per week on OE | 22:15 |
clarkb | donnyd: we've got some of that data in graphite too if you need it | 22:16 |
clarkb | corvus: for now we can switch ansible_host to ipv4 address maybe? | 22:16 |
corvus | the issue on the kna1 mirror seems disk related: -bash: /etc/profile: Input/output error | 22:16 |
corvus | it seems possible that host may be about to start bombing jobs | 22:17 |
clarkb | corvus: thats weird I'm able to open /etc/profile on that host | 22:17 |
corvus | clarkb: mirror01.kna1.airship-citycloud.opendev.org ? | 22:17 |
clarkb | corvus: ya | 22:17 |
clarkb | and nothing in dmesg since it last rebooted | 22:18 |
clarkb | (9 days ago) | 22:18 |
clarkb | er wait | 22:18 |
clarkb | I was looking at wrong terminal /me double checks | 22:18 |
corvus | when i ssh in, i get immediately disconnected after the io error | 22:18 |
clarkb | ya it looks like ssh succeeds but it just errors out after motd | 22:18 |
clarkb | corvus: ya I missed that it was failing as it printed out the motd and everything and then you get your old shell prompt back | 22:19 |
corvus | i'm assuming the disk is gone | 22:19 |
clarkb | corvus: we can put it in the emergency file for now | 22:19 |
clarkb | (that shoul dsolve the ansible side problem) | 22:20 |
mordred | corvus: I agree - this isn't super tennable - but I'm not sure which thing would be better - if we put the base role in each service playbook, then we theoretically run a bunch of stuff each time we run a service that isn't useful, but at the cost of services being decoupled. we could also not make a service playbook dependent on base running - at the potential cost that something from base that's important | 22:20 |
mordred | for a service doesnt' get applied. but maybe shifting base roles into service playbooks is better (I think you advocated for that before) | 22:20 |
clarkb | mordred: ianw's change does the second thing you suggested | 22:20 |
clarkb | re that mirror should we try a reboot? | 22:21 |
mordred | yeah - I wasn't super crazy about that change because I was worried we'd still miss important updates in base | 22:21 |
mordred | clarkb: ++ | 22:21 |
mordred | clarkb: I don't think we can debug it any further in its current state | 22:21 |
corvus | yeah, it just seems that threading the needle on having every host succeed is not working well for us today (we've got 3 moles needing whacking so far) | 22:21 |
corvus | ++reboot | 22:21 |
clarkb | I'll work on the reboot of kna1's mirror | 22:21 |
ianw | hrm, i think i had problems with that the other day | 22:22 |
corvus | for paste -- should we just prefer ipv4 addrs always? | 22:22 |
mordred | corvus: I think what I *really* want is a per-service base playbook that runs base and only gets run if base stuff has changed - but my brain can't quite wrap its head around implementing that | 22:22 |
ianw | clarkb: 2020-05-20 08:00:46 UTC rebooted mirror.kna1.airship-citycloud.opendev.org ; it was refusing a few connection and had some old hung processes lying around | 22:22 |
ianw | is it doing the same thing again? | 22:22 |
clarkb | ianw: ya | 22:22 |
clarkb | corvus: I htink ipv4 is likely to be more reliable in general | 22:22 |
mordred | ianw: we don't know - we can't get in to see if it has hung processes lying around | 22:22 |
corvus | i think we wanted ip addrs for ansible_host so that we wouldn't mess up a prod server when booting a replacement | 22:22 |
mordred | corvus: yes, that is correct | 22:23 |
clarkb | but thats based on having comcast provided ipv6 in the past where things would randomly just not work | 22:23 |
mordred | clarkb: I think ipv6 has been mostly stable for us | 22:23 |
corvus | mostly -- but i think we've still seen more issues with, say, gerrit than ipv4. <--anecdata | 22:23 |
clarkb | mordred: we've seen similar flakyness fwiw | 22:23 |
mordred | I bet if we reboot paste (or just did a restart of the ipv6 networking stack) ipv6 woudl come back there | 22:23 |
clarkb | mordred: some of the logstash worker nodes can't talk to the gearman server via ipv6 | 22:23 |
clarkb | so they are slow to start up as they fall back to ipv4 | 22:23 |
clarkb | mordred: thats possible | 22:24 |
corvus | mordred: oh you reckon that's client side | 22:24 |
corvus | i'll reboot paste | 22:24 |
mordred | corvus: yeah - the host seems to think it has an ipv6 address, but I can't get to it - so since it's been up for >800 days I wouldn't be surprised if somethign got borked in rax routing tables | 22:24 |
clarkb | I've asked the mirror in kna1 to reboot | 22:25 |
ianw | i'm pretty sure i still have an open ticket about rax hosts not being able to do symmetrical ipv6 | 22:25 |
fungi | catching back up again, yes rebooting instances in rackspace has sometimes solved the sudden ipv6 connectivity breakage | 22:25 |
fungi | ianw: i think that ticket was eventually closed | 22:26 |
clarkb | mirrors up | 22:26 |
fungi | if memory serves, the resolution was that "all" you need to do is delete and recreate the port/interface | 22:26 |
fungi | (in which case you may as well delete and redeploy the instance) | 22:27 |
corvus | paste is back up, but v6 isn't any better | 22:27 |
ianw | yeah, https://portal.rackspace.com/610275/tickets/details/190621-ord-0000027 | 22:27 |
corvus | oh, well, it can at least ping outbound v6 now | 22:27 |
clarkb | nothing in syslog or kern.log | 22:27 |
clarkb | on the mirror for ^ | 22:28 |
fungi | i suppose we could switch the inventory to the v4 address? should probably also take the aaaa out of dns for it | 22:28 |
fungi | at least until we can deploy a replacement | 22:28 |
corvus | mordred: are you on paste from a v6 addr? | 22:28 |
clarkb | also mirror had ansible things in syslog from 21:40ish | 22:29 |
fungi | i'm able to ssh to paste over ipv6 | 22:29 |
clarkb | so it seemed to be a relatively recent occurence | 22:29 |
fungi | maybe it's fixed? | 22:29 |
ianw | ING paste.openstack.org(paste01.openstack.org (2001:4800:7817:103:be76:4eff:fe05:176e)) 56 data bytes | 22:29 |
ianw | 64 bytes from paste01.openstack.org (2001:4800:7817:103:be76:4eff:fe05:176e): icmp_seq=1 ttl=46 time=280 ms | 22:29 |
ianw | from .au :) | 22:29 |
corvus | but not from bridge.o.o | 22:29 |
corvus | (though bridge can ping6 google.com) | 22:30 |
fungi | yeah, when we see these connectivity issues, it's typically between specific instances in rackspace | 22:30 |
mordred | corvus: yes | 22:31 |
mordred | corvus: I am in over ipv6 - although I could not get in from my laptop over ipv6 before you rebooted | 22:31 |
fungi | suggesting there's a dead route somewhere (and usually the packets only go missing in one direction according to packet captures) | 22:31 |
mordred | well - I will say that gives weight to clarkb's suggestion of using ipv4 for ansible hosts | 22:32 |
mordred | actually, I think that was a suggestion to hostname - but that does make replacements harder | 22:32 |
fungi | also possible there's an incorrectly cached flow somewhere in their network which will age out now that the instance has been rebooted | 22:32 |
clarkb | mordred: I was suggesting ipv4 as ime its more reliable | 22:32 |
ianw | yep, https://portal.rackspace.com/610275/tickets/details/180206-iad-0005440 is the original issue about one-way ipv6 ... i bet pinging from paste you see the packets on bridge, but they never make it back | 22:32 |
mordred | yeah | 22:32 |
clarkb | ipv6 is almost as reliable but weirdness happens more often ime | 22:32 |
mordred | if we occasinoally get unexplained ipv6 issues between hosts in rax - then as much as I'd prefer v6 on general principle, it seems like we're opening ourselves up to more likely ansible woes | 22:33 |
fungi | ipv6 is plenty reliable, it's that providers have deployed it in so many unreliable ways due to complex migration plans | 22:33 |
clarkb | and that happens at the isp level (comcast had a bitbucket in denver back in the day) and in cloud providers (this thing with rax and vexxhost not being able to google dns) | 22:33 |
mordred | fungi: yah | 22:33 |
fungi | and also them not noticing bugs because so few customers tried to use it with any seriousness until relatively recently | 22:34 |
fungi | s/bugs/misconfigurations/ | 22:34 |
mordred | corvus: I have an idea for base splitting that MIGHT not be *100%* terrible (sort of the thing I suggested above but didn't know how to do) | 22:37 |
corvus | mordred: cool :) | 22:38 |
corvus | i'll see about switching our ansible_hosts to v4 | 22:38 |
mordred | corvus: what if we make child-jobs of infra-prod-run-base for each service that don't do anything different but add like an "ansible_limit_hosts" variable | 22:38 |
mordred | corvus: and then make each service soft-depend on the host-specific version of base | 22:38 |
mordred | when we DO change base it'll be a larger number of smaller jobs that have to run - but we won't have service-zuul not running because run-base couldn't talk to paste | 22:38 |
mordred | s/host-specific/service-specific/ | 22:39 |
corvus | mordred: what would cause the service-specific jobs to run? | 22:39 |
corvus | file matchers that are the same as the service? | 22:39 |
mordred | no - file matchers that are specific to "base needs to run" - we'd ditch "run a single base playbook" completely - so if base doesn't need to run, it doesn't and we skip it | 22:40 |
clarkb | mordred: if we did that wouldn't we want to fail if base failed? | 22:41 |
mordred | yes - but then we're only failing if base fails for the specific service in question | 22:41 |
mordred | so if base fails on a zuul host, then not running service-zuul is the right choice | 22:41 |
corvus | ooh | 22:41 |
clarkb | gotcha | 22:41 |
clarkb | but gitea or whatever could still run | 22:42 |
mordred | yeah | 22:42 |
mordred | but we still get to use zuul file matchers to avoid running useless things | 22:42 |
mordred | we could also do service-base specific file matchers if needed - like if we wanted to also trigger a base run for a service if host vars changed (because iptables) | 22:42 |
mordred | but - probablu just triggering all the base jobs if host_vars/* changes isn't a terrible idea | 22:43 |
donnyd | clarkb: Oh, I didn't know we tracked that kind of data | 22:43 |
corvus | mordred: so a change that touches something in base would cause all the little base jobs to run, and then, say, also the zuul service job (if the change was a zuul change). and the zuul service job only depends on its little bit of base. | 22:43 |
mordred | corvus: yes | 22:43 |
clarkb | donnyd: I'm not sure we specifically track jobs per provider but do have resources per provider | 22:43 |
*** yuri has quit IRC | 22:43 | |
clarkb | donnyd: but wanted to point out we have some related info | 22:43 |
mordred | corvus: we could also do it with a bunch of little service-base playbooks and not with an ansible_limit_hosts variable - the downside to a limit_hosts var is that we're keeping host lists in 2 different places | 22:44 |
mordred | but the idea would be the same | 22:44 |
corvus | mordred: fwiw, base is gonna run on a lot of changes (any hostvars change) | 22:44 |
donnyd | I was honestly just curious about the actual number of jobs launched per day/week | 22:44 |
mordred | corvus: yeah. I mean - we could get more specific on those hostvars changes if we did individual playbooks | 22:45 |
donnyd | 1300-1500 per day and from the week total about 8500 on my crappy little cloud | 22:45 |
donnyd | Not too shabby | 22:46 |
mordred | donnyd: ++ | 22:46 |
donnyd | i wish i could make it go faster, which i am, but it will be a little while yet | 22:46 |
corvus | mordred: what's the right way to run generate_inventory.py? | 22:47 |
donnyd | I am replacing the 8 big compute servers with 32 little ones | 22:47 |
mordred | corvus: I don't know that it's designed to be re-run - I think I just used it the first time - but let me look | 22:48 |
donnyd | So i am curious to see if more == faster | 22:48 |
corvus | mordred: to get the right clouds.yaml | 22:48 |
mordred | corvus: yeah - that's just it - we don't have a split clouds.yaml any more | 22:48 |
mordred | so that is not designed for our current situation (We should probably delete it) | 22:48 |
mordred | corvus: I think you'd be better off doing a quick yaml transform on the existing inventory | 22:49 |
*** tkajinam has joined #opendev | 22:49 | |
openstackgerrit | James E. Blair proposed opendev/system-config master: Use ipv4 in inventory https://review.opendev.org/730144 | 22:52 |
corvus | or emacs | 22:52 |
*** Dmitrii-Sh has quit IRC | 22:53 | |
clarkb | corvus: any idea why ask01 didn' update? | 22:54 |
mordred | corvus, clarkb: so - do y'all like the split-base idea enough for me to work up a patch? (I don't want to work up that patch if we don't provisionally like the idea) | 23:00 |
clarkb | mordred: I think my biggest concern with it is it will make understanding when and how to run jobs even more complicated | 23:01 |
clarkb | I like the idea though. If we can make it simple enough that it itself doesn't break constantly I think it could be worthwile | 23:01 |
mordred | yeah - that's my concern too | 23:10 |
mordred | clarkb: I think we would need to do a pile of base playbooks | 23:10 |
mordred | instead of zuul job host limit strings | 23:11 |
mordred | we'd never understand the limit strings | 23:11 |
*** tkajinam_ has joined #opendev | 23:18 | |
*** tkajinam has quit IRC | 23:20 | |
*** Dmitrii-Sh has joined #opendev | 23:21 | |
openstackgerrit | Merged opendev/system-config master: Add iptables_extra_allowed_groups https://review.opendev.org/726475 | 23:23 |
openstackgerrit | Merged opendev/system-config master: Add support for multiple jvbs behind meetpad https://review.opendev.org/729008 | 23:23 |
mordred | corvus: don't know if you saw from clarkb on your inventory patch - but a host was missed | 23:33 |
mordred | corvus: ask01.openstack.org | 23:34 |
mordred | it otherwise looks great | 23:34 |
mordred | (I suppose I could just fix that real quick) | 23:34 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Use ipv4 in inventory https://review.opendev.org/730144 | 23:35 |
mordred | clarkb: ^^ | 23:35 |
corvus | ah thx, sorry | 23:35 |
mordred | corvus: no worries - I just also realized that it's a one-line thing, so you know, quicker to just fix it than poke :) | 23:37 |
ianw | i've been through and removed old servers/volumes/dns entries for all active clouds in https://etherpad.opendev.org/p/openstack.org-mirror-be-gone ... so that's everything gone and cleaned up for the opentsack.org mirrors | 23:37 |
mordred | ianw: wow. nice | 23:37 |
ianw | clarkb: did you end up thinking the inventory fixutre in https://review.opendev.org/729418 might be helpful? | 23:43 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Use ipv4 in inventory https://review.opendev.org/730144 | 23:51 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Use ipv4 in server launch inventory output https://review.opendev.org/730149 | 23:51 |
ianw | argghhh sorry i cherry-picked that and thus ended up rebasing it | 23:52 |
ianw | i didn't mean to reupload 730144 | 23:52 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!