*** dviroel|out is now known as dviroel | 00:21 | |
*** rlandy is now known as rlandy|out | 00:31 | |
opendevreview | Ian Wienand proposed opendev/glean master: testing: Add ipv6 details to OVH https://review.opendev.org/c/opendev/glean/+/843225 | 00:53 |
---|---|---|
*** dviroel is now known as dviroel|out | 01:24 | |
opendevreview | Ian Wienand proposed opendev/glean master: Revert "Add option to ignore config drive interfaces info" https://review.opendev.org/c/opendev/glean/+/843225 | 02:03 |
*** rcastillo_ is now known as rcastillo | 02:56 | |
*** ysandeep|out is now known as ysandeep|rover | 04:35 | |
*** ysandeep|rover is now known as ysandeep|rover|brb | 05:25 | |
*** ysandeep|rover|brb is now known as ysandeep|rover | 05:32 | |
opendevreview | Ian Wienand proposed opendev/glean master: write_redhat_interfaces: refactor to walk interfaces first https://review.opendev.org/c/opendev/glean/+/843241 | 06:57 |
opendevreview | Ian Wienand proposed opendev/glean master: write_redhat_interfaces: pass multiple networks to output functions https://review.opendev.org/c/opendev/glean/+/843242 | 06:57 |
opendevreview | Ian Wienand proposed opendev/glean master: [wip] write out ipv6 https://review.opendev.org/c/opendev/glean/+/843243 | 06:57 |
ianw | clarkb/corvus: ^ i now understand why we just skipped ipv6. the whole thing is written to more or less assume that one network entry == one ifcfg-* config file. this fails when two network entries (ipv4 & ipv6) == one ifcfg-* | 07:00 |
ianw | i think that maps out a path forward, but it doesn't work yet. i'll keep at it, but fyi | 07:00 |
*** jpena|off is now known as jpena | 07:33 | |
*** ysandeep|rover is now known as ysandeep|rover|lunch | 07:44 | |
frickler | not sure if related, but I'm seeing retries on ovh like https://zuul.opendev.org/t/openstack/build/3af04b3c4db24a01b88605924e5a2f1c | 07:51 |
frickler | actually only a one off it seems, so possibly just coincidence | 07:54 |
*** ysandeep|rover|lunch is now known as ysandeep|rover | 08:33 | |
*** rlandy|out is now known as rlandy | 10:23 | |
*** arxcruz is now known as arxcruz|off | 10:52 | |
*** dviroel|out is now known as dviroel | 11:30 | |
fungi | the up side is, that's exactly why we put that check in validate-host, so it would be caught as early as possible and retry rather than fail outright | 11:44 |
fungi | though the traceroute is actually breaking because of dns resolution issues, according to the output from the task | 11:45 |
fungi | "opendev.org: Temporary failure in name resolution" | 11:45 |
fungi | both the v4 and v6 traceroute failed trying to trace to opendev.org because of that | 11:45 |
fungi | we didn't collect the log from unbound though, so not sure if it was struggling | 11:47 |
frickler | yes, I got red-herringed by ian's glean patches earlier and then the console only telling something about "no valid v4/v6 interface found" | 11:48 |
frickler | seems there are some more hits, but not too many. https://zuul.opendev.org/t/openstack/build/a5a754408ce2485c9a27f50823a940c5 for example | 11:49 |
frickler | might be that ovh is a common factor though | 11:50 |
*** rlandy is now known as rlandy|biab | 12:36 | |
*** ysandeep|rover is now known as ysandeep|rover|afk | 12:41 | |
*** rlandy|biab is now known as rlandy | 12:53 | |
*** ysandeep|rover|afk is now known as ysandeep|rover | 13:31 | |
opendevreview | Jeremy Stanley proposed opendev/glean master: write_redhat_interfaces: pass multiple networks to output functions https://review.opendev.org/c/opendev/glean/+/843242 | 13:32 |
opendevreview | Jeremy Stanley proposed opendev/glean master: [wip] write out ipv6 https://review.opendev.org/c/opendev/glean/+/843243 | 13:32 |
opendevreview | Merged zuul/zuul-jobs master: ensure-podman: Remove kubic from Ubuntu 18.04 and drop 20.04 https://review.opendev.org/c/zuul/zuul-jobs/+/843093 | 13:48 |
opendevreview | Merged zuul/zuul-jobs master: buildset registry: run socat in new session https://review.opendev.org/c/zuul/zuul-jobs/+/843038 | 15:00 |
*** dviroel is now known as dviroel|lunch | 15:09 | |
*** rlandy is now known as rlandy|mtg | 15:25 | |
opendevreview | Julia Kreger proposed openstack/diskimage-builder master: Check and mount boot volume for data extraction with nouuid https://review.opendev.org/c/openstack/diskimage-builder/+/843297 | 15:37 |
clarkb | The gerrit 3.6.0 upgrade is going to be more difficult than the more recent ones we've done due to the "copy-approvals" command that needs to be run. | 15:41 |
clarkb | Details in a commit message for updated gerrit images shorty | 15:41 |
clarkb | *shortly | 15:41 |
fungi | so they're reworking how votes/labels are recorded and stored then? | 15:43 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update Gerrit images to 3.4.5 and 3.5.2 https://review.opendev.org/c/opendev/system-config/+/843298 | 15:45 |
clarkb | yup | 15:45 |
clarkb | or at least that is the implication I haven't dug into the specifics yet | 15:45 |
clarkb | fungi: I think older gerrit would look at old patchests to see if any votes needed to be forward ported to current patchsets. But startingin 3.6.0 they don't do that so you have to forward port manually which can be slow (their warning) | 15:45 |
clarkb | But we should be able to punt on all of that until we're ready to start looking at the 3.6 upgrade | 15:46 |
clarkb | for now we just get up to date on our images and be aware of that larger upgrade process to 3.6 when we get there | 15:46 |
fungi | got it, thanks for flagging | 15:48 |
opendevreview | Merged openstack/diskimage-builder master: Make centos reset-bls-entries behave the same as rhel https://review.opendev.org/c/openstack/diskimage-builder/+/839830 | 16:04 |
*** marios is now known as marios|out | 16:08 | |
*** dviroel|lunch is now known as dviroel | 16:16 | |
clarkb | corvus: do schedulers and web components haev a graceful stop mechanism? It looks like no but double checking | 16:19 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster https://review.opendev.org/c/opendev/system-config/+/843317 | 16:29 |
clarkb | infra-root ^ I think I've got most of the details there but left TODOs where I wanted feedback | 16:29 |
clarkb | If you can take a look at leave comments on what you think would be appropriate that would be great | 16:29 |
*** rlandy|mtg is now known as rlandy | 16:34 | |
clarkb | once we've gotten those bits cleaned up the next step would be to run it manually in a screen session from bridge. Then if that works we can automate it | 16:34 |
corvus | clarkb: are you aware of https://review.opendev.org/828176 ? | 16:41 |
clarkb | oh no I had completely missed that | 16:42 |
clarkb | I can rebase on that | 16:42 |
corvus | clarkb: i think i like my approach for waiting better -- but maybe we should add a 'down' like you add in yours | 16:43 |
clarkb | corvus: my only concern with that appraoch to waiting is that I'm not sure it will ever timeout? Or if it does can we control the timeout length? | 16:43 |
corvus | (i like that the wait is a shell one-liner that can be copy/pasted, and is a single task) | 16:43 |
clarkb | looks like you can set a per task timeout regardless of what the task is so we can do that if we want a timeout | 16:44 |
corvus | yeah, though i'm ambivalent on that -- i don't think it's been a problem so far | 16:45 |
clarkb | ya I think we can add it later if it becomes a problem. | 16:45 |
clarkb | Adding the down is probably a good idea though since docker can restart containers in some situations otherwise | 16:45 |
clarkb | corvus: do you want me to rebase and add that or do you want to add it? | 16:46 |
corvus | i can add | 16:47 |
clarkb | corvus: and I guess the ps -q thing avoids any races because we'll wait on no containers if the exit immediately? | 16:48 |
corvus | i believe so | 16:48 |
opendevreview | James E. Blair proposed opendev/system-config master: Add the start of a Zuul rolling restart playbook https://review.opendev.org/c/opendev/system-config/+/828176 | 16:49 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster https://review.opendev.org/c/opendev/system-config/+/843317 | 16:51 |
*** ysandeep|rover is now known as ysandeep|out | 16:59 | |
corvus | clarkb: re your question -- yes, i'd use the ansible uri module against the api | 17:08 |
clarkb | corvus: one thing I'm quickly noticing is that since we do lists of dicts the parsing is super clunky in ansible | 17:12 |
clarkb | vs if it was a dict of dicts | 17:12 |
corvus | clarkb: well, if we had done dict of dicts, it actually would have had to be dict of lists of dicts -- because we can and do have multiple entries for the same host | 17:13 |
corvus | (so in our playbook, we should probably wait until there is exactly one entry for a host and it is running) | 17:14 |
corvus | (by design, zuul will happily allow you to run more than one component on a host; that doesn't happen in our deployment, but we do still see multiple entries when we stop and start depending on how cleanly and recently the component shut down) | 17:15 |
clarkb | right and ansible's expression on conditions and looping make that super weird. I'm sorting through it but wish ansible had a better way to express waiting on blocks of information | 17:15 |
corvus | clarkb: we could muddle through doing it in jinja -- or we could make a quick ansible module | 17:16 |
clarkb | I dont' think zuul needs to chagne. I think ansible needs to change, but the PR to address this has apparently been open for years :/ | 17:16 |
corvus | just a quick python function that takes the json, a hostname, and a status and returns true when those conditions are met | 17:16 |
corvus | (why can't we just inline a python function in ansible yaml?) | 17:17 |
clarkb | ya there is json_query which is a level above that which I'm currently trying to use to express this | 17:17 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster https://review.opendev.org/c/opendev/system-config/+/843317 | 17:27 |
clarkb | corvus: ^ something like that maybe | 17:27 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster https://review.opendev.org/c/opendev/system-config/+/843317 | 17:29 |
clarkb | I should actually test that json_query stuff locally really quickly | 17:32 |
corvus | clarkb: quick comment on that | 17:36 |
opendevreview | James E. Blair proposed opendev/system-config master: WIP: Add inline_python role https://review.opendev.org/c/opendev/system-config/+/843322 | 17:40 |
corvus | just spitballing here; but that ^ seems handy to me. | 17:40 |
corvus | er, that should say module, not role, but you get the idea | 17:41 |
opendevreview | James E. Blair proposed opendev/system-config master: WIP: Add inline_python module https://review.opendev.org/c/opendev/system-config/+/843322 | 17:41 |
clarkb | the only way I can get the jinja to work is to get ansible to emit a warning for [WARNING]: conditional statements should not include jinja2 templating delimiters such as {{ }} or {% %} if I don't then it explodes on the ? in the json_query query :/ | 17:49 |
clarkb | anyway I think I have something that works now just need to port it over | 17:49 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster https://review.opendev.org/c/opendev/system-config/+/843317 | 17:53 |
clarkb | that seems to work locally and if I set 'running' to 'notrunning' it does retries | 17:54 |
corvus | clarkb: i just left a comment suggesting we give the scheduler/web a few hours timeout on startup | 17:56 |
clarkb | corvus: sure. Do you think that 15 second delay between queries is too short too? | 17:57 |
corvus | (i actually have some backlog tasks to look at startup improvements, but for now, it wouldn't surprise me if it took 30m to come online during a rolling restart, so 45m seems too close for a timeout that shouldn't be necessary) | 17:57 |
corvus | clarkb: probably not necessary to check more than every 30-60 seconds? | 17:57 |
clarkb | ya I guess even if it takes 20 minutes to startup a check every 60 seconds isn't that big of a deal | 17:58 |
corvus | it's a lightweight method, but it's not cached. i'd probably settle at 30s? | 17:59 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster https://review.opendev.org/c/opendev/system-config/+/843317 | 17:59 |
clarkb | oh heh I just pushed at 60 seconds. Updating to 30s | 17:59 |
clarkb | oh I did the wrong component anyway :/ | 18:00 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster https://review.opendev.org/c/opendev/system-config/+/843317 | 18:01 |
corvus | clarkb: some timeout for web too | 18:02 |
corvus | er -- same | 18:02 |
corvus | (it has to do the same init work as scheduler) | 18:02 |
clarkb | corvus: I thought about that but we are waiting for the scheudler for up to 3 hours then just need to wait a bit for web afterwards since it should have the same three hour block to do what it needs to? | 18:03 |
clarkb | if we want to be extra careful I can set it to 3 hours for web though | 18:03 |
clarkb | I guess it doesn't hurt to be extra careful /me updates | 18:04 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster https://review.opendev.org/c/opendev/system-config/+/843317 | 18:05 |
clarkb | (because the scheduler could be really fast for some reason but not web and then web would potentially trip) | 18:05 |
corvus | oh, yeah, you're right, it's probably not as important as i thought, but still, shouldn't hurt. | 18:09 |
opendevreview | Clark Boylan proposed opendev/system-config master: Explicitly install jmespath alongside ansible on bridge https://review.opendev.org/c/opendev/system-config/+/843330 | 18:12 |
clarkb | ^ is something I noticed. The lib is already there so this should work as is but I can't sort out why it is there and this helps make it explicit | 18:13 |
opendevreview | James E. Blair proposed opendev/system-config master: Add inline_python module https://review.opendev.org/c/opendev/system-config/+/843322 | 18:14 |
corvus | i took the wip off of that. we can take it or leave it, i don't feel strongly about it. i at least wanted to explore the idea. | 18:15 |
clarkb | corvus: how does the nested exit_json work in that? | 18:16 |
corvus | clarkb: i believe it literally calls sys.exit, which is why it works nested | 18:16 |
corvus | so if you call exit_json or fail_json in the script, that happens; if you forget, then there's the fallback in the module itself. but if you call the nested functions, control never reaches there. | 18:18 |
clarkb | gotcha | 18:25 |
fungi | clarkb: is the idea to use that as-is manually working around the bits mentioned in the todo comments, then knock those out in a later change? | 18:38 |
clarkb | fungi: I'm thinking it would probably be good to run that by hand in a screen in roughly its current state during a quiet time (over a weekend?) then look at the todos more closely if/when we choose to run that automatically | 18:39 |
clarkb | most of the todos at this point are related to running it automatically in the background and aren't a big concern for manual runs | 18:39 |
fungi | yeah, agreed. just checking before i approve it in the current state | 18:40 |
clarkb | the other thought I had was hackign it up to only run a set of services at a time | 18:42 |
clarkb | then we could be sure the executors are happy before doing the mergers and so on | 18:42 |
opendevreview | Merged opendev/system-config master: Add the start of a Zuul rolling restart playbook https://review.opendev.org/c/opendev/system-config/+/828176 | 18:42 |
clarkb | I guess I should try and make time friday to start that then check in on it over the weekend? | 18:43 |
fungi | or i can start it first thing in my day tomorrow | 18:52 |
clarkb | do we think thursday is quiet enough to run something liek that? if so I'm game | 18:52 |
clarkb | I do think that the risk is low until we get to the schedulers | 18:53 |
clarkb | before then worst case is maybe jobs get killed unexpectedly then are restarted | 18:53 |
fungi | yeah, i'm not too worried. things are also relatively quiet this week, or so it's seemed | 18:58 |
opendevreview | Merged opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster https://review.opendev.org/c/opendev/system-config/+/843317 | 19:01 |
corvus | Holiday in many Europe countries tomorrow and in USA on Monday | 19:02 |
fungi | even better | 19:13 |
fungi | i barely manage to keep track of holidays here any more | 19:13 |
corvus | i'm now in the position of needing to know all of them :) | 19:14 |
fungi | indeed | 19:14 |
opendevreview | Julia Kreger proposed openstack/diskimage-builder master: Check and mount boot volume for data extraction with nouuid https://review.opendev.org/c/openstack/diskimage-builder/+/843297 | 19:40 |
clarkb | its a nice day outside. Time to park myself in the yard with some code reviews | 20:00 |
corvus | nice! it's finally back to "not hot" here. | 20:02 |
*** dviroel is now known as dviroel|afk | 20:07 | |
*** jpena is now known as jpena|off | 20:17 | |
clarkb | ianw: I've reviewed the glean stack. It isn't clear to me if reverting the ignore interfaces change is necessary to simplify the ipv6 addition (I think it would just hae your new code iterate over an empty dict which should be a noop?). But we may want t omove it to a followup change instead of being at the base so that we can communicate the removal of that flag | 20:55 |
*** timburke__ is now known as timburke | 20:59 | |
BlaisePabon[m] | It's been hard for me to find resources to explain zuul and gating to my colleagues. This video is a real pleasure and I think it will get me get the point across. https://www.youtube.com/watch?v=apLHQ4DkIHU | 21:46 |
BlaisePabon[m] | I think that the presenter is probably a member of this community in fact (Ian Wienand ?) | 21:46 |
BlaisePabon[m] | s/get/help/ | 21:46 |
clarkb | BlaisePabon[m]: yup ianw | 21:46 |
*** tosky_ is now known as tosky | 21:49 | |
*** rlandy is now known as rlandy|bbl | 21:58 | |
corvus | infra-root: https://review.opendev.org/843034 to bump the zuul tenant default ansible version is ready | 22:34 |
ianw | BlaisePabon[m]: yes, that was my talk -- happy to answer any questions. For reasons unknown some of the overlays didn't work great in that video. if you want the original openoffice presentation files happy to forward on | 23:04 |
clarkb | I need to start converting my notes into slides for the summit. I only have 10-15 minute sthough so the real trick is slimming things down sufficiently | 23:06 |
ianw | that is not long! i just scraped that one in at 45m | 23:07 |
fungi | yeah, harder to scale talks down than up ;) | 23:09 |
ianw | clarkb (and fungi) : thanks for the review on the glean bits. sometimes sleeping on it reveals new ideas but so far i don't really have any other than completely refactoring everything | 23:09 |
ianw | the networkd work prometheanfire did is a bit better way of parsing and writing out config file from the metadata | 23:09 |
fungi | i'm not opposed to merging that option removal chage first, just think we need to make sure we communicate it ahead of release | 23:10 |
ianw | yeah it was really mostly that i was quite confused as to why the things i was adding to OVH for testing weren't triggering. it is no fault of the original patch, we've named all the testing scenarios a bit obscurely | 23:12 |
fungi | clarkb: so when i get settled in tomorrow, i'll plan to run `ansible-playbook /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_reboot.yaml` in a root screen session on bridge.o.o | 23:24 |
fungi | should i add -f10 or anything? not sure if that causes problems with the serialization | 23:25 |
clarkb | fungi: I think you do want -f 20 for the pull. It shouldn't affect the serialization | 23:25 |
fungi | also i guess i need to disable deployments before that to make sure nothing races? | 23:25 |
clarkb | -f says "you can use up to this many forks" and serial: 1 says this runs one at a time regardless | 23:25 |
fungi | okay, cool, so it won't force parallelization overriding what we require inside the playbook | 23:26 |
clarkb | I think racing is ok since we can watch it and catch things up that might end up behind | 23:26 |
clarkb | the race would be pulling newer images halfway through | 23:26 |
fungi | yeah | 23:26 |
clarkb | and ya you want it in screen because it might take a couple days | 23:26 |
clarkb | maybe `time` it so that we can get a sense for the runtime | 23:26 |
opendevreview | Merged openstack/project-config master: Set "zuul" tenant default Ansible version to 5 https://review.opendev.org/c/openstack/project-config/+/843034 | 23:27 |
fungi | okay, `time ansible-playbook -f20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_reboot.yaml 2>&1 | tee zuul_reboot.log` | 23:27 |
fungi | i've got that ready to fire in the screen session | 23:27 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!