fungi | maybe an alternative would be borg on whichever fileserver usually serves the read-write volume | 00:07 |
---|---|---|
fungi | if we decide it's really unworkable to keep that synced long distance via afs | 00:08 |
ianw | yeah, we could do like a "vos dump" and pipe that through, similar to how we dump databases | 00:08 |
fungi | but also remember openstack's been releasing a lot of release candidates over the past couple of weeks | 00:09 |
ianw | i'm not sure if the format of that is good for incrementals | 00:09 |
*** mlavalle has quit IRC | 00:10 | |
fungi | presumably using the afs client locally on the server as if it were a local filesystem wouldn't perform too badly? | 00:10 |
fungi | we might even be able to do something to pause vos release and then backup from the read-only replica there | 00:11 |
fungi | or use vos backup and then backup files from that $somehow | 00:11 |
ianw | yeah, that's pretty much "vos dump". otherwise vicepa is a pretty big mess of files | 00:14 |
fungi | well, i mean, just tell borg to backup /afs/openstack.org/...whatever on afs01.dfw.o.o | 00:17 |
ianw | oh right :) | 00:32 |
ianw | #status log backup prune on vexxhost backup server complete | 00:45 |
openstackstatus | ianw: finished logging | 00:45 |
*** hamalq has quit IRC | 01:21 | |
ianw | ok, got an arm64 builder and built an centos-8 minimal image. checking it out now | 01:27 |
*** spotz has quit IRC | 01:32 | |
ianw | ok, i think i actually got it all right now | 01:49 |
ianw | linux/boot/vmlinuz-4.18.0-240.15.1.el8_3.aarch64 root=LABEL=cloudimg-rootfs ro console=tty0 console=ttyAMA0,115200 no_timer_check nofb nomodeset gfxpayload=text | 01:50 |
ianw | in the final image's /EFI/centos/grub.cfg | 01:50 |
*** brinzhang_ is now known as brinzhang | 02:06 | |
*** avass has quit IRC | 02:22 | |
*** jonher has quit IRC | 03:17 | |
*** fresta has quit IRC | 03:18 | |
*** jonher has joined #opendev | 03:25 | |
*** tkajinam is now known as tkajinam|lunch | 03:35 | |
*** ykarel|away has joined #opendev | 03:54 | |
*** paladox has quit IRC | 04:27 | |
*** paladox has joined #opendev | 04:33 | |
*** whoami-rajat has joined #opendev | 04:35 | |
raukadah | ianw: Hello, can we get this merged https://review.opendev.org/c/openstack/diskimage-builder/+/785138 it is blocking tripleo ovb jobs, thanks :-) | 04:38 |
*** raukadah is now known as chandankumar | 04:39 | |
chandankumar | ianw: thanks :-) | 04:43 |
ianw | np, we should follow on with that noted doc update | 04:43 |
*** paladox has quit IRC | 04:43 | |
chandankumar | ianw: sure, working on that | 04:45 |
*** paladox has joined #opendev | 04:45 | |
*** marios has joined #opendev | 05:19 | |
*** eolivare has joined #opendev | 05:45 | |
openstackgerrit | Merged openstack/diskimage-builder master: Make DIB_DNF_MODULE_STREAMS part of yum element https://review.opendev.org/c/openstack/diskimage-builder/+/785138 | 05:49 |
*** ralonsoh has joined #opendev | 05:53 | |
openstackgerrit | chandan kumar proposed openstack/diskimage-builder master: Improved the documentation for DIB_DNF_MODULE_STREAMS https://review.opendev.org/c/openstack/diskimage-builder/+/785304 | 05:54 |
*** slaweq has joined #opendev | 06:01 | |
*** d34dh0r53 has quit IRC | 06:26 | |
*** d34dh0r53 has joined #opendev | 06:28 | |
*** ykarel|away is now known as ykarel | 06:43 | |
*** tosky has joined #opendev | 06:45 | |
*** tkajinam|lunch is now known as tkajinam | 06:46 | |
*** jpena|off is now known as jpena | 06:50 | |
*** avass has joined #opendev | 06:57 | |
*** ysandeep|away is now known as ysandeep | 06:57 | |
*** fressi has joined #opendev | 06:58 | |
*** rpittau|afk is now known as rpittau | 07:01 | |
*** sboyron has joined #opendev | 07:03 | |
openstackgerrit | Merged openstack/diskimage-builder master: Properly set grub2 root device when using efi https://review.opendev.org/c/openstack/diskimage-builder/+/785247 | 07:05 |
openstackgerrit | Merged openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse. https://review.opendev.org/c/openstack/diskimage-builder/+/778723 | 07:05 |
*** andrewbonney has joined #opendev | 07:06 | |
ianw | clarkb: ^ i just pushed 3.9.0 with that. i'll give it a chance then update the nodepool images | 07:14 |
*** amoralej|off is now known as amoralej | 07:16 | |
*** brinzhang has quit IRC | 07:35 | |
*** brinzhang has joined #opendev | 07:35 | |
*** amorin_ has quit IRC | 07:43 | |
*** amorin has joined #opendev | 07:43 | |
hrw | morning | 07:47 |
*** lpetrut has joined #opendev | 07:47 | |
hrw | great job on dib! | 07:47 |
*** ykarel_ has joined #opendev | 08:00 | |
*** ykarel has quit IRC | 08:02 | |
*** ykarel_ is now known as ykarel | 08:02 | |
*** ykarel is now known as ykarel|lunch | 08:23 | |
*** ykarel|lunch is now known as ykarel | 09:22 | |
*** sshnaidm|afk is now known as sshnaidm | 09:31 | |
openstackgerrit | Hervé Beraud proposed openstack/project-config master: Move away from nodejs4 in an Openstack context https://review.opendev.org/c/openstack/project-config/+/785353 | 09:34 |
*** dtantsur|afk is now known as dtantsur | 09:41 | |
openstackgerrit | Merged openstack/diskimage-builder master: Improved the documentation for DIB_DNF_MODULE_STREAMS https://review.opendev.org/c/openstack/diskimage-builder/+/785304 | 10:34 |
*** sshnaidm has quit IRC | 11:01 | |
*** sshnaidm has joined #opendev | 11:04 | |
*** kopecmartin has quit IRC | 11:26 | |
*** jpena is now known as jpena|lunch | 11:30 | |
*** artom has quit IRC | 11:36 | |
*** sshnaidm has quit IRC | 11:37 | |
*** artom has joined #opendev | 11:37 | |
*** elod is now known as elod_afk | 11:38 | |
*** kopecmartin has joined #opendev | 11:48 | |
*** kopecmartin has quit IRC | 11:49 | |
*** sshnaidm has joined #opendev | 11:50 | |
*** ysandeep is now known as ysandeep|afk | 12:01 | |
*** amoralej is now known as amoralej|lunch | 12:13 | |
*** artom has quit IRC | 12:26 | |
*** fressi has left #opendev | 12:29 | |
*** jpena|lunch is now known as jpena | 12:31 | |
*** ysandeep|afk is now known as ysandeep | 12:58 | |
*** amoralej|lunch is now known as amoralej | 12:59 | |
*** elod_afk is now known as elod | 13:11 | |
*** marios is now known as marios|call | 13:12 | |
*** artom has joined #opendev | 13:32 | |
*** lpetrut has quit IRC | 14:14 | |
openstackgerrit | Merged openstack/project-config master: Move away from nodejs4 in an Openstack context https://review.opendev.org/c/openstack/project-config/+/785353 | 14:29 |
*** ykarel has quit IRC | 14:32 | |
*** ykarel has joined #opendev | 14:32 | |
fungi | ianw: hrw: i'm still catching up, not clear if nb03 is updated for dib 3.9.0 but looking now to figure out how far along we are/what's left to do there | 14:46 |
fungi | looks like nodepool-builder was restarted on it at 08:53 utc today | 14:47 |
fungi | i don't see any centos arm images uploaded to linaro though | 14:55 |
fungi | just ubuntu and debian | 14:55 |
fungi | er, nevermind, i was looking at nodes not images | 14:56 |
fungi | most recent centos-8-arm64 image was built 14.5 hours ago | 14:56 |
fungi | which was before the builder got restarted | 14:57 |
*** ykarel has quit IRC | 14:57 | |
fungi | also i confirmed pbr freeze in the container reports diskimage-builder==3.9.0 | 14:57 |
*** spotz has joined #opendev | 14:58 | |
fungi | okay, i deleted centos-8-arm64-0000036826 from over 1.5 days ago | 14:58 |
*** ricolin has joined #opendev | 14:59 | |
fungi | also centos-8-stream-arm64-0000001564 from the same timeframe | 15:00 |
clarkb | fungi: I think ianw built a separate arm64 builder test node | 15:01 |
clarkb | fungi: but then after 3.9.0 release was made any update to nodepool would have updated the nodepool-builder container image and we would auto deploy that | 15:02 |
fungi | got it | 15:02 |
fungi | well, anyway, the new images can't really be substantially more broken than the old ones | 15:02 |
clarkb | fungi: https://review.opendev.org/c/zuul/nodepool/+/785347 landing that should get us nodepool-builder images if we don't already have them | 15:04 |
fungi | yep, i missed that one thanks, approved just now | 15:05 |
fungi | unfortunately nb03 started building debian-stretch-arm64 just before i deleted the oldest centos images, so we have to wait for that to finish before these can start | 15:09 |
openstackgerrit | James E. Blair proposed zuul/zuul-jobs master: Remove arm64 jobs (temporarily) https://review.opendev.org/c/zuul/zuul-jobs/+/785432 | 15:09 |
ricolin | fungi, where I can find the progress for build debian-stretch-arm64? | 15:12 |
clarkb | ricolin: https://nb03.opendev.org then find the current log file (they are ordred and also you can sort by timestamp) | 15:13 |
clarkb | ricolin: it won't auto refresh but you can reload periodically (like once a minute or so should give you good updates) | 15:13 |
ricolin | clarkb, thx | 15:15 |
*** marios|call is now known as marios | 15:17 | |
clarkb | fungi: when you get a chance can you respond to ianw's message about the review replacement? I think you probably have the best overview of the issue from a security standpoint? | 15:21 |
fungi | ahh, sure, sorry juggling too many things at once and hadn't gotten to that | 15:21 |
clarkb | no worries, I don't think its urgent to get done before ianw's day starts :) | 15:22 |
clarkb | it just occurred to me that you can probably speak to that best | 15:22 |
clarkb | (and I'm in my start the day get through messages push) | 15:22 |
fungi | my get through messages push hasn't completed yet either, though i did sleep in a bit | 15:34 |
*** mlavalle has joined #opendev | 15:35 | |
ricolin | just out of curious is there way to stop queued jobs from a pipeline (like check-arm64)? | 15:35 |
clarkb | ricolin: an admin can dequeue the buildsets, but there isn't anything for specific jobs | 15:36 |
fungi | because jobs can have interdependencies, so there's no sane way to remove just one from a buildset | 15:38 |
fungi | but also it's unclear what should be reported in such cases if we could | 15:39 |
fungi | ricolin: also if the change gets abandoned or a new patchset is pushed, the old queue item will be removed automatically | 15:39 |
hrw | morning | 15:42 |
hrw | do we still have jobs on debian-stretch? | 15:42 |
ricolin | fungi, clarkb got it | 15:48 |
fungi | hrw: i believe the openstack-ansible folks are using it for older branches | 15:49 |
fungi | according to what they said a few weeks ago when i brought up possibly removing our stretch images | 15:50 |
hrw | fungi: thanks | 15:50 |
ricolin | hrw, the devstack job to run full tempest tests on arm64 is all green now, but the run time is really slow. https://review.opendev.org/c/openstack/devstack/+/708317 | 15:51 |
hrw | ricolin: good to hear | 15:51 |
ricolin | if you can provide ideas on improving the performance will be great:) | 15:52 |
ricolin | currently devstack is plan to move to parallel execution which might be helpful for reduce the exec time (https://review.opendev.org/c/openstack/devstack/+/783311) | 15:53 |
hrw | ricolin: my brain shutdowns when I look at devstack output | 15:53 |
dtantsur | we have a person hitting what seems to be https://bugs.launchpad.net/git-review/+bug/1332549 | 15:55 |
openstack | Launchpad bug 1332549 in git-review "git review sometimes causes unpacker error for Missing tree" [Undecided,New] | 15:55 |
hrw | ricolin: for sure it is worth using PIP_EXTRA_INDEX_URL=local/infra/wheel/cache | 15:55 |
hrw | ricolin: as you are installing several python packages so that way some of them will be fetched already built to wheels | 15:56 |
clarkb | dtantsur: do you know what version of git is being used? we suspect that using git protocol v2 may also mitigate it | 15:56 |
clarkb | dtantsur: evidence to the contrary would be great if we can have it | 15:56 |
clarkb | s/can// | 15:57 |
ricolin | hrw, good point | 15:57 |
*** rpittau is now known as rpittau|bbl | 15:57 | |
dtantsur | clarkb: 2.25.1 | 15:57 |
clarkb | dtantsur: git 2.18.0 or newer supports protocol v2 and 2.26.0 enables it by default | 15:58 |
*** cenne has joined #opendev | 15:58 | |
dtantsur | hi cenne, we're discussing your issue | 15:58 |
hrw | ricolin: you install 30 Python tarballs. including grcpio, pynacl which take some time to build | 15:58 |
clarkb | dtantsur: you might try having them enable protocol v2 then and see if it helps. `git config protocol.version 2` in the repo | 15:58 |
cenne | yes ^^' thank you. and sorry | 15:58 |
clarkb | (I think that is how you do it anyway) | 15:59 |
dtantsur | no worries, it's just bad luck of newcomers :) | 15:59 |
hrw | ricolin: you would cut several minutes of time | 15:59 |
fungi | and if the git version is in the >=2.18.0 <2.26.0 range then yeah finding out if switching protocol version solves the problem would also be an excellent data point | 15:59 |
cenne | git version 2.25.1 | 15:59 |
dtantsur | cenne: could you try the command that clarkb suggested? | 15:59 |
ricolin | hrw, yeah around 5-6mins I think:) | 15:59 |
*** jpena is now known as jpena|off | 15:59 | |
clarkb | cenne: yup so not enabled by default but you can enable it explicitly and try again | 15:59 |
cenne | on it | 16:00 |
fungi | ricolin: hrw: we should also double-check that our wheel builders are creating all the arm64 wheels that job needs, since compiling larger packages from sdist could be very slow | 16:00 |
hrw | fungi: wheel build jobs go through whtever is in requirements project | 16:00 |
fungi | hrw: yes, but it doesn't mean they're working | 16:01 |
hrw | fungi: wheels or jobs? | 16:01 |
cenne | btw, ftr, i did get git review go through once. It's after ammending and re-reviewing that it borked | 16:02 |
fungi | the arm64 wheel builder jobs. we should make sure they're successfully generating wheels for the things the devstack job is trying to use (look for evidence in the devstack job log that it built things from sdist instead) | 16:02 |
clarkb | cenne: yes, we think it has to do with the structure of your local pack files and how the remote server and your local client negotiate with objects to transfer | 16:02 |
clarkb | cenne: this is why we suspect that v2 may help address the problem because that negotiation is handled in a different manner (and so far no one reporting this problem has reported a v2 by default client version) | 16:02 |
hrw | fungi: imho if devstack job is using something which is not in openstack/requirements then it is a bug in devstack | 16:03 |
cenne | okay, so i ran the command earlier and redid git review -t topicname | 16:03 |
cenne | same error | 16:03 |
JayF | cenne: just to confirm, you `cat .git/config | grep protocol` and see the config you added earlier taking effect? | 16:04 |
cenne | i can see it in git config --list, bt i'll d that too | 16:04 |
clarkb | I've found one doc that says you need to set it with --global but that doesn't seem to be accurate from what I know about git configs | 16:05 |
JayF | cenne++ you did better than me, I didn't even know about `git config --list` | 16:05 |
fungi | hrw: what i'm saying is that if we see evidence devstack is building some things from sdist, that likely implies that our wheel builder jobs aren't successfully building at least some of the arm64 wheels it needs, the job makes a best effort to build things in the upper-constraints.txt list but that doesn't mean it manages to build them all | 16:06 |
cenne | it's there. I had to do grep -C 2 protocol though | 16:06 |
cenne | it's there basically | 16:06 |
dtantsur | cenne: a silly idea, what if you amend the commit again? e.g. change the message slightly? | 16:06 |
clarkb | dtantsur: cenne ^ that suggestion has been shown to help other people or even just rebasing again | 16:06 |
hrw | fungi: yep | 16:07 |
cenne | i'll change it slightly and try. | 16:07 |
fungi | dtantsur: that would likely make the error "go away" but doesn't get us much better idea of the underlying problem unfortunately | 16:07 |
hrw | just started local test of installing all packages devstack builds to check are they in cache | 16:07 |
clarkb | ya other examples do `git push -c protocol.version=2 dest ref` which is even less global than setting it in the global config so I suspect doing it in the repo config is fine | 16:07 |
clarkb | good to know that this doesn't seem to have helped. | 16:08 |
dtantsur | fungi: it will rule out a damage to the repo itself | 16:08 |
dtantsur | I mean, the base repo | 16:08 |
fungi | dtantsur: oh, sure it would do that | 16:08 |
cenne | added a dot (.) in the commit message. that change good enough? | 16:09 |
cenne | it didnt work btw | 16:09 |
clarkb | in that case pushing with no thin is probably your best bet. Someone else went with that recently. I think there may be a config option you can set for that too /me looks | 16:10 |
hrw | fungi, ricolin: all packages which devstack built are in cache | 16:10 |
cenne | if this turns out to be some really silly mistake on my side, im sorry in advance | 16:10 |
fungi | hrw: cool, so the devstack setup log doesn't indicate it used sdists for anything? | 16:10 |
fungi | that would have been an easy explanation for the slower performance of the job, oh well | 16:11 |
hrw | fungi: no sdist in log | 16:11 |
fungi | good to know that much is working at least | 16:11 |
hrw | ricolin: nodepool_wheel_mirror: "https://{{ zuul_site_mirror_fqdn }}/wheel/{{ ansible_distribution | lower }}-{{ (ansible_os_family == 'Debian') | ternary(ansible_distribution_version, ansible_distribution_major_version) }}-{{ ansible_architecture | lower }}" | 16:12 |
dtantsur | cenne: what does `git show 2ad9fbc7797aeaebfe2dcc613d928afc97e934e6` outputs? | 16:12 |
dtantsur | (or whatever missing tree ID you have) | 16:13 |
clarkb | cenne: I don't think its a silly mistake, though I do suspect that some sort of local operations cause the repos to get confused when snycing with jgit. | 16:13 |
clarkb | unfortunately we haven't been able to narrow that down | 16:14 |
hrw | ricolin: https://mirror.mtl01.inap.opendev.org/wheel/ shows you which distros have wheel cache already | 16:14 |
ricolin | that's helpful | 16:15 |
clarkb | basically when you do a push each side figure out what they have and the path from one to the other, from that a pack is made | 16:15 |
clarkb | I think what is happening is that jgit and cgit are not agreeing on what the contents of that pack need to be to have a complete path | 16:15 |
hrw | ricolin: I wonder how much you will cut | 16:16 |
clarkb | this means when jgit receives the pack from cgit it finds that somethign is missing that it expects to see and it rejects it | 16:16 |
clarkb | when you do the --no-thin flag to push the issue is avoided as both sides are far less discerning and send much more data | 16:16 |
cenne | dtantsur: I think it's showing the base directory | 16:16 |
hrw | ricolin: remember to add cache for both aarch64 and x86-64 archs - both can use cache | 16:16 |
cenne | i can send a paste if you needed | 16:16 |
cenne | *directory files | 16:17 |
clarkb | cenne: I don't know that that will help. I've tried making modifications to the same files that people are modifying and having trouble with thinking that maybe it is related to that somehow | 16:17 |
clarkb | trying a different version of git may also be interesting to see if the behavior is consistent across C git verions. But also a lot more work | 16:18 |
clarkb | probably easiest to just do a no-thin push and move on (I am not finding a config option for that so may need to push directly?) | 16:18 |
dtantsur | or clone the repository again, download your patch with `git review -d <number>` and try again | 16:18 |
dtantsur | no-thin is probably more constructive, clarkb do you remember how to push by hand? I've never done it, I think :) | 16:19 |
clarkb | cenne: git-review should print out its git push command something like `git push gerrit HEAD:refs/for/master:topic` you can modify that to `git push --no-thin gerrit HEAD:refs/for/master:topic` | 16:19 |
hrw | ricolin: I am curious how much it can cut | 16:19 |
clarkb | cenne: you may need to run it with git review -v to get that though | 16:19 |
hrw | ricolin: some will be saved due to 'no need to build', some due to fetching from local cache | 16:19 |
ricolin | hrw, you mean the execution time? | 16:20 |
clarkb | ricolin: hrw: we should be setting the wheel mirror for all devstack jobs already? | 16:20 |
clarkb | a very early job setup role should be doing that. | 16:20 |
hrw | ricolin: yes | 16:21 |
clarkb | fwiw I'm not opposed to someone adding --no-thin to git-review as a flag. I didn't realize that had been suggested so long ago | 16:22 |
hrw | clarkb: should. looks like it does not or it is ignored somewhere (sudo?) | 16:22 |
cenne | ok. i'll try i guess | 16:22 |
cenne | if it helps add more info. | 16:22 |
ricolin | hrw, I saw you set up PIP_EXTRA_INDEX_URL for kolla already | 16:23 |
clarkb | ricolin: do you have a link to a job run I can lookat? | 16:25 |
ricolin | clarkb, https://zuul.openstack.org/builds?job_name=devstack-platform-arm64 | 16:26 |
fungi | looks like the debian-stretch-arm64 image update completed and uploaded, but nb03 still hasn't started building new centos images, i'll see if we paused them or something | 16:26 |
clarkb | ricolin: https://zuul.openstack.org/build/8382a4f0479b4e9c824e9ad2f32c0ea4/console#0/3/20/controller that is the task that should configure the global pip config to use the mirrors including the wheel mirror | 16:27 |
fungi | nothing paused in its nodepool.conf, and dib-image-list doesn't report any pause via cli | 16:28 |
clarkb | fungi: is it building some other image? | 16:29 |
fungi | not that i can see | 16:29 |
clarkb | are the old images still present? I wonder if it doesnt' see them as needing to be rebuilt? | 16:30 |
fungi | yeah, only nb01 and nb02 are building images at the moment | 16:30 |
clarkb | you can also try an explicit build request | 16:30 |
ricolin | clarkb, that job fail because I set the wrong DEFAULT_IMAGE_FILE_NAME, which I fixed in newest patch set | 16:30 |
cenne | It worked! | 16:30 |
cenne | --no-thin got it through. | 16:31 |
fungi | clarkb: i deleted the oldest centos-8-arm64 and centos-8-stream-arm64 images with dib-image-delete so only the latest one for each remains | 16:31 |
clarkb | cenne: great, and sorry for the trouble. Maybe we need to think about adding a flag to git-review for people to opt into that on the occasions it happens | 16:31 |
clarkb | fungi: I think it may not rebuild in that case | 16:32 |
clarkb | fungi: because it notices it has an up to date image | 16:32 |
fungi | clarkb: oh, so delete the newer images too? will do | 16:32 |
clarkb | fungi: or just do an explicit build request | 16:32 |
fungi | they're equally broken anyway, no loss | 16:32 |
cenne | clarkb: np. was fun troubleshooting (not that i understood most of it). Thanks. :) | 16:33 |
clarkb | ricolin: ya I'm just pointing out that the build is attempting to write a pip.conf, but we don't see its contents | 16:33 |
clarkb | cenne: you're welcome | 16:33 |
*** hamalq has joined #opendev | 16:33 | |
clarkb | ricolin: maybe a good next step would be to add /etc/pip.conf to the log files that are collected, than you can see why it may not be orking? | 16:33 |
fungi | okay, deleted centos-8-stream-arm64-0000001565 and centos-8-stream-arm64-0000001565 from ~16-17 hours ago | 16:33 |
fungi | clarkb: bingo! centos-8-arm64-0000036828 is building now | 16:34 |
clarkb | great | 16:35 |
*** marios is now known as marios|out | 16:36 | |
* cenne thanks also dtantsur: and JayF: | 16:38 | |
dtantsur | sure thing, I don't think I helped a lot :) | 16:38 |
JayF | \o/ cenne helpfully you have better luck with the rest of your contribution... | 16:38 |
JayF | get all the bad luck outta the way now | 16:38 |
cenne | :P hopefully. | 16:40 |
cenne | so.. should i revert back the protocol.version 2 now? or leave it as is | 16:42 |
*** sshnaidm has quit IRC | 16:43 | |
JayF | It should only do good things to have protovol v2 enabled for that repo | 16:43 |
*** sshnaidm has joined #opendev | 16:44 | |
hrw | ricolin: kolla got cache once we finished getting cache jobs working again | 16:45 |
hrw | fungi: btw - what 0000* numbers are? | 16:46 |
fungi | hrw: the image build increment, so we can identify individual image revisions | 16:47 |
fungi | hrw: nodepool-builder just increments that for each build | 16:47 |
hrw | ok. build numberish. | 16:47 |
fungi | yup | 16:49 |
hrw | so, once we get working centos-8-stream-arm64 images, will there be something blocking centos-8-stream-arm64 nodes creation? | 16:49 |
*** marios|out has quit IRC | 16:49 | |
ricolin | clarkb, okay, will add log to next update | 16:50 |
fungi | hrw: shouldn't be | 16:51 |
fungi | hrw: basically if the image builds succeed then the builder will upload them automatically to the linaro cloud through glance, after which any new server boot calls should use the new images | 16:52 |
fungi | and (hopefully) boot to a reachable state again | 16:52 |
hrw | fungi: great | 16:52 |
fungi | and at that point we'll see all the queued builds in zuul begin to show moving progress bars | 16:53 |
fungi | though it will probably take some time to work through the pending backlog | 16:53 |
* hrw off | 17:07 | |
*** auristor has quit IRC | 17:08 | |
*** sshnaidm is now known as sshnaidm|afk | 17:10 | |
*** auristor has joined #opendev | 17:12 | |
*** eolivare has quit IRC | 17:23 | |
clarkb | cenne: I would clear it out just to reduce potential vairables for the future, but I don't expect it will cause problemsi f you leave it either | 17:28 |
*** dtantsur is now known as dtantsur|afk | 17:29 | |
*** amoralej is now known as amoralej|off | 17:31 | |
fungi | centos-8-arm64-0000036828 reached a ready state and is uploading now. centos-8-stream-arm64-0000001566 has started building | 17:32 |
fungi | and centos-8-arm64-1617903089 is the ready image in linaro-us now, so we should hopefully start seeing nodes boot with that | 17:36 |
*** rpittau|bbl is now known as rpittau | 17:44 | |
*** andrewbonney has quit IRC | 17:44 | |
cenne | clarkb: ok, i'll leave it on for now. it's just for one repo, and i'm changing os soon anyway | 17:44 |
cenne | clarkb: will I have to use the --no-thin flag every time? I had to use it again when amending are resubmitting the commit | 17:45 |
fungi | i concerned that having a min-ready of 1 for ubuntu-focal-arm64-xxxlarge is consuming a significant percentage of our available arm node quota unnecessarily | 17:45 |
fungi | er, i'm concerned | 17:45 |
fungi | i'll push up a patch | 17:45 |
fungi | we also have a bunch of locked deleting nodes | 17:45 |
fungi | looks like they're all debian-buster-arm64 | 17:46 |
fungi | i wonder if those are also having trouble booting | 17:46 |
openstackgerrit | Merged zuul/zuul-jobs master: Remove arm64 jobs (temporarily) https://review.opendev.org/c/zuul/zuul-jobs/+/785432 | 17:46 |
fungi | yeah, basically we have the following nodes in linaro-us: two debian-buster-arm64 nodes in-use for over an hour, two ubuntu-focal-arm64 nodes in use for half an hour, five debian-buster-arm64 nodes deleting for the past few minutes, and a ready ubuntu-focal-arm64-xxxlarge node nothing's used for 8.5 hours yet | 17:49 |
openstackgerrit | Merged zuul/zuul-jobs master: Add upload-logs-azure role https://review.opendev.org/c/zuul/zuul-jobs/+/782004 | 17:50 |
clarkb | fungi: ya that cloud might be best served with min-ready: 0 on all labels | 17:51 |
clarkb | then let demand determine allocations otherwise the quota is too low | 17:51 |
fungi | revising my theory on debian-buster-arm64, we do have a nodes in use for a while as well as deleting, i may have just caught it at a bad time | 17:52 |
fungi | i think it's more the fact that delete calls take upwards of 10 minutes to complete on that cloud which makes for a lot of deleting state nodes in the list | 17:53 |
fungi | we haven't even tried to launch a new centos-8 node yet, i think due to demand from other labels | 17:54 |
*** klonn has joined #opendev | 18:00 | |
openstackgerrit | Jeremy Stanley proposed openstack/project-config master: Zero the min-ready for arm64 node labels https://review.opendev.org/c/openstack/project-config/+/785453 | 18:01 |
*** lpetrut has joined #opendev | 18:11 | |
*** lpetrut has quit IRC | 18:14 | |
*** brinzhang_ has joined #opendev | 18:21 | |
*** brinzhang has quit IRC | 18:24 | |
clarkb | fungi: if you need to confirm the new images are working you could boot one out of band | 18:29 |
fungi | maybe later, i'm onto more urgent fires now | 18:30 |
*** ysandeep is now known as ysandeep|away | 18:40 | |
openstackgerrit | Merged openstack/project-config master: Zero the min-ready for arm64 node labels https://review.opendev.org/c/openstack/project-config/+/785453 | 18:46 |
*** whoami-rajat has quit IRC | 19:02 | |
fungi | okay, this is odd... looking at one of the debian-buster-arm64 nodes in a deleting state (0023862017), `nodepool list` reports its time in that state as only a few minutes, but the debug log shows it repeatedly attempting to delete the same node for several hours | 19:02 |
fungi | so i've suddenly lost faith in the state times nodepool is reporting | 19:02 |
clarkb | fungi: I think the reason for that is nodepool may be creating dummy records for leaked instances | 19:03 |
clarkb | and the operations that do that end up resetting the timestamp | 19:03 |
fungi | it looks like the age field is being reset on every delete pass | 19:03 |
clarkb | ya I seem to recall seeing that before | 19:03 |
clarkb | I think when the cleanup passes grab the lock the first thing they do is set the age timer | 19:03 |
fungi | in fact, the log i was looking at was only a few hours old. nodepool has been trying to delete that particular node for several days | 19:05 |
fungi | 2021-04-06 08:56:36,759 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0023862017 (state: deleting, allocated_to: None) | 19:06 |
clarkb | chances are if you try to manually delete it it will fail too | 19:06 |
fungi | yup, working my way up to that | 19:06 |
fungi | the age being reset just makes it much harder to spot which ones are legitimately deleting and which ones have been there for a long time | 19:07 |
clarkb | ya | 19:07 |
fungi | openstack server show indicates it has a status of BUILD | 19:08 |
fungi | created 2021-04-06T08:43:05Z | 19:08 |
fungi | probably nova won't let us delete instances in state BUILD | 19:09 |
fungi | however openstack server delete did not report an error for that uuid | 19:09 |
fungi | status remains BUILD though | 19:09 |
fungi | running with -v says END return value: 0 | 19:10 |
fungi | so i guess nova's all like "thanks i'll get right on that, it'll go into delete as soon as it's done building!" | 19:11 |
clarkb | ya the delete request is more of a "please do this if you can" and the response is "I heard you" | 19:11 |
fungi | kevinz: we have 5 instances in linaro-us which are stuck in either BUILD or ERROR state for the past 2.5 days and would appreciate if you could clean them up so we can use the rest of the quota: a25e6f3b-5695-4729-b9c7-bb6160d7dd55 3ec7f816-52e5-480d-828a-767ebae7cf84 6accab6d-e1bd-4998-85eb-d7764d1ee0f5 a82e3b41-b0f7-4310-a89f-2a84eef8f8b0 15c38603-41c2-484d-8637-62941b871cbe | 19:16 |
fungi | once /etc/nodepool/nodepool.yaml updates on nl03 i'll clean up the lingering xxlarge ready node | 19:18 |
*** d34dh0r53 has quit IRC | 19:21 | |
*** d34dh0r53 has joined #opendev | 19:22 | |
fungi | between those 5 stuck 8gb nodes and the 16gb ready node, our effective quota is reduced to 10 nodes there | 19:25 |
fungi | though there is a building centos-8-stream-arm64 node now! | 19:25 |
fungi | we also have a server instance named "test" booted in there since late october of last year | 19:27 |
fungi | infra-root: ^ any objections to me deleting the server instance named "test" from linaro-us? | 19:28 |
clarkb | fungi: checking | 19:29 |
clarkb | fungi: is that the server whose ipv6 addr ends in 7610? | 19:29 |
fungi | the centos-8-stream-arm64 node went into in-use 5 minutes ago, so looks like it may actually be getting used for a job | 19:29 |
clarkb | if so that was the one ianw was using to test the new dib patches aiui | 19:29 |
fungi | clarkb: nope, 2604:1380:4111:3e56:f816:3eff:fe35:d8dc | 19:30 |
fungi | this is in the node tenant, not the control plane tenant | 19:30 |
fungi | the one you're talking about is test-builder and it's in the control plane tenant | 19:31 |
clarkb | gotcha | 19:31 |
clarkb | I don't recall adding a test node myself so I'm fine with it | 19:31 |
openstackgerrit | Clark Boylan proposed openstack/diskimage-builder master: Support secure boot with Ubuntu if using efi https://review.opendev.org/c/openstack/diskimage-builder/+/785503 | 19:34 |
fungi | i need to take a break to start making dinner, and then the rest of my evening is mostly spoken for between paperwork and openstack election officiating duties | 19:36 |
clarkb | fungi: thank you for getting the centos 8 images going again | 19:37 |
fungi | well, to be clear, you and stevebaker and ianw and hrw did most of the heavy lifting on that | 19:43 |
fungi | i just drove it the last few meters over the finish line | 19:44 |
*** d34dh0r53 has quit IRC | 19:51 | |
*** d34dh0r53 has joined #opendev | 19:51 | |
*** dmsimard6 has joined #opendev | 20:24 | |
*** akahat has quit IRC | 20:24 | |
*** tosky_ has joined #opendev | 20:25 | |
*** janders has quit IRC | 20:26 | |
*** janders8 has joined #opendev | 20:26 | |
*** dmsimard has quit IRC | 20:26 | |
*** dmsimard6 is now known as dmsimard | 20:26 | |
*** tosky has quit IRC | 20:26 | |
*** tosky_ is now known as tosky | 20:28 | |
*** sboyron has quit IRC | 20:36 | |
*** akahat has joined #opendev | 20:37 | |
*** slaweq has quit IRC | 20:41 | |
*** klonn has quit IRC | 20:42 | |
hrw | fungi: every time I look at dib it feels weird. so the moment when I say why I think it is wrong and someone gets 'HA! I know how to go now' is great | 20:42 |
clarkb | hrw: the thing about dib is it is weird, but ultimately relatively simple. Its just a series of run-parts processes | 20:43 |
clarkb | once you know that it is relatively easy to inject "breakpoints" into the list of scripts and debug from there. The major issue this time around was figuring out how to boot arm for me :) but as I was figuring that out you pointed out the problem :) | 20:44 |
clarkb | (looks like virt-install does have an arch flag which I'll likely try next time this comes up) | 20:44 |
hrw | clarkb: the problem is that people are still not used to how efi systems work | 20:44 |
hrw | clarkb: once you get used to how efi system work it does not matter is it aarch64 or x86-64 | 20:45 |
clarkb | hrw: ya I mean stevebaker pointed out that ubuntu's packaging doesn't actually install the shim where you need it if you install the package. So there is other magic going on | 20:45 |
clarkb | hrw: thats true, though in this case we weren't sure it was even an efi problem (I was unaware of the change to dib) | 20:45 |
hrw | what is ubuntu? ;D | 20:45 |
clarkb | debian does the same thing fwiw | 20:46 |
clarkb | everything goes in /usr and some other magic seems to be necessary to install that to /boot/efi/EFI/{ubuntu,debian/ | 20:47 |
hrw | yeah, distroisms | 20:47 |
stevebaker | I'm writing an ubuntu usb installer now, lets see what it does | 20:47 |
clarkb | one thing I would like to test is if we can switch all our test node images over to the efi build. In theory it will work since legacy bios is still bootable on that path but then we would get efi if clouds supported it? | 20:50 |
clarkb | though maybe you have to set an image metadata flag or nova flavor option ro something for that to happen | 20:51 |
hrw | clarkb: q35 can boot uefi for sure. i440fx not sure | 20:51 |
hrw | it is more a matter of having uefi binary available on nova-compute hosts | 20:52 |
hrw | then you set hw_firmware_type=uefi and should boot | 20:53 |
fungi | also no idea if xen vs kvm complicates things | 20:53 |
hrw | I lack xen experience | 20:53 |
fungi | nearly half of the nodes we boot are on xen | 20:54 |
clarkb | our images should do both though I think | 20:54 |
clarkb | so in theory xen and old kvm woudl just keep doing bios? | 20:54 |
clarkb | (well they do both if you set efi) | 20:54 |
ianw | in theory :) i'd be willing to be that somewhere between the crappy vhd conversion steps and booting that something wouldn't work :) | 20:55 |
hrw | x86 images iirc have efi grub in ESP, then grub-pc in bios parition, then have /boot and / or sth like that | 20:55 |
ianw | we merged some changes that supposedly made dib do dual-boot images | 20:55 |
ianw | https://review.opendev.org/c/openstack/diskimage-builder/+/743243 | 20:56 |
hrw | nice | 20:58 |
* hrw off | 21:00 | |
fungi | yeah, debian has multi-boot installer images with support for several architectures, and mixed efi/bios bootloader, so it's definitely doable | 21:00 |
fungi | though the utility of "i have a single usb stick which can install debian onto x86 and aarch64 and powerpc" is questionable | 21:01 |
fungi | i still don't see 785453 applied after three hours... i wonder if something's wrong | 21:04 |
fungi | infra-prod-service-nodepool TIMED_OUT in 30m 32s | 21:05 |
fungi | i guess that's why | 21:05 |
ianw | ohh, i usually suspect nb03 first if there's something wrong there | 21:13 |
fungi | could be, i haven't pulled up logs yet | 21:16 |
*** rpittau is now known as rpittau|afk | 21:17 | |
*** sshnaidm|afk is now known as sshnaidm|off | 21:24 | |
fungi | /var/log/ansible/service-nodepool.yaml.log on bridge doesn't mention the launchers at all | 21:34 |
fungi | i'm not 100% positive that's what https://zuul.opendev.org/t/openstack/build/695b972e640a4b188175a5ff8f790fbe/ was writing to though, we don't have the task details since the run playbook timed out | 21:36 |
fungi | huh, the /etc/nodepool/nodepool.yaml on nl03 looks nothing like https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml | 21:46 |
fungi | almost as if it's generated yaml instead | 21:46 |
corvus | fungi: it is | 21:46 |
fungi | anybody know why that would be? | 21:46 |
corvus | but it should be generated from that file | 21:46 |
fungi | oh, okay | 21:46 |
corvus | fungi: https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml#L4-L5 is reason | 21:46 |
fungi | oh, yep thanks | 21:46 |
fungi | so still no clue why /var/log/ansible/service-nodepool.yaml.log doesn't mention nl* servers | 21:48 |
fungi | i guess that's an artifact of the timeout not represented in that logfile (yet?) | 21:50 |
fungi | the previous log from 02:21:47 includes them | 21:51 |
fungi | root 1963 0.0 0.0 4636 788 ? S 18:54 0:00 /bin/sh -c ansible-playbook -v -f 5 /home/zuul/src/opendev.org/opendev/system-config/playbooks/service-nodepool.yaml >> /var/log/ansible/service-nodepool.yaml.log | 21:52 |
fungi | aha, so it's still actually running, that's why | 21:52 |
fungi | child process claims to be connecting to 104.130.124.242 which is nb02 | 21:53 |
fungi | i'm able to log into nb02, even over ipv4, and it's been up for weeks... | 21:54 |
fungi | i don't see any ansible processes on nb02 either | 21:54 |
fungi | oh, in fact it looks like the ansible process connecting to nb02 is timestamped 13:35 | 21:56 |
fungi | so something hung earlier today i guess? | 21:56 |
fungi | several distinct runs are all in progress trying to connect to nb02 | 21:57 |
fungi | i can ssh to that address from bridge.o.o as root too | 21:57 |
fungi | infra-root: ^ suggestions? | 21:58 |
ianw | full disk maybe causing python/ansible to hang? | 21:58 |
fungi | doesn't look like it | 21:59 |
* ianw has sound of straws clutching | 21:59 | |
* fungi grasps some more | 21:59 | |
fungi | memory, disk, cpu all look reasonable on nb02 given that it's building images | 22:00 |
fungi | root 29407 0.2 0.4 184100 38372 ? Sl 13:35 1:22 /usr/bin/python3 /usr/local/bin/docker-compose pull | 22:01 |
fungi | there's some later ones in the process table too | 22:01 |
fungi | i bet that's it, ansible is hung because docker-compose pull isn't returning? | 22:02 |
fungi | anyway i've run out of time, i need to switch to other tasks | 22:03 |
fungi | hopefully that helps whoever gets a chance to look at it next | 22:04 |
ianw | i can have a look in a bit | 22:05 |
*** hamalq has quit IRC | 22:07 | |
*** hamalq has joined #opendev | 22:07 | |
fungi | thanks! | 22:08 |
fungi | and sorry, i've promised too much on too many things and had to parcel out the remainder of my day to make sure i don't slip on deadlines | 22:09 |
ianw | Thu Apr 1 13:15:11 UTC 2021 found extra address! | 22:20 |
ianw | so we caught the review server getting an extra address. i should have a dump | 22:20 |
fungi | ooh, as in tcpdump of (possible) route announcements? | 22:21 |
fungi | that's exciting indeed | 22:22 |
ianw | hrm, so "_count=$(ip addr | grep 'inet6 2' | wc -l)" returned -gt 1 | 22:22 |
ianw | but currently it only has one address | 22:22 |
fungi | diagnosing network issues in other people's systems is always a bit like being a spirit medium | 22:23 |
ianw | it is a week on | 22:23 |
ianw | from when that triggered though | 22:23 |
fungi | yeah, well past any likely expiration | 22:23 |
ianw | 2021-04-01 13:15:10.262527 IP6 (class 0xc0, hlim 255, next-header ICMPv6 (58) payload length: 56) fe80::ce2d:e0ff:fe5a:d84e > ff02::1: [icmp6 sum ok] ICMP6, router advertisement, length 56 | 22:27 |
ianw | 2021-04-01 13:15:10.262672 IP6 (class 0xc0, hlim 255, next-header ICMPv6 (58) payload length: 56) fe80::ce2d:e0ff:fe0f:74af > ff02::1: [icmp6 sum ok] ICMP6, router advertisement, length 56 | 22:27 |
ianw | 2021-04-01 13:15:10.313178 IP6 (class 0xe0, hlim 255, next-header ICMPv6 (58) payload length: 64) fe80::f816:3eff:fec3:34bc > ff02::1: [icmp6 sum ok] ICMP6, router advertisement, length 64 | 22:27 |
ianw | that is 3 RA's received 1 second before we saw two addresses | 22:27 |
ianw | here is a full packet dump -> http://paste.openstack.org/show/804320/ | 22:30 |
fungi | our current default there is through fe80::ce2d:e0ff:fe5a:d84e and fe80::ce2d:e0ff:fe0f:74af | 22:34 |
fungi | so fe80::f816:3eff:fec3:34bc is the odd duck | 22:34 |
fungi | ieee has no oui match on f8:16:3e | 22:36 |
fungi | as far as i can tell | 22:36 |
fungi | so guessing that's made up and belongs to a vm | 22:36 |
ianw | it appeared between 2021-04-01 13:15:10.313178 and 2021-04-01 13:16:57.742043 and then was never heard from again | 22:38 |
ianw | (pasting some details in https://etherpad.opendev.org/p/gerrit-upgrade-2021) | 22:38 |
fungi | sounds about right if it was some zuul job | 22:38 |
ianw | it says "valid time 2592000s" which makes 30 days by my calculation | 22:43 |
fungi | neat, i wonder if the kernel reaps those sooner if their gateways fall out of the neighbor table or something | 22:43 |
ianw | i should have dumped the ip addr output too, that was an oversight | 22:44 |
ianw | i'm not sure how much this debugging is going to help. i'm happy to continue it, if we think there is anything to glean | 22:45 |
fungi | probably we should be getting input from operators at vexxhost as to what it would help them for us to collect | 22:49 |
fungi | though if we have a theory that the announcements are originating from our job nodes, maybe collecting enough information to tie a stray announcement to a particular job would enable us to either confirm that and gain some insight into the nature and origin of those announcements, or refute it | 22:50 |
clarkb | infra-root tomorrow morning I intend to run the external id cleanup script against the 224 accounts I retired last week. I haven't seen anyone show up and complain about their account being off and this last batch was the set that showed up due to lack of usernames, ssh keys, and/or valid openids | 23:18 |
clarkb | I'm fairly confident in this group. I suspect the next one will be much tougher :( | 23:18 |
fungi | sounds great, thanks! | 23:18 |
clarkb | if you have seen anyone complain let me know :) | 23:18 |
fungi | i have not, ftr | 23:19 |
clarkb | fungi: ianw maybe stick the details in logsatsh? | 23:21 |
clarkb | and see if any log lines show that was one of ours? | 23:21 |
fungi | yeah, though in this case the evidence might be too old already | 23:21 |
fungi | worth a shot though | 23:21 |
clarkb | good point and those ipv6 addres are unlikely to be stable like the ipv4 ones we use in testing | 23:22 |
clarkb | since the macs are probably random? | 23:22 |
fungi | that's what i'm guessing, but really not certain | 23:23 |
clarkb | I wonder if the timestamps might be the important piece of info, like if vexxhost can cross check against maintenance at that itme or other activities along those lines | 23:29 |
*** tosky has quit IRC | 23:38 | |
fungi | oh, yeah, server restarts, et cetera | 23:40 |
fungi | or switchport flaps | 23:40 |
ianw | i'll leave the dump file on disk, https://etherpad.opendev.org/p/gerrit-upgrade-2021 has extracted details from the 1st | 23:42 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!