Thursday, 2021-04-08

fungimaybe an alternative would be borg on whichever fileserver usually serves the read-write volume00:07
fungiif we decide it's really unworkable to keep that synced long distance via afs00:08
ianwyeah, we could do like a "vos dump" and pipe that through, similar to how we dump databases00:08
fungibut also remember openstack's been releasing a lot of release candidates over the past couple of weeks00:09
ianwi'm not sure if the format of that is good for incrementals00:09
*** mlavalle has quit IRC00:10
fungipresumably using the afs client locally on the server as if it were a local filesystem wouldn't perform too badly?00:10
fungiwe might even be able to do something to pause vos release and then backup from the read-only replica there00:11
fungior use vos backup and then backup files from that $somehow00:11
ianwyeah, that's pretty much "vos dump".  otherwise vicepa is a pretty big mess of files00:14
fungiwell, i mean, just tell borg to backup /afs/openstack.org/...whatever on afs01.dfw.o.o00:17
ianwoh right :)00:32
ianw#status log backup prune on vexxhost backup server complete00:45
openstackstatusianw: finished logging00:45
*** hamalq has quit IRC01:21
ianwok, got an arm64 builder and built an centos-8 minimal image.  checking it out now01:27
*** spotz has quit IRC01:32
ianwok, i think i actually got it all right now01:49
ianwlinux/boot/vmlinuz-4.18.0-240.15.1.el8_3.aarch64 root=LABEL=cloudimg-rootfs ro  console=tty0 console=ttyAMA0,115200 no_timer_check nofb nomodeset gfxpayload=text01:50
ianwin the final image's /EFI/centos/grub.cfg01:50
*** brinzhang_ is now known as brinzhang02:06
*** avass has quit IRC02:22
*** jonher has quit IRC03:17
*** fresta has quit IRC03:18
*** jonher has joined #opendev03:25
*** tkajinam is now known as tkajinam|lunch03:35
*** ykarel|away has joined #opendev03:54
*** paladox has quit IRC04:27
*** paladox has joined #opendev04:33
*** whoami-rajat has joined #opendev04:35
raukadahianw: Hello, can we get this merged https://review.opendev.org/c/openstack/diskimage-builder/+/785138 it is blocking tripleo ovb jobs, thanks :-)04:38
*** raukadah is now known as chandankumar04:39
chandankumarianw: thanks :-)04:43
ianwnp, we should follow on with that noted doc update04:43
*** paladox has quit IRC04:43
chandankumarianw: sure, working on that04:45
*** paladox has joined #opendev04:45
*** marios has joined #opendev05:19
*** eolivare has joined #opendev05:45
openstackgerritMerged openstack/diskimage-builder master: Make DIB_DNF_MODULE_STREAMS part of yum element  https://review.opendev.org/c/openstack/diskimage-builder/+/78513805:49
*** ralonsoh has joined #opendev05:53
openstackgerritchandan kumar proposed openstack/diskimage-builder master: Improved the documentation for DIB_DNF_MODULE_STREAMS  https://review.opendev.org/c/openstack/diskimage-builder/+/78530405:54
*** slaweq has joined #opendev06:01
*** d34dh0r53 has quit IRC06:26
*** d34dh0r53 has joined #opendev06:28
*** ykarel|away is now known as ykarel06:43
*** tosky has joined #opendev06:45
*** tkajinam|lunch is now known as tkajinam06:46
*** jpena|off is now known as jpena06:50
*** avass has joined #opendev06:57
*** ysandeep|away is now known as ysandeep06:57
*** fressi has joined #opendev06:58
*** rpittau|afk is now known as rpittau07:01
*** sboyron has joined #opendev07:03
openstackgerritMerged openstack/diskimage-builder master: Properly set grub2 root device when using efi  https://review.opendev.org/c/openstack/diskimage-builder/+/78524707:05
openstackgerritMerged openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse.  https://review.opendev.org/c/openstack/diskimage-builder/+/77872307:05
*** andrewbonney has joined #opendev07:06
ianwclarkb: ^ i just pushed 3.9.0 with that.  i'll give it a chance then update the nodepool images07:14
*** amoralej|off is now known as amoralej07:16
*** brinzhang has quit IRC07:35
*** brinzhang has joined #opendev07:35
*** amorin_ has quit IRC07:43
*** amorin has joined #opendev07:43
hrwmorning07:47
*** lpetrut has joined #opendev07:47
hrwgreat job on dib!07:47
*** ykarel_ has joined #opendev08:00
*** ykarel has quit IRC08:02
*** ykarel_ is now known as ykarel08:02
*** ykarel is now known as ykarel|lunch08:23
*** ykarel|lunch is now known as ykarel09:22
*** sshnaidm|afk is now known as sshnaidm09:31
openstackgerritHervĂ© Beraud proposed openstack/project-config master: Move away from nodejs4 in an Openstack context  https://review.opendev.org/c/openstack/project-config/+/78535309:34
*** dtantsur|afk is now known as dtantsur09:41
openstackgerritMerged openstack/diskimage-builder master: Improved the documentation for DIB_DNF_MODULE_STREAMS  https://review.opendev.org/c/openstack/diskimage-builder/+/78530410:34
*** sshnaidm has quit IRC11:01
*** sshnaidm has joined #opendev11:04
*** kopecmartin has quit IRC11:26
*** jpena is now known as jpena|lunch11:30
*** artom has quit IRC11:36
*** sshnaidm has quit IRC11:37
*** artom has joined #opendev11:37
*** elod is now known as elod_afk11:38
*** kopecmartin has joined #opendev11:48
*** kopecmartin has quit IRC11:49
*** sshnaidm has joined #opendev11:50
*** ysandeep is now known as ysandeep|afk12:01
*** amoralej is now known as amoralej|lunch12:13
*** artom has quit IRC12:26
*** fressi has left #opendev12:29
*** jpena|lunch is now known as jpena12:31
*** ysandeep|afk is now known as ysandeep12:58
*** amoralej|lunch is now known as amoralej12:59
*** elod_afk is now known as elod13:11
*** marios is now known as marios|call13:12
*** artom has joined #opendev13:32
*** lpetrut has quit IRC14:14
openstackgerritMerged openstack/project-config master: Move away from nodejs4 in an Openstack context  https://review.opendev.org/c/openstack/project-config/+/78535314:29
*** ykarel has quit IRC14:32
*** ykarel has joined #opendev14:32
fungiianw: hrw: i'm still catching up, not clear if nb03 is updated for dib 3.9.0 but looking now to figure out how far along we are/what's left to do there14:46
fungilooks like nodepool-builder was restarted on it at 08:53 utc today14:47
fungii don't see any centos arm images uploaded to linaro though14:55
fungijust ubuntu and debian14:55
fungier, nevermind, i was looking at nodes not images14:56
fungimost recent centos-8-arm64 image was built 14.5 hours ago14:56
fungiwhich was before the builder got restarted14:57
*** ykarel has quit IRC14:57
fungialso i confirmed pbr freeze in the container reports diskimage-builder==3.9.014:57
*** spotz has joined #opendev14:58
fungiokay, i deleted centos-8-arm64-0000036826 from over 1.5 days ago14:58
*** ricolin has joined #opendev14:59
fungialso centos-8-stream-arm64-0000001564 from the same timeframe15:00
clarkbfungi: I think ianw built a separate arm64 builder test node15:01
clarkbfungi: but then after 3.9.0 release was made any update to nodepool would have updated the nodepool-builder container image and we would auto deploy that15:02
fungigot it15:02
fungiwell, anyway, the new images can't really be substantially more broken than the old ones15:02
clarkbfungi: https://review.opendev.org/c/zuul/nodepool/+/785347 landing that should get us nodepool-builder images if we don't already have them15:04
fungiyep, i missed that one thanks, approved just now15:05
fungiunfortunately nb03 started building debian-stretch-arm64 just before i deleted the oldest centos images, so we have to wait for that to finish before these can start15:09
openstackgerritJames E. Blair proposed zuul/zuul-jobs master: Remove arm64 jobs (temporarily)  https://review.opendev.org/c/zuul/zuul-jobs/+/78543215:09
ricolinfungi, where I can find the progress for build debian-stretch-arm64?15:12
clarkbricolin: https://nb03.opendev.org then find the current log file (they are ordred and also you can sort by timestamp)15:13
clarkbricolin: it won't auto refresh but you can reload periodically (like once a minute or so should give you good updates)15:13
ricolinclarkb, thx15:15
*** marios|call is now known as marios15:17
clarkbfungi: when you get a chance can you respond to ianw's message about the review replacement? I think you probably have the best overview of the issue from a security standpoint?15:21
fungiahh, sure, sorry juggling too many things at once and hadn't gotten to that15:21
clarkbno worries, I don't think its urgent to get done before ianw's day starts :)15:22
clarkbit just occurred to me that you can probably speak to that best15:22
clarkb(and I'm in my start the day get through messages push)15:22
fungimy get through messages push hasn't completed yet either, though i did sleep in a bit15:34
*** mlavalle has joined #opendev15:35
ricolinjust out of curious is there way to stop queued jobs from a pipeline (like check-arm64)?15:35
clarkbricolin: an admin can dequeue the buildsets, but there isn't anything for specific jobs15:36
fungibecause jobs can have interdependencies, so there's no sane way to remove just one from a buildset15:38
fungibut also it's unclear what should be reported in such cases if we could15:39
fungiricolin: also if the change gets abandoned or a new patchset is pushed, the old queue item will be removed automatically15:39
hrwmorning15:42
hrwdo we still have jobs on debian-stretch?15:42
ricolinfungi, clarkb got it15:48
fungihrw: i believe the openstack-ansible folks are using it for older branches15:49
fungiaccording to what they said a few weeks ago when i brought up possibly removing our stretch images15:50
hrwfungi: thanks15:50
ricolinhrw, the devstack job to run full tempest tests on arm64 is all green now, but the run time is really slow. https://review.opendev.org/c/openstack/devstack/+/70831715:51
hrwricolin: good to hear15:51
ricolinif you can provide ideas on improving the performance will be great:)15:52
ricolincurrently devstack is plan to move to parallel execution which might be helpful for reduce the exec time (https://review.opendev.org/c/openstack/devstack/+/783311)15:53
hrwricolin: my brain shutdowns when I look at devstack output15:53
dtantsurwe have a person hitting what seems to be https://bugs.launchpad.net/git-review/+bug/133254915:55
openstackLaunchpad bug 1332549 in git-review "git review sometimes causes unpacker error for Missing tree" [Undecided,New]15:55
hrwricolin: for sure it is worth using PIP_EXTRA_INDEX_URL=local/infra/wheel/cache15:55
hrwricolin: as you are installing several python packages so that way some of them will be fetched already built to wheels15:56
clarkbdtantsur: do you know what version of git is being used? we suspect that using git protocol v2 may also mitigate it15:56
clarkbdtantsur: evidence to the contrary would be great if we can have it15:56
clarkbs/can//15:57
ricolinhrw, good point15:57
*** rpittau is now known as rpittau|bbl15:57
dtantsurclarkb: 2.25.115:57
clarkbdtantsur: git 2.18.0 or newer supports protocol v2 and 2.26.0 enables it by default15:58
*** cenne has joined #opendev15:58
dtantsurhi cenne, we're discussing your issue15:58
hrwricolin: you install 30 Python tarballs. including grcpio, pynacl which take some time to build15:58
clarkbdtantsur: you might try having them enable protocol v2 then and see if it helps. `git config protocol.version 2` in the repo15:58
cenneyes ^^' thank you. and sorry15:58
clarkb(I think that is how you do it anyway)15:59
dtantsurno worries, it's just bad luck of newcomers :)15:59
hrwricolin: you would cut several minutes of time15:59
fungiand if the git version is in the >=2.18.0 <2.26.0 range then yeah finding out if switching protocol version solves the problem would also be an excellent data point15:59
cennegit version 2.25.115:59
dtantsurcenne: could you try the command that clarkb suggested?15:59
ricolinhrw, yeah around 5-6mins I think:)15:59
*** jpena is now known as jpena|off15:59
clarkbcenne: yup so not enabled by default but you can enable it explicitly and try again15:59
cenneon it16:00
fungiricolin: hrw: we should also double-check that our wheel builders are creating all the arm64 wheels that job needs, since compiling larger packages from sdist could be very slow16:00
hrwfungi: wheel build jobs go through whtever is in requirements project16:00
fungihrw: yes, but it doesn't mean they're working16:01
hrwfungi: wheels or jobs?16:01
cennebtw, ftr, i did get git review go through once. It's after ammending and re-reviewing that it borked16:02
fungithe arm64 wheel builder jobs. we should make sure they're successfully generating wheels for the things the devstack job is trying to use (look for evidence in the devstack job log that it built things from sdist instead)16:02
clarkbcenne: yes, we think it has to do with the structure of your local pack files and how the remote server and your local client negotiate with objects to transfer16:02
clarkbcenne: this is why we suspect that v2 may help address the problem because that negotiation is handled in a different manner (and so far no one reporting this problem has reported a v2 by default client version)16:02
hrwfungi: imho if devstack job is using something which is not in openstack/requirements then it is a bug in devstack16:03
cenneokay, so i ran the command earlier and redid git review -t topicname16:03
cennesame error16:03
JayFcenne: just to confirm, you `cat .git/config | grep protocol` and see the config you added earlier taking effect?16:04
cennei can see it in git config --list, bt i'll d that too16:04
clarkbI've found one doc that says you need to set it with --global but that doesn't seem to be accurate from what I know about git configs16:05
JayFcenne++ you did better than me, I didn't even know about `git config --list`16:05
fungihrw: what i'm saying is that if we see evidence devstack is building some things from sdist, that likely implies that our wheel builder jobs aren't successfully building at least some of the arm64 wheels it needs, the job makes a best effort to build things in the upper-constraints.txt list but that doesn't mean it manages to build them all16:06
cenneit's there. I had to do grep -C 2 protocol though16:06
cenneit's there basically16:06
dtantsurcenne: a silly idea, what if you amend the commit again? e.g. change the message slightly?16:06
clarkbdtantsur: cenne  ^ that suggestion has been shown to help other people or even just rebasing again16:06
hrwfungi: yep16:07
cennei'll change it slightly and try.16:07
fungidtantsur: that would likely make the error "go away" but doesn't get us much better idea of the underlying problem unfortunately16:07
hrwjust started local test of installing all packages devstack builds to check are they in cache16:07
clarkbya other examples do `git push -c protocol.version=2 dest ref` which is even less global than setting it in the global config so I suspect doing it in the repo config is fine16:07
clarkbgood to know that this doesn't seem to have helped.16:08
dtantsurfungi: it will rule out a damage to the repo itself16:08
dtantsurI mean, the base repo16:08
fungidtantsur: oh, sure it would do that16:08
cenneadded a dot (.) in the commit message. that change good enough?16:09
cenneit didnt work btw16:09
clarkbin that case pushing with no thin is probably your best bet. Someone else went with that recently. I think there may be a config option you can set for that too /me looks16:10
hrwfungi, ricolin: all packages which devstack built are in cache16:10
cenneif this turns out to be some really silly mistake on my side, im sorry in advance16:10
fungihrw: cool, so the devstack setup log doesn't indicate it used sdists for anything?16:10
fungithat would have been an easy explanation for the slower performance of the job, oh well16:11
hrwfungi: no sdist in log16:11
fungigood to know that much is working at least16:11
hrwricolin: nodepool_wheel_mirror: "https://{{ zuul_site_mirror_fqdn }}/wheel/{{ ansible_distribution | lower }}-{{ (ansible_os_family == 'Debian') | ternary(ansible_distribution_version, ansible_distribution_major_version) }}-{{ ansible_architecture | lower }}"16:12
dtantsurcenne: what does `git show 2ad9fbc7797aeaebfe2dcc613d928afc97e934e6` outputs?16:12
dtantsur(or whatever missing tree ID you have)16:13
clarkbcenne: I don't think its a silly mistake, though I do suspect that some sort of local operations cause the repos to get confused when snycing with jgit.16:13
clarkbunfortunately we haven't been able to narrow that down16:14
hrwricolin: https://mirror.mtl01.inap.opendev.org/wheel/ shows you which distros have wheel cache already16:14
ricolinthat's helpful16:15
clarkbbasically when you do a push each side figure out what they have and the path from one to the other, from that a pack is made16:15
clarkbI think what is happening is that jgit and cgit are not agreeing on what the contents of that pack need to be to have a complete path16:15
hrwricolin: I wonder how much you will cut16:16
clarkbthis means when jgit receives the pack from cgit it finds that somethign is missing that it expects to see and it rejects it16:16
clarkbwhen you do the --no-thin flag to push the issue is avoided as both sides are far less discerning and send much more data16:16
cennedtantsur: I think it's showing the base directory16:16
hrwricolin: remember to add cache for both aarch64 and x86-64 archs - both can use cache16:16
cennei can send a paste if you needed16:16
cenne*directory files16:17
clarkbcenne: I don't know that that will help. I've tried making modifications to the same files that people are modifying and having trouble with thinking that maybe it is related to that somehow16:17
clarkbtrying a different version of git may also be interesting to see if the behavior is consistent across C git verions. But also a lot more work16:18
clarkbprobably easiest to just do a no-thin push and move on (I am not finding a config option for that so may need to push directly?)16:18
dtantsuror clone the repository again, download your patch with `git review -d <number>` and try again16:18
dtantsurno-thin is probably more constructive, clarkb do you remember how to push by hand? I've never done it, I think :)16:19
clarkbcenne: git-review should print out its git push command something like `git push gerrit HEAD:refs/for/master:topic` you can modify that to `git push --no-thin gerrit HEAD:refs/for/master:topic`16:19
hrwricolin: I am curious how much it can cut16:19
clarkbcenne: you may need to run it with git review -v to get that though16:19
hrwricolin: some will be saved due to 'no need to build', some due to fetching from local cache16:19
ricolinhrw, you mean the execution time?16:20
clarkbricolin: hrw: we should be setting the wheel mirror for all devstack jobs already?16:20
clarkba very early job setup role should be doing that.16:20
hrwricolin: yes16:21
clarkbfwiw I'm not opposed to someone adding --no-thin to git-review as a flag. I didn't realize that had been suggested so long ago16:22
hrwclarkb: should. looks like it does not or it is ignored somewhere (sudo?)16:22
cenneok. i'll try i guess16:22
cenneif it helps add more info.16:22
ricolinhrw, I saw you set up PIP_EXTRA_INDEX_URL for kolla already16:23
clarkbricolin: do you have a link to a job run I can lookat?16:25
ricolinclarkb, https://zuul.openstack.org/builds?job_name=devstack-platform-arm6416:26
fungilooks like the debian-stretch-arm64 image update completed and uploaded, but nb03 still hasn't started building new centos images, i'll see if we paused them or something16:26
clarkbricolin: https://zuul.openstack.org/build/8382a4f0479b4e9c824e9ad2f32c0ea4/console#0/3/20/controller that is the task that should configure the global pip config to use the mirrors including the wheel mirror16:27
funginothing paused in its nodepool.conf, and dib-image-list doesn't report any pause via cli16:28
clarkbfungi: is it building some other image?16:29
funginot that i can see16:29
clarkbare the old images still present? I wonder if it doesnt' see them as needing to be rebuilt?16:30
fungiyeah, only nb01 and nb02 are building images at the moment16:30
clarkbyou can also try an explicit build request16:30
ricolinclarkb, that job fail because I set the wrong DEFAULT_IMAGE_FILE_NAME, which I fixed in newest patch set16:30
cenneIt worked!16:30
cenne--no-thin got it through.16:31
fungiclarkb: i deleted the oldest centos-8-arm64 and centos-8-stream-arm64 images with dib-image-delete so only the latest one for each remains16:31
clarkbcenne: great, and sorry for the trouble. Maybe we need to think about adding a flag to git-review for people to opt into that on the occasions it happens16:31
clarkbfungi: I think it may not rebuild in that case16:32
clarkbfungi: because it notices it has an up to date image16:32
fungiclarkb: oh, so delete the newer images too? will do16:32
clarkbfungi: or just do an explicit build request16:32
fungithey're equally broken anyway, no loss16:32
cenneclarkb: np. was fun troubleshooting (not that i understood most of it). Thanks. :)16:33
clarkbricolin: ya I'm just pointing out that the build is attempting to write a pip.conf, but we don't see its contents16:33
clarkbcenne: you're welcome16:33
*** hamalq has joined #opendev16:33
clarkbricolin: maybe a good next step would be to add /etc/pip.conf to the log files that are collected, than you can see why it may not be orking?16:33
fungiokay, deleted centos-8-stream-arm64-0000001565 and centos-8-stream-arm64-0000001565 from ~16-17 hours ago16:33
fungiclarkb: bingo! centos-8-arm64-0000036828 is building now16:34
clarkbgreat16:35
*** marios is now known as marios|out16:36
* cenne thanks also dtantsur: and JayF:16:38
dtantsursure thing, I don't think I helped a lot :)16:38
JayF\o/ cenne helpfully you have better luck with the rest of your contribution...16:38
JayFget all the bad luck outta the way now16:38
cenne:P hopefully.16:40
cenneso.. should i revert back the protocol.version 2 now? or leave it as is16:42
*** sshnaidm has quit IRC16:43
JayFIt should only do good things to have protovol v2 enabled for that repo16:43
*** sshnaidm has joined #opendev16:44
hrwricolin: kolla got cache once we finished getting cache jobs working again16:45
hrwfungi: btw - what 0000* numbers are?16:46
fungihrw: the image build increment, so we can identify individual image revisions16:47
fungihrw: nodepool-builder just increments that for each build16:47
hrwok. build numberish.16:47
fungiyup16:49
hrwso, once we get working centos-8-stream-arm64 images, will there be something blocking centos-8-stream-arm64 nodes creation?16:49
*** marios|out has quit IRC16:49
ricolinclarkb, okay, will add log to next update16:50
fungihrw: shouldn't be16:51
fungihrw: basically if the image builds succeed then the builder will upload them automatically to the linaro cloud through glance, after which any new server boot calls should use the new images16:52
fungiand (hopefully) boot to a reachable state again16:52
hrwfungi: great16:52
fungiand at that point we'll see all the queued builds in zuul begin to show moving progress bars16:53
fungithough it will probably take some time to work through the pending backlog16:53
* hrw off17:07
*** auristor has quit IRC17:08
*** sshnaidm is now known as sshnaidm|afk17:10
*** auristor has joined #opendev17:12
*** eolivare has quit IRC17:23
clarkbcenne: I would clear it out just to reduce potential vairables for the future, but I don't expect it will cause problemsi f you leave it either17:28
*** dtantsur is now known as dtantsur|afk17:29
*** amoralej is now known as amoralej|off17:31
fungicentos-8-arm64-0000036828 reached a ready state and is uploading now. centos-8-stream-arm64-0000001566 has started building17:32
fungiand centos-8-arm64-1617903089 is the ready image in linaro-us now, so we should hopefully start seeing nodes boot with that17:36
*** rpittau|bbl is now known as rpittau17:44
*** andrewbonney has quit IRC17:44
cenneclarkb: ok, i'll leave it on for now. it's just for one repo, and i'm changing os soon anyway17:44
cenneclarkb: will I have to use the --no-thin flag every time? I had to use it again when amending are resubmitting the commit17:45
fungii concerned that having a min-ready of 1 for ubuntu-focal-arm64-xxxlarge is consuming a significant percentage of our available arm node quota unnecessarily17:45
fungier, i'm concerned17:45
fungii'll push up a patch17:45
fungiwe also have a bunch of locked deleting nodes17:45
fungilooks like they're all debian-buster-arm6417:46
fungii wonder if those are also having trouble booting17:46
openstackgerritMerged zuul/zuul-jobs master: Remove arm64 jobs (temporarily)  https://review.opendev.org/c/zuul/zuul-jobs/+/78543217:46
fungiyeah, basically we have the following nodes in linaro-us: two debian-buster-arm64 nodes in-use for over an hour, two ubuntu-focal-arm64 nodes in use for half an hour, five debian-buster-arm64 nodes deleting for the past few minutes, and a ready ubuntu-focal-arm64-xxxlarge node nothing's used for 8.5 hours yet17:49
openstackgerritMerged zuul/zuul-jobs master: Add upload-logs-azure role  https://review.opendev.org/c/zuul/zuul-jobs/+/78200417:50
clarkbfungi: ya that cloud might be best served with min-ready: 0 on all labels17:51
clarkbthen let demand determine allocations otherwise the quota is too low17:51
fungirevising my theory on debian-buster-arm64, we do have a nodes in use for a while as well as deleting, i may have just caught it at a bad time17:52
fungii think it's more the fact that delete calls take upwards of 10 minutes to complete on that cloud which makes for a lot of deleting state nodes in the list17:53
fungiwe haven't even tried to launch a new centos-8 node yet, i think due to demand from other labels17:54
*** klonn has joined #opendev18:00
openstackgerritJeremy Stanley proposed openstack/project-config master: Zero the min-ready for arm64 node labels  https://review.opendev.org/c/openstack/project-config/+/78545318:01
*** lpetrut has joined #opendev18:11
*** lpetrut has quit IRC18:14
*** brinzhang_ has joined #opendev18:21
*** brinzhang has quit IRC18:24
clarkbfungi: if you need to confirm the new images are working you could boot one out of band18:29
fungimaybe later, i'm onto more urgent fires now18:30
*** ysandeep is now known as ysandeep|away18:40
openstackgerritMerged openstack/project-config master: Zero the min-ready for arm64 node labels  https://review.opendev.org/c/openstack/project-config/+/78545318:46
*** whoami-rajat has quit IRC19:02
fungiokay, this is odd... looking at one of the debian-buster-arm64 nodes in a deleting state (0023862017), `nodepool list` reports its time in that state as only a few minutes, but the debug log shows it repeatedly attempting to delete the same node for several hours19:02
fungiso i've suddenly lost faith in the state times nodepool is reporting19:02
clarkbfungi: I think the reason for that is nodepool may be creating dummy records for leaked instances19:03
clarkband the operations that do that end up resetting the timestamp19:03
fungiit looks like the age field is being reset on every delete pass19:03
clarkbya I seem to recall seeing that before19:03
clarkbI think when the cleanup passes grab the lock the first thing they do is set the age timer19:03
fungiin fact, the log i was looking at was only a few hours old. nodepool has been trying to delete that particular node for several days19:05
fungi2021-04-06 08:56:36,759 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0023862017 (state: deleting, allocated_to: None)19:06
clarkbchances are if you try to manually delete it it will fail too19:06
fungiyup, working my way up to that19:06
fungithe age being reset just makes it much harder to spot which ones are legitimately deleting and which ones have been there for a long time19:07
clarkbya19:07
fungiopenstack server show indicates it has a status of BUILD19:08
fungicreated 2021-04-06T08:43:05Z19:08
fungiprobably nova won't let us delete instances in state BUILD19:09
fungihowever openstack server delete did not report an error for that uuid19:09
fungistatus remains BUILD though19:09
fungirunning with -v says END return value: 019:10
fungiso i guess nova's all like "thanks i'll get right on that, it'll go into delete as soon as it's done building!"19:11
clarkbya the delete request is more of a "please do this if you can" and the response is "I heard you"19:11
fungikevinz: we have 5 instances in linaro-us which are stuck in either BUILD or ERROR state for the past 2.5 days and would appreciate if you could clean them up so we can use the rest of the quota: a25e6f3b-5695-4729-b9c7-bb6160d7dd55 3ec7f816-52e5-480d-828a-767ebae7cf84 6accab6d-e1bd-4998-85eb-d7764d1ee0f5 a82e3b41-b0f7-4310-a89f-2a84eef8f8b0 15c38603-41c2-484d-8637-62941b871cbe19:16
fungionce /etc/nodepool/nodepool.yaml updates on nl03 i'll clean up the lingering xxlarge ready node19:18
*** d34dh0r53 has quit IRC19:21
*** d34dh0r53 has joined #opendev19:22
fungibetween those 5 stuck 8gb nodes and the 16gb ready node, our effective quota is reduced to 10 nodes there19:25
fungithough there is a building centos-8-stream-arm64 node now!19:25
fungiwe also have a server instance named "test" booted in there since late october of last year19:27
fungiinfra-root: ^ any objections to me deleting the server instance named "test" from linaro-us?19:28
clarkbfungi: checking19:29
clarkbfungi: is that the server whose ipv6 addr ends in 7610?19:29
fungithe centos-8-stream-arm64 node went into in-use 5 minutes ago, so looks like it may actually be getting used for a job19:29
clarkbif so that was the one ianw was using to test the new dib patches aiui19:29
fungiclarkb: nope, 2604:1380:4111:3e56:f816:3eff:fe35:d8dc19:30
fungithis is in the node tenant, not the control plane tenant19:30
fungithe one you're talking about is test-builder and it's in the control plane tenant19:31
clarkbgotcha19:31
clarkbI don't recall adding a test node myself so I'm fine with it19:31
openstackgerritClark Boylan proposed openstack/diskimage-builder master: Support secure boot with Ubuntu if using efi  https://review.opendev.org/c/openstack/diskimage-builder/+/78550319:34
fungii need to take a break to start making dinner, and then the rest of my evening is mostly spoken for between paperwork and openstack election officiating duties19:36
clarkbfungi: thank you for getting the centos 8 images going again19:37
fungiwell, to be clear, you and stevebaker and ianw and hrw did most of the heavy lifting on that19:43
fungii just drove it the last few meters over the finish line19:44
*** d34dh0r53 has quit IRC19:51
*** d34dh0r53 has joined #opendev19:51
*** dmsimard6 has joined #opendev20:24
*** akahat has quit IRC20:24
*** tosky_ has joined #opendev20:25
*** janders has quit IRC20:26
*** janders8 has joined #opendev20:26
*** dmsimard has quit IRC20:26
*** dmsimard6 is now known as dmsimard20:26
*** tosky has quit IRC20:26
*** tosky_ is now known as tosky20:28
*** sboyron has quit IRC20:36
*** akahat has joined #opendev20:37
*** slaweq has quit IRC20:41
*** klonn has quit IRC20:42
hrwfungi: every time I look at dib it feels weird. so the moment when I say why I think it is wrong and someone gets 'HA! I know how to go now' is great20:42
clarkbhrw: the thing about dib is it is weird, but ultimately relatively simple. Its just a series of run-parts processes20:43
clarkbonce you know that it is relatively easy to inject "breakpoints" into the list of scripts and debug from there. The major issue this time around was figuring out how to boot arm for me :) but as I was figuring that out you pointed out the problem :)20:44
clarkb(looks like virt-install does have an arch flag which I'll likely try next time this comes up)20:44
hrwclarkb: the problem is that people are still not used to how efi systems work20:44
hrwclarkb: once you get used to how efi system work it does not matter is it aarch64 or x86-6420:45
clarkbhrw: ya I mean stevebaker pointed out that ubuntu's packaging doesn't actually install the shim where you need it if you install the package. So there is other magic going on20:45
clarkbhrw: thats true, though in this case we weren't sure it was even an efi problem (I was unaware of the change to dib)20:45
hrwwhat is ubuntu? ;D20:45
clarkbdebian does the same thing fwiw20:46
clarkbeverything goes in /usr and some other magic seems to be necessary to install that to /boot/efi/EFI/{ubuntu,debian/20:47
hrwyeah, distroisms20:47
stevebakerI'm writing an ubuntu usb installer now, lets see what it does20:47
clarkbone thing I would like to test is if we can switch all our test node images over to the efi build. In theory it will work since legacy bios is still bootable on that path but then we would get efi if clouds supported it?20:50
clarkbthough maybe you have to set an image metadata flag or nova flavor option ro something for that to happen20:51
hrwclarkb: q35 can boot uefi for sure. i440fx not sure20:51
hrwit is more a matter of having uefi binary available on nova-compute hosts20:52
hrwthen you set hw_firmware_type=uefi and should boot20:53
fungialso no idea if xen vs kvm complicates things20:53
hrwI lack xen experience20:53
funginearly half of the nodes we boot are on xen20:54
clarkbour images should do both though I think20:54
clarkbso in theory xen and old kvm woudl just keep doing bios?20:54
clarkb(well they do both if you set efi)20:54
ianwin theory :)  i'd be willing to be that somewhere between the crappy vhd conversion steps and booting that something wouldn't work :)20:55
hrwx86 images iirc have efi grub in ESP, then grub-pc in bios parition, then have /boot and / or sth like that20:55
ianwwe merged some changes that supposedly made dib do dual-boot images20:55
ianwhttps://review.opendev.org/c/openstack/diskimage-builder/+/74324320:56
hrwnice20:58
* hrw off21:00
fungiyeah, debian has multi-boot installer images with support for several architectures, and mixed efi/bios bootloader, so it's definitely doable21:00
fungithough the utility of "i have a single usb stick which can install debian onto x86 and aarch64 and powerpc" is questionable21:01
fungii still don't see 785453 applied after three hours... i wonder if something's wrong21:04
fungiinfra-prod-service-nodepool TIMED_OUT in 30m 32s21:05
fungii guess that's why21:05
ianwohh, i usually suspect nb03 first if there's something wrong there21:13
fungicould be, i haven't pulled up logs yet21:16
*** rpittau is now known as rpittau|afk21:17
*** sshnaidm|afk is now known as sshnaidm|off21:24
fungi/var/log/ansible/service-nodepool.yaml.log on bridge doesn't mention the launchers at all21:34
fungii'm not 100% positive that's what https://zuul.opendev.org/t/openstack/build/695b972e640a4b188175a5ff8f790fbe/ was writing to though, we don't have the task details since the run playbook timed out21:36
fungihuh, the /etc/nodepool/nodepool.yaml on nl03 looks nothing like https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml21:46
fungialmost as if it's generated yaml instead21:46
corvusfungi: it is21:46
fungianybody know why that would be?21:46
corvusbut it should be generated from that file21:46
fungioh, okay21:46
corvusfungi: https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml#L4-L5 is reason21:46
fungioh, yep thanks21:46
fungiso still no clue why /var/log/ansible/service-nodepool.yaml.log doesn't mention nl* servers21:48
fungii guess that's an artifact of the timeout not represented in that logfile (yet?)21:50
fungithe previous log from 02:21:47 includes them21:51
fungiroot      1963  0.0  0.0   4636   788 ?        S    18:54   0:00 /bin/sh -c ansible-playbook -v -f 5 /home/zuul/src/opendev.org/opendev/system-config/playbooks/service-nodepool.yaml >> /var/log/ansible/service-nodepool.yaml.log21:52
fungiaha, so it's still actually running, that's why21:52
fungichild process claims to be connecting to 104.130.124.242 which is nb0221:53
fungii'm able to log into nb02, even over ipv4, and it's been up for weeks...21:54
fungii don't see any ansible processes on nb02 either21:54
fungioh, in fact it looks like the ansible process connecting to nb02 is timestamped 13:3521:56
fungiso something hung earlier today i guess?21:56
fungiseveral distinct runs are all in progress trying to connect to nb0221:57
fungii can ssh to that address from bridge.o.o as root too21:57
fungiinfra-root: ^ suggestions?21:58
ianwfull disk maybe causing python/ansible to hang?21:58
fungidoesn't look like it21:59
* ianw has sound of straws clutching21:59
* fungi grasps some more21:59
fungimemory, disk, cpu all look reasonable on nb02 given that it's building images22:00
fungiroot     29407  0.2  0.4 184100 38372 ?        Sl   13:35   1:22 /usr/bin/python3 /usr/local/bin/docker-compose pull22:01
fungithere's some later ones in the process table too22:01
fungii bet that's it, ansible is hung because docker-compose pull isn't returning?22:02
fungianyway i've run out of time, i need to switch to other tasks22:03
fungihopefully that helps whoever gets a chance to look at it next22:04
ianwi can have a look in a bit22:05
*** hamalq has quit IRC22:07
*** hamalq has joined #opendev22:07
fungithanks!22:08
fungiand sorry, i've promised too much on too many things and had to parcel out the remainder of my day to make sure i don't slip on deadlines22:09
ianwThu Apr  1 13:15:11 UTC 2021 found extra address!22:20
ianwso we caught the review server getting an extra address.  i should have a dump22:20
fungiooh, as in tcpdump of (possible) route announcements?22:21
fungithat's exciting indeed22:22
ianwhrm, so "_count=$(ip addr | grep 'inet6 2' | wc -l)" returned -gt 122:22
ianwbut currently it only has one address22:22
fungidiagnosing network issues in other people's systems is always a bit like being a spirit medium22:23
ianwit is a week on22:23
ianwfrom when that triggered though22:23
fungiyeah, well past any likely expiration22:23
ianw2021-04-01 13:15:10.262527 IP6 (class 0xc0, hlim 255, next-header ICMPv6 (58) payload length: 56) fe80::ce2d:e0ff:fe5a:d84e > ff02::1: [icmp6 sum ok] ICMP6, router advertisement, length 5622:27
ianw2021-04-01 13:15:10.262672 IP6 (class 0xc0, hlim 255, next-header ICMPv6 (58) payload length: 56) fe80::ce2d:e0ff:fe0f:74af > ff02::1: [icmp6 sum ok] ICMP6, router advertisement, length 5622:27
ianw2021-04-01 13:15:10.313178 IP6 (class 0xe0, hlim 255, next-header ICMPv6 (58) payload length: 64) fe80::f816:3eff:fec3:34bc > ff02::1: [icmp6 sum ok] ICMP6, router advertisement, length 6422:27
ianwthat is 3 RA's received 1 second before we saw two addresses22:27
ianwhere is a full packet dump -> http://paste.openstack.org/show/804320/22:30
fungiour current default there is through fe80::ce2d:e0ff:fe5a:d84e and fe80::ce2d:e0ff:fe0f:74af22:34
fungiso fe80::f816:3eff:fec3:34bc is the odd duck22:34
fungiieee has no oui match on f8:16:3e22:36
fungias far as i can tell22:36
fungiso guessing that's made up and belongs to a vm22:36
ianwit appeared between 2021-04-01 13:15:10.313178 and 2021-04-01 13:16:57.742043 and then was never heard from again22:38
ianw(pasting some details in https://etherpad.opendev.org/p/gerrit-upgrade-2021)22:38
fungisounds about right if it was some zuul job22:38
ianwit says "valid time 2592000s" which makes 30 days by my calculation22:43
fungineat, i wonder if the kernel reaps those sooner if their gateways fall out of the neighbor table or something22:43
ianwi should have dumped the ip addr output too, that was an oversight22:44
ianwi'm not sure how much this debugging is going to help.  i'm happy to continue it, if we think there is anything to glean22:45
fungiprobably we should be getting input from operators at vexxhost as to what it would help them for us to collect22:49
fungithough if we have a theory that the announcements are originating from our job nodes, maybe collecting enough information to tie a stray announcement to a particular job would enable us to either confirm that and gain some insight into the nature and origin of those announcements, or refute it22:50
clarkbinfra-root tomorrow morning I intend to run the external id cleanup script against the 224 accounts I retired last week. I haven't seen anyone show up and complain about their account being off and this last batch was the set that showed up due to lack of usernames, ssh keys, and/or valid openids23:18
clarkbI'm fairly confident in this group. I suspect the next one will be much tougher :(23:18
fungisounds great, thanks!23:18
clarkbif you have seen anyone complain let me know :)23:18
fungii have not, ftr23:19
clarkbfungi: ianw maybe stick the details in logsatsh?23:21
clarkband see if any log lines show that was one of ours?23:21
fungiyeah, though in this case the evidence might be too old already23:21
fungiworth a shot though23:21
clarkbgood point and those ipv6 addres are unlikely to be stable like the ipv4 ones we use in testing23:22
clarkbsince the macs are probably random?23:22
fungithat's what i'm guessing, but really not certain23:23
clarkbI wonder if the timestamps might be the important piece of info, like if vexxhost can cross check against maintenance at that itme or other activities along those lines23:29
*** tosky has quit IRC23:38
fungioh, yeah, server restarts, et cetera23:40
fungior switchport flaps23:40
ianwi'll leave the dump file on disk, https://etherpad.opendev.org/p/gerrit-upgrade-2021 has extracted details from the 1st23:42

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!