opendevreview | Shnaidman Sagi (Sergey) proposed openstack/diskimage-builder master: Improve DIB for building CentOS 9 stream https://review.opendev.org/c/openstack/diskimage-builder/+/806819 | 00:22 |
---|---|---|
ianw | clarkb / fungi : I have put in a draft note at the end of https://etherpad.opendev.org/p/gerrit-upgrade-3.3 about dashboards and attention sets as discussed in the meeting. please feel free to edit and send as you see fit | 02:53 |
Clark[m] | ianw: I made two small edits but lgtm if you want to send it | 02:58 |
ianw | Clark[m]: were you thinking openstack-discuss or just service-discuss? | 03:03 |
Clark[m] | service-discuss. More people seem to be watching our lists and getting it out there will hopefully percolate through the places | 03:05 |
ianw | will do. going to take a quick walk before rain and will come back to it :) | 03:06 |
*** ysandeep|out is now known as ysandeep | 04:04 | |
*** ykarel|away is now known as ykarel | 04:32 | |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: Update centos element for 9-stream https://review.opendev.org/c/openstack/diskimage-builder/+/806819 | 04:53 |
ianw | sshnaidm: i have updated the change with the testing we should be doing, and tried to explain more clearly what's going on in https://review.opendev.org/c/openstack/diskimage-builder/+/806819/comment/87f51fd0_0ee1505c/ | 05:04 |
ianw | hopefully that can get us on the same page | 05:05 |
*** bhagyashris is now known as bhagyashris|rover | 05:21 | |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: Update centos element for 9-stream https://review.opendev.org/c/openstack/diskimage-builder/+/806819 | 06:49 |
opendevreview | Bhagyashri Shewale proposed zuul/zuul-jobs master: [DNM] Handle TypeError while installing the any sibling python packages https://review.opendev.org/c/zuul/zuul-jobs/+/813749 | 06:50 |
opendevreview | Bhagyashri Shewale proposed zuul/zuul-jobs master: [DNM] Handle TypeError while installing the any sibling python packages https://review.opendev.org/c/zuul/zuul-jobs/+/813749 | 06:59 |
opendevreview | Bhagyashri Shewale proposed zuul/zuul-jobs master: [DNM] Handle TypeError while installing the any sibling python packages https://review.opendev.org/c/zuul/zuul-jobs/+/813749 | 07:09 |
opendevreview | Dong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size https://review.opendev.org/c/zuul/zuul-jobs/+/813034 | 07:18 |
opendevreview | Bhagyashri Shewale proposed zuul/zuul-jobs master: Handled TypeError while installing any sibling python packages https://review.opendev.org/c/zuul/zuul-jobs/+/813749 | 07:25 |
*** jpena|off is now known as jpena | 07:29 | |
lourot | o/ has anyone got a moment for https://github.com/openstack-charmers/test-share/pull/21 ? thanks! | 07:45 |
lourot | wrong channel, sorry | 07:46 |
frickler | infra-root: lots of post_failures. I've heard rumors of OVH having issues, but can't dig right now. might be log uploads failing | 08:05 |
ianw | yep, OVH | 08:20 |
ianw | WARNING:keystoneauth.identity.generic.base:Failed to discover | 08:20 |
ianw | available identity versions when contacting https://auth.cloud.ovh.net/. | 08:20 |
opendevreview | Ian Wienand proposed opendev/base-jobs master: Disable log upload to OVH https://review.opendev.org/c/opendev/base-jobs/+/813780 | 08:24 |
ianw | status page gives "Active issue" | 08:24 |
ianw | https://status.us.ovhcloud.com/ | 08:25 |
frickler | didn't we use to have a third log provider? maybe we should actively try to get some more redundancy again | 08:28 |
ttx | Looks like OVH is having network issues right now | 08:29 |
ianw | frickler: think we should fast merge it? | 08:29 |
ttx | I was disconnected for an hour, just came back | 08:29 |
frickler | ianw: actually I think the indentation might be broken with your patch? | 08:29 |
ttx | (my bouncer is on a OVH node) | 08:29 |
ianw | frickler: the lines are commented, i think the other lines remain the same? | 08:30 |
frickler | ianw: but don't comments need to match the indentation of their surroundings? | 08:31 |
ianw | i don't believe so, but if you'd prefer i can delete the lines and we can just put a revert in | 08:32 |
frickler | anyway, it seems to be working again just now, at least accessing the cloud from bridge is now returning things again | 08:32 |
ianw | yeah, i can get to auth.cloud.ovh.net:5000 too | 08:33 |
opendevreview | Ian Wienand proposed opendev/base-jobs master: Disable log upload to OVH https://review.opendev.org/c/opendev/base-jobs/+/813780 | 08:33 |
opendevreview | Ian Wienand proposed opendev/base-jobs master: Revert "Disable log upload to OVH" https://review.opendev.org/c/opendev/base-jobs/+/813783 | 08:33 |
ianw | there's the stack if we have issues | 08:34 |
ianw | i'm afraid i'm rapidly reaching burnout point here | 08:34 |
frickler | ianw: np, thanks for your help, I can check from time to time and see if it stays stable | 08:35 |
*** arxcruz is now known as arxcruz|rover | 08:48 | |
opendevreview | yatin proposed openstack/diskimage-builder master: [WIP] Add support for CentOS Stream 9 in DIB https://review.opendev.org/c/openstack/diskimage-builder/+/811392 | 09:22 |
*** ykarel is now known as ykarel|lunch | 09:23 | |
*** ysandeep is now known as ysandeep|mtg | 09:25 | |
frickler | hmm, still seeing failures, checking logs | 09:48 |
frickler | I expected to see errors in the executor logs, but can't find anything there. also zuul didn't vote on 813780 but I'm also not seeing any job for that | 09:57 |
frickler | and while looking for strange things, https://review.opendev.org/800445 seems to be stuck in check for 44h | 09:58 |
frickler | also some tobiko periodic jobs for even longer | 10:10 |
frickler | I also don't see any current POST_FAILURES, so will leave the upload config as is for now | 10:15 |
opendevreview | Shnaidman Sagi (Sergey) proposed zuul/zuul-jobs master: Include podman installation with molecule https://review.opendev.org/c/zuul/zuul-jobs/+/803471 | 10:18 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Changing no of days for query from 14 to 7 https://review.opendev.org/c/opendev/elastic-recheck/+/813795 | 10:27 |
*** ykarel|lunch is now known as ykarel | 10:37 | |
*** ysandeep|mtg is now known as ysandeep | 10:57 | |
*** dviroel|out is now known as dviroel | 11:17 | |
*** jpena is now known as jpena|lunch | 11:24 | |
*** ysandeep is now known as ysandeep|afk | 11:31 | |
*** ysandeep|afk is now known as ysandeep | 12:01 | |
*** jpena|lunch is now known as jpena | 12:24 | |
*** ykarel__ is now known as ykarel | 12:37 | |
ysandeep | Folks o/ Is there a way to set the hashtag(s) via the CLI, for example we can set topic: -t <topic> to git-review | 13:06 |
*** dviroel is now known as dviroel|rover | 13:33 | |
*** arxcruz|rover is now known as arxcruz | 13:37 | |
*** bhagyashris|rover is now known as bhagyashris | 13:39 | |
fungi | ysandeep: feel free to push up a change implementing that in git-review, though you can probably also do it as a second command straight to gerrit's ssh api... checking the documentation | 13:40 |
fungi | ysandeep: i'm not finding it, looks like they didn't implement any controls for hashtags in the ssh cli, at least not yet | 13:47 |
ysandeep | fungi: ack, no worries, thank your for checking | 13:48 |
ysandeep | you* | 13:48 |
fungi | ysandeep: looks like it could be set at push, similar to how git-review does topics at push: https://review.opendev.org/Documentation/user-upload.html#hashtag | 13:51 |
Clark[m] | They are part of the push ref options | 13:51 |
fungi | yeah, it's just that the ssh cli also has a set-topic command, so i was hoping there might be a similar set-hashtag | 13:52 |
* ysandeep checking documentation | 13:56 | |
fungi | yeah, in git_review/cmd.py you could probably just extend the command line options with one for hashtags and then append to push_options like happens with --topic | 13:58 |
fungi | oh, interesting, it looks like you can only set one hashtag at push time, not a list of them | 13:58 |
ysandeep | fungi, thanks i was able to set hashtag with push ref options | 14:09 |
ysandeep | fungi, I will give a shot implementing that in git-review | 14:12 |
Tengu | ysandeep: cool! thanks :) | 14:13 |
Tengu | fungi: thanks as well :) | 14:13 |
fungi | ysandeep: feel free to ping me here when you push up the git-review feature and i'll be happy to review it | 14:21 |
ysandeep | fungi++ thanks! I will try to implement that as my weekend python project.. So will probably ping you in week after PTG | 14:23 |
fungi | whenever, have fun! | 14:23 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: WIP: ER bot with opensearch for upstream https://review.opendev.org/c/opendev/elastic-recheck/+/813250 | 14:23 |
johnsom | FYI, the ptg page is down. I'm getting a cloudflare error 523 "origin is unreachable" page going to openstack.org/ptg page. | 14:36 |
fungi | johnsom: yes, there's some network incident in vexxhost impacting some systems there, but ptg.opendev.org is still up | 14:39 |
johnsom | Ah, bummer, I wish them luck! | 14:40 |
fungi | i'm sure they'll have it cracked shortly | 14:40 |
*** ysandeep is now known as ysandeep|dinner | 14:55 | |
*** ykarel is now known as ykarel|away | 15:12 | |
clarkb | I guess the OVH stuff corrected itself before we had to worry about landing and then reverting any changes | 15:26 |
fungi | seems so | 15:27 |
clarkb | apparently I'm somehow still identified with oftc too. Neat | 15:29 |
fungi | did you set up cert auth? | 15:29 |
fungi | i never have to identify on reconnect | 15:29 |
fungi | it's just done as part of the tls setup with the client key | 15:30 |
clarkb | I don't think I did based on my nickserv status | 15:30 |
clarkb | but also it seems to have been magically handled by weechat? so meh? | 15:31 |
*** marios is now known as marios|out | 15:33 | |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: WIP: ER bot with opensearch for upstream https://review.opendev.org/c/opendev/elastic-recheck/+/813250 | 15:48 |
clarkb | fungi: thinking out loud here should we hold off on https://review.opendev.org/c/opendev/system-config/+/813534/ and children until after renaming is done just to avoid any issues withconfig when gerrit starts up again in that process? Or go for it and maybe restart gerrit today/tomorrow? | 15:49 |
clarkb | similar question for https://review.opendev.org/c/opendev/system-config/+/813716 | 15:49 |
*** ysandeep|dinner is now known as ysandeep | 16:10 | |
clarkb | fungi: https://review.opendev.org/c/opendev/gerritlib/+/813710 is a super easy review too (adds python39 testing to gerritlib as we run jeepyb which uses gerritlib on python39 now) | 16:11 |
fungi | johnsom: problem seems to have been fixed, if you needed to get to the site for something | 16:13 |
johnsom | fungi Thank you! | 16:13 |
fungi | clarkb: i like the quick restart later today or tomorrow idea, just to make sure we're as prepped as we can be | 16:14 |
gthiemonge | Hey Folks, one of my patches is stuck in zuul https://zuul.openstack.org/status#698450 | 16:14 |
gthiemonge | is there any ways to kill it? | 16:14 |
fungi | also i just realized i booked an appointment for a vehicle inspection friday after the openstack release team meeting, but i should be back well before we're starting the rename maintenance | 16:15 |
clarkb | fungi: in that case feel free to carefully review and approve those chagnes I guess :) | 16:16 |
clarkb | gthiemonge: hrm we should probably inspect why that happened first | 16:16 |
clarkb | corvus: ^ fyi there are two changes in openstack's check pipeline that have gotten stuck. Likely due to lost node requests? I feel like that is what happened before. I'll start trying to find logs for them | 16:17 |
gthiemonge | octavia-v2-dsvm-scenario-centos-8-stream is still queued, and my patch updates this job | 16:17 |
fungi | clarkb: frickler noted some stuck changes earlier | 16:17 |
clarkb | fungi: I'm guessing it is these chagnes since they are old enough to have been seen back when frickler's work day was happening | 16:18 |
fungi | 800445,16 has an openstack-tox-py36 build queued for 50 hours and counting | 16:18 |
fungi | also there's a few periodic jobs waiting since days | 16:19 |
clarkb | 810f631ad5494c9ba7bc892d1c3f430f is the event associated with that enqueue I think | 16:21 |
clarkb | there are two other events for child changes | 16:21 |
clarkb | 2021-10-13 07:41:31,801 DEBUG zuul.Pipeline.openstack.check: [e: 810f631ad5494c9ba7bc892d1c3f430f] Adding node request <NodeRequest 300-0015752422 ['nested-virt-centos-8-stream']> for job octavia-v2-dsvm-scenario-centos-8-stream to item <QueueItem 7f41db5ea67847b887881460f0b7b2b5 for <Change 0x7f62645d9e80 openstack/octavia-tempest-plugin 698450,19> in check> <- is the last thing the | 16:22 |
clarkb | scheduler logs for that job on that event | 16:22 |
clarkb | now to hunt down that node requests | 16:22 |
clarkb | nested-virt-centos-8-stream <- the job uses a special node type... | 16:23 |
fungi | i notice all the long-queued builds in periodic are for fedora | 16:24 |
fungi | so we might be looking at multiple causes | 16:24 |
clarkb | ya gthiemonge is caused because only ovh can build nested-virt-centos-8-stream and that reuqest happened during ovh's outage. I think the raeson we haven't node failured is we must've leaked launcher registrations in zookeeper again so nodepool thinks there are other providers still to decline that request | 16:27 |
clarkb | let me check on the registrations really quickly, but in gthiemonge's case I think the easiest thing is to abandon/restore and push a new patchset | 16:27 |
clarkb | or wait, I think if I restart the node with the registrations it will notice and node failure it | 16:28 |
fungi | we seem to not be unable to launch fedora-34 nodes, but i have a feeling something similar befell the three periodic builds which wanted one | 16:28 |
clarkb | hrm I don't see any extra registrations | 16:29 |
fungi | similarly that openstack-tox-py36 job would have wanted an ubuntu-bionic node and we're probably booting fewer of them these days | 16:29 |
fungi | statistically speaking, as none of the 5 stuck examples use our most popular node label, it's possible they're all representatives of a similar problem | 16:30 |
*** jpena is now known as jpena|off | 16:30 | |
clarkb | the linaro provider has not declined that request yet | 16:32 |
clarkb | it cannot provide that node type so it should decline it. Now to look at why it hasn't yet | 16:33 |
clarkb | linaro reports not enough quota to satisfy request whcih will cause it to enter pause mode and not decline requests. | 16:36 |
clarkb | I think that may be starving its ability to get through ahd decline requests it cannot satisfy at all due to being the wrong label type | 16:36 |
clarkb | there are leaked isntances in that cloud which I am trying to clean up now. We'll see what happens | 16:37 |
clarkb | fungi: the tobiko changes have been pathological for a while. I suspect some weird configuration issue as they have a ton of errors iirc | 16:40 |
clarkb | fungi: neutron is probably related to whatever is causing fedora issues | 16:40 |
fungi | i looked at a few and it was suds-jurko failing to build | 16:40 |
clarkb | fungi: but that wouldn't cause them to be stuck in zuul? | 16:40 |
fungi | no, talking about your suggestion that the tobiko jobs have been pathologically going into retry_limit | 16:41 |
clarkb | gthiemonge: I think your best bet may be to push a new patchset or abandon and restore the current patchset. The issue is nodepool isn't declining the request because it can't get to those requests because a cloud is failing very early :/ | 16:41 |
clarkb | fungi: ah | 16:41 |
clarkb | I'm going to write an email to kevinz about cleaning up these leaked isntances in the linaro cloud now | 16:42 |
fungi | the fedora-34 situation is a little odd too. there's one in airship-kna1 which has been in a "ready" state for more than 3 days | 16:44 |
fungi | it should have gotten assigned to a build long before now | 16:44 |
clarkb | linaro email sent | 16:45 |
clarkb | I think that we can set the linaro quota to 0 if this persists and we notice more stuck changes due to it | 16:46 |
fungi | i wonder if there's a good way to determine why nl02 hasn't assigned 0026879768 to a node request yet | 16:46 |
clarkb | fungi: that cloud is also probably near or at its quota most of the time so it has a hard time filing through requests | 16:48 |
fungi | oh, maybe | 16:48 |
clarkb | nodepool.exceptions.ConnectionTimeoutException: Timeout waiting for connection to 149.202.161.123 on port 22 <- seems to be the general issue here | 16:48 |
clarkb | with fedora-34 launches I mean | 16:48 |
clarkb | we'll probably need to launch one out of band and inspect it. /me starts trying to do that | 16:48 |
fungi | yeah, but 0026879768 has been in a ready state for days there | 16:49 |
fungi | and grepping the debug log, the last mention was when it came ready and was unlocked (2021-10-10 08:52:00,927) | 16:49 |
clarkb | fungi: yes, but if the airship cloud is perpetually paused it won't ever get a chance to scan all the requests and find the few fedora requests to assign that node | 16:50 |
clarkb | the process here is provider is not doing an action proceed to grab next request, lock it, check quota, if at quota pause, when no longer at quota attempt to launch node. | 16:51 |
clarkb | it can shortcut that if it already has the node but that depends on it finding a random request for a fedora034 node | 16:51 |
gthiemonge | clarkb: ok thanks, i'll try | 16:51 |
fungi | got it, so the problem is that we want the pause to pause cloud provider api interactions, not just everything | 16:52 |
fungi | pausing declining nodes, or assigning nodes which are available and ready, results in a deadlock | 16:53 |
fungi | but i guess nodepool performs a server list, which is a cloud provider api call | 16:53 |
fungi | so is that where it's blocking? | 16:54 |
clarkb | it does an internal wait for a node to be deleted iirc as it knows there will be free quota after that | 16:54 |
opendevreview | Merged opendev/gerritlib master: Add python39 testing https://review.opendev.org/c/opendev/gerritlib/+/813710 | 16:54 |
clarkb | fungi: and ya it seems like we could have it continue to process and decline things it has no hope of ever fulfilling as well as fulfilling things it already has resource allocated to like the fedora-34 node | 16:55 |
clarkb | clarkb-test-fedora-34 is booting in ovh bhs1. that as a region I noticed fedora-34 boot problems in | 16:56 |
clarkb | fungi: I think it is failing in rax too, otherwise I would blame potentially bad uploads due to ovh's network problems | 16:57 |
clarkb | the byte count for this image seems to match up what we have on nb02 at least | 16:58 |
clarkb | this is spicy the console log is just one giant kernel panic | 16:59 |
fungi | i'm looking at a build which succeeded on a fedora-34 node in airship-citycloud yesterday, so apparently we booted one there for it even though there was one ready for a couple days | 17:00 |
fungi | in the same provider | 17:00 |
clarkb | fungi: were they different pools? | 17:00 |
clarkb | anyway I think fedora-34 is completely hosed based on the console log on the test instance i booted in ovh | 17:00 |
fungi | provider: airship-kna1 | 17:00 |
clarkb | I'm going to try and boot in other clouds and see if we get different results | 17:01 |
clarkb | fungi: ya but that provider has multiple pools. I don't think it will share acorss pools | 17:01 |
fungi | so same pool | 17:01 |
clarkb | fungi: we can do multiple pools per provider now | 17:03 |
fungi | nl02.opendev.org-PoolWorker.airship-kna1-main-43d321e2735c45e5a17fc8d90e8ac674 logged for both nodes 0026879768 (the one that's been ready for days) and 0026901606 (the one booted yesterday for node request 300-0015737905) | 17:05 |
fungi | why would the pool worker build a new node for an incoming node request when it already had one with the same label available? | 17:06 |
clarkb | I do not know. That seems like a bug | 17:06 |
clarkb | the iweb boot doesn't seem to panic. That makes me think potentially corrupt image in ovh. However the iweb test node doesn't seem to allow me in via ssh either | 17:07 |
fungi | could be cpu flags? | 17:07 |
fungi | or something hypervisor-related? | 17:07 |
clarkb | fungi: well the image upload happened around ovh's crisis. I really suspect it is as simple as a bad image upload there | 17:09 |
clarkb | on the iweb side of things it apepars that we only configure the lo interface with glean? | 17:09 |
clarkb | at least I don't see anything logged for other interfaces + glean in the console log | 17:09 |
clarkb | I think we should consider pausing fedora-34 image builds then delete today build | 17:10 |
clarkb | and then hopefully those that understand fedora can look into why glean + network manager seem to be unhappy with it | 17:10 |
opendevreview | Merged zuul/zuul-jobs master: Handled TypeError while installing any sibling python packages https://review.opendev.org/c/zuul/zuul-jobs/+/813749 | 17:11 |
opendevreview | Clark Boylan proposed openstack/project-config master: Pause fedora-34 to debug network problems https://review.opendev.org/c/openstack/project-config/+/813876 | 17:12 |
clarkb | I'm going to boot a test on yesterday's image to see if it acts different | 17:12 |
fungi | i guess it's just provider launches we can pause from the cli? i always forget | 17:13 |
clarkb | I didn't realize there is a command line option I'll check after I test yseterday's image | 17:14 |
fungi | i'm looking it up in the docs now | 17:14 |
clarkb | hrm yseterday's image may be not better | 17:15 |
clarkb | in which case we're in a more roll forward state | 17:15 |
fungi | https://zuul-ci.org/docs/nodepool/operation.html#command-line-tools "image-pause: pause an image" | 17:16 |
clarkb | ya I think rolling back to the previous image isn't going to help us | 17:16 |
opendevreview | Merged opendev/system-config master: Replace testing group vars with host vars for review02 https://review.opendev.org/c/opendev/system-config/+/813534 | 17:16 |
clarkb | fungi: https://review.opendev.org/c/zuul/zuul-jobs/+/813749 that merged which needed a fedora-34 node. So there must be working fedora-34 somewhere /em looks | 17:19 |
clarkb | ha that ran in airhsip. did it use your old node? | 17:20 |
clarkb | fedora-34-airship-kna1-0026921358 was the hostname | 17:20 |
clarkb | Failed to start [0;1;39mNetwork Manager Wait Online then See 'systemctl status NetworkManager-wait-online.service' for details. on the iweb images | 17:21 |
clarkb | ovh kernel panics but could just be a bad image | 17:21 |
fungi | airship-kna1 for the check pipeline build too | 17:22 |
fungi | so it's like we're only getting fedora-34 nodes in airship-kna1, which also has a fedora-34 ready node it's been ignoring for days | 17:23 |
clarkb | maybe its image is just old | 17:23 |
fungi | i wonder if image uploads have been failing there and it's got an old... | 17:23 |
fungi | yeah | 17:23 |
clarkb | doesn't appear to be old | 17:24 |
clarkb | it could be luck that whatever is causing NM + glean to fail in iweb isn't an issue in airship | 17:24 |
clarkb | I am going to try and delete the image in ovh that is panicing to force a reupload | 17:24 |
clarkb | then maybe we'll see ovh do what iweb is doing or function like airship | 17:24 |
fungi | yeah, airship is currently using image 7900 uploaded 13.5 hours ago | 17:25 |
fungi | er, airship-kna1 is | 17:25 |
fungi | maybe network setup in citycloud is different than everywhere else we try to boot fedora-34? | 17:26 |
fungi | different virtual interface type which the f34 image's kernel is actually recognizing? | 17:27 |
clarkb | I think it uses dhcp like many clouds. rax is static | 17:27 |
clarkb | ya that might explain it | 17:28 |
clarkb | https://bodhi.fedoraproject.org/updates/FEDORA-2021-ffda3d6fa1 is a recent update and they already have https://bodhi.fedoraproject.org/updates/FEDORA-2021-385f3aebfd proposed too | 17:29 |
fungi | ens3 is the detected interface on the successful builds there | 17:29 |
fungi | also citycloud is using rfc-1918 addressing with floating-ip for global access | 17:30 |
clarkb | this is curious booting the previous image in ovh produces no console log. But I also cannot ssh in | 17:31 |
fungi | and no global ipv6 | 17:31 |
clarkb | at least it isn't kernel pacnicing? | 17:31 |
clarkb | I'm going to clean up my test instances in ovh and iweb and boot some on rax and vexxhost and see if they are any different | 17:34 |
clarkb | vexxhost can boot the fedora-34 image successfully too ,but we don't launch the fedora image tehre because we only do the special larger instances in vexxhost now | 17:45 |
opendevreview | Merged opendev/system-config master: Switch test gerrit hostname to review99.opendev.org https://review.opendev.org/c/opendev/system-config/+/813671 | 17:47 |
fungi | so something about the image works in vexxhost and citycloud, is unreachable in iweb, crashes during boot in ovh... | 17:49 |
clarkb | fungi: yes though the crashes during boot in ovh may be unrelated and a result of us trying to upload images tehre during their network crisis | 17:50 |
clarkb | rax test node also appears to be sad. This is noteworthy because rax uses static configuration and not dhcp. Implies the issue is independent of dhcp | 17:51 |
fungi | "sad" as in boots but is unreachable over the network, or crashing like in ovh? | 17:51 |
clarkb | unreachable over network. They don't support cli console logs so I didn't bother to check that | 17:52 |
clarkb | it is possible that it crashes but that requires me to dig out credentials and do more work, but I need to context switch to other stuff | 17:52 |
fungi | yeah, you can console url show and then stick that in a browser | 17:52 |
clarkb | oh neat | 17:52 |
fungi | shouldn't need credentials, the url is just meant to be unguessable | 17:52 |
clarkb | ok let me relaunch in rax and see what it says | 17:53 |
fungi | that's been my fallback on the providers who don't implement console log show | 17:53 |
fungi | annoying, but better than nothing | 17:53 |
clarkb | but I think we're fast approaching the bit where I say "people who care about this platform and understand it should really take a look" because I'm still arguing we should delete allfedora and usestream which seems to be a fair bit more stable for our pruposes | 17:54 |
fungi | if memory serves, ianw did a fair bit of work on the f34 networking stuff, so may have a better idea of where we should be looking for root cause | 17:55 |
clarkb | ya my hunch is something to do with ordering of services. Like udev isn't finding the device properly before we run glean or similar | 17:55 |
clarkb | It wouldn't be so problematic if fedora didn't update so frequently with so many big explosions :) | 17:56 |
clarkb | basically the reason we don't do intermediate ubuntu release | 17:56 |
clarkb | I'm going to abandon the f34 pause change since that won't help | 17:56 |
fungi | i suppose we could temporarily add f34 labels to vexxhost and remove them from everywhere else except citycloud | 17:57 |
clarkb | ya the risk there is the flavor there is huge so the fedora jobs might end up needing that memory | 17:58 |
clarkb | but if it is a short lived change the risk of that should be low | 17:58 |
fungi | also not sure if i should delete this f34 ready node in citycloud which nodepool seems to just be ignoring and wasting quota with | 17:58 |
clarkb | might be worth keepign around if anyone has time to dig into why the launcher isn't using it but instaed booting new nodes | 17:58 |
fungi | yeah, also i have a feeling that if i do delete it, a new ready node will be booted and ignored instead | 17:59 |
clarkb | oh ya since it is the only cloud that can satisfy the min-ready of 1 for f34 right now | 17:59 |
clarkb | everything else will fail and eventually the airship cloud should get it | 18:00 |
clarkb | fungi: the rax boot enters an emergency shell and I can't seem to get any scrollback to understand why that happens better | 18:09 |
clarkb | "unknown key released" when I hit page up | 18:10 |
clarkb | certainly seems that something to do with fedora 34 is new or different and causing some clouds problems | 18:10 |
fungi | i wonder when this started | 18:11 |
clarkb | I'll try to emergency boot this and see if the initramfs sos report is present | 18:11 |
clarkb | (I kinda doubt it will be there because I don't think it is on persistent storage but doesn't hurt to check) | 18:12 |
fungi | yeah, if it didn't get far enough to pivot from the initramfs to the real rootfs | 18:14 |
clarkb | LE job failed for https://review.opendev.org/c/opendev/system-config/+/813534 so the jobs behind it didn't run | 18:17 |
*** ysandeep is now known as ysandeep|out | 18:19 | |
clarkb | fatal: unable to access 'https://github.com/Neilpang/acme.sh/': Failed to connect to github.com port 443: Connection timed out <- that repo redirects to https://github.com/acmesh-official/acme.sh now but is generally accessible. I guess this is just a random internet is sad occurence | 18:19 |
clarkb | but this means that service-review didn't run after that change landed so not sure if we want to manually run it | 18:19 |
fungi | though we could stand to update the url anyway, i guess | 18:20 |
fungi | also retry downloads maybe | 18:20 |
clarkb | ++ | 18:21 |
clarkb | for that ord f34 instance I can't get to the rescue instance | 18:21 |
clarkb | does it rescue with its own image by default? | 18:21 |
fungi | yes | 18:21 |
clarkb | ugh | 18:22 |
fungi | that's probably not going to work | 18:22 |
clarkb | ya | 18:22 |
fungi | for... reasons which should previously have been obvious to me, sorru | 18:22 |
clarkb | I didn't expect rescuing to give us any new info anyway. I'll just unrescue and delete the instance | 18:22 |
clarkb | people who have had trouble booting f34 VMs have pointed to https://fedoraproject.org/wiki/Changes/UnifyGrubConfig on the internets. I'm booting a new instance in vexxhost so that I can check its grub configs | 18:27 |
clarkb | fungi: when the current set of deploy jobs finishes maybe we should run service-review manually to be sure there are no unexpected updates? | 18:28 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Retry acme.sh cloning https://review.opendev.org/c/opendev/system-config/+/813880 | 18:28 |
fungi | no idea if that's the way to do it, was reading random examples and trying to piece together from the docs | 18:28 |
clarkb | fungi: I left a comment on it, close, but not quite | 18:29 |
fungi | thanks | 18:30 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Retry acme.sh cloning https://review.opendev.org/c/opendev/system-config/+/813880 | 18:32 |
clarkb | [Wed Oct 13 18:29:05 2021] Unknown command line parameters: nofb BOOT_IMAGE=/boot/vmlinuz-5.14.10-200.fc34.x86_64 gfxpayload=text | 18:34 |
clarkb | the vexxhost node's dmesg reports that. I half wonder if it isn't able to find the kernel as a result on some system | 18:34 |
clarkb | though maybe not | 18:35 |
clarkb | since the kernel is already running at this point | 18:35 |
clarkb | and we're just telling the kernel about itself | 18:35 |
fungi | yeah, it's got to be finding the kernel if it's into the initrd | 18:36 |
clarkb | /boot/efi/EFI/fedora/ exists but is empty. We do all of our x86 images as grub images. I half wonder the other clouds might be seeing the efi dir and attempting efi, failing due to the lack of an efi config and not falling back to grub? | 18:37 |
clarkb | I don't know how that all works with openstack, nova, kvm, and qemu | 18:37 |
clarkb | the grub config and fstab and the device label all lgtm | 18:42 |
clarkb | The actual grub menu entry uses the device uuid not its label, but both the label in the grub /etc/default/grub config and the uuid in the /boot/grub2/grub.cfg menu entry match /dev/vda1 | 18:43 |
clarkb | I don't think vexxhosts's kvm had to do any magic to properly boot this | 18:43 |
clarkb | ok I really need to page out the f34 stuff. I'm going to delete the vexxhost test node as it didn't show me anything suepr useful other than "it should work". But then i need to do lunch and then we should manually run service-review.yaml, check that didn't make any unexpected changes to gerrit. Then test node exporter on trusty. Then prep stuff for the project renaming | 18:47 |
fungi | looking into the gitea metadata automation, it's the gitea_create_repos.Gitea.update_gitea_project_settings() method we want to call, and that already takes a project as a posarg, we're calling it from a loop in the make_projects() method | 18:48 |
clarkb | fungi: iirc there is a force flag | 18:48 |
clarkb | and if the project is new or the force flag is set then the metadata is updated | 18:48 |
fungi | though looking closer at how we call it from ansible, it may be simpler to add a project filter as a library argument and filter it that way | 18:48 |
clarkb | maybe we can make the force flag a list of names to force? | 18:49 |
clarkb | if force is not empty then if project in force type deal | 18:49 |
fungi | yeah, we already parameterize that | 18:49 |
fungi | always_update: "{{ gitea_always_update }}" | 18:49 |
fungi | right now we just set it to true or let it default to false | 18:50 |
fungi | but we could overload it as a trinary? | 18:50 |
fungi | or make it a regex | 18:50 |
clarkb | well you could set it to structured data like a list | 18:50 |
clarkb | false/[] don't force update or [ foo/bar, bar/foo] force update those projects | 18:51 |
fungi | though if we also still want a way to be able to force it to do all projects, we'd have to list thousands | 19:00 |
fungi | but yeah, a trinary of falsey/[some,list]/true could fit and remain backward-compatible | 19:01 |
fungi | oh, though ansible seems to like these arguments to be declared with only one type | 19:13 |
fungi | oh, maybe not declaring a type in the AnsibleModule argument_spec is fine | 19:14 |
fungi | we don't seem to declare it for all of them | 19:15 |
clarkb | I've quickly consumed some food. fungi I'll start a root screen on bridge and run the service-review.yaml playbook? | 19:18 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Allow gitea_create_repos always_update to be list https://review.opendev.org/c/opendev/system-config/+/813886 | 19:19 |
fungi | clarkb: sounds good, thanks | 19:19 |
fungi | also there's a start on the metadata update project filtering, though i haven't touched the testing yet | 19:19 |
clarkb | alright starting that playbook now | 19:20 |
fungi | i'm attached to the root screen | 19:20 |
clarkb | review02.opendev.org : ok=66 changed=0 unreachable=0 failed=0 skipped=9 rescued=0 ignored=0 | 19:23 |
fungi | looks good | 19:23 |
clarkb | it looked as we hoped no changes | 19:23 |
clarkb | yup, I think we're good, the var movement didn't cause any unexpected updates | 19:23 |
clarkb | I'll go ahead an exit the screen? | 19:23 |
fungi | yep, go ahead | 19:23 |
clarkb | cool, I think we should still do a restart because we haven't done one since the quoting changes happened | 19:24 |
clarkb | but this is good news on its own | 19:24 |
fungi | i should be around for a gerrit restart later today if you want | 19:25 |
clarkb | ok, lets see where the day continues to go :) I am still planning to get the rename input file pushed and review the related chagnes and start an etherpad | 19:26 |
clarkb | oh and I wanted to test node exporter on wiki. | 19:26 |
clarkb | fungi: for ^ it is basically `wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz` the nconfirm the sha256, then extract and run the binary to see that it starts and doesnt crash | 19:27 |
clarkb | fungi: any objectiosn for me doing that on wiki now? | 19:27 |
clarkb | it all runs as my own user | 19:27 |
clarkb | I went ahead and fetched it, checked the hash and extracted it since that is all pretty safe | 19:30 |
clarkb | fungi: ^ I await your ACK before running the binary out of an abundance of cuation | 19:31 |
clarkb | fungi: re testing of the gitea project stuff you should be able to hack the existing system-config-run-gitea job for that since it creates projects and then does another pass of them to ensure it noops (but in this case we could hack it to force updates for some projects) | 19:33 |
fungi | clarkb: no objections | 19:34 |
clarkb | cool it ran successfully. I ran it in the foreground and killed it. But if you want to double check it didn't fork into a daemon it was listening on port 9100 and process was called node_exporter | 19:37 |
clarkb | I can mark that done and we should be good to land that spec tomorrow | 19:37 |
fungi | yep, lgtm. nothing listening on 9100 (though it does have listeners on 9200 and 9300/tcp on the loopback) | 19:51 |
clarkb | those ports are for ES that is used to do text search | 19:54 |
clarkb | they are expected iirc | 19:54 |
fungi | yep | 19:59 |
ianw | clarkb: sorry, reading scrollback now | 20:38 |
clarkb | ianw: I don't think it is super urgent but its super weird and going to be a pain to resolve I bet :/ | 20:39 |
ianw | the kernel starting and things going blank could very well be a sign that root=/dev/... is missing, i have seen that before | 20:39 |
ianw | that said, i think it is passing in the devstack boot tests ... it should hit there too if that's it | 20:40 |
clarkb | well it works in citycloud and vexxhost | 20:40 |
clarkb | which is why I suspect this is an odd one | 20:40 |
ianw | hrm, yeah i have no immediate thoughts :/ | 20:44 |
ianw | bib just have to sort out some things | 20:45 |
opendevreview | Andrii Ostapenko proposed zuul/zuul-jobs master: Add retries for docker image upload to buildset registry https://review.opendev.org/c/zuul/zuul-jobs/+/813894 | 20:49 |
clarkb | do we know ^'s IRC nick? | 21:44 |
clarkb | similar to goneri's related update we should make sure there aren't problems with the registry or local networking since that upload should always be local to the cloud | 21:45 |
clarkb | Its a huge warning flag to me that people are retrying those requests and points to an underlying issue that we should probably fix instead | 21:45 |
clarkb | fungi: the openinfra renames are renaming projects like openstackid which need to be retired. I guess we retire them in the new name location? | 21:48 |
fungi | there's a foundation profile associated with the gerrit account e-mail, but it doesn't have any irc field filled in | 21:48 |
fungi | clarkb: yeah, retiring them in the new location is fine, also gets rid of the old namespace that way | 21:49 |
clarkb | well the old namespace will stick around for redirects but it empties it | 21:49 |
fungi | right, that | 21:50 |
fungi | the project list will no longer include the old namespace | 21:50 |
fungi | does https://zuul.opendev.org/t/openstack/build/a758b4b433b7433aa3574ebbd3d77c21 look to anyone else like our conftest is bitrotten? | 21:55 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Allow gitea_create_repos always_update to be list https://review.opendev.org/c/opendev/system-config/+/813886 | 21:58 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: More yaml.safe_load() in testinfra/conftest.py https://review.opendev.org/c/opendev/system-config/+/813900 | 21:58 |
opendevreview | Clark Boylan proposed openstack/project-config master: Move ansible-role-refstack-client from x/ to openinfra/ https://review.opendev.org/c/openstack/project-config/+/765787 | 21:58 |
clarkb | Pushed that to resovle a conflict ebtween two of the renaming changes | 21:59 |
fungi | thanks | 21:59 |
clarkb | fungi: looks like pyyaml updated and we need to update to match? | 22:00 |
clarkb | safe flag? | 22:00 |
opendevreview | Clark Boylan proposed opendev/project-config master: Record renames being performed on October 15, 2021 https://review.opendev.org/c/opendev/project-config/+/813902 | 22:03 |
clarkb | and there is our input file and recording of the changes | 22:03 |
fungi | clarkb: yeah, for now i updated the remaining call in that script to match the other one which was already using safe_load | 22:07 |
fungi | but there are probably a bunch more which need changing | 22:07 |
corvus | i'd like to restart zuul now. any objections? | 22:17 |
clarkb | corvus: looking | 22:20 |
fungi | should we time the gerrit restart to coincide? | 22:20 |
clarkb | fungi: if you'd like. I don't see any relaese activity but will warn the release team. The tripleo team may appreicate waiting that 14 mintues to see if those changes at the top of their queue end up merging | 22:21 |
clarkb | I've warned the release team | 22:21 |
corvus | i can afk for 20 minutes and try again if you like | 22:21 |
fungi | i can handle the gerrit restart in the middle of the zuul down/up | 22:22 |
clarkb | corvus: considering how long their changes can take that might be a good thing | 22:22 |
clarkb | just to avoid another set of 4 hour round trips for each of them | 22:22 |
clarkb | infra-root I've put https://etherpad.opendev.org/p/project-renames-2021-10-15 together for the rename on Friday | 22:22 |
corvus | clarkb: okay. my own thought is that the last time we waited 5 minutes it took an hour. there's no good time and therefore no bad time to restart. | 22:23 |
clarkb | corvus: fair neough, I'm happy to proceed now too | 22:23 |
corvus | but it's fine. i have something else to do that takes 20m so it's no big deal to me. | 22:23 |
clarkb | I'll let you decide if you'd ratehr do it now or in 20 minutes :) | 22:23 |
clarkb | I'll be around for both | 22:23 |
corvus | let's come back in 20m. (mostly just don't want to establish too much of a precedent :) | 22:24 |
clarkb | fungi: re gerrit restart the big thing it will be checking is the gerrit.config quoting updates | 22:24 |
fungi | yep | 22:25 |
fungi | if i need to hand edit the config to get it to restart, i can do that really quickly too | 22:25 |
clarkb | fungi: /home/gerrit2/tmp/clarkb/gerrit.config.20211013.pre-group-mangling <- is a copy I made of that file on review02 earlier today when those otehr changes were merging | 22:25 |
clarkb | we can use that to compare delta post restart | 22:25 |
fungi | status notice Both Gerrit and Zuul services are being restarted briefly for minor updates, and should return to service momentarily; all previously running builds will be reenqueued once Zuul is fully started again | 22:40 |
fungi | that work? | 22:40 |
clarkb | lgtm | 22:40 |
fungi | i have a root screen session started on the gerrit server in case we need to coordinate anything there, and the docker-compose down command is queued | 22:41 |
clarkb | I'll join it | 22:41 |
fungi | that tripleo job has been uploading logs for almost 5 minutes, so should end any time hopefullt | 22:43 |
clarkb | but also we gave it a chance we can proceed when ready I think | 22:44 |
fungi | yep | 22:44 |
clarkb | corvus: you'll do a stop, then we can restart gerrit, then a start? | 22:44 |
fungi | that's how we did it last time, at least | 22:45 |
fungi | oh, the test job wrapped up, now the paused registry job is closing out | 22:47 |
corvus | sounds good | 22:47 |
corvus | are we waiting still, or calling it good enough? | 22:48 |
clarkb | I'm happy calling it good enoguh. We gave it a real chance. | 22:48 |
fungi | good enough | 22:48 |
fungi | #status notice Both Gerrit and Zuul services are being restarted briefly for minor updates, and should return to service momentarily; all previously running builds will be reenqueued once Zuul is fully started again | 22:49 |
opendevstatus | fungi: sending notice | 22:49 |
corvus | okay. i'm re-pulling images to get ianw's 400 change | 22:49 |
-opendevstatus- NOTICE: Both Gerrit and Zuul services are being restarted briefly for minor updates, and should return to service momentarily; all previously running builds will be reenqueued once Zuul is fully started again | 22:49 | |
corvus | will take just a few seconds longer | 22:49 |
corvus | stopping zuul | 22:50 |
corvus | fungi: you can proceed with gerrit restart | 22:50 |
fungi | downing gerrit | 22:50 |
fungi | upping | 22:51 |
corvus | waiting for signal from fungi to start zuul | 22:51 |
fungi | webui is loading for me | 22:51 |
clarkb | yup lodas for me too and the config diff is empty | 22:52 |
fungi | [2021-10-13T22:51:33.954Z] [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.3.6-44-g48c065f8b3-dirty ready | 22:52 |
fungi | corvus: all clear | 22:52 |
clarkb | ++ | 22:52 |
corvus | starting zuul | 22:53 |
clarkb | fungi: I detached from the screen. feel free to close it whenever you like | 22:53 |
fungi | thanks, don | 22:53 |
fungi | e | 22:53 |
clarkb | its ok you can call me Don | 22:54 |
corvus | just don call me shirley | 22:54 |
fungi | i picked the wrong day to stop sniffing glue | 22:55 |
clarkb | what continues to try and get a kata-conatiners tenant? we removed it right? Maybe the cronjobs to dump queues? | 22:58 |
clarkb | Thats a not today question I think | 22:58 |
opendevreview | Ian Wienand proposed opendev/system-config master: ptgbot: have apache cache backend https://review.opendev.org/c/opendev/system-config/+/813910 | 23:01 |
ianw | fungi: ^ i'd probably consider you domain expert in that -- i'd not really intended the little static server to be demand-facing, so having apache cache it would be good for reliability i think | 23:01 |
fungi | oh, yep | 23:03 |
corvus | re-enqueing | 23:05 |
corvus | #status log restarted all of zuul on commit 3066cbf9d60749ff74c1b1519e464f31f2132114 | 23:05 |
opendevstatus | corvus: finished logging | 23:05 |
clarkb | and in an hour we should see the znode count fall again? | 23:07 |
clarkb | I think we expect it in the 80-90k range? | 23:07 |
corvus | yeah. it's hard for me to say if 110 might be okay though -- so i probably wouldn't assume we have a leak until it's over 120k sustained. | 23:10 |
clarkb | corvus: a zuul/zuul change showed up in the openstack tenant release-approval pipeline briefly. I'm surprised we evaluate zuul changes in openstack at all? | 23:11 |
corvus | it's in the projects list | 23:12 |
clarkb | huh I didn't expect that but that is expected behavior then | 23:12 |
clarkb | oh i bet it is there for the zuul_return testing in system-config/project-config/etc? | 23:12 |
clarkb | we might be able toclean that up nwo as zuul_return has a mock or osmething now iirc | 23:12 |
corvus | i think it may have been to try to trigger opendev deployments on zuul changes or something. | 23:13 |
corvus | not sure if currently used | 23:13 |
corvus | but it looks like jobs are loaded too, so may be some job inheritance going on | 23:13 |
corvus | re-enqueue complete | 23:15 |
clarkb | fungi: I approved the safe_load fix | 23:18 |
clarkb | ianw: if we ignore the kernel panic in ovh because maybe that was due to their outage coinciding with our upload we're left with two failure modes. The rax emergency initramfs shell and the iweb failure to get network | 23:24 |
clarkb | It might be easier to debug the iweb failure case first? like maybe do a build that hard code dhcp without glean or something and see if that boots and work back from that? | 23:24 |
clarkb | and maybe if we get lucky fixing that will give us clues to fixing the rax problem | 23:25 |
ianw | clarkb: yeah, i think all will be revealed if we can get a serial output | 23:28 |
clarkb | fungi: re the gitea metadata. I'm thinking we can just do a copy of playbooks/sync-gitea-projects.yaml but then replace the gitea_always_update var with our list and then be good? You should be able to test this by calling that copy of sync-gitea-projects.yaml in the system-config-run-gitea job | 23:30 |
clarkb | that job runs playbooks/test-gitea.yaml <- should be easy to run the playbook from there? | 23:30 |
clarkb | note the import_playbook in test-gitea.yaml you should be able to run sync-gitea-projects that way | 23:31 |
fungi | yup, will give it a shot tomorrow between meetings | 23:33 |
ianw | clarkb: sorry my attention is slightly divided, i'm just trying to see if we can get these 9-stream image-based builds testing in 806819 | 23:35 |
clarkb | ianw: ya no worries, I don't think this is urgent yet. Worse case we can add f34 to vexxhost liek fungi suggested and that will give us enough capacity for the label to limp along while we debug further | 23:36 |
clarkb | ianw: https://zuul.opendev.org/t/openstack/build/e336bb93987042a18a5acc44fb818b1e/log/logs/centos_8-build-succeeds.FAIL.log#933-935 that seems odd considering the other centos builds succeeded in that job. Has epel already started removing centos 8 stuff? | 23:41 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] testing centos 8 image builds https://review.opendev.org/c/openstack/diskimage-builder/+/813912 | 23:43 |
ianw | clarkb: ^ i hope to find out :) | 23:43 |
clarkb | ha ok | 23:44 |
ianw | i don't like these image-based jobs but clearly people still use them | 23:45 |
clarkb | ianw: ya I'll admit I didn't even consider that that might be what people were trying to do there | 23:45 |
clarkb | the minimal builds are far more reliable bceause you don't have the upstream image changing daily on you in the case of ubuntu for example | 23:45 |
opendevreview | Merged opendev/system-config master: More yaml.safe_load() in testinfra/conftest.py https://review.opendev.org/c/opendev/system-config/+/813900 | 23:46 |
clarkb | big znode drop from ~140k to 108k. corvus' estimate of ~110k may have been spot on | 23:53 |
corvus | 90 may be idle and 110 may be busy-ish ? | 23:56 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!