*** mazzy509881 is now known as mazzy50988 | 00:10 | |
*** mazzy509889 is now known as mazzy50988 | 00:34 | |
corvus | last zuul bugfix change merged; waiting for promote | 01:39 |
---|---|---|
corvus | once it does promote, i'll pull the images, then i'll restart zuul01 with the new image, wait for it to settle, then do the same for zuul02 | 01:40 |
corvus | promote is done, pulling images | 01:41 |
corvus | stopping zuul01 | 01:43 |
corvus | we have some work to do on the shutdown sequence, but it's stopped. | 01:45 |
corvus | starting 01 again | 01:45 |
corvus | https://zuul.opendev.org/components is now interesting; you can see the new version # on zuul01 | 01:45 |
corvus | as zuul01 is coming online expect occasional status page errors; that's harmless | 01:46 |
fungi | what needs to be adjusted for shutdown now? i guess to do the rolling scheduler restarts? | 01:48 |
corvus | it looks like we don't wait to finish the run handler before we disconnect from zk | 01:49 |
fungi | ahh, that could be messy i guess | 01:49 |
corvus | (sorry, to be clear, i mean the scheduler program itself, not opendev ops) | 01:49 |
fungi | makes sense, thanks | 01:50 |
corvus | i just realized one of the changes was not backwards compat, so i'm going to go ahead and shut down zuul02 | 01:51 |
corvus | otherwise it's going to throw more and more errors | 01:51 |
fungi | too bad | 01:52 |
fungi | we'll get a real rolling restart soon, i'm sure | 01:52 |
corvus | i'm optimistic this can still be a rolling restart, just with a brief pause in processing | 01:52 |
fungi | oh, right the state is still persisted | 01:53 |
corvus | and zuul01 is already processing some tenants | 01:53 |
corvus | they just happen to be the empty ones | 01:54 |
corvus | okay that did not work | 01:58 |
corvus | i'm going to shut it down, clear zk state, and restore queues from backup | 01:59 |
fungi | thanks! | 01:59 |
corvus | (i suspect that the non-backwards-compat change tanked it) | 01:59 |
fungi | i can see how something like that could lead to an unrecoverable/corrupted state | 02:00 |
corvus | starting up now | 02:03 |
corvus | this'll be one of the longer startup times since the zk state is empty | 02:05 |
corvus | re-enqueueing | 02:27 |
corvus | re-enqueue done | 02:39 |
corvus | i'm going to leave zuul01 off for now and start it up tomorrow morning | 02:40 |
fungi | sounds good, thanks again! | 02:53 |
*** ysandeep|out is now known as ysandeep | 04:37 | |
*** ysandeep is now known as ysandeep|brb | 04:46 | |
*** ykarel|away is now known as ykarel | 04:54 | |
opendevreview | chandan kumar proposed opendev/system-config master: Enable mirroring of centos stream 9 contents https://review.opendev.org/c/opendev/system-config/+/817136 | 05:18 |
*** ysandeep|brb is now known as ysandeep | 05:35 | |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: containerfile: handle errors better https://review.opendev.org/c/openstack/diskimage-builder/+/817139 | 06:24 |
ianw | there is still something wrong with the containerfile build. but hopefully this stops us getting into a broken state | 06:29 |
*** ysandeep is now known as ysandeep|lunch | 07:40 | |
*** sshnaidm is now known as sshnaidm|afk | 07:45 | |
*** ykarel is now known as ykarel|lunch | 08:06 | |
*** pojadhav- is now known as pojadhav | 08:49 | |
*** ysandeep|lunch is now known as ysandeep | 09:06 | |
*** gthiemon1e is now known as gthiemonge | 09:14 | |
*** sshnaidm|afk is now known as sshnaidm | 09:40 | |
*** ykarel|lunch is now known as ykarel | 09:54 | |
*** dviroel|out is now known as dviroel | 11:07 | |
*** ysandeep is now known as ysandeep|afk | 11:12 | |
zbr | is anyone aware of storyboard issues? i wanted to unsubscribe me from its notifications and I discovered that it does not allow logins anymore. https://sbarnea.com/ss/Screen-Shot-2021-11-09-11-17-31.25.png --- that is what happens AFTER pressing the login with LP button. | 11:17 |
*** pojadhav is now known as pojadhav|afk | 11:40 | |
*** ysandeep|afk is now known as ysandeep | 12:05 | |
*** pojadhav- is now known as pojadhav | 14:24 | |
*** ysandeep is now known as ysandeep|dinner | 14:27 | |
*** pojadhav- is now known as pojadhav | 14:39 | |
opendevreview | Martin Kopec proposed opendev/irc-meetings master: Update QA meeting info https://review.opendev.org/c/opendev/irc-meetings/+/817224 | 14:48 |
opendevreview | Martin Kopec proposed opendev/irc-meetings master: Update Interop meeting details https://review.opendev.org/c/opendev/irc-meetings/+/817225 | 14:49 |
*** ykarel_ is now known as ykarel|away | 14:57 | |
clarkb | zbr: I am unable to reproduce but can't dig in further than that for now as I need to do a school run and eat breakfast | 15:14 |
corvus | i'm going to start zuul01 | 15:19 |
corvus | status page blips may occur | 15:19 |
*** ysandeep|dinner is now known as ysandeep | 15:22 | |
fungi | zbr: i'm also able to log into storyboard just fine, and have additionally checked that your account is currently set to enabled | 15:25 |
fungi | if you'd like me to unset your e-mail address from it, i'm happy to do that | 15:25 |
corvus | zuul01 is running | 15:39 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: Create repo for ProxySQL Ansible role https://review.opendev.org/c/openstack/project-config/+/817271 | 16:06 |
opendevreview | Dmitriy Rabotyagov proposed openstack/project-config master: And ansible-role-proxysql repo to zuul jobs https://review.opendev.org/c/openstack/project-config/+/817272 | 16:08 |
zbr | clarkb: fungi no need to do anything. I was able to login succesfully and I think I found the bug in story board. Original notification was about a new weird story 1111111 being created this morning, with link to it https://storyboard.openstack.org/#!/story/2009670 -- when using this url and trying to login I got delogged automatically. When logging from homepage it did work. I guess that there is something fishy about this story, as it seems it was | 16:17 |
zbr | removed. Still, this should never tigger user logout. | 16:17 |
zbr | if i knew about this bug i would have not bothered to mention it here, but only few minutes ago i identified the root cause. | 16:17 |
fungi | interesting, thanks for the heads up about the weird story, i'll see if i can tell where that came from | 16:17 |
fungi | zbr: and i think the webclient problem you observed was reported as https://storyboard.openstack.org/#!/story/2008184 | 16:20 |
fungi | seems to have to do with the openid redirect/return sending you back to an error page | 16:21 |
*** ysandeep is now known as ysandeep|out | 16:21 | |
clarkb | There are not changes in openstack's zuul tenant more than a few hours old implying we don't have any problems with things getting stuck again | 16:27 |
clarkb | corvus: is it safe to restart zuul-web (I expect it is) in order to get the card sizing fix deployed | 16:56 |
clarkb | I guess we'll probably do a restart for the pipeline path change and that would catch it too. That is probably good enough | 17:00 |
corvus | clarkb: yes to both :) | 17:01 |
*** marios is now known as marios|out | 17:02 | |
clarkb | corvus: one thing I've noticed is that sometimes the status page doesn't give me an estimated completion time. I know this can be related to a lack of data in the database, but I would expect either scheduler to return the data from the database and not be an issue in a multi scheduler setup | 17:31 |
clarkb | just calling that out in case there is potential for a bug here | 17:31 |
clarkb | but for example change 817108,1 is running tripleo jobs I'm fairly certain we've run enough times previously to have runtime data in the db for | 17:32 |
clarkb | but 3 of the 4 running jobs there don't give the hover over tooltip | 17:32 |
clarkb | er 4 out of 5 | 17:32 |
corvus | hrm i'll take a look | 17:33 |
corvus | clarkb: i think i see the issue; more in #zuul | 17:54 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Remove growroot logs dumping in base-test https://review.opendev.org/c/opendev/base-jobs/+/817289 | 18:08 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Remove growroot log dumping from the base job https://review.opendev.org/c/opendev/base-jobs/+/817290 | 18:08 |
clarkb | neither of ^ is particularly urgent but I noticed we were still dumping growroot logs when I was looking at a job today and thought we don't need that anymore and it can be distracting when skimming logs | 18:09 |
opendevreview | Clint Byrum proposed zuul/zuul-jobs master: Remove google_sudoers in revoke-sudo https://review.opendev.org/c/zuul/zuul-jobs/+/817291 | 18:10 |
fungi | zbr: that story you got notified about looks like it was a user testing out story creation and picking a couple of projects at random, them they set it to private once they realized they couldn't delete it (so i deleted it as an admin just now). thanks for bringing it to my attention | 18:27 |
rosmaita | when someone has a few minutes ... zuul is giving me "Unknown configuration error" on this small change, and I can't figure out what I'm doing wrong: https://review.opendev.org/c/openstack/os-brick/+/817111 | 18:34 |
clarkb | rosmaita: we think that was a bug in zuul that has since been corrected (as of like 02:00 UTC today or so) | 18:35 |
rosmaita | \o/ | 18:35 |
clarkb | rosmaita: if you try to recheck it should hopefully either run or give you back a proper error message | 18:35 |
rosmaita | excellent, ty | 18:35 |
clarkb | rosmaita: looks like it queued up jobs | 18:39 |
rosmaita | great, thanks! | 18:40 |
clarkb | if anyone wants details there was an issue trying to serialize too much data into individual zookeeper znode entries when processing zuul configs. When that happened zuul got a database error and that bubbled up to the user as an unknown error. corvus fixed that by sharding data serialized for configs | 18:40 |
clarkb | fungi: re the @ username thing we can probably grep external ids for that somehow if we want to audit on our end | 18:50 |
fungi | yeah, i just didn't have time to, and if it's important to the user they can check it | 18:51 |
fungi | i'm finding so many stories which people have simply set to private rather than marking them invalid | 18:53 |
clarkb | That might explain why it is common to only allow the private -> public transition and not the other way around? | 18:54 |
fungi | though also i'm finding a bunch which people opened incorrectly as private when they're just normal bug reports | 18:55 |
clarkb | corvus: really quickly before I've got to run the opendev meeting. elodilles notes that they had some failed openstack release jobs because zuul.tag wsn't set https://zuul.opendev.org/t/openstack/build/6708011371124d1e92a43a2702343ba2/log/zuul-info/inventory.yaml#47 is an example of that | 18:59 |
clarkb | the job appears to have been triggered by ref/tags/foo | 18:59 |
clarkb | I wouldn't have expected this to be a multi scheduler issue but maybe we aren't serializing that info properly? eg is tag missing in model.py somewhere? | 18:59 |
johnsom | Has the zuul fix been deployed for the openstack instance? We are still seeing invalid configuration errors as of 10:44am (pacific) | 18:59 |
clarkb | (I'd look myself but I really need to do the meeting now) | 19:00 |
clarkb | johnsom: yes it was restarted yesterday evening pacific time with the expected fix for the "Unknown config error" thing | 19:00 |
johnsom | https://review.opendev.org/c/openstack/designate/+/786506 | 19:00 |
clarkb | johnsom: rosmaita just rechecekd a change in that situation and it successfully enqueued | 19:00 |
johnsom | Yeah, I'm going to re-spin this patch anyway, but was surprised to see the same error still based on the scroll back. | 19:01 |
clarkb | its possible there are multiple underlying issues and we only fixed one of them | 19:02 |
clarkb | your error is different fwiw | 19:02 |
johnsom | We saw this yesterday too. A parent patch had the config message, then child had this one. | 19:03 |
clarkb | In this case the parent seems to have always been able to run jobs | 19:04 |
corvus | johnsom: i don't see an issue with that change now? | 19:12 |
jrosser_ | i'm seeing this: Nodeset ubuntu-bionic-2-node already defined <snip...> in "openstack/openstack-zuul-jobs/zuul.d/nodesets.yaml@master", line 2, column 3 | 19:13 |
jrosser_ | on here https://review.opendev.org/c/openstack/ansible-role-python_venv_build/+/817219 | 19:13 |
johnsom | corvus, yeah, 20 minutes later, I pushed a change to it, this time it started | 19:13 |
artom | Same here, here's the review: https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/815557/7 | 19:45 |
opendevreview | Ian Wienand proposed opendev/system-config master: gerrit: test reviewed flag add/delete/add cycle https://review.opendev.org/c/opendev/system-config/+/817301 | 19:51 |
clarkb | https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/zuul.d/nodesets.yaml#L2-L12 is where we define ubuntu-bionic-2-node. I wonder if we're deserializing in such a way that we're overlapping the contents of that fiel somehow | 19:55 |
opendevreview | Merged opendev/system-config master: Retry acme.sh cloning https://review.opendev.org/c/opendev/system-config/+/813880 | 20:22 |
opendevreview | Merged opendev/system-config master: Switch IPv4 rejects from host-prohibit to admin https://review.opendev.org/c/opendev/system-config/+/810013 | 20:36 |
corvus | clarkb: i'm not following the conversation very well; where do i start looking for the nodeset issue? | 20:45 |
clarkb | corvus: https://review.opendev.org/c/openstack/ansible-role-python_venv_build/+/817219 there and alternatively https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/815557/7 | 20:45 |
clarkb | the both have errors from zuul complaining that the ubuntu-bionic-2-node nodeset is already defined | 20:46 |
clarkb | neither change seems to directly add a nodeset with a conflicting name, which is what the error would typically indicate | 20:46 |
corvus | thanks | 20:47 |
clarkb | looks like the second change was rechecked and successfully enqueued | 20:47 |
clarkb | so this isn't a consistent failure. That is good for artom but less good for debugging it :) | 20:48 |
sshnaidm | is there a problem with zuul.tag variable? I don't see it's passed anymore in "tag" pipeline jobs: https://92c4e0b4c5172d297c6a-24da3718548989a700aa54b9db0ff79c.ssl.cf1.rackcdn.com/612c3400c17c7a9d8b0d230383858331cb0cd653/tag/ansible-collections-openstack-release/7bbc0ba/zuul-info/inventory.yaml | 20:48 |
clarkb | sshnaidm: yes, fix is in progress at https://review.opendev.org/c/zuul/zuul/+/817298 and will need a zuul restart to take effect | 20:48 |
sshnaidm | clarkb, oh, thanks! | 20:48 |
clarkb | this was discussed in #openstack-release | 20:49 |
*** dviroel is now known as dviroel|out | 20:51 | |
clarkb | corvus: one thing I notice is that that nodeset is the very first thing defined in that config file. Which means it would be the very first error seen if we had somehow duplicated the file in a config aggregation | 21:03 |
corvus | clarkb: yeah, i was thinking along similar lines -- istr we might only report the first of many errors, so that may be what's going on. and this may be related to our storage issue yesterday (ie, we may have been storing the full set of errors) | 21:04 |
clarkb | oh yup and once you add up the error for all the duplicate nodesets that can easily get large | 21:05 |
corvus | yes, i confirmed we only include the first error in the change message | 21:06 |
corvus | (we will include all the errors in inline comments, but that doesn't apply to this case) | 21:06 |
corvus | still need a hypothesis for what's triggering the errors... :/ | 21:07 |
clarkb | corvus: is it possible for us to use a cached value and a newly merged value? | 21:11 |
corvus | honestly don't know | 21:12 |
clarkb | we do log "Using files from cache for project" when using the cached files | 21:22 |
clarkb | but when that happens we skip ahead in the loop and don't submit a cat job so it shouldn't be possible to append the two together | 21:23 |
corvus | here's a clue: the errors for 817219 happened on zuul01, and that did not perform the most recent reconfiguration of openstack; zuul02 did. zuul01 updated its layout because it detected it was out of date. so there may be a difference between a layout generated via a reconfiguration event, versus noticing the layout is out of date. but it's also possible this is happening merely because it's not the same host as where the reconfiguration | 21:23 |
corvus | happened. i'll see if the other change can narrow it down further. | 21:23 |
corvus | 815557 is the same situation but reversed: it happened on zuul02, but the most recent tenant reconfig happened on zuul01, so zuul02 was updated via the up-to-date check. that doesn't narrow it down much other than to say that it's not host-specific, and it does strongly point toward a cross-host issue | 21:27 |
corvus | i wonder if the configuration error keys can somehow be different in either of those two circumstances | 21:28 |
corvus | but hrm, this isn't an existing error, so that shouldn't matter. | 21:29 |
clarkb | looking at the code I can see how we maybe read from the cache when it isn't fully populated (because we clear the entire cache before updating it with the write lock but when we read we don't seem to use a read lock) | 21:29 |
clarkb | but I cannot see anything that looks like it would allow us to double up a file so far | 21:30 |
clarkb | would zuul.d/nodesets.yaml be considered an extra_config_files? | 21:34 |
corvus | (i just realized we also don't log all of the errors in the debug log, but they are all available in the web, so i checked all the tenants and none of them have an error related to ubuntu-bionic-2-node so i think our assumptions hold) | 21:35 |
corvus | clarkb: no those should only be like zuul-tests.d/ in zuul-jobs | 21:35 |
clarkb | got it | 21:35 |
clarkb | also correction above we do use a read lock that corresponds to the write lock so that should be fine | 21:36 |
corvus | the configuration error keys in the layout loading errors do have different hashes on the two different schedulers. that makes me think that we could incorrectly filter out config errors when constructing layouts. if we knew that the ubuntu-bionic-2-node was an existing config error, then i would say that's the culprit and what needs to be fixed. but i'm troubled by the fact that we don't know where that error is actually coming from. | 21:45 |
corvus | (incidentally, the openstack tenant is up to 134 configuration errors) | 21:51 |
clarkb | corvus: to make sure I understand this process correctly: in _cacheTenantYaml | 21:55 |
clarkb | er in _cacheTenantYAML we check if the cache is valid. If it is then we update from the cache without updating anything. but in your digging above we're hitting the cached data is invalid path so we go through cat jobs and then update the cache with the cat jobs? | 21:56 |
corvus | clarkb: no, i'm working on a different end -- i started with "why are we seeing an error which doesn't belong to this change" as opposed to "where did the error come from". i think i have a theory as to the first question, but i don't have any theory or data about the second one. i think you're on the right path and i'll get there. i'm just following the breadcrumbs one at a time :) | 21:58 |
clarkb | gotcha | 21:58 |
corvus | (as an aside, i just saw in the logs a cooperative reconfiguration -- the first scheduler reconfigured the tenant and started re-enqueing changes in check; the second scheduler updated its configuration to match, saw that check was busy, and started re-enquing changes in gate) | 22:00 |
clarkb | I should buy a dry erase board | 22:10 |
corvus | i just extracted all the relevant log lines from a reconfiguration from an event versus one from a layout update, and they are exactly the same except that the first one from the event submits a cat job for the project-branch that triggered the event, and the second one only uses the cache. | 22:10 |
corvus | so that's as expected. | 22:10 |
clarkb | I'm wondering if UnparsedBranchCache's .get() method could potentially be adding things somehow. But thats mostly based on that seems to be where we get the parsed yaml but unprocessed for zuul data from and it has a fairly complicated set of logic for returning all the files | 22:15 |
clarkb | like maybe https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L7288-L7290 needs a dedeup step, But reading through it I haven't seen anything that would indicate generation of duplicates in that list yet | 22:17 |
clarkb | The function starts with a list of unique fns because they are dict keys which must be unique | 22:18 |
corvus | another clue: vexxhost tenant defines a nodeset with the same name | 22:20 |
fungi | ooh, cross-tenant config leak? | 22:20 |
clarkb | I think it is possible to set extra-config-paths that overlap with the defaults and double load | 22:22 |
clarkb | but you'd have to explicitly set that config and then the change shouldn't merge because you'd haev duplicates there | 22:22 |
clarkb | Basically it is possible to generate this error via a change that updates extra-config-paths but it shouldn't be mergeable | 22:23 |
corvus | extra-config-paths is a tenant configuration setting (ie main.yaml) | 22:23 |
clarkb | ah, but even then we'd notice separately I think. But ya since https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L7282-L7289 isn't filtering for duplicates you could accidentally but intentionally add them in | 22:24 |
clarkb | and opendev doesn't currently set extra-config-paths dups that I can see | 22:24 |
clarkb | that is a long winded way of saying I think this is a problem but only in the general case and isn't currently the issue we are obvserving | 22:25 |
corvus | agreed | 22:28 |
clarkb | I need to take a break. Back in a bit | 22:31 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: containerfile: handle errors better https://review.opendev.org/c/openstack/diskimage-builder/+/817139 | 22:32 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: centos 9-stream: make non-voting for mirror issues https://review.opendev.org/c/openstack/diskimage-builder/+/817312 | 22:32 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: Revert "centos 9-stream: make non-voting for mirror issues" https://review.opendev.org/c/openstack/diskimage-builder/+/817313 | 22:32 |
corvus | clarkb: okay, i found a difference between the schedulers... i'm manually running through createDynamicLayout, and in my simulation of this method, i get 386 nodesets on one scheduler, and 755 on the other: https://opendev.org/zuul/zuul/src/branch/master/zuul/configloader.py#L2482 | 22:41 |
clarkb | is this through the repl? | 22:42 |
corvus | ya | 22:42 |
corvus | i think we can assume that at any given time, one scheduler has done a real tenant reconfiguration, the other has done the layout-update style reconfig | 22:43 |
clarkb | oh interesting that method does a similar thing to the cache get() method I called out | 22:44 |
corvus | in this case, the scheduler with 755 nodesets is the one that did the layout update. so the scheduler that did the real reconfiguration is in good shape, but the other one has too many nodesets | 22:45 |
corvus | there are 2 ubuntu-bionic-2-node nodesets in the list; they're both from the same source_cnotext, so i don't think we're looking at a cross-tenant issue | 22:51 |
corvus | they are different objects fwiw | 22:51 |
corvus | i think i'm going to start working on a test case now | 22:52 |
clarkb | its not surprising they are different objects due to config.extend(self.tenant_parser.parseConfig(tenant, incdata, loading_errors, pcontext)) | 22:53 |
clarkb | we create a new object each time we view the file. I strongly suspect that somehow the same file (via same source_context) is getting loaded multiple times via ^ | 22:53 |
clarkb | corvus: maybe double check that we don't have dups in tenant.untrusted_projects? | 22:58 |
corvus | k | 23:01 |
clarkb | and the other check would be tpc.branches doesn't have master listed twice I think | 23:01 |
corvus | negative on both | 23:03 |
clarkb | I guess that is good in that we're giving the method what should be correct inputs | 23:04 |
corvus | the duplicate data is in tpc.branches | 23:04 |
clarkb | corvus: is master listed multiple times? | 23:04 |
corvus | ie, tpc.parsedbranchconfig.get('master').nodesets is long | 23:04 |
corvus | nope | 23:04 |
clarkb | I see so single master branch for the tpc but that has duplicated the nodesets info | 23:04 |
corvus | yep | 23:05 |
clarkb | https://opendev.org/zuul/zuul/src/branch/master/zuul/configloader.py#L2401-L2407 may be the issue then? | 23:06 |
clarkb | hrm but we set config to an empty ParsedConfig at the start which means we should only add things which haven't already been added? | 23:08 |
corvus | i think (unless i messed up in my testing) that this already has the duplicated data: https://opendev.org/zuul/zuul/src/branch/master/zuul/configloader.py#L2404 | 23:09 |
clarkb | https://opendev.org/zuul/zuul/src/branch/master/zuul/configloader.py#L1599-L1601 | 23:12 |
clarkb | are we maybe mixing a project,branch specific cache with the entire cache contents? | 23:13 |
clarkb | I think that may be what is happening on line 1601, which pollutes line 2404? | 23:14 |
clarkb | hrm but we do filter by the project and then branch there. I'm so confused. I should probably let corvus write that test case and fiugre it out | 23:18 |
fungi | infra-prod-remote-puppet-else has started failing when trying to install pymysql on health01.openstack.org | 23:21 |
clarkb | fungi: maybe we stick it in the emergency file and send an email to the list following up with the initial thread on turning that sutff off saying things are starting to break now? | 23:22 |
fungi | not sure why it's just started, but pymysql requires python>=3.6 for some time now (it's last release was in january), and that's a xenial host so only has 3.5 | 23:23 |
clarkb | we wouldn't update the installation unless the repo updated. I'm guessing the openstack-health repo updated? | 23:24 |
fungi | oh, and it's trying to install it with python 2.7 anyway | 23:24 |
fungi | Using cached https://files.pythonhosted.org/packages/2b/c4/3c3e7e598b1b490a2525068c22f397fda13f48623b7bd54fb209cd0ab774/PyMySQL-1.0.0.tar.gz | 23:25 |
fungi | apparently 1.0.0 was yanked from pypi | 23:25 |
fungi | yeah, looks like there's been some occasional updates merged to the openstack-health repo | 23:26 |
fungi | most recent change merged 5 days ago | 23:27 |
fungi | but infra-prod-remote-puppet-else has been succeeding until just now | 23:27 |
ianw | how about i add another mystery | 23:34 |
ianw | https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/containerfile/root.d/08-containerfile#L63 | 23:34 |
ianw | seems to me we have not been setting DIB_CONTAINERFILE_PODMAN_ROOT=1 in production | 23:34 |
fungi | i've moved health01.openstack.org:/root/.cache/pip out of the way to see if it works when rerun | 23:34 |
ianw | meaning that "tar -C $TARGET_ROOT --numeric-owner -xf -" was not running as root, but yet seemingly somehow still managing to write everything as root owned? | 23:35 |
ianw | this is not working now. i can't quite understand how it was ever working | 23:35 |
clarkb | ianw: podman will use user namespacing where you can be root in the container but normal user outside of the container | 23:35 |
clarkb | could that explain it? | 23:35 |
ianw | this is a tar outside any podman context | 23:36 |
clarkb | oh I see it is the tar on the other end of the podman export pipe | 23:37 |
clarkb | and we aren't setting the flag to prepend with sudo hence the confusion | 23:38 |
clarkb | are we running dib with privileges so that any process ti forks gets them too? | 23:39 |
ianw | in the gate tests, we set that flag https://opendev.org/openstack/diskimage-builder/src/branch/master/roles/dib-functests/tasks/main.yaml#L69 | 23:39 |
corvus | clarkb: i have been unable to reproduce in a test case so far. i started 2 schedulers, forced a reconfig on one, then let the other scheduler handle a patchset upload of a .zuul.yaml... no duplicate nodesets. :/ | 23:40 |
ianw | ok, it's my fault | 23:41 |
ianw | https://review.opendev.org/c/openstack/diskimage-builder/+/814081/13/diskimage_builder/elements/containerfile/root.d/08-containerfile | 23:41 |
clarkb | ianw: ah it was a hardcoded sudo previously | 23:41 |
ianw | that it got 7 reviews and nobody noticed makes me feel *slightly* better that it was subtle :) | 23:42 |
clarkb | corvus: back to the repl I guess? I don't really have any better ideas than somehow that loop is likely to be injecting extra data because we aren't careful against duplicates | 23:45 |
fungi | spot checks indicate the iptables reject rule change rolled out without problem | 23:50 |
fungi | REJECT all -- anywhere anywhere reject-with icmp-admin-prohibited | 23:50 |
clarkb | cool maybe we can land https://review.opendev.org/c/opendev/system-config/+/816869 tomorrow if frickler and ianw get a chance to review it "overnight" (relative to me) | 23:52 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!