Tuesday, 2021-11-09

*** mazzy509881 is now known as mazzy5098800:10
*** mazzy509889 is now known as mazzy5098800:34
corvuslast zuul bugfix change merged; waiting for promote01:39
corvusonce it does promote, i'll pull the images, then i'll restart zuul01 with the new image, wait for it to settle, then do the same for zuul0201:40
corvuspromote is done, pulling images01:41
corvusstopping zuul0101:43
corvuswe have some work to do on the shutdown sequence, but it's stopped.01:45
corvusstarting 01 again01:45
corvushttps://zuul.opendev.org/components is now interesting; you can see the new version # on zuul0101:45
corvusas zuul01 is coming online expect occasional status page errors; that's harmless01:46
fungiwhat needs to be adjusted for shutdown now? i guess to do the rolling scheduler restarts?01:48
corvusit looks like we don't wait to finish the run handler before we disconnect from zk01:49
fungiahh, that could be messy i guess01:49
corvus(sorry, to be clear, i mean the scheduler program itself, not opendev ops)01:49
fungimakes sense, thanks01:50
corvusi just realized one of the changes was not backwards compat, so i'm going to go ahead and shut down zuul0201:51
corvusotherwise it's going to throw more and more errors01:51
fungitoo bad01:52
fungiwe'll get a real rolling restart soon, i'm sure01:52
corvusi'm optimistic this can still be a rolling restart, just with a brief pause in processing01:52
fungioh, right the state is still persisted01:53
corvusand zuul01 is already processing some tenants01:53
corvusthey just happen to be the empty ones01:54
corvusokay that did not work01:58
corvusi'm going to shut it down, clear zk state, and restore queues from backup01:59
corvus(i suspect that the non-backwards-compat change tanked it)01:59
fungii can see how something like that could lead to an unrecoverable/corrupted state02:00
corvusstarting up now02:03
corvusthis'll be one of the longer startup times since the zk state is empty02:05
corvusre-enqueue done02:39
corvusi'm going to leave zuul01 off for now and start it up tomorrow morning02:40
fungisounds good, thanks again!02:53
*** ysandeep|out is now known as ysandeep04:37
*** ysandeep is now known as ysandeep|brb04:46
*** ykarel|away is now known as ykarel04:54
opendevreviewchandan kumar proposed opendev/system-config master: Enable mirroring of centos stream 9 contents  https://review.opendev.org/c/opendev/system-config/+/81713605:18
*** ysandeep|brb is now known as ysandeep05:35
opendevreviewIan Wienand proposed openstack/diskimage-builder master: containerfile: handle errors better  https://review.opendev.org/c/openstack/diskimage-builder/+/81713906:24
ianwthere is still something wrong with the containerfile build.  but hopefully this stops us getting into a broken state06:29
*** ysandeep is now known as ysandeep|lunch07:40
*** sshnaidm is now known as sshnaidm|afk07:45
*** ykarel is now known as ykarel|lunch08:06
*** pojadhav- is now known as pojadhav08:49
*** ysandeep|lunch is now known as ysandeep09:06
*** gthiemon1e is now known as gthiemonge09:14
*** sshnaidm|afk is now known as sshnaidm09:40
*** ykarel|lunch is now known as ykarel09:54
*** dviroel|out is now known as dviroel11:07
*** ysandeep is now known as ysandeep|afk11:12
zbris anyone aware of storyboard issues? i wanted to unsubscribe me from its notifications and I discovered that it does not allow logins anymore. https://sbarnea.com/ss/Screen-Shot-2021-11-09-11-17-31.25.png --- that is what happens AFTER pressing the login with LP button.11:17
*** pojadhav is now known as pojadhav|afk11:40
*** ysandeep|afk is now known as ysandeep12:05
*** pojadhav- is now known as pojadhav14:24
*** ysandeep is now known as ysandeep|dinner14:27
*** pojadhav- is now known as pojadhav14:39
opendevreviewMartin Kopec proposed opendev/irc-meetings master: Update QA meeting info  https://review.opendev.org/c/opendev/irc-meetings/+/81722414:48
opendevreviewMartin Kopec proposed opendev/irc-meetings master: Update Interop meeting details  https://review.opendev.org/c/opendev/irc-meetings/+/81722514:49
*** ykarel_ is now known as ykarel|away14:57
clarkbzbr: I am unable to reproduce but can't dig in further than that for now as I need to do a school run and eat breakfast15:14
corvusi'm going to start zuul0115:19
corvusstatus page blips may occur15:19
*** ysandeep|dinner is now known as ysandeep15:22
fungizbr: i'm also able to log into storyboard just fine, and have additionally checked that your account is currently set to enabled15:25
fungiif you'd like me to unset your e-mail address from it, i'm happy to do that15:25
corvuszuul01 is running15:39
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Create repo for ProxySQL Ansible role  https://review.opendev.org/c/openstack/project-config/+/81727116:06
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: And ansible-role-proxysql repo to zuul jobs  https://review.opendev.org/c/openstack/project-config/+/81727216:08
zbrclarkb: fungi no need to do anything. I was able to login succesfully and I think I found the bug in story board. Original notification was about a new weird story 1111111 being created this morning, with link to it https://storyboard.openstack.org/#!/story/2009670 -- when using this url and trying to login I got delogged automatically. When logging from homepage it did work. I guess that there is something fishy about this story, as it seems it was16:17
zbrremoved. Still, this should never tigger user logout.16:17
zbrif i knew about this bug i would have not bothered to mention it here, but only few minutes ago i identified the root cause.16:17
fungiinteresting, thanks for the heads up about the weird story, i'll see if i can tell where that came from16:17
fungizbr: and i think the webclient problem you observed was reported as https://storyboard.openstack.org/#!/story/200818416:20
fungiseems to have to do with the openid redirect/return sending you back to an error page16:21
*** ysandeep is now known as ysandeep|out16:21
clarkbThere are not changes in openstack's zuul tenant more than a few hours old implying we don't have any problems with things getting stuck again16:27
clarkbcorvus: is it safe to restart zuul-web (I expect it is) in order to get the card sizing fix deployed16:56
clarkbI guess we'll probably do a restart for the pipeline path change and that would catch it too. That is probably good enough17:00
corvusclarkb: yes to both :)17:01
*** marios is now known as marios|out17:02
clarkbcorvus: one thing I've noticed is that sometimes the status page doesn't give me an estimated completion time. I know this can be related to a lack of data in the database, but I would expect either scheduler to return the data from the database and not be an issue in a multi scheduler setup17:31
clarkbjust calling that out in case there is potential for a bug here17:31
clarkbbut for example change 817108,1 is running tripleo jobs I'm fairly certain we've run enough times previously to have runtime data in the db for17:32
clarkbbut 3 of the 4 running jobs there don't give the hover over tooltip17:32
clarkber 4 out of 517:32
corvushrm i'll take a look17:33
corvusclarkb: i think i see the issue; more in #zuul17:54
opendevreviewClark Boylan proposed opendev/base-jobs master: Remove growroot logs dumping in base-test  https://review.opendev.org/c/opendev/base-jobs/+/81728918:08
opendevreviewClark Boylan proposed opendev/base-jobs master: Remove growroot log dumping from the base job  https://review.opendev.org/c/opendev/base-jobs/+/81729018:08
clarkbneither of ^ is particularly urgent but I noticed we were still dumping growroot logs when I was looking at a job today and thought we don't need that anymore and it can be distracting when skimming logs18:09
opendevreviewClint Byrum proposed zuul/zuul-jobs master: Remove google_sudoers in revoke-sudo  https://review.opendev.org/c/zuul/zuul-jobs/+/81729118:10
fungizbr: that story you got notified about looks like it was a user testing out story creation and picking a couple of projects at random, them they set it to private once they realized they couldn't delete it (so i deleted it as an admin just now). thanks for bringing it to my attention18:27
rosmaitawhen someone has a few minutes ... zuul is giving me "Unknown configuration error" on this small change, and I can't figure out what I'm doing wrong: https://review.opendev.org/c/openstack/os-brick/+/81711118:34
clarkbrosmaita: we think that was a bug in zuul that has since been corrected (as of like 02:00 UTC today or so)18:35
clarkbrosmaita: if you try to recheck it should hopefully either run or give you back a proper error message18:35
rosmaitaexcellent, ty18:35
clarkbrosmaita: looks like it queued up jobs18:39
rosmaitagreat, thanks!18:40
clarkbif anyone wants details there was an issue trying to serialize too much data into individual zookeeper znode entries when processing zuul configs. When that happened zuul got a database error and that bubbled up to the user as an unknown error. corvus fixed that by sharding data serialized for configs18:40
clarkbfungi: re the @ username thing we can probably grep external ids for that somehow if we want to audit on our end18:50
fungiyeah, i just didn't have time to, and if it's important to the user they can check it18:51
fungii'm finding so many stories which people have simply set to private rather than marking them invalid18:53
clarkbThat might explain why it is common to only allow the private -> public transition and not the other way around?18:54
fungithough also i'm finding a bunch which people opened incorrectly as private when they're just normal bug reports18:55
clarkbcorvus: really quickly before I've got to run the opendev meeting. elodilles notes that they had some failed openstack release jobs because zuul.tag wsn't set https://zuul.opendev.org/t/openstack/build/6708011371124d1e92a43a2702343ba2/log/zuul-info/inventory.yaml#47 is an example of that18:59
clarkbthe job appears to have been triggered by ref/tags/foo18:59
clarkbI wouldn't have expected this to be a multi scheduler issue but maybe we aren't serializing that info properly? eg is tag missing in model.py somewhere?18:59
johnsomHas the zuul fix been deployed for the openstack instance? We are still seeing invalid configuration errors as of 10:44am (pacific)18:59
clarkb(I'd look myself but I really need  to do the meeting now)19:00
clarkbjohnsom: yes it was restarted yesterday evening pacific time with the expected fix for the "Unknown config error" thing19:00
clarkbjohnsom: rosmaita just rechecekd a change in that situation and it successfully enqueued19:00
johnsomYeah, I'm going to re-spin this patch anyway, but was surprised to see the same error still based on the scroll back.19:01
clarkbits possible there are multiple underlying issues and we only fixed one of them19:02
clarkbyour error is different fwiw19:02
johnsomWe saw this yesterday too. A parent patch had the config message, then child had this one.19:03
clarkbIn this case the parent seems to have always been able to run jobs19:04
corvusjohnsom: i don't see an issue with that change now?19:12
jrosser_i'm seeing this: Nodeset ubuntu-bionic-2-node already defined <snip...> in "openstack/openstack-zuul-jobs/zuul.d/nodesets.yaml@master", line 2, column 319:13
jrosser_on here https://review.opendev.org/c/openstack/ansible-role-python_venv_build/+/81721919:13
johnsomcorvus, yeah, 20 minutes later, I pushed a change to it, this time it started19:13
artomSame here, here's the review: https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/815557/719:45
opendevreviewIan Wienand proposed opendev/system-config master: gerrit: test reviewed flag add/delete/add cycle  https://review.opendev.org/c/opendev/system-config/+/81730119:51
clarkbhttps://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/zuul.d/nodesets.yaml#L2-L12 is where we define ubuntu-bionic-2-node. I wonder if we're deserializing in such a way that we're overlapping the contents of that fiel somehow19:55
opendevreviewMerged opendev/system-config master: Retry acme.sh cloning  https://review.opendev.org/c/opendev/system-config/+/81388020:22
opendevreviewMerged opendev/system-config master: Switch IPv4 rejects from host-prohibit to admin  https://review.opendev.org/c/opendev/system-config/+/81001320:36
corvusclarkb: i'm not following the conversation very well; where do i start looking for the nodeset issue?20:45
clarkbcorvus: https://review.opendev.org/c/openstack/ansible-role-python_venv_build/+/817219 there and alternatively https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/815557/720:45
clarkbthe both have errors from zuul complaining that the ubuntu-bionic-2-node nodeset is already defined20:46
clarkbneither change seems to directly add a nodeset with a conflicting name, which is what the error would typically indicate20:46
clarkblooks like the second change was rechecked and successfully enqueued20:47
clarkbso this isn't a consistent failure. That is good for artom but less good for debugging it :)20:48
sshnaidmis there a problem with zuul.tag variable? I don't see it's passed anymore in "tag" pipeline jobs: https://92c4e0b4c5172d297c6a-24da3718548989a700aa54b9db0ff79c.ssl.cf1.rackcdn.com/612c3400c17c7a9d8b0d230383858331cb0cd653/tag/ansible-collections-openstack-release/7bbc0ba/zuul-info/inventory.yaml20:48
clarkbsshnaidm: yes, fix is in progress at https://review.opendev.org/c/zuul/zuul/+/817298 and will need a zuul restart to take effect20:48
sshnaidmclarkb, oh, thanks!20:48
clarkbthis was discussed in #openstack-release20:49
*** dviroel is now known as dviroel|out20:51
clarkbcorvus: one thing I notice is that that nodeset is the very first thing defined in that config file. Which means it would be the very first error seen if we had somehow duplicated the file in a config aggregation21:03
corvusclarkb: yeah, i was thinking along similar lines -- istr we might only report the first of many errors, so that may be what's going on.  and this may be related to our storage issue yesterday (ie, we may have been storing the full set of errors)21:04
clarkboh yup and once you add up the error for all the duplicate nodesets that can easily get large21:05
corvusyes, i confirmed we only include the first error in the change message21:06
corvus(we will include all the errors in inline comments, but that doesn't apply to this case)21:06
corvusstill need a hypothesis for what's triggering the errors... :/21:07
clarkbcorvus: is it possible for us to use a cached value and a newly merged value?21:11
corvushonestly don't know21:12
clarkbwe do log "Using files from cache for project" when using the cached files21:22
clarkbbut when that happens we skip ahead in the loop and don't submit a cat job so it shouldn't be possible to append the two together21:23
corvushere's a clue: the errors for 817219 happened on zuul01, and that did not perform the most recent reconfiguration of openstack; zuul02 did.  zuul01 updated its layout because it detected it was out of date.  so there may be a difference between a layout generated via a reconfiguration event, versus noticing the layout is out of date.  but it's also possible this is happening merely because it's not the same host as where the reconfiguration21:23
corvushappened.  i'll see if the other change can narrow it down further.21:23
corvus815557 is the same situation but reversed: it happened on zuul02, but the most recent tenant reconfig happened on zuul01, so zuul02 was updated via the up-to-date check.  that doesn't narrow it down much other than to say that it's not host-specific, and it does strongly point toward a cross-host issue21:27
corvusi wonder if the configuration error keys can somehow be different in either of those two circumstances21:28
corvusbut hrm, this isn't an existing error, so that shouldn't matter.21:29
clarkblooking at the code I can see how we maybe read from the cache when it isn't fully populated (because we clear the entire cache before updating it with the write lock but when we read we don't seem to use a read lock)21:29
clarkbbut I cannot see anything that looks like it would allow us to double up a file so far21:30
clarkbwould zuul.d/nodesets.yaml be considered an extra_config_files?21:34
corvus(i just realized we also don't log all of the errors in the debug log, but they are all available in the web, so i checked all the tenants and none of them have an error related to ubuntu-bionic-2-node so i think our assumptions hold)21:35
corvusclarkb: no those should only be like zuul-tests.d/ in zuul-jobs21:35
clarkbgot it21:35
clarkbalso correction above we do use a read lock that corresponds to the write lock so that should be fine21:36
corvusthe configuration error keys in the layout loading errors do have different hashes on the two different schedulers.  that makes me think that we could incorrectly filter out config errors when constructing layouts.  if we knew that the ubuntu-bionic-2-node was an existing config error, then i would say that's the culprit and what needs to be fixed.  but i'm troubled by the fact that we don't know where that error is actually coming from.21:45
corvus(incidentally, the openstack tenant is up to 134 configuration errors)21:51
clarkbcorvus: to make sure I understand this process correctly: in _cacheTenantYaml21:55
clarkber in _cacheTenantYAML we check if the cache is valid. If it is then we update from the cache without updating anything. but in your digging above we're hitting the cached data is invalid path so we go through cat jobs and then update the cache with the cat jobs?21:56
corvusclarkb: no, i'm working on a different end -- i started with "why are we seeing an error which doesn't belong to this change" as opposed to "where did the error come from".  i think i have a theory as to the first question, but i don't have any theory or data about the second one.  i think you're on the right path and i'll get there.  i'm just following the breadcrumbs one at a time :)21:58
corvus(as an aside, i just saw in the logs a cooperative reconfiguration -- the first scheduler reconfigured the tenant and started re-enqueing changes in check; the second scheduler updated its configuration to match, saw that check was busy, and started re-enquing changes in gate)22:00
clarkbI should buy a dry erase board22:10
corvusi just extracted all the relevant log lines from a reconfiguration from an event versus one from a layout update, and they are exactly the same except that the first one from the event submits a cat job for the project-branch that triggered the event, and the second one only uses the cache.22:10
corvusso that's as expected.22:10
clarkbI'm wondering if UnparsedBranchCache's .get() method could potentially be adding things somehow. But thats mostly based on that seems to be where we get the parsed yaml but unprocessed for zuul data from and it has a fairly complicated set of logic for returning all the files22:15
clarkblike maybe https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L7288-L7290 needs a dedeup step, But reading through it I haven't seen anything that would indicate generation of duplicates in that list yet22:17
clarkbThe function starts with a list of unique fns because they are dict keys which must be unique22:18
corvusanother clue: vexxhost tenant defines a nodeset with the same name22:20
fungiooh, cross-tenant config leak?22:20
clarkbI think it is possible to set extra-config-paths that overlap with the defaults and double load22:22
clarkbbut you'd have to explicitly set that config and then the change shouldn't merge because you'd haev duplicates there22:22
clarkbBasically it is possible to generate this error via a change that updates extra-config-paths but it shouldn't be mergeable22:23
corvusextra-config-paths is a tenant configuration setting (ie main.yaml)22:23
clarkbah, but even then we'd notice separately I think. But ya since https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L7282-L7289 isn't filtering for duplicates you could accidentally but intentionally add them in22:24
clarkband opendev doesn't currently set extra-config-paths dups that I can see22:24
clarkbthat is a long winded way of saying I think this is a problem but only in the general case and isn't currently the issue we are obvserving22:25
clarkbI need to take a break. Back in a bit22:31
opendevreviewIan Wienand proposed openstack/diskimage-builder master: containerfile: handle errors better  https://review.opendev.org/c/openstack/diskimage-builder/+/81713922:32
opendevreviewIan Wienand proposed openstack/diskimage-builder master: centos 9-stream: make non-voting for mirror issues  https://review.opendev.org/c/openstack/diskimage-builder/+/81731222:32
opendevreviewIan Wienand proposed openstack/diskimage-builder master: Revert "centos 9-stream: make non-voting for mirror issues"  https://review.opendev.org/c/openstack/diskimage-builder/+/81731322:32
corvusclarkb: okay, i found a difference between the schedulers... i'm manually running through createDynamicLayout, and in my simulation of this method, i get 386 nodesets on one scheduler, and 755 on the other: https://opendev.org/zuul/zuul/src/branch/master/zuul/configloader.py#L248222:41
clarkbis this through the repl?22:42
corvusi think we can assume that at any given time, one scheduler has done a real tenant reconfiguration, the other has done the layout-update style reconfig22:43
clarkboh interesting that method does a similar thing to the cache get() method I called out22:44
corvusin this case, the scheduler with 755 nodesets is the one that did the layout update.  so the scheduler that did the real reconfiguration is in good shape, but the other one has too many nodesets22:45
corvusthere are 2 ubuntu-bionic-2-node nodesets in the list; they're both from the same source_cnotext, so i don't think we're looking at a cross-tenant issue22:51
corvusthey are different objects fwiw22:51
corvusi think i'm going to start working on a test case now22:52
clarkbits not surprising they are different objects due to config.extend(self.tenant_parser.parseConfig(tenant, incdata, loading_errors, pcontext))22:53
clarkbwe create a new object each time we view the file. I strongly suspect that somehow the same file (via same source_context) is getting loaded multiple times via ^22:53
clarkbcorvus: maybe double check that we don't have dups in tenant.untrusted_projects?22:58
clarkband the other check would be tpc.branches doesn't have master listed twice I think23:01
corvusnegative on both23:03
clarkbI guess that is good in that we're giving the method what should be correct inputs23:04
corvusthe duplicate data is in tpc.branches23:04
clarkbcorvus: is master listed multiple times?23:04
corvusie, tpc.parsedbranchconfig.get('master').nodesets is long23:04
clarkbI see so single master branch for the tpc but that has duplicated the nodesets info23:04
clarkbhttps://opendev.org/zuul/zuul/src/branch/master/zuul/configloader.py#L2401-L2407 may be the issue then?23:06
clarkbhrm but we set config to an empty ParsedConfig at the start which means we should only add things which haven't already been added?23:08
corvusi think (unless i messed up in my testing) that this already has the duplicated data: https://opendev.org/zuul/zuul/src/branch/master/zuul/configloader.py#L240423:09
clarkbare we maybe mixing a project,branch specific cache with the entire cache contents?23:13
clarkbI think that may be what is happening on line 1601, which pollutes line 2404?23:14
clarkbhrm but we do filter by the project and then branch there. I'm so confused. I should probably let corvus write that test case and fiugre it out23:18
fungiinfra-prod-remote-puppet-else has started failing when trying to install pymysql on health01.openstack.org23:21
clarkbfungi: maybe we stick it in the emergency file and send an email to the list following up with the initial thread on turning that sutff off saying things are starting to break now?23:22
funginot sure why it's just started, but pymysql requires python>=3.6 for some time now (it's last release was in january), and that's a xenial host so only has 3.523:23
clarkbwe wouldn't update the installation unless the repo updated. I'm guessing the openstack-health repo updated?23:24
fungioh, and it's trying to install it with python 2.7 anyway23:24
fungiUsing cached https://files.pythonhosted.org/packages/2b/c4/3c3e7e598b1b490a2525068c22f397fda13f48623b7bd54fb209cd0ab774/PyMySQL-1.0.0.tar.gz23:25
fungiapparently 1.0.0 was yanked from pypi23:25
fungiyeah, looks like there's been some occasional updates merged to the openstack-health repo23:26
fungimost recent change merged 5 days ago23:27
fungibut infra-prod-remote-puppet-else has been succeeding until just now23:27
ianwhow about i add another mystery23:34
ianwseems to me we have not been setting DIB_CONTAINERFILE_PODMAN_ROOT=1 in production23:34
fungii've moved health01.openstack.org:/root/.cache/pip out of the way to see if it works when rerun23:34
ianwmeaning that "tar -C $TARGET_ROOT --numeric-owner -xf -" was not running as root, but yet seemingly somehow still managing to write everything as root owned?23:35
ianwthis is not working now.  i can't quite understand how it was ever working23:35
clarkbianw: podman will use user namespacing where you can be root in the container but normal user outside of the container23:35
clarkbcould that explain it?23:35
ianwthis is a tar outside any podman context23:36
clarkboh I see it is the tar on the other end of the podman export pipe23:37
clarkband we aren't setting the flag to prepend with sudo hence the confusion23:38
clarkbare we running dib with privileges so that any process ti forks gets them too?23:39
ianwin the gate tests, we set that flag https://opendev.org/openstack/diskimage-builder/src/branch/master/roles/dib-functests/tasks/main.yaml#L6923:39
corvusclarkb: i have been unable to reproduce in a test case so far.  i started 2 schedulers, forced a reconfig on one, then let the other scheduler handle a patchset upload of a .zuul.yaml... no duplicate nodesets. :/23:40
ianwok, it's my fault23:41
clarkbianw: ah it was a hardcoded sudo previously23:41
ianwthat it got 7 reviews and nobody noticed makes me feel *slightly* better that it was subtle :)23:42
clarkbcorvus: back to the repl I guess? I don't really have any better ideas than somehow that loop is likely to be injecting extra data because we aren't careful against duplicates23:45
fungispot checks indicate the iptables reject rule change rolled out without problem23:50
fungiREJECT     all  --  anywhere             anywhere             reject-with icmp-admin-prohibited23:50
clarkbcool maybe we can land https://review.opendev.org/c/opendev/system-config/+/816869 tomorrow if frickler and ianw get a chance to review it "overnight" (relative to me)23:52

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!