Wednesday, 2022-07-20

*** dviroel|out is now known as dviroel00:33
*** dviroel is now known as dviroel|out01:00
*** rlandy|ruck is now known as rlandy|out01:00
*** ysandeep|out is now known as ysandeep01:52
*** ysandeep is now known as ysandeep|afk03:23
*** ysandeep|afk is now known as ysandeep10:14
*** rlandy|out is now known as rlandy10:33
*** anbanerj is now known as frenzy_friday11:33
*** dviroel|out is now known as dviroel11:34
*** anbanerj is now known as frenzy_friday11:58
lajoskatonaHi, on Ussuri and Victoria we have failing py3x jobs, see: https://zuul.openstack.org/builds?job_name=openstack-tox-py38&job_name=openstack-tox-py36&project=openstack%2Fneutron&branch=stable%2Fvictoria&branch=stable%2Fussuri&pipeline=periodic-stable&skip=012:03
lajoskatonaLocally I can't reproduce the timeout, and for first check nothing special was merged to these branches on Neutron since 11 July, last week12:04
lajoskatonaso perhaps you know if something has changed in image or in some mirror or things like that that would be helpful12:04
fungimmm, lajoskatona seems to have disappeared while i was trying to search through the logs of one of those builds, but it looks like there are eventlet timeout tracebacks, is that what can't be reproduced?12:30
fungielodilles: have you seen similar issues in u/v branches of other projects in the past 1.5 weeks?12:30
lajoskatonafungi: Hi, seems I have some net issues, sorry12:45
fungino worries, did you catch my earlier comments from the channel log?12:47
fungispecifically, it's the 40-second eventlet timeouts you're unable to reproduce?12:48
fungii thought i saw a thread on openstack-discuss about that, i wonder if there were fixes for whatever it was which only got backported as far as wallaby12:49
lajoskatonafungi: yes, I have the same eventlet locally and no timeout locally, so that's why I thought there's something which I dont have locally12:49
lajoskatonafungi: I check, perhaps I missed that thread12:49
fungii'll see if i can find the ml thread i'm thinking of, or whether i dreamed it12:49
lajoskatonafungi: thanks12:50
fungii'm not immediately able to spot it in the archive12:51
fungii may be remembering the os-vif/libuxbridge problem, though that doesn't seem at all similar12:52
funginothing jumps out at me from june either12:54
lajoskatonafungi: I found an old thread which pointed to this req bump: https://review.opendev.org/c/openstack/requirements/+/81155512:55
lajoskatonafungi: but not sure if we see the same here, this is the mail: https://lists.openstack.org/pipermail/openstack-discuss/2021-October/025179.html12:56
elodillesfungi: so far i only found this at neutron's unit test jobs12:57
fungiodd that it would have just started ~1.5 weeks ago. i don't see any recent constraints changes on those branches at all12:58
elodillesi've checked the pip freeze outputs of the failing job vs the previous passing job (from July 11th) and there is no difference at all12:59
fungido we capture dpkg -l output?12:59
elodillesneither requirements' stable/victoria was touched since april12:59
elodillesfungi: i don't think i saw 'dpkg -l' in the logs13:00
fungiyeah, capture it for devstack jobs but not unit tests13:00
elodillesalso interesting, that victoria is focal based but ussuri is bionic13:01
fungimaybe we could infer it by grabbing the dpkg -l from devstack jobs on the 10th and 12th or something13:01
elodillesfungi: ok, i'll try to do that13:02
fungito see what might have updated in focal and in bionic around those dates13:02
fungicould be there was a security fix ubuntu rolled out on the 11th13:02
lajoskatonafungi, elodilles: yeah pip seems to be the same in the green and red runs13:21
dpawlikclarkb: hey, wanna check https://review.opendev.org/c/openstack/ci-log-processing/+/848218  please?13:25
elodillesi've taken a sample (from bionic, stable/ussuri) dpkg-l.txt diff between Jul 05 and Jul 19: https://paste.opendev.org/show/b2EuM9b16RC6il4G6kHx/13:26
elodilles(these were the closest runs i've found in neutron repo)13:27
fungiwe don't do periodic stable devstack jobs for neutron daily any more?13:34
elodillesas far as i know stable-periodics are all 'lightweight' unit test jobs13:36
*** ysandeep is now known as ysandeep|afk13:37
elodilleshmmm, but neutron has extra 'periodic' jobs13:41
fungieven generic periodic stable jobs for devstack might be sufficient to spot what's changed in distro packages, if most of the same packages are getting installed in those jobs13:44
lajoskatonaelodilles: we have, like here: https://zuul.openstack.org/buildsets?project=openstack%2Fneutron&branch=stable%2Fvictoria&pipeline=periodic&skip=013:45
lajoskatonathough it's new for me that there's periodic and periodic-stable pipeline.....13:45
lajoskatonain my mind it was the same13:46
fungiperiodic is usually for master branch testing, and periodic-stable is for stable branch testing. we trigger them at slightly different times to offset the load13:46
*** dasm|off is now known as dasm|ruck13:48
fungioh, i guess not really that far apart. https://opendev.org/openstack/project-config/src/branch/master/zuul.d/pipelines.yaml indicates periodic should trigger at 02:00 and periodic-stable at 02:01, just far enough apart to make sure the periodic jobs get some priority for their node requests in case we can't run them all before load on the system picks back up13:49
fungii was thinking of the periodic-weekly pipeline, which starts at 08:00 on saturdays, hopefully after the daily periodics have wrapped up13:50
elodillesso the difference seems to be: https://paste.opendev.org/show/bwXLDiDuz9mCUEk3OnZn/13:57
elodillesignore me, i've diffed stable/wallaby :/13:58
fungithough it may be the same14:02
fungiat least the same as for victoria, since they run on the same platform14:02
elodillesyes, they are the same (both are focal)14:03
elodillesso at least the result is the same14:03
fungiso that suggests this situation could be brought on by a kernel or libc update14:05
fungithough i wonder why it doesn't affect wallaby jobs14:05
elodillesyes. (for ussuri / bionic: https://paste.opendev.org/show/b8JwAKL4wMqdxCGOZpp8/ )14:09
*** ysandeep|afk is now known as ysandeep14:13
lajoskatonaelodilles: the upper lines of packages are from a passed run?14:15
elodilleslajoskatona: yes. in both case versions were bumped with one between July 11th and July 12th14:23
elodillesfrom 5.4.0-121 to 5.4.0-122 ; from 4.15.0-188 to 4.15.0-18914:24
lajoskatonaelodilles: thanks14:25
lajoskatonaelodilles: I this to the bug (https://bugs.launchpad.net/neutron/+bug/1982206 )14:26
elodilleslajoskatona: ++14:45
fungiyou should be able to look at the ubuntu package changelogs to find out what "fixes" were included in -122 and -189 and if there's overlap that could be a clue, or maybe this was related to the libc update (the distro package updates could also just be a red herring)14:52
fungioh, or maybe the kernel package changelogs are effectively useless :/14:54
fungihttps://changelogs.ubuntu.com/changelogs/pool/main/l/linux-signed/linux-signed_4.15.0-189.200/changelog14:54
fungii guess we'd need to figure out what patches were imported into the linux-signed source package between 4.15.0-188.199 and 4.15.0-189.20014:58
fungithere's probably a git repo on lp for that14:58
fungihttps://code.launchpad.net/ubuntu/+source/linux-signed seems to be the place15:00
clarkbdpawlik: I've mentioned it before but I wonder why you don't just send the json as is? I don't understand why you have to read the json then reformat it and send it out again15:01
fungihttps://git.launchpad.net/ubuntu/+source/linux-signed?h=ubuntu%2Fbionic-security15:02
clarkbdpawlik: but I also don't really have the bandwidth to review that stuff. This is why the opendev team stopped running those services15:03
dpawlikclarkb: there is also an json send to the separate index15:05
dpawlikclarkb: and for me it was obvious if someone will continue working on making some graphs base on the value that I have prepared in logsender would be good to review15:05
dpawlikbut if not, ok, we can go as it is.15:05
fungii can't seem to find the kernel patches, even on the applied version of that branch15:06
clarkbdpawlik: right sending to a separate index is good (I think that allows you to manage data rotations independently for the different types of information and have longer/shorter retention as necessary). But what confuses me is why you need to deserialize and reserialize the document in a different format. Can you just take what the job is emitting and send it to opensearch? I15:08
clarkbalso agree it is good to have reviews. The problem is if I was someone who was able to do those reviews we wouldn't have needed to evict these services from opendev. I think you should be looking for help from the openstack project which aimed to preserve this functionality15:08
clarkbfungi: iirc ubuntu does log them somewhere but it is somewhere weird15:08
clarkbor weird to me ebcause I don't understand all the different branches and packges for the ubuntu kernels15:08
clarkbfungi: http://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_5.4.0-122.138/changelog15:10
fungihttps://ubuntu.com/security/notices/USN-5515-115:11
fungii came at it from another angle15:11
clarkbfungi: lajoskatona also I'm on a bit of a campaign to remind everyone that asks about failures without linking to a specific failure log to please do so :)15:22
lajoskatonaclarkb: ack, I keep in mind15:23
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_bb1/periodic-stable/opendev.org/openstack/neutron/stable/victoria/openstack-tox-py38/bb13920/tmpkify9opp the truncated subunit log might be helpful too will show you the order of the tests that did run and which tests ran15:23
clarkbthat might single out a specific test that is problematic or class of tests15:24
clarkblooking at the console log the last thing logged to the console was about 16 minutes prior to the timeout. Are your unittests still runnning with internal test timeouts? If so this implies whatever it is breaks that15:26
clarkbif not, then maybe you should reenable those timeouts to see if they can help catch what is breaking15:26
clarkbetc15:26
clarkblooks like neutron may have actually removed the test timeout by default ...15:30
clarkbon master it is only applied to the db migration tests?15:30
clarkbthis is why those timeouts exist. So that the code that creates the problem can be more readily identified15:30
lajoskatonaclarkb: you mean OS_TEST_TIMEOUT ?15:34
clarkblajoskatona: yes15:35
clarkbbut it seems like that is only applied to the db migration tests?15:35
clarkbthe original intent way back when was that it be applied globally to catch test cases that locked up and hopefully provide some sort of feedback into where the lock up was15:36
clarkbI don't know that it would help here, but the idea behidn the global test timeouts is that it would15:37
lajoskatonaclarkb: I see it in tox.ini on master, and for functional/test_migrations, so yes15:38
clarkblajoskatona: one option to try may be setting a global test timeout to like 5 minutes (~1/3 the delta between logging and job timeout) and see if that produces any errors that are debuggable15:46
*** dviroel is now known as dviroel|lunch16:06
*** ysandeep is now known as ysandeep|out16:11
*** akekane_ is now known as abhishekk17:09
*** dviroel|lunch is now known as dviroel17:23
*** dviroel is now known as dviroel|afk19:34
*** rlandy is now known as rlandy|biab21:10
*** rlandy|biab is now known as rlandy21:33
*** rlandy is now known as rlandy|bbl22:15
*** rlandy|bbl is now known as rlandy23:30
*** dasm|ruck is now known as dasm|off23:41

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!