Monday, 2020-04-27

*** tosky has quit IRC00:09
*** sgw has quit IRC00:16
*** DSpider has quit IRC00:19
openstackgerritIan Wienand proposed opendev/glean master: Add container build jobs
openstackgerritMerged opendev/system-config master: send zuul link to opendev zuul
openstackgerritMerged opendev/system-config master: Cron module wants strings
openstackgerritMerged openstack/diskimage-builder master: Add sibling container builds to experimental queue
*** rkukura has quit IRC02:13
*** rkukura has joined #opendev02:27
ianwmordred: there is something going on with puppet apply where it's somehow restoring back to an old change02:49
ianwremote_puppet_else.yaml.log.2020-04-27T01:24:05Z:Notice: /Stage[main]/Openstack_project::Status/Httpd::Vhost[]/File[]/content: content changed '{md5}9185a2797200c84814be8c05195800fa' to '{md5}c9a8216d842c5c83e6910eb41d4d91ee'02:49
ianwremote_puppet_else.yaml.log.2020-04-27T01:35:36Z:Notice: /Stage[main]/Openstack_project::Status/Httpd::Vhost[]/File[]/content: content changed '{md5}c9a8216d842c5c83e6910eb41d4d91ee' to '{md5}9185a2797200c84814be8c05195800fa'02:49
ianwthe 01:24 run updated it, then the 01:35 run un-updated it, i think02:50
ianwdeploy723282,19 mins 16 secs2020-04-27T01:23:4402:51
clarkbianw: I think thats a zuul bug that cirvus found02:51
ianwopendev-prod-hourlymaster9 mins 16 secs2020-04-27T01:35:1602:51
clarkbit uses the change merged against master and that is racy02:52
ianwclarkb: hrm, i think it was the opendev-prod-hourly that has seemed to revert the change, that should have seen the new change?02:54
ianwthe hourly job checked out system-config master to 2020-04-27 01:35:45.184204 | | 2e2be9e6873ffe7dd07d84792b2bbef47e901f02 Merge "Fix zuul.conf jinja2 template"02:55
clarkbhrm maybe another bug of similar variety?02:56
clarkblike maybe deploy ran out order so deploy hourly ran head^ ?02:56
ianwif i'm correct in calculating merged at 2020-04-26 23:42 ... so several hours before the hourly job02:58
ianwgoing so see if i can come up with a timeline in
ianwthe next run, running now, appears to have applied it03:32
*** factor has joined #opendev03:34
*** ykarel|away is now known as ykarel04:30
openstackgerritMerged zuul/zuul-jobs master: Update ensure-javascript-packages README
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] ensure-virtualenv
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] ensure-virtualenv
*** ysandeep|away is now known as ysandeep05:12
*** jaicaa has quit IRC05:18
*** jaicaa has joined #opendev05:20
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [wip] ensure-virtualenv
openstackgerritIan Wienand proposed openstack/diskimage-builder master: [wip] plain nodes
*** dpawlik has joined #opendev05:56
AJaegerinfra-root, I just saw a promote job fail with timeout uploading to AFS, see
ianwAJaeger: hrm, weird; i just checked that dir, and even touched and rm'd a file there and it was ok06:00
ianw /afs/
openstackgerritMerged openstack/project-config master: Add Airship subproject documentation job
AJaegerianw: might be a temporary networking problem  ;(06:25
openstackgerritAndreas Jaeger proposed openstack/project-config master: Stop translation stable branches on projects without Dashboard
*** iurygregory has quit IRC07:09
*** iurygregory has joined #opendev07:10
*** DSpider has joined #opendev07:22
*** rpittau|afk is now known as rpittau07:22
*** tosky has joined #opendev07:26
*** sshnaidm|afk is now known as sshnaidm07:35
*** ysandeep is now known as ysandeep|lunch08:16
*** logan_ has joined #opendev08:31
*** logan- has quit IRC08:32
*** logan_ is now known as logan-08:35
*** ykarel is now known as ykarel|lunch08:44
hrwzuul runs all using ansible. how to force it to use py3 on zuul?09:02
hrw2020-04-24 12:47:53.223078 | primary |   "exception": "Traceback (most recent call last):\n  File \"/tmp/ansible_pip_payload_Ffk1eE/\", line 254, in <module>\n    from pkg_resources import Requirement\nImportError: No module named pkg_resources\n",09:05
hrw2020-04-24 12:47:53.223192 | primary |   "msg": "Failed to import the required Python library (setuptools) on debian-buster-arm64-linaro-us-0016157969's Python /usr/bin/python. Please read module documentation and install in the appropriate location"09:05
fricklerhrw: just set it like this?
hrwfrickler: thx09:08
*** ykarel|lunch is now known as ykarel09:36
*** ysandeep|lunch is now known as ysandeep09:53
*** ykarel is now known as ykarel|afak10:31
*** ykarel|afak is now known as ykarel|afk10:31
*** rpittau is now known as rpittau|bbl10:32
*** ykarel|afk is now known as ykarel11:31
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Use cached buildset_registry fact
donnydJust an FYI OpenEdge is undergoing maintenance - shouldn't affect the CI - but in case it does you will know why11:35
*** smcginnis has quit IRC11:40
*** DSpider has quit IRC11:40
*** smcginnis has joined #opendev11:41
*** DSpider has joined #opendev11:41
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: haskell-stack-test: add haskell tool stack test
openstackgerritMonty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx
*** ykarel is now known as ykarel|afk12:38
*** rpittau|bbl is now known as rpittau12:49
*** ykarel|afk is now known as ykarel12:52
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: omit variable instead of ignoring errors
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: WIP: omit variable instead of ignoring errors
openstackgerritMonty Taylor proposed opendev/system-config master: Use gitea for gerrit gitweb links
openstackgerritMonty Taylor proposed opendev/base-jobs master: Define an ubuntu-focal nodeset
openstackgerritMonty Taylor proposed opendev/system-config master: Test zuul-executor on focal
openstackgerritMonty Taylor proposed opendev/system-config master: Test zuul-executor on focal
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Do not set buildset_fact if it's not present in results.json
corvusmordred: hrm, it doesn't look like we have a working cert for yet14:11
fungicorvus: mordred: ianw spotted that it was getting overwritten and tried to put together a timeline in
fungithough i guess that was the redirect url itself, not the cert14:15
fungifor the link on status.o.o14:16
fungioh, right, ianw spotted that the acme challenge cname hadn't been created for it so added that14:17
fungibut also noted that the server is still in the emergency disable list so changes to it aren't getting applied14:17
fungiand was reluctant to take it out of the emergency disable list with nobody else around14:18
fungicorvus: mordred: are we clear to take back out of the emergency disable list in that case? there's no comment in the file saying why we disabled it and now i can't remember14:19
corvusfungi: i think we need
corvusotherwise the next config change will kill geard14:20
fungigot it, reviewing14:20
fungiahh, yep, i remember discussing this one14:20
corvusso it seems like we can merge that, then take the scheduler out of emergency, then run the letsencrypt playbook?  then run the zuul playbook?14:21
fungiseems that way to me, i just approved it moments ago14:22
mordredyes - I agree with all of that14:29
fungirelated, corvus: mordred seems to have addressed your comment on 72304814:30
fungimordred: i had a question on 723048 about use of sighup there... is that just sending hangup to the scheduler pid, and if so shouldn't we use the rpc client instead?14:31
mordredfungi: oh - HUP is probably bad there - maybe we don't need to do anything other than having docker-compose shut down the container?14:32
mordredah - graceful stop in the old init script was USR114:33
fungiyeah, if the goal was to stop the scheduler, then hup is not the thing14:33
mordredyeah- lemme update14:33
corvuswe don't do any graceful stops of the scheduler at the moment, only hard stops14:33
corvusmordred: so i think we just want the scheduler to stop in the normal way14:34
fungialso usr1 seems unlikely to be something we would want to use anyway14:34
fungibecause it could take hours to finish14:34
corvusyeah that14:34
openstackgerritMonty Taylor proposed opendev/system-config master: Rework zuul start/stop/restart playbooks for docker
mordredoh - yeah? ok. me just takes it out14:34
fungithough maybe once we have distributed scheduler, it's basically instantaneous/hitless?14:34
openstackgerritMonty Taylor proposed opendev/system-config master: Rework zuul start/stop/restart playbooks for docker
mordredhow's that look?14:35
*** mlavalle has joined #opendev14:41
fungiclarkb: cacti indicates we had a hard swap event on lists.o.o (severe enough to cause a 15-minute snmp blackout) around 12:2014:46
*** iurygregory has quit IRC14:47
*** iurygregory has joined #opendev14:48
fungioom knocked out 9 python processes between 12:26:48 and 12:33:1214:49
fungiprobably earlier in fact, that event seems to have overrun the dmesg ring buffer14:50
fungi11 "Killed process" lines recorded to syslog between 12:27:01 and 12:33:3014:51
fungii guess the timestamps embedded in the kmesg events are off by a bit14:52
fungioh wow, even dstat was stuttering14:54
fungitoward the worst, it was only managing to record roughly one snapshot a minute14:55
clarkbfungi: did we see mailman qrunner process memory change upwards in that period?14:59
openstackgerritMerged opendev/system-config master: Run smart-reconfigure instead of HUP
clarkbalso we should cross check with that robot too maybe?14:59
fungii'm working to understand the fields recorded in the csv14:59
fungilooks like the last two fields are process details15:00
fungiahh, no, the final fields are ,"process              pid  cpu read write","process              pid  read write cpu","memory process","used","free"15:02
fungii guess those correspond to --top-cpu-adv --top-io-adv --top-mem-adv and so "memory process" is the field we care about there?15:02
clarkbApr 27 12:25:54 is when OOM killer was first invoked looks like15:03
clarkbfungi: ya I think memory process is the most important one15:03
clarkbthe others probably have useful info too like who was busy during the lead up period15:03
clarkbfungi: looks like that same bot is active around the OOM15:05
clarkbI kinda want to add a robots.txt that tells it to go away and see if we have a behavior change15:05
fungiso going into this timeframe, we had 12:20:00 13543 qrunner / 40660992%15:05
clarkbfungi: note the % is a bit weird. Its actually just bytes. So thats 40MB ish which isn't bad15:06
fungias of 12:25:32 16053 listinfo / 50327552%15:06
fungiand kswapd0 was the most active cpu and i/o consumer15:07
clarkbfungi: what that is telling me is we don't have a single process which is loading up on memory.15:07
clarkbwhich makes me more suspicious of apache15:07
clarkbfungi: also we seem to be using mpm_worker and not mpm_event in apache15:09
clarkblikely a holdover from upgrading that server in place15:09
clarkbiirc mpm event is far more efficient memory wise because it doesn't fork for all the things?15:09
clarkbmaybe we should try switching that over too15:09
fungican't hurt15:09
fungianyway, i'm going to restart all the mailman sites... we talked about wanting a reboot of this server anyway, should i just go ahead and do that?15:10
fungiand then set the dstat collection back up (and rotate the old log)15:10
clarkbfungi: ya a reboot seems like it would at least help rule out older kernel bugs (if that is a possibility here)15:11
clarkbI seem to recall that xenial kernel of some variety didn't handle buffers and caches properly15:11
clarkband then we need to stop apache2, a2dismod mpm_worker, a2enmod mpm_event, start apache?15:12
clarkbmordred: maybe we should encode that into unit files then systemctl works and ansible can just ensure a service state?15:13
clarkb(I realize that will take a bit more work to get the systemd incantations correct, but our testing should help with that)15:14
fungilists.o.o is currently booted with linux 4.4.0-145-generic with an uptime of 380 days and will be booting linux 4.4.0-177-generic15:17
fungii've checked and apt reports no packages pending upgrade15:17
fungireboot underway15:17
fungitaking a while to come back up, probably either a pending host migration or just overdue fsck15:19
openstackgerritMerged opendev/base-jobs master: Define an ubuntu-focal nodeset
clarkbfungi: seems like thats pretty normal for us :/15:20
fungiwhen you go that long between reboots, yes15:20
fungiit came back15:21
fungi41 qrunner processes running according to ps15:21
clarkbI see a bunch of mailman processes. I tlooks happy15:21
fungiso seems like the sites all started back as expected15:21
clarkbfungi: are you wanting to do the apache thing? or should I plan to do that after breakfast? I'm happy either way, justdon't want to step on toes15:23
fungi#status log rebooted for kernel update15:23
openstackstatusfungi: finished logging15:23
fungi#status log running `dstat -tcmndrylpg --tcp --top-cpu-adv --top-mem-adv --swap --output dstat-csv.log` in a root screen session on lists.o.o15:23
openstackstatusfungi: finished logging15:23
corvusmordred, fungi: i think we're ready to remove zuul from emergency and run some playbooks?15:23
fungiclarkb: i need to switch gears to do some openstack vmt stuff shortly, but can try to get to it later, or we can just observe first and see if the oom situation persists since the reboot15:24
openstackgerritMerged zuul/zuul-jobs master: hlint: add haskell source code suggestions job
corvusi think so, so i'll do that15:25
*** ysandeep is now known as ysandeep|away15:25
fungicorvus: i think so723107 merged ~25 minutes ago15:25
corvusrunning le playbook now15:25
fungioui, c'est bon15:27
corvusmordred, fungi, ianw:  lgtm now15:30
corvuslooks like i don't need to run the zuul service playbook15:30
clarkbfungi: ya I' mostly just suspicious of apache right now given the qrunner sizes don't go up when we oom and we have an indexer bot running through apache at around that same period15:35
fungioh, me too. if you look back at the cacti graphs, once it's able to get snmp responses again the 5-minute load average is still >5015:37
fungiso likely lots and lots of processes15:37
clarkbfungi: did my process above look correct to you for using mpm event? I've also double checked toehr other xenial hosts are using apache + mpm_event and not worker15:38
fungiwhich could be the mta or mailman handling a bunch of messages, but probably it's apache forking15:38
*** _mlavalle_1 has joined #opendev15:38
fungiclarkb: yeah, i guess the current puppet-mailman isn't picking an mpm for apache and like you say we've inherited a non-default one due to in-place upgrades15:39
fungiyour command sequence looks right to me15:40
clarkbya I think we've basically just relied on platform defaults. Unfortauntely platform default has stuck around too long in this case15:40
*** mlavalle has quit IRC15:40
clarkbinfra-root ^ I'd like to switch apache2 from mpm_worker to mpm_event on lists.o.o. Plan is stop apache2; a2dismod mpm_worker; a2enmod mpm_event ; start apache2. This gets it in line with our other apache servers. I'll do that shortly after some tea. Let me know if you'd like me to hold off15:41
corvusclarkb: ++15:42
fungithanks clarkb!15:42
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Do not set buildset_fact if it's not present in results.json
clarkbalright tea is consumed proceeding now15:49
clarkband done. website seems to respodn to my browser15:50
*** ykarel is now known as ykarel|away15:50
clarkb#status log Updated to use Apache mpm_event insteand of mpm_worker. mpm_worker was a holdover from doing in place upgrades of this server. All other Xenial hosts default to mpm_event.15:51
openstackstatusclarkb: finished logging15:51
clarkbfungi: then assuming the OOM perists tomorrow maybe we try a robots.txt and exclude this particular bot?15:52
*** sshnaidm is now known as sshnaidm|afk15:53
AJaegerdo we still use proposes to kill the only place that I could find...15:56
clarkbAJaeger: we do not15:56
clarkbAJaeger: do you also want to remove from that list?15:57
clarkbits the line above the block you removed15:57
AJaegerclarkb: sure, can do...15:57
AJaegerI thought that was in use, so was not sure whther we need it...15:57
clarkbAJaeger: it exists as a redirect host on, but I don't think we need cacti data for it15:58
clarkb(since it is just a CNAME to static in dns)15:58
AJaegerAh, good15:58
openstackgerritAndreas Jaeger proposed opendev/system-config master: Remove git*
AJaegerclarkb: updated ^15:59
*** rpittau is now known as rpittau|afk16:07
redrobotWould love another set of eyes on this change:
clarkbcorvus: mordred ^ are we good to add new git repos or do zuul things still need updating?16:09
corvusclarkb: i think we're good16:09
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Do not set buildset_fact if it's not present in results.json
mordredclarkb: yeah- I thnik we're good16:13
clarkbcorvus: mordred did you see the thing from ianw yesterday about periodic deploy and on demand deploy undoing each other?16:15
fungiexample was in that etherpad i linked earlier16:15
clarkb appears to be where notes were taken16:15
corvusdid it perhaps enqueue the item before 1:23, thereby enqueuing the old ref?16:17
openstackgerritMerged opendev/system-config master: Rework zuul start/stop/restart playbooks for docker
*** iurygregory has quit IRC16:17
corvus(rather, not the old ref, but the state of the repo at that point in time)16:17
clarkblooks like hourly starts at the top of the hour with a 2 minute hitter16:18
mordredclarkb: yeah - I was thinking it might be what corvus said16:18
clarkbso ya it would've enqueued before the non hourly job ran at least. Not sure when the non hourly job enqueued though16:18
corvusclarkb: well the non-hourly job would have gotten that specific change16:18
clarkbshould the hourly jobs maybe not use the zuul provided ref? and always update?16:19
corvusso basically, we enqueued an hourly job and froze the repo state for it, then probably due to load didn't get around to running it for a while16:19
mordredmaybe - I believe the intent of the hourly jobs is "run with the tip of master when you run" as opposed to "run with the tip of master when you are encoded" - so maybe putting in a flag we can set on the hourly pipeline invocation of the job that would cause the playbooks to do a pull from opendev first?16:20
mordred(the maybe there is in response to clarkb's "should the hourly jobs...")16:20
openstackgerritMerged openstack/project-config master: Revert "Disable ovn pypi jobs temporarily"
fungior should there be a way to tell zuul you want timer triggered jobs to have their heads resolved when started rather than when enqueued? that may be tough to pull off though16:22
*** hrw has quit IRC16:22
corvusyeah, that's an intentional design decision to ensure that all jobs in a buildset run with the same repo state16:22
*** elod has quit IRC16:22
*** hrw has joined #opendev16:22
mordredin a magical world it would be neat to be able to have a periodic pipeline that only triggers if there has been no corresponding activity in a different pipeline for X duration. I have no idea what that would look like, and would probably require v4 and required db16:23
corvusso i don't think changing zuul is appropriate here16:23
*** elod has joined #opendev16:23
fungii just get a little twitchy with jobs working around zuul's git handling, but maybe this is one of those circumstances where it's the better solution16:24
mordredcorvus: what do you think about a "pull latest from opendev" flag for the run-base playbook?16:24
corvusmaybe a pull in the job is the best workaround here -- other than minimizing what we actually need the hourly pipeline for16:24
corvus(eventually, it should just be for letsencrypt, right?)16:24
mordredcorvus: I think we mostly have hourly pipeline for things that are using images but that we don't have a way to trigger otherwise16:24
mordredso that we don't have to wait for a day to pick up a new zuul image or similar16:24
mordredbut I agree with the goal - it woudl be great to have only LE in there16:25
mordredI can work up a "pull from upstream" flag if we think that's an ok workaround for now16:25
corvussounds reasonable to me16:25
fungiyeah, it seems like the most straightforward solution at this point16:26
*** tobiash has quit IRC16:26
*** prometheanfire has quit IRC16:26
*** calcmandan has quit IRC16:26
*** noonedeadpunk has quit IRC16:26
*** jkt has quit IRC16:26
*** dirk has quit IRC16:26
*** AJaeger has quit IRC16:26
*** tobiash has joined #opendev16:32
*** prometheanfire has joined #opendev16:32
*** calcmandan has joined #opendev16:32
*** noonedeadpunk has joined #opendev16:32
*** jkt has joined #opendev16:32
*** dirk has joined #opendev16:32
*** AJaeger has joined #opendev16:32
fungiyoctozepto: fdegir: did your git problems with repos persist into today or did they mysteriously clear up?16:36
yoctozeptofungi: I didn't do much today regarding clone/pull operations so hard to tell; assume they did ;-)16:37
*** ChanServ has quit IRC16:42
fungiyoctozepto: thanks, hopefully it was just some temporary network problem somewhere out on the internet16:42
*** ChanServ has joined #opendev16:45
*** sets mode: +o ChanServ16:45
*** _mlavalle_1 has quit IRC17:09
*** mlavalle has joined #opendev17:11
openstackgerritMerged openstack/project-config master: Define stable cores for horizon plugins in neutron stadium
openstackgerritMerged openstack/project-config master: Add Portieris Armada app to StarlingX
*** tobiash has quit IRC17:26
*** prometheanfire has quit IRC17:26
*** calcmandan has quit IRC17:26
*** noonedeadpunk has quit IRC17:26
*** jkt has quit IRC17:26
*** dirk has quit IRC17:26
*** AJaeger has quit IRC17:26
*** tobiash has joined #opendev17:29
*** prometheanfire has joined #opendev17:29
*** calcmandan has joined #opendev17:29
*** noonedeadpunk has joined #opendev17:29
*** jkt has joined #opendev17:29
*** dirk has joined #opendev17:29
*** AJaeger has joined #opendev17:29
openstackgerritMerged openstack/project-config master: Add ansible role for managing Luna SA HSM
fdegirfungi: i noticed similar issues today as well so I had to switch to mirrors17:38
*** ChanServ has quit IRC17:39
fungifdegir: and this was cloning over https via ipv4?17:40
*** ChanServ has joined #opendev17:41
*** sets mode: +o ChanServ17:41
fdegirfungi: yes and i just started another set of clones manually right now and it's hanging - will probably timeout17:42
fdegirCloning into 'shade'...17:42
fdegirand just waits17:42
fungii'll switch to trying shade in that case. and see about forcing my testing to go on ipv4 instead of ipv617:43
fdegirfungi: as i noted yesterday, it could be another repo next time17:44
fdegirfatal: unable to access '': Failed to connect to port 443: Connection timed out17:44
fungilooks like my git client is new enough to support `git clone --ipv4 ...`17:44
fdegirfungi: testing the repos bifrost clones during its installation:
fdegirnow requirements hanging17:45
fungii've got a loop going on my workstation now like `while git clone --ipv4;do rm -rf shade;done`17:45
fdegirfungi: if it helps, i can keep this thing running and you can look at logs17:46
fdegiri can pass my public ip if it helps17:46
fungifdegir: yes, i can check our load balancer for any hits from your ip address, though if a connection failed to reach the load balancer that will be hard to spot17:46
fungiideally devstack is timestamping when it tries to clone17:47
fdegirfungi: we don't use devstack17:47
fungiahh, okay, the other problem report was from a devstack user17:48
fdegirfungi: yes - seeing that made me realize it may be an issue on gerrit side17:48
fdegiroriginally i thought i had issues but that bug report made me report as well17:48
fungiif you have a timestamp for when one of the failed clone commands was attempted i can hopefully work out whether any connections arrived at the load balancer from you at that time17:48
fungii have exact times for every request which reached the lb from you and what backend they were directed to17:49
fdegirfatal: unable to access '': Operation timed out after 300029 milliseconds with 0 out of 0 bytes received17:49
fungibut obviously if a connection attempt doesn't reach us that won't be logged at our end17:49
fdegiri don't have timestamps as we didn't enable timestamping on our jenkins17:50
fdegirand can't check the slaves since we use openstack single use slave17:51
fdegirCloning into 'python-ironicclient'...17:52
fdegirso it is totally random17:52
fungithese are the connections haproxy saw from your ip address:
fdegirneed to have dinner now17:53
fdegirwill be back later tonight17:53
fungicool, thanks!17:53
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Do not set buildset_fact if it's not present in results.json
fungilooks like those are all being directed to, though i may have trouble mapping them to specific requests as haproxy just operating as an osi layer 4 socket proxy isn't doing any transparent forwarding17:54
fungiso far my shade clone loop isn't encountering any issues, but i'll switch to hitting gitea04 directly to see if that has anything to do with it17:55
clarkbfungi: note that you will only hit a single backend using that url17:55
funginow trying this: while git clone --ipv4;do rm -rf shade;done17:56
clarkband ya mapping onto gitea side requests can be a pain17:57
*** factor has quit IRC17:58
clarkbfungi: gitea04 shows things like [E] Fail to serve RPC(upload-pack): exit status 128 - fatal: the remote end hung up unexpectedly17:58
fungifdegir: when you're back from dinner, it might be helpful if you could try with a simple reproducer like that from the network where you're seeing that, and then maybe also try from another location if you can, so we can tell whether it's specific to the location you're coming from. if it is, then we can compare traceroutes in both directions and possible start to work on correlating where the problem might17:59
dpawlikhi. If we would like to switch in openstack/validations-common from testr to stestr, ( requirements-check CI job is raising error that stestr not found in lower-constraints. Is something else that I need to configure or just add stestr==3.0.1 to lower-constraints.txt ?17:59
fungiclarkb: yeah, that could be due to a number of reasons, sounds like typical premature socket termination17:59
fungidpawlik: i'm not sure, you may be better off asking in #openstack-requirements as it's probably more on topic there18:00
clarkbfungi: fdegir the other thing that might be useful is talking to a backend (or all 8 backends) directly18:00
dpawlikthank you fungi18:00
clarkbthey all have valid tls certs and are exposed publicly;18:00
fungiclarkb: yeah, that's what i'm suggesting, test in a loop to one backend directly so we can rule out the source hash directing some clients to good backends and others to bad18:01
fungithe while shell loop i pasted just above is exactly that18:01
clarkb shows spikes in failed tcp connection attempts which may be related18:02
clarkbasymettric routing could cause that18:02
fungifwiw, i've been running that continuously for nearly 10 minutes, and it's taking roughly 10 seconds each iteration, so far no errors18:03
clarkbfungi: the other upside to directly connecting to the backends is we'll be able to filter logs for that more easily18:03
*** factor has joined #opendev18:04
fungialso the ip address fdegir provided me looks like it's in the kna3 pop18:06
*** mehakmittal has joined #opendev18:06
fungior at least that's the subdomain in reverse dns on the last named core router in my traceroutes to it, but there are a couple hops after that with no ptr records on their serial interfaces18:07
clarkb`while true; do for X in `seq 1 8` ; do echo $X ; rm -rf shade-gitea0$X && git clone https://gitea0$ shade-gitea0$X ; done ; done` I'm running that now just to see if I can trip it. note you should set -x it18:07
clarkber set -e18:07
clarkbfungi: we have a mirror in that location we could try to reprodcue from there18:08
fungimy thoughts as well18:08
funginote that fdegir was seeing it over ipv4, so you might want to add --ipv4 to your git clone command to reproduce faithfully18:08
clarkbfungi: I've only got ipv4 locally18:09
fungioh, :( for you18:09
clarkbif I want ipv6 I haev to explicitly bounce through my ipv6 cloud node18:09
fungimy backwater cable provider has finally managed to do a decent job with v6 prefix delegation over dhcp618:10
clarkbfungi: my ISP just got bought (effective may 1st iirc). The new company is soliciting questions about the move and I asked if they planned to roll out ipv6. THe answer I got did not give me confidence they even know what ipv6 is18:10
clarkbno errors yet in that for loop. I'm going to knead some bread now18:11
fungihave fun!18:11
fungilooks like our mirror server is in kna1 not kna3, but maybe they share the same core18:12
fungitesting from mirror01.kna1.citycloud.openstack.org18:14
fungiinteresting, that server routes outbound from an rfc-1918 address, presumably through a fip18:15
fungii've got a clone loop of shade from gitea04 underway on it now, seeing around 4s for each clone to complete18:16
fungiooh! i've hit it!!!18:17
fungithis definitely seems to be client location specific18:17
fungii'll reproduce again with a brief sleep between attempts and some timestamping18:18
*** mehakmittal has quit IRC18:20
*** mehakmittal has joined #opendev18:21
fungii've got this running in a root screen session on now: while :;do sleep 10;echo -n 'start ';date -Is;git clone;echo -n 'end ';date -Is;rm -rf shade;done18:22
fungithe spacing should make it easy to find in gitea's web log18:22
*** muskan has joined #opendev18:22
fungii also confirmed the timestamps on that server seem to be accurate18:23
fungian attempt to clone just now at 2020-04-27T18:23:35 seems to be hanging18:23
fungiyeah, still hanging, this is good!18:24
fungidoing `docker-compose logs|grep` as root on gitea04 now with pwd /etc/gitea-docker18:27
fungihopefully that's the correct thing18:27
fungiclone started at 18:25:53 is still hanging18:27
clarkbfungi: do you see it show up on the gitea side?18:28
fungialso started a ping from citycloud to to see if there's any obvious packet loss18:28
clarkbmy clone loop from home is still running successfully18:28
fungigitea-web_1  | [Macaron] 2020-04-27 18:23:20: Started POST /openstack/shade/git-upload-pack for
fungithat's the last recorded entry for
fungii wonder if the timestamps from gitea are accurate18:29
clarkbok so we get far enough to start the upload-pack but then packages maybe disappear? we can tcpdump those to see what is going on at lower level maybe?18:29
clarkbfungi: it records the start and end timestamps18:29
clarkbas separate entries18:29
fungithere was a clone from that address which started at 2020-04-27T18:23:35 and ended at 2020-04-27T18:25:4318:29
fungiand was successful18:30
clarkbfungi: also maybe have mirror.kna1 fetch resources from mirror.sjc1?18:30
clarkbfungi: and see if we can get it to fail doing more basic http requests18:30
fungino, my bad, that one timed out18:30
fungilast successful clone started 2020-04-27T18:23:18 and ended at 2020-04-27T18:23:2518:30
fungiso i think the connection is never established18:31
fungii'll switch to tcpdump next18:31
yoctozeptofungi: actually kolla CI had issues with opendev:         "    \"msg\": \"Failed to download remote objects and refs:  fatal: unable to access '': Failed to connect to port 443: Connection timed out\\n\"",18:32
yoctozepto Mon Apr 27 12:45:20 202018:32
clarkbyoctozepto: its likely the same issue if its some transatlantic routing problem (or similar)18:32
fungiyoctozepto: that (connection timed out) sounds like what we're seeing thenm18:32
clarkbbut also please don't talk to gitea in zuul jobs18:32
clarkbzuul should provide everything you need18:32
fungithe launchpad bug opened for devstack yesterday indicated a "connection refused" error18:32
fungiyoctozepto: but since the job did connect to opendev, can you let us know where that failure ran?18:33
clarkbthat job ran in kna1 too18:34
clarkbso ya likely the same issue18:34
*** iurygregory has joined #opendev18:34
fungiprovider: airship-kna118:34
fungistarting to suspect this may be a citynetwork issue18:35
clarkbI'm goign to stop my local clones now that we haev narrowed this down with an ability to debug18:35
clarkbmy local clones did not have any problem18:35
clarkbfungi: are you testing kna to all gitea backends or just 04?18:35
fungiclarkb: just gitea0418:36
clarkbfungi: might be worth checking if it is all 8  (if its a bitmask problem or something like that then some may work while others dont)18:36
fungiyeah, have definitely seen that in the past when you have flow-based distribution routing hashed on addresses and one of your cores is blackholing stuff18:37
fungiokay, i have tcpdump running in a root screen session on gitea04 streaming to stdout and filtering for the kna1 mirror's ip address18:38
yoctozeptoclarkb: ack, it's actually bifrost that talked to it and we have little control over it (it is to be deprecated and replaced by a kolla-containerised solution when time allows - hopefully soon)18:38
fungiassuming this is reproducible, mnaser may want to get in touch with the network folks at citynet18:39
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: tox: Use 'block: ... always: ...' instead of ignore_errors
fungithey'll probably have a faster time of working out the connectivity issues18:39
fungitcpdump is definitely capturing packets on successful clone runs18:40
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: ensure-sphinx: use failed_when: false instead of ignore_errors: true
fungias soon as i snag another hung clone i'll be able to work out whether the tcp/syn ever arrived. if it does and a response is generated, i'll probably need to start up a similar tcpdump on the mirror server to see if the syn+ack ever arrives18:41
fungiokay, caught one18:42
fungistart 2020-04-27T18:41:46+00:0018:42
fungiand last packet to arrive at gitea04 was 18:42:01.640261 IP > (end of the previous completed clone)18:43
clarkbfungi: so not even getting the SYN18:44
fungithat's how it's looking to me18:44
fungimy 1k echo slow ping is just about to wrap up and i can get some icmp delivery stats18:44
fungi1000 packets transmitted, 1000 received, 0% packet loss, time 999929ms, rtt min/avg/max/mdev = 176.894/177.360/267.018/3.398 ms18:45
fungiso icmp doesn't seem impacted18:45
fungii could probably install hping or something to do syn/syn+ack pings but may be best if we just hand this off to mnaser and whoever we usually talk to at citycloud18:46
*** dpawlik has quit IRC18:47
fungithough first i guess we can try some connections to other places from citycloud if we want18:47
clarkbfungi: ya I think so. Especially since its the initial SYN disappearing18:47
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: fo: Use 'block: ... always: ...' and failed_whne instead of ignore_errors
fungimaybe best to do an easier reproducer with nc or something18:47
clarkbor even just ping?18:47
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: go: Use 'block: ... always: ...' and failed_when instead of ignore_errors
mnaserlet me try and ping people..18:48
clarkbfungi: fwiw tobias is usually who I email18:48
fungithanks mnaser! i know you know some folks there18:48
clarkband mnaser is always around :)18:48
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: ara-report: use failed_when: false instead of ignore_errors: true
fungiyeah, i don't usually see tobberydberg around in irc18:50
fungioh, he's actually in #openstack-infra at the moment18:51
fungibut anyway, sounds like maybe mnaser has this well in hand18:51
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: k8-logs: use failed_when: instead of ignore_errors:
mnaseranything in specific i can forward?18:51
mnaserit looks like hitting is timing out?18:51
fungimnaser: we're getting reports from users of citycloud (including ourselves) that a small percentage of tcp connections from kna to servers we have in your sjc location have their initial tcp/syn packet never make it18:53
fungithe result is "connection timed out" for some tcp sockets18:53
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: container-logs: use failed_when: instead of ignore_errors:
fungigenerally manifesting so far in `git clone` connections for the gitea load balancer (though we've reproduced it with direct connections to the backend as well)18:54
fungian example is in citycloud (a fip for stalls attempting to establish a socket to 3000/tcp18:55
fungimost connections attempts are fine, but sometimes the initial tcp/syn packet from never makes it to according to tcpdump listening on the destination18:56
mnaserfungi: wonderful, thank you, i handed that over18:56
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: tox: Use 'block: ... always: ...' instead of ignore_errors
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: ensure-sphinx: use failed_when: false instead of ignore_errors: true
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: go: Use 'block: ... always: ...' and failed_when instead of ignore_errors
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: ara-report: use failed_when: false instead of ignore_errors: true
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: fetch-subunit-output: use failed_when: instead of ignore_errors:
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: add-build-sshkey: use failed_when: instead of ignore_errors:
mnaseri dont have an ack from them but its 9pm-ish so18:57
fungiyep, it's likely not urgent18:57
fungifdegir: ^ to summarize, we think there's something going on between citycloud kna and vexxhost sjc, likely close to (or maybe even inside) the citycloud side of the connection18:58
fungimainly because i've so far been unable to reproduce from elsewhere, though i'll try to test from citycluod lon and ovh gra just to get some more transatlantic datapoints19:00
fdegirthanks for the debugging fungi19:04
fdegiri was thinking london but instead try frankfurt and stockholm regions19:05
fdegirplus the us one perhaps19:05
fungiwe conveniently already have servers in kna and lon which is why i tested those19:05
fungiso far i'm not able to reproduce from citycloud lon nor from ovh gra19:06
fungiso i have doubts it's a general transatlantic issue19:06
fungiwere your connections coming from directly-addressed servers, or through a (layer 3 or 4) nat?19:07
fungiall our systems in citycloud are behind fips, so that could be a common factor too19:07
fdegirsame as your systems19:07
fdegirwe are running in kna as well19:07
fungiyeah, so *could* just be their nat layer is overrun in that pop19:07
fungiand some new flows are getting dropped19:08
fungifdegir: if you have quota you can shift to one of their other pops, that might be a workaround for you19:09
openstackgerritMerged zuul/zuul-jobs master: fetch-sphinx-tarball: use remote_src true
fdegirfungi: i think we do and can try moving to london19:10
fungifdegir: if that solve it for you, that'll also be a useful datapoint for us19:10
fdegirfungi: this was really helpful as i was puzzled and searching opendev/openstack-infra maillists to see if there was a planned maintenance19:10
fungii'm putting together bidirectional traceroutes now to see if they're symmertical19:11
fdegirfungi: will let you know when i do that but it may not happen tomorrow19:11
clarkbfungi: the nat on our mjrros is 1:119:12
fungiclarkb: yep, but it may very well be the same systems doing the binat and the overload pat19:13
clarkbbut I suppose if global tables arefull that wont help much19:13
*** muskan has quit IRC19:14
fungiboth vexxhost and citynetwork seem to be peering with cogent and preferring them, though from kna3 the traceroute seems to go through citynetwork sto2/cogent sto03 peering, while on the way back from vexxhost packets are arriving at the cogent lon01/citynetwork lon1 peering and then traverse sto2 to kna319:17
fungiso basically symmetrical on the vexxhost end but somewhat asymmetric on the citynetwork end19:19
openstackgerritMerged zuul/zuul-jobs master: fetch-sphinx-tarball: Do not keep owner of archived files
fungitesting with our mirror in lon1, routing is (unsurprisingly) fully symmetrical at least to the pop level19:20
fungiso this suggests the problem is likely in citynetwork kna3 or sto219:20
fungior possibly cogent sto0319:21
fungigiven i can't reproduce the issue from lon, which is following basically the same routes through cogent's core19:22
*** mehakmittal has quit IRC19:22
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Set owner to executor user
mnaserfungi: seems like they were having some similar issues -- "As for traffic to regardless of which transit provider I push traffic through there is packetloss far out on the net. So they are most likely having issues (or their transit(s))"19:40
mnaserwe use cogent at sjc1 so i guess that theory may add up19:40
mnaserthere is other transit but, yeah.19:41
mnaseronly reported issue is "Welcome to the Cogent Communications status page.  Some customers may be seeing latency between Singapore and Hong Kong due to a submarine fiber issue. At this time there is no ETR. The ticket for this issue is HD11103479."19:42
mnaserfungi: can we have mtr running from kna1, that info may be useful to reach out to transit19:43
clarkbmnaser: do you want the pings too or is a simple traceroute sufficient?19:43
mnaserclarkb: traceroutes is usually what makes transit providers happy19:44
clarkbwell thats a first, traceroute isn't installed but mtr is19:46
clarkbthat 486 rtt to first router is rough19:51
openstackgerritMerged zuul/zuul-jobs master: tox: allow running default envlist in tox
openstackgerritMerged opendev/gerritlib master: Use ensure-* roles
fungiclarkb: an mtr in the other direction would probably also be good20:08
fungisometimes when you see a jump like that, it's the point of convergence for an assymetric route where the return path is going through a significant latency increase somewhere else many hops out20:09
clarkbfungi: k let me install mtr on opendev lb20:09
clarkbit doesn't have traceroute either20:09
fungithough also since the hops after that one are lower latency, it could just be that router is under load and deprioritizing icmp messages20:10
funginot at all uncommon20:10
fungiespecially since it looks like it's probably their datacenter distribution layer20:11
clarkbfungi: mnaser the other direction20:16
clarkbI used traceroute there because I had to install either it or mtr and mtr has a million deps20:16
corvusclarkb: are we running an apache on zuul01 now?20:34
clarkbcorvus: system-config/playbooks/roles/zuul-web seems to imply we are but I haven't double checked yet20:36
clarkbyup seems that we are20:36
clarkbI'm thinking maybe we want to compress the javascript html and css resources20:36
corvusclarkb: it looks like apache is still configured to serve out of /opt/zuul-web-content20:37
corvuswhich is making me wonder if we're positive anything has changed?20:38
clarkbcorvus: the main reason I thought we had changed was the headers for main.js in my brwoser came from cherrypy20:40
clarkbit sets the server: header20:40
clarkbwe rewrite /.* to localhost:9000/.*20:41
clarkbalso it doesn't seem like the deflation of status.json is actually working. If I request it with accept-encoding: deflate set I get back plain text20:41
clarkbthis might need a bit more in depth debugging20:42
corvushrm, the timestamps on the apache config files are old though20:42
corvusclarkb: i don't see "/.* to localhost:9000/.*"20:43
clarkbcorvus: thats in the zuul role20:43
clarkbcorvus: 000-default.conf seems to be where we write that too20:43
clarkband  since it comes before the other files it wins? I think we should maybe remove the old files if they are no longer expected to be valid (to reduce confusion)20:44
corvusyes those are brand new20:44
corvusthis is a very confusing situation20:44
clarkbI agree20:44
corvusdiff 000-default.conf20:45
corvusthat seems to suggest we have indeed lost some features20:45
clarkbcorvus: if I'm reading it correctly I think a big change is going to cherrypy for all requests21:04
clarkbwhich I think is desireable, we wanted to stop consuming the js tarball, but maybe we need to figure out how to make that more efficient (better js compiles, compression, etc)21:04
clarkbcorvus: I think the /api/status caching is all wrong now that zuul's api has been redone too?21:05
corvusclarkb: yeah, i don't think anything has to be different than before; apache as a reverse proxy should be able to cache the data, it should be served by cherrypy with correct headers21:06
corvusso i guess we need to identify what we think is different or should be improved and see if we can improve the apache config to make that happen21:06
clarkbcorvus: for caching I think its just the path21:09
clarkbits /api/tenant/.*/status now iirc21:09
*** jrichard has joined #opendev21:11
clarkbtesting status retrieval in my browser it is coming back as gzip according to headers21:13
clarkbso the DEFLATE may be working with gzip and not deflate21:13
clarkbaha thats normal because apache21:14
clarkbthat gives me an idea for an improvement here one moment please21:14
clarkbcorvus: it doesn't look like cherrypy is setting content-type on static files it is serving21:16
clarkbcorvus: but if it were we could do something like: AddOutputFilterByType DEFLATE application/json text/css text/javascript application/javascript21:17
clarkbI'll go ahead and push ^ up as well as caching improvements then if cherrypy starts doing that we'll be ready for it21:17
jrichardMy change ( ) went in today to create the starlingx/portieris-armada-app repo, but I don't see it under .  Do I need to do something else to add the project there?21:17
clarkbjrichard: no, we've been having some issues with config management that we thought were addressed but that indicates it probably isn't yet21:18
clarkbjrichard: is the project in gerrit?21:19
clarkbyes looks like gitea and gerrit are happy so its just the zuul config reload that isn't firing properly21:19
corvuslooks like it ran manage-projects and puppet-else but not zuul21:19
clarkbmordred: corvus ^ fyi I know you were looking at that21:19
corvusi don't think i was looking at that but i can21:20
jrichardI do see it in gerrit.  Is there anything I can do now to get it added there?21:22
corvusclarkb, mordred: a cursory look makes me think that project-config is just configured to run remote-puppet-else and hasn't been updated to run service-zuul21:22
clarkbcorvus: I was assuming it was related to the sighup thing but I guess you think its earlier in the stack (not firing the job at all?)21:23
corvusclarkb: yeah, sighup should be fixed; i'll see about making a change to the job config21:26
corvusall of the job descriptions say "Run the playbook for the docker registry."21:27
corvusi feel like those could be more correct21:27
corvusclarkb: i'm really looking forward to your reorg patch21:28
clarkbcorvus: ya I'll need to resurrect that once the dust has settled on zuul and nodepool and codesearch and eavesdrop21:29
clarkbI think nodepool is the last remaining set of services?21:29
openstackgerritClark Boylan proposed opendev/system-config master: Improve zuul-web apache config
clarkbthats the first bit in making performance better I think21:29
redrobotHmm... I don't think Zuul is picking up this patch to a new repo?  Maybe I missed something? 🤔21:30
redrobotI had to add Zuul to reviewers manually21:30
redrobotbut I don' think that helped, hehe21:30
clarkbredrobot: its the same issue jrichard has but against a different new repo21:31
clarkbredrobot: we basically haven't signalled zuul to let it know there are new projects21:32
redrobotclarkb, gotcha.  Thanks!21:32
openstackgerritJames E. Blair proposed opendev/system-config master: Clean up some job descriptions
openstackgerritJames E. Blair proposed openstack/project-config master: Run the zuul service playbook on tenant changes
corvusclarkb, mordred: ^ i think that should fix the issue redrobot and jrichard observed21:44
clarkbcorvus: also fwiw I'ev read up on zuul's cherrypy static file serving and it should lookup mimetypes by file extention21:45
corvusit looks like the tenant config is in place, so i will manually run a smart-reconfigure21:45
clarkbcorvus: I think maybe having two .'s in the file extensions like we do with our js may confuse it? I need to set up a test for that21:45
mordredcorvus: ah - yeah - I think that looks solid21:45
mordredclarkb, corvus: related:
mordredthat will make sure service-zuul uses the zuul prepared copy of project-config21:47
clarkbcorvus: in I think we want puppet else and zuul21:48
mordred(which is a thing we added to other jobs after the initial zuul patch was written)21:48
mordredclarkb: why puppet else/21:48
clarkbmordred: nodepool for now21:48
clarkbI think it may be the last thing though21:48
corvusclarkb: i'm not following; we only ran puppet-else on changes to zuul/main.yaml21:49
*** DSpider has quit IRC21:49
mordredyeah - I think it's ok to wait for service-nodepool before triggering nodepool config changes on p-c changes21:49
ianwcorvus/modred: thanks, i didn't consider the enqueue v runtime21:49
clarkboh I see ya ok21:49
clarkbI think the original code should've maybe been run more aggressively but if we weren't already then its fine21:50
corvusclarkb: are you suggesting we should run puppet-else on changes to nodepool/.* ?21:50
corvusclarkb: it looks like service-nodepool runs puppet on the old puppet servers21:51
corvusso i don't think we need puppet-else21:51
mordredoh good point21:51
clarkboh I didn't realize that had gotten split out alraady21:52
corvusjrichard, redrobot: you should be good to go now; you'll probably need to recheck those changes21:58
openstackgerritMerged openstack/project-config master: Run the zuul service playbook on tenant changes
clarkbI was mistaken about cherrypy not sending content-type. It seems that firefox forgets that info if workin with a cached file22:07
clarkbbut forcing cache bypass shows that it does send the content-type22:07
openstackgerritClark Boylan proposed opendev/system-config master: Improve zuul-web apache config
clarkbinfra-root ^ I think that may make zuul a bit more responsive for users22:08
clarkbI need to pop out for a bike ride now. Back in a bit22:08
openstackgerritMerged opendev/system-config master: Clean up some job descriptions
*** jrichard has quit IRC22:35
openstackgerritMonty Taylor proposed opendev/system-config master: Test zuul-executor on focal
redrobotcorvus, awesome, thanks for the help!23:01
*** tosky has quit IRC23:02
openstackgerritClark Boylan proposed opendev/system-config master: Increase timeout on system-config-run-zuul
clarkbmy apache2 vhost change hit a timeout on that job so I'm bumping it23:41
clarkblooking at logs it seems to have been compiling openafs when it triggered the timeout23:41

Generated by 2.15.3 by Marius Gedminas - find it at!