15:00:36 <anteaya> #startmeeting third-party
15:00:37 <openstack> Meeting started Mon Oct  5 15:00:36 2015 UTC and is due to finish in 60 minutes.  The chair is anteaya. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:38 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:41 <openstack> The meeting name has been set to 'third_party'
15:00:48 <anteaya> hello
15:00:56 <rfolco> o/
15:01:01 <anteaya> hello rfolco
15:01:32 <anteaya> rfolco: I'm not sure if I know you, what ci account is yours?
15:01:36 <mmedvede> hi anteaya
15:01:46 <rfolco> rfolco, pkvm ci
15:01:54 <anteaya> mmedvede: hello, how are you today?
15:02:14 <rfolco> anteaya, pkvm ci :)
15:02:16 <mmedvede> anteaya: I am well, thank you
15:02:21 <asselin__> o/
15:02:34 <mmedvede> anteaya: rfolco and I are on the same team
15:02:38 <anteaya> hi asselin__
15:02:44 <anteaya> mmedvede: glad to hear it
15:02:55 <asselin__> hi everyone
15:03:00 <anteaya> mmedvede: oh, I was just going to say I don't see pkvm ci listed: https://wiki.openstack.org/wiki/ThirdPartySystems
15:03:27 <mmedvede> anteaya:  IBMPowerKVMCI
15:03:33 <anteaya> oh
15:03:39 <mmedvede> abbreviations :)
15:03:48 <anteaya> yeah I wouldn't remember that abbreveation
15:03:53 <anteaya> so thanks
15:04:02 <anteaya> what shall we talk about today?
15:04:28 <anteaya> does anyone have anything they wish to discuss?
15:04:54 <aysyd> hi everyone
15:05:01 <anteaya> hello aysyd
15:05:30 <rfolco> anteaya, aysyd is in my team too
15:05:35 <anteaya> wonderful
15:05:43 <anteaya> your team has a great turnout today
15:05:48 <rfolco> lol
15:05:57 <anteaya> how is your ci operating?
15:06:18 <rfolco> in terms of?
15:06:29 <anteaya> is it working as expected?
15:06:30 <mmedvede> it is stable, but has some pypi problems at the moment
15:06:36 <mmedvede> can't get packages
15:06:39 <anteaya> glad it is stable
15:06:43 <anteaya> that is a problem
15:06:49 <anteaya> is it a new problem?
15:07:02 <mmedvede> started today
15:07:06 <anteaya> interesting
15:07:15 <mmedvede> but upstream is also failing lots
15:07:23 <anteaya> upstream what?
15:07:37 <mmedvede> upstream jenkins
15:07:56 <anteaya> do you have some urls of patches with jobs failing due to jenkins?
15:09:03 <mmedvede> anteaya: I do not believe it is due to jenkins. I meant that some of the jenkins jobs have also started failing. Along the other third-party CI systems
15:09:13 <anteaya> interesting
15:09:26 <anteaya> have you any logs that might point to what the issue is
15:09:47 <anteaya> if there is a problem that is causing jobs to fail I would be interested in finding out what that may be
15:10:09 <mmedvede> anteaya: looking at xenproject ci http://logs.openstack.xenproject.org/86/230186/4/check/dsvm-tempest-xen/4e0a686/logs/devstacklog.txt.gz: [Errno 104] Connection reset by peer
15:10:16 <mmedvede> during python package install
15:10:28 <rfolco> I don't know the specifics for today's problem, but int general pip packages and dependencies are changed without much criteria so they break CI jobs
15:10:36 <mmedvede> same happens for us, e.g.  http://dal05.objectstorage.softlayer.net/v1/AUTH_3d8e6ecb-f597-448c-8ec2-164e9f710dd6/pkvmci/nova/86/230186/4/check/check-ibm-tempest-dsvm-full/14e85a6/devstacklog.txt.gz
15:11:08 <mmedvede> to me it looks like a load capacity problem
15:11:28 <mmedvede> asselin__: is your CI working fine?
15:12:18 * asselin_ checks
15:12:30 <hogepodge> o/
15:12:50 <anteaya> hi hogepodge, welcome
15:13:02 <anteaya> we are just looking at some pypi timeout issues
15:13:20 <asselin_> seems ok, a few random failures I need to check the details, but nothing major
15:14:08 <dstufft> Hi
15:14:12 <anteaya> hi welcome
15:14:36 <anteaya> so dstufft his meeting is for operators of ci systems that aren't openstacks but report to it
15:14:38 <anteaya> https://wiki.openstack.org/wiki/ThirdPartySystems
15:15:13 <anteaya> and everyone this is dstufft he works with a lot of python packaging issues and may have some abiltiy to evaluate if pypi is experiencing load issues
15:15:22 <anteaya> mmedvede: can you share those log links again?
15:15:54 <mmedvede> sure
15:15:56 <anteaya> thanks
15:16:11 <mmedvede> #link IBM PowerKVM CI pypi timeout http://dal05.objectstorage.softlayer.net/v1/AUTH_3d8e6ecb-f597-448c-8ec2-164e9f710dd6/pkvmci/nova/86/230186/4/check/check-ibm-tempest-dsvm-full/14e85a6/devstacklog.txt.gz
15:16:33 <mmedvede> #link citrix-xenserver i http://logs.openstack.xenproject.org/86/230186/4/check/dsvm-tempest-xen/4e0a686/logs/devstacklog.txt.gz
15:17:14 <mmedvede> I might have mislabeled citrix
15:17:17 <anteaya> dstufft: have you enough context that your presence in this meeting makes sense to you yet?
15:17:21 <anteaya> #undo
15:17:22 <openstack> Removing item from minutes: <ircmeeting.items.Link object at 0x9c86ed0>
15:17:23 <dstufft> yea
15:17:27 <anteaya> dstufft: thanks
15:17:31 <anteaya> mmedvede: try again
15:18:39 <dstufft> Connection Reset by Peer would be coming from Fastly
15:18:47 <dstufft> Fastly's our CDN
15:18:54 <dstufft> there was an issue like this awhile ago...
15:18:54 <dstufft> sec
15:18:56 <anteaya> mmedvede: I removed the last link from the meeting minutes you can try the last link again
15:18:59 <anteaya> dstufft: thank you
15:19:32 <mmedvede> #link XenProject CI check  pypi timeout http://logs.openstack.xenproject.org/86/230186/4/check/dsvm-tempest-xen/4e0a686/logs/devstacklog.txt.gz
15:19:38 <anteaya> mmedvede: thank you
15:20:27 <dstufft> https://github.com/travis-ci/travis-ci/issues/2389
15:20:52 <anteaya> #link https://github.com/travis-ci/travis-ci/issues/2389
15:21:24 <anteaya> that says the issue was closed March 9th
15:21:35 <dstufft> Right, that was Travis-CI having a similar issue
15:21:44 <dstufft> they seemed to have resolved it by disabling ECN on their systems
15:22:19 * anteaya looks up ECN
15:22:35 <dstufft> From my memory, this problem seemed specific to a particular setup (e.g. it wasn't a wide spread problem, but the people having the problem had it regularly)
15:22:52 <dstufft> and Fastly investigated it for a awhile and couldn't find anything on their end that seemed to be causing it
15:23:00 <anteaya> hmmmm
15:23:12 <anteaya> well it seems to be affecting at least two different operators
15:23:24 <dstufft> https://github.com/travis-ci/travis-ci/issues/2389#issuecomment-75292931 is the post where someone suggested the ECN option
15:23:41 <anteaya> and given the amount of folks running systems vs the amount of folks who talk to us I would multiple that by at least 10
15:24:24 <anteaya> dstufft: is this what you mean by ecn? https://en.wikipedia.org/wiki/Explicit_Congestion_Notification
15:24:31 <dstufft> yea
15:24:32 <mmedvede> anteaya: looking at our internal scoreboard status, lots of third-parties are failing. Strangely, upstream jobs look fine, but I need to query logstash to make sure
15:24:52 <anteaya> #link ecn https://en.wikipedia.org/wiki/Explicit_Congestion_Notification
15:24:55 <anteaya> dstufft: thank you
15:25:03 <anteaya> mmedvede: hmmmm
15:25:03 <dstufft> I think it was particularly this bit "Rather than responding properly or ignoring the bits, some outdated or faulty network equipment has historically dropped or mangled packets that have ECN bits set."
15:25:23 <anteaya> does anyone know offhand (or can you look?) if you are using ecn?
15:25:46 <dstufft> Going by memory, I think the guess was that some hardware switch in between Travis and the Fastly POP was doing something bad with the ECN bits
15:26:05 <anteaya> mmedvede: yes, if you could query logstash with what you are seeing that would be great, thank you
15:26:17 <anteaya> dstufft: fair enough
15:26:31 <anteaya> I'm feeling that this connection reset by peer situation is new
15:26:35 <dstufft> It was never confirmed though, so it might be the problem
15:26:37 <dstufft> er
15:26:40 <dstufft> might not be the problem
15:26:41 <anteaya> mmedvede reported it started today
15:26:46 <dstufft> I'm happy to raise an issue with Fastly though
15:26:54 <anteaya> and I do think they have been running their ci for about a year
15:27:07 <anteaya> dstufft: that would be great, thank you
15:27:24 <anteaya> if they could take a look at what they are seeing from their end at the very least
15:28:21 <dstufft> are you able to curl -I pypi from that box and see what POP you're getting
15:28:22 <anteaya> so dstufft will speak with Fastly and mmedvede will look at logstash to see if upstream is seeing any of the same issues
15:28:32 <anteaya> mmedvede: can you do so?
15:28:38 <dstufft> it'll be in a header
15:28:39 <anteaya> curl -I pypi?
15:28:42 <dstufft> Served-By or so
15:28:46 <dstufft> curl -I https://pypi.python.org/
15:28:59 <dstufft> X-Served-By: cache-iad2138-IAD
15:29:02 <dstufft> like that
15:29:04 <anteaya> perhaps put it in paste?
15:29:11 <anteaya> unless it is one line
15:30:13 <dstufft> all the headers will be multiple lines, just that header will be one
15:30:55 <anteaya> does anyone have time to try that now?
15:30:57 <dstufft> FWIW https://github.com/pypa/pip/issues/2426 would make this not fail (or at least, not fail as a much) I just haven't done it yet
15:31:58 <anteaya> dstufft: how can I help you have time to do that?
15:32:52 <dstufft> I've been working on Warehouse so I haven't been touching pip much, I can probably switch around to that sometime soon though
15:33:00 <anteaya> and it doesn't sound like any of our operators have time right now to look at their headers
15:33:04 <asselin__> X-Served-By: cache-iad2142-IAD, cache-dfw1834-DFW
15:33:11 <anteaya> asselin__: ah thank you
15:33:16 <dstufft> other option to mitigate is to run your own PyPI mirror nearby too
15:33:40 <asselin__> but we use our own pypi mirror now
15:33:44 <anteaya> dstufft: true but many of our smaller ci operators probably won't do that
15:34:00 <dstufft> oh wait
15:34:00 <anteaya> asselin__: ah okay would make sense then you are isolated from this issue
15:34:04 * anteaya waits
15:34:12 <dstufft> is that log you sent me from your own mirror?
15:34:24 <anteaya> asselin__ didn't send any logs
15:34:29 <anteaya> mmedvede sent logs
15:34:35 <dstufft> oh
15:34:36 <dstufft> durr
15:34:39 <asselin__> no, that's the output fromthe curl command you posted above
15:34:40 <anteaya> mmedvede: are you using a pypi mirror?
15:34:49 <dstufft> stupid Textual made their usernames the same color
15:34:52 <anteaya> dstufft: no it is a good question
15:34:57 <anteaya> dstufft: ah yeah
15:35:15 <anteaya> okay so header information to you, elbow you some time to work on pip
15:35:28 <anteaya> anything else we should discuss on this topic?
15:35:32 <anteaya> right now?
15:35:38 <dstufft> the curl info would be most useful from the boxes that are getting failure
15:36:01 <mmedvede> Made a kibana query with "error: [Errno 104] Connection reset by peer" last 7 days, shows spike today.
15:36:05 <anteaya> dstufft: agreed, I'll work on getting that to you post meeting if it doesnt' come up before the end of the meeting
15:36:10 <anteaya> mmedvede: okay thanks
15:36:17 <anteaya> mmedvede: can you curl for the pypi headers?
15:36:32 <anteaya> curl -I https://pypi.python.org/
15:36:40 <mmedvede> anteaya: I probably can. It is not 100% rate of failure though
15:36:47 <anteaya> the X-served-by header
15:36:51 <anteaya> understood
15:37:02 <mmedvede> anteaya: not sure about mirrors, need to check
15:37:18 <rfolco> X-Served-By: cache-iad2145-IAD, cache-atl6226-ATL
15:37:40 <asselin__> I did see this issue today with our mirror. Never saw it before...perhaps it was being updated at the time? http://15.126.198.151/98/229998/2/check/lefthand-iscsi-driver-master-client-pip-vsa673-dsvm/c2a1270/logs/devstacklog.txt.gz#_2015-10-05_11_02_13_117
15:39:32 <anteaya> asselin__: you mean perhaps the pbr wheel was being updated at the time?
15:39:39 <anteaya> rfolco: thank you
15:40:08 <rfolco> anteaya, ATL means atlanta and DFW Dallas ? just curious
15:40:12 <mmedvede> #link Kibana search for "connection reset" http://imgur.com/Y46OqHl
15:40:25 <dstufft> rfolco: yea
15:40:25 <asselin__> anteaya, yes
15:40:27 <anteaya> rfolco: I'd guess that too, I don't know for sure
15:40:39 <anteaya> asselin__: okay thank you, and I agree it is possible
15:40:44 <dstufft> https://www.fastly.com/network <- Fastly Locations
15:41:03 <dstufft> generally they use airport codes in their DC names
15:41:11 <anteaya> mmedvede: filter on build_status:failure as well, not all of those hits are failures
15:41:57 <anteaya> dstufft: so that header is the fastly location header, that is being used by the box to hit pypi
15:42:33 <dstufft> anteaya: Yea, pypi.python.org is a GeoDNS name that routes to your closest Fastly POP, Fastly is running Varnish which connects back to the PyPI servers and caches the result
15:42:52 <mmedvede> anteaya: thank you. Added the filter. The same picture, only now 21 hits (probably 2 new just happened)
15:43:03 <dstufft> the Connection Reset By Peer is coming from between your computer and Fastly, if it was between Fastly and PyPI you'd get a 503 error instead
15:43:11 <anteaya> mmedvede: great, thanks
15:43:28 <anteaya> dstufft: good to know
15:43:52 <anteaya> okay so we will be interested to hear your response from fastly
15:44:03 <anteaya> if you could post the the infra mailing list that would be great
15:44:11 <anteaya> does that sound fair to everyone?
15:44:42 <anteaya> mmedvede: thanks for bringing this up
15:45:14 <anteaya> okay if I give hogepodge some airtime now?
15:45:30 <rfolco> +1
15:45:34 <anteaya> thank you
15:45:58 <anteaya> #topic openstack foundation trademark usage program for third party operators
15:46:13 <rfolco> 
15:46:16 <anteaya> hogepodge: care to share your thoughts on the current status of your work?
15:46:48 <hogepodge> We're starting out with cinder drivers, since the third-party testing is pretty solid for that project.
15:46:49 <anteaya> he might be afk at the moment
15:46:53 <anteaya> ah here we are
15:46:57 <anteaya> great
15:47:09 <asselin__> go cinder!
15:47:25 <hogepodge> Reaching out to companies that have existing drivers passing cinder-ci to get them started on the license program or update their current licenses.
15:47:33 <anteaya> wonderful
15:47:51 <anteaya> does anyone present have any questions for hogepodge?
15:48:35 <hogepodge> If we make good progress (I have a bit of a backlog of pre-summit work, so I'll know more later this afternoon) we're going to require all storage drivers be passing cinder ci to carry the OpenStack Compatible mark.
15:48:49 <anteaya> okay lovely
15:48:50 <asselin__> what does that mean "passing cinder ci"
15:48:57 <anteaya> great question
15:49:14 <hogepodge> Passing the set of tests that cinder requires to demonstrate driver-facing and user-facing apis
15:49:28 <asselin__> especially in the face of patches that may or may not work and intermittent failures
15:49:53 <asselin__> such as pypi, devstack changes, et.c
15:51:35 <hogepodge> We don't yank a license because of a failed test. At renewal we would if the driver has not been passing ci for some time. Same thing for initial license. We want commitment to quality and community standards.
15:52:09 <hogepodge> So we have discretion. Meaning if something breaks upstream we can be patient and understanding and work with both upstream and downstream devs to get things right.
15:53:33 <asselin__> also, which tests need to be required? Can any be skipped due to legitimate bugs external to the driver?
15:53:42 <hogepodge> After the summit we want to work with neutron team and vendors to come up with ci for network plugins, with the same idea of using community testing standards to drive the trademark program for network drivers
15:53:48 <dstufft> (Sorry to interject, Fastly got back to me. Asked if we could get a mtr from the failing machine to pypi.python.org)
15:53:54 <asselin__> I know there are quite a few encypted volume bugs that affect drivers differently
15:54:01 <anteaya> dstufft: what is a mtr?
15:54:34 <dstufft> http://www.bitwizard.nl/mtr/
15:54:47 <dstufft> should be packaged in most distros
15:54:55 <anteaya> #link http://www.bitwizard.nl/mtr/
15:55:12 <anteaya> dstufft: thanks will work on getting this back to you post meeting
15:55:18 <anteaya> sorry hogepodge and asselin__
15:55:21 <hogepodge> asselin__: we'd work with vendors and the dev team on problems like that.
15:56:03 <fungi> asselin__: hogepodge: my understanding was that it was basically up to cinder to decide whether a driver was in sufficient shape for the mark (e.g. deciding what tests it needed to run, et cetera). is that accurate?
15:56:49 <anteaya> that puts even more pressure on the projects
15:57:05 <hogepodge> fungi: more or less, yes. We feel like the community knows best as to what makes a compatible driver. Cinder has a fairly well-defined set of apis that all need to be implemented, so that makes the job easier in a lot of ways
15:57:13 <anteaya> and the projects are under considerable strain from the drivers as it is
15:57:39 <anteaya> I would argue neutron buckled from the pressue which is why they revoked manditory testing
15:58:16 <anteaya> not strain from the operators that attend meetings and comply
15:58:22 <anteaya> but from the ones who don't
15:58:29 <anteaya> and then are upset about it
15:58:34 <fungi> agreed that does put additional responsibility on the projects. on the other hand having some body disconnected from the project deciding what should be tested (possibly in disagreement with what the project wants to see tested) could lead to different and arguably worse sorts of strain
15:58:57 <anteaya> oh as to what should be tested, the projects do know best
15:59:09 <asselin__> sounds like we should have a tool to help check
15:59:14 <anteaya> as to monitoring to say whether a given ci is running those tests?
15:59:15 <fungi> this seems like something projects should be able to opt into though, for sure
15:59:24 <anteaya> that is considerably more work
15:59:50 <anteaya> well if the foundation makes the testing manditory to recieve the mark, I dont' see how anyone can opt either in or out
15:59:55 <fungi> true, though last cycle cinder did that work admirably, ripping out untested drivers right and left
16:00:02 <hogepodge> all of the conversations with neutron are prelimiary. many members have expressed their concerns, but have also expressed that mark pressure could help bring the program back.
16:00:04 <anteaya> very much so
16:00:09 <hogepodge> if neutron said "nope", we'd respect that
16:00:31 <anteaya> I'm saying that cinder took one path and neutron another
16:00:42 <fungi> anteaya: i expect it's more that the mark won't exist if there's no interest from a project in helping shape and police it
16:00:45 <anteaya> and I fully understand why both projects made the decison they made
16:01:14 <anteaya> no interest has different definitions
16:01:26 <anteaya> interest in having things for marketing? very high interest
16:01:42 <anteaya> interest in doing the leg work so those things actually reflect value? very low
16:01:55 <anteaya> anyway
16:02:03 <hogepodge> paths can be different too
16:02:06 <anteaya> we won't reach consencus today
16:02:08 <hogepodge> not one size-fits-all
16:02:11 <anteaya> hogepodge: that is good to know
16:02:20 <anteaya> that might work best then
16:02:24 <openstack> rakhmerov: Error: Can't start another meeting, one is in progress.  Use #endmeeting first.
16:02:35 <anteaya> as I don't think what works for cinder will work for neutron
16:02:46 <anteaya> rakhmerov: in a meeting just finishing up
16:02:49 <rakhmerov> excuse me, we're supposed to have a meeting
16:02:53 <rakhmerov> yep, thnx
16:03:10 <anteaya> thanks all for your kind attendance and participationg
16:03:14 <anteaya> see you next week
16:03:17 <anteaya> #endmeeting