Friday, 2018-11-30

*** flaper87 has quit IRC		00:09
*** flaper87 has joined #openstack-tc		00:14
*** tosky has quit IRC		00:27
clarkb	Ok I've got to run now. Please do ping if others are interested in discussing more. I think it is an important thing we want to sort out	00:49
*** dklyle has joined #openstack-tc		01:59
*** dklyle has quit IRC		02:05
*** mriedem_afk has quit IRC		02:19
*** ricolin has joined #openstack-tc		03:02
*** dklyle has joined #openstack-tc		03:11
*** whoami-rajat has joined #openstack-tc		03:12
*** dklyle has quit IRC		03:17
*** diablo_rojo has quit IRC		06:12
*** e0ne has joined #openstack-tc		06:32
*** flaper87 has quit IRC		06:32
*** e0ne has quit IRC		07:31
*** Luzi has joined #openstack-tc		07:38
*** tosky has joined #openstack-tc		08:42
*** jpich has joined #openstack-tc		08:53
*** e0ne has joined #openstack-tc		09:27
*** ricolin has quit IRC		10:44
*** cdent has joined #openstack-tc		10:59
*** dtantsur\|mtg is now known as dtantsur\|afk		11:00
*** cdent has quit IRC		11:20
*** cdent has joined #openstack-tc		11:46
*** Luzi has quit IRC		12:00
*** jaypipes is now known as leakypipes		12:44
*** e0ne has quit IRC		12:54
*** cdent has quit IRC		13:37
*** e0ne has joined #openstack-tc		13:37
*** whoami-rajat has quit IRC		13:49
*** EmilienM is now known as EvilienM		13:58
openstackgerrit	Sean McGinnis proposed openstack/governance master: Add stable:follows-policy for cinder deliverables https://review.openstack.org/621164	14:11
*** mriedem has joined #openstack-tc		14:19
*** jamesmcarthur has joined #openstack-tc		14:24
*** cdent has joined #openstack-tc		14:32
*** lbragstad is now known as elbragstad		14:36
*** whoami-rajat has joined #openstack-tc		14:38
dhellmann	clarkb : good topic, and thanks for not waiting for office hours to raise it	15:01
dhellmann	I'd like to include "gate stability" or "quality" somehow as a goal, but I'm struggling to come up with a way to quantify it in a per-team way so we can measure progress	15:03
dhellmann	I'm not sure asking a specific group of people to dedicate their time to debugging the issues is the right approach. Where would we find those people?	15:04
dhellmann	I do like the approach of incentivizing everyone to make their tests reliable by "rewarding" stable jobs with priority	15:04
dhellmann	the implementation details there may be tricky	15:05
cdent	that this [t F3u] was true in the past it part of why we have trouble now: we hope/think it is going to be other people that fix it. having it rotate and/or be "part time" is a nice idea but the amount of experience an expertise to do so is large, sadly	15:06
purplerbot	<clarkb> one (admittedly less practical idea) I had was to encourage a sort of "sdague/jogo/mtreinish" rotation. Basically have a group of people that can take on the tasks they did in the past, but be explicit that it shouldn't be a full time thing to help avoid burn out but also ensure more than one person knows what to do [2018-11-29 23:03:51.525961] [n F3u]	15:06
cdent	I recently put that word out internally that this specifically is a critical area and there were some warm rumblings in response, but I don't know if it will turn into anything real	15:07
dhellmann	we need to design the system so we don't need heroes to keep it running	15:09
smcginnis	++	15:09
cdent	yes	15:13
cdent	heroes are rare (and bad for health). When there are many people they are easier to find.	15:14
*** jamesmcarthur has quit IRC		15:33
fungi	time to apply all that behavioral psychology i learned at university, i guess	15:35
fungi	we can give users a lever that dispenses food pellets. also electrifying the cage floor is probably a viable tactic	15:36
* cdent re-reads walden two		15:37
dhellmann	heh	15:38
ttx	OH: "Heroes are bad for health"	15:43
*** dansmith is now known as SteelyDan		15:43
dims	ttx : LOL	15:45
*** jamesmcarthur has joined #openstack-tc		15:46
*** jamesmcarthur has quit IRC		15:49
*** jamesmcarthur has joined #openstack-tc		15:49
*** mriedem has quit IRC		15:50
mnaser	for example, it took me probably 20 minutes today to find out we were uselessly creating swap in OSA jobs because we didn't use any of it and i took 15-20 minutes to find out and push a fix to disable that behaviour	15:54
mnaser	clarkb / infra-core: do we have stats on the number of always-failing non voting jobs?	15:54
mnaser	i feel like those contribute a lot.	15:54
cdent	yeah, good point	15:54
fungi	what was the savings in job runtime from not creating a swapfile?	15:54
mnaser	fungi: on ovh, 15-18 minutes	15:56
fungi	wow	15:56
mnaser	i dont know if this was a one-off	15:56
fungi	i guess it wasn't being created sparse	15:56
cdent	that much? that's rather surprising	15:56
mnaser	we cant do sparse on certain os like centos 7	15:57
cdent	it makes it seem like $stuff is _way_ oversubscribed	15:57
fungi	sparse swapfiles should be nearly instantaneous to create, but you risk crashing the node hard if you use all the disk	15:57
mnaser	http://logs.openstack.org/36/619636/1/gate/openstack-ansible-deploy-aio_metal-ubuntu-bionic/72c540f/logs/ara-report/result/f6ed9f8a-419a-41b8-8d81-19d6e5aac6cc/	15:57
mnaser	sparse swapfiles dont work on xfs (which is centos-7)	15:58
fungi	on the other hand, if the job does use more memory than anticipated, without swap you'll be unable to debug it when the oom killer sacrifices something which makes the node no longer accessible	15:58
mnaser	fungi: i went over some of our numbers over successful jobs and we're far away from swapping	15:58
fungi	so there are always trade-offs	15:58
mnaser	like, some 4gb away from swapping..	15:58
mnaser	the other thing im struggling with right now with my ptl hat on is	15:59
fungi	but yeah, if you don't use most of the ram and we're using platforms which don't support sparse swap (or you need the additional filesystem space) then dropping it is certainlg a good call	15:59
mnaser	contributions to implement things that need CI resources which are then not maintained by those who push them	15:59
fungi	sure. in general "contributions to implement things [...] which are then not maintained by those who push them" has been a perpetual problem in openstack	16:00
mnaser	the thing that bothers me is things like this	16:01
mnaser	http://zuul.openstack.org/builds?job_name=openstack-ansible-deploy-aio_distro_lxc-opensuse-150&job_name=openstack-ansible-deploy-aio_distro_lxc-opensuse-423&branch=master&branch=openstack-ansible-deploy-distro_ceph-opensuse-423&branch=openstack-ansible-deploy-distro_ceph-opensuse-150	16:01
mnaser	that's a lot of wasted CI resources	16:01
mnaser	and i'm really just wondering if we should come up with some policy that says if a job is non-voting and failing for N period of time, it will be removed.	16:02
* cdent is still stuck on it taking so long to do a dd?		16:02
fungi	cdent: slow disk	16:02
cmurphy	mnaser: there's nothing stopping you from creating that policy for your project	16:02
fungi	mnaser: we've certainly done that from time to time, but yes maybe a policy within openstack would be good there	16:02
cdent	fungi: isn't that something that ought to be investigated too	16:03
cdent	I have felt (since my dawn of openstack) that we are constantly in a state of oversubscription and it is _that_ which causes us so much pain	16:03
fungi	cdent: yes, it's something we can bring to the attention of the service provider, but i think they have us on cheaper storage there by choice	16:03
mnaser	cdent: hardware is expensive	16:04
mnaser	no one is writing a check for that hardware :)	16:05
mnaser	so there isn't exactly an SLA for donated infrastructure	16:05
mnaser	cmurphy: fungi that's true, but i would be more comfortable if it was an openstack-y thing rather than grumpy-mo-keeps-seeing-failing-jobs-and-has-no-time-to-fix-them-so-he-removed-them	16:06
cmurphy	it's not i'm-grumpy-so-i-removed-them it's "Our team's policy is to only keep running jobs that are consistently useful"	16:07
openstackgerrit	Lance Bragstad proposed openstack/governance master: Update charter to include PTL appointment https://review.openstack.org/620928	16:14
*** mriedem has joined #openstack-tc		16:21
fungi	mnaser: out of curiosity, did you happen to notice whether the slow swap creation was happening only in one of the two ovh regions? i've been trying to narrow down why we have 20x as many job timeouts in one as in the other even when we ran for nearly a week with the same max-servers in both	16:24
mnaser	fungi: i have not dug in that far into it to be honest	16:25
fungi	if filesystem access is waaay slower in one of them than the other, that could certainly explain it	16:25
fungi	no worries, that gives me something to test next	16:25
mnaser	yep, that could be a helpful next step	16:25
*** jamesmcarthur has quit IRC		16:38
*** jamesmcarthur has joined #openstack-tc		16:42
clarkb	mnaser: fwiw I think that quality and efficiency aren't exactly the same thing here. Yes we are inefficient, but separately we also seem to have regressions in quality which impact efficiency. Not running jobs that always fail will address efficiency positively and potentially quality negatively (beacuse those jobs should pass if they test something useful)	16:43
*** whoami-rajat has quit IRC		16:48
*** whoami-rajat has joined #openstack-tc		16:55
*** jpich has quit IRC		17:00
*** mriedem is now known as mriedem_lunch		17:11
fungi	speaking of centos, looks like rhel 8 will still include python 2.7? https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8-beta/html/8.0_beta_release_notes/new-features#web_servers_databases_dynamic_languages_2	17:11
fungi	"Python 2.7 is available in the python2 package. However, Python 2 will have a shorter life cycle and its aim is to facilitate smoother transition to Python 3 for customers."	17:12
*** weshay is now known as he_hates_me		17:35
*** he_hates_me is now known as weshay		17:36
*** e0ne has quit IRC		17:45
*** openstackgerrit has quit IRC		17:51
*** jamesmcarthur has quit IRC		18:15
clarkb	fwiw I don't think a group of individuals should be the only people that care/act on quality concerns. But I also don't really see any change in behavior without something setting an example for others	18:16
scas	a rant on software quality from yesteryear is still relevant today https://queue.acm.org/detail.cfm?id=2349257	18:19
clarkb	another approach may be to set an expectation that teams have an "at least triage, but fixing is even better" day or week each milestone	18:28
clarkb	and don't prescribe activity directly. But instead use that as a reminder that we care about this stuff.	18:28
*** diablo_rojo has joined #openstack-tc		18:29
elbragstad	clarkb ++	18:31
*** jamesmcarthur has joined #openstack-tc		18:31
clarkb	I think in theory we've used the feature freeze and RC period for this sort of work, but it is hard to tell how effective that is as all the release projects get very quiet and the deployment project gets very busy	18:32
clarkb	(so as an outsider I don't have enough insight to know if those set aside periods are useful for this task)	18:33
*** logan- has joined #openstack-tc		18:36
scas	a bugbusting event does work in other long-lived open source projects, but it's the coordination that's always the unknown unknown	18:44
*** openstackgerrit has joined #openstack-tc		19:10
openstackgerrit	Doug Hellmann proposed openstack/governance master: clean up readme https://review.openstack.org/621270	19:10
*** mriedem_lunch is now known as mriedem		19:13
*** jamesmcarthur has quit IRC		19:20
*** jamesmcarthur has joined #openstack-tc		19:21
*** jamesmcarthur has quit IRC		19:28
*** jamesmcarthur has joined #openstack-tc		19:28
*** jamesmcarthur_ has joined #openstack-tc		19:29
*** jamesmcarthur has quit IRC		19:33
openstackgerrit	Doug Hellmann proposed openstack/governance master: add board working group data handling https://review.openstack.org/621277	19:42
*** whoami-rajat has quit IRC		20:08
*** jamesmcarthur_ has quit IRC		20:14
*** jamesmcarthur has joined #openstack-tc		20:15
*** jamesmcarthur has quit IRC		20:20
openstackgerrit	Jeremy Stanley proposed openstack/project-team-guide master: Document use of the openstack-discuss mailing list https://review.openstack.org/621284	20:38
fungi	trying to untangle the mentions of mailing lists in the governance-sigs repo, and having a hard time separating ideas people had about how sigs were going to work from how things actually shook out. for example, the bi-weekly newsletter/summary etherpad seems to have never actually been touched and i don't remember a single one ever going to any mailing list	20:47
fungi	mrhillsman: ttx: i think https://git.openstack.org/cgit/openstack/governance-sigs/tree/doc/source/index.rst#n55 might be due for removal from that document?	20:48
fungi	dhellmann: ^	20:48
fungi	happy to just rip that out while i'm making other edits	20:48
*** openstackgerrit has quit IRC		20:50
mrhillsman	++	20:50
fungi	seems it got overly-specific about process which wasn't actually in use yet	20:52
*** mriedem has quit IRC		22:54
*** cdent has quit IRC		23:07
*** tosky has quit IRC		23:42

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!