*** flaper87 has quit IRC | 00:09 | |
*** flaper87 has joined #openstack-tc | 00:14 | |
*** tosky has quit IRC | 00:27 | |
clarkb | Ok I've got to run now. Please do ping if others are interested in discussing more. I think it is an important thing we want to sort out | 00:49 |
---|---|---|
*** dklyle has joined #openstack-tc | 01:59 | |
*** dklyle has quit IRC | 02:05 | |
*** mriedem_afk has quit IRC | 02:19 | |
*** ricolin has joined #openstack-tc | 03:02 | |
*** dklyle has joined #openstack-tc | 03:11 | |
*** whoami-rajat has joined #openstack-tc | 03:12 | |
*** dklyle has quit IRC | 03:17 | |
*** diablo_rojo has quit IRC | 06:12 | |
*** e0ne has joined #openstack-tc | 06:32 | |
*** flaper87 has quit IRC | 06:32 | |
*** e0ne has quit IRC | 07:31 | |
*** Luzi has joined #openstack-tc | 07:38 | |
*** tosky has joined #openstack-tc | 08:42 | |
*** jpich has joined #openstack-tc | 08:53 | |
*** e0ne has joined #openstack-tc | 09:27 | |
*** ricolin has quit IRC | 10:44 | |
*** cdent has joined #openstack-tc | 10:59 | |
*** dtantsur|mtg is now known as dtantsur|afk | 11:00 | |
*** cdent has quit IRC | 11:20 | |
*** cdent has joined #openstack-tc | 11:46 | |
*** Luzi has quit IRC | 12:00 | |
*** jaypipes is now known as leakypipes | 12:44 | |
*** e0ne has quit IRC | 12:54 | |
*** cdent has quit IRC | 13:37 | |
*** e0ne has joined #openstack-tc | 13:37 | |
*** whoami-rajat has quit IRC | 13:49 | |
*** EmilienM is now known as EvilienM | 13:58 | |
openstackgerrit | Sean McGinnis proposed openstack/governance master: Add stable:follows-policy for cinder deliverables https://review.openstack.org/621164 | 14:11 |
*** mriedem has joined #openstack-tc | 14:19 | |
*** jamesmcarthur has joined #openstack-tc | 14:24 | |
*** cdent has joined #openstack-tc | 14:32 | |
*** lbragstad is now known as elbragstad | 14:36 | |
*** whoami-rajat has joined #openstack-tc | 14:38 | |
dhellmann | clarkb : good topic, and thanks for not waiting for office hours to raise it | 15:01 |
dhellmann | I'd like to include "gate stability" or "quality" somehow as a goal, but I'm struggling to come up with a way to quantify it in a per-team way so we can measure progress | 15:03 |
dhellmann | I'm not sure asking a specific group of people to dedicate their time to debugging the issues is the right approach. Where would we find those people? | 15:04 |
dhellmann | I do like the approach of incentivizing everyone to make their tests reliable by "rewarding" stable jobs with priority | 15:04 |
dhellmann | the implementation details there may be tricky | 15:05 |
cdent | that this [t F3u] was true in the past it part of why we have trouble now: we hope/think it is going to be other people that fix it. having it rotate and/or be "part time" is a nice idea but the amount of experience an expertise to do so is large, sadly | 15:06 |
purplerbot | <clarkb> one (admittedly less practical idea) I had was to encourage a sort of "sdague/jogo/mtreinish" rotation. Basically have a group of people that can take on the tasks they did in the past, but be explicit that it shouldn't be a full time thing to help avoid burn out but also ensure more than one person knows what to do [2018-11-29 23:03:51.525961] [n F3u] | 15:06 |
cdent | I recently put that word out internally that this specifically is a critical area and there were some warm rumblings in response, but I don't know if it will turn into anything real | 15:07 |
dhellmann | we need to design the system so we don't need heroes to keep it running | 15:09 |
smcginnis | ++ | 15:09 |
cdent | yes | 15:13 |
cdent | heroes are rare (and bad for health). When there are many people they are easier to find. | 15:14 |
*** jamesmcarthur has quit IRC | 15:33 | |
fungi | time to apply all that behavioral psychology i learned at university, i guess | 15:35 |
fungi | we can give users a lever that dispenses food pellets. also electrifying the cage floor is probably a viable tactic | 15:36 |
* cdent re-reads walden two | 15:37 | |
dhellmann | heh | 15:38 |
ttx | OH: "Heroes are bad for health" | 15:43 |
*** dansmith is now known as SteelyDan | 15:43 | |
dims | ttx : LOL | 15:45 |
*** jamesmcarthur has joined #openstack-tc | 15:46 | |
*** jamesmcarthur has quit IRC | 15:49 | |
*** jamesmcarthur has joined #openstack-tc | 15:49 | |
*** mriedem has quit IRC | 15:50 | |
mnaser | for example, it took me probably 20 minutes today to find out we were uselessly creating swap in OSA jobs because we didn't use any of it and i took 15-20 minutes to find out and push a fix to disable that behaviour | 15:54 |
mnaser | clarkb / infra-core: do we have stats on the number of always-failing non voting jobs? | 15:54 |
mnaser | i feel like those contribute a lot. | 15:54 |
cdent | yeah, good point | 15:54 |
fungi | what was the savings in job runtime from not creating a swapfile? | 15:54 |
mnaser | fungi: on ovh, 15-18 minutes | 15:56 |
fungi | wow | 15:56 |
mnaser | i dont know if this was a one-off | 15:56 |
fungi | i guess it wasn't being created sparse | 15:56 |
cdent | that much? that's rather surprising | 15:56 |
mnaser | we cant do sparse on certain os like centos 7 | 15:57 |
cdent | it makes it seem like $stuff is _way_ oversubscribed | 15:57 |
fungi | sparse swapfiles should be nearly instantaneous to create, but you risk crashing the node hard if you use all the disk | 15:57 |
mnaser | http://logs.openstack.org/36/619636/1/gate/openstack-ansible-deploy-aio_metal-ubuntu-bionic/72c540f/logs/ara-report/result/f6ed9f8a-419a-41b8-8d81-19d6e5aac6cc/ | 15:57 |
mnaser | sparse swapfiles dont work on xfs (which is centos-7) | 15:58 |
fungi | on the other hand, if the job does use more memory than anticipated, without swap you'll be unable to debug it when the oom killer sacrifices something which makes the node no longer accessible | 15:58 |
mnaser | fungi: i went over some of our numbers over successful jobs and we're far away from swapping | 15:58 |
fungi | so there are always trade-offs | 15:58 |
mnaser | like, some 4gb away from swapping.. | 15:58 |
mnaser | the other thing im struggling with right now with my ptl hat on is | 15:59 |
fungi | but yeah, if you don't use most of the ram and we're using platforms which don't support sparse swap (or you need the additional filesystem space) then dropping it is certainlg a good call | 15:59 |
mnaser | contributions to implement things that need CI resources which are then not maintained by those who push them | 15:59 |
fungi | sure. in general "contributions to implement things [...] which are then not maintained by those who push them" has been a perpetual problem in openstack | 16:00 |
mnaser | the thing that bothers me is things like this | 16:01 |
mnaser | http://zuul.openstack.org/builds?job_name=openstack-ansible-deploy-aio_distro_lxc-opensuse-150&job_name=openstack-ansible-deploy-aio_distro_lxc-opensuse-423&branch=master&branch=openstack-ansible-deploy-distro_ceph-opensuse-423&branch=openstack-ansible-deploy-distro_ceph-opensuse-150 | 16:01 |
mnaser | that's a lot of wasted CI resources | 16:01 |
mnaser | and i'm really just wondering if we should come up with some policy that says if a job is non-voting and failing for N period of time, it will be removed. | 16:02 |
* cdent is still stuck on it taking so long to do a dd? | 16:02 | |
fungi | cdent: slow disk | 16:02 |
cmurphy | mnaser: there's nothing stopping you from creating that policy for your project | 16:02 |
fungi | mnaser: we've certainly done that from time to time, but yes maybe a policy within openstack would be good there | 16:02 |
cdent | fungi: isn't that something that ought to be investigated too | 16:03 |
cdent | I have felt (since my dawn of openstack) that we are constantly in a state of oversubscription and it is _that_ which causes us so much pain | 16:03 |
fungi | cdent: yes, it's something we can bring to the attention of the service provider, but i think they have us on cheaper storage there by choice | 16:03 |
mnaser | cdent: hardware is expensive | 16:04 |
mnaser | no one is writing a check for that hardware :) | 16:05 |
mnaser | so there isn't exactly an SLA for donated infrastructure | 16:05 |
mnaser | cmurphy: fungi that's true, but i would be more comfortable if it was an openstack-y thing rather than grumpy-mo-keeps-seeing-failing-jobs-and-has-no-time-to-fix-them-so-he-removed-them | 16:06 |
cmurphy | it's not i'm-grumpy-so-i-removed-them it's "Our team's policy is to only keep running jobs that are consistently useful" | 16:07 |
openstackgerrit | Lance Bragstad proposed openstack/governance master: Update charter to include PTL appointment https://review.openstack.org/620928 | 16:14 |
*** mriedem has joined #openstack-tc | 16:21 | |
fungi | mnaser: out of curiosity, did you happen to notice whether the slow swap creation was happening only in one of the two ovh regions? i've been trying to narrow down why we have 20x as many job timeouts in one as in the other even when we ran for nearly a week with the same max-servers in both | 16:24 |
mnaser | fungi: i have not dug in that far into it to be honest | 16:25 |
fungi | if filesystem access is waaay slower in one of them than the other, that could certainly explain it | 16:25 |
fungi | no worries, that gives me something to test next | 16:25 |
mnaser | yep, that could be a helpful next step | 16:25 |
*** jamesmcarthur has quit IRC | 16:38 | |
*** jamesmcarthur has joined #openstack-tc | 16:42 | |
clarkb | mnaser: fwiw I think that quality and efficiency aren't exactly the same thing here. Yes we are inefficient, but separately we also seem to have regressions in quality which impact efficiency. Not running jobs that always fail will address efficiency positively and potentially quality negatively (beacuse those jobs should pass if they test something useful) | 16:43 |
*** whoami-rajat has quit IRC | 16:48 | |
*** whoami-rajat has joined #openstack-tc | 16:55 | |
*** jpich has quit IRC | 17:00 | |
*** mriedem is now known as mriedem_lunch | 17:11 | |
fungi | speaking of centos, looks like rhel 8 *will* still include python 2.7? https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8-beta/html/8.0_beta_release_notes/new-features#web_servers_databases_dynamic_languages_2 | 17:11 |
fungi | "Python 2.7 is available in the python2 package. However, Python 2 will have a shorter life cycle and its aim is to facilitate smoother transition to Python 3 for customers." | 17:12 |
*** weshay is now known as he_hates_me | 17:35 | |
*** he_hates_me is now known as weshay | 17:36 | |
*** e0ne has quit IRC | 17:45 | |
*** openstackgerrit has quit IRC | 17:51 | |
*** jamesmcarthur has quit IRC | 18:15 | |
clarkb | fwiw I don't think a group of individuals should be the only people that care/act on quality concerns. But I also don't really see any change in behavior without something setting an example for others | 18:16 |
scas | a rant on software quality from yesteryear is still relevant today https://queue.acm.org/detail.cfm?id=2349257 | 18:19 |
clarkb | another approach may be to set an expectation that teams have an "at least triage, but fixing is even better" day or week each milestone | 18:28 |
clarkb | and don't prescribe activity directly. But instead use that as a reminder that we care about this stuff. | 18:28 |
*** diablo_rojo has joined #openstack-tc | 18:29 | |
elbragstad | clarkb ++ | 18:31 |
*** jamesmcarthur has joined #openstack-tc | 18:31 | |
clarkb | I think in theory we've used the feature freeze and RC period for this sort of work, but it is hard to tell how effective that is as all the release projects get very quiet and the deployment project gets very busy | 18:32 |
clarkb | (so as an outsider I don't have enough insight to know if those set aside periods are useful for this task) | 18:33 |
*** logan- has joined #openstack-tc | 18:36 | |
scas | a bugbusting event does work in other long-lived open source projects, but it's the coordination that's always the unknown unknown | 18:44 |
*** openstackgerrit has joined #openstack-tc | 19:10 | |
openstackgerrit | Doug Hellmann proposed openstack/governance master: clean up readme https://review.openstack.org/621270 | 19:10 |
*** mriedem_lunch is now known as mriedem | 19:13 | |
*** jamesmcarthur has quit IRC | 19:20 | |
*** jamesmcarthur has joined #openstack-tc | 19:21 | |
*** jamesmcarthur has quit IRC | 19:28 | |
*** jamesmcarthur has joined #openstack-tc | 19:28 | |
*** jamesmcarthur_ has joined #openstack-tc | 19:29 | |
*** jamesmcarthur has quit IRC | 19:33 | |
openstackgerrit | Doug Hellmann proposed openstack/governance master: add board working group data handling https://review.openstack.org/621277 | 19:42 |
*** whoami-rajat has quit IRC | 20:08 | |
*** jamesmcarthur_ has quit IRC | 20:14 | |
*** jamesmcarthur has joined #openstack-tc | 20:15 | |
*** jamesmcarthur has quit IRC | 20:20 | |
openstackgerrit | Jeremy Stanley proposed openstack/project-team-guide master: Document use of the openstack-discuss mailing list https://review.openstack.org/621284 | 20:38 |
fungi | trying to untangle the mentions of mailing lists in the governance-sigs repo, and having a hard time separating ideas people had about how sigs were going to work from how things actually shook out. for example, the bi-weekly newsletter/summary etherpad seems to have never actually been touched and i don't remember a single one ever going to any mailing list | 20:47 |
fungi | mrhillsman: ttx: i think https://git.openstack.org/cgit/openstack/governance-sigs/tree/doc/source/index.rst#n55 might be due for removal from that document? | 20:48 |
fungi | dhellmann: ^ | 20:48 |
fungi | happy to just rip that out while i'm making other edits | 20:48 |
*** openstackgerrit has quit IRC | 20:50 | |
mrhillsman | ++ | 20:50 |
fungi | seems it got overly-specific about process which wasn't actually in use yet | 20:52 |
*** mriedem has quit IRC | 22:54 | |
*** cdent has quit IRC | 23:07 | |
*** tosky has quit IRC | 23:42 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!