Friday, 2019-09-13

fungiassuming he's talking about the 110bps kind you set the telephone handset onto00:00
clarkbI never had one myself but knew immediately what it was when I first saw a picture. All of a sudden those modem noises made so much sense to me :)00:00
paladoxheh00:00
fungihttps://en.wikipedia.org/wiki/Acoustic_coupler00:01
clarkbwe were always quite a few years behind on technology growing up so by the time the internet made it to the island via satellite connection 56kbps modems were the thing00:01
paladoxclarkb fungi https://www.gadgetspeak.com/aimg/565737-bt-home-hub-2-n.jpg00:01
fungiyeah, that looks a good deal more fancy than what i was thinking of00:02
paladoxclarkb super fast broadband only hit my area in 2012.00:02
fungii mean, it has buttons00:02
*** mattw4 has quit IRC00:02
fungithe phone i used my coupler with still used a rotary dial00:02
paladoxI remember that because they upgraded the green boxes outside i think.00:03
clarkbpalaodx: current speeds and rates back home https://www.fsmtc.fm/internet/adsl00:03
fungithen again, we didn't even get dtmf service where i lived until i was in my teens, so pushbutton phones were fairly irrelevant (though there were some that had a switch to go into "pulse dialing" mode)00:03
paladox:O00:03
clarkblooks like *Mbps for $220/month00:04
paladoxadsl is slow!00:04
clarkber 8Mbps00:04
paladoxit's expensive in the US00:04
clarkboh thats not the US00:04
clarkbin the US I pay $85/month for 100/100 Mbps symmetric connectivity00:04
paladoxoh00:04
clarkbI think I can actually get that up to 200/200 now for maybe the same price00:04
clarkbpaladox: I grew up on an island in the middle of the pacific00:05
paladoxoh00:05
fungiclarkb: it's probably sad then that my parents live in the continental usa and pay similar rates for similar speeds of dsl connection as on that chart00:05
clarkbfungi: ouch00:05
fungithey're lucky the phone company even started offering dsl at all. the only broadband option before that was hughes/directv satellite00:05
paladoxIn the UK, ofcom regulate the ISP's, which is awesome! Because then things doin't become too expensive.00:05
fungithey've had dsl available for ~5 years now00:06
paladoxdsl only came to you 5 years ago?00:06
fungito where i grew up and my parents still live, i've moved out decades ago00:06
paladoxheh00:07
fungithe service where i live now is still crap but it's orders of magnitude faster00:07
paladoxwe got rid of dsl the year bt brough infinity to my area!00:07
paladox*brought00:07
clarkbI've got fiber to my home but the prices for gigabit are silly so I don't pay for it00:08
fungirural usa is woefully underserved by broadband providers00:08
paladoxThey've only just started rolling out gigabit here!00:08
clarkbmy ISP is actually getting acquired and I'm hoping that gigabit prices come down as a result00:08
paladoxBT offer gift cards if your new to them.00:09
fungifastest speed of any broadband provider on the island here is 100mbps (charter/spectrum cable), which i consider not bad, but then again my standards are not high and my needs are not either00:09
paladoxfungi that's faster by 20-30mb then what i have :)00:09
paladox80mbps00:09
paladoxthough they cannot market it as 80mbps anymore.00:10
fungii mean, i probably don't actually *get* 100mbps, i've never really tested it, and it seems to get overwhelmed during tourist season00:10
fungiplus they like to just turn it off at random during the week in the off-season to do maintenance, because they assume the island is basically vacant and nobody will notice00:11
paladoxheh00:11
paladoxbt offer a speed gurentee now.00:12
paladoxon certain plans.00:12
fungiguarantees are virtually nonexistent in the usa. for business use you can sometimes get an sla, but what that generally means is if you notice they're not meeting the agreement then you can ask for a pro-rated redund of lost service00:13
fungiand an sla on residential service is basically unheard of00:14
paladoxheh00:14
paladoxhttps://www.bt.com/broadband/deals/00:14
fungii should clarify, consumer protection is virtually nonexistent in the usa, and companies can usually get away with lying to you through their teeth with no real ramifications00:15
paladoxoh00:16
paladox:(00:16
fungi99% of customers won't realize they're being lied to, and the remaining 1% will spend more time and money fighting to get a refund than it's ultimately worth00:16
paladoxheh00:17
fungiresult: profit!00:17
* fungi is a bit of a cynic00:18
paladoxfungi we have an advertising regulator too!00:19
fungiwow00:19
paladoxfungi and consumers can complain :)00:20
fungii mean, we do have places to complain, but it's like the proverbial "complaints" sign on the paper shredder00:21
paladoxfungi https://www.asa.org.uk00:21
*** pkopec has joined #openstack-infra00:21
paladoxWe have ofgem (regulates engergy companys), ofstead (regulates scools/furthur education), asa, ofcom.00:22
fungiwe have those things too, but generally the industries being regulated buy enough politicians to get their cronies appointed to run the regulatory bodies00:23
paladoxheh00:24
paladoxit's all independant here00:24
* paladox expecially likes ofcom!00:25
donnydok I added some static rageroutes00:25
paladoxit forced my mobile network to offer me unlimited tethering00:25
paladoxas i'm on a unlimited data plan00:25
fungipaladox: score!00:25
donnydshould keep the mirror traffic from going all the way up and down00:25
fungidonnyd: anything in particular we should keep an eye on? do we need to add any routes on the mirror itself?00:26
fungior is this all handled by the respective gateways still?00:26
donnydfungi: they should be handled in the openstack router, so nah00:26
fungiawesome. thanks!00:26
donnydI just dropped a static route in the router of each project00:27
fungiwill keep an eye out for any obvious connectivity issues between nodes and the mirror00:27
donnydhttps://www.irccloud.com/pastebin/bSBPNA4q/00:27
*** armax has quit IRC00:28
fungilooks good to me00:28
donnydso I still have a bit of work to do to solve non-mirror related traffic inbound and outbound00:28
donnydhowever the large majority of failures we were seeing were related to talking to the mirror00:29
donnydshort of ipv6 just not working correctly at all, i can't get a much more direct path to the mirror and something else is up00:29
fungifingers crossed, but this seems likely to help00:30
donnydyea I can imagine that taking out the extra hop will make things work a bit better... next on the list is to solve once and for all the retransmission issue00:31
donnydI have some ideas, but I really need to test this out and see where why they are occuring00:31
donnydHopefully I have an answer by tomorrow00:32
openstackgerritIan Wienand proposed opendev/glean master: Update testing to opensuse 15  https://review.opendev.org/67951200:32
donnydjust for giggles, do you think you could fire up iperf3 in the mirror00:33
fungisure, just a sec while i get my notes00:34
donnydiperf3 -s00:36
fungiwell, more importantly making sure i open the right ports in iptables00:36
*** xenos76 has quit IRC00:36
donnydshould be tcp/5201 if I am not mistaken00:37
fungi5201/tcp00:37
fungilooks like00:37
fungiyeah00:37
*** rosmaita has quit IRC00:39
donnydthanks fungi00:40
fungiServer listening on 520100:42
fungigive it a whirl00:42
fungii've opened that port in iptables for both v6 and v6 traffic00:42
fungier, both v4 and v500:43
fungigah00:43
* fungi stops trying to type00:43
*** Garyx_ has quit IRC00:49
*** rosmaita has joined #openstack-infra00:52
clarkbcmurphy: https://storage.gra1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_d32/681691/1/gate/cross-keystone-py35/d32391c/ that failed with a timeout do we need to bump the requirements side too?00:52
cmurphyclarkb: yeah we can but since requirements is frozen i figured it wasn't as high a priority00:53
*** gyee has quit IRC00:53
donnydfungi: ok it looks pretty good to me00:56
fungistill using it or should i button it back up now?00:56
donnydits def not hitting the edge and performance is expected00:56
donnydyou can take it back down00:57
*** pkopec has quit IRC00:57
fungithanks again, on it00:57
donnydthanks... i just wanted to validate that flows will run around the 10G mark00:57
donnydhttps://www.irccloud.com/pastebin/ho4EZJrb/00:57
donnydnot too bad for a single thread00:58
donnydI dont have dpdk running or any network acclerators, so that is pretty good00:58
donnydI just need to develop a load test to make sure that retransmissions aren't crazy... some are normal, but millions are not01:00
fungiokay, iperf3 stopped and uninstalled, temporary iptables rules deleted, all back to normal01:01
*** markvoelker has joined #openstack-infra01:01
*** calbers has quit IRC01:02
*** Garyx has joined #openstack-infra01:03
*** markvoelker has quit IRC01:05
donnydfungi: ok it looks to me like setting the mtu on my tunnel interface has significantly lowered the retransmissions01:09
openstackgerritDonny Davis proposed openstack/project-config master: Direct Mirror route + Path MTU fixed Reenable FN  https://review.opendev.org/68195101:11
fungigood deal01:13
donnydin ntop you can watch a connection and it seemed like every other packet was a retrans... now I see none01:15
clarkbdonnyd: that is after swapping in the neutron router for your edge router?01:15
donnydSo for ipv4 there it is now using an openstack router with a direct public connection01:17
donnydFor ipv6 traffic destined for the mirror there is a static route in the v6 Openstack router that points directly to the mirror01:17
donnydFor ipv6 traffic destined for not the mirror it still hits the edge, but this time with a proper MTU value01:17
donnydin v6 land, the connection goes as follows - test node -> Zuul tenant Openstack Router -> Mirror Tenant Openstack Router -> Mirror01:18
donnydIt used to go - test node -> Zuul tenant Openstack Router -> Edge Router ->  Mirror Tenant Openstack Router -> Mirror01:19
donnydSo now the connection never leaves the network node for mirror bound traffic01:19
donnydSo to fix v4 connection issues the Openstack router will relieve the pressure on the edge router NAT table, because before it was getting 100% of the load.. Now it just services non zuul tenant traffic01:21
donnydI am hoping we have a winner winner chicken dinner01:22
clarkbmakes sense01:22
*** mriedem_afk has quit IRC01:23
donnydIn all honesty it really makes no sense.. I have had the exact same edge router config doing 100X more... but it was mostly voip traffic, so maybe not a noisy01:23
fungithat's usually longer established sessions01:24
fungiso less churn on the nat table01:24
donnydyea, and I am thinking exactly that is the issue on the v4 side01:24
donnydits not expiring fast enough to keep up01:25
fungiit's usually rapid nat session turnover which burns you because of the cooldown before a port can be reallocated for another session01:25
donnydI can fix that for CI purposes... but I worry it will break lots of other things around here... and I would worry about effecting the other services and the way they currently function01:25
fungimakes sense01:26
donnydI can turn it down to kill the state in milli seconds.. but then it also may drop non busy connections too01:26
donnydso I am thinking this change will solve both the v4 issues we have seen and the v6 issues01:27
donnydWon't know for sure till a full load is back on... but even with a light load before it was 100% clear something was wrong01:27
fungii recall one customer years back who operated a mobile browsing gateway service for cellphone providers and had to do a ton of nat to map between disparate and conflicting private phone networks. the connection churn was so bad we needed almost 10x as many ports in the nat pool as their peak for concurrent sessions01:28
donnydIn the last 30 or so minutes there have been about 15 retransmissions01:28
clarkbfungi: this is why they all run ipv6 now01:28
donnydbefore it would have been 5-8K in the same time period01:28
fungiyup01:28
clarkbthe one place in this country you can get an ipv6 address is your cell phone network01:28
fungiclarkb: except if you want to tether to it, and then you end up with an ipv4 network masquerading behind your v6-routerphone01:30
donnydit really makes a lot of sense for cell phones.. they are so transient and v4 space is so limited01:30
fungiso silly01:30
fungithey could easily route a /64 behind every cellphone in the world and not even make a dent in v6 address consumption01:31
donnydSo here is a good example of the fix01:31
donnydhttps://usercontent.irccloud-cdn.com/file/kb4MEjSF/image.png01:31
*** dave-mccowan has quit IRC01:31
donnydThis is an inbound ssh connection and before the retransmit # would have been in the 500-1k mark by now01:32
fungiyeesh01:32
donnydAlso I hope all of this data is helpful to someone down the road... or at least for our own education01:33
fungiit's been very enlightening for me so far01:33
donnydthey could give everyone their own permanent /50 and still not even touch v601:33
fungiif you every get an opportunity to summarize your work on this in an article or a conference talk, i'm certain it would be useful too01:34
fungiwell, it would probably be a /56 or a /48... easier to carve on octet boundaries for reverse dns zone delegation purposes01:35
donnydYea it would be good to write it all up once it actually works really well. I don't have big money gear doing any of this... anyone could do the same thing01:35
donnydyea that is true01:35
fungimtreinish had a great talk a few years back about his "closet cloud" and this could make an interesting sequel01:36
donnydIt has taken quite a lot of "test and tune" to  get this thing to perform close to what the big guys like vexxhost and ovh can do01:37
fungithough he didn't have nearly as many hosts01:37
fungiand was doing it on a very shoestring budget01:37
donnydI have 8 compute nodes / one controller / two swift / one cinder01:38
donnydalthough I am thinking with as little space as the swift service is taking, I may run it on the openstack infra so I can take advantage of the nvme drives01:39
fungiyeah, i think he said his was 4 compute nodes, but probably also lower-end servers (it was also longer ago)01:39
*** JorgeFranco has joined #openstack-infra01:39
fungii think they were used dell 1u rackmounts he got for a song on ebay01:40
donnydyea that's my play too01:40
donnydI wear people down on ebay and get them for nothing01:41
donnydeach of the compute nodes 4X cpus 128G ram I got for 200 bucks each01:41
fungiyeah, that's a remarkable deal01:42
donnydthe nvme stuff was a little on the pricey side.. but less than most people pay for a gaming rig01:42
donnyd24T of nvme is what I got in the compute nodes... any guesses how much I paid?01:42
fungii don't really know the going rate, i think i paid a few hundred last year for 2tb01:43
donnydjust for proof its actually fixed01:43
donnydhttps://usercontent.irccloud-cdn.com/file/Ix9FqH24/image.png01:43
donnyd1 more retransmission01:43
donnydbut that is acceptable01:43
fungiyeah, might as well be 0 compared to before01:44
donnydin the time I had it up there were a total of 43M01:44
donnydit was only up for a short while... so yea I would say its in much better shape01:45
donnyd1800 bucks for the nvme drives01:45
donnydor $1.3 per GB01:45
donnydnot too bad01:46
*** whoami-rajat has quit IRC01:47
openstackgerritMerged openstack/project-config master: Direct Mirror route + Path MTU fixed Reenable FN  https://review.opendev.org/68195101:48
funginot bad? that's crazy inexpensive01:48
*** dolpher has quit IRC01:49
donnydI just keep making offers until they take it... I usually get told no... but every now and again someone says eff it01:49
donnydMy favorite thing to do is when i make an offer for say 100 dollars on something and they come back with 120.. I send another offer for 9001:49
donnydLOL01:49
fungigenius01:50
fungiyou clearly paid attention to the haggling scene in life of brian01:50
donnydI have never seen it.. but now I am going to01:51
fungiit's a brief scene, but iconic01:51
fungibrian's in a hurry (on the run), doesn't want to haggle to buy a disguise at a market stall, and the purveyor proceeds to lay into him with a lesson on how to haggle01:52
fungione of eric idles more memorable performances on the big screen01:54
*** tkajinam has quit IRC01:54
donnydI am going to have to check it out01:55
donnydI am going to stick around till some jobs are scheduled to make sure there are no/minor issues...01:56
fungithanks, i'm here to help with any reverts01:56
fungishould be up for a few hours still01:56
donnydI need to turn on skydive so people can introspect the traffic on FN01:57
donnydever seen it?01:58
funginot sure i have context. a network visualization app?02:01
fungiaha, http://skydive.network/02:02
fungilooks vaguely familiar02:02
fungithat might have made an appearance in a conference talk i saw, hard to remember. does certainly look neat02:03
*** apetrich has quit IRC02:10
*** tkajinam has joined #openstack-infra02:11
donnydIt lets you get into the SDN of Openstack and see what is going on02:22
donnydyou can even use the traffic generator to make sure a connection that should work does02:23
*** markvoelker has joined #openstack-infra02:26
*** hongbin has joined #openstack-infra02:32
*** roman_g has quit IRC02:34
*** tkajinam has quit IRC02:36
*** markvoelker has quit IRC02:36
openstackgerritMerged opendev/glean master: Update testing to opensuse 15  https://review.opendev.org/67951202:36
*** markvoelker has joined #openstack-infra02:38
*** markvoelker has quit IRC02:43
*** tkajinam has joined #openstack-infra02:56
*** dklyle has quit IRC02:57
*** dklyle has joined #openstack-infra03:00
donnydI have just about got it all up and running if your curious to take a look03:01
donnydabout 5 more minutes and i should be done03:01
fungisure, still around03:02
*** rf0lc0 has quit IRC03:05
*** bobh has joined #openstack-infra03:08
donnydhttps://openstack.fortnebula.com:8082/topology03:14
donnydlmk if you can see this03:14
*** ramishra has joined #openstack-infra03:16
fungichecking03:16
fungiyep! it loads03:17
*** njohnston has quit IRC03:17
fungiand i can reorient the topology view/gravity and stuff just fine03:18
donnydso i will probably turn the auth back on in the near future.. but this could be real handy for T/S the network issues03:19
donnydand all the data is saved in elastic03:19
fungi(firefox in debian/unstable with no fancy plugins besides some privacy extensions)03:19
fungiyeag, it looks like stuff like the generator isn't locked down, so you probably don't want it left open03:20
fungialso careful not to expose your elasticsearch api, like, ever03:20
donnydyea, elastic is no good for the interwebs03:21
donnydI'm sure there are bots constantly scanning for it03:22
fungiwe have one backending our wiki search which accidentally got exposed once during an in-place distro upgrade where iptables-persistent and puppet decided to fight it out over what files were symlinks so we ended up with a self-referential pair of symlinks for firewall rules... elacticsearch was compromised in minutes03:23
*** bobh has quit IRC03:25
donnydwoops03:26
fungiand the skydive tab i had open just threw up a sign in widget! wow that's reactive03:26
donnydSo lets say I needed to find some traffic.. I could just use the capture interface to find what I am looking for and get some metrics or something fancy like that03:27
donnydSo go click on the openstack bubble and then br-int03:30
donnydIf you want to see whats happening, you can click create capture,  and then put to cursor in interface 1 and then click br-int again03:31
donnydand it will start caputring traffic on that bridgr03:32
donnydand then head over to the flows tab and you can see traffic moving03:32
donnydand of course you can use a BPF filter to see something more specific03:33
*** hongbin has quit IRC03:34
donnydso its been a while and FN is still sitting at 10 nodes... should be 2003:34
*** factor has quit IRC03:35
fungilooks like it's still max-servers: 0 in the config on disk for nl0203:36
fungii'll see if we're having trouble puppeting it03:36
fungiSep 13 01:47:36 nl02 puppet-user[20819]: Applied catalog in 6.05 seconds03:37
fungiso it's been nearly 2 hours since the last pulse completed03:37
fungisomething has caused our configuration management updates to take in the neighborhood of 100+ minutes recently03:38
fungithat's no good03:38
fungiwill see if i can identify the cause of the slowdown03:39
fungiit was closer to 45 minutes up until the last few pulses03:40
fungitons of defunct ansible-playbook processes with a start time of 02:18z03:48
fungilooks like the only active connection is to storyboard-dev03:52
fungiand i'm timing out trying to ssh into it03:52
*** JorgeFranco has quit IRC03:54
donnydwell that is not good03:54
fungiyeah, pulling up console access for the instance03:54
fungihung kernel tasks spamming the virtual console03:56
fungigoing to reboot it03:57
*** rh-jelabarre has quit IRC03:58
fungiand now i can ssh into it again03:58
fungikilled the stuck ansible ssh process to that server and it's proceeding to configure servers again03:59
fungi#status log hard-rebooted storyboard-dev due to hung kernel tasks rendering server unresponsive, killed stuck ansible connection to it from bridge to get configuration management to resume a timely cadence04:00
openstackstatusfungi: finished logging04:00
donnydSo I am working on that swift dashboard widget so we can see trends04:01
donnydi have something up there.. but no idea if that is what we are looking for04:01
fungineat. i think once i see nl02 pick up the fn max-servers change i'm knocking off for the night04:01
donnydyea me too, i need to get some zzs04:02
donnydthanks for the help fungi :)04:02
fungimy pleasure. and thanks for yours!04:02
donnydclarkb: https://grafana.fortnebula.com/d/9MMqh8HWk/openstack-utilization?from=now%2FM&to=now&fullscreen&panelId=2604:04
*** ykarel|away has joined #openstack-infra04:06
donnydIt's pretty ugly.. but I am not entirely sure what you are looking for04:07
* donnyd pulls ripcord and jumps out the plane04:11
donnydI am headed out fungi04:11
*** eharney has joined #openstack-infra04:11
*** markvoelker has joined #openstack-infra04:16
*** markvoelker has quit IRC04:21
*** eharney has quit IRC04:27
*** kjackal has joined #openstack-infra04:31
*** exsdev has quit IRC04:41
*** exsdev has joined #openstack-infra04:43
*** ykarel|away has quit IRC04:43
*** pcaruana has joined #openstack-infra05:02
*** udesale has joined #openstack-infra05:08
*** ykarel|away has joined #openstack-infra05:13
*** jtomasek has joined #openstack-infra05:16
*** diablo_rojo has quit IRC05:30
*** tkajinam has quit IRC05:54
*** tkajinam has joined #openstack-infra06:03
*** xek has joined #openstack-infra06:03
*** ykarel|away is now known as ykarel06:03
*** rpittau|afk is now known as rpittau06:08
*** xek has quit IRC06:12
*** xek has joined #openstack-infra06:13
*** ralonsoh has joined #openstack-infra06:14
*** tkajinam_ has joined #openstack-infra06:19
*** tkajinam has quit IRC06:22
*** kjackal has quit IRC06:24
*** slaweq has joined #openstack-infra06:51
*** lpetrut has joined #openstack-infra06:54
*** diga has quit IRC06:55
*** trident has quit IRC06:55
*** xek has quit IRC06:56
*** pkopec has joined #openstack-infra06:56
*** pgaxatte has joined #openstack-infra06:57
*** kjackal has joined #openstack-infra07:00
*** whoami-rajat has joined #openstack-infra07:02
*** udesale has quit IRC07:04
*** lajoskatona has joined #openstack-infra07:06
*** jbadiapa has joined #openstack-infra07:06
*** trident has joined #openstack-infra07:07
*** Florian has joined #openstack-infra07:10
*** kaiokmo has quit IRC07:12
*** tosky has joined #openstack-infra07:12
*** tesseract has joined #openstack-infra07:15
lajoskatonafrickler: Hi, a question: Is there a way to give more that 8Gb memory to ODL tempest executions? By dstat it seems that ODL (java.....) consumes all memory, and I suppose vm boot and other things are starving, this is why we have ugly timeouts.07:15
openstackgerritMerged opendev/system-config master: fedora mirror update : add sleep  https://review.opendev.org/68136707:15
evrardjpcorvus: I see you your name in system-config/kubernetes/percona-xtradb-cluster. I wanted to hear about the stability of that helm chart, and if you tried other things :)07:17
*** tosky_ has joined #openstack-infra07:18
*** ralonsoh has quit IRC07:19
*** shachar has quit IRC07:20
*** ralonsoh has joined #openstack-infra07:21
*** tosky has quit IRC07:21
*** ramishra has quit IRC07:26
*** ramishra has joined #openstack-infra07:28
*** jaosorior has joined #openstack-infra07:35
*** prometheanfire has quit IRC07:36
*** gfidente has joined #openstack-infra07:37
*** prometheanfire has joined #openstack-infra07:38
*** tosky_ is now known as tosky07:38
*** jpena|off is now known as jpena07:40
AJaegerevrardjp: good morning, do you need https://review.opendev.org/673019 before train branches are created?07:44
*** kaiokmo has joined #openstack-infra07:45
openstackgerritMerged openstack/diskimage-builder master: Use x86 architeture specific grub2 packages for RHEL  https://review.opendev.org/68188907:52
openstackgerritMerged openstack/diskimage-builder master: Move doc related modules to doc/requirements.txt  https://review.opendev.org/62846607:52
*** apetrich has joined #openstack-infra07:52
*** ykarel is now known as ykarel|lunch07:56
*** priteau has joined #openstack-infra07:57
*** Tengu has quit IRC07:59
*** tkajinam_ has quit IRC08:01
*** Tengu has joined #openstack-infra08:03
*** gfidente has quit IRC08:06
*** dchen has quit IRC08:13
*** ccamacho has joined #openstack-infra08:14
*** ccamacho has quit IRC08:14
openstackgerritAndreas Jaeger proposed openstack/diskimage-builder master: Only install doc requirements if needed  https://review.opendev.org/68199108:17
*** ccamacho has joined #openstack-infra08:18
*** gfidente has joined #openstack-infra08:25
*** snapiri has joined #openstack-infra08:26
evrardjpgood morning08:29
evrardjpI will have a look08:29
evrardjpAJaeger: I think we agreed that we need to exercise more going that way. The problem was a chain of dependencies to be settled before this, and those aren't done yet. But I will update the minimum thing I have to do08:30
evrardjptoday08:31
*** roman_g has joined #openstack-infra08:34
evrardjpwow that's hardly english08:34
evrardjpI meant that 1) Dependencies have to be done first 2) we need to exercise more 3) so we concluded that there is no urgency afaik for this, yet we should do it.08:35
*** xenos76 has joined #openstack-infra08:37
*** e0ne has joined #openstack-infra08:38
*** gfidente has quit IRC08:39
AJaegerevrardjp: understood08:41
*** gfidente has joined #openstack-infra08:43
*** markvoelker has joined #openstack-infra08:44
*** exsdev0 has joined #openstack-infra08:44
*** derekh has joined #openstack-infra08:45
*** exsdev has quit IRC08:45
*** exsdev0 is now known as exsdev08:45
*** markvoelker has quit IRC08:49
*** ralonsoh has quit IRC08:51
*** ralonsoh has joined #openstack-infra08:55
*** jaosorior has quit IRC09:00
*** ykarel|lunch is now known as ykarel09:08
*** udesale has joined #openstack-infra09:15
*** Florian has quit IRC09:19
*** FlorianFa has joined #openstack-infra09:19
*** lajoskatona has quit IRC09:31
*** lajoskatona has joined #openstack-infra09:35
*** iurygregory has joined #openstack-infra09:43
*** ralonsoh has quit IRC09:48
*** ralonsoh has joined #openstack-infra09:48
*** ralonsoh has quit IRC09:50
*** ralonsoh has joined #openstack-infra09:53
*** udesale has quit IRC09:56
*** udesale has joined #openstack-infra09:57
*** tkajinam has joined #openstack-infra10:05
*** ociuhandu has joined #openstack-infra10:06
*** ociuhandu has quit IRC10:07
*** pcaruana has quit IRC10:11
*** shachar has joined #openstack-infra10:20
*** snapiri has quit IRC10:23
*** ociuhandu has joined #openstack-infra10:23
*** ociuhandu has quit IRC10:28
*** pgaxatte has quit IRC10:34
donnydlajoskatona: FN has expanded labels10:38
*** ociuhandu has joined #openstack-infra10:38
*** ociuhandu has quit IRC10:44
lajoskatonadonnyd:Hi, could you explain this please, I am not an expert of infra10:52
*** elod has quit IRC10:54
donnydYou can run an experimental job against the expanded labels10:54
*** elod has joined #openstack-infra10:55
donnydhttps://opendev.org/openstack/project-config/src/branch/master/nodepool/nl02.openstack.org.yaml#L35610:56
lajoskatonadonnyd: thanks, just to be on the same page.10:58
lajoskatonadonnyd: I add nodepool: ubuntu-bionic-expanded which as I see has 16G memory, and zuul will build the VM accordingly?11:00
donnydNodepool will route your job to the appropriate provider that has that label11:01
lajoskatonadonnyd: ok, I check with it. Thanks again11:02
donnydJust a reminder, only do this for experimental purposes. Not all the providers have this label11:02
lajoskatonadonnyd: ok, I come back to infra if I have results11:05
*** jbadiapa has quit IRC11:08
openstackgerritDonny Davis proposed openstack/project-config master: Fixes for FN seem to have worked - scaling up  https://review.opendev.org/68202611:13
AJaegerfungi, frickler, can you confirm and want to +2A? ^11:14
AJaegerdonnyd: thanks11:14
donnydAJaeger: In testing on my end it all  was working much mo betta... so hopefully this is it11:15
donnydwon't know 100% till it gets there in scale11:15
*** lpetrut has quit IRC11:20
*** iurygregory has quit IRC11:23
*** pcaruana has joined #openstack-infra11:24
*** calbers has joined #openstack-infra11:32
*** lucasagomes has joined #openstack-infra11:34
*** rh-jelabarre has joined #openstack-infra11:39
AJaegersure ;)11:39
*** sshnaidm|rover is now known as sshnaidm|off11:40
*** calbers has quit IRC11:40
*** udesale has quit IRC11:40
*** udesale has joined #openstack-infra11:41
*** ociuhandu has joined #openstack-infra11:41
*** lpetrut has joined #openstack-infra11:42
*** ociuhandu has quit IRC11:42
*** jpena is now known as jpena|lunch11:48
*** lpetrut has quit IRC11:59
*** lpetrut has joined #openstack-infra11:59
*** rkukura has quit IRC12:01
*** pgaxatte has joined #openstack-infra12:03
*** markvoelker has joined #openstack-infra12:04
*** apetrich has quit IRC12:08
*** goldyfruit has quit IRC12:12
*** apetrich has joined #openstack-infra12:16
*** ociuhandu has joined #openstack-infra12:22
*** tkajinam has quit IRC12:25
*** jaosorior has joined #openstack-infra12:27
*** Goneri has joined #openstack-infra12:32
*** ociuhandu has quit IRC12:32
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-tox-output: introduce zuul_use_fetch_output  https://review.opendev.org/68186412:35
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-subunit-output: introduce zuul_use_fetch_output  https://review.opendev.org/68188212:35
*** rlandy has joined #openstack-infra12:35
*** ociuhandu has joined #openstack-infra12:39
*** dave-mccowan has joined #openstack-infra12:44
*** rf0lc0 has joined #openstack-infra12:44
*** jbadiapa has joined #openstack-infra12:44
sean-k-mooneyfungi: so this is a dumb idea but we were talking about alternitive way of scheduling jobs yesterday12:44
sean-k-mooneyfor any given project in the could we order its pending jobs by a kind of  weighted average of the review priority field in gerrit and the lenght of time it has been in the pipeline while still doing round robin across projects within the pipeline12:46
*** lpetrut has quit IRC12:48
sean-k-mooneybasically score = (1 + review-priority-labael) * time in pipeline12:48
sean-k-mooneythen withing a pipelien group by project and sort the project jobs by score12:49
sean-k-mooneythat would allow project teams to set review priority to +2 to merge to give it higher pirorty12:50
sean-k-mooneye.g. a job with a +2 priority that had been wiait for 1 hour would have the same score as a job with a 0 pirorty that had been waiting for 312:51
pabelangersean-k-mooney: zuul as support for relative priority today, but is based on the change queue a project is part of: https://zuul-ci.org/docs/zuul/admin/components.html#attr-scheduler.relative_priority12:51
pabelangerI want to say, if you are in the integrated queue, it is also weighted against the other projects in it too12:52
pabelangerthat works well, to let smaller projects access to resources12:52
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-tox-output: introduce zuul_use_fetch_output  https://review.opendev.org/68186412:52
sean-k-mooneyyep we want to keep that12:52
sean-k-mooneybut ff is today/yesterday12:52
sean-k-mooneyand it woudl have been nice to say merge these patches first12:53
sean-k-mooneyto avoid conflict12:53
sean-k-mooneyand these othere one later12:53
*** lpetrut has joined #openstack-infra12:53
sean-k-mooneywe tried to do that by not approving things but that does not always work12:53
sean-k-mooneypabelanger: but ya it would effectily be like the relative_priortiy option but instead fo operating at the pipeline level it would operate on the individual pathces.12:54
pabelangersean-k-mooney: yah, in this case, an option is to abandon or ask humans not to add new patches to change queue, to all the important one to get nodeset first.  However, that is a manual step.  Long term, I've mostly found, more resources also fixes the issue :)12:54
*** EmilienM is now known as EvilienM12:55
sean-k-mooneypabelanger: yes throwing more cores (but human and machine) can help but both are limited resouces12:55
AJaegerand a gate that is stable ;) So bug fixes instead of unstable features :)12:55
pabelangerAJaeger: +10012:56
sean-k-mooneyAJaeger: actully in this case sepecicaly the issue has not been stablity12:56
sean-k-mooneyit was more we had 3 feature that had trivial merge confict12:57
sean-k-mooneylike litrally white space12:57
sean-k-mooneysoe we ended up puting all 3 feature into 1 chain of patches to avoid that12:57
sean-k-mooneythen a seperate feature merged and put the chain into merge confilct12:57
*** ociuhandu has quit IRC12:58
AJaegersean-k-mooney: you can check *before* approving for conflicts, there's a tab in gerrit for it.12:58
sean-k-mooneyyes i know12:58
*** dave-mccowan has quit IRC12:58
AJaegersean-k-mooney: and I doubt that your proposal above would have helped with this specific case - or would it?12:58
sean-k-mooneyno12:58
sean-k-mooneywell it might12:59
sean-k-mooneyin that the big chain could have been marked a high pirority12:59
sean-k-mooneyand the rest as not12:59
*** jpena|lunch is now known as jpena12:59
sean-k-mooneysoe it may have changed the order of mergeing12:59
sean-k-mooneyanyway it was just a thought of something we could maybe try13:00
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-output-openshift: initial role  https://review.opendev.org/68204413:05
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-subunit-output: introduce zuul_use_fetch_output  https://review.opendev.org/68188213:05
sean-k-mooneyalso i proably should have put that in the zuul channel sorry13:05
fungii could see maybe working additional prioritization criteria into independent pipelines. it might also work on the inactive changes outside the window of a dependent pipeline, but reprioritizing changes which have already started builds would waste resources13:06
fungisince dependent pipelines are sequenced13:06
*** ociuhandu has joined #openstack-infra13:07
fungiadjusting the order of changes in them necessarily means discarding prior results or in progress builds13:07
sean-k-mooneyfungi: ya i was more thining of just the initall ordering13:07
sean-k-mooneyonce they start there ordering would be fixed13:07
sean-k-mooneythis was mainly a suggestion for independent pipeline like check13:08
sean-k-mooneysicne all patche have to go through check it also indirectly influence gate13:08
fungiwell, the initial ordering (until you exceed the active window) is determined by chronological events. there's no way for zuul to prededict that you're going to approve something of higher priority in the future so it starts testing things as soon as it is able13:09
sean-k-mooneybut in a positive way by priorities what gets sent to gate13:09
sean-k-mooneyright but it currenlty round robings between ques of jobs per porject within a pipleine right13:09
pabelangerthere is also the issue, that nodepool is the one who handles node-requests, you could ask for something to be higher, but if that nodeset doesn't come online right away, the other patches will still get their nodesets first13:09
sean-k-mooneyand pull the next job of the top of that queue13:09
sean-k-mooneyim suggesing order the queue not by time of submission13:10
sean-k-mooneybut by the score above with is time in the queue* priority13:10
sean-k-mooneyif we do an insertion sort we only update that order once13:10
sean-k-mooneywhen each patch is submitted13:11
*** Tengu has quit IRC13:11
*** Tengu has joined #openstack-infra13:11
sean-k-mooneyit really would be an ordering of the jobs sets rather then indivugal jobs too13:12
pabelangerright, but today, even in realative priority, if you had A and B, if both ask for nodes from nodepool. We don't block B until A gets them, we send request for both. And if A nodeset fail, due to cloud issue, B will still run before A13:12
*** ociuhandu has quit IRC13:12
sean-k-mooneyas we want the set of jobs to run together13:12
fungiyeah, but inserting a change ahead of other changes of lower priority which are already being tested necessarily means discarding those builds13:12
sean-k-mooneyfungi: as i said im only suggsing doing this for change that not started being tested13:13
AJaegersean-k-mooney: Your approach might work in theory - but our tooling is not able t ohandle it without major changes. Your idea is to delay asking for nodes and recalculate priorities every time a new change is added. What we do today is calculate only when a new change is added - and not afterwards.13:13
AJaegerSo, that is also a different complexity and would be a far longer scheduling computation.13:14
openstackgerritMerged openstack/project-config master: Fixes for FN seem to have worked - scaling up  https://review.opendev.org/68202613:14
sean-k-mooneyAJaeger: ya i had not looked at the zuul code13:14
fungithat might also mean starting to defer, or unnecessarily redoing, merge tasks too13:14
sean-k-mooneybut its just somehting that i thought of after our conversation last night.13:15
fungiwhich is why independent pipelines might be easier to tackle for something like that13:15
sean-k-mooneyi was not expecting imediate action but i was wondering how feasible it might be13:15
sean-k-mooneyfungi: isnt the check an independent pipeline?13:15
*** ociuhandu has joined #openstack-infra13:16
fungiyes, we've been muddling that and discussions of "shared queues" though13:16
sean-k-mooneyah13:16
*** iurygregory has joined #openstack-infra13:16
fungiand activity windows13:16
pabelangerTBH: having some humans work to see why we have such a large rate of failure when launching nodes in nodepool, also helps this situation. EG: http://grafana.openstack.org/d/rZtIH5Imz/nodepool?orgId=1&fullscreen&panelId=17&from=now-7d&to=now I would spend a lot of time working to keep that at zero (when day job was working in openstack). That has a huge impact on how efficient jobs work in nodepool,13:16
pabelangergiven how limited resources are too13:17
sean-k-mooneyi used queue in a generic sense of things waithing int check that have not started13:17
sean-k-mooneynot in the zuul sense which i should learn about at sometime13:17
pabelangerhttp://grafana.openstack.org/d/rZtIH5Imz/nodepool?orgId=1&fullscreen&panelId=17&from=now-5y&to=now is another graph13:17
pabelangerthat could be cloud side, or make some issues with specific operating systems.13:18
pabelangermaybe*13:18
sean-k-mooney ya that is more of an operaational task that never ends13:18
clarkbpabelanger: a large chunk of it is failed boots in vexxhost due to volume leaks and quota being accounted wrong13:18
*** mriedem has joined #openstack-infra13:18
sean-k-mooneyi have no doubts that if we had 0 failed node builds the gate woudl be fast13:19
clarkbits a known issue, shrews is looking into our side of it13:19
pabelangerclarkb: yah, that is something I am hoping to work on this / next week. It also hits un hard with NODE_FAILUREs13:19
*** whoami-rajat has quit IRC13:19
sean-k-mooney*faster13:19
AJaegerconfig-core, could you review https://review.opendev.org/681785 https://review.opendev.org/681276 https://review.opendev.org/#/c/681259/ https://review.opendev.org/681361 and https://review.opendev.org/680901 , please?13:19
AJaegernothing of that should impact feature freeze13:20
clarkbsean-k-mooney: if we order the queue by subjective priority everyone will set their change to be highest priority13:20
clarkbyou see the same thing happen in ticket queues and bug trackers13:20
sean-k-mooneyclarkb: i think that field is only settabel by the core team13:20
clarkbit makes the field pretty useless when you tie it to expected processing time13:21
AJaegersean-k-mooney: I'm core, so I set it on my own changes ;)13:21
sean-k-mooneyhehe well that is your privledge13:21
Shrewsclarkb: the volume leak, I'm convinced until somebody else has proof otherwise, is not something caused by nodepool. what i'm looking into now is the side effects of that (image leaks caused by the volume remaining in use)13:22
AJaegersean-k-mooney: I think FF is special, let's not over-optimize for it... I think we can do more to get jobs stable all the time and that will help a lot in this time.13:22
sean-k-mooneybut the whole idea was to give core an extra knob they could tweak on a per patch basis beyound witholding a +w13:22
clarkbShrews: thanks and ya would not surprise me if it were a nova and or cinder problem13:22
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: DNM: test tox-py36 on openshift node  https://review.opendev.org/68204913:22
clarkbsean-k-mooney: to improve review visibility13:22
pabelangerclarkb: Shrews: Agree, but I do think adding a quota check, if enough volumes free is a good idea13:23
clarkb"if you want to review the important changes thid is where you find them"13:23
sean-k-mooneythid?13:24
clarkbsorry "this"13:24
sean-k-mooneythis13:24
sean-k-mooneyya that is what review priortiy was intened to be13:25
sean-k-mooneywe dont use it in nova currently but we might at some point13:25
clarkbbut ya I agree with AJaeger this is the time of the cycle where we feel the pain of our ci debt13:26
clarkbif we paid that down continuously we'd be in amuch better place13:26
*** jaosorior has quit IRC13:26
pabelangertripleo used to (or maybe still) have this problem. They'd end up abandoning all the open patches, then only approve ones needed to fix tests. As crude as that is, it would allow them to order things, in current tooling13:26
clarkbrather than optimizethe tools for "our code is flaky and cant pass CI" lets optimize for making good software then when we dog food that software we dont leak volumes preventing us from booting jnstances13:27
fungithe volume leak in vexxost is looking likely to be nova and cinder pointing fingers over which should be responsible for making sure volume attachment records are sane when servers are deleted13:27
fungifixing that in openstack would directly improve our throughput13:27
pabelanger+113:27
*** goldyfruit has joined #openstack-infra13:28
fungialso known as "people run our software, and the bugs we haven't fixed have real-world consequences"13:28
smcginnisI haven't seen the full discussion. Why is it that nova isn't deleting the attachment but then trying to delete the volume?13:28
lajoskatonadonnyd: How can I reference these expanded labels from my job definition, like here: https://review.opendev.org/#/c/672925/19/.zuul.d/jobs.yaml@115 ?13:29
funginova doesn't seem to try to delete the volume. it sounds (from what mnaser is saying) is that if there's an error then nova just logs that and continues on its merry way13:29
fungii should say, doesn't re-try to delete13:29
smcginnisI guess getting to the root of that error is the key. Cinder won't spontaneously decide to clean things up without someone telling it to.13:30
fungiyep, and nova probably shouldn't consider the instance truly deleted if it hasn't gotten confirmation from cinder13:30
fungibut... the devil is in the details, i'm sure13:31
fungipoint being, there are bugs in openstack which we haven't fixed, are deployed in service providers who are donating resources to the testing of openstack, and so we directly see the result of these deployed openstack bugs impact the testing throughput of new patches to openstack13:32
*** eharney has joined #openstack-infra13:33
fungiso it's not just testing-related bugs which impact our ability to test changes13:33
AJaegerit's also small things like using promote jobs that don't need jobs and don't rebuild artifacts ;) - that saves two or three nodes per merge currently (releasenotes, docs, api-ref)13:42
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-sphinx-tarball: introduce zuul_use_fetch_output  https://review.opendev.org/68187013:42
donnydlajoskatona: I'm out at the moment, be back in a bit13:44
lajoskatonadonnyd: ok, thanks, I go for my sons as well now :-)13:45
openstackgerritJeremy Stanley proposed opendev/git-review master: Support spaces in topic  https://review.opendev.org/68190613:46
clarkbAJaeger: also speeding up jobs helps a lot. This is why I spent a little time to show that devstack could save a bunch of time reworking its api interactions13:46
*** efried has joined #openstack-infra13:47
AJaegerindeed, clarkb ...13:47
AJaegercan we start an FAQ about scheduling? If everybody that wanted to discuss scheduling in the last 4 weeks would have spend a day improving jobs, we would be in a much better situation ;)13:49
*** xenos76 has quit IRC13:50
*** aaronsheffield has joined #openstack-infra13:52
*** kjackal has quit IRC13:58
*** kaiokmo has quit IRC14:01
*** ykarel is now known as ykarel|afk14:01
fungiwell, it starts out as people complaining about scheduling because they feel the algorithm unfairly penalizes them, then others explain the scheduling and the hard choices which were made, and then it evolves into wanting to discuss optimizations to that scheduling14:02
*** ykarel|afk has quit IRC14:10
*** kaiokmo has joined #openstack-infra14:13
*** goldyfruit has quit IRC14:14
mriedemclarkb: fyi, i screwed up on that gate fix the other day and the fix for my fix is https://review.opendev.org/68202514:15
mriedemin case we need to promote14:15
mriedemthat's for http://status.openstack.org/elastic-recheck/#184361514:15
AJaegermriedem: should fungi enqueue directly to gate (not promote) - it's still in check...14:17
mriedemprobably? it hasn't gotten a node in check yet14:20
mriedemand since it's nova, it probably won't for 12+ hours14:20
fungican do14:20
fungijust a sec14:20
clarkbnote that isnt nova specific and currently its ~5.5 hours14:21
*** hrw has joined #openstack-infra14:21
clarkbbut ya if it fixes gate issues lets get them enqueued14:21
hrwmorning14:21
hrwcan mirror.bhs1.ovh.openstack.org be forced to refresh ubuntu mirror?14:22
fungimriedem: it's enqueued now. we can also promote it to the front of the gate if that will help get things moving faster in the integrated gate queue14:22
*** whoami-rajat has joined #openstack-infra14:23
AJaegerhrw: all our mirrors are in sync - why do you need to update this specific one and not the others?14:23
fungihrw: all the mirrors should be in sync, they're sharing one network filesystem14:23
AJaegerhrw: and what is missing?14:23
clarkbhrw it should update every ~4 hours but looks like it got stuck a couple days ago14:23
mriedemfungi: that will reset the gate won't it?14:23
clarkbAJaeger: ^ I'm guessing its not a specific mirror just that it hasntupdated in acouple days14:23
fungimriedem: yes, so it's not a tradeoff to be taken lightly14:23
hrwINFO:kolla.common.utils.kolla-toolbox: python3-dev : Depends: libpython3-dev (= 3.6.7-1~18.04) but it is not going to be installed14:23
mriedemyeah let's just let it ride for now14:23
clarkblikely the afs release and stale lock issue14:23
hrwINFO:kolla.common.utils.kolla-toolbox:               Depends: python3.6-dev (>= 3.6.7-1~) but it is not going to be installed14:23
hrwAJaeger: CI gets stuck :(14:24
clarkbhrw: hrm the mirror should always be consistent when it is published. why us it not being installed?14:24
hrwAJaeger: from what I see ovh mirrors have older packages than ubuntu:bionic container14:24
hrwclarkb: container has newer package than mirror == installation fails14:25
clarkboh I see you areupdating acontainer built fron another source14:25
hrwyes14:25
hrwwe fetch official container and then install packages in it.14:25
hrwworks unless run on ovh14:26
clarkbovh has the same packages as everyone else14:26
clarkbwe use a shared filesystem (afs) to host the data14:26
hrwok14:27
fungii'm assuming hrw looked at a failure which occurred in ovh, assumed it was an ovh-specific problem, and didn't realize it's affecting our entire mirror network14:27
hrwfungi: +114:27
fungiso for future reference, explaining the issue you're encountering and providing examples first rather than jumping to conclusions about what needs to be done helps avoid a lot of confusion14:27
hrwright, sorry about that14:28
*** panda|ruck has quit IRC14:28
hrwhttps://pastebin.com/1zEMBwbQ shows the issue14:28
*** panda has joined #openstack-infra14:29
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-sphinx-tarball: introduce zuul_use_fetch_output  https://review.opendev.org/68187014:29
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-translation-output: introduce zuul_use_fetch_output  https://review.opendev.org/68188714:29
hrwlibexpat1 from container is newer than one in mirror. we want to install python3-dev and it refuses because of it. If I do same in ubuntu:bionic locally with ubuntu mirrors then it works14:29
clarkbI think it hasnt updated since I manually updatedit  I'm fairly certain I r eleased the update server lock but can double check14:30
fungiright, so to clarify, the problem seems to stem from using images which are built from newer packages than those currently present in our mirror14:30
hrwfungi: yes14:30
fungiand yes, i expect clarkb is on the money that it's because mirror updates have hung14:30
fungiand this would be affecting all our mirror of ubuntu14:31
clarkb"The lock file '/afs/.openstack.org/mirror/ubuntu/db/lockfile' already exists."14:32
*** xek has joined #openstack-infra14:32
clarkbvos listvldb does not show it as being locked though14:33
clarkboh wait that is a reprepro lock I think14:33
fungiyep14:34
*** goldyfruit has joined #openstack-infra14:34
fungimaybe a previous reprepro run was terminated ungracefully and left that dangling?14:34
clarkbpossibly from the reboot that was done on that server?14:35
fungicreated Sep 10 18:3314:35
*** xek_ has joined #openstack-infra14:35
fungior last updated at least14:35
hrwfungi: fits mirror's age14:35
clarkbis that something we can simply rm then?14:35
fungiyeah, if it was rebooted mid-run for reprepro that would probably explain it14:35
fungishould be able to rm and rerun the script14:36
*** xek has quit IRC14:37
fungireboot   system boot  Tue Sep 10 16:51   still running      0.0.0.014:37
fungiacording to last14:37
fungiso that was timestamped a couple hours *after* the reboot? strange14:38
fungiseems to correspond with the duration uptime reports, so i don't think the timestamp in wtmp is way off or anything14:39
clarkbfwiw I've grabbed the mirror update lock file and removed the reprepro lock file and will manually run the update now14:39
fungithanks14:39
*** goldyfruit_ has joined #openstack-infra14:42
hrwthank you14:42
*** rcernin has quit IRC14:42
*** e0ne has quit IRC14:43
*** lpetrut has quit IRC14:43
*** goldyfruit has quit IRC14:44
*** ociuhandu has quit IRC14:45
*** munimeha1 has joined #openstack-infra14:47
*** mattw4 has joined #openstack-infra14:48
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-javascript-content-tarball: introduce zuul_use_fetch_output  https://review.opendev.org/68190314:48
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-sphinx-output: introduce zuul_use_fetch_output  https://review.opendev.org/68190514:48
*** xenos76 has joined #openstack-infra14:49
*** rkukura has joined #openstack-infra14:51
clarkbAJaeger: re faq what I've tried to do in the past is send periodic updates to the mailing list explaining the current situation and why improving software reliability is likely to have the greatest impact14:56
*** ociuhandu has joined #openstack-infra14:59
*** armax has joined #openstack-infra15:02
clarkbunfortunately I'm not sure how effective that has been15:02
*** ociuhandu has quit IRC15:03
*** xenos76 has quit IRC15:03
hrwthanks for mirror refresh - installing python3-dev works so we can continue fixing errors shown us by CI15:06
clarkbhrw: fwiw its still running (the afs publishign step is where we are at so hopefulyl done soon)15:08
hrwclarkb: in worst case it will one more recheck.15:08
*** pgaxatte has quit IRC15:09
AJaegerclarkb: there have been questions/discussions about fail-fast, reverify - and also the requirement to run through check first. I was thinking of documenting these. And adding a node about stable gates...15:09
* AJaeger will be back later15:10
*** mattw4 has quit IRC15:11
*** ociuhandu has joined #openstack-infra15:13
*** pmatulis has joined #openstack-infra15:15
*** pmatulis has left #openstack-infra15:15
*** jamesmcarthur has joined #openstack-infra15:15
*** ykarel|afk has joined #openstack-infra15:16
*** ociuhandu has quit IRC15:18
*** tesseract has quit IRC15:25
*** igordc has joined #openstack-infra15:25
*** ociuhandu has joined #openstack-infra15:26
*** xenos76 has joined #openstack-infra15:26
*** igordc has quit IRC15:27
*** igordc has joined #openstack-infra15:27
*** jtomasek has quit IRC15:28
*** dayou has quit IRC15:28
*** udesale has quit IRC15:30
*** udesale has joined #openstack-infra15:31
*** weshay|ruck has quit IRC15:31
*** igordc has quit IRC15:32
*** ociuhandu has quit IRC15:33
*** rkukura has quit IRC15:33
*** dklyle has quit IRC15:35
fungiAJaeger: the choices we've made for scheduling algorithm are probably worth explaining too15:36
*** dklyle has joined #openstack-infra15:36
clarkbcaught one of those swift errors. corvus' hunch was correct it is the get_container that is failing15:39
clarkbhttp://paste.openstack.org/show/775777/15:39
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Retry container gets in upload-logs-swift  https://review.opendev.org/68209115:39
*** ykarel|afk is now known as ykarel|away15:39
clarkbcorvus: ^ thats an attempt at handling it15:39
*** dayou has joined #openstack-infra15:40
*** ociuhandu has joined #openstack-infra15:42
clarkbubuntu mirror update has completed successfully from what I can tell and there is no reprepro lock file remaining15:42
clarkbI've releaed all the locks I had held to do that15:42
clarkbhrw: ^ fyi15:46
*** ociuhandu has quit IRC15:46
*** gfidente is now known as gfidente|afk15:50
*** ramishra has quit IRC15:51
*** rmcallis has joined #openstack-infra15:55
corvusclarkb: i think you leed a lambda15:59
corvusand by leed i mean need15:59
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-javascript-content-tarball: introduce zuul_use_fetch_output  https://review.opendev.org/68190316:01
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-sphinx-output: introduce zuul_use_fetch_output  https://review.opendev.org/68190516:01
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: fetch-coverage-output: introduce zuul_use_fetch_output  https://review.opendev.org/68190416:01
openstackgerritClark Boylan proposed openstack/infra-manual master: Document why jobs queues are slow  https://review.opendev.org/68209816:01
clarkbcorvus: oh right16:01
clarkbAJaeger: fungi ^ thats a rough first draft (I didn't even build it locally, probably should've)16:01
*** calbers has joined #openstack-infra16:02
*** rlandy is now known as rlandy|brb16:02
hrwclarkb: thanks!16:02
clarkblooks like the mirror.ubuntu release still causing stale cache problems https://e63daf97f32f2cebe260-b80014e5a7a6453c822c4dc8f22159da.ssl.cf1.rackcdn.com/681829/1/gate/openstack-tox-py37/fb0ef49/job-output.txt ?16:05
clarkbthat host is running openafs and not kafs iirc16:05
*** owalsh has quit IRC16:11
smcginnisHowdy infra folks. We got a tag-releases job failure. Looks like a mirror issue that I think I saw being discussed earlier.16:13
smcginnisShould we hold off on any releases? Or are things cleared up?16:13
smcginnisIf cleared up, can we get this reenqueued? https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_c95/765061f2ad4b965059c33e2577547ac952344ee3/release-post/tag-releases/c95672e/job-output.txt16:13
clarkbsmcginnis: we can check the files directly to see if afs has caught up16:14
smcginnisThanks16:15
*** mattw4 has joined #openstack-infra16:15
clarkbI just have to remember where it lists those fiel sizes and hashes16:15
clarkbfungi: ^ you probably know off the top of your head16:16
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Retry container gets in upload-logs-swift  https://review.opendev.org/68209116:17
*** mriedem is now known as mriedem_afk16:19
clarkbI can confirm that the values are stil not what are expected according to the job16:19
clarkbbut I think I also need to check the thing that gives us the expected values16:19
clarkbthe release file16:21
clarkbb6177c2e199dc2374792b3ad3df2611643a2d32211c88cff8cb44192864aba32 1033953 main/binary-amd64/Packages.gz is what releases says now for bionic-update so I think afs has caught up16:22
*** rpittau is now known as rpittau|afk16:22
clarkbsmcginnis: ^ fyi16:22
*** kjackal has joined #openstack-infra16:22
clarkbthat matches what I get locally if I list the ifle size and sha256sum it16:22
*** hrw has left #openstack-infra16:24
*** iurygregory has quit IRC16:24
*** eharney has quit IRC16:25
jrosserDid anything come of looking at the debian buster repos yesterday? exim packages failing to install....16:28
clarkbjrosser: yes fungi believes the issue is we need to stop disabling buster updates repo in jobs16:28
clarkbjrosser: our mirror isn't keeping exim4 for both buster and buster updates because buster updates supercedes buster. And the job is likely looking for the buster package aiui16:28
jrosserOk, so the setup of the node disables the updates repo?16:29
*** owalsh has joined #openstack-infra16:29
clarkbI'm not sure. I know when buster was first brought up the jobs failed because buster updates had no content. updates must've been disabled somewhere. Not sure if in the base jobs or your jobs16:30
jrosserFor this I just flipped the nodeset from stretch to buster16:31
jrosserWell actually I had to define a Mideast, there doesn’t seem to be one afaik16:31
jrosser*nodeset16:31
*** whoami-rajat has quit IRC16:32
jrosserSo it’s a job that would otherwise be fine on stretch16:32
fungiyeah, the problem was partly to do with reprepro: we configured it to start mirroring the buster-updates suite as soon as that appeared on debian's official mirrors (shortly after buster release day) but the suite sat empty until a few days ago. reprepro refuses to create empty suite indices, so our mirrors continued to lack a buster-updates suite and jobs which tried to pull package indices for it16:32
fungiwere broken16:32
AJaegerclarkb: thanks for the writeup!16:32
funginow there are packages in buster-updates, which brings a new problem... reprepro wants to only keep a certain number of versions of the same packages in the shared package pool16:33
openstackgerritDonny Davis proposed openstack/project-config master: FN networking issued have been solved  https://review.opendev.org/68210716:33
fungiso it has deleted the older buster release packages and is only keeping the ones for the buster-updates suite16:33
smcginnisclarkb: Thanks. Do you think it's safe to reenqueue that failed job then?16:34
*** owalsh has quit IRC16:34
donnydSo I have been monitoring the workload now that it's at 50%, and the retransmission issue with ipv6 is solved16:34
fungijrosser: clarkb: all that is to say, i think we need to figure out where the tuning knob is for number of versions of packages to keep in the pool and crank it up a bit16:34
clarkbfungi: or just enable buster updates in the jobs?16:35
fungibut also, if jobs switch to including buster-updates sources, that may fix things, yes16:35
clarkbfungi: we shouldn't need those older packages right? so lets leave them out16:35
jrosserpart of the motivation for changing to buster was we’ve just switched OSA to py3 and some stuff was failing with what looked like python3.5 bugs in stretch16:35
donnyd1% of packets are retransmitted which is a really good # for real world over a tunnel16:35
clarkbfungi: any idea where we fixed buster by disabling updates previoulsy?16:35
fungiclarkb: i don't think "we" did, projects which got debian-buster jobs working likely wrote out their own sources.list files16:36
clarkbjrosser: do you have the link to the failing job handy again? we can probably work backward from that to see what is configuring repos16:37
*** kopecmartin is now known as kopecmartin|off16:38
AJaegerinteresting, https://review.opendev.org/668238 is in integrated gate and has merge conflict - but it's the *first* keystone change in the queue. Why do we keep it in the queue at all and not remove it?16:38
jrosserclarkb: this is it https://review.opendev.org/#/c/681777/16:39
clarkbAJaeger: probably just a missing optimization16:40
*** derekh has quit IRC16:40
AJaegerjrosser: why not create debian-buster nodeset?16:40
*** whoami-rajat has joined #openstack-infra16:40
jrosserAJaeger: because I only just learn anything about this in the last couple of days :) I copied what kolla have done16:41
AJaegerjrosser: patch for https://opendev.org/opendev/base-jobs/src/branch/master/zuul.d/nodesets.yaml is welcome if that is something of wider scope16:41
*** rlandy|brb is now known as rlandy16:42
jrosserI can’t do that right now but will later if no one else gets there first16:42
*** kjackal has quit IRC16:43
*** gyee has joined #openstack-infra16:44
AJaegercmurphy: 668238 will not merge with just +W - you need +2 as well...16:45
*** markvoelker has quit IRC16:45
cmurphyAJaeger: TIL16:46
*** kjackal has joined #openstack-infra16:46
fungii find it annoying that rackspace has ceased mentioning instance names in incident tickets lately16:48
funginow they only tell you the instance uuid and you get to guess what region to grep to figure out which one they're talking about16:48
*** bnemec is now known as beekneemech16:48
*** markvoelker has joined #openstack-infra16:49
fungikdc04.openstack.org16:49
*** jpena is now known as jpena|off16:49
fungi(i ord)16:49
fungier, in ord16:49
*** cmurphy is now known as cmorpheus16:51
fungicmorpheus: is your nick a matrix reference or a sandman reference? (though i suppose "both" and "neither" could also work)16:52
cmorpheusfungi: i think both because i think the matrix uses it as a sandman reference16:53
clarkbfungi: jrosser that job actually does seem to update buster updates16:53
fungicmorpheus: oh, neat! i actually never picked up on that tie-in16:53
*** markvoelker has quit IRC16:53
fungibut now that i think about it...16:54
*** owalsh has joined #openstack-infra16:54
fungiclarkb: yeah, so could be "just" using buster-updates isn't going to solve the problem. digging deeper we're mirroring debian 10 (buster) update 1 package versions for those (and deleting the original buster release versions of them) but not including them in the buster-updates package lists. i wonder why reprepro mirrored them at all if they're not for the buster-updates suite16:55
*** piotrowskim has quit IRC16:56
openstackgerritJames E. Blair proposed zuul/zuul master: WIP: Support HTTP-only Gerrit  https://review.opendev.org/68193616:57
openstackgerritJames E. Blair proposed zuul/zuul master: Update gerrit pagination test fixtures  https://review.opendev.org/68211416:57
donnydfungi: whenever you get a chance I can help a little bit with getting jobs through the gate16:59
donnydlooks like everyone is pretty busy today in infra17:00
fungi#status log kdc04 in rax-ord rebooted at 15:46 for a hypervisor host problem, provider ticket 190913-ord-000047217:00
openstackstatusfungi: finished logging17:00
fungidonnyd: sure, i've been a bit scattered today, have a list of the change numbers?17:00
fungiinfra-root: anything in particular we need to check on kdc04 following an unexpected reboot?17:01
clarkbfungi: the afs/kerberos docs should have server rotation notes, probably confirm the services are running as per that ?17:01
donnydfungi https://review.opendev.org/#/c/682107/17:01
fungidonnyd: just the one?17:02
clarkbfungi: https://docs.openstack.org/infra/system-config/kerberos.html#no-service-outage-server-maintenance looks like we want to double check the db propogation is working on kdc0317:02
donnydyea, just scaling FN back up to near 100%17:02
clarkbI think that runs in a cron so you can simply check the logs for that cron job to see if its happy post reboot17:02
fungidonnyd: i've approved it, thanks!17:02
fungiclarkb: thanks, i'll check it out17:03
*** diablo_rojo has joined #openstack-infra17:04
donnydthanks fungi17:04
*** xek_ has quit IRC17:08
donnydand I did some bad math before.. the number of retransmits is more around .05% - the average connection is about 500K packets deep and the number of retrans is about 350 packets (some a little less, some a little more)17:10
donnydI would call that resolved17:10
donnydon to the next17:10
*** ralonsoh has quit IRC17:11
fungiyep, that seems within reason17:12
corvusclarkb, ianw: i barely know what i'm talking about in https://review.opendev.org/681338 -- i was mostly just trying to capture what i thought was what we thought we should try next time we lose an afs fileserver since the recovery this time wasn't as smooth as we would have liked.  i really don't know how to improve that text (other than spelling 'release' correctly).  so can one of you either17:15
corvusimprove it yourself or merge it?17:15
openstackgerritMerged openstack/project-config master: FN networking issued have been solved  https://review.opendev.org/68210717:16
*** xenos76 has quit IRC17:16
clarkbcorvus: I'll take a stab at a new patchset17:16
clarkbinfra-root can you review the project rename stuff https://etherpad.openstack.org/p/project-renames-2019-09-1017:17
auristorcorvus: I must have missed something.  what type of "crash" occurred?   Did the volserver or fileserver process panic?  Did the host lose connectivity to the disks used for vice partitions?  did the host panic or other restart without a clean shutdown?  was the host machine lost and the vice partitions attached to a new machine?17:21
openstackgerritClark Boylan proposed opendev/system-config master: Add docs for recovering an OpenAFS fileserver  https://review.opendev.org/68133817:21
clarkbcorvus: ^ something like that maybe.17:21
clarkbauristor: we think a host provider live migration caused the host to spin out of control. It stopped responding to ssh and what we could get from snmp showed significant cpu useage and load on the host17:22
clarkbauristor: and ya one of the symptoms from the console of the host was complaints from the kernel about things being unresponsive for large periods of time (including the disks hosting the volumes iirc)17:23
auristorwas it responding to rxdebug and bos status?17:23
clarkbit was not responding to other afs commands (vos show volume type stuff)17:23
auristorvos commands would talk to the volserver17:24
auristorthe reason I ask is that the recovery procedure is dependent upon the type of crash and whether a clean shutdown could be performed.17:24
clarkbno clean shutdown could be performed we had to hard reboot17:24
clarkbvos examine $volume -localauth is what was failing17:25
auristorin that case volume transactions would not survive the restart and any volumes that were attached to the fileserver or volserver would be salvaged automatically the first time they are accessed17:25
openstackgerritJonathan Rosser proposed opendev/base-jobs master: Add nodeset for debian-buster  https://review.opendev.org/68211817:25
corvusthat's why i omitted the explicit salvage op from my instructions17:26
clarkbgotcha17:26
auristorthe cleanup that would be required is to identify any temporary clone volumes and manually salvage and zap them17:26
AJaegerclarkb, fungi, do we want a debian-buster nodeset? See 682118 ^17:27
corvusbut i don't know all the cases where a manual salvage would be required, eg ^17:27
clarkbauristor: is that different than `bos salvage -server $FILESERVER -partition a -showlog -orphans attach -forceDAFS` ?17:27
*** xenos76 has joined #openstack-infra17:27
*** priteau has quit IRC17:28
auristorvos listvol the vice partitions.  any volume ids that are unknown to the vl service are temporary clones17:28
auristorpartition salvages should not be used with a dafs server17:29
auristora dafs server cannot take a whole vice partition offline17:29
jrosserAJaeger: i didnt think it was very cool to just switch the debian-stable nodeset from stretch to buster underneath projects that are using it right now17:30
clarkbout of curiousity why does the flag exist then?17:30
auristorfor individual volume salvages17:30
auristorare you asking about the -forceDAFS or -partition flag?17:31
clarkbfwiw that comes from https://lists.openafs.org/pipermail/openafs-devel/2018-May/020493.html if you want to respond to that and say you can't do that17:31
clarkbauristor: the -forceDAFS flag17:31
clarkbyou just said you can't do dafs in this case so seems odd to have a flag that does it17:31
auristorthat is for individual volume salvages17:31
*** kjackal has quit IRC17:31
clarkbok I assume that would be a -volume instead of -partition flag then?17:31
auristorI said you shouldn't do a partition salvage with a dafs fileserver17:32
auristorthe bos command has no idea what type of fileserver it is17:32
clarkbok I'm confused. I was merely going off of ianw's comments and the thread he linked on the openafs mailing list17:35
clarkbif thats wrong (and we know that) can someone please say so on the review?17:35
*** udesale has quit IRC17:36
corvusclarkb: have we ever manually run the salvager?17:38
clarkbcorvus: I believe ianw did in order to restore functionality of the fedora mirror (per comments on that openafs thread) however I do not know that for certain17:39
fungiAJaeger: we already have debian-buster (and debian-buster-arm64) nodes, so if a nodeset is necessary for people to be able to run jobs on those then i suppose it's warranted?17:39
fungijrosser: i agree, replacing debian-stable without some confirmation that jobs can actually run on debian-buster nodes is probably disruptive17:40
jrosser I think having <>-codenane and <>-stable at the same time is useful, folks can choose to stick or get moved forward17:42
openstackgerritJames E. Blair proposed opendev/system-config master: Add docs for recovering an OpenAFS fileserver  https://review.opendev.org/68133817:42
corvusclarkb, fungi, ianw: ^ let's just merge that so we at least have some docs that tell us what to make sure is still working, even if it doesn't tell us all they ways to fix all the things that may be broken.17:43
corvusthen when ianw is back, he can propose a change to include stuff about the salvager, and we can examine that very carefully at length17:43
*** goldyfruit_ has quit IRC17:46
*** goldyfruit_ has joined #openstack-infra17:49
auristorI've added some comments to the change in Gerrit17:49
*** markvoelker has joined #openstack-infra17:51
*** mriedem_afk is now known as mriedem17:51
openstackgerritClark Boylan proposed openstack/infra-manual master: Document why jobs queues are slow  https://review.opendev.org/68209817:51
clarkbcorvus: ^ addressed your comments17:51
clarkbauristor: thank you17:51
*** eharney has joined #openstack-infra17:51
clarkbauristor: responded to your question there17:54
openstackgerritDavid Shrewsbury proposed zuul/nodepool master: DNM: Demonstrate image leak  https://review.opendev.org/68212717:58
openstackgerritDavid Shrewsbury proposed zuul/nodepool master: DNM: Demonstrate image leak  https://review.opendev.org/68212717:59
Shrewsinfra-root: I believe 682127 ^^^ will verify the bug that is causing the leaked images in vexxhost because of the leaked voluems. The fix is https://review.opendev.org/681857 but I wanted to see the test fail before including the fix.18:01
Shrewsvolumes18:02
Shrewsi have no idea what a voluems is18:02
*** panda has quit IRC18:02
*** panda has joined #openstack-infra18:03
fungii think it's french18:04
clarkbShrews: neat bug18:05
clarkbthat won't fix the volume leaks but will fix any image leaks unrelated to volume leaks?18:06
clarkbor at least give us the ability to delete the image once the volume is otherwise deleted18:06
Shrewsclarkb: well, what I believe will happen is that we will now continue to retry deleting the image upload, but it will continue to fail until the leaked volume is removed18:07
Shrewsso yah, your 2nd comment18:07
Shrewsit's definitely related to volume leaks18:09
Shrewsbut i guess it could manifest in other ways (any error in deleting an upload)18:09
Shrewsat least the fix is easy. the test was surprisingly difficult18:10
*** nhicher has quit IRC18:15
*** nhicher has joined #openstack-infra18:15
openstackgerritMohammed Naser proposed zuul/nodepool master: k8s: make context optional  https://review.opendev.org/68213018:17
*** raschid has quit IRC18:26
*** jamesmcarthur has quit IRC18:26
openstackgerritMerged opendev/system-config master: Add docs for recovering an OpenAFS fileserver  https://review.opendev.org/68133818:27
*** prometheanfire has quit IRC18:30
*** prometheanfire has joined #openstack-infra18:31
clarkbcorvus: if you have a sec to rereview https://review.opendev.org/#/c/682098/2 AJaeger has +2'd it. Then maybe we can send an email to the openstack-discuss list pointing people to that section18:32
corvusclarkb: +318:41
clarkbI've cleaned up the two nodes I held recently to debug job retries (one was for ironic and the other for octavia)18:42
clarkbguilhermesp: do you still need that held node?18:42
openstackgerritMerged openstack/infra-manual master: Document why jobs queues are slow  https://review.opendev.org/68209818:45
clarkbcorvus: thanks18:47
AJaegerpublished: https://docs.openstack.org/infra/manual/testing.html18:50
clarkbcool I'll send an email to openstack-discuss now with pointers to that18:51
*** xek_ has joined #openstack-infra18:54
*** signed8bit has joined #openstack-infra18:55
*** lucasagomes has quit IRC18:59
*** goldyfruit___ has joined #openstack-infra18:59
*** goldyfruit_ has quit IRC19:02
*** markvoelker has quit IRC19:04
*** markvoelker has joined #openstack-infra19:06
mriedemclarkb: https://review.opendev.org/#/c/682133/ should help with gate fail #219:11
mriedemhttp://status.openstack.org/elastic-recheck/#181314719:12
mriedemjust fyi19:12
clarkbmriedem: are we reasnobly confident that enqueuing it without check results won't result in failures?19:12
clarkbeg you ran pep8 and pyxy unittests locally?19:12
mriedempep8 yes and targeted unit and functional tests but not everything19:13
mriedemi can do that first19:13
clarkbcool let me know and I'll enqueue19:13
clarkbI smell lunch so I'll do that while I wait19:13
*** lpetrut has joined #openstack-infra19:14
*** lpetrut has quit IRC19:15
openstackgerritTristan Cacqueray proposed zuul/zuul master: DNM: show that mypy doesn't check type for caller  https://review.opendev.org/68214219:16
*** lpetrut has joined #openstack-infra19:16
*** markvoelker has quit IRC19:16
*** markvoelker has joined #openstack-infra19:17
guilhermespsorry for the delay clarkb . I will need probably just for today19:20
guilhermespif it is possible you schedule the clean up for tomorrow19:21
clarkbguilhermesp: no worries was just checking as I cleaned up my own holds19:21
guilhermespit was good reminder btw, I did not touch it yesterday as I was planning so, gonna do today19:23
*** kjackal has joined #openstack-infra19:27
mriedemclarkb: yeah https://review.opendev.org/#/c/682133/ and https://review.opendev.org/#/c/682025/ are good for promotion,19:50
mriedemthe former was promoted earlier today but failed due to the subunit parser bug19:50
clarkbok I'll do that momentarily19:50
mriedems/former/latter/19:51
clarkbmriedem: whihc one happens more often (do you knwo?) I can put that one ahead of the other to slightly improve our chances19:51
mriedemsubunit parser19:52
mriedemthe former19:52
*** slaweq has quit IRC19:52
mriedemsince your email i looked at http://status.openstack.org/elastic-recheck/#1686542 again and it looks like keystone and the openstack-tox-lower-constraints job there has the top hits19:53
johnsomDid something change in zuul that "zuul_work_dir" is no longer created automatically?19:53
clarkbok https://review.opendev.org/#/c/682133/ first19:53
clarkbmriedem: ya keystone has had timeout issues in their tests cmorpheus has been working to bump the timeouts while the sort out the runtimes19:53
fungijohnsom: that sounds unlikely, but if you have an example failure maybe we can get to the bottom of it?19:53
openstackgerritFabien Boucher proposed zuul/zuul master: Pagure - add support for git.tag.creation event  https://review.opendev.org/67993819:54
cmorpheusthe keystone timeouts have mostly gone away with the latest bump19:54
johnsomHere is the failure: https://zuul.opendev.org/t/openstack/build/9797918fd0e64899bfa6715b78179e5c/log/job-output.txt#46419:54
johnsomHere is the zuul config: https://github.com/openstack/octavia/blob/master/zuul.d/jobs.yaml#L11619:54
cmorpheusand we'll have a less bad fix once a few more changes get through19:54
fungiclarkb: mriedem: any idea if the ironic enospc aborts on rackspace are still ongoing?19:54
clarkbfungi: the change that dropped the disk sizes to 4GB merged19:54
clarkbfungi: at 1am pacific time today19:55
fungiahh, okay, and any idea if that was sufficient to stem the failures?19:55
mriedemfungi: news to me19:55
pabelangerjohnsom: octavia-lib is only pushed to disk19:55
clarkbI don't know how much that has helped their issues though it did seem to reduce the occurence of the issue on that particualr change as least19:55
pabelangernot octavia19:55
openstackgerritFabien Boucher proposed zuul/zuul master: Pagure - add support for git.tag.creation event  https://review.opendev.org/67993819:55
clarkbso with small data set it seemed to help but not fix?19:55
johnsomRight, octavia is not needed by the job19:55
pabelangerbut zuul_work_dir is hardcoded to it?19:55
pabelangercan you drop that19:56
clarkbmriedem: I've enqueued both changes and there are no other nova changes ahead in the gate so I don't think we need to promote them19:56
pabelangerand use zuul.project.src_dir19:56
fungimriedem: yeah, it wouldn't show up on e-r because the failure condition causes ansible to act like the network access died and so zuul retries the job. it can eventually result in retry_limit if the job gets unlucky and runs in the same provider thrice in a row19:56
johnsomI should say, it's not needed by the octavia-lib runs of  that job19:56
guilhermespall right clarkb you can clean the hold 23.253.245.79 :)19:57
clarkbguilhermesp: ok thanks19:57
pabelangerjohnsom: I believe you can remove zuul_work_dir from your tox job, and should work as expected. Other wise octavia-tox-py37-tips expect to run against only octavia19:57
johnsomHmm, ok, I will remove that and give it a try. It's just odd, as it was working ok.19:57
johnsomMaybe I can't share that definition between these two repos.19:58
*** lpetrut has quit IRC19:58
fungijohnsom: if octavia-lib changes had a depends-on declared to octavia changes i believe they'll end up with octavia checkouts there, could that be why it went unnoticed?19:59
clarkbcmorpheus: do keystones unittests run in parallel like nova and neutron? that is usually a major throughput bonus. If they do then great. Do you know why they are slow out of curiousity (eg disk io or network or cpu or ?)19:59
johnsomThat could be actually. Most have been that way recently19:59
johnsomfungi for the win!19:59
johnsomlol19:59
cmorpheusclarkb: we think they are slow because we've been adding hundreds of unit tests19:59
*** lucasagomes has joined #openstack-infra20:00
cmorpheusclarkb: although sometimes they still run in <40 minutes and sometimes they take nearly 90 which i can only attribute to getting unlucky with node selection20:00
cmorpheusthey do run in parallel i believe20:00
clarkbya depending on what resources you need some clouds are defintiely slower than others20:00
fungiokay, i need to disappear to grab late lunch/early dinner while the tourists are otherwise occupied... i'll be back soon i hope20:01
clarkbjohnsom: fyi we should have worlddump data for future failures with weird networking we can double check that to see if things are broken on the host https://review.opendev.org/#/c/680340/20:01
pabelangerclarkb: it might be worth looking at hostid in inventory file, we recently seen a pattern of slow compute nodes in the clouds, which provider was able to fix.20:01
cmorpheushmm we only have 'stestr run {posargs} and nothing about workers or paraellism in .stestr.conf20:01
johnsomclarkb Ok. Is fortnebula back in rotation? I haven't seen one of those failures recently20:02
openstackgerritFabien Boucher proposed zuul/zuul master: Pagure - handle Pull Request tags (labels) metadata  https://review.opendev.org/68105020:02
clarkbjohnsom: I think at small scale. donnyd identified a retransmit problem with his router being overwhelmed and has redone the network architecture to put openstack router on dedicated instance20:02
clarkbjohnsom: and monitoring shows retransmits have significantly fallen off20:03
fungiclarkb: johnsom: it's back up to 90 (assuming the change i approved eventially merged)20:03
clarkbdonnyd: ^ you can probalby fill in johnsom better than I20:03
clarkbfungi: cool20:03
johnsomclarkb Ok cool. I will keep my eye out for runs.20:03
fungiyeah, says it merged at 17:1620:03
fungihttps://review.opendev.org/68210720:03
fungianyway, food. bbl20:04
clarkbcmorpheus: looks like they are in parallel. The little {X} prefix before your test names shows you which runner ran the test20:05
cmorpheusah20:05
*** panda has quit IRC20:05
clarkbto compare to nova though: Ran: 17497 tests in 461.6404 sec. vs Ran: 7089 tests in 2283.5612 sec.20:07
clarkbthere is a way to get a list of slow tests ( I'm surprised that stestr isn't showing that by default )20:07
* clarkb downloads the subunit file20:07
*** panda has joined #openstack-infra20:08
cmorpheuswe think the tests we added are inefficient because they use setUp() to recreate fixtures for each test, lbragstad was looking into optimizing it but hadn't been successful yet20:09
donnydis there an issue with FN again20:09
cmorpheushttps://review.opendev.org/66306520:09
*** igordc has joined #openstack-infra20:10
clarkbdonnyd: no, but johnsom had noticed dns issues on FN previously so was aksing if FN is back in the rotation now that we have extra debugging info20:10
cmorpheussplitting out those slow tests reduces the unit test time by more than half https://review.opendev.org/68078820:10
clarkbdonnyd: I think you tracked it to the likely culprit I was just pointing out we have extra info for debugging that in the future to rule in or out test host related issues20:10
*** gfidente|afk has quit IRC20:11
donnydoh ok20:11
johnsomclarkb I found a grenade run at fortnebula that used IPv6 DNS successfully. So looking good so far.20:11
clarkbcmorpheus: https://gist.githubusercontent.com/cboylan/82f43a9c345100efcf155f82d936edf4/raw/9862d29614b9a541c35368971472f50e477d9782/keystone%2520unittest%2520runtimes I got that from this build https://zuul.opendev.org/t/openstack/build/05abcf573a4d402f9950176cd0fe813e20:13
clarkbcmorpheus: not sure if that helps20:13
cmorpheusclarkb: that confirms what we thought20:14
cmorpheusthe keystone.tests.unit.protection.v3 are super slow20:15
johnsomclarkb Yeah, multiple good runs. This is likely fixed.20:15
cmorpheusand there are a crapton of them20:15
AJaegerconfig-core, a couple of changes that should be safe to merge: could you review https://review.opendev.org/681785 https://review.opendev.org/681276 https://review.opendev.org/#/c/681259/ https://review.opendev.org/681361 and https://review.opendev.org/680901 , please?20:15
*** ykarel|away has quit IRC20:18
AJaegerthanks, clarkb20:20
clarkbAJaeger: I'm about to head out for a short bike ride so not going to review (or approve) the zuul jobs change as that likly needs more attention (though if it is a new role should be safe)20:21
AJaegerclarkb: we have some more to review next week anyhow ;) Enjoy your ride!20:22
* AJaeger wishes everybody a great weekend20:22
*** diablo_rojo__ has joined #openstack-infra20:23
*** diablo_rojo has quit IRC20:25
paladoxcorvus fungi clarkb What do you think about https://imgur.com/MgF7zrp for the pg theme?20:26
clarkbpaladox: pg is the js framework that newer gerrit uses? I think it looks fine.20:27
paladoxyup20:28
clarkbcmorpheus: fwiw I used testr to get those numbers via test load < subnuit.file && testr slowest --all20:28
clarkbI have never really used stestr though I'm assuming it can do similar20:28
clarkbcmorpheus: that might be a good way to confirm you are making progress over time though20:28
cmorpheusclarkb: cool, good idea20:29
openstackgerritMerged openstack/project-config master: Tighten formatting for new branch reno page title  https://review.opendev.org/68178520:30
openstackgerritMerged openstack/project-config master: Report openstack/election repo changes in IRC  https://review.opendev.org/68127620:30
openstackgerritMerged openstack/project-config master: End project gating for charm-neutron-api-genericswitch  https://review.opendev.org/68125920:30
openstackgerritJames E. Blair proposed zuul/zuul master: WIP: Support HTTP-only Gerrit  https://review.opendev.org/68193620:31
openstackgerritMerged openstack/project-config master: Update horizon grafana dashboard  https://review.opendev.org/68136120:33
corvuspaladox: :-D  that looks great!20:33
paladox:)20:33
paladoxi based it on our theme, but changed things so it incorporated opendev's logo.20:34
paladox(credits for our skin goes to Tyler Cipriani)20:34
corvuspaladox: i love the nod to history here -- the gerrit theme was openstack+wikimedia infra teams' first collaboration20:35
paladoxoh!20:35
paladoxheh20:35
openstackgerritTristan Cacqueray proposed zuul/zuul master: Add tox-py37 to Zuul check  https://review.opendev.org/68215820:36
paladoxI have the config to support gerrit 2.15-3.0 and the one for 3.1+ (https://gerrit-review.googlesource.com/c/gerrit/+/234734)20:36
corvuspaladox: before our two projects got our hands on it, this is what gerrit looked like: https://www.researchgate.net/profile/Ahmed_E_Hassan/publication/266657830/figure/fig1/AS:357844848267264@1462328268100/An-example-Gerrit-code-review.png  -- then openstack and wikimedia bounced some changes back and forth, and the result became the default theme for the remainder of the gwt era20:38
paladoxcorvus wow.20:39
paladoxdidn't know that :)20:39
paladoxcorvus https://phabricator.wikimedia.org/P910420:40
paladoxthat would be the config for you :)20:40
paladoxthat goes into gerrit-theme.html under static/20:41
*** whoami-rajat has quit IRC20:42
*** roman_g has quit IRC20:42
corvuspaladox: added to etherpad!20:42
paladox:)20:42
paladoxcorvus i have the gerrit 3.1 config too, just changing the things in it! :)20:42
*** nhicher has quit IRC20:42
*** nhicher has joined #openstack-infra20:43
paladoxcorvus https://phabricator.wikimedia.org/P910520:45
paladoxvery easy to define custom themes for users and allow them to enable it using localStorage!20:45
paladox*too20:45
corvuspaladox: i'm looking forward to 'dark' myself :)20:49
paladox:P20:49
corvusclarkb, fungi: you know that thing where we comment out all the check jobs to work on something?  i found a shorter version of that: rename the 'check' project-pipeline to experimental (assuming there isn't one already) then make a new check project-pipeline with the single job.20:50
corvusjust a handy zuul tip20:50
corvussee https://review.opendev.org/#/c/681936/3/.zuul.yaml20:50
*** openstackgerrit has quit IRC20:51
*** lbragstad has joined #openstack-infra20:51
mriedemclarkb: in true karmic fashion, the subunit parser fix in the gate failed b/c of the other functional test gate bug20:51
mriedemwhich failed for reverse reasons earlier20:52
corvusmriedem: want to make an omnibus patch or re-promote?  i can do the promote if you tell me what to do20:52
mriedemi'd prefer to not squash the fixes since i want to backport them as well20:53
mriedemcorvus: does "promote" mean ninja merge?20:53
mriedemor just put it in the gate and skip check?20:53
paladoxcorvus i'm looking forward to a cleaner look & also zuul integration under gerrit.w.org20:53
corvusmriedem: put it top of gate and skip check20:53
mriedemcorvus: does it need to finish it's run first? it's in the gate now20:53
mriedemhttps://review.opendev.org/#/c/682133/ i mean20:53
corvusmriedem: what's the other patch?20:54
mriedemcorvus: https://review.opendev.org/#/c/682025/20:56
*** markvoelker has quit IRC20:57
corvusmriedem: we can use promote to force a gate-reset and reconstruct it in the order we want.  it's useful for flakey-test fixing patches because it lets us put the fixes at the top of the queue.  but it always resets the entire queue, so if i used it right now, we would lose the work done on the current head (679510).  if there's a chance that will merge (ie, we are not sure that it's just going to21:00
corvushit a bug right at the end), then we should probably let it do so.  let me give 3 options based on how you rate the urgency of these fixes:21:00
corvusmriedem: 1) most urgent: promote 682133 then 682025 immediately (will waste 679510 which will end up behind those).  2) somewhat urgent: wait 30m and promote 682133 and 682025 after 679510 finishes.  3) least urgent: dequeue 682133 and re-enqueue it at the back of the queue (least disruptive, but it'll be behind several other changes).21:02
corvusmriedem: (and, of course, we could do nothing.  always an option :)21:02
*** rh-jelabarre has quit IRC21:02
mriedem#2 would be nice21:02
mriedemnova has 3 relatively beefy series approved since yesterday that we're trying to get through21:03
mriedemand given it's nova it takes awhile to get a node21:03
*** pcaruana has quit IRC21:03
mriedemso i'm cool with having that stuff maybe merged by sunday night, but it would suck to be middle of next week and we're still rechecking21:03
corvusk.  i'll try to keep an eye out for 679510 clearing out and promote those 2 when it does21:04
mriedemgreat, thanks21:04
corvusmriedem: oh, one question is 682133 then 682025 the preferred order, or does it matter?21:04
mriedemyeah, based on e-r hits anyway21:04
mriedembut i'm not sure it matters too much since we've seen both fail today21:04
corvuscool.  neutron just failed so i'm going it now.21:04
mriedemyeah that neutron patch failed hard https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_54d/679510/1/gate/neutron-functional/54dc546/testr_results.html.gz21:05
corvusmriedem: done21:06
corvushopefully we're luckier this time21:06
mriedemthanks again21:06
corvusnp21:06
paladoxcorvus some of our users are waiting for 2.16 since it adds the project dashboard to PolyGerrit and also the inline editor!21:07
corvuspaladox: i just used the inline editor a bit yesterday -- it's handy but takes getting used to (it's way to easy to forget to exit the editor -- and then after that you still have te remember to publish your drafts).  that's something i heard folks talking about at the hackathon but didn't quite grasp at the time until i tried it and fell right into that trap.  :)21:09
paladoxheh21:09
corvus(basically i was like: "i got this.  i totally know i need to publish my drafts before it's real"  but then didn't realize i needed to stop editing before publishing)21:09
paladoxyeh, though under PolyGerrit it saves anything you type to the local storage.21:09
paladoxso that you can just load up the file and press save21:10
paladoxthere was a bug until earlier this year that made it impossible to use the inline editor.21:10
paladoxthat bug was a easy fix, but took sometime to find it :P21:10
*** slaweq has joined #openstack-infra21:11
*** rf0lc0 has quit IRC21:12
*** slaweq has quit IRC21:16
clarkbI use the inline editor to add depends on a lot21:19
clarkbparticularly useful if I know i don't have a project cloned yet but am doing some testing of cross project stuff21:19
*** KeithMnemonic has quit IRC21:19
*** diablo_rojo__ is now known as diablo_rojo21:20
clarkbcorvus: I did update  https://review.opendev.org/#/c/682091/ with the lambda21:20
*** kjackal has quit IRC21:28
*** lucasagomes has quit IRC21:30
*** openstackgerrit has joined #openstack-infra21:30
openstackgerritJames E. Blair proposed zuul/zuul master: Support HTTP-only Gerrit  https://review.opendev.org/68193621:30
*** efried is now known as efried_afk21:35
*** jbadiapa has quit IRC21:37
*** diablo_rojo has quit IRC21:44
*** xek_ has quit IRC21:46
*** xek has joined #openstack-infra21:47
*** xek has quit IRC21:49
*** xek has joined #openstack-infra21:49
*** mriedem has left #openstack-infra21:49
*** EvilienM is now known as EmilienM21:52
*** xek_ has joined #openstack-infra21:52
*** xek has quit IRC21:55
*** diablo_rojo has joined #openstack-infra21:59
*** munimeha1 has quit IRC22:03
*** mriedem has joined #openstack-infra22:04
*** markvoelker has joined #openstack-infra22:09
*** slaweq has joined #openstack-infra22:11
*** markvoelker has quit IRC22:14
*** slaweq has quit IRC22:16
openstackgerritClark Boylan proposed openstack/project-config master: Rename x/ansible-role-cloud-launcher -> opendev/  https://review.opendev.org/66253022:24
clarkbok thats the rebsae to ensure the project rename changes are in the correct order22:24
clarkbok there is a bug in our gitea-rename-tasks.yaml playbook. It uses an ansible playbook/tasks list that no longer exists in order to create the orgs22:28
clarkbbecause we don't create any new orgs in this rename I think that is ok22:28
clarkbhttps://opendev.org/opendev/system-config/src/branch/master/playbooks/gitea-rename-tasks.yaml#L24-L27 we can delete that task when we run it on monday but that should be fixed22:29
*** mriedem is now known as mriedem_afk22:30
clarkbcorvus: for https://opendev.org/opendev/system-config/src/branch/master/playbooks/gitea-rename-tasks.yaml#L28-L68 that section should be fine since it reimplements what was in the old task lists anyway right? basically the libification of these http requests didn't change behavior it just turned it into python to avoid the process exec cost22:30
*** rlandy has quit IRC22:38
*** Goneri has quit IRC22:41
*** pkopec has quit IRC22:46
*** goldyfruit___ has quit IRC22:49
corvusclarkb: that is a task list?22:55
corvuswe moved some project creation stuff to a module22:56
corvusbut that shouldn't have a bearing here?22:56
clarkbwell https://opendev.org/opendev/system-config/src/branch/master/playbooks/gitea-rename-tasks.yaml#L24-L27 imports a file that was deleted in that module move. But we don't need that bit. The other question was whether or not we added anything to the code that moved into python code22:56
clarkbI don't think we did so the ansible http tasks we use to rename should be fine (just slower than the python we could rewrite them in)22:57
*** xek_ has quit IRC22:59
*** mattw4 has quit IRC23:11
*** diablo_rojo has quit IRC23:12
*** tosky has quit IRC23:23
*** xenos76 has quit IRC23:25
fungicorvus: woah, mind blown. s/check/experimental/ is an awesome trick i would never have thought of23:26
fungipaladox: that theme looks nice and clean to me, though it seems like the change view and diff view pages are usually where gerrit theming really gets hairy23:31
*** gyee has quit IRC23:32
paladoxfungi looks like this https://imgur.com/a/ImGOuMe on the diff view23:32
*** gyee has joined #openstack-infra23:35
fungiahh, yeah that looks pretty good too23:40
*** gyee has quit IRC23:51
*** rcernin has joined #openstack-infra23:56

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!