Wednesday, 2017-07-26

*** esberglu has quit IRC00:04
*** edmondsw has joined #openstack-powervm00:20
*** edmondsw has quit IRC00:25
*** thorst has joined #openstack-powervm00:29
*** jwcroppe has joined #openstack-powervm00:29
*** thorst has quit IRC00:33
*** jwcroppe has quit IRC00:33
*** jwcroppe has joined #openstack-powervm00:33
*** esberglu has joined #openstack-powervm00:40
*** esberglu has quit IRC00:45
*** thorst has joined #openstack-powervm00:59
*** thorst has quit IRC01:07
*** svenkat has joined #openstack-powervm01:09
*** edmondsw has joined #openstack-powervm02:08
*** svenkat has quit IRC02:10
*** edmondsw has quit IRC02:13
*** edmondsw has joined #openstack-powervm03:57
*** apearson has joined #openstack-powervm03:59
*** edmondsw has quit IRC04:01
*** apearson has quit IRC04:04
*** apearson has joined #openstack-powervm04:04
*** tjakobs has joined #openstack-powervm04:43
*** chhavi has joined #openstack-powervm04:49
*** esberglu has joined #openstack-powervm04:53
*** esberglu has quit IRC04:58
*** apearson has quit IRC05:00
*** tjakobs_ has joined #openstack-powervm05:02
*** tjakobs has quit IRC05:04
*** thorst has joined #openstack-powervm05:04
*** thorst has quit IRC05:10
*** tjakobs_ has quit IRC05:18
*** edmondsw has joined #openstack-powervm05:45
*** edmondsw has quit IRC05:49
*** thorst has joined #openstack-powervm06:29
*** thorst has quit IRC06:33
*** esberglu has joined #openstack-powervm06:41
*** tjakobs_ has joined #openstack-powervm06:45
*** esberglu has quit IRC06:46
*** tjakobs_ has quit IRC07:29
*** edmondsw has joined #openstack-powervm07:33
*** edmondsw has quit IRC07:37
*** esberglu has joined #openstack-powervm08:29
*** thorst has joined #openstack-powervm08:30
*** esberglu has quit IRC08:34
*** thorst has quit IRC08:34
*** k0da has joined #openstack-powervm09:07
*** dwayne has quit IRC09:16
*** dwayne_ has joined #openstack-powervm09:21
*** edmondsw has joined #openstack-powervm09:22
*** mdrabe has quit IRC09:23
*** edmondsw has quit IRC09:25
*** mdrabe has joined #openstack-powervm09:26
*** thorst has joined #openstack-powervm09:58
*** thorst has quit IRC10:06
*** thorst has joined #openstack-powervm10:07
*** thorst has quit IRC10:11
*** esberglu has joined #openstack-powervm10:17
*** esberglu has quit IRC10:22
*** edmondsw has joined #openstack-powervm11:09
*** edmondsw has quit IRC11:13
*** svenkat has joined #openstack-powervm11:42
*** smatzek has joined #openstack-powervm11:50
*** esberglu has joined #openstack-powervm12:05
*** esberglu_ has joined #openstack-powervm12:08
*** esberglu has quit IRC12:10
*** edmondsw has joined #openstack-powervm12:13
*** thorst has joined #openstack-powervm12:16
*** esberglu_ has quit IRC12:32
*** esberglu_ has joined #openstack-powervm12:56
*** esberglu_ has quit IRC12:58
*** esberglu_ has joined #openstack-powervm12:59
*** apearson has joined #openstack-powervm13:05
*** jwcroppe has quit IRC13:08
*** kylek3h has joined #openstack-powervm13:28
esberglu_edmondsw: efried: Increased the discover hosts timeout for CI in 560913:35
esberglu_That should hopefully alleviate some of the CI failures13:35
esberglu_Also made a new etherpad13:35
esberglu_https://etherpad.openstack.org/p/powervm_tempest_failures13:35
esberglu_For tracking failing tempest tests13:35
*** jwcroppe has joined #openstack-powervm13:52
*** apearson has quit IRC13:59
*** apearson has joined #openstack-powervm14:27
*** esberglu_ has quit IRC14:44
*** esberglu has joined #openstack-powervm14:44
esbergluefried: Looks like the marker LU uploads are freezing in CI14:55
esbergluhttp://184.172.12.213/00/486700/2/check/nova-out-of-tree-pvm/fd429a4/14:55
efriedAm I a little frightened that we still pass 745 tests when that happens?14:59
*** tjakobs_ has joined #openstack-powervm15:00
efriedesberglu which neo?15:00
efriedor, really, any neo in the cluster will do15:00
esbergluefried: neo615:01
esbergluThis is what is causing the really long CI runs that fail15:01
esbergluSeems to be localized to the ssp cluster with neo 6, 7, 8, and 1115:01
esbergluefried: Nvm, seeing it on other ssp groups as well15:03
efriedesberglu If you clear out the marker, does the world go back to sanity?15:04
esbergluefried: I wonder if the vm cleaner scripts are broken15:06
efriedWere the vm cleaner scripts supposed to scrub marker LUs too?15:07
esbergluI think post_stack_vm_cleaner.py does (or did?). I'm pulling up the script now15:08
efriedesberglu btw, I'm still not stacking.15:10
esbergluefried: neo-os-ci/ci-ansible/roles/ci-management/templates/scripts/post_stack_vm_cleaner.py15:11
esbergluI think the remove_backing_storage function there would also clear out marker lu's or no?15:11
esbergluefried: With uwsgi or mod_wsgi15:13
efriedmod_wsgi.  With uwsgi, glance-api wouldn't start.  With mod_wsgi, it started, but failed on image creation.15:13
efriedSeems like a timing thing tbh, cause I can run the failing command successfully right after stack fails.15:14
esbergluefried: You could just add a wait before that command to confirm15:15
esbergluI think edmondsw's system was hitting that sometimes15:15
efriedesberglu remove_backing_storage looks like it'll only remove LUs associated with SCSI mappings on LPARs.  Which will never include marker LUs.15:15
efriedOh, I wonder if I need to bust down my SMT level.  I don't think I did that.15:15
esbergluefried: Oh yeah didn't think of that15:16
esbergluWorth a shot15:16
efriedsho15:16
efriedremind me how?15:17
edmondswwhat is a marker lu?15:17
efriededmondsw Long story, hold on a tick15:17
edmondswsure15:17
esberglusudo ppc64_cpu --smt=off15:17
efriedfound it - sudo ppc64_cpu --smt=off15:17
efrieddid it halfway through stacking, we'll see if it takes ;-)15:17
esbergluedmondsw: Did you still want me to walk you through SSP setup at some point?15:18
efriededmondsw We've got some clever code in our SSP disk driver that coordinates image uploads across threads, including on hosts that otherwise don't know about each other.15:18
edmondswesberglu yes... how long do you think that'll take?15:18
efriededmondsw Do you have an IP.com account?15:19
edmondswefried doesn't ring a bell15:19
esbergluedmondsw: I'd say 5-20 minutes. 30 max15:20
esberglu*15-2015:20
efriededmondsw emailed you15:20
edmondswefried tx15:21
edmondswesberglu doesn't sound too bad... you free now?15:21
esbergluedmondsw: Sure15:21
* efried listens in15:21
esberglu1) Log into the backing SAN gui15:24
edmondswcheck15:24
esberglu2) Go to the volumes section in the menu on the left of the screen (3rd one down) and select volumes by host15:25
edmondswcheck15:25
esberglu3) Find your system, should be neodev<neo#>15:26
esbergluneodev5 IIRC for you15:26
edmondswyep, I'm there15:26
esberglu4) Create volume15:27
esbergluthin provision, mdiskgrp015:27
edmondswwell, I'm at neodev5-1... there's also neodev5-2... does it matter?15:27
esbergluedmondsw: Is this a single vios setup or dual?15:27
edmondswwhatever the novalink installer gave me15:28
*** apearson has quit IRC15:28
esbergluIIRC correctly the -2 is for when you have dual vioses. I've only done this for single, not sure if it makes a differece15:28
esbergluthorst probably knows15:28
esbergluAnyway now you need 2 volumes, 1 meta and 1 data15:29
edmondswsize?15:29
esbergluTypically I do 1G for meta and 250G for data per system15:29
efriedYou need to assign the *same* volume to *all* VIOSes that need to participate in the cluster.15:29
esbergluSo the CI SSPs have 4 systems (1G meta, 4x250G data)15:30
efriedIf you have dual VIOS, it'd be a really good idea to have both participating, because I believe we try to map from both, in case one breaks.15:30
edmondswesberglu efried I think I have dual15:33
esbergluedmondsw: I'm only able to log into one of your vioses but can ping both vios ips (according to the neo hardware page)15:33
edmondswhmm15:33
esbergluI haven't done much with dual vios, maybe that's normal15:33
esbergluWell anyway we can just start with 5-1 and then add 5-2 if we confirm dual15:34
*** k0da has quit IRC15:35
edmondswok, I have meta and data volumes created and mapped to the host15:36
edmondsw5-115:36
esbergluedmondsw: Okay log into vios 115:37
esbergluAnd run lspv15:37
esbergluYou should see hdisk0 and hdisk115:37
*** apearson has joined #openstack-powervm15:37
edmondswpvmctl shows 2 vios but vios2 is "busy"15:39
thorsthdisk0 and 1 may be the SAS drives.15:39
thorstyou have to do some inspection to figure out if SAS or FC15:39
thorstmkvterm into vios2 to figure out why busy (I bet it didn't boot properly)15:39
esbergluthorst: Yeah I'm just showing him that so when he runs lspv after he sees the difference15:40
thorstesberglu: ahh, gotcha15:40
edmondswI see hdisk0 and 115:41
esbergluokay now run15:41
esbergluoem_setup_env15:41
esbergluthen15:41
esberglucfgmgr15:41
edmondswdone15:43
edmondswcouple errors about can't find child device15:43
esbergluedmondsw: Still running or complete?15:43
edmondswcomplete15:43
thorstthat's normal for FC ports that aren't plugged in15:43
edmondswcoo15:44
thorstmost of the cards have multiple ports, only one is plugged in because I didn't have enough switches for everything15:44
thorstand FC is...you know...expensive  :-)15:44
edmondswreally ;)15:44
esbergluthorst: The new disks should be showing up at this point though and they aren't15:44
thorstso I came in half way through.  Is the zoning done?  The disks are out on the v7k?15:45
esbergluthorst: Yep15:45
edmondswI created the disks on the SVC and said "create and map to host"15:45
edmondswmeta and data15:45
thorstk...sounds right.  Were there other disks on the host?15:45
edmondswno15:46
esbergluthorst: Nope. There were a bunch of gpfs ones previously but I deleted all of those last week15:46
thorstPM me the v7k IP you're using15:46
thorsto, ok15:46
thorstthis is neo5.15:46
esbergluyep15:46
thorsthmm...well I know zoning works there.15:46
thorstso this is not installed in SDN or SDE mode...straight traditional mode?15:46
edmondswright15:46
esbergluthorst: Yeah I've done this for CI quite a bit now, but that was always single vios.15:47
thorsttry mapping the disks to the neo-dev-5-215:47
esbergluWould you just add the second vios to the cluster the same as you would add a second neo?15:47
thorstsometimes vios1 maps to neo-dev-5-2 in the v7k...because its really mapped to a card15:47
thorstand how the cards get assigned to the VIOSes gets...odd15:48
thorstesberglu: yep!15:48
*** efried has quit IRC15:49
esbergluedmondsw: To map to another host, just highlight the two disks and click the actions button15:49
esbergluAnd there is a map to host option15:49
edmondswk15:49
edmondswdone... so now run cfgmgr again?15:50
esbergluYep15:50
esbergluIf this worked and you run lspv again after you should see the new ones15:50
esberglulspv -size to see which is meta and which is data15:52
edmondswcfgmgr output it the same, but I do see hdisk2-3 now with lspv15:52
edmondswshould I exit out of oem_setup_env?15:52
esbergluyeah you can15:53
esbergluNow all that's left is that cluster creation15:53
esberglucluster -create -clustername <clustername> -repopvs <meta_disk> -sp <sp_name> -sppvs <data_disk(s)>15:54
esbergluclustername and sp_name are whatever you want to name them15:54
esbergluci_cluster and ci_ssp for CI15:54
edmondswrunning15:55
edmondswreports success15:56
esberglucluster -list15:56
esbergluand15:56
esberglucluster -status15:57
esbergluto make sure all is as expected15:57
edmondswlists and says ok15:57
esbergluPerfect15:57
edmondswthanks!15:57
esbergluNow once you get the 2nd vios figured out15:57
esberglu- map to host in san (done already)15:57
esberglu- run cfgmgr15:57
esberglu- run "cluster -addnode -clustername -hostname <vios2hostname>"15:58
edmondswon that note... thorst I did mkvterm and it was sitting at a login prompt. I put "padmin" and it is now periodically spewing    INIT: failed write of utmp entry: "          cons"15:58
edmondswnever asked for password15:59
esbergluI think that the addnode can be run from either vios, but you might have to be on the one you created the cluster with15:59
thorstedmondsw: that looks like it installed wrong...or the disk it installed into went haywire15:59
edmondswthorst do I need to do a whole new novalink install, or how do I fix that?16:00
*** efried has joined #openstack-powervm16:02
thorstedmondsw: well, first, do you care?  If you do, then I'd just do a net install of the VIOS manually16:03
thorstif it were me...16:03
thorstbut if you've never done that, it can be daunting16:03
thorstand also, your novalink won't have redundancy (the NL installer makes things redundant by hosting itself out of a dual VIOS)16:04
edmondswmaybe I don't care at the moment :)16:04
thorsta reinstall is typically the best solution16:04
thorstjust gets you away from all the gork16:04
edmondswyeah, that's what I suspected16:04
thorstI usually map FC devices to the hosts pre install16:05
thorsthave the installer install to those16:05
thorst(FC is more reliable than those old SAS disks)16:05
thorstthen once install is done, add in the SSP disks16:05
thorstand then we're good.16:05
*** jwcroppe has quit IRC16:20
*** miltonm has quit IRC17:03
*** miltonm has joined #openstack-powervm17:19
esbergluefried: I have some questions about the LU issues17:50
efriedTalk to me, Goose.17:50
esbergluOkay you want to log into neo7 and run "pvmctl lu list"17:50
esbergluYou'll see there are tons of LUs17:50
esbergluA 30G one for every instance in the ssp group17:51
esbergluAnd then all of the marker LUs17:51
esbergluI think that the marker LU's for each instance are getting cleaned up after the run, but does having that many around simultaneously have any implications17:53
efriedWell, we could run out of space in the SSP.17:53
efriedthough that doesn't seem to be an issue here.17:53
efriedesberglu I don't actually see any markers17:54
esbergluWait what are the 0.1G LUs17:55
esbergluAre markers only the ones that start with part17:55
efriedyuh17:55
esbergluefried: There was one of those around when we were talking earlier and I deleted it17:56
efriedThose 0.1 ones might be reaaaalllly old, from when we were using (or trying to use) a 100MB image.17:56
efriedWant me to clean 'em up?17:56
esbergluefried: I think they are for the instances that get spawned during the tempest tests17:56
esbergluI'm pretty sure they are all from live instances17:56
efriedBut why would we ever create boot disks of 100MB?17:56
esbergluWe use that 100MB all zeros image still I believe17:57
efriedfor what?17:57
esbergluFor the default image that tempest uses for spawns17:59
efriedI didn't think we used that.18:00
efriedAnyway, want me to get rid of the ones that are listed as not in use?18:00
efried(means they don't have a scsi mapping associated with them.)18:00
*** k0da has joined #openstack-powervm18:02
esbergluefried: Sure. We should add that to the periodic_vm_cleaner script long term18:02
efriedk, looks like this:18:03
efriedpvmctl lu list -d name type in_use --hide-label --field-delim ' ' | awk '$2 == "VirtualIO_Disk" && $3 == "False" {print $1}' | while read n; do pvmctl lu delete -i name=$n; done18:03
efriedGetting some failures.18:03
efriedWhich probably means I'm trying to delete ones for runs that are actually happening.18:03
efriedI did not restrict deletion to the 0.1GB ones.  Mebbe I should have :)18:04
efriedso yeah, this is probably going to cause some CI failures18:09
esbergluEh oh well18:10
esbergluThis doesn't explain the marker lu issue though. Remind me why having a bad marker LU around breaks things?18:12
esbergluIs it because the upload hangs and then everything else just sits waiting for it to finish?18:14
esbergluefried: Is there not a good way to check for an upload hanging?18:15
efriedYeah, the marker LU is how the guy doing the upload tells everyone else it's doing the upload, and they a) shouldn't do an upload, b) need to wait for the upload to finish.18:16
efriedIf that process (the one doing the upload) gets killed prematurely, the marker LU doesn't get deleted.18:17
efriedSo everyone thinks there's still an upload going, and they wait forever.18:17
efriedesberglu In order to tell that happened, you would have to find the image LU corresponding to the marker and somehow figure out if bytes are still going to it.18:18
efriedCause I don't know of a way to backtrace from an LU to figure out which process/host created it.18:18
esbergluefried: So in this case it probably happened when the network was going haywire18:18
esbergluWhich explains why we've been seeing issues since then18:18
efriedThe only other way would be to scour the compute logs on all the hosts looking for the creation of that marker LU.18:19
efriedesberglu That LU scrub loop finished.  There's still a bunch of 0.1GB ones that are in use.  I think some of them may actually have duplicate names, which is funky.18:20
efriedyup18:21
efriedname=boot_pvm11_tempest_Del_96584eda,udid=278b4f23be6ca211e78003000040f2e95d28b49d00052918a1578db40261a7cd9918:21
efriedname=boot_pvm11_tempest_Del_96584eda,udid=278b4f23be6ca211e78003000040f2e95d7d974ba05a0214f53da7e1e553c8ffb318:21
efriedRare, but (clearly) not impossible.18:21
esbergluefried: What if we had a cron job that recorded the marker lus. And if the same one marker LU is around for too long assume it's no good and delete it18:21
efriedthat would be okay.18:22
esbergluefried: Anything potentially bad that could come of the duplication?18:22
efriednah, not really.  Pretty sure we use udids anywhere that matters.18:23
efriedI manually deleted both of those guys.18:23
efriedThose 100MB images are backed to one called base_os18:27
efriedDo we still have a base_os image that's 100MB??18:27
edmondswthose dup names aren't for marker LUs, though, are they? Cause that would be bad18:27
edmondsw(see, I read what you sent efried)18:27
efriededmondsw No, they're for reglr images.18:27
edmondswcoo18:27
esbergluefried: We still use the 100mb zeros one18:28
edmondswdidn't look like the example was a marker18:28
efriedA scintillating read, não é edmondsw ?18:28
edmondswI was so amped!18:28
edmondswwhat does a marker LU's name look like?18:29
efriedIIRC, it's the same as the image being uploaded, prefixed with 'part{udid[:8]}' or similar.18:30
efriedsorry, uuid, not udid18:30
efriedThe uuid bit is used to break ties if multiple threads try to do the same upload at once.18:30
edmondswyep, that was in the doc18:31
edmondswthe tie-breaking bit18:31
efriedWhere, theoretically, you could get collisions.  And that could indeed be bad-ish.  But actually I think the result would just be more than one of the same image LU.18:31
efriedIn any case, I'm willing to take that bug report.  Should be an almost-never kind of thing.18:31
edmondswyeah, I guess it would just mean 2 threads thinking they won the tiebreaker...18:33
efriedright.  But only if there's not a *third* thread with a lower uuid at the same time :)18:33
edmondswI was thinking it might mean no thread thought they won... so everyone would wait forever... but that seems doubtful18:33
edmondswI trust it was written better than that18:34
efriededmondsw You can look at the code if you want to trace that case.  I'd be interested to know what happens there.18:34
*** k0da has quit IRC18:34
edmondswnah, higher priorities atm18:34
efriedAnd "written better" is relative.  I doubt we spent much time on conditions like "what happens if the first 32 bits of two UUIDs collide".18:35
edmondswlol18:35
efriedAnd I know for sure that we explicitly discounted full UUID collisions.18:35
efried...as "sure, let it fail".18:35
efriededmondsw https://github.com/powervm/pypowervm/blob/master/pypowervm/tasks/cluster_ssp.py#L155 in case priorities change, or you can't help yourself :)18:37
edmondswtempter...18:37
efriededmondsw Yes, it looks like we'll actually upload the same image twice.18:39
efried...if the markers conflict, and they're the lowest-sorting.18:39
edmondswyep, just reached the same conclusion18:40
efriedKeep in mind that this image LU upload only happens once per SSP per image (which is kinda the whole point).  So that's rare to begin with.  And at that time, multiple threads - which are almost certainly on separate servers - would have to be trying to upload (presumably separate copies of) the same image at close enough to the same time that they both manage to create marker LUs.  And then, as if that wasn't rare enoug18:41
efriedh, they would both have to generate UUIDs whose first 32 bits are the same.18:41
*** chhavi has quit IRC19:11
*** k0da has joined #openstack-powervm19:35
*** miltonm has quit IRC19:45
*** miltonm has joined #openstack-powervm19:59
edmondswefried can you put something about PCI Passthru on the PTG etherpad? https://etherpad.openstack.org/p/nova-ptg-queens20:03
efriededmondsw What would I put there?20:03
edmondswthink it's too early?20:03
edmondswI thought you might have a better sense for what to put there than I do yet20:04
efriededmondsw Well, looking at the way this is just a dump right now, I wouldn't feel terrible about it.20:04
edmondswwe can obviously edit as the PTG nears20:04
efriedWonder if I should put it under Scheduler/Placement.20:05
efriededmondsw Okay, done.  See L38-40.20:06
efriedyah20:06
edmondswtx20:06
efriednot yet sure if thorst is going20:07
*** apearson has quit IRC20:36
*** apearson has joined #openstack-powervm20:37
*** apearson has quit IRC20:37
*** smatzek has quit IRC20:40
*** thorst has quit IRC20:50
*** svenkat has quit IRC21:00
*** esberglu has quit IRC21:05
*** esberglu has joined #openstack-powervm21:29
*** thorst has joined #openstack-powervm21:32
*** thorst has quit IRC21:37
*** kylek3h has quit IRC21:54
*** kylek3h has joined #openstack-powervm21:55
*** edmondsw has quit IRC21:58
*** kylek3h has quit IRC21:59
*** thorst has joined #openstack-powervm22:25
*** thorst has quit IRC22:26
*** esberglu has quit IRC22:34
*** k0da has quit IRC22:43
*** edmondsw has joined #openstack-powervm23:06
*** edmondsw has quit IRC23:11
*** tjakobs_ has quit IRC23:22
*** esberglu has joined #openstack-powervm23:23
*** thorst has joined #openstack-powervm23:27
*** esberglu has quit IRC23:28
*** thorst has quit IRC23:32
*** efried is now known as efried_zzz23:35
*** thorst has joined #openstack-powervm23:50
*** thorst has quit IRC23:50

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!