Tuesday, 2020-01-14

*** TxGirlGeek has quit IRC		00:04
*** TxGirlGeek has joined #starlingx		00:07
*** mpeters-wrs has joined #starlingx		00:27
*** byang has joined #starlingx		00:48
*** TxGirlGeek has quit IRC		01:28
*** sgw has quit IRC		01:50
*** mpeters-wrs has quit IRC		02:05
*** wangyi4 has joined #starlingx		02:06
*** sgw has joined #starlingx		02:54
*** mpeters-wrs has joined #starlingx		03:09
*** mpeters-wrs has quit IRC		03:43
*** mpeters-wrs has joined #starlingx		04:02
*** wangyi41 has joined #starlingx		04:16
*** cyan_ has joined #starlingx		04:25
*** rchurch_ has quit IRC		04:34
*** rchurch has joined #starlingx		04:35
*** mpeters-wrs has quit IRC		04:36
*** TxGirlGeek has joined #starlingx		04:52
*** mpeters-wrs has joined #starlingx		05:00
*** mpeters-wrs has quit IRC		05:05
*** TxGirlGeek has quit IRC		07:05
*** anran has joined #starlingx		07:54
* wangyi41 sent a long message: < https://matrix.org/_matrix/media/r0/download/matrix.org/QyderTfmvjBlanwZaLRCIlPG >		08:09
*** wangyi4 has quit IRC		08:10
wangyi41	This task force includes @anran, @yan_chen and me. Welcome more people join in.	08:13
*** anran has quit IRC		08:55
*** sgw has quit IRC		09:45
*** byang has quit IRC		12:09
*** mpeters-wrs has joined #starlingx		12:14
*** ijolliffe has quit IRC		12:52
*** ijolliffe has joined #starlingx		13:18
*** mpeters-wrs has quit IRC		13:50
*** sgw has joined #starlingx		14:07
*** mpeters-wrs has joined #starlingx		14:07
sgw	Morning all	14:08
*** mpeters-wrs has quit IRC		14:09
*** mpeters has joined #starlingx		14:09
*** billzvonar has joined #starlingx		14:41
sgw	slittle1: Morning	14:45
ijolliffe	morning - thanks wangyi41 and team - i see 4 reviews posted for the hack-a-thon	14:46
*** billzvonar has quit IRC		14:46
* sgw back in 90 or so		15:02
slittle1	CENGN had troubles setting up the build container via stx-tools/Dockerfile ...	15:07
slittle1	RUN pip install python-subunit junitxml --upgrade && \	15:08
slittle1	pip install tox --upgrade	15:08
slittle1	failed .... anyone else observing this?	15:08
dpenney_	A new version of more-itertools was just released a couple days ago, so maybe we constrain it to the older version for now: https://pypi.org/project/more-itertools/#history	15:09
dpenney_	maybe python3-specific code was added and it can't build for python2.7?	15:11
dpenney_	looks like the previously successful build was 20200111T023000Z, which used more-itertools 8.0.2	15:20
*** TxGirlGeek has joined #starlingx		16:02
dpenney_	I've posted a review to resolve the build failure: https://review.opendev.org/702471	16:27
*** abailey has quit IRC		16:29
*** abailey has joined #starlingx		16:30
slittle1	looks good	16:30
*** jrichard has quit IRC		16:35
sgw	slittle1: you around? I am working on the layer build testing and having some issues	16:54
*** mpeters has quit IRC		16:56
slittle1	what are you seeing?	17:20
*** mpeters-wrs has joined #starlingx		17:20
sgw	slittle1: First I tried a basic build of compiler layer, download and built the pkgs OK, then I switched to flock	17:27
slittle1	sgw: yes ....	17:28
sgw	Now, I know I am not perfect so I started with download_mirror, but had forgotten to reset the repo/manifest	17:28
sgw	There was an issue with the next steps of generate-cgcs-centos-repo and populate-downloads, so there needs to be somekind of error checking that the layers are checked now properly.	17:29
slittle1	to be clear .... are you using the same workspace as the former compiler layer build ?	17:29
slittle1	I've been using separate workspaces for each layer	17:30
sgw	Yes, using the same workspace	17:31
sgw	Also, I think that initial dir to find the centos-mirror-tools/config should be relative to MY_REPO not to the location of the scripts	17:31
sgw	that way the scripts get copied to /usr/local/bin (in the Dockerfile) and everything else is relative to MY_REPO	17:32
slittle1	k	17:32
slittle1	stepping out for a few mins to heat lunch	17:33
sgw	Ok, ping when back	17:34
slittle1	sgw: back	17:41
sgw	that was a fast lunch, did you even have a chance to chew ;-)	17:41
dpenney_	the cengn build I kicked off got further, but looks like it fails on downloading the new kata container RPMs	17:43
dpenney_	http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20200114T172515Z/logs/	17:43
slittle1	just heated.... still eating	17:51
sgw	Got it.	17:52
sgw	dpenney_: did Bart talk with you about the PBR stuff? The PRC folks are still doing the performance check	17:53
bwensley	I did talk with him.	17:54
bwensley	He seems OK with it, but I'll let him answer as well. :)	17:54
slittle1	meeting	18:01
*** jrichard has joined #starlingx		18:22
*** mpeters-wrs has quit IRC		18:52
*** mpeters-wrs has joined #starlingx		18:54
*** mpeters-wrs has quit IRC		18:58
*** mpeters-wrs has joined #starlingx		18:58
dpenney_	+1 :)	19:21
slittle1	I think we'll need to revert stx-tools:431885231ae41256188a7c32f0f5351c4455707b to fix the CENGN build.	19:25
dpenney_	that would require reverting maybe 8 commits, vs updating the versions of the rpms in question	19:26
slittle1	Looks like the kata repo was updated Dec_10, and the update that just merged has never been updated to track the upstream change	19:26
dpenney_	they're all binary rpms, so we should be able to just update the LST file with the new versions, which I'm looking at now	19:27
slittle1	ok, I'll buy that	19:27
dpenney_	doing a test download now	19:30
sgw	dpenney_: that +1 was to the PBR stuff?	19:32
sgw	slittle1: I guess we assumed the kata folks had the right versions, clearly they were working with cached RPMS	19:32
dpenney_	yeah, I'm not concerned over the versioning impacting the update mechanism, as the versions would always be incrementing (or PBR and semver would be fundamentally broken)	19:33
dpenney_	I'll post a review shortly for the kata fix	19:33
sgw	slittle1: so your working assumtion for the layering is that each layer is a different workspace? Are you testing mirror download and generation based on that or are you using the corporate mirror for your default /import location? in other words your mirror is always fully populated	19:37
slittle1	Trying to test both	19:38
dpenney_	review is posted: https://review.opendev.org/702506	19:38
sgw	I can start testing with the assumption 1 workspace/layer, but will start with empty mirror. (I actually pass the "output" directory for my mirror to avoid an extra copy)	19:38
dpenney_	once https://review.opendev.org/702506 is merged, I'll kick off a new CENGN build again	19:42
sgw	dpenney_: so all you tested was that they exist and download, do we know if the functionality will change?	20:00
dpenney_	yeah, all I verified was that they could be downloaded to allow the build to proceed. Otherwise, we can revert all the kata updates and have them rebase	20:01
sgw	dpenney_: back the nvme follow-up, so if I use the command you suggested, will that also address the storage config issues? I am not local to the machines right now, if I lock, delete the existing storage nodes, then I can't reboot them, is there a suggested process?	20:02
dpenney_	I would expect storage config should be fine. Configure the rootfs/boot device, the node should install and discover resources, and populate the system database appropriately	20:06
dpenney_	and I think deleting the node triggers a wipedisk and reboot, but I could be mistaken	20:07
sgw	dpenney_: system host-lock 4	20:16
sgw	Cannot lock a storage node when ceph pools are not empty and replication is lost. This may result in data loss.	20:16
sgw	Is there a way to force the lock	20:16
dpenney_	rchurch, do you know the answer to sgw's question?	20:20
rchurch	You should be able to force lock the storage host. If you don't care about the data in the cluster you can delete the pools. That will restore HEALTH_OK and you can then lock normally	20:23
dpenney_	per Frank's email reply, I'll abandon my kata update and we can trigger a revert of the 8 kata updates	20:24
sgw	rchurch: force lock? I dont see that in the help output	20:26
rchurch	swg: system host-lock [-f] <hostname or id>	20:28
sgw	rchurch: where is that actually documented, it's not part of the system host-lock --help option	20:31
dpenney_	system help host-lock	20:32
rchurch	Yep. That's what I did	20:32
sgw	Ah, you guys know your tools, may tools use --help of the sub option (think git, although it works both ways)	20:34
sgw	dpenney_ rchurch: so when I set up for personality=storage, do I have to define the storage disk also as nvme vs sd?	20:35
sgw	the command that dpenney_ sent in email was just for rootfs and boot device	20:36
dpenney_	if your disks are nvme, yes	20:36
dpenney_	the disk configuration would be a post install step	20:36
dpenney_	for storage OSDs	20:36
dpenney_	the disks would get discovered and put in sysinv	20:36
sgw	dpenney_: that's defined on in the existing documentation when doing a standard install, I will try that again, I did change those OSD commands	20:37
*** rchurch has quit IRC		20:39
dpenney_	Kata update reversions: https://review.opendev.org/#/q/status:open+projects:starlingx+branch:master+topic:kata	20:40
*** rchurch has joined #starlingx		20:40
bwensley	Still need a core from tools and root.	20:46
dpenney_	sgw: can you review the two reversions for tools and root?	20:46
*** mpeters-wrs has quit IRC		20:47
*** mpeters-wrs has joined #starlingx		20:48
sgw	dpenney_: sorry stepped way to zap my lunch	20:49
*** mpeters-wrs has quit IRC		20:49
sgw	looking	20:49
*** mpeters-wrs has joined #starlingx		20:49
sgw	dpenney_: your missing signed-off-by, but I will let it go this time	20:50
dpenney_	I just used the "revert" button in gerrit :)	20:52
dpenney_	didn't even notice the lack of signed-off-by ;)	20:53
sgw	no worries	20:55
sgw	guys would there be any reason that host-disk-list would not list a second drive (short of it not existing)? The NUC I ordered were supposed to all be the same with 2 disks, is there a way to unlock ssh to get into a compute or storage node via ssh?	21:37
dpenney_	is it in a RAID config, maybe?	21:43
dpenney_	once it installs, you should be able to ssh in as sysadmin, using the original password - which will prompt for an immediate change	21:43
*** mpeters-wrs has quit IRC		21:46
abailey	hackathon review for test_pvs uploaded: https://review.opendev.org/#/c/702537/	21:49
rchurch	The sysinv agent might not recognize the disk if the disk's major device number is not in the supported list. Check out VALID_MAJOR_LIST in sysinv/common/constants.py	21:50
abailey	which failed zuul :(	21:50
dpenney_	kata reversions have merged and I've kicked a new cengn build	21:51
sgw	dpenney_: Ah, I thought all nodes got the new admin password set up for the controllers, that worked and apparently I got shorted on disks! I need to double check hardware when I am back to where they are	21:53
sgw	dpenney_: 3 have 2 disks and 3 apparently only have 1 :-(	21:54
dpenney_	they do get the new password, but only after the first unlocking, when the puppet apply happens	21:54
sgw	Ah, and those are locked because they are not fully provisioned yet, so will that screw with unlocking?	21:55
dpenney_	well that's unfortunate... I've seen cases where RAID config made two disks look like a single disk, I think	21:55
dpenney_	nope, you changing the password won't cause a problem	21:55
sgw	no I don't think it RAID the disks, like I said will double check	21:55
* sgw BTW, to all, thanks for being here on IRC and helping out, this was way faster for me to diagnose and understand my issue		21:56
sgw	abailey: pylint gotcha!	21:57
sgw	dpenney_: just to confirm, if I try to unlock only one storage, it will fail, as it needs to talk to storage-1 before it will configure and provision properly, correct?	21:59
rchurch	The unlock should succeed. The cluster will not be healthy until the other storage host is unlocked and any required data replication is done.	22:01
sgw	Yes, it unlocked and is showing operational/disabled and availability/failed	22:05
sgw	fm alarm-list shows that storage-0 experienced a configuration failure	22:07
sgw	and a "service affecting failre"	22:08
sgw	failure	22:08
rchurch	Lock the storage host. Check the puppet logs on the storage host: sudo less -R /var/log/puppet/latest/puppet.log	22:10
*** ijolliffe has quit IRC		22:22
sgw	More stangeness ensues, it decided to reboot itself as I was looking at the logs, it looks like it had /dev/disk/by-path/ nodes for a second nvme , but no /dev/nvme1n1 entry	22:29
sgw	This might have to wait until Friday when I am back next to the hardware	22:29
rchurch	FWIW, I've provisioned storage hosts with NVMe only disks in the past, so I wouldn't expect a problem. Doesn't mean there isn't one, but it worked at one time.	22:33
*** abailey has quit IRC		22:37
sgw	rchurch: thanks for that, helpful to know	22:41
*** jrichard has quit IRC		22:58
*** jrichard has joined #starlingx		23:08
jrichard	hackathon: updated review for network api tests is up. https://review.opendev.org/702300	23:28
*** mpeters-wrs has joined #starlingx		23:31
*** mpeters-wrs has quit IRC		23:48

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!