*** pabelanger has joined #openstack-infra | 00:00 | |
pabelanger | fungi: mnaser: thinking outload about a 20 stack-patchseries for nova, and have idea how it would look, but if zuul knows the 20 patches are submitted together, maybe the realitive priority also does. So, if there is the 20 stack nova patch, and 2nd, and 3rd nova patch behind it. Then nodepool allocated nodes to 2nd and 3rd nova change before 4th path in the nova 20 stack... if that makes sense | 00:06 |
---|---|---|
clarkb | http://logs.openstack.org/12/615612/3/gate/neutron-grenade/bcd2c51/logs/grenade.sh.txt.gz#_2018-12-04_23_01_10_429 the grenade job failed on ovs/q-agt timing out | 00:06 |
pabelanger | s/idea/no idea/ | 00:06 |
pabelanger | that way, other users in nova also get feedback out side the mass patch series | 00:06 |
openstackgerrit | Clark Boylan proposed openstack-infra/opendev-website master: Add some initial content thoughtso https://review.openstack.org/622624 | 00:13 |
clarkb | that is super rough | 00:13 |
clarkb | and I intend to take faq/q&a content from the email we sent and incorporate it in | 00:13 |
clarkb | but figured getting a draft or even an outline going would probably help get the ball rolling | 00:14 |
*** kjackal has quit IRC | 00:14 | |
clarkb | I'm going to have to pop out soonish as a contractor is coming over to tell me how much a new wall costs but will saerch out any feedback when I am able | 00:14 |
clarkb | interesting the tempest change failed on the same thing as gernade | 00:16 |
clarkb | mnaser: ^ we likely have a systemic bug there since the change was from glance and not neutron | 00:16 |
clarkb | hrm looks like both happened on bhs1 and the devstack run took about an hour before it failed | 00:18 |
clarkb | fungi: amorin ^ maybe there is some other issue affecting bhs1? | 00:18 |
*** jcoufal has joined #openstack-infra | 00:30 | |
*** gyee has quit IRC | 00:33 | |
clarkb | looking at dstat for the job there is a spike in writes to ~46MBps | 00:39 |
clarkb | but its pretty consistently closer to 1MBps for the job run | 00:39 |
clarkb | there is also a period of almost persistent 10% cpu wai which could be related (they do overlap somewhat) | 00:39 |
clarkb | possible we are still our own noisy neighbor here? | 00:40 |
clarkb | looking at the devstack log that roughly correlates with when neutron mysql migrations were run | 00:42 |
clarkb | mordred: perhaps this is a silly idea, but can we run mysql in unsafe mode where it is eatmydatay? | 00:42 |
clarkb | if we are our own noisy neighbor that sort of thing may help | 00:43 |
clarkb | The other thing is whether or not kvm is waiting for these writes to succeed before completing them in hte VM | 00:43 |
clarkb | we turned that off in infracloud to get more io throughput | 00:43 |
clarkb | amorin: ^ that last question is likely a question for you | 00:43 |
*** wolverineav has quit IRC | 00:45 | |
clarkb | ya then the second 10% ish cpu wai block lines up with nova db migrations | 00:46 |
pabelanger | clarkb: yah, I might end up trying eatmydata for some DIB testing I am doing, I'm seeing a large amount of failures now with ovh | 00:47 |
*** wolverineav has joined #openstack-infra | 00:48 | |
*** rlandy has quit IRC | 00:59 | |
*** jcoufal has quit IRC | 01:00 | |
*** sthussey has quit IRC | 01:01 | |
clarkb | I've launched clarkb-test nodes in vexxhost sjc1 adn ovh bhs1 (and working on gra1) so far I've run sysbench on bsh1 and sjc1 and they look really similar | 01:05 |
clarkb | both about 40Mb/sec and 2600 requests per second | 01:06 |
clarkb | makes me wonder if its a bad hypervisor or misconfigured hypervisor | 01:06 |
clarkb | so some subset of the jobs hit it there | 01:06 |
clarkb | ansible facts don't seem to capture if we are emulated or virt | 01:08 |
clarkb | but the machines I've logged into are definitely kvm according to systemd-detect-virt | 01:08 |
clarkb | amorin: fungi: d73a3a4c-b31c-41a7-b8a3-226cdae0d558 and 7bb83b73-e279-4c85-af31-f7870e2714fc are two bhs1 instances that exhibited the weird slowness. Maybe we can zero in on the hypervisor(s) that ran those and see if there is something wrong there? virt not enabled (so using qemu emulation), slow disk, unhappy disk, etc | 01:11 |
clarkb | ba25071f-edaf-417d-8564-fab75676116e is my test instance that seems to be fine in that region if you need t ocompare (or check if it is running on the same hypervisor) | 01:12 |
*** jamesmcarthur has joined #openstack-infra | 01:13 | |
clarkb | gra1 actually has much slower rnadom reads and write to disk according to sysbench. About 1/10 the others | 01:15 |
clarkb | little more than that, but not much | 01:15 |
clarkb | but we don't see the issues in gra1 right? | 01:15 |
clarkb | and now I must go sort out dinner, fungi ^ you may want to try and duplicate my results. Feel free to use teh clarkb-test instances in those regions if you do so | 01:16 |
*** wolverineav has quit IRC | 01:24 | |
fungi | yeah, at least mining logstash for job timeouts i was seeing disproportionally far more in bhs1 than graphene | 01:24 |
fungi | er, than gra1 | 01:25 |
fungi | sorry graphene, my tab key is next to my 1 | 01:25 |
*** wolverineav has joined #openstack-infra | 01:25 | |
pabelanger | yes, I can confirm, IO heavy jobs on ovh-bhs1 are timing out here | 01:26 |
*** wolverineav has quit IRC | 01:30 | |
openstackgerrit | Merged openstack/os-testr master: Update the home-page URL https://review.openstack.org/622427 | 01:31 |
*** wolverineav has joined #openstack-infra | 01:31 | |
*** studarus has joined #openstack-infra | 01:34 | |
*** bobh has joined #openstack-infra | 01:35 | |
*** hongbin has joined #openstack-infra | 01:36 | |
fungi | maybe we let amorin reproduce/investigate with load on it | 01:36 |
mordred | clarkb: there's some stuff that could be done like eatmydata - also some my.cnf settings to turn down data durability - although it'll ultimately only usually affect fsync - if we're saturating throughput it wouldn't help a ton | 01:38 |
mordred | but it's totally worth trying eatmydata | 01:38 |
*** jamesdenton has joined #openstack-infra | 01:46 | |
*** witek has quit IRC | 02:00 | |
*** witek has joined #openstack-infra | 02:00 | |
*** jamesmcarthur has quit IRC | 02:03 | |
*** sthussey has joined #openstack-infra | 02:04 | |
*** wolverineav has quit IRC | 02:07 | |
*** dklyle has quit IRC | 02:12 | |
*** dklyle has joined #openstack-infra | 02:13 | |
*** mrsoul has joined #openstack-infra | 02:14 | |
*** wolverineav has joined #openstack-infra | 02:14 | |
openstackgerrit | MarcH proposed openstack-infra/git-review master: tests/__init__.py: ssh-keygen -m PEM for bouncycastle https://review.openstack.org/622636 | 02:18 |
clarkb | fungi pabelanger more and more Im suspecting a specific hypervisor | 02:20 |
clarkb | since sjc1 and bhs1 are basically the same for io in limited testing | 02:20 |
*** wolverineav has quit IRC | 02:21 | |
*** eernst has joined #openstack-infra | 02:25 | |
*** bobh has quit IRC | 02:33 | |
*** studarus has quit IRC | 02:35 | |
*** larainema has joined #openstack-infra | 02:37 | |
mnaser | fyi we do throttle iops per gb | 02:39 |
mnaser | 30 iops per gb so @ 80 gb volumes => 240 iops | 02:40 |
mnaser | er | 02:40 |
mnaser | 2400. | 02:40 |
mnaser | and 0.5MB/s per GB for SSD volume sso hence the 40MB/s | 02:41 |
mnaser | (nice to know our qos works well, lol) | 02:41 |
pabelanger | clarkb: I guess there is no way to track the hypervisor from guest OS | 02:43 |
*** psachin has joined #openstack-infra | 02:44 | |
mnaser | pabelanger, clarkb: https://review.openstack.org/#/c/577933/ there is and i tried :D | 02:50 |
mnaser | but you can check nova show and look at hostId from the API | 02:50 |
*** imacdonn has quit IRC | 02:52 | |
*** imacdonn has joined #openstack-infra | 02:53 | |
*** jd_ has quit IRC | 02:54 | |
*** bhavikdbavishi has joined #openstack-infra | 02:56 | |
*** bhavikdbavishi has quit IRC | 03:01 | |
*** jd_ has joined #openstack-infra | 03:02 | |
*** hongbin has quit IRC | 03:07 | |
*** auristor has quit IRC | 03:08 | |
*** eernst has quit IRC | 03:09 | |
*** eernst has joined #openstack-infra | 03:09 | |
clarkb | ya unfortunately the nodes are gone by the time we can look | 03:11 |
clarkb | so either welog that with nodepool or ask the cloud | 03:12 |
openstackgerrit | melissaml proposed openstack/os-testr master: Change openstack-dev to openstack-discuss https://review.openstack.org/622698 | 03:14 |
*** apetrich has quit IRC | 03:15 | |
*** bobh has joined #openstack-infra | 03:18 | |
*** bhavikdbavishi has joined #openstack-infra | 03:18 | |
*** ykarel|away has joined #openstack-infra | 03:21 | |
*** graphene has quit IRC | 03:23 | |
*** ramishra has joined #openstack-infra | 03:23 | |
*** agopi has joined #openstack-infra | 03:23 | |
pabelanger | mnaser: it seems 577933 might just need documentation updates at this point, but not sure what other reviewers say | 03:32 |
*** auristor has joined #openstack-infra | 03:32 | |
pabelanger | agree it would be helpful from job POV to collect that info | 03:32 |
*** bobh has quit IRC | 03:36 | |
*** yamamoto has joined #openstack-infra | 03:38 | |
*** yamamoto has quit IRC | 03:42 | |
*** bobh has joined #openstack-infra | 03:45 | |
*** jamesmcarthur has joined #openstack-infra | 04:03 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101 | 04:04 |
*** bobh has quit IRC | 04:04 | |
*** wolverineav has joined #openstack-infra | 04:08 | |
*** jamesmcarthur has quit IRC | 04:08 | |
*** wolverineav has quit IRC | 04:12 | |
*** wolverineav has joined #openstack-infra | 04:19 | |
*** hongbin has joined #openstack-infra | 04:24 | |
*** eernst has quit IRC | 04:26 | |
*** yamamoto has joined #openstack-infra | 04:32 | |
*** wolverineav has quit IRC | 04:37 | |
*** janki has joined #openstack-infra | 04:40 | |
*** sthussey has quit IRC | 04:51 | |
*** mordred has quit IRC | 04:57 | |
*** mordred has joined #openstack-infra | 04:57 | |
*** hongbin has quit IRC | 05:19 | |
*** yamamoto has quit IRC | 05:34 | |
openstackgerrit | Vieri proposed openstack/gertty master: Change openstack-dev to openstack-discuss https://review.openstack.org/622850 | 05:37 |
*** ahosam has joined #openstack-infra | 05:37 | |
*** stevebaker has quit IRC | 05:45 | |
*** dmellado has quit IRC | 05:46 | |
*** gouthamr has quit IRC | 05:46 | |
openstackgerrit | Merged openstack-infra/nodepool master: Add cleanup routine to delete empty nodes https://review.openstack.org/622616 | 05:47 |
*** dmellado has joined #openstack-infra | 05:48 | |
*** diablo_rojo has quit IRC | 05:51 | |
*** stevebaker has joined #openstack-infra | 05:51 | |
*** gouthamr has joined #openstack-infra | 05:53 | |
*** yamamoto has joined #openstack-infra | 05:54 | |
*** dmellado has quit IRC | 05:55 | |
*** wolverineav has joined #openstack-infra | 05:56 | |
*** dmellado has joined #openstack-infra | 05:57 | |
*** stevebaker has quit IRC | 05:59 | |
*** wolverineav has quit IRC | 06:00 | |
*** dmellado has quit IRC | 06:02 | |
*** stevebaker has joined #openstack-infra | 06:06 | |
*** dmellado has joined #openstack-infra | 06:14 | |
*** gouthamr has quit IRC | 06:15 | |
*** stevebaker has quit IRC | 06:15 | |
*** gouthamr has joined #openstack-infra | 06:18 | |
*** stevebaker has joined #openstack-infra | 06:19 | |
*** ykarel|away has quit IRC | 06:21 | |
*** diablo_rojo has joined #openstack-infra | 06:24 | |
*** stevebaker has quit IRC | 06:25 | |
*** gouthamr has quit IRC | 06:29 | |
*** stevebaker has joined #openstack-infra | 06:29 | |
*** gouthamr has joined #openstack-infra | 06:32 | |
*** stevebaker has quit IRC | 06:41 | |
*** stevebaker has joined #openstack-infra | 06:42 | |
*** ykarel|away has joined #openstack-infra | 06:45 | |
*** stevebaker has quit IRC | 06:51 | |
*** ykarel|away is now known as ykarel | 06:54 | |
*** stevebaker has joined #openstack-infra | 06:55 | |
*** kjackal has joined #openstack-infra | 06:57 | |
*** gouthamr has quit IRC | 06:58 | |
*** ralonsoh has joined #openstack-infra | 06:58 | |
*** stevebaker has quit IRC | 07:03 | |
*** stevebaker has joined #openstack-infra | 07:06 | |
*** gouthamr has joined #openstack-infra | 07:07 | |
*** bhavikdbavishi has quit IRC | 07:09 | |
*** bhavikdbavishi1 has joined #openstack-infra | 07:09 | |
*** pcaruana has joined #openstack-infra | 07:10 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 07:12 | |
*** ahosam has quit IRC | 07:15 | |
*** stevebaker has quit IRC | 07:15 | |
openstackgerrit | Quique Llorente proposed openstack-infra/zuul master: Add default value for relative_priority https://review.openstack.org/622175 | 07:17 |
*** quiquell|off is now known as quiquell | 07:17 | |
*** stevebaker has joined #openstack-infra | 07:19 | |
*** takamatsu has joined #openstack-infra | 07:21 | |
*** stevebaker has quit IRC | 07:27 | |
*** florianf has joined #openstack-infra | 07:27 | |
*** stevebaker has joined #openstack-infra | 07:30 | |
*** yboaron has joined #openstack-infra | 07:33 | |
*** stevebaker has quit IRC | 07:37 | |
*** stevebaker has joined #openstack-infra | 07:40 | |
*** dpawlik has joined #openstack-infra | 07:40 | |
*** gouthamr has quit IRC | 07:40 | |
*** apetrich has joined #openstack-infra | 07:42 | |
*** gouthamr has joined #openstack-infra | 07:42 | |
*** kjackal has quit IRC | 07:45 | |
*** kjackal has joined #openstack-infra | 07:45 | |
*** stevebaker has quit IRC | 07:46 | |
*** slaweq has joined #openstack-infra | 07:48 | |
*** quiquell is now known as quiquell|brb | 07:48 | |
*** stevebaker has joined #openstack-infra | 07:50 | |
amorin | mordred: fungi clarkb I am on the host was hosting d73a3a4c-b31c-41a7-b8a3-226cdae0d558 and 7bb83b73-e279-4c85-af31-f7870e2714fc | 07:54 |
amorin | it seems to be slower than others | 07:54 |
amorin | dd zero on host itself is slower than on another | 07:54 |
amorin | but there are some instances on it | 07:54 |
amorin | I will disable it to test without any load | 07:54 |
*** stevebaker has quit IRC | 07:55 | |
*** stevebaker has joined #openstack-infra | 08:00 | |
*** ykarel is now known as ykarel|lunch | 08:00 | |
openstackgerrit | Arnaud Morin proposed openstack-infra/project-config master: Reduce a little number of instances on BHS1 https://review.openstack.org/622876 | 08:02 |
*** ahosam has joined #openstack-infra | 08:02 | |
*** ginopc has joined #openstack-infra | 08:03 | |
*** e0ne has joined #openstack-infra | 08:05 | |
*** stevebaker has quit IRC | 08:05 | |
AJaeger | amorin: do you need this quickly? ^ | 08:07 |
AJaeger | infra-root, ^ | 08:08 |
*** stevebaker has joined #openstack-infra | 08:09 | |
amorin | AJaeger: nop | 08:11 |
amorin | it can wait afternoon | 08:11 |
amorin | (I am in europe tz) | 08:11 |
amorin | I know that most of the guys are in US | 08:11 |
amorin | it can wait | 08:11 |
frickler | amorin: just an idea: are you using deadline of cfq scheduler on the hypervisors? we found that cfq can cause issues in an iops throttled environment and that effect increased a lot from 4.13 to 4.15 kernel | 08:14 |
frickler | s/of/or/ | 08:14 |
amorin | I have no idea, but I can check | 08:14 |
*** stevebaker has quit IRC | 08:17 | |
*** florianf has quit IRC | 08:19 | |
*** priteau has joined #openstack-infra | 08:21 | |
*** stevebaker has joined #openstack-infra | 08:21 | |
amorin | frickler: we are using cfq on hypervisors | 08:21 |
*** quiquell|brb is now known as quiquell | 08:22 | |
frickler | amorin: o.k., so for our setup, we resolved the issue by changing to deadline. | 08:25 |
amorin | I was told that for SSD disks, it's better to use noop instead | 08:25 |
amorin | but I not an expert on that part | 08:26 |
amorin | I'll see if I can apply that on the whole aggregate for the OSF | 08:26 |
frickler | amorin: thx, I'll be offline for a bit, will check back later. your node reduction should merge any moment | 08:28 |
openstackgerrit | Merged openstack-infra/project-config master: Reduce a little number of instances on BHS1 https://review.openstack.org/622876 | 08:30 |
*** stevebaker has quit IRC | 08:30 | |
*** stevebaker has joined #openstack-infra | 08:31 | |
*** gfidente has joined #openstack-infra | 08:31 | |
*** hrubi has joined #openstack-infra | 08:33 | |
*** bhavikdbavishi has quit IRC | 08:33 | |
amorin | ok | 08:33 |
*** jpena|off is now known as jpena | 08:39 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101 | 08:44 |
*** florianf_ has joined #openstack-infra | 08:44 | |
*** tosky has joined #openstack-infra | 08:50 | |
*** fresta has quit IRC | 08:50 | |
*** fresta has joined #openstack-infra | 08:51 | |
openstackgerrit | Natal Ngétal proposed openstack/diskimage-builder master: [Core] Change openstack-dev to openstack-discuss. https://review.openstack.org/622895 | 08:52 |
*** aojea has joined #openstack-infra | 08:54 | |
*** ykarel|lunch is now known as ykarel | 09:00 | |
*** shardy has joined #openstack-infra | 09:00 | |
*** verdurin has quit IRC | 09:04 | |
*** dpawlik has quit IRC | 09:04 | |
*** jpich has joined #openstack-infra | 09:05 | |
*** wolverineav has joined #openstack-infra | 09:05 | |
*** dpawlik has joined #openstack-infra | 09:05 | |
*** bhavikdbavishi has joined #openstack-infra | 09:06 | |
*** bhavikdbavishi has quit IRC | 09:07 | |
*** bhavikdbavishi has joined #openstack-infra | 09:07 | |
*** verdurin has joined #openstack-infra | 09:07 | |
*** wolverineav has quit IRC | 09:09 | |
*** ccamacho has joined #openstack-infra | 09:20 | |
*** zhangfei has joined #openstack-infra | 09:21 | |
*** xek has joined #openstack-infra | 09:21 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101 | 09:30 |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Make estimatedNodepoolQuotaUsed more resilient https://review.openstack.org/622906 | 09:30 |
openstackgerrit | Tobias Henkel proposed openstack-infra/nodepool master: Make estimatedNodepoolQuotaUsed more resilient https://review.openstack.org/622906 | 09:32 |
*** zhangfei has quit IRC | 09:34 | |
*** rcernin has quit IRC | 09:38 | |
*** agopi is now known as agopi|brb | 09:40 | |
*** bhavikdbavishi has quit IRC | 09:40 | |
chandan_kumar | hello | 09:44 |
chandan_kumar | Is there a way to set to set basepython=python3 when in zuul job run or post_run is called to run a ansible playbook? | 09:45 |
*** larainema has quit IRC | 09:47 | |
*** zhangfei has joined #openstack-infra | 09:47 | |
*** derekh has joined #openstack-infra | 09:49 | |
*** witek has quit IRC | 09:50 | |
*** zhangfei has quit IRC | 09:50 | |
*** zhangfei has joined #openstack-infra | 10:09 | |
*** diablo_rojo has quit IRC | 10:12 | |
amorin | AJaeger frickler, I configured the whole OSF aggregate to deadline instead of CFQ on hypervisors in BHS1, we'll see if it's giving better results on your side | 10:17 |
*** ahosam has quit IRC | 10:20 | |
*** quiquell is now known as quiquell|brb | 10:21 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: web: update status page layout based on screen size https://review.openstack.org/622010 | 10:22 |
*** agopi|brb is now known as agopi | 10:29 | |
sshnaidm | fungi, clarkb not sure what happened, but my email in gerrit turned to be "unverified" and I can't submit patches | 10:42 |
*** alexchadin has joined #openstack-infra | 10:49 | |
*** dtantsur|afk is now known as dtantsur | 10:50 | |
*** quiquell|brb is now known as quiquell | 10:52 | |
stephenfin | Which repo configures the projects that the openstackgerrit IRC bot monitors? | 10:55 |
*** graphene has joined #openstack-infra | 10:57 | |
openstackgerrit | Merged openstack-infra/git-review master: Use six for cross python compatibility https://review.openstack.org/616688 | 10:57 |
*** bhavikdbavishi has joined #openstack-infra | 10:58 | |
*** bhavikdbavishi has quit IRC | 11:09 | |
*** bhavikdbavishi has joined #openstack-infra | 11:09 | |
*** xek has quit IRC | 11:12 | |
openstackgerrit | Merged openstack-infra/git-review master: Avoid UnicodeEncodeError on python 2 https://review.openstack.org/583535 | 11:15 |
*** bhavikdbavishi has quit IRC | 11:16 | |
*** bhavikdbavishi has joined #openstack-infra | 11:16 | |
*** bhavikdbavishi has quit IRC | 11:20 | |
*** zhangfei has quit IRC | 11:22 | |
*** xek has joined #openstack-infra | 11:26 | |
cmurphy | stephenfin: http://git.openstack.org/cgit/openstack-infra/project-config/tree/gerritbot/channels.yaml | 11:27 |
stephenfin | ta | 11:27 |
*** tpsilva has joined #openstack-infra | 11:31 | |
*** xek has quit IRC | 11:38 | |
fungi | sshnaidm: when was the last time it worked? could you have switched accounts when we deactivated one of your duplicate gerrit accounts a couple of weeks ago? | 11:50 |
*** ahosam has joined #openstack-infra | 11:50 | |
fungi | sshnaidm: is it your pullusum@ or einarum@ address? | 11:53 |
*** yamamoto has quit IRC | 11:53 | |
*** yamamoto has joined #openstack-infra | 11:53 | |
*** yamamoto has quit IRC | 11:57 | |
*** bhavikdbavishi has joined #openstack-infra | 12:02 | |
openstackgerrit | Merged openstack-infra/zuul master: Fix "reverse" Depends-On detection with new Gerrit URL schema https://review.openstack.org/620838 | 12:04 |
sshnaidm | fungi, it started today, yesterday submitted patches | 12:04 |
sshnaidm | fungi, I had another email there "sshnaidm@redhat.com" and set it as "preferred" | 12:05 |
alexchadin | hi, where can I find some grenade job examples which were adapted for zuul? | 12:05 |
openstackgerrit | Brendan proposed openstack-infra/zuul master: Fix urllib imports in Gerrit HTTP form auth code https://review.openstack.org/622942 | 12:05 |
sshnaidm | fungi, and now seems like gerrit removed it, I'm trying to add it again, but no verification mail so far.. | 12:05 |
fungi | sshnaidm: indeed, i can't find that address associated with any account in gerrit's database | 12:06 |
fungi | i'll see if i can tell when it sent verification e-mails for it | 12:06 |
fungi | sshnaidm: i see it sent messages to that address as recently as 11:00:48 utc, so barely an hour ago and at least 15 minutes after you pinged me in here | 12:11 |
fungi | was that when you attempted to re-add the address? | 12:11 |
*** pbourke has quit IRC | 12:14 | |
*** kashyap has joined #openstack-infra | 12:15 | |
sshnaidm | fungi, yeah, I think so | 12:25 |
*** jpena is now known as jpena|lunch | 12:25 | |
*** ahosam has quit IRC | 12:25 | |
sshnaidm | fungi, found this mail finally | 12:26 |
sshnaidm | fungi, yay! can submit patches again :) | 12:27 |
*** e0ne has quit IRC | 12:27 | |
*** e0ne has joined #openstack-infra | 12:29 | |
fungi | sshnaidm: great! glad nothing seems to be broken on our end anyway | 12:39 |
*** pbourke has joined #openstack-infra | 12:40 | |
kashyap | [OT] Some folks here might appreciate this: https://www.qemu-advent-calendar.org/ | 12:40 |
kashyap | If you're wondering WTH it is: | 12:41 |
kashyap | [quote] | 12:41 |
kashyap | The QEMU Advent Calendar 2018 features a QEMU disk image each day of December until Christmas. Each day a new package becomes available for download [...] The disk images contain interesting operating systems and software that run under the QEMU emulator. Some of them are well-known or not-so-well-known operating systems, old and new, others are custom demos and neat algorithms. | 12:41 |
kashyap | [/quote] | 12:41 |
fungi | i recall they've done that in previous years too | 12:41 |
fungi | it's neat | 12:41 |
kashyap | fungi: We didn't do it last year :-) | 12:41 |
kashyap | Last time was in 2016. (And before that was in 2014) | 12:41 |
fungi | in even years then? ;) | 12:42 |
kashyap | Heh, yeah | 12:45 |
kashyap | It's just too much damn work. | 12:45 |
kashyap | This year, I only part-volunteered; 2016, I spent lot more time on it. | 12:46 |
*** ramishra has quit IRC | 12:51 | |
*** yamamoto has joined #openstack-infra | 12:54 | |
*** kjackal has quit IRC | 12:54 | |
*** kjackal has joined #openstack-infra | 12:55 | |
*** bobh has joined #openstack-infra | 12:58 | |
*** kjackal has quit IRC | 12:59 | |
*** aojea has quit IRC | 13:01 | |
*** ykarel is now known as ykarel|afk | 13:03 | |
*** kjackal has joined #openstack-infra | 13:04 | |
*** ramishra has joined #openstack-infra | 13:05 | |
*** yamamoto has quit IRC | 13:06 | |
*** janki has quit IRC | 13:08 | |
*** ykarel|afk has quit IRC | 13:10 | |
*** bobh has quit IRC | 13:22 | |
*** udesale has joined #openstack-infra | 13:26 | |
*** jpena|lunch is now known as jpena | 13:33 | |
*** ykarel|afk has joined #openstack-infra | 13:35 | |
*** rlandy has joined #openstack-infra | 13:36 | |
*** ykarel|afk is now known as ykarel | 13:36 | |
*** dtantsur is now known as dtantsur|brb | 13:39 | |
mordred | morning fungi - how's things? | 13:40 |
*** dpawlik has quit IRC | 13:41 | |
*** dpawlik has joined #openstack-infra | 13:44 | |
fungi | i have contractors ripping out and rebuilding my previously-flooded downstairs entry (finally) | 13:44 |
fungi | it's sort of like my usual industrial music but with less shouting | 13:44 |
*** jcoufal has joined #openstack-infra | 13:44 | |
mordred | hrm. maybe you could figure out some way to get the contractors to yell more? | 13:45 |
*** agopi has quit IRC | 13:45 | |
fungi | i could flip this breaker back on, i suppose | 13:45 |
*** jcoufal_ has joined #openstack-infra | 13:46 | |
openstackgerrit | Monty Taylor proposed openstack-infra/system-config master: Add a script to generate the static inventory https://review.openstack.org/622964 | 13:47 |
mordred | fungi: that might just result in muffled thuds and then less noise | 13:48 |
fungi | fair point | 13:48 |
*** jcoufal has quit IRC | 13:48 | |
mordred | fungi: there's a stab at an inventory generation script. in writing the commit message for it, it occurred to me that in most cases it should be completely unneeded, as all somebody needs to do is add some ips to a yaml file after running launch-node | 13:49 |
*** dpawlik has quit IRC | 13:49 | |
mordred | fungi: maybe we should just have launch-node print a little yaml snippet that could be copy-pastad into the inventory file? | 13:49 |
*** agopi has joined #openstack-infra | 13:49 | |
*** priteau has quit IRC | 13:50 | |
*** dpawlik has joined #openstack-infra | 13:50 | |
*** mriedem has joined #openstack-infra | 13:52 | |
fungi | not a bad idea. we do something similar for dns currently and will likely be needing to print a snippet to add to a commit to one of our zone repos soon | 13:53 |
fungi | maybe both can be wrapped up together | 13:54 |
fungi | (i mean, snippets for two commits since they go in different repos, but the same routine could spit out both) | 13:54 |
*** jamesmcarthur has joined #openstack-infra | 13:55 | |
fungi | that also gives us a start on having automation directly propose those patches in the future, should we wish it | 13:55 |
*** sthussey has joined #openstack-infra | 14:00 | |
*** kgiusti has joined #openstack-infra | 14:03 | |
mordred | ++ | 14:04 |
*** priteau has joined #openstack-infra | 14:05 | |
*** florianf_ is now known as florianf | 14:05 | |
openstackgerrit | Matt Riedemann proposed openstack-infra/elastic-recheck master: Add query for n-api/g-api startup timeout bug 1806912 https://review.openstack.org/622966 | 14:06 |
openstack | bug 1806912 in OpenStack-Gate "devstack timeout because n-api/g-api takes longer than 60 seconds to start" [Undecided,Confirmed] https://launchpad.net/bugs/1806912 | 14:06 |
*** sshnaidm has quit IRC | 14:06 | |
*** quiquell is now known as quiquell|off | 14:11 | |
*** sshnaidm has joined #openstack-infra | 14:14 | |
mordred | infra-root: I'm afk for the next few hours - giving a talk about zuul today | 14:24 |
fungi | mordred: ooh! g'luck! | 14:25 |
pabelanger | +1 | 14:25 |
mordred | thanks! this one will be fun - these humans have zero background in openstack at all, so it's a complete blank canvas (I'm sure I'll be completely forgetting some important context :) ) | 14:26 |
*** psachin has quit IRC | 14:27 | |
fungi | they have a background in ci systems at least? | 14:27 |
mordred | who knows! | 14:27 |
fungi | sounds like it'll be a blast | 14:28 |
fungi | "openstack: sort of like aws without selling your soul" | 14:29 |
mordred | wait - I didn't have to sell my soul? | 14:29 |
mordred | I knew I was doing something wrong | 14:29 |
fungi | i can get you a receipt | 14:29 |
*** janki has joined #openstack-infra | 14:33 | |
openstackgerrit | Merged openstack-infra/elastic-recheck master: Add query for n-api/g-api startup timeout bug 1806912 https://review.openstack.org/622966 | 14:38 |
openstack | bug 1806912 in OpenStack-Gate "devstack timeout because n-api/g-api takes longer than 60 seconds to start" [Undecided,Confirmed] https://launchpad.net/bugs/1806912 | 14:38 |
mriedem | ovh-bhs1 nodes must be slow | 14:39 |
mriedem | seeing lots of slow-node related timeout failures in e-r on those nodes | 14:39 |
*** jamesmcarthur has quit IRC | 14:45 | |
*** boden has joined #openstack-infra | 14:48 | |
*** eharney has joined #openstack-infra | 14:48 | |
fungi | mriedem: yes, we think it's disk write performance, amorin is attempting to troubleshoot | 14:50 |
mriedem | ok | 14:50 |
fungi | it may be just some of the hosts in our dedicated aggregate in that region | 14:50 |
fungi | which is making it tough to pin down | 14:51 |
pabelanger | mriedem: fungi: last evening mnaser link https://review.openstack.org/577933/ as maybe a way we could help track hostid from the guest VM when clarkb was trying to get more info from jobs. | 14:53 |
openstackgerrit | Stephen Finucane proposed openstack-infra/project-config master: Remove openstack/osc-placement from #openstack-nova https://review.openstack.org/622987 | 14:54 |
mriedem | pabelanger: no updates since june so i forgot about that one | 14:55 |
*** anteaya has joined #openstack-infra | 14:55 | |
mriedem | probably needs a helping hand | 14:55 |
fungi | yeah, with that we could collect the hostid along with other instance metadata and even expose it as a column in logstash/kibana once our providers start supporting it | 14:56 |
pabelanger | yah, would defer to mnaser, but looked like just needs doc updates | 14:56 |
*** zul has quit IRC | 14:57 | |
*** sshnaidm has quit IRC | 14:57 | |
fungi | getting nodepool to request it from the api and pass that all the way through zuul to ansible somehow is probably possible but seems like it would be rather complicated to instrument | 14:57 |
*** quiquell|off has quit IRC | 14:59 | |
pabelanger | fungi: interesting idea, maybe something to ask about in #zuul. | 14:59 |
openstackgerrit | Merged openstack-infra/system-config master: Retire the interop-wg mailing list https://review.openstack.org/619056 | 15:10 |
*** lpetrut has joined #openstack-infra | 15:12 | |
*** dtantsur|brb is now known as dtantsur | 15:13 | |
*** sshnaidm has joined #openstack-infra | 15:18 | |
*** hwoarang has joined #openstack-infra | 15:27 | |
logan- | reviews on https://review.openstack.org/#/q/starredby:logan2211%2540gmail.com+status:open+project:%255Eopenstack/openstack-ansible-.* will beappreciated | 15:30 |
logan- | er wrong channel, sorry | 15:30 |
*** jamesmcarthur has joined #openstack-infra | 15:33 | |
corvus | fungi, pabelanger, mnaser: looks like we get the hostId from nova already in nodepool, we just don't do anything with it; it would be pretty easy to plumb that through to zuul | 15:34 |
fungi | oh, really? | 15:34 |
openstackgerrit | melissaml proposed openstack/ansible-role-cloud-launcher master: Change openstack-dev to openstack-discuss https://review.openstack.org/623008 | 15:34 |
*** zul has joined #openstack-infra | 15:34 | |
fungi | corvus: like, by returning it through gearman? | 15:35 |
corvus | fungi: yeah, i *think* it should be there when we're done with server creation | 15:35 |
*** jamesmcarthur has quit IRC | 15:35 | |
corvus | fungi: via zookeeper (to scheduler) and gearman (to executor), yes | 15:35 |
*** jamesmcarthur has joined #openstack-infra | 15:35 | |
*** sshnaidm is now known as sshnaidm|afk | 15:36 | |
fungi | oh, right, i keep forgetting launcher<->scheduler communication is zk | 15:36 |
fungi | your definition of "pretty easy" differs a bit from mine ;) | 15:37 |
*** jesusaur has quit IRC | 15:37 | |
corvus | fungi: heh, it's 2 changes to 3 components, but it's basically just adding data to structures that already exist | 15:37 |
*** kashyap has left #openstack-infra | 15:38 | |
pabelanger | cool, so patches welcome then :) | 15:40 |
pabelanger | might look at it more this afternoon | 15:41 |
Linkid | hi | 15:41 |
*** jesusaur has joined #openstack-infra | 15:42 | |
Linkid | I don't know if I'm on the right channel | 15:42 |
Linkid | I would like some info about resources available at the OSF | 15:42 |
Linkid | like disk space | 15:43 |
Linkid | because I would like to suggest a tool for hosting sonething and I don't know if it is possible | 15:43 |
Linkid | (before I suggest something impossible) | 15:44 |
Linkid | *something | 15:44 |
fungi | Linkid: the openstack foundation doesn't maintain community services, but you've found the channel for a community of people who collaborate on providing services to people who work on building free/libre open source projects | 15:45 |
fungi | er, building services for | 15:46 |
fungi | Linkid: what tool are you thinking about? | 15:46 |
Linkid | fungi: I was thinking about installing peertube to host OpenStack Summit (among other) videos (in addition to Youtube | 15:48 |
*** lpetrut has quit IRC | 15:49 | |
Linkid | and I would be happy to help :) | 15:49 |
openstackgerrit | Stephen Finucane proposed openstack-infra/project-config master: Add openstack/os-api-ref to #openstack-doc https://review.openstack.org/623013 | 15:49 |
Linkid | https://joinpeertube.org | 15:49 |
fungi | an interesting idea. i wasn't aware of peertube until just now | 15:50 |
fungi | curious what drove them to choose the agpl | 15:50 |
Linkid | this is a libre project maintained by a French orga, and it is a really good alternative to youtube :) | 15:50 |
fungi | looks like last year they switched their codebase from gplv3 to agplv3 but the commit doesn't indicate why | 15:51 |
Linkid | If there are enough resources, I would be happy to install it and to make scripts to add video metadata to it | 15:51 |
Linkid | they made a presentation at FOSDEM 2018, if you want | 15:52 |
Linkid | (but it was in french, if I remember it well) | 15:52 |
fungi | i think we'd likely need to get some sort of copyright agreement in place with the openstack foundation since i'm not sure what the copyright situation is with the summit recordings | 15:52 |
Linkid | ah | 15:53 |
fungi | but this is compelling since i discovered not long ago that people in mainland china can't watch our conference session recordings on youtube so there's already a need to host those videos in multiple places | 15:53 |
Linkid | I thought summit recordings were the property of the foundation (or CC0) | 15:53 |
Linkid | :) | 15:54 |
fungi | not to mention, it's always rubbed me the wrong way that we espouse free software at out conferences but then expect people who want to watch recordings of the sessions from them to do so through proprietary video hosting services | 15:55 |
fungi | er, at our conferences | 15:55 |
corvus | this sounds like a great idea; i agree we should clarify the licensing status (i really hope they are CC licensed, and if they aren't we should see about getting that changed) | 15:56 |
Linkid | fungi: yes, Framasoft (the orga which promotè peertube) thoughts the same ^^ | 15:57 |
Linkid | (about other conferences) | 15:57 |
corvus | it can be run in docker: https://github.com/Chocobozzz/PeerTube/blob/develop/support/doc/docker.md | 15:57 |
Linkid | do you want me to write a mail for the suggestion ? | 15:57 |
Linkid | corvus: yep :) | 15:58 |
corvus | Linkid: assuming we get all the pre-requisites worked out, we're working on moving our infrastructure to be mostly driven by ansible + containers, so that's what most of the work to get it running would entail. | 15:58 |
fungi | Linkid: sure, starting a discussion on the openstack-infra@lists.openstack.org mailing list might be an easier way for us to also point osf staff who handle the event logistics and video production stuff involved | 15:59 |
*** jamesmcarthur has quit IRC | 15:59 | |
corvus | maybe this could be an opendev branded service? we could probably offload the current static video hosting we have for zuul-ci.org to it. | 16:00 |
pabelanger | TIL: https://joinpeertube.org/ looks very cool | 16:00 |
pabelanger | corvus: +1 | 16:00 |
Linkid | fungi: ok :). I'll do it in 2 hours, when I'll be able to write the mail, then :) | 16:01 |
fungi | Linkid: awesome, thanks! i'm digging into the copyright situation for summit session videos now, so hopefully will have a better answer about them by then | 16:01 |
*** slaweq has quit IRC | 16:02 | |
*** haleyb has joined #openstack-infra | 16:12 | |
openstackgerrit | Jeremy Stanley proposed openstack-infra/zuul master: Add instructions for reporting vulnerabilities https://review.openstack.org/554352 | 16:12 |
*** jamesmcarthur has joined #openstack-infra | 16:13 | |
*** graphene has quit IRC | 16:13 | |
anteaya | Linkid: are you you affiliated with peertube? | 16:14 |
openstackgerrit | Merged openstack-infra/nodepool master: Set type for error'ed instances https://review.openstack.org/622101 | 16:14 |
openstackgerrit | Merged openstack-infra/nodepool master: Make estimatedNodepoolQuotaUsed more resilient https://review.openstack.org/622906 | 16:14 |
*** graphene has joined #openstack-infra | 16:15 | |
*** sshnaidm|afk has quit IRC | 16:15 | |
anteaya | Linkid: I'm wondering if the peertube folks want to stay with the use of the word 'spy' in these descriptions: https://framatube.org/about/peertube or would rather go with the word 'view'? | 16:15 |
*** jamesmcarthur has quit IRC | 16:18 | |
Linkid | anteaya: I have contacts with Framasoft (the orga which promotes peertube) | 16:18 |
Linkid | I could ask them, if you want | 16:18 |
*** pcaruana has quit IRC | 16:18 | |
fungi | i have a feeling the content on that page was translated, and the translator didn't realize that word could have negative connotations | 16:19 |
anteaya | fungi: that is my feeling as well | 16:20 |
*** bobh has joined #openstack-infra | 16:21 | |
fungi | i do wonder how well bittorrent protocol works through the great firewall of china (if at all) | 16:21 |
corvus | it's possible they used the word intentionally since it's describing a potentially privacy invading action -- learning which videos someone else is watching | 16:21 |
anteaya | Linkid: thank you, they can contact me in this channel or via pm using my nick or email me at mynick@mynick.info | 16:21 |
anteaya | corvus: agreed, I'm curious as well | 16:21 |
fungi | corvus: i thought that too at first, but had a hard time parsing the sentence to be sure | 16:21 |
anteaya | if it is intentional, then great, I'm just not sure | 16:22 |
anteaya | I do like their transparency | 16:22 |
corvus | the third use ("worst-case scenario") makes me lean towards the intentional interpretation | 16:22 |
corvus | anteaya: yes, they're very up-front about what it does and how it works | 16:23 |
fungi | yeah, if it's just missing some prepositions then i can see how it might have been intended that way | 16:23 |
anteaya | I like that, gives me a good feeling inside | 16:23 |
Linkid | yes, they do want transparency to show that this tool is great | 16:25 |
clarkb | cool those bhs1 test nodes were on the same hypervisor | 16:25 |
clarkb | fungi ^ any sense yet if amorins changes have made bhs1 more reliable? | 16:25 |
clarkb | re videos confs like lca upload to youtube then separately host the videos in free format(s) on a browseable index | 16:26 |
fungi | clarkb: no, he's looking into potential performance issues in bhs1 again today | 16:26 |
clarkb | swift may make somethibg like lcas setup easy for us | 16:26 |
Linkid | and they started a campaign some months ago to translate everything in english (and in other languages). Maybe there are still some mistakes | 16:26 |
clarkb | fungi ya looks like a scheduler change was applied | 16:27 |
clarkb | (reading sb) | 16:27 |
fungi | oh, right, i saw that. no idea but i'll take a dig into logstash | 16:28 |
openstackgerrit | Doug Hellmann proposed openstack-infra/project-config master: import git-os-job source repo https://review.openstack.org/623023 | 16:30 |
anteaya | Linkid: ah okay, thank you, I don't like to start off a relationship by assuming someone else is incorrect, I like to ask first, sometimes I'm the one who has mis-understood | 16:30 |
*** janki has quit IRC | 16:30 | |
*** sshnaidm|afk has joined #openstack-infra | 16:30 | |
*** gyee has joined #openstack-infra | 16:34 | |
anteaya | looks like one thing framasoft does is aggregate free software tools, rebrand them as frama-* and host instances | 16:35 |
anteaya | their collaborative editing tool for instance is etherpad-lite | 16:36 |
anteaya | looks like an awesome service for end users looking to use free software via the browser | 16:37 |
anteaya | looks like their targets are schools and small businesses | 16:38 |
*** studarus has joined #openstack-infra | 16:39 | |
fungi | btw, i can't find content licensing or copyright details for summit session videos anywhere so i've asked some osf staff who ought to know (and am also recommending they publish information about that one way or another) | 16:41 |
anteaya | fungi: I'm surprised it hasn't come up before | 16:42 |
*** dpawlik has quit IRC | 16:42 | |
anteaya | Linkid: I'm looking for the code for the https://framacolibri.org/ service, so far I can't seem to find that | 16:42 |
anteaya | Linkid: it looks really awesome | 16:43 |
*** bobh has quit IRC | 16:43 | |
anteaya | so far it appears it is a combination of twitter, bug tracker and linkedin | 16:43 |
*** dpawlik has joined #openstack-infra | 16:44 | |
openstackgerrit | Sorin Sbarnea proposed openstack-infra/elastic-recheck master: Query: [primary] Waiting for logger https://review.openstack.org/622210 | 16:45 |
*** udesale has quit IRC | 16:46 | |
*** udesale has joined #openstack-infra | 16:47 | |
ssbarnea|rover | mriedem: can you please help with open elastic-recheck CRs? https://review.openstack.org/#/q/project:openstack-infra/elastic-recheck+status:open | 16:51 |
ssbarnea|rover | clarkb: fungi : i found one log file from yesterday which I cannot search on logstash: http://logs.openstack.org/20/619520/12/check/tripleo-ci-centos-7-scenario004-standalone/0edbd40/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz#_2018-12-04_20_08_20 | 16:56 |
*** jamesmcarthur has joined #openstack-infra | 16:57 | |
ssbarnea|rover | i searched for that auth.docker.io error line and zero hits, but based on the file name it should have being indexed, right? | 16:57 |
ssbarnea|rover | i did not see any post failure, which makes me believe it shoudl have being found. | 16:58 |
fungi | clarkb: amorin: frickler: not a huge sample size (35 occurrences registered in the past 6 hours) but the proportion of job timeouts occurring in ovh-bhs1 are roughly 2x what would be random distribution based on its proportion of overall quota | 16:58 |
fungi | which isn't great, but also not nearly so bad as what we were seeing last week | 16:58 |
*** priteau has quit IRC | 16:59 | |
fungi | ovh-bhs1 accounts for 15% of our quota and is where we saw 34% of timeouts occur in the past 6 hours | 17:00 |
*** bhavikdbavishi has quit IRC | 17:02 | |
*** kjackal has quit IRC | 17:02 | |
*** bhavikdbavishi has joined #openstack-infra | 17:02 | |
fungi | on the other hand, ovh-gra1 accounts for only 8% of our quota and is where 20% of the timeouts occurred over the past 6 hours, so bhs1 is actually doing considerably better than gra1 in that regard | 17:03 |
mriedem | ssbarnea|rover: looking | 17:03 |
fungi | it'll take a while to accumulate a large enough sample size to be sure this is representative though | 17:03 |
*** yamamoto has joined #openstack-infra | 17:04 | |
clarkb | fungi: interesting. Not sure if you saw my notes last night but vexxhost sjc1 and bhs1 had roughly the same io throughput according to sysbench random read and writes benchmarking. That is what made me think there was maybe an unhappy hypervisor | 17:05 |
clarkb | and amorin seems to have confirmed that a cuople of the unhappy jobs ran on the same hypervisor which was slower so at least some progress there | 17:05 |
*** jamesmcarthur has quit IRC | 17:05 | |
*** priteau has joined #openstack-infra | 17:06 | |
*** jamesmcarthur has joined #openstack-infra | 17:07 | |
*** slaweq has joined #openstack-infra | 17:08 | |
*** yamamoto has quit IRC | 17:08 | |
*** ginopc has quit IRC | 17:08 | |
*** priteau has quit IRC | 17:08 | |
*** priteau has joined #openstack-infra | 17:09 | |
*** rlandy is now known as rlandy|brb | 17:10 | |
Linkid | anteaya : yes, Framasoft promotes free softwares for people to use alternatives :). | 17:11 |
*** shardy has quit IRC | 17:11 | |
openstackgerrit | Merged openstack-infra/elastic-recheck master: Change openstack-dev to openstack-discuss https://review.openstack.org/622326 | 17:12 |
Linkid | And to show that they are alternatives, they offer access to those ones with their frama* services :) | 17:12 |
*** shardy has joined #openstack-infra | 17:13 | |
clarkb | ssbarnea|rover: re https://bugs.launchpad.net/openstack-gate/+bug/1806655 that should only happen if the logger daemon is stopped for some reason. One reason may be if your job reboots the test nodes (then you have to restart the logger daemon in the job) | 17:13 |
openstack | Launchpad bug 1806655 in OpenStack-Gate "Zuul console spam: [primary] Waiting for logger" [Undecided,New] | 17:13 |
clarkb | ssbarnea|rover: looking at logstash it appears to be a tripleo job specific issue? | 17:14 |
ssbarnea|rover | clarkb: ok. i have another case where some errors from mistral/api.log.txt.gz are not to be found on logstash. is this file indexed to or not? | 17:15 |
ssbarnea|rover | clarkb: yep, i suspect is specific to tripleo. | 17:15 |
anteaya | Linkid: thanks, lots to read on the various links | 17:16 |
*** jpich has quit IRC | 17:17 | |
*** wolverineav has joined #openstack-infra | 17:17 | |
clarkb | ssbarnea|rover: https://git.openstack.org/cgit/openstack-infra/project-config/tree/roles/submit-logstash-jobs/defaults/main.yaml is the list of stuff we index. I don't think api.log matches anything there | 17:18 |
clarkb | hrm I was trying to catch up on email but now there is more than when I started | 17:19 |
*** slaweq has quit IRC | 17:20 | |
ssbarnea|rover | clarkb: thanks for the link, now i don't know what to do about mistral log, is huge and indexing whose would sound like a pretty bad idea, but indexing errors and warnings from it would not sound as a bad idea to me. | 17:20 |
ssbarnea|rover | do we have a way to do selective indexing? | 17:20 |
clarkb | ssbarnea|rover: not with the current logstash config. It will index everything >=INFO level from the oslo.logging format logs | 17:21 |
*** slaweq has joined #openstack-infra | 17:21 | |
clarkb | ssbarnea|rover: in the past we have worked with various projects to improve their logging to make it more useable by logstash/us and conseuqently ops | 17:21 |
*** e0ne has quit IRC | 17:22 | |
ssbarnea|rover | ahh, that is ok 95% of the spam is debug so extra load should not be an issue. | 17:22 |
*** jamesmcarthur has quit IRC | 17:22 | |
clarkb | ssbarnea|rover: can you link me to an example file? I'm curious to see how big it is | 17:22 |
ssbarnea|rover | i am really glad to hear that we drop DEBUG lines | 17:22 |
mriedem | ssbarnea|rover: i've gone over several of your newer e-r queries and -1ed several of them | 17:22 |
clarkb | we actually rely on os-loganalyze for that iirc. so we should double check it is filtering mistral logs properly too (which I expect it isn't for a file named just api.log) | 17:23 |
ssbarnea|rover | mriedem: thanks, looking now at them. | 17:23 |
*** jamesmcarthur has joined #openstack-infra | 17:23 | |
*** zul has quit IRC | 17:23 | |
fungi | ssbarnea|rover: if we didn't drop debug level loglines our logstash retention would probably be more like a day instead of 10 | 17:24 |
*** bobh has joined #openstack-infra | 17:25 | |
fungi | and it takes a 6tb elasticsearch cluster just to house that much | 17:25 |
openstackgerrit | Merged openstack-infra/zuul master: Add instructions for reporting vulnerabilities https://review.openstack.org/554352 | 17:25 |
openstackgerrit | Merged openstack-infra/elastic-recheck master: Categorize missing /etc/heat/policy.json https://review.openstack.org/621128 | 17:27 |
clarkb | fungi: we would also likely take two days to index that one day of logs | 17:28 |
fungi | or need many times more indexing workers | 17:28 |
mnaser | clarkb: have you seen anything about systemd-python failing to install on centos 7.6? | 17:28 |
mnaser | "Cannot find libsystemd or libsystemd-journal" | 17:29 |
fungi | efried: nice plug for git-restack in your howto! | 17:29 |
clarkb | mnaser: nope that one is new to me | 17:30 |
mnaser | :( okay, i'm a bit lost on why its hapepning | 17:30 |
mnaser | http://logs.openstack.org/51/620651/5/gate/openstack-ansible-deploy-aio_lxc-centos-7/01ca2fd/logs/openstack/aio1_keystone_container-c3bb8d22/python_venv_build.log.txt.gz | 17:30 |
fungi | efried: makes me wonder if git-restack should grow a --continue convenience option which just invokes git-rebase --continue | 17:31 |
mnaser | http://logs.openstack.org/51/620651/5/gate/openstack-ansible-deploy-aio_lxc-centos-7/01ca2fd/logs/ara-report/result/ad4ef8f9-31c4-4b5d-886d-b692fcd5e1d0/ and we install systemd-devel | 17:31 |
openstackgerrit | Duc Truong proposed openstack-infra/irc-meetings master: Change Senlin meeting to different biweekly times https://review.openstack.org/623031 | 17:33 |
*** bobh has quit IRC | 17:34 | |
*** jmorgan1 has quit IRC | 17:36 | |
*** jmorgan1 has joined #openstack-infra | 17:40 | |
*** mriedem is now known as mriedem_away | 17:40 | |
*** bobh has joined #openstack-infra | 17:42 | |
*** rlandy|brb is now known as rlandy | 17:42 | |
*** jpena is now known as jpena|off | 17:42 | |
*** florianf is now known as florianf|afk | 17:43 | |
ssbarnea|rover | mriedem_away: clarkb : read last comment on https://review.openstack.org/#/c/621004/1 -- mainly the query to look for ssh failures is currently broken. | 17:44 |
openstackgerrit | James E. Blair proposed openstack-infra/infra-specs master: Add opendev Gerrit spec https://review.openstack.org/623033 | 17:45 |
corvus | clarkb, fungi, mordred, pabelanger, tobiash: ^ can you take a look at that? i'd like to get it into shape as quickly as possible | 17:46 |
ssbarnea|rover | because ansible output changes with improved json logging there are lots of queries that may have broken because they assumed that two strings are on the same log line, which is no longer true. See https://github.com/openstack-infra/elastic-recheck/blob/master/queries/1721093.yaml | 17:47 |
anteaya | this looks like the catalog of services without the frama-* branding: https://chatons.org/en/find though the page is yet to be translated into english, I welcome corrections on my assumptions from French collegues | 17:48 |
*** mrhillsman has quit IRC | 17:51 | |
*** mrhillsman has joined #openstack-infra | 17:51 | |
*** sthussey has quit IRC | 17:52 | |
*** jiapei has quit IRC | 17:52 | |
*** adrianreza has quit IRC | 17:52 | |
*** jiapei has joined #openstack-infra | 17:53 | |
clarkb | corvus: next on my list after getting through email backlog | 17:53 |
*** sthussey has joined #openstack-infra | 17:55 | |
*** adrianreza has joined #openstack-infra | 17:55 | |
*** jamesmcarthur has quit IRC | 17:55 | |
anteaya | looks like framasoft uses a feature on etherpads that allows them to be sorted by name or date, subscribed to and can make them private or public: https://mypads.framapad.org/mypads/?/mypads/group/framalang-7l3ibkl0/view | 17:57 |
clarkb | anteaya: ya there are plugins to do that. For the private vs public you have to set up auth iirc. I'm more of the opinion that "etherpads" shouldn't be permanent records (since they can be changed afterall) | 17:59 |
anteaya | ah, thank you | 18:00 |
*** bobh has quit IRC | 18:00 | |
fungi | an interesting correlation, it looks like our average job uses fairly close to 1 node-hour? zuul seems to be averaging ~1kjph at capacity and we have 1030 nodes in our aggregate quota | 18:01 |
anteaya | bnemec: I didn't know you had your own channel | 18:01 |
anteaya | you have your broadcast on in the background, you are very soothing | 18:01 |
fungi | bnn: the ben nemec network | 18:02 |
clarkb | hrm apparently I don't understand what a reboot is required to complete this package upgrade means | 18:02 |
clarkb | because ns2 is still complaining | 18:02 |
anteaya | fungi: I do recommend | 18:02 |
fungi | clarkb: yeah, i saw the cron e-mail again this morning | 18:02 |
* bnemec trademarks bnn :-) | 18:02 | |
fungi | clarkb: puzzling to say the least | 18:02 |
clarkb | fungi: is that a nice way of saying that unattended upgrades wants a human to do the upgrade? | 18:02 |
clarkb | fungi: also looking more closely its mad about lxd and friends. Maybe we uninstall those | 18:03 |
clarkb | (I thought we were uninstalling those fwiw, so maybe start by double checking that) | 18:03 |
*** derekh has quit IRC | 18:03 | |
clarkb | corvus: did you see https://review.openstack.org/#/c/622624/ for website content? I'm hoping to pull that back up again and refine it after lunch today. So reviews there much appreciated as well | 18:05 |
fungi | clarkb: i didn't hold onto the cronspam from ns2 so not entirely sure to be honest | 18:06 |
fungi | clarkb: oh, these are new packages, that's why | 18:07 |
*** jamesmcarthur has joined #openstack-infra | 18:07 | |
corvus | clarkb: oh nice i'll review now | 18:07 |
fungi | see /var/log/dpkg.log for entries from today | 18:07 |
bnemec | anteaya: I don't think of it so much as a channel as a dumping ground for videos I want other people to see. :-) | 18:07 |
bnemec | There's also about a hundred videos of hurdle races, but they're not public because I'm not subjecting other people's kids to YouTube commenters. | 18:07 |
*** jamesmcarthur has quit IRC | 18:07 | |
clarkb | fungi: will do thanks | 18:07 |
bnemec | anteaya: But I'm glad you like it. | 18:07 |
fungi | clarkb: there was a new kernel (again) | 18:07 |
*** jamesmcarthur has joined #openstack-infra | 18:07 | |
clarkb | fungi: ah so the lxd being held back is just noise? | 18:07 |
fungi | i think so | 18:08 |
clarkb | why doesn't ns1 send the same spam? | 18:08 |
clarkb | or maybe my filters are broken | 18:08 |
fungi | likely not configured the same | 18:08 |
anteaya | bnemec: ha ha ha, thanks for the explaination | 18:09 |
*** dave-mccowan has joined #openstack-infra | 18:09 | |
anteaya | bnemec: and yes, I do like it, like background talk radio without the death | 18:10 |
bnemec | :-) | 18:10 |
anteaya | would listen to more ben radio anytime | 18:10 |
anteaya | and yeah, +1 for protecting the younglings | 18:11 |
fungi | clarkb: yeah, ns1 isn't auto-updating at all | 18:11 |
fungi | though unattended-upgrades is installed and the checksum on /etc/apt/apt.conf.d/50unattended-upgrades matches between them | 18:12 |
Shrews | infra-root: I'd like to restart the nodepool launchers to pick up some fixes. Any objections to doing that now? | 18:13 |
corvus | Shrews: ++ | 18:13 |
fungi | Shrews: sounds fine to me | 18:13 |
Shrews | ok. i'll begin the preparations for it | 18:14 |
corvus | i added infra-core to opendev-website-core | 18:14 |
tobiash | corvus: lgtm and ++ for not using cgit :) | 18:15 |
fungi | clarkb: 2018-12-05 06:50:20,794 ERROR Cache has broken packages, exiting | 18:17 |
fungi | seen in /var/log/unattended-upgrades/unattended-upgrades.log on ns1 | 18:17 |
clarkb | the one "feature" cgit has that few others seem to do is the ability to look at source files in a semi rendered state so that you can link to lines in eg rst files | 18:17 |
clarkb | fungi: huh | 18:17 |
clarkb | fungi: can we apt-get autoclean and see if that fixed? | 18:18 |
Shrews | corvus: oh, your new addition for the DELETED node state is going to force a total shutdown i think | 18:18 |
Shrews | corvus: if we do 1-at-a-time, some launchers may see that as an invalid state and bork | 18:18 |
fungi | clarkb: https://bugs.debian.org/898607 | 18:18 |
openstack | Debian bug 898607 in unattended-upgrades "unattended-upgrades: send mail on all ERROR conditions" [Important,Fixed] | 18:18 |
fungi | clarkb: i think it's to do with the lxd packages | 18:20 |
clarkb | fungi: so maybe ensure those are removed and do an autoclean? | 18:20 |
*** ralonsoh has quit IRC | 18:21 | |
tobiash | Shrews: so we should add a release note before we do a release then | 18:21 |
Shrews | clarkb: i forget, is a pip install necessary for launcher upgrade? | 18:24 |
fungi | Shrews: is `pbr freeze` not listing the latest commit? | 18:28 |
clarkb | Shrews: puppet should do it for us | 18:28 |
clarkb | but ya running python3 $(which pbr) freeze should show you the sha1 | 18:28 |
clarkb | to confirm | 18:29 |
openstackgerrit | Clark Boylan proposed openstack-infra/system-config master: Don't install lxd on our servers https://review.openstack.org/623040 | 18:30 |
openstackgerrit | Clark Boylan proposed openstack-infra/system-config master: Configure packages on ubuntu arm servers https://review.openstack.org/623041 | 18:30 |
clarkb | fungi: ianw ^ package fixes including for arm | 18:30 |
Shrews | fungi: clarkb: k, cool. i was doing 'pip freeze' which was less than helpful | 18:30 |
fungi | clarkb: aha, i'm noticing that one of the packages which is behind in updates on ns1 compared to ns2 is unattended-upgrades itself | 18:31 |
mnaser | does anyone know if new images have been uploaded to all clouds? | 18:31 |
mnaser | i'm still seeing centos 7.5 machines go up, even as of 2 hours ago | 18:32 |
fungi | clarkb: though the package changelog for the newer one doesn't seem to indicate it includes the fix for debian bug 898607 so i'm still at a loss | 18:32 |
openstack | Debian bug 898607 in unattended-upgrades "unattended-upgrades: send mail on all ERROR conditions" [Important,Fixed] http://bugs.debian.org/898607 | 18:32 |
frickler | corvus: seems https://review.openstack.org/621639 would need to be applied manually before the check can pass? | 18:33 |
clarkb | mnaser: looks like our images builds are broken for centos for ~one week | 18:33 |
*** wolverineav has quit IRC | 18:34 | |
corvus | frickler: does that check not run as the openstackinfra user? | 18:34 |
mnaser | clarkb: can i help fix that? i'm noticing some servers get centos 7.6 and others with 7.5 | 18:34 |
Shrews | ok, stopping all launchers | 18:34 |
clarkb | mnaser: I don't think any of our test nodes should be 7.6, they will update to 7.6 when running though | 18:34 |
corvus | Shrews: you can do one at a time for slightly less disruption | 18:35 |
clarkb | mnaser: it appears to be systemic as only arm images on nb03 are successfully building right now? | 18:35 |
Shrews | corvus: see my previous query to you | 18:36 |
mnaser | clarkb: ok yeah it looks like now our containers are 7.6 and our hosts are 7.5 | 18:36 |
Shrews | corvus: new DELETED node state | 18:36 |
Shrews | infra-root: all nodepool launchers restarted now | 18:36 |
mnaser | clarkb: if there are any logs, i can look into them | 18:36 |
clarkb | mnaser: looks like we've got dib processing running since November 14 and November 28 on the two non arm builders. So the pipeline jammed up there | 18:36 |
*** wolverineav has joined #openstack-infra | 18:37 | |
fungi | as in dib was hung and still trying to build something? | 18:37 |
openstackgerrit | Salvador Fuentes Garcia proposed openstack-infra/project-config master: kata-containers: re-enable Fedora job https://review.openstack.org/623043 | 18:37 |
clarkb | mnaser: https://nb02.openstack.org/ubuntu-bionic-0000000042.log and https://nb01.openstack.org/opensuse-423-0000000025.log | 18:37 |
frickler | corvus: I was assuming there is a difference between access to the channel and access to listing permissions from chanserv. but maybe permissions still aren't set up correctly in the first place, yes | 18:37 |
clarkb | fungi: yup exactly and those logs seem to line up with that | 18:37 |
corvus | Shrews: ah missed that | 18:37 |
corvus | frickler: the bot has access to the channel | 18:38 |
clarkb | in this case a root likely needs to strace to understand what is going on there | 18:38 |
fungi | clarkb: and not hung on the same image types either. nb01 is on opensuse and nb02 is on ubuntu | 18:38 |
corvus | frickler: my question is whether that check runs authenticated | 18:38 |
clarkb | because those logs just end as if they weren't running anymore but ps says otherwise | 18:38 |
clarkb | fungi: they are in different clouds too iirc | 18:38 |
*** priteau has quit IRC | 18:38 | |
mnaser | looks like no builds have been happening since the 27th? | 18:39 |
mnaser | https://nb02.openstack.org | 18:39 |
clarkb | mnaser: 28th | 18:39 |
clarkb | mnaser: I linked the last log file above | 18:39 |
clarkb | both disk-image-create processes are blocking on a wait4() so thats not super helpfu | 18:39 |
fungi | clarkb: ps suggests the hung child process on nb02 is a child of `pip install os-testr` in service of the venv-os-testr element | 18:40 |
clarkb | in this case we may just want to take this as an opportunity to reboot the servers (apply package updates and clear out any system resources dib may have leaked) | 18:40 |
clarkb | fungi: yup that lines up with the log above too | 18:40 |
*** ccamacho has quit IRC | 18:40 | |
clarkb | 2018-11-28 00:35:02.023 | Downloading https://files.pythonhosted.org/packages/2a/fd/2a8b894ee3451704cf8525a6a94b87d5ba24747b7bbd3d2f7059189ad79f/stestr-2.1.1.tar.gz (104kB) is last logged line | 18:40 |
clarkb | that pip install is blocking on a read | 18:41 |
frickler | corvus: I was assuming it does, because it takes the nickname as parameter, but looking closer in fact it does seem to | 18:41 |
fungi | clarkb: though on nb01 it's hung running a git clone of openstack/charm-interface-barbican-secrets | 18:41 |
frickler | does not* | 18:41 |
fungi | so didn't even hang doing the same things | 18:41 |
*** wolverineav has quit IRC | 18:41 | |
clarkb | the git operation is blocking on a read to 0 | 18:42 |
clarkb | (stdin) | 18:42 |
*** wolverineav has joined #openstack-infra | 18:43 | |
corvus | frickler: any idea what i need to tell chanserv to make it so the script can run? | 18:43 |
clarkb | plenty of disk on that server too | 18:43 |
clarkb | possibly some condition was set up by an earlier build such that the subsequent onces were unhappy? weird though that it wouldn't be the same issue in both spots | 18:43 |
Shrews | #status log Nodepool launchers restarted and now running with commit ee8ca083a23d5684d62b6a9709f068c59d7383e0 | 18:44 |
openstackstatus | Shrews: finished logging | 18:44 |
frickler | corvus: "set #channel private off" might help | 18:45 |
*** dpawlik has quit IRC | 18:45 | |
fungi | clarkb: and the pip install child process on nb02 is blocking on a read from fd 5 | 18:45 |
clarkb | fungi: yup | 18:45 |
*** diablo_rojo has joined #openstack-infra | 18:45 | |
clarkb | fungi: I'm somewhat inclined to just reboot them both and watch them for similar behavior in the future | 18:45 |
*** dpawlik has joined #openstack-infra | 18:46 | |
fungi | i concur, there's not much more i can learn from these unless someone else wants to take a stab | 18:46 |
clarkb | I guess we can look at the build just prior to the most recent ones to see fi there is anything obvious | 18:46 |
clarkb | https://nb02.openstack.org/debian-stretch-0000000037.log completed successfully on nb02 | 18:46 |
*** mriedem_away is now known as mriedem | 18:46 | |
*** manjeets_ is now known as manjeets | 18:47 | |
clarkb | https://nb01.openstack.org/opensuse-423-0000000024.log failed on nb01 so nothing clear there | 18:47 |
clarkb | fungi: do you want to reboot or should I? | 18:47 |
clarkb | I'll go ahead and do it | 18:49 |
*** shardy has quit IRC | 18:49 | |
fungi | ahh, yep sorry, trying to do too many things at once | 18:49 |
mriedem | ssbarnea|rover: replied on https://review.openstack.org/#/c/621004/ | 18:49 |
clarkb | mnaser: fungi https://nb01.openstack.org/ubuntu-trusty-0000000040.log is the running build on nb01 after a reboot | 18:50 |
ssbarnea|rover | mriedem: thanks. that was my plan, after getting confirmation. probably after this we should check existing queries for similar issues. | 18:50 |
*** ccamacho has joined #openstack-infra | 18:50 | |
clarkb | we can watch that to see if builds work now | 18:50 |
*** eharney has quit IRC | 18:52 | |
*** harlowja has joined #openstack-infra | 18:52 | |
clarkb | mnaser: fungi https://nb02.openstack.org/ubuntu-xenial-0000000038.log for nb02. Assuming those builds are now working, they should get around to doing centos7 in the near future | 18:52 |
clarkb | then we should keep our eyes open for new 7.6 failures | 18:52 |
mnaser | clarkb: great, thank you | 18:52 |
frickler | corvus: confirmed on a different channel that setting the private flag is what is hiding the access list | 18:53 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Add an upgrade release note for schema change https://review.openstack.org/623046 | 18:53 |
corvus | frickler: flag removed | 18:56 |
corvus | clarkb, dhellmann, tobiash: great comments on 623033, thx; i replied to all | 19:00 |
*** ccamacho has quit IRC | 19:00 | |
*** wolverineav has quit IRC | 19:01 | |
*** betherly has joined #openstack-infra | 19:01 | |
*** eernst has joined #openstack-infra | 19:01 | |
corvus | dhellmann: also in my reply to you, i tried to both provide technical answers, and also my own feedback. hopefully the two can be disentangled as necessary. :) | 19:04 |
*** betherly has quit IRC | 19:06 | |
clarkb | ssbarnea|rover: seems like "2018-12-04 04:14:47,233 ERROR:dlrn:Known error building packages for openstack-tripleo-heat-templates, will retry later" may be the root case of one of those waiting on logger failures. | 19:06 |
clarkb | ssbarnea|rover: dlrn probably shouldn't have a retry later option for CI? | 19:06 |
*** therve has left #openstack-infra | 19:07 | |
clarkb | looks like delorean failed to open a local repomd.xml file. Thats odd | 19:09 |
ssbarnea|rover | clarkb: please comment on https://bugs.launchpad.net/tripleo/+bug/1714202 ticket, so other can see it (and hopefully reply). | 19:10 |
openstack | Launchpad bug 1714202 in tripleo "DLRN builds fail intermittently (network errors)" [Medium,Incomplete] | 19:10 |
dhellmann | corvus : yep. I suspect a rename of a bunch of the cruft we have now to some neutral prefix to start combined with a lenient policy about using the openstack/ prefix later is going to be where we end up | 19:10 |
ssbarnea|rover | is late here and i do not know all the details. to me is looks like an infra issue. | 19:11 |
clarkb | ssbarnea|rover: done | 19:12 |
clarkb | ssbarnea|rover: well in this case its not the network because its a local file. So either the file doesn't exist or the permissions don't allow reads or the filesystem is corrupt | 19:12 |
clarkb | we should rule out the first two things first :) | 19:12 |
ssbarnea|rover | clarkb: true, but corrupted filesystem counts as infra too. different cause. i am too tired now to look at it. | 19:13 |
clarkb | it does but it should be a far less common issue and the other two things are much easier to check. Just add a sudo ls -l there | 19:14 |
*** bhavikdbavishi has quit IRC | 19:15 | |
*** ykarel has quit IRC | 19:16 | |
*** wolverineav has joined #openstack-infra | 19:16 | |
*** wolverineav has quit IRC | 19:16 | |
*** wolverineav has joined #openstack-infra | 19:17 | |
clarkb | ssbarnea|rover: for the other url on your bug we do proxy cache centos buildlogs https://git.openstack.org/cgit/openstack-infra/system-config/tree/modules/openstack_project/templates/mirror.vhost.erb#n172 | 19:19 |
*** gfidente has quit IRC | 19:21 | |
*** kgiusti has left #openstack-infra | 19:29 | |
*** wolverineav has quit IRC | 19:29 | |
*** kgiusti has joined #openstack-infra | 19:30 | |
*** wolverineav has joined #openstack-infra | 19:31 | |
*** wolverineav has quit IRC | 19:39 | |
*** wolverineav has joined #openstack-infra | 19:43 | |
mnaser | clarkb: so far xenial built successfully, opensuse-423 building now | 19:45 |
mnaser | so good progress so far | 19:45 |
*** wolverineav has quit IRC | 19:46 | |
*** wolverineav has joined #openstack-infra | 19:46 | |
openstackgerrit | Merged openstack-infra/project-config master: Add #openstack-designate to accessbot https://review.openstack.org/621639 | 19:46 |
fungi | clarkb: okay, so a bit of spelunking i think reveals that the reason unattended-upgrades doesn't want to run is that the new versions of lxd and lxd-client want to also install libuv1 which was not previously installed, and u-a is configured to only upgrade packages which are already installed but not add any new packages | 19:48 |
clarkb | aha | 19:48 |
fungi | so as suspected, removing lxd and lxd-client should, i think, get it back on track | 19:48 |
fungi | (one exception to the no new packages rule is kernels) | 19:49 |
*** studarus has quit IRC | 19:53 | |
*** jamesmcarthur has quit IRC | 19:55 | |
fungi | clarkb: ahh, nope, lxd looks like it may be unrelated after all | 19:57 |
clarkb | http://logs.openstack.org/62/621562/2/gate/tripleo-ci-centos-7-containers-multinode/0fa88fd/logs/undercloud/var/log/extra/dstat.html.gz in that job from 18:57 ish to 19:20ish it appears its just validating heat yamls? | 19:57 |
*** jamesmcarthur has joined #openstack-infra | 19:57 | |
clarkb | at 19:14 tripleoclient times out on some websocket error (unfortunately its not clear what it is talking to from that traceback) | 19:58 |
clarkb | then at 19:20 the stack is actually deployed | 19:58 |
clarkb | this did run on a bhs1 one node but looking at the dstat it actually looks a lot more healthy than the dstats from bhs1 I was looking at yseterday | 19:58 |
fungi | clarkb: because the unattended-upgrades log on both ns1 and ns2 complains that it's not going to upgrade those packages so either there's an actual different corrupt package in the cache and it's not specifying which one, or it's a behavior difference between the versions of unattended-upgrades on those two servers | 19:58 |
clarkb | fungi: time to try the autoclean? | 19:59 |
clarkb | reading that dstat disk writes spike to over 300mbps | 19:59 |
clarkb | and are in the 30 range while validating things | 19:59 |
fungi | just a `sudo apt clean` ought to wipe out the package cache, but i want to try and figure out what's corrupt first | 19:59 |
clarkb | what I do notice is that all the memory is used up (mostly) | 19:59 |
clarkb | and there are load spikes to 29 and 12 | 20:00 |
clarkb | and lots of paging | 20:00 |
clarkb | what this is telling me is that bhs1 itself isn't bad in that specific case, but rather the job is demanding quite a lot from the test node and things timed out (probably due to swapping?) | 20:01 |
clarkb | thinking about disk throughput lots of swapping could haev a snowballing effect on disk performance | 20:02 |
clarkb | where tripleo jobs end up being the noisy neighbors spinning the disks a bunch (or at least moving electrons around on ssds) | 20:02 |
*** jaosorior has joined #openstack-infra | 20:03 | |
*** eharney has joined #openstack-infra | 20:05 | |
*** dtantsur is now known as dtantsur|afk | 20:06 | |
*** wolverineav has quit IRC | 20:06 | |
*** wolverineav has joined #openstack-infra | 20:07 | |
fungi | #status log removed lxd and lxd-client packages from ns1 and ns2.opendev.org, autoremoved, upgraded and rebooted | 20:09 |
openstackstatus | fungi: finished logging | 20:09 |
fungi | clarkb: interestingly, i had to manually start nsd on ns2 (per your earlier observation) but not ns1 | 20:09 |
clarkb | fungi: could be a race in the startup. The existing unit does allow for it to start after networking is fully up but doesn't require it | 20:11 |
clarkb | comparing dstat to a run of the same job in inap there are some differences. Inap has slightly more memory available to be VM | 20:16 |
clarkb | 8000GB vs 8096GB maybe? | 20:16 |
fungi | i wonder if ns2 consistently loses the race and ns1 consistently wins it, but don't feel like rebooting them even more to find out | 20:16 |
clarkb | fungi: ya my fix should solve it long term I think | 20:17 |
clarkb | the paging is actually worse by count in inap | 20:18 |
clarkb | implying that maybe that is the big difference (we are able to page in stuff that isn't cached due to no memory more effectively) | 20:18 |
*** wolverineav has quit IRC | 20:18 | |
clarkb | mwhahaha: ssbarnea|rover re ^ I would be curious if there are any easy hacks we can do to alleviate the memory pressure | 20:19 |
clarkb | does tripleo enable kernel same page merging? | 20:19 |
clarkb | that may help | 20:19 |
clarkb | perhaps we can also reduce the number of webservers if we have overprovisioned them ? (have no idea if that is a thing just remember it being somethign we did with devstack to reduce its overhead) | 20:20 |
clarkb | also this dstat data is great, thank you for adding that. I do notice it doesn't seem to be in the standalone job though? | 20:20 |
mwhahaha | should be | 20:20 |
mwhahaha | it uses the same ci bits | 20:20 |
mwhahaha | (dstat that is) | 20:20 |
clarkb | mwhahaha: hrm I can't find it in some semi random jobs I pulled out to check | 20:21 |
clarkb | http://logs.openstack.org/62/621562/1/check/tripleo-ci-centos-7-standalone/1c762fc/logs/undercloud/var/log/extra/ doesn't have a dstat file | 20:21 |
*** kjackal has joined #openstack-infra | 20:22 | |
mwhahaha | no looks like we don't run that part | 20:22 |
mwhahaha | probably because it's in the udnercloud setup | 20:22 |
mwhahaha | which this is not, we can add it tho | 20:22 |
clarkb | I think that would be helpful especially if we do find its something like memory pressure causing the fallout, then we'll be able to measure if we've improved that or not | 20:23 |
*** kjackal has quit IRC | 20:23 | |
mwhahaha | the standalone is less memory intensive | 20:23 |
mwhahaha | by default it fits in like 6 or 7g | 20:23 |
clarkb | oh nice | 20:23 |
mwhahaha | but when tempest goes it might increase | 20:24 |
*** kjackal has joined #openstack-infra | 20:24 | |
clarkb | in the failure above we seem to hit the pressure just before running the overcloud stack create | 20:24 |
clarkb | and then it sticks around until the end of the job for the most part (it gets slightly better near the end) | 20:25 |
clarkb | in any case, thats my current thought on where these are having a sad. Memory pressure reduces IO throughput because lots of paging is happening (which itself spins the disks which could cause noisy neighbor issues) | 20:26 |
clarkb | reducing memory pressure should reduce the need for paging which should improve IO throughput and then hopefully things are happier | 20:26 |
*** jamesmcarthur has quit IRC | 20:28 | |
mwhahaha | it would seem that just running the undercloud with nothing else means all the ram is taken up without doing anything | 20:31 |
pabelanger | Shrews: thanks for restarting nodepool-launchers! | 20:32 |
*** Swami has joined #openstack-infra | 20:32 | |
mwhahaha | so the memory all starts getting gobbled up during step4 of the undercloud install which would be when we actually turn on the openstack servers (other than keystone) | 20:33 |
pabelanger | interesting enought, I think we might need another zuul-executor or 2. Our executor queue backlog has been above 0 the last 6 hours: http://grafana.openstack.org/dashboard/db/zuul-status | 20:33 |
mwhahaha | so it's likely that it's the openstack service containers and the lack of sharedness from running python in containers | 20:33 |
pabelanger | enough* | 20:33 |
pabelanger | I am not really sure why that is however | 20:35 |
pabelanger | http://grafana.openstack.org/dashboard/db/zuul-status?from=now-90d&to=now | 20:38 |
pabelanger | we are starting more builds in a given period, since 2018-11-29 | 20:39 |
pabelanger | Which makes me think that realitive builds change, has caused us to run more jobs in a given hour | 20:39 |
pabelanger | and more IO on executors, where they are running up against governors | 20:40 |
pabelanger | clarkb: corvus: tobiash: ^something that might be of interest when you have spare cycles | 20:40 |
corvus | pabelanger: that's a fascinating theory | 20:42 |
openstackgerrit | Kendall Nelson proposed openstack-infra/infra-specs master: StoryBoard Story Attachments https://review.openstack.org/607377 | 20:42 |
tobiash | pabelanger: in my eyes this makes sense as now the smaller less active projects have a higher relative priority to the big resource consumers. Often the big resource consumers have long running jobs that block the same resources for a long time while smaller projects typically have faster jobs. Thus in the end you can run more jobs per hour than before. | 20:43 |
corvus | makes sense | 20:43 |
tobiash | I see the same thing in our deployment too | 20:44 |
*** dpawlik has quit IRC | 20:44 | |
tobiash | the resource hogs are always the projects that have slow jobs | 20:44 |
fungi | but over the long term it's still the same number and type of jobs completed in the same amount of time | 20:44 |
*** jamesmcarthur has joined #openstack-infra | 20:44 | |
fungi | we're just changing the order in which they get started | 20:44 |
tobiash | yes, but during peak load the throughput is higher | 20:44 |
corvus | ...theoretically, we could actually end up taking *longer* to run jobs overall -- because right now we have node capacity sitting idle waiting for executors, whereas we did not before.... | 20:45 |
pabelanger | right | 20:45 |
fungi | ahh, and the projects with longer-runnnig jobs which are also the projects with more contention for changes (due to their longer-running jobs) end up waiting until later into the window when activity has died off | 20:45 |
corvus | so we may need to add capacity not only to make things faster now (we will complete the jobs we're supposed to be running right now faster), but also to get our daily aggregate back to the overall time we had before. | 20:46 |
tobiash | I'd guess you need at least two more executors | 20:47 |
fungi | so basically with more executors we'd be able to sustain a higher jph throughput at peak and then have a lower job rate off-peak than otherwise | 20:47 |
tobiash | judjing from these stats | 20:47 |
pabelanger | tobiash: yah, I would agree with 2 | 20:47 |
corvus | why 2? | 20:48 |
*** slaweq has quit IRC | 20:49 | |
fungi | if there's a metric for calculating how many executors we need that would be awesome to feed into our future elastic executor autoprovisioning ;) | 20:49 |
pabelanger | executors look to be capped at about 65 running jobs, but executor queue seems to be ~120 | 20:49 |
pabelanger | so, guessing that 2 would help bring that down | 20:49 |
pabelanger | but, yah, just a guess :) | 20:50 |
efried | fungi: catching up... | 20:51 |
efried | I realized later that I should probably not have put commit --amend in my instructions, as that can f ya when there's a merge conflict, and rebase --continue automatically does commit --amend when it's necessary (and not when it's not). | 20:51 |
efried | But yeah, if we had a git restack --continue that served both purposes, it would be simpler to explain to noobs, probably. | 20:51 |
openstackgerrit | James E. Blair proposed openstack-infra/system-config master: Add ze12.openstack.org https://review.openstack.org/623067 | 20:52 |
corvus | efried, fungi: oh, we talking about 'git restack'? can you catch me up? | 20:52 |
corvus | fungi, pabelanger: ^ that change will fail until the host exists, but if we're ready to add it, i can go ahead and create it. | 20:52 |
fungi | corvus: efried sent an awesome howto to the openstack-discuss ml about workflow for stacks of patches, and plugged git-restack heavily | 20:52 |
corvus | sweet, reading now | 20:53 |
tobiash | corvus: it's an educated guess. I see that there are times when all executors deregistered and 1 more would be a less than 10% increase while 2 more would probably give a little more headroom. | 20:53 |
tobiash | but as pabelanger said, it's just a guess | 20:53 |
corvus | tobiash, pabelanger: i agree, but also, i think the 'jobs starting' governor is a big part of this, so that may change things a bit. i lean toward adding 1 and re-evaluating for now... | 20:54 |
pabelanger | starting with one wfm | 20:55 |
pabelanger | I also agree jobs starting is related | 20:55 |
tobiash | corvus: that's fine, I just looked at these graphs already one or two weeks ago and thought that one more would make sense, so my first guess was two | 20:55 |
*** wolverineav has joined #openstack-infra | 20:56 | |
fungi | even with all the executor backoff, we're averaging rough 1kjph which is pretty awesome | 20:57 |
corvus | tobiash, pabelanger: yeah, you're probably right. i mostly just don't want to delete servers :) | 20:57 |
corvus | efried, fungi: adding 'git restack --continue' makes sense to me | 20:57 |
pabelanger | corvus: ++ | 20:57 |
efried | corvus: Cool | 20:57 |
*** wolverineav has quit IRC | 20:58 | |
*** wolverineav has joined #openstack-infra | 20:58 | |
*** kjackal has quit IRC | 20:59 | |
*** dpawlik has joined #openstack-infra | 20:59 | |
*** dpawlik has quit IRC | 21:03 | |
*** jtomasek_ has quit IRC | 21:05 | |
*** rkukura_ has joined #openstack-infra | 21:08 | |
tobiash | corvus: ah, so you're targeting an exact fit | 21:11 |
*** rkukura has quit IRC | 21:11 | |
*** rkukura_ is now known as rkukura | 21:11 | |
*** jamesmcarthur has quit IRC | 21:11 | |
openstackgerrit | Merged openstack-infra/nodepool master: Add an upgrade release note for schema change https://review.openstack.org/623046 | 21:17 |
fungi | gonna go find some food now that the workmen seem to be done here for the day; i should be back soon though | 21:25 |
*** jamesmcarthur has joined #openstack-infra | 21:28 | |
*** e0ne has joined #openstack-infra | 21:30 | |
*** sambetts|afk has quit IRC | 21:31 | |
*** priteau has joined #openstack-infra | 21:33 | |
*** yboaron has quit IRC | 21:36 | |
corvus | hrm, launch-node doesn't seem to be detecting that the server is up | 21:37 |
corvus | Shrews, mordred: ^ more potential fallout from openstacksdk? | 21:38 |
corvus | i believe launch-node uses the internal wait/timeout feature | 21:39 |
*** jcoufal_ has quit IRC | 21:42 | |
*** graphene has quit IRC | 21:42 | |
corvus | Shrews, mordred: yes, it seems so -- running launch-node with 0.19.0 works, 0.20.0 fails | 21:43 |
*** graphene has joined #openstack-infra | 21:44 | |
corvus | this is all i got: http://paste.openstack.org/show/736724/ | 21:45 |
*** jamesmcarthur has quit IRC | 21:46 | |
*** priteau has quit IRC | 21:50 | |
*** e0ne has quit IRC | 21:53 | |
corvus | how do i find the documentation for 0.19.0? | 21:57 |
*** kgiusti has left #openstack-infra | 21:58 | |
*** jamesmcarthur has joined #openstack-infra | 21:58 | |
corvus | the server is built, but dns.py isn't working because it says: AttributeError: 'Proxy' object has no attribute 'get_server' | 21:59 |
*** e0ne has joined #openstack-infra | 21:59 | |
corvus | but according to the docs under the 0.19 tag, the compute proxy is supposed to have a get_server method: http://git.openstack.org/cgit/openstack/openstacksdk/tree/doc/source/user/proxies/compute.rst?h=0.19.0 | 21:59 |
*** rcernin has joined #openstack-infra | 22:03 | |
corvus | okay, i have to go back to openstacksdk 0.17.2 for that to work | 22:03 |
mordred | corvus: crappit | 22:04 |
mordred | corvus: lemme look real quick and see if I can be more immediately helpfulk | 22:04 |
*** graphene has quit IRC | 22:04 | |
corvus | mordred: ok; i've just got past the immediate needs -- (i have a server and i have the dns commands). so at this point it's just what next to make launch-node.py and dns.py work again | 22:04 |
*** graphene has joined #openstack-infra | 22:05 | |
mordred | corvus: kk. I think that's likely a yukky break in 0.20.0 ... I'm kinda tempted to revert the patch that's causing all of this and then re-revert it with a bunch more testing since there's clearly some issues here | 22:05 |
*** betherly has joined #openstack-infra | 22:06 | |
corvus | mordred: ok. note two problems: the server wait (introduced in 0.20) and the server_get proxy issue (introduced in 0.18) | 22:07 |
*** jamesmcarthur has quit IRC | 22:07 | |
openstackgerrit | James E. Blair proposed openstack-infra/system-config master: Add ze12.openstack.org https://review.openstack.org/623067 | 22:08 |
corvus | clarkb, mordred, fungi, pabelanger: ^ that shuld be ready to merge | 22:08 |
*** e0ne has quit IRC | 22:08 | |
mordred | corvus: so .. I think the get_server issue is somethinng else - which I think is something I need to come up with a better logging/error for | 22:09 |
pabelanger | corvus: +2 | 22:09 |
mordred | cloud.compute not having a get_server is more likely to mean it wasnt' able to construct a compute Adapter for some reaons - but if that's the case it's a TERRIBLE error message for it | 22:10 |
mordred | I *think* what it's saying is "this Proxy, which is a bare Proxy and not an openstack.compute.v2._proxy.Proxy, does not have a get_server method" | 22:11 |
*** betherly has quit IRC | 22:11 | |
mordred | but I'll dig in to both things | 22:11 |
corvus | mordred: i think so -- here's a repr() and dict() from when the error was happening: http://paste.openstack.org/show/736725/ | 22:11 |
corvus | that's a <openstack.compute.v2._proxy.Proxy object at 0x7f24784b94e0> when it's working | 22:12 |
clarkb | anyone want me to keep my three io test nodes around? if not I'll be deleting them shortly (one in bhs1, one in gra1 and one in sjc1) | 22:15 |
clarkb | mnaser: centos7 image just started building a few minutes ago | 22:16 |
clarkb | on nb01 | 22:16 |
*** efried is now known as efried_out_til_j | 22:17 | |
corvus | the "big project backlog" today is not so great. neutron and tht at 6h, nova at 3. | 22:17 |
*** efried_out_til_j is now known as efried_cya_jan | 22:17 | |
corvus | by "not so great" i mean not large | 22:17 |
mordred | corvus: yah. that's what it SHOULD be | 22:18 |
corvus | yeah, it seems like a reasonable amount of time. | 22:18 |
mordred | corvus: do you have any way to see what it is when it's broken without too much effort? | 22:18 |
clarkb | corvus: ya its looking much better today than yesterday. Likely depending on whether or not long chains of changes are coming in for $project? at least athat seemed to be the case with nova yesterday | 22:19 |
*** tpsilva has quit IRC | 22:19 | |
*** jamesmcarthur has joined #openstack-infra | 22:19 | |
corvus | mordred: i defer to clarkb on that :) | 22:20 |
clarkb | aroo? | 22:21 |
clarkb | what the Proxy value is when it doesn't work? | 22:21 |
bkero | Shush Agnew | 22:21 |
mordred | clarkb: yah - my hypothesis is that it's an openstack.proxy.Proxy when it doesn't work | 22:22 |
mordred | oh - wait. yes - of course it is | 22:22 |
corvus | mordred: oh ha i crossed streams | 22:22 |
mordred | that is a bug we actually already have a workaround for in keystoneauth | 22:22 |
corvus | i thought you were asking about backlogs | 22:23 |
corvus | mordred: yes, it's a compute_v2 proxy when it works | 22:23 |
mordred | https://review.openstack.org/#/c/621257/ | 22:23 |
*** bobh has joined #openstack-infra | 22:23 | |
mordred | that will fix the launch_node problem - it's the rackspace version discovery thing | 22:23 |
mordred | or - it'll fix the dns.py problem | 22:23 |
mordred | I do not yet know what the wait problem is | 22:23 |
mordred | I was hoping we could get the rate limiting patch landed - but I think we just need to cut a new keystoneauth with that fix in | 22:24 |
*** wolverineav has quit IRC | 22:24 | |
*** wolverineav has joined #openstack-infra | 22:24 | |
mordred | lbragstad, kmalloc: ^^ mind if I push up a ksa release patch? | 22:25 |
kmalloc | mordred: do eet | 22:25 |
lbragstad | sure | 22:25 |
kmalloc | mordred: also... | 22:25 |
*** udesale has quit IRC | 22:25 | |
kmalloc | mordred: +2 on rate limit | 22:25 |
*** eernst has quit IRC | 22:26 | |
kmalloc | mordred: as long as we commit to getting an in-ksa functional test beyond the SDK case. | 22:26 |
kmalloc | mordred: but the SDK case is a good start | 22:26 |
* kmalloc has to go get car... it's finally repaired! | 22:26 | |
mordred | kmalloc, lbragstad: remote: https://review.openstack.org/623090 Release 3.11.2 of keystoneauth | 22:27 |
mordred | kmalloc: sweet. and yes - actually, before we land it, let's get the sdk consume patch fixed up and green | 22:27 |
mordred | that way we can at least have that and see that it works | 22:27 |
kmalloc | mordred: ++ | 22:27 |
kmalloc | mordred: please mark the KSA -1 Workflow to hold until it's green or at least comment as much on it | 22:28 |
mordred | ++ | 22:28 |
kmalloc | but the code as of now, looks about right sans that functional test | 22:28 |
mordred | kmalloc: done | 22:28 |
mordred | woot | 22:28 |
*** boden has quit IRC | 22:30 | |
clarkb | fungi: interesting thing I've noticed about e-r. A bunch of bug graphs have holes. At first I was worried that we broke indexing, then I realized after opening up those bug graphs that the hole is when we disabled bhs1 | 22:32 |
clarkb | unfortunately it see that most of them are still recurring :/ | 22:33 |
clarkb | http://status.openstack.org/elastic-recheck/#1805176 http://status.openstack.org/elastic-recheck/#1763070 http://status.openstack.org/elastic-recheck/#1802640 http://status.openstack.org/elastic-recheck/#1745168 http://status.openstack.org/elastic-recheck/#1793364 http://status.openstack.org/elastic-recheck/#1806912 and probably others | 22:34 |
*** bobh has quit IRC | 22:34 | |
*** dave-mccowan has quit IRC | 22:35 | |
*** wolverineav has quit IRC | 22:36 | |
corvus | that's pretty standout | 22:36 |
*** bobh has joined #openstack-infra | 22:37 | |
*** e0ne has joined #openstack-infra | 22:37 | |
corvus | in a bit of irony, all of the jobs for the ze12 change have node assignments but are waiting for an executor to start | 22:38 |
*** e0ne has quit IRC | 22:38 | |
pabelanger | yah, I also noticed that | 22:39 |
pabelanger | something happened at 22:30 (gate reset maybe?) and now backlog in executor queue is growing again | 22:39 |
pabelanger | yah, I think integrated queue | 22:40 |
clarkb | pabelanger: ya tripleo had a couple failures, one was ntp sync and the other a tempest unittest failure | 22:40 |
clarkb | integrated queue may have also reset, not sure | 22:40 |
*** owalsh_ has joined #openstack-infra | 22:40 | |
clarkb | fwiw the tripleo ntp sync issue shows up pretty evenly distributed across the clouds so that is one I don't think is bhs1 related | 22:41 |
clarkb | fungi: ^ re bhs1 we think we ruled out cpu time? | 22:41 |
mordred | corvus: well - I haven't found the wait_for_server issue yet - but I have found that we're apparently not doing discovery cache properly and are re-running discovery needlessly :( | 22:41 |
clarkb | http://logs.openstack.org/01/619701/5/gate/tempest-slow/2bb461b/controller/logs/screen-n-api.txt.gz is a recent example of a bhs1 is "slow" failure | 22:41 |
mordred | kmalloc: ^^ just fyi - I haven't diagnosed the issue yet | 22:41 |
clarkb | that was nova starting up taking 64 seconds which is longer than the devstack timeout allows for | 22:42 |
*** owalsh has quit IRC | 22:42 | |
*** bobh has quit IRC | 22:42 | |
clarkb | in that particular case there is only one period of cpu wai of ~10% for a couple seconds, otherwise it actually looks pretty happy overall. Low load, no paging, no memory pressure and so on | 22:44 |
clarkb | mriedem: if you look at http://logs.openstack.org/01/619701/5/gate/tempest-slow/2bb461b/controller/logs/screen-n-api.txt.gz any hunches to what is slow there? its actually pretty slow (there is a gap) right at the start | 22:44 |
clarkb | mriedem: maybe if we can identify what is particularly slow there we can use that as a bread crumb to go further. (and maybe it is just disk IO though my test instance shows that is fine, so possibly more unhappy hypversors?) | 22:45 |
mriedem | please hold dear caller, currently slitting wrists in nova | 22:46 |
clarkb | maybe I need to stay up late one of these evenings and try to catch amorin for paired debugging | 22:46 |
clarkb | mriedem: I'm sorry, I hope that it gets better | 22:46 |
*** wolverineav has joined #openstack-infra | 22:47 | |
pabelanger | ouch, another gate reset for integrated | 22:49 |
clarkb | the good news is our classification rate is reasonably high, so figuring this out should have a pretty big impact overall | 22:50 |
clarkb | the bad news, is that means is causing problems :? | 22:50 |
*** jamesmcarthur has quit IRC | 22:51 | |
clarkb | pabelanger: not a bhs1 failure though | 22:51 |
*** wolverineav has quit IRC | 22:51 | |
pabelanger | yah, but it just spiked our executor backlog again, 273 waiting now. | 22:51 |
pabelanger | maybe we should enqueue 623067 to help get a head of it | 22:52 |
clarkb | ya mostly just calling it out because addressing the oddity in bhs1 won't solve all the problems | 22:52 |
*** rcernin has quit IRC | 22:52 | |
mordred | corvus: so - unfortunately, I cannot reproduce create_server(wait=True) breaking - oh - except ... | 22:52 |
*** rcernin has joined #openstack-infra | 22:52 | |
mordred | corvus: that same bug would actually cause the wait loop to spin until it times out | 22:52 |
mordred | corvus: so - other than having provided you with absolutely atrocious error messages, this whole thing is fixed with that keystoneauth patch - which I have now submitted a release request for | 22:53 |
*** wolverineav has joined #openstack-infra | 22:53 | |
corvus | mordred: feel free to create ze13 if you want to test :) we can defer adding it into the cluster until we're sure we want it | 22:53 |
mordred | corvus: I created a mttest already ... and it worked (this is with the patched ksa) | 22:54 |
corvus | ok good | 22:54 |
clarkb | I'm going to compile a list of test nodes in the last 6 hours or so that seem to have failed in bhs1 where we have those gaps and maybe that can help debug if it is unhappy hypervisors | 22:54 |
mriedem | clarkb: well we appear to be doing this an ass ton | 22:56 |
mriedem | nova.scheduler.client.report._create_client | 22:56 |
mriedem | Dec 05 20:14:22.572362 ubuntu-xenial-ovh-bhs1-0000959981 devstack@n-api.service[23459]: DEBUG nova.compute.rpcapi [None req-87b2e9bd-e752-452f-9933-005e21b6841b None None] Not caching compute RPC version_cap, because min service_version is 0. Please ensure a nova-compute service has been started. Defaulting to current version. {{(pid=23463) _determine_version_cap /opt/stack/nova/nova/compute/rpcapi.py:397}} Dec 05 20:14:22.5 | 22:56 |
mriedem | 0 ubuntu-xenial-ovh-bhs1-0000959981 devstack@n-api.service[23459]: DEBUG oslo_concurrency.lockutils [None req-87b2e9bd-e752-452f-9933-005e21b6841b None None] Lock "placement_client" acquired by "nova.scheduler.client.report._create_client" :: waited 0.000s {{(pid=23463) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:327}} | 22:56 |
mriedem | yikes | 22:56 |
mriedem | rmine_version_cap called over 1k times on startup | 22:57 |
mriedem | *determine_version_cap | 22:57 |
mriedem | dansmith: ^ | 22:57 |
openstackgerrit | MarcH proposed openstack-infra/git-review master: test_uploads_with_nondefault_rebase: fix git screen scraping https://review.openstack.org/623096 | 22:57 |
mriedem | my guess is, | 22:58 |
mriedem | nova.compute.api.API init per extension, per api worker | 22:58 |
dansmith | mriedem: instantiating a compute rpc somewhere that gets called a lot? | 22:58 |
mriedem | so at least 2 workers | 22:58 |
mriedem | ~50+ extensions | 22:58 |
*** bobh has joined #openstack-infra | 22:59 | |
dansmith | the value is cached per worker, and the warning is calling out a pretty bad sitch | 22:59 |
dansmith | I guess we could try to only log once per process or something | 22:59 |
mriedem | there just aren't any compute services started before the api in a devstack run | 22:59 |
mriedem | also... | 23:00 |
mriedem | this would be looking in cell0... | 23:00 |
mriedem | where we'll never have computes | 23:00 |
clarkb | mriedem: ok so devstack is detecting it as nova-api hasn't started but root causing it its something deeper | 23:00 |
mriedem | devstack times out after 60 seconds | 23:00 |
clarkb | ya and it took 64 seconds to start | 23:00 |
mriedem | this is taking 64 | 23:00 |
mriedem | Dec 05 20:14:27.967716 ubuntu-xenial-ovh-bhs1-0000959981 devstack@n-api.service[23459]: WSGI app 0 (mountpoint='') ready in 64 seconds on interpreter 0x27018c0 pid: 23462 (default app) | 23:00 |
clarkb | mriedem: mostly I'm trying to characterize the slowness there so that we can debug it better | 23:01 |
clarkb | as it may be cloud provider side | 23:01 |
clarkb | the good news is (if we ignore the tripleo failures that show the gap because tis convenient) there are only a small number of failures since amorin made changes this mornign | 23:01 |
mriedem | this is one of the slow ovh-bhs1 nodes | 23:01 |
dansmith | mriedem: ah, well, the cell0 thing is legit | 23:02 |
mriedem | so my point is, we're doing this db lookup api workers * extensions for something that will never return anything because | 23:02 |
mriedem | connection = mysql+pymysql://root:secretdatabase@127.0.0.1/nova_cell0?charset=utf8 | 23:02 |
*** jamesmcarthur has joined #openstack-infra | 23:02 | |
dansmith | yeah, that's fair | 23:02 |
dansmith | but | 23:02 |
mriedem | and we don't have nova-compute services in cell0 | 23:02 |
dansmith | it also means we probably have a bug there | 23:02 |
clarkb | mriedem: ok so its two things then, yes slow bhs1 node, but maybe it illustrates a cell0 bug? | 23:03 |
dansmith | because we should be scattering and gathering, even though that won't fix the time zero problem | 23:03 |
mriedem | dansmith: but if we cached? | 23:03 |
clarkb | (I'm mostly interested in figuring out the slow bhs1 nodes, but happy to help on the other thing too) | 23:03 |
mriedem | we're not caching b/c 0 | 23:03 |
fungi | clarkb: yes, amorin confirmed cpo oversub in bhs1 was 2:1 and analysis of logstash records after we halved max-servers there still showed 20x as many job timeouts as gra1 | 23:03 |
*** bobh has quit IRC | 23:03 | |
dansmith | mriedem: but we'll continue looking at cell0 I imagine | 23:03 |
mriedem | when n-api starts in devstack we won't have any computes, so this would just always be 0 on first startup | 23:03 |
mriedem | but yeah we are not iterating the cells | 23:04 |
mriedem | should be calling get_minimum_version_all_cells yeah? | 23:04 |
clarkb | fungi: cool fwiw looking at dstat there is significant cpu idle time when things time out so nto surprising it isn't cpu (or at least doesn't appear to be) | 23:04 |
dansmith | we have service version get all cells | 23:04 |
dansmith | for a reason | 23:04 |
dansmith | yeah that | 23:04 |
dansmith | mriedem: we probably need a signal that none were found and avoid logging until there are computes that are old, not just missing | 23:05 |
dansmith | although that could be a legit misconfig too I guess | 23:05 |
*** wolverineav has quit IRC | 23:05 | |
dansmith | mriedem: if you file a bug and/or remind me tomorrow I could probably take a look at that | 23:05 |
*** jamesmcarthur has quit IRC | 23:05 | |
pabelanger | another gate reset, this time cinder and constraints job | 23:05 |
mriedem | dansmith: will do | 23:06 |
*** wolverineav has joined #openstack-infra | 23:12 | |
pabelanger | corvus: clarkb: we are now up to same amount of jobs running as are queued in executor backlog, starting builds seems to be our hurdle right now. And with the last few gate resets in the last hour, executors cannot get a head. At this rate, i think we are still another hour out for 623067 to land. | 23:12 |
corvus | pabelanger: seems likely. tripleo is about to reset again too. | 23:14 |
corvus | (that job at the top timed out) | 23:14 |
clarkb | infra-root I've started to summarize on https://etherpad.openstack.org/p/bhs1-test-node-slowness | 23:14 |
corvus | (it was on bhs1) | 23:14 |
clarkb | in the case of tripleo and bhs1 I think their jobs put a lot more demand on the test nodes than eg tempest | 23:15 |
*** larainema has joined #openstack-infra | 23:15 | |
clarkb | and tripleo may be our noisy neighbor there possibly compounded by some external slowness | 23:15 |
mriedem | clarkb: dansmith: https://bugs.launchpad.net/nova/+bug/1807044 | 23:16 |
openstack | Launchpad bug 1807044 in OpenStack Compute (nova) "nova-api startup does not scan cells looking for minimum nova-compute service version" [Medium,Confirmed] | 23:16 |
clarkb | from what I can tell looking at things that are not tripleo we have much fewer failures on bhs1 since amorin made chagnes this morning | 23:16 |
clarkb | we need more data to say for sure, but I think those changes did improve the situation for us. | 23:16 |
clarkb | mriedem: thank you | 23:17 |
clarkb | mwhahaha: ssbarnea|rover http://logs.openstack.org/99/616199/4/gate/tripleo-buildimage-overcloud-full-centos-7/c027629/job-output.txt.gz#_2018-12-05_22_56_12_383902 a case of broken ansible? | 23:21 |
mwhahaha | hmm that's odd | 23:22 |
mwhahaha | we set environment_type in job defs (i think), maybe it got missed | 23:22 |
corvus | it takes longer to start a big integration job for a change deep in the integrated gate, compounding the problem with the executors. | 23:23 |
mwhahaha | yea it inherits from tripleo-ci-base which does not have the environment_type set | 23:23 |
* mwhahaha fixes | 23:23 | |
corvus | clarkb, pabelanger: we also seem to be using swap a lot | 23:24 |
clarkb | corvus: ya particularly in the tripleo jobs I think. I noticed there is a ton of paging on some of those jobs | 23:25 |
*** rlandy is now known as rlandy|bbl | 23:25 | |
*** owalsh_ is now known as owalsh | 23:25 | |
corvus | clarkb: i mean on the executors | 23:25 |
clarkb | oh | 23:25 |
clarkb | interesting | 23:25 |
notmyname | I was going to try to make a patch for zuul's dashboard, but then I realized I don't know anything about react | 23:26 |
corvus | http://cacti.openstack.org/cacti/graph.php?local_graph_id=64005&rra_id=all | 23:26 |
corvus | notmyname: https://www.softwarefactory-project.io/react-for-python-developers.html | 23:26 |
clarkb | corvus: there is free memory available on that executor too | 23:26 |
mwhahaha | clarkb: hmm in poking at these base definitions, where is the 'multinode' base job defined? is that in zuul somewhere | 23:26 |
notmyname | corvus: you certainly had that link available quickly :-) | 23:27 |
mwhahaha | nm i found it in zuul-jobs/zuul.yaml | 23:27 |
corvus | notmyname: :) https://www.google.com/search?q=tristan+react+python | 23:27 |
*** wolverineav has quit IRC | 23:27 | |
corvus | at least, that works in my google bubble | 23:27 |
notmyname | heh | 23:27 |
notmyname | who's tristan? | 23:28 |
corvus | notmyname: tristanC, he wrote most of the current dashboard | 23:28 |
notmyname | ah, ok | 23:28 |
clarkb | mwhahaha: it is defined in openstack-infra/zuul-jobs/zuul.yaml | 23:29 |
clarkb | mwhahaha: oh you found it :) | 23:29 |
corvus | clarkb: yeah, and there are no memory hogs, so i'm kinda curious what's getting swapped out | 23:29 |
clarkb | corvus: ya top says ~4% for executor then a bunch of 1%ish ansible playbooks | 23:30 |
mordred | notmyname: https://zuul-ci.org/docs/zuul/developer/javascript.html also might be helpful, toolchain-wise | 23:30 |
clarkb | corvus: apparently /proc can tell us /me looks | 23:30 |
corvus | clarkb: oh, tell me about this magic when you get a chance :) | 23:31 |
*** mriedem has quit IRC | 23:31 | |
mwhahaha | clarkb: actually this is caused by https://review.openstack.org/#/c/618669/10/playbooks/tripleo-ci/post.yaml in the gate | 23:31 |
clarkb | corvus: VmSwap: 2663556 kB for the executor says /proc. Uh its grep VmSwap in /proc/$pid/status | 23:31 |
* fungi throws rip torn style glitter in the air and shouts "kernel magic!" | 23:32 | |
clarkb | mwhahaha: ah, so I got ahead of myself with the debugging. Though not sure if that chagne will fail on its own | 23:32 |
mwhahaha | it won't | 23:32 |
mwhahaha | but it didn't fail that job either | 23:32 |
clarkb | corvus: I think thats a big chunk of it there, odd that its resident size is comparatively tiny | 23:33 |
mwhahaha | unless i'm missing something, http://logs.openstack.org/99/616199/4/gate/tripleo-buildimage-overcloud-full-centos-7/c027629/job-output.txt.gz#_2018-12-05_22_56_13_434974 seems to show that the error was just ignored | 23:34 |
clarkb | mwhahaha: it resulted in a post-failure result for the job. I think the "normal" there means that we didn't short circuit or abandon the play, it finished "normally" with error | 23:34 |
mwhahaha | ok i'll kick the other change out of the gate | 23:35 |
corvus | clarkb: i don't see much of significance that has changed in the executor code over the past few months | 23:35 |
clarkb | corvus: I wonder if this is side effect of how python does dicts and data structures in general. basically it grows them and they never shrink (granted releasing memory back to OS is not easy outisde of python either) | 23:36 |
clarkb | corvus: at some point the executor may have allocated a bunch of memory for "something" | 23:37 |
clarkb | and now ist not needed so we see the delta between rss and vmswap | 23:37 |
fungi | yes, it's not uncommon to see a python process gobbling lots of memory but with a small resident size | 23:37 |
pabelanger | oh, I didn't even think to check cacti.o.o. Yah, wonder what is going on there | 23:37 |
fungi | we likely need some sort of python memory profiler imported and logging sizes of (used and cleared) structures to know where it's all going | 23:38 |
corvus | http://cacti.openstack.org/cacti/graph_view.php?action=preview&host_id=0&rows=100&columns=-1&graph_template_id=41&thumbnails=true&filter=ze | 23:38 |
clarkb | started ~week and a half ago? likely unrelated to the relative priority changes then | 23:39 |
clarkb | but they all started doing it together | 23:39 |
clarkb | (for the most part) | 23:39 |
fungi | ze02 is either getting overwhelmed or is suffering unrelated packet loss from cacti.o.o (ongoing rackspace ipv6 issues?) | 23:40 |
clarkb | fungi: it has hadthose issues in the past | 23:40 |
fungi | ~1.,5 weeks ago is when a lot of our contributors decided their holiday vacation following the summit was over | 23:40 |
clarkb | maybe we should just rebuild it | 23:40 |
corvus | clarkb: the only restart i have in the logs is from nov 28. | 23:41 |
corvus | which seems right around the beginning of the increase | 23:41 |
corvus | that was 3.3.1.dev61 (whatever that is) | 23:42 |
corvus | i don't know what it was running before | 23:42 |
pabelanger | that's when we did the security fix, right? | 23:42 |
corvus | yep | 23:42 |
fungi | so could be that a change between the 2018-11-28 and the restart before that introduced a memory leak, or just a memory inefficiency maybe | 23:42 |
fungi | doesn't look especially leaky as it's not really climbing | 23:43 |
*** graphene has quit IRC | 23:43 | |
clarkb | corvus: 8715505e6d38c092257179b8a089a2a560df5e58 that should be 3.3.1.dev61 according to git describe | 23:44 |
clarkb | that was the restart for the security fix | 23:44 |
*** _alastor_ has quit IRC | 23:44 | |
corvus | clarkb: makes sense | 23:44 |
*** wolverineav has joined #openstack-infra | 23:45 | |
clarkb | looking at commits before that the fix for semaphores and python3 can't do unbuffered binary io and possibly "Include executor variables in initial setup call" stand out to me reading the log | 23:47 |
clarkb | ah we don't run the console logger thing under python3 currently so that binary io one should be a noop | 23:48 |
clarkb | however, that does make we wonder if the place where we might be allocating large buffers then not using them much of the time is the log streaming | 23:49 |
clarkb | that goes through the executor right? | 23:49 |
*** jamesmcarthur has joined #openstack-infra | 23:50 | |
*** bobh has joined #openstack-infra | 23:51 | |
corvus | clarkb: zuul_console runs on the remote node, so shouldn't affect things. yes, the executor has a subprocess which handles log streaming | 23:51 |
corvus | clarkb: it's not the one that's eating up all that swap | 23:51 |
corvus | it's an order of magnitude smaller | 23:51 |
corvus | on ze07: 2018-11-15 23:13:16,178 DEBUG zuul.Executor: Configured logging: 3.3.1.dev31 | 23:52 |
corvus | it must have rebooted or something | 23:53 |
corvus | its graphs don't look like the rest | 23:53 |
clarkb | b31b866fbca35939204bb69c447cd68412efa9ce dev31 is that commit I think | 23:54 |
*** jamesmcarthur has quit IRC | 23:54 | |
corvus | and 2018-11-06 19:28:34,195 DEBUG zuul.Executor: Configured logging: 3.3.1.dev15 | 23:55 |
corvus | on ze09 | 23:55 |
corvus | also has different-looking graphs | 23:55 |
clarkb | between 31 and 61 in the executor server file there is changes to job output parsing for the line based commenting | 23:56 |
corvus | ze07's increased usage starts shortly after that restart | 23:56 |
clarkb | other than that its just winrm fixes | 23:56 |
corvus | ze09 looks suspicious after its restart as well | 23:57 |
clarkb | actually maybe its not per line commenting. But it issomething that iterates through the log file then modifies it if an error condition is met | 23:58 |
clarkb | 5e9f77326 is the commit that changes things in that part of the code | 23:59 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!