*** melwitt has joined #openstack-infra | 00:03 | |
openstackgerrit | Michael Krotscheck proposed a change to openstack-infra/storyboard-webclient: Customise Bootstrap https://review.openstack.org/67337 | 00:04 |
---|---|---|
*** zz_ewindisch is now known as ewindisch | 00:08 | |
*** vipul-away is now known as vipul | 00:08 | |
*** CaptTofu has joined #openstack-infra | 00:09 | |
*** ewindisch is now known as zz_ewindisch | 00:10 | |
*** sarob has joined #openstack-infra | 00:11 | |
*** MarkAtwood has quit IRC | 00:13 | |
pabelanger | A few weeks / month ago somebody was suggesting a graphic rendering lib for rst docs... it wasn't graphviz but something else | 00:14 |
pabelanger | there was some talk about maybe using it for -infra documentation | 00:14 |
*** rnirmal has quit IRC | 00:14 | |
clarkb | pabelanger: I think it was hashar, but I forget what the lib was called | 00:15 |
pabelanger | clarkb: Ya, I thought it was hashar too | 00:15 |
zaro | clarkb: hey, i just got back. i'm just finishing up the gerrit testing, was gonna put it aside to start hacking on the scp-plugin tomorrow. | 00:16 |
clarkb | zaro: great, thanks | 00:16 |
pabelanger | http://blockdiag.com/ | 00:19 |
pabelanger | eavesdrop.o.o to the rescue | 00:19 |
openstackgerrit | Michael Krotscheck proposed a change to openstack-infra/storyboard-webclient: Customise Bootstrap https://review.openstack.org/67337 | 00:21 |
*** wenlock has quit IRC | 00:21 | |
mattoliverau | I haven't read the email, and this is just me thinking out loud but in regards to rate limiting how about doing something similar to TCP windowing. Pick a low point that the queue will never be smaller then, say 20. Then everytime a patch is merged increase the queue by X, say 1. Everytime there needs to be a reset be brutal, like halve the queue size. Requeuing the stuff taken off in a high priority | 00:22 |
mattoliverau | queue. This would mean when there are lot of resets the queue will be smaller and smaller so less ref re-pointing and hopfully push through all the congestion. When working again the queue will continue to build up. | 00:22 |
mattoliverau | Again, i'm new, so just my 2 cents. | 00:22 |
clarkb | mattoliverau: yup that was my thinking | 00:22 |
clarkb | tcp slow start | 00:22 |
clarkb | it has its faults, you almost never hit peak efficiency, but it does work at protecting you | 00:22 |
mattoliverau | clarkb: Lol, missed your comment with the name slow start :) | 00:23 |
mattoliverau | clarkb: of course, but it's somewhere inbetween, better then a fixed queue length, but not to problem of a huge queue when zuul is needed most. | 00:23 |
*** vipul is now known as vipul-away | 00:24 | |
*** mrodden1 has quit IRC | 00:31 | |
*** vipul-away is now known as vipul | 00:34 | |
*** fifieldt has joined #openstack-infra | 00:36 | |
*** ok_delta has quit IRC | 00:37 | |
*** odyssey4me has quit IRC | 00:37 | |
*** sarob has quit IRC | 00:40 | |
*** wenlock has joined #openstack-infra | 00:40 | |
*** nati_uen_ has joined #openstack-infra | 00:40 | |
*** smurugesan has quit IRC | 00:40 | |
*** gokrokve has joined #openstack-infra | 00:41 | |
*** yamahata has quit IRC | 00:42 | |
*** michchap_ has quit IRC | 00:43 | |
*** michchap has joined #openstack-infra | 00:43 | |
*** nati_ueno has quit IRC | 00:43 | |
*** odyssey4me has joined #openstack-infra | 00:46 | |
clarkb | this is interesting the run handler sleeping run handler awake log messages haven't happened for 15 minutes. so that is what is starving us | 00:46 |
clarkb | something is spending a lot of time in the middle of that loop | 00:46 |
clarkb | sdague: ^ | 00:46 |
openstackgerrit | A change was merged to openstack-dev/hacking: Move hacking guide to root directory https://review.openstack.org/62132 | 00:47 |
openstackgerrit | A change was merged to openstack-dev/hacking: Cleanup HACKING.rst https://review.openstack.org/62133 | 00:47 |
openstackgerrit | A change was merged to openstack-dev/hacking: Re-Add section on assertRaises(Exception https://review.openstack.org/62134 | 00:47 |
openstackgerrit | A change was merged to openstack-dev/hacking: Turn Python3 section into a list https://review.openstack.org/62135 | 00:47 |
openstackgerrit | A change was merged to openstack-dev/hacking: Add Python3 deprecated assert* to HACKING.rst https://review.openstack.org/62136 | 00:47 |
*** mrodden has joined #openstack-infra | 00:48 | |
openstackgerrit | Michael Krotscheck proposed a change to openstack-infra/storyboard-webclient: Moved homepage content to about page. https://review.openstack.org/67344 | 00:50 |
clarkb | sdague: I am digging through logs now to see if I can determine where it is starving itself | 00:50 |
*** hogepodge has quit IRC | 00:50 | |
*** harlowja is now known as harlowja_away | 00:50 | |
*** CaptTofu has quit IRC | 00:53 | |
*** melwitt has quit IRC | 00:58 | |
*** melwitt1 has joined #openstack-infra | 00:59 | |
clarkb | it looks like it takes that long to submit all of the gearman jobs after a gate reset | 00:59 |
*** melwitt1 has quit IRC | 01:03 | |
*** sarob has joined #openstack-infra | 01:05 | |
*** sarob has quit IRC | 01:05 | |
*** CaptTofu has joined #openstack-infra | 01:05 | |
*** sarob has joined #openstack-infra | 01:06 | |
*** dkranz has joined #openstack-infra | 01:07 | |
*** harlowja_away is now known as harlowja | 01:08 | |
*** melwitt has joined #openstack-infra | 01:08 | |
clarkb | the bulk of the time was spent reseting the gate | 01:08 |
clarkb | 2014-01-17 00:31:52,791 DEBUG zuul.DependentPipelineManager: Starting queue processor: gate | 01:08 |
clarkb | 2014-01-17 00:47:17,732 DEBUG zuul.DependentPipelineManager: Finished queue processor: gate (changed: True) | 01:08 |
*** sarob_ has joined #openstack-infra | 01:08 | |
clarkb | that is ~15 minutes of just dealing with gate reset, which is bad considering how often the gate resets | 01:09 |
openstackgerrit | Eric Guo proposed a change to openstack/requirements: Have tox install via setup.py develop https://review.openstack.org/66549 | 01:09 |
mordred | clarkb: wow | 01:10 |
*** sarob has quit IRC | 01:10 | |
*** sarob has joined #openstack-infra | 01:12 | |
clarkb | it is taking 9-11 seconds to got git reset, git remote update, git reset --hard $BRANCH, git merge $patchset, then create a ref that zuul can advertise to the testers | 01:13 |
openstackgerrit | Michael Krotscheck proposed a change to openstack-infra/storyboard-webclient: Added apache license to footer https://review.openstack.org/67347 | 01:13 |
dkranz | Scrolling back, this might be a bad time to say this but I did a reverify with bug number on https://review.openstack.org/#/c/63934/ which closes the error-in-log-file hole. | 01:14 |
clarkb | 90*9 = ~13 minutes | 01:14 |
clarkb | so that accounts for the bulk of the reset time | 01:14 |
clarkb | knowing that, I think jeblairs farm of zuul workers plan is a really good one | 01:14 |
*** sarob has quit IRC | 01:15 | |
clarkb | if we can distribute that work instead of doing it serially we should be able to get that number much smaller | 01:15 |
*** sarob_ has quit IRC | 01:15 | |
*** sarob has joined #openstack-infra | 01:15 | |
clarkb | now it is also possible that the git repos themselves are degrading and are usually faster | 01:16 |
clarkb | which isn't that far fetched as sdague indicated zuul had much better performance previously. Restarting zuul won't fix the problem but clearing out the git repos or otherwise repairing them might | 01:16 |
*** sarob has quit IRC | 01:18 | |
clarkb | http://paste.openstack.org/show/61413/ I have filtered out everything but the git checkouts there. This shows the amount of time between each git checkout which is roughly the amount of time it takes to do a checkout reset merge etc | 01:19 |
mordred | clarkb: I wonder if git remote update is potentially too heavy of a hammer too. (although the farm of workers is better) | 01:20 |
clarkb | we might also try using a newer version of git on the zuul box | 01:20 |
clarkb | we can use https://launchpad.net/~git-core/+archive/ppa to get newer git on zuul.o.o | 01:21 |
*** zhiwei has joined #openstack-infra | 01:21 | |
*** sarob has joined #openstack-infra | 01:21 | |
clarkb | mordred: it may be | 01:21 |
mordred | clarkb: git fetch remotes/origin/$BRANCH ; git reset --hard FETCH_HEAD might do slightly less work | 01:21 |
*** zhiwei has quit IRC | 01:22 | |
clarkb | mordred: there is a big time delta between updating repository and the next step | 01:22 |
* clarkb looks at that code | 01:22 | |
mordred | clarkb: as in, the remote update step is taking a long time? | 01:22 |
clarkb | ya | 01:23 |
*** sarob has quit IRC | 01:24 | |
*** sarob has joined #openstack-infra | 01:24 | |
clarkb | yup looks like that vast majority of time is in the remote update step | 01:25 |
clarkb | it is happening in GitPython though. need to read up on it to see if we can make that smarter | 01:25 |
*** pcrews_ has quit IRC | 01:27 | |
*** melwitt has quit IRC | 01:27 | |
openstackgerrit | A change was merged to openstack-infra/config: Increase timeouts for jobs doing tempest runs https://review.openstack.org/66379 | 01:28 |
*** sarob has quit IRC | 01:29 | |
mordred | clarkb: will you point me to the part of the code you're looking at? | 01:29 |
clarkb | mordred: I am digging through zuul/merger.py. mergeChanges() is the function that seems to do the work | 01:30 |
clarkb | mordred: the repo update only happens once per project:branch relationship during a reset | 01:32 |
clarkb | so while it is costly when it happens it isn't the biggest cost. the git checkouts seem to be most painful | 01:32 |
mordred | really? | 01:32 |
clarkb | ya checkout happens for each change so * 90 | 01:32 |
*** xchu has joined #openstack-infra | 01:32 | |
clarkb | and takes about as much time as an update | 01:32 |
mordred | is that just because it's modifying a working tree? | 01:33 |
*** sdake has quit IRC | 01:33 | |
clarkb | oh possibly as git has to reflect the changes on disk | 01:34 |
mordred | can I make a REALLY stupid suggestion? | 01:34 |
mordred | what if we ran it under eatmydata? | 01:34 |
*** zhiwei has joined #openstack-infra | 01:35 | |
mattoliverau | Can you wrap the git checkout + other git reset stuff into some python thread so they can be done in parallel? That way it shouldn't be 90*9 | 01:35 |
clarkb | hmm that is an interesting question. my first initial thought was are you crazy, my second thought is that may just be an incredible idea | 01:35 |
clarkb | mattoliverau: we can, that is what jeblair's make workers do the work idea gets at | 01:36 |
clarkb | mattoliverau: I think we will end up doing that regardless, but we need a short term solution | 01:36 |
*** afazekas has quit IRC | 01:36 | |
clarkb | mordred: eatmydata disables fsync? does that mean no data will ever get synced or it will sync whenever the OS feels like it? | 01:36 |
clarkb | mordred: my biggest concern now is that zuul relies on disk persistence to do graceful restarts | 01:37 |
clarkb | mordred: I am pretty sure that will break if we put zuul under eatmydata | 01:37 |
mattoliverau | so how about tmpfs then? no disk IO then, only ram. | 01:38 |
mordred | clarkb: hrm. good point | 01:38 |
mordred | yeah - tmpfs would be the next question - but I don't think we have the ram to handle all of the repo size | 01:38 |
mordred | I lied | 01:39 |
*** gokrokve has quit IRC | 01:39 | |
mordred | mordred@zuul:~$ sudo du -chs /var/lib/zuul/git/ | 01:39 |
mordred | 2.8G/var/lib/zuul/git/ | 01:39 |
*** gokrokve has joined #openstack-infra | 01:40 | |
clarkb | tmpfs sounds like a great idea | 01:40 |
mattoliverau | so we maybe able to put /var/lib/zuul/git under tmpfs, and bypass disk all together if it doesn't work then it just means it isn't disk io casuing issues. | 01:40 |
clarkb | mordred: I think if we stop zuul, overlay a mount on /var/lib/zuul/git then start zuul it will just reclone everything | 01:40 |
*** pcrews has joined #openstack-infra | 01:41 | |
clarkb | currently git has a bout 4GB of cached and buffered memory | 01:41 |
*** thuc has quit IRC | 01:41 | |
clarkb | so 2.8GB filesystem may eat into that in ways that are unhappy though I bet a good chunk of that cache is for the git stuff | 01:41 |
*** thuc has joined #openstack-infra | 01:42 | |
mattoliverau | How much RAM us the system using for everything else? is the servers RAM under utilised? I guess I could just go check out cacti :) | 01:42 |
mattoliverau | s/us/is/ | 01:42 |
clarkb | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=392&rra_id=all | 01:43 |
*** gokrokve has quit IRC | 01:44 | |
clarkb | zuul is 4GB virt, 1.3GB resident, geard is 1GB virt, 760MB resident | 01:44 |
clarkb | then recheckwatch, puppet, and apache processes hang around with about 100MB a pice | 01:44 |
mordred | we could bump it to 8G if putting the temp merge location in ram would be good, ya know? | 01:46 |
*** thuc has quit IRC | 01:46 | |
clarkb | it is 8GB now | 01:46 |
clarkb | it is an 8vcpu 8GB rackspace performance node | 01:47 |
mordred | meh | 01:47 |
mattoliverau | Yeah, the rest is mainly in cached memory. So the question is, how much will the kernel actually give back to us.. what is the real free figure. | 01:47 |
mattoliverau | clarkb: there was a talk at LCA about thi during the sysadmin miniconf. | 01:48 |
mattoliverau | s/thi/this/ | 01:48 |
clarkb | I missed it :( | 01:49 |
dstufft | mordred: clarkb fungi So pip 1.5.1rc1 and 1.11.1rc2 just dropped, if you're at all able to run them through the paces in the openstack infra to make sure we fixed all your issues that would be really really awesome | 01:49 |
*** yamahata has joined #openstack-infra | 01:50 | |
*** jp_at_hp has quit IRC | 01:51 | |
mattoliverau | if I remember correctly, we can check meminfo: cat /proc/meminfo |grep -i active | 01:51 |
clarkb | 4836276 kB | 01:52 |
mattoliverau | whatever the figure is for inactive should be currently what the kernel can dump (and thus give back at this point in time) | 01:52 |
mordred | dstufft: I'd love to - the gate is so slammed though I don't think we're likely able to run anything with a difference - but I'll see what I can cook up | 01:52 |
clarkb | mattoliverau: inactive is 2269672 kB | 01:52 |
*** wenlock has quit IRC | 01:52 | |
dstufft | mordred: ok I totally understand if you can't fwiw :) Mostly I want to avoid another upgrade apocalypse | 01:52 |
clarkb | oh there are several inactive categories, are they distinct or subsets? | 01:53 |
mattoliverau | clarkb: so if I am correct, we may only get about 2.2 G back. | 01:53 |
*** locke105 has joined #openstack-infra | 01:53 | |
clarkb | looks like they add up so we only need that value above | 01:53 |
mordred | dstufft: well, we're blocking >=1.5 anyway - so I think we can test upgrading to 1.5.1 at our leisure | 01:53 |
* mordred is still excited for his new 1.5 overlord | 01:53 | |
mattoliverau | I might go find the talk in question, the videos are up and it only went for 10 minutes or so. | 01:54 |
clarkb | if we want to keep 8vcpu we can go up to a 30GB perf node | 01:55 |
clarkb | that will give us plenty of room for a ~16GB tmpfs | 01:55 |
mattoliverau | Yeah, that might be a good idea, that would give us room to grow. | 01:56 |
mordred | clarkb: use 30G perf nodes for all the things!!! | 01:56 |
clarkb | I think that is a not so crazy idea, but it is also late on thursday | 01:56 |
mattoliverau | http://is.gd/U9kBon | 01:56 |
clarkb | would be curious to get fungi's input | 01:56 |
mordred | clarkb: we should make the build farm use 30G perf nodes | 01:56 |
mattoliverau | the talk in question ^^ | 01:57 |
mattoliverau | I think | 01:57 |
mordred | can you imagine just how quickly pvo would should up and scold us? | 01:57 |
mattoliverau | mordred: lol | 01:57 |
clarkb | I also think newer git is worth a shot, the version of git we are running is pretty old | 01:57 |
mordred | clarkb: ++ | 01:57 |
mattoliverau | can't hurt, in theory the code has to be more efficient.. unless linus broke something :P | 02:00 |
clarkb | ya I am doing some quick unscientific tests locally | 02:01 |
mattoliverau | Lol the best kind of test ;) | 02:01 |
*** nosnos has joined #openstack-infra | 02:04 | |
*** senk has quit IRC | 02:04 | |
clarkb | git checkouts were any better. git clone was about 20 seconds faster for nova | 02:05 |
clarkb | also I just realized this is GitPython so it may be doing some stuff in pyth | 02:05 |
mattoliverau | that's true, makes it hard to determine the bottle neck. were those times based from gitpython or git? | 02:06 |
dstufft | I think GitPython just shells out | 02:06 |
dstufft | but I might be thinking of a different project | 02:06 |
clarkb | dstufft: it does for some stuff and not for others iirc | 02:06 |
*** jhesketh__ has joined #openstack-infra | 02:06 | |
clarkb | also they use tabs in their source so now I don't want to read it | 02:06 |
dstufft | clarkb: I've learned to avoid reading other people's source code unless I really want to be caremad | 02:07 |
dstufft | (it's too late not to read my own :( ) | 02:07 |
dims | just peeked at gate queue, looks like it crept up to 104 | 02:08 |
clarkb | dstufft: it appears to shell out for checkout | 02:09 |
*** gyee is now known as gyee_nothere | 02:09 | |
*** adrian_otto has joined #openstack-infra | 02:10 | |
*** CaptTofu has quit IRC | 02:10 | |
*** gokrokve has joined #openstack-infra | 02:10 | |
dstufft | clarkb: also question, is this cloning stuff to run tests on it? | 02:10 |
adrian_otto | are our Zuul workers clogged up? I have 4 Solum gerrit reviews that have no votes on them from jenkins, dating back over the past ~4 hours. | 02:11 |
dstufft | e.g. is it a read only clone and are you or can you use a shallow clone to make it go faster? | 02:11 |
clarkb | adrian_otto: no zuul is clogged up | 02:11 |
clarkb | dstufft: we can't shallow cloen for reasons. this is the repo zuul is using to build the refs that get tested | 02:11 |
clarkb | iirc it needs all the refs in order to build the zuul refs which a shallow clone won't give you | 02:12 |
adrian_otto | clarkb: no zuul is clogged, or no it is not? | 02:12 |
dstufft | clarkb: ok! | 02:12 |
dstufft | I don't know much about zuul so :( | 02:12 |
clarkb | adrian_otto: no, zuul is clogged | 02:12 |
clarkb | the workers themselves are fine | 02:12 |
adrian_otto | clarkb: ok, thanks | 02:12 |
*** slong- is now known as slong-afk | 02:15 | |
*** gothicmindfood has joined #openstack-infra | 02:15 | |
*** pballand has quit IRC | 02:21 | |
*** julim has joined #openstack-infra | 02:22 | |
*** yaguang has joined #openstack-infra | 02:23 | |
clarkb | adrian_otto: long story short is that the longer the gate queue gets the more time zuul spends reseting it (currently a full gate reset takes more than >15 minutes) while it is doing that reset the zuul scheduler does nothing else. There are plans to make that better (farming the expensive git work out to workers to allow massive scale out, and we have been fiddling with using a tmpfs as the cost of | 02:23 |
clarkb | disk seems to hurt quite a bit) | 02:23 |
*** julim has quit IRC | 02:24 | |
*** portante_ is now known as portante | 02:24 | |
*** gothicmindfood has quit IRC | 02:27 | |
adrian_otto | clarkb: thanks for the detail. Can you help me understand what a gate reset is, and why it happens? | 02:29 |
clarkb | adrian_otto: the gate pipeline is where we test serialized changes in parallel. change A gets approved first and goes onto the head of the queue, then change B gets approved and gets added behind A. Instead of waiting for A to merge before testing B we test B with A assuming A will pass and merge | 02:30 |
clarkb | adrian_otto: when A does not pass and merge we have to retest B without A as the previous scenario is no longer valid | 02:31 |
clarkb | that is a gate reset. | 02:31 |
clarkb | when you have 102 changes in the pipeline something failing at the head of the queue means we have to cancel jobs for 101 changes, then completely rebuild the git refs to test 101 changes (the 102nd is removed as it failed) then restart all of the tests | 02:32 |
fungi | except in the current gate it's a plus b plus c plus... plus z and then repeat the alphabet several more times | 02:32 |
adrian_otto | ok, so that's sounds like a definite design weakness in zuul | 02:32 |
clarkb | adrian_otto: its not a design weakness in zuul, it is a problem with speculative merging and testing | 02:33 |
adrian_otto | isn't that the key feature that makes zuul compelling? | 02:33 |
clarkb | yes | 02:33 |
clarkb | adrian_otto: in the best case you merge all 102 changes at one time and your time to test is O(1) | 02:34 |
fungi | adrian_otto: more to the point, consider the integrated projects to basically be one software project with more than a thousand developers approving a hundred changes a day and trying to make sure every change passes the entire integration test suite prior to letting it merge | 02:34 |
clarkb | when you are consistently failing that goes to O(n) | 02:34 |
clarkb | in the previous state you were in O(n) | 02:34 |
clarkb | so this is a win over the old state, but in the worst case is still bad | 02:34 |
adrian_otto | indeed | 02:35 |
fungi | the alternative, which a lot of projects settle for, is merge first, then test periodically and see if the published software is obviously broken, then try to bisect and hope you can narrow down which commit to revert | 02:35 |
adrian_otto | so might it make sense to use an admission control strategy? | 02:35 |
adrian_otto | so the queue is limited? | 02:36 |
clarkb | adrian_otto: see scrollback :) | 02:36 |
adrian_otto | that might speed up the reset case, at the cost of some concurrency in the best case | 02:36 |
*** nati_uen_ has quit IRC | 02:37 | |
clarkb | jeblair has historically been opposed to rate limiting the size of a zuul queue. I have argued for the feature in the past. I think something simple like tcps slow start would help quite a bit | 02:37 |
adrian_otto | thanks for the additional detail! | 02:37 |
*** nati_ueno has joined #openstack-infra | 02:37 | |
clarkb | at LCA jeblair seemed to be more onboard with adding something like thatto zuul | 02:38 |
adrian_otto | you can still have a backlog that's not part of the active queue | 02:38 |
clarkb | yup | 02:38 |
mattoliverau | It was the sun and the Aussie beer ;P | 02:38 |
adrian_otto | and spoon feed the active queue so it remains a more optimal length | 02:39 |
*** yamahata has quit IRC | 02:39 | |
clarkb | adrian_otto: exactly, just like a tcp connection | 02:39 |
adrian_otto | yep | 02:39 |
clarkb | well tcp rarely if ever hits optimal state, but it is consistently not worst case | 02:39 |
*** dstanek has joined #openstack-infra | 02:46 | |
lifeless | clarkb: mmm | 02:46 |
lifeless | clarkb: you could argue that tcp is nothing but worst case :) | 02:46 |
clarkb | lifeless: maybe when you latency is NZ bad | 02:46 |
clarkb | :) | 02:46 |
StevenK | clarkb: Well, it's a sliding window, and also best effort. | 02:48 |
StevenK | clarkb: However, I agree with you -- I think checking a queue of 90-100 all the time is bong, and we should limit it to a window | 02:48 |
*** jishaom has joined #openstack-infra | 02:49 | |
*** odyssey4me has quit IRC | 02:53 | |
*** carl_baldwin has joined #openstack-infra | 02:54 | |
*** AaronGr is now known as AaronGr_Zzz | 02:57 | |
notmyname | adrian_otto: since you were asking about stuff, I threw together a quick graph for you http://not.mn/solum_gate_status.html | 03:01 |
notmyname | adrian_otto: if that's not the right jobs, let me know (or open a pull request--the repo link is at the bottom) | 03:02 |
*** odyssey4me has joined #openstack-infra | 03:02 | |
*** jhesketh__ has quit IRC | 03:02 | |
sdague | clarkb: so can you promote this now - https://review.openstack.org/#/c/65805/ ? | 03:02 |
*** rakhmerov has quit IRC | 03:02 | |
*** jhesketh_ has quit IRC | 03:03 | |
sdague | if the theory on load is correct, that should level things out a bunch | 03:03 |
*** jhesketh_ has joined #openstack-infra | 03:03 | |
clarkb | sdague it has been promoted should see it in a bit | 03:04 |
*** rossella_s has joined #openstack-infra | 03:05 | |
sdague | ok, just looked at the queue and it was still at the bottom | 03:05 |
sdague | but I guess we're just processing the events still? | 03:05 |
*** jhesketh__ has joined #openstack-infra | 03:05 | |
clarkb | ya the promotion takes ~15 minutes according to fungi | 03:05 |
clarkb | sdague see sb for long explanation for zuul slowness | 03:06 |
sdague | yep, just read it | 03:06 |
clarkb | i hunted it down. tldr really long gate is expensive | 03:06 |
sdague | clarkb: right, especially as it starves out the other events | 03:06 |
*** pballand has joined #openstack-infra | 03:06 | |
sdague | the tmpfs approach look promissing? | 03:06 |
*** yaguang has quit IRC | 03:09 | |
clarkb | ya walking home now was hoping to chat with fungi about thst when I get back | 03:10 |
*** yaguang has joined #openstack-infra | 03:10 | |
sdague | cool | 03:10 |
*** HenryG has joined #openstack-infra | 03:10 | |
*** krotscheck has quit IRC | 03:10 | |
*** ArxCruz has quit IRC | 03:11 | |
*** zhiyan has joined #openstack-infra | 03:12 | |
sdague | so I guess the other question is if we're taking forever to reset with the change that we think will make this better, would it make sense to just dump the gate queue at this point? | 03:13 |
sdague | the d-g just popped to the head | 03:14 |
*** salv-orlando has joined #openstack-infra | 03:15 | |
notmyname | sdague: if stuff is getting promoted, then dumping the gate feels like something to do just to do something | 03:16 |
sdague | notmyname: sure, though given that we can't allocate devstack nodes to jobs until the gate reset finishes, it's still adding 15 minutes additional friction on each hit. Which while small, adds up. | 03:18 |
sdague | clarkb / fungi: looks like a bad py26 node - https://jenkins01.openstack.org/job/gate-nova-python26/17060/console | 03:19 |
notmyname | yes, but I'm working on getting a patch through for the past 12 hours, and I've got another dependency that's been over 50 hours in the gate with over 13 resets. an extra 15 minutes really isn't much | 03:19 |
sdague | it's not one extra 15 minutes, it's 15 * failing tests in gate (and right now there are at least 2 py 2.6 unit test failures that I see) | 03:21 |
fungi | i've taken centos6-1 offline | 03:22 |
fungi | thanks sdague | 03:22 |
sdague | 7 py26 unit tests fails... at least | 03:22 |
fungi | i'm also caught back up on scrollback since dinner now. i am a dismally slow reader | 03:22 |
sdague | yeh, about 40% of the gate jobs right now are in a fail state because of that py26 node | 03:23 |
sdague | zuul hasn't noticed yet because it's still processing the first promote | 03:23 |
fungi | i agree, in light of the performance breakdown, that saving the state of the pipelines and gracefully stopping zuul, mounting a suitably large tmpfs on /var/lib/zuul/git, starting zuul and restoring the changes would likely help performance | 03:24 |
*** rakhmerov has joined #openstack-infra | 03:25 | |
fungi | the +/- buffers/cache amount is a good bit larger than a du of that dir | 03:26 |
fungi | and zuul has a ton of swap for spillover if that ends up being an underestimate | 03:26 |
*** adrian_otto has quit IRC | 03:28 | |
*** dcramer_ has quit IRC | 03:28 | |
fungi | 4g tmpfs should be doable looking at the present state of the server | 03:29 |
mattoliverau | fungi: you need to check the active and inactive memory in meminfo to see how much the kernel will really give back to you +/- buffers is a bit of lie. | 03:30 |
mattoliverau | but yeah there is swap.. so long as it swaps out something it doesn't need again :) | 03:30 |
fungi | yeah, but some of what's currently resident is safe to page out | 03:31 |
fungi | more a question of how much | 03:31 |
*** pballand has quit IRC | 03:31 | |
*** nati_uen_ has joined #openstack-infra | 03:31 | |
fungi | active(anon) is under 3g | 03:33 |
mattoliverau | what is inactive | 03:33 |
fungi | there's a fair amount of active(file) but i anticipate that being git | 03:33 |
clarkb | fungi: I am willing to give tmpfs on current zuul a shot, we will need probably at least a 3GB filesystem | 03:33 |
fungi | inactive is about 1.5g | 03:33 |
clarkb | but only about 2 gb was inactive | 03:33 |
mattoliverau | you you may get your 2g + what ever is actually free, and then everything else will be swaped. | 03:34 |
*** nati_ueno has quit IRC | 03:35 | |
mattoliverau | there should be an inactive, inactive(anon) and inactive(file). Use the first as it is the total. | 03:35 |
mattoliverau | but i don't have access to the server so I don't actually know what the current value is. | 03:35 |
*** jerryz has quit IRC | 03:35 | |
fungi | right, inactive is roughly 2g | 03:35 |
clarkb | fungi: I mentioned to mattoliverau earlier that we could go to a 30GB perf node to keep our vcpu count that will give us plenty of room for a massive tmpfs | 03:36 |
mattoliverau | So from my understandding, that is as much as the kernel can actually give you. | 03:36 |
fungi | clarkb: yeah, i'm hesitant since the downtime to swap nodes would be a bit greater. how long did it take you the other day? | 03:37 |
clarkb | it wasn;t too bad, you basically prestage the node compeltely then do the swap. making sure firewall rules are correct everywhere was the biggest hurdle | 03:38 |
fungi | i'm rapidly running out of steam for the night but can probably squeeze in another hour or so | 03:38 |
clarkb | we can probably get it done in well under half an hour | 03:38 |
StevenK | mattoliverau: For a tmpfs? tmpfs are swapped-back | 03:38 |
clarkb | fungi: I don't think we should do anything tonight unless you really really want to | 03:38 |
StevenK | swap-backed, even | 03:39 |
mattoliverau | StevenK: tmpfs is just a ramdisk, so yes, it'll be swapped out.. in theory. | 03:39 |
clarkb | fungi: maybe fire off a 30GB node build tonight and plan for swap tomorrow? | 03:39 |
clarkb | fungi: or, put tmpfs in place on existing zuul and see what happens | 03:39 |
fungi | clarkb: i can get a new-new-zuul spinning up now. we'll hang our hopes on the tempest parallelism reduction to make some stability headway in the meantime | 03:41 |
clarkb | ++ I think that is path of most sanity | 03:41 |
*** weshay has quit IRC | 03:42 | |
clarkb | fungi: zuuls A record ttl is already 5 minutes so that is covered | 03:42 |
fungi | awesome | 03:42 |
clarkb | then tomorrow we grab the pipeline state, stop zuul, update dns, make sure firewalls update (which more I think of it may not be a problem since most connections are to zuul so only zuul firewall matters) and start zuul on new server | 03:43 |
clarkb | if anything goes uber terrible we put old server back in use | 03:43 |
*** amotoki has joined #openstack-infra | 03:43 | |
mattoliverau | Sounds like a plan! And on that note then I'm going to go to lunch, ttyl. | 03:44 |
clarkb | I am no longer convinced new git will make much of an impact | 03:46 |
fungi | heh... "120 GB Performance" | 03:48 |
* fungi resists temptation | 03:49 | |
fungi | so we want 30 not 15? | 03:49 |
StevenK | fungi: And then put / on a tmpfs? :-P | 03:49 |
fungi | StevenK: bitcoins aplenty | 03:49 |
clarkb | fungi: 15 has 4vcpu | 03:50 |
fungi | oh weird | 03:50 |
clarkb | fungi: the current 8gb have 8vcpu | 03:50 |
clarkb | I think we should go 30 just to keep the vcpu value constant | 03:50 |
fungi | so new-zuul was non-performance? | 03:50 |
fungi | or 15g perf have fewer cpus than 8 and 30? | 03:51 |
clarkb | new zuul was performance, 8gb 8vcpu | 03:52 |
clarkb | but the flavors are weird, 8gb gives you 8vcpu but 15 give you 4vcpu | 03:52 |
clarkb | double check that with nova flavor-list but pretty sure those were the values I saw earlier today | 03:52 |
fungi | you're right | 03:53 |
fungi | strange but true | 03:53 |
clarkb | the other nice thing about 30GB is we can make the tmpfs pretty large and not worry too much about it filling unexpectedly | 03:54 |
fungi | yup | 03:54 |
clarkb | eg 16GB :) | 03:55 |
sdague | they basically seem to have created a high memory set of perf nodes | 03:55 |
clarkb | on my way home I was also thinking that zuul could do a better job in its scheduler of handling mroe than one discrete item at once | 03:57 |
clarkb | at the very least it should be able to process different pipelines independently | 03:57 |
clarkb | the nice thing about the serial way it does things now is it makes it very predictable about the order jobs run in and so on | 03:58 |
clarkb | but gate being slow doesn't have to affect check for example | 03:58 |
clarkb | but I think making changes like that probably won't have large benefits when 99% of your time is waiting for a forked git process to do its thing | 03:58 |
*** coolsvap has joined #openstack-infra | 04:00 | |
fungi | also, you'd need multiple git workspaces to avoid collisions | 04:01 |
*** harlowja is now known as harlowja_away | 04:01 | |
clarkb | oh right good point | 04:01 |
fungi | don't want to be building two nova refs in one git clone at the same moment | 04:01 |
*** praneshp has quit IRC | 04:04 | |
*** sarob has joined #openstack-infra | 04:08 | |
*** sarob_ has joined #openstack-infra | 04:10 | |
*** CaptTofu has joined #openstack-infra | 04:11 | |
*** sarob has quit IRC | 04:13 | |
*** CaptTofu has quit IRC | 04:15 | |
*** sdake has joined #openstack-infra | 04:16 | |
mikal | zuul hates me | 04:17 |
notmyname | mikal: don't worry. zuul hates everybody today ;-) | 04:17 |
mikal | Yay! | 04:17 |
mikal | On the performance nodes front, there are two types | 04:18 |
fungi | clarkb: we haven't merged the change yet that autopartitions the secondary block device on these performance nodes, have we? | 04:18 |
mikal | Which might not be obvious from flavour list | 04:18 |
sdague | so it's not incredibly helpful for people to "reverify bug 123456789" - https://review.openstack.org/#/c/61714/2 | 04:18 |
sdague | because that patch can't pass right now, due to grizzly devstack issues | 04:18 |
mikal | OMG, who did that? | 04:18 |
*** _ruhe is now known as ruhe | 04:18 | |
mikal | Performance 1 has its biggest at 8 vcpus, 8gb ram | 04:19 |
mikal | Performance 2 has its biggest at 32 vcpus, 120gb ram | 04:19 |
*** coolsvap_away has joined #openstack-infra | 04:20 | |
*** coolsvap has quit IRC | 04:21 | |
*** coolsvap_away is now known as coolsvap | 04:21 | |
*** vkozhukalov has joined #openstack-infra | 04:21 | |
sdague | fungi: so given that the grizzly devstack issues are out there, could you kick out all the stable/havana patches in the queue? because they are all just time bombs | 04:22 |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 04:22 | |
notmyname | is there a single job that is run for _every_ gate job that isn't run for check jobs? I'm looking for a graphite metric | 04:23 |
notmyname | eg maybe gate-grenade-dsvm | 04:23 |
fungi | sdague: i'm not sure how to "kick them out" aside from uploading trivial new patchsets to each of them | 04:24 |
clarkb | fungi: I think that is the only way | 04:24 |
sdague | fungi: yeh, that would be the only way | 04:24 |
fungi | but a 'zuul eject' command would make for a good future addition | 04:24 |
sdague | yeh | 04:24 |
sdague | notmyname: gate-tempest-dsvm-full is the best approximation of the gate | 04:25 |
notmyname | sdague: thanks | 04:25 |
sdague | however, it's dynamic | 04:25 |
sdague | so not exact | 04:25 |
notmyname | dynamic? | 04:25 |
sdague | the integrated queue is assembled based on overlapping jobs | 04:25 |
sdague | so if change A runs tests 1 2 3, and change B runs tests 3 4 5, and change C runs tests 5 6 7 | 04:26 |
sdague | they will be in a single queue | 04:26 |
sdague | even though A doesn't overlap with C | 04:26 |
clarkb | and only one job of that entire set needs to fail to create a reset | 04:26 |
*** dcramer_ has joined #openstack-infra | 04:26 | |
clarkb | so I was thinking about this a bit more after LCA, and I think what I would like to do is expose the zuul logs more. There shouldn't be any priveleged info in them so it should be safe to just logstash them or whatever, but ti will give clear data on 'this was a gate reset' and so on | 04:27 |
openstackgerrit | A change was merged to openstack-infra/storyboard-webclient: Customise Bootstrap https://review.openstack.org/67337 | 04:28 |
openstackgerrit | A change was merged to openstack-infra/storyboard-webclient: Moved homepage content to about page. https://review.openstack.org/67344 | 04:28 |
SergeyLukjanov | evening guys! | 04:28 |
clarkb | SergeyLukjanov: ohai | 04:28 |
SergeyLukjanov | clarkb, it'll be awesome to be able to read zuul logs :) | 04:28 |
clarkb | SergeyLukjanov: ya, I want to double check with jeblair to see if there are any known gotchas with that, but we can pipe it into the test log logstash too and get overlapping data | 04:29 |
*** rossella_s has quit IRC | 04:29 | |
mikal | clarkb: noting that if you turn on our swift reporter and debug logging, it logs the swift password | 04:29 |
clarkb | sdague: also did you see I diagnosed the missing console.html in logstash problem? zaro is going to work on a fix | 04:29 |
clarkb | mikal: is swift reporter a thing? | 04:30 |
clarkb | mikal: in any case we should sanitize that logging imo | 04:30 |
sdague | clarkb: cool | 04:30 |
mikal | clarkb: it is for us. I think its meant to be for you in the future. | 04:30 |
mikal | clarkb: but perhaps I am mis-representing jhesketh__ and jblair's plan | 04:30 |
clarkb | sdague: what happens there is jenkins hasn't even touched the file on logs.o.o by the time logstash processes it, logstash gets a 404 and moves on | 04:31 |
sdague | great | 04:31 |
clarkb | sdague: so we will update the scp plugin to not finish the job until that file has at least been touched | 04:31 |
clarkb | should be a simple wait on a thread sync event | 04:31 |
sdague | cool | 04:32 |
*** chandankumar has joined #openstack-infra | 04:32 | |
clarkb | sdague: how is neutron thing going? | 04:33 |
fungi | clarkb: yeah, we haven't approved https://review.openstack.org/63190 apparently, so that explains the lack of swap | 04:33 |
sdague | good, though slowed down by the current gate backup. The lower concurrency patch is looking promissing in the gate right now though. | 04:34 |
clarkb | sdague: yeah | 04:34 |
clarkb | fungi: are you just going to manually add swap then? or should we merge 63190 and rebuild? | 04:34 |
fungi | i'll delete my first launch and build another with that patch added | 04:35 |
sdague | I need to go to bed, but I think if we kick out the stable branch changes in the gate the gate will empty by morning | 04:35 |
fungi | unless we want to merge it first | 04:35 |
clarkb | fungi: ok, I am reviewing that change now too | 04:35 |
clarkb | sdague: noted | 04:35 |
fungi | clarkb: thanks. we might as well approve it rather than continue manually applying it on every launch environment ;) | 04:35 |
sdague | that swift one that was reverify 123456789 is reset #2 in there, but that's because it can't pass | 04:35 |
sdague | there is a nova unit test fail as well, above it | 04:36 |
clarkb | sdague: is it just the two? I will look and figure it out don't stay up | 04:36 |
sdague | so that's what's failing now, though they are off to the side | 04:36 |
sdague | the rest is being computed | 04:36 |
sdague | however, stable/havana patches will fail on grenade | 04:36 |
sdague | because of grizzly | 04:36 |
sdague | so they take a while to show up after a reset | 04:37 |
clarkb | ok so all stable/havana changes should be kicked out | 04:37 |
sdague | yes | 04:37 |
clarkb | got it | 04:37 |
*** carl_baldwin has quit IRC | 04:37 | |
*** esker has quit IRC | 04:37 | |
sdague | chmouel and I were looking at stable grizzly devstack today, will do so again in the morning | 04:38 |
clarkb | ok | 04:38 |
sdague | I think it's fundamentally the pip 1.5 thing | 04:38 |
*** esker has joined #openstack-infra | 04:38 | |
sdague | anyway, bed time. Talk to you later. | 04:38 |
clarkb | we aren't using 1.5 anymore though right? or did we deal with that differently? | 04:38 |
clarkb | fungi: ^ | 04:38 |
clarkb | fungi: oh I remember, I asked for that change not to be in install_puppet.sh | 04:39 |
clarkb | fungi: because well it is doing something completely different and potentially harmful instead of simply installing puppet | 04:39 |
clarkb | fungi: why don't you use a local checkout and we can figure out how to dlea with that properly when jeblair is backl | 04:39 |
fungi | clarkb: ahh, right. you should note that in the review | 04:39 |
clarkb | yup sorry I didn't do that before, my bad | 04:40 |
openstackgerrit | A change was merged to openstack-infra/devstack-gate: Cut tempest concurrency in half https://review.openstack.org/65805 | 04:40 |
*** fifieldt has quit IRC | 04:41 | |
HenryG | In gerrit, is there a way to search for any reviews in progress that touch a particular file? | 04:42 |
*** fifieldt has joined #openstack-infra | 04:42 | |
*** emagana has joined #openstack-infra | 04:42 | |
jhesketh__ | clarkb: so (reading back...) I suggested on the infra mailing list that we run a zuul per a pipeline to ease the load on the gate | 04:42 |
clarkb | jhesketh__: that won't ease the load on the gate but would help the other pipelines | 04:43 |
jhesketh__ | jblair didn't think it was necessary with the move to a performance node and also his future plan of sending git methods to workers | 04:43 |
notmyname | clarkb: fungi: thanks for the help with the CVE patch today | 04:43 |
clarkb | HenryG: if you have watched the projects and use the ssh query api then I think the answer is yes | 04:43 |
jhesketh__ | well zuul will be able to do it's git magic faster if it doesn't have to fight other pipelines | 04:43 |
clarkb | jhesketh__: there is not fighting though they are all dealt with serially | 04:44 |
clarkb | the problem is that the gate pipeline takes 15 minutes to handle a reset, and nothing else in zuul runs | 04:44 |
fungi | notmyname: of course, it's my pleasure | 04:44 |
clarkb | we need to make that faster, the worker idea should help there as it distributes the expensive git work across nodes | 04:44 |
jhesketh__ | clarkb: if zuul is pulling in a patch for nova in the check pipeline doesn't that block any merge it might be wanting to try on the gate pipeline? | 04:45 |
jhesketh__ | right okay | 04:45 |
clarkb | jhesketh__: not really because it will handle those one at a time | 04:45 |
clarkb | this interim idea is use tmpfs to speed up git operations | 04:45 |
fungi | jhesketh__: zuul's output is a constructed git ref, i the end, so the state of its work tree doesn't have to hang around. just a git object | 04:46 |
clarkb | as that requires no code changes and should help quite a bit | 04:46 |
jhesketh__ | clarkb: so it does block, it's just not significant? | 04:46 |
jhesketh__ | (the check pipeline that is) | 04:46 |
clarkb | jhesketh__: ya because the check pipeline work is once and done | 04:46 |
clarkb | ~10 seconds of work | 04:47 |
*** praneshp has joined #openstack-infra | 04:47 | |
jhesketh__ | sure, but if somebody commits a dozen patches at once that's still a delay | 04:47 |
clarkb | but for dependent pipelines it processes the entire queue before being done. which is ~10 seconds multiplied by the number of changes | 04:47 |
fungi | jhesketh__: it blocks, but insofar as it all blocks because git operations are not happening in parallel | 04:47 |
jhesketh__ | yep | 04:47 |
clarkb | jhesketh__: but it allows other work to happen between those changes | 04:47 |
HenryG | clarkb: yes I have "watched" the project (tempest, in this case). Do you have a ptr handy to the ssh query api for a noob to get started? | 04:48 |
clarkb | so the total work is 10*10 seconds but it doesn't starve the other queues | 04:48 |
jhesketh__ | sure, | 04:48 |
clarkb | with the gate it literally stops everything else for that 15 minute peruiod | 04:48 |
mikal | I can asume that my stackforge approval from an hour ago isn't lost, right? | 04:48 |
mikal | Just slow? | 04:48 |
clarkb | mikal: yes just very very slow | 04:48 |
StevenK | 515 events, wheee | 04:48 |
clarkb | the compounding problem with the gate is on a failure it does all of the work again | 04:49 |
clarkb | then you fail and it does it all again | 04:49 |
clarkb | and on and on | 04:49 |
*** praneshp_ has joined #openstack-infra | 04:52 | |
clarkb | finding trivial patchset content is non trivial | 04:53 |
clarkb | fungi: just update commit message? | 04:53 |
sdague | clarkb: ok, not quite asleep yet | 04:53 |
* mikal promises not to approve anything for a while | 04:53 | |
sdague | but it looks like there are 6 - 8 stable/havana patches in the gate | 04:53 |
sdague | so if you nuke them now, I think the gate will clear out by morning | 04:53 |
clarkb | sdague: I foud 5 | 04:53 |
sdague | lots of keystone with month old test results | 04:53 |
sdague | I went through and started -2ing a ton of stuff | 04:54 |
*** praneshp has quit IRC | 04:54 | |
*** praneshp_ is now known as praneshp | 04:54 | |
mikal | Oh, we still have that "old checks" problem? | 04:54 |
sdague | apparently, I have -2 on havana | 04:54 |
fungi | clarkb: yeah, update commit message will work | 04:54 |
sdague | mikal: yes | 04:54 |
mikal | Would it be meaningful to have that quick and dirty rechecker turned on | 04:54 |
StevenK | sdague: But that turns into an event, and zuul isn't really getting around to that ... | 04:54 |
sdague | mikal: probably | 04:54 |
mikal | I didn't do it because I was told that we'd have gerrit doing it soon | 04:54 |
sdague | StevenK: sure | 04:54 |
mikal | But if it would help, I'll get it done today | 04:54 |
sdague | however it will signal | 04:54 |
sdague | mikal: yes, it would be helpful, have it have a variable for # of days that we consider something stale | 04:55 |
sdague | that we could set in infra | 04:55 |
sdague | it would be awesome | 04:55 |
mikal | sdague: as in projects.yaml? | 04:55 |
* mikal pulls out that code and dusts it off | 04:55 | |
jhesketh__ | mikal: is this the turbo-hipster gerrit rechecker? | 04:55 |
mikal | jhesketh__: yeah | 04:55 |
sdague | mikal: wherever clarkb and fungi think it should live | 04:55 |
sdague | just want to make it configurable | 04:56 |
mikal | It will reduce the number of merge fails | 04:56 |
mikal | Well, what you get today is quick and dirty | 04:56 |
jhesketh__ | mikal: unless you set up turbo-hipster on infra the config will have to be in our cloud | 04:56 |
sdague | mikal: this actually isn't a mege fail problem | 04:56 |
jhesketh__ | well I guess you could hit a url for it | 04:56 |
mikal | And then we do something less shit sometime real soon | 04:56 |
sdague | it's the fact that tox or deps changed in a month | 04:56 |
sdague | so the passing results aren't valid at all | 04:56 |
clarkb | sdague: some of these do actually fail to merge | 04:56 |
clarkb | sdague: its fun... | 04:56 |
mikal | Yeah, so a recheck of checks older than a week would have covered this, right? | 04:56 |
sdague | clarkb: ok | 04:56 |
sdague | mikal: yes | 04:56 |
clarkb | sdague: I am pushing patchsets though to make it clear | 04:57 |
jhesketh__ | sdague: sure, so this code mikal whacked together is a turbo-hipster plugin.. so it'll probably not be configurable today if you want quick and dirty | 04:57 |
mikal | Ok, cool | 04:57 |
mikal | I shall do a thing | 04:57 |
mikal | jhesketh__: I think that's ok | 04:57 |
sdague | mikal: you are my hero :) | 04:57 |
mikal | We can make it suck less tomorrow | 04:57 |
jhesketh__ | mikal: oh yeah, I agree. Just letting others know | 04:57 |
mikal | I need theme music | 04:57 |
fungi | clearly i can't work on things and keep up with irc at the same time | 04:57 |
fungi | i'm sure you're all discussing exciting things | 04:58 |
mikal | LOL | 04:58 |
mikal | Just robots of doom | 04:58 |
mikal | jhesketh__: is testzuul free at the moment? | 04:59 |
mikal | jhesketh__: I might run this there | 04:59 |
jhesketh__ | mikal: go for it... I think it's in an okay state | 04:59 |
mikal | jhesketh__: cool | 04:59 |
clarkb | sdague: lol bugs are getting assigned to me because I am writing those patchsets :) | 04:59 |
fungi | clarkb: so, new-new-zuul is 2001:4800:7815:0101:3bc3:d7f6:ff04:e07f | 05:00 |
fungi | 15g tmpfs on the git dir | 05:00 |
fungi | zuul daemon seems to properly recreate the contents of that directory when it's started | 05:00 |
clarkb | fungi: noice | 05:00 |
fungi | i've also started the puppet agent on it | 05:00 |
clarkb | fungi: is it accepting jobs though? | 05:00 |
clarkb | oh I know where we need to update firewalls, on the jenkins masters | 05:01 |
clarkb | er wait no | 05:01 |
clarkb | we just need to make sure the jenkins masters connect to new new zuul's geard | 05:01 |
fungi | yeah. but i've stopped the zuul daemon again just to be safe | 05:01 |
clarkb | cool | 05:02 |
*** chandankumar has quit IRC | 05:03 | |
clarkb | fungi: so ya, I think we plan to do a switcheroo early tomorrow and see if tmpfs helps a bunch | 05:03 |
*** mrda has quit IRC | 05:03 | |
clarkb | I will attempt to wake up early | 05:03 |
*** resker has joined #openstack-infra | 05:03 | |
fungi | i'll be around and ready | 05:04 |
clarkb | sdague: I have killed two keystone changes and one swift, there appear to be 3 more changes | 05:04 |
clarkb | sdague: slowly getting through them | 05:04 |
notmyname | clarkb: https://review.openstack.org/#/c/67186/ and https://review.openstack.org/#/c/67187/ are backports for the CVE bug | 05:05 |
notmyname | for grizzly and havana | 05:06 |
clarkb | notmyname: ok, neither will pass the gate until grenade is working for grizzly and havana | 05:06 |
clarkb | notmyname: sdague and chmouel are working on that as a priority | 05:06 |
notmyname | clarkb: right. I just thought you were working on making sure those don't get into the queue. they were/are marked as approved | 05:07 |
*** esker has quit IRC | 05:07 | |
clarkb | notmyname: I didn't see them in the queue | 05:07 |
notmyname | ah ok | 05:07 |
*** ruhe is now known as _ruhe | 05:07 | |
*** krtaylor has joined #openstack-infra | 05:08 | |
clarkb | I think I got all of them according to a gerrit search | 05:09 |
clarkb | jhesketh__: going back to zuul slowness. I probably wans't entirely clear but in zuuls main loop is processes all results then processes events | 05:11 |
*** yamahata has joined #openstack-infra | 05:12 | |
clarkb | jhesketh__: results cause gate resets (if a job result was fail) this causes zuul to cancel all jobs in the gate behind it, then remerge the new state of proposed git merging, then start jobs for all of those changes. That process takes 15 minutes or more with 90 changes in the queue | 05:12 |
fungi | right. pragmatic ordering since results have a chance of reducing the complexity | 05:12 |
clarkb | jhesketh__: that entire process is one iteration through the loop so no other results or events are processed during that time | 05:12 |
clarkb | jhesketh__: because of that zuul per pipeline won't fix the problem but it will decouple it from check and post and so on | 05:12 |
*** zz_ewindisch is now known as ewindisch | 05:13 | |
*** mrda has joined #openstack-infra | 05:13 | |
clarkb | jhesketh__: zuul per pipeline will still rsult in really slow gate processing. The way to fix that is to make git operations quicker. git worker nodes and git repos in tmpfs should make that better. And honestly after reading through logs I think if we solve that problem then zuul per pipeline isn't necessary | 05:13 |
clarkb | we are literally spending minutes running git remote update and git checkout foo and git merge | 05:14 |
*** resker has quit IRC | 05:14 | |
jhesketh_ | clarkb: okay, thanks for the clarification, makes sense | 05:14 |
fungi | clarkb: it might also have the effect of interleaving workers between pipelines, unlike the broad swing we see now (gate resets, all pending check changes get workers, then attempts are made on the gate changes, repeat) | 05:15 |
clarkb | fungi: yup | 05:15 |
fungi | since there would be more than one gearman server for a jenkins master to listen to | 05:15 |
clarkb | jhesketh_: I do think another thing that would help but would require massive rewrites of zuul is to do everything in a non blocking manner. fire off hundreds of git merges at once and wait for IO to happen. Using the git gearman workers approximates this but could probably just be done in process too | 05:16 |
*** sarob_ has quit IRC | 05:18 | |
clarkb | lifeless: https://jenkins02.openstack.org/job/gate-neutron-python27/6117/console is that a limitation of testtools matchers? | 05:18 |
clarkb | jhesketh_: the whole situation has led me to drinking heavily | 05:18 |
*** amotoki_ has joined #openstack-infra | 05:18 | |
*** sarob has joined #openstack-infra | 05:18 | |
clarkb | jhesketh_: :) | 05:18 |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 05:20 | |
lifeless | clarkb: no | 05:20 |
jhesketh_ | clarkb: heh, okay | 05:20 |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 05:21 | |
fungi | clarkb: the whole situation has gotten in the way of my usual heavy drinking. opposite of the expected effect | 05:21 |
clarkb | fungi: I'm sorry, I found this IPA to help tremendously | 05:21 |
lifeless | the matcher api doesn't assume strings etc | 05:21 |
*** sarob_ has joined #openstack-infra | 05:21 | |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 05:21 | |
mikal | clarkb: is there a way to specify a wildcard project name in layout.yaml? | 05:21 |
fungi | clarkb: as long as it's a v6 ipa | 05:21 |
clarkb | lifeless: I didn't think so, but figured I would ask anyways | 05:21 |
mikal | i.e. I want this to match more than one project | 05:21 |
clarkb | mikal: no, but you can have templates that you apply to many projects | 05:21 |
mikal | But I still need to list the projects, right? | 05:22 |
clarkb | mikal: yup | 05:22 |
mikal | :( | 05:22 |
clarkb | mikal: actually wait | 05:22 |
clarkb | mikal: the thing that does event matching may do regexes everywhere /me examines code | 05:22 |
*** amotoki has quit IRC | 05:22 | |
*** sarob has quit IRC | 05:23 | |
clarkb | mikal: best I can tell project is a magical key and doesn' | 05:24 |
clarkb | t | 05:24 |
clarkb | sdague: russellb: fungi: the spice flows. I think that d-g change helped | 05:24 |
*** esker has joined #openstack-infra | 05:25 | |
fungi | clarkb: awesome. instead of ipa, i think i'm going to settle in for a nap | 05:25 |
*** sarob_ has quit IRC | 05:25 | |
fungi | maybe after the zuul upgrade tomorrow i'll actually find some time to start catching up on e-mail and code review | 05:26 |
mikal | fungi: better code review, or we'll kick you out of core! | 05:26 |
fungi | mikal: somehow i think my current code review stats would let me kick everyone else out | 05:26 |
mikal | LOL | 05:27 |
mikal | Project of one | 05:27 |
fungi | but that's holidays for you | 05:27 |
fungi | last month shouldn't really count | 05:27 |
clarkb | last month was a lie | 05:28 |
*** nicedice has quit IRC | 05:29 | |
fungi | but there *was* cake, at least | 05:29 |
clarkb | code review is high on list of things now that we seem to have a handle on gate badness | 05:29 |
clarkb | and by have a handle on I mean understand | 05:29 |
fungi | cower in ph33r of | 05:30 |
fungi | +++ATH | 05:31 |
fungi | NO CARRIER | 05:32 |
clarkb | fungi: is the zuul tmpfs in fstab? | 05:34 |
fungi | clarkb: yup | 05:34 |
clarkb | awesome, it occured to me that a reboot may result in weird things if it wasn't | 05:35 |
fungi | none /var/lib/zuul/git tmpfs defaults,size=15G 0 0 | 05:35 |
fungi | what kinda sysadmin do you take me for? ;) | 05:35 |
clarkb | :P I am just double checking | 05:35 |
fungi | yeah, good to confirm that | 05:35 |
fungi | i just double-checked too because i'm running on fumes and no longer trust myself | 05:36 |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: Add query for bug 1269940 https://review.openstack.org/67303 | 05:36 |
uvirtbot | Launchpad bug 1269940 in openstack-ci "[EnvInject] - [ERROR] - SEVERE ERROR occurs:" [Undecided,New] https://launchpad.net/bugs/1269940 | 05:36 |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: Add query for bug 1260311 https://review.openstack.org/67314 | 05:37 |
uvirtbot | Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/1260311 | 05:37 |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: Add e-r query for bug 1266611 https://review.openstack.org/65344 | 05:37 |
uvirtbot | Launchpad bug 1266611 in nova "test_create_image_with_reboot fails with InstanceInvalidState in gate-nova-python*" [Undecided,New] https://launchpad.net/bugs/1266611 | 05:37 |
clarkb | fungi: I trust you | 05:38 |
*** odyssey4me has quit IRC | 05:38 | |
fungi | eh, i don't recommend it. counterindicated by my operating manual | 05:39 |
* fungi is covered in warning labels | 05:40 | |
clarkb | fungi: I have a thing at ~10am PST, will try to be up early maybe we can attempt zuul stuff around 8am PST | 05:40 |
fungi | sounds great | 05:40 |
clarkb | also watch the gate, it may merge a ton of things all at once over the enxt 10 minutes | 05:41 |
fungi | i saw | 05:41 |
fungi | though the longest-running changes have had a tendency to be the ones that fail, so it's always a major fake-out | 05:41 |
clarkb | :/ we did just increase test time by a non trivial factor | 05:42 |
fungi | plus, job run times are longer than jenkins expects now, so its estimates are a bit optimistic | 05:42 |
* clarkb hopes it is just that | 05:42 | |
clarkb | NNOOOOOO a job just afiled | 05:42 |
clarkb | oh it was just a test timeout for grenade lets bjump that timeout too | 05:43 |
* clarkb proposes that change | 05:43 | |
fungi | i'll stick around to approve it if you propose | 05:43 |
openstackgerrit | Clark Boylan proposed a change to openstack-infra/config: Double grenade test timeouts https://review.openstack.org/67374 | 05:46 |
clarkb | fungi: ^ | 05:46 |
clarkb | with that in place I feel confident that the queue will move | 05:47 |
fungi | it's in | 05:47 |
clarkb | danke | 05:47 |
fungi | well, approved. will take time to get through the event queue | 05:47 |
clarkb | ya I figure we don't worry too much about that :) | 05:48 |
*** slong has joined #openstack-infra | 05:50 | |
*** slong-afk has quit IRC | 05:51 | |
*** HenryG has quit IRC | 05:52 | |
*** DinaBelova has joined #openstack-infra | 05:53 | |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 05:53 | |
*** pballand has joined #openstack-infra | 05:56 | |
clarkb | fungi: anyways don't stay up anymore, things should settle down overnight (I hope) and we can hit this with a big hammer tomorrow | 05:57 |
*** zhiwei has quit IRC | 05:59 | |
openstackgerrit | Ruslan Kamaldinov proposed a change to openstack-infra/storyboard: Fixed doc build https://review.openstack.org/67376 | 06:02 |
openstackgerrit | Guido Günther proposed a change to openstack-infra/jenkins-job-builder: tests: Allow to test project parameters https://review.openstack.org/67265 | 06:04 |
openstackgerrit | Guido Günther proposed a change to openstack-infra/jenkins-job-builder: project_maven: Don't require artifact-id and group-id https://review.openstack.org/66036 | 06:04 |
*** reed has quit IRC | 06:07 | |
*** odyssey4me has joined #openstack-infra | 06:08 | |
*** CaptTofu has joined #openstack-infra | 06:12 | |
*** chandankumar has joined #openstack-infra | 06:15 | |
*** CaptTofu has quit IRC | 06:16 | |
*** pballand has quit IRC | 06:17 | |
*** praneshp is now known as praneshp_afk | 06:18 | |
*** denis_makogon has joined #openstack-infra | 06:26 | |
*** pelix has left #openstack-infra | 06:31 | |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 06:34 | |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 06:36 | |
*** afazekas_ has quit IRC | 06:37 | |
*** gokrokve has quit IRC | 06:38 | |
*** gokrokve has joined #openstack-infra | 06:38 | |
mikal | I think I just realized my approach wont work | 06:39 |
mikal | The extra text zuul puts in the review comment will stop the recheck from triggering | 06:39 |
clarkb | mikal: oh right, because the regex is very restrictive : | 06:40 |
mikal | Yeah | 06:41 |
mikal | I'm going to write a crappy daemon for now | 06:41 |
mikal | But its a shame I can't use zuul | 06:41 |
*** sHellUx has joined #openstack-infra | 06:41 | |
*** gokrokve has quit IRC | 06:42 | |
*** sHellUx has quit IRC | 06:42 | |
*** SergeyLukjanov_ has joined #openstack-infra | 06:44 | |
*** SergeyLukjanov_ has quit IRC | 06:45 | |
*** DinaBelova_ has joined #openstack-infra | 06:46 | |
*** vkozhukalov has quit IRC | 06:52 | |
*** ewindisch is now known as zz_ewindisch | 06:55 | |
*** DinaBelova has quit IRC | 06:56 | |
*** DinaBelova_ is now known as DinaBelova | 06:56 | |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 06:58 | |
*** DinaBelova is now known as DinaBelova_ | 06:58 | |
*** mrda has quit IRC | 07:01 | |
*** odyssey4me has quit IRC | 07:04 | |
*** yolanda has joined #openstack-infra | 07:07 | |
*** nati_uen_ has quit IRC | 07:11 | |
*** odyssey4me has joined #openstack-infra | 07:12 | |
*** afazekas_ has joined #openstack-infra | 07:25 | |
*** jcoufal has joined #openstack-infra | 07:27 | |
clarkb | anteaya: can you check if https://review.openstack.org/#/c/66490/ is just broken? it is flapping in the gate and I think the patch itself doesn't work | 07:33 |
clarkb | anteaya: and if so can you make sure someone proposes a new patchset to it to remove it from teh gate if it is still in the gate when you see this? | 07:33 |
openstackgerrit | A change was merged to openstack-infra/config: Double grenade test timeouts https://review.openstack.org/67374 | 07:41 |
clarkb | oh good now I can go to bed | 07:42 |
openstackgerrit | Andreas Jaeger proposed a change to openstack-infra/config: Add gates for API projects and operations-guide https://review.openstack.org/67394 | 07:47 |
*** dizquierdo has joined #openstack-infra | 07:51 | |
*** jamielennox is now known as jamielennox|away | 07:54 | |
*** flaper87|afk is now known as flaper87 | 07:55 | |
*** DinaBelova_ is now known as DinaBelova | 07:58 | |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 07:58 | |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 08:01 | |
*** odyssey4me has quit IRC | 08:01 | |
*** fifieldt has quit IRC | 08:05 | |
*** fifieldt has joined #openstack-infra | 08:07 | |
*** odyssey4me has joined #openstack-infra | 08:09 | |
*** CaptTofu has joined #openstack-infra | 08:12 | |
*** bookwar has quit IRC | 08:14 | |
*** bookwar has joined #openstack-infra | 08:16 | |
*** CaptTofu has quit IRC | 08:17 | |
*** jcoufal has quit IRC | 08:21 | |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 08:24 | |
*** mancdaz_away is now known as mancdaz | 08:25 | |
*** mancdaz is now known as mancdaz_away | 08:25 | |
*** vkozhukalov has joined #openstack-infra | 08:28 | |
*** jcoufal has joined #openstack-infra | 08:31 | |
*** luqas has joined #openstack-infra | 08:32 | |
*** mancdaz_away is now known as mancdaz | 08:34 | |
*** coolsvap has quit IRC | 08:35 | |
*** coolsvap has joined #openstack-infra | 08:35 | |
*** odyssey4me has quit IRC | 08:36 | |
*** fifieldt has quit IRC | 08:37 | |
*** NikitaKonovalov has joined #openstack-infra | 08:42 | |
*** odyssey4me has joined #openstack-infra | 08:44 | |
*** dpyzhov has joined #openstack-infra | 08:47 | |
*** talluri has joined #openstack-infra | 08:48 | |
*** odyssey4me has quit IRC | 08:49 | |
*** mrmartin has joined #openstack-infra | 08:50 | |
*** ogelbukh has quit IRC | 08:55 | |
*** odyssey4me has joined #openstack-infra | 08:56 | |
*** hashar has joined #openstack-infra | 08:57 | |
*** lyle has joined #openstack-infra | 08:58 | |
*** mrmartin has quit IRC | 08:58 | |
*** david-lyle has quit IRC | 08:58 | |
*** emagana has quit IRC | 08:59 | |
*** mdenny has quit IRC | 09:01 | |
*** mdenny has joined #openstack-infra | 09:01 | |
*** vkozhukalov has quit IRC | 09:03 | |
*** mrmartin has joined #openstack-infra | 09:04 | |
*** mrmartin has quit IRC | 09:08 | |
*** kruskakli has quit IRC | 09:11 | |
*** fbo_away is now known as fbo | 09:12 | |
*** praneshp_afk has quit IRC | 09:12 | |
*** mrmartin has joined #openstack-infra | 09:13 | |
*** _ruhe is now known as ruhe | 09:17 | |
*** vkozhukalov has joined #openstack-infra | 09:18 | |
*** yassine has joined #openstack-infra | 09:20 | |
*** IvanBerezovskiy has joined #openstack-infra | 09:20 | |
*** JohanH has joined #openstack-infra | 09:21 | |
*** markmc has joined #openstack-infra | 09:22 | |
*** max_lobur_afk is now known as max_lobur | 09:23 | |
*** pblaho has joined #openstack-infra | 09:26 | |
JohanH | Hi, we are trying to get Zuul to work in our own project and we are running into some issues that we can not get several concurrent gate checks to execute in parallel. The first job starts but all the other changes in the queue are skipped. Does anyone know what the problem might be? We would like to run as many parallel jobs a possible utilizing all our jenkins slave workers | 09:28 |
*** luqas has quit IRC | 09:38 | |
*** ruhe is now known as ruhe_away | 09:41 | |
*** ruhe_away is now known as ruhe | 09:42 | |
*** denis_makogon has quit IRC | 09:44 | |
SergeyLukjanov | JohanH, which dependency manager are you using? | 09:45 |
SergeyLukjanov | JohanH, if you're setting up zuul for gerrit.o.o than you're need to use 'check' pipeline instead of 'gate' because zuul.o.o will merge files instead of yours one | 09:46 |
*** jooools has joined #openstack-infra | 09:47 | |
*** luqas has joined #openstack-infra | 09:47 | |
*** odyssey4me has quit IRC | 09:54 | |
*** yamahata has quit IRC | 09:56 | |
JohanH | Hi SergeyLukjanov, we are using the gate pipeline and then I guess that it is the dependent pipeline manager. According to the zuul documentation and the description for the DependentPipelineManager: In order to achieve parallel testing of changes, the dependent pipeline manager performs speculative execution on changes. It orders changes based on their entry into the pipeline. It begins testing all changes in parallel, assumin | 09:58 |
JohanH | will pass its tests. If they all succeed, all the changes can be tested and merged in parallel. | 09:58 |
*** jishaom has quit IRC | 09:59 | |
flaper87 | fungi: anyway I can ssh into a box running this test? http://logs.openstack.org/99/65499/4/check/gate-glance-python27/ff2cac8/nose_results.html | 09:59 |
JohanH | So, wouldn't it start testing the changes in parallel | 09:59 |
flaper87 | fungi: I've no idea what's going on there and tests pass in my box | 09:59 |
*** xchu has quit IRC | 09:59 | |
*** odyssey4me has joined #openstack-infra | 10:03 | |
*** SergeyLukjanov is now known as SergeyLukjanov_a | 10:10 | |
*** SergeyLukjanov_a is now known as SergeyLukjanov_ | 10:11 | |
*** dpyzhov has quit IRC | 10:11 | |
*** CaptTofu has joined #openstack-infra | 10:13 | |
*** jp_at_hp has joined #openstack-infra | 10:14 | |
*** CaptTofu has quit IRC | 10:18 | |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 10:21 | |
*** pblaho has quit IRC | 10:21 | |
*** rakhmerov has quit IRC | 10:22 | |
openstackgerrit | Guido Günther proposed a change to openstack-infra/jenkins-job-builder: tests: Allow to test project parameters https://review.openstack.org/67265 | 10:25 |
openstackgerrit | Guido Günther proposed a change to openstack-infra/jenkins-job-builder: project_maven: Don't require artifact-id and group-id https://review.openstack.org/66036 | 10:25 |
*** talluri has quit IRC | 10:29 | |
*** mrda has joined #openstack-infra | 10:29 | |
*** talluri has joined #openstack-infra | 10:30 | |
mikal | It is scary how often the stale recheck bot fires | 10:32 |
mikal | Its like... really common | 10:33 |
*** dpyzhov has joined #openstack-infra | 10:35 | |
*** jooools has quit IRC | 10:40 | |
openstackgerrit | SlickNik proposed a change to openstack-infra/config: Update devstack-gate jobs for Trove tempest tests https://review.openstack.org/65065 | 10:40 |
openstackgerrit | SlickNik proposed a change to openstack-infra/devstack-gate: Add Trove testing support https://review.openstack.org/65040 | 10:42 |
*** zhiyan has left #openstack-infra | 10:43 | |
SlickNik | ^^ jeblair / mordred / fungi / clarkb Please review when you get a chance. Thanks! | 10:43 |
mikal | clarkb: I have a simple bot which does rechecks, I'm not goign to leave it running over night though, as it scares me that it might recheck the world without perission | 10:44 |
mikal | Also, the check queue is pretty long at the moment | 10:44 |
*** jooools has joined #openstack-infra | 10:46 | |
*** vkozhukalov has quit IRC | 10:46 | |
*** nosnos has quit IRC | 10:53 | |
SergeyLukjanov | JohanH, it should start in parallel | 10:54 |
SergeyLukjanov | JohanH, do you have enough slaves& | 10:54 |
SergeyLukjanov | ?* | 10:54 |
*** mrmartin has quit IRC | 10:54 | |
anteaya | mikal: thank you for holding off on the recheck bot | 10:55 |
anteaya | we would never climb out of the current situation | 10:55 |
anteaya | yay down to 64 events, progress | 10:55 |
anteaya | we started off yesterday with over 1000 events but never got below 600 by the end of my day yesterday | 10:56 |
mikal | anteaya: so, the thinking is a recheck is a lot cheaper than a gate merge flush | 10:58 |
mikal | So, we were hoping doing recents on ancient check runs would make the gate queue a bit less horrible | 10:58 |
*** vkozhukalov has joined #openstack-infra | 10:58 | |
*** yaguang has quit IRC | 10:58 | |
mikal | The bot only does a recheck if someone comments on a review with an ancient check, so its also not a blanket thing | 10:58 |
mikal | But I wills top it over night and keep an eye on it while its running | 10:58 |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 11:00 | |
*** tma996 has joined #openstack-infra | 11:02 | |
*** talluri has quit IRC | 11:05 | |
*** amotoki has joined #openstack-infra | 11:05 | |
*** derekh has joined #openstack-infra | 11:05 | |
anteaya | mikal: hmmm okay, let's keep an eye on the amount of events | 11:06 |
anteaya | if you have been running it on the system for the past 8 hours, it might be a source of support for the > 500 event decrease I see on the zuul status page | 11:07 |
*** amotoki_ has quit IRC | 11:07 | |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 11:07 | |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 11:08 | |
anteaya | clarkb: salv-orlando has beat me to it with a big -2 on 66490, thanks for alerting us and sorry causing a problem | 11:09 |
kiall | So - Just noticed a change that merged yesterday https://review.openstack.org/#/c/67143/ never got pushed to github, but did make it to git.o.o .. | 11:09 |
kiall | I'm assuming the next merge will "fix" it .. But might be a problem | 11:09 |
*** NikitaKonovalov has quit IRC | 11:10 | |
*** rakhmerov has joined #openstack-infra | 11:10 | |
anteaya | he sniped it with a new patchset | 11:10 |
sdague | morning folks | 11:13 |
anteaya | morning sdague | 11:14 |
*** rakhmerov has quit IRC | 11:14 | |
anteaya | mikal: I just ready part of the backscroll, clarkb and fungi were casting incantations last night and some of them seemed to be working | 11:15 |
anteaya | so that might be part of the source of the > 500 decrease in events | 11:15 |
sdague | yeh, jenkins is still blowing us up it looks like | 11:16 |
sdague | which actually seems to be the root cause of the problem right now | 11:16 |
anteaya | clarkb and fungi are planning a zuul upgrade at 11am this morning | 11:17 |
anteaya | all things being equal | 11:17 |
sdague | http://status.openstack.org/elastic-recheck/ - graphs 1, 2, and 3 are jenkins errors | 11:17 |
sdague | #2 isn't effecting us, but the others are | 11:17 |
anteaya | goodness we didn't fare well yesterday afternoon | 11:18 |
anteaya | grenade test timeouts have been doubled: https://review.openstack.org/#/c/67374/ | 11:18 |
anteaya | and I think there was another d-g change but I didn't get far enough back in the backscroll to id the url for it | 11:19 |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 11:22 | |
*** ArxCruz has joined #openstack-infra | 11:22 | |
*** mrda has quit IRC | 11:26 | |
openstackgerrit | Sean Dague proposed a change to openstack-infra/elastic-recheck: only run on openstack gate projects https://review.openstack.org/67273 | 11:27 |
openstackgerrit | Sean Dague proposed a change to openstack-infra/elastic-recheck: expose on channel when we timeout on logs https://review.openstack.org/66565 | 11:27 |
openstackgerrit | Sean Dague proposed a change to openstack-infra/elastic-recheck: move to static LOG https://review.openstack.org/66564 | 11:27 |
openstackgerrit | Sean Dague proposed a change to openstack-infra/elastic-recheck: create more sane logging for the er bot https://review.openstack.org/66435 | 11:27 |
anteaya | timeout for tempest runs have also been increased: https://review.openstack.org/66379 | 11:30 |
anteaya | I think that was the other change I saw referenced | 11:30 |
*** vipul is now known as vipul-away | 11:31 | |
anteaya | mordred and clarkb: jog0 had done some evaluation of times using eatmydata yesterday and I believe the conclusion he and fungi had reached was it was not a significant time savings | 11:32 |
anteaya | if I recally they were both rather disappointed by the outcome | 11:32 |
anteaya | ping jog0 for exact details as I might be incorrect in the application of what was being evaluated | 11:33 |
anteaya | s/recally/recall | 11:33 |
anteaya | it's early | 11:33 |
*** ruhe is now known as _ruhe | 11:33 | |
*** rfolco has joined #openstack-infra | 11:33 | |
*** NikitaKonovalov has joined #openstack-infra | 11:39 | |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: only run on openstack gate projects https://review.openstack.org/67273 | 11:40 |
*** DinaBelova is now known as DinaBelova_ | 11:41 | |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: create more sane logging for the er bot https://review.openstack.org/66435 | 11:41 |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 11:41 | |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: move to static LOG https://review.openstack.org/66564 | 11:41 |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: expose on channel when we timeout on logs https://review.openstack.org/66565 | 11:43 |
*** DinaBelova_ is now known as DinaBelova | 11:47 | |
*** smarcet has joined #openstack-infra | 11:51 | |
*** _ruhe is now known as ruhe | 11:52 | |
*** dpyzhov has quit IRC | 11:52 | |
*** dpyzhov has joined #openstack-infra | 11:53 | |
*** jcoufal has quit IRC | 11:56 | |
*** mrmartin has joined #openstack-infra | 11:59 | |
*** DinaBelova is now known as DinaBelova_ | 12:00 | |
*** vkozhukalov has quit IRC | 12:00 | |
*** hashar has quit IRC | 12:03 | |
*** dstanek has quit IRC | 12:06 | |
*** talluri has joined #openstack-infra | 12:10 | |
*** lcestari has joined #openstack-infra | 12:10 | |
*** rakhmerov has joined #openstack-infra | 12:11 | |
*** vkozhukalov has joined #openstack-infra | 12:12 | |
*** pblaho has joined #openstack-infra | 12:12 | |
*** CaptTofu has joined #openstack-infra | 12:14 | |
*** rakhmerov has quit IRC | 12:15 | |
dims | sdague, i had a suggestion in https://bugs.launchpad.net/openstack-ci/+bug/1260311/comments/3 | 12:15 |
uvirtbot | Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] | 12:15 |
dims | for the jenkins troubles | 12:15 |
sdague | sure | 12:16 |
sdague | honestly, that's suspiciously high to me | 12:16 |
sdague | I need to talk with fungi when he gets up | 12:16 |
dims | we are on 1.525 of jenkins | 12:16 |
sdague | because it might be one of the things that there is retry logic around, but we still count it as a fail | 12:17 |
*** vkozhukalov has quit IRC | 12:17 | |
sdague | which would totally skew things in graphite | 12:17 |
dims | y | 12:17 |
*** CaptTofu has quit IRC | 12:19 | |
*** dpyzhov has quit IRC | 12:19 | |
*** talluri has quit IRC | 12:21 | |
*** vkozhukalov has joined #openstack-infra | 12:32 | |
*** jcoufal has joined #openstack-infra | 12:33 | |
dims | sdague, bit more looking around and new recommendation on the version # for jenkins (https://bugs.launchpad.net/openstack-ci/+bug/1260311/comments/4) | 12:38 |
uvirtbot | Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] | 12:38 |
*** chandankumar has quit IRC | 12:42 | |
*** hashar has joined #openstack-infra | 12:43 | |
*** derekh has quit IRC | 12:46 | |
openstackgerrit | Davanum Srinivas (dims) proposed a change to openstack-infra/elastic-recheck: Better query for bug 1260311 https://review.openstack.org/67446 | 12:49 |
uvirtbot | Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/1260311 | 12:49 |
*** dpyzhov has joined #openstack-infra | 12:51 | |
*** emagana has joined #openstack-infra | 12:52 | |
*** talluri has joined #openstack-infra | 12:53 | |
*** dstanek has joined #openstack-infra | 12:53 | |
*** CaptTofu has joined #openstack-infra | 12:55 | |
*** emagana has quit IRC | 12:56 | |
*** dstanek has quit IRC | 12:59 | |
*** salv-orlando has quit IRC | 13:02 | |
*** zz_ewindisch is now known as ewindisch | 13:02 | |
*** coolsvap has quit IRC | 13:09 | |
*** ewindisch is now known as zz_ewindisch | 13:09 | |
*** mrmartin has quit IRC | 13:09 | |
*** markmc has quit IRC | 13:11 | |
*** rakhmerov has joined #openstack-infra | 13:12 | |
*** zz_ewindisch is now known as ewindisch | 13:14 | |
*** rakhmerov has quit IRC | 13:16 | |
*** ewindisch is now known as zz_ewindisch | 13:18 | |
*** amotoki_ has joined #openstack-infra | 13:18 | |
*** amotoki has quit IRC | 13:20 | |
*** jcoufal has quit IRC | 13:21 | |
*** dizquierdo has quit IRC | 13:26 | |
*** mfink has quit IRC | 13:26 | |
*** dstanek has joined #openstack-infra | 13:29 | |
*** thomasem has joined #openstack-infra | 13:31 | |
*** hashar has quit IRC | 13:31 | |
chmouel | sdague: i was wondering if you were working on stable/grizzly issues as well? | 13:31 |
*** DinaBelova_ is now known as DinaBelova | 13:33 | |
*** dstanek has quit IRC | 13:34 | |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: Better query for bug 1260311 https://review.openstack.org/67446 | 13:34 |
uvirtbot | Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/1260311 | 13:34 |
sdague | chmouel: trying to get your patch up now in a test env to try to help | 13:35 |
chmouel | sdague: i think there is a bit more than that, at least with euca2ools and boto being incompatible | 13:35 |
*** dims has quit IRC | 13:35 | |
sdague | chmouel: but we aren't running those anyway, right? | 13:36 |
chmouel | sdague: i think we still have failures in tempest.tests.boto.test_ec2_volumes.EC2VolumesTest.test_create_volume_from_snapshot | 13:37 |
*** pblaho has quit IRC | 13:37 | |
*** pblaho has joined #openstack-infra | 13:37 | |
chmouel | sdague: from https://review.openstack.org/#/c/67311/ | 13:37 |
chmouel | sdague: if i just rm -rf /usr/local/lib/**/*boto and rerun tempest it seems to work | 13:38 |
*** mfink has joined #openstack-infra | 13:38 | |
*** carl_baldwin has joined #openstack-infra | 13:40 | |
*** markmc has joined #openstack-infra | 13:40 | |
sdague | chmouel: so in that review I'm seeing volumes fails unrelated to ec2 | 13:40 |
sdague | chmouel: http://logs.openstack.org/11/67311/2/check/check-tempest-dsvm-full/779c8f6/logs/screen-c-sch.txt.gz | 13:40 |
*** hashar has joined #openstack-infra | 13:41 | |
chmouel | sdague: oh yeah right, the ec2 runs but fails as you say due of the issue with cinder http://ep.chmouel.com:8080/Screenshots/2014-01-17__14-41-56.png | 13:42 |
*** nati_ueno has joined #openstack-infra | 13:42 | |
*** nati_ueno has quit IRC | 13:42 | |
russellb | so, based on the failure rates graph here, looks like failure rates are down a good bit today? http://status.openstack.org/elastic-recheck/ | 13:43 |
sdague | russellb: yes, I definitely think the concurency reduction helped | 13:43 |
russellb | ok cool | 13:43 |
*** jcoufal-m has joined #openstack-infra | 13:43 | |
russellb | may take the weekend for the queues to recover a bit it seems | 13:44 |
*** nati_ueno has joined #openstack-infra | 13:44 | |
*** DinaBelova is now known as DinaBelova_ | 13:44 | |
sdague | yeh, there are still other kinds of fails going on, which we'll need to figure out | 13:44 |
*** julim has joined #openstack-infra | 13:44 | |
sdague | also need to get the word out that stable bits can't be put in the gate right now until we address the pip 1.5 issuse on grizzly devstack | 13:44 |
*** jcoufal-m_ has joined #openstack-infra | 13:45 | |
sdague | which will kill a stable/havana change because of grenade | 13:45 |
*** jcoufal-m_ has quit IRC | 13:45 | |
*** jcoufal-m_ has joined #openstack-infra | 13:45 | |
sdague | chmouel: so the log for that run is confusing | 13:45 |
*** emagana has joined #openstack-infra | 13:45 | |
russellb | alrighty | 13:45 |
russellb | on to some other bugs then | 13:45 |
sdague | russellb: yep, and thanks for getting to the bottom of the load thing | 13:46 |
chmouel | sdague: yeah with my patch on my just reckicked test vm i definitively get netaddr updated properly: | 13:46 |
chmouel | ubuntu@devstack:~$ pip freeze|grep netaddr | 13:46 |
chmouel | Warning: cannot find svn location for distribute==0.6.24dev-r0 | 13:46 |
chmouel | netaddr==0.7.10 | 13:46 |
sdague | right, but something isn't right | 13:47 |
russellb | sdague: np | 13:47 |
fungi | mmm, dims is gone, but what he doesn't realize is that we're actually only on 1.525 for jenkins01, but we're also seeing the same java stack trace (the missing class master one) on jenkins02 which runs 1.543 | 13:47 |
sdague | chmouel: the fact that we pip install netaddr 6 times over the course of the console | 13:47 |
sdague | means pip keeps thinking there is a 0.7.5 to remove | 13:47 |
sdague | which is why cinder explodes | 13:47 |
*** dims has joined #openstack-infra | 13:48 | |
sdague | fungi: right, so we started classifying infra bugs in er yesterday (because our classification rate was down to 30%) | 13:48 |
*** nati_ueno has quit IRC | 13:48 | |
*** jcoufal-m has quit IRC | 13:49 | |
sdague | fungi: http://status.openstack.org/elastic-recheck/ - Bug 1260311 | 13:49 |
uvirtbot | Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/1260311 | 13:49 |
*** DinaBelova_ is now known as DinaBelova | 13:49 | |
sdague | it's the 3rd graph down | 13:49 |
sdague | of the er graphs | 13:49 |
sdague | it's so high and so frequent, I feel like we must be misunderstanding something | 13:49 |
chmouel | do we need to install python-netaddr from the packages first? | 13:50 |
*** emagana has quit IRC | 13:50 | |
*** emagana has joined #openstack-infra | 13:50 | |
fungi | right. it's like i was explaining to jog0, we can *either* have catchall buckets like "jenkins it breakybadz" or we can track specific problems, but please let's not try to use a bug with "we gots stack traces" to diagnose actua faiures | 13:50 |
sdague | fungi: so, we can do it however you'd like to | 13:51 |
sdague | but realize that those are failure events in graphite | 13:51 |
sdague | so right now ~ 40% of graphite failures for gate jobs are infra | 13:51 |
sdague | for the last week | 13:51 |
*** salv-orlando has joined #openstack-infra | 13:52 | |
*** zul has quit IRC | 13:52 | |
fungi | well, i'm okay with catchall bucket bugs for that. and i'm fine with "jenkins stack trace" as an elastic-recheck pattern, but keep in mind that it's not going to assist much in diagnosing the underlying problem and the moment other devs start jumping in and trying to use the bug to that end, we're going to be running in circles chasing our tails | 13:52 |
*** dkliban has joined #openstack-infra | 13:52 | |
*** jcoufal-m_ has quit IRC | 13:53 | |
sdague | fungi: sure | 13:53 |
sdague | fungi: the point I'm trying to ask, is is that issue, which looks like a failure to launch at all, something that we already recover on? | 13:53 |
*** dcramer_ has quit IRC | 13:53 | |
fungi | that bug you linked has already collected stack trace details for two almost certainly unrelated issues, and dims was trying to use it to track down upstream bugs in jenkins. that's going to waste a lot of people's time | 13:53 |
*** yamahata has joined #openstack-infra | 13:53 | |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 13:54 | |
fungi | the *first* stack trace in that bug, from what we've seen, is the vm going missing between when it first talks to the jenkins master and when it gets assigned a job | 13:54 |
*** dprince has joined #openstack-infra | 13:55 | |
sdague | fungi: so we can work on getting these broken out, which is fine, this is a process | 13:55 |
fungi | the second stack trace in that bug is deeper in the slave agent, causing some manner of miscommunication with the master | 13:55 |
chmouel | it's a bit annoying that i can't reproduce on clean precise vm :( the tempest runs fine after with my patch | 13:55 |
*** emagana has quit IRC | 13:55 | |
*** zul has joined #openstack-infra | 13:55 | |
*** emagana has joined #openstack-infra | 13:55 | |
fungi | sdague: we already had two separate bugs. i referred that comment back to the other bug | 13:56 |
*** markmc has quit IRC | 13:56 | |
sdague | fungi: ok, so we'll refine this. What I really want to know is are these gate resetting bugs, or are we actually autorecovering in zuul | 13:57 |
*** herndon_ has joined #openstack-infra | 13:57 | |
fungi | well, we have seen both those stack traces associated with job failures. that's not to say that they don't also appear when a job gets aborted/cancelled and we tear down the vm before jenkins is done processing the abort/cancellation | 13:58 |
sdague | fungi: so the current rates on those makes those the biggest cause of resets right now | 13:59 |
fungi | but i think in those cases we don't get logs into logstash, so if you're finding them there then these are likely jobs which did fail at some level | 13:59 |
sdague | fungi: this is datamining logstash | 13:59 |
fungi | right. that's what i figured | 13:59 |
sdague | so only if it gets to logstash, and is marked as FAILURE | 13:59 |
*** markmc has joined #openstack-infra | 13:59 | |
fungi | was there a job status of failure associated with thoe? | 13:59 |
fungi | tose | 13:59 |
fungi | those | 14:00 |
sdague | build_status:FAILURE | 14:00 |
fungi | this keyboard is annoying me | 14:00 |
*** CaptTofu has quit IRC | 14:00 | |
*** CaptTofu has joined #openstack-infra | 14:00 | |
*** jcoufal has joined #openstack-infra | 14:01 | |
fungi | i do think it's probably not the biggest cause of actual gate resets though. the majority are going to be the one where the persistent slave is eaten by bug 1267364 and kills a lot of jobs at once, but we fix it by the time it's ejected one or two changes out of the gate (and the rest end up testing clean when the gate reset is done processing) | 14:02 |
uvirtbot | Launchpad bug 1267364 in openstack-ci "Recurrent jenkins slave agent failures" [Critical,In progress] https://launchpad.net/bugs/1267364 | 14:02 |
fungi | the continuing work to move our testing off persistent slaves is our current solution to that | 14:02 |
*** mfer has joined #openstack-infra | 14:03 | |
fungi | the incidence of it has gone way down in the past week from what i've seen (i've only had to offline one persistent slave in several days even under the heaviest load we've been seeing) | 14:03 |
fungi | it does still crop up for nonpersistent slaves, but they get torn down after impacting a single job rather than taking out dozens in a shooting-spree | 14:04 |
*** CaptTofu has quit IRC | 14:05 | |
*** annegent_ has joined #openstack-infra | 14:05 | |
*** smarcet has left #openstack-infra | 14:05 | |
sdague | fungi: http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiamF2YS5pby5JbnRlcnJ1cHRlZElPRXhjZXB0aW9uXCIgQU5EIGZpbGVuYW1lOlwiY29uc29sZS5odG1sXCIgIEFORCBtZXNzYWdlOlwiaHVkc29uLkxhdW5jaGVyJFJlbW90ZUxhdW5jaGVyLmxhdW5jaFwiIEFORCBidWlsZF9xdWV1ZTpnYXRlIiwidGltZWZyYW1lIjoiODY0MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsIm9mZnNldCI6MCwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzODk5Njc1MzI3MDl9 | 14:05 |
fungi | the combination of it mostly only cropping up when the jenkins masters are under heavy strain and the accompanying gate dynamics when we're under that sort of load make the ratio of full gate resets to individual job failures for that bug abnormally high (probably by orders of magnitude) | 14:06 |
IvanBerezovskiy | fungi, hi. Can I ask you a question about Cassandra and Hbase installation on CI nodes? | 14:06 |
sdague | 231 gate errors in the last 24 hrs | 14:06 |
fungi | IvanBerezovskiy: are you the one working to get it supported in ubuntu lts? | 14:06 |
*** jhesketh__ has quit IRC | 14:06 | |
fungi | sdague: how many were from centos6-1? | 14:06 |
*** markmcclain has joined #openstack-infra | 14:07 | |
fungi | that's the one which went wild last night while i was at dinner, and i had to put it down when i got back to the computer | 14:07 |
fungi | sdague: but i agree, we should take this as a sign to continue prioritizing a move to nonpersistent slaves for all non-privileged jobs | 14:09 |
sdague | sure | 14:09 |
sdague | 25 we tempest-dvsm-full | 14:09 |
sdague | so it's not just the unit test nodes | 14:09 |
fungi | good to know. those hopefully should have been only one job affected per slave experiencing that error | 14:10 |
dims | fungi, when you get a chance can i please have a stack trace from the 1.543 install for the JENKINS-19453 bug so i can try to match it to jenkins source to see if i can find something (per your comment #5 in bug 1260311) | 14:10 |
uvirtbot | Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/1260311 | 14:10 |
fungi | sdague: and that's for the stacktrace in 1267364, not the other one? | 14:10 |
fungi | dims: it's in the bug i inked | 14:10 |
fungi | linked | 14:10 |
*** prad has joined #openstack-infra | 14:11 | |
*** jaypipes has joined #openstack-infra | 14:11 | |
fungi | dims: oh, actually i guess it's not | 14:12 |
fungi | we only had them from the jenkins console, which expires out after 24 huors | 14:12 |
fungi | hours | 14:12 |
*** rakhmerov has joined #openstack-infra | 14:13 | |
*** NikitaKonovalov has quit IRC | 14:13 | |
sdague | fungi: http://logs.openstack.org/84/65184/4/gate/gate-tempest-dsvm-postgres-full/7c3f2bc/console.html is being classified as Bug 1260311 by jog0's query | 14:13 |
uvirtbot | Launchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/1260311 | 14:13 |
*** NikitaKonovalov has joined #openstack-infra | 14:13 | |
IvanBerezovskiy | fungi, fungi, As it was said here https://review.openstack.org/#/c/66884/ we can't use non-ubuntu mirrors. So i want to find another way to isntall these packages. My suggestion is create job for single-use node like https://git.openstack.org/cgit/openstack-infra/config/tree/modules/openstack_project/files/jenkins_job_builder/config/storyboard.yaml . So it'll be job with shell script that'll install cassandra and hbase. What do you think | 14:13 |
sdague | which, we can figure out if that's wrong | 14:13 |
*** yaguang has joined #openstack-infra | 14:14 | |
fungi | sdague: so there may be several different issues there | 14:14 |
jog0 | that query was taken straight from dims comment in the bug | 14:14 |
*** nati_ueno has joined #openstack-infra | 14:14 | |
sdague | fungi: sure, so we should narrow that out | 14:14 |
fungi | dims: the stacktrace we were seeing in both 1.525 and 1.543 is the java.lang.NoClassDefFoundError: Could not initialize class jenkins.model.Jenkins$MasterComputer one http://paste.openstack.org/show/60883/ | 14:15 |
sdague | I'm just concerned that we've had 40+ dvsm hits on that in the last 24hrs so reseting that way every 35 minutes seems very bad | 14:15 |
sdague | and could be a contributing factor to the zuul load | 14:15 |
fungi | sdague: right, this takes us back to "do we want a catchall bucket for people to recheck these against or should we have separate bugs for the different causes/events" | 14:15 |
sdague | fungi: what would you like? | 14:16 |
dims | fungi, the line numbers will be different between 1.525 and 1.543 - trying to figure out which stack trace came from which version | 14:16 |
*** jhesketh_ has quit IRC | 14:16 | |
fungi | IvanBerezovskiy: if it's for non-openstack jobs, that's fine. for openstack projects, all those jobs would fail any time that remote repository is unreachable/broken | 14:16 |
sdague | the er bug reporting is an art not a science, so we just want rules in there on how to categorize it. | 14:16 |
fungi | dims: ahh, i may not have captured the exact line numbers for one triggered from jenkins02 in that bug. we'd need to find a new slave exhibiting that failure from jenkins02 and get those details | 14:17 |
*** nati_ueno has quit IRC | 14:17 | |
*** rakhmerov has quit IRC | 14:17 | |
dims | thanks fungi i'll look for it as well | 14:18 |
sdague | fungi: is there better metadata in ES that we need to bin these? | 14:18 |
*** nati_ueno has joined #openstack-infra | 14:18 | |
jog0 | fungi: I am happy to split the bugs as you want | 14:18 |
*** zz_ewindisch is now known as ewindisch | 14:18 | |
jog0 | long as we are categorizing them under something I am happy | 14:18 |
fungi | sdague: i'm fine with catch-all bugs for elastic-recheck to use for infra problems, but we would still use separate infra bugs to work through the actual causes. in many cases, the bugs themselves will be solved before someone can add an accurate e-s pattern to match them | 14:19 |
ruhe | fungi: (on the topic started by IvanBerezovskiy), so the only option to test ceilometer backends, which aren't present in stable mirrors - is to get them (hbase and cassandra) supported in ubuntu lts? | 14:19 |
sdague | fungi: well that's not the case for at least 3 infra bugs right now | 14:19 |
jog0 | fungi: you won't like this query then: bug 1269940 | 14:19 |
uvirtbot | Launchpad bug 1269940 in openstack-ci "[EnvInject] - [ERROR] - SEVERE ERROR occurs:" [Undecided,New] https://launchpad.net/bugs/1269940 | 14:19 |
fungi | ruhe: how do you expect people running plain ubuntu to test that on their own systems (particularly if they can't/won't install unvetted/insecure third-party packages)? | 14:20 |
*** sandywalsh has quit IRC | 14:20 | |
fungi | sdague: agreed. it ends up being the case for other infra bugs however | 14:21 |
*** rossella_s has joined #openstack-infra | 14:21 | |
fungi | jog0: i think it's like matching on "python traceback" | 14:21 |
jog0 | fungi: haha yup | 14:21 |
* dims realizes we need the jenkins01/02 info in logstash as well :) | 14:22 | |
jog0 | that is a catch all as a stop gap for classifiying things | 14:22 |
fungi | "bug: we seem to be using python" | 14:22 |
*** amotoki_ has quit IRC | 14:22 | |
jog0 | so yes I agree its a really vague somewhat useless bug. so as we know more we can split the bug up | 14:23 |
ruhe | fungi: i understand your concern. the problem with this storages is they only have vendor-managed repositories and no one wants to maintain them since they're complex software. i guess this topic should be discussed in email | 14:23 |
fungi | anyway, i need to step away for a few. i should learn not to start checking work e-mail and irc when i first wake up... it leads to me working half the morning from my bedroom and skipping breakfast as a result | 14:23 |
sdague | fungi: so I think that given the windows of time where there aren't infra folks online, I think using er for real has value. Because bugs don't get fixed immediately | 14:23 |
*** yamahata has quit IRC | 14:23 | |
sdague | :) | 14:23 |
sdague | yeh, sorry about that | 14:23 |
*** yamahata has joined #openstack-infra | 14:23 | |
dims | fungi, :) | 14:23 |
chmouel | EmilienM: ping? | 14:24 |
EmilienM | chmouel: pong | 14:25 |
EmilienM | chmouel: here is good too, i use to talk about devstack on #openstack-qa though :-) | 14:25 |
*** dstanek has joined #openstack-infra | 14:25 | |
fungi | ruhe: i would argue that makes them immature software projects, and we should seek to help them improve that situation so that we *can* use them rather than just accepting that situation | 14:25 |
* fungi will bbiab | 14:25 | |
sdague | chmouel: yeh, lets take the grizzly devstack over to -qa | 14:26 |
EmilienM | chmouel: i was wondering the cinder issue in devstack/havana and it's WIP by you and sdague, right? | 14:26 |
*** eharney has joined #openstack-infra | 14:28 | |
*** ryanpetrello has joined #openstack-infra | 14:29 | |
openstackgerrit | Nikita Konovalov proposed a change to openstack-infra/storyboard: Introducing basic REST API https://review.openstack.org/63118 | 14:30 |
*** herndon_ has quit IRC | 14:31 | |
*** nprivalova has joined #openstack-infra | 14:33 | |
*** sandywalsh has joined #openstack-infra | 14:33 | |
openstackgerrit | Sean Dague proposed a change to openstack-infra/elastic-recheck: add uncategorized failure generation code https://review.openstack.org/67267 | 14:35 |
*** mrmartin has joined #openstack-infra | 14:36 | |
*** pblaho has quit IRC | 14:36 | |
*** mrodden has quit IRC | 14:38 | |
*** dcramer_ has joined #openstack-infra | 14:39 | |
*** dansmith is now known as damnsmith | 14:40 | |
openstackgerrit | A change was merged to openstack-infra/reviewstats: Add --csv-rows option https://review.openstack.org/60115 | 14:42 |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: add uncategorized failure generation code https://review.openstack.org/67267 | 14:42 |
*** SergeyLukjanov is now known as SergeyLukjanov_a | 14:43 | |
*** SergeyLukjanov_a is now known as SergeyLukjanov_ | 14:44 | |
openstackgerrit | Max Lobur proposed a change to openstack/requirements: Add futures library to global requirements https://review.openstack.org/66349 | 14:45 |
*** dizquierdo has joined #openstack-infra | 14:45 | |
openstackgerrit | Max Lobur proposed a change to openstack/requirements: Add futures library to global requirements https://review.openstack.org/66349 | 14:47 |
*** thuc has joined #openstack-infra | 14:49 | |
*** thuc_ has joined #openstack-infra | 14:49 | |
jog0 | was a bug filed for 'No distributions at all found for oslo.messaging>=1.2.0a11' ? | 14:50 |
jog0 | example: http://logs.openstack.org/82/64682/1/gate/gate-glance-pep8/f1dce31/console.html.gz | 14:50 |
*** beagles is now known as beagles_brb | 14:50 | |
*** mrodden has joined #openstack-infra | 14:51 | |
openstackgerrit | Tom Fifield proposed a change to openstack-infra/config: Add build job for Japanese Install Guide https://review.openstack.org/67481 | 14:51 |
jgriffith | EmilienM: Cinder issue in devstack/havana? | 14:51 |
*** fifieldt has joined #openstack-infra | 14:51 | |
EmilienM | jgriffith: yeah, the stuff you were talking about yesterday | 14:53 |
*** coolsvap has joined #openstack-infra | 14:53 | |
*** thuc has quit IRC | 14:53 | |
*** annegent_ has quit IRC | 14:53 | |
jgriffith | EmilienM: oh, but interesting it's only affecting Cinder now, which leads me to believe thee's been a patch for other projects to address this? | 14:53 |
*** emagana_ has joined #openstack-infra | 14:54 | |
*** senk has joined #openstack-infra | 14:55 | |
*** russellb is now known as rustlebee | 14:55 | |
*** mrmartin has quit IRC | 14:55 | |
*** rakhmerov has joined #openstack-infra | 14:56 | |
*** jog0 is now known as flashgordon | 14:56 | |
*** oubiwann_ has joined #openstack-infra | 14:56 | |
*** emagana has quit IRC | 14:56 | |
*** marun has joined #openstack-infra | 14:57 | |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 14:57 | |
*** talluri has quit IRC | 14:57 | |
flashgordon | looks like this is the closest bug 1261253 | 14:59 |
uvirtbot | Launchpad bug 1261253 in tripleo "oslo.messaging 1.2.0a11 is outdated and problematic to install" [High,Triaged] https://launchpad.net/bugs/1261253 | 14:59 |
*** dims is now known as dimsum | 15:00 | |
*** burt1 has joined #openstack-infra | 15:01 | |
*** Ajaeger has joined #openstack-infra | 15:01 | |
*** pblaho has joined #openstack-infra | 15:02 | |
fungi | aww, we lost jog0 now | 15:03 |
fungi | oh, wait, flashgordon | 15:03 |
fungi | flashgordon: the No distributions at all found for oslo.messaging>=1.2.0a11 is an interesting one | 15:03 |
fungi | flashgordon: that looks like pip 1.5 ignoring the -f | 15:04 |
fungi | i wish we had a pip --version and/or pip freeze at the end of that job | 15:05 |
*** talluri has joined #openstack-infra | 15:05 | |
*** esker has quit IRC | 15:06 | |
*** esker has joined #openstack-infra | 15:06 | |
*** esker has quit IRC | 15:06 | |
*** nicedice has joined #openstack-infra | 15:07 | |
openstackgerrit | Joe Gordon proposed a change to openstack-infra/config: Don't run non-voting gate-grenade-dsvm-neutron https://review.openstack.org/67485 | 15:08 |
flashgordon | fungi: casual nick friday in nova land | 15:08 |
flashgordon | sdague: ^ | 15:08 |
*** thedodd has joined #openstack-infra | 15:09 | |
flashgordon | fungi: logstash query message:"No distributions at all found for oslo.messaging>=1.2.0a11" AND filename:"console.html" | 15:09 |
flaper87 | fungi: another case where it fails in the gate and not locally: https://review.openstack.org/#/c/65499/ :( Do you think I can get access to one box? | 15:09 |
fungi | flashgordon: i got it, just slow. i've lost track of which days are which any more | 15:09 |
flaper87 | FWIW, I'm setting up an ubuntu saucy to test it too | 15:09 |
flashgordon | fungi: heh I am amazed your still alive after this week | 15:09 |
*** nati_uen_ has joined #openstack-infra | 15:12 | |
*** jergerber has joined #openstack-infra | 15:12 | |
fungi | flaper87: which one? the py26 and py27 unit tests fail in entirely different ways (though also, no, can't really grant you access to the long-running 26 slave for infra policy reasons unless i completely tear down and replace it, and the 27 slave is a single-use node which was automatically deleted after it ran) | 15:12 |
flaper87 | fungi: py27 would've been enough. | 15:13 |
flaper87 | fungi: I'll set it up in my vm and see if I can replicate it | 15:14 |
*** nati_uen_ has quit IRC | 15:14 | |
dstufft | fungi: adding various --version invocations to things you're using is the best thing I learned from travis-ci tbh | 15:14 |
dstufft | it makes debuging things massively better | 15:14 |
*** nati_uen_ has joined #openstack-infra | 15:15 | |
*** nati_ueno has quit IRC | 15:15 | |
openstackgerrit | Sergey Kraynev proposed a change to openstack/requirements: Update python-neutronclient version https://review.openstack.org/67487 | 15:16 |
fungi | dstufft: yep, we do that in a lot of places | 15:17 |
fungi | just not ever enough places ;) | 15:17 |
openstackgerrit | Ruslan Kamaldinov proposed a change to openstack-infra/config: Extracted ci docs jobs to a template https://review.openstack.org/67489 | 15:19 |
*** emagana_ has quit IRC | 15:21 | |
openstackgerrit | Sergey Kraynev proposed a change to openstack/requirements: Update python-neutronclient version to 2.3.3 https://review.openstack.org/67487 | 15:22 |
*** emagana has joined #openstack-infra | 15:22 | |
*** HenryG has joined #openstack-infra | 15:22 | |
flashgordon | fungi: what file do I touch to add branch name to logstash? | 15:23 |
flashgordon | re: master or stable/havana | 15:23 |
*** annegent_ has joined #openstack-infra | 15:24 | |
*** bookwar has left #openstack-infra | 15:24 | |
openstackgerrit | Sergey Kraynev proposed a change to openstack/requirements: Update python-neutronclient version to 2.3.2 https://review.openstack.org/67491 | 15:24 |
*** CaptTofu has joined #openstack-infra | 15:24 | |
*** rnirmal has joined #openstack-infra | 15:26 | |
fungi | flashgordon: do we not already index the zuul parameters for jobs in logstash? | 15:27 |
flashgordon | we have the build_ref | 15:27 |
*** gokrokve has joined #openstack-infra | 15:28 | |
*** IvanBerezovskiy has left #openstack-infra | 15:28 | |
*** annegent_ has quit IRC | 15:28 | |
flashgordon | would it be zuul_branch? | 15:29 |
fungi | flashgordon: i think that's probably what you want. remember in the context of our various integration tests there are multiple branches in play | 15:30 |
flashgordon | ohh nice zuul has docs | 15:30 |
fungi | and yes, zuul has very nice docs | 15:31 |
*** carl_baldwin has quit IRC | 15:31 | |
flashgordon | ' The target branch for the change that triggered this build | 15:31 |
flashgordon | fungi: if there is no zuul_change is there zuul_branch? | 15:32 |
*** carl_baldwin has joined #openstack-infra | 15:32 | |
clarkb | anteaya: salv-orlando: a -2 doesn't kick the change out of the gate. has a new patchset been pushed to it to kick it out of the gate? | 15:33 |
*** _NikitaKonovalov has joined #openstack-infra | 15:33 | |
clarkb | anteaya: salv-orlando: at this point it probably doesn't matter much as fungi and I are going to fork lift zuul and can simply not reverify that change | 15:33 |
fungi | flashgordon: i believe there is always a zuul_branch, yes (periodic bitrot jobs for example have no zuul_change but would still have a zuul_branch) | 15:34 |
*** mancdaz is now known as mancdaz_away | 15:34 | |
flashgordon | fungi: thanks | 15:34 |
*** kmartin has quit IRC | 15:34 | |
fungi | flashgordon: i'm going to double-check that though | 15:34 |
*** NikitaKonovalov has quit IRC | 15:34 | |
*** _NikitaKonovalov is now known as NikitaKonovalov | 15:34 | |
fungi | because now that i say it, i start to doubt myself | 15:34 |
flashgordon | heh thanks | 15:34 |
*** mancdaz_away is now known as mancdaz | 15:35 | |
*** kmartin has joined #openstack-infra | 15:35 | |
*** talluri has quit IRC | 15:35 | |
fungi | and that reminds me, some of the periodic jobs are still broken... need to track down where /opt/stack/new/devstack-gate/devstack-vm-gate.sh went: http://logs.openstack.org/periodic-qa/periodic-tempest-dsvm-all-havana/037442e/console.html | 15:36 |
*** marun has quit IRC | 15:36 | |
*** marun has joined #openstack-infra | 15:36 | |
*** jgrimm has joined #openstack-infra | 15:37 | |
*** annegent_ has joined #openstack-infra | 15:38 | |
mordred | morning fungi | 15:38 |
mordred | morning flashgordon clarkb | 15:38 |
fungi | morning mordred | 15:39 |
openstackgerrit | Joe Gordon proposed a change to openstack-infra/config: Record build_branch in logstash https://review.openstack.org/67498 | 15:39 |
*** wenlock has joined #openstack-infra | 15:39 | |
flashgordon | fungi: ^ | 15:39 |
clarkb | morning | 15:39 |
flashgordon | sdague: ^ | 15:39 |
*** emagana has quit IRC | 15:39 | |
*** emagana has joined #openstack-infra | 15:40 | |
clarkb | fungi: I am mostly booted at this point and ready to do the zuul dance if you still think we should do that | 15:40 |
*** rcleere has joined #openstack-infra | 15:40 | |
*** davidhadas has joined #openstack-infra | 15:41 | |
fungi | dimsum: to your earlier question about identifying which jenkins master a job ran on, you can actually mine that out of the console log (though having it as a parameter would definitely be nice). the "Building remotely on" line hyperlinks to the appropriate jenkins master's webui | 15:41 |
fungi | clarkb: sure thing | 15:41 |
clarkb | zuul just merged a bunch of changes by the way. I think the d-g tempest concurrency change did have a drastic effect | 15:41 |
*** herndon has joined #openstack-infra | 15:41 | |
dimsum | fungi, y, just can't build a query that has the name of the jenkins host and snippet from hudson stack trace | 15:42 |
clarkb | https://jenkins02.openstack.org/job/gate-tempest-dsvm-full/6416/console seems to be a relatively common failure causing resets (but I haven't even looked at e-r just noticed that 404 is common to several test failures last night and this morning) | 15:42 |
*** esker has joined #openstack-infra | 15:43 | |
*** NikitaKonovalov is now known as NikitaKonovalov_ | 15:44 | |
*** bnemec is now known as beekneemech | 15:45 | |
openstackgerrit | Tom Fifield proposed a change to openstack-infra/config: Add build job for Japanese Install Guide https://review.openstack.org/67481 | 15:45 |
fungi | dimsum: in the meantime, the next rogue persistent slave i get failing jobs with that stack trace from jenkins02, i'll get the exact text including line numbers | 15:45 |
dimsum | fungi, cool | 15:45 |
openstackgerrit | Sergey Kraynev proposed a change to openstack/requirements: Update python-neutronclient version to 2.3.3 https://review.openstack.org/67491 | 15:47 |
fungi | clarkb: so what's the zuul swap operation here? we snapshot the pipelines, kill zuul ungracefully, update a/aaaa records, copy over the queue dump to new-new-new-zuul, start the zuul service there, wait as necessary for the dns propagation, make sure jenkins masters are connecting to it, load the queue dumps and we're off to the races? | 15:48 |
*** carl_baldwin has quit IRC | 15:48 | |
*** carl_baldwin has joined #openstack-infra | 15:48 | |
clarkb | basically | 15:48 |
clarkb | we also need to check nodepool has connected to new zuul | 15:49 |
fungi | right, jenkins masters *and* nodepool | 15:49 |
fungi | good reminder | 15:49 |
openstackgerrit | Ben Nemec proposed a change to openstack-dev/hacking: Enforce import group ordering https://review.openstack.org/54403 | 15:50 |
clarkb | 2 more changes can merge and there are a few check tests that can be reported but I am less concerned about the check tests | 15:51 |
*** JohanH has quit IRC | 15:51 | |
clarkb | but I think right around now is a decent time to do it as 2 changes will be merging and the gate is reseting otherwise | 15:52 |
*** senk has quit IRC | 15:52 | |
*** adrian_otto has joined #openstack-infra | 15:52 | |
fungi | oh, though the rather large queue lengths mean maybe we should gracefully stop it and wait for it to finish processing those? | 15:52 |
clarkb | fungi: that requires it fully processing everything in those queues which could take days | 15:53 |
clarkb | >_> | 15:53 |
fungi | we won't be able to copy over the event and result queues, right? | 15:53 |
clarkb | fungi: right | 15:53 |
fungi | it was down to 0/0 earlier | 15:53 |
clarkb | I suppose we can wait to see if those numbers fall shortly | 15:53 |
fungi | but it's started picking up now | 15:53 |
clarkb | it picked up during the last gate reset where the zuul main loop does nothing | 15:54 |
fungi | we caught a nova fail a couple changes from the head of the gate an hour or two ago and the delay that caused allowed the events/results to pile up | 15:54 |
fungi | yeah | 15:54 |
clarkb | normally that loop has a few iterations per second. during a gate reset it is one iteration every 15 or so minutes | 15:55 |
clarkb | another thing that occurred to me with back of napkin maths is that we only have enough slaves to run tests for ~64 changes concurrently | 15:55 |
salv-orlando | clarkb: I did first put a new patch set and then -2 it to ensure people did not approve it | 15:55 |
clarkb | salv-orlando: awesome, I missed that thanks | 15:55 |
clarkb | so we are battling the resets but also having only about 1/3 of the test resources we need to get out of the hole | 15:56 |
*** jcoufal has quit IRC | 15:56 | |
*** adrian_otto has quit IRC | 15:56 | |
salv-orlando | But we've probably found out that all those unit test failure are related to an oslo change that went in yesterday | 15:56 |
clarkb | fungi: results queue is falling, under 100 now. I say we wait a handful of minutes to see if the events queue falls too | 15:56 |
*** pblaho has quit IRC | 15:56 | |
fungi | k | 15:57 |
mordred | fungi: from an hour ago, I would argue that it might also mean that distros haven't adapted to how some newer software operates and are trying to perpetuate a model that is more beneficial to their own processes than it is to solving today's problems | 15:57 |
fungi | clarkb: also i think 67186,1 and 67187,1 there are probably contributing to gate churn | 15:57 |
fungi | clarkb: since they're both stable branch changes | 15:58 |
clarkb | fungi: they would be then, we should omit them from the zuul reenqueue | 15:58 |
clarkb | fungi: oh other thing to do after we stop zuul, is to manually stop jobs in jenkinses so that nodepool can create new nodes | 15:58 |
*** annegent_ has quit IRC | 15:58 | |
clarkb | fungi: do you want to grab queue state, stop zuul, and update DNS while I kill jobs in jenkinses as quickly as I can? | 15:59 |
fungi | mordred: entirely possible, but in that case we need some serious reevaluation of our security support model | 15:59 |
mordred | fungi: I think we might need some serious reevaluation of our security support model | 16:00 |
clarkb | fungi: mordred: I am not seeing the context to security and distros | 16:00 |
clarkb | have a timestamp? | 16:00 |
mordred | because I'm not sure that the distro approach which may involve staying on an old version of a piece of software that the otherwise very active upstream has stopped caring about is the right thing to do | 16:00 |
fungi | clarkb: 14:13 utc | 16:00 |
*** dpyzhov has quit IRC | 16:01 | |
fungi | clarkb: our previous decisions not to install software from random third-party package repositories | 16:01 |
fungi | for testing official openstack projects | 16:01 |
mordred | cassandra has consitently not been a thing you really want to include in a distro - but I would not call it immature, even though I personally dislike many of their core devs | 16:01 |
clarkb | I see thanks | 16:01 |
mordred | they produce software intended for continuous deployment - and people who use it use it in those contexts - so manufacturing a 3-year stable release is just silly | 16:02 |
openstackgerrit | Tom Fifield proposed a change to openstack-infra/config: Add build job for Japanese Install Guide https://review.openstack.org/67481 | 16:03 |
*** yolanda has quit IRC | 16:03 | |
mordred | in fact - new thing from the CEO of redhat ... https://enterprisersproject.com/article/death-20 | 16:03 |
fungi | mordred: well, i didn't mean immature in a negative connotation. i meant the reasons free software is usually not packaged at all is one of 1. it's so new not enough people have interest in it yet, 2. it's not interesting in general or, 3. there are design issues with the software which make it too hard to package reliably/consistently | 16:03 |
mordred | talks about how even 6-month releases are getting to be too much | 16:03 |
*** nati_ueno has joined #openstack-infra | 16:03 | |
mordred | fungi: indeed. I'm mainly saying that I think that one of the design pieces of 3 might not be something you want to fix in some cases | 16:04 |
anteaya | clarkb: yes, salv-orlando submitted a new change to remove it from the gate, sorry I wasn't clear | 16:04 |
mordred | such as "the delivery model is intended for continual consuption" - which is actually more likely to be able to be dealt with at scale than a periodic release model | 16:04 |
anteaya | sorry salv-orlando already answered you | 16:05 |
clarkb | mordred: we should all start running arch | 16:05 |
fungi | mordred: i agree that some software stays well-tested enough that you can be reasonably assured of its reliability when drinking from the firehose. but there's also enough out there which still isn't that the linux distributions play a useful role in shielding admins who don't want to discover yet another new software bug every morning when they get to work | 16:05 |
*** gokrokve has quit IRC | 16:05 | |
*** marun has quit IRC | 16:05 | |
mordred | fungi: totally | 16:05 |
mordred | I think that the distros can and do play a very useful role | 16:05 |
*** marun has joined #openstack-infra | 16:05 | |
*** gokrokve has joined #openstack-infra | 16:06 | |
*** nati_uen_ has quit IRC | 16:06 | |
clarkb | fungi: event queue isn't falling very quickly. I figure we give it a few more minutes but otherwise I feel like we should take a hatchet to it | 16:06 |
mordred | I'm just saying that a strict adhearance to distro-packaged software may not be necessarily the right choice every time - which is a reversal of my traditionl position | 16:06 |
mordred | fungi: I think that some things have changed in the high-volume/high-scale world and I don't think distro-world has caught up | 16:07 |
*** reed has joined #openstack-infra | 16:08 | |
clarkb | mordred: I agree, but I also think that projects need to provide something. eg a pip installable thing from pypi (we fail at this), because the put up a jar file behind http without a sha1 that logstash does and our tarballs with similar problems aren't very friendly | 16:08 |
fungi | mordred: well, i do agree, particularly since we're part of that ;) | 16:08 |
mordred | clarkb: +100 | 16:09 |
mordred | fungi: hehe | 16:09 |
mordred | I had the idea the other day that someone should upgrade apt-get so that it understood pip and mvn and npm and gem | 16:09 |
clarkb | fungi: I am going to step away for ~3 minutes then I say we go for it | 16:09 |
*** pasquier-s_ has quit IRC | 16:09 | |
mordred | so that you could perhaps do "apt-get install pip:python-novaclient" and it would do the right thing | 16:10 |
fungi | clarkb: sounds good. i need a quick coffee refill anyway | 16:10 |
clarkb | note we should reverify savanna changes first and omit stable/* changes | 16:10 |
fungi | mordred: you mean pip install apt:mysql-client | 16:11 |
*** gokrokve has quit IRC | 16:11 | |
fungi | ;) | 16:11 |
fungi | clarkb: which savanna changes? | 16:11 |
fungi | i probably missed them in scrollbackl | 16:11 |
*** vipul-away is now known as vipul | 16:12 | |
openstackgerrit | Davanum Srinivas (dims) proposed a change to openstack-infra/config: Add jenkins host name to the logstash records https://review.openstack.org/67508 | 16:12 |
*** tangestani has joined #openstack-infra | 16:12 | |
dkranz | fungi: Any chance we can move https://review.openstack.org/#/c/63934/ (restoring fail on log errors) up in the queue? | 16:13 |
dkranz | I really don't want to see this fail because another log error crept in. | 16:13 |
fungi | dkranz: clarkb: let's move 63934 to the top of the gate list before we import it on the replacement zuul | 16:14 |
flashgordon | clarkb: btw only 5 hits on gate for the 404 issue you found | 16:14 |
*** afazekas_ has quit IRC | 16:14 | |
flashgordon | in last 7 days | 16:14 |
*** thuc has joined #openstack-infra | 16:14 | |
*** tangestani has quit IRC | 16:15 | |
fungi | flashgordon: for the console logs which got indexed anyway (scp plugin bug still lurking) | 16:15 |
flashgordon | fungi: ack, thats implied for everything | 16:15 |
flashgordon | 196 hits with check queue | 16:15 |
* fungi nods | 16:15 | |
* flashgordon files a bug | 16:15 | |
dimsum | Added a couple of reviews to grab the jenkins host name for logstash (https://review.openstack.org/#/c/67495/ https://review.openstack.org/#/c/67508/ ) | 16:15 |
* SergeyLukjanov triggered by savanna word used :) | 16:16 | |
fungi | dimsum: yep, saw those just now | 16:16 |
clarkb | fungi: in the gate queue | 16:16 |
clarkb | fungi: one last thing to check before we dive in, we should make sure that the zuul ref replication is disabled on new zuul and new new zuul | 16:17 |
clarkb | pretty sure jeblair dealt with that a wee kago so all should be well | 16:17 |
*** thuc_ has quit IRC | 16:18 | |
fungi | clarkb: right, that was reverted in the zuul source. i'll check the clone on it | 16:18 |
clarkb | fungi: was it reverted in zuul source or just the config? | 16:18 |
*** thuc has quit IRC | 16:19 | |
fungi | oh... hrm | 16:19 |
fungi | right, it was the config | 16:19 |
*** lyle is now known as david-lyle | 16:19 | |
clarkb | I am logged into all 5 jenkins masters and ready to kill jobs | 16:20 |
clarkb | fungi: basically ready when you are | 16:20 |
fungi | i'm looking for the revert | 16:20 |
*** dizquierdo has quit IRC | 16:21 | |
*** anteaya is now known as tired | 16:21 | |
*** tired is now known as very_tired | 16:22 | |
openstackgerrit | Sean Dague proposed a change to openstack-infra/elastic-recheck: add bug metadata to graph list https://review.openstack.org/67510 | 16:22 |
clarkb | fungi: 0c8845494d308e8fedfd6e9890c5ea6cd2f85bdb | 16:22 |
clarkb | in config | 16:22 |
fungi | right, why couldn't i find that in the commit log? | 16:23 |
fungi | trying to do too many things at once | 16:23 |
clarkb | I did git log -p manifests/site.pp because I remembered it getting piped through there | 16:23 |
fungi | i don't see any reference to the git replication urls in zuul.conf on the new server | 16:24 |
fungi | hold on | 16:24 |
fungi | okay, sorry. local distraction | 16:25 |
fungi | so i missed why we need to reverify savanna changes if they're already in the gate | 16:26 |
*** esker has quit IRC | 16:26 | |
clarkb | fungi: isn't that how we restore the gate? | 16:26 |
*** BobBall is now known as BobBallAway | 16:26 | |
fungi | yeah, but don't we want to restore the whole gate, not just teh savanna changes? | 16:26 |
*** vipul is now known as vipul-away | 16:27 | |
fungi | i'm clearly confused on some point | 16:27 |
clarkb | fungi: we do, just pointing out we want to reverify them first | 16:27 |
clarkb | so that their jobs queue up first as they are currently running | 16:27 |
fungi | oh, so they were causing some sort of disruption | 16:27 |
SergeyLukjanov | could I ask why savanna changes are so prio now? :) | 16:27 |
fungi | er, fixing some sort of disruption? | 16:27 |
clarkb | SergeyLukjanov: simply because they managed to run tests for half an hour and we are about to kill them | 16:27 |
clarkb | SergeyLukjanov: fungi: there is nothing special about those changes beyond their current position in the queue | 16:28 |
fungi | ahh, i see, you mean because they're in a different gate queue, so don't want to make them wait on available nodes | 16:28 |
clarkb | exactly | 16:28 |
*** Ajaeger has quit IRC | 16:29 | |
SergeyLukjanov | oh, see it too ;) | 16:29 |
SergeyLukjanov | thanks | 16:29 |
*** gyee_nothere has quit IRC | 16:29 | |
fungi | clarkb: and also prioritize 63934,3 so that we reduce the risk of more errors getting introduced before that merges | 16:29 |
clarkb | yup | 16:30 |
*** Ajaeger has joined #openstack-infra | 16:30 | |
clarkb | I am actually less worried about the stable/* jobs, I can push new patchsets to them in order to make an impression on the change approvers :) | 16:30 |
fungi | getting logged into rackspace and jenkins masters now | 16:30 |
*** gyee has joined #openstack-infra | 16:30 | |
clarkb | fungi: s/rackspace/nodepool/ ? | 16:31 |
*** marun has quit IRC | 16:31 | |
fungi | let's at least leave out 67186,1 and 67187,1 since we know about them and they're already relatively high up in the gate | 16:31 |
clarkb | fungi: k | 16:31 |
fungi | rackspace to make dns changes | 16:31 |
*** mrodden has quit IRC | 16:31 | |
clarkb | oh that | 16:31 |
*** vipul-away is now known as vipul | 16:31 | |
fungi | trying to reduce the zuul outage window as much as possible so we miss fewer patchset and approve events | 16:32 |
clarkb | ++ | 16:32 |
openstackgerrit | Andreas Jaeger proposed a change to openstack-infra/config: Add build job for Japanese Install Guide https://review.openstack.org/67481 | 16:32 |
*** marun has joined #openstack-infra | 16:32 | |
*** krotscheck has joined #openstack-infra | 16:32 | |
clarkb | mordred: any chance you can statusbot us? | 16:33 |
*** MarkAtwood has joined #openstack-infra | 16:34 | |
fungi | the rackspace dns interface needs a filter | 16:34 |
clarkb | oh ya, otherwise it is tons of scrolling | 16:35 |
*** mrodden has joined #openstack-infra | 16:35 | |
fungi | okay, logged into the jenkins webuis, rackspace dashboard at the dns entries, cli on nodepool and both old and new zuul | 16:36 |
*** nati_uen_ has joined #openstack-infra | 16:36 | |
fungi | are you doing the zuul pipeline dump/restore, clarkb? | 16:36 |
clarkb | fungi: I thgouth you were :P I was going t okill jenkins jobs | 16:36 |
fungi | ahh, okay | 16:36 |
fungi | gimme a sec to referesh my memory on how that works | 16:37 |
clarkb | np, I believe the script is in zuuls tools dir | 16:37 |
openstackgerrit | Joe Gordon proposed a change to openstack-infra/config: Record short_build_uuid in logstash/ElasticSearch https://review.openstack.org/67516 | 16:37 |
*** nati_uen_ has quit IRC | 16:37 | |
*** markwash has quit IRC | 16:37 | |
fungi | there was also a ~root/zuul-changes2.py left over from the last round | 16:38 |
clarkb | flashgordon: re ^ I am pretty sure you can match on the short uuid | 16:38 |
*** nati_uen_ has joined #openstack-infra | 16:38 | |
flashgordon | clarkb: sample query? | 16:38 |
clarkb | flashgordon: just search for build_uuid:someshortuuid | 16:38 |
clarkb | notice the lack of quotes | 16:38 |
*** markwash has joined #openstack-infra | 16:39 | |
mfer | fungi is there a place I can "subscribe" to get an update of the openstack in an sdk name? i don't want to bug you but I'm so darn curious. | 16:39 |
clarkb | fungi: oh right, you want that one as it uses the zuul rpc cli | 16:39 |
*** nati_ueno has quit IRC | 16:39 | |
clarkb | fungi: but the old one will work using the reverifies too (if you give reverify a bug) | 16:40 |
*** Ajaeger has quit IRC | 16:40 | |
flashgordon | build_uuid:2123b9a | 16:40 |
flashgordon | vs: build_uuid:2123b9a6a1464d41864e8436d5bf4397 | 16:41 |
flashgordon | short has no hits | 16:41 |
flashgordon | clarkb: ^ | 16:41 |
clarkb | flashgordon: sorry you need build_uuid:2123b9a* | 16:41 |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 16:42 | |
flashgordon | clarkb: sweet! | 16:42 |
flashgordon | thanks | 16:42 |
*** adrian_otto has joined #openstack-infra | 16:43 | |
flashgordon | clarkb: here is another one https://review.openstack.org/#/c/67498/ | 16:43 |
fungi | clarkb: okay, so it's... for pipeline in check gate post ; do python zuul-changes2.py http://zuul.openstack.org $pipeline > $pipeline.sh ; done | 16:43 |
*** markmcclain has quit IRC | 16:44 | |
*** markmcclain has joined #openstack-infra | 16:44 | |
*** gothicmindfood has joined #openstack-infra | 16:44 | |
clarkb | fungi: k | 16:44 |
*** senk has joined #openstack-infra | 16:44 | |
*** AaronGr_Zzz is now known as AaronGr | 16:45 | |
*** adrian_otto has left #openstack-infra | 16:45 | |
fungi | oh, it won't dump post | 16:45 |
fungi | because those aren't changes | 16:45 |
clarkb | oh right, I think we can get away with that here | 16:45 |
fungi | looking through a sample real quick so i can confirm the reordering/filtering we want to do on the gate | 16:46 |
clarkb | flashgordon: that change is technically fine. question about why it is necessary though. A bug fingerprint should indicate a bug regardless of branch, and a false positive due to branch should itself be a bug correct? | 16:47 |
*** davidhadas_ has joined #openstack-infra | 16:47 | |
flashgordon | two fold | 16:47 |
flashgordon | one is its easier when digging through logstash | 16:47 |
*** davidhadas has quit IRC | 16:47 | |
flashgordon | and two, if we *know* a bug is stable only we can prevent false positives | 16:48 |
*** senk has quit IRC | 16:49 | |
clarkb | preventing false positives that way masks other bugs though | 16:49 |
*** markwash has quit IRC | 16:49 | |
*** DennyZhang has joined #openstack-infra | 16:49 | |
*** markmcclain has quit IRC | 16:50 | |
clarkb | fungi: how does the sample work? I suppose you can just ocmment out the lines for changes we want to ignore | 16:50 |
fungi | clarkb: yep, i'm getting the reordering into the final command line too though | 16:50 |
clarkb | oh right for savanna :) | 16:50 |
flashgordon | clarkb: it won't mask bugs it will leave them as unclassified | 16:50 |
clarkb | and dkranz's change | 16:50 |
fungi | and the error filtering fix | 16:50 |
*** senk has joined #openstack-infra | 16:51 | |
clarkb | flashgordon: if you didn't filter on branch it would match all branches | 16:51 |
*** cp16net is now known as goofy-nick-frida | 16:51 | |
flashgordon | either way, having the data makes understanding logstash data easier. | 16:51 |
flashgordon | before writting the fingerprint | 16:51 |
*** goofy-nick-frida is now known as goofy-nic-friday | 16:51 | |
fungi | okay, have it the way i want it. making sure i can copy it quickly now | 16:52 |
clarkb | flashgordon: gotcha | 16:54 |
openstackgerrit | Davanum Srinivas (dims) proposed a change to openstack-infra/config: Add jenkins master name to the logstash records https://review.openstack.org/67508 | 16:54 |
*** NayanaD has joined #openstack-infra | 16:55 | |
*** NayanaD is now known as San_D | 16:55 | |
fungi | all set. so dumping the check/gate pipelines and immediately stopping zuul | 16:56 |
fungi | ready? | 16:56 |
clarkb | I am ready | 16:56 |
*** sgrasley has joined #openstack-infra | 16:56 | |
fungi | done | 16:57 |
fungi | updating dns now | 16:57 |
openstackgerrit | Michael Krotscheck proposed a change to openstack-infra/config: Added artifact upload of storyboard. https://review.openstack.org/67520 | 16:57 |
clarkb | ok killing jenkins jobs now | 16:57 |
*** coolsvap has quit IRC | 16:57 | |
*** tma996 has quit IRC | 16:58 | |
openstackgerrit | Michael Krotscheck proposed a change to openstack-infra/config: Added artifact upload of storyboard. https://review.openstack.org/67520 | 16:58 |
krotscheck | My bad, sorry | 16:58 |
krotscheck | That one's good | 16:58 |
fungi | clarkb: whups. the aaaa record had a one-hour ttl on it | 16:58 |
fungi | i should have double-checked that last night | 16:59 |
clarkb | fungi: that'll teach me | 16:59 |
*** coolsvap has joined #openstack-infra | 16:59 | |
clarkb | :/ | 16:59 |
zaro | morning | 17:00 |
*** mancdaz is now known as mancdaz_away | 17:00 | |
*** vkozhukalov has quit IRC | 17:00 | |
fungi | clarkb: should nodepool get restarted to connect to the new zuul? | 17:01 |
clarkb | fungi: I believe the gear lib should do automatic reconnection | 17:01 |
fungi | and is it safe to start new zuul and reenqueue changes now even though the jenkins masters aren't connected to it yet? | 17:02 |
*** davidhadas_ has quit IRC | 17:02 | |
clarkb | jenkins masters cannot connect to it until it has started, the geard lib is embedded | 17:03 |
zaro | clarkb: you in today? | 17:03 |
clarkb | I think you need to wait for at least one master to advertise its job list before reenqueuing | 17:03 |
clarkb | zaro: after the zuul stuff is done I had planned on trying to make it in | 17:03 |
zaro | clarkb: office i mean | 17:03 |
clarkb | yes | 17:03 |
fungi | clarkb: more to the point, i meant is it okay to reenqueue changes before the jenkins masters are connecting to the new zuul. i assume so | 17:04 |
*** yaguang has quit IRC | 17:04 | |
clarkb | fungi: I don't think so | 17:04 |
*** markwash has joined #openstack-infra | 17:04 | |
clarkb | fungi: zuul may report those jobs as lost since gearman won't know how to run those jobs | 17:04 |
fungi | ahh, right, jobs won't be registered | 17:05 |
clarkb | so I think we start new new zuul, then get at least one master t oconnect to it, then reenqueue | 17:05 |
zaro | clarkb: Azher asked for a meeting to help him get setup with zuul and jjb today. didn't know if you were interested to be on the call. | 17:05 |
zaro | clarkb: meeting will be at 11am pst | 17:06 |
clarkb | zaro: we'll see... | 17:06 |
*** hashar has quit IRC | 17:07 | |
clarkb | fungi: ok, jenkins masters have had their jobs killed | 17:07 |
fungi | jenkins01 seems to have established sockets to the gearman port on new zuul's ipv4 address. that's a good sign | 17:09 |
clarkb | fungi: nodepool is connected to 162.242.150.96:4730 | 17:09 |
clarkb | which I think is new zuul | 17:09 |
fungi | yep, checking the other masters still, but good so far | 17:09 |
*** sarob has joined #openstack-infra | 17:10 | |
fungi | jenkins.o.o has no gearman connections according to netstat | 17:10 |
fungi | the other masters are connected to new zuul though | 17:10 |
clarkb | cool /me looks at jenkins.o.o | 17:10 |
*** hashar has joined #openstack-infra | 17:11 | |
clarkb | fungi: I am going to try disabling then enalbing a job on that host as that kicks the gearman plugin | 17:11 |
fungi | k | 17:12 |
*** fifieldt has quit IRC | 17:12 | |
clarkb | that hasn't appeared to help | 17:13 |
*** sarob_ has joined #openstack-infra | 17:13 | |
clarkb | I lied I think it worked | 17:13 |
fungi | we could just restart jenkins service entirely | 17:13 |
clarkb | oh it is talking to old gearman | 17:13 |
clarkb | yeah lets do that | 17:13 |
fungi | netstat -nt|grep 4730 shows nothing on jenkins.o.o | 17:13 |
*** obondarev_ has joined #openstack-infra | 17:13 | |
fungi | stopping it now | 17:14 |
clarkb | fungi: jenkins log shows it tring to talk to the .88 address | 17:14 |
fungi | starting | 17:14 |
fungi | right, i suspected that was why there were no established sockets | 17:14 |
ttx | Why oh WHY is Gerrit askling me to rebase | 17:14 |
fungi | there it goes | 17:14 |
*** sarob has quit IRC | 17:14 | |
ttx | https://review.openstack.org/#/c/67422/ | 17:15 |
ttx | and I'm rebasing and it doesn't really help | 17:15 |
notmyname | gate status graphs for common gate jobs + several projects http://not.mn/all_gate_status.html | 17:15 |
fungi | i see 8 connections to the right gearman server now | 17:15 |
clarkb | ttx: hold on | 17:15 |
fungi | clarkb: ready for me to reenqueue all the things then? | 17:15 |
clarkb | fungi: I think we should try reenqueing one thing first | 17:15 |
* ttx holds (and drinks more) | 17:15 | |
clarkb | fungi: see ttx's question | 17:15 |
*** thuc has joined #openstack-infra | 17:15 | |
fungi | clarkb: will do | 17:16 |
clarkb | fungi: because something seems off but that may just be that he got zuul when it had no workers | 17:16 |
*** thuc_ has joined #openstack-infra | 17:16 | |
fungi | enqueued 63934,3 into the gate | 17:17 |
*** jooools has quit IRC | 17:17 | |
fungi | clarkb: zuul hasn't cloned any repos in /var/lib/zuul/git yet | 17:18 |
clarkb | fungi: it should do that automagically | 17:18 |
fungi | git clone -v ssh://jenkins@review.openstack.org:29418/openstack/neutron /var/lib/zuul/git/openstack/neutron' returned exit status 128: Host key verification failed. | 17:18 |
clarkb | oh that :) | 17:18 |
fungi | i guess it's not puppeted? | 17:18 |
clarkb | apparently not | 17:18 |
fungi | what file(s) do i need? | 17:18 |
*** rakhmerov has quit IRC | 17:19 | |
fungi | i'll grab them from old zuul | 17:19 |
fungi | ahh, right, i can just accept the host key | 17:19 |
clarkb | fungi: it would be for the zuul users known hosts file | 17:19 |
fungi | added | 17:20 |
fungi | should i restart zuul>? | 17:20 |
clarkb | no, just try reenqueing that one change | 17:20 |
clarkb | zuul will do clones on the fly if necessary | 17:20 |
*** thuc has quit IRC | 17:20 | |
fungi | worked | 17:21 |
*** sarob_ has quit IRC | 17:21 | |
*** beagles_brb is now known as beagles | 17:21 | |
clarkb | but still failed to merge | 17:21 |
*** kiall has quit IRC | 17:21 | |
fungi | though i seem to have the old ipv6 address lodged deep within my browser | 17:21 |
*** sarob has joined #openstack-infra | 17:21 | |
fungi | it did check out the project for that change though | 17:21 |
clarkb | yup | 17:22 |
fungi | UnboundLocalError: local variable 'repo' referenced before assignment | 17:22 |
fungi | zuul bug? | 17:22 |
clarkb | yup must be | 17:23 |
fungi | i assume restarting zuul daemon is the best course of action for now? | 17:23 |
*** kiall has joined #openstack-infra | 17:24 | |
clarkb | ya why don't we do that | 17:24 |
*** jp_at_hp has quit IRC | 17:24 | |
clarkb | fungi: oh wait | 17:24 |
*** senk has quit IRC | 17:25 | |
clarkb | git doesn't know who we are | 17:25 |
*** yolanda has joined #openstack-infra | 17:25 | |
clarkb | that should raelly be puppeted. on old zuul the zuul users gitconfig was set to set the name and email | 17:25 |
clarkb | we should do tat by hand on new new zuul | 17:25 |
fungi | fixing | 17:25 |
*** sarob_ has joined #openstack-infra | 17:25 | |
clarkb | then document it needs puppeting | 17:25 |
*** gokrokve has joined #openstack-infra | 17:26 | |
openstackgerrit | Ruslan Kamaldinov proposed a change to openstack-infra/config: Extracted ci docs jobs to a template https://review.openstack.org/67489 | 17:26 |
clarkb | looks like new zuul has a ~zuul/.gitconfig as well | 17:26 |
fungi | it does now | 17:26 |
clarkb | ok now try reenqueing 63934 | 17:26 |
*** pballand has joined #openstack-infra | 17:27 | |
*** sarob__ has joined #openstack-infra | 17:27 | |
mordred | clarkb: ++ to puppeting | 17:28 |
*** sarob has quit IRC | 17:28 | |
clarkb | fungi: zuul is cloning all the things | 17:28 |
clarkb | which is something to note about using a tmpfs if we don't prepopulate it zuul startup will be a bit slower than before | 17:29 |
*** markwash has quit IRC | 17:29 | |
fungi | yeah, i expected that | 17:30 |
fungi | but that's just on reboot of the server | 17:30 |
clarkb | yup | 17:30 |
* mordred is excited about our new tmpfs overlord | 17:30 | |
*** sarob_ has quit IRC | 17:30 | |
fungi | hmmm, my bsd firewall here is segfaulting my shell | 17:30 |
clarkb | ok stuff is queueing | 17:30 |
openstackgerrit | Joe Gordon proposed a change to openstack-infra/elastic-recheck: Add noirc option to bot https://review.openstack.org/67525 | 17:31 |
fungi | i might not be long for this internet if my dhclient segfaults too | 17:31 |
fungi | ready for me to enqueue everything else then? | 17:31 |
sdague | cool, check queue going to refilll automatically? | 17:31 |
clarkb | fungi: :( where did you stash your preserved queues? | 17:31 |
clarkb | sdague: yup, fungi grabbed check and gate queue state we just need to apply them now | 17:31 |
fungi | my homedir, though the first entry in the gate.sh is redundant now. fixing | 17:31 |
sdague | clarkb: do you have a list of promote bits from markmcclain | 17:31 |
ttx | fungi: any idea why i'm asked to rebase stuff ? | 17:32 |
clarkb | sdague: no I do not | 17:32 |
ttx | I rebased on HEAD and that doesn't work either | 17:32 |
*** nati_ueno has joined #openstack-infra | 17:32 | |
ttx | https://review.openstack.org/#/c/67422/ | 17:32 |
fungi | clarkb: ready for me to requeue all the things? | 17:32 |
clarkb | ttx: yes, we just moved zuul to new host with a tmpfs /var/lib/zuul/git to speed up the zuul git operations. When we did that we discovered that puppet did not configure git for zuul properly | 17:32 |
ttx | hah. | 17:32 |
clarkb | fungi: I think so | 17:32 |
clarkb | ttx: we fixed that by hand and have noted that we need to automate it, you should not be asked to rebase anymore | 17:33 |
ttx | clarkb: any ETA on fix ? Should I stay online for the next 5 min or come back in two hours ? | 17:33 |
clarkb | ttx: we just fixed it | 17:33 |
fungi | clarkb: it's running under a screen session for the root user now | 17:33 |
fungi | in case i disappear | 17:33 |
ttx | clarkb: hmm, but how do I push the change AGAIN | 17:33 |
fungi | ttx: recheck or reverify | 17:33 |
fungi | ttx: or reapprove | 17:34 |
clarkb | ttx: you shouldn't need to, the existing patchset should be fine | 17:34 |
clarkb | fungi: looks like python26 slaves/jobs are having trouble ;( | 17:34 |
*** fbo is now known as fbo_away | 17:34 | |
sdague | https://etherpad.openstack.org/p/montreal-code-sprint - under Parallel | 17:34 |
fungi | clarkb: i still can't see the new status page because of my resolver cache | 17:34 |
clarkb | fungi: I am going to disable then enable jobs on jenkins01 and 02 to rekick gearman | 17:34 |
ttx | clarkb: except Jenkins-2ed it already. I reverified it. We'll see how it goes. Thanks! | 17:34 |
fungi | trying to call rndc flushname was how i discovered my firewall is in trouble | 17:34 |
sdague | but unfortunately markmcclain isn't here at the moment | 17:34 |
fungi | clarkb: okay | 17:35 |
sdague | I guess we'll just wait until he builds a list when he gets back | 17:35 |
fungi | clarkb: do we think gearman plugin didn't reconnect to zuul properly when we restarted the service? | 17:35 |
*** nati_uen_ has quit IRC | 17:35 | |
clarkb | fungi: I think there is a bug in gearman client where it doesn't register all of its jobs | 17:35 |
fungi | ahh | 17:36 |
clarkb | er gearman plugin not client | 17:36 |
*** nati_ueno has quit IRC | 17:36 | |
clarkb | fungi: you should edit your /etc/hosts :P to get zuul status | 17:36 |
fungi | clarkb: i'm going to | 17:36 |
*** nati_ueno has joined #openstack-infra | 17:36 | |
clarkb | fungi: https://jenkins01.openstack.org/job/gate-cinder-python27/5053/console | 17:36 |
clarkb | not sure why that is happening | 17:37 |
fungi | clarkb: oh, wait, i'm not resolving zuul incorrectly. the status page just seems to be broken for some reason | 17:37 |
fungi | oh, or maybe i am | 17:38 |
*** rakhmerov has joined #openstack-infra | 17:38 | |
clarkb | oh I wonder if the test slaves have the ipv6 address cached | 17:38 |
clarkb | I can fetch the ref that the gate-cinder-ypthon27 job failed to fetch | 17:38 |
*** rakhmerov has joined #openstack-infra | 17:38 | |
fungi | there we go. had to clear my browser cache too | 17:39 |
*** yassine has quit IRC | 17:39 | |
*** senk has joined #openstack-infra | 17:39 | |
*** yassine has joined #openstack-infra | 17:40 | |
fungi | clarkb: hmmm, you mean like maybe a local dnscache daemon on the slaves? | 17:40 |
clarkb | ya | 17:40 |
fungi | that might be a centos thing, agreed | 17:40 |
*** DennyZha` has joined #openstack-infra | 17:40 | |
clarkb | that is a python27 job | 17:40 |
*** DennyZhang has quit IRC | 17:41 | |
*** tjones has joined #openstack-infra | 17:41 | |
clarkb | I think we are mostly good now, just need to ride out the hiccups | 17:42 |
*** ruhe is now known as _ruhe | 17:42 | |
*** yassine has quit IRC | 17:42 | |
*** yassine has joined #openstack-infra | 17:43 | |
*** praneshp has joined #openstack-infra | 17:43 | |
*** yassine has quit IRC | 17:43 | |
*** hashar has quit IRC | 17:43 | |
clarkb | though the enqueue seems to not update zuul status? debug.log shows many jobs starting implying the enqueue is working but status doesn't reflect that for me | 17:44 |
fungi | maybe those are still in the event queue? | 17:44 |
clarkb | looks like the Run handler has only woken twice in the last 10 minutes, I think using the rpc to enqueue may do like a gate reset and hold everything up while it does its work | 17:45 |
*** yassine has joined #openstack-infra | 17:45 | |
*** yassine has quit IRC | 17:45 | |
*** sarob has joined #openstack-infra | 17:45 | |
*** DennyZha` has quit IRC | 17:45 | |
*** sarob has quit IRC | 17:45 | |
*** sarob has joined #openstack-infra | 17:45 | |
openstackgerrit | A change was merged to openstack-infra/storyboard-webclient: Added apache license to footer https://review.openstack.org/67347 | 17:46 |
*** mattray has joined #openstack-infra | 17:47 | |
*** yamahata has quit IRC | 17:48 | |
*** sarob__ has quit IRC | 17:48 | |
fungi | worth noting, the server is basically idle cpu-wise | 17:48 |
fungi | so this has to be network-related delays, right? | 17:48 |
clarkb | or the enqueue isn't doing what we expect | 17:49 |
fungi | 2014-01-17 17:49:27,357 INFO zuul.Gerrit: Updating information for 67333,4 | 17:50 |
*** sarob_ has joined #openstack-infra | 17:50 | |
fungi | maybe gerrit's getting firebombed | 17:50 |
*** talluri has joined #openstack-infra | 17:50 | |
mordred | load is fine on gerrit | 17:50 |
fungi | yep | 17:50 |
clarkb | http://paste.openstack.org/show/61460/ | 17:50 |
clarkb | I think gearman function registering is not working so well. I will enable disable on all jenkins masters | 17:51 |
fungi | okay | 17:51 |
*** harlowja_away is now known as harlowja | 17:51 | |
*** mrodden1 has joined #openstack-infra | 17:52 | |
clarkb | have done 1 2 3 and 4 doing jenkins.o.o now | 17:52 |
fungi | clarkb: you also have a thing of some kind in 8 minutes, right? if you need to stop, i can work through the rest of this | 17:52 |
*** mrodden has quit IRC | 17:52 | |
*** sarob has quit IRC | 17:52 | |
clarkb | fungi: well its a meeting thing. I should be able to give you a bit of time | 17:53 |
*** rnirmal has quit IRC | 17:53 | |
fungi | k | 17:53 |
clarkb | all jenkinses should have reregisterd their gearman functions | 17:53 |
fungi | load on gerrit is spiking, so we did something | 17:53 |
clarkb | ok, going to watch tail -f /var/log/zuul/debug.log | grep ERROR | 17:54 |
fungi | same thing i'm doing | 17:54 |
*** sarob_ has quit IRC | 17:55 | |
*** sarob has joined #openstack-infra | 17:55 | |
clarkb | we seem to still be hitting ERROR zuul.Gearman: Exception while checking functions | 17:56 |
clarkb | for that same set_description job | 17:56 |
clarkb | zaro: any idea why that is happening? | 17:56 |
clarkb | Function set_description:jenkins01.openstack.org is not registered | 17:56 |
zaro | clarkb: i think i've got the scp-plugin patch ready. but i have a few meetings now, so will discuss with you after 1pm. | 17:57 |
*** jerryz has joined #openstack-infra | 17:57 | |
clarkb | zaro: sure, can you take a quick look at ^ | 17:57 |
*** jerryz has quit IRC | 17:57 | |
*** jerryz has joined #openstack-infra | 17:57 | |
zaro | clarkb: yeah. let find that in the code during my meeting. | 17:57 |
openstackgerrit | Joe Gordon proposed a change to openstack-infra/elastic-recheck: Add query for bug 1261253 https://review.openstack.org/67539 | 17:58 |
uvirtbot | Launchpad bug 1261253 in tripleo "oslo.messaging 1.2.0a11 is outdated and problematic to install" [High,Triaged] https://launchpad.net/bugs/1261253 | 17:58 |
*** sarob_ has joined #openstack-infra | 17:59 | |
*** yolanda has quit IRC | 17:59 | |
fungi | seen a couple timeout errors since... gate-tempest-dsvm-neutron-large-ops and gate-ceilometer-pep8 | 17:59 |
*** sarob has quit IRC | 18:00 | |
fungi | er, the jobs were probably unrelated | 18:01 |
fungi | Exception while checking functions | 18:01 |
*** sarob has joined #openstack-infra | 18:01 | |
openstackgerrit | Matthew Treinish proposed a change to openstack-infra/elastic-recheck: Add multi-project irc support to the bot https://review.openstack.org/67540 | 18:01 |
clarkb | fungi: ya, those exceptions seem to be timeout errors | 18:01 |
zaro | clarkb: is stop function registered? | 18:02 |
fungi | in connection.sendAdminRequest | 18:02 |
clarkb | zaro: fungi: I am not sure if stop function is regisered but /var/log/zuul/gearman-server.log shows errors around getting its status | 18:03 |
clarkb | zaro: fungi: that looks like a possible geard bug | 18:03 |
*** sarob__ has joined #openstack-infra | 18:04 | |
mordred | clarkb, fungi: I've been floating in and out - please ping me if I can be useful to your brains | 18:04 |
*** odyssey4me has quit IRC | 18:04 | |
clarkb | fungi: zaro: I think zuul slowness may be due to those timeouts, it is waiting and waiting and well waiting | 18:05 |
clarkb | should we possibly try restarting zuul to begin a new geard? | 18:05 |
*** sarob_ has quit IRC | 18:05 | |
openstackgerrit | Matthew Treinish proposed a change to openstack-infra/elastic-recheck: Add multi-project irc support to the bot https://review.openstack.org/67540 | 18:05 |
fungi | clarkb: i can do that and reenqueue it all again | 18:05 |
fungi | clarkb: should we include a brief wait for jenkins masters to reconnect to the gearman service? | 18:06 |
*** sarob has quit IRC | 18:06 | |
clarkb | fungi: yes, I think so | 18:06 |
fungi | okay, killing zuul now | 18:06 |
clarkb | fungi: well a wait before reenqueing | 18:06 |
fungi | yeah, that's what i meant | 18:06 |
clarkb | fungi: the gearman service is a child of the zuul service so you start them both with the zuul init script | 18:07 |
fungi | how long do you think is sane? | 18:07 |
clarkb | half a minute is probably plenty | 18:07 |
fungi | k | 18:07 |
clarkb | fungi: you can telnet localhost 4730 and run send status to the socket | 18:07 |
clarkb | that should return a giant list of everything ever | 18:07 |
fungi | right now it returns nothing | 18:08 |
clarkb | just 'status' returns nothing? | 18:08 |
fungi | oh you said run send status | 18:08 |
clarkb | gah my bad | 18:08 |
fungi | yeah, status returns a ton | 18:09 |
clarkb | the command is just 'status' | 18:09 |
fungi | though it picked up a nova change in the check pipeline already and marked a gate-nova-python26 as lost | 18:09 |
clarkb | fungi: then before reenqueing the world I think we try enqueing one change again. and tail zuul/debug.log and zuul/gearman-server.log | 18:09 |
*** galstrom has joined #openstack-infra | 18:09 | |
clarkb | fungi: :( | 18:09 |
clarkb | fungi: I wonder if that means jenkins* but not jenkins01 and jenkins02 have registered their functions | 18:10 |
clarkb | as only 01 and 02 can run the python26 jobs | 18:10 |
fungi | well, i've reenqueued the devstack-gate change we had at the top before | 18:11 |
fungi | but it has no py26 jobs | 18:11 |
*** sarob__ has quit IRC | 18:11 | |
fungi | ERROR zuul.Gearman: Job <gear.Job 0x7fbe68147690 handle: None name: build:gate-trove-python27 unique: 247c5ef1806f4581ac54f8b7cb31e8b3> is not registered with Gearman | 18:11 |
clarkb | fungi: how does zuul/gearman-server.log look? are there any recent tracebacks for the stop job? | 18:11 |
*** sarob has joined #openstack-infra | 18:11 | |
clarkb | why is gearman so cranky | 18:12 |
*** pballand has quit IRC | 18:12 | |
fungi | 2014-01-17 18:06:23 [...] KeyError: 'stop:jenkins01.openstack.org' | 18:12 |
clarkb | so that is from before the restart correct? | 18:12 |
fungi | checking | 18:13 |
fungi | 18:06 was the start | 18:13 |
*** Ajaeger has joined #openstack-infra | 18:14 | |
fungi | ahh, stopped at 18:06:34 | 18:15 |
zaro | fungi: is that from jenkins gearman plugin? | 18:15 |
Ajaeger | What is a "LOST" failure for a gate? https://review.openstack.org/#/c/67493/ | 18:15 |
fungi | started at 18:06:49 | 18:15 |
fungi | Ajaeger: us | 18:15 |
fungi | so that keyerror was from before i stopped it | 18:15 |
Ajaeger | fungi: ok, I'll let you fix it ;) | 18:15 |
*** herndon has quit IRC | 18:16 | |
*** sarob has quit IRC | 18:16 | |
*** yamahata has joined #openstack-infra | 18:16 | |
clarkb | fungi: in that status listing does jenkins01 or jenkins02 show up at all? | 18:17 |
*** hogepodge has joined #openstack-infra | 18:17 | |
fungi | clarkb: i just tried reenqueuing a savanna change and got this in the log... | 18:18 |
fungi | 2014-01-17 18:16:52,521 WARNING zuul.Scheduler: Build <Build 8363618cf6394cf4bfc5e2596c900e09 of gate-savanna-python26> not found by any queue manager | 18:18 |
clarkb | fungi: ya that is resulting in LOST builds | 18:18 |
fungi | ERROR zuul.DependentPipelineManager: Exception while canceling build <Build 8363618cf6394cf4bfc5e2596c900e09 of gate-savanna-python26> for change <Change 0x7fbe60456410 66554,4> | 18:18 |
clarkb | it couldn't cancel it because there was no job I bet | 18:18 |
fungi | oh, wait, i need the non-cancel errors | 18:19 |
fungi | there | 18:19 |
fungi | 2014-01-17 18:16:52,401 ERROR zuul.Gearman: Job <gear.Job 0x7fbe602e1210 handle: None name: build:gate-savanna-python26 unique: 8363618cf6394cf4bfc5e2596c900e09> is not registered with Gearman | 18:19 |
clarkb | ya, that means the jenkins masters never registered that function with the geard daemon | 18:19 |
clarkb | fungi: perhaps look at jenkins logs on 01 and 02 to see if the gearman plugin is puking? | 18:20 |
salv-orlando | fungi, clarkb: sorry for the interruption - I assume it's not yet ok to start approving again patches? | 18:20 |
clarkb | salv-orlando: ya not quite yet, we have run into unexpected trouble with gearman | 18:20 |
fungi | and now status on port 4730 returns nothing | 18:20 |
salv-orlando | clarkb: will keep lurking waiting for a go-ahead | 18:20 |
clarkb | fungi: o_O how does the gearman-server.log look? | 18:21 |
fungi | 2014-01-17 18:20:54,214 ERROR gear.BaseClientServer: Exception in poll loop | 18:21 |
fungi | KeyError: 'stop:jenkins03.openstack.org' | 18:21 |
*** salv-orlando has quit IRC | 18:22 | |
*** marun has quit IRC | 18:22 | |
*** marun has joined #openstack-infra | 18:22 | |
fungi | quite a few, but all for jenkins02 | 18:22 |
fungi | er, for jenkins03 | 18:22 |
clarkb | fungi: out of curiousity how does the version of gear compare on new zuul and new new zuul | 18:23 |
fungi | oh crap, this is what you ran into last time | 18:23 |
clarkb | ya | 18:23 |
clarkb | so restarting it didn't help | 18:23 |
fungi | pip freeze says gear==0.5.0 | 18:23 |
*** herndon has joined #openstack-infra | 18:24 | |
fungi | same as on old zuul | 18:24 |
fungi | also, we have newer statsd on new zuul | 18:25 |
fungi | separate problem | 18:25 |
fungi | i've downgraded statsd while i'm thinking about it | 18:26 |
clarkb | next crazy idea, stop jenkinses, bring up one at a time in a relatively slow manner allowing each to register with gearman without threash | 18:26 |
fungi | okay, doing | 18:26 |
fungi | sounds sane enough to me | 18:26 |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 18:27 | |
*** hogepodge has quit IRC | 18:28 | |
*** aude has joined #openstack-infra | 18:30 | |
clarkb | fungi: and check the gearman plugin versions are consistent across jenkinses, pretty sure jeblair ran into that though and made them consistent | 18:30 |
*** max_lobur is now known as max_lobur_afk | 18:30 | |
*** hogepodge has joined #openstack-infra | 18:30 | |
*** nati_uen_ has joined #openstack-infra | 18:31 | |
fungi | will do. also deleting offline slaves, including long-running ones, so they don't get brought back online when jenkins restarts. i'll note them here | 18:31 |
*** CaptTofu has quit IRC | 18:31 | |
*** smurugesan has joined #openstack-infra | 18:31 | |
*** kgriffs has joined #openstack-infra | 18:33 | |
*** nati_ueno has quit IRC | 18:33 | |
*** luqas has quit IRC | 18:35 | |
*** marun has quit IRC | 18:35 | |
*** marun has joined #openstack-infra | 18:35 | |
*** jaypipes has quit IRC | 18:36 | |
fungi | centos6-1, precise{1,11,13,17,19,21,27,29,3,37,39,7,14,16,34,38,4,40,8} | 18:37 |
clarkb | wow that is a lot of slaves | 18:37 |
*** hogepodge has quit IRC | 18:37 | |
clarkb | I spot checked gearman plugin versions and they all look consistent and are 0.0.4.2.ad75b7e | 18:37 |
fungi | yeah, jenkins masters have been so loaded they're failing out slaves right and left | 18:37 |
fungi | i'm still deleting offline nodepool nodes on 03 and 04, but i'll begin restarting jenkins services one at a time on the other masters | 18:39 |
clarkb | k | 18:39 |
clarkb | fungi: if you tail the jenkins.log for the masters as they come up you should see it registering gearman functions. you can use that to get a sense for what is being registered and how long it takes | 18:39 |
fungi | INFO: ---- Worker pypi.slave.openstack.org_exec-0 registering 184 functions | 18:43 |
fungi | clarkb: ^ that? | 18:43 |
clarkb | yeah | 18:43 |
clarkb | it should happen for all the workers and go on and on. the list are failry large which is why I wonder if geard may not keep up or gearman plugin not keep up | 18:43 |
fungi | so, status is still returning absolutely nothing from the gear socket on new zuul, fwiw | 18:43 |
clarkb | really | 18:43 |
fungi | a few minutes after starting jenkins on jenkins.o.o | 18:44 |
fungi | making me wonder if the geard is kaput | 18:44 |
clarkb | ya | 18:44 |
clarkb | oh I bet status fails due to that keyerror | 18:44 |
clarkb | and once that happens geard is kaput | 18:44 |
fungi | so stop jenkins.o.o again, restart zuul, then start jenkins again? | 18:45 |
clarkb | sure? | 18:45 |
*** lucasagomes has joined #openstack-infra | 18:46 | |
*** lucasagomes has left #openstack-infra | 18:46 | |
fungi | status is working now | 18:47 |
*** _ruhe is now known as ruhe | 18:48 | |
*** herndon has quit IRC | 18:50 | |
clarkb | ermagerd 67025 is running python26 job | 18:50 |
*** smarcet has joined #openstack-infra | 18:50 | |
clarkb | fungi: I wonder, could the reenqueue thing that speaks rpc be breaking zuul/geard because of some bug? | 18:51 |
fungi | clarkb: maybe. though i gathered that's how stuff was reenqueued on the last zuul too | 18:52 |
clarkb | fungi: k, probably worth retrying with the reenqueue rpc and if it fails *AGAIN* then fall back on reverify/recheck | 18:52 |
*** salv-orlando has joined #openstack-infra | 18:53 | |
fungi | yep, confirmed that all the jenkins masters are restarted and gear status is still responding | 18:53 |
*** vkozhukalov has joined #openstack-infra | 18:53 | |
*** markwash has joined #openstack-infra | 18:53 | |
clarkb | yay! | 18:54 |
fungi | reenqueued the savana change which was bailing on us before | 18:54 |
clarkb | I think that is a real bug in geard, when the dust settles we should grab relevant logs, and submit a bug | 18:54 |
SergeyLukjanov | fungi, sorry for our naughty jobs :) | 18:55 |
fungi | reenqueued the devstack-gate change | 18:55 |
fungi | geard status is still fine | 18:55 |
*** markmcclain has joined #openstack-infra | 18:55 | |
fungi | SergeyLukjanov: your jobs were fine. our servers were not | 18:55 |
*** markmcclain has quit IRC | 18:55 | |
*** markmcclain has joined #openstack-infra | 18:56 | |
fungi | someone snuck a neutron change in, but it's looking fine too | 18:56 |
fungi | so far everything has workers and no "lost" results | 18:57 |
*** thuc_ has quit IRC | 18:57 | |
clarkb | fungi: yup looks good from my end too | 18:57 |
fungi | trying the mass reenqueue again now | 18:57 |
*** thuc has joined #openstack-infra | 18:57 | |
fungi | event queue is spiking, of course | 18:57 |
clarkb | fungi: pretty sure the registration and starting of jobs is racy in the zuul, geard, gearman-plugin stack and if you catch it just right it causes geard to crash | 18:57 |
*** markwash_ has joined #openstack-infra | 18:58 | |
fungi | reenqueue scripts have returned | 18:59 |
clarkb | nice | 18:59 |
fungi | zuul seems to be tearing through the event queue now | 18:59 |
clarkb | fungi: now, one last quick sanity check. if you grep for 'zuul.Repo' in the debug.log you will get timestamps for all of the git operations | 18:59 |
clarkb | it used to take 9-15 seconds per change, but tmpfs should make that faster | 19:00 |
*** markwash has quit IRC | 19:00 | |
*** markwash_ is now known as markwash | 19:00 | |
fungi | load on zuul is not at all heavy thus far | 19:01 |
clarkb | just looking at the status we are up to 80 something changes in the gate pipeline and it only took a few minutes, much better than the 15-20 it took before | 19:02 |
fungi | gerrit's really not breaking a sweat either | 19:02 |
*** thuc has quit IRC | 19:02 | |
SergeyLukjanov | clarkb, have you already proved that problem is in IO? | 19:02 |
*** rfolco has quit IRC | 19:02 | |
*** azherkhna has joined #openstack-infra | 19:02 | |
clarkb | 'checking out master' is now a subsecond operation | 19:03 |
clarkb | SergeyLukjanov: 'proved'. preliminary results look very very good | 19:03 |
fungi | SergeyLukjanov: we suspect there was a lot of write delay/contention based on the system profiling stats, but i think we need to watch this go for a while under constant load to be certain it's improved significantly (we'll get that oppotrunity) | 19:03 |
*** galstrom is now known as galstrom_zzz | 19:04 | |
SergeyLukjanov | k, see it | 19:04 |
*** marun has quit IRC | 19:04 | |
*** marun has joined #openstack-infra | 19:05 | |
*** jaypipes has joined #openstack-infra | 19:05 | |
*** julim has quit IRC | 19:05 | |
*** julim has joined #openstack-infra | 19:06 | |
*** vipul is now known as vipul-away | 19:09 | |
fungi | #status notice zuul.openstack.org underwent maintenance today from 16:50 to 19:00 UTC, so any changes approved during that timeframe should be reapproved so as to be added to the gate. new patchsets uploaded for those two hours should be rechecked (no bug) if test results are desired | 19:09 |
*** vipul-away is now known as vipul | 19:09 | |
fungi | did we lose statusbot? | 19:09 |
clarkb | apparently | 19:09 |
fungi | yup. fixing | 19:09 |
*** openstackstatus has joined #openstack-infra | 19:11 | |
mordred | clarkb: nice! | 19:12 |
fungi | #status notice zuul.openstack.org underwent maintenance today from 16:50 to 19:00 UTC, so any changes approved during that timeframe should be reapproved so as to be added to the gate. new patchsets uploaded for those two hours should be rechecked (no bug) if test results are desired | 19:13 |
openstackstatus | NOTICE: zuul.openstack.org underwent maintenance today from 16:50 to 19:00 UTC, so any changes approved during that timeframe should be reapproved so as to be added to the gate. new patchsets uploaded for those two hours should be rechecked (no bug) if test results are desired | 19:13 |
fungi | the event/result queues are back to trivial levels already, and enormous pipeline lengths are active | 19:14 |
fungi | statsd is still broken though, even though i downgraded the new zuul's statsd package to be the same as the old one's | 19:14 |
clarkb | fungi: is statsd erroring? | 19:14 |
fungi | good question | 19:15 |
clarkb | oh we just got our first gate reset | 19:15 |
clarkb | lets see how long it takes to clear | 19:15 |
fungi | and then snipe it out, because outdated sample config | 19:16 |
fungi | no, wait, i misread the log. wrong job entirely for that anyway | 19:16 |
clarkb | reset processed | 19:16 |
clarkb | in ~1.75 minutes? not bad :) | 19:17 |
*** pballand has joined #openstack-infra | 19:17 | |
zaro | clarkb: do you want to review scp-plugin on github? | 19:17 |
clarkb | zaro: I don't see a new pull request | 19:18 |
*** nati_ueno has joined #openstack-infra | 19:18 | |
*** nati_ueno has quit IRC | 19:18 | |
*** nati_ueno has joined #openstack-infra | 19:19 | |
*** CaptTofu has joined #openstack-infra | 19:19 | |
clarkb | zaro: I am going to head into the office around lunch, if you are in today we can go over it there | 19:20 |
*** yolanda has joined #openstack-infra | 19:20 | |
zaro | ok. i'll just wait for you. see you later. | 19:21 |
*** sarob has joined #openstack-infra | 19:22 | |
clarkb | fungi: I think I know the statsd problem | 19:22 |
clarkb | fungi: that is one place where the firewall rules on the remote end may need updating | 19:22 |
clarkb | fungi: if you start the iptables persistent service it should redig DNS records and update the ruleset | 19:22 |
*** nati_uen_ has quit IRC | 19:23 | |
fungi | right, it's updated by dns name! | 19:23 |
fungi | fixing | 19:23 |
*** tjones has quit IRC | 19:23 | |
fungi | wow the graphite server is running at a crawl too | 19:24 |
*** sarob has quit IRC | 19:27 | |
*** thuc has joined #openstack-infra | 19:27 | |
fungi | clarkb: good call. stats are updating again | 19:28 |
mordred | yay stats | 19:28 |
*** marun has quit IRC | 19:28 | |
*** marun has joined #openstack-infra | 19:29 | |
fungi | there's another gate reset | 19:29 |
fungi | BadRequest: Multiple possible networks found, use a Network ID to be more specific. (HTTP 400) | 19:30 |
*** tjones has joined #openstack-infra | 19:30 | |
fungi | oh, that's the one which snuck into the gate behind my one test reenqueue | 19:31 |
clarkb | hahahahaha | 19:31 |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: add bug metadata to graph list https://review.openstack.org/67510 | 19:31 |
*** denis_makogon has joined #openstack-infra | 19:31 | |
*** tjones has quit IRC | 19:31 | |
fungi | looks like the last two patchsets were uploaded while zuul was offline, and then it was approved with no check results | 19:31 |
clarkb | well it has been taken care of now :) | 19:31 |
fungi | indeed | 19:31 |
clarkb | fungi: was statsd the last remaining major issue? | 19:32 |
*** tjones has joined #openstack-infra | 19:32 | |
fungi | clarkb: my home firewall is my next major remaining issue | 19:32 |
fungi | i worry when a 15-year-old sparc64 server starts randomly segfaulting running processes | 19:33 |
clarkb | notes from switchover, should puppet known_hosts file for zuul ssh, should puppet zuul .gitconfig, gearman-plugin + geard + zuul is not happy with registering our jobs and needs handholding currently (believe this is a bug in geard) | 19:33 |
*** azherkhna has quit IRC | 19:33 | |
clarkb | fungi: you know you can buy dirt cheap power sipping boxes that work as great routers right? | 19:33 |
fungi | server comes up with too-new statsd, need to reload firewall rules on graphite server | 19:33 |
fungi | clarkb: yes, i know this. i even have the hardware spec'd out and everything but... so little available free time lately | 19:34 |
clarkb | I am going to afk now and catch up on my morning. If no one beats me to it I will write bugs up for what we learned today | 19:34 |
fungi | clarkb: sounds good | 19:34 |
clarkb | also scp plugin, and lca expense reports | 19:34 |
*** vipul is now known as vipul-away | 19:35 | |
*** vipul-away is now known as vipul | 19:35 | |
*** vipul is now known as vipul-away | 19:35 | |
clarkb | fungi: when I get back you should just stop working for the rest of the afternoon | 19:36 |
clarkb | because EWHENDOYOUSLEEP? | 19:36 |
fungi | clarkb: that would be appreciated. i have the gf's folks in town visiting one more night and should at least pretend i enjoy their company | 19:36 |
*** mgagne has quit IRC | 19:37 | |
fungi | so will probably be disappearing for dinner again maybe 2300utc-ish | 19:37 |
clarkb | fungi: yup no worries. ok really afking now so that I am able to cover the afternoon | 19:37 |
sdague | fungi: puppet question ... | 19:37 |
fungi | sdague: sure | 19:37 |
sdague | so we're going to add another elastic recheck program that runs on cron | 19:38 |
* fungi nods | 19:38 | |
sdague | and what I'd also like to do is trigger these jobs after CD | 19:38 |
*** hogepodge has joined #openstack-infra | 19:39 | |
*** mattray has left #openstack-infra | 19:39 | |
sdague | because we might be landing a change, and we'd like to trigger that output | 19:39 |
sdague | but right now the cron jobs are defined on the status site | 19:40 |
*** sarob has joined #openstack-infra | 19:40 | |
sdague | which is done because the state dir is set there | 19:41 |
fungi | okay, so you want a script which is called from a cron entry and from an exec, and wrap them both in lockfile (or implement a locking mechanism within the script) presumably, then subscribe the exec to the vcsrepo object | 19:41 |
*** marun has quit IRC | 19:41 | |
*** oubiwann_ has quit IRC | 19:41 | |
fungi | am i answering the right question? | 19:41 |
*** marun has joined #openstack-infra | 19:41 | |
sdague | I think so | 19:41 |
sdague | I am wondering if we could define the command as a var in the elastic_recheck/init.pp | 19:42 |
*** oubiwann_ has joined #openstack-infra | 19:42 | |
fungi | almost certainly | 19:42 |
sdague | can we get vars from one pp to another easily? | 19:42 |
sdague | you have a call example for something like that? | 19:42 |
*** sarob has quit IRC | 19:43 | |
fungi | oh, hrm... class scope lookup | 19:43 |
sdague | yeh | 19:43 |
*** sarob has joined #openstack-infra | 19:43 | |
fungi | i know how to do it in an erb template... | 19:43 |
dkranz | fungi: Grr. So the error log gate run is being bitten by https://bugs.launchpad.net/tempest/+bug/1260537 | 19:43 |
uvirtbot | Launchpad bug 1260537 in tempest "Generic catchall bug for non triaged bugs where a server doesn't reach it's required state" [High,Confirmed] | 19:43 |
fungi | trying to remember if i've seen it in a puppet manifest | 19:43 |
dkranz | fungi: Do I just do a reverify now or is some other action appropriate? | 19:43 |
fungi | dkranz: reverify once it dies (i can abort the remaining running jobs) and then when it gets into the queue i'll promote it | 19:44 |
dkranz | fungi: Will reverify kill the current faliing build? | 19:44 |
dkranz | fungi: ok | 19:44 |
fungi | dkranz: nope, that's why i need to abort the jobs | 19:44 |
fungi | okay, it's out of the gate now. should be safe to reverify | 19:45 |
fungi | dkranz: ^ | 19:46 |
*** mrmartin has joined #openstack-infra | 19:46 | |
*** yassine has joined #openstack-infra | 19:46 | |
*** vipul-away is now known as vipul | 19:47 | |
dkranz | fungi: Thanks, I did the reverify. | 19:47 |
fungi | i see it | 19:47 |
fungi | promoting now | 19:47 |
*** reed has quit IRC | 19:47 | |
fungi | bam. there it is | 19:47 |
fungi | snappy, snappy new zuulie | 19:47 |
*** sarob has quit IRC | 19:48 | |
fungi | oh zuulie you nut | 19:48 |
openstackgerrit | Matthew Treinish proposed a change to openstack-infra/elastic-recheck: Add multi-project irc support to the bot https://review.openstack.org/67540 | 19:49 |
*** AaronGr is now known as aarongr_afk | 19:49 | |
*** vipul is now known as vipul-away | 19:50 | |
mrmartin | re | 19:50 |
*** denis_makogon_ has joined #openstack-infra | 19:52 | |
mrmartin | fungi: if you have 5 minutes during this day, please comment this review request: https://review.openstack.org/#/c/67443/ This contains the gating / distro tarball task required for community portal. | 19:52 |
salv-orlando | I might be stating the obvious but since I see still a consistent number of failures in unit test jobs, perhaps there is a case for bumping up patches for bug 1270212 | 19:52 |
uvirtbot | Launchpad bug 1270212 in oslo "regression: multiple calls to Message.__mod__ trigger exceptions" [Critical,In progress] https://launchpad.net/bugs/1270212 | 19:52 |
*** pballand has quit IRC | 19:52 | |
openstackgerrit | Matthew Treinish proposed a change to openstack-infra/elastic-recheck: Add multi-project irc support to the bot https://review.openstack.org/67540 | 19:52 |
openstackgerrit | Sean Dague proposed a change to openstack-infra/elastic-recheck: fix css style to make page more readable https://review.openstack.org/67560 | 19:54 |
*** Ajaeger has quit IRC | 19:55 | |
*** yassine has quit IRC | 19:55 | |
*** smarcet has quit IRC | 19:55 | |
*** denis_makogon has quit IRC | 19:55 | |
*** kgriffs has left #openstack-infra | 19:56 | |
clarkb | salv-orlando: are there fixes for that change yet? | 19:57 |
clarkb | er for that bug | 19:57 |
fungi | sdague: i think you want http://docs.puppetlabs.com/puppet/2.7/reference/lang_scope.html#accessing-out-of-scope-variables | 19:57 |
clarkb | fungi: sdague: you can reference variables in manifests like $::somescope::innerscope::variablename | 19:57 |
sdague | ok | 19:58 |
clarkb | you do need to make sure you have previously included that class that defines the variable | 19:58 |
*** denis_makogon_ is now known as denis_makogon | 19:58 | |
sdague | cool | 19:58 |
fungi | mrmartin: i don't see a change 67443 at all. did you maybe experiment with gerrit's drafts option, or is that a typo? | 19:59 |
clarkb | zaro: I am on my way in now | 19:59 |
mrmartin | it was a draft :D | 19:59 |
mrmartin | how can I share this draft review with you? | 20:00 |
sdague | cool, i'll see if I can figure it out | 20:00 |
fungi | mrmartin: just set them work-in-progress in the future. drafts are implemented in gerrit in a fairly broken fashion | 20:00 |
*** markmc has quit IRC | 20:00 | |
mrmartin | good to know that. | 20:00 |
*** herndon has joined #openstack-infra | 20:00 | |
*** marun has quit IRC | 20:00 | |
fungi | mrmartin: in the interim, you can add me as a reviewer (just add "fungi" in the requested reviewers line) | 20:00 |
sdague | also - http://status.openstack.org/elastic-recheck/ - shift reload, and we have descriptions on bugs now | 20:00 |
*** marun has joined #openstack-infra | 20:00 | |
*** ruhe is now known as _ruhe | 20:01 | |
fungi | mrmartin: it will resolve it to my name and e-mail address when you do that | 20:01 |
mrmartin | fungi: I did it | 20:01 |
fungi | sdague: great! | 20:01 |
fungi | mrmartin: i can see it now | 20:01 |
clarkb | sdague: flashgordon: fwiw I think some of the jenkins errors will be false positives. When zuul aborts a job occasionally that menifests as an uncaught exception (I forget which) and the job fails | 20:03 |
mrmartin | fungi: ok add as many comments as you can, so if anything missing, I can correct the patch. thnx! | 20:03 |
clarkb | but zuul aborting jobs is perfectly normal | 20:03 |
clarkb | that said the vast majority are likely slaves falling over and running tests to failure as quickly as they can | 20:03 |
fungi | mrmartin: will do | 20:03 |
*** mrodden1 is now known as mrodden | 20:04 | |
notmyname | ...and I thought a 100+ jobs inthe check queue yesterday were a lot | 20:05 |
*** galstrom_zzz is now known as galstrom | 20:05 | |
fungi | notmyname: yeah, i'm hoping they go far faster now that zuul is on an even bigger server and is doing all its git scratch work on tmpfs | 20:06 |
fungi | as of an hour ago | 20:07 |
*** nati_uen_ has joined #openstack-infra | 20:07 | |
clarkb | it definitely seems to have made the gate reset cost much lower | 20:07 |
clarkb | which was putting the brakes on everything | 20:07 |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 20:08 | |
fungi | the event/result queue pileup is completely resolved | 20:08 |
clarkb | now we suffer from having about 1/3 to 1/4 of the test infra needed to run all of the tests | 20:08 |
*** nati_uen_ has quit IRC | 20:08 | |
notmyname | is that a matter of getting more workers in the nodepool? | 20:09 |
*** nati_uen_ has joined #openstack-infra | 20:09 | |
fungi | well, the nodepool capacity is driven somewhat by gate resets still, since a gate reset near the front of the gate will decimate the entire quota and need them all rebuilt | 20:10 |
clarkb | notmyname: sort of. we need more cloud quota to do that and we have to be careful that adding more nodes doesn't make jenkins flakier | 20:10 |
clarkb | and we just saw geard get cranky... | 20:10 |
*** mfink has quit IRC | 20:10 | |
clarkb | for now I think we are better off working to make jenkins and geard happier then ramp up nodepool | 20:10 |
fungi | at our current aggregate quota i saw things moving fairly smoothly even though a modest reset rate when the gate was around 25-30 changes deep | 20:10 |
*** nati_ueno has quit IRC | 20:10 | |
fungi | once it got bigger than that, it got into a decimate/rebuild pendulum swing | 20:11 |
*** smarcet has joined #openstack-infra | 20:11 | |
fungi | which makes me think that if we do decide to arbitrarily limit the number of testable changes at the front of an integrated queue, the sweet spot is currently somewhere around there | 20:12 |
clarkb | fungi: I don't think we arbitrarily limit the number of testable changse, I think we let it scale a window based on performance | 20:13 |
fungi | clarkb: i agree that makes more sense | 20:13 |
*** hashar has joined #openstack-infra | 20:13 | |
fungi | i liked the slow-start/backoff idea, as much as i can like any pessimistic model for this | 20:13 |
clarkb | I don't think it will be too hard to implement either as zuul basically takes a list and iterates over it until done. we can slice that list first | 20:14 |
openstackgerrit | Davanum Srinivas (dims) proposed a change to openstack-infra/devstack-gate: Temporary HACK : Enable UCA https://review.openstack.org/67564 | 20:14 |
clarkb | the trickier bits will be in presenting it to users so that folks know they are in the queue but not being tested | 20:14 |
clarkb | dimsum: re ^ do we expect libvirt to work now? | 20:15 |
*** marun has quit IRC | 20:15 | |
*** marun has joined #openstack-infra | 20:15 | |
sdague | clarkb: so if that's the case, realize that it's being reported as a FAILURE to ES and graphite | 20:15 |
sdague | which means it will make the gate look worse than it is | 20:16 |
sdague | when you run stats on it. So it would be good if those could be classified as a different status | 20:16 |
dimsum | clarkb, i have a vm with UCA and don't see the problems reported hence trying to run it in d-g | 20:16 |
clarkb | sdague agree but it is a jenkins limitation | 20:17 |
clarkb | sdague the way they implement job aborts is by raising an exception. if not caught cleanly you lose | 20:17 |
*** galstrom has left #openstack-infra | 20:17 | |
clarkb | dimsum: did you run nova unittests too? | 20:17 |
salv-orlando | clarkb: neutron fix is up for review. I can prepare patches for other projects if you're ok to bump them ahead of the queue | 20:18 |
salv-orlando | clarkb: neutron patch --> https://review.openstack.org/#/c/67537/ | 20:18 |
sdague | clarkb: so the abort job exception is a different exception | 20:18 |
*** tjones has quit IRC | 20:18 | |
fungi | clarkb: does jenkins report it as "FAILURE" state though in that case rather than "ABORT"? | 20:18 |
sdague | from what I can tell | 20:18 |
clarkb | fungi in some corner cases yes | 20:18 |
sdague | I've definitely seen ABORT | 20:18 |
clarkb | ya abort is the 99% case | 20:19 |
sdague | clarkb: right, that's one of the reasons I wanted to raise the question | 20:19 |
dimsum | clarkb, yep | 20:19 |
clarkb | but when jenkins doesnt cleanly catch the abort exception it looks like failure | 20:19 |
notmyname | I'm not sure who to direct this at, so I'm throwing it in here: | 20:19 |
fungi | dimsum: interesting idea. i was trying to test it myself using d-g on a vm, but our recent refactor moved some repos around from where my script/instructions expect them | 20:19 |
notmyname | I'm currently working on the Swift 1.12.0 release. I consider this somewhat of a test run for the gates for next week's i2 stuff. | 20:20 |
*** goofy-nic-friday is now known as cp16net | 20:20 | |
notmyname | my plan is to get the last patches through the gate for an RC (today or when stuff lands, whichever is last) | 20:20 |
notmyname | I'm currently looking at these patches: https://review.openstack.org/#/q/branch:master+AND+Approved%253D1+AND+status:open+AND+project:openstack/swift,n,z | 20:20 |
notmyname | other patches would be whatever else is approved today, including one for the release notes update | 20:21 |
notmyname | I don't think I need anything specific from -infra (beyond the hard work you're already doing). I wanted to give you a status update, especially because of the milestone next week (this is sort of a trial run, I'd think) | 20:22 |
fungi | notmyname: makes sense. as far as i know we're done with emergency disruptions. we spent this morning doing what we can to try to beef up gating performance/throughput in preparation for the bigger rush next week | 20:22 |
dimsum | fungi, don | 20:22 |
dimsum | fungi, don't know if this will work - https://review.openstack.org/#/c/67564/ - taking a shot | 20:23 |
fungi | dimsum: it looks like i would expect it to, but set that to wip because we won't actually put that change as it stands into production. we'd want to do that in nodepool prep scripts instead and/or in puppet configuration (but it may make for a worthwhile proof-of-concept) | 20:24 |
fungi | dimsum: the other place you could try testing it would be with a change to devstack (before it starts installing packages) | 20:26 |
dimsum | ah. right | 20:26 |
dimsum | will do | 20:26 |
fungi | but either way will probably work | 20:26 |
*** prad_ has joined #openstack-infra | 20:26 | |
*** salv-orlando has quit IRC | 20:28 | |
*** herndon has quit IRC | 20:28 | |
*** prad has quit IRC | 20:28 | |
*** prad_ is now known as prad | 20:28 | |
openstackgerrit | Evgeny Fadeev proposed a change to openstack-infra/askbot-theme: made launchpad importer read and write data separately https://review.openstack.org/67567 | 20:30 |
sdague | clarkb, fungi: easy change - gate status to dedicated page, so we can pull it off er - https://review.openstack.org/#/c/65700/ | 20:30 |
sdague | if anyone's up for walking away from fire :) | 20:30 |
*** DinaBelova is now known as DinaBelova_ | 20:36 | |
*** Ryan_Lane has quit IRC | 20:36 | |
*** Ryan_Lane has joined #openstack-infra | 20:36 | |
*** mrmartin has quit IRC | 20:36 | |
*** salv-orlando has joined #openstack-infra | 20:38 | |
*** herndon has joined #openstack-infra | 20:38 | |
*** yolanda has quit IRC | 20:38 | |
*** markwash has quit IRC | 20:39 | |
*** markwash has joined #openstack-infra | 20:41 | |
*** marun has quit IRC | 20:41 | |
*** marun has joined #openstack-infra | 20:41 | |
notmyname | wow. I am noticing that zuul is picking up approved changes _much_ more quickly now | 20:44 |
*** carl_baldwin has quit IRC | 20:46 | |
*** senk has quit IRC | 20:47 | |
*** carl_baldwin has joined #openstack-infra | 20:47 | |
*** markmcclain has quit IRC | 20:47 | |
*** vipul-away is now known as vipul | 20:47 | |
*** jaypipes has quit IRC | 20:48 | |
*** jaypipes_ has joined #openstack-infra | 20:48 | |
*** jaypipes_ has quit IRC | 20:48 | |
*** dprince has quit IRC | 20:49 | |
*** pballand has joined #openstack-infra | 20:49 | |
fungi | notmyname: that's thanks to the event queue no longer being backlogged | 20:49 |
rustlebee | queues are huge :) | 20:50 |
fungi | rustlebee: yep, i expect them to start dropping once the check pipeline catches up on worker assignments now | 20:51 |
rustlebee | cool | 20:52 |
fungi | rustlebee: without your awesome collapseypatch, my browser would have choked on the current status page i think | 20:52 |
rustlebee | heh | 20:52 |
openstackgerrit | Emilien Macchi proposed a change to openstack-infra/config: gerritbot: Add API doc git notifications on #openstack-doc https://review.openstack.org/67573 | 20:53 |
dimsum | rustlebee, ya, very handy! | 20:55 |
* rustlebee clicks expand all ... poor chrome | 20:56 | |
fungi | *boom* | 20:56 |
*** tjones has joined #openstack-infra | 20:58 | |
*** hashar has quit IRC | 20:59 | |
*** herndon has quit IRC | 20:59 | |
*** thomasem has quit IRC | 21:01 | |
*** marun has quit IRC | 21:01 | |
*** marun has joined #openstack-infra | 21:01 | |
*** nati_ueno has joined #openstack-infra | 21:03 | |
*** smarcet has left #openstack-infra | 21:05 | |
very_tired | rustlebee: yes thanks for the collapsy patc | 21:05 |
very_tired | h | 21:05 |
*** herndon has joined #openstack-infra | 21:06 | |
rustlebee | you're welcome :) | 21:06 |
rustlebee | it was fun. | 21:06 |
*** herndon has quit IRC | 21:06 | |
rustlebee | anything web related is out of my normal comfort zone | 21:06 |
very_tired | fungi: email alert, I just sent this: http://lists.openstack.org/pipermail/openstack-infra/2014-January/000661.html | 21:06 |
*** herndon has joined #openstack-infra | 21:07 | |
*** nati_uen_ has quit IRC | 21:07 | |
very_tired | will ping at 8pm and if they haven't responded, no voting for them | 21:07 |
*** herndon has quit IRC | 21:07 | |
very_tired | rustlebee: you did a nice job of it | 21:07 |
fungi | very_tired: sounds good. in the meantime, get some rest | 21:07 |
very_tired | fungi: :D | 21:07 |
very_tired | code sprint winding down | 21:07 |
very_tired | we have patches to gate | 21:07 |
*** herndon_ has joined #openstack-infra | 21:08 | |
clarkb | fungi: I am back at a different desk now | 21:08 |
very_tired | fungi: https://etherpad.openstack.org/p/montreal-code-sprint | 21:08 |
very_tired | under the to be promoted section | 21:08 |
fungi | very_tired: more stability fixes? | 21:08 |
sdague | fungi: yes, these should decrease load on the neutron side | 21:09 |
sdague | which should make it more likely to pass | 21:09 |
very_tired | still working on getting +A on all the neutron patches, marun is going through them | 21:09 |
very_tired | so is nati_ueno | 21:10 |
fungi | sdague: very_tired: if you could work up a preferred sequence, we can promote the whole batch. more stable gate means more faster gate | 21:10 |
fungi | need to know changenum,psnum | 21:11 |
very_tired | fungi mtreinish is double checking that now | 21:11 |
*** jgrimm has quit IRC | 21:12 | |
mtreinish | fungi: I just reordered the tempest test list | 21:13 |
fungi | clarkb: do you think we have a chance of being able to sanely quiesce zuul tomorrow for that project rename maintenance? | 21:13 |
*** oubiwann_ has quit IRC | 21:14 | |
*** nati_ueno has quit IRC | 21:14 | |
*** marun has quit IRC | 21:14 | |
*** marun has joined #openstack-infra | 21:14 | |
*** oubiwann_ has joined #openstack-infra | 21:15 | |
*** nati_ueno has joined #openstack-infra | 21:15 | |
very_tired | fungi: they responded to my email, so you might not need to do anything | 21:16 |
mikal | Morning | 21:16 |
very_tired | mikal: morning | 21:16 |
very_tired | happy saturday | 21:16 |
clarkb | fungi: maybe? but it is looking less likely | 21:16 |
*** nati_ueno has quit IRC | 21:17 | |
fungi | clarkb: i try to look at it as we're load-testing the new zuul ;) | 21:18 |
*** nati_ueno has joined #openstack-infra | 21:18 | |
*** oubiwann_ has quit IRC | 21:19 | |
mikal | very_tired: you're anteaya? | 21:19 |
openstackgerrit | Andreas Jaeger proposed a change to openstack-infra/config: Add build job for Japanese Install Guide https://review.openstack.org/67481 | 21:20 |
openstackgerrit | Michael Krotscheck proposed a change to openstack-infra/storyboard-webclient: [WIP] Storyboard API Interface and basic project management https://review.openstack.org/67582 | 21:22 |
very_tired | mikal: I am | 21:23 |
mikal | very_tired: so, I don't think I caused the recheck backlog... The script didn't run for that long. | 21:24 |
fungi | mikal: oh, the thing to recheck stale patches? | 21:25 |
mikal | fungi: yeah | 21:25 |
fungi | anyway, no, the check volume is from us dumping the state of the zuul check and gate pipelines, moving to a bigger badder zuul and restoring them... so they all needed fresh workers and then new patchsets came in on top of that | 21:26 |
clarkb | but, bigger badder zuul is pretty awesome | 21:26 |
*** marun has quit IRC | 21:26 | |
*** marun has joined #openstack-infra | 21:27 | |
fungi | bigger badder zuul will eat your spleen for breakfast it's so awesome | 21:27 |
*** UtahDave has joined #openstack-infra | 21:28 | |
fungi | or at least, according to our design specs it has a taste for spleen. more testing required | 21:28 |
*** pcrews has quit IRC | 21:28 | |
mikal | So, are gate flushes still hurting us? | 21:28 |
*** krotscheck has quit IRC | 21:29 | |
*** pcrews has joined #openstack-infra | 21:29 | |
clarkb | mikal: yes in that they force us to retest stuff, no they don't cause zuul to stop for forever to process them | 21:29 |
fungi | mikal: they will still severely deplete our available job workers for prolonged periods | 21:29 |
*** NikitaKonovalov_ is now known as NikitaKonovalov | 21:30 | |
mikal | Ok, so I got to the point with my rechecker where it would run until it found something to recheck, recheck that, and then exit. I would then go and hand verify the recheck. I hadn't found any incorrect rechecks in a while. | 21:30 |
fungi | though apparently the neutron+qa testing/stability sprint has a stack of patches which they think will make a big improvement on reset frequency | 21:30 |
*** NikitaKonovalov is now known as NikitaKonovalov_ | 21:30 | |
mikal | I'm wondering if I should turn it back on this morning, or if the queues are so long I should just let it rest for a day | 21:31 |
mikal | The queues do look pretty long... | 21:31 |
*** vipul is now known as vipul-away | 21:31 | |
fungi | mikal: i see it as a tradeoff there. at least some of the more persistent gate resets we're getting are actually from stale changes getting approved after bit-rotting in review for too long | 21:32 |
mikal | fungi: I was surprised by how many stale checks there were last night | 21:32 |
fungi | so catching those early might help keep cores from approving them | 21:32 |
mikal | It was a non-trivial percentage of reviews | 21:32 |
mikal | Noting that sdague doesn't want checks on stable at the moment because of pip | 21:32 |
mikal | (wow, the nova check fail rate at the moment is really high) | 21:33 |
*** SumitNaiksatam has joined #openstack-infra | 21:37 | |
*** marun has quit IRC | 21:37 | |
*** marun has joined #openstack-infra | 21:38 | |
mordred | mikal: I, for one, support your rechecker | 21:38 |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: fix css style to make page more readable https://review.openstack.org/67560 | 21:40 |
mikal | I just don't want to break the world with my well meaning flailing | 21:41 |
mikal | There's only so much kermit arms can do | 21:41 |
portante | mordred: do we run the devstack environments with GRO turned on (the generic receive off-load stuff)? | 21:41 |
portante | I am guessing it is not a concern, but just checking | 21:41 |
*** vipul-away is now known as vipul | 21:43 | |
*** aarongr_afk is now known as AaronGr | 21:45 | |
very_tired | heh, kermit arms | 21:45 |
fungi | the muppet geek in me knew exactly what he meant | 21:46 |
*** vipul is now known as vipul-away | 21:47 | |
*** rustlebee is now known as russellb | 21:52 | |
*** sdake has quit IRC | 21:52 | |
very_tired | fungi: the patches in the "to be promoted" section are all +A'd and in the order they need to go into the gate: https://etherpad.openstack.org/p/montreal-code-sprint | 21:52 |
very_tired | fungi: let me know if you need more | 21:52 |
*** beekneemech has quit IRC | 21:52 | |
very_tired | more as in more information, not more as in more work to do | 21:53 |
fungi | very_tired: they're separated by project... is that the order you want them in? (neutron block first, then that standalone "please also this" change, then the tempest changes)? | 21:53 |
*** herndon_ has quit IRC | 21:53 | |
*** derekh has joined #openstack-infra | 21:54 | |
fungi | looks like that standalone 67537 isn't approved anyway | 21:54 |
*** thedodd has quit IRC | 21:54 | |
*** carl_baldwin has quit IRC | 21:54 | |
very_tired | fungi: this one goes first please: Please also this: https://review.openstack.org/#/c/67537/ | 21:54 |
*** carl_baldwin has joined #openstack-infra | 21:55 | |
*** marun has quit IRC | 21:55 | |
*** sarob has joined #openstack-infra | 21:55 | |
very_tired | fungi: yes, salv-orlando is getting a +A on that, sorry I thought we were ready on our end | 21:55 |
fungi | no problem | 21:55 |
*** sarob has quit IRC | 21:55 | |
*** marun has joined #openstack-infra | 21:55 | |
*** sandywalsh has quit IRC | 21:56 | |
*** herndon_ has joined #openstack-infra | 21:56 | |
fungi | very_tired: though you may need to get clarkb's help on those. i'm about to disappear to go out for food | 21:57 |
clarkb | fungi: go disappear, I will be mostly here in a few minutes | 21:57 |
*** UtahDave has quit IRC | 21:57 | |
very_tired | fungi: happy food, I will work with clarkb | 21:58 |
very_tired | thanks | 21:58 |
fungi | very_tired: also, kudos to you and the attendees at the sprint--that's an impressive list of stability and debugging fixes | 21:58 |
very_tired | fungi thanks, it was very beneficial on many levels | 21:59 |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: Add query for bug 1261253 https://review.openstack.org/67539 | 21:59 |
uvirtbot | Launchpad bug 1261253 in tripleo "oslo.messaging 1.2.0a11 is outdated and problematic to install" [High,Triaged] https://launchpad.net/bugs/1261253 | 21:59 |
very_tired | we had a good group here | 21:59 |
*** rnirmal has joined #openstack-infra | 22:02 | |
very_tired | clarkb: all the tempest patches in the "to be promoted" section can go in | 22:02 |
very_tired | https://etherpad.openstack.org/p/montreal-code-sprint | 22:02 |
*** tjones has left #openstack-infra | 22:03 | |
very_tired | we are waiting on +A on 67537 for it to go first on the neutron block and then once we have that, 67537 followed by the neutron block | 22:03 |
fungi | clarkb: though keep in mind that anything you promote now will mean all the remaining changes in the check queue which have accumulated since the last gate reset will also be serviced before zuul takes a crack at what's in the gate (including 63934,3 which we intentionally placed at the front) | 22:04 |
*** vipul-away is now known as vipul | 22:04 | |
clarkb | very_tired: I would like to do all of them at once as promotion requires a reset | 22:04 |
*** vipul is now known as vipul-away | 22:04 | |
fungi | clarkb: i agree that's probably the best choice | 22:04 |
*** melwitt has joined #openstack-infra | 22:05 | |
clarkb | very_tired: so once everything has been approved and queued ping me and we will promote | 22:05 |
very_tired | clarkb: will do | 22:05 |
very_tired | clarkb: good to go | 22:06 |
flashgordon | you guys ever see this bug: http://logs.openstack.org/21/65121/2/gate/gate-grenade-dsvm/efd816b/console.html | 22:06 |
flashgordon | SCPRepositoryPublisher aborted due to exception | 22:06 |
*** carl_baldwin has quit IRC | 22:06 | |
mordred | flashgordon: it means that java hates us | 22:07 |
*** carl_baldwin has joined #openstack-infra | 22:07 | |
flashgordon | mordred: yup | 22:08 |
flashgordon | but about to file a bug if we don't have one | 22:08 |
flashgordon | 263 hits in logstash | 22:08 |
*** sandywalsh has joined #openstack-infra | 22:08 | |
fungi | flashgordon: that log you linked doesn't seem to have been associated with a result posted to the associated change | 22:09 |
fungi | flashgordon: i wonder if that job got intentionally killed when a job on a change ahead of it failed in the gate | 22:09 |
*** jcooley_ has joined #openstack-infra | 22:10 | |
very_tired | clarkb: problem | 22:10 |
clarkb | very_tired: ? | 22:10 |
very_tired | https://review.openstack.org/#/c/67537/ never passed check | 22:10 |
openstackgerrit | Sean Dague proposed a change to openstack-infra/config: add in elastic-recheck-unclassified report https://review.openstack.org/67591 | 22:10 |
very_tired | so salv-orlando says go with the tempest block | 22:10 |
very_tired | let https://review.openstack.org/#/c/67537/ come back with check | 22:11 |
*** CaptTofu has quit IRC | 22:11 | |
fungi | flashgordon: there are only two grenade failures on the change for that log, and neither of them refer to that particular job run | 22:11 |
very_tired | and then if it does promote it and the rest of the neutron block | 22:11 |
openstackgerrit | Monty Taylor proposed a change to openstack-infra/storyboard: Fix the intial db migration https://review.openstack.org/67592 | 22:11 |
very_tired | does that sound reasonable? | 22:11 |
clarkb | very_tired: I am not doing two promotions | 22:11 |
*** gema has quit IRC | 22:11 | |
*** nati_uen_ has joined #openstack-infra | 22:11 | |
clarkb | promotions are very expensive | 22:11 |
flashgordon | fungi: hmm | 22:12 |
*** MarkAtwood has quit IRC | 22:13 | |
fungi | flashgordon: i'm guessing java.lang.InterruptedException is something akin to sigint | 22:13 |
very_tired | clarkb: I understand | 22:13 |
flashgordon | fungi: that makes sense | 22:14 |
fungi | flashgordon: "Thrown when a thread is waiting, sleeping, or otherwise occupied, and the thread is interrupted, either before or during the activity." | 22:14 |
fungi | (from oracle's language doc reference) | 22:14 |
*** med_ has quit IRC | 22:14 | |
*** UtahDave has joined #openstack-infra | 22:14 | |
flashgordon | fungi: that makes a lot of sense | 22:15 |
fungi | flashgordon: so i think you have a cancelled/aborted job there that jenkins reported as a failure | 22:15 |
flashgordon | yup | 22:15 |
fungi | because EJENKINS | 22:15 |
very_tired | clarkb: this is our fault and we will wear it | 22:15 |
flashgordon | so I will add a elastic-jenkins fingerprint for that so we can ignore those and get better classification rate numbers | 22:15 |
*** CaptTofu has joined #openstack-infra | 22:16 | |
flashgordon | if that sounds good to you | 22:16 |
flashgordon | which means add a bug marked as resolved | 22:16 |
fungi | flashgordon: sounds like a good call | 22:16 |
*** nati_ueno has quit IRC | 22:16 | |
fungi | anyway, really disappearing for several hours starting now... back later for more fun | 22:16 |
clarkb | fungi: have fun | 22:17 |
*** mfer has quit IRC | 22:17 | |
*** reed has joined #openstack-infra | 22:18 | |
very_tired | fungi: enjoy | 22:18 |
*** thedodd has joined #openstack-infra | 22:19 | |
flashgordon | fungi: so this happens in the gate queue only which fits your hypothesis | 22:19 |
*** ewindisch is now known as zz_ewindisch | 22:19 | |
dimsum | flashgordon, i've seen many stack traces that finally end up in the wait interrupt at line hudson.remoting.Request.call(Request.java:146) | 22:20 |
flashgordon | dimsum: link? | 22:22 |
flashgordon | dimsum: I am using this query: message:"java.lang.InterruptedException" AND filename:"console.html" | 22:22 |
*** salv-orlando has quit IRC | 22:22 | |
*** nati_uen_ has quit IRC | 22:22 | |
*** med_ has joined #openstack-infra | 22:23 | |
*** nati_ueno has joined #openstack-infra | 22:23 | |
*** vkozhukalov has quit IRC | 22:24 | |
*** eharney has quit IRC | 22:24 | |
flashgordon | fungi: https://bugs.launchpad.net/openstack-ci/+bug/1270309 | 22:24 |
uvirtbot | Launchpad bug 1270309 in openstack-ci "jenkins java.lang.InterruptedException" [Undecided,New] | 22:24 |
flashgordon | can you triage that, I think won't fix makes sense but your call | 22:24 |
flashgordon | but something closed | 22:24 |
*** bnemec has joined #openstack-infra | 22:24 | |
*** rossella_s has quit IRC | 22:25 | |
*** carl_baldwin has quit IRC | 22:26 | |
*** carl_baldwin has joined #openstack-infra | 22:26 | |
dimsum | "hudson.remoting.Request.call(Request.java" | 22:28 |
very_tired | I'm out for the weekend and Monday, I expect to be online again on Tuesday | 22:28 |
*** sarob has joined #openstack-infra | 22:28 | |
mordred | have a great weekend very_tired | 22:28 |
very_tired | clarkb and fungi thanks for all your help | 22:28 |
very_tired | thanks | 22:28 |
*** marun has quit IRC | 22:28 | |
very_tired | :D | 22:28 |
*** very_tired is now known as anteaya | 22:28 | |
*** marun has joined #openstack-infra | 22:29 | |
*** gema has joined #openstack-infra | 22:30 | |
*** carl_baldwin has quit IRC | 22:30 | |
*** lcestari has quit IRC | 22:31 | |
flashgordon | fungi: I think it is a valid infra bug actually, these shouldn't be marked as failures | 22:32 |
*** obondarev_ has quit IRC | 22:32 | |
*** reed_ has joined #openstack-infra | 22:32 | |
*** emagana has quit IRC | 22:32 | |
flashgordon | dimsum: I think that is the same issue, that is part of the InterruptedException stacktrace | 22:32 |
flashgordon | dimsum: see https://bugs.launchpad.net/openstack-ci/+bug/1270309 | 22:33 |
uvirtbot | Launchpad bug 1270309 in openstack-ci "jenkins java.lang.InterruptedException" [Undecided,New] | 22:33 |
*** nati_ueno has quit IRC | 22:33 | |
*** reed__ has joined #openstack-infra | 22:34 | |
notmyname | why would this change https://review.openstack.org/#/c/67538/ be marked as SKIPPED in zuul? | 22:34 |
notmyname | it's towards the bottom of the gate queue | 22:35 |
*** reed has quit IRC | 22:35 | |
*** reed__ has quit IRC | 22:35 | |
*** senk has joined #openstack-infra | 22:36 | |
*** reed_ has quit IRC | 22:37 | |
*** HenryG has quit IRC | 22:37 | |
openstackgerrit | Joe Gordon proposed a change to openstack-infra/elastic-recheck: Add query for bug 1270309 https://review.openstack.org/67594 | 22:39 |
uvirtbot | Launchpad bug 1270309 in openstack-ci "jenkins java.lang.InterruptedException" [Undecided,New] https://launchpad.net/bugs/1270309 | 22:39 |
clarkb | notmyname: probably a merge conflict, if you hover over the red bubble it will tell you | 22:41 |
openstackgerrit | Joe Gordon proposed a change to openstack-infra/elastic-recheck: Use short build_uuids in elasticSearch queries https://review.openstack.org/67596 | 22:45 |
zaro | clarkb: new scp plugin is on jenkins-dev.o.o | 22:45 |
*** ArxCruz has quit IRC | 22:49 | |
*** flashgordon is now known as jog0 | 22:50 | |
*** marun has quit IRC | 22:50 | |
*** mrda has joined #openstack-infra | 22:53 | |
*** dstanek has quit IRC | 22:56 | |
*** prad has quit IRC | 22:57 | |
*** mrda has quit IRC | 22:57 | |
*** thedodd has quit IRC | 22:59 | |
*** radix has joined #openstack-infra | 23:01 | |
radix | jenkins seems to be ignoring one of my patches, https://review.openstack.org/#/c/67006/3 , is there something wedged? | 23:01 |
radix | or is there something messed up with my patch because I've done something wrong, maybe | 23:02 |
*** rcleere has quit IRC | 23:02 | |
clarkb | radix: it is being rechecked | 23:02 |
radix | oh ok cool :) | 23:03 |
clarkb | radix: looks like it was a draft at one point though | 23:03 |
radix | yep, started out as one | 23:03 |
clarkb | drafts are evil and don't work at all in the CI systems | 23:03 |
clarkb | you can use Work in progress instead | 23:03 |
radix | well, I assumed jenkins would notice the first non-draft I posted | 23:03 |
clarkb | depends on how the non draft is posted | 23:03 |
*** dcramer_ has quit IRC | 23:04 | |
clarkb | if it is just published jenkins won't notice | 23:04 |
clarkb | if it is pushed as a fresh non draft patchset jenkins should notice and in that case jenkins may have missed it beacuse we have been hitting zuul with a hammer to make it go quicker | 23:04 |
radix | ah, ok | 23:04 |
radix | yeah, I just pushed a new rev as a non-draft, so it was probably that | 23:04 |
*** zz_ewindisch is now known as ewindisch | 23:05 | |
radix | I'll point out that https://wiki.openstack.org/wiki/Gerrit_Workflow explains how to use drafts, and doesn't discourage them | 23:05 |
clarkb | gah | 23:06 |
* clarkb goes on a bug filing spree | 23:06 | |
radix | hehe :) | 23:06 |
clarkb | since the chances I get all of this done today are slim | 23:06 |
*** burt1 has quit IRC | 23:07 | |
*** sarob has quit IRC | 23:10 | |
*** sarob has joined #openstack-infra | 23:10 | |
openstackgerrit | A change was merged to openstack-infra/devstack-gate: comparison to stable/grizzly is not numeric https://review.openstack.org/63934 | 23:11 |
*** jergerber has quit IRC | 23:11 | |
*** thuc has quit IRC | 23:12 | |
*** thuc has joined #openstack-infra | 23:12 | |
sdague | yay, the non numeric patch finally landed! | 23:14 |
*** senk has quit IRC | 23:14 | |
*** reed__ has joined #openstack-infra | 23:14 | |
sdague | also, there is a fix for stable/grizzly devstack in the gate now | 23:14 |
sdague | no need to promote it, it's fine if it churns through the weekend | 23:14 |
sdague | but that should be handy | 23:14 |
*** sarob has quit IRC | 23:15 | |
*** thuc_ has joined #openstack-infra | 23:16 | |
clarkb | sdague: woot | 23:16 |
clarkb | sdague: what was the fix? | 23:16 |
sdague | https://review.openstack.org/#/c/67425/ | 23:16 |
*** markmcclain has joined #openstack-infra | 23:16 | |
sdague | basically, we were so wrapped up in the pip 1.5 thing, we forget the broken run arounds on pip 1.4 | 23:16 |
sdague | that never got back ported | 23:17 |
clarkb | :( | 23:17 |
*** thuc has quit IRC | 23:17 | |
sdague | however, it passed | 23:17 |
sdague | so I think it will fix things | 23:17 |
sdague | chmouel has additional good backports and fixes for grizzly | 23:17 |
sdague | but that one should be sufficient to get stable/havana working | 23:17 |
*** soleblaze has quit IRC | 23:18 | |
*** markmcclain1 has joined #openstack-infra | 23:19 | |
clarkb | bugs 1270321 1270319 and 1270320 submitted to cover the stuff we ran into today | 23:19 |
uvirtbot | Launchpad bug 1270321 in openstack-ci "Puppet manifests for zuul install too new statsd." [Medium,Triaged] https://launchpad.net/bugs/1270321 | 23:19 |
clarkb | radix: I think I am just going to update the wiki now | 23:19 |
*** mrodden has quit IRC | 23:20 | |
radix | thanks :) | 23:20 |
*** denis_makogon has quit IRC | 23:20 | |
*** markmcclain has quit IRC | 23:21 | |
sdague | clarkb: was your hack to disable draft perms ever something that worked? | 23:21 |
*** herndon_ has quit IRC | 23:22 | |
*** soleblaze has joined #openstack-infra | 23:23 | |
*** CaptTofu has quit IRC | 23:24 | |
*** sarob has joined #openstack-infra | 23:24 | |
*** CaptTofu has joined #openstack-infra | 23:25 | |
*** reed__ has quit IRC | 23:25 | |
*** carl_baldwin has joined #openstack-infra | 23:27 | |
sdague | now that we did a zuul restart with the durable enqueue times in it - https://review.openstack.org/#/q/status:open+project:openstack-infra/config+branch:master+topic:status_ui,n,z could land any time, which displays enqueue duration in jobs | 23:28 |
sdague | and makes the merge conflict changes black, so they are easier to distinguish | 23:29 |
*** CaptTofu has quit IRC | 23:29 | |
clarkb | sdague: haven't had a chance to test that | 23:30 |
clarkb | zaro: is that something you can test on review-dev? disable push rights to refs/drafts/* for all projects | 23:31 |
*** jcooley_ has quit IRC | 23:32 | |
openstackgerrit | Sean Dague proposed a change to openstack-infra/config: add in elastic-recheck-unclassified report https://review.openstack.org/67591 | 23:35 |
sdague | clarkb: there is actually a url in the review that shows it in action | 23:35 |
clarkb | sdague: cool, I will take a look momentarily | 23:35 |
sdague | it's all just status ui on the zuul json | 23:35 |
sdague | so you can just gvfs-open it locally actually | 23:35 |
sdague | cd config/modules/openstack_project/files/status && gvfs-open index.html | 23:36 |
*** emagana has joined #openstack-infra | 23:38 | |
*** mfink has joined #openstack-infra | 23:39 | |
*** jcooley_ has joined #openstack-infra | 23:39 | |
mordred | sdague: looks good to me | 23:40 |
sdague | mordred: cool | 23:40 |
sdague | mordred: so the grizzly devstack thing | 23:40 |
mordred | yeah? | 23:40 |
sdague | apparently you pushed a fix for that in august | 23:40 |
sdague | which got lost | 23:40 |
sdague | and someone found it | 23:40 |
mordred | AWESOME | 23:40 |
sdague | https://review.openstack.org/#/c/67425/ | 23:40 |
sdague | why it only started screwing us now... I don't know | 23:41 |
mordred | so broken | 23:41 |
sdague | so anyway, once that gets through the gate, havana patches can land again | 23:42 |
sdague | I think | 23:42 |
*** vipul-away is now known as vipul | 23:42 | |
*** jcooley_ has quit IRC | 23:44 | |
*** boris-42 has quit IRC | 23:45 | |
*** rnirmal has quit IRC | 23:45 | |
mordred | sdague, clarkb: perhaps we should make some of the different colors different shapes too - for people with colorblindness | 23:46 |
zaro | clarkb, sdague : i'll give disabling drafts a try. | 23:46 |
clarkb | zaro: thank you | 23:46 |
*** salv-orlando has joined #openstack-infra | 23:47 | |
sdague | mordred: yeh, I think that would be good. Honestly, we should probably do the shape draws with svg anyway. | 23:47 |
sdague | maybe after turning status.js into templates I'll do that | 23:47 |
*** jerryz has quit IRC | 23:47 | |
*** obondarev_ has joined #openstack-infra | 23:48 | |
mordred | sdague: yeah. and on your plane - def look at the bower/grunt stuff for that - if we're going to get fancier, I think we should consider not just being files in the config repo | 23:49 |
sdague | yep, I'd be fine with that | 23:49 |
mordred | it also may be way overkill - which is why you should look at it and not me | 23:49 |
sdague | heh | 23:49 |
portante | sdague, mouse of the circle was a well hidden feature in zuul for me | 23:49 |
portante | mouse over | 23:49 |
portante | thanks for pointing that out | 23:50 |
*** krotscheck has joined #openstack-infra | 23:50 | |
clarkb | ok fixing the wiki article finally :) | 23:51 |
*** markmcclain1 has quit IRC | 23:52 | |
*** flaper87 is now known as flaper87|afk | 23:53 | |
*** carl_baldwin has quit IRC | 23:53 | |
*** jerryz has joined #openstack-infra | 23:56 | |
clarkb | https://wiki.openstack.org/wiki/Gerrit_Workflow#Work_in_Progress how does that look? | 23:57 |
*** pballand has quit IRC | 23:57 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!