tristanC | gundalow: on another topic, we'll upgrade ansible.sf-project.io host to the latest version of software-factory soon, (the version 3.1) | 00:04 |
---|---|---|
tristanC | gundalow: in that version, the ansible*/zuul-config project will now host that file https://softwarefactory-project.io/cgit/config/tree/zuul/ansible_networking.yaml | 00:05 |
gundalow | Cool. Is their an email list I should subscribe to for planed upgrades/outages notices? | 00:05 |
tristanC | gundalow: we don't have such mailling-list yet, there shouldn't be any outages | 00:06 |
gundalow | :) | 00:06 |
tristanC | gundalow: though the upgrade will propose an update to the zuul-config project, so there will be a PR to accept to make the new version effective | 00:06 |
gundalow | cool, will keep an eye out for that. Thanks for the heads up | 00:07 |
tristanC | it seems like we could do that in a couple of weeks, one month top | 00:07 |
sfbender | Paul Belanger created software-factory/sf-config master: Fix grapaha graph for executor memory usage https://softwarefactory-project.io/r/12862 | 00:39 |
sfbender | Paul Belanger created software-factory/sf-config master: Add executor HDD usage to zuul-status graph https://softwarefactory-project.io/r/12863 | 00:46 |
sfbender | Paul Belanger created software-factory/sf-config master: Add max_servers metric to nodepool test nodes graph https://softwarefactory-project.io/r/12864 | 01:03 |
sfbender | Merged www.softwarefactory-project.io master: Add 3.0 release note for new sf-config and acme-tiny version https://softwarefactory-project.io/r/12826 | 01:06 |
tristanC | logan-: i published a new sf-config and acme-tiny package in the 3.0 release repository. This should fix the bug you reported, thanks for the feedback! ( release note is: http://www.softwarefactory-project.io/releases/3.0/ ) | 01:14 |
*** caphrim007 has joined #softwarefactory | 01:21 | |
*** caphrim007 has quit IRC | 01:26 | |
*** caphrim007 has joined #softwarefactory | 01:35 | |
*** Guest38444 has quit IRC | 02:04 | |
*** Guest38444 has joined #softwarefactory | 02:08 | |
sfbender | Tristan de Cacqueray created software-factory/sf-config master: zuul: install missing packages for config-check https://softwarefactory-project.io/r/12865 | 02:53 |
*** caphrim007_ has joined #softwarefactory | 03:01 | |
*** caphrim007 has quit IRC | 03:04 | |
sfbender | Tristan de Cacqueray created software-factory/sf-ci master: Switch back to base job since log-classify is now integrated https://softwarefactory-project.io/r/12866 | 03:04 |
logan- | awesome tristanC, thanks for the follow up. i'm interested to deploy 3.1 and try out a private config repo. is this the process I should be looking at using zuul_rpm_build.py? https://softwarefactory-project.io/docs/contributor/prepare_dev_environment.html | 03:32 |
tristanC | logan-: you could give the current 3.1 candidate a try by running this task: https://softwarefactory-project.io/paste/show/1128/ | 03:36 |
tristanC | then continue the update process as documented here: https://softwarefactory-project.io/docs/operator/upgrade.html | 03:36 |
tristanC | e.g. yum update sf-config && sfconfig --upgrade | 03:36 |
logan- | thanks! | 03:37 |
tristanC | though note that private config repo has not been tested, so there are still probably some issues with it | 03:38 |
tristanC | for example, we need a toggle to restrict the default acl here: https://softwarefactory-project.io/cgit/software-factory/sf-config/tree/ansible/roles/sf-repos/templates/config/resources/_internal.yaml.j2#n59 | 03:39 |
tristanC | and this task also needs to be udpated: https://softwarefactory-project.io/cgit/software-factory/sf-config/tree/ansible/roles/sf-repos/tasks/fetch_config_repo.yml#n5 | 03:39 |
tristanC | (the current process is to fetch the config repo on every hosts to apply new config, and this assume public access to the repo) | 03:39 |
tristanC | so to enable a private config repo, we'll have to setup the access key on every host managed by sfconfig | 03:40 |
tristanC | or we could change the logic and push the config repo content from the install-server instead of pulling | 03:40 |
logan- | yeah, similar to how prepare-workspace pushes the repos | 03:41 |
tristanC | basically, any task using config_public_location needs to be fixed | 03:43 |
tristanC | logan-: also, even if we support private config repo (e.g. in gerrit), zuul may still leaks its content, e.g. config-check and config-update job logs will be visible in zuul status page and builds history | 03:45 |
logan- | good point | 03:46 |
tristanC | logan-: that can also be parametrized, e.g. if the private config option (TBD) is set, then we could make the task no_log and keep the artifacts locally on the executor | 03:56 |
tristanC | feel free to try the 3.1 candidate version though, it still adds many new great features :-) | 03:57 |
logan- | will do! | 04:04 |
sfbender | Tristan de Cacqueray created software-factory/sf-config master: nodepool: fix dib cache location https://softwarefactory-project.io/r/12868 | 06:08 |
*** nchakrab has joined #softwarefactory | 06:13 | |
sfbender | Tristan de Cacqueray created software-factory/sf-docs master: Add log-classify user documentation https://softwarefactory-project.io/r/12869 | 06:27 |
sfbender | Tristan de Cacqueray created logreduce master: Fix ARA report directory link to ara-report https://softwarefactory-project.io/r/12870 | 06:36 |
sfbender | Tristan de Cacqueray created logreduce master: Update zuul-jobs log-classify role https://softwarefactory-project.io/r/12871 | 06:36 |
sfbender | Merged logreduce master: Fix ARA report directory link to ara-report https://softwarefactory-project.io/r/12870 | 06:38 |
sfbender | Merged logreduce master: Update zuul-jobs log-classify role https://softwarefactory-project.io/r/12871 | 06:40 |
sfbender | Merged www.softwarefactory-project.io master: Add sprint 2018-26 https://softwarefactory-project.io/r/12804 | 07:01 |
*** Guest38444 has quit IRC | 07:28 | |
*** Guest38444 has joined #softwarefactory | 07:31 | |
*** jpena|off is now known as jpena | 08:04 | |
sfbender | Merged software-factory/sf-ci master: Switch back to base job since log-classify is now integrated https://softwarefactory-project.io/r/12866 | 09:52 |
*** jpena is now known as jpena|lunch | 10:59 | |
sfbender | Merged software-factory/sf-config master: nodepool: fix dib cache location https://softwarefactory-project.io/r/12868 | 11:18 |
sfbender | Merged software-factory/sf-config master: Fix grapaha graph for executor memory usage https://softwarefactory-project.io/r/12862 | 11:28 |
sfbender | Merged software-factory/sf-config master: zuul: install missing packages for config-check https://softwarefactory-project.io/r/12865 | 11:35 |
sfbender | Merged software-factory/cauth master: cauth/repoxplorer: Harden in case of repoxplorer or elasticsearch down https://softwarefactory-project.io/r/12831 | 11:51 |
*** Guest38444 has quit IRC | 12:01 | |
sfbender | Fabien Boucher created software-factory/managesf master: managesf/configuration/repoxplorer: Fix in case tenant does not have default-connection https://softwarefactory-project.io/r/12878 | 12:02 |
*** Guest38444 has joined #softwarefactory | 12:10 | |
*** jpena|lunch is now known as jpena | 12:16 | |
sfbender | Fabien Boucher created software-factory/managesf master: managesf/configuration: handle the private attribute https://softwarefactory-project.io/r/12879 | 12:31 |
rcarrillocruz | folks, any issues with the oci server | 12:42 |
rcarrillocruz | seeing a lot of node job reschedules | 12:42 |
rcarrillocruz | just got a retry limit | 12:42 |
tristanC | rcarrillocruz: yes, though it doesn't seems related to oci, other jobs are also failing with retry limit with dib nodeset | 12:44 |
rcarrillocruz | oki | 12:44 |
tristanC | rcarrillocruz: i have yet found the bottleneck, i'll have a look tomorrow | 12:46 |
tristanC | we are migrating rdoproject.org jobs over to sf-project.io zuul, this may be causing scaling issue between zuul and nodepool, or maybe executor are overloaded | 12:47 |
tristanC | e.g.: https://softwarefactory-project.io/grafana/d/000000001/zuul-status?panelId=43&fullscreen&orgId=1&from=now%2FM&to=now | 12:47 |
tristanC | pabelanger: that graph seems a bit odd https://softwarefactory-project.io/grafana/d/000000001/zuul-status?panelId=44&fullscreen&orgId=1&from=now%2FM&to=now, shouldn't the executor load be lower? | 12:49 |
tristanC | they only have 4cpu each | 12:50 |
tristanC | pabelanger: symptoms are jobs take a long time to start, and sometime bail out with 'retry_limit' | 12:53 |
*** nchakrab_ has joined #softwarefactory | 12:57 | |
sfbender | Fabien Boucher created software-factory/sf-config master: cgit and hound config: take care of the private attribute https://softwarefactory-project.io/r/12880 | 12:57 |
*** nchakrab has quit IRC | 13:00 | |
tristanC | pabelanger: zuul.conf currently uses load_multiplier=2.5, i think we could lower this to 2 or even 1.5 | 13:06 |
pabelanger | tristanC: if you look at executor at https://softwarefactory-project.io/grafana/d/000000001/zuul-status?orgId=1 it will show if they are or are not accepting jobs, if not accepting jobs, no builds will start | 13:52 |
pabelanger | that is likely because of governor | 13:52 |
pabelanger | tristanC: starting build graph looks to be good too | 13:53 |
pabelanger | rcarrillocruz: have a log? | 13:58 |
tristanC | pabelanger: e.g. PS5 of https://softwarefactory-project.io/r/#/c/12763/ | 13:58 |
tristanC | pabelanger: i wonder if executors may accept job but then fail to start the build. sometime on the status page, console logs just stop with END OF STREAM, e.G.: https://softwarefactory-project.io/zuul/t/rdoproject.org/stream.html?uuid=934eac35fa994e049ed78484318e57fd&logfile=console.log | 14:00 |
pabelanger | tristanC: so, i don't have proof yet, but I think we are seeing poor IO on the zuul-executor, which could be taking too long to do merge operations before running the playbook, if that fails, i believe the job will be rescheduled by scheduler | 14:01 |
pabelanger | tristanC: can you send a copy an executor jobs for the jobs above^ | 14:02 |
tristanC | or that yes | 14:02 |
pabelanger | we should be able to see a timeout in logs | 14:02 |
pabelanger | tristanC: do we have SSDs in these compute nodes? I believe we should look to mount /var/lib/zuul with SSD, not ceph to help get better IO | 14:03 |
pabelanger | local disk vs network disk | 14:03 |
tristanC | pabelanger: indeed WARNING zuul.AnsibleJob: [build: a24ebac1a2744059b7692512e36405d5] Ansible timeout exceeded | 14:04 |
pabelanger | tristanC: was that pre-run? | 14:05 |
tristanC | pabelanger: here is the log of the retry_limit rcarrillocruz got: https://ansible.softwarefactory-project.io/paste/show/piAZSihmXcQixLtnFb2z/ | 14:09 |
tristanC | pabelanger: and here is a similar failure happening with a dib nodeset: https://softwarefactory-project.io/paste/show/Q5Jb0maoKsnMf2A2wBrQ/ | 14:11 |
pabelanger | Hmm, that timeout looks to be short | 14:13 |
pabelanger | and we don't seem to log it | 14:13 |
pabelanger | tristanC: -9 is abort | 14:16 |
pabelanger | tristanC: so zuul aborted the run for some reason | 14:16 |
pabelanger | tristanC: new patchset? | 14:16 |
*** nchakrab_ has quit IRC | 14:17 | |
pabelanger | I don't think it is new hdd sensor, it should only stop jobs from running, not abort them | 14:17 |
tristanC | pabelanger: iirc those were reported as retry_limit, and there is a warning about about ansible timeout | 14:20 |
pabelanger | tristanC: did zuul-executor get restarted during that time? | 14:20 |
pabelanger | tristanC: scheduler log should give more info to retries too | 14:21 |
*** nchakrab has joined #softwarefactory | 14:21 | |
tristanC | pabelanger: scheduler logs for the second build is https://softwarefactory-project.io/paste/show/HmAeuHC0Xb2tRjYxZTkG/ | 14:23 |
tristanC | pabelanger: first build is https://ansible.softwarefactory-project.io/paste/show/gx6lceANyJ8964dwJeoe/ | 14:23 |
tristanC | i got to go now, i'll debug more tomorrow | 14:24 |
*** nchakrab has quit IRC | 14:40 | |
rcarrillocruz | folks, i need to debug a weird issue on vyos_config, within the context of a zuul job run | 15:53 |
rcarrillocruz | how can help me out to do an autohold and inject my pubkey | 15:53 |
pabelanger | fbo: ^ | 15:54 |
pabelanger | rcarrillocruz: sorry, I don't have access myself | 15:54 |
rcarrillocruz | ah, nhicher is not around | 15:54 |
rcarrillocruz | :/ | 15:54 |
fbo | rcarrillocruz: yep | 15:55 |
pabelanger | rcarrillocruz: I think on PTO | 15:55 |
rcarrillocruz | fbo: https://github.com/ansible-network/cloud-vpn/pull/3 | 15:55 |
rcarrillocruz | let me know when i push a new patchset | 15:56 |
rcarrillocruz | so the hold is made | 15:56 |
rcarrillocruz | my keys: https://github.com/rcarrillocruz.keys | 15:56 |
rcarrillocruz | or a recheck rather, don't really have anything to change on the PR | 15:57 |
fbo | looks like I need a job name | 15:58 |
rcarrillocruz | cloud-vpn-aws-vyos-to-aws-vpn | 15:58 |
fbo | rcarrillocruz: ^ | 15:58 |
fbo | ok | 15:58 |
fbo | rcarrillocruz: ok let's recheck your change | 15:59 |
rcarrillocruz | done | 15:59 |
fbo | rcarrillocruz: the link to your pub key ? | 16:01 |
rcarrillocruz | any from the link i pasted earlier | 16:02 |
rcarrillocruz | https://github.com/rcarrillocruz.keys | 16:02 |
fbo | thanks | 16:03 |
rcarrillocruz | what's the IP, i don't think that's logged on the job log | 16:04 |
rcarrillocruz | or wait, i think i can get it on the nodes dashboard | 16:05 |
rcarrillocruz | bah, not | 16:06 |
fbo | rcarrillocruz: zuul@38.145.33.133 | 16:07 |
rcarrillocruz | thx mate | 16:07 |
rcarrillocruz | where is the workspace put these days | 16:08 |
rcarrillocruz | [zuul@host-10-0-0-11 ~]$ pwd | 16:08 |
rcarrillocruz | /home/zuul | 16:08 |
rcarrillocruz | [zuul@host-10-0-0-11 ~]$ ls | 16:08 |
rcarrillocruz | wait | 16:09 |
rcarrillocruz | i think you need to put the key on zuul-worker user | 16:09 |
fbo | Oh but I was unable to connect with zuul-worker, and that's zuul that was defined in the nodepool config | 16:10 |
fbo | for that image | 16:10 |
rcarrillocruz | thing is the workapce (per the job def) is checked out on zuul-worker home folder | 16:10 |
rcarrillocruz | this is odd | 16:11 |
rcarrillocruz | [zuul@host-10-0-0-11 ~]$ cd /home | 16:11 |
rcarrillocruz | [zuul@host-10-0-0-11 home]$ ls | 16:11 |
fbo | rcarrillocruz: you can sudo -i isn't it ? | 16:11 |
rcarrillocruz | zuul | 16:11 |
rcarrillocruz | i can sudo, but i don't see the zuul-worker home folder anywhere | 16:11 |
fbo | same here. | 16:12 |
fbo | Is the image correct ? I mean I did nothing specific, just login in it | 16:12 |
rcarrillocruz | well, the image is a f27-oci | 16:12 |
rcarrillocruz | that's not mmanaged by me | 16:12 |
rcarrillocruz | i think you may have give me access to a node that is not part of the job | 16:13 |
fbo | ah so that's not the right image then | 16:13 |
rcarrillocruz | https://github.com/ansible-network/zuul-config/blob/master/zuul.d/jobs.yaml | 16:13 |
rcarrillocruz | what i need to get is access to the node that is running the job (still) | 16:13 |
rcarrillocruz | if you did the autohold, it should not be deleted by nodepool, right after the job ends | 16:14 |
rcarrillocruz | ok | 16:15 |
rcarrillocruz | so the node is | 16:15 |
rcarrillocruz | 0000046738 | 16:15 |
rcarrillocruz | per https://ansible.softwarefactory-project.io/zuul/nodes.html | 16:15 |
rcarrillocruz | what's the IP of that node | 16:16 |
rcarrillocruz | nodepool list should show it | 16:16 |
fbo | rcarrillocruz: this is a container not sure I can give you access then | 16:16 |
fbo | for the autohold there isn't option for specifying the image so the node on hold should be the right one | 16:16 |
rcarrillocruz | i would assume containers do run a ssh daemon, and they don't have another access mechanism ? | 16:16 |
rcarrillocruz | i mean, the zuul executor connects to the node | 16:17 |
rcarrillocruz | i'd be surprised if the container is accessed by zuul executor by connecting to host, then doing something like a primitive docker exec or the likes | 16:17 |
rcarrillocruz | if you run in nodepool | 16:18 |
rcarrillocruz | nodepool list |grep 0000046738 | 16:18 |
rcarrillocruz | what does it show | 16:18 |
fbo | rcarrillocruz: ok so let's try zuul-worker@38.145.33.82 | 16:19 |
fbo | try port 34999 | 16:20 |
fbo | that's the one specified on nodepool list --detail | 16:21 |
rcarrillocruz | can ssh to 22, cannot to 34999 | 16:21 |
rcarrillocruz | otoh, in but there's no checked out project, i assume it got deleted maybe, dunno | 16:22 |
rcarrillocruz | will try to recreate from this | 16:22 |
fbo | rcarrillocruz: well that's note the way to do it. I removed your from there as this was the main oci node. | 16:27 |
rcarrillocruz | so that's the host? | 16:27 |
rcarrillocruz | you know what | 16:28 |
rcarrillocruz | i'll change the node type | 16:28 |
rcarrillocruz | and try to recreate from a real fedora node | 16:28 |
rcarrillocruz | pabelanger: did you created any fedora dib nodes on our tenant? | 16:29 |
rcarrillocruz | i was off last week, unsure what you did there | 16:29 |
fbo | I succeed to connect on 34999 and was in the container | 16:29 |
rcarrillocruz | let me retry then | 16:29 |
rcarrillocruz | [ricky@ricky-laptop ~]$ ssh zuul-worker@38.145.33.82 -p 34999 | 16:30 |
rcarrillocruz | Received disconnect from 38.145.33.82 port 34999:2: Too many authentication failures | 16:30 |
rcarrillocruz | Authentication failed. | 16:30 |
rcarrillocruz | you may need to put my pubkey within the container zuul-worker auth keys | 16:30 |
fbo | oh ok you had network access to the port, cool | 16:30 |
rcarrillocruz | so the way it works apparently | 16:31 |
rcarrillocruz | 22 is for the host | 16:31 |
rcarrillocruz | then each container | 16:31 |
rcarrillocruz | it spawns an sshd process on 34999 | 16:31 |
fbo | retry | 16:31 |
rcarrillocruz | just so zuul-executor can connect to it | 16:31 |
rcarrillocruz | oci slaves that it | 16:31 |
rcarrillocruz | i'm in now | 16:31 |
rcarrillocruz | \o/ | 16:31 |
rcarrillocruz | and the change is checked out there | 16:31 |
rcarrillocruz | thx fbo, i can debug now | 16:32 |
fbo | rcarrillocruz: sorry for the time to figure out how to do that. | 16:32 |
rcarrillocruz | hey, you solved it ;-) | 16:33 |
*** jpena is now known as jpena|off | 17:03 | |
*** fbo is now known as fbo|off | 17:14 | |
sfbender | Fabien Boucher created software-factory/managesf master: wip - managesf/resources: add extra validation for the private attribute https://softwarefactory-project.io/r/12883 | 17:31 |
*** Guest38444 has quit IRC | 18:43 | |
*** Guest38444 has joined #softwarefactory | 18:46 | |
*** caphrim007_ has quit IRC | 19:00 | |
*** caphrim007 has joined #softwarefactory | 19:01 | |
sfbender | Merged software-factory/sf-config master: zuul: integrate log-classify post actions https://softwarefactory-project.io/r/12763 | 20:08 |
gundalow | Created a new branch `stable-2.5` and I've protected the branch in GitHub, though Zuul doesn't seem to be running: https://github.com/ansible-network/network-engine/pull/107 I can't see anything in the dashboard | 22:34 |
*** Guest38444 has quit IRC | 22:58 | |
*** Guest38444 has joined #softwarefactory | 23:13 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!