*** dpawlik has joined #softwarefactory | 06:04 | |
*** dpawlik has quit IRC | 06:04 | |
*** dpawlik has joined #softwarefactory | 06:07 | |
*** dpawlik has quit IRC | 06:07 | |
*** dpawlik has joined #softwarefactory | 06:08 | |
*** apevec has joined #softwarefactory | 06:43 | |
*** sshnaidm|off is now known as sshnaidm | 07:33 | |
*** jpena|off is now known as jpena | 07:49 | |
*** brendangalloway has joined #softwarefactory | 11:29 | |
*** jpena is now known as jpena|lunch | 11:30 | |
*** rfolco has joined #softwarefactory | 12:04 | |
brendangalloway | tristanC: Last week I had a question about the ara-report folder not being being visible. Is there a config setting I can change to make it visible again? Our devs don't like the change to their workflow | 12:13 |
---|---|---|
brendangalloway | I'm also noticing an issue where the post-config job does not appear to be updating the nodepool.yaml file correctly. The playbook runs '/bin/managesf-configuration nodepool --output "/etc/nodepool/nodepool.yaml" --extra-launcher --hostname runc' which generates a file with an empty providers entry. If I manually run the utility without the | 12:15 |
brendangalloway | extra launcher the file appears to be generated correctly | 12:15 |
tristanC | the ara-report folder is no longer visible but it is still available. Either by appending `/ara-report/` to the log_url, either by clicking the `Ara Report` link from the build result | 12:15 |
tristanC | brendangalloway: that post-config issue rings a bell, let me check | 12:16 |
brendangalloway | We found you could type the url back in manually, but having the link in the folder was a lot more convenient when debugging. If it's not possible we'll live, but it would be preferred if we could restore the previous behaviour somehow | 12:19 |
tristanC | brendangalloway: that's unfortunate. This is happening because we switched the ara-api to be a dedicated service so that it could run the python3 version (previously it was running under mod_wsgi in apache, which meant it has to be py2 on centos) | 12:21 |
tristanC | brendangalloway: and we couldn't find a way to instruct apache to perform a rewrite rule while keeping the folder available in the generated index | 12:22 |
brendangalloway | And lastly, I'm trying to set up a kubernetes cluster in preparation for runc being deprecated in the next release. The documentation on what needs to be done is a bit scattered though and I'm struggling to piece together exactly what has to be done. Does adding the hypervisor-k1s role to arch.yaml set up a kubernetes cluster on the specified | 12:22 |
brendangalloway | node, or just install the tools needed for nodepool to talk to the cluster defined the kube_file in sfconfig.yaml? | 12:22 |
brendangalloway | ok, that is unfortunate. If there is some way to restore the link in the future it would be appreciated | 12:24 |
tristanC | brendangalloway: if you can setup a kubernetes and provide the kube_file that would be the best | 12:24 |
tristanC | brendangalloway: otherwise, using the k1s component will setup a fake kubernetes endpoint that will work for nodepool/zuul workload, and it will work similarly than k1s, e.g. nodepool will be auto configured and there will be a _managed_k1s provider added to the config repo | 12:25 |
tristanC | ftr, the code is currently available here: https://pagure.io/software-factory/k1s | 12:25 |
brendangalloway | Thanks. I have deployed a kubeadm cluster on another network in our openstack cluster and copied across the admin config file, updated sfconfig.yaml and run sfconfig --no-install | 12:27 |
brendangalloway | The operate nodepool docs refer to the provider defined in _local_hypervisor_k1s.yaml, but that only gets created when using the hypervisor-k1s role? Do I still need to defined a provider, or can I simply refer to the one defined in the kube_file? If the latter, how do I do so | 12:29 |
tristanC | brendangalloway: well the migration from runc to kubernetes remains to be defined and documented :) | 12:30 |
tristanC | brendangalloway: until then, you can setup a custom (not managed by sfconfig) provider, like this one: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/tests/fixtures/kubernetes.yaml#L10-L24 | 12:31 |
tristanC | the `context` attribute should match a context from the kube_file you provided | 12:31 |
*** jpena|lunch is now known as jpena | 12:31 | |
brendangalloway | I'm not planning on migrating the existing runc jobs just yet (I see there's a stream of patches that are required to do so), but do want to see if we can use the kube ourselves before then. | 12:33 |
tristanC | brendangalloway: are you using the opendev.org/zuul/zuul-jobs project? | 12:35 |
brendangalloway | I think so - we're using the zuul-jobs that were provided as part of the software-factory install | 12:38 |
tristanC | brendangalloway: ok good, (we are still waiting for some role to be accepted upstream to help with replacing runc with kubectl), and iirc they are provided by the zuul-jobs copy provided by the software-factory install | 12:40 |
brendangalloway | ok, I will not try migrate to kubectl just yet. | 12:41 |
brendangalloway | Ok, thanks for all the help. I see the k1s driver providers mechanisms to store dockerfile definitions of images in the config repo itself. Is there a similar mechanism for managing external kubernetes clusters in the CI definitions, or would we have to have any custom containers defined in a repo and just refer to them in the provider | 12:47 |
brendangalloway | definition? | 12:47 |
tristanC | brendangalloway: about the post-config job failing to produce a valid nodepool configuration, i can't find a fix. But looking at the code, i think there maybe was an issue with the ansible fact cache mechanism | 12:47 |
tristanC | brendangalloway: in particular, the `--extra-launcher` argument is added if `ansible_hostname` is not the `first_launcher` ( from https://softwarefactory-project.io/cgit/software-factory/sf-config/tree/ansible/roles/sf-nodepool/tasks/update.yml#n26 ) | 12:48 |
brendangalloway | and are there any config requirements for dockers to be zuul workers? I see the centos-7 dockerfile example provided for k1s performs the equivalent of the zuul-worker dib element. Do we have to do something similar for external pods? | 12:49 |
tristanC | brendangalloway: and first-launcher is set to be the `name` of the host in the arch file ( from https://softwarefactory-project.io/cgit/software-factory/sf-config/tree/sfconfig/arch.py#n89 ) | 12:49 |
tristanC | brendangalloway: thus could you share the output of `grep ^first_launcher /var/lib/software-factory/ansible/group_vars/all.yaml` and `ansible -m setup runc | grep ansible_hostname` | 12:51 |
tristanC | and perhaps dropping the file in `/var/lib/software-factory/ansible/facts/` would help fix that issue? | 12:52 |
brendangalloway | I'm guessing this is a problem: | 12:52 |
brendangalloway | ansible -m setup runc | grep ansible_hostname [WARNING]: Could not match supplied host pattern, ignoring: runc [WARNING]: No hosts matched, nothing to do | 12:52 |
brendangalloway | first launcher is 'main' | 12:53 |
tristanC | brendangalloway: about managing custom container image, this is currently specifics to k1s. We are working on a generalized solution named `zuul-images-jobs`, but that is a lot of work | 12:53 |
tristanC | brendangalloway: i would recommend you manage the image manually at first | 12:54 |
tristanC | e.g. either publish them to a public registry, or push them directly to the host running the kubernetes cluster | 12:54 |
brendangalloway | tristanc: ok, and we would need to prep the image with the zuul-worker steps? We have a private docker repo on site for our containerised components | 12:55 |
brendangalloway | by dropping the file I should delete the runc file in that folder? | 12:56 |
tristanC | brendangalloway: it depends on your job, but if you use the upstream jobs like `tox`, there are some run statement that help with that, in particular: https://softwarefactory-project.io/cgit/software-factory/sf-config/tree/ansible/roles/sf-repos/files/config/containers/centos-7/Dockerfile#n29 | 12:57 |
tristanC | and installing tool like python3-devel, rsync and such | 12:57 |
tristanC | otherwise any image (with at least linux-util or busybox and python) should work | 12:58 |
brendangalloway | Don't all worker nodes require a zuul login, zuul-minimum packages and sudo permission for the executor to use them? We've encountered lots of problem with static nodes not being set up in exactly the manner that zuul expects | 13:00 |
brendangalloway | and are there public images that have already been configured as zuul workers? | 13:00 |
tristanC | for kubernetes it is diffirent as ansible will use the `kubectl` connection plugin, e.g. it runs `kubectl exec` command, thus there is no need for a login or existing user | 13:00 |
tristanC | but some zuul-jobs will perform a `revoke-sudo` task, and this will fail if sudo is not configured, as suggested by the two RUN statements from the `centos-7/Dockerfile#n29` above link | 13:03 |
brendangalloway | ok, so the requirements for what are installed will depend on the job | 13:03 |
tristanC | brendangalloway: yes. And the `centos-7/Dockerfile` should provide something equivalent of the default runc-centos label | 13:04 |
brendangalloway | and if we want to implement any jobs that use/inherit the fetch output role, we need to use the new role as in the patches at https://review.opendev.org/#/q/topic:zuul-jobs-with-kubectl ? | 13:05 |
tristanC | that is correct, any roles that perform synchronize are going to fail otherwise | 13:07 |
brendangalloway | specifically synchronise to the worker? | 13:08 |
brendangalloway | not synchronise in general, for example copying the log files across to the elk node | 13:09 |
tristanC | brendangalloway: the roles that uses synchronize to fetch artifact file from the worker to the executor, those needs to be adapted to copy the file to the ~/zuul-output directory, and let the base job fetch the file | 13:09 |
tristanC | brendangalloway: about the config-update job failure, i guess you were able to fix it manually in /usr/share/sf-config/ansible/roles/sf-nodepool/tasks/update.yml ? | 13:10 |
brendangalloway | Ok - I think the only places we use that at the moment are in the runc containers | 13:10 |
tristanC | brendangalloway: iiuc that failure, you have an host in the arch file named `main`, and another one named `runc` ? | 13:10 |
brendangalloway | I fixed it by running the bin by hand without the --extra launcher and then restarting nodepool | 13:11 |
brendangalloway | yes, the runc containers are on a separate host to the executor | 13:11 |
tristanC | brendangalloway: that may be reverted once the config-update job run again, until we understand the issue and have a fix, you better remove the `extra-launcher` argument from the `sf-nodepool/tasks/update.yml` file | 13:12 |
brendangalloway | I will do so. Now that you mention that - when running sfconfig --upgrade I had to edit the timeout command in /usr/share/sf-config/ansible/roles/sf-elasticsearch/tasks/postconf.yml +42 | 13:17 |
brendangalloway | what is the correct way to to submit bug reports or similar for issues like that? | 13:18 |
tristanC | brendangalloway: on this page: https://tree.taiga.io/project/morucci-software-factory/issues | 13:19 |
brendangalloway | tristanC: Thank you so much for all the help. I think that is everything I needed to know | 13:22 |
tristanC | brendangalloway: you're welcome, thank you for the feedback! | 13:29 |
tristanC | brendangalloway: so we are looking at the `extra-launcher` issue, and it seems like ansible may be using incorrect fact, could you check the `ansible_hostname` value in `/var/lib/software-factory/ansible/facts` and see if they are consistent? If not, i think dropping the file should be enough to prevent that failure, but we'll add some check to at least detect when this is happening | 13:30 |
tristanC | turns out we just had a similar issue in another deployment, and the `ansible_hostname` from the fact is not the same as the hostname of the host, resulting in that incorrect `extra-launcher` argument being set | 13:31 |
brendangalloway | would hostname vs FQDN be a bit enough difference? | 13:32 |
brendangalloway | *big enough | 13:32 |
brendangalloway | the facts file has the hostname as runc, the arch file has it as runc.domain | 13:33 |
tristanC | we would like the ansible_hostname to match what is in the arch file, that is just the name, without the domain | 13:33 |
brendangalloway | so I would need to remove the domains in the arch file? | 13:34 |
tristanC | brendangalloway: i guess you are not running the nodepool-launcher on the runc host, thus i suspect the "main" host has a `ansible_hostname` fact that refers to runc | 13:34 |
brendangalloway | yes nodepool launcher is on main.ci, runc containers are running on runc.ci | 13:35 |
tristanC | brendangalloway: changing the runc host name from the arch file shouldn't be required | 13:35 |
brendangalloway | So I must change it in the facts file? | 13:36 |
tristanC | brendangalloway: in our case, we found that `grep ansible_hostname /var/lib/software-factory/ansible/facts/main-host.org.name` shows `nodepool-builder`, instead of `main-host` | 13:36 |
tristanC | iirc, removing the fact file should ensure that ansible_hostname is correct, but we are looking for the why, and how to prevent that :) | 13:38 |
brendangalloway | !! | 13:41 |
openstack | brendangalloway: Error: "!" is not a valid command. | 13:41 |
brendangalloway | [root@main.domain facts]# grep ansible_hostname *builder.domain: "ansible_hostname": "builder",elk.domain: "ansible_hostname": "elk",main.domain: "ansible_hostname": "runc",merger.domain: "ansible_hostname": "merger",runc.domain: "ansible_hostname": "runc", | 13:42 |
brendangalloway | so the hostname being runc in the main file is the issue? | 13:42 |
tristanC | brendangalloway: yes | 13:42 |
tristanC | that confused the nodepool.yaml generation logic, resulting in an empty provider list | 13:43 |
brendangalloway | any idea how that file ended up being wrong? | 13:43 |
brendangalloway | I assume I should change that and revert the change to the upgrade task? | 13:43 |
tristanC | that's the issue we are trying to understand | 13:44 |
tristanC | once this is fixed, the upgrade task bandaid can be reverted | 13:44 |
brendangalloway | Is there other information I can provide that would help you understand the issue? | 13:45 |
tristanC | that's ok thank you, we are debugging an affected setup | 13:45 |
brendangalloway | ok, let me know if I can assist | 13:46 |
sfbender | Daniel Pawlik created software-factory/managesf master: DNM - Added external-project parameter to compute repo by hound https://softwarefactory-project.io/r/18204 | 14:56 |
sfbender | Tristan de Cacqueray created software-factory/sf-config master: sfconfig: add an update facts task https://softwarefactory-project.io/r/18205 | 14:56 |
tristanC | brendangalloway: https://softwarefactory-project.io/r/18205 seems to be a solution for the invalid fact ansible_hostname | 14:57 |
*** dpawlik has quit IRC | 15:59 | |
*** jpena is now known as jpena|off | 17:04 | |
*** sshnaidm is now known as sshnaidm|afk | 18:07 | |
sfbender | Merged www.softwarefactory-project.io master: Add previous sprints summaries https://softwarefactory-project.io/r/18069 | 18:10 |
*** brendangalloway has quit IRC | 19:20 | |
*** rfolco has quit IRC | 21:27 | |
*** rfolco has joined #softwarefactory | 22:03 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!