elodilles | hi infra team o/ in relmgmt team we have a task for this week saying 'Notify the Infrastructure team to generate an artifact signing key (but not replace the current one yet), and begin the attestation process' | 10:25 |
---|---|---|
elodilles | so please start to work on this ^^^ & let us know (@ #openstack-releases if we can help with anything) | 10:26 |
fungi | elodilles: thanks for the reminder! i'll try to work on that today if i get a few minutes | 12:41 |
elodilles | fungi: cool, thanks! | 12:57 |
damiandabrowski | hey everyone, we(openstack-ansible) started getting DISK_FULL errors on rocky9 jobs yesterday, can I ask for a suggestion what to do with this? | 15:24 |
damiandabrowski | https://zuul.opendev.org/t/openstack/builds?result=DISK_FULL&skip=0 | 15:24 |
fungi | damiandabrowski: disk_full results are zuul's way of saying the build used too much disk space on the executor, usually it means the job retrieved too much log data or other files from job nodes at the end of the job though it can also be caused by other activities that use up executor space | 15:29 |
fungi | damiandabrowski: unfortunately zuul terminates the build abruptly to prevent it from consuming more disk on the executor, so there is no diagnostic data that can be uploaded. do you have any idea what might have changed in those jobs in the past 24-48 hours? | 15:32 |
fungi | looks like it's specifically the openstack-ansible-upgrade-aio_metal-rockylinux-9 job | 15:32 |
fungi | damiandabrowski: looks like it's not happening every time either. all the disk_full results are for stable/2023.1 branch builds, but also there are some successful stable/2023.1 runs too | 15:34 |
damiandabrowski | fungi: hmm, i have no clue what could potentially changed something in this job, noonedeadpunk maybe you have some idea? | 15:35 |
fungi | another thing to look at would be the job duration. it appears the disk_full results ran about as long or longer than successful builds, which tells me the extra disk is being consumed at the end of the builds, so likely when logs are being fetched from the job nodes | 15:36 |
noonedeadpunk | disk_full is zuul-executor issue? | 15:36 |
fungi | noonedeadpunk: not a zuul executor issue, but a zuul executor safeguard | 15:37 |
noonedeadpunk | or it's full on nodepool node? | 15:37 |
fungi | if the build tries to use too much disk on the executor, the executor abruptly ends the build in order to protect available disk for other builds running at the same time | 15:37 |
noonedeadpunk | I'm just trying to understand when this potentially happens - somewhere on POST step, or it's smth in job itself that fills-in node disk | 15:38 |
noonedeadpunk | as in case of POST - I'm really not sure how to debug that.... | 15:38 |
fungi | noonedeadpunk: as i said, the build duration for success and disk_full builds is about the same, which suggests it's happening near or at the end of the job, probably when pulling logs from the nodes | 15:38 |
noonedeadpunk | yeah, but it's hard to anylize what could take so many diskspace when there're no logs and these are logs themselves causing issue... | 15:39 |
fungi | a random example off the top of my head would be if some process in on a job node, let's say neutron, started generating massive amounts of error messages and the logs grew huge, then when the executor is asked to retrieve them it can discover that they'll require too much space and ends the job rather than risking filling up its own disk | 15:40 |
noonedeadpunk | yeah, I totally got how it happens... I 'm just not sure how to deal with that | 15:41 |
fungi | i would encourage you not to just give up. for example, have you looked to see how much total space the logs pulled from a successful build of that job on the same branch in the last day consume? how about a failed one that didn't result in disk_full? | 15:41 |
fungi | maybe these jobs are collecting tons of log data and it's often just under the threshold | 15:41 |
fungi | the problem can be approached the same way as job timeouts | 15:42 |
fungi | just because you don't get logs when a build times out doesn't meant it's impossible what's causing it to sometimes do that | 15:42 |
noonedeadpunk | I'm fetching logs from sucessfull builds right now to check | 15:43 |
noonedeadpunk | though I'm not sure what to consider big enough to cause problems | 15:43 |
fungi | i'm going to look up what the threshold is in the executor's disk governor, but i do know it's large enough that jobs rarely hit it | 15:43 |
* noonedeadpunk seeing that error first time | 15:44 | |
fungi | i'll take me a few minutes because i honestly have no idea if it's configurable and if so whether we configure it or rely on its built-in default threshold | 15:45 |
noonedeadpunk | I guess wget has found at least one nasty log file... /var/log/messages is already like 350Mb and it's still fetching | 15:46 |
fungi | ouch | 15:46 |
fungi | noonedeadpunk: https://zuul-ci.org/docs/zuul/latest/configuration.html#attr-executor.disk_limit_per_job | 15:46 |
noonedeadpunk | I wonder if there're plenty of OOMs or smth... | 15:47 |
noonedeadpunk | I'm pretty sure that limit is non-default to what I see from sucessfull job | 15:47 |
noonedeadpunk | 520mb this file solely | 15:48 |
fungi | https://opendev.org/opendev/system-config/src/commit/e8a274e/playbooks/roles/zuul/templates/zuul.conf.j2#L47 | 15:48 |
fungi | noonedeadpunk: wget may be uncompressing it locally? | 15:48 |
fungi | though it does appear we set it to ~5gb | 15:49 |
noonedeadpunk | yeah, likely it is... | 15:49 |
fungi | so anyway, something during log/artifact collection is going over 5gb of local storage on the executor for one build | 15:50 |
noonedeadpunk | I'd need to check when it gets compressed | 15:50 |
noonedeadpunk | Before sending to executor or after... | 15:50 |
clarkb | during/after. Its a monitor check not a fs quota | 15:50 |
noonedeadpunk | As I thought it's after | 15:50 |
noonedeadpunk | ok, gotcha, thanks for answers as usual! | 15:51 |
fungi | yw | 15:52 |
fungi | noonedeadpunk: also obviously, trimming the size of data collected from builds will speed up your job completion times, speed up retrieval time for people diagnosing failures, and waste less space in our donors' swift services | 15:54 |
fungi | as well as put less load on our executors | 15:54 |
fungi | so whatever you can do there within reason is beneficial to all of us | 15:54 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!