| @mnasiadka:matrix.org | apache on static02 is probably down - connection refused... | 08:19 |
|---|---|---|
| @gthiemonge:matrix.org | I get connection refused on https://releases.openstack.org/ | 08:20 |
| @mnasiadka:matrix.org | releases, docs, tarballs are affected | 08:26 |
| @tobias-urdin:matrix.org | ^same | 08:40 |
| @harbott.osism.tech:regio.chat | `Mar 13 01:45:54 static02 systemd[1]: apache2.service: Failed with result 'oom-kill'.` | 09:33 |
| @harbott.osism.tech:regio.chat | restarted now, maybe the waf stuff took too much memory? also we should likely look into the systemd service doing automatic restarts? | 09:34 |
| @mnasiadka:matrix.org | In theory there's Restart=on-abort in the systemd unit file, but it won't work with oom-kill - so we would need to switch to Restart=on-abnormal | 09:55 |
| @tobias-urdin:matrix.org | looks like it died again | 13:17 |
| @blasseye:matrix.org | Yes.. | 13:23 |
| @garyx:matrix.org | It's dead jim | 13:26 |
| @jim:acmegating.com | the spike happens very quickly; it's not a gradual buildup | 13:28 |
| @jim:acmegating.com | restarted again just to try to increase the uptime, but i don't expect it to last. the url patterns in the log look like the crazy ones from our botnet. | 13:35 |
| @fungicide:matrix.org | we're not maxxing out nf_conntrack_count at least | 13:50 |
| @fungicide:matrix.org | definitely lots of apache worker slots in use though there are still plenty waiting/unassigned for the moment | 13:51 |
| @fungicide:matrix.org | dmesg indicates the oom killer reaped an apache2 process at 01:43:45 and again at 13:15:13 | 13:55 |
| @fungicide:matrix.org | i guess it didn't just take out a worker process, but the parent supervisor both times? | 13:55 |
| @jim:acmegating.com | seems that way. i've been watching top, and everything looks sensible so far. lots of workers using a modest amount each, nothing bigger than that. | 13:59 |
| @clarkb:matrix.org | prior to this round of sadness we did increase the number of total valid workers/threads for apache2 | 14:48 |
| @clarkb:matrix.org | its possible that was "safe" before with the characteristics of the requests then but is not safe with the changes we've made or due to changes in request pattersn? | 14:48 |
| @clarkb:matrix.org | Looks like the service has been running for about an hour now and memory usage is fine. We are not at the server limit either, but it isn't a low server count. But as corvus points out it seems to spike so maybe something very specific that triggers it | 14:52 |
| @fungicide:matrix.org | well, we maxxed out workers last week at the same threshold and didn't trigger oom | 14:53 |
| @fungicide:matrix.org | so maybe subsequent adjustments have caused it to use more ram per worker on average now | 14:53 |
| @clarkb:matrix.org | mnasiadka: not sure how your day is going, but I'm now looking at static again. I could probably go either way on whether we sync up now. Let me know if you have a strong perference | 15:02 |
| @clarkb:matrix.org | about 27% of traffic to docs.openstack.org prior to the 01:44 ish OOM was either a 301 or 302 redirect. and about 7% was 403 rejections | 15:13 |
| @clarkb:matrix.org | of course there are other vhosts too. I'm just trying to look at it from a high level and see if anything particular problematic stands out | 15:13 |
| @clarkb:matrix.org | docs.openstack.org is about 74% of the traffic around this time period | 15:14 |
| @clarkb:matrix.org | s/traffic/requests total/ | 15:14 |
| @fungicide:matrix.org | yeah, i'm not surprised since most of the problem paths we saw were for /developer/.* and that prefix is redirected to /.* | 15:41 |
| @fungicide:matrix.org | i assume you were only looking at https access since http is also automatically redirected | 15:41 |
| @fungicide:matrix.org | and for that matter, most paths like /ironic/something get redirected to /ironic/latest/something | 15:42 |
| @clarkb:matrix.org | I think http and https are combined log files | 15:43 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 980473: Disable modsecurity response body access in two static vhosts https://review.opendev.org/c/opendev/system-config/+/980473 | 15:51 | |
| -@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 980473: Disable modsecurity response body access in two static vhosts https://review.opendev.org/c/opendev/system-config/+/980473 | 16:50 | |
| @clarkb:matrix.org | deployment reports success and I can see some of the apache worker processes have already rotated out for the reload | 16:55 |
| @mnasiadka:matrix.org | Clark: unfortunately I've got an unplanned family adventure, can we do that on Monday? | 17:04 |
| @clarkb:matrix.org | mnasiadka: yup! enjoy your weekend | 17:05 |
| @clarkb:matrix.org | about 1/3 of the apache workers have still not aged out and been replaced. But memory usage remains at steady sustainable levels | 17:23 |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed wip: [zuul/zuul-jobs] 980499: Add and test an ensure-validate-pyproject role https://review.opendev.org/c/zuul/zuul-jobs/+/980499 | 17:24 | |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [openstack/project-config] 980500: Run validate-pyproject during package checks https://review.opendev.org/c/openstack/project-config/+/980500 | 17:27 | |
| @fungicide:matrix.org | heading out for a late lunch, back shortly | 17:43 |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org marked as active: [zuul/zuul-jobs] 980499: Add and test an ensure-validate-pyproject role https://review.opendev.org/c/zuul/zuul-jobs/+/980499 | 19:26 | |
| @clarkb:matrix.org | Looking ahead to next week. After syncing up with mnasiadka on monday I'm thinking that may be a good tiem to upgrade ansible on bridge? | 19:56 |
| @fungicide:matrix.org | count me in | 20:00 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!