*** yamamoto has joined #openstack-infra | 00:26 | |
*** rlandy has quit IRC | 00:30 | |
*** tjgresha has joined #openstack-infra | 00:49 | |
-openstackstatus- NOTICE: The Gerrit service on review.opendev.org is being quickly restarted to apply a new security patch | 00:56 | |
*** gyee has quit IRC | 01:09 | |
*** tjgresha has quit IRC | 01:30 | |
*** __ministry1 has joined #openstack-infra | 01:38 | |
*** rcernin has quit IRC | 01:44 | |
*** dviroel has quit IRC | 02:04 | |
*** zzzeek has quit IRC | 02:07 | |
*** rcernin has joined #openstack-infra | 02:09 | |
*** zzzeek has joined #openstack-infra | 02:10 | |
*** tjgresha has joined #openstack-infra | 02:13 | |
*** zzzeek has quit IRC | 02:24 | |
*** zzzeek has joined #openstack-infra | 02:25 | |
*** tjgresha has quit IRC | 02:30 | |
*** hamalq has quit IRC | 02:36 | |
*** yonglihe has joined #openstack-infra | 02:47 | |
*** zzzeek has quit IRC | 02:48 | |
*** zzzeek has joined #openstack-infra | 02:52 | |
*** yamamoto has quit IRC | 03:26 | |
*** yamamoto_ has joined #openstack-infra | 03:26 | |
*** irclogbot_0 has quit IRC | 03:27 | |
*** dchen has quit IRC | 03:34 | |
*** dchen has joined #openstack-infra | 03:34 | |
*** ykarel has joined #openstack-infra | 04:11 | |
*** ricolin_ has joined #openstack-infra | 04:13 | |
*** ykarel_ has joined #openstack-infra | 04:15 | |
*** ykarel has quit IRC | 04:17 | |
*** jfan has quit IRC | 04:26 | |
*** yamamoto has joined #openstack-infra | 04:26 | |
*** yamamoto_ has quit IRC | 04:29 | |
*** ramishra has quit IRC | 04:40 | |
*** irclogbot_0 has joined #openstack-infra | 04:41 | |
*** irclogbot_0 has quit IRC | 04:54 | |
*** irclogbot_2 has joined #openstack-infra | 04:58 | |
*** ramishra has joined #openstack-infra | 05:04 | |
*** vishalmanchanda has joined #openstack-infra | 05:07 | |
*** ociuhandu has joined #openstack-infra | 05:07 | |
*** ociuhandu has quit IRC | 05:12 | |
*** ykarel_ is now known as ykarel | 05:16 | |
*** ricolin_ has quit IRC | 05:35 | |
*** ricolin has joined #openstack-infra | 05:39 | |
*** priteau has quit IRC | 05:47 | |
*** ykarel_ has joined #openstack-infra | 05:51 | |
*** ykarel has quit IRC | 05:53 | |
*** ykarel_ is now known as ykarel | 06:22 | |
*** lbragstad_ has joined #openstack-infra | 06:24 | |
*** lbragstad has quit IRC | 06:24 | |
*** ysandeep|away is now known as ysandeep | 06:43 | |
*** ralonsoh has joined #openstack-infra | 06:49 | |
*** slaweq has joined #openstack-infra | 06:55 | |
*** jcapitao has joined #openstack-infra | 07:00 | |
*** sboyron has joined #openstack-infra | 07:02 | |
*** amoralej|off is now known as amoralej | 07:05 | |
*** sboyron_ has joined #openstack-infra | 07:19 | |
*** sboyron has quit IRC | 07:22 | |
*** eolivare has joined #openstack-infra | 07:32 | |
*** rcernin has quit IRC | 07:37 | |
*** slaweq has quit IRC | 07:40 | |
*** slaweq has joined #openstack-infra | 07:42 | |
*** xek has joined #openstack-infra | 07:48 | |
*** ralonsoh has quit IRC | 07:54 | |
*** dklyle has quit IRC | 07:59 | |
*** ralonsoh has joined #openstack-infra | 08:01 | |
*** ralonsoh has quit IRC | 08:03 | |
*** ralonsoh has joined #openstack-infra | 08:05 | |
*** dchen has quit IRC | 08:11 | |
*** hashar has joined #openstack-infra | 08:13 | |
*** rcernin has joined #openstack-infra | 08:14 | |
*** rpittau|afk is now known as rpittau | 08:25 | |
*** andrewbonney has joined #openstack-infra | 08:27 | |
*** rcernin has quit IRC | 08:31 | |
*** dtantsur|afk is now known as dtantsur | 08:36 | |
*** gfidente has joined #openstack-infra | 08:44 | |
*** kopecmartin has quit IRC | 08:48 | |
*** kopecmartin has joined #openstack-infra | 08:50 | |
*** lxkong has quit IRC | 08:52 | |
*** lxkong has joined #openstack-infra | 08:53 | |
*** lxkong has quit IRC | 08:53 | |
*** lxkong has joined #openstack-infra | 08:54 | |
*** jpena|off is now known as jpena | 08:56 | |
*** priteau has joined #openstack-infra | 08:57 | |
*** rcernin has joined #openstack-infra | 09:00 | |
*** tosky has joined #openstack-infra | 09:02 | |
*** lucasagomes has joined #openstack-infra | 09:06 | |
*** hberaud has joined #openstack-infra | 09:13 | |
*** rcernin has quit IRC | 09:18 | |
*** rcernin has joined #openstack-infra | 09:23 | |
*** ociuhandu has joined #openstack-infra | 09:29 | |
*** d34dh0r53 has quit IRC | 09:39 | |
*** derekh has joined #openstack-infra | 09:39 | |
*** d34dh0r53 has joined #openstack-infra | 09:39 | |
*** ociuhandu has quit IRC | 09:44 | |
*** ociuhandu has joined #openstack-infra | 09:44 | |
*** d34dh0r53 has quit IRC | 09:48 | |
*** d34dh0r53 has joined #openstack-infra | 09:49 | |
*** rcernin has quit IRC | 10:08 | |
*** rcernin has joined #openstack-infra | 10:19 | |
*** tosky has quit IRC | 10:33 | |
*** tosky has joined #openstack-infra | 10:34 | |
*** rcernin has quit IRC | 11:13 | |
*** zbr1 has joined #openstack-infra | 11:14 | |
*** dviroel has joined #openstack-infra | 11:15 | |
*** zbr has quit IRC | 11:16 | |
*** zbr1 is now known as zbr | 11:16 | |
*** rcernin has joined #openstack-infra | 11:35 | |
*** gfidente has quit IRC | 11:35 | |
*** sshnaidm|ruck is now known as sshnaidm|afk | 11:38 | |
*** ysandeep is now known as ysandeep|afk | 11:44 | |
*** gfidente has joined #openstack-infra | 11:47 | |
*** lpetrut has joined #openstack-infra | 11:51 | |
*** rcernin has quit IRC | 12:04 | |
*** jcapitao is now known as jcapitao_lunch | 12:08 | |
*** iurygregory_ has joined #openstack-infra | 12:09 | |
*** iurygregory has quit IRC | 12:09 | |
*** piotrowskim has joined #openstack-infra | 12:12 | |
*** ysandeep|afk is now known as ysandeep | 12:14 | |
*** yamamoto has quit IRC | 12:15 | |
noonedeadpunk | fungi: hi! returning to the question with citycloud. there's some mess in the ticket I created. Is floating IP https://opendev.org/opendev/system-config/src/branch/master/inventory/base/hosts.yaml#L532-L538 is still assigned to the mirror inside your project? | 12:19 |
---|---|---|
noonedeadpunk | which should be https://opendev.org/opendev/system-config/src/branch/master/playbooks/templates/clouds/nodepool_clouds.yaml.j2#L155-L164 right? | 12:20 |
noonedeadpunk | can you also get network id or vm id so folks could double check that we're looking at the right thing... | 12:21 |
*** eolivare_ has joined #openstack-infra | 12:24 | |
*** eolivare has quit IRC | 12:26 | |
*** rlandy has joined #openstack-infra | 12:26 | |
frickler | noonedeadpunk: the old floating IP currenty isn't being used by us anymore. there are new ones listed here along with the IDs involved: http://paste.openstack.org/show/xkQX8wH09PsR4JzP7fpr/ | 12:29 |
*** hashar is now known as hasharLunch | 12:29 | |
noonedeadpunk | aha, gotcha | 12:30 |
*** yamamoto has joined #openstack-infra | 12:31 | |
*** jpena is now known as jpena|lunch | 12:36 | |
*** sshnaidm|afk is now known as sshnaidm|ruck | 12:36 | |
*** ysandeep is now known as ysandeep|mtg | 12:37 | |
*** yamamoto has quit IRC | 12:39 | |
*** tbachman has quit IRC | 12:44 | |
*** eolivare_ has quit IRC | 12:46 | |
*** tbachman has joined #openstack-infra | 12:48 | |
*** yamamoto has joined #openstack-infra | 12:49 | |
*** yamamoto has quit IRC | 12:50 | |
frickler | noonedeadpunk: if I look at the router, I see a completely different address there, not sure if that is as designed or may to part of the issue: {"subnet_id": "0cff86a9-a33a-4550-b2ee-f2c909dee4d2", "ip_address": "77.81.6.17"} | 12:58 |
*** amoralej is now known as amoralej|lunch | 12:59 | |
*** iurygregory_ is now known as iurygregory | 13:04 | |
*** yamamoto has joined #openstack-infra | 13:07 | |
*** yamamoto has quit IRC | 13:07 | |
*** yamamoto has joined #openstack-infra | 13:08 | |
*** Tengu has quit IRC | 13:09 | |
*** Tengu has joined #openstack-infra | 13:10 | |
*** Tengu has quit IRC | 13:10 | |
*** Tengu has joined #openstack-infra | 13:10 | |
*** Tengu has quit IRC | 13:10 | |
*** Tengu has joined #openstack-infra | 13:11 | |
*** Tengu has quit IRC | 13:11 | |
*** yamamoto has quit IRC | 13:12 | |
*** jcapitao_lunch is now known as jcapitao | 13:14 | |
*** Tengu has joined #openstack-infra | 13:18 | |
*** hasharLunch is now known as hashar | 13:18 | |
*** eolivare_ has joined #openstack-infra | 13:18 | |
noonedeadpunk | frickler: yeah folks have moved router. can you check if this has solved the issue | 13:23 |
*** jpena|lunch is now known as jpena | 13:25 | |
noonedeadpunk | at least they're reachable for me now | 13:26 |
frickler | noonedeadpunk: I can ping both, too, and log into mirror1, so this seems fine again, thanks for your help | 13:50 |
frickler | infra-root: I don't have time today to do the followup work of changing the address everywhere, maybe one of you can do that? also not sure what the idea with the second mirror based on focal was? seems it doesn't have all users deployed, likely due to lack of connectivity? | 13:52 |
*** tbachman has quit IRC | 13:52 | |
*** ociuhandu has quit IRC | 13:53 | |
*** tbachman has joined #openstack-infra | 13:55 | |
dansmith | clarkb: are you able to generate me an updated paste of the percentage of gate resources used by each of the projects? now that neutron has dropped the tripleo jobs I'm curious what the new numbers are | 13:58 |
*** amoralej|lunch is now known as amoralej | 14:02 | |
*** ociuhandu has joined #openstack-infra | 14:10 | |
*** sreejithp has joined #openstack-infra | 14:14 | |
openstackgerrit | Akihiro Motoki proposed openstack/openstack-zuul-jobs master: translation: Handle renaming of Chinese locales in Django https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/773689 | 14:21 |
openstackgerrit | Akihiro Motoki proposed openstack/openstack-zuul-jobs master: translation: Handle renaming of Chinese locales in Django https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/773689 | 14:24 |
*** ociuhandu has quit IRC | 14:30 | |
*** ociuhandu has joined #openstack-infra | 14:31 | |
*** ociuhandu has quit IRC | 14:36 | |
*** __ministry1 has quit IRC | 14:38 | |
*** ociuhandu has joined #openstack-infra | 14:48 | |
*** ociuhandu has quit IRC | 14:58 | |
*** dwalt has joined #openstack-infra | 15:01 | |
*** slaweq has quit IRC | 15:14 | |
*** slaweq has joined #openstack-infra | 15:14 | |
*** aarents has quit IRC | 15:16 | |
*** ociuhandu has joined #openstack-infra | 15:18 | |
*** ociuhandu has quit IRC | 15:22 | |
*** ociuhandu has joined #openstack-infra | 15:23 | |
*** ociuhandu has quit IRC | 15:23 | |
*** ociuhandu has joined #openstack-infra | 15:25 | |
fungi | noonedeadpunk: frickler: yes, after my reboots didn't work, ianw tried replacing the floating ip with a new one (so new address, not updated in dns yet), and when that didn't work he tried to boot a new instance there which also didn't work, but if it had he was considering using it as an opportunity to get the mirror server upgraded | 15:26 |
fungi | i can work on correcting dns and booting nodes there again for now, to get things back in operation, since i think that's currently the only region supplying some specific node types | 15:27 |
*** hashar is now known as hasharAway | 15:27 | |
*** lpetrut has quit IRC | 15:27 | |
*** ociuhandu has quit IRC | 15:31 | |
*** ociuhandu has joined #openstack-infra | 15:31 | |
openstackgerrit | Mohammed Naser proposed openstack/project-config master: Switch to using v3-standard-8 flavors https://review.opendev.org/c/openstack/project-config/+/773710 | 15:33 |
*** ysandeep|mtg is now known as ysandeep | 15:41 | |
*** ociuhandu has quit IRC | 15:41 | |
*** gfidente has quit IRC | 15:47 | |
*** gfidente has joined #openstack-infra | 15:49 | |
*** ysandeep is now known as ysandeep|away | 15:54 | |
openstackgerrit | Merged openstack/project-config master: Switch to using v3-standard-8 flavors https://review.opendev.org/c/openstack/project-config/+/773710 | 15:56 |
clarkb | dansmith: yes I can regenerate that after meetings today | 15:58 |
dansmith | clarkb: thanks | 15:58 |
*** dklyle has joined #openstack-infra | 15:58 | |
*** dklyle has quit IRC | 15:59 | |
*** david-lyle has joined #openstack-infra | 15:59 | |
*** amoralej is now known as amoralej|off | 16:00 | |
*** jamesmcarthur has joined #openstack-infra | 16:02 | |
*** jamesmcarthur has quit IRC | 16:03 | |
*** jamesmcarthur has joined #openstack-infra | 16:03 | |
*** hasharAway is now known as hashar | 16:04 | |
*** yamamoto has joined #openstack-infra | 16:17 | |
*** ykarel has quit IRC | 16:20 | |
*** lbragstad_ is now known as lbragstad | 16:21 | |
*** david-lyle is now known as dklyle | 16:24 | |
*** yamamoto has quit IRC | 16:26 | |
dansmith | mnaser: do you know if the gerrit instance is running in one of the same flavors that is io-restricted? | 16:28 |
mnaser | dansmith: gerrit runs at rax afaik | 16:28 |
dansmith | okay | 16:29 |
*** ociuhandu has joined #openstack-infra | 16:34 | |
*** ociuhandu has quit IRC | 16:35 | |
*** ociuhandu has joined #openstack-infra | 16:35 | |
*** jcapitao has quit IRC | 16:52 | |
*** jamesmcarthur has quit IRC | 16:52 | |
*** jamesmcarthur has joined #openstack-infra | 16:54 | |
*** jamesmcarthur has quit IRC | 16:57 | |
fungi | and is on a 64gb ram 16vcpu instance with the data on a cinder-attached ssd volume | 16:59 |
*** jamesmcarthur has joined #openstack-infra | 17:00 | |
*** lucasagomes has quit IRC | 17:04 | |
*** ociuhandu has quit IRC | 17:06 | |
*** ociuhandu has joined #openstack-infra | 17:07 | |
*** ociuhandu has quit IRC | 17:13 | |
*** zbr1 has joined #openstack-infra | 17:16 | |
clarkb | dansmith: http://paste.openstack.org/show/MIfg7ByqwceE1rFgu8gw/ | 17:17 |
dansmith | clarkb: wow, no real change there | 17:17 |
*** zbr has quit IRC | 17:18 | |
*** zbr1 is now known as zbr | 17:18 | |
*** ociuhandu has joined #openstack-infra | 17:19 | |
clarkb | dansmith: each report covers a month of logs so we may not see shifts until we roll enough logs over. If we really need it I can modify the script to look at say only the last 7 days instead | 17:24 |
*** ociuhandu has quit IRC | 17:25 | |
*** ociuhandu has joined #openstack-infra | 17:36 | |
dansmith | clarkb: okay, I was on a call when I looked, I see yeah it goes back to like beginning of jan, so fair enough | 17:38 |
*** rlandy is now known as rlandy|biab | 17:40 | |
*** ociuhandu has quit IRC | 17:40 | |
*** d34dh0r53 has quit IRC | 17:42 | |
*** ralonsoh has quit IRC | 17:45 | |
*** jamesmcarthur has quit IRC | 17:45 | |
*** d34dh0r53 has joined #openstack-infra | 17:50 | |
clarkb | dansmith: ~last week http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/ | 17:50 |
dansmith | clarkb: ah thanks, sorry you had to do that, | 17:51 |
dansmith | but that's what I expected to see.. the tripleo number swell to fill the void left by neutron | 17:51 |
clarkb | like a seesaw | 17:52 |
*** gfidente is now known as gfidente|afk | 17:52 | |
dansmith | frustrating | 17:52 |
*** ociuhandu has joined #openstack-infra | 17:52 | |
*** jpena is now known as jpena|off | 17:55 | |
fungi | it's not unexpected. basically the long queue times create backpressure reducing the amount of activity below what it would be if unbounded, so shrinking one source allows the others to naturally expand into the void it leaves | 17:55 |
fungi | when people see things are merging more readily, they're likely to approve more changes than they would if things were backed up more and taking forever | 17:57 |
fungi | also when developers are getting feedback from builds more quickly, they can iterate faster and push new revisions with greater frequency | 17:58 |
*** jamesmcarthur has joined #openstack-infra | 17:59 | |
*** ociuhandu has quit IRC | 17:59 | |
*** rcernin has joined #openstack-infra | 17:59 | |
dansmith | fungi: all the numbers have to add up to 100%, so obviously things have to swell | 18:00 |
dansmith | without normalizing this against number of commits it's kinda guessing anyway | 18:00 |
clarkb | ya the original purpose of the script was to determine if the fear that all these new projects were killing the zuul queues was valid | 18:02 |
*** ociuhandu has joined #openstack-infra | 18:02 | |
clarkb | it needs work if we want to improve it to do more detailed analysis | 18:02 |
dansmith | yeah | 18:03 |
fungi | sort of, if you look at a full week the system is not running at 100% for the duration. it does catch up here and there, mostly on weekends, but the amount of time it spends caught up is where the total work volume becomes apparent | 18:03 |
dansmith | I think we pretty much need to set some goals for individual job runtime, and maximum number of heavy jobs per commit | 18:03 |
dansmith | and try to get projects to move things to periodic or experimental outside of that | 18:03 |
*** derekh has quit IRC | 18:04 | |
*** rcernin has quit IRC | 18:04 | |
fungi | but beyond the work volume the system manages to handle in a given week, i'm suggesting there are psychological and logistical effects at play too, that reductions in work volume elsewhere will also be pitted against | 18:04 |
dansmith | the tripleo guys asked today for what is "reasonable usage" so we should probably just try to define that | 18:05 |
clarkb | dansmith: In the past I've suggested that using OSA and kolla as comparisons as similar deployment projects may be a valid way of defining things | 18:05 |
dansmith | yeah, | 18:05 |
*** eolivare_ has quit IRC | 18:05 | |
dansmith | I was going to say something like figure out how long a tempest run takes, plus the devstack setup time plus some slack, and use that | 18:06 |
dansmith | maybe kolla is a better metric | 18:06 |
fungi | average node-hours per change merged might be a good metric, because efficient developing and reviewing practices can lead to fewer revisions, more resilient/robust jobs mean fewer rechecks, and so on | 18:06 |
*** ociuhandu has quit IRC | 18:06 | |
zbr | fungi: do you know that https://opendev.org/opendev/system-config/src/branch/master/playbooks/apply-package-updates.yaml#L1 is a syntax error? | 18:06 |
dansmith | fungi: sure, just like if tripleo is 50% of all the changes in openstack and using 40% of the gate, then that's not as bad as we think it is | 18:07 |
fungi | if we're going to have a measuring stick, i'd like it to be one which encourages good practices and efficient use of system, because that's where perverse incentives will lead people | 18:07 |
zbr | use of variable in hosts, not quite something to make ansible happy | 18:07 |
fungi | zbr: hah, that's amusing | 18:07 |
clarkb | I'm guessing that was intended to be used with -e target=foo ? | 18:08 |
zbr | it does pass if you define the variable but | 18:08 |
zbr | ansible-playbook --syntax-check playbooks/apply-package-updates.yaml | 18:08 |
zbr | is still a failure | 18:08 |
clarkb | I mean if its only ever used in that manner then is it a problem? | 18:08 |
zbr | the funny bit is that is not used like this, is used with --limit, which is the better way of doing it | 18:08 |
clarkb | zbr: it is used eaxctlythe way I describe it | 18:09 |
clarkb | in launch node | 18:09 |
clarkb | with -e target=foo | 18:09 |
zbr | found it while testing the new linter, which does syntax checking using ansible. | 18:09 |
zbr | workaround: - hosts: "{{ target | default([]) }}" | 18:10 |
clarkb | zbr: will it fail if target is defined with -e target? | 18:10 |
clarkb | if not then it isn't a syntax error as used | 18:10 |
zbr | clarkb: tell this to syntax check. imho a playbook that crashes when not given some magic inputs is still a syntax error. | 18:11 |
zbr | it is easy to add a localhost: fail if not defined.... to avoid it. | 18:11 |
clarkb | well the whole point is it should only run against the remote, we don't want it to run against localhost | 18:12 |
clarkb | and no that wouldn't eb a syntax error it would be a runtime error due to invalid inputs | 18:12 |
zbr | clarkb: the test for undefined would run on localhost, not the task. | 18:12 |
zbr | i can show you. give me few minutes. | 18:12 |
fungi | it sounds like you're basically saying the syntax checker is broken | 18:13 |
clarkb | zbr: well I'm asking if there is actually anything to fix here | 18:13 |
clarkb | can we save 15 minutes and accept that it is correct as used and move on? | 18:13 |
fungi | because it lacks context for how the playbook is invoked | 18:13 |
zbr | imho is not broken, it detects code that is not well written. is like writing python code and assuming a variable is defined, without checking for it. | 18:14 |
zbr | yes, as used it works, but that does not make the code good. | 18:14 |
clarkb | zbr: functions can have required arguments | 18:14 |
clarkb | if you want to think of the playbook in that way target is a required parameter | 18:15 |
*** ociuhandu has joined #openstack-infra | 18:15 | |
clarkb | not providing it is an error just as calling a python function without the required arguments | 18:15 |
*** ociuhandu has quit IRC | 18:15 | |
*** ociuhandu has joined #openstack-infra | 18:15 | |
zbr | see https://review.opendev.org/c/opendev/system-config/+/773782/1/playbooks/apply-package-updates.yaml | 18:17 |
clarkb | zbr: but why is that better? | 18:17 |
clarkb | the end result is the same in both cases, an error becuse target is not defined | 18:17 |
clarkb | but one requires you to write a ton of error checking code | 18:18 |
zbr | because it avoids a crash | 18:18 |
zbr | try to compare them with python code, usually any piece of encapsulated code should check for inputs, is just good practice. | 18:18 |
clarkb | most python functions assume that their required arguments are provided | 18:19 |
clarkb | and it is up to the caller to get it right, just as is the case here | 18:19 |
zbr | in that particular case the benefit is minor, but think about other playbooks that may use lots of vars that are needed or not. | 18:19 |
*** jamesmcarthur has quit IRC | 18:24 | |
*** ociuhandu has quit IRC | 18:27 | |
*** ociuhandu has joined #openstack-infra | 18:30 | |
*** ociuhandu has quit IRC | 18:30 | |
clarkb | zbr: so the linter is running ansible-lint's syntax checker and the syntax check expects hosts: to always be defined even though it is valid to have a variable there? | 18:31 |
clarkb | (I'm looking at the change and trying to udnerstand the concern within that context) | 18:31 |
*** sshnaidm|ruck is now known as sshnaidm|afk | 18:34 | |
*** dtantsur is now known as dtantsur|afk | 18:35 | |
*** rpittau is now known as rpittau|afk | 18:39 | |
*** hashar is now known as hasharDinner | 18:41 | |
*** tdasilva has joined #openstack-infra | 18:44 | |
zbr | ansible syntax check expects to be able resolve hosts, if it fails it will give a syntax error. | 18:47 |
clarkb | zbr: does that mean it needs an inventory file too? | 18:47 |
zbr | nope | 18:47 |
zbr | putting hosts: dskfndlgnlf is perfectly valid, but using jinja2 can produce an error. depends on how you write it. | 18:49 |
fungi | definitely sounds broken then | 18:50 |
fungi | if it doesn't care that the hosts value has any meaning, then it should just ignore if it contains variable expansion | 18:51 |
zbr | the same kind of broken as writing a python function that receives an argument and not checking that its type is ok | 18:51 |
zbr | imho, it is quite good that they did it like this. | 18:51 |
fungi | or perhaps ansible-lint needs to be fed whatever variables ansible itself would be supplied on invocation | 18:51 |
*** jamesmcarthur has joined #openstack-infra | 18:52 | |
fungi | well, this is more like complaining that a python function requires an argument, without knowing whether the caller will supply that argument | 18:52 |
zbr | you can take the linter out from this debate, now is between you and ansible-playbook --syntax-check, something is already used on many repos. | 18:53 |
fungi | got it, either way, the idea is that you should avoid valid constructs if they're hard to test/evaluate out of context | 18:54 |
fungi | there are reasonable points on both sides | 18:54 |
fungi | is a checker which lacks context a suitable tool to use in every situation? is it worth the effort to alter a correctly working implementation to make it easier to check for correctness? | 18:56 |
*** diablo_rojo_phon has joined #openstack-infra | 18:56 | |
fungi | where "correctness" may also be someone's opinion | 18:56 |
zbr | fungi: take a look at https://docs.ansible.com/ansible/latest/dev_guide/developing_collections.html -- and see playbooks/tasks/ -- i am bit surprised to see that I need to explain why mixing tasks and playbooks inside a folder is a bad idea. | 19:00 |
zbr | and this has nothing to so with collections, is about layouting code. | 19:01 |
fungi | zbr: it may not be a good idea, but it also may not be worth the time it takes to debate, review and improve if it's already working | 19:02 |
fungi | it might be worth avoiding doing the same thing in the future, sure | 19:02 |
*** sboyron_ has quit IRC | 19:03 | |
zbr | fungi: tbh: system-config is in very good shape by ansible standards, i would refrain from naming other more messy cases i have to deal with ;) | 19:03 |
*** jamesmcarthur has quit IRC | 19:03 | |
zbr | and that mixing of tasks/vars/playbooks is quite a common mistake, but now the linter complains about it. | 19:04 |
zbr | in fact we can blame ansible a little bit for that with the generic "include" that was deprecated as being so confusing. | 19:05 |
zbr | i seen people wondering why they cannot include a playbook from inside a tasks file, again an again. | 19:05 |
zbr | FYI, the filetype detection uses patterns from https://github.com/ansible-community/ansible-lint/blob/master/src/ansiblelint/config.py#L5-L20 -- the list is not hardcoded and subject to change based on feedback. | 19:07 |
*** jamesmcarthur has joined #openstack-infra | 19:15 | |
*** andrewbonney has quit IRC | 19:15 | |
*** rlandy|biab is now known as rlandy | 19:35 | |
*** hasharDinner is now known as hashar | 19:39 | |
openstackgerrit | Merged openstack/project-config master: Revert "Temporarily stop booting nodes in citycloud-kna1" https://review.opendev.org/c/openstack/project-config/+/773240 | 19:40 |
*** tdasilva_ has joined #openstack-infra | 19:44 | |
*** tdasilva has quit IRC | 19:46 | |
dansmith | clarkb: fungi: do you know why this job was paused for 3ish hours? https://zuul.opendev.org/t/openstack/build/8af7cfabcaff4f2b83d26395d6a9b19f/log/job-output.txt#4160 | 19:55 |
clarkb | dansmith: yes, that is the tripleo job that builds all their container images, then it sits arounds serving them for the child jobs | 19:56 |
clarkb | I think the breakdown is something like an hour of building and 2 hours of servinig | 19:56 |
fungi | dansmith: looks like it's running a server which other builds are pulling content from | 19:56 |
fungi | so it has to pause until those builds complete | 19:56 |
clarkb | it could potentially stop sooner, though I'm not sure if zuul makes that easy (wait for all child jobs to say "we have the data you are serving you can go away now") | 19:57 |
clarkb | ya it starts at ~11:30 then pauses at ~12:28 after building the images | 19:59 |
clarkb | then the remaining ~ 2 hours is spent serving those images to the downstream consuming jobs | 19:59 |
*** rcernin has joined #openstack-infra | 20:00 | |
*** tdasilva_ has quit IRC | 20:03 | |
*** tdasilva_ has joined #openstack-infra | 20:03 | |
*** rcernin has quit IRC | 20:04 | |
*** jamesmcarthur has quit IRC | 20:05 | |
*** jamesmcarthur has joined #openstack-infra | 20:06 | |
*** jamesmcarthur has quit IRC | 20:07 | |
*** jamesmcarthur has joined #openstack-infra | 20:12 | |
*** yamamoto has joined #openstack-infra | 20:23 | |
dansmith | fungi: clarkb: sorry for the delayed response.. so there's other jobs running that uses that worker or something so there's just no output during that time? | 20:24 |
openstackgerrit | Jeremy Stanley proposed openstack/project-config master: Move bindep to opendev tenant https://review.opendev.org/c/openstack/project-config/+/773793 | 20:24 |
dansmith | we were wondering if that was zuul pausing a job for a reschedule or something like that | 20:24 |
*** rcernin has joined #openstack-infra | 20:25 | |
fungi | dansmith: correct, the "job" actually starts a server which serves content to other builds running as part of the same buildset | 20:27 |
*** yamamoto has quit IRC | 20:27 | |
dansmith | fungi: okay buildset is zuulv3 lingo that does not mean "multiple nodes" right? | 20:27 |
fungi | when a ref (e.g. a change) is enqueued into a pipeline, builds for each of the selected jobs are started. that collection of builds is a buildset | 20:28 |
fungi | they get reported together once all builds within the buildset complete | 20:29 |
dansmith | oh, so one job can serve stuff to another job? | 20:29 |
dansmith | like, the jobs depend on each other? | 20:29 |
fungi | so a buildset might be the set of linters, unit tests and functional test jobs which ran | 20:29 |
fungi | they can depend on each other, yes | 20:29 |
fungi | and can even interact | 20:30 |
dansmith | okay, I've never known such a thing, other than multinode jobs | 20:30 |
fungi | we started doing it initially with our container image testing workflow, where one job sets up a registry server and then other jobs depending on it build and push images into that registry and then yet still other jobs can pull those images and exercise or publish them to a durable location | 20:31 |
dansmith | certainly that ends up with workers waiting around for another worker to get to the usable point right? | 20:32 |
fungi | correct | 20:33 |
fungi | which, depending on how the jobs are written and what they need to do, could just be a few minutes | 20:33 |
dansmith | is there anything easy to grep for to figure out how long a worker waited? | 20:33 |
dansmith | and once a worker has completed its job in a buildset, presumably it goes on to do something else, we don't need a py27 worker hanging around until the end of the devstack worker just because it's the same buildset... | 20:34 |
dansmith | the reason I ask about the grep'able thing is just curious if there is a way to spot inefficient configurations where one worker ends up waiting 45 minutes for another to get to a usable place | 20:35 |
fungi | i'd have to get much more familiar with the pausing mechanism, i'm not sure if there's visible evidence of it in the task output | 20:36 |
dansmith | okay | 20:36 |
clarkb | the job paused/job resumed lines that you linked are produced by the zuul_return pause thing iirc | 20:37 |
*** jamesmcarthur has quit IRC | 20:40 | |
dansmith | right, so the job has some way of entering a "while true: sleep" loop so it can serve, yeah? | 20:41 |
dansmith | and presumably the dependent job need to poll for readiness or be told by zuul that the other job is at the sync point so it can start using it right? | 20:42 |
clarkb | dansmith: yes, zuul provides an ansible module called: zuul_return which allows a job to provide state back to the scheduler | 20:42 |
clarkb | I think in this case zuul won't start the children jobs until the parent either exits successfully or paused so it is quite simple | 20:43 |
clarkb | and the parent won't stop after being paused until the children are all done | 20:43 |
dansmith | oh jeez | 20:43 |
dansmith | so this thing might be 45 minutes in, hit the pause, and then we start building the thing that is going to need this, which could take 45 minutes on its own such that we're sitting idle for that long? | 20:44 |
clarkb | dansmith: yes, though in this case its about 60 minutes then 120 minutes | 20:44 |
dansmith | 120 minutes for the depdendent child thing to build | 20:45 |
dansmith | ? | 20:45 |
clarkb | yes | 20:45 |
dansmith | uh | 20:45 |
dansmith | am I missing how that's not a super bad waste of resources? | 20:45 |
clarkb | oh actually it might be 180, how have they managed that? I guess paused jobs areb't subject to timeouts in nromal ways | 20:46 |
clarkb | dansmith: well its intended use is to avoid needing to perform duplicate work in many jobs | 20:46 |
dansmith | sure, it's a tool, but in this case, it could be working against us it seems like | 20:46 |
clarkb | basically that first hour is performed once rather than say 5 times in jobs that are all multinode. So we save 4 hours in my contrived example | 20:46 |
clarkb | but ya it is possible to set it up such that we don't win in the final tally balance | 20:47 |
dansmith | clarkb: but did I read you right that you can tell that it took 120 minutes to build the child worker and all the time we were sitting idle? | 20:47 |
dansmith | and if so, can you show me how to figure that math? | 20:48 |
clarkb | dansmith: https://zuul.opendev.org/t/openstack/build/8af7cfabcaff4f2b83d26395d6a9b19f/log/job-output.txt#4160-4161 shows you the time paused (it was actually almost 3 hours not two) I assumed 2 horus bceause we have a 3 hour job timeout and it had already spent an hour building images at that point. But I think zuul must not do timeouts in paused jobs in a normal way | 20:48 |
clarkb | dansmith: look at the timestamps on the left side of the text there | 20:48 |
dansmith | clarkb: right, I thought those times between the paused and resumed were when this node was busy serving images | 20:49 |
dansmith | are you saying that's all idle wait time? | 20:49 |
clarkb | well it is idle from Zuul's perspective. | 20:49 |
clarkb | logging for any active period while idel from zuul's perspective will depend on the job itselfr | 20:50 |
dansmith | heh, okay sure, I'm just wondering how to connect this to the thing that is dependent on it, to figure out if this thing is sitting around longer than it needs | 20:50 |
dansmith | but maybe the answer is "it's totally dependent on the config of the job" | 20:50 |
fungi | keep in mind that while it's one node waiting and serving content to other nodes for several hours, all that time there are at least several multi-node jobs *running* and using the content it's serving, so that one node is a fairly small percentage of what's in use | 20:50 |
dansmith | fungi: well, that's what I was trying to understand, | 20:51 |
clarkb | dansmith: I don't know wheer tripleo logs their "idle" workload | 20:51 |
dansmith | I thought clarkb was saying we don't even start to build the child jobs until this job gets here | 20:51 |
dansmith | i.e. not parallelizing the builds of the parent and children | 20:51 |
clarkb | dansmith: correct | 20:51 |
fungi | and yeah, as clarkb points out, it's not necessarily "idle" in the usual sense, it's not running job tasks but it's doing something (serving content to nodes for other running builds) | 20:52 |
dansmith | clarkb: okay but you don't know how much of that three hours was build vs serving | 20:52 |
clarkb | dansmith: I know the build was 1 hour, that all happened before the pause. The serving all happens during the 3 hour pause | 20:52 |
dansmith | I gotcha, I thought it was asserted that the time between those two markers was just the waiting for build | 20:52 |
dansmith | clarkb: yeah, I'm talking about the building of the things that depend on this | 20:52 |
fungi | the three hours pause was the amount of time it took the other builds which say they rely on that to complete | 20:53 |
dansmith | right, okay got it | 20:53 |
fungi | once they were done. that build serving the content for them resumed, cleaned up and finished | 20:53 |
dansmith | so that could be two hours of not using this and one hour of using it, or much worse or much better | 20:53 |
clarkb | what this setup has done is avoid needing all of those child jobs to spend an hour doing image builds. So we save roughly 1 hour * num_child_jobs | 20:54 |
dansmith | clarkb: presumably yes I get that | 20:54 |
fungi | minus the node which is doing the serving of course | 20:54 |
clarkb | dansmith: yes, and if all it is doing is serving docker images it is possible thatthose get pulled like 20 minutes into the pause and then the idle node goes properly idle | 20:54 |
clarkb | the zuul pause mechanism isn't rich enough to say we're done early you can go away now | 20:55 |
dansmith | clarkb: right, but it consumes a worker for the full period until those jobs (which were done with this thing in 20 minutes) have finished three hours later | 20:55 |
dansmith | yeah | 20:55 |
fungi | if that node spends too much time sitting around because the jobs which pulled images from it in their first five minutes take hours to compete, then it's possible that we still end up using more node-hours overall than if each job had done redundant activity | 20:55 |
clarkb | dansmith: right it can likely be optimized further, but I believe this si still an improvement on the simple alternative | 20:55 |
clarkb | particualrly for tripleo which has long image builds | 20:55 |
dansmith | clarkb: yeah, I'm sure in a lot of cases it is | 20:56 |
dansmith | I'm just trying to understand what we're looking at | 20:56 |
clarkb | (and many multinode jobs that need the images) | 20:56 |
dansmith | it also seems like the kind of thing that could easily be done for convenience when it's just lines in a yaml file, but without realizing the imapct | 20:56 |
dansmith | presumably the three hours could also be the time it takes to build three children, only the last of which actually needs this, all of which are serialized | 20:57 |
fungi | napkin math, a 4-hour single-node content serving job (1 hour creating the content + 3 hours serving it) which is providing content to three two-node jobs which run for up to three hours saves us 2 node-hours | 20:57 |
clarkb | dansmith: yes, if the child jobs don't actually need those resources then we've optimized for a non existant problem and likely made things worse | 20:57 |
dansmith | I'm really not trying to say this is not a gain, it totally is, I'm just saying I can see writing a job with dependencies and not realize I have created a 4-node serialization that didn't necessarily need to happen, or for which there is a better optimization | 20:58 |
dansmith | like, if you don't understand all these nuances | 20:58 |
dansmith | one way to maybe spot that is if you know there's not three hours of work that depends on this worker, that'd be a sign that maybe you've created a monster | 20:59 |
clarkb | yup I think that is a possibility. My understanding of the tripleo situation is that they do actually need those images, but it is possible there are better ways to get them (like quay or something) | 20:59 |
dansmith | clarkb: well, right, in the tripleo case it probably is perfect for that they need, but if a four hour job seems longer than necessary, then it might help point to somewhere that you've done something bad | 21:00 |
fungi | right, my napkin math example shows the break-even point is probably if you have at least 5-6 nodes occupied with redundant work (if it's only 4, then the separating it out and serving it costs you more than it nets you), but that will depend to a great extent on the durations | 21:02 |
dansmith | fungi: I was going to say "or rearranging the dependencies might be more efficient" | 21:02 |
fungi | yeah | 21:02 |
dansmith | I dunno how zuul reserves workers, so maybe not, | 21:02 |
dansmith | but it would be helpful to be able to visualize that somehow | 21:03 |
clarkb | I think it is fair to say that depending on the situation this tool can make things better or worse from a through put perspective. A lto of that will depend on the actual job workload. It is also a good indicatioin that something might be worth reviewing if it takes a very long time | 21:03 |
fungi | workers don't really get reserved, they're satisfied from an available pool | 21:03 |
clarkb | https://grafana.opendev.org/d/9XCNuphGk/zuul-status?orgId=1 can help visualize some of that (there are also nodepool specific dashbaords there as well) | 21:04 |
fungi | so zuul puts out a node request for a given build and waits for a nodepool launcher to fulfil that request from available resources | 21:04 |
dansmith | fungi: I would never *dare* to suggest a zuul feature, but presumably there'd also be an optimization where you have a named sync point and zuul can build both in parallel until the point at which they need to both be at a certain state of readiness, | 21:05 |
dansmith | plus the "okay I'm done with you now" thing clarkb mentioned | 21:05 |
dansmith | but I'd much rather hear a job author say "I'd like to be able to do X but can't" for one of those situations | 21:05 |
fungi | which gets a little complicated since dependent builds need their node requests fulfilled from the same nodepool region | 21:05 |
fungi | dansmith: yeah, that sounds like a potentially useful evolution of the job dependency handling, i don't know what would be involved in implementing it | 21:08 |
dansmith | yeah, I'm saying I wouldn't even consider it until some job optimizer claims no more can be done without something like that :) | 21:08 |
fungi | the "i'm done with you now" mechanism could be essentially the same as the sync point mechanism | 21:10 |
fungi | traffic control, in a more general sense | 21:10 |
clarkb | another aspect here is that you may be paused and waiting some time for the child jobs tostart due to contention or cloud flakyness | 21:11 |
fungi | everyone stop here, when this pint is reached you go but you stay, et cetera | 21:11 |
clarkb | we should be able to measure that without any job changes, but I'm not sure zuul/nodepool expose that info | 21:11 |
clarkb | having something like a "time waiting for this to boot" in graphite would be nice though | 21:11 |
clarkb | corvus: ^ do you know if that is somethign we already expose? | 21:11 |
clarkb | we capture a boot time but I'm pretty sure that clock starts once we believe we've got free quota available so would ignore the time waiting for quota to become available? | 21:12 |
clarkb | https://grafana.opendev.org/d/4JjHXp2Gk/nodepool?orgId=1 time to ready is the boot time I'm thinking of | 21:13 |
clarkb | a zuul level time from node request being sent to filled is what I think I'm tlaking about as being useful | 21:14 |
*** thiago__ has joined #openstack-infra | 21:14 | |
*** vishalmanchanda has quit IRC | 21:15 | |
*** tdasilva_ has quit IRC | 21:17 | |
*** thiago__ has quit IRC | 21:18 | |
*** thiago__ has joined #openstack-infra | 21:19 | |
*** tbachman has quit IRC | 21:19 | |
*** dwalt has quit IRC | 21:21 | |
*** rcernin has quit IRC | 21:22 | |
*** tbachman has joined #openstack-infra | 21:26 | |
*** gfidente|afk has quit IRC | 21:33 | |
*** thiago__ has quit IRC | 21:33 | |
*** thiago__ has joined #openstack-infra | 21:33 | |
*** hashar has quit IRC | 21:40 | |
*** jamesmcarthur has joined #openstack-infra | 21:41 | |
*** rcernin has joined #openstack-infra | 21:52 | |
*** ociuhandu has joined #openstack-infra | 22:01 | |
*** thiago__ is now known as tdasilva | 22:04 | |
*** ociuhandu has quit IRC | 22:05 | |
*** rcernin has quit IRC | 22:08 | |
*** rcernin has joined #openstack-infra | 22:09 | |
*** yamamoto has joined #openstack-infra | 22:12 | |
openstackgerrit | Merged openstack/project-config master: Move bindep to opendev tenant https://review.opendev.org/c/openstack/project-config/+/773793 | 22:14 |
corvus | ohai, reading | 22:19 |
*** xek has quit IRC | 22:25 | |
corvus | clarkb: node request timing is sent to graphite under zuul.nodepool.requests.fulfilled and zuul.nodepool.requests.fulfilled.label.$LABEL | 22:29 |
corvus | so you can get stats on how long, say, a "centos7" node request takes to fill with the second, or any node request takes to fill with the first | 22:30 |
clarkb | corvus: oh cool | 22:32 |
clarkb | and that is the time from request sent to fulfilled from zuul's perspective | 22:32 |
clarkb | dansmith: ^ so I think you use that to answer (on average or typical cases) how long the nodes will be "booting" while the parnet job is paused | 22:33 |
*** jamesmcarthur has quit IRC | 22:49 | |
*** slaweq has quit IRC | 22:55 | |
*** tdasilva_ has joined #openstack-infra | 22:57 | |
*** tdasilva has quit IRC | 22:59 | |
*** yamamoto has quit IRC | 23:03 | |
*** yamamoto_ has joined #openstack-infra | 23:03 | |
*** JayF has quit IRC | 23:13 | |
dansmith | clarkb: sorry I got distracted.. I think getting it from graphite won't actually answer the real question I have, which is "for what percentage of this 4h was this thing useful to children" | 23:13 |
dansmith | because the answer is really in the job and how it's used | 23:13 |
dansmith | so I should just to pick apart a couple runs and compare timestamps I think | 23:13 |
dansmith | because even if it's 30m to find the next node, the one image pull or whatever it does, before 2h of idle time is what I really want to know :) | 23:14 |
clarkb | dansmith: ya if you found a representative sample the timestamps in the logs should give you that info too (when did each job start and end and how does that compare with the pause time) | 23:14 |
dansmith | aye | 23:15 |
dansmith | someone that knows how that job works may also tell me "oh it's pulling images all the damn time" | 23:15 |
*** JayF has joined #openstack-infra | 23:17 | |
*** thiago__ has joined #openstack-infra | 23:18 | |
*** tdasilva_ has quit IRC | 23:20 | |
*** tdasilva_ has joined #openstack-infra | 23:20 | |
*** thiago__ has quit IRC | 23:23 | |
*** calbers has quit IRC | 23:28 | |
*** dchen has joined #openstack-infra | 23:32 | |
*** calbers has joined #openstack-infra | 23:37 | |
openstackgerrit | Akihiro Motoki proposed openstack/openstack-zuul-jobs master: translation: Handle renaming of Chinese locales in Django https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/773689 | 23:53 |
*** rlandy has quit IRC | 23:55 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!