Wednesday, 2020-09-23

oneswig#startmeeting scientific-sig11:03
Meeting started Wed Sep 23 11:03:55 2020 UTC and is due to finish in 60 minutes.
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.11:03
*** openstack changes topic to " (Meeting topic: scientific-sig)"11:03
openstackThe meeting name has been set to 'scientific_sig'11:03
oneswigapologies for lateness11:04
*** ociuhandu has joined #openstack-meeting11:05
belmoreirahi oneswig11:11
oneswigHi belmoreira, how's things?11:11
oneswigpriteau was mentioning the discussion in the large scale sig this morning11:11
belmoreiragood, and with you?11:12
oneswigI have been negligent in my attention to the Scientific SIG :-(11:12
oneswigOtherwise things are going well I'd say11:12
belmoreirayes, we had the large scale sig meeting this morning and the discussion for the next goals and PTG11:12
*** e0ne has joined #openstack-meeting11:12
belmoreiragood to hear that11:13
oneswigWhat are the pain points you are having with scaling?11:14
*** yamamoto has joined #openstack-meeting11:15
belmoreirawell, I think we hit everything possible to hit during the last years... for now until we can physically add more nodes we should be good11:17
oneswigPierre mentioned you were describing limitations with cells because of not helping network scaling - is that new or something you've been fighting from the beginning?11:20
belmoreirait's more related with neutron scalability, not nova or cells design11:20
belmoreiralast year we split the infrastructure in 2 regions. Currently we have 3 regions11:21
*** raildo has joined #openstack-meeting11:25
*** yamamoto has quit IRC11:25
belmoreiraI think it will be interesting to share it11:26
oneswigHow does storage link between the 3 regions, do you have to share storage?11:27
jandershi oneswig belmoreira11:36
jandersbelmoreira do you have multiple cells per region, do I get that right?11:36
*** b1airo has joined #openstack-meeting11:36
oneswigHi janders, good to see you11:36
jandersoneswig good to see you too. Apologies for joining late - team meeting clash.11:36
jandershey b1airo!11:37
b1airomy late excuse is more about beer...11:37
jandersb1airo important! :)11:37
oneswigHi b1airo, very important.  Back home?11:38
oneswig#chair b1airo11:38
openstackCurrent chairs: b1airo oneswig11:38
b1airooh totally janders , certainly higher priority than meetings anyways :-P11:38
jandersI look forward to times when we can combine both again11:38
jandersas it should be11:38
b1airono oneswig , still up north hanging out with the NIWA crew11:38
belmoreirajanders yes, we have multiple cells11:39
b1airohow many now belmoreira ?11:39
belmoreirain 3 regions we have more than 70 cells in total11:39
jandersbelmoreira do any of your cells span regions, or is each cell contained within one region?11:40
belmoreiraeach cell has a maximum of 250 nodes11:40
belmoreirajanders cells are per region11:40
jandersbelmoreira which aspect of scalability do cells help with the most in your experience?11:40
b1airoare you still following the same, err, "disposable" cell controllers model? :-)11:40
belmoreiraeach cell has it's own rabbit infrastructure11:41
*** rfolco|ruck has joined #openstack-meeting11:41
b1airo(i vaguely recall you are running your cell controllers within the prod cloud itself... 🐢)11:41
belmoreiraand it's a good failure domain, in case of issues things are contained11:41
belmoreirab1airo :) yes, all our controller plane runs inside the cloud itself (inception)11:42
jandersbelmoreira nice! :)11:43
b1airoagree on that - spanning regions (a user facing construct) across cells (a backend scalability and failure domain concern) seems like a questionable idea11:43
jandersbelmoreira does this architecture pose a challenge in case of a need of a full-system shutdown?11:43
belmoreirayou mean a shutdown in the data centre :)11:44
jandersbelmoreira I really like it, just wonder what extra measures are needed to prevent losing the "starer motor"11:44
jandersbelmoreira yeah11:44
belmoreirayes, sure... if that happens we need to understand what needs to be available first11:45
b1airoi guess maybe the api top-level needs to come up first, followed by compute "cell0" (i guess that must be a thing in this architecture?11:46
belmoreirabut is not a big issue, because instance start doesn't need the control plane11:46
jandersbelmoreira right!11:46
jandersbelmoreira do you have dedicated compute nodes for infra services, so that they are separate from user workloads and easy to identify?11:47
belmoreirab1airo yes, if we really need APIs from the beginning, but if a disaster happens APIs availability will be probably the last11:47
b1airoha, good point! so "cell0" is really just select instance startup directly on compute nodes?11:47
jandersdo the infra instances have networking statically configured?11:48
janders(cause I suppose DHCP services may not be available yet)11:48
b1airowas coming to the networking question too :-)11:49
jandersor is the inception arch cell-specific, with neutron being independent of this?11:49
belmoreirab1airo in a case of a disaster we will probably force instance start per compute node11:49
belmoreiraand worry to have the DBs up11:49
belmoreirajanders we use DHCP, but it's a separate infrastructure... yes, it needs to be up11:51
jandersbelmoreira makes sense11:51
belmoreirajanders users and infra instances share the same infrastructure11:52
jandersbelmoreira no noisy neighbour issues?11:52
belmoreiraonly compute instances have their dedicated cells/regions11:52
belmoreirajanders yes, sometimes... we usually live migrate noisy neighbours to less busy compute nodes11:53
*** martial_ has joined #openstack-meeting11:53
martial_Late (still getting kids ready)11:54
oneswigHi martial_, morning11:54
jandersbelmoreira it's awesome to hear about your architecture and your experiences with it, thanks for sharing!11:55
oneswigjanders: how's things with you?11:55
jandersoneswig good, thank you for asking! :)11:55
belmoreirajanders np11:56
janderssomething I've been looking most recently that might be useful for the SIG is potentially introducing NVMe-aware cleaning to Ironic11:56
janders(think trim/discard/... functionality)11:56
oneswigjanders: I've seen key rotation used for SATA SSDs, does that also apply here?11:57
oneswigThe discard idea is good though!11:57
jandersyeah some of the "secure" deletion options leverage manipulating crypto keys11:58
janderswhat's supported really varies but I hope we can find enough common ground11:58
jandershow are things at your end oneswig? What are you guys up to these days?11:59
oneswigjanders: too much to describe in our final minute :-(11:59
oneswigI think finally we are back to having rather too much fun.11:59
jandersoneswig true! poor timing on my behalf. Next time!11:59
oneswigUntil next time.  I promise to come better prepared.12:00
jandershave a good one all12:00
oneswigtime to close12:00
jandersand thanks again for sharing super interesting stuff belmoreira12:00
*** openstack changes topic to "OpenStack Meetings ||"12:00
oneswigthanks all12:00
openstackMeeting ended Wed Sep 23 12:00:38 2020 UTC.  Information about MeetBot at . (v 0.1.4)12:00
openstackMinutes (text):
liuyulong#startmeeting neutron_l314:00
Meeting started Wed Sep 23 14:00:47 2020 UTC and is due to finish in 60 minutes.
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.14:00
*** openstack changes topic to " (Meeting topic: neutron_l3)"14:00
openstackThe meeting name has been set to 'neutron_l3'14:00
*** yamamoto has quit IRC14:02
liuyulongNo announcements from me today, so maybe we can directly goto the Bug section to cut the meeting time.14:03
liuyulongOK, no objection, : )14:04
liuyulong#topic Bugs14:04
*** openstack changes topic to "Bugs (Meeting topic: neutron_l3)"14:05
liuyulongralonsoh, hi14:05
liuyulongThese are the bug lists from our deputy.14:05
liuyulongFirst one14:05
openstackLaunchpad bug 1895950 in neutron "keepalived can't perform failover if the l3 agent is down" [Medium,Won't fix]14:06
slaweqI don't understand why You marked it as won't fix14:06
liuyulongI replied to this last week, the L3 agent should be alive during the HA router state change.14:07
slaweqIMO if we will move bringing interfaces to be up to neutron-state-change-monitor process it should works14:07
liuyulongAfter the patch
slaweqbut that is regression introduced by this patch14:07
slaweqisn't it?14:07
slaweqsmall but IMHO still regression14:08
*** thgcorrea has joined #openstack-meeting14:08
liuyulongBecause the running state-agent process can not do that work if you do not re-spawn it.14:09
slaweqisn't it respawned if You restart L3 agent?14:09
liuyulongI'm not sure, but from my experience, the state change process will run as it is.14:10
ralonsohif the keepalived-state-change process is running, is not rebooted14:11
ralonsohbut if reload_cfg if enabled, then we'll send SIGHUP14:11
ralonsoh(reload_cfg is false when restarting l3 agent)14:12
ralonsohso no, we don't restart it14:12
slaweqok, so maybe we can add bringing interfaces to be up/down to the state-change process and keep it in l3 agent for 1 cycle14:12
liuyulongIt reloads the config options, not the python process.14:12
slaweqlater remove it from the l3 agent14:12
slaweqor maybe 2 cycles14:12
slaweqand add e.g. release note about that14:12
*** martial_ has quit IRC14:12
ralonsohone question: if the l3 agent is down, how this host will become master?14:13
slaweqkeepalived can still be running14:13
slaweqand it can failover14:13
slaweqbut l3 agent will not bring interfaces up on new master node14:13
ralonsohyeah, that was my question14:13
liuyulongThe DB state updating still needs L3 agent alive.14:13
slaweqI know that14:13
slaweqbut still IMO would be better to have working dataplane even in case when L3 agent is down for some reason14:14
liuyulongActually L3 agent must run during HA router failover, it is designed by this. (not me, but it is) : )14:14
slaweqliuyulong: before Your patch even?14:15
liuyulongNo, I mean HA state change workflow has something related to L3 agent. It needs L3 agent to do some work.14:15
liuyulongNot the gateway, but something like RA, DB state, config state and so on.14:16
liuyulongBut, it's fine to add the gateway UP action to the state-change process.14:17
liuyulongI'm fine with it.14:17
slaweqok, lets keep this bug as won't fix for now14:17
slaweqand maybe check/update docs to be clear about that there14:18
*** apetrich has quit IRC14:19
liuyulongSorry, bad connection14:21
openstackLaunchpad bug 1894843 in neutron "[dvr_snat] Router update deletes rfp interface from qrouter even when VM port is present on this host" [Medium,New]14:21
liuyulongI have no idea why set "dvr_snat" on every hypervisor? Should it be "dvr"?14:23
ralonsohdvr_snat should be only on network controllers14:24
slaweqwe are using dvr_snat e.g. in our gates14:24
liuyulongL3 agent in "dvr_snat" with mixed compute service does not work fine from my personal experiences.14:24
slaweqand that possible can cause some failures in dvr multinode jobs maybe14:24
slaweq(idk for sure but just guessing)14:24
liuyulongIMO, this should be documented well, users should not deploy their cloud like this.14:25
liuyulongIMO, there are no much agent mode check for "dvr_snat" during the router processing.14:26
liuyulongWe have consensus that the "dvr_snat" is for those centralized network node (functions) which can not be distributed.14:27
liuyulongSo, my advice for this bug/user is to change the config options.14:29
liuyulongThe final cloud deployment should be in two scenario:14:30
liuyulong1. their compute nodes have ability to external network (internet), so the compute node set the L3 agent mode to "dvr".14:30
liuyulong2. compute node can not reach the Internet, set the agent mode to "dvr_no_external"14:31
liuyulong3. centralized network nodes should be run dedicated physical hosts, and the L3 agent mode is "dvr_snat".14:32
liuyulongOK, no more bugs from me14:35
*** b1airo has quit IRC14:35
liuyulongOK, let's move on14:36
liuyulong#topic On demand agenda14:36
*** openstack changes topic to "On demand agenda (Meeting topic: neutron_l3)"14:36
ralonsohnothing from me14:38
openstackliuyulong: Error: Could not gather data from Launchpad for bug #1895972 ( The error has been logged14:38
liuyulongAnother gap is filling... Congrats!14:39
ralonsohthis feature is ongoing but yes!14:39
liuyulongThere are C works, so it is one example of fullstack development process for OVN feature.14:41
liuyulongPython works are not started.14:42
liuyulongslaweq, hi, I've replied the comments.14:43
liuyulongI've tested it from my local devstack environment for a while.14:43
ralonsohand what is happening with
ralonsohsuperseded by yours, I think so14:44
liuyulongI cannot say I covered every cases, but those I noticed and experienced.14:44
liuyulongralonsoh, yep, it has 2 closes bugs.14:44
slaweqliuyulong: ok, I will check that14:46
liuyulongBut with some deep thinking, after these flows refactor or rediect (some works else), IMO the entire flow structure may have a chance to redesign in someday.14:47
liuyulongIt could be a long story. Just forget it. : )14:48
ZhuXiaoYuOh, I wonder why is not approved too14:48
*** dklyle has joined #openstack-meeting14:48
ZhuXiaoYuwould you give an explanation?14:50
ZhuXiaoYuI will tell Li YaJie later14:50
liuyulongPlease take look at the inline comments in gerrit, and the meeting LOG here. : )14:51
liuyulongOK, no more talks from me now.14:51
liuyulongI will left 1 or 2 mins here.14:51
ZhuXiaoYumy patch for ecmp14:52
ZhuXiaoYuI really hope it can be 'merged'14:53
liuyulongIt's feature freeze now, IMO, it should be moved to next dev cycle.14:55
liuyulongWait...14:55 when is the next dev cycle?14:57
liuyulongIf this was not in the V-3 list, it will not be merged for now.14:57
liuyulongSorry, I cannot open the for now.14:58
slaweqZhuXiaoYu: yes, we are in the RC-1 week now14:58
slaweqso we can merge this patch after rc-1 will be released and we will have stable/victoria branch created already14:58
ZhuXiaoYugot it, really thx for tell me that, it's helpful14:59
liuyulongI will start another round review of the spec this week.14:59
liuyulongTime is up.15:00
liuyulongThank you guys.15:00
*** openstack changes topic to "OpenStack Meetings ||"15:00
openstackMeeting ended Wed Sep 23 15:00:26 2020 UTC.  Information about MeetBot at . (v 0.1.4)15:00
openstackMinutes (text):
timburke#startmeeting swift21:00
Meeting started Wed Sep 23 21:00:25 2020 UTC and is due to finish in 60 minutes.
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.21:00
*** openstack changes topic to " (Meeting topic: swift)"21:00
openstackThe meeting name has been set to 'swift'21:00
timburkewho's here for the swift meeting?21:00
*** patchbot has joined #openstack-meeting21:01
timburkemaybe it's just you and me21:02
kota_oh yeah21:02
timburkeagenda's at
*** tdasilva has joined #openstack-meeting21:02
timburkemain thing i wanted to mention was some follow up on the state of our gate21:03
timburke#topic busted gates21:03
*** openstack changes topic to "busted gates (Meeting topic: swift)"21:03
kota_oh, it looks bunch of branches were broken...21:04
timburkei *think* swift's ussuri gate is fixed now -- at any rate, i stopped seeing emails about docs failure21:04
timburkeassuming it's moving, i'll try to get some stable releases out for ussuri and train this week21:05
timburkeswift client's gate is better now! the fix landed after the deadline to branch for victoria, though, so i might need to reach out to the stable team to sort out how best to fix that one21:06
timburke(the fix involved some requirements changes, so i worry a little that a simple backport may not be great)21:06
kota_i see21:06
timburkei discovered pyeclib's gate was broken after seeing p 74462321:07
patchbot - pyeclib - [goal] Migrate testing to ubuntu focal (ABANDONED) - 4 patch sets21:07
timburkep 753472 fixed it, but disabled the two jobs we had to test against tip-of-master libec21:08
patchbot - pyeclib - Fix gate (MERGED) - 1 patch set21:08
claygfocal is gunna be so great - i'm sure I'll try upgrading to it at some point21:08
claygkota_: i snuck in 😁21:09
timburkeat some point we should dig into how those fail, but they're both such low-volume repos that i'm fairly certain they still work well together21:10
timburkewhile i was looking at pyeclib, i also pushed in p 753421 to test against py38 on focal and py36 on centos821:10
patchbot - pyeclib - Update gate jobs (MERGED) - 4 patch sets21:10
kota_libec-pyeclib-unit said `/bin/bash: line 17: tox: command not found` :(21:11
kota_at p 75347221:11
patchbot - pyeclib - Fix gate (MERGED) - 1 patch set21:11
kota_no p 74462321:11
patchbot - pyeclib - [goal] Migrate testing to ubuntu focal (ABANDONED) - 4 patch sets21:11
timburkei love how snappy pyeclib's jobs are -- at 2-4 mins per job, i feel like we can add more target platforms all day long!21:12
kota_sounds good21:12
timburkebut all of this reminded me that i should check on the state of libec's gate; will report back next week21:14
timburkethat's all i've got for the gate stuff; any questions or comments?21:14
kota_nothing so far. thanks for your effort to keep the gate to work.21:15
*** eharney has joined #openstack-meeting21:16
claygtimburke: 👏21:16
timburkeall right, i've just got one other topic on my mind lately21:16
timburke#topic hung proxy servers21:16
*** openstack changes topic to "hung proxy servers (Meeting topic: swift)"21:16
timburkethere have been two distinct issues that came up recently are somewhat related21:17
timburkeone is
openstackLaunchpad bug 1895739 in OpenStack Object Storage (swift) "Proxy server sometimes deadlocks while logging client disconnect" [Undecided,In progress]21:18
*** mlavalle has joined #openstack-meeting21:20
timburkethe nitty-gritty is in the bug, but the summary is that while we're down in logging, garbage collection may cause us to try to grab the same (non-reentrant) lock twice in the same (green)thread21:20
timburkethe other is
timburkewhere eventlet sees that there's a fd read to read, but then doesn't wake anyone up to read it21:21
timburkegood news is that the second one is already merged (and tagged!) following -- thanks for cleaning it up clayg!21:23
claygtight poll loop keeps asking for the same fd, and it says it's ready - but it just keeps polling21:23
timburkethe first one has a patch at p 75259321:24
patchbot - swift - Replace threading._active_limbo_lock with a re-ent... - 3 patch sets21:24
timburkei think both of these issues can affect other services, it's just acutely bad on proxies21:24
timburkeas much as anything, i just wanted to raise awareness in case anyone else sees similar issues, and maybe see if i could get someone to look at the swift patch ;-)21:25
claygdoes lp bug #1895739 only effect py3?21:28
openstackLaunchpad bug 1895739 in OpenStack Object Storage (swift) "Proxy server sometimes deadlocks while logging client disconnect" [Undecided,In progress]
timburkei've only *observed it* on py3 -- and i'm not sure why :-(21:28
timburkelooking at py2's code, it seems like it *could* happen there, too... but again, i've not actually seen it21:28
timburkemaybe there was some change in GC algo?21:29
claygwhat kind of lock *is* _active_limbo_lock in cpython?  does eventlet patch it by default?21:29
timburkei still haven't found a good way to reliably reproduce the problem, either :-(21:29
timburkeclayg, so in cpython it's a pretty low-level lock -- uses as i recall21:31
timburkeeventlet *does* patch it; it gets replaces with a Semaphore21:32
timburkewhich seems like a reasonable replacement given the semantics21:33
timburkei tried to go over some of the weirdness that leads to this in the bug -- it's not really clear to me whether we're to blame, eventlet's to blame, or cpython's to blame :-/21:35
timburkeswapping out for our own reentrant lock seems like the most-reasonable approach, though, especially since it's already getting patched21:35
timburkeclayg, since you've already put some effort into thining about eventlet and our PipeMutex, mind takinga look this week?21:36
claygi'm sure it's fine - but without a repro it's hard to say exactly21:37
*** baojg has joined #openstack-meeting21:37
timburkeall right, that's all i've got planned21:38
timburke#topic open discussion21:38
*** openstack changes topic to "open discussion (Meeting topic: swift)"21:38
timburkewhat else should we talk about this week?21:38
claygare we still stalled out on pyeclib?21:38
timburkepyeclib's good now, afaik -- maybe you're thinking of p 738959 though?21:39
patchbot - liberasurecode - Be willing to write fragments with legacy crc - 2 patch sets21:39
timburkei still haven't circled back on it -- i'm coming around to wanting to at least treat set-to-the-empty-string the same as unset, but beyond that i'm not sure21:41
timburkei think my main question is: which falsey values should we look for?21:44
timburkekota_, clayg any thoughts there? keeping in mind that the check'll have to be written in C21:46
claygi like 0 and 1 for true and false in C21:47
kota_clayg: agree. plus empty value seems False.21:47
clayganyone have any idea why making a request that uses acl's results in the env getting copied?  p 75277021:48
patchbot - swift - Log error processing manifest as ServerError - 1 patch set21:48
timburkeok, i'll code that up this week21:48
claygwe end up loosing the storage policy index from the req.environ as well21:48
*** rfolco|ruck has quit IRC21:55
*** yonglihe has quit IRC21:56
*** yonglihe has joined #openstack-meeting21:56
timburkei have no idea. sorry. went looking21:57
timburkei'll see about digging into it more on the patch, though21:57
timburkeall right, i think that'll do it21:58
timburkethank you all for coming, and thank you for working on swift!21:58
*** openstack changes topic to "OpenStack Meetings ||"21:58
openstackMeeting ended Wed Sep 23 21:58:27 2020 UTC.  Information about MeetBot at . (v 0.1.4)21:58
openstackMinutes (text):
