Tuesday, 2021-03-02

clarkbI've identified a setup where the user has one active accoutn and one inactive accoutn and I know they are not using either. I've selected this user to test retiring the inactive account, then delete the externalids from that account as it should be safe to do so00:03
clarkbthen tomorrow maybe we can have someone like smcginnis volunteer to be a test case where one of the accounts is used00:03
openstackgerritIan Wienand proposed opendev/system-config master: [wip] handle zuul-summary-results as .jar / per-project config  https://review.opendev.org/c/opendev/system-config/+/77811600:10
*** osmanlicilegi has joined #opendev00:15
*** osmanlicilegi has quit IRC00:15
*** osmanlicilegi has joined #opendev00:16
*** gibi has joined #opendev00:17
*** tosky has quit IRC00:17
clarkbalright the two step cleanup of first removing the preferred email from refs/users/xy/abxy:account.config then doing a curl to post a delete for the corresponding external id has been completed. I'm going to run a consistency check again prod now to see if that makes the consistency issue go away ( we expect it to, this should confirm it )00:19
clarkbnow i just have to figure out how to run the consistency check again00:19
clarkbthe consistency check shows that account is no longer present in the conflicts list00:28
clarkbI also had it check for preferred emails missing just in case and that looks clean too00:28
clarkbI think that means the process of "retiring" accounts then removing their external ids should be viable here00:29
clarkbI need to switch gears to getting tomorrow's meeting agenda out then start dinner, but maybe if smcginnis sees this note we can test this same process on accounts where at least one is being used00:29
ianwit looks like writing out /etc/openafs/server/rxkad.keytab is breaking with some sort of interaction between ansible/!!binary in yaml/focal/??? ... :/00:33
*** hamalq has quit IRC00:41
clarkbthere are about 35 accounts that we can cleanup directlyin this manner I think00:42
clarkband then another few hundred that we'll probably clean up in this way due to inactivity00:42
clarkbthen we can improve our analysis to handle the resulting set00:42
clarkbavass: you are also in this bucket so maybe we will use you as the test scenario :)00:43
ianwok, we've clearly never tried writing out the rxkad keys; i guess we've never deployed a new server.  i think i just need to remove the !!binary, that was from before we base64 encoded them00:50
ianw(or realised that !!binary got turned into a utf-8 string IIRC)00:50
clarkband agenda is sent. Sorry for getting that out a bit late today00:56
openstackgerritIan Wienand proposed opendev/system-config master: Remove afs-admin group  https://review.opendev.org/c/opendev/system-config/+/77812000:57
*** iurygregory has quit IRC01:07
*** mlavalle has quit IRC01:14
*** hemanth_n has joined #opendev01:25
*** iurygregory has joined #opendev01:30
auristorthree vlservers running.   afsdb02 is the coordinator01:32
auristorsame for ptservers01:33
ianwauristor: oh good ...  it looked like it was working  :)01:34
ianw#status log afsdb03.openstack.org now online, SRV records added01:35
openstackstatusianw: finished logging01:35
ianwi think it bodes well for upgrade-in-place for the afs01/afs02 servers01:37
auristornow that you have HA, you can take the risk01:38
auristorI'm not seeing the SRV record changes yet.01:39
auristorare openstack.org clients configured with dbservers from a local CellServDB or via DNS?01:39
auristorif CellServDB, the clients will not learn about afsdb03 until a new version is distributed and the clients are restarted01:42
*** LowKey has quit IRC01:44
*** LowKey has joined #opendev01:45
ianwauristor: yeah, i've corrected the -/_ typo in the srv records now :)02:47
ianwi guess we ship a full CellServDB02:55
auristordid you add a PTR record for to afsdb03.openstack.org ?03:09
auristorI see the SRV records03:10
ianwhrm i have to log in separately to add reverse dns03:11
openstackgerritIan Wienand proposed opendev/system-config master: openafs-client: trim CellServDB  https://review.opendev.org/c/opendev/system-config/+/77812703:37
auristorPTR record is visible03:58
*** ykarel|away has joined #opendev04:09
*** ykarel|away is now known as ykarel04:14
*** whoami-rajat_ has joined #opendev05:25
*** auristor has quit IRC05:35
*** brinzhang_ has quit IRC06:01
*** brinzhang_ has joined #opendev06:01
*** lbragstad_ has joined #opendev06:03
*** auristor has joined #opendev06:05
*** marios has joined #opendev06:05
*** lbragstad has quit IRC06:06
openstackgerritMartin Kopec proposed opendev/system-config master: SSH access to refstack for kopecmartin  https://review.opendev.org/c/opendev/system-config/+/77809006:44
openstackgerritMartin Kopec proposed opendev/system-config master: refstack: Edit URL of public RefStackAPI  https://review.opendev.org/c/opendev/system-config/+/77629206:47
*** whoami-rajat_ is now known as whoami-rajat06:54
*** sboyron has joined #opendev06:54
*** ralonsoh has joined #opendev06:55
*** slaweq has joined #opendev07:08
*** lpetrut has joined #opendev07:27
*** ykarel_ has joined #opendev07:27
*** ykarel has quit IRC07:30
*** eolivare has joined #opendev07:44
*** ykarel_ is now known as ykarel07:51
*** andrewbonney has joined #opendev07:58
*** fressi has joined #opendev08:13
*** rpittau|afk is now known as rpittau08:22
avassclarkb: heh, yeah sure go ahead :)08:24
openstackgerritMartin Kopec proposed opendev/system-config master: refstack: Edit URL of public RefStackAPI  https://review.opendev.org/c/opendev/system-config/+/77629208:32
*** tosky has joined #opendev08:35
*** zimmerry has quit IRC08:52
*** hashar has joined #opendev08:56
*** jpena|off is now known as jpena08:57
*** toomer has joined #opendev08:59
*** brinzhang0 has joined #opendev09:15
*** brinzhang_ has quit IRC09:18
openstackgerritMerged opendev/irc-meetings master: Remove usused Public Cloud SIG meeting slot  https://review.opendev.org/c/opendev/irc-meetings/+/77719109:18
openstackgerritMartin Kopec proposed opendev/system-config master: refstack: Edit URL of public RefStackAPI  https://review.opendev.org/c/opendev/system-config/+/77629209:22
*** zoharm has joined #opendev10:06
*** LowKey has quit IRC10:51
*** klonn has joined #opendev12:26
*** jpena is now known as jpena|lunch12:31
*** klonn has quit IRC12:46
openstackgerritMartin Kopec proposed opendev/system-config master: refstack: trigger image upload  https://review.opendev.org/c/opendev/system-config/+/77818712:53
openstackgerritMartin Kopec proposed opendev/system-config master: refstack: Edit URL of public RefStackAPI  https://review.opendev.org/c/opendev/system-config/+/77629212:54
sshnaidmis gerrit slow today for me only?13:22
*** jpena|lunch is now known as jpena13:33
*** lbragstad_ is now known as lbragstad13:41
*** mlavalle has joined #opendev13:59
*** hemanth_n has quit IRC14:04
*** fdegir has joined #opendev14:19
ykarelfungi, hi14:50
ykarelfungi, can u check lists.openstack.org/pipermail/openstack-discuss/2021-March/020804.html14:51
ykareli missed ur mail long ago :)14:51
fungisure, just a few, getting ready to dial into the board meeting14:51
ykarelack take ur time14:51
ykareli cleaned up open reviews on that, there are still some open but those can also be closed if we get confirmation on branch deletion14:52
ykarellike https://review.opendev.org/c/openstack/instack-undercloud/+/77737014:52
smcginnisclarkb: Still need to test something?14:57
*** roman_g has joined #opendev15:04
roman_gHello team. Sorry to bother you. Still with CityCloud/Kna1. Is there any chance you keep NODE_FAILURE logs from 2021-02-16 - 2021-02-25? Could have a look?15:09
fungiroman_g: i'll see what i can put together15:09
fungiunfortunately the backtraces are multiline and not prefixed, so if we want those included too it's not just a simple grep15:10
roman_gThe CityCloud also says that they saw "Image is not raw format" errors, probably from us. Image ID is c0e9524d-506c-467b-bed5-5fcdec309280.15:11
roman_gMight be also something to look at.15:12
roman_gThank you, fungi.15:12
fungiroman_g: oldest logs we still have are from 2021-02-20, will that suffice?15:13
*** ykarel has quit IRC15:17
fungiroman_g: interestingly i don't think we're seeing the "no valid host" errors we had been seeing (still trying to confirm i'm not just looking at it wrong), but seeing quite a few "Unable to find public IP of server" errors15:18
openstackgerritMerged openstack/project-config master: Add tripleo-ci-health-queries project  https://review.opendev.org/c/openstack/project-config/+/77799115:18
fungimore on some days than others, something like 10-30 a day15:19
fungidepending on the day15:19
roman_gDoes it seem to be the reason why instances were not launching? No more IPs left in pool/quota?15:20
funginothing so straightforward, the metadata returned from the api is missing a public ip address15:20
fungialso sorry i'm trying to pay attention to the foundation board of directors meeting15:21
fungiso slightly distracted15:21
fungilooks like we're also seeing errors when attempting to delete floating ips there15:22
fungialso timeouts waiting for server deletion calls to complete15:24
fungithe image uuid you mentioned earlier isn't in our image list currently, so may have been from a truncated upload or something. did they say when they saw that?15:27
roman_g2021-02-16 23:11:4715:27
fungiahh, yeah that's too far back for us to still have logged15:27
toomerroman_g: It's good to know that CityCloud come back to you on those issues.15:28
roman_gfungi list of NODE_FAILUREs https://paste.ubuntu.com/p/9MHZ4Wx7PC/15:30
roman_gUTC time15:30
roman_gThat's from Zuul15:30
clarkbsmcginnis: yes, you have two gerrit accounts. One is inactive and I'd like to make it even more inactive then ahve you confirm your active (and currently used) account remains happy15:31
fungiroman_g: thanks, i can try to trace some of those back but it will take a bit, since i'll need to map them to node requests and from there to actual api calls15:31
clarkbsmcginnis: I don't expect any issues since the other account has been marked inactive and shouldn't be used, but if we can have you confirm that after we do some work that would be excellent15:31
smcginnisclarkb: Sure, that should be fine. Board meeting right now, but I can verify my good account is still good afterwards.15:32
roman_gfungi thank you. Can I get back to you with this tomorrow? Save logs somewhere, please.15:32
clarkbsmcginnis: ya I've got to finish booting my morning too so will be a bit before I get to it15:32
fungiroman_g: yes, i'm working on filtering the logs down first, the launcher which handles that cloud also connects to lots of others so i need to untangle the debug log entries from it15:33
clarkbroman_g: fungi  note https://opendev.org/opendev/system-config/src/branch/master/playbooks/templates/clouds/nodepool_clouds.yaml.j2#L158 is what will determine what file format we upload as15:33
clarkbI don't know if that profile sets it to raw or qcow2 etc. We can override it if necssary but should also fix that in sdk if necessary15:34
*** lpetrut has quit IRC15:34
fungiyeah, it will be buried somewhere in openstacksdk's defaults i think15:34
fungibut since they just mentioned one error about that 14 days ago, i expect it was a failed upload15:35
clarkbhttps://opendev.org/openstack/openstacksdk/src/branch/master/openstack/config/vendors/citycloud.json doesn't seem to set it at all15:35
*** openstackgerrit has quit IRC15:35
fungiwow, one compressed scheduler debug log is going to take 15 minutes for me to retrieve to my workstation to analyze15:53
clarkbinfra-root I'm going to run the "retire-user.sh" script against smcginnis' already inactive account16:00
clarkbthen I'll remove the external ids for that account that conflict with the email address16:00
smcginnisDo I get a pension?16:00
clarkbsmcginnis: unfortunately not16:01
clarkbsmcginnis: fwiw I've called it "retire" because its more forceful that simply setting the account inactive. Apparently gerrit will still check all sorts of things on inactive accounts you'd prefer it didn't and just treat the entire account as inactive instead :(16:01
clarkbcool that first step is done. Now to sort out the next step and remove the external ids associated with that account16:03
clarkbsmcginnis: ok I'm all done on my side. If you want to double check that logging out and into gerrit etc are still working as normal that would be good16:08
clarkbI only modified the inactive account so don't expect any problems, but trying to be careful here16:08
smcginnisclarkb: ack, will test things out in a bit and report back.16:10
fungiugh, the one day worth of scheduler debug log i'm trying to hunt in is ~80m lines long16:14
fungieven just slicing an hour out of it is taking a while16:17
fungiokay, that's better, only 5m lines16:20
fungistill kinda insane that we're logging 1k lines per second on average over a 24 hour period from just this one service16:21
clarkbinfra-root there are about 35 more accounts similar to smcginnis' where at least one of the accounts is already marked inactive. I'd like to run the retire.sh script against those. Then figure out how to automate the external id deletion via the api (so far I've just done curl and manual json edits)16:29
clarkbany objections to me running retire-user.sh against those now then sorting out the external id cleanups?16:30
fungino objection, sounds great--thanks!16:30
clarkboh I should note that smcginnis' PM'd me where I gave some account details to check and said things looked fine16:31
clarkbgreat, I'll proceed then16:31
fungiyeah, i figured there had been some coordination16:31
*** dviroel has quit IRC16:34
avassfungi: yeah those are fun to dig through :)16:36
fungitrying to map a NODE_FAILURE result back to a node request16:39
*** sshnaidm is now known as sshnaidm|afk17:00
*** marios is now known as marios|out17:01
*** zoharm has quit IRC17:04
*** marios|out has quit IRC17:06
roman_gfungi Big Data processing :)17:13
fungiindeed. i mapped one node_failure report to an event id, now seeing if i can find the node requests associated with it17:14
fungiluckily the triggering event id only appears in a little over 10k lines, so shouldn't be too hard17:15
*** openstackgerrit has joined #opendev17:16
openstackgerritMerged opendev/git-review master: Don't test rebasing with unstaged changes  https://review.opendev.org/c/opendev/git-review/+/77745617:16
openstackgerritClark Boylan proposed opendev/system-config master: Add tools being used to make sense of gerrit account inconsistencies  https://review.opendev.org/c/opendev/system-config/+/77784617:19
clarkbinfra-root ^ that ps adds a script that reads a list of emails and accounts for which to delete external ids for that email under that account17:20
clarkbif that looks good to you I'll run it against the batch of accounts I retired17:20
avassfungi: regarding that, I'm not sure but I think the node request is part of the message and not the logged data like build and event_id is. Which I guess you don't really notice but when logging that as a key:value store to splunk or I guess elastic it would probably be nice if that was more searchable17:21
avasssame problem with buildset in the executor I believe. might take a look at that some day17:21
fungiavass: yeah, i've found the list of node requests submitted for a particular event and am cross-checking against the list of node requests completed17:22
fungithen i'll at least know the node request to track down17:22
*** lpetrut has joined #opendev17:28
openstackgerritSorin Sbârnea proposed opendev/git-review master: Fixes test system dependencies  https://review.opendev.org/c/opendev/git-review/+/77269917:32
fungihuh, so the node request corresponding to the node_failure result does get logged as completed, but with "Node None"17:33
fungiso i guess that's the pattern to look for17:33
avassfungi: is that the problem that corvus was looking at with dependent changes or change queue a week or so ago?17:34
fungialso i now know the request id so can actually find that in the launcher log17:34
fungiavass: nah, i think maybe it's just how this gets logged. i'm still digging for a cause17:34
funginot far enough along to have any theory as of yet17:35
fungiroman_g: okay, on the -2 for patchset 3 of https://review.opendev.org/776956 the node request was declined "because node type(s) [ubuntu-bionic-expanded] not available"17:37
funginow to see if i can figure out why that was17:37
openstackgerritClark Boylan proposed opendev/system-config master: Add tools being used to make sense of gerrit account inconsistencies  https://review.opendev.org/c/opendev/system-config/+/77784617:38
clarkbinfra-root ^ added another check to the bulk external id remover. Now it should only do that for accounts which are inactive in gerrit17:38
roman_gfungi "not available" is quite strange. It doesn't say why is it not available.17:40
zbrcan gerrit 2.13 work with java 11 (at least for testing purposes)17:40
fungiroman_g: it probably does, but i haven't gotten that far yet17:40
clarkbzbr: I don't think so17:40
clarkbzbr: they only just added java 11 support in 3.2 aiui17:40
zbrthanks! i was trying to compute the test deps :p17:41
clarkbfungi: the kna1 provider is the only provider providing that label currently. If you grep the request id in nl02's launch debug logs it should tell you why17:41
fungiroman_g: nodepool.exceptions.ConnectionTimeoutException: Timeout waiting for connection to on port 2217:41
clarkbmy hunch is it failed to boot 3 times in a row17:41
clarkbsomething something cloud redundancy and this is why we push so hard on the subject17:42
fungiroman_g: so looks like we gave up waiting for the booted node to respond on port 22 to gather the host keys from it17:42
*** rpittau is now known as rpittau|afk17:42
roman_gAhha... Now we need to understand if it's the same problem with other launch failing attempts, right?17:43
fungiroman_g: and then tried two more times, with new nodes each time, before we gave up17:43
roman_g3 times in a row, or with a delay between tries?17:43
clarkbroman_g: the "wait for boot/ssh/etc" is effectively a delay between tries17:43
clarkbbut I Think that is the only real delay in play there17:44
fungiyeah, basically for this particular example, the one provider which offers that node type accepted the node request, tried three times to boot a node to fulfil it, each time it waited some minutes for ssh to become responsive, then gave up and booted another. once it failed that way three times in a row, it told zuul it could not fulfil the node request (by returning a "completed" result for the request with17:44
fungia node of None)17:44
fungionce the server instance is reported ready by the nova api, the launcher starts trying to connect via ssh to gather host keys from it, and that's roughly 120 seconds17:46
fungiso i looks like the nodes weren't responding on their ssh ports for at least 120 seconds after they were reported to be in a ready state by the cloud. possible they were just very slow booting, i suppose, or that there was some network connectivity problem reaching them17:47
clarkbthose servers use FIPs. Could be anything from neutron's NAT to glean17:47
fungiyeah, like their floating ips weren't forwarding, or maybe the hypervisor hosts to which they were scheduled weren't connected to the network17:48
fungianyway, from this i can get a list of times and server instance uuids we can give to citynetwork17:48
fungithe fact that we try three times to get a reachable node before giving up suggests the ones reported as node_failure results are going to be the tip of the iceberg, since lots more requests probably succeeded on the second or third try17:49
fungithough parsing these out of the launcher log will still be a bit of a challenge since the data (instance uuid and failure exception) is logged on different lines without any common key to tie them together17:52
openstackgerritMerged opendev/git-review master: Switch to default Sphinx theme  https://review.opendev.org/c/opendev/git-review/+/77782517:52
openstackgerritMerged opendev/git-review master: Overhaul Python package metadata and OpenDev URLs  https://review.opendev.org/c/opendev/git-review/+/77782617:53
*** jpena is now known as jpena|off18:02
clarkbzbr: do you know what these tripleo ci ruck rover gerrit accounts are used for? they show up in our list of problem accounts18:02
clarkbthey don't appear to have done any reviews or code pushes in the last year so are currently in our bucket of "disable and move on". But I wonder if there is some sort of interesting useage pattern there we need to account for instead18:03
*** lpetrut has quit IRC18:07
clarkbmwhahaha weshay|ruck ^ you may know as well18:08
weshay|ruckclarkb, hrm.. let me check our scripts... I suspect it's just pulling readonly info... that's all I can think of.. Mind if I go dig for a bit?18:10
clarkbweshay|ruck: ya it isn't a rush. We're currently dealing with the accounts that we're much more confident in modifying :)18:10
clarkbweshay|ruck: I'm just waiting for some reviews though so started skimming the list of ones we're less sure of and that stood out to me as a possible case where we aren't capturing a valid use case properly18:11
weshay|ruckk.. we may be using that account in more than the one place I'm thinking of18:12
clarkbif you can describe how it is used we can try to better account for it in our auditing of the accounts and subsequent classification of them18:13
clarkbcurrently we're looking at reviews and pushed changes in the last year as a starting point18:14
roman_gfungi, I would appreciate a list of times and server instance uuids, so that we can forward it to citynetwork18:19
zbrclarkb: sadly, even with latest changes running the test on fedora seems to generate lots of failure. looking at htop, i believe it attempts to run one gerrit for each test.18:19
roman_gWe are getting issues quite often, and they are usually slow to respond. Still need to find a root cause, and get it resolved on their side.18:20
fungiroman_g: yep, nearly there. i've got ~700 in the past 10 days, just massaging the analysis to present the time the boot call returned, time we saw the instance go ready and time we gave up waiting for ssh to be reachable18:20
zbrweshay|ruck: ^18:21
*** hashar has quit IRC18:21
roman_gfungi save your method to a script somewhere. Might be needed again later.18:23
openstackgerritClark Boylan proposed opendev/system-config master: Remove ze01.openstack.org  https://review.opendev.org/c/opendev/system-config/+/77822718:23
roman_gfungi I'd love CityCloud to respond in realtime like you are doing, so that next time they could debug failing instance launch attempts on the go18:23
clarkbzbr: yes, the current git-review runs a gerrit per test, but out of a prepped work dir18:24
clarkbroman_g: you may want to consider using our normally sized instances then you can run across all the clouds18:24
clarkbroman_g: that will give you far more redundancy18:24
clarkbinfra-root https://review.opendev.org/c/opendev/system-config/+/778227 is the next step in zuul-executor replacements. Basically want to get the now unused server complete out of the way then start replacing 2-3 at a time18:25
roman_gclarkb I know. I would love to. Devs say they strictly need possibility to launch a few bare nested VMs of a significant size and with a good performance, and boot them, and join into a cluster and execute workloads on those nested VMs. ¯\_(ツ)_/¯18:27
*** ralonsoh has quit IRC18:29
roman_gclarkb we have moved all tests to standard 8GB RAM instances, and only have left one end-to-end test on Gate with 16GB instance with nested VMs. It gets executed on merge only.18:29
clarkbroman_g: sure, usually those tasks can be broken up such that you could for example run a cluster and boot a VM now you know that works. Next you test your workload that you want to run in VMs without the overhead of nested virt (or lack of virt). And that gives you very similar amounts of coverage18:30
clarkb(there are other factors to consider like how are devs expected to run or reproduce tests locally if the tests require significant resources, testable software tends to get tested more often and as a consequence is more robust, etc)18:35
clarkbI'm not going to claim we're perfect at it either :) just pointing out that as a goal it tends to be a good one18:36
*** andrewbonney has quit IRC18:38
*** hamalq has joined #opendev18:38
fungiroman_g: this is the basic list by the way, though only has the times the nova api responded with details for the instance it was booting: http://paste.openstack.org/show/803152/18:42
fungii'm still fumbling to get the times the api reported the instance was ready and teh times we gave up waiting to be able to ssh into it18:43
fungier, fumbling to include them (i've got the patterns to grep on, just nitegrating them into the loop is finicky)18:43
clarkbfungi: the timeout value is a constant and should be pretty consistent18:44
fungiyep, just figured i'd include it because i can, the reported ready time is more important to include though18:45
roman_gfungi that is a list of instances which were not reachable up on launch?18:45
fungiroman_g: yes18:47
fungiand the dates/times the boot requests for them were made18:48
roman_gWe don't get a lot of this kind of failures now, do we? https://grafana.opendev.org/d/QQzTp6EGz/nodepool-airship-citycloud?orgId=118:51
fungiroman_g: nope, last one i saw in the log was a few days ago18:52
fungiso maybe they've done something to address that issue18:52
clarkbnote those graphs will be for all labels so can be hard to see something lable specific, but there are no errors marked there at all18:52
fungithe problem may not be label-specific either. the other labels can be tried in additional providers18:53
fungithis particular label though, if that provider can't boot it, then that's it18:53
clarkbgood point18:54
clarkbit is a label specific user facing impact but may have happened for all the labels there18:54
roman_gFrom the logs here http://paste.openstack.org/show/803152/ I would guess that this is for all labels (i.e. including 8GB), as we don't launch that many 16/32GB instances.18:55
roman_gI'm quite surprised to see so many instances with problems with networking (instance launched but not accessible).18:56
roman_gAny additional Information you would want to pass to CityCloud?18:56
fungiroman_g: well, also remember we retry three times, so only a fraction of those are final tries18:57
roman_gWhat would help them to debug it on their side?18:57
fungiroman_g: that was just a preliminary list, i'm trying to combine the ready times and also when we gave up waiting18:57
roman_gfungi OK.18:57
fungii doubt there's much more besides instance uuids they'd need, since their logs will tell them the hosts where those booted, the ip addresses assigned to them, et cetera18:58
fungii'm including the times mostly to give them an idea for what vintage of logs they should be looking in18:59
*** toomer has quit IRC19:08
*** eolivare has quit IRC19:33
openstackgerritJeremy Stanley proposed opendev/git-review master: Adjust categories for some release notes  https://review.opendev.org/c/opendev/git-review/+/77825719:52
clarkbianw: fungi: if you can look over the external id remove script in https://review.opendev.org/c/opendev/system-config/+/777846 that would be great. I think my greatest concern is somehow accidentally removing external ids from an active account if I get the input file wrong20:01
clarkbthere are guards against that in the code, but let me know if you think tehy can be stronger20:02
clarkbI need to eat lunch now, but will try to sit back down again to work through those (and the script if if needs updates)20:03
fungiroman_g: http://paste.openstack.org/show/803159/ and http://paste.openstack.org/show/803160/20:10
fungii gave up trying to cram the detail into one line, and had to split it since the full set was more than the paste server would allow20:11
fungiroman_g: also that's strictly limited to nodes we made three attempts to boot. turns out the way we log nodes which succeed on the second or third try makes it hard to confirm with certainty that it was for that particular provider20:12
fungiroman_g: there are a handful where instead of timing out waiting for ssh, the boot request itself failed. there were only a handful, but also we don't get much useful info back when it happens either:20:16
fungiopenstack.exceptions.SDKException: Error in creating the server (no further information available)20:16
fungibasically the nova api just reports an "error" status instead of the instance going to "active"20:17
openstackgerritMerged opendev/system-config master: refstack: trigger image upload  https://review.opendev.org/c/opendev/system-config/+/77818720:17
fungiroman_g: those cases you can recognize because we never start gathering host keys, there's just a ~10-minute delay between waiting for server and failure20:18
fungii'm guessing that's a timeout in nova on their end20:19
fungior something else nova's waiting on20:19
*** hashar has joined #opendev20:30
openstackgerritMerged opendev/git-review master: Adjust categories for some release notes  https://review.opendev.org/c/opendev/git-review/+/77825720:38
*** dviroel has joined #opendev20:55
openstackgerritJames E. Blair proposed opendev/system-config master: Use upstream jitsi-meet web image  https://review.opendev.org/c/opendev/system-config/+/77830821:00
corvusclarkb, fungi: ^21:01
*** bhagyashri|rover has quit IRC21:06
*** bhagyashris has joined #opendev21:07
*** sboyron has quit IRC21:30
clarkbianw: thanks for the review. I think it was mostly beacuse handling the output for account details is a bit more consistent with the other output (its all json)21:33
clarkbianw: but could update it to use that other url request instead.21:34
ianwthat was my only thought, i mean otherwise it lgtm21:34
clarkbthe current issue is I've discovered with the help of fungi that I think the : in external.ids:delete is being percent encoded by requests and gerrit wants a literal :21:34
clarkbso I'm about to go deep into the bowels of requests to see how I can have it stop doing that21:35
clarkb(is it bad that I considered forking out to curl?)21:35
clarkbcorvus: thanks! looks like zuul isn't happy with it for some reason but the chagne on quick skim lgtm21:36
clarkbanyway, into the python requests depths21:36
clarkbianw: fungi helped me find another bug too where we can have non unique keys as email or account id so I don't want to flatten them into a dict and instead just iterate over them one by one21:37
clarkbI'll get things pushed up once I figure out the : thing21:37
clarkbhttps://github.com/psf/requests/issues/1454#issuecomment-20832874 this comment says I get to create request objects from scratch :)21:44
*** brinzhang0 has quit IRC21:47
clarkbhrm doing testing against a local "server" maybe this isn't the issue. I see a literal : afterall21:51
clarkbnow I'm really confused because apache logs for gerrit when we were trying this also seem to show we're getting the : as expected21:58
clarkbgerrit is returning "unauthorized" and I thought it was because the url I was hitting was just not valid21:58
clarkbearlier today I did smcginnis' external id removals using curl to the same POST location21:58
clarkband I was authorized then21:58
clarkbI can also fetch external id info which has a similar level of authorization aiui21:59
clarkbthat makes me think taht authentication isn't the problem21:59
clarkbI guess I'm going to test this via curl again to double check22:00
clarkbconfirmed using curl continues to work22:06
clarkbhahahahaha I think I figured it out22:06
clarkbI wasn't passing auth info to the post :(22:06
hasharclarkb: but you figured it out! :-]22:09
openstackgerritClark Boylan proposed opendev/system-config master: Add tools being used to make sense of gerrit account inconsistencies  https://review.opendev.org/c/opendev/system-config/+/77784622:10
clarkbhashar: ya, but also I think that is a good indication I should pick this up tomorrow22:10
clarkbI've been working with fungi on this a bit and so fungi has bits paged in, I'll see if fungi wants to be an extension to my brain later and continue otherwise pick it up in the morning when hopeflly my brain is working better22:11
hasharoh fighting with Gerrit emails and external-ids :\22:12
hasharI feel your pain, we had a few issues with it in a past version of Gerrit22:12
clarkbhashar: yup, we have ~640 accounts with email conflicts. Getting through the first set of 35 fixes now22:12
clarkber its actually 640 emails with account conflicts22:13
clarkbthe total number of accounts is at least twice that22:13
hasharouch :-\22:13
clarkbhashar: trying to whittle it down bit by bit22:13
clarkbanyway ianw fungi https://review.opendev.org/c/opendev/system-config/+/777846 has the auth passed in now as well as the other improvements I mentioned22:13
clarkbI'm going to switch gears shortly and get some other stuff done that having a tired brain is less important for22:14
roman_gfungi Thank you for additional logs! I will email them to CityCloud tomorrow. Will get back to you if I hear anything from them.22:14
hasharclarkb: +1 on using the Gerrit web API. We went messing up directly with the All-Users.git which was not that great. Then we only had to fix a few accounts22:17
hasharclarkb: good luck with less important stuff!22:17
clarkbhashar: ya the nice thing about the api is we can make changes a few accounts at a time without downtime22:18
*** roman_g has quit IRC22:19
clarkbhashar: which means we can make the changes in incremental steps as we are comfortable with them which I like22:19
clarkbhashar: we also think we've found a bug around openid and users changing their email addresses in gerrit. I still need to report that :/22:21
hasharclarkb: :\  at least if you write a nicely detailed report upstream is usually fast to react.  Though for openid I don't know how well maintained it is22:28
openstackgerritMerged opendev/system-config master: Remove afs-admin group  https://review.opendev.org/c/opendev/system-config/+/77812022:40
openstackgerritMerged opendev/system-config master: openafs-client: trim CellServDB  https://review.opendev.org/c/opendev/system-config/+/77812722:57
clarkbthe stevedore caching is creating a sha256 hash of all the entrypoint files in egg info and distinfo dirs and their mtimes22:58
clarkbgiven the significant number of the resulting sha256 key'd cache files we haev I'm wondering if it isn't doing that in a stable manner22:58
*** gnuoy has quit IRC23:04
ianwclarkb: yeah i was just having a poke23:06
ianwif you run them through json_pp23:06
ianw"path_values" : [23:06
ianw                                                              >       [23:06
ianw                                                              >          "/tmp/ansible_os_keypair_payload_qstnh5m0/ansible_os23:06
ianwi think that's the reason the hash is unstable23:07
clarkboh its not supposed to try and cache things in tmp bceause ansible does that23:07
clarkbmaybe our stevedore is too old then?23:07
ianwyeah, but only on 3.3 which i doubt anything has pulled in23:07
clarkbfwiw the glob.iglob seems stable (though the sort order isn't alnum)23:07
ianwyep; it's definitely the cloud launcher bits.  that's failing on openedge, i thought we had removed that?23:08
clarkbmaybe we failed to remove it from the cloud launcher23:08
*** hashar has quit IRC23:10
fungiafter going back over the release notes for git-review i don't think there's a significant need to announce the release candidate and ask people to test: https://docs.opendev.org/opendev/git-review/latest/releasenotes.html (i did also previously go through the list of changes and make sure everything relevant had a release note)23:16
fungibetween having personally used it a bit, and the extensive functional regression testing we do for it in zuul, it should at least be nominally usable23:17
fungithe major version bump is mostly about things we removed (especially python 2.7 support)23:18
clarkbfungi: if you've been using it then ya I think you are probably good23:19
clarkbpeople can always downgrade if necessary23:19
clarkband we can always do a 2.0.1 if necessary as well23:19
fungiyep, and probably will ;)23:20
openstackgerritIan Wienand proposed opendev/system-config master: [wip] handle zuul-summary-results as .jar / per-project config  https://review.opendev.org/c/opendev/system-config/+/77811623:39

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!