12:59:28 #startmeeting opendev-maint 12:59:29 Meeting started Fri Nov 20 12:59:28 2020 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. 12:59:30 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 12:59:32 The meeting name has been set to 'opendev_maint' 13:01:23 #status notice The Gerrit service at review.opendev.org will be offline starting at 15:00 UTC (roughly two hours from now) for a weekend upgrade maintenance: http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html 13:01:23 fungi: sending notice 13:04:35 fungi: finished sending notice 13:59:34 #status notice The Gerrit service at review.opendev.org will be offline starting at 15:00 UTC (roughly one hour from now) for a weekend upgrade maintenance: http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html 13:59:34 fungi: sending notice 14:02:45 fungi: finished sending notice 14:25:39 morning! 14:28:24 fungi: I think I'll go ahead and put gerrit and zuul in the emregency file now. 14:31:24 and thats done. Please double check I got all the hostnames correct (digits and openstack vs opendev etc) 14:31:47 and when y ou're done with that do you think we should do the belts and suspenders route of disabling the ssh keys for zuul there too? 14:38:58 fungi: also do you want to start a root screen on review? maybe slightly wider than normal :P 14:40:09 done, `screen -x 123851` 14:41:03 and attached 14:41:03 not sure what disabling zuul's ssh keys will accomplish, can you elaborate? 14:41:25 it will prevent zuul jobs from ssh'ing into bridge and making unexpected changes to the system should something "odd" happen 14:41:42 I think gerrit being down will effectively prevent that even if zuul managed to turn back on again though 14:41:43 oh, there, i guess we can 14:41:52 i thought you meant its ssh key into gerrit 14:42:05 sorry no ~zuul on bridge 14:44:06 sure, i can do that 14:44:21 fungi: just move aside authorized_keys is probably easiest? 14:45:05 as ~zuul on bridge i did `mv .ssh/{,disabled_}authorized_keys` 14:45:47 fungi: can you double check the emergency file contents too (just making sure we've got this correct on both sides then that way if one doesn't work as expected we've got a backup) 14:45:59 my biggest concern is mixing up a digit eg 01 instead of 02 and openstack and opendev in hostnames 14:46:03 I think I got it right though 14:46:41 hostnames in the emergency file look correct, yes, was just checking that 14:47:25 thanks 14:47:41 I've just updated the maintenance file that apache will serve from the copy in my homedir 14:47:42 checked them against our inventory in system-config 14:53:04 I plan to make my first cup of tea during the first gc pass :) 14:58:03 yeah, i'm switching computers now and will get more coffee once that's underway 14:58:45 fungi: I've edited the vhost file on review. When you're at the other computer I think we check that then restart apache at 1500? 14:58:54 then we can start turning off gerrit and zuul 15:00:43 lgtm 15:00:49 ready for me to reload apache? 15:00:53 my clock says 1500 now I think so 15:01:04 done 15:01:17 maintenance page appears for me 15:01:20 me too 15:01:46 next we can stop zuul and gerrit. I don't think the order matters too much 15:01:48 status notice or status alert? wondering if we want to leave people's irc topics altered all weekend given there's also a maintenance page up 15:02:05 ya lets not change the topics 15:02:17 if we get too many questions we can flip to topic swapping 15:02:45 #status notice The Gerrit service at review.opendev.org is offline for a weekend upgrade maintenance, updates will be provided once it's available again: http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html 15:02:45 fungi: sending notice 15:03:21 if you get the gerrit docker compose down I'll do zuul 15:03:25 i guess we should save queues in zuul? 15:03:31 eh 15:03:35 and restore at the end of the maintenance? or no? 15:03:42 I guess we can? 15:04:00 I hadn't planned on it 15:04:13 given the long period of time between states I wasn't entirely sure if we wanted to do that 15:04:21 i guess don't worry about it. we can include messaging reminding people to recheck changes with no zuul feedback on thenm 15:04:40 gerrit is down now 15:05:57 i'll comment out crontab entries on gerrit next 15:05:58 fungi: finished sending notice 15:06:16 `sudo ansible-playbook -v -f 50 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_stop.yaml` <- is what I'll run on bridge to stop zuul 15:06:36 actually I'll start a root screen there and run it there without the sudo 15:07:18 i've got one going on bridge now 15:07:24 if you just want to join it 15:07:32 oh I just started one too. I'll join yours 15:07:53 ahh, okay, screen -list didn't show any yet when i created this one, sorry 15:08:26 hahaha we put them in the emergency file so the playbook doesn't work 15:08:32 I'll manually stop them 15:08:46 oh right 15:08:49 heh 15:09:48 scheduler and web are done. Now to do a for loop for the mergers and executors 15:11:25 i'll double-check the gerrit.config per step 1.6 15:12:45 clarkb: could probably still do "ansible -m shell ze*"; or edit the playbook to remove !disabled 15:12:54 serverId, enableSignedPush, and change.move are still in there, though you did check them after we restarted gerrit earlier in the week too 15:12:54 but i bet you already started the loop 15:13:43 yup looping should be done now if anyone wants to check 15:13:59 i'll go ahead and start the db dump per step 1.7.1, estimated time is 10 minutes 15:13:59 fungi: ya I expected that one to be fine after our test but didn't remove it as it seemed like a good sanity check 15:14:39 mysqldump command is currently underway in the root screen session on review.o.o 15:15:20 in parallel i'll start the rsync update for our 2.13 ~gerrit2 backup in a second screen window 15:15:45 fungi: we don't want to start that until the db dump is done? 15:15:53 that way the db dump is copied properly too 15:16:02 oh, fair, since we're dumping into the homedir 15:16:05 yeah, i'll wait 15:16:25 i guess we could have dumped into the /mnt/2020-11-20_backups volume instead 15:16:48 oh good point 15:16:53 oh well 15:22:10 it'll be finished any minute now anyway, based on my earlier measurements 15:23:45 mysqldump seems to have completed fine 15:23:53 ya I think we can rsync now 15:23:58 1.7gb compressed 15:24:18 is that size in line with our other backups? 15:24:32 rsync update is underway now, i'll compare backup sizes in a second window 15:24:40 yes it is 15:24:50 I checked outside of the screen 15:25:26 yeah, they're all roughly 1.7gb except teh old 2.13-backup-1505853185.sql.gz from 2017 15:25:35 which we probably no longer need 15:26:03 in theory this rsync should be less than 5 minutes 15:26:39 though could be longer because of the db dump(s)/logrotate i suppose 15:27:27 even if it was a full sync we'd still be on track for our estimated time target 15:27:55 yeah, fresh rsync starting with nothing took ~25 minutes 15:34:44 I think the gerrit caches and git dirs change a fair bit over time 15:34:51 in addition to the db and log cycling 15:35:06 and it's done 15:35:15 yeah periodic git gc probably didn't help either 15:36:05 anybody want to double-check anything before we start the aggressive git gc (step 2.1)? 15:36:25 echo $? otherwise no I can't think of anything 15:37:15 yeah, i don't normally expect rsync to silently fail 15:37:29 but it exited 0 15:37:32 yup lgtm 15:37:36 I think we can gc now 15:37:42 i have the gc staged in the screen session now 15:37:51 and it's running 15:38:05 after the gc we can spot check that everything is still owned by gerrit2 15:38:08 estimates time at this step is 40 minutes, so you can go get your tea 15:38:20 yup I'm gonna go start the kettle now. thanks 15:38:31 i don't see any obvious errors streaming by anyway 15:38:56 keeping timing notes on the etherpad too because I'm curious to see how close the estimates particularly for today are 15:39:46 good call, and yeah that's more or less why i left the time commands in most of these 16:01:15 probably ~15 minutes remaining 16:01:54 I'm back fwiw just monitoring over tea and toast 16:13:08 estimated 5 minutes remaining on this step 16:13:09 it is down to 2 repos 16:13:23 of course one of them is nova :) 16:13:41 the other is presumably either neutron or openstack-manuals 16:13:46 it was airshipctl 16:13:51 oh 16:13:52 wow 16:13:52 I think it comes down to how find and xargs sort 16:14:10 I think openstack manuals was the third to last 16:14:19 looks like we're down to just nova now 16:15:50 here's hoping these rebuilt gerrit images which we haven't tested upgrading with are still fine 16:16:21 I'm not too worried about that, I did a bunch of local testing with our images over the last few months and the images moved over time and were always fine 16:17:00 yeah, the functional exercises we put them through should suffice for catching egregious problems with them, at the very least 16:17:25 then ya we also put them through the fake prod marathons 16:20:02 before we proceed to the next step it appears that the track upstream cron fired? 16:20:11 fungi: did that one get disabled too? 16:20:12 and done 16:20:25 i thought i disabled them both, checking 16:21:05 oh... it's under root's crontab not gerrit2's 16:21:24 we should disable that cron then kill the running container for it 16:22:12 I think the command is kill 16:22:15 like that? or is it docker kill? 16:22:17 to line up with ps 16:22:27 yup 16:22:30 okay, it's done 16:22:54 we should keep an eye on those things because they use the explicit docker image iirc 16:23:06 the change updates the docker image version in hiera which will apply to all those scripts 16:23:22 granted they don't really run gerrit things just jeepyb in gerrit so its probably fine for them to use the old iamge accidentally 16:23:37 the only remaining cronjobs for root are bup, mysqldump, and borg(x2) 16:23:38 ok I think we can proceed? 16:24:09 and confirmed, the cronjobs for gerrit2 are both disabled still 16:24:20 we were going to check ownership on files in the git tree 16:24:25 ++ 16:25:18 everything looks like it's still gerrit2, even stuff with timestamps in the past hour 16:25:19 that spot check looks good to me 16:25:45 so i think we're safe (but also we change user to gerrit2 in our gc commands so it shouldn't be a problem any longer) 16:26:09 ya just a dobule check since we had problems with that on -test before we udpated the gc commands 16:26:13 I think its fine and we can proceed 16:26:27 does that look right? 16:26:52 yup updated to opendevorg/gerrit:2.14 16:27:03 on both entries in the docker compose file 16:27:03 okay, will pull with it now 16:27:27 how do we list them before running with them? 16:27:42 docker image list 16:27:51 i need to make myself a cheatsheet for container stuff, clearly 16:28:22 opendevorg/gerrit 2.14 39de77c2c8e9 22 hours ago 676MB 16:28:33 that seems right 16:28:37 yup 16:28:59 ready to init? 16:29:07 I guess so :) 16:29:14 and it's running 16:37:21 around now is when we would expect this one to finish, but also this was the one with the least consistent timing 16:37:36 taking longer than our estimate, yeah 16:37:52 we theorized its due to hashing the http passwds 16:38:01 and the input for that has changed a bit recently 16:38:07 (but maybe we also need entrpoy? I dunno) 16:38:08 should be far fewer of those now though 16:39:22 it seems pretty idle 16:39:42 ya top isn't showing it be busy 16:40:08 the first time we ran it it took just under 30 minutes 16:40:33 could also be that the server instance or volume or (more likely?) trove instance we used on review-test performed better for some reason 16:41:02 the idleness of the server suggests to me that maybe this is the trove instance being sluggish 16:41:39 | 460106 | gerrit2 | 10.223.160.46:56540 | reviewdb | Query | 716 | copy to tmp table | ALTER TABLE change_messages ADD real_author INT | 16:41:51 | Id | User | Host | db | Command | Time | State | Info | 16:41:54 ^ column headers 16:42:02 ah ok so it is the db side then? 16:42:03 fungi: so yeah, looks like 16:42:11 yep that's "show full processlist" 16:42:15 in mysql 16:42:15 yeah - sounds like maybe the old db is tuned/sized differently 16:42:33 or just on an old host or something 16:42:38 * fungi blames mordred since he created the trove instance for review-test ;) 16:42:53 totally fair :) 16:42:55 this is one reason why we allocated tons of extra time :) 16:43:02 s/blames/thanks/ 16:43:11 as long as we can explain it (and sounds like we have) I'm happy 16:43:39 though its a bit disappointing we're investing in the db when we're gonna discard it shortly :) 16:44:02 right? 16:44:18 i'll just take it as an opportunity to catch up on e-mail in another terminal 16:46:09 there should be a word for blame/thanks 16:46:47 the germans probably have one 16:47:04 mordred: _____ you very much for setting up that trove instance! 16:47:06 deutsche has all sorts of awesome words english is missing 16:48:02 schadendanke perhaps? (me making up new words) 16:48:48 doch (the positive answer to a negative question) is in my opinion the greatest example of potentially solvable vagueness in english 16:48:58 yup 16:49:05 omg i need that in my life 16:49:25 it fills the "no, yes it is" 16:49:30 role 16:49:42 somehow english, while a germanic language, decided to just punt on that 16:49:48 yup 16:49:58 I blame the normans 16:50:10 | 460106 | gerrit2 | 10.223.160.46:56540 | reviewdb | Query | 1227 | rename result table | ALTER TABLE change_messages ADD real_author INT | 16:50:20 mordred: sshhhh, ttx might be listening 16:50:26 changed from "copy" to "rename" sounds like progress 16:50:44 | 460106 | gerrit2 | 10.223.160.46:56540 | reviewdb | Query | 5 | copy to tmp table | ALTER TABLE patch_comments ADD real_author INT | 16:50:48 new table 16:51:13 i wonder what the relative sizes of those 2 tables are 16:51:28 also - in newer mysql that should be able to be an online operation 16:51:36 but apparently not in the version we're running 16:52:13 so it's doing the alter by making a new table with the new column added, copying all the data to the new table and deleting the old 16:52:16 yay 16:52:18 ya our mysql is old. we used old mysql on review-test and it was fine so I dind't think we should need to upgrade first 16:52:23 maybe the mysql version for the review-test trove instance was newer than for review? 16:52:30 fungi: I'm 99% sure I checked that 16:52:32 and they matched 16:52:34 ahh, so that did get checked 16:52:39 but maybe I misread the rax web ui or something 16:52:45 maybe they both did the copy and hte new one is just on better hypervisor 16:53:20 or the dump/src process optimizes the disk layout a lot compared to a long-running server 16:53:39 I'm trying to identify which schema makes this change btu the way they do migrations doesn't make that easy for all cases 16:53:58 they guice inject db specific migrations from somewhere 16:54:00 I can't find the somewhere 16:54:31 anyway its proceeding I'll chill 16:54:31 fungi: yeah - that's also potentially the case 16:54:37 clarkb: they guice inject *everything* 16:54:57 I don't think the notedb conversion will be very affected by that either since its all db reads 16:55:12 so hopeflly the very long portion of the upgrade continues to just be long and not longer 16:55:29 oof, it also looks like they're doing one-at-a-time 16:55:34 | 460106 | gerrit2 | 10.223.160.46:56540 | reviewdb | Query | 15 | copy to tmp table | ALTER TABLE patch_comments ADD unresolved CHAR(1) DEFAULT 'N' NOT NULL CHECK (unresolved IN ('Y','N')) | 16:55:39 second update to same table 16:55:57 which, to be fair, is the way we usually do it too 16:56:09 but now i feel more pressure to do upgrade rollups :) 16:56:19 yah - to both 16:56:25 "we" being zuul/nodepool? 16:56:42 er, i guess not nodepool as it doesn't use an rdbms 17:04:15 ya still having no luck figuring out where the Schema_13X.java files map to actual sql stuff 17:04:32 I wonder if it automagic based on their table defs somewhere 17:04:37 fungi: yes (also openstack) 17:05:51 I'm just trying to figure out what sort of progress we're making relative to the stack of schema migrations. Unfortunately it prints out all the ones it will do at the beginning then does them so you don't get that insight 17:05:58 i would not be surprised if these schema migrations aren't somehow generated at runtime 17:06:04 corvus: I think nova decided to do rollups when releases are cut - so if you upgrade from icehouse to juno it would be a rollup, but if you're doing CD between icehouse and juno it would be a bunch of individual ones 17:06:41 which seems sane - I'm not sure how that would map into zuul - but maybe something to consider in the v4/v5 boundaries 17:07:18 mordred: ++ 17:07:45 yay! 17:07:58 it's doing the data migrations now 17:08:32 ok cool 17:08:39 looks like it's coming in around 40 minutes? 17:08:40 seems like things may be slower but not catastrophically so 17:08:51 (instead of 8) 17:09:04 142 is the hashing schema change iirc 17:11:32 yup confirmed that one has content in the schema version java file because they hash java side 17:19:14 corvus: is it doing interesting db things at the moment? I wonder if it is also doing some sort of table update for the hashed data 17:19:22 rather than just inserting records 17:19:23 looks like there's a borg backup underway, that could also be stealing some processor time... though currently the server is still not all that busy 17:19:32 ya I think it must be busy with mysql again 17:19:54 db schema upgrades are the boringest 17:20:38 also note that we had originally thought that the notedb conversion would run overnight. Based on how long this is taking that may be the case again, but we've already buitl in that buffer so I don't think we need to rollback or anything like that yet 17:21:09 just need to be patient I guess (something I am terrible at accomplishing) 17:21:31 clarkb: "UPDATE account_external_ids SET" 17:21:47 that looks like what we expect, yeah 17:21:55 then some personal info; it's doing lots of those individually 17:21:55 yup 17:22:18 db.accountExternalIds().upsert(newIds); <- is the code that should line up to 17:22:28 oh you know what 17:22:34 yeah this is the stage where we believe it's replacing plaintext rest api passwords with bcrypt2 hashes 17:22:48 its updating every account even if they didn't have a hashed password 17:22:53 yes 17:22:57 i just caught it doing one :) 17:23:04 List newIds = db.accountExternalIds().all().toList(); 17:23:14 password='bcrypt:... 17:23:17 rather than finding the ones with a password and only updating them 17:23:29 I guess that explains why this is slow 17:23:29 is it hashing null for 99.9% of the accounts? 17:23:40 no it only hashes if the previous value was not null 17:23:41 or just skipping them once it realizes they have no password? 17:23:49 but it is still upserting them back again 17:23:52 rather than skipping them 17:23:54 it's doing an update to set them to null 17:23:58 ahh, okay that's better than, you know, the other thing 17:24:08 (which mysql may optimize out, but it'll at least have to go through the parser and lookup) 17:24:29 corvus: do you see sequential ids? if so that may give us a sense for how long this will take. I think we have ~36k ids 17:24:39 ids seem random 17:25:00 may be sorted by username though: it's at "mt.." 17:25:11 now p.. 17:25:27 so maybe ~halfway 17:26:03 hah, i saw 'username:rms...' and started, then moved the rest of the window in view to see 'username:rmstar' 17:26:36 mysql is idle 17:26:53 and done 17:26:54 it reports done on the java side 17:27:07 exited 0 17:27:24 yup from what we can see it lgtm 17:27:36 anything we need to check before proceeding with 2.15? 17:27:46 I think we can proceed and just accept these will be slower. Then expect notedb to run overnight again 17:27:53 57m11.729s was the reported runtime 17:28:01 ya I put that on the etherpad 17:28:38 updated compose file for 2.15, shall i pull? 17:28:51 yes please pull 17:29:20 opendevorg/gerrit 2.15 bfef80bd754d 23 hours ago 678MB 17:29:26 looks right 17:29:29 yup 17:29:46 ready to init 2.15? 17:29:52 I'm ready 17:30:04 it's running 17:31:34 schema 144 is the writing to external ids in all users 17:31:48 143 is opaque due to guice 17:32:01 anyway I shall continue to practice patience 17:32:14 * fungi finds a glass full of opaque juice 17:33:13 the java is very busy on 144 17:33:20 (as expected given its writing back to git) 17:34:15 huh, it's doing a git gc now 17:34:24 only on all-users 17:34:24 of all-users i guess 17:34:26 ya 17:34:27 busy busy javas 17:34:45 you still need it for everything else to speed up the reindexing aiui 17:35:19 sure 17:38:20 this one's running long too, compared to our estimate 17:38:46 but i have a feeling we're still going to wind up on schedule when we get to the checkpoint 17:39:44 151 migrates groups into notedb I think 17:40:09 we baked in lots of scotty factor 17:40:58 ya I think it "helps" that there was no way we thought we'd get everything done in one ~10 hour period. So once we assume an overnight being able to slot a very slow process in there makes for a lot of wiggle room 17:42:10 mordred: you've just reminded me that mandalorian has a new episode today. I know what I'm doing during the notedb conversion 17:42:21 busy busy jawas 17:42:38 haha. I'm waiting until the whole season is out 17:42:48 and done 17:42:56 just under 13 minutes 17:43:04 12m47.295s 17:43:24 anybody want to check anything before i work on the 2.16 upgrade? 17:43:34 I don't think so 17:44:02 proceeding 17:44:31 good to pull images? 17:44:34 2.16 lgtm I think you should pull 17:44:56 opendevorg/gerrit 2.16 aacb1fac66de 24 hours ago 681MB 17:44:59 also looks right 17:45:02 yup 17:45:14 ready to init 2.16? 17:45:29 ++ 17:45:36 running 17:45:57 time estimate is 7 minutes, no idea how accurate that will end up being 17:46:57 * mordred is excited 17:47:18 after this we have another aggressive git gc followed by an offline reindex, then we'll checkpoint the db and homedir in preparation for the notedb migration 17:47:53 this theoretically gives us a functional 2.16 pre-notedb state we can roll back to in a pinch 17:47:54 then depending on what time it is we'll do 3.0, 3.1, and 3.2 this evening or tomorrow 17:48:05 yup 17:49:10 sort of related, I feel like notedb is sort of a misleading name. None of the db stuff lives in what git notes thinks are notes as far as I can tell 17:49:12 its just special refs 17:49:26 this had me very confused when I first started looking at the upgrade 17:50:22 yeah, i expect that was an early name which stuck around long after they decided using actual git notes for it was suboptimal 17:53:53 i think we'll make up some of the lost time in our over-estimate of the checkpoint steps 17:54:54 glad we weren't late starting 17:56:14 ++ I never want to wake up early but having the extra couple of hours tends to be good for buffering ime 17:56:39 happy to anchor the early hours while your tea and toast kick in 17:57:17 in exchange, it's your responsibility to take up my slack later when my beer starts to kick in 17:58:33 ha 17:59:37 sporadic java process cpu consumption at this stage 18:01:54 migration 168 and 170 are opaque due to guice. 169 is more group notedb stuff 18:02:07 not sure which one we are on now as things scrolled by 18:02:17 oh did it just finish? 18:02:27 oh interesting 18:02:39 the migrations are done but now it is reindexing? 18:02:45 no, i was scrolling back in the screen buffer to get a feel for where we are 18:03:03 it's been at "Index projects in version 4 is ready" for a while 18:03:15 ya worrying about whati t may be doing since it said 170 was done right? 18:03:19 though maybe it's logging 18:03:32 yeah, it got through the db migrations 18:03:45 and started an offline reindex apparenrly 18:03:56 there it goes 18:03:58 done finally 18:03:59 ya that was expected for projects and accounts and groups 18:04:06 because accountsa nd groups and project stuff go into notedb but not changes 18:04:18 18m19.111s 18:04:29 yup etherpad updated 18:04:39 exit code is zero I think we can reindex 18:04:43 ready to do a full aggressive git gc now? 18:04:48 er sorry not reindex 18:04:50 gc 18:04:57 getting ahead of myself 18:05:00 yup 18:05:06 okay, running 18:05:21 41 minutes estimated 18:05:33 the next reindex is a full reindex because we've done skip level upgrades 18:05:40 with no intermediate online reindexing 18:05:42 should be a reasonably accurate estimate since no trove interaction 18:06:49 and we did one prior to the upgrades which was close in time too 18:07:34 yup 18:30:28 one thing the delete plugin lets you do which I didn't manage to have time to test is to archive repos 18:30:46 it will be nice to test that a bit more for all of the dead repos we've got and see if that improves things like reindexing 18:40:28 down to nova and all users now 18:41:55 yup 18:48:25 done 18:48:32 looks happy 18:48:35 time for the reindex now? 18:48:39 anything we should check before starting the offline reindex? 18:48:50 I don't think so. UNless you want to check file perms again 18:48:53 we want to stick with 16 threads? 18:49:06 yes 18:49:14 I think so anyway 18:49:25 file perms look okay still 18:49:31 one of the things brought up on the gerrit mailing list is that thread for these things use memory and if you overdo the threads you oom 18:49:39 so sticking with what we know shouldn't oom seems like a good idea 18:49:48 its 24 threads on the notedb conversion but 16 on reindexing 18:49:48 yeah, i'm fine with sticking with the count we tested with 18:50:01 okay, it's running 18:50:12 estimates time to completion is 35 minutes 18:51:39 gc time was ~43 minutes so close to our estimate. i didn't catch the actual time output 18:52:19 oh I didn't look, I should've 18:52:25 for those watching the screen session, the java exceptions are about broken changes which are expected 18:52:46 ya we reproduced the unhappy changes on 2.13 prod 18:52:52 its just that newer gerrit complains more 18:53:08 stems from some very old/early history lingering in the db 19:01:29 it is about a quarter of the way through now so on track for ~40 minutes 19:01:47 fairly close to our estimate in that case 19:14:09 just over 50% now 19:32:45 just crossed 90% 19:37:39 down to the last hundred or so changes to index now 19:37:54 and done 19:38:05 ~48minutes 19:38:24 47m51.046s yeah 19:38:34 2.16 db dump now? 19:38:43 yup, ready for me to start it? 19:38:49 yes I am 19:39:02 and it's running 19:39:37 then we backup again, then start the notedb offline transition 19:39:43 such excite 19:42:30 it all over my screen 19:42:35 (literally) 19:43:30 o/ 19:43:36 sounds like it's going well 19:43:44 ianw: slower than expected but no major issues otherwise 19:43:52 * fungi hands everything off to ianw 19:44:02 [just kidding!] 19:44:12 we're at our 2.16 checkpoint step. backing up the db then copying gerrit2 homedir aside 19:44:22 the next step after the checkpoint is to run the offline notedb migration 19:44:29 * ianw recovers after heart attack 19:44:34 yeah, i think we're basically on schedule, thanks to minor miracles of planning 19:44:58 whcih is good beacuse I'm getting hungry for lunch and notedb migration step is perfect time for that :) 19:45:42 other than the trove instance being slower than what we benchmarked with review-test, it's been basically uneventful. no major issues, just tests of patience 19:45:44 clarkb: one good thing about being in .au is the madolorian comes out at 8pm 19:45:52 ianw: hacks 19:46:08 * fungi relocates to a different hemisphere 19:47:08 i hear there are plenty of island nations on that side of the equator which would be entirely compatible with my lifestyle 19:48:42 internet connectivity tends to be the biggest issue 19:48:59 i can tolerate packet loss and latency 19:49:06 okay, db dump is done 19:49:10 rsync next 19:49:33 ready to run? 19:49:44 let me check the filesize 19:50:04 still 1.7gb lgmt 19:50:07 I think you can run the rsync now 19:50:09 oh, good call, thanks 19:50:15 running 19:50:49 the 10 minute estimate there is very loose. could be more like 20, who knows 19:50:57 we'll find out :) 19:51:04 if it's accurate, puts us right on schedule 20:01:16 and done! 20:01:22 10m56.072s 20:01:26 reasonably close 20:01:28 \o/ 20:01:33 only one minute late 20:01:49 hopefully not 10% late 20:02:19 well one minut against the estimated 10 minutes but also ~20:00UTCwas when I guessed we would start the notedb transition 20:02:34 okay, notedb migration 20:03:10 anything we need to check now, or ready to start? 20:03:37 just that the command has the -Xmx value which it does and the threads are overridden and they are. I can't think of anything else to check since we aren't starting 2.16 and interacting with it 20:03:43 I think we are ready to start notedb migration 20:04:03 okay, running 20:04:17 eta for this is 4.5 hours 20:04:32 no idea if it will be slower, but seems likely? 20:05:06 that will put us at 00:35 utc at the earliest 20:05:19 we should check it periodically too just to be sure it hasn't bailed out 20:05:24 i can probably start planning friday dinner now 20:05:35 ++ I'm going to work on lunch as well 20:05:47 also the docs say this process is resumable should we need to do that 20:05:50 I don't think we tested that though 20:06:01 is this screen logging to a file? 20:06:05 yeah, it always worked in the tests i observed 20:06:16 ianw: no 20:06:36 i can try to ask screen to start recording if you think that would be helpful 20:06:56 might be worth a ctrl-a h if you like, ... just in case 20:07:20 what does that do? 20:07:29 (I suspect I'll learn something new about screen) 20:07:31 actually it's a captial-H 20:07:31 done. ~root/hardcopy.0 should have it 20:07:47 clarkb: just keeps a file of what's going on 20:07:52 okay, ~root/screenlog.0 now 20:08:02 TIL 20:08:08 alright I'm going to find lunch now then will check in again 20:08:45 it's mostly something i've accidentally hit in the past and then later had to delete, though i appreciate the potential usefulness 20:35:52 for folks who haven't followed closely, this is the "overnight" step, though if it completes at the 4.5 hour estimate (don't count on it) i should still be around to try to complete the upgrades 20:36:40 the git gc which follows it is estimated at 1.5 hours as well though, will will be getting well into my evening at that point 20:37:21 ya as noted on the etherpad I kind of expected we'd finish with the gc then resume tomorroe 20:37:34 that gc is longer becauseit packs all the notedb stuff 20:37:42 if both steps finish on schedule somehow, i should still be on hand to drive the rest assuming we don't want to break until tomorrow 20:38:04 ya I can be around to push further if you're still around 20:38:11 the upgrade steps after the git gc should be fast 20:38:41 the real risk is that we turn things back on and then there are major unforeseen problems while most of us are done for the day 20:38:43 clarkb, fungi: etherpad link? 20:38:55 https://etherpad.opendev.org/p/opendev-gerrit-3.2-upgrade-plan 20:39:10 #link https://etherpad.opendev.org/p/opendev-gerrit-3.2-upgrade-plan 20:39:33 ya I dont think weturn it on even if we get to that point 20:39:38 ooh, thanks for remembering meetbot is listening! 20:39:47 because we'll want to be around for that 20:40:50 i definitely don't want to feel like i've left a mess for others to clean up, so am all for still not starting services up again until some point where everyone's around and well-rested 20:41:10 we might be able to get through the 3.2 upgrade tonight and let it sit there until tomorrow 20:41:32 that seems like the ideal, yes 20:41:41 like stop at 5.17 20:41:55 sgtm 20:42:20 (i totally read that as "stop at procedure five decimal one seven") 20:43:32 ya I think that would be best. 20:43:54 fun fact: this notedb transion is running with the "make it faster" changes too 20:44:01 s/transion/migration/ 20:44:08 i couldn't even turn on the kitchen tap without filling out a twenty-seven b stroke six, bloody paperwork 20:44:18 I got really excited about those changes tehn realized we were already testing with it 21:17:09 hrm the output indicates we may be closer to finishing than I would've expected? 21:17:29 Total number of rebuilt changes 757000/760025 (99.6%) 21:17:30 i'm not falling for it 21:18:02 ya its possible there is multiple passes to this or someting 21:18:16 the log says its switching primary to notedb now 21:18:50 I will continue to wait patiently but act optimistic 21:21:20 oh ya it is a multipass thing 21:21:25 I remember now that it will do another reindex 21:21:39 built in to the migrator 21:21:50 got my hopes up :) 21:22:59 [2020-11-20 21:21:59,798] [RebuildChange-15] WARN com.google.gerrit.server.notedb.PrimaryStorageMigrator : Change 89432 previously failed to rebuild; skipping primary storage migration 21:23:03 that is the causeof the tracback we see 21:23:13 (this was expected for a number of changes in the 10-20 range) 21:40:07 don't know why kswapd0 is so busy 21:40:13 ya was just going to mention that 21:40:28 we're holding steady at ~500mb swap use and have ~36gb memory available 21:40:48 but free memory is only ~600mb 21:40:49 i've seen this before and a dop_caches sometimes helps 21:41:10 dop_caches? 21:41:11 echo 3 > /proc/sys/vm/drop_caches 21:42:50 dope caches dude 21:43:09 "This is a non-destructive operation and will only free things that are completely unused. Dirty objects will continue to be in use until written out to disk and are not freeable. If you run "sync" first to flush them out to disk, these drop operations will tend to free more memory. " says the internet 21:43:40 * fungi goes back to applying heat to comestible wares 21:43:42 do we want to clear the caches? 21:44:24 presumably gerrit/java/jvm will just reread what it needs back itno the kernel caches when it needs them? 21:44:29 whether or not that will be a problem I don't know 21:44:45 i guess that might avoid having the vmm write out unused pages to disk because more ram is avail? 21:44:57 yeah, this has no affect on userspace 21:45:03 well, other than temporal 21:45:04 except indirectly 21:45:57 (iow, if we're not using caches because sizeof(git repos)>sizeof(ram) and it's just churning, then this could help avoid it making bad decisions; but we'd probably have to do it multiple times.) 21:46:20 (if we are using caches, then it'll just slow us down while it rebuilds) 21:47:02 2019-08-27 21:47:11 * look into afs server performace; drop_caches to stop kswapd0, 21:47:11 monitor 21:47:37 that was where i saw it going crazy before 21:48:07 ianw, clarkb: i think with near zero iowait and low cpu usage i would not be inclined to drop caches 21:48:15 the buff/cache value is going up slowly as the free value goes down slowly. But swap usage is stable 21:48:28 corvus: that makes sense to me 21:48:52 could this be accounting for the cpu time spent performing cache reads? 21:49:24 I'm not sure I understand the question 21:49:38 i don't know what actions are actually accounted for under kswapd 21:50:03 https://www.suse.com/support/kb/doc/?id=000018698 ; we shoudl check the zones 21:50:10 so i'm wondering if gerrit performing a bunch of io and receiving cache hits might cause cpu usage under kswapd 21:50:50 ianw: If the system is under memory pressure, it can cause the kswapd scanning the available memory over and over again. This has the effect of kswapd using a lot of CPU cycles. 21:50:53 that sounds plausible 21:53:06 my biggest concern is that the "free" number continue to fall slowly 21:53:41 do we think the cache value may fall on its own if we start to lose even more "free" space? 21:53:49 clarkb: i think that's the immediate cause for kswapd running, but there's plenty of available memory because of the caches 21:54:06 corvus: I see, so once we actually need memory it should start to use what is available? 21:54:41 ah yup free just went back up to 554 21:54:47 from below 500 (thats MB) 21:54:57 so ya I think your hypothesis matches what we observe 21:54:59 clarkb: yeah; i expect free to stay relatively low (until java exits) 21:55:43 but it won't cause real memory pressure because the caches will be reduced to make more room. 21:55:58 in that case the safest thing may be to just let it run? 21:56:51 i think so; if we were seeing cpu or io pressure i would be more inclined to intervene, but atm it may be working as designed. no idea if we're benefitting from the caches on this workload, but i don't think it's hurting us. 21:57:29 the behavior just changed 21:57:39 (all java cpu no kswapd) 21:58:24 it switched to gc'ing all users 21:58:27 then I think it does a reindex 21:59:44 yeah i think that dropping caches is a way to short-circuit kswapd0's scan basically, which has now finished 21:59:47 this is all included in the tool (we've manualyl done it in other context too, just clarifying that it is choosing these things) 22:06:03 also with most of this going on in a memory-preallocated jvm, it's not clear how much fiddling with virtual memory distribution within the underlying operating system will really help 22:06:39 fungi: that 20GB is spoken for though aiui 22:07:08 which is about 1/3 of our available memory 22:07:18 (we should have plenty of extra) 22:08:57 I think this gc is single threaded. When we run the xargs each gc gets 16 threads and we do 16 of them 22:09:11 which explains why this is so much slower. I wonder if jgit gc isn't multithreaded 22:10:44 kids are out of school now. I may go watch the mandalorian now if others are paying attention 22:10:51 I'll keep an eye on irc but not the screen session 22:12:21 I just got overruled, great british bakeoff is happening 22:19:56 i feel for you 22:20:07 back to food-related tasks for now as well 22:39:36 * fungi find noel fielding to be the only redeeming factor for the bakeoff 22:40:19 i've never seen a bakeoff, but i did recently acquire a pet colony of lactobacillus sanfranciscensis 22:40:40 * fungi keeps wishing julian barratt would appear and then it would turn into a new season of mighty boosh 22:42:12 i have descendants of lactobacillus newenglandensis living in the back of my fridge which come out periodically to make more sandwich bread 22:43:36 fungi: i await the discovery of lactobacillus fungensis. that won't be confusing at all. 22:45:03 it would be a symbiosis 22:45:08 apropos the Mandalorian, the plant he's trying to reach is Corvus 22:45:23 the blackbird! 22:47:31 * clarkb checked between baking challenges, it is on to reindexing now 22:47:36 iirc the reindexing is the last step of the process 22:48:09 it is slower than the reindexing we just did. I think beacuse we just added a ton of refs and haven't gc'd but not sure of that 22:51:04 ianw: wow; indeed i was looking up https://en.wikipedia.org/wiki/Corvus_(constellation) to suggest where the authors may have gotten the idea to name a planet that and google's first autocomplete suggestion was "corvus star wars" 22:51:14 my son's obsessions are Gaiman's Norse Mythology, with odin's ravens, and the Mandalorian, who is going to Corvus, and I have a corvus at work 22:52:52 ianw: you have corvids circling around you 22:53:29 (actually he's obsessed with thor comics, but i told him he had to read the book before i'd start buying https://www.darkhorse.com/Comics/3005-354/Norse-Mythology-1 :) 22:55:51 wow the radio 4 adaptation looks fun: https://en.wikipedia.org/wiki/Norse_Mythology_(book) 22:57:11 you'll have to enlighten me on gaiman's take on norse mythology, i read all his sandman comics (and some side series) back when they were in print, but he was mostly focused on greek mythology at the time 22:58:06 clearly civilization has moved on whilst i've been dreaming 22:59:20 i think i have most of sandman still in mylar bags with acid-free backing boards 23:01:07 delirium was my favorite character, though she was also sort of a tank girl rip-off 23:04:56 fungi: it's a very easy read book, a few chuckles 23:05:38 you would probably enjoy https://www.audible.com.au/pd/The-Sandman-Audiobook/B086WR6FG8 23:06:15 https://www.abc.net.au/radio/programs/conversations/neil-gaiman-norse-mythology/12503632 is a really good listen on the background to the book 23:09:01 now i'm wondering if there's a connection with dream's raven "matthew" 23:11:24 oof only to 5% now. I wonder if this reindex will expand that 4.5 hour estimate 23:12:07 * clarkb keeps saying to himself "we only need to do this once so its ok" 23:12:32 follow it up with "so long as it's done when i wake up we're not behind schedule" 23:12:36 it is all out on the cpu 23:12:49 we have 16 cpus and our load average is 16 23:12:56 ya its definitely doing its best 23:12:59 sounds idyllic 23:19:34 ideally we get to run the gc today too, I can probably manage to hit the up arrow key a few times in the screen and start that if its too late for fungi :) 23:19:45 but ya as long as thats done before tomorrow we're still doing well 23:19:48 s/before/by/ 23:20:25 yeah, if this ends on schedule i should have no trouble initiating the git gc, but... 23:25:45 if this pace keeps up its actually on track for ~10 hours from now? thats rough napkin math, so I may be completely off 23:26:02 also if I remember correctly it does the biggest projects first then the smaller ones so maybe the pace will pick up as it gets further 23:26:12 since the smaller projects will have fewer changes (and thus refs) impacting reindexing 23:26:21 anyway its only once and should be done by tomorrow :) 00:05:47 counting off time to index 200 changes it does seem to be slowly getting quicker 00:05:59 but that might not be a wide enough sample to check 00:34:12 ~now is when we expected it to be done. It is not done if anyone is wondering. Still slow but maybe slowly getting quicker. I'll keep an eye on it 00:34:33 fungi: corvus: I'll aim to be back around about 15:00 tomorrow as well 00:34:40 but we'll see how I do 00:37:12 ~10k changes in ~17 minutes 00:37:21 not great 00:37:50 but also watching it like this may not be great for my health. I'm gonna take a break 01:29:20 I've discovered that there may actually have been a flag to tell the migrator to not reindex. That would have allowed us to do the gc'ing first then manually reindex. But at this point sticking to what we've tested is our best bet I think even if it takes all night 01:29:35 ++ 01:29:44 plan the dive and dive the plan 01:29:56 are you mordred now? 01:30:21 i, um, used to have a long daily commute by train and read pulp adventure novels 01:30:27 ha 01:30:54 for anyone following along I don't erally expect this to finish before I go to bed so that I can kick off the gc 01:31:40 I'll still check on it, but probably try and return tomorrow at 15:00 UTC. Assuming it exits 0 I think fungi you can probably go ahead and start the gc? but wait on others before doing the next steps. Or if you'd prefer to wait for me to be awake I'm cool with that too 01:32:18 clarkb: it's probably going to be fungi that hits the button; but in case i (or someone else) happens to be around first... it's .... 01:32:31 sorry what step? 01:32:51 currently 4.3: time find /home/gerrit2/review_site/git/ -type d -name "*.git" -print0 | xargs -t -0 -P 16 -n 1 -IGITDIR sudo -H -u gerrit2 git --git-dir="GITDIR" gc --aggressive 01:33:09 please run echo $? when this current command finishes so we can confirm it exits 0 01:33:22 during testing we discovered that gerrit commands don't always tell you they have errored when they error :? 01:33:40 so so its echo $? then if 0 step 4.3 from a couple lines above 01:34:21 clarkb: so 4.1 (migrate-to-notedb) that's running now; then 4.2 when that finishes, and if it's zero and nothing seems to be on fire, 4.3 (gc). right? 01:34:39 correct 01:34:51 clarkb: can i 'strikethrough' the steps done on the etherpad? 01:34:58 corvus: yes I think that is fine 01:35:17 done (and i bolded 4.1) 01:35:53 clarkb: have a good evening! 01:36:05 I'll try! :) dinenr then the mandalorian I hope 01:50:18 i just caught up, had two episodes to get through 01:50:36 and yeah, this looks like it's taking a while 01:51:13 i'm planning to fire off the git gc when i wake up, assuming the reindex is even done by then 03:47:28 Reindexing changes: project-slices: 29% (785/2697), 30% (235273/760363) (-) fyi 04:47:34 just crossed 300k 04:54:42 also I've learned that one of the things the wikimedia changes does is shuffle the project "slices" They are supposed to be broken down into smaller chunks to prevent a single repo from dominating the cost like nova 04:55:04 however, that element of randomness may explain why we see times that vary so much ? at least contribute to it 04:56:24 I haven't done as much testing as wikimedia did, but I would be really surprised if it is faster to skip around like that. it seems like you want to keep things warm in the cache 04:56:37 eg do all of nova, then do all of neutron and so on 05:01:24 "It does mean that reindexing after invalidating the DiffSummary cache will be expensive" another tidbit from the source (I wonder if we're in that situation perhaps induced by the notedb migration? 05:09:31 oh neat they also split up slices based on changeid/number not actual ref count 05:09:50 so if you've got lots of changes with lots of refs (patchsets) in certain projects those won't be balanced well 05:11:19 they also use mod to split them up so change 1 and 2 go in different slices and 101 and 102 go in different slices if moddiny by 2. When you probably want them to be in the same slice due to git tree state cache warmth? Anyway thats probably enough java for me tonight. There is likely quite a bit of room for improvement in the reindexer to be more deterministic and less reliant on luck 05:17:00 oh and when we tested we would typically start gerrit at 2.16 and maybe that populates the DiffSummary caches? We didn't want to do that this time ebcause to interact with it we'd have to drop our web notice. It would be funny if not starting on 2.16 without notedb was the problem 05:19:24 o/ is there an etherpad with the steps that are occurring and what was done / left to do for those curious people who want to watch from the sidelines ? 05:19:32 (aka me) 05:19:45 mnaser: https://etherpad.opendev.org/p/opendev-gerrit-3.2-upgrade-plan the bolded item is the one we're on 05:20:46 mnaser: we are currently doing the last part of the notedb migration which is a full reindex (which is going slower than expected but we also planned for this long task to happen during the between days period) 05:21:11 when this is done we git gc all the repos to pack up the notedb contents (makes things faster), then upgrade to 3.0, 3.1, 3.2 and reindex again 05:22:32 Cool! So it sounds like the major migration is done 05:22:55 the actual data migration part is ya. Now its a bucnh of house keeping around that (reindex and gc) 05:23:14 I’d argue that the actual migration into notedb is the trickier bit, indexing is indexing 05:23:17 Awesome 05:24:18 So I assume from now on, Gerrit will no longer use a database server 05:24:33 It will be using purely notedb I guess? 05:24:35 unfortunately that is a bad assumption :P 05:24:42 the accountPatchReviewDb remains in mysql 05:24:52 its the single table database that tracks when you have reviewed a file 05:25:08 but ya one of the changes I have proposed and WIP'd is one to remove the main db configuration from the gerrit config 05:25:46 we'll actually do that cleanup after we're settled on the new version as its ok to have the old db config in place. gerrit 3.2 will just ignore it 05:26:07 Oh I see 05:26:40 So in a way however the database is not that important, you’d just lose track of what patches you reviewed if that db is lost? 05:26:51 what files you have reviewed 05:26:59 the change votes are in notedb 05:27:10 you know when you look at a file diff and it gives you a checkmark on that file? 05:27:18 oh yes 05:27:18 thats all that database is doing is tracking those checkmarks next to files for you 05:27:26 and ya its not super critical 05:28:44 replication to gitea will also take a bit once this is all done as all that notedb state will be replicated for changes 05:28:45 and I guess in terms of scale there’s a few other deployments who have ran at our scale or even bigger :p 05:29:19 oh ouch, that will add a lot of additional data that is replicated across every gitea system 05:29:21 ya I haven't checked recently. I think gerrithub may be similar? But they didn't really exist until notedb was a thing? I may msiremember that. I know they were a driving force for it because it meant they could store stuff in github iirc 05:29:42 mnaser: ya the problem is refs/changes/12345/45/meta is where it goes 05:30:05 so you can't replicate the patchets without the notedb content (since git ref spec doesn't allow you to exclude things like that as far as I can tell) 05:30:16 I don't expect it will cause many issues once we get the initial sync done 05:30:25 that will just take some time (in testing it was like 1.5 days) 05:30:30 Looks like gerrithub is in the 500000s of changes 05:30:55 And I think we’re in the 700k’s 05:31:05 760363 05:31:23 we're watching a slow count up to that number on the reindex right now 05:31:37 Doesnt Google have a big installation too? 05:32:05 there is the gerrit gerrit, chrome, and android 05:32:11 however, google doesn't really run gerrit 05:32:21 they use dependency injection to replace a bunch of stuff aiui 05:32:52 so that it ties into their proprietary internal distributed filesystems and databses and indexers etc 05:33:09 The chrome one is at 2.5m wow heh 05:33:23 Oh I see so they’re probably not running notedb 05:33:30 we discovered this the hard way when we did an upgrade once and jgit just didn't work 05:33:43 it turned out that jgit was fine talking to their filesystem/storage/whatever it was but not to a posix fs 05:34:02 and so no one caught it until an open source deployment upgraded 05:34:04 (us) 05:34:18 ouch 06:25:01 i think they're using notedb, but the git data store isn't what mere mortals use 06:25:35 Reindexing changes: project-slices: 49% (1345/2697), 51% (390766/760363) (/) | 06:26:05 that's a timestamped progress status before i go to bed 10:21:28 Reindexing changes: project-slices: 74% (2021/2697), 77% (587125/760363) (-) 10:21:45 25% in ~ 4 hours 10:22:26 that puts it at about 14:00UTC to finish 12:21:06 yeah, awake again and it's claiming around 88% complete now 12:27:12 Reindexing changes: project-slices: 87% (2373/2697), 89% (679414/760363) 13:44:24 99%! 13:46:49 once this wraps up, assuming it looks good, i'll start the git gc and then i need to run out to the hardware store to pick up an order for some tools 14:11:52 1086m41.925s 14:12:10 that's 18h6m42s 14:12:23 exited 0 14:13:05 i've pulled the gc command back up and will start it momentarily 14:13:24 just need to switch computers to double-check our notes 14:18:59 okay, looking good and i've updated our notes to indicate which step we're on, gc is running now 14:19:45 thanks. I'm very slowly waking up but maybe I can take it easy for another hour or teo now 14:19:46 estimated time to completion is 1.25 hours so hopefully done before 16:00 14:20:05 the previous gc times were failry accurate if a few minutes fast iirc 14:20:45 other than the final offline reindex, all the other steps should go quickly 14:21:14 at least up until we start gerrit again, and then there's the replication which will probably take ages 14:21:47 and the long tail of fixing things which are broken (some of which we know about, some of which we likely don't yet) 14:22:39 anyway, not seeing any obvious errors stream by, so i'll take this opportunity to go pick up my order and be back in plenty of time for the rest of the upgrade 14:22:54 thanks again 14:25:07 o/ 15:21:24 okay, i'm back. if the gc finishes at 1.25 hours then that'll be ~12 minutes from now 15:22:16 judging by the cinder runtime when I checked about 5 minutes ago I think it will be longer but not significantly so. All the expensive repos seem to be processing at this point 15:24:34 nova, cinder, horizon, manuals 15:24:42 oh and neutron 15:33:50 nova is the only one running now 15:39:56 80m54.544s 15:40:02 exited 0 15:40:26 about 6 minutes logner than estimated much better. 15:41:15 okay, ready for the next pull? 15:41:29 yes that loosk good to me 15:41:59 opendevorg/gerrit 3.0 fbd02764262c 46 hours ago 679MB 15:42:10 that looks about right 15:42:33 if you're ready to run the init I am 15:42:37 running 15:42:54 in testing this was near instantaneous 15:43:08 0m12.344s and exited 0 15:43:13 no error messages 15:43:54 ready for me to work on 3.1 or want to check anything? 15:44:10 I don't think there is anything other than the exit code to check 15:44:19 lets do 3.1. This init doesn't do any schema updates 15:44:46 and pulling 15:45:05 opendevorg/gerrit 3.1 eae7770f89d6 46 hours ago 681MB 15:45:08 lgtm 15:45:33 ready to init with 3.1? 15:45:39 I think so. Can't think of anything else to check first 15:45:46 underway 15:45:59 and done 15:46:01 0m11.280s 15:46:07 exited 0 15:46:44 ready to pull 3.2? 15:46:46 yup 15:47:03 opendevorg/gerrit 3.2 6fdfe303e8df 46 hours ago 681MB 15:47:05 that image lgtm 15:47:26 I think we can do the reindex 15:47:32 running 15:47:32 er no sorry I keep getting ahead of myself 15:47:34 the init 15:47:45 yeah, the init is what i'm running, sorry 15:47:49 the command you have queued looks right :) 15:47:56 okay, running now 15:48:23 0m13.628s and exited 0 15:48:32 *now* it's time to reindex 15:48:46 yup and the command you have up for that lgtm 15:48:57 okay, starting it now 15:49:08 eta 41 minutes 15:49:36 then we start gerrit and begin unwinding things 15:50:41 or 18 hours :/ 15:50:47 yeah, ugh 15:51:54 well, we're already at 1% done so hopefully not 18 hours 15:52:12 ya this is going much quicker just counting off progress at 20 second intervals 15:52:22 we were doing about 200 changes per 20 second interval last night. This just did like 4k 15:52:37 I think the gc'ing helps tremendously 15:53:51 for the unwinding it would be good for others to maybe look over what I've written down again and just sanity check it. I think my biggest concern at this point is any interaction between our ci/cd and gitea replication lag 15:54:18 I believe in cd we pull from gerrit and not gitea so that isn't an issue but I've got us explicitly replicating our infra repos first to mitigate that 15:55:08 as another sanity check our disk utilization has gone up about 5GB since the gc which is what we expected based on testing 15:55:14 93GB -> 98GB on that fs 15:55:31 the unpacked state was about 110GB iirc 15:57:12 already up to 10% much much quicker this time 15:59:40 those exceptions in the screen scrollback are expected (small number of corrupted changes) 15:59:52 gerrit needs its coffee 16:00:33 i'm estimating ~18:00 for completion of this step 16:00:58 oh the rate seems to have just significantly improved 16:01:46 and my math was wrong 16:01:59 corvus needs coffee too? 16:02:14 maybe ~17:00? 16:02:45 ya about another hour by my math 16:03:01 it took ~14 minutes to get to 20% so another 4 blocks of 14 minutes 16:05:13 i have to run an errand; i probably won't be back until after this completes, but i'll check in when i get back and see if there's unexpected issues i can help with 16:17:12 thanks! 16:26:26 it is up to 61% now 16:27:35 I guess the trick with the notedb migration would've been to somehow stop that process prior to reindexing, then garbage collect, then reindex manually. Reading the code there is a --reindex flag but it isn't clear to me if you can negate that somehow. Anyway we shouldn't need to do this again so not worth thinking about too much anymore 16:28:24 fungi: not to get ahead of myself, but do you think we should block port 29418 and leave the apache message in place when we first start gerrit? then check that logs indicate it is happy before opening things up? 16:28:50 I did have us starting gerrit before updating apache to check logs but realize that port 29418 would still be accessible 16:37:55 yeah, wouldn't hurt to temporarily remove public access to that port initially, but obviously we shouldn't start up anything which would need access either (like zuul) 16:38:37 i can edit the firewall rules temporarily now to do that. i'll use a second window in that screen session 16:40:17 ya there are a number of things I think we should do before starting zuul in the etherpad 16:40:56 and done 16:40:58 thanks 16:41:18 I'm putting together a list of scripts to update to use the 3.2 image on review.o.o now since it occurred to me that we run manage-project type things periodically iirc 16:41:19 iptables -nL and ip6tables -nL now report no allow rule for 29418 16:41:33 and we don't want them to use the old image (ist actually probably ok for them to use the old image since its the same version of jeepby but I don't want to count on that 16:41:37 (i left the overflow reject rules for 29418 in there for now) 16:42:26 90% 16:43:23 docker-compose.yaml, /usr/local/bin/manage-projects, /usr/local/bin/track-upstream seem to be the files using that variable when I grep in sytem-config 16:43:46 docker-compose is already edited but we should update the other two before starting zuul (I've made a note in the etherpad too) 16:48:20 done in 59 minutes 16:48:25 59m13.719s exited 0 16:48:27 yep 16:48:54 okay, and 29418 is currently blocked so in theory we can start gerrit and check its service logs for obvious signs of distress 16:49:01 yup I think that is our next step 16:49:09 docker-compose up -d 16:49:20 ready? 16:49:23 I guess so 16:50:01 Gerrit Code Review 3.2.5-1-g49f2331755-dirty ready 16:50:23 that plugin manager exception is expected. I believe it is because we don't enable the oplugin manager in our config but have the plugin installed 16:50:44 something to add to the to do list to remove or enable i guess 16:50:48 ya 16:51:30 before we open things up I should add my gerrit admin ssh key. But I think you've had more experience with doing those things so maybe you want to do the force submit of the change if it still looks good to you as well as kick off replication for system-config and project-config? 16:51:39 we want to force merge first then replicate I think 16:51:47 also before we go further let me reread the etherpad notes :) 16:52:13 are you going to be able to do those things without 29418 open? 16:52:25 no I'm saying lets just be ready for that when we open it 16:52:30 oh, sure 16:52:56 before we open things though why don't we fix /usr/local/bin/manage-projects and /usr/local/bin/track-upstream ? 16:53:04 we need to change the image version in those scripts to 3.2 16:53:10 once 29418 is open i can add your openid account to project bootstrappers temporarily so you can add verify +2 and call submit 16:53:49 fungi: do you want to do the script fix in the screen or should I just do them off screen then you can confirm on screen? 16:53:52 do we have a change to update /usr/local/bin/manage-projects and /usr/local/bin/track-upstream already? 16:54:17 they're not going to get called until we reenable the crontabs 16:54:19 fungi: yes, the change whcih we force merge sets gerrit_container_image in ansible vars and that is used in docker-compose and the two scripts 16:54:26 ahh, okay 16:54:28 fungi: manage-projects is called by zuul periodically iirc 16:54:38 so once zuul is up it may try it 16:54:43 well, ansible is still disabled for the server too 16:54:48 oh good point 16:55:02 well I think we should fix it anyway since its a good sanity check 16:55:13 sure, i can edit those manually for now 16:55:19 my concern in particular is a race between the config management updates and the manage-project updates 16:55:22 I don't know that they always go in order 16:56:38 lgty? 16:56:40 those edits lgtm thanks 16:57:38 ok give me a minute to get situated with auth things then I guess we can turn it on and force merge the config mgmt change then replicate 16:59:39 alright i've got keys loaded and have my totp token 17:00:14 cool, so open 29418 first or undo the maintenance page in apache first? 17:00:19 I think lets undo apache first 17:00:51 does that look correct? 17:00:58 yes, but we also want to remove the /p blocks too 17:01:23 like that? 17:01:26 yup 17:01:48 ready for me to reload apache2? 17:02:09 let me just double check zuul isn't running somehow 17:02:16 k 17:02:30 ps shows no zuul processes on zuul01 17:02:44 I guess we continue unless you can think of anything else 17:02:52 nope, nothing comes to mind 17:03:00 and it's up 17:03:16 i get the webui 17:03:41 signing in 17:04:08 I'm signed in 17:04:30 as my regular user. Did you want to review https://review.opendev.org/c/opendev/system-config/+/762895/1 and maybe be the one to force merge it? 17:04:36 doesn't look like anyone else has voted on it yet 17:04:54 yeah, signed in as my normal user too 17:05:01 firing up gertty 17:05:20 seems to be syncing okay 17:05:24 I'm removing my WIP on that change now 17:06:45 should to remember to remind gertty users that now they need to add "auth-type: basic" to their configs 17:09:43 worth noting you actually wanted me to review https://review.opendev.org/c/opendev/system-config/+/762895/2 not /1 17:09:55 took a bit to realize i was looking at an old patch there 17:10:15 oh sorry thats what it redirected me to from my link in theetherpad 17:10:31 because etherpad had the /1 too 17:12:02 no worries, i've voted +2 on it 17:12:11 fungi: ok do you want to submit it or do you want me to? 17:12:24 i can do it, just a sec 17:12:36 you'll need to add the +2 verified too 17:12:41 yep 17:12:49 and workflow +1 obviously 17:13:39 once that force merges I want to see if replication for system-config replicates everything or just that ref 17:13:56 but generally replicate system-config and project-config next I think 17:16:35 fatal: "762895" no such change 17:16:48 d'oh 17:16:56 i was doing it to review-test ;) 17:17:02 * fungi curses his command history 17:17:37 need to open 29418 on review.o.o for this 17:17:46 are we good with that? 17:17:49 yes I am 17:17:59 also you still need the verified +2 (I assume your admin accounts will do that) 17:18:11 it will 17:18:36 fungi: note that rules.v4 is the file now iirc 17:19:14 and if we missed actually blocking 29418 on ipv4 then oh well at thsi point :) it seems fine 17:19:14 yeah, i'm just keeping rules consistent with it until we confirm and clean up the cruft 17:19:18 kk 17:19:23 i edited all three 17:19:30 gotcha 17:20:35 okay, it's merged and i've removed membership for my admin account from project bootstrappers 17:20:53 now lets see what is being replicated 17:21:34 nothing in the queue so did it only replicate that ref? /me looks at gitea 17:22:04 https://opendev.org/opendev/system-config/commit/2197f11a0f27da3f9bd1c009c84107dc09559f6e yes only that ref 17:22:31 neat 17:22:40 i suppose we need to manually trigger a full replication 17:22:41 what I think that means is we could not replicate anything and let it catch up over time? 17:22:57 ya or we manually replicate. I still think we manually replicate system-config and project-config first though 17:23:15 i can trigger replication for system-config first 17:23:15 probably ripping off this bandaid is the best option to ensure we have plenty of disk on the giteas 17:23:22 fungi: ++ that would be great 17:23:44 triggered 17:24:36 there are already 4 new changes too 17:25:14 hrm system-config is done replicating? that took suspciously little time 17:27:37 I see the refs on gitea01 though 17:27:52 I wonder if part of the reason we were slow replicating in testing was network bandwidth 17:28:01 could be... 17:28:09 i can trigger project-config next 17:28:12 ++ 17:28:26 done 17:28:42 this might be overoptimization: but we may also want to do nova, neutron, cinder, horizon, openstack-manuals so that we can run teh gitea gc after they are done 17:28:44 i assumed we were talking openstack/project-config not opendev/project-config in this case 17:28:47 since they should be the biggest repos 17:28:51 fungi: correct 17:29:03 sure i can do nova next and see what happens 17:29:41 wow it says project-config is done 17:30:14 tailing replication_log in the screen session is probably not useful. it lags waaaay behind because of how verbose the logs is 17:30:50 spot checking project-config on gitea01 shows that it seems to have worked too 17:30:57 I see refs/changes/xy/abcxy/meta 17:31:34 but ya lets work through that list I posted just above, check disk usage on gitea01 and run gc on all the giteas if it looks like we expanded disk use a lot 17:31:57 then when we're happy with that trigger full replication then start looking at zuul I guess 17:32:18 the replication log is really, really busy though, are you sure it's not actively replicating everything? 17:32:35 fungi: gerrit show-queue -w says no 17:32:42 strange 17:32:54 if I start a new tail on replication_log its quiet 17:33:02 I think thats just screen and ssh buffering with large amounts of text? 17:33:32 previously it was very noisy but now it seems to have quiesced, yeah 17:33:58 okay, i'll do nova now 17:34:06 ++ 17:34:09 and it should be running 17:34:37 I see it in the show queue 17:34:43 yeah 17:36:30 I see disk use slowly increasing on gitea01 so it seems to be doing things 17:40:07 status notice The Gerrit service on review.opendev.org is accepting connections but is still in the process of post-upgrade sanity checks and data replication, so Zuul will not see any changes uploaded or rechecked at this time; we will provide additional updates when all services are restored. 17:40:16 something like that ^? 17:40:20 sounds good to me 17:41:16 nova replication is done according to show queue and disk use increased by about a gig so ya I think doing some of these big ones first, gc'ing then doing everything is a good idea 17:41:19 #status notice The Gerrit service on review.opendev.org is accepting connections but is still in the process of post-upgrade sanity checks and data replication, so Zuul will not see any changes uploaded or rechecked at this time; we will provide additional updates when all services are restored. 17:41:19 fungi: sending notice 17:42:04 okay, i'll do openstack-manuals next 17:42:09 ++ 17:42:15 and it's running 17:42:54 and honetly at the rate these have gone I think we should start global replication, benchmark it, then see if we can wait a bit before starting zuul since it seems quick. If benchmarks say it will eb all day then nevermind 17:43:33 sure 17:43:49 since that will rule out any out of sync unexpectedness 17:44:24 manuals is done 17:44:30 fungi: finished sending notice 17:44:43 neutron next? 17:44:43 I think you can just enqueue the others in the list and let gerrit figure out ordering 17:45:02 I would just tell it to do neutron cinder and horizon now 17:45:43 yup, was just finding your original list in scrollback 17:46:02 that list is based on which things were slow to gc which implies more data/more refs 17:46:04 triggered all three 17:47:48 horizon is done, neutron and cinder still running 17:48:58 * mnaser is playing around gerrit right now 17:49:21 just be aware zuul is still offline 17:49:31 fungi: yep! i'm just trying to see if the gerrit functionality itself seems to be okay 17:49:43 thanks, appreciated! 17:49:47 i am noticing a few things, none are critical of course, but "oh, interesting" type of tings 17:49:58 mnaser: ya I expect a lot of that :) 17:50:00 sure, i'm going to hate the new ui for a while i'm sure 17:50:08 i.e. anything except verified/code-review/workflow are under this thing called "Other labels" 17:50:08 polygerrit adds a bunch of new excellent features and some not so great things 17:50:18 so roll call votes in governance are under "Other labels" 17:50:46 backport candidate patches seem to be affected too, not a big deal but maybe good for us to know how it decides whats other and whats not 17:50:48 but where we were was a dead end so we're ripping the bandaid off and going to try and work upstream and with plugins etc to make stuff better 17:50:57 mnaser: have a link to a change so we can see that? 17:51:11 sure -- https://review.opendev.org/c/openstack/governance/+/760917 17:51:38 also i'm noticing that the gitweb links are broken, probably worth working on a proper link to gitea to replace those anyway 17:51:56 you can see rollcall-vote is under other labels, so is code-review in there (but i guess maybe that's cause code-review doesn't mean anything for merging inside openstack/governance) 17:52:32 might be a good time to start a post-upgrade notes etherpad where we can collect lists of things which have changed people might ask about, and things we know are broken which will either be fixed or removed 17:52:45 yeah, i can start putting a few things in there too 17:53:08 ++ 17:53:08 some other minor things are the ordering of code review comments 17:53:15 it seems to be verified, code-review then workflow 17:53:34 I think it was that way before? 17:53:41 I've already forgotten 17:53:49 i remember you would see code-review, verified, workflow in the list 17:53:55 zuul always came in the middle, workflow was always at the end 17:54:08 (in the display of votes at least) 17:54:21 fungi: ok those replications are done and we're using 4gb extra disk. I'll trigger the gc cron on all of the giteas now? any other repos you think we should replicate first? 17:54:28 * corvus checking in 17:54:58 corvus: tl;dr is gerrit is up and seems ok so far. replication is much quicker than anticipated. We are manually triggering replication for "large" repos so that we can gc on the giteas to pack back down again then start global replication 17:55:13 after that we'll eb looking at zuul 17:55:42 i've started a pad here https://etherpad.opendev.org/p/gerrit-3.2-post-upgrade-notes 17:55:44 ++ 17:55:47 mnaser: ^ 17:56:09 fungi: cool i'll fill those out 17:56:22 clarkb: i agree, git gc on gitea next 17:56:26 corvus: fungi any other repos we should manually replicate? We have done system-config project-config nova neutron cinder horizon and openstack manuals 17:56:31 then we can do a full replication 17:56:48 can't think of any others 17:56:49 fungi: k will give corvus a minute to bring up any other repos that may be worth doing that to then I can do the gitea gc'ing 17:56:55 cool I'll work on gitea gc'ing now 17:57:01 just to avoid overrunning the fs with all of them at once 17:57:12 thanks! 17:57:43 something i remember broke last time we did an update was all the bp topic links from specs 17:57:46 i just tested one and its working just fine 17:57:59 specifically: https://review.opendev.org/#/q/topic:bp/action-event-fault-details from https://blueprints.launchpad.net/nova/+spec/action-event-fault-details as an example 17:58:25 oops 17:58:28 i found our first broken 17:59:24 Directly linked changes are redirecting to an incorrect port, Example: https://review.opendev.org/712697 => Location: https://review.opendev.org:80/c/openstack/nova/+/712697/ 17:59:28 i added that to the etherpad 17:59:59 i remember fixing that inside our gerrit installation actually, let me find 18:00:26 that could be related to the thing fungi linked about after the bug fixing this week 18:00:32 mnaser: that may be a known issue, at least wmf and eclipse ran into it and filed bugs 18:00:37 if i remember right, we did this: `listenUrl = proxy-https://*:8080/` 18:00:43 or maybe that was for https redirection stuff 18:00:50 apparently we can fiddle the proxy settings in apache if it's the same issue 18:00:53 * fungi checks notes 18:01:06 all 8 giteas are gc'ing now 18:02:01 mnaser: can you see if it looks like https://bugs.chromium.org/p/gerrit/issues/detail?id=13701 18:02:10 using /c/number works fwiw 18:02:16 that may be an easy workaround for now if necessary 18:02:35 if so the solution is supposedly "X-Forwarded-Proto" expr=%{REQUEST_SCHEME}" in our vhost config 18:03:17 clarkb: well, the /# links are supposed to be "permalinks" so i don't think "use /c" is an easy solution (the problem is existing links point there) 18:03:18 that makes sense 18:03:27 corvus: yup we should fix it 18:03:54 fungi: x-forward-proto makes sense to me 18:03:55 "X-Forwarded-Proto is now required because of underlying upgrade of the Jetty library, when Gerrit is accessed through an HTTP(/S) reverse-proxy." 18:03:57 I think I have figured out why replication timing is so much better. its because we're not replicating all the actual git content now 18:04:02 indeed, so yes, that does all make sense 18:04:26 anyone writing an x-forwarded-proto change? 18:04:42 I'm not 18:04:58 looks like i am :) 18:05:03 in fact I need to find something to drink. back shortly 18:05:17 * mnaser keeps looking 18:05:25 i kind of want to ninja the fix in first just to make sure it works 18:05:42 corvus: please feel free to hand-patch it into the config first 18:05:50 k will do both 18:06:17 i agree the change isn't much good if the fix turns out to be incorrect for our deployment for some reason 18:06:32 i'll add the "you'll need a new version of git-review" to "what's changed" 18:06:36 as i guess that might come up 18:08:37 mnaser: redirect look good now? 18:08:59 corvus: yes! working in my browser and curl shows the right path too 18:09:12 yay 18:10:29 seems like gerritbot is not posting changes 18:10:33 mnaser: i thnik it's git-review>=1.26 18:10:35 i am not sure if thats cause its turned off or 18:10:50 it probably needs to be restarted now that the event stream is accessible 18:10:56 i'll do that now 18:11:13 corvus: fwiw if you'relooking at the vhodt I think there may be old cruft in there we should cleanup. I always get lost when looking though 18:11:15 fungi: i see the 1.27.0 release notes have: "Update default gerrit namespace for newer gerrit. According to Gerrit documentation for 2.15.3, refs/for/’branch’ should be used when pushing changes to Gerrit instead of refs/publish/’branch’." -- is it not that change? 18:11:19 remote: https://review.opendev.org/c/opendev/system-config/+/763577 Add X-Forwarded-Proto to gerrit apache config [NEW] 18:11:43 gerritbot has been restarted 18:11:44 corvus: fungi ^ should we force merge that one too? 18:11:54 clarkb: i see comments related to upgrade i will address them 18:12:13 corvus: well the upgradethins should be handled 18:12:18 well look at that, i can now post emojis in my changes without a 500 18:12:18 :P 18:12:21 as part of the earlier force merge 18:12:31 mnaser: thanks, yeah 1.27 sounds right, i was going from memory 18:13:19 clarkb: oh, er, what do you want me to do? 18:13:35 clarkb: i agree that the TODO lines have been removed in system-config master 18:13:37 I'm more thinking about what I think is old gitweb config. I don't think it needs doing now. I just mean someone that groks apache better than me should look at that vhost and audit it 18:13:48 clarkb: i have manually removed them from the live apache config 18:13:51 as there may be a few cleanups we can do 18:13:53 corvus: thanks 18:13:58 but they were already commented out, so that should all be a noop 18:14:21 is the "links" part in the gerrit change display something that is customizable by the deploy (where gitweb currently is listed?). if so, probably would be neat if we added a "zuul builds" link which went to a prefiltered zuul build search using the changeid! 18:14:30 the gitea gc's are still going. The cron only does one repo at a time 18:15:00 mnaser: you can probably write a plugin for that 18:15:15 ok i see, so the gitweb link comes from a plugin 18:15:25 mnaser: gitweb is built in but gitiles is a plugin 18:15:28 aiui 18:15:49 https://review.opendev.org/q/hashtag:%22dnm%22+(status:open%20OR%20status:merged) tags stuff working pretty neatly too 18:15:58 but i feel like we should consider replacing that with a link to gitea anyway if we can 18:16:02 mnaser: https://review.opendev.org/Documentation/dev-plugins.html#links-to-external-tools may be relevant? 18:16:11 looks like we'd need to do a tiny plugin 18:16:32 ou, that's pretty cool and seems like it would be quite straightforward too 18:17:04 not sure if that's the right interface to put it in the 'links' section 18:17:14 but seems pretty close to that 18:17:29 could incorporate that into the zuul plugin 18:18:00 fungi: https://review.opendev.org/c/opendev/system-config/+/763577 lgtm if you want to review that one and force merge it too? 18:18:03 speaking of which https://gerrit.googlesource.com/plugins/zuul/ 18:18:16 https://review.opendev.org/c/openstack/project-config/+/763576 seems to work pretty well too for a WIP change that is accessible :) 18:18:20 also https://gerrit.googlesource.com/plugins/zuul-status/ 18:18:55 btw gertty has half-implemented support for hashtags 18:19:00 i will be motivated to finish it now :) 18:19:08 mnaser: ya one followup we can look at doing is removing workflow -1 18:19:53 it seems like i see some 3pci still reporting to cinder, so they're probably 'just fine' 18:20:52 763577 is merged 18:20:58 it looks like you can mark a change as private, which i guess can be useful 18:20:58 yup and gerritbot reported it 18:21:09 mnaser: hrm I think we should actually disable that 18:21:11 indeed it did 18:21:17 yeah i remmber it was disabled before 18:21:27 well, "drafts" were disabled 18:21:28 mnaser: I don't want people assuming "private" is really "private' until we can check it 18:21:34 ya private is a newer thing iirc 18:21:44 i wonder if you can enable it per project too, or for specific users 18:21:44 but gerrit removed drafts and replaced them with two features, private changes and work in progress status 18:21:55 would be really nice for embargo'd security changes 18:22:10 "Do not use private changes for making security fixes (see pitfalls below)" 18:22:13 no it won't be :P 18:22:20 aha 18:22:21 this is why I don't want it enabled if we can disable it 18:22:31 drafts was a honeypot and private will likely be too 18:22:34 i'll add it to "what's changed" for now 18:23:01 https://gerrit-review.googlesource.com/Documentation/intro-user.html#private-changes that quote is from there 18:23:37 we can set change.disablePrivateChanges to true 18:23:40 yeah, it's an attractive nuisance 18:23:45 i agree we should disable it 18:23:45 in gerrit.config 18:23:53 i can write that change now 18:23:58 thanks 18:24:32 i moved my test change back to public in case that causes some issue about disabling it with a private change already there 18:25:03 mnaser: thanks, though I expect its fine. usualyl that stuff gets enforced when you push 18:25:11 similar to hwo we disabled drafts, the old drafts were fine 18:25:16 giteas are about done gc'ing 18:25:32 what change topic were we using for upgrade-related changes? 18:26:21 oh that's quite cool -- if you look at a change diff and click on "show blame", it shows the blame and you can click to go to the original change behind it 18:26:33 nifty 18:26:36 gitea06 onyl has 19GB free disk. I'm going to look at that as its much lower than the others 18:26:57 fungi: I was using gerrit-upgrade-prep 18:27:05 thanks 18:27:07 and have a couple of wip changes there that we should land once we're properly settled 18:27:49 fungi: I wonder if we shouldn't manually apply that change and force merge it too 18:28:13 it'll need a service restart 18:28:19 ya 18:28:26 probably not is the best time for those? 18:28:28 s/not/now/ 18:28:34 since we're telling people its not ready yet 18:30:53 change is 763578 18:31:00 i'll hand edit the config now and restart gerrit 18:32:05 i have the line added in the screen session if you want to double-check 18:32:22 screen looks correct 18:32:30 it looks like a change owner can set an assignee for their change 18:32:34 I'm still trying to sort out gitea06 disk 18:33:13 i'm not too sure what an assignee really .. means 18:33:53 the gitea web container us using 20gb of disk in /var/lib/docker/containers 18:33:58 that may be to support workflows where reviewers are auto-assigned 18:34:02 mnaser: ^ 18:34:06 whcih should be separate from the bind mounted stuff which is where we expect data to go 18:34:21 okay, restarting the service 18:34:29 mnaser: i'm also not sure who's supposed to check the "resolved" box for comments. the author or the reviewer? 18:34:34 I expect that if I restart gitea on 06 that will clean up after itself 18:34:47 mnaser: we'll have some more cultural things to figure out 18:34:56 but maybe I should exec into it first and figure out where the disk is used 18:35:10 corvus: yep, as the assignee of a change seems to be a 1:1 mapping too 18:35:29 clarkb: i'd probably see why it ran away with so much disk space in the first place out of my curiosity :) 18:36:33 it is the log file 18:37:06 I'll compress a copy into my homedir then down up the container? 18:37:23 wfm 18:37:30 hmmmmmmm 18:37:38 you can change your full display name inside gerrit right now 18:37:46 mnaser: you always could 18:38:02 oh, i thought you could change the formatting 18:38:09 some people would stick their irc nicks in there or put away messages 18:38:10 nope, always was allowed 18:38:15 ah got it 18:38:28 but now away messages are unnecessary, because... 18:38:37 you can set your status! 18:38:42 indeed 18:39:10 actually what's changed around the name is that it has a separate "display name" and "full name" 18:39:22 you can change them both 18:39:33 used to just be a full name 18:40:16 unrelated but 18:40:29 the static link url to the CLA is well, ancient 18:40:30 https://review.opendev.org/static/cla.html 18:40:44 "you agree that OpenStack, LLC may assign the LLC Contribution Agreement along with all its rights and obligations under the LLC Contribution License Agreement to the Project Manager." 18:40:53 mnaser: technically still accurate 18:41:03 openstack, llc? :p 18:41:42 mnaser: yep 18:41:47 hrm using xz because I don't want a 2GB gzip file 18:41:49 but this is slow 18:41:59 fungi: has gerrit restarted? 18:42:03 well IANAL but if it works, it works 18:42:08 mnaser: section #9 contains the previous icla 18:42:17 because lawyers 18:42:27 it's an icla within an icla 18:42:31 clarkb: yes 18:42:45 in theory private changes should no longer appear as an option 18:43:26 fungi: cool the change for that lgtm. I +2'd it if you want to force merge it too 18:43:56 once I've got gitea06 in a good spot I t hink we're ready to start replicating more things 18:44:08 I'll give it say 5 minutes on the xz but if that isn't done switch to gz? 18:44:27 mnaser: the short answer is that agreeing to the new license agreement carries a clause saying you agree that contributions previously made under the old agreement can be assumed to be under the new agreement, and part of doing that is specifying a copy of the old agreement 18:45:00 i'll merge the private disable change now 18:46:02 clarkb: care to add a workflow +1? 18:46:18 done 18:46:20 thanks 18:46:40 and now it's merged and i've removed my admin account from project bootstrappers again 18:46:47 thanks for taking care of that 18:46:51 np 18:47:35 do we have an 'opendev' plugin in ue? 18:48:01 i was researching on how to add the opendev logo and replace 'Gerrit' by 'OpenDev', found out it was possible by writing a style plugin 18:48:10 i've found the one used by chromium -- https://chromium.googlesource.com/infra/gerrit-plugins/chromium-style/+/refs/heads/master 18:48:11 nope, though that raises the question whether we'd want an aio plugin for all our stuff or separate single-purpose plugins 18:48:27 what is ue? 18:48:42 i assumed he meant "use" 18:48:57 ah yes, in use indeed 18:49:49 mnaser: no what we came to realize was thati f we tried to get every single thing like that done before we did the notedb transition in particular we'd just be making it harder and harder as more changes land 18:50:01 clarkb: oh yes, of course, i agree :) 18:50:06 instead it felt prudent to ugprade, then figure out what we need tochange as we're able to roll ahead with eg the 3.3 release 18:50:16 that comes out next week, maybe we'll upgrade week after 18:50:25 by the way, funny thing 18:50:36 in that plugin `if (window.location.host.includes("chromium-review")) {` 18:50:39 `} else if (window.location.host.includes("chrome-internal-review")) {` 18:51:00 https://chrome-internal-review.googlesource.com/ i wonder where this little guy goes :) 18:51:47 behind a firewall/vpn you can't reach, no doubt 18:52:15 it's likely full of googlicious goodness 18:52:48 ok its been more than 5 minutes and xz is still going. I'm going to stop it and see how big a gzip is 18:53:21 xz takes a lot more memory/cpu to compress than gzip 18:53:40 so not surpeising 18:53:50 gz will probably still make it nearly as small 18:54:15 fungi: I went with xz to start beacuse compressing journald logs is significanlty better with it than gzip 18:54:21 that is why the devstack jobs use xz for that purpose 18:54:28 like an order of magnitude 18:55:02 woah really? 18:55:20 ya 18:55:25 i rarely see xz get that much of an advantage over gz. maybe 25% 18:55:39 order of magnitude is impressive indeed 18:55:39 its like 30MB xz and 200MB gzip iirc 18:56:02 rest of the gerrit looks pretty good to me so far in terms of functionality at this point, i'll come try 'break' things again once zuul is back up :) 18:56:08 i guess it's on super repetitive stuff 18:56:19 * mnaser goes for a walk 18:56:20 gl! 18:56:22 thanks again mnaser! 18:56:41 i'm going to need to break in an hour to light the grill and start cooking dinner 18:57:14 corvus: ^ if you're still around any thoughts on the zuul startup process I have on the etherpad? 18:57:31 fungi: ya lunch here is in about an hour and I barely ate breakfast so should haev something too 18:58:57 ok gzip is done. took 18GB down to 1.2GB so its probably going to give us more than enough space. I'm stopping gitea06 now using the safer process in the playbook 19:00:29 yup 35GB available now which I think is plenty 19:00:56 fungi: corvus I think we are ready to trigger global replication now. Gitea01 has the least free disk at 27GB but our git repo growth was about 15GB so I expect that to be plenty 19:01:12 fungi: ^ do you want to trigger that if you agree we're good? 19:01:18 sounds good, i can trigger it as soon as you're ready 19:01:58 I guess I'm as ready as I will be. gitea06 is up now 19:03:04 i've done `replication start --all --now` 19:03:16 I see things getting queued up in show-queue 19:05:00 it doesn't seem to load the queue items as quickly as before 19:05:05 the number is still climbing 19:06:02 heh its stream events the replication scheduled events for everything 19:08:43 peaked at just over 17k events in the queue 19:08:48 number is falling now (slowly) 19:10:09 I'm going to remove the digest auth option from all our zuul config files as the default is basic 19:10:25 this is required before we start zuul back up again, but I will wait on zuul startup until we've got eyeballs 19:12:31 looks like it may only be necessary on the scheulder? the others have it but no corresponding secret. I'll do the others for completeness 19:15:45 sounds right 19:16:09 just under 16k events now so whatever that comes out for replicating 19:16:21 only the scheduler performs privileged actions on gerrit, the other services just pull refs (at least in our deployment) 19:16:34 clarkb: looking re zuul 19:17:44 clarkb: 6.4.1 and 6.4.2? 19:17:56 corvus: ya 19:17:58 clarkb: i think 6.4.2 is done arleady, right? 19:18:10 yup and 6.4.1 is done as of 30 seconds ago 19:18:26 I guess the question for you is do you think we should start zuul now or wait or do other things first? 19:18:35 zuul can't ssh into bridge to run ansible right now 19:18:50 so we should be able to bring it up, have it run normal ci jobs, be happy with it then work to reenable cd? 19:19:07 clarkb: sgtm. i can't think of a reason to delay 19:19:54 looks like zuul_start.yaml starts the scheduler, then web, then mergers, then executors 19:20:05 do we want ot hack up a playbook to not exclude disabled or do it more manually? 19:20:38 clarkb: i'd just hack out disabled then run that 19:21:08 ok I think it has to be in the same dir as what we run out of because it includes other roles? 19:21:16 I guess tahts fine because nothing is updating system-config on bridge right now 19:22:01 are we planning on relying on ansible to undo the commented-out cronjobs or should we manually uncomment them (and when)? 19:22:36 fungi: I was going to rely on ansible 19:22:42 track-upstream isn't super critical 19:23:11 actually lets uncomment them because the gc'ing and the log cleanup is good to have 19:23:23 we can probably do that now? 19:23:41 corvus: fungi: I've got an edited zuul start playbook in the root screen on bridge 19:23:53 that is a vim buffer if you want to take a look at that before we run it 19:23:54 okay, i'll uncomment the cronjobs now 19:24:18 playbook in bridge root screen lgtm 19:24:19 down to 14.7k replication tasks now 19:24:24 clarkb: lgtm 19:24:33 clarkb: rember -f 20 :) 19:24:38 corvus: ++ 19:24:55 or 50 is fine :) 19:25:00 heh, 50 it is 19:25:01 -f lots 19:25:13 that command was in the scrollback so easy to modify 19:25:13 * fungi fasts fireball 19:25:17 does that command look good to yall? 19:25:19 er, casts 19:25:27 yeah, looks fine 19:25:30 ++ 19:25:31 ok running it 19:26:07 success! 19:26:21 looks happy 19:26:27 now to see what the running service is like 19:26:43 executors are deleting stale dirs 19:27:20 2020-11-21 19:25:55,459 DEBUG zuul.Repo: Updating repository /var/lib/zuul/git/opendev.org/inaugust/src.sh 19:27:25 crontabs edited in root screen session on review.o.o if anyone wants to double-check those 19:27:34 that is not going as quickly as i would expect 19:28:07 i wonder if zuul is going to have to pull a lot of new refs 19:28:21 oh okay, things are moving now 19:28:31 i think we might have been stuck at branch iteration longer than i expected 19:28:46 ie, the delay wasn't git, but rather the rest api querying branches 19:29:15 cat jobs are proceeding 19:29:29 this takes about 5-10 minutes typically iirc 19:31:22 i'm seeing a number of errors in gertty 19:31:26 I moved my temporary playbook into my homedir to avoid any trouble that may cause system-config syncing when we get there 19:31:54 i have no reason to think they are on the gerrit side; more likely minor api tweaks 19:32:13 zuul is running jobs in the openstack tenant 19:32:28 https://review.opendev.org/763599 for that change 19:32:54 down to 13.3k replication tasks 19:32:54 corvus: gertty isn't logging any errors for me... did you change your auth from digest to basic? 19:33:34 fungi: oh, not yet; that's not the error i'm getting but maybe it's a secondary effect 19:33:36 2020-11-21 19:31:30,509 WARNING zuul.ConfigLoader: Zuul encountered an error while accessing the repo x/ansible-role- 19:33:36 bindep. The error was: 19:33:36 invalid literal for int() with base 16: 'l la' 19:33:48 zuul logged that error for a handful of repos ^ 19:33:51 corvus: I thnik I saw that scroll by in the zuul scheduler debug 19:34:00 yeah 19:34:38 should I be digging into that or are you investigating? 19:34:44 corvus: yeah, the error i remember gertty throwing when i had the wrong auth type was opaque to say the least 19:35:01 i don't recall seeing that before, therefore i don't know if it could be upgrade related. but it doesn't seem like it should be -- that's in-repo content over the git protocol, so i don't think anything should be different. but i dunno. 19:35:03 i've put a reminder in the post-upgrade etherpad for gertty users to update their configs 19:35:25 corvus: oh I see this is us talking git not api 19:36:06 three jobs have succeeded, but the other jobs on that chagne will take a while to run so will be a while before we see zuul comment back 19:37:03 fatal: https://review.opendev.org/x/ansible-role-bindep/info/refs not valid: is this a git repository? 19:37:19 that would explain the proximate cause of the zuul error 19:38:53 info/refs/ is there and file level permissiosn look ok 19:39:18 ansible-role-bindep doesn't show up in the error_log 19:39:44 i can clone it over ssh 19:39:51 is there a problem with "x/" repos and http? 19:40:44 x/ranger reproduces (just a random one I remembered was in x/) 19:41:00 I wonder if this is a permissions issue perhaps related to the bug that got mitigated? 19:41:25 just for 'x/' though? 19:41:58 review-test reproduces fwiw 19:43:50 if you search for chagnes in those repos you can see them 19:43:54 in the web ui I mean 19:44:25 if i curl info/refs for x repos, i get the gerrit web app 19:45:12 i'm a little worried there's some kind of routing thing in gerrit that assumes any one-letter path component is not a repo 19:45:19 oh fun 19:45:25 yikes 19:45:28 no basis for that other than observed behavior 19:46:05 i'm going to start looking at gerrit source code 19:46:10 ok 19:48:07 down to 11.1k replication tasks and things look good on gitea01 disk wise 19:49:04 its x/ 19:49:12 corvus: java/com/google/gerrit/httpd/raw/StaticModule.java 19:49:44 it serves something related to polygerrit judging by the path names 19:49:47 s/path/variable/ 19:49:55 clarkb: thx 19:52:34 clarkb: poly gerrit extension plugins? 19:53:00 ya the docs talk about #/x//settings 19:53:12 and /x/pluginname/*screenname* 19:55:09 do we need to start talking about renaming them? 19:56:06 I did test a rename and if you move the project in gerrit's git dir everything seems to be fine except for project watches config 19:56:17 you can do an online reindex too 19:56:26 or maybe this is somethign to pull luca in on 19:56:47 i think a surprise project rename might be disruptive 19:57:11 agreed 19:58:12 grepping logs, i'm not seeing any currently legit access for /x/* 19:58:20 (other than attempted clones) 19:58:30 there are some requests for fonts: /x/fonts/roboto/Roboto-Bold.ttf 19:58:58 but i'm not sure those are actually returning fonts (i think they may just return the app) 19:59:55 thinking out loud here. I wonder if we can convince the gerrit http server to check for x/repo first then fallback to x/else 20:00:16 clarkb: i think long term if gerrit wants to own x/ we can't have it 20:00:31 ya agreed, I figure something liek that wold be so we can schedule a rename not today 20:00:49 but short term, i'm wondering if, since it doesn't seem like our gerrit is using x/ right now, we can rebuild it without that exclusion then work on a rename plan 20:01:27 i'm around to review a gerrit patch, though getting started grilling 20:01:43 (if we're right about x/ being used for plugins, then it'll become an issue as we add polygerrit plugins) 20:01:58 i assume we'll want to start a thread on repo-discuss noting that polygerrit has made some repository names impossible. that seems like a bug they would be interested in fixing 20:02:20 fungi: i assume they'll fix it with a doc change saying 'don't use these' 20:02:28 corvus: ya maybe we can add a sed to the jobs to comment that out on the 3.2 branch which will rebuild image then pull that and use it? 20:02:28 just like /p/ and /c/ are unavailable 20:02:47 clarkb: sounds good 20:03:03 corvus: do you want to write that change or should I/ 20:03:12 if you can't use repositories whose names start with c/ or p/ or x/ but gerrit doesn't prevent you from creating them, that sounds like a bug 20:03:17 clarkb: you if you're available 20:03:17 also I think we should trim down the images so its just 3.2 on that chagne 20:03:21 ok working on that now 20:03:25 for not properly separating api paths from git project paths 20:03:40 fungi: perhaps gerrit does prevent creation; we should check that 20:04:35 i imagine we should just no longer allow single-char in the initial path component of project names to be safe for the future 20:05:38 ++ 20:05:54 or is there a more correct path prefix we should switch to using to access git repositories? 20:06:24 I always have to spend 10 minutes figuring out how we build the gerrit wars in these jobs 20:06:33 fungi: the download urls are rooted at / 20:06:36 I checked that as I wondered too 20:06:51 and arent' configurable? 20:07:28 because that seems like it would be a relatively minor fix... deprecate the / routing for project names and add a new prefix 20:08:02 and instruct users to migrate to the new prefix and then eventually rtemove the download routing at / in a later release 20:14:20 clarkb: i came to the same conclusion 20:14:41 i mean, /p/ *used* to work :/ 20:15:36 remote: https://review.opendev.org/c/opendev/system-config/+/763600 Handle x/ prefix projects on gerrit 3.2 20:15:56 I figure we can pull that image onto review-test and test out there first, then if that looks ok do it to prod 20:16:11 and I'll update my change so that we can land it 20:16:16 clarkb: ++ 20:16:27 clarkb: what needs to be updated? 20:16:51 corvus: stuff around which jobs to run I think 20:17:16 corvus: I removed 2.13 - 3.1 since they aren't necessary to get that image 20:17:47 o/ ... well done everyone! 20:17:52 clarkb: can't we land that? 20:17:52 since luca reached out when he saw our upgrade was in progress and suggested we should let him know if we hit any snags, is this something we should give him a heads up about? 20:17:52 if we add them back in I need to make the sed branch specific. If we don't add them back in then I need squash it into fungi's use regular stable branches change I think 20:18:02 fungi: yes 20:18:16 corvus: yes I think I need to update the system-config-run dependency maybe? 20:18:16 i think we should send an email saying we found this issue and our proposed solution and see if he thinks it's ok 20:18:29 corvus: I'm sorry these jobs always confuse me 20:18:42 I'm basically hsut saying that we need to review teh job updates carefully if we land this 20:18:45 i've got the grill starting so i'm happy to throw a quick e-mail out there pointing to our workaround and asking for suggestions 20:18:52 fungi: go for it 20:22:15 5.8k on the replication 20:24:25 gitea01 is down to 18GB free. Should have plenty for the remaining replication 20:25:01 I'm going to find some food while I wait for zuul to build that image 20:31:02 reply sent to luca, seems like my patio is experiencing unnecessary levels of packet loss so i'm less responsive than i might otherwise be at the moment 20:32:48 my ansible is bad 20:32:51 fixing 20:32:55 the new UI is so much faster, very pleasant for us high latency users 20:34:47 new ps has been pushed 20:39:57 infra-root. I added myself to project bootstrappers and admins on review-test. Then went to /plugins/ which returns a json doc of plugins 20:40:10 the index_url for each plugin we have is listed there and they all start with plugins/ not x/ 20:40:19 (just another data point towards the safety of this change) 20:41:29 I think toget that document you have to be in the amdins group 20:41:36 could probably get it via the rest api instead too 20:41:37 clarkb: yeah, if i'm following correctly x/ might be used by polygerrit plugins to serve certain resources 20:43:24 corvus: hrm are any of the plugins we have polygerrit plugins? I assume that some are like the codemirror-editor and download-commands? 20:45:16 clarkb: no idea 20:46:23 clarkb: ansible pares error again 20:47:53 k, can someone look at it really quickly? I feel like my brain isn't working 20:47:57 clarkb: will do 20:48:14 poking at codemirror editor on review-test with ff dev tools it self hosts its static contents looks like 20:49:55 I think I see it/ shell needs to be a list 20:50:00 yes i'm on it 20:50:09 k 20:50:11 - "/x/*", 20:50:11 + //"/x/*", 20:50:17 clarkb: that's the intended change, yeah? 20:50:20 corvus: yes 20:50:30 i'm validating it makes it all the way through ansible unscathed 20:50:33 it comments out that line with the /x/* in it 20:51:18 clarkb: pushed 20:51:38 i figured i'd double check the whole thing to save us any more round trips 20:51:38 thanks 20:51:42 ++ 20:56:15 ~600 replication tasks now 20:58:27 one this is built, pulled and restarted, do we need to restart the executors and mergers as well? 20:59:32 its running the bazelisk build now 21:00:16 fungi: you want to respond to luca? 21:00:19 and file the bug? 21:00:41 replication is done. I'm going to do another round of gc'ing on the giteas 21:01:09 oh, cool, he already replied. yeah i can do that immediately after dinner 21:01:15 fungi: specifically I think the bit that was missing in the email was that its cloning repos 21:02:09 yes 21:03:40 giteas are gc'ing now 21:13:42 build finished 21:13:49 docker://insecure-ci-registry.opendev.org:5000/opendevorg/gerrit:f76ab6a8900f40718c6cd8a57596e3fc_3.2 21:14:12 cool I'll get that on review-test momentarily 21:14:26 i'm also running it locally for fun 21:14:55 or will, when it downloads, in a few minutes 21:15:57 note review-test's LE cert expired a few days ago and we decided to leave it be 21:16:54 cloning x/ranger from review-test works now 21:17:22 \o/ 21:18:07 https://review-test.opendev.org/x/fonts/fonts/robotomono/RobotoMono-Regular.ttf is a 404 21:19:06 clarkb: but it's also not a real thing on prod 21:19:12 ya I guess not 21:19:17 I just wanted to see what it does there 21:20:08 clarkb: want me to update your patch with the system-config-run change? 21:20:27 corvus: that would be swell 21:20:38 then I think it sould be landable? 21:22:19 clarkb: actually... maybe we should make this 2 changes 21:22:32 corvus: I'm good with that too 21:22:37 just the sed then a cleanup? 21:22:53 yep 21:22:58 wfm 21:23:03 i'll take care of that 21:23:15 corvus: remember you need to check the branch if you do that 21:23:15 clarkb: meanwhile, we have a built image -- want to go ahead and run it on prod? 21:23:20 or have 3.2 use a differnet playbook 21:23:30 clarkb: how about we invert the order? 21:23:36 corvus: that also works 21:23:38 remove old stuff, then the x/ change 21:23:41 ++ 21:23:42 will be easy to revert 21:23:53 for prod any concern that this may break something else? or are we willing to find out the hard way :) 21:24:12 clarkb: i think we've done the testing we can 21:24:15 ok 21:24:20 I'll do this in the screen fwiw 21:24:30 i'm not worried about it breaking anything in a way we can't roll back 21:26:46 gerrit is starting back up again on prod 21:28:05 hrm the chagne screen isn't loading for me though I thought I tested taht on review-test too 21:28:08 oh there it goes 21:28:10 I just need patience 21:28:59 I can clone ranger from prod via https now too 21:30:05 remote: https://review.opendev.org/c/opendev/system-config/+/763616 Remove container image builds for old gerrit versions [NEW] 21:30:06 remote: https://review.opendev.org/c/opendev/system-config/+/763600 Handle x/ prefix projects on gerrit 3.2 21:30:12 clarkb: i think we should do a full-reconfigure in zuul 21:30:14 i'll do that 21:30:15 oh I should go rsetart gerritbot now that I restarted gerrit 21:30:17 corvus: ++ 21:30:56 gerritbot has been restarted 21:31:17 i have more work to do on those image build changes; on it 21:34:19 btw zuul commented a -1 on https://review.opendev.org/c/openstack/os-brick/+/763599/ which was the first change that started runnign zuul jobs. That aspect of things looks good 21:34:37 clarkb, fungi, ianw: remote: https://review.opendev.org/c/openstack/project-config/+/763617 Remove old gerrit image jobs from jeepyb [NEW] 21:35:36 +2 21:36:08 cat jobs are running 21:37:37 corvus: one small thing on https://review.opendev.org/c/opendev/system-config/+/763616 21:38:28 I'm ahppy to fix the issue on ^ if you want to roll forward instead 21:38:33 er I mean fix it in a follow on 21:38:38 clarkb: i'll respin 21:38:41 ok 21:40:12 clarkb: respin done 21:40:34 2020-11-21 21:37:15,977 INFO zuul.Scheduler: Full reconfiguration complete (duration: 379.767 seconds) 21:40:48 and no more of thos errors? 21:41:06 was review.o.o restarted with the fix? i guess so, my tests to reproduce the error don't fail 21:41:13 what was the error message on attempting to clone? 21:41:23 sorry, just now catching up since dinner's done 21:41:37 fungi: heh, lemme see if i have a terminal open with the error :) 21:41:43 back to nominal levels of packet loss again and can test things suitably 21:41:57 thanks! 21:42:17 working up the reply to luca now 21:42:22 another thing I notice is that gitweb doesn't work but gitiles seems to 21:42:30 I think we should just stop using gitweb maybe and have it gitiles 21:42:36 that isn't super urgent though 21:42:43 then we can add in gitea when we sort that out 21:42:49 fungi: i don't, sorry :( 21:43:05 19:37 < corvus> fatal: https://review.opendev.org/x/ansible-role-bindep/info/refs not valid: is this a git repository? 21:43:11 fungi: but i pasted that ^ 21:43:16 that was about it 21:44:00 clarkb: confirmed, no new 'invalid literal' errors from zuul 21:45:17 +2 from me on corvus' image stack 21:46:00 +2 from me on clarkb's image stack 21:46:26 zuul stillcan't ssh into bridge (Ithink that is a good thing), once we've got these issues settled I figured we would use https://review.opendev.org/c/opendev/system-config/+/757161 this change as the canary for that? 21:46:49 my family has pointed out to me that I am yet to shower today though, so now might be time for me to take a break. 21:46:57 is there anything else you'd like me to do before I pop out for a bit? 21:48:02 nope, go become less offensive to your family ;) 21:48:05 i think now's a good break time 21:48:11 fungi: maybe you can include a diff for luca as well: http://paste.openstack.org/show/qz6zQ6a3jkRVluxebh8l/ 21:48:13 fungi: can you +3 https://review.opendev.org/763617 ? 21:48:14 corvus: thanks! i'll try to work with that 21:48:26 yeah, will review 21:49:03 and approved 21:49:18 giteas are still gc'ing but free disk space is going up so we should be more than good there 21:49:20 and now break time 21:49:58 I've also removed my normal user from privileged groups on review-test 21:50:05 as I am done testing there for now 21:55:07 i've re-replied to luca, will start putting the bug report together shortly 21:55:34 any other urgent upgrade-related tasks need my attention first? 21:56:11 fungi: i don't think so. i'm about to +w the remaining image stack 21:57:22 err, there's another error 21:59:42 clarkb, fungi: can you +3 https://review.opendev.org/763616 ? 22:00:18 missed an update for the infra-prod jobs to trigger on 3.2 builds 22:01:17 yup, taking a look now 22:02:17 current status: we need to merge https://review.opendev.org/763616 and https://review.opendev.org/763600 then the repos will match the image we're running in production. then we can proceed with enabling cd. aside from that, i think there's no known issues in prod and we're just waiting for replication to finish. 22:02:26 i've approved 763616 now 22:02:45 cool, then i'm going to afk for another errand 22:03:22 infra-root: just a highlight ping for what i think is the current status (a couple lines up ^) as i think we're all on break while waiting for tasks to complete 22:04:08 awesome, thanks again! 22:44:40 https://bugs.chromium.org/p/gerrit/issues/detail?id=13721 22:45:03 if anyone feels inclined, please clarify mistakes or omissions therein 22:46:06 I'll take a look in a few. 22:47:20 gitea01 has finished gc'ing and has 22gb free which should be plenty for now 22:48:19 the others all have more free disk too 22:48:22 and are done as well 22:48:28 I think that means all the replication related activities are done 22:49:35 fungi: the bug looks good to me 22:50:42 I'm going to start drafting a "its up, this is what we've discovered, this is where we go from here" type email in etherpad 22:52:04 thanks! don't forget to incorporate notes from https://etherpad.opendev.org/p/gerrit-3.2-post-upgrade-notes as appropriate 22:52:32 ya was going to link to that I think 22:55:07 https://etherpad.opendev.org/p/rNXB-vJe8IUeFnOKFVs8 is what I'm drafting 22:57:41 fungi: no idea if it helps but i think x/ was introduced @ https://gerrit.googlesource.com/gerrit/+/153d46c367965cd7782a3ac86212c07b298eaca8 22:58:33 actually no, more to dig 22:59:19 the file was moved at some point which makes it difficult to go back in time with 22:59:26 I ended up doing a git log -p and grepping for it and giving up 23:00:07 https://gerrit.googlesource.com/gerrit/+/7cadbc0c0c64b47204cf0de293b7c68814774652 23:00:31 + serve("/x/*").with(PolyGerritUiIndexServlet.class); 23:00:44 that is really the first instance. i wonder if it's not really necessary and just been pulled along since 23:01:42 ianw: the docs hint at it but could still be dead code 23:02:44 .. at least it's in a "add x/ this is a really important path never remove" type change i guess :) 23:04:05 not in or in? 23:04:46 https://etherpad.opendev.org/p/rNXB-vJe8IUeFnOKFVs8 ok I think thats largely put together at this point 23:13:48 clarkb: minor suggestion on maybe something that explains the x/ thing at a high level but enough for people to understand 23:14:36 ianw: something liek that? 23:15:09 yeah, i think so; feel like it explains how both want to "own" the /x endpoint 23:15:24 namespace, whatever :) 23:19:44 oh shoot, I think there is a minor but not super important issue with https://review.opendev.org/763600 it doesn't update the dockerfile so we won't promote the image 23:19:59 corvus: ^ maybe thats something we can figure out manually or just push up another change that does a noop dockerfile edit? 23:20:34 double check me on all that first though 23:22:57 also I'm starting to feel the exhaustion roll in. If others want to drive things and get cd rolling again I'll do my best to help, otherwise, tomorrow morning might be good 23:23:50 ya I think the promote jobs for the 3.2 docker image tagging didn't run 23:23:57 I'll push up a noop job now to get that rolling 23:26:03 remote: https://review.opendev.org/c/opendev/system-config/+/763618 Noop change to promote docker image build 23:26:49 i've got to run out, but i can get to the CD stuff early my tomorrow? i don't think we need it before then? 23:27:23 ya I don't think its super urgent unless others really want their sunday back. I'm just wiped out 23:27:39 fungi: corvus ^ fyi. Also any thoughts on that email? should I send that nowish? 23:28:18 infra-root Note that https://review.opendev.org/c/opendev/system-config/+/763618 or something like it should land before we start doing cd again 23:29:23 ok, that's the new image with the x/ fix right? 23:29:32 yes 23:29:41 i.e. we don't want to CD deploy the old image 23:29:49 we actually just built it when corvus' changes landed but because we didn't modify files that the promote jobs match we didn't promote it 23:29:58 we could also do an out of band promote via docker directly if we want 23:30:11 763618 should also take care of it since the dockerfile is modified 23:30:53 ok, i have to head out but will check back later 23:30:59 ianw: o/ 23:37:45 clarkb: sorry, stepped away for a bit, reading draft e-mail now 23:41:01 made a couple of minor edits but lgtm in general 23:42:13 cool I'll wait abit to seeif corvus is able to take a look then send that out 23:42:30 fungi: and maybe a corresponding #status notice 23:42:50 I'm taking abreak now though. The tired hit me hardin the lastlittle bit 23:43:21 yup, a status notice at the same time that e-mail gets sent would make sense 23:46:19 fungi: did you see 763618 too? 23:47:43 likely not if you're asking 23:48:16 approvidado 23:48:22 reading scrollback 23:50:24 clarkb: email lgtm 23:50:34 cool I'll send that out momentarily 23:55:04 how about this for the notice #status notice Gerrit is up and running again on version 3.2. Zuul is talking to it and running jobs. You can push and review changes. However, we are still working through things and there may be additional service restarts during out upgrad window which ends at 01:00 November 23. 23:55:25 clarkb: s/out upgrad/our upgrade/ 23:55:32 I can also add "See http://lists.opendev.org/pipermail/service-announce/2020-November/000013.html for more details" 23:55:54 how about this for the notice #status notice Gerrit is up and running again on version 3.2. Zuul is talking to it and running jobs. You can push and review changes. However, we are still working through things and there may be additional service restarts during our upgrade window which ends at 01:00 November 23. See http://lists.opendev.org/pipermail/service-announce/2020-November/000013.html for 23:55:56 more details 23:56:23 is that just short enough if I drop my prefix? 23:57:24 maybe squeeze it down a bit so it fits in a single notice 23:57:33 i think statusbot will truncate it otherwise 23:57:43 like #status notice Gerrit is up and running again on version 3.2. Zuul is talking to it and running jobs. You can push and review changes. However, we are still working through things and there may be additional service restarts during our upgrade window ending 01:00UTC November 23. See http://lists.opendev.org/pipermail/service-announce/2020-November/000013.html for more details 23:57:56 or, rather, statusbot doesn't know to so the irc server ends up discarding the rest 23:58:40 looks good. hopefully that's short enough 23:58:55 I can trim it a bit more but I'll just go ahead and send it with that trimming 23:59:23 #status notice Gerrit is up and running again on version 3.2. Zuul is talking to it and running jobs. You can push and review changes. We are still working through things and there may be additional service restarts during our upgrade window ending 01:00UTC November 23. http://lists.opendev.org/pipermail/service-announce/2020-November/000013.html for more details 23:59:23 clarkb: sending notice 00:00:21 looks like it fit! 00:00:46 I think its still channel dependent because the channel name goes at the beginning of the message but ya seems like for the channels I'm in it is good 00:02:03 once 763618 lands and promotes the image I think we're in a good spot to turn on cd again, but more and moer I'm feeling like that is a tomorrow morning thing 00:02:07 will others be around for that? 00:02:14 sounded like ianw would au morning 00:07:07 i will be around circa 13-14z 00:08:05 i probably won't be around tomorrow 00:08:56 my concern with doing it today is if we turn it on and dont notice problems becausethey happen at0600 utc or whatever 00:09:02 so probably tomorrow is best? 00:09:22 yeah i agree 00:09:38 i don't expect to have major additional issues crop up which we'll be unable to deal with on the spot 01:05:10 ok I think the promote has happened 01:05:44 https://review.opendev.org/c/opendev/system-config/+/763618/ build succeeded deploy pipeline 01:40:05 excellent 01:40:14 i'm starting to fade though 02:20:07 Congrats on the Gerrit upgrade!!! 02:22:45 The post upgrade etherpad doesn't look horrible 02:26:24 the x/ conflict is probably the big thing 02:32:18 thanks mordred! 02:56:35 Btw ... If anybody needs an unwind Netflix ... We Are The Champions is amazing. We watched the hair episode last night. I promise you've never seen anything like it 14:23:06 no signs of trouble this morning, i'm around whenever folks are ready to try reenabling ansible 15:02:25 fungi: I think our first step is to confirm our newly promoted 3.2 tagged image is happy on review-test, then roll that out to prod 15:03:15 then in my sleep I figured a staged rollout of cd would probably be good: put the ssh keys back but keep review and zuul in the emergency file and see what jobs do, then remove zuul from emergency file and see what jobs do, then remove review and land the html cleanup change I wrote and see how that does? 15:03:25 I think for the first two steps the periodic jobs should give us decent coverage 15:04:56 mordred: I've watched the first two episods. the cheese wheel racing is amazing 15:07:23 the image we're currently running from is hand-built? or fetched from check pipeline? 15:07:27 david ostrovsky has congradulated us on the gerrit mailing list. Also asks if we have any feedback. I guess following up there with the url paths thing might be good as well as questions about if you can make the notedb step go faster by somehow manually gc'ing then manually reindexing 15:07:58 fungi: review-test should be running the check pipeline image 15:08:06 the docker compose file should reflect that 15:08:19 but we've got a fix of some sort in place in production right? 15:08:19 fungi: and prod is in the same boat iirc 15:09:05 ahh, okay, yep looks like it's also running the image built in the check pipeline then 15:09:10 fungi: the fix is that https://review.opendev.org/c/opendev/system-config/+/763618 and promoted our workaround as the 3.2 tag in docker hub. Which means we can switch back to using the opendevorg/gerrit:3.2 image on both servers 15:09:24 I think we should do review-test first and just quickly double check that git clone still works, then do the same with prod 15:09:25 right, i'll get it swapped out on review-test now 15:10:55 opendevorg/gerrit 3.2 3391de1cd0b2 15 hours ago 681MB 15:11:07 that's what review-test is in the process of starting on now 15:18:11 fungi: I can clone ranger from review-test 15:18:16 via https 15:23:20 yup, same. perfect 15:23:38 shall i similarly edit the docker-compose.yaml on review.o.o in that case? 15:23:50 yes I think we should go ahead and get as many of these restarts in on prod during out window as we can 15:24:19 edits made, see screen window 15:24:32 do i need to down before i pull, or can i pull first? 15:24:50 you can pull first 15:25:00 sorry I'm not on the screen yet, but I think it will be fine since you just did it on -test 15:25:23 opendevorg/gerrit 3.2 3391de1cd0b2 15 hours ago 681MB 15:25:28 that's what's pulled 15:25:32 and it matches -test 15:25:35 shall i down and up -d? 15:25:42 ++ 15:25:53 done 15:29:50 one thing that occured to me is we should double check our container shutdown process is still valid. I figured an easy way to do that was to grab the deb packages they publish and read the init script but I can find where the actual packages are 15:30:25 `git clone https://review.opendev.org/x/ranger` is still working for me 15:30:46 *I can't find where 15:31:33 which package? docker? docker-compose? 15:31:37 nevermind found then deb.gerritforge.com is only older stuff bionic.gerritforce.com has newer things 15:31:47 fungi: the "native packages" that luca publishes http://bionic.gerritforge.com/dists/gerrit/contrib/binary-amd64/gerrit-3.2.5.1-1.noarch.deb 15:32:02 since I assume that will have systemd unit or init file that we can see how stop is done 15:32:31 our current stop is based on the 2.13 provided init script. actually I wonder if 3.2 provides one too 15:33:32 ah yup it does 15:33:45 resources/com/google/gerrit/pgm/init/gerrit.sh and that still shows sig hup so I think we're good 15:33:54 oh, got it 15:34:03 thought you were talking about docker tooling packages 15:34:45 no just more generally. Our docker-compose config should send a sighup to stop gerrit's container 15:34:50 which it looks like is correct 15:35:00 *is still correct 15:58:34 remote: https://review.opendev.org/c/opendev/system-config/+/763656 Update gerrit docker image to java 11 15:58:39 I think that is a later thing so will mark it WIP 15:58:44 also gerritbot didn't report that :/ 15:58:49 oh right we just restarted :) 15:59:02 I'm restarting gerritbot now 16:00:20 also git review gives a nice error message when you try to push with too old git review to new gerrit 16:00:25 seems like we need to restart gerritbot any time we restart gerrit these days 16:03:30 ok I won't say I feel ready, but I'm probably as ready as I will be :) what do you think of my staged plan to get zuul cd happening again? 16:06:20 it seems sound, i'm up for it 16:06:58 opendev-prod-hourly jobs are the ones that we'd expect to run and those run at the top of the hour. So if we move authorized_keys back in place then we should be able to monitor at 17:00UTC? 16:07:24 then if we're happy with the results of that we remove zuul from emergency and wait for the hourly prod jobs at 18:00UTC 16:07:29 (zuul is in that list) 16:10:08 fungi: I put a commented out mv command in the bridge screen to put keys back in place, can you check it? 16:10:40 yep, that looks adequate 16:10:58 ok I guess we wait for 17:00 then? 16:14:00 was ansible globally disabled, and have we taken things back out of the emergency disable list? 16:14:47 looks like /home/zuul/DISABLE-ANSIBLE does not exist on bridge at least 16:14:48 ansible was not globally disabled with the DISABLE-ANSIBLE file and the host are all still in the emergency disable list 16:14:58 we used the more forceful "you cannot ssh at all" disable method 16:15:34 cool, so in theory the 1700z deploy will skip the stuff we stil have disabled in the emergency list 16:15:56 yup, then after that if we're happy with the results we take the zuul hosts out of emergency and let the next hourly pulse run on them 16:16:13 then if we're happy with that we remove review and then land my html cleanup change 16:16:42 review isn't part of the hourly jobs so we need something else to trigger a job on it (it is on the daily periodic jobs though so we should ensure we run jobs against it before ~0600 or put it back in the emergency file) 16:19:21 fungi: one upside to doing the ssh disable is that the jobs fail quicker in zuul 16:19:31 which we wanted beacuse we knew that things would be off for a long period of time 16:19:43 when you write the disable ansible file the jobs will poll it and see if it goes away before their timeout 16:20:02 during typical operation ^ is better beacuse its a short window where you want to pause rather than a full stop 16:21:50 https://etherpad.opendev.org/p/E3ixAAviIQ1F_1-gzuq_ is the gerrit mailing list email from david. I figure we should respond. fungi not sure if you're subscribed? but seems like we should write up an email and bring up the x/ conflict? 16:22:48 i'm not subscribed, but happy if someone mentions that bug to get some additional visibility/input 16:32:26 fungi: I drafted a response in that etherpad, have a moment to take a look? 16:33:16 yep, just a sec 16:35:27 I think they were impressed we were able to incorporate a jgit fix from yseterday too :) 16:35:31 something something zuul 16:37:03 yep, reply lgtm, thanks! 16:38:53 i've got something to which i must attend briefly, but will be back to check the hourly deploy run 16:41:11 response sent 17:01:02 infra-prod-install-ansible is running 17:01:28 as well as publish-irc-meetings (that one maybe didn't rely on bridge though?) 17:03:24 infra-prod-install-ansible reports success 17:03:30 now it is running service -bridge 17:06:49 service-bridge claims success now too 17:07:52 cloud-launcher is running now 17:12:46 fungi: are you back? 17:12:49 yup 17:12:55 I've checked that system-config updated in zuul's homedir 17:12:57 looking good so far 17:13:05 but now am trying to figure out where the hell project-config is synced too/from 17:13:19 /opt/project-config is what system-config ansible vars seem to show but that seems old as dirt on bridge 17:13:37 that makes me think that it isn't actually where we sync from, but I'm having a really hard time understanding it 17:13:39 i thought it put one in ~zuul 17:14:16 ok that one looks more up to date 17:14:24 btu I still can't tell from our config management what is used 17:14:24 from friday, yah 17:14:39 (also its a project creation from friday... maybe we should've stopped those for a bit) 17:15:09 well, it won't run manage-projects yet 17:15:20 because of review.o.o still being in emergency 17:15:40 ya 17:15:45 but yeah once we reenable that, we should check the outcome of manage-projects runs 17:15:56 I think I figured it out 17:16:02 /opt/project-config is the remote path but no the bridge path 17:16:12 the bridge path is /home/zuul/src/../project-config 17:17:56 fungi: looking at timestamps there is a good chance that projct is already created /me checks 17:18:20 https://opendev.org/openstack/charm-magpie 17:18:44 and they are in gerrit too, ok one less thing to worry about until we're happy wit hthe state of the world 17:19:05 fungi: nodepool's job is next and I think that one may be expected to fail due to the issues on the buidlers that inaw was debugging. Not sure if they have been fixed yet 17:19:13 just a heads up that a failure there is probably sane 17:20:46 I suspect that our hourly jobs take longer than an hour to complete 17:22:56 huh cloud launcher failed, I wonder if it is trying to talk to a cloud that isn't available anymore (that is usually why it fails) 17:23:21 fungi: it just occured to me that the jeepyb scripts that talk to the db likely won't fail until we remove the db config from the gerrit config 17:23:41 fungi: and there is potential there for welcome message to spam new users created on 3.2 beacuse it won't see them on the 2.16 db 17:24:09 I don't think that is urgent ( we can apologise a lot) but its in the stack of changes to do that cleanup anyway. Then those scripts should start failing on the ini file lookups 17:26:16 mmm, maybe if they have only one recorded change in the old db, yes 17:26:56 i think it would need them to exist in the old db but have only one change present 17:27:03 i need to look back at the logic there 17:27:46 also we can just edit playbooks/roles/gerrit/files/hooks/patchset-created to drop welcome-message? 17:27:53 easily 17:28:00 the other one was the blueprint update? 17:28:05 bug update 17:29:19 looks like bug and blueprint both 17:29:29 confirmed that nodepool failed 17:29:35 registry running now 17:29:51 fungi: so ya maybe we get a change in that simply comments out those scripts in the various hook scripts for now? 17:30:03 then that can land before or after the db config cleanup 17:30:06 yeah, looking at welcome_message.py the query is looking for changes in the db matching that account id, so it won't fire for actually new users post upgrade, but will i think continue to fire for existing users who only had one change in before the upgrade 17:30:17 got it 17:30:32 registry just succeeded. zuul is running now and it should noop succeed 17:30:42 i think update_bug.py will still half-work, it will just fail to reassign bugs 17:31:05 fungi: but will it raise an exception early because the ini file doesn't have the keys it is looking for anymore? 17:31:51 I think it will 17:31:53 oh, did we remove the db details from the config? 17:32:03 fungi: not yet, thati s one of the chagnes to land though 17:32:22 fungi: https://review.opendev.org/c/opendev/system-config/+/757162 17:32:22 got it 17:32:39 so yeah i guess we can strip them out until someone has time to address those two scripts 17:32:50 i'll amend 757162 with that i guess 17:33:03 wfm 17:33:11 note its parent is the html cleanup 17:33:20 whcih is also not yet merged 17:33:30 yeah, i'm keeping it in the stack 17:33:56 zuul "succeeded" 17:34:14 fungi: also its three scripts 17:34:23 welcome message, and update bug and update blueprint 17:36:17 update blueprint doesn't rely on the db though 17:36:36 it does 17:36:50 huh, i wonder why. okay i'll take another look at that one too 17:37:46 select subject, topic from changes where change_key=%s 17:38:01 yeesh, okay so it's using the db to look up changes 17:38:24 ya rest api should be fine for that anonymously too 17:38:29 I'm adding notes to the etherpad 17:38:59 and yeah, the find_specs() function performing that query is called unconditionally in update_blueprint.py so it'll break entirely 17:40:20 and eavesdrop succeeded. Now puppet else is starting 17:40:59 update_bug.py is also called from two other hook scripts, i'll double-check whether those modes are expected to work at all 17:43:26 looks like the others are safe to stay, update_bug.py is only connecting to the db within set_in_progress() which is only called within a conditional checking args.hook == "patchset-created" 17:44:23 fungi: where does it do the ini file lookups? 17:44:33 because it will raise on those when the keys are removed from the file 17:44:47 (its less about where it connects and more where it finds the config) 17:48:28 puppetry is still running according to the log I'm tailing 17:48:46 I don't expect this to finish before 18:00, but should I go ahead and remove the zuul nodes from the emergency file anyway now since things seem to be working? 17:48:48 fungi: ^ 17:49:59 ini file is parsed in jeepyb.gerritdb.connect() which isn't called outside the check for patchset-created 17:50:26 sorry, digging in jeepyb internals 17:50:49 what's the desire to take zuul servers out of emergency in the middle of a deploy run? 17:51:03 oh, just in case it finishes before the top of the hour? 17:51:06 that this deploy run is racing the next cron iteration 17:51:07 yup 17:51:18 if we wait we might have to skip to 19:00 though this may end up happening anyway 17:51:32 oh wait puppet is done it says 17:51:39 is it likely to decide to deploy things to zuul servers in this run if they're taken out of emergency early? 17:51:42 ahh, then go for it 17:51:48 ok will do it in the screne 17:52:55 and done, can you double check the contents of the emergency file really quickly just to make sure I didn't do anything obviously wrong? 17:54:29 emergency file in the bridge screen lgtm 17:55:00 gerritbot is still silent. did it get restarted? 17:55:45 ya I thought I restarted it 17:55:50 last started at 15:59 17:56:12 gerrit restart was 15:25 17:56:30 but for some reason it didn't echo when i pushed updates for your system-config stack 17:56:35 checking gerritbot's logs 17:58:03 Nov 22 16:59:23 eavesdrop01 docker-gerritbot[1386]: 2020-11-22 16:59:23,196 ERROR gerrit.GerritWatcher: Exception consuming ssh event stream: 17:58:23 (from syslog) 17:58:28 neat 17:59:04 I thought it was supposed to try and reconnect 17:59:14 looks like the json module failed to parse an event 17:59:24 and that crashed gerritbot 17:59:38 probably sufficient for now to ignore json parse failures there? 17:59:46 basically go "well I don't understand, oh well" 18:00:34 this is being parsed from within gerritlib, the exception was raised from there 18:00:49 so we'll likely need to fix this in gerritlib and tag a new release 18:01:17 side note: zuul doesn't use gerritlib, maybe there is something to be learned in zuul's implementation 18:05:00 http://paste.openstack.org/show/800291/ 18:05:05 that's the full traceback 18:05:35 are we getting empty events 18:06:02 maybe the fix is to wrap that in if l : data = json.loads(l) 18:07:03 and maybe catch json decode errors that happen anyway and reconnect or something like that 18:11:18 fungi: whats with the docker package insepection in the review screen? 18:11:28 https://review.opendev.org/c/opendev/gerritlib/+/763658 Log JSON decoding errors rather than aborting 18:12:25 clarkb: when you were first suggesting we needed to look at shutdown routines in some unspecified package you couldn't find, i thought you meant the docker package so i was tracking down where we'd installed it from 18:12:36 gotcha 18:14:47 fungi: looks like gerritbot side will also need to be updated to handle None events 18:14:58 I think we can land the gerritbot change first and be backward compatible 18:15:05 then land gerritlib and release it 18:16:24 clarkb: i don't see where gerritbot needs updating. _read() is just trying to parse a line from the stream and then either explicitly returning None early or returning None implicitly after enqueuing any event it found 18:17:15 i figured the return value wasn't being used since that method wasn't previously explicitly returning at all 18:17:18 fungi: https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L303-L346 specifically line 307 assumes a dict 18:17:52 oh I see 18:17:55 you're short circuiting 18:17:59 nevermind you're right 18:18:44 yeah, the _read() in gerritbot is being passed contents from the queue, not return values from gerritlib's _read() 18:18:52 left a suggestion but +2'd it 18:18:56 yup 18:24:52 nodepool is running now, then registry then zuul 18:25:09 on the zuul side of things we expect it to noop the config because I already removed the digest auth option from zuul's config files 18:26:51 okay, gerritbot is theoretically running now with 763658 hand applied 18:27:30 we should be able to check its logs for "Cannot parse data from Gerrit event stream:" 18:27:50 and see what the data is 18:27:55 exactly 18:28:13 infra-prod-service-zuul is starting nowish 18:29:42 oh another thing I noticed is that we do fetch from gitea for our periodic jobs when syncing project-config 18:30:07 this didn't end up being a problem because replication went quickly and we replicated project-config first, but we should keep that in mind for the future. It isn't always gerrit state 18:30:14 (maybe a good followup change is to switch it) 18:30:14 good thing we waited for the replication to finish yeah 18:33:36 that is in the sync-project-config role 18:33:48 it has a flag to run off of master that we set on the periodic jobs 18:37:22 looks like we're pulling new zuul images (not surprising) 18:40:36 it succeeded and ansible logs show that zuul.conf is changed: false which is what we wanted to see \o/ 18:42:13 infra-root I think we are ready for reviews on https://review.opendev.org/c/opendev/system-config/+/757161 since zuul looked good. if this chagne looks good to you maybe don't approve it just yet as we have to remove review.o.o from the emergency file for it to take effect 18:42:37 also note that we'll have to manually clean up those html/js/css files as the change doesn't rm them. B uthe change does update gerrit.config so we'll see if it does the right thing there 19:11:13 it's just dawned on me that 763658 isn't going to log in production, i don't think, because that's being emitted by gerritlib and we'd need python logging set up to write to a gerritlib debug log? 19:11:36 depends on what the default log setup is I think 19:11:42 mmm 19:11:44 I dont know how that s ervicr sets up logging 19:12:33 you could edit your update to always lot the event at debug level and see you grt those 19:12:43 if you dont then more digging is required 19:13:32 gerritbot itself is logging info level and above to syslog at least 19:16:30 ahh, okay, https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gerritbot/files/logging.config#L14-L17 seems to be mapped into the container and logs gerritlib debug level and higher to stdout 19:16:37 if i'm reading that correctly 19:17:07 oh, but then the console handler overrides the level to info i guess? 19:17:17 so i'd need to tweak that presumably 19:18:04 ya or update your thing ti log at a higher level 19:18:11 I suggestedthat on the change earlier 19:18:23 warn was what I suggested 19:19:26 yeah, which i agree with if the content it can't parse identifies some buggy behavior somewhere 19:19:55 anyway, in the short term i've restarted it set to debug level logging in the console handler 19:20:08 sounds good 19:22:10 fungi: what do you thibk about 757161? should we proceed with that? 19:23:00 sure, i've approved it just now 19:24:47 weneed to edit emergency.yaml as well 19:24:53 oh, yep, doing 19:25:30 removed review.o.o from the emergency list just now in the screen on bridge.o.o 19:25:53 lgtm 19:32:39 fungi: I'm going to make a copy of gerrit.config in my homedir so that we can easily diff the result after these changes land 19:34:36 sounds good 19:35:34 if this change lands ok then I think we land the next one, then restart gerrit and ensure it is happy with those updates? We can also make these updates on review-test really quickly and restart it 19:35:42 why don't I do that since I'm paranoid and this will make me feel better 19:40:08 yeah, not a terrible idea 19:40:12 go for it 19:40:51 I did both of the changes against review-test manually and it is restarting now 19:40:52 i'm poking more stuff into https://etherpad.opendev.org/p/gerrit-3.2-post-upgrade-notes i've thought of, and updated stuff there we've since resolved 19:41:28 I didn't do the hooks updates though since testing those is more painful 19:42:20 oh, one procedural point which sprang to mind, once all remaining upgrade steps are completed and we're satisfied with the result and endmeeting here, we can include the link to the maintenance meeting log in the conclusion announcement 19:42:51 that may be a nice bit of additional transparency for folks 19:42:58 ++ 19:43:15 review-test seems happy to me and no errors in the log (except the expected one from plugin manager plugin) 19:43:40 i guess that should go on the what's broken list so we don't forget to dig into it 19:43:44 adding 19:44:22 fungi: I'm 99% sure its because you have to explicitly enable that plugin in config in addition to installing it 19:44:49 but we aren't enabling remote plugin management so it breaks. But ya we can test if enabling remote plugin management fixes review-tests error log error 19:45:18 i just added it as a reminder to either enable or remove that 19:45:34 not a high priority, but would rather not forget 19:45:44 ++ 19:46:02 zuul syas we are at least half an hour to it merging the change to update commentlinks on review 19:47:23 ianw: when your monday starts, I was going to ask if you could maybe do a quick check of the review backups to ensure that all the shuffling hasn't made anything sad 19:49:21 realted to ^ we'll want to cleanup the old reviewdb when we're satisfied with things so that only the accountPatchReviewDb is backed up 19:49:26 should cut down on backup sizes 19:52:11 yeah, though we should preserve our the pre-upgrade mysqldump for a while "just in case" 19:52:46 ++ 19:57:56 i added some "end maintenance" communication steps to the upgrade plan pad 19:58:44 fungi: that list lgtm 20:20:31 the change that should trigger infra-prod-service-review is about to merge 20:21:21 hrm I think that decided it didn't need to run the deploy job :/ 20:22:12 ya ok our files list seems wrong for that job :/ 20:22:32 or wait now playbooks/roles/gerrit is in there 20:23:02 Unable to freeze job graph: Job system-config-promote-image-gerrit-2.13 not defined is the error 20:23:26 I see the issue 20:26:07 remote: https://review.opendev.org/c/opendev/system-config/+/763663 Fix the infra-prod-service-review image dependency 20:26:21 fungi: ^ gerritbot didn't report that or the earlier merge that afiled to run the job I expected. Did you catch things in logs 20:26:46 looking 20:27:05 it didn't log the "Cannot parse ..." message at least 20:27:13 seeing if it's failed in some other way 20:32:29 I'm not sure if merging https://review.opendev.org/c/opendev/system-config/+/763663 will trigger the infra-prod-service-review job (I think it may since we are updating that job). If it doesn't then I guess we can land the db cleanup change? 20:33:37 so here's the new gerritbot traceback :/ 20:33:40 http://paste.openstack.org/show/800294/ 20:34:39 fungi: its a bug in your change 20:34:46 you should be print line not data 20:34:52 beacuse data doesn't get assigned if you fall into the traceback 20:35:05 d'oh, yep! 20:36:19 okay, switched that log from data to l 20:36:22 will update the change 20:36:50 fungi: note its line not l 20:36:55 at least looking at your change 20:37:06 well, it's "l" in production, it'll be "line" in my change 20:37:19 we're running the latest release of gerritlib in that container, not the master branch tip 20:37:24 I see 20:37:28 pycodestyle mandated that get "fixed" 20:38:00 of course 20:38:50 but the fact that we were tripping that code path indicates we're seeing more occurences of unparseable events in the stream at least 20:38:58 ya 20:39:33 can you review https://review.opendev.org/c/opendev/system-config/+/763663 ? 20:39:45 zuul should be done check testing it in about 7 minutes 20:40:11 fungi: I wonder if there is a new event type that isn't json 20:40:19 and we've just got to ginore it or parse it differently 20:40:39 I guess we should find out soon enough 20:41:15 aha, yep, good catch 20:41:19 on 763663 20:42:13 also ansible seems to have undone my log level edit in the gerritbot logging config so i restarted again with it reset 20:42:32 fungi: it will do that hourly as eavesdrop is in the hourly cron jobs 20:42:38 fungi: maybe put eavesdrop in emergency? 20:42:50 yeah, i suppose i can do that 20:43:50 done 20:46:51 https://review.opendev.org/c/opendev/system-config/+/757162/ is the next chagne to land if infra-prod-service-review doesn't run after my fix (its not clear to me if the fix will trigger the job due to file matchers and zuul behavior) 20:55:58 clarkb will do 20:59:17 fungi: fwiw `grep GerritWatcher -A 20 debug.log` in /var/log/zuul on zuul01 doesn't show anything like that. It does show when we restart gerrit and connectivity is lost 21:00:01 just trying to catch up ... gerritbot not listening? 21:00:08 ianw: its having trouble decoding events at times 21:00:23 grepping JSONDecodeError in that debug.log for zuul shows it happens once? 21:00:54 and then it tries to reconnect. I think that may line up with a service restart 21:01:09 15:25:46,161 <- iirc that is when we restarted to get on the newly promoted image 21:01:15 so no real key indicator yet 21:01:53 ianw: we've started reenabling cd too, having trouble getting infra-prod-service-review to run due to job deps whihc should be fixed by https://review.opendev.org/c/opendev/system-config/+/763663 not sure if that change landing will run the job though 21:02:27 once that job does run and we're happy with the result I think we're good from the cd perspective 21:03:17 fungi: ya in the zuul example of this problem it seems that zuul gets a short read beacuse we restarted the service. That then fails to decode because its incomplete json. Then it fails a few times after that trying to reconnect 21:04:18 ok sorry i'm about 40 minutes away from being 100% here 21:04:34 ianw: no worries 21:11:04 waiting for system-config-legacy-logstash-filters to start 21:11:48 kolla runs 40 check jobs 21:12:15 29 are non voting 21:12:20 I'm nto sure this is how we imaging this woudl work 21:23:00 "OR, more simply, just check the User-Agent and serve the all the HTTP incoming requests for Git repositories if the client user agent is Git." I like this idea from luca 21:26:41 clarkb: yeah, no idea if it's a short read or what, though we're not restarting gerrit when it happens 21:27:04 though that could explain why it was crashing when we'd restart gerrit 21:27:15 ya 21:27:24 i hadn't looked into that failure mode 21:27:29 that makes me wonder if sighup sin't happening or isn't as graceful as we hope 21:27:41 you'd expect gerrit to flush connections and close them on a graceful stop 21:27:47 that might be a question for luca 21:28:26 we can rpobably test that by manually doing a sighup to the process and observing its behavior 21:28:32 rather than relying on docker-compose to do it 21:28:39 then we at least know gerrit got the signal 21:30:51 or maybe we need a longer stop_grace_period value in docker-compose 21:30:58 though its already 5m and we stop faster than that 21:33:33 this system-config-legacy-logstash-filters job ended up on the airship cloud and its super slow :/ 21:34:49 slightly worried it might time out 21:36:31 fungi: I put a kill command (commetned out) in the review-test screen if we want to try and manaully stop the gerrit process that way and see if it goes away quickly like we see with docker-compose down 21:37:01 checking 21:38:04 if https://review.opendev.org/c/opendev/system-config/+/763663 fails I'm gonna break for lunch/rest 21:38:09 while it rechecks 21:38:50 clarkb: yeah, that looks like the proper child process 21:39:17 k I guess I should go ahead and run it and see what happens 21:40:00 it stopped almost immediately 21:40:19 thats "good" i guess. means our docker compose file is unlikely to be broken 21:40:48 I wonder if that means that gerrit no longer handles sighup 21:41:11 may be worth double-checking the error_log to see if it did log a graceful stop 21:45:09 wow I think it may have finished just before the timeout 21:45:15 the job I mean 21:45:41 seems like we ought to consider putting it on a diet rsn 21:46:19 hey! my stuff is logging 21:46:56 looks like it's getting a bunch of empty strings on read 21:47:07 Nov 22 21:46:32 eavesdrop01 docker-gerritbot[1386]: 2020-11-22 21:46:20,320 DEBUG gerrit.GerritWatcher: Cannot parse data from Gerrit event stream: 21:47:10 Nov 22 21:46:32 eavesdrop01 docker-gerritbot[1386]: '' 21:47:35 it's seriously spamming the log 21:47:50 fungi: maybe just filter those out then? 21:47:50 i wonder if this is some pathological behavior from it getting disconnected and not noticing 21:47:55 oh maybe? 21:48:19 but yeah, i'll add a "if line" or similar to skip empty reads 21:48:21 ok I don't think https://review.opendev.org/c/opendev/system-config/+/763663 was able to trigger the deploy job 21:48:23 and see what happens 21:48:53 it's having trouble stopping the gerritbot container even 21:50:14 okay, it's restarted with that conditional wrapping everything after the readline() 21:50:16 interestingly zuul doesn't apepar to have that problem 21:50:27 could it be a paramiko issue? 21:50:33 amybe compare paramiko between zuul and gerritbot 21:51:34 infra-root do we want to alnd https://review.opendev.org/c/opendev/system-config/+/757162 to try and get the infra-prod-service-review deploy job to run now? Or would we prefer a more noopy change to get the change that previously merged to run? 21:52:00 I don't think enqueue will work because the issue was present on the cahnge thatmerged earlier so enqueing will just abort 21:52:37 i'm in the middle of dinner now so can look soonish or will happily defer to others' judgement there 21:52:49 fungi: related to that I'm fading fast I think I need a meal 21:52:53 * clarkb tracks one down 21:53:19 I think the risk with 757162 is that it adds more changes to apply with infra-prod-service-review rather than the more simple 757161 21:53:44 I'll push up a noopy change then go get food 21:55:42 remote: https://review.opendev.org/c/opendev/system-config/+/763665 Change to trigger infra-prod-service-review 21:56:00 and now I'm taking a break 21:56:55 i'm surprised gerrit would want to do the UA matching; but we could do something like the google approach and move git to a separate hostname, but then we do the UA switching with a 301 in apache? 21:57:40 ianw: well its luca not google 21:58:11 I'm not quite sure how just a separate hodtname hrlps 21:58:12 i haven't seen anyone reply or comment on my bug report yet 21:58:24 becauee you need to aplly gerrit acls and auth 21:58:54 fungi: ya most of the discussion is on thr mailing list I'm hoping they poke at thr bug when the work weel resumes 21:59:02 is there some sort of side discussion going on? 21:59:04 oh 21:59:58 process: mention to luca and he asks me to file a bug. i do that and discussion of it happens somewhere other than the bug or my e-mail 22:00:29 indeed 22:01:35 so just to be clear: if there has been some discussion of the bug report i filed, i have seen none of it 22:02:10 i'll happily weigh in on discussion within the bug report 22:02:58 I mentioned both the issur and thr bug itself on my response to the mailing list and they are now discussing it on the mailing list not the bug 22:03:26 i'll just continue to assume my input on the topic is not useful in that case 22:04:25 my hunch is its more that on sunday its easy to couch quarterback the mailing listbut not the bug tracker 22:05:22 fair enough, i'll be at the ready to reply with bug comments once the professionals are back on the field 22:06:57 yeah, i see it as just floating a few ideas; but fundamentally you let people call their projects anything, have people access repos via / and use some bits for their UI. seems like a choose two situation 22:07:47 clarkb: https://review.opendev.org/c/opendev/system-config/+/757162 seems ok to me? 22:12:31 ianw: ya I expect its fine. its more that we've force merged a number of changes as well as merged 757161 at this point and none of those have run yet 22:12:39 ianw: so I'm thinking keep the delta down as much as possible may be nice 22:13:05 ianw: but if you're able to keep an eye on things I'm also ok with merging 757162 22:13:13 I'm "around". Eating soup 22:13:53 fwiw looking at the code it seems that gerrit does properly install a java runtime shutdown hook 22:14:03 not sureif that hook is sufficient to gracefully stop connections though 22:14:17 clarkb: yeah, i'm around and can watch 22:14:34 ianw: that change still has my WIP on it by feel free to remove that and approve if the change itself looks good after a revie 22:15:13 ianw: I also put a copy of gerrit.config and secure.config in ~clarkb/gerrit_config_backups on review to aid in checking of diffs after stuff runs 22:15:30 clarkb: i'm not sure i can remove your wip now 22:15:35 oh becuse we don't admin 22:15:40 ok give me a minute 22:16:30 WIP removed 22:16:35 (but I didn't approve) 22:17:40 i'll watch that 22:44:27 TASK [sync-project-config : Sync project-config repo] ************************** seems to be failing on nb01 & nb02 22:44:34 :/ 22:45:26 are the disks full again? 22:45:34 we put project-config on /opt too 22:45:52 /dev/mapper/main-main 1007G 1007G 0 100% /opt 22:45:57 clarkb: jinx :) 22:46:12 ok, it looks like i'm debugging that properly today now :) 22:48:06 gerritbot is still parsing events for the moment 22:53:06 time check, we've got just over two hours until our maintenance is officially slated to end 22:53:30 system-config-run-review (2. attempt) ... unfortunately i missed what caused the first attempt to fail 22:53:45 this is on the gate job for https://review.opendev.org/c/opendev/system-config/+/757162/ 22:54:18 fungi: yup I'm hopeful we'll get ^ too deploy and we restart one more time 22:54:38 i'll go poke at the zuul logs to make sure it was an infra error, not something with the job 22:55:20 butonce that restart is done and we'rehappy with things I think wecall it done 22:58:11 its merging now 22:58:27 excellent 22:58:34 infra-prod-service-review is queued 22:59:59 and ansible is running 23:01:14 and its done 23:01:22 the onyl thing I didn't quite expect was it restarted apache2 23:01:31 so maybe the edits we made to the vhost config didn't quite line up 23:01:42 I'm going to cmopare diffs and look at files and stuff now 23:03:18 gerrit.config looks "ok". We are not quoting the same way as gerrit I don't think so a lot of the comment links have "changes" 23:03:24 I think those are fine 23:03:51 secure.config looks good 23:04:11 we should probably try to normalize them in git though 23:04:16 ++ 23:04:18 docker-compose.yaml lgtm 23:04:43 the track-upstream and manage-projects scripts lgtm 23:05:45 patchset-created lgtm 23:06:35 the apache vhost lgtm its got the request header and no /p/ redirection 23:07:34 ok I think the only other thing to do is delete/move aside the files that 757161 stops managing 23:07:42 I'll move them into my homedir then we can restart? 23:09:50 sgtm 23:10:09 on hand for the gerrit container restart once you've moved those files away 23:10:55 files are moved 23:11:23 fungi: do you want me to do the down up -d or will you do it? 23:11:31 happy to do it 23:11:31 (not sure if on hand meant around for it or doing the typing) 23:11:34 k go for it 23:12:03 downed and upped in the root screen session on review.o.o 23:12:08 yup saw it happen 23:13:56 seems to be up now. I can view changes 23:14:08 fungi: you may need to convince gerritbot to be happy? or maybe not after your changes 23:14:58 on the upgrade etheerpad everything but item 7 is struck through 23:16:10 I'll abandon my noopy change 23:16:13 looking 23:16:34 Nov 22 23:11:46 eavesdrop01 docker-gerritbot[1386]: 2020-11-22 23:11:46,598 DEBUG paramiko.transport: EOF in transport thread 23:16:53 that seems likely to indicate it lost the connection and didn't reconnect? 23:17:31 i've restarted the gerritbot container now 23:17:58 it's getting and logging events 23:18:12 cool 23:18:21 fungi: you haven't happened to have drafted the content for item 7 have yo? 23:18:55 nope, but i could 23:19:19 re the config I actually wonder if what is happening is we quote things in our ansible and old gerrit removed them but new gerrit does not remove them 23:19:33 because the config has been static since 2.13 except for hand edits 23:19:43 fungi: that would be great 23:19:59 i'll start a draft in a pad 23:20:07 zuul is seeing events too because a horizon change just entered the gate 23:33:19 fungi: I've detached from the screen on review and bridge 23:33:33 I don't think they have to go away anytime soon but I think I'm done with them myself 23:33:34 cool 23:33:54 also detached on review-test 23:37:27 started the announce ml post draft here: https://etherpad.opendev.org/p/nzYm6eWfCr1mSf0Dis4B 23:37:36 i'm positive it's missing stuff 23:37:40 looking 23:38:50 fungi: I made acuple small edits 23:38:52 lgtm otherwise 00:04:56 ianw: fungi should we endmeeting in here, send that email out, and the status notice? 00:06:43 if you've got nothing else for me to do but monitor, ++ 00:06:47 i think so, unless anybody can think of anything else we need to do first 00:06:56 I can't 00:07:00 and I'm ready for a 12 hour nap 00:07:09 i hear ya 00:08:34 is the http password a suggestion or a requirement? 00:08:54 for normal users I think just a suggestion. But for infra-root we should all do that 00:08:55 unlike before, you only get one look at it now 00:09:53 yup, if you dismiss the window prematurely, you need to generate anotehr 00:09:57 another 00:10:47 fungi: I don't think you need to endmeeting becuase its been longe rthan an hour but maybe you should for symmetry? 00:10:49 i'd probably s/mostly functional/functional/ just to not make it sound like we're worried about anything 00:14:18 fungi: and were you planning to send that out? /me is fading fast so wants to ensure this gets done :) 00:15:19 yeah, i can send it momentarily 00:15:45 don't forget you wanted the meeting log whihc may need an endmeeting first? 00:15:54 yup 00:28:37 okay, any objections to me doing endmeeting now? 00:28:52 no 00:29:15 in that case, all followup discussion should happen in #opendev 00:29:18 see you all there 00:29:23 o/ 00:29:24 #endmeeting