SharePoint Upgrade From Hell

This is going to be a wall of text. And 99% of the people I know aren’t even interested. But I’m writing this on behalf of every other SharePoint admin out there who are unfortunate enough to discover just how easy SharePoint is to break!

Little background: I’ve been working with SharePoint since about 2005. Not that long for some but long enough to know that after a few years of use a SharePoint farm has a few quirks in it and it’s a good idea to upgrade it. And you never upgrade an existing farm, you always start with a new fresh one and import all data! Now one of my jobs (!) is managing a 30k user corporate SharePoint – a business critical solution since all documentation are in there. And not only that, our entire BI solution is in there as well, complete with “PowerPivot” and “Reporting Services for SharePoint”… No pressure!

So now it was time to upgrade it from SP2013/SQL2014 to SP2016/SQL2016, including all BI solutions. We’ve gone through a “dev” environment, a “test” environment and even a “preprod” environment and everything went surprisingly well. There was ofcourse the usual glitches getting the BI features to work (and the S2S cert trust that is required for Excel with data source connection files now that Excel service moved out of SP to OOS!). But anyway, the preprod farm was so great that the plan was to take it into production. Our BI team didn’t see a big problem doing that in an afternoon on a weekday, whereas for me the biggest problem was the 1.5TB of data that needed to be shuffled and upgraded. And “even the best laid plans”, you know. I also knew that one of the biggest issue was network infrastructure which for a global company is so complex that the best way forward was to swap IP addresses of the servers so we wouldn’t have to change DNS or static IP routes anywhere, we’d just solve it at the load balancer level. So I managed to get a whole Saturday from the business to have SharePoint offline, but no more. After all, all documentation is in there!

That Saturday was last Saturday April 14th. I got up at 4am to start shuffling the data. By 7 that was done and I started upgrading the database with the normal “mount-spcontentdatabase”. Here was my first mistake (in hindsight). I had already written a script to do this, but that’ll come later. By 10 everything was loaded, upgraded and I proceeded to change IP addresses around and change it in the load balancer, then go through my long list of checks that normal user SP functionality works while our BI team were updating all of their things.

After lunch we had a “go/no-go” meeting and everything looked good. I also noticed at this point I had a case to create a new SharePoint site for a project, something I actually hadn’t tested since that’s not a “normal user SP functionality”. And that’s when the shit hit the fan! What I had missed thanks to my scripting was that one of the content databases had failed to upgraded and was now corrupt and when I wanted to create a new site it did it in that database since it was the “least used” and hence the error. “No problem, plan a) I’ll just delete this database”, right? Nope, SharePoint wouldn’t have it because the database wasn’t attached since it was corrupt. Yet I could see the sites in that database listed with get-spsite?

Tried a few things but couldn’t recover so I decided plan b) remove the web app and create a new and re-import/re-mount this corrupt DB, all other DB’s were already upgraded successfully so not a big operation. Well, SharePoint wouldn’t have that either – it couldn’t dismount this database because it was corrupt so I couldn’t remove the webapp! I was completely stuck with a broken web app that I couldn’t remove because of a content database that wasn’t mounted?!

So plan C) rename that webapp with the corrupt database and give it a nonsense URL so I could create a new web app with the proper URL. That seemed to work but when I tried importing a new backup of this content DB it didn’t import any of the site collections! .. digging around I could see that the sites in the broken webapp, with the new nonsense URL, still had the original URL! It couldn’t update them because… there was no content DB attached to them! I dug around in SharePoint Manager (which was designed for 2013 I know) but it kept crashing when I clicked any of the sites in the broken webapp.

So there I was with a broken web app with a corrupt contentdb with sites occupying the URL I needed to create our proper web app. Came to the conclusion that the config db was pretty much fucked at this point at now we’re at 2pm. Best option available to me at this point was calling Microsoft premiere support case with a severity A case. I’m pretty sure if I had gone for that they would have looked at it, made the same determination as me and said “since this is a farm not yet in production, I’d say the best way forward is to recreate the farm”. During that time our BI would be in SharePoint 2016 but the “big” web app in 2013 on separate IP addresses! God knows how the network would handle that and getting the engineers in India to change firewall routes in less than a week wasn’t that likely. Because rebuilding a new farm in production would take at least a week, right?…


After clearing it with my supervisor that this was indeed the best way to solve it NOW! All other options led to some unknown hellhole – going back was always a possibility no matter what.

I got a green light and Red Bull at about 3pm …

 

  • SP Product Config Wizard to detach all server from the farm
  • delete all databases from SQL except all the (successfully) upgraded content DB’s
  • thank myself for having saved all of the “AutoSPInstaller” response files
  • create a dummy web app to upgrade the corrupt DB (no way I’m doing that in the proper web app again!)
  • eat the food my awesome wife brought me
  • recreate all webapps
  • restore all content DB at about 1.5TB
  • upgraded the service apps

Basically I had done at least a weeks work in 7 hours and all in production environment!

The “Done!” mail got out at 9pm! Now I’m not one to brag, but any SharePoint admin must be impressed by that! Hell, even Scotty would be proud! I spent a few hours on Sunday cleaning up the mess and sorting out the BI issues (since this was a new farm there were a lot of BI configuration that was lost) but by Sunday 6pm everything was fully operational and I promptly went to be and slept like a baby. And one of the first things to hit me on Monday morning was “why is Managed Metadata empty” because yeah, in my haste I forgot that little thing ?

How was your weekend?