The Big Tenant-to-Tenant Migration

As you may know, I worked for the Nordic part of the Thomas Cook Group. I was the O365 admin for a tenant with over 30,000 user accounts. And we ran the Azure AD Connect service for the entire group and had just moved to pass-through authentication with Seamless SSO. Although it was a royal pain sometimes to work in such a large company where even a minor change could take weeks to implement and get approval for from everywhere.

As you may also know, Thomas Cook Group filed for bankruptcy in October last year. And there was no advanced warning or anything about what was going to happen next. But for our part, we realised that we would (if the company survived) most likely be moving our Nordic business to a new O365 tenant so we began planning for that. Over the next few months a lot of stuff happened. The Nordic part of the group was sold off and started their own company NLTG and the old group was shutting down all parts of their business. Except the German part because they were backed by their government so they survived (more on that later).

When we got back after the Christmas break we were given a clear order to evacuate the tenant before end of February. Since we were a separate company and legal entity we were no longer entitled to share the old tenant which, even though it makes sense, pretty much lit a torch under our asses to get this done now. And we realised it wouldn’t be a pretty or a smooth operation, as I recall saying, “this is going to take a sledgehammer, not a scalpel!”. Fortunately I’m very used to sledgehammer my way to getting results. Yeap, thinking back to that SharePoint upgrade that was all over the place!

So there we were, less than 8 weeks to pull off a migration with 3,000 users, 5,500 mailboxes, 10TB of SharePoint data, 8TB of OneDrive data and 12TB of Exchange data. And this is how it went…

Identities : The building block of any good tenant is the identities. When we first planned for the migration our plan was to have a new on-prem AD that would be fed by.. well that’s irrelevant since there was no time for that. The only way forward was to use our existing on-prem AD. But the problem was that MSFT doesn’t support syncing your on-prem identities to two tenants. Why? I have no idea – I fully get how you wouldn’t want that in a production environment (since the UPN domain can only be valid in one tenant) but for a migration like this it would have solved a lot of headaches if we were allowed to do it like that. But nope, we really wanted to have Microsoft support for this. And we also had to retain our e-mail domains since we’re heavily dependant on the brand which is almost as Swedish as Ikea, at least in Sweden. So that presented us with the first big problem – pre-populating the new tenant with 3,000 user objects so we could start copying the data and when it was time to migrate and then play around with the UPN domains so the matching would work. But the first step was creating the 3,000 users as cloud only “onmicrosoft” accounts. This was done using powershell to export as much info on the users as possible (including “usagelocation” and “preferredlanguages” since we’re an org with offices from Thailand to Mexico!) and then powershell to recreate the users as closely as possible. Another step we had to take here was setting up a filter in Azure AD Connect that would only sync users to each tenant depending on the value of an extensionattribute. That way we could make sure no user was synced to both tenants at any time, although it did take alot of tinkering to get that logic working but fortunately Microsoft have documented how to do attribute filtering, so thanks for that.

Authentication: Remember how I said we’d just gone over to PTA for the old tenant? Well this little thing meant that as long as users were logging in to the old tenant (which we knew the Germany company would) we couldn’t use PTA for our users since it’s all based around a computer object in the AD forest with a Kerberos encryption key that’s tied to the tenant! So if we set up PTA for our new tenant that would change the key on the computer object and they wouldn’t be able to login anymore! So to solve this we did a “quick and dirty” setup of a temporary AD FS for our users to use based on domain. This was a surprisingly easy thing to do in Windows Server 2019 but it was an added “gotcha!” of this entire scenario!

SharePoint : The first problem with SharePoint was to determine which sites were relevant to keep and which weren’t. Our entire SharePoint was well over 20TB so we had to make sure to only copy over sites we knew were relevant to the Nordics business. But there’s no way of determining that without going through all the underlying permissions and groups to determine if “our” users are working on the site or not. It’s not like you can ask SharePoint to “give me all the sites that any user with the UPN domain @domain.se is working on”. Or maybe there is, I just didn’t have the knowledge to write that powershell at the time. Once that was done we used ShareGate to migrate all the SharePoint data. The biggest fear was that it wouldn’t be able to match the old identities with the new ones – which it did! I’m pretty sure it went by “DisplayName” to match them but we’re just very very thankful it did because that would have been a mess to sort out. The biggest issue I had with ShareGate was how unpredictable it was when it came to doing incremental copies, which was done through powershell. We split it up on 4 different servers with about 80 sites per server. Sometimes it could complete them all in 2 hours, sometimes it took 8 hours for one server, sometimes longer. During the weekend of the actual move it took well over 12 hours to complete which caused me a bit of unnecessary stress.

OneDrive : Since we already had a pretty nice “masterlist” of users that we would be migrating it was pretty easy to setup a CSV file to map “Old OneDrive -> New OneDrive” that we then used ShareGate to copy. That went pretty nicely although there were some instances of data not being copied over so we had to sort that our after the switch and people were missing a few files. Other than that the issue was the same as above – it was very unpredictable and I had to mess around with the queues on the weekend of the switch. We had one incident of a users OneDrive being almost empty but looking back at the old OneDrive is was empty too. So our theory there is that his OneDrive client must’ve been paused so we had to send that computer to the lab for data recovery – but that’s not ShareGates fault one bit!

Exchange : Oh joy! I was in charge of the Nordics business moving from on-prem to Online 3 years ago so I wasn’t looking forward to another move at all. After doing a quick check around for what tool to use (with our extremely limited budget – our company had gone bankrupt and we were still getting back on our feets!). It ended up being CodeTwo which was by far the cheapest alternative. But as the saying goes “you get what you paid for” and in this instance we paid for a software to move data from Mailbox A in Tenant X to Mailbox A in Tenant Y. And it did that job without much of an issue. There were still a lot of things to sort out around the move (like transport rules, conference rooms) but the big issue was just moving all the data. The biggest issue I had with that software was that they didn’t have a CSV import function when moving tenant->tenant! When moving on-prem -> tenant that wasn’t an issue, but tenant -> tenant, well the only way to enter a mailbox was to actually manually enter a mailbox! So we spent days entering 5,500 mailboxes and matching them with their new mailbox. A simply CSV import would’ve saved us days of work on this. My next issue with the software was when we were up to about 800 mailboxes per server on 7 servers, that really slowed the UI down. At the end it was so slow that when you started a queue for a incremental copy the UI would stop responding and you didn’t even know it was working until it was done and it just popped alive again.

Teams: Now Teams was the most interesting bit. Because Teams is based on so many technologies it was difficult to do a proper Teams migration. No matter how far we looked we just couldn’t find a tool that would migrate Teams with the channel/chats that also took the entire underlying SharePoint site! If you had other document libraries or data on the SharePoint site, then that was lost if you migrated the Team. But if you migrated the SharePoint site you will have lost the data in Teams that wasn’t in the default document library! So we made the choice of migrating the SharePoint sites since noone should have have posted anything business critical in a chat in a channel in Teams. Fortunately ShareGate comes with the ability to recreate O365 groups so all the groups got recreated and we only had to make the ones that were Teams into Teams manually, that was it. But it was a bit of a “unexpected behaviour” for ShareGate when it came to legacy sites (that were migrated from on-prem) that now had an O365 group, it simply wouldn’t recognise them as O365 groups or O365 Group sites and created them as legacy sites in the new tenant regardless. But that was easy enough to handle afterwards.

Licenses: This was another headache but fortunately not mine! Since our old license agreement with Microsoft was tied to our old company we couldn’t use that. And since our company was brand new we had no credit score anywhere so Microsoft couldn’t just hand us 3,000 licenses and hope we’d pay. After a lot of back and forth we managed to get the licenses in place well enough to start the migration and begin copying all that data. But there was still the matter of support contract with Microsoft. There was alot of options floating around to try different support alternatives but in the end we agreed on a premiere support deal with Microsoft. Even though the paperwork got sorted and we were told on Friday January 31st that everything was done and we now had premiere support with MSFT it turns out that like a lot of things in O365, sometimes it can take a day or two for the wheels to turn and you’ll see how critical this became for us.

Additional headache: One headache we had was that we’re not only running a normal business, we’re also running an airline. And the pilots must be able to check their e-mail for any notices and warnings from the aviation authorities before takeoff. This may include stuff like “this aircraft model isn’t flight worthy so don’t fly this aircraft model” and “Iran just shot down a civilian aircraft, avoid their airspace”. Things like that is absolutely critical for the pilots to check for, so saying “e-mail will be down for a day” is completely unacceptable from that perspective. And we were supposed to retain all the e-mail domains, and a domain can only exist in one tenant at a time. So we had to figure out a way to handle this and move their accounts and e-mail domain as quickly as possible to avoid any flight delays because their e-mails isn’t working. (spoiler – their email was down for 90 minutes)

The plan: So the best plan we came up with was to start an incremental copy of all the SharePoint/OneDrive data first thing on the morning of Saturday February 1st. Then at about 18:00 CET we’d set automatic forwarding on everyone’s mailbox in the old tenant that would forward every mail to their new mailbox with the “onmicrosoft” address. That way we were guaranteed no mail would go missing in case of bad timings. Then we did an incremental copy of all mailboxes. We had done this in plenty of tests and it would only take 2 hours so we planned to start with the first most important domain for our airline at 21:00 CET, then when that was done continue with the largest domain we had (with about 800 users) and work our way through our list of about 10 domains.
The switch consisted of alof of steps since we weren’t allowed to sync an on-prem object to two tenants.

  • The first step was to change the UPN domain of the users on-prem to newtenant.onmicrosoft.com and let that sync to the old tenant. Since that domain didn’t exist in the old tenant that resulted in the user being given a oldtenant.onmicrosoft.com UPN domain which was crucial since we knew we would end up having to restore users from recycle bin, which would be problematic if they still had their old UPN domain which no longer was in the tenant.
  • The second step was removing them from sync in the old Azure AD Connect sync and changing the extensionattribute so it would sync to the new tenant. This resulted in all users being put into the recycle bin in the old tenant, and in the new tenant it would match everyone properly as long as the UPN matched perfectly for on-prem and in the new tenant. They were then automatically converted to “synced from on-prem” users in Azure AD.
  • Thirdly we removed the domain from the old tenant and added it to the new tenant. Even though this is a straight forward process when you’ve made sure all objects for that domain are changed so the domains aren’t in use, I feared this step the most since I’ve previously had alot of issues removing a domain like this. Then ofcourse we’d have to tell that domain to be federated so it would use ADFS.
  • Lastly we would change the UPN of the user back to their original UPN on-prem and let that sync to the new tenant which now had the new domain and everything was set.
    When we did this with our test domains (of about 20 users each) this entire process took an hour so we felt pretty comfortable we’d be done at about 3-4 on Sunday morning and then we’d get some sleep before the users woke up to check their phones only to see the “error signing you in” and they’d start calling.

But… “no plan of operations extends with any certainty beyond the first contact with the main hostile force“.

How it played out: I woke up early on Saturday (at about 5) to start incremental copy of all the SharePoint / OneDrive data. Unfortunately Sharegate was a bit unpredictable in it’s behaviour so I had to move sites around in the queues to make it before 18:00 but make it I did. Then I ran the powershell to set the automatic forwarding and started the incremental copy of the mail. The team (4 engineers, 1 external SME/contractor and the project manager) met up at the office at about 20:00 in the evening for pizza and a last “go-no go” check for everything. And at 21:00 I started with our airline domain And by 22:30 it was all done, every user had the proper UPN, licens, login everything was good to go. And that’s when it started – the operations team in our airline said they couldn’t access their emails in the Outlook app on their phones or computers. We had ofcourse verified that it worked through the O365 portal so we knew everything worked. After troubleshooting this for about an hour we decided to log a Severity A case with Microsoft (at 23:30) and one of us would work on this case and the rest continue working with the other domains. That work with the other domains came to a halt for one of our largest domains which wasn’t removed from the ole tenant. No user had it in their UPN, no recipient used the domain, nothing. But the domain never got deleted, it was stuck in “pending”. So another severity A case to Microsoft (at about 00:30) and we proceeded with the next domain. At about 02 in the morning that domain did eventually go away by itself and we thought everything was good when our airline operations team (who’s responsibility it is to keep the planes flying, so I have the utmost respect for them and their challenges!) wanted us to do a rollback and try again at a later date. We spent about an hour with them arguing than a rollback wouldn’t solve this issue and we didn’t have time to try again next week since we had to evacuate the old tenant. Another argument was that this is a client issue and the mails are accessible through the web and we can get Microsoft to solve the client issue after. Fortunately we were able to convince them to proceed but now we’re at 03 int he morning and I had been working for 22 hours straight and I had no energy left in me so I tried sleeping for a bit. After 2 hours I woke up to cheers because now the Outlook clients in our airlines started to work so the biggest issue we had was solved and we could keep on with the remaining few domains.
At about lunch on Sunday morning we were all done with all domains and users and started to do the clean up job of on-prem systems no one knew about that had a EWS configured to the old tenant that no longer worked etc and that continued for days.

So where was Microsoft in this? As I mentioned our premier support deal with them got activated on the day before the switch. But that hadn’t replicated to all systems and instances of those systems in Microsoft so there was a big challenge even to get them to accept a SevA case from us. But we had two cases that managed to register as SevA cases with them during this switch and they weren’t helping us with either of them. The first case was regarding Outlook clients no longer being able to connect. Many blogs on many sites on the Internet says “when moving to a new tenant this may take a few hours”. In our case we were already up to hours and when creating new users we were able to connect to them immediately, but not the ones that had been switched and we didn’t see a reason why. This started to resolve itself after about 6 hours. And it wasn’t thanks to Microsoft doing anything on their side because they called me at about 5:30 on the sunday morning to say “sorry but we still haven’t been able to find any engineer to work on this case”. The other case we had with them was regarding the domain that wasn’t getting deleted. The called back on that issue also after it was resolved to ask us to verify the domain name because according to what they were seeing the domain was no longer in the old tenant so they obviously hadn’t done anything on their end in that case either.

Lessons learned:

  • Powershell and CSV files rule! If we didn’t have the proper master files for data information this would have been alot more difficult.
  • Switching over 1,000 mailboxes from on tenant to another actually does take up to 6 hours for all of Exchange Online to know what hit it so the clients can connect again.
  • Azure AD Connect is very powerful and “smart” in how it matches users.
  • Information and user communication and support is vital for this! In our case we started informing right away it was coming and we had staffed up extra support on the monday to get our business up and running after this big switch and that was really needed.
  • You can get away with buying cheaper “off the shelf” products rather than more expensive products but expect to have to work around their flaws and shortcomings. Do you want to pay twice the amount for a more expensive solution or sacrifice a few days work for your staff for manual work?
  • Test-test-test and test again just to be sure.

Error thrown

Call to a member function id() on array