Moving data from legacy systems into Azure SQL

First, some background

At the start of data projects, we often ask:

“Where’s the data coming from?”, and,
“Where’s the data going to reside?”

For a recent client, we needed to move their main business databases from services with a lot of technical debt to a new group of Azure SQL databases. Each database wasn’t big (75Gb of data and 350 million records) but it was important and it had to be 100% reconciled - so there was no margin for error.

We examined the features of two free Microsoft tools (“SQL Server Migration Assistant” & “Database Compare Utility”) and evaluated how to use them to achieve these tasks:

migrate the schema in a consistent repeatable way
migrate the data in a consistent repeatable way
validate the schema
reconcile the data on a row-by-column basis

We took a look at the features of each tool to figure out if we could achieve each of the above tasks.

SQL Server Migration Assistant

The first tool, used to migrate the schema and the data was “Sql Server Migration Assistant” (SSMA). This comes in different flavours depending on your source database and works with Access, DB2, MySQL, Oracle, and SAP ASE. It works with various (Microsoft) targets too from SQL Server 2012 through to Azure Synapes Analytics.

With SSMA we can connect to the source database, read the schema(s) we’re interested in, and run a report to see what can (and cannot) be automatically migrated. Time-wise the report took about one hour for a schema with about 3,500 objects in it.

On the DDL side, it’s very accurate and in real-world scenarios migrated around 99.9% of the objects. Some objects (such as computed columns or functional indexes) tend to get reported for refactoring, and further care is needed for areas where constraints may conflict. An example of this is that primary keys and unique constraints don’t work quite the same for all source systems but, in general, SSMA does a lot of the heavy lifting at this stage.

On the programmatic side (functions, stored procedures, and packages etc) the automatic conversion rate is a bit lower at 90% so more refactoring will need to be done. This is mainly around the use of internal system functions and parameter handling. The advantage of SSMA is that these issues all get reported to you so you’re aware of what needs to be worked on. It also gives a conservative estimate of how much effort SSMA thinks it will take a person to refactor the problem areas. As a general guide if your application code is mainly external to the database, SSMA is great for the schema migration - as well as for migration scoping. You also get options for data type changes, including whether a date should be a timestamp and checking if a number is an integer.

At this point, the tool can take the source schema, convert it based on the report, and apply it to your target database. If you prefer to control this, you can generate scripts so this can be checked and updated if required (maybe with those indexes you refactored) before being applied.

The final option is to migrate the data from source to target. There are various options for this too – how many threads to use, how many records per batch in terms of your commitment point. You can alter the SQL being used for the inserts so it’s inserted in primary key order, which then allows for data migration restart options. All of this can be driven by a command line if you don’t like the GUI. Any schema datatype changes are picked up by the data migration functionality.

By using this tool with a Premium6 level Azure SQL Server instance, we were able to migrate the data (75Gb, 350 million rows) in under three hours.

So now we have our data in the Azure SQL database – how do we know it’s correct?

Database Compare Utility

This was our best friend! It’s not a pretty tool because you’re down in the weeds with *.ini files and using xls files for configuration options, but it has three compelling reasons to use it:

Speed - we were able to do a record by record comparison between the source and the target of 350 million records in less than 2 hours.
Detail – when it finds an issue you get the record (described as a primary key reference), and the column that is mismatched with the data being logged in the output.
Support - we received great support from Microsoft on this product, directly from the developers who found a bug, and shipped a new version within 48 hours.

Some downsides

If the table does not have a primary key, then it only does a row count check. To solve this you can give it a pseudo primary key in the xls file which then does the full row by column comparison.
For large tables it can hit memory constraints – but again you can put predicates into that xls file to split the large tables up into manageable sizes. We were able to process tables up to 7 million rows without issue.

To keep the techies happy, the tool does not compare the actual data for each row – that would be too much data flying across the network. What it does is calculate the MD5 hash for the connected columns and then compare the hash values, so it’s based on the primary key and 35 bytes per row. By using native MD5HASH functions and reducing network traffic, it makes the utility fast without compromising accuracy.