Item: Incremental Load With Azure Data Factory (ADF)
Author: Amit

Home Blog Microsoft Azure

Microsoft Azure (27 Blogs)

Become a Certified Professional

Incremental Load With Azure Data Factory (ADF)

4.8 out of 5 based on 9597 votes
Last updated on 18^th Dec 2024 20.7K Views

Prashant Bisht Technical content writer experienced in writing tech-related blogs along with software technologies. Skilled in technical content writing, content writing, SEO content writing, WordPress, off-page SEO.

Bookmark

Incremental load in Azure Data Factory (ADF) efficiently transfers data from on-premises SQL Server to Azure SQL Database.

What is a Watermark?

A watermark refers to a column in the source table that has the last updated time stamp or an incrementing key. This is primarily useful for identifying the new or updated records that need to be loaded into your target data store. Using Watermark facilitates efficient data loading and results in reducing the amount of data transferred and processed. To further know about it, one can visit Microsoft Azure Training. Furthermore, it ensures faster data pipelines due to smaller data loads.

Workflow for Using a Watermark:

Let's understand the step-by-step implementation process for the incremental loading of data.

Step 1- This includes Creating the Table and Data Population On-Premises.

First of all, you have to create a database on my local SQL Server. Inside that database, you have to create a table named 'dbo.student'. After this insert three records into this table and verify that the data is correct. Furthermore, to copy this data into a table named 'Student' in an Azure SQL database, use the 'updateDate' column in the 'Student' table as a watermark. This is necessary to determine which records need to be copied or updated.

Step 2: It Consists of Table Creation and Data Population in Azure

In this step, you have to create an Azure SQL Database through the Azure portal. After this, you have to connect the database through SSMS. After connecting, create a table, named Student in the same structure as the Student table created in the on-premise SQL Server. It includes a studentId column which is not IDENTITY and will be used to store the studentId values from the source table.

After this, create another table called stgStudent with the same structure as Student. You can use it as a staging table before loading data into the Student table. After that, create a table named WaterMark for maintaining the multiple tables in the source database.

Step 3: Create a Self-Hosted Integration Runtime

The next thing you have to do is to create an ADF resource from the Azure Portal. Then open the ADF resource by going to the Manage link of the ADF and create a new self-hosted integration runtime. The Integration Runtime (IR) is the compute infrastructure used by ADF for data flow, data movement and SSIS package execution. Having a self-hosted IR is necessary for measuring the movement of data from on-premise SQL Server to Azure SQL.

Step 4: It includes Creating the Azure Integration Runtime

The Azure Integration Runtime (IR) is necessary for copying the data between cloud data stores. For this, you can use the default options and set up the runtime with the name azureIR2.

Step 5: Create a Linked Service for SQL Server

This includes creating a linked service that helps to link the source data store to the Data Factory. This is similar to the connection string useful for connection information required for the Data Factory to connect to the external data source. Along with this, also give the details for the on-premise SQL Server and create the linked service. You will also find the option to connect via Integration runtime.

Step 6: Create a Linked Service for Azure SQL

You have to provide the details for the Azure SQL database and create the linked service. You can also connect via the Integration runtime option as created in the previous step.

Step 7: Create the Dataset for the SQL Server table

The next step is to simply the points or references of the data to be used in the ADF activities as inputs and outputs. Create the dataset for the table, dbo.Student, in on-premise SQL Server.

Step 8: Create a second Dataset for the Azure table

This step includes creating the dataset for the table, dbo.stgStudent. For example, I have created the dataset named AzureSqlTable1 in the Azure SQL database.

Step 9: Create the Watermark Dataset

Create the Watermark dataset for the table, dbo.WaterMark, in the Azure SQL database. For example, I have created a database named AzureSqlTable2.

Step 10: Create a Pipeline

In this step, go to the Author tab of the ADF resource and create a new pipeline. Provide a name for it like pipeline_incrload. It is a crucial step taught in the Microsoft Azure Certification course.

Step 11: Add Parameters

This includes visiting the Parameters tab of the pipeline and adding the following parameters. Along with this, also set the default values as detailed below.

Step 12: Create the Lookup Activity

Creating a Lookup activity is useful for reading and returning the content of a configuration file or table. Along with this, also returns the result of executing a query or stored procedure. Furthermore, using the output from the Lookup activity will help you with subsequent copy or transformation activity if it's a singleton value.

For example, here I have created a lookup activity, named lookupOldWaterMark. The source dataset is set to AzureSqlTable2 (pointing to dbo.WaterMark table). Here are the queries that I have written to retrieve the waterMarkVal column value from the WaterMark table for the value, Student.

Also Read This:

Microsoft Azure Virtual Machines

Microsoft Azure Interview Questions

What Is Azure Resource Manager

Azure 900 Certification Cost

Step 13: Create a Second Lookup activity

In the step, you have to create the second lookup activity, named lookupNewWaterMark. Along with this, the source dataset is set to SqlServerTable1, pointing to do.Student table in on-premise SQL Server. After this, you have write the following query to retrieve the maximum value of update Date column value of Student table. For examples, I have referenced the pipeline parameters in the query.

Step 14: Create a Copy Data Activity

In this step, I have used a Copy data activist between data stores located on-premises and in the cloud. The copy data activity is named CopytoStaging, and it adds the output links from the two lookup activities as input to the Copy data activity.

Now, in the source tab, the source dataset is set as SqlServerTable1, pointing to dbo. Along with it, write the following queries to retrieve all the records from the SQL Server Student table.

Select AzureSQLTable1 in the sink tab as the sink dataset. This points to the staging table dbo.stgStudent. After this, write the pre-copy script to truncate the staging table stgStudent every time before data loading.

Step 15: Create the Stored Procedure Activity

In this step, you have to create the stored procedure activity next to the Copy Data activity. It will be executed successfully completion of the Copy Data activity.

Here is the code for the stored procedure. This procedure is necessary for updating and inserting the records in Student table from the staging stgStudent.

Step 16: Create the Stored Procedure to Update the Watermark

Here, you have to create a second Stored Procedure activity, named uspUpdateWaterMark. This activity will be executed after successfully completing the first Stored Procedure activity named uspUpsertStudent. Here the linked service is set as AzureSqlDatabase1 and the stored procedure as usp_write_watermark.

The purpose of this stored procedure is to update the watermarkval column of the WaterMark table with the latest value of updateDate column from the Student table after the data is loaded. This procedure takes two parameters which are LastModifiedtime and TableName.

Step 17: Debugging the Pipeline

After completing all the activities, the next thing you have to do is publish all the changes. After this, press the Debug button for a test execution of the pipeline. It includes an output tab for showing the status of the activities.

Step 18: Check the data in Azure SQL Database

After selecting the data from the dbo.WaterMark table, you have to ensure that the waterMakVal column value has changed. It has to be equal to the maximum value of the updateDate column of the dbo.Student table in SQL Server.

After selecting the dbo.Student table you ensure that all the records are inserted in the dbo. Furthermore, the Student table in SQL Server is now available in the Azure SQL Student table.

Step 19: Update and Insert Data in SQL Server

Now, I update the stream value in one record of the dbo.Student table in SQL Server. The updateDate column value is also modified with the GETDATE() function output. I also add a new student record. The inserted and updated records have the latest values in the updateDate column.

In the next load, only the update and insert in the source table need to be reflected in the sink table. The other records should remain the same.

Step 20: Debug the Pipeline

In this step, you have to execute the pipeline again by pressing the Debug button. Following the progress and will help you execute all the activities successfully.

Step 21: Check Data in Azure SQL Database

After selecting the data from dbo.WaterMark table the waterMarkVal column value will be changed. It will be equal to the value of the updateDate column of dbo.Student table in SQL Server. Along with this, selecting the data from dbo.The student table ensures that the existing student record is updated and a new record is inserted.

Thus, ensuring that you have completed the incremental load of data from the on-premise SQL Server to the Azure SQL database table.

Conclusion:

The incremental load with Azure Data Factory (ADF) is a robust and efficient way to load data from on-premises SQL Server to Azure SQL Database. This process involves the use of watermarks, which are columns in the source data that track changes over time. By leveraging watermarks, ADF can identify and load only the new or updated records, significantly reducing data transfer and processing time. Enrolling in the Azure 104 Certification course can be a very beneficial career choice for your career. This approach is particularly beneficial for large datasets and frequent updates, as it ensures that the target database remains syncincremental-load-with-azure-data-factoryhronized with the source data while optimizing resource utilization.