Azure Data Factory Tutorial

Are you ready to learn one of the most exciting tutorials of recent times? Your wait is over. In this AZURE data, Factory tutorial article will explain the essential factors and functionalities as we know that per day we are generating a large number of data from multiple sources. To secure these data and maintain the accuracy level of these data generation, Azure data factory helps the user to overcome any hustle. AZURE data factory has come with many advanced features like data pipelines, Azure SQL servers, and portals. So let’s get into an in-depth knowledge of Azure data factory concepts.

Introduction to AZURE Data Factory:

AZURE data factory is one of the cloud-based data integrated systems that support the user to generate the accurate data-driven frames workflow in the time of cloud orchestration and automation of data transactions. Azure data factory ensures users to create and schedule the data framework workflows during the time of data storage. This azure data factory will be used in many application tools, such as business intelligence (BI), data visualization and analytic processes. The main functionalities of Azure data factory included are, storing a large amount of data, analyzing data, data transaction with the help of factory pipelines helps to publish the data, and also to perform data visualization process.

IMAGE 

Azure data factory is developed by one of the top software companies in the world that is MICROSOFT CORPORATIONS.

Architecture of Azure data factory:

The complete architecture tells the work nature of the Azure data factory and also its components. Let’s see the complete picture of Azure data factory architecture,

IMAGE

In this architecture, azure data factory offers the tasks like automating the data pipelines, incremental loading of data, data integration, and loading binary kinds of data such as geospatial data and images. The azure data factory architecture consists of the following components such as,

  • Data Sources: There are two types of data sources available such as on-premise SQL server and external data. On-premise SQL server helps to store the data on the SQL database server. It helps in the stimulation of on-premises data and the deployment of the server script. The external data integrates the multiple data sources from the data warehouse. These types of data later can be used for the data insight purpose, especially during the time of sales growth.
  • Data ingestion and storage: There are three types of storage that Azure data factory offers such as blob storage, Azure synapse, and data factory. The Blob storage is use to store the data before loading them into the Azure data synapse. And whereas the Azure Synapse is mainly used to distribute the system designed data. It also supports massive parallel processing (MPP). The data factory is use to manage the orchestration, data automation, and even data transformation.
  • Data analysis and reporting: This data analysis service and reporting is a fully managed service that offers data modelling and analytical capabilities. To perform data analytics, we usually go with one of the business analytical tools, which is Business intelligence (BI).
  • Data Authentication: Azure Active Directory (AD) is a type of data authentication tool which helps the users to connect with analytical services. One important thing is that Data factory also uses the Active directory to provide the service principal or Man
  • Data Pipeline: A data pipeline is nothing but a logical grouping of activities to coordinate the tasks like data loading and data transformation. The parental data pipeline helps the users to run the sequence of many child pipelines. Each child data pipeline helps to load the data on one or more data warehouse application tables.

Features of Azure data factory:

The following features illustrate the functionalities of the Azure data factory. These features also support the user to collect the data, transfer the data, and storage of data. Let’s see them in-depth,

  • Collection of data and connect it: Data in the Azure data factory copied through pipelines and afterwards they will transfer to the cloud-based data sources and the same data can move from on-premise data SQL servers.
  • Publishing the collected data: Azure data factory helps users to publish the structured, analyzed, and visualized data. It also helps to monitor the already stored azure warehouse data.
  • Data monitoring: Microsoft PowerShell and Azure software are available to monitor the data through a pipeline. As we know that Azure data factory works mainly on data-driven workflow structure so that the user can quickly move and transfer the data. One more important thing is that the Azure data factory never works on a single process. It consists of many tools and components to perform many tasks.
  • Pipeline: Pipeline is nothing but a unit of work that can perform by a large number of logical groups. In Azure data factory pipeline can be single or multiple. The essential tasks included are data transmission, analyzing, and storage of data.
  • Activities: data processing steps of azure pipelines will represent with the help of data activities. The activities included are copy, transfer of data, and move the data.
  • Data sets in Azure data factory: data sets are nothing but the data structures which represent the data storage.

Azure Devops Training

  • Master Your Craft
  • Lifetime LMS & Faculty Access
  • 24/7 online expert support
  • Real-world & Project Based Learning
  • Linked data services: linked data service also called as a bridge which was used to connect azure data to other external resources. There are two types of related services available such as computer resources and data stores.
  • Triggers: It used to find and compel data activities while performing data transmission activities.
  • Control flows: it is just an extended activity which will be carried by the pipeline. We can call it a thread or chain.

Azure data factory overview:

IMAGE 

Here in this section, we are going to learn the functionalities of Azure data factory components. Firstly, we need to create the data pipelines to execute the multiple data activities. If any of the action moves or performs the data transmission, so the user needs to define the input and output data formats. Afterwards finishing this activity, the user should connect the multiple data sources or other services via linked data services. With the help of integration triggers, users can also specify the infrastructure and location of data. Once you create a pipeline, you can insert triggers automatically which perform on the base of specific times or events. Suppose if you don’t have automatic pipelines, you can also create it from pre-defined or customized templates.

Azure data factory tools - a visual tool:

Azure data factory tool helps to increase productivity and efficiency. It enables the azure data factory to work better with latest or legacy systems. The following are the essential tools:

  • Author and Monitor: This author and monitor help to increase productivity through pipelines and also make the applications run quickly.
  • More data connectors: you can see these connectors on any on-premises servers. The following are the sub-tools of more data connectors;

Ø Notebook Activity: This ingests data at the scale level uses more than 70 on-premises or any cloud data server sources. This also helps to prepare or transform the ingested data in Azure brick data in the form of Notebook activity.

Ø Filter Activity: this filter activity will be ingested from 70 on-premises data sources.

Ø Execute SSIS package: Execute the SSIS packages will be available on Azure SSIS runtime integration in the data factory.

Ø Lookup Activity: this lookup activity support the retrieving of data set from more

than 70+ data sources.

  • Azure key integrated vaults: this Azure integrated key vault is a kind of linked services and also refers to the secret storage in the key fault of the data factory.
  • Iteration developments and debugging: the iteration development and debugging tools mainly used to debug the ETL or ELT pipeline visual tools. This also performs the tasks like testing or debugging the pipeline applications or also put breakpoints to debug major portions of the pipeline.
  • View/test/run status activity: once the task gets finished, you can preview or view the task, perform the test process and run the application.
  • Clone pipeline task and activities: after all tasks are performed, the user can clone the entire pipeline or any related pipeline canvas. This also creates the identical copy or any activities included settings.
  • New data resource Explorer: this tool helps users to expand or collapse all the multiple resources (data sets or pipelines). You can also adjust the ‘Resource Explorer’ just by dragging the browser key.
  • View or edit the data factory pipelines code: user can view and edit JSON for the data factory Azure pipelines. Once this setting has been done, you can now publish the changes that you have done.
  • View of pending changes: In the Azure data factory, we can see the many additional setting tools such as Add or Edit/delete, triggers, data sets, linked services, integration runtimes and number of any pending changes will be published on the data factory services.
  • Import the data resource tool: this tool is used to import the data transmission from multiple data resources (Such as VSTS or GIT repository).

Lookups in Azure data factory:

Lookups in Azure data factory are similar to azure copy data activities; here we are going to retrieve the data from lookups. Lookups comprises source data sets, copied data activities but no sink data sets. The main purpose of using lookups is to get the configured values.

The following are the few examples for lookup data base:

Source system Source file name Source file extension Is active
LEGO COLORS Csv (Cascaded source value) true
LEGO Inventors CSV file true
LEGO Inventory_dataparts CSV file true
LEGO Inventory_datasets CSV file false
LEGO Part_categories CSV file True
LEGO Part_relationships CSV file false
LEGO Data parts CSV file True
LEGO Data sets CSV file True
LEGO Themes CSV file True

The programming example is as follows:

Source System,

SourceFileName,

SourceFileExtension

IsActive

Lego, colors, csv, 1 # which indicates that the statement is true

Lego, inventories, csv, 1

Lego, inventory_dataparts, csv, 1

Lego, inventory_datasets, CSV, 0 # which indicates that the statement is false

Lego, parts_datacategories, csv, 1

Lego, parts_reltationships, csv, 0

Lego, parts, Csv, 1

Lego, sets, Csv, 1

Lego, themes, Csv, 1

Ops Trainerz

Subscribe to our youtube channel to get new updates..!

Pipelines and Data activities in Azure data factory:

As I said earlier, an azure data factory can consist of one or more pipelines. A pipeline is nothing but a bridge, which connects the data activities logically together to perform a specific task. The following are the important data activities performed by Azure data factory pipeline;

  • Cleaning the log data
  • Ingest the stored data
  • Mapping the data flow
  • Analyze the log data flow

An azure data factory pipeline also allows users to manage the data activities individually or as a group of people. The activities in azure data factory indicate that the actions to be performed on the customized data. There are three data activities available in Azure data factory such as,

  1. Data movements’ activities
  2. Data transformation activities
  3. Control activities.

IMAGE 

Let’s get into the full details of these three data activities,

1. Data movement activities:

This copies the data from the multiple data sources and stores them into a sink store. The below list are the important examples of data activities:

  • Azure-> azure blob storage, cognitive search, cosmos DB, cosmos DB’s API, data explorer, data lake storage and Gen1, database for Maria DB, MySQL, PostgreSQl, file storage, and managed instances
  • Database -> amazon redshift, DB2, Drill, Greenplum, HBase, Hive, Apache impala, Informix, Microsoft Access, MySQL, NetEzza, Oracle, Phoenix, SAP Business warehouse.
  • NoSQL -> Cassandra, Couch base, and MongoDB
  • Files -> Amazon S3, File system, FTP, Google cloud storage, HDFS, SFTP
  • Generic -> Generic HTTP, Generic Odata, Generic ODBC, and Generic REST
  • Services -> Amazon Marketplace Web, Common data service, Concur, Dynamics 365, Dynamic CRM, Google AdWords,  Hub spot , and Jira.

2. Data transformation activities:

Azure data factory pipeline supports the following important data activities;

Data Transformation activity Compute environment
Data Flow Azure data bricks managed by Azure data factory activities
Azure data function Azure function
HIVE HDInsight [or HADOOP]
Pig HDInsight [or HADOOP]
MapReduce HDInsight [or HADOOP]
Hadoop streaming HDInsight [or HADOOP]
Machine learning activities : batch execution and update resources Azure VM
Stored procedure Azure SQL, Azure synapse Analytics or formerly SQL data Warehouse), or any SQL server

3. Control Flow Activities:

The following are the important control flow activities can be performed:

Control activities Description
Append variable Adding values to the existing array variable
Execute pipeline This execute pipeline allows the data factory pipeline to retrieve another data pipelines
Filter activity Helps to apply the data filter expression to an input array values
For each This for each data activity defines the repeating the control flow in the azure factory pipelines. This data activity is used to iterate over the data collection and executes the activities in the data loop.
GET metadata Get METADATA will be used to retrieve the metadata of any input data in the Azure data factory
IF condition activity This IF condition will be used to branch the data based on the available conditions that evaluate either true or false. Here the IF condition provides the same functionality and enables user to work on programming
Lookup Activity Lookup data in azure data factory can be used to read or look up the data records or table name or value data from any external data sources. This final output data can be later referred by data succeeding activities
SET data variables This data activity will be used to SET the value of already existing data variables
Until data activity This helps to implement the DO-until loop condition value that is similar to programming Do-until looping structure. This executes the set of activities in loop condition which is associated with other activity and then evaluates TRUE output
Validation data activity This data activity ensures the azure pipeline continues execution, if any reference dataset exists.
WAIT activity When user uses the WAIT activity in a azure factory pipeline which specifies the continuation and execution of subsequent data activities
Web data activity This web data activity will be used to call any CUSTOM REST endpoint from the azure data factory pipeline.
Web hook activity Using this Web hook data activity, users can call endpoints, and also helps to pass the callback website URL.

Creating Azure data factory using the Azure portal:

The following are important steps explains you how to create the Azure data factory using the Azure portal,

Azure Devops Training

Weekday / Weekend Batches

Steps:

1. First sign in to the Azure portal

2. Now choose NEW -> then select DATA +Analytics -> then go for Data factory as shown in the below image,

IMAGE

3. Go to the new data factory blade ->choose name -> then enter GetStartedDF

IMAGE

4. Under the subscription section ->select the Azure subscription -> now you are about to create the first ever Azure data factory.

5. Now choose the existing data resource group, or you can also create the new resource group. For example create the first resource group as ADFGetStartedRG.

6. Go to the location option -> select the data factory location. Here only regions will be shown in the drop-down lists which are supported by the Azure data factory.

7. Now choose the pin to the dashboard check box.

8. Select the create button.

9. On the azure dashboard, you will see the following status as deploying data factory.

IMAGE

10. After the new data factory has been created, you are able to see the first data factory page; this page also contains the many important contents.

IMAGE

Why is Azure data factory so important?

  • The following are the important advantages of Azure data factory;
  • Azure data supports organizations to migrate the virtual machines which are imported from private clouds to Azure.
  • Helps to connect easily with multiple existing Azure data factory subscriptions.
  • Able to create and manage the Azure data factory virtual machines from a single data console.
  • Back up the azure data factory workloads at the public cloud-based enterprises.
  • Supports users to deallocate the Azure virtual machines, so it will save the time.
  • Also many azure data factory features support many Microsoft applications.

Conclusion:
In this blog, I have tried my best to explain the basics to advance features of azure data factory and their benefits. As we know it’s not an easy task to maintain and control the data generation. So Azure data factory tool helps us to overcome this hustle and also supports many Microsoft applications to run. I hope this blog may help few of you to learn and implement the features of Azure data factory.

Krishna
Krishna
AWS Lambda Developer
I am working as AWS Lambda Developer since 2014. and I have good knowledge skills on AWS & DevOps Platform. TO share my Knowledge through Blogs from OpsTrainerz is Good Opportunity to Me.

Request for more information