The Azure Data Lake Store is a storage solution to manage files for big data analytical workloads. The definition of a data lake according to Wikipedia, “is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files. The idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning.”
To read more from the MS documentation visit Overview of Azure Data Lake Store
In summarizing the documentation’s overview, here are some of the key capabilities for starting out.
- Hadoop compatible
- Virtually unlimited storage
- Performance for analytical processing
- User and Role based security
- Data encryption
- Store any data format
In its simplest form, it is a hierarchical file system of folders and files. You run your analytical processing scripts pointing to a set of folders or files.
The following is how I created the Azure Data Lake Store. To see the MS documentation visit Get started with Azure Data Lake Store using the Azure Portal
- New Data Lake Store
A. Encryption Settings. I decided the more sophisticated option of creating a master encryption key in an existing Azure Key Vault for my own ownership. To see read details read Data protection
- Click Create button
- Confirm provisioning of Azure Data Lake
- Overview section has to provide details of the data lake service. However, it is prompting for further action. Grant the data lake store account RN_rkdatalake to have access to the key vault.A. Click on the orange bar to setup.
B. Click on Grant Permissions button to grant the RN-rkdatalake account permissions.
- Let’s go back to the rkdatalake blade and take a tour of the some of the unique settings.
A. Encryption settings
The master encryption key is located and managed in my Key Vault named rkEntKeyVault. The data lake store account RN_rkdatalake only has access to the key vault to encrypt data stored.B. Firewall For security best practices, it is recommended to enable the firewall. The firewall is based on client IP Address or IP address range.C. Pricing. For a developer scenario, pay-as-you-go should be quite fine. For myself, this option has not been expensive at all and usually work with a several GBs of data of data anyways. Currently, it is 0.039 USD per GB which is still pennies.
For other monthly plans,
D. Data Explorer This is more of a tool to explore the file system in Data Lake Store. You can create folders, upload files and manage permissions. File Preview of MvcWeb.log
You can assign permissions to a folder or file. Here I am managing permissions on the MyData folder I had created.Click Add so I can add a user or group, that is in Azure Active Directory, to have access to this folder.
Creating the data lake store sets the foundation for analytical processing. You may begin to upload large amounts of data in their respective folders. Examples can be IoT sensor data, tweets, .csv export from relational databases, log files, images, videos or documents. It is up to the processing application such as Azure Data Analytics U-SQL or Hadoop applications to process the data which would use a set of libraries and apply your custom logic. To be specific about what open source applications can work with Azure Data Lake Store, read Open Source Big Data applications that work with Azure Data Lake Store. Essentially, only Azure’s HDInsight works with it and not any other cloud or on-premises Hadoop platform to my current understanding.Next, we will look at PowerShell and Options to Upload Data to Azure Data Lake