Last summer Microsoft has rebranded the Azure Kusto Query engine as Azure Data Explorer. While it does not support fully elastic scaling, it at least allows to scale up and out a cluster via an API or the Azure portal to adapt to different workloads. It also offers parquet support out of the box which made me spend some time to look into it.
Powershell script to create a new blob container does not work for Azure Storage Emulator Hot Network Questions Why is my Nintendo Switch Charging So Slowly? Azure Blob Storage helps you create data lakes for your analytics needs, and provides storage to build powerful cloud-native and mobile apps. Optimize costs with tiered storage for your long-term data, and flexibly scale up for high-performance computing and machine learning workloads.
Azure Data Explorer
With the heavy use of Apache Parquet datasetswithin my team at Blue Yonder we are always looking for managed, scalable/elastic queryengines on flat files beside the usual suspects like drill, hive, presto orimpala.
For the following tests I deployed a Azure Data Explorer cluster with two instances ofStandard_D14_v2 servers with each 16 vCores, 112 GiB ram, 800 GiB SSD storage and a network bandwidth class extremely high (which corresponds to 8 NICs).
Data Preparation NY Taxi Dataset
Like in the understanding parquet predicate pushdown blog post we are using the NY Taxi dataset for the tests because it has a reasonable size and some nice properties like different datatypes and includes some messy data (like all real world data engineering problems).
We can convert the csv files to parquet with pandas and pyarrow:
Each csv file has about 700MiB, the parquet files about 180MiB and per file about 10 million rows.
Data Ingestion
The Azure Data Explorer supports control and query commands to interactwith the cluster. Kusto control commands always startwith a dot and are used to manage the service, query information about it andexplore, create and alter tables. The primary query language is the kusto query language, but a subset of T-SQL is also supported.
The table schema definition supports a number of scalar data types:
Type | Storage Type (internal name) |
---|---|
bool | I8 |
datetime | DateTime |
guid | UniqueId |
int | I32 |
long | I64 |
real | R64 |
string | StringBuffer |
timespan | TimeSpan |
decimal | Decimal |
To create a table for the NY Taxi dataset we can use the following control command with the table and columns names and the corresponding data types:
Ingest data into the Azure Data Explorer
The .ingest into table
command can read the data from an Azure Blob or Azure Data Lake Storage and import the data into the cluster. This means it is ingesting the data and stores it locally for a better performance. Authentication is done with Azure SaS Tokens.
Importing one month of csv data takes about 110 seconds. As a reference parsing the same csv file with pandas.read_csv
takes about 19 seconds.
One should always use of obfuscated strings (the h
in front of the string values) to ensure that the SaS Token is never recorded or logged:
Ingesting parquet data from the azure blob storage uses the similar command, and determines the different file format from the file extension. Beside csv and parquet quite some more data formats like json, jsonlines, ocr and avro are supported. According to the documentation it is also possible to specify the format by appending with (format='parquet')
.
Loading the data from parquet only took 30s and already gives us a nice speedup. One can also use multiple parquet files in the blob store to load the data in one run, but I did not get a performance improvement (e.g better than duration times number files, which I interpret that there is no parallel import happening):
Once the data is ingested on can nicely query it using the Azure Data explorer either in the Kusto query language or in T-SQL:
Query External Tables
Azure Blob Explorer
Loading the data into the cluster gives best performance, but often one just wants to do an ad hoc query on parquet data in the blob storage. Using external tables supports exactly this scenario. And this time using multiple files/partitioning helped to speed up the query.
Azure Blob Storage Explorer
Querying external data looks similar but has the benefit that one does not have to load the data into the cluster. In a follow up post I'll do some performance benchmarks. Based on my first experiences it seems like the query engine is aware of some of the parquet properties like columnar storage and predicate pushdown, because queries return results faster than loading the full data from the blob storage (with the 30mb/s limit) would take.
Export Data
Azure Data Explorer
For the for the sake of completeness I'll just show an example how to export data fromthe cluster back to a parquet file in the azure lob storage: