When testing Azure Data Lake (ADL) to Azure Data Warehouse (ADW) file ingestion, this error continued to come up on various external table SELECTs. The confusion was that the ADL only contained parquet files. There was only one external file format defined, and that too was obviously for parquet. From where was an RCFile error originating? The bottom line, in this particular engagement scenario, was that this error is actually a truncation error.
Things to Verify:
Supporting t-sql Scripts
If you are new to PolyBase and external tables in SQL Server environments, here are a three t-sql scripts that are supporting references to the error resolution given above.
Example CREATE EXTERNAL DATA SOURCE t-SQL. Click here for more information.
CREATE EXTERNAL DATA SOURCE [MyDataSourceName]
WITH (TYPE = HADOOP,
LOCATION = N'adl://MyDataLakeName.azuredatalakestore.net',
CREDENTIAL = [MyCredential])
Example CREATE EXTERNAL FILE FORMAT t-SQL. Click here for more information.
CREATE EXTERNAL FILE FORMAT [MyFileFormatName]
WITH (FORMAT_TYPE = PARQUET
, DATA_COMPRESSION = N'org.apache.hadoop.io.compress.SnappyCodec')
Example CREATE EXTERNAL TABLE t-sql script. Click here for more information.
BEGIN TRY DROP EXTERNAL TABLE [ext].[MyExternalTableName] END TRY BEGIN CATCH END CATCH
CREATE EXTERNAL TABLE [ext].[MyExternalTableName]
[ColumnName1] bigint NULL
,[ColumnName2] nvarchar(4000) NULL -- if this value is too small, you will get the conversion error
,[ColumnName3] bit NULL
,[ColumnName4] datetime NULL
,[ADLcheckSum] nvarchar(64) NULL -- if this value is too small, you will get the conversion error
,[ADFIngestionId] nvarchar(64) NULL -- if this value is too small, you will get the conversion error
WITH (DATA_SOURCE = [MyDataSourceName]
, LOCATION = N'/Folder1/Folder2/'
, FILE_FORMAT = [MyFileFormatName]
, REJECT_TYPE = VALUE
,REJECT_VALUE = 0)
Querying across cloud databases is supported in Azure through elastic queries (in preview). You can read more about that here, but I thought a good talking point would be to briefly compare to elastic query to PolyBase. You can read about PolyBase here.
Note: At the righting of this blog post, an Azure Data Warehouse could not serve as a "principal" in an elastic query, but it can be the "secondary".
These two Azure features have similar setup. They both require ...
Polybase is about linking to unstructured data, not another database. That is truly the short version of the matter. On both principal servers shown above the t-sql syntax is the same SELECT ColumnName FROM externalSchemaName.TableName. It is not evident what feature you are using: Elastic Query or PolyBase. Although you can JOIN an internal and external table together, this might fall under the heading "I can, but I won't". It really depends on the size of your tables. I personally do not feel that UNION ALL poses the same performance risk.
Conclusion: All said, elastic query is really a nice Azure feature which can solve data migration problems and an easy sharing of reference data. It surely is not a replacement for ETL -- all things in moderation, my friend! There remains a solid need for SSIS or ADFv2. For every Azure offering there is an appropriate implementation place.
|Microsoft Data & AI||
All Things Azure