Data Engineering

ADF V2 Issue With File Extension After Decompressing Files

Posted on 20th April 201816th December 2019 by Ben Jarvis

20
Apr

On a current client project we are taking files from an on-prem file server and uploading them to Azure Blob Storage using ADF V2. The files are compressed on-prem using GZip compression and need to be decompressed before they are placed in blob storage where some other processes will pick them up.

ADF V2 natively supports decompression of files as documented at https://docs.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support. With this functionality ADF should change the extension of the file when it is decompressed so 1234_567.csv.gz would become 1234_567.csv however, I’ve noticed that this doesn’t happen in all cases.

In our particular case the file names and extensions of the source files are all uppercase and when ADF uploads them it doesn’t alter the file extension e.g. if I upload 1234_567.CSV.GZ I get 1234_567.CSV.GZ in blob storage rather than 1234_567.CSV.

If I upload 1234_567.csv.gz the functionality works correctly and I get 1234_567.csv in blob storage.This means that the file extension replace is case sensitive when it should be case insensitive.

This bug isn’t a major issue for us as the file is decompressed and we can change the extension when we process the file further however, it’s something that stumped me for a while.

I’ve raised a bug at https://feedback.azure.com/forums/270578-data-factory/suggestions/34012312–bug-file-name-isn-t-changed-when-decompressing-f to get this fixed so please vote and I’ll update the post once the issue has been resolved.

Ben Jarvis

Ben is the CTO at Adatis and leads technical excellence, working across the full Azure stack to architect and develop bleeding-edge Azure solutions for clients across industries. His varied IT career has spanned infrastructure, application development, and complex data engineering.