Open File format in data analytics and AI - changing the international rules game

What do open file formats in data analytics, Aussie Rules football, and Gaelic Football have to do with each other? Sort of nothing – it's just an analogy that'll make sense after a few more paragraphs. Hang in there!

If you work in data and AI, you're likely familiar with data lakes, lake houses, and powerful analytical engines like MSFT Fabric, Snowflake, and Databricks. But it's important to recognise that the backbone of all this lies in data storage using open file formats. These formats support efficient data management and play a critical role in ensuring data interoperability, future-proofing, and optimal performance and scale.

In my opinion, their biggest advantage is their open nature – they allow for greater flexibility and reduce dependency on a single technology. I've worked on too many programs where a customer signed up for a data ecosystem using Oracle, SQL Server, or another, only to later switch platforms for various reasons. This often leads to significant and costly migration efforts to rework and remediate the data infrastructure, the logic and the data itself. Open file formats can help mitigate such challenges, saving time and money in the long run and provide huge performance and cost benefits, flexibility and scale.

So, let’s look at open file formats, and why you should care if you are working through a data strategy, data transformation, or looking to get going with something modern.

What are open file formats?

They are standardised file formats, publicly documented and driven by community-based enhancements and continuous improvement and reliability. They are free from proprietary restrictions. And therefore, allow for data to be stored, accessed, and processed across different software and hardware platforms without vendor lock-in. Examples include Apache Parquet and Apache Iceberg, both of which are designed to optimise data storage and processing for large-scale analytics.

Imagine data stored in your data lake, accessed by a data engineer using Snowflake for transformation, merging, and enhancements, then loading the outputs back into the data lake. Meanwhile, a data scientist prefers Databricks to read the same data for predictive workloads, again loading the outputs back into the data lake. Finally, a data analyst using Fabric reads either or both outputs for data analysis and crafting visual reports for business outcomes.

These data workers can therefore participate collaboratively using technologies of their choice, or technologies according to their various strengths and weaknesses. This does, of course, require a consistent framework, architecture and guardrails for these data workers to leverage and follow – but that is a story for another day.

Let's use an analogy

Aussie Rules and Gaelic Football are quite similar, but with some fundamental differences. Aussie Rules features four posts without a crossbar or net, while Gaelic Football has two posts with a crossbar and net. The pitch dimensions and tackling rules also differ significantly. These differences mean that while some players might switch codes, the sports remain distinct due to their unique rules and equipment.

Open file formats in data analytics are like having a universal set of rules and equipment that works for both sports. If Aussie Rules and Gaelic Football had more similarities in their ball, pitch, and rules (I.e., the "open file formats"), we could expect a greater international uptake and easier transition between the two sports, (I.e., "better data analytics"). Open file formats create a common standard (i.e., "the ball, pitch and rules") that enables seamless collaboration and better performance and scale across different tools and platforms, simplifying data management and enhancing efficiency.

This is how AI interpreted my analogy :D

In addition to the obvious advantage of technology options and more freedom from lock in, other significant advantages are performance and scale.

Let’s look at two primary open file formats

Apache Parquet and Iceberg

Apache Parquet is a columnar storage file format optimized for efficiency in big data processing. Designed for use with data processing frameworks like Apache Spark, Hive, and Hadoop, Parquet allows for efficient data storage and retrieval by storing data in columns instead of rows. This format significantly reduces the amount of data read during queries, improving performance and reducing I/O operations. Companies like Twitter and Cloudera have been significant contributors to Parquet.

Apache Iceberg is an open table format for large analytic datasets. It was designed to manage petabyte-scale datasets on distributed storage like Amazon S3, Azure Blob Storage, and Google Cloud Storage. Iceberg addresses limitations in existing table formats, offering features such as hidden partitioning, schema evolution, and atomic operations, which make it highly suitable for modern data analytics workloads. Companies like Netflix, Apple, and LinkedIn have been significant contributors to Iceberg.

Netflix, Apple, LinkedIn, Twitter and Cloudera not only contribute to the development of these formats but also integrate them into their data processing and analytics platforms, thereby ensuring they remain robust and continuously improving.

Advantages

Let’s diver deeper into the advantages for a data analytics strategy and architecture built around open file format.

1. Interoperability

These formats are designed to work across various data processing and storage systems, enabling seamless integration and data sharing. For instance, Apache Parquet is widely supported by data processing frameworks like Apache Spark, Hive, and Hadoop, and Apache Iceberg. This means your data teams can use the best tool for the job without worrying about data format restrictions.

2. Future-Proofing

Using open file formats helps protect against vendor lock-in, ensuring that data remains accessible and usable over time. Proprietary formats can become obsolete if the vendor ceases support or changes the format specifications. In contrast, open formats like Parquet and Iceberg are maintained by the community, ensuring their longevity and adaptability to future technological advancements.

3. Efficiency

Open file formats are often optimized for performance, providing efficient storage and fast read/write operations. Parquet, for example, is a columnar format that significantly reduces the amount of data read during queries, improving query performance and reducing I/O costs. This efficiency is particularly important in big data environments where large volumes of data need to be processed quickly and cost-effectively.

Iceberg also offers performance benefits by supporting hidden partitioning and advanced data layouts. This allows for more efficient data skipping and pruning, which can drastically speed up query execution times.

4. Community Support

Open file formats benefit from broad community support and continuous improvements. The open-source nature of formats like Parquet and Iceberg means that they are constantly being enhanced by contributions from developers worldwide. This ongoing development ensures that these formats remain robust, secure, and up-to-date with the latest advancements in data technology.

5. Transparency

Transparency is another significant advantage of open file formats. The specifications of these formats are publicly available, which promotes trust and reliability in data handling processes. Users can understand exactly how their data is stored and processed, which is crucial for maintaining data integrity and compliance with data governance policies.

6. AI

AI models rely on high-quality, consistent data. Open file formats provide robust mechanisms for maintaining data integrity, such as schema evolution and atomic operations in Iceberg. This ensures that AI models are trained on accurate and reliable data, leading to better performance and more trustworthy results.

Conclusion

The benefits of open file formats provide a vital component of modern data analytics strategies, enabling organisations to manage their data more effectively and extract valuable insights with greater ease and reliability. Currently Snowflake supports Iceberg in external data storage configurations, whereas Databricks and Fabric veered more towards the Parquet regime (including Delta Parquet) but both now extending their offering with future support for Iceberg.

It seems like the strategic choices data professionals will have to make will become less of whether they prefer Fabric, Snowflake or Databricks, but rather whether they would adopt Parquet or Iceberg.

Of course, the competition between the technology vendors will still be there, they will in future just need to showcase how well they work across these open file formats (the proverbial pitch, ball and rules) and what benefits they bring the various data professional personas.