Skip to main content

Command Palette

Search for a command to run...

Rethinking Metadata Indexing in Analytical Data Systems

Published
β€’4 min read
Rethinking Metadata Indexing in Analytical Data Systems

Metadata Is the Real Index: How Lakehouse Systems Avoid Scanning the World

Today I want to talk about something that separates toy systems from production-grade data infrastructure:

Metadata.

Specifically, how modern lakehouse systems use metadata to filter and search massive datasets without scanning everything.


The Real Problem

Let’s say we have file-level statistics like this:

// (file path, min, max)
// "2024-01-01/0001.parquet", "aaa", "app"
// "2024-01-01/0002.parquet", "baa", "cat"
// "2024-01-01/0003.parquet", "bob", "dad"
// "2024-01-01/0004.parquet", "eel", "fit"
// "2024-01-01/0005.parquet", "goo", "hop"
// "2024-01-01/0006.parquet", "ink", "lit"
// "2024-01-01/0007.parquet", "pop", "sit"
// "2024-01-01/0008.parquet", "run", "zoo"
// "2024-01-02/0001.parquet", "aaa", "app"
// "2024-01-02/0002.parquet", "baa", "cat"
// "2024-01-03/0001.parquet", "www", "yaa"
// "2024-01-04/0001.parquet", "ybb", "zzz"

And we want to answer:

Input:
<"2024-01-01", "website">
<"2024-01-04", "youth">
<"2024-01-01", "buzz">

Output:
<"2024-01-01", "website"> β†’ ["2024-01-01/0008.parquet"]
<"2024-01-04", "youth">   β†’ ["2024-01-04/0001.parquet"]

The question is simple:

Given a partition (date) and a key, which files could contain the key?


Level 1 Thinking: Just Loop

Someone with 6 months of coding experience would:

  • Filter by date

  • Loop through files

  • Check if min <= key <= max

It works.

But does it scale?

No.

Because real systems don’t have 12 files.
They have millions.


You sort the files by min, then:

  • lower_bound(key)

  • Traverse until min > key

Better.

This works well in memory.

But we’re still thinking like this is a LeetCode problem.


Level 3 Thinking: Interval Trees

Now we build an augmented BST:

  • Each node stores (min, max)

  • Each node maintains subtree_max

  • While searching, we prune entire subtrees

Elegant.

Efficient.

Beautiful.

If everything fits in RAM.


The Reality: Metadata Lives on Disk

In real lakehouse systems:

  • Metadata doesn’t live in memory.

  • It lives on disk (S3, HDFS, object storage).

  • It must survive crashes.

  • It must support concurrent writers.

  • It must support snapshot isolation.

And here’s the critical constraint:

Sequential reads are dramatically faster than random reads.

On spinning disks: 100Γ— difference.
On SSD/NVMe: still significantly faster.
On object storage: even more dramatic due to request overhead.

An interval tree optimized for RAM becomes a liability on disk because it requires random access.


What Systems Like Apache Hudi Actually Do

Apache Hudi doesn’t use an interval tree.

Instead, it does something much more production-friendly:

It stores metadata as a table.

More specifically:

  • Hudi maintains a Metadata Table

  • The Metadata Table is itself a Hudi table

  • It is append-only

  • It supports snapshot isolation

  • It stores:

    • File listings

    • Column statistics (min/max)

    • Bloom filters

    • Partition stats

This design choice matters.

Instead of building a pointer-heavy in-memory structure, Hudi:

  • Writes metadata sequentially

  • Reads metadata sequentially

  • Uses copy-on-write or merge-on-read for efficient snapshots

  • Leverages column pruning and predicate pushdown

  • 🧱 Apache Hudi β€” FULL STRUCTURE (DATA + METADATA)

      <base-path>/                           ← HUDI DATA TABLE ROOT
      β”œβ”€β”€ .hoodie/                           ← DATA TABLE METADATA
      β”‚   β”œβ”€β”€ hoodie.properties
      β”‚   β”œβ”€β”€ timeline/
      β”‚   β”‚   β”œβ”€β”€ 20240101120000.commit
      β”‚   β”‚   β”œβ”€β”€ 20240102120000.commit
      β”‚   β”‚   └── ...
      β”‚   └── auxiliary/
      β”‚
      β”œβ”€β”€ 2024-01-01/                        ← PARTITION (DATA)
      β”‚   β”œβ”€β”€ fileId1_0.parquet              ← BASE FILE (DATA)
      β”‚   β”œβ”€β”€ fileId2_0.parquet
      β”‚   └── ...
      β”‚
      β”œβ”€β”€ 2024-01-02/
      β”‚   β”œβ”€β”€ fileId3_0.parquet
      β”‚   └── ...
      β”‚
      β”œβ”€β”€ 2024-01-03/
      β”‚   └── fileId4_0.parquet
      β”‚
      └── .hoodie/metadata/                  ← HUDI METADATA TABLE ROOT (ALSO A HUDI TABLE)
          β”œβ”€β”€ .hoodie/                       ← METADATA TABLE METADATA
          β”‚   β”œβ”€β”€ hoodie.properties
          β”‚   └── timeline/
          β”‚       β”œβ”€β”€ 20240101120000.commit
          β”‚       └── ...
          β”‚
          β”œβ”€β”€ files/                         ← PARTITION: FILE LISTING
          β”‚   β”œβ”€β”€ fg_files_0.parquet
          β”‚   └── ...
          β”‚
          β”œβ”€β”€ column_stats/                  ← PARTITION: COLUMN STATISTICS
          β”‚   β”œβ”€β”€ fg_colstats_0.parquet
          β”‚   └── ...
          β”‚
          β”œβ”€β”€ bloom_filters/                 ← PARTITION: BLOOM FILTERS
          β”‚   β”œβ”€β”€ fg_bloom_0.parquet
          β”‚   └── ...
          β”‚
          └── record_index/                  ← PARTITION: RECORD β†’ FILE MAP
              β”œβ”€β”€ fg_record_index_0.parquet
              └── ...
    

    And this scales well.

In other words:

It optimizes for disk reality, not algorithm elegance.


Why This Scales

Because:

  • Sequential writes β†’ high throughput

  • Sequential reads β†’ predictable IO

  • Snapshot isolation β†’ consistent queries

  • Append-only logs β†’ simpler concurrency model

Spark doesn’t see raw files blindly.

Hudi controls visibility through:

  • Commit timelines

  • Snapshots

  • Metadata filtering

The query engine (Spark, Flink, etc.) asks:

β€œWhich files should I read?”

And Hudi answers:

β€œOnly these.”


The Real Insight

The interesting part isn’t interval trees vs binary search.

The real insight is this:

In distributed systems, IO patterns dominate algorithmic elegance.