Rethinking Metadata Indexing in Analytical Data Systems

Metadata Is the Real Index: How Lakehouse Systems Avoid Scanning the World

Today I want to talk about something that separates toy systems from production-grade data infrastructure:

Metadata.

Specifically, how modern lakehouse systems use metadata to filter and search massive datasets without scanning everything.

The Real Problem

Let’s say we have file-level statistics like this:

// (file path, min, max)
// "2024-01-01/0001.parquet", "aaa", "app"
// "2024-01-01/0002.parquet", "baa", "cat"
// "2024-01-01/0003.parquet", "bob", "dad"
// "2024-01-01/0004.parquet", "eel", "fit"
// "2024-01-01/0005.parquet", "goo", "hop"
// "2024-01-01/0006.parquet", "ink", "lit"
// "2024-01-01/0007.parquet", "pop", "sit"
// "2024-01-01/0008.parquet", "run", "zoo"
// "2024-01-02/0001.parquet", "aaa", "app"
// "2024-01-02/0002.parquet", "baa", "cat"
// "2024-01-03/0001.parquet", "www", "yaa"
// "2024-01-04/0001.parquet", "ybb", "zzz"

And we want to answer:

Input:
<"2024-01-01", "website">
<"2024-01-04", "youth">
<"2024-01-01", "buzz">

Output:
<"2024-01-01", "website"> → ["2024-01-01/0008.parquet"]
<"2024-01-04", "youth">   → ["2024-01-04/0001.parquet"]

The question is simple:

Given a partition (date) and a key, which files could contain the key?

Level 1 Thinking: Just Loop

Someone with 6 months of coding experience would:

Filter by date
Loop through files
Check if min <= key <= max

It works.

But does it scale?

No.

Because real systems don’t have 12 files.
They have millions.

Level 2 Thinking: Binary Search

You sort the files by min, then:

lower_bound(key)
Traverse until min > key

Better.

This works well in memory.

But we’re still thinking like this is a LeetCode problem.

Level 3 Thinking: Interval Trees

Now we build an augmented BST:

Each node stores (min, max)
Each node maintains subtree_max
While searching, we prune entire subtrees

Elegant.

Efficient.

Beautiful.

If everything fits in RAM.

The Reality: Metadata Lives on Disk

In real lakehouse systems:

Metadata doesn’t live in memory.
It lives on disk (S3, HDFS, object storage).
It must survive crashes.
It must support concurrent writers.
It must support snapshot isolation.

And here’s the critical constraint:

Sequential reads are dramatically faster than random reads.

On spinning disks: 100× difference.
On SSD/NVMe: still significantly faster.
On object storage: even more dramatic due to request overhead.

An interval tree optimized for RAM becomes a liability on disk because it requires random access.

What Systems Like Apache Hudi Actually Do

Apache Hudi doesn’t use an interval tree.

Instead, it does something much more production-friendly:

It stores metadata as a table.

More specifically:

Hudi maintains a Metadata Table
The Metadata Table is itself a Hudi table
It is append-only
It supports snapshot isolation
It stores:
- File listings
- Column statistics (min/max)
- Bloom filters
- Partition stats

This design choice matters.

Instead of building a pointer-heavy in-memory structure, Hudi:

Writes metadata sequentially
Reads metadata sequentially
Uses copy-on-write or merge-on-read for efficient snapshots
Leverages column pruning and predicate pushdown

🧱 Apache Hudi — FULL STRUCTURE (DATA + METADATA)

  <base-path>/                           ← HUDI DATA TABLE ROOT
  ├── .hoodie/                           ← DATA TABLE METADATA
  │   ├── hoodie.properties
  │   ├── timeline/
  │   │   ├── 20240101120000.commit
  │   │   ├── 20240102120000.commit
  │   │   └── ...
  │   └── auxiliary/
  │
  ├── 2024-01-01/                        ← PARTITION (DATA)
  │   ├── fileId1_0.parquet              ← BASE FILE (DATA)
  │   ├── fileId2_0.parquet
  │   └── ...
  │
  ├── 2024-01-02/
  │   ├── fileId3_0.parquet
  │   └── ...
  │
  ├── 2024-01-03/
  │   └── fileId4_0.parquet
  │
  └── .hoodie/metadata/                  ← HUDI METADATA TABLE ROOT (ALSO A HUDI TABLE)
      ├── .hoodie/                       ← METADATA TABLE METADATA
      │   ├── hoodie.properties
      │   └── timeline/
      │       ├── 20240101120000.commit
      │       └── ...
      │
      ├── files/                         ← PARTITION: FILE LISTING
      │   ├── fg_files_0.parquet
      │   └── ...
      │
      ├── column_stats/                  ← PARTITION: COLUMN STATISTICS
      │   ├── fg_colstats_0.parquet
      │   └── ...
      │
      ├── bloom_filters/                 ← PARTITION: BLOOM FILTERS
      │   ├── fg_bloom_0.parquet
      │   └── ...
      │
      └── record_index/                  ← PARTITION: RECORD → FILE MAP
          ├── fg_record_index_0.parquet
          └── ...

And this scales well.

In other words:

It optimizes for disk reality, not algorithm elegance.

Why This Scales

Because:

Sequential writes → high throughput
Sequential reads → predictable IO
Snapshot isolation → consistent queries
Append-only logs → simpler concurrency model

Spark doesn’t see raw files blindly.

Hudi controls visibility through:

Commit timelines
Snapshots
Metadata filtering

The query engine (Spark, Flink, etc.) asks:

“Which files should I read?”

And Hudi answers:

“Only these.”

The Real Insight

The interesting part isn’t interval trees vs binary search.

The real insight is this:

In distributed systems, IO patterns dominate algorithmic elegance.

Rethinking Metadata Indexing in Analytical Data Systems

Metadata Is the Real Index: How Lakehouse Systems Avoid Scanning the World

The Real Problem

Level 1 Thinking: Just Loop

Level 2 Thinking: Binary Search

Level 3 Thinking: Interval Trees

The Reality: Metadata Lives on Disk

What Systems Like Apache Hudi Actually Do

🧱 Apache Hudi — FULL STRUCTURE (DATA + METADATA)

Why This Scales

The Real Insight

Comments

More from this blog

Zigzagged Deletes: MongoDB TTL Regression Caused by a WiredTiger Cursor Bug

Building Orion: A C++23 Distributed Task Runtime Inspired by Ray

Modern Storage Engines

Hacking Secret Hitler: From Bayesian Math to AI Agents

Command Palette

Metadata Is the Real Index: How Lakehouse Systems Avoid Scanning the World

The Real Problem

Level 1 Thinking: Just Loop

Level 2 Thinking: Binary Search

Level 3 Thinking: Interval Trees

The Reality: Metadata Lives on Disk

What Systems Like Apache Hudi Actually Do

🧱 Apache Hudi — FULL STRUCTURE (DATA + METADATA)

Why This Scales

The Real Insight

Comments

More from this blog