Rethinking Metadata Indexing in Analytical Data Systems

Metadata Is the Real Index: How Lakehouse Systems Avoid Scanning the World
Today I want to talk about something that separates toy systems from production-grade data infrastructure:
Metadata.
Specifically, how modern lakehouse systems use metadata to filter and search massive datasets without scanning everything.
The Real Problem
Letβs say we have file-level statistics like this:
// (file path, min, max)
// "2024-01-01/0001.parquet", "aaa", "app"
// "2024-01-01/0002.parquet", "baa", "cat"
// "2024-01-01/0003.parquet", "bob", "dad"
// "2024-01-01/0004.parquet", "eel", "fit"
// "2024-01-01/0005.parquet", "goo", "hop"
// "2024-01-01/0006.parquet", "ink", "lit"
// "2024-01-01/0007.parquet", "pop", "sit"
// "2024-01-01/0008.parquet", "run", "zoo"
// "2024-01-02/0001.parquet", "aaa", "app"
// "2024-01-02/0002.parquet", "baa", "cat"
// "2024-01-03/0001.parquet", "www", "yaa"
// "2024-01-04/0001.parquet", "ybb", "zzz"
And we want to answer:
Input:
<"2024-01-01", "website">
<"2024-01-04", "youth">
<"2024-01-01", "buzz">
Output:
<"2024-01-01", "website"> β ["2024-01-01/0008.parquet"]
<"2024-01-04", "youth"> β ["2024-01-04/0001.parquet"]
The question is simple:
Given a partition (date) and a key, which files could contain the key?
Level 1 Thinking: Just Loop
Someone with 6 months of coding experience would:
Filter by date
Loop through files
Check if
min <= key <= max
It works.
But does it scale?
No.
Because real systems donβt have 12 files.
They have millions.
Level 2 Thinking: Binary Search
You sort the files by min, then:
lower_bound(key)Traverse until
min > key
Better.
This works well in memory.
But weβre still thinking like this is a LeetCode problem.
Level 3 Thinking: Interval Trees
Now we build an augmented BST:
Each node stores
(min, max)Each node maintains
subtree_maxWhile searching, we prune entire subtrees
Elegant.
Efficient.
Beautiful.
If everything fits in RAM.
The Reality: Metadata Lives on Disk
In real lakehouse systems:
Metadata doesnβt live in memory.
It lives on disk (S3, HDFS, object storage).
It must survive crashes.
It must support concurrent writers.
It must support snapshot isolation.
And hereβs the critical constraint:
Sequential reads are dramatically faster than random reads.
On spinning disks: 100Γ difference.
On SSD/NVMe: still significantly faster.
On object storage: even more dramatic due to request overhead.
An interval tree optimized for RAM becomes a liability on disk because it requires random access.
What Systems Like Apache Hudi Actually Do
Apache Hudi doesnβt use an interval tree.
Instead, it does something much more production-friendly:
It stores metadata as a table.
More specifically:
Hudi maintains a Metadata Table
The Metadata Table is itself a Hudi table
It is append-only
It supports snapshot isolation
It stores:
File listings
Column statistics (min/max)
Bloom filters
Partition stats
This design choice matters.
Instead of building a pointer-heavy in-memory structure, Hudi:
Writes metadata sequentially
Reads metadata sequentially
Uses copy-on-write or merge-on-read for efficient snapshots
Leverages column pruning and predicate pushdown
π§± Apache Hudi β FULL STRUCTURE (DATA + METADATA)
<base-path>/ β HUDI DATA TABLE ROOT βββ .hoodie/ β DATA TABLE METADATA β βββ hoodie.properties β βββ timeline/ β β βββ 20240101120000.commit β β βββ 20240102120000.commit β β βββ ... β βββ auxiliary/ β βββ 2024-01-01/ β PARTITION (DATA) β βββ fileId1_0.parquet β BASE FILE (DATA) β βββ fileId2_0.parquet β βββ ... β βββ 2024-01-02/ β βββ fileId3_0.parquet β βββ ... β βββ 2024-01-03/ β βββ fileId4_0.parquet β βββ .hoodie/metadata/ β HUDI METADATA TABLE ROOT (ALSO A HUDI TABLE) βββ .hoodie/ β METADATA TABLE METADATA β βββ hoodie.properties β βββ timeline/ β βββ 20240101120000.commit β βββ ... β βββ files/ β PARTITION: FILE LISTING β βββ fg_files_0.parquet β βββ ... β βββ column_stats/ β PARTITION: COLUMN STATISTICS β βββ fg_colstats_0.parquet β βββ ... β βββ bloom_filters/ β PARTITION: BLOOM FILTERS β βββ fg_bloom_0.parquet β βββ ... β βββ record_index/ β PARTITION: RECORD β FILE MAP βββ fg_record_index_0.parquet βββ ...
And this scales well.
In other words:
It optimizes for disk reality, not algorithm elegance.
Why This Scales
Because:
Sequential writes β high throughput
Sequential reads β predictable IO
Snapshot isolation β consistent queries
Append-only logs β simpler concurrency model
Spark doesnβt see raw files blindly.
Hudi controls visibility through:
Commit timelines
Snapshots
Metadata filtering
The query engine (Spark, Flink, etc.) asks:
βWhich files should I read?β
And Hudi answers:
βOnly these.β
The Real Insight
The interesting part isnβt interval trees vs binary search.
The real insight is this:
In distributed systems, IO patterns dominate algorithmic elegance.



