TLDR Data 2026-06-11
Fable Evals Performance ✅, Airbnb’s Evolved Data Architecture 🏘️, PostgreSQL Differential Privacy 🎭
Webinar: Unlocking first-party data for AI (Sponsor)
Join CEO Ben Brook on June 16 for a practical roadmap to first-party data activation and AI governance, plus a maturity diagnostic you can apply immediately.
> It starts with a real-time source of truth. Transcend encodes complete data-use permissions directly into the systems that process customer data, so every AI initiative and data product runs on a real-time source of truth. See how.
> 220+ enterprise IT and business leaders on why AI initiatives fail, and what to do about it.
Get the report →
Inside QuestDB's Query Engine: Tracing Three Queries (8 minute read)
QuestDB's time-series query engine appears tuple-at-a-time externally, but internally mixes vectorized execution, SIMD C++ kernels, Java batch processing, JIT filtering, and frame-based parallelism. Small SQL changes can shift execution paths, affecting group-by, filtering, and aggregation performance.
Scaling Beyond One: How Airbnb Evolved Its Data Architecture for a Multi-product World (9 minute read)
Airbnb evolved its offline data architecture for a multi-product world with a flexible modeling framework that balances shared consistency with domain-specific needs. Its three principles are no hybrid models, consistent identifier naming, and clear namespaces so teams can separate product-specific models from cross-cutting monolithic ones.
Parenting Iceberg and Lance with Gravitino: The Reality Behind Unified Lakehouse Architectures (8 minute read)
Apache Gravitino can govern Iceberg tables and Lance multimodal datasets through one metadata layer, RBAC model, and audit surface. Iceberg commits through the catalog, while Lance uses a two-step object-storage flow, with gotchas around config rewrites, jars, enum casing, and client drift.
Dagster price increase 10x insane, don't ever use them (Reddit Thread)
Dagster's managed pricing jump has triggered backlash, pushing smaller users toward self-hosting, Airflow, Prefect, or simpler cron-style setups while still valuing Dagster OSS.
We had to build new evals for Fable (8 minute read)
Claude Fable 5 is a major step up for complex data analysis, scoring roughly 10-15% better than recent frontier models on Hex's evals and excelling at messy, long-horizon tasks that require judgment, clear assumptions, and cross-checking semantic models against raw data.
Why Metadata Has to Be Mutation-Friendly (10 minute read)
In high-update lakehouses, metadata becomes a high-mutation system. Apache Hudi's Merge-On-Read Metadata Table handles this with append-first writes and deferred compaction, reducing write cost and supporting scalable indexing more efficiently than Copy-On-Write designs.
When Event Time Meets Reality: Lessons from Building Billing on Apache Flink (12 minute read)
While building their usage-based billing pipeline, Gorgias experienced overlapping windows and incorrect aggregations during historical reprocessing due to internal repartitioning and uneven operator behavior that broke event-time guarantees. The team mitigated this by aligning keys across pipeline steps and applying conditional extra delays only during replays.
Introducing Loon: A New Storage Engine for Vector Data That Never Stops Changing (19 minute read)
Vector datasets evolve through backfills, embedding versions, and mixed workloads, not just vector columns. Loon, behind Milvus 3.0 beta and Zilliz Vector Lakebase, uses hybrid file formats, row-ID alignment, and versioned manifests so scalars, vectors, and object references can update independently with less rewriting.
Introducing Streamling: Performant and Extensible Data Streaming Runtime (7 minute read)
Streamling is an open-source Rust, Arrow, and DataFusion streaming runtime for transactional workloads rather than heavy analytics. It runs mostly single-node stateless pipelines with Kafka, Postgres, ClickHouse, HTTP enrichment, TypeScript/WASM transforms, plugins, checkpointing, and effectively-once delivery.
PostgreSQL Anonymizer 3.1: Introducing Local Differential Privacy (2 minute read)
PostgreSQL Anonymizer 3.1 adds expanded masking for PII and sensitive data, with six masking strategies, including substitution, randomization, pseudonymization, shuffling, noise addition, and generalization. It now supports Local Differential Privacy via GRRM, providing formal privacy guarantees for survey and categorical data with privacy controlled by epsilon.
DataAgents: How we turned 9 months of analysis into 10 days (6 minute read)
Capital One's DataAgent pattern cut cloud dormancy analysis across about 350 AWS, Azure, and GCP resource types from 6-9 months to 10 days. It combines asset data, AI-generated Spark SQL, confidence scoring, false-positive checks, and human validation to find high-confidence savings opportunities.
Scaling Zero Copy from 1 Trillion to 120 Trillion Rows with File Federation (5 minute read)
Zero Copy at Salesforce Data 360 evolved from Query Federation to Iceberg File Federation to support AI workloads across distributed enterprise data without centralizing it. The new architecture reduces cross-system compute overhead, preserves governance through temporary catalog-based access, and is being pushed by the need for real-time AI across major data platforms.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email