TLDR Data 2026-05-07
Netflix’s ML Metadata Graph 🧬, Inside DuckDB’s Speed 🦆, Searchable S3 Storage 🔎
Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph (14 minute read)
Netflix's Model Lifecycle Graph is a centralized Metadata Service (MDS) that connects fragmented ML assets (models, features, pipelines, datasets, and experiments) across the entire company into a single, queryable graph. By ingesting real-time events, normalizing them with a unified URI-based model, enriching relationships, and storing them in Datomic + Elasticsearch, Netflix enables easy discovery, lineage tracking, impact analysis, and cross-domain reuse of models.
DuckDB Internals: Why is DuckDB Fast? (17 minute read)
DuckDB is fast because it runs in-process, avoids server/client data movement, and combines columnar storage, query optimization, predicate pushdown, vectorized execution, and row-group pruning to scan only the data it needs. This post explains how DuckDB turns SQL into an executable plan and why its storage and Parquet-reading model make analytics feel unusually fast on a single machine.
Building Self-Healing Data Pipelines at Halodoc (9 minute read)
Build targeted self-healing layers for recurring pipeline failures: CDC auto-restarts with safe checkpoint rewind, source-vs-lake consistency checks, size-aware mini-batching, Spark retry memory scaling, warehouse lock cleanup using query watermarks, and dependency-aware backfills. The design pattern is: alert first, validate eligibility, recover safely, measure impact. Results included CDC recovery dropping from 45+ min to <5 min and backfill setup from 4-8 h to <15 min.
From SSH to REST: A Security-Driven Modernization of Slack's EMR Data Pipelines (15 minute read)
Slack modernized its data pipelines by migrating over 700 SSH-based operators on AWS EMR to a secure REST-based architecture with zero downtime across 8 regions. Its team replaced direct SSH access with Quarry, their internal REST job submission gateway, and used YARN's Distributed Shell to run arbitrary commands for proper resource management, reliable tracking, clean cancellation, and server-side lifecycle handling.
Can Agents Replace the Search Stack? (6 minute read)
A lightweight LLM agent, given basic retrieval tools (BM25 and/or embeddings), can outperform complex search backends and reranking pipelines, simplifying the search architecture. In experiments on Amazon ESCI data, agentic setups delivered big gains (NDCG from ~0.29 baseline to 0.41-0.45), with agents intelligently rewriting queries, exploring, and evaluating results.
Beyond the hype: The enterprise AI architecture we actually need (7 minute read)
Enterprise AI is moving toward a federated stack: native AI inside systems of record like SAP, Salesforce, Workday, and ServiceNow; sovereign private models hosted on internal infrastructure; curated data lakes; and AI analytics layers that can federate queries across domains. Agent orchestration sits on top, with full traceability, timestamps, and auditability to satisfy compliance demands such as the EU AI Act. Two missing capabilities: a trusted marketplace for external agents using verifiable identities, and an employee intelligence layer that embeds AI into workspaces so users can query operational data without switching tools.
We're Missing Data: The Other Half of AI Transformation (6 minute read)
AI in data and engineering orgs is overfocused on tools and underinvested in the operating model needed to absorb them. Technical gains from coding agents, eval infra, and internal assistants are real, but without redesigning management, career ladders, team composition, trust mechanics, and communication norms, productivity typically rises for about 6 months and then plateaus. AI transformation is multiplicative, not additive: fund both the technical stack and the operating stack, or the investment will underdeliver.
An open lakehouse — any engine, your cloud or ours (Sponsor)
Apache Iceberg + Parquet under the hood. Query it from Trino, Spark, or DuckDB over one governance layer. SaaS or self-install in your own Azure tenant. Databasin One writes schema-aware SQL with Claude + GPT — built in, not an add-on SKU. Per-minute billing, no commits. TLDR readers get $100 to start.
See how we compare to the warehouse guys
How We Accelerated Transpilation by Compiling SQLGlot with mypyc (8 minute read)
Fivetran dramatically accelerated SQLGlot (the popular pure-Python SQL parser, transpiler, and optimizer) by compiling it with mypyc, a tool that turns well-typed Python code into fast C extensions. They ship the compiled version as an optional package that delivers ~5x faster parsing, ~2.5x faster SQL generation, and 2-2.5x faster optimization, while keeping the original pure-Python version as the default for maximum compatibility.
Integrating AI Into Apache Kafka Architectures: Patterns and Best Practices (11 minute read)
When integrating LLMs with Apache Kafka, use Kafka strictly as a durable event backbone and keep all model inference outside the broker. Use one of three main inference patterns (external RPC, embedded models like ONNX/TFLite, or sidecar), and follow best practices for topic design (raw-events → enriched-context → model-outputs), replayability, dead-letter queues, idempotency, and cost/latency/governance considerations.
S3 is the perfect place to store data, until you try to search it (11 minute read)
Firn is an open-source API for fast vector and full-text search on S3-backed data, using Lance plus caching to make repeated queries extremely fast. It's useful for teams that want searchable object storage without the cost or complexity of running OpenSearch.
Redis Array Type: Short Story of a Long Development (3 minute read)
Redis Array is a proposed new data type, currently under review in a pull request, that natively supports numerical indexing as part of its semantics, combining efficient sparse and dense representations with automatic internal reshaping for optimal memory usage and performance, creating a powerful structure ideal for use cases like ring buffers, large indexed collections, and storing documents/files with fast access, scanning, and search capabilities.
Implementing Statistical Guardrails for Non-Deterministic Agents (5 minute read)
Statistical guardrails, like semantic drift detection using cosine-distance z-scores against a safe baseline embedding and confidence thresholding using Shannon entropy on token probabilities, add an automated safety layer for non-deterministic agents.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email