TLDR Data 2026-06-01
Meta’s Index-as-Model 🔦, Zero Dollar Analytics 💸, Context for Data Agents 🧠
A live data app for $0: DuckDB, Astro, and no BI tool (8 minute read)
A $0 data app can work well when the stack is simple: open data, DuckDB transforms, Astro/Leaflet/SVG for the interface, GitHub Actions for refreshes, and existing static hosting. AI-assisted coding makes bespoke, on-brand data products cheaper and more flexible than BI tools when you do not need governance, shared metrics, or heavy analytics workflows.
SilverTorch: Index as Model — A New Retrieval Paradigm for Recommendation Systems (7 minute read)
SilverTorch is Meta's new retrieval system for recommendation engines like feeds and Reels. It introduces the “Index as Model” paradigm, turning the retrieval pipeline, including user embedding, ANN search, eligibility filtering, neural reranking, and multi-task scoring, into a unified PyTorch model. The system runs end-to-end on GPUs using Bloom filters and fused Int8 ANN kernels.
The Postgres Developer's Guide to Vector Index Tradeoffs (11 minute read)
Vector search in Postgres becomes an index-design problem once tables reach millions of vectors, filters enter the query path, and recall/latency tradeoffs start affecting product quality. Exact search is best for small datasets and recall baselines. HNSW is the default read-heavy ANN choice when data fits in memory, IVFFlat reduces memory and maintenance costs at the expense of more tuning, and StreamingDiskANN via pgvectorscale targets large indexes that outgrow RAM. Hybrid search with BM25 plus vectors in Postgres improves recall by combining semantic matching with keyword relevance.
Enabling Data Intelligence: Data Profiling Framework at Halodoc (10 minute read)
Halodoc built an Airflow-native data profiling framework to replace repeated ad hoc SQL profiling across hundreds of tables and multiple systems. It combines column-level profiling, join intelligence, and source-table analysis, running compute in Redshift or Athena and isolating each table in Kubernetes pods with run_id-based, idempotent staging writes. The result is a self-serve, searchable view of data quality and table relationships.
SQLite is All You Need for Durable Workflows (4 minute read)
Durable AI workflows can use local SQLite plus Litestream backups instead of heavier orchestration or database infrastructure. The tradeoff is simple, cheap, inspectable state for agents, unless you need high availability or shared scalability, where Postgres still fits better.
Event-Driven vs. Polling Architectures for Agent Triggers (11 minute read)
Agent trigger architecture should be designed around delivery contracts, not a simplistic webhook-vs-polling choice. Webhooks are usually at-least-once, unordered, and best-effort. Polling can blow through rate limits. CDC and message buses offer stronger replay and durability, but still require idempotent handling. Mature agent systems typically combine fast-path events, reconciliation polling or replay, structural idempotency keys, and durable runtimes so long-running agents can survive duplicates, missed events, retries, and external waits.
MOR Isn't a Storage Optimization. It's an Architectural Shift (11 minute read)
Instead of synchronously rewriting entire files on every mutation (Copy-On-Write), MOR (Merge-On-Read) appends changes to log files and defers the expensive merge/compaction work to a background process, effectively time-shifting optimization from write time to a separate, controllable schedule. This design better supports high-frequency streaming updates and CDC workloads, though it introduces tradeoffs in read amplification and compaction management.
ktx (GitHub Repo)
ktx is a local context layer that helps data agents query warehouses more accurately by combining approved metrics, join logic, warehouse metadata, and company knowledge into one searchable surface. It is aimed at teams that want Claude, Codex, Cursor, or other agents to reuse trusted definitions instead of inventing SQL from scratch.
Apache Iceberg 1.11.0 Adds registerView: Closing a Catalog Migration Gap (4 minute read)
Apache Iceberg 1.11.0 adds ‘registerView', a metadata-preserving migration primitive that lets catalogs register existing Iceberg views from metadata files instead of recreating them from SQL. The release also adds a dedicated REST Catalog endpoint enabling cleaner authorization, capability signaling, and backward compatibility. This closes a migration gap for catalog-to-catalog moves, DR workflows, blue-green catalog upgrades, and tools like the Apache Polaris Iceberg Catalog Migrator.
Introducing CostBench: an Open Benchmark for Data Warehouse Cost-performance (5 minute read)
CostBench is ClickHouse's new open-source benchmark designed to evaluate cloud data warehouses based on price-performance, or how much performance you get per dollar, rather than just raw speed alone. It tests both query performance and data ingestion across realistic analytical workloads on ClickHouse Cloud, Snowflake, Databricks, BigQuery, and Redshift.
How we built a lab to evaluate data agents (22 minute read)
Hex built Shoebox, an internal eval “lab bench” for data agents, so teams can compare candidate runs against stable production baselines and judge improvements across prompts, models, memory, search, and workspace context. They also created Shorelane Commerce, a realistic fake business with messy warehouse data, because simple text-to-SQL benchmarks do not reflect the ambiguity, context, and data debt real analytics agents must handle.
The best of CPDP 2026 (14 minute read)
Computers, Privacy, and Data Protection 2026 highlighted the regulatory pressure points shaping data governance and AI: age-gating, biometric age verification, health data, children's digital rights, AI chatbot privacy, and the widening gap between formal compliance and real-world enforcement. Panels emphasized concrete risks like biometric processing limits, 230 million weekly health-related ChatGPT queries, and the need for PETs, transparency, and stronger controls over platform work, content moderation, and generative AI use.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email