TLDR Data 2026-06-15
Databricks’ Agent Orchestrator 🕹️, Ecosystems Beat Models 🔁, LinkedIn’s Search Brain 🔍
Encoding Your Domain Expert: The Context Layer Behind Spotify's Data Assistant (6 minute read)
Spotify built Vedder, an AI data assistant for 2,100+ users across 177 clusters, to move beyond schema-only RAG across 70,000 datasets. Domain experts curate each cluster with datasets, vetted question-SQL pairs, and business docs. Only 12.5% of mined query pairs were accepted, so health scoring tracks drift, validity, coverage, and reproducibility to keep context reliable.
How Feldera Works: A True Incremental View Maintenance Engine (3 minute read)
Feldera treats streams as incremental SQL views, using DBSP to propagate deltas instead of recomputing joins and aggregations. Inserts, deletes, and updates become Z-set changes, so only affected rows are updated. The result is batch-SQL-like semantics for continuous pipelines with lower CPU, less memory pressure, and predictable latency.
Semantic Search for AI Agents at Scale: Retrieval and Ranking for LinkedIn's Hiring Assistant (15 minute read)
LinkedIn built MUSE (Member Understanding Semantic Embeddings) to power semantic search inside Hiring Assistant. It uses a dual-tower Matryoshka embedding model trained on millions of high-quality labels from an LLM Teacher grounded in product policy, combining embedding-based retrieval with a downstream engagement-optimized ranker.
The Mythical Agent-Month (10 minute read)
AI coding agents reduce coding labor, but not the hardest parts of software: design judgment, scope control, testing, and maintainability. They reduce accidental complexity, but can create technical debt, architectural drift, and bloated codebases at machine speed. The edge shifts to experts who can steer the model, say no, and keep systems production-ready.
The Bill Arrives: How to Manage Agentic AI Costs at Scale (17 minute read)
Uber's AI budget blowout shows agentic AI is a task-economics problem, not a token-pricing problem. Claude Code adoption hit 84% across 5,000 engineers, exhausting the annual AI budget by mid-April. With spend hidden in re-sent context, retrieval, orchestration, governance, and retries, teams need to measure value per task, control context, and build stateful agent infrastructure.
A frontier without an ecosystem is not stable (4 minute read)
Companies need to compound human expertise and AI capability, not just rely on the best model. By owning their workflows, evals, and institutional knowledge, firms can keep improving while avoiding a future where all value flows to a few frontier models.
Join renowned data strategist Doug Laney and Matia CEO Benjamin Segal for a discussion on the future of the data stack. (Sponsor)
Your data stack is held together with duct tape. You know it. Your team knows it.
On June 24th, Matia CEO Benjamin Segal and Doug Laney, author of Infonomics and Data Juice, are doing a live fireside on what comes next.
Register now→Introducing Flights: Agent-Native Ingest in MotherDuck (4 minute read)
Flights is MotherDuck's new agent-native data pipeline feature that lets AI agents easily build, run, and schedule ingestion and transformation workloads using a secure, general-purpose Python runtime. It has native support for dlt pipelines, direct DuckDB execution, logging, scheduling, and versioning. Agents can create Flights via MCP server, SQL table functions, or the UI.
Introducing Omnigent: A Meta-Harness to Combine, Control, and Share Your Agents (7 minute read)
Omnigent is an open-source Databricks meta-harness that makes agents like Claude Code, Codex, Pi, and custom agents work together through one shared layer. It helps teams compose agents, add security and cost controls, share live sessions, and keep workflows portable as tools change.
Apache DataFusion 54.0.0 Released (7 minute read)
Apache DataFusion 54.0.0 adds major SQL upgrades, including LATERAL joins, SQL lambda functions for arrays, a new Arrow-based Avro reader, and spill-to-disk for memory-heavy nested loop joins. Performance also jumps, with near-unique LEFT/FULL sort-merge joins up to 20–50× faster and repartition-heavy operations improving by up to 50%.
Linux Foundation Announces OpenSharing Project to Standardize AI Asset and Data Exchange (4 minute read)
Databricks has handed the Delta Sharing protocol over to the Linux Foundation. OpenSharing extends Delta Sharing to AI models, agent skills, and unstructured data across clouds and platforms. It adds standard APIs for discovery, authorization, and access, with support for existing Delta Sharing recipients plus Apache Iceberg/REST Catalog clients. The project aims to replace proprietary marketplaces with a single standard for enterprise AI asset distribution.
The Hidden Cost of ai_parse_document in Production (10 minute read)
Databricks' ai_parse_document + ai_query can turn messy PDFs into structured JSON in a few SQL lines, but the challenge is reliability at scale. Every rerun reopens parsing and LLM costs, corrected documents can create duplicates, and even temperature 0 still produces non-deterministic outputs that undermine auditability. A pipeline design with checkpoints, versioned prompts, and deduplication reduces reprocessing cost and improves reproducibility. Deterministic parsers like OpenDataLoader PDF are more appropriate when document templates are consistent.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email