TLDR Data 2026-06-04
dbt Core v2 Alpha π¦, Cart Prediction with LLMs π, Ray vs Daft π§ͺ
Your Cart Has a Story. Here's How We Learned to Read It (7 minute read)
Zepto built a Cart Contextual Model that treats shopping carts as βsentencesβ and uses a Transformer-based masked language model (MLM) to infer user intent in real time as items are added. By training on historical cart patterns with temporal, geographical, and product signals plus inverse-frequency masking to handle long-tail items, the model predicts what else the user will likely buy.
Vector Search in Manticore Search: A Deep Dive (28 minute read)
Manticore Search argues vector search should be tuned like a production retrieval system, not treated as a default embedding feature. It recommends aligning similarity metrics with models, tuning HNSW for recall, latency, and memory, and using batching, chunk optimization, and physical backups to keep indexes consistent.
A field journal on Ray Data and Daft for multimodal data lake (14 minute read)
After running 8 production-like use cases side-by-side, Ray Data was selected over Daft primarily for superior stability and resilience at scale (especially in complex async LLM inference) while acknowledging Daft's strengths in ergonomic native multimodal primitives and cleaner code for many operations.
Debunking 8 data layout myths: why Liquid Clustering outperforms partitioning (11 minute read)
Databricks debunks 8 common myths about data layout, arguing that Liquid Clustering is superior to traditional Hive-style partitioning for modern lakehouses. Unlike rigid partitioning, Liquid Clustering dynamically organizes data using clustering keys that evolve over time, supports row-level concurrency, metadata-only operations, and works seamlessly across open table formats.
The Rise of Multi-Query Engines (7 minute read)
AI agents are creating more small, bursty data queries, making single-warehouse costs harder to manage. Multi-engine routing cuts cost by sending each query to the best engine while keeping familiar workflows.
Diving deep into Redis's new array data type (25 minute read)
Redis Array is a brand-new native data type (introduced in Redis 8.8) designed for constant-time positional access by index, filling a long-standing gap in Redis where position/index itself carries semantic meaning. It efficiently supports both dense and extremely sparse arrays using a hierarchical group-based structure, allowing fast random access, range queries, ring-buffer semantics, pattern matching across sparse data, and fixed memory usage.
Routing Multiple Query Engines with Iceberg (18 minute read)
QueryFlux is an open-source Rust-based SQL routing proxy that intelligently directs queries across multiple query engines (Trino, Spark, DuckDB, Snowflake, Athena, Flink, etc.), sharing the same Iceberg tables. It handles protocol translation, dialect conversion via SQLGlot, cost-aware routing, concurrency control, and health-based failover.
dbt Core v2 is here: still open source, now rebuilt for what's next (9 minute read)
dbt Core v2.0 alpha makes the Fusion engine's Rust-based runtime open source under Apache 2.0, unifying Core and Fusion around a shared foundation with faster parsing, Parquet artifacts, better local docs, simpler installs, and a tighter language spec. Fusion remains the recommended free CLI for most users, while Core v2 serves teams that need fully open source code or custom OSS builds.
ingestr (GitHub Repo)
ingestr is a CLI ELT tool for moving data from many databases and SaaS apps into warehouses or storage with simple flags, no backend or custom code required. It supports incremental loads, easy install, and broad connector coverage.
OpenTelemetry Launches βBlueprintsβ Initiative to Simplify Enterprise Observability Adoption (3 minute read)
OpenTelemetry launched βBlueprintsβ to simplify observability with standard patterns and reference implementations for Kubernetes, infrastructure, apps, and centralized telemetry platforms.
MongoDB and Stored Procedures (10 minute read)
MongoDB can run low-latency transactional logic without stored procedures by combining ACID transactions, bulkWrite, validation, indexes, and pipeline updates. This is demonstrated through an example that processes payments with card checks, vendor checks, limits, duplicate prevention, and ledger writes.
Authorization for AI agents: What to build before the EU AI Act deadline (6 minute read)
Lays out identity, policy, and audit patterns teams need to externally enforce least-privilege agent calls under upcoming regulation.
Pluto 1.0 Release (12 minute read)
Pluto 1.0 marks the Julia notebook environment as stable, with major improvements to reproducibility, reactivity, sharing, accessibility, education, docs, and editor tools.
dltHub AI Workbench data quality toolkit: schema-aware checks that route their own fixes (4 minute read)
Preview adds persistent, metadata-driven data-quality decorators that fail fast and auto-route remediation in dlt pipelines.
Curated deep dives, tools and trends in big data, data science and data engineering π
Join 400,000 readers for
one daily email