TLDR Data 2026-06-08
Anthropic’s Automated Analytics 🔍, PostgreSQL 19 Beta 🐘, McKinney on Agentic Engineering 🛠️
How Anthropic enables self-service data analytics with Claude (5 minute read)
Anthropic argues that accurate self-service analytics with LLMs is mostly a context, governance, and verification problem, not a SQL generation problem: teams need canonical datasets, strong metadata, semantic-layer-first workflows, maintained skills, and curated sources of truth. Their biggest gains came from reducing ambiguity, preventing staleness, improving retrieval, and validating continuously through offline evals, ablations, provenance, and correction loops.
Dynamic Repartitioning for Time Series Workloads (11 minute read)
Netflix built dynamic partition splitting in Cassandra to handle wide partitions in high-volume time-series workloads like viewing history, metrics, and events. Rather than relying on static buckets or manual fixes, the system detects hot or oversized partitions at runtime and automatically splits them into smaller pieces while preserving query compatibility and data consistency.
The Join-Aware Materialized View Query Rewrite Gap (4 minute read)
Join-aware materialized views make star-schema BI faster by keeping fact-to-dimension joins available for rewrite. Single-table MVs miss the dashboard grouping attributes. StarRocks, BigQuery, Redshift, and Oracle support this directly. Databricks has experimental Metric Views, while Snowflake leaves the capability split across MVs and Dynamic Tables.
Ground truth is a process, not a dataset (4 minute read)
Ground truth is a process, not a static dataset. For complex AI report fact-checking, Amazon's audit-then-score protocol lets AI challenge benchmark labels with evidence. A human auditor reviews disputes and updates the ground truth when warranted, lifting expert accuracy to 90.9%.
Vibe Coding Is Dangerous, Agentic Engineering Isn't (15 minute read)
Wes McKinney argues that “vibe coding” is dangerous when people one-shot prompts, skip review, and ship blindly, but “agentic engineering” can work when humans stay deeply involved in specs, architecture, testing, review, and deciding what not to build. His workflow treats AI as an accelerator, not a replacement for engineering judgment, using tools like Superpowers, Roborev, tests, token tracking, and strict maintenance habits to keep agents accountable and useful over time.
Structure vs. Concept (9 minute read)
Taxonomies organize business concepts for humans, while ontologies define classes, properties, constraints, and rules. Vector retrieval works best with rich taxonomy text; reasoning needs ontology axioms. Keep them linked but separate, so business users can curate concepts while data models stay logically precise.
Mozilla Data Collective - Your Models Are Only as Good as the Datasets You Train On (Sponsor)
Build for global growth with language datasets that help you go to new markets faster.
Mozilla Data Collective offers 600+ documented datasets across 300+ languages, helping companies reach new customers and strengthen multilingual AI capabilities with consented, traceable datasets.
Browse and Download Free Datasets
What is Apache Arrow Flight? (8 minute read)
Apache Arrow Flight uses Arrow and gRPC to move large columnar datasets quickly with zero-copy transfer. Servers stream Arrow RecordBatches directly, can parallelize reads across endpoints, and mostly serve as infrastructure for custom high-performance data services.
PostgreSQL 19 Beta 1 Released! (5 minute read)
PostgreSQL 19 beta 1 is available for real-world testing before GA. Major updates include autoscaling async I/O, parallel autovacuum, faster foreign-key inserts, SQL/PGQ graph queries, better observability, restart-free logical replication, SNI-based TLS certificates, online checksum toggling, LZ4 default TOAST compression, and removal of RADIUS auth.
A Deep Dive into Calibration of Language Models: Platt Scaling, Isotonic Regression, Temperature Scaling (7 minute read)
Temperature Scaling is the simplest for LLM calibration, Platt Scaling is data-efficient and fast but often too coarse, and Isotonic Regression is the most flexible and accurate when you have plenty of calibration data, though it risks overfitting on small sets. For best results with LLMs, evaluate using Expected Calibration Error (ECE), reliability diagrams, and Brier score.
Your Obsidian Vault Can Now Run SQL (and Your Agent Can Read It) (5 minute read)
A new DuckDB + MotherDuck plugin lets users run SQL blocks inside notes, query local files or cloud tables, and freeze results back into markdown tables with daily/weekly refresh scheduling.
The Tableau Exodus Has Begun (4 minute read)
Executives are cutting Tableau because BI feels too expensive and undervalued, not necessarily because another tool is better. The smart response is to preserve critical BI-only metrics, consider cheaper or consolidated platforms, and use the migration to rethink BI's value in an AI-first world.
Curated deep dives, tools and trends in big data, data science and data engineering 📊
Join 400,000 readers for
one daily email