MotherDuck

MotherDuck

Serverless analytics platform powered by DuckDB

Overview

MotherDuck is a serverless analytics platform that combines the speed of DuckDB with cloud convenience. This guide will walk you through connecting your Hypermode agent to MotherDuck, enabling powerful data analytics, complex queries, and insights generation using both your own data and MotherDuck’s rich sample datasets.

Prerequisites

Before connecting MotherDuck to Hypermode, you’ll need:
  1. A MotherDuck account
  2. MotherDuck API token for authentication
  3. A Hypermode workspace

Setting up MotherDuck

Step 1: Create your MotherDuck account

If you haven’t already, sign up for a MotherDuck account to access serverless analytics capabilities.

Step 2: Access your workspace

Once logged in, you’ll see your MotherDuck workspace with access to:
  • Your personal databases
  • Sample datasets (NYC Taxi, TPC-H, etc.)
  • Query editor and analytics tools
MotherDuck workspace

Step 3: Generate an API token

Create an API token for secure access:
  1. Navigate to SettingsAccess Tokens in your MotherDuck dashboard
  2. Click Create Token
  3. Give your token a descriptive name (for example, “Hypermode Integration”)
  4. Copy the generated token immediately
Create API token
Store your API token securely! You won’t be able to see it again after creation, and it provides full access to your MotherDuck workspace.
Create API token

Creating your MotherDuck agent

Step 1: Create a new agent

From the Hypermode interface, create a new agent manually:
  1. Click the agent dropdown menu
  2. Select “Create new Agent”
Navigate to create agent

Step 2: Configure agent settings

Use these recommended settings for your MotherDuck agent:
  • Agent Name: AnalyticsAgent
  • Agent Title: Connects to MotherDuck Analytics
  • Description: AnalyticsAgent performs data analysis and generates insights
  • Instructions: You have a connection to MotherDuck and various other developer tools to perform advanced analytics, generate insights, and answer data questions. You can query large datasets, perform aggregations, and create data visualizations. You have access to sample datasets including NYC Taxi data, TPC-H benchmark data, and more.
  • Model: GPT-4.1
Create agent modal

Step 3: View your agent profile

Once created, navigate to your agent’s settings page to see the profile: Agent profile

Connecting to MotherDuck

Step 1: Add the MotherDuck connection

Navigate to the Connections tab and add MotherDuck:
  1. Click “Add connection”
  2. Select “MotherDuck” from the dropdown
Add MotherDuck connection

Step 2: Configure credentials

Enter your MotherDuck credentials:
  • API Token: Your MotherDuck API token created in the previous step
MotherDuck connection modal

Verifying the MotherDuck connection

Step 1: Test basic connectivity

Start a new thread and test the connection:
What databases are available in my MotherDuck workspace?
Expected result Your agent will show the available databases:
  • md_information_schema - System metadata
  • my_db - Your personal database
  • sample_data - Rich sample datasets
Test connection

Step 2: Explore sample datasets

Test access to the sample datasets:
What sample datasets are available? Can you show me the schemas and table structures?
Expected result Your agent will discover these sample datasets:
  • NYC data (sample_data.nyc): taxi trips, service requests, rideshare data
  • Hacker News (sample_data.hn): 3.8M posts and comments
  • Movies (sample_data.kaggle): 41K movies with embeddings
  • WHO Air Quality (sample_data.who): Global ambient air quality data
  • Stack Overflow Survey (sample_data.stackoverflow_survey): Developer survey results

Step 3: Query sample data

Test with a simple query on the NYC taxi dataset:
Show me 3 sample taxi trips from the NYC dataset with their basic details
Expected result Your agent will return something like:
VendorID: 2, Pickup: 2022-11-04 23:13:01, Dropoff: 2022-11-04 23:25:38
Trip Distance: 1.78 miles, Total Amount: $16.56, Passengers: 4

Exploring built-in sample datasets

NYC taxi dataset (3.2M records)

The NYC taxi dataset contains detailed trip information with 19 fields including timestamps, locations, fares, and payment details:
Analyze NYC taxi usage patterns by hour of day - when are the busiest times?
Real results from the data:
  • Peak hours: 12-6 PM (157K-223K trips per hour)
  • Quietest time: 3-4 AM (23K trips)
  • Highest average fares: Early morning 4-6 AM ($27-30) due to airport trips
  • Lowest average fares: Mid-morning 9-11 AM (~$20-21)

Hacker News dataset (3.8M posts)

Complete Hacker News data including stories, comments, scores, and timestamps:
What are the highest-scoring Hacker News stories of all time?
Real top stories include:
  1. “Mechanical Watch” - 4,298 points by todsacerdoti
  2. “Google Search Is Dying” - 3,636 points by dbrereton
  3. “My First Impressions of Web3” - 3,393 points by natdempk
  4. Major news: “Queen Elizabeth II has died” - 2,827 points

Movies dataset (41K movies with embeddings)

Rich movie data with titles, overviews, and pre-computed embeddings for similarity search:
Find movies similar to "Toy Story" using semantic search
Dataset includes:
  • Movie titles and plot overviews
  • 512-dimensional embeddings for semantic similarity
  • Popular films from Toy Story to modern releases

WHO air quality dataset

Global ambient air quality measurements with PM2.5, PM10, and NO2 concentrations:
Which cities have the highest air pollution levels?
Features:
  • PM2.5 and PM10 particulate matter concentrations
  • NO2 (nitrogen dioxide) levels
  • Geographic coordinates for mapping
  • Multi-year temporal data

Setting up your data environment

Step 1: Understand the data structure

Get familiar with the complete sample dataset structure:
Give me a detailed overview of all sample datasets including row counts, key fields, and potential analysis opportunities
Your agent will discover:
  • NYC Taxi: 3.2M trips with fare, location, and time data
  • Hacker News: 3.8M posts with scores, authors, and content
  • Movies: 41K films with embeddings for ML applications
  • Air Quality: WHO data with pollution measurements globally

Step 2: Create your own database (Optional)

You can create custom databases alongside the sample data:
Create a new database called 'my_analytics' for custom analysis

Step 3: Update agent instructions

Enhance your agent’s capabilities with specific dataset knowledge:
You are connected to MotherDuck with access to these verified sample datasets:

**NYC Data (sample_data.nyc.taxi)**: 3,252,717 taxi trip records from 2022
- Fields: pickup/dropoff times, locations (PULocationID/DOLocationID), fares, distances
- Great for: Time series analysis, location-based insights, revenue analysis

**Hacker News (sample_data.hn.hacker_news)**: 3,866,740 posts and comments
- Fields: title, score, author, timestamp, type (story/comment)
- Great for: Content analysis, trending topics, user behavior

**Movies (sample_data.kaggle.movies)**: 41,371 movies with embeddings
- Fields: title, overview, 512-dimensional embeddings
- Great for: Semantic similarity, recommendation systems

**WHO Air Quality (sample_data.who.ambient_air_quality)**: Global pollution data
- Fields: PM2.5/PM10/NO2 concentrations, coordinates, population
- Great for: Environmental analysis, geographic patterns

Always use the full table names with schema prefixes (e.g., sample_data.nyc.taxi).

Testing analytical operations

Test 1: Time series analysis with real data

Analyze actual taxi usage patterns:
Show me NYC taxi demand by hour with average fares - what patterns do you see?
Real insights from the data:
  • Clear daily patterns with rush hour peaks
  • Early morning premium pricing (airport runs)
  • Lowest activity 2-4 AM, highest 5-7 PM
  • Weekend vs weekday patterns visible in the data

Test 2: Content analysis

Explore Hacker News trending topics:
What topics generate the highest engagement on Hacker News based on story titles and scores?
Real findings:
  • Tech criticism (“Google Search Is Dying”) scores highly
  • Major news events (Queen Elizabeth, Musk/Twitter) get massive engagement
  • Technical deep-dives (“Mechanical Watch”) resonate with the community
HN Analysis

Test 3: Geospatial analysis potential

While location IDs need lookup tables, the data structure supports location analysis:
Analyze pickup and dropoff location patterns in the taxi data

Test 4: Semantic similarity with embeddings

Leverage the pre-computed movie embeddings:
Using the movie embeddings, find films similar to popular titles
The movies dataset includes 512-dimensional embeddings perfect for similarity analysis and recommendation systems.

What you can do

With your MotherDuck connection established, your agent can:
  • Perform complex analytics on millions of records with DuckDB’s speed
  • Generate business insights from real-world datasets
  • Run time series analysis on temporal data (taxi trips, HN posts)
  • Execute semantic search using pre-computed embeddings
  • Analyze geographic patterns with coordinate data
  • Process text data from movie overviews and HN content
  • Handle big data efficiently with columnar processing
  • Create statistical summaries and trend analysis
  • Integrate with other tools for comprehensive workflows

Advanced analytical capabilities with real examples

Window functions and analytical SQL

Calculate running totals and moving averages for taxi revenue by day using the actual NYC data

Text analysis on real content

Analyze Hacker News story titles to identify trending topics and sentiment patterns
Use the movie embeddings to build a recommendation system - find movies similar to "Toy Story"
Movie Similarity Movie Similarity

Environmental data analysis

Identify the most polluted cities globally using WHO air quality measurements

Best practices for the sample datasets

  1. NYC Taxi Data: Always filter by date ranges for performance; use appropriate aggregations for time-based analysis
  2. Hacker News: Consider story vs comment types; use score thresholds for quality filtering
  3. Movies: Leverage embeddings for similarity; combine with text analysis of overviews
  4. Air Quality: Account for different measurement standards; filter by data quality indicators

Sample analytical workflows with real data

Peak demand prediction

Using historical taxi patterns, predict optimal driver deployment times
Real insight The data shows 4-6 AM has highest fares but lowest volume - perfect for premium positioning.

Community engagement analysis

Identify what content types perform best on Hacker News
Real finding Technical deep-dives and industry criticism generate highest engagement scores.

Environmental health correlation

Correlate air quality data with population density to identify at-risk areas

Content recommendation engine

Build a semantic movie recommendation system using the pre-computed embeddings

Advanced analytics examples

Seasonal trend analysis

Identify seasonal patterns in taxi usage and predict demand for different times of year

A/B testing analysis

Design and analyze A/B tests using statistical functions to determine significance of results

Predictive analytics preparation

Create features and prepare datasets for machine learning models to predict taxi demand or customer behavior

Troubleshooting

Common query issues with sample data

  1. Table not found errors: Always use full schema names like sample_data.nyc.taxi
  2. Performance with large datasets: Use LIMIT and appropriate WHERE clauses
  3. Memory constraints: The taxi dataset has 3.2M rows - be mindful of result sizes
  4. Embedding queries: Movie embeddings are 512-dimensional arrays - use appropriate similarity functions

Query performance optimization

  1. Filter early: NYC taxi data benefits from date/time filtering
  2. Use appropriate indexes: MotherDuck optimizes automatically but be mindful of query patterns
  3. Batch large operations: For analysis across millions of HN posts, consider sampling strategies
The sample datasets are production-ready for analysis and perfect for learning advanced SQL patterns. Start with the NYC taxi data for time-series practice, use Hacker News for text analysis, leverage movie embeddings for ML applications, and explore WHO data for geospatial analysis.
The combination of real data scale (millions of records) with advanced DuckDB features makes MotherDuck perfect for production-grade analytics learning and development.

Learn more

Combine MotherDuck with other Hypermode connections to create powerful data-driven workflows. For example, use Slack to share analytical insights, GitHub to version control your analytical queries, or Stripe to combine payment data with other business metrics for comprehensive reporting.