DeepSeek is pushing DuckDB beyond its single-node roots with smallpond, a new, simple approach to distributed compute. But does it solve the scalability challenge—or introduce new trade-offs?
i dont fully follow when you say- Unlike systems like Spark or Daft that can distribute work at the query execution level (breaking down individual operations like joins or aggregations), smallpond operates at a higher level. It distributes entire partitions to workers, and each worker processes its entire partition using DuckDB.
In spark too, a partition is processed by a core, right?
Yes, Spark also processes partitions with individual cores, but the key difference is how work is distributed WITHIN a partition.
- Spark breaks queries into stages and tasks, distributing operations like joins and aggregations across multiple workers—even within a partition.
- smallpond, on the other hand, assigns entire partitions to separate DuckDB instances running in Ray tasks, where each instance processes its partition independently without breaking queries into smaller tasks.
So while both use partitioning, Spark distributes work at a finer level, while smallpond keeps execution isolated per partition.
You can try a local chat app with DeepSeek as the LLM and a MCP server such as https://github.com/motherduckdb/mcp-server-motherduck. Then when you ask 'hey, what's my top SKU for past 2 months", the chat app will list the "tools" available locally, such as 'run_sql' or 'list tables', and LLM guess which tool to call (say list table first), then call a set of tools to get schema, generate SQL , get result, then present the result to you.
A bit self-prompting, I am building a MCP server for Kafka and our product. You may check the demo at https://www.linkedin.com/feed/update/urn:li:activity:7298966083804282880/ the repo at https://github.com/jovezhong/mcp-timeplus You can follow the README to understand how to configure a chat app(such as Claude or 5ire) to start MCP servers(which are usually built in nodeJS or python, so you need tools like npx and uvx locally). Just search duckdb and mcp as the keywords, there are a few sample tools available, nothing official and you cannot expect those tools to give perfect answers. In the end, the best way to work with your data is SQL, period. Having a text2sql is a fancy idea, but in the end, I'd rather write SQL directly
I added experimental S3 support here: https://github.com/definite-app/smallpond
Awesome Mike! Will def check this out
i dont fully follow when you say- Unlike systems like Spark or Daft that can distribute work at the query execution level (breaking down individual operations like joins or aggregations), smallpond operates at a higher level. It distributes entire partitions to workers, and each worker processes its entire partition using DuckDB.
In spark too, a partition is processed by a core, right?
Yes, Spark also processes partitions with individual cores, but the key difference is how work is distributed WITHIN a partition.
- Spark breaks queries into stages and tasks, distributing operations like joins and aggregations across multiple workers—even within a partition.
- smallpond, on the other hand, assigns entire partitions to separate DuckDB instances running in Ray tasks, where each instance processes its partition independently without breaking queries into smaller tasks.
So while both use partitioning, Spark distributes work at a finer level, while smallpond keeps execution isolated per partition.
Hope that clarifies!
wouldn't this mean that smallpond isn't meant for complex and dynamic operations (e.g. which involve shuffle) ?
Thats true. If you have analytical workload that requires typically joins accross partitions, that would be really slow in that case
Is there a way to get to visualize the DAG, at the moment?
Is there a way for DeepSeek to Connect and interrogate a DuckDB database?
You can try a local chat app with DeepSeek as the LLM and a MCP server such as https://github.com/motherduckdb/mcp-server-motherduck. Then when you ask 'hey, what's my top SKU for past 2 months", the chat app will list the "tools" available locally, such as 'run_sql' or 'list tables', and LLM guess which tool to call (say list table first), then call a set of tools to get schema, generate SQL , get result, then present the result to you.
Would you be able to help me set it up in more detail? I am not so proficient in this matter.
A bit self-prompting, I am building a MCP server for Kafka and our product. You may check the demo at https://www.linkedin.com/feed/update/urn:li:activity:7298966083804282880/ the repo at https://github.com/jovezhong/mcp-timeplus You can follow the README to understand how to configure a chat app(such as Claude or 5ire) to start MCP servers(which are usually built in nodeJS or python, so you need tools like npx and uvx locally). Just search duckdb and mcp as the keywords, there are a few sample tools available, nothing official and you cannot expect those tools to give perfect answers. In the end, the best way to work with your data is SQL, period. Having a text2sql is a fancy idea, but in the end, I'd rather write SQL directly