DuckDB goes distributed? DeepSeek’s smallpond…

Feb 28

DeepSeek is pushing DuckDB beyond its single-node roots with smallpond, a new, simple approach to distributed compute. But does it solve the scalability challenge—or introduce new trade-offs?

Read →

12 Comments

Mike Ritchie

Mar 2

I added experimental S3 support here: https://github.com/definite-app/smallpond

Expand full comment

Reply (1)

mehdio

Mar 2

Awesome Mike! Will def check this out

Expand full comment

Akshay Baura

Feb 28

i dont fully follow when you say- Unlike systems like Spark or Daft that can distribute work at the query execution level (breaking down individual operations like joins or aggregations), smallpond operates at a higher level. It distributes entire partitions to workers, and each worker processes its entire partition using DuckDB.

In spark too, a partition is processed by a core, right?

Expand full comment

Reply (1)

mehdio

Feb 28

Yes, Spark also processes partitions with individual cores, but the key difference is how work is distributed WITHIN a partition.

- Spark breaks queries into stages and tasks, distributing operations like joins and aggregations across multiple workers—even within a partition.

- smallpond, on the other hand, assigns entire partitions to separate DuckDB instances running in Ray tasks, where each instance processes its partition independently without breaking queries into smaller tasks.

So while both use partitioning, Spark distributes work at a finer level, while smallpond keeps execution isolated per partition.

Hope that clarifies!

Expand full comment

Reply (1)

Akshay Baura

Feb 28

wouldn't this mean that smallpond isn't meant for complex and dynamic operations (e.g. which involve shuffle) ?

Expand full comment

Reply (1)

mehdio

Feb 28

Thats true. If you have analytical workload that requires typically joins accross partitions, that would be really slow in that case

Expand full comment

Chanukya

Mar 5

Is there a way to get to visualize the DAG, at the moment?

Expand full comment

Marc Vanderstraeten

Mar 1

Is there a way for DeepSeek to Connect and interrogate a DuckDB database?

Expand full comment

Reply (1)

Jove Zhong

Mar 1

You can try a local chat app with DeepSeek as the LLM and a MCP server such as https://github.com/motherduckdb/mcp-server-motherduck. Then when you ask 'hey, what's my top SKU for past 2 months", the chat app will list the "tools" available locally, such as 'run_sql' or 'list tables', and LLM guess which tool to call (say list table first), then call a set of tools to get schema, generate SQL , get result, then present the result to you.

Expand full comment

Reply (1)

Marc Vanderstraeten

Mar 1

Would you be able to help me set it up in more detail? I am not so proficient in this matter.

Expand full comment

Reply (1)

Jove Zhong

Mar 3

A bit self-prompting, I am building a MCP server for Kafka and our product. You may check the demo at https://www.linkedin.com/feed/update/urn:li:activity:7298966083804282880/ the repo at https://github.com/jovezhong/mcp-timeplus You can follow the README to understand how to configure a chat app(such as Claude or 5ire) to start MCP servers(which are usually built in nodeJS or python, so you need tools like npx and uvx locally). Just search duckdb and mcp as the keywords, there are a few sample tools available, nothing official and you cannot expect those tools to give perfect answers. In the end, the best way to work with your data is SQL, period. Having a text2sql is a fancy idea, but in the end, I'd rather write SQL directly

Expand full comment

Mehdio's Tech (Data) Corner

DuckDB goes distributed? DeepSeek’s smallpond…