Snowflake to SQLite or DuckDB: Data-Driven Apps Using Local Data

Written by Brian Roepke | Feb 1, 2025 6:00:00 PM

Why a Local DB?

We were working on a project that had a particular request. They wanted to server up data to ~12,000 users. When thinking about utilizing Snowflake as the backend for this, it would probably end up being pretty costly for just reading data that wasn't changing all that frequently. We were contemplating other ways to house the data without incurring these costs, but also without dramatically changing the codebase, which typically will connect directly to Snowflake.

The first thought was to export a series of CSVs and then load them as Pandas Data Frames. For some use cases, where Snowflake is directly loading a table to Pandas, this would work; however, for cases where we needed to process more complex queries, this wouldn't work.

Then we spotted a couple of intriguing posts from the data community. Let's dive in and take a look at why SQLite and DuckDB might be fantastic alternatives to direct connections or primitive CSV-based workflows.

SQLite and Its Insane Popularity

Just about everyone has used SQLite in their data journey. It's often one of the first data sources people are introduced to in their learning journey, either in school or online courses. It has, of course, a great SQL interface as well as extensive support built into Python with the sqlite3 package. In the following post, X goes over some mind-bending facts about SQLite that might help you understand the power of this. This is not just a "dumb" database format.

It's the most widely used database engine available.
Blazingly fast, even faster than in-memory cache systems for some use cases.
Insanely scalable, with the ability to handle databases up to 281 terabytes (bigger than you could deal with as a file-based DB!).
It has support for an unlimited amount of readers as long as your infrastructure can handle it.
It supports ACID (Atomic, Consistent, Isolated, and Durable) transactions that completely roll back if the entire transaction doesn't complete.
Free to use, actively developed, and available everywhere
The main downside - It's not suitable for a high number of writers to the database. Check out their page on when to use and when not to use for more information.

Collection of insane and fun facts about SQLite. Let's go!

SQLite is the most deployed and most used database. There are over one trillion (1000000000000 or a million million) SQLite databases in active use.

It is maintained by three people. They don't allow outside…
— v (@iavins) December 29, 2024

DuckDB: The SQLite for Data People?

A more recent database to gain popularity is DuckDB. DuckDB has a lot of similarities to SQLite and for the use case above, either one will work perfectly. Here are some benefits of DuckDB overall.

DuckDB operates within your application without requiring a separate server, making it easy to integrate and configure.
Optimized for analytical workloads with features like columnar storage, vectorized query execution, and parallel processing.
Works with popular programming languages (Python, R, Julia) and data tools like Pandas and Apache Arrow for streamlined workflows.
Supports querying large datasets directly from modern formats like Parquet and CSV without loading them entirely into memory.
Supports ACID transactions.
Delivers high performance with minimal resource usage, ideal for constrained environments.
Free to use, actively developed, and available on Windows, macOS, and Linux.

While either database format would work, DuckDB might be the optimal choice based on its modern take and its extensive support for a modern data ecosystem. This article gives a wonderful overview in much more detail on DuckDB.

DuckDB Beyond the Hype by Alireza Sadeghi

A Powerful Addition to the Data Scientist's and Data Engineer's Toolbox

Read on Substack

A Quick Overview & Comparison

Feature	SQLite	DuckDB
DuckDB	Embedded SQL database for transactional workloads	Embedded SQL database optimized for analytical workloads
Use Case	OLTP (Online Transaction Processing)	OLTP (Online Transaction Processing)
Storage Model	Row-oriented	Column-oriented
Performance Focus	Fast for small transactions and single-row operations	High performance for analytical queries and large datasets
Parallel Processing	Single-threaded	Supports multi-threaded query execution
SQL Features	Full SQL support for transactional operations	Full SQL support with advanced analytics functions
Data Format Support	Primarily operates on its own database files	Directly reads modern data formats (e.g., Parquet, CSV, JSON)
Integration with Tools	Limited integration with modern analytics tools	Seamlessly integrates with Python, R, Julia, Pandas, and Arrow
ACID Compliance	Fully ACID compliant	Fully ACID compliant
Resource Usage	Minimal memory and disk usage	Optimized for high performance with slightly higher memory use
License	Open source (Public Domain)	Open source (MIT License)
Target Audience	Developers building lightweight apps with transactional needs	Data scientists, analysts, and engineers working on data analytics

Transforming Snowflake to SQLite/DuckDB

Before we get started, this script is pretty long. If you want to check it out in its entirety, head over to GitHub for SQLite or DuckDB, and take a look.

The scripts start off with fetching the schemas from Snowflake and creating a corresponding table in the local database. Because neither SQLite nor DuckDB supports all of the same data types as Snowflake, the script will map each of the Schema values to the appropriate matching supported type. While they aren't perfect matches, in a majority of cases, this will be transparent to your use case.

Next, the script will fetch the data from Snowflake, convert any data retrieved to a matching data type if needed, and then insert those into the local database.

def main():
    """
    Main function to orchestrate the data transfer from Snowflake to SQLite.
    """
    snowflake_connection = snowflake_connection_helper()
    print("Snowflake connection established.")

    try:
        for table_name in table_names:
            print(f"Processing table: {table_name}")

            # Fetch Snowflake schema
            schema = fetch_table_schema(snowflake_connection, table_name)

            # Create table in SQLite/DuckDB
            create_table_in_sqlite(table_name, schema)

            # Fetch data from Snowflake
            data, _ = fetch_data_from_snowflake(snowflake_connection, table_name)

            # Convert data types based on schema
            converted_data = convert_data_for_sqlite(data, schema)

            # Write to SQLite/DuckDB
            write_data_to_sqlite(converted_data, table_name)

            print(f"Data successfully written for table: {table_name}")
    finally:
        snowflake_connection.close()
        print("Snowflake connection closed.")

Note: There is a really useful function called snowflake_connection_helper available in this repository that will show you how to effectively use RSA key-pairs for Snowflake authentication in your Streamlit secrets.toml file.

Appending Data

Since the script is simply performing an INSERT query to add data from Snowflake to the local databases, it's simple to write a series of scripts that will append newly added data in Snowflake to the local databases through SELECT queries that filter based on an incremental key such as a timestamp. Depending on your data, if you incrementally load your databases instead of re-running the above script, you can save time and resources keeping your local tables up to date after their initial creation.

Testing in Streamlit

In order to see this in action, we created a small test harness in Streamlit that will help us visualize each of the three data sources (Snowflake, SQLite, and DuckDB). It takes a single query and shows how that one query can be used across all three data sources with no changes to the query or the code to display the Data Frame or the chart.

Conclusion

In conclusion, leveraging local databases like SQLite and DuckDB offers a cost-effective and efficient solution for managing data without the heavy expenses associated with cloud-based systems like Snowflake. By utilizing these embedded databases, we can maintain the flexibility and power of SQL queries while significantly reducing operational costs. The ability to seamlessly integrate with modern data tools and support for complex analytical workloads makes DuckDB particularly appealing for data scientists and analysts. Meanwhile, SQLite's widespread use and simplicity make it a reliable choice for transactional needs. Ultimately, the choice between these two depends on the specific requirements of your project, but both provide robust alternatives to traditional data management approaches. As demonstrated, transitioning from Snowflake to a local database can be streamlined with the right scripts and tools, ensuring that your data remains accessible and manageable without breaking the bank.

View full post