We were working on a project that had a particular request. They wanted to server up data to ~12,000 users. When thinking about utilizing Snowflake as the backend for this, it would probably end up being pretty costly for just reading data that wasn't changing all that frequently. We were contemplating other ways to house the data without incurring these costs, but also without dramatically changing the codebase, which typically will connect directly to Snowflake.
The first thought was to export a series of CSVs and then load them as Pandas Data Frames. For some use cases, where Snowflake is directly loading a table to Pandas, this would work; however, for cases where we needed to process more complex queries, this wouldn't work.
Then we spotted a couple of intriguing posts from the data community. Let's dive in and take a look at why SQLite and DuckDB might be fantastic alternatives to direct connections or primitive CSV-based workflows.
Just about everyone has used SQLite in their data journey. It's often one of the first data sources people are introduced to in their learning journey, either in school or online courses. It has, of course, a great SQL interface as well as extensive support built into Python with the sqlite3 package. In the following post, X goes over some mind-bending facts about SQLite that might help you understand the power of this. This is not just a "dumb" database format.
A more recent database to gain popularity is DuckDB. DuckDB has a lot of similarities to SQLite and for the use case above, either one will work perfectly. Here are some benefits of DuckDB overall.
While either database format would work, DuckDB might be the optimal choice based on its modern take and its extensive support for a modern data ecosystem. This article gives a wonderful overview in much more detail on DuckDB.
Feature | SQLite | DuckDB |
DuckDB | Embedded SQL database for transactional workloads | Embedded SQL database optimized for analytical workloads |
Use Case | OLTP (Online Transaction Processing) | OLTP (Online Transaction Processing) |
Storage Model | Row-oriented | Column-oriented |
Performance Focus | Fast for small transactions and single-row operations | High performance for analytical queries and large datasets |
Parallel Processing | Single-threaded | Supports multi-threaded query execution |
SQL Features | Full SQL support for transactional operations | Full SQL support with advanced analytics functions |
Data Format Support | Primarily operates on its own database files | Directly reads modern data formats (e.g., Parquet, CSV, JSON) |
Integration with Tools | Limited integration with modern analytics tools | Seamlessly integrates with Python, R, Julia, Pandas, and Arrow |
ACID Compliance | Fully ACID compliant | Fully ACID compliant |
Resource Usage | Minimal memory and disk usage | Optimized for high performance with slightly higher memory use |
License | Open source (Public Domain) | Open source (MIT License) |
Target Audience | Developers building lightweight apps with transactional needs | Data scientists, analysts, and engineers working on data analytics |
Before we get started, this script is pretty long. If you want to check it out in its entirety, head over to GitHub for SQLite or DuckDB, and take a look.
The scripts start off with fetching the schemas from Snowflake and creating a corresponding table in the local database. Because neither SQLite nor DuckDB supports all of the same data types as Snowflake, the script will map each of the Schema values to the appropriate matching supported type. While they aren't perfect matches, in a majority of cases, this will be transparent to your use case.
Next, the script will fetch the data from Snowflake, convert any data retrieved to a matching data type if needed, and then insert those into the local database.
def main():
"""
Main function to orchestrate the data transfer from Snowflake to SQLite.
"""
snowflake_connection = snowflake_connection_helper()
print("Snowflake connection established.")
try:
for table_name in table_names:
print(f"Processing table: {table_name}")
# Fetch Snowflake schema
schema = fetch_table_schema(snowflake_connection, table_name)
# Create table in SQLite/DuckDB
create_table_in_sqlite(table_name, schema)
# Fetch data from Snowflake
data, _ = fetch_data_from_snowflake(snowflake_connection, table_name)
# Convert data types based on schema
converted_data = convert_data_for_sqlite(data, schema)
# Write to SQLite/DuckDB
write_data_to_sqlite(converted_data, table_name)
print(f"Data successfully written for table: {table_name}")
finally:
snowflake_connection.close()
print("Snowflake connection closed.")
Note: There is a really useful function called snowflake_connection_helper
available in this repository that will show you how to effectively use RSA key-pairs for Snowflake authentication in your Streamlit secrets.toml
file.
Since the script is simply performing an INSERT
query to add data from Snowflake to the local databases, it's simple to write a series of scripts that will append newly added data in Snowflake to the local databases through SELECT
queries that filter based on an incremental key such as a timestamp. Depending on your data, if you incrementally load your databases instead of re-running the above script, you can save time and resources keeping your local tables up to date after their initial creation.
In order to see this in action, we created a small test harness in Streamlit that will help us visualize each of the three data sources (Snowflake, SQLite, and DuckDB). It takes a single query and shows how that one query can be used across all three data sources with no changes to the query or the code to display the Data Frame or the chart.
In conclusion, leveraging local databases like SQLite and DuckDB offers a cost-effective and efficient solution for managing data without the heavy expenses associated with cloud-based systems like Snowflake. By utilizing these embedded databases, we can maintain the flexibility and power of SQL queries while significantly reducing operational costs. The ability to seamlessly integrate with modern data tools and support for complex analytical workloads makes DuckDB particularly appealing for data scientists and analysts. Meanwhile, SQLite's widespread use and simplicity make it a reliable choice for transactional needs. Ultimately, the choice between these two depends on the specific requirements of your project, but both provide robust alternatives to traditional data management approaches. As demonstrated, transitioning from Snowflake to a local database can be streamlined with the right scripts and tools, ensuring that your data remains accessible and manageable without breaking the bank.