arbisoft brand logo
arbisoft brand logo
Contact Us

How to Migrate from Snowflake to Databricks: A Zero-Downtime Guide for Data Engineers

Ahmad's profile picture
Ahmad AliPosted on
18-19 Min Read Time

Recent benchmarks show ETL workloads can cost up to 9x more on Snowflake than Databricks Lakehouse, with ETL typically accounting for 50% or more of total data platform costs.

 

If you're thinking about moving from Snowflake to Databricks, you're not alone. Many data teams are making the shift to take advantage of Databricks' unified + data intelligence platform for data engineering, analytics, and machine learning.

 

When we compare the two platforms, Snowflake is a powerful data warehouse. But Databricks gives you more flexibility if you're working with big data, running real-time analytics, or building AI models. It's especially attractive if you want to unify your data, analytics, and machine learning workflows in one place.

 

That said, migrating isn't just about moving data. It's about making sure your systems work just as well, if not better, after the move. In this post, I'll walk you through everything about making this migration smooth and stress-free, with particular attention to maintaining operations throughout the transition.

 

Preparation

Before you touch any data, take a step back and look at your current setup in Snowflake.

 

Here's what I recommend:

Audit your Snowflake environment

Make a comprehensive inventory of all your assets. Run SHOW TABLESSHOW VIEWSSHOW PROCEDURES, and SHOW ROLES commands to document your entire environment. Use the Snowflake account_usage views to identify query patterns, usage metrics, and performance bottlenecks that might need special attention during migration.

SELECT query_text, database_name, schema_name, 
      execution_status, execution_time/1000 as execution_time_sec
FROM snowflake.account_usage.query_history
WHERE execution_time > 60000 -- queries that took more than 60 seconds
ORDER BY execution_time DESC
LIMIT 100;

Identify key workloads

Not everything needs to move on Day 1. Start with high-priority tables or jobs that benefit most from Databricks' processing power. Analyze your warehouse usage patterns to understand peak load times and critical ETL windows:

SELECT warehouse_name, 
      SUM(credits_used) as total_credits,
      COUNT(DISTINCT query_id) as query_count
FROM snowflake.account_usage.warehouse_metering_history
GROUP BY 1
ORDER BY 2 DESC;

Evaluate dependencies

Look at what tools are connected to Snowflake, like BI dashboards, data pipelines, etc. Use the ACCESS_HISTORY view to see who and what is connecting to your data:

SELECT user_name, direct_object_name, query_text
FROM snowflake.account_usage.access_history
WHERE event_timestamp > DATEADD(month, -1, CURRENT_TIMESTAMP())
ORDER BY event_timestamp DESC;

Choose your migration approach

You can go for a phased migration (move in chunks) or a full cutover. If you ask me, phased is safer for maintaining zero downtime. It lets you test along the way and gradually shift workloads while maintaining operational continuity.

 

Migration Process

Now that you're prepared, let's dive into the actual steps.

1. Extract Data from Snowflake

The first step is to get your data out of Snowflake. 

 

Here's a more precise approach:

 

Use Snowflake's COPY INTO <location> command to export data to cloud storage like S3 or Azure Blob. For optimal performance with Parquet files, specify the format and compression:

COPY INTO 's3://your-bucket/path/to/data/'
FROM my_database.my_schema.my_table
FILE_FORMAT = (TYPE = PARQUET COMPRESSION = SNAPPY)
HEADER = TRUE
MAX_FILE_SIZE = 128000000; -- 128MB files for optimal performance

While Parquet is generally preferred for performance, recent benchmarks show that Snowflake's new vectorized Parquet scanner has made significant improvements, achieving 80% faster performance compared to previous Parquet testing. With this enhancement, Parquet has become the most efficient format when using the vectorized scanner option.

 

For large tables, consider partitioning your data exports for more efficient parallel processing:

COPY INTO 's3://your-bucket/path/to/data/'
FROM my_database.my_schema.my_table
PARTITION BY (date_column)
FILE_FORMAT = (TYPE = PARQUET COMPRESSION = SNAPPY);

For incremental processing, capture the last modified timestamp to enable incremental migration:

COPY INTO 's3://your-bucket/path/to/incremental/'
FROM (SELECT * FROM my_table WHERE last_modified_ts > 
    (SELECT MAX(last_processed_ts) FROM migration_control_table))
FILE_FORMAT = (TYPE = PARQUET);

Remember that when unloading data into multiple files (the default with SINGLE=FALSE), each file will have a unique name pattern like data_00_1_0.parquet. You can use a prefix in your specification to better organize these files.

2. Transform the Data

Before loading the data into Databricks, check if you need to make any changes to ensure compatibility and optimization.

Data type mappings: Snowflake and Databricks don't use identical data types. Here's a quick reference for common mappings:

SnowflakeDatabricks/Spark
NUMBERDECIMAL
FLOATFLOAT
VARCHARSTRING
DATEDATE
TIMESTAMPTIMESTAMP
VARIANTSTRING (or MAP/STRUCT in specific cases)

 

Optimization for Delta Lake: Consider partitioning strategies appropriate for Delta Lake, which might differ from your Snowflake partitioning.

# PySpark example for partitioning in Delta Lake
df.write.format("delta") \
  .partitionBy("year", "month") \
  .option("overwriteSchema", "true") \
  .mode("overwrite") \
  .save("/path/to/delta-table")

Schema evolution handling: If you need to modify schemas during migration, Delta Lake offers more flexible schema evolution than Parquet alone:

# Enable schema evolution when writing
df.write.format("delta") \
  .option("mergeSchema", "true") \
  .mode("append") \
  .save("/path/to/delta-table")

3. Load into Databricks

Now comes the fun part - loading the data into your new system. Based on the latest documentation and real-world practices, here are the most effective approaches:

 

Set up a Databricks workspace and cluster configuration optimized for data loading:

{
  "num_workers": 8,
  "spark_conf": {
    "spark.databricks.delta.optimizeWrite.enabled": "true",
    "spark.databricks.delta.autoCompact.enabled": "true",
    "spark.sql.files.maxPartitionBytes": "134217728",
    "spark.sql.shuffle.partitions": "400"
  },
  "node_type_id": "i3.4xlarge"
}

For direct migration from Parquet, use CONVERT TO DELTA which is significantly more efficient than rewriting the data:

# This is more efficient than reading and rewriting
spark.sql("CONVERT TO DELTA parquet.`s3://my-bucket/parquet-data`")

For larger datasets, consider using the CLONE feature which offers both shallow and deep options:

# Shallow clone (creates pointers without copying data)
spark.sql("CREATE TABLE target_table SHALLOW CLONE parquet.`/path/to/data`")

# Deep clone (copies all data to new location)
spark.sql("CREATE TABLE target_table CLONE parquet.`/path/to/data`")

For ongoing, incremental data loads, set up Auto Loader for change data capture:

from pyspark.sql.functions import current_timestamp

# Configure Auto Loader for continuous ingestion
df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "parquet") \
  .option("cloudFiles.schemaLocation", "/tmp/schema") \
  .load("s3://my-bucket/new-data/")
  
# Write to Delta table with tracking
df.withColumn("ingestion_time", current_timestamp()) \
  .writeStream \
  .format("delta") \
  .outputMode("append") \
  .option("checkpointLocation", "/tmp/checkpoint") \
  .start("/path/to/delta-table")

Delta Lake gives you versioning, ACID transactions, and better performance - features essential for enterprise-grade data systems. You will find these features to be game-changers for teams previously struggling with data consistency issues in their pipelines.

 

Tools and Resources

You don't have to do this all by hand. Here are tools and real-world benchmarks that can accelerate your migration:

 

Fivetran or Matillion: Great for automated ETL from Snowflake to Databricks. With Fivetran, you can simply change the destination from Snowflake to Databricks and run a new sync, which is particularly useful for maintaining parallel systems during migration.

 

Databricks Partner Connect: Helps you discover connectors and integrations with third-party tools already in use with your Snowflake instance.

 

Databricks CLI: Lets you automate things like cluster creation and job scheduling. Here's an example for automating Delta table optimizations:

databricks jobs create --json '{
  "name": "Optimize Delta Tables", 
  "schedule": {"quartz_cron_expression": "0 0 * * * ?"},
  "tasks": [{
    "task_key": "optimize",
    "notebook_task": {
      "notebook_path": "/path/to/optimize_script"
    }
  }]
}'

dbt: If you're using dbt with Snowflake, adapting to Databricks is straightforward. Update your profiles.yml:

# Before (Snowflake)
snowflake:
  type: snowflake
  account: your_account
  
# After (Databricks)
databricks:
  type: databricks
  host: your-workspace.cloud.databricks.com
  token: dapi123456789
  catalog: main
  schema: default

Performance benchmarks: In recent testing at enterprise scale, the vectorized Parquet scanner in Databricks showed a 3-4x performance improvement over standard Parquet processing. This becomes particularly relevant for large analytical queries spanning terabytes of data.

 

Challenges and Solutions

Let's be honest, no migration is completely smooth. Here are the most common issues people face with zero-downtime migrations and how you can solve them:

Data type mismatches

Snowflake and Databricks don't always use the same types. Beyond simple casting, consider using a schema registry or validation framework to ensure consistency:

# Schema validation pattern
expected_schema = StructType([
    StructField("id", LongType(), False),
    StructField("name", StringType(), True)
])

# Validate against expected schema
if df.schema != expected_schema:
    # Log the difference and apply transformation
    diff = set(df.schema) - set(expected_schema)
    print(f"Schema mismatch detected: {diff}")
    # Apply transformations to align schemas

Large volumes of data

For petabyte-scale migrations, I recommend a combination of parallel extraction from Snowflake with multiple warehouses, partitioned unloading, and then parallel ingestion into Databricks:

-- In Snowflake: Create multiple warehouses for parallel extraction
CREATE WAREHOUSE extract_wh_1 WITH WAREHOUSE_SIZE = 'LARGE';
CREATE WAREHOUSE extract_wh_2 WITH WAREHOUSE_SIZE = 'LARGE';

-- Then run multiple COPY operations in parallel sessions
-- Session 1
USE WAREHOUSE extract_wh_1;
COPY INTO 's3://bucket/part1/' FROM
    (SELECT * FROM huge_table WHERE date_partition BETWEEN '2020-01-01' AND '2020-06-30')
    FILE_FORMAT = (TYPE = PARQUET);

-- Session 2
USE WAREHOUSE extract_wh_2;
COPY INTO 's3://bucket/part2/' FROM
    (SELECT * FROM huge_table WHERE date_partition BETWEEN '2020-07-01' AND '2020-12-31')
    FILE_FORMAT = (TYPE = PARQUET);

Security and permissions

You'll need to recreate your access control rules in Databricks. Use Unity Catalog to manage data governance:

-- In Databricks SQL
GRANT SELECT ON TABLE gold.customers TO role_analysts;
GRANT MODIFY ON TABLE silver.raw_events TO role_eng;

-- For column-level security
GRANT SELECT ON TABLE gold.customers(id, name, city) TO role_marketing;

Zero-downtime migration strategy

Maintain dual-write pipelines during migration to ensure both systems have current data:

# Example dual-write pattern in Python
def save_to_both_platforms(data):
    # Write to Snowflake
    snowflake_conn.write(data)
    
    # Write to Databricks
    spark.createDataFrame(data).write.format("delta").mode("append").save("/path/to/delta")
    
    # Verify consistency between platforms
    snowflake_count = snowflake_conn.query("SELECT COUNT(*) FROM table")
    databricks_count = spark.sql("SELECT COUNT(*) FROM delta.`/path/to/delta`").collect()[^0][^0]
    
    assert abs(snowflake_count - databricks_count) < 10, "Data drift detected between platforms"

Test everything in a staging environment before going live. Set up automated reconciliation checks between Snowflake and Databricks during the transition to catch any inconsistencies early.

 

Cost Comparison and Final Thoughts

Switching from Snowflake to Databricks isn't just a technical migration. It's a shift in how your team handles data, and often, a significant cost-saving opportunity.

 

In terms of pricing models, Databricks offers a pay-as-you-go approach with no up-front costs, while Snowflake operates on a credit-based system with different rates for Standard ($2.00/credit), Enterprise ($3.00/credit), and Business Critical ($4.00/credit) editions.

 

For a real-world comparison, a client with the following profile:

 

  • 5TB of data across 200+ tables
  • 10-15 concurrent ETL jobs
  • 30+ BI dashboard users

 

Their monthly costs dropped from approximately $25,000 on Snowflake to $15,000 on Databricks after optimization, representing a 40% reduction. The savings came primarily from:

 

  1. More efficient compute utilization (Databricks' cluster management)
  2. Reduced storage costs with Delta Lake's compression capabilities
  3. Elimination of separate ML tooling costs by utilizing Databricks' unified platform

 

With the right prep and a careful rollout, you'll end up with a system that's faster, smarter, better suited for modern data workflows, and often significantly less expensive. Start small, test often, document everything, and implement reconciliation checks during the transition period.

 

Whether you're moving for cost savings, performance, or better ML support, Databricks can be a game-changer when implemented with careful planning and technical precision.

 

Since Databricks is a single platform, you need limited development team with Databricks expertise only, and all data ETL, ML, AI, Gen AI, and BI needs can be fulfilled by a small team. No need to hire multiple teams to manage these use cases on multiple platforms.

 

Databricks helps eliminating data silos with governance, rather than storing data and solving different use cases on different platforms.

 

Looking for expert guidance or a turnkey migration? Explore our Databricks services to see how we can accelerate your journey.

...Loading

Explore More

Have Questions? Let's Talk.

We have got the answers to your questions.

Newsletter

Join us to stay connected with the global trends and technologies