“Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”
“They delivered a high-quality product and their customer service was excellent. We’ve had other teams approach us, asking to use it for their own projects”.
“Arbisoft has been a valued partner to edX since 2013. We work with their engineers day in and day out to advance the Open edX platform and support our learners across the world.”
81.8% NPS78% of our clients believe that Arbisoft is better than most other providers they have worked with.
Arbisoft is your one-stop shop when it comes to your eLearning needs. Our Ed-tech services are designed to improve the learning experience and simplify educational operations.
“Arbisoft has been a valued partner to edX since 2013. We work with their engineers day in and day out to advance the Open edX platform and support our learners across the world.”
Get cutting-edge travel tech solutions that cater to your users’ every need. We have been employing the latest technology to build custom travel solutions for our clients since 2007.
“Arbisoft has been my most trusted technology partner for now over 15 years. Arbisoft has very unique methods of recruiting and training, and the results demonstrate that. They have great teams, great positive attitudes and great communication.”
As a long-time contributor to the healthcare industry, we have been at the forefront of developing custom healthcare technology solutions that have benefitted millions.
I wanted to tell you how much I appreciate the work you and your team have been doing of all the overseas teams I've worked with, yours is the most communicative, most responsive and most talented.
We take pride in meeting the most complex needs of our clients and developing stellar fintech solutions that deliver the greatest value in every aspect.
“Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”
Unlock innovative solutions for your e-commerce business with Arbisoft’s seasoned workforce. Reach out to us with your needs and let’s get to work!
The development team at Arbisoft is very skilled and proactive. They communicate well, raise concerns when they think a development approach wont work and go out of their way to ensure client needs are met.
Arbisoft is a holistic technology partner, adept at tailoring solutions that cater to business needs across industries. Partner with us to go from conception to completion!
“The app has generated significant revenue and received industry awards, which is attributed to Arbisoft’s work. Team members are proactive, collaborative, and responsive”.
“Arbisoft partnered with Travelliance (TVA) to develop Accounting, Reporting, & Operations solutions. We helped cut downtime to zero, providing 24/7 support, and making sure their database of 7 million users functions smoothly.”
“I couldn’t be more pleased with the Arbisoft team. Their engineering product is top-notch, as is their client relations and account management. From the beginning, they felt like members of our own team—true partners rather than vendors.”
Arbisoft was an invaluable partner in developing TripScanner, as they served as my outsourced website and software development team. Arbisoft did an incredible job, building TripScanner end-to-end, and completing the project on time and within budget at a fraction of the cost of a US-based developer.
Recent benchmarks show ETL workloads can cost up to 9x more on Snowflake than Databricks Lakehouse, with ETL typically accounting for 50% or more of total data platform costs.
If you're thinking about moving from Snowflake to Databricks, you're not alone. Many data teams are making the shift to take advantage of Databricks' unified + data intelligence platform for data engineering, analytics, and machine learning.
When we compare the two platforms, Snowflake is a powerful data warehouse. But Databricks gives you more flexibility if you're working with big data, running real-time analytics, or building AI models. It's especially attractive if you want to unify your data, analytics, and machine learning workflows in one place.
That said, migrating isn't just about moving data. It's about making sure your systems work just as well, if not better, after the move. In this post, I'll walk you through everything about making this migration smooth and stress-free, with particular attention to maintaining operations throughout the transition.
Preparation
Before you touch any data, take a step back and look at your current setup in Snowflake.
Here's what I recommend:
Audit your Snowflake environment
Make a comprehensive inventory of all your assets. Run SHOW TABLES, SHOW VIEWS, SHOW PROCEDURES, and SHOW ROLES commands to document your entire environment. Use the Snowflake account_usage views to identify query patterns, usage metrics, and performance bottlenecks that might need special attention during migration.
SELECT query_text, database_name, schema_name,
execution_status, execution_time/1000 as execution_time_sec
FROM snowflake.account_usage.query_history
WHERE execution_time > 60000 -- queries that took more than 60 seconds
ORDER BY execution_time DESC
LIMIT 100;
Identify key workloads
Not everything needs to move on Day 1. Start with high-priority tables or jobs that benefit most from Databricks' processing power. Analyze your warehouse usage patterns to understand peak load times and critical ETL windows:
SELECT warehouse_name,
SUM(credits_used) as total_credits,
COUNT(DISTINCT query_id) as query_count
FROM snowflake.account_usage.warehouse_metering_history
GROUP BY 1
ORDER BY 2 DESC;
Evaluate dependencies
Look at what tools are connected to Snowflake, like BI dashboards, data pipelines, etc. Use the ACCESS_HISTORY view to see who and what is connecting to your data:
SELECT user_name, direct_object_name, query_text
FROM snowflake.account_usage.access_history
WHERE event_timestamp > DATEADD(month, -1, CURRENT_TIMESTAMP())
ORDER BY event_timestamp DESC;
Choose your migration approach
You can go for a phased migration (move in chunks) or a full cutover. If you ask me, phased is safer for maintaining zero downtime. It lets you test along the way and gradually shift workloads while maintaining operational continuity.
Migration Process
Now that you're prepared, let's dive into the actual steps.
1. Extract Data from Snowflake
The first step is to get your data out of Snowflake.
Here's a more precise approach:
Use Snowflake's COPY INTO <location> command to export data to cloud storage like S3 or Azure Blob. For optimal performance with Parquet files, specify the format and compression:
COPY INTO 's3://your-bucket/path/to/data/'
FROM my_database.my_schema.my_table
FILE_FORMAT = (TYPE = PARQUET COMPRESSION = SNAPPY)
HEADER = TRUE
MAX_FILE_SIZE = 128000000; -- 128MB files for optimal performance
While Parquet is generally preferred for performance, recent benchmarks show that Snowflake's new vectorized Parquet scanner has made significant improvements, achieving 80% faster performance compared to previous Parquet testing. With this enhancement, Parquet has become the most efficient format when using the vectorized scanner option.
For large tables, consider partitioning your data exports for more efficient parallel processing:
COPY INTO 's3://your-bucket/path/to/data/'
FROM my_database.my_schema.my_table
PARTITION BY (date_column)
FILE_FORMAT = (TYPE = PARQUET COMPRESSION = SNAPPY);
For incremental processing, capture the last modified timestamp to enable incremental migration:
COPY INTO 's3://your-bucket/path/to/incremental/'
FROM (SELECT * FROM my_table WHERE last_modified_ts >
(SELECT MAX(last_processed_ts) FROM migration_control_table))
FILE_FORMAT = (TYPE = PARQUET);
Remember that when unloading data into multiple files (the default with SINGLE=FALSE), each file will have a unique name pattern like data_00_1_0.parquet. You can use a prefix in your specification to better organize these files.
2. Transform the Data
Before loading the data into Databricks, check if you need to make any changes to ensure compatibility and optimization.
Data type mappings: Snowflake and Databricks don't use identical data types. Here's a quick reference for common mappings:
Snowflake
Databricks/Spark
NUMBER
DECIMAL
FLOAT
FLOAT
VARCHAR
STRING
DATE
DATE
TIMESTAMP
TIMESTAMP
VARIANT
STRING (or MAP/STRUCT in specific cases)
Optimization for Delta Lake: Consider partitioning strategies appropriate for Delta Lake, which might differ from your Snowflake partitioning.
# PySpark example for partitioning in Delta Lake
df.write.format("delta") \
.partitionBy("year", "month") \
.option("overwriteSchema", "true") \
.mode("overwrite") \
.save("/path/to/delta-table")
Schema evolution handling: If you need to modify schemas during migration, Delta Lake offers more flexible schema evolution than Parquet alone:
Now comes the fun part - loading the data into your new system. Based on the latest documentation and real-world practices, here are the most effective approaches:
Set up a Databricks workspace and cluster configuration optimized for data loading:
For direct migration from Parquet, use CONVERT TO DELTA which is significantly more efficient than rewriting the data:
# This is more efficient than reading and rewriting
spark.sql("CONVERT TO DELTA parquet.`s3://my-bucket/parquet-data`")
For larger datasets, consider using the CLONE feature which offers both shallow and deep options:
# Shallow clone (creates pointers without copying data)
spark.sql("CREATE TABLE target_table SHALLOW CLONE parquet.`/path/to/data`")
# Deep clone (copies all data to new location)
spark.sql("CREATE TABLE target_table CLONE parquet.`/path/to/data`")
For ongoing, incremental data loads, set up Auto Loader for change data capture:
from pyspark.sql.functions import current_timestamp
# Configure Auto Loader for continuous ingestion
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
.option("cloudFiles.schemaLocation", "/tmp/schema") \
.load("s3://my-bucket/new-data/")
# Write to Delta table with tracking
df.withColumn("ingestion_time", current_timestamp()) \
.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/tmp/checkpoint") \
.start("/path/to/delta-table")
Delta Lake gives you versioning, ACID transactions, and better performance - features essential for enterprise-grade data systems. You will find these features to be game-changers for teams previously struggling with data consistency issues in their pipelines.
Tools and Resources
You don't have to do this all by hand. Here are tools and real-world benchmarks that can accelerate your migration:
Fivetran or Matillion: Great for automated ETL from Snowflake to Databricks. With Fivetran, you can simply change the destination from Snowflake to Databricks and run a new sync, which is particularly useful for maintaining parallel systems during migration.
Databricks Partner Connect: Helps you discover connectors and integrations with third-party tools already in use with your Snowflake instance.
Databricks CLI: Lets you automate things like cluster creation and job scheduling. Here's an example for automating Delta table optimizations:
dbt: If you're using dbt with Snowflake, adapting to Databricks is straightforward. Update your profiles.yml:
# Before (Snowflake)
snowflake:
type: snowflake
account: your_account
# After (Databricks)
databricks:
type: databricks
host: your-workspace.cloud.databricks.com
token: dapi123456789
catalog: main
schema: default
Performance benchmarks: In recent testing at enterprise scale, the vectorized Parquet scanner in Databricks showed a 3-4x performance improvement over standard Parquet processing. This becomes particularly relevant for large analytical queries spanning terabytes of data.
Challenges and Solutions
Let's be honest, no migration is completely smooth. Here are the most common issues people face with zero-downtime migrations and how you can solve them:
Data type mismatches
Snowflake and Databricks don't always use the same types. Beyond simple casting, consider using a schema registry or validation framework to ensure consistency:
# Schema validation pattern
expected_schema = StructType([
StructField("id", LongType(), False),
StructField("name", StringType(), True)
])
# Validate against expected schema
if df.schema != expected_schema:
# Log the difference and apply transformation
diff = set(df.schema) - set(expected_schema)
print(f"Schema mismatch detected: {diff}")
# Apply transformations to align schemas
Large volumes of data
For petabyte-scale migrations, I recommend a combination of parallel extraction from Snowflake with multiple warehouses, partitioned unloading, and then parallel ingestion into Databricks:
-- In Snowflake: Create multiple warehouses for parallel extraction
CREATE WAREHOUSE extract_wh_1 WITH WAREHOUSE_SIZE = 'LARGE';
CREATE WAREHOUSE extract_wh_2 WITH WAREHOUSE_SIZE = 'LARGE';
-- Then run multiple COPY operations in parallel sessions
-- Session 1
USE WAREHOUSE extract_wh_1;
COPY INTO 's3://bucket/part1/' FROM
(SELECT * FROM huge_table WHERE date_partition BETWEEN '2020-01-01' AND '2020-06-30')
FILE_FORMAT = (TYPE = PARQUET);
-- Session 2
USE WAREHOUSE extract_wh_2;
COPY INTO 's3://bucket/part2/' FROM
(SELECT * FROM huge_table WHERE date_partition BETWEEN '2020-07-01' AND '2020-12-31')
FILE_FORMAT = (TYPE = PARQUET);
Security and permissions
You'll need to recreate your access control rules in Databricks. Use Unity Catalog to manage data governance:
-- In Databricks SQL
GRANT SELECT ON TABLE gold.customers TO role_analysts;
GRANT MODIFY ON TABLE silver.raw_events TO role_eng;
-- For column-level security
GRANT SELECT ON TABLE gold.customers(id, name, city) TO role_marketing;
Zero-downtime migration strategy
Maintain dual-write pipelines during migration to ensure both systems have current data:
# Example dual-write pattern in Python
def save_to_both_platforms(data):
# Write to Snowflake
snowflake_conn.write(data)
# Write to Databricks
spark.createDataFrame(data).write.format("delta").mode("append").save("/path/to/delta")
# Verify consistency between platforms
snowflake_count = snowflake_conn.query("SELECT COUNT(*) FROM table")
databricks_count = spark.sql("SELECT COUNT(*) FROM delta.`/path/to/delta`").collect()[^0][^0]
assert abs(snowflake_count - databricks_count) < 10, "Data drift detected between platforms"
Test everything in a staging environment before going live. Set up automated reconciliation checks between Snowflake and Databricks during the transition to catch any inconsistencies early.
Cost Comparison and Final Thoughts
Switching from Snowflake to Databricks isn't just a technical migration. It's a shift in how your team handles data, and often, a significant cost-saving opportunity.
In terms of pricing models, Databricks offers a pay-as-you-go approach with no up-front costs, while Snowflake operates on a credit-based system with different rates for Standard ($2.00/credit), Enterprise ($3.00/credit), and Business Critical ($4.00/credit) editions.
For a real-world comparison, a client with the following profile:
5TB of data across 200+ tables
10-15 concurrent ETL jobs
30+ BI dashboard users
Their monthly costs dropped from approximately $25,000 on Snowflake to $15,000 on Databricks after optimization, representing a 40% reduction. The savings came primarily from:
More efficient compute utilization (Databricks' cluster management)
Reduced storage costs with Delta Lake's compression capabilities
Elimination of separate ML tooling costs by utilizing Databricks' unified platform
With the right prep and a careful rollout, you'll end up with a system that's faster, smarter, better suited for modern data workflows, and often significantly less expensive. Start small, test often, document everything, and implement reconciliation checks during the transition period.
Whether you're moving for cost savings, performance, or better ML support, Databricks can be a game-changer when implemented with careful planning and technical precision.
Since Databricks is a single platform, you need limited development team with Databricks expertise only, and all data ETL, ML, AI, Gen AI, and BI needs can be fulfilled by a small team. No need to hire multiple teams to manage these use cases on multiple platforms.
Databricks helps eliminating data silos with governance, rather than storing data and solving different use cases on different platforms.
Looking for expert guidance or a turnkey migration? Explore our Databricks services to see how we can accelerate your journey.