INDUSTRIES

Arbisoft is your one-stop shop when it comes to your eLearning needs. Our Ed-tech services are designed to improve the learning experience and simplify educational operations.
Discover More
- "Working with Arbisoft has felt less like hiring a vendor and more like gaining a team of trusted colleagues. Their developers don’t just build what we ask, they think alongside us, offer smart suggestions, and care deeply about getting it right."
  Sarah Johnson / SVP of Product, Summit K12
Get cutting-edge travel tech solutions that cater to your users’ every need. We have been employing the latest technology to build custom travel solutions for our clients since 2007.
Discover More
- “Arbisoft has been my most trusted technology partner for now over 15 years. Arbisoft has very unique methods of recruiting and training, and the results demonstrate that. They have great teams, great positive attitudes and great communication.”
  Paul English / Co-Founder, KAYAK
As a long-time contributor to the healthcare industry, we have been at the forefront of developing custom healthcare technology solutions that have benefitted millions.
Discover More
- "I wanted to tell you how much I appreciate the work you and your team have been doing of all the overseas teams I've worked with, yours is the most communicative, most responsive and most talented."
  Matt Hasel / Program Manager, eHuman
We take pride in meeting the most complex needs of our clients and developing stellar fintech solutions that deliver the greatest value in every aspect.
Discover More
- “Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”
  Jake Peters / CEO & Co-Founder, PayPerks
Unlock innovative solutions for your e-commerce business with Arbisoft’s seasoned workforce. Reach out to us with your needs and let’s get to work!
Discover More
- "The development team at Arbisoft is very skilled and proactive. They communicate well, raise concerns when they think a development approach wont work and go out of their way to ensure client needs are met."
  Veronika Sonsev / Co-Founder
Arbisoft is a holistic technology partner, adept at tailoring solutions that cater to business needs across industries. Partner with us to go from conception to completion!
Discover More
- “The app has generated significant revenue and received industry awards, which is attributed to Arbisoft’s work. Team members are proactive, collaborative, and responsive”.
  Silvan Rath / CEO, Predict.io

Optimizing AWS Glue ETL Performance: Achieving High-Efficiency Data Transformations Using Apache Iceberg

Muhammad UsamaPosted on June 12, 2026

4-5 Min Read Time

This technical deep dive explores the architectural evolution from a legacy direct-connect ETL model to a modern, scalable data lakehouse leveraging Amazon S3 and Apache Iceberg for optimized performance and reduced operational overhead.

Introduction

Many data architectures begin with a straightforward pattern: extracting data from production OLTP databases, applying transformations via AWS Glue, and loading results into a reporting downstream. While effective for small datasets, this approach inevitably faces scalability constraints as data velocity and volume increase.

In our experience, we reached a critical inflection point where ETL runtimes degraded significantly. The direct-query model introduced unsustainable read pressure on production systems, impacting both user experience and data freshness.

To address these challenges, we decoupled our extraction and transformation layers by implementing a robust data lake layer based on the Apache Iceberg table format, hosted on Amazon S3.

The Original Architecture

The legacy architecture relied on AWS Glue jobs establishing direct JDBC connections to production MySQL instances. While this simplified early development, it created tight coupling between analytical workloads and operational database performance.

The Problems We Faced

Performance Bottlenecks

High-frequency ETL queries on massive tables saturated database I/O, leading to several critical issues:

Extended ETL Runtimes: Jobs consistently exceeded 7–8 hours due to throttling and concurrency issues.
Redundant Extraction: Multiple concurrent workflows query identical datasets, multiplying resource consumption.
Systemic Latency: High I/O wait times on production instances threatened application stability.

Operational Overhead

Data pipeline fragility necessitated extensive manual intervention, typically consuming 3–5 hours of engineering time daily to resolve timeout failures and cascading workflow delays.

Frequent Glue job failures
Cascading workflow delays
3–5 hours of manual intervention daily

Scalability Challenges

The architecture lacked a dedicated storage layer between extraction and transformation workloads.

Phase 1 Solution: The "Load Once, Transform Many" Pattern

The core objective of remediation was to decouple the high-pressure analytical workloads from the production OLTP database. The legacy model suffered from "tight coupling," where every ETL job performed direct JDBC reads, causing severe resource contention and extended runtimes (7–8 hours).

We introduced a Data Lakehouse layer using Apache Iceberg on Amazon S3. This shifted the architecture from a direct-connect model to an ingestion-first model.

Key Technical Enhancements:

Decoupling via Centralized Ingestion: By introducing a dedicated Glue "Ingest" job, the production database is queried only once to extract the full dataset. This data is then persisted as Apache Iceberg tables on S3, which becomes the single source of truth for all downstream transformation jobs.
Performance Optimization: Subsequent ETL workflows no longer query the production database. They operate exclusively against the S3 Iceberg tables. This isolation eliminated the I/O saturation on the MySQL production instance, drastically improving application stability and data freshness.
Full-Refresh Strategy: Given the project time constraints, the initial implementation utilized a full table reload on each run. While not the most efficient in terms of storage bandwidth, it provided immediate stability by offloading read pressure from the production database.

Results and Impact

Operational database: Now protected from heavy analytical traffic. It serves only the initial ingestion job, significantly reducing I/O wait times.
Glue Ingest Job: Acts as the gateway. It executes the extraction process and manages the write-to-Iceberg operation.
Storage Layer (S3 + Apache Iceberg): This is the "Data Lake" tier. Apache Iceberg provides ACID compliance and schema evolution support, which ensures that even though tables are reloaded/overwritten during Phase 1, the downstream ETL jobs can read consistent snapshots of the data.
Transformation Layer (Glue): Decoupled from the source, these jobs now read from the Iceberg tables on S3. Since S3 I/O is typically faster and more scalable than direct MySQL JDBC reads, job runtimes were reduced from 8 hours to ~2.5–3 hours.

What’s Next: Phase 2

Phase 1 established the foundation for a scalable lakehouse architecture. Phase 2 focuses on incremental processing and advanced optimization.

Incremental loading using watermarking and CDC
Upsert and merge support inside Iceberg tables
Iceberg table partitioning and schema evolution
Incremental writing to target database

Just published

Why Odoo Implementations Fail and 6 Risks You Can Reduce blog image

Why Odoo Implementations Fail and 6 Risks You Can ReduceRead More

How to Choose an Odoo Implementation Partner in 2026 blog image

How to Choose an Odoo Implementation Partner in 2026Read More

Top Databricks Partners in the US by Region and Business Need (2026) blog image

Top Databricks Partners in the US by Region and Business Need (2026)Read More

Explore More

Have Questions? Let's Talk.

We have got the answers to your questions.

Trusted by Market Leaders in Education, Travel, Finance and E-commerce since 2007

We put excellence, value and quality above all - and it shows

NPS

INDUSTRIES

Real-time Maintenance Reporting

Workflow Automation Platform

Recruitment Automation Tool

Learner Engagement Platform

Customer Feedback Analytics

School Communication Suite

Digital Learning Suite

Software Development Outsourcing

Dedicated Teams

IT Staff Augmentation

New Venture Partnership

Optimizing AWS Glue ETL Performance: Achieving High-Efficiency Data Transformations Using Apache Iceberg

Introduction

The Original Architecture