Scaling ETL Optimization with AI: A Graph Theory Approach to Petabyte-Scale Data Processing

By James Conner

When Scale Meets Complexity: The 125PB Challenge

                            Scale Overview:
                            Data lake size: 125 petabytes
Ingestion rate: 6 million events per second (average)
Largest single table: 5.6 petabytes
Processing model: Daily batch ingests via Apache Airflow

                        

Managing an enterprise-scale ETL pipeline presents unique challenges that go beyond traditional data engineering problems. Our Apache Airflow-orchestrated system was responsible for creating thousands of datasets with complex lineage relationships, but we faced a critical issue: the daily SLO targets for ETL process completion were consistently missed.

The root cause of our performance problems wasn't technical infrastructure limitations. It was a human factor.

The code and lineage for each dataset were produced by users of varying data engineering skill levels across the organization. This democratized approach to data transformation led to several systemic problems:

Dataset duplication across teams without awareness of existing solutions
Incorrect dependency relationships causing unnecessary bottlenecks
Suboptimal SQL code consuming excessive compute resources and time
Lack of comprehensive testing meant frequent job failures
Poor data architecture creating serial processing chains where parallel execution was possible

We needed to create a tool that would analyze the dynamically generated daily processing DAG to resolve the SLO misses and identify these problems as users submitted their SQL transform code to the ETL system. The solution would need to be automated, repeatable, and intelligent enough to handle optimization decisions at scale.

Graph Theory Meets Data Engineering

The first step in solving this optimization challenge was understanding the existing data architecture, and for that, we needed data about the data. The ETL system's PostgreSQL database contained comprehensive metadata about each dataset, including schema definitions, parent-child lineage relationships, average runtime statistics, ownership information, and the actual SQL transformation code. This metadata was accessible through the ETL system's Django APIs, providing a perfect foundation for MLOps integration.

Since a Directed Acyclic Graph (DAG) is fundamentally a graph theory concept, I leveraged Python's networkx library to create a graph representation of our ETL pipeline. This approach allowed for detailed analysis of nodes (individual datasets) and edges (dependency relationships with attributes including runtime, failure rates, domain classification, ownership, and calculated weights).

                            Graph Components:
                            Nodes: Individual datasets
Edges: Dependencies with attributes including runtime, failure rates, domain classification, ownership, and calculated weights
Edge case handling: Self-joins managed by creating duplicate nodes with single edges back to originals

                        

Interestingly, the ETL DAG wasn't quite an actual DAG because some datasets had self-joins. This was easily handled by creating duplicate nodes with single edges back to the originals, maintaining the graph structure while preserving the logical relationships.

Micro vs. Macro Optimization Strategies

Once the graph was created, I developed two complementary optimization approaches to address different types of performance bottlenecks.

Micro-optimization focused on identifying individual datasets requiring optimization. To identify potential targets, I employed a variety of graph centrality algorithms in networkx, including closeness centrality (datasets with short paths to all other nodes), betweenness centrality (datasets frequently appearing in shortest paths between other nodes), and Katz centrality (measuring influence based on walks of all lengths).

These metrics revealed critical datasets that, due to their high number of dependent child datasets or cross-organizational importance (such as datasets shared between advertising and finance teams), became prime candidates for individual optimization efforts.

Macro-optimization addressed system-wide performance improvements through a methodology to identify the most problematic execution paths. I mapped complete dependency chains from all final/terminal datasets (those with no children) back to all of their source datasets, then created a synthetic weight metric combining execution time, failure rates, node centrality, and degree measurements.

By analyzing the "heaviest" dataset paths using cumulative calculated weights, I could cross-reference paths to find common offending nodes that contributed substantially to the overall ETL DAG execution time. The most effective approach was looking for nodes in the pool of heaviest dataset paths with limited degrees, which identified several large serially executing chains of datasets.

AI-Powered Automation at Scale

Given the scale of the optimization challenge (thousands of datasets requiring deep analysis and potential modification), manual intervention was impractical. The solution needed to be automated, repeatable, and intelligent enough to handle the nuanced decisions required for both micro and macro optimizations.

I implemented a multi-agent system using AWS Bedrock with Claude 4.5 as the underlying LLM, designed with specialized agents for different optimization types. The micro-optimization agent focused on SQL query optimization and individual dataset performance improvements, while the macro-optimization agent concentrated on data architecture decisions and could delegate SQL optimization tasks to the micro agent as needed.

Initially, I tested using a single model with a more elaborate prompt that attempted to handle both optimization types. However, I quickly discovered that it was more effective to utilize a dedicated model for each optimization type with tighter, more focused prompts for each specific domain. This approach provided better results and more consistent optimization quality.

                            Agent Specialization:
                            Micro-optimization agent: Focused on SQL query optimization and individual dataset performance improvements
Macro-optimization agent: Concentrated on data architecture decisions and could delegate SQL optimization tasks to the micro agent as needed
Agentic routing: Ensured tasks reached the appropriate specialist agent

                        

Agentic routing ensured tasks reached the appropriate specialist agent, creating a single entry point that would allow additional functionality to be added later without requiring new endpoints. For example, I could easily tie in the newly created Data Quality GenAI product to auto-create quality expectations for the dataset by adding it to the agentic routing.

Several critical guardrails were implemented to ensure safe and compliant operations. SOX-compliant datasets were marked as immutable. Table schemas were treated as inviolable contracts that could not be modified. The system had full access to dataset metadata and system documentation through RAG integration (AWS Kendra), MCP, and API calls.

Most importantly, all changes were implemented as QA versions requiring manual approval before production deployment. While thorough testing was conducted, the potential for automated SQL changes to trigger petabyte-scale data backfills made manual approval a necessary safety measure.

Production Results: 77% Performance Improvement

First Run Production Results:

Identified a 14-step serial execution chain causing major delays
Achieved 77% performance improvement on the critical bottleneck through combined SQL optimization and dataset restructuring
Reduced daily ETL processing window by 4 hours
Enabled parallel processing where previously serial execution was required

The macro-optimization implementation was remarkably successful on its first production pass. The system identified a 14-step serial execution chain that was a major source of time delays in our daily ETL processing window.

Fixing this chain required a combination of optimizing individual SQL transformation logic and strategically breaking up datasets to enable parallelization and recombination. The result was a 77% performance improvement that reduced the daily ETL processing window by 4 hours.

The micro-optimization implementation was also incredibly successful, though it highlighted the need to have shared memory between the two services. Some of the macro targets were also micro targets, so having an air traffic control system is going to be needed to ensure overlap between the two methods doesn't occur.

This success demonstrated the effectiveness of combining graph theory analysis with AI-powered optimization. The system could identify bottlenecks that would have been extremely difficult to spot manually across thousands of interconnected datasets, then generate targeted solutions that addressed both the immediate performance issues and the underlying architectural problems.

Infrastructure and LLMOps Architecture

Building a production-ready AI optimization system at enterprise scale required careful consideration of infrastructure, observability, and operational practices. The original POC and development was done locally using LangChain/Graph, but the decision was made to go with a Bedrock native solution for productionization due to the simplicity and low-code nature of the offering.

The architecture needed to handle real-time event processing, model orchestration, and seamless integration with existing ETL systems while maintaining security and compliance standards.

                            Core Infrastructure Components:
                            EventBridge: Captures events for published SQL changes to datasets and daily dynamically generated DAG updates
Step Functions: Orchestrates event handling from EventBridge triggers with native Bedrock integration for routing decisions
Bedrock Integration: Evaluates routing decisions and instantiates the correct optimizer model with appropriate prompts
Bedrock Managed Supervisor: Handles agent-to-agent query routing when macro optimization tasks need micro optimization execution
Lambda Functions: Processes the optimizer outputs and integrates with the ETL system APIs to create QA dataset versions
DynamoDB: Provides persistent storage for optimization metadata and system state

                        

The event-driven architecture ensures that optimization processes trigger automatically when datasets are modified or when the daily DAG generation completes. Step Functions provide native Bedrock integration, eliminating the need for custom API handling while maintaining robust error handling and retry logic.

All infrastructure is defined as code using AWS CDK, enabling version-controlled, repeatable deployments across environments. CI/CD pipelines manage the complete infrastructure lifecycle, including IAM roles, security policies, and resource configurations.

Observability is critical for LLMOps, so CloudTrail provides comprehensive API call tracing for audit and debugging purposes, while CloudWatch captures operational metrics including model invocation rates, processing times, and error rates. These metrics are surfaced through Grafana dashboards, providing real-time visibility into system performance and optimization effectiveness.

The architecture supports horizontal scaling through serverless components, automatically handling varying workloads as the number of datasets and optimization requests fluctuates throughout the day. This approach ensures cost efficiency while maintaining the responsiveness needed for production ETL operations.

Key Insights for Enterprise Data Engineering

This project revealed several important principles for scaling data engineering operations in enterprise environments.

Graph theory applications: Complex data lineage problems can be effectively modeled and analyzed using traditional computer science algorithms. The combination of centrality measures and path analysis provided insights that would have been impossible to achieve through manual inspection.

Multi-agent AI systems: Specialized agents can handle different aspects of optimization more effectively than monolithic approaches. The separation between micro and macro optimization allowed each agent to focus on its domain of expertise while maintaining coordination through the routing system.

Safety-first automation: Even with sophisticated AI, critical safeguards and human oversight remain essential for production systems. The QA versioning approach provided the confidence needed to deploy automated optimizations in a high-stakes environment.

Holistic optimization: Combining micro and macro optimization strategies yields better results than focusing on individual components alone. The serial execution chain that provided our biggest win would never have been identified through individual dataset optimization.

The democratization of data transformation tools creates new challenges that require systematic solutions. As organizations scale their data operations, the combination of graph analysis and AI-powered optimization provides a path forward for managing complexity while maintaining performance and reliability.

This case study represents a real-world application of AI and graph theory to solve enterprise-scale data engineering challenges. The combination of traditional computer science algorithms with modern AI capabilities offers a powerful approach to managing complex, large-scale data systems that will only become more relevant as data volumes continue to grow.