By Vishnu Vasanth on 18 May, 2024
The e6data Promise
In this post
Popular distributed compute engines like Spark, Databricks SQL (including the new vectorized Photon engine), Trino / Presto (including Starburst) all share some key commonalities: 1. They have a Coordinator (in Trino / Presto / Starburst) or a Driver in (Spark / Databricks) who’s design is Monolithic, Stateful and VM-centric. 2. They follow a Centralised, Static approach to Distributed Processing (i.e. Task Scheduling, Coordination) and Execution Flow management (i.e. Control Flow and Data Flow). The responsibilities of this centralised, non-scalable component include: • All parts of a query’s lifecycle - except for task execution. • All aspects of distributed processing like scheduling tasks on nodes, coordination. • All aspects of infrastructure management like cluster load and health monitoring.
Data platform and Data engineering teams operating at scale have found that such an approach has key challenges: The Coordinator / Driver becomes a Single Point Of Failure (SPOF)
Problem with the status quo: Under high concurrency or complex query load, the central driver/coordinator node becomes the bottleneck and single point of failure.
This central component holds the state of all queries. Therefore, anytime the coordinator suffers a crash, this impacts all queries in the system (both in-progress, as well as queued).
The various activities that are performed in this non-scalable central component compete for scarce resources. This is particularly pronounced under various forms of system load (such as concurrency, complex queries) and scale (large number of executors). This results in degraded performance for all queries in the system - which in turn may leads to failures in the central coordinator
• The approach to distributed processing is static and centrally controlled, where the coordinator allocates distributable tasks to individual executor nodes and coordinates the dependencies between such tasks. Such decisions are made in a static manner prior to execution (i.e. based on an up-front view of cluster load). Such decisions are impossible to get right in a complex distributed system since they cannot account for incorrect estimates due to changes in load due to concurrency and/or query complexity, unequal load distribution load due to data skew, etc. • Another challenge arises from the centrally managed execution flow which follows a pull-based implementation. Signalling for both the control flow as well as data flow originates from the central component – i.e. Coordinator / Driver. As outlined earlier, this central component is often overwhelmed under load, concurrency, and cluster scale. Under such conditions, we commonly find executors at low utilisation levels despite seeing degradation in query latencies and throughput.
Several applications today require deterministic p90 / 95 / 99 latencies to be guaranteed. Under a static approach to distributed processing, newly added executors (as a result of executor scale out) do not benefit in-progress queries since the mapping of distributable tasks to nodes is decided prior to start of execution. Thus, such an approach makes it impossible to offer guarantees around latency and throughput – despite having the ability to scale out.
The bulk of our gains (e.g., on performance, efficiency and scalability) are informed by the challenges we saw in the design and implementation of distributed compute engines like Spark and Trino / Presto.
Recognizing the challenges of a central, stateful component like the Coordinator / Driver in Presto / Spark, we opted for an architecture that was disaggregated, stateless, and kubernetes-native. This enables us to make each component.
e6data Architecture: Disaggregated, Stateless with no Single Point of Failure
• Disaggregated: Each lightweight service serves a single purpose (or as close to it as possible). • Stateless: Granular scalability is a first-order requirement, which meant that we had to make every component stateless wherever possible. • Kubernetes-native: There is no tight coupling of components to physical nodes (as in the case of the Coordinator / Driver). Each of the services have independent resources, and should be deployed and scaled independent of others.
e6data Distributed Processing: Decentralized + Dynamic approach to Task Scheduling & Coordination
• Decentralised approach to Distributed processing: Decisions of which task runs on which executor resource (an executor pod in our case), are made in a decentralised manner. As an executor pod has available resources, it picks up tasks from a global queue and commences execution. This method also removes the need for centralized coordination between tasks. An added benefit of such approach is that it is far less susceptible to data skew leading to uniform utilization across executor resources. This in-turn avoids idle time and under—utilization of executors. • Dynamic approach to Distributed processing: In contrast to the static (i.e. prior to execution) approach to allocating tasks to executor resources, our approach makes these decisions in a dynamic or on-the-fly basis. Our approach benefits from a far more reliable view of the live state of executor load – and therefore removes the possibility of incorrect resource allocations leading to poor cluster utilization. We consistently engage with demanding customers on cutting-edge use cases, pushing the boundaries of innovation. Our Lighthouse Customer Program is one such opportunity, which is a tailored program for enterprises with large, challenging use cases. You can read more about it here .