Publication (Phd Thesis)

Automatic Runtime Performance Optimization Methodology Applied to Dataflow-based Hyperspectral Imaging Cancer Detection Algorithms

Lazcano López, Raquel
Nowadays, the increase in the complexity of algorithms, the growth of the volumes of information to process and the tightening of the performance requirements have contributed to the need for multicore architectures and parallel processing. These challenges have led to widening the gap between application development productivity and platformusage complexity. In the last few years, many methodologies have been proposed to close this productivity gap, like Y-chart strategy, dataflowbased design or automatic parallelization techniques. Y-chart strategy isolates the application from the architecture, modeling each one separately and merging them afterwards under a list of user-defined constraints; dataflow-based design proposes modeling applications using dataflow Models of Computation (MoCs), which are specifically well-suited for exposing their structural parallelism; automatic parallelization techniques parallelize applications with very small effort from developers side. The combination of the first two resulted in a methodology where dataflow-based applications are modeled as a set of tasks –or actors– exchanging pieces of information –or tokens–; these tasks can be automatically deployed onto multicore platforms, mapping and scheduling them so they can make use of all the available resources of the target architecture. Nonetheless, this process is yet arduous from developers side: since the computational load of actors may considerably vary, conceiving an efficient mapping and scheduling of them results in an iterative process. These actors are the functional units of these models, and thus they are considered as black boxes –primitives of the system–, so up until now there is an open issue to solve if their computational load is not well balanced. To tackle this situation, this PhD aims at devising a methodology to further optimize applications throughput by combining these three methodologies, using it with a wide range of applications and simplifying the process of parallelizing and deploying them onto multicore architectures. To do so, the methodology proposed in this document aims at combining a Y-chart dataflow-based design with the optimization potential offered by the polyhedral model, which is a mathematical framework applying aggressive transformations to maximize applications performance. As nowadays there is a trend to consider reactiveness to changes at runtime as a requirement formany applications, it is desirable for the enhanced methodology to ensure this requirement is fulfilled. Hence, two frameworks are selected as cornerstones of this methodology: Parallel Real-time Embedded Executives Scheduling Method (PREESM) –and its dynamic counterpart, Synchronous Parameterized and Interfaced Dataflow Embedded Runtime (SPiDER)– from the Y-chart dataflow-based domain, and Automatic Speculative Polyhedral Loop Optimizer (Apollo) from the polyhedral one. These frameworks are chosen because they guarantee this runtime reactiveness: PREESM thanks to SPiDER, and Apollo because it is one of the few tools in the state of the art successfully extending the polyhedral model to runtime. The combination of dataflow and polyhedral domains creates, in turn, a broad spectrum of new opportunities: the characteristics intrinsic to dataflow MoCs can be embedded within Apollo to (i)make it dataflow-based compliant, and (ii) explore new optimization possibilities these characteristics enable, which were out of reach until now. As a result, Apollo is extended to include these two features, providing its runtime manager with a multiversioning systemto evaluate several optimizing transformations and choose the best one depending on the application context, which in turn is defined by features as volume of data to process or target architecture. Similarly, after modifying and extending the automatic optimization framework to include these new functionalities, it needs to be immersed into the dataflow design environment so as to join their parallelization capabilities. Hence, Apollo has been embedded within both PREESM and SPiDER, assessing how to balance the parallelization potential of both frameworks to make the most of them. By doing so, new optimizations are enabled inside actors to exploit intra-actor optimization opportunities together with inter-actor parallelism –i.e., data parallelism. Since it is desirable for the methodology to be used with a broad range of applications, a heterogeneous benchmark suite is devised to validate the mentioned methodology with applications coming from different domains. To do so, this suite is built by combining some of the most traditionally used benchmark suites in both the polyhedral domain and in the dataflow one. From the polyhedral domain, PolyBench and ApolloBench are chosen, the former because it is the most widely used benchmark suite in the community, and the latter because it was the one used by Apollo developers to originally validate the tool, and it contains several features not represented within PolyBench. From the dataflow domain, due to the lack of uniformity in dataflow benchmark suites, the benchmarks chosen are three of the applications most widely used by PREESM and SPiDER developers when testing new functionalities. The results obtained from the evaluation of the proposed methodology with the polyhedral benchmarks show that, as expected, the extended version of Apollo outperforms by far the original one for applications simulating a dataflow-based behavior, providing particularly good results for small problem sizes. When compared to Clang, the results demonstrate that the throughput is also improved, especially for the largest problem sizes, where maximum speedups of 59£ and 161£ are achieved for PolyBench and ApolloBench, respectively. This study shows also interesting results regarding the transformation selected for each case, proving that the transformation has a direct impact on the execution time: depending on the benchmark, problem size or target architecture, the same transformation may provide very good or very bad results, hence illustrating the benefits of a multiversioning system. This study proves that using such a system comes at the cost of adding a non-negligible initial overhead, but that it can be canceled out thanks to the iterative nature of dataflow applications. Additionally, the main objective of the evaluation of the methodology with the dataflow benchmarks is to assess the advantages and disadvantages of assigning the parallelization capabilities to one framework or the other. The three applications selected prove to be a good representation of this problematic: one of them obtains better results when PREESM parallelizes –9£ with PREESM, 4£ with Apollo and 6£ with the combination, always compared to Clang–; another only obtains speedups when Apollo is in charge –around 5£–; finally, the last one is much faster when the optimizations come from both frameworks –14£ with PREESM, 75£ with Apollo and almost 100£ with the combination–. The latter is also used to functionally validate the changes added within SPiDER, proving the proposed methodology to successfully detect runtime changes and reconfigure the behavior of the multiversioning system accordingly, with similar speedups than those obtained statically –close to 100£. These two studies ensure the correctness of the proposed methodology, using to that end benchmark suites widely applied in the related domains; nonetheless, so as to guarantee that the proposed methodology fulfills the objectives for which it is envisioned and characterize its applicability, a real-life application is implemented using this methodology, considering it as a real use-case. This application has been extracted from a complex algorithm used to locate human brain tumor boundaries in intraoperative real-time during surgical procedures: specifically, the part of the algorithm selected to conform this use-case is the supervised classification stage of the hybrid classification algorithm specifically designed for this task, as it is the most time-consuming part of the algorithm. The results obtained showthat, as expected, the largest speedups are reached when combining the optimization potential of both tools, with speedups close to 4£ when compared to the sequential version. These results prove also to be competitive with those extracted from the state of the art, showing similar execution times using fewer Processing Elements (PEs) and applying automatic parallelization techniques instead of manually exploiting the parallelization, which in turn involves a non-negligible increase in the production times. In conclusion, the methodology proposed in this PhD has been proven to be successful in merging the parallelization potential of dataflow-based specifications with the optimization capabilities of automatic parallelization approaches, which is achieved by either exploiting separately the advantages of each domain when one fails to increase the throughput, or profiting from both potentials in those situations compliant with both worlds.
Research areas:
Type of Publication:
Phd Thesis
Type of Publication:
Tesis (Doctoral)
Eduardo Juárez y César Sanz