Moore’s Law needs a hug. The days of transistors stuffed into tiny numbered silicon computer chips, and their lives – hardware accelerators – came with a price.
When programming an accelerator – a process by which applications offload certain tasks to the system hardware, especially to speed up that task – you must build a whole new supporting software . Hardware accelerators can run certain tasks in order of magnitude faster than the CPU, but they cannot be used immediately. The software needs to make efficient use of the accelerator instructions to make it compatible with the entire application system. This leads to a lot of engineering work that would then have to be maintained for a new chip you’re compiling the code on, with any programming language.
Now, scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have created a new programming language called “Exo” for writing high-performance code on accelerators. Hardware. Exo helps low-level performance engineers convert very simple programs that specify what they want to compute into very complex programs that do the same things as the specification, but much faster using use these special accelerator chips. For example, engineers can use Exo to turn a simple matrix multiplication into a more complex program that runs orders of magnitude faster using these special accelerators.
Unlike other programming languages and compilers, Exo is built on a concept called “Exocompilation”. “Traditionally, a lot of research has focused on automation,” said Yuka Ikarashi, a PhD student in electrical engineering and computer science and CSAIL affiliate and lead author of the new paper on Exo. optimization process for specific hardware. “This is great for most programmers, but for performance engineers, the compiler frequently gets in the way. Because the compiler’s optimization is automatic, there’s no good way to fix it when it’s doing it wrong and give you 45 percent efficiency instead of 90 percent. ”
With Exocompilation, the performance engineer is back in the driver’s seat. The responsibility of choosing which optimizations to apply, when, and in what order is externalized from the compiler, left to the performance engineer. This way they don’t have to spend time fighting with one-sided compilers or doing everything manually. At the same time, Exo is responsible for ensuring that all these optimizations are correct. As a result, performance engineers can spend their time improving performance, rather than debugging complex, optimized code.
“The Exo language is a compiler parameterized on the hardware it targets; Adrian Sampson, assistant professor in the Department of Computer Science at Cornell University, says the same compiler can adapt to many different hardware accelerators. “Instead of writing a bunch of messy C++ code to compile for a new accelerator, Exo gives you a unified and abstract way to write down the ‘shape’ of the hardware you want to target. . You can then reuse the existing Exo compiler to adapt that new description instead of writing something completely new from scratch. The potential impact of work like this is enormous: If hardware innovators could stop worrying about the cost of developing a new compiler for every new hardware idea, they could try and come up with something new. more ideas. The industry that can break its reliance on legacy hardware succeeds only because of the ecosystem and despite its inefficiencies. ”
The highest performance computer chips produced today, such as Google’s TPU, Apple’s Neural Engine or NVIDIA’s Tensor Cores, powerful scientific computing and computing applications by accelerating a something called “critical subroutines,” the kernel, or high-performance computing (HPC) subroutines.
Get rid of confusing jargon, these programs are essential. For example, something called Basic Linear Algebra Subroutine (BLAS) is a “library” or set of such subroutines, dedicated to linear algebraic operations and enables many machine learning tasks such as neural networks, weather forecasting, cloud computing, and drug discovery. . (BLAS is so important that it won the Turing Prize for Jack Dongarra in 2021.) However, these new chips – which required hundreds of engineers to design – are only as good as what software libraries are capable of. This HPC allows.
For now, however, this kind of performance optimization is still done by hand to ensure that every last compute cycle on these chips is used. HPC subroutines regularly run at 90% plus theoretical peak performance, and hardware engineers have gone to great lengths to add 5 or 10% speed to these theoretical peaks. So if the software isn’t heavily optimized, all that hard work goes to waste – which is exactly what Exo helps avoid.
Another important part of Exocompilation is that performance engineers can describe new chips they want to optimize without having to modify the compiler. Traditionally, the definition of a hardware interface has been maintained by compiler developers, but with most of these new accelerators, the hardware interface is proprietary. Companies must maintain their own copy (fork) of the entire traditional compiler, modified to support their particular chip. This requires hiring compiler development teams in addition to performance engineers.
“In Exo, we instead externalize the definition of hardware-specific backends from exocompiler. This gives us a better separation between Exo – which is an open source project – and hardware-specific code – which is usually proprietary. We have demonstrated that we can use Exo to quickly write performance code like Intel’s Hand-optimized Math Kernel Library. We are actively working with engineers and researchers at several companies, said Gilbert Bernstein, a postdoc at the University of California at Berkeley.
The future of Exo calls for discovering a more efficient scheduling metalanguage and extending its semantics to support parallel programming models to apply it to more accelerators, including GPUs. .
Ikarashi and Bernstein wrote this paper with Alex Reinking and Hasan Genc, both PhD students at UC Berkeley, and MIT Assistant Professor Jonathan Ragan-Kelley.
This work was supported in part by the Center for Applied Drive Architecture, one of the six centers of JUMP, a program of the Semiconductor Research Corporation co-funded by the Defense Advanced Research Projects Agency. aid. Ikarashi is supported by the Funai Overseas Scholarship, the Masason Foundation, and the Great Educator Scholarship. The team presented the work at the ACM SIGPLAN Conference on Programming Language Design and Implementation 2022.