diff options
author | Razvan Lupusoru <razvan.lupusoru@gmail.com> | 2024-01-09 07:33:11 -0800 |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-01-09 07:33:11 -0800 |
commit | ab4af25d5dfaecf01e6c6e94dc79e7304321c376 (patch) | |
tree | 892c5313e52d2465a36732f89cdbb38aa2af4b39 | |
parent | 0242d27dc89ff19e331ae4945933cdb360c7d4cf (diff) |
[acc] OpenACC dialect design philosophy and details (#75548)
This document captures the design philosophy of the acc dialect. It also
shares the rationale behind the design and implementation of various
operations - and ties that back to the dialect design goals.
Co-authored-by: Valentin Clement <clementval@gmail.com>
Co-authored-by: Slava Zakharin <szakharin@nvidia.com>
-rwxr-xr-x | mlir/docs/Dialects/OpenACC.md | 449 | ||||
-rw-r--r-- | mlir/include/mlir/Dialect/OpenACC/OpenACCBase.td | 8 |
2 files changed, 450 insertions, 7 deletions
diff --git a/mlir/docs/Dialects/OpenACC.md b/mlir/docs/Dialects/OpenACC.md new file mode 100755 index 000000000000..da7d4be07e3e --- /dev/null +++ b/mlir/docs/Dialects/OpenACC.md @@ -0,0 +1,449 @@ +The `acc` dialect is an MLIR dialect for representing the OpenACC +programming model. OpenACC is a standardized directive-based model which +is used with C, C++, and Fortran to enable programmers to expose +parallelism in their code. The descriptive approach used by OpenACC +allows targeting of parallel multicore and accelerator targets like GPUs +by giving the compiler the freedom of how to parallelize for specific +architectures. OpenACC also provides the ability to optimize the +parallelism through increasingly more prescriptive clauses. + +This dialect models the constructs from the [OpenACC 3.3 specification] +(https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.3-final.pdf) + +This document describes the design of the OpenACC dialect in MLIR. It +lists and explains design goals and design choices along with their +rationale. It also describes specifics with regards to acc dialect +operations, types, and attributes. + +[TOC] + +## Dialect Design Goals + +* Needs to have complete representation of the OpenACC language. + - A frontend requires this in order to properly generate a + representation of possible `acc` pragmas in MLIR. Additionally, + this dialect is expected to be further lowered when materializing + its semantics. Without a complete representation, a frontend might + choose a lower abstraction (such as direct runtime call) - but this + would impact the ability to do analysis and optimizations on the + dialect. +* Allow representation at the same semantic level as the OpenACC +language while having capability to represent nuances of the source +language semantics (such as Fortran descriptors) in an agnostic manner. + - Using abstractions that closely model the OpenACC language + simplifies frontend implementation. It also allows for easier + debugging of the IR. However, sometimes source language specific + behavior is needed when materializing OpenACC. In these cases, such + as privatization of C++ objects with default constructor, the + frontend fills in the `recipe` along with the `private` operation + which can be packaged neatly with the `acc` dialect operations. +* Be able to regenerate the semantic equivalent of the user pragmas from +the dialect (including bounds, names, clauses, modifiers, etc). + - This is a strong measure of making sure that the dialect is not + lossy in semantics. It also allows capability to generate + appropriate and useful debug information outside of the frontend. +* Be dialect agnostic so that it can be used and coexist with other +dialects including but not limited to `hlfir`, `fir`, `llvm`, `cir`. + - Directive-based models such as OpenACC are always used with a + source language, so the `acc` dialect coexisting with other + dialect(s) is necessary by construction. Through proper + abstractions, neither the `acc` dialect nor the source language + dialect should have dependencies on each other; where needed, + interfaces should be used to ensure `acc` dialect can verify + expected properties. +* The dialect must allow dataflow to be modeled accurately and +performantly using MLIR's existing facilities. + - Appropriate dataflow modeling is important for analyses and IR + reasoning - even something as simple as walking the uses. Therefore + operations, like data operations, are expected to generate results + which can be used in modeling behavior. For example, consider an + `acc copyin` clause. After the `acc.copyin` operation, a pointer + which lives on devices should be distinguishable from one that lives + in host memory. +* Be friendly to MLIR optimization passes by implementing common +interfaces. + - Interfaces, such as `MemoryEffects`, are the key way MLIR + transformations and analyses are designed to interact with the IR. + In order for the operations in the `acc` dialect to be optimizable + (either directly or even indirectly by not blocking optimizations + of nested IR), implementing relevant common interfaces is needed. + +The design philosophy of the acc dialect is one where the design goals +are adhered to. Current and planned operations, attributes, types must +adhere to the design goals. + +## Operation Categories + +The OpenACC dialect includes both high-level operations (which retain +the same semantic meaning as their OpenACC language equivalent), +intermediate-level operations (which are used to decompose clauses +from constructs), and low-level operations (to encode specifics +associated with source language in a generic way). + +The high-level operations list contains the following OpenACC language +constructs and their corresponding operations: +* `acc parallel` → `acc.parallel` +* `acc kernels` → `acc.kernels` +* `acc serial` → `acc.serial` +* `acc data` → `acc.data` +* `acc loop` → `acc.loop` +* `acc enter data` → `acc.enter_data` +* `acc exit data` → `acc.exit_data` +* `acc host_data` → `acc.host_data` +* `acc init` → `acc.init` +* `acc shutdown` → `acc.shutdown` +* `acc update` → `acc.update` +* `acc set` → `acc.set` +* `acc wait` → `acc.wait` +* `acc atomic read` → `acc.atomic.read` +* `acc atomic write` → `acc.atomic.write` +* `acc atomic update` → `acc.atomic.update` +* `acc atomic capture` → `acc.atomic.capture` + +This second group contains operations which are used to represent +either decomposed constructs or clauses for more accurate modeling: +* `acc routine` → `acc.routine` + `acc.routine_info` attribute +* `acc declare` → `acc.declare_enter` + `acc.declare_exit` or +`acc.declare` +* `acc {construct} copyin` → `acc.copyin` (before region) + +`acc.delete` (after region) +* `acc {construct} copy` → `acc.copyin` (before region) + +`acc.copyout` (after region) +* `acc {construct} copyout` → `acc.create` (before region) + +`acc.copyout` (after region) +* `acc {construct} attach` → `acc.attach` (before region) + +`acc.detach` (after region) +* `acc {construct} create` → `acc.create` (before region) + +`acc.delete` (after region) +* `acc {construct} present` → `acc.present` (before region) + +`acc.delete` (after region) +* `acc {construct} no_create` → `acc.nocreate` (before region) + +`acc.delete` (after region) +* `acc {construct} deviceptr` → `acc.deviceptr` +* `acc {construct} private` → `acc.private` +* `acc {construct} firstprivate` → `acc.firstprivate` +* `acc {construct} reduction` → `acc.reduction` +* `acc cache` → `acc.cache` +* `acc update device` → `acc.update_device` +* `acc update host` → `acc.update_host` +* `acc host_data use_device` → `acc.use_device` +* `acc declare device_resident` → `acc.declare_device_resident` +* `acc declare link` → `acc.declare_link` +* `acc exit data delete` → `acc.delete` (with `structured` flag as +false) +* `acc exit data detach` → `acc.detach` (with `structured` flag as +false) +* `acc {construct} {data_clause}(var[lb:ub])` → `acc.bounds` + +The low-level operations are: +* `acc.private.recipe` +* `acc.reduction.recipe` +* `acc.firstprivate.recipe` +* `acc.global_ctor` +* `acc.global_dtor` +* `acc.yield` +* `acc.terminator` +The low-level operations semantics and reasoning are further explained +in sections below. + +### Data Operations + +#### Data Clause Decomposition +The data clauses are decomposed from their constructs for better +dataflow modeling in MLIR. There are multiple reasons for this which +are consistent with the dialect goals: +* Correctly represents dataflow. Data clauses have different effects +at entry to region and at exit from region. +* Friendlier to add attributes such as `MemoryEffects` to a single +operation. This can better reflect semantics (like the fact that an +`acc.copyin` operation only reads host memory) +* Operations can be moved or optimized individually (eg `CSE`). +* Easier to keep track of debug information. Line location can point to +the text representing the data clause instead of the construct. +Additionally, attributes can be used to keep track of variable names in +clauses without having to walk the IR tree in attempt to recover the +information (this makes acc dialect more agnostic with regards to what +other dialect it is used with). +* Clear operation ordering since all data operations are on same +list. + +Each of the `acc` dialect data operations represents either the +entry or the exit portion of the data action specification. Thus, +`acc.copyin` represents the semantics defined in section +`2.7.7 copyin clause` whose wording starts with +`At entry to a region`. The decomposed exit operation `acc.delete` +represents the second part of that section, whose wording starts with +`At exit from the region`. The `delete` action may be performed +after checking and updating of the relevant reference counters noted. + +The `acc` data operations, even when decomposed, retain their original +data clause in an operation operand `dataClause` for possibility to +recover this information during debugging. For example, `acc copy`, +does not translate to `acc.copy` operation, but instead to `acc.copyin` +for entry and `acc.copyout` for exit. Both the decomposed operations +hold a `dataClause` field that specifies this was an `acc copy`. + +The link between the decomposed entry and exit operations is the ssa +value produced by the entry operation. Namely, it is the `accPtr` result +which is used both in the `dataOperands` of the operation used for the +construct and in the `accPtr` operand of the exit operation. + +#### Bounds + +OpenACC data clauses allow the use of bounds specifiers as per +`2.7.1 Data Specification in Data Clauses`. However, array dimensions +for the data are not always required in the clause if the source +language's type system captures this information - the user can just +specify the variable name in the data clause. So the `acc.bounds` +operation is an important piece to ensure uniform representation of both +explicit user set dimensions and implicit type-based dimensions. It +contains several key features to allow properly encoding sizes in a +manner flexible and agnostic to the source language's dialect: +* Multi-dimensional arrays can be represented by using multiple ordered +`acc.bounds` operations. +* Bounds are required to be zero-normalized. This works well with the +`PointerLikeType` requirement in data clauses - since a lowerbound of 0 +means looking at data at the zero offset from pointer. This requirement +also works well in ensuring the `acc` dialect is agnostic to source +language dialect since it prevents ambiguity such as the case of Fortran +arrays where the lower bound is not a fixed value. +* If the source dialect does not encode the dimensions in the type (eg +`!fir.array<?x?xi32>`) but instead encodes it in some other way (such as +through descriptors), then the frontend must fill in the `acc.bounds` +operands with appropriate information (such as loads from descriptor). +The `acc.bounds` operation also permits lossy source dialect, such +as if the frontend uses aggressive pointer decay and cannot represent +the dimensions in the type system (eg using `!llvm.ptr` for arrays). +Both of these aspects show `acc.bounds`' operation's flexibility to +allow the representation to be agnostic since the `acc` dialect is not +expected to be able to understand how to extract dimension information +from the types of the source dialect. +* The OpenACC specification allows either extent or upperbound in the +data clause depending on whether it is Fortran or C and C++. The +`acc.bounds` operation is rich enough to accept either or both - for +convenience in lowering to the dialect and for ability to precisely +capture the meaning from the clause. +* The stride, either in units or bytes, can be also captured in the +`acc.bounds` operation. This is also an important part to be able to +accept a source language's arrays without forcing the frontend to +normalize them in some way. For example, consider a case where in a +parent function, a whole array is mapped to device. Then only a view of +a non-1 stride is passed to child function (eg Fortran array slice with +non-1 stride). A `copy` operation of this data in child should be able +to avoid remapping this array. If instead the operation required +normalizing the array (such as making it contiguous), then unexpected +disjoint mapping of the same host data would be error-prone since it +would result in multiple mappings to device. + +#### Counters + +The data operations also maintain semantics described in the OpenACC +specification related to runtime counters. More specifically, consider +the specification of the entry portion of `acc copyin` in section 2.7.7: +``` +At entry to a region, the structured reference counter is used. On an +enter data directive, the dynamic reference counter is used. +- If var is present and is not a null pointer, a present increment +action with the appropriate reference counter is performed. +- If var is not present, a copyin action with the appropriate reference +counter is performed. +- If var is a pointer reference, an attach action is performed. +``` +The `acc.copyin` operation includes these semantics, including those +related to attach, which is specified through the `varPtrPtr` operand. +The `structured` flag on the operation is important since the +`structured reference counter` should be used when the flag is true; and +the `dynamic reference counter` should be used when it is false. + +At exit from structured regions (`acc data`, `acc kernels`), the +`acc copyin` operation is decomposed to `acc.delete` (with the +`structured` flag as true). The semantics of the `acc.delete` are +also consistent with the OpenACC specification noted for the exit +portion of the `acc copyin` clause: +``` +At exit from the region: +- If the structured reference counter for var is zero, no action is +taken. +- Otherwise, a detach action is performed if var is a pointer reference, +and a present decrement action with the structured reference counter is +performed if var is not a null pointer. If both structured and dynamic +reference counters are zero, a delete action is performed. +``` + +### Types + +There are a few acc dialect type categories to describe: +* type of acc data clause operation input `varPtr` + - The type of `varPtr` must be pointer-like. This is done by + attaching the `PointerLikeType` interface to the appropriate MLIR + type. Although memory/storage concept is a lower level abstraction, + it is useful because the OpenACC model distinguishes between host + and device memory explicitly - and the mapping between the two is + done through pointers. Thus, by explicitly requiring it in the + dialect, the appropriate language frontend must create storage or + use type that satisfies the mapping constraint. +* type of result of acc data clause operations + - The type of the acc data clause operation is exactly the same as + `varPtr`. This was done intentionally instead of introducing an + `acc.ref/ptr` type so that IR compatibility and the dialect's + existing strong type checking can be maintained. This is needed + since the `acc` dialect must live within another dialect whose type + system is unknown to it. The only constraint is that the appropriate + dialect type must use the `PointerLikeType` interface. +* type of decomposed clauses + - Decomposed clauses, such as `acc.bounds` and `acc.declare_enter` + produce types to allow their results to be used only in specific + operations. + +### Recipes + +Recipes are a generic way to express source language specific semantics. + +There are currently two categories of recipes, but the recipe concept +can be extended for any additional low-level information that needs +to be captured for successful lowering of OpenACC. The two categories +are: +* recipes used in the context of privatization associated with a +construct +* recipes used in the context of additional specification of data +semantics + +The intention of the recipes is to specify how materialization of +action, such as privatization, should be done when the semantics +of the action needs interpreted and lowered, such as before generating +LLVM dialect. + +The recipes used for privatization provide a source-language independent +way of specifying the creation of a local variable of that type. This +means using the appropriate `alloca` instruction and being able to +specify default initialization or default constructor. + +### Routine + +The routine directive is used to note that a procedure should be made +available for the accelerator in a way that is consistent with its +modifiers, such as those that describe the parallelism. In the acc +dialect, an acc routine is represented through two joint pieces - an +attribute and an operation: +* The `acc.routine` operation is simply a specifier which notes which +symbol (or string) the acc routine is needed for, along with parallelism +associated. This defines a symbol that can be referenced in attribute. +* The `acc.routine_info` attribute is an attribute used on the source +dialect specific operation which specifies one or multiple `acc.routine` +symbols. Typically, this is attached to `func.func` which either +provides the declaration (in case of externals) or provides the +actual body of the acc routine in the dialect that the source language +was translated to. + +### Declare + +OpenACC `declare` is a mechanism which declares a definition of a global +or a local to be accessible to accelerator with an implicit lifetime +as that of the scope where it was declared in. Thus, `declare` semantics +are represented through multiple operations and attributes: +* `acc.declare` - This is a structured operation which contains an +MLIR region and can be used in similar manner as acc.data to specify +an implicit data region with specific procedure lifetime. This is +typically used inside `func.func` after variable declarations. +* `acc.declare_enter` - This is an unstructured operation which is +used as a decomposed form of `acc declare`. It effectively allows the +entry operation to exist in a scope different than the exit operation. +It can also be used along `acc.declare_exit` which consumes its token +to define a scoped region without using MLIR region. This operation is +also used in `acc.global_ctor`. +* `acc.declare_exit` - The matching equivalent of `acc.declare_enter` +except that it specifies exit semantics. This operation is typically +used inside a `func.func` at the exit points or with `acc.global_dtor`. +* `acc.global_ctor` - Lives at the same level as source dialect globals +and is used to specify data actions to be done at program entry. This +is used in conjunction with source dialect globals whose lifetime is +not just a single procedure. +* `acc.global_dtor` - Defines the exit data actions that should be done +at program exit. Typically used to revert the actions of +`acc.global_ctor`. + +The attributes: +* `acc.declare` - This is a facility for easier determination of +variables which are `acc declare`'d. This attribute is used on +operations producing globals and on operations producing locals such as +dialect specific `alloca`'s. Having this attribute is required in order +to appear in a data mapping operation associated with any of the +`acc.declare*` operations. +* `acc.declare_action` - Since the OpenACC specification allows +declaration of variables that have yet to be allocated, this attribute +is used at the allocation and deallocation points. More specifically, +this attribute captures symbols of functions to be called to perform +an action either pre-allocate, post-allocate, pre-deallocate, or +post-deallocate. Calls to these functions should be materialized when +lowering OpenACC semantics to ensure proper data actions are done +after the allocation/deallocation. + +## OpenACC Transforms and Analyses + +The design goal for the `acc` dialect is to be friendly to MLIR +optimization passes including CSE and LICM. Additionally, since it is +designed to recover original clauses, it makes late verification and +analysis possible in the MLIR framework outside of the frontend. + +This section describes a few MLIR-level passes for which the `acc` +dialect design should be friendly for. This section is currently +solely outlining the possibilities intended by the design and not +necessarily existing passes. + +### Verification + +Since the OpenACC dialect is not lossy with regards to its +representation, it is possible to do OpenACC language semantic checking +at the MLIR-level. What follows is a list of various semantic checks +needed. + +This first list is required to be done in the frontend because the `acc` +dialect operations must be valid when constructed: +* Ensure that only listed clauses are allowed for each directive. +* Ensure that only listed modifiers are allowed for each clause. + +However, the following are semantic checks that can be done at the +MLIR-level (either in a separate pass or as part of the operation +verifier): +* Specify the validity checks that each modifier needs. (eg num_gangs +may need a positive integer). +* Ensure valid clause nesting. +* Validate clause restrictions which cannot appear with others. +* Validate that no conflicting clauses are used on variables. + +Note that some of these checks can be even more precise when done at the +MLIR level because optimizations like inlining and constant propagation +expose detail that wouldn't have been visible in the frontend. + +### Implicit Data Attributes + +The OpenACC specification includes a section on `2.6.2 Variables with +Implicitly Determined Data Attributes`. What this section describes are +the data actions that should be applied to a variable for which +user did not specify a data action for. The action depends on the +construct being used and also on the default clause. However, the point +to note here is that variables which are live-in into the acc region +must employ some data mapping so the data can be passed to accelerator. + +One possible optimizations that affects data attributes needed is +`Scalar Replacement of Aggregates (SROA)`. The `acc` dialect should +not prevent this from happening on the source dialect. + +Because it is intended to be possible to apply optimizations across an +`acc` region, the analysis/transformation pass that applies the implicit +data attributes should be run as late as possible - ideally right before +any outlining process which uses the `acc` region body to create an +accelerator procedure. It is expected that existing MLIR facilities, +such as `mlir::Liveness` will work for the `acc` region and thus can be +used to perform this analysis. + +### Redundant Clause Elimination + +The data operations are modeled in a way where data entry operations +look like loads and data exit operations look like stores. Thus these +operations are intended to be optimized in the following ways: +* Be able to eliminate redundant operations such as when an `acc.copyin` +dominates another. +* Be able to hoist/sink such operations out of loops. + +[include "Dialects/OpenACCDialect.md"] diff --git a/mlir/include/mlir/Dialect/OpenACC/OpenACCBase.td b/mlir/include/mlir/Dialect/OpenACC/OpenACCBase.td index 60e2ccfa18b6..2f7dfb2751c9 100644 --- a/mlir/include/mlir/Dialect/OpenACC/OpenACCBase.td +++ b/mlir/include/mlir/Dialect/OpenACC/OpenACCBase.td @@ -7,6 +7,7 @@ // ============================================================================= // // Defines MLIR OpenACC dialect. +// See [`OpenACC Dialect Documentation`](Dialects/OpenACC.md) for more details. // //===----------------------------------------------------------------------===// @@ -17,13 +18,6 @@ include "mlir/IR/AttrTypeBase.td" def OpenACC_Dialect : Dialect { let name = "acc"; - - let summary = "An OpenACC dialect for MLIR."; - - let description = [{ - This dialect models the construct from the OpenACC 3.3 directive language. - }]; - let useDefaultAttributePrinterParser = 1; let useDefaultTypePrinterParser = 1; let cppNamespace = "::mlir::acc"; |