[acc] OpenACC dialect design philosophy and details (#75548)

This document captures the design philosophy of the acc dialect. It also shares the rationale behind the design and implementation of various operations - and ties that back to the dialect design goals. Co-authored-by: Valentin Clement <clementval@gmail.com> Co-authored-by: Slava Zakharin <szakharin@nvidia.com>
author: Razvan Lupusoru <razvan.lupusoru@gmail.com> 2024-01-09 07:33:11 -0800
committer: GitHub <noreply@github.com> 2024-01-09 07:33:11 -0800
commit: ab4af25d5dfaecf01e6c6e94dc79e7304321c376 (patch)
tree: 892c5313e52d2465a36732f89cdbb38aa2af4b39
parent: 0242d27dc89ff19e331ae4945933cdb360c7d4cf (diff)
2 files changed, 450 insertions, 7 deletions
diff --git a/mlir/docs/Dialects/OpenACC.md b/mlir/docs/Dialects/OpenACC.md
new file mode 100755
index 000000000000..da7d4be07e3e
--- /dev/null
+++ b/mlir/docs/Dialects/OpenACC.md
@@ -0,0 +1,449 @@
+The `acc` dialect is an MLIR dialect for representing the OpenACC
+programming model. OpenACC is a standardized directive-based model which
+is used with C, C++, and Fortran to enable programmers to expose
+parallelism in their code. The descriptive approach used by OpenACC
+allows targeting of parallel multicore and accelerator targets like GPUs
+by giving the compiler the freedom of how to parallelize for specific
+architectures. OpenACC also provides the ability to optimize the
+parallelism through increasingly more prescriptive clauses.
+
+This dialect models the constructs from the [OpenACC 3.3 specification]
+(https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.3-final.pdf)
+
+This document describes the design of the OpenACC dialect in MLIR. It
+lists and explains design goals and design choices along with their
+rationale. It also describes specifics with regards to acc dialect
+operations, types, and attributes.
+
+[TOC]
+
+## Dialect Design Goals
+
+* Needs to have complete representation of the OpenACC language.
+	- A frontend requires this in order to properly generate a
+	representation of possible `acc` pragmas in MLIR. Additionally,
+	this dialect is expected to be further lowered when materializing
+	its semantics. Without a complete representation, a frontend might
+	choose a lower abstraction (such as direct runtime call) - but this
+	would impact the ability to do analysis and optimizations on the
+	dialect.
+* Allow representation at the same semantic level as the OpenACC
+language while having capability to represent nuances of the source
+language semantics (such as Fortran descriptors) in an agnostic manner.
+	- Using abstractions that closely model the OpenACC language
+	simplifies frontend implementation. It also allows for easier
+	debugging of the IR. However, sometimes source language specific
+	behavior is needed when materializing OpenACC. In these cases, such
+	as privatization of C++ objects with default constructor, the
+	frontend fills in the `recipe` along with the `private` operation
+	which can be packaged neatly with the `acc` dialect operations.
+* Be able to regenerate the semantic equivalent of the user pragmas from
+the dialect (including bounds, names, clauses, modifiers, etc).
+	- This is a strong measure of making sure that the dialect is not
+	lossy in semantics. It also allows capability to generate
+	appropriate and useful debug information outside of the frontend.
+* Be dialect agnostic so that it can be used and coexist with other
+dialects including but not limited to `hlfir`, `fir`, `llvm`, `cir`.
+	- Directive-based models such as OpenACC are always used with a
+	source language, so the `acc` dialect coexisting with other
+	dialect(s) is necessary by construction. Through proper
+	abstractions, neither the `acc` dialect nor the source language
+	dialect should have dependencies on each other; where needed,
+	interfaces should be used to ensure `acc` dialect can verify
+	expected properties.
+* The dialect must allow dataflow to be modeled accurately and
+performantly using MLIR's existing facilities.
+	- Appropriate dataflow modeling is important for analyses and IR
+	reasoning - even something as simple as walking the uses. Therefore
+	operations, like data operations, are expected to generate results
+	which can be used in modeling behavior. For example, consider an
+	`acc copyin` clause. After the `acc.copyin` operation, a pointer
+	which lives on devices should be distinguishable from one that lives
+	in host memory.
+* Be friendly to MLIR optimization passes by implementing common
+interfaces.
+	- Interfaces, such as `MemoryEffects`, are the key way MLIR
+	transformations and analyses are designed to interact with the IR.
+	In order for the operations in the `acc` dialect to be optimizable
+	(either directly or even indirectly by not blocking optimizations
+	of nested IR), implementing relevant common interfaces is needed.
+
+The design philosophy of the acc dialect is one where the design goals
+are adhered to. Current and planned operations, attributes, types must
+adhere to the design goals.
+
+## Operation Categories
+
+The OpenACC dialect includes both high-level operations (which retain
+the same semantic meaning as their OpenACC language equivalent),
+intermediate-level operations (which are used to decompose clauses
+from constructs), and low-level operations (to encode specifics
+associated with source language in a generic way).
+
+The high-level operations list contains the following OpenACC language
+constructs and their corresponding operations:
+* `acc parallel` &rarr; `acc.parallel`
+* `acc kernels` &rarr; `acc.kernels`
+* `acc serial` &rarr; `acc.serial`
+* `acc data` &rarr; `acc.data`
+* `acc loop` &rarr; `acc.loop`
+* `acc enter data` &rarr; `acc.enter_data`
+* `acc exit data` &rarr; `acc.exit_data`
+* `acc host_data` &rarr; `acc.host_data`
+* `acc init` &rarr; `acc.init`
+* `acc shutdown` &rarr; `acc.shutdown`
+* `acc update` &rarr; `acc.update`
+* `acc set` &rarr; `acc.set`
+* `acc wait` &rarr; `acc.wait`
+* `acc atomic read` &rarr; `acc.atomic.read`
+* `acc atomic write` &rarr; `acc.atomic.write`
+* `acc atomic update` &rarr; `acc.atomic.update`
+* `acc atomic capture` &rarr; `acc.atomic.capture`
+
+This second group contains operations which are used to represent
+either decomposed constructs or clauses for more accurate modeling:
+* `acc routine` &rarr; `acc.routine` + `acc.routine_info` attribute
+* `acc declare` &rarr; `acc.declare_enter` + `acc.declare_exit` or
+`acc.declare`
+* `acc {construct} copyin` &rarr; `acc.copyin` (before region) +
+`acc.delete` (after region)
+* `acc {construct} copy` &rarr; `acc.copyin` (before region) +
+`acc.copyout` (after region)
+* `acc {construct} copyout` &rarr; `acc.create` (before region) +
+`acc.copyout` (after region)
+* `acc {construct} attach` &rarr; `acc.attach` (before region) +
+`acc.detach` (after region)
+* `acc {construct} create` &rarr; `acc.create` (before region) +
+`acc.delete` (after region)
+* `acc {construct} present` &rarr; `acc.present` (before region) +
+`acc.delete` (after region)
+* `acc {construct} no_create` &rarr; `acc.nocreate` (before region) +
+`acc.delete` (after region)
+* `acc {construct} deviceptr` &rarr; `acc.deviceptr`
+* `acc {construct} private` &rarr; `acc.private`
+* `acc {construct} firstprivate` &rarr; `acc.firstprivate`
+* `acc {construct} reduction` &rarr; `acc.reduction`
+* `acc cache` &rarr; `acc.cache`
+* `acc update device` &rarr; `acc.update_device`
+* `acc update host` &rarr; `acc.update_host`
+* `acc host_data use_device` &rarr; `acc.use_device`
+* `acc declare device_resident` &rarr; `acc.declare_device_resident`
+* `acc declare link` &rarr; `acc.declare_link`
+* `acc exit data delete` &rarr; `acc.delete` (with `structured` flag as
+false)
+* `acc exit data detach` &rarr; `acc.detach` (with `structured` flag as
+false)
+* `acc {construct} {data_clause}(var[lb:ub])` &rarr; `acc.bounds`
+
+The low-level operations are:
+* `acc.private.recipe`
+* `acc.reduction.recipe`
+* `acc.firstprivate.recipe`
+* `acc.global_ctor`
+* `acc.global_dtor`
+* `acc.yield`
+* `acc.terminator`
+The low-level operations semantics and reasoning are further explained
+in sections below.
+
+### Data Operations
+
+#### Data Clause Decomposition
+The data clauses are decomposed from their constructs for better
+dataflow modeling in MLIR. There are multiple reasons for this which
+are consistent with the dialect goals:
+* Correctly represents dataflow. Data clauses have different effects
+at entry to region and at exit from region.
+* Friendlier to add attributes such as `MemoryEffects` to a single
+operation. This can better reflect semantics (like the fact that an
+`acc.copyin` operation only reads host memory)
+* Operations can be moved or optimized individually (eg `CSE`).
+* Easier to keep track of debug information. Line location can point to
+the text representing the data clause instead of the construct.
+Additionally, attributes can be used to keep track of variable names in
+clauses without having to walk the IR tree in attempt to recover the
+information (this makes acc dialect more agnostic with regards to what
+other dialect it is used with).
+* Clear operation ordering since all data operations are on same
+list.
+
+Each of the `acc` dialect data operations represents either the
+entry or the exit portion of the data action specification. Thus,
+`acc.copyin` represents the semantics defined in section
+`2.7.7 copyin clause` whose wording starts with
+`At entry to a region`. The decomposed exit operation `acc.delete`
+represents the second part of that section, whose wording starts with
+`At exit from the region`. The `delete` action may be performed
+after checking and updating of the relevant reference counters noted.
+
+The `acc` data operations, even when decomposed, retain their original
+data clause in an operation operand `dataClause` for possibility to
+recover this information during debugging. For example, `acc copy`,
+does not translate to `acc.copy` operation, but instead to `acc.copyin`
+for entry and `acc.copyout` for exit. Both the decomposed operations
+hold a `dataClause` field that specifies this was an `acc copy`.
+
+The link between the decomposed entry and exit operations is the ssa
+value produced by the entry operation. Namely, it is the `accPtr` result
+which is used both in the `dataOperands` of the operation used for the
+construct and in the `accPtr` operand of the exit operation.
+
+#### Bounds
+
+OpenACC data clauses allow the use of bounds specifiers as per
+`2.7.1 Data Specification in Data Clauses`. However, array dimensions
+for the data are not always required in the clause if the source
+language's type system captures this information - the user can just
+specify the variable name in the data clause. So the `acc.bounds`
+operation is an important piece to ensure uniform representation of both
+explicit user set dimensions and implicit type-based dimensions. It
+contains several key features to allow properly encoding sizes in a
+manner flexible and agnostic to the source language's dialect:
+* Multi-dimensional arrays can be represented by using multiple ordered
+`acc.bounds` operations.
+* Bounds are required to be zero-normalized. This works well with the
+`PointerLikeType` requirement in data clauses - since a lowerbound of 0
+means looking at data at the zero offset from pointer. This requirement
+also works well in ensuring the `acc` dialect is agnostic to source
+language dialect since it prevents ambiguity such as the case of Fortran
+arrays where the lower bound is not a fixed value.
+* If the source dialect does not encode the dimensions in the type (eg
+`!fir.array<?x?xi32>`) but instead encodes it in some other way (such as
+through descriptors), then the frontend must fill in the `acc.bounds`
+operands with appropriate information (such as loads from descriptor).
+The `acc.bounds` operation also permits lossy source dialect, such
+as if the frontend uses aggressive pointer decay and cannot represent
+the dimensions in the type system (eg using `!llvm.ptr` for arrays).
+Both of these aspects show `acc.bounds`' operation's flexibility to
+allow the representation to be agnostic since the `acc` dialect is not
+expected to be able to understand how to extract dimension information
+from the types of the source dialect.
+* The OpenACC specification allows either extent or upperbound in the
+data clause depending on whether it is Fortran or C and C++. The
+`acc.bounds` operation is rich enough to accept either or both - for
+convenience in lowering to the dialect and for ability to precisely
+capture the meaning from the clause.
+* The stride, either in units or bytes, can be also captured in the
+`acc.bounds` operation. This is also an important part to be able to
+accept a source language's arrays without forcing the frontend to
+normalize them in some way. For example, consider a case where in a
+parent function, a whole array is mapped to device. Then only a view of
+a non-1 stride is passed to child function (eg Fortran array slice with
+non-1 stride). A `copy` operation of this data in child should be able
+to avoid remapping this array. If instead the operation required
+normalizing the array (such as making it contiguous), then unexpected
+disjoint mapping of the same host data would be error-prone since it
+would result in multiple mappings to device.
+
+#### Counters
+
+The data operations also maintain semantics described in the OpenACC
+specification related to runtime counters. More specifically, consider
+the specification of the entry portion of `acc copyin` in section 2.7.7:
+```
+At entry to a region, the structured reference counter is used. On an
+enter data directive, the dynamic reference counter is used.
+- If var is present and is not a null pointer, a present increment
+action with the appropriate reference counter is performed.
+- If var is not present, a copyin action with the appropriate reference
+counter is performed.
+- If var is a pointer reference, an attach action is performed.
+```
+The `acc.copyin` operation includes these semantics, including those
+related to attach, which is specified through the `varPtrPtr` operand.
+The `structured` flag on the operation is important since the
+`structured reference counter` should be used when the flag is true; and
+the `dynamic reference counter` should be used when it is false.
+
+At exit from structured regions (`acc data`, `acc kernels`), the
+`acc copyin` operation is decomposed to `acc.delete` (with the
+`structured` flag as true). The semantics of the `acc.delete` are
+also consistent with the OpenACC specification noted for the exit
+portion of the `acc copyin` clause:
+```
+At exit from the region:
+- If the structured reference counter for var is zero, no action is
+taken.
+- Otherwise, a detach action is performed if var is a pointer reference,
+and a present decrement action with the structured reference counter is
+performed if var is not a null pointer. If both structured and dynamic
+reference counters are zero, a delete action is performed.
+```
+
+### Types
+
+There are a few acc dialect type categories to describe:
+* type of acc data clause operation input `varPtr`
+	- The type of `varPtr` must be pointer-like. This is done by
+	attaching the `PointerLikeType` interface to the appropriate MLIR
+	type. Although memory/storage concept is a lower level abstraction,
+	it is useful because the OpenACC model distinguishes between host
+	and device memory explicitly - and the mapping between the two is
+	done through pointers. Thus, by explicitly requiring it in the
+	dialect, the appropriate language frontend must create storage or
+	use type that satisfies the mapping constraint.
+* type of result of acc data clause operations
+	- The type of the acc data clause operation is exactly the same as
+	`varPtr`. This was done intentionally instead of introducing an
+	`acc.ref/ptr` type so that IR compatibility and the dialect's
+	existing strong type checking can be maintained. This is needed
+	since the `acc` dialect must live within another dialect whose type
+	system is unknown to it. The only constraint is that the appropriate
+	dialect type must use the `PointerLikeType` interface.
+* type of decomposed clauses
+	- Decomposed clauses, such as `acc.bounds` and `acc.declare_enter`
+	produce types to allow their results to be used only in specific
+	operations.
+
+### Recipes
+
+Recipes are a generic way to express source language specific semantics.
+
+There are currently two categories of recipes, but the recipe concept
+can be extended for any additional low-level information that needs
+to be captured for successful lowering of OpenACC. The two categories
+are:
+* recipes used in the context of privatization associated with a
+construct
+* recipes used in the context of additional specification of data
+semantics
+
+The intention of the recipes is to specify how materialization of
+action, such as privatization, should be done when the semantics
+of the action needs interpreted and lowered, such as before generating
+LLVM dialect.
+
+The recipes used for privatization provide a source-language independent
+way of specifying the creation of a local variable of that type. This
+means using the appropriate `alloca` instruction and being able to
+specify default initialization or default constructor.
+
+### Routine
+
+The routine directive is used to note that a procedure should be made
+available for the accelerator in a way that is consistent with its
+modifiers, such as those that describe the parallelism. In the acc
+dialect, an acc routine is represented through two joint pieces - an
+attribute and an operation:
+* The `acc.routine` operation is simply a specifier which notes which
+symbol (or string) the acc routine is needed for, along with parallelism
+associated. This defines a symbol that can be referenced in attribute.
+* The `acc.routine_info` attribute is an attribute used on the source
+dialect specific operation which specifies one or multiple `acc.routine`
+symbols. Typically, this is attached to `func.func` which either 
+provides the declaration (in case of externals) or provides the
+actual body of the acc routine in the dialect that the source language
+was translated to.
+
+### Declare
+
+OpenACC `declare` is a mechanism which declares a definition of a global
+or a local to be accessible to accelerator with an implicit lifetime
+as that of the scope where it was declared in. Thus, `declare` semantics
+are represented through multiple operations and attributes:
+* `acc.declare` - This is a structured operation which contains an
+MLIR region and can be used in similar manner as acc.data to specify
+an implicit data region with specific procedure lifetime. This is
+typically used inside `func.func` after variable declarations.
+* `acc.declare_enter` - This is an unstructured operation which is
+used as a decomposed form of `acc declare`. It effectively allows the
+entry operation to exist in a scope different than the exit operation.
+It can also be used along `acc.declare_exit` which consumes its token
+to define a scoped region without using MLIR region. This operation is
+also used in `acc.global_ctor`.
+* `acc.declare_exit` - The matching equivalent of `acc.declare_enter`
+except that it specifies exit semantics. This operation is typically
+used inside a `func.func` at the exit points or with `acc.global_dtor`.
+* `acc.global_ctor` - Lives at the same level as source dialect globals
+and is used to specify data actions to be done at program entry. This
+is used in conjunction with source dialect globals whose lifetime is
+not just a single procedure.
+* `acc.global_dtor` - Defines the exit data actions that should be done
+at program exit. Typically used to revert the actions of
+`acc.global_ctor`.
+
+The attributes:
+* `acc.declare` - This is a facility for easier determination of
+variables which are `acc declare`'d. This attribute is used on
+operations producing globals and on operations producing locals such as
+dialect specific `alloca`'s. Having this attribute is required in order
+to appear in a data mapping operation associated with any of the
+`acc.declare*` operations.
+* `acc.declare_action` - Since the OpenACC specification allows
+declaration of variables that have yet to be allocated, this attribute
+is used at the allocation and deallocation points. More specifically,
+this attribute captures symbols of functions to be called to perform
+an action either pre-allocate, post-allocate, pre-deallocate, or
+post-deallocate. Calls to these functions should be materialized when
+lowering OpenACC semantics to ensure proper data actions are done
+after the allocation/deallocation.
+
+## OpenACC Transforms and Analyses
+
+The design goal for the `acc` dialect is to be friendly to MLIR
+optimization passes including CSE and LICM. Additionally, since it is
+designed to recover original clauses, it makes late verification and
+analysis possible in the MLIR framework outside of the frontend.
+
+This section describes a few MLIR-level passes for which the `acc`
+dialect design should be friendly for. This section is currently
+solely outlining the possibilities intended by the design and not
+necessarily existing passes.
+
+### Verification
+
+Since the OpenACC dialect is not lossy with regards to its
+representation, it is possible to do OpenACC language semantic checking
+at the MLIR-level. What follows is a list of various semantic checks
+needed.
+
+This first list is required to be done in the frontend because the `acc`
+dialect operations must be valid when constructed:
+* Ensure that only listed clauses are allowed for each directive.
+* Ensure that only listed modifiers are allowed for each clause.
+
+However, the following are semantic checks that can be done at the
+MLIR-level (either in a separate pass or as part of the operation
+verifier):
+* Specify the validity checks that each modifier needs. (eg num_gangs
+may need a positive integer).
+* Ensure valid clause nesting.
+* Validate clause restrictions which cannot appear with others.
+* Validate that no conflicting clauses are used on variables.
+
+Note that some of these checks can be even more precise when done at the
+MLIR level because optimizations like inlining and constant propagation
+expose detail that wouldn't have been visible in the frontend.
+
+### Implicit Data Attributes
+
+The OpenACC specification includes a section on `2.6.2 Variables with
+Implicitly Determined Data Attributes`. What this section describes are
+the data actions that should be applied to a variable for which
+user did not specify a data action for. The action depends on the
+construct being used and also on the default clause. However, the point
+to note here is that variables which are live-in into the acc region
+must employ some data mapping so the data can be passed to accelerator.
+
+One possible optimizations that affects data attributes needed is
+`Scalar Replacement of Aggregates (SROA)`. The `acc` dialect should
+not prevent this from happening on the source dialect.
+
+Because it is intended to be possible to apply optimizations across an
+`acc` region, the analysis/transformation pass that applies the implicit
+data attributes should be run as late as possible - ideally right before
+any outlining process which uses the `acc` region body to create an
+accelerator procedure. It is expected that existing MLIR facilities,
+such as `mlir::Liveness` will work for the `acc` region and thus can be
+used to perform this analysis.
+
+### Redundant Clause Elimination
+
+The data operations are modeled in a way where data entry operations
+look like loads and data exit operations look like stores. Thus these
+operations are intended to be optimized in the following ways:
+* Be able to eliminate redundant operations such as when an `acc.copyin`
+dominates another.
+* Be able to hoist/sink such operations out of loops.
+
+[include "Dialects/OpenACCDialect.md"]
diff --git a/mlir/include/mlir/Dialect/OpenACC/OpenACCBase.td b/mlir/include/mlir/Dialect/OpenACC/OpenACCBase.td
index 60e2ccfa18b6..2f7dfb2751c9 100644
--- a/mlir/include/mlir/Dialect/OpenACC/OpenACCBase.td
+++ b/mlir/include/mlir/Dialect/OpenACC/OpenACCBase.td
@@ -7,6 +7,7 @@
 // =============================================================================
 //
 // Defines MLIR OpenACC dialect.
+// See [`OpenACC Dialect Documentation`](Dialects/OpenACC.md) for more details.
 //
 //===----------------------------------------------------------------------===//
 
@@ -17,13 +18,6 @@ include "mlir/IR/AttrTypeBase.td"
 
 def OpenACC_Dialect : Dialect {
   let name = "acc";
-
-  let summary = "An OpenACC dialect for MLIR.";
-
-  let description = [{
-    This dialect models the construct from the OpenACC 3.3 directive language.
-  }];
-
   let useDefaultAttributePrinterParser = 1;
   let useDefaultTypePrinterParser = 1;
   let cppNamespace = "::mlir::acc";
author	Razvan Lupusoru <razvan.lupusoru@gmail.com>	2024-01-09 07:33:11 -0800
committer	GitHub <noreply@github.com>	2024-01-09 07:33:11 -0800
commit	ab4af25d5dfaecf01e6c6e94dc79e7304321c376 (patch)
tree	892c5313e52d2465a36732f89cdbb38aa2af4b39
parent	0242d27dc89ff19e331ae4945933cdb360c7d4cf (diff)