docs/OpenMPSupport.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138

.. raw:: html

  <style type="text/css">
    .none { background-color: #FFCCCC }
    .partial { background-color: #FFFF99 }
    .good { background-color: #CCFF99 }
  </style>

.. role:: none
.. role:: partial
.. role:: good

.. contents::
   :local:

==================
OpenMP Support
==================

Clang fully supports OpenMP 4.5. Clang supports offloading to X86_64, AArch64,
PPC64[LE] and has `basic support for Cuda devices`_.

Standalone directives
=====================

* #pragma omp [for] simd: :good:`Complete`.

* #pragma omp declare simd: :partial:`Partial`.  We support parsing/semantic
  analysis + generation of special attributes for X86 target, but still
  missing the LLVM pass for vectorization.

* #pragma omp taskloop [simd]: :good:`Complete`.

* #pragma omp target [enter|exit] data: :good:`Complete`.

* #pragma omp target update: :good:`Complete`.

* #pragma omp target: :good:`Complete`.

* #pragma omp declare target: :good:`Complete`.

* #pragma omp teams: :good:`Complete`.

* #pragma omp distribute [simd]: :good:`Complete`.

* #pragma omp distribute parallel for [simd]: :good:`Complete`.

Combined directives
===================

* #pragma omp parallel for simd: :good:`Complete`.

* #pragma omp target parallel: :good:`Complete`.

* #pragma omp target parallel for [simd]: :good:`Complete`.

* #pragma omp target simd: :good:`Complete`.

* #pragma omp target teams: :good:`Complete`.

* #pragma omp teams distribute [simd]: :good:`Complete`.

* #pragma omp target teams distribute [simd]: :good:`Complete`.

* #pragma omp teams distribute parallel for [simd]: :good:`Complete`.

* #pragma omp target teams distribute parallel for [simd]: :good:`Complete`.

Clang does not support any constructs/updates from OpenMP 5.0 except
for `reduction`-based clauses in the `task` and `target`-based directives.

In addition, the LLVM OpenMP runtime `libomp` supports the OpenMP Tools
Interface (OMPT) on x86, x86_64, AArch64, and PPC64 on Linux, Windows, and mac OS.

.. _basic support for Cuda devices:

Cuda devices support
====================

Directives execution modes
--------------------------

Clang code generation for target regions supports two modes: the SPMD and
non-SPMD modes. Clang chooses one of these two modes automatically based on the
way directives and clauses on those directives are used. The SPMD mode uses a
simplified set of runtime functions thus increasing performance at the cost of
supporting some OpenMP features. The non-SPMD mode is the most generic mode and
supports all currently available OpenMP features. The compiler will always
attempt to use the SPMD mode wherever possible. SPMD mode will not be used if:

   - The target region contains an `if()` clause that refers to a `parallel`
     directive.

   - The target region contains a `parallel` directive with a `num_threads()`
     clause.

   - The target region contains user code (other than OpenMP-specific
     directives) in between the `target` and the `parallel` directives.

Data-sharing modes
------------------

Clang supports two data-sharing models for Cuda devices: `Generic` and `Cuda`
modes. The default mode is `Generic`. `Cuda` mode can give an additional
performance and can be activated using the `-fopenmp-cuda-mode` flag. In
`Generic` mode all local variables that can be shared in the parallel regions
are stored in the global memory. In `Cuda` mode local variables are not shared
between the threads and it is user responsibility to share the required data
between the threads in the parallel regions.

Collapsed loop nest counter
---------------------------

When using the collapse clause on a loop nest the default behaviour is to
automatically extend the representation of the loop counter to 64 bits for
the cases where the sizes of the collapsed loops are not known at compile
time. To prevent this conservative choice and use at most 32 bits,
compile your program with the `-fopenmp-optimistic-collapse`.


Features not supported or with limited support for Cuda devices
---------------------------------------------------------------

- Cancellation constructs are not supported.

- Doacross loop nest is not supported.

- User-defined reductions are supported only for trivial types.

- Nested parallelism: inner parallel regions are executed sequentially.

- Static linking of libraries containing device code is not supported yet.

- Automatic translation of math functions in target regions to device-specific
  math functions is not implemented yet.

- Debug information for OpenMP target regions is not supported yet.