doc/software_arch/software_arch.rst

1c65381846 Jeff*0001 .. _sarch:
                0002
                0003 Software Architecture
                0004 *********************
                0005
4c8caab8bd Jeff*0006 This chapter focuses on describing the **WRAPPER** environment within
                0007 which both the core numerics and the pluggable packages operate. The
                0008 description presented here is intended to be a detailed exposition and
                0009 contains significant background material, as well as advanced details on
                0010 working with the WRAPPER. The tutorial examples in this manual (see
                0011 :numref:`chap_modelExamples`) contain more
                0012 succinct, step-by-step instructions on running basic numerical
                0013 experiments, of various types, both sequentially and in parallel. For
                0014 many projects, simply starting from an example code and adapting it to
                0015 suit a particular situation will be all that is required. The first part
                0016 of this chapter discusses the MITgcm architecture at an abstract level.
                0017 In the second part of the chapter we described practical details of the
                0018 MITgcm implementation and the current tools and operating system features
                0019 that are employed.
                0020
                0021 Overall architectural goals
                0022 ===========================
                0023
                0024 Broadly, the goals of the software architecture employed in MITgcm are
                0025 three-fold:
                0026
                0027 -  To be able to study a very broad range of interesting and
                0028    challenging rotating fluids problems;
                0029
                0030 -  The model code should be readily targeted to a wide range of
                0031    platforms; and
                0032
                0033 -  On any given platform, performance should be
                0034    comparable to an implementation developed and specialized
                0035    specifically for that platform.
                0036
                0037 These points are summarized in :numref:`mitgcm_goals`,
                0038 which conveys the goals of the MITgcm design. The goals lead to a
                0039 software architecture which at the broadest level can be viewed as
                0040 consisting of:
                0041
                0042 #. A core set of numerical and support code. This is discussed in detail
                0043    in :numref:`discret_algorithm`.
                0044
                0045 #. A scheme for supporting optional “pluggable” **packages** (containing
                0046    for example mixed-layer schemes, biogeochemical schemes, atmospheric
                0047    physics). These packages are used both to overlay alternate dynamics
                0048    and to introduce specialized physical content onto the core numerical
                0049    code. An overview of the package scheme is given at the start of
                0050    :numref:`packagesI`.
                0051
                0052 #. A support framework called WRAPPER (Wrappable Application
                0053    Parallel Programming Environment Resource), within which the core
                0054    numerics and pluggable packages operate.
                0055
                0056  .. figure:: figs/mitgcm_goals.*
                0057     :width: 70%
                0058     :align: center
                0059     :alt: span of mitgcm goals
                0060     :name: mitgcm_goals
                0061
                0062     The MITgcm architecture is designed to allow simulation of a wide range of physical problems on a wide range of hardware. The computational resource requirements of the applications targeted range from around 10\ :sup:`7` bytes ( :math:`\approx` 10 megabytes) of memory to 10\ :sup:`11` bytes ( :math:`\approx` 100 gigabytes). Arithmetic operation counts for the applications of interest range from 10\ :sup:`9` floating point operations to more than 10\ :sup:`17` floating point operations.
                0063
                0064
                0065 This chapter focuses on describing the WRAPPER environment under
                0066 which both the core numerics and the pluggable packages function. The
                0067 description presented here is intended to be a detailed exposition and
                0068 contains significant background material, as well as advanced details on
                0069 working with the WRAPPER. The “Getting Started” chapter of this manual
                0070 (:numref:`chap_getting_started`) contains more succinct, step-by-step
                0071 instructions on running basic numerical experiments both sequentially
                0072 and in parallel. For many projects simply starting from an example code
                0073 and adapting it to suit a particular situation will be all that is
                0074 required.
                0075
af61fa61c7 Jeff*0076 .. _wrapper:
                0077
4c8caab8bd Jeff*0078 WRAPPER
                0079 =======
                0080
                0081 A significant element of the software architecture utilized in MITgcm is
                0082 a software superstructure and substructure collectively called the
                0083 WRAPPER (Wrappable Application Parallel Programming Environment
                0084 Resource). All numerical and support code in MITgcm is written to “fit”
                0085 within the WRAPPER infrastructure. Writing code to fit within the
                0086 WRAPPER means that coding has to follow certain, relatively
                0087 straightforward, rules and conventions (these are discussed further in
                0088 :numref:`specify_decomp`).
                0089
                0090 The approach taken by the WRAPPER is illustrated in :numref:`fit_in_wrapper`,
                0091 which shows how the WRAPPER serves to insulate
                0092 code that fits within it from architectural differences between hardware
                0093 platforms and operating systems. This allows numerical code to be easily
                0094 retargeted.
                0095
                0096  .. figure:: figs/fit_in_wrapper.png
                0097     :width: 70%
                0098     :align: center
                0099     :alt: schematic of a wrapper
                0100     :name: fit_in_wrapper
                0101
                0102     Numerical code is written to fit within a software support infrastructure called WRAPPER. The WRAPPER is portable and can be specialized for a wide range of specific target hardware and programming environments, without impacting numerical code that fits within the WRAPPER. Codes that fit within the WRAPPER can generally be made to run as fast on a particular platform as codes specially optimized for that platform.
                0103
                0104 .. _target_hardware:
                0105
                0106 Target hardware
                0107 ---------------
                0108
                0109 The WRAPPER is designed to target as broad as possible a range of
                0110 computer systems. The original development of the WRAPPER took place on
                0111 a multi-processor, CRAY Y-MP system. On that system, numerical code
                0112 performance and scaling under the WRAPPER was in excess of that of an
                0113 implementation that was tightly bound to the CRAY system’s proprietary
                0114 multi-tasking and micro-tasking approach. Later developments have been
                0115 carried out on uniprocessor and multiprocessor Sun systems with both
                0116 uniform memory access (UMA) and non-uniform memory access (NUMA)
                0117 designs. Significant work has also been undertaken on x86 cluster
                0118 systems, Alpha processor based clustered SMP systems, and on
                0119 cache-coherent NUMA (CC-NUMA) systems such as Silicon Graphics Altix
                0120 systems. The MITgcm code, operating within the WRAPPER, is also
                0121 routinely used on large scale MPP systems (for example, Cray T3E and IBM
                0122 SP systems). In all cases, numerical code, operating within the WRAPPER,
                0123 performs and scales very competitively with equivalent numerical code
                0124 that has been modified to contain native optimizations for a particular
                0125 system (see Hoe et al. 1999) :cite:`hoe:99` .
                0126
                0127 Supporting hardware neutrality
                0128 ------------------------------
                0129
                0130 The different systems mentioned in :numref:`target_hardware` can be
                0131 categorized in many different ways. For example, one common distinction
                0132 is between shared-memory parallel systems (SMP and PVP) and distributed
                0133 memory parallel systems (for example x86 clusters and large MPP
                0134 systems). This is one example of a difference between compute platforms
                0135 that can impact an application. Another common distinction is between
                0136 vector processing systems with highly specialized CPUs and memory
                0137 subsystems and commodity microprocessor based systems. There are
                0138 numerous other differences, especially in relation to how parallel
                0139 execution is supported. To capture the essential differences between
                0140 different platforms the WRAPPER uses a *machine model*.
                0141
                0142 WRAPPER machine model
                0143 ---------------------
                0144
                0145 Applications using the WRAPPER are not written to target just one
                0146 particular machine (for example an IBM SP2) or just one particular
                0147 family or class of machines (for example Parallel Vector Processor
                0148 Systems). Instead the WRAPPER provides applications with an abstract
                0149 *machine model*. The machine model is very general; however, it can
                0150 easily be specialized to fit, in a computationally efficient manner, any
                0151 computer architecture currently available to the scientific computing
                0152 community.
                0153
                0154 Machine model parallelism
                0155 -------------------------
                0156
                0157 Codes operating under the WRAPPER target an abstract machine that is
                0158 assumed to consist of one or more logical processors that can compute
                0159 concurrently. Computational work is divided among the logical processors
                0160 by allocating “ownership” to each processor of a certain set (or sets)
                0161 of calculations. Each set of calculations owned by a particular
                0162 processor is associated with a specific region of the physical space
                0163 that is being simulated, and only one processor will be associated with each
                0164 such region (domain decomposition).
                0165
                0166 In a strict sense the logical processors over which work is divided do
                0167 not need to correspond to physical processors. It is perfectly possible
                0168 to execute a configuration decomposed for multiple logical processors on
                0169 a single physical processor. This helps ensure that numerical code that
                0170 is written to fit within the WRAPPER will parallelize with no additional
                0171 effort. It is also useful for debugging purposes. Generally, however,
                0172 the computational domain will be subdivided over multiple logical
                0173 processors in order to then bind those logical processors to physical
                0174 processor resources that can compute in parallel.
                0175
37736c7850 Jeff*0176 .. _tile_description:
                0177
4c8caab8bd Jeff*0178 Tiles
                0179 ~~~~~
                0180
                0181 Computationally, the data structures (e.g., arrays, scalar variables,
                0182 etc.) that hold the simulated state are associated with each region of
                0183 physical space and are allocated to a particular logical processor. We
                0184 refer to these data structures as being **owned** by the processor to
                0185 which their associated region of physical space has been allocated.
                0186 Individual regions that are allocated to processors are called
                0187 **tiles**. A processor can own more than one tile. :numref:`domain_decomp`
                0188 shows a physical domain being mapped to a set of
                0189 logical processors, with each processor owning a single region of the
                0190 domain (a single tile). Except for periods of communication and
                0191 coordination, each processor computes autonomously, working only with
                0192 data from the tile that the processor owns. If instead multiple
                0193 tiles were allotted to a single processor, each of these tiles would be computed on
                0194 independently of the other allotted tiles, in a sequential fashion.
                0195
                0196  .. figure:: figs/domain_decomp.png
                0197     :width: 70%
                0198     :align: center
                0199     :alt: domain decomposition
                0200     :name: domain_decomp
                0201
                0202     The WRAPPER provides support for one and two dimensional decompositions of grid-point domains. The figure shows a hypothetical domain of total size :math:`N_{x}N_{y}N_{z}`. This hypothetical domain is decomposed in two-dimensions along the :math:`N_{x}` and :math:`N_{y}` directions. The resulting tiles are owned by different processors. The owning processors perform the arithmetic operations associated with a tile. Although not illustrated here, a single processor can own several tiles. Whenever a processor wishes to transfer data between tiles or communicate with other processors it calls a WRAPPER supplied function.
                0203
                0204 Tile layout
                0205 ~~~~~~~~~~~
                0206
                0207 Tiles consist of an interior region and an overlap region. The overlap
                0208 region of a tile corresponds to the interior region of an adjacent tile.
                0209 In :numref:`tiled-world` each tile would own the region within the
                0210 black square and hold duplicate information for overlap regions
                0211 extending into the tiles to the north, south, east and west. During
                0212 computational phases a processor will reference data in an overlap
                0213 region whenever it requires values that lie outside the domain it owns.
                0214 Periodically processors will make calls to WRAPPER functions to
                0215 communicate data between tiles, in order to keep the overlap regions up
                0216 to date (see :numref:`comm_primitives`). The WRAPPER
                0217 functions can use a variety of different mechanisms to communicate data
                0218 between tiles.
                0219
                0220  .. figure:: figs/tiled-world.png
                0221     :width: 70%
                0222     :align: center
                0223     :alt: global earth subdivided into tiles
                0224     :name: tiled-world
                0225
                0226     A global grid subdivided into tiles. Tiles contain a interior region and an overlap region. Overlap regions are periodically updated from neighboring tiles.
                0227
                0228
                0229 Communication mechanisms
                0230 ------------------------
                0231
                0232 Logical processors are assumed to be able to exchange information
                0233 between tiles (and between each other) using at least one of two possible
                0234 mechanisms, shared memory or distributed memory communication.
                0235 The WRAPPER assumes that communication will use one of these two styles.
                0236 The underlying hardware and operating system support
                0237 for the style used is not specified and can vary from system to system.
                0238
                0239 .. _shared_mem_comm:
                0240
                0241 Shared memory communication
                0242 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
                0243
                0244 Under this mode of communication, data transfers are assumed to be possible using direct addressing of
                0245 regions of memory.  In the WRAPPER shared memory communication model,
                0246 simple writes to an array can be made to be visible to other CPUs at
                0247 the application code level. So, as shown below, if one CPU (CPU1) writes
                0248 the value 8 to element 3 of array a, then other CPUs (here, CPU2) will be
                0249 able to see the value 8 when they read from a(3). This provides a very low
                0250 latency and high bandwidth communication mechanism. Thus, in this way one CPU
                0251 can communicate information to another CPU by assigning a particular value to a particular memory
                0252 location.
                0253
                0254 ::
                0255
                0256
                0257              CPU1                    |        CPU2
                0258              ====                    |        ====
                0259                                      |
                0260            a(3) = 8                  |        WHILE ( a(3) .NE. 8 )
                0261                                      |         WAIT
                0262                                      |        END WHILE
                0263                                      |
                0264
                0265
                0266 Under shared communication independent CPUs are operating on the exact
                0267 same global address space at the application level. This is the model of
                0268 memory access that is supported at the basic system design level in
                0269 “shared-memory” systems such as PVP systems, SMP systems, and on
                0270 distributed shared memory systems (e.g., SGI Origin, SGI Altix, and some
                0271 AMD Opteron systems). On such systems the WRAPPER will generally use
                0272 simple read and write statements to access directly application data
                0273 structures when communicating between CPUs.
                0274
                0275 In a system where assignments statements map directly to hardware instructions that
                0276 transport data between CPU and memory banks, this can be a very
                0277 efficient mechanism for communication. In such case multiple CPUs
                0278 can communicate simply be reading and writing to agreed
                0279 locations and following a few basic rules. The latency of this sort of
                0280 communication is generally not that much higher than the hardware
                0281 latency of other memory accesses on the system. The bandwidth available
                0282 between CPUs communicating in this way can be close to the bandwidth of
                0283 the systems main-memory interconnect. This can make this method of
                0284 communication very efficient provided it is used appropriately.
                0285
                0286 Memory consistency
                0287 ##################
                0288
                0289 When using shared memory communication between multiple processors, the
                0290 WRAPPER level shields user applications from certain counter-intuitive
                0291 system behaviors. In particular, one issue the WRAPPER layer must deal
                0292 with is a systems memory model. In general the order of reads and writes
                0293 expressed by the textual order of an application code may not be the
                0294 ordering of instructions executed by the processor performing the
                0295 application. The processor performing the application instructions will
                0296 always operate so that, for the application instructions the processor
                0297 is executing, any reordering is not apparent. However,
                0298 machines are often designed so that reordering of instructions is not
                0299 hidden from other second processors. This means that, in general, even
                0300 on a shared memory system two processors can observe inconsistent memory
                0301 values.
                0302
                0303 The issue of memory consistency between multiple processors is discussed
                0304 at length in many computer science papers. From a practical point of
                0305 view, in order to deal with this issue, shared memory machines all
                0306 provide some mechanism to enforce memory consistency when it is needed.
                0307 The exact mechanism employed will vary between systems. For
                0308 communication using shared memory, the WRAPPER provides a place to
                0309 invoke the appropriate mechanism to ensure memory consistency for a
                0310 particular platform.
                0311
                0312 Cache effects and false sharing
                0313 ###############################
                0314
                0315 Shared-memory machines often have local-to-processor memory caches which
                0316 contain mirrored copies of main memory. Automatic cache-coherence
                0317 protocols are used to maintain consistency between caches on different
                0318 processors. These cache-coherence protocols typically enforce
                0319 consistency between regions of memory with large granularity (typically
                0320 128 or 256 byte chunks). The coherency protocols employed can be
                0321 expensive relative to other memory accesses and so care is taken in the
                0322 WRAPPER (by padding synchronization structures appropriately) to avoid
                0323 unnecessary coherence traffic.
                0324
                0325 Operating system support for shared memory
                0326 ##########################################
                0327
                0328 Applications running under multiple threads within a single process can
                0329 use shared memory communication. In this case *all* the memory locations
                0330 in an application are potentially visible to all the compute threads.
                0331 Multiple threads operating within a single process is the standard
                0332 mechanism for supporting shared memory that the WRAPPER utilizes.
                0333 Configuring and launching code to run in multi-threaded mode on specific
                0334 platforms is discussed in :numref:`multi-thread_exe`.
                0335 However, on many systems, potentially very efficient mechanisms for
                0336 using shared memory communication between multiple processes (in
                0337 contrast to multiple threads within a single process) also exist. In
                0338 most cases this works by making a limited region of memory shared
                0339 between processes. The MMAP and IPC facilities in UNIX systems provide
                0340 this capability as do vendor specific tools like LAPI and IMC.
                0341 Extensions exist for the WRAPPER that allow these mechanisms to be used
                0342 for shared memory communication. However, these mechanisms are not
                0343 distributed with the default WRAPPER sources, because of their
                0344 proprietary nature.
                0345
                0346 .. _distributed_mem_comm:
                0347
                0348 Distributed memory communication
                0349 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                0350
                0351 Under this mode of communication there is no mechanism, at the application code level,
                0352 for directly addressing regions of memory owned and visible to
                0353 another CPU. Instead a communication library must be used, as
                0354 illustrated below. If one CPU (here, CPU1) writes the value 8 to element 3 of array a,
                0355 then at least one of CPU1 and/or CPU2 will need to call a
                0356 function in the API of the communication library to communicate data
                0357 from a tile that it owns to a tile that another CPU owns. By default
                0358 the WRAPPER binds to the MPI communication library
                0359 for this style of communication (see https://computing.llnl.gov/tutorials/mpi/
                0360 for more information about the MPI Standard).
                0361
                0362 ::
                0363
                0364
                0365              CPU1                    |        CPU2
                0366              ====                    |        ====
                0367                                      |
                0368            a(3) = 8                  |        WHILE ( a(3) .NE. 8 )
                0369            CALL SEND( CPU2,a(3) )    |         CALL RECV( CPU1, a(3) )
                0370                                      |        END WHILE
                0371                                      |
                0372
                0373
                0374 Many parallel systems are not constructed in a way where it is possible
                0375 or practical for an application to use shared memory for communication.
                0376 For cluster systems consisting of individual computers connected by
                0377 a fast network, there is no notion of shared memory at
                0378 the system level. For this sort of system the WRAPPER provides support
                0379 for communication based on a bespoke communication library.
                0380 The default communication library used is MPI. It is relatively
                0381 straightforward to implement bindings to optimized platform specific
                0382 communication libraries. For example the work described in
                0383 Hoe et al. (1999) :cite:`hoe:99` substituted standard MPI communication
                0384 for a highly optimized library.
                0385
                0386 .. _comm_primitives:
                0387
                0388 Communication primitives
                0389 ------------------------
                0390
                0391 Optimized communication support is assumed to be potentially available
                0392 for a small number of communication operations. It is also assumed that
                0393 communication performance optimizations can be achieved by optimizing a
                0394 small number of communication primitives. Three optimizable primitives
                0395 are provided by the WRAPPER.
                0396
                0397
                0398  .. figure:: figs/comm-prim.png
                0399     :width: 70%
                0400     :align: center
                0401     :alt: global sum and exchange comm primitives
                0402     :name: comm-prim
                0403
                0404     Three performance critical parallel primitives are provided by the WRAPPER. These primitives are always used to communicate data between tiles. The figure shows four tiles. The curved arrows indicate exchange primitives which transfer data between the overlap regions at tile edges and interior regions for nearest-neighbor tiles. The straight arrows symbolize global sum operations which connect all tiles. The global sum operation provides both a key arithmetic primitive and can serve as a synchronization primitive. A third barrier primitive is also provided, which behaves much like the global sum primitive.
                0405
                0406
                0407 -  **EXCHANGE** This operation is used to transfer data between interior
                0408    and overlap regions of neighboring tiles. A number of different forms
                0409    of this operation are supported. These different forms handle:
                0410
                0411    -  Data type differences. Sixty-four bit and thirty-two bit fields
                0412       may be handled separately.
                0413
                0414    -  Bindings to different communication methods. Exchange primitives
                0415       select between using shared memory or distributed memory
                0416       communication.
                0417
                0418    -  Transformation operations required when transporting data between
                0419       different grid regions. Transferring data between faces of a
                0420       cube-sphere grid, for example, involves a rotation of vector
                0421       components.
                0422
                0423    -  Forward and reverse mode computations. Derivative calculations
                0424       require tangent linear and adjoint forms of the exchange
                0425       primitives.
                0426
                0427 -  **GLOBAL SUM** The global sum operation is a central arithmetic
                0428    operation for the pressure inversion phase of the MITgcm algorithm.
                0429    For certain configurations, scaling can be highly sensitive to the
                0430    performance of the global sum primitive. This operation is a
                0431    collective operation involving all tiles of the simulated domain.
                0432    Different forms of the global sum primitive exist for handling:
                0433
                0434    -  Data type differences. Sixty-four bit and thirty-two bit fields
                0435       may be handled separately.
                0436
                0437    -  Bindings to different communication methods. Exchange primitives
                0438       select between using shared memory or distributed memory
                0439       communication.
                0440
                0441    -  Forward and reverse mode computations. Derivative calculations
                0442       require tangent linear and adjoint forms of the exchange
                0443       primitives.
                0444
                0445 -  **BARRIER** The WRAPPER provides a global synchronization function
                0446    called barrier. This is used to synchronize computations over all
                0447    tiles. The **BARRIER** and **GLOBAL SUM** primitives have much in
                0448    common and in some cases use the same underlying code.
                0449
                0450 Memory architecture
                0451 -------------------
                0452
                0453 The WRAPPER machine model is aimed to target efficient systems with
                0454 highly pipelined memory architectures and systems with deep memory
                0455 hierarchies that favor memory reuse. This is achieved by supporting a
                0456 flexible tiling strategy as shown in :numref:`tiling_detail`.
                0457 Within a CPU, computations are carried out sequentially on each tile in
                0458 turn. By reshaping tiles according to the target platform it is possible
                0459 to automatically tune code to improve memory performance. On a vector
                0460 machine a given domain might be subdivided into a few long, thin
                0461 regions. On a commodity microprocessor based system, however, the same
                0462 region could be simulated use many more smaller sub-domains.
                0463
                0464  .. figure:: figs/tiling_detail.png
                0465     :width: 70%
                0466     :align: center
                0467     :alt: tiling strategy in WRAPPER
                0468     :name: tiling_detail
                0469
                0470     The tiling strategy that the WRAPPER supports allows tiles to be shaped to suit the underlying system memory architecture. Compact tiles that lead to greater memory reuse can be used on cache based systems (upper half of figure) with deep memory hierarchies, whereas long tiles with large inner loops can be used to exploit vector systems having highly pipelined memory systems.
                0471
                0472 Summary
                0473 -------
                0474
                0475 Following the discussion above, the machine model that the WRAPPER
                0476 presents to an application has the following characteristics:
                0477
                0478 -  The machine consists of one or more logical processors.
                0479
                0480 -  Each processor operates on tiles that it owns.
                0481
                0482 -  A processor may own more than one tile.
                0483
                0484 -  Processors may compute concurrently.
                0485
                0486 -  Exchange of information between tiles is handled by the machine
                0487    (WRAPPER) not by the application.
                0488
                0489 Behind the scenes this allows the WRAPPER to adapt the machine model
                0490 functions to exploit hardware on which:
                0491
                0492 -  Processors may be able to communicate very efficiently with each
                0493    other using shared memory.
                0494
                0495 -  An alternative communication mechanism based on a relatively simple
                0496    interprocess communication API may be required.
                0497
                0498 -  Shared memory may not necessarily obey sequential consistency,
                0499    however some mechanism will exist for enforcing memory consistency.
                0500
                0501 -  Memory consistency that is enforced at the hardware level may be
                0502    expensive. Unnecessary triggering of consistency protocols should be
                0503    avoided.
                0504
                0505 -  Memory access patterns may need to be either repetitive or highly
                0506    pipelined for optimum hardware performance.
                0507
                0508 This generic model, summarized in :numref:`tiles_and_wrapper`, captures the essential hardware ingredients of almost
                0509 all successful scientific computer systems designed in the last 50
                0510 years.
                0511
                0512  .. figure:: figs/tiles_and_wrapper.png
                0513     :width: 85%
                0514     :align: center
                0515     :alt: summary figure tiles and wrapper
                0516     :name: tiles_and_wrapper
                0517
                0518     Summary of the WRAPPER machine model.
                0519
4bad209eba Jeff*0520 .. _using_wrapper:
                0521
4c8caab8bd Jeff*0522 Using the WRAPPER
                0523 =================
                0524
                0525 In order to support maximum portability the WRAPPER is implemented
                0526 primarily in sequential Fortran 77. At a practical level the key steps
                0527 provided by the WRAPPER are:
                0528
                0529 #. specifying how a domain will be decomposed
                0530
                0531 #. starting a code in either sequential or parallel modes of operations
                0532
                0533 #. controlling communication between tiles and between concurrently
                0534    computing CPUs.
                0535
                0536 This section describes the details of each of these operations.
                0537 :numref:`specify_decomp` explains the way a
                0538 domain is decomposed (or composed) is expressed. :numref:`starting_code`
                0539 describes practical details of running codes
                0540 in various different parallel modes on contemporary computer systems.
                0541 :numref:`controlling_comm` explains the internal
                0542 information that the WRAPPER uses to control how information is
                0543 communicated between tiles.
                0544
                0545 .. _specify_decomp:
                0546
                0547 Specifying a domain decomposition
                0548 ---------------------------------
                0549
                0550 At its heart, much of the WRAPPER works only in terms of a collection
                0551 of tiles which are interconnected to each other. This is also true of
                0552 application code operating within the WRAPPER. Application code is
                0553 written as a series of compute operations, each of which operates on a
                0554 single tile. If application code needs to perform operations involving
                0555 data associated with another tile, it uses a WRAPPER function to
                0556 obtain that data. The specification of how a global domain is
                0557 constructed from tiles or alternatively how a global domain is
                0558 decomposed into tiles is made in the file :filelink:`SIZE.h <model/inc/SIZE.h>`. This file defines
                0559 the following parameters:
                0560
                0561 .. admonition:: File: :filelink:`model/inc/SIZE.h`
                0562   :class: note
                0563
                0564     | Parameter: :varlink:`sNx`, :varlink:`sNx`
                0565     | Parameter: :varlink:`OLx`, :varlink:`OLy`
                0566     | Parameter: :varlink:`nSx`, :varlink:`nSy`
                0567     | Parameter: :varlink:`nPx`, :varlink:`nPy`
                0568
                0569
                0570 Together these parameters define a tiling decomposition of the style
                0571 shown in :numref:`size_h`. The parameters ``sNx`` and ``sNx``
                0572 define the size of an individual tile. The parameters ``OLx`` and ``OLy``
                0573 define the maximum size of the overlap extent. This must be set to the
                0574 maximum width of the computation stencil that the numerical code
                0575 finite-difference operations require between overlap region updates.
                0576 The maximum overlap required by any of the operations in the MITgcm
                0577 code distributed at this time is four grid points (some of the higher-order advection schemes
                0578 require a large overlap region). Code modifications and enhancements that involve adding wide
                0579 finite-difference stencils may require increasing ``OLx`` and ``OLy``.
                0580 Setting ``OLx`` and ``OLy`` to a too large value will decrease code
                0581 performance (because redundant computations will be performed),
                0582 however it will not cause any other problems.
                0583
                0584  .. figure:: figs/size_h.png
                0585     :width: 80%
                0586     :align: center
                0587     :alt: explanation of SIZE.h domain decomposition
                0588     :name: size_h
                0589
                0590     The three level domain decomposition hierarchy employed by the WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated to a single process. Multiple processes can exist, each with multiple tiles. Tiles within a process can be spread over multiple compute threads.
                0591
                0592 The parameters ``nSx`` and ``nSy`` specify the number of tiles that will be
                0593 created within a single process. Each of these tiles will have internal
                0594 dimensions of ``sNx`` and ``sNy``. If, when the code is executed, these
                0595 tiles are allocated to different threads of a process that are then
                0596 bound to different physical processors (see the multi-threaded
                0597 execution discussion in :numref:`starting_code`), then
                0598 computation will be performed concurrently on each tile. However, it is
                0599 also possible to run the same decomposition within a process running a
                0600 single thread on a single processor. In this case the tiles will be
                0601 computed over sequentially. If the decomposition is run in a single
                0602 process running multiple threads but attached to a single physical
                0603 processor, then, in general, the computation for different tiles will be
                0604 interleaved by system level software. This too is a valid mode of
                0605 operation.
                0606
                0607 The parameters ``sNx``, ``sNy``, ``OLx``, ``OLy``,
                0608 ``nSx`` and ``nSy`` are used extensively
                0609 by numerical code. The settings of ``sNx``, ``sNy``, ``OLx``, and ``OLy`` are used to
                0610 form the loop ranges for many numerical calculations and to provide
                0611 dimensions for arrays holding numerical state. The ``nSx`` and ``nSy`` are
                0612 used in conjunction with the thread number parameter ``myThid``. Much of
                0613 the numerical code operating within the WRAPPER takes the form:
                0614
                0615 .. code-block:: fortran
                0616
                0617           DO bj=myByLo(myThid),myByHi(myThid)
                0618            DO bi=myBxLo(myThid),myBxHi(myThid)
                0619               :
                0620               a block of computations ranging
                0621               over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
                0622               :
                0623            ENDDO
                0624           ENDDO
                0625
                0626           communication code to sum a number or maybe update
                0627           tile overlap regions
                0628
                0629           DO bj=myByLo(myThid),myByHi(myThid)
                0630            DO bi=myBxLo(myThid),myBxHi(myThid)
                0631               :
                0632               another block of computations ranging
                0633               over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
                0634               :
                0635            ENDDO
                0636           ENDDO
                0637
                0638 The variables ``myBxLo(myThid)``, ``myBxHi(myThid)``, ``myByLo(myThid)`` and
                0639 ``myByHi(myThid)`` set the bounds of the loops in ``bi`` and ``bj`` in this
                0640 schematic. These variables specify the subset of the tiles in the range
                0641 ``1, nSx`` and ``1, nSy1`` that the logical processor bound to thread
                0642 number ``myThid`` owns. The thread number variable ``myThid`` ranges from 1
                0643 to the total number of threads requested at execution time. For each
                0644 value of ``myThid`` the loop scheme above will step sequentially through
                0645 the tiles owned by that thread. However, different threads will have
                0646 different ranges of tiles assigned to them, so that separate threads can
                0647 compute iterations of the ``bi``, ``bj`` loop concurrently. Within a ``bi``,
                0648 ``bj`` loop, computation is performed concurrently over as many processes
                0649 and threads as there are physical processors available to compute.
                0650
                0651 An exception to the the use of ``bi`` and ``bj`` in loops arises in the
                0652 exchange routines used when the :ref:`exch2 package <sub_phys_pkg_exch2>` is used with the cubed
                0653 sphere. In this case ``bj`` is generally set to 1 and the loop runs from
                0654 ``1, bi``. Within the loop ``bi`` is used to retrieve the tile number,
                0655 which is then used to reference exchange parameters.
                0656
                0657 The amount of computation that can be embedded in a single loop over ``bi``
                0658 and ``bj`` varies for different parts of the MITgcm algorithm.
af61fa61c7 Jeff*0659 Consider  a code extract from the 2-D implicit elliptic solver:
4c8caab8bd Jeff*0660
a7275ae7c1 Oliv*0661 .. code-block:: fortran
4c8caab8bd Jeff*0662
                0663           REAL*8  cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
                0664           REAL*8  err
                0665               :
                0666               :
                0667             other computations
                0668               :
                0669               :
                0670           err = 0.
                0671           DO bj=myByLo(myThid),myByHi(myThid)
                0672            DO bi=myBxLo(myThid),myBxHi(myThid)
                0673             DO J=1,sNy
                0674              DO I=1,sNx
                0675                err = err + cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
                0676              ENDDO
                0677             ENDDO
                0678            ENDDO
                0679           ENDDO
                0680
                0681           CALL GLOBAL_SUM_R8( err   , myThid )
                0682           err = SQRT(err)
                0683
                0684
                0685
                0686 This portion of the code computes the :math:`L_2`\ -Norm
                0687 of a vector whose elements are held in the array ``cg2d_r``, writing the
                0688 final result to scalar variable ``err``. Notice that under the WRAPPER,
                0689 arrays such as cg2d_r have two extra trailing dimensions. These right
                0690 most indices are tile indexes. Different threads with a single process
                0691 operate on different ranges of tile index, as controlled by the settings
                0692 of ``myByLo(myThid)``, ``myByHi(myThid)``, ``myBxLo(myThid)`` and
                0693 ``myBxHi(myThid)``. Because the :math:`L_2`\ -Norm
                0694 requires a global reduction, the ``bi``, ``bj`` loop above only contains one
                0695 statement. This computation phase is then followed by a communication
                0696 phase in which all threads and processes must participate. However, in
                0697 other areas of the MITgcm, code entries subsections of code are within a
                0698 single ``bi``, ``bj`` loop. For example the evaluation of all the momentum
                0699 equation prognostic terms (see :filelink:`dynamics.F <model/src/dynamics.F>`) is within a single
                0700 ``bi``, ``bj`` loop.
                0701
                0702 The final decomposition parameters are ``nPx`` and ``nPy``. These parameters
                0703 are used to indicate to the WRAPPER level how many processes (each with
                0704 ``nSx``\ :math:`\times`\ ``nSy`` tiles) will be used for this simulation.
                0705 This information is needed during initialization and during I/O phases.
                0706 However, unlike the variables ``sNx``, ``sNy``, ``OLx``, ``OLy``, ``nSx`` and ``nSy`` the
                0707 values of ``nPx`` and ``nPy`` are absent from the core numerical and support
                0708 code.
                0709
                0710 Examples of :filelink:`SIZE.h <model/inc/SIZE.h>` specifications
a7275ae7c1 Oliv*0711 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4c8caab8bd Jeff*0712
                0713 The following different :filelink:`SIZE.h <model/inc/SIZE.h>` parameter setting illustrate how to
                0714 interpret the values of ``sNx``, ``sNy``, ``OLx``, ``OLy``, ``nSx``, ``nSy``, ``nPx`` and ``nPy``.
                0715
                0716 #. ::
                0717
                0718              PARAMETER (
                0719             &           sNx =  90,
                0720             &           sNy =  40,
                0721             &           OLx =   3,
                0722             &           OLy =   3,
                0723             &           nSx =   1,
                0724             &           nSy =   1,
                0725             &           nPx =   1,
                0726             &           nPy =   1)
                0727
                0728    This sets up a single tile with *x*-dimension of ninety grid points,
                0729    *y*-dimension of forty grid points, and *x* and *y* overlaps of three grid
                0730    points each.
                0731
                0732 #. ::
                0733
                0734              PARAMETER (
                0735             &           sNx =  45,
                0736             &           sNy =  20,
                0737             &           OLx =   3,
                0738             &           OLy =   3,
                0739             &           nSx =   1,
                0740             &           nSy =   1,
                0741             &           nPx =   2,
                0742             &           nPy =   2)
                0743
                0744    This sets up tiles with *x*-dimension of forty-five grid points,
                0745    *y*-dimension of twenty grid points, and *x* and *y* overlaps of three grid
                0746    points each. There are four tiles allocated to four separate
                0747    processes (``nPx=2, nPy=2``) and arranged so that the global domain size
                0748    is again ninety grid points in *x* and forty grid points in *y*. In
                0749    general the formula for global grid size (held in model variables
                0750    ``Nx`` and ``Ny``) is
                0751
                0752    ::
                0753
                0754                         Nx  = sNx*nSx*nPx
                0755                         Ny  = sNy*nSy*nPy
                0756
                0757 #. ::
                0758
                0759              PARAMETER (
                0760             &           sNx =  90,
                0761             &           sNy =  10,
                0762             &           OLx =   3,
                0763             &           OLy =   3,
                0764             &           nSx =   1,
                0765             &           nSy =   2,
                0766             &           nPx =   1,
                0767             &           nPy =   2)
                0768
                0769    This sets up tiles with *x*-dimension of ninety grid points,
                0770    *y*-dimension of ten grid points, and *x* and *y* overlaps of three grid
                0771    points each. There are four tiles allocated to two separate processes
                0772    (``nPy=2``) each of which has two separate sub-domains ``nSy=2``. The
                0773    global domain size is again ninety grid points in *x* and forty grid
                0774    points in *y*. The two sub-domains in each process will be computed
                0775    sequentially if they are given to a single thread within a single
                0776    process. Alternatively if the code is invoked with multiple threads
                0777    per process the two domains in y may be computed concurrently.
                0778
                0779 #. ::
                0780
                0781              PARAMETER (
                0782             &           sNx =  32,
                0783             &           sNy =  32,
                0784             &           OLx =   3,
                0785             &           OLy =   3,
                0786             &           nSx =   6,
                0787             &           nSy =   1,
                0788             &           nPx =   1,
                0789             &           nPy =   1)
                0790
                0791    This sets up tiles with *x*-dimension of thirty-two grid points,
                0792    *y*-dimension of thirty-two grid points, and *x* and *y* overlaps of three
                0793    grid points each. There are six tiles allocated to six separate
                0794    logical processors (``nSx=6``). This set of values can be used for a
                0795    cube sphere calculation. Each tile of size :math:`32 \times 32`
                0796    represents a face of the cube. Initializing the tile connectivity
                0797    correctly (see :numref:`cubed_sphere_comm`. allows the
                0798    rotations associated with moving between the six cube faces to be
                0799    embedded within the tile-tile communication code.
                0800
                0801 .. _starting_code:
                0802
                0803 Starting the code
                0804 -----------------
                0805
                0806 When code is started under the WRAPPER, execution begins in a main
                0807 routine :filelink:`eesupp/src/main.F` that is owned by the WRAPPER. Control is
                0808 transferred to the application through a routine called
                0809 :filelink:`model/src/the_model_main.F` once the WRAPPER has initialized correctly and has
                0810 created the necessary variables to support subsequent calls to
                0811 communication routines by the application code. The main stages of the WRAPPER startup calling
                0812 sequence are as follows:
                0813
                0814 ::
                0815
                0816
                0817            MAIN
                0818            |
                0819            |--EEBOOT               :: WRAPPER initialization
                0820            |  |
                0821            |  |-- EEBOOT_MINMAL    :: Minimal startup. Just enough to
                0822            |  |                       allow basic I/O.
                0823            |  |-- EEINTRO_MSG      :: Write startup greeting.
                0824            |  |
                0825            |  |-- EESET_PARMS      :: Set WRAPPER parameters
                0826            |  |
                0827            |  |-- EEWRITE_EEENV    :: Print WRAPPER parameter settings
                0828            |  |
                0829            |  |-- INI_PROCS        :: Associate processes with grid regions.
                0830            |  |
                0831            |  |-- INI_THREADING_ENVIRONMENT   :: Associate threads with grid regions.
                0832            |       |
                0833            |       |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
                0834            |                                     :: communication data structures
                0835            |
                0836            |
                0837            |--CHECK_THREADS    :: Validate multiple thread start up.
                0838            |
                0839            |--THE_MODEL_MAIN   :: Numerical code top-level driver routine
                0840
                0841 The steps above preceeds transfer of control to application code, which occurs in the procedure :filelink:`the_main_model.F <model/src/the_model_main.F>`
                0842
                0843 .. _multi-thread_exe:
                0844
                0845 Multi-threaded execution
                0846 ~~~~~~~~~~~~~~~~~~~~~~~~
                0847
                0848 Prior to transferring control to the procedure :filelink:`the_main_model.F <model/src/the_model_main.F>`
                0849 the WRAPPER may cause several coarse grain threads to be initialized.
                0850 The routine :filelink:`the_main_model.F <model/src/the_model_main.F>` is called once for each thread and is
                0851 passed a single stack argument which is the thread number, stored in
                0852 the :varlink:`myThid`. In addition to specifying a decomposition with
                0853 multiple tiles per process (see :numref:`specify_decomp`) configuring and starting a code to
                0854 run using multiple threads requires the following steps.
                0855
                0856 Compilation
                0857 ###########
                0858
                0859 First the code must be compiled with appropriate multi-threading
                0860 directives active in the file :filelink:`eesupp/src/main.F` and with appropriate compiler
                0861 flags to request multi-threading support. The header files
                0862 :filelink:`eesupp/inc/MAIN_PDIRECTIVES1.h` and :filelink:`eesupp/inc/MAIN_PDIRECTIVES2.h` contain directives
                0863 compatible with compilers for Sun, Compaq, SGI, Hewlett-Packard SMP
                0864 systems and CRAY PVP systems. These directives can be activated by using
                0865 compile time directives ``-DTARGET_SUN``, ``-DTARGET_DEC``,
a7275ae7c1 Oliv*0866 ``-DTARGET_SGI``, ``-DTARGET_HP`` or ``-DTARGET_CRAY_VECTOR``
4c8caab8bd Jeff*0867 respectively. Compiler options for invoking multi-threaded compilation
                0868 vary from system to system and from compiler to compiler. The options
                0869 will be described in the individual compiler documentation. For the
                0870 Fortran compiler from Sun the following options are needed to correctly
                0871 compile multi-threaded code
                0872
                0873 ::
                0874
                0875          -stackvar -explicitpar -vpara -noautopar
                0876
                0877 These options are specific to the Sun compiler. Other compilers will use
                0878 different syntax that will be described in their documentation. The
                0879 effect of these options is as follows:
                0880
                0881 #. **-stackvar** Causes all local variables to be allocated in stack
                0882    storage. This is necessary for local variables to ensure that they
                0883    are private to their thread. Note, when using this option it may be
                0884    necessary to override the default limit on stack-size that the
                0885    operating system assigns to a process. This can normally be done by
                0886    changing the settings of the command shell’s ``stack-size``.
                0887    However, on some systems changing this limit will require
                0888    privileged administrator access to modify system parameters.
                0889
                0890 #. **-explicitpar** Requests that multiple threads be spawned in
                0891    response to explicit directives in the application code. These
                0892    directives are inserted with syntax appropriate to the particular
                0893    target platform when, for example, the ``-DTARGET_SUN`` flag is
                0894    selected.
                0895
                0896 #. **-vpara** This causes the compiler to describe the multi-threaded
                0897    configuration it is creating. This is not required but it can be
                0898    useful when troubleshooting.
                0899
                0900 #. **-noautopar** This inhibits any automatic multi-threaded
                0901    parallelization the compiler may otherwise generate.
                0902
                0903 An example of valid settings for the ``eedata`` file for a domain with two
                0904 subdomains in *y* and running with two threads is shown below
                0905
                0906 ::
                0907
                0908      nTx=1,nTy=2
                0909
                0910 This set of values will cause computations to stay within a single
                0911 thread when moving across the ``nSx`` sub-domains. In the *y*-direction,
                0912 however, sub-domains will be split equally between two threads.
                0913
                0914 Despite its appealing programming model, multi-threaded execution
                0915 remains less common than multi-process execution (described in :numref:`multi-process_exe`).
                0916 One major reason for
                0917 this is that many system libraries are still not “thread-safe”. This
                0918 means that, for example, on some systems it is not safe to call system
                0919 routines to perform I/O when running in multi-threaded mode (except,
                0920 perhaps, in a limited set of circumstances). Another reason is that
                0921 support for multi-threaded programming models varies between systems.
                0922
                0923 .. _multi-process_exe:
                0924
                0925 Multi-process execution
                0926 ~~~~~~~~~~~~~~~~~~~~~~~
                0927
                0928 Multi-process execution is more ubiquitous than multi-threaded execution.
                0929 In order to run code in a
                0930 multi-process configuration, a decomposition specification (see
                0931 :numref:`specify_decomp`) is given (in which at least one
                0932 of the parameters ``nPx`` or ``nPy`` will be greater than one). Then, as
                0933 for multi-threaded operation, appropriate compile time and run time
                0934 steps must be taken.
                0935
                0936 Compilation
                0937 ###########
                0938
                0939 Multi-process execution under the WRAPPER assumes that portable,
                0940 MPI libraries are available for controlling the start-up of multiple
                0941 processes. The MPI libraries are not required, although they are
                0942 usually used, for performance critical communication. However, in
                0943 order to simplify the task of controlling and coordinating the start
                0944 up of a large number (hundreds and possibly even thousands) of copies
                0945 of the same program, MPI is used. The calls to the MPI multi-process
                0946 startup routines must be activated at compile time. Currently MPI
                0947 libraries are invoked by specifying the appropriate options file with
                0948 the ``-of`` flag when running the :filelink:`genmake2 <tools/genmake2>`
                0949 script, which generates the
                0950 Makefile for compiling and linking MITgcm. (Previously this was done
                0951 by setting the ``ALLOW_USE_MPI`` and ``ALWAYS_USE_MPI`` flags in the
                0952 :filelink:`CPP_EEOPTIONS.h </eesupp/inc/CPP_EEOPTIONS.h>` file.) More
                0953 detailed information about the use of
                0954 :filelink:`genmake2 <tools/genmake2>` for specifying local compiler
                0955 flags is located in :numref:`genmake2_desc`.
                0956
                0957 Execution
                0958 #########
                0959
                0960 The mechanics of starting a program in multi-process mode under MPI is
                0961 not standardized. Documentation associated with the distribution of MPI
                0962 installed on a system will describe how to start a program using that
                0963 distribution. For the open-source `MPICH <https://www.mpich.org/>`_
                0964 system, the MITgcm program can
                0965 be started using a command such as
                0966
                0967 ::
                0968
                0969     mpirun -np 64 -machinefile mf ./mitgcmuv
                0970
                0971 In this example the text ``-np 64`` specifies the number of processes
                0972 that will be created. The numeric value 64 must be equal to (or greater than) the
                0973 product of the processor grid settings of ``nPx`` and ``nPy`` in the file
                0974 :filelink:`SIZE.h <model/inc/SIZE.h>`. The option ``-machinefile mf``
                0975 specifies that a text file called ``mf``
                0976 will be read to get a list of processor names on which the sixty-four
                0977 processes will execute. The syntax of this file is specified by the
                0978 MPI distribution.
                0979
                0980 Environment variables
                0981 ~~~~~~~~~~~~~~~~~~~~~
                0982
                0983 On some systems multi-threaded execution also requires the setting of a
                0984 special environment variable. On many machines this variable is called
                0985 ``PARALLEL`` and its values should be set to the number of parallel threads
                0986 required. Generally the help or manual pages associated with the
                0987 multi-threaded compiler on a machine will explain how to set the
                0988 required environment variables.
                0989
                0990 Runtime input parameters
                0991 ~~~~~~~~~~~~~~~~~~~~~~~~
                0992
                0993 Finally the file ``eedata``
                0994 needs to be configured to indicate the number
                0995 of threads to be used in the *x* and *y* directions:
                0996
                0997 ::
                0998
                0999     # Example "eedata" file
                1000     # Lines beginning "#" are comments
                1001     # nTx - No. threads per process in X
                1002     # nTy - No. threads per process in Y
                1003      &EEPARMS
                1004      nTx=1,
                1005      nTy=1,
                1006      &
                1007
                1008
                1009 The product of ``nTx`` and ``nTy`` must be equal to the number of threads
                1010 spawned, i.e., the setting of the environment variable ``PARALLEL``. The value
                1011 of ``nTx`` must subdivide the number of sub-domains in *x* (``nSx``) exactly.
                1012 The value of ``nTy`` must subdivide the number of sub-domains in *y* (``nSy``)
                1013 exactly. The multi-process startup of the MITgcm executable ``mitgcmuv`` is
                1014 controlled by the routines :filelink:`eeboot_minimal.F <eesupp/src/eeboot_minimal.F>`
                1015 and :filelink:`ini_procs.F <eesupp/src/ini_procs.F>`. The
                1016 first routine performs basic steps required to make sure each process is
                1017 started and has a textual output stream associated with it. By default
                1018 two output files are opened for each process with names ``STDOUT.NNNN``
                1019 and ``STDERR.NNNN``. The *NNNNN* part of the name is filled in with
                1020 the process number so that process number 0 will create output files
                1021 ``STDOUT.0000`` and ``STDERR.0000``, process number 1 will create output
                1022 files ``STDOUT.0001`` and ``STDERR.0001``, etc. These files are used for
                1023 reporting status and configuration information and for reporting error
                1024 conditions on a process-by-process basis. The :filelink:`eeboot_minimal.F <eesupp/src/eeboot_minimal.F>`
                1025 procedure also sets the variables :varlink:`myProcId` and :varlink:`MPI_COMM_MODEL`.
                1026 These variables are related to processor identification and are used
                1027 later in the routine :filelink:`ini_procs.F <eesupp/src/ini_procs.F>` to allocate tiles to processes.
                1028
                1029 Allocation of processes to tiles is controlled by the routine
                1030 :filelink:`ini_procs.F <eesupp/src/ini_procs.F>`. For each process this routine sets the variables
                1031 :varlink:`myXGlobalLo` and :varlink:`myYGlobalLo`. These variables specify, in index
                1032 space, the coordinates of the southernmost and westernmost corner of
                1033 the southernmost and westernmost tile owned by this process. The
                1034 variables :varlink:`pidW`, :varlink:`pidE`, :varlink:`pidS` and :varlink:`pidN` are also set in this
                1035 routine. These are used to identify processes holding tiles to the
                1036 west, east, south and north of a given process. These values are
                1037 stored in global storage in the header file :filelink:`EESUPPORT.h <eesupp/inc/EESUPPORT.h>` for use by
                1038 communication routines. The above does not hold when the :ref:`exch2 package <sub_phys_pkg_exch2>`
                1039 is used. The :ref:`exch2 package <sub_phys_pkg_exch2>` sets its own parameters to specify the global
                1040 indices of tiles and their relationships to each other. See the
                1041 documentation on the :ref:`exch2 package <sub_phys_pkg_exch2>` for details.
                1042
                1043 .. _controlling_comm:
                1044
                1045 Controlling communication
                1046 -------------------------
                1047
                1048 The WRAPPER maintains internal information that is used for
                1049 communication operations and can be customized for different
                1050 platforms. This section describes the information that is held and used.
                1051
                1052 1. **Tile-tile connectivity information** For each tile the WRAPPER sets
                1053    a flag that sets the tile number to the north, south, east and west
                1054    of that tile. This number is unique over all tiles in a
                1055    configuration. Except when using the cubed sphere and
                1056    the :ref:`exch2 package <sub_phys_pkg_exch2>`,
                1057    the number is held in the variables :varlink:`tileNo` (this holds
                1058    the tiles own number), :varlink:`tileNoN`, :varlink:`tileNoS`,
                1059    :varlink:`tileNoE` and :varlink:`tileNoW`.
                1060    A parameter is also stored with each tile that specifies the type of
                1061    communication that is used between tiles. This information is held in
                1062    the variables :varlink:`tileCommModeN`, :varlink:`tileCommModeS`,
                1063    :varlink:`tileCommModeE` and
                1064    :varlink:`tileCommModeW`. This latter set of variables can take one of the
                1065    following values ``COMM_NONE``, ``COMM_MSG``, ``COMM_PUT`` and
                1066    ``COMM_GET``. A value of ``COMM_NONE`` is used to indicate that a tile
                1067    has no neighbor to communicate with on a particular face. A value of
                1068    ``COMM_MSG`` is used to indicate that some form of distributed memory
                1069    communication is required to communicate between these tile faces
                1070    (see :numref:`distributed_mem_comm`). A value of
                1071    ``COMM_PUT`` or ``COMM_GET`` is used to indicate forms of shared memory
                1072    communication (see :numref:`shared_mem_comm`). The
                1073    ``COMM_PUT`` value indicates that a CPU should communicate by writing
                1074    to data structures owned by another CPU. A ``COMM_GET`` value
                1075    indicates that a CPU should communicate by reading from data
                1076    structures owned by another CPU. These flags affect the behavior of
                1077    the WRAPPER exchange primitive (see :numref:`comm-prim`). The routine
                1078    :filelink:`ini_communication_patterns.F <eesupp/src/ini_communication_patterns.F>`
                1079    is responsible for setting the
                1080    communication mode values for each tile.
                1081
                1082    When using the cubed sphere configuration with the :ref:`exch2 package <sub_phys_pkg_exch2>`, the
                1083    relationships between tiles and their communication methods are set
                1084    by the :ref:`exch2 package <sub_phys_pkg_exch2>` and stored in different variables.
                1085    See the :ref:`exch2 package <sub_phys_pkg_exch2>`
                1086    documentation for details.
                1087
                1088    |
                1089
                1090 2. **MP directives** The WRAPPER transfers control to numerical
                1091    application code through the routine
                1092    :filelink:`the_model_main.F <model/src/the_model_main.F>`. This routine
                1093    is called in a way that allows for it to be invoked by several
                1094    threads. Support for this is based on either multi-processing (MP)
                1095    compiler directives or specific calls to multi-threading libraries
                1096    (e.g., POSIX threads). Most commercially available Fortran compilers
                1097    support the generation of code to spawn multiple threads through some
                1098    form of compiler directives. Compiler directives are generally more
                1099    convenient than writing code to explicitly spawn threads. On
                1100    some systems, compiler directives may be the only method available.
                1101    The WRAPPER is distributed with template MP directives for a number
                1102    of systems.
                1103
                1104    These directives are inserted into the code just before and after the
                1105    transfer of control to numerical algorithm code through the routine
                1106    :filelink:`the_model_main.F <model/src/the_model_main.F>`. An example of
                1107    the code that performs this process for a Silicon Graphics system is as follows:
                1108
                1109    ::
                1110
                1111      C--
                1112      C--  Parallel directives for MIPS Pro Fortran compiler
                1113      C--
                1114      C      Parallel compiler directives for SGI with IRIX
                1115      C$PAR  PARALLEL DO
                1116      C$PAR&  CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
                1117      C$PAR&  SHARE(nThreads),LOCAL(myThid,I)
                1118      C
                1119            DO I=1,nThreads
                1120              myThid = I
                1121
                1122      C--     Invoke nThreads instances of the numerical model
                1123              CALL THE_MODEL_MAIN(myThid)
                1124
                1125            ENDDO
                1126
                1127    Prior to transferring control to the procedure
                1128    :filelink:`the_model_main.F <model/src/the_model_main.F>` the
                1129    WRAPPER may use MP directives to spawn multiple threads.  This code
                1130    is extracted from the files :filelink:`main.F <eesupp/src/main.F>` and
                1131    :filelink:`eesupp/inc/MAIN_PDIRECTIVES1.h`. The variable
                1132    :varlink:`nThreads` specifies how many instances of the routine
                1133    :filelink:`the_model_main.F <model/src/the_model_main.F>` will be created.
                1134    The value of :varlink:`nThreads` is set in the routine
                1135    :filelink:`ini_threading_environment.F <eesupp/src/ini_threading_environment.F>`.
                1136    The value is set equal to the the product of the parameters
                1137    :varlink:`nTx` and :varlink:`nTy` that are read from the file
                1138    ``eedata``. If the value of :varlink:`nThreads` is inconsistent with the number
                1139    of threads requested from the operating system (for example by using
                1140    an environment variable as described in :numref:`multi-thread_exe`)
                1141    then usually an error will be
                1142    reported by the routine :filelink:`check_threads.F <eesupp/src/check_threads.F>`.
                1143
                1144    |
                1145
                1146 3. **memsync flags** As discussed in :numref:`shared_mem_comm`,
                1147    a low-level system function may be need to force memory consistency
                1148    on some shared memory systems. The routine
                1149    :filelink:`memsync.F <eesupp/src/memsync.F>` is used for
                1150    this purpose. This routine should not need modifying and the
                1151    information below is only provided for completeness. A logical
                1152    parameter :varlink:`exchNeedsMemSync` set in the routine
                1153    :filelink:`ini_communication_patterns.F <eesupp/src/ini_communication_patterns.F>`
                1154    controls whether the :filelink:`memsync.F <eesupp/src/memsync.F>`
                1155    primitive is called. In general this routine is only used for
                1156    multi-threaded execution. The code that goes into the :filelink:`memsync.F <eesupp/src/memsync.F>`
                1157    routine is specific to the compiler and processor used. In some
                1158    cases, it must be written using a short code snippet of assembly
                1159    language. For an Ultra Sparc system the following code snippet is
                1160    used
                1161
                1162    ::
                1163
                1164        asm("membar #LoadStore|#StoreStore");
                1165
                1166    For an Alpha based system the equivalent code reads
                1167
                1168    ::
                1169
                1170        asm("mb");
                1171
                1172    while on an x86 system the following code is required
                1173
                1174    ::
                1175
                1176        asm("lock; addl $0,0(%%esp)": : :"memory")
                1177
                1178 #. **Cache line size** As discussed in :numref:`shared_mem_comm`,
                1179    multi-threaded codes
                1180    explicitly avoid penalties associated with excessive coherence
                1181    traffic on an SMP system. To do this the shared memory data
                1182    structures used by the :filelink:`global_sum.F <eesupp/src/global_sum.F>`,
                1183    :filelink:`global_max.F <eesupp/src/global_max.F>` and
                1184    :filelink:`barrier.F <eesupp/src/barrier.F>`
                1185    routines are padded. The variables that control the padding are set
                1186    in the header file :filelink:`EEPARAMS.h <eesupp/inc/EEPARAMS.h>`.
                1187    These variables are called :varlink:`cacheLineSize`, :varlink:`lShare1`,
                1188    :varlink:`lShare4` and :varlink:`lShare8`. The default
                1189    values should not normally need changing.
                1190
                1191    |
                1192
                1193 #. **\_BARRIER** This is a CPP macro that is expanded to a call to a
                1194    routine which synchronizes all the logical processors running under
                1195    the WRAPPER. Using a macro here preserves flexibility to insert a
                1196    specialized call in-line into application code. By default this
                1197    resolves to calling the procedure :filelink:`barrier.F <eesupp/src/barrier.F>`.
                1198    The default setting for the ``_BARRIER`` macro is given in the
                1199    file :filelink:`CPP_EEMACROS.h <eesupp/inc/CPP_EEMACROS.h>`.
                1200
                1201    |
                1202
                1203 #. **\_GSUM** This is a CPP macro that is expanded to a call to a
                1204    routine which sums up a floating point number over all the logical
                1205    processors running under the WRAPPER. Using a macro here provides
                1206    extra flexibility to insert a specialized call in-line into
                1207    application code. By default this resolves to calling the procedure
                1208    ``GLOBAL_SUM_R8()`` for 64-bit floating point operands or
                1209    ``GLOBAL_SUM_R4()`` for 32-bit floating point operand
                1210    (located in file :filelink:`global_sum.F <eesupp/src/global_sum.F>`). The default
                1211    setting for the ``_GSUM`` macro is given in the file
                1212    :filelink:`CPP_EEMACROS.h <eesupp/inc/CPP_EEMACROS.h>`.
                1213    The ``_GSUM`` macro is a performance critical operation, especially for
                1214    large processor count, small tile size configurations. The custom
                1215    communication example discussed in :numref:`jam_example` shows
                1216    how the macro is used to invoke a custom global sum routine for a
                1217    specific set of hardware.
                1218
                1219    |
                1220
                1221 #. **\_EXCH** The ``_EXCH`` CPP macro is used to update tile overlap
                1222    regions. It is qualified by a suffix indicating whether overlap
                1223    updates are for two-dimensional (``_EXCH_XY``) or three dimensional
                1224    (``_EXCH_XYZ``) physical fields and whether fields are 32-bit floating
                1225    point (``_EXCH_XY_R4``, ``_EXCH_XYZ_R4``) or 64-bit floating point
                1226    (``_EXCH_XY_R8``, ``_EXCH_XYZ_R8``). The macro mappings are defined in
                1227    the header file :filelink:`CPP_EEMACROS.h <eesupp/inc/CPP_EEMACROS.h>`.
                1228    As with ``_GSUM``, the ``_EXCH``
                1229    operation plays a crucial role in scaling to small tile, large
                1230    logical and physical processor count configurations. The example in
                1231    :numref:`jam_example` discusses defining an optimized and
                1232    specialized form on the ``_EXCH`` operation.
                1233
                1234    The ``_EXCH`` operation is also central to supporting grids such as the
                1235    cube-sphere grid. In this class of grid a rotation may be required
                1236    between tiles. Aligning the coordinate requiring rotation with the
                1237    tile decomposition allows the coordinate transformation to be
                1238    embedded within a custom form of the ``_EXCH`` primitive. In these cases
                1239    ``_EXCH`` is mapped to exch2 routines, as detailed in the :ref:`exch2 package <sub_phys_pkg_exch2>`
                1240    documentation.
                1241
                1242    |
                1243
                1244 #. **Reverse Mode** The communication primitives ``_EXCH`` and ``_GSUM`` both
                1245    employ hand-written adjoint forms (or reverse mode) forms. These
                1246    reverse mode forms can be found in the source code directory
                1247    :filelink:`pkg/autodiff`. For the global sum primitive the reverse mode form
                1248    calls are to ``GLOBAL_ADSUM_R4()`` and ``GLOBAL_ADSUM_R8()`` (located in
                1249    file :filelink:`global_sum_ad.F <pkg/autodiff/global_sum_ad.F>`). The reverse
                1250    mode form of the exchange primitives are found in routines prefixed
                1251    ``ADEXCH``. The exchange routines make calls to the same low-level
                1252    communication primitives as the forward mode operations. However, the
                1253    routine argument :varlink:`theSimulationMode` is set to the value
                1254    ``REVERSE_SIMULATION``. This signifies to the low-level routines that
                1255    the adjoint forms of the appropriate communication operation should
                1256    be performed.
                1257
                1258    |
                1259
                1260 #. **MAX_NO_THREADS** The variable :varlink:`MAX_NO_THREADS` is used to
                1261    indicate the maximum number of OS threads that a code will use. This
                1262    value defaults to thirty-two and is set in the file
                1263    :filelink:`EEPARAMS.h <eesupp/inc/EEPARAMS.h>`. For
                1264    single threaded execution it can be reduced to one if required. The
                1265    value is largely private to the WRAPPER and application code will not
                1266    normally reference the value, except in the following scenario.
                1267
                1268    For certain physical parametrization schemes it is necessary to have
                1269    a substantial number of work arrays. Where these arrays are allocated
                1270    in heap storage (for example COMMON blocks) multi-threaded execution
                1271    will require multiple instances of the COMMON block data. This can be
                1272    achieved using a Fortran 90 module construct. However, if this
                1273    mechanism is unavailable then the work arrays can be extended with
                1274    dimensions using the tile dimensioning scheme of :varlink:`nSx` and :varlink:`nSy` (as
                1275    described in :numref:`specify_decomp`). However, if
                1276    the configuration being specified involves many more tiles than OS
                1277    threads then it can save memory resources to reduce the variable
                1278    :varlink:`MAX_NO_THREADS` to be equal to the actual number of threads that
                1279    will be used and to declare the physical parameterization work arrays
a7275ae7c1 Oliv*1280    with a single :varlink:`MAX_NO_THREADS` extra dimension. An example of this
4c8caab8bd Jeff*1281    is given in the verification experiment :filelink:`verification/aim.5l_cs`. Here the
a7275ae7c1 Oliv*1282    default setting of :varlink:`MAX_NO_THREADS` is altered to
4c8caab8bd Jeff*1283
                1284    ::
                1285
                1286              INTEGER MAX_NO_THREADS
                1287              PARAMETER ( MAX_NO_THREADS =    6 )
                1288
                1289    and several work arrays for storing intermediate calculations are
                1290    created with declarations of the form.
                1291
                1292    ::
                1293
                1294              common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
                1295
                1296    This declaration scheme is not used widely, because most global data
                1297    is used for permanent, not temporary, storage of state information. In
                1298    the case of permanent state information this approach cannot be used
                1299    because there has to be enough storage allocated for all tiles.
                1300    However, the technique can sometimes be a useful scheme for reducing
                1301    memory requirements in complex physical parameterizations.
                1302
                1303 Specializing the Communication Code
                1304 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                1305
                1306 The isolation of performance critical communication primitives and the
                1307 subdivision of the simulation domain into tiles is a powerful tool.
                1308 Here we show how it can be used to improve application performance and
                1309 how it can be used to adapt to new gridding approaches.
                1310
                1311 .. _jam_example:
                1312
                1313 JAM example
                1314 ~~~~~~~~~~~
                1315
                1316 On some platforms a big performance boost can be obtained by binding the
                1317 communication routines ``_EXCH`` and ``_GSUM`` to specialized native
                1318 libraries (for example, the shmem library on CRAY T3E systems). The
                1319 ``LETS_MAKE_JAM`` CPP flag is used as an illustration of a specialized
                1320 communication configuration that substitutes for standard, portable
                1321 forms of ``_EXCH`` and ``_GSUM``. It affects three source files
                1322 :filelink:`eeboot.F <eesupp/src/eeboot.F>`, :filelink:`CPP_EEMACROS.h <eesupp/inc/CPP_EEMACROS.h>`
                1323 and :filelink:`cg2d.F </model/src/cg2d.F>`. When the flag is defined is
                1324 has the following effects.
                1325
                1326 -  An extra phase is included at boot time to initialize the custom
                1327    communications library (see ini_jam.F).
                1328
                1329 -  The ``_GSUM`` and ``_EXCH`` macro definitions are replaced with calls
                1330    to custom routines (see gsum_jam.F and exch_jam.F)
                1331
                1332 -  a highly specialized form of the exchange operator (optimized for
                1333    overlap regions of width one) is substituted into the elliptic solver
                1334    routine :filelink:`cg2d.F </model/src/cg2d.F>`.
                1335
                1336 Developing specialized code for other libraries follows a similar
                1337 pattern.
                1338
                1339 .. _cubed_sphere_comm:
                1340
                1341 Cube sphere communication
                1342 ~~~~~~~~~~~~~~~~~~~~~~~~~
                1343
                1344 Actual ``_EXCH`` routine code is generated automatically from a series of
                1345 template files, for example
                1346 :filelink:`exch2_rx1_cube.template </pkg/exch2/exch2_rx1_cube.template>`.
                1347 This is done to allow a
                1348 large number of variations of the exchange process to be maintained. One
                1349 set of variations supports the cube sphere grid. Support for a cube
                1350 sphere grid in MITgcm is based on having each face of the cube as a
                1351 separate tile or tiles. The exchange routines are then able to absorb
                1352 much of the detailed rotation and reorientation required when moving
                1353 around the cube grid. The set of ``_EXCH`` routines that contain the word
                1354 cube in their name perform these transformations. They are invoked when
                1355 the run-time logical parameter :varlink:`useCubedSphereExchange` is
                1356 set ``.TRUE.``. To
                1357 facilitate the transformations on a staggered C-grid, exchange
                1358 operations are defined separately for both vector and scalar quantities
                1359 and for grid-centered and for grid-face and grid-corner quantities.
                1360 Three sets of exchange routines are defined. Routines with names of the
                1361 form ``exch2_rx`` are used to exchange cell centered scalar quantities.
                1362 Routines with names of the form ``exch2_uv_rx`` are used to exchange
                1363 vector quantities located at the C-grid velocity points. The vector
                1364 quantities exchanged by the ``exch_uv_rx`` routines can either be signed
                1365 (for example velocity components) or un-signed (for example grid-cell
                1366 separations). Routines with names of the form ``exch_z_rx`` are used to
                1367 exchange quantities at the C-grid vorticity point locations.
                1368
                1369 MITgcm execution under WRAPPER
                1370 ==============================
                1371
                1372 Fitting together the WRAPPER elements, package elements and MITgcm core
                1373 equation elements of the source code produces the calling sequence shown below.
                1374
                1375 Annotated call tree for MITgcm and WRAPPER
                1376 ------------------------------------------
                1377
                1378 WRAPPER layer.
                1379
                1380 ::
                1381
                1382
                1383            MAIN
                1384            |
                1385            |--EEBOOT               :: WRAPPER initialization
                1386            |  |
                1387            |  |-- EEBOOT_MINMAL    :: Minimal startup. Just enough to
                1388            |  |                       allow basic I/O.
                1389            |  |-- EEINTRO_MSG      :: Write startup greeting.
                1390            |  |
                1391            |  |-- EESET_PARMS      :: Set WRAPPER parameters
                1392            |  |
                1393            |  |-- EEWRITE_EEENV    :: Print WRAPPER parameter settings
                1394            |  |
                1395            |  |-- INI_PROCS        :: Associate processes with grid regions.
                1396            |  |
                1397            |  |-- INI_THREADING_ENVIRONMENT   :: Associate threads with grid regions.
                1398            |       |
                1399            |       |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
                1400            |                                     :: communication data structures
                1401            |
                1402            |
                1403            |--CHECK_THREADS    :: Validate multiple thread start up.
                1404            |
                1405            |--THE_MODEL_MAIN   :: Numerical code top-level driver routine
                1406
                1407 Core equations plus packages.
                1408
1a0ceabba8 Timo*1409 .. _model_main_call_tree:
                1410
232737bee5 timo*1411 .. literalinclude:: ../../model/src/the_model_main.F
2a7411a401 timo*1412     :start-at: C Invocation from WRAPPER level...
                1413     :end-at: C    |                 :: events.
4c8caab8bd Jeff*1414
                1415
                1416 Measuring and Characterizing Performance
                1417 ----------------------------------------
                1418
                1419 TO BE DONE (CNH)
                1420
                1421 Estimating Resource Requirements
                1422 --------------------------------
                1423
                1424 TO BE DONE (CNH)
                1425
                1426 Atlantic 1/6 degree example
                1427 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
                1428
                1429 Dry Run testing
                1430 ~~~~~~~~~~~~~~~
                1431
                1432 Adjoint Resource Requirements
                1433 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                1434
                1435 State Estimation Environment Resources
                1436 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~