Warning, /doc/software_arch/software_arch.rst is written in an unsupported language. File is not indexed.
view on githubraw file Latest commit 37736c78 on 2019-01-25 23:56:49 UTC
1c65381846 Jeff*0001 .. _sarch:
0002
0003 Software Architecture
0004 *********************
0005
4c8caab8bd Jeff*0006 This chapter focuses on describing the **WRAPPER** environment within
0007 which both the core numerics and the pluggable packages operate. The
0008 description presented here is intended to be a detailed exposition and
0009 contains significant background material, as well as advanced details on
0010 working with the WRAPPER. The tutorial examples in this manual (see
0011 :numref:`chap_modelExamples`) contain more
0012 succinct, step-by-step instructions on running basic numerical
0013 experiments, of various types, both sequentially and in parallel. For
0014 many projects, simply starting from an example code and adapting it to
0015 suit a particular situation will be all that is required. The first part
0016 of this chapter discusses the MITgcm architecture at an abstract level.
0017 In the second part of the chapter we described practical details of the
0018 MITgcm implementation and the current tools and operating system features
0019 that are employed.
0020
0021 Overall architectural goals
0022 ===========================
0023
0024 Broadly, the goals of the software architecture employed in MITgcm are
0025 three-fold:
0026
0027 - To be able to study a very broad range of interesting and
0028 challenging rotating fluids problems;
0029
0030 - The model code should be readily targeted to a wide range of
0031 platforms; and
0032
0033 - On any given platform, performance should be
0034 comparable to an implementation developed and specialized
0035 specifically for that platform.
0036
0037 These points are summarized in :numref:`mitgcm_goals`,
0038 which conveys the goals of the MITgcm design. The goals lead to a
0039 software architecture which at the broadest level can be viewed as
0040 consisting of:
0041
0042 #. A core set of numerical and support code. This is discussed in detail
0043 in :numref:`discret_algorithm`.
0044
0045 #. A scheme for supporting optional “pluggable” **packages** (containing
0046 for example mixed-layer schemes, biogeochemical schemes, atmospheric
0047 physics). These packages are used both to overlay alternate dynamics
0048 and to introduce specialized physical content onto the core numerical
0049 code. An overview of the package scheme is given at the start of
0050 :numref:`packagesI`.
0051
0052 #. A support framework called WRAPPER (Wrappable Application
0053 Parallel Programming Environment Resource), within which the core
0054 numerics and pluggable packages operate.
0055
0056 .. figure:: figs/mitgcm_goals.*
0057 :width: 70%
0058 :align: center
0059 :alt: span of mitgcm goals
0060 :name: mitgcm_goals
0061
0062 The MITgcm architecture is designed to allow simulation of a wide range of physical problems on a wide range of hardware. The computational resource requirements of the applications targeted range from around 10\ :sup:`7` bytes ( :math:`\approx` 10 megabytes) of memory to 10\ :sup:`11` bytes ( :math:`\approx` 100 gigabytes). Arithmetic operation counts for the applications of interest range from 10\ :sup:`9` floating point operations to more than 10\ :sup:`17` floating point operations.
0063
0064
0065 This chapter focuses on describing the WRAPPER environment under
0066 which both the core numerics and the pluggable packages function. The
0067 description presented here is intended to be a detailed exposition and
0068 contains significant background material, as well as advanced details on
0069 working with the WRAPPER. The “Getting Started” chapter of this manual
0070 (:numref:`chap_getting_started`) contains more succinct, step-by-step
0071 instructions on running basic numerical experiments both sequentially
0072 and in parallel. For many projects simply starting from an example code
0073 and adapting it to suit a particular situation will be all that is
0074 required.
0075
af61fa61c7 Jeff*0076 .. _wrapper:
0077
4c8caab8bd Jeff*0078 WRAPPER
0079 =======
0080
0081 A significant element of the software architecture utilized in MITgcm is
0082 a software superstructure and substructure collectively called the
0083 WRAPPER (Wrappable Application Parallel Programming Environment
0084 Resource). All numerical and support code in MITgcm is written to “fit”
0085 within the WRAPPER infrastructure. Writing code to fit within the
0086 WRAPPER means that coding has to follow certain, relatively
0087 straightforward, rules and conventions (these are discussed further in
0088 :numref:`specify_decomp`).
0089
0090 The approach taken by the WRAPPER is illustrated in :numref:`fit_in_wrapper`,
0091 which shows how the WRAPPER serves to insulate
0092 code that fits within it from architectural differences between hardware
0093 platforms and operating systems. This allows numerical code to be easily
0094 retargeted.
0095
0096 .. figure:: figs/fit_in_wrapper.png
0097 :width: 70%
0098 :align: center
0099 :alt: schematic of a wrapper
0100 :name: fit_in_wrapper
0101
0102 Numerical code is written to fit within a software support infrastructure called WRAPPER. The WRAPPER is portable and can be specialized for a wide range of specific target hardware and programming environments, without impacting numerical code that fits within the WRAPPER. Codes that fit within the WRAPPER can generally be made to run as fast on a particular platform as codes specially optimized for that platform.
0103
0104 .. _target_hardware:
0105
0106 Target hardware
0107 ---------------
0108
0109 The WRAPPER is designed to target as broad as possible a range of
0110 computer systems. The original development of the WRAPPER took place on
0111 a multi-processor, CRAY Y-MP system. On that system, numerical code
0112 performance and scaling under the WRAPPER was in excess of that of an
0113 implementation that was tightly bound to the CRAY system’s proprietary
0114 multi-tasking and micro-tasking approach. Later developments have been
0115 carried out on uniprocessor and multiprocessor Sun systems with both
0116 uniform memory access (UMA) and non-uniform memory access (NUMA)
0117 designs. Significant work has also been undertaken on x86 cluster
0118 systems, Alpha processor based clustered SMP systems, and on
0119 cache-coherent NUMA (CC-NUMA) systems such as Silicon Graphics Altix
0120 systems. The MITgcm code, operating within the WRAPPER, is also
0121 routinely used on large scale MPP systems (for example, Cray T3E and IBM
0122 SP systems). In all cases, numerical code, operating within the WRAPPER,
0123 performs and scales very competitively with equivalent numerical code
0124 that has been modified to contain native optimizations for a particular
0125 system (see Hoe et al. 1999) :cite:`hoe:99` .
0126
0127 Supporting hardware neutrality
0128 ------------------------------
0129
0130 The different systems mentioned in :numref:`target_hardware` can be
0131 categorized in many different ways. For example, one common distinction
0132 is between shared-memory parallel systems (SMP and PVP) and distributed
0133 memory parallel systems (for example x86 clusters and large MPP
0134 systems). This is one example of a difference between compute platforms
0135 that can impact an application. Another common distinction is between
0136 vector processing systems with highly specialized CPUs and memory
0137 subsystems and commodity microprocessor based systems. There are
0138 numerous other differences, especially in relation to how parallel
0139 execution is supported. To capture the essential differences between
0140 different platforms the WRAPPER uses a *machine model*.
0141
0142 WRAPPER machine model
0143 ---------------------
0144
0145 Applications using the WRAPPER are not written to target just one
0146 particular machine (for example an IBM SP2) or just one particular
0147 family or class of machines (for example Parallel Vector Processor
0148 Systems). Instead the WRAPPER provides applications with an abstract
0149 *machine model*. The machine model is very general; however, it can
0150 easily be specialized to fit, in a computationally efficient manner, any
0151 computer architecture currently available to the scientific computing
0152 community.
0153
0154 Machine model parallelism
0155 -------------------------
0156
0157 Codes operating under the WRAPPER target an abstract machine that is
0158 assumed to consist of one or more logical processors that can compute
0159 concurrently. Computational work is divided among the logical processors
0160 by allocating “ownership” to each processor of a certain set (or sets)
0161 of calculations. Each set of calculations owned by a particular
0162 processor is associated with a specific region of the physical space
0163 that is being simulated, and only one processor will be associated with each
0164 such region (domain decomposition).
0165
0166 In a strict sense the logical processors over which work is divided do
0167 not need to correspond to physical processors. It is perfectly possible
0168 to execute a configuration decomposed for multiple logical processors on
0169 a single physical processor. This helps ensure that numerical code that
0170 is written to fit within the WRAPPER will parallelize with no additional
0171 effort. It is also useful for debugging purposes. Generally, however,
0172 the computational domain will be subdivided over multiple logical
0173 processors in order to then bind those logical processors to physical
0174 processor resources that can compute in parallel.
0175
37736c7850 Jeff*0176 .. _tile_description:
0177
4c8caab8bd Jeff*0178 Tiles
0179 ~~~~~
0180
0181 Computationally, the data structures (e.g., arrays, scalar variables,
0182 etc.) that hold the simulated state are associated with each region of
0183 physical space and are allocated to a particular logical processor. We
0184 refer to these data structures as being **owned** by the processor to
0185 which their associated region of physical space has been allocated.
0186 Individual regions that are allocated to processors are called
0187 **tiles**. A processor can own more than one tile. :numref:`domain_decomp`
0188 shows a physical domain being mapped to a set of
0189 logical processors, with each processor owning a single region of the
0190 domain (a single tile). Except for periods of communication and
0191 coordination, each processor computes autonomously, working only with
0192 data from the tile that the processor owns. If instead multiple
0193 tiles were allotted to a single processor, each of these tiles would be computed on
0194 independently of the other allotted tiles, in a sequential fashion.
0195
0196 .. figure:: figs/domain_decomp.png
0197 :width: 70%
0198 :align: center
0199 :alt: domain decomposition
0200 :name: domain_decomp
0201
0202 The WRAPPER provides support for one and two dimensional decompositions of grid-point domains. The figure shows a hypothetical domain of total size :math:`N_{x}N_{y}N_{z}`. This hypothetical domain is decomposed in two-dimensions along the :math:`N_{x}` and :math:`N_{y}` directions. The resulting tiles are owned by different processors. The owning processors perform the arithmetic operations associated with a tile. Although not illustrated here, a single processor can own several tiles. Whenever a processor wishes to transfer data between tiles or communicate with other processors it calls a WRAPPER supplied function.
0203
0204 Tile layout
0205 ~~~~~~~~~~~
0206
0207 Tiles consist of an interior region and an overlap region. The overlap
0208 region of a tile corresponds to the interior region of an adjacent tile.
0209 In :numref:`tiled-world` each tile would own the region within the
0210 black square and hold duplicate information for overlap regions
0211 extending into the tiles to the north, south, east and west. During
0212 computational phases a processor will reference data in an overlap
0213 region whenever it requires values that lie outside the domain it owns.
0214 Periodically processors will make calls to WRAPPER functions to
0215 communicate data between tiles, in order to keep the overlap regions up
0216 to date (see :numref:`comm_primitives`). The WRAPPER
0217 functions can use a variety of different mechanisms to communicate data
0218 between tiles.
0219
0220 .. figure:: figs/tiled-world.png
0221 :width: 70%
0222 :align: center
0223 :alt: global earth subdivided into tiles
0224 :name: tiled-world
0225
0226 A global grid subdivided into tiles. Tiles contain a interior region and an overlap region. Overlap regions are periodically updated from neighboring tiles.
0227
0228
0229 Communication mechanisms
0230 ------------------------
0231
0232 Logical processors are assumed to be able to exchange information
0233 between tiles (and between each other) using at least one of two possible
0234 mechanisms, shared memory or distributed memory communication.
0235 The WRAPPER assumes that communication will use one of these two styles.
0236 The underlying hardware and operating system support
0237 for the style used is not specified and can vary from system to system.
0238
0239 .. _shared_mem_comm:
0240
0241 Shared memory communication
0242 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
0243
0244 Under this mode of communication, data transfers are assumed to be possible using direct addressing of
0245 regions of memory. In the WRAPPER shared memory communication model,
0246 simple writes to an array can be made to be visible to other CPUs at
0247 the application code level. So, as shown below, if one CPU (CPU1) writes
0248 the value 8 to element 3 of array a, then other CPUs (here, CPU2) will be
0249 able to see the value 8 when they read from a(3). This provides a very low
0250 latency and high bandwidth communication mechanism. Thus, in this way one CPU
0251 can communicate information to another CPU by assigning a particular value to a particular memory
0252 location.
0253
0254 ::
0255
0256
0257 CPU1 | CPU2
0258 ==== | ====
0259 |
0260 a(3) = 8 | WHILE ( a(3) .NE. 8 )
0261 | WAIT
0262 | END WHILE
0263 |
0264
0265
0266 Under shared communication independent CPUs are operating on the exact
0267 same global address space at the application level. This is the model of
0268 memory access that is supported at the basic system design level in
0269 “shared-memory” systems such as PVP systems, SMP systems, and on
0270 distributed shared memory systems (e.g., SGI Origin, SGI Altix, and some
0271 AMD Opteron systems). On such systems the WRAPPER will generally use
0272 simple read and write statements to access directly application data
0273 structures when communicating between CPUs.
0274
0275 In a system where assignments statements map directly to hardware instructions that
0276 transport data between CPU and memory banks, this can be a very
0277 efficient mechanism for communication. In such case multiple CPUs
0278 can communicate simply be reading and writing to agreed
0279 locations and following a few basic rules. The latency of this sort of
0280 communication is generally not that much higher than the hardware
0281 latency of other memory accesses on the system. The bandwidth available
0282 between CPUs communicating in this way can be close to the bandwidth of
0283 the systems main-memory interconnect. This can make this method of
0284 communication very efficient provided it is used appropriately.
0285
0286 Memory consistency
0287 ##################
0288
0289 When using shared memory communication between multiple processors, the
0290 WRAPPER level shields user applications from certain counter-intuitive
0291 system behaviors. In particular, one issue the WRAPPER layer must deal
0292 with is a systems memory model. In general the order of reads and writes
0293 expressed by the textual order of an application code may not be the
0294 ordering of instructions executed by the processor performing the
0295 application. The processor performing the application instructions will
0296 always operate so that, for the application instructions the processor
0297 is executing, any reordering is not apparent. However,
0298 machines are often designed so that reordering of instructions is not
0299 hidden from other second processors. This means that, in general, even
0300 on a shared memory system two processors can observe inconsistent memory
0301 values.
0302
0303 The issue of memory consistency between multiple processors is discussed
0304 at length in many computer science papers. From a practical point of
0305 view, in order to deal with this issue, shared memory machines all
0306 provide some mechanism to enforce memory consistency when it is needed.
0307 The exact mechanism employed will vary between systems. For
0308 communication using shared memory, the WRAPPER provides a place to
0309 invoke the appropriate mechanism to ensure memory consistency for a
0310 particular platform.
0311
0312 Cache effects and false sharing
0313 ###############################
0314
0315 Shared-memory machines often have local-to-processor memory caches which
0316 contain mirrored copies of main memory. Automatic cache-coherence
0317 protocols are used to maintain consistency between caches on different
0318 processors. These cache-coherence protocols typically enforce
0319 consistency between regions of memory with large granularity (typically
0320 128 or 256 byte chunks). The coherency protocols employed can be
0321 expensive relative to other memory accesses and so care is taken in the
0322 WRAPPER (by padding synchronization structures appropriately) to avoid
0323 unnecessary coherence traffic.
0324
0325 Operating system support for shared memory
0326 ##########################################
0327
0328 Applications running under multiple threads within a single process can
0329 use shared memory communication. In this case *all* the memory locations
0330 in an application are potentially visible to all the compute threads.
0331 Multiple threads operating within a single process is the standard
0332 mechanism for supporting shared memory that the WRAPPER utilizes.
0333 Configuring and launching code to run in multi-threaded mode on specific
0334 platforms is discussed in :numref:`multi-thread_exe`.
0335 However, on many systems, potentially very efficient mechanisms for
0336 using shared memory communication between multiple processes (in
0337 contrast to multiple threads within a single process) also exist. In
0338 most cases this works by making a limited region of memory shared
0339 between processes. The MMAP and IPC facilities in UNIX systems provide
0340 this capability as do vendor specific tools like LAPI and IMC.
0341 Extensions exist for the WRAPPER that allow these mechanisms to be used
0342 for shared memory communication. However, these mechanisms are not
0343 distributed with the default WRAPPER sources, because of their
0344 proprietary nature.
0345
0346 .. _distributed_mem_comm:
0347
0348 Distributed memory communication
0349 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0350
0351 Under this mode of communication there is no mechanism, at the application code level,
0352 for directly addressing regions of memory owned and visible to
0353 another CPU. Instead a communication library must be used, as
0354 illustrated below. If one CPU (here, CPU1) writes the value 8 to element 3 of array a,
0355 then at least one of CPU1 and/or CPU2 will need to call a
0356 function in the API of the communication library to communicate data
0357 from a tile that it owns to a tile that another CPU owns. By default
0358 the WRAPPER binds to the MPI communication library
0359 for this style of communication (see https://computing.llnl.gov/tutorials/mpi/
0360 for more information about the MPI Standard).
0361
0362 ::
0363
0364
0365 CPU1 | CPU2
0366 ==== | ====
0367 |
0368 a(3) = 8 | WHILE ( a(3) .NE. 8 )
0369 CALL SEND( CPU2,a(3) ) | CALL RECV( CPU1, a(3) )
0370 | END WHILE
0371 |
0372
0373
0374 Many parallel systems are not constructed in a way where it is possible
0375 or practical for an application to use shared memory for communication.
0376 For cluster systems consisting of individual computers connected by
0377 a fast network, there is no notion of shared memory at
0378 the system level. For this sort of system the WRAPPER provides support
0379 for communication based on a bespoke communication library.
0380 The default communication library used is MPI. It is relatively
0381 straightforward to implement bindings to optimized platform specific
0382 communication libraries. For example the work described in
0383 Hoe et al. (1999) :cite:`hoe:99` substituted standard MPI communication
0384 for a highly optimized library.
0385
0386 .. _comm_primitives:
0387
0388 Communication primitives
0389 ------------------------
0390
0391 Optimized communication support is assumed to be potentially available
0392 for a small number of communication operations. It is also assumed that
0393 communication performance optimizations can be achieved by optimizing a
0394 small number of communication primitives. Three optimizable primitives
0395 are provided by the WRAPPER.
0396
0397
0398 .. figure:: figs/comm-prim.png
0399 :width: 70%
0400 :align: center
0401 :alt: global sum and exchange comm primitives
0402 :name: comm-prim
0403
0404 Three performance critical parallel primitives are provided by the WRAPPER. These primitives are always used to communicate data between tiles. The figure shows four tiles. The curved arrows indicate exchange primitives which transfer data between the overlap regions at tile edges and interior regions for nearest-neighbor tiles. The straight arrows symbolize global sum operations which connect all tiles. The global sum operation provides both a key arithmetic primitive and can serve as a synchronization primitive. A third barrier primitive is also provided, which behaves much like the global sum primitive.
0405
0406
0407 - **EXCHANGE** This operation is used to transfer data between interior
0408 and overlap regions of neighboring tiles. A number of different forms
0409 of this operation are supported. These different forms handle:
0410
0411 - Data type differences. Sixty-four bit and thirty-two bit fields
0412 may be handled separately.
0413
0414 - Bindings to different communication methods. Exchange primitives
0415 select between using shared memory or distributed memory
0416 communication.
0417
0418 - Transformation operations required when transporting data between
0419 different grid regions. Transferring data between faces of a
0420 cube-sphere grid, for example, involves a rotation of vector
0421 components.
0422
0423 - Forward and reverse mode computations. Derivative calculations
0424 require tangent linear and adjoint forms of the exchange
0425 primitives.
0426
0427 - **GLOBAL SUM** The global sum operation is a central arithmetic
0428 operation for the pressure inversion phase of the MITgcm algorithm.
0429 For certain configurations, scaling can be highly sensitive to the
0430 performance of the global sum primitive. This operation is a
0431 collective operation involving all tiles of the simulated domain.
0432 Different forms of the global sum primitive exist for handling:
0433
0434 - Data type differences. Sixty-four bit and thirty-two bit fields
0435 may be handled separately.
0436
0437 - Bindings to different communication methods. Exchange primitives
0438 select between using shared memory or distributed memory
0439 communication.
0440
0441 - Forward and reverse mode computations. Derivative calculations
0442 require tangent linear and adjoint forms of the exchange
0443 primitives.
0444
0445 - **BARRIER** The WRAPPER provides a global synchronization function
0446 called barrier. This is used to synchronize computations over all
0447 tiles. The **BARRIER** and **GLOBAL SUM** primitives have much in
0448 common and in some cases use the same underlying code.
0449
0450 Memory architecture
0451 -------------------
0452
0453 The WRAPPER machine model is aimed to target efficient systems with
0454 highly pipelined memory architectures and systems with deep memory
0455 hierarchies that favor memory reuse. This is achieved by supporting a
0456 flexible tiling strategy as shown in :numref:`tiling_detail`.
0457 Within a CPU, computations are carried out sequentially on each tile in
0458 turn. By reshaping tiles according to the target platform it is possible
0459 to automatically tune code to improve memory performance. On a vector
0460 machine a given domain might be subdivided into a few long, thin
0461 regions. On a commodity microprocessor based system, however, the same
0462 region could be simulated use many more smaller sub-domains.
0463
0464 .. figure:: figs/tiling_detail.png
0465 :width: 70%
0466 :align: center
0467 :alt: tiling strategy in WRAPPER
0468 :name: tiling_detail
0469
0470 The tiling strategy that the WRAPPER supports allows tiles to be shaped to suit the underlying system memory architecture. Compact tiles that lead to greater memory reuse can be used on cache based systems (upper half of figure) with deep memory hierarchies, whereas long tiles with large inner loops can be used to exploit vector systems having highly pipelined memory systems.
0471
0472 Summary
0473 -------
0474
0475 Following the discussion above, the machine model that the WRAPPER
0476 presents to an application has the following characteristics:
0477
0478 - The machine consists of one or more logical processors.
0479
0480 - Each processor operates on tiles that it owns.
0481
0482 - A processor may own more than one tile.
0483
0484 - Processors may compute concurrently.
0485
0486 - Exchange of information between tiles is handled by the machine
0487 (WRAPPER) not by the application.
0488
0489 Behind the scenes this allows the WRAPPER to adapt the machine model
0490 functions to exploit hardware on which:
0491
0492 - Processors may be able to communicate very efficiently with each
0493 other using shared memory.
0494
0495 - An alternative communication mechanism based on a relatively simple
0496 interprocess communication API may be required.
0497
0498 - Shared memory may not necessarily obey sequential consistency,
0499 however some mechanism will exist for enforcing memory consistency.
0500
0501 - Memory consistency that is enforced at the hardware level may be
0502 expensive. Unnecessary triggering of consistency protocols should be
0503 avoided.
0504
0505 - Memory access patterns may need to be either repetitive or highly
0506 pipelined for optimum hardware performance.
0507
0508 This generic model, summarized in :numref:`tiles_and_wrapper`, captures the essential hardware ingredients of almost
0509 all successful scientific computer systems designed in the last 50
0510 years.
0511
0512 .. figure:: figs/tiles_and_wrapper.png
0513 :width: 85%
0514 :align: center
0515 :alt: summary figure tiles and wrapper
0516 :name: tiles_and_wrapper
0517
0518 Summary of the WRAPPER machine model.
0519
4bad209eba Jeff*0520 .. _using_wrapper:
0521
4c8caab8bd Jeff*0522 Using the WRAPPER
0523 =================
0524
0525 In order to support maximum portability the WRAPPER is implemented
0526 primarily in sequential Fortran 77. At a practical level the key steps
0527 provided by the WRAPPER are:
0528
0529 #. specifying how a domain will be decomposed
0530
0531 #. starting a code in either sequential or parallel modes of operations
0532
0533 #. controlling communication between tiles and between concurrently
0534 computing CPUs.
0535
0536 This section describes the details of each of these operations.
0537 :numref:`specify_decomp` explains the way a
0538 domain is decomposed (or composed) is expressed. :numref:`starting_code`
0539 describes practical details of running codes
0540 in various different parallel modes on contemporary computer systems.
0541 :numref:`controlling_comm` explains the internal
0542 information that the WRAPPER uses to control how information is
0543 communicated between tiles.
0544
0545 .. _specify_decomp:
0546
0547 Specifying a domain decomposition
0548 ---------------------------------
0549
0550 At its heart, much of the WRAPPER works only in terms of a collection
0551 of tiles which are interconnected to each other. This is also true of
0552 application code operating within the WRAPPER. Application code is
0553 written as a series of compute operations, each of which operates on a
0554 single tile. If application code needs to perform operations involving
0555 data associated with another tile, it uses a WRAPPER function to
0556 obtain that data. The specification of how a global domain is
0557 constructed from tiles or alternatively how a global domain is
0558 decomposed into tiles is made in the file :filelink:`SIZE.h <model/inc/SIZE.h>`. This file defines
0559 the following parameters:
0560
0561 .. admonition:: File: :filelink:`model/inc/SIZE.h`
0562 :class: note
0563
0564 | Parameter: :varlink:`sNx`, :varlink:`sNx`
0565 | Parameter: :varlink:`OLx`, :varlink:`OLy`
0566 | Parameter: :varlink:`nSx`, :varlink:`nSy`
0567 | Parameter: :varlink:`nPx`, :varlink:`nPy`
0568
0569
0570 Together these parameters define a tiling decomposition of the style
0571 shown in :numref:`size_h`. The parameters ``sNx`` and ``sNx``
0572 define the size of an individual tile. The parameters ``OLx`` and ``OLy``
0573 define the maximum size of the overlap extent. This must be set to the
0574 maximum width of the computation stencil that the numerical code
0575 finite-difference operations require between overlap region updates.
0576 The maximum overlap required by any of the operations in the MITgcm
0577 code distributed at this time is four grid points (some of the higher-order advection schemes
0578 require a large overlap region). Code modifications and enhancements that involve adding wide
0579 finite-difference stencils may require increasing ``OLx`` and ``OLy``.
0580 Setting ``OLx`` and ``OLy`` to a too large value will decrease code
0581 performance (because redundant computations will be performed),
0582 however it will not cause any other problems.
0583
0584 .. figure:: figs/size_h.png
0585 :width: 80%
0586 :align: center
0587 :alt: explanation of SIZE.h domain decomposition
0588 :name: size_h
0589
0590 The three level domain decomposition hierarchy employed by the WRAPPER. A domain is composed of tiles. Multiple tiles can be allocated to a single process. Multiple processes can exist, each with multiple tiles. Tiles within a process can be spread over multiple compute threads.
0591
0592 The parameters ``nSx`` and ``nSy`` specify the number of tiles that will be
0593 created within a single process. Each of these tiles will have internal
0594 dimensions of ``sNx`` and ``sNy``. If, when the code is executed, these
0595 tiles are allocated to different threads of a process that are then
0596 bound to different physical processors (see the multi-threaded
0597 execution discussion in :numref:`starting_code`), then
0598 computation will be performed concurrently on each tile. However, it is
0599 also possible to run the same decomposition within a process running a
0600 single thread on a single processor. In this case the tiles will be
0601 computed over sequentially. If the decomposition is run in a single
0602 process running multiple threads but attached to a single physical
0603 processor, then, in general, the computation for different tiles will be
0604 interleaved by system level software. This too is a valid mode of
0605 operation.
0606
0607 The parameters ``sNx``, ``sNy``, ``OLx``, ``OLy``,
0608 ``nSx`` and ``nSy`` are used extensively
0609 by numerical code. The settings of ``sNx``, ``sNy``, ``OLx``, and ``OLy`` are used to
0610 form the loop ranges for many numerical calculations and to provide
0611 dimensions for arrays holding numerical state. The ``nSx`` and ``nSy`` are
0612 used in conjunction with the thread number parameter ``myThid``. Much of
0613 the numerical code operating within the WRAPPER takes the form:
0614
0615 .. code-block:: fortran
0616
0617 DO bj=myByLo(myThid),myByHi(myThid)
0618 DO bi=myBxLo(myThid),myBxHi(myThid)
0619 :
0620 a block of computations ranging
0621 over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
0622 :
0623 ENDDO
0624 ENDDO
0625
0626 communication code to sum a number or maybe update
0627 tile overlap regions
0628
0629 DO bj=myByLo(myThid),myByHi(myThid)
0630 DO bi=myBxLo(myThid),myBxHi(myThid)
0631 :
0632 another block of computations ranging
0633 over 1,sNx +/- OLx and 1,sNy +/- OLy grid points
0634 :
0635 ENDDO
0636 ENDDO
0637
0638 The variables ``myBxLo(myThid)``, ``myBxHi(myThid)``, ``myByLo(myThid)`` and
0639 ``myByHi(myThid)`` set the bounds of the loops in ``bi`` and ``bj`` in this
0640 schematic. These variables specify the subset of the tiles in the range
0641 ``1, nSx`` and ``1, nSy1`` that the logical processor bound to thread
0642 number ``myThid`` owns. The thread number variable ``myThid`` ranges from 1
0643 to the total number of threads requested at execution time. For each
0644 value of ``myThid`` the loop scheme above will step sequentially through
0645 the tiles owned by that thread. However, different threads will have
0646 different ranges of tiles assigned to them, so that separate threads can
0647 compute iterations of the ``bi``, ``bj`` loop concurrently. Within a ``bi``,
0648 ``bj`` loop, computation is performed concurrently over as many processes
0649 and threads as there are physical processors available to compute.
0650
0651 An exception to the the use of ``bi`` and ``bj`` in loops arises in the
0652 exchange routines used when the :ref:`exch2 package <sub_phys_pkg_exch2>` is used with the cubed
0653 sphere. In this case ``bj`` is generally set to 1 and the loop runs from
0654 ``1, bi``. Within the loop ``bi`` is used to retrieve the tile number,
0655 which is then used to reference exchange parameters.
0656
0657 The amount of computation that can be embedded in a single loop over ``bi``
0658 and ``bj`` varies for different parts of the MITgcm algorithm.
af61fa61c7 Jeff*0659 Consider a code extract from the 2-D implicit elliptic solver:
4c8caab8bd Jeff*0660
a7275ae7c1 Oliv*0661 .. code-block:: fortran
4c8caab8bd Jeff*0662
0663 REAL*8 cg2d_r(1-OLx:sNx+OLx,1-OLy:sNy+OLy,nSx,nSy)
0664 REAL*8 err
0665 :
0666 :
0667 other computations
0668 :
0669 :
0670 err = 0.
0671 DO bj=myByLo(myThid),myByHi(myThid)
0672 DO bi=myBxLo(myThid),myBxHi(myThid)
0673 DO J=1,sNy
0674 DO I=1,sNx
0675 err = err + cg2d_r(I,J,bi,bj)*cg2d_r(I,J,bi,bj)
0676 ENDDO
0677 ENDDO
0678 ENDDO
0679 ENDDO
0680
0681 CALL GLOBAL_SUM_R8( err , myThid )
0682 err = SQRT(err)
0683
0684
0685
0686 This portion of the code computes the :math:`L_2`\ -Norm
0687 of a vector whose elements are held in the array ``cg2d_r``, writing the
0688 final result to scalar variable ``err``. Notice that under the WRAPPER,
0689 arrays such as cg2d_r have two extra trailing dimensions. These right
0690 most indices are tile indexes. Different threads with a single process
0691 operate on different ranges of tile index, as controlled by the settings
0692 of ``myByLo(myThid)``, ``myByHi(myThid)``, ``myBxLo(myThid)`` and
0693 ``myBxHi(myThid)``. Because the :math:`L_2`\ -Norm
0694 requires a global reduction, the ``bi``, ``bj`` loop above only contains one
0695 statement. This computation phase is then followed by a communication
0696 phase in which all threads and processes must participate. However, in
0697 other areas of the MITgcm, code entries subsections of code are within a
0698 single ``bi``, ``bj`` loop. For example the evaluation of all the momentum
0699 equation prognostic terms (see :filelink:`dynamics.F <model/src/dynamics.F>`) is within a single
0700 ``bi``, ``bj`` loop.
0701
0702 The final decomposition parameters are ``nPx`` and ``nPy``. These parameters
0703 are used to indicate to the WRAPPER level how many processes (each with
0704 ``nSx``\ :math:`\times`\ ``nSy`` tiles) will be used for this simulation.
0705 This information is needed during initialization and during I/O phases.
0706 However, unlike the variables ``sNx``, ``sNy``, ``OLx``, ``OLy``, ``nSx`` and ``nSy`` the
0707 values of ``nPx`` and ``nPy`` are absent from the core numerical and support
0708 code.
0709
0710 Examples of :filelink:`SIZE.h <model/inc/SIZE.h>` specifications
a7275ae7c1 Oliv*0711 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4c8caab8bd Jeff*0712
0713 The following different :filelink:`SIZE.h <model/inc/SIZE.h>` parameter setting illustrate how to
0714 interpret the values of ``sNx``, ``sNy``, ``OLx``, ``OLy``, ``nSx``, ``nSy``, ``nPx`` and ``nPy``.
0715
0716 #. ::
0717
0718 PARAMETER (
0719 & sNx = 90,
0720 & sNy = 40,
0721 & OLx = 3,
0722 & OLy = 3,
0723 & nSx = 1,
0724 & nSy = 1,
0725 & nPx = 1,
0726 & nPy = 1)
0727
0728 This sets up a single tile with *x*-dimension of ninety grid points,
0729 *y*-dimension of forty grid points, and *x* and *y* overlaps of three grid
0730 points each.
0731
0732 #. ::
0733
0734 PARAMETER (
0735 & sNx = 45,
0736 & sNy = 20,
0737 & OLx = 3,
0738 & OLy = 3,
0739 & nSx = 1,
0740 & nSy = 1,
0741 & nPx = 2,
0742 & nPy = 2)
0743
0744 This sets up tiles with *x*-dimension of forty-five grid points,
0745 *y*-dimension of twenty grid points, and *x* and *y* overlaps of three grid
0746 points each. There are four tiles allocated to four separate
0747 processes (``nPx=2, nPy=2``) and arranged so that the global domain size
0748 is again ninety grid points in *x* and forty grid points in *y*. In
0749 general the formula for global grid size (held in model variables
0750 ``Nx`` and ``Ny``) is
0751
0752 ::
0753
0754 Nx = sNx*nSx*nPx
0755 Ny = sNy*nSy*nPy
0756
0757 #. ::
0758
0759 PARAMETER (
0760 & sNx = 90,
0761 & sNy = 10,
0762 & OLx = 3,
0763 & OLy = 3,
0764 & nSx = 1,
0765 & nSy = 2,
0766 & nPx = 1,
0767 & nPy = 2)
0768
0769 This sets up tiles with *x*-dimension of ninety grid points,
0770 *y*-dimension of ten grid points, and *x* and *y* overlaps of three grid
0771 points each. There are four tiles allocated to two separate processes
0772 (``nPy=2``) each of which has two separate sub-domains ``nSy=2``. The
0773 global domain size is again ninety grid points in *x* and forty grid
0774 points in *y*. The two sub-domains in each process will be computed
0775 sequentially if they are given to a single thread within a single
0776 process. Alternatively if the code is invoked with multiple threads
0777 per process the two domains in y may be computed concurrently.
0778
0779 #. ::
0780
0781 PARAMETER (
0782 & sNx = 32,
0783 & sNy = 32,
0784 & OLx = 3,
0785 & OLy = 3,
0786 & nSx = 6,
0787 & nSy = 1,
0788 & nPx = 1,
0789 & nPy = 1)
0790
0791 This sets up tiles with *x*-dimension of thirty-two grid points,
0792 *y*-dimension of thirty-two grid points, and *x* and *y* overlaps of three
0793 grid points each. There are six tiles allocated to six separate
0794 logical processors (``nSx=6``). This set of values can be used for a
0795 cube sphere calculation. Each tile of size :math:`32 \times 32`
0796 represents a face of the cube. Initializing the tile connectivity
0797 correctly (see :numref:`cubed_sphere_comm`. allows the
0798 rotations associated with moving between the six cube faces to be
0799 embedded within the tile-tile communication code.
0800
0801 .. _starting_code:
0802
0803 Starting the code
0804 -----------------
0805
0806 When code is started under the WRAPPER, execution begins in a main
0807 routine :filelink:`eesupp/src/main.F` that is owned by the WRAPPER. Control is
0808 transferred to the application through a routine called
0809 :filelink:`model/src/the_model_main.F` once the WRAPPER has initialized correctly and has
0810 created the necessary variables to support subsequent calls to
0811 communication routines by the application code. The main stages of the WRAPPER startup calling
0812 sequence are as follows:
0813
0814 ::
0815
0816
0817 MAIN
0818 |
0819 |--EEBOOT :: WRAPPER initialization
0820 | |
0821 | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
0822 | | allow basic I/O.
0823 | |-- EEINTRO_MSG :: Write startup greeting.
0824 | |
0825 | |-- EESET_PARMS :: Set WRAPPER parameters
0826 | |
0827 | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
0828 | |
0829 | |-- INI_PROCS :: Associate processes with grid regions.
0830 | |
0831 | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
0832 | |
0833 | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
0834 | :: communication data structures
0835 |
0836 |
0837 |--CHECK_THREADS :: Validate multiple thread start up.
0838 |
0839 |--THE_MODEL_MAIN :: Numerical code top-level driver routine
0840
0841 The steps above preceeds transfer of control to application code, which occurs in the procedure :filelink:`the_main_model.F <model/src/the_model_main.F>`
0842
0843 .. _multi-thread_exe:
0844
0845 Multi-threaded execution
0846 ~~~~~~~~~~~~~~~~~~~~~~~~
0847
0848 Prior to transferring control to the procedure :filelink:`the_main_model.F <model/src/the_model_main.F>`
0849 the WRAPPER may cause several coarse grain threads to be initialized.
0850 The routine :filelink:`the_main_model.F <model/src/the_model_main.F>` is called once for each thread and is
0851 passed a single stack argument which is the thread number, stored in
0852 the :varlink:`myThid`. In addition to specifying a decomposition with
0853 multiple tiles per process (see :numref:`specify_decomp`) configuring and starting a code to
0854 run using multiple threads requires the following steps.
0855
0856 Compilation
0857 ###########
0858
0859 First the code must be compiled with appropriate multi-threading
0860 directives active in the file :filelink:`eesupp/src/main.F` and with appropriate compiler
0861 flags to request multi-threading support. The header files
0862 :filelink:`eesupp/inc/MAIN_PDIRECTIVES1.h` and :filelink:`eesupp/inc/MAIN_PDIRECTIVES2.h` contain directives
0863 compatible with compilers for Sun, Compaq, SGI, Hewlett-Packard SMP
0864 systems and CRAY PVP systems. These directives can be activated by using
0865 compile time directives ``-DTARGET_SUN``, ``-DTARGET_DEC``,
a7275ae7c1 Oliv*0866 ``-DTARGET_SGI``, ``-DTARGET_HP`` or ``-DTARGET_CRAY_VECTOR``
4c8caab8bd Jeff*0867 respectively. Compiler options for invoking multi-threaded compilation
0868 vary from system to system and from compiler to compiler. The options
0869 will be described in the individual compiler documentation. For the
0870 Fortran compiler from Sun the following options are needed to correctly
0871 compile multi-threaded code
0872
0873 ::
0874
0875 -stackvar -explicitpar -vpara -noautopar
0876
0877 These options are specific to the Sun compiler. Other compilers will use
0878 different syntax that will be described in their documentation. The
0879 effect of these options is as follows:
0880
0881 #. **-stackvar** Causes all local variables to be allocated in stack
0882 storage. This is necessary for local variables to ensure that they
0883 are private to their thread. Note, when using this option it may be
0884 necessary to override the default limit on stack-size that the
0885 operating system assigns to a process. This can normally be done by
0886 changing the settings of the command shell’s ``stack-size``.
0887 However, on some systems changing this limit will require
0888 privileged administrator access to modify system parameters.
0889
0890 #. **-explicitpar** Requests that multiple threads be spawned in
0891 response to explicit directives in the application code. These
0892 directives are inserted with syntax appropriate to the particular
0893 target platform when, for example, the ``-DTARGET_SUN`` flag is
0894 selected.
0895
0896 #. **-vpara** This causes the compiler to describe the multi-threaded
0897 configuration it is creating. This is not required but it can be
0898 useful when troubleshooting.
0899
0900 #. **-noautopar** This inhibits any automatic multi-threaded
0901 parallelization the compiler may otherwise generate.
0902
0903 An example of valid settings for the ``eedata`` file for a domain with two
0904 subdomains in *y* and running with two threads is shown below
0905
0906 ::
0907
0908 nTx=1,nTy=2
0909
0910 This set of values will cause computations to stay within a single
0911 thread when moving across the ``nSx`` sub-domains. In the *y*-direction,
0912 however, sub-domains will be split equally between two threads.
0913
0914 Despite its appealing programming model, multi-threaded execution
0915 remains less common than multi-process execution (described in :numref:`multi-process_exe`).
0916 One major reason for
0917 this is that many system libraries are still not “thread-safe”. This
0918 means that, for example, on some systems it is not safe to call system
0919 routines to perform I/O when running in multi-threaded mode (except,
0920 perhaps, in a limited set of circumstances). Another reason is that
0921 support for multi-threaded programming models varies between systems.
0922
0923 .. _multi-process_exe:
0924
0925 Multi-process execution
0926 ~~~~~~~~~~~~~~~~~~~~~~~
0927
0928 Multi-process execution is more ubiquitous than multi-threaded execution.
0929 In order to run code in a
0930 multi-process configuration, a decomposition specification (see
0931 :numref:`specify_decomp`) is given (in which at least one
0932 of the parameters ``nPx`` or ``nPy`` will be greater than one). Then, as
0933 for multi-threaded operation, appropriate compile time and run time
0934 steps must be taken.
0935
0936 Compilation
0937 ###########
0938
0939 Multi-process execution under the WRAPPER assumes that portable,
0940 MPI libraries are available for controlling the start-up of multiple
0941 processes. The MPI libraries are not required, although they are
0942 usually used, for performance critical communication. However, in
0943 order to simplify the task of controlling and coordinating the start
0944 up of a large number (hundreds and possibly even thousands) of copies
0945 of the same program, MPI is used. The calls to the MPI multi-process
0946 startup routines must be activated at compile time. Currently MPI
0947 libraries are invoked by specifying the appropriate options file with
0948 the ``-of`` flag when running the :filelink:`genmake2 <tools/genmake2>`
0949 script, which generates the
0950 Makefile for compiling and linking MITgcm. (Previously this was done
0951 by setting the ``ALLOW_USE_MPI`` and ``ALWAYS_USE_MPI`` flags in the
0952 :filelink:`CPP_EEOPTIONS.h </eesupp/inc/CPP_EEOPTIONS.h>` file.) More
0953 detailed information about the use of
0954 :filelink:`genmake2 <tools/genmake2>` for specifying local compiler
0955 flags is located in :numref:`genmake2_desc`.
0956
0957 Execution
0958 #########
0959
0960 The mechanics of starting a program in multi-process mode under MPI is
0961 not standardized. Documentation associated with the distribution of MPI
0962 installed on a system will describe how to start a program using that
0963 distribution. For the open-source `MPICH <https://www.mpich.org/>`_
0964 system, the MITgcm program can
0965 be started using a command such as
0966
0967 ::
0968
0969 mpirun -np 64 -machinefile mf ./mitgcmuv
0970
0971 In this example the text ``-np 64`` specifies the number of processes
0972 that will be created. The numeric value 64 must be equal to (or greater than) the
0973 product of the processor grid settings of ``nPx`` and ``nPy`` in the file
0974 :filelink:`SIZE.h <model/inc/SIZE.h>`. The option ``-machinefile mf``
0975 specifies that a text file called ``mf``
0976 will be read to get a list of processor names on which the sixty-four
0977 processes will execute. The syntax of this file is specified by the
0978 MPI distribution.
0979
0980 Environment variables
0981 ~~~~~~~~~~~~~~~~~~~~~
0982
0983 On some systems multi-threaded execution also requires the setting of a
0984 special environment variable. On many machines this variable is called
0985 ``PARALLEL`` and its values should be set to the number of parallel threads
0986 required. Generally the help or manual pages associated with the
0987 multi-threaded compiler on a machine will explain how to set the
0988 required environment variables.
0989
0990 Runtime input parameters
0991 ~~~~~~~~~~~~~~~~~~~~~~~~
0992
0993 Finally the file ``eedata``
0994 needs to be configured to indicate the number
0995 of threads to be used in the *x* and *y* directions:
0996
0997 ::
0998
0999 # Example "eedata" file
1000 # Lines beginning "#" are comments
1001 # nTx - No. threads per process in X
1002 # nTy - No. threads per process in Y
1003 &EEPARMS
1004 nTx=1,
1005 nTy=1,
1006 &
1007
1008
1009 The product of ``nTx`` and ``nTy`` must be equal to the number of threads
1010 spawned, i.e., the setting of the environment variable ``PARALLEL``. The value
1011 of ``nTx`` must subdivide the number of sub-domains in *x* (``nSx``) exactly.
1012 The value of ``nTy`` must subdivide the number of sub-domains in *y* (``nSy``)
1013 exactly. The multi-process startup of the MITgcm executable ``mitgcmuv`` is
1014 controlled by the routines :filelink:`eeboot_minimal.F <eesupp/src/eeboot_minimal.F>`
1015 and :filelink:`ini_procs.F <eesupp/src/ini_procs.F>`. The
1016 first routine performs basic steps required to make sure each process is
1017 started and has a textual output stream associated with it. By default
1018 two output files are opened for each process with names ``STDOUT.NNNN``
1019 and ``STDERR.NNNN``. The *NNNNN* part of the name is filled in with
1020 the process number so that process number 0 will create output files
1021 ``STDOUT.0000`` and ``STDERR.0000``, process number 1 will create output
1022 files ``STDOUT.0001`` and ``STDERR.0001``, etc. These files are used for
1023 reporting status and configuration information and for reporting error
1024 conditions on a process-by-process basis. The :filelink:`eeboot_minimal.F <eesupp/src/eeboot_minimal.F>`
1025 procedure also sets the variables :varlink:`myProcId` and :varlink:`MPI_COMM_MODEL`.
1026 These variables are related to processor identification and are used
1027 later in the routine :filelink:`ini_procs.F <eesupp/src/ini_procs.F>` to allocate tiles to processes.
1028
1029 Allocation of processes to tiles is controlled by the routine
1030 :filelink:`ini_procs.F <eesupp/src/ini_procs.F>`. For each process this routine sets the variables
1031 :varlink:`myXGlobalLo` and :varlink:`myYGlobalLo`. These variables specify, in index
1032 space, the coordinates of the southernmost and westernmost corner of
1033 the southernmost and westernmost tile owned by this process. The
1034 variables :varlink:`pidW`, :varlink:`pidE`, :varlink:`pidS` and :varlink:`pidN` are also set in this
1035 routine. These are used to identify processes holding tiles to the
1036 west, east, south and north of a given process. These values are
1037 stored in global storage in the header file :filelink:`EESUPPORT.h <eesupp/inc/EESUPPORT.h>` for use by
1038 communication routines. The above does not hold when the :ref:`exch2 package <sub_phys_pkg_exch2>`
1039 is used. The :ref:`exch2 package <sub_phys_pkg_exch2>` sets its own parameters to specify the global
1040 indices of tiles and their relationships to each other. See the
1041 documentation on the :ref:`exch2 package <sub_phys_pkg_exch2>` for details.
1042
1043 .. _controlling_comm:
1044
1045 Controlling communication
1046 -------------------------
1047
1048 The WRAPPER maintains internal information that is used for
1049 communication operations and can be customized for different
1050 platforms. This section describes the information that is held and used.
1051
1052 1. **Tile-tile connectivity information** For each tile the WRAPPER sets
1053 a flag that sets the tile number to the north, south, east and west
1054 of that tile. This number is unique over all tiles in a
1055 configuration. Except when using the cubed sphere and
1056 the :ref:`exch2 package <sub_phys_pkg_exch2>`,
1057 the number is held in the variables :varlink:`tileNo` (this holds
1058 the tiles own number), :varlink:`tileNoN`, :varlink:`tileNoS`,
1059 :varlink:`tileNoE` and :varlink:`tileNoW`.
1060 A parameter is also stored with each tile that specifies the type of
1061 communication that is used between tiles. This information is held in
1062 the variables :varlink:`tileCommModeN`, :varlink:`tileCommModeS`,
1063 :varlink:`tileCommModeE` and
1064 :varlink:`tileCommModeW`. This latter set of variables can take one of the
1065 following values ``COMM_NONE``, ``COMM_MSG``, ``COMM_PUT`` and
1066 ``COMM_GET``. A value of ``COMM_NONE`` is used to indicate that a tile
1067 has no neighbor to communicate with on a particular face. A value of
1068 ``COMM_MSG`` is used to indicate that some form of distributed memory
1069 communication is required to communicate between these tile faces
1070 (see :numref:`distributed_mem_comm`). A value of
1071 ``COMM_PUT`` or ``COMM_GET`` is used to indicate forms of shared memory
1072 communication (see :numref:`shared_mem_comm`). The
1073 ``COMM_PUT`` value indicates that a CPU should communicate by writing
1074 to data structures owned by another CPU. A ``COMM_GET`` value
1075 indicates that a CPU should communicate by reading from data
1076 structures owned by another CPU. These flags affect the behavior of
1077 the WRAPPER exchange primitive (see :numref:`comm-prim`). The routine
1078 :filelink:`ini_communication_patterns.F <eesupp/src/ini_communication_patterns.F>`
1079 is responsible for setting the
1080 communication mode values for each tile.
1081
1082 When using the cubed sphere configuration with the :ref:`exch2 package <sub_phys_pkg_exch2>`, the
1083 relationships between tiles and their communication methods are set
1084 by the :ref:`exch2 package <sub_phys_pkg_exch2>` and stored in different variables.
1085 See the :ref:`exch2 package <sub_phys_pkg_exch2>`
1086 documentation for details.
1087
1088 |
1089
1090 2. **MP directives** The WRAPPER transfers control to numerical
1091 application code through the routine
1092 :filelink:`the_model_main.F <model/src/the_model_main.F>`. This routine
1093 is called in a way that allows for it to be invoked by several
1094 threads. Support for this is based on either multi-processing (MP)
1095 compiler directives or specific calls to multi-threading libraries
1096 (e.g., POSIX threads). Most commercially available Fortran compilers
1097 support the generation of code to spawn multiple threads through some
1098 form of compiler directives. Compiler directives are generally more
1099 convenient than writing code to explicitly spawn threads. On
1100 some systems, compiler directives may be the only method available.
1101 The WRAPPER is distributed with template MP directives for a number
1102 of systems.
1103
1104 These directives are inserted into the code just before and after the
1105 transfer of control to numerical algorithm code through the routine
1106 :filelink:`the_model_main.F <model/src/the_model_main.F>`. An example of
1107 the code that performs this process for a Silicon Graphics system is as follows:
1108
1109 ::
1110
1111 C--
1112 C-- Parallel directives for MIPS Pro Fortran compiler
1113 C--
1114 C Parallel compiler directives for SGI with IRIX
1115 C$PAR PARALLEL DO
1116 C$PAR& CHUNK=1,MP_SCHEDTYPE=INTERLEAVE,
1117 C$PAR& SHARE(nThreads),LOCAL(myThid,I)
1118 C
1119 DO I=1,nThreads
1120 myThid = I
1121
1122 C-- Invoke nThreads instances of the numerical model
1123 CALL THE_MODEL_MAIN(myThid)
1124
1125 ENDDO
1126
1127 Prior to transferring control to the procedure
1128 :filelink:`the_model_main.F <model/src/the_model_main.F>` the
1129 WRAPPER may use MP directives to spawn multiple threads. This code
1130 is extracted from the files :filelink:`main.F <eesupp/src/main.F>` and
1131 :filelink:`eesupp/inc/MAIN_PDIRECTIVES1.h`. The variable
1132 :varlink:`nThreads` specifies how many instances of the routine
1133 :filelink:`the_model_main.F <model/src/the_model_main.F>` will be created.
1134 The value of :varlink:`nThreads` is set in the routine
1135 :filelink:`ini_threading_environment.F <eesupp/src/ini_threading_environment.F>`.
1136 The value is set equal to the the product of the parameters
1137 :varlink:`nTx` and :varlink:`nTy` that are read from the file
1138 ``eedata``. If the value of :varlink:`nThreads` is inconsistent with the number
1139 of threads requested from the operating system (for example by using
1140 an environment variable as described in :numref:`multi-thread_exe`)
1141 then usually an error will be
1142 reported by the routine :filelink:`check_threads.F <eesupp/src/check_threads.F>`.
1143
1144 |
1145
1146 3. **memsync flags** As discussed in :numref:`shared_mem_comm`,
1147 a low-level system function may be need to force memory consistency
1148 on some shared memory systems. The routine
1149 :filelink:`memsync.F <eesupp/src/memsync.F>` is used for
1150 this purpose. This routine should not need modifying and the
1151 information below is only provided for completeness. A logical
1152 parameter :varlink:`exchNeedsMemSync` set in the routine
1153 :filelink:`ini_communication_patterns.F <eesupp/src/ini_communication_patterns.F>`
1154 controls whether the :filelink:`memsync.F <eesupp/src/memsync.F>`
1155 primitive is called. In general this routine is only used for
1156 multi-threaded execution. The code that goes into the :filelink:`memsync.F <eesupp/src/memsync.F>`
1157 routine is specific to the compiler and processor used. In some
1158 cases, it must be written using a short code snippet of assembly
1159 language. For an Ultra Sparc system the following code snippet is
1160 used
1161
1162 ::
1163
1164 asm("membar #LoadStore|#StoreStore");
1165
1166 For an Alpha based system the equivalent code reads
1167
1168 ::
1169
1170 asm("mb");
1171
1172 while on an x86 system the following code is required
1173
1174 ::
1175
1176 asm("lock; addl $0,0(%%esp)": : :"memory")
1177
1178 #. **Cache line size** As discussed in :numref:`shared_mem_comm`,
1179 multi-threaded codes
1180 explicitly avoid penalties associated with excessive coherence
1181 traffic on an SMP system. To do this the shared memory data
1182 structures used by the :filelink:`global_sum.F <eesupp/src/global_sum.F>`,
1183 :filelink:`global_max.F <eesupp/src/global_max.F>` and
1184 :filelink:`barrier.F <eesupp/src/barrier.F>`
1185 routines are padded. The variables that control the padding are set
1186 in the header file :filelink:`EEPARAMS.h <eesupp/inc/EEPARAMS.h>`.
1187 These variables are called :varlink:`cacheLineSize`, :varlink:`lShare1`,
1188 :varlink:`lShare4` and :varlink:`lShare8`. The default
1189 values should not normally need changing.
1190
1191 |
1192
1193 #. **\_BARRIER** This is a CPP macro that is expanded to a call to a
1194 routine which synchronizes all the logical processors running under
1195 the WRAPPER. Using a macro here preserves flexibility to insert a
1196 specialized call in-line into application code. By default this
1197 resolves to calling the procedure :filelink:`barrier.F <eesupp/src/barrier.F>`.
1198 The default setting for the ``_BARRIER`` macro is given in the
1199 file :filelink:`CPP_EEMACROS.h <eesupp/inc/CPP_EEMACROS.h>`.
1200
1201 |
1202
1203 #. **\_GSUM** This is a CPP macro that is expanded to a call to a
1204 routine which sums up a floating point number over all the logical
1205 processors running under the WRAPPER. Using a macro here provides
1206 extra flexibility to insert a specialized call in-line into
1207 application code. By default this resolves to calling the procedure
1208 ``GLOBAL_SUM_R8()`` for 64-bit floating point operands or
1209 ``GLOBAL_SUM_R4()`` for 32-bit floating point operand
1210 (located in file :filelink:`global_sum.F <eesupp/src/global_sum.F>`). The default
1211 setting for the ``_GSUM`` macro is given in the file
1212 :filelink:`CPP_EEMACROS.h <eesupp/inc/CPP_EEMACROS.h>`.
1213 The ``_GSUM`` macro is a performance critical operation, especially for
1214 large processor count, small tile size configurations. The custom
1215 communication example discussed in :numref:`jam_example` shows
1216 how the macro is used to invoke a custom global sum routine for a
1217 specific set of hardware.
1218
1219 |
1220
1221 #. **\_EXCH** The ``_EXCH`` CPP macro is used to update tile overlap
1222 regions. It is qualified by a suffix indicating whether overlap
1223 updates are for two-dimensional (``_EXCH_XY``) or three dimensional
1224 (``_EXCH_XYZ``) physical fields and whether fields are 32-bit floating
1225 point (``_EXCH_XY_R4``, ``_EXCH_XYZ_R4``) or 64-bit floating point
1226 (``_EXCH_XY_R8``, ``_EXCH_XYZ_R8``). The macro mappings are defined in
1227 the header file :filelink:`CPP_EEMACROS.h <eesupp/inc/CPP_EEMACROS.h>`.
1228 As with ``_GSUM``, the ``_EXCH``
1229 operation plays a crucial role in scaling to small tile, large
1230 logical and physical processor count configurations. The example in
1231 :numref:`jam_example` discusses defining an optimized and
1232 specialized form on the ``_EXCH`` operation.
1233
1234 The ``_EXCH`` operation is also central to supporting grids such as the
1235 cube-sphere grid. In this class of grid a rotation may be required
1236 between tiles. Aligning the coordinate requiring rotation with the
1237 tile decomposition allows the coordinate transformation to be
1238 embedded within a custom form of the ``_EXCH`` primitive. In these cases
1239 ``_EXCH`` is mapped to exch2 routines, as detailed in the :ref:`exch2 package <sub_phys_pkg_exch2>`
1240 documentation.
1241
1242 |
1243
1244 #. **Reverse Mode** The communication primitives ``_EXCH`` and ``_GSUM`` both
1245 employ hand-written adjoint forms (or reverse mode) forms. These
1246 reverse mode forms can be found in the source code directory
1247 :filelink:`pkg/autodiff`. For the global sum primitive the reverse mode form
1248 calls are to ``GLOBAL_ADSUM_R4()`` and ``GLOBAL_ADSUM_R8()`` (located in
1249 file :filelink:`global_sum_ad.F <pkg/autodiff/global_sum_ad.F>`). The reverse
1250 mode form of the exchange primitives are found in routines prefixed
1251 ``ADEXCH``. The exchange routines make calls to the same low-level
1252 communication primitives as the forward mode operations. However, the
1253 routine argument :varlink:`theSimulationMode` is set to the value
1254 ``REVERSE_SIMULATION``. This signifies to the low-level routines that
1255 the adjoint forms of the appropriate communication operation should
1256 be performed.
1257
1258 |
1259
1260 #. **MAX_NO_THREADS** The variable :varlink:`MAX_NO_THREADS` is used to
1261 indicate the maximum number of OS threads that a code will use. This
1262 value defaults to thirty-two and is set in the file
1263 :filelink:`EEPARAMS.h <eesupp/inc/EEPARAMS.h>`. For
1264 single threaded execution it can be reduced to one if required. The
1265 value is largely private to the WRAPPER and application code will not
1266 normally reference the value, except in the following scenario.
1267
1268 For certain physical parametrization schemes it is necessary to have
1269 a substantial number of work arrays. Where these arrays are allocated
1270 in heap storage (for example COMMON blocks) multi-threaded execution
1271 will require multiple instances of the COMMON block data. This can be
1272 achieved using a Fortran 90 module construct. However, if this
1273 mechanism is unavailable then the work arrays can be extended with
1274 dimensions using the tile dimensioning scheme of :varlink:`nSx` and :varlink:`nSy` (as
1275 described in :numref:`specify_decomp`). However, if
1276 the configuration being specified involves many more tiles than OS
1277 threads then it can save memory resources to reduce the variable
1278 :varlink:`MAX_NO_THREADS` to be equal to the actual number of threads that
1279 will be used and to declare the physical parameterization work arrays
a7275ae7c1 Oliv*1280 with a single :varlink:`MAX_NO_THREADS` extra dimension. An example of this
4c8caab8bd Jeff*1281 is given in the verification experiment :filelink:`verification/aim.5l_cs`. Here the
a7275ae7c1 Oliv*1282 default setting of :varlink:`MAX_NO_THREADS` is altered to
4c8caab8bd Jeff*1283
1284 ::
1285
1286 INTEGER MAX_NO_THREADS
1287 PARAMETER ( MAX_NO_THREADS = 6 )
1288
1289 and several work arrays for storing intermediate calculations are
1290 created with declarations of the form.
1291
1292 ::
1293
1294 common /FORCIN/ sst1(ngp,MAX_NO_THREADS)
1295
1296 This declaration scheme is not used widely, because most global data
1297 is used for permanent, not temporary, storage of state information. In
1298 the case of permanent state information this approach cannot be used
1299 because there has to be enough storage allocated for all tiles.
1300 However, the technique can sometimes be a useful scheme for reducing
1301 memory requirements in complex physical parameterizations.
1302
1303 Specializing the Communication Code
1304 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1305
1306 The isolation of performance critical communication primitives and the
1307 subdivision of the simulation domain into tiles is a powerful tool.
1308 Here we show how it can be used to improve application performance and
1309 how it can be used to adapt to new gridding approaches.
1310
1311 .. _jam_example:
1312
1313 JAM example
1314 ~~~~~~~~~~~
1315
1316 On some platforms a big performance boost can be obtained by binding the
1317 communication routines ``_EXCH`` and ``_GSUM`` to specialized native
1318 libraries (for example, the shmem library on CRAY T3E systems). The
1319 ``LETS_MAKE_JAM`` CPP flag is used as an illustration of a specialized
1320 communication configuration that substitutes for standard, portable
1321 forms of ``_EXCH`` and ``_GSUM``. It affects three source files
1322 :filelink:`eeboot.F <eesupp/src/eeboot.F>`, :filelink:`CPP_EEMACROS.h <eesupp/inc/CPP_EEMACROS.h>`
1323 and :filelink:`cg2d.F </model/src/cg2d.F>`. When the flag is defined is
1324 has the following effects.
1325
1326 - An extra phase is included at boot time to initialize the custom
1327 communications library (see ini_jam.F).
1328
1329 - The ``_GSUM`` and ``_EXCH`` macro definitions are replaced with calls
1330 to custom routines (see gsum_jam.F and exch_jam.F)
1331
1332 - a highly specialized form of the exchange operator (optimized for
1333 overlap regions of width one) is substituted into the elliptic solver
1334 routine :filelink:`cg2d.F </model/src/cg2d.F>`.
1335
1336 Developing specialized code for other libraries follows a similar
1337 pattern.
1338
1339 .. _cubed_sphere_comm:
1340
1341 Cube sphere communication
1342 ~~~~~~~~~~~~~~~~~~~~~~~~~
1343
1344 Actual ``_EXCH`` routine code is generated automatically from a series of
1345 template files, for example
1346 :filelink:`exch2_rx1_cube.template </pkg/exch2/exch2_rx1_cube.template>`.
1347 This is done to allow a
1348 large number of variations of the exchange process to be maintained. One
1349 set of variations supports the cube sphere grid. Support for a cube
1350 sphere grid in MITgcm is based on having each face of the cube as a
1351 separate tile or tiles. The exchange routines are then able to absorb
1352 much of the detailed rotation and reorientation required when moving
1353 around the cube grid. The set of ``_EXCH`` routines that contain the word
1354 cube in their name perform these transformations. They are invoked when
1355 the run-time logical parameter :varlink:`useCubedSphereExchange` is
1356 set ``.TRUE.``. To
1357 facilitate the transformations on a staggered C-grid, exchange
1358 operations are defined separately for both vector and scalar quantities
1359 and for grid-centered and for grid-face and grid-corner quantities.
1360 Three sets of exchange routines are defined. Routines with names of the
1361 form ``exch2_rx`` are used to exchange cell centered scalar quantities.
1362 Routines with names of the form ``exch2_uv_rx`` are used to exchange
1363 vector quantities located at the C-grid velocity points. The vector
1364 quantities exchanged by the ``exch_uv_rx`` routines can either be signed
1365 (for example velocity components) or un-signed (for example grid-cell
1366 separations). Routines with names of the form ``exch_z_rx`` are used to
1367 exchange quantities at the C-grid vorticity point locations.
1368
1369 MITgcm execution under WRAPPER
1370 ==============================
1371
1372 Fitting together the WRAPPER elements, package elements and MITgcm core
1373 equation elements of the source code produces the calling sequence shown below.
1374
1375 Annotated call tree for MITgcm and WRAPPER
1376 ------------------------------------------
1377
1378 WRAPPER layer.
1379
1380 ::
1381
1382
1383 MAIN
1384 |
1385 |--EEBOOT :: WRAPPER initialization
1386 | |
1387 | |-- EEBOOT_MINMAL :: Minimal startup. Just enough to
1388 | | allow basic I/O.
1389 | |-- EEINTRO_MSG :: Write startup greeting.
1390 | |
1391 | |-- EESET_PARMS :: Set WRAPPER parameters
1392 | |
1393 | |-- EEWRITE_EEENV :: Print WRAPPER parameter settings
1394 | |
1395 | |-- INI_PROCS :: Associate processes with grid regions.
1396 | |
1397 | |-- INI_THREADING_ENVIRONMENT :: Associate threads with grid regions.
1398 | |
1399 | |--INI_COMMUNICATION_PATTERNS :: Initialize between tile
1400 | :: communication data structures
1401 |
1402 |
1403 |--CHECK_THREADS :: Validate multiple thread start up.
1404 |
1405 |--THE_MODEL_MAIN :: Numerical code top-level driver routine
1406
1407 Core equations plus packages.
1408
1a0ceabba8 Timo*1409 .. _model_main_call_tree:
1410
232737bee5 timo*1411 .. literalinclude:: ../../model/src/the_model_main.F
2a7411a401 timo*1412 :start-at: C Invocation from WRAPPER level...
1413 :end-at: C | :: events.
4c8caab8bd Jeff*1414
1415
1416 Measuring and Characterizing Performance
1417 ----------------------------------------
1418
1419 TO BE DONE (CNH)
1420
1421 Estimating Resource Requirements
1422 --------------------------------
1423
1424 TO BE DONE (CNH)
1425
1426 Atlantic 1/6 degree example
1427 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1428
1429 Dry Run testing
1430 ~~~~~~~~~~~~~~~
1431
1432 Adjoint Resource Requirements
1433 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1434
1435 State Estimation Environment Resources
1436 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~