Back to home page

MITgcm

 
 

    


File indexing completed on 2023-05-28 05:10:54 UTC

view on githubraw file Latest commit b4daa243 on 2023-05-28 03:53:22 UTC
b4daa24319 Shre*0001 /*
                0002 ##########################################################
                0003 # This file is part of the AdjoinableMPI library         #
                0004 # released under the MIT License.                        #
                0005 # The full COPYRIGHT notice can be found in the top      #
                0006 # level directory of the AdjoinableMPI distribution.     #
                0007 ########################################################## 
                0008 */
                0009 #ifndef _AMPI_AMPI_H_
                0010 #define _AMPI_AMPI_H_
                0011 
                0012 /**
                0013  * \file
                0014  * \ingroup UserInterfaceHeaders
                0015  * One-stop header file for all AD-tool-independent AMPI routines; this is the file to replace mpi.h in the user code.
                0016  */
                0017 
                0018 /**
                0019  * \defgroup UserInterfaceHeaders User-Interface header files
                0020  * This set contains all the header files with declarations relevant to the user; header files not listed in this group
                0021  * are internal to AdjoinableMPI or relate to support to be provided by a given AD tool.
                0022  */
                0023 
                0024 /**
                0025  * \defgroup UserInterfaceDeclarations User-Interface declarations
                0026  * This set contains all declarations relevant to the user; anything in the source files not listed in this group
                0027  * is internal to AdjoinableMPI or relates to support to be provided by a given AD tool.
                0028  */
                0029 
                0030 /** \mainpage 
                0031  * The Adjoinable MPI (AMPI) library provides a modified set if MPI subroutines
                0032  * that are constructed such that an adjoint in the context of algorithmic
                0033  * differentiation (AD) can be computed. The library is designed to be supported
                0034  * by a variety of AD tools and to enable also the computation of  (higher-order)
                0035  * forward derivatives.
                0036  * \authors <b>Laurent Hasco&euml;t</b> 
                0037  * (currently at INRIA Sophia-Antipolis; <a href="http://fr.linkedin.com/pub/laurent-hascoët/86/821/a04">LinkedIn</a> - <a href="mailto:Laurent.Hascoet@sophia.inria.fr?subject=AMPI">e-mail</a>)
                0038  * \authors <b>Michel Schanen</b> 
                0039  * (currently at RWTH Aachen; <a href="http://www.stce.rwth-aachen.de/people/Michel.Schanen.html">home page</a> - <a href="mailto:schanen@stce.rwth-aachen.de?subject=AMPI">e-mail</a>)
                0040  * \authors <b>Jean Utke</b> 
                0041  * (until March 2014 at Argonne National Laboratory; <a href="http://www.linkedin.com/pub/jean-utke/5/645/7a">LinkedIn</a> - <a href="mailto:utkej1@gmail.com?subject=AMPI">e-mail</a>)
                0042  *
                0043  * Contributions informing the approach implemented in AMPI were made by the co-authors of&nbsp;\cite Utke2009TAM  <b>P. Heimbach, C. Hill, U. Naummann</b>. 
                0044  * 
                0045  * Significant contributions were made by <b>Anton Bovin</b> (summer student at Argonne National Laboratory in 2013;<a href="http://www.linkedin.com/pub/anton-bovin/86/b1b/847">LinkedIn</a>).
                0046  *
                0047  * <b>Please refer to the \ref UserGuide for information regarding the use of the library in a given application.</b>
                0048  *
                0049  * Information regarding the library design, library internal functionality and the interfaces of methods to
                0050  * be supported by a given AD tool are given in \ref LibraryDevelopmentGuide
                0051  *
                0052  * \section links Links to Ressources
                0053  * 
                0054  *  - <a href="https://trac.mcs.anl.gov/projects/AdjoinableMPI/wiki">TRAC  page</a> for bug and feature tracking, links to presentations
                0055  *  - <a href="http://mercurial.mcs.anl.gov/ad/AdjoinableMPI/">mercurial repository</a> for source code and change history
                0056  *  - <a href="http://www.mcs.anl.gov/~utke/AdjoinableMPI/regression/tests.shtml">regression tests</a> 
                0057  *
                0058  */
                0059 
                0060 /**
                0061  * \page UserGuide User Guide
                0062  * \tableofcontents
                0063  * \section Introduction
                0064  * 
                0065  * The Adjoinable MPI (AMPI) library provides a modified set of MPI subroutines
                0066  * that are constructed such that:
                0067  *  - an adjoint in the context of algorithmic differentiation (AD) can be computed,
                0068  *  - it can be supported by a variety of AD tools,
                0069  *  - it enable also the computation of  (higher-order) forward derivatives,
                0070  *  - it provides an implementation for a straight pass-through to MPI such that the switch to AMPI can be made permanent
                0071  * without forcing compile dependencies on any AD tool.
                0072  *
                0073  * There are principal recipes for the construction of the adjoint of 
                0074  * a given communication, see \cite Utke2009TAM . 
                0075  * The practical implementation of these recipes, however, faces the following 
                0076  * challenges.
                0077  *  - the target language may prevent some implementation options
                0078  *   - exposing an MPI_Request augmented with extra information as a structured type (not supported by Fortran 77)
                0079  *   - passing an array of buffers (of different length), e.g. to \ref AMPI_Waitall, as an additional argument to  (not supported in any Fortran version)
                0080  *  - the AD tool implementation could be based on 
                0081  *   - operator overloading
                0082  *    - original data and (forward) derivatives co-located (e.g. Rapsodia,dco)
                0083  *    - original data and (forward) derivatives referenced (e.g. Adol-C)
                0084  *   - source transformation
                0085  *    - association by address (e.g. OpenAD)
                0086  *    - association by name (e.g. Tapenade)
                0087  * 
                0088  * The above choices imply certain consequences on the complexity for implementing  
                0089  * the adjoint (and forward derivative) action and this could imply differences in the AMPI design.
                0090  * However, from a user's perspective it is a clear advantage to present a <b>single, AD tool implementation independent
                0091  * AMPI library</b> such that switching AD tools is not hindered by AMPI while also promoting a common understanding of the
                0092  * differentiation through MPI calls.
                0093  * We assume the reader is familiar with MPI and AD concepts.
                0094  *
                0095  * \section sources Getting the library sources
                0096  * 
                0097  * The sources can be accessed through the <a href="http://mercurial.mcs.anl.gov/ad/AdjoinableMPI/">AdjoinableMPI mercurial repository</a>. Bug tracking, feature requests
                0098  * etc. are done via <a href="http://trac.mcs.anl.gov/projects/AdjoinableMPI">trac</a>.
                0099  * In the following we assume the sources are cloned (cf <a href="http://mercurial.selenic.com/">mercurial web site</a> for details about mercurial)
                0100  * into a directory `AdjoinableMPI` by invoking
                0101  * \code
                0102  * hg clone http://mercurial.mcs.anl.gov/ad/AdjoinableMPI
                0103  * \endcode
                0104  * 
                0105  * \section configure Library - Configure,  Build, and Install
                0106  * 
                0107  * Configuration, build, and install follows the typical GNU autotools chain. Go to the source directory
                0108  * \code
                0109  * cd AdjoinableMPI
                0110  * \endcode
                0111  * If the sources were obtained from the mercurial repository, then one first needs to run the autotools via invoking
                0112  * \code
                0113  * ./autogen.sh
                0114  * \endcode
                0115  * In the typical `autoconf` fashion invoke
                0116  * \code
                0117  *  configure --prefix=<installation directory> ...
                0118  * \endcode
                0119  * in or outside the source tree.
                0120  * The AD tool supporting AMPI should provide information which detailed AMPI
                0121  * configure settings are required if any.
                0122  * Build the libaries with
                0123  * \code
                0124  * make
                0125  * \endcode
                0126  * Optionally, before installing, one can do a sanity check by running:  `make check` .
                0127  *
                0128  * To install the header files and compiled libraries follow with
                0129  * \code
                0130  *  make install
                0131  * \endcode
                0132  * after which in the installation directory one should find under <tt>\<installation directory\></tt> the following.
                0133  *  - header files: see also  \ref dirStruct
                0134  *  - libraries:
                0135  *    - libampiPlainC - for pass through to MPI, no AD functionality
                0136  *    - libampiCommon - implementation of AD functionality shared between all AD tools supporting AMPI
                0137  *    - libampiBookkeeping - implementation of AD functionality needed by some AD tools (see the AD tool documentation)
                0138  *    - libampiTape - implementation of AD functionality needed by some AD tools (see the AD tool documentation)
                0139  *
                0140  * Note, the following libraries are AMPI internal:
                0141  *  - libampiADtoolStubsOO - stubs for operator overloading AD tools not needed by the user
                0142  *  - libampiADtoolStubsST - stubs for source transformation AD tools not needed by the user
                0143  *
                0144  * \section mpiToAmpi Switching from MPI to Adjoinable MPI
                0145  *
                0146  * For a given MPI-parallelized source code the user will replace all calls to MPI_... routines with the respective  AMPI_...
                0147  * equivalent provided in \ref UserInterfaceDeclarations.
                0148  * To include the declarations replace
                0149  *  - in C/C++: includes of <tt>mpi.h</tt> with
                0150  *  \code
                0151  *  #include <ampi/ampi.h>
                0152  *  \endcode
                0153  *  - in Fortran: includes of <tt>mpif.h</tt> with
                0154  *  \code
                0155  *  #include <ampi/ampif.h>
                0156  *  \endcode
                0157  *
                0158  * respectively.
                0159  *
                0160  * Because in many cases certain MPI calls (e.g. for initialization and finalization) take place outside the scope of
                0161  * the original computation and its AD-derivatives and therefore do not themselves become part of the AD process,
                0162  * see the explanations in \ref differentiableSection.
                0163  * Each routine in this documentation lists to the changes to the parameters
                0164  * relative to the MPI standard. These changes impact parameters specifying
                0165  *  - MPI_Datatype parameters, see \ref datatypes
                0166  *  - MPI_Request parameters, see \ref requests
                0167  *
                0168  * Some routines require new parameters specifying the pairing two-sided communications, see \ref pairings.
                0169  * Similarly to the various approaches (preprocessing, templating, using <tt>typedef</tt>)
                0170  * employed to effect a change to an active type for overloading-based AD tools, this switch
                0171  * from MPI to AMPI routines should be done as a one-time effort.
                0172  * Because  AMPI provides an implementation for a straight pass-through to MPI it is possible to make this switch
                0173  * permanent and retain builds that are completely independent of any AD tool and use AMPI as a thin wrapper library to AMPI.
                0174  *
                0175  * \section appCompile Application - compile and link
                0176  *
                0177  * After the switch described in \ref mpiToAmpi is done, the application should be recompiled with the include path addition
                0178  * \code
                0179  * -I<installation directory>/include
                0180  * \endcode
                0181  * and linked with the link path extension
                0182  * \code 
                0183  * -L<installation directory>/lib[64]
                0184  * \endcode 
                0185  * Note, the name of the subdirectory (lib or lib64 ) depends on the system;
                0186  * the appropriate set of libraries, see \ref configure; the optional ones in square brackets depend on the AD tool:
                0187  * \code
                0188  * -libampicommon [ -libampiBookkeeping -lampiTape ]
                0189  * \endcode 
                0190  * <b>OR</b> if instead of differentiation by AD a straight pass-through to MPI is desired, then
                0191  * \code
                0192  * -libampiPlainC
                0193  * \endcode
                0194  * instead.
                0195  * 
                0196  * \section dirStruct Directory and File Structure
                0197  * All locations discussed below are relative to the top level source directory. 
                0198  * The top level header file to be included in place of the usual  "mpi.h" is located in  
                0199  * ampi/ampi.h
                0200  *
                0201  * It references the header files in <tt>ampi/userIF</tt> , see also \ref UserInterfaceHeaders which are organized to contain
                0202  *  - unmodified pass through to MPI in <tt>ampi/userIF/passThrough.h</tt> which exists to give the extent of the original MPI we cover  
                0203  *  - variants of routines that in principle need adjoint logic but happen to be called outside of the code section that is adjoined and therefore 
                0204  *    are not transformed / not traced (NT) in  <tt>ampi/userIF/nt.h</tt>
                0205  *  - routines that are modified from the original MPI counterparts because their behavior in the reverse sweep differs from their behavior in the 
                0206  *    forward sweep and they also may have a modified signatyre; in <tt>ampi/userIF/modified.h</tt>
                0207  *  - routines that are specific for some variants of source transformation (ST) approaches in <tt>ampi/userIF/st.h</tt>; 
                0208  *    while these impose a larger burden for moving from MPI to AMPI on the user, they also enable a wider variety of transformations 
                0209  *    currently supported by the tools; we anticipate that the ST specific versions may become obsolete as the source transformation tools evolve to 
                0210  *    support all transformations via the routines in <tt>ampi/userIF/modified.h</tt> 
                0211  *
                0212  * Additional header files contain enumerations used as arguments to AMPI routines. All declarations that are part of the user
                0213  * interface are grouped in \ref UserInterfaceDeclarations. All other declarations in header files in the library are not to be used directly in the user code.
                0214  * 
                0215  * A library that simply passes through all AMPI calls to their MPI counterparts for a test compilation and execution without any involvement of 
                0216  * and AD tool is implemented in the source files in the <tt>PlainC</tt> directory.
                0217  * 
                0218  * \section differentiableSection Using subroutine variants NT vs non-NT relative to the differentiable section
                0219  * 
                0220  * The typical assumption of a program to be differentiated is that there is some top level routine <tt>head</tt> which does the numerical computation 
                0221  * and communication which is called from some main <tt>driver</tt> routine. The <tt>driver</tt> routine would have to be manually adjusted to initiate 
                0222  * the derivative computation, retrieve, and use the derivative values.
                0223  * Therefore only <tt>head</tt> and everything it references would be <em>adjoined</em> while <tt>driver</tt> would not. Typically, the <tt>driver</tt>
                0224  * routine also includes the basic setup and teardown with MPI_Init and MPI_Finalize and consequently these calls (for consistency) should be replaced 
                0225  * with their AMPI "no trace/transformation"  (NT) counterparts \ref AMPI_Init_NT and \ref AMPI_Finalize_NT. 
                0226  * The same approach should be taken for all resource allocations/deallocations (e.g. \ref AMPI_Buffer_attach_NT and \ref AMPI_Buffer_detach_NT) 
                0227  * that can exist in the scope enclosing the adjointed section alleviating 
                0228  * the need for the AD tool implementation to tackle them. 
                0229  * For cases where these routines have to be called within the adjointed code section the variants without the <tt>_NT</tt> suffix will ensure the
                0230  * correct adjoint behavior.
                0231  * 
                0232  * \section general General Assumptions on types and Communication Patterns
                0233  *
                0234  * \subsection datatypes Datatype consistency
                0235  * 
                0236  * Because the MPI standard passes buffers as <tt>void*</tt>  (aka choice) the information about the type of
                0237  * the buffer and in particular the distinction between active  and passive data (in the AD sense) must be
                0238  * conveyed via the <tt>datatype</tt> parameters and be consistent with the type of the buffer. To indicate buffers of
                0239  * active type the library predefines the following
                0240  * - for C/C++
                0241  *   - \ref AMPI_ADOUBLE  as the active variant of the passive MPI_DOUBLE
                0242  *   - \ref AMPI_AFLOAT as the active variant of the passive MPI_FLOAT
                0243  * - for Fortran
                0244  *   - \ref AMPI_ADOUBLE_PRECISION as the active variant of the passive MPI_DOUBLE_PRECISION
                0245  *   - \ref AMPI_AREAL as the active variant of the passive MPI_REAL
                0246  *
                0247  * Passive buffers can be used as parameters to the AMPI interfaces with respective passive data type values.
                0248  *
                0249  * \subsection requests Request Type
                0250  *
                0251  * Because additional information has to be attached to the MPI_Request instances  used in nonblocking communications, there
                0252  * is an expanded data structure to hold this information. Even though in some contexts (F77) this structure cannot be exposed
                0253  * to the user code the general approach is to declare variables that are to hold requests as \ref AMPI_Request (instead of
                0254  * MPI_Request).
                0255  *
                0256  * \subsection pairings Pairings
                0257  *
                0258  * Following the explanations in \cite Utke2009TAM it is clear that context information about the 
                0259  * communication pattern, that is the pairing of MPI calls, is needed to achieve 
                0260  * -# correct adjoints, i.e. correct send and receive end points and deadlock free
                0261  * -# if possible retain the efficiency advantages present in the original MPI communication for the adjoint.
                0262  *
                0263  * In AMPI pairings are conveyed via additional <tt>pairedWith</tt> parameters which may be set to \ref AMPI_PairedWith enumeration values , see e.g. \ref AMPI_Send or \ref AMPI_Recv.
                0264  * The need to convey the pairing imposes restrictions because in a given code the pairing may not be static.
                0265  * For a example a given <tt>MPI_Recv</tt> may be paired with 
                0266  * \code{.cpp}
                0267  * if (doBufferedSends)  
                0268  *   MPI_Bsend(...); 
                0269  * else  
                0270  *   MPI_Ssend(...);
                0271  * \endcode 
                0272  *
                0273  * but the AD tool has to decide on the send mode once the reverse sweep needs to adjoin the orginal <tt>MPI_Recv</tt>.  
                0274  * Tracing such information in a global data structure is not scalable and piggybacking the send type onto the message 
                0275  * so it can be traced on the receiving side is conceivable but not trivial and currently not implemented. 
                0276  * 
                0277  * \restriction Pairing of send and receive modes must be static.
                0278  *
                0279  * Note that this does not prevent the use of wild cards for source, or tag.
                0280  *
                0281  * \section examples Examples
                0282  * A set of examples organized to illustrate the uses of AMPI together with setups for AD tools that also serve as
                0283  * regression tests are collected in  `AdjoinableMPIexamples` that can be obtained similarly to the AMPI sources themselves
                0284  * by cloning
                0285  *\code
                0286  * hg clone http://mercurial.mcs.anl.gov/ad/AdjoinableMPIexamples
                0287  * \endcode
                0288  * The daily regression tests based on these examples report the results on the page linked via the main page of this documentation.  
                0289  *
                0290  */
                0291 
                0292 /**
                0293  * \page LibraryDevelopmentGuide Library Development Guide
                0294  * \tableofcontents
                0295  * \section naming Naming Conventions - Code Organization
                0296  * Directories and libraries are organized as follows:
                0297  *  - user interface header files, see  \ref dirStruct; should not contain anything else (e.g. no internal helper functions)
                0298  *  - `PlainC` :  pass through to MPI implementations of the user interface; no reference to ADTOOL interfaces; to be renamed
                0299  *  - `Tape` : sequential access storage mechanism default implementation (implemented as doubly linked list) to enable forward/reverse
                0300  *  reading; may not reference ADTOOL or AMPI symbols/types; may reference MPI
                0301  *  - `Bookkeeping` : random access storage for AMPI_Requests (but possibly also other objects that could be opaque)
                0302  *  - `Common` : the AD enabled workhorse; here we have all the common functionality for MPI differentiation;
                0303  *
                0304  * Symbol prefixes:
                0305  *  - `AMPI_` to be used for anything in MPI replacing the `MPI_` prefix; not to be used for symbols outside of the user interface
                0306  *  - `TAPE_AMPI_` to be used for the `Tape` sequential access storage mechanism declared in ampi/tape/support.h
                0307  *  - `BK_AMPI_`:  `Bookkeeping`  random access storage mechanism declared in ampi/bookkeeping/support.h
                0308  *  - `ADTOOL_AMPI_` to be
                0309  *
                0310  *
                0311  *
                0312  * \section nonblocking Nonblocking Communication and Fortran Compatibility
                0313  * 
                0314  * A central concern is the handling of non-blocking sends and receives in combination with their respective completion,
                0315  * e.g. wait,  waitall, test. 
                0316  * Taking as an example 
                0317  * \code{.cpp}
                0318  * MPI_Irecv(&b,...,&r);
                0319  * // some other code in between 
                0320  * MPI_Wait(&r,MPI_STATUS_IGNORE); 
                0321  * \endcode
                0322  * The adjoint action for <tt>MPI_Wait</tt> will have to be the <tt>MPI_Isend</tt> of the adjoint data associated with 
                0323  * the data in buffer <tt>b</tt>. 
                0324  * The original <tt>MPI_Wait</tt> does not have any of the parameters required for the send and in particular it does not 
                0325  * have the buffer. The latter, however, is crucial in particular in a source transformation context because, absent a correct syntactic 
                0326  * representation for the buffer at the <tt>MPI_Wait</tt> call site one has to map the address <tt>&b</tt> valid during the forward 
                0327  * sweep to the address of the associated adjoint buffer during the reverse sweep. 
                0328  * In some circumstances, e.g. when the buffer refers to stack variable and the reversal mode follows a strict <em>joint</em> scheme 
                0329  * where one does not leave the stack frame of a given subroutine until the reverse sweep has completed, it is possible to predetermine 
                0330  * the address of the respective adjoint buffer even in the source transformation context.  
                0331  * In the general case, e.g. allowing for <em>split</em> mode reversal 
                0332  * or  dynamic memory deallocation before the adjoint sweep commences such predetermination 
                0333  * requires a more elaborate mapping algorithm. 
                0334  * This mapping is subject of ongoing research and currently not supported. 
                0335  * 
                0336  * On the other hand, for operator overloading based tools, the mapping to a reverse sweep address space is an integral part of the 
                0337  * tool because there the reverse sweep is executed as interpretation of  a trace of the execution that is entirely separate from the original program 
                0338  * address space. Therefore all addresses have to be mapped to the new adjoint address space to begin with and no association to some 
                0339  * adjoint program variable is needed. Instead, the buffer address can be conveyed via the request parameter (and AMPI-userIF bookkeeping) 
                0340  * to the <tt>MPI_Wait</tt> call site, traced there and is then recoverable during the reverse sweep.  
                0341  * Nevertheless, to allow a common interface this version of the AMPI library has the buffer as an additional argument to in the source-transformation-specific \ref AMPI_Wait_ST 
                0342  * variant of \ref AMPI_Wait.  
                0343  * In later editions, when source transformation tools can fully support the address mapping, the  of the AMPI library the \ref AMPI_Wait_ST variant  may be dropped.  
                0344  * 
                0345  * Similarly to conveying the buffer address via userIF bookkeeping associated with the request being passed, all other information such as source or destination, tag, 
                0346  * data type, or the distinction if a request originated with a send or receive  will be part of the augmented information attached to the request and be subject to the trace and recovery as the buffer address itself. 
                0347  * In the source transformation context, for cases in which parameter values such as source, destination, or tag are constants or loop indices the question could be asked if these values couldn't be easily recovered in
                0348  * the generated adjoint code without having to store them. 
                0349  * Such recovery following a TBR-like approach would, however, require exposing the augmented request instance as a structured data type to the TBR analysis in the languages other than Fortran77. 
                0350  * This necessitates the introduction of the \ref AMPI_Request, which in Fotran77 still maps to just an integer address. 
                0351  * The switching between these variants is done via  configure flags, see \ref configure.
                0352  * 
                0353  * \section bookkeeping Bookkeeping of Requests
                0354  * 
                0355  * As mentioned in \ref nonblocking the target language may prevent the augmented request from being used directly.  
                0356  * In such cases the augmented information has to be kept internal to the library, that is we do some bookkeeping to convey the necessary information between the nonblocking sends or receives and
                0357  * the and respective completion calls. Currently the bookkeeping has a very simple implementation as a doubly-linked list implying linear search costs which is acceptable only as long as the 
                0358  * number of icomplete nonblocking operations per process remains moderate. 
                0359  *
                0360  * Whenever internal handles are used to keep trace (or correspondence) of a given internal object
                0361  * between two distant locations in the source code (e.g. file identifier to keep trace of an opened/read/closed file,
                0362  * or address to keep trace of a malloc/used/freed dynamic memory, or request ID to keep trace of a Isend/Wait...)
                0363  * we may have to arrange the same correspondence during the backward sweep.
                0364  * Keeping the internal identifier in the AD stack is not sufficient because there is no guarantee that
                0365  * the mechanism in the backward sweep will use the same values for the internal handle.
                0366  * The bookkeeping we use to solve this problem goes as follows:
                0367  * - standard TBR mechanism makes sure that variables that are needed in the BW sweep and are overwritten
                0368  *    are pushed onto the AD stack before they are overwritten
                0369  * - At the end of its life in the forward sweep, the FW handle is pushed in the AD stack
                0370  * - At the beginning of its backward life, we obtain a BW handle, we pop the FW handle,
                0371  *    and we keep the pair of those in a table (if an adjoint handle is created too, we keep the triplet).
                0372  * - When a variable is popped from the AD stack, and it is an internal handle,
                0373  *    the popped handle is re-based using the said table.
                0374  *
                0375  * Simple workaround for the "request" case:
                0376  * This method doesn't rely on TBR.
                0377  * - Push the FW request upon acquisition (e.g. just after the Isend)
                0378  * - Push the FW request upon release (e.g. just before the Wait)
                0379  * - Pop the FW request upon adjoint of release, and get the BW request from the adjoint of release
                0380  * - Add the BW request into the bookkeeping, with the FW request as a key.
                0381  * - Upon adjoint of acquisition, pop the FW request, lookup in the bookkeeping to get the BW request.
                0382  *
                0383  * \section bundling  Tangent-linear mode bundling the derivatives or shadowing the communication
                0384  * A central question for the implementation of tangent-linear mode becomes
                0385  * whether to bundle the original buffer <tt>b</tt> with the  derivative <tt>b_d</tt> as  pair and communicate the pair
                0386  * or to send separate messages for the derivatives.
                0387  * - shadowing messages avoid the bundling/unbundling if <tt>b</tt> and <tt>b_d</tt>
                0388  * are already given as separate entities as is the case in association by name, see \ref Introduction.
                0389  * - for one-sided passive communications there is no hook to do the bundling/unbundling on the target side; therefore
                0390  * it would be inherently impossible to achieve semantically correct behavior with any bundling/unbundling scheme.
                0391  * The example here is a case where a put on the origin side and subsequent computations on the target side are synchronized
                0392  * via a barrier which by itself does not have any obvious link to the target window by which one could trigger an unbundling.
                0393  * - the bundling operation itself may incur nontrivial overhead for large buffers
                0394  *
                0395  * An earlier argument against message shadowing was the difficulty of correctly associating message pairs while using wildcards.
                0396  * This association can, however, be ensured when a the shadowing message for the <tt>b_d</tt> is received on a communicator
                0397  * <tt>comm_d</tt> that duplicates the original communicator <tt>comm</tt> and uses the
                0398  * actual src and tag values obtained from the receive of the shadowed message as in the following example:
                0399  *
                0400  * \code{.cpp}
                0401  * if ( myRank==1) {
                0402  *   send(x,...,0,tag1,comm); // send of the original data
                0403  *   send(x_d,...,0,tag1,comm_d); // shadowing send of the derivatives
                0404  * else if ( myRank==2) {
                0405  *   send(y,...,0,tag2,comm);
                0406  *   send(y_d,...,0,tag2,comm_d);
                0407  * else if ( myRank==0) {
                0408  *   do {
                0409  *      recv(t,...,ANY_SOURCE, ANY_TAG,comm,&status); // recv of the original data
                0410  *      recv(t_d,...,status.SOURCE,status.TAG,comm_d,STATUS_IGNORE); // shadowing recv with wildcards disambiguated
                0411  *      z+=t; // original operation
                0412  *      z_d+=t_d; // corresponding derivative operation
                0413  *   }
                0414  * }
                0415  * \endcode
                0416  *
                0417  * This same approach can be applied to (user-defined) reduction operations, see \ref reduction, in that the binomial
                0418  * tree traversal for the reduction is shadowed in the same way and a user defined operation with derivatives can be invoked
                0419  * by passing the derivatives as separate arguments.
                0420  *
                0421  * The above approach is to be taken by any tool in which <tt>b</tt> and <tt>b_d</tt> are not already paired in consecutive
                0422  * memory such as association by name as in Tapenade or by implementation choice such as forward interpreters in Adol-C where
                0423  * the 0-th order Taylor coefficients live in a separate array from the first-  and higher-order Taylor coefficients.
                0424  * Tools with association by address (OpenAD, Rapsodia) would have the data already given in paired form and therefore not
                0425  * need messsage shadowing but communicate the paired data.
                0426  *
                0427  * \section badOptions Rejected design options
                0428  * About MPI_Types and the "active" boolean:
                0429  * One cannot get away with just an "active" boolean to indicate the structure of
                0430  * the MPI_Type of the bundle. Since the MPI_Type definition of the bundle type
                0431  * has to be done anyway in the differentiated application code, and is passed
                0432  * to the communication call, the AMPI communication implementation will
                0433  * check this bundle MPI_Type to discover activity and trace/not trace accordingly.
                0434  *
                0435  * For the operator overloading, the tool needs to supply the active MPI types
                0436  * for the built-in MPI_datatypes and using the active types, one can achieve
                0437  * type conformance between the buffer and the type parameter passed.
                0438  *
                0439  * \section onesided One-Sided Active Targets
                0440  * Idea - use an <tt>AMPI_Win</tt> instance (similar to the \ref AMPI_Request ) to attach more 
                0441  * information about the things that are applied to the window and completed on the fence; 
                0442  * we execute/trace/collect-for-later-execution operations on the window in the following fashion
                0443  * 
                0444  * forward: 
                0445  * - MPI_Get record op/args on the window (buffer called 'x')  
                0446  * - MPI_Put/MPI_Accumulate(z,,...): record op/args on the window; during forward: replace with MPI_Get of the remote target value into temporary 't' ; postpone to the  fence;
                0447  * 
                0448  * upon hitting a fence in the forward sweep:
                0449  * 1. put all ops on the stack 
                0450  * 2. run the fence 
                0451  * 3. for earch accum/put:
                0452  * 3.1:  push 't'
                0453  * 3.2: do the postponed accumulate/put
                0454  * 4. run a fence for 3.2
                0455  * 5. for each accum*: 
                0456  * 5.1 get accumlation result 'r'
                0457  * 6. run a fence for 5.1 
                0458  * 7. for each accum*:
                0459  * 7.1 push 'r'
                0460  * 
                0461  * for the adjoint of a fence :
                0462  * 0. for each operation on the window coming from the previous fence: 
                0463  * 0.1 op isa GET then x_bar=0.0
                0464  * 0.2 op isa PUT/accum= then x_bar+=t21
                0465  * 0.3 op isa accum+ then x_bar+=t22
                0466  * 1. run a fence
                0467  * 2. pop  op from the stack and put onto adjoint window
                0468  * 2.1 op isa PUT/accum=: then   GET('t21')
                0469  * 2.2 op isa accum+; then get('t22') from adjoint target 
                0470  * 2.3 op isa accum*, then pop('r'),  GET('t23') from adjoint target
                0471  * 3. run a fence
                0472  * 4. for each op on the adjoint window
                0473  * 4.1 op isa GET, then accum+ into remote
                0474  * 4.2 op isa PUT/accum: pop(t); accu(t,'=') to the value in the target
                0475  * 4.3 op isa PUT/accum=; then acc(0.0,'=') to adjoint target
                0476  * 4.4 op isa accum*: then accumulate( r*t23/t,'=', to the target) AND do z_bar+=r*t23/z  (this is the old local z ); 
                0477  * 
                0478  * \section derived Handling of Derived Types
                0479  * (Written mostly in the context of ADOL-C.) MPI allows the user to create typemaps for arbitrary structures in terms of a block
                0480  * count and arrays of block lengths, block types, and displacements. For sending an array of active variables, we could get by with
                0481  * a pointer to their value array; in the case of a struct, we may want to send an arbitrary collection of data as well as some active
                0482  * variables which we'll need to "dereference". If a struct contains active data, we must manually pack it into a new array because
                0483  * -# the original datamap alignment is destroyed when we convert active data to real values
                0484  * -# we would like to send completely contiguous messages
                0485  * 
                0486  * When received, the struct is unpacked again.
                0487  * 
                0488  * When the user calls the \ref AMPI_Type_create_struct_NT wrapper with a datamap, the map is stored in a structure of type
                0489  * \ref derivedTypeData; the wrapper also generates an internal typemap that describes the packed data. The packed typemap is used
                0490  * whenver a derived type is sent and received; it's also used in conjunction with the user-provided map to pack and unpack data.
                0491  * This typemap is invisible to the user, so the creation of derived datatypes is accomplished entirely with calls to the
                0492  * \ref AMPI_Type_create_struct and \ref AMPI_Type_commit_NT wrappers.
                0493  * 
                0494  * \image html dtype_illustration.png
                0495  * \image latex dtype_illustration.png
                0496  * 
                0497  * AMPI currently supports sending structs with active elements and structs with embedded structs. Packing is called recursively.
                0498  * Functions implemented are \ref AMPI_Type_create_struct_NT and \ref AMPI_Type_contiguous_NT. A wrapper for _Type_vector can't be
                0499  * implemented now because the point of that function is to send noncontiguous data and, for simplicity and efficiency, we're assuming
                0500  * that the active variables we're sending are contiguous.
                0501  * 
                0502  * Worth noting: if we have multiple active variables in a struct and we want to send an array of these structs, we have to send every
                0503  * active element to ensure that our contiguity checks don't assert false.
                0504  * 
                0505  * \section reduction Reduction operations
                0506  * 
                0507  * Since operator overloading can't enter MPI routines, other AMPI functions extract the double values from active variables,
                0508  * transfer those, and have explicit adjoint code that replaces the automated transformation. This is possible because we know the
                0509  * partial derivative of the result. For reductions, we can also do this with built-in reduction ops (e.g., sum, product). But
                0510  * we can't do this for user-defined ops because we don't know the partial derivative of the result.
                0511  * 
                0512  * (Again explained in the context of ADOL-C.) So we have to make the tracing machinery enter the Reduce and perform taping every
                0513  * time the reduction op is applied. As it turns out, MPICH implements Reduce for derived types as a binary tree of Send/Recv pairs,
                0514  * so we can make our own Reduce by replicating the code with AMPI_Send/Recv functions. (Note that derived types are necessarily
                0515  * reduced with user-defined ops because MPI doesn't know how to accumulate them with its built-in ops.) So AMPI_Reduce is implemented
                0516  * for derived types as the aforementioned binary tree with active temporaries used between steps for applying the reduction op.
                0517  * See \ref AMPI_Op_create_NT.
                0518  * 
                0519  * 
                0520  */
                0521 
                0522 
                0523 
                0524 
                0525 #include <mpi.h>
                0526 #if defined(__cplusplus)
                0527 extern "C" {
                0528 #endif
                0529 
                0530 #include "ampi/userIF/passThrough.h"
                0531 #include "ampi/userIF/nt.h"
                0532 #include "ampi/userIF/modified.h"
                0533 #include "ampi/userIF/st.h"
                0534 
                0535 #include "ampi/libCommon/modified.h"
                0536 
                0537 #if defined(__cplusplus)
                0538 }
                0539 #endif
                0540 
                0541 #endif