Skip to content
  • Austen Lauria's avatar
    Improve predefined pack/unpack performance using mpich methods. · 30639166
    Austen Lauria authored
    For the original mpich implementation, see:
    https://github.com/pmodels/mpich/blob/9ab5fd06af2a648bf24214f0d9cff0ee77ee3e7d/src/mpi/datatype/veccpy.h
    
    Small testcase to demonstrate the performance difference:
    https://gist.github.com/markalle/9f92e9facbd71136bcfb9f0e0305a1da
    % mpicc -o x packperf_nc.c
    % mpirun -np 1 ./x
    
    Before:
    > pack dtbig    :    943    863    863    862    862  (avg 879)  usec
    > unpack dtbig  :    919    955    917    917    917  (avg 925)  usec
    > pack dtsmall  :    810    810    810    831    810  (avg 814)  usec
    > unpack dtsmall:    947    954    996    941    969  (avg 962)  usec
    After:
    > pack dtbig    :    205    124    120    118    118  (avg 137)  usec
    > unpack dtbig  :    122    120    120    120    120  (avg 120)  usec
    > pack dtsmall  :    133    122    122    121    121  (avg 124)  usec
    > unpack dtsmall:    124    124    123    123    123  (avg 123)  usec
    
    Having lots of small memcpy() was slower than the mpich code that
    uses blocks of array assignments.
    
    Notes about what changed:
    * at the top-level mpi/c/pack.c and unpack.c it now sometimes
      turns (count, dtype) into (1, newdtype) with an newdtype made
      by MPI_Type_contiguous(count, dtype).  This is because the lower
      level pack/unpack always iterates over description elements and
      when it sees (count,dtype) there's no possibility of a single
      description element describing the whole data.
    
      I'm triggering that code only when the count is >=250 and
      the type is non-contiguous.  It likely only needs to be triggered
      if the datatype has a single element such that the
      element.count * element.extent == dtype.extent but that would
      be more code to detect.
    * in Datatype_internal.h I moved the macros around a little so
      I could reuse them in the new unrolled array assignments code.
      That way I don't have to figure out that INT4 is int32_t, because
      those macros already have that info.  The diff probably looks
      large but there isn't that much going on there.
    * in opal_datatype_pack/unpack.h there's an extra section to call
      the mpich vector copying code for a description element if
      it's a certain size, and continue with the regular code if
      the mpich call rejects it (due to not recognizing the element.id,
      or due to it being cuda memory, or due to alignment)
    * the new opal_datatype_pack_unpack_predefined.h largely copied
      from mpich.  The macros boil down to unrolled array assignments.
      I recycled the opal_datatype_internal.h macros to get the values
      for TYPE.  That way I don't have to figure out whether
      SHORT_FLOAT_COMPELX is short float _Complex or opal_short_float_complex_t
      or unavailable for example.
    
    Extra notes about the new pack/unpack routine:
    * For checking cuda memory I didn't check every item in the vector,
      only the first and possibly the last, since I don't think individual
      description elements should be spanning gpu and system memory.
    * I didn't use the unaligned-stride code from mpich, instead just
      rejecting anything unaligned
    
    Licensing:
    https://github.com/pmodels/mpich/blob/9ab5fd06af2a648bf24214f0d9cff0ee77ee3e7d/src/mpi/datatype/veccpy.h
    
    
    where the code came from says
    > /*
    >  *  (C) 2001 by Argonne National Laboratory.
    >  *      See COPYRIGHT in top-level directory.
    >  */
    And I pasted the above mentioned COPYRIGHT at the top of
    opal_datatype_pack_unpack_predefined.h
    
    Signed-off-by: default avatarMark Allen <markalle@us.ibm.com>
    30639166
This project is licensed under the Other. Learn more