Skip to content
  • Austen Lauria's avatar
    Improve predefined pack/unpack performance using mpich methods. · 30639166
    Austen Lauria authored
    For the original mpich implementation, see:
    Small testcase to demonstrate the performance difference:
    % mpicc -o x packperf_nc.c
    % mpirun -np 1 ./x
    > pack dtbig    :    943    863    863    862    862  (avg 879)  usec
    > unpack dtbig  :    919    955    917    917    917  (avg 925)  usec
    > pack dtsmall  :    810    810    810    831    810  (avg 814)  usec
    > unpack dtsmall:    947    954    996    941    969  (avg 962)  usec
    > pack dtbig    :    205    124    120    118    118  (avg 137)  usec
    > unpack dtbig  :    122    120    120    120    120  (avg 120)  usec
    > pack dtsmall  :    133    122    122    121    121  (avg 124)  usec
    > unpack dtsmall:    124    124    123    123    123  (avg 123)  usec
    Having lots of small memcpy() was slower than the mpich code that
    uses blocks of array assignments.
    Notes about what changed:
    * at the top-level mpi/c/pack.c and unpack.c it now sometimes
      turns (count, dtype) into (1, newdtype) with an newdtype made
      by MPI_Type_contiguous(count, dtype).  This is because the lower
      level pack/unpack always iterates over description elements and
      when it sees (count,dtype) there's no possibility of a single
      description element describing the whole data.
      I'm triggering that code only when the count is >=250 and
      the type is non-contiguous.  It likely only needs to be triggered
      if the datatype has a single element such that the
      element.count * element.extent == dtype.extent but that would
      be more code to detect.
    * in Datatype_internal.h I moved the macros around a little so
      I could reuse them in the new unrolled array assignments code.
      That way I don't have to figure out that INT4 is int32_t, because
      those macros already have that info.  The diff probably looks
      large but there isn't that much going on there.
    * in opal_datatype_pack/unpack.h there's an extra section to call
      the mpich vector copying code for a description element if
      it's a certain size, and continue with the regular code if
      the mpich call rejects it (due to not recognizing the,
      or due to it being cuda memory, or due to alignment)
    * the new opal_datatype_pack_unpack_predefined.h largely copied
      from mpich.  The macros boil down to unrolled array assignments.
      I recycled the opal_datatype_internal.h macros to get the values
      for TYPE.  That way I don't have to figure out whether
      SHORT_FLOAT_COMPELX is short float _Complex or opal_short_float_complex_t
      or unavailable for example.
    Extra notes about the new pack/unpack routine:
    * For checking cuda memory I didn't check every item in the vector,
      only the first and possibly the last, since I don't think individual
      description elements should be spanning gpu and system memory.
    * I didn't use the unaligned-stride code from mpich, instead just
      rejecting anything unaligned
    where the code came from says
    > /*
    >  *  (C) 2001 by Argonne National Laboratory.
    >  *      See COPYRIGHT in top-level directory.
    >  */
    And I pasted the above mentioned COPYRIGHT at the top of
    Signed-off-by: default avatarMark Allen <>
    Improve predefined pack/unpack performance using mpich methods.
    Austen Lauria authored
    For the original mpich implementation, see:
    Small testcase to demonstrate the performance difference:
    % mpicc -o x packperf_nc.c
    % mpirun -np 1 ./x
    > pack dtbig    :    943    863    863    862    862  (avg 879)  usec
    > unpack dtbig  :    919    955    917    917    917  (avg 925)  usec
    > pack dtsmall  :    810    810    810    831    810  (avg 814)  usec
    > unpack dtsmall:    947    954    996    941    969  (avg 962)  usec
    > pack dtbig    :    205    124    120    118    118  (avg 137)  usec
    > unpack dtbig  :    122    120    120    120    120  (avg 120)  usec
    > pack dtsmall  :    133    122    122    121    121  (avg 124)  usec
    > unpack dtsmall:    124    124    123    123    123  (avg 123)  usec
    Having lots of small memcpy() was slower than the mpich code that
    uses blocks of array assignments.
    Notes about what changed:
    * at the top-level mpi/c/pack.c and unpack.c it now sometimes
      turns (count, dtype) into (1, newdtype) with an newdtype made
      by MPI_Type_contiguous(count, dtype).  This is because the lower
      level pack/unpack always iterates over description elements and
      when it sees (count,dtype) there's no possibility of a single
      description element describing the whole data.
      I'm triggering that code only when the count is >=250 and
      the type is non-contiguous.  It likely only needs to be triggered
      if the datatype has a single element such that the
      element.count * element.extent == dtype.extent but that would
      be more code to detect.
    * in Datatype_internal.h I moved the macros around a little so
      I could reuse them in the new unrolled array assignments code.
      That way I don't have to figure out that INT4 is int32_t, because
      those macros already have that info.  The diff probably looks
      large but there isn't that much going on there.
    * in opal_datatype_pack/unpack.h there's an extra section to call
      the mpich vector copying code for a description element if
      it's a certain size, and continue with the regular code if
      the mpich call rejects it (due to not recognizing the,
      or due to it being cuda memory, or due to alignment)
    * the new opal_datatype_pack_unpack_predefined.h largely copied
      from mpich.  The macros boil down to unrolled array assignments.
      I recycled the opal_datatype_internal.h macros to get the values
      for TYPE.  That way I don't have to figure out whether
      SHORT_FLOAT_COMPELX is short float _Complex or opal_short_float_complex_t
      or unavailable for example.
    Extra notes about the new pack/unpack routine:
    * For checking cuda memory I didn't check every item in the vector,
      only the first and possibly the last, since I don't think individual
      description elements should be spanning gpu and system memory.
    * I didn't use the unaligned-stride code from mpich, instead just
      rejecting anything unaligned
    where the code came from says
    > /*
    >  *  (C) 2001 by Argonne National Laboratory.
    >  *      See COPYRIGHT in top-level directory.
    >  */
    And I pasted the above mentioned COPYRIGHT at the top of
    Signed-off-by: default avatarMark Allen <>
This project is licensed under the Other. Learn more