LICENSE · master · felixkhals / SWP-CM22-Planbased OMPI · GitLab

Improve predefined pack/unpack performance using mpich methods. · 30639166

Austen Lauria authored Jul 14, 2020

For the original mpich implementation, see:
https://github.com/pmodels/mpich/blob/9ab5fd06af2a648bf24214f0d9cff0ee77ee3e7d/src/mpi/datatype/veccpy.h

Small testcase to demonstrate the performance difference:
https://gist.github.com/markalle/9f92e9facbd71136bcfb9f0e0305a1da
% mpicc -o x packperf_nc.c
% mpirun -np 1 ./x

Before:
> pack dtbig    :    943    863    863    862    862  (avg 879)  usec
> unpack dtbig  :    919    955    917    917    917  (avg 925)  usec
> pack dtsmall  :    810    810    810    831    810  (avg 814)  usec
> unpack dtsmall:    947    954    996    941    969  (avg 962)  usec
After:
> pack dtbig    :    205    124    120    118    118  (avg 137)  usec
> unpack dtbig  :    122    120    120    120    120  (avg 120)  usec
> pack dtsmall  :    133    122    122    121    121  (avg 124)  usec
> unpack dtsmall:    124    124    123    123    123  (avg 123)  usec

Having lots of small memcpy() was slower than the mpich code that
uses blocks of array assignments.

Notes about what changed:
* at the top-level mpi/c/pack.c and unpack.c it now sometimes
  turns (count, dtype) into (1, newdtype) with an newdtype made
  by MPI_Type_contiguous(count, dtype).  This is because the lower
  level pack/unpack always iterates over description elements and
  when it sees (count,dtype) there's no possibility of a single
  description element describing the whole data.

  I'm triggering that code only when the count is >=250 and
  the type is non-contiguous.  It likely only needs to be triggered
  if the datatype has a single element such that the
  element.count * element.extent == dtype.extent but that would
  be more code to detect.
* in Datatype_internal.h I moved the macros around a little so
  I could reuse them in the new unrolled array assignments code.
  That way I don't have to figure out that INT4 is int32_t, because
  those macros already have that info.  The diff probably looks
  large but there isn't that much going on there.
* in opal_datatype_pack/unpack.h there's an extra section to call
  the mpich vector copying code for a description element if
  it's a certain size, and continue with the regular code if
  the mpich call rejects it (due to not recognizing the element.id,
  or due to it being cuda memory, or due to alignment)
* the new opal_datatype_pack_unpack_predefined.h largely copied
  from mpich.  The macros boil down to unrolled array assignments.
  I recycled the opal_datatype_internal.h macros to get the values
  for TYPE.  That way I don't have to figure out whether
  SHORT_FLOAT_COMPELX is short float _Complex or opal_short_float_complex_t
  or unavailable for example.

Extra notes about the new pack/unpack routine:
* For checking cuda memory I didn't check every item in the vector,
  only the first and possibly the last, since I don't think individual
  description elements should be spanning gpu and system memory.
* I didn't use the unaligned-stride code from mpich, instead just
  rejecting anything unaligned

Licensing:
https://github.com/pmodels/mpich/blob/9ab5fd06af2a648bf24214f0d9cff0ee77ee3e7d/src/mpi/datatype/veccpy.h


where the code came from says
> /*
>  *  (C) 2001 by Argonne National Laboratory.
>  *      See COPYRIGHT in top-level directory.
>  */
And I pasted the above mentioned COPYRIGHT at the top of
opal_datatype_pack_unpack_predefined.h

Signed-off-by: Mark Allen <markalle@us.ibm.com>

30639166

This project is licensed under the Other. Learn more