-
Austen Lauria authored
For the original mpich implementation, see: https://github.com/pmodels/mpich/blob/9ab5fd06af2a648bf24214f0d9cff0ee77ee3e7d/src/mpi/datatype/veccpy.h Small testcase to demonstrate the performance difference: https://gist.github.com/markalle/9f92e9facbd71136bcfb9f0e0305a1da % mpicc -o x packperf_nc.c % mpirun -np 1 ./x Before: > pack dtbig : 943 863 863 862 862 (avg 879) usec > unpack dtbig : 919 955 917 917 917 (avg 925) usec > pack dtsmall : 810 810 810 831 810 (avg 814) usec > unpack dtsmall: 947 954 996 941 969 (avg 962) usec After: > pack dtbig : 205 124 120 118 118 (avg 137) usec > unpack dtbig : 122 120 120 120 120 (avg 120) usec > pack dtsmall : 133 122 122 121 121 (avg 124) usec > unpack dtsmall: 124 124 123 123 123 (avg 123) usec Having lots of small memcpy() was slower than the mpich code that uses blocks of array assignments. Notes about what changed: * at the top-level mpi/c/pack.c and unpack.c it now sometimes turns (count, dtype) into (1, newdtype) with an newdtype made by MPI_Type_contiguous(count, dtype). This is because the lower level pack/unpack always iterates over description elements and when it sees (count,dtype) there's no possibility of a single description element describing the whole data. I'm triggering that code only when the count is >=250 and the type is non-contiguous. It likely only needs to be triggered if the datatype has a single element such that the element.count * element.extent == dtype.extent but that would be more code to detect. * in Datatype_internal.h I moved the macros around a little so I could reuse them in the new unrolled array assignments code. That way I don't have to figure out that INT4 is int32_t, because those macros already have that info. The diff probably looks large but there isn't that much going on there. * in opal_datatype_pack/unpack.h there's an extra section to call the mpich vector copying code for a description element if it's a certain size, and continue with the regular code if the mpich call rejects it (due to not recognizing the element.id, or due to it being cuda memory, or due to alignment) * the new opal_datatype_pack_unpack_predefined.h largely copied from mpich. The macros boil down to unrolled array assignments. I recycled the opal_datatype_internal.h macros to get the values for TYPE. That way I don't have to figure out whether SHORT_FLOAT_COMPELX is short float _Complex or opal_short_float_complex_t or unavailable for example. Extra notes about the new pack/unpack routine: * For checking cuda memory I didn't check every item in the vector, only the first and possibly the last, since I don't think individual description elements should be spanning gpu and system memory. * I didn't use the unaligned-stride code from mpich, instead just rejecting anything unaligned Licensing: https://github.com/pmodels/mpich/blob/9ab5fd06af2a648bf24214f0d9cff0ee77ee3e7d/src/mpi/datatype/veccpy.h where the code came from says > /* > * (C) 2001 by Argonne National Laboratory. > * See COPYRIGHT in top-level directory. > */ And I pasted the above mentioned COPYRIGHT at the top of opal_datatype_pack_unpack_predefined.h Signed-off-by: Mark Allen <markalle@us.ibm.com>
30639166
This project is licensed under the Other.
Learn more
Loading