Conversation
|
tests/gds_kernel_latency, brdw0/1, cuda9.0, driver 384.81, using Tesla P100: No Flusher CPU Flusher + 4 usec NIC Flusher + 8 usec |
|
hpgmg_async, brdw0/1, cuda9.0, driver 384.81, using Tesla P100, 2 processes:
|
|
@e-ago does GDS_FLUSHER_SERVICE=0 (no flusher) imply GDS_GPU_HAS_FLUSHER=1, i.e. using CUDA 9.1 (broken but still adding some overhead) internal flusher or nothing at all? |
|
@drossetti no. If GDS_GPU_HAS_FLUSHER is set to 1, then GDS_FLUSHER_SERVICE is ignored. On the contrary, GDS_FLUSHER_SERVICE=0 doesn’t imply GDS_GPU_HAS_FLUSHER=1. |
| // move flush to last wait in the whole batch | ||
| if (n_waits && no_network_descs_after_entry(n_descs, descs, last_wait)) { | ||
| gds_dbg("optimizing FLUSH to last wait i=%zu\n", last_wait); | ||
| move_flush = true; |
There was a problem hiding this comment.
who is setting move_flush=true in the 'GPU support native flusher' case?
src/flusher.hpp
Outdated
| #define GDS_FLUSHER_PORT 1 | ||
| #define GDS_FLUSHER_QKEY 0 //0x11111111 | ||
|
|
||
| #define CUDA_CHECK(stmt) \ |
There was a problem hiding this comment.
are you using the CUDA RT API? this is a big decision...
There was a problem hiding this comment.
removed, it was an oversight
src/gdsync.cpp
Outdated
| gqp->recv_cq.curr_offset = 0; | ||
|
|
||
| gds_dbg("created gds_qp=%p\n", gqp); | ||
| if(!(flags & GDS_CREATE_QP_FLUSHER)) |
src/gdsync.cpp
Outdated
| } | ||
| qp_attr->send_cq = tx_cq; | ||
| gds_dbg("created send_cq=%p\n", qp_attr->send_cq); | ||
| if(!(flags & GDS_CREATE_QP_FLUSHER)) |
There was a problem hiding this comment.
looks like you are using a negated logic for the flusher flag...
please refactor into a bool local var
src/gdsync.cpp
Outdated
| param->waitValue.flags |= CU_STREAM_WAIT_VALUE_FLUSH; | ||
|
|
||
| //No longer supported since CUDA 9.1 | ||
| //if (need_flush) param->waitValue.flags |= CU_STREAM_WAIT_VALUE_FLUSH; |
There was a problem hiding this comment.
not true, we have to query via ::CU_DEVICE_ATTRIBUTE_CAN_USE_WAIT_VALUE_FLUSH
src/flusher.hpp
Outdated
| #include "archutils.h" | ||
|
|
||
| #define GDS_FLUSHER_TYPE_CPU 1 | ||
| #define GDS_FLUSHER_TYPE_NIC 2 |
There was a problem hiding this comment.
i'd rather have enum here
|
|
||
| #define GDS_FLUSHER_OP_CPU 2 | ||
| #define GDS_FLUSHER_OP_NIC 5 | ||
|
|
There was a problem hiding this comment.
Those constants are not related: they represent the number of ops required by NIC or CPU flusher
…f define. local bool variable during qp creation. if CUDA_VERSION >= 9020 then query CU_DEVICE_ATTRIBUTE_CAN_USE_WAIT_VALUE_FLUSH in case of native flusher
|
@drossetti I've pushed some changes:
|
drossetti
left a comment
There was a problem hiding this comment.
the flusher is a big chunk of code.
I suggest to move to a more object oriented design and split the implementation in different .cpp files.
besides please reuse the memory allocation/registration functions already present in libgdsync
|
|
||
| #define GDS_FLUSHER_OP_CPU 2 | ||
| #define GDS_FLUSHER_OP_NIC 5 | ||
|
|
| else | ||
| return false; | ||
| } | ||
| #define CHECK_FLUSHER_SERVICE() \ |
| } | ||
|
|
||
| static inline bool gds_flusher_service_active() { | ||
| if(gds_use_flusher == GDS_FLUSHER_CPU || gds_use_flusher == GDS_FLUSHER_NIC) |
There was a problem hiding this comment.
should not also check if flusher_thread!=NULL ? or wait for the thread to set some volatile flag signaling its livelihood ?
|
|
||
| #define ROUND_TO(V,PS) ((((V) + (PS) - 1)/(PS)) * (PS)) | ||
|
|
||
| bool gds_use_native_flusher() |
There was a problem hiding this comment.
this API reflect the a choice which has been made earlier, while its name implies an order to use the native flusher...
could be renamed as gds_is_native_flusher() or similar
| static gds_flusher_buf flack_d; | ||
| static int flusher_value=0; | ||
| static pthread_t flusher_thread; | ||
| static int gds_use_flusher = -1; |
There was a problem hiding this comment.
I don't like the current stateful C API.
There should be a way to create a singleton object, the flusher, using an object factory.
flusher should be an abstract base class. derived classes are specializations.
And functions should be methods of that class.
| } | ||
|
|
||
| static int gds_flusher_pin_buffer(gds_flusher_buf * fl_mem, size_t req_size, int type_mem) | ||
| { |
There was a problem hiding this comment.
why do you need a new memory allocation/registration function? why not using/extending those already here?
| gds_dbg("created gds_qp=%p\n", gqp); | ||
| if(!is_qp_flusher) | ||
| { | ||
| if(gds_flusher_init(pd, context, gpu_id)) |
There was a problem hiding this comment.
gds_flusher_init() should return a flusher object which is stored in gds_qp.
you should convince the reviewer that there is value in abstracting the native flusher inside , or to simply special case in gdsync.c
|
The flusher implementation for the moment is in PR #51 |
There are 3 types of flusher: GPU native, CPU thread and NIC. It is possible to specify which one must be used by means of 2 env vars:
All the GDS_FLUSHER_SERVICE values have been tested with tests/gds_kernel_latency; here there is a report of the outputs with performance and the list of params posted in case of a wait operation.
Tested on ivy2/3 with cuda_20171220_23307802-inline-weak-membar-perf.
Note: GDR on ivy2/3 has poor performance.
In order to evaluate real performance, we should test the flusher on real-world applications using Async.
GDS_FLUSHER_SERVICE=0
GDS_FLUSHER_SERVICE=1 (CPU Thread) + 16 usec
GDS_FLUSHER_SERVICE=2 (NIC) + 20 usec