Flusher step 2: CPU and NIC options by e-ago · Pull Request #32 · gpudirect/libgdsync

e-ago · 2018-01-17T15:26:21Z

There are 3 types of flusher: GPU native, CPU thread and NIC. It is possible to specify which one must be used by means of 2 env vars:

GDS_GPU_HAS_FLUSHER: 1 enable the GPU native flusher (service flusher will be ignored), 0 otherwise. Since CUDA 9.1 it must be always 0
GDS_FLUSHER_SERVICE:
- 0 : No flusher service (default)
- 1 : CPU thread flusher service
- 2 : NIC flusher service

All the GDS_FLUSHER_SERVICE values have been tested with tests/gds_kernel_latency; here there is a report of the outputs with performance and the list of params posted in case of a wait operation.
Tested on ivy2/3 with cuda_20171220_23307802-inline-weak-membar-perf.
Note: GDR on ivy2/3 has poor performance.

In order to evaluate real performance, we should test the flusher on real-world applications using Async.

GDS_FLUSHER_SERVICE=0

[12893] GDS WARN  gds_post_ops() poll params
[12893] GDS INFO  gds_dump_params() param[0]:
[12893] GDS INFO  gds_dump_param() WAIT32 addr:0x204a0f9bc alias:0x7ffe8fb8afa0 value:00000000 flags:00000000

testing....
[1] batch 2: posted 20 sequences
pre-posting took 2301.00 usec
[0] 2048000 bytes in 0.04 seconds = 416.41 Mbit/sec
[0] 1000 iters in 0.04 seconds = 39.35 usec/iter
[1] 2048000 bytes in 0.04 seconds = 416.08 Mbit/sec
[1] 1000 iters in 0.04 seconds = 39.38 usec/iter

GDS_FLUSHER_SERVICE=1 (CPU Thread) + 16 usec

[12926] GDS WARN  gds_post_ops() poll params
[12926] GDS INFO  gds_dump_params() param[0]:
[12926] GDS INFO  gds_dump_param() WAIT32 addr:0x204a0f9bc alias:0x7ffc224f85f0 value:00000000 flags:00000000
[12926] GDS INFO  gds_dump_params() param[1]:
[12926] GDS INFO  gds_dump_param() WRITE32 addr:0x204a80000 alias:0x1 value:000003e7 flags:00000001
[12926] GDS INFO  gds_dump_params() param[2]:
[12926] GDS INFO  gds_dump_param() WAIT32 addr:0x23046d0000 alias:0x7ffc22513df0 value:000003e7 flags:00000001

testing....
[1] batch 2: posted 20 sequences
pre-posting took 2470.00 usec
[0] 2048000 bytes in 0.06 seconds = 293.14 Mbit/sec
[0] 1000 iters in 0.06 seconds = 55.89 usec/iter
[1] 2048000 bytes in 0.06 seconds = 292.90 Mbit/sec
[1] 1000 iters in 0.06 seconds = 55.94 usec/iter

GDS_FLUSHER_SERVICE=2 (NIC) + 20 usec

[12961] GDS WARN  gds_post_ops() poll params
[12961] GDS INFO  gds_dump_params() param[0]:
[12961] GDS INFO  gds_dump_param() WAIT32 addr:0x204a0f9bc alias:0x204a60a04 value:00000000 flags:00000000
[12961] GDS INFO  gds_dump_params() param[1]:
[12961] GDS INFO  gds_dump_param() WRITE32 addr:0x23046c0000 alias:(nil) value:000003e7 flags:00000001
[12961] GDS INFO  gds_dump_params() param[2]:
[12961] GDS INFO  gds_dump_param() WRITE32 addr:0x204a40104 alias:0x7f7df36a3300 value:e7030000 flags:00000001
[12961] GDS INFO  gds_dump_params() param[3]:
[12961] GDS INFO  gds_dump_param() WRITE32 addr:0x204a60b00 alias:0x7fffe63c8600 value:08e60300 flags:00000000
[12961] GDS INFO  gds_dump_params() param[4]:
[12961] GDS INFO  gds_dump_param() WRITE32 addr:0x204a60b04 alias:0x1 value:036d1400 flags:00000001
[12961] GDS INFO  gds_dump_params() param[5]:
[12961] GDS INFO  gds_dump_param() WAIT32 addr:0x23046d0000 alias:0x7fffe63e3e00 value:000003e7 flags:00000001

[1] batch 2: posted 20 sequences
pre-posting took 2556.00 usec
[0] 2048000 bytes in 0.06 seconds = 275.28 Mbit/sec
[0] 1000 iters in 0.06 seconds = 59.52 usec/iter
[1] 2048000 bytes in 0.06 seconds = 275.09 Mbit/sec
[1] 1000 iters in 0.06 seconds = 59.56 usec/iter

e-ago · 2018-01-25T12:12:21Z

tests/gds_kernel_latency, brdw0/1, cuda9.0, driver 384.81, using Tesla P100:

No Flusher

iters=1000 tx/rx_depth=1024

testing....
pre-posting took 1024.00 usec
[0] 2048000 bytes in 0.02 seconds = 744.56 Mbit/sec
[0] 1000 iters in 0.02 seconds = 22.00 usec/iter
[1] 2048000 bytes in 0.02 seconds = 743.54 Mbit/sec
[1] 1000 iters in 0.02 seconds = 22.03 usec/iter

CPU Flusher + 4 usec

pre-posting took 1400.00 usec
[0] 2048000 bytes in 0.03 seconds = 613.86 Mbit/sec
[0] 1000 iters in 0.03 seconds = 26.69 usec/iter
[1] 2048000 bytes in 0.03 seconds = 612.81 Mbit/sec
[1] 1000 iters in 0.03 seconds = 26.74 usec/iter

NIC Flusher + 8 usec

pre-posting took 1427.00 usec
[0] 2048000 bytes in 0.03 seconds = 540.98 Mbit/sec
[0] 1000 iters in 0.03 seconds = 30.29 usec/iter
[1] 2048000 bytes in 0.03 seconds = 539.94 Mbit/sec
[1] 1000 iters in 0.03 seconds = 30.34 usec/iter

e-ago · 2018-01-25T14:46:15Z

hpgmg_async, brdw0/1, cuda9.0, driver 384.81, using Tesla P100, 2 processes:

size	gain no flusher	sec no flusher
CPU Flusher
4	-11.54%	+0.0003402
5	-11.92%	+0.0009222
6	-9.61%	+0.0020278
7	-5.33%	+0.0031404
NIC Flusher
4	-13.39%	+0.0003948
5	-14.39%	+0.0011136
6	-11.49%	+0.0024238
7	-5.56%	+0.0032774

drossetti · 2018-01-27T01:29:56Z

@e-ago does GDS_FLUSHER_SERVICE=0 (no flusher) imply GDS_GPU_HAS_FLUSHER=1, i.e. using CUDA 9.1 (broken but still adding some overhead) internal flusher or nothing at all?

e-ago · 2018-01-29T22:05:58Z

@drossetti no. If GDS_GPU_HAS_FLUSHER is set to 1, then GDS_FLUSHER_SERVICE is ignored. On the contrary, GDS_FLUSHER_SERVICE=0 doesn’t imply GDS_GPU_HAS_FLUSHER=1.
That is, if GDS_FLUSHER_SERVICE=0 and GDS_GPU_HAS_FLUSHER=0 there is no flusher at all

drossetti · 2018-01-27T01:06:49Z

src/apis.cpp

        // move flush to last wait in the whole batch
        if (n_waits && no_network_descs_after_entry(n_descs, descs, last_wait)) {
                gds_dbg("optimizing FLUSH to last wait i=%zu\n", last_wait);
-                move_flush = true;


who is setting move_flush=true in the 'GPU support native flusher' case?

drossetti · 2018-01-27T01:08:04Z

src/flusher.hpp

+#define GDS_FLUSHER_PORT 1
+#define GDS_FLUSHER_QKEY 0 //0x11111111
+
+#define CUDA_CHECK(stmt)                                \


are you using the CUDA RT API? this is a big decision...

removed, it was an oversight

drossetti · 2018-01-27T01:38:13Z

src/gdsync.cpp

        gqp->recv_cq.curr_offset = 0;

-        gds_dbg("created gds_qp=%p\n", gqp);
+        if(!(flags & GDS_CREATE_QP_FLUSHER))


drossetti · 2018-01-27T01:39:30Z

src/gdsync.cpp

        }
+        qp_attr->send_cq = tx_cq;
+        gds_dbg("created send_cq=%p\n", qp_attr->send_cq);
+        if(!(flags & GDS_CREATE_QP_FLUSHER))


looks like you are using a negated logic for the flusher flag...
please refactor into a bool local var

drossetti · 2018-01-27T01:43:54Z

src/gdsync.cpp

-                param->waitValue.flags |= CU_STREAM_WAIT_VALUE_FLUSH;
+
+        //No longer supported since CUDA 9.1
+        //if (need_flush) param->waitValue.flags |= CU_STREAM_WAIT_VALUE_FLUSH;


not true, we have to query via ::CU_DEVICE_ATTRIBUTE_CAN_USE_WAIT_VALUE_FLUSH

drossetti · 2018-02-01T21:08:55Z

src/flusher.hpp

+#include "archutils.h"
+
+#define GDS_FLUSHER_TYPE_CPU 1
+#define GDS_FLUSHER_TYPE_NIC 2


i'd rather have enum here

drossetti · 2018-02-01T21:09:09Z

src/flusher.hpp

+
+#define GDS_FLUSHER_OP_CPU 2
+#define GDS_FLUSHER_OP_NIC 5
+


enum here too

Those constants are not related: they represent the number of ops required by NIC or CPU flusher

…f define. local bool variable during qp creation. if CUDA_VERSION >= 9020 then query CU_DEVICE_ATTRIBUTE_CAN_USE_WAIT_VALUE_FLUSH in case of native flusher

e-ago · 2018-02-05T15:38:05Z

@drossetti I've pushed some changes:

flusher env vars (GDS_GPU_HAS_FLUSHER and GDS_FLUSHER_SERVICE) merged into a single one (GDS_FLUSHER_TYPE)
enum with 4 different flusher types: GDS_FLUSHER_NONE=0, GDS_FLUSHER_NATIVE, GDS_FLUSHER_CPU, GDS_FLUSHER_NIC
in case of GDS_FLUSHER_NATIVE, move_flush reintroduced with CU_DEVICE_ATTRIBUTE_CAN_USE_WAIT_VALUE_FLUSH check (CUDA_VERSION >= 9020)
local bool variable during qp creation

drossetti

the flusher is a big chunk of code.
I suggest to move to a more object oriented design and split the implementation in different .cpp files.
besides please reuse the memory allocation/registration functions already present in libgdsync

drossetti · 2018-02-15T01:18:18Z

src/flusher.hpp

+
+#define GDS_FLUSHER_OP_CPU 2
+#define GDS_FLUSHER_OP_NIC 5
+


drossetti · 2018-02-15T01:23:24Z

src/flusher.cpp

+    else
+        return false;
+}
+#define CHECK_FLUSHER_SERVICE()                                                 \


why a macro?

drossetti · 2018-02-15T01:24:16Z

src/flusher.cpp

+}
+
+static inline bool gds_flusher_service_active() {
+    if(gds_use_flusher == GDS_FLUSHER_CPU || gds_use_flusher == GDS_FLUSHER_NIC)


should not also check if flusher_thread!=NULL ? or wait for the thread to set some volatile flag signaling its livelihood ?

drossetti · 2018-02-15T01:26:41Z

src/flusher.cpp

+
+#define ROUND_TO(V,PS) ((((V) + (PS) - 1)/(PS)) * (PS))
+
+bool gds_use_native_flusher()


this API reflect the a choice which has been made earlier, while its name implies an order to use the native flusher...
could be renamed as gds_is_native_flusher() or similar

drossetti · 2018-02-15T01:29:35Z

src/flusher.cpp

+static gds_flusher_buf flack_d;
+static int flusher_value=0;
+static pthread_t flusher_thread;
+static int gds_use_flusher = -1;


I don't like the current stateful C API.
There should be a way to create a singleton object, the flusher, using an object factory.
flusher should be an abstract base class. derived classes are specializations.
And functions should be methods of that class.

drossetti · 2018-02-15T01:36:31Z

src/flusher.cpp

+}
+
+static int gds_flusher_pin_buffer(gds_flusher_buf * fl_mem, size_t req_size, int type_mem)
+{


why do you need a new memory allocation/registration function? why not using/extending those already here?

drossetti · 2018-02-15T02:06:39Z

src/gdsync.cpp

-        gds_dbg("created gds_qp=%p\n", gqp);
+        if(!is_qp_flusher)
+        {
+                if(gds_flusher_init(pd, context, gpu_id))


gds_flusher_init() should return a flusher object which is stored in gds_qp.
you should convince the reviewer that there is value in abstracting the native flusher inside , or to simply special case in gdsync.c

e-ago · 2018-04-12T12:30:06Z

The flusher implementation for the moment is in PR #51

e-ago added 5 commits January 16, 2018 18:40

Initial commit flusher

71ec1fe

flusher inlcude, nvtx removed

82b11e1

CU_STREAM_WAIT_VALUE_FLUSH no longer supported in CUDA 9.x. Closes #31

31229e2

Commented dump wait params useful for debug

56a4e7a

CU_STREAM_WAIT_VALUE_FLUSH code restored commented as a reminder

b00b350

drossetti requested changes Feb 1, 2018

View reviewed changes

flusher env vars merged in a single one. flusher types enum instead o…

f3c4cf1

…f define. local bool variable during qp creation. if CUDA_VERSION >= 9020 then query CU_DEVICE_ATTRIBUTE_CAN_USE_WAIT_VALUE_FLUSH in case of native flusher

enum gds_flusher_type renamed

d058537

drossetti requested changes Feb 15, 2018

View reviewed changes

drossetti reviewed Feb 15, 2018

View reviewed changes


		#define ROUND_TO(V,PS) ((((V) + (PS) - 1)/(PS)) * (PS))

		bool gds_use_native_flusher()

Conversation

e-ago commented Jan 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

e-ago commented Jan 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

e-ago commented Jan 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drossetti commented Jan 27, 2018

Uh oh!

e-ago commented Jan 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

e-ago Feb 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

e-ago commented Feb 5, 2018

Uh oh!

drossetti left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

e-ago commented Apr 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

e-ago commented Jan 17, 2018 •

edited

Loading

e-ago commented Jan 25, 2018 •

edited

Loading

e-ago commented Jan 25, 2018 •

edited

Loading

e-ago commented Jan 29, 2018 •

edited

Loading

e-ago Feb 5, 2018 •

edited

Loading