diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..62519cd --- /dev/null +++ b/.gitignore @@ -0,0 +1,6 @@ +_site +.sass-cache +.jekyll-cache +.jekyll-metadata +vendor +.bundle diff --git a/404.html b/404.html new file mode 100644 index 0000000..086a5c9 --- /dev/null +++ b/404.html @@ -0,0 +1,25 @@ +--- +permalink: /404.html +layout: default +--- + + + +
Page not found :(
+The requested page could not be found.
+- Finally I've decided to start blogging ! -
-
- For sometime I've been procrastinating to start writing down my thoughts, experiments & experiences
- --- giving myself excuses of busy schedule. But being unable to do something due to lack of time is
- most probably sign of time mis-management.
-
-
- So I started planning how I'm going to spend my week long time --- just as sprint planning, where I sketch
- out what're the things I'm going to do this week, with day level granularity, on each Saturday evening. After following
- this ritual for last 5 months I've discovered it's not only easy to accomodate lots of tasks in 86,400 seconds
- but also accomplish 100% if planned properly. I see this planning as an art, where there's nothing optimal, rather
- it's a gradual progression towards betterment, where failure is a friendly teacher ✅.
-
-
- When I started following this ritual, I used to sketch whole plan in my head. After few weeks I decided to use pencil & paper
- and was able to do better planning where I was accomplishing >90% of what I planned.
-
-
- During this period, I taught myself how to wake up early morning. All it takes --- keeping alarm away
- from bed's reach, so that I'm forced to get out of bed & stop the alarm, rather than snoozing it.
- In last 5 months, I've worked on 3 large-scale production-grade open-source software systems. May be some otherday I'll
- talk about them. Reading more blogs; listening to podcasts; reading books are among the things
- I've become better at doing. I've seen improvement in the way I manage my tasks at workspace --- completing more
- good quality tasks in lesser timespan. Empowered by better time management, I've seen improvement
- in my relationship with other human beings & nature ( it's important 😉 ).
-
-
- But more important part is understanding that, scheduling is an art --- where everyday collecting
- feedback; accumulating them over week & on Saturday evening sitting down for making a better
- plan --- is an infinite loop.
-
-
- I noticed getting inside infinite loop is hard, but once you're in, it's beautiful here 🥰.
- I'd speak from my heart, as I'm writing this, I feel writing is indeed impactful.
-
-
- Have a great time !
-
- Last week I implemented multiple variants of highly parallelizable cryptographic hash function BLAKE3 - using SYCL and today I'd like to present my collective understanding, which I gained while implementing/ benchmarking - BLAKE3, targeting heterogeneous accelerator platform(s). BLAKE3 cryptographic hash function easily lends itself well - to data parallel execution environments like SYCL/ OpenCL. Speaking from high level design point of view, it consists of - following two steps. -
-
- The root of tree ( 32 -bytes wide ) is desired cryptographic hash of input byte array. Both of these steps are good candidates for data parallelism. Note, step-1 produces
- N -many leaf nodes of Binary Merkle Tree, which are used for finding root of Merkle Tree in step-2 i.e. step-2 is data dependent on step-1.
-
-
- In this document, I'll be working with input byte array of length M -bytes such that M = N * 1024, where N = 2i, i = {1, 2, 3 ...}. That means
- after execution of step-1 of BLAKE3, I should have power of 2 -many leaf nodes ( = N ), which will be used for computing root of fully balanced Binary Merkle Tree.
- This will simplify both explanation & implementation. I'll walk you through following two techniques of implementing BLAKE3.
-
- Let me start with first approach which is simpler.
-
-
- Let us assume, I've 8KB input which I take as byte array ( say const sycl::uchar * ) and split it into 8 equal sized chunks.
- Now each of these 1024 -bytes wide chunks can be compressed in parallel. For doing so, I'll dispatch 8 work-items, with work-group size W ( <= 8 && 8 % W == 0 ),
- where each work-item executes compress( ... ) function, consuming 1024 -bytes input message into hash state.
- Once all these 8 work-items complete their execution, each of them output 64 -bytes chaining value ( which is actually BLAKE3 hash state matrix of that chunk ),
- from which first 32 -bytes to be taken
- as output chaining value of that chunk. These output chaining values are used as leaf nodes of Binary Merkle Tree, which I'm about to construct.
-
-
- In final step of computation, I construct a Binary Merkle Tree from N ( = 8 ) output chaining values.
- As Binary Merkle Tree is a hierarchical structure, I need to dispatch multiple rounds of kernels, respecting data dependency.
- To be more specific, in this case 3 ( = log2(N), where N = 8 ) rounds will be required.
- In first dispatch round, I'll dispatch 4 work-items, who will read ( total ) 4 consecutive pairs of output chaining values and interpret each pair of chaining values as left and
- right child of ( to be computed ) parent node, placed right next to each other ( in ltr order ), as depicted below.
-
- Computing parent node involves compressing a pair of chaining values, while setting some flags denoting that parent chaining value will be output of compress( ... ), - where each chaining value is of 32 -bytes, - making total of 64 -bytes input to compress( ... ). After completion of this dispatch round, we should have 4 parent nodes, who live just above leaf nodes. - In next dispatch round, I've to ask for 2 work-items, each will compress two consecutive chaining values ( which were computed during last round ) - and produce total 2 parent nodes, who live just below root of the tree ( to be computed in next dispatch round ). -
-
- In final round, it suffices to dispatch just a single task which takes
- 64 -bytes input ( read two chaining values, which are two immediate children of root of tree; these were computed during last round --- thus data dependency ) and produces 32 -bytes
- output chaining value which is root of Merkle Tree. This root is our desired BLAKE3 hash. Also note, before root of tree can be computed, flag denoting
- that output of compress( ... ) function invocation is root chaining value of BLAKE3 Merkle Tree, need to be set.
-
-
- A pictorial demonstration might be helpful at this moment.
-
- Empowered with this high level knowledge of algorithmic construction of BLAKE3, it's good time to dive into often mentioned compress( ... ) - function. Simply speaking compression starts with 32 -bytes input chaining value and 64 -bytes input message, it consumes whole message into BLAKE3 hash state - ( in multiple rounds, while also employing message permutation, using predefined indexing tricks ) - and produces output of 64 -bytes, which is nothing but hash state after consuming whole input message inside it. First 32 -bytes of output is taken as chaining value - which is either used as input to next stage of computation or as final root chaining value i.e. BLAKE3 digest. Let's emphasize on BLAKE3 hash state. -
-- BLAKE3 hash state is 64 -bytes wide; as BLAKE3 word size is 32 -bit, hash state can be represented using an array of 16 elements where each element is 32 -bit wide - unsigned integer i.e. sycl::uint. When compressing 64 -bytes message, BLAKE3 consumes input message in 7 rounds, while at end of each round ( except last one ) - permutes 64 -bytes message in a predefined way. At end of applying 7 rounds, it takes first 32 -bytes of hash state, which has now consumed permuted variants of 64 -bytes input - message, as output chaining value. Each round of BLAKE3 compression consists of bit wise manipulation of 32 -bit wide hash state words. -
-- Note indices passed as argument to g( ... ) function from blake3_round( ... ), which clearly shows in first four g( ... ) function invocations, it's column-wise mixing - first eight message words with hash state. And last four g( ... ) function invocations are diagonally mixing remaining eight message words with hash state. It's possible to - reduce four vertical mixing invocations into single function call, where all four columns are mixed parallelly, - if I represent hash state as an array of 4 vectors ( SYCL intrinsic ), where each vector is of type sycl::uint4, as shown below. With this new representation of hash state, - diagonal mixing also enjoys boost, where all four diagonals of hash state matrix can be mixed parallelly. -
-- With this new representation of hash state column-wise mixing looks like below. -
-
- But keeping hash state as 4x4 matrix comes with its own requirement, where it needs to be diagonalised such that each diagonal of 4x4 matrix is now in same column, before
- diagonal mixing can be applied. After diagonal mixing, hash state needs to get back to its original form, which calls for undoing the diagonalisation previously performed.
- Diagonalisation involves rotating each of four vectors leftwards by row index of respective vector i.e. {0, 1, 2, 3} in 4x4 state matrix. Note, vector lane rotation doesn't rotate
- each lane content ( sycl::uint ), instead it rotates whole vector by N ( < 4, because each row has 4 lanes ) places.
- I make use of vector swizzle operators provided by SYCL vector intrinsic API for rotating vector.
-
-
- A pictorial depiction looks like below.
-
- Note the color coding, which shows how diagonalisation helps in bringing each of 4 diagonals of 4x4 hash state matrix in same column. This makes applying diagonal mixing - much easier ( and faster ) on 128 -bit vectors i.e. sycl::uint4. -
-- Following code snippet can perform diagonal mixing on four 128 -bit vectors i.e. sycl::uint4[4]. This will consume last 8 message words of total 64 -bytes - input message into hash state. Note, other than which message words are consumed, diagonal mixing is just same as column-wise mixing, because - we've arranged columns to be so. This means, in implementation both of these mixings can be replaced using preprocessor directives or other compile-time - code generation means. -
-- After diagonal mixing diagonalisation will be undone, rotating 4x4 state matrix vectors rightwards by row index of respective vector i.e. {0, 1, 2, 3} in 4x4 state matrix. Following code snippet should bring back - hash state is desired form, preparing it for next round. -
-- During compression of 64 -bytes message input, after application of each round of mixing, sixteen message words ( 16 x 4 = 64 -bytes total message ) are permuted in following manner - and permuted output is used as input message words during next round. -
-- Now let us go back to chunk compression, where we had 1024 -bytes input ( as sycl::uchar * ); each compress( ... ) function invocation - takes 64 contiguous bytes to mix with 4x4 hash state matrix, that means we've to iterate 16 times for processing whole chunk. Each of these 64 -bytes - are called blocks, 16 of them make a chunk. For first block in a chunk, a predefined constant input chaining value is used, but all subsequent 15 blocks use - previous block's 32 -bytes output chaining value as its input chaining value. Note, BLAKE3 hash state is 64 -bytes, input chaining value is 32 -bytes, so remaining 32 -bytes - of hash state ( i.e. last two rows of 4x4 hash state matrix ) comes from predefined constants & other parameters passed to compress( ... ), which includes flags denoting whether this block - is first/ last of this chunk or if output chaining value will be parent/ root node of BLAKE3 Merkle Tree, block length, chunk index etc.. After applying compress( ... ) - first 32 -bytes of hash state is taken as output chaining value, to be used as input chaining value of next block in same chunk. Below is a pictorial demonstration for ease of - understanding. -
-- Last block of each chunk produces 32 -bytes output chaining value, which is considered to be leaf node of Binary Merkle Tree. After all chunks are compressed & we've - N -many output chaining values, interpreted as leaf nodes of Binary Merkle Tree. Usual parallel Binary Merkle Tree construction algorithm can be applied. Note, for - merging two consecutive chaining values ( at leaf level ) into single parent chaining value, compress( ... ) function is invoked with a predefined constant input - chaining value, along with two consecutive chaining values being interpreted as 64 -bytes input message. Some input flags are passed to denote that its output chaining value will be a parent node. - While computing root chaining value ( read target digest of input byte array ), two immediate child nodes just below root are compressed - into single chaining value, while passing some flags to denote this is root node being computed. -
-
- It's good time to give second approach a go, where each SYCL work-item compresses more than one chunk.
-
-
- Remember in first approach of parallel BLAKE3, I used a 4x4 matrix of 64 -bytes for representing hash state, when compressing each block of total 1024 -bytes wide chunk.
- But this time, hash state is represented using a 16 x N matrix, where N = {2, 4, 8, 16} and ith row of state matrix holds N -many different chunk's hash state
- word at index i; so N -many different chunk's hash states are represented in N -many columns of 16 x N shared state matrix. That means, with N = 4, sixteen 128 -bit
- vectors will be used for representing whole hash state of 4 chunks. Programmatically I'd like to represent it using following syntax.
-
- With N ( = 4 ) chunks being compressed together, each SYCL work-item mixes total 4096 -bytes of input message into hash state,
- each 1024 -bytes chunk splitted in 16 blocks, each of width 64 -bytes. There'll be sixteen rounds required for compressing 4096 -bytes.
- In each round, ith block of all N chunks are compressed together. Note color coding used in following demonstration, where
- I attempt to show you how message words ( 32 -bit wide ) from each block are chosen to construct 128 -bit vectors ( using sycl::uint4 )
- which are using during column-wise and diagonal mixing. I'd also like you to note that, there's no diagonalisation and undiagonalisation steps
- required in this SIMD style mixing, because each chunk's hash state is actually a 16 word vector, which is a column of 16 x N state matrix.
- After first block is processed, which consumes 64 -bytes message from each of four chunks ( i.e. first block of each chunk ), output chaining value
- of four chunks are prepared by taking 8 x N state matrix, where lower 8 x N portion of matrix ( read last 8 rows ) is dropped. This should produce
- 32 -bytes output chaining value for each chunk, which will be used as input chaining values for those respective chunks when processing blocki+1
- from all N ( = 4 ) chunks.
-
-
- After all sixteen blocks from all chunks are compressed into hash state, 16 x N state matrix is truncated to 8 x N matrix ( by dropping last 8 rows ),
- which holds N -many output chaining values of N -many chunks. These N -many output chaining values are considered as N -many leaf nodes of BLAKE3 Merkle Tree,
- which will be constructed once all work-items complete compressing N -many chunks each.
-
-
- Binary Merklization algorithm doesn't anyhow change in second approach.
-
- Note, when N = 2, sixteen 64 -bit wide SIMD registers are used for representing hash state of two chunks, which are compressed in parallel. Similarly, for N = {4, 8, 16}
- sixteen {128, 256, 512} -bit registers ( respectively ) will be used for representing hash state of N chunks. On modern CPUs which support avx512* instructions
- 512 -bit vectors can help boosting this style of leveraging arbitrary many SIMD lanes.
-
-
- For understanding opportunities of using SIMD for parallelizing BLAKE3 on relatively large input byte arrays, I suggest you take a look at BLAKE3
- specification's
- section 5.3.
-
- As you've now better understanding of aforementioned two approaches for computing BLAKE3 hash, I'd like to present you with benchmark results. In following tables, you'll - see I'm taking random input of N -bytes; transferring whole input to accelerator's accompanying memory; invoking BLAKE3 kernel with on-device data pointer; waiting for - all computation steps to complete and finally transferring 32 -bytes digest ( which is output chaining value of root node of Binary Merkle Tree in BLAKE3 hash ) to preallocated - memory on host. Note, all these numbers represent mean value obtained after executing same kernel with same input size/ arguments K ( = 8 ) -many times. -
-- BLAKE3 Hash using approach_1 - | -||||
---|---|---|---|---|
- Input Size - | -- Accelerator - | -- Kernel Execution Time - | -- Host -> Device Tx Time - | -- Host <- Device Tx Time - | -
64 MB | -||||
Tesla V100-SXM2-16GB | -844.598250 us | -6.166145 ms | -6.973250 us | -|
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz | -6.239875 ms | -9.797500 ms | -2.525625 us | -|
Intel(R) Iris(R) Xe MAX Graphics | -4.974242 ms | -17.749401 ms | -1.319500 us | -|
128 MB | -||||
Tesla V100-SXM2-16GB | -1.800964 ms | -12.269974 ms | -7.080000 us | -|
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz | -8.187520 ms | -20.664062 ms | -1.242000 us | -|
Intel(R) Iris(R) Xe MAX Graphics | -9.812348 ms | -35.475108 ms | -1.319500 us | -|
256 MB | -||||
Tesla V100-SXM2-16GB | -3.267731 ms | -24.462952 ms | -6.805500 us | -|
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz | -8.853032 ms | -32.455801 ms | -1.047125 us | -|
Intel(R) Iris(R) Xe MAX Graphics | -19.465823 ms | -70.886068 ms | -1.293500 us | -|
512 MB | -||||
Tesla V100-SXM2-16GB | -5.998047 ms | -48.833740 ms | -6.713750 us | -|
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz | -14.807205 ms | -48.242437 ms | -1.063000 us | -|
Intel(R) Iris(R) Xe MAX Graphics | -39.271700 ms | -141.716997 ms | -1.313000 us | -|
1024 MB | -||||
Tesla V100-SXM2-16GB | -11.915527 ms | -97.573730 ms | -8.423000 us | -|
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz | -22.864140 ms | -79.047688 ms | -1.088500 us | -|
Intel(R) Iris(R) Xe MAX Graphics | -77.556440 ms | -283.341799 ms | -1.534000 us | -
- In above table, you should see three accelerators I targeted for benchmarking BLAKE3 SIMD approach_1, where two of them are GPUs
- from two different vendors and one is 64 -core CPU from Intel. You'll notice, Nvidia's Tesla V100 GPU performs best on all input sizes.
- Being a 64 -core CPU, in majority of cases it performs better compared to Intel's Irix Xe Max GPU. Note, when input size is 64 MB, Intel GPU
- perform's little better than 64 -core CPU. But as input size increases kernel execution time on Intel GPU starts to quickly increase, though
- on Intel CPU execution time for 64 MB, 128 MB and 256 MB input sizes are pretty close to each other.
-
-
- As output size is constant ( read 32 -bytes ), device to host data transfer cost is not much of concern. But input data size is variable,
- host to device input data transfer cost can help us in answering is it worth transferring large byte array to accelerator for computing BLAKE3 hash ?
-
-
- Comparing between multiple accelerators ( with same input size ), it shows as input data size increases host to device data transfer cost
- increases quickly for GPU ( even surpasses input data transfer cost on CPU for same size ), which makes sense because those accelerators
- are connected to host over PCIe bus. When comparing input data transfer cost of Nvidia's GPU and Intel's CPU, I see until 512 MB input size,
- cost was lesser for GPU, but at 512 MB input size both of them take around same time. For both GPUs from two different vendors, I see
- their host to device data transfer cost increases linearly as input size is doubled, because both of them are connected to host CPU using PCIe,
- which doesn't have high bandwidth. Due to these relatively high input data transfer costs, it may not always benefit using this accelerated
- BLAKE3 implementation, where explicitly data needs to be transferred to accelerator's local DRAM, and it may end up defeating whole purpose
- of speeding up. Just to make it more evident, notice in above table, for 1 GB input size on Nvidia Telsa V100 GPU, input transfer is ~8x costlier than actual
- computation of BLAKE3 hash.
-
-
- Lastly I'd like to draw your attention to device to host data transfer cost ( transferring 32 -bytes digest back to host ), where
- you should notice, on Nvidia's Tesla V100 GPU it's ~(6 - 7)x more expensive to transfer 32 -bytes ( over PCIe ) to host, when compared to
- Intel's GPU.
-
- In approach_2 of computing BLAKE3, I'm compressing {2, 4, 8, 16} chunks together & in following table I present kernel execution time and host <-> device data transfer - cost, by timing SYCL events obtained as result of submitting respective commands on SYCL queue, which has profiling enabled. -
-- BLAKE3 Hash using approach_2, compressing {2, 4, 8, 16} chunks together - | -|||||
---|---|---|---|---|---|
- Input Size - | -- Accelerator - | -- SIMD Width - | -- Kernel Execution Time - | -- Host -> Device Tx Time - | -- Host <- Device Tx Time - | -
64 MB | -|||||
Tesla V100-SXM2-16GB | -|||||
64 -bit | -1.016358 ms | -6.172363 ms | -7.568375 us | -||
128 -bit | -923.828375 us | -6.168457 ms | -7.323875 us | -||
256 -bit | -1.318848 ms | -6.168945 ms | -7.812625 us | -||
512 -bit | -2.055176 ms | -6.176270 ms | -10.254000 us | -||
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz | -|||||
64 -bit | -7.193866 ms | -13.689200 ms | -4.531500 us | -||
128 -bit | -6.739462 ms | -14.008103 ms | -2.967625 us | -||
256 -bit | -7.261953 ms | -14.829467 ms | -2.978000 us | -||
512 -bit | -11.546031 ms | -13.229008 ms | -1.385125 us | -||
Intel(R) Iris(R) Xe MAX Graphics | -|||||
64 -bit | -3.106389 ms | -17.748458 ms | -1.365000 us | -||
128 -bit | -28.628951 ms | -17.749823 ms | -1.332500 us | -||
256 -bit | -56.188691 ms | -17.748861 ms | -1.326000 us | -||
512 -bit | -105.559818 ms | -17.749823 ms | -1.365000 us | -||
256 MB | -|||||
Tesla V100-SXM2-16GB | -|||||
64 -bit | -3.539550 ms | -24.455078 ms | -7.080250 us | -||
128 -bit | -4.190674 ms | -24.442871 ms | -7.080000 us | -||
256 -bit | -5.203370 ms | -24.459961 ms | -7.568250 us | -||
512 -bit | -13.925293 ms | -24.453369 ms | -7.568375 us | -||
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz | -|||||
64 -bit | -10.928828 ms | -33.915237 ms | -967.625000 ns | -||
128 -bit | -8.854166 ms | -32.901272 ms | -976.500000 ns | -||
256 -bit | -10.290331 ms | -33.643110 ms | -1.030125 us | -||
512 -bit | -18.790299 ms | -33.722702 ms | -966.125000 ns | -||
Intel(R) Iris(R) Xe MAX Graphics | -|||||
64 -bit | -11.941254 ms | -70.892191 ms | -1.326000 us | -||
128 -bit | -110.007846 ms | -70.894857 ms | -1.339000 us | -||
256 -bit | -245.655891 ms | -70.883748 ms | -1.391000 us | -||
512 -bit | -475.246200 ms | -70.886621 ms | -1.332500 us | -||
1024 MB | -|||||
Tesla V100-SXM2-16GB | -|||||
64 -bit | -11.715087 ms | -97.482910 ms | -9.765625 us | -||
128 -bit | -12.184326 ms | -97.552734 ms | -8.300750 us | -||
256 -bit | -18.732911 ms | -97.577148 ms | -7.812250 us | -||
512 -bit | -52.898436 ms | -97.524170 ms | -8.056625 us | -||
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz | -|||||
64 -bit | -35.084335 ms | -76.414034 ms | -903.625000 ns | -||
128 -bit | -25.805447 ms | -79.800968 ms | -1.052625 us | -||
256 -bit | -28.765474 ms | -80.076494 ms | -1.017875 us | -||
512 -bit | -61.698307 ms | -75.633475 ms | -1.107500 us | -||
Intel(R) Iris(R) Xe MAX Graphics | -|||||
64 -bit | -47.515533 ms | -283.342299 ms | -1.319500 us | -||
128 -bit | -431.324205 ms | -283.344665 ms | -1.306500 us | -||
256 -bit | -938.669251 ms | -283.345946 ms | -2.164500 us | -||
512 -bit | -1.843786 s | -283.342267 ms | -2.164500 us | -
- In above table, host <-> device data transfer cost is not something I'm interested in, but instead I want you to notice how changing SIMD lane count from 2 to 16 ( by doubling ),
- affects kernel execution time on some accelerator with some specific input size. For example, let us zoom into row where input size is 256 MB; on Tesla V100 GPU
- as number of SIMD lanes used are doubled ( i.e. # -of chunks being compressed together by each SYCL work-item is doubled ), kernel execution time is doubled.
- Now notice, how kernel execution performs on Intel Iris Xe Max GPU, for input size 256 MB, using both approach_{1, 2} ( with SIMD lanes {2, 4, 8, 16} ). See with
- approach_1, kernel execution takes ~19 ms, but when 2 chunks are compressed together ( using approach_2 ), resources are better utilized, results in improved kernel
- execution time of ~12 ms. But kernel execution becomes ~9x costlier as soon as 4 chunks are compressed together, because now each work-item compresses 4 chunks in parallel,
- while representing hash state using sixteen 128 -bit wide vectors. Thus each work-item ends up using too many resources ( read register files for keeping hash state ), which
- results in register spilling --- portions of hash state are now placed on far distant and high latency global memory. Accessing global memory involves going through
- multiple memory hierarchies i.e. global memory and two levels of caches etc., each of which adds its own latency. After compressing 4 chunks together, as I keep increasing
- # -of chunks to compress together ( say 8/ 16 ), execution time also keeps getting doubled up. Notice, for 1024 MB input size, on Intel Iris Xe Max GPU, when 2 chunks are
- compressed together, kernel execution time is ~47 ms, but when 4 chunks are compressed together execution time suffers heavily --- increasing ~9x ! This trend follows,
- as kernel execution time is doubled and quadrupled ( compared to compressing two chunks together ) when 8 and 16 chunks are compressed in parallel, respectively.
-
-
- Now let us explore whether similar kind of behaviour is visible when compressing multiple chunks together on Intel CPU. Let me begin with input size 64 MB, when only single chunk
- is compressed by each SYCL work-item ( read approach_1 ), kernel execution time is ~6 ms. As I increase # -of chunks being compressed together execution time stays almost same until we reach
- SIMD_LANE_COUNT = 16. Due to presence of limited number of 512 -bit registers, not many work-items can execute at a time, which is reflected when sixteen 512 -bit registers
- are used for representing hash state, compressing 16 chunks in parallel. When I look at input size both 256 MB & 1024 MB, I see similar pattern, where compressing 4 chunks together
- performs better than compressing {2, 8, 16} chunks together. When 4 chunks are compressed together sixteen 128 -bit vectors are used by each SYCL work-item
- for representing hash state. Now there are 32 of these 512 -bit register files on each core of this machine, which means on each core 8 SYCL work-items should proceed in parallel,
- without any register spilling.
-
- With these benchmark results in mind, I'll say, approach_1 of BLAKE3, where each SYCL work-item compresses single chunk performs much better ( almost always ) compared - to approach_2, where it's possible to compress 2/ 4/ 8/ 16 chunks in parallel, by each SYCL work-item. -
-
- For sake of reproducibility and future reference, I keep blake3's SYCL implementation here.
- Here I've presented benchmark results of only three accelerators, but you can find more of them for different input sizes here.
- You may be interested in BLAKE3 implementation approach_{1, 2}, which you can find in this file.
-
-
- Note, this implementation might be helpful when hashing large input sizes, but does it really benefit using this implementation when host -> device data transfer
- cost is much higher compared to kernel execution time ? is an important question.
-
-
- If you happen to be interested in using BLAKE3 for constructing Merkle Tree, you may
- check out my other project. Binary
- Merklization using BLAKE3 is much easier as it requires me to hash only 64 -bytes input and produce 32 -bytes digest, which is an intermediate node of Merkle Tree. This kind of
- hashing is called 2-to-1 hashing, where two BLAKE3 digests are concatenated & hashed. You may notice, 2-to-1 hashing is simpler ( and cheaper ) because I need to compress
- only one chunk, which has only one block in itself. And that chunk is root node of BLAKE3's internal Merkle Tree, meaning BLAKE3's internal Merklization requirement is not a
- requirement anymore.
- I need to just make a single call to compress( ... )
- function with proper flags to denote that this is the only node in BLAKE3's internal Merkle Tree --- compute-wise cheap ! So for generating all intermediate nodes of Binary Merkle Tree
- when N -many ( read N is power of 2 ) leaf nodes are provided ( read each leaf node is a BLAKE3 digest ), using BLAKE3 hash, above linked implementation can come handy. Benchmark details
- of Binary Merklization implementation using BLAKE3 for different input sizes on different accelerator platforms can be found here.
- While we're at it, it can be interesting to take a look at ( and compare ) Binary Merklization implementation
- using Rescue Prime hash, I wrote few weeks ago. It's clearly visible how performant BLAKE3 is compared to Rescue Prime, when Binary Merklization benchmarks are compared !
-
- Have a great time ! -
-- January 20, 2022 -
-- December 31, 2021 -
-- December 19, 2021 -
-- November 27, 2021 -
-- November 08, 2021 -
-- November 01, 2021 -
-- October 23, 2021 -
-- October 15, 2021 -
-- October 11, 2021 -
-- October 03, 2021 -
-- September 25, 2021 -
-- September 19, 2021 -
-- September 12, 2021 -
-- September 06, 2021 -
-- August 30, 2021 -
-- August 16, 2021 -
-- August 09, 2021 -
-- July 23, 2021 -
-- July 15, 2021 -
-- July 10, 2021 -
-- July 04, 2021 -
-- June 28, 2021 -
-- June 20, 2021 -
-- June 13, 2021 -
-- June 06, 2021 -
-- June 01, 2021 -
-- May 29, 2021 -
-
- Email : hello [at] itzmeanjan [dot] in
-
- Telegram : https://t.me/itzmeanjan
-
- Twitter : https://twitter.com/meanjanroy
-
- LinkedIn : https://www.linkedin.com/in/itzmeanjan
-
- Sometime ago I started working on pub0sub - Fast, Light-weight, Ordered
- Pub/Sub
- System --- built on top of async I/O, leveraging power of kernel event loop.
-
- The main idea behind it was to write a software ( along with SDK ) which can be used for publishing
- arbitrary length binary messages to N-many
- topics; subscribing to N-many topics --- listening for messages published on each of them; and last
- but not least
- one powerful Pub/Sub Hub ( i.e. Router ) which will easily solve C10K by leveraging power of async
- I/O.
-
-
- The aforementioned problem statement is solved, which is why I decided to update problem statement.
- Now it looks like pub0sub - Distributed, Fast, Light-weight, Ordered Pub/Sub
- System --- solving
- C1M easily while leveraging power of kernel event loop & p2p networking.
-
-
- By making pub0sub distributed, I get to handle 1M concurrent
- connection
- where nodes form a mesh network for chatting about topic interest(s) & forward messages when need to
- --- a collaborative effort among peers.
- I choose to use libp2p for networking purpose, for being so modular
- --- enabling easy horizontal
- scalability, while taking care of stream multiplexing, security, peer-discovery etc.
-
-
- Here I propose primary design of system !
-
- Multiple pub0sub nodes can discover & connect to each other using DHT - ( distributed hash table ) - powered peer discovery mechanism, built right into libp2p - and eventually form a mesh network. If network has N participant(s), each participant is going to - maintain - connection with other N-1 peer(s), where N > 0. These participants of p2p network are going to chat - with each - other over bi-directional stream. Things nodes need to talk about 👇 -
-Operation | -Interpretation | -
---|---|
Topic subscription | -Letting peers know of interest in some topics | -
Topic subscription ACK | -Peer saying it has noted down & will forward published messages if sees any | -
Topic unsubscription | -Announcing not interested in topics anymore | -
Topic unsubscription ACK | -Peer saying it has removed entry & will no more forward published messages | -
Published message forwarding | -Passing published message to interested peer | -
Periodic heartbeat | -Network health check | -
- As each of aforementioned operations require to pass different message formats, I'm going to define - respective wire formats. But before I get into wire format, writing to/ reading from stream - I'd like to spend some time in going through high level overview of network operation. -
-
- Say, two nodes form a cluster --- one node has a topic_1 subscriber connected to it while
- other one has a publisher connected to it, willing to publish message on topic_1. After
- first node finds out, it has one subscriber interested in messages from topic_1, it decides
- to ask its peer 0hub node, if it sees any message targeted to
- topic_1, it should inform requester. Publisher sends publish intent to network, which
- triggers
- event saying network has received some message on topic_1 for which first node has interested
- subscriber.
- Two nodes chat over p2p network, resulting into message forwarding, which enables first node
- to deliver message published on topic_1 to its subscriber.
-
-
- When noticed carefully, network follows certain protocols
-
- Let's take another scenario. -
-
- Continuing previous scenario, after sometime subscriber doesn't anymore want to receive
- messages published on topic_1, so it sends unsubscription intent to network. As a result
- of it, respective 0hub node decides to broadcast same to network, because it found it doesn't have any other
- subscribers who're interested in messages of topic_1. All peers who kept record of this node
- being interested in topic_1, updates their respective interest table, ensuring when in future
- it receives message published on topic_1, it won't forward to first peer.
-
-
- This way of showing interest to topics when peers has some subscribers to feed or
- announcing not interested anymore when all subscribers of certain topic unsubscribes --- allows
- network to pass published messages only when needed, eventually consuming lower bandwidth. I call it Lazy
- Pushing.
-
- With more peers, network interaction may look like 👇 from high level, where 0hub nodes - form p2p mesh network, other participants are mere clients. -
-- Say, one subscriber shows interest in receiving messages from {topic_1, topic_2, topic_3} - but the 0hub node it's connected to doesn't have any publisher - of any of those topics. As soon as 0hub node learns it has subscriber - to feed messages of - {topic_1, topic_2, topic_3}, following protocol it announces that intent to other peers. - Each of other peers record it & as soon as they receive any message published on any of these topics - they forward those to respective peers. -
-- I'll now spend some time in specifying wire-format of messages exchanged between peers. -
-- Each message exchanged between peers over p2p layer, needs to have two parts -
-- Just by reading header part receiver must be able to understand two things -
-- First question can be answered by checking very first byte of message. Each operation - is denoted by unique opcode. There're 255 possible opcodes, though only 6 of them are in use as of - now. -
-Interpretation | -Opcode | -
---|---|
Heartbeat | -1 | -
Topic subscription | -2 | -
Topic subscription ACK | -3 | -
Topic unsubscription | -4 | -
Topic unsubscription ACK | -5 | -
Message Forward | -6 | -
- By reading next 4 bytes from header, receiver understands how many more bytes it should read from stream
- so that it can successfully deserialise message, depending upon opcode. Each opcode denotes
- different
- message wire-format, resulting into invocation of different deserialisation logic upon reception.
-
-
- Above is a high level wire-format, which is applicable for each of messages. But I'd like
- to define how BODY of message is serialised/ deserialised for different opcodes. Starting
- with
- how it looks like when announcing interest in listening to some topics.
-
- Note, above image is nothing but magnification of message BODY when opcode ∈ {2, 4}.
- Requirement
- is peer needs to announce it wants to receive all messages published on topics, because it has
- some subscribers interested in those. Receiver side when reading from stream, knows how many
- bytes it needs to read from stream for completely consuming BODY.
-
-
- It starts by reading first 1-byte, where it has instruction encoded how many next bytes it should
- read for making one meaningful topic name. Now it has either consumed all bytes of BODY
- or some of them are left. If left, it'll again consume 1-byte, carrying instruction for it for
- figuring out what's next topic name. This way, it'll keep reading until it has exhausted all bytes
- of BODY. By the end it must have successfully constructed structured object in respective
- environment, containing topics some peer want to get notified of.
-
-
- Similar wire-format is followed for serialising BODY when announcing lack of interest in some
- topics.
-
-
- Both of aforementioned opcodes, expect to hear back with ACK messages i.e. opcode ∈ {3, 5}, where
- BODY can be encoded by putting binary value denoting success/ failure. These are expected to
- be received with in a stipulated time window after interest ( opcode 2 ) / lack of interest ( opcode
- 4 ) message is sent to peer. If not
- received, for opcode 2, it'll be resent upto N-times. If still not received, it results into
- connection termination
- with peer for not following protocol.
-
-
- But if peer is waiting for ACK of message with opcode 4, it doesn't resend, because of being low
- priority. Of course
- it might result into network wasting some bandwidth for passing some published message which could have been
- avoided. If any forwarded message
- from any topic to which peer is not interested in, is received even after lack of interest message
- was broadcast ( opcode 4 )
- it can be ignored by receiver. Receiving peer also sends another message to respective peer with
- opcode 4, stating it's
- not interested in these topics --- just like repeating self. This is done so that next time network can save some bandwidth.
-
-
- Finally I'll cover how to serialise/ deserialise forwarded message to/ from stream.
-
- Start by reading first byte of BODY, which encodes how many topics this message is being
- targeted to.
- A message can be targeted to 255 topics at max. Receiver knows how many topics it should be reading
- from stream.
- So it starts by reading next 1 byte, encoding first topic's byte length. It knows how many next
- bytes to be read
- for figuring out first topic name. It has just read one topic name. Similarly it'll continue reading
- more topic names
- until all are read off. After N topic names are read, it'll read 4 bytes, encoding how many next
- bytes it needs to
- read for extracting out actual message content.
-
-
- Eventually it'll reach end, constructing structured data by consuming stream. This is how forwarded
- messages
- are recovered from stream by some peer who showed interest in getting notified when some message is
- published
- on topics of interest. After getting structured data, receipient 0hub
- node can
- send message to subscribers connected to it directly, interested in any of topics this message is
- published on.
-
-
- Reader may notice, a slight difference in encoding variable number topic list, between
- previous two diagrams.
- When encoding to be forwarded message ( opcode 6 ), peer encodes topic count in first 1-byte of
- BODY part of message.
- This is required, otherwise during deserialisation receiver won't be able to understand where in
- stream it should stop
- reading topic names & start reading 4-byte lengthy actual message content's length field.
-
-
- But same is not required for message sent with opcode ∈ {2, 4}, because there's nothing more to read
- after topic name list
- and receiver already knows length of BODY part of message, so it knows how long to read from
- stream.
-
- Let's go through one example
-
-
- Say 0hub peer want to announce its interest in messages published on
- topic_1, topic_2. Serialised message for this operation looks like
-
Message Part | -Field Name | -Field Byte Length | -Field Value | -
---|---|---|---|
Header | -Opcode | -1 | -2 | -
Body Length | -4 | -16 | -|
Body | -Topic-1 Length | -1 | -7 | -
Topic-1 Name | -7 | -topic_1 | -|
Topic-2 Length | -1 | -7 | -|
Topic-2 Name | -7 | -topic_2 | -
- 21 bytes of data to be sent to each peer, resulting into (N-1) * 21 bytes of data
- broadcast in total, where N > 0 & N is #-of participants in mesh network.
-
-
- Similarly by following aforementioned example, message of lack of interest to topics ( opcode 4 )
- can be published on network.
-
-
- Finally I'll go through one last example showing serialisation of to be forwarded message i.e.
- opcode 6.
- Assuming this message is published on topic_1, topic_2 & content of message is
- hello.
-
Message Part | -Field Name | -Field Byte Length | -Field Value | -
---|---|---|---|
Header | -Opcode | -1 | -6 | -
Body Length | -4 | -26 | -|
Body | -Topic Count | -1 | -2 | -
Topic-1 Length | -1 | -7 | -|
Topic-1 Name | -7 | -topic_1 | -|
Topic-2 Length | -1 | -7 | -|
Topic-2 Name | -7 | -topic_2 | -|
Data Length | -4 | -5 | -|
Data | -5 | -hello | -
- This results into sending 31 bytes of data to each of those peers who showed interest to - topic_1, topic_2. Not to all N-1 remaining participants of mesh network --- Lazy - Pushing - at work. -
-
- Peers need to periodically send heartbeat messages for checking health of long-lived network
- connections
- to other peers. Opcode 1 is reserved for this purpose, where BODY of message ∈ {ping, pong}.
-
-
- If reader has covered whole proposal, they probably understand this is by no means a final version
- of design.
- Improvements like not forming strongly connected mesh helps in reducing huge bandwidth cost --- can
- be taken into consideration to further
- enhance protocol. Message authentication can be added so that peers only accept connection request
- from other peers
- who are trusted, when such setup is desired.
-
-
- Existing pub0sub implementation is here.
-
- Your feedback will be invaluable. Have a great time ! -
-
- Few weeks back I started working on pub0sub - Fast, Light-weight, Ordered Pub/Sub System
- leveraging power of kernel event loop, addressing C10K while running on a consumer grade machine.
-
-
- pub0sub can easily handle > 10k concurrent connections even on consumer grade
- machine, because it doesn't follow conventional way of writing TCP servers in Go. Generally, one
- go-routine accepts TCP connection & spawns new go-routine for handling connection throughout its lifetime. This way if objective
- is to handle > 10k concurrent connections, there're > 10k go-routines. Go scheduler needs
- to perform expensive context switching for running go-routines on underlying OS threads. For > 10k go-routines
- cost of context switching is pretty high, when no useful task gets accomplished. Also stack memory requirement
- for > 10k go-routines is not something neglectable.
-
-
- Avoiding aforementioned path helps in discovering another potential way, where I can ask kernel event loop to watch
- file descriptors of interest & only inform when some action need to be taken. At a time any of two completion events can happen on socket
- { READ, WRITE } --- either pre-scheduled reading or writing from socket has been completed,
- giving opportunity to act on it & schedule next operation. There's no more > 10k go-routines, rather only 2 go-routines ---
- one used for listening & accepting TCP connections; another for watching & responding to I/O events.
-
-
- It'll be perfectly okay to
- add more watcher go-routines ( static, done at system startup phase ), each managing its own kernel event loop and watching some
- delegated sockets. But in that case newly accepted connections
- need to be fairly distributed among all event watching loops otherwise some of them becomes hotspot, resulting into
- performance degradation. Some watcher does more socket watching, some does less. Even it's possible some topics
- in pub/sub are popular and all subscribers interested in those topics needs to be distributed across
- available watchers. For orchestrating N-sockets on available M-watchers, where N >>> M, I need to keep additional
- state information in listener go-routine. It can also be further explored whether dynamic watcher adding/ removing in runtime as per system state
- brings any improvement or not.
-
-
- Back to current implementation, as a result, lesser time spent in context switching more time spent doing actual work. As soon as new connection
- is accepted, it's delegated to watcher. On the other hand watcher waits for I/O completion events &
- as soon as some of them are available to act on; it starts looping over them one-by-one ---
- processing each & scheduling next action on socket.
-
- This model of writing TCP server is performant, but brings in some complexities. Previously I could
- manage each connection's whole life-cycle in its own go-routine --- seperation of concern was well respected.
- As a result implementation was easier to reason about.
-
-
- Say in first model, server has two clients --- each being managed in its own go-routine. Pub/Sub Hub is waiting to
- read message from respective sockets, where each message has two parts
-
Part | -Size ( in bytes ) | -Purpose | -
---|---|---|
Envelope | -5 | -Keeps OPCODE, BodyLength | -
Body | -N | -Keeps BodyLength-bytes actual data | -
- For one client, go-routine reading envelope and for another one body is being read after envelope reading - is done. I'd like to highlight, messages are seperated in two parts because it helps - in determining what's length of variable sized body, where envelope length is fixed at 5-bytes. -
-
- But attaining same behaviour when delegating reading from/ writing to sockets to watcher is little more involved.
-
-
- For reading message envelope, request is issued; watcher informs when envelope is read. Then envelope is deserialised to
- figure out how many more bytes to read from socket for consuming message body. N-bytes body reading is
- again delegated to watcher, which informs as soon as it's done. Now if there're M-clients connected at this moment
- each of them may be reading any possible part of message. There could be also different kinds of messages --- where OPCODE is
- encoded in envelope along with body length. Reading handler function needs to remember where it left off
- last time & what exactly it was doing then, so that it can keep processing later part of message.
- This calls for additional state keeping --- resulting into more memory allocation than first model.
-
-
- It's like when first time envelope is read, reading handler function understands what's intention of client
- and how many more bytes it needs to read from socket to construct message. It puts an entry in ongoing reading
- table, indexed by socket, along with OPCODE so that appropriate deserialisation
- handler can be invoked when body reading will be completed; issues N-bytes body reading request; moves on to next event processing step.
- After sometime when body reading is done on this socket, watcher informs, socket is looked up in ongoing reading table
- to understand what has happened till now & what to do next with body. And finally intended action is taken
- on received message and socket entry from ongoing reading table is removed.
-
- In pub0sub there're two kinds of clients i.e. { publisher, subscriber }. Each - of them interact with Pub/Sub server i.e. 0hub with different intention - resulting into different message format. Opcode helps read handler understand how to deserialise - message body or what kind of actions to take on deserialised, structured message - & how to eventually respond back to client. -
-
- With all these pieces Pub/Sub Hub implementation 0hub shows quite
- promising performance. I've tested it with 16k concurrent connections on consumer grade machine. I believe
- if it's tested in containerised environment where virtual overlay networking can be easily used
- and more ports ( = more clients ) in total are available, 0hub
- will break its own record. Some other day I'd like to run that experiment.
-
-
- Recently I started noticing issue with long lived TCP connections --- resulting into abnormal connection termination.
- I suspect this is due to long lived TCP connections might be idle for long time if publishers are not publishing
- often or subscribers have subscribed to some infrequently update receiving topics. To address this
- situation I plan to add periodic heartbeat message passing between client & hub. Heartbeat messages will be of
- 5-bytes --- only envelope, no body. If there's no body, it denotes last 4-bytes of message envelope
- will be holding 0, only first byte contains opcode --- 10 for ping, 11 for pong. For maintaining backward compatibility
- envelope size can't be changed, which is why I'm wasting 4 bytes in envelope by only storing 0.
-
- Every 30 seconds 0hub sends PING ( opcode = 10 ) message to all connected
- publishers & subcribers and expects to hear back with PONG ( opcode = 11 ) message. For all those who responded
- back, their next health check to be scheduled at t+30. Others who didn't respond back, they'll be
- pinged again upto 3 times at max, each after 30 seconds delay. If they still don't respond back, hub terminates connection
- with them while cleaning up all resources associated with respective client.
-
-
- This will hopefully help hub in maintaining connection & related resources only for healthy & active clients, while enabling it
- in estimating how many subscribers to receive published message on a topic with better precision.
- This estimation calculation will pose a challenge during implementing distributed version of pub0sub --- which I'll
- face very soon.
-
-
- These PING/ PONG messages are simply an overhead, though unavoidable, consuming bandwidth overtime. But I can probably
- reduce #-of health checking done. When hub and client has recently communicated for sake of their
- usual business procedure, it's quite evident connection is active --- health check can be avoided. Idea is to only do
- health check when hub hasn't heard from client for some time.
-
-
- If some active publisher sends message publish intent every < 30s, it can avoid explicit health check cost. On other hand subscribers
- listening to active topic i.e. frequent update receiving topic, can avoid
- health check because connection issues will be caught when attempting to push update. That's why I call
- health check LAZY.
- As each health check message & response wastes 4-bytes for sake of backward compatibility, it's better
- to keep its usage as low as possible.
-
-
- Another way I'm looking at --- it's possible to send PING ( opcode = 10 ) message from hub to client
- of only 1-byte length i.e. OPCODE part of whole message, but when client responds back with PONG ( opcode = 11 )
- then need to send only envelope i.e. 5-bytes, as proposed 👆. This way communication pattern becomes
- somewhat asymmetric, but helps in saving 4-bytes, resulting into health check round-trip with 6-bytes
- instead of previous 10-bytes.
-
-
- Current version of pub0sub is here.
- I'd love to get feedback and have a great time !
-
- For last few months I've been working at TCP level more often than I generally do. - During this period I designed and implemented few systems where multiple participants - talk to each other over TCP while following custom application level protocol. - I learned the way most of TCP applications written in ( specifically ) Golang - can be done in a slight different way so that applications don't end up spawning one go-routine - per accepted connection --- resulting into thousands of active go-routines when talking to thousands - of concurrent peers. Rather than handling each peer in its own go-routine, - proactively attempting to read from socket & spending most of its time in blocked mode; keeping only one socket watcher - go-routine which is responsible for informing any READ/ WRITE completion - event happening on any of delegated sockets --- consumes way lesser resources. - It excels at reducing scope of context switching by bringing possible - go-routine count to minimal. As a result of it, Golang scheduler only needs - to manage a few go-routines now. Previously scheduler had to orchestrate - thousands of go-routines on N system threads. I ran some experiments - and result was promising --- TCP servers able to easily handle 100k concurrent connections - when following second approach. -
-- Following 3 different approaches, I develop key-value database where clients can send read/ write requests - over TCP. I challenge each implementation with 100k concurrent connections - and collect statistics of their performance, resource consumption, execution trace etc. under load; - all running on consumer-grade machines in containerised environment i.e. Docker. -
-- The application I develop is quite simple but it captures the essence of a TCP application. - It's a remote ( not necessarily geographically ) in-memory KV database, to which clients connect - over TCP & maintain that connection throughout their life time. During their life time they do - any of two possible operations in a randomised manner. -
-- In both of the cases clients expect to hear back from server. In response frame - VALUE associated with KEY is returned. For WRITE request, VALUE in response frame - must be equal to what's sent in request frame. On server side all reading/ writing - is done in concurrent safe manner --- by acquiring mutex locks. Only for write request - r/w lock is held i.e. critical section of code, otherwise normal read-only lock is held --- allowing - fulfilment of multiple READ requests concurrently. -
-
- For performing desired operations, clients send structured data frames
- over TCP; server extracts that out from socket; performs action as specified
- in message envelope i.e. opcode field; responds back with response frame.
-
-
- Each message sent over wire is two-parts, where envelope carries operation kind i.e. {READ, WRITE} &
- how many more bytes server need to be read from stream to construct a structured message. Clients
- always expect to receive only one kind of frame in response.
-
- For a READ frame, sent when client is interested in looking up VALUE associated
- with KEY, body just holds key, preceded with key length in 1 byte field. Notice, body length
- field in envelope is 2 bytes, allowing at max 65535 bytes of body, but in body actually
- 256 bytes can be written due to key length field in body being of 1 byte. This is done intensionally
- for keeping illustration simple.
-
-
- Practically max READ frame size over wire will be
-
Field | -Max Thoeretical Size ( in bytes ) | -Max Practical Size ( in bytes ) | -
---|---|---|
Envelope | -3 | -3 | -
Body | -65535 | -256 | -
Total | -65538 | -259 | -
- WRITE frame carries little more data, which is sent when client is interested
- in associating VALUE with some KEY, because it carries both key, value & each of them are preceded
- with respective length in 1 byte field. Same scene here, practically WRITE frame's body will be at max 512 bytes
- though it's allowed to be at max 65535 bytes theoretically, as written in body length field in stream.
-
-
- Limits WRITE frame size will be
-
Field | -Max Thoeretical Size ( in bytes ) | -Max Practical Size ( in bytes ) | -
---|---|---|
Envelope | -3 | -3 | -
Body | -65535 | -512 | -
Total | -65538 | -515 | -
- In response of READ/ WRITE request client expects to receive one RESPONSE frame, where VALUE - associated with KEY is encoded, where length of VALUE precedes it, encoded 1 byte --- signaling - client how many more bytes to read from stream to construct response. Good thing about - response frame, it doesn't waste any space, just allows sending 255 bytes VALUE at max. -
-Field | -Max Theoretical Size ( in bytes ) | -Max Practical Size ( in bytes ) | -
---|---|---|
Envelope | -2 | -2 | -
Body | -255 | -255 | -
Total | -257 | -257 | -
- Now I'd like to spend some time in specifying how each of 3 approaches work.
- For ease of addressing, I'll refer to them from now on as {v1 => 1, v2 => 2, v3 => 3}.
-
-
- Model v1 is popular way of writing TCP servers in Go, where one listener go-routine
- keep listening on a host:port; accepts connection & spawns new go-routine for handling
- connection throughout its life time. This model respects seperation of concern well & operations
- happening on socket are easier to reason about due to clean structure. But one thing to notice, each go-routine
- alive for handling concurrent connections, spends a lot of its time in blocked state --- proactively waiting to read
- from socket.
-
- Model v2 is slightly different than v1, where rather than spawning
- one go-routine per accepted connection, all accepted connections are delegated to
- one watcher go-routine, which runs one kernel event loop and learns about READ/ WRITE
- completion events on sockets being watched. Every now and then event loop informs
- watcher go-routine of READ/ WRITE completion events, providing with opportunity to
- take action on accomplished task and schedule next operation on socket asynchronously.
-
-
- This mode of operation has some similarity with libuv --- which powers NodeJS's event loop.
-
- I'd call model v3 a generic version of model v2, where N-watcher go-routines
- run N-many kernel event loops and each accepted connection is delegated to one of these
- available watchers for rest of their life time. Whenever READ/ WRITE completion event ocurrs on
- some socket, event loop notifies respective watcher go-routine, which invokes handle{READ,WRITE} method
- to take action on completed event and schedule next operation on socket, to be completed asynchronously.
-
-
- Using this model calls for socket orchestrating technique --- connections are fairly
- distributed among all available watcher go-routines. Goal of orchestration is not creating hot-spots i.e.
- some watcher go-routine managing lots of sockets while some has got few. This defeats whole purpose
- of model v3. One naive orchestration technique will be using modular arithmetic, where
- M-th accepted connection is delegated to M % N -th watcher go-routine, where M > 0, N > 0, N = #-of watcher go-routines.
-
-
- One problem I see with this scheme is, assuming peer connections are generally long-lived
- some watcher might end-up managing all those long-lived peers while some other watcher go-routine
- probably received those sockets which were unfortunately not long-lived, will manage few sockets --- creating
- imbalance in socket watching delegation i.e. hotspot resulting into bad performance.
- What I think can be done, rather than blindly orchestrating sockets using naive round-robin technique,
- better to keep one feedback loop from watcher go-routines, so that they can inform
- listener go-routine of their current status i.e. how many delegated sockets are they managing
- now ?, how many of them are active in terms of READ/ WRITE operation frequency --- rolling average
- over finite timespan ? etc., allowing listener go-routine to make more informed decision before it
- delegates accepted connection to some watcher. This brings in management flexibility.
-
- It's time to run these models on real mahine and collect statistics. I've prepared parallel
- benchmarking testcases, where in each round of benchmarking one client connects to TCP server
- and sends two frames in order. First frame is read request for some KEY, waits for response, consumes
- it ( if some other client has already set VALUE for that KEY ); then it sends write request
- with a KEY, VALUE pair, waits for response, expecting to see VALUE in response matching what it sent
- in write request. Each benchmark is performed 8 times, to get average statistics.
-
-
- I do parallel benchmark of model v1 on two machines running GNU/Linux & MacOS
- where for each round takes ~34k ns on GNU/Linux, but it's relatively on higher side
- when run on MacOS ~45k ns.
-
- For model v2, MacOS takes lesser time for each round than it took in model v1. But that's not - true for GNU/Linux --- rather it almost doubled up. -
-- In case of model v3, GNU/Linux and MacOS both of them has kept their trends - intact --- for one average benchmark round completion timespan increases, for other - it's decreasing, respectively. -
-
- Now I plan to stress test 3 models on both of GNU/Linux & MacOS platform with 8k
- concurrent connections, where each client connects to TCP server, sends
- read & write requests in order while waiting for their response
- in both of the cases.
-
-
- When model v1 is stress tested, it completes lot faster on GNU/Linux, given
- it enjoys benefit of faster CPU.
-
- With 8k concurrent connections model v2 takes almost same time to complete - on both GNU/Linux & MacOS platform. -
-- Time required for completing stress testing with model v3 - is almost unchanged for MacOS, but for GNU/Linux it's slightly increasing. -
-
- Go's trace tool is helpful in getting deeper insight into what happened
- when program was running. So I collect program execution trace when running
- test cases. These are collected on MacOS machine with Intel i5 CPU @ 2.4Ghz.
-
-
- While looking for how major go-routines spent their time model v1,
- I found listener go-routine which accepts connections and spawns new go-routine
- for handling it, spends a major portion of time in blocked state --- which is understandable
- because it's waiting for new connection to arrive.
-
- If I now look at how its spawned connection handler go-routine spent its time, - I see it has also spent most of its time in waiting for network IO. This also - makes sense given the fact, in model v1 each connection is handled in its - own go-routine, resulting into each of those go-routines proactively waiting - to read from socket --- waiting for network IO event. -
-- I look at model v2's execution trace. It has two major go-routines i.e. {listener, watcher}. - Listener does same job in all 3 models --- wait for incoming connection; accept it; prepare connection - handling phase ( different in each model ); keep looping, which is why network IO based blocking is evident in its trace. -
-- When I look at model v2's watcher go-routine trace, it doesn't spend any time in waiting - for network IO --- it makes a blocking call where it waits for accumulation of a few READ/ WRITE - completion event from underlying kernel event loop, as soon as it's done, it starts looping over them and takes necessary actions. - This single function is equivalent of what N-many handleConnection go-routine does in model v1. - When scheduler wait column is checked in each of these traces, it's well understandable each go-routine - spawned needs to be scheduled on underlying OS threads to actually run, and scheduling is not cheap - when having 100k go-routines. -
-- At last I take a look at trace of model v3, where I run 4 watcher go-routines - each managing a subset of accepted connections. Listener go-routine's trace is similar - to what I found in other models. -
-- Downside of having more go-routines is scheduling cost --- here I run 4 go-routines - with 4 different kernel event loop, subset of sockets are delegated to them, resulting - into spending more time in scheduler wait stage. Also notice though there're 4 watcher go-routines - ready to do their job, not all being used. It's because of the fact during test, - when trace is collected, only one connection request is sent from client side, resulting - into only one socket being managed by one of available watcher go-routines. -
-
- Finally it's time for 100k concurrent connection challenge.
-
-
- The problem I face is how to run 100k clients on my machine ? Given the fact
- network port identifier is of 16 bits, which allows me to run 65536 ( 1 << 16 )
- clients at max. Leaving lower 1024 port numbers, I still need ~40k clients. It's all happening
- because I've only one IP address i.e. 127.0.0.1 . I can make use of some virtual
- networking technique, where I get a different subnet and multiple virtual machines are
- allocated IP address from that subnet. Each of those virtual machines run one copy of client
- application, actually to be more specific each of them run N ( < 65536 )-clients. This way
- I can easily get to 100k client target.
-
-
- I choose Docker for its virtual networking capability, where each client container runs 16k clients,
- requiring only 6 containers together hitting another server container i.e. {v1, v2, v3}_server
- able to simulate 100k concurrent connection scenario.
-
-
- I start with model v1 --- total 7 containers to run, one for server, others for clients.
- I see CPU usage ~60%, suddenly it moves to ~100%. The memory usage is due to high number of key value lookup
- happening concurrently.
-
- Similarly model v2 and v3 are simulated, where one TCP server manages
- ~100k concurrent connections, each client attempting to randomly read/ write some
- randomly generated key & respective value is returned back to them in response.
-
-
- I notice, in model v1, PID count for v1_server container i.e. TCP server
- is 33, denoting 33 OS threads created in this containerised environment, which is due to
- handling 100k active go-routines require lots of underlying OS threads --- sign of context
- switching. Now I look at same field for model v{2, 3}, requiring ~11 OS threads
- for serving 100k concurrent connections --- seemily saving some context switching cost.
-
- I note, each model is capable of handling 100k concurrent connection in simulated
- environment. Each of these models has its own benefits or downsides such using model v1
- program structuring is easier to understand, also it's natual & familiar way of
- writing TCP applications in Golang; while using model v2, chance of context switching
- can be avoided by drastically reducing #-of active go-routines, but it's no silver bullet. On the other hand model v3 which is a generic
- version of model v2, is able to leverage power of more than one event loop, each managing subset of accepted
- connections --- sharded architecture, resulting into less mutex lock contention, given orchestration
- technique fits well.
-
-
- For almost all standard TCP applications, model v1 is good fit, model v2 or model v3 ( with better orchestrator )
- can be used when extreme performance is required, while paying relatively lesser cost.
-
-
- I keep implementation powering these findings in this repository
- for future reference.
-
- I plan to impose C1M challenge ( i.e. managing 1M concurrent connections ) on these models ---
- some other day I'll talk about it. Have a great time !
-