Track Performance Tuning for v1.0

performance test design - TODO add a diagram

latency orders are sent and measured, while background order are sending. E.g. 2 latency orders per second, and 200 background orders per second.
on latency orders, the matching app will record the start/end time(sys nano) of each phase.
FIX test tool(my own code) is used to send orders(both latency order and background orders)
web monitor supports to see the current test status（aka test summary), e.g. start time, end time(current time), total orders( both background and latenty）, latency order count, tail 200 latency order latency data and chart. Angular is used to show data; c3chart is used for chart.
latency data statistics are calculated by python, with pandas module. Just the basic data frame.describe( multi percentiles) is used.
GC took time is also parsed on every web monitor refresh( test summary). Java code. TODO: use python rather than java.

note: jmeter is NOT used anymore, because web interface is too slow to send trades(TODO: why so slow?). FIX test tool is prepared(including related scripts - run_multi_scenarios.sh)

Findings

the match_us is very stable on any load(see Dec 26, 2017 result and later)
no much difference between the Disruptor_BusySpinWaitStrategy and Disruptor_BusySpinWaitStrategy. The Disruptor document says BusySpinWaitStrategy is for low latency , and BusySpinWaitStrategy is for ultra low latency. We could only use BusySpinWaitStrategy for performance tuning, since they use similar resource(mainly on CPU) (see Dec 26, 2017 result)
tried several aspect improvements, and gain much better performance. Good performance on 5000 order/second

CPU - save cpu by avoid toString while loging - logger.debug("xyz:{}", obj), rather than logger.debug("xyz:{}", obj.toString())
DiskIO - FIX client test level is setup with warn. The tool is on same box with engine. I hope to minimize the impact to engine. The client tool FIX storage is already disabled previously. FIX storage on client is not required for our test, since we are verifying the performance of matching engine.
Network IO - slow FIX client consuming caused sending side(engine FIX) blocked. Multi-thread clients are used.
Memory - stop saving latency orders to OMS. OMS is not key in this phase. In the future, we will use some persist way for OMS, or split OMS to another separate process(rather inside of engine)

By checking the GC logs, it is too frequent(every second) on minor GC. The young size is too small(150M). We only have little object in old generation. Increase young GC side helped the performance data. These 2 articles are very helpful to understand GC and some basic tuning: http://mp.weixin.qq.com/s/t1Cx1n6irN1RWG8HQyHU2w https://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications BTW: why/who use 1G bytes in young generation? They are cleaned on every minor gc? Easy to find out by check with jvisualvm sampler.
note for performance tuning

don't rush to change the code. Profiling and analyze current report firstly. If you have some ideas, note them.

I stopped myself to continuously change the code for performance(because I have many ideas on how to tune it). Analyze the output, and evaluate your ideas.

the visualvm cpu sampling and profiling does not work for such ultra latency test because . the calculation of how long a method use cpu in a period is meaningless. We hope to track the latency of each message processing. . the sampling rate(every 50 ms?) is too big. . the profiling instrument will cause performance impact.
warm up is helpful in some case(why and how? See JVM feature). Evaluate and try it.

Jan 7, 2018 - add e2e measurement (release v1.0 is defined since today)

note: disabled server FIX storage for better e2e to void DISKIO on the critical path. It can be added back if you need it, by a simple configuration change(of qfj configuration file). For ultra low-latency app, it is recommended to use OMS(or post trade system) for the possible data loss(and data sync) with client.
note: the internal snapshot request rate is extended from 1 second to 3600 seconds while performance test. Such rate will be upgraded to around 15 minutes when MarketDataService is used. During web GUI, rate 1 second is still required, before MDS is ready.
note - v1.1 is introduced. InputQ is removed, by reusing QFJ single thread of acceptor. See diagram here: https://raw.githubusercontent.com/wiki/baoyingwang/OrderBook/attachments/ME_1.1/ME_1.1_architecture_20180107.png(maybe update with new pictures later)

current performance - Disruptor_BusySpinWaitStrategy_bg5000perSec_lt60perMin_duration600sec_20180107_102049 (more links see the links after this diagram)

Disruptor_BusySpinWaitStrategy_bg5000perSec_lt60perMin_duration600sec_20180107_102049_latency_overall.png

images - linux vm (rate 50/500/1000/3000/5000 with BlockingQ or Disruptor as inputQ):

test result data - linux vm: https://raw.githubusercontent.com/wiki/baoyingwang/OrderBook/attachments/ME_1.0/ME_1.0_test.result_20180107.zips_linux.zip

images - win7

test result data - win7:

https://raw.githubusercontent.com/wiki/baoyingwang/OrderBook/attachments/ME_1.0/ME_1.0_test.result_20180107.zips_win.zip

Jan 4, 2018 - compare windows and "linux vm on windows"; warm up with high rate(3000/s and 30 seconds with 60 latency/s)

test case running. note: bias lock is already disabled (-XX:-UseBiasedLocking), because we found many stop app for revoke biased lock note: extended safepoint interval to 300 seconds(-XX:+UnlockDiagnosticVMOptions, -XX:GuaranteedSafepointInterval=300000), to avoid possible app thread stop.(we found them from console) note: it supports to run on windows now(git bash is required).

Dec 31, 2017 - happy new year - 1) on 5000 order/S, session. FIX related threads(session.Send 44M/s and NIO Processor 15+15/S) are generating too many temp objects on young gen. 2) on 50 order/S, the "application stop' is very frequent.

on 5000 order/S, session. FIX related threads(session.Send 44M/s and NIO Processor 15+15/S) are generating too many temp objects on young gen. Object more detail from jvisualvm output(sampler - memory). It is related FIX message marshalling/un-marshalling. TODO: try to use FIX to convey JSON, to reduce FIX dictionary translations.
on 50 order/S, the "application stop' is very frequent. After adding "-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1", we can see that some are RevokeBias, and others are "no vm operation". I haven't count how many of each.

arguments="--queue_type ${queue_type} --strategy ${strategy} --symbols USDJPY --queue_size 65536"
VisualVMOptions="-Dcom.sun.management.jmxremote.port=3333  -Dcom.sun.management.jmxremote.ssl=false  -Dcom.sun.management.jmxremote.authenticate=false"
GCPrintOptions="-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:log/GC.txt"
JVMOptions="${VisualVMOptions} -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:NewRatio=1 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Xmx3072M -Xms3072M ${GCPrintOptions}"
java $JVMOptions -jar  ${jarfile} ${arguments} &

155.250: RevokeBias                       [      40          0              0    ]      [     0     0     0     0     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
155.251: RevokeBias                       [      40          0              0    ]      [     0     0     0     0     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
156.183: RevokeBias                       [      40          0              0    ]      [     0     0     0     0     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count

158.183: no vm operation                  [      40          1              0    ]      [     1     0     1     0     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
161.188: no vm operation                  [      40          0              0    ]      [     0     0     0     0     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
165.190: no vm operation                  [      40          0              0    ]      [     0     0     2     0     0    ]  0

found a useful article about the application stop time: https://www.lmax.com/blog/staff-blogs/2015/08/05/jvm-guaranteed-safepoints/ After disable the bias lock(TODO: why? AND why not?), my application stop is about every 1 second. It is consistent with 1000 on GuaranteedSafepointInterval (java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal 2>&1 | grep Safepoint).

TODO: should I disable biased lock? -XX:-UseBiasedLocking , why and why not? TODO: should I increase the GuaranteedSafepointInterval(-XX:+UnlockDiagnosticVMOptions -XX:GuaranteedSafepointInterval=300000) ? e.g. to 300000(300 seconds), why and why not? see above link and ready more. (GuaranteedSafepointInterval, rather than GuaranteedSafepointTime - the value from the link is wrong on java 8). Looks like the performance data is rather good, after adding -XX:-UseBiasedLocking -XX:+UnlockDiagnosticVMOptions -XX:GuaranteedSafepointInterval=300000. The test is on going. I need sometime to determine whether these should be applied.

Dec 29, 2017 - a little better with bigger young generation size, because of less minor GC

young gen=150M				young gen=1400M
inputQ_us	match_us	outputQ_us		inputQ_us	match_us	outputQ_us
1198	1198	1198	count	1196	1196	1196
174.4	1.71	81.72	mean	162.22	0.7	49.49
596.39	21.05	327.68	std	985.80	3.92	218.53
0.483	0.171	2.172	min	0.834	0.157	1.63
8.24975	0.56	9.92	25%	5.08	0.35	5.715
20.031	0.812	17.7755	50%	14.472	0.446	10.4905
69.7145	1.24075	37.4235	75%	33.2265	0.565	24.48425
385.0197	1.445	128.7143	90%	155.6775	0.8265	57.2885
740.0291	1.6326	308.6618	95%	528.15925	1.0495	145.5985
2861.886	2.23918	1156.13663	99%	3477.3479	2.2233	715.37735
9273.4	710.382	6689.444	max	22391.588	129.804	3194.357

note: 'pickFrom' is removed from header, to make the table easily reading on without horizontal scrollbar.

before change: minor gc every second -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC, -Xmx3072M, -Xms3072M note: default ratio of new vs old (2)

2017-12-29T02:38:44.187+0000: 611.086: [GC (Allocation Failure) 2017-12-29T02:38:44.187+0000: 611.086: [ParNew: 139255K->3547K(153344K), 0.0229920 secs] 628263K->492555K(3128704K), 0.0230535 secs] [Times: user=0.04 sys=0.00, real=0.02 secs] 
2017-12-29T02:38:44.211+0000: 611.110: Total time for which application threads were stopped: 0.0239491 seconds, Stopping threads took: 0.0000664 seconds
2017-12-29T02:38:45.221+0000: 612.119: [GC (Allocation Failure) 2017-12-29T02:38:45.221+0000: 612.119: [ParNew: 139867K->2912K(153344K), 0.0259835 secs] 628875K->491920K(3128704K), 0.0260498 secs] [Times: user=0.03 sys=0.00, real=0.03 secs] 
2017-12-29T02:38:45.247+0000: 612.145: Total time for which application threads were stopped: 0.0312989 seconds, Stopping threads took: 0.0050753 seconds
2017-12-29T02:38:46.177+0000: 613.075: [GC (Allocation Failure) 2017-12-29T02:38:46.177+0000: 613.075: [ParNew: 139232K->3204K(153344K), 0.0161876 secs] 628240K->492213K(3128704K), 0.0162461 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] 
2017-12-29T02:38:46.193+0000: 613.092: Total time for which application threads were stopped: 0.0164855 seconds, Stopping threads took: 0.0000510 seconds
2017-12-29T02:38:47.178+0000: 614.077: [GC (Allocation Failure) 2017-12-29T02:38:47.178+0000: 614.077: [ParNew: 139524K->2971K(153344K), 0.0155237 secs] 628533K->491980K(3128704K), 0.0155831 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] 
2017-12-29T02:38:47.194+0000: 614.093: Total time for which application threads were stopped: 0.0157783 seconds, Stopping threads took: 0.0000319 seconds
2017-12-29T02:38:48.178+0000: 615.077: [GC (Allocation Failure) 2017-12-29T02:38:48.178+0000: 615.077: [ParNew: 139291K->2868K(153344K), 0.0174023 secs] 628300K->491878K(3128704K), 0.0174582 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] 
2017-12-29T02:38:48.195+0000: 615.094: Total time for which application threads were stopped: 0.0176585 seconds, Stopping threads took: 0.0000294 seconds
2017-12-29T02:38:49.177+0000: 616.076: [GC (Allocation Failure) 2017-12-29T02:38:49.177+0000: 616.076: [ParNew: 139188K->2841K(153344K), 0.0160903 secs] 628198K->491852K(3128704K), 0.0175685 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] 
2017-12-29T02:38:49.195+0000: 616.094: Total time for which application threads were stopped: 0.0178737 seconds, Stopping threads took: 0.0000374 seconds
2017-12-29T02:38:50.177+0000: 617.076: [GC (Allocation Failure) 2017-12-29T02:38:50.177+0000: 617.076: [ParNew: 139161K->2884K(153344K), 0.0157959 secs] 628172K->491895K(3128704K), 0.0158495 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]

--BlockingQueue_X_bg5000perSec_20171229_022833_duration600Sec.GC.txt

after change: minor GC per 10 seconds with bigger young generation(1415616K) -XX:NewRatio=1, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC, -Xmx3072M, -Xms3072M

2017-12-29T08:45:05.656+0000: 576.504: [GC (Allocation Failure) 2017-12-29T08:45:05.656+0000: 576.504: [ParNew: 1261305K->2911K(1415616K), 0.0137893 secs] 1290985K->32620K(2988480K), 0.0138477 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
2017-12-29T08:45:05.670+0000: 576.518: Total time for which application threads were stopped: 0.0141363 seconds, Stopping threads took: 0.0000715 seconds
2017-12-29T08:45:14.841+0000: 585.689: [GC (Allocation Failure) 2017-12-29T08:45:14.841+0000: 585.689: [ParNew: 1261279K->2912K(1415616K), 0.0135910 secs] 1290988K->32624K(2988480K), 0.0136448 secs] [Times: user=0.02 sys=0.01, real=0.02 secs] 
2017-12-29T08:45:14.855+0000: 585.703: Total time for which application threads were stopped: 0.0138769 seconds, Stopping threads took: 0.0000689 seconds
2017-12-29T08:45:24.003+0000: 594.851: [GC (Allocation Failure) 2017-12-29T08:45:24.003+0000: 594.851: [ParNew: 1261280K->2923K(1415616K), 0.0141921 secs] 1290992K->32639K(2988480K), 0.0142494 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] 
2017-12-29T08:45:24.017+0000: 594.865: Total time for which application threads were stopped: 0.0144154 seconds, Stopping threads took: 0.0000326 seconds
2017-12-29T08:45:33.226+0000: 604.074: [GC (Allocation Failure) 2017-12-29T08:45:33.226+0000: 604.074: [ParNew: 1261291K->2907K(1415616K), 0.0132034 secs] 1291007K->32627K(2988480K), 0.0132546 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 
2017-12-29T08:45:33.240+0000: 604.088: Total time for which application threads were stopped: 0.0134510 seconds, Stopping threads took: 0.0000267 seconds
2017-12-29T08:45:42.401+0000: 613.249: [GC (Allocation Failure) 2017-12-29T08:45:42.401+0000: 613.249: [ParNew: 1261275K->2952K(1415616K), 0.0114524 secs] 1290995K->32675K(2988480K), 0.0115035 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 
2017-12-29T08:45:42.413+0000: 613.261: Total time for which application threads were stopped: 0.0160465 seconds, Stopping threads took: 0.0043239 seconds

--BlockingQueue_X_bg5000perSec_20171229_083529_duration600Sec.GC.txt

Question: why/who is generation 1G objects within 10 seconds? Is it normally? TODO: use visualvm to find the related thread(sampler -> memory -> per thread allocation -> delta) - The visualvm told me that it is the "QFJ Message Processor" which allocated the huge memory. This is the thread which invokes the callback of my own logic.

Dec 28, 2017 - much better throughput after several important changes

bug fix on the rate calc on test tool while bigger than 200; Previously, it was not 500, but 4000 per second.
OMS ignore latency order too; Memory usage is lower down.
testtool use multi threads to avoid possible tcp block on huge traffic

See the test result on much higher load - 5000 orders per second BlockingQueue_X_bg5000perSec_20171228_080412_duration600Sec_latency_overall.png Disruptor_BusySpinWaitStrategy_bg5000perSec_20171228_073938_duration600Sec_latency_overall.png

Dec 27, 2017 - better throughput - investigation on 500 msg/second

The different strategies are similar on the test result. So, we just check one of them. Dumped the heap while re-running, and found that

the OMS used too many memory(200M+). TODO: OMS should ignore latency orders. Right now, it only ignore background orders. One instance of "baoying.orderbook.app.SimpleOMSEngine" loaded by "sun.misc.Launcher$AppClassLoader @ 0xcaf702a8" occupies 233,014,992 (30.12%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]" loaded by "".
the FIX IO used too many memory(150M + 130M). I suspect it is because of slow consumer issue. The FIX testtool clients cannot handle such high volume messages. TODO: re-test, after improving the performance of FIX testtool clients, e.g. switch multi threads, or improve the logging, or both. Right now, the testtool client has disable FIX storage. One instance of "org.apache.mina.transport.socket.nio.NioSocketSession" loaded by "sun.misc.Launcher$AppClassLoader @ 0xcaf702a8" occupies 158,170,400 (20.45%) bytes. One instance of "org.apache.mina.transport.socket.nio.NioSocketSession" loaded by "sun.misc.Launcher$AppClassLoader @ 0xcaf702a8" occupies 132,223,776 (17.09%) bytes.
executor(since thread) of event bus queued too many tasks Since the FIX IO was blocked. The tasks(which invoke FIX IO) cannot proceed, then all queued here. We still need focus on FIX IO. One instance of "java.util.concurrent.ThreadPoolExecutor" loaded by "" occupies 154,390,240 (19.96%) bytes. The instance is referenced by com.google.common.eventbus.Subscriber$SynchronizedSubscriber @ 0xcc6ff340 , loaded by "sun.misc.Launcher$AppClassLoader @ 0xcaf702a8".
the engine cannot handle messages after 2 minutes. You can see many metric changes at that time(around 07:44 - Blocking Q 500msg), e.g. vmstat memory cache(Linux vmstat命令实战详解 https://www.cnblogs.com/ggjucheng/archive/2012/01/05/2312625.html)

BlockingQueue_X_bg500perSec_20171226_074117_duration600Sec_latency_overall.png

Dec 26, 2017 - merry xrismas - in the past week, most time is on setup test harness

btw - performance of single cpu is very very bad ( my vm was assigned with single cpu initially)

Test System Basic Info - It is a linux VM on win 7 64bit

gc Name 1:ParNew
gc Name 1 Pool 1:Par Eden Space
gc Name 1 Pool 2:Par Survivor Space
gc Name 2:ConcurrentMarkSweep
gc Name 2 Pool 1:Par Eden Space
gc Name 2 Pool 2:Par Survivor Space
gc Name 2 Pool 3:CMS Old Gen
os Arch:amd64
os AvailableProcessors:2
os Name:Linux
os Version:4.4.0-104-generic
runtime Name:7716@BaoyingUbuntu
runtime SpecName:Java Virtual Machine Specification
runtime SpecVendor:Oracle Corporation
runtime SpecVersion:1.8
runtime VmName:Java HotSpot(TM) 64-Bit Server VM
runtime VmVendor:Oracle Corporation
runtime VmVersion:25.101-b13

Summary

the match_us is very stable on any load
the pickFromOutputQ_us dramatically increases on load 500 order per second. It is related with the 1) thread model. Single thread on ER bus to update local cache(OMS) and to send external FIX tool. It can be mitigated by setup a new dedicated thread for sending orders to external FIX tool. 2) during high load, it prints much more logs 3) and memory usage are increasing, because the messages cannot be consumed in time
no much difference between the Disruptor_BusySpinWaitStrategy and Disruptor_BusySpinWaitStrategy. The Disrtuptor document says BusySpinWaitStrategy is for low latency , and BusySpinWaitStrategy is for ultra low latency. We could only use BusySpinWaitStrategy for performance tuning, since they use similar resource(mainly on CPU)

Test Result Data is here: 20171226_test_result_data.zip note: looks like not easy to understand, if you haven't read the code or this wiki.

50 per second on diff strategies

BlockingQueue_X_bg50perSec_20171226_063657_duration600Sec_latency_overall.png Disruptor_BusySpinWaitStrategy_bg50perSec_20171226_061723_duration300Sec_latency_overall.png Disruptor_SleepingWaitStrategy_bg50perSec_20171226_063025_duration300Sec_latency_overall.png Disruptor_YieldingWaitStrategy_bg50perSec_20171226_062354_duration300Sec_latency_overall.png

200 per second on diff strategies

BlockingQueue_X_bg200perSec_20171226_070806_duration600Sec_latency_overall.png Disruptor_BusySpinWaitStrategy_bg200perSec_20171226_064828_duration300Sec_latency_overall.png Disruptor_SleepingWaitStrategy_bg200perSec_20171226_070134_duration300Sec_latency_overall.png Disruptor_YieldingWaitStrategy_bg200perSec_20171226_065500_duration300Sec_latency_overall.png

500 per second on diff strategies(the system cannot handle this load. You can see it from the pickFromOutputQ_us sub-plot)

diagrams are removed, since the testtool has a bug on rate calculation. It is rate is 4000+ during that, rather than expected 500.

Dec 18, 2017 - inputQ - LMAX DISRUPTOR(busy cpu) , outputQ -guava async event bus

type	put2InputQ_us	pickFromInputQ_us	match_us	pickFromOutputQ_us
count	449.0	449.0	449.0	449.0
mean	0.626	268.639	1.603	194.84
std	0.5526	3131.712	2.628	550.56
min	0.0	0.365	0.364	2.187
25%	0.365	4.376	1.094	13.856
50%	0.365	5.469	1.458	17.138
75%	0.729	6.928	1.823	125.074
90%	0.73	116.0304	2.188	564.0354
95%	1.094	548.4284	2.917	737.3152
99%	2.917	825.12152	5.13416	1923.83112
max	6.563	50235.161	53.967	8923.994

Dec 17, 2017 - use LMAX DISRUPTOR since pick Q takes more time

LMAX DISRUPTOR - https://github.com/LMAX-Exchange/disruptor/wiki/Getting-Started

Replace only pickFromInputQ with DISRUPTOR(+BusySpinWaitStrategy). We can see the performance is improved signifigantly(see the pickFromInputQ_us column).

note: But CPU usage is high(50% for my 4 cores cpu - intel i7-6600u 2.60GHz)
note: similar performance with ArrayBlockingQueue, if default strategy(SleepingWaitStrategy) is used.

type	put2InputQ_us	pickFromInputQ_us	match_us	pickFromOutputQ_us
count	121.0	121.0	121.0	121.0
mean	0.66	18.777	2.1426	225.59499173553718
std	0.29	81.192	0.438	1693.880477760186
min	0.364	4.74	1.094	8.022
25%	0.365	5.835	1.823	10.939
50%	0.729	6.563	2.188	12.398
75%	0.729	7.293	2.188	14.585
90%	1.094	8.386	2.553	138.924
95%	1.094	8.751	2.917	634.457
99%	1.459	514.2748	3.282	996.827399999999
max	1.824	654.877	3.646	18588.879

note: this data has removed heading 50 latency data, avoid the warm up impact(which did impact-to be investigated).

Test Data

{
	"start_time":"2017-12-17T02:40:27.120Z",
	"rate_per_second":"4.52",
	"order_count":2203,
	"end_time":"2017-12-17T02:48:35.018Z",
	"duration_in_second":487,
	"latency_data_count":173
}

YieldingWaitStrategy - low latency systems It took more CPU - around 60~70 on my win 7 (it is higher than the BusySpinWaitStrategy - 50%). But performance is not as good as BusySpinWaitStrategy)

type	put2InputQ_us	pickFromInputQ_us	match_us	pickFromOutputQ_us
count	207.0	207.0	207.0	207.0
mean	0.6817	72.244	1.8214	241.23057487922708
std	0.951	277.021	0.5125	452.59852395855114
min	0.0	1.094	0.364	2.187
25%	0.365	5.104	1.459	11.668
50%	0.729	5.47	1.823	13.491
75%	0.729	6.563	2.188	490.9755
90%	0.73	91.085	2.188	661.0024000000001
95%	0.9841	538.4867	2.552	757.9946
99%	2.14426	1258.52312	2.917	1778.4871799999992
max	11.303	2984.139	4.376	3777.211

BTW: other enhancements

web monitor(http://localhost:8080/test_summary.html) support view latency(tail 400) as graph).
GC pause times are analyzed online, but not yet integrated the result to web monitor

Dec 14,15,16, 2017 -

Glad to have two day vacations, and weekend to tune this engine

only record latency(nanoTime) for latency orders
refactor to return Tuple, rather than as part of parameters in match method
add time on picking order from input Q
only calculate the latency on taker side(by record related timestamps in filled execution report), since maker side maybe sit in book long time.
python scripts to calculate the statistics of latency
GC output parsing java code

Got below result

pick from Q(both input and output) took more time, and the matching process is fast(avg 5.3 us, 95% 12 us)
cpu usage is very low, since LinkedBlockingQueue is used.(similar test result with ArrayBlockingQueue, which is expected, but not used because I don't want introduce logic to control the out-of-capacity for this phase).
cannot identify obvious relationship between GC(and app thread stop) with the spike Q picking latency

type	put2InputQ_us	pickFromInputQ_us	match_us	pickFromOutputQ_us
count	478.0	478.0	478.0	478.0
mean	1.261725941422594	748.6533933054393	5.363443514644351	221.17526569037656
std	1.3463244342974796	1932.4899826558315	6.229908042497636	1024.58946151024
min	0.0	7.293	0.365	7.657
25%	0.729	54.42100000000001	2.917	26.345
50%	1.093	70.738	4.741	32.817
75%	1.094	737.46525	5.47	63.9015
90%	2.188	1559.7446	6.928000000000001	424.6493000000001
95%	4.74	2685.687749999995	12.086749999999991	827.6025499999997
99%	6.564	11369.258260000042	20.054229999999997	2907.322100000022
max	7.658	17678.031	117.411	17108.478