speed_compare (core i7 quad@4ghz).txt

*****************************************************
Processor: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz (8 CPUs)
System Manufacturer: Apple Inc.
System Model: iMac15,1
Operating System: Windows 8 Pro 64-bit (6.2, Build 9200) (9200.win8_gdr.151230-0600)
Card name: AMD RADEON R9 M290X
Manufacturer: Advanced Micro Devices, Inc.
Chip type: AMD Radeon Graphics Processor (0x6810)
DAC type: Internal DAC(400MHz)
Device Type: Full Device
Display Memory: 5840 MB
Dedicated Memory: 2030 MB
Shared Memory: 3810 MB
*****************************************************

!!!
!!! IMPORTANT: All the results from CompV use assembler code (no intrinsics) and TBBMalloc (preferable) or TCMalloc.
!!!

=== RGB24 -> Grayscale (1282, 720) (#10k times) ===
CompV: 449.ms(full optiz)
OpenCV: 3086.ms(full optiz)
Remarque: We're 680% (almost #7 times) faster.
For openCV we're using CV_RGB2GRAY

=== YUV420P -> RGB24 (1282, 720) (#10k times) ===
CompV: 968.ms(full optiz)
OpenCV: 5919.ms(full optiz)
Remarque: We're 611% (#6 times) faster.
For openCV we're using CV_YUV2RGB_I420

=== RGB24 -> HSV (1282, 720) (#10k times) ===
CompV: 2137.ms(full optiz)
OpenCV: 15709.ms(full optiz)
Remarque: We're 735% (#7 times) faster.
For openCV we're using CV_RGB2HSV

=== YUV420P -> HSV (1282, 720) (#10k times) ===
CompV: 3045.ms(full optiz)
OpenCV: 32920.ms(full optiz)
Remarque: We're 1080% (almost #11 times) faster.
For OpenCV you need to convert to BGR then HSV (#2 steps: COLOR_YUV2BGR_I420 then CV_BGR2HSV)

=== Image Split BGR/RGB (1282, 720) (#10k times) ===
CompV: 694.ms(full optiz)
OpenCV: 8210.ms(full optiz)
Remarque: We're 1182% (almost #12 times) faster.

=== Histogram Building on grayscal 8u image (1282 , 720) (#10k times) ===
CompV: 1073.ms(full optiz)
OpenCV: 4114.ms(full optiz)
Remarque: We're 383% (almost #4 times) faster.

=== Histogram Equalization on grayscal 8u image (1282, 720) (#10k times) ===
CompV: 2675.ms (full optiz)
OpenCV: 4311.ms (full optiz)
Remarque: We're 161% (almost #2 times) faster.
OpenCV is MT'ing both histogram generation and equalization. We're faster but this is not enought at all. 
We need more optiz in CompV. For now with low prio because not used in any comercial product.

=== Integral and Square integral on grayscal 8u image (1282, 720) (#1k times) ===
CompV: 1832.ms (No optiz)
OpenCV: 8221.ms (full optiz)
Remarque:

=== Wolf binarization on grayscal 8u image (1282, 720), window (41, 41) (#1k times) ===
CompV: 8721.ms (partial optiz)
OpenCV: 21683.ms (full optiz)
Remarque: Integral computation not optimized yet
OpenCV's Wolf implementation is based on https://github.com/chriswolfvision/local_adaptive_binarization


=== FAST9_16, Nonmax=true, threshold=20, maxfeatures=-1 (1282, 720) (#10k times) ===
CompV: 3358.ms(mt,avx2), 4730.ms(mt,sse2), 8447(st,avx2), 11261.ms(st,sse2)
OpenCV: 10270.ms(full optiz)
Remarque: We're 305% (#3 times) faster.
The way we write to the shared strengths buffer is slow but required to have best support for multithreading.
This is why we are better than OpenCV by faaaaar. OpenCV uses SSE2 and MT.
The above numbers means we can run feature detection on HD image at 2977fps while opencv runs at 973fps.
We're using #8 threads and with more threads (e.g. #16) we can go as high as 5000fps

=== Image scaling (Bilinear) from (1282, 720) to (1064, 597), (#10k times) ===
CompV: 1474.ms(mt,avx2), 1866.ms(mt,sse41), 2645.ms(mt,c++), 5435.ms(st,avx2), 6708.ms(st,sse41), 9470.ms(st,c++)
OpenCV: 4148.ms(full optiz)
Remarque: We're 281% (almost #3 times) faster. 
OpenCV uses SSE2 and multithreading and we're faster even with plain c++ (without sse or avx).

=== Image scaling (Bicubic) from (1282, 720) to (1064, 597), (#10k times) ===
CompV: Shame!!
OpenCV: 6671.ms(full optiz)
Remarque: 
Our Bicubic implementation uses more computations (more multiplications) than what OpenCV is doing. 
We have better result compared to OpenCV. OpenCV's Bicubic implementation produces images with very poor quality on small sizes.
OpenCV uses SSE2 and multithreading.
OpenCV uses fixed-point implementation while we use floating-point implementation.

=== Image rotation (Bilinear) from (1282, 720) (45°), (#1k times) ===
CompV: 1540.ms(mt,avx2)
OpenCV: 6854.ms(full optiz)
Remarque: 
OpenCV uses SSE2 and multithreading.

=== Image rotation (Bicubic) from (1282, 720) (45°), (#10k times) ===
CompV: Shame!!
OpenCV: 74464.ms(full optiz)
Remarque: 
OpenCV uses fixed-point implementation while we use floating-point implementation.

=== Gaussian Blur (kernel size = 7, Sigma = 2.0) (1282, 720) (#10k times) ===
CompV(float): 2419.ms(mt,avx2,fma3), 2822.ms(mt,avx2), 5476.ms(mt,sse2), 7874.ms(st,avx2,fma3), 9980.ms(st,avx2), 17527.ms (st,sse2), 23004.ms(mt,c++), 79309.ms(st,c++)
CompV(Fixed): 1367.ms(mt,avx2), 2815.ms(mt,sse2), 4736.ms(st,avx2), 8824.ms(st,sse2), 19741.ms(mt,c++), 70983.ms(st,c++)
OpenCV(float): 16186.ms(full optiz)
Remarque: We're 1184% (almost #12 times) faster when using our fixed-point implementation and 669% (almost #7 times) faster for the floating point implementation.
https://github.com/DoubangoTelecom/compv/issues/118

=== Sobel3x3 (gradx, grady, scaling, adding) (1282, 720) (#10k times) ===
CompV: 7476.ms(full optiz), 127286.ms(st, c++)
OpenCV: 46231.ms(full optiz)
Remarque: We're 618% (#6 times) faster.
Sobel kernels are pre-defined (0, 1 and 2 values) which means we can compute the convolution without using multiplications (add and sub only).
For now sobel convolution not used in commercial products which means not urgent to change the implementaion.

=== Adaptive Thresholding (ADAPTIVE_THRESH_MEAN_C, CV_THRESH_BINARY_INV, MaxVal = 255, BlockSize = 5, C = 8) (1282, 720) (#10k times) ===
CompV: 3551.ms(full optiz), 56589.ms(st, c++)
OpenCV: 14441.ms(full optiz)
Remarque: We're 406% (#4 times) faster.
OpenCV uses SSE and MT for filtering but we're #4 times faster. Thanks to our hand-written avx code and smart context-switch in our threading pools.

=== Otsu Thresholding (1282, 720) (#10k times) ===
CompV: 1253.ms(full optiz), 8723.ms(st, c++)
OpenCV: 6254.ms (full optiz)
Remarque: We're 500% (#5 times) faster.
OpenCV uses SSE2 and MT for global thresholding but we're #5 times faster.

=== BruteForce matcher (KNN = 2, crosscheck = false, norm = hamming) (200, 258) (#1k times) ===
CompV: 260.ms(mt,asm,sse42), 370.ms(mt,c++,sse42), 837.ms(st,asm,sse42), 1108.ms(st,c++,sse42), 6060.ms (st,c++)
OpenCV: 7384.ms(full optiz)
Remarque: We're 2840% (#28 times) faster.
We're #28 times faster. Thanks to popcnt SSE42 instruction, multi-threading, cache blocking and many other optiz.

=== Canny edge detector (Kernel size = 3, tLow=59, tHigh=119, gradient L1 compare) (1282, 720) (#10k times) ===
CompV: 14903.ms(full optiz),
OpenCV: 63434.ms(full optiz)
Remarque: We're 425% (#4 times) faster.
tLow and tHight values are very important for speed comparison(high values -> less time spent in hysterisis and nms)
The test images contains many edges and this is why it's unusually slow. Canny on HD image should last ~1.ms (which means ~1k fps)
Both CompV and OpenCV are using L1 gradient.

=== HoughLines (Canny: Kernel size = 3, tLow=59, tHigh=119, rho = 1.0, theta = pi/180, threshold = 100) (1282, 720) (#1k times) ===
CompV(SHT): 10367.ms(full optiz)
CompV(KHT): 1413.ms(full optiz)
OpenCV: 25219.ms(full optiz)
Remarque: We're 1784% (almost #18 times) faster.
Test done with dense image (many non-zero pixels). In normal use case our KHT implementation should work at ~1kfps while OpenCV works at ~40fps.
SHT implementation not used in comercial products for know and needs more optimizations.

=== MorphOp Erode (Strel: 3x3 MORPH_CROSS) (1285, 1285) (8u array and strel) (#10k times) ===
CompV: 449.ms(full optiz)
OpenCV: 4700.ms(full optiz)
Remarque: We're 1046% (#10 times) faster.
OpenCV's morphOps are multi-threaded and using SSE(vecOp). Used code: morphologyEx(MORPH_ERODE).
Using Cross because it's not separable filter and requires more processing.
https://en.wikipedia.org/wiki/Erosion_(morphology)

=== MorphOp Close (Strel: 3x3 MORPH_CROSS) (1285, 1285) (8u array and strel) (#10k times) ===
CompV: 794.ms(full optiz)
OpenCV: 10075.ms(full optiz)
Remarque: We're 1268% (almost #13 times) faster.
OpenCV's morphOps are multi-threaded and using SSE(vecOp). Used code: morphologyEx(MORPH_CLOSE). 
Using Cross because it's not separable filter and requires more processing.
https://en.wikipedia.org/wiki/Closing_(morphology)

=== Connected Component Labeling *Processing* (text_1122x1182_white.png) (#10k times) ===
CompV(PLSL): 2973.ms(full optiz)
OpenCV: 66416.ms(full optiz)
Remarque: We're 2233% (#22 times) faster.
This is a high density document full of text to stress the function.
We're using Parallel Light Speed Labeling (https://hal.archives-ouvertes.fr/hal-01361188/document) and we're not extraction the points but
a run-lenght (LEA) table (few data). This make features (centroid, perimeter, bounding boxes...) computation insanely fast.

=== Connected Component Labeling *Extracting bounding boxes* (text_1122x1182_white.png) (#10k times) ===
CompV(PLSL): 956.ms(full optiz)
OpenCV: 2627.ms(full optiz)
Remarque: We're 274% (almost #3 times) faster.
Our implementation is single-threaded. Must provide MT implementation to improve performances.
OpenCV uses SSE42.

=== MSER (ocr2_1122x1182_gray.yuv, delta=2, min_area=(0.0055 * 0.0055), max_area=(0.8 * 0.15), max_variation=0.3, min_diversity=0.2, connectivity=8) (#1k times) ===
CompV(LMSER): 27072.ms(full optiz)
OpenCV: 651744.ms(full optiz)
[1] https://github.com/idiap/mser: 49779.ms(full optiz)
Remarque: We're 2407% (#24 times) faster than OpenCV and 183% faster than [1].
This is more than HD image with many regions and low params to stress the implementation.
Please note that comparison against [1] is useless as they don't extract the points (moments only). Extracting all the points (what we and OpenCV are doing)
is the bottleneck.

=== HOG (input=equirectangular_1282x720.jpg, winSize=inputSize, blockSize=(8,8), blockStride=(4,4), cellSize=(8,8), nBins=9, norm=L2Hyst, gammaCorrection=false, nlevels=64) (1282, 720) (#1k times) ===
CompV: 5198.ms(full optiz)
OpenCV: 29356.ms(full optiz)
Remarque: We're 564% (almost #6 times) faster.
OpenCV uses SSE2 and pre-computed indices for the indexes. To be fair we have created the HOG descriptor outside of the loop to make sure OpenCV will reuse the computed indices.
We tested the HOG descriptor with many products and machine learning algorithms (SVM, Gradient Boosting...) and we always get better result with our implementation (versus OpenCV).