Difference between revisions of "Performance:OpenCV:BoofCV"

From BoofCV
Jump to navigationJump to search
m
 
(31 intermediate revisions by the same user not shown)
Line 1: Line 1:


The following is a comparison of similar algorithms in BoofCV and OpenCV for speed. Ten different algorithms were tuned to produce similar results and then run on three different architectures, desktop computer running AMD64, Raspberry PI 3B+, and ODROID XU4. Algorithms covered go from low level image processing (Gaussian blur) to mid-level features (SIFT).
The following is a comparison of similar algorithms in BoofCV and OpenCV for speed. Ten different algorithms were tuned to produce similar results and then run on three different architectures, desktop computer running on a Core i7-6700, Raspberry PI 3B+, and ODROID XU4. Algorithms covered go from low level image processing (Gaussian blur) to mid-level features (SIFT).


= Introduction =
= Introduction =
Line 6: Line 6:
It’s been a while since the last runtime comparison was done between [[Performance:OpenCV:BoofCV:2011|BoofCV and OpenCV in 2011]]. Back then I thought that neural networks (NN) were essentially worthless and C/C++ was the dominant language. Now NN dominate the field and almost everyone use Python. The main event which prompted this benchmark to be done again was concurrency (a.k.a. threads) being added to BoofCV.
It’s been a while since the last runtime comparison was done between [[Performance:OpenCV:BoofCV:2011|BoofCV and OpenCV in 2011]]. Back then I thought that neural networks (NN) were essentially worthless and C/C++ was the dominant language. Now NN dominate the field and almost everyone use Python. The main event which prompted this benchmark to be done again was concurrency (a.k.a. threads) being added to BoofCV.


The goal of this benchmark is to replicate the speed that an “average” user can expect when using either of these libraries for low level to mid-level image processing/computer vision routines. This will cover image convolution up to feature detectors. NN and machine learning in general is not covered since neither library is particularly good in that domain. It is assumed that the average user will install the library using the easiest possible method, cut and paste example code, and do minimal optimiations. For OpenCV this means “pip install opencv-python” and for BoofCV using pre-built jars on Maven Central.
The goal of this benchmark is to replicate the speed that an “average” user can expect when using either of these libraries for low level to mid-level image processing/computer vision routines. This will cover image convolution up to feature detectors. NN and machine learning are not included. It is assumed that the average user will install the library using the easiest possible method, cut and paste example code, and do some simple optimizations. For OpenCV this means “pip install opencv-python” and for BoofCV using pre-built jars on Maven Central. If memory or data structures can be easily recycled then they are.


While this approach sounds easy enough it proved to be impossible to follow 100% and exceptions were made. Another issue is that none of the algorithms were implemented the same. In fact, only three of them have a chance of producing nearly identical results; Gaussian blur, Sobel, and Histogram. The others have known major differences. For example, BoofCV’s Canny implementation forces you to blur the image while OpenCV doesn’t. BoofCV’s SURF implementation produces significantly better features than OpenCV’s [1]. The default settings in each library can produce drastically different results. As a result, tuning criteria are clearly stated and followed in an attempt to produce comparable output.
While this approach sounds easy enough it proved to be impossible to follow 100% and exceptions were made, discussed below. Another issue is that none of the algorithms were implemented the same. In fact, only three of them have a chance of producing nearly identical results; Gaussian blur, Sobel, and Histogram. The others have known major differences. For example, BoofCV’s Canny implementation forces you to blur the image while OpenCV doesn’t. BoofCV’s SURF implementation produces significantly better features than OpenCV’s ([[Performance:SURF|SURF Benchmark]]). The default settings in each library can produce drastically different results. Thus tuning criteria are clearly stated and followed in an attempt to produce comparable output.


Source code used in this benchmark can be found below. To replicate the results please carefully read the instructions. Especially for architectures with ARM processors, it took about 3 attempts (or 8 hrs) to get a good build of OpenCV running on Raspberry PI.
To replicate the results please carefully read the instructions on this page and in the source code. Especially for architectures with ARM processors, it took about 3 attempts (or 8 hrs) to get a good build of OpenCV running on Raspberry PI. Suggestions for improving the fairness of this comparison are welcomed.


<center>Benchmark Source Code:<br>https://github.com/lessthanoptimal/SpeedTestCV</center>
<center>Benchmark Source Code:<br>https://github.com/lessthanoptimal/SpeedTestCV</center>
Line 22: Line 22:
|-
|-
|| OpenCV || 4.0.1
|| OpenCV || 4.0.1
|}
{| class="wikitable"
|-
! Device !! CPU !! Cores !! RAM !! OS
|-
|| Desktop || Core i7-6700 || 4 || 32 GB || Ubuntu 18.04.2 @ 64bit
|-
|| [https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/ Raspberry PI 3B+] || Cortex-A53 ||  4 || 1 GB || Raspbian 9.4 @ 32bit
|-
|| [https://wiki.odroid.com/odroid-xu4/odroid-xu4 ODROID XU4] || Cortex-A15 and A7 || 4+4 || 2 GB || Ubuntu 16.04.4 @ 32bit
|}
|}
</center>
</center>
Line 28: Line 38:
To cite this article use the following:
To cite this article use the following:
<pre>@misc{BoofVsOpenCV,
<pre>@misc{BoofVsOpenCV,
   title = {Performance of OpenCV vs BoofCV 2019},
   title = {Performance of OpenCV vs BoofCV: March 2019},
   howpublished = {\url{https://boofcv.org/index.php?title=Performance:OpenCV:BoofCV}},
   howpublished = {\url{https://boofcv.org/index.php?title=Performance:OpenCV:BoofCV}},
   author = {Peter Abeles},
   author = {Peter Abeles},
Line 34: Line 44:
}</pre>
}</pre>


= Functions and Tuning =
= Algorithms, Tuning, and Exceptions =


<center>
<center>
Line 53: Line 63:
|-
|-
|| Binary Contour || External Only. 4-Connect Rule. Find around 1,100,000 points
|| Binary Contour || External Only. 4-Connect Rule. Find around 1,100,000 points
|-
|| Good Features || Shi-Tomasi corners. Unweighted. Radius = 10 pixels. 3,000 Features
|-
|-
|| Hough Line || Polar Variant. Resolutions: angle = 1 degree, range = 5 pixels. Detect 500 lines.
|| Hough Line || Polar Variant. Resolutions: angle = 1 degree, range = 5 pixels. Detect 500 lines.
|-
|-
|| SIFT || Detect and Describe. 5 Octaves. 10,000 Features
|| SIFT || Detect and Describe. 5 Octaves. 3 Scales. No doubling of first octave. 10,000 Features
|-
|-
|| SURF || Detect and Describe. 4 Octaves. 4 Scales. 10,000 Features
|| SURF || Detect and Describe. 4 Octaves. 4 Scales. 10,000 Features
Line 62: Line 74:
</center>
</center>


Two images were used in these test. The first image was 3648 x 2736 pixels of a chessboard pattern and a wood background. The second was a local mean thresholded version of the first for use in binary operators. The operation parameters were selected to be reasonable and remove potential biases. One factor that determines how fast a feature detector + descriptor run are the number of features detected. OpenCV's SIFT implementation appears to have its default settings selected for speed. While BoofCV's SIFT has defaults designed to replicate the stability found in the original paper. For SIFT the number of octaves was selected to more closely match the SIFT paper.
Two images were used in these test. The first image was 3648 x 2736 pixels of a [https://github.com/lessthanoptimal/SpeedTestCV/blob/master/data/chessboard_large.jpg chessboard pattern with a wood background] and was processed as an 8-bit gray scale image. The second was a binary version of the just mentioned image for use by binary operators. This ensured that the binary operators had the same initial input. Tuning parameters and tuning goals mentioned above were selected based on common use cases and to remove potential biases. As an example, one factor that determines how fast a feature detector + descriptor run are the number of features detected since each detected feature must be described.


As previously mentioned, tuning these two libraries to produce similar results is a very difficult if not impossible problem. An attempt was made to be fair. See in code comments for specific details.
As previously mentioned, tuning these two libraries to produce similar results is a very difficult if not impossible problem. An attempt was made to be fair. See in code comments for specific details for why values were selected. The best way to ensure that two implementations are "equivalent" is to apply them to the same task and measure their performance. That approach is very labor intensive and often impossible due to difference in quality between two implementations, see [[Performance:SURF|the SURF Benchmark]] as an example, and was not done here.


== Exceptions to the Rules ==
== Exceptions to the Rules ==


SIFT and SURF are covered by patents (or were, SIFT’s just expired this month) and not included in the pip package. That means you need to build it from scratch. Major issues were found on ARM architectures where there was no version of OpenCV 4 that could be easily installed and the default JVM included lack optimizations for ARM. The build settings for OpenCV are included below. An attempt was made to find the best settings and different websites had different recommendations.
SIFT and SURF are covered by patents (SIFT's parent expires in 2020) and not included in the pip package. That means you need to build OpenCV from scratch. Thus, on Desktop, SIFT and SURF are running code custom built for the desktop's architecture breaking the "average user" rule. Major issues were found on ARM architectures where there was no version of OpenCV 4 that could be easily installed and for BoofCV, the default JVM included lacked optimizations for ARM making it run very slow!
 
The build settings for OpenCV on ARM are included below. An attempt was made to find the best settings and different websites had different recommendations. I picked one which explicitly enabled CPU specific optimizations.
 
<pre>cmake -D CMAKE_BUILD_TYPE=RELEASE    -D CMAKE_INSTALL_PREFIX=/usr/local    -D INSTALL_PYTHON_EXAMPLES=ON    -D OPENCV_EXTRA_MODULES_PATH=~/opencv_contrib-4.0.1/modules    -D ENABLE_NEON=ON    -D ENABLE_VFPV3=ON    -D WITH_FFMPEG=ON    -D WITH_GSTREAMER=ON    -D BUILD_EXAMPLES=ON -D OPENCV_ENABLE_NONFREE=ON ..</pre>
 
Instructions for installing the JVM used on ARM architectures:
 
[http://hirt.se/blog/?p=1116 jdk11.0.2-linux-arm32-vfp-hflt]
 
While outside of the scope of this benchmark, building OpenCV on your specific architecture does provide significant performance boost for some operations. Gaussian blur ran about 2x faster on Desktop when custom built.


= Results =
= Results =


Results are shown below for Intel Core i7, Odroid XU4, and Raspberry PI 3B+. Click on the arrow to change which results you are viewing.  
Results are shown below for Intel Core i7, Odroid XU4, and Raspberry PI 3B+. Click on the arrow to change which results you are viewing.  
OpenCV does very well in the Gaussian Blur test due to its hand crafted SIMD instructions being multi-threaded. For other low level SIMD friendly operations the speed difference isn't as great between Java and the C code, so it tends to come down to threading. SURF isn't a very SIMD friendly algorith and the major difference in performance come from algorithmic details. The main surprise is SIFT, which should have crushed BoofCV because the most computationally expensive part is apply Gaussian blur. OpenCV's SIFT implementation took a speed it when it was tuned to more closely approximate the original algorithm.
Results between architectures are more consistent than I thought they would be. OpenCV on desktop used the generic version contained in pypy (except for SIFT and SURF) while OpenCV for ARM architectures had been custom built for each architecture. Winners and near ties are essentially identical, except for mean-threshold on ODROID. SIFT was unable to finish computing on ARM processors. Lack of memory is the suspect. BoofCV's SIFT implementation avoids the need to save each layer at the same time.


<gallery mode="slideshow">
<gallery mode="slideshow">
Line 83: Line 101:
File:Boof_vs_opencv_rpi3BP_2019.png
File:Boof_vs_opencv_rpi3BP_2019.png
</gallery>
</gallery>
Results between architectures are more consistent than it was thought they would be. OpenCV on desktop used the generic version contained in pypy (except for SIFT and SURF) while OpenCV for ARM architectures had been custom built for each architecture. Winners and near ties are effectively the same. OpenCV's SIFT was unable to finish computing on ARM processors, threw out of memory error or just died. OpenCV's SIFT code has not been inspect to root cause this problem, but BoofCV's implementation was designed to recycle images as much as possible.
For low level image processing routines there is less room for implementation variability and results are easier to explain. If OpenCV was optimized to the greatest extent possible, it should output perform BoofCV in low level operations which are array heavy by about 2x to 4x, based on past experience. This is because hand crafted architecture specific code or GCC will typically generate more efficient SIMD instructions than JVM. In practice code is rarely optimized to this extent as is shown by OpenCV. An example of what this level of optimization can achieve is seen with Gaussian blur where OpenCV has hand crafted SIMD instructions and a concurrent implementation and runs 3x faster than BoofCV's own concurrent implementation. Despite all of OpenCV's apparent advantages BoofCV out performs OpenCV's Sobel, histogram, mean threshold implementations is due to a mixture of this code lacking the refinement of Gaussian blur and BoofCV's code being concurrent. It's worth noting that both libraries have spotty concurrent coverage. BoofCV's dominating performance for "good features" was unexpected is likely caused by a superior implementation in combination with BoofCV's code being concurrent.
For high level operations, implementation details matter more and data structures tend to be sparse, partially negating the compiler advantage of OpenCV. Algorithms are also more complex making explaining performance differences much more difficult. This can be most clearly seen with SURF, where BoofCV was 4x faster and produced more stable features. The main surprise is SIFT, which should have crushed BoofCV because the most computationally expensive part is applying Gaussian blur many times. OpenCV has a large algorithmic advantage with Canny because BoofCV requires Gaussian blur while OpenCV does not. Both BoofCV and OpenCV lack concurrent implementations of outer contour tracing and the most probable explanation is that BoofCV's algorithm is simply faster. The same applies to hough polar.
To help illustrate the points above, here is a table showing single thread performance for select operations on the desktop Core i7 computer.  Note how in some cases relative performance changes and in other not. It's rare for users to turn off threading which is why single thread performance isn't discussed in more detail.
<center>
{| class=wikitable
|+ Single thread performance on Desktop i7 for select operations. Milliseconds
! Operation || BoofCV || OpenCV
|-
| Gaussian Blur || 144 || 74
|-
| Mean Threshold || 78 || 16
|-
| Good Features || 172 || 282
|-
| Outer Contour ||  47 || 85
|}
</center>


= Conclusions =
= Conclusions =


In this benchmark, BoofCV out performed OpenCV in 8 out of 10 benchmarks on desktop and 7 out of 10 on ARM processors. OpenCV does better in low level image processing routines where hand optimized SIMD instructions were injected into the code. For high level operations performance starts to tilt even more so in BoofCV’s direction. The improvement in performance in low level operation with BoofCV is likely due to the addition of concurrent implementations.
Two computer vision libraries, BoofCV and OpenCV, were compared against each other for speed using a small subset of commonly used computer vision operations. BoofCV was the top performer in 6 out of 10, there was a tie in 2 operations, and OpenCV did best in 2 operations. Tests were performed on desktop and embedded platforms with similar results across the board. Both libraries were given the same input and turned to produce similar output.
 
= End Comment =


The last time I published this benchmark I was a bit surprised at the lack of reading comprehension exhibited by academic paper authors. The results were clearly split down the middle, yet most people somehow concluded that OpenCV was the clear winner! The real answer to which library is faster/better is “it depends”. If you ignore language preference then the following would be true. Is your problem heavy in pure image convolution? OpenCV is best for you! Do you want to use a fast and stable QR code detector ([[Performance:QrCode|see these results]]) then BoofCV is for you.
Explaining the reason for the differences is difficult due the two libraries having very different architectures. For low level array heavy operations OpenCV has a higher theoretical performance limit than BoofCV due to its ability to include code tailored to specific architectures and GCC generating more effective SIMD instructions than the JVM. As is often the case, due to level of effort, it appears that OpenCV only came close to achieving this theoretical performance with Gaussian blur and not the other operations tested. BoofCV out performed OpenCV in other low level operations and this can sometimes be explained by BoofCV having better concurrent coverage (i.e. comparable single thread performance) and/or a more efficient implementations (i.e. better single thread performance). For high level operations data structures tend to be sparse, partially negating the SIMD performance advantage of a C/C++ implementation.

Latest revision as of 14:59, 18 May 2019

The following is a comparison of similar algorithms in BoofCV and OpenCV for speed. Ten different algorithms were tuned to produce similar results and then run on three different architectures, desktop computer running on a Core i7-6700, Raspberry PI 3B+, and ODROID XU4. Algorithms covered go from low level image processing (Gaussian blur) to mid-level features (SIFT).

Introduction

It’s been a while since the last runtime comparison was done between BoofCV and OpenCV in 2011. Back then I thought that neural networks (NN) were essentially worthless and C/C++ was the dominant language. Now NN dominate the field and almost everyone use Python. The main event which prompted this benchmark to be done again was concurrency (a.k.a. threads) being added to BoofCV.

The goal of this benchmark is to replicate the speed that an “average” user can expect when using either of these libraries for low level to mid-level image processing/computer vision routines. This will cover image convolution up to feature detectors. NN and machine learning are not included. It is assumed that the average user will install the library using the easiest possible method, cut and paste example code, and do some simple optimizations. For OpenCV this means “pip install opencv-python” and for BoofCV using pre-built jars on Maven Central. If memory or data structures can be easily recycled then they are.

While this approach sounds easy enough it proved to be impossible to follow 100% and exceptions were made, discussed below. Another issue is that none of the algorithms were implemented the same. In fact, only three of them have a chance of producing nearly identical results; Gaussian blur, Sobel, and Histogram. The others have known major differences. For example, BoofCV’s Canny implementation forces you to blur the image while OpenCV doesn’t. BoofCV’s SURF implementation produces significantly better features than OpenCV’s (SURF Benchmark). The default settings in each library can produce drastically different results. Thus tuning criteria are clearly stated and followed in an attempt to produce comparable output.

To replicate the results please carefully read the instructions on this page and in the source code. Especially for architectures with ARM processors, it took about 3 attempts (or 8 hrs) to get a good build of OpenCV running on Raspberry PI. Suggestions for improving the fairness of this comparison are welcomed.

Benchmark Source Code:
https://github.com/lessthanoptimal/SpeedTestCV
Library Version
BoofCV 0.33.1
OpenCV 4.0.1
Device CPU Cores RAM OS
Desktop Core i7-6700 4 32 GB Ubuntu 18.04.2 @ 64bit
Raspberry PI 3B+ Cortex-A53 4 1 GB Raspbian 9.4 @ 32bit
ODROID XU4 Cortex-A15 and A7 4+4 2 GB Ubuntu 16.04.4 @ 32bit


To cite this article use the following:

@misc{BoofVsOpenCV,
  title = {Performance of OpenCV vs BoofCV: March 2019},
  howpublished = {\url{https://boofcv.org/index.php?title=Performance:OpenCV:BoofCV}},
  author = {Peter Abeles},
  originalyear = {03.22.2019}
}

Algorithms, Tuning, and Exceptions

Operation Tuning Target
Gaussian Blur Radius = 5
Sobel Gradient 3x3 Kernel
Local Mean Thresholding Radius = 5
Image Histogram
Canny Edge Output edge pixel chains. ~550,000 unique pixels in chains
Binary Contour External Only. 4-Connect Rule. Find around 1,100,000 points
Good Features Shi-Tomasi corners. Unweighted. Radius = 10 pixels. 3,000 Features
Hough Line Polar Variant. Resolutions: angle = 1 degree, range = 5 pixels. Detect 500 lines.
SIFT Detect and Describe. 5 Octaves. 3 Scales. No doubling of first octave. 10,000 Features
SURF Detect and Describe. 4 Octaves. 4 Scales. 10,000 Features

Two images were used in these test. The first image was 3648 x 2736 pixels of a chessboard pattern with a wood background and was processed as an 8-bit gray scale image. The second was a binary version of the just mentioned image for use by binary operators. This ensured that the binary operators had the same initial input. Tuning parameters and tuning goals mentioned above were selected based on common use cases and to remove potential biases. As an example, one factor that determines how fast a feature detector + descriptor run are the number of features detected since each detected feature must be described.

As previously mentioned, tuning these two libraries to produce similar results is a very difficult if not impossible problem. An attempt was made to be fair. See in code comments for specific details for why values were selected. The best way to ensure that two implementations are "equivalent" is to apply them to the same task and measure their performance. That approach is very labor intensive and often impossible due to difference in quality between two implementations, see the SURF Benchmark as an example, and was not done here.

Exceptions to the Rules

SIFT and SURF are covered by patents (SIFT's parent expires in 2020) and not included in the pip package. That means you need to build OpenCV from scratch. Thus, on Desktop, SIFT and SURF are running code custom built for the desktop's architecture breaking the "average user" rule. Major issues were found on ARM architectures where there was no version of OpenCV 4 that could be easily installed and for BoofCV, the default JVM included lacked optimizations for ARM making it run very slow!

The build settings for OpenCV on ARM are included below. An attempt was made to find the best settings and different websites had different recommendations. I picked one which explicitly enabled CPU specific optimizations.

cmake -D CMAKE_BUILD_TYPE=RELEASE     -D CMAKE_INSTALL_PREFIX=/usr/local     -D INSTALL_PYTHON_EXAMPLES=ON     -D OPENCV_EXTRA_MODULES_PATH=~/opencv_contrib-4.0.1/modules     -D ENABLE_NEON=ON     -D ENABLE_VFPV3=ON     -D WITH_FFMPEG=ON     -D WITH_GSTREAMER=ON     -D BUILD_EXAMPLES=ON -D OPENCV_ENABLE_NONFREE=ON ..

Instructions for installing the JVM used on ARM architectures:

jdk11.0.2-linux-arm32-vfp-hflt

While outside of the scope of this benchmark, building OpenCV on your specific architecture does provide significant performance boost for some operations. Gaussian blur ran about 2x faster on Desktop when custom built.

Results

Results are shown below for Intel Core i7, Odroid XU4, and Raspberry PI 3B+. Click on the arrow to change which results you are viewing.

Results between architectures are more consistent than it was thought they would be. OpenCV on desktop used the generic version contained in pypy (except for SIFT and SURF) while OpenCV for ARM architectures had been custom built for each architecture. Winners and near ties are effectively the same. OpenCV's SIFT was unable to finish computing on ARM processors, threw out of memory error or just died. OpenCV's SIFT code has not been inspect to root cause this problem, but BoofCV's implementation was designed to recycle images as much as possible.

For low level image processing routines there is less room for implementation variability and results are easier to explain. If OpenCV was optimized to the greatest extent possible, it should output perform BoofCV in low level operations which are array heavy by about 2x to 4x, based on past experience. This is because hand crafted architecture specific code or GCC will typically generate more efficient SIMD instructions than JVM. In practice code is rarely optimized to this extent as is shown by OpenCV. An example of what this level of optimization can achieve is seen with Gaussian blur where OpenCV has hand crafted SIMD instructions and a concurrent implementation and runs 3x faster than BoofCV's own concurrent implementation. Despite all of OpenCV's apparent advantages BoofCV out performs OpenCV's Sobel, histogram, mean threshold implementations is due to a mixture of this code lacking the refinement of Gaussian blur and BoofCV's code being concurrent. It's worth noting that both libraries have spotty concurrent coverage. BoofCV's dominating performance for "good features" was unexpected is likely caused by a superior implementation in combination with BoofCV's code being concurrent.

For high level operations, implementation details matter more and data structures tend to be sparse, partially negating the compiler advantage of OpenCV. Algorithms are also more complex making explaining performance differences much more difficult. This can be most clearly seen with SURF, where BoofCV was 4x faster and produced more stable features. The main surprise is SIFT, which should have crushed BoofCV because the most computationally expensive part is applying Gaussian blur many times. OpenCV has a large algorithmic advantage with Canny because BoofCV requires Gaussian blur while OpenCV does not. Both BoofCV and OpenCV lack concurrent implementations of outer contour tracing and the most probable explanation is that BoofCV's algorithm is simply faster. The same applies to hough polar.

To help illustrate the points above, here is a table showing single thread performance for select operations on the desktop Core i7 computer. Note how in some cases relative performance changes and in other not. It's rare for users to turn off threading which is why single thread performance isn't discussed in more detail.

Single thread performance on Desktop i7 for select operations. Milliseconds
Operation BoofCV OpenCV
Gaussian Blur 144 74
Mean Threshold 78 16
Good Features 172 282
Outer Contour 47 85

Conclusions

Two computer vision libraries, BoofCV and OpenCV, were compared against each other for speed using a small subset of commonly used computer vision operations. BoofCV was the top performer in 6 out of 10, there was a tie in 2 operations, and OpenCV did best in 2 operations. Tests were performed on desktop and embedded platforms with similar results across the board. Both libraries were given the same input and turned to produce similar output.

Explaining the reason for the differences is difficult due the two libraries having very different architectures. For low level array heavy operations OpenCV has a higher theoretical performance limit than BoofCV due to its ability to include code tailored to specific architectures and GCC generating more effective SIMD instructions than the JVM. As is often the case, due to level of effort, it appears that OpenCV only came close to achieving this theoretical performance with Gaussian blur and not the other operations tested. BoofCV out performed OpenCV in other low level operations and this can sometimes be explained by BoofCV having better concurrent coverage (i.e. comparable single thread performance) and/or a more efficient implementations (i.e. better single thread performance). For high level operations data structures tend to be sparse, partially negating the SIMD performance advantage of a C/C++ implementation.