Concurrency

From BoofCV

Concurrent algorithms were introduced into BoofCV in version 0.33. Concurrent or multi-threaded algorithms are algorithms which can solve problems with more than one processes running in parallel. This can result in very large speed improvements. On my desktop, each physical CPU core results in a speed up of about 1x for highly paralizable algorithms such as image convolution. Since my system has 4-cores that means a 4x speed up is typical. Not all algorithms have been made concurrent and not all algorithms can be made fully parallel. That means in a typical computer vision algorithm you can't expect a 4x speed up.

By default concurrent algorithms are turned on. If you wish to turn them off that is simple enough, see below.

BoofConcurrency.USE_CONCURRENT = false;

It's highly recommended that you do this at or near the very first line in your program and leave it alone. In theory you can change it back to true if you wish, but not all algorithms will respect that change.

By default the maximum number of threads created is defined by ```Runtime.getRuntime().availableProcessors()```. This is typically the total number of hyperthreads and not cores. It's possible the manually specify the number of threads:

BoofConcurrency.setMaxThreads(5);

That will set the maximum number of threads to 5. Again, it's not a great idea to mess with this value after the program has launched unless you really know what you're doing.

Other Issues:

  1. Some algorithms produce equivalent but non-deterministic output
  2. Concurrent algorithms are never as "bug free" as serial algorithms.

An algorithm like image convolution should produce identical results. However, a feature detector might return the same features but in a different order. Your code shouldn't depend on the order a feature is returned but often it is.

Multi-threaded algorithms are extremely difficult to write correctly and robustly. Bugs might only happen when a 1 in a million timing coincidence occur or only manifest in faster or slow machine. For this reason all concurrent algorithms in BoofCV have an untainted equivalent serial algorithm which makes no calls to threads at all. If you think there might be a bug you can switch to the serial version, then start debugging and preparing a bug report.

JMH Benchmark Results

Benchmark                       (concurrent)  (width)  Mode  Cnt     Score     Error  Units
BenchmarkBlurImageOps.gaussian          true     5000  avgt   10    84.100 ±   6.310  ms/op
BenchmarkBlurImageOps.gaussian         false     5000  avgt   10   328.401 ±   4.105  ms/op
BenchmarkBlurImageOps.mean              true     5000  avgt   10    32.907 ±   0.695  ms/op
BenchmarkBlurImageOps.mean             false     5000  avgt   10   127.657 ±   0.623  ms/op
BenchmarkBlurImageOps.median            true     5000  avgt   10   322.588 ±   6.403  ms/op
BenchmarkBlurImageOps.median           false     5000  avgt   10  1142.279 ± 120.221  ms/op

The table above shows actual performance with and without threads for different image blur operations. My system has 4 cores and 8 hyperthreads. As a result the typical speedup is around 4x. When the number of concurrent threads exceeds the number of cores the gains go down significantly.

Writing Your Own Concurrent Algorithms

All build in concurrent algorithms use a single thread pool. Specifically, they use a single ForkJoinPool. You can access this pool for your use as it's a static variable.

BoofConcurrency.pool

However, it's recommended that you stick to using function provided in BoofCV for parallelizing for loops. Most of the time you will want to use BoofConcurrency.loopFor(). Here's an example of how it can be used. Below is function which sets the value of all pixels less than 200 to 0.

for (int y = 0; y < input.height; y++) {
	int index = input.startIndex + y*input.stride;
	for (int x = 0; x < input.width; x++, index++ ) {
		if( input.data[index] < 200 )
			input.data[index] = 0;
	}
}

The concurrent version looks very similar:

BoofConcurrency.loopFor(0, input.height, y -> {
	int index = input.startIndex + y*input.stride;
	for (int x = 0; x < input.width; x++, index++ ) {
		if( input.data[index] < 200 )
			input.data[index] = 0;
	}
});

Only one line needs to be changed! There are also other function in BoofConcurrency like the loopBlock variants. These are used when you want to run a block of values all in the same thread.

BoofCV Innards

Internally BoofCV follows the following code practices:

  • Always first optimize single threaded algorithms fully first
  • All algorithms must have a single thread version which contain no references to any threaded code
  • Threaded/concurrent algorithms are tested for correctness against the equivalent single threaded code

These code practices assume that multi-threaded can never be fully trusted. I've noticed a strong correlation between people who believe tutorials that claim to make concurrency easy and bug free, and how buggy their code is. By following the procedure above several subtle bugs were discovered and fixed in BoofCV.

Automatically generated code is used heavily in concurrent algorithms in BoofCV. This is done by adding code comments with hints in them that tell a code generator how to make the algorithm parallel. An example of this is shown below:

//CONCURRENT_BELOW BoofConcurrency.loopFor(0, input.height, y -> {
for (int y = 0; y < input.height; y++) {
	int index = input.startIndex + y*input.stride;
	for (int x = 0; x < input.width; x++, index++ ) {
		if( input.data[index] < 200 )
			input.data[index] = 0;
	}
}
//CONCURRENT_ABOVE });

The //CONCURRENT comments indicate that either the line below or above it needs to be replaced with the remaining text. See AutocodeConcurrentApp for more details on this system.