Parallel Reduction in Compute Shaders

This is a post inspired by a conversation with @kazik at the latest Creative Coding Jam in Berlin.
Here’s the context: suppose you have a data structure on the GPU (like a texture, a shader buffer object, etc.) and want to find the minimum, or the maximum, or another “reduction” of the values it contains; you also want to avoid passing the data back to CPU, because this would highly affect the real-time performance of your application. You might then think “I’m going to use a compute shader for that!”. You are on the right path, but you soon realize something: compute shaders (and shaders in general) are very good for parallel tasks, i.e. problems which can be solved by repeating the same task in isolated threads performing computations concurrently. The key words here are isolated and concurrent: a thread does not know anything about what another thread is computing and doesn’t always have a way to know when the computation has ended. Indeed, threads in a compute shader can only sync among each other if they belong to the same workgroup. “Can’t I fit all my data in a single workgroup, then?”, unfortunately in most cases one can’t.
Alright, we have no way to sync all the threads, so we will have data race conditions when writing and reading data, what to do now? We can use a technique called parallel reduction: namely, we can perform the reduction in each workgroup, which we can sync, save back the output, and repeat until the whole data structure fits in a single workgroup. The last point is very much related to how much memory your GPU has, since it will affect the workgroup size, and in turn the shared memory size (which is usually quite small).
Here’s an implementation of parallel reduction used to compute the minimum value in a buffer of floats (think of it as the lowest brightness of an image, if it helps).
To check it is working, we can display two circles with gray value obtained on the GPU via parallel reduction, and on the CPU via the inbuilt kotlin’s min function

import org.intellij.lang.annotations.Language
import org.openrndr.application
import org.openrndr.color.ColorRGBa
import org.openrndr.draw.*
import org.openrndr.math.Vector2
import kotlin.math.pow

fun main() = application {
    configure {
        width = 1000
        height = 1000
    program {

        val workersX = 16
        val n = workersX.toDouble().pow(4.0).toInt() // buffer size must be a power of workers group size
        val buffer = shaderStorageBuffer(shaderStorageFormat {
            member("minimum", BufferMemberType.FLOAT, n)

        //Create the buffer values;
        val buff = List(n) {
            Math.random() + 0.45
        //Compute CPU minimum;
        val minimumValue = buff.min()

        buffer.put {
            buff.forEach {

        val computeCode = """
            #version 430
            layout(local_size_x = ${workersX}, local_size_y = 1) in;
            uniform int computeWidth;
            uniform int recursionStep;
            layout (std430, binding = 0) buffer buffValues
                float values[];
            shared float[${workersX}] localBuff;
            float reduction(float[${workersX}] arr){
            // Compute minimum;
                float m = arr[0];
                for (int i=0; i < arr.length(); i++){
                    if (arr[i] < m){
                        m = arr[i];
                return m;
            void main(){
                  const int id = int(gl_LocalInvocationID.x);
                  const int k = int(gl_WorkGroupID.x);
                  int stride = int(pow(${workersX}, recursionStep));
                  int globalId = (id + k * ${workersX}) * stride;
                  localBuff[id] = values[globalId];
                  barrier(); // Forces threads in the SAME group to be in sync;
                  float minimum = reduction(localBuff); // Compute  minimum (or the reduction implemented);
                  // Writes on the local index thread;
                  if (id == 0){
                    values[k * ${workersX} * stride] = minimum;
        val computeReduce = ComputeShader.fromCode(computeCode, "reduction")
        computeReduce.buffer("buffValues", buffer)

        extend {
            // The copy should happen through a compute shader,
            // here done on CPU for debugging's sake.
            buffer.put {
                buff.forEach {

            // Compute reduction by
            val numSteps = n / workersX
            var dispatchNum = numSteps
            var s = 0
            while(dispatchNum >= 1){
                computeReduce.uniform("recursionStep", s)
                computeReduce.uniform("computeWidth", dispatchNum)
                computeReduce.execute(dispatchNum, 1)
                dispatchNum /= workersX
                s += 1

            drawer.fill = ColorRGBa.WHITE
            drawer.stroke = null
            drawer.shadeStyle = shadeStyle {
                fragmentTransform = """
                    x_fill.rgb = vec3(b_buff.minimum[0]);
                buffer("buff", buffer) //That's how you can access a buffer in the graphics pipeline :)
   - Vector2.UNIT_X * 50.0,100.0)
            drawer.shadeStyle = shadeStyle {
                fragmentTransform = """
                    x_fill.rgb = vec3(p_min);
                parameter("min", minimumValue)
  + Vector2.UNIT_X * 50.0,100.0)

You’ll see that the two disks have exactly the same gray value. :slight_smile:
The very same approach can be used for other reductions, like maximum or average, and also for other data structure like textures. In the latter case, one could work with bi-dimensional workgroup, copying a square of the texture in the local shared memory, and write back a single pixel per workgroup.