# Parallel Reduction in Compute Shaders

This is a post inspired by a conversation with @kazik at the latest Creative Coding Jam in Berlin.
Alright, we have no way to sync all the threads, so we will have data race conditions when writing and reading data, what to do now? We can use a technique called parallel reduction: namely, we can perform the reduction in each workgroup, which we can sync, save back the output, and repeat until the whole data structure fits in a single workgroup. The last point is very much related to how much memory your GPU has, since it will affect the workgroup size, and in turn the shared memory size (which is usually quite small).
Here’s an implementation of parallel reduction used to compute the minimum value in a buffer of floats (think of it as the lowest brightness of an image, if it helps).
To check it is working, we can display two circles with gray value obtained on the GPU via parallel reduction, and on the CPU via the inbuilt kotlin’s min function

import org.intellij.lang.annotations.Language
import org.openrndr.application
import org.openrndr.color.ColorRGBa
import org.openrndr.draw.*
import org.openrndr.math.Vector2
import kotlin.math.pow

fun main() = application {
configure {
width = 1000
height = 1000
}
program {

val workersX = 16
val n = workersX.toDouble().pow(4.0).toInt() // buffer size must be a power of workers group size

member("minimum", BufferMemberType.FLOAT, n)
})

//Create the buffer values;
val buff = List(n) {
Math.random() + 0.45
}

//Compute CPU minimum;
val minimumValue = buff.min()

buffer.put {
buff.forEach {
write(it.toFloat())
}
}

@Language("GLSL")
val computeCode = """
#version 430

layout(local_size_x = \${workersX}, local_size_y = 1) in;

uniform int computeWidth;
uniform int recursionStep;

layout (std430, binding = 0) buffer buffValues
{
float values[];
};

shared float[\${workersX}] localBuff;

float reduction(float[\${workersX}] arr){
// Compute minimum;
float m = arr[0];
for (int i=0; i < arr.length(); i++){
if (arr[i] < m){
m = arr[i];
}
}
return m;
}

void main(){
const int id = int(gl_LocalInvocationID.x);
const int k = int(gl_WorkGroupID.x);

int stride = int(pow(\${workersX}, recursionStep));
int globalId = (id + k * \${workersX}) * stride;
localBuff[id] = values[globalId];
barrier(); // Forces threads in the SAME group to be in sync;

float minimum = reduction(localBuff); // Compute  minimum (or the reduction implemented);

// Writes on the local index thread;
if (id == 0){
values[k * \${workersX} * stride] = minimum;
}
}
""".trimIndent()
computeReduce.buffer("buffValues", buffer)

extend {
// The copy should happen through a compute shader,
// here done on CPU for debugging's sake.
buffer.put {
buff.forEach {
write(it.toFloat())
}
}

// Compute reduction by
val numSteps = n / workersX
var dispatchNum = numSteps
var s = 0
while(dispatchNum >= 1){
computeReduce.uniform("recursionStep", s)
computeReduce.uniform("computeWidth", dispatchNum)
computeReduce.execute(dispatchNum, 1)
dispatchNum /= workersX
s += 1
}

drawer.fill = ColorRGBa.WHITE
drawer.stroke = null
fragmentTransform = """
x_fill.rgb = vec3(b_buff.minimum[0]);
""".trimIndent()
buffer("buff", buffer) //That's how you can access a buffer in the graphics pipeline :)
}
drawer.circle(drawer.bounds.center - Vector2.UNIT_X * 50.0,100.0)
fragmentTransform = """
x_fill.rgb = vec3(p_min);
""".trimIndent()
parameter("min", minimumValue)
}
drawer.circle(drawer.bounds.center + Vector2.UNIT_X * 50.0,100.0)
}
}
}

You’ll see that the two disks have exactly the same gray value.
The very same approach can be used for other reductions, like maximum or average, and also for other data structure like textures. In the latter case, one could work with bi-dimensional workgroup, copying a square of the texture in the local shared memory, and write back a single pixel per workgroup.

2 Likes