Over at the Parallel for All blog, Mark Harris writes that Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access ...