[ELI5] How a graphics card works and the real world effects of the various specs

Also, what makes GPUs different from CPUs is that they do all of this in parallel.

An instruction on a CPU performs one operation on one piece of data. There can be many instructions executing at the same time as long as they work on different data, and an "instruction" can span multiple logical concepts like "get two pieces of data, add them together, and store them in a third place", but fundamentally it's a set of individual operations. This is known as a SISD processor - Single Instruction Single Data.

GPUs have to process a much greater number of operations, so they execute instructions in parallel. You might have 4 processors ("SMX engines"), each of which operates 8 banks ("warps") of 32 "threads". The total of this would be 1024 "CUDA cores" or "stream processors or "compute elements" (everyone uses different terminology). These aren't like cores on your CPU, however - each "bank" of threads operates its 32 cores in lockstep, performing the same instruction on 32 different pieces of data at once. It is totally impossible for cores within the same warp to execute different instructions. The only thing that can be done is to "mask" cores off, so that instead of performing the instruction they perform a NOP instruction instead. This means that if half of the threads in a warp are doing one thing, and the other half are doing the other, it takes twice as long (run code branch A with half the cores masked off, run code branch B with the other cores).

Each SMX engine has its own independent memory subsystem, scheduling queue, etc. The memory subsystems are very elaborate. Since there's so many cores running, it's totally impossible to feed them from DDR3 memory. Instead, you need GDDR5 memory. If DDR3 were a sports car, GDDR5 would be a semi truck. The upside is the bandwidth is 10x higher, the downside is the latency is 10x longer.

As a result GPUs have a very different programming strategy. You tend to launch many more threads than you have cores. They launch their memory access and then sleep and wait for the data to arrive. When all 32 processors have their data, the warp is scheduled for execution on a bank. In general it's often worth it to cache data in the VERY LIMITED on-chip memory or to re-calculate data rather than waiting for a trip to the global (GDDR5) main memory.

Another important trick is that the memory subsystem attempts to consolidate memory requests as much as possible. If all 32 threads want sequential pieces of data, or the same piece of data, that can be handled as a single request. If you deviate from this access pattern you start to overload the memory controller and performance takes a nosedive. The controller also attempts to cache data on-chip, space permitting.

This approach works really well for graphics processing, since you are usually doing matrix transformations on sequential pieces of data. There's also a bunch of other special hardware that lets you get spatial closeness in 2d and 3d, assists in applying texturing, etc. It can be good or bad for general-purpose computing. Some things really benefit from the extra bandwidth more than they are hurt by the latency. Databases, for example, seem to work pretty well on GPUs because they are bandwidth-bound, and can be structured so their data access patterns fit the GPU model well. Total memory capacity is always a problem though, even a Titan only has 12GB of memory and compared to a server with 256GB of memory that's just not a lot.

/r/hardware Thread Parent