In a processor (CPU) driven architecture all your code runs in an continuous marching ahead loop of ‘fetch-decode-execute’. That is, it is entirely sequential execution even with branches/calls and things like look-ahead fetch, etc. With advent of multi-core architectures, execution is parallel, but only to the extent of multiple sequential loop of ‘fetch-decode-execute’. Remember that even in multi-core architectures, the bus to external memory such as RAM and hard disk, is still only one (in some cases, two) making it somewhat sequential in execution if external memory related fetches are involved. For instance, if you have a processor with two-cores and two-threads such as current Intel architectures, then effective parallelism is only about 2.5x instead of theoretical 4x. Formally, this follows a law called as Amdahl’s law which defines what is the effective parallelization possible when part of a task has non-parallelizable section in it. In case of multi-core processors, the primary bottleneck is the external memory bus connected to RAM which makes all parallel executions in cores to pipeline for sequencing. No amount of coding optimization can make this memory bus work parallel-ally independently for each core. This story does not change much even if you go massive parallel architectures such as GPUs because all GPU cores too share a memory bus ultimately.
This is not the case for FPGAs and ASICs where the entire hardware definition is left to the coder. You will be operating at the lowest level of gates such as CMOS in case of ASICs or RTL definitions in case of both FPGAs and ASICs. FPGAs are still a bit higher level than ASICs where the fundamental building blocks would be flip-flips (FF), look-up-tables (LUT) and RAMs. So, basically there is no underlying ‘fetch-decode-execute’ cycle in these cases. All you have is almost raw digital hardware definition which simply will run synchronously with external clock sources or asynchronously based on events.
Now, how does that make it better than processors? Well, you can define your memory elements and logic, all in one go, as per requirement for meeting your task and you are no longer required to share any fundamental resources such as memory making your architectures truly parallel and independent. Does that mean you do not need any shared memories? That entirely depends on your task. If your task needs some handshaking between different parallel sections of code, they will be the point of sharing reducing overall parallelism to some extent, but at least it is coming as a requirement from your task and not due to your hardware.
There are several technical publications and articles online which compare FPGAs with GPUs and CPUs –
- Comparing Performance and Energy Efficiency of FPGAs and GPUs for High Productivity Computing
- GPU vs FPGA: The Battle For AI Hardware Rages On
- Why use an FPGA instead of a CPU or GPU?
- Computing Performance Benchmarks among CPU, GPU, and FPGA
- DSP FPGA CPU GPU
- A gentle introduction to hardware accelerated data processing
It is clear from the above that FPGA brings in significant performance improvement over CPU/GPUs in several cases. But, they miss out on the following fundamental aspects which is important to understand –
- For a heavy computing task, FPGA/ASIC requires much lower clock (in few hundred MHz) than that for CPUs/GPUs (few GHz). This is very important because the overall power consumption is a function of clock frequency.
- FPGA/ASIC is made for your task. So, it contains (or has enabled) only the hardware particular for your task definition and nothing else. This is unlike the case of CPU/GPU which has all components of it always available and enabled (except may be some peripherals). This also has a huge impact on overall power consumption.
- You can make your FPGA/ASIC design such that for most cases, Amdahl’s law does not have to apply, that is, keep no common resources between parallel threads of execution.
- FPGA/ASIC offers very definite control over time of execution making low latency, real-time executions predictably possible.
- If your task is I/O bound and/or with lots of decision making (that is, branches), FPGA/ASIC may not give any benefit over CPUs. In some case, may perform worse that CPUs.
In summary, I would say that
- For time-critical, and highly-computationally intensive tasks, FPGAs will anyday fair better than CPUs/GPUs, even though the development effort is higher.
- If your task is mostly for decision making (including things like DMA transaction initiations, etc.) or is meant for interactivity with end-users, CPUs are best.
- For computational tasks, GPUs are better than CPUs. But FPGAs outrank them in several aspects such as no need for shared memory and complete control over hardware definition, making GPUs not particularly a good choice for computational resources and not a good choice for decision making too. However, if you are constrained in resources and only have a CPU alongside with a graphics processor, GPUs in that graphics processor is better than that CPU for computational aspects. But, if you are not resource constrained, FPGAs are better. If you see the market dynamics as of now, Altera which was purely FPGA company is now part of Intel, and on the other hand, Xilinx which was also purely FPGA company is heavily promoting Zynq architecture which is nothing but FPGA alongside with ARM cores. This means that FPGAs will start appearing as a hardware resource in your day-to-day devices such as laptops, mobiles in near future.
|Computation and/or real-time||✖||?||✔|
|Non-real time decision making||✔||✖||✖|
Watch out for more posts related to FPGAs and ASICs in this blog. I intend to discuss HDL, in my next post.