This thesis is about a design of a programmable high-performance, many-core processing unit structure FPGA implementation. The processing unit largely relies on the capabilities of arithmetic and logical operation functions of the DSP blocks which can found in Xilinx Virtex-5 and Virtex-6 FPGA to achieve high operating frequency. On the programming side the aim is to support the OpenCL required services which the most appropriate language and standards for many-core architectures and parallel programming. At the architecture design it is the main aspect of the standard constraints and OpenCL stance.
In the first part of the thesis there is a short description about the general central processing unit (CPU) structure, operation, and the architecture solutions which serves the performance improvement. After that there is an introduction about the various parallel computing architectures.
The next section contains the detailed descriptions about two high performance general purpose grapics processing units, which the leaders of the parallel computing techniques. The first is the AMD Cypress, and the second is the Nvidia Fermi.
The next chapter is presents the OpenCL language and standard. The aim and general guidelines of the standard are explained. Then, are also discussed in detail in the standard-defined models (platform model, process model, memory model, programming model), and the runtime environment needed to develop relevant regulations and restrictions.
The next part is the architectural designing of the processing unit. In this section there is an analysis of the OpenCL models and standards, and it is defined a architectural build-up under this analysis. There is a short description about the main parts and specifications of the architecture.
After this section there is a presentation of the design of an arithmetic and logic unit (ALU). For that, the first is a detailed analysis of DSP blocks of the Virtex-5 and Virtex-6 FPGA (capabilities, configuration options). The ALU is also analyzed for designing several factors, such as operating frequency, delay, or the type of arithmetic and logical operations
In the next chapters there are more detailed presentations of the parts of the architecture. Such a register file, thread scheduler, memories, control interface. There are more alternative realizations of each parts and it can found advantages and disadvantages of this realizations. It can found more detailed description about the realization of the preferred alternatives.
A separate section deals with the instruction set.
There are the opportunities for further developments in the last chapter.