Graphics processing unit

A graphics processing unit (GPU), also occasionally called visual processing unit (VPU), is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the building of images in a frame buffer intended for output to a display. GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel. In a personal computer, a GPU can be present on a video card, or it can be on the motherboard or—in certain CPUs—on the CPU die.

The term GPU was popularized by Nvidia in 1999, who marketed the GeForce 256 as "the world's first 'GPU', or Graphics Processing Unit, a single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that are capable of processing a minimum of 10 million polygons per second". Rival ATI Technologies coined the term visual processing unit or VPU with the release of the Radeon 9700 in 2002.

However, both cards were predated by Rendition's Hercules Thriller Conspiracy card, which combined Rendition's Verite graphics chip with Fujitsu's FXG-1 "Pinolite" T&L chip into a single chipset in 1997, though the card's release was eventually cancelled. In turn, arcade games (often using multiple chips) had featured similar capabilities years before home systems, such as Namco's Magic Edge Hornet Simulator in 1993 and Sega's Model 3 in 1996.

Arcade machines (1970s-1980s)
The use of dedicated GPU procecessors for video games originates from arcade game system boards. The first dedicated graphics chip for a video game was the Fujitsu MB14241 video shifter, used to process the graphics for Taito and Midway arcade games of the 1970s, including hits such as Gun Fight in 1975 and Space Invaders in 1978. The use of dedicated graphics processors gained prominence during the golden age of arcade video games (1978 to mid-1980s), when various arcade system board manufacturers, such as Sega and Namco, began producing their own custom graphics processors for their arcade games, most notably for systems that produced 2.5D sprite-scaling graphics, such as the Sega VCO Object hardware (which also supported stereoscopic 3D) during 1981-1982 and the 16-bit Namco Pole Position hardware during 1982-1983. The most well known use of custom GPU processors in the 1980s was Sega's powerful Super Scaler graphics boards, which utilized several custom GPU chipsets to produce advanced 2.5D sprite-scaling graphics for various 1980s Sega arcade hits, such as Hang-On and Space Harrier in 1985, OutRun in 1986, and After Burner and Thunder Force in 1987. In 1988, the Namco System 21 introduced the use of custom GPU processors for 3D polygon graphics. For a list of GPU processors used in arcade games from the late 1970s to the 1990s, see Evolution of arcade video game hardware.

In the arcades, specialized video hardware for sprite-based pseudo-3D graphics began appearing in the early 1980s. In 1981, the Sega VCO Object hardware introduced sprite-scaling with full-color graphics. In 1982, the Sega Zaxxon hardware produced scrolling graphics in an isometric perspective. The Namco Pole Position system in 1982 used several custom graphics chips to produce colorful sprite-scaling background and foreground objects. This culminated in Sega's Super Scaler graphics hardware, which produced the most advanced pseudo-3D graphics of the 1980s, for systems such as the Sega Space Harrier (1985), Sega OutRun (1986) and X Board (1987).

Home systems (1980s)
On personal computers, the use of dedicated graphics chips began gaining prominence around the early-mid-1980s. NEC released the SGP (Super Graphic Processor) for the NEC PC-8801 in 1981 and NEC µPD7220 GDC (Graphic Display Controller) for the NEC PC-9801 in 1982.

The NEC µPD7220 GDC (Graphics Display Controller), developed during 1979-1981, is a video interface controller capable of drawing lines, circles, arcs, and character graphics to a bitmapped display. It was the first major implementation of a graphics display controller as a single Large Scale Integration (LSI) integrated circuit chip, enabling the design of low-cost, high-performance video graphics cards such as those from Number Nine Visual Technology. It became one of the best known of what became known as graphics processing units in the 1980s. It came standard with the NEC PC-9801, APC III, Tulip System-1 and Epson QX-10.

Intel licensed the NEC µPD7220 design and called it the 82720 graphics display controller. Announced in 1982, it was the first of what would become a long line of Intel graphics processing units. In 1983, Intel made the iSBX 275 Video Graphics Controller Multimodule Board, for industrial systems based on the Multibus standard. The card was based on the NEC 82720 Graphics Display Controller, and accelerated the drawing of lines, arcs, rectangles, and character bitmaps. The framebuffer was also accelerated through loading via DMA. The board was intended for use with Intel's line of Multibus industrial single-board computer plugin cards.

Some of the first personal computers to come standard with a GPU in the early 1980s include the µPD7220 based computers mentioned above, as well as TMS9918 based computers such as the TI-99/4A, MSX and Sega SC-3000. Video game consoles also began using graphics coprocessors in the early 1980s, such as the ColecoVision and Sega SG-1000's TMS9918, the Atari 5200's ANTIC, and the Nintendo Entertainment System's Picture Processing Unit.

Released in 1985, the Commodore Amiga was one of the first personal computers to come standard with a GPU. The GPU supported line draw, area fill, and included a type of stream processor called a blitter which accelerated the movement, manipulation, and combination of multiple arbitrary bitmaps. Also included was a coprocessor with its own (primitive) instruction set capable of directly invoking a sequence of graphics operations without CPU intervention. Prior to this and for quite some time after, many other personal computer systems instead used their main, general-purpose CPU to handle almost every aspect of drawing the display, short of generating the final video signal.

In 1986, Texas Instruments released the TMS34010, the first microprocessor with on-chip graphics capabilities. It could run general-purpose code, but it had a very graphics-oriented instruction set. In 1990-1991, this chip would become the basis of the Texas Instruments Graphics Architecture ("TIGA") Windows accelerator cards.

In 1987, the IBM 8514 graphics system was released as one of the first video cards for IBM PC compatibles to implement fixed-function 2D primitives in electronic hardware.

1987 also saw the release of the Sharp X68000 computer, which featured the most advanced home computer GPU chipset of the 1980s, a custom Sharp chipset called CYNTHIA, which was capable of producing arcade-quality 2D sprite graphics, so much so that the X68000 served as the development machine for Capcom's CP System arcade hardware.

1990s


In 1991, S3 Graphics introduced the S3 86C911, which its designers named after the Porsche 911 as an implication of the performance increase it promised. The 86C911 spawned a host of imitators: by 1995, all major PC graphics chip makers had added 2D acceleration support to their chips. By this time, fixed-function Windows accelerators had surpassed expensive general-purpose graphics coprocessors in Windows performance, and these coprocessors faded away from the PC market.

Throughout the 1990s, 2D GUI acceleration continued to evolve. As manufacturing capabilities improved, so did the level of integration of graphics chips. Additional application programming interfaces (APIs) arrived for a variety of tasks, such as Microsoft's WinG graphics library for Windows 3.x, and their later DirectDraw interface for hardware acceleration of 2D games within Windows 95 and later.

By the early 1990s, arcade game manufacturers were increasingly using custom GPU processors dedicated to producing real-time 3D polygon graphics. The most powerful custom GPU processors of the 1990s were used for arcade machines such as the Namco System 21 (1988), Sega Model 1 (1992), Namco System 22 (1993), Sega Model 2 (1993), Sega Model 3 (1996), Namco System 23 (1997), and Sega Naomi (1998).

By the mid-1990s, CPU-assisted real-time 3D graphics were becoming increasingly common in console and computer games, which led to an increasing public demand for hardware-accelerated 3D graphics. Early examples of mass-marketed home 3D graphics hardware can be found in fifth generation video game consoles such as the Sega Saturn, Sony PlayStation and Nintendo 64.

In the PC world, the first 3D graphics card for a home computer was NEC's PC-FXGA, released for their PC-98 platform in 1995, which could produce 3D graphics surpassing the PlayStation console and rivaling the Nintendo 64 in terms of polygon rendering performance. The first 3D graphics cards for IBM-compatible PC's soon followed in early 1996: Creative Labs' 3D Blaster, NVIDIA's NV1, and particularly NEC's PowerVR. While the 3D Blaster and NV1 (with the first game to support them being PlayStation port Toshinden) were unable to rival the PlayStation, the PowerVR surpassed the PlayStation and even approached arcade quality graphics, with a near arcade quality PowerVR demo of Namco's Rave Racer (though this PC port was later cancelled). Similarly, the NV1 card received PC ports of Sega titles Virtua Fighter Remix and Virtua Cop (which surpassed the Saturn versions, but couldn't rival the arcade originals). In late 1996, 3dfx launched the Voodoo line, which rivalled the PowerVR in quality and would soon become the most popular PC graphics cards of the late 1990s.

Other notable failed first tries for low-cost 3D graphics chips included the S3 ViRGE, ATI Rage, and Matrox Mystique. These chips were essentially previous-generation 2D accelerators with 3D features bolted on. Many were even pin-compatible with the earlier-generation chips for ease of implementation and minimal cost. Initially, performance 3D graphics were possible only with discrete boards dedicated to accelerating 3D functions (and lacking 2D GUI acceleration entirely), such as the 3dfx Voodoo. However, as manufacturing technology continued to progress, video, 2D GUI acceleration and 3D functionality were all integrated into one chip. NEC's PowerVR and Rendition's Verite chipsets were the first to do this well enough to be worthy of note.

OpenGL appeared in the early '90s as a professional graphics API, but originally suffered from performance issues which allowed the Glide API to step in and become a dominant force on the PC in the late '90s. However, these issues were quickly overcome and the Glide API fell by the wayside. Software implementations of OpenGL were common during this time, although the influence of OpenGL eventually led to widespread hardware support. Over time, a parity emerged between features offered in hardware and those offered in OpenGL. DirectX became popular among Windows game developers during the late 90s. Unlike OpenGL, Microsoft insisted on providing strict one-to-one support of hardware. The approach made DirectX less popular as a standalone graphics API initially, since many GPUs provided their own specific features, which existing OpenGL applications were already able to benefit from, leaving DirectX often one generation behind. (See: Comparison of OpenGL and Direct3D).

Over time, Microsoft began to work more closely with hardware developers, and started to target the releases of DirectX to coincide with those of the supporting graphics hardware. Direct3D 5.0 was the first version of the burgeoning API to gain widespread adoption in the gaming market, and it competed directly with many more-hardware-specific, often proprietary graphics libraries, while OpenGL maintained a strong following. Direct3D 7.0 introduced support for hardware-accelerated transform and lighting (T&L) for Direct3D, while OpenGL had this capability already exposed from its inception. 3D accelerator cards moved beyond being just simple rasterizers to add another significant hardware stage to the 3D rendering pipeline.

The first hardware to feature what later became known as T&L was Namco's Magic Edge Hornet Simulator arcade game system in 1993. Fujitsu's FXG-1 "Pinolite" geometry processor (based on the Fujitsu TGPx4 chipset used in the Sega Model 2C arcade system in 1996) was later released in 1997 and pioneered consumer hardware support for T&L, making arcade-quality 3D graphics possible on a PC. Rendition soon utilized the Fujitsu FXG-1 for their Hercules Thriller Conspiracy, which was to be the first consumer GPU graphics card featuring T&L, but its release was eventually cancelled. 

Later, the Nvidia GeForce 256 (also known as NV10) popularized hardware-accelerated T&L for the consumer-level card, though professional 3D cards already had this capability. Hardware transform and lighting, both already existing features of OpenGL, came to consumer-level hardware in the late '90s and set the precedent for later pixel shader and vertex shader units which were far more flexible and programmable.

2000 to 2005
With the advent of the OpenGL API and similar functionality in DirectX, GPUs added programmable shading to their capabilities. Each pixel could now be processed by a short program that could include additional image textures as inputs, and each geometric vertex could likewise be processed by a short program before it was projected onto the screen. Nvidia was first to produce a chip capable of programmable shading, the GeForce 3 (code named NV20). By October 2002, with the introduction of the ATI Radeon 9700 (also known as R300), the world's first Direct3D 9.0 accelerator, pixel and vertex shaders could implement looping and lengthy floating point math, and in general were quickly becoming as flexible as CPUs, and orders of magnitude faster for image-array operations. Pixel shading is often used for things like bump mapping, which adds texture, to make an object look shiny, dull, rough, or even round or extruded.

2006 to present
With the introduction of the GeForce 8 series and the then new generic stream processing unit GPUs became a more generalized computing device. Today, parallel GPUs have begun making computational inroads against the CPU, and a subfield of research, dubbed GPU Computing or GPGPU for General Purpose Computing on GPU, has found its way into fields as diverse as machine learning, oil exploration, scientific image processing, linear algebra, statistics, 3D reconstruction and even stock options pricing determination. Nvidia's CUDA platform was the earliest widely adopted programming model for GPU computing. More recently OpenCL has become broadly supported. OpenCL is an open standard defined by the Khronos Group which allows for the development of code for both GPUs and CPUs with an emphasis on portability. OpenCL solutions are supported by Intel, AMD, Nvidia, and ARM, and according to a recent report by Evan's data OpenCL is the GPGPU development platform most widely used by developers in both the US and Asia Pacific.

GPU companies
Many companies have produced GPUs under a number of brand names. In 2008, Intel, Nvidia and AMD/ATI were the market share leaders, with 49.4%, 27.8% and 20.6% market share respectively. However, those numbers include Intel's integrated graphics solutions as GPUs. Not counting those numbers, Nvidia and ATI control nearly 100% of the market as of 2008. In addition, S3 Graphics (owned by VIA Technologies) and Matrox produce GPUs.

Computational functions
Modern GPUs use most of their transistors to do calculations related to 3D computer graphics. They were initially used to accelerate the memory-intensive work of texture mapping and rendering polygons, later adding units to accelerate geometric calculations such as the rotation and translation of vertices into different coordinate systems. Recent developments in GPUs include support for programmable shaders which can manipulate vertices and textures with many of the same operations supported by CPUs, oversampling and interpolation techniques to reduce aliasing, and very high-precision color spaces. Because most of these computations involve matrix and vector operations, engineers and scientists have increasingly studied the use of GPUs for non-graphical calculations. An example of GPUs being used non-graphically is the generation of Bitcoins, where the graphical processing unit is used to solve hash functions.

In addition to the 3D hardware, today's GPUs include basic 2D acceleration and framebuffer capabilities (usually with a VGA compatibility mode). Newer cards like AMD/ATI HD5000-HD7000 even lack 2D acceleration, it has to be emulated by 3D hardware.

GPU accelerated video decoding
Most GPUs made since 1995 support the YUV color space and hardware overlays, important for digital video playback, and many GPUs made since 2000 also support MPEG primitives such as motion compensation and iDCT. This process of hardware accelerated video decoding, where portions of the video decoding process and video post-processing are offloaded to the GPU hardware, is commonly referred to as "GPU accelerated video decoding", "GPU assisted video decoding", "GPU hardware accelerated video decoding" or "GPU hardware assisted video decoding".

More recent graphics cards even decode high-definition video on the card, offloading the central processing unit. The most common APIs for GPU accelerated video decoding are DxVA for Microsoft Windows operating system, VDPAU, VAAPI, XvMC, and XvBA for Linux and UNIX based operating-system. All except XvMC are capable of decoding videos encoded with MPEG-1, MPEG-2, MPEG-4 ASP (MPEG-4 Part 2), MPEG-4 AVC (H.264 / DivX 6), VC-1, WMV3/WMV9, Xvid / OpenDivX (DivX 4), and DivX 5 codecs, while XvMC is only capable of decoding MPEG-1 and MPEG-2.

Video decoding processes that can be accelerated
The video decoding processes that can be accelerated by today's modern GPU hardware are:
 * Motion compensation (mocomp)
 * Inverse discrete cosine transform (iDCT)
 * Inverse telecine 3:2 and 2:2 pull-down correction
 * Inverse modified discrete cosine transform (iMDCT)
 * In-loop deblocking filter
 * Intra-frame prediction
 * Inverse quantization (IQ)
 * Variable-length decoding (VLD), more commonly known as slice-level acceleration
 * Spatial-temporal deinterlacing and automatic interlace/progressive source detection
 * Bitstream processing (Context-adaptive variable-length coding/Context-adaptive binary arithmetic coding) and perfect pixel positioning.

Dedicated graphics cards
The GPUs of the most powerful class typically interface with the motherboard by means of an expansion slot such as PCI Express (PCIe) or Accelerated Graphics Port (AGP) and can usually be replaced or upgraded with relative ease, assuming the motherboard is capable of supporting the upgrade. A few graphics cards still use Peripheral Component Interconnect (PCI) slots, but their bandwidth is so limited that they are generally used only when a PCIe or AGP slot is not available.

A dedicated GPU is not necessarily removable, nor does it necessarily interface with the motherboard in a standard fashion. The term "dedicated" refers to the fact that dedicated graphics cards have RAM that is dedicated to the card's use, not to the fact that most dedicated GPUs are removable. Dedicated GPUs for portable computers are most commonly interfaced through a non-standard and often proprietary slot due to size and weight constraints. Such ports may still be considered PCIe or AGP in terms of their logical host interface, even if they are not physically interchangeable with their counterparts.

Technologies such as SLI by Nvidia and CrossFire by ATI allow multiple GPUs to be used to draw a single image, increasing the processing power available for graphics.

Integrated graphics solutions
Integrated graphics solutions, shared graphics solutions, or Integrated graphics processors (IGP) utilize a portion of a computer's system RAM rather than dedicated graphics memory. Most are integrated into the motherboard, though exceptions include AMD's IGPs that use dedicated sideport memory on certain motherboards, and APUs, where they are integrated with the CPU die. Computers with integrated graphics account for 90% of all PC shipments. These solutions are less costly to implement than dedicated graphics solutions, but tend to be less capable. Historically, integrated solutions were often considered unfit to play 3D games or run graphically intensive programs but could run less intensive programs such as Adobe Flash. Examples of such IGPs would be offerings from SiS and VIA circa 2004. However, modern integrated graphics processors such as AMD Accelerated Processing Unit and Intel HD Graphics are more than capable of handling 2D graphics or low stress 3D graphics, having improved performance that they can match or exceed that of older dedicated graphic cards, although they are still far less capable than current generation dedicated GPUs.

As a GPU is extremely memory intensive, an integrated solution may find itself competing for the already relatively slow system RAM with the CPU, as it has minimal or no dedicated video memory. IGPs can have up to 29.856 GB/s of memory bandwidth from system RAM, however graphics cards can enjoy up to 264GB/sec of bandwidth over its memory-bus. Older integrated graphics chipsets lacked hardware transform and lighting, but newer ones include it.

Hybrid solutions
This newer class of GPUs competes with integrated graphics in the low-end desktop and notebook markets. The most common implementations of this are ATI's HyperMemory and Nvidia's TurboCache. Hybrid graphics cards are somewhat more expensive than integrated graphics, but much less expensive than dedicated graphics cards. These share memory with the system and have a small dedicated memory cache, to make up for the high latency of the system RAM. Technologies within PCI Express can make this possible. While these solutions are sometimes advertised as having as much as 768MB of RAM, this refers to how much can be shared with the system memory.

Stream Processing and General Purpose GPUs (GPGPU)
It is becoming increasingly common to use a general purpose graphics processing unit as a modified form of stream processor. This concept turns the massive computational power of a modern graphics accelerator's shader pipeline into general-purpose computing power, as opposed to being hard wired solely to do graphical operations. In certain applications requiring massive vector operations, this can yield several orders of magnitude higher performance than a conventional CPU. The two largest discrete (see "Dedicated graphics cards" above) GPU designers, ATI and Nvidia, are beginning to pursue this approach with an array of applications. Both Nvidia and ATI have teamed with Stanford University to create a GPU-based client for the Folding@home distributed computing project, for protein folding calculations. In certain circumstances the GPU calculates forty times faster than the conventional CPUs traditionally used by such applications.

GPGPU can be used for many types of embarrassingly parallel tasks including ray tracing, computational fluid dynamics and weather modelling. They are generally suited to high-throughput type computations that exhibit data-parallelism to exploit the wide vector width SIMD architecture of the GPU.

Furthermore, GPU-based high performance computers are starting to play a significant role in large-scale modelling. Three of the 10 most powerful supercomputers in the world take advantage of GPU acceleration.

NVIDIA cards support API extensions to the C programming language such as CUDA ("Compute Unified Device Architecture") and OpenCL. CUDA is specifically for NVIDIA GPUs whilst OpenCL is designed to work across a multitude of architectures including GPU, CPU and DSP (using vendor specific SDKs). These technologies allow specified functions (kernels) from a normal C program to run on the GPU's stream processors. This makes C programs capable of taking advantage of a GPU's ability to operate on large matrices in parallel, while still making use of the CPU when appropriate. CUDA is also the first API to allow CPU-based applications to access directly the resources of a GPU for more general purpose computing without the limitations of using a graphics API.

Since 2005 there has been interest in using the performance offered by GPUs for evolutionary computation in general, and for accelerating the fitness evaluation in genetic programming in particular. Most approaches compile linear or tree programs on the host PC and transfer the executable to the GPU to be run. Typically the performance advantage is only obtained by running the single active program simultaneously on many example problems in parallel, using the GPU's SIMD architecture. However, substantial acceleration can also be obtained by not compiling the programs, and instead transferring them to the GPU, to be interpreted there. Acceleration can then be obtained by either interpreting multiple programs simultaneously, simultaneously running multiple example problems, or combinations of both. A modern GPU (e.g. 8800 GTX or later) can readily simultaneously interpret hundreds of thousands of very small programs.

Hardware

 * Comparison of AMD graphics processing units
 * Comparison of Nvidia graphics processing units
 * Comparison of Intel graphics processing units
 * Intel GMA
 * Larrabee
 * Nvidia PureVideo - the bit-stream technology from Nvidia used in their graphics chips to accelerate video decoding on hardware GPU with DXVA.
 * UVD (Unified Video Decoder) - is the video decoding bit-stream technology from ATI Technologies to support hardware (GPU) decode with DXVA.

APIs

 * DirectX Video Acceleration (DxVA) API for Microsoft Windows operating-system.
 * Video Acceleration API (VA API)
 * VDPAU (Video Decode and Presentation API for Unix)
 * X-Video Bitstream Acceleration (XvBA), the X11 equivalent of DXVA for MPEG-2, H.264, and VC-1
 * X-Video Motion Compensation, the X11 equivalent for MPEG-2 video codec only

Applications

 * GPU cluster
 * Mathematica includes built-in support for CUDA and OpenCL GPU execution
 * MATLAB acceleration using the Parallel Computing Toolbox and MATLAB Distributed Computing Server, as well as 3rd party packages like Jacket.
 * Molecular modeling on GPU
 * Bitcoin Mining