Skip to main content

Chap#8

General GPU Concepts

Q: What is a GPU?
A: A GPU (Graphics Processing Unit) is a specialized microprocessor for parallel processing, originally for graphics, now used for general-purpose computation. It offloads intensive tasks from the CPU by handling millions of instructions per second.


Q: Why are GPUs good for parallel operations?
A: GPUs excel at parallel operations because they are built with a massive number of smaller, simpler execution units (cores). This architecture allows them to perform many calculations simultaneously.


Q: What is the main architectural difference between a CPU and a GPU regarding cores?
A: CPUs have a few powerful cores optimized for serial tasks and low latency. GPUs have hundreds/thousands of smaller cores designed for high-throughput parallel processing of many simpler tasks.


Q: How do modern GPUs (like NVIDIA's) achieve high performance with many cores?
A: Modern GPUs achieve high performance through a large number of cores, each capable of handling multiple threads concurrently. They are highly optimized for parallel floating-point operations on large datasets.


Q: What is the basic idea behind GPU evolution for increased parallelism?
A: The core idea is to use a large quantity (hundreds or thousands) of simpler processing units. These units execute the same instruction simultaneously on different pieces of data (SIMD paradigm).


Q: Besides graphics, what other field uses GPUs extensively?
A: High-Performance Computing (HPC) extensively uses GPUs. Their massive parallel processing capabilities are ideal for complex scientific simulations and data analysis.

GPU Components & Architecture

Q: List 6 main components of a GPU.
A: Key GPU components include the Graphic Processor (the core engine), Frame Buffer (memory for display output), and dedicated high-speed Memory, Memory, Graphic Bios, Display connector, Computer connectors.

Q: What are the key components of GPU cores, and how are they structured for efficiency?
A: GPU cores replicate slim CPU elements like the Fetch/Decode unit, ALU (for calculations), and registers (for local data). To save space, the Fetch/Decode logic is often shared across multiple ALUs. A GPU is built using Compute Units (CUs), each containing several Processing Elements (PEs), which are the basic execution units.


GPU Memory Hierarchy

Q: Name the five main memory regions accessible from a single work item on a GPU.
A: The five main memory regions are Registers (fastest, per work-item), Local Memory (shared within a work-group), Texture Memory (optimized for spatial locality), Constant Memory (cached, read-only for kernels), and Global Memory (largest, accessible by host & device).


Q: What are Registers in the GPU memory hierarchy?
A: Registers are the fastest, smallest, and most immediate level of memory on a GPU. Each work-item has its own private set of dedicated registers for very quick data access.


Q: What is Global Memory on a GPU, and what are its characteristics?
A: Global Memory is the largest memory space on the GPU, accessible by both the GPU and the host (CPU). It offers high bandwidth but has higher latency compared to other on-chip memories.


Q: What is Constant Memory, and what are its special properties?
A: Constant Memory is a read-only memory region for kernels, optimized for data that doesn't change during kernel execution. It's cached and supports efficient broadcasting of values to many work-items.


Q: What is Local Memory in the GPU hierarchy?
A: Local Memory (also called shared memory) is a small, fast on-chip memory shared among work-items within the same work-group. It allows for efficient data sharing and communication between these work-items.


Q: When is Texture Memory beneficial?
A: Texture Memory is useful when nearby data is accessed together, like in images. It has specialized caching and addressing modes that can reduce memory traffic and improve performance.

GPU Programming Concepts (General & OpenCL)

Q: What is the "host" and "device" in GPU computing?
A: In GPU computing, the "host" is typically the main CPU and its memory system. The "device" refers to the GPU (or other accelerator) and its dedicated memory.


Q: What is a "Work-Item" in GPU computing?
A: A Work-Item is the most basic unit of execution in GPU computing, representing a single thread. Many work-items execute the same kernel code in parallel on different data.


Q: What is a "Work-Group"?
A: A Work-Group is a collection of work-items that are scheduled to run concurrently on a single Compute Unit. Work-items within a work-group can cooperate using shared local memory and synchronization.


Q: What is OpenCL?
A: OpenCL (Open Computing Language) is an open standard framework for writing programs that can execute across heterogeneous platforms. It allows developers to use CPUs, GPUs, DSPs, and FPGAs for parallel computing.


Q: What is an OpenCL Kernel?
A: An OpenCL Kernel is a function written in a C-based language that executes on an OpenCL device (e.g., GPU). Many instances of this kernel (work-items) run in parallel to process data.


Q: When writing OpenCL kernels, why is it important to specify memory address spaces like
A: Specifying address spaces like __global or __local is crucial because it tells the compiler where data resides (e.g., large off-chip memory vs. fast on-chip shared memory). This directly impacts performance and data accessibility for work-items.


Q: What is a "heterogeneous system" in the context of OpenCL?
A: A heterogeneous system in OpenCL consists of a host (CPU) connected to one or more OpenCL compute devices. These devices can be of different types, like GPUs, FPGAs, or DSPs.


Q: What is the general role of "Host Code" in OpenCL?
A: Host code, running on the CPU, manages the overall OpenCL application. It sets up devices, compiles kernels, manages memory transfers between host and device, and enqueues kernels for execution on the device.


Q: List 3 key steps the host code must perform to execute an OpenCL kernel.
A: 1. Discover and initialize OpenCL devices and create a context. 2. Compile the kernel source code into a program object. 3. Create memory buffers, transfer data to the device, set kernel arguments, and enqueue the kernel for execution.


Q: What is "occupancy" on a GPU, and why is it important?
A: Occupancy is the ratio of active work-groups to the maximum possible work-groups per compute unit. High occupancy is vital for performance as it helps hide memory latency by keeping the GPU's processing elements busy with other ready work.





MCQs




What does NDRange (N-Dimensional Range) define in OpenCL?
Answer: Total number of work-items executing a kernel


Which OpenCL function is used to discover available devices?
Answer: clGetDeviceIDs()


Which function retrieves detailed information about an OpenCL device?
Answer: clGetDeviceInfo()


What is the purpose of clCreateContext() in OpenCL?
Answer: To create an execution environment for OpenCL objects


What does clBuildProgram() do in OpenCL?
Answer: Compiles and links kernel source code into a program object


Which OpenCL function is commonly used to transfer data from host to device memory?
Answer: clEnqueueWriteBuffer()


How does an OpenCL kernel work-item know which data to process?
Answer: It queries its global ID using get_global_id()

Comments

Popular posts from this blog

Chap#10

Network topologies Definition: Network topologies define how nodes (processors/computers) are interconnected in parallel and distributed systems. The choice of topology affects performance, scalability, and cost. Key Metrics: Degree: Number of links per node. (Formula: deg = connections per node) Example: In a linear array, each node (except ends) has 2 links. Diameter: Longest shortest path between any two nodes. (Formula: diam = max distance) Example: Linear array with 8 nodes has diameter 7 (P₀ to P₇). Bisection Width: Minimum links to cut to split the network into two halves. (Formula: bw = min cuts) Example: Binary tree has bw=1 (cutting the root disconnects it).4 1. Linear Array Define : Nodes are connected one after another in a straight line. Each node (except the ends) connects to two neighbors one on the left and one on the right. Explanation : Simple to build and easy to understand, but not efficient for large networks. Long distance between farthest nodes makes comm...
Asymmetric-key algorithms are algorithms used in cryptography that use two different keys  a public key for encryption and a private key for decryption. These keys are mathematically related, but the private key cannot be easily derived from the public key. Types: RSA (Rivest–Shamir–Adleman): It uses large prime numbers to generate the key pair and supports both encryption and digital signatures DSA (Digital Signature Algorithm): DSA is primarily used for creating digital signatures, ensuring the authenticity. Symmetric-key algorithms are algorithms for cryptography that use the same cryptographic keys for both encryption of plaintext and decryption of ciphertext  Types: Stream Cipher:  Stream Cipher Converts the plain text into cipher text by taking 1 byte of plain text at a time. Block cipher: Converts the plain text into cipher text by taking plain text's block at a time DES? DES stands for Data Encryption Standard . It is a symmetric-key algorithm used to enc...

Ai Mental Health & Cyber Safety Presentation

Module A - The Normalization Engine Linguistic Challenge: Roman Urdu lacks standardized orthography (e.g., "kesa" vs "kaisa"), creating orthographic "noise" that significantly degrades the accuracy of downstream AI models. Technical Role: Acts as a Sequence-to-Sequence (Seq2Seq) transliteration and lexical normalization layer to standardize inputs before analysis. Model: A specialized transformer architecture, specifically m2m100 fine-tuned on parallel corpora or UrduParaphraseBERT. Primary Dataset: Roman-Urdu-Parl (RUP). A large-scale parallel corpus of 6.37 million sentence pairs designed to support machine transliteration and word embedding training. Link: https://arxiv.org/abs/2503.21530 Outcome: Reduces orthographic noise by achieving up to 97.44% Char-BLEU accuracy for Roman-Urdu to Urdu conversion, ensuring Module B receives high-quality "clean" data for risk analysis. Module B - Risk Stratification (BERT) Heading: The "Safety ...