



### **GPU Memory II**

- Memory Hardware and Bank Conflict



## **CUDA Device Memory Space: Review**

### Each thread can:

- R/W per-thread registers
- R/W per-thread local memory
- R/W per-block shared memory
- R/W per-grid global memory
- Read only per-grid constant memory
- Read only per-grid texture memory
- The host can R/W global, constant, and texture memories





# **Parallel Memory Sharing**





#### **Hardware Overview**





### **Register File**

#### Register File (RF)

- ➢ 32 KB
- Provides 4 operands/clock
- Texture pipe can also read/write RF
  - 2 SMs share 1 TEX
- Load/Store pipe can also read /write RF





## **Programmer View of Register File**

# There are 8192 registers in each SM in G80

- Registers are dynamically partitioned across all Blocks assigned to the SM
- Once assigned to a Block, the register is NOT accessible by threads in other Blocks
- Each thread in the same Block only access registers assigned to itself





## **Matrix Multiplication Example**

- If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM?
  - Each Block requires 10\*256 = 2560 registers
  - > 8192 = **3** \* 2560 + change
  - So, three blocks can run on an SM as far as registers are concerned
- How about if each thread increases the use of registers by 1?
  - > Each Block now requires 11\*256 = 2816 registers
  - > 8192 < 2816 \*3
  - Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!



## **More on Dynamic Partitioning**

- Dynamic partitioning gives more flexibility to compilers/programmers
  - One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each
    - This allows for finer grain threading than traditional CPU threading models.
  - The compiler can tradeoff between instruction -level parallelism and thread level parallelism

Copyright © 2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes



### ILP vs. TLP Example

Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads have 200 cycles

> 3 Blocks can run on each SM

- If a Compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load
  - Only two can run on each SM
  - However, one only needs 200/(8\*4) = 7 Warps to tolerate the memory latency
  - Two Blocks have 16 Warps. The performance can be actually higher!

Copyright © 2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes



### Constant

- Immediate address constants
- Indexed address constants
- Constants stored in DRAM, and cached on chip
  - L1 per SM
- A constant value can be broadcast to all threads in a Warp
  - Extremely efficient way of accessing a value that is common for all threads in a Block!





## **Shared Memory**

- Each Multi-processor has 16 KB of Shared Memory
  - > 16 banks of 32bit words
  - > Will discuss about accessing pattern later

#### Visible to all threads in a thread block

read and write access





## **Matrix Multiplication Example**

#### Explore Tile-based implementation with Shared Memory.

#### Question:

How is shared memory organized?

What are the issues when accessing shared memory?



```
roinia
                         GPU Memory II
      Invent the Future
 Tiled Matrix Multiplication Kernel --
         void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
  global
1.
    shared float Mds[TILE WIDTH][TILE WIDTH];
    shared float Nds[TILE WIDTH][TILE WIDTH];
   int bx = blockIdx.x; int by = blockIdx.y;
3.
   int tx = threadIdx.x; int ty = threadIdx.y;
4.
  int Row = by * TILE WIDTH + ty;
5.
  int Col = bx * TILE_WIDTH + tx;
6.
    float Pvalue = 0;
7.
     for (int m = 0; m < Width/TILE WIDTH; ++m) {</pre>
8.
     // Coolaborative loading of Md and Nd tiles into shared memory
       Mds[ty][tx] = Md[Row*Width + (m*TILE WIDTH + tx)];
9.
10.
       Nds[ty][tx] = Nd[Col + (m*TILE WIDTH + ty)*Width];
11.
         syncthreads();
12.
       for (int k = 0; k < TILE WIDTH; ++k) {
13.
       Pvalue += Mds[ty][k] * Nds[k][tx];
14.
        Synchthreads();
15.
       }
16.
       Pd[Row*Width+Col] = Pvalue;
}
```



### **Matrix Multiplication Shared Memory Usage**

- Each Block requires 2\* BLOCK\_SIZE <sup>2</sup> \* 4 bytes of shared memory storage
  - For BLOCK\_SIZE = 16, each BLOCK requires 2KB, up to 8 Blocks can fit into the Shared Memory of an SM
  - Since each SM can only take 768 threads, each SM can only take 3 Blocks of 256 threads each
  - Occupancy is not limited by Shared memory

Copyright  $\ensuremath{\textcircled{O}}$  2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes



# **Shared Memory** Organization

- Parallel Memory Architecture:
  - Memory is divided into banks
  - Essential to achieve high bandwidth

#### Each bank can service one address per cyc

- A memory can service as many simultaneous accesses as it has banks
- Multiple simultaneous accesses to a bank result in a bank conflict
  - Conflicting accesses are serialized



Copyright © 2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes







Copyright © 2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes



### How addresses map to banks in CUDA

- Each bank has a bandwidth of 32 bits per clock cycle
- Successive 32-bit words are assigned to successive banks

#### G80 has 16 banks

- > So bank = address % 16
- Same as the size of a half-warp
  - No bank conflicts between different half-warps, only within a single half-warp

Copyright  $\textcircled{\sc c}$  2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes



## **Share Memory Performance**

Shared memory is as fast as registers if there are no bank conflicts

#### The fast case:

- If all threads of a half-warp access different banks, there is no bank conflict
- If all threads of a half-warp access the identical address, there is no bank conflict (broadcast)

#### > The slow case:

- Bank Conflict: multiple threads in the same half-warp access the same bank
- Must serialize the accesses
- Cost = max # of simultaneous accesses to a single bank

Copyright © 2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes





## **Data types and bank conflicts**

> This has no conflicts if type of shared is 32-bits:

foo = shared[baseIndex + threadIdx.x]









## **Structs and Bank Conflicts**

Struct assignments compile into as many memory accesses as there are struct members:

```
Thread 0
     struct vector { float x, y, z; };
                                                                Thread 1
     struct myType {
                                                                Thread 2
                                                                Thread 3
         float f;
                                                                hread 4
         int c;
                                                               Thread 5
                                                                Thread 6
     };
                                                                Thread 7
      shared struct vector vectors[64];
      shared struct myType myTypes[64];
                                                               Thread 15
    This has no bank conflicts for vector; struct size is 3 words
\geq
         3 accesses per thread, contiguous banks (no common factor with 16)
     \geq
     struct vector v = vectors[baseIndex + threadIdx.x];
    This has 2-way bank conflicts for my Type; (2 accesses per thread)
\geq
     struct myType m = myTypes[baseIndex + threadIdx.x];
```

Copyright © 2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes



### **Common Array Bank Conflict Patterns 1D**

- Each thread loads 2 elements into shared memory:
  - 2-way-interleaved loads result in
     2-way bank conflicts:

```
int tid = threadIdx.x;
shared[2*tid] = global[2*tid];
shared[2*tid+1] = global[2*tid+1];
```

- This makes sense for traditional CPU threads, locality in cache line usage and reduced sharing traffic.
  - Not in shared memory usage where there is no cache line effects but banking effects



Copyright © 2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes



## **A Better Array Access Pattern**

Each thread loads one element in every consecutive group of blockDim elements.

```
shared[tid] = global[tid];
shared[tid + blockDim.x] =
```

```
global[tid + blockDim.x];
```





# **Common Bank Conflict Patterns (2D)**

- Operating on 2D array of floats in shared memory
  - e.g. image processing

#### Example: 16x16 block

- Each thread processes a row
- So threads in a block access the elements in each column simultaneously (example: row 1 in purple)
- > 16-way bank conflicts: rows all start at bank 0

#### Solution 1) pad the rows

- Add one float to the end of each row
- Solution 2) transpose before processing
  - Suffer bank conflicts during transpose







#### **Does Matrix Multiplication Incur Shared Memory Bank Conflicts?**



Copyright  $\ensuremath{\mathbb{C}}$  2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes

27



Copyright © 2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes









Copyright © 2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes



Copyright © 2010 by Yong Cao, Referencing UIUC ECE498AL Course Notes