본문 바로가기

Programming/병렬처리(CUDA)

[CUDA] 용어 정리

출처 : http://icysword.blog.me/140202059108


SP (Streaming Processor) : GPU 에서 연산을 하는 코어 유닛. 연산을 위한 레지스터, 실수연산용FPU (FP), 정수연산용 ALU (Int), 데이터 로드/스토어용 LSU (move, cmp) . CUDA에서 4개의 Thread를 동작

 

 

SFU (Special Function Unit) : Sin, Cos, 역수, 제곱근, Graphic Interpolation 등 특수연산 

 

SM (Streaming Multiprocessor) : 8개 SP, 2개의 SFU, 공유메모리, 캐시로 구성. CUDA에서 워프와 블록을 실행 

 
GTX 770 같은 경우는 8 SM x 192 Core = 1536 CUDA Core 이다.
 

 

워프 : 하나의 SM에서 처리하는 스레드 ( 8(SP) x 4(Thread) = 32 Thread)

 

 

gridDim (x,y,z) : Kernel 함수를 호출할 때 정했던 Grid 차원
blockDim (x,y,z) : 블록의 각 차원에 대한 스레드 수
blockIdx (x,y,z) : 현재 블록의 Index
threadIdx (x,y,z) : 현재 Thread의 Index 

 

아래 출처 : http://web.eecs.utk.edu/~mgates3/docs/cuda.html

 

CUDA syntax

Source code is in .cu files, which contain mixture of host (CPU) and device (GPU) code.

Declaring functions

__global__declares kernel, which is called on host and executed on device
__device__declares device function, which is called and executed on device
__host__declares host function, which is called and executed on host
__noinline__to avoid inlining
__forceinline__to force inlining

Declaring variables

__device__declares device variable in global memory, accessible from all threads, with lifetime of application
__constant__declares device variable in constant memory, accessible from all threads, with lifetime of application
__shared__declares device varibale in block's shared memory, accessible from all threads within a block, with lifetime of block
__restrict__standard C definition that pointers are not aliased

Types

Most routines return an error code of type cudaError_t.

Vector types

char1, uchar1, short1, ushort1, int1, uint1, long1, ulong1, float1
char2, uchar2, short2, ushort2, int2, uint2, long2, ulong2, float2
char3, uchar3, short3, ushort3, int3, uint3, long3, ulong3, float3
char4, uchar4, short4, ushort4, int4, uint4, long4, ulong4, float4

longlong1, ulonglong1, double1
longlong2, ulonglong2, double2

dim3

Components are accessible as variable.x, variable.y, variable.z, variable.w. 
Constructor is make_<type>( x, ... ), for example:

float2 xx = make_float2( 1., 2. );

dim3 can take 1, 2, or 3 argumetns:

dim3 blocks1D( 5       );
dim3 blocks2D( 5, 5    );
dim3 blocks3D( 5, 5, 5 );

Pre-defined variables

dim3 gridDimdimensions of grid
dim3 blockDimdimensions of block
uint3 blockIdxblock index within grid
uint3 threadIdxthread index within block
int warpSizenumber of threads in warp

Kernel invocation

__global__ void kernel( ... ) { ... }

dim3 blocks( nx, ny, nz );           // cuda 1.x has 1D and 2D grids, cuda 2.x adds 3D grids
dim3 threadsPerBlock( mx, my, mz );  // cuda 1.x has 1D, 2D, and 3D blocks

kernel<<< blocks, threadsPerBlock >>>( ... );

Thread management

__threadfence_block();wait until memory accesses are visible to block
__threadfence();wait until memory accesses are visible to block and device
__threadfence_system();wait until memory accesses are visible to block and device and host (2.x)
__syncthreads();wait until all threads reach sync

Memory management

__device__ float* pointer;
cudaMalloc( &pointer, size );
cudaFree( pointer );

// direction is one of cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost
cudaMemcpy( dst_pointer, src_pointer, size, direction );

__constant__ float dev_data[n];
float host_data[n];
cudaMemcpyToSymbol  ( dev_data,  host_data, sizeof(host_data) );  // dev_data  = host_data
cudaMemcpyFromSymbol( host_data, dev_data,  sizeof(host_data) );  // host_data = dev_data

Also, malloc and free work inside a kernel (2.x), but memory allocated in a kernel must be deallocated in a kernel (not the host). It can be freed in a different kernel, though.

Atomic functions

old = atomicAdd ( &addr, value );  // old = *addr;  *addr += value
old = atomicSub ( &addr, value );  // old = *addr;  *addr –= value
old = atomicExch( &addr, value );  // old = *addr;  *addr  = value

old = atomicMin ( &addr, value );  // old = *addr;  *addr = min( old, value )
old = atomicMax ( &addr, value );  // old = *addr;  *addr = max( old, value )

// increment up to value, then reset to 0  
// decrement down to 0, then reset to value
old = atomicInc ( &addr, value );  // old = *addr;  *addr = ((old >= value) ? 0 : old+1 )
old = atomicDec ( &addr, value );  // old = *addr;  *addr = ((old == 0) or (old > val) ? val : old–1 )

old = atomicAnd ( &addr, value );  // old = *addr;  *addr &= value
old = atomicOr  ( &addr, value );  // old = *addr;  *addr |= value
old = atomicXor ( &addr, value );  // old = *addr;  *addr ^= value

// compare-and-store
old = atomicCAS ( &addr, compare, value );  // old = *addr;  *addr = ((old == compare) ? value : old)

Warp vote

int __all   ( predicate );
int __any   ( predicate );
int __ballot( predicate );  // nth thread sets nth bit to predicate

Timer

wall clock cycle counter

clock_t clock();

Texture

can also return float2 or float4, depending on texRef.

// integer index
float tex1Dfetch( texRef, ix );

// float index
float tex1D( texRef, x       );
float tex2D( texRef, x, y    );
float tex3D( texRef, x, y, z );

float tex1DLayered( texRef, x    );
float tex2DLayered( texRef, x, y );

Low-level Driver API

#include <cuda.h>

CUdevice dev;
CUdevprop properties;
char name[n];
int major, minor;
size_t bytes;

cuInit( 0 );  // takes flags for future use
cuDeviceGetCount         ( &cnt );
cuDeviceGet              ( &dev, index );
cuDeviceGetName          ( name, sizeof(name), dev );
cuDeviceComputeCapability( &major, &minor,     dev );
cuDeviceTotalMem         ( &bytes,             dev );
cuDeviceGetProperties    ( &properties,        dev );  // max threads, etc.

cuBLAS

Matrices are column-major. Indices are 1-based; this affects result of i<t>amax and i<t>amin.

#include <cublas_v2.h>

cublasHandle_t handle;
cudaStream_t   stream;

cublasCreate( &handle );
cublasDestroy( handle );
cublasGetVersion( handle, &version );
cublasSetStream( handle,  stream );
cublasGetStream( handle, &stream );
cublasSetPointerMode( handle,  mode );
cublasGetPointerMode( handle, &mode );

// copy x => y
cublasSetVector     ( n, elemSize, x_src_host, incx, y_dst_dev,  incy );
cublasGetVector     ( n, elemSize, x_src_dev,  incx, y_dst_host, incy );
cublasSetVectorAsync( n, elemSize, x_src_host, incx, y_dst_dev,  incy, stream );
cublasGetVectorAsync( n, elemSize, x_src_dev,  incx, y_dst_host, incy, stream );

// copy A => B
cublasSetMatrix     ( rows, cols, elemSize, A_src_host, lda, B_dst_dev,  ldb );
cublasGetMatrix     ( rows, cols, elemSize, A_src_dev,  lda, B_dst_host, ldb );
cublasSetMatrixAsync( rows, cols, elemSize, A_src_host, lda, B_dst_dev,  ldb, stream );
cublasGetMatrixAsync( rows, cols, elemSize, A_src_dev,  lda, B_dst_host, ldb, stream );

Constants

argumentconstantsdescription (Fortran letter)
transCUBLAS_OP_Nnon-transposed ('N')
CUBLAS_OP_Ttransposed ('T')
CUBLAS_OP_Cconjugate transposed ('C')
 
uploCUBLAS_FILL_MODE_LOWERlower part filled ('L')
CUBLAS_FILL_MODE_UPPERupper part filled ('U')
 
sideCUBLAS_SIDE_LEFTmatrix on left ('L')
CUBLAS_SIDE_RIGHTmatrix on right ('R')
 
modeCUBLAS_POINTER_MODE_HOSTalpha and beta scalars passed on host
CUBLAS_POINTER_MODE_DEVICEalpha and beta scalars passed on device

BLAS functions have cublas prefix and first letter of usual BLAS function name is capitalized. Arguments are the same as standard BLAS, with these exceptions:

  • All functions add handle as first argument.
  • All functions return cublasStatus_t error code.
  • Constants alpha and beta are passed by pointer. All other scalars (n, incx, etc.) are bassed by value.
  • Functions that return a value, such as ddot, add result as last argument, and save value to result.
  • Constants are given in table above, instead of using characters.

Examples:

cublasDdot ( handle, n, x, incx, y, incy, &result );  // result = ddot( n, x, incx, y, incy );
cublasDaxpy( handle, n, &alpha, x, incx, y, incy );   // daxpy( n, alpha, x, incx, y, incy );

Compiler

nvcc, often found in /usr/local/cuda/bin

Defines __CUDACC__

Flags common with cc

Short flagLong flagOutput or Description
-c--compile.o object file
-E--preprocesson standard output
-M--generate-dependencieson standard output
-o file--output-file file
-I directory--include-path directoryheader search path
-L directory--library-path directorylibrary search path
-l lib--library liblink with library
-libgenerate library
-sharedgenerate shared library
-pg--profilefor gprof
-g level--debug level
-G--device-debug
-O level--optimize level
 
Undocumented (but in sample makefiles)
-m64compile x86_64 host CPU code

Flags specific to nvcc

-vlist compilation commands as they are executed
-dryrunlist compilation commands, without executing
-keepsaves intermediate files (e.g., pre-processed) for debugging
-cleanremoves output files (with same exact compiler options)
-arch=<compute_xy>generate PTX for capability x.y
-code=<sm_xy>generate binary for capability x.y, by default same as -arch
-gencode arch=...,code=...same as -arch and -code, but may be repeated

Argumenents for -arch and -code

It makes most sense (to me) to give -arch a virtual architecture and -code a real architecture, though both flags accept both virtual and real architectures (at times).

Virtual architectureReal architectureFeatures
Teslacompute_10sm_10Basic features
compute_11sm_11+ atomic memory ops on global memory
compute_12sm_12+ atomic memory ops on shared memory
+ vote instructions
compute_13sm_13+ double precision
Fermicompute_20sm_20+ Fermi

[출처] [CUDA] 용어 정리|작성자 별빛