Conversion from block dimensions to warps in CUDA -


this question has answer here:

i'm little confused regarding how blocks of dimensions mapped warps of size 32.

i have read , experienced first hand inner dimension of block being multiple of 32 improves performance.

say create block dimensions 16x16. can warp contain threads 2 different y-dimensions, e.g. 1 , 2 ?

why having inner dimension of 32 improve performance though there technically enough threads scheduled warp?

your biggest question has been answered in about warp , threads , how cuda threads divided warps?. so, have focuses answer in why.

the blocksize in cuda multiple of warp size. warp size implementation defined , numbe 32 related shared memory organization, data access patterns , data flow control [ 1 ].

so, blocksize being multiple of 32 not improves performance means threads used something. note used something depends on threads within block.

a blocksize being not multiple of 32 rounds nearest multiple, if request fewer threads. see gpu optimization fundamentals presentation of cliff woolley nvidia developer technology group has interesting hints performance.

in addition, memory operations , instructions executed per warp, can understand importance of number. think reason why 32 , not 16 or 64 undocumented. remember warp size "the answer ultimate question of life, universe, , everything" [ 2 ].

[1] david b kirk , w hwu wen-mei. programming massively parallel processors: hands-on approach. elsevier, 2010.

[2] hitchhiker's guide galaxy.


Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -