Given that:
*A GPU contains multiple SIMD processors
*Each SIMD processor contains multiple lanes.
*Each SIMD processor is assigned a single thread block (by the thread block scheduler)
The question is which one of these two alternatives is correct:
-Alt1 (parallel execution of threads): Each lane runs a single thread among all threads in the thread block -> to completely become executed, each thread takes as many clock cycles as there is elements in the vector that it writes to/reads from
-Alt2 ("sequential-alternating" execution of threads): Each thread occupies all lanes in a single SIMD processor -> each thread takes round_up(<nr_of_elements_in_the_vector>/<nr_of_lanes_per_SIMD_processor>) clock cycles to finish execution (not necessary consecutive) -> the thread scheduler (in each SIMD processor) schedules/alternates between different threads even if a single thread didn't finish all its cycles. So threads doesn't execute in parallel
(PS. Alt1 is what I understood from the GPU class/slides; Alt2 is what I understood from the book)