Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

1. #include <x86intrin.h> 2. void dgemm (int n, double* A, double* B, double* C)

ID: 3725877 • Letter: 1

Question

1. #include <x86intrin.h>
2. void dgemm (int n, double* A, double* B, double* C)
3. {
4. for (int i = 0; i < n; i += 4)
5. for (int j = 0; j < n; j++) {
6. __m256 c0 = _mm256_load_pd(C + i + j * n); /*c0 = C[i][j]*/
7. for (int k = 0; k < n ; k++)  
8. c0 = _mm256_add_pd (c0 , /*c0 += A[i][k] * B[k][j] */
9.    _mm256_mul_pd (_mm256_load_pd(A+i+k*n),
10. _mm256_broadcast_sd(B+k+j*n)));
11. _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
12. }
13. }


In line 10, what is the intrinsic function " __m256_broadcast_sd(B+k+j*n)" trying to do?

Explanation / Answer

This is the full line instruction,

c0 = _mm256_add_pd (c0 ,    _mm256_mul_pd (_mm256_load_pd(A+i+k*n),   _mm256_broadcast_sd(B+k+j*n)));

which is ----> c0 += A[i][k] * B[k][j] ,  

to do this, load four elements of A using "_mm256_load_pd()", to multiply these element by one oelement of B, we use " _mm256_broadcast_sd()".

This ntrinsic function " __m256_broadcast_sd(B+k+j*n)" , makes four identical copies of double precision number of scalar.

//for any clarification, please do comments