Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

int sum1(int A[N][N], int B[N][N]) { int i, k, sum = 0; for (i = 0; i < N; i++)

ID: 3769191 • Letter: I

Question

int sum1(int A[N][N], int B[N][N]) {

int i, k, sum = 0;

for (i = 0; i < N; i++)

for (k = 0; k < N; k++)

sum += A[i][k] + B[k][i]; return sum;

}

TLB Behavior Assume that we tile the two arrays A and B above, with square tiles of the same size for both, and we have a fully-associative, 128-entry TLB and 8KB pages. What tile size for arrays A and B (measured in number of array elements) will maximize use of the TLB and minimize TLB misses, if the sum1 code above were to be tiled?

Assume that the array elements are 4B integers, that N*4 is much greater than the page size, and that you can ignore any caches for this question

Explanation / Answer

• This transformation modifies the iteration space of a loop nest by introducing loop levels to divide the iteration space in tiles. It is a multi-dimensional generalization of strip-mining. It is generally used to improve memory reuse, but can also improve processor, register, translation-lookaside buffer (TLB), or page locality. It is also called loop blocking. • [0263] The size of the tiles of the iteration space is chosen such that the data needed in each tile fits into the cache memory, thus reducing the cache misses. In the case of coarse-grain computers, the size of the tiles can also be chosen such that the number of parallel operations of the loop body matches the number of processors of the computer. • [0264] Example of loop tiling: for(i=0; i<N; i++) for(ii=0; ii<N; ii = ii+16) for(j=0; j<N; j++) for(jj=0; jj<N; jj = jj+16) a[i][j] = b[j][i]; for(i=ii; i< min(ii+15,N); i++) for(j=jj; j< min(jj+15,N); j++) a[i][j] = b[j][i]; Loop Interchange • [0265] This transformation interchanges loop levels of a nest in order to change data dependences. It can: • o enable vectorization by interchanging an independent loop with a dependent loop, or o improve vectorization by pushing the independent loop with the largest range further inside, or o deduce the stride, or o increase the number of loop-invariant expressions in the inner-loop, or o improve parallel performance by moving an independent loop outside of a loop nest to increase the granularity of each iteration and reduce the number of barrier synchronizations. • [0271] Example of loop interchange: for(i=0; i<N; i++) for(j=0; j<N; j++) for(j=0;j<N; j++) for(i=0; i<N; i++) a[i] = a[i] + b[i][j]; a[i] = a[i] + b[i][j]; Loop-Coalescing/Collapsing • [0272] This transformation combines a loop nest into a single loop. It can improve the scheduling of the loop, and also reduces the loop overhead. Collapsing is a simpler version of coalescing in which the number of dimensions of arrays is reduced as well. Collapsing reduces the overhead of nested loops and multi-dimensional arrays. Collapsing can be applied to loop nests that iterate over memory with a constant stride, otherwise loop coalescing is a better approach. It can be used to make vectorizing profitable by increasing the iteration range of the innermost loop. • [0273] Example of loop coalescing: for(i=0; i<N; i++) for(k=0; k<N*M; k++){ for(j=0;j<M; j++) i = ((k1)/m)*m + 1; a[i][j] = a[i][j] + c; j = ((T1)%m) + 1; a[i][j] = a[i][j] + c; } Loop Fusion • [0274] This transformation, also called loop jamming or loop merging, merges 2 successive loops. It reduces loop overhead, increases instruction-level parallelism, improves register, cache, or page locality, and improves the load balance of parallel loops. Alignment can be taken into account by introducing conditional instructions to take care of dependences. • [0275] Example of loop fusion: for(i=0; i<N; i++) for(i=0; i<N; i++) { a[i] = b[i] + c; a[i] = b[i] + c; for(i=0; i<N; i++) d[i] = e[i] + c; d[i] = e[i] + c; } Loop Distribution • [0276] This transformation, also called loop fission, allows splitting a loop in several pieces in case the loop body is too big, or because of dependences. The iteration space of the new loops is the same as the iteration space of the original loop. Loop spreading is a more sophisticated distribution. • [0277] Example of loop distribution: for(i=0; i<N; i++) { for(i=0; i<N; i++) a[i] = b[i] + c; a[i] = b[i] + c; d[i] = e[i] + c; for(i=0;i<N; i++) } d[i] = e[i] +c; Loop Unrolling/Unroll-and-Jam • [0278] This transformation replicates the original loop body in order to get a larger one. A loop can be unrolled partially or completely. It is used to get more opportunity for parallelization by making the loop body bigger, it also improves register, or cache usage and reduces loop overhead. Unrolling the outer loop followed by merging the induced inner loops is referred to as unroll-and-jam. • [0279] Example of loop unrolling: for(i=0; i<N; i++) for(i=0; i<N; i = i+2{ a[i] = b[i] + c a[i] = b[i] + c; a[i+1] = b[i+1] + c; } if ((N1)%2) == 1) a[N1] = b[N1] + c; Loop Alignment • [0280] This optimization transforms the code to achieve aligned array accesses in the loop body. The application of loop alignment transforms loop-carried dependences into loop-independent dependences, which allows extracting more parallelism from a loop. It uses a combination of other transformations, like loop peeling or introduces conditional statements. Loop alignment can be used in conjunction with loop fusion to align the array accesses in both loop nests. In the example below, all accesses to array a become aligned. • [0281] Example of loop alignment: for(i=2;i <= N;i++) { for(i=1; i<=N; i++) { a[i] = b[i] + c[i]; If (i>1) a[i] = b[i] + c[i]; d[i] = a[i1] * 2; if (i<N) d[i+1] = a[i] * 2; e[i] = a[i1] + d[i+1]; if (i<N) e[i+1] = a[i] + d[i+2]; } } Loop Skewing • [0282] This transformation is used to enable parallelization of a loop nest. It is useful in combination with loop interchange. It is performed by adding the outer loop index multiplied by a skew factor, f, to the bounds of the inner loop variable, and then subtracting the same quantity from every use of the inner loop variable inside the loop. • [0283] Example of loop skewing: for(i=1; i <= N; i++) for(i=1; i <= N; i++) for(j=1;j <= N; j++) for(j=i+1;j <= i+N; j++) a[i] = a[i+j] + c; a[i] = a[j] + c; Loop Peeling • [0284] This transformation removes a small number of starting or closing iterations of a loop to avoid dependences in the loop body. These removed iterations are executed separately. It can be used for matching the iteration control of adjacent loops to enable loop fusion. • [0285] Example of loop peeling: for(i=0; i<=N; i++) a[0][N] = a[0][N] + a[N][N]; a[i][N] = a[0][N] + a[N][N]; for(i=1;i<=N1; i++) a[i][N] = a[0][N] + a[N][N]; a[N][N] = a[0][N] + a[N][N]; Loop Splitting • [0286] This transformation cuts the iteration space in pieces by creating other loop nests. It is also called Index Set Splitting, and is generally used because of dependences that prevent parallelization. The iteration space of the new loops is a subset of the original one. It can be seen as a generalization of loop peeling. • [0287] Example of loop splitting: for(i=0; i<=N; i++) for(i=0;i<(N+1)/2; i++) a[i] = a[Ni+1] + c; a[i] = a[Ni+1] + c; for(i= (N+1)/2;i <= N;i++) a[i] = a[NI+1] + c; Node Splitting • [0288] This transformation splits a statement in pieces. It is used to break dependence cycles in the dependence graph due to the too high granularity of the nodes, thus enabling vectorization of the statements. • [0289] Example of node splitting: for(i=0;i < N;i++) { for(i = 0,i < N;i++) { b[i] = a[i] + c[i] * d[i] ; t1[i] = c[i] * d[i]; a[i+1] = b[i] * (d[i] c[i]); t2[i] = d[i] c[i]; } b[i] = a[i] + t1[i]; a[i+1] = b[i] * t2[i]; • o in more details, after procedure inlining, we obtain: for(i=0; i<N;i++) { for(j=1; j<9;j=j++) if (k>0) a[i][j] = a[i][j1] b[i+1][j1]; else d[j] = c[j] + 2; } d[i] = d[i+1] + 2; } for(i=0; i<N;i++) a[i][i] = b[i] + 3; • [0319] After loop unswitching, we obtain: if (k > 0) for(i=0; i<N;i++) { for(j=1; j<9;j=j++) a[i][j] = a[i][j1] b[i+1][j1]; d[i] = d[i+1] + 2; } else for(i=0; i<N;i++) { for(j=1; j<9;j=j++) d[j] = c[j] + 2; d[i] = d[i+1] + 2; } for(i=0; i<N;i++) a[i][i] = b[i] + 3; • [0320] After loop normalization, we obtain: if (k > 0) for(i=0; i<N;i++) { for(j=0; j<8;j=j++) a[i][j+1] = a[i][j] b[i+1][j]; d[i] = d[i+1] + 2; } else for(i=0; i<N;i++) { for(j=0; j<8;j=j++) d[j] = c[j+1] + 2; d[i] = d[i+1] + 2; } for(i=0; i<N;i++) a[i][i] = b[i] + 3; • [0321] After loop distribution and loop fusion, we obtain: if (k > 0) for(i=0; i<N;i++) for(j=0; j<8;j=j++) a[i][j+1] = a[i][j] b[i+1][j]; else for(i=0; i<N;i++) for(j=0; j<8;j=j++) d[j] = c[j+1] + 2; for(i=0; i<N;i++) { d[i] = d[i+1] + 2; a[i][i] = b[i] + 3; } • [0322] After loop interchange, we obtain: if (k > 0) for(j=0; j<8;j=j++) for(i=0; i<N;i++) a[i][j+1] = a[i][j] b[i+1][j]; else for(i=0; i<N;i++) for(j=0; j<8;j=j++) d[j] = c[j+1] + 2; for(i=0; i<N;i++) { d[i] = d[i+1] + 2; a[i][i] = b[i] + 3; }