OpenCores

OpenCores

Why open processors are so much slower than commercial ones?)

no use

no use

1/1

no use

no use

Why open processors are so much slower than commercial ones?) by Unknown on Aug 14, 2004			Not available!
Le Jeudi 15 Juillet 2004 16:05, Justin Young a Ã©crit : [q] Thank you for pointing that our Austin. You are right, what I meant by markup was that it goes faster! I worded that quite badly! Sorry everyone! [q]> Actually, this, in my experience, is not true. It is rare, these days, > that you can hand compile code that runs >faster than a compiler can, especially code of any size. Compilers have >come a very long way over the past few decades.[/q] You are right, but I recently compared a matrix multiply program (hence my reference earlier) assembled by hand and compared it with gcc, and hand compiled code ran a lot faster. But do note, it is only a small program. [/q] I had try many way to code a matrix multiplication in C. The difference between implementation for the worst case (matric of 512512) is x25... void mulMatrix11(const unsigned int n, const int A[n][n], const int B[n][n], int C[n][n]) { register int tmp6,tmp7,tmp8,tmp9; register int tmp10,tmp11,tmp12,tmp13; register unsigned int i,j,k,ii,jj,kk,jjj,kkk; for(ii=0; ii<n ;ii=ii+K) for(kk=0; kk<n ;kk=kk+K) for(i=ii;i<ii+K;i=i+1) { for(kkk=kk;kkk<kk+K;kkk+=4){ __asm__ __volatile__ ("prefetchnta 16(%0) " : : "S" (&(A[i][kkk+4])) ); tmp6 = A[i][kkk]; tmp7 = A[i][kkk+1]; tmp8 = A[i][kkk+2]; tmp9 = A[i][kkk+3]; for(j=0;j<n;j=j+128) { __asm__ __volatile__ ("prefetchnta 512(%0)" : : "D"(&(B[kkk][j+128]))); __asm__ __volatile__ ("prefetchnta 512(%0)" : : "S"(&(B[kkk+1][j+128]))); __asm__ __volatile__ ("prefetchnta 512(%0)" : : "D"(&(B[kkk+2][j+128]))); __asm__ __volatile__ ("prefetchnta 512(%0)" : : "S"(&(B[kkk+3][j+128]))); for(jjj=j;jjj<j+128;jjj+=2){ C[i][jjj] += tmp6B[kkk][jjj]+tmp7B[kkk+1][jjj] + tmp8B[kkk+2][jjj]+tmp9B[kkk+3][jjj]; C[i][jjj+1] += tmp6B[kkk][jjj+1]+tmp7B[kkk+1][jjj+1] + tmp8B[kkk+2][jjj+1]+tmp9B[kkk+3][jjj+1]; } } } } } The previous code is between 45% (small matrix) and 400% faster than the following one in average case. void mulMatrix1(int n,int A[n][n] ,int B[n][n],int C[n][n]) { int i,j,k; for(i=0; i<n ;i=i+1) for(j=0; j<n ;j=j+1) for(k=0; k<n ;k=k+1) C[i][j] = C[i][j] + A[i][k]B[k][j]; } Le Jeudi 15 Juillet 2004 16:05, Justin Young a Ã©crit : Thank you for pointing that our Austin. You are right, what I meant by markup was that it goes faster! I worded that quite badly! Sorry everyone! > Actually, this, in my experience, is not true. It is rare, these days, > that you can hand compile code that runs >faster than a compiler can, especially code of any size. Compilers have >come a very long way over the past few decades. You are right, but I recently compared a matrix multiply program (hence my reference earlier) assembled by hand and compared it with gcc, and hand compiled code ran a lot faster. But do note, it is only a small program. I had try many way to code a matrix multiplication in C. The difference between implementation for the worst case (matric of 512512) is x25... void mulMatrix11(const unsigned int n, const int A[n][n], const int B[n][n], int C[n][n]) { register int tmp6,tmp7,tmp8,tmp9; register int tmp10,tmp11,tmp12,tmp13; register unsigned int i,j,k,ii,jj,kk,jjj,kkk; for(ii=0; ii for(kk=0; kk for(i=ii;i { for(kkk=kk;kkk __asm__ __volatile__ ("prefetchnta 16(%0) " : : "S" (&(A[kkk+4])) ); tmp6 = A[kkk]; tmp7 = A[kkk+1]; tmp8 = A[kkk+2]; tmp9 = A[kkk+3]; for(j=0;j { __asm__ __volatile__ ("prefetchnta 512(%0)" : : "D"(&(B[kkk][j+128]))); __asm__ __volatile__ ("prefetchnta 512(%0)" : : "S"(&(B[kkk+1][j+128]))); __asm__ __volatile__ ("prefetchnta 512(%0)" : : "D"(&(B[kkk+2][j+128]))); __asm__ __volatile__ ("prefetchnta 512(%0)" : : "S"(&(B[kkk+3][j+128]))); for(jjj=j;jjj C[jjj] += tmp6B[kkk][jjj]+tmp7B[kkk+1][jjj] + tmp8B[kkk+2][jjj]+tmp9B[kkk+3][jjj]; C[jjj+1] += tmp6B[kkk][jjj+1]+tmp7B[kkk+1][jjj+1] + tmp8B[kkk+2][jjj+1]+tmp9B[kkk+3][jjj+1]; } } } } } The previous code is between 45% (small matrix) and 400% faster than the following one in average case. void mulMatrix1(int n,int A[n][n] ,int B[n][n],int C[n][n]) { int i,j,k; for(i=0; i for(j=0; j for(k=0; k C[j] = C[j] + A[k]B[k][j]; }

no use

no use

1/1

no use

no use

© copyright 1999-2025 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.