URL
https://opencores.org/ocsvn/mpeg2fpga/mpeg2fpga/trunk
Subversion Repositories mpeg2fpga
[/] [mpeg2fpga/] [trunk/] [tools/] [mpeg2dec/] [AP528.txt] - Rev 2
Compare with Previous | Blame | View Log
[Developer Home][Contents][Search][Contact Us][Support][Intel(r)][Image][Application Note]Using MMX? Instructions in a Fast iDCTAlgorithm for MPEG DecodingDisclaimerInformation in this document is provided in connection with Intelproducts. No license under any patent or copyright is granted expresslyor implied by this publication. Intel assumes no liability whatsoever,including infringement of any patent or copyright, for sale and use ofIntel products except as provided in Intel's Terms and Conditions of Salefor such products.Intel retains the right to make changes to these specifications at anytime, without notice. Microcomputer Products may have minor variations totheir specifications known as errata.MPEG is an international standard for audio and video compression anddecompression promoted by ISO. Implementations of MPEG CODEC?s or MPEGenabled platforms may require licenses from various entities, includingIntel Corporation. Intel makes no representation as to the need forlicenses from any entity. No licenses, either express, implied or byestoppel are granted herein. For information on licensing Intel patents,please contact Intel.1.0. INTRODUCTION* 1.1. MPEG Compression Method2.0. iDCT ALGORITHM* 2.1. Selecting a Fast iDCT Algorithm* 2.2. AAN Algorithm3.0. AAN ALGORITHM IMPLEMENTED WITH MMX[tm] INSTRUCTIONS* 3.1. iDCT Routine Interface* 3.2. Optimization Considerations4.0. PERFORMANCE GAINS5.0. REFERENCES6.0. TWO-DIMENSIONAL IDCT CODE LISTING1.0. INTRODUCTIONThe Intel Architecture (IA) media extensions include single-instruction,multi-data (SIMD) instructions. This application note presents examples ofcode that exploit these instructions. Specifically, this document describesan implementation of a two-dimensional (2D) inverse Discrete Cosine Transfer(iDCT) using MMX[tm] instructions. This transformation is widely used inimage compression algorithms; most notably in the Joint Photographic ExpertsGroup (JPEG) and Motion Picture Experts Group (MPEG) standards.This document focuses on one iDCT algorithm that provides efficient MPEGdecoding. The implementation of this algorithm in MMX code, listed inSection 6.0, can be used "as is" according to the guidelines presented inSection 3.1. However, many iDCT algorithms exist. The reader is encouragedto consider the alternative ideas and issues presented in this document,since they have implications for other iDCT algorithms.MPEG Compression MethodThe MPEG compression method has two phases, encoding and decoding.Multimedia images are first encoded, then transmitted, and finally decodedback into a sequence of images by the MPEG player.The encoding process follows these steps:1. The input data is transformed using a Discrete Cosine Transform (DCT).2. The resulting DCT coefficients are quantized.3. The quantized coefficients are packed efficiently using Huffman tables.During the decoding process, the MPEG player reverses these steps byunpacking the coefficients, dequantizing them, and applying a 2D iDCT. Toachieve a high number of frames per second, this decoding process must bevery fast. This document concentrates on the iDCT component of the decodingprocess.2.0. iDCT ALGORITHMThe 8x8 two-dimensional iDCT used in MPEG decompression is defined as:[Image]where the normalization factors, and are defined as:alpha () = [Image] = [Image] for = 0= 1/2 for > 0Where is either u or v and [Image] is the coefficient matrix.The solution to this equation contains 64 multiplications and 63 additionsfor each element of [Image], for a total of 4096 multiplications and 4032additions.The above equation is equivalent to a summation over v, followed by asummation over u, as follows:[Image]This equation is the equivalent to applying a one-dimensional (1D) iDCTeight times on each column of [Image] and then applying a 1D iDCT on therows of the result. Reversing this order by applying a 1D iDCT on the rowsfirst, and then on the columns, gives the same result.2.1. Selecting a Fast iDCT AlgorithmMany algorithms have been proposed for efficient calculation of the 2D iDCT.Some algorithms are based on efficient 1D iDCTs [4]; some rely on directanalysis of the two-dimensional nature of iDCT [2]; and others combine a 2Dprescale with a very efficient 1D iDCT [1]. Some algorithms even take intoaccount the zero coefficients used in the MPEG bit streams to constructstatistically efficient iDCT algorithms [3]. Most fast 1D DCT and iDCTalgorithms are variants of Lee's Fast DCT algorithm [6], or are based onvariants of Winograd's FFT [5].The following algorithms were evaluated for implementation using MMXinstructions:* Statistic algorithm [3]* Feig's two-dimensional algorithm [2]* The LLM algorithm [4]* The AAN algorithm [1]In general, statistic algorithms inspect the input data and executeconditional paths in the control flow. The overhead caused by the datainspection and the jump instructions would be too expensive when compared tothe speed achievable by other algorithms implemented with MMX instructions.Although Feig's 2D algorithm [2] reduces the multiplication count, the costof multiplication in MMX technology is small, so this reduction is notcritical. Also, the irregular memory-access pattern of this algorithm is notconducive to efficient implementation using the MMX instruction set.Both the LLM and AAN algorithms were implemented in MMX assembly code. TheLLM algorithm was implemented using accumulation in 32-bit elements, whilethe AAN algorithm used accumulation in 16-bit elements. The resulting LLMimplementation was more accurate, conforming to the iDCT IEEE standard [7].However, the AAN implementation was much faster. The AAN implementation ispresented in this document.2.2. AAN AlgorithmThe AAN algorithm is a one-dimensional, prescaled DCT/iDCT algorithm. First,the eight input coefficients are prescaled by eight prescale values, whichrequires eight multiplications. Then the algorithm is applied to theprescaled coefficients, which requires five multiplications and 29 additionsfor the transform. Although the 1D AAN algorithm is less efficient thanother 1D algorithms (for example, LLM), when applied to the two-dimensional,8x8 iDCT, this algorithm takes only 64 multiplications for the prescale, and80 multiplications and 464 additions for the transform.3.0. AAN ALGORITHM IMPLEMENTED WITH MMX[tm] INSTRUCTIONSThe AAN implementation described in this document uses 16-bit data elements,so that four variables can be processed in parallel using packed words. MMXinstructions that operate on packed words read or store four wordscontiguously in memory. So, for an 8x8 matrix, half a row can be read orstored at one time. If the 1D iDCT is applied to the columns of an 8x8matrix, MMX instructions can operate on four columns at a time. Applying the1D iDCT to the rows of the matrix is more involved and less efficient.The AAN algorithm is performed in four steps:1. Perform an iDCT on the columns of the matrix.2. Transpose the matrix.3. Perform a second iDCT.4. Prescale the input coefficients of the iDCT on the columns of thetransposed matrix, which is equivalent to performing an iDCT on therows of the original matrix.These steps result in a transposed matrix, which would have to be againtransposed to obtain the final result. To prevent this extra step, the inputmatrix should be transposed initially. The cost of transposing the input isnegligible, since the matrix is constructed from the Zig-Zag scan [9].Therefore, the actual implementation follows these steps:1. Prescale the input coefficients of the transposed input matrix.2. Perform an iDCT on the columns of the transposed input matrix, which isequivalent to performing an iDCT on the rows of the final matrix.3. Transpose the matrix.4. Perform a second iDCT on the columns of the final matrix.Another consideration is the limited length of the MMX registers. All theiDCT algorithms mentioned in Section 3.1 are defined mathematically withoutregard to the size of the accumulator or registers. To ensure adequateprecision for operations on 16-bit data elements, the algorithm was analyzedcarefully and appropriate precision was assigned during every intermediatestage of the calculation.Precision was controlled using packed shift instructions, which shift alldata elements in a register by the same amount. Shift right instructionswere used to prevent overflow of the most significant bit in eachintermediate step of the calculation.3.1. iDCT Routine InterfaceThe iDCT routine is called from within an assembly module. The routine getsa pointer to an 8x8, 16-bit element matrix; the pointer is located in theESI register. The matrix should be aligned on an 8-byte boundary. Each dataelement is actually a 12-bit quantity that is left-adjusted, meaning thatthe four least significant bits equal zero. The input matrix should betransposed; otherwise, the output matrix must be transposed. The value 0.5must be added to the DC value, input matrix element [0][0].The routine returns the same memory array. Therefore, if the original inputoperands are needed (for example, in test mode), they must be copied beforethe call to the iDCT routine.3.2. Optimization Considerationsto exploit parallelism in an algorithm. Parallelism in the 2D iDCT wasapproached from two directions, as illustrated in Figure 1:* Within a single 8x8 iDCT* Between four 8x8 iDCTs.In the first approach, data is accessed by rows within the matrix. In thesecond approach, data from the four matrixes is interleaved to enableefficient use of the MMX instructions.Figure 1. Single iDCT vs. Four iDCTs[Image]The advantage of the single-iDCT approach is that the interface to an MPEGplayer is simpler. Its disadvantages are:* The matrix must be transposed in order to operate on several rows inparallel.* To prevent overflow, packed shift instructions must be used. Since in agiven register, the accuracy of the four data elements varies, theshift count is determined by the worst case among the four elements.This results in an extra loss of accuracy.The advantages of the four-iDCTs approach are:* Matrix transposition is not required.* To prevent overflow, packed shift instructions must still be used.However, since all the data elements in a register have the sameaccuracy, there is no extra loss of accuracy to accommodate the worstcase among four elements.The disadvantage of the single-iDCT approach is that, to take advantage ofthe MMX instructions, the input data from the four matrixes must first beinterleaved (see Figure 1). Then, after the transform, the resulting datamust be restored to four 8x8 matrixes.Because of its simplicity, the single-iDCT approach was chosen. Instructionscheduling was done manually to ensure optimal performance.Register use was carefully considered as well. In most cases, intermediateresults were kept in registers; temporary storage to memory was needed inonly a few cases. For example, consider the implementation of the matrixtranspose. The basic operation is the transpose of 4x4 elements [8], asillustrated in Figure 2.Figure 2. Matrix Transpose Operation[Image]The 8x8 iDCT requires four of these operations. The sequence of theseoperations was carefully chosen to save memory stores and loads. First, M4was transposed, followed by M3. These two results were immediately used toperform the iDCT on the last four columns. Similarly, M2 was transposed,followed by M1. These results were used for the iDCT on the first fourcolumns.The detailed steps of the matrix transpose algorithm are:1. Prescale: 16 packed multiplies2. Column 0: even part3. Column 0: odd part4. Column 0: output butterfly5. Column 1: even part6. Column 1: odd part7. Column 1: output butterfly8. Transpose: M4 part9. Transpose: M3 part10. Column 1: even part (after transpose)11. Column 1: odd part (after transpose)12. Column 1: output butterfly (after transform)13. Transpose: M2 parts14. Transpose: M1 parts15. Column 0: odd part (after transpose)16. Column 0: odd part (after transpose)17. Column 0: output butterfly (after transform)18. Cleanupwhere:* Column 0 represents the first four columns.* Even part calculates the part of the iDCT that uses even-indexedelements.* Odd part calculates the part of the iDCT that uses odd-indexedelements.* Output butterfly generates the 1D iDCT using the results of the evenand odd parts.During the rescheduling process, instructions from one block were moved tothe previous block whenever an empty slot could be filled. This reorderingis marked by comments in the code listed in Section 6.0.4.0. PERFORMANCE GAINSThe cycle count for this implementation of the AAN algorithm, using MMXinstructions, is 240 clocks. A direct comparison of this implementation witha scalar implementation of the AAN algorithm would be misleading, since theAAN algorithm is not the fastest scalar implementation of an iDCT. However,the implementation presented here is estimated to be 3 to 3.5 times fasterthan the general performance of scalar iDCT algorithms.5.0. REFERENCES1. Arai, Y., T. Agui, and M. Nakajima, (1988). A Fast DCT-SQ Scheme forImages; Trans IEICE, 71, pp. 1095-1097.2. Feig E., and S. Winograd, (1992). Fast Algorithms for Discrete CosineTransform, IEEE Trans. Signal Proc., 40, pp. 2174-2193.3. Hung, A.C., and Thy Meng, (1994). A Fast Statistical Inverse Discretecosine Transform for Image Compression, SPIE/IS&Teletronic Imaging,2187 , pp. 196-205.4. Loeffler, C., A. Ligtenberg, and C. S. Moschytz, (1989). Practical Fast1D DCT Algorithm with Eleven Multiplications, Proc. ICASSP 1989, pp.988-991.5. Winograd S. (1976). On Computing the Discrete Fourier Transform, IBMRes. Rep, RC-6291.6. Lee, B. A New Algorithm to Compute the Discrete Cosine Transform, IEEETrans. Signal Proc., Dec/84, pp. 1243-1245.7. IEEE standard specification for the implementation of 8x8 iDCT IEEE std1180-19908. MPEG standard, Coding of Moving Pictures, ISO/IEC DIS 11172.6.0. TWO-DIMENSIONAL IDCT CODE LISTING; esi - input and output data pointer; the input data is tranposed and each 16 bit element in the 8x8 matrix;is left aligned:; for example in 11...1110000 format; If the iDCT is of I macroblock then 0.5 needs to be added to the;DC Component; (element[0][0] of the matrix).nolistinclude iammx.inc ; IAMMX Emulator MacrosMMWORD TEXTEQU <DWORD>.list.586.model flat_DATA SEGMENT PARA PUBLIC USE32 'DATA'x0005000200010001 DQ 0005000200010001hx0040000000000000 DQ 40000000000000hx5a825a825a825a82 DW 5a82h, 5a82h, 5a82h, 5a82h ; 23170x539f539f539f539f DW 539fh, 539fh, 539fh, 539fh ; 21407x4546454645464546 DW 4546h, 4546h, 4546h, 4546h ; 17734x61f861f861f861f8 DW 61f8h, 61f8h, 61f8h, 61f8h ; 25080scratch1 DQ 0scratch3 DQ 0scratch5 DQ 0scratch7 DQ 0; for debug onlyx0 DQ 0preSC DW 16384, 22725, 21407, 19266, 16384, 12873, 8867, 4520DW 22725, 31521, 29692, 26722, 22725, 17855, 12299, 6270DW 21407, 29692, 27969, 25172, 21407, 16819, 11585, 5906DW 19266, 26722, 25172, 22654, 19266, 15137, 10426, 5315DW 16384, 22725, 21407, 19266, 16384, 12873, 8867, 4520DW 12873, 17855, 16819, 15137, 25746, 20228, 13933, 7103DW 17734, 24598, 23170, 20853, 17734, 13933, 9597, 4892DW 18081, 25080, 23624, 21261, 18081, 14206, 9785, 4988_DATA ENDS_TEXT SEGMENT PARA PUBLIC USE32 'CODE'COMMENT ^void idct8x8aan (int16 *src_result);^public _idct8x8aan_idct8x8aan proc nearpush ebplea ecx, [preSC]mov ebp, esppush esimov esi, DWORD PTR [ebp+8] ; source;slot; column 0: even part; use V4, V12, V0, V8 to produce V22..V25movq mm0, mmword ptr [ecx+8*12] ; maybe the first mul can be done together; with the dequantization in iHuff module ?;slotpmulhw mm0, mmword ptr [esi+8*12] ; V12;slotmovq mm1, mmword ptr [ecx+8*4];slotpmulhw mm1, mmword ptr [esi+8*4] ; V4;slotmovq mm3, mmword ptr [ecx+8*0]psraw mm0, 1 ; t64=t66pmulhw mm3, mmword ptr [esi+8*0] ; V0;slotmovq mm5, mmword ptr [ecx+8*8] ; duplicate V4movq mm2, mm1 ; added 11/1/96pmulhw mm5, mmword ptr [esi+8*8] ; V8psubsw mm1, mm0 ; V16pmulhw mm1, mmword ptr x5a825a825a825a82 ; 23170 ->V18paddsw mm2, mm0 ; V17movq mm0, mm2 ; duplicate V17psraw mm2, 1 ; t75=t82psraw mm0, 2 ; t72movq mm4, mm3 ; duplicate V0paddsw mm3, mm5 ; V19psubsw mm4, mm5 ; V20 ;mm5 free;moved from the block belowmovq mm7, mmword ptr [ecx+8*10]psraw mm3, 1 ; t74=t81movq mm6, mm3 ; duplicate t74=t81psraw mm4, 2 ; t77=t79psubsw mm1, mm0 ; V21 ; mm0 freepaddsw mm3, mm2 ; V22movq mm5, mm1 ; duplicate V21paddsw mm1, mm4 ; V23movq mmword ptr [esi+8*4], mm3 ; V22psubsw mm4, mm5 ; V24; mm5 freemovq mmword ptr [esi+8*12], mm1 ; V23psubsw mm6, mm2 ; V25; mm2 freemovq mmword ptr [esi+8*0], mm4 ; V24;slot; keep mm6 alive all along the next block;movq mmword ptr [esi+8*8], mm6 ; V25; column 0: odd part; use V2, V6, V10, V14 to produce V31, V39, V40, V41;moved above;movq mm7, mmword ptr [ecx+8*10]pmulhw mm7, mmword ptr [esi+8*10] ; V10;slotmovq mm0, mmword ptr [ecx+8*6];slotpmulhw mm0, mmword ptr [esi+8*6] ; V6;slotmovq mm5, mmword ptr [ecx+8*2]movq mm3, mm7 ; duplicate V10pmulhw mm5, mmword ptr [esi+8*2] ; V2;slotmovq mm4, mmword ptr [ecx+8*14]psubsw mm7, mm0 ; V26pmulhw mm4, mmword ptr [esi+8*14] ; V14paddsw mm3, mm0 ; V29 ; free mm0movq mm1, mm7 ; duplicate V26psraw mm3, 1 ; t91=t94pmulhw mm7, mmword ptr x539f539f539f539f ; V33psraw mm1, 1 ; t96movq mm0, mm5 ; duplicate V2psraw mm4, 2 ; t85=t87paddsw mm5, mm4 ; V27psubsw mm0, mm4 ; V28 ; free mm4movq mm2, mm0 ; duplicate V28psraw mm5, 1 ; t90=t93pmulhw mm0, mmword ptr x4546454645464546 ; V35psraw mm2, 1 ; t97movq mm4, mm5 ; duplicate t90=t93psubsw mm1, mm2 ; V32 ; free mm2pmulhw mm1, mmword ptr x61f861f861f861f8 ; V36psllw mm7, 1 ; t107paddsw mm5, mm3 ; V31psubsw mm4, mm3 ; V30 ; free mm3pmulhw mm4, mmword ptr x5a825a825a825a82 ; V34nop ;slotpsubsw mm0, mm1 ; V38psubsw mm1, mm7 ; V37 ; free mm7psllw mm1, 1 ; t114;move from the next blockmovq mm3, mm6 ; duplicate V25;move from the next blockmovq mm7, mmword ptr [esi+8*4] ; V22psllw mm0, 1 ; t110psubsw mm0, mm5 ; V39 (mm5 still needed for next block)psllw mm4, 2 ; t112;move from the next blockmovq mm2, mmword ptr [esi+8*12] ; V23psubsw mm4, mm0 ; V40paddsw mm1, mm4 ; V41; free mm0;move from the next blockpsllw mm2, 1 ; t117=t125; column 0: output butterfly;move above;movq mm3, mm6 ; duplicate V25;movq mm7, mmword ptr [esi+8*4] ; V22;movq mm2, mmword ptr [esi+8*12] ; V23;psllw mm2, 1 ; t117=t125psubsw mm6, mm1 ; tm6paddsw mm3, mm1 ; tm8; free mm1movq mm1, mm7 ; duplicate V22paddsw mm7, mm5 ; tm0movq mmword ptr [esi+8*8], mm3 ; tm8; free mm3psubsw mm1, mm5 ; tm14; free mm5movq mmword ptr [esi+8*6], mm6 ; tm6; free mm6movq mm3, mm2 ; duplicate t117=t125movq mm6, mmword ptr [esi+8*0] ; V24paddsw mm2, mm0 ; tm2movq mmword ptr [esi+8*0], mm7 ; tm0; free mm7psubsw mm3, mm0 ; tm12; free mm0movq mmword ptr [esi+8*14], mm1 ; tm14; free mm1psllw mm6, 1 ; t119=t123movq mmword ptr [esi+8*2], mm2 ; tm2; free mm2movq mm0, mm6 ; duplicate t119=t123movq mmword ptr [esi+8*12], mm3 ; tm12; free mm3paddsw mm6, mm4 ; tm4;moved from next blockmovq mm1, mmword ptr [ecx+8*5]psubsw mm0, mm4 ; tm10; free mm4;moved from next blockpmulhw mm1, mmword ptr [esi+8*5] ; V5;slotmovq mmword ptr [esi+8*4], mm6 ; tm4; free mm6;slotmovq mmword ptr [esi+8*10], mm0 ; tm10; free mm0;slot; column 1: even part; use V5, V13, V1, V9 to produce V56..V59;moved to prev block;movq mm1, mmword ptr [ecx+8*5];pmulhw mm1, mmword ptr [esi+8*5] ; V5movq mm7, mmword ptr [ecx+8*13]psllw mm1, 1 ; t128=t130pmulhw mm7, mmword ptr [esi+8*13] ; V13movq mm2, mm1 ; duplicate t128=t130movq mm3, mmword ptr [ecx+8*1];slotpmulhw mm3, mmword ptr [esi+8*1] ; V1;slotmovq mm5, mmword ptr [ecx+8*9]psubsw mm1, mm7 ; V50pmulhw mm5, mmword ptr [esi+8*9] ; V9paddsw mm2, mm7 ; V51pmulhw mm1, mmword ptr x5a825a825a825a82 ; 23170 ->V52movq mm6, mm2 ; duplicate V51psraw mm2, 1 ; t138=t144movq mm4, mm3 ; duplicate V1psraw mm6, 2 ; t136paddsw mm3, mm5 ; V53psubsw mm4, mm5 ; V54 ;mm5 freemovq mm7, mm3 ; duplicate V53;moved from next blockmovq mm0, mmword ptr [ecx+8*11]psraw mm4, 1 ; t140=t142psubsw mm1, mm6 ; V55 ; mm6 freepaddsw mm3, mm2 ; V56movq mm5, mm4 ; duplicate t140=t142paddsw mm4, mm1 ; V57movq mmword ptr [esi+8*5], mm3 ; V56psubsw mm5, mm1 ; V58; mm1 freemovq mmword ptr [esi+8*13], mm4 ; V57psubsw mm7, mm2 ; V59; mm2 freemovq mmword ptr [esi+8*9], mm5 ; V58;slot; keep mm7 alive all along the next block;movq mmword ptr [esi+8*1], mm7 ; V59;moved above;movq mm0, mmword ptr [ecx+8*11]pmulhw mm0, mmword ptr [esi+8*11] ; V11;slotmovq mm6, mmword ptr [ecx+8*7];slotpmulhw mm6, mmword ptr [esi+8*7] ; V7;slotmovq mm4, mmword ptr [ecx+8*15]movq mm3, mm0 ; duplicate V11pmulhw mm4, mmword ptr [esi+8*15] ; V15;slotmovq mm5, mmword ptr [ecx+8*3]psllw mm6,1 ; t146=t152pmulhw mm5, mmword ptr [esi+8*3] ; V3paddsw mm0, mm6 ; V63; note that V15 computation has a correction step:; this is a 'magic' constant that rebiases the results to be closer to the expected result; this magic constant can be refined to reduce the error even more; by doing the correction step in a later stage when the number is actually multiplied by 16paddw mm4, mmword ptr x0005000200010001psubsw mm3, mm6 ; V60 ; free mm6psraw mm0, 1 ; t154=t156movq mm1, mm3 ; duplicate V60pmulhw mm1, mmword ptr x539f539f539f539f ; V67movq mm6, mm5 ; duplicate V3psraw mm4, 2 ; t148=t150;slotpaddsw mm5, mm4 ; V61psubsw mm6, mm4 ; V62 ; free mm4movq mm4, mm5 ; duplicate V61psllw mm1, 1 ; t169paddsw mm5, mm0 ; V65 -> resultpsubsw mm4, mm0 ; V64 ; free mm0pmulhw mm4, mmword ptr x5a825a825a825a82 ; V68psraw mm3, 1 ; t158psubsw mm3, mm6 ; V66movq mm2, mm5 ; duplicate V65pmulhw mm3, mmword ptr x61f861f861f861f8 ; V70psllw mm6, 1 ; t165pmulhw mm6, mmword ptr x4546454645464546 ; V69psraw mm2, 1 ; t172;moved from next blockmovq mm0, mmword ptr [esi+8*5] ; V56psllw mm4, 1 ; t174;moved from next blockpsraw mm0, 1 ; t177=t188nop ; slotpsubsw mm6, mm3 ; V72psubsw mm3, mm1 ; V71 ; free mm1psubsw mm6, mm2 ; V73 ; free mm2;moved from next blockpsraw mm5, 1 ; t178=t189psubsw mm4, mm6 ; V74;moved from next blockmovq mm1, mm0 ; duplicate t177=t188paddsw mm3, mm4 ; V75;moved from next blockpaddsw mm0, mm5 ; tm1;location; 5 - V56; 13 - V57; 9 - V58; X - V59, mm7; X - V65, mm5; X - V73, mm6; X - V74, mm4; X - V75, mm3; free mm0, mm1 & mm2;move above;movq mm0, mmword ptr [esi+8*5] ; V56;psllw mm0, 1 ; t177=t188 ! new !!;psllw mm5, 1 ; t178=t189 ! new !!;movq mm1, mm0 ; duplicate t177=t188;paddsw mm0, mm5 ; tm1movq mm2, mmword ptr [esi+8*13] ; V57psubsw mm1, mm5 ; tm15; free mm5movq mmword ptr [esi+8*1], mm0 ; tm1; free mm0psraw mm7, 1 ; t182=t184 ! new !!;save the store as used directly in the transpose;movq mmword ptr [esi+8*15], mm1 ; tm15; free mm1movq mm5, mm7 ; duplicate t182=t184psubsw mm7, mm3 ; tm7paddsw mm5, mm3 ; tm9; free mm3;slotmovq mm0, mmword ptr [esi+8*9] ; V58movq mm3, mm2 ; duplicate V57movq mmword ptr [esi+8*7], mm7 ; tm7; free mm7psubsw mm3, mm6 ; tm13paddsw mm2, mm6 ; tm3 ; free mm6; moved up from the transposemovq mm7, mm3; moved up from the transposepunpcklwd mm3, mm1movq mm6, mm0 ; duplicate V58movq mmword ptr [esi+8*3], mm2 ; tm3; free mm2paddsw mm0, mm4 ; tm5psubsw mm6, mm4 ; tm11; free mm4; moved up from the transposepunpckhwd mm7, mm1movq mmword ptr [esi+8*5], mm0 ; tm5; free mm0; moved up from the transposemovq mm2, mm5; transpose - M4 part; --------- ---------; | M1 | M2 | | M1'| M3'|; --------- --> ---------; | M3 | M4 | | M2'| M4'|; --------- ---------; Two alternatives: use full mmword approach so the following code can be; scheduled before the transpose is done without stores, or use the faster; half mmword stores (when possible)movdf dword ptr [esi+8*9+4], mm3 ; MS part of tmt9punpcklwd mm5, mm6movdf dword ptr [esi+8*13+4], mm7 ; MS part of tmt13punpckhwd mm2, mm6movdf dword ptr [esi+8*9], mm5 ; LS part of tmt9punpckhdq mm5, mm3 ; free mm3movdf dword ptr [esi+8*13], mm2 ; LS part of tmt13punpckhdq mm2, mm7 ; free mm7; moved up from the M3 transposemovq mm0, mmword ptr [esi+8*8];slot; moved up from the M3 transposemovq mm1, mmword ptr [esi+8*10]; moved up from the M3 transposemovq mm3, mm0; shuffle the rest of the data, and write it with 2 mmword writesmovq mmword ptr [esi+8*11], mm5 ; tmt11; moved up from the M3 transposepunpcklwd mm0, mm1movq mmword ptr [esi+8*15], mm2 ; tmt15; moved up from the M3 transposepunpckhwd mm3, mm1; transpose - M3 part; moved up to previous code section;movq mm0, mmword ptr [esi+8*8];movq mm1, mmword ptr [esi+8*10];movq mm3, mm0;punpcklwd mm0, mm1;punpckhwd mm3, mm1movq mm6, mmword ptr [esi+8*12];slotmovq mm4, mmword ptr [esi+8*14]movq mm2, mm6; shuffle the data and write out the lower parts of the transposed in 4 dwordspunpcklwd mm6, mm4movq mm1, mm0punpckhdq mm1, mm6movq mm7, mm3punpckhwd mm2, mm4 ; free mm4;slotpunpckldq mm0, mm6 ; free mm6;slot;moved from next blockmovq mm4, mmword ptr [esi+8*13] ; tmt13punpckldq mm3, mm2punpckhdq mm7, mm2 ; free mm2;moved from next blockmovq mm5, mm3 ; duplicate tmt5; column 1: even part (after transpose);moved above;movq mm5, mm3 ; duplicate tmt5;movq mm4, mmword ptr [esi+8*13] ; tmt13psubsw mm3, mm4 ; V134;slotpmulhw mm3, mmword ptr x5a825a825a825a82 ; 23170 ->V136;slotmovq mm6, mmword ptr [esi+8*9] ; tmt9paddsw mm5, mm4 ; V135 ; mm4 freemovq mm4, mm0 ; duplicate tmt1paddsw mm0, mm6 ; V137psubsw mm4, mm6 ; V138 ; mm6 freepsllw mm3, 2 ; t290psubsw mm3, mm5 ; V139movq mm6, mm0 ; duplicate V137paddsw mm0, mm5 ; V140movq mm2, mm4 ; duplicate V138paddsw mm2, mm3 ; V141psubsw mm4, mm3 ; V142 ; mm3 freemovq mmword ptr [esi+8*9], mm0 ; V140psubsw mm6, mm5 ; V143 ; mm5 free;moved from next blockmovq mm0, mmword ptr[esi+8*11] ; tmt11;slotmovq mmword ptr [esi+8*13], mm2 ; V141;moved from next blockmovq mm2, mm0 ; duplicate tmt11; column 1: odd part (after transpose);moved up to the prev block;movq mm0, mmword ptr[esi+8*11] ; tmt11;movq mm2, mm0 ; duplicate tmt11movq mm5, mmword ptr[esi+8*15] ; tmt15psubsw mm0, mm7 ; V144movq mm3, mm0 ; duplicate V144paddsw mm2, mm7 ; V147 ; free mm7pmulhw mm0, mmword ptr x539f539f539f539f ; 21407-> V151movq mm7, mm1 ; duplicate tmt3paddsw mm7, mm5 ; V145psubsw mm1, mm5 ; V146 ; free mm5psubsw mm3, mm1 ; V150movq mm5, mm7 ; duplicate V145pmulhw mm1, mmword ptr x4546454645464546 ; 17734-> V153psubsw mm5, mm2 ; V148pmulhw mm3, mmword ptr x61f861f861f861f8 ; 25080-> V154psllw mm0, 2 ; t311pmulhw mm5, mmword ptr x5a825a825a825a82 ; 23170-> V152paddsw mm7, mm2 ; V149 ; free mm2psllw mm1, 1 ; t313nop ; slot;without the nop above - freeze here for one clock;the nop cleans the mess a little bitmovq mm2, mm3 ; duplicate V154psubsw mm3, mm0 ; V155 ; free mm0psubsw mm1, mm2 ; V156 ; free mm2;moved from the next blockmovq mm2, mm6 ; duplicate V143;moved from the next blockmovq mm0, mmword ptr[esi+8*13] ; V141psllw mm1, 1 ; t315psubsw mm1, mm7 ; V157 (keep V149)psllw mm5, 2 ; t317psubsw mm5, mm1 ; V158psllw mm3, 1 ; t319paddsw mm3, mm5 ; V159;slot; column 1: output butterfly (after transform);moved to the prev block;movq mm2, mm6 ; duplicate V143;movq mm0, mmword ptr[esi+8*13] ; V141psubsw mm2, mm3 ; V163paddsw mm6, mm3 ; V164 ; free mm3movq mm3, mm4 ; duplicate V142psubsw mm4, mm5 ; V165 ; free mm5movq mmword ptr scratch7, mm2 ; out7psraw mm6, 4psraw mm4, 4paddsw mm3, mm5 ; V162movq mm2, mmword ptr[esi+8*9] ; V140movq mm5, mm0 ; duplicate V141;in order not to perculate this line up, we read [esi+8*9] very near to this locationmovq mmword ptr [esi+8*9], mm6 ; out9paddsw mm0, mm1 ; V161movq mmword ptr scratch5, mm3 ; out5psubsw mm5, mm1 ; V166 ; free mm1movq mmword ptr[esi+8*11], mm4 ; out11psraw mm5, 4movq mmword ptr scratch3, mm0 ; out3movq mm4, mm2 ; duplicate V140movq mmword ptr[esi+8*13], mm5 ; out13paddsw mm2, mm7 ; V160;moved from the next blockmovq mm0, mmword ptr [esi+8*1]psubsw mm4, mm7 ; V167 ; free mm7;moved from the next blockmovq mm7, mmword ptr [esi+8*3]psraw mm4, 4movq mmword ptr scratch1, mm2 ; out1;moved from the next blockmovq mm1, mm0movq mmword ptr[esi+8*15], mm4 ; out15;moved from the next blockpunpcklwd mm0, mm7; transpose - M2 parts;moved up to the prev block;movq mm0, mmword ptr [esi+8*1];movq mm7, mmword ptr [esi+8*3];movq mm1, mm0;punpcklwd mm0, mm7movq mm5, mmword ptr [esi+8*5]punpckhwd mm1, mm7movq mm4, mmword ptr [esi+8*7]movq mm3, mm5; shuffle the data and write out the lower parts of the trasposed in 4 dwordsmovdf dword ptr [esi+8*8], mm0 ; LS part of tmt8punpcklwd mm5, mm4movdf dword ptr [esi+8*12], mm1 ; LS part of tmt12punpckhwd mm3, mm4movdf dword ptr [esi+8*8+4], mm5 ; MS part of tmt8punpckhdq mm0, mm5 ; tmt10movdf dword ptr [esi+8*12+4], mm3 ; MS part of tmt12punpckhdq mm1, mm3 ; tmt14; transpose - M1 partsmovq mm7, mmword ptr [esi];slotmovq mm2, mmword ptr [esi+8*2]movq mm6, mm7movq mm5, mmword ptr [esi+8*4]punpcklwd mm7, mm2movq mm4, mmword ptr [esi+8*6]punpckhwd mm6, mm2 ; free mm2movq mm3, mm5punpcklwd mm5, mm4punpckhwd mm3, mm4 ; free mm4movq mm2, mm7movq mm4, mm6punpckldq mm7, mm5 ; tmt0punpckhdq mm2, mm5 ; tmt2 ; free mm5;slot; shuffle the rest of the data, and write it with 2 mmword writespunpckldq mm6, mm3 ; tmt4;move from next blockmovq mm5, mm2 ; duplicate tmt2punpckhdq mm4, mm3 ; tmt6 ; free mm3;move from next blockmovq mm3, mm0 ; duplicate tmt10; column 0: odd part (after transpose);moved up to prev block;movq mm3, mm0 ; duplicate tmt10;movq mm5, mm2 ; duplicate tmt2psubsw mm0, mm4 ; V110paddsw mm3, mm4 ; V113 ; free mm4movq mm4, mm0 ; duplicate V110paddsw mm2, mm1 ; V111pmulhw mm0, mmword ptr x539f539f539f539f ; 21407-> V117psubsw mm5, mm1 ; V112 ; free mm1psubsw mm4, mm5 ; V116movq mm1, mm2 ; duplicate V111pmulhw mm5, mmword ptr x4546454645464546 ; 17734-> V119psubsw mm2, mm3 ; V114pmulhw mm4, mmword ptr x61f861f861f861f8 ; 25080-> V120paddsw mm1, mm3 ; V115 ; free mm3pmulhw mm2, mmword ptr x5a825a825a825a82 ; 23170-> V118psllw mm0, 2 ; t266movq mmword ptr[esi+8*0], mm1 ; save V115psllw mm5, 1 ; t268psubsw mm5, mm4 ; V122psubsw mm4, mm0 ; V121 ; free mm0psllw mm5, 1 ; t270;slotpsubsw mm5, mm1 ; V123 ; free mm1psllw mm2, 2 ; t272psubsw mm2, mm5 ; V124 (keep V123)psllw mm4, 1 ; t274movq mmword ptr[esi+8*2], mm5 ; save V123 ; free mm5paddsw mm4, mm2 ; V125 (keep V124); column 0: even part (after transpose)movq mm0, mmword ptr[esi+8*12] ; tmt12movq mm3, mm6 ; duplicate tmt4psubsw mm6, mm0 ; V100paddsw mm3, mm0 ; V101 ; free mm0pmulhw mm6, mmword ptr x5a825a825a825a82 ; 23170 ->V102movq mm5, mm7 ; duplicate tmt0movq mm1, mmword ptr[esi+8*8] ; tmt8;slotpaddsw mm7, mm1 ; V103psubsw mm5, mm1 ; V104 ; free mm1movq mm0, mm7 ; duplicate V103psllw mm6, 2 ; t245paddsw mm7, mm3 ; V106movq mm1, mm5 ; duplicate V104psubsw mm6, mm3 ; V105psubsw mm0, mm3 ; V109; free mm3paddsw mm5, mm6 ; V107psubsw mm1, mm6 ; V108 ; free mm6; column 0: output butterfly (after transform)movq mm3, mm1 ; duplicate V108paddsw mm1, mm2 ; out4psraw mm1, 4psubsw mm3, mm2 ; out10 ; free mm2psraw mm3, 4movq mm6, mm0 ; duplicate V109movq mmword ptr[esi+8*4], mm1 ; out4 ; free mm1psubsw mm0, mm4 ; out6movq mmword ptr[esi+8*10], mm3 ; out10 ; free mm3psraw mm0, 4paddsw mm6, mm4 ; out8 ; free mm4movq mm1, mm7 ; duplicate V106movq mmword ptr[esi+8*6], mm0 ; out6 ; free mm0psraw mm6, 4movq mm4, mmword ptr[esi+8*0] ; V115;slotmovq mmword ptr[esi+8*8], mm6 ; out8 ; free mm6movq mm2, mm5 ; duplicate V107movq mm3, mmword ptr[esi+8*2] ; V123paddsw mm7, mm4 ; out0;moved up from next blockmovq mm0, mmword ptr scratch3psraw mm7, 4;moved up from next blockmovq mm6, mmword ptr scratch5psubsw mm1, mm4 ; out14 ; free mm4paddsw mm5, mm3 ; out2psraw mm1, 4movq mmword ptr[esi], mm7 ; out0 ; free mm7psraw mm5, 4movq mmword ptr[esi+8*14], mm1 ; out14 ; free mm1psubsw mm2, mm3 ; out12 ; free mm3movq mmword ptr[esi+8*2], mm5 ; out2 ; free mm5psraw mm2, 4;moved up to the prev blockmovq mm4, mmword ptr scratch7;moved up to the prev blockpsraw mm0, 4movq mmword ptr[esi+8*12], mm2 ; out12 ; free mm2;moved up to the prev blockpsraw mm6, 4;move back the data to its correct place;moved up to the prev block;movq mm0, mmword ptr scratch3;movq mm6, mmword ptr scratch5;movq mm4, mmword ptr scratch7;psraw mm0, 4;psraw mm6, 4movq mm1, mmword ptr scratch1psraw mm4, 4movq mmword ptr [esi+8*3], mm0 ; out3psraw mm1, 4movq mmword ptr [esi+8*5], mm6 ; out5;slotmovq mmword ptr [esi+8*7], mm4 ; out7;slotmovq mmword ptr [esi+8*1], mm1 ; out1;slotemmspop esipop ebpret 0_idct8x8aan ENDP_TEXT ENDSEND
